← Back to papers

Paper deep dive

Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

Enoch Hyunwook Kang

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 233

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 6:06:05 AM

Summary

The paper provides theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like equilibrium play in infinitely repeated games without explicit post-training. By leveraging Bayesian learning and asymptotic best-response learning, these agents naturally converge to Nash equilibria along realized play paths, even under conditions of unknown stochastic payoffs.

Entities (5)

Nash equilibrium · game-theoretic-concept · 99%Reasonably reasoning AI agents · agent · 98%Infinitely repeated games · game-setting · 97%Asymptotic best-response learning · learning-mechanism · 95%Bayesian learning · learning-mechanism · 95%

Relation Signals (3)

Reasonably reasoning AI agents achieve Nash equilibrium

confidence 95% · we prove that ‘reasonably reasoning’ agents... eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium

Bayesian learning enables Reasonably reasoning AI agents

confidence 92% · By Bayesian learning, we refer to an agent’s capacity to learn other agents’ strategies

Asymptotic best-response learning enables Reasonably reasoning AI agents

confidence 92% · By asymptotic best-response learning, we mean the agent’s ability to eventually learn an optimal counter-strategy

Cypher Suggestions (2)

Find all agents that achieve Nash equilibrium · confidence 90% · unvalidated

MATCH (a:Agent)-[:ACHIEVES]->(e:Concept {name: 'Nash equilibrium'}) RETURN a.name

Map the relationship between learning mechanisms and agent capabilities · confidence 90% · unvalidated

MATCH (l:LearningMechanism)-[:ENABLES]->(a:Agent) RETURN l.name, a.name

Abstract

Abstract:AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents' advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that `reasonably reasoning' agents, i.e., agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner's dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

232,421 characters extracted from source content.

Expand or collapse full text

Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably Enoch Hyunwook Kang University of Washington ehwkang@uw.edu Abstract AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents’ advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that ‘reasonably reasoning’ agents, i.e., agents capable of forming beliefs about others’ strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner’s dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions. 1 Introduction Recent advancements integrating Artificial Intelligence (AI) models with sophisticated reasoning and tool-use capabilities have enabled the widespread practical deployment of AI agents across diverse application domains [45]. As AI agents become increasingly integral to interactive systems, a critical and timely challenge arises: determining whether these agents can navigate complex strategic interactions effectively in real-world competitions in digital markets, e.g., automated negotiation, dynamic pricing, and advertising auctions [9, 27, 38, 37, 59, 8]. As AI agents are deployed more broadly in such settings, the central issue is not only whether they can behave strategically, but also whether their strategic interactions will converge to stable, predictable equilibria, and which equilibria such systems will select. This question is not merely theoretical. Recent work by [15] and [22], together with related empirical studies of algorithmic interaction, suggests that autonomous algorithmic and AI systems can generate strategically consequential repeated-game behavior in economically important environments. Pricing algorithms can sustain supra-competitive outcomes without explicit communication, rapid reactive pricing technologies can elevate prices even in competitive equilibrium, and real-world adoption of algorithmic pricing has been associated with higher margins in concentrated markets [11, 6]. On the other hand, empirical evaluations of LLMs reveal that widely used, off-the-shelf AI models (e.g., GPT, Claude, Gemini, Kimi, DeepSeek) as AI agents frequently fail to exhibit predicted equilibrium behavior in strategic interactions and often resort to brittle heuristics or produce inconsistent policies [28, 30, 29, 12]. In practice, simply prompting standard AI models to engage in repeated games often yields strategies that diverge significantly from rational, equilibrium-based play predicted by classical game theory, although some successes have been reported [3]. Such brittleness and inconsistency raise concerns about deploying AI agents in societally crucial domains that require reliable strategic decision-making. One prominent approach to address this limitation is targeted, strategic post-training procedures [44, 18]. However, relying on uniform deployment of such fine-tuning approaches across diverse and independently developed AI agents is often impractical. Consequently, there exists a compelling need for the assurance that AI agents with some “reasonable” reasoning capabilities autonomously adapt their strategies and find a stable equilibrium. This critical observation motivates the central research question explored in this paper: Can off-the-shelf reasoning AI agents achieve strategic equilibrium without post-training? In this paper, we theoretically and empirically address this question within the framework of infinitely repeated games, a setting in which agents repeatedly encounter identical strategic scenarios with no predefined endpoint. Specifically, we show that reasoning LLM-based AI agents naturally evolve toward Nash equilibrium along realized play paths, without relying on explicit post-training or specialized fine-tuning procedures. The key to achieving this lies in two basic reasoning capabilities we call “reasonably reasoning” capabilities: Bayesian learning and asymptotic best-response learning. By Bayesian learning, we refer to an agent’s capacity to learn other agents’ strategies from observed historical interactions, thereby enabling a theory-of-mind of others’ future actions. By asymptotic best-response learning, we mean the agent’s ability to eventually learn an optimal counter-strategy given the inferred beliefs about other agents’ strategies, thereby maximizing its expected payoff. Under such capabilities, which we demonstrate that AI agents possess, we prove that agents eventually exhibit a Nash equilibrium along every realized path in possibly infinitely repeated interactions. Our main theoretical results are heavily rooted in a fundamental result in Bayesian learning literature [33, 43] that a set of Bayesian learning agents with the ability to exactly best respond to their belief about the opponent agents’ strategy, i.e., maximize their expected payoff, eventually exhibit a Nash equilibrium along every realized path in possibly infinitely repeated interactions. The key difference in this paper’s theoretical result is that it allows asymptotic best-response learning rather than assuming that the agent can choose the exact best-response action, i.e., the agent is an expected-utility maximizer. This is an important relaxation, as the off-the-shelf LLM agents are not expected utility maximizers [55, 24]. Rather, they are stochastic posterior samplers by default (i.e., in temperature = 1 setup) [5]. We prove that, under mild and realistic assumptions, LLM agents, which are posterior belief samplers, achieve asymptotic best-response learning. We then prove that the fundamental result in Bayesian learning [33, 43], which requires exact best-response capability, can be extended to asymptotic best-response learning. Combined with the recent findings that LLMs are Bayesian in-context learners under stationary, repeated settings [16, 39, 13, 54, 51, 50, 20], we conclude that reasoning LLM agents eventually exhibit a Nash equilibrium along every realized path in possibly infinitely repeated interactions. Beyond the benchmark with common-knowledge stage payoffs, we also consider the practically relevant case in which payoffs are not known to agents ex ante and each agent observes only its own privately realized stochastic payoffs. We modify PS-BR to not only sample an opponent-strategy hypothesis, but also sample a hypothesis for the agent’s own mean payoff matrix (equivalently, its own payoff kernel within the known noise family). Under the analogous learning conditions together with an additional asymptotic public-sufficiency assumption on hidden private histories, PS-BR recovers the same asymptotic on-path ε -best-response property and therefore inherits the zero-shot Nash convergence guarantees. This paper is structured as follows. Section 2 discusses related works. Section 3 introduces the setup. Section 4 defines reasonably reasoning agents and relates their Bayesian and best-response learning properties to in-context and test-time inference in language models. Section 5 presents the main zero-shot Nash convergence results. Section 6 discusses how we can extend the zero-shot Nash convergence result for unknown, stochastic payoffs. Then Section 7 provides empirical evidence of the theoretical contributions in this paper. 2 Related works Bayesian Learning. The theoretical analysis of reasonably reasoning agents is based largely on the Bayesian learning literature. Bayesian learning in repeated games is defined by a fundamental tension between the ability to logically learn opponents’ strategies and the ability to respond to them optimally. The foundational possibility result in [33] showed that if players’ prior beliefs contain a "grain of truth" (absolute continuity) regarding the true distribution of play, then standard Bayesian updating guarantees that their predictions will eventually converge to the truth, thereby naturally culminating in a Nash equilibrium. However, [41, 42] subsequently proved a negative result: requiring players to simultaneously maintain this grain of truth and perfectly best-respond across all possible counterfactual game histories leads to a mathematical contradiction, as the infinite sets of learnable strategies and optimizing strategies are often mutually singular. [43] resolved this tension by introducing “optimizing learnability”, the crucial insight that agents do not need to perfectly learn unreached counterfactuals; they only need to accurately predict and best-respond along the realized path of play. Nonetheless, Norman identified that a stubborn impossibility persists in a specific class of games called M* games, where adversarial payoff geometries prevent learning and optimization from coexisting even on-path. This paper systematically navigates these classic boundaries to guarantee zero-shot Nash convergence for LLM agents. We actively employ [33] grain of truth (Assumption 2) to guarantee predictive accuracy via the classic merging of opinions, and avoid [41, 42]’s impossibility by formally adopting the on-path relaxation and non-M* in [43]. However, although employing the standard Bayesian learning setup [33, 43] guarantees accurate forecasts of future on-path actions, it does not guarantee posterior concentration, as LLM agents are not expected-utility maximizers, and rather posterior belief samplers [5, 55, 24]. To address this, we introduce the finite menu and KL separation condition (Assumption 3), which is necessary to mathematically force the LLM agent’s posterior to concentrate onto a single point mass (Lemma 4.2). By forcing posterior concentration, the LLM agent’s stochastic “predict-then-act” reasoning seamlessly stabilizes into an asymptotic best response. Strategic capabilities of LLM agents. As LLMs are increasingly deployed as interactive agents, a growing literature studies whether LLMs behave strategically in canonical games, emphasizing preference representation, belief formation, and (approximate) best responses rather than taking equilibrium play for granted [49, 31]. In one-shot normal-form, bargaining, and negotiation tasks, off-the-shelf models often follow plausible but context-sensitive heuristics: behavior can depart from equilibrium predictions and change markedly under small framing or instruction variations [26, 21, 29]. Strategic performance can improve with model scale and reasoning scaffolds, but the remaining variance across prompts and settings is substantial [32]. These issues become more acute under repeated games, where payoffs depend on stable, history-contingent policies. Multi-agent evaluation benchmarks report large cross-model and cross-game heterogeneity and frequent non-equilibrium dynamics, especially in coordination and social-dilemma regimes [40, 17, 30]. Controlled repeated-game experiments similarly find that cooperation/reciprocity can emerge, but is fragile to opponent choice and to seemingly minor prompt or protocol changes [3, 23, 53]. In market-style repeated settings, recent work further documents collusive or supra-competitive outcomes among LLM agents and highlights sensitivity to communication opportunities and wording choices [22, 2]. Overall, existing results demonstrate meaningful strategic adaptation but do not provide general, zero-shot guarantees that heterogeneous, independently deployed off-the-shelf agents will converge to predictable equilibrium behavior. Our paper targets this gap by identifying two basic theory-of-mind capabilities, Bayesian learning of opponents and asymptotic best-response learning, and proving that, under mild conditions, they imply Nash continuation play along realized paths in repeated games, without requiring explicit post-training or cross-agent coordination. LLM agents as Bayesian in-context learners. A growing body of work links in-context learning (ICL), i.e., test-time adaptation that conditions prior history on a prompt without parameter updates, to Bayesian inference over latent task hypotheses. In stylized transformer meta-learning settings, [54] argue that transformers trained over a task distribution can implement an implicit Bayesian update and produce posterior-predictive behavior from in-context data; related analyses formalize ICL as (approximate) Bayesian model averaging and study how this view depends on model parameterization and drives generalization [57]. Moving beyond specific constructions, [20] propose a martingale-based perspective that yields diagnostics and theoretical criteria for when an in-context learner’s predictive sequence is consistent with Bayesian updating, while [50] provide a broader meta-learning theory in which ICL is provably equivalent to Bayesian inference with accompanying generalization guarantees. Empirically, LLMs also exhibit meta-adaptation across tasks presented in-context [16], and several abilities that appear “emergent” under scaling can be substantially attributed to improved ICL mechanisms [39]. Complementing these viewpoints, [51] model LLM ICL through a latent-variable lens, where demonstrations act as evidence about an unobserved task variable—clarifying why behavior can be highly sensitive to the specific examples and their ordering—and related results document few-shot in-context adaptation even in low-resource language learning regimes [13]. For agentic and repeated-interaction settings, these Bayesian-ICL perspectives motivate modeling an LLM agent’s use of the interaction transcript as maintaining and updating a posterior over opponent strategies/types; autoregressive generation can then be interpreted as sampling-based decision-making from the induced posterior [56, 52], providing a concrete bridge between in-context learning and belief-based strategic behavior. Expected utility maximization and best response. Standard learning-in-games analyses often assume agents compute an exact best response to their posterior at every history [33, 43]. This is a poor behavioral model for off-the-shelf LLM agents, whose actions are induced by stochastic decoding and thus implement a distribution over choices rather than a deterministic maximization of expected utility. In probabilistic decision tasks, [55] find systematic belief–decision incoherence, suggesting that elicited probabilities should not be treated as beliefs that the model then perfectly best-responds to. In risky-choice experiments, [24] similarly document substantial departures from expected-utility maximization and large sensitivity to prompting/model type, with behavior better described as noisy sampling. [5] argues that LLMs naturally implement posterior sampling. These results motivate replacing exact best response with a weaker, sampling-compatible notion, e.g., posterior-sampling policies, which are shown to achieve asymptotic best-response performance along the realized path. 3 Setup 3.1 Infinitely repeated game We study interaction among a finite set of agents I=1,2,…,NI=\1,2,…,N\ in an infinitely repeated (discounted) game with perfect monitoring of actions and common-knowledge stage payoffs. We define the game as the tuple =(I,Aii∈I,uii∈I,λii∈I)G= (I, \A_i \_i∈ I, \u_i \_i∈ I, \ _i \_i∈ I ) where: • I is the finite set of AI agents • AiA_i is the finite set of actions available to agent i • A=∏i∈IAiA= _i∈ IA_i is the joint action space, where a joint action profile at round t is denoted at=(a1t,…,a|I|t)∈Aa^t= (a_1^t,…,a_|I|^t )∈ A. (aita_i^t indicates the action of agent i at round t) • ui:A→[0,1]u_i:A→[0,1] is agent i’s (known) stage-game payoff function • λi∈(0,1) _i∈(0,1) is the private discount factor used by agent i to value future payoffs. At each round t=1,2,…t=1,2,…, each agent i simultaneously chooses an action ait∈Aia_i^t∈ A_i, forming a joint action profile at∈Aa^t∈ A, which is publicly observed. Agent i then receives the stage payoff ui​(at)∈[0,1]u_i(a^t)∈[0,1] (1) These stage payoffs induce a standard infinitely repeated game with perfect monitoring of actions. In defining the payoffs uii∈I \u_i \_i∈ I, we restrict the set of games considered in this paper using the following standard assumption in the Bayesian learning literature [43]. Intuitively, this excludes games without a pure-strategy equilibrium, e.g., rock-scissors-paper; rigorously, it rules out the pathological class in which on-path learning cannot be patched into nearby Nash behavior. Assumption 1 (Non-M⋆ game [43]). Consider the infinitely repeated game induced by the true stage payoffs uii∈I \u_i \_i∈ I in equation (1). For each player i, define the stage-game minmax payoff and pure-action maxmin payoff as φi:=minσ−i∈Δ​(A−i)⁡maxσi∈Δ​(Ai)⁡ui​(σi,σ−i),Φi⋆:=maxai∈Ai⁡mina−i∈BR−i​(ai)⁡ui​(ai,a−i), _i:= _ _-i∈ (A_-i ) _ _i∈ (A_i )u_i ( _i, _-i ), _i := _a_i∈ A_i _a_-i _-i (a_i )u_i (a_i,a_-i ), where BR−i​(ai)BR_-i (a_i ) denotes the set of opponents’ (joint) best responses to aia_i in the stage game. We call that the stage game is M⋆M if Φi⋆<φi _i < _i for every i. We assume the stage game is not M⋆M (equivalently, Φi⋆≥φi _i ≥ _i holds for some i). 3.2 Strategy We define the joint action history at round t as ht=(a1,a2,…,at−1),h^t= (a^1,a^2,…,a^t-1 ), and Ht=(a1,a2,…,at−1):as∈A​ for ​s≤t−1.H^t= \ (a^1,a^2,…,a^t-1 ):a^s∈ A for s≤ t-1 \. Let H0:=∅H^0:=\ \ denote the empty history. Denote the complete set of possible histories as H=⋃t≥0HtH= _t≥ 0H^t. (Throughout this paper, we allow AI agents’ strategies to have bounded memory.) Definition 1 (Strategy). A strategy for agent i is a function fi:H→Δ​(Ai),f_i:H→ (A_i ), which maps every joint action history to a distribution over agent i’s actions AiA_i. Let ℱiF_i denote the space of all strategies of agent i. A strategy profile is a tuple f=(f1,…,fN)∈ℱ=∏i∈Iℱif= (f_1,…,f_N ) = _i∈ IF_i. Let H∞H^∞ denote the space of infinite play paths, i.e., H∞=(a1,a2,…):at∈A​ for all ​t∈ℕ.H^∞= \ (a^1,a^2,… ):a^t∈ A for all t \. Definition 2 (Play-path distribution). A strategy profile f=(f1,…,fN)∈ℱf=(f_1,…,f_N) induces a unique probability distribution μfμ^f over H∞H^∞ (the play-path distribution), defined on cylinder sets by μf​(C​(a1,…,at)):=∏s=1t∏i∈Ifi​(hs)​(ais),μ^f (C (a^1,…,a^t ) ):= _s=1^t _i∈ If_i (h^s ) (a_i^s ), where hs=(a1,…,as−1)h^s=(a^1,…,a^s-1) and C​(h):=z∈H∞:z=(h,…)C(h):=\z∈ H^∞:z=(h,…)\. By Kolmogorov’s extension theorem [19], these finite-dimensional probabilities define a unique probability measure μfμ^f on (H∞,ℬ)(H^∞,B), where ℬB is the product σ-algebra. For the upcoming discussions, we fix some notations. Given that we fix a history hth^t, for any continuation profile g (i.e., a profile that specifies play after histories extending hth^t), let μhtgμ^g_h^t denote the induced distribution on H∞H^∞ over the future joint-action sequence (at,at+1,…)(a^t,a^t+1,…) when play starts at history hth^t and follows g thereafter. Formally, we identify the tail (at,at+1,…)(a^t,a^t+1,…) with y∈H∞y∈ H^∞ by setting y1=aty^1=a^t, y2=at+1y^2=a^t+1, and so on, and regard μhtgμ^g_h^t as a measure on this reindexed space. For a full profile g∈ℱg , we write μhtgμ^g_h^t for the continuation distribution induced by its restriction g|htg|_h^t. If μg​(C​(ht))>0μ^g(C(h^t))>0, then μhtgμ^g_h^t coincides with the conditional distribution μg(⋅∣ht)μ^g(· h^t). 3.3 Beliefs Each agent i acts under uncertainty regarding the opponents’ future play f−if_-i. The agent maintains subjective beliefs over opponents’ strategies and updates them as the game unfolds. Behavioral representatives (belief-equivalent behavior strategies). Fix player i and a (possibly mixed) belief μi _i over opponents’ strategy profiles ℱ−iF_-i. For any own strategy gi∈ℱig_i _i, μi _i induces a predictive distribution over play paths Piμi,gi​(E):=∫ℱ−iμ(gi,f−i)​(E)​μi​(f−i)for measurable ​E⊆H∞.P_i _i,g_i(E):= _F_-iμ^(g_i,f_-i)(E)\,d _i(f_-i) measurable E H^∞. By Kuhn’s theorem [35] and Aumann’s extension to infinite extensive-form games [7], there exists a behavior-strategy profile f¯−i∈ℱ−i f_-i _-i such that for every gig_i, μ(gi,f¯−i)=Piμi,gi.μ^(g_i, f_-i)\;=\;P_i _i,g_i. We call any such f¯−i f_-i a behavioral representative (or belief-equivalent profile) of μi _i [35, 7, 33]. When μi _i has finite support g−i1,…,g−iK\g_-i^1,…,g_-i^K\, one convenient choice is f¯−i​(h)​(a−i)=∑k=1Kμi​(g−ik∣h)​g−ik​(h)​(a−i), f_-i(h)(a_-i)= _k=1^K _i(g_-i^k h)\,g_-i^k(h)(a_-i), for histories h where Bayes’ rule is defined. Prior and posterior predictive beliefs. Agent i holds a subjective prior μi0 _i^0 over ℱ−iF_-i. Write Pi0,gi:=Piμi0,giP_i^0,g_i:=P_i _i^0,g_i for the induced prior predictive distribution. As we discussed above (as used explicitly in [33]), there exists a behavioral representative f−i∈ℱ−if_-i^i _-i such that, for every gig_i, μ(gi,f−i)=Pi0,giμ^(g_i,f_-i^i)=P_i^0,g_i. We fix such an f−iif_-i^i and call it agent i’s subjective expectation of opponents’ play. At any history hth^t where Bayes’ rule is defined, μi0 _i^0 yields a posterior μit(⋅∣ht) _i^t(· h^t) and a posterior predictive continuation belief. Let f−i,tf_-i^i,t denote any behavioral representative of this posterior predictive belief. As a standing convention, we take these representatives to be chosen consistently by continuation: f−i,t|ht:=f−i|ht,f_-i^i,t |_h^t\;:=\;f_-i^i |_h^t, i.e., the time-t posterior predictive continuation is represented by the restriction of the fixed belief-equivalent profile f−iif_-i^i to histories extending hth^t. 3.4 Subjective utility and Nash equilibrium Subjective Expected Utility. An agent evaluates the optimality of a continuation strategy based on their subjective beliefs at a given history. Fix a history hth^t and let σi∈ℱi​(ht) _i _i(h^t) be a continuation strategy for agent i from hth^t onward. For any opponents’ continuation profile g−ig_-i, denote by μht(σi,g−i)μ^( _i,g_-i)_h^t the induced distribution over future play paths when play starts at hth^t and follows (σi,g−i)( _i,g_-i) thereafter. Following the standard literature [34], we define the belief-explicit subjective expected utility of playing σi _i starting at hth^t as Vi​(σi∣ht;g−i)=y∼μht(σi,g−i)​[(1−λi)​∑k=0∞λik​ui​(yk+1)],V_i( _i h^t;g_-i)=E_y μ^( _i,g_-i)_h^t [(1- _i) _k=0^∞ _i^ku_i(y^k+1) ], (2) where y=(y1,y2,…)y=(y^1,y^2,…) represents the future path of joint actions relative to time t, with yk+1y^k+1 denoting the joint action at step k+1k+1 of this future path (i.e., at absolute time t+kt+k). When g−i=f−i,tg_-i=f_-i^i,t, we write Vi​(σi∣ht):=Vi​(σi∣ht;f−i,t).V_i( _i h^t):=V_i( _i h^t;f_-i^i,t). (3) For any belief about opponents’ continuation play g−ig_-i at history hth^t, we define the set of ε -best-response continuation strategies for agent i at hth^t as BRiε​(g−i∣ht)=σi∈ℱi​(ht):Vi​(σi∣ht;g−i)≥supσi′∈ℱi​(ht)Vi​(σi′∣ht;g−i)−ε. _i (g_-i h^t)= \ _i _i(h^t):V_i( _i h^t;g_-i)≥ _ _i _i(h^t)V_i( _i h^t;g_-i)- \. Nash equilibrium. The true performance of a strategy profile f∈ℱf for agent i is given by: Ui​(f)=z∼μf​[(1−λi)​∑t=1∞λit−1​ui​(zt)],U_i(f)=E_z μ^f [ (1- _i ) _t=1^∞ _i^t-1u_i (z^t ) ], where zt∈Az^t∈ A is the joint action at round t, and λi∈(0,1) _i∈(0,1) is agent i’s discount factor. The factor (1−λi)(1- _i) is a normalization ensuring that Ui​(f)∈[0,1]U_i(f)∈[0,1] whenever ui​(a)∈[0,1]u_i(a)∈[0,1] for all a∈Aa∈ A. Definition 3 (ε -Nash equilibrium). A strategy profile f=(f1,…,fN)∈ℱf= (f_1,…,f_N ) is an ε -Nash equilibrium if, for every agent i∈Ii∈ I, Ui​(f)≥supfi′∈ℱiUi​(fi′,f−i)−ε.U_i(f)≥ _f_i _iU_i (f_i ,f_-i )- . 4 Reasonably Reasoning Agents As discussed earlier, one of the key ideas of this work is that reasoning LLM-based AI agents are fundamentally “reasonably reasoning” agents. In this section, we formally define the class of reasonably reasoning agents, and then demonstrate why reasoning-LLM agents are naturally reasonably reasoning agents. The definition isolates two ingredients: (i) Bayesian learning and (i) an on-path, asymptotic notion of ε -consistency. Definition 4 (Reasonably Reasoning Agent). Fix a repeated game and a strategy profile f=(fi)i∈If=(f_i)_i∈ I generating the objective play-path distribution μfμ^f (Definition 2). Player i is a Reasonably Reasoning (R) agent if the following hold. • Bayesian learning: Player i has a prior μi0 _i^0 over opponents’ strategy profiles ℱ−iF_-i and forms posteriors (μit)t≥0( _i^t)_t≥ 0 by Bayes’ rule. Let f−i,tf_-i^i,t denote any behavioral representative of player i’s posterior predictive continuation belief at history hth^t (as in Section 3.3), so that for every continuation strategy σi _i, Vi​(σi∣ht)=Vi​(σi∣ht;f−i,t).V_i( _i h^t)=V_i( _i h^t;f_-i^i,t). • Asymptotic ε -consistency on-path: For every ε>0 >0, μf​(z:∃Ti​(z,ε)​<∞​s.t.​∀t≥Ti​(z,ε),fi|ht​(z)∈BRiε​(f−i,t|ht​(z)∣ht​(z)))=1.μ^f\! ( \z:∃\,T_i(z, )<∞\ s.t.\ ∀\,t≥ T_i(z, ),\ f_i |_h^t(z) _i \! (f_-i^i,t |_h^t(z) h^t(z) ) \ )=1. Intuitively, the “Bayesian learning” condition ensures that agents update their strategic beliefs coherently given observations. The “asymptotic ε -consistency” condition captures the idea that after a possibly long initial stumbling phase, agents eventually learn to play (approximately) optimal continuation strategies relative to their own beliefs along the realized path of play. It generalizes Norman’s ε -consistency [43], which requires ε -best responding at all times (not only eventually) on a full-measure set of paths. This generalization is critical, as LLM-based AI agents are not expected-utility maximizers but rather posterior belief samplers [5, 55, 24]. 4.1 Bayesian learning The Bayesian-learning component of Definition 4 does not require an agent to explicitly store a symbolic prior over the full (and typically infinite-dimensional) strategy space ℱ−iF_-i. Instead, what matters for decision-making is that, after observing a public history hth^t, the agent induces a coherent posterior predictive distribution over opponents’ continuation play. In repeated interaction, the latent object of inference is not merely the opponents’ next-period action, but their repeated-game strategy: a reaction rule mapping histories to action distributions. While realized actions vary with the evolving public history, the underlying reaction rule is time-invariant; learning is therefore best understood as refining uncertainty about that rule (and, crucially, about its predictive implications for future play). Formally, let μi0 _i^0 denote player i’s subjective prior over opponents’ strategy profiles ℱ−iF_-i, and let μit(⋅∣ht) _i^t(· h^t) denote the posterior obtained by Bayes’ rule after history hth^t whenever it is defined. The continuation problem depends on μit _i^t only through the induced posterior predictive distribution over future play, because continuation values are computed by integrating payoffs against that predictive distribution. Following [33], we represent player i’s posterior predictive continuation belief by a behavioral profile f−i,tf_-i^i,t, chosen (without loss of generality) so that along the realized history ht​(z)h^t(z), f−i,t|ht​(z)≡f−i|ht​(z),f_-i^i,t |_h^t(z)≡ f_-i^i |_h^t(z), (4) where f−iif_-i^i is a fixed belief-equivalent profile representing player i’s prior predictive distribution as in Section 3. Thus, the continuation of a single belief-equivalent behavioral profile can be taken to match the time-t posterior predictive continuation belief along the realized path. To guarantee that Bayesian updating is well-defined and that predictive beliefs can converge to the truth on-path, we impose the standard grain-of-truth condition. Assumption 2 (Grain of truth [33]). For each player i, the objective play-path distribution μfμ^f is absolutely continuous with respect to i’s prior predictive distribution under fif_i, i.e. μf≪Pi0,fiμ^f P_i^0,f_i. Equivalently, any event that player i assigns zero probability under their prior predictive model has zero probability under the true play distribution induced by f. Under Assumption 2, classical merging-of-opinions results [10] imply that player i’s posterior predictive continuation beliefs become accurate along μfμ^f-almost every realized play path. We formalize this later by showing that absolute continuity implies strong path prediction (Lemma 5.1). 4.2 LLM agents are Bayesian learning agents The Bayesian-learning abstraction above matches what we can operationally observe from LLM agents: history-conditioned predictive distributions. An LLM, when prompted with the game rules and the realized interaction history, induces a conditional distribution over next tokens, which can be arranged to correspond to a distribution over a discrete label for an opponent strategy. This “as if Bayesian” framing is appropriate for two reasons. First, the technical apparatus in Section 3 already works at the level of predictive distributions: given any coherent family of history-conditioned forecasts, we may represent it by an equivalent belief over opponents’ strategies via the behavioral representatives f−i,tf_-i^i,t (and, in particular, by a fixed belief-equivalent profile f−iif_-i^i whose continuation matches posteriors along realized histories as in (4)). Second, recent theory and empirical evidence indicate that AI agents, most of which are auto-regressive LLM models, can implement Bayesian or approximately Bayesian in-context learning in repeated, stationary environments [54, 57, 20, 50]. Interpreting the prompt history as data and the model’s induced distribution as a posterior predictive therefore provides a principled bridge between LLM behavior and Bayesian-learning agents in repeated games. Finally, Assumption 2 should be understood as a modeling requirement on the LLM agent’s support: the agent’s predictive model should not rule out (assign zero probability to) events that can actually occur under the true interaction induced by f. In practice, this corresponds to ensuring that the agent’s elicited beliefs (or the menu used to elicit them) are sufficiently expressive and include mild stochasticity/trembles so that no on-path event receives zero predicted probability. 4.3 LLM agents achieve asymptotic ε -consistency In LLM agents, the output mechanism is mediated by stochastic decoding. Even holding the prompt fixed, a standard LLM’s output induces a distribution over outputs rather than a deterministic argmax rule. Empirically, LLMs exhibit substantial decision noise and can violate the coherence one would expect if they were consistently computing expected-utility-maximizing best responses to elicited beliefs [55, 24]. Rather, LLM agents are posterior samplers, which sample an output from their posterior belief over the output space in their mind [5, 14]. This creates a methodological tension for our purposes, as the Bayesian learning literature’s Nash equilibrium convergence arguments require a best-response property (e.g., [33, 43]). The goal of this subsection is to reconcile these: we formalize a minimal “predict-then-act” rule that is faithful to sampling-based LLM behavior yet is still sufficient to guarantee asymptotic ε -best-response learning on the realized play path. LLMs naturally induce posterior-sampling best response (PS-BR). Reasoning LLM-based AI agents are naturally scaffolded first to infer the situation from the previous interactions and then respond optimally to that inferred model (a theory-of-mind “infer, then respond” [58, 47]). This behavior is formally defined as as posterior-sampling best response (PS-BR): sample a hypothesis about the opponent from the current posterior, then best respond to that sampled hypothesis. Definition 5 (Posterior sampling best response (PS-BR)). Fix player i and a history hth^t. Given posterior μit(⋅∣ht) _i^t(· h^t) over opponents’ strategy profiles, PS-BR chooses a continuation strategy by: 1. sampling f~−i∼μit(⋅∣ht) f_-i _i^t(· h^t); 2. playing any best response σi∈BRi​(f~−i∣ht) _i _i( f_-i h^t) in the continuation game after hth^t. Denote the resulting (randomized) continuation strategy by σi,tPS(⋅∣ht)σ^PS_i,t(· h^t). Here, step 1, “sample f~−i∼μit(⋅∣ht) f_-i _i^t(· h^t)”, is simply querying an LLM (under its default temperature τ=1τ=1 setup) to output an opponent strategy label from the LLM’s conditional distribution over allowed labels based on the previous interaction history. Step 2 is instantiated by evaluating a finite set of candidate self-strategies against that sampled opponent strategy via roll-out, and selecting the value-maximizing candidate. For implementation details used for experiments, see Appendix D. Because PS-BR best responds to a single draw f~−i f_-i rather than to the posterior predictive continuation f−i,tf_-i^i,t, it can be suboptimal if the posterior remains dispersed: different posterior samples can induce different best responses, producing unstable play and potentially persistent deviations from best-response optimality. The key observation is that this suboptimality is entirely driven by posterior dispersion. The next lemma makes this quantitative by upper-bounding the best-response gap by a simple collision statistic of the posterior. Lemma 4.1 (PS-BR is a DitD_i^t-best response). Fix player i and a history hth^t. Suppose μit(⋅∣ht) _i^t(· h^t) is supported on a finite set −iS_-i and write pt​(g−i):=μit​(g−i∣ht),g−i∈−i.p_t(g_-i):= _i^t(g_-i h^t), g_-i _-i. Define the posterior collision complement Dit​(ht):= 1−∑g−i∈−ipt​(g−i)2=Prg~,g~′∼μit(⋅∣ht)⁡[g~≠g~′].D_i^t(h^t)\ :=\ 1- _g_-i _-ip_t(g_-i)^2\ =\ _ g, g \, \, _i^t(· h^t)\! [ g≠ g ]. Let σi,tPS(⋅∣ht)σ^PS_i,t(· h^t) be PS-BR at hth^t. Then Vi​(σi,tPS∣ht)≥supσiVi​(σi∣ht)−Dit​(ht).V_i(σ^PS_i,t h^t)\ ≥\ _ _iV_i( _i h^t)\ -\ D_i^t(h^t). Equivalently, σi,tPS(⋅∣ht)∈BRiDit​(ht)(f−i,t|ht∣ht)σ^PS_i,t(· h^t) _i^D_i^t(h^t)\! (f_-i^i,t |_h^t h^t ). The statistic Dit​(ht)=1−‖pt‖22D_i^t(h^t)=1-\|p_t\|_2^2 is 0 exactly when the posterior is degenerate (a point mass) and is close to 11 when the posterior is highly spread out. Thus Lemma 4.1 says: PS-BR is an approximate best response to the agent’s posterior predictive belief, with an approximation error equal to the probability that two independent posterior samples would disagree. To obtain R’s asymptotic ε -consistency, it suffices (by Lemma 4.1) to ensure that Dit​(ht​(z))→0D_i^t(h^t(z))→ 0 along μfμ^f-almost every realized path z. Intuitively, we need the agent’s posterior to concentrate so that posterior sampling becomes (asymptotically) deterministic. In general repeated games, full posterior concentration over an unrestricted strategy space is too much to ask (and is closely related to classic impossibility phenomena; see [41, 42]). We therefore impose a standard restriction that is also natural from an LLM-agent implementation perspective: the agent maintains a finite menu of opponent-strategy hypotheses and updates a posterior over that menu [4, 25]. In addition, we require an on-path KL separation condition ensuring that incorrect hypotheses are detectably different from the true strategy along the realized play path. This is exactly what makes posterior concentration (and hence vanishing sampling error) mathematically inevitable. Assumption 3 (Finite menu and KL separation). Fix player i. Assume the support of μi0 _i^0 is finite; write −i:=supp​(μi0)⊆ℱ−iS_-i:=supp( _i^0) _-i. Assume: 1. (Menu grain of truth) f−i∈−if_-i _-i and μi0​(f−i)>0 _i^0(f_-i)>0. 2. (Caution / uniform positivity) There exists ν∈(0,1)ν∈(0,1) such that for every g−i∈−ig_-i _-i, every history h, and every a−i∈A−ia_-i∈ A_-i, g−i​(h)​(a−i)≥ν.g_-i(h)(a_-i)≥ν. 3. (On-path KL separation) For every g−i∈−i∖f−ig_-i _-i \f_-i\ there exists κi​(g−i)>0 _i(g_-i)>0 such that μfμ^f-a.s. in z, lim infT→∞1T​∑t=1TDKL​(f−i​(ht​(z))∥g−i​(ht​(z)))≥κi​(g−i), _T→∞\ 1T _t=1^TD_KL\! (f_-i(h^t(z))\ \|\ g_-i(h^t(z)) )\ ≥\ _i(g_-i), where for distributions p,q∈Δ​(A−i)p,q∈ (A_-i), DKL​(p∥q):=∑a−i∈A−ip​(a−i)​log⁡p​(a−i)q​(a−i).D_KL(p\|q):= _a_-i∈ A_-ip(a_-i) p(a_-i)q(a_-i). Assumption 3 is directly implementable in an LLM-agent pipeline: the menu −iS_-i is a finite library of opponent strategy templates, “caution” can be enforced by adding an arbitrarily small tremble (to avoid zero likelihoods), and KL separation is an identifiability condition stating that wrong templates are distinguishable from the truth along the realized interaction history (the only history that matters for on-path learning). Under Assumption 3, standard likelihood-ratio arguments yield posterior concentration on the true hypothesis. Lemma 4.2 (Posterior concentration under KL separation). Fix player i and suppose Assumption 3 holds for i. Then μfμ^f-a.s. in z, μit​(f−i∣ht​(z))⟶ 1,and hencemaxg−i∈−i∖f−i⁡μit​(g−i∣ht​(z))⟶ 0. _i^t(f_-i h^t(z))\ \ 1, hence _g_-i _-i \f_-i\ _i^t(g_-i h^t(z))\ \ 0. Lemma 4.2 implies Dit​(ht​(z))→0D_i^t(h^t(z))→ 0 on-path, and then Lemma 4.1 upgrades PS-BR from a dispersion-dependent approximation to an eventual ε -best-response rule. Proposition 4.3 (PS-BR implies asymptotic ε -consistency). Fix player i. Suppose player i uses PS-BR at every history and Assumption 3 holds for i. Then player i satisfies the asymptotic ε -consistency requirement in Definition 4. This proposition is the formal resolution of the “LLMs are stochastic samplers” issue: the standard sampling-based decoding (temperature τ≃1τ 1) induces randomness that prevents exact best-response optimality at any fixed time, but if the agent’s posterior over a finite, identifiable hypothesis menu concentrates, then the induced sampling randomness becomes asymptotically negligible. Consequently, the agent’s behavior converges (on-path) to ε -best-response play relative to its (accurate) predictive beliefs, which is exactly the R requirement needed for the zero-shot Nash convergence results in Section 5. The proofs of Lemmas 4.1–4.2 and Proposition 4.3 are deferred to Appendix B. 5 Zero-shot Nash convergence We now show that the reasonably reasoning agents we defined in Section 4, together with a learnability condition on beliefs, generate play that is eventually weakly close to Nash equilibrium play along the realized path. This argument follows the weak-subjective-equilibrium framework in [43], adapted to LLM agent-specific setups discussed in Section 4, i.e., (i) asymptotic (on-path) ε -consistency and (i) the finite-menu KL-separation for verifying the learnability condition. 5.1 Weak subjective equilibrium We work with the standard weak distance on play-path distributions. Let ℬtB^t be the σ-algebra generated by cylinder events of length t. Definition 6 (Weak distance). For probability measures μ,νμ,ν over infinite play paths, define d​(μ,ν):=∑t=1∞2−t​supE∈ℬt|μ​(E)−ν​(E)|.d(μ,ν)\ :=\ _t=1^∞2^-t\ _E ^t |μ(E)-ν(E) |. For a history hth^t with μ​(C​(ht))>0μ(C(h^t))>0 and ν​(C​(ht))>0ν(C(h^t))>0, define the conditional (continuation) weak distance dht(μ,ν):=d(μ(⋅∣C(ht)),ν(⋅∣C(ht))).d_h^t(μ,ν)\ :=\ d(μ(· C(h^t)),\ ν(· C(h^t))). We use weak distance to compare continuations of play after a realized history. Definition 7 (Weak similarity in continuation). Fix a history hth^t. Two profiles f and g are η-weakly similar in continuation after hth^t if dht​(μf,μg)≤η.d_h^t(μ^f,μ^g)\ ≤\ η. Weak subjective equilibrium is Norman’s key intermediate notion: players best respond (up to ξ) to their subjective model, and their subjective model is weakly close (within η) to the objective continuation distribution. Definition 8 (Weak subjective equilibrium [43]). Fix ξ,η≥0ξ,η≥ 0 and a history hth^t. A continuation profile f|htf |_h^t is a weak ξ-subjective η-equilibrium after hth^t if for every player i there exists a supporting profile fi=(fi,f−i)f^i=(f_i,f_-i^i) such that: 1. (Subjective best response) fi|ht∈BRiξ​(f−i|ht∣ht)f_i |_h^t _i^ξ\! (f_-i^i |_h^t h^t ), where payoffs are evaluated under μfiμ^f^i. 2. (Weak predictive accuracy) dht​(μf,μfi)≤ηd_h^t(μ^f,μ^f^i)≤η. Definition 9 (Learns to predict the path of play (strong)). Player i learns to predict the path of play under f if for every η>0η>0, μf​(z:∃Ti​(z,η)<∞​s.t.​∀t≥Ti​(z,η),dht​(z)​(μf,μfi)≤η)=1,μ^f\! ( \z:∃\,T_i(z,η)<∞\ s.t.\ ∀\,t≥ T_i(z,η),\ d_h^t(z)(μ^f,μ^f^i)≤η \ )=1, where fi=(fi,f−i)f^i=(f_i,f_-i^i) is a supporting (belief-equivalent) profile for player i (as in Section 3). Remark 1 (Connection to Optimizing Learnability). A longstanding challenge in Bayesian learning in games is [41, 42]’s inconsistency result, which shows that requiring an agent to learn and best-respond on all possible continuation paths is often mathematically impossible. However, [43] resolved this by introducing optimizing learnability, the insight that agents only need to learn the continuation play along the realized paths generated by their optimizing choices. Our R definition naturally instantiates Norman’s insight: Definition 4 and Definition 9 require ε -consistency and predictive accuracy strictly μfμ^f-almost surely (i.e., strictly on the realized, optimizing play path). Therefore, the on-path merging of opinions guaranteed by [10] is entirely sufficient for zero-shot Nash convergence, bypassing Nachbar’s impossibility. Crucially, while our agent’s specific decision rule (PS-BR) requires finite menus and KL separation to guarantee the optimality of actions (asymptotic ε -consistency, Section 4), the learning of the true path (strong path prediction) relies purely on the absolute continuity of beliefs. It does not require the posterior to concentrate; it can be verified directly from Assumption 2 via the classic merging of opinions result. The following Lemma 5.1 formalizes this idea. Lemma 5.1 (Absolute continuity implies strong path prediction). Fix player i. Suppose the objective play-path distribution μfμ^f is absolutely continuous with respect to player i’s prior predictive distribution Pi0,fiP_i^0,f_i (Assumption 2). Then player i learns to predict the path of play under f in the sense of Definition 9. The proof is deferred to Appendix B. 5.2 From learning to zero-shot Nash convergence We first show that asymptotic ε -consistency, together with strong prediction, implies that the realized continuation play is eventually a weak subjective equilibrium. Proposition 5.2. Suppose each player i is R (Definition 4) and learns to predict the path of play under f (Definition 9). Then for any ξ>0ξ>0 and η>0η>0, μf​(z:∃T​(z)​<∞​s.t.​∀t≥T​(z),f|ht​(z)​is a weak ξ-subjective η-equilibrium after ht​(z))=1.μ^f\! ( \z:∃\,T(z)<∞\ s.t.\ ∀\,t≥ T(z),\ f |_h^t(z)\ is a weak $ξ$-subjective $η$-equilibrium after $h^t(z)$ \ )=1. Finally, we convert a weak subjective equilibrium into proximity to a Nash equilibrium. Theorem 5.3 (Zero-shot Nash convergence along realized play). Suppose every player i is R and learns to predict the path of play under f. Assume the grain-of-truth condition (Assumption 2) holds for each player. Then for every ε>0 >0, μf(z:∃ μ^f\! ( \z:∃\, T​(z)<∞​s.t.​∀t≥T​(z),∃f^ε,t,z​an ε-Nash equilibrium T(z)<∞\ s.t.\ ∀\,t≥ T(z),\ ∃\ f ,t,z\ an $ $-Nash equilibrium of the continuation game after ht​(z) withdht​(z)(μf,μf^ε,t,z)≤ε)=1. the continuation game after $h^t(z)$ with\ d_h^t(z)(μ^f,μ f ,t,z)≤ \ )=1. Corollary 5.4 (Zero-shot Nash convergence for PS-BR). Assume that for every player i, Assumption 3 holds and player i uses PS-BR (Definition 5). Then the conclusion of Theorem 5.3 holds. The proofs of Theorem 5.3 and Corollary 5.4 are deferred to Appendix B. As a direct consequence, under our practical PS-BR implementation, the premises of Theorem 5.3 are verified directly. The main theoretical results, Theorem 5.3 and Corollary 5.4, may seem counter-intuitive: if each agent is learning, then what each agent is trying to predict is itself changing over time, so why should behavior ever stabilize? This concern is valid for many myopic learning models, where the learner treats the opponent as having a fixed action distribution even though the opponent is also adapting. The promise of Bayesian learning [33] is that, under a suitable grain-of-truth condition, agents’ posterior predictive forecasts about future play can nonetheless become accurate (merge) along the realized path. In repeated games, the correct object of inference is not a fixed action, but the opponent’s repeated-game strategy: a fixed contingent plan (mapping histories to actions) that may be highly nonstationary. In particular, even if an opponent updates beliefs and changes its period-by-period best response, once its prior, update rule, and decision rule are fixed from time 0, its behavior defines a single mapping f−i:H→Δ​(A−i)f_-i:H→ (A_-i) (hence a fixed repeated-game strategy in our sense). Agents’ beliefs change because they refine uncertainty about this fixed mapping (and its on-path implications), not because the mapping is being rewritten exogenously over time. Indeed, our main results do not require that posteriors over opponent strategies literally stop moving. Instead, they require on-path stabilization in two weaker senses: 1. Stability of forecasts (predictive merging). Under the grain-of-truth condition (Assumption 2), Bayesian updating implies that, along μfμ^f-almost every realized history ht​(z)h^t(z), the agent’s posterior predictive distribution over future play becomes close to the true continuation distribution (formalized later by Definition 9 and Lemma 5.1). Importantly, this can happen even if the posterior over strategy labels does not concentrate: distinct strategy hypotheses may be observationally equivalent on the realized path, and any remaining disagreement can persist only on counterfactual histories that are not reached. 2. Stability of (approximate) best responses. Once an agent’s predictive belief about continuation play is accurate on-path, playing an ε -best response to that belief is also nearly optimal against the true continuation play. Moreover, best-response sets need not vary wildly: when the payoff gap between the best action and the runner-up is nontrivial, small changes in beliefs do not change which continuation strategies are ε -optimal. This is exactly why our R definition imposes only asymptotic on-path ε -consistency (Definition 4), rather than requiring perfect best-response optimality at every time and every counterfactual history. Even if beliefs keep updating forever, behavior can still stabilize because decisions depend on the predictive implications of beliefs on the realized continuation game. If the posterior mass shifts among hypotheses that induce (nearly) the same continuation distribution after ht​(z)h^t(z), then the agent’s best-response problem is (nearly) unchanged, so play remains stable. For our PS-BR implementation with a finite menu and KL separation (Assumption 3), we obtain an even stronger form of stabilization: the posterior over the menu concentrates on the true opponent strategy (Lemma 4.2), so the randomness from posterior sampling becomes asymptotically negligible (Lemma 4.1), yielding eventual on-path ε -best-response behavior (Proposition 4.3). 5.3 Zero-shot stage-game Nash convergence for myopic rules Theorem 5.3 and Corollary 5.4 establish eventual on-path convergence to a Nash equilibrium of the continuation game under PS-BR. That guarantee is deliberately strong: it concerns repeated-game optimality and therefore requires beliefs over opponents’ full continuation strategies. Yet this level of reasoning may be unnecessary when the object of interest is only stage-wise strategic optimality. If we ask instead whether the realized mixed action profile at each history is eventually an approximate Nash equilibrium of the one-shot stage game, then predicting the opponents’ next joint action may suffice. This reduction captures the logic of SCoT [3], which implements a “predict the next move, then best respond” procedure rather than full continuation planning. The purpose of this subsection is to justify this simplification formally. We analyze two one-step variants: myopic PS-BR, which best responds to a one-step predictive belief, and SCoT [3], which best responds to a deterministic point prediction of the opponents’ next action. 5.3.1 Myopic PS-BR myopic PS-BR retains the Bayesian-learning-plus-best-response structure of the previous subsection, but truncates both objects to one period: the agent forms a one-step predictive belief over the opponents’ next joint action and then plays a myopic best response to that belief. For notational convenience, as already used above, for any opponents’ profile g−ig_-i and history h, we write g−i​(h)∈Δ​(A−i)g_-i(h)∈ (A_-i) for the induced distribution over the opponents’ joint next action at history h. In particular, when g−ig_-i is an actual profile of opponents’ mixed actions, this is the product distribution g−i​(h)=⨂j≠igj​(h).g_-i(h)= _j≠ ig_j(h). Definition 10 (One-shot stage-game ε -best response and stage ε -Nash). For αi∈Δ​(Ai) _i∈ (A_i) and q∈Δ​(A−i)q∈ (A_-i), define ui​(αi,q):=∑ai∈Ai∑a−i∈A−iαi​(ai)​q​(a−i)​ui​(ai,a−i).u_i( _i,q):= _a_i∈ A_i _a_-i∈ A_-i _i(a_i)\,q(a_-i)\,u_i(a_i,a_-i). For ε≥0 ≥ 0, define briε​(q):=αi∈Δ​(Ai):ui​(αi,q)≥supαi′∈Δ​(Ai)ui​(αi′,q)−ε.br_i (q):= \ _i∈ (A_i):u_i( _i,q)≥ _ _i ∈ (A_i)u_i( _i ,q)- \. We also write bri​(q):=bri0​(q).br_i(q):=br_i^0(q). At a history hth^t, write f−i​(ht):=⨂j≠ifj​(ht)∈Δ​(A−i)f_-i(h^t):= _j≠ if_j(h^t)∈ (A_-i) for the actual current joint mixed action of player i’s opponents. The current mixed-action profile f​(ht):=(f1​(ht),…,fN​(ht))∈∏j∈IΔ​(Aj)f(h^t):=(f_1(h^t),…,f_N(h^t))∈ _j∈ I (A_j) is a stage ε -Nash equilibrium if fi​(ht)∈briε​(f−i​(ht))for every ​i∈I.f_i(h^t) _i \! (f_-i(h^t) ) every i∈ I. Fix player i and let fi=(fi,f−i)f^i=(f_i,f_-i^i), where f−iif_-i^i is the fixed belief-equivalent profile from Section 3.3. Let f−i,tf_-i^i,t be the continuation-consistent representative of player i’s predictive belief at history hth^t. We write qit(⋅∣ht):=f−i,t(ht)∈Δ(A−i).q_i^t(· h^t)\ :=\ f_-i^i,t(h^t)∈ (A_-i). By the representative-choice convention from Section 3.3, along the histories under consideration, f−i,t​(ht)=f−i​(ht).f_-i^i,t(h^t)=f_-i^i(h^t). When the posterior μit(⋅∣ht) _i^t(· h^t) is supported on a finite set −i⊆ℱ−iS_-i _-i, this is qit(⋅∣ht)=∑g−i∈−iμit(g−i∣ht)g−i(ht)(⋅).q_i^t(· h^t)= _g_-i _-i _i^t(g_-i h^t)\,g_-i(h^t)(·). Definition 11 (Myopic posterior-sampling best response (myopic PS-BR)). Fix player i and a history hth^t. Suppose μit(⋅∣ht) _i^t(· h^t) is supported on a finite set −iS_-i. For each g−i∈−ig_-i _-i, choose a mixed action αig−i,ht∈bri​(g−i​(ht)). _i^g_-i,h^t _i\! (g_-i(h^t) ). Myopic PS-BR: 1. samples f~−i∼μit(⋅∣ht) f_-i _i^t(· h^t); 2. uses the mixed action αif~−i,ht _i f_-i,h^t. The induced ex ante mixed action is αi,tmPS(⋅∣ht):=∑g−i∈−iμit(g−i∣ht)αig−i,ht(⋅). _i,t^mPS(· h^t):= _g_-i _-i _i^t(g_-i h^t)\, _i^g_-i,h^t(·). Whenever player i uses myopic PS-BR, we identify fi(ht)=αi,tmPS(⋅∣ht).f_i(h^t)= _i,t^mPS(· h^t). Lemma 5.5 (Stage best responses are stable under nearby beliefs). Fix player i and define ‖p−q‖TV:=supB⊆A−i|p​(B)−q​(B)|for ​p,q∈Δ​(A−i).\|p-q\|_TV:= _B A_-i|p(B)-q(B)| p,q∈ (A_-i). If αi∈briξ​(q) _i _i^ξ(q), then αi∈briξ+2​‖p−q‖TV​(p). _i _i^ξ+2\|p-q\|_TV(p). Lemma 5.6 (Myopic PS-BR is a DitD_i^t-stage best response). Fix player i and a history hth^t. Suppose μit(⋅∣ht) _i^t(· h^t) is supported on a finite set −iS_-i and write pt​(g−i):=μit​(g−i∣ht),g−i∈−i.p_t(g_-i):= _i^t(g_-i h^t), g_-i _-i. Define Dit​(ht):=1−∑g−i∈−ipt​(g−i)2.D_i^t(h^t):=1- _g_-i _-ip_t(g_-i)^2. Let αi,tmPS(⋅∣ht) _i,t^mPS(· h^t) be myopic PS-BR and let qit(⋅∣ht)=∑g−i∈−ipt(g−i)g−i(ht)(⋅)q_i^t(· h^t)= _g_-i _-ip_t(g_-i)\,g_-i(h^t)(·) be the one-step posterior predictive belief. Then ui(αi,tmPS,qit(⋅∣ht))≥supαi∈Δ​(Ai)ui(αi,qit(⋅∣ht))−Dit(ht).u_i\! ( _i,t^mPS,\,q_i^t(· h^t) )≥ _ _i∈ (A_i)u_i\! ( _i,\,q_i^t(· h^t) )-D_i^t(h^t). Equivalently, αi,tmPS(⋅∣ht)∈briDit​(ht)(qit(⋅∣ht)). _i,t^mPS(· h^t) _i^D_i^t(h^t)\! (q_i^t(· h^t) ). Lemma 5.7 (Strong path prediction implies one-step predictive accuracy). Fix player i. Suppose player i learns to predict the path of play under f (Definition 9). Then μf(z:∀η>0,∃Ti(z,η)<∞s.t.∀t≥Ti(z,η),∥qit(⋅∣ht(z))−f−i(ht(z))∥TV≤η)=1.μ^f\! ( \z:∀η>0,\ ∃ T_i(z,η)<∞\ s.t.\ ∀ t≥ T_i(z,η),\ \|q_i^t(· h^t(z))-f_-i(h^t(z)) \|_TV≤η \ )=1. Theorem 5.8 (Bayesian convergence to stage-game Nash under myopic PS-BR). Assume that for every player i, Assumption 3 holds and player i uses myopic PS-BR (Definition 11) at every history. Then for every ε>0 >0, μf​(z:∃T​(z)<∞​s.t.​∀t≥T​(z),f​(ht​(z))​is a stage ε-Nash equilibrium)=1.μ^f\! ( \z:∃ T(z)<∞\ s.t.\ ∀ t≥ T(z),\ f(h^t(z))\ is a stage $ $-Nash equilibrium \ )=1. 5.4 SCoT [3] The second reduction is SCoT [3]. Instead of best responding to the full one-step predictive distribution, the agent first forms a deterministic point prediction of the opponents’ next joint action and then best responds to that point prediction. In general, this is not equivalent to best responding to a mixed belief, so the argument is different from the classical Bayesian-learning-plus-best-response route. Nevertheless, when all players use deterministic point-prediction rules, the true next action along the realized path is pure at every history, and predictive accuracy is enough to make the point prediction eventually correct. This gives eventual stage-game Nash convergence under a different mechanism than myopic PS-BR. Definition 12 (Social Chain of Thought (SCoT) [3]). Fix player i. At each history hth^t, let qit(⋅∣ht):=f−i,t(ht)∈Δ(A−i)q_i^t(· h^t):=f_-i^i,t(h^t)∈ (A_-i) denote player i’s one-step predictive distribution over opponents’ next joint action. Along the histories under consideration, the representative-choice convention from Section 3.3 gives f−i,t​(ht)=f−i​(ht).f_-i^i,t(h^t)=f_-i^i(h^t). A SCoT rule for player i consists of: 1. a deterministic MAP (maximum a posteriori) selector a^−it​(ht)∈arg⁡maxa−i∈A−i⁡qit​(a−i∣ht); a_-i^t(h^t)∈ _a_-i∈ A_-iq_i^t(a_-i h^t); 2. a deterministic pure best-response selector bi:A−i→Aisuch thatbi​(a−i)∈arg⁡maxai∈Ai⁡ui​(ai,a−i)for every ​a−i∈A−i.b_i:A_-i→ A_i that b_i(a_-i)∈ _a_i∈ A_iu_i(a_i,a_-i)\ \ for every a_-i∈ A_-i. The induced strategy is fi​(ht):=δbi​(a^−it​(ht))∈Δ​(Ai).f_i(h^t)\ :=\ _\,b_i( a_-i^t(h^t))∈ (A_i). Thus a SCoT player uses a pure action at every history. Lemma 5.9 (Deterministic truth implies asymptotic purity and eventual MAP correctness). Fix player i and suppose player i learns to predict the path of play under f in the sense of Definition 9. Assume that for every history h∈Hh∈ H there exists an action a−i⋆​(h)∈A−ia_-i (h)∈ A_-i such that f−i​(h)=δa−i⋆​(h).f_-i(h)= _a_-i (h). Then μf​(z:∃Ti​(z)<∞​s.t.​∀t≥Ti​(z),a^−it​(ht​(z))=a−i⋆​(ht​(z)))=1.μ^f\! ( \z:∃ T_i(z)<∞\ s.t.\ ∀ t≥ T_i(z),\ a_-i^t(h^t(z))=a_-i (h^t(z)) \ )=1. In particular, along μfμ^f-almost every realized path z, qit​(a−i⋆​(ht​(z))∣ht​(z))⟶1and1−maxa−i∈A−i⁡qit​(a−i∣ht​(z))⟶0.q_i^t\! (a_-i (h^t(z)) h^t(z) ) 1 1- _a_-i∈ A_-iq_i^t(a_-i h^t(z)) 0. Theorem 5.10 (One-shot stage-game Nash convergence for SCoT). Suppose every player i∈Ii∈ I uses SCoT in the sense of Definition 12, and suppose every player learns to predict the path of play under f in the sense of Definition 9. Then μf​(z:∃T​(z)<∞​s.t.​∀t≥T​(z),f​(ht​(z))​is a stage Nash equilibrium)=1.μ^f\! ( \z:∃ T(z)<∞\ s.t.\ ∀ t≥ T(z),\ f(h^t(z))\ is a stage Nash equilibrium \ )=1. Equivalently, along μfμ^f-almost every realized path, the current mixed-action profile eventually becomes a stage 0-Nash equilibrium. Corollary 5.11 (Bayesian stage-game Nash convergence for SCoT). Suppose every player uses deterministic MAP-SCoT and Assumption 2 holds for every player. Then the conclusion of Theorem 5.10 holds: μf​(z:∃T​(z)<∞​s.t.​∀t≥T​(z),f​(ht​(z))​is a stage Nash equilibrium)=1.μ^f\! ( \z:∃ T(z)<∞\ s.t.\ ∀ t≥ T(z),\ f(h^t(z))\ is a stage Nash equilibrium \ )=1. Remark 2. Theorem 5.10 relies on the fact that when all players use SCoT with deterministic tie-breaking, the true current action profile is pure at every history. This is why asymptotic purity need not be imposed separately: it is implied by Bayesian one-step predictive accuracy toward a pure truth. If opponents are allowed to play genuinely mixed current actions, this argument breaks down, and additional conditions such as asymptotic purity or BR-invariance are again needed. The SCoT result is therefore naturally paired with the grain-of-truth assumption (Assumption 2) and the corresponding merging-of-opinions argument, rather than with Assumption 3, whose uniform-positivity requirement is tailored to cautious menu-based posteriors and posterior-sampling rules such as PS-BR. The proofs are deferred to Appendix B. Taken together, Theorem 5.8 and Theorem 5.10 show that, for the weaker objective of stage-game Nash convergence, full continuation planning is not necessary. However, these one-step results are inherently limited to stage-game equilibrium. They do not by themselves recover more demanding continuation-game or history-contingent repeated-game equilibria, whose incentive structure is sustained by the value of future paths of play. Establishing convergence to those richer repeated-game equilibria requires a procedure, such as PS-BR, that reasons over full continuation strategies rather than only over the next-period action. 6 Extension to unknown, stochastic, and private payoffs Sections 3–5 assumed that the stage payoff functions ui:A→[0,1]u_i:A→[0,1] are common knowledge and deterministic. We now drop this assumption and allow each agent to observe only its own privately realized stochastic payoffs. 6.1 Private-payoff repeated game and information histories Fix the same action sets (Ai)i∈I(A_i)_i∈ I and discount factors (λi)i∈I( _i)_i∈ I as in Section 3. For each player i, let ℛi⊆ℝR_i denote the payoff space and let νi​(d​r) _i(dr) be a dominating base measure (counting measure in the discrete case, Lebesgue measure in the continuous case). We assume that the payoff noise family is known. Concretely, for each player i there is a known family of densities ψi​(r;μ),r∈ℛi,μ∈ℝ, _i(r;μ), r _i,\ μ , where the parameter μ is the mean payoff. The true unknown object is player i’s mean payoff matrix ui:A→[0,1].u_i:A→[0,1]. (As usual, any bounded payoff matrix can be affinely normalized into [0,1][0,1] without changing best responses or Nash inequalities.) At round t, after the public joint action at∈Aa^t∈ A is realized, player i privately observes rit∼qiui(⋅∣at),whereqiui(dr∣a):=ψi(r;ui(a))νi(dr).r_i^t q_i^u_i(· a^t), \;\;q_i^u_i(dr a):= _i(r;u_i(a))\, _i(dr). (5) Thus the true payoff kernel is determined by the true mean matrix uiu_i. In the private-payoff model, actions may depend on both the public history and the player’s own private payoff observations. Accordingly, define player i’s information history at time t as xit:=(ht,ri1:t−1)∈Xit:=Ht×ℛit−1,Xi:=⋃t≥1Xit.x_i^t:=(h^t,r_i^1:t-1)∈ X_i^t:=H^t×R_i^t-1, X_i:= _t≥ 1X_i^t. A strategy for player i in the private-payoff game is a map σi:Xi→Δ​(Ai). _i:X_i→ (A_i). Let Σi _i denote the set of such strategies and Σ:=∏i∈IΣi := _i∈ I _i. The full sample space is Ω:=∏t≥1(A×∏i∈Iℛi), := _t≥ 1 (A× _i∈ IR_i ), whose typical element is ω=(a1,r1,a2,r2,…),rt=(r1t,…,rNt).ω=(a^1,r^1,a^2,r^2,…), r^t=(r_1^t,…,r_N^t). Given a strategy profile σ∈Σσ∈ and the true mean matrices u=(ui)i∈Iu=(u_i)_i∈ I, the tuple (σ,u)(σ,u) induces a unique probability law Pσ,uP^σ,u on Ω by the Ionescu–Tulcea theorem. For a realized path ω∈Ωω∈ , write xt​(ω):=(xit​(ω))i∈Ix^t(ω):=(x_i^t(ω))_i∈ I for the realized vector of information histories at time t. For any continuation profile τ defined on future information histories extending xtx^t, let Pxtτ,uP_x^t^τ,u denote the induced continuation law. For player i, define the continuation payoff after xtx^t by Ui​(τ∣xt):=Pxtτ,u​[(1−λi)​∑k=0∞λik​rit+k].U_i(τ x^t):=E_P_x^t^τ,u [(1- _i) _k=0^∞ _i^kr_i^t+k ]. By iterated expectation and (5), Ui​(τ∣xt)=Pxtτ,u​[(1−λi)​∑k=0∞λik​ui​(at+k)].U_i(τ x^t)=E_P_x^t^τ,u [(1- _i) _k=0^∞ _i^ku_i(a^t+k) ]. Hence the objective continuation payoff in the private-payoff game equals the discounted payoff induced by the true mean matrix, even though strategies may condition on private payoff realizations. A continuation profile τ is an ε -Nash equilibrium after xtx^t if, for every i∈Ii∈ I, Ui​(τ∣xt)≥supτi′∈Σi​(xit)Ui​(τi′,τ−i∣xt)−ε.U_i(τ x^t)≥ _ _i ∈ _i(x_i^t)U_i( _i , _-i x^t)- . Finally, let μ¯xtτ,u μ_x^t^τ,u denote the public-action marginal of Pxtτ,uP_x^t^τ,u on the future public action path (at,at+1,…)∈H∞(a^t,a^t+1,…)∈ H^∞. We compare continuation profiles only through these public-action marginals, using dxt(τ,τ^):=d(μ¯xtτ,u,μ¯xtτ^,u),d_x^t(τ, τ):=d\! ( μ_x^t^τ,u, μ_x^t τ,u ), where d is the weak distance from Definition 6. 6.2 Known-noise, unknown-mean parametrization We now impose the finite-menu structure used by PS-BR. For player i, let ℳiM_i be a finite menu of candidate mean payoff matrices mi:A→[0,1].m_i:A→[0,1]. Each mi∈ℳim_i _i induces a payoff kernel qimi​(d​r∣a):=ψi​(r;mi​(a))​νi​(d​r).q_i^m_i(dr a):= _i(r;m_i(a))\, _i(dr). Thus sampling a payoff matrix label is exactly sampling a payoff kernel, expressed in mean-matrix coordinates. Given xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1), player i’s posterior over candidate mean matrices is πit​(mi∣xit)∝πi0​(mi)​∏s=1t−1ψi​(ris;mi​(as)),mi∈ℳi. _i^t(m_i x_i^t) _i^0(m_i) _s=1^t-1 _i(r_i^s;m_i(a^s)), m_i _i. (6) As in Sections 4–5, we model player i’s beliefs about the opponents through a finite menu of public-action continuation models g−i:H→Δ​(A−i).g_-i:H→ (A_-i). These models describe the predictive law of opponents’ next public action conditional on public history. Let −iS_-i denote the finite menu and let μit(⋅∣ht) _i^t(· h^t) be player i’s posterior over −iS_-i. 6.3 Subjective continuation values and PS-BR Fix player i, an information history xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1), a reduced-form opponents’ continuation model g−i∈−ig_-i _-i, and a continuation strategy τi∈Σi​(xit) _i∈ _i(x_i^t). Let Pxit(τi,g−i),miP_x_i^t^( _i,g_-i),m_i denote the induced law on player i’s future observable sequence when: (i) player i follows τi _i, (i) opponents’ public actions are generated by g−ig_-i, and (i) player i’s future private payoffs are generated from the kernel qimiq_i^m_i. Define the mim_i-subjective continuation value by Vimi​(τi∣xit;g−i):=Pxit(τi,g−i),mi​[(1−λi)​∑k=0∞λik​rit+k].V_i^m_i( _i x_i^t;g_-i):=E_P_x_i^t^( _i,g_-i),m_i [(1- _i) _k=0^∞ _i^kr_i^t+k ]. (7) For ε≥0 ≥ 0, define BRi,miε(g−i∣xit):=τi∈Σi(xit):Vimi(τi∣xit;g−i)≥supτi′∈Σi​(xit)Vimi(τi′∣xit;g−i)−ε,BR_i,m_i (g_-i x_i^t):= \ _i∈ _i(x_i^t):V_i^m_i( _i x_i^t;g_-i)≥ _ _i ∈ _i(x_i^t)V_i^m_i( _i x_i^t;g_-i)- \, and write BRi,mi​(g−i∣xit):=BRi,mi0​(g−i∣xit).BR_i,m_i(g_-i x_i^t):=BR_i,m_i^0(g_-i x_i^t). Player i’s mixed subjective continuation value is Vimix,t​(τi∣xit):=g−i∼μit(⋅∣ht)mi∼πit(⋅∣xit)​[Vimi​(τi∣xit;g−i)].V_i^mix,t( _i x_i^t):=E_ subarraycg_-i _i^t(· h^t)\\ m_i _i^t(· x_i^t) subarray [V_i^m_i( _i x_i^t;g_-i) ]. (8) For the true mean matrix uiu_i, define Viui,t​(τi∣xit):=g−i∼μit(⋅∣ht)​[Viui​(τi∣xit;g−i)].V_i^u_i,t( _i x_i^t):=E_g_-i _i^t(· h^t) [V_i^u_i( _i x_i^t;g_-i) ]. (9) Fix player i and an information history xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1). The posterior μit(⋅∣ht) _i^t(· h^t) over the finite menu −iS_-i induces a posterior predictive law over future public action paths. Let g−i,tg_-i^i,t denote any reduced-form behavioral representative of this posterior predictive continuation law. Concretely, g−i,tg_-i^i,t is chosen so that for every continuation strategy τi∈Σi​(xit) _i∈ _i(x_i^t), Viui,t​(τi∣xit)=Viui​(τi∣xit;g−i,t).V_i^u_i,t( _i x_i^t)=V_i^u_i( _i x_i^t;g_-i^i,t). (10) When −i=g−i1,…,g−iKS_-i=\g_-i^1,…,g_-i^K\ is finite, one convenient choice is g−i,t​(h)​(a−i)=∑k=1Kμit,h​(g−ik)​g−ik​(h)​(a−i),h⪰ht,g_-i^i,t(h)(a_-i)= _k=1^K _i^t,h(g_-i^k)\,g_-i^k(h)(a_-i), h h^t, where μit,h _i^t,h is the continuation posterior obtained by updating μit(⋅∣ht) _i^t(· h^t) along the continuation history h. Let μ¯xit(τi,g−i),mi μ_x_i^t^( _i,g_-i),m_i denote the public-action marginal of Pxit(τi,g−i),miP_x_i^t^( _i,g_-i),m_i on (at,at+1,…)∈H∞(a^t,a^t+1,…)∈ H^∞. For the actual continuation strategy σi _i, player i’s posterior predictive law over future public action paths can then be written as Πit(⋅∣xit)=∑mi∈ℳiπit(mi∣xit)μ¯xit(σi,g−i,t),mi. _i^t(· x_i^t)= _m_i _i _i^t(m_i x_i^t)\, μ_x_i^t^( _i,g_-i^i,t),m_i. (11) We can now state the private-payoff PS-BR rule. Definition 13 (Posterior-sampling best response (PS-BR) with private payoffs). Fix player i and an information history xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1). Given: (i) the posterior μit(⋅∣ht) _i^t(· h^t) over reduced-form opponents’ continuation models, and (i) the posterior πit(⋅∣xit) _i^t(· x_i^t) over player i’s own mean payoff matrices, PS-BR chooses a continuation strategy by: 1. sample an opponents’ continuation model g~−i∼μit(⋅∣ht) g_-i _i^t(· h^t); 2. sample a mean payoff matrix m~i∼πit(⋅∣xit) m_i _i^t(· x_i^t); 3. play any continuation strategy τi∈BRi,m~i​(g~−i∣xit) _i _i, m_i( g_-i x_i^t). Denote the resulting randomized continuation strategy by σi,tPS(⋅∣xit) _i,t^PS(· x_i^t). 6.4 Posterior concentration Although the primitive strategy profile is σ∈Σσ∈ , the public action path it induces admits a reduced-form description. For each player i, define f¯i(h):=Pσ,u(ait∈⋅∣ht=h),f¯:=(f¯i)i∈I, f_i(h):=P^σ,u(a_i^t∈· h^t=h), f:=( f_i)_i∈ I, and let μ¯σ,u μ^σ,u denote the induced law on the public action path in H∞H^∞. Thus f¯ f is the true reduced-form public-action model generated by the information-history strategy profile σ and the true mean matrices u. For player i’s finite menu of reduced-form opponents’ continuation models −iS_-i, assume that Assumption 3 holds mutatis mutandis with the true reduced-form opponent model f¯−i f_-i and the true public-action path law μ¯σ,u μ^σ,u in place of f−if_-i and μfμ^f. Lemma 6.1 (Posterior concentration of reduced-form public-action beliefs). Fix player i and suppose player i’s finite menu −iS_-i and posterior μit(⋅∣ht) _i^t(· h^t) satisfy Assumption 3 mutatis mutandis with f¯−i f_-i and μ¯σ,u μ^σ,u in place of f−if_-i and μfμ^f. Then under the true interaction law Pσ,uP^σ,u, μit​(f¯−i∣ht)⟶1and hencemaxg−i∈−i∖f¯−i⁡μit​(g−i∣ht)⟶0, _i^t( f_-i h^t) 1 hence _g_-i _-i \ f_-i\ _i^t(g_-i h^t) 0, almost surely. The only genuinely new learnability requirement in the private-payoff extension is on the payoff side: identifiability of player i’s own mean payoff matrix from private noisy rewards. Assumption 4 (Finite payoff-menu identifiability under known noise). Fix player i and let ℳi=supp​(πi0)M_i=supp( _i^0) be finite. Assume: 1. (Menu grain of truth) The true mean matrix ui∈ℳiu_i _i and πi0​(ui)>0 _i^0(u_i)>0. 2. (Known common noise family) Each menu element mi∈ℳim_i _i induces the payoff kernel qimi​(d​r∣a)=ψi​(r;mi​(a))​νi​(d​r),q_i^m_i(dr a)= _i(r;m_i(a))\, _i(dr), and the true payoff law is qiuiq_i^u_i. 3. (Finite second moments of log-likelihood ratios) For every mi∈ℳi∖uim_i _i \u_i\, supa∈AR∼qiui(⋅∣a)​[(log⁡ψi​(R;ui​(a))ψi​(R;mi​(a)))2]<∞. _a∈ AE_R q_i^u_i(· a) [ ( _i(R;u_i(a)) _i(R;m_i(a)) )^2 ]<∞. 4. (On-path KL separation) For every mi∈ℳi∖uim_i _i \u_i\ there exists κi​(mi)>0 _i(m_i)>0 such that under the true interaction law Pσ,uP^σ,u, lim infT→∞1T∑t=1TDKL(qiui(⋅∣at)∥qimi(⋅∣at))≥κi(mi)a.s. _T→∞ 1T _t=1^TD_KL\! (q_i^u_i(· a^t)\ \|\ q_i^m_i(· a^t) )≥ _i(m_i) .s. The next lemma is the mean-matrix analogue of Lemma 4.2. Lemma 6.2 (Payoff posterior concentration under known-noise KL separation). Fix player i and suppose Assumption 4 holds. Then under the true interaction law Pσ,uP^σ,u, πit​(ui∣xit)⟶1,and hencemaxmi∈ℳi∖ui⁡πit​(mi∣xit)⟶0, _i^t(u_i x_i^t) 1, hence _m_i _i \u_i\ _i^t(m_i x_i^t) 0, almost surely. Lemma 6.3 (Payoff concentration identifies the predictive public-action law). Fix player i. For every information history xitx_i^t, d(Πit(⋅∣xit),μ¯xit(σi,g−i,t),ui)≤1−πit(ui∣xit).d\! ( _i^t(· x_i^t), μ_x_i^t^( _i,g_-i^i,t),u_i )≤ 1- _i^t(u_i x_i^t). Consequently, under Lemma 6.2, d(Πit(⋅∣xit),μ¯xit(σi,g−i,t),ui)⟶0under Pσ,u a.s.d\! ( _i^t(· x_i^t), μ_x_i^t^( _i,g_-i^i,t),u_i ) 0 P^σ,u a.s. The proof is deferred to Appendix B. 6.5 PS-BR gap and asymptotic consistency Let pt​(g−i,mi):=μit​(g−i∣ht)​πit​(mi∣xit),(g−i,mi)∈−i×ℳi.p_t(g_-i,m_i):= _i^t(g_-i h^t)\, _i^t(m_i x_i^t), (g_-i,m_i) _-i×M_i. Define the joint collision complement Dit,joint​(xit):=1−∑(g−i,mi)∈−i×ℳipt​(g−i,mi)2.D_i^t,joint(x_i^t):=1- _(g_-i,m_i) _-i×M_ip_t(g_-i,m_i)^2. Lemma 6.4 (PS-BR is a Dit,jointD_i^t,joint-best response to the mixed subjective value). Fix player i and an information history xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1). Let σi,tPS _i,t^PS be PS-BR from Definition 13. Then Vimix,t​(σi,tPS∣xit)≥supτi∈Σi​(xit)Vimix,t​(τi∣xit)−Dit,joint​(xit).V_i^mix,t( _i,t^PS x_i^t)≥ _ _i∈ _i(x_i^t)V_i^mix,t( _i x_i^t)-D_i^t,joint(x_i^t). Equivalently, σi,tPS _i,t^PS is a Dit,joint​(xit)D_i^t,joint(x_i^t)-best response to the mixed subjective continuation value (8). Define δit​(xit):=1−πit​(ui∣xit). _i^t(x_i^t):=1- _i^t(u_i x_i^t). Because continuation values are normalized to lie in [0,1][0,1], for every τi∈Σi​(xit) _i∈ _i(x_i^t), |Vimix,t(τi∣xit)−Viui,t(τi∣xit)|≤δit(xit). |V_i^mix,t( _i x_i^t)-V_i^u_i,t( _i x_i^t) |≤ _i^t(x_i^t). (12) Combining (12), Lemma 6.4, Lemma 6.1, and Lemma 6.2 yields the asymptotic best-response property. Proposition 6.5 (PS-BR implies asymptotic ε -consistency in the private-payoff game). Fix player i. Assume: (i) Assumption 3 holds mutatis mutandis for player i’s menu of reduced-form opponents’ continuation models, with the true reduced-form opponent model f¯−i f_-i and the true public-action path law μ¯σ,u μ^σ,u in place of f−if_-i and μfμ^f, (i) Assumption 4 holds for player i’s mean-matrix menu, and (i) player i uses PS-BR at every information history. Then for every ε>0 >0, Pσ,u(ω:∃Ti(ω,ε)<∞s.t.∀t≥Ti(ω,ε),σi,tPS(⋅∣xit(ω))∈BRi,uiε(g−i,t∣xit(ω)))=1.P^σ,u\! ( \ω:∃\,T_i(ω, )<∞\ s.t.\ ∀ t≥ T_i(ω, ),\ _i,t^PS(· x_i^t(ω)) _i,u_i \! (g_-i^i,t x_i^t(ω) ) \ )=1. 6.6 Zero-shot Nash convergence with private payoffs To lift the earlier zero-shot argument, one replaces public histories hth^t by information-history vectors xtx^t, and one compares continuation profiles through the weak distance between their induced public-action marginals after the realized full information-history vector. Because player i only observes xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1), the relevant Bayesian merging step is first stated on player i’s observable process. Assumption 6 then identifies this player-relative predictive target with the ex post public continuation law after xtx^t asymptotically. For player i, let Oi:=∏t≥1(A×ℛi)O_i:= _t≥ 1(A×R_i) denote the space of observable sequences (a1,ri1,a2,ri2,…).(a^1,r_i^1,a^2,r_i^2,…). Let Piσ,uP_i^σ,u be the marginal of Pσ,uP^σ,u on OiO_i, and let Qi0,σiQ_i^0, _i be player i’s prior predictive law on OiO_i induced by their priors over −iS_-i and ℳiM_i, the known noise family, and their own strategy σi _i. Let μ¯i,xitσ,u​(E):=Pσ,u​((at,at+1,…)∈E∣xit),E∈ℬ, μ_i,x_i^t^σ,u(E):=P^σ,u\! ((a^t,a^t+1,…)∈ E x_i^t ), E , denote the true public-action continuation law conditional on player i’s own observable information history xitx_i^t. Also let Πit(⋅∣xit) _i^t(· x_i^t) denote player i’s posterior predictive law over the future public action path (at,at+1,…)∈H∞(a^t,a^t+1,…)∈ H^∞ conditional on xitx_i^t. In the private-payoff setup, player i’s prior over reduced-form opponents’ continuation models and over its own finite menu of payoff hypotheses is constructed so that the true observable process is represented as one feasible element. Thus the induced prior predictive law on player i’s observable sequence should place positive mass on the true observable path law. This naturally gives the following Assumption 5. Assumption 5 (Observable grain of truth in the private-payoff game). Fix player i. Assume Piσ,u≪Qi0,σi.P_i^σ,u Q_i^0, _i. The next requirement is also natural in the PS-BR regime. Although player i never observes the opponents’ private reward histories, those histories matter for future public play only through how they shape the opponents’ own continuation behavior. As each player’s private payoff posterior concentrates and the residual effect of these hidden reward histories on public continuation play becomes negligible, conditioning on the realized full information-history vector xtx^t or on player i’s own observable history xitx_i^t should asymptotically yield the same public-action continuation law. Assumption 6 formalizes the intended information structure: player i does not observe the other players’ private reward histories and need only infer its own payoff matrix together with the opponents’ reduced-form public-action strategy. Asymptotically, any additional predictive content in the unobserved private histories becomes negligible for future public play. Assumption 6 (Asymptotic public sufficiency of hidden private histories). For every player i, d​(μ¯i,xit​(ω)σ,u,μ¯xt​(ω)σ,u)⟶0for ​Pσ,u​-a.e. ​ω.d\! ( μ_i,x_i^t(ω)^σ,u, μ_x^t(ω)^σ,u ) 0 P^σ,u-a.e. ω. Assumption 6 is the formal expression of the idea that, in the intended regime, each player needs to infer only its own payoff matrix and the opponents’ reduced-form public-action strategy; the opponents’ unrevealed private reward histories do not asymptotically alter future public play beyond what those objects already encode. Lemma 6.6 (Observable grain of truth implies strong public-path prediction). Fix player i. Under Assumptions 5 and 6, player i’s posterior predictive law over future public action paths merges with the true public-action continuation law after the realized information-history vector: d(Πit(⋅∣xit(ω)),μ¯xt​(ω)σ,u)⟶0for Pσ,u-a.e. ω.d\! ( _i^t(· x_i^t(ω)), μ_x^t(ω)^σ,u ) 0 P^σ,u-a.e. ω. The proof is deferred to Appendix B. Definition 14 (Weak subjective equilibrium on information histories). Fix ξ,η≥0ξ,η≥ 0 and an information-history vector xtx^t. A continuation profile τ is a weak ξ-subjective η-equilibrium after xtx^t if, for every player i, there exists a reduced-form opponents’ continuation model g−iig_-i^i such that τi∈BRi,uiξ​(g−i∣xit) _i _i,u_i^ξ(g_-i^i x_i^t) and d​(μ¯xtτ,u,μ¯xit(τi,g−i),ui)≤η.d\! ( μ_x^t^τ,u, μ_x_i^t^( _i,g_-i^i),u_i )≤η. Proposition 6.7 (Learning and asymptotic consistency imply weak subjective equilibrium in the private-payoff game). Suppose every player i satisfies the conclusion of Proposition 6.5 and of Lemma 6.6. Then for every ξ>0ξ>0 and η>0η>0, Pσ,u​(ω:∃T​(ω)​<∞​s.t.​∀t≥T​(ω),σ|xt​(ω)​ is a weak ξ-subjective η-equilibrium after ​xt​(ω))=1.P^σ,u\! ( \ω:∃\,T(ω)<∞\ s.t.\ ∀ t≥ T(ω),\ σ |_x^t(ω) is a weak $ξ$-subjective $η$-equilibrium after x^t(ω) \ )=1. The proof is deferred to Appendix B. Theorem 6.8 (Zero-shot Nash convergence with private payoffs). Assume that for every player i, Assumption 3 holds mutatis mutandis for the finite menu of reduced-form opponents’ continuation models, with the true reduced-form opponent model f¯−i f_-i and the true public-action path law μ¯σ,u μ^σ,u in place of f−if_-i and μfμ^f, Assumption 4 holds for the finite menu of candidate mean payoff matrices under the known noise family, Assumptions 5 and 6 hold, and player i uses PS-BR at every information history. Then for every ε>0 >0, Pσ,u(ω:∃T(ω)<∞s.t.∀t≥T(ω), P^σ,u\! ( \ω:∃\,T(ω)<∞\ s.t.\ ∀ t≥ T(ω),\ ∃τ^ε,t,ω​an ε-Nash equilibrium of the continuation game ∃\ τ ,t,ω\ an $ $-Nash equilibrium of the continuation game after xt(ω) withd(μ¯xt​(ω)σ,u,μ¯xt​(ω)τ^ε,t,ω,u)≤ε)=1. x^t(ω) with\ d\! ( μ_x^t(ω)^σ,u, μ_x^t(ω) τ ,t,ω,u )≤ \ )=1. Theorem 6.8’s interpretation is similar to Theorem 5.3, but now under the additional Assumption 6: although agents do not know the payoff matrix ex ante and observe only noisy private rewards, their public continuation play eventually becomes weakly close, along the realized path, to an ε -Nash equilibrium of the continuation game. In the known common noise family setting, implementing payoff-kernel sampling is equivalent to sampling a mean payoff matrix from a finite reward menu and evaluating continuation strategies against the induced kernel. 7 Experiments In this section, we empirically evaluate whether off-the-shelf reasoning LLM agents exhibit the theoretical properties derived in previous sections, i.e., whether they converge toward Nash equilibrium behavior in repeated strategic interaction. After discussing the experiment setup that is common throughout all experiments in Section 7.1, we provide simulation experimentation results that test the following three hypotheses implied by our theoretical analysis: 1. For convergence to the stage-game (myopic) Nash equilibrium, simple predict–then–act reasoning, e.g., SCoT, should already be sufficient (Section 7.2). 2. For convergence to non-trivial repeated-game Nash equilibria that rely on continuation incentives and long-horizon strategic reasoning, myopic approaches should generally fail, whereas PS-BR, which explicitly evaluates continuation strategies, should succeed (Section 7.3). 3. PS-BR should remain effective even when the payoff matrix is not given and must be learned from noisy payoff observations, recovering equilibrium behavior under payoff uncertainty (Section 7.4). 7.1 Setup Baselines. We use Qwen 3.5-27B [46], a small-scale open-reasoning model with GPT-5-mini level capabilities [48]. Specifically, we run three models, with almost the same prompts except the reasoning patterns: • Base: Qwen 3.5-27B with direct action selection from rules + interaction history. • SCoT: Qwen 3.5-27B with chain-of-thought style “predict-then-act” prompting [3].It has demonstrated success in some repeated games, such as the Battle of the Sexes, and can be considered a simplified, myopic version of PS-BR. For details, see Appendix E. • PS-BR: Qwen 3.5-27B with PS-BR (Definition 5, also detailed in Appendix D). Benchmarks. We consider five repeated-game environments in total: BoS, PD, Promo, Samaritan, and Lemons. (1) Battle of the Sexes (BoS; coordination with asymmetric equilibria). Actions each period: J or F. Per-period payoff matrix (Player 1, Player 2): P2: ​JP2: ​FP1: ​J(10,7)(0,0)P1: ​F(0,0)(7,10) array[]c|c&P2: J&P2: F\\ 1: J&(10,7)&(0,0)\\ P1: F&(0,0)&(7,10) array The non-trivial cooperative Nash equilibrium (pure): (J,J)(J,J) and (F,F)(F,F). One non-trivial cooperative Nash equilibrium is both of them sticking to one action: • Play J after every history (outcome (J,J)(J,J) every period). • Play F after every history (outcome (F,F)(F,F) every period). Such a non-trivial cooperative Nash equilibrium is particularly plausible when a monetary transfer underlies the game. Another non-trivial cooperative Nash equilibrium is turn-taking: • Play (J,J)(J,J) in odd periods and (F,F)(F,F) in even periods. • After any history, continue the same odd/even phase convention. (2) Prisoner’s Dilemma (PD; social dilemma). Actions each period: J or F. Per-period payoff matrix (Player 1, Player 2): P2: ​JP2: ​FP1: ​J(3,3)(−5,5)P1: ​F(5,−5)(0,0) array[]c|c&P2: J&P2: F\\ 1: J&(3,3)&(-5,5)\\ P1: F&(5,-5)&(0,0) array One-shot stage-game Nash equilibrium: (F,F)(F,F). A baseline pure Nash equilibrium of the repeated game is stationary play of (F,F)(F,F) after every history. A nontrivial cooperative Nash equilibrium (grim-trigger cooperation) is: • Cooperative phase: play (J,J)(J,J) every period. • If any player ever plays F, switch forever to (F,F)(F,F). (3) Promo [36, Appendix H.1] Actions each period: R (Regular), P (Promotion), or Z (price-war punishment). Per-period payoff matrix (Player 1, Player 2): P2: ​RP2: ​P2: ​ZP1: ​R(1,1)(−1,4)(−2,−2)P1: ​P(4,−1)(0,0)(−2,−2)P1: ​Z(−2,−2)(−2,−2)(−2,−2) array[]c|c&P2: R&P2: P&P2: Z\\ 1: R&(1,1)&(-1,4)&(-2,-2)\\ P1: P&(4,-1)&(0,0)&(-2,-2)\\ P1: Z&(-2,-2)&(-2,-2)&(-2,-2) array One-shot stage-game Nash equilibrium (pure): (P,P)(P,P). A baseline pure Nash equilibrium of the repeated game is the stationary play of (P,P)(P,P) after every history. A nontrivial cooperative pure Nash equilibrium described in [36] is: • Cooperative phase: (P,R)(P,R) in the odd round, and (R,P)(R,P) in the even round. • If the opponent deviates from the cooperation, play Z for two periods and revert to the cooperative phase. (4) Samaritan (altruism / one-sided moral hazard). Player 1 (Helper): Help (H) or No-help (N). Player 2 (Recipient): Work (W) or Shirk (S). Per-period payoff matrix (Helper, Recipient): Recipient: ​WRecipient: ​SHelper: ​H(2,−1)(0,0)Helper: ​N(1,−2)(−1,−3) array[]c|c&Recipient: W&Recipient: S\\ : H&(2,-1)&(0,0)\\ Helper: N&(1,-2)&(-1,-3) array One-shot stage-game Nash equilibrium (pure): (H,S)(H,S). The helper has a dominant action (help), and the recipient best responds by shirking. A nontrivial cooperative Nash equilibrium exists for sufficiently patient players: • Cooperative phase: play (H,W)(H,W) every period. • If the recipient ever shirks, switch forever to punishment (N,W)(N,W). • If, during punishment, the helper ever deviates by helping, the recipient switches forever to (H,S)(H,S) behavior. (5) Lemons (adverse selection). Player 1 (Seller): High Quality (H​QHQ) or Low Quality (L​QLQ). Player 2 (Buyer): Buy (B) or Don’t buy (D). Per-period payoff matrix (Seller, Buyer): Buyer: ​BBuyer: ​DSeller: ​H​Q(3,3)(−1,0)Seller: ​L​Q(4,−1)(0,0) array[]c|c&Buyer: B&Buyer: D\\ : HQ&(3,3)&(-1,0)\\ Seller: LQ&(4,-1)&(0,0) array One-shot stage-game Nash equilibrium (pure): (L​Q,D)(LQ,D). Seller has strict dominant action L​QLQ; buyer best-responds to L​QLQ with D. A baseline pure Nash equilibrium of the repeated game is the stationary play of (L​Q,D)(LQ,D) after every history. A nontrivial cooperative Nash equilibrium for sufficiently patient players: • Start by playing (H​Q,B)(HQ,B), and continue (H​Q,B)(HQ,B) as long as no low-quality sale has ever been observed. • If the buyer ever buys and then observes L​QLQ, switch forever to D; seller then plays dominant L​QLQ thereafter. 7.2 Experiment 1. Nash convergence Here, we test the first hypothesis: for convergence to any Nash equilibrium, simple predict–then–act reasoning, e.g., SCoT [3], should already suffice. 7.2.1 Experiment design In Section 5.3, we showed that if agents myopically learn to predict opponents’ next actions and then best respond to those predictions, the realized play path eventually converges to a stage-game ε -Nash equilibrium. SCoT [3] operationalizes precisely such a predict–then–act rule, making it a natural empirical test of the theory. To evaluate this prediction, we simulate repeated interaction in each benchmark game described in Section 7.1. Two identical copies of the same model interact in symmetric self-play for T=200T=200 rounds with perfect monitoring of actions and payoffs. No communication channel is available beyond the public history of previous actions and realized payoffs. Each model conditions its round-t decision only on the observed interaction history up to round t−1t-1. To measure this equilibrium-action convergence, among the 1,…,2001,…,200 rounds, we only focus on the late-round window t∈161,…,180t∈\161,…,180\. For each round in this window, we checked the percentage of both players’ realized actions that match any Nash equilibrium action, i.e., Nash equilibrium action of the underlying one-shot game or an on-path action of the cooperative repeated-game equilibrium described in Section 7.1. We then average these indicators over rounds 161161–180180 and report the resulting percentage. Thus, the reported number can be interpreted as the fraction of late-round play that lies on either a one-shot Nash path or a cooperative-equilibrium path. Using rounds 161161–180180 isolates steady-state behavior and avoids placing weight on transient early-round dynamics and terminal-horizon effects. For each of the three model configurations (Base, SCoT, and PS-BR), we run 20 independent such self-play matches. Our primary outcome of interest is whether the realized joint action profile converges to either a one-shot Nash action or an on-path action of the benchmark cooperative repeated-game Nash equilibrium for that game. 7.2.2 Results Table 1: Equilibrium-follow percentage in late rounds (rounds 161–180) for any (one-shot Nash or cooperative on-path action) Nash equilibrium. Reported scores are averaged over 20 trials. Game Base SCoT PS-BR BoS 60.0% 100.0% 100.0% PD 60.0% 100.0% 87.8% Promo 0.0% 100.0% 100.0% Samaritan 64.5% 100.0% 97.2% Lemons 0.0% 100.0% 89.8% Table 1 shows that once cooperative on-path actions are also credited, SCoT attains a perfect late-round equilibrium-action score in all five benchmark environments. Base, by contrast, remains uneven across games, reaching 60.0% in BoS, 60.0% in PD, and 64.5% in Samaritan, but 0.0% in both Promo and Lemons. PS-BR also performs strongly, scoring 100.0% in BoS and Promo and rising to 87.8% in PD, 97.2% in Samaritan, and 89.8% in Lemons when cooperative on-path actions are credited. Overall, these results show that myopic predict–then–act prompting often steers play to some Nash equilibrium. A natural question is what kind of equilibrium convergence Table 1 is capturing. The theory in Section 5.3 predicts that myopic predict–then–act reasoning should be sufficient for convergence to a stage-game ε -Nash equilibrium, without requiring agents to reason over full continuation strategies. The empirical results are broadly consistent with this prediction. In particular, SCoT attains perfect equilibrium-follow scores in all five environments once the evaluation metric credits both one-shot Nash actions and on-path actions of cooperative repeated-game equilibria. This suggests that explicitly prompting the model to forecast the opponent’s next move and then act accordingly is often enough to remove obviously non-equilibrium play in the late rounds. At the same time, the results should be interpreted carefully. The metric in Table 1 deliberately aggregates two qualitatively different notions of equilibrium-consistent behavior: one-shot Nash actions and actions that lie on the path of a benchmark cooperative repeated-game equilibrium. As a result, a high score means that play has moved onto some equilibrium-consistent path, but it does not tell us which kind of equilibrium has been selected. For example, in Prisoner’s Dilemma, both (F,F)(F,F) and (J,J)(J,J) can be counted as successful late-round outcomes under our metric, even though the former reflects myopic defection while the latter reflects cooperation sustained by continuation incentives. Likewise, in BoS, converging to either coordinated outcome counts as success even though equilibrium selection remains unresolved. This distinction is important because myopic reasoning can explain only a limited class of equilibrium phenomena. A one-step predict–then–act rule can stabilize play at actions that are locally optimal given beliefs about the opponent’s next move, but it does not by itself reason over future punishment and reward paths. Consequently, strong performance in Table 1 should be read as evidence that myopic prompting is often sufficient for equilibrium action convergence, not as evidence that it can reliably implement a particular nontrivial repeated-game equilibrium. In other words, SCoT appears effective at steering play toward some equilibrium-consistent late-round behavior, but the table does not yet establish whether it can sustain the richer, history-contingent equilibria that depend on long-horizon continuation values. This limitation is exactly what motivates the next experiment. To distinguish simple equilibrium-action convergence from genuine repeated-game strategic reasoning, we now test whether the models can follow a specific nontrivial cooperative Nash equilibrium path when that path must be sustained by continuation incentives rather than by myopic one-step optimization alone. 7.3 Experiment 2. Nontrivial Nash convergence We now move from asking whether play converges to some equilibrium-consistent action profile to the harder question of whether agents can track a nontrivial, cooperative repeated-game Nash equilibrium sustained by continuation incentives. Here, we test the second hypothesis: for convergence to non-trivial repeated-game Nash equilibria that rely on continuation incentives and long-horizon strategic reasoning, myopic approaches should generally fail, whereas PS-BR, which explicitly evaluates continuation strategies, should succeed. 7.3.1 Experiment design To verify whether a particular long-horizon cooperative Nash equilibrium can be implemented, we included a prompt for each agent that specifies a particular long-horizon non-trivial cooperative Nash equilibrium and asks them to “strongly expect the opponent to play” the strategy. Such prompting sets the initial point of the evolution of their beliefs. For example, in PD, this meant prompting both agents to expect the opponent to play a continued grim-trigger strategy, i.e., cooperation until a defection triggers permanent punishment. On the other hand, in Promo, it meant prompting both agents to expect the prescribed alternating cooperative pattern (P,R),(R,P),(P,R),…(P,R),(R,P),(P,R),…, until a defection triggers finite punishment. As before, all experiments use symmetric self-play with two copies of the same model under perfect monitoring. Each match lasts T=200T=200 rounds. In every round, players act simultaneously, observe both actions and realized payoffs, and then condition the next-round decision on the updated history. Again, for each round t∈161,…,180t∈\161,…,180\ in each run, we checked if both players’ realized actions match the desired nontrivial cooperative equilibrium behavior in terms of percentage, then averaged the percentages over the 20 rounds (161-180) and reported the mean by setting and game. (We chose round 180 as the endpoint since PS-BR uses 20 rounds of lookahead, and we excluded pre-161 results, as we want to see the equilibrium outcome.) 7.3.2 Results. Table 2: Equilibrium-follow percentage in late rounds (rounds 161–180) for the prompt-specified nontrivial cooperative equilibrium. Reported scores are averaged over 20 trials. Game Base SCoT PS-BR BoS 0.0% 0.0% 92.5% PD 0.0% 100.0% 98.0% Promo 0.0% 0.0% 94.8% Samaritan 0.0% 0.0% 93.3% Lemons 0.0% 0.0% 93.5% Table 2 shows a sharp separation across methods. PS-BR achieves high late-round follow rates in all five environments, reaching 92.5% in BoS, 98.0% in PD, 94.8% in Promo, 93.3% in Samaritan, and 93.5% in Lemons. Thus, once the cooperative equilibrium is explicitly specified, the non-myopic planner tracks the intended long-horizon path quite reliably across all benchmark games. By contrast, Base remains at 0.0% in every environment. SCoT succeeds only in PD, where it reaches 100.0%, and remains at 0.0% in BoS, Promo, Samaritan, and Lemons. Since the three settings use nearly the same game instructions and history context, the main difference is the reasoning/decision strategy (direct action for Base, myopic predict–then–act for SCoT, and posterior-sampling best response with rollout planning for PS-BR). This pattern suggests that direct prompting is insufficient for following contingent cooperative equilibrium prescriptions, while myopic prompting can recover the simple stationary cooperative path in PD but not the richer coordination, punishment, or trust-based prescriptions in the other games. PS-BR’s explicit modeling of opponent strategy and continuation value is what enables sustained on-path behavior in late rounds. The results in Table 2 provide a clear separation between myopic and non-myopic reasoning. Unlike Experiment 1, where multiple equilibrium-consistent outcomes were credited, this experiment sets up initial beliefs so that agents follow a specific cooperative equilibrium path that requires non-myopic reasoning. Under this stricter criterion, PS-BR consistently achieves high follow rates across all environments, whereas Base fails entirely and SCoT succeeds only in the simplest case (PD). This pattern aligns closely with the theoretical distinction developed in Section 5. Implementing a nontrivial repeated-game equilibrium requires reasoning over continuation values: agents must understand that short-term deviations trigger future punishment, and that adherence to the cooperative path is optimal only when these future consequences are taken into account. PS-BR explicitly evaluates such continuation strategies through rollout, and therefore can internalize these long-horizon incentives. By contrast, SCoT operates on one-step predictions and local best responses, which are insufficient to sustain equilibria that depend on multi-period incentive compatibility. The one partial exception is Prisoner’s Dilemma, where SCoT achieves perfect performance. This is consistent with the structure of the grim-trigger equilibrium in PD: the cooperative phase (J,J)(J,J) is itself a stage-game Pareto-dominant outcome and is locally consistent with mutual best responses under optimistic beliefs. As a result, myopic reasoning can incidentally align with the cooperative path. In contrast, games such as BoS, Promo, Samaritan, and Lemons require coordination on asymmetric roles, punishment phases, or trust-dependent behavior that cannot be justified purely from one-step optimization, making myopic approaches ineffective. More broadly, these results indicate that equilibrium selection and path-following are fundamentally harder than equilibrium action convergence. While Experiment 1 shows that simple reasoning can often eliminate non-equilibrium behavior, Experiment 2 demonstrates that sustaining a particular equilibrium—especially one supported by continuation incentives—requires explicit modeling of future play. This provides empirical support for the theoretical claim that the posterior-sampling best response, by operating over full continuation strategies, can implement repeated-game equilibria that lie beyond the reach of myopic predict–then–act rules. Having established this distinction under known and deterministic payoffs, we next consider a more realistic setting in which agents must simultaneously learn the payoff structure from noisy private observations while engaging in strategic interaction. 7.4 Experiment 3: Nontrivial Nash convergence under unknown payoffs 7.4.1 Setup We keep the interaction protocol, horizons, and game set from Experiment 1 (Section 7.2) and Experiment 2 (Section 7.3), and modify only the payoff observations: agents no longer receive the payoff matrix in the prompt and instead learn solely from noisy, privately observed payoffs. For each benchmark game g∈BoS,PD,Promo,Samaritan,Lemonsg∈\BoS,PD,Promo,Samaritan,Lemons\, let uig​(a)∈ℝu_i^g(a) denote the deterministic stage payoff from Experiment 1 for player i and joint action a∈Aa∈ A. In Experiment 3, after the public joint action ata^t is realized, player i observes a private payoff rit=uig​(at)+ϵi,t,ϵi,t∼i.i.d.​(0,σg2),r_i^t\;=\;u_i^g(a^t)+ _i,t, _i,t .i.d. N(0, _g^2), (13) independent across players i and rounds t. Players observe the full public action history but only their own payoff sequence (rit)t(r_i^t)_t. All equilibrium notions continue to refer to the underlying mean-payoff repeated game induced by uigu_i^g. Known common noise family, unknown mean matrix. Experiment 3 instantiates the private-payoff theory in the special case where the reward noise family is known and only the mean payoff matrix is unknown. Concretely, for each player i and joint action a, rit∣at=a∼​(mi​(a),σg2),r_i^t a^t=a (m_i(a), _g^2), where σg2 _g^2 is common knowledge and the unknown object is the matrix mi:A→ℝm_i:A . The finite reward menu used by PS-BR is therefore a finite menu of candidate mean matrices. Equivalently, each candidate matrix m induces a full payoff kernel qim(⋅∣a)=(m(a),σg2),q_i^m(· a)=N(m(a), _g^2), so payoff-matrix sampling in the implementation is exactly payoff-kernel sampling in the theory, expressed in mean-matrix coordinates. We choose a noise level large enough that, on a single step, the realized payoff can often reverse the ranking between two outcomes whose true mean payoffs differ by the smallest strategically relevant gap. Formally, for each game g, define the minimal nonzero payoff separation Δmin,g:=mini∈I⁡min⁡|uig​(a)−uig​(a′)|:a,a′∈A,uig​(a)≠uig​(a′). _ ,g\;:=\; _i∈ I\ \|u_i^g(a)-u_i^g(a )|:\ a,a ∈ A,\ u_i^g(a)≠ u_i^g(a ) \. (14) For the payoff matrices used in Experiment 1, the smallest payoff gaps are Δmin,BoS=3 _ ,BoS=3 and Δmin,PD=2 _ ,PD=2, while for Promo, Samaritan, and Lemons the smallest gap is 11. We set the Gaussian noise standard deviation to σg=Δmin,g. _g\;=\; _ ,g. (15) With additive Gaussian noise, the noisy difference between two outcomes with mean gap Δ has standard deviation 2​σg 2 _g; hence when Δ=Δmin,g = _ ,g and σg=Δmin,g _g= _ ,g, a single observation reverses the sign of the comparison with probability Φ​(−1/2)≈0.24 \! (-1/ 2 )≈ 0.24. Thus, roughly one in four observations on the tightest gaps is directionally misleading, while averaging over time still reveals the true mean incentives. Then we repeat the same experiments in Experiment 1 (late-round adherence to the any Nash equilibrium path) and Experiment 2 (late-round adherence to the prompt-specified nontrivial cooperative Nash equilibrium path), using the same scoring window and reporting conventions; the only change is that agents must infer incentives from the private noisy payoffs (13) rather than reading uigu_i^g from the prompt. To match Assumption 4, we equip each agent with a finite hypothesis class over the unknown mean payoff matrix. Fix a game g and player i, and define the offset set K:=−2,−1.5,−1,−0.5,0,+0.5,+1,+1.5,+2.K:=\-2,-1.5,-1,-0.5,0,+0.5,+1,+1.5,+2\. The finite menu of candidate mean matrices is ℳi,g:=m:A→ℝ:m(a)=uig(a)+kaσgfor each a∈A,with ka∈K.M_i,g:= \m:A :m(a)=u_i^g(a)+k_a _g\ for each a∈ A,\ with k_a∈ K \. In particular, the true mean matrix uigu_i^g belongs to ℳi,gM_i,g by taking ka=0k_a=0 for every joint action a. Operationally, player i maintains a posterior over ℳi,gM_i,g using the Gaussian likelihood πit​(m∣ht,ri1:t−1)∝πi0​(m)​∏s=1t−1ϕ​(ris;m​(as),σg2), _i^t(m h^t,r_i^1:t-1) _i^0(m) _s=1^t-1φ(r_i^s;m(a^s), _g^2), where ϕ​(⋅;μ,σg2)φ(·;μ, _g^2) is the Gaussian density. PS-BR then samples one candidate mean matrix from this posterior and evaluates continuation strategies against the induced payoff kernel. Because ℳi,gM_i,g has product form over joint actions, this posterior can be updated action-wise under a product prior over offsets (ka)a∈A(k_a)_a∈ A; one need not enumerate the full menu explicitly in order to sample a complete mean matrix. 7.4.2 Results. We report two complementary late-round metrics under unknown stochastic payoffs: convergence to any Nash equilibrium action (Table 3) and follow-through on the prompt-specified cooperative Nash equilibrium path (Table 4). Table 3: Unknown stochastic payoffs: equilibrium-follow percentage in late rounds (rounds 161–180) for any Nash equilibrium. Reported scores are averaged over 20 trials. Game Base SCoT PS-BR BoS 60.0% 95.0% 99.8% PD 60.0% 98.0% 98.0% Promo 0.0% 100% 100.0% Samaritan 0.0% 0.0% 96.2% Lemons 0.0% 98.5% 82.5% Table 4: Unknown stochastic payoffs: equilibrium-follow percentage in late rounds (rounds 161–180) for the prompt-specified cooperative Nash equilibrium. Reported scores are averaged over 20 trials. Game Base SCoT PS-BR BoS 0% 0% 98.0% PD 0% 0% 71.2% Promo 0% 0% 71.0% Samaritan 5% 0% 81.0% Lemons 0% 0% 73.8% On the broader “any Nash” metric (Table 3), SCoT still performs very strongly in BoS (95.0%), PD (98.0%), Promo (100.0%), and Lemons (98.5%), but falls to 0.0% in Samaritan. PS-BR is near-perfect in BoS (99.8%), PD (98.0%), and Promo (100.0%), remains strong in Samaritan (96.2%), and reaches 82.5% in Lemons. Base remains limited, scoring 60.0% in BoS and PD and 0.0% in Promo, Samaritan, and Lemons. On the other hand, on stricter prompt-specified cooperative-equilibrium metric (Table 4), PS-BR remains the only method with substantial late-round follow-through under unknown payoffs: 98.0% in BoS, 71.2% in PD, 71.0% in Promo, 81.0% in Samaritan, and 73.8% in Lemons. Both Base and SCoT are at 0.0% in BoS, PD, Promo, and Lemons, while Base reaches only 5.0% in Samaritan. These results suggest that under noisy private payoffs, myopic reasoning is often still enough to reach some equilibrium-like late-round behavior, but not to track the specific long-horizon cooperative prescription; the non-myopic planner, PS-BR, retains a clear advantage when the task requires identifying and sustaining the intended cooperative repeated-game path. Accordingly, Experiment 3 should be interpreted as testing strategic learning under noisy private observations of an unknown mean-payoff matrix, rather than learning an arbitrary payoff distribution. The informational difficulty comes from identifying the mean incentives relevant for continuation planning, while the noise family itself is held fixed and known. Taken together, Tables 3 and 4 show that payoff uncertainty preserves the basic separation observed in the deterministic-payoff experiments, while also making the task meaningfully harder. On the broader “any Nash” metric, both SCoT and PS-BR still often reach equilibrium-consistent late-round behavior, indicating that noisy private payoffs do not prevent agents from eventually identifying at least some strategically stable pattern of play. This is consistent with the idea that coarse equilibrium-action convergence can survive substantial observational noise as long as the underlying incentives remain learnable over repeated interaction. However, the stricter cooperative-equilibrium metric reveals a much sharper distinction. Under unknown payoffs, PS-BR remains the only method that reliably tracks the prompt-specified nontrivial repeated-game equilibrium across all environments, whereas Base and SCoT almost completely fail. This gap is important because it shows that the main difficulty is not merely predicting the opponent’s next move, but jointly inferring the payoff structure and reasoning over continuation incentives. To sustain a particular cooperative equilibrium under payoff uncertainty, an agent must learn which action profiles are valuable, which deviations are tempting, and why future punishments make cooperation incentive compatible. PS-BR is designed to do exactly this by sampling both opponent strategies and payoff hypotheses and then planning against the sampled continuation game. The fact that PS-BR still performs well, though less perfectly than in the known-payoff case, is also informative. Relative to Table 2, follow rates decline in PD, Promo, Samaritan, and Lemons once payoffs must be learned from noisy private observations. This is the expected direction: payoff uncertainty introduces an additional layer of posterior dispersion, so even when the opponent strategy is inferred correctly, errors in the learned payoff model can still distort continuation-value comparisons. In other words, the unknown-payoff setting does not overturn the mechanism established earlier, but it weakens it quantitatively by making both belief learning and best-response computation noisier. At the same time, the results suggest that the theoretical extension in Section 6 is empirically meaningful rather than merely formal. The model class that explicitly represents uncertainty over payoffs and updates from private observations retains a substantial advantage precisely in the environments where long-horizon repeated-game incentives matter most. Thus, the experiments support the broader claim of the paper: reasonably reasoning agents need not know the full game in advance to move toward equilibrium-like behavior. What matters is whether they can infer both the strategic behavior of others and the payoff consequences of interaction well enough to approximate continuation best responses on the realized path. Overall, the three experiments draw a coherent empirical picture. Simple predict–then–act reasoning is often sufficient for convergence to some stage-game or equilibrium-consistent action pattern. But when the objective is to implement a specific nontrivial repeated-game equilibrium, especially under realistic informational frictions such as unknown and stochastic payoffs, explicit continuation-level reasoning becomes decisive. This is exactly the regime in which PS-BR provides a robust advantage, matching the central theoretical message of the paper. 8 Conclusion In this paper, we theoretically highlight the promising prospect that general-purpose AI agents can attain game-theoretic robustness through inherent reasoning capabilities rather than bespoke training. By demonstrating that LLMs can evolve toward equilibrium behavior on the fly, we take a step toward safer and more autonomous multi-agent AI systems that remain effective across the myriad interactive scenarios they will encounter in the real world. The results bridge the gap between AI agents and classical game theory, indicating that the rich knowledge and inferential power of modern LLMs may be harnessed to meet longstanding challenges in multi-agent learning and interaction. Ultimately, enabling LLM-based agents to naturally exhibit equilibrium-like behavior during play not only advances our theoretical understanding of their behavior but also paves the way for their deployment in societally crucial domains that require reliable strategic decision-making. References [1] D. Abreu (1988) On the theory of infinitely repeated games with discounting. Econometrica: Journal of the Econometric Society, p. 383–396. Cited by: §H.1. [2] K. Agrawal, V. Teo, J. J. Vazquez, S. Kunnavakkam, V. Srikanth, and A. Liu (2025) Evaluating llm agent collusion in double auctions. External Links: 2507.01413, Document Cited by: §2. [3] E. Akata, L. Schulz, J. Coda-Forno, S. J. Oh, M. Bethge, and E. Schulz (2025) Playing repeated games with large language models. Nature Human Behaviour 9 (7), p. 1380–1390. Cited by: §E.1, §E.2, Appendix E, §1, §2, §5.3, §5.4, §5.4, 2nd item, §7.2.1, §7.2, Definition 12. [4] M. Aoyagi, G. R. Fréchette, and S. Yuksel (2024) Beliefs in repeated games: an experiment. American Economic Review 114 (12), p. 3944–3975. Cited by: §4.3. [5] D. Arumugam and T. L. Griffiths (2025) Toward efficient exploration by large language model agents. arXiv preprint arXiv:2504.20997. Cited by: §1, §2, §2, §4.3, §4. [6] S. Assad, R. Clark, D. Ershov, and L. Xu (2024) Algorithmic pricing and competition: empirical evidence from the german retail gasoline market. Journal of Political Economy 132 (3), p. 723–771. Cited by: §1. [7] R. J. Aumann (1961) Mixed and behavior strategies in infinite extensive games. Princeton University Princeton. Cited by: §3.3, §3.3. [8] G. Bansal, W. Hua, Z. Huang, A. Fourney, A. Swearngin, W. Epperson, T. Payne, J. M. Hofman, B. Lucier, C. Singh, et al. (2025) Magentic marketplace: an open-source environment for studying agentic markets. arXiv preprint arXiv:2510.25779. Cited by: §1. [9] F. Bianchi, P. J. Chia, M. Yuksekgonul, J. Tagliabue, D. Jurafsky, and J. Zou (2024) How well can llms negotiate? negotiationarena platform and analysis. arXiv preprint arXiv:2402.05863. Cited by: §1. [10] D. Blackwell and L. Dubins (1962) Merging of opinions with increasing information. The Annals of Mathematical Statistics 33 (3), p. 882–886. Cited by: Appendix B, §4.1, Remark 1. [11] Z. Y. Brown and A. MacKay (2023) Competition in pricing algorithms. American Economic Journal: Microeconomics 15 (2), p. 109–156. Cited by: §1. [12] A. Buscemi, D. Proverbio, A. Di Stefano, T. A. Han, G. Castignani, and P. Di Liò (2025) Fairgame: a framework for ai agents bias recognition using game theory. arXiv preprint arXiv:2504.14325. Cited by: §1. [13] S. Cahyawijaya, H. Lovenia, and P. Fung (2024) Llms are few-shot in-context low-resource language learners. arXiv preprint arXiv:2403.16512. Cited by: §1, §2. [14] T. T. Cai, H. Namkoong, D. Russo, and K. W. Zhang (2024) Active exploration via autoregressive generation of missing data. arXiv preprint arXiv:2405.19466. Cited by: §4.3. [15] E. Calvano, G. Calzolari, V. Denicolo, and S. Pastorello (2020) Artificial intelligence, algorithmic pricing, and collusion. American Economic Review 110 (10), p. 3267–3297. Cited by: §1. [16] J. Coda-Forno, M. Binz, Z. Akata, M. Botvinick, J. Wang, and E. Schulz (2023) Meta-in-context learning in large language models. Advances in Neural Information Processing Systems 36, p. 65189–65201. Cited by: §1, §2. [17] J. Duan, R. Zhang, J. Diffenderfer, B. Kailkhura, L. Sun, E. Stengel-Eskin, M. Bansal, T. Chen, and K. Xu (2024) GTBench: uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. External Links: 2402.12348, Document Cited by: §2. [18] J. A. Duque, M. Aghajohari, T. Cooijmans, R. Ciuca, T. Zhang, G. Gidel, and A. Courville (2024) Advantage alignment algorithms. arXiv preprint arXiv:2406.14662. Cited by: §1. [19] R. Durrett (2019) Probability: theory and examples. 5 edition, Cambridge University Press. Note: See Theorem 2.1.21 (Kolmogorov’s extension theorem) External Links: Document Cited by: Definition 2. [20] F. Falck, Z. Wang, and C. Holmes (2024) Is in-context learning in large language models bayesian? a martingale perspective. arXiv preprint arXiv:2406.00793. Cited by: §1, §2, §4.2. [21] C. Fan, J. Chen, Y. Jin, and H. He (2023) Can large language models serve as rational players in game theory? a systematic analysis. Note: AAAI 2024 External Links: 2312.05488, Document Cited by: §2. [22] S. Fish, Y. A. Gonczarowski, and R. I. Shorrer (2024) Algorithmic collusion by large language models. arXiv preprint arXiv:2404.00806 7 (2), p. 5. Cited by: §1, §2. [23] N. Fontana, F. Pierri, and L. M. Aiello (2024) Nicer than humans: how do large language models behave in the prisoner’s dilemma?. arXiv preprint arXiv:2406.13605. Cited by: §2. [24] L. Ge, Y. Zhang, and Y. Vorobeychik (2026) Mind the (dh) gap! a contrast in risky choices between reasoning and conversational llms. arXiv preprint arXiv:2602.15173. Cited by: §1, §2, §2, §4.3, §4. [25] D. Gill and Y. Rosokha (2024) Beliefs, learning, and personality in the indefinitely repeated prisoner’s dilemma. American Economic Journal: Microeconomics 16 (3), p. 259–283. Cited by: §4.3. [26] F. Guo (2023) GPT in game theory experiments. External Links: 2305.05516, Document Cited by: §2. [27] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang (2024) Large language model based multi-agents: a survey of progress and challenges. arXiv preprint arXiv:2402.01680. Cited by: §1. [28] X. Guo, K. Huang, J. Liu, W. Fan, N. Vélez, Q. Wu, H. Wang, T. L. Griffiths, and M. Wang (2024) Embodied llm agents learn to cooperate in organized teams. External Links: 2403.12482, Link Cited by: §1. [29] W. Hua, O. Liu, L. Li, A. Amayuelas, J. Chen, L. Jiang, M. Jin, L. Fan, F. Sun, W. Wang, et al. (2024) Game-theoretic llm: agent workflow for negotiation games. arXiv preprint arXiv:2411.05990. Cited by: §1, §2. [30] J. Huang, E. J. Li, M. H. Lam, T. Liang, W. Wang, Y. Yuan, W. Jiao, X. Wang, Z. Tu, and M. R. Lyu (2024) How far are we on the decision-making of llms? evaluating llms’ gaming ability in multi-agent environments. arXiv preprint arXiv:2403.11807. Cited by: §1, §2. [31] J. Jia, Z. Yuan, J. Pan, P. E. McNamara, and D. Chen (2025) LLM strategic reasoning: agentic study through behavioral game theory. arXiv preprint arXiv:2502.20432. Cited by: §2. [32] G. Kader and D. Lee (2024) The emergence of strategic reasoning of large language models. arXiv preprint arXiv:2412.13013. Cited by: §2. [33] E. Kalai and E. Lehrer (1993) Rational learning leads to nash equilibrium. Econometrica: Journal of the Econometric Society, p. 1019–1045. Cited by: Appendix B, §1, §2, §2, §2, §3.3, §3.3, §4.1, §4.3, §5.2, Assumption 2. [34] E. Kalai and E. Lehrer (1993) Subjective equilibrium in repeated games. Econometrica 61 (5), p. 1231–1240. Cited by: §3.4. [35] H. W. Kuhn (1953) Extensive games and the problem of information. Contributions to the Theory of Games 2 (28), p. 193–216. Cited by: §3.3, §3.3. [36] R. Lal (1990) Price promotions: limiting competitive encroachment. Marketing science 9 (3), p. 247–262. Cited by: §H.1, §H.1, §7.1, §7.1. [37] Y. Li, W. Zhang, J. Wang, S. Zhang, Y. Du, Y. Wen, and W. Pan (2024) Aligning individual and collective objectives in multi-agent cooperation. Advances in Neural Information Processing Systems 37, p. 44735–44760. Cited by: §1. [38] A. Lopez-Lira (2025) Can large language models trade? testing financial theories with llm agents in market simulations. arXiv preprint arXiv:2504.10789. Cited by: §1. [39] S. Lu, I. Bigoulaeva, R. Sachdeva, H. T. Madabushi, and I. Gurevych (2024) Are emergent abilities in large language models just in-context learning?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 5098–5139. Cited by: §1, §2. [40] S. Mao, Y. Cai, Y. Xia, W. Wu, X. Wang, F. Wang, T. Ge, and F. Wei (2023) ALYMPICS: llm agents meet game theory – exploring strategic decision-making with ai agents. External Links: 2311.03220, Document Cited by: §2. [41] J. H. Nachbar (1997) Prediction, optimization, and learning in repeated games. Econometrica: Journal of the Econometric Society, p. 275–309. Cited by: §2, §2, §4.3, Remark 1. [42] J. H. Nachbar (2005) Beliefs in repeated games. Econometrica 73 (2), p. 459–480. Cited by: §2, §2, §4.3, Remark 1. [43] T. W. Norman (2022) The possibility of bayesian learning in repeated games. Games and Economic Behavior 136, p. 142–152. Cited by: Lemma A.2, Appendix B, Appendix B, Appendix C, §1, §2, §2, §2, §3.1, §4.3, §4, §5, Assumption 1, Definition 8, Remark 1. [44] C. Park, X. Liu, A. Ozdaglar, and K. Zhang (2024) Do llm agents have regret? a case study in online learning and games. arXiv preprint arXiv:2403.16843. Cited by: §1. [45] X. Qu, A. Damoah, J. Sherwood, P. Liu, C. S. Jin, L. Chen, M. Shen, N. Aleisa, Z. Hou, C. Zhang, et al. (2025) A comprehensive review of ai agents: transforming possibilities in technology and beyond. arXiv preprint arXiv:2508.11957. Cited by: §1. [46] Qwen Team (2026-02) Qwen3.5: towards native multimodal agents. External Links: Link Cited by: §7.1. [47] M. Riemer, Z. Ashktorab, D. Bouneffouf, P. Das, M. Liu, J. D. Weisz, and M. Campbell (2024) Position: theory of mind benchmarks are broken for large language models. arXiv preprint arXiv:2412.19726. Cited by: §4.3. [48] A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. (2025) Openai gpt-5 system card. arXiv preprint arXiv:2601.03267. Cited by: §7.1. [49] H. Sun, Y. Wu, P. Wang, W. Chen, Y. Cheng, X. Deng, and X. Chu (2025) Game theory meets large language models: a systematic survey with taxonomy and new frontiers. arXiv preprint arXiv:2502.09053. Cited by: §2. [50] T. Wakayama and T. Suzuki (2025) In-context learning is provably bayesian inference: a generalization theory for meta-learning. arXiv preprint arXiv:2510.10981. Cited by: §1, §2, §4.2. [51] X. Wang, W. Zhu, M. Saxon, M. Steyvers, and W. Y. Wang (2023) Large language models are latent variable models: explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems 36, p. 15614–15638. Cited by: §1, §2. [52] S. Welleck, A. Bertsch, M. Finlayson, H. Schoelkopf, A. Xie, G. Neubig, I. Kulikov, and Z. Harchaoui (2024) From decoding to meta-generation: inference-time algorithms for large language models. arXiv preprint arXiv:2406.16838. Cited by: §2. [53] R. Willis et al. (2025) Will systems of llm agents cooperate: an investigation into a social dilemma. arXiv preprint arXiv:2501.16173. Cited by: §2. [54] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma (2021) An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080. Cited by: §1, §2, §4.2. [55] K. Yamin, J. Tang, S. Cortes-Gomez, A. Sharma, E. Horvitz, and B. Wilder (2026) Do llms act like rational agents? measuring belief coherence in probabilistic decision making. arXiv preprint arXiv:2602.06286. Cited by: §1, §2, §2, §4.3, §4. [56] K. W. Zhang, T. Cai, H. Namkoong, and D. Russo (2024) Posterior sampling via autoregressive generation. In NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty, Cited by: §2. [57] Y. Zhang, F. Zhang, Z. Yang, and Z. Wang (2023) What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420. Cited by: §2, §4.2. [58] P. Zhou, A. Madaan, S. P. Potharaju, A. Gupta, K. R. McKee, A. Holtzman, J. Pujara, X. Ren, S. Mishra, A. Nematzadeh, et al. (2023) How far are large language models from agents with theory-of-mind?. arXiv preprint arXiv:2310.03051. Cited by: §4.3. [59] S. Zhu, J. Sun, Y. Nian, T. South, A. Pentland, and J. Pei (2025) The automated but risky game: modeling and benchmarking agent-to-agent negotiations and transactions in consumer markets. arXiv preprint arXiv:2506.00073. Cited by: §1. Appendix A Continuity and Finite-Horizon Robustness Lemma A.1 (Continuity of discounted payoff). For each agent i and every δ>0δ>0, there exists ρi​(δ)>0 _i(δ)>0 such that for any strategy profiles f,g∈ℱf,g , d​(μf,μg)≤ρi​(δ)⇒|Ui​(f)−Ui​(g)|≤δ.d(μ^f,μ^g)≤ _i(δ) |U_i(f)-U_i(g) |≤δ. In particular, if ρ​(δ)=mini∈I⁡ρi​(δ)ρ(δ)= _i∈ I _i(δ) and d​(μf,μg)≤ρ​(δ)d(μ^f,μ^g)≤ρ(δ), then |Ui​(f)−Ui​(g)|≤δ |U_i(f)-U_i(g) |≤δ for all i∈Ii∈ I. A.1 Finite-horizon variants and robustness For a finite horizon T∈ℕT , we denote by ℱTF^T the set of behaviour strategies specified on histories of length at most T; two full strategies that coincide on these histories induce the same distribution over histories up to time T and the same truncated payoff. For f∈ℱTf ^T, define the T-period discounted payoff UiT​(f)=z∼μf​[(1−λi)​∑t=1Tλit−1​ui​(zt)].U_i^T(f)=E_z μ^f [(1- _i) _t=1^T _i^t-1u_i(z^t) ]. Definition 15 (Finite-horizon weak ξ-subjective η-equilibrium). Let ξ,η≥0ξ,η≥ 0 and a fixed horizon T. A truncated strategy profile f∈ℱTf ^T is a finite-horizon weak ξ-subjective η-equilibrium if for each agent i∈Ii∈ I there exists a supporting truncated profile fi∈ℱTf^i ^T such that: • fii=fif_i^i=f_i; • UiT​(fi,f−i)≥supgi∈ℱiTUiT​(gi,f−i)−ξU_i^T(f_i,f_-i^i)≥ _g_i _i^TU_i^T(g_i,f_-i^i)-ξ; • d​(μfi,μf)≤ηd(μ^f^i,μ^f)≤η when d is computed using only cylinder events in ℬtB^t with t≤Tt≤ T. We now show that finite-horizon weak subjective equilibria can be “patched” into approximate finite-horizon Nash equilibria without changing the induced distribution of play up to time T. Lemma A.2 (Finite-horizon purification for η=0η=0 [43]). Fix a finite horizon T and a profile f∈ℱTf ^T. Suppose f is a finite-horizon weak ψ-subjective 0-equilibrium for some ψ≥0ψ≥ 0. Then there exists a truncated strategy profile f^∈ℱT f ^T such that: • f f is a ψ-Nash equilibrium of the T-period game, i.e., for all i∈Ii∈ I and all gi∈ℱiTg_i _i^T, UiT​(f^i,f^−i)≥UiT​(gi,f^−i)−ψ;U_i^T( f_i, f_-i)\;≥\;U_i^T(g_i, f_-i)-ψ; • the induced distributions of histories of length at most T coincide: for every E∈ℬTE ^T, μf^​(E)=μf​(E)μ f(E)=μ^f(E). We next extend this to the case where η>0η>0 but small, using a compactness and limit argument. Lemma A.3 (Finite-horizon robustness). Fix a finite horizon T and ψ>0ψ>0. For every θ>0θ>0 there exists η¯T​(ψ,θ)>0 η_T(ψ,θ)>0 such that: if f∈ℱTf ^T is a finite-horizon weak ψ-subjective η-equilibrium with η≤η¯T​(ψ,θ)η≤ η_T(ψ,θ), then there exists a ψ-Nash equilibrium f^∈ℱT f ^T satisfying d​(μf^,μf)≤θd(μ f,μ^f)≤θ (again with d computed on cylinder events of length at most T). We now patch finite-horizon robustness to the infinite-horizon game by truncating the payoff at a sufficiently large horizon and using Lemma A.1; the resulting infinite-horizon patching lemma is recorded below. Lemma A.4 (Infinite-horizon patching). Fix ξ>0ξ>0 and ε>0 >0. There exists η^​(ξ,ε)>0 η(ξ, )>0 such that if f∈ℱf is a weak ξ-subjective η-equilibrium in the sense of Definition 8 with η≤η^​(ξ,ε)η≤ η(ξ, ), then there exists a strategy profile f^∈ℱ f satisfying: • f f is a (ξ+ε)(ξ+ )-Nash equilibrium of the infinite-horizon game; • d​(μf^,μf)≤εd(μ f,μ^f)≤ . Remark 3 (Continuation-game analogues). Lemmas A.2–A.4 apply verbatim to continuation games after any history hth^t by interpreting Ui​(⋅)U_i(·) as continuation payoff from hth^t and d​(⋅,⋅)d(·,·) as the weak distance between μhtgμ^g_h^t and μhtg′μ^g _h^t. They also apply verbatim to the private-payoff continuation game after any realized information-history vector xtx^t when ℱiF_i is replaced by Σi _i, histories hth^t are replaced by xtx^t, payoffs are Ui​(τ∣xt)U_i(τ x^t), and weak distance is computed on the public-action marginals μ¯xtτ,u μ_x^t^τ,u. Appendix B Proofs Proof of Lemma A.1. Fix i and δ>0δ>0. Choose a finite horizon T∈ℕT large enough that (1−λi)​∑t=T+1∞λit−1≤δ4.(1- _i) _t=T+1^∞ _i^t-1\;≤\; δ4. (16) For any profile g∈ℱg , define the truncated payoff UiT​(g)=z∼μg​[(1−λi)​∑t=1Tλit−1​ui​(zt)].U_i^T(g)=E_z μ^g [(1- _i) _t=1^T _i^t-1u_i(z^t) ]. Then for any g we have |Ui​(g)−UiT​(g)|≤(1−λi)​∑t=T+1∞λit−1≤δ4 |U_i(g)-U_i^T(g) |≤(1- _i) _t=T+1^∞ _i^t-1≤ δ4 by (16), using that ui​(⋅)∈[0,1]u_i(·)∈[0,1]. Now fix f,g∈ℱf,g . We can decompose |Ui​(f)−Ui​(g)|≤|Ui​(f)−UiT​(f)|+|UiT​(f)−UiT​(g)|+|UiT​(g)−Ui​(g)|. |U_i(f)-U_i(g) |≤ |U_i(f)-U_i^T(f) |+ |U_i^T(f)-U_i^T(g) |+ |U_i^T(g)-U_i(g) |. By the bound above, the first and third terms are each at most δ/4δ/4. It remains to control |UiT​(f)−UiT​(g)||U_i^T(f)-U_i^T(g)|. For each t∈1,…,Tt∈\1,…,T\ and each joint action profile a∈Aa∈ A, let αtf​(a)=μf​(z∈H∞:zt=a),αtg​(a)=μg​(z∈H∞:zt=a). _t^f(a)=μ^f (\z∈ H^∞:z^t=a\ ), _t^g(a)=μ^g (\z∈ H^∞:z^t=a\ ). Since ui​(a)∈[0,1]u_i(a)∈[0,1] for all a, we have |∑a∈Aui​(a)​(αtf​(a)−αtg​(a))|≤supE∈ℬt|μf​(E)−μg​(E)|. | _a∈ Au_i(a) ( _t^f(a)- _t^g(a) ) |≤ _E ^t |μ^f(E)-μ^g(E) |. Hence |UiT​(f)−UiT​(g)| |U_i^T(f)-U_i^T(g) | =|∑t=1T(1−λi)​λit−1​∑a∈Aui​(a)​(αtf​(a)−αtg​(a))| = | _t=1^T(1- _i) _i^t-1 _a∈ Au_i(a) ( _t^f(a)- _t^g(a) ) | ≤∑t=1T(1−λi)​λit−1​supE∈ℬt|μf​(E)−μg​(E)|. ≤ _t=1^T(1- _i) _i^t-1 _E ^t |μ^f(E)-μ^g(E) |. By the definition (6) of d​(μf,μg)d(μ^f,μ^g), for each t we have 2−t​supE∈ℬt|μf​(E)−μg​(E)|≤d​(μf,μg),2^-t _E ^t |μ^f(E)-μ^g(E) |≤ d(μ^f,μ^g), hence supE∈ℬt|μf​(E)−μg​(E)|≤2t​d​(μf,μg). _E ^t |μ^f(E)-μ^g(E) |≤ 2^td(μ^f,μ^g). Thus |UiT​(f)−UiT​(g)|≤d​(μf,μg)​∑t=1T(1−λi)​λit−1​2t. |U_i^T(f)-U_i^T(g) |≤ d(μ^f,μ^g) _t=1^T(1- _i) _i^t-12^t. The finite sum on the right depends only on T and λi _i; call it Ci​(T)C_i(T). Define ρi​(δ)=min⁡δ4​Ci​(T), 1. _i(δ)= \ δ4C_i(T),\,1 \. If d​(μf,μg)≤ρi​(δ)d(μ^f,μ^g)≤ _i(δ), then |UiT​(f)−UiT​(g)|≤Ci​(T)​ρi​(δ)≤δ4. |U_i^T(f)-U_i^T(g) |≤ C_i(T) _i(δ)≤ δ4. Combining the three bounds gives |Ui​(f)−Ui​(g)|≤δ4+δ4+δ4<δ. |U_i(f)-U_i(g) |≤ δ4+ δ4+ δ4\;<\;δ. Setting ρ​(δ)=mini∈I⁡ρi​(δ)ρ(δ)= _i∈ I _i(δ) yields the final claim. ∎ Proof of Lemma A.2. This is the finite-horizon analogue of the “purification” or “deviation-tree patching” result for weak subjective equilibria in [43]. The key idea is to modify off-path behavior so that, for each player i, any history that can only arise from a deviation by i triggers opponents’ play according to the supporting profile fif^i (which makes fif_i a ψ-best response), while on-path histories preserve the original profile f. Formally, one constructs a deviation tree for each player and assigns to each subtree corresponding to a first deviation by i the opponents’ strategies from f−iif^i_-i, keeping f on the non-deviation branch. This construction ensures: (i) if all players follow f f, the induced distribution of histories up to time T coincides with that under f (item 2); and (i) any unilateral deviation by player i induces, up to time T, the same distribution of histories as deviating against f−iif^i_-i, against which fif_i is a ψ-best reply by Definition 15. Therefore f f is a ψ-Nash equilibrium of the T-period game (item 1). A detailed construction and proof of these properties is given in [43], Proposition 3.1, and the associated deviation-tree arguments; our setting is the same repeated-game environment, so the proof carries over verbatim. ∎ Proof of Lemma A.3. Suppose, towards a contradiction, that there exist T,ψ>0T,ψ>0 and θ>0θ>0 such that for every m∈ℕm there is a finite-horizon weak ψ-subjective ηm _m-equilibrium f(m)∈ℱTf^(m) ^T with ηm≤1/m _m≤ 1/m and such that no ψ-Nash equilibrium lies within weak distance θ of μf(m)μ^f^(m) (measured on ℬTB^T). For each m and each i∈Ii∈ I, let fi,(m)f^i,(m) be a supporting truncated profile witnessing that f(m)f^(m) is a finite-horizon weak ψ-subjective ηm _m-equilibrium, i.e., fii,(m)=fi(m)f_i^i,(m)=f_i^(m), UiT​(fi(m),f−i,(m))≥supgi∈ℱiTUiT​(gi,f−i,(m))−ψ,d​(μfi,(m),μf(m))≤ηm.U_i^T(f_i^(m),f_-i^i,(m))≥ _g_i _i^TU_i^T(g_i,f_-i^i,(m))-ψ, d(μ^f^i,(m),μ^f^(m))≤ _m. Because the horizon T and action sets are finite, the space of behaviour strategies ℱTF^T is a finite-dimensional product of simplices and hence compact in the product topology. Thus, by sequential compactness, there exists a subsequence (which we relabel for notational convenience) such that f(m)→f⋆andfi,(m)→fi,⋆for all ​i∈I,f^(m)→ f f^i,(m)→ f^i, all i∈ I, as m→∞m→∞, in the product topology on ℱTF^T. The map f↦μf μ^f on finite histories (up to time T) is continuous with respect to this topology and the weak topology induced by d (restricted to ℬTB^T), so μf(m)→μf⋆,μfi,(m)→μfi,⋆.μ^f^(m)→μ^f , μ^f^i,(m)→μ^f^i, . Since d​(μfi,(m),μf(m))≤ηm→0d(μ^f^i,(m),μ^f^(m))≤ _m→ 0, we must have d​(μfi,⋆,μf⋆)=0d(μ^f^i, ,μ^f )=0, so μfi,⋆=μf⋆μ^f^i, =μ^f on ℬTB^T. Moreover, the best-response inequality passes to the limit. Fix i and any gi∈ℱiTg_i _i^T. For all m, UiT​(fi(m),f−i,(m))≥supgi′∈ℱiTUiT​(gi′,f−i,(m))−ψ≥UiT​(gi,f−i,(m))−ψ.U_i^T(f_i^(m),f_-i^i,(m))≥ _g _i _i^TU_i^T(g _i,f_-i^i,(m))-ψ≥ U_i^T(g_i,f_-i^i,(m))-ψ. By continuity of UiTU_i^T in the product topology (an immediate consequence of Lemma A.1 restricted to horizon T), taking m→∞m→∞ yields UiT​(fi⋆,f−i,⋆)≥UiT​(gi,f−i,⋆)−ψ.U_i^T(f_i ,f_-i^i, )≥ U_i^T(g_i,f_-i^i, )-ψ. Since gig_i was arbitrary and fii,⋆=fi⋆f_i^i, =f_i (by pointwise convergence of fii,(m)f_i^i,(m) to fii,⋆f_i^i, and of fi(m)f_i^(m) to fi⋆f_i ), we conclude that UiT​(fi⋆,f−i,⋆)≥supgi∈ℱiTUiT​(gi,f−i,⋆)−ψ.U_i^T(f_i ,f_-i^i, )≥ _g_i _i^TU_i^T(g_i,f_-i^i, )-ψ. Together with d​(μfi,⋆,μf⋆)=0d(μ^f^i, ,μ^f )=0, this shows that f⋆f is a finite-horizon weak ψ-subjective 0-equilibrium of the T-period game. By Lemma A.2, there exists a profile f^⋆∈ℱT f ^T such that f^⋆ f is a ψ-Nash equilibrium of the T-period game and μf^⋆μ f coincides with μf⋆μ^f on histories of length at most T. In particular, d​(μf^⋆,μf⋆)=0d(μ f ,μ^f )=0. Since μf(m)→μf⋆μ^f^(m)→μ^f in the weak metric d (restricted to ℬTB^T), we have d​(μf(m),μf^⋆)→0d(μ^f^(m),μ f )→ 0 as m→∞m→∞. Thus for all sufficiently large m, d​(μf(m),μf^⋆)≤θd(μ^f^(m),μ f )≤θ. But f^⋆ f is a ψ-Nash equilibrium, contradicting the assumption that no ψ-Nash equilibrium lies within weak distance θ of μf(m)μ^f^(m). This contradiction shows that such a sequence (f(m))(f^(m)) cannot exist, and hence there must exist η¯T​(ψ,θ)>0 η_T(ψ,θ)>0 with the stated property. ∎ Proof of Lemma A.4. Fix ξ>0ξ>0 and ε>0 >0. Choose a finite horizon T large enough that, for all i∈Ii∈ I and all profiles h∈ℱh , |Ui​(h)−UiT​(h)|≤ε8, |U_i(h)-U_i^T(h) |\;≤\; 8, (17) and also ∑t>T2−t≤ε4. _t>T2^-t\;≤\; 4. (18) Such a T exists because the tails of both geometric series are uniformly small. Let f be a weak ξ-subjective η-equilibrium with supporting profiles fii∈I\f^i\_i∈ I as in Definition 8, i.e., for each i, fii=fi,Ui​(fi,f−i)≥supgi∈ℱiUi​(gi,f−i)−ξ,d​(μfi,μf)≤η.f_i^i=f_i, U_i(f_i,f_-i^i)≥ _g_i _iU_i(g_i,f_-i^i)-ξ, d(μ^f^i,μ^f)≤η. Consider the truncated profiles f(T)f^(T) and (fi)(T)(f^i)^(T) obtained by restricting the prescriptions of f and fif^i to histories of length at most T. For each i we have (fii)(T)=fi(T)(f_i^i)^(T)=f_i^(T) and, since the weak distance on histories up to T is bounded by the full weak distance, d​(μ(fi)(T),μf(T))≤d​(μfi,μf)≤η.d(μ^(f^i)^(T),μ^f^(T))≤ d(μ^f^i,μ^f)≤η. We now show that f(T)f^(T) is a finite-horizon weak ψT _T-subjective η-equilibrium for a slightly relaxed parameter ψT _T. Fix i and note that for any profile h, |Ui​(h)−UiT​(h)|≤ε8|U_i(h)-U_i^T(h)|≤ 8 by (17). Using the weak subjective inequality for f and fif^i, we obtain UiT​(fi(T),(f−i)(T)) U_i^T(f_i^(T),(f_-i^i)^(T)) =UiT​(fi,f−i) =U_i^T(f_i,f_-i^i) ≥Ui​(fi,f−i)−ε8 ≥ U_i(f_i,f_-i^i)- 8 ≥supgi∈ℱiUi​(gi,f−i)−ξ−ε8. ≥ _g_i _iU_i(g_i,f_-i^i)-ξ- 8. For any truncated deviation gi(T)∈ℱiTg_i^(T) _i^T we can extend it arbitrarily to a full strategy gi∈ℱig_i _i, and then Ui​(gi,f−i)≥UiT​(gi(T),(f−i)(T))−ε8,U_i(g_i,f_-i^i)≥ U_i^T(g_i^(T),(f_-i^i)^(T))- 8, again by (17). Taking the supremum over gi(T)g_i^(T) yields UiT​(fi(T),(f−i)(T)) U_i^T(f_i^(T),(f_-i^i)^(T)) ≥supgi(T)∈ℱiTUiT​(gi(T),(f−i)(T))−ξ−ε4. ≥ _g_i^(T) _i^TU_i^T(g_i^(T),(f_-i^i)^(T))-ξ- 4. Thus, if we define ψT:=ξ+ε4, _T:=ξ+ 4, then for each i the truncated profiles f(T)f^(T) and (fi)(T)(f^i)^(T) satisfy UiT​(fi(T),(f−i)(T))≥supgi(T)∈ℱiTUiT​(gi(T),(f−i)(T))−ψT,U_i^T(f_i^(T),(f_-i^i)^(T))≥ _g_i^(T) _i^TU_i^T(g_i^(T),(f_-i^i)^(T))- _T, and d​(μ(fi)(T),μf(T))≤ηd(μ^(f^i)^(T),μ^f^(T))≤η, so f(T)f^(T) is a finite-horizon weak ψT _T-subjective η-equilibrium in the sense of Definition 15. Applying Lemma A.3 with this T, ψ=ψTψ= _T and θ=ε/2θ= /2, there exists η¯T​(ψT,ε/2)>0 η_T( _T, /2)>0 such that if η≤η¯T​(ψT,ε/2)η≤ η_T( _T, /2) then there is a ψT _T-Nash equilibrium f~(T)∈ℱT f^(T) ^T for the T-period game with d​(μf~(T),μf(T))≤ε2.d(μ f^(T),μ^f^(T))≤ 2. Define η^​(ξ,ε):=η¯T​(ξ+ε4,ε2). η(ξ, ):= η_T (ξ+ 4, 2 ). Assume henceforth that η≤η^​(ξ,ε)η≤ η(ξ, ) so that this conclusion holds. Extend f~(T) f^(T) arbitrarily to a full strategy profile f^∈ℱ f by specifying its behaviour after period T in any way. Then f f and f~(T) f^(T) coincide on periods t≤Tt≤ T, and similarly f and f(T)f^(T) coincide on t≤Tt≤ T. The weak distance between f f and f can be bounded as d​(μf^,μf)≤d​(μf^,μf~(T))+d​(μf~(T),μf(T))+d​(μf(T),μf).d(μ f,μ^f)≤ d(μ f,μ f^(T))+d(μ f^(T),μ^f^(T))+d(μ^f^(T),μ^f). The second term is at most ε/2 /2 by construction. For the first and third terms, any discrepancy between f f and f~(T) f^(T) (respectively, f and f(T)f^(T)) occurs only at times t>Tt>T, so each of these weak distances is bounded by the tail ∑t>T2−t≤ε/4 _t>T2^-t≤ /4 by (18). Hence d​(μf^,μf)≤ε4+ε2+ε4=ε.d(μ f,μ^f)≤ 4+ 2+ 4= . It remains to show that f f is a (ξ+ε)(ξ+ )-Nash equilibrium of the infinite-horizon game. Fix i∈Ii∈ I and any deviation gi∈ℱig_i _i. Let gi(T)g_i^(T) denote the truncation of gig_i to a T-period strategy, i.e., its prescriptions on histories of length at most T; clearly UiT​(gi,f^−i)=UiT​(gi(T),f~−i(T))U_i^T(g_i, f_-i)=U_i^T(g_i^(T), f_-i^(T)) since f f and f~(T) f^(T) coincide on the first T periods. Because f~(T) f^(T) is a ψT _T-Nash equilibrium of the T-period game, UiT​(f~i(T),f~−i(T))≥UiT​(gi(T),f~−i(T))−ψT.U_i^T( f_i^(T), f_-i^(T))\;≥\;U_i^T(g_i^(T), f_-i^(T))- _T. Using the truncation bound (17), we obtain Ui​(f^i,f^−i)≥UiT​(f^i,f^−i)−ε8=UiT​(f~i(T),f~−i(T))−ε8U_i( f_i, f_-i)\;≥\;U_i^T( f_i, f_-i)- 8=U_i^T( f_i^(T), f_-i^(T))- 8 and Ui​(gi,f^−i)≤UiT​(gi,f^−i)+ε8=UiT​(gi(T),f~−i(T))+ε8.U_i(g_i, f_-i)\;≤\;U_i^T(g_i, f_-i)+ 8=U_i^T(g_i^(T), f_-i^(T))+ 8. Combining these inequalities yields Ui​(f^i,f^−i) U_i( f_i, f_-i) ≥UiT​(f~i(T),f~−i(T))−ε8 ≥ U_i^T( f_i^(T), f_-i^(T))- 8 ≥UiT​(gi(T),f~−i(T))−ψT−ε8 ≥ U_i^T(g_i^(T), f_-i^(T))- _T- 8 ≥Ui​(gi,f^−i)−ψT−ε4. ≥ U_i(g_i, f_-i)- _T- 4. Recalling that ψT=ξ+ε/4 _T=ξ+ /4, we have ψT+ε4=ξ+ε2≤ξ+ε, _T+ 4=ξ+ 2≤ξ+ , so for every deviation gig_i, Ui​(f^i,f^−i)≥Ui​(gi,f^−i)−(ξ+ε).U_i( f_i, f_-i)≥ U_i(g_i, f_-i)-(ξ+ ). Thus f f is a (ξ+ε)(ξ+ )-Nash equilibrium. ∎ Proof of Lemma 4.1. For each g−i∈−ig_-i _-i define the continuation value envelope M​(g−i):=supσiVi​(σi∣ht;g−i)∈[0,1].M(g_-i)\ :=\ _ _iV_i( _i h^t;g_-i)\ ∈\ [0,1]. For each g−ig_-i pick a (measurable) best response σig−i∈BRi​(g−i∣ht) _i^g_-i _i(g_-i h^t), so that Vi​(σig−i∣ht;g−i)=M​(g−i)V_i( _i^g_-i h^t;g_-i)=M(g_-i). By definition, PS-BR first samples g~−i∼pt​(⋅) g_-i p_t(·) and then plays σig~−i _i g_-i. Evaluating against the posterior predictive belief and using linearity in the mixing over opponent hypotheses, Vi​(σi,tPS∣ht) V_i(σ^PS_i,t h^t) =∑g~−i∈−ipt​(g~−i)​∑g−i∈−ipt​(g−i)​Vi​(σig~−i∣ht;g−i) = _ g_-i _-ip_t( g_-i)\ _g_-i _-ip_t(g_-i)\ V_i( _i g_-i h^t;g_-i) ≥∑g−i∈−ipt​(g−i)2​Vi​(σig−i∣ht;g−i) ≥ _g_-i _-ip_t(g_-i)^2\ V_i( _i^g_-i h^t;g_-i) =∑g−i∈−ipt​(g−i)2​M​(g−i). = _g_-i _-ip_t(g_-i)^2\,M(g_-i). On the other hand, supσiVi​(σi∣ht)=supσi∑g−i∈−ipt​(g−i)​Vi​(σi∣ht;g−i)≤∑g−i∈−ipt​(g−i)​M​(g−i). _ _iV_i( _i h^t)= _ _i _g_-i _-ip_t(g_-i)\,V_i( _i h^t;g_-i)≤ _g_-i _-ip_t(g_-i)\,M(g_-i). Subtracting and using M​(g−i)≤1M(g_-i)≤ 1, supσiVi​(σi∣ht)−Vi​(σi,tPS∣ht) _ _iV_i( _i h^t)-V_i(σ^PS_i,t h^t) ≤∑g−i∈−i(pt​(g−i)−pt​(g−i)2)​M​(g−i) ≤ _g_-i _-i (p_t(g_-i)-p_t(g_-i)^2 )M(g_-i) ≤∑g−i∈−i(pt​(g−i)−pt​(g−i)2) ≤ _g_-i _-i (p_t(g_-i)-p_t(g_-i)^2 ) =1−∑g−i∈−ipt​(g−i)2=Dit​(ht). =1- _g_-i _-ip_t(g_-i)^2=D_i^t(h^t). This proves the claim. ∎ Proof of Lemma 4.2. Fix any g−i∈−i∖f−ig_-i _-i \f_-i\. Write at=(ait,a−it)a^t=(a_i^t,a_-i^t) for the period-t action profile along the realized play path z, and write hth^t for the length-t history (a1,…,at−1)(a^1,…,a^t-1). Because −iS_-i is finite and all menu strategies are ν-cautious, Bayes’ rule is well-defined at every history and the posterior odds admit the standard likelihood ratio form: μit​(g−i∣ht)μit​(f−i∣ht)=μi0​(g−i)μi0​(f−i)​∏s=1t−1g−i​(hs)​(a−is)f−i​(hs)​(a−is). _i^t(g_-i h^t) _i^t(f_-i h^t)= _i^0(g_-i) _i^0(f_-i)\ _s=1^t-1 g_-i(h^s)(a_-i^s)f_-i(h^s)(a_-i^s). (19) Define the log-likelihood ratio increments Xs:=log⁡f−i​(hs)​(a−is)g−i​(hs)​(a−is).X_s:= f_-i(h^s)(a_-i^s)g_-i(h^s)(a_-i^s). Taking logs in (19) gives log⁡μit​(g−i∣ht)μit​(f−i∣ht)=log⁡μi0​(g−i)μi0​(f−i)−∑s=1t−1Xs. _i^t(g_-i h^t) _i^t(f_-i h^t)= _i^0(g_-i) _i^0(f_-i)\ -\ _s=1^t-1X_s. (20) Let ℱsF_s be the σ-algebra generated by the history hsh^s. Under the true play distribution μfμ^f, conditional on ℱsF_s the opponents’ action a−isa_-i^s is distributed according to f−i​(hs)f_-i(h^s). Therefore, μf​[Xs∣ℱs]=∑a−i∈A−if−i​(hs)​(a−i)​log⁡f−i​(hs)​(a−i)g−i​(hs)​(a−i)=DKL​(f−i​(hs)∥g−i​(hs)).E_μ^f\! [X_s _s ]= _a_-i∈ A_-if_-i(h^s)(a_-i) f_-i(h^s)(a_-i)g_-i(h^s)(a_-i)=D_KL\! (f_-i(h^s)\ \|\ g_-i(h^s) ). Define the martingale difference sequence Ys:=Xs−​[Xs∣ℱs]Y_s:=X_s-E[X_s _s]. By ν-caution, for all s we have f−i​(hs)​(a−is)∈[ν,1]f_-i(h^s)(a_-i^s)∈[ν,1] and g−i​(hs)​(a−is)∈[ν,1]g_-i(h^s)(a_-i^s)∈[ν,1], hence |Xs|≤log(1/ν),|[Xs∣ℱs]|≤log(1/ν),and thus|Ys|≤2log(1/ν):=c.|X_s|≤ (1/ν), |E[X_s _s] |≤ (1/ν), thus |Y_s|≤ 2 (1/ν)\ :=\ c. Azuma–Hoeffding yields, for any ϵ>0ε>0, Pr⁡(|∑s=1TYs|≥ϵ​T)≤ 2​exp⁡(−ϵ2​T2​c2). \! ( | _s=1^TY_s |≥ε T )\ ≤\ 2 \! (- ε^2T2c^2 ). The right-hand side is summable in T, so by Borel–Cantelli, 1T​∑s=1TYs⟶ 0μf-a.s. 1T _s=1^TY_s\ \ 0 $μ^f$-a.s. Consequently, 1T​∑s=1TXs=1T​∑s=1T​[Xs∣ℱs]+o​(1)=1T​∑s=1TDKL​(f−i​(hs)∥g−i​(hs))+o​(1)μf-a.s. 1T _s=1^TX_s= 1T _s=1^TE[X_s _s]\ +\ o(1)= 1T _s=1^TD_KL\! (f_-i(h^s)\ \|\ g_-i(h^s) )\ +\ o(1) $μ^f$-a.s. By the KL-separation part of Assumption 3, the liminf of the empirical averages of these KL terms is strictly positive μfμ^f-a.s., hence ∑s=1t−1Xs⟶+∞μf-a.s. _s=1^t-1X_s\ \ +∞ $μ^f$-a.s. Returning to (20), we obtain log⁡μit​(g−i∣ht)μit​(f−i∣ht)⟶−∞μf-a.s., _i^t(g_-i h^t) _i^t(f_-i h^t)\ \ -∞ $μ^f$-a.s., so μit​(g−i∣ht)/μit​(f−i∣ht)→0 _i^t(g_-i h^t)/ _i^t(f_-i h^t)→ 0 almost surely. Because there are finitely many g−i≠f−ig_-i≠ f_-i, this implies μit​(f−i∣ht)→1 _i^t(f_-i h^t)→ 1 and maxg−i≠f−i⁡μit​(g−i∣ht)→0 _g_-i≠ f_-i _i^t(g_-i h^t)→ 0 almost surely. ∎ Proof of Lemma 6.1. Identical to the proof of Lemma 4.2, with (f−i,μf)(f_-i,μ^f) replaced by (f¯−i,μ¯σ,u)( f_-i, μ^σ,u). ∎ Proof of Proposition 4.3. Along any realized play path z, define pt(⋅)=μit(⋅∣ht(z))p_t(·)= _i^t(· h^t(z)) on the finite set −iS_-i and the associated Dit​(ht​(z))=1−∑g−ipt​(g−i)2D_i^t(h^t(z))=1- _g_-ip_t(g_-i)^2. By Lemma 4.2, μit​(f−i∣ht​(z))→1 _i^t(f_-i h^t(z))→ 1 almost surely, hence ∑g−i∈−ipt​(g−i)2≥μit​(f−i∣ht​(z))2⟶ 1, _g_-i _-ip_t(g_-i)^2\ ≥\ _i^t(f_-i h^t(z))^2\ \ 1, and therefore Dit​(ht​(z))→0D_i^t(h^t(z))→ 0 almost surely. Fix any ε>0 >0 and any z in the full-measure event where Dit​(ht​(z))→0D_i^t(h^t(z))→ 0. Choose Ti​(z,ε)T_i(z, ) such that Dit​(ht​(z))≤εD_i^t(h^t(z))≤ for all t≥Ti​(z,ε)t≥ T_i(z, ). For each such t, Lemma 4.1 implies that PS-BR at ht​(z)h^t(z) is an ε -best response to the posterior predictive continuation belief, i.e., fi|ht​(z)∈BRiε​(f−i,t|ht​(z)∣ht​(z)).f_i |_h^t(z) _i \! (f_-i^i,t |_h^t(z) h^t(z) ). This is exactly the asymptotic ε -consistency requirement in Definition 4. ∎ Proof of Lemma 5.1. Let μfi≡Pi0,fiμ^f^i≡ P_i^0,f_i be the distribution induced by the belief-equivalent profile (fi,f−i)(f_i,f_-i^i) representing the prior predictive. By Assumption 2, μf≪μfiμ^f μ^f^i. By the merging of opinions theorem [33, 10], absolute continuity guarantees that the conditional predictive distributions over future play paths merge almost surely in total variation. Specifically, for μfμ^f-almost every path z∈H∞z∈ H^∞: limt→∞supE∈ℬ|μf(E∣C(ht(z)))−μfi(E∣C(ht(z)))|=0, _t→∞ _E |μ^f(E C(h^t(z)))-μ^f^i(E C(h^t(z))) |=0, where ℬB is the product σ-algebra on H∞H^∞. Recall from Definition 6 that the continuation weak distance is bounded by the total variation distance. For any finite length k, the σ-algebra ℬkB^k generated by cylinder events of length k is a sub-σ-algebra of ℬB. Therefore: supE∈ℬk|μf(E∣C(ht(z)))−μfi(E∣C(ht(z)))|≤supE∈ℬ|μf(E∣C(ht(z)))−μfi(E∣C(ht(z)))|. _E ^k |μ^f(E C(h^t(z)))-μ^f^i(E C(h^t(z))) |≤ _E |μ^f(E C(h^t(z)))-μ^f^i(E C(h^t(z))) |. Using this bound, the continuation weak distance dht​(z)​(μf,μfi)d_h^t(z)(μ^f,μ^f^i) satisfies: dht​(z)​(μf,μfi) d_h^t(z)(μ^f,μ^f^i) =∑k=1∞2−ksupE∈ℬk|μf(E∣C(ht(z)))−μfi(E∣C(ht(z)))| = _k=1^∞2^-k _E ^k |μ^f(E C(h^t(z)))-μ^f^i(E C(h^t(z))) | ≤∑k=1∞2−ksupE∈ℬ|μf(E∣C(ht(z)))−μfi(E∣C(ht(z)))| ≤ _k=1^∞2^-k _E |μ^f(E C(h^t(z)))-μ^f^i(E C(h^t(z))) | =supE∈ℬ|μf(E∣C(ht(z)))−μfi(E∣C(ht(z)))|. = _E |μ^f(E C(h^t(z)))-μ^f^i(E C(h^t(z))) |. Since the total variation distance on the right-hand side converges to zero as t→∞t→∞ for μfμ^f-almost every z, we have: limt→∞dht​(z)​(μf,μfi)=0μf-a.s. _t→∞d_h^t(z)(μ^f,μ^f^i)=0 $μ^f$-a.s. By the definition of the limit, for any η>0η>0, there μfμ^f-a.s. exists a finite time Ti​(z,η)T_i(z,η) such that for all t≥Ti​(z,η)t≥ T_i(z,η), dht​(z)​(μf,μfi)≤ηd_h^t(z)(μ^f,μ^f^i)≤η. This precisely satisfies the strong path prediction requirement in Definition 9. ∎ Proof of Proposition 5.2. Fix ξ,η>0ξ,η>0. For each player i, R implies that μfμ^f-a.s. in z there exists Tibr​(z)T_i^br(z) such that for all t≥Tibr​(z)t≥ T_i^br(z), fi|ht​(z)∈BRiξ​(f−i,t|ht​(z)∣ht​(z)).f_i |_h^t(z) _i^ξ\! (f_-i^i,t |_h^t(z) h^t(z) ). By the representative choice (4), we may equivalently write f−i,t|ht​(z)≡f−i|ht​(z)f_-i^i,t |_h^t(z)≡ f_-i^i |_h^t(z), so for all t≥Tibr​(z)t≥ T_i^br(z), fi|ht​(z)∈BRiξ​(f−i|ht​(z)∣ht​(z)),f_i |_h^t(z) _i^ξ\! (f_-i^i |_h^t(z) h^t(z) ), which is exactly the subjective best-response condition in Definition 8. Similarly, strong prediction implies that μfμ^f-a.s. in z there exists Tipred​(z)T_i^pred(z) such that for all t≥Tipred​(z)t≥ T_i^pred(z), dht​(z)​(μf,μfi)≤η,d_h^t(z)(μ^f,μ^f^i)≤η, which is the weak predictive accuracy condition in Definition 8. Let T​(z):=maxi⁡Tibr​(z),Tipred​(z)T(z):= _i\T_i^br(z),\,T_i^pred(z)\, which is finite μfμ^f-a.s. since I is finite. Then for all t≥T​(z)t≥ T(z) and every player i, both conditions in Definition 8 hold with supporting profile fif^i, so f|ht​(z)f |_h^t(z) is a weak ξ-subjective η-equilibrium after ht​(z)h^t(z). ∎ Proof of Theorem 5.3. Fix ε>0 >0 and set ξ:=ε/2ξ:= /2. Let η^​(⋅,⋅) η(·,·) be the function from the infinite patching lemma (Lemma A.4 in Appendix A), and set η:=η^​(ξ,ε/2)η:= η(ξ, /2). By Proposition 5.2, μfμ^f-a.s. in z there exists T​(z)T(z) such that for all t≥T​(z)t≥ T(z), the continuation profile f|ht​(z)f |_h^t(z) is a weak ξ-subjective η-equilibrium after ht​(z)h^t(z). Applying Lemma A.4 at each such t yields an ε -Nash equilibrium f^ε,t,z f ,t,z of the continuation game after ht​(z)h^t(z) satisfying dht​(z)​(μf,μf^ε,t,z)≤εd_h^t(z)(μ^f,μ f ,t,z)≤ . ∎ Proof of Corollary 5.4. By Proposition 4.3, under Assumption 3, each player is R. Because Assumption 3 (specifically the menu grain of truth) implies Assumption 2, Lemma 5.1 guarantees each player learns to predict the path of play under f. Theorem 5.3 therefore applies. ∎ Proof of Lemma 6.2. Fix any mi∈ℳi∖uim_i _i \u_i\. By Bayes’ rule (6), πit​(mi∣xit)πit​(ui∣xit)=πi0​(mi)πi0​(ui)​∏s=1t−1ψi​(ris;mi​(as))ψi​(ris;ui​(as)). _i^t(m_i x_i^t) _i^t(u_i x_i^t)= _i^0(m_i) _i^0(u_i) _s=1^t-1 _i(r_i^s;m_i(a^s)) _i(r_i^s;u_i(a^s)). Equivalently, log⁡πit​(mi∣xit)πit​(ui∣xit)=log⁡πi0​(mi)πi0​(ui)−∑s=1t−1Xs, _i^t(m_i x_i^t) _i^t(u_i x_i^t)= _i^0(m_i) _i^0(u_i)- _s=1^t-1X_s, where Xs:=log⁡ψi​(ris;ui​(as))ψi​(ris;mi​(as)).X_s:= _i(r_i^s;u_i(a^s)) _i(r_i^s;m_i(a^s)). Let ℋs:=σ​(hs+1,ri1:s−1),H_s:=σ(h^s+1,r_i^1:s-1), so that asa^s is ℋsH_s-measurable and, under the true interaction law, risr_i^s is conditionally distributed as qiui(⋅∣as)q_i^u_i(· a^s). Therefore [Xs∣ℋs]=DKL(qiui(⋅∣as)∥qimi(⋅∣as)).E[X_s _s]=D_KL\! (q_i^u_i(· a^s)\ \|\ q_i^m_i(· a^s) ). Define the martingale difference sequence Ys:=Xs−​[Xs∣ℋs].Y_s:=X_s-E[X_s _s]. By Assumption 4(3), sups​[Ys2]<∞. _sE[Y_s^2]<∞. Hence ∑s=1∞​[Ys2]s2<∞, _s=1^∞ E[Y_s^2]s^2<∞, so the martingale strong law implies 1T​∑s=1TYs⟶0a.s. 1T _s=1^TY_s 0 .s. Therefore, 1T∑s=1TXs=1T∑s=1TDKL(qiui(⋅∣as)∥qimi(⋅∣as))+o(1)a.s. 1T _s=1^TX_s= 1T _s=1^TD_KL\! (q_i^u_i(· a^s)\ \|\ q_i^m_i(· a^s) )+o(1) .s. By Assumption 4(4), the liminf of the empirical KL average is strictly positive almost surely, hence ∑s=1t−1Xs⟶+∞a.s. _s=1^t-1X_s +∞ .s. It follows that log⁡πit​(mi∣xit)πit​(ui∣xit)⟶−∞a.s., _i^t(m_i x_i^t) _i^t(u_i x_i^t) -∞ .s., so πit​(mi∣xit)πit​(ui∣xit)⟶0. _i^t(m_i x_i^t) _i^t(u_i x_i^t) 0. Since ℳiM_i is finite, this implies πit​(ui∣xit)⟶1andmaxmi≠ui⁡πit​(mi∣xit)⟶0 _i^t(u_i x_i^t) 1 _m_i≠ u_i _i^t(m_i x_i^t) 0 almost surely. ∎ Proof of Lemma 6.3. By (11), for every measurable event E⊆H∞E H^∞, |Πit(E∣xit)−μ¯xit(σi,g−i,t),ui(E)| | _i^t(E x_i^t)- μ_x_i^t^( _i,g_-i^i,t),u_i(E) | =|∑mi∈ℳiπit(mi∣xit)μ¯xit(σi,g−i,t),mi(E)−μ¯xit(σi,g−i,t),ui(E)| = | _m_i _i _i^t(m_i x_i^t) μ_x_i^t^( _i,g_-i^i,t),m_i(E)- μ_x_i^t^( _i,g_-i^i,t),u_i(E) | ≤∑mi≠uiπit​(mi∣xit)=1−πit​(ui∣xit). ≤ _m_i≠ u_i _i^t(m_i x_i^t)=1- _i^t(u_i x_i^t). Taking the supremum over cylinder events at each horizon and summing with the weights 2−t2^-t yields the stated bound. ∎ Proof of Lemma 6.4. Fix player i and an information history xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1). Let ℳ:=−i×ℳiM:=S_-i×M_i, and for each m=(g−i,mi)∈ℳm=(g_-i,m_i) define the continuation value functional Vim​(τi∣xit):=Vimi​(τi∣xit;g−i)∈[0,1],V_i^m( _i x_i^t)\;:=\;V_i^m_i( _i x_i^t;g_-i)\;∈\;[0,1], and the value envelope M​(m):=supτiVim​(τi∣xit)∈[0,1].M(m)\;:=\; _ _iV_i^m( _i x_i^t)\;∈\;[0,1]. For each m∈ℳm fix a (measurable) best response τim _i^m attaining M​(m)M(m), i.e., Vim​(τim∣xit)=M​(m)V_i^m( _i^m x_i^t)=M(m). By Definition 13, PS-BR samples (g~−i,m~i)∼pt​(⋅)( g_-i, m_i) p_t(·) and then plays τi(g~−i,m~i) _i^( g_-i, m_i). Let σi,tPSσ^PS_i,t denote this randomized continuation strategy at xitx_i^t. Because Vimix,tV_i^mix,t is linear in both the opponents-mixture and the payoff-matrix mixture, we can write Vimix,t​(τi∣xit)=∑(g−i,mi)∈ℳpt​(g−i,mi)​Vi(g−i,mi)​(τi∣xit)=∑m∈ℳpt​(m)​Vim​(τi∣xit).V_i^mix,t( _i x_i^t)= _(g_-i,m_i) p_t(g_-i,m_i)\,V_i^(g_-i,m_i)( _i x_i^t)= _m p_t(m)\,V_i^m( _i x_i^t). Therefore, evaluating PS-BR under the mixed subjective objective gives Vimix,t​(σi,tPS∣xit) V_i^mix,t(σ^PS_i,t x_i^t) =∑m~∈ℳpt​(m~)​Vimix,t​(τim~∣xit) = _ m p_t( m)\;V_i^mix,t( _i m x_i^t) =∑m~∈ℳpt​(m~)​∑m∈ℳpt​(m)​Vim​(τim~∣xit) = _ m p_t( m)\; _m p_t(m)\,V_i^m( _i m x_i^t) ≥∑m∈ℳpt​(m)2​Vim​(τim∣xit)=∑m∈ℳpt​(m)2​M​(m). ≥ _m p_t(m)^2\,V_i^m( _i^m x_i^t)= _m p_t(m)^2\,M(m). On the other hand, supτiVimix,t​(τi∣xit)=supτi∑m∈ℳpt​(m)​Vim​(τi∣xit)≤∑m∈ℳpt​(m)​supτiVim​(τi∣xit)=∑m∈ℳpt​(m)​M​(m). _ _iV_i^mix,t( _i x_i^t)= _ _i _m p_t(m)\,V_i^m( _i x_i^t)≤ _m p_t(m)\, _ _iV_i^m( _i x_i^t)= _m p_t(m)\,M(m). Subtracting and using M​(m)≤1M(m)≤ 1 for all m, supτiVimix,t​(τi∣xit)−Vimix,t​(σi,tPS∣xit) _ _iV_i^mix,t( _i x_i^t)-V_i^mix,t(σ^PS_i,t x_i^t) ≤∑m∈ℳ(pt​(m)−pt​(m)2)​M​(m) ≤ _m (p_t(m)-p_t(m)^2 )M(m) ≤∑m∈ℳ(pt​(m)−pt​(m)2)=1−∑m∈ℳpt​(m)2=Dit,joint​(xit). ≤ _m (p_t(m)-p_t(m)^2 )=1- _m p_t(m)^2=D_i^t,joint(x_i^t). This proves the claim. ∎ Proof of Proposition 6.5. Work on the full-measure event on which both posterior concentrations hold: μit​(f¯−i∣ht)→1andπit​(ui∣xit)→1. _i^t( f_-i h^t)→ 1 _i^t(u_i x_i^t)→ 1. Then Dit,joint​(xit)→0andδit​(xit):=1−πit​(ui∣xit)→0.D_i^t,joint(x_i^t)→ 0 _i^t(x_i^t):=1- _i^t(u_i x_i^t)→ 0. By (12), Lemma 6.4, and (10), supτi∈Σi​(xit)Viui​(τi∣xit;g−i,t)−Viui​(σi,tPS∣xit;g−i,t) _ _i∈ _i(x_i^t)V_i^u_i\! ( _i x_i^t;g_-i^i,t )-V_i^u_i\! ( _i,t^PS x_i^t;g_-i^i,t ) =supτiViui,t​(τi∣xit)−Viui,t​(σi,tPS∣xit) = _ _iV_i^u_i,t( _i x_i^t)-V_i^u_i,t( _i,t^PS x_i^t) ≤Dit,joint​(xit)+2​δit​(xit). ≤ D_i^t,joint(x_i^t)+2 _i^t(x_i^t). The right-hand side converges to 0 almost surely, so the stated eventual ε -best-response property follows. ∎ Proof of Lemma 6.6. By the Blackwell–Dubins merging argument applied on the observable process OiO_i, Assumption 5 implies d(Πit(⋅∣xit(ω)),μ¯i,xit​(ω)σ,u)⟶0for Pσ,u-a.e. ω.d\! ( _i^t(· x_i^t(ω)), μ_i,x_i^t(ω)^σ,u ) 0 P^σ,u-a.e. ω. Assumption 6 gives d​(μ¯i,xit​(ω)σ,u,μ¯xt​(ω)σ,u)⟶0for ​Pσ,u​-a.e. ​ω.d\! ( μ_i,x_i^t(ω)^σ,u, μ_x^t(ω)^σ,u ) 0 P^σ,u-a.e. ω. The claim follows by the triangle inequality. ∎ Proof of Proposition 6.7. Fix ξ,η>0ξ,η>0. For each player i, Proposition 6.5 implies that Pσ,uP^σ,u-a.s. there exists Tibr​(ω)T_i^br(ω) such that for all t≥Tibr​(ω)t≥ T_i^br(ω), σi,tPS(⋅∣xit(ω))∈BRi,uiξ(g−i,t∣xit(ω)). _i,t^PS(· x_i^t(ω)) _i,u_i^ξ\! (g_-i^i,t x_i^t(ω) ). Also, Lemma 6.6 together with Lemma 6.3 implies that Pσ,uP^σ,u-a.s. there exists Tipred​(ω)T_i^pred(ω) such that for all t≥Tipred​(ω)t≥ T_i^pred(ω), d​(μ¯xt​(ω)σ,u,μ¯xit​(ω)(σi,g−i,t),ui)≤η.d\! ( μ_x^t(ω)^σ,u, μ_x_i^t(ω)^( _i,g_-i^i,t),u_i )≤η. Indeed, d​(μ¯xt​(ω)σ,u,μ¯xit​(ω)(σi,g−i,t),ui) d\! ( μ_x^t(ω)^σ,u, μ_x_i^t(ω)^( _i,g_-i^i,t),u_i ) ≤d(μ¯xt​(ω)σ,u,Πit(⋅∣xit(ω)))+d(Πit(⋅∣xit(ω)),μ¯xit​(ω)(σi,g−i,t),ui), ≤ d\! ( μ_x^t(ω)^σ,u, _i^t(· x_i^t(ω)) )+d\! ( _i^t(· x_i^t(ω)), μ_x_i^t(ω)^( _i,g_-i^i,t),u_i ), and both terms vanish almost surely by Lemmas 6.6 and 6.3. Let T​(ω):=maxi∈I⁡Tibr​(ω),Tipred​(ω).T(ω):= _i∈ I\T_i^br(ω),T_i^pred(ω)\. Then for all t≥T​(ω)t≥ T(ω) and every player i, both conditions in Definition 14 hold with supporting reduced-form model g−i,tg_-i^i,t. ∎ Proof of Lemma 5.5. Fix player i, let p,q∈Δ​(A−i)p,q∈ (A_-i), and suppose αi∈briξ​(q) _i _i^ξ(q). For any αi∈Δ​(Ai) _i∈ (A_i) define ϕαi​(a−i):=∑ai∈Aiαi​(ai)​ui​(ai,a−i),a−i∈A−i. _ _i(a_-i):= _a_i∈ A_i _i(a_i)\,u_i(a_i,a_-i), a_-i∈ A_-i. Since ui​(ai,a−i)∈[0,1]u_i(a_i,a_-i)∈[0,1], we have ϕαi​(a−i)∈[0,1] _ _i(a_-i)∈[0,1] for all a−i∈A−ia_-i∈ A_-i. Also, ui​(αi,p)−ui​(αi,q)=∑a−i∈A−iϕαi​(a−i)​(p​(a−i)−q​(a−i)).u_i( _i,p)-u_i( _i,q)= _a_-i∈ A_-i _ _i(a_-i) (p(a_-i)-q(a_-i) ). Set S+:=a−i∈A−i:p​(a−i)≥q​(a−i).S^+:=\a_-i∈ A_-i:p(a_-i)≥ q(a_-i)\. Because 0≤ϕαi≤10≤ _ _i≤ 1, we have ui​(αi,p)−ui​(αi,q) u_i( _i,p)-u_i( _i,q) =∑a−i∈S+ϕαi​(a−i)​(p​(a−i)−q​(a−i))+∑a−i∉S+ϕαi​(a−i)​(p​(a−i)−q​(a−i)) = _a_-i∈ S^+ _ _i(a_-i) (p(a_-i)-q(a_-i) )+ _a_-i∉ S^+ _ _i(a_-i) (p(a_-i)-q(a_-i) ) ≤∑a−i∈S+(p​(a−i)−q​(a−i)) ≤ _a_-i∈ S^+ (p(a_-i)-q(a_-i) ) =p​(S+)−q​(S+) =p(S^+)-q(S^+) ≤‖p−q‖TV. ≤\|p-q\|_TV. Applying the same argument with p and q interchanged yields ui​(αi,q)−ui​(αi,p)≤‖p−q‖TV.u_i( _i,q)-u_i( _i,p)≤\|p-q\|_TV. Therefore |ui​(αi,p)−ui​(αi,q)|≤‖p−q‖TVfor every ​αi∈Δ​(Ai).|u_i( _i,p)-u_i( _i,q)|≤\|p-q\|_TV every _i∈ (A_i). (21) Now suppose αi∈briξ​(q) _i _i^ξ(q). Then ui​(αi,q)≥supαi′∈Δ​(Ai)ui​(αi′,q)−ξ.u_i( _i,q)≥ _ _i ∈ (A_i)u_i( _i ,q)-ξ. Using (21), ui​(αi,p) u_i( _i,p) ≥ui​(αi,q)−‖p−q‖TV ≥ u_i( _i,q)-\|p-q\|_TV ≥supαi′∈Δ​(Ai)ui​(αi′,q)−ξ−‖p−q‖TV ≥ _ _i ∈ (A_i)u_i( _i ,q)-ξ-\|p-q\|_TV ≥supαi′∈Δ​(Ai)(ui​(αi′,p)−‖p−q‖TV)−ξ−‖p−q‖TV ≥ _ _i ∈ (A_i) (u_i( _i ,p)-\|p-q\|_TV )-ξ-\|p-q\|_TV =supαi′∈Δ​(Ai)ui​(αi′,p)−ξ−2​‖p−q‖TV. = _ _i ∈ (A_i)u_i( _i ,p)-ξ-2\|p-q\|_TV. Hence αi∈briξ+2​‖p−q‖TV​(p). _i _i^ξ+2\|p-q\|_TV(p). ∎ Proof of Lemma 5.6. Fix player i and history hth^t. For each g−i∈−ig_-i _-i define M​(g−i):=supαi∈Δ​(Ai)ui​(αi,g−i​(ht))∈[0,1].M(g_-i):= _ _i∈ (A_i)u_i\! ( _i,g_-i(h^t) )∈[0,1]. By Definition 11, for each g−i∈−ig_-i _-i we have chosen αig−i,ht∈bri​(g−i​(ht)), _i^g_-i,h^t _i\! (g_-i(h^t) ), so ui​(αig−i,ht,g−i​(ht))=M​(g−i).u_i\! ( _i^g_-i,h^t,g_-i(h^t) )=M(g_-i). Write pt​(g−i)=μit​(g−i∣ht)p_t(g_-i)= _i^t(g_-i h^t). The ex ante mixed action induced by myopic PS-BR is αi,tmPS(⋅∣ht)=∑g~−i∈−ipt(g~−i)αig~−i,ht(⋅), _i,t^mPS(· h^t)= _ g_-i _-ip_t( g_-i)\, _i g_-i,h^t(·), and the one-step posterior predictive belief is qit(⋅∣ht)=∑g−i∈−ipt(g−i)g−i(ht)(⋅).q_i^t(· h^t)= _g_-i _-ip_t(g_-i)\,g_-i(h^t)(·). By bilinearity of ui​(⋅,⋅)u_i(·,·), ui​(αi,tmPS,qit) u_i\! ( _i,t^mPS,q_i^t ) =∑g~−i∈−ipt​(g~−i)​∑g−i∈−ipt​(g−i)​ui​(αig~−i,ht,g−i​(ht)) = _ g_-i _-ip_t( g_-i) _g_-i _-ip_t(g_-i)\,u_i\! ( _i g_-i,h^t,g_-i(h^t) ) ≥∑g−i∈−ipt​(g−i)2​ui​(αig−i,ht,g−i​(ht)) ≥ _g_-i _-ip_t(g_-i)^2\,u_i\! ( _i^g_-i,h^t,g_-i(h^t) ) =∑g−i∈−ipt​(g−i)2​M​(g−i). = _g_-i _-ip_t(g_-i)^2\,M(g_-i). On the other hand, again by bilinearity, supαi∈Δ​(Ai)ui​(αi,qit) _ _i∈ (A_i)u_i( _i,q_i^t) =supαi∈Δ​(Ai)∑g−i∈−ipt​(g−i)​ui​(αi,g−i​(ht)) = _ _i∈ (A_i) _g_-i _-ip_t(g_-i)\,u_i\! ( _i,g_-i(h^t) ) ≤∑g−i∈−ipt​(g−i)​supαi∈Δ​(Ai)ui​(αi,g−i​(ht)) ≤ _g_-i _-ip_t(g_-i)\, _ _i∈ (A_i)u_i\! ( _i,g_-i(h^t) ) =∑g−i∈−ipt​(g−i)​M​(g−i). = _g_-i _-ip_t(g_-i)\,M(g_-i). Subtracting, supαiui​(αi,qit)−ui​(αi,tmPS,qit) _ _iu_i( _i,q_i^t)-u_i( _i,t^mPS,q_i^t) ≤∑g−i∈−i(pt​(g−i)−pt​(g−i)2)​M​(g−i) ≤ _g_-i _-i (p_t(g_-i)-p_t(g_-i)^2 )\,M(g_-i) ≤∑g−i∈−i(pt​(g−i)−pt​(g−i)2) ≤ _g_-i _-i (p_t(g_-i)-p_t(g_-i)^2 ) =1−∑g−i∈−ipt​(g−i)2 =1- _g_-i _-ip_t(g_-i)^2 =Dit​(ht). =D_i^t(h^t). This proves the claim. ∎ Proof of Lemma 5.7. Fix player i and let fi=(fi,f−i)f^i=(f_i,f_-i^i) be the supporting profile from Definition 9. Fix a realized path z∈H∞z∈ H^∞ in the full-measure event from Definition 9. By definition of qitq_i^t and the representative choice (4), qit(⋅∣ht(z))=f−i,t(ht(z))=f−i(ht(z)).q_i^t(· h^t(z))=f_-i^i,t(h^t(z))=f_-i^i(h^t(z)). Let η>0η>0. By Definition 9, there exists Ti​(z,η/2)<∞T_i(z,η/2)<∞ such that for all t≥Ti​(z,η/2)t≥ T_i(z,η/2), dht​(z)​(μf,μfi)≤η/2.d_h^t(z)(μ^f,μ^f^i)≤η/2. Fix such a t. For any subset B⊆A−iB A_-i, define the one-step cylinder event EB:=y∈H∞:y−i1∈B∈ℬ1.E_B:=\y∈ H^∞:\ y_-i^1∈ B\ ^1. By the definition of continuation measures, μht​(z)f​(EB)=f−i​(ht​(z))​(B),μht​(z)fi​(EB)=f−i​(ht​(z))​(B)=qit​(B∣ht​(z)).μ^f_h^t(z)(E_B)=f_-i(h^t(z))(B), μ^f^i_h^t(z)(E_B)=f_-i^i(h^t(z))(B)=q_i^t(B h^t(z)). Therefore, ∥qit(⋅∣ht(z))−f−i(ht(z))∥TV \|q_i^t(· h^t(z))-f_-i(h^t(z)) \|_TV =supB⊆A−i|qit(B∣ht(z))−f−i(ht(z))(B)| = _B A_-i |q_i^t(B h^t(z))-f_-i(h^t(z))(B) | =supB⊆A−i|μht​(z)fi​(EB)−μht​(z)f​(EB)| = _B A_-i |μ^f^i_h^t(z)(E_B)-μ^f_h^t(z)(E_B) | ≤supE∈ℬ1|μht​(z)fi​(E)−μht​(z)f​(E)|. ≤ _E ^1 |μ^f^i_h^t(z)(E)-μ^f_h^t(z)(E) |. By Definition 6, dht​(z)​(μf,μfi)=∑k=1∞2−k​supE∈ℬk|μht​(z)f​(E)−μht​(z)fi​(E)|.d_h^t(z)(μ^f,μ^f^i)= _k=1^∞2^-k _E ^k |μ^f_h^t(z)(E)-μ^f^i_h^t(z)(E) |. In particular, 12​supE∈ℬ1|μht​(z)f​(E)−μht​(z)fi​(E)|≤dht​(z)​(μf,μfi), 12 _E ^1 |μ^f_h^t(z)(E)-μ^f^i_h^t(z)(E) |≤ d_h^t(z)(μ^f,μ^f^i), so supE∈ℬ1|μht​(z)f​(E)−μht​(z)fi​(E)|≤2​dht​(z)​(μf,μfi)≤η. _E ^1 |μ^f_h^t(z)(E)-μ^f^i_h^t(z)(E) |≤ 2\,d_h^t(z)(μ^f,μ^f^i)≤η. Hence ∥qit(⋅∣ht(z))−f−i(ht(z))∥TV≤η \|q_i^t(· h^t(z))-f_-i(h^t(z)) \|_TV≤η for all t≥Ti​(z,η/2)t≥ T_i(z,η/2). Since η>0η>0 was arbitrary, this proves the claim. ∎ Proof of Theorem 5.8. Fix ε>0 >0 and set ξ:=ε/3ξ:= /3. For player i, Assumption 3 implies, by Lemma 4.2, that there is a full-measure event on which μit​(f−i∣ht​(z))⟶1. _i^t(f_-i h^t(z)) 1. Since f−i∈−if_-i _-i by menu grain of truth, on that event we also have Dit​(ht​(z))=1−∑g−i∈−iμit​(g−i∣ht​(z))2⟶0.D_i^t(h^t(z))=1- _g_-i _-i _i^t(g_-i h^t(z))^2 0. Therefore there exists Tibr​(z)<∞T_i^br(z)<∞ such that for all t≥Tibr​(z)t≥ T_i^br(z), Dit​(ht​(z))≤ξ.D_i^t(h^t(z))≤ξ. Because player i uses myopic PS-BR, we have fi(ht(z))=αi,tmPS(⋅∣ht(z)).f_i(h^t(z))= _i,t^mPS(· h^t(z)). Applying Lemma 5.6, it follows that for all t≥Tibr​(z)t≥ T_i^br(z), fi(ht(z))∈briξ(qit(⋅∣ht(z))).f_i(h^t(z)) _i^ξ\! (q_i^t(· h^t(z)) ). Next, write pt​(g−i)=μit​(g−i∣ht​(z)).p_t(g_-i)= _i^t(g_-i h^t(z)). At history ht​(z)h^t(z), qit(⋅∣ht(z))=∑g−i∈−ipt(g−i)g−i(ht(z))(⋅).q_i^t(· h^t(z))= _g_-i _-ip_t(g_-i)\,g_-i(h^t(z))(·). For any B⊆A−iB A_-i, |qit(B∣ht(z))−f−i(ht(z))(B)| |q_i^t(B h^t(z))-f_-i(h^t(z))(B) | =|∑g−i∈−ipt​(g−i)​g−i​(ht​(z))​(B)−f−i​(ht​(z))​(B)| = | _g_-i _-ip_t(g_-i)\,g_-i(h^t(z))(B)-f_-i(h^t(z))(B) | =|∑g−i≠f−ipt​(g−i)​(g−i​(ht​(z))​(B)−f−i​(ht​(z))​(B))| = | _g_-i≠ f_-ip_t(g_-i) (g_-i(h^t(z))(B)-f_-i(h^t(z))(B) ) | ≤∑g−i≠f−ipt​(g−i) ≤ _g_-i≠ f_-ip_t(g_-i) =1−μit​(f−i∣ht​(z)). =1- _i^t(f_-i h^t(z)). Taking the supremum over B⊆A−iB A_-i gives ∥qit(⋅∣ht(z))−f−i(ht(z))∥TV≤1−μit(f−i∣ht(z))⟶0. \|q_i^t(· h^t(z))-f_-i(h^t(z)) \|_TV≤ 1- _i^t(f_-i h^t(z)) 0. Hence there exists Tipred​(z)<∞T_i^pred(z)<∞ such that for all t≥Tipred​(z)t≥ T_i^pred(z), ∥qit(⋅∣ht(z))−f−i(ht(z))∥TV≤ξ. \|q_i^t(· h^t(z))-f_-i(h^t(z)) \|_TV≤ξ. Now fix t≥max⁡Tibr​(z),Tipred​(z)t≥ \T_i^br(z),T_i^pred(z)\. We already know that fi(ht(z))∈briξ(qit(⋅∣ht(z))),f_i(h^t(z)) _i^ξ\! (q_i^t(· h^t(z)) ), and that ∥qit(⋅∣ht(z))−f−i(ht(z))∥TV≤ξ. \|q_i^t(· h^t(z))-f_-i(h^t(z)) \|_TV≤ξ. Applying Lemma 5.5 with p=f−i​(ht​(z))p=f_-i(h^t(z)) and q=qit(⋅∣ht(z))q=q_i^t(· h^t(z)) yields fi​(ht​(z))∈briξ+2​ξ​(f−i​(ht​(z)))=briε​(f−i​(ht​(z))).f_i(h^t(z)) _i^ξ+2ξ\! (f_-i(h^t(z)) )=br_i \! (f_-i(h^t(z)) ). Intersect the full-measure events above over all players i∈Ii∈ I. Since I is finite, on that intersection we may define T​(z):=maxi∈I⁡max⁡Tibr​(z),Tipred​(z)<∞.T(z):= _i∈ I \T_i^br(z),T_i^pred(z) \<∞. Then for all t≥T​(z)t≥ T(z) and all players i, fi​(ht​(z))∈briε​(f−i​(ht​(z))).f_i(h^t(z)) _i \! (f_-i(h^t(z)) ). By Definition 10, this means that f​(ht​(z))f(h^t(z)) is a stage ε -Nash equilibrium for all t≥T​(z)t≥ T(z). ∎ Proof of Lemma 5.9. Fix player i and let fi=(fi,f−i)f^i=(f_i,f_-i^i) be the supporting profile from Definition 9. Fix a realized path z in the full-measure event from Definition 9. By definition of qitq_i^t and the representative choice (4), qit(⋅∣ht(z))=f−i,t(ht(z))=f−i(ht(z)).q_i^t(· h^t(z))=f_-i^i,t(h^t(z))=f_-i^i(h^t(z)). For each t, define the one-step cylinder event Et​(z):=y∈H∞:y−i1=a−i⋆​(ht​(z))∈ℬ1.E_t(z):=\y∈ H^∞:\ y_-i^1=a_-i (h^t(z))\ ^1. Because the true opponents’ next action at history ht​(z)h^t(z) is pure, f−i​(ht​(z))=δa−i⋆​(ht​(z)),f_-i(h^t(z))= _a_-i (h^t(z)), so μht​(z)f​(Et​(z))=1.μ^f_h^t(z)(E_t(z))=1. Also, by the on-path identification above, μht​(z)fi​(Et​(z))=f−i​(ht​(z))​(a−i⋆​(ht​(z)))=qit​(a−i⋆​(ht​(z))∣ht​(z)).μ^f^i_h^t(z)(E_t(z))=f_-i^i(h^t(z)) (a_-i (h^t(z)) )=q_i^t\! (a_-i (h^t(z)) h^t(z) ). Hence 1−qit​(a−i⋆​(ht​(z))∣ht​(z)) 1-q_i^t\! (a_-i (h^t(z)) h^t(z) ) =|μht​(z)f​(Et​(z))−μht​(z)fi​(Et​(z))| = |μ^f_h^t(z)(E_t(z))-μ^f^i_h^t(z)(E_t(z)) | ≤supE∈ℬ1|μht​(z)f​(E)−μht​(z)fi​(E)|. ≤ _E ^1 |μ^f_h^t(z)(E)-μ^f^i_h^t(z)(E) |. As in the proof of Lemma 5.7, supE∈ℬ1|μht​(z)f​(E)−μht​(z)fi​(E)|≤2​dht​(z)​(μf,μfi). _E ^1 |μ^f_h^t(z)(E)-μ^f^i_h^t(z)(E) |≤ 2\,d_h^t(z)(μ^f,μ^f^i). Because player i learns to predict the path of play, dht​(z)​(μf,μfi)⟶0.d_h^t(z)(μ^f,μ^f^i) 0. Therefore qit​(a−i⋆​(ht​(z))∣ht​(z))⟶1.q_i^t\! (a_-i (h^t(z)) h^t(z) ) 1. It follows immediately that 1−maxa−i∈A−i⁡qit​(a−i∣ht​(z))≤1−qit​(a−i⋆​(ht​(z))∣ht​(z))⟶0,1- _a_-i∈ A_-iq_i^t(a_-i h^t(z))≤ 1-q_i^t\! (a_-i (h^t(z)) h^t(z) ) 0, which proves asymptotic purity. Finally, because qit​(a−i⋆​(ht​(z))∣ht​(z))⟶1,q_i^t\! (a_-i (h^t(z)) h^t(z) ) 1, there exists Ti​(z)<∞T_i(z)<∞ such that for all t≥Ti​(z)t≥ T_i(z), qit​(a−i⋆​(ht​(z))∣ht​(z))>12.q_i^t\! (a_-i (h^t(z)) h^t(z) )> 12. For such t, the action a−i⋆​(ht​(z))a_-i (h^t(z)) is the unique maximizer of qit(⋅∣ht(z))q_i^t(· h^t(z)), because all other probabilities sum to 1−qit​(a−i⋆​(ht​(z))∣ht​(z))<12.1-q_i^t\! (a_-i (h^t(z)) h^t(z) )< 12. Hence the deterministic MAP selector must satisfy a^−it​(ht​(z))=a−i⋆​(ht​(z))for all ​t≥Ti​(z). a_-i^t(h^t(z))=a_-i (h^t(z)) all t≥ T_i(z). This proves the claim. ∎ Proof of Theorem 5.10. Because every player j∈Ij∈ I uses deterministic MAP-SCoT, for every history h∈Hh∈ H we have fj​(h)=δaj⋆​(h)for some ​aj⋆​(h)∈Aj.f_j(h)= _a_j (h) some a_j (h)∈ A_j. Hence for every player i and every history h, f−i​(h)=δa−i⋆​(h)for some ​a−i⋆​(h)∈A−i.f_-i(h)= _a_-i (h) some a_-i (h)∈ A_-i. For each player i, apply Lemma 5.9. There is a full-measure event on which there exists Ti​(z)<∞T_i(z)<∞ such that for all t≥Ti​(z)t≥ T_i(z), a^−it​(ht​(z))=a−i⋆​(ht​(z)). a_-i^t(h^t(z))=a_-i (h^t(z)). Because the player set I is finite, the intersection of these full-measure events over all players still has measure one. Fix a realized path z in that intersection. For any player i and any t≥Ti​(z)t≥ T_i(z), Definition 12 gives fi​(ht​(z))=δbi​(a^−it​(ht​(z)))=δbi​(a−i⋆​(ht​(z))).f_i(h^t(z))= _\,b_i( a_-i^t(h^t(z)))= _\,b_i(a_-i (h^t(z))). By definition of the pure best-response selector bib_i, bi​(a−i)∈arg⁡maxai∈Ai⁡ui​(ai,a−i)for every ​a−i∈A−i.b_i(a_-i)∈ _a_i∈ A_iu_i(a_i,a_-i) every a_-i∈ A_-i. Therefore δbi​(a−i⋆​(ht​(z)))∈bri​(δa−i⋆​(ht​(z)))=bri​(f−i​(ht​(z))). _\,b_i(a_-i (h^t(z))) _i\! ( _a_-i (h^t(z)) )=br_i\! (f_-i(h^t(z)) ). So for every player i and all t≥Ti​(z)t≥ T_i(z), fi​(ht​(z))∈bri​(f−i​(ht​(z))).f_i(h^t(z)) _i\! (f_-i(h^t(z)) ). Define T​(z):=maxi∈I⁡Ti​(z)<∞.T(z):= _i∈ IT_i(z)<∞. Then for all t≥T​(z)t≥ T(z) and every player i, fi​(ht​(z))∈bri​(f−i​(ht​(z))).f_i(h^t(z)) _i\! (f_-i(h^t(z)) ). By Definition 10, this means that f​(ht​(z))f(h^t(z)) is a stage Nash equilibrium for all t≥T​(z)t≥ T(z). ∎ Proof of Corollary 5.11. By Lemma 5.1, Assumption 2 implies that every player learns to predict the path of play under f in the sense of Definition 9. Theorem 5.10 therefore applies directly. ∎ Appendix C Bounded-memory strategies and finite-state reduction Many practical agent policies (including menu-based planners) depend only on a bounded window of recent interaction. Following the bounded-recall restriction in [43], we formalize this as a bounded-memory condition. For a history h=(a1,…,at−1)∈Hh=(a^1,…,a^t-1)∈ H let |h|:=t−1|h|:=t-1 denote its length. For κ∈ℕκ , define suffixκ​(h):=(at−min⁡κ,t−1,…,at−1)∈⋃m=0κAm,suffix_κ(h)\ :=\ (a^t- \κ,t-1\,…,a^t-1)∈ _m=0^κA^m, i.e., the last min⁡κ,|h| \κ,|h|\ joint actions of h (with suffixκ​(∅)=∅suffix_κ( )= ). Definition 16 (κ-memory (bounded-recall) strategy). A strategy fi:H→Δ​(Ai)f_i:H→ (A_i) has memory at most κ if for all histories h,h′∈Hh,h ∈ H, suffixκ​(h)=suffixκ​(h′)⟹fi​(h)=fi​(h′).suffix_κ(h)=suffix_κ(h ) f_i(h)=f_i(h ). Let ℱiκ⊆ℱiF_i^κ _i denote the set of κ-memory strategies for player i, and write ℱκ:=∏i∈IℱiκF^κ:= _i∈ IF_i^κ. Let κ:=⋃m=0κAm S_κ\ :=\ _m=0^κA^m be the finite set of action-suffixes of length at most κ. Define the deterministic state update map Tκ:κ×A→κT_κ: S_κ× A→ S_κ by Tκ​(s,a):=suffixκ​((s,a)),T_κ(s,a)\ :=\ suffix_κ((s,a)), i.e., append the new joint action a to the suffix s and keep the last κ entries. For any play path z=(a1,a2,…)∈H∞z=(a^1,a^2,…)∈ H^∞, define the induced memory state at time t: st​(z):=suffixκ​(ht​(z))∈κ.s^t(z)\ :=\ suffix_κ(h^t(z))\ ∈\ S_κ. Lemma C.1 (Finite-state Markov property under bounded memory). If f∈ℱκf ^κ, then for every t≥1t≥ 1 and every history hth^t with s=suffixκ​(ht)s=suffix_κ(h^t), the next-period action distribution depends on hth^t only through s: μf​(at=a∣ht)=∏i∈Ifi​(s)​(ai).μ^f(a^t=a h^t)\ =\ _i∈ If_i(s)(a_i). Moreover, the induced state process satisfies st+1=Tκ​(st,at)s^t+1=T_κ(s^t,a^t) almost surely, so (st)t≥1(s^t)_t≥ 1 is a time-homogeneous Markov chain on κ S_κ. Proof. Fix t and history hth^t. By Definition 2, μf​(at=a∣ht)=∏i∈Ifi​(ht)​(ai).μ^f(a^t=a h^t)= _i∈ If_i(h^t)(a_i). If f∈ℱκf ^κ, then fi​(ht)=fi​(suffixκ​(ht))=fi​(s)f_i(h^t)=f_i(suffix_κ(h^t))=f_i(s) for each i, giving the displayed equality. The state update is deterministic by construction of TκT_κ: st+1=suffixκ​(ht+1)=suffixκ​((ht,at))=Tκ​(suffixκ​(ht),at)=Tκ​(st,at)s^t+1=suffix_κ(h^t+1)=suffix_κ((h^t,a^t))=T_κ(suffix_κ(h^t),a^t)=T_κ(s^t,a^t). Thus (st)(s^t) is Markov with kernel induced by the conditional law of ata^t given sts^t. ∎ Lemma C.2 (Continuation distributions depend only on the memory state). Let g∈ℱκg ^κ and let h,h′∈Hh,h ∈ H satisfy suffixκ​(h)=suffixκ​(h′)suffix_κ(h)=suffix_κ(h ). Then the continuation play-path distributions coincide: μhg=μh′g.μ^g_h\ =\ μ^g_h . Proof. By Lemma C.1, the conditional distribution of the next action profile and all future evolution under g depends on the past only through the current memory state s=suffixκ​(⋅)s=suffix_κ(·). Since h and h′h induce the same state, the induced kernels for (at,at+1,…)(a^t,a^t+1,…) are identical from either starting history. Therefore the induced continuation measures coincide. ∎ C.1 Best responses to bounded-memory opponents are bounded-memory A key benefit of bounded-memory opponents is that each player faces a finite-state discounted MDP in the continuation game. In particular, the best-response search in BRiε​(g−i∣ht)BR_i (g_-i h^t) can be restricted without loss to bounded-memory policies. Lemma C.3 (Markovian best responses to κ-memory opponents). Fix player i, a history hth^t, and an opponents’ continuation profile g−i∈ℱ−iκg_-i _-i^κ. Then there exists a best response σi⋆∈BRi​(g−i∣ht) _i _i(g_-i h^t) that is stationary Markov with respect to the memory state. That is, there exists a map πi:κ→Δ​(Ai) _i: S_κ→ (A_i) such that for every continuation history h¯⪰ht h h^t, σi⋆​(h¯)=πi​(suffixκ​(h¯)). _i ( h)\ =\ _i(suffix_κ( h)). Consequently, for every ε≥0 ≥ 0, supσi∈ℱi​(ht)Vi​(σi∣ht;g−i)=supσi∈ℱiκ​(ht)Vi​(σi∣ht;g−i), _ _i _i(h^t)V_i( _i h^t;g_-i)\ =\ _ _i _i^κ(h^t)V_i( _i h^t;g_-i), and BRi​(g−i∣ht)∩ℱiκ​(ht)≠∅BR_i(g_-i h^t) _i^κ(h^t)≠ . Proof. Let s0:=suffixκ​(ht)∈κs_0:=suffix_κ(h^t)∈ S_κ. Fix g−i∈ℱ−iκg_-i _-i^κ. Define a controlled Markov process on κ S_κ as follows. In state s, the player chooses ai∈Aia_i∈ A_i, the opponents’ joint action is drawn as a−i∼g−i​(s)∈Δ​(A−i)a_-i g_-i(s)∈ (A_-i), the stage payoff is ui​(ai,a−i)u_i(a_i,a_-i), and the next state is s′=Tκ​(s,(ai,a−i))s =T_κ(s,(a_i,a_-i)). For any bounded function v:κ→ℝv: S_κ , define the Bellman operator T by (​v)​(s):=maxα∈Δ​(Ai)⁡ai∼αa−i∼g−i​(s)​[(1−λi)​ui​(ai,a−i)+λi​v​(Tκ​(s,(ai,a−i)))].(Tv)(s):= _α∈ (A_i)E_ subarrayca_i α\\ a_-i g_-i(s) subarray [(1- _i)\,u_i(a_i,a_-i)+ _i\,v\! (T_κ(s,(a_i,a_-i)) ) ]. Because λi∈(0,1) _i∈(0,1), T is a contraction in ∥⋅∥∞\|·\|_∞: for any v,wv,w and any s, |(​v)​(s)−(​w)​(s)|≤maxα⁡​[λi​|v​(s′)−w​(s′)|]≤λi​‖v−w‖∞.|(Tv)(s)-(Tw)(s)|≤ _αE [ _i\,|v(s )-w(s )| ]≤ _i\|v-w\|_∞. Hence T has a unique fixed point V⋆:κ→ℝV : S_κ . For each s, the maximization over α∈Δ​(Ai)α∈ (A_i) attains its maximum because Δ​(Ai) (A_i) is compact and the objective is continuous and linear in α. Fix a maximizer πi​(s)∈Δ​(Ai) _i(s)∈ (A_i) for each s and define the associated policy evaluation operator (πi​v)​(s):=ai∼πi​(s)a−i∼g−i​(s)​[(1−λi)​ui​(ai,a−i)+λi​v​(Tκ​(s,(ai,a−i)))].(T_ _iv)(s):=E_ subarrayca_i _i(s)\\ a_-i g_-i(s) subarray [(1- _i)\,u_i(a_i,a_-i)+ _i\,v\! (T_κ(s,(a_i,a_-i)) ) ]. Then (πi​V⋆)​(s)=(​V⋆)​(s)=V⋆​(s)(T_ _iV )(s)=(TV )(s)=V (s) for all s, so V⋆V is a fixed point of πiT_ _i. Since πiT_ _i is also a λi _i-contraction, its fixed point is unique; denote it by VπiV _i. We conclude Vπi=V⋆V _i=V . Now define σi⋆ _i to be the stationary Markov continuation strategy induced by πi _i, i.e. σi⋆​(h¯)=πi​(suffixκ​(h¯)) _i ( h)= _i(suffix_κ( h)) for all h¯⪰ht h h^t. By construction, the induced continuation value from hth^t is Vi​(σi⋆∣ht;g−i)=V⋆​(s0)V_i( _i h^t;g_-i)=V (s_0). It remains to show optimality against all continuation strategies, including those with unbounded memory. Let σi _i be any continuation strategy and define its statewise value envelope Wσi(s):=supVi(σi∣h¯;g−i):h¯⪰ht,suffixκ(h¯)=s.W_ _i(s)\ :=\ \V_i( _i h;g_-i): h h^t,\ suffix_κ( h)=s \. Fix any s and ϵ>0ε>0, and choose h¯ h with suffixκ​(h¯)=ssuffix_κ( h)=s and Vi​(σi∣h¯;g−i)≥Wσi​(s)−ϵV_i( _i h;g_-i)≥ W_ _i(s)-ε. Let α:=σi​(h¯)∈Δ​(Ai)α:= _i( h)∈ (A_i) be the first-step mixed action. Conditioning on the first joint action (ai,a−i)(a_i,a_-i) and using that the next state is s′=Tκ​(s,(ai,a−i))s =T_κ(s,(a_i,a_-i)), we have Vi​(σi∣h¯;g−i) V_i( _i h;g_-i) =​[(1−λi)​ui​(ai,a−i)+λi​Vi​(σi∣(h¯,(ai,a−i));g−i)] =E [(1- _i)u_i(a_i,a_-i)+ _i\,V_i( _i ( h,(a_i,a_-i));g_-i) ] ≤​[(1−λi)​ui​(ai,a−i)+λi​Wσi​(s′)]. [(1- _i)u_i(a_i,a_-i)+ _i\,W_ _i(s ) ]. Therefore, Wσi​(s)−ϵ≤ai∼αa−i∼g−i​(s)​[(1−λi)​ui​(ai,a−i)+λi​Wσi​(Tκ​(s,(ai,a−i)))]≤(​Wσi)​(s).W_ _i(s)-ε _ subarrayca_i α\\ a_-i g_-i(s) subarray [(1- _i)u_i(a_i,a_-i)+ _iW_ _i(T_κ(s,(a_i,a_-i))) ]≤(TW_ _i)(s). Letting ϵ↓0ε 0 gives Wσi≤​WσiW_ _i _ _i pointwise. By monotonicity of T and contraction, iterating yields Wσi≤n​WσiW_ _i ^nW_ _i for all n, and n​Wσi→V⋆T^nW_ _i→ V uniformly as n→∞n→∞. Hence Wσi​(s)≤V⋆​(s)W_ _i(s)≤ V (s) for all s, and in particular Vi​(σi∣ht;g−i)≤Wσi​(s0)≤V⋆​(s0)=Vi​(σi⋆∣ht;g−i).V_i( _i h^t;g_-i)\ ≤\ W_ _i(s_0)\ ≤\ V (s_0)\ =\ V_i( _i h^t;g_-i). Thus σi⋆ _i is a best response. The final displayed equality of suprema follows because an optimal policy exists within ℱiκ​(ht)F_i^κ(h^t). ∎ C.2 A checkable KL-separation condition under bounded memory Assumption 3-(3) (on-path KL separation) is stated for general history-dependent strategies. Under bounded memory, it reduces to a state-frequency condition. Lemma C.4 (State-frequency decomposition of on-path KL averages). Fix player i, κ∈ℕκ , and f−i,g−i∈ℱ−iκf_-i,g_-i _-i^κ. For a realized path z, define st​(z)=suffixκ​(ht​(z))s^t(z)=suffix_κ(h^t(z)) and empirical state frequencies π^Tz​(s):=1T​∑t=1T​st​(z)=s,s∈κ. π_T^z(s)\ :=\ 1T _t=1^T1\s^t(z)=s\, s∈ S_κ. Then for every T and every z, 1T​∑t=1TDKL​(f−i​(ht​(z))∥g−i​(ht​(z)))=∑s∈κπ^Tz​(s)​DKL​(f−i​(s)∥g−i​(s)). 1T _t=1^TD_KL\! (f_-i(h^t(z))\ \|\ g_-i(h^t(z)) )\;=\; _s∈ S_κ π_T^z(s)\,D_KL\! (f_-i(s)\ \|\ g_-i(s) ). In particular, for any fixed state s, lim infT→∞1T​∑t=1TDKL​(f−i​(ht​(z))∥g−i​(ht​(z)))≥(lim infT→∞π^Tz​(s))⋅DKL​(f−i​(s)∥g−i​(s)). _T→∞\ 1T _t=1^TD_KL\! (f_-i(h^t(z))\ \|\ g_-i(h^t(z)) )\ ≥\ ( _T→∞ π_T^z(s) )· D_KL\! (f_-i(s)\ \|\ g_-i(s) ). Proof. If f−i,g−i∈ℱ−iκf_-i,g_-i _-i^κ, then for each t we have f−i​(ht​(z))=f−i​(st​(z))f_-i(h^t(z))=f_-i(s^t(z)) and g−i​(ht​(z))=g−i​(st​(z))g_-i(h^t(z))=g_-i(s^t(z)) by Definition 16. Therefore, 1T​∑t=1TDKL​(f−i​(ht​(z))∥g−i​(ht​(z)))=1T​∑t=1TDKL​(f−i​(st​(z))∥g−i​(st​(z))). 1T _t=1^TD_KL\! (f_-i(h^t(z))\ \|\ g_-i(h^t(z)) )= 1T _t=1^TD_KL\! (f_-i(s^t(z))\ \|\ g_-i(s^t(z)) ). Grouping the sum by the value of st​(z)s^t(z) yields the stated decomposition. The inequality follows by lower bounding the sum by a single state’s contribution and taking lim inf . ∎ Corollary C.5 (A sufficient condition for Assumption 3(3)). Fix player i and suppose −i⊆ℱ−iκS_-i _-i^κ. Fix g−i∈−i∖f−ig_-i _-i \f_-i\ and a state s∈κs∈ S_κ such that DKL​(f−i​(s)∥g−i​(s))>0D_KL(f_-i(s)\|g_-i(s))>0. If μfμ^f-a.s. in z, lim infT→∞π^Tz​(s)≥ρi​(g−i)> 0, _T→∞ π_T^z(s)\ ≥\ _i(g_-i)\ >\ 0, then the on-path KL separation condition in Assumption 3(3) holds for this g−ig_-i with κi​(g−i)=ρi​(g−i)⋅DKL​(f−i​(s)∥g−i​(s)) _i(g_-i)= _i(g_-i)· D_KL(f_-i(s)\|g_-i(s)). Proof. Immediate from Lemma C.4. ∎ All statements in Sections 4–5 are formulated on the full history space H and therefore apply verbatim when the realized profile f (and/or the menu strategies in Assumption 3) lie in ℱκF^κ. The main additions above are: (i) best responses to κ-memory opponents can be taken to be stationary Markov (Lemma C.3), and (i) Assumption 3(3) can be verified by state-frequency separation (Lemma C.4 and Corollary C.5). Once Assumption 3 is verified (e.g. via Corollary C.5), the proofs of Lemma 4.2, Proposition 4.3, and Corollary 5.4 are unchanged. Appendix D Implementation details of the strategy-level PS-BR planner This appendix details the implementation used in our experiments. At each round, an agent samples a latent opponent strategy from its inference based on the previous history, evaluates candidate self-strategies by rollout, and plays the current action induced by the best rollout-value strategy. D.1 Opponent strategy sampling Fix player i at round t with local history hit=((ai1,a−i1),…,(ait−1,a−it−1))h_i^t=((a_i^1,a_-i^1),…,(a_i^t-1,a_-i^t-1)). For opponent-strategy inference, the implementation rewrites this to the opponent-view history h~−it=((a−i1,ai1),…,(a−it−1,ait−1)), h_-i^t=((a_-i^1,a_i^1),…,(a_-i^t-1,a_i^t-1)), so each tuple is ordered as (opponent action, your action). The opponent strategy inference is performed once per real decision round (with configured label-sampling temperature) and then held fixed across all K rollout samples used to evaluate candidate self-strategies at that round. Inference supports two modes: • llm-label (default): construct an in-context prompt containing the game rules, observed history, and the allowed strategy labels (with short descriptions), then ask the model to output exactly one label. Parsing is label-constrained; if parsing fails repeatedly, a deterministic label fallback is used. • likelihood: infer from a hand-coded likelihood over the menu (described below), with no model call. llm-label mode details. In llm-label mode, if the model call itself fails, the implementation falls back to likelihood mode for that decision round. The template used in code is: rules_text Observed action history tuple format: (opponent action, your action). Infer the opponent strategy from the FIRST action in each tuple. Round 1: opp_action_1, self_action_1 Round 2: opp_action_2, self_action_2 ... You are inferring the opponent strategy in repeated game_name. Observed rounds so far: observed_rounds. Objective: sample one opponent strategy label according to your posterior belief over allowed labels. Estimate that posterior using ALL observed rounds (do not ignore older rounds), and focus on recent patterns. The opponent may change strategy over time; if you detect a shift, prioritize the most recent consistent behavior while still accounting for earlier rounds. Internally assign a compatibility score from 0 to 100 to every allowed label, convert them into relative posterior weights, and sample exactly one final label from those weights. Output rule: do NOT output scores, reasoning, or ranking. Respond with exactly one label only. **Output only the label.** Allowed labels: - label_1: description_1 - label_2: description_2 ... where game_name is the active repeated-game name (e.g., BoS, PD, Promo, Samaritan’s dilemma, or Lemons), and observed_rounds=t-1. When collusive-prior guidance is enabled (--collusive-mode), the prompt appends a strong-prior line. In our code this prior is mad0 for Promo opponent 1 and mad1 for Promo opponent 2. Likelihood-mode details. To score strategy s, the implementation evaluates history under the opponent’s perspective h~−it=((a−i1,ai1),…,(a−it−1,ait−1)) h_-i^t=((a_-i^1,a_i^1),…,(a_-i^t-1,a_i^t-1)): log⁡Lt​(s)=∑u=1t−1log⁡(​a−iu=J​psu+​a−iu=F​(1−psu)), L_t(s)= _u=1^t-1 \! (1\a_-i^u=J\p_s^u+1\a_-i^u=F\(1-p_s^u) ), with clipping to [10−6,1−10−6][10^-6,1-10^-6] for numerical stability. Given temperature τ>0τ>0 (implemented as τ=max⁡sample_temperature,10−5τ= \ sample\_temperature,10^-5\), weights are wt​(s)∝exp⁡(log⁡Lt​(s)τ),w_t(s) \! ( L_t(s)τ ), and one opponent strategy is sampled from this categorical distribution. D.2 Rollout value and strategy selection Given a sampled opponent strategy s−is_-i, for every candidate self-strategy si∈Mgs_i∈ M_g, the planner rolls out from round t to t¯ t, where t¯=min⁡T,t+H−1,H>0,T,H=0, t= cases \T,\ t+H-1\,&H>0,\\ T,&H=0, cases T is the game horizon, and H is the planning horizon. For rollout sample m∈1,…,Km∈\1,…,K\, at each simulated round r, actions are sampled from the fixed opponent strategy s−is_-i and the currently evaluated candidate sis_i: a^ir,m∼Bernoulli​(psir),a^−ir,m∼Bernoulli​(ps−ir), a_i^r,m \! (p_s_i^r ), a_-i^r,m \! (p_s_-i^r ), where psirp_s_i^r and ps−irp_s_-i^r are the round-r probabilities of action J induced by sis_i and s−is_-i under the simulated history prefix generated so far. The rollout value for candidate sis_i against sampled opponent strategy s−is_-i is Vi(m)​(si∣s−i)=∑r=t¯γr−t​ui​(a^ir,m,a^−ir,m),V_i^(m)(s_i s_-i)= _r=t tγ^\,r-tu_i( a_i^r,m, a_-i^r,m), with discount γ. The estimated value of strategy sis_i is V¯i​(si∣s−i)=1K​∑m=1KVi(m)​(si∣s−i), V_i(s_i s_-i)= 1K _m=1^KV_i^(m)(s_i s_-i), and the chosen strategy is si⋆∈arg⁡maxsi⁡V¯i​(si∣s−i),s_i ∈ _s_i V_i(s_i s_-i), with deterministic hash-based tie-breaking when needed. The executed action at real round t is then sampled from si⋆s_i at the current history. Algorithm 1 Strategy-level PS-BR loop for two-player games 1:game g, total rounds T, menu MgM_g, samples K, horizon H, discount γ, temperature τ, inference mode ∈llm-label,likelihood∈\ llm-label, likelihood\ 2:Initialize h1←∅h^1← , x11←(h1,∅)x_1^1←(h^1, ), x21←(h1,∅)x_2^1←(h^1, ), C1←0C_1← 0, and C2←0C_2← 0 3:for t=1,…,Tt=1,…,T do 4: for i∈1,2i∈\1,2\ do 5: Let xit=(ht,ri1:t−1)x_i^t=(h^t,r_i^1:t-1) be player i’s current local history 6: Construct opponent-view history h~−it h_-i^t by swapping tuple order in the public history hth^t 7: Infer one strategy label s−i∈Mgs_-i∈ M_g from rules, history h~−it h_-i^t 8: for all si∈Mgs_i∈ M_g do 9: for k=1,…,Kk=1,…,K do 10: Vi(k)​(si∣s−i)←RolloutValue​(g,i,si,s−i,xit,t,T,H,γ)V_i^(k)(s_i s_-i) (g,i,s_i,s_-i,x_i^t,t,T,H,γ) 11: V¯i​(si∣s−i)←1K​∑k=1KVi(k)​(si∣s−i) V_i(s_i s_-i)← 1K _k=1^KV_i^(k)(s_i s_-i) 12: si⋆←arg⁡maxsi∈Mg⁡V¯i​(si∣s−i)s_i ← _s_i∈ M_g V_i(s_i s_-i) ⊳ deterministic tie-break 13: Sample real action aita_i^t from strategy si⋆s_i at history xitx_i^t 14: Sample realized rewards (r1t,r2t)(r_1^t,r_2^t) from the environment payoff law at (a1t,a2t)(a_1^t,a_2^t) 15: C1←C1+r1tC_1← C_1+r_1^t and C2←C2+r2tC_2← C_2+r_2^t 16: Set ht+1←(ht,(a1t,a2t))h^t+1←(h^t,(a_1^t,a_2^t)) 17: Set x1t+1←(ht+1,r11:t)x_1^t+1←(h^t+1,r_1^1:t) and x2t+1←(ht+1,r21:t)x_2^t+1←(h^t+1,r_2^1:t) For Experiment 3, the environment payoff law in Algorithm 1 is the known Gaussian noise family centered at the true mean matrix. On the player’s own side, player i additionally samples m~i∼πit(⋅∣xit) m_i _i^t(· x_i^t), rollout values are computed under m~i m_i in place of the true uiu_i, and player i’s local information history stores only (ht,ri1:t−1)(h^t,r_i^1:t-1); in particular, the update step above never reveals or conditions on r−i1:t−1r_-i^1:t-1. Appendix E Social chain-of-thought prompting (SCoT) This appendix discusses that the social chain-of-thought (SCoT) prompting intervention of [3] can be viewed as a particularly simple instance PS-BR. E.1 SCoT as a two-stage “predict-then-act” operator In [3], SCoT is implemented by prompt-chaining in each round of a repeated game: 1. Prediction prompt (belief elicitation). Given the public history hth^t, the model is asked to predict the opponent’s next move (or, more generally, to describe what the other player will do next). 2. Action prompt (best response to the elicited belief). The model is then asked to choose its action given the predicted opponent move, typically phrased as “given your prediction, what is best for you to do now?” This “separate belief report, then act” structure forces an explicit theory-of-mind step before action selection, and empirically improves coordination in some repeated games. E.2 Mapping SCoT as a special case of PSBR Fix agent i at history hth^t. Let A−iA_-i denote the opponents’ joint action space, and define the agent’s posterior predictive over opponents’ next action as qit(⋅∣ht)∈Δ(A−i).q_i^t(· h^t)∈ (A_-i). In our paper’s belief language, qit(⋅∣ht)q_i^t(· h^t) is the one-step marginal induced by the agent’s posterior predictive continuation belief f−i,t|htf_-i^i,t|_h^t. SCoT can then be expressed as the following generic operator: 1. Infrence: produce a~−it a_-i^t as an imputation of the missing opponents’ next action. Operationally, this is obtained by querying the model with the prediction prompt. 2. Optimize given the imputation: choose aita_i^t as an (approximate) best response to the imputed a~−it a_-i^t (and the known payoffs), e.g. ait∈arg⁡maxai∈Ai⁡ui​(ai,a~−it)(myopic).a_i^t∈ _a_i∈ A_iu_i(a_i, a_-i^t) (myopic). More generally, one may replace uiu_i by the continuation objective, i.e., choose aita_i^t (or a continuation strategy) that maximizes the discounted value conditional on a~−it a_-i^t and the induced continuation play. Two special cases are worth separating because they clarify the relationship to PS-BR. (i) Deterministic SCoT = point estimation. In the implementation studied by [3], the model is often run in a near-deterministic regime (e.g., decoding choices consistent with temperature ≈0≈ 0), so the prediction step behaves like a point estimate (roughly “MAP” under the model’s implicit predictive distribution). In this view, SCoT is an inference-and-optimize heuristic that can still improve play by making the model’s implicit prediction problem explicit. (i) Myopic PS-BR = sampling-based estimation. If instead the prediction prompt is decoded stochastically (e.g., sampling at nonzero temperature), then a~−it a_-i^t becomes a draw from the model’s own predictive distribution: a~−it∼qit(⋅∣ht). a_-i^t q_i^t(· h^t). Appendix F Prompts F.1 Base prompts In Base, each player’s round-t prompt is: rules text+compact history+“You are currently playing round ​t​”+action query.rules text+compact history+``You are currently playing round t′+action query. The compact history prefix used in code is: Observed action history (your action, opponent action): Round 1: <self_1>, <opp_1> ... Round t-1: <self_t-1>, <opp_t-1> Round-level action query templates (Base). • BoS: Q: Which Option do you choose, J or F? A: • PD (order randomized each round): Q: Which action do you choose, J or F? A: • Harmony: Q: Which action do you choose, C or D? A: • Promo: Q: Which action do you choose, R, P, or Z? A: • Samaritan (Helper prompt): Q: Which action do you choose, H or N? A: • Samaritan (Recipient prompt): Q: Which action do you choose, W or S? A: • Lemons (Seller prompt): Q: Which action do you choose, HQ or LQ? A: • Lemons (Buyer prompt): Q: Which action do you choose, B or D? A: Before the final “A:” token, code injects a strategy-context block (same helper used in Base and SCoT): In repeated <GameName>, a strategy maps prior history to a player’s next action (possibly probabilistically). Allowed strategies: - <label_1>: <short description> - ... Role mapping in this prompt: - Player A is the other player. - Player B is you. Observed rounds so far: <t-1>. Context: full history prefix up to round <t-1>. Strongly expect Player A to play with strategy ’<prior_label>’. [if available] Allowed action tokens: <tokens>. [if available] Output rule: do NOT output scores, reasoning, or ranking. Respond with exactly one action only. F.2 SCoT prompts SCoT uses two prompts per player per round. Stage 1 (prediction prompt). The prediction queries are: • BoS: Q: Which action do you predict the other player will choose, J or F? A: • PD (order randomized each round): Q: Which action do you predict the other player will choose, J or F? A: • Harmony: Q: Which action do you predict the other player will choose, C or D? A: • Promo: Q: Which action do you predict the other player will choose, R, P, or Z? A: • Samaritan (Helper predicts Recipient): Q: Which action do you predict the other player will choose, W or S? A: • Samaritan (Recipient predicts Helper): Q: Which action do you predict the other player will choose, action H or action N? A: • Lemons (Seller predicts Buyer): Q: Which Option do you predict the other player will choose, Option B or Option D? A: • Lemons (Buyer predicts Seller): Q: Which Option do you predict the other player will choose, Option HQ or Option LQ? A: As implemented, the Stage-1 prediction prompt is enriched with the same strategy-context block shown above. Stage 2 (action prompt conditioned on Stage-1 prediction). After receiving prediction <PRED>, code uses: • BoS: Q: Given that you think the other player will choose Option <PRED> in round <t>, imagine the outcome for both of your possible actions (Option J and Option F), compare which gives you a better result, and then choose. Which Option do you think is the best to choose for you in this round, Option J or Option F? Output only one letter: J or F. A: • PD (with randomized <opt1>, <opt2>): Q: Given that you think the other player will choose Option <PRED> in round <t>, imagine the outcome for both of your possible actions (Option <opt1> and Option <opt2>), compare which gives you a better result, and then choose. Which Option do you think is the best to choose for you in this round, Option <opt1> or Option <opt2>? Output only one letter: J or F. A: • Harmony: Q: Given that you think the other player will choose <PRED> in round <t>, imagine the outcome for both of your possible actions (C and D), compare which gives you a better result, and then choose. Which action do you think is best for you in this round, C or D? Output only one action: C or D. A: • Promo: Q: Given that you think the other player will choose <PRED> in round <t>, imagine the outcome for your possible actions (R, P, and Z), compare which gives you a better result, and then choose. Which action do you think is best for you in this round, R, P, or Z? Output only one action: R, P, or Z. A: • Samaritan (Helper): Q: Given that you think the other player will choose Option <PRED> in round <t>, imagine the outcome for both of your possible actions (Option H and Option N), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option H or Option N? Output only one letter: H or N. A: • Samaritan (Recipient): Q: Given that you think the other player will choose Option <PRED> in round <t>, imagine the outcome for both of your possible actions (Option W and Option S), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option W or Option S? Output only one letter: W or S. A: • Lemons (Seller): Q: Given that you think the other player will choose Option <PRED> in round <t>, imagine the outcome for both of your possible actions (Option HQ and Option LQ), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option HQ or Option LQ? Output only one letter: HQ or LQ. A: • Lemons (Buyer): Q: Given that you think the other player will choose Option <PRED> in round <t>, imagine the outcome for both of your possible actions (Option B and Option D), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option B or Option D? Output only one letter: B or D. A: F.3 PS-BR prompts for known deterministic payoffs PS-BR does not query the LLM for direct action choice. Actions are produced by rollout-based strategy evaluation after sampling one opponent strategy per round. The prompt-facing LLM call is for opponent strategy-label inference in llm-label mode. Opponent strategy inference prompt (llm-label). At round t, for player i, history is rewritten to opponent view h~−it=((a−i1,ai1),…,(a−it−1,ait−1)), h_-i^t=((a_-i^1,a_i^1),…,(a_-i^t-1,a_i^t-1)), so tuples are (Player A action, Player B action) with: • Player A = opponent whose strategy is inferred. • Player B = current decision-maker. The prompt template is: You are inferring Player A’s strategy (the opponent) in repeated <GameName>. In a repeated-game setting, a strategy is a rule that maps prior history to the player’s next action (possibly probabilistically). <rules_text> Observed rounds so far: <t-1>. Allowed labels: - <label_1>: <description_1> - ... Observed action history tuple format: (Player A action, Player B action). Player A is the opponent whose strategy label you must infer. Player B is you (the decision-maker). Context: full history prefix up to round <...>. Target: observed Player A action at round <...>. Choose the allowed label that makes this observed Player A target most compatible with the context. At round <...>, use this mapping: Context history as (Player A, Player B), rounds <...>: round <k>: Player A=<...>, Player B=<...> Observed target Player A action at round <...>: <...> Strongly expect Player A to play with strategy ’<prior_label>’. Player A’s strategy may have changed over time, so weigh recent rounds more heavily than earlier rounds. Output rule: do NOT output scores, reasoning, or ranking. Respond with exactly one label only. **Output only the label.** Likelihood mode (no prompt). If --strategy-inference likelihood is used, no LLM prompt is issued for strategy inference; the label is sampled from a hand-coded likelihood over the finite menu. F.4 PS-BR prompts for unknown stochastic payoffs Under the theorem-aligned implementation used for Experiment 3, PS-BR under unknown stochastic payoffs still samples both an opponent strategy hypothesis and a payoff hypothesis at each round before rollout-based strategy evaluation. The opponent-strategy side is handled exactly as in the known deterministic-payoff case. The payoff side is not open-ended JSON inference. Instead, Experiment 3 uses the known-common-noise / unknown-mean construction from Section 6 and Section 7.4.1: player i maintains a posterior over a finite menu ℳi,gM_i,g of candidate mean payoff matrices under the Gaussian noise family with known variance σg2 _g^2. Opponent strategy inference prompt (llm-label). The opponent strategy is inferred from the joint action history, exactly as in the known deterministic payoffs case. The prompt template remains identical to the one detailed in the previous subsection. Finite-menu Gaussian payoff posterior (experiment configuration). At round t, player i updates πit​(m∣ht,ri1:t−1)∝πi0​(m)​∏s=1t−1ϕ​(ris;m​(as),σg2),m∈ℳi,g, _i^t(m h^t,r_i^1:t-1) _i^0(m) _s=1^t-1φ(r_i^s;m(a^s), _g^2), m _i,g, where ϕ​(⋅;μ,σg2)φ(·;μ, _g^2) is the Gaussian density and ris∣as∼​(m​(as),σg2)r_i^s a^s (m(a^s), _g^2) under candidate mean matrix m. The implementation then samples one matrix label m~i∼πit m_i _i^t and evaluates continuation strategies against the induced payoff kernel qim~i(⋅∣a)=(m~i(a),σg2).q_i m_i(· a)=N( m_i(a), _g^2). Product structure of the menu. Although the theorem-level menu ℳi,gM_i,g is finite but large, it has product form over joint actions. With a product prior over the offsets (ka)a∈A(k_a)_a∈ A and the Gaussian likelihood above, the posterior factorizes by joint action. Operationally, the implementation therefore updates the discrete posterior for each action-specific offset ka∈Kk_a∈ K separately and samples a full mean matrix by drawing one offset for each joint action. This is exactly equivalent to sampling from the full finite menu, without explicitly enumerating all of its elements. Likelihood mode (experiment configuration). In the reported Experiment 3 runs, --payoff-inference likelihood is used. No LLM prompt is issued for payoff inference; the sampled mean-matrix label is drawn from the Gaussian posterior above. Opponent strategy inference is handled either by the llm-label prompt described above or by the corresponding likelihood mode, depending on the strategy-inference setting. Heuristic prompt mode. An open-ended json payoff-table prompt can still be used as a heuristic variant, but it is not the theorem-aligned implementation analyzed in Section 6 and instantiated in Experiment 3. Appendix G Game-specific strategy menus Denote ait−1a_i^t-1 and a−it−1a_-i^t-1 denote own and opponent actions at round t−1t-1. Then we consider: (1) BoS strategy menu. Here pstp_s^t denotes the probability of playing J at round t. • insist_j: pst=1p_s^t=1 for all t. • insist_f: pst=0p_s^t=0 for all t. • wsls_bos: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, if ait−1=a−it−1a_i^t-1=a_-i^t-1 then repeat ait−1a_i^t-1, else switch from ait−1a_i^t-1. • mlur: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, if ait−1=a−it−1a_i^t-1=a_-i^t-1 then repeat ait−1a_i^t-1, else pst=0.5p_s^t=0.5. • alternate_phase0: pst=1p_s^t=1 on odd t, and pst=0p_s^t=0 on even t. • alternate_phase1: pst=0p_s^t=0 on odd t, and pst=1p_s^t=1 on even t. • noisy_insist_j: pst=0.9p_s^t=0.9 for all t. • noisy_insist_f: pst=0.1p_s^t=0.1 for all t. (2) PD strategy menu. Here pstp_s^t denotes the probability of playing J at round t. • allc: pst=1p_s^t=1 for all t. • alld: pst=0p_s^t=0 for all t. • soft_allc: pst=0.9p_s^t=0.9 for all t. • soft_alld: pst=0.1p_s^t=0.1 for all t. • tft: ps1=1p_s^1=1; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Ja_-i^t-1=J. • wsls: ps1=1p_s^1=1; for t≥2t≥ 2, if ait−1=a−it−1a_i^t-1=a_-i^t-1 then repeat ait−1a_i^t-1, else switch from ait−1a_i^t-1. • soft_grim_trigger: pst=0p_s^t=0 if the opponent played F in either of the previous two rounds; otherwise pst=1p_s^t=1. • grim_trigger: pst=1p_s^t=1 until the opponent has played F at least once in the past; thereafter pst=0p_s^t=0 forever. (3) Harmony strategy menu. Here pstp_s^t denotes the probability of playing C at round t. • allc: pst=1p_s^t=1 for all t. • alld: pst=0p_s^t=0 for all t. • tft: ps1=1p_s^1=1; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Ca_-i^t-1=C. • stft: ps1=0p_s^1=0; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Ca_-i^t-1=C. • generous_tft: ps1=1p_s^1=1; for t≥2t≥ 2, if a−it−1=Ca_-i^t-1=C then pst=1p_s^t=1, else pst=0.3p_s^t=0.3. • grim_trigger: pst=1p_s^t=1 until the opponent has played D at least once in the past; thereafter pst=0p_s^t=0 forever. • wsls_pavlov: ps1=1p_s^1=1; for t≥2t≥ 2, if ait−1=a−it−1a_i^t-1=a_-i^t-1 then repeat ait−1a_i^t-1, else switch from ait−1a_i^t-1. • random_pc: pst=0.5p_s^t=0.5 for all t. (4) Promo strategy menu (actions: R = regular, P = promotion, Z = punishment/price war). • allR: play R at every round. • allP: play P at every round. • allZ: play Z at every round. • soft_allR: play R with probability 0.90.9 and P with probability 0.10.1. • soft_allP: play P with probability 0.90.9 and R with probability 0.10.1. • mad0: cooperative path is odd-round P/even-round R; when a deviation from the prescribed phase path is detected, play Z for 2 rounds, then return to phase-0 alternation. • mad1: cooperative path is odd-round R/even-round P; when a deviation from the prescribed phase path is detected, play Z for 2 rounds, then return to phase-1 alternation. • grim_trigger: follow the phase-0 alternating path until the first deviation, then play Z forever. (5) Samaritan’s dilemma (Helper actions: H = Help, N = No-help; Recipient actions: W = Work, S = Shirk). Helper strategy menu. Here pstp_s^t denotes the probability the helper plays H at round t. • always_help: pst=1p_s^t=1 for all t. • never_help: pst=0p_s^t=0 for all t. • tft_help: ps1=1p_s^1=1; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Wa_-i^t-1=W. • grim_forgive: pst=0p_s^t=0 if the recipient played S in either of the previous two rounds; otherwise pst=1p_s^t=1. • grim_nohelp: pst=1p_s^t=1 until the recipient has played S at least once in the past; thereafter pst=0p_s^t=0 forever. • wsls_helper: ps1=1p_s^1=1; for t≥2t≥ 2, if a−it−1=Wa_-i^t-1=W then repeat ait−1a_i^t-1, else switch from ait−1a_i^t-1. • noisy_help: pst=0.9p_s^t=0.9 for all t. • noisy_nohelp: pst=0.1p_s^t=0.1 for all t. Recipient strategy menu. Here pstp_s^t denotes the probability the recipient plays W at round t. • always_work: pst=1p_s^t=1 for all t. • always_shirk: pst=0p_s^t=0 for all t. • work_if_helped: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Ha_-i^t-1=H. • exploit_help: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Na_-i^t-1=N. • grim_shirk_after_nohelp: pst=1p_s^t=1 until the helper has played N at least once in the past; thereafter pst=0p_s^t=0 forever. • forgiving_work: ps1=1p_s^1=1; for t≥2t≥ 2, if a−it−1=Ha_-i^t-1=H then pst=1p_s^t=1, else pst=0.3p_s^t=0.3. • noisy_work: pst=0.9p_s^t=0.9 for all t. • noisy_shirk: pst=0.1p_s^t=0.1 for all t. (6) Lemons (Seller actions: H​QHQ = High-quality, L​QLQ = Low-quality; Buyer actions: B = Buy, D = Don’t buy). Seller strategy menu. Here pstp_s^t denotes the probability the seller plays H​QHQ at round t. • always_hq: pst=1p_s^t=1 for all t. • always_lq: pst=0p_s^t=0 for all t. • hq_if_bought_last: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=Ba_-i^t-1=B. • grim_hq_until_boycott: pst=1p_s^t=1 until the buyer has played D at least once in the past; thereafter pst=0p_s^t=0 forever. • lq_if_boycott_last: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, pst=0p_s^t=0 iff a−it−1=Da_-i^t-1=D. • grim_forgiving: pst=0p_s^t=0 if the buyer played D in either of the previous two rounds; otherwise pst=1p_s^t=1. • noisy_hq: pst=0.9p_s^t=0.9 for all t. • noisy_lq: pst=0.1p_s^t=0.1 for all t. Buyer strategy menu. Here pstp_s^t denotes the probability the buyer plays B at round t. • always_buy: pst=1p_s^t=1 for all t. • never_buy: pst=0p_s^t=0 for all t. • soft_always_buy: pst=0.9p_s^t=0.9 for all t. • soft_never_buy: pst=0.1p_s^t=0.1 for all t. • tft_buy: ps1=0.5p_s^1=0.5; for t≥2t≥ 2, pst=1p_s^t=1 iff a−it−1=H​Qa_-i^t-1=HQ. • generous_buy: ps1=1p_s^1=1; for t≥2t≥ 2, if a−it−1=H​Qa_-i^t-1=HQ then pst=1p_s^t=1, else pst=0.3p_s^t=0.3. • grim_boycott: pst=1p_s^t=1 until the seller has played L​QLQ at least once in the past; thereafter pst=0p_s^t=0 forever. • grim_forgiving: pst=0p_s^t=0 if the seller played L​QLQ in either of the previous two rounds; otherwise pst=1p_s^t=1. Appendix H Promo game H.1 Promo game [36]: alternating promotions with finite punishment Lal (1990) studies repeated price competition in a market with two identical “national” brands that have loyal consumers and a third “local” brand with little/no loyalty. The local brand disciplines prices in the switching segment, creating a tension for the national brands between (i) extracting rents from loyals via a high “regular” price and (i) defending the switchers via temporary price cuts. A key result is that, even when the corresponding one-shot stage game has no Nash equilibrium, an alternating promotions pattern – only one national brand is on promotion in a given period and the roles alternate over time – can arise as a pure-strategy Nash equilibrium of the infinite-horizon discounted game, supported by a credible number of punishment periods. To obtain a compact repeated-game benchmark, we discretize [36]’s richer price-choice problem into three representative regimes per firm: • Regular (R): charge the high “regular” price • Promotion (P): charge the low promotional price • Punishment/price war (Z): charge a very low price used only in punishment phases. The resulting 3×3 payoff matrix in Appendix 7 is a reduced-form encoding of the ordinal incentive structure: a unilateral promotion against a regular-price rival yields the highest current-period gain (the “temptation” payoff); simultaneous promotions are less profitable than alternating promotions; and outcomes involving Z are jointly bad, standing in for the “intense competition/price war” phase used to deter deviations. The canonical nontrivial Nash equilibrium is an alternating path: play (P,R)(P,R) in odd rounds and (R,P)(R,P) in even rounds (or vice versa). After any deviation from the prescribed phase, switch to a punishment phase (e.g., (Z,Z)(Z,Z) for a fixed number of rounds) for a few periods and then return to the alternating path (as defined as [1]), or revert permanently to a low-payoff punishment regime (grim trigger). For sufficiently patient players, the discounted loss from the punishment phase outweighs the one-shot deviation gain, making the alternating-promotions path incentive compatible.