Paper deep dive

Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

Daniel Hennes, Zun Li, John Schultz, Marc Lanctot

Year: 2026Venue: arXiv preprintArea: cs.GTType: PreprintEmbeddings: 95

Abstract

Abstract:Recent advances in multi-agent reinforcement learning, particularly Policy-Space Response Oracles (PSRO), have enabled the computation of approximate game-theoretic equilibria in increasingly complex domains. However, these methods rely on deep reinforcement learning oracles that produce `black-box' neural network policies, making them difficult to interpret, trust or debug. We introduce Code-Space Response Oracles (CSRO), a novel framework that addresses this challenge by replacing RL oracles with Large Language Models (LLMs). CSRO reframes the best response computation as a code generation task, prompting an LLM to generate policies directly as human-readable code. This approach not only yields inherently interpretable policies but also leverages the LLM's pretrained knowledge to discover complex, human-like strategies. We explore multiple ways to construct and enhance an LLM-based oracle: zero-shot prompting, iterative refinement and \emph{AlphaEvolve}, a distributed LLM-based evolutionary system. We demonstrate that CSRO achieves performance competitive with baselines while producing a diverse set of explainable policies. Our work presents a new perspective on multi-agent learning, shifting the focus from optimizing opaque policy parameters to synthesizing interpretable algorithmic behavior.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 1:06:51 AM

Summary

Code-Space Response Oracles (CSRO) is a framework for multi-agent reinforcement learning that replaces traditional 'black-box' neural network policies with interpretable, human-readable code generated by Large Language Models (LLMs). By reframing best-response computation as a program synthesis task, CSRO enables the discovery of complex strategies while maintaining explainability. The framework supports iterative refinement through techniques like zero-shot prompting, linear refinement, and AlphaEvolve, demonstrating competitive performance against established baselines in games like Repeated Rock-Paper-Scissors and Leduc Poker.

Entities (6)

CSRO · framework · 100%LLM · technology · 100%PSRO · algorithm · 100%AlphaEvolve · methodology · 95%Leduc hold’em · environment · 95%Repeated Rock-Paper-Scissors · environment · 95%

Relation Signals (4)

CSRO → evaluatedon → Repeated Rock-Paper-Scissors

confidence 95% · We evaluate our framework on two standard analytical games... Repeated Rock-Paper-Scissors (RRPS)

CSRO → replaces → RL oracles

confidence 95% · CSRO, a novel framework that addresses this challenge by replacing RL oracles with Large Language Models (LLMs).

CSRO → generalizes → LLM-PSRO

confidence 90% · CSRO generalizes a recent demonstration called LLM-PSRO (Bachrach et al., 2025).

CSRO → utilizes → AlphaEvolve

confidence 90% · We demonstrate that CSRO achieves performance competitive with baselines while producing a diverse set of explainable policies... integrating evolutionary refinement method, such as AlphaEvolve

Cypher Suggestions (2)

Find all frameworks and their associated methodologies · confidence 90% · unvalidated

MATCH (f:Framework)-[:UTILIZES]->(m:Methodology) RETURN f.name, m.name

Identify environments used for evaluating specific frameworks · confidence 90% · unvalidated

MATCH (f:Framework)-[:EVALUATED_ON]->(e:Environment) RETURN f.name, e.name

Full Text

94,978 characters extracted from source content.

Expand or collapse full text

2026-3-12 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Daniel Hennes *1 , Zun Li *1 , John Schultz 1 and Marc Lanctot 1 * Equal contributions, 1 Google DeepMind Recent advances in multi-agent reinforcement learning, particularly Policy-Space Response Oracles (PSRO), have enabled the computation of approximate game-theoretic equilibria in increasingly complex domains. However, these methods rely on deep reinforcement learning oracles that produce ‘black-box’ neural network policies, making them difficult to interpret, trust or debug. We introduce Code-Space Response Oracles (CSRO), a novel framework that addresses this challenge by replacing RL oracles with Large Language Models (LLMs). CSRO reframes the best response computation as a code generation task, prompting an LLM to generate policies directly as human-readable code. This approach not only yields inherently interpretable policies but also leverages the LLM’s pretrained knowledge to discover complex, human-like strategies. We explore multiple ways to construct and enhance an LLM-based oracle: zero-shot prompting, iterative refinement and AlphaEvolve, a distributed LLM-based evolutionary system. We demonstrate that CSRO achieves performance competitive with baselines while producing a diverse set of explainable policies. Our work presents a new perspective on multi-agent learning, shifting the focus from optimizing opaque policy parameters to synthesizing interpretable algorithmic behavior. 1. Introduction The strategic interaction of multiple autonomous agents is a cornerstone of artificial intelligence, with applications ranging from economic modeling and autonomous driving to cybersecurity and recreational games (Shoham and Leyton-Brown, 2008). A central goal in this field is to develop algorithms that compute robust and effective strategies. The predominant theoretical framework for analyzing such systems is game theory. However, computing exact equilibria in large, complex games is often intractable, motivating the development of iterative methods to find approximate solutions. Among the most successful of these methods are Policy-Space Response Oracles (PSRO) (Bighashdel et al., 2024; Lanctot et al., 2017). PSRO and its variants have achieved state-of-the-art performance in notoriously difficult games, such as Barrage Stratego (Mcaleer et al., 2020) and StarCraft (Vinyals et al., 2019). PSRO iteratively builds a set of policies by computing a best response to the current meta-strategy. Standard implementations rely on deep reinforcement learning (RL) oracles, which train a neural network to approximate a best response through extensive interaction with a population of opponents. While powerful, this approach results in policies that are opaque, “black-box” models. This lack of interpretability prevents strategy verification, and forms a significant barrier to deploying such agents in high-stakes, real-world applications where explainability is crucial. Furthermore, training these RL oracles is often sample-inefficient, requiring millions or billions of game simulations to converge. To address the trade-off between performance and interpretability, we propose Code-Space Response Oracles (CSRO). We reframe best-response computation as a program synthesis task, replacing the Deep RL oracle with a Large Language Model (LLM) that generates policies as executable source code. By providing the LLM with the game rules, an API for interacting with the environment, Corresponding author(s): hennes@google.com, lizun@google.com © 2026 Google DeepMind. All rights reserved arXiv:2603.10098v1 [cs.GT] 10 Mar 2026 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models and descriptions of opponent strategies, we prompt it to write a programmatic best response. CSRO leverages the vast knowledge of logic, planning, and strategy embedded within pre-trained LLMs, enabling them to generate sophisticated, human-like policies, dramatically reducing the need for costly exploration. CSRO generalizes a recent demonstration called LLM-PSRO (Bachrach et al., 2025). LLM-PSRO relies on open-loop “best-of-N” sampling using full opponent source code, CSRO introduces two critical extensions for scalability and robustness. First, we replace passive sampling with an iterative refinement loop (e.g., via evolutionary algorithms) that actively optimizes strategies based on meta- game performance. Second, we address context limits via context abstraction; rather than ingesting raw code from all opponents, CSRO summarizes strategies in natural language and filters inputs. This allows our method to scale to complex games where full source code would exceed context windows. This paper makes several contributions, primarily introducing CSRO as a novel framework utilizing LLMs as code-generating oracles to compute approximate equilibria. We demonstrate that while LLMs can generate strategies zero-shot, performance is significantly enhanced by integrating evolutionary refinement method, such as AlphaEvolve (Novikov et al., 2025; Romera-Paredes et al., 2024), to iteratively optimize code based on closed-loop feedback. This approach yields policies that are fully interpretable, represented by commented source code. Finally, unlike prior work limited to internal comparisons (Bachrach et al., 2025), we rigorously validate CSRO by benchmarking against standardized external populations and game-theoretic solvers, showing that code-generation oracles can compete with mature baselines in established domains. 2. Code Space Response Oracles In this section, we first provide a brief overview of the standard PSRO algorithm, which forms the foundation of our work. We then formally introduce our proposed method, CSRO, detailing how an LLM is employed as a code-generating oracle to produce interpretable, programmatic policies. 2.1. Preliminaries PSRO is an iterative algorithm for computing approximate Nash equilibria in multi-agent games. In this paper we focus on two-player symmetric zero-sum games, but the methodology directly transfers to general game domains. Consider a two-player symmetric game where each player푖∈ 0,1shares a common set of strategiesΠwith the same utility function푢. A strategy, or a policy,휋 ∈Π, is a mapping from observations to a probability distribution over actions. A player playing휋receives 푢(휋, 휋 ′ )against an opponent playing휋 ′ ; the opponent receives푢(휋 ′ , 휋)= −푢(휋, 휋 ′ )according to the zero-sum condition. The PSRO algorithm in a symmetric game setting maintains a single set of policies푃. At each iteration푘, the algorithm first solves for a symmetric equilibrium mixture휎over the current policies in an empirical meta-game. Then, from the viewpoint of any player (since they are symmetric), a best response oracle computes a policy휋 ∗ that maximizes expected utility against the opponent meta-strategy 휎: 휋 ∗ ∈ argmax 휋 피 휋 ′ ∼휎 [푢(휋, 휋 ′ )](1) This new policy is added to the policy set푃, and the process repeats. In standard PSRO, the oracle is a deep reinforcement learning algorithm that produces an opaque neural network policy. 2 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 0123456789 1011121314151617181920 Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration Metagame Payoff 0123456789 1011121314151617181920 Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Equilibrium Solution greenberg iocainebotphasenbott halbot mod1bot biopic robertot russrocker4 boom predbot shofar granite marble actr_lag2_d sweetrock piedra sunNervebot zq_move antirotnbot mixed_strat markovbails markov5 multibot inocencio debruijn81 pibot randbot adddriftbot2 driftbot foxtrotbot flatbot3 textbot addshiftbot3 switchalot sunCrazybot switchbot peterbot r226bot freqbot2 copybot rotatebot rockbot antiflatbot 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Return vs. Repeated Rock-Paper-Scissors Bots 1000 750 500 250 0 250 500 750 1000 0.0 0.2 0.4 0.6 0.8 1.0 Average Figure 1 | Example run of CSRO in Repeated Rock-Paper-Scissors. 2.2. Code Policies The central innovation of CSRO is the replacement of the deep RL oracle with an LLM that synthesizes policies as executable code. This reframes the best response computation from a process of numerical optimization to one of programmatic reasoning and generation. In our framework, a policy휋is no longer an opaque neural network but a (stateful) code policy, such as a Python function, that maps observations to actions. The primary mechanism for guiding the LLM is a carefully constructed prompt, which is dynami- cally generated at each iteration of the algorithm. This prompt provides the LLM with all necessary context to formulate a strategy. It begins with a natural language description of the game’s rules and objectives, alongside a precise API specification for the policy function to ensure the generated code is executable. A crucial component is the description of the opponent’s current meta-strategy,휎. Since opponent policies in CSRO are also programmatic, their source code can be directly included in the prompt. To manage prompt complexity when facing a large mixture of policies, we can in addition to — or instead of the code — provide a high-level summary of the opponents’ collective behavior, a summary that can itself be generated by prompting the LLM to analyze the set of code policies. The prompt concludes with a clear directive to generate a best response and can optionally include an instruction to produce self-explaining code with detailed comments and a docstring describing the intended strategy. 2.3. CSRO Algorithm The complete CSRO algorithm is detailed in Algorithm 1. The overall structure follows the iterative loop of PSRO, but the oracle step is replaced by an LLM-based code generation process. At each iteration, a new programmatic policy is generated, and added to the policy set, enriching the meta- game with increasingly sophisticated strategies. 2.4. Oracle Refinement Mechanisms The CSRO framework supports two distinct, complementary mechanisms for improving the quality of generated policies using in-context learning, as detailed in Algorithm 1. These mechanisms operate on different timescales to provide both immediate tactical corrections and long-term strategic guidance. 2.4.1. Cross-Iteration Strategic Adaptation At the beginning of each iteration푘, the oracle is provided with deep strategic context by analyzing the programmatic policies that constitute the current meta-game equilibrium휎. Theconstruct_prompt 3 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Algorithm 1: Code-Space Response Oracles. Input: 퐺: symmetric zero-sum game, 퐾: max. iterations, 푀: max. refinement budget 1 푃 ←휋 initial 2 for 푘← 1 to 퐾 do 3 푈 ← compute_payoff_matrix(푃) 4 휎← compute_meta_equilibrium(푈) 5 푝푟표푚푝푡 ← construct_prompt(퐺, 휎, 푃) 6 휋 ′ ← llm_oracle(푝푟표푚푝푡) 7 푢← evaluate_policy(휋 ′ , 휎, 푃) 8 푗← 0 9 whilenot terminated(u, j, M) do 10푗← 푗+ 1 11푝푟표푚푝푡 ← update_prompt(푝푟표푚푝푡, 휋 ′ ,푢) 12휋 ′ ← llm_oracle(푝푟표푚푝푡) 13푢← evaluate_policy(휋 ′ , 휎, 푃) 14 end 15 푃 ← 푃∪휋 ′ 16 end 17 return 푃, 휎 function inspects the source code of the opponent policies currently active in the meta-game (i.e., those with a non-zero probability in휎). The source code of each opponent can be directly included in the prompt or alternatively summarized by another LLM call to provide a high-level description of the strategic behavior the opponent code implements. To keep the context length manageable we can, e.g., filter by a threshold on minimum support or only include the top-k opponents according to the meta-game equilibrium. The provided opponents’ source code or descriptions guide the generation of the next candidate policy. 2.4.2. Intra-Iteration Policy Refinement The second mechanism is an inner feedback loop designed to ensure that any new policy휋 ′ is a robust best response to the current meta-game휎. We here consider three variants that differ in how the refinement stage in lines 9-14 are implemented: 1. ZeroShot: We prompt the LLM to directly generate a program in zero-shot fashion. This corresponds to the case where the refinement stage (Algorithm 1: line 9–14) is never entered. 2.LinearRefinement: We start with an initial candidate policy whose expected utility푢against휎is immediately evaluated. If the policy is found to be losing (푢 <0), a conditional refinement loop is triggered. The LLM is then tasked with regenerating the policy based on this feedback. The program is updated if the score is increased. This cycle of evaluation and refinement continues until the policy achieves a non-negative utility or a maximum refinement budget푀is reached. We call it "linear refinement" because all the computations are exercised in a single thread. 3. AlphaEvolve: AlphaEvolve (Novikov et al., 2025; Romera-Paredes et al., 2024) is a large-scale distributed system which uses an LLM to mutate programs in multiple threads and uses a score function to guide an evolutionary search procedure. To ensure diversity, programs are clustered into different subpopulations which are evolved independently. Each evolution thread continuously samples past programs and prompts the LLM to generate new modifications to them in favor of high-scored variants. AlphaEvolve has been shown effective in finding novel and better algorithms in 4 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models math, optimization (Novikov et al., 2025) and cognitive science domains (Castro et al., 2025). We consider AlphaEvolve perfectly suitable for our purpose of computing best response programs using an estimate of피 휋 ′ ∼휎 [푢(휋, 휋 ′ )] as the score function for the evolutionary search. 3. Experiments To validate the effectiveness and interpretability of CSRO, we conduct a series of experiments designed to answer the following research questions: (1) Does CSRO converge to a low-exploitability equi- librium? (2) Can the LLM oracle generate effective, non-trivial strategies in a zero-shot or few-shot manner? (3) Are the programmatic policies generated by CSRO demonstrably more interpretable than the neural network policies produced by traditional oracles? 3.1. Environments We evaluate our framework on two standard analytical games, chosen to test different aspects of strategic reasoning. Both environment implementations are based on OpenSpiel (Lanctot et al., 2019). Repeated Rock-Paper-Scissors (RRPS): In this repeated version, players repeat the standard stage game for 1000 consecutive rounds (throws). Although playing according to a uniform random policy at each round is known to be a Nash equilibrium and is unexploitable, it fails to exploit patterns in the moves of suboptimal opponents. Therefore, RRPS has been a very informative benchmark for multiagent learning and opponent modeling (Billings, 2000a,b; Knoll et al.; Lanctot et al., 2023). Repeated Leduc hold’em poker: In this environment, players play Leduc hold’em for 100 rounds (hands); the dealer role alternates between consecutive rounds, with the first-round-dealer randomly decided between the two players. Leduc hold’em is a popular benchmark for research in imperfect information games, introduced in (Southey et al., 2005). It is a two-player, zero-sum game with a small six-card deck (two suits of Jack, Queen, and King). The game features two betting rounds, with a single public community card revealed at the flop. Its structure, which combines private and public information across multiple decision points, requires more sophisticated stateful reasoning than simpler variants like Kuhn poker, making it an excellent environment to test the LLM’s ability to generate complex, multi-step strategies involving concepts like bluffing and value betting. The repeated version of Leduc hold’em is a more realistic setting than the one-shot version; and it further adds another strategic layer where an agent is more effective if it identifies strategic patterns of the opponent and is thus able to exploit them (similar to RRPS). 3.2. Evaluation Population In RRPS, we evaluate CSRO code-policies against 43 hand-coded, heuristic strategies from the inter- national RRPS competitions (Billings, 2000a,b). These bots range from simple, stateless strategies likerandbot(uniform random) androckbot(always rock) to bots playing fixed sequences like pibot. Other approaches use limited memory, such ascopybot(beating the opponent’s last move) ordriftbot(biasing actions based on recent play). More sophisticated entrants rely on histor- ical observations, using methods like frequency analysis (freqbot2), Markov chain predictions (markov5), or complex rule-based systems (inocencio). Some bots switch between different inter- nal policies based on profitability (multibot), while innovative designs include a neural network-like system (sunNervebot) and a cognitive architecture (actr_lag2_decay). The competition win- ners,iocainebotandgreenberg, succeeded by maintaining a large set of different predictors and counter-strategies, selecting the one that proved most effective as the match progressed. This 5 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models population of bots represent a diverse set of RRPS strategies which is useful in terms of measuring the generalization and exploitability of agents (Lanctot et al., 2023). An example run of CSRO in the RRPS environment is visualized in Figure 1, showing the evolution of payoffs and the equilibrium solution over iterations as well as returns against the bot population. For repeated Leduc Poker, we compute a Nash equilibrium for the single-hand game using Counterfactual Regret Minimization + (CFR+) (Tammelin, 2014). We run CFR+ for 10 000 iterations and use the weighted averaged strategy. We evaluate CSRO code-policies against an opponent which plays this optimal Nash strategy at every hand of the repeated game. We also evaluate code-policies against two heuristic strategies with easy to detect patterns: AlwaysCall and AlwaysFold (a strategy that always folds when possible and otherwise calls). Given an evaluation population of heuristic strategies (bots), we can adopt the three metrics introduced in (Lanctot et al., 2023). We denote the population of bots asP. Population Return is the average return of휋against the population. This measures the generalization capability of agents against an unseen opponent at test-time: PopReturn(휋)=피 휋 ′ ∼P [푢(휋, 휋 ′ )] Within Population Exploitability serves as an approximate exploitability metric: PopExpl(휋)=− min 휋 ′ ∈P 푢(휋, 휋 ′ ) Aggregate Score measures a balance between average- and worst-case analysis: AggScore(휋)= PopReturn(휋)− PopExpl(휋) An agent can have both highPopReturnandPopExplby winning against most of the bots in the population but being particularly exploitable by one opponent. In this caseAggScoreis a better measure of generalization capability. 3.3. Baselines To evaluate CSRO, we compare against a strong, conventional baseline from the multi-agent learning literature. We implement a standard PSRO algorithm where the best response oracle is a deep reinforcement learning agent. The agent’s policy is represented by a deep Long Short-Term Memory (LSTM)-based network and is trained using the Importance Weighted Actor-Learner Architecture (IMPALA) algorithm (Espeholt et al., 2018). The recurrent nature of LSTM makes it suitable for our repeated, partially observable game settings. This serves as a canonical baseline, allowing for a direct comparison between our method, which generates oracle policies as explicit code, and the standard approach of optimizing a policy in a neural network’s weight space. We refer to the Supplementary Materials A.1.1 for implementation details. In RRPS, we further include a direct comparison with an LLM-based agent adapting the simple yet effective approach proposed in (Lanctot et al., 2023). The agent’s architecture requires no task-specific training, instead relying directly on the sequence-prediction capabilities of a pre-trained LLM. A history of previous actions is provided as a textual prompt and the model generates the subsequent actions for both player and opponent. The agent plays the best response to the predicted opponent action. While the original work utilized Chinchilla (Hoffmann et al., 2022), our implementation substitutes this family of LLMs for the pre-trained Gemma 3 models (Team, 2025) to benchmark a more recent and accessible model on this task. See Table 6 for full results. 6 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 0 25 50 75 100 125 150 175 200 Population Return 0 100 200 300 400 500 600 700 Within Population Exploitability 500 400 300 200 100 0 100 Aggregate Score AlphaEvolve LinearRefinement (code) ZeroShot (code) LinearRefinement (desc.) ZeroShot (desc.) Top 5 Min support ZeroShot (no opponent input) Figure 2 | Performance of CSRO variants in Repeated Rock-Paper-Scissors In addition to the LLM agent based on Chinchilla, we include two further baselines from (Lanctot et al., 2023) that have not been trained against the evaluation population. Tabular Q-learning (QL) with recall length of 10 rounds and Contextual Regret Minimization (ContRM). For details we refer to previous work. 3.4. Implementation Details We implement the CSRO oracle using the Gemini 2.5 Pro model (et al., 2025). All experiments are run for퐾=20 iterations and푀=10 for LinearRefinement. For the AlphaEvolve oracle, we follow the method described in (Novikov et al., 2025; Romera-Paredes et al., 2024). Results are averaged over 5 seeds if not specified otherwise. The prompts and initial strategies are provided in Supplementary Materials A.2. 4. Results 4.1. Repeated Rock-Paper-Scissors Method / AgentPopReturn↑PopExpl↓AggScore↑ CSRO - AlphaEvolve50.5± 1.9 25.2± 20.3 25.4± 21.6 - LinearRef. (code) 159.8± 7.7 37.7± 10.6 122.1± 9.8 - LinearRef. (desc.) 99.3± 29.9 31.6± 12.4 67.7± 21.4 - ZeroShot (desc.) 130.2± 15.4 66.7± 25.9 63.5± 11.4 LLM Agent (27B)193.267.2126.0 PSRO-IMPALA −108.9± 17.6 423.2± 28.0 −532.1± 41.5 QL (R=10) ∗ −0.58.6−9.1 ContRM ∗ 164.816.3148.5 LLM Agent (70B) ∗ 201.045.8155.2 Table 1|Performance in Repeated Rock-Paper-Scissors across Population Return, Within Population Exploitability and Aggregate Score. Results are averaged over 5 seeds. ( ∗ Denotes results reported in Lanctot et al. (2023).) Our primary results in RRPS are summarized in Table 1. We report PopReturn (higher is better), PopExpl (lower is better), and AggScore (higher is better) for the meta-equilibrium strategies of CSRO variants at the final iteration (퐾=20) as well as baseline performances. Our experiments show 7 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models that among the proposed CSRO variants, the AlphaEvolve oracle is the most effective at generating a low-exploitability policy with a population exploitability of 25.2±20.3. This outcome is consistent with the overall objective of the PSRO framework. The oracle generates policies that minimize worst- case performance against the current strategy mixture. As weaker opponents are assigned less support in the meta-game equilibrium, the oracle’s objective naturally prioritizes robustness against strong strategies over maximizing exploitation of potential weak opponents. A powerful evolutionary process like AlphaEvolve is particularly well-suited to drive this optimization. LinearRefinement (code) with Top 5 filtering achieves the highest aggregate score among our tested CSRO variants (122.1±9.8). This performance is competitive with the strongest baseline LLM agent, a 27B parameter Gemma 3 model (126.0). We find that CSRO with intra-iteration policy refinement exhibits lower exploitability than the baseline Gemma 3 agents across all model sizes. Furthermore, all CSRO variants substantially outperform the PSRO-IMPALA baseline across all three metrics. For context within the broader literature, (Lanctot et al., 2023) reported an impressive aggregate score of 155.2 for an LLM agent based on a Chinchilla model with 70B parameters, this is significantly higher than most baseline RL algorithms considered in their work. Among agents trained via self-play, ContRM, a contextual regret minimizer, achieved an aggregate score of 148.51. Their findings also underscore the importance of a balanced evaluation; for instance, while the Q-learning agent with recall 10 achieved very low exploitability (8.6), its performance was undermined by a near-zero PopReturn, resulting in an AggScore lower than that of a bot playing uniformly at random (randbot). A detailed overview of results for all CSRO variants is shown in Figure 2 (and Supplementary Materials Table 7). The oracle’s input format (code vs. description) interacts with the generation process. The benefit of using source code as input appears to be linked to iterative refinement. In the ZeroShot setting, providing opponent policies as natural language descriptions leads to better results; ZeroShot (desc.) achieves an aggregate score of 63.5±11.4, compared to−54.3±118.7 for ZeroShot (code). A potential explanation is that generating a best response from a large set of source code in a single pass is a more complex task for which a textual summary provides a more tractable input. As a critical ablation, the ZeroShot (no opponent input) variant, which receives no information on opponent strategies, has a high PopReturn (135.3±10.2) but also an extremely high PopExpl (614.2±60.8), leading to a very poor aggregate score (−478.9±70.2). This result underscores that opponent conditioning is the most important component of the prompt for generating robust strategies. The Top 5 filter generally yields better results than Min support. A plausible explanation is that the meta-game equilibrium often concentrates its probability mass on the most recent, single-best counter-policy. In such cases, Min support filtering would provide the LLM with only one opponent strategy as context. This can lead to overfitting, where the oracle generates a narrow policy that is highly effective against that one opponent but brittle against the wider, low-probability population. In contrast, the Top 5 filter ensures a more diverse set of opponents is always present in the prompt, encouraging the LLM to generate more general and robust policies. 4.2. Repeated Leduc Hold’em Poker In Table 2, we report CSRO performance in repeated Leduc poker as well as PSRO-IMPALA and CFR+ baselines, where metrics are computed with respect to the evaluation populationPconsisting of CFR+, AlwaysCall and AlwaysFold. Here we use description input and no filtering for all oracles, as this produced lowest exploitability in RRPS (see Supplementary Materials Table 7). Results show that the AlphaEvolve oracle achieves the highest PopReturn and AggScore, showing its efficacy of producing generalizable strategies. CSRO-AlphaEvolve also achieves an exploitability of 4.4±0.6, 8 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models MethodPopReturn↑ PopExpl↓ AggScore↑ CSRO - AlphaEvolve49.3± 3.74.4± 0.644.9± 4.1 - LinearRefinement 43.8± 2.59.8± 3.0 34.0± 4.3 - ZeroShot40.4± 1.619.6± 2.1 20.7± 3.0 PSRO-IMPALA13.3± 6.958.4± 3.3 −45.0± 10.1 CFR+39.8± 0.30.0± 0.0 39.8± 0.3 Table 2 | Performance in repeated Leduc hold’em. Results are averaged over 3 seeds. competitive with CFR+, demonstrating the robustness of the meta-equilibrium found. We found that the bots’ worst-case performances (PopExpl) consistently arose from matches against the CFR+ opponent. Both the average-case (PopReturn) and worst-case (PopExpl) performances of the CSRO variants directly correlates with the strength of the best-response oracle: AlphaEvolve, followed by LinearRefinement and ZeroShot. This finding underscores the critical role of a powerful best-response solver in achieving game-theoretically near-optimal policies in repeated Leduc Poker within the CSRO framework. We present a detailed breakdown of returns against each opponent in the eval population in Supplementary Materials Table 8. Against AlwaysCall, CSRO-AlphaEvolve achieves an average return of 110.3±9.7 by developing a powerful value betting strategy (see Section 4.3). This performance is significantly higher than that of the PSRO-IMPALA (57.7±3.3) and CFR+ (62.1±0.8) baselines. This demonstrates a key advantage of the policies found by CSRO framework: CSRO does not just learn a low-exploitable strategy, but finds strategies that dynamically capitalize on the opponents’ predictability over the course of the repeated game to achieve a super-Nash level of exploitation. LinearRefinement achieves a higher average return against AlwaysFold (57.3±8.8) compared to AlphaEvolve (42.0±3.2) and on par with CFR+ (57.4±0.2). AlphaEvolve minimizes exploitability against CFR+ while exploiting AlwaysCall significantly more than CFR+. LinearRefinement finds strategies that lose more to CFR+, still exploiting AlwaysCall more than CFR+, and exploiting AlwaysFold as much as CFR+. In summary, we see evidence that the various CSRO oracles find different non-dominated strategies. 4.3. Qualitative Analysis A key advantage of CSRO is the generation of interpretable policies, which allows for direct inspection of an agent’s strategic logic. By analyzing the best programs from our experiments, we find they contain multiple strategic components, parameters and even heuristic value estimations. 4.3.1. Repeated Rock-Paper-Scissors The best-performing strategy, generated by LinearRefinement (no filter) - Top 5, is a sophisticated ensemble agent achieving an impressive PopReturn of 238.3 (see Supplementary Materials Table 7) that would have placed it in between 2nd and 3rd place in the competition. Its well-commented source code allows for a direct analysis of its strategy, which is built on several key principles. First, the agent’s strategy is a form of brute-force pattern matching, enabled by a comprehensive ensemble of 32 predictors. Rather than attempting to first deduce an opponent’s high-level strategy (e.g., Markov player), the agent tests a wide array of hypotheses in parallel on every single turn. This large portfolio—which includes high-order Markov models, reactive models, and numerous heuristic 9 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models detectors—is designed on the principle that one of its specialized experts will likely find a strong correlation in any given opponent’s behavior, relying on breadth rather than deduction. Second, it uses an aggressive, high-confidence adaptation mechanism. Expert votes are weighted by the fifth power of their score, allowing the agent to rapidly and decisively commit to a theory of the opponent’s strategy based on recent success. This non-linear weighting enables fast adaptation in a competitive environment. Finally, its most advanced component is an explicit second-order “Theory of Mind” model. This component infers the opponent’s likely predictive model by observing which of its own experts is currently most successful. It then simulates the opponent’s prediction of its own move and plays the corresponding counter-move. See Supplementary Materials Listing 1 for the full source code. 4.3.2. Repeated Leduc hold’em poker The policy that achieves the best PopReturn (77.8) and best AggScore (69.1) demonstrates an effective synthesis of opponent modeling to drive a classic Expected Value (EV) calculation. The EV is calculated based on two key estimations: the equity (the probability of winning at showdown) and a statistical model of the opponent’s tendencies (e.g., the probability they will fold to a bet). The agent combines these estimations to calculate the EV of a potential raise as a weighted average of two primary outcomes: the immediate profit from winning the pot if the opponent folds, versus the potential net gain or loss from playing for a larger pot at showdown if the opponent calls. This calculation explicitly weighs the risk and reward of bluffing versus betting for value. The interpretability of this structure allows us to see precisely how it produces expert-level, adaptive strategies. For instance: Against an AlwaysCall opponent, the bot learns the opponent’s folding probability is near zero. Consequently, its EV calculation simplifies to a classic “value betting” strategy, where it ceases to bluff and only raises with very strong hands. Against an AlwaysFold opponent, the bot rapidly learns the folding probability is near 100%. Its EV calculation for a raise then becomes equal to the current pot size, making its own hand irrelevant. As a result, the agent learns to bluff relentlessly to exploit the opponent’s passivity. This transparent adaptation, directly observable in the policy’s code, demonstrates a level of strategic reasoning and interpretability that is fundamentally absent in opaque, black-box policies. See Supplementary Materials Listing 2 for the full source code. The generated policies share a key characteristic across both games and multiple seeds: they are not monolithic but rather compositions of distinct, interpretable strategic modules. Crucially, policies often demonstrate a capacity for sophisticated opponent modeling. This demonstrates a capacity for higher-order reasoning that is transparent and verifiable, a feature that distinguishes CSRO-generated policies from their black-box counterparts. 5. Related Work Our work builds directly upon the Policy-Space Response Oracles (PSRO) framework (Lanctot et al., 2017), a general-purpose algorithm for computing approximate Nash equilibria in large extensive- form games. Significant follow-up work has focused on improving the meta-solver, such as using alpha-rank (Omidshafiei et al., 2019) or developing more scalable training schemes like Pipeline PSRO (Mcaleer et al., 2020). While these advancements have improved convergence and scalability, they have largely retained the use of deep reinforcement learning as the core of the best response oracle, thereby inheriting its limitations in terms of sample complexity and policy opacity. CSRO is orthogonal to these meta-solver improvements; we instead focus on fundamentally changing the nature of the oracle itself to prioritize interpretability. 10 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Another line of work focuses on applying game-theoretic solvers to guide strategic interactions of LLMs in dialogue-based games (Gemp et al., 2024). The authors formalize natural language negotiation and debate scenarios as extensive-form games. They integrate LLMs with solvers like CFR and PSRO to generate prompts containing high-level strategic instructions to the game-playing LLM, thereby steering its dialogue to be more rational and less exploitable. A key difference is that the best responses produced by their method are natural language prompts, whereas whereas CSRO outputs executable code. Previous work by Kempinski et al. (Kempinski et al., 2025) also explores using LLMs for iterative reasoning in game-theoretic settings. In their work, “Game of Thoughts,” they propose a family of algorithms inspired by cognitive hierarchy theory, such as level-k reasoning, to iteratively refine an LLM’s strategic play. Their methods, including a variant named Policy Space Response Oracle Language Model (PSROLM), similarly place an LLM within an “outer loop” to improve its performance against prior strategies. While both approaches validate the use of LLMs for sophisticated strategic reasoning, our CSRO framework is distinct in two fundamental ways. First, our primary contribution is the synthesis of explicit, human-readable programmatic policies. This focus on generating executable and commented code as the policy representation provides a level of interpretability and verifiability not present in methods that generate actions directly. Second, our algorithm incorporates a novel intra-iteration refinement loop, which tactically hardens a candidate policy against the current meta-game before adding it to the policy pool, ensuring a higher quality of strategies. Our work is most closely related to (Bachrach et al., 2025) introducing LLM-PSRO. The authors focus on demonstrating the viability of LLMs as a code-generating, best-response oracle within a self-play loop. LLM-PSRO does not feature an inner feedback loop to iteratively refine the candidate policy. The work only includes inter-population evaluation and does not show performance against external populations or baselines. 6. Conclusion and Discussion We introduced Code-Space Response Oracles (CSRO), a novel framework for finding approximate equilibria in multi-agent games. We addressed the critical limitation of policy opacity in standard methods like PSRO by replacing the conventional black-box, deep RL oracle with a Large Language Model. Our approach reframes the best response computation as a program synthesis task, prompting an LLM to generate policies directly as human-readable, executable code. The results show that CSRO achieves competitive performance in terms of convergence to a low-exploitability equilibrium, particularly in complex games like multi-hand Leduc hold’em poker. In RRPS, while the AggScore of CSRO is below that of the best-performing LLM-based baseline reported in previous work, it is important to highlight the significant difference in computational efficiency. The baseline LLM agents are invoked on every turn, i.e., requiring 1000 model calls for a single game. In contrast, LinearRefinement generates a complete, reusable policy where the number of LLM calls only grows linearly with the number of iterations (e.g.,퐾=20 in our experiments). This demonstrates a favorable trade-off between peak performance and the computational cost of policy generation, positioning CSRO as a practical and efficient method for strategy discovery. We acknowledge that the LLM’s pre-training data likely contains knowledge of game strategies for a classic game like Rock-Paper-Scissors. However, the oracle must synthesize a novel best response to a specific, dynamically generated mixture of programmatic opponents provided in-context. The success of this process demonstrates a sophisticated capability for in-context strategic reasoning and code generation, not merely pattern retrieval. Despite its promising results, our approach has several limitations. First, the performance of the 11 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models CSRO oracle is inherently tied to the capabilities of the underlying LLM and the quality of the prompt. Poorly structured prompts or less capable models can lead to suboptimal or syntactically incorrect code, requiring error-handling and regeneration logic. Second, while our method reduces the high sample complexity associated with RL training, it introduces the computational cost of repeated LLM API calls, which can be significant. Finally, the scalability of CSRO to games with vast, high-dimensional observation spaces (e.g., Stratego or StarCraft) remains an open question. Representing complex state and opponent strategies within the context length of current LLMs remains a substantial engineering challenge. This manuscript was created with assistance from Gemini 2.5 Pro. The tool was used for revision and editing to improve clarity. The authors have carefully reviewed and approved all changes. 12 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models References Y. Bachrach, E. Toledo, K. Hambardzumyan, D. Magka, M. Josifoski, M. Jiang, J. Foerster, R. Raileanu, T. Shavrina, N. Cancedda, A. Ruderman, K. Millican, A. Lupu, and R. Hazra. Combining code generating large language models and self-play to iteratively refine strategies in games. In Thirty- Fourth International Joint Conference on Artificial Intelligence - Demo Track, pages 10999–11003, 2025. A. Bighashdel, Y. Wang, S. McAleer, R. Savani, and F. A. Oliehoek. Policy space response oracles: A survey. In Thirty-Third International Joint Conference on Artificial Intelligence - Survey Track, pages 7951–7961, 2024. D. Billings. The first international roshambo programming competition. ICGA Journal, 23(1):42–50, 2000a. D. Billings. The second international roshambo programming competition. https://icga.org/icga/ games/roshambo/, 2000b. P. S. Castro, N. Tomasev, A. Anand, N. Sharma, R. Mohanta, A. Dev, K. Perlin, S. Jain, K. Levin, N. Elteto, W. Dabney, A. Novikov, G. C. Turner, M. K. Eckstein, N. D. Daw, K. J. Miller, and K. Stachenfeld. Discovering symbolic cognitive models from human and animal behavior. In Forty-Second International Conference on Machine Learning, 2025. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Thirty-Fifth International Conference on Machine Learning, volume 80, pages 1407–1416, 2018. G. C. et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. I. Gemp, R. Patel, Y. Bachrach, M. Lanctot, V. Dasagi, L. Marris, G. Piliouras, S. Liu, and K. Tuyls. Steering language models with game-theoretic solvers. In Agentic Markets Workshop at ICML 2024, 2024. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre. Training compute- optimal large language models. In Thirty-Sixth International Conference on Neural Information Processing Systems, 2022. B. Kempinski, I. Gemp, K. Larson, M. Lanctot, Y. Bachrach, and T. Kachman. Game of thoughts: Iterative reasoning in game-theoretic domains with large language models. In Twenty-Fourth International Conference on Autonomous Agents and Multiagent Systems, pages 1088–1098, 2025. B. Knoll, D. Lu, and J. Burdge. Rock Paper Scissors Programming Competition. https://rpscontest.com/. Accessed: 2025-10-8. M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Thirty-First International Conference on Neural Information Processing Systems, page 4193–4206, 2017. M. Lanctot, E. Lockhart, J.-B. Lespiau, V. Zambaldi, S. Upadhyay, J. Pérolat, S. Srinivasan, F. Timbers, K. Tuyls, S. Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019. 13 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models M. Lanctot, J. Schultz, N. Burch, M. O. Smith, D. Hennes, T. Anthony, and J. Perolat. Population-based evaluation in repeated rock-paper-scissors as a benchmark for multiagent reinforcement learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. S. Mcaleer, J. Lanier, R. Fox, and P. Baldi. Pipeline psro: A scalable approach for finding approximate nash equilibria in large games. In Thrity-Fourth International Conference on Neural Information Processing Systems, volume 34, pages 20238–20248. Curran Associates, Inc., 2020. A. Novikov, N. V ̃ u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025. URL https://arxiv.org/abs/2506.13131. S. Omidshafiei, C. Papadimitriou, G. Piliouras, K. Tuyls, M. Rowland, J.-B. Lespiau, W. M. Czarnecki, M. Lanctot, J. Perolat, and R. Munos.훼-rank: Multi-agent evaluation by evolution. Scientific reports, 9(1):9937, 2019. B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024. Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, New York, NY, USA, 2008. F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner. Bayes’ bluff: Opponent modelling in poker. In Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 550–557, 2005. O. Tammelin. Solving large imperfect information games using cfr+, 2014. URL https://arxiv.org/ abs/1407.5042. G. Team. Gemma 3 technical report, 2025. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, et al. Grandmaster level in starcraft i using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. 14 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models A. Supplementary materials A.1. Hyperparameters A.1.1. PSRO-IMPALA For all of our experiments, we perform a hyperparameter sweep over learning rates, hidden layer sizes, unroll length, entropy cost, batch size, and max gradient norm (see Table 3). Best performing hyper parameters are shown in Table 4 and Table 5. HyperparameterRange learning rates[0.0001, 0.001] hidden layer sizes [256, 128], [512, 256] unroll length[20, 40] entropy cost[0.001, 0.1] batch size[16, 32, 64] max gradient norm1, 10, 40 Table 3 | IMPALA hyperparameter sweep. HyperparameterValue learning rates0.0001 hidden layer sizes[256, 128] unroll length20 entropy cost0.001 batch size16 max gradient norm40 observation representation observation_tensor() in OpenSpiel (Lanctot et al., 2019) Table 4 | Hyperparameters in RRPS. HyperparameterValue learning rates0.001 hidden layer sizes[512, 256] unroll length40 entropy cost0.01 batch size64 max gradient norm40 observation representation infostate_tensor() in OpenSpiel (Lanctot et al., 2019) Table 5 | Hyperparameter for repeated Leduc hold’em poker. 15 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Gemma 3 (PT) Bot Names270M1B4B27B actr_lag2_decay -20.1 -40.2 -22.1 -18.7 adddriftbot268.139.4 83.7 65.4 addshiftbot3120.5 134.2 186.6 199.7 antiflatbot995.3 993.1 991.1 991.2 antirotnbot53.449.8 65.9 51.6 biopic-25.3 -44.3 -25.6 -35.2 boom27.6-1.44.4 -21.2 copybot965.6 962.4 968.2 969.8 debruijn81-34.3 -45.4 -14.1-7.4 driftbot90.727.3 100.4 78.0 flatbot35.699.1 127.7 137.4 foxtrotbot15.047.6 33.9 12.6 freqbot2608.1 863.0 917.0 936.5 granite116.879.2 84.9 99.7 greenberg-47.1 -84.1 -52.0 -29.3 halbot5.1 -96.7 -87.8 -55.9 inocencio522.1 405.5 293.2 568.8 iocainebot-80.2 -131.2 -68.1 -35.6 marble113.274.5 91.9 98.8 markov5-34.2 -58.2-7.2 -18.5 markovbails-39.2 -39.6-6.2 -21.2 mixed_strategy48.21.2 37.6 26.6 mod1bot-55.9 -60.5 -54.9 -50.8 multibot287.0 177.8 265.2 253.0 peterbot650.6 770.1 865.6 879.6 phasenbott-85.2 -141.3 -64.8 -52.8 pibot6.5 -36.0 -21.5 -28.5 piedra33.622.5 40.6 40.6 predbot46.6 -38.1 -27.5 -67.2 r226bot197.2 354.1 394.7 391.8 randbot6.47.45.3-9.3 robertot-17.1 -13.1 -11.8 -27.9 rockbot998.4 998.4 995.6 995.0 rotatebot978.0 980.9 988.8 997.5 russrocker4-51.8 -71.5 -31.2 -38.9 shofar-33.6 -44.2 -24.8 -34.8 sunCrazybot248.0 313.7 371.6 397.9 sunNervebot-24.2 -71.3 -30.7 -32.8 sweetrock36.641.6 38.9 42.0 switchalot67.0 106.9 111.9 125.2 switchbot181.6 218.4 204.8 226.3 textbot57.3 138.9 180.1 153.8 zq_move181.2 107.8 151.8 155.9 PopReturn167.0 162.7 187.2 193.2 PopExpl85.2 141.3 87.8 67.2 AggScore81.821.4 99.4 126.0 Table 6|LLM agent performance against Repeated Rock-Paper-Scissors bot population. Returns averaged over 16 games. 16 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models MethodInput FilterPopReturn↑PopExpl↓AggScore↑ meanmaxmeanminmeanmax AlphaEvolveDesc. -50.5± 1.954.225.2± 20.33.3 25.4± 21.647.6 LinearRefinement Desc. -99.3± 29.9 171.8 31.6± 12.4 11.8 67.7± 21.4 128.1 LinearRefinement Code Top 5159.8± 7.7 189.1 37.7± 10.68.8122.1± 9.8 141.3 ZeroShotDesc. -130.2± 15.4 177.8 66.7± 25.9 18.9 63.5± 11.486.9 LinearRefinement Code -167.7± 20.4238.3 83.2± 27.4 15.7 84.5± 26.1 148.1 ZeroShotDesc. Top 5149.4± 23.9 207.3 93.5± 23.5 38.2 55.8± 40.0155.9 ZeroShotCode Top 5126.7± 39.0 190.0 95.8± 17.9 53.2 30.9± 51.0 111.7 ZeroShotCode Min support 116.9± 30.6 160.9 109.5± 30.1 24.87.4± 45.0 122.7 ZeroShotDesc. Min support 157.9± 13.5 204.5 111.0± 45.2 38.0 46.9± 48.9 133.1 LinearRefinement Desc. Min support 110.6± 23.5 185.9 137.0± 41.6 28.9 −26.4± 54.284.4 LinearRefinement Code Min support 57.3± 40.0 152.3 183.2± 67.7 54.2 −125.8± 101.698.1 LinearRefinement Desc. Top 5140.9± 22.2 187.6 185.2± 98.5 67.1 −44.3± 105.7 110.5 ZeroShotCode -166.6± 26.1 208.3 220.9± 92.7 95.8 −54.3± 118.7 112.6 ZeroShot--135.3± 10.2 173.2 614.2± 60.8 411.7 −478.9± 70.2 -238.5 Table 7|Performance of CSRO variants in Repeated Rock-Paper-Scissors ranked by PopExpl. Metrics are averaged over 3 seeds for AlphaEvolve and 5 seeds for all other variants. Opponent MethodCFR+AlwaysCallAlwaysFold CSRO - AlphaEvolve−4.4± 0.6 110.3± 9.7 42.0± 3.2 - LinearRefinement −9.8± 3.0 83.8± 5.1 57.3± 8.8 - ZeroShot−19.6± 2.1 99.7± 3.0 41.0± 6.1 PSRO-IMPALA −58.4± 3.3 57.7± 3.3 40.7± 14.1 CFR+0.0± 0.0 62.1± 0.857.4± 0.2 Table 8|Performance of CSRO variants in repeated Leduc hold’em. Metrics are averaged over 3 seeds. 17 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Listing 1 | A repeated Rock-Paper-Scissors bot that achieves best average performance against the opponent pool 1 from collections import defaultdict 2 import random 3 from typing import Any, Callable, Dict, List, Optional, Tuple 4 5 6 class Agent: 7 """An expert agent for Repeated Rock Paper Scissors, engineered to defeat 8 a field of sophisticated, adaptive opponents. 9 10 This agent’s strategy is a significant enhancement of the ’ensemble of 11 experts’ paradigm, built on four core principles: 12 13 1. ** Superset Expert Ensemble ** : The agent employs a vast portfolio of 14 predictive experts designed to be a strict superset of those used by its 15 competitors. This includes high-order Markov models (up to 8th order), 16 complex reactive/joint models, and a wide range of periodic and 17 heuristic predictors. By having more ’lenses’ to view the game, it can 18 detect patterns that other agents cannot. 19 20 2. ** Aggressive, High-Confidence Adaptation ** : The agent uses a highly 21 aggressive weighted voting system, weighting each expert’s vote by the 22 * fifth power * of its score. This allows the agent to rapidly and 23 decisively lock onto successful predictive models while effectively 24 eliminating noise from the dozens of other, less-successful experts. 25 26 3. ** Advanced Meta-Prediction (’Theory of Mind’) ** : The agent not only models 27 the opponent but also models itself being modeled by the opponent. Its 28 most critical expert, ‘meta_imitation‘, deduces the opponent’s likely 29 predictive model by observing which of its own strategies is currently 30 most successful. It then simulates the opponent’s prediction of its own 31 move and plays the perfect counter to the opponent’s anticipated play. 32 33 4. ** Strategic Randomization ** : To combat the meta-predictors of its 34 opponents, the agent introduces randomness at key decision points. When 35 any of its predictive models or the final vote tally results in a tie, 36 it chooses randomly among the best options. This makes its own behavior 37 significantly harder to predict and exploit. 38 39 Combined with finely tuned hyperparameters for a 1000-round game (a 40 long-memory decay rate of 0.985), this agent is designed to out-learn, 41 out-adapt, and out-think its competition. 42 """ 43 44 def __init__(self): 45 """Initializes the agent’s constants, models, and meta-strategy state.""" 46 # Game constants 47 self.MOVES = [’ROCK’, ’PAPER’, ’SCISSORS’] 48 self.COUNTER_MOVES = 49 ’ROCK’: ’PAPER’, 50 ’PAPER’: ’SCISSORS’, 51 ’SCISSORS’: ’ROCK’, 52 53 54 # Hyperparameters tuned for a 1000-round game against strong learners 55 self.decay = 0.985 # Slower decay for longer memory and pattern stability 56 self.vote_power = ( 57 5 # Use score^5 for extremely aggressive, high-confidence adaptation 58 ) 59 60 # State is initialized in reset() 61 self.reset() 62 63 def reset(self) -> None: 64 """Resets all agent state for the beginning of a new match.""" 65 self.my_history: List[str] = [] 66 self.opponent_history: List[str] = [] 67 18 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 68 # --- Underlying predictive models (Opponent’s patterns) --- 69 self.opp_markov_models = 70 i: defaultdict(lambda: defaultdict(int)) for i in range(1, 9) 71 # n=1 to 8 72 self.reactive_model = defaultdict(lambda: defaultdict(int)) 73 self.joint_hist_model = defaultdict(lambda: defaultdict(int)) 74 self.freq_model = defaultdict(int) 75 76 # --- Underlying predictive models (My own patterns for meta-experts) --- 77 self.my_markov_models = 78 i: defaultdict(lambda: defaultdict(int)) for i in range(1, 9) 79 # n=1 to 8 80 self.my_reactive_model = defaultdict(lambda: defaultdict(int)) 81 self.my_joint_hist_model = defaultdict(lambda: defaultdict(int)) 82 83 # --- Meta-learning state (The Experts) --- 84 self.experts: Dict[str, Callable[[], Optional[str]]] = 85 # Opponent-history-based predictors (Markov up to 8th order) 86 ** 87 f’opp_markovi’: self._create_opp_markov_predictor(i) 88 for i in range(1, 9) 89 , 90 # My-history-based predictors 91 ’reactive’: self._predict_reactive, 92 ’joint_hist’: self._predict_joint_hist, 93 # Simple statistical predictors 94 ’frequentist’: self._predict_frequentist, 95 # Heuristic predictors 96 ’copy_opponent’: self._predict_copy_opponent, 97 ’copy_me’: self._predict_copy_me, 98 ’rotator’: self._predict_rotator, 99 ’counter_rotator’: self._predict_counter_rotator, 100 # Periodic predictors (up to 12 rounds) 101 ** 102 f’periodic_i’: self._create_periodic_predictor(i) 103 for i in range(2, 13) 104 , 105 # Standard Meta-predictors (Theory of Mind Level 1) 106 ** 107 f’meta_my_markovi’: self._create_meta_my_markov_predictor(i) 108 for i in range(1, 4) 109 , 110 ’meta_my_reaction’: self._predict_meta_my_reaction, 111 ’meta_my_joint’: self._predict_meta_my_joint, 112 # Advanced Meta-predictor (Theory of Mind Level 2) 113 ’meta_imitation’: self._predict_meta_imitation, 114 115 self.expert_scores = name: 1.0 for name in self.experts 116 self.last_predictions: Dict[str, Optional[str]] = 117 name: None for name in self.experts 118 119 120 # --- Update Methods --- 121 122 def _update_scores(self, actual_opponent_move: str) -> None: 123 """Updates expert scores based on their last prediction’s accuracy.""" 124 for name, prediction in self.last_predictions.items(): 125 self.expert_scores[name] * = self.decay 126 if prediction == actual_opponent_move: 127 self.expert_scores[name] += 1 128 129 def _update_models(self, my_action: str, opponent_action: str) -> None: 130 """Updates all internal statistical models with the last round’s data.""" 131 self.freq_model[opponent_action] += 1 132 133 if self.my_history: 134 self.reactive_model[self.my_history[-1]][opponent_action] += 1 135 joint_key = (self.my_history[-1], self.opponent_history[-1]) 136 self.joint_hist_model[joint_key][opponent_action] += 1 137 self.my_joint_hist_model[joint_key][my_action] += 1 19 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 138 139 if self.opponent_history: 140 self.my_reactive_model[self.opponent_history[-1]][my_action] += 1 141 142 for n in self.opp_markov_models: 143 if len(self.opponent_history) >= n: 144 key = tuple(self.opponent_history[-n:]) 145 self.opp_markov_models[n][key][opponent_action] += 1 146 if len(self.my_history) >= n: 147 key = tuple(self.my_history[-n:]) 148 self.my_markov_models[n][key][my_action] += 1 149 150 # --- Expert Prediction Methods --- 151 152 def _get_best_prediction(self, model: Dict, key: Any) -> Optional[str]: 153 """Helper to get the most likely move from a model given a key. 154 155 Crucially, it breaks ties randomly to make our agent less predictable. 156 """ 157 predictions = model.get(key) 158 if not predictions: 159 return None 160 max_val = max(predictions.values()) 161 best_moves = [move for move, val in predictions.items() if val == max_val] 162 return random.choice(best_moves) 163 164 def _create_opp_markov_predictor(self, n: int) -> Callable[[], Optional[str]]: 165 """Factory for creating n-th order Markov model predictors for the opponent.""" 166 167 def predictor() -> Optional[str]: 168 if len(self.opponent_history) < n: 169 return None 170 key = tuple(self.opponent_history[-n:]) 171 return self._get_best_prediction(self.opp_markov_models[n], key) 172 173 return predictor 174 175 def _predict_reactive(self) -> Optional[str]: 176 if not self.my_history: 177 return None 178 return self._get_best_prediction(self.reactive_model, self.my_history[-1]) 179 180 def _predict_joint_hist(self) -> Optional[str]: 181 if not self.my_history: 182 return None 183 key = (self.my_history[-1], self.opponent_history[-1]) 184 return self._get_best_prediction(self.joint_hist_model, key) 185 186 def _predict_frequentist(self) -> Optional[str]: 187 return self._get_best_prediction(’freq’: self.freq_model, ’freq’) 188 189 def _predict_copy_opponent(self) -> Optional[str]: 190 return self.opponent_history[-1] if self.opponent_history else None 191 192 def _predict_copy_me(self) -> Optional[str]: 193 return self.my_history[-1] if self.my_history else None 194 195 def _predict_rotator(self) -> Optional[str]: 196 if not self.opponent_history: 197 return None 198 return self.COUNTER_MOVES.get(self.opponent_history[-1]) 199 200 def _predict_counter_rotator(self) -> Optional[str]: 201 if not self.opponent_history: 202 return None 203 return v: k for k, v in self.COUNTER_MOVES.items().get( 204 self.opponent_history[-1] 205 ) 206 207 def _create_periodic_predictor(self, n: int) -> Callable[[], Optional[str]]: 20 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 208 """Factory for creating predictors for various periodicities.""" 209 210 def predictor() -> Optional[str]: 211 return ( 212 self.opponent_history[-n] if len(self.opponent_history) >= n else None 213 ) 214 215 return predictor 216 217 def _create_meta_my_markov_predictor( 218 self, n: int 219 ) -> Callable[[], Optional[str]]: 220 """Assumes opponent predicts my n-th order markov pattern and counters it.""" 221 222 def predictor() -> Optional[str]: 223 if len(self.my_history) < n: 224 return None 225 key = tuple(self.my_history[-n:]) 226 predicted_my_move = self._get_best_prediction( 227 self.my_markov_models[n], key 228 ) 229 return ( 230 self.COUNTER_MOVES.get(predicted_my_move) 231 if predicted_my_move 232 else None 233 ) 234 235 return predictor 236 237 def _predict_meta_my_reaction(self) -> Optional[str]: 238 """Assumes opponent predicts my reaction pattern and counters it.""" 239 if not self.opponent_history: 240 return None 241 predicted_my_reaction = self._get_best_prediction( 242 self.my_reactive_model, self.opponent_history[-1] 243 ) 244 return ( 245 self.COUNTER_MOVES.get(predicted_my_reaction) 246 if predicted_my_reaction 247 else None 248 ) 249 250 def _predict_meta_my_joint(self) -> Optional[str]: 251 """Assumes opponent predicts my joint history pattern and counters it.""" 252 if not self.my_history: 253 return None 254 key = (self.my_history[-1], self.opponent_history[-1]) 255 predicted_my_move = self._get_best_prediction(self.my_joint_hist_model, key) 256 return ( 257 self.COUNTER_MOVES.get(predicted_my_move) if predicted_my_move else None 258 ) 259 260 def _predict_meta_imitation(self) -> Optional[str]: 261 """Predicts by assuming the opponent is using a model similar to our own 262 263 best-performing model to predict our moves. This is Theory of Mind Level 2. 264 """ 265 # Define the set of our experts that an opponent is likely to be using. 266 opponent_modeling_experts = 267 ’opp_markov1’: (self.my_markov_models[1], self.my_history, 1), 268 ’opp_markov2’: (self.my_markov_models[2], self.my_history, 2), 269 ’opp_markov3’: (self.my_markov_models[3], self.my_history, 3), 270 ’reactive’: (self.my_reactive_model, self.opponent_history, 1), 271 ’joint_hist’: ( 272 self.my_joint_hist_model, 273 (self.my_history, self.opponent_history), 274 2, 275 ), 276 277 # Find which of these opponent-modeling experts is performing best for us. 21 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 278 best_expert_name = max( 279 opponent_modeling_experts, 280 key=lambda name: self.expert_scores.get(name, 0), 281 ) 282 model, history_data, hist_len = opponent_modeling_experts[best_expert_name] 283 284 key = None 285 # Construct the key for the corresponding model of * our * behavior. 286 if best_expert_name == ’joint_hist’: 287 if len(history_data[0]) >= 1 and len(history_data[1]) >= 1: 288 key = (history_data[0][-1], history_data[1][-1]) 289 elif isinstance(history_data, list) and len(history_data) >= hist_len: 290 key = tuple(history_data[-hist_len:]) 291 292 if key is None: 293 return None 294 295 # Predict our own move using the model the opponent is likely using. 296 predicted_my_move = self._get_best_prediction(model, key) 297 # Assume the opponent will play the counter to our predicted move. 298 return ( 299 self.COUNTER_MOVES.get(predicted_my_move) if predicted_my_move else None 300 ) 301 302 # --- Main Agent Logic --- 303 304 def act(self, observation: dict[str, Any]) -> str: 305 """Determines the agent’s next move by learning from the past and using the 306 307 weighted ensemble to predict the opponent’s action. 308 """ 309 # First round: reset state and play randomly. 310 if observation.get(’my_action’) is None: 311 self.reset() 312 return random.choice(self.MOVES) 313 314 # --- 1. Learning Phase (from previous turn’s outcome) --- 315 my_prev_action = observation[’my_action’] 316 opp_prev_action = observation[’opponent_action’] 317 318 self._update_scores(opp_prev_action) 319 self._update_models(my_prev_action, opp_prev_action) 320 self.my_history.append(my_prev_action) 321 self.opponent_history.append(opp_prev_action) 322 323 # --- 2. Prediction Phase (for the current turn) --- 324 for name, func in self.experts.items(): 325 self.last_predictions[name] = func() 326 327 # --- 3. Action Phase (for the current turn) --- 328 vote_tally = defaultdict(float) 329 for name, prediction in self.last_predictions.items(): 330 if prediction is not None: 331 vote_tally[prediction] += self.expert_scores[name] ** self.vote_power 332 333 if vote_tally: 334 # Determine the most likely opponent move using our tie-breaking helper. 335 predicted_opponent_move = self._get_best_prediction( 336 ’vote’: vote_tally, ’vote’ 337 ) 338 return self.COUNTER_MOVES[predicted_opponent_move] 339 340 # Ultimate fallback: If no expert could predict, play randomly. 341 return random.choice(self.MOVES) 22 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Listing 2 | A repeated Leduc Poker bot that achieves best PopReturn against the opponent pool 1 from typing import Any, Dict 2 3 4 class RepeatedLeducPokerBot: 5 """A bot using a non-stationary strategy. 6 7 It collapses its strategy into a single, continuous ’bravery’ parameter. This 8 parameter adapts based on the opponent’s observed passivity and aggression, 9 creating a reactive, non-stationary strategy designed to be difficult for 10 learning opponents to model. 11 """ 12 13 def __init__(self): 14 """Initializes the bot.""" 15 self.player_id: int = -1 16 self.opponent_id: int = -1 17 self.card_ranking = ’J’: 1, ’Q’: 2, ’K’: 3 18 self.game_count = 0 19 20 self.alpha = 1.0 # Laplace smoothing parameter. 21 # Model: info_set_str -> ’FOLD’: count, ’CALL’: count, ’RAISE’: count 22 self.opponent_model = 23 24 def _get_info_set_key(self, round_name, public_card, round_history) -> str: 25 """Creates a unique key for an information set.""" 26 public_card_str = public_card if public_card else ’ 27 28 action_str = ’.join(a[’action’][0] for a in round_history) 29 return f’round_name:public_card_str:action_str’ 30 31 def receive_outcome(self, obs: Dict[str, Any]): 32 """Receives the game outcome and updates the opponent action model.""" 33 self.game_count += 1 34 action_history = obs.get(’action_history’, ) 35 36 for round_name in [’PREFLOP’, ’POSTFLOP’]: 37 round_history = action_history.get(round_name, []) 38 # We need to build the history incrementally to get the info set * before * each action 39 temp_round_history = [] 40 for action_event in round_history: 41 if action_event.get(’player_id’) == self.opponent_id: 42 public_card = ( 43 obs[’public_state’].get(’public_card’) 44 if round_name == ’POSTFLOP’ 45 else None 46 ) 47 info_set_key = self._get_info_set_key( 48 round_name, public_card, temp_round_history 49 ) 50 51 if info_set_key not in self.opponent_model: 52 self.opponent_model[info_set_key] = 53 ’FOLD’: self.alpha, 54 ’CALL’: self.alpha, 55 ’RAISE’: self.alpha, 56 57 58 action = action_event.get(’action’) 59 if action in self.opponent_model[info_set_key]: 60 self.opponent_model[info_set_key][action] += 1 61 62 temp_round_history.append(action_event) 63 64 def restart(self, player_id: int): 65 """Starts a new game, assigning player position and adapting ’bravery’. 66 67 Args: 68 player_id: The player ID (0 or 1) for the new game. 69 """ 23 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 70 self.player_id = player_id 71 self.opponent_id = 1 - player_id 72 73 # This method is called at the start of a new game. 74 # We can reset any per-game state here if needed. 75 pass 76 77 def act(self, obs: Dict[str, Any]) -> str: 78 """Outputs an action based on Expected Value (EV) calculations.""" 79 legal_actions = obs[’player_view’][’legal_actions’] 80 if not legal_actions: 81 return ’CALL’ 82 83 # 1. Calculate Equity 84 my_hand = obs[’player_view’][’hand’] 85 public_card = obs[’public_state’][’public_card’] 86 equity = self._calculate_equity(my_hand, public_card, obs) # Pass obs here 87 88 # 2. Calculate EV for each legal action 89 evs = 90 for action in legal_actions: 91 evs[action] = self._calculate_action_ev(action, obs, equity) 92 93 # 3. Choose best action 94 best_action = max(evs, key=evs.get) 95 return best_action 96 97 def _calculate_equity( 98 self, my_hand: str, public_card: str or None, obs: Dict[str, Any] 99 ) -> float: 100 """Calculates the equity of our hand against a distribution of opponent hands, 101 102 weighted by opponent’s observed actions in the current hand. 103 Equity is the probability of winning at showdown. 104 """ 105 deck = [’J’, ’J’, ’Q’, ’Q’, ’K’, ’K’] 106 deck.remove(my_hand) 107 if public_card: 108 if public_card in deck: 109 if public_card in deck: # Check membership before removing 110 deck.remove(public_card) 111 112 possible_opponent_hands = deck 113 if not possible_opponent_hands: 114 return 0.5 # No possible hands for opponent, return neutral equity 115 116 opponent_hand_weights = 117 hand: 1.0 for hand in possible_opponent_hands 118 # Start with uniform weights 119 120 # --- Add Weighting based on History --- 121 action_history = obs.get(’action_history’, ) 122 preflop_history = action_history.get(’PREFLOP’, []) 123 postflop_history = action_history.get(’POSTFLOP’, []) 124 125 opponent_preflop_raises = 0 126 opponent_preflop_calls = 0 # Calls implying >0 bet (heuristic) 127 opponent_postflop_raises = 0 128 opponent_postflop_calls = 0 # Calls implying >0 bet (heuristic) 129 130 # Simplified action parsing (doesn’t know exact bet amounts) 131 temp_round_hist = [] 132 for action_event in preflop_history: 133 if action_event[’player_id’] == self.opponent_id: 134 action = action_event[’action’] 135 # Check if this action faces a bet based on prior actions 136 facing_bet = ( 137 any( 138 a[’action’] == ’RAISE’ 139 for a in temp_round_hist 24 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 140 if a[’player_id’] == self.player_id 141 ) 142 or self.player_id == 1 143 ) # P2 faces B 144 145 if action == ’RAISE’: 146 opponent_preflop_raises += 1 147 elif ( 148 action == ’CALL’ and facing_bet 149 ): # Count calls that match a bet/raise 150 opponent_preflop_calls += 1 151 temp_round_hist.append(action_event) 152 153 temp_round_hist = [] 154 # Base postflop bet state on whether preflop ended with aggression 155 postflop_facing_bet = False # Initial state for P1 postflop 156 if preflop_history: 157 last_preflop_action = preflop_history[-1] 158 # If last preflop action was a raise (either ours or theirs that we called) 159 if last_preflop_action[’action’] == ’RAISE’: 160 postflop_facing_bet = ( 161 True # Simplified: assumes aggression carries over 162 ) 163 164 for action_event in postflop_history: 165 if action_event[’player_id’] == self.opponent_id: 166 action = action_event[’action’] 167 # Check if opponent faced a bet in this round before their action 168 currently_facing_bet = any( 169 a[’action’] == ’RAISE’ 170 for a in temp_round_hist 171 if a[’player_id’] == self.player_id 172 ) 173 174 if action == ’RAISE’: 175 opponent_postflop_raises += 1 176 postflop_facing_bet = True # Their raise makes us face a bet 177 elif ( 178 action == ’CALL’ and currently_facing_bet 179 ): # Count calls that match a bet/raise 180 opponent_postflop_calls += 1 181 postflop_facing_bet = False # Call equalizes bets 182 elif ( 183 action == ’CALL’ and not currently_facing_bet 184 ): # Zero-cost call (check) 185 postflop_facing_bet = False # Check equalizes bets 186 else: # Our action 187 if action_event[’action’] == ’RAISE’: 188 postflop_facing_bet = True # We raised, opponent faces bet 189 elif action_event[’action’] == ’CALL’: # Our check or call 190 # If we called a raise, bets are equal. If we checked, bets are equal. 191 postflop_facing_bet = False 192 temp_round_hist.append(action_event) 193 194 # Apply weights (Heuristics - Requires Tuning) 195 if opponent_preflop_raises > 0: 196 for hand in list(opponent_hand_weights.keys()): 197 if hand == ’J’: 198 opponent_hand_weights[hand] * = 0.1 # Less likely J 199 if hand == ’Q’: 200 opponent_hand_weights[hand] * = 0.7 201 if hand == ’K’: 202 opponent_hand_weights[hand] * = 2.0 # More likely K 203 elif opponent_preflop_calls > 0: # Called preflop but didn’t raise 204 for hand in list(opponent_hand_weights.keys()): 205 if hand == ’J’: 206 opponent_hand_weights[hand] * = 1.2 207 if hand == ’K’: 208 opponent_hand_weights[hand] * = 0.5 # Less likely K 209 25 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 210 if public_card: 211 if opponent_postflop_raises > 0: 212 for hand in list(opponent_hand_weights.keys()): 213 opp_rank_tuple = self._get_hand_rank(hand, public_card) 214 is_pair = ( 215 opp_rank_tuple[0] >= 3 + self.card_ranking[’J’] 216 ) # Check if it’s any pair 217 if not is_pair: 218 opponent_hand_weights[ 219 hand 220 ] * = 0.1 # Unlikely to raise postflop without a pair 221 else: 222 opponent_hand_weights[ 223 hand 224 ] * = 2.0 # More likely to have a pair if raised 225 226 # Normalize weights to form a probability distribution 227 total_weight = sum(opponent_hand_weights.values()) 228 if ( 229 total_weight <= 1e-9 230 ): # Avoid division by zero / handle case where all weights -> 0 231 # Reset to uniform if weights collapse 232 opponent_hand_weights = hand: 1.0 for hand in possible_opponent_hands 233 total_weight = sum(opponent_hand_weights.values()) 234 if total_weight <= 1e-9: 235 return 0.5 # Still no weight, return neutral equity 236 237 opponent_hand_probs = 238 hand: weight / total_weight 239 for hand, weight in opponent_hand_weights.items() 240 241 # Filter out hands with zero probability for efficiency 242 opponent_hand_probs = 243 h: p for h, p in opponent_hand_probs.items() if p > 1e-9 244 245 246 if not opponent_hand_probs: 247 return 0.5 # Check again after filtering 248 249 win, tie = 0.0, 0.0 # Use floats for weighted sum 250 total_prob_considered = 0.0 251 252 if public_card: # Postflop Equity Calculation 253 my_rank = self._get_hand_rank(my_hand, public_card) 254 for opp_hand, prob in opponent_hand_probs.items(): 255 opp_rank = self._get_hand_rank(opp_hand, public_card) 256 if my_rank > opp_rank: 257 win += prob 258 elif my_rank == opp_rank: 259 tie += prob 260 total_prob_considered += prob 261 262 if total_prob_considered == 0: 263 return 0.5 264 # Weighted Equity = (Sum P(OppHand) * Win(OppHand) + 0.5 * Sum P(OppHand) * Tie(OppHand)) / Sum P(OppHand) 265 # Since Sum P(OppHand) = total_prob_considered, this simplifies: 266 return (win + 0.5 * tie) / total_prob_considered 267 268 else: # Preflop Equity Calculation (Average over public cards) 269 total_weighted_outcomes = 0.0 # Sum of probabilities of paths considered 270 271 # Need the deck * after * my card is removed, before considering opponent hand 272 base_deck_for_opp = [’J’, ’J’, ’Q’, ’Q’, ’K’, ’K’] 273 base_deck_for_opp.remove(my_hand) 274 275 for opp_hand, opp_prob in opponent_hand_probs.items(): 276 # Create the deck for board cards given my_hand and opp_hand 277 deck_for_board = list(base_deck_for_opp) 278 if opp_hand not in deck_for_board: 26 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 279 # This can happen if opponent_hand_weights included hands now impossible 280 # due to filtering or deck state issues. Skip this opponent hand. 281 continue 282 deck_for_board.remove(opp_hand) 283 284 if not deck_for_board: # No possible board cards (e.g., K vs K preflop) 285 # Rank based on private card only 286 my_rank_priv = self.card_ranking[my_hand] 287 opp_rank_priv = self.card_ranking[opp_hand] 288 outcome_weight = ( 289 opp_prob # Weight by probability of this opponent hand 290 ) 291 if my_rank_priv > opp_rank_priv: 292 win += outcome_weight 293 elif my_rank_priv == opp_rank_priv: 294 tie += outcome_weight 295 total_weighted_outcomes += outcome_weight 296 continue # Skip board card loop for this opponent hand 297 298 # Loop through possible board cards 299 num_board_cards = len(deck_for_board) 300 board_card_prob = ( 301 1.0 / num_board_cards 302 ) # Uniform prob for each possible board card 303 304 for board_card in deck_for_board: 305 outcome_weight = ( 306 opp_prob * board_card_prob 307 ) # Combined probability of this path 308 309 my_rank_board = self._get_hand_rank(my_hand, board_card) 310 opp_rank_board = self._get_hand_rank(opp_hand, board_card) 311 312 if my_rank_board > opp_rank_board: 313 win += outcome_weight 314 # Loss contributes 0 to ’win’ or ’tie’ 315 elif my_rank_board == opp_rank_board: 316 tie += outcome_weight 317 total_weighted_outcomes += outcome_weight 318 319 if total_weighted_outcomes == 0: 320 return 0.5 # Default if something went wrong 321 # Equity = (Weighted Wins + 0.5 * Weighted Ties) / Total Weight Considered 322 # Note: total_weighted_outcomes should ideally sum close to 1.0 if all paths are covered. 323 # Normalizing by total_weighted_outcomes accounts for any paths skipped (e.g., impossible opp hands) 324 return (win + 0.5 * tie) / total_weighted_outcomes 325 326 def _get_hand_rank( 327 self, private_card: str, public_card: str or None 328 ) -> tuple[int, int]: 329 """Assigns a numeric rank tuple to a hand for comparison.""" 330 private_rank = self.card_ranking[private_card] 331 if not public_card: 332 # Pre-showdown, just rank by private card 333 return (private_rank, 0) 334 335 public_rank = self.card_ranking[public_card] 336 # Postflop 337 if private_rank == public_rank: # Pair 338 # Major rank for pairs (4,5,6) is higher than for high cards (1,2,3) 339 return (3 + private_rank, 0) 340 else: # High card 341 # Major rank is the highest card, minor rank is the kicker 342 return (max(private_rank, public_rank), min(private_rank, public_rank)) 343 344 def _get_opponent_action_probs(self, obs: Dict[str, Any]) -> Dict[str, float]: 345 """Gets the opponent’s likely response probabilities to our RAISE from our model.""" 346 round_name = obs[’public_state’][’round’] 347 public_card = obs[’public_state’][’public_card’] 27 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 348 349 current_history = obs[’action_history’][round_name] 350 # Simulate our raise to create the info set the opponent would face 351 simulated_history = current_history + [ 352 ’player_id’: self.player_id, ’action’: ’RAISE’ 353 ] 354 info_set_key = self._get_info_set_key( 355 round_name, public_card, simulated_history 356 ) 357 358 if info_set_key in self.opponent_model: 359 counts = self.opponent_model[info_set_key] 360 total = sum(counts.values()) 361 if total > 0: 362 return action: count / total for action, count in counts.items() 363 364 # Default assumption if we have no data 365 return ’FOLD’: 0.33, ’CALL’: 0.34, ’RAISE’: 0.33 366 367 def _calculate_action_ev( 368 self, action: str, obs: Dict[str, Any], equity: float 369 ) -> float: 370 """Calculates the Expected Value of a single action.""" 371 pot_size = obs[’public_state’][’pot_size’] 372 373 committed = [(100 - c) for c in obs[’public_state’][’chips’]] 374 my_committed = committed[self.player_id] 375 opp_committed = committed[self.opponent_id] 376 amount_to_call = opp_committed - my_committed 377 378 round_name = obs[’public_state’][’round’] 379 raise_amount = 2 if round_name == ’PREFLOP’ else 4 380 381 if action == ’FOLD’: 382 return 0 383 384 if action == ’CALL’: 385 pot_after_call = pot_size + amount_to_call 386 return (equity * pot_after_call) - amount_to_call 387 388 if action == ’RAISE’: 389 # EV(Raise) = P(Opponent Folds) * (Win Pot) + P(Opponent Calls) * (EV at Showdown) 390 # We simplify by ignoring opponent re-raises. 391 392 opp_probs = self._get_opponent_action_probs(obs) 393 prob_opp_folds = opp_probs[’FOLD’] 394 395 # What we win if opponent folds 396 win_on_fold = pot_size 397 398 # What happens if opponent calls our raise 399 amount_to_raise = amount_to_call + raise_amount 400 # Pot includes current pot, my raise, and opponent’s call of the raise amount 401 pot_after_opp_calls = pot_size + amount_to_raise + raise_amount 402 403 value_if_called = (equity * pot_after_opp_calls) - amount_to_raise 404 405 return (prob_opp_folds * win_on_fold) + ( 406 (1 - prob_opp_folds) * value_if_called 407 ) 408 409 return -999 # Should not be reached A.2. Prompts and Initial Strategies A.2.1. Repeated Rock-Paper-Scissors 28 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models Listing 3 | Prompt for Repeated Rock-Paper-Scissors 1 Act as an expert Python programmer and game theory enthusiast. 2 3 Your task is to implement an agent class called ‘Agent‘ that represents a 4 player’s strategy in a game of Repeated Rock Paper Scissors. 5 6 ## Game Rules: 7 8 * The game is played for 1000 rounds. 9 * Your goal is to maximize your total number of wins over the 1000 rounds. 10 * The valid moves are strings: ’ROCK’, ’PAPER’, ’SCISSORS’. 11 * ROCK beats SCISSORS, SCISSORS beats PAPER, and PAPER beats ROCK. 12 13 ## Class Signature: 14 15 Your implementation must align with the following class structure. 16 17 ‘python 18 class Agent: 19 20 def act(self, observation: dict[str, Any]) -> str: 21 ... 22 ‘ 23 24 ## Input details: 25 26 The input ‘observation‘ is a dictionary with two keys representing the moves 27 played by the player and the opponent in the previous round. Defaults to ‘None‘ 28 for the first round. 29 30 ‘python 31 observation = 32 ’my_action’: ’ROCK’, # None for the first round. 33 ’opponent_action’: ’PAPER’, # None for the first round. 34 35 ‘ A.2.2. Repeated Leduc Poker Listing 4 | Prompt for Repeated Leduc Poker 1 Act as an expert in computer games, opponent modeling, algorithm design and multiagent learning. Your task is to iteratively improve the provided bot in repeated leduc poker. The bot will play leduc poker with an opponent for multiple games, where in each game their player positions are randomly permutated. The primary goal is to increase the scores on the provided evaluation metrics, where larger values are better. 2 3 Always adhere to best practices in Python coding. 4 5 # Rule of Leduc Poker 6 Leduc Poker is a simplified two-player poker game, ideal for AI research, that uses a small deck to focus on core poker concepts like betting strategy and imperfect information. 7 8 Here is a detailed breakdown of the rules to clarify legal moves. Note that in this implementation , the "Check" action is not available; players must use "Call" instead. A call may be zero- cost if there is no outstanding bet to match. 9 10 ** 1. Setup & Preliminaries ** 11 * ** Players: ** 2. 12 * ** Deck: ** 6 cards (two Jacks, two Queens, two Kings). 13 * ** Blinds: ** Before cards are dealt, mandatory bets are posted: 14 * Player 1 (P1) posts a ** Small Blind ** of 1 unit. 15 * Player 2 (P2) posts a ** Big Blind ** of 2 units. 16 * ** The Deal: ** Each player receives one private card, face down. 17 18 ** 2. Core Betting Rules ** 19 * ** Raise Sizing: ** The amount to raise is fixed. 29 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 20 * ** Round 1: ** The raise amount is ** 2 units ** . 21 * ** Round 2: ** The raise amount is ** 4 units ** . 22 * ** Total Betting Cap: ** The total betting cap for each round is a maximum of ** two raises ** . 23 * ** Acting First: ** Player 1 (the small blind) acts first in both betting rounds (pre-flop and post-flop). 24 25 ** 3. Round 1: Pre-Flop Betting ** 26 This round occurs before the public card is revealed. 27 28 * ** P1’s First Action: ** P1 must act on P2’s 2-unit Big Blind. 29 * ** Fold: ** Forfeit the 1-unit blind. P2 wins the pot. 30 * ** Call: ** Match the 2 units by putting in 1 more unit. 31 * ** Raise: ** Make a 2-unit raise, for a total of 4 units (P1 puts in 3 units). The total betting cap has been reached. 32 * ** P2’s Action: ** 33 * If P1 ** called ** , P2 can ** Call ** (a zero-cost action, as bets are equal) to end the round , or ** Raise ** (by putting in 2 more units to make it 4 total). 34 * If P1 ** raised ** , P2 can only ** Call ** (by putting in 2 more units) or ** Fold ** . The betting cap has been reached. 35 * ** P1’s Second Action (if necessary): ** If P1 called and P2 then raised, the action returns to P1. P1 can only ** Call ** (by putting in 2 more units) or ** Fold ** . 36 37 ** 4. The Flop: Public Card ** 38 After Round 1 betting concludes, one public card is dealt face-up. This card is shared by both players. 39 40 ** 5. Round 2: Post-Flop Betting ** 41 This round occurs after the flop. There are no blinds. 42 43 * ** P1’s First Action: ** 44 * ** Call: ** Make a zero-cost call to pass the turn (as there is no outstanding bet). 45 * ** Raise: ** Make a 4-unit raise. 46 * ** P2’s Action: ** 47 * If P1 ** called ** (at zero-cost), P2 can also ** Call ** (at zero-cost, ending the round) or ** Raise ** 4 units. 48 * If P1 ** raised ** , P2 can ** Call ** (matching the 4 units), ** Raise ** (by putting in another 4 units, for a total bet of 8), or ** Fold ** . The total betting cap has been reached. 49 * ** Subsequent Actions: ** 50 * If P2 ** raised ** (after P1’s initial zero-cost call), the action returns to P1, who can ** Call ** (the 4 unit bet), ** Raise ** (to 8 total), or ** Fold ** . The total betting cap has been reached. 51 * If a player ** raises ** , the other player can only ** Call ** or ** Fold ** , as the betting cap has been reached. 52 53 ** 6. Showdown & Hand Ranking ** 54 If neither player folds, a showdown occurs after Round 2 betting. 55 56 * ** Hand: ** A player’s hand is their private card combined with the public card. 57 * ** Hand Ranks (best to worst): ** 58 1. ** Pair: ** Two cards of the same rank (e.g., J-J). Higher pairs beat lower pairs. 59 2. ** High Card: ** If no one has a pair, the player with the highest card wins (K > Q > J). 60 * ** Ties: ** If both players have the same hand rank (e.g., both have a King-high), the pot is split. 61 62 ** 7. Winning ** 63 A player wins the pot either by being the only one left after the other folds, or by having the best hand at showdown. 64 65 66 # Program Skeleton 67 You should design the bot according to the following APIs: 68 69 ‘python 70 class RepeatedLeducPokerBot: 71 72 def receive_outcome(self, obs: dict[str, Any]): 73 """Receive game outcome of previous game.""" 74 75 def restart(self, player_id: int): 76 """Start a new round of leduc poker with being assigned the player position player_id.""" 30 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 77 78 def act(self, obs: dict[str, Any]) -> str: 79 """Output an action given an observation. 80 81 Args: 82 obs: a JSON observation dictionary. 83 Output: 84 action, an action in ’FOLD’, ’CALL’, ’RAISE’ 85 """ 86 ‘ 87 88 # Example observation format 89 The act method is called each time the agent need to make a decision during a game, given an observation in the format of JSON. Here are some examples of an observation: 90 ’player_view’: ’player_id’: 0, ’current_player’: True, ’hand’: ’K’, ’legal_actions’: [’CALL’, ’RAISE’], ’public_state’: ’round’: ’PREFLOP’, ’chips’: [99, 99], ’pot_size’: 2, ’ public_card’: None, ’action_history’: ’PREFLOP’: [], ’POSTFLOP’: [], ’game_result’: None 91 92 ’player_view’: ’player_id’: 1, ’current_player’: True, ’hand’: ’K’, ’legal_actions’: [’CALL’, ’RAISE’], ’public_state’: ’round’: ’POSTFLOP’, ’chips’: [99, 99], ’pot_size’: 2, ’ public_card’: ’Q’, ’action_history’: ’PREFLOP’: [’player_id’: 0, ’action’: ’CALL’, ’ player_id’: 1, ’action’: ’CALL’], ’POSTFLOP’: [’player_id’: 0, ’action’: ’CALL’], ’ game_result’: None 93 94 The receive_outcome method is called at the end of each game. The agent receives an observation of the final state of the game to improve game play for future games with the same opponent. The observation here contains final payoff for each player, and players’ hands if the last action is not FOLD. Some examples: 95 96 ’player_view’: ’player_id’: 0, ’current_player’: False, ’hand’: ’K’, ’legal_actions’: [], ’ public_state’: ’round’: ’PREFLOP’, ’chips’: [101, 99], ’pot_size’: 0, ’public_card’: None, ’action_history’: ’PREFLOP’: [’player_id’: 0, ’action’: ’RAISE’, ’player_id’: 1, ’ action’: ’FOLD’], ’POSTFLOP’: [], ’game_result’: ’outcome’: ’FOLD’, ’returns’: [1, -1], ’showdown_hands’: None 97 98 99 ’player_view’:’player_id’: 0, ’current_player’: False, ’hand’: ’J’, ’legal_actions’: [], ’ public_state’: ’round’: ’POSTFLOP’, ’chips’: [95, 105], ’pot_size’: 0, ’public_card’: ’Q’, ’action_history’: ’PREFLOP’: [’player_id’: 0, ’action’: ’RAISE’, ’player_id’: 1, ’ action’: ’RAISE’, ’player_id’: 0, ’action’: ’CALL’], ’POSTFLOP’: [’player_id’: 0, ’ action’: ’CALL’, ’player_id’: 1, ’action’: ’CALL’], ’game_result’: ’outcome’: ’ SHOWDOWN’, ’returns’: [-5, 5], ’showdown_hands’: [’player_id’: 0, ’hand’: ’J’, ’ player_id’: 1, ’hand’: ’K’] 100 101 The restart method is called at the beginning of each game, where the agent is informed the player position it will act as in the next game. 102 103 legal_actions is a subset of ’FOLD’, ’CALL’, ’RAISE’ 104 105 106 # Current program 107 Here is the current program we are trying to improve (you will need to propose a modification to it below): 108 109 code 110 111 # Opponents 112 Here are the summary of opponent codes you are trying to beat: 113 instances 114 115 Try to reason about these opponents, and come up with a strategy that can exploit them in general. 116 117 # * SEARCH/REPLACE block * Rules: 118 119 Every * SEARCH/REPLACE block * must use this format: 120 1. The opening fence: ‘python 121 2. The start of search block: <<<<<<< SEARCH 122 3. A contiguous chunk of up to 4 lines to search for in the existing source code 123 4. The dividing line: ======= 31 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 124 5. The lines to replace into the source code 125 6. The end of the replace block: >>>>>>> REPLACE 126 7. The closing fence: ‘ 127 128 Every * SEARCH * section must * EXACTLY MATCH * the existing file content, character for character, including all comments, docstrings, etc. 129 130 * SEARCH/REPLACE * blocks will replace * all * matching occurrences. 131 Include enough lines to make the SEARCH blocks uniquely match the lines to change. 132 133 Keep * SEARCH/REPLACE * blocks concise. 134 Break large * SEARCH/REPLACE * blocks into a series of smaller blocks that each change a small portion of the file. 135 Include just the changing lines, and a few surrounding lines if needed for uniqueness. 136 Do not include long runs of unchanging lines in * SEARCH/REPLACE * blocks. 137 138 To move code within a file, use 2 * SEARCH/REPLACE * blocks: 1 to delete it from its current location, 1 to insert it in the new location. 139 140 Make sure that the changes you propose are consistent with each other. For example, if you refer to a new config variable somewhere, you should also propose a change to add that variable. 141 142 Example: 143 ‘python 144 <<<<<<< SEARCH 145 return total_loss 146 ======= 147 # Add sparsity-promoting regularization to the loss. 148 total_loss += self.hypers.l1_reg_weight * l1_reg 149 150 return total_loss 151 replace 152 ‘ 153 and 154 ‘python 155 <<<<<<< SEARCH 156 return hyper.zipit([ 157 ======= 158 return hyper.zipit([ 159 hyper.uniform(’l1_reg_weight’, hyper.interval(0.0, 0.01)), 160 replace 161 ‘ 162 163 lazy_prompt 164 ONLY EVER RETURN CODE IN A * SEARCH/REPLACE BLOCK * ! 165 166 # Task 167 task_instruction focus_sentence trigger_chain_of_thought 168 Describe each change with a * SEARCH/REPLACE block * . The initial strategy is a heuristic strategy by zero-shot prompting "give a strong heuristic repeated Leduc Poker strategy": A.2.3. Repeated Leduc Poker Listing 5 | Prompt for Repeated Leduc Poker 1 import random 2 from typing import Any, Dict 3 4 5 class RepeatedLeducPokerBot: 6 """ 7 A simple heuristic bot for repeated Leduc Poker. 8 9 The bot’s strategy is based on a set of rules that consider the strength 10 of its private card, the public card (if revealed), and the current 32 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 11 betting round. It does not learn from the opponent’s behavior. 12 """ 13 14 def __init__(self): 15 """Initializes the bot.""" 16 self.player_id: int = -1 17 self.card_ranking = ’J’: 1, ’Q’: 2, ’K’: 3 18 19 def receive_outcome(self, obs: Dict[str, Any]): 20 """ 21 Receives the game outcome of the previous game. 22 23 For this simple heuristic bot, this method does nothing as it 24 does not adapt its strategy based on past games. 25 """ 26 pass 27 28 def restart(self, player_id: int): 29 """ 30 Starts a new game of Leduc Poker, assigning the player position. 31 32 Args: 33 player_id: The player ID (0 or 1) for the new game. 34 """ 35 self.player_id = player_id 36 37 def act(self, obs: Dict[str, Any]) -> str: 38 """ 39 Outputs an action given an observation. 40 41 Args: 42 obs: A dictionary containing the game state from the player’s perspective. 43 44 Returns: 45 A string representing the chosen action: ’FOLD’, ’CALL’, or ’RAISE’. 46 """ 47 player_view = obs.get(’player_view’, ) 48 public_state = obs.get(’public_state’, ) 49 legal_actions = player_view.get(’legal_actions’, []) 50 51 if not legal_actions: 52 # If there are no legal actions, we can’t do anything. 53 # This can happen at the end of a hand. 54 return ’CALL’ 55 56 hand = player_view.get(’hand’) 57 round_name = public_state.get(’round’) 58 59 if round_name == ’PREFLOP’: 60 return self._preflop_strategy(hand, legal_actions) 61 elif round_name == ’POSTFLOP’: 62 public_card = public_state.get(’public_card’) 63 return self._postflop_strategy(hand, public_card, legal_actions) 64 65 # Default action if something unexpected happens 66 return ’CALL’ if ’CALL’ in legal_actions else random.choice(legal_actions) 67 68 def _preflop_strategy(self, hand: str, legal_actions: list[str]) -> str: 69 """Determines the action for the pre-flop round.""" 70 hand_strength = self.card_ranking.get(hand, 0) 71 72 # Strong hand (King) - play aggressively 73 if hand_strength == 3: # King 74 if ’RAISE’ in legal_actions: 75 return ’RAISE’ 76 return ’CALL’ 77 78 # Medium hand (Queen) - play cautiously 79 elif hand_strength == 2: # Queen 80 # If we can call, we do. Avoid folding if possible. 33 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models 81 if ’CALL’ in legal_actions: 82 return ’CALL’ 83 # This case should ideally not be hit if CALL is always an option 84 # when FOLD is, but as a fallback. 85 return ’FOLD’ 86 87 # Weak hand (Jack) - play defensively 88 else: # Jack 89 # We want to see the flop cheaply. 90 if ’CALL’ in legal_actions: 91 return ’CALL’ 92 return ’FOLD’ 93 94 def _postflop_strategy(self, hand: str, public_card: str, legal_actions: list[str]) -> str: 95 """Determines the action for the post-flop round.""" 96 hand_strength = self.card_ranking.get(hand, 0) 97 public_card_strength = self.card_ranking.get(public_card, 0) 98 99 # Check for a pair (strongest hand) 100 if hand == public_card: 101 if ’RAISE’ in legal_actions: 102 return ’RAISE’ 103 return ’CALL’ 104 105 # High card strategy 106 # If our private card is better than the public card, we have a decent high card hand 107 if hand_strength > public_card_strength: 108 if ’RAISE’ in legal_actions: 109 return ’RAISE’ 110 return ’CALL’ 111 112 # If our private card is weaker than the public card, our hand is weak. 113 else: 114 # Check-call if possible (zero-cost call) 115 # A zero-cost call is implied if we just need to call 0 chips. 116 # We will simplify and just check/call. 117 if ’CALL’ in legal_actions: 118 return ’CALL’ 119 return ’FOLD’ 34