← Back to papers

Paper deep dive

Robust agents learn causal world models

Jonathan Richens, Tom Everitt

Year: 2024Venue: ICLR 2024Area: Formal/TheoreticalType: TheoreticalEmbeddings: 311

Abstract

Abstract:It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.

Tags

ai-safety (imported, 100%)formaltheoretical (suggested, 92%)theoretical (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 7:37:39 PM

Summary

The paper proves that any agent capable of maintaining low regret across a wide range of distributional shifts must necessarily learn an approximate causal model of the data-generating process. This theoretical result establishes causal reasoning as a fundamental requirement for robust adaptation and general intelligence, providing a formal basis for causal representation learning.

Entities (5)

Jonathan Richens · researcher · 100%Tom Everitt · researcher · 100%Causal Bayesian Networks · formalism · 95%Causal Influence Diagram · formalism · 95%Distributional Shift · concept · 90%

Relation Signals (3)

Jonathan Richens affiliatedwith Google DeepMind

confidence 100% · Jonathan Richens Google DeepMind

Tom Everitt affiliatedwith Google DeepMind

confidence 100% · Tom Everitt Google DeepMind

Robust Agents learn Causal World Models

confidence 95% · Any agent capable of adapting to a sufficiently large set of distributional shifts must have learned a causal model of the data generating process.

Cypher Suggestions (2)

Identify the relationship between agents and causal models · confidence 95% · unvalidated

MATCH (a:Agent)-[:LEARN]->(m:Model) WHERE m.name CONTAINS 'Causal' RETURN a.name, m.name

Find researchers and their affiliations · confidence 90% · unvalidated

MATCH (r:Researcher)-[:AFFILIATED_WITH]->(o:Organization) RETURN r.name, o.name

Full Text

310,276 characters extracted from source content.

Expand or collapse full text

booktabs Robust agents learn causal world models Jonathan Richens Google DeepMind &Tom Everitt Google DeepMind jonrichens@deepmind.com Abstract It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound for a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference. 1 Introduction What capabilities are necessary for general intelligence (Legg & Hutter, 2007)? One candidate is causal reasoning, which plays a foundational role in human cognition (Gopnik et al., 2007; Sloman & Lagnado, 2015). It has even been argued that human-level AI is impossible without causal reasoning (Pearl, 2018). However, recent years have seen the development of agents that do not explicitly learn or reason on causal models, but nonetheless are capable of adapting to a wide range of environments and tasks (Reed et al., 2022; Team et al., 2023; Brown et al., 2020). This raises the question, do agents have to learn causal models in order to adapt to new domains, or are other inductive biases sufficient? To answer this question, we have to be careful not to assume that agents use causal assumptions a priori. For example, transportability theory determines what causal knowledge is necessary for transfer learning when all assumptions on the data generating process (inductive biases) can be expressed as constraints on causal structure (Bareinboim & Pearl, 2016). However, deep learning algorithms can exploit a much larger set of inductive biases (Neyshabur et al., 2014; Battaglia et al., 2018; Rahaman et al., 2019; Goyal & Bengio, 2022) which in many real-world settings may be sufficient to identify low regret policies without requiring causal knowledge. The main result of this paper is to answer this question by showing that, Any agent capable of adapting to a sufficiently large set of distributional shifts must have learned a causal model of the data generating process. Here, adapting to a distributional shift means learning a policy that satisfies a regret bound following an intervention on the data generating process—for example, changing the distribution of features or latent variables. It is known that a causal model of the data generating process can be used to identify regret-bounded policies following a distributional shift (sufficiency), with more accurate models allowing lower regret policies to be found. We prove the converse (necessity)—given regret-bounded policies for a large set of distributional shifts, we can learn an approximate causal model of the data generating process, with the approximation becoming exact for optimal policies. Hence, learning a causal model of the data generating process is necessary for robust adaptation. This has consequences for a number of fields and questions. For one, it implies that causal identification laws also constrain domain adaptation. For example, we show that adapting to covariate and label shifts is only possible if the causal relations between features and labels can be identified from the training data—a non-trivial causal discovery problem. This provides further theoretical justification for causal representation learning (Schölkopf et al., 2021), showing that learning causal representations is necessary for achieving strong robustness guarantees. Our result also implies that we can learn causal models from adaptive agents. We demonstrate this by solving a causal discovery task on synthetic data by observing the policy of a regret-bounded agent under distributional shifts. More speculatively, our results suggest that causal models could play a role in emergent capabilities. Agents trained to minimise a loss function across many domains are incentivized to learn a causal world model, which could in turn enable them to solve a much larger set of decision tasks they were not explicitly trained on. Outline of paper. In Section 2 we introduce concepts from causality and decision theory used to derive our results. We present our main theoretical results in Section 3 and discuss their interpretation in terms of adaptive agents, transfer learning and causal inference. In Section 4 we discuss limitations, as well as implications for a number of fields and open questions. In section 5 we discuss related work including transportability (Bareinboim & Pearl, 2016) and the causal hierarchy theorem (Bareinboim et al., 2022), and recent empirical work on emergent world models. In Appendix B we describe experiments applying our theoretical results to causal discovery problems. 2 Preliminaries 2.1 Causal models We use capital letters for random variables V, and lower case for their values v∈dom⁢(V)domv (V)v ∈ dom ( V ). For simplicity, we assume each variable has a finite number of possible values, |dom⁢(V)|<∞dom|dom(V)|<∞| dom ( V ) | < ∞. Bold face denotes sets of variables =V1,…,Vnsubscript1…subscript V=\V_1,…,V_n\italic_V = V1 , … , Vitalic_n , and their values ∈dom()=×idom(Vi) v ( V)=×_idom(V_i)italic_v ∈ dom ( italic_V ) = ×i dom ( Vitalic_i ). A probabilistic model specifies the joint distribution P⁢()P( V)P ( italic_V ) over a set of variables Vitalic_V. These models can support associative queries, for example P⁢(=∣=)conditionalP( Y= y X= x)P ( italic_Y = italic_y ∣ italic_X = italic_x ) for ,⊆ X, Y Vitalic_X , italic_Y ⊆ italic_V. Interventions describe external changes to the data generating process (and hence changing the joint distribution), for example a hard intervention do⁢(=)do do( X= x)do ( italic_X = italic_x ) describes forcing the set of variables ⊆ X Vitalic_X ⊆ italic_V to take value xitalic_x. This generates a new distribution P⁢(∣do⁢(X=x))=P⁢(x)conditionaldosubscriptP( V do(X=x))=P( V_x)P ( italic_V ∣ do ( X = x ) ) = P ( italic_Vitalic_x ) where subscript V_ xitalic_Vbold_italic_x refers to the variables Vitalic_V following this intervention. The power of causal models is that they specify not only P⁢()P( V)P ( italic_V ) but also the distribution of Vitalic_V under all interventions, and hence these models can be used to evaluate both associative and interventional queries e.g. P⁢(=∣do⁢(=))conditionaldoP( Y= y do( X= x))P ( italic_Y = italic_y ∣ do ( italic_X = italic_x ) ). For the derivation of our results we focus on a specific class of causal models—causal Bayesian networks (CBNs). There are several alternative models and formalisms that are studied in the literature, including structural equation models (Pearl, 2009) and the Neyman-Rubin causal models (Rubin, 2005), and results can be straightforwardly adapted to these. Definition 1 (Bayesian networks). A Bayesian network M=(G,P)M=(G,P)M = ( G , P ) over a set of variables =V1,…,Vnsubscript1…subscript V=\V_1,…,V_n\italic_V = V1 , … , Vitalic_n is a joint probability distribution P⁢()P( V)P ( italic_V ) that factors according to a directed acyclic graph (DAG) G, i.e. P⁢(V1,…,Vn)=∏i=1nP⁢(Vi∣PaVi)subscript1…subscriptsuperscriptsubscriptproduct1conditionalsubscriptsubscriptPasubscriptP(V_1,…,V_n)= _i=1^nP(V_i _V_i)P ( V1 , … , Vitalic_n ) = ∏i = 1n P ( Vitalic_i ∣ PaV start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ), where PaVisubscriptPasubscriptPa_V_iPaV start_POSTSUBSCRIPT i end_POSTSUBSCRIPT are the parents of VisubscriptV_iVitalic_i in G. A Bayesian network is causal if the graph G captures the causal relationships between the variables or, formally, if the result of any intervention do⁢(=)do do( X= x)do ( italic_X = italic_x ) for ⊆ X Vitalic_X ⊆ italic_V can be computed from the truncated factorisation formula: P⁢(∣do⁢())=∏i:vi∉P⁢(vi∣pavi)if consistent with 0otherwise.conditionaldocasessubscriptproduct:subscriptconditionalsubscriptsubscriptpasubscriptif consistent with 0otherwise.P( v do( x))= cases _i:v_i ∈ xP(v_% i _v_i)&if $ v$ consistent with $ x$\\ 0&otherwise. casesP ( italic_v ∣ do ( italic_x ) ) = start_ROW start_CELL ∏i : v start_POSTSUBSCRIPT i ∉ italic_x end_POSTSUBSCRIPT P ( vitalic_i ∣ pav start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_v consistent with italic_x end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW More generally, a soft intervention σvi=P′⁢(Vi∣Pai∗)subscriptsubscriptsuperscript′conditionalsubscriptsubscriptsuperscriptPa _v_i=P (V_i ^*_i)σitalic_v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = P′ ( Vitalic_i ∣ Pa∗i ) replaces the conditional probability distribution for VisubscriptV_iVitalic_i with a new distribution P′⁢(Vi∣Pai∗)superscript′conditionalsubscriptsubscriptsuperscriptPaP (V_i ^*_i)P′ ( Vitalic_i ∣ Pa∗i ), possibly resulting in a new parent set Pai∗≠PaisubscriptsuperscriptPasubscriptPaPa^*_i _iPa∗i ≠ Pai as long as no cycles are introduced in the graph. We refer to σvisubscriptsubscript _v_iσitalic_v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT as a domain indicator (Correa & Bareinboim, 2020) (it has also been called an environment index, Arjovsky et al., 2019). The updated distribution is denoted P⁢(;σ′)=∏i:vi∈′P′⁢(vi∣pavi∗)⁢∏i:vi∉′P⁢(vi∣pavi)subscriptsuperscript′subscriptproduct:subscriptsuperscript′conditionalsubscriptsubscriptsuperscriptpasubscriptsubscriptproduct:subscriptsuperscript′conditionalsubscriptsubscriptpasubscriptP( v; _ v )= _i:v_i∈ v P % (v_i ^*_v_i) _i:v_i ∈ v P(v_i% _v_i)P ( italic_v ; σbold_italic_v′ ) = ∏i : v start_POSTSUBSCRIPT i ∈ italic_v′ end_POSTSUBSCRIPT P′ ( vitalic_i ∣ pa∗v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ∏i : v start_POSTSUBSCRIPT i ∉ italic_v′ end_POSTSUBSCRIPT P ( vitalic_i ∣ pav start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ). In general, soft interventions cannot be defined without knowledge of G. For example, the soft intervention σY=P′⁢(y∣x)subscriptsuperscript′conditional _Y=P (y x)σitalic_Y = P′ ( y ∣ x ) is incompatible with the causal structure Y→X→Y→ XY → X as it would induce a causal cycle. As our results are concerned with learning causal models (and hence causal structure), we focus our theoretical analysis on a subset of the soft interventions, local interventions, that are compatible with all causal structures and so can be used without tacitly assuming knowledge of G. Definition 2 (Local interventions). Local intervention σ on Vi∈subscriptV_i∈ VVitalic_i ∈ italic_V involves applying a map to the states of VisubscriptV_iVitalic_i that is not conditional on any other endogenous variables, vi↦f⁢(vi)maps-tosubscriptsubscriptv_i f(v_i)vitalic_i ↦ f ( vitalic_i ). We use the notation σ=do⁢(Vi=f⁢(vi))dosubscriptsubscriptσ= do(V_i=f(v_i))σ = do ( Vitalic_i = f ( vitalic_i ) ) (variable VisubscriptV_iVitalic_i is assigned the state f⁢(vi)subscriptf(v_i)f ( vitalic_i )). Formally, this is a soft intervention on VisubscriptV_iVitalic_i that transforms the conditional probability distribution as, P⁢(vi|pai;σ)=∑vi⁢’:f⁢(vi′)=viP⁢(vi′|pai)conditionalsubscriptsubscriptpasubscript:subscript’superscriptsubscript′subscriptconditionalsuperscriptsubscript′subscriptpaP(v_i\,|\,pa_i;σ)= _v_i :f(v_i % )=v_iP(v_i \,|\,pa_i)P ( vitalic_i | pai ; σ ) = ∑v start_POSTSUBSCRIPT i ’ : f ( vitalic_i′ ) = vitalic_i end_POSTSUBSCRIPT P ( vitalic_i′ | pai ) (1) Example: Hard interventions do⁢(Vi=vi′)dosubscriptsubscriptsuperscript′ do(V_i=v _i)do ( Vitalic_i = v′italic_i ) are local interventions where f⁢(vi)subscriptf(v_i)f ( vitalic_i ) is a constant function. Example: Translations are local interventions as do⁢(Vi=vi+k)=do⁢(Vi=f⁢(vi))dosubscriptsubscriptdosubscriptsubscript do(V_i=v_i+k)= do(V_i=f(v_i))do ( Vitalic_i = vitalic_i + k ) = do ( Vitalic_i = f ( vitalic_i ) ) where f⁢(vi)=vi+ksubscriptsubscriptf(v_i)=v_i+kf ( vitalic_i ) = vitalic_i + k. Examples include changing the position of objects in RL environments (Shah et al., 2022) and images (Engstrom et al., 2019). Example: Logical NOT operation X↦¬Xmaps-toX X ↦ ¬ X for Boolean X is a local intervention. We also consider stochastic interventions, noting that mixtures of local interventions can also be defined without knowledge of G. For example, adding noise to a variable X=X+ϵitalic-ϵX=X+ = X + ϵ, ϵ∼⁢(0,1)similar-toitalic-ϵ01ε (0,1)ϵ ∼ N ( 0 , 1 ), is a soft intervention on X described by a mixture over local interventions (translations). Definition 3 (Mixtures of interventions). A mixed intervention σ∗=∑ipi⁢σisuperscriptsubscriptsubscriptsubscriptσ^*= _ip_i _iσ∗ = ∑i pitalic_i σitalic_i for ∑pi=1subscript1Σ p_i=1∑ pitalic_i = 1 performs intervention σisubscript _iσitalic_i with probability pisubscriptp_ipitalic_i. Formally, P⁢(∣σ∗)=∑ipi⁢P⁢(∣σi)conditionalsuperscriptsubscriptsubscriptconditionalsubscriptP( v σ^*)= _ip_iP( v _i)P ( italic_v ∣ σ∗ ) = ∑i pitalic_i P ( italic_v ∣ σitalic_i ). 2.2 Decision tasks Decision tasks involve a decision maker (agent) choosing a policy so as to optimise an objective function (utility). To give a causal description of decision tasks we use the causal influence diagram (CID) formalism (Howard & Matheson, 2005; Everitt et al., 2021), which extend a CBN of the environment (chance) variables by introducing decision and utility nodes (see Figure 1 for examples). For simplicity we focus on tasks involving a single decision and a single utility function. Definition 4 (Causal influence diagram). A (single-decision, single-utility) causal influence diagram (CID) is a CBN M=(G,P)M=(G,P)M = ( G , P ) where the variables Vitalic_V are partitioned into decision, utility, and chance variables, =(D,U,) V=(\D\,\U\, C)italic_V = ( D , U , italic_C ). The utility variable is a real-valued function of its parents, U⁢(paU)subscriptpaU(pa_U)U ( paU ). Single-decision single-utility CIDs can represent most decision tasks such as classification and regression as they specify what decision should be made (d∈Dd∈ Dd ∈ D), based on what information (paDsubscriptpapa_DpaD), with objective (⁢[U]delimited-[]E[U]blackboard_E [ U ]). They can also describe some multi-decision tasks such as Markov decision processes111Note Markov decision processes can be formulated as a single-decision single-utility CID, by modelling the choice of policy as a single decision and the cumulative discounted reward as a single utility variable.. The utility is any real-valued function including standard loss and reward functions. We assume that the environment is described by a set of random variables Citalic_C that interact via causal mechanisms222This assumption follows from Reichenbach (1956), and we discuss further in Section A.3, and where Citalic_C satisfies causal sufficiency (Pearl, 2009) (includes all common causes), noting that such a choice of Citalic_C always exists. We refer to the CBN over Citalic_C as the ‘true’ or ‘underlying’ CBN. Note we do not assume the agent has any knowledge of the underling CBN, nor do we assume which variables in Citalic_C are observed or unobserved by the agent, beyond that the agent can observe PaD⊆subscriptPaPa_D CPaD ⊆ italic_C. We also assume knowledge of the utility function U⁢(PaU)subscriptPaU(Pa_U)U ( PaU ). The conditional probability distribution for the decision node π⁢(d∣paD)conditionalsubscriptpaπ(d _D)π ( d ∣ paD ) (the policy) is not a fixed parameter of the model but is set by the agent so as to maximise its expected utility, which for a policy π is π⁢[U]=⁢[U∣do⁢(D=π⁢(paD))]superscriptdelimited-[]delimited-[]conditionaldosubscriptpaE^π[U]=E[U do(D=π(pa_D))]blackboard_Eπ [ U ] = blackboard_E [ U ∣ do ( D = π ( paD ) ) ]. A policy π∗superscriptπ^*π∗ is optimal if it maximises π∗⁢[U]superscriptsuperscriptdelimited-[]E^π^*[U]blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ]. Typically, agents do not behave optimally and incur some regret δ, which is the decrease in expected utility compared to an optimal policy δ:=π∗⁢[U]−π⁢[U]assignsuperscriptsuperscriptdelimited-[]superscriptdelimited-[]δ:=E^π^*[U]-E^π[U]δ := blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ] - blackboard_Eπ [ U ]. To simplify our theoretical analysis, we focus on a widely studied class of decision tasks where the agents decision does not causally influence the environment (e.g. Figure 1). Assumption 1 (Unmediated decision task). DescD∩AncU=∅subscriptDescsubscriptAncDesc_D _U= ∩ AncU = ∅. In unmediated decision tasks, the agent is provided some (partial) observations of the environment and chooses a policy, which is then evaluated using the utility function which is a function of the environment state and the agent’s decision. Examples of unmediated decision tasks include prediction tasks such as classification and regression, whereas examples of mediated decision tasks that are not covered by our theorems include Markov decision processes where the agent’s decision (action) influences the utility via the environment state. UUUDDitalic_DYYitalic_YXXitalic_X (a) training UUUDDitalic_DYYitalic_YXXitalic_X (b) testing UUUDDitalic_DYYitalic_YXXitalic_X (c) known domain shift Figure 1: CID for a supervised learning task during (a) training and (b) testing following a distributional (covariate) shift (unsupervised domain adaptation, Wilson & Cook, 2020). The agent chooses a label prediction D=Y^^D= YD = over start_ARG Y end_ARG given features X, with the goal of minimising loss U=−Loss⁢(Y,Y^)Loss^U=-Loss(Y, Y)U = - Loss ( Y , over start_ARG Y end_ARG ). Decision variables are depicted as square nodes, chance variables as circular nodes and utilities as diamond nodes. Information edges (dashed) show the variables the agent conditions their policy on. In this example the labels cause the features Y→X→Y→ XY → X (for examples where features cause labels see Castro et al., 2020; Schölkopf et al., 2012). The black square (‘regime node’ (Correa & Bareinboim, 2020)) in (b) and (c) denotes a distributional shift induced by an intervention on X. Diagram (c) depicts the idealised case where the agent knows what domain shift has occurred. By theorem 1, if the agent can return an optimal decision boundary for known covariate and label shifts, then it must have learned the CBN over =X,Y C=\X,Y\italic_C = X , Y . Note that even if the agent has sufficient training data to learn P⁢(X,Y)P(X,Y)P ( X , Y ), the causal structure Y→X→Y→ XY → X is in general non-identifiable given P⁢(X,Y)P(X,Y)P ( X , Y ) and so domain adaptation requires that the agent solves a non-trivial causal discovery problem. 2.3 Distributional shifts We focus on generalisation that goes beyond the iid assumption, where agents are evaluated in domains that are distributionally shifted from the training environment. Distributional shifts can be changes to the environment (domain shifts), as in domain adaptation and domain generalisation (Farahani et al., 2021; Wilson & Cook, 2020), or changes to the objective (task shifts) as in zero shot learning (Xian et al., 2018), in-context learning (Brown et al., 2020) and multi-task reinforcement learning (Reed et al., 2022). Our analysis focuses on domain shifts that involve changes to the causal data generating process, and hence can be modelled as interventions (Schölkopf et al., 2021). This does not assume that all shifts an agent will encounter can be modelled as interventions, but requires that the agent is at least capable of adapting to these shifts. Examples of interventionally generated shifts include translating objects in images (Engstrom et al., 2019), noising inputs and adversarial robustness (Hendrycks & Dietterich, 2019), and changes to the initial conditions or transition function in Markov decision processes (Peng et al., 2018). Examples of shifts that are not naturally represented as interventions include changing the set of environment variables Citalic_C, and introducing selection biases (Shen et al., 2018). See Section A.3 for discussion. Our main results restrict to local domain shifts, which correspond to local interventions on the chance variables Citalic_C. We do not consider shifts that change the agent’s decision D, although we include shifts that drop inputs to the policy PaD→PaD′⊆PaD→subscriptPasuperscriptsubscriptPa′subscriptPaPa_D _D _DPaD → PaD′ ⊆ PaD (e.g. masking) as local interventions. We do not consider task shifts i.e. changing the utility function. As we are interested in determining the capabilities necessary for domain adaptation, we restrict our attention to decision tasks where domain adaptation is non-trivial, i.e. where the optimal policy depends on the environment distribution P⁢(=)P( C= c)P ( italic_C = italic_c ). Assumption 2 (Domain dependence). There exists P⁢(=)P( C= c)P ( italic_C = italic_c ) and P′⁢(=)superscript′P ( C= c)P′ ( italic_C = italic_c ) compatible with M such that π∗=arg⁢maxπ⁡Pπ⁢[U]superscriptsubscriptargmaxsuperscriptsubscriptdelimited-[]π^*= *arg\,max_πE_P^π[U]π∗ = start_OPERATOR arg max end_OPERATORπ blackboard_EPitalic_π [ U ] implies π∗≠arg⁢maxπ⁡P′π⁢[U]superscriptsubscriptargmaxsuperscriptsubscriptsuperscript′delimited-[]π^*≠ *arg\,max_πE_P ^π[U]π∗ ≠ start_OPERATOR arg max end_OPERATORπ blackboard_EP′italic_π [ U ]. Assumption 2 implies the existence of domain shifts that change the optimal policy. 3 Causal models are necessary for robust adaptation We now present our results in their most general form—an equivalence between learning the underlying CBN and learning regret bounded policies for local domain shifts. Then in Section 3.2 we apply our theorems to three settings: adaptive agents, transfer learning, and causal inference, and show that robust agents must learn causal world models. First we focus on the idealised case where we assume optimality. We show for almost all decision tasks the underlying CBN can be reconstructed given optimal policies for a large set of domain shifts. Theorem 1. For almost all CIDs M=(G,P)M=(G,P)M = ( G , P ) satisfying Assumptions 1 and 2, we can identify the directed acyclic graph G and joint distribution P over all ancestors of the utility AncUsubscriptAncAnc_UAncU given πσ∗⁢(d∣paD)σ∈ΣsubscriptsubscriptsuperscriptconditionalsubscriptpaΣ\π^*_σ(d _D)\_σ∈ π∗italic_σ ( d ∣ paD ) σ ∈ Σ where πσ∗⁢(d∣paD)subscriptsuperscriptconditionalsubscriptpaπ^*_σ(d _D)π∗italic_σ ( d ∣ paD ) is an optimal policy in the domain σ and Σ Σ is the set of all mixtures of local interventions. Proof in Appendix C. The parameters P⁢(vi∣pai)conditionalsubscriptsubscriptpaP(v_i _i)P ( vitalic_i ∣ pai ), U⁢(paU)subscriptpaU(pa_U)U ( paU ) of the underlying CBN define a parameter space and the condition for almost all CIDs means that the subset of the parameter space for which the Theorem 1 does not hold is Lebesgue measure zero (see Section A.2 for discussion). This condition is necessary because there exist finely-tuned environments for which the CBN cannot be identified given the agent’s policy due to variables X∈AncUsubscriptAncX _UX ∈ AncU that do not affect the expected utility. For example consider X→Y→U→X→ Y→ UX → Y → U, Y=⁢(0,x)0Y=N(0,x)Y = N ( 0 , x ) and U=D+YU=D+YU = D + Y, then changing X can only change the variance of U while leaving its expected value (and hence the optimal policy) constant. However, this only occurs for very specific choices of the parameters P and U. In Appendix B we give a simplified overview of the proof with a worked example. We assume access to an oracle for optimal policies πσ∗subscriptsuperscriptπ^*_σπ∗italic_σ for any given local intervention on Citalic_C. Note, this assumes the agent is robust to distributional shifts on a causally sufficient set of variables Citalic_C, not that the set of variables the agent observes is causally sufficient. We devise an algorithm that queries this oracle with different mixtures of local interventions and identifies the mixtures for which the optimal policies changes. We then show that these critical mixtures identify the parameters of the CBN, specifying both the graph G⁢(AncU)subscriptAncG(Anc_U)G ( AncU ) and the joint distribution P⁢(AncU)subscriptAncP(Anc_U)P ( AncU ). 3.1 Relaxing the assumption of optimality We now relax the assumption of optimality, considering the case where the policies πσsubscript _σπitalic_σ satisfy a regret bound πσ⁢[U]≥πσ∗⁢[U]−δsuperscriptsubscriptdelimited-[]superscriptsubscriptsuperscriptdelimited-[]E _σ[U] ^π^*_σ[U]- _Eπitalic_σ [ U ] ≥ blackboard_Eπ start_POSTSUPERSCRIPT ∗σ end_POSTSUPERSCRIPT [ U ] - δ. We show that for δ>00δ>0δ > 0 we can recover an approximation of the environment CBN, with error that grows linearly in δ for δ≪π∗⁢[U]much-less-thansuperscriptsuperscriptdelimited-[]δ ^π^*[U]δ ≪ blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ]. Theorem 2. For almost all CIDs M=(G,P)M=(G,P)M = ( G , P ) satisfying Assumptions 1 and 2, we can identify an approximate causal model M′=(P′,G′)superscript′superscript′M =(P ,G )M′ = ( P′ , G′ ) given πσ⁢(d∣paD)σ∈ΣsubscriptsubscriptconditionalsubscriptpaΣ\ _σ(d _D)\_σ∈ πitalic_σ ( d ∣ paD ) σ ∈ Σ where πσ⁢[U]≥πσ∗⁢[U]−δsuperscriptsubscriptdelimited-[]superscriptsubscriptsuperscriptdelimited-[]E _σ[U] ^π^*_σ[U]- _Eπitalic_σ [ U ] ≥ blackboard_Eπ start_POSTSUPERSCRIPT ∗σ end_POSTSUPERSCRIPT [ U ] - δ and Σ Σ is the set of mixtures of local interventions. The parameters of M′ satisfy |P′(vi∣pai)−P(vi∣pai)|≤γ(δ) |P (v_i _i)-P(v_i _i) |% ≤γ(δ)| P′ ( vitalic_i ∣ pai ) - P ( vitalic_i ∣ pai ) | ≤ γ ( δ ) ∀for-all∀ Vi∈subscriptV_i∈ VVitalic_i ∈ italic_V where γ⁢(0)=000γ(0)=0γ ( 0 ) = 0 and γ⁢(δ)γ(δ)γ ( δ ) grows linearly in δ for small regret δ≪π∗⁢[U]much-less-thansuperscriptsuperscriptdelimited-[]δ ^π^*[U]δ ≪ blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ]. Proof in Appendix D. The worst-case error bounds γ⁢(δ)γ(δ)γ ( δ ) for the parameter errors are detailed in Appendix D. For δ>00δ>0δ > 0 it may not be possible to identify G perfectly as some weak causal relations cannot be resolved due to these error bounds. We describe in Appendix D how we can learn a sub-graph G′⊆Gsuperscript′G G′ ⊆ G that may exclude directed edges corresponding to weak causal relations. Theorem 2 shows that we can learn a (sparse) approximate causal models of the data generating process from regret bounded policies under domain shifts, where the approximation becoming exact as δ→0→0δ→ 0δ → 0. In Appendix F we demonstrate learning the underlying CBN from regret-bounded policies using simulated data for randomly generated CIDs similar to Figure 1, and explore how the accuracy of the approximate CBN scales with the regret bound (Figure 2). (a) Error rate for learned DAG v.s. regret bound (b) Mean error for P⁢(x,y)P(x,y)P ( x , y ) v.s. regret bound Figure 2: Comparing the model-average error rates for a) the learned DAG G′ and b) learned joint distribution P′⁢(x,y)superscript′P (x,y)P′ ( x , y ), v.s. the (normalised) regret bound δ/|[u∣D=1]−[u∣D=0]|δ/ |E[u D=1]-E[u D=0] |δ / | blackboard_E [ u ∣ D = 1 ] - blackboard_E [ u ∣ D = 0 ] |. Average error taken over 1000 randomly generated environments with binary decision D and two binary latent variables X,YX,YX , Y. Comparison to error rate for random guess (green) See Appendix F for details. Finally, we prove sufficiency, i.e. that having an (approximate) causal model of the data generating process is sufficient to identify regret-bounded policies. The result is well-known for the non-approximate case (Bareinboim & Pearl, 2016). Theorem 3. Given the CBN M=(P,G)M=(P,G)M = ( P , G ) that is causally sufficient we can identify optimal policies πσ∗⁢(d∣paD)subscriptsuperscriptconditionalsubscriptpaπ^*_σ(d _D)π∗italic_σ ( d ∣ paD ) for any given U where PaU⊆subscriptPaPa_U CPaU ⊆ italic_C and for all soft interventions σ. Given an approximate causal model M′=(P′,G′)superscript′superscript′M =(P ,G )M′ = ( P′ , G′ ) for which |P′(vi∣pai)−P(vi∣pai)|≤ϵ≪1 |P (v_i _i)-P(v_i _i) |% ≤ε 1| P′ ( vitalic_i ∣ pai ) - P ( vitalic_i ∣ pai ) | ≤ ϵ ≪ 1, we can identify regret-bounded policies where the regret δ grows linearly in ϵitalic-ϵεϵ. Proof in Appendix E. Together, Theorems 2 and 3 imply that learning an approximate causal model of the data generating process is necessary and sufficient for learning regret-bounded policies under local interventions. 3.2 Interpretation In this section we interpret Theorems 1, 2 and 3 through three lenses; agents, transfer learning and causal inference. First, we derive our result that any agent capable of adapting to local domain shifts must have learned a causal model of the data generating process. Agents are adaptive goal-directed systems, meaning they choose actions to achieve some desired outcome, and would change their behaviour if they knew the consequences of their actions had changed (Dennett, 1989), i.e. following a domain shift (Kenton et al., 2023). For example, a firm sets prices to maximise profit, and changes its pricing to adapt to changes in supply and demand. We define adaptation as the ability to competently pursue goals (minimize regret) in a distributionally shifted environment. Adaptation is achieved through a combination of generalisation—applying knowledge learned during training to the shifted environment—and through re-training in the shifted environment. We are interested in the zero-shot setting (Kirk et al., 2023), which describes powerful agents capable of adapting without re-training, i.e. using only knowledge of the environment and what change (shift) has occurred. Our aim is to determine precisely what knowledge of the environment is necessary to support this capability. Hence we focus on the simplified task of adapting to known distributional shifts, i.e. the agent conditions333σ is equivalent to an environment index for (in-context) invariant risk minimization (Gupta et al., 2023; Arjovsky et al., 2019), or the context variable for zero-shot reinforcement learning (Kirk et al., 2023) their policy on σ, πσ=π⁢(d∣paD,σ)subscriptconditionalsubscriptpa _σ=π(d _D,σ)πitalic_σ = π ( d ∣ paD , σ ). Note this is strictly easier than adapting to unknown distributional shifts, and so any knowledge necessary for adapting to known distributional shifts is also necessary for adapting to unknown shifts444Any agent capable of returning a regret-bounded policy without input σ can do so given σ simply by discarding σ. Hence, we can trivially extend this agent’s policy to include input σ and the policy will still satisfy the regret bound, allowing Theorem 2 to be applied. Likewise, if the agent does condition their policy but on an inferred σ′σ σ′ (see Kirk et al. (2023) for a review of implementations), and this σ′σ σ′-conditional policy satisfies a regret bound, we can apply Theorem 2.. Corollary 1. Let π⁢(d∣paD,σ)conditionalsubscriptpaπ(d _D,σ)π ( d ∣ paD , σ ) be the policy of an agent that satisfies a regret bound π⁢(σ)⁢[U]≥π∗⁢[U]−δsuperscriptdelimited-[]superscriptsuperscriptdelimited-[]E^π(σ)[U] ^π^*[U]- _Eπ ( σ ) [ U ] ≥ blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ] - δ for all local domain shifts σ, in an environment described by the CID M=(G,P)M=(G,P)M = ( G , P ) which satisfies Assumptions 1 and 2. By Theorems 2 and 3, π is informationally equivalent to an approximation M′ of M, with M′→M→superscript′M → M′ → M smoothly as δ→0→0δ→ 0δ → 0. Corollary 1 follows immediately from Theorem 2 by replacing πσ⁢(d∣paD)=π⁢(d∣paD,σ)subscriptconditionalsubscriptpaconditionalsubscriptpa _σ(d _D)=π(d _D,σ)πitalic_σ ( d ∣ paD ) = π ( d ∣ paD , σ ). If πσsubscript _σπitalic_σ satisfies a tight regret bound for all local shifts σ, we can reconstruct the underlying CBN from the agent’s policy alone (following the procedure in Appendix C). Therefore, any agent capable of generalising under known local domain shifts must have learned the underling CBN. Precisely, the agent has learned the policy π⁢(d∣paD,σ)conditionalsubscriptpaπ(d _D,σ)π ( d ∣ paD , σ ), which is informationally equivalent to a causal model of the environment, as any associative or causal query that can be identified using the true causal model of the environment can be identified using π⁢(d∣paD,σ)conditionalsubscriptpaπ(d _D,σ)π ( d ∣ paD , σ ). Example: Doctors are agents expected to make low regret decisions under a wide range of known distributional shifts, without re-training in the shifted environment. For example, consider the task of risk-stratifying patients based on their signs and medical history. The doctor may be transferred to a new ward where patients have received a treatment (known distributional shift) that has a stochastic effect on latent variables (mixed intervention) such as curing diseases and causing side effects. The doctor cannot re-train in this new domain, e.g. taking random decisions and observing outcomes. To be capable of this adaptivity, Theorem 2 implies the doctor must know a good approximation of the causal relations between the relevant latent variables—how the treatment affects diseases, how these diseases and their symptoms are causally related, and so on. Likewise, any medical AI that hopes to replicate this capability must have learned a similarly accurate causal model, and the better the agent’s performance the more accurate its causal model must be. Transfer learning. In transfer learning (Zhuang et al., 2020), models are trained on a set of source domains and evaluated on held-out target domains where i) the data distribution differs from the source domains, and i) the data available for training is restricted compared to the source domains (Wang et al., 2022). For example, in unsupervised domain adaptation the learner is restricted to samples of the input features from the target domains paD∼P⁢(PaD;σ)similar-tosubscriptpasubscriptPapa_D P(Pa_D;σ)paD ∼ P ( PaD ; σ ), whereas in domain generalisation typically no data from the target domain is available during training (Farahani et al., 2021). Let SsubscriptD_SDitalic_S denote the training data from the source domains and σsubscriptD_σDitalic_σ denote the training data available from a given target domain σ. Let there exist a transfer learning algorithm that returns a policy πσsubscript _σπitalic_σ satisfying a regret bound for a given target domain σ, provided this training data. As πσsubscript _σπitalic_σ is a function of the training data, then by Theorems 1 and 2 the existence of this algorithm implies that we can identify the underlying CBN from S∪σ∈ΣsubscriptsubscriptsubscriptΣD_S∪\D_σ\_σ∈ Ditalic_S ∪ Ditalic_σ σ ∈ Σ. To see that this imparts non-trivial constraints on the existence of the transfer learning algorithm, we can consider the following simple example. Example: Consider the CID for the supervised learning task depicted in Figure 1. Let S=(xi,yi)∼P⁢(X,Y)i=1nsubscriptsuperscriptsubscriptsimilar-tosuperscriptsuperscript1D_S=\(x^i,y^i) P(X,Y)\_i=1^nDitalic_S = ( xitalic_i , yitalic_i ) ∼ P ( X , Y ) i = 1n, so for sufficiently large n the agent can learn the P⁢(X,Y)P(X,Y)P ( X , Y ) from SsubscriptD_SDitalic_S. However, Y→X→Y→ XY → X must also be identifiable from the training data S∪σ∈ΣsubscriptsubscriptsubscriptΣD_S∪\D_σ\_σ∈ Ditalic_S ∪ Ditalic_σ σ ∈ Σ. In other words, the transfer learning problem contains a hidden causal discovery problem. If σ=∅subscriptD_σ= _σ = ∅ then Y→X→Y→ XY → X must be identifiable from P⁢(x,y)P(x,y)P ( x , y ) alone, which is impossible unless the causal data generating process obeys additional assumptions (see for example Hoyer et al., 2008). If unlabelled features from the target domain are included in the training data σ=xi∼P⁢(X;σ)i=1nσsubscriptsuperscriptsubscriptsimilar-tosuperscript1subscriptD_σ=\x^i P(X;σ)\_i=1^n_σDitalic_σ = xitalic_i ∼ P ( X ; σ ) i = 1nitalic_σ, Y→X→Y→ XY → X can in principle be identified as P⁢(X;σY)≠P⁢(X)subscriptP(X; _Y)≠ P(X)P ( X ; σitalic_Y ) ≠ P ( X ). Causal inference. Theorem 1 can also be interpreted purely in terms of causal inference. We can compare to the causal hierarchy theorem (CHT) (Bareinboim et al., 2022), which states that an oracle for L1 queries (observational) is almost always insufficient to evaluate all L2 queries (interventional). Our Theorem 1 can be stated in an analogous way; an oracle for optimal policies under mixtures of local interventions ΠΣ∗:σ↦π∗⁢(σ):subscriptsuperscriptΠΣmaps-tosuperscript ^*_ :σ π^*(σ)Π∗roman_Σ : σ ↦ π∗ ( σ ), can evaluate all L2 queries, which follows from the fact that the oracle identifies the underlying CBN which in turn identifies all L2 queries. Note ΠΣ∗subscriptsuperscriptΠΣ ^*_ Π∗roman_Σ is a strict subset of L2, and we describe a subset of L2 as being L2-complete if evaluating these queries is sufficient to evaluate all L2 queries. Hence Theorem 1 can be summarised as ΠΣ∗subscriptsuperscriptΠΣ ^*_ Π∗roman_Σ is L2-complete. It would be interesting in future work to determine what other strict subsets of L2 are L2-complete, as identifying these queries is sufficient to identify all interventional queries. Why is this surprising? Firstly, we may expect the optimal policies to encode a relatively small number of causal relations, as they can be computed from ⁢[u∣d,paD;σ]delimited-[]conditionalsubscriptpaE[u d,pa_D;σ]blackboard_E [ u ∣ d , paD ; σ ], which describes the response of a single variable U to intervention σ. However, Theorem 1 shows that the optimal policies encode all causal and associative relations in AncUsubscriptAncAnc_UAncU, including causal relations between latent variables, for example P⁢(x)subscriptP( Y_x)P ( italic_Yitalic_x ) for any ,⊆AncUsubscriptAnc X, Y _Uitalic_X , italic_Y ⊆ AncU. Secondly, Theorems 2 and 3 combined imply that learning to generalise under domain shifts is equivalent to learning a causal model of the data generating process—problems that on the surface are conceptually distinct. Figure 3: The left figure situates Theorem 1 in Pearl’s causal hierarchy (Bareinboim et al., 2022). L2 contains all interventional queries. L2 includes the sets of optimal policy queries under domain shifts C∗subscriptsuperscript ^*_CΠ∗italic_C and task shifts U∗subscriptsuperscript ^*_UΠ∗italic_U, as optimal policies can always be found from a finite number of interventional queries. Theorem 1 surprisingly shows that C∗subscriptsuperscript ^*_CΠ∗italic_C contains L2, and therefore C∗=L2subscriptsuperscriptL2 ^*_C=L2Π∗italic_C = L2 (right). That is, learning optimal policies under all shifts for a single utility U is sufficient to identify L2. This also implies that U∗⊆C∗subscriptsuperscriptsubscriptsuperscript ^*_U ^*_CΠ∗italic_U ⊆ Π∗italic_C, which implies that learning optimal policies for domain shifts is sufficient to identify optimal policies for task shifts. 4 Discussion Here we discuss the consequences for several fields and open questions, as well as limitations. Causal representation learning. Causal representation learning (CRL) aims to learn representations of data that capture unknown causal structure (Schölkopf et al., 2021), with the aim of exploiting causal invariances to achieve better generalisation across domains. Theorems 1 and 2 show that any method that enables generalisation across many domains necessarily involves learning an (approximate) causal model of the data generating process—i.e. a causal representation. Hence, our results provide theoretical justification for CRL by showing it is necessary for strong robustness guarantees. Causal bounds on transfer learning. As described in Section 3.2, Theorems 1 and 2 imply fundamental causal constraints on certain transfer learning tasks. For example in the supervised learning task depicted in Figure 1, identifying regret-bounded policies under covariate and label shifts requires learning the causal relations between features and labels. Causal discovery problems such as this are well understood in many settings (Vowels et al., 2022), and in general identifying this causal structure (e.g. that Y→X→Y→ XY → X in Figure 4 a)) is impossible without interventional data and/or additional assumptions. This connection allows us to convert (im)possibility results for causal discovery to (im)possibility results for transfer learning. Future work could explore this for smaller sets of distributional shifts and derive more general causal bounds on transfer learning. Good regulator theorem. The good regulator theorem is often interpreted as saying that any good controller of a system must have a model of that system (Conant & Ross Ashby, 1970). However, some imagination is needed to take this lesson from the actual theorem, which technically only states that there exists an optimal regulator that is a deterministic function of the state of the system (which could be trivial, Wentworth (2021)). Our theorem less ambiguously states that any robust agent must have learned an (approximate) causal model of the environment, as described in Section 3.2. It can therefore be interpreted as a more precise, causal good regulator theorem. Emergent capabilities. Causal models enable a kind of general competency—an agent can use a causal model of its environment to optimise for any given objective function U⁢(PaU⊆)subscriptPaU(Pa_U V)U ( PaU ⊆ italic_V ) without additional data (Theorem 3). This could explain how general competence can arise from narrow training objectives (Brown et al., 2020; Silver et al., 2021). By Theorems 1 and 2, agents trained to maximise reward across many environments are incentivized to learn a causal world model (as they cannot generalise without one), which can in turn be used to solve any other decision task in the same environment (Theorem 3). This incentive does not imply that training an agent with a simple reward signal is sufficient to learn causal world models. E.g. it will still be impossible for an agent to learn a causal model (and therefore to generalise) if the model is not identifiable from its training data. The question is then if current methods and training schemes are sufficient for learning causal world models. Early results suggest that transformer models can learn world models capable of out-of-distribution prediction (Li et al., 2022, see Section 6 for discussion). While foundation models are capable of achieving state of the art accuracy on causal reasoning benchmarks (Kıcıman et al., 2023), how they achieve this (and if it constitutes bona fide causal reasoning) is debated (Zečević et al., 2023). Causal discovery. Theorems 1 and 2 involve learning the causal structure of the environment by observing the agent’s policy under interventions. It is perhaps surprising that the response of this single variable to interventions is sufficient to identify all associative and causal relations in AncUsubscriptAncAnc_UAncU. Typically, causal discovery algorithms involve measuring the response of many variables to interventions (Vowels et al., 2022). Also, many causal discovery algorithms assume independent causal mechanisms (Schölkopf et al., 2021), which is equivalent to assuming no agents are present in the data generating process (Kenton et al., 2023). However, our results suggest that agents could be powerful resources for causal discovery. In Appendix B we use the proof of Theorem 2 to derive a causal discovery algorithm for learning causal structure over latents, and test it on synthetic data. No competence without understanding. Causal models are fundamental to how humans understand and explain the world (Gopnik et al., 2007; Pearl & Mackenzie, 2018). Increasingly, deep learning models are used to predict and control complex systems we do not yet fully understand, such as inertially confined plasmas (Degrave et al., 2022) and biomoloecular systems (Abramson et al., 2024). These models arguably offer a shortcut to competence without understanding—solving problems without needing to develop richer models of the underlying systems, or understand how and why the solutions work. For example, AlphaFold accurately predicts protein folding but was found not to improve understanding of the underlying biochemical processes (Outeiral et al., 2022). It has been argued that a reliance on black-box models could greatly widen the gap between capabilities and understanding (Pasquale, 2015). However, our result points to a potential solution. If a controller is sufficiently capable and robust, we can always extract an interpretable causal model of the system it is controlling. There is no robust control without ‘understanding’, in the sense of learning a causal model of the underlying system. Future work could explore extending our results to develop efficient algorithms for eliciting causal models from robust agents. Applicability of causal methods. Causal models have been used to formally define concepts such as intent (Halpern & Kleiman-Weiner, 2018; Ward et al., 2024), harm (Richens et al., 2022), deception (Ward et al., 2023b), manipulation (Ward et al., 2023a) and incentives (Everitt et al., 2021), and are required for approaches to explainability (Wachter et al., 2017) and fairness (Kusner et al., 2017). Methods for designing safe and ethical AI systems that build on these definitions require causal models of the data generating process, which are typically hard to learn, leading some to doubt their practicality (Fawkes et al., 2022; Rahmattalabi & Xiang, 2022). However, our results show that sufficiently capable agents must have learned a causal world model capable of supporting these methods, and demonstrate that these world models can be elicited from the agent. Limitations. Theorems 1 and 2 require agents to be robust to a large set of domain shifts (local interventions on all environment variables). Theorem 2 shows that loosening regret bounds results in some causal relations being unidentifiable from the agent’s policy. Hence, we expect it is still possible to learn some casual knowledge of the environment from agents that are robust to a smaller set of domain shifts, albeit less complete that the full underlying CBN. Finally, our results only apply to unmediated decision tasks (Assumption 1). We expect Theorems 1 and 2 can be extended to active decision tasks, as Assumption 1 does not play a major role beyond simplifying the proofs. 5 Related work Several recent empirical works have explored if deep learning models learn ‘surface statistics’ (e.g. correlations between inputs and outputs) or learn internal representations of the world (McGrath et al., 2022; Abdou et al., 2021; Li et al., 2022; Gurnee & Tegmark, 2023). Our results offer some theoretical clarity to this discussion, tying an agents performance to the fidelity its world model, and showing that going beyond ‘surface statistics’ to learning causal relations is fundamentally necessary for robustness. One study in particular (Li et al., 2022) found that a GPT model trained to predict legal next moves in the board game Othello learned a linear representation of the board state (Nanda, 2023). Further, this internal representation of the board state could be changed by intervening on the intermediate activations, with the model updating its predictions consistent with the intervention, including interventions that take the board state outside of the training distribution. This indicates that the network is learning and utilising a representation of the data generating process that can support out-of-distribution generalisation under interventions—much like a causal model. The problem of evaluating policies under distributional shifts has been studied extensively in causal transportability (CT) theory (Bareinboim & Pearl, 2016; Bellot & Bareinboim, 2022). CT aims to provide necessary and sufficient conditions for policy evaluation under known domain shifts when all assumptions on the data generating process (i.e. inductive biases) can be expressed as constraints on causal structure (Bareinboim & Pearl, 2016). However, deep learning algorithms can exploit a much larger set of inductive biases (Neyshabur et al., 2014; Battaglia et al., 2018; Rahaman et al., 2019; Goyal & Bengio, 2022) which in many real-world tasks may be sufficient to identify low regret policies without requiring causal knowledge. Thus, CT does not imply that agents must learn causal models in order to generalise unless we assume agents only use causal assumptions to begin with, which would be proof by assumption. See Appendix G for further discussion. A similar result to Theorems 1 and 2 is the causal hierarchy theorem (CHT) (Bareinboim et al., 2022; Ibeling & Icard, 2021), which shows that observational data is almost always insufficient for identifying all causal relations between environment variables, whereas our results state that the set of optimal policies is almost always sufficient to identify all causal relations. In Section 3.2 we discuss the similarities between these theorems, and in Appendix G we discuss their differences. 6 Conclusion Causal reasoning is foundational to human intelligence, and has been conjectured to be necessary for achieving human level AI (Pearl, 2019). In recent years, this conjecture has been challenged by the development of artificial agents capable of generalising to new tasks and domains without explicitly learning or reasoning on causal models. And while the necessity of causal models for solving causal inference tasks has been established (Bareinboim et al., 2022), their role in decision tasks such as classification and reinforcement learning is less clear. We have resolved this conjecture in a model-independent way, showing that any agent capable of robustly solving a decision task must have learned a causal model of the data generating process, regardless of how the agent is trained or the details of its architecture. This hints at an even deeper connection between causality and general intelligence, as this causal model can be used to find policies that optimise any given objective function over the environment variables. By establishing a formal connection between causality and generalisation, our results show that causal world models are a necessary ingredient for robust and general AI. Acknowledgements. We would like to thank Alexis Bellot, Damiano Fornasiere, Pietro Greiner, James Fox, Matt MacDermott, David Reber, David Watson and Philip Bachman for their helpful discussions and comments on the manuscript. References Abdou et al. (2021) Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color. arXiv preprint arXiv:2109.06129, 2021. Abramson et al. (2024) Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, p. 1–3, 2024. Arjovsky et al. (2019) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019. Bareinboim & Pearl (2012a) Elias Bareinboim and Judea Pearl. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics, p. 100–108. PMLR, 2012a. Bareinboim & Pearl (2012b) Elias Bareinboim and Judea Pearl. Transportability of causal effects: Completeness results. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, p. 698–704, 2012b. Bareinboim & Pearl (2016) Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016. Bareinboim et al. (2022) Elias Bareinboim, Juan D Correa, Duligur Ibeling, and Thomas Icard. On pearl’s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, p. 507–556. 2022. Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. Bellot & Bareinboim (2022) Alexis Bellot and Elias Bareinboim. Partial transportability for domain generalization. 2022. Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. Castro et al. (2020) Daniel C Castro, Ian Walker, and Ben Glocker. Causality matters in medical imaging. Nature Communications, 11(1):3673, 2020. Cohen & Welling (2016) Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, p. 2990–2999. PMLR, 2016. Conant & Ross Ashby (1970) Roger C Conant and W Ross Ashby. Every good regulator of a system must be a model of that system. International journal of systems science, 1(2):89–97, 1970. Correa & Bareinboim (2020) Juan Correa and Elias Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the AAAI conference on artificial intelligence, volume 34, p. 10093–10100, 2020. Dawid (2002) A P Dawid. Influence diagrams for causal modelling and inference. International Statistical Review / Revue Internationale de Statistique, 70:161–189, 2002. Degrave et al. (2022) Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022. Dennett (1989) Daniel C Dennett. The intentional stance. MIT press, 1989. Engstrom et al. (2019) Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. Exploring the landscape of spatial robustness. In International conference on machine learning, p. 1802–1811. PMLR, 2019. Everitt et al. (2021) Tom Everitt, Ryan Carey, Eric D Langlois, Pedro A Ortega, and Shane Legg. Agent incentives: A causal perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, p. 11487–11495, 2021. Farahani et al. (2021) Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R Arabnia. A brief review of domain adaptation. Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020, p. 877–894, 2021. Fawkes et al. (2022) Jake Fawkes, Robin Evans, and Dino Sejdinovic. Selection, ignorability and challenges with causal fairness. In Conference on Causal Learning and Reasoning, p. 275–289. PMLR, 2022. Gopnik et al. (2007) Alison Gopnik, Laura Schulz, and Laura Elizabeth Schulz. Causal learning: Psychology, philosophy, and computation. Oxford University Press, 2007. Goyal & Bengio (2022) Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478(2266):20210068, 2022. Gupta et al. (2023) Sharut Gupta, Stefanie Jegelka, David Lopez-Paz, and Kartik Ahuja. Context is environment, 2023. Gurnee & Tegmark (2023) Wes Gurnee and Max Tegmark. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023. Halpern & Kleiman-Weiner (2018) Joseph Halpern and Max Kleiman-Weiner. Towards formal definitions of blameworthiness, intention, and moral responsibility. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Hendrycks & Dietterich (2019) Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. Howard & Matheson (2005) Ronald A Howard and James E Matheson. Influence diagrams. Decision Analysis, 2(3):127–143, 2005. Hoyer et al. (2008) Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21, 2008. Ibeling & Icard (2021) Duligur Ibeling and Thomas Icard. A topological perspective on causal inference. Advances in Neural Information Processing Systems, 34:5608–5619, 2021. Kenton et al. (2023) Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, and Tom Everitt. Discovering agents. Artificial Intelligence, p. 103963, 2023. Kıcıman et al. (2023) Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023. Kirk et al. (2023) Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264, 2023. Kusner et al. (2017) Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in neural information processing systems, 30, 2017. Legg & Hutter (2007) Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17:391–444, 2007. Li et al. (2022) Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022. McGrath et al. (2022) Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in AlphaZero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022. Meek (2013) Christopher Meek. Strong completeness and faithfulness in bayesian networks. arXiv preprint arXiv:1302.4973, 2013. Meinshausen (2018) Nicolai Meinshausen. Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), p. 6–10. IEEE, 2018. Mitrovic et al. (2018) Jovana Mitrovic, Dino Sejdinovic, and Yee Whye Teh. Causal inference via kernel deviance measures. Advances in neural information processing systems, 31, 2018. Mooij et al. (2016) Joris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103–1204, 2016. Nanda (2023) Neel Nanda. Actually, othello-gpt has a linear emergent world model, Mar 2023. URL <https://neelnanda.io/mechanistic-interpretability/othello>. Neyshabur et al. (2014) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014. Okamoto (1973) Masashi Okamoto. Distinctness of the eigenvalues of a quadratic form in a multivariate sample. The Annals of Statistics, p. 763–765, 1973. Outeiral et al. (2022) Carlos Outeiral, Daniel A Nissley, and Charlotte M Deane. Current structure predictors are not learning the physics of protein folding. Bioinformatics, 38(7):1881–1887, 2022. Pasquale (2015) Frank Pasquale. The black box society: The secret algorithms that control money and information. Harvard University Press, 2015. Pearl (2009) Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition edition, 2009. ISBN 9780521895606. Pearl (2018) Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv preprint arXiv:1801.04016, 2018. Pearl (2019) Judea Pearl. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM, 62(3):54–60, 2019. Pearl & Bareinboim (2011) Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, p. 247–254, 2011. Pearl & Mackenzie (2018) Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018. Peng et al. (2018) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), p. 3803–3810. IEEE, 2018. Rahaman et al. (2019) Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, p. 5301–5310. PMLR, 2019. Rahmattalabi & Xiang (2022) Aida Rahmattalabi and Alice Xiang. Promises and challenges of causality for ethical machine learning. arXiv preprint arXiv:2201.10683, 2022. Reed et al. (2022) Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022. Reichenbach (1956) Hans Reichenbach. The direction of time, volume 65. Univ of California Press, 1956. Richens et al. (2022) Jonathan Richens, Rory Beard, and Daniel H Thompson. Counterfactual harm. Advances in Neural Information Processing Systems, 35:36350–36365, 2022. Rubin (2005) Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005. Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012. Schölkopf et al. (2021) Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021. Shah et al. (2022) Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. Goal misgeneralization: Why correct specifications aren’t enough for correct goals. arXiv preprint arXiv:2210.01790, 2022. Shen et al. (2018) Zheyan Shen, Peng Cui, Kun Kuang, Bo Li, and Peixuan Chen. Causally regularized learning with agnostic data selection bias. In Proceedings of the 26th ACM international conference on Multimedia, p. 411–419, 2018. Silver et al. (2021) David Silver, Satinder Singh, Doina Precup, and Richard S Sutton. Reward is enough. Artificial Intelligence, 299:103535, 2021. Sloman & Lagnado (2015) Steven A Sloman and David Lagnado. Causality in thought. Annual review of psychology, 66:223–247, 2015. Spirtes et al. (2000) Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000. Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. Vowels et al. (2022) Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. D’ya like dags? a survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1–36, 2022. Wachter et al. (2017) Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech., 31:841, 2017. Wang et al. (2022) Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022. Ward et al. (2023a) Francis Rhys Ward, Tom Everitt, Francesco Belardinelli, and Francesca Toni. Honesty is the best policy: defining and mitigating AI deception. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. Ward et al. (2023b) Francis Rhys Ward, Francesca Toni, and Francesco Belardinelli. Defining deception in structural causal games. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, p. 2902–2904, 2023b. Ward et al. (2024) Francis Rhys Ward, Matt MacDermott, Francesco Belardinelli, Francesca Toni, and Tom Everitt. The reasons that agents act: Intention and instrumental goals. arXiv preprint arXiv:2402.07221, 2024. Wentworth (2021) John Wentworth. Fixing the good regulator theorem. https://w.alignmentforum.org/posts/Dx9LoqsEh3gHNJMDk/fixing-the-good-regulator-theorem, 2021. Accessed: 2023-10-17. Wilson & Cook (2020) Garrett Wilson and Diane J Cook. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1–46, 2020. Xian et al. (2018) Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41(9):2251–2265, 2018. Zečević et al. (2023) Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal. arXiv preprint arXiv:2308.13067, 2023. Zhuang et al. (2020) Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020. Appendix A Preliminaries A.1 Setup and assumptions The environment is described by a set of random variables =C1,C2,…,CNsubscript1subscript2…subscript C=\C_1,C_2,…,C_N\italic_C = C1 , C2 , … , Citalic_N , which in combination with the decision D and utility nodes U define the state space for the CID =∪D,U V= C∪\D,U\italic_V = italic_C ∪ D , U . In out notation individual variables Ci∈subscriptC_i∈ CCitalic_i ∈ italic_C are given indexes, whereas set of variables are indexless and bold, and we use = V= vitalic_V = italic_v as short hand for the joint state of the variables in a set Citalic_C. The joint probability distribution P⁢(=,D=d,U=u)formulae-sequenceformulae-sequenceP( C= c,D=d,U=u)P ( italic_C = italic_c , D = d , U = u ) describes the statistical relations between environment variables. Bayesian networks factorise joint probability distributions according to a graph G (Pearl, 2009). See 1 The distributions and statistical relationships between variables may change as a result of external interventions applied to a system. Hard interventions set a subset ′⊆superscript′ C Citalic_C′ ⊆ italic_C of the variables to particular values ′superscript′ c italic_c′, denoted do⁢(′=′)dosuperscript′ do( C = c )do ( italic_C′ = italic_c′ ) or do⁢(′)dosuperscript′ do( c )do ( italic_c′ ). Naively, one joint probability distribution Pdo⁢(′)subscriptdosuperscript′P_ do( c )Pdo ( italic_c′ ) would be needed to describe the updated relationship under each possible intervention do⁢(′)dosuperscript′ do( c )do ( italic_c′ ). Fortunately, all interventional distributions can be derived from a single Bayesian network, if G matches the causal structure of the environment (i.e. has an edge Vi→Vj→subscriptsubscriptV_i→ V_jVitalic_i → Vitalic_j whenever an intervention on VisubscriptV_iVitalic_i directly influences the value of another variable Y, and lacks unmodeled confounders; Spirtes et al., 2000; Pearl, 2009). When this holds, we call the Bayesian network causal and G a causal graph. With respect to the causal graph G we denote the direct causes (parents) of VisubscriptV_iVitalic_i as PaisubscriptPaPa_iPai, the set of all causes (ancestors) AncisubscriptAncAnc_iAnci, and the variables that VisubscriptV_iVitalic_i directly causes (children) ChisubscriptChCh_iChi and descendants DescisubscriptDescDesc_iDesci as the set of all downstream variables. Note in particular that AncisubscriptAncAnc_iAnci and DescisubscriptDescDesc_iDesci refer to proper ancestors and descendants, i.e. Vi∉AncisubscriptsubscriptAncV_i _iVitalic_i ∉ Anci and Vi∉DescisubscriptsubscriptDescV_i _iVitalic_i ∉ Desci. We denote a causal Bayesian network (CBN) as M=(P,G)M=(P,G)M = ( P , G ) where P is the joint and G is the directed acyclic graph (DAG) describing the causal structure of the environment. Further, the interventional distribution Pdo⁢(v′)subscriptdosuperscript′P_ do(v )Pdo ( v′ ) is given by the truncated factorisation Pdo⁢(v′)⁢()=∏i:vi∉′P⁢(vi∣pavi)if consistent with ′0otherwise.subscriptdosuperscript′casessubscriptproduct:subscriptsuperscript′conditionalsubscriptsubscriptpasubscriptif consistent with ′0otherwise.P_ do(v )( v)= cases _i:v_i ∈ v^% P(v_i _v_i)&if $ v$ consistent with $ % v $\\ 0&otherwise. casesPdo ( v′ ) ( italic_v ) = start_ROW start_CELL ∏i : v start_POSTSUBSCRIPT i ∉ italic_v′ end_POSTSUBSCRIPT P ( vitalic_i ∣ pav start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_v consistent with italic_v′ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW Equivalently, the effect of interventions can be computed by adding an extra node X^ Xover start_ARG X end_ARG and edge V^i→Vi→subscript^subscript V_i→ V_iover start_ARG V end_ARGi → Vitalic_i for each node Vi∈VsubscriptV_i∈ VVitalic_i ∈ V (Correa & Bareinboim, 2020; Dawid, 2002). Intervening on VisubscriptV_iVitalic_i then corresponds to conditioning on V^isubscript V_iover start_ARG V end_ARGi in the extended graph. More general, soft interventions σ=P′⁢(Vi∣Pai∗)superscript′conditionalsubscriptsubscriptsuperscriptPaσ=P (V_i ^*_i)σ = P′ ( Vitalic_i ∣ Pa∗i ) replace the conditional probability distribution for VisubscriptV_iVitalic_i with a new one, possibly using a new parent set Pao∗subscriptsuperscriptPaPa^*_oPa∗o as long as no cycles are introduced in the graph (Correa & Bareinboim, 2020). The modified environment is denoted M⁢(σ)M(σ)M ( σ ). General soft interventions cannot be defined without prior knowledge of the causal graph G. For example, the soft intervention σY=P′⁢(y∣x)subscriptsuperscript′conditional _Y=P (y x)σitalic_Y = P′ ( y ∣ x ) is incompatible with the causal structure Y→X→Y→ XY → X as it would introduce a causal cycle, and so an agent’s policy may not be well defined with respect to this intervention. We therefore focus our theoretical analysis on a subset of the soft interventions, local interventions, that can be implemented without assuming knowledge of G. See 2 Example: Fixing the value of a variable (hard intervention) is a local intervention as do⁢(Vi=vi′)=do⁢(Vi=f⁢(vi))dosubscriptsubscriptsuperscript′dosubscriptsubscript do(V_i=v _i)= do(V_i=f(v_i))do ( Vitalic_i = v′italic_i ) = do ( Vitalic_i = f ( vitalic_i ) ) where f⁢(vi)=vi′subscriptsubscriptsuperscript′f(v_i)=v _if ( vitalic_i ) = v′italic_i. Example: Translations are local interventions as do⁢(Vi=vi+k)=do⁢(Vi=f⁢(vi))dosubscriptsubscriptdosubscriptsubscript do(V_i=v_i+k)= do(V_i=f(v_i))do ( Vitalic_i = vitalic_i + k ) = do ( Vitalic_i = f ( vitalic_i ) ) where f⁢(vi)=vi+ksubscriptsubscriptf(v_i)=v_i+kf ( vitalic_i ) = vitalic_i + k. This includes changing the position of objects in RL environments (Shah et al., 2022). Example: Logical NOT operation X→¬X→X→ X → ¬ X for Boolean X We also consider mixtures of interventions, which can also be described without knowledge of G. See 3 Example: Adding Gaussian noise is a mixture over local operations (translations) σϵ=do⁢(X=X+ϵ)subscriptitalic-ϵdoitalic-ϵ _ε= do(X=X+ε)σitalic_ϵ = do ( X = X + ϵ ) where ϵ∼⁢(0,1)similar-toitalic-ϵ01ε (0,1)ϵ ∼ N ( 0 , 1 ). In common to most decision making tasks such as prediction, classification, and reinforcement learning, is that a decision should be outputted based on some information to optimise some objective. The exact terms vary: decisions are sometimes called outputs, actions, predictions, or classifications; information is sometimes called features, context, or state; and objectives are sometimes called utility functions or loss functions. However, all of these setups can be described within the causal influence diagram (CID) framework (Howard & Matheson, 2005; Everitt et al., 2021). CIDs are causal Bayesian networks where the variables are divided into decision D, utility U, and chance variables V, and no conditional probability distribution is specified for the decision variables. The task of the agent is to select the distribution π=P⁢(D=d∣PaD=paD)conditionalsubscriptPasubscriptpaπ=P(D=d _D=pa_D)π = P ( D = d ∣ PaD = paD ), also known as the policy or decision rule. An optimal policy π∗superscriptπ^*π∗ is defined as a policy π∗superscriptπ^*π∗ that maximizes the expected value of the utility π∗⁢[U]subscriptsuperscriptdelimited-[]E_π^*[U]blackboard_Eπ∗ [ U ]. See 4 By convention, decision nodes are drawn square, utility nodes diamond, and chance nodes round. The parents of D, PaDsubscriptPaPa_DPaD, can be interpreted as the information the decision is allowed to depend on, and are depicted as dashed lines. See Figure 4 for an example. In the following we restrict our attention to a class of CIDs we refer to as ‘unmediated decision tasks’, where where the agent’s decision does not causally influence any chance variables that go on to influence the utility. This simplifies our theoretical analysis, although it is likely that our results extend to the general case. See 1 Examples of unmediated decision tasks include all standard classification and regression tasks, and generative AI tasks where the output is not included in the training set. For example, in classification typically the choice of label does not influence the data generating process. Problems that are mediated rather than unmediated decision tasks includes most control and reinforcement learning tasks, where the agent’s decision is an action that influences the state of the environment. Furthermore, we will focus on non-trivial unmediated decision tasks i.e. where U∈ChDsubscriptChU _DU ∈ ChD, as the case ChD=∅subscriptChCh_D= = ∅ describes trivial decision tasks (the agent’s action does not influence the utility). Figure 4 is an example of a non-trivial unmediated decision task. In transfer learning we are typically interested in problems where generalising from the source to target domain(s) is non-trivial, and in the trivial case we cannot expect agents to have to learn anything about their environments in order to generalise. If this is not the case, then generalising under distributional shifts is trivial. Therefore we restrict our attention to decision tasks where the distribution of the environment is relevant to the agent when determining its policy. Specifically, we say that a decision task is domain independent if there exists a single policy that is optimal for all choices of environment distribution P⁢(=c)P( C=c)P ( italic_C = c ). See 2 Lemma 1. Domain dependence implies that; i) There exists no d∈dom⁢(D)domd (D)d ∈ dom ( D ) such that d∈arg⁢maxd⁡U⁢(d,c)subscriptargmaxd∈ *arg\,max_dU(d,c)d ∈ start_OPERATOR arg max end_OPERATORd U ( d , c ) ∀∈dom⁢(C)for-alldom∀ c (C)∀ italic_c ∈ dom ( C ). i) PaD⊊AncUsubscriptPasubscriptAncPa_D _UPaD ⊊ AncU i) D∈PaUsubscriptPaD _UD ∈ PaU Proof. i) For any P′ we have P′⁢[U∣do⁢(D=d),paD]=∑P′⁢(d=∣paD)⁢U⁢(d,)subscriptsuperscript′delimited-[]conditionaldosubscriptpasubscriptsuperscript′subscriptconditionalsubscriptpaE_P [U do(D=d),pa_D]= _ cP^% ( C_d= c _D)U(d, c)blackboard_EP′ [ U ∣ do ( D = d ) , paD ] = ∑italic_c P′ ( italic_Citalic_d = italic_c ∣ paD ) U ( d , italic_c ) =∑P′⁢(=∣paD)⁢U⁢(d,c)absentsubscriptsuperscript′conditionalsubscriptpa= _ cP ( C= c _D)U(d,c)= ∑italic_c P′ ( italic_C = italic_c ∣ paD ) U ( d , c ) where we have used DescD∩AncU=∅subscriptDescsubscriptAncDesc_D _U= ∩ AncU = ∅. Therefore if ∃ d∗superscriptd^*d∗ s.t. d∗=arg⁢maxd⁡U⁢(d,)superscriptsubscriptargmaxd^*= *arg\,max_dU(d, c)d∗ = start_OPERATOR arg max end_OPERATORd U ( d , italic_c ) ∀for-all∀ = C= citalic_C = italic_c then P′⁢[U∣do⁢(D=d∗),paD]≥∑P′⁢(=∣paD)⁢U⁢(d′,)⁢P′⁢[U∣do⁢(D=d),paD]subscriptsuperscript′delimited-[]conditionaldosuperscriptsubscriptpasubscriptsuperscript′conditionalsubscriptpasuperscript′subscriptsuperscript′delimited-[]conditionaldosubscriptpaE_P [U do(D=d^*),pa_D]≥ _% cP ( C= c _D)U(d , c) % E_P [U do(D=d),pa_D]blackboard_EP′ [ U ∣ do ( D = d∗ ) , paD ] ≥ ∑italic_c P′ ( italic_C = italic_c ∣ paD ) U ( d′ , italic_c ) blackboard_EP′ [ U ∣ do ( D = d ) , paD ] ∀for-all∀ d≠d∗superscriptd≠ d^*d ≠ d∗, and so D=d∗superscriptD=d^*D = d∗ is optimal for all P′⁢(=)superscript′P ( C= c)P′ ( italic_C = italic_c ) and we violate domain dependence. i) As D∈PaUsubscriptPaD _UD ∈ PaU (i), then PaU⊆AncUsubscriptPasubscriptAncPa_U _UPaU ⊆ AncU. If AncU=PaDsubscriptAncsubscriptPaAnc_U=Pa_DAncU = PaD then P⁢[u∣d,paD]=U⁢(d,paD)subscriptdelimited-[]conditionalsubscriptpasubscriptpaE_P[u d,pa_D]=U(d,pa_D)blackboard_EP [ u ∣ d , paD ] = U ( d , paD ) which is independent of P⁢(=)P( C= c)P ( italic_C = italic_c ), and hence there is a single optimal policy for all P and we violate domain dependence. i) If D∉AncUsubscriptAncD _UD ∉ AncU then the CID is trivial, in the sense that ⁢[U∣do⁢(D=d)]=⁢[U]delimited-[]conditionaldodelimited-[]E[U do(D=d)]=E[U]blackboard_E [ U ∣ do ( D = d ) ] = blackboard_E [ U ], and hence all decisions are optimal for all distributions P⁢()P( C)P ( italic_C ), which violates domain dependence (Assumption 2). Therefore D∈AncUsubscriptAncD _UD ∈ AncU which with DescD∩AncU=∅subscriptDescsubscriptAncDesc_D _U= ∩ AncU = ∅ implies D∈PaUsubscriptPaD _UD ∈ PaU. ∎ C1subscript1C_1C1C2subscript2C_2C2C3subscript3C_3C3C4subscript4C_4C4C5subscript5C_5C5C6subscript6C_6C6DDitalic_DUUitalic_U Figure 4: The CID for an unmediated decision task, where D has no causal influence on the environment state Citalic_C. Our main theorem implies that an agent that is robust to distributional shifts on Citalic_C must learn the CBN over AncU=C1,C2,C3,C4,C6subscriptAncsubscript1subscript2subscript3subscript4subscript6Anc_U=\C_1,C_2,C_3,C_4,C_6\AncU = C1 , C2 , C3 , C4 , C6 , noting that C5∉AncUsubscript5subscriptAncC_5 _UC5 ∉ AncU. C4subscript4C_4C4 is an example of a variable that is only an ancestor of U via D and so has no direct causal effect on the utility, but is still relevant to the decision task as it is a proxy for C1subscript1C_1C1 which is a cause of U. C6subscript6C_6C6 is a cause of U but not of D, and naively one might assume that distributional shifts on C6subscript6C_6C6 cannot influence the agent’s decision. However, the optimal policy can change under distributional shifts on C6subscript6C_6C6 as these effect the utility, and hence the agent will have to learn a CBN including C6subscript6C_6C6 if it is to be robust to shifts on C6subscript6C_6C6. A.2 Parameterisation of CIDs The joint distribution P is defined for all environment variables Citalic_C, and the CID is defined by the parameters for P⁢()P( C)P ( italic_C ) and U⁢(PaU)subscriptPaU(Pa_U)U ( PaU ). We restrict our attention to Citalic_C that are categorical, and without loss of generality we label states ci=0,1,…,dimi−1subscript01…subscriptdim1c_i=0,1,…,dim_i-1citalic_i = 0 , 1 , … , dimi - 1 where dimisubscriptdimdim_idimi is the dimension of variable CisubscriptC_iCitalic_i. Firstly, the joint P⁢()P( C)P ( italic_C ) is parameterised by the conditional probability distributions (CPDs) in the Markov factorization with respect to G, θP=P(ci∣pai)∀ci∈0,…,dimi−2,pai∈Pai,Ci∈ _P=\P(c_i _i)\,∀\,c_i∈\0,…,% dim_i-2\,pa_i _i,\,C_i∈ Cθitalic_P = P ( citalic_i ∣ pai ) ∀ citalic_i ∈ 0 , … , dimi - 2 , pai ∈ Pai , Citalic_i ∈ italic_C. Note that the CPDs p⁢(Ci=dimi−1∣pai)subscriptsubscriptdimconditional1subscriptpap(C_i=dim_i-1 _i)p ( Citalic_i = dimi - 1 ∣ pai ) are not included in θPsubscript _Pθitalic_P as they are fully constrained by normalization P⁢(Ci=dimi−1∣pai)=1−∑j=0dimi−1P⁢(ci∣pai)subscriptsubscriptdimconditional1subscriptpa1superscriptsubscript0subscriptdim1conditionalsubscriptsubscriptpaP(C_i=dim_i-1 _i)=1- _j=0^dim_i-1P(% c_i _i)P ( Citalic_i = dimi - 1 ∣ pai ) = 1 - ∑j = 0dimi - 1 P ( citalic_i ∣ pai ). Secondly, the utility function is simply parameterised by its value given the state of its parents θU=U⁢(paU)⁢∀PaU=paUsubscriptsubscriptpafor-allsubscriptPasubscriptpa _U=\U(pa_U)\,∀\,Pa_U=pa_U\θitalic_U = U ( paU ) ∀ PaU = paU . For simplicity we work with the normalized utility function, U⁢(paU)→U⁢(paU)−minpaU′⁡U⁢(PaU=paU′)maxpaU′⁡U⁢(PaU=paU′)−minpaU′⁡U⁢(PaU=paU′)→subscriptpasubscriptpasubscriptsuperscriptsubscriptpa′subscriptPasubscriptsuperscriptpa′subscriptsuperscriptsubscriptpa′subscriptPasubscriptsuperscriptpa′subscriptsuperscriptsubscriptpa′subscriptPasubscriptsuperscriptpa′U(pa_U)→ U(pa_U)- _pa_U^% U(Pa_U=pa _U) _pa_U^% U(Pa_U=pa _U)- _pa_U^% U(Pa_U=pa _U)U ( paU ) → divide start_ARG U ( paU ) - minpa start_POSTSUBSCRIPT U′ end_POSTSUBSCRIPT U ( PaU = pa′U ) end_ARG start_ARG maxpa start_POSTSUBSCRIPT U′ end_POSTSUBSCRIPT U ( PaU = pa′U ) - minpa start_POSTSUBSCRIPT U′ end_POSTSUBSCRIPT U ( PaU = pa′U ) end_ARG (2) with values between 0 and 1. Noting that as this is a positive affine transformation of the utility function the set of optimal policies invariant, and we can re-scale regret bounds accordingly. Let θMsubscript _Mθitalic_M denote the set of all parameters for the CID, θM=θP∪θUsubscriptsubscriptsubscript _M= _P∪ _Uθitalic_M = θitalic_P ∪ θitalic_U, and note that the elements of θMsubscript _Mθitalic_M in the [0,1]01[0,1][ 0 , 1 ] interval and are logically independent, i.e. we can independently choose any [0,1]01[0,1][ 0 , 1 ] value for each parameter and this defines a valid parameterization of the CID for the baseline environment. In the following when we refer to ‘the parameters P,UP,UP , U’ we are referring to θMsubscript _Mθitalic_M. We follow the method outlined in (Meek, 2013) to prove that certain constraints on P,UP,UP , U hold ‘for almost all P,UP,UP , U’ and hence for almost all decision tasks. This involves converting a given constraint into polynomial equations over θMsubscript _Mθitalic_M and applying the following Lemma, Lemma 2 (Okamoto, 1973). The solutions to a (nontrivial) polynomial are Lebesgue measure zero over the space of the parameters of the polynomial. A polynomial in n variables is non-trivial (not an identity) if not all instantiations of the n variables are solutions of the polynomial. For example, the equation poly⁢(θM)=0polysubscript0poly( _M)=0poly ( θitalic_M ) = 0 is trivial if and only if all coefficients of the polynomial expression poly⁢(θM)polysubscriptpoly( _M)poly ( θitalic_M ) are zero. Therefore, any constraint on P,UP,UP , U that can be converted into a polynomial equation over θMsubscript _Mθitalic_M must either hold for all θMsubscript _Mθitalic_M or for a Lebesgue measure zero subset of instantiations of θMsubscript _Mθitalic_M. Operationally, this means that if we have any smooth distribution over the parameter space (for example, describing the distribution of environments we expect to encounter), the probability of drawing an environment from this distribution for which the condition does not hold is 0. A.3 Distributional shifts & policy oracles In the derivation of our results we restrict out attention to distributional shifts that can be modelled as (soft) interventions on the data generating process. We note that by Reichenbach’s principle (Reichenbach, 1956), which states that all statistical associations are due to underlying causal structures, we can assume the existence of a causal data generating process that can be described in terms of a CBN M=(P,G)M=(P,G)M = ( P , G ). Therefore there is a causal factorization of the joint P⁢(=)=∏iP⁢(ci∣Pai)subscriptproductconditionalsubscriptsubscriptPaP( C= c)= _iP(c_i _i)P ( italic_C = italic_c ) = ∏i P ( citalic_i ∣ Pai ). By allowing for mixtures of interventions, we can reach any distribution over Citalic_C, which can be seen trivially by noting that we can perform a soft intervention to achieve any deterministic distribution P⁢(=)=δ⁢(=′)superscript′P( C= c)=δ( C= c )P ( italic_C = italic_c ) = δ ( italic_C = italic_c′ ), and then take a mixture over these deterministic distributions to achieve an arbitrary distribution over Citalic_C. The set of distributions that cannot be generated by interventions include those that change the set of variables Vitalic_V including the decision and utility variables, and introducing selection biases (which are causally represented with the introduction of additional nodes that are conditioned on Bareinboim & Pearl, 2012a). For further discussions on the relation between distributional shifts and interventions see Schölkopf et al. (2021); Meinshausen (2018). In the following proofs we use policy oracles to formalise knowledge of regret-bounded behaviour under distributional shifts. Definition 5 (Policy oracle). A policy oracle for a set of interventions Σ Σ is a map ΠΣδ:σ↦πσ⁢(d∣paD):subscriptsuperscriptΠΣmaps-tosubscriptconditionalsubscriptpa ^δ_ :σ _σ(d _D)Πitalic_δroman_Σ : σ ↦ πitalic_σ ( d ∣ paD ) ∀for-all∀ σ∈Σσ∈ σ ∈ Σ where Σ Σ is a set of domains. It is δ-optimal if πσ⁢(d∣paD)subscriptconditionalsubscriptpa _σ(d _D)πitalic_σ ( d ∣ paD ) achieves an expected utility πσ⁢[U]≥π∗⁢[U]−δsuperscriptsubscriptdelimited-[]superscriptsuperscriptdelimited-[]E _σ[U] ^π^*[U]- _Eπitalic_σ [ U ] ≥ blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ] - δ in the CID M⁢(σ)M(σ)M ( σ ) where δ≥00δ≥ 0δ ≥ 0. Here δ is the regret upper bound, which is satisfied under all distributional shifts σ∈Σσ∈ σ ∈ Σ. We refer to δ-optimal policy oracles for δ=00δ=0δ = 0 as optimal policy oracles. For the proof of our main result we restrict our attention to policy oracles with Σ Σ that includes mixtures over all local interventions (def. 2). Note that the policy oracle specifies only what policy the agent returns in a distributionally shifted environment M⁢(σ)M(σ)M ( σ ). It does not specify how this policy is generated, which will depend on the specific setup. For example, in domain generalisation that agent typically receives no additional data from the target domains, and is expected to produce a policy (decision boundary) that achieves a low regret across all target domains. On the other hand in domain adaptation and few shot learning, the agent is provided with some new data from each target domain with which to adjust its policy. As we hope to accommodate all of these perspectives we specify only the agent’s policy, not the data used to generate it. This is discussed further in Section 3.2. What distributional shifts do we consider? In our proofs, we assume the agent is robust to any domain shifts that can be described as a mixture of local interventions on the environment variables Citalic_C. We do not consider interventions that change the utility U or the agent’s decision D, though we do include dropping inputs to the policy (masking) PaD→PaD′⊆PaD→subscriptPasuperscriptsubscriptPa′subscriptPaPa_D _D _DPaD → PaD′ ⊆ PaD as local interventions. Appendix B Appendix: Simplified proof In this section we outline the proof of Theorem 1 for a simple binary decision task with binary latent variables. As mentioned in Section 4, the method used to identify the CBN in Theorem 1 can be viewed as an algorithm for learning the CBN over latent variables by observing the policy of a regret-bounded agent under various distributional shifts. To demonstrate this, in Appendix F we use an implementation of the algorithm on randomly generated CIDs, showing empirically that we can learn the underlying CBN in this way, and explore how the agent’s regret bound affects the accuracy of the learned CBN. Consider the CID in Figure 5, describing a binary decision task D∈0,101D∈\0,1\D ∈ 0 , 1 with two binary latent variables X,Y∈PaUsubscriptPaX,Y _UX , Y ∈ PaU. UUUXXitalic_XYYitalic_YDDitalic_D Figure 5: Example CID describing a context-free mutli-armed bandit with binary latent variables X,YX,YX , Y. Consider an agent that selects a policy πDsubscript _Dπitalic_D such that it maximises the expected utility. That is, the CID describes a context-free bandit problem, where X,YX,YX , Y are latent variables that influence the arm values ⁢[u∣d]=∑x,yP⁢(x,y)⁢U⁢(x,y,d)delimited-[]conditionalsubscriptE[u d]= _x,yP(x,y)U(x,y,d)blackboard_E [ u ∣ d ] = ∑x , y P ( x , y ) U ( x , y , d ). Our aim is to learn this CID given only knowledge of the agent’s policy under distributional shifts, and knowledge that it satisfies a regret bound. We assume knowledge of i) the set of chance variables =X,Y C=\X,Y\italic_C = X , Y , i) the utility function U⁢(d,x,y)U(d,x,y)U ( d , x , y ), and i) the policy πD⁢(σ)subscript _D(σ)πitalic_D ( σ ) under distributional shifts σ (other variables (U,X,YU,X,YU , X , Y) are unobserved). To learn the CID the aim is therefore to learn the parameters of the joint distribution over latents P⁢(x,y)P(x,y)P ( x , y ) and the unknown causal structure. As we know the utility function we know D,X,Y∈PaUsubscriptPaD,X,Y _UD , X , Y ∈ PaU, and by assuming the CID is unmediated (Assumption 1) we know X,Y∉DescDsubscriptDescX,Y _DX , Y ∉ DescD. Likewise the decision task is context free hence D∉DescX∪DescYsubscriptDescsubscriptDescD _X _YD ∉ DescX ∪ DescY. Hence the only unknown causal structure is the DAG over the latent variables =X,Y C=\X,Y\italic_C = X , Y . The expected utility difference between D=00D=0D = 0 and D=11D=1D = 1 following a hard intervention on X is given by ⁢[u∣D=0;do⁢(X=0)]−⁢[u∣D=1;do⁢(X=0)]=∑yP⁢(YX=0=y)⁢[U⁢(0,0,y)−U⁢(1,0,y)]delimited-[]conditional0do0delimited-[]conditional1do0subscriptsubscript0delimited-[]0010 [u D=0; do(X=0)]-E[u D=1;% do(X=0)]=Σ _yP(Y_X=0=y)[U(0,0,y)-U(1,0,y)]blackboard_E [ u ∣ D = 0 ; do ( X = 0 ) ] - blackboard_E [ u ∣ D = 1 ; do ( X = 0 ) ] = ∑y P ( Yitalic_X = 0 = y ) [ U ( 0 , 0 , y ) - U ( 1 , 0 , y ) ] (3) =P⁢(YX=0=0)⁢[U⁢(0,0,Y=0)−U⁢(1,0,0)]+(1−P⁢(YX=0=0))⁢[U⁢(0,0,1)−U⁢(1,0,1)]absentsubscript00delimited-[]0001001subscript00delimited-[]001101 =P(Y_X=0=0)[U(0,0,Y=0)-U(1,0,0)]+(1-P(Y_X=0=0))[U(0,0,1)-U(1,% 0,1)]= P ( Yitalic_X = 0 = 0 ) [ U ( 0 , 0 , Y = 0 ) - U ( 1 , 0 , 0 ) ] + ( 1 - P ( Yitalic_X = 0 = 0 ) ) [ U ( 0 , 0 , 1 ) - U ( 1 , 0 , 1 ) ] (4) As we know U⁢(d,x,y)U(d,x,y)U ( d , x , y ) we can therefore identify P⁢(YX=0=0)subscript00P(Y_X=0=0)P ( Yitalic_X = 0 = 0 ) if we can identify this expected utility difference. We do this using the agent’s policy under distributional shifts, and in this simple case we can restrict our attention to hard interventions. Following the steps outlined in Lemma 4, domain dependence insures that we can identify a hard intervention σ2=do⁢(X=x′,Y=y′)subscript2doformulae-sequencesuperscript′ _2= do(X=x ,Y=y )σ2 = do ( X = x′ , Y = y′ ) that results in a different optimal policy to the optimal policy under σ1=do⁢(X=0)subscript1do0 _1= do(X=0)σ1 = do ( X = 0 ). For a mixture of these two interventions σ3=q⁢σ1+(1−q)⁢σ2subscript3subscript11subscript2 _3=q _1+(1-q) _2σ3 = q σ1 + ( 1 - q ) σ2 the expected utility is ⁢[u∣d,σ3]=q⁢[u∣d,σ1]+(1−q)⁢[u∣d,σ2]delimited-[]conditionalsubscript3delimited-[]conditionalsubscript11delimited-[]conditionalsubscript2E[u d, _3]=qE[u d, _1]+(1-q)E% [u d, _2]blackboard_E [ u ∣ d , σ3 ] = q blackboard_E [ u ∣ d , σ1 ] + ( 1 - q ) blackboard_E [ u ∣ d , σ2 ]. This is a linear function with respect to q, and for q=11q=1q = 1 the optimal decision (d1subscript1d_1d1) is different than for q=00q=0q = 0 (d2≠d1subscript2subscript1d_2≠ d_1d2 ≠ d1). Therefore, there is a single indifference point qcritsubscriptcritq_critqcrit for which both decisions are optimal. It is simple to show that this indifference point is given by, qcrit=(1−⁢[u∣D=d1;do⁢(X=0)]−⁢[u∣D=d2;do⁢(X=0)]U⁢(d1,x′,y′)−U⁢(d2,x′,y′))−1subscriptcritsuperscript1delimited-[]conditionalsubscript1do0delimited-[]conditionalsubscript2do0subscript1superscript′subscript2superscript′1q_crit= (1- E[u D=d_1; do(X=0)]-% E[u D=d_2; do(X=0)]U(d_1,x ,y )-U(d% _2,x ,y ) )^-1qcrit = ( 1 - divide start_ARG blackboard_E [ u ∣ D = d1 ; do ( X = 0 ) ] - blackboard_E [ u ∣ D = d2 ; do ( X = 0 ) ] end_ARG start_ARG U ( d1 , x′ , y′ ) - U ( d2 , x′ , y′ ) end_ARG )- 1 (5) D=d1subscript1D=d_1D = d1 is optimal for q≤qcritsubscriptcritq≤ q_critq ≤ qcrit and D=d2subscript2D=d_2D = d2 is optimal for q≥qcritsubscriptcritq≥ q_critq ≥ qcrit. We can estimate qcritsubscriptcritq_critqcrit by randomly sampling values of q uniformly over [0,1]01[0,1][ 0 , 1 ] and observing the optimal decision under the resulting mixed intervention (Algorithm 1). That is, qcritsubscriptcritq_critqcrit is the probability that D=d1subscript1D=d_1D = d1 is returned by the policy oracle for a randomly sampled q. In this way we learn qcritsubscriptcritq_critqcrit and as we know U⁢(d,x,y)U(d,x,y)U ( d , x , y ) we can identify the expected utility difference under do⁢(X=0)do0 do(X=0)do ( X = 0 ) in the numerator of Equation 5 and so identify P⁢(YX=0=0)subscript00P(Y_X=0=0)P ( Yitalic_X = 0 = 0 ). Similarly we identify P⁢(YX=1=0),P⁢(XY=0=0)subscript10subscript00P(Y_X=1=0),P(X_Y=0=0)P ( Yitalic_X = 1 = 0 ) , P ( Xitalic_Y = 0 = 0 ) and P⁢(XY=1=0)subscript10P(X_Y=1=0)P ( Xitalic_Y = 1 = 0 ), which encode both the causal relation between X and Y (e.g. there is a directed path from X to Y if and only if P⁢(YX=0)≠P⁢(YX=1)subscript0subscript1P(Y_X=0)≠ P(Y_X=1)P ( Yitalic_X = 0 ) ≠ P ( Yitalic_X = 1 ) for almost all CBNs), and determine the parameters of the CBN as P⁢(Ci=ci∣do⁢(∖Ci))=P⁢(Ci=ci∣Pai=pai)subscriptconditionalsubscriptdosubscriptsubscriptconditionalsubscriptsubscriptPasubscriptpaP(C_i=c_i do( C C_i))=P(C_i=c_i % Pa_i=pa_i)P ( Citalic_i = citalic_i ∣ do ( italic_C ∖ Citalic_i ) ) = P ( Citalic_i = citalic_i ∣ Pai = pai ). Appendix C Proof of Theorem 1 In this appendix we prove Theorem 1. For an informal overview of the proof see Appendix B. First, we show that for a given distributional shift σ, for almost all P,UP,UP , U there is a single optimal decision. While this is not necessary for our proof, it simplifies our analysis. And as our main theorem holds for almost all P,UP,UP , U, we can include any finite number of independent conditions that hold for almost all P,UP,UP , U without strengthening this condition, as the union of Lebesgue measure zero sets is Lebesgue measure zero. Lemma 3. For any given local intervention σ there is a single deterministic optimal policy for almost all P,UP,UP , U. Proof. Following intervention σ two decisions d,d′,d d , d′ are simultaneously optimal in context paDsubscriptpapa_DpaD if, ⁢[u∣paD,do⁢(D=d);σ]=⁢[u∣paD,do⁢(D=d′);σ]delimited-[]conditionalsubscriptpadodelimited-[]conditionalsubscriptpadosuperscript′E[u _D, do(D=d);σ]=E[u % pa_D, do(D=d );σ]blackboard_E [ u ∣ paD , do ( D = d ) ; σ ] = blackboard_E [ u ∣ paD , do ( D = d′ ) ; σ ] (6) Let =[AncU∖PaD]delimited-[]subscriptAncsubscriptPa Z=[Anc_U _D]italic_Z = [ AncU ∖ PaD ] and =PaU∖DsubscriptPa X=Pa_U \D\italic_X = PaU ∖ D . Noting that ⁢[u∣paD,do⁢(D=d);σ]=∑U⁢(d,)⁢P⁢(,paD∣do⁢(D=d);σ)/P⁢(paD∣do⁢(D=d);σ)delimited-[]conditionalsubscriptpadosubscriptconditionalsubscriptpadoconditionalsubscriptpadoE[u _D, do(D=d);σ]= _ zU(d, % x)P( z,pa_D do(D=d);σ)/P(pa_D % do(D=d);σ)blackboard_E [ u ∣ paD , do ( D = d ) ; σ ] = ∑italic_z U ( d , italic_x ) P ( italic_z , paD ∣ do ( D = d ) ; σ ) / P ( paD ∣ do ( D = d ) ; σ ) (7) and that P⁢(paD∣do⁢(D=d);σ)=P⁢(paD;σ)conditionalsubscriptpadosubscriptpaP(pa_D do(D=d);σ)=P(pa_D;σ)P ( paD ∣ do ( D = d ) ; σ ) = P ( paD ; σ ) and P⁢(,paD∣do⁢(D=d);σ)=P⁢(,paD;σ)conditionalsubscriptpadosubscriptpaP( z,pa_D do(D=d);σ)=P( z,pa_D;σ)P ( italic_z , paD ∣ do ( D = d ) ; σ ) = P ( italic_z , paD ; σ ) which follows from DescD∩AncU=∅subscriptDescsubscriptAncDesc_D _U= ∩ AncU = ∅, we can multiple both sides of equation 6 with P⁢(paD;σ)subscriptpaP(pa_D;σ)P ( paD ; σ ) giving, ∑U⁢(d,)⁢P⁢(,paD;σ)=∑U⁢(d′,)⁢P⁢(,paD;σ)subscriptsubscriptpasubscriptsuperscript′subscriptpa _ zU(d, x)P( z,pa_D;σ)= _ zU(d^% , x)P( z,pa_D;σ)∑italic_z U ( d , italic_x ) P ( italic_z , paD ; σ ) = ∑italic_z U ( d′ , italic_x ) P ( italic_z , paD ; σ ) (8) and ∑[U⁢(d,)−U⁢(d′,)]⁢P⁢(,paD;σ)=0subscriptdelimited-[]superscript′subscriptpa0 _ z[U(d, x)-U(d , x)]P( z,pa_D;σ% )=0∑italic_z [ U ( d , italic_x ) - U ( d′ , italic_x ) ] P ( italic_z , paD ; σ ) = 0 (9) Let σ=do⁢(v1=f1⁢(v1),…,vN=fN⁢(vN))doformulae-sequencesubscript1subscript1subscript1…subscriptsubscriptsubscriptσ= do(v_1=f_1(v_1),…,v_N=f_N(v_N))σ = do ( v1 = f1 ( v1 ) , … , vitalic_N = fitalic_N ( vitalic_N ) ). The joint P⁢(,paD;σ)=∏iP⁢(ci∣pai;σ)subscriptpasubscriptproductconditionalsubscriptsubscriptpaP( z,pa_D;σ)= _iP(c_i _i;σ)P ( italic_z , paD ; σ ) = ∏i P ( citalic_i ∣ pai ; σ ) is polynomial, and the local interventions P⁢(ci∣pai;σ)=∑ci′:fi⁢(ci′)=ciP⁢(ci′∣pai)conditionalsubscriptsubscriptpasubscript:subscriptsuperscript′subscriptsubscriptsuperscript′subscriptconditionalsubscriptsuperscript′subscriptpaP(c_i _i;σ)= _c _i:f_i(c _i)=% c_iP(c _i _i)P ( citalic_i ∣ pai ; σ ) = ∑c′ start_POSTSUBSCRIPT i : fitalic_i ( c′italic_i ) = citalic_i end_POSTSUBSCRIPT P ( c′italic_i ∣ pai ) keep it polynomial. Therefore equation 9 is a polynomial equation over the model parameters, and is certain to be non-trivial as d≠d′≠ d d ≠ d′. Therefore by Lemma 2 for almost all P,UP,UP , U equation 9 is not satisfied, and as there are a finite number of decisions this implies that for almost all P,UP,UP , U there is a single optimal decision for a given σ, paDsubscriptpapa_DpaD and hence a single optimal policy. ∎ Next, we detail how a policy oracle can be used to identify a specific causal query in the shifted environment M⁢(σ)M(σ)M ( σ ), that we will later use to identify the model parameters. Lemma 4. Using an optimal policy oracle ΠΣ∗subscriptsuperscriptΠΣ ^*_ Π∗roman_Σ where Σ Σ includes all mixtures of local interventions on Citalic_C including masking inputs PaD′⊆PaDsuperscriptsubscriptPa′subscriptPaPa_D _DPaD′ ⊆ PaD, then for any given PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′ such that PaD′∩PaU=∅superscriptsubscriptPa′subscriptPaPa_D _U= ′ ∩ PaU = ∅ we can identify ∑zP⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]subscriptdelimited-[]superscript′ _zP( C= c;σ)[U(d, c)-U(d , c)]∑z P ( italic_C = italic_c ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ], for d and d′ where d≠d′≠ d d ≠ d′ and =∖PaD′subscriptPa′ Z= C _D italic_Z = italic_C ∖ PaD′. Proof. By Lemma 3 for almost all P,UP,UP , U there is a single optimal decision following the shift σ. Let d1=arg⁢maxd⁡⁢[u∣do⁢(D=d),paD′;σ]subscript1subscriptargmaxdelimited-[]conditionaldosuperscriptsubscriptpa′d_1= *arg\,max_dE[u do(D=d),pa% _D ;σ]d1 = start_OPERATOR arg max end_OPERATORd blackboard_E [ u ∣ do ( D = d ) , paD′ ; σ ] where d1=π∗⁢(σ)subscript1superscriptd_1=π^*(σ)d1 = π∗ ( σ ). We can identify d1subscript1d_1d1 by querying the policy oracle with σ. Consider a hard intervention on all Ci∈subscriptC_i∈ CCitalic_i ∈ italic_C, σ′:=do⁢(c1′,c2′,…,cN′)assignsuperscript′dosubscriptsuperscript′1subscriptsuperscript′2…subscriptsuperscript′σ := do(c _1,c _2,…,c _% N)σ′ := do ( c′1 , c′2 , … , c′italic_N ) where for all Ci∈PaD′subscriptsuperscriptsubscriptPa′C_i _D Citalic_i ∈ PaD′ we set Ci=cisubscriptsubscriptC_i=c_iCitalic_i = citalic_i to be the same state as in observation PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′. The expected utility under this intervention is ⁢[u∣do⁢(D=d),paD′;σ′]=U⁢(d,′)delimited-[]conditionaldosuperscriptsubscriptpa′superscript′E[u do(D=d),pa_D ;σ ]=U(d% , x )blackboard_E [ u ∣ do ( D = d ) , paD′ ; σ′ ] = U ( d , italic_x′ ) where =PaU∖DsubscriptPa X=Pa_U \D\italic_X = PaU ∖ D (and we have that D∈PaUsubscriptPaD _UD ∈ PaU from Lemma 1 i)). Next we show that there is a choice of hard intervention σ′σ σ′ such that the policy oracle must return different optimal decisions in the context PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′ for σ′σ σ′ and σ. As PaD′∩PaU=∅superscriptsubscriptPa′subscriptPaPa_D _U= ′ ∩ PaU = ∅ then we are free to choose any X=x′X=x X = x′ and the resulting σ′σ σ′ will be compatible with the evidence PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′. Note that by Lemma 1 i) ∃ =′superscript′ X= x italic_X = italic_x′ s.t. d1≠arg⁢maxd⁡U⁢(d,x′)subscript1subscriptargmaxsuperscript′d_1≠ *arg\,max_dU(d,x )d1 ≠ start_OPERATOR arg max end_OPERATORd U ( d , x′ ), else D=d1subscript1D=d_1D = d1 is optimal for all = X= xitalic_X = italic_x which violates domain dependence. We can determine this =′superscript′ X= x italic_X = italic_x′ given the utility function and d1subscript1d_1d1. Let d2=arg⁢maxd⁡U⁢(d,′)subscript2subscriptargmaxsuperscript′d_2= *arg\,max_dU(d, x )d2 = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) and σ′=do⁢(c1′,c2′,…,cN′)superscript′dosubscriptsuperscript′1subscriptsuperscript′2…subscriptsuperscript′σ = do(c _1,c _2,…,c _N)σ′ = do ( c′1 , c′2 , … , c′italic_N ) be the hard intervention for which =′superscript′ X= x italic_X = italic_x′ and PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′. Consider the joint distribution over Citalic_C under the mixed local intervention σ~⁢(q)=q⁢σ+(1−q)⁢σ′~1superscript′ σ(q)=qσ+(1-q)σ over~ start_ARG σ end_ARG ( q ) = q σ + ( 1 - q ) σ′, P⁢(=∣do⁢(D=d);σ~⁢(q))conditionaldo~ P( C= c do(D=d); σ(q))P ( italic_C = italic_c ∣ do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) =P⁢(=;σ~⁢(q))absent~ =P( C= c; σ(q))= P ( italic_C = italic_c ; over~ start_ARG σ end_ARG ( q ) ) (10) =q⁢P⁢(=;σ)+(1−q)⁢P⁢(=;σ′)absent1superscript′ =qP( C= c;σ)+(1-q)P( C= c;σ )= q P ( italic_C = italic_c ; σ ) + ( 1 - q ) P ( italic_C = italic_c ; σ′ ) (11) where in the first line we have used ChD=UsubscriptChCh_D=\U\ChD = U to drop the intervention. Note that =∖PaD≠∅subscriptPa Z= C _D≠ _Z = italic_C ∖ PaD ≠ ∅ by Lemma 1 i). The expected utility is given by, ⁢[u∣paD,do⁢(D=d);σ~⁢(q)]=∑P⁢(=∣paD,do⁢(D=d);σ~⁢(q))⁢U⁢(d,)delimited-[]conditionalsubscriptpado~subscriptconditionalsubscriptpado~ [u _D, do(D=d); σ(q% )]=Σ _ zP( Z= z _D, do(D=d);% σ(q))U(d, x)blackboard_E [ u ∣ paD , do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ] = ∑italic_z P ( italic_Z = italic_z ∣ paD , do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) U ( d , italic_x ) (12) =∑P⁢(=∣do⁢(D=d);σ~⁢(q))P⁢(paD∣do⁢(D=d);σ~⁢(q))⁢U⁢(d,)absentsubscriptconditionaldo~conditionalsubscriptpado~ =Σ _ z P( C= c do(D=d);% σ(q))P(pa_D do(D=d); σ(q))U% (d, x)= ∑italic_z divide start_ARG P ( italic_C = italic_c ∣ do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) end_ARG start_ARG P ( paD ∣ do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) end_ARG U ( d , italic_x ) (13) =1P⁢(paD;σ~⁢(q))⁢∑P⁢(=;σ~⁢(q))⁢U⁢(d,)absent1subscriptpa~subscript~ = 1P(pa_D; σ(q))Σ _ z% P( C= c; σ(q))U(d, x)= divide start_ARG 1 end_ARG start_ARG P ( paD ; over~ start_ARG σ end_ARG ( q ) ) end_ARG ∑italic_z P ( italic_C = italic_c ; over~ start_ARG σ end_ARG ( q ) ) U ( d , italic_x ) (14) =1P⁢(paD;σ~⁢(q))⁢∑q⁢P⁢(=;σ)⁢U⁢(d,)+(1−q)⁢P⁢(=;σ′)⁢U⁢(d,′)absent1subscriptpa~subscript1superscript′ = 1P(pa_D; σ(q))Σ _ z% qP( C= c;σ)U(d, x)+(1-q)P( C= c;σ )U(d% , x )= divide start_ARG 1 end_ARG start_ARG P ( paD ; over~ start_ARG σ end_ARG ( q ) ) end_ARG ∑italic_z q P ( italic_C = italic_c ; σ ) U ( d , italic_x ) + ( 1 - q ) P ( italic_C = italic_c ; σ′ ) U ( d , italic_x′ ) (15) Note that for q=11q=1q = 1 the optimal decision is d1subscript1d_1d1 and for q=00q=0q = 0 the optimal decision returned by the policy oracle belongs to the set d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′) s.t. subscriptargmaxsuperscript′\d s.t. d= *arg\,max_dU(d, x )\ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) which does not contain d1subscript1d_1d1. Furthermore, the argmax of equation 15 with respect to d is a piecewise linear function with domain q∈[0,1]01q∈[0,1]q ∈ [ 0 , 1 ]. Therefore there must be some q=qcritsubscriptcritq=q_critq = qcrit that is the smallest value of q such that for q<qcritsubscriptcritq<q_critq < qcrit the policy oracle returns an optimal decision in the set d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′) s.t. subscriptargmaxsuperscript′\d s.t. d= *arg\,max_dU(d, x )\ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) and for q≥qcritsubscriptcritq≥ q_critq ≥ qcrit the optimal decision is not in this set. The value of qcritsubscriptcritq_critqcrit is given by ⁢[u∣paD,do⁢(D=d);σ~⁢(qcrit)]=0delimited-[]conditionalsubscriptpado~subscriptcrit0E[u _D, do(D=d); σ(q_crit% )]=0blackboard_E [ u ∣ paD , do ( D = d ) ; over~ start_ARG σ end_ARG ( qcrit ) ] = 0, which by equation 15 is, qcrit⁢∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]+(1−qcrit)⁢[U⁢(d2,′)−U⁢(d3,′)]=0subscriptcritsubscriptdelimited-[]subscript2subscript31subscriptcritdelimited-[]subscript2superscript′subscript3superscript′0q_critΣ _ zP( C= c;σ)[U(d_2, x)-U(d% _3, x)]+(1-q_crit)[U(d_2, x )-U(d_3, x^% )]=0qcrit ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] + ( 1 - qcrit ) [ U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) ] = 0 (16) where d2∈d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′)subscript2 s.t. subscriptargmaxsuperscript′d_2∈\d s.t. d= *arg\,max_dU(d, x )\d2 ∈ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) and d3∉d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′)subscript3 s.t. subscriptargmaxsuperscript′d_3 ∈\d s.t. d= *arg\,max_dU(d, x )\d3 ∉ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) . This yields the following expression for qcritsubscriptcritq_critqcrit, qcrit=(1−∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]U⁢(d2,′)−U⁢(d3,′))−1subscriptcritsuperscript1subscriptdelimited-[]subscript2subscript3subscript2superscript′subscript3superscript′1q_crit= (1- Σ _ zP( C= c;σ)[U(d_% 2, x)-U(d_3, x)]U(d_2, x )-U(d_3, x % ) )^-1qcrit = ( 1 - divide start_ARG ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] end_ARG start_ARG U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) end_ARG )- 1 (17) where we have used ∑P⁢(=;σ′)⁢[U⁢(d2,)−U⁢(d3,)]=U⁢(d2,′)−U⁢(d3,′)subscriptsuperscript′delimited-[]subscript2subscript3subscript2superscript′subscript3superscript′Σ _ zP( C= c;σ )[U(d_2, c)-U(d_3,% c)]=U(d_2, x )-U(d_3, x )∑italic_z P ( italic_C = italic_c ; σ′ ) [ U ( d2 , italic_c ) - U ( d3 , italic_c ) ] = U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ). We can determine ∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]subscriptdelimited-[]subscript2subscript3 _ zP( C= c;σ)[U(d_2, x)-U(d_3, x)]∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] given qcritsubscriptcritq_critqcrit and the utility function U⁢(d,)U(d, x)U ( d , italic_x ). Finally, we describe a Algorithm 1 (below) that uses a policy oracle for the Monte Carlo estimation of qcritsubscriptcritq_critqcrit, which can be used to determine ∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]subscriptdelimited-[]subscript2subscript3 _ zP( C= c;σ)[U(d_2, c)-U(d_3, c)]∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_c ) - U ( d3 , italic_c ) ]) in the asymptotic limit N→∞→N→∞N → ∞ as well as identifying d2,d3subscript2subscript3d_2,d_3d2 , d3. d1←ΠΣ∗⁢(σ)←subscript1subscriptsuperscriptΠΣd_1← ^*_ (σ)d1 ← Π∗roman_Σ ( σ ) σ′,d2← any hard intervention on ⁢ s.t. ⁢d2=arg⁢maxd⁡U⁢(d,)≠d1←superscript′subscript2 any hard intervention on s.t. subscript2subscriptargmaxsubscript1σ ,d_2← any hard intervention on C s% .t. d_2= *arg\,max_dU(d, x)≠ d_1σ′ , d2 ← any hard intervention on italic_C s.t. d2 = start_OPERATOR arg max end_OPERATORd U ( d , italic_x ) ≠ d1 D⁢(q=1)←d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′)←1 s.t. subscriptargmaxsuperscript′D(q=1)←\d s.t. d= *arg\,max_dU(d, x^% )\D ( q = 1 ) ← d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) θ=00θ=0θ = 0 for i←1←1i← 1i ← 1 to N do q∼Uniform⁢(0,1)similar-toUniform01q (0,1)q ∼ Uniform ( 0 , 1 ) π∗⁢(d∣paD)←ΠΣ∗⁢(σ3⁢(q))←superscriptconditionalsubscriptpasubscriptsuperscriptΠΣsubscript3π^*(d _D)← ^*_ ( _3(q))π∗ ( d ∣ paD ) ← Π∗roman_Σ ( σ3 ( q ) ) if d∈D⁢(q=1)1d∈ D(q=1)d ∈ D ( q = 1 ) ∀for-all∀ π∗⁢(d∣paD)>0superscriptconditionalsubscriptpa0π^*(d _D)>0π∗ ( d ∣ paD ) > 0 then θ←θ+1←1θ←θ+1θ ← θ + 1 end if end for qcrit=θ/Nsubscriptcritq_crit=θ/Nqcrit = θ / N D⁢(qcrit)←ΠΣ∗⁢(qcrit⁢σ+(1−qcrit)⁢σ′)←subscriptcritsubscriptsuperscriptΠΣsubscriptcrit1subscriptcritsuperscript′D(q_crit)← ^*_ (q_critσ+(1-q_ % crit)σ )D ( qcrit ) ← Π∗roman_Σ ( qcrit σ + ( 1 - qcrit ) σ′ ) d3∈D⁢(qcrit)subscript3subscriptcritd_3∈ D(q_crit)d3 ∈ D ( qcrit ), d3≠d2subscript3subscript2d_3≠ d_2d3 ≠ d2 return qcrit,d2,d3subscriptcritsubscript2subscript3q_crit,d_2,d_3qcrit , d2 , d3 Algorithm 1 Identify qcrit,d2,d3subscriptcritsubscript2subscript3q_crit,d_2,d_3qcrit , d2 , d3 using policy oracle. Input: (U,ΠΣ∗,N,σ)subscriptsuperscriptΠΣ(U, ^*_ ,N,σ)( U , Π∗roman_Σ , N , σ ) ∎ We are now ready to derive Theorem 1. See 1 Proof. We learn the graph G and parameters P⁢(ci∣pai)conditionalsubscriptsubscriptpaP(c_i _i)P ( citalic_i ∣ pai ) by learning ‘leave-one-out’ interventional distributions P⁢(ci∣do⁢(c1,…,ci−1,ci+1,…,cN))conditionalsubscriptdosubscript1…subscript1subscript1…subscriptP(c_i do(c_1,…,c_i-1,c_i+1,…,c_N))P ( citalic_i ∣ do ( c1 , … , citalic_i - 1 , citalic_i + 1 , … , citalic_N ) ). Note that under this intervention CisubscriptC_iCitalic_i depends only on its parent set and hence P(ci∣do(c1,…,ci−1,ci+1,…,cN)=P(ci∣pai)P(c_i do(c_1,…,c_i-1,c_i+1,…,c_N)=P(c_i % pa_i)P ( citalic_i ∣ do ( c1 , … , citalic_i - 1 , citalic_i + 1 , … , citalic_N ) = P ( citalic_i ∣ pai ) where Pai=paisubscriptPasubscriptpaPa_i=pa_iPai = pai denotes the state of PaisubscriptPaPa_iPai under the leave-one-out intervention. Almost all P are causally faithful (Meek, 2013). Hence, for almost all P, these interventional distributions can be used to determine PaisubscriptPaPa_iPai as in the interventional distribution Ci⟂̸Cjnot-perpendicular-tosubscriptsubscriptC_i C_jCitalic_i ⟂̸ Citalic_j if and only if Cj∈PaisubscriptsubscriptPaC_j _iCitalic_j ∈ Pai. Explicitly, for almost all environments, Cj∈PaisubscriptsubscriptPaC_j _iCitalic_j ∈ Pai if and only if there are two leave-one-out interventions that differ only on CjsubscriptC_jCitalic_j with Cj=cjsubscriptsubscriptC_j=c_jCitalic_j = citalic_j and Cj=cj′subscriptsuperscriptsubscript′C_j=c_j Citalic_j = citalic_j′ such that P⁢(ci∣do⁢(c1,…,cj,…,ci−1,ci+1,…,cN))≠P⁢(ci∣do⁢(c1,…,cj′,…,ci−1,ci+1,…,cN))conditionalsubscriptdosubscript1…subscript…subscript1subscript1…subscriptconditionalsubscriptdosubscript1…subscriptsuperscript′…subscript1subscript1…subscriptP(c_i do(c_1,…,c_j,…,c_i-1,c_i+1,…,c_N)% )≠ P(c_i do(c_1,…,c _j,…,c_i-1,c_i+1% ,…,c_N))P ( citalic_i ∣ do ( c1 , … , citalic_j , … , citalic_i - 1 , citalic_i + 1 , … , citalic_N ) ) ≠ P ( citalic_i ∣ do ( c1 , … , c′italic_j , … , citalic_i - 1 , citalic_i + 1 , … , citalic_N ) ). For ease of notation we will use P⁢(ci∣pai)conditionalsubscriptsubscriptpaP(c_i _i)P ( citalic_i ∣ pai ) interchangeably with P⁢(ci∣do⁢(c1,…,cj,…,ci−1,ci+1,…,cN))conditionalsubscriptdosubscript1…subscript…subscript1subscript1…subscriptP(c_i do(c_1,…,c_j,…,c_i-1,c_i+1,…,c_N))P ( citalic_i ∣ do ( c1 , … , citalic_j , … , citalic_i - 1 , citalic_i + 1 , … , citalic_N ) ). First we learn the parameters for chance variables that have a directed path to U that does not include D, i.e. are ancestors of U in the graph GD^subscript^G_ DGover start_ARG D end_ARG where we intervene on D. Case 1: learning parameters for Ci∈AncU⁢(GD^)subscriptsubscriptAncsubscript^C_i _U(G_ D)Citalic_i ∈ AncU ( Gover start_ARG D end_ARG ). Consider a directed path Ck→…→C1→subscript…→subscript1C_k→…→ C_1Citalic_k → … → C1 where C1∈PaUsubscript1subscriptPaC_1 _UC1 ∈ PaU and all variables are chance nodes (the path does not include D). Assume we know Pak−1,…,Pa1subscriptPa1…subscriptPa1Pa_k-1,…,Pa_1Pak - 1 , … , Pa1 and the parameters P⁢(Ci∣pai)conditionalsubscriptsubscriptpaP(C_i _i)P ( Citalic_i ∣ pai ) for i=k−1,…,11…1i=k-1,…,1i = k - 1 , … , 1. We show that given these parameters we can identify the unknown parameters P⁢(ck∣pak)conditionalsubscriptsubscriptpaP(c_k _k)P ( citalic_k ∣ pak ) (and hence PaksubscriptPaPa_kPak). Define =∖Ck,…,C1subscript…subscript1 Y= C \C_k,…,C_1\italic_Y = italic_C ∖ Citalic_k , … , C1 and consider the local intervention σ=do⁢(y1,…,yN−k,ck=f⁢(ck))dosubscript1…subscriptsubscriptsubscriptσ= do(y_1,…,y_N-k,c_k=f(c_k))σ = do ( y1 , … , yitalic_N - k , citalic_k = f ( citalic_k ) ) where do⁢(ck=f⁢(ck))dosubscriptsubscript do(c_k=f(c_k))do ( citalic_k = f ( citalic_k ) ) is a local intervention on CksubscriptC_kCitalic_k such that, f⁢(Ck)=ck′,Ck=ck′ck′⁢ otherwisesubscriptcasessuperscriptsubscript′subscriptsuperscriptsubscript′otherwisesuperscriptsubscript′ otherwiseotherwisef(C_k)= casesc_k \,,\,C_k=c_k \\ c_k \, otherwise casesf ( Citalic_k ) = start_ROW start_CELL citalic_k′ , Citalic_k = citalic_k′ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL citalic_k′ ′ otherwise end_CELL start_CELL end_CELL end_ROW (18) I.e. f⁢(Ck)subscriptf(C_k)f ( Citalic_k ) maps CksubscriptC_kCitalic_k to a 2 dimensional subspace where the image of Ck=ck′subscriptsuperscriptsubscript′C_k=c_k Citalic_k = citalic_k′ is Ck=ck′subscriptsuperscriptsubscript′C_k=c_k Citalic_k = citalic_k′ and all other states being mapped to Ck=ck′subscriptsuperscriptsubscript′C_k=c_k Citalic_k = citalic_k′ ′, where ck′,ck′≠ck′subscript′subscript′subscript′c_k ,c_k ≠ c_k citalic_k′ , citalic_k′ ′ ≠ citalic_k′ are arbitrary states of CksubscriptC_kCitalic_k. In the following we mask all inputs to the policy PaD′=∅superscriptsubscriptPa′Pa_D = ′ = ∅. By Lemma 4 we can identify, ∑P⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]subscriptdelimited-[]superscript′ _ cP( C= c;σ)[U(d, c)-U(d , % c)]∑italic_c P ( italic_C = italic_c ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] =∑ck…⁢∑c1P⁢(ck∣pak;σ)⁢…⁢P⁢(c1∣pa1;σ)⁢[U⁢(d,)−U⁢(d′,)]absentsubscriptsubscript…subscriptsubscript1conditionalsubscriptsubscriptpa…conditionalsubscript1subscriptpa1delimited-[]superscript′ = _c_k… _c_1P(c_k _k;σ)% … P(c_1 _1;σ)[U(d, c)-U(d , c)]= ∑c start_POSTSUBSCRIPT k end_POSTSUBSCRIPT … ∑c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P ( citalic_k ∣ pak ; σ ) … P ( c1 ∣ pa1 ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] =∑ckP⁢(ck∣pak;σ)⁢β⁢(ck)absentsubscriptsubscriptconditionalsubscriptsubscriptpasubscript = _c_kP(c_k _k;σ)β(c_k)= ∑c start_POSTSUBSCRIPT k end_POSTSUBSCRIPT P ( citalic_k ∣ pak ; σ ) β ( citalic_k ) (19) where β⁢(ck):=∑ck−1…⁢∑c1P⁢(ck−1∣pak−1;σ)⁢…⁢P⁢(c1∣pa1;σ)⁢[U⁢(d,)−U⁢(d′,)]assignsubscriptsubscriptsubscript1…subscriptsubscript1conditionalsubscript1subscriptpa1…conditionalsubscript1subscriptpa1delimited-[]superscript′β(c_k):= _c_k-1… _c_1P(c_k-1 _k-1;% σ)… P(c_1 _1;σ)[U(d, c)-U(d , % c)]β ( citalic_k ) := ∑c start_POSTSUBSCRIPT k - 1 end_POSTSUBSCRIPT … ∑c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P ( citalic_k - 1 ∣ pak - 1 ; σ ) … P ( c1 ∣ pa1 ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] (20) and β⁢(c1):=[U⁢(d,)−U⁢(d′,)]assignsubscript1delimited-[]superscript′β(c_1):=[U(d, c)-U(d , c)]β ( c1 ) := [ U ( d , italic_c ) - U ( d′ , italic_c ) ]. Note that in equation 19 β⁢(ck)subscriptβ(c_k)β ( citalic_k ) is determined by the known parameters P⁢(ck−1∣pak−1),…,P⁢(c1∣pa1)conditionalsubscript1subscriptpa1…conditionalsubscript1subscriptpa1P(c_k-1 _k-1),…,P(c_1 _1)P ( citalic_k - 1 ∣ pak - 1 ) , … , P ( c1 ∣ pa1 ) and U⁢(paU)subscriptpaU(pa_U)U ( paU ), and β⁢(ck)subscriptβ(c_k)β ( citalic_k ) are non-zero for almost all P,UP,UP , U as β⁢(ck)=0subscript0β(c_k)=0β ( citalic_k ) = 0 is a polynomial equation in these parameters it is not satisfied for almost all P,UP,UP , U. Using the definition of the local intervention in equation 18 we have P⁢(Ck=ck′∣pak;σ)=P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpasubscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k;σ)=P(C_k=c_k % pa_k)P ( Citalic_k = citalic_k′ ∣ pak ; σ ) = P ( Citalic_k = citalic_k′ ∣ pak ), and P⁢(Ck=ck′∣pak;σ)=1−P⁢(Ci=ck′∣pak;σ)=1−P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpa1subscriptconditionalsuperscriptsubscript′subscriptpa1subscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k;σ)=1-P(C_i=c_k^% _k;σ)=1-P(C_k=c_k _k)P ( Citalic_k = citalic_k′ ′ ∣ pak ; σ ) = 1 - P ( Citalic_i = citalic_k′ ∣ pak ; σ ) = 1 - P ( Citalic_k = citalic_k′ ∣ pak ). Therefore the right hand side of equation 19 has a single undetermined parameter P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k)P ( Citalic_k = citalic_k′ ∣ pak ) and the left hand side can be determined using the policy oracle (Lemma 4), and we can solve for P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k)P ( Citalic_k = citalic_k′ ∣ pak ). By repeating this procedure with different interventions, varying the hard intervention do⁢(=)do do( Y= y)do ( italic_Y = italic_y ) and the choices of ck′,ck′subscript′subscript′c_k ,c_k citalic_k′ , citalic_k′ ′, we can identify P⁢(ck∣pak)conditionalsubscriptsubscriptpaP(c_k _k)P ( citalic_k ∣ pak ) for all ck,paksubscriptsubscriptpac_k,pa_kcitalic_k , pak and hence PaksubscriptPaPa_kPak. We now learn the parameters for all Ci∈AncU⁢(GD^)subscriptsubscriptAncsubscript^C_i _U(G_ D)Citalic_i ∈ AncU ( Gover start_ARG D end_ARG ). We know the set PaUsubscriptPaPa_UPaU as this is the domain of the utility function U⁢(PaU)subscriptPaU(Pa_U)U ( PaU ) which is known by assumption. We can then proceed iteratively, first learning the parameters of P,GP,GP , G that are P⁢(c1∣pa1)conditionalsubscript1subscriptpa1P(c_1 _1)P ( c1 ∣ pa1 ) and Pa1subscriptPa1Pa_1Pa1 for some C1∈PaUsubscript1subscriptPaC_1 _UC1 ∈ PaU. We can do this as β⁢(c1)=U⁢(d,)−U⁢(d′,)subscript1superscript′β(c_1)=U(d, x)-U(d , x)β ( c1 ) = U ( d , italic_x ) - U ( d′ , italic_x ) with d,d′⁢superscript′d,d xd , d′ italic_x returned by Algorithm 1 in Lemma 4 and U⁢(PaU)subscriptPaU(Pa_U)U ( PaU ) is known. We can then determine the parameters for all Cj∈Pa1subscriptsubscriptPa1C_j _1Citalic_j ∈ Pa1, and so on until we have traversed Anc1subscriptAnc1Anc_1Anc1. We repeat this for all Ci∈PaUsubscriptsubscriptPaC_i _UCitalic_i ∈ PaU until we have covered all AncU⁢(GD^)subscriptAncsubscript^Anc_U(G_ D)AncU ( Gover start_ARG D end_ARG ). Case 2: learning parameters for Ci∈AncDsubscriptsubscriptAncC_i _DCitalic_i ∈ AncD, Ci∉AncU⁢(GD^)subscriptsubscriptAncsubscript^C_i _U(G_ D)Citalic_i ∉ AncU ( Gover start_ARG D end_ARG ). Consider Ck∈AncUsubscriptsubscriptAncC_k _UCitalic_k ∈ AncU for which all directed paths to U are via D, Ck→Ck−1→…→C1→subscriptsubscript1→…→subscript1C_k→ C_k-1→…→ C_1Citalic_k → Citalic_k - 1 → … → C1 where C1∈P~⁢aDsubscript1~subscriptC_1∈ Pa_DC1 ∈ over~ start_ARG P end_ARG aitalic_D. As before, assume we know Pak−1,…,Pa1subscriptPa1…subscriptPa1Pa_k-1,…,Pa_1Pak - 1 , … , Pa1 and the parameters P⁢(Ci∣pai)conditionalsubscriptsubscriptpaP(C_i _i)P ( Citalic_i ∣ pai ) for i=k−1,…,11…1i=k-1,…,1i = k - 1 , … , 1. We now show that given these parameters we can identify the unknown parameters P⁢(ck∣pak)conditionalsubscriptsubscriptpaP(c_k _k)P ( citalic_k ∣ pak ) (and hence PaksubscriptPaPa_kPak). Define =∖Ck,…,C1subscript…subscript1 Y= C \C_k,…,C_1\italic_Y = italic_C ∖ Citalic_k , … , C1 and let σ=do⁢(y1,…,yN−k,ck=f⁢(ck))dosubscript1…subscriptsubscriptsubscriptσ= do(y_1,…,y_N-k,c_k=f(c_k))σ = do ( y1 , … , yitalic_N - k , citalic_k = f ( citalic_k ) ) where do⁢(ck=f⁢(ck))dosubscriptsubscript do(c_k=f(c_k))do ( citalic_k = f ( citalic_k ) ) is a local intervention defined in equation 18. We now mask all evidence except C1subscript1C_1C1, i.e. PaD′=C1superscriptsubscriptPa′subscript1Pa_D =\C_1\PaD′ = C1 . Note that as C1∉PaUsubscript1subscriptPaC_1 _UC1 ∉ PaU we can apply Lemma 4, giving (for k≥22k≥ 2k ≥ 2) ∑P⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]subscriptdelimited-[]superscript′ _ zP( C= c;σ)[U(d, c)-U(d , % c)]∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] =∑ck…⁢∑c2P⁢(ck∣pak;σ)⁢…⁢P⁢(c1∣pa1)⁢[U⁢(d,)−U⁢(d′,)]absentsubscriptsubscript…subscriptsubscript2conditionalsubscriptsubscriptpa…conditionalsubscript1subscriptpa1delimited-[]superscript′ = _c_k… _c_2P(c_k _k;σ)% … P(c_1 _1)[U(d, c)-U(d , c)]= ∑c start_POSTSUBSCRIPT k end_POSTSUBSCRIPT … ∑c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT P ( citalic_k ∣ pak ; σ ) … P ( c1 ∣ pa1 ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] (21) =∑ckP⁢(ck∣pak)⁢α⁢(ck)absentsubscriptsubscriptconditionalsubscriptsubscriptpasubscript = _c_kP(c_k _k)α(c_k)= ∑c start_POSTSUBSCRIPT k end_POSTSUBSCRIPT P ( citalic_k ∣ pak ) α ( citalic_k ) (22) where =∖C1subscript1 Z= C \C_1\italic_Z = italic_C ∖ C1 and, α⁢(ck):=∑ck−1…⁢∑c2P⁢(ck−1∣pak−1)⁢…⁢P⁢(c1∣pa1)⁢[U⁢(d,)−U⁢(d′,)]assignsubscriptsubscriptsubscript1…subscriptsubscript2conditionalsubscript1subscriptpa1…conditionalsubscript1subscriptpa1delimited-[]superscript′α(c_k):= _c_k-1… _c_2P(c_k-1 _k-1)% … P(c_1 _1)[U(d, c)-U(d , c)]α ( citalic_k ) := ∑c start_POSTSUBSCRIPT k - 1 end_POSTSUBSCRIPT … ∑c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT P ( citalic_k - 1 ∣ pak - 1 ) … P ( c1 ∣ pa1 ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] (23) and for k=11k=1k = 1 we have, ∑P⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]=P⁢(C1=c1∣pa1;σ)⁢α⁢(1)subscriptdelimited-[]superscript′subscript1conditionalsubscript1subscriptpa11 _ zP( C= c;σ)[U(d, c)-U(d , c)]=P(C_1% =c_1 _1;σ)α(1)∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] = P ( C1 = c1 ∣ pa1 ; σ ) α ( 1 ) (24) where α⁢(1):=[U⁢(d,)−U⁢(d′,)]assign1delimited-[]superscript′α(1):=[U(d, x)-U(d , x)]α ( 1 ) := [ U ( d , italic_x ) - U ( d′ , italic_x ) ]. We can determine α⁢(ck)subscriptα(c_k)α ( citalic_k ) as we know the parameters for Ck−1,…,C1subscript1…subscript1C_k-1,…,C_1Citalic_k - 1 , … , C1 by assumption, and α⁢(ck)≠0subscript0α(c_k)≠ 0α ( citalic_k ) ≠ 0 for almost all P,UP,UP , U as the equation α⁢(ck)=0subscript0α(c_k)=0α ( citalic_k ) = 0 is a polynomial in the model parameters by Lemma 2 it is not satisfied for almost all P,UP,UP , U. Using the definition of the local intervention equation 18 we have P⁢(Ck=ck′∣pak;σ)=P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpasubscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k;σ)=P(C_k=c_k % pa_k)P ( Citalic_k = citalic_k′ ∣ pak ; σ ) = P ( Citalic_k = citalic_k′ ∣ pak ), and P⁢(Ck=ck′∣pak;σ)=1−P⁢(Ci=ck′∣pak;σ)=1−P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpa1subscriptconditionalsuperscriptsubscript′subscriptpa1subscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k;σ)=1-P(C_i=c_k^% _k;σ)=1-P(C_k=c_k _k)P ( Citalic_k = citalic_k′ ′ ∣ pak ; σ ) = 1 - P ( Citalic_i = citalic_k′ ∣ pak ; σ ) = 1 - P ( Citalic_k = citalic_k′ ∣ pak ). Therefore the right hand side of equation 19 has a single undetermined parameter P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k)P ( Citalic_k = citalic_k′ ∣ pak ) and the left hand side can be determined using the policy oracle (using Lemma 4, noting PaD′=C1subscriptsuperscriptPa′subscript1Pa _D=\C_1\Pa′D = C1 and C1∩PaU=∅subscript1subscriptPa\C_1\ _U= C1 ∩ PaU = ∅), and we can solve for P⁢(Ck=ck′∣pak)subscriptconditionalsuperscriptsubscript′subscriptpaP(C_k=c_k _k)P ( Citalic_k = citalic_k′ ∣ pak ). By repeating this procedure with different interventions, varying the hard intervention do⁢(=)do do( Y= y)do ( italic_Y = italic_y ) and the choices of ck′,ck′subscript′subscript′c_k ,c_k citalic_k′ , citalic_k′ ′, we can identify P⁢(ck∣pak)conditionalsubscriptsubscriptpaP(c_k _k)P ( citalic_k ∣ pak ) for all ck,paksubscriptsubscriptpac_k,pa_kcitalic_k , pak and hence PaksubscriptPaPa_kPak. We now learn the parameters for all Ci∈AncD∖AncU⁢(GD^)subscriptsubscriptAncsubscriptAncsubscript^C_i _D _U(G_ D)Citalic_i ∈ AncD ∖ AncU ( Gover start_ARG D end_ARG ). We know PaDsubscriptPaPa_DPaD from the domain of the policy returned by the policy oracle. If the parameters for all variables in PaDsubscriptPaPa_DPaD have be learned in the previous set, we are finished. Otherwise, there are variables that are in AncUsubscriptAncAnc_UAncU for which all directed paths to U are via D. Let this set of variables by Pa~D⊆PaDsubscript~PasubscriptPa Pa_D _Dover~ start_ARG Pa end_ARGD ⊆ PaD. For any C1∈Pa~Dsubscript1subscript~PaC_1∈ Pa_DC1 ∈ over~ start_ARG Pa end_ARGD we can determine α⁢(c1)=U⁢(d,)−U⁢(d′,)subscript1superscript′α(c_1)=U(d, x)-U(d , x)α ( c1 ) = U ( d , italic_x ) - U ( d′ , italic_x ) with d,d′,superscript′d,d , xd , d′ , italic_x returned by Algorithm 1 in Lemma 4, noting that C1∉PaUsubscript1subscriptPaC_1 _UC1 ∉ PaU. We can then determine the parameters for all Cj∈Pa1subscriptsubscriptPa1C_j _1Citalic_j ∈ Pa1, and so on until we have traversed Anc1subscriptAnc1Anc_1Anc1, and repeat until we have learned the parameters for all Ci∈AncD∖AncU⁢(GD^)subscriptsubscriptAncsubscriptAncsubscript^C_i _D _U(G_ D)Citalic_i ∈ AncD ∖ AncU ( Gover start_ARG D end_ARG ). ∎ Appendix D Proof of Theorem 2 In this section we derive a version of Lemma 4 using a δ-optimal policy oracle for δ>00δ>0δ > 0. The reason we consider this case is that Theorem 1 assumes optimality, which is a strong assumption that won’t be satisfied by realistic systems. It is therefore important to determine if our main results are contingent on this assumption. For example, it may be that we can only identify a causal model from the agent’s policy for δ=00δ=0δ = 0, and for δ>00δ>0δ > 0 no causal model can be learned. Instead, what we find is that realistic agents with δ>00δ>0δ > 0 have to learn approximate causal models, with the fidelity of these approximations increasing in a reasonable way as δ→0→0δ→ 0δ → 0. Low-regret analysis. What is a reasonable way for the approximation errors to change with δ? Clearly, if an agent has an arbitrarily large regret bound we cannot expect to learn anything about the environment from its policy. For example, a completely random policy can satisfy a large enough regret bound, and an agent does not need to learn anything about the environment to learn this policy. Therefore we must still constrain the regret to be small in our analysis, and the standard way to do this by an order analysis. We define ‘small regret’ as δ≪π∗⁢[U]much-less-thansuperscriptsuperscriptdelimited-[]δ ^π^*[U]δ ≪ blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ]. As we work with the normalised utility function (see Section A.1), we have π∗⁢[U]≤1superscriptsuperscriptdelimited-[]1E^π^*[U]≤ 1blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ] ≤ 1 and so we can define the small regret regime as δ≪1much-less-than1δ 1δ ≪ 1. What we find is that for small δ the order of the error in our estimation of the model parameters grows linearly with the order of increase in the regret for agents that incur only a small regret. Therefore we get a linear trade-off between regret and accuracy for small δ. First we show that Algorithm 1 allows us to estimate the value of Q=∑P⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]subscriptdelimited-[]superscript′Q= _ zP( C= c;σ)[U(d, c)-U(d , c)]Q = ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] with an approximate value Q~~ Qover~ start_ARG Q end_ARG, and estimate bounds Q~±superscript~plus-or-minus Q^±over~ start_ARG Q end_ARG± such that the true value of Q is guaranteed to satisfy Q~−≤Q≤Q~+superscript~superscript~ Q^-≤ Q≤ Q^+over~ start_ARG Q end_ARG- ≤ Q ≤ over~ start_ARG Q end_ARG+. Lemma 5. Using a δ-optimal policy oracle ΠΣδsubscriptsuperscriptΠΣ ^δ_ Πitalic_δroman_Σ where Σ Σ includes all mixtures of local interventions, including masking inputs PaD′⊆PaDsuperscriptsubscriptPa′subscriptPaPa_D _DPaD′ ⊆ PaD, then for any given PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′ such that PaD′∩PaU=∅superscriptsubscriptPa′subscriptPaPa_D _U= ′ ∩ PaU = ∅, we can determine d,d′,′superscript′d,d , x d , d′ , italic_x′ where d≠d′≠ d d ≠ d′ and a point estimate Q~~ Qover~ start_ARG Q end_ARG for Q⁢(paD,d,d′):=∑P⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]<0assignsubscriptpasuperscript′subscriptdelimited-[]superscript′0Q(pa_D,d,d ):= _ zP( C= c;σ)[U(d, % x)-U(d , x)]<0Q ( paD , d , d′ ) := ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d , italic_x ) - U ( d′ , italic_x ) ] < 0 and bounds Q∈[Q~−,Q~+]superscript~superscript~Q∈[ Q^-, Q^+]Q ∈ [ over~ start_ARG Q end_ARG- , over~ start_ARG Q end_ARG+ ] where =∖PaDsubscriptPa Z= C _Ditalic_Z = italic_C ∖ PaD, =PaU∖DsubscriptPa X=Pa_U \D\italic_X = PaU ∖ D and, 11−ξ⁢(Q−δ)≤Q~≤11+ξ⁢(Q+δ)11~11 11-ξ(Q-δ)≤ Q≤ 11+ξ(Q+δ)divide start_ARG 1 end_ARG start_ARG 1 - ξ end_ARG ( Q - δ ) ≤ over~ start_ARG Q end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 1 + ξ end_ARG ( Q + δ ) (25) where ξ:=δ/(U⁢(d,′)−U⁢(d′,′))>0assignsuperscript′superscript′0ξ:=δ/(U(d, x )-U(d , x ))>0ξ := δ / ( U ( d , italic_x′ ) - U ( d′ , italic_x′ ) ) > 0 (26) and in the worst case these bounds scale with δ as Q~+superscript~ Q^+over~ start_ARG Q end_ARG+ ≤(1−ξ1+ξ)⁢Q+2⁢δ1+ξabsent1121 ≤ ( 1-ξ1+ξ )Q+ 2δ1+ξ≤ ( divide start_ARG 1 - ξ end_ARG start_ARG 1 + ξ end_ARG ) Q + divide start_ARG 2 δ end_ARG start_ARG 1 + ξ end_ARG (27) Q~−superscript~ Q^-over~ start_ARG Q end_ARG- ≥(1+ξ1−ξ)⁢Q−2⁢δ1−ξabsent1121 ≥ ( 1+ξ1-ξ )Q- 2δ1-ξ≥ ( divide start_ARG 1 + ξ end_ARG start_ARG 1 - ξ end_ARG ) Q - divide start_ARG 2 δ end_ARG start_ARG 1 - ξ end_ARG (28) Proof. By Lemma 3 for almost all P,UP,UP , U there is a single optimal decision following the shift σ. Let d1subscript1d_1d1 be the decision returned by the policy returned by the policy oracle in the context PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′, which must satisfy the bound ⁢[u∣d,paD;σ]≤maxd⁡⁢[u∣d,paD;σ]−δdelimited-[]conditionalsubscriptpasubscriptdelimited-[]conditionalsubscriptpaE[u d,pa_D;σ]≤ _dE[u d,% pa_D;σ]- _E [ u ∣ d , paD ; σ ] ≤ maxitalic_d blackboard_E [ u ∣ d , paD ; σ ] - δ. Consider a hard intervention on all Ci∈subscriptC_i∈ CCitalic_i ∈ italic_C, σ′:=do⁢(c1′,c2′,…,cN′)assignsuperscript′dosubscriptsuperscript′1subscriptsuperscript′2…subscriptsuperscript′σ := do(c _1,c _2,…,c _% N)σ′ := do ( c′1 , c′2 , … , c′italic_N ) where for all Ci∈PaD′subscriptsuperscriptsubscriptPa′C_i _D Citalic_i ∈ PaD′ we set Ci=cisubscriptsubscriptC_i=c_iCitalic_i = citalic_i to be the same state as in observation PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′. The expected utility under this intervention is ⁢[u∣do⁢(D=d),paD′;σ′]=U⁢(d,′)delimited-[]conditionaldosuperscriptsubscriptpa′superscript′E[u do(D=d),pa_D ;σ ]=U(d% , x )blackboard_E [ u ∣ do ( D = d ) , paD′ ; σ′ ] = U ( d , italic_x′ ) where =PaU∖DsubscriptPa X=Pa_U \D\italic_X = PaU ∖ D (and we have that D∈PaUsubscriptPaD _UD ∈ PaU from Lemma 1 i)). Next we show that there is a choice of hard intervention σ′σ σ′ such that the policy oracle must return different optimal decisions in the context PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′ for σ′σ σ′ and σ. As PaD′∩PaU=∅superscriptsubscriptPa′subscriptPaPa_D _U= ′ ∩ PaU = ∅ then we are free to choose any X=x′X=x X = x′ and the resulting σ′σ σ′ will be compatible with the evidence PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′. Note that by Lemma 1 i) ∃ =′superscript′ X= x italic_X = italic_x′ s.t. d1≠arg⁢maxd⁡U⁢(d,x′)subscript1subscriptargmaxsuperscript′d_1≠ *arg\,max_dU(d,x )d1 ≠ start_OPERATOR arg max end_OPERATORd U ( d , x′ ), else D=d1subscript1D=d_1D = d1 is optimal for all = X= xitalic_X = italic_x which violates domain dependence. We can determine this =′superscript′ X= x italic_X = italic_x′ given the utility function and d1subscript1d_1d1. Let d2=arg⁢maxd⁡U⁢(d,′)subscript2subscriptargmaxsuperscript′d_2= *arg\,max_dU(d, x )d2 = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) and σ′=do⁢(c1′,c2′,…,cN′)superscript′dosubscriptsuperscript′1subscriptsuperscript′2…subscriptsuperscript′σ = do(c _1,c _2,…,c _N)σ′ = do ( c′1 , c′2 , … , c′italic_N ) be the hard intervention for which =′superscript′ X= x italic_X = italic_x′ and PaD′=paD′subscriptPa′subscriptpa′Pa_D =pa_D PaD′ = paD′. Note, we do not use the policy oracle to determine d2subscript2d_2d2 which can be determined from U⁢(PaU)subscriptPaU(Pa_U)U ( PaU ) alone, and hence there is no uncertainty if d2subscript2d_2d2 is in fact optimal under σ′σ σ′ for any regret bound, nor that d1subscript1d_1d1 is not optimal under σ′σ σ′. Consider the joint distribution over Citalic_C under the mixed local intervention σ~⁢(q)=q⁢σ+(1−q)⁢σ′~1superscript′ σ(q)=qσ+(1-q)σ over~ start_ARG σ end_ARG ( q ) = q σ + ( 1 - q ) σ′, P⁢(=∣do⁢(D=d);σ~⁢(q))conditionaldo~ P( C= c do(D=d); σ(q))P ( italic_C = italic_c ∣ do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) =P⁢(=;σ~⁢(q))absent~ =P( C= c; σ(q))= P ( italic_C = italic_c ; over~ start_ARG σ end_ARG ( q ) ) (29) =q⁢P⁢(=;σ)+(1−q)⁢P⁢(=;σ′)absent1superscript′ =qP( C= c;σ)+(1-q)P( C= c;σ )= q P ( italic_C = italic_c ; σ ) + ( 1 - q ) P ( italic_C = italic_c ; σ′ ) (30) where in the first line we have used ChD=UsubscriptChCh_D=\U\ChD = U to drop the intervention. Note that =∖PaD≠∅subscriptPa Z= C _D≠ _Z = italic_C ∖ PaD ≠ ∅ by Lemma 1 i). The expected utility is given by, ⁢[u∣paD,do⁢(D=d);σ~⁢(q)]=∑P⁢(=∣paD,do⁢(D=d);σ~⁢(q))⁢U⁢(d,)delimited-[]conditionalsubscriptpado~subscriptconditionalsubscriptpado~ [u _D, do(D=d); σ(q% )]=Σ _ zP( Z= z _D, do(D=d);% σ(q))U(d, x)blackboard_E [ u ∣ paD , do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ] = ∑italic_z P ( italic_Z = italic_z ∣ paD , do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) U ( d , italic_x ) (31) =∑P⁢(=∣do⁢(D=d);σ~⁢(q))P⁢(paD∣do⁢(D=d);σ~⁢(q))⁢U⁢(d,)absentsubscriptconditionaldo~conditionalsubscriptpado~ =Σ _ z P( C= c do(D=d);% σ(q))P(pa_D do(D=d); σ(q))U% (d, x)= ∑italic_z divide start_ARG P ( italic_C = italic_c ∣ do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) end_ARG start_ARG P ( paD ∣ do ( D = d ) ; over~ start_ARG σ end_ARG ( q ) ) end_ARG U ( d , italic_x ) (32) =1P⁢(paD;σ~⁢(q))⁢∑P⁢(=;σ~⁢(q))⁢U⁢(d,)absent1subscriptpa~subscript~ = 1P(pa_D; σ(q))Σ _ z% P( C= c; σ(q))U(d, x)= divide start_ARG 1 end_ARG start_ARG P ( paD ; over~ start_ARG σ end_ARG ( q ) ) end_ARG ∑italic_z P ( italic_C = italic_c ; over~ start_ARG σ end_ARG ( q ) ) U ( d , italic_x ) (33) =1P⁢(paD;σ~⁢(q))⁢∑q⁢P⁢(=;σ)⁢U⁢(d,)+(1−q)⁢P⁢(=;σ′)⁢U⁢(d,′)absent1subscriptpa~subscript1superscript′ = 1P(pa_D; σ(q))Σ _ z% qP( C= c;σ)U(d, x)+(1-q)P( C= c;σ )U(d% , x )= divide start_ARG 1 end_ARG start_ARG P ( paD ; over~ start_ARG σ end_ARG ( q ) ) end_ARG ∑italic_z q P ( italic_C = italic_c ; σ ) U ( d , italic_x ) + ( 1 - q ) P ( italic_C = italic_c ; σ′ ) U ( d , italic_x′ ) (34) Note that for q=11q=1q = 1 the optimal decision is d1subscript1d_1d1 and for q=00q=0q = 0 the optimal decision returned by the policy oracle belongs to the set d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′) s.t. subscriptargmaxsuperscript′\d s.t. d= *arg\,max_dU(d, x )\ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) which does not contain d1subscript1d_1d1. Furthermore, the argmax of equation 34 with respect to d is a piecewise linear function with domain q∈[0,1]01q∈[0,1]q ∈ [ 0 , 1 ]. Therefore there must be some q=qcritsubscriptcritq=q_critq = qcrit that is the smallest value of q such that for q<qcritsubscriptcritq<q_critq < qcrit the policy oracle returns an optimal decision in the set d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′) s.t. subscriptargmaxsuperscript′\d s.t. d= *arg\,max_dU(d, x )\ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) and for q≥qcritsubscriptcritq≥ q_critq ≥ qcrit the optimal decision is not in this set. The value of qcritsubscriptcritq_critqcrit is given by ⁢[u∣paD,do⁢(D=d);σ~⁢(qcrit)]=0delimited-[]conditionalsubscriptpado~subscriptcrit0E[u _D, do(D=d); σ(q_crit% )]=0blackboard_E [ u ∣ paD , do ( D = d ) ; over~ start_ARG σ end_ARG ( qcrit ) ] = 0, which by equation 34 is, qcrit⁢∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]+(1−qcrit)⁢[U⁢(d2,′)−U⁢(d3,′)]=0subscriptcritsubscriptdelimited-[]subscript2subscript31subscriptcritdelimited-[]subscript2superscript′subscript3superscript′0q_critΣ _ zP( C= c;σ)[U(d_2, x)-U(d% _3, x)]+(1-q_crit)[U(d_2, x )-U(d_3, x^% )]=0qcrit ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] + ( 1 - qcrit ) [ U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) ] = 0 (35) where d2∈d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′)subscript2 s.t. subscriptargmaxsuperscript′d_2∈\d s.t. d= *arg\,max_dU(d, x )\d2 ∈ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) and d3∉d⁢ s.t. ⁢d=arg⁢maxd⁡U⁢(d,′)subscript3 s.t. subscriptargmaxsuperscript′d_3 ∈\d s.t. d= *arg\,max_dU(d, x )\d3 ∉ d s.t. d = start_OPERATOR arg max end_OPERATORd U ( d , italic_x′ ) . This yields the following expression for qcritsubscriptcritq_critqcrit, qcrit=(1−∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]U⁢(d2,′)−U⁢(d3,′))−1subscriptcritsuperscript1subscriptdelimited-[]subscript2subscript3subscript2superscript′subscript3superscript′1q_crit= (1- Σ _ zP( C= c;σ)[U(d_% 2, x)-U(d_3, x)]U(d_2, x )-U(d_3, x % ) )^-1qcrit = ( 1 - divide start_ARG ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] end_ARG start_ARG U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) end_ARG )- 1 (36) where we have used ∑P⁢(=;σ′)⁢[U⁢(d2,)−U⁢(d3,)]=U⁢(d2,′)−U⁢(d3,′)subscriptsuperscript′delimited-[]subscript2subscript3subscript2superscript′subscript3superscript′Σ _ zP( C= c;σ )[U(d_2, c)-U(d_3,% c)]=U(d_2, x )-U(d_3, x )∑italic_z P ( italic_C = italic_c ; σ′ ) [ U ( d2 , italic_c ) - U ( d3 , italic_c ) ] = U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ). While Algorithm 1 identifies the smallest value of q such that the optimal policy changes, as we no longer have an optimal policy oracle, the probability q~~ qover~ start_ARG q end_ARG returned by Algorithm 1 is no longer necessarily equal to qcritsubscriptcritq_critqcrit. Instead, there are minimal and maximal value of q~~ qover~ start_ARG q end_ARG that Algorithm 1 can return, which are determined by the regret bound (see Figure 6). Our first aim is to bound qcritsubscriptcritq_critqcrit using q~~ qover~ start_ARG q end_ARG returned by the policy oracle. The maximal (minimal) values q~~ qover~ start_ARG q end_ARG can take while satisfying the regret bound are q±superscriptplus-or-minusq^±q±, which are the solutions to the equations δ δ ≥⁢[u∣paD,do⁢(D=d2);σ~⁢(q+)]−⁢[u∣paD,do⁢(D=d3);σ~⁢(q+)]absentdelimited-[]conditionalsubscriptpadosubscript2~superscriptdelimited-[]conditionalsubscriptpadosubscript3~superscript [u _D, do(D=d_2); % σ(q^+)]-E[u _D, do(D=d_3); % σ(q^+)]≥ blackboard_E [ u ∣ paD , do ( D = d2 ) ; over~ start_ARG σ end_ARG ( q+ ) ] - blackboard_E [ u ∣ paD , do ( D = d3 ) ; over~ start_ARG σ end_ARG ( q+ ) ] (37) −δ -δ- δ ≤⁢[u∣paD,do⁢(D=d2);σ~⁢(q−)]−⁢[u∣paD,do⁢(D=d3);σ~⁢(q−)]absentdelimited-[]conditionalsubscriptpadosubscript2~superscriptdelimited-[]conditionalsubscriptpadosubscript3~superscript [u _D, do(D=d_2); % σ(q^-)]-E[u _D, do(D=d_3); % σ(q^-)]≤ blackboard_E [ u ∣ paD , do ( D = d2 ) ; over~ start_ARG σ end_ARG ( q- ) ] - blackboard_E [ u ∣ paD , do ( D = d3 ) ; over~ start_ARG σ end_ARG ( q- ) ] (38) Using equation 34 these simplify to δ⁢P⁢(paD;σ⁢(q+))subscriptpasuperscript δ P(pa_D;σ(q^+))δ P ( paD ; σ ( q+ ) ) ≥q+⁢∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]+(1−q+)⁢[U⁢(d2,′)−U⁢(d3,′)]absentsuperscriptsubscriptdelimited-[]subscript2subscript31superscriptdelimited-[]subscript2superscript′subscript3superscript′ ≥ q^+Σ _ zP( C= c;σ)[U(d_2, % x)-U(d_3, x)]+(1-q^+)[U(d_2, x )-U(d_3, x^% )]≥ q+ ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] + ( 1 - q+ ) [ U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) ] (39) δ⁢P⁢(paD;σ⁢(q−))subscriptpasuperscript δ P(pa_D;σ(q^-))δ P ( paD ; σ ( q- ) ) ≤q−⁢∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]+(1−q−)⁢[U⁢(d2,′)−U⁢(d3,′)]absentsuperscriptsubscriptdelimited-[]subscript2subscript31superscriptdelimited-[]subscript2superscript′subscript3superscript′ ≤ q^-Σ _ zP( C= c;σ)[U(d_2, % x)-U(d_3, x)]+(1-q^-)[U(d_2, x )-U(d_3, x^% )]≤ q- ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] + ( 1 - q- ) [ U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) ] (40) We can relax and simplify these bounds by taking the maximum possible values for the unknown quantity P⁢(paD;σ⁢(q~))→1→subscriptpa~1P(pa_D;σ( q))→ 1P ( paD ; σ ( over~ start_ARG q end_ARG ) ) → 1 giving, δ δ ≥q+⁢∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]+(1−q+)⁢[U⁢(d2,′)−U⁢(d3,′)]absentsuperscriptsubscriptdelimited-[]subscript2subscript31superscriptdelimited-[]subscript2superscript′subscript3superscript′ ≥ q^+Σ _ zP( C= c;σ)[U(d_2, % x)-U(d_3, x)]+(1-q^+)[U(d_2, x )-U(d_3, x^% )]≥ q+ ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] + ( 1 - q+ ) [ U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) ] (41) δ δ ≤q−⁢∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]+(1−q−)⁢[U⁢(d2,′)−U⁢(d3,′)]absentsuperscriptsubscriptdelimited-[]subscript2subscript31superscriptdelimited-[]subscript2superscript′subscript3superscript′ ≤ q^-Σ _ zP( C= c;σ)[U(d_2, % x)-U(d_3, x)]+(1-q^-)[U(d_2, x )-U(d_3, x^% )]≤ q- ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ] + ( 1 - q- ) [ U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) ] (42) Let Δ1:=⁢[u∣paD,do⁢(D=d2);σ]−⁢[u∣paD,do⁢(D=d3);σ]assignsubscriptΔ1delimited-[]conditionalsubscriptpadosubscript2delimited-[]conditionalsubscriptpadosubscript3 _1:=E[u _D, do(D=d_2);σ]-% E[u _D, do(D=d_3);σ]Δ1 := blackboard_E [ u ∣ paD , do ( D = d2 ) ; σ ] - blackboard_E [ u ∣ paD , do ( D = d3 ) ; σ ] and Δ0:=U⁢(d2,′)−U⁢(d3,′)assignsubscriptΔ0subscript2superscript′subscript3superscript′ _0:=U(d_2, x )-U(d_3, x )Δ0 := U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ). Note that Δ0>0subscriptΔ00 _0>0Δ0 > 0 as d2subscript2d_2d2 is optimal under σ′σ σ′, and Δ1<0subscriptΔ10 _1<0Δ1 < 0 as by linearity we have ⁢[u∣paD,do⁢(D=d3);σ~⁢(q)]>⁢[u∣paD,do⁢(D=d2);σ~⁢(q)]delimited-[]conditionalsubscriptpadosubscript3~delimited-[]conditionalsubscriptpadosubscript2~E[u _D, do(D=d_3); σ(q)]>% E[u _D, do(D=d_2); σ(q)]blackboard_E [ u ∣ paD , do ( D = d3 ) ; over~ start_ARG σ end_ARG ( q ) ] > blackboard_E [ u ∣ paD , do ( D = d2 ) ; over~ start_ARG σ end_ARG ( q ) ] for q>qcritsubscriptcritq>q_critq > qcrit and qcrit<1subscriptcrit1q_crit<1qcrit < 1 therefore ⁢[u∣paD,do⁢(D=d3);σ~⁢(1)]>⁢[u∣paD,do⁢(D=d2);σ~⁢(1)]delimited-[]conditionalsubscriptpadosubscript3~1delimited-[]conditionalsubscriptpadosubscript2~1E[u _D, do(D=d_3); σ(1)]>% E[u _D, do(D=d_2); σ(1)]blackboard_E [ u ∣ paD , do ( D = d3 ) ; over~ start_ARG σ end_ARG ( 1 ) ] > blackboard_E [ u ∣ paD , do ( D = d2 ) ; over~ start_ARG σ end_ARG ( 1 ) ]. We now define q±superscriptplus-or-minusq^±q± w.r.t the (relaxed) bounds equation 41 and equation 42, and simplifying these inequalities using equation 36 gives q~~ qover~ start_ARG q end_ARG ≤q+=min⁡1,qcrit⁢(1+ξ)absentsubscript1subscriptcrit1 ≤ q_+= \1,q_crit(1+ξ)\≤ q+ = min 1 , qcrit ( 1 + ξ ) (43) q~~ qover~ start_ARG q end_ARG ≥q−=max⁡0,qcrit⁢(1−ξ)absentsubscript0subscriptcrit1 ≥ q_-= \0,q_crit(1-ξ)\≥ q- = max 0 , qcrit ( 1 - ξ ) (44) where ξ:=δ/Δ0>0assignsubscriptΔ00ξ:=δ/ _0>0ξ := δ / Δ0 > 0 (45) We therefore generate bounds on qcritsubscriptcritq_critqcrit using equation 44 and equation 43, i.e. qcritsubscriptcrit q_critqcrit ≤q~/(1−ξ)absent~1 ≤ q/(1-ξ)≤ over~ start_ARG q end_ARG / ( 1 - ξ ) (46) qcritsubscriptcrit q_critqcrit ≥q~/(1+ξ)absent~1 ≥ q/(1+ξ)≥ over~ start_ARG q end_ARG / ( 1 + ξ ) (47) We use q~~ qover~ start_ARG q end_ARG in place of qcritsubscriptcritq_critqcrit in equation 36, giving an estimate Q~~ Qover~ start_ARG Q end_ARG for Q=∑P⁢(=;σ)⁢[U⁢(d2,)−U⁢(d3,)]subscriptdelimited-[]subscript2subscript3Q= _ zP( C= c;σ)[U(d_2, x)-U(d_3, x)]Q = ∑italic_z P ( italic_C = italic_c ; σ ) [ U ( d2 , italic_x ) - U ( d3 , italic_x ) ], yielding, Q~=Δ0⁢(1−1/q~)~subscriptΔ011~ Q= _0 (1-1/ q )over~ start_ARG Q end_ARG = Δ0 ( 1 - 1 / over~ start_ARG q end_ARG ) (48) Finally, applying bounds equation 46 and equation 47 gives, 11−ξ⁢(Δ0−δ)≤Q~≤11+ξ⁢(Δ0+δ)11subscriptΔ0~11subscriptΔ0 11-ξ ( _0-δ )≤ Q≤ 11+ξ% ( _0+δ )divide start_ARG 1 end_ARG start_ARG 1 - ξ end_ARG ( Δ0 - δ ) ≤ over~ start_ARG Q end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 1 + ξ end_ARG ( Δ0 + δ ) (49) Next, we determine upper and lower bounds Q±⁢(q~)superscriptplus-or-minus~Q^±( q)Q± ( over~ start_ARG q end_ARG ) using Q=Δ0⁢(1−1/qcrit)subscriptΔ011subscriptcritQ= _0(1-1/q_crit)Q = Δ0 ( 1 - 1 / qcrit ) and equation 46 and equation 47 giving, Q Q ≤Q~+⁢(q~)=Δ0⁢(1−1−ξq~)absentsuperscript~~subscriptΔ011~ ≤ Q^+( q)= _0 (1- 1-ξ% q )≤ over~ start_ARG Q end_ARG+ ( over~ start_ARG q end_ARG ) = Δ0 ( 1 - divide start_ARG 1 - ξ end_ARG start_ARG over~ start_ARG q end_ARG end_ARG ) (50) Q Q ≥Q~−⁢(q~)=Δ0⁢(1−1+ξq~)absentsuperscript~~subscriptΔ011~ ≥ Q^-( q)= _0 (1- 1+ξ% q )≥ over~ start_ARG Q end_ARG- ( over~ start_ARG q end_ARG ) = Δ0 ( 1 - divide start_ARG 1 + ξ end_ARG start_ARG over~ start_ARG q end_ARG end_ARG ) (51) noting that as equation 48 is monotonic in q that the true value of Q is guaranteed to fall between these bounds. Finally, we derive expressions for the worst-case bounds in terms of the true value of Q, which are given by determining Q~±superscript~plus-or-minus Q^±over~ start_ARG Q end_ARG± for the max and min values of q~~ qover~ start_ARG q end_ARG which are given by equation 47 and equation 46, Q+superscript Q^+Q+ =maxq~⁡Q~+⁢(q~)=Δ0⁢(1−1−ξ1+ξ⁢1qcrit)absentsubscript~superscript~~subscriptΔ01111subscriptcrit = _ q Q^+( q)= _0 (1- % 1-ξ1+ξ 1q_crit )= maxover~ start_ARG q end_ARG over~ start_ARG Q end_ARG+ ( over~ start_ARG q end_ARG ) = Δ0 ( 1 - divide start_ARG 1 - ξ end_ARG start_ARG 1 + ξ end_ARG divide start_ARG 1 end_ARG start_ARG qcrit end_ARG ) (52) =(1−ξ1+ξ)⁢Q+2⁢δ1+ξabsent1121 = ( 1-ξ1+ξ )Q+ 2δ1+ξ= ( divide start_ARG 1 - ξ end_ARG start_ARG 1 + ξ end_ARG ) Q + divide start_ARG 2 δ end_ARG start_ARG 1 + ξ end_ARG (53) Q−superscript Q^-Q- =minq~⁡Q~−⁢(q~)=Δ0⁢(1−1+ξ1−ξ⁢1qcrit)absentsubscript~superscript~~subscriptΔ01111subscriptcrit = _ q Q^-( q)= _0 (1- % 1+ξ1-ξ 1q_crit )= minover~ start_ARG q end_ARG over~ start_ARG Q end_ARG- ( over~ start_ARG q end_ARG ) = Δ0 ( 1 - divide start_ARG 1 + ξ end_ARG start_ARG 1 - ξ end_ARG divide start_ARG 1 end_ARG start_ARG qcrit end_ARG ) (54) =(1+ξ1−ξ)⁢Q−2⁢δ1−ξabsent1121 = ( 1+ξ1-ξ )Q- 2δ1-ξ= ( divide start_ARG 1 + ξ end_ARG start_ARG 1 - ξ end_ARG ) Q - divide start_ARG 2 δ end_ARG start_ARG 1 - ξ end_ARG (55) Figure 6: Overview of Lemma 5. Δ0=U⁢(d2,′)−U⁢(d3,′)subscriptΔ0subscript2superscript′subscript3superscript′ _0=U(d_2, x )-U(d_3, x )Δ0 = U ( d2 , italic_x′ ) - U ( d3 , italic_x′ ) and Δ1=⁢[u∣paD,do⁢(D=d2);σ]−⁢[u∣paD,do⁢(D=d3);σ]subscriptΔ1delimited-[]conditionalsubscriptpadosubscript2delimited-[]conditionalsubscriptpadosubscript3 _1=E[u _D, do(D=d_2);σ]-% E[u _D, do(D=d_3);σ]Δ1 = blackboard_E [ u ∣ paD , do ( D = d2 ) ; σ ] - blackboard_E [ u ∣ paD , do ( D = d3 ) ; σ ]. Using an optimal policy oracle we can identify qcritsubscriptcritq_critqcrit precisely as detailed in Lemma 4. For δ>00δ>0δ > 0 instead of returning qcritsubscriptcritq_critqcrit Algorithm 1 returns q~~ qover~ start_ARG q end_ARG, as the agent can incur regret and so the value of q for which the policy changes is no longer constrained to be qcritsubscriptcritq_critqcrit. We use q~~ qover~ start_ARG q end_ARG in place to qcritsubscriptcritq_critqcrit to calculate an approximate value of the target query, in the same way as in Lemma 4. The maximum and minimum values of q~~ qover~ start_ARG q end_ARG can take are q±superscriptplus-or-minusq^±q± which result in maximal regret δ, q~≥q−=qcrit⁢(1−δ/Δ0)~superscriptsubscriptcrit1subscriptΔ0 q≥ q^-=q_crit(1-δ/ _0)over~ start_ARG q end_ARG ≥ q- = qcrit ( 1 - δ / Δ0 ) and q~≤q+=qcrit⁢(1+δ/Δ0)~superscriptsubscriptcrit1subscriptΔ0 q≤ q^+=q_crit(1+δ/ _0)over~ start_ARG q end_ARG ≤ q+ = qcrit ( 1 + δ / Δ0 ). We can therefore bound the amount that Q~~ Qover~ start_ARG Q end_ARG deviates from the value of the target query Q. ∎ Lemma 6. For δ≪π∗⁢[U]much-less-thansuperscriptsuperscriptdelimited-[]δ ^π^*[U]δ ≪ blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ], Q~~ Qover~ start_ARG Q end_ARG and Q~±superscript~plus-or-minus Q^±over~ start_ARG Q end_ARG± (as defined in Lemma 5) satisfy bounds, |Q~−Q|≤δ⁢(1−QΔ0)+⁢(δ2)~1subscriptΔ0superscript2 | Q-Q |≤δ(1- Q _0)+O(δ^% 2)| over~ start_ARG Q end_ARG - Q | ≤ δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (56) and Q~+−Qsuperscript~ Q^+-Qover~ start_ARG Q end_ARG+ - Q ≤2⁢δ⁢(1−QΔ0)+⁢(δ2)absent21subscriptΔ0superscript2 ≤ 2δ(1- Q _0)+O(δ^2)≤ 2 δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (57) Q−Q~−superscript~ Q- Q^-Q - over~ start_ARG Q end_ARG- ≥−2⁢δ⁢(1−QΔ0)+⁢(δ2)absent21subscriptΔ0superscript2 ≥-2δ(1- Q _0)+O(δ^2)≥ - 2 δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (58) Proof. As we work with the normalised utility function (see Section A.1), we have π∗⁢[U]≤1superscriptsuperscriptdelimited-[]1E^π^*[U]≤ 1blackboard_Eπ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ U ] ≤ 1 and so we can define the small regret regime as δ≪1much-less-than1δ 1δ ≪ 1. We can Taylor expand the bounds on Q~,Q~±~superscript~plus-or-minus Q, Q^±over~ start_ARG Q end_ARG , over~ start_ARG Q end_ARG± about δ=00δ=0δ = 0 giving, Q−δ⁢(1−QΔ0)+⁢(δ2)≤Q~≤Q+δ⁢(1−QΔ0)+⁢(δ2)1subscriptΔ0superscript2~1subscriptΔ0superscript2Q-δ(1- Q _0)+O(δ^2)≤ Q≤ Q+% δ(1- Q _0)+O(δ^2)Q - δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) ≤ over~ start_ARG Q end_ARG ≤ Q + δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (59) and therefore, |Q~−Q|≤δ⁢(1−QΔ0)+⁢(δ2)~1subscriptΔ0superscript2 | Q-Q |≤δ(1- Q _0)+O(δ^% 2)| over~ start_ARG Q end_ARG - Q | ≤ δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (60) and Q~+−Qsuperscript~ Q^+-Qover~ start_ARG Q end_ARG+ - Q ≤2⁢δ⁢(1−QΔ0)+⁢(δ2)absent21subscriptΔ0superscript2 ≤ 2δ(1- Q _0)+O(δ^2)≤ 2 δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (61) Q−Q~−superscript~ Q- Q^-Q - over~ start_ARG Q end_ARG- ≥−2⁢δ⁢(1−QΔ0)+⁢(δ2)absent21subscriptΔ0superscript2 ≥-2δ(1- Q _0)+O(δ^2)≥ - 2 δ ( 1 - divide start_ARG Q end_ARG start_ARG Δ0 end_ARG ) + O ( δ2 ) (62) therefore for small δ the worst case error on our estimate Q~~ Qover~ start_ARG Q end_ARG grows linearly in δ, and our upper and lower bounds for Q~~ Qover~ start_ARG Q end_ARG also grow linearly. ∎ See 2 Proof. We use the δ−limit-fromδ-δ -optimal policy oracle to estimate the model parameters following the same steps as in the proof of Theorem 1 in Appendix C. However, as the policy oracle is no longer optimal, the parameters estimates will have errors. Here, we show that for the parameters of P these errors grow linearly in δ for δ≪1much-less-than1δ 1δ ≪ 1, and that we learn a sparse sub-graph of G. Estimating parameters of P. In the proof of Theorem 1 we estimate the parameters P⁢(ci∣pai)conditionalsubscriptsubscriptpaP(c_i _i)P ( citalic_i ∣ pai ) in two cases. Case 1. Qk=∑P⁢(=;σ)⁢[U⁢(d,)−U⁢(d′,)]=∑ckP⁢(ck∣pak;σ)⁢β⁢(ck)subscriptsubscriptdelimited-[]superscript′subscriptsubscriptconditionalsubscriptsubscriptpasubscriptQ_k= _ cP( C= c;σ)[U(d, c)-U(d , c)]=% _c_kP(c_k _k;σ)β(c_k)Qitalic_k = ∑italic_c P ( italic_C = italic_c ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] = ∑c start_POSTSUBSCRIPT k end_POSTSUBSCRIPT P ( citalic_k ∣ pak ; σ ) β ( citalic_k ) (63) where β⁢(ck):=∑ck−1…⁢∑c1P⁢(ck−1∣pak−1;σ)⁢…⁢P⁢(c1∣pa1;σ)⁢[U⁢(d,)−U⁢(d′,)]assignsubscriptsubscriptsubscript1…subscriptsubscript1conditionalsubscript1subscriptpa1…conditionalsubscript1subscriptpa1delimited-[]superscript′β(c_k):= _c_k-1… _c_1P(c_k-1 _k-1;% σ)… P(c_1 _1;σ)[U(d, c)-U(d , % c)]β ( citalic_k ) := ∑c start_POSTSUBSCRIPT k - 1 end_POSTSUBSCRIPT … ∑c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P ( citalic_k - 1 ∣ pak - 1 ; σ ) … P ( c1 ∣ pa1 ; σ ) [ U ( d , italic_c ) - U ( d′ , italic_c ) ] (64) which we rearrange using P⁢(ck′∣pak;σ)=1−P⁢(ck′∣pak;σ)conditionalsubscriptsuperscript′subscriptpa1conditionalsubscriptsuperscript′subscriptpaP(c _k _k;σ)=1-P(c _k % pa_k;σ)P ( c′italic_k ∣ pak ; σ ) = 1 - P ( c′ ′k ∣ pak ; σ ) to give, P⁢(ck′∣pak;σ)=Qk−β⁢(ck′)β⁢(ck′)−β⁢(ck′)conditionalsubscriptsuperscript′subscriptpasubscriptsuperscriptsubscript′subscript′subscript′P(c _k _k;σ)= Q_k-β(c_k % )β(c_k )-β(c_k )P ( c′italic_k ∣ pak ; σ ) = divide start_ARG Qitalic_k - β ( citalic_k′ ′ ) end_ARG start_ARG β ( citalic_k′ ) - β ( citalic_k′ ′ ) end_ARG (65) Assume we have approximate values P^⁢(ck−1′∣pak−1;σ),…,P^⁢(c1′∣pa1;σ)^conditionalsubscriptsuperscript′1subscriptpa1…^conditionalsubscriptsuperscript′1subscriptpa1 P(c _k-1 _k-1;σ),…, P(c % _1 _1;σ)over start_ARG P end_ARG ( c′italic_k - 1 ∣ pak - 1 ; σ ) , … , over start_ARG P end_ARG ( c′1 ∣ pa1 ; σ ) where P^⁢(ck′∣pak;σ)=P⁢(ck′∣pak;σ)+⁢(δ)^conditionalsubscriptsuperscript′subscriptpaconditionalsubscriptsuperscript′subscriptpa P(c _k _k;σ)=P(c _k % pa_k;σ)+O(δ)over start_ARG P end_ARG ( c′italic_k ∣ pak ; σ ) = P ( c′italic_k ∣ pak ; σ ) + O ( δ ), i.e. errors in our estimates for these parameters grow linearly in δ for δ≪1much-less-than1δ 1δ ≪ 1. As β⁢(ck)subscriptβ(c_k)β ( citalic_k ) is a sum of products of these parameter estimates, then our estimate of β⁢(ck)subscriptβ(c_k)β ( citalic_k ) also has linear error for δ≪1much-less-than1δ 1δ ≪ 1, i.e. β^⁢(ck)=β⁢(ck)+⁢(δ)^subscriptsubscript β(c_k)=β(c_k)+O(δ)over start_ARG β end_ARG ( citalic_k ) = β ( citalic_k ) + O ( δ ), and likewise, P^⁢(ck′∣pak;σ)=Qk−β⁢(ck′)+⁢(δ)β⁢(ck′)−β⁢(ck′)+⁢(δ)=P⁢(ck′∣pak;σ)⁢(1+⁢(δ))^conditionalsubscriptsuperscript′subscriptpasubscriptsuperscriptsubscript′subscript′subscript′conditionalsubscriptsuperscript′subscriptpa1 P(c _k _k;σ)= Q_k-β(c_k^% )+O(δ)β(c_k )-β(c_k % )+O(δ)=P(c _k _k;σ) % (1+O(δ) )over start_ARG P end_ARG ( c′italic_k ∣ pak ; σ ) = divide start_ARG Qitalic_k - β ( citalic_k′ ′ ) + O ( δ ) end_ARG start_ARG β ( citalic_k′ ) - β ( citalic_k′ ′ ) + O ( δ ) end_ARG = P ( c′italic_k ∣ pak ; σ ) ( 1 + O ( δ ) ) (66) Then for k=11k=1k = 1 we know β⁢(c1)=U⁢(d,)−U⁢(d′,)subscript1superscript′β(c_1)=U(d, c)-U(d , c)β ( c1 ) = U ( d , italic_c ) - U ( d′ , italic_c ) precisely, and so P^⁢(c1′∣pa1;σ)=Q1−β⁢(c1′)+⁢(δ)β⁢(c1′)−β⁢(c1′)=P⁢(c1′∣pa1;σ)⁢(1+⁢(δ))^conditionalsubscriptsuperscript′1subscriptpa1subscript1superscriptsubscript1′subscript1′subscript1′conditionalsubscriptsuperscript′1subscriptpa11 P(c _1 _1;σ)= Q_1-β(c_1^% )+O(δ)β(c_1 )-β(c_1 % )=P(c _1 _1;σ) (1+O(% δ) )over start_ARG P end_ARG ( c′1 ∣ pa1 ; σ ) = divide start_ARG Q1 - β ( c1′ ′ ) + O ( δ ) end_ARG start_ARG β ( c1′ ) - β ( c1′ ′ ) end_ARG = P ( c′1 ∣ pa1 ; σ ) ( 1 + O ( δ ) ) (67) Which satisfies our assumption of ⁢(δ)O(δ)O ( δ ) error for k=11k=1k = 1, δ≪1much-less-than1δ 1δ ≪ 1. Therefore for all k we have error that grows linearly in δ for δ≪1much-less-than1δ 1δ ≪ 1. The expressions for Qk,α⁢(ck)subscriptsubscriptQ_k,α(c_k)Qitalic_k , α ( citalic_k ) for case 2 in the proof of Theorem 1 are similar, and it is trivial to show by the same method that for these parameters the error also grow linearly in δ for δ≪1much-less-than1δ 1δ ≪ 1. Learning graph structure. In Theorem 1 we determine PaksubscriptPaPa_kPak from P⁢(ck∣do⁢(∖Ck))conditionalsubscriptdosubscriptP(c_k do( C \C_k\))P ( citalic_k ∣ do ( italic_C ∖ Citalic_k ) ). Assuming causal faithfulness, which is satisfied for almost all P (Meek, 2013), Cj∈PaksubscriptsubscriptPaC_j _kCitalic_j ∈ Pak if and only if P⁢(ck∣do⁢(∖Ck))conditionalsubscriptdosubscriptP(c_k do( C \C_k\))P ( citalic_k ∣ do ( italic_C ∖ Citalic_k ) ) differ for some Cj=cj,Cj=cj′formulae-sequencesubscriptsubscriptsubscriptsubscriptsuperscript′C_j=c_j,C_j=c _jCitalic_j = citalic_j , Citalic_j = c′italic_j. However, as we now only have estimates P^⁢(ck∣do⁢(∖Ck))^conditionalsubscriptdosubscript P(c_k do( C \C_k\))over start_ARG P end_ARG ( citalic_k ∣ do ( italic_C ∖ Citalic_k ) ), any variation with respect to Cj=cjsubscriptsubscriptC_j=c_jCitalic_j = citalic_j may be due to the varying errors in these estimates rather than variation in the conditional probability itself. However, we have shown that we can learn any P⁢(ci∣pai)conditionalsubscriptsubscriptpaP(c_i _i)P ( citalic_i ∣ pai ) within error bounds, and that these bounds scale linearly with δ for δ≪1much-less-than1δ 1δ ≪ 1. Let Cj∈Pai+nsubscriptsubscriptPaC_j _i+nCitalic_j ∈ Pai + n and θk⁢j=P⁢(ck∣do⁢(∖Ck))subscriptconditionalsubscriptdosubscript _kj=P(c_k do( C \C_k\))θitalic_k j = P ( citalic_k ∣ do ( italic_C ∖ Citalic_k ) ), and denote the corresponding upper and lower bounds from Lemma 6 as θk⁢j±superscriptsubscriptplus-or-minus _kj^±θitalic_k j±. If ∃ θk⁢j≠θk⁢j′subscriptsubscriptsuperscript′ _kj≠ _kj θitalic_k j ≠ θitalic_k j′ and either θk⁢j+<θk⁢j′−superscriptsubscriptsuperscriptsubscriptsuperscript′ _kj^+< _kj ^-θitalic_k j+ < θitalic_k j′- or θk⁢j′+<θk⁢j−superscriptsubscriptsuperscript′subscript _kj ^+< _kj^-θitalic_k j′+ < θitalic_k j-, non-overlapping bounds for Cj=cjsubscriptsubscriptC_j=c_jCitalic_j = citalic_j and Cj=cj′subscriptsuperscriptsubscript′C_j=c_j Citalic_j = citalic_j′, then we know with certainty that Cj∈PaksubscriptsubscriptPaC_j _kCitalic_j ∈ Pak. If there are no such non-overlapping bounds for all j, we do not know if Cj∈PaksubscriptsubscriptPaC_j _kCitalic_j ∈ Pak and so exclude it from the set. This approach is guaranteed to identify a sub-graph of G (i.e. no false positives—directed edges present in the approximate CBN that are not present in the environment). Further, we only miss a parent if in the true underlying causal model for all Pak=paksubscriptPasubscriptpaPa_k=pa_kPak = pak intervening to change CjsubscriptC_jCitalic_j gives |P(ck∣pak,do(cj))−P(ck∣pak,do(cj′))|<(δ)|P(c_k _k, do(c_j))-P(c_k _k,% do(c_j ))|<O(δ)| P ( citalic_k ∣ pak , do ( citalic_j ) ) - P ( citalic_k ∣ pak , do ( citalic_j′ ) ) | < O ( δ ). Hence for δ≪1much-less-than1δ 1δ ≪ 1 we only fail to learn causal relations that small in magnitude (with respected to the regret δ), i.e. where the causal effect of the parent on the child is ⁢(δ)O(δ)O ( δ ). In Appendix F we explore the relation between the regret bound and the error in the learned causal graph using simulated data, and find that even agents that incur relatively high regret can be used to identify causal structure to a high accuracy compared to a random baseline. ∎ Appendix E Appendix: proof of Theorem 3 See 3 Proof. First we consider the case where we know the exact model M=(P,G)M=(P,G)M = ( P , G ). As M is causally sufficient we can identify ⁢[u∣d,paD;σ]delimited-[]conditionalsubscriptpaE[u d,pa_D;σ]blackboard_E [ u ∣ d , paD ; σ ] for any given soft interventions compatible with G and which involve only variables in G (which includes AncU∪UsubscriptAncAnc_U∪\U\AncU ∪ U ). Our policy oracle is constructed by i) estimating ⁢[u∣d,paD;σ]delimited-[]conditionalsubscriptpaE[u d,pa_D;σ]blackboard_E [ u ∣ d , paD ; σ ] for the input σ, i) calculating d∗=arg⁢maxd⁡⁢[u∣d,paD;σ]superscriptsubscriptargmaxdelimited-[]conditionalsubscriptpad^*= *arg\,max_dE[u d,pa_D;σ]d∗ = start_OPERATOR arg max end_OPERATORd blackboard_E [ u ∣ d , paD ; σ ] and returning any d∗superscriptd^*d∗ satisfying this. Next, consider the case where we know the approximate model M′=(P′,G′)superscript′superscript′M =(P ,G )M′ = ( P′ , G′ ), for which |P′(vi∣pai)−P(vi∣pai)|≤ϵ≪1 |P (v_i _i)-P(v_i _i) |% ≤ε 1| P′ ( vitalic_i ∣ pai ) - P ( vitalic_i ∣ pai ) | ≤ ϵ ≪ 1 which implies P′⁢(vi∣pai)=P⁢(vi∣pai)+ci⁢ϵsuperscript′conditionalsubscriptsubscriptpaconditionalsubscriptsubscriptpasubscriptitalic-ϵP (v_i _i)=P(v_i _i)+c_i\, ′ ( vitalic_i ∣ pai ) = P ( vitalic_i ∣ pai ) + citalic_i ϵ where |ci|≤1subscript1|c_i|≤ 1| citalic_i | ≤ 1. First we show that for any soft intervention σ we can approximate the post-intervention joint distribution P′(=∣do(D=d),PaD=paD;σ)=P(=∣do(D=d),PaD=paD;σ)+kϵ+(ϵ2)P ( Z= z do(D=d),Pa_D=pa_D;% σ)=P( Z= z do(D=d),Pa_D=pa_D;% σ)+kε+O(ε^2)P′ ( italic_Z = italic_z ∣ do ( D = d ) , PaD = paD ; σ ) = P ( italic_Z = italic_z ∣ do ( D = d ) , PaD = paD ; σ ) + k ϵ + O ( ϵ2 ) where =∖PaDsubscriptPa Z= C _Ditalic_Z = italic_C ∖ PaD and k is a function of the model parameters and constant in ϵitalic-ϵεϵ. Let σ=∑jqj⁢σjsubscriptsubscriptsubscriptσ= _jq_j _jσ = ∑j qitalic_j σitalic_j where σjsubscript _jσitalic_j are soft interventions. P′(=∣do(D=d),PaD=paD;σ)=∑jqjP′⁢(=∣do⁢(D=d);σ)P′⁢(=′,PaD=paD;σj) P ( Z= z do(D=d),Pa_D=% pa_D;σ)= _jq_j P ( C= c % do(D=d);σ)P ( Z= z ,Pa_D=% pa_D; _j)P′ ( italic_Z = italic_z ∣ do ( D = d ) , PaD = paD ; σ ) = ∑j qitalic_j divide start_ARG P′ ( italic_C = italic_c ∣ do ( D = d ) ; σ ) end_ARG start_ARG P′ ( italic_Z = italic_z′ , PaD = paD ; σitalic_j ) end_ARG (68) =∑jqj⁢∏iP′⁢(Ci=ci∣do⁢(D=d);σj)∑′∏iP′⁢(Ci=ci′∣do⁢(D=d);σj)absentsubscriptsubscriptsubscriptproductsuperscript′subscriptconditionalsubscriptdosubscriptsubscriptsuperscript′subscriptproductsuperscript′subscriptconditionalsubscriptsuperscript′dosubscript = _jq_j Π _iP (C_i=c_i % do(D=d); _j) _ z Π _iP % (C_i=c _i do(D=d); _j)= ∑j qitalic_j divide start_ARG ∏i P′ ( Citalic_i = citalic_i ∣ do ( D = d ) ; σitalic_j ) end_ARG start_ARG ∑italic_z′ ∏i P′ ( Citalic_i = c′italic_i ∣ do ( D = d ) ; σitalic_j ) end_ARG (69) =∑jqj⁢∏i[P⁢(Ci=ci∣do⁢(D=d);σj)+ci⁢ϵ]∑′∏i[P⁢(Ci=ci′∣do⁢(D=d);σj)+ci⁢ϵ]absentsubscriptsubscriptsubscriptproductdelimited-[]subscriptconditionalsubscriptdosubscriptsubscriptitalic-ϵsubscriptsuperscript′subscriptproductdelimited-[]subscriptconditionalsubscriptsuperscript′dosubscriptsubscriptitalic-ϵ = _jq_j Π _i[P(C_i=c_i do% (D=d); _j)+c_iε] _ z Π _i[P(C_% i=c _i do(D=d); _j)+c_iε]= ∑j qitalic_j divide start_ARG ∏i [ P ( Citalic_i = citalic_i ∣ do ( D = d ) ; σitalic_j ) + citalic_i ϵ ] end_ARG start_ARG ∑italic_z′ ∏i [ P ( Citalic_i = c′italic_i ∣ do ( D = d ) ; σitalic_j ) + citalic_i ϵ ] end_ARG (70) =∑jqj⁢∏iP⁢(Ci=ci∣do⁢(D=d);σj)⁢(1+ci⁢j′⁢ϵ)∑′∏iP⁢(Ci=ci⁢j′∣do⁢(D=d);σj)⁢(1+ci′⁢ϵ)absentsubscriptsubscriptsubscriptproductsubscriptconditionalsubscriptdosubscript1subscriptsuperscript′italic-ϵsubscriptsuperscript′subscriptproductsubscriptconditionalsubscriptsuperscript′dosubscript1subscriptsuperscript′italic-ϵ = _jq_j Π _iP(C_i=c_i do(% D=d); _j)(1+c _ijε) _ z Π% _iP(C_i=c _ij do(D=d); _j)(1+c % _iε)= ∑j qitalic_j divide start_ARG ∏i P ( Citalic_i = citalic_i ∣ do ( D = d ) ; σitalic_j ) ( 1 + c′italic_i j ϵ ) end_ARG start_ARG ∑italic_z′ ∏i P ( Citalic_i = c′italic_i j ∣ do ( D = d ) ; σitalic_j ) ( 1 + c′italic_i ϵ ) end_ARG (71) =P(=∣do(D=d),PaD=paD;σ)+ϵf(θ)+(ϵ2) =P( Z= z do(D=d),Pa_D=pa_% D;σ)+ε f(θ)+O(ε^2)= P ( italic_Z = italic_z ∣ do ( D = d ) , PaD = paD ; σ ) + ϵ f ( θ ) + O ( ϵ2 ) (72) where ci⁢j′:=ci/P⁢(Ci=ci∣do⁢(D=d);σj)assignsubscriptsuperscript′subscriptsubscriptconditionalsubscriptdosubscriptc _ij:=c_i/P(C_i=c_i do(D=d); _j)c′italic_i j := citalic_i / P ( Citalic_i = citalic_i ∣ do ( D = d ) ; σitalic_j ) and f⁢(θ)f(θ)f ( θ ) is a polynomial in the model parameters θi=P⁢(vi∣pai)subscriptconditionalsubscriptsubscriptpa _i=P(v_i _i)θitalic_i = P ( vitalic_i ∣ pai ). Therefore the expected utility under intervention σ evaluated using M′ satisfies, P′⁢[U∣do⁢(D=d),PaD=paD]subscriptsuperscript′delimited-[]conditionaldosubscriptPasubscriptpa _P [U do(D=d),Pa_D=% pa_D]blackboard_EP′ [ U ∣ do ( D = d ) , PaD = paD ] =∑P′(=∣do(D=d),PaD=paD;σ) = _ zP ( Z= z do(D=d),% Pa_D=pa_D;σ)= ∑italic_z P′ ( italic_Z = italic_z ∣ do ( D = d ) , PaD = paD ; σ ) (73) =∑′(=∣do(D=d),PaD=paD;σ)+ϵg(θ)+(ϵ2) = _ z ( Z= z do(D=d), % Pa_D=pa_D;σ)+ε g(θ)+O(ε^2)= ∑italic_z′ ( italic_Z = italic_z ∣ do ( D = d ) , PaD = paD ; σ ) + ϵ g ( θ ) + O ( ϵ2 ) (74) =⁢[U∣do⁢(D=d),PaD=paD]+ϵ⁢g⁢(θ)+⁢(ϵ2)absentdelimited-[]conditionaldosubscriptPasubscriptpaitalic-ϵsuperscriptitalic-ϵ2 =E[U do(D=d),Pa_D=pa_D% ]+ε g(θ)+O(ε^2)= blackboard_E [ U ∣ do ( D = d ) , PaD = paD ] + ϵ g ( θ ) + O ( ϵ2 ) (75) where g⁢(θ)g(θ)g ( θ ) is a polynomial in the model parameters. The decision d∗=arg⁢maxd⁡P′⁢[U∣do⁢(D=d),PaD=paD]superscriptsubscriptargmaxsubscriptsuperscript′delimited-[]conditionaldosubscriptPasubscriptpad^*= *arg\,max_dE_P [U do(D=d% ),Pa_D=pa_D]d∗ = start_OPERATOR arg max end_OPERATORd blackboard_EP′ [ U ∣ do ( D = d ) , PaD = paD ] incurs at most ϵ⁢g⁢(θ)italic-ϵε g(θ)ϵ g ( θ ) regret, and therefore the regret is linear in ϵitalic-ϵεϵ. ∎ Appendix F Experiments As discussed in Section 4 the proofs of Theorems 1 and 2 can be viewed as causal discovery algorithms where we assume i) knowledge of the set of environment variables Citalic_C, i) knowledge of the utility function U, i) the decision task is unmediated and iv) domain dependence. Given these assumptions we can learn an approximation of the underlying CBN given only the policy of the agent π⁢(σ)π(σ)π ( σ ) under interventions σ, with the approximation being exact when π⁢(σ)π(σ)π ( σ ) are optimal. To demonstrate this theoretical result we take the proof for simple Binary decision tasks outlined in Appendix B and recast it as a causal discovery algorithm (Algorithm 2 below). We test it on CIDs of the form shown in Figure 5 where we randomly choose the joint distribution over X,YX,YX , Y and their causal structure G. Note that Algorithm 2 is significantly simpler than the general method outlined in the proof of Theorem 1, as it exploits the fact that D,X,YD,X,YD , X , Y are binary variables and that ||=22| C|=2| italic_C | = 2. This causal discovery algorithm requires that we can intervene on the latent variables X,YX,YX , Y, but only requires that we can observe the response of a single variable (the decision) to these interventions. To motivate this setting, we can imagine situations where the latents X,YX,YX , Y cannot be directly observed but can be intervened on. Example. Many diseases cannot be directly observed in patient physiology, but can only be indirectly observed through the presence of symptoms. Let X,Y∈0,101X,Y∈\0,1\X , Y ∈ 0 , 1 be two such diseases, for which there are treatments, i.e. we can intervene to ‘turn off’ X and Y but cannot observe them. D∈0,101D∈\0,1\D ∈ 0 , 1 represents a decision to provide a specific pain relief medication, which results in a change in the symptom severity (utility). The response to pain relief depends on the presence or absence of the diseases (e.g pain relief is highly effective for patients with X=TX=TX = T, moderately effective for Y=TY=TY = T and less effective for X=F,Y=Fformulae-sequenceX=F,Y=FX = F , Y = F). The doctor’s goal is to minimise symptom severity while avoiding unnecessary use of pain medication, e.g. U⁢(d,x,y)=d⁢[s⁢(x,y)−c]delimited-[]U(d,x,y)=d[s(x,y)-c]U ( d , x , y ) = d [ s ( x , y ) - c ] where c is some cost associated with pain relief and s⁢(x,y)s(x,y)s ( x , y ) is the response to pain relief. Following an intervention σ (e.g. curing a disease σ=do⁢(X=F)doσ= do(X=F)σ = do ( X = F )), the doctor adapts their treatment policy in the shifted population. For example, this adaptation could occur by trial and error, with the doctor choosing random treatment decisions D and observing the change in symptom severity—a context-free bandit problem. Although we cannot directly observe the disease states X,YX,YX , Y, by intervening on the latent disease state and observing how the doctor’s policy adapts, we can learn both the joint distribution P⁢(X,Y)P(X,Y)P ( X , Y ) and the causal graph over X,YX,YX , Y. (a) Misclassification rate for G scaling with regret bound (b) Mean parameter error for P(x, y) scaling with regret bound (c) Worst-case error for P(x, y) scaling with regret bound Figure 7: Comparing the model-average error rates for a) the learned DAG and b) the mean error for parameters P⁢(x,y)P(x,y)P ( x , y ) and c) the worst-case error for parameters P⁢(x,y)P(x,y)P ( x , y ), v.s. the (normalised) regret bound δ/|[u∣D=1]−[u∣D=0]|δ/ |E[u D=1]-E[u D=0] |δ / | blackboard_E [ u ∣ D = 1 ] - blackboard_E [ u ∣ D = 0 ] |. Average error taken over 1000 randomly generated environments with binary decision D and two binary latent variables X,YX,YX , Y. Comparison to error rate for random guess (green). Results appear to show sub-linear growth in error rate with regret bound. Note that even weakly generalising agents can be used to identify causal structure significantly better than the random baseline. Figure 7 shows the average error in the learned parameters P⁢(x,y)P(x,y)P ( x , y ) and G when π⁢(σ)π(σ)π ( σ ) satisfy different regret bounds. The results are averaged over 1000 randomly generated CBNs where i) the parameters of the joint distribution P⁢(x,y)P(x,y)P ( x , y ) are chosen at random, i) the DAG G over X,YX,YX , Y is chosen at random from X→Y→X→ YX → Y and X←Y←X← YX ← Y, i) the utility function U⁢(d,x,y)∈[0,1]01U(d,x,y)∈[0,1]U ( d , x , y ) ∈ [ 0 , 1 ] is chosen at random (see Section A.2 for description of parameters). To simulate the regret-bounded agent we calculate the optimal policy for each environment and if the sub-optimal decision satisfies the regret bound we choose randomly from the two decisions when sampling from the policy oracle in Algorithm 1. We also compare to a random baseline algorithm which estimates P⁢(x,y)=1/414P(x,y)=1/4P ( x , y ) = 1 / 4 and randomly selects from X→Y→X→ YX → Y or X←Y←X← YX ← Y with equal probability. In a small number of cases Algorithm 1 fails to predict P⁢(x,y)∈[0,1]01P(x,y)∈[0,1]P ( x , y ) ∈ [ 0 , 1 ] due to finite sample errors, and for these cases we replace the output of the causal discovery algorithm with a random guess. From Figure 7 it appears that the error rate grows sub-linearly with regret. Note that the relevant scale for the regret is the difference in expected utility between the two decisions, hence we plot the normalised regret bound where we divide δ by this expected utility difference. Note that even for relatively large regret bounds, representing agents that generalise weakly, we can still identify the causal structure with a high accuracy. For example when the regret bound is 30% of the expected utility difference, we can still identify the correct causal structure in ∼90%similar-toabsentpercent90 90\%∼ 90 % of the randomly generated CIDs. This describes an agent that is guaranteed to incur a regret of at most 30%percent3030\%30 % of the expected utility difference between the decisions before the domain shift. If the domain shift results in the expected utility difference being less that 30%percent3030\%30 % of the unshifted expected utility difference, the agent can return a sub-optimal decision. Algorithm 2 Graph Learner for simple CID 1:function graph learner(ΠΣδsubscriptsuperscriptΠΣ ^δ_ Πitalic_δroman_Σ, U, δ, N) 2: d1,d2,x′,y′,qcrit←Algorithm 1⁢(U,ΠΣδ,N,σ1=do⁢(Y=0))←subscript1subscript2superscript′subscriptcritAlgorithm 1subscriptsuperscriptΠΣsubscript1do0d_1,d_2,x ,y ,q_crit 1(% U, ^δ_ ,N, _1= do(Y=0))d1 , d2 , x′ , y′ , qcrit ← Algorithm 1 ( U , Πitalic_δroman_Σ , N , σ1 = do ( Y = 0 ) ) ▷ ▷ Identify qcritsubscriptcritq_critqcrit for do⁢(Y=0)do0 do(Y=0)do ( Y = 0 ) 3: Exp. U difference =(U⁢(d2,x′,y′)−U⁢(d1,x′,y′))∗(1−1/qcrit)absentsubscript2superscript′subscript1superscript′11subscriptcrit=(U(d_2,x ,y )-U(d_1,x ,y ))*(1-1/q_ % crit)= ( U ( d2 , x′ , y′ ) - U ( d1 , x′ , y′ ) ) ∗ ( 1 - 1 / qcrit ) 4: Δ0=U⁢(0,0,d2)−U⁢(0,0,d1)subscriptΔ000subscript200subscript1 _0=U(0,0,d_2)-U(0,0,d_1)Δ0 = U ( 0 , 0 , d2 ) - U ( 0 , 0 , d1 ) 5: Δ1=U⁢(1,0,d2)−U⁢(1,0,d1)subscriptΔ110subscript210subscript1 _1=U(1,0,d_2)-U(1,0,d_1)Δ1 = U ( 1 , 0 , d2 ) - U ( 1 , 0 , d1 ) 6: P⁢(XY=0=0)=(Exp. U difference−Δ1)/(Δ0−Δ1)subscript00Exp. U differencesubscriptΔ1subscriptΔ0subscriptΔ1P(X_Y=0=0)=(Exp. U difference- _1)/( _0- _1)P ( Xitalic_Y = 0 = 0 ) = ( Exp. U difference - Δ1 ) / ( Δ0 - Δ1 ) 7: 8: d1,d2,x′,y′,qcrit←Algorithm 1⁢(U,ΠΣδ,N,σ1=do⁢(Y=1))←subscript1subscript2superscript′subscriptcritAlgorithm 1subscriptsuperscriptΠΣsubscript1do1d_1,d_2,x ,y ,q_crit 1(% U, ^δ_ ,N, _1= do(Y=1))d1 , d2 , x′ , y′ , qcrit ← Algorithm 1 ( U , Πitalic_δroman_Σ , N , σ1 = do ( Y = 1 ) ) ▷ ▷ Identify qcritsubscriptcritq_critqcrit for do⁢(Y=1)do1 do(Y=1)do ( Y = 1 ) 9: Exp. U difference =(U⁢(d2,x′,y′)−U⁢(d1,x′,y′))∗(1−1/qcrit)absentsubscript2superscript′subscript1superscript′11subscriptcrit=(U(d_2,x ,y )-U(d_1,x ,y ))*(1-1/q_ % crit)= ( U ( d2 , x′ , y′ ) - U ( d1 , x′ , y′ ) ) ∗ ( 1 - 1 / qcrit ) 10: Δ0=U⁢(0,1,d2)−U⁢(0,1,d1)subscriptΔ001subscript201subscript1 _0=U(0,1,d_2)-U(0,1,d_1)Δ0 = U ( 0 , 1 , d2 ) - U ( 0 , 1 , d1 ) 11: Δ1=U⁢(1,1,d2)−U⁢(1,1,d1)subscriptΔ111subscript211subscript1 _1=U(1,1,d_2)-U(1,1,d_1)Δ1 = U ( 1 , 1 , d2 ) - U ( 1 , 1 , d1 ) 12: P⁢(XY=1=0)=(Exp. U difference−Δ1)/(Δ0−Δ1)subscript10Exp. U differencesubscriptΔ1subscriptΔ0subscriptΔ1P(X_Y=1=0)=(Exp. U difference- _1)/( _0- _1)P ( Xitalic_Y = 1 = 0 ) = ( Exp. U difference - Δ1 ) / ( Δ0 - Δ1 ) 13: 14: d1,d2,x′,y′,qcrit←Algorithm 1⁢(U,ΠΣδ,N,σ1=do⁢(X=0))←subscript1subscript2superscript′subscriptcritAlgorithm 1subscriptsuperscriptΠΣsubscript1do0d_1,d_2,x ,y ,q_crit 1(% U, ^δ_ ,N, _1= do(X=0))d1 , d2 , x′ , y′ , qcrit ← Algorithm 1 ( U , Πitalic_δroman_Σ , N , σ1 = do ( X = 0 ) ) ▷ ▷ Identify qcritsubscriptcritq_critqcrit for do⁢(X=0)do0 do(X=0)do ( X = 0 ) 15: Exp. U difference =(U⁢(d2,x′,y′)−U⁢(d1,x′,y′))∗(1−1/qcrit)absentsubscript2superscript′subscript1superscript′11subscriptcrit=(U(d_2,x ,y )-U(d_1,x ,y ))*(1-1/q_ % crit)= ( U ( d2 , x′ , y′ ) - U ( d1 , x′ , y′ ) ) ∗ ( 1 - 1 / qcrit ) 16: Δ0=U⁢(0,0,d2)−U⁢(0,0,d1)subscriptΔ000subscript200subscript1 _0=U(0,0,d_2)-U(0,0,d_1)Δ0 = U ( 0 , 0 , d2 ) - U ( 0 , 0 , d1 ) 17: Δ1=U⁢(0,1,d2)−U⁢(0,1,d1)subscriptΔ101subscript201subscript1 _1=U(0,1,d_2)-U(0,1,d_1)Δ1 = U ( 0 , 1 , d2 ) - U ( 0 , 1 , d1 ) 18: P⁢(YX=0=0)=(Exp. U difference−Δ1)/(Δ0−Δ1)subscript00Exp. U differencesubscriptΔ1subscriptΔ0subscriptΔ1P(Y_X=0=0)=(Exp. U difference- _1)/( _0- _1)P ( Yitalic_X = 0 = 0 ) = ( Exp. U difference - Δ1 ) / ( Δ0 - Δ1 ) 19: 20: d1,d2,x′,y′,qcrit←Algorithm 1⁢(U,ΠΣδ,N,σ1=do⁢(X=1))←subscript1subscript2superscript′subscriptcritAlgorithm 1subscriptsuperscriptΠΣsubscript1do1d_1,d_2,x ,y ,q_crit 1(% U, ^δ_ ,N, _1= do(X=1))d1 , d2 , x′ , y′ , qcrit ← Algorithm 1 ( U , Πitalic_δroman_Σ , N , σ1 = do ( X = 1 ) ) ▷ ▷ Identify qcritsubscriptcritq_critqcrit for do⁢(X=1)do1 do(X=1)do ( X = 1 ) 21: Exp. U difference =(U⁢(d2,x′,y′)−U⁢(d1,x′,y′))∗(1−1/qcrit)absentsubscript2superscript′subscript1superscript′11subscriptcrit=(U(d_2,x ,y )-U(d_1,x ,y ))*(1-1/q_ % crit)= ( U ( d2 , x′ , y′ ) - U ( d1 , x′ , y′ ) ) ∗ ( 1 - 1 / qcrit ) 22: Δ0=U⁢(1,0,d2)−U⁢(1,0,d1)subscriptΔ010subscript210subscript1 _0=U(1,0,d_2)-U(1,0,d_1)Δ0 = U ( 1 , 0 , d2 ) - U ( 1 , 0 , d1 ) 23: Δ1=U⁢(1,1,d2)−U⁢(1,1,d1)subscriptΔ111subscript211subscript1 _1=U(1,1,d_2)-U(1,1,d_1)Δ1 = U ( 1 , 1 , d2 ) - U ( 1 , 1 , d1 ) 24: P⁢(YX=1=0)=(Exp. U difference−Δ1)/(Δ0−Δ1)subscript10Exp. U differencesubscriptΔ1subscriptΔ0subscriptΔ1P(Y_X=1=0)=(Exp. U difference- _1)/( _0- _1)P ( Yitalic_X = 1 = 0 ) = ( Exp. U difference - Δ1 ) / ( Δ0 - Δ1 ) 25: 26: if P⁢(YX=0=0)=P⁢(YX=1=0)subscript00subscript10P(Y_X=0=0)=P(Y_X=1=0)P ( Yitalic_X = 0 = 0 ) = P ( Yitalic_X = 1 = 0 ) then ▷ ▷ Identify G and P from interventionals 27: if P⁢(XY=0=0)=P⁢(XY=1=0)subscript00subscript10P(X_Y=0=0)=P(X_Y=1=0)P ( Xitalic_Y = 0 = 0 ) = P ( Xitalic_Y = 1 = 0 ) then 28: G←()←G←()G ← ( ) 29: P⁢(x,y)=P⁢(XY=0=x)⁢P⁢(YX=0=y)subscript0subscript0P(x,y)=P(X_Y=0=x)P(Y_X=0=y)P ( x , y ) = P ( Xitalic_Y = 0 = x ) P ( Yitalic_X = 0 = y ) 30: else 31: G←(Y→X)←→G←(Y→ X)G ← ( Y → X ) 32: P⁢(x,y)=P⁢(YX=0=y)⁢P⁢(XY=y=x)subscript0subscriptP(x,y)=P(Y_X=0=y)P(X_Y=y=x)P ( x , y ) = P ( Yitalic_X = 0 = y ) P ( Xitalic_Y = y = x ) 33: end if 34: else 35: G←(X→Y)←→G←(X→ Y)G ← ( X → Y ) 36: P⁢(x,y)=P⁢(XY=0=x)⁢P⁢(YX=x=y)subscript0subscriptP(x,y)=P(X_Y=0=x)P(Y_X=x=y)P ( x , y ) = P ( Xitalic_Y = 0 = x ) P ( Yitalic_X = x = y ) 37: end if 38: return G,P⁢(x,y)G,P(x,y)G , P ( x , y ) 39:end function Appendix G Appendix: transportability & Pearl’s causal hierarchy Transportability. The problem of evaluating policies under distributional shifts has been studied extensively in transportability theory (Pearl & Bareinboim, 2011; Bareinboim & Pearl, 2016; Bellot & Bareinboim, 2022). For decision tasks as outlined in Section 2.2, transportability aims to provide necessary and sufficient conditions for identifying the expected utility following a distributional shift, R=⁢[u∣d,paD;σ]delimited-[]conditionalsubscriptpaR=E[u d,pa_D;σ]R = blackboard_E [ u ∣ d , paD ; σ ], given (partial) knowledge of i) the joint P, causal graph G and interventional distributions I in the source domain, and i) (partial) knowledge of the joint P∗superscriptP^*P∗ and causal graph G∗superscriptG^*G∗ in the target domain (Pearl & Bareinboim, 2011; Bareinboim & Pearl, 2012b). Hence, these results differ from Theorems 1 and 2 in that they restrict to the case where all assumptions on the data generating process (i.e. inductive biases) can be expressed as (partial) knowledge of the underlying CBN. For example, Bareinboim & Pearl (2016) claim the problem is essentially solved in the case where ‘assumptions are expressible in DAG form’. This does not constrain possible approaches to domain generalisation that make use of non-causal assumptions and heuristics555Indeed, notable examples of causal assumptions that go beyond those expressible in DAG form include restricting the classes of structural equations Mooij et al. (2016) and assuming cause-effect asymmetry (Mitrovic et al., 2018), and indeed deep learning algorithms exploit a much wider set of inductive biases than causal assumptions alone (Neyshabur et al., 2014; Battaglia et al., 2018; Rahaman et al., 2019; Goyal & Bengio, 2022; Cohen & Welling, 2016). In many real-world tasks these may be sufficient to identify ‘good enough’ (i.e. regret-bounded) policies without requiring knowledge of the causal structure of the data generating process. Our aim has been to establish if learning causal models is necessary for domain generalisation in general. Hence assuming that agents are restricted to using inductive biases that amount to (partial) knowledge of the underlying CBN would be begging the question. Causal hierarchy’s theorem (CHT). The celebrated causal hierarchy theorem (Bareinboim et al., 2022; Ibeling & Icard, 2021) shows that for almost all environments there are causal relations between environment variables that cannot be identified from observational data without additional assumptions. Does this imply that a causal model is necessary for identifying optimal policies? First, note that the CHT is an insufficiency result, and only implies trivial necessity results. For example, is a causal model necessary for identifying all causal and associative relations between environment variables? Yes, but only because this set of observational and interventional distributions is a causal model. Formally, we can identify the underlying causal model (up to latent confounders) by assuming causal faithfulness, which holds for almost all causal models (Meek, 2013). The difference here is that the CHT is concerned with the identifiability of all causal and associative relations between environment variables. This sets a much higher bar than domain generalisation, which focuses on identifying a strict subset of these (regret-bounded policies) (Figure 3). Secondly, the CHT is concerned with the collapse (or lack thereof) of the causal hierarchy. For example, that observational data is insufficient for identifying all causal queries. We do not restrict agents to having observational training data—in fact, typically we assume that agents have access to both observational and interventional data in the online learning setting that we consider (e.g. agents can intervene to fix the decision node D by assumption). Finally, we can imagine a refinement of the CHT which states that observational data is insufficient for identifying regret-bounded policies without additional assumptions, bringing it in line with Theorems 1 and 2. If this was implied by the CHT, it would not imply our results unless we restrict to the case where all assumptions as constraints on the causal structure (similar to transportability). Likewise, it is simple to show that Theorem 1 does not imply the CHT. In deriving Theorem 1 we do not restrict to observational distributions (or make any restrictions on the data available to the agent when generating its policy).