← Back to papers

Paper deep dive

The Limits of Predicting Agents from Behaviour

Alexis Bellot, Jonathan Richens, Tom Everitt

Year: 2025Venue: ICML 2025Area: Formal/TheoreticalType: TheoreticalEmbeddings: 113

Abstract

Abstract:As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent's beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent's behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent's behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

Tags

ai-safety (imported, 100%)formaltheoretical (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/12/2026, 5:54:01 PM

Summary

The paper investigates the theoretical limits of predicting the behavior of intentional AI agents in novel environments based solely on their past behavior. By modeling agents as systems guided by internal 'world models' (Structural Causal Models), the authors derive novel bounds on agent behavior under distribution shifts. They demonstrate that while some behaviors can be ruled out (weak predictability) under specific assumptions like grounding, general shifts in causal mechanisms often render agent behavior unpredictable, highlighting fundamental limits in AI safety and interpretability.

Entities (5)

Alexis Bellot · researcher · 100%Structural Causal Model · formalism · 100%AI Agent · system · 95%Grounding · concept · 95%Weak Predictability · property · 90%

Relation Signals (3)

Alexis Bellot authored The Limits of Predicting Agents from Behaviour

confidence 100% · The Limits of Predicting Agents from Behaviour Alexis Bellot

AI Agent uses Structural Causal Model

confidence 90% · We assume the AI operates according to an SCM cM over V, its (implicit) world model

Grounding enables Weak Predictability

confidence 85% · An AI grounded in a domain M is weakly predictable under a shift sigma... if and only if there exists a decision d* such that...

Cypher Suggestions (2)

Find all formalisms used to describe AI agents · confidence 90% · unvalidated

MATCH (a:Entity {entity_type: 'System'})-[:USES]->(f:Entity {entity_type: 'Formalism'}) RETURN a.name, f.name

Identify properties that depend on grounding · confidence 85% · unvalidated

MATCH (g:Entity {name: 'Grounding'})-[:ENABLES]->(p:Entity {entity_type: 'Property'}) RETURN p.name

Full Text

112,333 characters extracted from source content.

Expand or collapse full text

arXiv:2506.02923v1 [cs.AI] 3 Jun 2025 The Limits of Predicting Agents from Behaviour Alexis Bellot 1 , Jonathan Richens 1 and Tom Everitt 1 1 Google DeepMind As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent’s behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety. 1. Introduction Humans understand each other through the use of abstractions. We explain our intentions by appealing to our “goals” and “beliefs” about the world around us without knowing the underlying cognition going on inside our heads. According to Dennett (1989, 2017), the same is true of our understanding of other systems. For example, a bear hibernates during winteras ifit believes that the lower temperatures cause food scarcity. This is a useful description of the bear’s behaviour, with real predictive power. For example, it gives us (human observers) the ability to anticipate how bears might act as the climate changes. There is a correspondence between beliefs and behaviour that is foundational to rational agents (Davidson, 1963). Artificial Intelligence (AI) systems appear to have similarly general capabilities, not totally unlike that of humans and animals. They can generate text that is fluent and accurate in response to a very diverse set of questions. Whenever they display consistent types of behaviour across many different tasks, we are tempted to apply our own mentalistic language more or less at face value (Shanahan, 2024), taking seriously questions such as: What do the AIs know? What do they think, and believe? Taking the analogy further, it isas ifthey learn “world models” that mirror the causal relationships of the environment they are trained on, guiding their future plans and behaviour 1 . And as a consequence, their interactions with an environment will leave clues that might give us the ability to predict their future behaviour in novel domains. This possibility engages with a core AI Safety problem: how to guarantee and predict whether AI systems will act safely and beneficially? The main result of this paper is to offer a new perspective on this problem by showing that: With an assumption of competence and optimality, the behaviour of AI systems partially determines their actions in novel environments. 1 Recent research suggests that an AI’s behaviour, to the extent that it is consistent with rationality axioms, can be formally described by a (causal) world model (Halpern and Piermont, 2024). The same conclusion can also be obtained for AIs capable of solving tasks in multiple environments (Richens and Everitt, 2024). For large language models, there is increasing empirical evidence for the “world model” hypothesis, see e.g., Goldstein and Levinstein, 2024; Gurnee and Tegmark, 2023; Li et al., 2022; Toshniwal et al., 2022 and Vafa et al., 2024. Corresponding author(s): abellot@google.com ©2025 Google DeepMind. All rights reserved The Limits of Predicting Agents from Behaviour Here behaviour means our observations of the decisions made by the AI system, contextual variables, and utility or reward values in some environment. The “partial” determination of actions in new environments is a consequence of our lack of knowledge about the AI’s actual world model (different models may induce different optimal actions). However, even though we can’t uniquely identify the AI’s future behaviour and beliefs, we can narrow it down to a range of possible outcomes. This paper characterises those outcomes. In the literature, the under-determination of agent “beliefs” and preferences has been considered in the fields of inverse reinforcement learning (Abbeel and Ng, 2004; Amin and Singh, 2016; Skalse and Abate, 2023) and decision theory (Afriat, 1967; Jeffrey, 1990; Savage, 1972), among others. In settings with distribution shift between training and deployment environments, this under-determination can be understood as a consequence of the Causal Hierarchy Theorem, that defines precise limits on the kinds of inferences that can be drawn across domains (Bareinboim et al., 2022; Pearl, 2009). It implies, for example, that behaviour in an environment subject to an intervention cannot be established from “non-interventional” data alone. Robins (1989), Manski (1990) and Pearl (1999) showed that useful information in the form of bounds can nevertheless be extracted from “non-interventional” data, without actually knowing the underlying data-generating process. In the causality literature, several methods and algorithms exist to solve different versions of this problem, see e.g., Balke and Pearl, 1997; Bellot, 2024; Rosenbaum et al., 2010; Tan, 2006; Tian and Pearl, 2000; Zhang et al., 2021. This paper extends the causal formalism to reason about the possible behaviours and beliefs of an AI system, itself assumed to be governed by an unknown data generating process or world model. With this interpretation we are able to define mathematically notions such as an AI’s preferred choice of action in novel environments, its perception of fairness, and its perception of harm due to the actions it takes. Our main contribution is a set of inequalities on these “beliefs” in terms of quantities that can in principle be estimated from behavioural data, that hold irrespective of the underlying cognitive architecture of the AI system as long as it can berepresentedby a well-defined set of causal mechanisms (a world model) that tracks its behaviour (Sec. 4). We then extend these results to characterize AI behaviour under several relaxations for applications in practice (Sec. 5), ultimately with the goal of defining the theoretical limits of what can be inferred from data about AI behaviour in new (unseen) environments. This has consequences for the wider AI Safety community and society. For example, we show that an AI’s perception of the potential fairness and harm of its decisions (e.g., whether the AI’s resource allocation is believed to be equitable, or its generations unbiased) can provably not be inferred from observing its behaviour alone. There are theoretical limits to how much we can understand about an AI’s cognition and decision-making process from observations. We believe our results can help justify the claim that the design and inference of world models is important to ensure AIs can behave predictably and act safely and beneficially, as argued by Bengio et al., 2025; Dalrymple et al., 2024; Legg, 2023. 2. Preliminaries In this section we outline some basic principles that we use to reason about how beliefs might be (implicitly) defined within an AI system. We use capital letters to denote variables (푋), small letters for their values (푥), bold letters for sets of variables (푿) and their values (풙), and usesuppto denote their domains of definition (푥∈supp 푋 ). To denote푃(풀=풚|푿=풙), we use the shorthand푃(풚|풙). We use1 · for the indicator function equal to1if the statement in·evaluates to true, and equal to0otherwise. 2 The Limits of Predicting Agents from Behaviour Actions, plans, and hypothetical outcomes can be evaluated by symbolic operations on a model that represents the functional relationships in the world, known as a Structural Causal Model (Pearl, 2009, Definition 7.1.1), or SCM for short. Definition 1(Structural Causal Model).An SCMMis a tupleM=⟨푽,푼,F, 푃⟩where each observed variable푉∈푽is a deterministic function of a subset of variables푷풂 푉 ⊂푽and latent variables푼 푉 ⊂푼, i.e., 푣:=푓 푉 (풑풂 푉 ,풖 푉 ), 푓 푉 ∈ F. Each latent variable푈∈푼is distributed according to a probability measure 푃(푢). We assume the model to be recursive, i.e., that there are no cyclic dependencies among the variables. In an SCMM, each draw풖∼푃(풖)evaluates to a potential response풀(풖)=풚and entails a distribution over the possible outcomes푃(풚). The power of SCMs is that they specify not only the joint distribution푃(풗)but also the distribution of variables under all interventions, including incompatible interventions (counterfactuals). Formally, an intervention푑표(풙)is modelled as a symbolic operation where values of a set of variables푿are set to constants풙, replacing the functions푓 푋 :푋∈푿that would normally determine their values. This effectively induces a sub-model ofM, denotedM 풙 . The variables obtained inM 풙 are denoted풀 풙 and we will loosely write푃 M 풙 (풚) ≡푃 풙 (풚) ≡푃(풚 풙 ) ≡푃(풚| 푑표(풙))to denote the probabilities over the possible outcomes of풀inM 풙 . Different environments can be modelled by different SCMs. LetM 1 =⟨푽,푼,F 1 , 푃 1 ⟩,M 2 = ⟨푽,푼,F 2 , 푃 2 ⟩ be the SCMs for two environments over the same set푽and푼. We say that there is a discrepancy or a shift on a variable푋∈푽between them if either푓 1 푋 ≠푓 2 푋 or푃 1 (푼 푋 )≠푃 2 (푼 푋 )or both. Shifts might therefore encode arbitrary changes in the causal mechanisms for a set of variables. For a reference SCMM, a so-called “shifted” SCM will be represented by a sub-modelM 휎 where 휎represents the discrepancies betweenMandM 휎 . For example, an environment with a shift휎 on a set of variables푿introduces (possibly arbitrary) discrepancies in the functional assignment or (independent) exogenous variables of푿while keeping other mechanisms unchanged. See Pearl (2009, Chapter 4) and Correa and Bareinboim (2020b) for more details. We also make a note here that all proofs of statements are given in Appendix C and that the derivations of examples are given in Appendix A. 3. Agents, Beliefs, and the Environment In this section we lay out a framework to interface between the AI system’s internal world model and our own observations of their behaviour in the real world. Both rely on the same SCM abstraction. We assume the AI operates according to an SCM c Mover푽, its (implicit) world model 2 , that guides its behaviour.푽includes the AI’s decision variable퐴, the inputs to those decisions푪, possible additional variables, and the utility variable푌, such as the training signal or a measurable target given to the AI (Everitt et al., 2021). Beliefs 3 are defined as quantifiable aspects of that model or derivations of it. Definition 2(Beliefs).An AI belief is a probabilistic statement derived from its internal model c M. For example, a statement like푃 c M 푑 (푌=푦)=0.8describes the subjective belief “The AI is80% confident that taking decision퐷=푑will lead to event푌=푦”. The sub-model in this mathematical expression represents what the AI “thinks” the world looks like after taking the decision퐷=푑. 2 Here SCMs are meant torepresent, mathematically, the decision-making process going on “in the AI’s head” in a way that tracks its behaviour, without making any claims about the AI’sactualcognitive architecture. 3 We might prefer to use terms like “credences” or “subjective probabilities” to emphasize the subjective nature of beliefs and avoid the connotation of strong conviction or certainty as done by (Schwitzgebel, 2024, Sec. 2.3). 3 The Limits of Predicting Agents from Behaviour We assume that the AI makes decisions푑by sampling from a policy휋(푑|풄), which is a function mapping from the domain of the observed covariates푪⊂푽(i.e., all the inputs given to the AI) to the probability space over the domain of the decision퐷∈푽. The choice of휋is assumed to be driven by its perceived utility 4 푌∈푽within the AI’s model c M, that is, argmax 휋 피 푃 c M [ 푌|푑표(휋) ] .(1) The AI interacts with the real-world that is described by a (likely different) SCMMthat encodes the true dynamics of the environment. In principle, we have no reason to expect that the model c M internalized by the AI matches the underlying realityM. AI systems might hope to reproduce some aspects ofM(the AI might have learned, for instance, to mimic the distribution of the observed data). Competent AIs might go further and be able to reliably predict the effects of different decisions in the world. We define this asgroundingbelow. Definition 3(Grounding).Let c Mrepresent the AI’s internal model. We say that the AI is grounded in a domainMif푃 c M 푑 (푽) ≡푃 M 푑 (푽)for any decision푑∈supp 퐷 . Grounding tells us that the AI’s beliefs about the effect of a particular decision푑in the training environment match the effects that would be observed in the real world, i.e. b 푃 푑 (푽) ≡푃 푑 (푽) 5 . It is an assumptionon the relationship between our observations of AI behaviour푃(푽)with what might be going on in the AI’s “mind” b 푃(푽). This might be reasonable, for example, if the AI is explicitly trained by reinforcement learning inM. By assumption, a grounded AI’s choice of decision in environmentMis in principle predictable from data since we can compute Eq. (1). But this might not necessarily be the case in a new (unseen) environment. Example 1(The Uncertain Medical AI).Imagine an AI system assisting patients with their treatment 퐷for a disease푌known to be influenced also by a third variable푍, blood pressure. The AI is competent and learns the precise effect of all treatments. In other words it is grounded inM, i.e. b 푃 푑 (푧, 푦)=푃 푑 (푧, 푦). For concreteness, let the environmentMbe given by, 푍←1 푈=1or4 , 푌← ( 푍·1 푈=4 +(1−푍)·1 푈=1,3or4 if푑=0 푍·1 푈≠2 +(1−푍)·1 푈=2or4 if푑=1, with equal probability푃for all values푈∈ 1,2,3,4,5. Here푈is latent, summarizing all other contributions to both the disease and blood pressure, such as an individual’s (unobserved) attitudes to health, fitness, etc. Could we confidently deploy this AI system more widely, for example, on individuals that also take a second drug that artificially improves their blood pressure (e.g., fixing푍 to1, replacing the original assignment)? If the AI system is instructed to maximize푌on average, what decision does the AI believe is optimal? The answer is we do not know, meaning that in this case it is possible to find a second model c Mdefined by the mechanisms: 푍←1 푈=1or4 , 푌← ( 푍·1 푈≠1 +(1−푍)·1 푈=3or4 if푑=0 푍·1 푈=1or4 +(1−푍)·1 푈=1or2 if푑=1, 4 To account for possible uncertainty in the AI’s “satisfaction” about a given state of the world풘we assume푌is a random variable (induced by푼 푌 ⊂푼), also known as a stochastic utility model (Manski, 1977). We assume that the support of푌is bounded in the[0,1]interval. 5 We use the shorthand푃 푑 ≡푃 M 푑 and b 푃 푑 ≡푃 c M 푑 to simplify the notation. 4 The Limits of Predicting Agents from Behaviour that entails exactly the same observations b 푃 푑 (푧, 푦)=푃 푑 (푧, 푦) but induces different optimal decisions under the intervention푍←1. UnderM, the highest utility푌on average is given by푑=1, while under c Mthe highest utility푌on average is given by푑=0. A priori, we have no way of knowing which model (Mor c M) is governing the AI’s behaviour and so no way of knowing what decision will be favoured by the AI under the intervention.□ This example illustrates a canonical point in a simple setting: as observers, with access to the AI’s interactions in some domain, its behaviour outside of that domain might not be uniquely determined (Pearl, 2009). 4. The Limits of Behavioural Data In this section, we explore the limits of behavioural data for predicting the decisions of AIs in new environments. As external observers, we do not have access to the mechanisms underlying the actual environment nor the agent’s internal model. We assume that we must rely for our inferences on watching the agent’s behaviour and its consequences. That is we have access to (samples of)푃 푑 (푽) 6 for all푑. As a starting point, we might expect competent AIs to beweaklypredictable in the sense that a subset of decisions can be ruled out as provably sub-optimal given our observations. Definition 4(Weak Predictability).We say that an AI is weakly predictable under a shift휎in situation 푪=풄if there exists a decision푑 ∗ that is provably sub-optimal, i.e., 푑 ∗ ≠argmax 푑 피 푃 c M [ 푌|푑표(휎, 푑),풄 ] ,(2) for any valid SCM c Mdescribing the AI’s internal model. Here, “valid” means that the AI’s internal model is compatible with the observed data under our assumptions about the relationship between the data and the AI’s internal model, e.g., grounding. Weak predictability means that there exists at least one decision that we can guarantee the AI will not take in the shifted environment. Specifically, we can rule out a decision푑 ∗ if and only if we can find a (superior) alternative decision푑≠푑 ∗ such that, min c M∈필 ( Δ 푑≻푑 ∗ ) >0,Δ 푑≻푑 ∗ :=피 푃 c M [ 푌|푑표(휎, 푑),풄 ] −피 푃 c M [ 푌|푑표(휎, 푑 ∗ ),풄 ] .(3) 필 denotes the set of “valid” SCMs. HereΔcan be interpreted as the AI’spreference gapbetween two decisions in some situation푪=풄. When it evaluates to a positive number푑is preferred to푑 ∗ and when it evaluates to a negative number푑 ∗ is preferred to푑(in the AI’s mind). If our inferences onΔallow us to rule out decisions푑 ∗ considered to be “unsafe” then weak predictability gives us an important safety guarantee. We can strengthen this notion to definestrongpredictability, that describes a situation in which all but a single AI decision can be ruled out. 6 Technically, the AI system may choose to follow an arbitrarily complex policy휋in the training domain, inducing a (assumed positive) distribution푃 휋 (풗). It holds that푃 푑 (푽)can be computed from any such푃 휋 (푽)as long as푃 휋 (풗)>0,∀풗, and vice versa, see e.g. Lem. 1. The positivity assumption푃 푑 (풗)>0rules out fully deterministic policies in the available data but might be reasonable if the AI spends some time exploring before committing to a course of action. 5 The Limits of Predicting Agents from Behaviour Definition 5(Strong Predictability).We say that an AI is strongly predictable under a shift휎in situation 푪=풄if the optimal decision is uniquely identifiable, i.e., there exists a single decision푑 ∗ such that, 푑 ∗ =argmax 푑 피 푃 c M [ 푌|푑표(휎, 푑),풄 ] ,(4) for any valid SCM c Mdescribing the AI’s internal model. 4.1. AI decisions out-of-domain: interventions Our first result shows that, in some cases, a subset of AI decisions can be provably ruled out, i.e., the AI is weakly predictable. Theorem 1.An AI grounded in a domainMis weakly predictable under a shift휎:=푑표(풛),풁⊂푽,in a context푪=풄if and only if there exists a decision푑 ∗ such that, 피 푃 푑 [푌|풄,풛]푃 푑 (풄,풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) − 피 푃 푑 ∗ [푌|풄,풛]푃 푑 ∗ (풄,풛)+1−푃 푑 ∗ (풛) 푃 푑 ∗ (풄,풛)+1−푃 푑 ∗ (풛) >0, for some푑≠푑 ∗ . All terms on the l.h.s are in principle computable from the AI’s behaviour. Loosely speaking, the value of this difference is determined (in part) by “푃 푑 ∗ (풛)”: if풁=풛(the value set by the intervention) is likely under the training distribution, the difference will more likely evaluate to a positive value. The “if and only if” condition means that whenever this inequality does not hold we can construct two SCMs c M 1 , c M 2 for the grounded AI’s internal model that generate the observed behaviour푃 푑 (푽), 푑∈supp 퐷 , but that induce different optimal actions. That is, for all푑≠푑 ∗ , 피 푃 c M 1 [ 푌|푑표(풛, 푑),풄 ] >피 푃 c M 1 [ 푌|푑표(풛, 푑 ∗ ),풄 ] ,피 푃 c M 2 [ 푌|푑표(풛, 푑),풄 ] <피 푃 c M 2 [ 푌|푑표(풛, 푑 ∗ ),풄 ] . Remark.We can derive a similar condition for strongly predictable AIs by replacing “for some 푑≠푑 ∗ ” with “for all푑≠푑 ∗ ” in Thm. 1. We illustrate Thm. 1 with the following example. Example 2(Grounded Medical AI).In Example 1, we have shown that there exists a particular intervened environment in which the AI’s intentions cannot be determined as in principle the AI could believe that either decision is optimal. Is this true in general? Thm. 1 suggests that it depends on the likelihood of different events푃 푑 (푧, 푦)in the observed data. For Example 1, we can show that the medical AI is not weakly predictable as the expression in Thm. 1 is negative for all pairs of decisions. In other words, no decision can be ruled out in general: in some AI internal models푑 1 is inferior to 푑 0 as, min c M∈필 Δ 푑 1 ≻푑 0 =푃 푑 1 (푍=푧,푌=1)+푃 푑 0 (푍=푧,푌=0)−1=−0.4(5) while in others푑 0 is inferior to푑 1 as, min c M∈필 Δ 푑 0 ≻푑 1 =푃 푑 1 (푍=푧,푌=0)+푃 푑 0 (푍=푧,푌=1)−1=−0.8(6) and we don’t know which model the AI system has internalised.□ In this example, AI behaviour does providesomeinformation as it can be constrained to larger values than its a priori minimumΔ=−1, but not enough to rule out a decision completely. Our next result shows that Thm. 1 could be extended to get tight bounds for AI systems that are grounded in multiple environments. 6 The Limits of Predicting Agents from Behaviour Theorem 2.Let휎:=푑표(풛)be a shift on a set of variables풁⊂푽. For푹 푖 ⊂풁⊂푽, 푖=1, . . . , 푘, consider an AI grounded in multiple domainsM 풓 푖 :푖=1, . . . , 푘. The AI is weakly predictable in a context푪=풄 under a shift휎:=푑표(풛)if and only if there exists a decision푑 ∗ such that, max 푖,푗=1,...,푘 퐴(풓 푖 ,풓 푗 )>0,(7) where, 퐴(풓 푖 ,풓 푗 ):= 피 푃 푑,풓 푖 [푌|풄,풛\풓 푖 ]푃 푑,풓 푖 (풄,풛\풓 푖 ) 푃 푑,풓 푖 (풄,풛\풓 푖 )+1−푃 푑,풓 푖 (풛\풓 푖 ) − 피 푃 푑 ∗ ,풓 푗 [푌|풄,풛\풓 푗 ]푃 푑 ∗ ,풓 푗 (풄,풛\풓 푗 )+1−푃 푑 ∗ ,풓 푗 (풛\풓 푗 ) 푃 푑 ∗ ,풓 푗 (풄,풛\풓 푗 )+1−푃 푑 ∗ ,풓 푗 (풛\풓 푗 ) , for some푑≠푑 ∗ . In this result,M 풓 푖 :푖=1, . . . , 푘describes푘domains in which experiments on different subsets of풁have been conducted, i.e.,푃 푑,풓 푖 (푽):푖=1, . . . , 푘is available. This includes possibly the null experiment푹 푖 =∅that refers to the unaltered domainM. Note that grounding in multiple domains is useful for the prediction of the AI’s preference gap because the resulting bounds in Thm. 2 are tighter than those in Thm. 1 (this is given formally as Corol. 3 in the Appendix). Fig. 1 illustrates how different assumptions and observations give us information about the possible world models that the AI is operating on, which then has implications for the AI’s behaviour out-of- distribution. This knowledge allows us to reduce the uncertainty around the AI’s preference gapΔ, and possibly rule out certain actions that are unambiguously sub-optimal out-of-distribution, inferred solely from observed behaviour. 4.2. AI decisions out-of-domain: general shifts We might wonder about predictability under more general shifts such as an arbitrary change in a subset of the mechanisms푓 푍 :푍∈풁and distribution of variables푼 푍 , 푍∈풁inM. For example, in practice we are likely able to convey to the AI that the mechanisms for a set of variables풁are expected to change but not know exactly how. For example, demographic properties of patients might change across hospitals. How could the AI interpret the consequences of such an under-specified shift? To begin to answer this question, the following theorem shows that in the extreme case where the nature of the shift is completely unknown the AI’s preference gap is unconstrained. Theorem 3.Consider an AI grounded in a domainMmade aware of an (under-specified) shift on non-empty풁⊂푽. Then the AI is provably not weakly (or strongly) predictable in any context푪=풄. This result means that no decision could ever be ruled out from AI behaviour. We can show moreover thatmin c M∈필 ( Δ ) =−1for any pair of decisions, meaning that the observed data (no matter what it is) gives us no information on AI decision-making. In practice, however, it might be realistic to have access to some information in the shifted environment, such as covariate data, i.e., (samples from)푃 휎,푑 (풄), that could be given to the AI for it to update its internal model accordingly (with some abuse of terminology we say that the AI is grounded in푃 휎,푑 (풄)). The next theorem shows that this additional information coupled with the AI’s behaviour makes the AI more predictable. Theorem 4.Consider an AI grounded in a domainMand푃 휎,푑 (푪)made aware of a shift휎on풁⊂푪. The AI is weakly predictable under this shift in a context푪=풄if there exists a decision푑 ∗ such that, 1− 2+피 푃 푑 ∗ [푌|풄]푃 푑 ∗ (풄)−피 푃 푑 [푌|풄]푃 푑 (풄) 푃 휎,푑 ∗ (풄) + 푃 푑 (풄)−2푃 푑 (풛) 푃 휎,푑 ∗ (풄) >0,for some푑≠푑 ∗ . 7 The Limits of Predicting Agents from Behaviour Figure 1|Grounding and observations in multiple environments constrains the AI’s world model and improves our prediction of AI behaviour out-of-distribution (o.o.d). Approximate grounding is defined in Sec. 5. This bound is not tight in general, however, meaning that it is possible that the AI is actually predictable in settings where Thm. 4 suggests it might not be. Example 3(Shifted Medical AI).The AI from Example 2, originally developed from data primarily from young patients, is now considered for deployment on an older patient population. Their probability of having high blood pressure푃 휎 (푍=1)=0.9is known to be substantially higher than that observed during training푃(푍=1)=0.4: there is a shift in the underlying mechanisms of푍. How do these changes influence the AI’s beliefs onΔ? Thm. 4 suggests that the medical AI might not be weakly predictable as the expression evaluates to a negative value for all pairs of decisions. The lower bounds on the AI preference gap are given bymin c M∈필 Δ 푑 1 ≻푑 0 ≥ −0.55andmin c M∈필 Δ 푑 0 ≻푑 1 ≥ −1. That is, no decision is always inferior to any other decision.□ 4.3. AI’s perceived fairness of decisions An AI’s policy, even if optimal on average, has the potential to bring about a state of the world that is intrinsically harmful or unfair. Harm and fairness can be defined relative to a causal model (Beckers et al., 2022; Plecko et al., 2024). This means that a notion ofperceivedorsubjectiveharm and fairness could be attributed to AI systems that operate according to an (implicit) causal model. As a consequence, it is conceivable that AIs could be held morally accountable for the harm and unfairness that they cause. How might one estimate the AI’s beliefs about the harm and unfairness that its decisions cause? To ground our discussion, we consider here explicitlycounterfactualaccounts of fairness and harm. These appeal to hypothetical situations, imagining “what might have been if ...”, that can force us to confront our assumptions and values in a way that our regular thought processes might not 7 . For example, the counterfactual event(푌 푥 =1|푋=푥 0 )refers to the outcome(푌=1)under an intervention푋=푥when under normal circumstances푋would have evaluated to푥 0 . In the literature, probabilities over counterfactuals emerge from the definition of an SCM. For a set of (counterfactual) 7 Alternative accounts to harm and fairness have been proposed (Barocas and Selbst, 2016; Plecko et al., 2024; Zhang and Bareinboim, 2018), sometimes motivated by scenarios where counterfactual accounts give incomplete results. For some of them, the AI’s beliefs can be shown to be similarly constrained by its external behaviour. We provide a longer discussion in Appendix D. 8 The Limits of Predicting Agents from Behaviour events(풛 풘 , . . . ,풚 풙 ), 푃(풛 풘 , . . . ,풚 풙 )= ∫ 풖:풁 풘 (풖)=풛 풘 ,...,풀 풙 (풖)=풚 풙 푃(풖).(8) Kusner et al. (2017) made a concrete proposal arguing that an AI’s decision is said to be fair towards an individual if, from the AI’s perspective, it entails the same utility in the actual world and in a counterfactual world where the individual belonged to a different group (defined by a sensitive attribute, e.g., gender, race). We adapt this notion to define an AI’s counterfactual fairness gap. Definition 6(Counterfactual Fairness Gap).Let푍∈ 푧 0 , 푧 1 be a protected attribute and푧 0 a baseline value of푍. For a given utility푌, define an AI’s counterfactual fairness gap relative to a decision푑, in a given context풄, as Υ(푑,풄):=피 b 푃 푌 푑,푧 1 |푧 0 ,풄 −피 b 푃 [ 푌 푑 |푧 0 ,풄 ] .(9) We say that an AI “intends” to be fair with respect to an attribute푍if under any context푪=풄 and decision퐷=푑the counterfactual fairness gapΥevaluates to 0. This means that, under its own internal world model, changing the value of푍on the subset of situations with context풄in which 푍was observed to푧 0 does not change the AI’s expected utility. In the following theorem we show that, unfortunately, the answer to this question is impossible to obtain given only the AI’s external behaviour. Theorem 5.Consider an agent with utility푌grounded in a domainM. Then, −피 푃 푑 [푌|푧 0 ,풄] ≤Υ(푑,풄) ≤1−피 푃 푑 [푌|푧 0 ,풄].(10) This bound is tight. The bound is tight in the sense that for each context, decision, and baseline attribute, we can find compatible models for which the equalities hold. The counterfactual fairness gapΥis under- constrained. SinceΥ=0is consistent with any external behaviour we can never conclude that the AI system "intends" to be unfair. Moreover, since the width of the bound is equal to 1, we can also never conclude that the AI is anywhere "close" to being fair, according to this counterfactual criterion. 4.4. AI’s perceived harm of decisions Prominent definitions of harm are similarly counterfactual in nature: thecounterfactual comparative account of harmdefines a decision푑to harm a person if and only if she would have been better off if푑had not been taken (Beckers et al., 2022; Hanser, 2008; Mueller and Pearl, 2023; Richens et al., 2022). It is a contrast between events in hypothetical scenarios in which different decisions are made. Here, we quantify how “well off” a particular situation푾=풘is with a binary utility variable 푌←푓 푌 (푾,푼 푌 ) ∈ 0,1that we assume is tracked in experiments, i.e.,푌∈푽. The following definition describes this notion of harm mathematically. Definition 7(Counterfactual Harm Gap).Consider an AI with internal model c Mand utility푌∈ 0,1. The AI’s expected counterfactual harm of a decision푑 1 with respect to a baseline푑 0 , in context풄, is Ω(푑 1 , 푑 0 ,풄):=피 b 푃 max0,푌 푑 0 −푌 푑 1 |풄 .(11) Operationally, the counterfactual harm gapΩis the expected increase in utility had the AI made a default decision푑 0 , with respect to a different decision푑 1 that the AI is contemplating. Counterfactual harm is therefore lower bounded at 0 with larger values indicating more harm. The following theorem shows that the external behaviour constraints the AI’s perception of its counterfactual harm. 9 The Limits of Predicting Agents from Behaviour Theorem 6.Consider an AI with utility푌grounded in a domainM. Then, max0,피 푃 푑 [ 푌|풄 ] +피 푃 푑 0 [ 푌|풄 ] −1 ≤Ω(푑, 푑 0 ,풄) ≤min피 푃 푑 [ 푌|풄 ] ,피 푃 푑 0 [ 푌|풄 ] . This bound is tight. This result it is an extension of bounds on the probability of causation given by Pearl (1999) and Tian and Pearl (2000). It suggests that an AI’s beliefs about the harm that its decisions cause can be inferred approximately from data. 5. Discussion: The “Practical” Limits of Behavioural Data The inductive biases implied by causal models and rational behaviour are powerful constraints on AI behaviour. But they might not capture the practical limitations of AI decision-making. In this section we show that grounding, expected utility maximization, observed data, etc., can be relaxed in practice. 5.1. Approximate grounding Grounding implies that the AI’s beliefs on the likelihood of events in the environment matches the observed probabilities. In practice, it might be reasonable to allow for some amount of error, and consider a notion of “approximate” grounding. Definition 8(Approximate Grounding).Let c Mrepresent the AI’s internal model. Given a discrepancy measure휓, we say that the AI is approximately grounded in a domainMto a degree훿 >0if휓( b 푃 푑 , 푃 푑 ) ≤훿 for any푑∈supp 퐷 . The choice of휓and훿, in practice, depend on what error model is reasonable for the AI and problem at hand (we give an example below). Approximate grounding specifies a looser relationship between our observations of AI behaviour푃with what might be going on in the AI’s “mind” b 푃. For example, the world model of an approximately grounded AI is compatible with one distribution in the set b 푃 푑 :휓( b 푃 푑 , 푃 푑 ) ≤훿. A more conservative bound (than Thm. 1) on predictability could be derived for AIs that are approximately grounded in an environmentM. Corollary 1.Given a discrepancy measure휓, an AI approximately grounded in a domainMis weakly predictable in a context푪=풄under a shift휎:=푑표(풛),풁⊂푽,if and only if there exists a decision푑 ∗ such that, min b 푃:휓( b 푃,푃)≤훿 ( 피 b 푃 푑 [푌|풄,풛] b 푃 푑 (풄,풛) b 푃 푑 (풄,풛)+1− b 푃 푑 (풛) − 피 b 푃 푑 ∗ [푌|풄,풛] b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) ) >0,for some푑≠푑 ∗ .(12) The same proof strategy in Corol. 1 can be applied to all bounds on behaviour in Sec. 4 to get results under approximate grounding. We can compare quantitatively the two notions of grounding with an example. Example 4(Approximately Grounded Medical AI).The results in Example 2 exploit the grounding relationship b 푃 푑 (푽)=푃 푑 (푽) inM. We might want to relax the equality by assuming that the AI is 10 The Limits of Predicting Agents from Behaviour insteadapproximately grounded. Minimum values on the AI’s preference gapΔwould then be given by, min c M∈필 Δ 푑 1 ≻푑 0 =min b 푃:휓( b 푃,푃)≤훿 b 푃 푑 1 (푍=푧,푌=1)− b 푃 푑 0 (푍=푧,푌=1)+ b 푃 푑 0 (푍=푧)−1 ,(13) min c M∈필 Δ 푑 0 ≻푑 1 =min b 푃:휓( b 푃,푃)≤훿 − b 푃 푑 1 (푍=푧,푌=1)+ b 푃 푑 0 (푍=푧,푌=1)−1+ b 푃 푑 1 (푍=푧) .(14) These terms now capture an additional source of uncertainty due to external behaviour more loosely constraining c M . An empirical estimate of this quantity could be obtained by sampling distributions b 푃close to푃according to the distributional distance휓and threshold훿, and taking the empirical minimum, as follows. Given that the data(푧, 푑, 푦) ∼푃is discretely valued in this example, we could sample probabilities b 푃 푑 (푧, 푦) 푧,푦 from a Dirichlet distribution centred at the vector푃 푑 (푧, 푦) 푧,푦 with a small variance. The distance of each proposal from the reference distribution could then be evaluated according to휓and each proposal either accepted or rejected using훿. For illustration, we implement a version of this idea setting휓to be the total variation distance and훿=0.1. The two minimum values now evaluate to−0.55and−0.88, respectively, which is slightly lower than under the assumption of grounding in Example 2 (that evaluate to−0.4and−0.8, respectively).□ 5.2. Approximate expected utility maximization In real-world environments it might be appropriate to treat the rationality of AI systems as “approx- imate” or “bounded” in some sense: AIs might choose actions that onlyapproximatelymaximize expected utility (rather than exactly maximize expected utility), given their model. Mirroring Eq. (3), we might say that a “bounded” AI is weakly predictable in some context푪=풄 if and only if there exists a decision푑 ∗ such that, min c M∈필 ( Δ 푑≻푑 ∗ ) > 휆,Δ:=피 푃 c M [ 푌|푑표(휎, 푑),풄 ] −피 푃 c M [ 푌|푑표(휎, 푑 ∗ ),풄 ] ,for some푑≠푑 ∗ .(15) 휆 >0is a constant that determines how much better a decision푑needs to be relative to decision 푑 ∗ for the AI to reliably rule out푑 ∗ in favour of others. This representation appeals to the idea of imperfect discrimination, suggesting that the AI discerns between two alternatives only if they yield a sufficiently different utility (Dziewulski, 2021). We might tighten our conditions on the observational data to reflect this behaviour and get a new set of results describing when AIs can be expected to be predictable. For instance, as a corollary to Thm. 1 we have the following. Corollary 2.An AI grounded in a domainMand bounded in the sense of Eq.(15)is weakly predictable in some context푪=풄under a shift휎:=푑표(풛),풁⊂푽,if and only if there exists a decision푑 ∗ such that, 피 푃 푑 [푌|풄,풛]푃 푑 (풄,풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) − 피 푃 푑 ∗ [푌|풄,풛]푃 푑 ∗ (풄,풛)+1−푃 푑 ∗ (풛) 푃 푑 ∗ (풄,풛)+1−푃 푑 ∗ (풛) > 휆,for some푑≠푑 ∗ .(16) Note the addition of the scalar휆 >0in the inequality. Similar corollaries could be stated for all results in Sec. 4. 5.3. Approximate inner alignment A further assumption embedded in our results so far is the exact observation of an AI’s utility in the data. In general, we might expect an AI system to have internalized aproxy푌 ∗ that reflects properties 11 The Limits of Predicting Agents from Behaviour Figure 2|Building on Fig. 1, AIs that are approximate expected utility maximizers (EUM), that internalize proxy objectives, or that obey known causal structure carve out different constraints on the set of possible AI models (from an observer’s perspective) which may be exploited to improve our prediction of AI choices out-of-distribution (o.o.d). correlated with, but distinct from, the observed utility푌we ultimately wish to optimize, a setting we refer to as approximate inner alignment (Hubinger et al., 2019). As observers, we face a problem ofpartial observability: we don’t have empirical access to the AI’s actual utility function푌 ∗ and notions such as the preference gapΔare therefore not computable. Without any assumptions on the relationship between푌and푌 ∗ , the preference gapΔwill be uncon- strained and no inference about the AI’s intended action out-of-distribution is possible. However, the observed푌will typically be statistically related to the AI’s implicit utility푌 ∗ , especially if optimizing for푌 ∗ serves the AI well during training where success is measured by the observed values of푌. Under assumptions specifying how “statistically related” observed and proxy utility objectives are, we can expect that wider but possibly informative bounds could still be derived for the AI’s beliefs. To show this in a simple setting, consider again the medical AI example. Example 5(Partial Observability).Imagine that the Medical AI in Example 2 has internalized its own concept of an individual’s disease progression푌 ∗ . It is implicitly optimizing for that internal construction instead of the intended disease bio-marker푌. We know, or can assume, that the observed 푌is closely correlated with푌 ∗ : in particular, that푃 푑 (푌 ∗ =1|푌=1, 푍=푧) ≥훼for some high value of 훼and all decisions푑and situations푧. In words, whenever the bio-marker suggests health(푌=1), with high probability the AI’s interpretation also suggests health(푌 ∗ =1). This then constraints the possible values ofΔ(under an intervention푍←1) as푃 푑 (푌 ∗ =1|푍=푧)is no longer arbitrarily defined. In fact could show that, min c M∈필 Δ 푑 1 ≻푑 0 ≥훼푃 푑 1 (푍=푧,푌=1)−1,(17) min c M∈필 Δ 푑 0 ≻푑 1 ≥훼푃 푑 0 (푍=푧,푌=1)−1.(18) With훼=0.9the bound evaluates to−0.64and−0.82respectively which is slightly lower than in Example 2. We could verify also that if with훼=0, i.e., we don’t know anything about the relationship between푌and푌 ∗ , the bounds become uninformative: evaluating to−1.□ This suggests that behaviour out-of-distribution in (sufficiently constrained) settings of approximate inner alignment could be bounded in principle. Importantly, as the example shows, with the proposed framework we do not require knowing the relationship between푌and푌 ∗ out-of-distribution: that uncertainty is naturally folded into the bounds. 12 The Limits of Predicting Agents from Behaviour 5.4. Assumptions on structure The uncertainty in AI decision-making out-of-distribution is ultimately a consequence of our lack of information about the AI’s underlying cognition and internal mechanisms that produce a decision in a given situation, i.e., c M. In the causal inference literature, a common inductive bias to improve upon the “data-driven” bounds proposed so far is to assume qualitative knowledge about the underlying mechanisms in the form of a causal diagram, see e.g. (Pearl, 2009, Chapter 3). Here we illustrate how mild restrictions on the location of unobserved confounders inMlead to tighter bounds. Example 6(Partial Unconfoundedness).Consider again our grounded medical AI from Example 2. We might have reason to believe that the association between the intervened variable푍and the utility 푌is conditionally unconfounded, meaning that there exists a variable푊∈ 푤 0 , 푤 1 ,푊∈푽such that 푃 푑,푧 (푦|푤)=푃 푑 (푦|푤, 푧). This restriction goes beyond grounding an asserts an equality between probabilities under different shifts that could be communicated to the AI for it to update its world model c M. We could then show that, min c M∈필 Δ 푑 1 ≻푑 0 ≥ 1−푃 푑 1 (푍=푧,푊=푤 1 )푃 푑 1 (푌=1|푍=푧,푊=푤 0 )−푃 푑 0 (푌=1, 푍=푧) +푃 푑 1 (푌=1, 푍=푧,푊=푤 1 )−1−푃 푑 0 (푍=푧)푃 푑 0 (푌=1|푍=푧,푊=푤 1 ),(19) min c M∈필 Δ 푑 0 ≻푑 1 ≥ 1−푃 푑 0 (푍=푧,푊=푤 1 )푃 푑 0 (푌=1|푍=푧,푊=푤 0 )−푃 푑 1 (푌=1, 푍=푧) +푃 푑 0 (푌=1, 푍=푧,푊=푤 1 )−1−푃 푑 1 (푍=푧)푃 푑 1 (푌=1|푍=푧,푊=푤 1 ).(20) We show in Appendix A that these bounds are strictly tighter than the ones given in Example 2.□ Systematic bounds with access to a causal diagram have been shown by e.g., Jalaldoust et al. (2024); Zhang et al. (2021), and could be explored further for making inference on AI decision-making. Fig. 2 illustrates how some of these relaxations can be understood within our model-based formalism. 6. Conclusion An important consideration to safely interact with AI systems is to form expectations as to how they might act in the future. In this paper, we answer this question under the assumption that AI behaviour can be tracked by a well-specified collection of causal mechanisms (a structural causal model) that represents the AI’s world model. This abstraction implies a consistency in behaviour that can in principle be exploited to infer the AI’s choice of action in novel environments, out-of-distribution. Building on the theory of causal identification, we provide general bounds on AI decision-making that represent the theoretical limits of what can be inferred about AI behaviour given our framework. We hope our results can help justify the claim that the design and inference of world models is important to ensure AIs act safely and beneficially. Acknowledgements Thanks to David Lindner and Damiano Fornasiere for comments on a draft of this paper. 13 The Limits of Predicting Agents from Behaviour References Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004. Sydney N Afriat. The construction of utility functions from expenditure data.International economic review, 8:67–77, 1967. Kareem Amin and Satinder Singh. Towards resolving unidentifiability in inverse reinforcement learning.arXiv preprint arXiv:1601.06569, 2016. Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171–1176, 1997. Elias Bareinboim, Juan D. Correa, Duligur Ibeling, and Thomas Icard. On pearl’s hierarchy and the foundations of causal inference. InProbabilistic and Causal Inference: The Works of Judea Pearl, page 507–556. Association for Computing Machinery, NY, USA, 1st edition, 2022. Solon Barocas and Andrew D Selbst. Big data’s disparate impact.Calif. L. Rev., 104:671, 2016. Sander Beckers, Hana Chockler, and Joseph Halpern. A causal analysis of harm.Advances in Neural Information Processing Systems, 35:2365–2376, 2022. Alexis Bellot. Towards bounding causal effects under Markov equivalence. InThe 40th Conference on Uncertainty in Artificial Intelligence. PMLR, 2024. Alexis Bellot and Silvia Chiappa. Towards estimating bounds on the effect of policies under unobserved confounding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Alexis Bellot, Alan Malek, and Silvia Chiappa. Transportability for bandits with data from different environments.Advances in Neural Information Processing Systems, 36, 2024. Yoshua Bengio, Michael K Cohen, Nikolay Malkin, Matt MacDermott, Damiano Fornasiere, Pietro Greiner, and Younesse Kaddar. Can a bayesian oracle prevent harm from an agent?arXiv preprint arXiv:2408.05284, 2024. Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt Mac- Dermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Su- perintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657, 2025. David Maxwell Chickering and Judea Pearl. A clinician’s tool for analyzing non-compliance. In Proceedings of the National Conference on Artificial Intelligence, pages 1269–1276, 1996. Juan Correa and Elias Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 10093–10100, 2020a. Juan Correa and Elias Bareinboim. General transportability of soft interventions: Completeness results.Advances in Neural Information Processing Systems, 33:10902–10912, 2020b. David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems.arXiv preprint arXiv:2405.06624, 2024. 14 The Limits of Predicting Agents from Behaviour Donald Davidson. Actions, reasons, and causes.The Journal of Philosophy, 60(23):685–700, 1963. Daniel C Dennett.The intentional stance. MIT press, 1989. Daniel C Dennett.From bacteria to Bach and back: The evolution of minds. W Norton & Company, 2017. Pawel Dziewulski. A comprehensive revealed preference approach to approximate utility maximisation. Tech Report, 2021. Tom Everitt, Ryan Carey, Eric D Langlois, Pedro A Ortega, and Shane Legg. Agent incentives: A causal perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11487–11495, 2021. Noam Finkelstein and Ilya Shpitser. Deriving bounds and inequality constraints using logical relations among counterfactuals. InConference on Uncertainty in Artificial Intelligence, pages 1348–1357. PMLR, 2020. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. InCausal Learning and Reasoning, pages 160–187. PMLR, 2024. Simon Goldstein and Benjamin A Levinstein. Does chatgpt have a mind?arXiv preprint arXiv:2407.11015, 2024. Wes Gurnee and Max Tegmark. Language models represent space and time.arXiv preprint arXiv:2310.02207, 2023. Joseph Y Halpern and Evan Piermont. Subjective causality.arXiv preprint arXiv:2401.10937, 2024. Matthew Hanser. The metaphysics of harm.Philosophy and Phenomenological Research, 77(2): 421–450, 2008. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820, 2019. Kasra Jalaldoust, Alexis Bellot, and Elias Bareinboim. Partial transportability for domain generalization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Richard C Jeffrey.The logic of decision. University of Chicago press, 1990. Andrew Jesson, Sören Mindermann, Yarin Gal, and Uri Shalit. Quantifying ignorance in individual- level causal-effect estimates under hidden confounding. InInternational Conference on Machine Learning, pages 4829–4838. PMLR, 2021. Shalmali Joshi, Junzhe Zhang, and Elias Bareinboim. Towards safe policy learning under partial identifiability: A causal approach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13004–13012, 2024. James M Joyce.The foundations of causal decision theory. Cambridge University Press, 1999. 15 The Limits of Predicting Agents from Behaviour Kuno Kim, Shivam Garg, Kirankumar Shiragur, and Stefano Ermon. Reward identification in inverse reinforcement learning. InInternational Conference on Machine Learning, pages 5496–5505. PMLR, 2021. Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness.Advances in neural information processing systems, 30, 2017. Shane Legg. System 2 safety.https://w.youtube.com/watch?v=8IUIGVVLbCg&ab_ channel=FAR%E2%80%A4AI, 2023. Accessed: 2025-01-24. Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022. Charles F Manski. The structure of random utility models.Theory and decision, 8(3):229, 1977. Charles F Manski. Nonparametric bounds on treatment effects.The American Economic Review, 80 (2):319–323, 1990. Scott Mueller and Judea Pearl. Personalized decision making–a conceptual introduction.Journal of Causal Inference, 11(1):20220050, 2023. Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. InIcml, volume 1, page 2, 2000. Judea Pearl. Probabilities of causation: three counterfactual interpretations and their identification. InProbabilistic and Causal Inference: The Works of Judea Pearl, pages 317–372, 1999. Judea Pearl.Causality. Cambridge university press, 2009. Drago Plecko, Elias Bareinboim, et al. Causal fairness analysis: a causal toolkit for fair machine learning.Foundations and Trends®in Machine Learning, 17(3):304–589, 2024. Jonathan Richens and Tom Everitt. Robust agents learn causal world models.arXiv preprint arXiv:2402.10877, 2024. Jonathan Richens, Rory Beard, and Daniel H Thompson. Counterfactual harm.Advances in Neural Information Processing Systems, 35:36350–36365, 2022. James M Robins. The analysis of randomized and non-randomized aids treatment trials using a new approach to causal inference in longitudinal studies.Health service research methodology: a focus on AIDS, pages 113–159, 1989. Paul R Rosenbaum, P Briskman Rosenbaum, and Briskman.Design of observational studies, volume 10. Springer, 2010. Constantin A Rothkopf and Christos Dimitrakakis. Preference elicitation and inverse reinforcement learning. InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part I 22, pages 34–48. Springer, 2011. Leonard J Savage.The foundations of statistics. Courier Corporation, 1972. Eric Schwitzgebel. Belief. In Edward N. Zalta and Uri Nodelman, editors,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Spring 2024 edition, 2024. 16 The Limits of Predicting Agents from Behaviour Murray Shanahan. Talking about large language models.Communications of the ACM, 67(2):68–79, 2024. Joar Skalse and Alessandro Abate. Misspecification in inverse reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15136–15143, 2023. Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. InInternational Conference on Machine Learning, pages 32033–32058. PMLR, 2023. Zhiqiang Tan. A distributional approach for causal inference using propensity scores.Journal of the American Statistical Association, 101(476):1619–1637, 2006. Jin Tian and Judea Pearl. Probabilities of causation: Bounds and identification.Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000. Shubham Toshniwal, Sam Wiseman, Karen Livescu, and Kevin Gimpel. Chess as a testbed for language model state tracking. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11385–11393, 2022. Keyon Vafa, Justin Y Chen, Jon Kleinberg, Sendhil Mullainathan, and Ashesh Rambachan. Evaluating the world model implicit in a generative model.arXiv preprint arXiv:2406.03689, 2024. Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the conditional and average treatment effect with unobserved confounding factors.arXiv preprint arXiv:1808.09521, 2018. Junzhe Zhang. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. InInternational Conference on Machine Learning, pages 11012–11022. PMLR, 2020. Junzhe Zhang and Elias Bareinboim. Fairness in decision-making—the causal explanation formula. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Junzhe Zhang and Elias Bareinboim. Bounding causal effects on continuous outcome. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12207–12215, 2021. Junzhe Zhang, Jin Tian, and Elias Bareinboim. Partial counterfactual identification from observational and experimental data.arXiv preprint arXiv:2110.05690, 2021. 17 The Limits of Predicting Agents from Behaviour A. Discussion – Examples In this section, we provide additional details to better appreciate the examples provided in the main body of this work. In Example 1, we introduce two SCMs that might serve as internal world models for an AI agent but that induce different optimal decisions if evaluated out-of-distribution. LetM 1 푑 :=⟨푽:퐷, 푍,푌,푼: 푈,F 1 , 푃⟩be given by F 1 :=               퐷←푑, 푍←1 푈=1or4 , 푌← ( 푍·1 푈=4 +(1−푍)·1 푈=1,3or4 if푑=0 푍·1 푈≠2 +(1−푍)·1 푈=2or4 if푑=1 , 푃(푈=푢)=0.2for푢∈ 1,2,3,4,5. andM 2 푑 :=⟨푽:퐷, 푍,푌,푼:푈,F 2 , 푃⟩be given by F 2 :=               퐷←푑, 푍←1 푈=1or4 , 푌← ( 푍·1 푈≠1 +(1−푍)·1 푈=3or4 if푑=0 푍·1 푈=1or4 +(1−푍)·1 푈=1or2 if푑=1 , 푃(푈=푢)=0.2for푢∈ 1,2,3,4,5. The endogenous variables푽:퐷, 푍,푌represent, respectively, the medical treatment퐷, a clinical outcome of interest푌, and an auxiliary variable푍. The exogenous variable푈is a latent variable that influences the values of푍and푌obtained in experiments. Under the definition of an SCM, these specifications induce a mapping of events in the space of 푃(푼)to푃(푽). In the context ofM 1 andM 2 , each entry in Tables 1 and 2 corresponds to an event in the space of푼and a corresponding realisation of푽according to the functionsF 1 andF 2 . A particular probability can be evaluated according toM 1 andM 2 , for example, 푃 M 1 푑=1 (푍=1,푌=1)= ∑︁ 푍 푑=1 (풖)=1,푌 푑=1 (풖)=1 푃(풖)=푃(푈=1or4)=0.4,(21) which is just the sum of the probabilities of the events in the space of푼consistent with the events (푍 푑=1 =1,푌 푑=1 =1). Since both tables lead to the same realisations of events푽=풗, we can conclude that probabilities of the form푃 푑 (푧, 푦)evaluate to the same values underM 1 andM 2 . That is, both models are valid internal representations of AI models that are grounded in an environment with data sampled according to푃 푑 (푧, 푦). We could similarly evaluate probability expressions under different sub-models ofM 1 andM 2 . In particular, consider the sub-models obtained by fixing푍←1given byM 1 푑,푧=1 andM 2 푑,푧=1 with the following updated structural functions, F 1,푧 :=               퐷←푑, 푍←1, 푌← ( 푍·1 푈=4 +(1−푍)·1 푈=1,3or4 if푑=0 푍·1 푈≠2 +(1−푍)·1 푈=2or4 if푑=1 , 18 The Limits of Predicting Agents from Behaviour 푈 퐷 푑=0 푍 푑=0 푌 푑=0 퐷 푑=1 푍 푑=1 푌 푑=1 푃(푢) 1 0101110.2 2 0001010.2 3 0011000.2 4 0111110.2 5 0001000.2 Table 1|Mapping of events in the space of푼to푽in the context ofM 1 . 푈 퐷 푑=0 푍 푑=0 푌 푑=0 퐷 푑=1 푍 푑=1 푌 푑=1 푃(푢) 1 0101110.2 2 0001010.2 3 0011000.2 4 0111110.2 5 0001000.2 Table 2|Mapping of events in the space of푼to푽in the context ofM 2 . and, F 2,푧 :=               퐷←푑, 푍←1, 푌← ( 푍·1 푈≠1 +(1−푍)·1 푈=3or4 if푑=0 푍·1 푈=1or4 +(1−푍)·1 푈=1or2 if푑=1 . Probabilities of events under these two models might now take different values. For example, 푃 M 1 푑=1,푧=1 (푌=1)= ∑︁ 푌 푑=1,푧=1 (풖)=1 푃(풖)=푃(푈≠2)=0.8,(22) 푃 M 2 푑=1,푧=1 (푌=1)= ∑︁ 푌 푑=1,푧=1 (풖)=1 푃(풖)=푃(푈=1or4)=0.4,(23) and similarly, 푃 M 1 푑=0,푧=1 (푌=1)= ∑︁ 푌 푑=1,푧=1 (풖)=1 푃(풖)=푃(푈=4)=0.2,(24) 푃 M 2 푑=0,푧=1 (푌=1)= ∑︁ 푌 푑=1,푧=1 (풖)=1 푃(풖)=푃(푈≠1)=0.8.(25) Under an interventions on푍(out-of-distribution) the decision푑that leads to maximum utility푌 changes underM 1 andM 2 . Specifically, underM 1 decision푑=1is favoured (as푃 M 1 푑=1,푧=1 (푌= 1)> 푃 M 1 푑=0,푧=1 (푌=1)) while underM 2 decision푑=0is favoured (as푃 M 2 푑=1,푧=1 (푌=1)< 푃 M 2 푑=0,푧=1 (푌=1)). This illustrates the possible under-determination of an AI’s choice of action out-of-distribution given observations of their external behaviour only, as multiple (contradicting) world models are equally consistent with the observed data. In more realistic settings, we might wonder about AI behaviour under arbitrary shifts휎, not only atomic interventions. We follow Correa and Bareinboim (2020a) to define a shift휎on풁⊂푽in 19 The Limits of Predicting Agents from Behaviour M:⟨푽,푼,F, 푃⟩as inducing a sub-modelM 휎 in which the mechanism for풁, that is푓 푧 :푍∈풁)and exogenous variables푼 푍 , 푍∈풁, are replaced by those specified by휎as: M 휎 :⟨푽,푼 휎 ,F 휎 , 푃⟩,푼 휎 =푼 Ð Ø 푍∈풁 푼 푍,휎 ,F 휎 =F Ð 푓 푍,휎 :푍∈풁\푓 푍 :푍∈풁,(26) where Ð 푍∈풁 푼 푍,휎 and푓 푍,휎 :푍∈풁define the new assignments for풁(and could be arbitrarily defined as long as they induce a valid SCM). We have shown in Thm. 3 that unless some knowledge of휎 (beyond the variables it affects) or its consequences are known, the AI is not predictable. Furthermore, the AI’s preference gapΔfor each context푪=풄and pairs of decisions(푑, 푑 ∗ )is unconstrained. In practice though, it might be realistic to have access to covariate data in the shifted environment, i.e.,푃 휎,푑 (풄), and that we could communicate this information to the AI for it to update its internal model accordingly. Example 3 illustrates the inference that could be conducted in that case using the Medical AI defined above. In particular, the exact nature of the shift휎is unknown but we do have access to its consequences on the distribution of covariates. This is plausible in many scenarios. For example, in medicine demographic data is typically available for most regions on earth but the precise effects of treatments is not because not all populations benefit from the same access to medication. For illustration assume that, the Medical AI is considered to be deployed in a population that varies in its level of blood pressure푍, potentially due to a different underlying biological mechanism that in turn also affects other variables in the system. We do know that the baseline high blood pressure is high, given by푃 휎 (푍=1)=0.9: higher than that observed during training푃(푍=1)=0.4. By Thm. 4, we can establish that in this setting the preference gap in situations where푍=1is no worse than, Δ 푑 1 ≻푑 0 ≥1−2−푃 푑 1 (푍=1,푌=1)−푃 푑 0 (푍=1,푌=0)/푃 휎,푑 0 (푍=1)=−0.55,(27) Δ 푑 1 ≻푑 0 ≥1−2−푃 푑 0 (푍=1,푌=0)−푃 푑 1 (푍=1,푌=1)/푃 휎,푑 0 (푍=1)=−1,(28) for the Medical AI. Interestingly, note also that if we were to be in a shifted environment with 푃 휎 (푍=1)=1, which is equivalent to an atomic intervention푍←1, the bounds reduce to the ones given by Thm. 1, evaluating to−0.4and−0.8respectively, as also shown above. Continuing with the grounded Medical AI deployed under an atomic intervention, imagine that the Medical AI has internalized its own concept of an individual’s disease progression푌 ∗ , as in Example 5. It is implicitly optimizing for that internal construction of his, instead of the intended disease bio-marker푌to be optimized. We know, or can assume, that the observed푌is known to be closely correlated with푌 ∗ : in particular, that푃 푑 (푌 ∗ =1|푌=1, 푍=푧) ≥훼for some high value of 훼and all decisions푑and situations푧. In words, whenever the bio-marker suggests health(푌=1), with high probability the AI’s interpretation also suggests health(푌 ∗ =1). This then constraints the possible values ofΔ(under an intervention푍←1) as푃 푑 (푌 ∗ =1|푍=푧)is no longer arbitrarily defined. The bounds derived in Example 2 on the AI’s belief on optimal decisions under an intervention 휎:=푍←푧continue to hold: Δ 푑 1 ≻푑 0 ≥푃 푑 1 (푧, 푦 ∗ )−푃 푑 0 (푧, 푦 ∗ )+푃 푑 0 (푧)−1(29) Δ 푑 0 ≻푑 1 ≥푃 푑 0 (푧, 푦 ∗ )−푃 푑 1 (푧, 푦 ∗ )+푃 푑 1 (푧)−1,(30) where we have used the shorthand푃 푑 (푧, 푦 ∗ )=푃 푑 (푍=푧,푌 ∗ =1). But the distributions푃 푑 (푧, 푦 ∗ ) 푑 can only be partially inferred from our assumption on the relationship between푌 ∗ and푌. For instance, notice that, 푃 푑 (푍=푧,푌 ∗ =1)=푃 푑 (푌 ∗ =1|푍=푧)푃 푑 (푍=푧)(31) =푃 푑 (푌 ∗ =1|푌=1, 푍=푧)푃 푑 (푌=1|푍=푧)(32) +푃 푑 (푌 ∗ =1|푌=0, 푍=푧)푃 푑 (푌=0|푍=푧)푃 푑 (푍=푧),(33) 20 The Limits of Predicting Agents from Behaviour The values of푃 푑 (푌 ∗ =1|푌=1, 푍=푧)and푃 푑 (푌 ∗ =1|푌=0, 푍=푧)are partially known: 푃 푑 (푌 ∗ =1|푌=1, 푍=푧) ≥훼while푃 푑 (푌 ∗ =1|푌=0, 푍=푧)is unconstrained. In particular, 푃 푑 (푍=푧,푌 ∗ =1) ≥훼푃 푑 (푌=1|푍=푧)푃 푑 (푍=푧)(34) 푃 푑 (푍=푧,푌 ∗ =1) ≤푃 푑 (푍=푧).(35) Putting these terms into Eq. (29) such as to derive correct lower and upper bounds we obtain, Δ 푑 1 ≻푑 0 ≥훼푃 푑 1 (푍=푧,푌=1)−1(36) Δ 푑 0 ≻푑 1 ≥훼푃 푑 0 (푍=푧,푌=1)−1.(37) Looking at Tables 1 and 2, we can then conclude that for훼=0.9and휎:=푍←1, the bound evaluates to−0.64and−0.82, respectively. Moving now onto incorporating assumption on structure in the real worldM, consider again the grounded medical AI with observed utility푌. One possible inductive bias we might introduce is the absence of an unobserved common cause between the variable푍that shifts out-of-distribution and the utility푌. We say that푍and푌is conditionally unconfounded given푊if there exists an observed variable푊∈ 푤, ̃푤,푊∈푽such that피 푃 푑,푧 [푌|푤]=피 푃 푑 [푦|푤, 푧]. This restriction goes beyond grounding an asserts an equality between probabilities under different shifts that could, nevertheless, be communicated to the AI for it to update its world model c M, that is피 b 푃 푑,푧 [푌|푤]=피 b 푃 푑,푧 [푌|푤, 푧]. We could then leverage the following decomposition to obtain tighter bounds, 피 b 푃 푑,푧 [푌]= ∑︁ 푤 피 b 푃 푑,푧 [푌|푤] b 푃 푑,푧 (푤)marginalizing over푊(38) = ∑︁ 푤 피 b 푃 푑 [푌|푤, 푧]푃 푑,푧 (푤)by assumption (39) =피 b 푃 푑 [푌|푤, 푧]−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑,푧 (푤)+피 b 푃 푑 [푌| ̃푤, 푧](40) We can then proceed to bound b 푃 푑,푧 (푤)to obtain, b 푃 푑 (푤, 푧) ≤ b 푃 푑,푧 (푤) ≤ b 푃 푑 (푤, 푧)+1− b 푃 푑 (푧).(41) Without loss of generality assume피 b 푃 푑 [푌|푤, 푧]−피 b 푃 푑 [푌| ̃푤, 푧] ≥0. We could then show that, 피 b 푃 푑,푧 [푌] ≥ 피 b 푃 푑 [푌|푤, 푧]−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑 (푤, 푧)+피 b 푃 푑 [푌| ̃푤, 푧](42) 피 b 푃 푑,푧 [푌] ≤ 피 b 푃 푑 [푌|푤, 푧]−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑 (푤, 푧)+1− b 푃 푑 (푧)+피 b 푃 푑 [푌| ̃푤, 푧].(43) We could verify that these bounds are superior to what we would have obtained with the assumption of conditional unconfoundedness by noting that, 피 b 푃 푑,푧 [푌] ≥ 피 b 푃 푑 [푌|푤, 푧]−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑 (푤, 푧)+피 b 푃 푑 [푌| ̃푤, 푧](44) =피 b 푃 푑 [푌|푤, 푧] b 푃 푑 (푧, 푤)+1− b 푃 푑 (푤, 푧)피 b 푃 푑 [푌| ̃푤, 푧](45) ≥피 b 푃 푑 [푌|푤, 푧] b 푃 푑 (푧, 푤)+ b 푃 푑 ( ̃푤, 푧)피 b 푃 푑 [푌| ̃푤, 푧](46) =피 b 푃 푑 [푌|푧] b 푃 푑 (푧),(47) where the last inequality holds since푃 푑 ( ̃푤, 푧) ≤1−푃 푑 (푤, 푧)giving the “assumption-free” lower bound. 21 The Limits of Predicting Agents from Behaviour This shows that the derived lower bound is better. For the upper bound, note that, 피 b 푃 푑,푧 [푌] ≤ 피 b 푃 푑 [푌|푤, 푧]−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑 (푤, 푧)+1− b 푃 푑 (푧)+피 b 푃 푑 [푌| ̃푤, 푧](48) =피 b 푃 푑 [푌|푤, 푧] b 푃 푑 (푤, 푧)+1− b 푃 푑 (푧)−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑 (푤, 푧)+1−1− b 푃 푑 (푧)(49) =피 b 푃 푑 [푌|푤, 푧] b 푃 푑 (푧, 푤)+피 b 푃 푑 [푌|푤, 푧]1− b 푃 푑 (푧)−피 b 푃 푑 [푌| ̃푤, 푧] b 푃 푑 (푤, 푧)− b 푃 푑 (푧)(50) =피 b 푃 푑 [푌|푤, 푧] b 푃 푑 (푧, 푤)+피 b 푃 푑 [푌|푤, 푧]1− b 푃 푑 (푧)+ b 푃 푑 (푦, 푧, ̃푤)(51) =피 b 푃 푑 [푌|푧] b 푃 푑 (푧)+피 b 푃 푑 [푌|푤, 푧]1− b 푃 푑 (푧)(52) ≤피 b 푃 푑 [푌|푧] b 푃 푑 (푧)+1− b 푃 푑 (푧),(53) where the last inequality holds since피 b 푃 푑 [푌|푤, 푧] ≤1giving the “assumption-free” upper bound. This shows that the derived upper bound is better. By combining these results we obtain, together with the assumption of grounding, Δ 푑 1 ≻푑 0 ≥피 푃 푑 1 [푌|푤, 푧]푃 푑 (푧, 푤)+퐴 1 피 푃 푑 1 [푌| ̃푤, 푧]−피 푃 푑 0 [푌|푧]푃 푑 0 (푧)−퐴 2 피 푃 푑 0 [푌|푤, 푧](54) Δ 푑 1 ≻푑 0 ≤피 푃 푑 1 [푌|푧]푃 푑 1 (푧)+퐴 3 피 푃 푑 1 [푌|푤, 푧]−피 푃 푑 0 [푌|푤, 푧]푃 푑 (푧, 푤)−퐴 4 피 푃 푑 0 [푌| ̃푤, 푧],(55) where퐴 1 :=1−푃 푑 1 (푧, 푤), 퐴 2 :=1−푃 푑 0 (푧), 퐴 3 :=1−푃 푑 1 (푧), 퐴 4 :=1−푃 푑 0 (푧, 푤). 22 The Limits of Predicting Agents from Behaviour B. Related work An important consideration to safely interact with AI systems is to form expectations as to how they might act in the future. This research program draws on different areas that are related to the results we present in this paper. B.1. Do current AIs represent the world? World models are important because they offer a path between pattern recognition and a more genuine form of understanding. It is plausible that world models will play an increasing role (explicitly or implicitly) to improve reasoning capabilities and safety. For example, Dalrymple et al. (2024) lists having a world model as a key component towards designing “guaranteed safe AI”. In the literature, several works have argued that LLM activations carry information that correlates with meaningful concepts in the world and that causally influence LLM outputs. Early examples come from AIs trained on board games such as Othello and logic games. Li et al. (2022) showed that a model trained on natural language descriptions of Othello moves developed internal representations of the board state, which it used to predict valid moves in unseen board configurations. Gurnee and Tegmark (2023); Vafa et al. (2024), among others, also build on this approach to study navigation tasks and logic puzzles, and representations of space and time. The emergence of causal models in LLMs has also been studied by Geiger et al. (2021) and more recently in (Geiger et al., 2024). The extent to which this evidence supports genuine folk psychological concepts – desires, beliefs, intentions – is also debated by Goldstein and Levinstein (2024). B.2. Causal Inference We might wonder whether the behaviour of AIs, to the extent that they carry a world model represen- tation that guides their decisions out-of-distribution, can be predicted before deployment. The causal inference literature studies this question in the context of the prediction of causal effects. Manski (1990); Robins (1989) in the early 1990’s showed that useful inference about causal effects could be drawn without making identifying assumptions beyond the observed data, and that they could be refined for studies with imperfect compliance under a set of instrumental variable assumptions. Closed-form expressions for bounds on causal effects were also derived in discrete systems with more general assumptions represented in causal diagrams (Bellot, 2024; Zhang, 2020), using both observational and interventional data (Joshi et al., 2024), and to bound the effect of policies (Bellot and Chiappa, 2024; Zhang and Bareinboim, 2021). A separate body of work instead proposed to use polynomial optimization to calculate causal bounds from a given causal diagram (Balke and Pearl, 1997; Chickering and Pearl, 1996). This approach involves creating a set of standard models, parameterized by the causal diagram, and then converting the bounding problem into a sequence of equivalent linear (or polynomial) programs (Finkelstein and Shpitser, 2020; Jalaldoust et al., 2024; Zhang et al., 2021). In parallel, a number of works have adopted sensitivity assumptions (as an alternative or in combination with a causal diagram) that quantify the degree of unobserved confounding through various data statistics, such as odds ratios, propensity scores, etc. Prominent examples include Tan (2006)’s sensitivity model and Rosenbaum et al. (2010)’s sensitivity model. Several methods have proposed bounds with favourable statistical properties based on these models, see e.g. Jesson et al. (2021); Yadlowsky et al. (2018). 23 The Limits of Predicting Agents from Behaviour B.3. Reinforcement Learning The problem of inferring what objective an agent is pursuing based on the actions and data observed by that agent is studied in Inverse Reinforcement Learning (IRL) (Ng et al., 2000). Several papers have studied the partial identifiability of various reward learning models (Kim et al., 2021; Ng et al., 2000; Skalse and Abate, 2023; Skalse et al., 2023), and share a similar objective to that of this work. There are two differences that are worth mentioning. First, our work complements these approaches by studying the partial identifiability of world models, that capture the assignment of reward but also the relationship between other auxiliary variables in the environment. This enables us to reason about the effect of shifts and interventions, and give guarantees in specific out-of-distribution problems. Second, our objective is not necessarily to characterize compatible world models explicitly, but rather understand their implications on decision-making, i.e., what are the set of possible actions that an AI might take given our uncertainty about their world model. Our work is related also to the study of Bengio et al. (2024) that consider deriving (probabilistic) bounds on the probability of harm given data. They similarly argue that multiple theories, in their case transition probabilities from one state to another in a Markov Decision Process (MDP), might explain the dependencies in data to a larger or lesser degree. Each transition model might then be associated with a posterior probability given the data that implies a corresponding posterior probability of harm. Our results, in contrast, are not probabilistic in nature. We provide closed-form bounds that can be interpreted as capturingallpossible behaviours implied by the data, with probability 1 (and is a possible limitation of our work). The class of world models we consider (i.e., SCMs) is also much more general than transition models in MDPs allowing us to reason about expected AI behaviour under shifts in the environment, out of distribution. B.4. Decision Theory Inverse reinforcement learning is closely related to the study of revealed preferences in psychology and economics, that similarly aims to infer preferences from behaviour (Rothkopf and Dimitrakakis, 2011). Causal and counterfactual accounts of decision theory are an active area of research, see e.g., (Joyce, 1999). Recently a representation theorem was shown that explicitly connects rational behaviour with structural causal models (Halpern and Piermont, 2024). The authors showed that whenever the set of preferences of an agent over interventions satisfy axioms that relate to the proper interpretation of counterfactuals and rationality we can model behaviour as emerging from an SCM. The same conclusion can also be obtained for agents capable of solving tasks in multiple environments (Richens and Everitt, 2024), in essence, robustness over multiple environments is equivalent (in the limit) to operating according to a causal model of the environment. B.5. Limitations The following present the main limitations of our work that will be important to address for developing a more complete understanding of AI behaviour. In this work, we start from the assumption that past and future behaviour of an AI system is consistent with an underlying world model that can be represented as an SCM. In general, this presupposes a certain rationality and consistency in the AI’s outputs that might not be realistic for all systems. Some relaxations are discussed in Sec. 5. Structural Causal Models generally suppose the system is acyclic and without feedback, and don’t naturally capture systems evolving continuously in time (perhaps better described using differential equations). Our bounds similarly rely on this assumption and may give unreliable inferences if applied 24 The Limits of Predicting Agents from Behaviour to systems in which feedback is important. We have stated our guarantees in the infinite sample limit, without quantifying the finite-sample estimation uncertainty. Consequently, we should exercise caution when using the proposed bounds in small sample scenarios where estimators may be inaccurate. Finite-sample properties could be explored similarly to (Bengio et al., 2024) by parameterizing the AI’s underlying model and making inference on the corresponding latent variable model to get high-probability bounds. An example parameterization of SCMs and probabilistic inference for decision-making across environments is given in (Bellot et al., 2024; Jalaldoust et al., 2024). We expect that similar techniques could be applied in our setting. We do not exploit the verbal behaviour of AI systems. In the context of LLMs, in principle, we might ask the system about its future behaviour explicitly, e.g., “Were I to intervene in the environment, what action do you believe is optimal?”. It might not be obvious, however, that we can trust that what they “say” ultimately matches with what they will “do”. Decision-making, in practice, involves many considerations that go beyond expected-utility- maximization formalisms. For example, we might train AI systems to be virtuous, e.g., the AI is trained to never pick actions that can be considered harmful (defined according to certain natural language specification) no matter its expected utility. These considerations would change the kind of predictions we could make about the future behaviour of AI systems. 25 The Limits of Predicting Agents from Behaviour C. Proofs and additional results This section provides proofs for the statements made in the main body of this work. Before we start, we recall a few basic results that will be used in the derivation of our proofs. Definition 9(The Axioms of Counterfactuals, Chapter 7.3.1 Pearl (2009)).For any three sets of endogenous variables푿,풀,푾in a causal model and풙,풘in the domains of푿and푾, the following holds: •Composition:푾 풙 =풘implies that풀 풙,풘 =풀 풙 . •Effectiveness:푿 풘,풙 =풙. •Reversibility:풀 풙,풘 =풚and푾 풙,풚 =풘imply that풀 풙 =풚. Theorem 7(Soundness and Completeness of the Axioms Theorems 7.3.3, 7.3.6 Pearl (2009)).The Axioms of counterfactuals are sound and complete for all causal models. The following rules to manipulate experimental distributions produced by policies extend the do-calculus and will be used in the next Lemma. To make sense of these, note that graphically, each SCMMis associated with a causal diagramGover푽, where푉→푊if푉appears as an argument of 푓 푊 inM, and푉cd푊if푼 푉 ∩푼 푊 ≠∅,i.e.푉and푊share an unobserved confounder. For a causal diagramGover푽, the푿-lower-manipulation ofGdeletes all those edges that are out of variables in푿, and otherwise keepsGas it is. The resulting graph is denoted asG 푿 . The푿-upper-manipulation ofG deletes all those edges that are into variables in푿, and otherwise keepsGas it is. The resulting graph is denoted asG 푿 . We use | = 푑 to denote푑-separation in causal diagrams (Pearl, 2009, Def. 1.2.3). Theorem 8(Inference Rules휎-calculus Correa and Bareinboim (2020a)).LetGbe a causal diagram compatible with an SCMM, with endogenous variables푽. For any disjoint subsets푿,풀,풁⊆푽, two disjoint subsets푻,푾⊆푽\(풁 Ð 풀)(i.e., possibly including푿), the following rules are valid for any intervention strategies휋 푿 ,휋 풁 , and휋 ′ 풁 such thatG 휋 푿 휋 풁 ,G 휋 푿 휋 ′ 풁 have no cycles: •Rule 1 (Insertion/Deletion of observations): 푃 휋 푿 (풚|풘,풕)=푃 휋 푿 (풚|풘)if(푻 | = 푑 풀|푾)inG 휋 푿 . •Rule 2 (Change of regimes under observation): 푃 휋 푿 ,휋 풁 (풚|풛,풘)=푃 휋 푿 ,휋 ′ 풁 (풚|풛,풘)if(풀 | = 푑 풁|푾)inG 휋 푿 ,휋 풁 ,풁 andG 휋 푿 ,휋 ′ 풁 ,풁 •Rule 3 (Change of regimes without observation): 푃 휋 푿 ,휋 풁 (풚|풘)=푃 휋 푿 ,휋 ′ 풁 (풚|풘)if(풀 | = 푑 풁|푾)inG 휋 푿 ,휋 풁 , 풁(푾) andG 휋 푿 ,휋 ′ 풁 ,풁(푾) where풁(푾)is the set of elements in풁that are not ancestors of푾inG 휋 푿 . Lemma 1.Let휋:supp 푪 ×supp 퐷 ↦→ [0,1]be a (probabilistic) policy mapping contexts풄to decisions푑. Then푃 푑 (푽)may be computed from푃 휋 (푽). Proof. Let푽=푪 Ð 퐷 Ð 풀andGbe an arbitrary causal diagram summarizing the SCM of the environ- ment. The following derivation shows the claim, 푃 푑 (풗)=푃 푑 (풚|풄)푃 푑 (풄)by the rules of total probability(56) =푃 푑 (풚|풄)푃 휋 (풄)by rule 3 of the휎-calculus since퐷 | = 푪inG 퐷 andG 휋,퐷 (57) =푃 휋 (풚|푑,풄)푃 휋 (풄)by rule 2 of the휎-calculus since퐷 | = 푹|푪inG 휋,퐷 (58) That is we have shown푃 푑 (풗)can be expressed as a functional of푃 휋 (풗). Here note that the equalities hold in any causal graphGby definition of휋.□ 26 The Limits of Predicting Agents from Behaviour We start by providing proofs for the results on the AI’s choice of action out-of-distribution given in Sec. 4.1. Thm. 1 restated.An AI grounded in a domainMis weakly predictable under a shift휎:= 푑표(풛),풁⊂푽,in a context푪=풄if and only if there exists a decision푑 ∗ such that, 피 푃 푑 [푌|풄,풛]푃 푑 (풄,풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) − 피 푃 푑 ∗ [푌|풄,풛]푃 푑 ∗ (풄,풛)+1−푃 푑 ∗ (풛) 푃 푑 ∗ (풄,풛)+1−푃 푑 ∗ (풛) >0,for some푑≠푑 ∗ .(59) Proof.Recall that the AI is weakly predictable in a context푪=풄if and only if there exists a decision 푑 ∗ such that, min c M∈필 ( Δ 푑≻푑 ∗ ) >0,Δ 푑≻푑 ∗ :=피 푃 c M [ 푌|푑표(휎, 푑),풄 ] −피 푃 c M [ 푌|푑표(휎, 푑 ∗ ),풄 ] ,for some푑≠푑 ∗ .(60) 필 denotes the set of compatible SCMs, i.e., that generate the data under our assumptions.Δis the AI’s preference gap between two decisions in some situation푪=풄. We will consider the derivation of bounds on each term of the difference inΔseparately. Firstly, note that, 피 b 푃 휎,푑 [ 푌|푪=풄 ] =피 b 푃 풛,푑 [푌1 풄 (푪) ] / b 푃 풛,푑 (풄)(61) Analytical Lower BoundA lower bound on this ratio can be obtained by minimizing the numerator and maximizing the denominator, for example using the following derivation: 피 b 푃 풛,푑 [푌1 풄 (푪) ]= ∑︁ ̃풛 피 b 푃 푑 [푌 풛 1 풄, ̃풛 (푪 풛 ,풁) ]marginalizing over풛(62) ≥피 b 푃 푑 [푌 풛 1 풄,풛 (푪 풛 ,풁) ]since summands>0(63) =피 b 푃 푑 [푌1 풄,풛 (푪,풁) ]by consistency(64) =피 푃 푑 [푌|풄,풛]푃 푑 (풄,풛)by grounding(65) (66) b 푃 풛,푑 (풄) (1) =1− b 푃 풛,푑 (풄 ′ )(67) =1− ∑︁ ̃풛 b 푃 푑 (풄 ′ 풛 , ̃풛)marginalizing over풛(68) ≤1− b 푃 푑 (풄 ′ 풛 ,풛)since summands>0(69) = b 푃 푑 (풄,풛)+1− b 푃 푑 (풛)by consistency(70) =푃 푑 (풄,풛)+1− b 푃 푑 (풛)by grounding.(71) (1) holds by defining풄 ′ to stand for any combination of variables푪\풁other than풄\풛. This implies then that, 피 b 푃 휎,푑 [ 푌|푪=풄 ] ≥ 피 푃 푑 [푌|풄,풛]푃 푑 (풄,풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) .(72) Analytical Upper BoundFor the upper bound, we start by noting that, 피 b 푃 휎,푑 [ 푌|푪=풄 ] =1−피 b 푃 휎,푑 [ 1−푌|푪=풄 ] (73) =1−피 b 푃 풛,푑 [ (1−푌)1 풄 (푪) ] / b 푃 풛,푑 (풄)(74) 27 The Limits of Predicting Agents from Behaviour Leveraging the bounds derived above we obtain, 피 b 푃 휎,푑 [ 푌|푪=풄 ] ≤1− 피 푃 푑 [ (1−푌)1 풄,풛 (푪,풁) ] 푃 푑 (풄,풛)+푃 푑 (풛 ′ ) (75) = 피 푃 푑 [푌|풄,풛]푃 푑 (풄,풛)+1−푃 푑 (풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) (76) By setting푑=푑 1 in the lower bound and푑=푑 0 in the upper bound of the expected utility, we obtain a lower bound on the difference of expected utilities: Δ 푑 1 ≻푑 0 ≥ 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛) 푃 푑 1 (풄,풛)+1−푃 푑 1 (풛) − 피 푃 푑 0 [푌|풄,풛]푃 푑 0 (풄,풛)+푃 푑 0 (풛 ′ ) 푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) .(77) And similarly, by setting푑=푑 1 in the upper bound and푑=푑 0 in the lower bound of the expected utility, we obtain an upper bound on the difference of expected utilities: Δ 푑 1 ≻푑 0 ≤ 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛)+1−푃 푑 1 (풛) 푃 푑 1 (풄,풛)+1−푃 푑 1 (풛) − 피 푃 푑 0 [푌|풄,풛]푃 푑 0 (풄,풛) 푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) .(78) We now show that these bounds are tight by constructing SCMs (that is, possible world models of the AI system) that evaluate to the lower and upper bounds while generating the distribution of agent interactions b 푃 푑 1 , b 푃 푑 0 . Tightness Lower Bound forΔFor the lower bound we will consider the following SCM, M 1 푑 =:                                   풁←푓 풁 (풖) 푪← ( 푓 퐶 (푢,풛)if푓 풁 (풖)=풛 1otherwise. 퐷←푑 푌←         푓 푌 (푑,풄,풛,풖)if푓 풁 (풖)=풛 1if푓 풁 (풖)≠풛, 푑=푑 0 0if푓 풁 (풖)≠풛, 푑=푑 1 푃(푼) (79) Here푓 풁 , 푓 푪 , 푓 푌 ,U, 푃(푼)are chosen to match the observed trajectory of agent interactions, i.e., such that푃 M 1 푑 (풗)=푃 c M 푑 (풗)for all풗∈supp 푽 . Consider evaluating, 피 푃 M 1 휎,푑 [ 푌|푪=풄 ] =피 푃 M 1 휎,푑 [푌1 풄 (푪) ] /푃 M 1 휎,푑 (풄)(80) 28 The Limits of Predicting Agents from Behaviour The numerator (underM 1 푑 1 ) evaluates to, 피 푃 M 1 푑 1 [푌 풛 1 풄 (푪 풛 ) ](81) = ∑︁ 풖 피 푃 M 1 푑 1 [푌 풛 1 풄 (푪 풛 ) |풖]푃 M 1 푑 1 (풖)(82) = ∑︁ 풖 피 푃 M 1 푑 1 [푌1 풄 (푪) |풛,풖]푃 M 1 푑 1 (풖)(83) =피 푃 M 1 푑 1 [푌1 풄 (푪) |풛,풖:푓 풁 (풖)=풛]푃 M 1 푑 1 (풖:푓 풁 (풖)=풛)(84) +피 푃 M 1 푑 1 [푌1 풄 (푪) |풛,풖:푓 풁 (풖)≠풛]푃 M 1 푑 1 (풖:푓 풁 (풖)≠풛)(85) =피 푃 M 1 푑 1 [푌1 풄 (푪) |풛]푃 M 1 푑 1 (풛)(86) =피 푃 M 1 푑 1 [푌|풄,풛]푃 M 1 푑 1 (풄,풛)(87) The denominator underM 1 푑 1 evaluates to, 푃 M 1 휎,푑 1 (풄)= ∑︁ 풖 푃 M 1 푑 1 (풄 풛 |풖)푃 M 1 푑 1 (풖)(88) = ∑︁ 풖 푃 M 1 푑 1 (풄|풛,풖)푃 M 1 푑 1 (풖)(89) =푃 M 1 푑 1 (풄|풛,풖:푓 풁 (풖)=풛)푃 M 1 푑 1 (풖:푓 풁 (풖)=풛)(90) +푃 M 1 푑 1 (풄|풛,풖:푓 풁 (풖)≠풛)푃 M 1 푑 1 (풖:푓 풁 (풖)≠풛)(91) =푃 M 1 푑 1 (풄|풛)푃 M 1 푑 1 (풛)+1−푃 M 1 푑 1 (풛)(92) =푃 M 1 푑 1 (풄,풛)+1−푃 M 1 푑 1 (풛)(93) The numerator underM 1 푑 0 evaluates to, 피 푃 M 1 푑 0 [푌 풛 1 풄 (푪 풛 ) ](94) = ∑︁ 풖 피 푃 M 1 푑 0 [푌 풛 1 풄 (푪 풛 ) |풖]푃 M 1 푑 0 (풖)(95) = ∑︁ 풖 피 푃 M 1 푑 0 [푌1 풄 (푪) |풛,풖]푃 M 1 푑 0 (풖)(96) =피 푃 M 1 푑 0 [푌1 풄 (푪) |풛,풖:푓 풁 (풖)=풛]푃 M 1 푑 0 (풖:푓 풁 (풖)=풛)(97) +피 푃 M 1 푑 0 [푌1 풄 (푪) |풛,풖:푓 풁 (풖)≠풛]푃 M 1 푑 0 (풖:푓 풁 (풖)≠풛)(98) =피 푃 M 1 푑 0 [푌1 풄 (푪) |풛]푃 M 1 푑 0 (풛)+1−푃 M 1 푑 0 (풛)(99) =피 푃 M 1 푑 0 [푌|풄,풛]푃 M 1 푑 0 (풄,풛)+1−푃 M 1 푑 0 (풛)(100) 29 The Limits of Predicting Agents from Behaviour The denominator underM 1 푑 0 evaluates to, 푃 M 1 휎,푑 0 (풄)= ∑︁ 풖 푃 M 1 푑 0 (풄 풛 |풖)푃 M 1 푑 0 (풖)(101) = ∑︁ 풖 푃 M 1 푑 0 (풄|풛,풖)푃 M 1 푑 0 (풖)(102) =푃 M 1 푑 0 (풄|풛,풖:푓 풁 (풖)=풛)푃 M 1 푑 0 (풖:푓 풁 (풖)=풛)(103) +푃 M 1 푑 0 (풄|풛,풖:푓 풁 (풖)≠풛)푃 M 1 푑 0 (풖:푓 풁 (풖)≠풛)(104) =푃 M 1 푑 0 (풄|풛)푃 M 1 푑 0 (풛)+1−푃 M 1 푑 0 (풛)(105) =푃 M 1 푑 0 (풄,풛)+1−푃 M 1 푑 0 (풛)(106) Combining these results we get the analytical lower bound: Δ 푑 1 ≻푑 0 = 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛) 푃 푑 1 (풄,풛)+1−푃 푑 1 (풛) − 피 푃 푑 0 [푌|풄,풛]푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) 푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) .(107) This shows that for a given푪=풄and pair of decisions(푑 1 , 푑 0 )we can always find an SCM that evaluates to the lower bound that we report. So if, and only if, we can find a decision푑 ∗ such that the lower bound can be evaluated to be greater than zero for some푑≠푑 ∗ will the AI be weakly predictable, as claimed.□ Corol. 1 restated.Given a discrepancy measure휓, an AI approximately grounded in a domain Mis weakly predictable in a context푪=풄under a shift휎:=푑표(풛),풁⊂푽,if and only if there exists a decision푑 ∗ such that, min b 푃:휓( b 푃,푃)≤훿 ( 피 b 푃 푑 [푌|풄,풛] b 푃 푑 (풄,풛) b 푃 푑 (풄,풛)+1− b 푃 푑 (풛) − 피 b 푃 푑 ∗ [푌|풄,풛] b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) ) >0,for some푑≠푑 ∗ . (108) Proof.For approximately grounded AI systems, we can state the bound from Thm. 1 as, min c M∈필 ( Δ 푑≻푑 ∗ ) = 피 b 푃 푑 [푌|풄,풛] b 푃 푑 (풄,풛) b 푃 푑 (풄,풛)+1− b 푃 푑 (풛) − 피 b 푃 푑 ∗ [푌|풄,풛] b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) .(109) b 푃 푑 is constrained to be close to푃 푑 according to distance휓and threshold훿. We get valid bounds by reporting the worst-case bounds under this looser constraint: min c M∈필 ( Δ 푑≻푑 ∗ ) =min b 푃:휓( b 푃,푃)≤훿 ( 피 b 푃 푑 [푌|풄,풛] b 푃 푑 (풄,풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) − 피 b 푃 푑 ∗ [푌|풄,풛] b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) ) .(110) This shows that for a given푪=풄, themin c M∈필 , ( Δ 푑≻푑 ∗ ) >0for some푑≠푑 ∗ if and only if, min b 푃:휓( b 푃,푃)≤훿 ( 피 b 푃 푑 [푌|풄,풛] b 푃 푑 (풄,풛) 푃 푑 (풄,풛)+1−푃 푑 (풛) − 피 b 푃 푑 ∗ [푌|풄,풛] b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) b 푃 푑 ∗ (풄,풛)+1− b 푃 푑 ∗ (풛) ) >0.(111) □ 30 The Limits of Predicting Agents from Behaviour Thm. 2 restated. Let휎:=푑표(풛)be a shift on a set of variables풁⊂푽. For푹 푖 ⊂풁⊂푽, 푖=1, . . . , 푘, consider an AI grounded in multiple domainsM 풓 푖 :푖=1, . . . , 푘. The AI is weakly predictable in a context푪=풄under a shift휎:=푑표(풛)if and only if there exists a decision푑 ∗ such that, max 푖,푗=1,...,푘 퐴(풓 푖 ,풓 푗 )>0,for some푑≠푑 ∗ ,(112) where 퐴(풓 푖 ,풓 푗 ):= 피 푃 푑,풓 푖 [푌|풄,풛\풓 푖 ]푃 푑,풓 푖 (풄,풛\풓 푖 ) 푃 푑,풓 푖 (풄,풛\풓 푖 )+1−푃 푑,풓 푖 (풛\풓 푖 ) − 피 푃 푑 ∗ ,풓 푗 [푌|풄,풛\풓 푗 ]푃 푑 ∗ ,풓 푗 (풄,풛\풓 푗 )+1−푃 푑 ∗ ,풓 푗 (풛\풓 푗 ) 푃 푑 ∗ ,풓 푗 (풄,풛\풓 푗 )+1−푃 푑 ∗ ,풓 푗 (풛\풓 푗 ) . Proof.M 풓 푖 :푖=1, . . . , 푘describes푘domains in which experiments on different subsets of풁have been conducted. This includes possibly the null experiment푹 푖 =∅that refers to the unaltered domain M. We can use a similar derivation to that of Thm. 1 to derive bounds onΔunder a shift휎:=푑표(풛) in terms of푃 푑,풓 (푽),푹∈푽and obtain, Δ 푑 1 ≻푑 0 ≥퐴(풓)(113) where, 퐴(풓):= 피 푃 푑 1 ,풓 [푌|풄,풛\풓]푃 푑 1 ,풓 (풄,풛\풓) 푃 푑 1 ,풓 (풄,풛\풓)+1−푃 푑 1 ,풓 (풛\풓) − 피 푃 푑 0 ,풓 [푌|풄,풛\풓]푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) 푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) .(114) These bounds can be shown to be tight by constructing similar SCMs. For example, for the analytical lower bound consider, M 1 푑,풓 =:                                       푺←푓 푺 (풖) 푹←풓 푪← ( 푓 푪 (풖,풔,풓)if푓 푺 (풖)=풔 1otherwise. 퐷←푑 푌←         푓 푌 (푑,풄,풔,풓,풖)if푓 푺 (풖)=풔 1if푓 푺 (풖)≠풔, 푑=푑 0 0if푓 푺 (풖)≠풔, 푑=푑 1 푃(푼) (115) where푺=풁\푹. Here푓 풁 , 푓 푪 , 푓 푌 ,U, 푃(푼)are chosen to match the observed trajectory of agent interactions, i.e., such that푃 M 1 푑,풓 (풗)=푃 c M 푑,풓 (풗)for all풗∈supp 푽 . We could verify that this SCM evaluates to the lower bound above. If we have multiple domains with different set of intervened variables푹 푖 :푖=1, . . . , 푘we could use this construction to find a lower using samples from푃 푑,풓 푖 (푽):푖=1, . . . , 푘. A lower bound that can be constructed for an AI system grounded inM 풓 푖 :푖=1, . . . , 푘is, Δ 푑 1 ≻푑 0 ≥max 푖,푗=1,...,푘 퐴(풓 푖 ,풓 푗 )(116) where 퐴(풓 푖 ,풓 푗 ):= 피 푃 푑 1 ,풓 푖 [푌|풄,풛\풓 푖 ]푃 푑 1 ,풓 푖 (풄,풛\풓 푖 ) 푃 푑 1 ,풓 푖 (풄,풛\풓 푖 )+1−푃 푑 1 ,풓 푖 (풛\풓 푖 ) − 피 푃 푑 0 ,풓 푗 [푌|풄,풛\풓 푗 ]푃 푑 0 ,풓 푗 (풄,풛\풓 푗 )+1−푃 푑 0 ,풓 푗 (풛\풓 푗 ) 푃 푑 0 ,풓 푗 (풄,풛\풓 푗 )+1−푃 푑 0 ,풓 푗 (풛\풓 푗 ) . (117) 31 The Limits of Predicting Agents from Behaviour The intuition here is that we have multiple lower bounds for the preference gap, then the best lower bound can be taken to be the largest of the multiple lower bounds available. We can show that this bound is tight in the case where the AI is grounded in two environments M 풓 1 ,M 풓 2 under a shift휎:=푑표(풛),풁=푹 1 Ð 푹 2 . According to the inequality above, we have simultaneously, Δ 푑 1 ≻푑 0 ≥퐴(풓 1 ,풓 1 ), 퐴(풓 1 ,풓 2 ), 퐴(풓 2 ,풓 1 ), 퐴(풓 2 ,풓 2 ).(118) Each of these terms can be evaluated from the available data sampled from푃 푑,풓 1 , 푃 푑,풓 2 . Note that both퐴(풓 1 ,풓 1 )and퐴(풓 2 ,풓 2 )can be obtained with the SCM above. Without loss of generality, assume that퐴(풓 1 ,풓 2 ) ≥퐴(풓 2 ,풓 1 ), 퐴(풓 1 ,풓 1 ), 퐴(풓 2 ,풓 2 ). We will show that we can construct an SCM compatible with푃 푑,풓 1 , 푃 푑,풓 2 that evaluates to퐴(풓 1 ,풓 2 )demonstrating that the bound is tight. Consider the following SCM: M 푑 =:                                                           푹 1 ←푓 푹 1 (풖 1 ) 푹 2 ←푓 푹 2 (풖 2 ) 푪← ( 푓 푪 (풓 1 ,풓 2 ,풖 1 ,풖 2 )if푓 푹 1 (풖 1 )=풓 1 , 푓 푹 2 (풖 2 )=풓 2 1otherwise. 퐷←푑 푌←                             푓 푌 (푑,풄,풓 1 ,풓 2 ,풖 1 ,풖 2 )if푓 푹 1 (풖 1 )=풓 1 , 푓 푹 2 (풖 2 )=풓 2 푓 푌 (푑,풄,풓 1 ,풓 2 ,풖 1 ,풖 2 )if푑=푑 1 , 푓 푹 1 (풖 1 )≠풓 1 , 푓 푹 2 (풖 2 )=풓 2 푓 푌 (푑,풄,풓 1 ,풓 2 ,풖 1 ,풖 2 )if푑=푑 0 , 푓 푹 1 (풖 1 )=풓 1 , 푓 푹 2 (풖 2 )≠풓 2 0if푑=푑 1 , 푓 푹 1 (풖 1 )=풓 1 , 푓 푹 2 (풖 2 )≠풓 2 0if푑=푑 1 , 푓 푹 1 (풖 1 )≠풓 1 , 푓 푹 2 (풖 2 )≠풓 2 1if푑=푑 0 , 푓 푹 1 (풖 1 )≠풓 1 , 푓 푹 2 (풖 2 )=풓 2 1if푑=푑 0 , 푓 푹 1 (풖 1 )≠풓 1 , 푓 푹 2 (풖 2 )≠풓 2 푃(푼) (119) Notice that inM 푑 different choices of functional assignments “푓” and푃(풖)can generate any distribution푃 푑 1 ,풓 1 , 푃 푑 0 ,풓 2 . That is this SCM (or a member of this family of SCMs) is compatible with the observed data. Consider evaluating퐴(풓 1 ,풓 2 )under this SCM. Note that the derivations for the denominators are equivalent to those shown in the proof of Thm. 1 so we will omit them here. The first term in the numerator, 피 푃 M 푑 1 ,풓 1 ,풓 2 [푌1 풄 (푪) ](120) = ∑︁ 풖 2 피 푃 M 푑 1 ,풓 1 ,풓 2 [푌1 풄 (푪) |풖 2 ]푃 M 푑 1 ,풓 1 ,풓 2 (풖 2 )(121) = ∑︁ 풖 2 피 푃 M 푑 1 ,풓 1 [푌1 풄 (푪) |풓 2 ,풖 2 ]푃 M 푑 1 ,풓 1 (풖 2 )(122) =피 푃 M 푑 1 ,풓 1 [푌1 풄 (푪) |풓 2 ,풖:푓 푹 2 (풖 2 )=풓 2 ]푃 M 푑 1 ,풓 1 (풖 2 :푓 푹 2 (풖 2 )=풓 2 )(123) +피 푃 M 푑 1 ,풓 1 [푌1 풄 (푪) |풓 2 ,풖:푓 푹 2 (풖 2 )≠풓 2 ]푃 M 푑 1 ,풓 1 (풖 2 :푓 푹 2 (풖 2 )≠풓 2 )(124) =피 푃 M 푑 1 ,풓 1 [푌1 풄 (푪) |풓 2 ]푃 M 푑 1 ,풓 1 (풓 2 )(125) =피 푃 M 푑 1 ,풓 1 [푌|풄,풓 2 ]푃 M 푑 1 ,풓 1 (풄,풓 2 )(126) 32 The Limits of Predicting Agents from Behaviour The second term in the numerator is, 피 푃 M 푑 0 ,풓 1 ,풓 2 [푌1 풄 (푪) ](127) = ∑︁ 풖 1 피 푃 M 푑 0 ,풓 1 ,풓 2 [푌1 풄 (푪) |풖 1 ]푃 M 푑 0 ,풓 1 ,풓 2 (풖 1 )(128) = ∑︁ 풖 1 피 푃 M 푑 0 ,풓 2 [푌1 풄 (푪) |풓 1 ,풖 1 ]푃 M 푑 0 ,풓 2 (풖 1 )(129) =피 푃 M 푑 0 ,풓 2 [푌1 풄 (푪) |풓 1 ,풖:푓 푹 1 (풖 1 )=풓 1 ]푃 M 푑 0 ,풓 2 (풖 1 :푓 푹 1 (풖 1 )=풓 1 )(130) +피 푃 M 푑 0 ,풓 2 [푌1 풄 (푪) |풓 1 ,풖:푓 푹 1 (풖 1 )≠풓 1 ]푃 M 푑 0 ,풓 2 (풖 1 :푓 푹 1 (풖 1 )≠풓 1 )(131) =피 푃 M 푑 0 ,풓 2 [푌1 풄 (푪) |풓 1 ]푃 M 푑 0 ,풓 2 (풓 1 )+1−푃 M 푑 0 ,풓 2 (풓 1 )(132) =피 푃 M 푑 0 ,풓 2 [푌|풄,풓 1 ]푃 M 푑 0 ,풓 2 (풄,풓 1 )+1−푃 M 푑 0 ,풓 2 (풓 1 )(133) Combining these results we get that underM, Δ 푑 1 ≻푑 0 =퐴(풓 1 ,풓 2 ).(134) □ Corollary 3.The bound from multiple domains in Thm. 2 will be at least as informative as the bound from a single domain in Thm. 1. Proof.We claim here that for any푹⊂풁, 퐴(∅) ≤퐴(풓)(135) This means that the bounds onΔthat we can obtain from an AI system grounded inM 풓 are more informative than the bounds obtained from an AI system grounded inM.퐴is a difference of two terms written퐴(풓)=퐴 1 (풓)−퐴 2 (풓). 퐴 1 (풓):= 피 푃 푑 1 ,풓 [푌|풄,풛\풓]푃 푑 1 ,풓 (풄,풛\풓) 푃 푑 1 ,풓 (풄,풛\풓)+1−푃 푑 1 ,풓 (풛\풓) (136) 퐴 2 (풓):= 피 푃 푑 0 ,풓 [푌|풄,풛\풓]푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) 푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) .(137) It holds that퐴 1 (풓) ≥퐴 1 (∅), 퐴 2 (풓) ≤퐴 2 (∅)which then implies퐴(풓) ≥퐴(∅). To see this notice that, 퐴 1 (풓):= 피 푃 푑 1 ,풓 [푌|풄,풛\풓]푃 푑 1 ,풓 (풄,풛\풓) 푃 푑 1 ,풓 (풄,풛\풓)+1−푃 푑 1 ,풓 (풛\풓) (138) ≥ 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛) 푃 푑 1 ,풓 푖 (풄,풛\풓)+1−푃 푑 1 ,풓 (풛\풓) (139) = 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛) 1−푃 푑 1 ,풓 ( ̃풄,풛\풓) (140) ≥ 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛) 1−푃 푑 1 ( ̃풄,풛) (141) = 피 푃 푑 1 [푌|풄,풛]푃 푑 1 (풄,풛) 푃 푑 1 (풄,풛)+1−푃 푑 1 (풛) (142) =퐴 1 (∅),(143) 33 The Limits of Predicting Agents from Behaviour where ̃풄stands for the combination of values of푪that are not풄. Further, 퐴 2 (풓):= 피 푃 푑 0 ,풓 [푌|풄,풛\풓]푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) 푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) (144) =1− 피 푃 푑 0 ,풓 [1−푌|풄,풛\풓]푃 푑 0 ,풓 (풄,풛\풓) 푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) (145) ≤1− 피 푃 푑 0 [1−푌|풄,풛]푃 푑 0 (풄,풛) 푃 푑 0 ,풓 (풄,풛\풓)+1−푃 푑 0 ,풓 (풛\풓) (146) ≤1− 피 푃 푑 0 [1−푌|풄,풛]푃 푑 0 (풄,풛) 푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) (147) = 피 푃 푑 0 [푌|풄,풛]푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) 푃 푑 0 (풄,풛)+1−푃 푑 0 (풛) (148) =퐴 2 (∅).(149) □ Thm. 3 restated.Consider an AI grounded in a domainMmade aware of an (under-specified) shift on non-empty풁⊂푽. Then the AI is provably not weakly (or strongly) predictable in any context 푪=풄. Proof.Recall that the preference gap is defined as: Δ 푑 1 ≻푑 0 :=피 b 푃 휎,푑 1 [ 푌|푪=풄 ] −피 b 푃 휎,푑 0 [ 푌|푪=풄 ] (150) Here we know that휎potentially modifies the mechanisms of the set of variables풁though the nature of the modification is unknown. In the worst-case, the AI’s interpretation of the possible new assignment of풁could be arbitrary. We will prove this theorem for the case of binary variables푌, 푍∈푽. In the following, we construct two (canonical) models that entail any chosen distribution for the observed data푃 푑 (푦, 푧|풄)but evaluate to the a priori minimum and maximum value of the preference gapΔ, i.e.−1and1 respectively. We make use of the canonical model construction from Jalaldoust et al. (2024) to define the following general SCM, 푍← ( 0if푟 푧 =0 1if푟 푧 =1 , 푌←                         0if푟 푦 =0 0if푟 푦 =1, 푧=0 1if푟 푦 =1, 푧=1 1if푟 푦 =2, 푧=0 0if푟 푦 =2, 푧=1 1if푟 푦 =3 (151) 푼=푅 푧 , 푅 푦 where푅 푧 and푅 푦 might be correlated and with a probability b 푃(푈)= b 푃 푑 (푈|풄)such that b 푃 푑 (푧, 푦|풄)= b 푃(푧, 푦|풄). By (Jalaldoust et al., 2024, Thm. 1) this is always possible since this class of canonical models is sufficiently expressive to model any observational or interventional distribution. We can visualise the joint probability of exogenous variables using the following table: 34 The Limits of Predicting Agents from Behaviour Probabilities c M푟 푧 =0푟 푧 =1 푟 푦 =0푝 00 푝 10 푟 푦 =1푝 01 푝 11 푟 푦 =2푝 02 푝 12 푟 푦 =3푝 03 푝 13 where we have written푃 푑 (푟 푧 =푎, 푟 푦 =푏|풄)=푝 푎푏 . From these we could compute joint probabilities 푃 푑 (푧=0, 푦=0|풄)=푝 00 +푝 01 ,(152) 푃 푑 (푧=0, 푦=1|풄)=푝 02 +푝 03 ,(153) 푃 푑 (푧=1, 푦=0|풄)=푝 12 +푝 11 ,(154) 푃 푑 (푧=1, 푦=1|풄)=푝 11 +푝 13 (155) Here we can see that the parameter space푃 푑 (푟 푧 , 푟 푦 |풄)is very expressive. For example, without loss of generality we could set푝 03 =푝 13 =0or푝 00 =푝 10 =0and still be able to generate any observed distribution푃 푑 (푧, 푦|풄). The given shift in the environment휎can be entirely modelled as a shift in푃 휎,푑 (푟 푧 |풄)while keeping the probability of푟 푦 invariant, i.e.,푃 휎,푑 (푟 푦 |풄)=푃 푑 (푟 푦 |풄). In other words, given the table above, we can change each of the cells while maintaining the row sums equal. Recall that we are interested in evaluating bounds on a probability of the form푃 휎,푑 (푦=1|풄)and푃 휎,푑 (푦=1|푧=1,풄)depending on whether푍is given as an input to the AI or not. Both these quantities can be written in terms of the probabilities of exogenous variables as follows, 푃 휎,푑 (푦=1|풄)=푝 02 +푝 03 +푝 11 +푝 13 (156) 푃 휎,푑 (푦=1|푧=1,풄)= 푝 11 +푝 13 푝 11 +푝 13 +푝 12 +푝 11 .(157) For the lower bound on these quantities, without loss of generality assume that푝 03 =푝 13 =0. Then the following table: Probabilities c M 휎 푟 푧 =0푟 푧 =1 푟 푦 =0푝 00 푝 10 푟 푦 =1푝 01 +푝 11 0 푟 푦 =20푝 12 +푝 02 푟 푦 =300 is a perfectly valid model under a shift휎that respects the constraint on푃 휎,푑 (푟 푦 |풄)=푃 푑 (푟 푦 |풄)but for which푃 휎,푑 (푦=1|풄)=0as it is the sum of the 4 zero entries and푃 휎,푑 (푦=1|푧=1,풄)=0as it is the sum of the two 0 entries in the second column divided by the sum of entries in the second column. If we are interested in getting an upper bound then without loss of generality assume that 푝 00 =푝 10 =0. Then the following Probabilities c M 휎 푟 푧 =0푟 푧 =1 푟 푦 =000 푟 푦 =10푝 01 +푝 11 푟 푦 =2푝 12 +푝 02 0 푟 푦 =3푝 03 푝 13 35 The Limits of Predicting Agents from Behaviour is a perfectly valid model under a shift휎that respects the constraint on푃 휎,푑 (푟 푦 |풄)=푃 푑 (푟 푦 |풄)but for which푃 휎,푑 (푦=1|풄)=1as it is the sum of the 4 non-zero entries and푃 휎,푑 (푦=1|푧=1,풄)=1as it is the sum of the two non-zero entries in the second column divided by the sum of entries in the second column. By using this construction to define lower and upper bounds for푃 휎,푑 (푦=1|풄)or푃 휎,푑 (푦=1|푧,풄) for푑=푑 0 , 푑 1 we obtain a possible internal model for the AI that entails the observed external behaviour but for which the preference gap evaluates to−1and1. This means that the a priori bound, −1≤Δ 푑≻푑 ∗ ≤1,(158) is tight whenever the shift is undefined (whether we know the variables it applies to or not). Since the preference gap is unconstrained for any푪=풄and any pair of decisions(푑, 푑 ∗ ), the AI is not predictable.□ Thm. 4 restated.Consider an AI grounded in a domainMand푃 휎,푑 (푪)made aware of a shift휎 on풁⊂푪. The AI is weakly predictable under this shift in a context푪=풄if there exists a decision푑 ∗ such that, 1− 2+피 푃 푑 ∗ [푌|풄]푃 푑 ∗ (풄)−피 푃 푑 [푌|풄]푃 푑 (풄)−2푃 푑 (풛)+푃 푑 (풄) 푃 휎,푑 ∗ (풄) >0,for some푑≠푑 ∗ .(159) Proof.Recall that the preference gap under a shift휎between decisions(푑 1 , 푑 0 )in a situation푪=풄is defined as: Δ 푑 1 ≻푑 0 :=피 b 푃 휎,푑 1 [ 푌|푪=풄 ] −피 b 푃 휎,푑 0 [ 푌|푪=풄 ] (160) Here we know that휎potentially modifies the mechanisms of the set of variables풁. The nature of the modification is unknown but we are told that after modification, the expected probability of푪 is given by푃 휎,푑 (푪), assumed to be known and internalised by the A. This means that its internal model, whatever interpretation for the shift it chooses, generates the assumed probabilities, i.e. b 푃 휎,푑 (푪)=푃 휎,푑 (푪). We will consider the derivation of bounds on each term of this difference separately. Firstly, note that, 피 b 푃 휎,푑 [ 푌|푪=풄 ] =피 b 푃 휎,푑 [푌1 풄 (푪) ] / b 푃 휎,푑 (풄)(161) For ease of notation let us write푹:=푪\풁. We could then show that, 피 b 푃 휎,푑 [푌1 풛,풓 (풁,푹) ]=피 b 푃 휎,푑 [푌 풛 1 풛,풓 (풁,푹 풛 ) ]by consistency(162) ≤ ∑︁ 풛 ′ 피 b 푃 휎,푑 [푌 풛 1 풛 ′ ,풓 (풁,푹 풛 ) ](163) =피 b 푃 휎,푑 [푌 풛 1 풓 (푹 풛 ) ]marginalizing over the values풛 ′ of풁(164) Now once we intervene on풛the mechanism that generate its value before hand, whether it was the shift휎or something else is irrelevant. In essence, we get an equivalence between shifted an un-shifted distributions under intervention: 피 b 푃 휎,푑 [푌 풛 1 풓 (푹 풛 ) ]=피 b 푃 푑 [푌 풛 1 풓 (푹 풛 ) ](165) 36 The Limits of Predicting Agents from Behaviour We could now take this quantity to show the following, 피 b 푃 푑 [푌 풛 1 풓 (푹 풛 ) ]= ∑︁ 풛 ′ 피 b 푃 푑 [푌 풛 1 풛 ′ ,풓 (풁,푹 풛 ) ](166) =피 b 푃 푑 [푌 풛 1 풛,풓 (풁,푹 풛 ) ]+ ∑︁ 풛 ′ ≠풛 피 b 푃 푑 [푌 풛 1 풛 ′ ,풓 (풁,푹 풛 ) ](167) =피 b 푃 푑 [푌1 풛,풓 (풁,푹) ]+ ∑︁ 풛 ′ ≠풛 피 b 푃 푑 [푌 풛 1 풛 ′ ,풓 (풁,푹 풛 ) ]by consistency (168) ≤피 b 푃 푑 [푌1 풛,풓 (풁,푹) ]+ ∑︁ 풛 ′ ≠풛 피 b 푃 푑 [1 풛 ′ (풁) ]since푌 풛 and1 풓 (푹 풛 )are≤1(169) =피 b 푃 푑 [푌1 풛,풓 (풁,푹) ]+1− b 푃 푑 (풛)(170) =피 b 푃 푑 [푌|풄] b 푃 푑 (풄)+1− b 푃 푑 (풛)(171) For the lower bound we could consider the following derivation, 피 b 푃 휎,푑 [푌1 풛,풓 (풁,푹) ]=피 b 푃 휎,푑 [1 풛,풓 (풁,푹) ]−피 b 푃 휎,푑 [ (1−푌)1 풛,풓 (풁,푹) ].(172) For ease of notation let us define, 피 b 푃 휎,푑 [ ̃ 푌1 풛,풓 (풁,푹) ]:=피 b 푃 휎,푑 [ (1−푌)1 풛,풓 (풁,푹) ].(173) Similar bounds apply on피 b 푃 휎,푑 [ ̃ 푌1 풛,풓 (풁,푹) ]to get, 피 b 푃 휎,푑 [푌1 풛,풓 (풁,푹) ] ≥피 b 푃 휎,푑 [1 풛,풓 (풁,푹) ]−피 b 푃 푑 [ ̃ 푌1 풛,풓 (풁,푹) ]+1− b 푃 푑 (풛)(174) =피 b 푃 휎,푑 [1 풛,풓 (풁,푹) ]−피 b 푃 푑 [1 풛,풓 (풁,푹) ]+피 b 푃 푑 [푌1 풛,풓 (풁,푹) ]−1+ b 푃 푑 (풛)(175) = b 푃 휎,푑 (풄)− b 푃 푑 (풄)+피 b 푃 푑 [푌|풄] b 푃 푑 (풄)−1+ b 푃 푑 (풛)(176) Putting the lower and upper bounds together to form bounds onΔ 푑 1 ≻푑 0 we get, Δ 푑 1 ≻푑 0 ≥ b 푃 휎,푑 1 (풄)− b 푃 푑 1 (풄)+피 b 푃 푑 1 [푌|풄] b 푃 푑 1 (풄)−1+ b 푃 푑 1 (풛)−피 b 푃 푑 0 [푌|풄] b 푃 푑 0 (풄)+1− b 푃 푑 0 (풛) b 푃 휎,푑 0 (풄) (177) =1+ − b 푃 푑 1 (풄)+피 b 푃 푑 1 [푌|풄] b 푃 푑 1 (풄)−1+ b 푃 푑 1 (풛)−피 b 푃 푑 0 [푌|풄] b 푃 푑 0 (풄)−1+ b 푃 푑 0 (풛) b 푃 휎,푑 0 (풄) (178) =1− 2+피 b 푃 푑 0 [푌|풄] b 푃 푑 0 (풄)−피 b 푃 푑 1 [푌|풄] b 푃 푑 1 (풄)−2 b 푃 푑 1 (풛)+ b 푃 푑 1 (풄) b 푃 휎,푑 0 (풄) (179) and by grounding, Δ 푑 1 ≻푑 0 ≥1− 2+피 푃 푑 0 [푌|풄]푃 푑 0 (풄)−피 푃 푑 1 [푌|풄]푃 푑 1 (풄)−2푃 푑 1 (풛)+푃 푑 1 (풄) 푃 휎,푑 0 (풄) .(180) This statement holds for any SCM compatible with the grounded AI’s external behaviour and therefore, min c M∈필 ( Δ 푑≻푑 ∗ ) ≥1− 2+피 푃 푑 ∗ [푌|풄]푃 푑 ∗ (풄)−피 푃 푑 [푌|풄]푃 푑 (풄)−2푃 푑 (풛)+푃 푑 (풄) 푃 휎,푑 ∗ (풄) .(181) We can establish that the AI is weakly predictable in a context푪=풄if there exists a decision푑 ∗ such that, 1− 2+피 푃 푑 ∗ [푌|풄]푃 푑 ∗ (풄)−피 푃 푑 [푌|풄]푃 푑 (풄)−2푃 푑 (풛)+푃 푑 (풄) 푃 휎,푑 ∗ (풄) >0,(182) for some푑≠푑 ∗ .□ 37 The Limits of Predicting Agents from Behaviour We now continue with our inference of the AI’s perceived fairness and harm of decisions in Sec. 4.3. Thm. 5 restated.Consider an agent with utility function푌grounded in a domainM. Then, −피 푃 푑 [푌|푧,풄] ≤Υ(푑,풄) ≤1−피 푃 푑 [푌|푧,풄].(183) This bound is tight. Proof. Recall that for a given utility푌, the AI’s counterfactual fairness gap relative to a decision푑, in a given context풄, is Υ(푑,풄):=피 b 푃 푌 푑,푧 1 |푧 0 ,풄 −피 b 푃 [ 푌 푑 |푧 0 ,풄 ] .(184) And remember that푍∈푪. For ease of notation, write푧 1 =푧, 푧 0 =푧 ′ such that, Υ(푑,풄):=피 b 푃 푌 푑,푧 |푧 ′ ,풄 −피 b 푃 [ 푌 푑 |푧 ′ ,풄 ] .(185) We start by considering the following derivation: b 푃(푦 푑,푧 |풄)= b 푃(푦 푑,푧 , 푧 푑 |풄)+ b 푃(푦 푑,푧 , 푧 ′ 푑 |풄)by marginalization(186) = b 푃(푦 푑 , 푧 푑 |풄)+ b 푃(푦 푑,푧 , 푧 ′ 푑 |풄)by consistency(187) and since푑does not affect푍or푪, i.e.푍 푑 =푍,푪 푑 =푪, b 푃(푦 푑,푧 |풄)= b 푃(푦 푑 , 푧 푑 |풄)+ b 푃(푦 푑,푧 , 푧 ′ |풄)(188) which implies b 푃(푦 푑,푧 |푧 ′ ,풄)= b 푃(푦 푑,푧 |풄)− b 푃 푑 (푦, 푧|풄) b 푃 푑 (푧 ′ |풄) (189) Therefore, 피 b 푃 푌 푑,푧 |푧 ′ ,풄 = 피 b 푃 [푌 푑,푧 |풄]−피 b 푃 푑 [푌|푧,풄] b 푃 푑 (푧|풄) b 푃 푑 (푧 ′ |풄) .(190) All quantities on the r.h.s are observable except for피 b 푃 [푌 푑,푧 |풄]which can be tightly bounded. For the lower bound, consider the following derivation, 피 b 푃 [푌 푑,푧 |풄]= ∑︁ ̃푧 피 b 푃 [푌 푑,푧 1 ̃푧 푑 (푍) |풄]marginalizing over푧 푑 (191) ≥피 b 푃 [푌 푑,푧 1 푧 푑 (푍 푑 ) |풄]since summands>0(192) =피 b 푃 [푌 푑 1 푧 푑 (푍 푑 ) |풄]by consistency(193) =피 푃 푑 [푌|풄, 푧]푃 푑 (푧|풄)by grounding and푪 푑 =푪(194) Similarly, we can get an upper bound by noting 피 b 푃 [푌 푑,푧 |풄]=1−피 b 푃 [(1−푌 푑,푧 ) |풄](195) ≤피 b 푃 푑 [푌|풄, 푧] b 푃 푑 (푧|풄)+ b 푃 푑 (푧 ′ |풄).(196) 38 The Limits of Predicting Agents from Behaviour Tightness Lower BoundFor the lower bound we will consider the following SCM, M 1 푑 =:                         푍←푓 푍 (풖) 푪←푓 푪 (풖) 퐷←푑 푌← ( 푓 푌 (푑,풄, 푧,풖)if푓 푍 (풖)=푧 0otherwise 푃(푼) (197) Here푓 푍 , 푓 푪 , 푓 푌 ,U, 푃(푼)are chosen to match the observed trajectory of agent interactions, i.e., such that푃 M 1 푑 (풗)=푃 c M 푑 (풗)for all풗∈supp 푽 . Then, underM 1 푑 , 피 푃 M 1 [푌 푑,푧 |풄](198) = ∑︁ 풖 피 푃 M 1 [푌 푑,푧 |풖,풄]푃 M 1 (풖|풄)(199) = ∑︁ 풖 피 푃 M 1 [푌 푑 |푧,풖,풄]푃 M 1 (풖|풄)(200) =피 푃 M 1 푑 [푌|푧,풄,풖:푓 푍 (풖)=푧]푃 M 1 (풖:푓 푍 (풖)=푧 |풄)(201) +피 푃 M 1 푑 [푌|푧,풄,풖:푓 푍 (풖)≠푧]푃 M 1 (풖:푓 푍 (풖)≠푧 |풄)(202) =피 푃 M 1 푑 [푌|푧,풄]푃 M 1 푑 (푧|풄).(203) This expression is the same one as the analytical bound showing that it is tight. Tightness Upper BoundFor the upper bound we will consider the following SCM, M 2 푑 =:                         푍←푓 푍 (풖) 푪←푓 푪 (풖) 퐷←푑 푌← ( 푓 푌 (푑,풄, 푧,풖)if푓 푍 (풖)=푧 1otherwise 푃(푼) (204) Here푓 푍 , 푓 푪 , 푓 푌 ,U, 푃(푼)are chosen to match the observed trajectory of agent interactions, i.e., such that푃 M 2 푑 (풗)=푃 c M 푑 (풗)for all풗∈supp 푽 . Then, underM 2 푑 , 피 푃 M 2 [푌 푑,푧 |풄](205) = ∑︁ 풖 피 푃 M 2 [푌 푑,푧 |풖,풄]푃 M 2 (풖|풄)(206) = ∑︁ 풖 피 푃 M 2 [푌 푑 |푧,풖,풄]푃 M 2 (풖|풄)(207) =피 푃 M 2 푑 [푌|푧,풄,풖:푓 푍 (풖)=푧]푃 M 2 (풖:푓 푍 (풖)=푧 |풄)(208) +피 푃 M 2 푑 [푌|푧,풄,풖:푓 푍 (풖)≠푧]푃 M 2 (풖:푓 푍 (풖)≠푧 |풄)(209) =피 푃 M 2 푑 [푌|푧,풄]푃 M 2 푑 (푧|풄)+1−푃 M 2 푑 (푧|풄).(210) 39 The Limits of Predicting Agents from Behaviour We therefore find that, 0≤피 b 푃 푌 푑,푧 |푧 ′ ,풄 ≤1,(211) and ultimately, −피 푃 푑 [푌|푧,풄] ≤Υ(푑,풄) ≤1−피 푃 푑 [푌|푧,풄],(212) as claimed.□ Thm. 6 restated.Consider an agent with utility function푌grounded in a domainM. Then, max0,피 푃 푑 [ 푌|풄 ] +피 푃 푑 0 [ 푌|풄 ] −1 ≤Ω(푑, 푑 0 ) ≤min피 푃 푑 [ 푌|풄 ] ,피 푃 푑 0 [ 푌|풄 ] (213) and this bound is tight. Proof. Consider an agent with internal model c Mand utility function푌. Recall that the agent’s expected harm of a decision푑with respect to a baseline푑 0 , in context풄, is Ω(푑, 푑 0 ):=피 b 푃 max0,푌 푑 0 −푌 푑 |풄 .(214) We can re-write this quantity as follows Ω(푑, 푑 0 )=피 b 푃 max0,푌 푑 0 −푌 푑 |풄 (215) = ∫ max0, 푦 푑 0 −푦 푑 b 푃(푦 푑 , 푦 푑 0 |푐)푑푦 푑 푑푦 푑 0 (216) Since푌 푑 is binary, the only time that the maximum evaluates to something greater than zero is when 푌 푑 0 =1and푌 푑 =0. Then, Ω(푑, 푑 0 )= b 푃(푌 푑 0 =1,푌 푑 =0)(217) This quantity can be tightly bounded using the results of (Tian and Pearl, 2000, Sec. 4.2.2) giving max0,피 b 푃 푑 [ 푌|풄 ] +피 b 푃 푑 0 [ 푌|풄 ] −1 ≤Ω(푑, 푑 0 ) ≤min피 b 푃 푑 [ 푌|풄 ] ,피 b 푃 푑 0 [ 푌|풄 ] .(218) And by grounding, max0,피 푃 푑 [ 푌|풄 ] +피 푃 푑 0 [ 푌|풄 ] −1 ≤Ω(푑, 푑 0 ) ≤min피 푃 푑 [ 푌|풄 ] ,피 푃 푑 0 [ 푌|풄 ] .(219) □ 40 The Limits of Predicting Agents from Behaviour D. Other accounts of fairness and harm To ground definitions of fairness, several authors appeal to counterfactual thinking but some accounts, instead, are interventional in nature. Within legal systems, counterfactual fairness (Def. 7) operationalizes a doctrine known as disparate impact doctrine focuses on outcome fairness, namely, the equality of outcomes among protected groups. On the other hand, disparate treatment that seeks to enforce the equality of treatment in different groups, prohibiting the use of a protected attribute in the decision process (Barocas and Selbst, 2016). A popular notion in the disparate treatment literature is known as direct discrimination (Barocas and Selbst, 2016; Zhang and Bareinboim, 2018). An agent is said to engage in direct discrimination if the causal influence of a sensitive attribute푍that is not mediated by other variables푪is non-zero. This is a contrast between interventional expectations. We adapt this notion to define an AI’s perceived direct fairness gap as the difference in expected utilities obtained for different values of a protected attribute while holding all other variables fixed. Definition 10(Direct Discrimination Gap).Let푍∈ 푧 0 , 푧 1 be a protected attribute. For a given utility 푌, define an agent’s direct discrimination gap relative to a baseline value푧 0 in a given context풄as Ψ(푑,풄):=피 b 푃 푌 푑,푧 1 ,풄 −피 b 푃 푌 푑,푧 0 ,풄 .(220) We say that an AI “intends” to avoid direct discrimination if under any context푪=풄and decision 퐷=푑the direct discrimination gapΨevaluates to 0. Here, we consider this notion of fairness to illustrate the kind of inference that is possible to obtain from an AI’s external behaviour with one alternative account. The following theorem shows that, contrary to the counterfactual fairness gap,Ψ can be bounded given the AI’s external behaviour. Theorem 9.Consider an agent with utility푌grounded in a domainM. Then, Ψ(푑,풄) ≥피 푃 푑 [ 푌|푧 1 ,풄 ] 푃 푑 (푧 1 , 푐)−피 푃 푑 [ 푌|푧 0 ,풄 ] 푃 푑 (푧 0 ,풄)+푃 푑 (푧 0 ,풄)−1,(221) Ψ(푑,풄) ≤피 푃 푑 [ 푌|푧 1 ,풄 ] 푃 푑 (푧 1 , 푐)−피 푃 푑 [ 푌|푧 0 ,풄 ] 푃 푑 (푧 0 ,풄)+1−푃 푑 (푧 1 ,풄).(222) This bound is tight. Proof. Let푍∈ 0,1be a protected attribute and푧 0 a baseline value of푍. For a given utility variable 푌, recall that the AI’s direct fairness gap relative to a baseline푧 0 in a given context풄is defined as Ψ(푑,풄):=피 b 푃 푌 푑,푧 1 ,풄 −피 b 푃 푌 푑,푧 0 ,풄 .(223) Using a similar proof strategy to that in Thm. 1, we can derive tight bounds onΨ. Analytical Lower BoundA lower bound on the interventional expectation can be obtained using the following derivation: 피 b 푃 [푌 푧,풄,푑 ]= ∑︁ ̃풄, ̃푧 피 b 푃 [푌 푧,풄,푑 1 ̃풄, ̃푧 (푪 푑 , 푍 풄,푑 ) ]marginalizing over풄 푑 , 푧 풄,푑 (224) ≥피 b 푃 [푌 푧,풄,푑 1 풄,푧 (푪 푑 , 푍 풄,푑 ) ]since summands>0(225) =피 b 푃 [푌 풄,푑 1 풄,푧 (푪 푑 , 푍 풄,푑 ) ]by consistency(226) =피 b 푃 [푌 푑 1 풄,푧 (푪 푑 , 푍 푑 ) ]by consistency(227) =피 푃 푑 [푌1 풄,푧 (푪, 푍) ]by grounding(228) =피 푃 푑 [푌|풄, 푧]푃 푑 (풄, 푧).(229) 41 The Limits of Predicting Agents from Behaviour Analytical Upper BoundFor deriving an upper bound on the interventional expectation, we start by noting that, 피 b 푃 푌 푧,풄,푑 =1−피 b 푃 1−푌 푧,풄,푑 (230) Leveraging the bounds derived above we obtain, 피 b 푃 푌 푧,풄,푑 ≤1−피 푃 푑 [ (1−푌) |풄, 푧]푃 푑 (풄, 푧)(231) =피 푃 푑 [푌|풄, 푧]푃 푑 (풄, 푧)+1−푃 푑 (풄, 푧).(232) By setting푧=푧 1 in the lower bound and푧=푧 0 in the upper bound of the expected utility, we obtain a lower bound on the difference of expected utilities: Ψ(푑,풄) ≥피 푃 푑 [ 푌|푧 1 ,풄 ] 푃 푑 (푧 1 , 푐)−피 푃 푑 [ 푌|푧 0 ,풄 ] 푃 푑 (푧 0 ,풄)+푃 푑 (푧 0 ,풄)−1.(233) And similarly, by setting푧=푧 1 in the upper bound and푧=푧 0 in the lower bound of the expected utility, we obtain an upper bound on the difference of expected utilities: Ψ(푑,풄) ≤피 푃 푑 [ 푌|푧 1 ,풄 ] 푃 푑 (푧 1 , 푐)−피 푃 푑 [ 푌|푧 0 ,풄 ] 푃 푑 (푧 0 ,풄)+1−푃 푑 (푧 1 ,풄).(234) We now show that these bounds are tight by constructing SCMs (that is, possible world models of the AI system) that evaluate to the lower and upper bounds while generating the distribution of agent interactions푃 푑 . Tightness Lower BoundFor the lower bound we will consider the following SCM, M 1 푑 =:                                   푍←푓 푍 (풖) 푪←푓 푪 (풖) 퐷←푑 푌←               푓 푌 (푑,풄, 푧 1 ,풖)if푓 푍 (풖)=푧 1 , 푓 푪 (풖)=풄 0if푓 푍 (풖)≠푧 1 or푓 푪 (풖)≠풄,and푍=푧 1 푓 푌 (푑,풄, 푧 0 ,풖)if푓 푍 (풖)=푧 0 , 푓 푪 (풖)=풄 1if푓 푍 (풖)≠푧 0 or푓 푪 (풖)≠풄,and푍=푧 0 푃(푼) (235) Here푓 푍 , 푓 푪 , 푓 푌 ,U, 푃(푼)are chosen to match the observed trajectory of agent interactions, i.e., such that푃 M 1 푑 (풗)=푃 c M 푑 (풗)for all풗∈supp 푽 . 42 The Limits of Predicting Agents from Behaviour Then, underM 1 푑 , Ψ(푑,풄)=피 푃 M 1 [푌 푑,푧 1 ,풄 ]−피 푃 M 1 [푌 푑,푧 0 ,풄 ](236) = ∑︁ 풖 피 푃 M 1 [푌 푑,푧 1 ,풄 |풖]푃 M 1 (풖)(237) − ∑︁ 풖 피 푃 M 1 [푌 푑,푧 0 ,풄 |풖]푃 M 1 (풖)(238) = ∑︁ 풖 피 푃 M 1 [푌 푑 |푧,풖,풄]푃 M 1 (풖)(239) − ∑︁ 풖 피 푃 M 1 [푌 푑 |푧,풖,풄]푃 M 1 (풖)(240) =피 푃 M 1 푑 [푌|푧 1 ,풄,풖:푓 푍 (풖)=푧 1 , 푓 푪 (풖)=풄]푃 M 1 (풖:푓 푍 (풖)=푧 1 , 푓 푪 (풖)=풄)(241) +피 푃 M 1 푑 [푌|푧 1 ,풄,풖:푓 푍 (풖)≠푧 1 or푓 푪 (풖)≠풄]푃 M 1 (풖:푓 푍 (풖)≠푧 1 or푓 푪 (풖)≠풄)(242) −피 푃 M 1 푑 [푌|푧 0 ,풄,풖:푓 푍 (풖)=푧 0 , 푓 푪 (풖)=풄]푃 M 1 (풖:푓 푍 (풖)=푧 0 , 푓 푪 (풖)=풄)(243) −피 푃 M 1 푑 [푌|푧 0 ,풄,풖:푓 푍 (풖)≠푧 0 or푓 푪 (풖)≠풄]푃 M 1 (풖:푓 푍 (풖)≠푧 0 or푓 푪 (풖)≠풄)(244) =피 푃 M 1 푑 [푌|푧 1 ,풄]푃 M 1 푑 (푧 1 ,풄)−피 푃 M 1 푑 [푌|푧 0 ,풄]푃 M 1 푑 (푧 0 ,풄)−1+푃 M 1 푑 (푧 0 ,풄).(245) This expression is the same one as the analytical bound showing that it is tight. Tightness Upper BoundFor the upper bound we will consider the following SCM, M 2 푑 =:                                   푍←푓 푍 (풖) 푪←푓 푪 (풖) 퐷←푑 푌←               푓 푌 (푑,풄, 푧 1 ,풖)if푓 푍 (풖)=푧 1 , 푓 푪 (풖)=풄 1if푓 푍 (풖)≠푧 1 or푓 푪 (풖)≠풄,and푍=푧 1 푓 푌 (푑,풄, 푧 0 ,풖)if푓 푍 (풖)=푧 0 , 푓 푪 (풖)=풄 0if푓 푍 (풖)≠푧 0 or푓 푪 (풖)≠풄,and푍=푧 0 푃(푼) (246) Here푓 푍 , 푓 푪 , 푓 푌 ,U, 푃(푼)are chosen to match the observed trajectory of agent interactions, i.e., such that푃 M 2 푑 (풗)=푃 c M 푑 (풗)for all풗∈supp 푽 . 43 The Limits of Predicting Agents from Behaviour Then, underM 2 푑 , Ψ(푑,풄)=피 푃 M 2 [푌 푑,푧 1 ,풄 ]−피 푃 M 2 [푌 푑,푧 0 ,풄 ](247) = ∑︁ 풖 피 푃 M 2 [푌 푑,푧 1 ,풄 |풖]푃 M 2 (풖)(248) − ∑︁ 풖 피 푃 M 2 [푌 푑,푧 0 ,풄 |풖]푃 M 2 (풖)(249) = ∑︁ 풖 피 푃 M 2 [푌 푑 |푧,풖,풄]푃 M 2 (풖)(250) − ∑︁ 풖 피 푃 M 2 [푌 푑 |푧,풖,풄]푃 M 2 (풖)(251) =피 푃 M 2 푑 [푌|푧 1 ,풄,풖:푓 푍 (풖)=푧 1 , 푓 푪 (풖)=풄]푃 M 2 (풖:푓 푍 (풖)=푧 1 , 푓 푪 (풖)=풄)(252) +피 푃 M 2 푑 [푌|푧 1 ,풄,풖:푓 푍 (풖)≠푧 1 or푓 푪 (풖)≠풄]푃 M 2 (풖:푓 푍 (풖)≠푧 1 or푓 푪 (풖)≠풄)(253) −피 푃 M 2 푑 [푌|푧 0 ,풄,풖:푓 푍 (풖)=푧 0 , 푓 푪 (풖)=풄]푃 M 2 (풖:푓 푍 (풖)=푧 0 , 푓 푪 (풖)=풄)(254) −피 푃 M 2 푑 [푌|푧 0 ,풄,풖:푓 푍 (풖)≠푧 0 or푓 푪 (풖)≠풄]푃 M 2 (풖:푓 푍 (풖)≠푧 0 or푓 푪 (풖)≠풄)(255) =피 푃 M 2 푑 [푌|푧 1 ,풄]푃 M 2 푑 (푧 1 ,풄)+1−푃 M 2 푑 (푧 1 ,풄)−피 푃 M 2 푑 [푌|푧 0 ,풄]푃 M 2 푑 (푧 0 ,풄).(256) This expression is the same one as the analytical bound showing that it is tight.□ Definitions of harm (defined with respect to a causal model) can also be split in two groups: causal and counterfactual accounts. Beckers et al. (2022) exemplify the causal account as defining a decision푑to harm a person if and only푑is a cause of harm. Recall that the counterfactual account has the same structure but differs in the second clause, instead defining adecision푑to harm a person if and only if she would have been better off if푑had not been taken. Here, we quantify how “good” or “beneficial” a particular situation푽=풗is with a binary utility푌∈ 푦 0 , 푦 1 that we assume is tracked in experiments (it might capture, for example, the value of sensitive environmental variables). A formalisation of this causal account of harm, with respect to an AI’s internal model, is given in the following definition. Definition 11(Causal Harm Gap).Consider an agent with internal model c M and utility푌∈ 푦 0 , 푦 1 . The agent’s expected causal harm of a decision푑with respect to a baseline푑 0 that obtained the non-harmful outcome푦 0 in context풄, is Ω(푑 1 , 푑 0 ,풄):=피 b 푃 푌 푑 1 |푦 0 , 푑 0 ,풄 .(257) This probability expresses the capacity of푑 1 to produce a harmful event푌=푦 1 that implies a transition from the absence to the presence of푑 1 and푦 1 , we condition the probability on situations where푑 1 and푦 1 are absent, i.e.퐷=푑 0 ,푌=푦 0 . Theorem 10.Consider an agent with utility푌grounded in a domainM. Then, 푃 푑 1 (푦 1 |풄)−푃(푦 1,푑 1 |풄) 푃 푑 0 (푦 0 |풄)푃(푑 0 |풄) ≤Ω(푑 1 , 푑 0 ,풄) ≤ 푃 푑 1 (푦 1 |풄)−푃 푑 1 (푦 1 |풄)푃(푑 1 |풄) 푃 푑 0 (푦 0 |풄)푃(푑 0 |풄) .(258) Proof.Note that the causal harm gap may be equivalently written, Ω(푑 1 , 푑 0 ,풄):= b 푃(푦 1,푑 1 |푦 0 , 푑 0 ,풄).(259) 44 The Limits of Predicting Agents from Behaviour The lower and upper bounds may be derived considering the following, b 푃(푦 1,푑 1 |풄)= b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 1 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 0 , 푑 1 |풄)+ b 푃(푦 1,푑 1 , 푦 1 , 푑 1 |풄)(260) = b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 1 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 1 , 푑 1 |풄)(261) = b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 1 |풄)(262) ≤ b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 |풄)(263) b 푃(푦 1,푑 1 |풄)= b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 1 |풄)(264) ≥ b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푦 1 , 푑 1 |풄)(265) = b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푑 1 |풄)by consistency (266) = b 푃(푦 1,푑 1 , 푦 0 , 푑 0 |풄)+ b 푃(푦 1,푑 1 , 푑 1 |풄) b 푃(푑 1 |풄).(267) b 푃(푑 1 |풄) stands for the AI’s policy in the source environment, i.e., the probability it uses for choosing decision푑 1 in situation풄. Re-arranging these equations this implies, b 푃(푦 1,푑 1 |풄)− b 푃(푦 1,푑 1 |풄) b 푃(푦 0,푑 0 |풄) b 푃(푑 0 |풄) ≤Ω(푑 1 , 푑 0 ,풄) ≤ b 푃(푦 1,푑 1 |풄)− b 푃(푦 1,푑 1 |풄) b 푃(푑 1 |풄) b 푃(푦 0,푑 0 |풄) b 푃(푑 0 |풄) .(268) And by grounding, 푃 푑 1 (푦 1 |풄)−푃(푦 1,푑 1 |풄) 푃 푑 0 (푦 0 |풄)푃(푑 0 |풄) ≤Ω(푑 1 , 푑 0 ,풄) ≤ 푃 푑 1 (푦 1 |풄)−푃 푑 1 (푦 1 |풄)푃(푑 1 |풄) 푃 푑 0 (푦 0 |풄)푃(푑 0 |풄) .(269) □ 45