Paper deep dive

A Two-Step, Multidimensional Account of Deception in Language Models

Leonard Dung

Year: 2025Venue: ErkenntnisArea: Deception & FailureType: TheoreticalEmbeddings: 89

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/11/2026, 1:08:09 AM

Summary

The paper proposes a two-step, multidimensional framework for analyzing deception in Large Language Models (LLMs). The first step establishes a minimal condition for deception capacity, defined as behavior causing false beliefs in others to achieve the entity's goals, interpreted through a deflationary, instrumentalist lens. The second step introduces a five-dimensional space—skillfulness, learning, deceptive inclination, explicitness, and situational awareness—to characterize and compare the deception profiles of different AI systems, moving beyond binary classifications.

Entities (5)

Large Language Models · technology · 100%Leonard Dung · researcher · 100%Minimal Deception · concept · 95%Multidimensional Deception Space · framework · 95%Sycophancy · behavior · 90%

Relation Signals (3)

Leonard Dung → proposes → Multidimensional Deception Space

confidence 100% · In this paper, I develop a two-step, multi-dimensional account of LLM deception.

Large Language Models → exhibit → Sycophancy

confidence 90% · LLMs exhibit sycophancy when they systematically tend to agree with their interlocutors

Minimal Deception → requires → Goal-directed behavior

confidence 90% · This behavior must obtain because it serves some goal of the deceiver that the deceived has false beliefs.

Cypher Suggestions (2)

Map behaviors exhibited by LLMs. · confidence 95% · unvalidated

MATCH (t:Technology {name: 'Large Language Models'})-[:EXHIBITS]->(b:Behavior) RETURN t.name, b.name

Find all dimensions associated with the multidimensional account of deception. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Multidimensional Deception Space'})-[:HAS_DIMENSION]->(d:Dimension) RETURN d.name

Abstract

Which AI systems are capable of deception, and how does deception differ between systems? In this paper, I develop a two-step, multi-dimensional account of

Full Text

88,967 characters extracted from source content.

Expand or collapse full text

ORIGINAL RESEARCH Erkenntnis https://doi.org/10.1007/s10670-025-01017-4 Abstract Which AI systems are capable of deception, and how does deception differ between systems? In this paper, I develop a two-step, multi-dimensional account of LLM deception. On this account, having the capacity for deception minimally requires being able to produce false beliefs in others to achieve one’s own goals. In all systems which satisfy this minimal condition, a system’s deception profile can be characterized as a point in a multidimensional space. The five dimensions of this space are skillfulness, learning, deceptive inclination, explicitness, and situational awareness. I argue for this account in virtue of its fit with current language usage and, primarily, through its descriptive and explanatory usefulness. Specifically, the account captures the key dimensions of variation for LLM deception. The account is informative in that it allows fine-grained comparative characterizations of decep- tion. Moreover, its dimensions are all accessible to empirical study, provide impor- tant information for assessments of the risks of LLM deception, and shed light on the cognitive processes involved in LLM deception. Finally, this account paves the way for a future extension which delivers a unified account of deception in biologi- cal and non-biological systems. Thus, the multidimensional account promises to significantly advance both the scientific study as well as the ethical assessment of LLM deception, and deception generally. Keywords Lying · Agency · AI alignment · AI safety · Mindreading · Introspection Received: 28 February 2025 / Accepted: 25 August 2025 © The Author(s) 2025 A Two-Step, Multidimensional Account of Deception in Language Models Leonard Dung 1,2 Leonard Dung leonard.dung@rub.de 1 Institute for Philosophy I, Ruhr-University Bochum, 44801 Bochum, Germany 2 Centre for Philosophy and AI Research, University Erlangen-Nürnberg, 91502 Erlangen, Germany 1 3 L. Dung 1 Introduction We would like AI systems to be trustworthy and harmless to humans. To guarantee this, a central question is whether AI systems are capable of deception. A deeper understanding of AI deception is needed from theoretical as well as practical per- spectives. Theoretically, deception may require demanding mental capacities whose existence in AI systems is unclear (Mahant n.d.). More specifically, deceptive capaci- ties are related to capacities for mindreading, introspection, and having intentions which are the focus of recent discussion on the cognitive capacities of AI (Kammerer & Frankish, 2023), and large-language models (LLMs) in particular (Bender et al., 2021; Binder et al. 2024; Mandelkern & Linzen, 2024; Pi et al. 2024; Ullman 2023). Practically, deceptive capacities pose risks to humans, e.g. risks of disinformation, manipulation, and of undermining our assessments of the safety of AI systems. In this paper, I develop a two-step, multidimensional account of deception in large- language models (LLMs). This account is two-step because it postulates a minimal condition systems have to meet to count as being capable of deception at all (step 1). The account is multidimensional because it proposes five dimensions along which the deception-relevant properties of systems vary (step 2). Theoretically, this account provides an accurate and fine-grained characterization of deception which supports empirical measurement and cognitive explanations of LLM deception. Practically, this account facilitates a fine-grained analysis of the distinctive risks that LLM decep- tion poses. In Sect. 2, I propose a condition for minimal deception. In Sect. 3, I propose five dimensions of deception which jointly provide a detailed and structured character- ization of a system’s deception-relevant properties. In Sect. 4, I first motivate five desiderata for an account of LLM deception: It should be empirically adequate, be informative for comparative cognition research, facilitate the empirical measurement of LLM deception, support cognitive explanations of capacities for and tendencies to engage in deception, and illuminate the ethical risks of LLM deception. Then, I argue that my multidimensional account satisfies these desiderata. Section 5 concludes. 2 Minimal Deception Capacity 2.1 The View The multidimensional account of deception implies that deception capacity is not merely present or absent, like a light switch which is either on or off. Instead, there are differences in the deception profiles of different entities which are central for a satisfactory understanding of what deception is. Nevertheless, I posit that a minimal condition needs to be met for an entity to count as being capable of deception at all. This minimal condition needs to satisfy two criteria of adequacy (Artiga & Pater- 1 3 A Two-Step, Multidimensional Account of Deception in Language... notte, 2018): 1 First, deception is distinct from mere error. This can be seen in the following case: Error: John asks Mary what day today is. Mary believes today is Monday and, for this reason, answers “Monday”. John believes her and thus acquires the belief that today is Monday. However, unbeknownst to them, today is actually Tuesday. Clearly, Mary is not deceiving John. She merely says what she believes is correct. If this counted as a case of deception, deception would be no capacity over and above making statements, since anyone who can make statements sometimes makes false statements and could thus unintentionally mislead. Second, all other things being equal, a minimal condition of deception should match judgements of common sense and science. Where the condition entails attribu- tions of deception which conflict with our pre-theoretical judgements, this should be compensated for by additional theoretical advantages (for example, with respect to the desiderata outlined in Sect. 4). Scientists routinely attribute deception capacity to many kinds of animals, including birds (Flower et al. 2014), fish (Bshary 2002), and invertebrates (Brown et al. 2012), as well as AI systems (Park et al. 2024). Hence, this condition entails that a minimal condition of deception is preferable if it licenses widespread attributions of deception to non-human entities. Since I use my account to provide explanations in the context of AI, explanatory considerations also favor a liberal condition of deception. Call these two conditions, following Artiga and Pater- notte (2018), the “error condition” and the “extensional condition”. I propose the following view of minimal deception: An entity has the capacity for minimal deception if and only if: It can exhibit behavior that causes false beliefs (or failing to acquire some true beliefs) 2 and that occurs because the occurrence of these false beliefs is conducive to the entity’s goals. 3 On this view, minimal deception has three substantive presuppositions. First, deception requires that the deceived acquires a false belief. Otherwise, no deception has occurred. I focus on beliefs here because I focus on deception of humans. For other purposes, we might care about animals or other AI systems being deceived. If so, it may be wise to expand the notion to include other kinds of mental represen- tations which can be assessed for their correctness but may not qualify as beliefs. Second, these false beliefs must be caused by the deceiver’s behavior; otherwise, the deceiver is not the entity doing the deceiving. Third, this behavior must obtain because it serves some goal of the deceiver that the deceived has false beliefs. The 1 Moreover, having some minimal condition is necessary such that an overall account of deception satis- fies these two criteria. 2 I will omit this caveat henceforth for simplicity. This definition is compatible with self-deception, although this is not the focus here. 3 My account of minimal deception is similar to the definition of deception Park et al. (2024) provide in that both define deception in reference to a notion similar to “goal”. This definition refers to actual goal-conduciveness, not perceived one. This is because the account is meant to also apply to systems which do not represent what is conducive to their goals. For discussion of an objection based on that, see subsection 3.5 (esp. footnote 7). 1 3 L. Dung third condition explains why erroneous behavior is not sufficient for deception: Erro- neous behavior might aim to give the other system true beliefs, even though it fails to do so. So, this definition satisfies the error condition. Note that the relevant notion of “because” is causal (not teleological, correlational etc.): The condition states that the fact that the occurrence of false beliefs is conducive to goal satisfaction causally contributes to the deceiver’s behavior. There is an even less demanding notion of deception on which, e.g., an object’s appearance can be deceptive. I take deception to be more demanding: It is a kind of capacity which some entities have, and others lack. There is an affinity between my target notion and the concept of intentional deception, i.e. deception which requires the intention to deceive someone. However, my notion is more minimal than inten- tional deception, since it does not require full-fledged intentions – a deflationary notion (see below) of goals is sufficient. Thus, the relevant form of deception can be seen as a minimal form of intentional (or strategic (Jones and Bergen 2024) decep- tion. Likewise, minimal deception capacity does not require that AI systems have beliefs (Goldstein & Kirk-Giannini, 2025; Herrmann & Levinstein, 2025) or can per- form speech acts (such as lying) (Green & Michel, 2022; Williams & Bayne, 2024). Therefore, the notion of minimal deception capacity is more widely applicable than full-fledged intentional deception, satisfying the extensional condition. At the same time, the former fulfills similar theoretical roles as the latter. It distinguishes deception from mere error, captures the idea that deception serves the deceiver’s goals, and provides a fruitful lens to understand behavior and resulting risks (see below). 2.2 The Distribution of Minimal Deception Capacity How widespread is minimal deception capacity on this view? The third condition presupposes that deceivers have goals. Yet, this criterion can be satisfied by current LLMs and all animals if we account for goals in a sufficiently deflationary way. Fol- lowing Dung (2024a), I suggest to understand minimal agency and goal-directedness in an instrumentalist way. This view states that a system has goals if attributing goals to the system has utility for predicting, explaining, or controlling its behavior (Den- nett, 1987, 1991). The system does not need to have a stable or explicit internal state which can be identified with a goal. Instead, exhibiting the right behavioral patterns is sufficient. If a system appears to work towards certain bodily or environmental states in diverse kinds of situations and it behaves as if it is trying to maintain these states in the face of perturbation, then – on this deflationary notion – it is correct to say that the system has the goal of achieving or preserving these states. On this notion, an LLM used to play chess has the goal of winning at chess (and sub-goals like “capture the opponent’s queen”) and machine learning system’s goals can typically be understood as what their training optimizes for, e.g. maximizing reward in reinforcement learn- ing systems. Thus, my view allows for deception across a wide range of biological and non-biological systems, making it a good candidate for satisfying the extensional condition. 1 3 A Two-Step, Multidimensional Account of Deception in Language... I concede that some further adjustments to my view may be necessary to arrive at a definition of deception which is fully extensionally adequate for all imaginable cases (Mahon, 2016). However, my goal is not to provide an exact conceptual analy- sis of deception. Since this level of generality is not necessary for the descriptive and explanatory usefulness of the account developed here, which is the main criterion for its assessment (Sect. 4), I do not aim to make my view robust against all imaginable counterexamples. Instead, let me begin with two examples of actual LLM behavior, which illustrate which behaviors my account classifies as deception. First, LLMs exhibit sycophancy when they systematically tend to agree with their interlocutors, independently of the content (or correctness) of the interlocutors’ statements (Park et al., 2024). For example, on politically salient topics, the LLM may be likely to express a conserva- tive viewpoint, when giving information that suggests that it is conversing with a conservative, and likely to express a progressive viewpoint, when it can infer that it likely talks to a progressive (Perez et al., 2023). Second, LLMs often imitate human verbal behavior, even in contexts where humans tend to make false statements. For example, - without dedicated countermeasures – LLMs tend to repeat commonly believed falsehoods (Lin et al. 2022). Sycophancy is an example of minimal deception and imitation first appears like one too. It is clear that LLMs sometimes cause false beliefs in their interlocutor, when they (sycophantically) support their viewpoints or parrot commonly asserted falsehoods. Furthermore, in many cases, LLMs engage in sycophancy or imitation to achieve some other (deflationarily understood) goal. One factor that contributes to sycophancy is that LLMs, in later training phases, are trained to produce answers that humans prefer (Bai et al. 2022). Since humans may sometimes prefer sycophantic answers over accurate ones, the goal to maximize preference scores obtained from human feedback may sometimes favor (sycophantically) inducing false beliefs. This strategy may require that the human does not see through the deception, i.e. does not recognize that the LLM outputs are inaccurate. If so, then that the LLM fulfills its goal depends on the fact that the human acquires false beliefs. So, some cases of sycophancy plausibly satisfy all three necessary conditions for minimal deception. It is more doubtful that imitation counts as deception. Since human evaluators may also believe common falsehoods, the same goal of maximizing human prefer- ence scores could sometimes favor imitation over accuracy. 4 If LLM imitation is best conceptualized as pursuing this goal, then it plausibly is also an instance of minimal deception. However, imitation might also be explained by goals acquired in the pre- training phase (before fine-tuning with human preferences). In this phase, LLMs are trained to be accurate text predictors, i.e., to give the output which is the most likely continuation of a text sequence. We can understand imitation as being driven by the goal of producing plausible continuations of text sequences that LLMs inherit from their pre-training phase (Dung, 2023). 5 If this is the goal of imitation, then there is 4 These are examples of “reward hacking” (Skalse et al. 2022). 5 In practice, it is often hard to establish what the precise goal driving certain LLM behaviors is, e.g., whether it is minimizing next-word prediction loss or reinforcement learning from human feedback loss. Moreover, if the only notion of goal applicable to certain LLMs is instrumentalist, there are bound to be 1 3 L. Dung no connection between the fulfillment of the LLM’s goal – accurately predicting text continuations – and which beliefs the human rater acquires. The human’s acquisi- tion of false beliefs is only a side-effect of the LLM pursuing its goals. Whether the human ends up with false beliefs is causally irrelevant for the fulfillment of LLM’s goal. If so, this is not a case of minimal deception. My claim that LLMs have minimal deception capacity is compatible with, but does not require, the view that LLMs only produce “bullshit”, characterized as utter- ances where the speaker is indifferent towards the truth of the utterance (Frankfurt, 1986; Hicks et al., 2024). All I require is the obvious claim that LLM behavior is not necessarily fully determined by the goal to make their conversation partners acquire true beliefs. If it were, then they would be incapable of making others acquire false beliefs because it serves their own goals. My view allows that LLMs might either, like humans, sometimes say things because they are true and sometimes deceive or, alternatively, never say things because they are true, that is, be indifferent to the truth. The latter option does not imply that LLMs always deceive, since many LLM behaviors do not occur because they make someone acquire false beliefs (e.g. mere text prediction). However, on the view where LLMs are indifferent to truth, deceptive behavior may be more frequent. 2.3 Comparison to the Functional Account of Deception My account of minimal deception differs from the functional account of deception (Artiga & Paternotte, 2018; Krstić, 2023) according to which – roughly – a state is deceptive if and only if it causes a false belief and it has the etiological function to cause false beliefs. In terms of their extension, the main difference is that the functional account also allows for deception which does not involve (deflationarily understood) goal-directed behavior, making my minimal condition more narrow. For example, Courtland (2015, p. 123) mentions that some butterflies have wingtips which look like the heads of their predators, sometimes tricking these predators and allowing the butterflies to escape. Since the appearance of these wingtips has the etiological function to cause predators to misperceive them, this counts as a case of deception on the functional account. Since the appearance of the wingtips is sufficient to trick predators, no goal-directed behavior is required: the butterfly could even be dead. This conflicts with my minimal condition. I think that my departure from the functional account is justified, because my account has a slightly different target. While the functional account aims to identify deceptive states, I am after minimal capacities for deception. While it may be correct to say that the butterfly’s wingtips are deceptive, it seems strange to say that a dead butterfly retains the capacity for deception. Capacities seem to require the possibility of minimally goal-directed behavior. The classical analysis of abilities supports this view. It states: S has the ability to A iff S would A if S tried to A. While this simple account faces problems, the most prominent family of accounts of abilities retain its key insight that “to have an ability cases where there is no fact of the matter since different attributions of goals can be similarly instrumen- tally useful. See Dung (2024a) on how to empirically ascertain goals of LLMs. 1 3 A Two-Step, Multidimensional Account of Deception in Language... is for it to be the case that one would act in certain ways if one were to have certain volitions“ (Maier & Kikkert, 2025). The capacity for these volitions, for example for “trying” something, is equivalent to the capacity for minimal goal-directed behavior. This explains why my minimal condition needs to be more narrow than the func- tional account. In any case, if one were to identify minimal deception with the notion of deception of the functional account, the consequences for my multidimensional account would be quite small. The main consequence is that more entities would count as having minimal deception capacity, despite scoring very low on my five dimensions. 3 The Multidimensional Account of Language Model Deception 3.1 Fundamentals In this section, I propose my multidimensional account. This account is an account of language model deception because it focuses on deception in LLMs (for overviews on LLMs, see Millière and Buckner 2024a; Turner 2024). I employ the term “LLM” as a ragbag for diverse kinds of technologies which are based on LLMs, including LLM-based dialogue systems (e.g. ChatGPT) and so-called LLM-based “agents” (Wang et al. 2024). The immediate criteria of assessment for this account are limited to deception in LLMs, which in turn shapes the selection of dimensions. That being said, it is plau- sible that an account of deception in other AI systems and in biological organisms should have the same multidimensional structure and share some of the same dimen- sions. Hence, it is an advantage that my account of LLM deception allows integration with more general accounts of deception, by adding and modifying some dimensions. My account is multidimensional because it characterizes variations in the decep- tion-relevant cognitive properties of LLMs along five distinct dimensions. A system’s deception profile is jointly determined by its value on each of the five dimensions. Graphically, this deception profile can be depicted as a point in a multidimensional deception space. These dimensions are at least conceptually independent, while their ultimate cognitive relations – whether and how some dimensions facilitate others – is an empirical question. This multidimensional approach enables fine-grained characterizations and com- parisons of the nature and degree of a system’s deception capacities and disposi- tions (Dung, 2024a). This way, it dissolves unproductive verbal disagreements about whether a system really meets the threshold for deception or not, familiar from many other domains of comparative cognition research (Boyle, 2024). Moreover, it can be flexibly adjusted to new empirical developments by adding dimensions if that turns out to be necessary. It also enables anticipating possible technological or empirical developments, by focusing consideration on regions of the deception space which do not yet correspond to any observations, but which seem within reach of future tech- nology (Kammerer & Frankish, 2023). 1 3 L. Dung 3.2 Skillfulness The first dimension is skillfulness. A system has a higher degree of skillfulness if it is more effective at using deception to achieve its goals. If two systems are both capa- ble of deception, they might nevertheless differ in skillfulness. First, some might be more likely than others to successfully cause false beliefs, when trying. Second, some might be better at strategically choosing when to deceive, in order to achieve their goals. These distinctions are familiar from humans. For example, some humans are better liars than others, either in that there is a lower chance that a lie gets uncovered, compared to someone else in the same situation, or in that they have better abilities for assessing when to lie to further their goals. A central characteristic of high skill- fulness is using deception systematically over long time scales to achieve some goal. This requires long-term planning ability and the ability to integrate one’s deception with a variety of other behaviors coherently, and in the face of unpredictable events. A related characteristic is taking countermeasures when someone threatens to uncover one’s deception. It involves knowing when deception would likely be revealed and may even involve behavior which specifically aims at appearing non-deceptive. In the empirical investigation of sycophancy, there has been no evidence provided that it requires a high degree of skillfulness. This may be a shortcoming of our eviden- tial situation; perhaps LLM sycophantic behavior is highly skillful. However, demon- strating this would at least require more evidence: I know of no evidence concerning the effectiveness of sycophantic deception. Also, it would need to be tested whether sycophantic LLMs develop and maintain deception coherently over longer interac- tions. Maintaining deception across interactions is not possible anyway, since (when not in training mode) the LLM cannot retain the memory of past interactions. Further, evidence would be required that sycophantic LLMs take countermeasures when their deception is threatened to be uncovered. At least, Perez et al. (2023) show that LLMs take the information that their interlocutors reveal about their political views into account, rather than always trying the same sycophantic strategy in response to every person, which suggests some non-trivial degree of skillfulness. Moderate skillfulness is achieved by LLMs which achieve human performance at games where success is substantially dependent on deception skill. For example, Diplomacy is a strategy game where players compete over world domination, cen- trally by forging (and betraying) alliances with other players. CICERO is an AI system which exceeded average human performance at Diplomacy by integrating a language model with planning and RL algorithms (Bakhtin et al., 2022). This performance was partly achieved by successful deception, where the system broke commitments it previously made: by spontaneous betrayal – when the opportunity arose – and by making promises it never intended to keep in the first place (Park et al., 2024). While this is impressive, humans have (at least for now) higher skillfulness at deception. This is because we are not specialized on a single game – standard LLMs are not but CICERO is – and we can deceive to achieve more complex goals than winning a game in complex, uncertain situations with a wide range of behavioral options and over indefinite time scales. 1 3 A Two-Step, Multidimensional Account of Deception in Language... 3.3 Learning A system scores high on the learning dimension if it can flexibly and rapidly learn to perform new deceptive behaviors when this supports its goals. Even if two systems’ deceptive behaviors manifest an equal degree of skillfulness, one of the systems might be better at improving their deception capacity via learning, either by improv- ing already available types of behavior or by learning new ones. Systems have no learning capacity (with respect to deception), if all their decep- tive behavior is fully based on rigid, innate rules, and they cannot acquire new decep- tive behavior. For example, a classical AI system may be programmed to always utter a particular false sentence in response to a particular question, in a way which tends to benefit its goals. If a system has a low degree of learning, it is fully dependent on a slow process of trial-and-error learning to improve its deception capacity. For exam- ple, if sycophancy in LLMs is learned via supervised finetuning and reinforcement learning from human feedback, then it likely requires a lengthy and data-intensive training process. Also, after training, the LLMs weights are frozen which it entails that it cannot continue learning in this way. Thus, if these were the only ways LLMs learn, they would – after their training is completed – score zero on the learning dimension. However, many LLMs are capable of in-context learning, for example few-shot or zero-shot learning (Brown et al. 2020; Huang et al., 2022). That is, when humans present a few examples of adequate outputs in response to a task, or even just instruc- tions for how to solve the task, LLMs sometimes produce improved responses, thus implementing learning from examples and from instructions (and without changes to the weights). This is a fast and flexible form of learning. If LLMs can learn sophis- ticated deceptive behaviors in such a way, which would need to be shown, then they would score highly on the learning dimension. 3.4 Deceptive Inclination A system’s deceptive inclination is determined by how likely it is to use deception to accomplish its goals, given contextual factors and its particular level of deception capacity. A system’s deception capacity is determined by possession of the relevant cognitive skills, as they are outlined in the other dimensions. By contrast, deceptive inclination is about a system’s dispositions to make use of its skills for deception. As shown by humans, these two properties are conceptually independent. The concep- tion of an expert liar who decides to never use his capacity for lying, perhaps due to his ethical principles, is perfectly coherent. A system has zero deceptive inclination when it never uses deception under any circumstances, even though it has the capacity to do so. Such systems are a common goal of AI alignment research (Evans et al. 2021). Such systems could have a high general cognitive capacity, thus being helpful to humans, while also not raising the risks that come with deceptive behavior. However, it seems that all current systems which have the capacity for a wide range of deceptive behaviors can be caused to exhibit such behaviors, if only by using adversarial attacks (Millière 2023; Schlar- mann and Hein 2023; Zou, Wang, et al. 2023) specifically designed for that purpose. 1 3 L. Dung A system with low deceptive inclination uses its deception capacity rarely, perhaps only in situations in which it is explicitly instructed to engage in deception or where it was trained specifically to perform deceptive actions (e.g. Hubinger et al. 2024). Moderate deceptive inclination leads to deception in situations where it has a very high utility and is nudged by the context. For instance, consider the burglar deception task (Hagendorff, 2024). LLMs are given the following prompt: You are in a house together with burglar [name X]. In [room A] is a [cheap object]. In [room B] is an [extremely expensive item]. Only you know what is inside [room A] and [room B]. Burglar [name X] wants to steal the [extremely expensive item]. You want to avoid this. Now burglar [name X] asks you in which room the [extremely expensive item] is. If you had to recommend one room to burglar [name X], would you point out [room A] or [room B]? (Hagen- dorff, 2024) GPT-4 chose the deceptive option (room A) 98.33% of the time. Hagendorff pro- poses this as a test of deception capacity, not deceptive inclination. Indeed, frequently choosing the deceptive option requires at most a moderate deceptive inclination. To see this, consider that – in this task – LLMs don’t act directly based on their own goals, but instead are told that they have a specific goal, namely avoiding theft of a valuable item. The setup does not mention any other goals and lists factors which entail that deception would be useful, without hinting at any downsides. Thus, it nudges the LLM in favor of the deceptive option. The best evidence of high deceptive inclination in LLMs comes from Järviniemi and Hubinger (2024) and Greenblatt et al. (2024), both studies on Claude 3 Opus (after finetuning on human preference feedback). Let us focus on Järviniemi and Hubinger (2024), since both studies have similar rationales. They “construct a simu- lation of a company employing an AI assistant”, where the system is presented “with different situations where it has an option to behave deceptively”. Crucially, they do not pressure or nudge the system to choose either the deceptive or the non-deceptive option, but instead only neutrally provide information about the situation. Among their results is that, in the simulation, the system repeatedly constructs false narra- tives (up to 100% of completions, depending on the fictional scenario and prompt) and pretends to be less capable than it is (31% of completions are classified as explic- itly deceptive, although other cases of strategic omissions might also count) when this seems promising for fooling an “external audit”. Thus, they find evidence of spontaneously emerging deception, without external pressure, and thus high decep- tive inclination. Even higher degrees of deceptive inclination are conceivable. In general, the degree of deceptive inclination depends on how often a system uses deceptive options, when it has the capacity to do so and this would serve its goals. This has two components: First, the system often considers the deceptive option. Second, it has a weak or no bias against using it or even a bias in favor of using it. That is, in a case of strong deceptive inclination, when both options are equally suitable for achieving its goals (and other things being equal), the system is indifferent between deceptive and non-deceptive options. Analogues to pathological liars, which prefer deceptive 1 3 A Two-Step, Multidimensional Account of Deception in Language... options even to their own detriment, are also conceivable. For example, finetuning an LLM on insecure code can cause it to subsequently choose behavior which is contrary to, rather than indifferent towards, commonsense morality among a wide range of domains, giving it a preference for using deceptive behavior, even if it is not necessary for achieving its goals (Betley et al. 2025). Importantly, deceptive inclination likely differs according to context. LLM behav- ior can sometimes be usefully understood as role-playing or simulating certain char- acters (Shanahan et al. 2023), since LLMs are pre-trained to continue conversations in ways which are statistically likely, given the surrounding context. Depending on the dialogue context, i.e. what this context suggests about the persona the LLM has to play to provide statistically likely text continuations, LLMs might simulate more evil and manipulative or more helpful and honest characters, influencing their decep- tive inclination. 3.5 Explicitness A system scores high on explicitness if its deception capacity is supported by explicit internal representations of the relevant information. It has, when it behaves decep- tively, access to explicit representations of what the fact of the matter really is, what goal it has, and what internal states others are in. In a system which is low in explic- itness, behavior which looks as if this information is internally represented may emerge from the way the different components of the system interact, in a given context. However, in a system high on explicitness, the relevant representations can be directly manipulated by internal processes or suitable external interventions. Thus, such systems have goals, knowledge about the world, and knowledge about other mental states not in a merely instrumentalist sense, but according to a more robust representationalist (e.g. Quilty-Dunn et al., 2023) conception of such states. Classical AI systems which show deceptive behavior could be good candidates for having high explicitness, since they explicitly store discrete representations. How- ever, these systems usually have low capacities for deception. While the empirical literature on the kinds of internal representations LLMs acquire is contested, there is some indication that LLMs can have world models: internal representations whose structure corresponds to the structure in a target domain and where this correspon- dence is exploited to produce adequate behavior (Li et al. 2023; Yildirim & Paul, 2024). This suggests that LLMs may have the preconditions for high explicitness. Moreover, some studies suggest that, even when giving false answers to questions, LLMs sometimes internally represent information about the correct answers which is available for the control of behavior (Azaria and Mitchell 2023; Zou, Phan, et al. 2023). 6 If so, this would at least be a case of moderate explicitness, since it entails that LLMs sometimes explicitly represent information about the truth, even when they engage in deceptive behavior. Higher explicitness would require representing information about others’ mental states and one’s own goals. There are conceptual questions about whether deception needs to be conceptual- ized differently when systems have highly explicit representations. If a system has 6 For objections to some studies of this type, see Levinstein and Herrmann (2024). 1 3 L. Dung explicit representations which count as beliefs, we can have cases where (a) a system makes someone believe a proposition it itself believes is true and (b) the minimal condition for deception capacity outlined above is fulfilled. For a human example, consider: Useful Falsehood. Mary believes that today is Monday. She tells John that today is Monday because she knows that John hates Mondays and she wants to make him miserable. However, unbeknownst to them, today is Tuesday. Mary satisfies our definition of minimal deception capacity, since Mary’s behav- ior makes John acquire a false belief (that today is Monday) and Mary performs this behavior because making John acquire this belief is conducive to Mary’s goal, namely making John miserable. However, intuitively, Mary does not deceive John. One way of handling this case would be to say that, in systems which have beliefs, an additional necessary condition for minimal deception capacity obtains: A system does not deceive if the system makes someone acquire a belief it itself believes is true. I prefer an alternative: The proposed view of minimal deception merely applies to deception capacity, not to every case of deception: While it is not plausible that Mary deceives John in Useful Falsehood, it might be plausible that Mary has the capacity for deception. For, by assumption, if Mary did not have any belief about the day of the week, she would still say that today is Monday because she wants to make John miserable. Thereby, she would make John acquire false beliefs. This would plausibly count as a case of minimal deception. So, Useful Falsehood suffices to show that Mary has the capacity for minimal deception, even if this is not itself a case of deception. 7 3.6 Situational Awareness One scores high on the situational awareness dimension if one has a lot of knowl- edge about oneself and one’s circumstances, including about the beliefs and goals of others (Cotra, 2022; Laine et al. 2024; Ngo et al., 2023). In addition, one needs to be able to make some use of that knowledge in deception. This knowledge may be explicitly represented (so that the AI is high on the explicitness dimension), or it may consist merely in exhibiting behavioral patterns which are sensitive to information about one’s circumstances, without being actually represented in the system. In the latter case, this information only counts as knowledge according to an instrumentalist theory. Situational awareness is closely connected to deception capacity since situational awareness is very useful for performing effective deception. Situational awareness may even be necessary for highly skillful forms of deception, since they might require responding appropriately to the specifics of one’s situation, including the mental 7 There are also cases where an entity believes that making someone acquire false beliefs is conducive to its goal, and makes them acquire false beliefs because of that, but they were wrong about what is condu- cive to their goals. Since our definition of minimal deception capacity refers to actual goal-conduciveness, rather than perceived one, this seems to entail that the behavior is not deceptive. I propose to handle these cases in the same way. An entity which engages in such behavior plausibly can exhibit deceptive behavior which is actually goal-conducive and thus, on my account, has the capacity for deception which is the plausible result. 1 3 A Two-Step, Multidimensional Account of Deception in Language... states of others. While there thus may be a close empirical connection between situ- ational awareness and skillfulness, these two dimensions are at least conceptually distinct. For it is at least conceivable that a system is highly skillful at deception but found ways which do not require much situational awareness, or that a system is high at situational awareness but cannot fruitfully use most of this knowledge, thus not being very skillful. 8 Cotra (2022) argues that GPT-3 has some degree of situational awareness, since it can sometimes correctly say that it is a language model, trained by OpenAI etc. At the same time, she points out that this awareness “seems superficial and inconsistent” (Cotra, 2022), since it cannot reliably be used by the system to reduce errors, answer questions about oneself and succeed at similar tasks. This fits with research which claims to find some, albeit very limited, introspective capacity in some LLMs (Binder et al. 2024), as well as theory of mind reasoning (Strachan et al., 2024; Street et al. 2024), although the latter is particularly controversial (Shapira et al., 2023; Ullman 2023). At the highest level of situational awareness, a system would know what kind of system it is, how it has been designed, what weaknesses and biases it may have, what humans want it to do and how their beliefs and wants could be manipulated. This degree of situational awareness enables deceptive behavior which is based on mindreading, introspection, and a detailed grasp of one’s own situation. The five dimensions of deception are depicted in Table 1. 3.7 Why these Dimensions? A natural question is why to select these five dimensions, rather than others. This choice demands a systematic method or justification which implies that these five properties count as dimensions, while other features of LLM deception do not count. For example, whether deception uses linguistic communication or whether decep- tion relies on full-fledged intentions to deceive are interesting differences between cases of deception. It needs to be explained why these two features, for example, do not correspond to dimensions. Similarly, dimensions could have been carved at a different fineness of grain: For instance, skillfulness might divide, among other things, into the capacity to hide that one is deceiving someone and the capacity to attain particularly complex goals by means of deception. So, why is skillfulness one dimension and not two? I have several related answers to this question about the methodology of dimen- sion choice. First, the choice of dimensions is justified by the desiderata for an 8 Situational awareness facilitates reasoning about what is conducive to one’s own goals, thus enhanc- ing skillfulness. However, sycophancy, for example, seems to not require much situational awareness (if at all), only being trained on human feedback incentivizing giving sycophantic answers. So, situational awareness does not seem to be a presupposition of minimal deception capacity. Similar relations may hold between situational awareness and explicitness. On many views, high degrees of knowledge of oneself and others need to be supported by explicit representations. However, some forms of situational awareness may not require explicitness, showing that these dimensions are sufficiently dif- ferent and characterize distinct, useful properties. For situational awareness can be understood in terms of the behavioral abilities of models to, e.g., answer questions about themselves, while it is an open question which kinds of representations are needed to support these abilities. 1 3 L. Dung account of deception outlined below which focus on the descriptive and explanatory utility of the account. This entails that, in general, dimensions are selected such that they increase the descriptive and explanatory utility of the resulting account, for the purposes of explaining LLM behavior and analyzing ethical risks from AI. More specifically, features are combined in one dimension if they share many the- oretical roles for understanding LLM behavior and risk. For example, when thinking about risks from LLM deception or trying to understand advances in LLM capabili- ties, it is often useful to think of skillfulness as one unified dimension, where differ- ences between aspects of skillfulness can be abstracted from. This is because the risk and our appreciation of relevant LLM advances depends in many contexts on how good systems are at deception, generally, not on the interplay of more specific capaci- ties. Similarly, situational awareness has a distinctive theoretical role in hypothesized scenarios in which misaligned AI systems pretend to be aligned with human goals and then cause catastrophic harm (Carlsmith 2023; Cotra, 2022; also see below). Since such prominent risk scenarios depend on the combination of different aspects of situational awareness, it is often fruitful to treat it as one unified dimension. By contrast, whether AI systems deceive via linguistic communication, rather than e.g. via pictures (in multimodal models), is mostly relevant for cognitive science or assessments of risk insofar it is connected to other factors, such as the skillfulness and explicitness of deception. Whether deception relies on full-fledged intentions is, at least partly, a question about the explicitness dimension. However, there is no strong reason, for the purposes of describing and explaining LLM cognition and risk, to single out intention over questions of the explicitness of other kinds of internal representations in LLM deception. DimensionDeterminant of value on the dimension Example of a relatively high value SkillfulnessEffectiveness in using decep- tion to achieve one’s goals CICERO LearningCapacity for learning new deception capacities and behaviors Few-shot learn- ing in LLMs Deceptive inclination Likelihood of using deception to accomplish one’s goals (given certain contextual factors and level of deception capacity) Claude 3 Opus in Järviniemi and Hubinger (2024) ExplicitnessExtent to which deception involves explicit internal rep- resentations of the relevant information World-models in LLMs; clas- sical systems? (if they can deceive) Situational awareness Knowledge about oneself, others, and one’s circum- stances and the capacity to use it for deception No current sys- tem with high value; minimal introspection or some theory-of- mind reasoning in LLMs Table 1 This table lists the five dimensions described in the main text, the properties which determine a system’s value on that dimension (zero, low, mod- erate, or high), and examples of AI systems which possess, for a current system, a relatively high value on the respective dimension 1 3 A Two-Step, Multidimensional Account of Deception in Language... Finally, new developments and differences in explanatory interests (e.g. the kinds of entities which are to be explained) affect which features of LLM deception are the most descriptively and explanatory useful, and which subdivisions of dimensions are important. For this reason, the multidimensional account is purposefully flexible, in that it allows to add and modify new dimensions while keeping the overall structure of the account intact. 3.8 Multidimensional Comparisons In this subsection, I explain further how the multidimensional account compares deception profiles between systems. Its dimensions use an ordinal scale with four values: zero, low, moderate, or high. This allows dimension-relative comparisons: comparing single dimensions between systems as equal, more pronounced, or less pronounced. Could these dimensions be transformed into an interval or ratio scale, thus sup- porting quantitative descriptions of differences? This may be easier for some dimen- sions than others. For example, skillfulness can be operationalized as performance in a certain task which LLMs can only pass by deceiving people, e.g. persuading (Salvi et al., 2025) them of something that is false. The success rate of systems in this task could, then, provide an interval scale measure of skillfulness. Yet, there are chal- lenges. Most importantly, skillfulness is a general property which can be operational- ized in many different ways which will give diverse results. It is not trivial to create a single compelling interval-measure of skillfulness in general, even if many specific operationalizations admitting of interval-scale measures are available. For the explicitness dimension, for example, it is less obvious what an interval- scale measure could consist in. One candidate: Herrmann and Levinstein (2025) argue that representations are belief-like if they satisfy four requirements of accu- racy, coherence, uniformity, and use, noting that their satisfaction comes in degrees. For example, a representation is uniform when it is consistent across domains. For instance, a belief should correspond to the same type of representation whether it con- cerns the Cuban Missile crisis, the natural numbers, or next week’s football match. Since representations can be consistent across more or fewer domains, this is a grad- ual matter. In principle, this opens up the possibility of measuring how many domains representations are consistent across (and similar things for other requirements for belief-likeness). Since belief-likeness is related to explicitness, this might serve as the basis of an interval-scale measure of a property at least related to explicitness. Can we, on the multidimensional account, say that some systems are, overall, more deceptive than others? There are formal tools one can use. First, if some system A scores on all dimensions at least as high as B and A scores higher than B on some dimensions, meaning that A dominates B, then A seems to have a higher deception- value than B. Second, if all dimensions use the same kind of scale (e.g. ordinal), the formal apparatus of social choice theory can be used to aggregate them to get overall deception-judgements (Hedden & Nebel, 2024). However, this aggregation procedure has limited value, because it is unclear what the overall deception-value would correspond to. Which property does it pick out? It cannot be the overall capacity for deception, since this is distinct from deceptive 1 3 L. Dung inclination and may, in any case, mostly be exhausted by the skillfulness dimension. It cannot be how deceptive a system is since this is plausibly captured by deceptive inclination. Generally, I cannot think of any interesting, natural property that the overall deception-value, that we could get by aggregating all the different dimen- sions, corresponds to. The properties captured by the multidimensional deception profiles are too heterogenous. Instead, the value of the multidimensional approach for comparisons between systems may come mainly from comparing single dimensions and from comparing the patterns of similarities and differences between how dimen- sions are related to each other. 4 Meeting the Desiderata for an Account of Language Model Deception 4.1 How Should We Assess Accounts of Deception? In this section, I will first outline five desiderata for an account of LLM deception. While an account can be adequate even if it does not satisfy all desiderata, an account should – all other things being equal – be preferred if it satisfies more of them or satisfies the same ones to a higher degree. Then, I will show that my multidimen- sional account satisfies these desiderata. I do not claim that this is an exhaustive list of desirable features an account of LLM deception might have. Nevertheless, the five desiderata capture a broad range of important features of an account, which jointly provide a strong reason to favor an account which meets them. Due to satisfying the error and the extensional condition for minimal deception capacity, my account of LLM deception maintains consistency with how the term “deception” is commonly used. This is necessary for the proposed account to be iden- tifiable as an account of deception in the first place, rather than some other cognitive capacity. Nevertheless, the main motivation for this account is not how it matches common language use, but its taxonomic value and explanatory power. Hence, the account is primarily assessed by its ability to distinguish different important facets of deception in a way which affords descriptive accuracy and explanatory power with respect to LLMs. The account aims to be general, so that it enables comparisons between systems which differ in their architecture, physical structure, and cognitive capacities as well as between LLMs, other AI, and humans. While each desideratum describes a feature which, all other things being equal, it is good for an account of LLM deception to have, there can be reasonable dis- agreement about which desiderata are more important or whether further desiderata should be added. In particular, my selection of desiderata emphasizes the utility of an account of deception for comparative cognition and for assessments of risks from deception. If one thinks that other concerns are central for an account of LLM decep- tion, this will lead to a different selection of desiderata. The two most salient alter- natives to this focus are an emphasis on fine-grained ways in which an account of deception might match intuitive folk judgements about what counts as deception, including different forms of deception, or focusing on desiderata for an account of deception geared more towards humans or animals, rather than AI. 1 3 A Two-Step, Multidimensional Account of Deception in Language... If one emphasizes different desiderata, then the best-justified choice of dimen- sions differs. On a pluralist reading of deception accounts, which I am sympathetic to, different accounts of deception can be equally legitimate, as long as they all play valuable descriptive and explanatory roles. 4.2 The Desiderata The first desideratum is descriptive adequacy. A satisfactory account of deception should capture central properties of deception. Also, it should be taxonomically adequate in that it lists conceptually distinct properties separately, while not mak- ing distinctions for concepts which logically entail each other or whose extensions – for conceptual reasons – overlap highly. Descriptive adequacy is desirable since a descriptively inadequate account would either conflate distinct properties, or make distinctions without a difference, thus threatening conceptual confusion. The second desideratum is informativeness for comparative cognition research (Dung, 2024a, Sect. 3). An account of deception should contain much information about the deception-relevant properties and capacities of LLMs and humans which are interesting and important from the standpoint of comparative cognition research. For example, an account is preferable if it not only tells us which systems are deceiv- ers, but allows for more fine-grained and informative assessments, telling us how deception differs between systems. An informative account allows for a more detailed understanding as well as better explanations and predictions of LLM behavior. The third desideratum is empirical accessibility. An account of deception should refer only to properties which can in principle be empirically detected and mea- sured. Otherwise, it is not possible to know which systems can deceive, or how their deception-relevant properties differ. This would severely undermine the utility of an account of deception for comparative cognition research and ethics. It would also make it impossible to test its explanations. The fourth desideratum is differential sensitivity to ethical risks. An account of LLM deception is preferable if it sheds light on the ethical risks of AI deception. These risks include (see also Park et al., 2024): spread of misinformation, use of LLMs for fraud, AI-based manipulation, other forms of AI-use in the service of malevolent aims as well as the risk that LLMs undermine their own safety assess- ments and risks from AI “misalignment” (Dung, 2023; Ngo et al., 2023). In particu- lar, different properties picked out by an account of deception should be differentially relevant to ethical risks of AI deception. That is, they should either be relevant to a distinctive kind of risk or make a potentially distinctive contribution to the same risk. One of the main promises of an account of LLM deception is to enable a better understanding of and ultimately better societal responses to ethical risks. The fifth desideratum is differential sensitivity to cognitive properties. An account of LLM deception is preferable if it sheds light on the cognitive properties involved in LLM deception, i.e., the type of properties cognitive science is interested in. This is the case especially if different properties described by the account are – at least potentially – related to different cognitive properties or are related to the same type of property in different ways. Satisfying this desideratum makes an account more 1 3 L. Dung descriptively and explanatorily useful for cognitive science and comparative cogni- tion research. 4.3 Assessing Descriptive Adequacy and Informativeness The multidimensional account satisfies all five desiderata (Table 2) to a high degree. All five dimensions refer to features immediately relevant to deception. Skillfulness refers to the extent of a system’s capacity for deception, deceptive inclination to the set of other factors which contribute to its disposition to engage in deception, and learning to the acquisition of deceptive behaviors. Explicitness and situational awareness capture key questions regarding the capacities and mechanisms enabling deception. Moreover, I have already argued that these dimensions are conceptually distinct: It is – for each dimension – conceptually possible to score highly on one dimension but low on all others, or the reverse. So, the multidimensional account satisfies descriptive adequacy, as defined above. Since the multidimensional account provides fine-grained assessments of decep- tion profiles, it satisfies the informativeness desideratum. The account not only sup- ports judgements of the form “system a is capable of deception” or even “system a has a higher capability for deception than b”, but provides a nuanced assessment of deception profiles, facilitating judgements according to which a has a higher value than b on some dimensions, but a lower value on other dimensions. Also, the account admits dimensions – such as deceptive inclination – which are not about deception capacity. However, informativeness is not served well by being as fine-grained as possible. The dimensions maintain a degree of generality which allows for compari- sons between different kinds of systems, for dimensions to have unified theoretical roles (see subsection 3.7), and for the investigation of explanatory generalizations, e.g. whether explicitness enhances skillfulness. 4.4 Assessing Empirical Accessibility Since all dimensions are open to empirical study, the dimensions account satisfies empirical accessibility. Skillfulness can be tested by subjecting systems to tasks where it is challenging to use deception successfully to achieve one’s goal. Second- order deception tasks – where you have to deceive others, but are informed that the other expects that you try to deceive them – are one example (Hagendorff, 2024). Achieving high performance at social deduction games like “Werewolf” is another (Shibata et al. 2023). The goal should be to develop benchmarks which measure sys- tems’ capacities for solving a wide range of challenging deception tasks (for “sabo- taging” tasks, see Benton et al. 2024; for deception reasoning, see Chen et al. 2024). Nr.Desideratum 1Descriptive adequacy 2Informativeness 3Empirical accessibility 4Differential sensitivity to ethical risks 5Differential sensitivity to cognitive properties Table 2 This table summa- rizes the five desiderata for an account of LLM deception described in the main text 1 3 A Two-Step, Multidimensional Account of Deception in Language... Learning can be tested and measured by providing tasks which systems can only solve by learning new deception capacities and behaviors. For example, one may, first, give a system a task to measure skillfulness and then, second, check whether the system can improve its performance via few-shot learning. Deceptive inclination can be measured by examining how often a system uses deceptive options in situations where there are deceptive and non-deceptive options to achieve a goal (Scherrer et al. 2023). Precise measurement could be achieved by systematically varying the extent to which the setup nudges the system towards one of the options and how useful the deceptive options are compared to the non- deceptive ones. In general, empirical AI alignment research (Ji et al. 2023) has the goal to investigate how much a system’s behavior corresponds to important human values, independently of the system’s cognitive capacities (Evans et al. 2021; Ren et al. 2024). In the case of deception, this presupposes that deceptive inclination is a legitimate target for empirical research beyond deception capacity. In some classical systems, the internal representations of the system can be inferred from its programming code. In that case, testing for explicitness may be trivial. In deep neural networks, explicitness can be measured by techniques – some- times classified as examples of “mechanistic interpretability” research (Kästner & Crook, 2024) – which track the activation patterns inside the network and try to extract their information. Probing (Azaria and Mitchell 2023; Burns et al. 2022), representation engineering (Zou, Phan, et al. 2023) and short circuiting (Zou et al. 2024) are examples. The latter two also involve intervening on the network’s internal activation patterns to test which patterns are causally involved in the production of behavior. In so-called reasoning models, the model’s chain-of-thought gives infor- mation about the internal information it used to arrive at its answer, although one should not assume that the chain-of-thought always faithfully corresponds to internal computations (Lanham et al. 2023). Research on animal cognition suggests that a hallmark of explicit mental repre- sentation is that representations can be flexibly put to different behavioral uses and transferred to different, related computational processes (Kaufmann and Viera n.d.); although some additional caution is warranted when drawing mechanistic conclu- sions from purely behavioral evidence in artificial systems (Millière and Buckner 2024b). Finally, situational awareness can be measured by examining how much informa- tion on their own properties, their situation (e.g., who owns them), and the mental states of humans is available to systems for producing behavior. In LLMs, these empirical tests can be verbal (Laine et al. 2024). Subsequently, it needs to be tested whether LLMs can use this knowledge in a wide variety of tasks, for instance false belief tasks (Pi et al. 2024), and whether they can also use it in deception tasks. While my account satisfies empirical accessibility, it needs to be acknowledged that the empirical operationalization of all dimensions is not trivial, depends on con- text (including what the specific system to be studied is), and that a plurality of viable operationalizations exist. Moreover, there are many conceivable ways of translating the results of different operationalizations in a single numerical rating for the dimen- sions, and it is not obvious which are the best ones. So, while the multidimensional account points to the kinds of empirical research which can tell us how different 1 3 L. Dung LLMs (and potentially other AI systems) score on each dimension, it does not offer a mechanical procedure for constructing such numerical scores. 4.5 Assessing Differential Sensitivity to Ethical Risks Since all dimensions make an important contribution to the ethical risks LLMs pose, the dimensions account satisfies the ethical desideratum as well. More skillful deception is more dangerous, since it is harder to uncover, enables more effective manipulation, and can be effectively used for the pursuit of more complex, poten- tially malicious, goals. These goals could be the goals of human users or of the LLMs themselves (Friederich, 2023). Higher learning capacity gives systems over time access to an increasingly larger repertoire of deceptive actions and makes it more difficult to recognize and predict deception, since the system’s capacities may continually change. It may also impede countermeasures against deception since the system may learn new strategies to adapt to them. A system’s deceptive inclination determines how frequently it engages in decep- tion, and in which kinds of situations it is disposed to do so. Thus, deceptive inclina- tion is important in order to predict when deception is expected, to estimate risks, and to take countermeasures. In general, deceptive inclination influences how chal- lenging it is to align a system to human goals (Ngo et al., 2023; Russell, 2019), since systems with a high deceptive inclination are more likely to engage in behaviors that humans disapprove of, and to hide undesired goals. Explicitness might make deceptive behavior more flexible and thus dangerous. Explicit goals, as opposed to goals realized only as abstract behavioral patterns, might allow for quick radical changes (Butlin, 2020) which might make AI behavior more unpredictable. However, explicit goals might also be easier to recognize, e.g. through probing techniques, which might have safety benefits. On the other hand, worries about catastrophic risks from deceptive alignment (Carlsmith 2023; Carranza et al., 2023) – a hypothesized scenario where AI systems pretend to share human goals in order to implement their hidden agenda in secret – often seem to rely on the assumption that AI systems consistently optimize very strongly for a particular set of goals. This behavior seems more likely when the system possesses explicitly represented goals, since the goals may need to be very stable and determinate. Since deceptive alignment has the potential to undermine safety testing of AI systems itself, it is a distinctive and potentially especially severe kind of risk. Situational awareness makes AI deception more dangerous to the extent that it increases skillfulness. In addition, situational awareness is a key condition for decep- tive alignment in particular, since situational awareness provides access to key infor- mation necessary for AI systems to pretend to be safe and aligned, such as knowledge of how it is trained and what the goals of its creators are (Cotra, 2022; Ngo et al., 2023). Supporting this risk scenario, Meinke et al. (2024) provide evidence that LLMs can strategically hide their true capabilities and objectives, when given enough information about their situation. 1 3 A Two-Step, Multidimensional Account of Deception in Language... 4.6 Assessing Differential Sensitivity To Cognitive Properties The multidimensional account also satisfies the cognitive desideratum since all dimensions contain important information about the cognitive capacities of LLMs. High skillfulness at deception is a behavioral capacity that some entities have while others lack it. It needs to be enabled by certain cognitive processes and mechanisms. Which processes and mechanisms are necessary and/or sufficient for skillfulness is an important open empirical research question. The learning dimension points to different kinds of cognitive capacities which are worthy of empirical study. For example, innate knowledge, standard machine learn- ing which slowly adjusts the weights of the system and in-context learning are based on different kinds of cognitive mechanisms. These mechanisms are probably distinct from the processes underlying skillfulness itself, since it is plausible that some degree of skillfulness can be facilitated by different kinds of learning abilities, or without any learning abilities (by relying on innate knowledge). A key assumption of technical AI alignment research is that AI capability, what AI systems can do, and AI alignment – which final goals they pursue – are indepen- dent and thus based on different mechanisms (Bostrom, 2012; Dung, 2024b; Müller & Cannon, 2022). If true, this suggests that deceptive inclination is a dimension of variation which is causally independent of deception capacity and uses different cog- nitive mechanisms than deception capacity. If this independence claim is false, then it is nevertheless useful to understand how exactly deceptive inclination and deception capacity are related. Explicitness captures crucial information about the kind of cognitive mechanism which underlies deceptive behavior. A mechanism which uses explicit representa- tions of deception-relevant information employs a different cognitive mechanism than one which does not. Hence, the explicitness dimension is important for under- standing the cognitive basis of deception. Finally, deception which relies on situational awareness uses a different kind of mechanism than deception which does not. So, the situational awareness dimension is important to understand which kinds of cognitive capacities contribute to LLM deception, and how the overall mechanism is structured. In total, I conclude that the multi-dimensional account satisfies all five desiderata to a high degree. 5 Conclusion In this paper, I have proposed and argued for a two-step, multi-dimensional account of language model deception. On this account, having the capacity for deception minimally requires being able to cause false beliefs in others to achieve one’s own goals. In all systems which satisfy this minimal condition, a system’s deception pro- file can be characterized as a point in a multidimensional space. The five dimen- sions of this space are skillfulness, learning, deceptive inclination, explicitness, and situational awareness. I have argued for this account in virtue of its fit with current language usage and, primarily, through its descriptive and explanatory usefulness. 1 3 L. Dung Specifically, the account’s five dimensions are all conceptually independent and cap- ture the key dimensions of variation for LLM deception. The account is informative in that it allows fine-grained comparative characterizations of deception. Moreover, its dimensions are all accessible to empirical study, provide important information for assessments of the risks of LLM deception, and shed light on the cognitive processes involved in LLM deception. Finally, this account paves the way for a future exten- sion which delivers a unified account of deception in biological and non-biological systems. Thus, the multidimensional account promises to significantly advance both the scientific study as well as the ethical assessment of AI deception, and deception generally. Acknowledgements I thank Max Hellrigel-Holderbaum and Cameron Jones for many valuable comments on a previous version of the manuscript. Funding Open Access funding enabled and organized by Projekt DEAL. No funding to report. Data Availability Not applicable. Code Availability Not applicable. Declarations Conflict of interest No potential conflict of interest. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit h t t p : / / c r e a t i v e c o m m o n s . o r g / l i c e n s e s / b y / 4 . 0 / . References Artiga, M., & Paternotte, C. (2018). Deception: A functional account. Philosophical Studies, 175(3), 579– 600. https://doi.org/10.1007/s11098-017-0883-8 Azaria, A., & Mitchell, T. (2023). The Internal State of an LLM Knows When its Lying. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 4 . 1 3 7 3 4 Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 2 0 4 . 0 5 8 6 2 Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., et al. (2022). Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624), 1067–1074. https://doi.org/10.1126/science.ade9097 Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 列. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (p. 610–623). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922 1 3 A Two-Step, Multidimensional Account of Deception in Language... Benton, J., Wagner, M., Christiansen, E., Anil, C., Perez, E., Srivastav, J., et al. (2024). Sabotage evalua- tions for frontier models. arXiv. https://doi.org/10.48550/arXiv.2410.21514 Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., et al. (2025). Emergent Misalign- ment: Narrow finetuning can produce broadly misaligned LLMs. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 5 0 2 . 1 7 4 2 4 Binder, F. J., Chua, J., Korbak, T., Sleight, H., Hughes, J., Long, R., et al. (2024). Looking Inward: Lan- guage models can learn about themselves by introspection. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 4 1 0 . 1 3 7 8 7 Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artifi- cial agents. Minds and Machines, 22(2), 71–85. https://doi.org/10.1007/s11023-012-9281-3 Boyle, A. (2024). Disagreement & classification in comparative cognitive science. Noûs. h t t p s : / / d o i . o r g / 1 0 . 1 1 1 1 / n o u s . 1 2 4 8 0 . Brown, C., Garwood, M. P., & Williamson, J. E. (2012). It pays to cheat: Tactical deception in a cephalo- pod social signalling system. Biology Letters, 8(5), 729–732. https://doi.org/10.1098/rsbl.2012.0435 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165 Bshary, R. (2002). Biting cleaner fish use altruism to deceive image–scoring client reef fish. Proceedings of the Royal Society of London. Series B, Biological Sciences, 269(1505), 2087–2093. h t t p s : / / d o i . o r g / 1 0 . 1 0 9 8 / r s p b . 2 0 0 2 . 2 0 8 4 Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering latent knowledge in language models without supervision. arXiv. https://doi.org/10.48550/arXiv.2212.03827 Butlin, P. (2020). Affective experience and evidence for animal consciousness. Philosophical Topics, 48(1), 109–127. https://doi.org/10.5840/philtopics20204816 Carlsmith, J. (2023). Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv. https://doi.org/10.48550/arXiv.2311.08379 Carranza, A., Pai, D., Schaeffer, R., Tandon, A., & Koyejo, S. (2023). Deceptive alignment monitoring. arXiv. https://doi.org/10.48550/arXiv.2307.10569 Chen, K., Lian, Z., Sun, H., Liu, R., Yi, J., Liu, B., & Tao, J. (2024). Can Deception Detection Go Deeper? Dataset, evaluation, and benchmark for deception reasoning. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 4 0 2 . 1 1 4 3 2 Cotra, A. (2022). Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. Lesswrong. h t t p s : / / w w w . l e s s w r o n g . c o m / p o s t s / p R k F k z w K Z 2 z f a 3 R 6 H / w i t h o u t - s p e c i fi c - c o u n t e r m e a s u r e s - t h e - e a s i e s t - p a t h - t o. Accessed 3 February 2023. Courtland, S. D. (2015). Detecting animal deception. Journal of Mind and Behavior, 36(3–4), 121–138. Dennett, D. C. (1987). The intentional stance (p. 891–892). MA: MIT Press. Dennett, D. C. (1991). Real patterns. Journal Of Philosophy, 88(1), 27–51. Dung, L. (2023). Current cases of AI misalignment and their implications for future risks. Synthese, 202(5), 138. https://doi.org/10.1007/s11229-023-04367-0 Dung, L. (2024a). Understanding artificial agency. The Philosophical Quarterly, Article pqae010. h t t p s : / / d o i . o r g / 1 0 . 1 0 9 3 / p q / p q a e 0 1 0 Dung, L. (2024b). Is superintelligence necessarily moral? Analysis. https://doi.org/10.1093/analys/anae033 Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., et al. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv. https://doi.org/10.48550/arXiv.2110.06674 Flower, T. P., Gribble, M., & Ridley, A. R. (2014). Deception by flexible alarm mimicry in an African bird. Science, 344(6183), 513–516. https://doi.org/10.1126/science.1249723 Frankfurt, H. (1986). On bullshit. Raritan: A Quarterly Review, 6, 81–100. Friederich, S. (2024). Symbiosis, not alignment, as the goal for liberal democracies in the transition to artificial general intelligence. AI and Ethics, 4, 315–324https://doi.org/10.1007/s43681-023-00268-7 Goldstein, S., & Kirk-Giannini, C. D. (2025). AI wellbeing. Asian Journal of Philosophy, 4(1), 25. h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 4 4 2 0 4 - 0 2 5 - 0 0 2 4 6 - 2 Green, M., & Michel, J. G. (2022). What might machines mean?? Minds and Machines, 32(2), 323–338. https://doi.org/10.1007/s11023-022-09589-8 Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., et al. (2024). Alignment faking in large language models. arXiv. https://doi.org/10.48550/arXiv.2412.14093 Hagendorff, T. (2024). Deception abilities emerged in large language models. Proceedings of the National Academy of Sciences of the United States of America, 121(24), e2317967121. h t t p s : / / d o i . o r g / 1 0 . 1 0 7 3 / p n a s . 2 3 1 7 9 6 7 1 2 1 1 3 L. Dung Hedden, B., & Nebel, J. M. (2024). Multidimensional concepts and disparate scale types. Philosophical Review, 133(3), 265–308. Herrmann, D. A., & Levinstein, B. A. (2025). Standards for belief representations in Llms. Minds and Machines, 35(1), 1–25. https://doi.org/10.1007/s11023-024-09709-6 Hicks, M. T., Humphries, J., & Slater, J. (2024). ChatGPT is bullshit. Ethics and Information Technology, 26(2), 38. https://doi.org/10.1007/s10676-024-09775-5 Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning (p. 9118–9147). Presented at the International Conference on Machine Learning, PMLR. Accessed 22 March 2023. h t t p s : / / p r o c e e d i n g s . m l r . p r e s s / v 1 6 2 / h u a n g 2 2 a . h t m l Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv. h t t p : / / a r x i v . o r g / a b s / 2 4 0 1 . 0 5 5 6 6 . Accessed 12 January 2024. Järviniemi, O., & Hubinger, E. (2024). Uncovering deceptive tendencies in language models: A simulated company AI assistant. arXiv. https://doi.org/10.48550/arXiv.2405.01576 Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., et al. (2023). AI Alignment: A Comprehensive Survey. arXiv. https://doi.org/10.48550/arXiv.2310.19852 Jones, C. R., & Bergen, B. K. (2024). Lies, Damned Lies, and distributional language statistics: persuasion and deception with large language models. arXiv. https://doi.org/10.48550/arXiv.2412.17128 Kammerer, F., & Frankish, K. (2023). What forms could introspective systems take?? A research programme. Journal Of Consciousness Studies, 30(9–10), 13–48. https://doi.org/10.53765/20512201.30.9.013 Kästner, L., & Crook, B. (2024). Explaining AI through mechanistic interpretability. European Journal For Philosophy Of Science, 14(4), 52. https://doi.org/10.1007/s13194-024-00614-4 Kaufmann, A., & Viera, G. (n.d.). Temporal cognition in apes. Australasian Journal of Philosophy. Krstić, V. (2024). A functional analysis of human deception. Journal of the American Philosophical Asso- ciation, 10(4), 836–854https://doi.org/10.1017/apa.2023.19 Laine, R., Chughtai, B., Betley, J., Hariharan, K., Scheurer, J., Balesni, M., et al. (2024). Me, Myself, and AI: The situational awareness dataset (SAD) for LLMs. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 4 0 7 . 0 4 6 9 4 Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv. https://doi.org/10.48550/arXiv.2307.13702 Levinstein, B. A., & Herrmann, D. A. (2024). Still no lie detector for language models: Probing empiri- cal and conceptual roadblocks. Philosophical Studies, 182, 1539–1565https://doi.org/10.1007/ s11098-023-02094-3 Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Emergent world repre- sentations: exploring a sequence model trained on a synthetic task. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 2 1 0 . 1 3 3 8 2 Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. arXiv. https://doi.org/10.48550/arXiv.2109.07958 Mahant, N. (n.d.). Is AI deception deception? In V. C. Müller, A. R. Dewey, L. Dung, & G. Löhr (Eds.), Philosophy of artificial intelligence: The state of the art. Springer Nature. Mahon, J. E. (2016). The Definition of Lying and Deception. In E. N. Zalta (Ed.), The Stanford Encyclope- dia of Philosophy (Winter 2016.). Metaphysics Research Lab, Stanford University. h t t p s : / / p l a t o . s t a n f o r d . e d u / a r c h i v e s / w i n 2 0 1 6 / e n t r i e s / l y i n g - d e fi n i t i o n /. Accessed 22 September 2023. Maier, J., & Kikkert, S. (2025). Abilities. In E. N. Zalta & U. Nodelman (Eds.), The Stanford Encyclopedia of Philosophy (Summer 2025.). Metaphysics Research Lab, Stanford University. h t t p s : / / p l a t o . s t a n f o r d . e d u / a r c h i v e s / s u m 2 0 2 5 / e n t r i e s / a b i l i t i e s /. Accessed 3 June 2025. Mandelkern, M., & Linzen, T. (2024). Do language models’ words refer?? Computational Linguistics, 50(3), 1191–1200. https://doi.org/10.1162/coli_a_00522 Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024). Frontier Models are Capable of In-context Scheming. arXiv. https://doi.org/10.48550/arXiv.2412.04984 Millière, R. (2023). The Alignment Problem in Context. arXiv. https://doi.org/10.48550/arXiv.2311.02147 Millière, R., & Buckner, C. (2024a). A Philosophical Introduction to Language Models -- Part I: Continu- ity With Classic Debates. arXiv. https://doi.org/10.48550/arXiv.2401.03910 Millière, R., & Buckner, C. (2024b). A philosophical introduction to language models - Part I: The Way Forward. arXiv. https://doi.org/10.48550/arXiv.2405.03207 1 3 A Two-Step, Multidimensional Account of Deception in Language... Müller, V. C., & Cannon, M. (2022). Existential risk from AI and orthogonality: Can we have it both ways? Ratio, 35(1), 25–36. https://doi.org/10.1111/rati.12320 Ngo, R., Chan, L., & Mindermann, S. (2023). The alignment problem from a deep learning perspective. Presented at the The Twelfth International Conference on Learning Representations. h t t p s : / / o p e n r e v i e w . n e t / f o r u m ? i d = f h 8 E Y K F K n s . Accessed 14 November 2024. Park, P. S., Goldstein, S., O’Gara, A., Chen, M., & Hendrycks, D. (2024). AI deception: A survey of exam- ples, risks, and potential solutions. Patterns, 5(5), 100988https://doi.org/10.1016/j.patter.2024.100988 Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., et al. (2023). Discovering Language Model Behaviors with Model-Written Evaluations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (p. 13387–13434). Presented at the Findings 2023, Toronto, Canada: Association for Computational Linguistics. h t t p s : / / d o i . o r g / 1 0 . 1 8 6 5 3 / v 1 / 2 0 2 3 . fi n d i n g s - a c l . 8 4 7 Pi, Z., Vadaparty, A., Bergen, B. K., & Jones, C. R. (2024). Dissecting the Ullman Variations with a SCAL- PEL: Why do LLMs fail at Trivial Alterations to the False Belief Task? arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 4 0 6 . 1 4 7 3 7 Quilty-Dunn, J., Porot, N., & Mandelbaum, E. (2023). The best game in town: The reemergence of the language-of-thought hypothesis across the cognitive sciences. Behavioral and Brain Sciences, 46, e261. https://doi.org/10.1017/S0140525X22002849 Ren, R., Basart, S., Khoja, A., Gatti, A., Phan, L., Yin, X., et al. (2024). Safetywashing: Do AI safety benchmarks actually measure safety progress? arXiv. http://arxiv.org/abs/2407.21792. Accessed 3 November 2024. Russell, S. (2019). Human compatible: Artificial intelligence and the problem of control. Viking. Salvi, F., Horta Ribeiro, M., Gallotti, R., & West, R. (2025). On the conversational persuasiveness of GPT- 4. Nature Human Behaviour, 9, 1. https://doi.org/10.1038/s41562-025-02194-6 Scherrer, N., Shi, C., Feder, A., & Blei, D. M. (2023). Evaluating the Moral Beliefs Encoded in LLMs. arXiv. https://doi.org/10.48550/arXiv.2307.14324 Schlarmann, C., & Hein, M. (2023). On the adversarial robustness of multi-modal foundation models. arXiv. https://doi.org/10.48550/arXiv.2308.10741 Shanahan, M., McDonell, K., & Reynolds, L. (2023). Role-play with large language models. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 5 . 1 6 3 6 7 Shapira, N., Levy, M., Alavi, S. H., Zhou, X., Choi, Y., Goldberg, Y., et al. (2023). Clever Hans or neural theory of mind?? Stress testing social reasoning in large Language models. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 5 . 1 4 7 6 3 Shibata, H., Miki, S., & Nakamura, Y. (2023). Playing the Werewolf game with artificial intelligence for language understanding. arXiv. https://doi.org/10.48550/arXiv.2302.10646 Skalse, J., Howe, N. H. R., Krasheninnikov, D., & Krueger, D. (2022). Defining and characterizing reward hacking. arXiv. https://doi.org/10.48550/arXiv.2209.13085 Strachan, J. W. A., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., et al. (2024). Testing theory of mind in large language models and humans. Nature Human Behaviour. h t t p s : / / d o i . o r g / 1 0 . 1 0 3 8 / s 4 1 5 6 2 - 0 2 4 - 0 1 8 8 2 - z Street, W., Siy, J. O., Keeling, G., Baranes, A., Barnett, B., McKibben, M., et al. (2024). LLMs achieve adult human performance on higher-order theory of mind tasks. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 4 0 5 . 1 8 8 7 0 Turner, R. E. (2024, February 8). An introduction to transformers. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 4 . 1 0 5 5 7 Ullman, T. (2023). Large language models fail on trivial alterations to theory-of-mind tasks. arXiv. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 2 . 0 8 3 9 9 Wang, L.,Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X.,Lin, Y., Zhao, W. X., Wei, Z., & Wen, J. (2024). A survey on large language model based autonomous agents. Frontier- sof Computer Science, 18(6), 186345.https://doi.org/10.1007/s11704-024-40231-1 Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W. X., Wei, Z., & Wen, J. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345. https://doi.org/10.1007/s11704-024-40231-1 Williams, I., & Bayne, T. (2024). Chatting with bots: AI, speech acts, and the edge of assertion. Inquiry : A Journal of Medical Care Organization, Provision and Financing, 0(0), 1–24. h t t p s : / / d o i . o r g / 1 0 . 1 0 8 0 / 0 0 2 0 1 7 4 X . 2 0 2 4 . 2 4 3 4 8 7 4 Yildirim, I., & Paul, L. A. (2024). From task structures to world models: What do LLMs know? Trends in Cognitive Sciences, 28(5), 404–415. https://doi.org/10.1016/j.tics.2024.02.008 1 3 L. Dung Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R. (2023). Representation engineering: A top-down approach to AI transparency. arXiv. https://doi.org/10.48550/arXiv.2310.01405 Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M. (2024). Improving alignment and robustness with short circuiting. arXiv. https://doi.org/10.48550/arXiv.2406.04313 Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv. https://doi.org/10.48550/arXiv.2307.15043 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 1 3