Paper deep dive

Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

Nitay Alon, Joseph M. Barnby, Reuth Mirsky, Stefan Sarkadi

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 500

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/22/2026, 6:09:07 AM

Summary

The 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind (ToM4AI 2026) held at AAAI 2026 explored the intersection of Theory of Mind and AI, focusing on how AI systems can represent, interpret, and respond to human mental states. The workshop featured keynote speakers, a hackathon, and numerous research papers covering topics such as metacognitive architectures, social alignment, human-in-the-loop systems, and the role of bias in social inference.

Entities (6)

AAAI 2026 · conference · 100%Geoff Bird · researcher · 100%Marten Sap · researcher · 100%Sarit Kraus · researcher · 100%ToM4AI 2026 · workshop · 100%SAMI · ai-agent · 95%

Relation Signals (3)

Marten Sap → keynotespeaker → ToM4AI 2026

confidence 100% · These three keynote speakers were selected to represent diverse perspectives... Marten Sap

ToM4AI 2026 → partof → AAAI 2026

confidence 100% · The second Theory of Mind for AI (ToM4AI) workshop took place on January 26, 2026 in Singapore, as part of the AAAI workshop series.

SAMI → developedby → Georgia Institute of Technology

confidence 90% · SAMI is an AI social agent deployed in Georgia Tech’s Online Master of Science in Computer Science program

Cypher Suggestions (2)

List all papers presented at the workshop. · confidence 95% · unvalidated

MATCH (p:Paper)-[:PRESENTED_AT]->(w:Workshop {name: 'ToM4AI 2026'}) RETURN p.title

Find all researchers who were keynote speakers at the workshop. · confidence 90% · unvalidated

MATCH (r:Researcher)-[:KEYNOTE_SPEAKER]->(w:Workshop {name: 'ToM4AI 2026'}) RETURN r.name

Abstract

Abstract:This volume includes a selection of papers presented at the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2026 in Singapore on 26th January 2026. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.

PDF

Open source PDF →Open local PDF →

Full Text

500,147 characters extracted from source content.

Expand or collapse full text

arXiv:2603.18786v1 [cs.AI] 19 Mar 2026 1 ToM4AI 2026 Edited by Nitay Alon Joe M. Barnby Reuth Mirsky S , tefan Sarkadi 18th March 2026 Preface Recently, there has been an increasing interest in the intersection of Theory of Mind (ToM) and artificial intelligence (AI). The ability to attribute mental states—such as beliefs, intentions, desires, and emotions—to oneself and others, is essential for predicting behavior. Thus ToM principles are crucial to enable better interpretation and response to human actions and intentions as AI systems evolve towards greater interactivity. The purpose of this 2nd volume in the ToM4AI series is to provide an open access and curated anthology for the ToM and AI research community. The second Theory of Mind for AI (ToM4AI) workshop took place on January 26, 2026 in Singapore, as part of the AAAI workshop series. It was an epic gathering of researchers from diverse fields, ranging from psychology, cognitive science, neuroscience, robotics, and AI, to explore the implications of ToM in developing advanced AI systems. The workshop facilitated discussions between theory-driven cognitive science and practical AI applications, fostering a multidisciplinary dialogue on the role of ToM in AI. A highlight of the workshop was the ToM-a-Thon - a hackathon that challenged participants to think about problems that go beyond the usual Sally-Anne test for which ToM is necessary to solve. The workshop activities were designed around four keynote talks that covered different aspects by internationally recognized leaders in the field, who discussed Theory of Mind from multiple disciplines: Psychology, cognitive science, and AI. These three keynote speakers were selected to represent diverse perspectives, in order to ensure a comprehensive and holistic exploration of the workshop’s theme: • Marten Sap, Assistant Professor, Language Technologies Institute, Carnegie Mellon Univer- sity. Dr. Sap’s talk emphasized the critical gaps in current large language models around individual, interpersonal, and cultural inference. A central question emerged: how do we build genuine subtext into AI systems? His work on social alignment through public-private knowledge inference (Sotopia project) revealed that humans expertly balance these two types of knowledge to achieve socially strategic goals. He reminded attendees that social life inher- ently brings social risks that AI systems must navigate. • Geoff Bird, Professor, University of Oxford. Prof. Geoff Bird challenged the computer science community with a fundamental question: what can psychology teach us about building bet- ter AI? He advocated for MindSpace Theory, which conceptualizes mental representation as existing along continuums rather than discrete categories. Some mental states are inherently harder to represent than others. Critically, he called for abandoning inadequate tests and benchmarks, emphasizing that trait generalization across contexts within individuals forms the core of Theory of Mind. • Sarit Kraus, Professor, Department of Computer Science, Bar-Ilan University. Prof. Kraus highlighted the essential role of humans-in-the-loop for providing ethical constraints and com- plementary advantages. Her research showed that successful agents focus on high-level signals in social communication. Intriguingly, she noted that LLMs appear to develop their own in- ternal subtext, with intra-agent correlation exceeding inter-agent correlation. Having over 60 paper submissions and more than 100 attendees, the workshop held 2 poster sessions throughout the day. Accepted papers are collected in this indexed volume. Additional information about the workshop, schedule, and talks, can be found on the workshop website: https: //tom4ai.github.io/events/AAAI2026/workshop-summary/ The review process for selecting the papers was double-blind and could not have been done successfully without the help of our excellent team of reviewers. Worth to mention is that ToM4AI also has a Call for Papers for a Special Issue on Artificial Intelligence Through Theory of Mind in the journal Autonomous Agents and Multi-Agent Systems. The full CfP for the Special Issue and the relevant submission deadlines can be found at https: //link.springer.com/collections/abcegacgha Acknowledgments We are greatful to Edith Cowan University (ECU) for their support to bring students and speakers alike to Singapore and for rewarding our ToM-a-Thon winners. The Organisers Nitay Alon, Joe M. Barnby, Reuth Mirsky S , tefan Sarkadi 18th March 2026 Contents Editors’ Note Stefan Sarkadi, Nitay Alon, Joseph M. Barnby, Reuth Mirksy . . . . . . . . . . . . . . . . . . . . . . . . . 3 A Metacognitive Architecture for ToM Revision in AI Agents Jisu Kim, Mahimul Islam, Ashok Goel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Exploration Through Introspection: A Self-Aware Reward Model Michael Petrowski, Milica Gašić . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Belief-Desire-Intention Dynamics in Language Models via the p-Beauty Contest Vignesh Natarajkumar, Takayuki Ito, Rafik Hadfi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Beyond VAGUE: Attention Analysis for Probing How VLMs Ground Ambiguity Heejeong Nam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs Ivan Chulo, Ananya Joshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Faithful Theory of Mind Distillation: Why Preference Based Refinement Improves Imitation Arpit Singh Gautam, Saurabh Jha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 From Theory of Mind to Theory of Environment: Counterfactual Simulation of Latent Environmental Dynamics Ryutaro Uchiyama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Geometric Belief Spaces for Robust Multi-Agent Belief Dynamics Tasha Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Inside Deception: How to Exploit a Target Peta Masters , Stefan Sarkadi , Tom Wallis, Peter McBurney . . . . . . . . . . . . . . . . . . . . . . . . . 47 Investigating the Effects of Translation Quality on LLM Performance in Machine-Translated Theory of Mind Benchmarks Haruhisa Iseno, Atsumoto Ohashi, Tetsuji Ogawa,Shinnosuke Takamichi, Ryuichiro Higashinaka . . . . . 50 Latent Theory of Mind in World Models for Multi-Agent Reinforcement Learning Tomas Leroy-Stone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 On the Interplay of Training Population Diversity, Theory of Mind, and Zero-Shot Coordination Saeed Hedayatian, Varun Bhatt, Bryon Tjanaka ,Michael Lewis, Katia P. Sycara, Stefanos Nikolaidis . . 57 Reasoning About Bias: Theory of Mind for Trustworthy Knowledge Distillation Satyam Kumar, Saurabh Jha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Routing Belief States: A Meta-Cognitive Architecture for Theory of Mind in Language Models Satyam Kumar, Saurabh Jha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 1 Semantic Encoders Enable Robust Communication-Aware Reinforcement Learning Policies Siddharth Srikanth, Varun Bhatt, Darius Mahjoob,Sophie Hsu, Aaquib Tabrez , Stefanos Nikolaidis . . . . 70 Sign-Based World Model as a Basis of Cognitive Modeling: Imitation in Human-Robot Interaction Anfisa Chuganskaya, Alexey K. Kovalev, Aleksandr I. Panov . . . . . . . . . . . . . . . . . . . . . . . . . 74 SUITE: Scaling Up Individualized Theory-of-Mind Evaluation in Large Language Models Zhenze Mo, Chance Jiajie Li, Ao Qu, Yuhan Tang,Luis Alberto Alonso Pastor , Kent Larson , Jinhua Zhao 77 The AI Tipping Point: How Design and Repeated Use Shape Beliefs About Andreea Zaman , Scott A. Hale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 The Curse of Knowledge in Language Models: Perfect Theory of Mind or Missing Human Biases? Nutchanon Yongsatianchot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 The Resonance Corpus: A Large-Scale Chinese Parent–Child Conversation Dataset Lingqin Meng, Yang Gao, Zhongzhi Huo, Stella Christie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Theory of Mind and Optimistic Beliefs Emerge in a Sequential Dilemma with Incremental Rewards Tom Lenaerts, Marco Saponara, Jorge M. Pacheco, Francisco C. Santos . . . . . . . . . . . . . . . . . . . 96 Theory of Mind for Explainable Human-Robot Interaction Marie S. Bauer, Julia Gachot, Matthias Kerzel, Cornelius Weber, Stefan Wermter . . . . . . . . . . . . . 100 Theory of Mind in Human-in-the-Loop Sammie Katt, Samuel Kaski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Theory of Mind through Partially Ordered Plans Sveta Paster Shainkopf, Reuth Mirsky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Toward Theory of Mind: BERT Learns and Uses Emotion Geometry in Two Phases You Ho Myung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Towards Combining Task-Independent and Learned Task-Dependent Intrinsic Rewards in Cooperative Multi-Agent Reinforcement Learning Dilith Jayakody, Domenic Rosati, Janarthanan Rajendran . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Visual Theory of Mind through LLM-based Semantic Extraction Sandy Hoffmann , Lucas Nardi Vieira, Stefan Sarkadi, Alison R. Panisson . . . . . . . . . . . . . . . . . 118 Who Knows Who Knows? A Step Toward Common Knowledge in Multi-Agent Systems Tianqi Song, Botao Amber Hu, Jingshu Li, Yi-Chieh Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Proc. of Int. Workshop on Advancing AI Through Theory of Mind, 2026 Editors’ Note Stefan Sarkadi 1 , Nitay Alon 2,3 , Joseph M. Barnby 4,5 , and Reuth Mirsky 6 1 Centre for Defence and Security AI, University of Lincoln – UK 2 Department of Computational Neuroscience, Max Planck Institute for Biological Cybernetics – GER 3 School of Computer Science and Engineering, The Hebrew University of Jerusalem – IL 4 Centre for AI and Machine Learning, Edith Cowan University – AUS 5 Institute for Psychiatry, Psychology and Neuroscience, King’s College London, UK 6 Computer Science Department, Tufts University – US ’Theory of Mind’ in Artificial Intelligence has continued on its trajec- tory in becoming a crucial topic for the field. From empathic AI for psy- chotherapy [9] to AI psychosis [8], the understanding the way in which AI might respond to, represent, and exaggerate human mental states are at the forefront of commercial and clinical application. However, there remains a ravine between scientific and engineering understanding of what exactly constitutes Theory of Mind, whether the causal basis of Theory of Mind in biological agents matters for synthetic systems, and the end goals of each approach. The ToM4AI workshop’s second iteration made this widening gap in- creasingly evident. Despite their deployment at scale, contemporary AI systems lack the efficiency, generality, and contextual sensitivity that define human cognition [1]. Meanwhile their adoption demands a rapid expansion of both computational and environmental resources. This phenomenon seems to press the AI community toward deeper philo- sophical and methodological questions [2] that were once the crux of cognitive science research, have now become again central rather than peripheral to the future of socially capable AI. A central theme throughout was to the degree ’Theory of Mind’ in AI is akin or even desirable compared to the messy but essential ’Theory of Mind’ in biological agents. Human cognition is not a purely logical 3 Proc. of Int. Workshop on Advancing AI Through Theory of Mind, 2026 or “rational” enterprise; it is shaped by forces that AI does not natu- rally capture or ‘experience’—trust, reputation, the risk of ostracism, vulnerability to physical harm, companionship and mortality. Socio- cognitve and evolutionary pressures drive human social life and struc- ture our reasoning, our strategic choices, and our capacity for inter- preting context. If machines are to engage meaningfully with humans, we must ask whether these socio-cognitive pressures, or their func- tional/computational representations of them, are indispensable for producing behaviour that is not merely competent (i.e. efficient at solv- ing well defined tasks) but socially intelligible [6]. This brings us to ask whether the benefits of Theory of Mind can be abstracted from the broader architecture of human cognition and social interaction. From a short-sighted (return-on-investment approach to AI business) heavily driven by technical engineering without scientific un- derstanding, it is tempting to imagine an artificial agent capable of sub- textual inference, adaptive reasoning, and efficient modelling of others’ beliefs without the pains of having to reproduce the emotional, moti- vational, or existential conditions that give rise to these abilities in hu- mans. Yet it remains unclear whether such a decoupling is conceptually coherent. Human Theory of Mind emerges not only from information- processing demands but also from the uncertain, interdependent, and often contradictory circumstances of lived experience [3]. After all, even in AI research “playing chess isn’t just about chess” [7]. The challenge, then, is to determine whether artificial systems can (or should) achieve the advantages of mentalisation without inherit- ing the full spectrum of the human condition, or whether human bi- ases are precisely what makes Theory of Mind meaningful and adap- tive in human culture. Alongside this is the goal one wants to achieve, and whether instillation of ’Theory of Mind’ in AI is useful for scientific and/or engineering purposes. A further challenge is how we interpret and build AI architectures. For instance, should these be regarded as scientific models of cognition, as engineering artifacts optimised for per- formance or as a combination of both? In cognitive science, bias is a window into underlying mental structure [11]; in engineering, bias is a flaw/error demanding correction [4]. This difference between fields is a stark contrast, and often leads to mistaken goals or interpretation of results between fields. We argue that, in fact, engineering commu- nities would do well to heed the adaptive advantage that human bias appears to inject into social inference. It fosters trust, it saves on com- putational time and overhead when calculating another’s future beliefs and actions. Similarly, the field of AI must decide whether substrate realism—the idea that meaningful cognitive models should reflect biological processes, or whether purely instrumental performance suffices [5], provided the system remains mechanistically interpretable and reliable. As noted in the closing reflections, our mental representations of 4 Proc. of Int. Workshop on Advancing AI Through Theory of Mind, 2026 complex social worlds are inherently uncertain and approximate. Em- bracing this uncertainty, rather than seeking to eliminate it, may be essential for designing artificial systems that can operate within the fluid and ambiguous contexts that define human interaction. The path forward will therefore require balancing fast paced, yet short-sighted engineering advancements, with deep scientific understanding of rep- resenting social cognition in artificial systems. We have to acknowledge both the limitations of current approaches and promote the rebuilding of collapsed bridges between cognitive science, philosophy, and com- puter science. To conclude, this Editor’s Note continues a dialogue that is still in its nascent stages. The overarching aim of this workshop series is to broaden the conversation regarding what it truly means, and what it will fundamentally require, to develop artificial agents with mentaliza- tion capabilities. This pursuit is intrinsically linked to the oldest philo- sophical challenge in AI: the Imitation Game [10]. As a closing reflection, perhaps human-like social intelligence should not be the ultimate engineering goal for AI. As a unique and "strange" technology, AI need not, nor ought it, strive to replicate the complexities of the relational human condition. Eschewing this goal does not dimin- ish the technology’s utility; rather, it may help reduce anthropomorphic biases, insofar as that is possible when interfacing with agents that pro- duce human language. Ultimately, human cognition possesses "messy" essential properties that are adaptive to daily life. AI does not need to abide by these same biological principles, and by avoiding them, we may actually mitigate some of the harms to which humans are (forgivably) vulnerable. References [1] Margaret A Boden. Artificial intelligence. Elsevier, 1996. [2] Cristiano Castelfranchi and Rino Falcone. Trust theory: A socio- cognitive and computational model, volume 18. John Wiley & Sons, 2010. [3] Robin IM Dunbar. The social brain hypothesis. Evolutionary An- thropology: Issues, News, and Reviews: Issues, News, and Reviews, 6(5):178–190, 1998. [4] Xavier Ferrer, Tom Van Nuenen, Jose M Such, Mark Coté, and Na- talia Criado. Bias and discrimination in ai: a cross-disciplinary perspective. IEEE Technology and Society Magazine, 40(2):72–80, 2021. 5 Proc. of Int. Workshop on Advancing AI Through Theory of Mind, 2026 [5] Stevan Harnad. Other bodies, other minds: A machine incarnation of an old philosophical problem. Minds and Machines, 1(1):43–54, 1991. [6] William Hasselberger. Agency, autonomy, and social intelligibility. Pacific Philosophical Quarterly, 93(2):255–278, 2012. [7] Peter R Lewis and Ştefan Sarkadi. Reflective artificial intelligence. Minds and Machines, 34(2):14, 2024. [8] Hamilton Morrin, Luke Nicholls, Michael Levin, Jenny Yiend, Udita Iyengar, Francesca DelGuidice, Sagnik Bhattacharya, Ste- fania Tognin, James MacCabe, Ricardo Twumasi, et al. Arti- ficial intelligence-associated delusions and large language mod- els: risks, mechanisms of delusion co-creation, and safeguarding strategies. The Lancet Psychiatry, 2026. [9] Max Rollwage, Jessica McFadyen, Keno Juchems, Annamaria Balogh, Sashank Pisupati, Margareta-Theodora Mircea, Tobias U Hauser, George Prichard, and Ross Harper. A cognitive layer ar- chitecture to support large-language model performance in psy- chotherapy interactions. Nature Medicine, pages 1–9, 2026. [10] Alan Turing. Computing Machinery and Intelligence. Mind, 59(236):433–460, 1950. [11] Amos Tversky and Daniel Kahneman. Judgment under uncer- tainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. science, 185(4157):1124– 1131, 1974. 6 A Metacognitive Architecture for ToM Revision in AI Agents Jisu Kim, Mahimul Islam, Ashok Goel Design Intelligence Laboratory, Georgia Institute of Technology jisu.kim@gatech.edu, mahimul@gatech.edu, ashok.goel@c.gatech.edu Abstract This paper presents a metacognitive architecture for revis- ing an AI agent’s Theory of Mind (ToM) to address misin- terpretations in human–AI interaction. The ability to revise an agent’s interpretations of users’ mental states and char- acteristics is critical for maintaining trust and positive per- ceptions, especially in AI-mediated social interactions. To enable ToM revision, we introduce a two-level metacogni- tive architecture that integrates knowledge-based AI (KBAI) with LLMs. The architecture comprises a cognitive layer that performs the agent’s core tasks, and a metacognitive layer that introspects on the cognitive layer using a Task–Method– Knowledge (TMK) model of the agent. The metacognitive layer (1) revises its interpretation of the user in response to user feedback and (2) communicates the revision process to the user. Motivation Artificial intelligence (AI) agents often face situations in which revising their interpretations of users is crucial. For example, social AI agents that use large language models (LLMs) are inherently prone to misinterpreting users’ inten- tions, preferences, or characteristics in AI-mediated social interactions (Wang 2024). Such mistakes in user interpreta- tion erode trust in human–AI interactions and harm percep- tions of the agent’s intelligence and likability (Honig and Oron-Gilad 2018; Lee et al. 2024; Salem et al. 2015). In human–AI communication, after users recognize the agent’s misinterpretation, they may provide feedback to help the agent revise its interpretation. This brings us to the The- ory of Mind (ToM) revision (Wang 2024). During ToM re- vision, the agent introspects, revises its interpretation of the user based on user feedback, and communicates the revi- sion to the user. ToM revision can shape users’ percep- tions of the agent and enhance trust when misinterpretations arise (Ashktorab et al. 2019; Wang 2024). A promising approach to supporting ToM revision is to provide AI agents with a metacognitive structure. Metacog- nition is the process of “reasoning about one’s own reason- ing” (Cox 2005). It provides a higher-level mechanism for agents to reflect on and adapt their behavior (Cox and Raja 2007). Our prior work on metacognition in AI focused on endowing an agent with a theory of its own mind to support self-explanation (da Miranda et al. 2025). We extend this line of research by enabling the agent not only to explain, but also to revise its interpretation of the user in response to feedback. In this work, we introduce a metacognitive architecture for ToM revision. The architecture integrates knowledge- based AI (KBAI) with LLMs. It comprises two layers: a cognitive layer that performs the agent’s core tasks, and a metacognitive layer that uses the Task–Method–Knowledge (TMK) model to introspect on the cognitive layer. The metacognitive layer responds to user feedback by identify- ing misinterpretations, revising its interpretation, and com- municating the revision via a step-by-step explanation. ToM Revision in SAMI SAMI is an AI social agent deployed in Georgia Tech’s On- line Master of Science in Computer Science program for ten semesters, serving over 11,000 users. It recommends social connections among users based on shared interests and characteristics extracted from their online posts (Kakar et al. 2024). During deployment, users frequently requested revisions to the agent’s knowledge base. These requests aimed either to (1) correct LLM-induced misinterpretations, in which the agent failed to infer the contextual meaning of extracted entities, or to (2) update user information due to external changes. For example, the agent may misinterpret a user’s current location when the user mentions both a prior and a cur- rent location in their self-introduction post: “I am an AI re- searcher in Atlanta, but I was born in Seoul.” In this ex- ample, the agent misinterprets “Seoul” as the current loca- tion and generates recommendations accordingly. The user then provides feedback indicating the misinterpretation: “I am not in Seoul. I am in Atlanta.” This feedback triggers the agent’s ToM revision. The agent extracts relevant en- tities from the feedback and localizes the source of mis- interpretation. It updates the knowledge base by changing the user’s primary location from “Seoul” to “Atlanta”. Fi- nally, the agent communicates the revision process to the user through a step-by-step explanation as follows: • Extract entities: I am analyzing your feedback, and it seems that I mistook your location as “Seoul” instead of “Atlanta.” • Locate relevant task: I traced the issue back to my 7 entity-extraction step for your introduction post. • Identify misinterpretation: I misinterpreted “Seoul” in “I was born in Seoul” as your current location. • Revise knowledge base: I updated my knowledge base by replacing “Seoul” with “Atlanta” as your current lo- cation. Metacognitive Architecture for ToM Revision We present a metacognitive architecture for ToM revision (Figure 1). The architecture comprises two layers: a cog- nitive layer (Level 1) and a metacognitive layer (Level 2). At Level 1 (the cognitive layer), the agent performs its core tasks. At Level 2 (the metacognitive layer), it introspects on Level 1 to identify the causes of misinterpretations about the user, revise its interpretation, and generate a revision mes- sage. For a detailed description of the architecture, see our full paper (Kim, Islam, and Goel 2026). Level 1 Reasoning At Level 1, the agent generates initial social recommenda- tions based on users’ introduction posts. The agent first clas- sifies the type of post. If the post is an introduction, the agent extracts entities such as hobbies, locations, and academic interests. The agent stores the extracted information in a knowledge base. It then applies a matchmaking algorithm over the knowledge base to identify users with shared at- tributes. Finally, it generates personalized recommendation messages (Kakar et al. 2024). Task, Method, Knowledge Representation To support Level 2 introspection over Level 1, we use TMK to represent the agent’s Level 1 reasoning process. TMK enables ToM reasoning by providing the agent with a structured, interpretable self-model of its internal processes. It encodes three components: Tasks denote the goals the agent tries to achieve, Methods specify how tasks are ex- ecuted, and Knowledge refers to the information the agent uses (Goel and Rugaber 2014). For ToM revision, we fo- cus on the Task model. It decomposes the agent’s Level 1 process into interpretable units, such as “Identify and Ex- tract Entities from the post.” This expressiveness provides the structural basis for localizing where a misinterpretation has happened (Goel and Rugaber 2017). Level 2 Reasoning At Level 2, the metacognitive layer leverages TMK to in- trospect on Level 1. When a user’s post is classified as feedback requesting a revision, the agent executes Level 2 to identify the cause of the misinterpretation and revise its interpretation of the user. First, the agent extracts task- relevant entities, the specific information the user is re- questing to revise, from the user feedback. It then identi- fies the Task in the TMK that led to the misinterpretation. Using the identified task, the agent performs a dictionary lookup in a solution library to retrieve the associated revi- sion function. We constructed this solution library from mis- interpretations observed during deployment and their corre- sponding revision functions. The agent then applies the re- Figure 1: Overview of the two-level metacognitive architec- ture for ToM revision. Level 1 executes the agent’s primary tasks, and Level 2 introspects on Level 1 to revise the agent’s interpretation of the user. trieved function to update the user data stored in the knowl- edge base. At each stage, the agent generates intermediate natural-language messages describing its reasoning and ac- tions. These messages are compiled into a single step-by- step revision message and presented to the user. Discussion To examine the behavior of the proposed architecture across different misinterpretation scenarios, we employed a con- trolled synthetic data check (Nauta et al. 2023). We used 20 cases derived from real user data and misinterpretations observed during SAMI’s deployment. Of the 20 cases, the metacognitive architecture completed revisions in 15 cases. In these cases, the agent identified the task-relevant mis- interpretation, revised its knowledge base for the user, and generated an explanation of the revision. These results sug- gest that the architecture can improve the interpretability and transparency of an AI agent’s behavior by revising and com- municating its interpretation of the user. More broadly, this work offers a generalizable architec- ture for developing trustworthy AI agents. It demonstrates how a metacognitive layer can support introspection for ToM revision. Furthermore, our work highlights the com- plementary benefits of combining KBAI and LLMs. KBAI, including TMK, the solution library, and the knowledge base, supports more interpretable and controlled agent be- havior (Pan et al. 2024; Gaur and Sheth 2024). At the same time, LLMs provide the semantic flexibility needed to gen- erate natural-language reasoning and responses. Our work points to new directions for designing AI agents that foster trust in human–AI interaction through transparent ToM re- vision and communication. Acknowledgments This research was supported by a grant from the US NSF (#2247790) to the National AI Institute of AI for Adult Learning and Online Education (aialoe.org). 8 References Ashktorab, Z.; Jain, M.; Liao, Q. V.; and Weisz, J. D. 2019. Resilient Chatbots: Repair Strategy Preferences for Conver- sational Breakdowns. In Proceedings of the 2019 CHI Con- ference on Human Factors in Computing Systems, 1–12. Cox, M.; and Raja, A. 2007. Metareasoning: A manifesto. BBN Technical. Cox, M. T. 2005. Metacognition in computation: A selected research review. Artificial Intelligence, 169(2): 104–141. da Miranda, M. P.; Islam, M.; Basappa, R.; Taylor, T.; and Goel, A. 2025. Adaptable Social AI Agents. In Proceed- ings of the 1st Workshop on Advancing Artificial Intelligence through Theory of Mind (AAAI-25). Gaur, M.; and Sheth, A. 2024. Building trustworthy Neu- roSymbolic AI Systems: Consistency, reliability, explain- ability, and safety. AI Magazine, 45(1): 139–155. Goel, A. K.; and Rugaber, S. 2014.Interactive meta- reasoning: Towards a CAD-like environment for designing game-playing agents. In Computational creativity research: Towards creative machines, 347–370. Springer. Goel, A. K.; and Rugaber, S. 2017. GAIA: A CAD-Like En- vironment for Designing Game-Playing Agents. IEEE Intel- ligent Systems, 32(3): 60–67. Honig, S.; and Oron-Gilad, T. 2018. Understanding and re- solving failures in human-robot interaction: Literature re- view and model development. Frontiers in psychology, 9: 861. Kakar, S.; Basappa, R.; Camacho, I.; Griswold, C.; Houk, A.; Leung, C.; Tekman, M.; Westervelt, P.; Wang, Q.; and Goel, A. K. 2024. SAMI: an AI actor for fostering social in- teractions in online classrooms. In International Conference on Intelligent Tutoring Systems, 149–161. Springer. Kim, J.; Islam, M.; and Goel, A. 2026. A Metacognitive Ar- chitecture for Correcting LLM Errors in AI Agents. In Pro- ceedings of the AAAI Conference on Artificial Intelligence. Lee, Y.; Son, K.; Kim, T. S.; Kim, J.; Chung, J. J. Y.; Adar, E.; and Kim, J. 2024. One vs. many: Comprehending accu- rate information from multiple erroneous and inconsistent ai generations. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2518–2531. Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schl ̈ otterer, J.; Van Keulen, M.; and Seifert, C. 2023. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable AI. ACM Computing Surveys, 55(13s): 1–42. Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; and Wu, X. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36(7): 3580–3599. Salem, M.; Lakatos, G.; Amirabdollahian, F.; and Dauten- hahn, K. 2015. Would you trust a (faulty) robot? Effects of error, task type, and personality on human-robot coopera- tion and trust. In Proceedings of the tenth annual ACM/IEEE international conference on human-robot interaction, 141– 148. Wang, Q. 2024. Mutual theory of mind for human-AI com- munication in AI-mediated social interaction. Ph.D. thesis, Ph. D. Dissertation. Georgia Institute of Technology. 9 Exploration Through Introspection: A Self-Aware Reward Model Michael Petrowski, Milica Ga ˇ si ́ c, Heinrich Heine University D ̈ usseldorf, Germany m.petrowski@uni-duesseldorf.de, gasic@hhu.de Abstract Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learn- ing agents infer their own internal states in gridworld en- vironments. Specifically, we introduce an introspective ex- ploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer “pain-belief” from online observations. This signal is inte- grated into a subjective reward function to study how self- awareness affects the agent’s learning abilities. Further, we use this computational framework to investigate the differ- ence in performance between normal and chronic pain per- ception models. Results show that introspective agents in gen- eral significantly outperform standard baseline agents and can replicate complex human-like behaviors. Code — https://github.com/m-petrowski/painrl Datasets — https://doi.org/10.5281/zenodo.18036125 1 Introduction A defining feature of human intelligence is the ability to at- tribute hidden mental states like beliefs to oneself and others, known as Theory of Mind (ToM) (Premack and Woodruff 1978). While recent ToM models in AI often emphasize in- ference of beliefs of others (Abrini et al. 2025), a closely related aspect to understanding others is a sophisticated un- derstanding of oneself. Happ ́ e (2003) argues that humans possess a unified system for thinking about mental states, whether directed at oneself or others. Inspired by this sys- tem in humans, equipping an artificial agent with the ability to model its own affective states might provide a form of self-awareness that is closely related to the understanding of similar states in others. In biological agents, signals such as pain are a core component of adaptive learning (Seymour 2019; Tabor and Burr 2019). They encode internal evalua- tions that guide the agent through complex environments. In this work, we propose a computational framework for introspection that provides an exploration incentive through self-evaluation to a reinforcement learning (RL) agent. In- spired by the exploratory properties of biological pain, we operationalize an aversive signal not as direct environmental feedback, but as a latent state an agent must infer. Specifi- cally, the agent’s own internal belief state of pain is inferred from observations of its own “happiness”, which is a reward signal proposed by Dubey, Griffiths, and Dayan (2022). Our novel reward acts as a powerful intrinsic learning sig- nal that provides a dynamic exploration incentive and adapts to environmental changes. Further, this introspective model enables us to study the performance of agents with differ- ent types of perceptions. We model normal and chronic pain perception and show how such a maladaptive perception can produce complex, addiction-like behaviors within our framework. 2 Related Work Traditional RL frameworks optimize externally defined re- ward functions (Sutton and Barto 2018), lacking the rep- resentational depth needed for mental-state reasoning. Re- cent work introduces subjective reward functions inspired by human features, where agents evaluate outcomes relative to prior expectations and comparisons (Dubey, Griffiths, and Dayan 2022). Similarly, Bayesian models of pain conceptu- alize pain perception as inference over hidden nociceptive causes (Eckert, Pabst, and Endres 2022). In cognitive sci- ence, ToM is often modeled through partially observable in- ference, where agents estimate others’ beliefs under uncer- tainty (Baker et al. 2017). The same probabilistic architec- ture applies to introspective inference: Mahajan, Dayan, and Seymour (2025) cast pain and injury as Partially Observ- able Markov Decision Processes. Our work builds on this paradigm by embedding an inferential affective model in- side an RL agent, thereby merging Bayesian inference with decision-theoretic learning. 3 A Self-Aware Reward Model Our introspective exploration component consists of a hid- den Markov model (HMM) inspired by Eckert, Pabst, and Endres (2022). We utilize the parameters from their HMMs that resemble normal and chronic pain perception (see Appendix A). Both models use hid- den states H t ∈ H =pain, no pain and observations O t ∈ O =noxious, harmless. Their chronic pain model has sticky transitions and ambiguous emissions, represent- ing a maladaptive perception model. The normal pain model is more informative and represents the healthy counterpart. Henceforth, we will refer to our models as “normal pain” 10 and “chronic pain” for convenience, acknowledging that they do not represent actual clinical pain in our framework. We extend the happiness function f h , which balances objective reward, expectations, and relative comparisons (Dubey, Griffiths, and Dayan 2022), with our HMM-based aversive signal. The resulting function f w represents the agent’s “well-being” and acts as the subjective reward func- tion: f h = w 1 · Objective + w 2 · Expect + w 3 · Compare f w = f h − w 4 · Pain (1) For learning, we use the Q-Learning algorithm (Watkins and Dayan 1992) with f w as the reward function in combination with an ε-greedy policy. The Pain component from f w is defined as the belief state of pain: Pain . = b(H t = pain) = Pr(H t = pain| O 1:t )(2) The beliefs are updated online using the forward algorithm (Rabiner 1989). The observations are obtained by incorpo- rating the sensory information given by f h . At every time step t, after taking action a t in state s t and transitioning into state s t+1 , we observe the value of the happiness f h . If f h ≥ 0, then O t is harmless and if f h < 0, it is noxious. 4 Experiments & Results We evaluate agents with introspective exploration in experi- ments across two different gridworld environments. Each is a 7 × 7 gridworld that contains a single agent and a food state (+1 objective reward, all other states 0), where the goal is to reach the food as quickly and as often as possible. In the stationary environment, the food state does not change its position throughout the agent’s lifetime (2500 steps). In the non-stationary environment, the food state changes its location to one of the other corners after every 1250 time steps during the agent’s lifetime (5000 steps). The setup is inspired by Dubey, Griffiths, and Dayan (2022). For each environment, we perform a grid search to find the parameters of the best performing agents with normal and chronic pain models from various reward function groups using the optimal reward framework (Singh, Lewis, and Barto 2009), with the mean cumulative objective reward (COR) as the performance metric. Reward function groups are determined by the components of f h that are activated via their weights w > 0 (see Appendix B for details). Our results show that introspective agents generally outperform ‘No pain’ baseline agents in their respec- tive reward category across environments (Figure 1, Fig- ure 5). This illustrates faster adaptation while receiv- ing a dynamic exploration incentive through the aver- sive signal (Figure 1, Appendix C). We also observed that agents with normal (M = 2295.6,SD = 65.7) and chronic (M = 2295.0,SD = 66.1) pain performed simi- larly in the stationary ‘Objective+Expect’ category, whereas the chronic model (M = 4214.6,SD = 165.4) interestingly outperformed the normal one (M = 3814.0,SD = 446.6) in the non-stationary setting. In both scenarios, the agents using self-awareness were superior to their respective ‘No pain’ baselines. This also applies to other categories (Fig- ure 1, Appendix C). Figure 1: Non-stationary: Mean cumulative objective reward (COR) and standard deviation (SD) of the best performing agents from each reward function category across different self-awareness models. (*): Statistically significant improve- ment over the same category ‘No pain’ baseline (one-sided paired-samples t-test, p≪ 0.05). Sample size n = 300. In the non-stationary setting, normal (α = 0.9, ‘Objective only’) and chronic (α = 0.1, ‘Objective only’) agents per- form better with opposite learning rates α, suggesting differ- ences in learning between perception models. Interestingly, the chronic pain agent notably outperforms its counterpart at the cost of a negative cumulative well-being throughout its lifetime. Its momentary well-being reaches zero only when visiting the food state, paralleling relief-seeking behavior (Appendix D). However, agents with this maladaptive com- ponent also yield the worst overall performance within the ‘Objective+Expect’ category (Appendix C, Figure 6). 5 Discussion & Conclusion Our introspection component serves as a powerful tool for adaptive learning: the normal model acts as a low-pass fil- ter, smoothing the happiness signal f h into a stable belief state that provides a dynamic exploration bonus, which is particularly advantageous in non-stationary environments. The chronic model, while approximating a constant decre- ment similar to the fixed aspiration level in Dubey, Griffiths, and Dayan (2022), is valuable for the complex behaviors it produces. When given a psychological interpretation, the agent’s high performance despite negative well-being aligns with findings on chronic pain and quality of life (Gureje et al. 1998; Blyth et al. 2001). Furthermore, its relief-seeking behavior through faster recovery provides a computational parallel to negative reinforcement in addiction (Koob and Le Moal 2008). For more details, see Appendix D. While this work has limitations, such as using a simple ε-greedy baseline and fixed HMM parameters, it success- fully demonstrates that an RL agent’s performance improves when it models one of its own internal aversive states. This self-awareness enhances adaptation and can generate com- plex, psychologically plausible dynamics. Our framework models the self-application aspect of the unified system hy- pothesized for ToM, and future work can test this by extend- ing the architecture to infer others’ states. 11 References Abrini, M.; Abend, O.; Acklin, D.; Admoni, H.; Aichinger, G.; Alon, N.; Ashktorab, Z.; Atreja, A.; Auron, M.; Aufre- iter, A.; Awasthi, R.; Banerjee, S.; Barnby, J. M.; Basappa, R.; Bergsmann, S.; Bouneffouf, D.; Callaghan, P.; Cavazza, M.; Chaminade, T.; Chernova, S.; Chetouan, M.; Choud- hury, M.; Cleeremans, A.; Cywinski, J. B.; Cuzzolin, F.; Deng, H.; Diamond, N.; Pasquasio, C. D.; Dumas, G.; van Duijn, M.; Dwarikanath, M.; Gao, Q.; Goel, A.; Goldstein, R.; Gombolay, M.; Gonzalez, G. E.; Halilovic, A.; Halmdi- enst, T.; Islam, M.; Jara-Ettinger, J.; Kastel, N.; Keydar, R.; Khanna, A. K.; Khoramshahi, M.; Kim, J.; Kim, M.; Kim, Y.; Krivic, S.; Krasnytskyi, N.; Kumar, A.; Kwon, J.; Lee, E.; Lee, S.; Lewis, P. R.; Li, X.; Li, Y.; Lewandowski, M.; Lloyd, N.; Luebbers, M. B.; Luo, D.; Lyu, H.; Mahapatra, D.; Maheshwari, K.; Mainali, M.; Mathur, P.; Mederitsch, P.; Miura, S.; de Miranda, M. P.; Mirsky, R.; Mishra, S.; Moor- man, N.; Morrison, K.; Muchovej, J.; Nessler, B.; Nessler, F.; Nguyen, H. M. J.; Ortego, A.; Papay, F. A.; Pasquali, A.; Rahimi, H.; Raghu, C.; Royka, A.; Sarkadi, S.; Scheuerman, J.; Schmid, S.; Schrater, P.; Sen, A.; Sheikhbahaee, Z.; Shi, K.; Simmons, R.; Singh, N.; Smith, M. O.; van der Meulen, R.; Solaki, A.; Sun, H.; Szolga, V.; Taylor, M. E.; Taylor, T.; Waveren, S. V.; Vargas, J. D.; Verbrugge, R.; Wagner, E.; Weisz, J. D.; Wen, X.; Yeoh, W.; Zhang, W.; Zhao, M.; and Zilberstein, S. 2025. Proceedings of 1st Workshop on Advancing Artificial Intelligence through Theory of Mind. arXiv:2505.03770. Baker, C. L.; Jara-Ettinger, J.; Saxe, R.; and Tenenbaum, J. B. 2017. Rational quantitative attribution of beliefs, de- sires and percepts in human mentalizing. Nature Human Behaviour, 1(4): 0064. Blyth, F. M.; March, L. M.; Brnabic, A. J. M.; Jorm, L. R.; Williamson, M.; and Cousins, M. J. 2001. Chronic pain in Australia: a prevalence study. PAIN, 89(2). Dubey, R.; Griffiths, T. L.; and Dayan, P. 2022. The pursuit of happiness: A reinforcement learning perspective on ha- bituation and comparisons. PLOS Computational Biology, 18(8): 1–32. Eckert, A.-L.; Pabst, K.; and Endres, D. M. 2022.A Bayesian model for chronic pain. Frontiers in Pain Re- search, Volume 3 - 2022. Gureje, O.; Von Korff, M.; Simon, G. E.; and Gater, R. 1998. Persistent Pain and Well-being: A World Health Organiza- tion Study in Primary Care. JAMA, 280(2): 147–151. Happ ́ e, F. 2003. Theory of mind and the self. Annals of the New York Academy of Sciences, 1001(1): 134–144. Koob, G. F.; and Le Moal, M. 2008. Addiction and the Brain Antireward System. Annual Review of Psychology, 59: 29– 53. Mahajan, P.; Dayan, P.; and Seymour, B. 2025. Homeostasis After Injury: How Intertwined Inference and Control Under- pin Post-Injury Pain and Behaviour. bioRxiv. Preprint. Navratilova, E.; Xie, J. Y.; Okun, A.; Qu, C.; Eyde, N.; Ci, S.; Ossipov, M. H.; King, T.; Fields, H. L.; and Porreca, F. 2012. Pain relief produces negative reinforcement through activation of mesolimbic reward–valuation circuitry. Pro- ceedings of the National Academy of Sciences, 109(50): 20709–20713. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4): 515–526. Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2): 257–286. Seymour, B. 2019. Pain: A Precision Signal for Reinforce- ment Learning and Control. Neuron, 101(6): 1029–1041. Singh, S.; Lewis, R. L.; and Barto, A. G. 2009. Where do rewards come from. In Proceedings of the annual confer- ence of the cognitive science society, 2601–2606. Cognitive Science Society. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learn- ing: An Introduction. MIT Press, 2nd edition. Tabor, A.; and Burr, C. 2019. Bayesian Learning Models of Pain: A Call to Action. Current Opinion in Behavioral Sciences, 26: 54–61. Pain and Aversive Motivation. Watkins, C. J. C. H.; and Dayan, P. 1992. Technical Note: Q-Learning. Machine Learning, 8(3): 279–292. A Hidden Markov Model Parameters Parameters for the HMMs can be found in Figure 2 and Fig- ure 3. H t−1 |H t painnopain pain t−1 0.30.7 nopain t−1 0.20.8 H t |O t noxiousharmless pain0.80.2 nopain0.10.9 H 0 pain0.223 no pain0.777 Figure 2: Normal pain: Parameters of the transition ma- trix Pr(H t | H t−1 ), emission matrix Pr(O t | H t ), and ini- tial state distribution Pr(H 0 ) of the hidden Markov model for normal pain perception. Transitions favor recovery, and emissions distinguish noxious from harmless sensations. Adapted from Eckert, Pabst, and Endres (2022). B Experiment Setup All agents in the experiments use the Q-learning algorithm with an ε-greedy policy to balance exploration and exploita- tion. The agent spawns in the bottom left corner and can navigate the environment by choosing one of five actions at each time step: UP, DOWN, LEFT, and RIGHT move the agent in the corresponding direction to the following field in the gridworld. The fifth action STAY makes it possible for 12 H t−1 |H t painnopain pain t−1 0.80.2 nopain t−1 0.70.3 H t |O t noxiousharmless pain0.60.4 no pain0.60.4 H 0 pain0.777 no pain0.223 Figure 3: Chronic pain: Parameters of the transition ma- trix Pr(H t | H t−1 ), emission matrix Pr(O t | H t ), and ini- tial state distribution Pr(H 0 ) of the hidden Markov model for chronic pain perception. Transitions are sticky and emis- sions are ambiguous. Adapted from Eckert, Pabst, and En- dres (2022). the agent to keep its position on the current field. Figure 4 shows this basic environment setup. Figure 4: Basic environment setup. Yellow circle: agent. Green square: food state. Yellow-marked area: possible ini- tial agent spawn positions. Green-marked area: possible ini- tial food state spawn positions. Grid Search According to the optimal reward framework (Singh, Lewis, and Barto 2009), we produce histories for all reward func- tions r A ∈ R A and evaluate the histories using the fit- ness function F , which is the cumulative objective re- ward at the end of agent A’s lifetime. In our setup, the cumulative objective reward corresponds to the number of time steps the agent has visited the food state dur- ing its lifetime, as only the food state grants the agent an objective reward of +1. Specifically, for each reward function r A , we will produce n = 300 histories, sam- pling a new environment E ∈ E stationary for each trajec- tory, and take the mean of F(h) over all those 300 his- tories. Also, the set of possible subjective reward func- tions is divided into what we call reward categories that correspond to the weights w 1 ,w 2 ,w 3 that are activated. There are seven of those categories: ‘Objective only’, ‘Ex- pect only’, ‘Compare only’, ‘Objective+Expect’, ‘Objec- tive+Compare’, ‘Expect+Compare’ and ‘All’. Each category is then divided into three subcategories, corresponding to the used pain model, which are No pain, Normal pain and Chronic pain. For each reward subcategory, we de- termine the best-performing agent according to the opti- mal reward function r ∗ A using the grid search. Because the grid search includes a large number of hyperparameters and grows exponentially with them, we limited our search space to w 1 ,w 2 ,w 3 ,w 4 ∈ [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1], the aspira- tion level ρ∈ [0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 1], the learn- ing rate α ∈ [0.1, 0.3, 0.5, 0.7, 0.9] and the exploration rate ε ∈ [0.01, 0.1] to save computational resources, while also including the two HMMs for normal and chronic pain. Re- moving duplicate or invalid reward functions results in a to- tal of 312,130 subjective reward functions for each of the two environments. The used discount factor for all experi- ments was γ = 0.99. C Detailed Results Stationary Environment Detailed results of the experiments in the stationary environ- ment can be taken from Figure 5 and Table 1. Figure 5: Stationary environment: Mean cumulative ob- jective reward (COR) and standard deviation (SD) of the best performing agents from each reward function cate- gory across different self-awareness models. (*): Statisti- cally significant improvement over the same category ‘No pain’ baseline (one-sided paired-samples t-test, p ≪ 0.05). Sample size n = 300. To highlight the differences, the plot starts from 1000 on the y-axis. Non-Stationary Environment Detailed results of the experiments in the non-stationary en- vironment can be taken from Figure 1 and Table 2. Figure 7 shows the performance of normal and chronic agents in the ‘Objective only’ category across different learning rates α (see Section 4). Plots of the objective reward, momentary well-being, subjective pain (w 4 · Pain) and cumulative well- being of the best normal and chronic pain agents in the ‘Ob- 13 Stationary Environment Results Cat.Pain w 1 w 2 w 3 w 4 ρ ε α MeanSD Obj. / 0.9 0.0 0.0 0.0 NA .1 .9 1858.6 338.7 N 0.1 0.0 0.0 1.0 NA .01 .7 2279.5 ∗ 69.3 C 0.1 0.0 0.0 0.1 NA .01 .9 2266.3 ∗ 68.5 Exp. / 0.0 0.9 0.0 0.0 NA .01 .7 1973.1 385.0 N 0.0 0.7 0.0 0.7 NA .01 .7 2295.6 ∗ 65.7 C 0.0 0.3 0.0 0.3 NA .01 .9 2294.6 ∗ 66.0 Comp. / 0.0 0.0 0.9 0.0 0.7 .01 .9 2272.2 69.1 N 0.0 0.0 0.5 0.9 0.05 .01 .7 2269.3 67.7 C 0.0 0.0 1.0 0.1 0.7 .01 .9 2270.2 67.0 Obj. + Exp. / 0.5 0.9 0.0 0.0 NA .01 .7 1973.1 385.0 N 0.1 0.7 0.0 0.7 NA .01 .7 2295.6 ∗ 65.7 C 0.9 0.3 0.0 0.7 NA .01 .9 2295.0 ∗ 66.1 Obj. + Comp. / 0.1 0.0 0.5 0.0 1.0 .01 .7 2272.2 69.1 N 0.1 0.0 0.5 0.9 0.05 .01 .7 2269.3 67.7 C 0.1 0.0 0.5 0.1 1.0 .01 .7 2270.3 67.0 Exp. + Comp. / 0.0 0.7 1.0 0.0 0.9 .01 .7 2291.1 65.8 N 0.0 0.3 0.3 0.7 0.01 .01 .1 2310.6 ∗ 62.5 C 0.0 0.7 0.3 0.5 0.05 .01 .7 2300.4 ∗ 61.0 All / 0.1 0.7 0.7 0.0 1.0 .01 .7 2291.0 65.8 N 0.7 0.3 0.3 0.7 0.01 .01 .1 2310.6 ∗ 62.5 C 0.7 0.3 0.7 1.0 0.01 .01 .9 2300.6 ∗ 59.6 Table 1: Stationary Environment: Parameters and corre- sponding mean cumulative objective reward (Mean) and standard deviation (SD) of the best performing agents from each reward function category across different pain condi- tions in the stationary environment. The reward categories are divided into pain subcategories: ‘No pain’ (/), ‘Normal pain’ (N) and ‘Chronic pain’ (C). Results marked with an asterisk (*) indicate a statistically significant improvement over the ‘No pain’ baseline within the same category (one- sided paired-samples t-test, p ≪ 0.05). Mean and SD are rounded to one decimal place. Lifetime = 2500, sample size n = 300. jective+Expect’ category can be found in Figure 8 and Fig- ure 9, respectively. Figure 6 shows the distribution of the performance of all the normal and chronic agents in the ‘Ob- jective+Expect’ category that were part of the grid search, mentioned in Section 4. D Further Analysis In the non-stationary environment, the chronic agent adapts faster to changes than the normal agent. Yet, the former consistently accumulates a negative cumulative well-being across its lifetime, while the normal agent remains positive. For the chronic pain agent, momentary well-being only recovers to approximately zero from a negative state when the food state is reached, as can be seen in the nearly iden- tical trajectories of both objective reward and well-being in Figure 9. This means that food discovery provides only temporary relief rather than a sustained positive value (Navratilova et al. 2012). The pattern strongly resembles how painkillers or addictive substances work in humans: Non-Stationary Environment Results Cat.Pain w 1 w 2 w 3 w 4 ρ ε α MeanSD Obj. / 0.1 0.0 0.0 0.0 NA .1 .9 1586.5 631.2 N 0.1 0.0 0.0 1.0 NA .01 .9 3101.8 ∗ 271.8 C 0.7 0.0 0.0 0.9 NA .01 .1 4142.5 ∗ 177.2 Exp. / 0.0 1.0 0.0 0.0 NA .1 .7 2371.0 613.3 N 0.0 0.1 0.0 0.9 NA .01 .1 3896.3 ∗ 383.1 C 0.0 0.3 0.0 0.3 NA .01 .3 4197.7 ∗ 186.9 Comp. / 0.0 0.0 0.1 0.0 1.0 .01 .1 4171.1 178.9 N 0.0 0.0 0.9 1.0 0.9 .01 .3 4173.1 178.0 C 0.0 0.0 0.9 0.1 0.9 001 .3 4178.1 181.6 Obj. + Exp. / 0.5 1.0 0.0 0.0 NA .1 .7 2371.0 613.3 N 0.1 0.1 0.0 0.5 NA .01 .7 3814.0 ∗ 446.6 C 0.1 0.3 0.0 0.5 NA .01 .3 4214.6 ∗ 165.4 Obj. + Comp. / 0.1 0.0 1.0 0.0 1.0 .01 .3 4008.8 189.1 N 0.1 0.0 1.0 1.0 1.0 .01 .3 4165.9 ∗ 172.2 C 0.3 0.0 0.3 0.5 0.7 .01 .3 4205.3 ∗ 190.8 Exp. + Comp. / 0.0 0.1 1.0 0.0 1.0 .01 .3 4194.2 188.8 N 0.0 0.1 1.0 0.9 1.0 .01 .3 4222.5 ∗ 176.0 C 0.0 0.3 0.3 0.5 0.7 .01 .1 4235.3 ∗ 170.0 All / 0.1 0.5 1.0 0.0 1.0 .01 .3 4194.6 201.2 N 0.1 0.3 1.0 0.7 1.0 .01 .3 4210.0 184.4 C 0.1 0.7 0.7 1.0 1.0 .01 .1 4235.5 ∗ 180.3 Table 2: Non-stationary environment: Parameters and cor- responding mean cumulative objective reward (Mean) and standard deviation (SD) of the best performing agents from each reward function category across different pain condi- tions in the non-stationary environment. The reward cate- gories are divided into pain subcategories: ‘No pain’ (/), ‘Normal pain’ (N) and ‘Chronic pain’ (C). Results marked with an asterisk (*) indicate a statistically significant im- provement over the ‘No pain’ baseline within the same cat- egory (one-sided paired-samples t-test, p ≪ 0.05). Mean and SD are rounded to one decimal place. Lifetime = 5000, sample size n = 300. Figure 6: Non-stationary environment: Distribution of mean cumulative objective reward (COR) in the ‘Objec- tive+Expect’ category, comparing all reward functions of the agents from the grid search with normal vs. chronic pain per- ception in this category. 14 Figure 7: Non-stationary environment: Mean cumulative ob- jective reward (COR) of respective best-performing ‘Objec- tive only’ agents with normal vs. chronic pain perception, plotted across different learning rates α. Figure 8: Normal pain, non-stationary environment: Mean and standard deviation (SD) of the Objective reward, mo- mentary well-being, subjective pain (w 4 · Pain) and cumu- lative well-being of the best performing ‘Objective+Expect, Normal pain’ agent, plotted over the agent’s lifetime (5000 steps). consuming the drug brings short-term relief from a negative baseline, but once the effect wears off, well-being quickly drops back into the negative domain (Koob and Le Moal 2008). For the chronic pain agent, being on the food state results in a neutral (zero) well-being, but losing access to the food causes an immediate drop below zero due to the high, persistent pain signal. This negative state is only Figure 9: Chronic pain, non-stationary environment: Mean and standard deviation (SD) of the Objective reward, mo- mentary well-being, subjective pain (w 4 · Pain) and cumu- lative well-being of the best performing ‘Objective+Expect, Chronic pain’ agent, plotted over the agent’s lifetime (5000 steps). alleviated by finding the food again, creating a cycle of relief-seeking behavior. Paradoxically, this need to escape the negative baseline improves task performance compared to the normal agent. The normal pain agent, in contrast, does not exhibit this cycle. Its momentary well-being briefly dips below zero when the food location changes, but then recovers to a non-negative level even before the food is found, leading to a stable and mostly positive cumulative well-being over its lifetime. 15 Belief-Desire-Intention Dynamics in Language Models via thep-Beauty Contest Vignesh Natarajkumar 1 , Takayuki Ito 2 , Rafik Hadfi 2 1 Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai, India 2 Graduate School of Informatics, Kyoto University, Kyoto, Japan vigneshn@smail.iitm.ac.in, ito@i.kyoto-u.ac.jp, rafik.hadfi@i.kyoto-u.ac.jp Abstract Strategic games provide a simple way to examine how lan- guage models reason about others and explain their own choices. They thus offer a quantifiable benchmark of a model’s capacity for Theory of Mind (ToM). In this study, we use an iteratedp-beauty contest to understand how language models of different sizes describe their intentions and ratio- nalize decisions while accounting for the behavior of other agents. The results highlight a correlation between model scale and the ability to participate across rounds of the contest strategically and track the future moves of opponents. We also find that a Belief-Desire-Intention (BDI)-inspired prompting scaffold helps reveal and address inconsistencies that are not visible from an observation of suboptimal gameplay alone. 1 Introduction Theory of Mind (ToM) refers to the ability to infer and rea- son about the mental states of oneself and others (Premack and Woodruff 1978; Wellman, Cross, and Watson 2001). This capacity supports many examples of human social in- teraction and strategic behavior (Li et al. 2024) and is com- monly used to study reasoning in both children and adults (Osterhaus and Koerber 2021). Adult humans are also ca- pable of recursive higher-order ToM - forming beliefs about others’ beliefs - which is important for multiparty interaction and cooperative decision-making (Liddle and Nettle 2006; De Weerd, Verbrugge, and Verheij 2015). The Belief-Desire-Intention (BDI) model is a cognitive architecture used to understand an agent’s reasoning through the lens of human psychology (Rao and Georgeff 1995). Previous research has utilized this framework for improving transparency in agentic decision-making (Jang et al. 2023). Recent work has evaluated whether language mod- els demonstrate ToM-like behavior (Kosinski 2024), with mixed results across benchmarks (Zhang et al. 2024; Chen et al. 2024). Smaller models particularly struggle with sta- ble recursive reasoning (Srivastava, Cao, and Wang 2025), motivating an examination of how models of different sizes represent and act on beliefs about others. (He et al. 2023) and (Duijn et al. 2023) have benchmarked language models on higher-order ToM tasks as well. Game-theoretic challenges offer a way to assess both mentalization abilities and strategic thinking. The p-beauty contest is a standard test of strategic “depth of reasoning”. In the general p-beauty contest, players simultaneously choose a number from a fixed non-negative range, and the winner is the one whose guess is closest to p times the mean of all submitted guesses. The Nash equilibrium in such a contest for 0 < p < 1 is 0, but humans typically display bounded ra- tionality and reduce their guesses iteratively, displaying “k- level reasoning” (Nagel 1995; Cooper et al. 2024), where k represents the depth of recursive reasoning undertaken. Quantifying the depth of reasoning for language models is useful for assessing their mentalization abilities. In this work, we study how models of different sizes par- ticipate in the p-beauty contest in two distinct multi-agent settings. Our observations using a BDI-inspired prompt structure indicate interesting misalignments in the “Beliefs, Desires, and Intentions” of smaller models that lead to sub- optimal responses in their gameplay. 2 Methodology We study the gameplay of four models from the Qwen 2.5- xB-Instruct family (x = 1.5B, 3B, 7B, 14B) (Qwen et al. 2025) in two separate settings. In our setup, guesses are re- stricted to the range [0, 200] and we set p = 0.5, making the target for each round half of the group’s average response. The player closest to the target in each round earns one point, and the participant with the highest cumulative score at the end of all rounds is the winner. Each experiment is run for five rounds and averaged over ten independent trials. In the first setting, four identical instances of a single lan- guage model play the game, and the impact of Chain-of- Thought (CoT) prompting on the gameplay is analyzed. In the second setting, four distinct model sizes from the same family compete. Agents in this game are prompted to generate a comprehensive Belief-Desire-Intention (BDI) re- port for every participant - including themselves - before fi- nalizing their move. The components of the report are based on the intuition provided by (Rao and Georgeff 1995): • “Belief”: a numerical prediction of the mean of the up- coming round based on gameplay history. • “Desire”: a natural-language description of the goal of the target agent. • “Intention”: a numerical prediction of the number that the target agent might play, considering its desires. 16 Figure 1: Gameplay in the first setting under regular and Chain-of-Thought prompting for four Qwen 2.5 models. 3 Results and Discussion We first analyze the results of the first setting. Figure 1 shows the mean guesses over the first five rounds for both the standard and CoT prompting conditions. Both the 1.5B and 3B models respond in a narrow band across all rounds and do not approach the equilibrium despite the usage of CoT prompting, indicating the ineffectiveness of such strategies in smaller models. Comparing the smaller models, the 3B model consistently chooses better responses than the 1.5B model, indicating an improved depth of reasoning. The 7B and 14B models opt for more competent starting points and gradually reduce their guesses across rounds, mirroring ob- servations from human participants, who typically deepen their recursive reasoning as the game progresses (Nagel 1995). CoT prompting further accelerates this evolution. The lack of evolution in strategic gameplay in smaller models raises a question: do these models misrepresent oth- ers’ goals, or do they exhibit a misalignment between their beliefs, desires, and intentions? Table 1: True mean vs. predicted mean (belief) for rounds 1–5. RoundTrue Mean1.5B3B7B14B 1106.00150100100100 2 70.0010040100100 338.75100507070 426.5038.75355050 519.2526.50354025 This question is examined in the second setting of the four-player game. Table 1 shows the discrepancy between the true mean and predictions about the mean that different models have. From the table, it is seen that larger models form beliefs that track the true mean across rounds more accurately. The 1.5B model chooses a mean near the mid- point of the range of playable numbers, and does not edit this response across rounds. The 3B model demonstrates the capacity to track the game state, but it does not track the game’s progress as consistently as larger models. Although all models state that their goal or desire at the start of each round is to “win the round” or to “maximize the chances of winning,” the smaller models assign vague or inconsistent desires to the other players - for example, to “choose a number above the average,” to “stay safe,” or to “guess something reasonable and conservative” - regardless of the actual behavior of the other agents and the rules of the game. The 7B and 14B models identify that other players, Figure 2: Intention tracking across rounds in the second set- ting. Each subplot shows the true response (light blue) and others’ predictions (dotted lines) for a target model. like themselves, aim to “outguess the others” and “maximize the chance of winning”, and take the previous moves of other players into account for this assignment of desires. The plots in Figure 2 compare the “perspectives” of each player by plotting the actual gameplay of each model along- side the perceived intentions attributed to it by all partici- pants. For the larger models, the number actually chosen is logically consistent with the intention reported earlier, while smaller models demonstrate inconsistencies in this regard. For instance, the 1.5B model estimates the mean for the up- coming round but then often chooses a number higher than that, demonstrating a flawed reasoning of the game. Interestingly, Figure 2 suggests that observing the strate- gies of more performant models through the Belief-Desire- Intention lens leads to an improvement in the responses from smaller participating models in comparison to Figure 1. This, however, may reflect an emulation and execution of game instructions rather than a genuine emergent ability to model the states of others. Through the p-beauty contest, we observe scale- dependent differences in how models perceive and respond to competing agents. Smaller models show little improve- ment across rounds, with flaws in reasoning about the game state, while larger ones play moves more consistent with their idea of the intentions of participating agents. The BDI-style prompting structure adopted highlights in- consistencies and areas for improvement in the convergence of beliefs, desires, and actions of smaller models, and pro- vides deeper insights into this multi-agent contest. 4 Acknowledgement This work was supported by JST Strategic International Col- laborative Research Program (SICORP), Grant Number JP- MJSC2307, and JST CREST Grant Number JPMJCR20D1, Japan. 17 References Chen, Z.; Wu, J.; Zhou, J.; Wen, B.; Bi, G.; Jiang, G.; Cao, Y.; Hu, M.; Lai, Y.; Xiong, Z.; and Huang, M. 2024. ToMBench: Benchmarking Theory of Mind in Large Lan- guage Models. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15959–15983. Bangkok, Thailand: Association for Computational Linguistics. Cooper, D. J.; Fatas, E.; Morales, A. J.; and Qi, S. 2024. Consistent Depth of Reasoning in Level-k Models. Ameri- can Economic Journal: Microeconomics, 16(4): 40–76. De Weerd, H.; Verbrugge, R.; and Verheij, B. 2015. Higher- order theory of mind in the Tacit Communication Game. Bi- ologically Inspired Cognitive Architectures, 11: 10–21. Duijn, M. J. v.; Dijk, B. M. A. v.; Kouwenhoven, T.; Valk, W. d.; Spruit, M. R.; and Putten, P. v. d. 2023. Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests. arXiv:2310.20320. Fu, Y.; Peng, H.; Ou, L.; Sabharwal, A.; and Khot, T. 2023. Specializing Smaller Language Models towards Multi-Step Reasoning. arXiv:2301.12726. He, Y.; Wu, Y.; Jia, Y.; Mihalcea, R.; Chen, Y.; and Deng, N. 2023. HI-TOM: A Benchmark for Evaluating Higher- Order Theory of Mind Reasoning in Large Language Mod- els. arXiv:2310.16755. Hua, W.; Fan, L.; Li, L.; Mei, K.; Ji, J.; Ge, Y.; Hemphill, L.; and Zhang, Y. 2024. War and Peace (WarAgent): Large Lan- guage Model-based Multi-Agent Simulation of World Wars. arXiv:2311.17227. Jang, M.; Yoon, Y.; Choi, J.; Ong, H.; and Kim, J. 2023. A Structured Prompting based on Belief-Desire-Intention Model for Proactive and Explainable Task Planning. In In- ternational Conference on Human-Agent Interaction, 375– 377. Gothenburg Sweden: ACM. ISBN 979-8-4007-0824-4. Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916. Kosinski, M. 2024. Evaluating Large Language Models in Theory of Mind Tasks. Proceedings of the National Academy of Sciences, 121(45): e2405460121. Li, N.; Gao, C.; Li, M.; Li, Y.; and Liao, Q. 2024. EconA- gent: Large Language Model-Empowered Agents for Simu- lating Macroeconomic Activities. arXiv:2310.10436. Liddle, B.; and Nettle, D. 2006. Higher-order theory of mind and social competence in school-age children. Journal of Cultural and Evolutionary Psychology, 4(3): 231–244. Lore, N.; Ilami, S.; and Heydari, B. 2024.Large Model Strategic Thinking, Small Model Efficiency: Trans- ferring Theory of Mind in Large Language Models. arXiv:2408.05241. Ma, W.; Mi, Q.; Zeng, Y.; Yan, X.; Wu, Y.; Lin, R.; Zhang, H.; and Wang, J. 2024. Large Language Models Play Star- Craft I: Benchmarks and A Chain of Summarization Ap- proach. arXiv:2312.11865. Nagel, R. 1995. Unraveling in Guessing Games: An Ex- perimental Study. The American Economic Review, 85(5): 1313–1326. Osterhaus, C.; and Koerber, S. 2021. The Development of Advanced Theory of Mind in Middle Childhood: A Longi- tudinal Study From Age 5 to 10 Years. Child Development, 92(5): 1872–1888. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4): 515–526. Qwen; Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; Lin, H.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Lin, J.; ...; and Qiu, Z. 2025. Qwen2.5 Technical Report. arXiv:2412.15115. Rao, A. S.; and Georgeff, M. P. 1995. BDI Agents: From Theory to Practice. In Proceedings of the First International Conference on Multiagent Systems (ICMAS-95), 312–319. San Francisco, CA: AAAI Press. Sap, M.; LeBras, R.; Fried, D.; and Choi, Y. 2023. Neu- ral Theory-of-Mind? On the Limits of Social Intelligence in Large LMs. arXiv:2210.13312. Shapira, N.; Zwirn, G.; and Goldberg, Y. 2023. How Well Do Large Language Models Perform on Faux Pas Tests? In Rogers, A.; Boyd-Graber, J.; and Okazaki, N., eds., Find- ings of the Association for Computational Linguistics: ACL 2023, 10438–10451. Association for Computational Lin- guistics. Srivastava, G.; Cao, S.; and Wang, X. 2025.To- wards Reasoning Ability of Small Language Models. arXiv:2502.11569. Ullman, T. 2023. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks. arXiv:2302.08399. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Met- zler, D.; Chi, E. H.; Hashimoto, T.; Vinyals, O.; Liang, P.; Dean, J.; and Fedus, W. 2022. Emergent Abilities of Large Language Models. arXiv:2206.07682. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; and Zhou, D. 2023. Chain-of- Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. Wellman, H. M.; Cross, D.; and Watson, J. 2001. Meta- Analysis of Theory-of-Mind Development: The Truth about False Belief. Child Development, 72(3): 655–684. Xu, Y.; Wang, S.; Li, P.; Luo, F.; Wang, X.; Liu, W.; and Liu, Y. 2024. Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf. arXiv:2309.04658. Zhang, Y.; Mao, S.; Ge, T.; Wang, X.; Xia, Y.; Lan, M.; and Wei, F. 2024. K-Level Reasoning: Establishing Higher Or- der Beliefs in Large Language Models for Strategic Reason- ing. arXiv:2402.01521. 18 Appendix This section offers additional context on Theory of Mind in language models and outlines the prompts used during the experiments. Extended Background Recent advancements in Large Language Models (LLMs) have sparked interest in their performance on tasks that re- quire ToM competency (Kosinski 2024; Sap et al. 2023; Shapira, Zwirn, and Goldberg 2023; Ullman 2023). LLMs have also been deployed in multi-agent systems that require social interaction (Hua et al. 2024) and game playing (Ma et al. 2024; Xu et al. 2024). Alternate prompting strategies such as Chain-of-Thought (CoT) prompting (Kojima et al. 2023; Wei et al. 2023) have been proposed to improve the capabilities of LLMs in ToM tasks. However, improving the strategic reasoning and ToM capabilities of Small Language Models (SLMs) remains challenging. CoT prompting ex- hibits minimal improvement or sometimes even a decline in the performance of SLMs in ToM tasks (Srivastava, Cao, and Wang 2025; Wei et al. 2022). Alternative methods of transferring some notion of ToM and reasoning from larger models to smaller ones have shown promising results (Fu et al. 2023; Lore, Ilami, and Heydari 2024). These efforts highlight the need to assess and improve the ToM capacity of language models of different sizes. Progress in this area could help improve the personalization and empathy of eas- ily deployable agents for human users. Standard Prompt ‘You are playerplayer id in the p-Beauty Contest. GAME RULES: - Each player chooses a number between 0 and MAX NUMBER. - The target isPVALUE times the mean of all guesses. - The player closest to the target wins 1 point. - All other players receive 0 points. GAME HISTORY:history text Current Points:pointstext Reflect on the game rules and history, and then respond in the format below: My guess: <number>’ Chain-of-Thought Prompt ‘You are playerplayerid in the p-Beauty Contest. GAME RULES: - Each player chooses a number between 0 and MAX NUMBER. - The target isPVALUE times the mean of all guesses. - The player closest to the target wins 1 point. - All other players receive 0 points. GAME HISTORY:history text Current Points:pointstext Think step-by-step about your strategy: 1. What was the target of the previous round? 2. How do you think other players will adjust their guesses based on the target you obtained in step 1? 3. Based on these observations, what number should you choose to be closest toP VALUE times the mean? Now, respond with your final guess in this format: My guess: <number>’ BDI Prompt ‘You are playerplayer id in the p-Beauty Contest. GAME RULES: - Each player chooses a number between 0 and MAXNUMBER. - The target isP VALUE times the mean of all guesses. - The player closest to the target wins 1 point. - All other players receive 0 points. GAME HISTORY:historytext Current Points:points text You must reason about the game using the Belief-Desire-Intention (BDI) framework. Provide your reasoning in the following format for each player in the game, including yourself: Belief: What number do you think will be the mean of all guesses in the upcoming round? Desire: What do you think the goal or target of each player in the upcoming round will be (in 2-3 sentences)? Answer in the format below: - Player 1: <motivation> - Player 2: <motivation> - Player 3: <motivation> - Player 4: <motivation> (Include yourself.) Intention: Based on the Desires identified, what number do you think each player will intend to play in the upcoming round? Answer in the format below: - Player 1: <intended number> - Player 2: <intended number> - Player 3: <intended number> - Player 4: <intended number> (Include yourself.) Finally, based on your belief, desire, and intention assessments, provide your final guess in the format below: My guess: <number>’ 19 Beyond VAGUE: Attention Analysis for Probing How VLMs Ground Ambiguity Heejeong Nam Computer Science, Brown University Abstract Large Vision-Language Models (LVLMs) represent a trans- formative advancement in artificial intelligence, and inter- pretability studies of these models have grown rapidly. How- ever, most multimodal reasoning benchmarks evaluate per- formance primarily through question answering accuracy, offering limited insight into why models fail. Moreover, they often rely on simple and direct cross-modal references, which restricts the evaluation of more indirect and nuanced, human-like understanding. To address this gap, we introduce VAGUE-Ground, a modified benchmark dataset based on VAGUE (Nam et al. 2025), designed to probe visual attention under indirect, human-like expressions. We analyze the in- ternal attention behaviors of LVLMs when processing direct versus indirect textual prompts. Our results show that current LVLMs struggle to first establish contextual grounding in in- direct text before associating it with visual information. We hope that this line of work contributes to advancing LVLMs toward more flexible, human-like reasoning processes. Datasets — https://huggingface.co/datasets/HazelNam/vague mask Introduction Human communication is inherently efficient because it re- lies not only on literal meaning but also on shared world knowledge and the integration of multimodal cues. The goal of Multimodal Theory of Mind (ToM) is to equip machines with the capacity to infer and reason beyond explicit infor- mation by leveraging multimodal context and shared under- standing. Computationally modeling this capability is a crit- ical frontier. Frameworks such as VAGUE (Nam et al. 2025) and VQA-CP (Agrawal et al. 2018) have established an im- portant baseline, revealing that although visual cues con- tribute to problem solving to some extent, models still rely excessively on the representational aspects of text. How- ever, most benchmarks remain within a VQA-style evalu- ation paradigm and do not provide a quantitative assessment of attention bias. Even when attention maps are reported, they are rarely treated as evaluation targets, leaving internal failure mechanisms largely unexplored. To address this, we introduce a dataset of nuanced tex- tual references to target objects, paired with segmentation masks to explicitly evaluate visual grounding. It facilitates patch-level evaluation of grounding behavior under ambi- guity, offering a structured way to diagnose internal failure modes in vision-language models. Leveraging the observa- tion that VLMs demonstrate emergent, training-free ground- ing behavior within particular attention layers (Kang et al. 2025; Park et al. 2025), our dataset systematically exam- ines whether LVLMs truly engage in multimodal reasoning at an internal level. More specifically, we propose a method- ology to identify emergent grounding layers within our set- ting and to quantify the alignment between a model’s visual attention and the underlying intent of the text by measur- ing the similarity between attention maps and our reference masks. Taken together, this framework provides a systematic approach to verify, analyze, and ultimately advance the vi- sual grounding capabilities that underpin more genuine hu- man–AI understanding. VAGUE-Ground The dataset is constructed through a semi-automated, multi- stage annotation process. Bounding-box proposals were first generated using Grounding-DINO (Liu et al. 2024), con- ditioned on the target referent’s object label (e.g., “cup,” “book”). Instances without valid detections were automat- ically discarded. For the remaining samples, human anno- tators verified and refined the bounding boxes, discarding cases with no suitable prediction or scene-level referents in- compatible with object-level grounding. The verified bound- ing boxes were then passed to the Segment Anything Model (SAM) (Kirillov et al. 2023) to obtain pixel-level masks, followed by a final round of human quality control to re- move low-quality or inconsistent annotations. This process yielded 922 instances from VCR and 379 from Ego4D, both derived from the parent dataset VAGUE. Each instance pairs an ambiguous expression with a human-verified segmenta- tion mask of its intended target. Refer to Appendix for more details about the annotation process and overview. Assessing Visual Grounding Under Ambiguity LLava-1.5 (7B) layer selection. Our initial experiments were conducted using LLaVA-1.5 (7B) (Liu et al. 2023). To identify grounding-sensitive layer–head pairs, we compute the average IoU between ground-truth segmentation masks and attention-derived binary masks under the direct prompt 20 Figure 1: VAGUE-Ground provides referential segmentation masks and a probing framework designed to identify the grounded object, enabling the evaluation of how multimodal intention disambiguation occurs within the model’s internal processes. Figure 2: Average image attention across (a) all instances, (b) instances correct under the direct but incorrect under the indirect prompt, and (c) instances correct under both. condition. Direct prompts are used to isolate layers exhibit- ing strong grounding behavior. Binary attention masks were constructed by retaining the top-p percentile (p = 0.7) of attention activations, which effectively suppress non-salient regions. We identified the top three layer–head pairs that most accurately localized target objects under direct prompt- ing: (21, 12), (20, 7), and (18, 1). Empirically, the (21, 12) pair produced more globally distributed attention, whereas (20, 7) yielded the sharpest and most localized grounding patterns. All subsequent results are based on the (20, 7) layer–head combination. Additional details are provided in Appendix . Grounding evaluation setup. We treat the target localiza- tion performance under the direct prompt as the reference baseline. We then assess changes in alignment when the prompt is replaced with its indirect counterpart. For qual- itative analysis, we visualize the binary attention masks to compare the model’s focus regions. Quantitatively, we mea- sure the IoU between the attention-derived mask and the ground-truth segmentation mask as well as the total atten- tion allocated to the image. Visualization. For each attention map, attention values are normalized by their local extrema, and a threshold is applied to retain the top-p percentile of activations. For visualiza- tion, the resulting binary mask is upsampled to the original image resolution using nearest-neighbor interpolation. Qualitative results. As shown in Fig. 1, the direct and indirect prompts, their corresponding attention maps, and the model’s answers are all closely interrelated. However, most attention was concentrated on locating words explic- itly mentioned in the input prompt within the image, rather than interpreting the intended meaning behind the indirect prompt. This observation demonstrates that the superficial understanding described in VAGUE (Nam et al. 2025) can indeed be traced within the model’s internal mechanisms. Quantitative results. For quantitative evaluation, we measure (i) the average total attention allocated to the image and (i) the average IoU between the attention-derived and ground-truth masks. The former reflects the overall degree of visual engagement, while the latter captures localization accuracy. As shown in Fig. 2, the total image attention con- sistently increases under the indirect prompt condition. This suggesting that the model makes greater use of visual input to compensate for missing contextual cues, unlike the direct prompt where textual information alone often suffices. How- ever, IoU decreases from 0.058 (direct) to 0.043 (indirect), corresponding to a 25.9% relative degradation in localiza- tion performance due to increased ambiguity compared to the direct baseline. Conclusion In this paper, we introduce VAGUE-Ground, a dataset de- signed to probe visual attention under indirect, human- like expressions. Our findings indicate that current LVLMs struggle to establish contextual grounding in indirect prompts before linking them to visual input. We hope this work provides a principled step toward more robust and human-like multimodal reasoning, while highlighting direc- tions for improving grounding alignment in future models. 21 References Agrawal, A.; Batra, D.; Parikh, D.; and Kembhavi, A. 2018. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Duan, J.; Yu, S.; Tan, N.; Yi, L.; and Tan, C. 2022. Boss: A benchmark for human belief prediction in object-context scenarios. arXiv preprint arXiv:2206.10665. Gandhi, K.; Fraenkel, J.-P.; Gerstenberg, T.; and Goodman, N. 2023.Understanding Social Reasoning in Language Models with Language Models. In Advances in Neural In- formation Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track. Dataset and Benchmarks Track. Hasan, M. K.; Rahman, W.; Bagher Zadeh, A.; Zhong, J.; Tanveer, M. I.; Morency, L.-P.; and Hoque, M. E. 2019. UR-FUNNY: A Multimodal Language Dataset for Under- standing Humor. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2046–2056. Hong Kong, China: Asso- ciation for Computational Linguistics. Jin, C.; Wu, Y.; Cao, J.; Xiang, J.; Kuo, Y.-L.; Hu, Z.; Ull- man, T.; Torralba, A.; Tenenbaum, J.; and Shu, T. 2024. MMToM-QA: Multimodal Theory of Mind Question An- swering. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), 16077–16102. Bangkok, Thailand: Association for Compu- tational Linguistics. Kang, S.; Kim, J.; Kim, J.; and Hwang, S. J. 2025. Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. In Proceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR), 9339–9350. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A. C.; Lo, W.- Y.; Doll ́ ar, P.; and Girshick, R. 2023. Segment Anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 3992–4003. Kumar, S.; Kulkarni, A.; Akhtar, M. S.; and Chakraborty, T. 2022. When did you become so smart, oh wise one?! Sar- casm Explanation in Multi-modal Multi-party Dialogues. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceed- ings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 5956–5968. Dublin, Ireland: Association for Computational Linguistics. Kumar, S.; Mondai, I.; Akhtar, M. S.; and Chakraborty, T. 2023. Explaining (sarcastic) utterances to enhance af- fect understanding in multimodal dialogues. In Proceed- ings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Ap- plications of Artificial Intelligence and Thirteenth Sympo- sium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press. ISBN 978-1- 57735-880-0. Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023. Visual Instruc- tion Tuning. In NeurIPS. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; Zhu, J.; and Zhang, L. 2024. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops – Poster Session. Nam, H.; Ahn, J.; Ka, K.; Chung, J.; and Yu, Y. 2025. VAGUE: Visual Contexts Clarify Ambiguous Expressions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1537–1547. Park, J.; Ko, J.; Byun, D.; Suh, J.; and Rhee, W. 2025. Cross- Attention Head Position Patterns Can Align with Human Vi- sual Concepts in Text-to-Image Generative Models. In The Thirteenth International Conference on Learning Represen- tations. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4): 515–526. Sclar, M.; Kumar, S.; West, P.; Suhr, A.; Choi, Y.; and Tsvetkov, Y. 2023. Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 13960– 13980. Toronto, Canada: Association for Computational Linguistics. Appendix A. Related Work Benchmarking Multimodal Theory of Mind Theory of Mind (ToM) denotes the capacity to infer and reason about others’ intentions based on observed infor- mation (Premack and Woodruff 1978). Although numer- ous unimodal approaches and benchmarks have been in- troduced (Sclar et al. 2023; Gandhi et al. 2023), they of- ten fall short in reflecting the complexity of real-world in- teractions, which inherently depend on both linguistic and perceptual cues. To extend ToM reasoning beyond purely textual contexts, recent studies have incorporated visual in- formation. The MMToM benchmark (Jin et al. 2024) chal- lenges models to jointly interpret visual and textual sig- nals in ToM-related question-answering tasks. Similarly, the BOSS dataset (Duan et al. 2022) focuses on scenarios re- quiring nonverbal communication, assessing whether mod- els can infer human beliefs from social and visual cues. A related line of research examines multimodal sarcasm and humor understanding. WITS (Kumar et al. 2022) and MOSES (Kumar et al. 2023) provide datasets for sarcasm interpretation that include emotional and tonal cues, while UR-FUNNY (Hasan et al. 2019) targets multimodal humor comprehension by integrating facial expressions and vocal prosody. However, up to our best knowledge, none of these studies investigate the internal processes of the models. 22 B. Dataset Details Dataset Overview Figure 3: Bounding-boxes are from Grounding-DINO and segmentation masks are from SAM. First, we examined the proportion of the target object within each referential mask, which serves as the ground- truth annotation across the 1,301 instances. On average, the target object occupied 6.45% of the entire image area, with a median of 3.08% and a standard deviation of 0.08. Fig. 4 shows the distribution of these ratios. Figure 4: Histogram showing ratio of the target object. Also, when evaluated using the LLava-1.5 (7B) (Liu et al. 2023) model under the Multiple Choice Question (MCQ) setting of VAGUE, the model achieved an accuracy of 98.08% when provided with both the image and the di- rect prompt. However, the accuracy dropped significantly to 32.90% when the image was paired with the indirect prompt. According to the diagnostic counterfactual choice design in- troduced in VAGUE, this degradation primarily stems from the model’s tendency to select answers that overly rely on textual cues, rather than properly integrating visual evi- dence. This observation further highlights the necessity of our proposed VAGUE-Ground benchmark, which offers a more precise framework for examining multimodal ground- ing under ambiguous linguistic contexts. Human Annotation Examples Fig. 5 illustrates the annotation interface used by annotators to select bounding boxes. Grounding-DINO can produce multiple bounding-box candidates, where only one—or sometimes several—may correspond to the correct target. In some cases, multiple objects of the same category exist, and the intended referent may vary depending on the listener’s interpretation of the utterance. Annotators were instructed to carefully consider these factors and retain only the bounding box(es) that correctly correspond to the target described by the direct prompt. Figure 5: Annotator interface of bounding-box verification Fig. 6 similarly presents examples of the verification pro- cess for segmentation masks. Annotators were required to assess whether each instance retained a mask that accurately corresponded to the intended target word. Figure 6: Annotator interface of segmentation mask verifi- cation C. Evaluation Framework Details Details in Layer Selection We first narrowed down the set of layers to be examined in detail. Using the direct prompt and the corresponding im- age with the task prompt, we computed, for each layer, the ratio of image-to-total attention—defined as the sum of all attention values directed to image patches divided by the to- tal attention across all tokens. After excluding the first three layers which are known to focus on the image globally, we selected the top 50% of layers with the highest ratios for further analysis. (See Fig. 7) 23 Figure 7: Image attention fraction per layer. For every layer–head pair within the selected layers, we then computed the average IoU between the ground-truth segmentation mask and the corresponding attention-derived binary mask, again using the direct prompt, task prompt, and image input. The binary attention mask was obtained by re- taining only the top-p percentile (p = 0.7) of attention acti- vations. Empirically, we found that higher thresholds more effectively delineate localized regions, as non-salient pixels tend to receive near-zero attention. Based on this analysis, we identified the top three layers that most accurately local- ized the target object under the direct prompt condition. More results Figure 8: Enter Caption Fig. 8 shows the histogram of IoU scores obtained under the direct and indirect prompt conditions. For visibility, the y-axis is truncated around 120, as both distributions exhibit higher frequencies near 0.0. 24 Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs Ivan Chulo, Ananya Joshi Johns Hopkins University Baltimore, MD 21218 USA Abstract Recent work shows activation steering substantially improves language models’ Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs’ activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Ad- dition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5% to 46.7% accuracy) is mediated by activations pro- cessing emotional content : emotionperception (+2.23), emo- tionvaluing (+2.20), while suppressing analytical processes: questioning (-0.78), convergentthinking (-1.59). This sug- gests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning. Introduction The capacity to attribute mental states to oneself and others represents a critical capability for AI systems engaged in so- cial reasoning, alignment, and human collaboration. Recent benchmarking work by Bortoletto et al. (2024) has demon- strated that activation steering techniques, particularly Con- trastive Activation Addition (CAA), can substantially im- prove language models’ ToM performance on belief attribu- tion tasks, often achieving accuracy improvements exceed- ing 10 %. This raises a question:what cognitive processes change when models successfully engage in perspective- taking? Traditional evaluations treat ToM as a monolithic capa- bility measured through binary accuracy on belief attribution tasks (Gandhi et al. 2023). However, cognitive science views ToM as comprising multiple interacting processes (Gabriel et al. 2019). If steering vectors modulate these components differently, comparing steered versus baseline activation pat- terns could reveal which processes are essential for success- ful perspective-taking, thus decomposing ToM processes in LLMs into their constituent building blocks. We introduce a decomposition approach using techniques from mechanistic interpretability in LLMs by combining linear classifer probes trained on 45 cognitive actions with CAA steering vectors. By measuring which cognitive pro- cesses systematically increase or decrease when ToM per- formance improves, we can identify what cognitive actions are correlated with language models’ successful perspective- taking. Contributions: (1) A mechanistic decomposition method using cognitive action probes to analyze steering effects (2) Evaluation on 1,000 BigToM forward belief scenarios showing 14.2% accuracy improvement (32.5% to 46.7%) (3) Analysis demonstrating a systematical increase in emo- tional/creative processes during perspective-taking and de- crease in analytical processes; (4) Findings suggesting that emotional understanding serves as the fundamental building block of perspective-taking in language models. Methodology We investigate how steering vectors modulate cognitive pro- cesses during belief attribution tasks. We developed a multi-stage pipeline integrating probe training, activation steering, and comparative analysis. We began by defining 45 cognitive actions across four cate- gories: Metacognitive, Analytical, Creative, and Emotional (see Appendix). For each action, we generated 700 synthetic training samples (31,500 total) of first-person narratives (2- 4 sentences) across 20 everyday contexts, using Gemma-3- 4B 1 . Activations were extracted from layers 0-30 of Gemma- 3-4B using nnsight, loosely based on the methodology from (Chen et al. 2024). To ensure consistent probe training, in- puts were augmented (at extraction and during inference) with the suffix “The cognitive action being demonstrated here is” for consistent final-token extraction. We trained 45 binary linear probes using one-vs-rest classification with AdamW optimization, cosine annealing, and early stopping based on AUC-ROC. For activation steering, we trained CAA vectors (Rim- sky et al. 2023; Zou et al. 2023) on 752 contrastive triplets from BigToM’s (Gandhi et al. 2023) forward belief scenar- ios (50/50 false-true split). Each triplet contains story con- text, belief question, and positive (correct ToM) versus nega- tive (incorrect ToM) completions, capturing representational differences between accurate and inaccurate perspective- taking. Vectors were trained across layers 14-30 using PCA- centered activation differences. 1 Code available at: https://github.com/ChuloIva/Cognimap 25 Figure 1: Radar chart comparing baseline versus steered cognitive action activation patterns across categories. The steered condition (red) compared to baseline (blue). These findings support LLMs mirror known cognitive phenomena that emotional understanding is more important than analytical procecsses in perspective-taking Evaluation We evaluated 1,000 forward belief scenarios from BigToM (Gandhi et al. 2023) (forward belieffalse), comparing baseline versus steered conditions. Answers were evaluated by computing p(correct) vs p(incorrect) from model logits. Cognitive action activations were captured at three timepoints: (1) at question, (2) after true answer, (3) af- ter wrong answer. For each action, we computed layer count (layers 10-20 where probe confidence indicated presence) and analyzed baseline-steered differences to identify which cognitive processes characterize successful ToM improve- ments. Results We first validated our probe methodology, finding that bi- nary probes achieved 0.78 average AUC-ROC and 0.68 F1 across 45 actions. Mid-layer performance (layers 5-24) out- performed early/late layers, informing our layer 10-20 anal- ysis window. We then evaluated CAA steered LLMs on 1,000 forward belief scenarios (false condition), observing accuracy improvements from 32.5% to 46.7% (a 14.2% gain), shifting 217 examples from incorrect to correct pre- dictions. To understand what drives this improvement, we ana- lyzed cognitive action activation patterns in baseline ver- sus steered conditions. The most striking finding was robust increase of emotional and generative processes. Emotional processes showed strong increases: emotionperception (mean ∆=+1.73), emotionvaluing (∆=+0.85), and emo- tion understanding (∆=+0.77). Additionally, Hypothe- sisgeneration (∆=+1.63) remained strongly elevated across all timepoints, indicating active belief formation and expla- nation generation. In contrast, analytical processes decreased: questioning (∆=-1.24), convergentthinking (∆=-1.13), and understand- ing (∆=-0.77), suggesting successful persepective taking in LLMs suppresses deliberate analytical interrogation. Cate- gory aggregation reinforces this pattern—creative processes (+0.35, +0.28, +0.24) and emotional processes (+0.35, +0.20, +0.22) consistently increase across timepoints, while analytical processes show decreases (+0.06, -0.19, -0.19). Discussion This pattern of increased emotional and generative processes and decreased analytical processes challenges assumptions that LLM social reasoning relies on deliberate chain-of- thought mechanisms. Rather, successful perspective-taking appears to operate through activating representations re- sponsible for processing emotional contexts. We make no claims that these findings generalize to hu- man cognition, nor does our methodology validate such comparisons. However, neuroscience research shows that affective and cognitive ToM share neural mechanisms in humans: Corradi-Dell’Acqua, Hofstetter, and Vuilleumier (2014) demonstrate that “patterns in TPJ (Temporopari- etal Junction) and MTG (Middle Temporal Gyrus) reflect the same neuronal activity, equally recruited in these two independent conditions.” During language modeling, net- works may learn shared representations linking perspective- taking with emotional context processing, mirroring the compressed structure of human social cognition embedded in linguistic data. Whether this constitutes genuine emu- lation of cognitive architecture or emergent convergence on functionally equivalent representations remains an open question. Future work should validate these cognitive decomposi- tion findings with multiple, bigger models and with addi- tional data sources. Conclusion Our work extends Bortoletto et al. (2024) by applying cog- nitive action probes to activation steered LLMs, we move beyond evaluating whether steering improves belief attribu- tion to understanding why. Our analysis of 1,000 forward belief scenarios reveals systematic modulation of cognitive processes: steering amplifies generative hypothesis forma- tion and emotional inference while suppressing analytical interrogation. This decomposition approach, combining in- terpretability tools with targeted interventions, offers a prin- cipled methodology for understanding complex AI capabil- ities. The findings challenge assumptions about deliberative social reasoning in LLMs and open new directions for mech- anistic analysis of perspective-taking and other high-level cognitive abilities. 26 References Anderson, L. W.; Krathwohl, D. R.; Airasian, P. W.; Cruik- shank, K. A.; Mayer, R. E.; Pintrich, P. R.; Raths, J.; and Wittrock, M. C. 2001. A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educa- tional objectives. New York: Longman, complete edition. Bortoletto, M.; Ruhdorfer, C.; Shi, L.; and Bulling, A. 2024. Benchmarking Mental State Representations in Lan- guage Models. ICML 2024 Workshop on Mechanistic Inter- pretability, arXiv:2406.17513. Chen, Y.; Wu, A.; DePodesta, T.; Yeh, C.; Li, K.; Castillo Marin, N.; Patel, O.; Riecke, J.; Raval, S.; Seow, O.; Wattenberg, M.; and Vi ́ egas, F. 2024. Designing a Dash- board for Transparency and Control of Conversational AI. arXiv:2406.07882. Corradi-Dell’Acqua, C.; Hofstetter, C.; and Vuilleumier, P. 2014. Cognitive and affective theory of mind share the same local patterns of activity in posterior temporal but not me- dial prefrontal cortex. Social Cognitive and Affective Neu- roscience, 9(8): 1175–1184. Flavell, J. H. 1979. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10): 906–911. Gabriel, E. T.; Oberger, R.; Schmoeger, M.; Deckert, M.; Vockh, S.; Auff, E.; and Willinger, U. 2019. Cognitive and affective Theory of Mind in adolescence: developmental as- pects and associated neuropsychological variables. Psycho- logical Research, 85(2): 533–553. Gandhi, K.; Franken, J.-P.; Gerstenberg, T.; and Goodman, N. D. 2023. Understanding Social Reasoning in Language Models with Language Models. arXiv:2306.15448. Gross, J. J. 1998. The emerging field of emotion regulation: An integrative review. Review of General Psychology, 2(3): 271–299. Guilford, J. P. 1967. The nature of human intelligence. New York: McGraw-Hill. Krathwohl, D. R.; Bloom, B. S.; and Masia, B. B. 1964. Tax- onomy of educational objectives: The classification of edu- cational goals. Handbook I: Affective domain. New York: David McKay Company. Mayer, J. D.; and Salovey, P. 1997. What is emotional in- telligence? In Salovey, P.; and Sluyter, D. J., eds., Emo- tional development and emotional intelligence: Implications for educators, 3–31. New York: Basic Books. Rimsky, N.; Gabrieli, N.; Schulz, J.; Tong, M.; Hubinger, E.; and Turner, A. M. 2023. Steering Llama 2 via Contrastive Activation Addition. arXiv:2312.06681. Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; Goel, S.; Li, N.; Byun, M. J.; Wang, Z.; Mallen, A.; Basart, S.; Koyejo, S.; Song, D.; Fredrikson, M.; Kolter, J. Z.; and Hendrycks, D. 2023. Representation Engineering: A Top- Down Approach to AI Transparency. arXiv:2310.01405. Appendix Complete Cognitive Action Taxonomy The 45 cognitive actions used for probe training were de- rived from established cognitive psychology and emotion re- search frameworks, then organized into five functional cate- gories. Actions were systematically sampled from Bloom’s Taxonomy (Anderson et al. 2001), Guilford’s Structure of Intellect (Guilford 1967), Metacognitive Processes Frame- work (Flavell 1979), Krathwohl’s Affective Domain (Krath- wohl, Bloom, and Masia 1964), Gross’s Process Model of Emotion Regulation (Gross 1998), and the Mayer-Salovey Emotional Intelligence Model (Mayer and Salovey 1997). Metacognitive (7 actions): • reconsidering: reconsidering a belief or decision • updating beliefs: updating mental models or beliefs • suspending judgment: suspending judgment and staying with uncertainty • metaawareness: reflecting on one’s own thinking pro- cess • metacognitivemonitoring: tracking one’s own compre- hension • metacognitiveregulation: adjusting thinking strategies • selfquestioning: interrogating one’s own understanding Analytical (16 actions): • noticing: noticing a pattern, feeling, or dynamic • patternrecognition: recognizing recurring patterns across situations • zoomingout: zooming out for broader context • zoomingin: zooming in on specific details • questioning: questioning an assumption or belief • abstracting: abstracting from specifics to general patterns • concretizing: making abstract concepts concrete and spe- cific • connecting: connecting disparate ideas or experiences • distinguishing: distinguishing between previously con- flated concepts • perspective taking: taking another’s perspective or tem- poral view • convergent thinking: finding the single best solution • understanding: interpreting and explaining meaning • applying: using knowledge in new situations • analyzing: breaking down into components • evaluating: making judgments about value or effective- ness • cognition awareness: becoming aware and comprehend- ing Creative (6 actions): • creating: generating new ideas or solutions • divergentthinking: generating multiple creative solu- tions • hypothesisgeneration: generating possible explanations 27 • counterfactualreasoning: engaging in ’what if’ thinking • analogicalthinking: drawing analogies between do- mains • reframing: reframing a situation or perspective Emotional (15 actions): • emotionalreappraisal: reinterpreting emotional mean- ing • emotion receiving: becoming aware of emotions • emotionresponding: actively engaging with emotions • emotionvaluing: attaching worth to emotional experi- ences • emotionorganizing: integrating conflicting emotions • emotioncharacterizing: aligning emotions with core val- ues • situation selection: choosing emotional contexts deliber- ately • situationmodification: changing circumstances to regu- late emotion • attentional deployment: directing attention for emotional regulation • responsemodulation: modifying emotional expression • emotionperception: identifying emotions in self/others • emotionfacilitation: using emotions to enhance thinking • emotionunderstanding: comprehending emotional com- plexity • emotionmanagement: regulating emotions in self/others • accepting: accepting and letting go of control Memory (1 action): • remembering: recalling relevant information or experi- ences Synthetic Training Data Generation To train cognitive action probes, we generated 31,500 syn- thetic examples (700 examples per cognitive action) using Gemma-3-4B. Each example consists of a first-person narra- tive demonstrating a specific cognitive action in an everyday context. Domain Coverage: Examples were distributed across 20 everyday domains to ensure variety and prevent overfitting to specific contexts: work, school, daily life, cooking, shop- ping, exercise, reading, writing, planning, learning, organiz- ing, problem-solving, hobbies, personal goals, time man- agement, finances, health, relationships, home projects, and travel. Generation Prompt Template: For each example, the language model received the following structured prompt: Generate a simple, first-person example of someone [cognitive action]. Action: [action name] Description: [action description] Domain: [randomly selected domain] Requirements: - Write in first person (I, my, me) - Keep it simple and realistic - 2-4 sentences maximum - Focus on the [action] cognitive action - Use everyday language Example only (no explanation): Design Rationale: First-person narratives capture the internal phenomenology of cognitive processes, providing more direct signal for probe training than third-person de- scriptions. The 2-4 sentence constraint ensures examples re- main focused on a single cognitive action while maintain- ing sufficient context. Domain randomization prevents the model from learning spurious correlations between cogni- tive actions and specific contexts (e.g., associating analyzing exclusively with academic scenarios). This approach yielded a balanced dataset of 31,500 train- ing examples with realistic language patterns and diverse contextual grounding, suitable for training binary classifiers to detect cognitive action presence in model activations. CAA Training Data Example The CAA steering vectors were trained on contrastive pairs distinguishing correct from incorrect belief attribution. Each training triplet consists of a story with observational infor- mation, a question about the protagonist’s belief, and two completions representing accurate vs. inaccurate Theory of Mind reasoning. False Belief Example: Story: “Noor is working as a barista at a busy coffee shop. Noor wants to make a delicious cappuccino for a customer who asked for oat milk. Noor grabs a milk pitcher and fills it with oat milk. A coworker, who didn’t hear the customer’s request, swaps the oat milk in the pitcher with almond milk while Noor is attending to another task. Noor does not see her coworker swapping the milk.” Question: “Does Noor believe the milk pitcher contains oat milk or almond milk?” Positive (correct ToM): “Noor believes the milk pitcher contains oat milk.” Negative (incorrect ToM): “Noor believes the milk pitcher contains almond milk.” Evaluation Methodology: Answer Ranking by Probability Following the BigToM evaluation protocol, we used answer ranking by probability rather than free-form text generation. Each question was formatted as a multiple-choice task with randomized answer positions: Prompt Format: Story: [story text] Question: [question text] Choose one of the following: a) [answer 1] b) [answer 2] 28 Please answer with the letter of your choice (a or b). Answer: Evaluation Process: For each question, we: 1. Randomized the positions of true and wrong answers be- tween options a) and b) 2. Calculatedp(letter = ‘a’) andp(letter = ‘b’) from model logits at the final token position 3. Selected the answer corresponding to the letter with higher probability 4. Determined correctness by comparing the selected an- swer to the ground truth This probability-based ranking approach eliminates con- founds from text generation artifacts and provides a more reliable measure of the model’s belief attribution capabili- ties. Answer position randomization ensures that the model cannot exploit systematic biases in option ordering. 29 Figure 2: Steering effects across all cognitive actions and timepoints (n=1000). Left panels show individual action changes at three timepoints: at question (before answer), after true answer, and after wrong answer. Bars indicate mean layer count differ- ence (steered - baseline) with positive values (right) showing increases and negative values (left) showing decreases. Emotional actions (emotionperception, emotionvaluing, noticing) consistently increase across timepoints, while analytical actions (ques- tioning, convergentthinking, understanding) consistently decrease, revealing the emotional foundation of successful ToM. Figure 3: Category-level analysis of steering effects. Negative points represent more activation at baseline while positive repre- sent more activation on steered. Left: mean steering effect by cognitive action category. Right: distribution of effects at answer timepoint. 30 Figure 4: Top 10 cognitive actions with largest increases and decreases between baseline and steered conditions at question level (left) and answer level (right). Emotional ac- tions (emotionperception, emotionvaluing, noticing) show the strongest increase, while analytical actions (questioning, convergent thinking, understanding) show the strongest de- crease, revealing the cognitive processes most affected by successful ToM steering. 31 Figure 5: Heatmap of cognitive action activation differences (steered - baseline) across timepoints. Each row represents a cognitive action, and columns show the three measurement timepoints: at question, after true answer, and after wrong answer. Green indicates increase and red indicates decreases. 32 Figure 6: Cognitive action probe performance across all 30 layers of Gemma-3-4B. The visualization shows average AUC-ROC scores for each layer, with Layer 9 achieving peak performance (0.948 AUC-ROC). Strong performance is maintained across mid-layers (5-24), while early and late layers show degraded performance. This pattern suggests that early layers focus on surface-level features, mid-layers capture high-level cognitive abstractions, and late layers optimize for next-token prediction, potentially overwriting intermediate representations. 33 Figure 7: Comparison of top 10 and bottom 10 performing cognitive actions ranked by average AUC-ROC across all layers. Best performers like suspendingjudgment (0.988) and counterfactualreasoning (0.984) show consistently high performance and distinct activation patterns across most layers. Worst performers like emotionresponding (0.778) and understanding (0.837) exhibit more variability and lower overall discrimination ability, suggesting these concepts may be more distributed or context- dependent in the model’s representation space. 34 Faithful Theory of Mind Distillation: Why Preference Based Refinement Improves Imitation Arpit Singh Gautam 1 , Saurabh Jha 1 , 1 Dell Technologies, CSG CTO Team arpitsinghgautam777@gmail.com, saurabh.jha21@gmail.com Abstract The ability to infer and reason about others’ mental states, Theory of Mind (ToM), is central to advanced cognition and socially intelligent AI. Although large language mod- els (LLMs) exhibit emergent ToM, distilling such reason- ing into smaller models is challenging. Supervised Fine- Tuning (SFT) trains students to imitate teacher outputs but may fail to capture underlying reasoning or even degrade performance. We investigate a sequential approach, distill- ing ToM reasoning from an Llama 3.1 8B ”teacher” into a Llama 3.2 1B ”student” using a 2,114-example Chain-of- Thought (CoT) dataset. The student is first fine-tuned via SFT, which learns reasoning format but degrades accuracy, then further aligned using preference-based Direct Prefer- ence Optimization (DPO). We compare models trained with SFT alone versus those refined sequentially with SFT fol- lowed by DPO. The SFT+DPO model achieves relative im- provements of 14% in teacher-aligned accuracy and 66% in counterfactual faithfulness, demonstrating the effective- ness of preference-based refinement in improving reasoning fidelity. Introduction Reasoning about beliefs, intentions, and knowledge of oth- ers, known as Theory of Mind (ToM), is central to human cognition and collaboration. As LLMs are increasingly used in reasoning-critical domains, determining whether they can exhibit faithful ToM reasoning is a key challenge (Gandhi et al. 2023; Chen et al. 2025; Sap et al. 2019). While large models display emergent ToM, smaller deployable ones of- ten lose this capability. A common approach for reasoning compression is Super- vised Fine-Tuning (SFT), where the student imitates the teacher’s Chain-of-Thought (CoT) (Wei et al. 2022; Hinton, Vinyals, and Dean 2015; Li 2024). However, imitation of- ten captures linguistic style rather than causal reasoning and, as our results show, can even degrade the base instruction- tuned model’s performance (Rafailov et al. 2023). Preference-based learning provides a stronger alternative. Instead of mimicking outputs, it trains models to prefer high-quality reasoning. Direct Preference Optimization (DPO) (Rafailov et al. 2023; Li 2024) uses a contrastive ob- jective between chosen and rejected responses. We hypoth- esize that a sequential setup—SFT to learn reasoning struc- ture, followed by DPO to refine quality—produces more ac- curate and faithful models. We evaluate sequential SFT-to-DPO distillation on a novel ToM CoT dataset, comparing a student trained only with SFT against one refined with DPO. Our findings show DPO refinement improves both accuracy and sensitivity to belief changes (Madsen et al. 2024; Chen et al. 2025). Methodology Our experimental pipeline involves dataset creation and a sequential, two-stage training regime. Dataset Construction We curated a dataset of 2,114 ToM-oriented reasoning sce- narios from BigToM (Gandhi et al. 2023) and Social-IQA (Sap et al. 2019). For each, a Llama 3.1 8B “teacher” gener- ated: • A coherent, logically grounded Chain-of-Thought and fi- nal answer (chosenCoT ). • A plausible but intentionally flawed Chain-of-Thought (rejected CoT ). This yielded two datasets used in sequence: • D SFT : pairs of (prompt, chosen CoT) for the ini- tial imitation stage. • D DPO :tripletsof (prompt, chosen CoT, rejectedCoT) for the preference-based refine- ment stage. Both constructed datasets, D SFT and D DPO , are pub- liclyavailableathttps://huggingface.co/datasets/ ArpitSinghGautam/faithful-tom-distillation. Sequential Training Regime Bothexperimentalbranchesoriginatefrom Llama-3.2-1B-Instruct. All training used iden- tical batch sizes, learning rates, and optimizers. Stage 1: SFT (Imitation) In this stage, the base model is trained via SFT onD SFT using the standard language mod- eling loss: L SFT (θ) =−E (x,y)∼D SFT [logP θ (y|x)](1) This stage produces our SFT-only baseline model. It teaches the model the *format* of CoT reasoning, but can reduce accuracy by disrupting its instruction-following ability. 35 Stage 2: DPO (Refinement) The SFT-tuned model from Stage 1 is further trained on D DPO using the Direct Prefer- ence Optimization loss: L DPO (π θ ) =−E (x,y w ,y l )∼D DPO [logσ (β ∆ θ,ref (x,y w ,y l ))], (2) ∆ θ,ref = log π θ (y w |x)π ref (y l |x) π θ (y l |x)π ref (y w |x) .(3) Here, y w and y l denote the preferred and rejected re- sponses, π ref is the frozen SFT reference policy, and β controls preference sharpness. This yields the SFT+DPO model, refining the reasoning quality learned during SFT. Evaluation Framework We evaluate on two complementary axes: 1. Teacher Accuracy Measures if the student produces the same logical answer as the 8B teacher on held-out examples. 2. Counterfactual Faithfulness Evaluates causal sensitiv- ity. Using the ToM dataset’s “Aware” vs “Not Aware” pairs, we test if a model’s answer changes appropriately when an agent’s knowledge state changes. All evaluations were performed using Mixtral-8x7B as an LLM-as-a-Judge. Results and Analysis We evaluate the models on a 2,114-example ToM CoT dataset using two complementary metrics: (1) teacher- aligned accuracy, measuring agreement with the Llama 3.1 8B teacher’s outputs, and (2) counterfactual faithfulness, as- sessing the model’s ability to adjust conclusions under be- lief changes. We compare two configurations: a 1B student trained with SFT and the same student further refined using Direct Preference Optimization (SFT+DPO). Each model generates one response per input for consistent comparison. Table 1: Evaluation results for SFT and SFT+DPO models on accuracy and faithfulness metrics. ModelAccuracy (%)Faithfulness (%) 1B (SFT-only)35.007.50 1B (SFT + DPO)40.0012.50 The SFT-only model learns the Chain-of-Thought struc- ture but shows weak causal robustness, achieving 35% ac- curacy and 7.5% faithfulness. Adding DPO refinement, we observe relative improvements of 14% in accuracy (35 to 40) and 66% (7.5 to 12.50) in faithfulness. This confirms that preference-based refinement restores and enhances rea- soning fidelity beyond what imitation alone provides. Figure 1 shows gradient norms during training. DPO ex- hibits lower, more stable gradients than SFT, indicating smoother convergence and less optimization noise. Similarly, Figure 2 shows training loss across the epoch, where DPO maintains a lower and more stable margin, in- dicating more effective preference alignment. These trends align with the quantitative gains in Table 1. Figure 1: Gradient norm comparison between SFT and the DPO refinement stage. Figure 2: Training loss comparison for SFT and DPO stages. Interpretation and Discussion. The SFT-only model overfits to surface linguistic patterns, mimicking the teacher’s style without internalizing causal dependencies. DPO adds a contrastive signal that helps the model distin- guish valid from flawed reasoning, improving causal dis- criminativeness. The low absolute faithfulness likely stems from two factors: the 1B student’s limited capacity for The- ory of Mind reasoning and inconsistent causal supervision from the 8B teacher, which together constrain attainable faithfulness even after refinement. Conclusion and Future Work Sequential SFT→DPO refinement improves reasoning accu- racy and faithfulness but remains bounded by teacher qual- ity and student capacity. Future work will explore stronger teachers (Llama 3.1 70B, 405B), larger CoT datasets, and multi-stage preference optimization to achieve faithful The- ory of Mind reasoning in smaller models. 36 References Chen, R.; Zhao, Y.; Liang, B.; and Zhang, Y. 2025. Assess- ing Theory of Mind in Large Language Models: Metrics and Benchmarks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), 1931–1944. Vienna, Austria: Association for Computational Linguistics. Gandhi, K.; Wu, Z.; Sap, M.; and Bosselut, A. 2023. Under- standing Social Reasoning in Language Models with Lan- guage Models. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 20531–20544. New Orleans, LA: Curran Associates, Inc. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. In Proceedings of the NIPS Deep Learning and Representation Learning Work- shop (DLRL 2015), 1–9. Montreal, Canada: NeurIPS Foun- dation. Li, J. 2024. A Reflection on Effectiveness and Faithfulness in Reasoning Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 1729–1742. Bangkok, Thailand: Association for Computational Linguistics. Madsen, A.; Rozen, J.; MacNeil, D.; and Raffel, C. 2024. Are Self-Explanations from Large Language Models Faith- ful? In Findings of the Association for Computational Lin- guistics (ACL 2024), 2152–2165. Bangkok, Thailand: Asso- ciation for Computational Linguistics. Rafailov, R.; Sharma, A.; Du, Y.; Krasheninnikov, D.; Driess, D.; Abbeel, P.; Finn, C.; and Levine, S. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 9874–9890. New Orleans, LA: Curran Associates, Inc. Sap, M.; Rashkin, H.; Chen, D.; Le Bras, R.; and Choi, Y. 2019. Social-IQA: Commonsense Reasoning about So- cial Interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), 4463–4473. Hong Kong, China: Associa- tion for Computational Linguistics. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q. V.; and Zhou, D. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems (NeurIPS 2022), 24824–24837. New Orleans, LA: Curran Associates, Inc. 37 From Theory of Mind to Theory of Environment: Counterfactual Simulation of Latent Environmental Dynamics Ryutaro Uchiyama Singapore University of Technology and Design ryutarouchiyama@sutd.edu.sg Abstract The vertebrate motor system employs dimensionality- reducing strategies to limit the complexity of movement co- ordination, for efficient motor control. But when environ- ments are dense with hidden action–outcome contingencies, movement complexity can promote behavioral innovation. Humans, perhaps uniquely, may infer the presence of hidden environmental dynamics from social cues, by drawing upon computational mechanisms shared with Theory of Mind. This proposed “Theory of Environment” supports behavioral inno- vation by expanding the dimensionality of motor exploration. The problem of behavioral innovation The flexibility and creativity of human behavior remain an enigma. Theories of cultural evolution explain how the emergence of conformism and imitation enabled behavioral innovations to persist and cumulate across generations, re- sulting in the ecological success of our species (Boyd and Richerson 1985). Behavioral innovation itself, however, re- mains poorly understood, often relying on assumptions of random variation that are formally analogous to genetic mu- tation. Here we propose a novel socio-cognitive mechanism, grounded in Theory of Mind computation (Barnby et al. 2024), that helps bridge this explanatory gap. The human brain controls approximately 600 muscles and 350 joints to generate desirable outcomes in a 3-dimensional space, yielding a highly redundant system in which a given action objective can be realized by a vast number of pos- sible motor configurations. To reduce this sprawling com- plexity (Bernstein 1967), the vertebrate motor system or- ganizes muscular activation into coordinated “muscle syn- ergies” (Overduin et al. 2008) that impose strategic low- dimensional constraints onto high-dimensional biomechan- ics. By constraining the variability of movement coordina- tion, muscle synergies facilitate efficient whole-body con- trol, but necessarily limit the exploration of movement- coordination structures and thus possible behaviors. Open- ended behavioral exploration is generally a costly invest- ment, as evolution optimizes for multiplicative (i.e., geomet- ric mean) fitness, where a single zero-fitness episode wipes out all prior gains. Assumptions of additive utility in rein- forcement learning hence underestimate this vulnerability to exploration risk. The restricted behavioral repertoire of non- human primates (Tennie, Call, and Tomasello 2009) should be construed not as a functional deficit, but as reflecting a general solution to the problem of motor complexity. Recently, leading research groups in human evolution- ary biology (Morgan and Feldman 2024) and computational cognitive science (Chu, Tenenbaum, and Schulz 2024) have independently argued that the species-unique feature of hu- man behavior is its open-ended variability. Such claims sug- gest that humans may have innovated the means to “unbind” acquired constraints on movement degrees-of-freedom – thus becoming able to not only reduce but also expand mo- tor exploration complexity. Recent approaches in the move- ment sciences illustrate how such increases in the dimen- sional complexity of motor coordination can facilitate skill acquisition (Dhawale, Smith, and ̈ Olveczky 2017). Real ecological environments typically contain an un- bounded number of hidden action–outcome contingencies (i.e., environmental dynamics) that can be potentially un- locked by skill acquisition – constituting an open-ended search space. The density of these latent environmental goals (“teleological depth”) thus determines the scope of prospective future gains in the controllability of environ- mental outcomes (Ligneul et al. 2022; Mancinelli, Roiser, and Dayan 2021). Such untapped prospective goal-states can offset the investment cost of behavioral exploration (Moli- naro et al. 2024), incentivizing learners to unbind their motor constraints, rather than remain locked into a low- dimensional repertoire optimized for known goals. But such calibration presumably requires a means to infer the teleo- logical depth of a given environment. How might this work? A four-fold typology of social goal inference Echoing Vygotsky (1980), we argue that teleological depth is cued by the socio-cultural environment. Without such cued information, the density of latent goals in an environ- ment could only be probed through actual open-ended be- havioral exploration – a prohibitively risky investment, as discussed. We refer to this social inference of teleological depth as theory of environment (ToE), and situate it in a 2×2 typology with other better studied mechanisms of social goal inference (Figure 1): 1. Goal attribution: From the first year of life, human in- fants expect others’ actions to be goal-directed. Infants are prolific in their attribution of goals not only to observed be- haviors, but also to artifactual and natural objects, for ex- 38 In-distributionOut-of-distribution Factive Goal AttributionAction Imitation implicit goal distribution (that generated the observed action) observed action observed action inference empirical goal distribution empirical goal distribution goal recoding Counterfactual Theory of Mind Theory of Environment simulation simulation counterfactually generated goal distribution observed action observed action empirical goal distribution inference inference empirical goal distribution imaginary goal distribution (that may counterfactually generate the observed action) Figure 1: Four modalities of social goal inference, identifying structural variation along two representational dimensions: (a) factive vs. counterfactual, pertaining to the omission/use of counterfactual simulation; and (b) in- vs. out-of-distribution, per- taining to the scope of goal inference – either bounded or unbounded by a current known hypothesis space. ample when interpreting the agentic purpose of wrenches or clouds (Kelemen 1999). 2. Theory of Mind (ToM): When observing an agent who acts upon a false belief, simple goal attribution is thwarted, instead requiring “meta-representation” of hidden mental states and counterfactual goals, i.e., theory of mind (ToM). Full-fledged ToM appears later in development than goal at- tribution (Gergely and Csibra 2003), and is observed reli- ably only in humans. Some non-human primates use a sim- pler, “factive” ToM that circumvents the computational cost of counterfactual simulation (Phillips et al. 2021). Due in part to this cost of counterfactual use, hypothesis-generation in ToM is constrained to the well-defined (“in-distribution”) space of known goals. This limitation is shared by inverse reinforcement learning (IRL) – a common algorithmic ap- proximation of ToM (Baker et al. 2017). IRL scales poorly in complex environments, and is often restricted to closed- ended task domains. ToM is thus inadequate when observing someone posting mail, if the observer lacks prior knowledge of the environmental dynamics of mail service. A cumula- tively cultural species is guaranteed regular encounters with such causally opaque behavior (Henrich 2016), suggesting the need for “out-of-distribution” inference mechanisms. 3. Action Imitation: Imitation can be seen as the recod- ing of an observed action into a novel goal unto itself (Lyons, Young, and Keil 2007; Schachner and Carey 2013) – mech- anistically consistent with hindsight relabeling methods in goal-conditioned RL (Andrychowicz et al. 2017). Imitation circumvents the cost of counterfactual generation, making it a sample-efficient mechanism for out-of-distribution learn- ing, akin to episodic control (Lengyel and Dayan 2008). But being tethered to literal observations, imitation lacks the generative flexibility of counterfactual simulation. 4. Theory of Environment (ToE): In our mail-posting example, ToM fails to resolve the observed action. But such “convergence failure” can itself serve as a valuable cue to switch from postulating hidden mental states to postulating hidden environmental dynamics. Both ToM and ToE depend upon counterfactual generation, and likely draw upon a com- mon computational machinery. But whereas ToM resolves ambiguity by searching a known (in-distribution) hypothe- sis space, ToE does so by searching an open-ended space of possible environmental dynamics. Verification of possi- ble environmental dynamics requires actual behavioral ex- ploration, rather than internal hypothesis-fitting. ToE thus generates out-of-distribution counterfactuals – a representa- tion adjacent to imagination, which we suggest promotes the expansion of motor dimensionality for skill development. In sum: the computational mechanisms of ToM may have purpose beyond mentalization. ToE deploys counterfactual generation of possible worlds, to support open-ended behav- ioral exploration. Cultural evolutionary theories typically construe asocial and social learning as exploration and ex- ploitation, respectively. Our proposal decomposes this di- chotomy by postulating an essential socio-cultural basis for human behavioral exploration and innovation. 39 Acknowledgments This work is supported by the SUTD Kickstarter Initiative (MOE AcRF Tier 1) under grant number SKI 2021 0609 References Andrychowicz, M.; Crow, D.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, O. P.; and Zaremba, W. 2017. Hindsight Experience Replay. In Advances in Neural Information Processing Systems, vol- ume 30. Baker, C. L.; Jara-Ettinger, J.; Saxe, R.; and Tenenbaum, J. B. 2017. Rational quantitative attribution of beliefs, de- sires and percepts in human mentalizing. Nature Human Behaviour, 1(4): 0064. Barnby, J. M.; Alon, N.; Bellucci, G.; Schilbach, L.; Frith, C. D.; and Bell, V. 2024. A Standard Framework for Social Cognition: interoperable algorithms for inference and repre- sentation. Bernstein, N. A. 1967. The co-ordination and regulation of movements. Pergamon Press. Boyd, R.; and Richerson, P. J. 1985. Culture and the evolu- tionary process. University of Chicago Press. ISBN 0-226- 06933-8. Chu, J.; Tenenbaum, J. B.; and Schulz, L. E. 2024. In praise of folly: flexible goals and human cognition. Trends in Cog- nitive Sciences, 28(7): 628–642. Dhawale, A. K.; Smith, M. A.; and ̈ Olveczky, B. P. 2017. The Role of Variability in Motor Learning. Annual Review of Neuroscience, 40(1): 479–498. Gergely, G.; and Csibra, G. 2003. Teleological reasoning in infancy: the na ̈ ıve theory of rational action. Trends in Cognitive Sciences, 7(7): 287–292. Henrich, J. 2016. The secret of our success: How culture is driving human evolution, domesticating our species, and making us smarter.Princeton, NJ: Princeton University Press. Kelemen, D. 1999. Why are rocks pointy? Children’s prefer- ence for teleological explanations of the natural world. De- velopmental Psychology, 35(6): 1440–1452. Lengyel, M.; and Dayan, P. 2008. Hippocampal contribu- tions to control: The third way. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in neural in- formation processing systems, volume 20, 889–896. Curran Associates, Inc. Ligneul, R.; Mainen, Z. F.; Ly, V.; and Cools, R. 2022. Stress-sensitive inference of task controllability. Nature Hu- man Behaviour, 6(6): 812–822. Lyons, D. E.; Young, A. G.; and Keil, F. C. 2007. The hidden structure of overimitation. Proceedings of the Na- tional Academy of Sciences of the United States of America, 104(50): 19751–6. Mancinelli, F.; Roiser, J.; and Dayan, P. 2021. Internality and the internalisation of failure: Evidence from a novel task. PLOS Computational Biology, 17(7): e1009134. Molinaro, G.; Colas, C.; Oudeyer, P.-Y.; and Collins, A. 2024. Latent Learning Progress Drives Autonomous Goal Selection in Human Reinforcement Learning. Morgan, T. J. H.; and Feldman, M. W. 2024. Human culture is uniquely open-ended rather than uniquely cumulative. Na- ture Human Behaviour. Overduin, S. A.; d’Avella, A.; Roh, J.; and Bizzi, E. 2008. Modulation of Muscle Synergy Recruitment in Primate Grasping. The Journal of Neuroscience, 28(4): 880–892. Phillips, J.; Buckwalter, W.; Cushman, F.; Friedman, O.; Martin, A.; Turri, J.; Santos, L.; and Knobe, J. 2021. Knowl- edge before belief. Behavioral and Brain Sciences, 44: e140. Schachner, A.; and Carey, S. 2013. Reasoning about ‘ir- rational’ actions: When intentional movements cannot be explained, the movements themselves are seen as the goal. Cognition, 129(2): 309–327. Tennie, C.; Call, J.; and Tomasello, M. 2009. Ratcheting up the ratchet: on the evolution of cumulative culture. Philo- sophical Transactions of the Royal Society B: Biological Sci- ences, 364(1528): 2405–2415. Vygotsky, L. S. 1980. Mind in Society: The development of higher psychological processes. Cambridge, MA: Harvard University Press. 40 Geometric Belief Spaces for Robust Multi-Agent Belief Dynamics Tasha Kim University of Oxford Oxford, United Kingdom tasha.kim@eng.ox.ac.uk Abstract Modeling how agents reason about each other’s beliefs (The- ory of Mind) relies heavily on recursive embedding of be- lief. As the number of agents increases, the problem grows rapidly in complexity and intractability. We introduce Geo- metric Belief Spaces (GBS), which represent collective be- lief states as points on low-dimensional manifolds with belief dynamics as gradient flow rather than recursive enumeration. GBS scales near-linear O(n 0.95 ) with 40 agents in prelim- inary tests, compared to O(n 1.80 ) of recursive methods. In search tasks that required cooperation, GBS achieved suc- cess rates comparable to simple averaging methods. GBS was highly resistant to errors introduced into the data, achiev- ing 95% accuracy even with 30% of the data records being wrong. This outperforms the simple averaging method giving 76% accuracy, even without adversarial training. In its cur- rent form, GBS models first-order beliefs; extending geomet- ric structure to higher order beliefs may open up a pathway for a more interpretable and scalable theory of mind. Introduction How should artificial agents model what other agents be- lieve? The standard paradigm involves recursive nesting: agent i models agent j’s beliefs, which include j’s model of i’s beliefs, and so on. This approach is theoretically grounded in epistemic logic (Fagin et al. 1995) and Bayesian inference (Baker et al. 2017), but faces the critical issue of exponentially growing computational cost as you add nest- ing depth. But in most coordination problems, we must rea- son about many agents simultaneously, not just in pairs. Instead of modeling beliefs recursively, we embed en- tire belief configurations as points in continuous, geomet- ric space: beliefs of all agents collectively define a single point on a low-dimensional manifoldM ⊂R d , turning be- lief dynamics into trajectories through the space. Note in this formulation, GBS models first-order beliefs only; inter- agent influences are captured through the interaction struc- ture, rather than explicit nesting. Related Work Computational theory of mind. Theory of Mind (ToM) stems from psychological origins (Premack and Woodruff 1978, Baron-Cohen, Leslie, and Frith 1985), and gained traction in AI research. But current nested epistemic ap- proaches (Fagin et al. 1995) or Bayesian frameworks (Baker Figure 1: (a) Sparse belief manifold and k-N topology. (b) pro- jected gradient flow dynamics. et al. 2017) suffer exponential complexity; recent neural ap- proximations (Rabinowitz et al. 2018, Jara-Ettinger 2019) scale better (but sacrifice interpretability). Jara-Ettinger (2019) framed ToM as inverse reinforcement learning and recently, large language models (LLMs) have displayed ToM capabilities (Li et al. 2023). Hybrid approaches like TomAbd (Montes et al. 2023) attempt to combine both sym- bolic and neural methods. Geometric and topological ML. Topological data anal- ysis (Carlsson 2009, Edelsbrunner and Harer 2008) show how shape captures information that coordinate represen- tations can overlook. Geometric deep learning (Bronstein et al. 2021) has unified different neural architectures through group theory. GBS applies these ideas to belief dynamics. Scalable multi-agent systems. Consensus algorithms (Olfati-Saber, Fax, and Murray 2007), distributed constraint optimization (Modi et al. 2005) give theoretical foundation, but struggle to scale. Neural methods QMIX (Rashid et al. 2018), CommNet (Sukhbaatar, Szlam, and Fergus 2016) work with dozens of agents but don’t directly model beliefs. Geometric Belief Spaces (GBS) Setup. We have n agents, each holding beliefs over D possible world states (varies by task). Agent i’s beliefs form probability distribution b i on the simplex. Stacking all agents’ beliefs, gives the full belief state B =b 1 ,...,b n . Embedding. We embed B intoR 3 using a hybrid ap- proach, where φ(B) = β· φ direct + (1− β)· φ MDS . Here φ direct = B · P projects beliefs onto a fixed orthonormal basis P ∈R D×3 and φ MDS applies MDS to pairwise JS (Jensen-Shannon) divergences. We use β = 0.7, but this was 41 Figure 2: (a) Computation time increases with agent count. Recursive ToM scales poorly, with steepest growth. (b) GBS exponent at 0.95, between the recursive approach and methods without belief modeling. (c) how much beliefs change per update. GBS and recursive ToM system achieves equilibrium at low values because it generates fixed positions instead of producing ongoing oscillations. not extensively tuned. The sparse k-N structure we used achieved O(n 0.95 ) empirical scaling, tested range n≤ 40. Communication topology. Each agent communicates with its k nearest neighbors, with k = max(3,⌊2 log(n + 1)⌋) to reduce number of pairwise divergence computations. Dynamics. Beliefs update through a projected gradient flow: dB dt =−Proj T B ∆ (∇E(B)) +η R(B). Here, Proj T B ∆ is projection onto the tangent space of the belief simplex. E(B) balances two objectives: (i) consensus-how agents agree with neighbors, (i) observation fidelity-agents match what they see. R(B) maintains belief diversity, η = 0.1. Cost. At 40 agents, a full update took 1.19ms on standard hardware. Empirically, the whole system scales O(n 0.95 ). Experiments GBS was assessed against three baselines: Recursive ToM – depth-2 belief nesting; Neural ToM – a trained MLP; Simple Averaging (Avg.) with no belief modeling (agents take an average of what they observed). Full experimental details including implementations and tasks are in Appendix A. E1: Computational scaling. Figure 2 and Table 1 high- lights a scaling tradeoff landscape, rather than a clear rank- ing. GBS shows near-linear empirical scaling while the task required around 1.19ms per update, which is 17× faster than Recursive ToM at n = 40. But simpler methods scaled even faster, raising the question of what GBS’s geometric struc- ture buys that justifies extra cost. E2: Cooperative search. Results were mixed. See Ta- ble 1. When information flows freely and observations are reliable, belief modeling does not matter greatly. E3: Adversarial robustness. Explicit belief modeling al- lowed robust coordination under conflicting information without requiring adversarial training. For a simple coordi- nation task without adversarial noise, in which 10 agents select from 5 options with 70% getting partial observa- tions, GBS had 90% success (30 episodes), versus Simple Table 1: Preliminary experimental results. E1: empirical scal- ing exponent. E2: search task (n=15). E3: adversarial robustness (n=15). Results suggest tradeoffs rather than a clear winner. E1E2E3 MethodScaleSucc.StepsSucc.Acc. Recursive ToM O(n 1.80 )70%6.4— Neural ToM O(n 0.49 )60%7.3100%73% Simple Avg. O(n 0.38 )80%4.670%76% GBS (ours)O(n 0.95 )80%4.8100%95% Avg. (73%) and Neural ToM (13%), suggesting coordina- tion from partial information may benefit from aligning ge- ometric beliefs. See details in Appendix D. Discussion In the belief trajectories of GBS, certain geometric proper- ties can be measured like distance between the clustering agents, speed of convergence and position of outliers. The relationship between this and meaningful epistemic knowl- edge remains to be seen. At first sight it appears that we will need to model beliefs explicitly for a truly scalable and ro- bust autonomous system, but simple averaging can achieve comparable search performance and scalability. In condi- tions where the GBS performs well in the face of adversarial influences, the utility of geometric techniques in eliminating faulty data is implied. Our results tested systems with up to 40 agents, but further studies are needed to see how larger systems behave. Eventually, computational time for diver- gence calculations will be dominated by the O(n 2 ) term but approximations can help extend the range of tractable sys- tems. At present our system only models first order beliefs and extension to nested reasoning would require a hierar- chical manifold structure, which remains an open question. Recursive methods give formal guarantees, neural methods offer flexibility and learn from data; what GBS can offer is explicit geometric structure to support interpretability and robustness in ways other approaches cannot. 42 References Amari, S.-i. 2016. Information Geometry and Its Applica- tions. Springer. Baker, C. L.; Jara-Ettinger, J.; Saxe, R.; and Tenenbaum, J. B. 2017. Rational quantitative attribution of beliefs, de- sires and percepts in human mentalizing. Nature Human Behaviour, 1(4): 1–10. Baron-Cohen, S.; Leslie, A. M.; and Frith, U. 1985. Does the autistic child have a “theory of mind”’? Cognition, 21(1): 37–46. Bronstein, M. M.; Bruna, J.; Cohen, T.; and Veli ˇ ckovi ́ c, P. 2021. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478. Carlsson, G. 2009. Topology and data. Bulletin of the Amer- ican Mathematical Society, 46(2): 255–308. De Weerd, H.; Verbrugge, R.; and Verheij, B. 2013. How much does it help to know what she knows you know? An agent-based simulation study. Artificial Intelligence, 199: 67–92. Edelsbrunner, H.; and Harer, J. 2008. Persistent homology-a survey. Contemporary mathematics, 453: 257–282. Fagin, R.; Halpern, J. Y.; Moses, Y.; and Vardi, M. 1995. Reasoning about knowledge. MIT press. Jara-Ettinger, J. 2019. Theory of mind as inverse reinforce- ment learning. Current Opinion in Behavioral Sciences, 29: 105–110. Li, H.; Chong, Y. Q.; Stepputtis, S.; Campbell, J.; Hughes, D.; Lewis, M.; and Sycara, K. 2023. Theory of Mind for Multi-Agent Collaboration via Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 180–192. Association for Computational Linguistics. Modi, P. J.; Shen, W.-M.; Tambe, M.; and Yokoo, M. 2005. ADOPT: Asynchronous distributed constraint optimization with quality guarantees. Artificial Intelligence, 161(1-2): 149–180. Montes, N.; Luck, M.; Osman, N.; Rodrigues, O.; and Sierra, C. 2023. Combining theory of mind and abduc- tive reasoning in agent-oriented programming. Autonomous Agents and Multi-Agent Systems, 37. Olfati-Saber, R.; Fax, J. A.; and Murray, R. M. 2007. Con- sensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1): 215–233. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4): 515–526. Rabinowitz, N.; Perbet, F.; Song, H. F.; Zhang, C.; Eslami, S. A.; and Botvinick, M. 2018. Machine theory of mind. In- ternational Conference on Machine Learning, 4218–4227. Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Fo- erster, J.; and Whiteson, S. 2018. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learn- ing, 4295–4304. Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2016. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29. 43 Appendix A. Implementation Details The embedding procedure 1. Compute pairwise Jensen-Shannon divergences D ij = JS(b i ∥b j ), yielding an n× n dissimilarity matrix. 2. Apply classical MDS via eigendecomposition of the double-centered divergence matrix, projecting to d = 3 dimensions (φ MDS ). 3. Compute direct projection φ direct = B · P where P ∈ R D×d is a fixed orthonormal basis. 4. Combine: φ(B) = 0.7· φ direct + 0.3· φ MDS . Baseline implementations • Recursive ToM: Depth-2 nesting (agent i models j’s be- liefs about k). Beliefs aggregate via weighted averag- ing with memoization (De Weerd, Verbrugge, and Verheij 2013, Baker et al. 2017). • Neural ToM: 3-layer MLP (128-64-32 hidden units, ReLU, sklearn MLPRegressor). Input: flattened beliefs, flattened observations, population mean, population std. Pretrained on 500 synthetic belief trajectories (Rabinowitz et al. 2018). • Simple Averaging: b t+1 i = b t i + 0.1( ̄ b t −b t i ) (Olfati-Saber, Fax, and Murray 2007). B. Computational Scaling Experiments (E1) Experimental setup We measured the computational cost of belief updates as a function of agent count n ∈ 5, 10, 15, 20, 30, 40. For each agent count, we performed 5 trials, measuring the wall-clock time for a single belief up- date iteration. Methods evaluated • Recursive ToM: Full recursive modeling with depth-2 re- cursion (i models j modeling k). Each agent maintains be- liefs about all other agents’ beliefs, requiring O(n 2 ) belief representations. Implementation uses dynamic program- ming with memoization. • Neural ToM: Multi-layer perceptron (MLP) with archi- tecture [input, 128, 64, 32, output]. Takes input of flat- tened beliefs, flattened observations, population mean, and population std. No explicit recursive structure. • Simple Averaging: Baseline that updates each agent’s be- lief toward the population mean: b t+1 i = b t i + α( ̄ b t − b t i ) where ̄ b t = 1 n P j b t j . This represents trivial information aggregation without theory of mind. • GBS (Ours): Sparse k-nearest neighbor communication structure with k = max(3,⌊2 log(n + 1)⌋). Geometric gradient computed only over local neighborhoods. Energy functional combines belief consensus and observation fi- delity terms. Complexity analysis We fit power-law models T (n) = an b to the measured times using least-squares regression. The exponent b characterizes the asymptotic scaling: • Recursive ToM: b = 1.80 (super-linear, approaching quadratic). • Neural ToM: b = 0.49 (sub-linear, limited expressive- ness). • Simple Averaging: b = 0.37 (sub-linear, no ToM capabil- ities). • GBS: b = 0.95 (near-linear, with explicit first-order belief modeling). The key insight is that GBS achieves near-linear scaling while maintaining explicit belief modeling, unlike Simple Averaging (which has no ToM) or Neural ToM (which is a black-box approximation). C. Cooperative Search Task (E2) Task description Agents navigate a 20×20 discrete grid to locate 3 hidden target positions. Each agent: • Has a local visibility radius of 3 cells. • Observes targets when within visibility radius. • Communicates belief distributions to neighbors. • Updates beliefs with geometric optimization (GBS) or baselines. Success criteria An episode is considered successful if: 1. Agents discover at least 2 of the 3 targets. 2. Population belief converges (variance < 0.1). 3. Both conditions met within 20 time steps. Scaling behavior Search success improves dramatically with agent count. At n = 5, success rates are low (20- 40%) because sparse observations provide insufficient infor- mation. At n = 40, all methods reach 100% success. The task becomes easy with enough information sources. This scaling trend reveals an important property of coop- erative multi-agent tasks: performance improves with agent count as more observations accelerate convergence. The in- teresting regime is intermediate scale (n = 15-20), where methods show meaningful differences. Results by agent count Table 2 shows complete search results across scales. Table 2: Search success rates at agent counts (10 episodes per count). MethodN=5 N=10 N=15 N=20 N=30 N=40 Recursive ToM20%60%70%70%90%100% Neural ToM40%50%60%70%100%100% Simple Avg.20%60%80%100%80%100% GBS (ours)30% 40% 80% 90% 90% 100% At n = 15 agents (reported in main paper Table 1), GBS achieves 80% success; this is competitive with Recursive ToM (70%) and higher than Neural ToM (60%). Simple Av- eraging also leads at 80%, but provides no interpretable be- lief structure. At n = 20, GBS reaches 90% success, demon- strating strong performance at intermediate scales. D. Adversarial Robustness Task (E3) Task description Agents must coordinate on selecting one of 5 options over 10 episodes, and receive conflicting infor- mation: 44 • Truthful observations: 60% of observations signal the correct option with strength 1.0. • Adversarial observations: 40% of observations signal in- correct option with strength 4.0 (deliberately misleading). • Agents get observations with probability 0.4 per time step. • Goal: Converge to consensus on the true option. Robustness mechanisms • GBS: Detects outliers by measuring geometric distance in the belief manifold. Observations that deviate significantly from the population trajectory receive lower trust weights: w i = 1/(1 + 2·||obs i − ̄ b|| 1 ). • Neural ToM: No explicit outlier detection. The neural network must learn to handle adversarial observations im- plicitly through training, but has not been trained on ad- versarial data. • Simple Averaging: Treats all observations equally. Highly vulnerable to adversarial information as it has no mechanism to identify or discount misleading signals. Metrics • Convergence success: Asks, “did agents reach consensus (variance < 0.05) within 20 steps?” • Correctness: Asks, “what percentage of agents selected the true option?” Results At 40% adversarial observations (main Table 1): • GBS: 100% convergence; 95% correctness. Robust iden- tification of true option despite misleading signals. • Neural ToM: 100% convergence; 73% correctness. Many agents initially misled, black-box learning eventually re- covers. • Simple Averaging: 70% convergence; 76% correctness. Fails completely in 30% of trials because it is unable to filter bad data. This demonstrates that explicit belief modeling provides measurable advantage when information sources cannot be trusted. Simple averaging suffices for cooperative scenarios but breaks down under adversarial conditions. GBS’s geo- metric structure enables principled outlier detection without requiring adversarial training. E. Geometric Analysis E1. Manifold Structure Belief space geometry The belief simplex ∆ d−1 = b ∈ R d : b i ≥ 0, P i b i = 1 forms a (d− 1)-dimensional man- ifold embedded inR d . A natural Riemannian structure is given the Fisher-Rao metric. In mixture coordinates it has a local form restricted to the tangent space, and P i db i = 0: ds 2 = d X i=1 1 b i db 2 i ,(1) Under this metric, the simplex has a curved Riemannian structure dually flat in information-geometric coordinates (Amari 2016). GBS uses this geometry heuristically (com- putes Euclidean gradients and projects back onto simplex). Convergence analysis Empirically, the energy functional E(B) = E consensus (B) +λE obs (B) decreases monotonically when the communication graph is connected, and learning rates are small (0.1-0.15 in our experiments). We observe convergence to stable equilibria in all our tested configura- tions, under the following conditions: 1. Connectivity: Each agent has at least one neighbor. 2. Observation consistency: Truthful observations domi- nate adversarial ones. 3. Small learning rate: α in range 0.1-0.15. The sparse k-N structure ensures connectivity while maintaining O(n) total edges, enabling near-linear updates. E2. Topological Features Persistent homology computation We use the GUDHI li- brary to compute persistent homology of belief trajectories: 1. Construct Vietoris-Rips complex with radius ε = 0.1. 2. Compute 0-dimensional (connected components) and 1- dimensional (loops) persistence. 3. Extract birth-death pairs (b i ,d i ) for each topological fea- ture. 4. Measure persistence as p i = d i − b i . Long-lived features (high persistence) indicate stable ge- ometric structures in the belief space that are robust to noise. In our experiments, GBS trajectories maintained consistent topological structure throughout convergence, while base- lines exhibit unstable or trivial topology. Interpretation • Connected components: Represent clusters of agents with similar beliefs. Convergence corresponds to all agents merging into a single component. • Loops (1-cycles): Indicate agents cycling through belief states before settling. Persistent loops suggest exploration vs. exploitation trade-offs. • Higher-order features: Captures more complex coordi- nation patterns; we focus on H0, H1 for interpretability. Quantitative results Experiments with 15 agents over 50 update steps (belief dimension 10, manifold dimension 3): • Manifold diameter reduced by 82% during convergence. • Strong diameter-energy correlation; r = 0.79, p < 0.001. • β 0 = 15 maintained throughout; a single connected com- ponent. • Transient 1-cycles (β 1 ≤ 5) captured temporary disagree- ment structures (topological stability). F. Additional Implementation Details F1. Hyperparameters All values listed in Table 3. F2. Environment Experiments were conducted with: • CPU: Intel Xeon E5-2680 v4 @ 2.40GHz • RAM: 32GB DDR4 • OS: Ubuntu 22.04 LTS 45 Table 3: Hyperparameters across experiments. ParameterValueDescription (i) GBS Learning rate α0.1-0.15Gradient descent step size k-N neighbors kmax(3, 2 logn)Sparse communication structure Consensus weight λ1.0Energy functional balance Belief clip[10 −10 , 1]Numerical stability (i) Recursive ToM Recursion depth2i models j modeling k Belief update rate0.1Mixing parameter (i) Neural ToM Architecture[in, 128, 64, 32, out]MLP hidden layers ActivationReLUNon-linearity InitializationGaussian σ = 0.1Weight initialization (iv) Simple Averaging Mixing rate0.1-0.15b i ← b i + α( ̄ b− b i ) (v) E2 Grid size20×20Discrete grid Targets3Hidden locations Visibility radius3Observation range Max steps20Episode limit (vi) E3 Options5Coordination choices Adversarial rate40%Misleading observations Adversarial strength4×Signal amplification Observation rate0.4Per-agent per-step Max steps20Episode limit • Python: 3.10.12 • NumPy: 1.24.3 • GUDHI: 3.8.0 (for persistent homology) G. Additional Results G.1. Ablation Studies Impact of k-N structure We tested different values of k (no. of neighbors) in the sparse communication graph: Table 4: Search success vs. k-N parameter (n = 15). k valueSuccess Rate Mean Steps Update Time (ms) k = 3 (fixed)60%9.40.23 k = 5 (fixed)70%8.80.31 k = logn65%9.10.28 k = 2 logn70%8.80.34 k = n− 1 (full)75%8.21.42 The k = 2 logn choice balances performance and effi- ciency, achieving 70% success (matching full connectivity) with 4× speedup. Impact of observation weighting For adversarial coordi- nation, we tested different outlier detection strategies. Table 5: E3 results with different outlier detection strategies (40% adversarial). StrategySuccess Rate Correctness No outlier detection70%76% Fixed threshold85%84% Geometric distance (ours)100%95% Learned weights (Neural)100%73% Geometric outlier detection, that is, measuring manifold distance outperforms both the fixed thresholding and learned approaches. H. Limitations and Future Work Task complexity Our cooperative search task becomes easy at large scale (n ≥ 40), with all methods reaching 100% success. This suggests the task saturates with suffi- cient information sources. Future work should evaluate on tasks that remain challenging at scale: • Competitive scenarios (e.g. agents with conflicting goals). • Deceptive communication (e.g. agents strategically misre- port beliefs). • Partial observability with systematic biases. • Dynamic environments with moving targets. Scalability limits While GBS achieves near-linear O(n 0.95 ) complexity, we only tested up to n = 40 agents. Very large-scale systems (n > 1000) may require: hier- archical belief structures, distributed computation across multiple machines, or approximations to the full geometric optimization. Belief dimension Current experiments use relatively low- dimensional belief spaces (d ≤ 400 grid cells for search). High-dimensional continuous state spaces may require: di- mensionality reduction techniques, factored belief represen- tations, sparse belief distributions. Note on statistics Search and adversarial robustness suc- cess rates were measured over 10 episodes, and expected variance is around±20 percentage points at this sample size. Measured time will vary across hardware configurations. 46 Inside Deception: How to Exploit a Target Peta Masters 1 , Stefan Sarkadi 2 , Tom Wallis, Peter McBurney 1 1 King’s College London 2 University of Lincoln peta.masters@kcl.ac.uk, SSarkadi@lincoln.ac.uk, tom@tomwallis.net, peter.mcburney@kcl.ac.uk Abstract This work is part of a program to formalise enablers of decep- tion so that attempts at manipulation can be detected by an AI system. In this extended abstract, we report results from a formalisation workshop which focused on two factors com- monly exploited in the deception of humans: repeated deliv- ery of a falsehood via multiple channels; and the application of emotional accelerators such as trust, ego, urgency and so- cial proof which can be used to pressure individuals into act- ing more in a deceiver’s interests than their own. Introduction Deception is an art, manipulation a skill. What, the deceiver asks, will make my target believe what I want them to be- lieve? What will make them do what I want them to do? We conducted a formalisation workshop attended by researchers with expertise in fields ranging from cyber- security and deep fake detection to marketing and psychol- ogy and invited them to examine 21 factors known to con- tribute to the deception of human targets (Masters, Kuhn, and Luff 2025). Of these factors, five appeared to contribute primarily to concealment of the truth, six to the generation of false belief, and 10 seemed to be triggers of reflexive—and consequently poorly-considered—action. Workshop partici- pants elected to focus on two factors in particular: repeated delivery of information via multiple channels or from di- verse sources—which primarily influences belief-formation, and urgency or scarcity—which trigger reflexive action. In what follows, we outline the background to these two exploitable factors and describe several possible formalisms. The proposed AI is effectively the digital twin for a fallible human observer but one capable of warning that it may be being deceived. It takes detectable signals as input and, pro- cessing them in line with the formalisms, arrives at decisions which, though sometimes incorrect, are always explainable. Multiple Sources / Repetition In the context of military intelligence, it is a given that false information should be delivered via as many channels as possible (Omand 2020). Operation Fortitude, for example, a major deception operation undertaken by the Allies dur- ing the Second World War in support of the D-day land- ings in June 1944, sought to deceive the Nazi military com- mand as to the location for the intended allied military land- ings. The operation made use of physical deception such as dummy landing craft, fake radio communications, creation of maps and handbooks with false locations, and strategic ‘leaks’ of false reports delivered via diplomatic channels and by double agents. Evidence from multiple sources is potent. When we rely on reasoning—loosely, the System 2 thinking described by Kahneman (2011)—we tend to look for cor- roborating evidence. Meanwhile, however, in the context of heuristic, System 1-type thinking, repetition even from a sin- gle source can have similar impact if heard enough times. In fact, repetition can be effective even where the informa- tion delivered is blatantly false (Fazio, Rand, and Pennycook 2019) and independently of the credibility of the source (Un- kelbach and Greifeneder 2018). This is the so-called “illu- sory truth effect” thought to derive from a cognitive heuristic that equates fluency and ease of processing with veracity. It helps explain how mis—and disinformation takes root: “re- peating a claim makes it feel truer” (Udry and Barber 2024). Table 1: Multiple sources / repetition Detectable indicators Number of messages (i.e., items of informa- tion). Spacing between messages. Number of sources. Closeness of the match. Proposed Model/ Method A redundancy metric evaluates message con- sistency, adopting a similarity threshold (i.e., the decision is binary: same/different). A con- centration metric, combining number of mes- sages, spacing and sources summarises the tar- get’s exposure to (potentially inaccurate) evi- dence. Boost the belief/magnitude of the initial signal logarithmically for each repetition from a single source, linearly by multiplication for each confirmation from any additional source. Alternatives(1) Process each signal in real time. (2) Adopt a game theoretic multi-agent model representing the agent interaction across multiple channels. (3) Bayes’ Th. with simplifying assumptions. Possible frameworks Algebraic, information theory, goal recognition, game theory. NotesHigh concentration of dissimilar messages via one channel could be used to bury information, exploiting information overload, or to pad out spacing between messages. Similarity could be modelled the same way as for repetition. 47 Research into this phenomenon typically involves one repetition: participants are exposed to a message, then their belief in a range of statements, including some already seen, is evaluated. Where frequency of messages has been con- sidered, the impact of repetition is found to be logarithmic: the greatest increase in perceived truthfulness follows the second exposure then diminishes (Hassan and Barber 2021). Spacing between repetitions has also been investigated. The perceived truthfulness of evidence increases when there are up to 16 intervening items of information but is unchanged with more or very few (Udry, White, and Barber 2022). In non-peer reviewed work, an alternative approach quan- tifies the effect using Bayes’ Rule. Priors depend on the credibility of the source; posterior probabilities are updated iteratively with each signal on the assumption that the more often a trusted source repeats a story, the more likely it is to be true (Orchinik, Rand, and Bhui 2025). Research into rep- etition often uses exact matches but more complex messag- ing across different channels may rely on non-identical con- tent so the degree of similarity must also be measured. One way to integrate multiple sources computationally will be to use Subjective Logic (Josang, Hayward, and Pope 2006). To incorporate the combined impact of repetition and sim- ilar messages from multiple sources, workshop participants proposed several possible approaches (see Table 1). The most promising quantifies frequency, spacing, repetition, the number of channels used, and message similarity. As a func- tion of messages sent (msgs), times sent, and the channels used to communicate those messages, the impact may be ex- pressed as: |msgs| max(times)−min(times) , P t∈times (t−times) 2 |times| ,|msgs|, |channels|, with a Bayesian representation of posterior updates. The terms can be combined and simplified to: P t∈times (t−times) 2 ×|msgs| 2 ×|channels| |times|×(max(times)−min(times)) (Eq.1). Urgency / Scarcity A common practice amongst marketers is the attempt to in- stil a sense of urgency in the decision-maker reminding them that offers are “time-limited”, there is “only one left”, other interested buyers are “coming back this afternoon”. Recent studies have examined the impact of time pressure on: pur- chasing (Khetarpal and Singh 2024); negotiation and plea bargaining—in the context of which undue time pressure may constitute coercion (Luna 2022); economic decision- making (Kocher, Pahlke, and Trautmann 2013); and ethical decision-making (Moberg 2000). Interestingly, we note that a common heuristic strategy when making a purchase under time pressure is to opt for brand familiarity which depends on advertising (c.f. multiple sources/repetition above). Experiments to evaluate the effect of time pressure often impose time constraints although, as Ord ́ o ̃ nez, Benson I, and Pittarello (2015) point out, constraints do not necessarily imply pressure. The meaning may better be captured, there- fore, by the term ‘urgency’. The distinction is important in the context of deception or any deliberate attempt to provoke an instinctive or non-reflective response since it is the per- ception of having insufficient time that is of interest which may or may not correspond to the actual time available. Three useful models arise in the literature to measure the impact of time pressure, each a function of time avail- able (T a ) versus time required (T r ). One takes a ratio, an- other the difference, but the best fit—from a study of 197 participants—is based on a combination: the ratio of the ad- ditional time needed to complete a task relative to the time available (the time shortfall, as it were) by the time required (Ord ́ o ̃ nez et al. 2004, as cited in (Ord ́ o ̃ nez, Benson I, and Pittarello 2015)): T P = J T r −T a T r (Eq.2), where T P is time pressure and J is a monotonic constant that translates the result to the judgement scale (e.g., 0 to 10) used by participants. Having considered several alternatives (see Table 2), workshop participants elected to use the above model brack- eting urgency and scarcity together as two sides of the same coin: urgency may be regarded as scarcity of time (as sug- gested by the notion of a time shortfall); meanwhile, the per- ception of scarcity provokes a sense of urgency (consider the panic-buying of toilet rolls during the COVID-19 pandemic (Stratton 2021)). Substituting amount needed and amount available for the terms T r and T a respectively, Eq. 2 (in- tended as a measure of urgency) can be repurposed to cap- ture a precisely equivalent concept of scarcity. Table 2: Towards formalisation of urgency / scarcity Detectable signals Time/resource availability calculable from real- world resource management. Self-reported requirement. Physiological stress indicators e.g., heart rate, blood pressure, etc. detectable using wearable tech. Proposed Model/ Method Eq.2 adapted also for scarcity and/or calculate stress factor from physiological indicators. Alternatives(1) Model not taking action as ordinary state v. need to act when urgency indicators are de- tected. (2) Behavioural recognition from video data; (3) On detection of stress, make the ’greedy’ decision and/or with some probability take random action. Possible frameworks Algebraic, narrative/causality, game theory, planning/goal recognition. NotesSelf-reported data, if used, returns perceived values (more accurate input to TP). Detect sig- nals suggestive of artificially limited timescales (e.g. “only one left”) and contradictory signals (e.g., empty shelves v. known availability). Urgency and scarcity are not the only factors capable of provoking poor decisions. Information overload may have a similar impact (Starcke and Brand 2012; Phillips-Wren and Adya 2020). We identify a further four from the orig- inal 21 factors considered—trust, ego, guilt, and greed— any of which may act as psychological ‘triggers to’ or ‘am- plifiers of’ poor decision-making, especially when paired with factors that invite a particular response through social compliance, social proof, path of least resistance, or con- sistency. Thus, while Equation 2 suggests a means of cal- culating stress based on temporal and tangible indicators, an alternative—and potentially more reliable—predictor of poor decision-making, to be considered in future work, may be to measure an individual’s level of stress physiologically. 48 Acknowledgements Peta Masters and Stefan Sarkadi gratefully acknowledge support from the Government Office for Science and the Royal Academy of Engineering under the UK Intelligence Community Postdoctoral Research Fellowships scheme. References Fazio, L. K.; Rand, D. G.; and Pennycook, G. 2019. Repe- tition increases perceived truth equally for plausible and im- plausible statements. Psychonomic bulletin & review, 26(5): 1705–1710. Hassan, A.; and Barber, S. J. 2021. The effects of repetition frequency on the illusory truth effect. Cognitive research: principles and implications, 6(1): 38. Josang, A.; Hayward, R.; and Pope, S. 2006. Trust network analysis with subjective logic. In Conference Proceedings of the Twenty-Ninth Australasian Computer Science Confer- ence (ACSW 2006), 85–94. Australian Computer Society. Kahneman, D. 2011. Thinking, fast and slow. Farrar, Straus and Giroux, USA. Khetarpal, M.; and Singh, S. 2024. “Limited time offer”: Im- pact of time scarcity messages on consumer’s impulse pur- chase. Journal of Promotion Management, 30(2): 282–301. Kocher, M. G.; Pahlke, J.; and Trautmann, S. T. 2013. Tem- pus fugit: Time pressure in risky decisions. Management Science, 59(10): 2380–2391. Luna, S. 2022. Defining coercion: An application in inter- rogation and plea negotiation contexts. Psychology, Public Policy, and Law, 28(2): 240. Masters, P.; Kuhn, G.; and Luff, P. 2025. Towards a frame- work for detecting deceptive contextual and behavioural sig- nals. In Proceedings of the 2025 IEEE Statistical Signal Processing Workshop (SSP). Moberg, D. J. 2000. Time pressure and ethical decision- making: The case for moral readiness. Business & Profes- sional Ethics Journal, 19(2): 41–67. Omand, D. 2020. How spies think: ten lessons in intelli- gence. Penguin UK. Orchinik, R.; Rand, D.; and Bhui, R. 2025. The Not So Illusory Truth Effect: A Rational Foundation for Repetition Effects. Ord ́ o ̃ nez, L. D.; Benson, L.; Bearden I, J.; and Murphy, R. 2004. It’s about time!: A Comparison of Three Models of Time Pressure. Presentation at the fourth Tucson Interdisci- plinary Workshop on Decision Making. Ord ́ o ̃ nez, L. D.; Benson I, L.; and Pittarello, A. 2015. Time-pressure perception and decision making. The Wiley Blackwell handbook of judgment and decision making, 2: 517–542. Phillips-Wren, G.; and Adya, M. 2020. Decision making un- der stress: The role of information overload, time pressure, complexity, and uncertainty. Journal of decision systems, 29(sup1): 213–225. Starcke, K.; and Brand, M. 2012. Decision making under stress: a selective review. Neuroscience & Biobehavioral Reviews, 36(4): 1228–1248. Stratton, J. 2021. Coronavirus, the great toilet paper panic and civilisation. Thesis Eleven, 165(1): 145–168. Udry, J.; and Barber, S. J. 2024. The illusory truth effect: A review of how repetition increases belief in misinformation. Current Opinion in Psychology, 56: 101736. Udry, J.; White, S. K.; and Barber, S. J. 2022. The effects of repetition spacing on the illusory truth effect. Cognition, 225: 105157. Unkelbach, C.; and Greifeneder, R. 2018. Experiential flu- ency and declarative advice jointly inform judgments of truth. Journal of Experimental Social Psychology, 79: 78– 86. 49 Investigating the Effects of Translation Quality on LLM Performance in Machine-Translated Theory of Mind Benchmarks Haruhisa Iseno 1,2 , Atsumoto Ohashi 1 , Tetsuji Ogawa 3 , Shinnosuke Takamichi 4 , Ryuichiro Higashinaka 1,2 1 Graduate School of Informatics, Nagoya University 2 NII LLMC 3 Department of Communications and Computer Engineering, Waseda University 4 Department of Information and Computer Science, Keio University iseno.haruhisa.h4@s.mail, ohashi.atsumoto.c0@s.mail.nagoya-u.ac.jp ogawa.tetsuji@waseda.jp, shinnosuke takamichi@keio.jp, higashinaka@i.nagoya-u.ac.jp Abstract In recent years, various benchmarks have been proposed to evaluate the Theory of Mind (ToM) of large language models (LLMs). However, most of these benchmarks are constructed in English, and there is still a shortage of ToM benchmarks for other languages. A straightforward approach to creating non- English ToM benchmarks is to machine-translate existing En- glish benchmarks, but it remains unclear how translation er- rors affect ToM evaluation results. In this study, we machine- translated two English ToM benchmarks, ToMBench and FANToM, into Japanese and examined how translation qual- ity influences the ToM evaluation of LLMs. Our experiments show that the impact of translation quality differs across benchmarks: machine translation consistently decreased the evaluation scores of all models for ToMBench, whereas its impact was limited for FANToM. Furthermore, our analysis indicates that translation errors related to accuracy greatly de- grade evaluation scores. Introduction Most existing Theory of Mind (ToM) benchmarks (Chen et al. 2025) are constructed only in English (Le, Boureau, and Nickel 2019; Gandhi et al. 2023; Kim et al. 2023; Xu et al. 2024; Shinoda et al. 2025), and evaluation benchmarks in other languages remain insufficient. As a method to ad- dress this issue, creating benchmarks by machine-translating existing datasets (Chen et al. 2024) can be considered; how- ever, the extent to which machine translation errors affect ToM evaluation remains unclear. In this study, we investigated the impact of transla- tion quality on ToM evaluation of Large Language Models (LLMs) in machine-translated ToM benchmarks. First, we machine-translated existing ToM benchmarks (ToMBench and FANToM) into Japanese and evaluated the extent to which LLM performance changes compared to the English versions. Additionally, we manually post-edited a subset of the translated questions to investigate whether correct- ing translation errors improves accuracy rates. We then ana- lyzed what types of translation errors affect ToM evaluation. Through these analyses, we clarified the impact of transla- tion quality on ToM evaluation results and identified the un- derlying factors. Approach We first selected multiple ToM benchmarks with differ- ent characteristics as evaluation targets for generalizability. We then machine-translated the selected benchmarks into Japanese and investigated changes in LLM evaluation results through comparison with the original English versions. Mul- tiple LLMs were selected from recent state-of-the-art mod- els for evaluation. Furthermore, we manually post-edited a subset of the translated benchmarks to correct translation er- rors. By comparing LLM accuracy rates before and after cor- rection, we investigated the impact of translation quality on benchmark performance. We then analyzed the relationship between the amount of translation errors in questions (con- text, question text, and answer choices) and their correctness using correlation analysis. Datasets In this study, we selected two benchmarks for translation: ToMBench (Chen et al. 2024) and FANToM (Kim et al. 2023). ToMBench uses narrative texts as context, while FANToM uses dialogues; this difference results in the two benchmarks having distinct characteristics. Translated datasets We machine-translated all data from ToMBench and FAN- ToM into Japanese with Llama4-Scout, which performed superiorly to DeepL for sampled questions. The translation prompt was simply “Please translate the following English text into Japanese,” and translation was performed in a zero- shot manner. As a result of translation, about 1.6 million characters were translated for ToMBench and about 10 mil- lion characters for FANToM. Post-edited datasets We sampled questions from the translated benchmarks and manually evaluated translation quality and post-edited the translations. For ToMBench, we sampled a total of 60 questions (556 sentences) from 8 major question types out of the 20 types included in the benchmark. For FANToM, we randomly sampled 20 dialogues (1,563 sentences) containing a total of 250 question (including 72 questions from the three main 50 Table 1: Accuracy rates on English and Japanese bench- marks. Bold indicates the higher accuracy rate. ToMBenchFANToM EnglishJapaneseEnglishJapanese GPT-4o 78.871.864.459.6 Claude4 79.477.659.7 61.4 Gemini2.5 79.773.373.3 73.7 Llama3.3 76.570.5 50.645.4 Qwen3 73.262.045.4 48.5 Table 2: Accuracy rates before and after correction on the sampled questions for post-editing. Bold indicates the higher accuracy rate. ToMBenchFANToM BeforeAfterBeforeAfter GPT-4o71.6 75.056.3 57.7 Claude4 80.0 75.061.9 60.5 Gemini2.5 68.3 75.0 61.9 61.9 Llama3.361.6 66.639.4 39.4 Qwen365.0 71.660.5 61.9 question categories: BeliefQ, InfoAccessibilityQ, and An- swerabilityQ). Professional workers proficient in both Japanese and En- glish evaluated and corrected the sampled translated sen- tences on the basis of Multidimensional Quality Metrics (MQM) (Lommel et al. 2014). Specifically, following the JTF Translation Quality Assessment Guidelines 1 established by the Japan Translation Federation (JTF) based on MQM, workers checked for translation errors on a sentence-by- sentence basis. Sentences containing errors were corrected to appropriate Japanese sentences. As a result, 354 transla- tion errors were corrected in ToMBench and 768 in FAN- ToM. The average number of errors per 100 characters was 3.1 for ToMBench and 3.4 for FANToM. Experiment Procedure In the evaluation, we presented LLMs with context (nar- rative or dialogue text) and had them answer multiple- choice questions. The prompts for solving the questions were designed for this study, with the same content pro- vided in English and Japanese versions respectively (de- tails in Appendix). We used five state-of-the-art LLMs (GPT-4o, Claude-4-Sonnet, Gemini-2.5-Flash, Llama-3.3- 70B, and Qwen-3-32B) as evaluation targets. We evaluated LLM performance on ToM using both the original English benchmarks and the Japanese-translated versions. For ToMBench, we used all 2,860 questions. For FANToM, we used 3,280 questions, covering BeliefQ, In- foAccessibilityQ, and AnswerabilityQ. To examine the extent to which translation affects bench- mark performance, we evaluated LLM performance on ToM before and after correction on the corrected question sets. For ToMBench, we used all 60 corrected questions. For 1 https://w.jtf.jp/pdf/jtf translationqualityguidelinesv1. pdf Table 3: Point-biserial correlation coefficients between the number of errors in a question and the correctness for that question. Bold indicates the higher correlation. * indicates statistical significance at p < 0.05. Accuracy errorsFluency errors GPT-4o+0.230 ∗ +0.048 Claude4+0.184 ∗ −0.072 Gemini2.5+0.105+0.031 Llama3.3+0.235 ∗ +0.189 ∗ Qwen3+0.045+0.040 FANToM, we used 72 questions from the three main ques- tion categories; the remaining questions in the dataset were not used because they were sub-questions derived from the main questions. We used point-biserial correlation coefficients to analyze the relationship between the number of errors in a ques- tion and the correctness for that question. All the samples of ToMBench and FANToM were analyzed. In this analysis, a positive correlation indicates that questions with more trans- lation errors were more likely to be answered incorrectly by the models. This analysis was performed separately for ac- curacy errors and fluency errors based on MQM classifica- tion (the distribution of translation errors in each benchmark is provided in Appendix). Results Table 1 compares LLM performance on ToM between the original English benchmarks and the Japanese-translated versions. For ToMBench, the Japanese version performed worse across all models. In contrast, for FANToM, results were mixed, with some models performing worse and oth- ers performing better. This indicates that the impact of trans- lation on benchmark performance varies depending on the type of benchmark. Table 2 compares LLM performance on ToM before and after post-editing. All models except Claude-4 performed better or similarly after correction in both benchmarks. This suggests that translation errors tend to decrease ToM evalu- ation performance. We observed that the improvement after correction was larger for ToMBench than for FANToM. Table 3 shows the point-biserial correlation coefficients. Examining the overall trend, we found that accuracy er- rors showed higher correlation than fluency errors across all models. In particular, for GPT-4o, Claude-4, and Llama-3.3, we found a significantly high correlation between the num- ber of accuracy errors and correctness (p < 0.05). Conclusion In this study, we translated ToMBench and FANToM into Japanese and analyzed performance comparisons with the English versions and post-editing effects. We investigated how translation errors affect benchmark performance. A limitation of this study is that the evaluation was re- stricted to two benchmarks. Future research should investi- gate more diverse benchmarks to analyze this causal rela- tionship in detail. 51 Acknowledgments This work was supported by the “R&D Hub Aimed at Ensur- ing Transparency and Reliability of Generative AI Models” project of the Ministry of Education, Culture, Sports, Sci- ence and Technology. References Chen, R.; Jiang, W.; Qin, C.; and Tan, C. 2025. Theory of Mind in Large Language Models: Assessment and Enhance- ment. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 31539–31558. Chen, Z.; Wu, J.; Zhou, J.; Wen, B.; Bi, G.; Jiang, G.; Cao, Y.; Hu, M.; Lai, Y.; Xiong, Z.; and Huang, M. 2024. ToMBench: Benchmarking Theory of Mind in Large Lan- guage Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15959–15983. Gandhi, K.; Fr ̈ anken, J.-P.; Gerstenberg, T.; and Goodman, N. 2023. Understanding social reasoning in language mod- els with language models. In Proceedings of the 37th In- ternational Conference on Neural Information Processing Systems, 36: 13518–13529. Kim, H.; Sclar, M.; Zhou, X.; Bras, R.; Kim, G.; Choi, Y.; and Sap, M. 2023. FANToM: A Benchmark for Stress- testing Machine Theory of Mind in Interactions. In Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 14397–14413. Le, M.; Boureau, Y.-L.; and Nickel, M. 2019. Revisiting the Evaluation of Theory of Mind through Question Answer- ing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing, 5872–5877. Lommel, A.; Burchardt, A.; Popovi ́ c, M.; Harris, K.; Avramidis, E.; and Uszkoreit, H. 2014. Using a new ana- lytic measure for the annotation and analysis of MT errors on real data. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, 165– 172. Shinoda, K.; Hojo, N.; Nishida, K.; Mizuno, S.; Suzuki, K.; Masumura, R.; Sugiyama, H.; and Saito, K. 2025. ToMATO: Verbalizing the mental states of role-playing LLMs for benchmarking theory of mind. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 1520–1528. Xu, H.; Zhao, R.; Zhu, L.; Du, J.; and He, Y. 2024. Open- ToM: A Comprehensive Benchmark for Evaluating Theory- of-Mind Reasoning Capabilities of Large Language Models. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), 8593–8623. Appendix Prompts for Theory of Mind Tasks This section presents the English versions of the prompts given to LLMs to solve ToMBench and FANToM. When solving questions in Japanese, these prompts were manually translated into Japanese and provided to the LLMs. The prompt for solving ToMBench is as follows. context contains the context that serves as the basis for inference,question contains questions about char- acters’ mental states, anda,b,c, andd are the answer choices. Please read the passage and the question I will ask. Choose the correct answer from options A, B, C, and D . context question A: a B: b C: c D: d Please answer with the letter of the option that you think is correct and do not output anything other than a single letter. The following are the prompts used to solve FAN- ToM, which is used to solve BeliefQ, InfoAccessibili- tyQ, and AnswerabilityQ.context contains the di- alogue text that serves as the basis for inference, and BeliefQ,InfoQ, andAnsQ contain question texts defined for each task by the dataset. Additionally,factQ andfactA contain the facts asked in BeliefQ, and candidates lists the names of the characters. context Question: BeliefQ ans_a ans_b Please choose either a or b as the correct answer. Output only a or b. context Information: factQ factA Question: InfoQ Characters: candidates Choose the characters who correctly answer the question from the list above. Separate names with commas. Answer: context Target: factQ Question: AnsQ Characters: candidates Choose the characters who correctly answer the question from the list above. Separate names with commas. Answer: Distribution of Translation Errors This section presents the distribution of translation errors contained in the questions from ToMBench and FANToM that were subject to post-editing. Translation errors are clas- sified into three severity levels (critical, major, and minor) on the basis of their impact on comprehension. 52 Figure 1: Distribution of error severity in the sampled ques- tions. Figure 2: Distribution of accuracy errors in the sampled questions. Figure 1 shows the distribution of translation error sever- ity in both benchmarks. FANToM has a higher proportion of minor errors than ToMBench. Figure 2 shows the distribution of accuracy errors in both benchmarks. Mistranslations (errors where the wrong word choice or expression changes the intended meaning) were most frequently observed in both benchmarks. Additionally, while ToMBench rarely had untranslated portions, FANToM had a relatively large proportion of untranslated portions. This is likely because the context in FANToM is longer than in ToMBench, making it more prone to leaving portions un- translated in long text translation. Figure 3 shows the distribution of fluency errors in both benchmarks. Inconsistency was the main error factor in both benchmarks. This is mainly attributed to character names not being translated consistently. Figure 3: Distribution of fluency errors in the sampled ques- tions. 53 Latent Theory of Mind in World Models for Multi-Agent Reinforcement Learning Tomas Leroy-Stone CentraleSup ́ elec tomas.stone@student-cs.fr Abstract Theory of Mind (ToM), the capacity to infer others’ beliefs, goals, and intentions, is central to human social intelligence. As artificial agents increasingly interact with humans and other autonomous systems, embedding ToM principles into learning architectures offers a path toward adaptive, inter- pretable, and socially compatible behavior. We propose a con- ceptual Teammate-Conditioned World Model (TCWM) that integrates ToM reasoning within a Dreamer-style recurrent latent dynamics model. The framework factorizes the latent state into environment and teammate components and intro- duces a ToM head that infers structured embeddings of part- ner behavior from partial trajectories. These latent beliefs condition the actor and critic during imagined rollouts, en- abling anticipation of diverse collaborators under uncertainty. We argue that this integration of ToM and world models could help mitigate non-stationarity in multi-agent reinforcement learning (MARL), support zero- and few-shot coordination, and enhance the interpretability of learned policies. TCWM reframes world models as predictors of both physical dynam- ics and others’ policies, offering a bridge between cognitive theory and control. Introduction Human collaboration depends on the ability to reason about other minds. This Theory of Mind (Premack and Woodruff 1978) allows people to infer unseen beliefs and intentions, predict others’ actions, and coordinate under uncertainty. Translating this capacity into artificial systems is essen- tial for robust multi-agent and human-AI interaction (Rabi- nowitz et al. 2018; Carroll et al. 2019; Liang et al. 2024). World models have proven effective for single-agent con- trol because they learn compact, predictive summaries of how environments evolve from action to action. (Hafner et al. 2025) In multi-agent settings, however, the behavior of others makes the world feel unstable, since partners change what happens next in ways that simple prediction treats as noise. A Theory of Mind perspective offers a complemen- tary ingredient by turning social uncertainty into structure, helping agents anticipate how others will act and why. In this work we introduce an architectural and conceptual ap- proach that pairs a learned dynamics model with an explicit model of other agents, so the agent can imagine how part- ners will respond and plan accordingly, setting the stage for stable coordination without assuming fixed teammates. Background World Models World models learn latent representations of environment dynamics that allow agents to predict and plan without di- rect interaction (Hafner et al. 2025). A recurrent state-space model (RSSM) encodes observations x t and actions a t into a hidden state h t , and samples a latent variable z t that captures unobserved structure. By training an observation decoder to reconstruct ˆx t from z t , the model learns to compress and “imagine” the environment’s behavior. Multi-Agent Reinforcement Learning In MARL, multiple agents optimize a shared or compet- ing objective. Coordination is difficult because each agent’s environment changes as others learn. Centralized training with decentralized execution (CTDE) alleviates this by giv- ing agents joint information during training, but it still as- sumes fixed or known partners (Yu, Foerster, and Whiteson 2022). Modeling other agents as latent processes provides a scalable alternative. Computational Theory of Mind. ToM has been formalized in machine learning as the prob- lem of predicting others’ actions and goals from partial tra- jectories (Rabinowitz et al. 2018; Papoudakis et al. 2021). Recent work shows that agents with explicit ToM reason- ing exhibit stronger coordination, interpretability, and social awareness (Fuchs et al. 2021; Oguntola, Rovatsos, and Gal 2023; Sclar et al. 2022). These findings motivate combining ToM inference with world-model imagination. Teammate-Conditioned World Models Architecture Our TCWM factorizes the latent state into environmental and social components: z t = [z env t ,z tom t ], where z env t captures physical dynamics and z tom t represents the inferred mental state of another agent. Figure 1 illustrates the process. The encoder processes the agent’s observation x t and action a t to produce a hidden state h t . Two decoders operate in parallel: 54 z env 1 z ToM 1 h 1 enc decdec x 1 ˆx 1 ˆa 1 z env 2 z ToM 2 h 2 enc decdec x 2 ˆx 2 ˆa 2 z env 3 z ToM 3 h 3 enc decdec x 3 ˆx 3 ˆa 3 a 1 a 2 Figure 1: World model architecture. The encoder maps ob- servations x t and actions a t into a hidden state h t . Latents z env t and z tom t model environment and social dynamics re- spectively. The environment decoder reconstructs ˆx t to ver- ify understanding, while the ToM head predicts the team- mate’s next action ˆa t . • The environment decoder reconstructs ˆx t from z env t , verifying that the model accurately represents the world. • The ToM head predicts other agents’ next action ˆa t from z tom t , serving as a learned model of partner intentions. Both decoders receive supervision: the reconstruction loss trains z env t as in the original Dreamer architecture (Hafner et al. 2025), while the ToM loss trains z tom t via a cross- entropy objective between predicted and observed teammate actions: L ToM =E t h − log ˆp a j t | z tom t ,h t i + α KL q(z tom t | h t ) p(z tom t | h t−1 ) (1) At each step, the agents’ joint actions a t are applied to tran- sition the environment to the next state, producing new ob- servations x t+1 . This recurrent structure allows the model to learn both environmental and social dynamics in latent space. Actor-Critic Learning The actor and critic are conditioned on both z env t and z tom t (Figure 2). This means the agent’s predicted value v t , reward estimate r t , and action a t are influenced by its understand- ing of the environment and its beliefs about others’ men- tal states. During imagination rollouts, sampled z tom t values simulate different partner types, allowing the agent to train under varied social contexts. Theory of Mind as a Bridge Between Cognition and Control The TCWM framework unites cognitive theories of ToM with the control-oriented mechanisms of world models. The latent z tom t acts as an internal belief representation, inferred from limited evidence and refined over time. Conditioning the policy on this belief state allows the agent to plan ac- tions that complement or anticipate others’ intentions. This yields three main benefits: (1) disentangling so- cial from physical uncertainty stabilizes learning and im- proves generalization, (2) conditioning on inferred mental z env 1 z ToM 1 h 1 enc x 1 z env 2 z ToM 2 h 2 z env 3 z ToM 3 h 3 v 1 a 1 r 2 v 2 a 2 r 3 v 3 a 3 Figure 2: Actor-critic learning. The policy and value func- tions are conditioned on both environment and ToM latents. This allows the agent to plan actions that depend on its model of others’ mental states as well as environmental dy- namics. states enables zero-shot coordination and few-shot adapta- tion, and (3) explicit social latents make policy reasoning interpretable and inspectable. Discussion and Future Work This paper presents TCWM as a conceptual architecture that integrates Theory of Mind reasoning into world models for multi-agent reinforcement learning. While we focus on the idea rather than a full implementation, several concrete eval- uation directions follow. TCWM should be tested in environ- ments that require social inference and generalization, such as Overcooked-AI (Carroll et al. 2019) for zero-shot coordi- nation or Melting Pot (Leibo et al. 2021) for diverse social contexts. Key metrics include zero-shot coordination with unseen partners, few-shot improvement across episodes, la- tent identifiability of z tom t as interpretable behavioral fea- tures, and prediction accuracy of the ToM head on future partner actions. Future work will formalize the TCWM training objec- tive, compare against baseline world models without social factorization, and examine whether explicit social latents improve stability and interpretability in non-stationary set- tings. We hope this conceptual contribution stimulates dis- cussion on bridging cognitive models of social reasoning with model-based reinforcement learning. Conclusion Integrating Theory of Mind with world models reframes imagination-based learning as social simulation. By mod- eling other agents as structured latent processes, TCWM provides a route to adaptive, interpretable, and human- compatible intelligence. Future work will explore extend- ing this architecture to mixed human-AI coordination, eval- uating whether its learned social latents align with human mental-state reasoning. Such systems may enable artificial agents that imagine not only their environments, but also the minds that share them. 55 References Carroll, M.; Shah, R.; Ho, M. K.; Griffiths, T. L.; Seshia, S. A.; Abbeel, P.; and Dragan, A. D. 2019. On the Util- ity of Learning about Humans for Human–AI Coordina- tion. In Advances in Neural Information Processing Systems (NeurIPS). Fuchs, F.; Foerster, J.; Hofmann, K.; and Whiteson, S. 2021. Theory of Mind for Deep Reinforcement Learning in Han- abi. In Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Hafner, D.; Pasukonis, J.; Ba, J.; and Lillicrap, T. 2025. Mas- tering Diverse Control Tasks through World Models. Nature. Leibo, J. Z.; Du ́ enez-Guzm ́ an, E. A.; Vezhnevets, A.; Lyle, C.; Dunning, I.; Beattie, C.; Lillicrap, T. P.; and Graepel, T. 2021. Melting Pot: Scalable Evaluation of Multi-Agent Re- inforcement Learning. In International Conference on Ma- chine Learning (ICML). Liang, Y.; Chen, D.; Gupta, A.; Du, S.; and Jaques, N. 2024. Learning to Cooperate with Humans Using Gener- ative Agents. Advances in Neural Information Processing Systems (NeurIPS). Oguntola, O.; Rovatsos, M.; and Gal, K. 2023.Theory of Mind as Intrinsic Motivation for Multi-Agent Reinforce- ment Learning. In AAAI Conference on Artificial Intelli- gence. Papoudakis, G.; Christianos, F.; Sch ̈ afer, L.; and Albrecht, S. V. 2021. Benchmarking Multi-Agent Deep Reinforce- ment Learning Algorithms in Cooperative Tasks. In Neural Information Processing Systems (NeurIPS). Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4): 515–526. Rabinowitz, N. C.; Perbet, F.; Song, F.; Zhang, C.; Eslami, S.; and Botvinick, M. 2018. Machine Theory of Mind. In Proceedings of the 35th International Conference on Ma- chine Learning (ICML). Sclar, M.; Mirsky, R.; Kraus, S.; and Barnby, J. 2022. Sym- metric Machine Theory of Mind for Multi-Agent Coopera- tion. In Neural Information Processing Systems Workshop on Theory of Mind for AI. Yu, C.; Foerster, J. N.; and Whiteson, S. 2022. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Methods. In AAAI Conference on Artificial Intelligence. 56 On the Interplay of Training Population Diversity, Theory of Mind, and Zero-Shot Coordination Saeed Hedayatian * 1 , Varun Bhatt * 1 , Bryon Tjanaka 1 , Michael Lewis 2 , Katia P. Sycara 3 , Stefanos Nikolaidis 1 1 Thomas Lord Department of Computer Science, University of Southern California 2 School of Computing and Information, University of Pittsburgh 3 Robotics Institute, Carnegie Mellon University saeedhed,vsbhatt,tjanaka,nikolaid@usc.edu,ml@sis.pitt.edu,sycara@andrew.cmu.edu Abstract Training adaptive agents that can quickly and efficiently coor- dinate with other previously unseen agents (zero-shot coordi- nation; ZSC) is crucial for applications involving multi-agent teams, such as household assistant robots and autonomous vehicles. Prior works achieve this by training a population of agents with diverse behaviors, followed by training a gener- alist agent to coordinate with them. With a novel population generation framework leveraging quality diversity (QD) al- gorithms, we obtain the highest behavioral diversity in a col- laborative domain, Overcooked. However, contrary to prior assumptions about the impact of diversity, the increased be- havioral diversity of the QD population led to worse ZSC. We evaluate multiple diversity metrics and show that training population diversity is not a good indicator of ZSC perfor- mance. With further investigation into the implicitly learned theory of mind (ToM) by the generalist agent, we demonstrate that a population that facilitates implicit ToM learning is more helpful for ZSC than a behaviorally diverse population. 1 Introduction An open problem in collaborative multi-agent domains is zero-shot coordination (ZSC) or ad-hoc teamwork, i.e., train agents that are capable of collaborating with previously un- seen partners (Stone et al. 2010; Bard et al. 2020; Hu et al. 2020). However, efficient collaborative strategies might in- volve different conventions (Lewis 2008; Boutilier 1996) that will be unknown to a new agent (Hu et al. 2020; Sarkar, Shih, and Sadigh 2024). Many prior solutions re- solve this by creating a training population that includes di- verse conventions, which a generalist agent could learn to adapt to (Strouse et al. 2021; Zhao et al. 2023; Lupu et al. 2021; Yu et al. 2023; Wang et al. 2024). Quality Diversity (QD) optimization (Pugh, Soros, and Stanley 2016; Cully et al. 2015; Fontaine and Nikolaidis 2022) has previously been shown to be capable of gener- ating high-quality solutions that are diverse with respect to chosen notions of behavior. In this paper, we exploit QD op- timization to systematically cover the space of behaviors in a collaborative domain based on the game Overcooked (ove 2018; Carroll et al. 2019). We extend the state-of-the-art QD * These authors contributed equally. algorithm, CMA-MAE (Fontaine and Nikolaidis 2022), and demonstrate the benefit of this systematic search in terms of the resulting population diversity. However, we do not observe an increase in ZSC perfor- mance despite the increased population diversity. Through multiple diversity metrics, we show that the training popu- lation diversity is indeed not a good indicator of ZSC per- formance. We further investigate by measuring how well the generalist learns to model its training partners from their ac- tions (one aspect of ToM) and demonstrate that populations that facilitate better modeling, rather than purely being more diverse, lead to better ZSC performance. 2 Population Generation with QD QD Problem A QD optimization assumes an objective functionf : R n → R (e.g., the performance of the team) and a set of measure functions, m : R n → R k (e.g., number of specific sub-tasks or “events” completed by agents) and gen- erates an archive of solutions (e.g., weights of a policy net- work) that maximize the objective function and are diverse with respect to the measure functions. The performance of a QD optimization algorithm is expressed in the form of the QD-score (Pugh, Soros, and Stanley 2016), which is the sum of objective values of solutions in the archive. We provide a detailed problem formulation in Appendix B. CMA-MAE for Diverse Teams CMA-MAE (Fontaine and Nikolaidis 2022) formulates the QD problem as max- imizing the QD-Score and leverages a black-box optimizer, CMA-ES (Hansen 2016), to guide its search. QD optimiza- tion with CMA-MAE occurs in a loop of sampling candidate solutions, evaluating them, and updating the archive of solu- tions. Finally, CMA-MAE updates its internal CMA-ES op- timizers based on the evaluations, leading to a natural gradi- ent ascent on the QD-Score (Fontaine and Nikolaidis 2021). We provide the pseudocode in Appendix C. 3 Experiments Overcooked Environment Following prior work (Carroll et al. 2019; Wang et al. 2024), we ran our experiments on a simplified version of the video game Overcooked (ove 2018), where players must cooperate in a simulated kitchen to prepare and deliver as many dishes as possible. We used 57 Table 1: Comparison of population diversity, ToM, and ZSC performance metrics, averaged over all six Overcooked layouts. Diversity metrics capture diversity in skill (skill variance), sub-task completion (event dispersion), and their combination (joint dispersion). Further details about each of these metrics are in Appendix D.1. MethodSkill VarEvent Disp.Joint Disp.ToMZSC Performance FCP16.367.468.2011.840.38 MEP16.976.657.509.820.47 TrajeDi16.247.247.996.590.46 HSP14.328.619.2232.810.51 QD4.7710.3410.72-8.550.24 six representative Overcooked layouts from a prior bench- mark (Wang et al. 2024) as our testbed. Population Generation We compared the populations generated by our QD framework and prior ZSC works: FCP (Strouse et al. 2021), MEP (Zhao et al. 2023), Tra- jeDi (Lupu et al. 2021), and HSP (Yu et al. 2023). We lever- aged the implementation of prior algorithms from the ZSC- Eval benchmark (Wang et al. 2024) and provide details in Appendix D.1. Diversity Metrics We evaluated behavioral diversity across population generation methods using three metrics: (1) Skill variance, defined as the variance across teams’ mean returns to quantify skill diversity, with higher vari- ance indicating a wider range of proficiencies in the popu- lation; (2) Event dispersion, measured as the log-volume of the covariance ellipsoid (approximated using the top-5 prin- cipal directions) over vectors of sub-task completion counts, capturing the spread and the dispersion of sub-tasks; and (3) Joint dispersion, which extends the previous metric by incorporating both sub-tasks and return values. ZSC We trained one generalist agent for each training population via on-policy Reinforcement Learning by pairing it with agents from its training population. We adopted the recurrent policy network and the training loop from the base- lines for fair comparisons between the training populations (Appendix D.2 contains the training details). We compared the ZSC performance by pairing the generalist agents with unseen evaluation agents from the ZSC-Eval benchmark and computed the normalized mean returns. Measuring ToM We use a new metric to test whether the training populations facilitated implicit ToM modeling of partners by generalist agents. We paired the generalist with two partners from its training population: one for the first half of the episode and another for the second half. We repeated this for all agent pairs in the training population and computed the mean returns (J ij ). We then computed the mean returns with each partner from the training population (J i ) and calculated the difference between continuing with the same partner versus swapping the partner mid-episode (∆ ij = J i − J ij ). If the generalist models its partners im- plicitly via its recurrent policy, swapping the partners mid- episode will create a mismatch with its internal model of the other agent, leading to worse coordination. Hence, a higher mean ∆ ij indicates a stronger influence of ToM in the gen- eralist’s decision-making. 4 Results Population Diversity May Not Benefit ZSC Table 1 summarizes the diversity metrics of all training populations, and the ZSC performance of the corresponding general- ist. The QD population obtains the highest event disper- sion and joint dispersion, implying the highest diversity in terms of different amounts and types of sub-tasks performed. Other baselines have higher skill variance, owing to select- ing checkpoints from different stages of training. However, none of these metrics correlate well with the ZSC perfor- mance. The resulting Spearman’s rank correlation coeffi- cient between ZSC performance and the diversity metrics is 0.3 for skill diversity, and -0.4 for event dispersion and joint dispersion, showing low correlation. Theory of Mind is a Good Indicator of ZSC As shown in Table 1, our ToM metric mostly follows the same order as the ZSC performance, with a Spearman’s rank correla- tion coefficient of 0.7. Thus, unlike the training population diversity metrics, ToM is a good indicator of ZSC perfor- mance. Additionally, the ToM metric is very low for the QD population, showing that the corresponding generalist might not have learned to model partners based on their actions and instead follows a partner-independent policy. Hence, explic- itly setting partner modeling as a training target might be beneficial for future ZSC applications, especially when the training population is very diverse. 5 Conclusions In this work, we propose a framework leveraging Quality Diversity (QD) optimization to obtain populations with high behavioral diversity, and observe that training with more di- verse populations does not necessarily improve zero-shot coordination (ZSC). Further analysis showed that training population diversity is not a good indicator of ZSC perfor- mance. Instead, our ToM metric, measured with the train- ing population, is correlated with the ZSC performance on an unseen population, highlighting its potential as an ex- plicit training target for the generalist in the future. Another promising future research direction is to incorporate ToM into population generation via a new metric that doesn’t depend on the generalist or joint training of the generalist and the population. We are excited about these extensions leading to better ZSC algorithms and better coordination be- tween AI agents themselves, as well as with humans. 58 References 2018. Overcooked 2. Bard, N.; Foerster, J. N.; Chandar, S.; Burch, N.; Lanctot, M.; Song, H. F.; Parisotto, E.; Dumoulin, V.; Moitra, S.; Hughes, E.; Dunning, I.; Mourad, S.; Larochelle, H.; Belle- mare, M. G.; and Bowling, M. 2020. The Hanabi challenge: A new frontier for AI research. Artificial Intelligence, 280: 103216. Bernstein, D. S.; Givan, R.; Immerman, N.; and Zilber- stein, S. 2002. The Complexity of Decentralized Control of Markov Decision Processes. Math. Oper. Res., 27(4): 819– 840. Boutilier, C. 1996. Planning, Learning and Coordination in Multiagent Decision Processes. In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge, 195–210. Carroll, M.; Shah, R.; Ho, M. K.; Griffiths, T.; Seshia, S.; Abbeel, P.; and Dragan, A. 2019. On the utility of learn- ing about humans for human-ai coordination. Advances in neural information processing systems, 32. Cully, A.; Clune, J.; Tarapore, D.; and Mouret, J.-B. 2015. Robots that can adapt like animals. Nature, 521: 503–507. Fontaine, M. C.; and Nikolaidis, S. 2021. Differentiable Quality Diversity. In Advances in Neural Information Pro- cessing Systems. Fontaine, M. C.; and Nikolaidis, S. 2022. Covariance Ma- trix Adaptation MAP-Annealing. Proceedings of the Ge- netic and Evolutionary Computation Conference. Hansen, N. 2016. The CMA Evolution Strategy: A Tutorial. CoRR, abs/1604.00772. Hu, H.; Lerer, A.; Peysakhovich, A.; and Foerster, J. N. 2020. "Other-Play" for Zero-Shot Coordination. In Pro- ceedings of the 37th International Conference on Machine Learning, ICML, volume 119, 4399–4410. Lewis, D. 2008. Convention: A philosophical study. John Wiley & Sons. Lupu, A.; Cui, B.; Hu, H.; and Foerster, J. 2021. Trajectory diversity for zero-shot coordination. In Proceedings of the International conference on machine learning, 7204–7213. Pugh, J. K.; Soros, L. B.; and Stanley, K. O. 2016. Qual- ity diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3: 40. Sarkar, B.; Shih, A.; and Sadigh, D. 2024. Diverse conven- tions for human-AI collaboration. Advances in Neural In- formation Processing Systems, 36. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Stone, P.; Kaminka, G.; Kraus, S.; and Rosenschein, J. 2010. Ad hoc autonomous agent teams: Collaboration without pre- coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, 1504–1509. Strouse, D.; McKee, K.; Botvinick, M.; Hughes, E.; and Ev- erett, R. 2021. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34: 14502–14515. Tjanaka, B.; Fontaine, M. C.; Lee, D. H.; Zhang, Y.; Balam, N. R.; Dennler, N.; Garlanka, S. S.; Klapsis, N. D.; and Nikolaidis, S. 2023.Pyribs: A Bare-Bones Python Li- brary for Quality Diversity Optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, 220–229. ISBN 9798400701191. Wang, X.; Zhang, S.; Zhang, W.; Dong, W.; Chen, J.; Wen, Y.; and Zhang, W. 2024. ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination. In Advances in Neural Information Processing Systems 38. Yu, C.; Gao, J.; Liu, W.; Xu, B.; Tang, H.; Yang, J.; Wang, Y.; and Wu, Y. 2023. Learning Zero-Shot Cooperation with Hu- mans, Assuming Humans Are Biased. In The Eleventh In- ternational Conference on Learning Representations, ICLR. OpenReview.net. Zhao, R.; Song, J.; Yuan, Y.; Hu, H.; Gao, Y.; Wu, Y.; Sun, Z.; and Yang, W. 2023. Maximum entropy population-based training for zero-shot human-AI coordination. In Proceed- ings of the AAAI Conference on Artificial Intelligence, vol- ume 37, 6145–6153. 59 A Acknowledgments This work has been partially supported by the NSF CA- REER #2145077, NSF NRI #2024949 and DARPA EMHAT HR00112490409. B Detailed Problem Formulation We address the problem of finding teams of agents ex- hibiting diverse behavior in a collaborative multi-agent en- vironment. We define the environment via a decentralized Markov Decision Process (dec-MDP (Bernstein et al. 2002)) ⟨S,A,R,P,γ⟩ with N agents (N = 2 in our experiments), where S is the state space, A = Π N i A i is the joint action space of all agents,R : S ×A → R is the common reward function that all agents receive,P : S ×A → ∆(S) is the transition function, and γ is the discount factor. Each agent acts according to a parameterized policy, π i : S → ∆(A i ), with the goal of maximizing the expected discounted return J (π 1 ,...,π N ) = E τ h P s t ,a t ∈τ γ t R(s t ,a t ) i . Quality Diversity (QD) Optimization for Diverse Team Behaviors QD optimization searches for a set of diverse and high-quality solutions to a given problem. In our set- ting, the solutions correspond to the concatenation of the parameters θ i ∈ R n of all agent policies π i . The solution quality is defined by an objective function, set to be the ex- pected return J . The behavioral diversity is captured by a set of measure functions (also called feature functions or be- havioral descriptors) m : R nN → R k defining a measure space Z = m(R nN ). For example, two measure functions for agents working together in a kitchen could be the num- ber of dirty plates cleaned and the number of dishes cooked by the first agent. The QD goal is to find solutions θ ∈ R nN such that for each z ∈ Z , m(θ) = z and J is maximized. In the above example, this would amount to finding teams in which the first agent ranges from only cleaning plates to never cleaning plates, as well as ranging from only cook- ing dishes to never cooking dishes. In practice, the measure space is divided into M discrete cells, called the archive, with the QD goal being to fill each cell with the best solu- tion, i.e., maximize the sum of objective values of all solu- tions in the archive (called the QD-Score (Pugh, Soros, and Stanley 2016)). Zero-Shot Coordination (ZSC) To illustrate the value of having a diverse set of teams, we consider the problem of training a generalist agent policy for ZSC (Strouse et al. 2021; Zhao et al. 2023; Lupu et al. 2021; Yu et al. 2023; Wang et al. 2024). The goal of ZSC is to maximize the ex- pected return J when the trained agent (π generalist ) teams up with a set of unseen partners π 1:N−1 ∈ Π test . We assume access to a set of partners Π train (called the partner popu- lation) during training. To isolate the effect of the training population, we assume that the generalist agent is trained with a fixed Reinforcement Learning (RL) algorithm. C Pseudocode of Extended Version of CMA-MAE Algorithm 1 shows the outline of our version of the CMA- MAE algorithm (Sec. 2). We assume the following two func- tions: • evaluate: Takes in the policy parameters for a team, eval- uates it in the environment, and outputs the obtained dis- counted return and the measure values. • add_solution: Maps the given solution to a cell in the archive based on its measure values. Then, if the cell is empty or the incumbent solution has a lower objec- tive, it adds the solution to the archive and returns the archive improvement ∆ i , as done in the original CMA- MAE (Fontaine and Nikolaidis 2022). Each iteration in the QD loop goes through the CMA- ES emitters, followed by the team crossover emitter. The CMA-ES emitters (lines 4-11) operate exactly as in the orig- inal CMA-MAE. Our modifications begin with the team crossover emitter (lines 12-19). The team crossover emitter samples two existing teams from the archive and performs a crossover by switching the partners of each team with one another. The new solutions are evaluated and added to the archive, similar to the solutions generated by CMA-ES emit- ters. However, the team crossover emitter does not utilize the archive improvements output from the archive addition. Note that when the archive is empty, the algorithm skips the team crossover emitter. To better the interplay between team crossover and CMA- ES, we periodically reset the CMA-ES emitters every 25 iterations and start the search again from a random solu- tion in the archive (lines 20-26). As a result, some CMA- ES emitters start their search from solutions found via team crossover. Periodic restarts have previously been shown to restrict the CMA-ES search in the solution space (Fontaine and Nikolaidis 2022). However, in our work, their purpose is to improve the search near areas where teams are found by crossover. Lastly, to facilitate the optimization, we ini- tialize the agent’s neural networks using the weights of a pre-trained PPO agent and conduct the QD search by fine- tuning the weights of the last layer. This helps CMA-MAE by reducing the size of the solution vectors and enabling it to leverage the representations learned by the pre-trained agent to efficiently diversify the behaviors without having to learn useful representations from scratch. D Experiment Details D.1 Diverse Team Behavior Generation QD Populations Our QD framework was implemented in pyribs (Tjanaka et al. 2023). Each team in the population consisted of two agents. The agents had a policy network composed of convolutional and feedforward layers (similar to the ones employed by the baselines) that were initialized from a pre-trained PPO agent. The weights of the last feed- forward layers were flattened and used as the solution vector on which CMA-MAE operates. The objective function was the discounted return, and the measure functions were three of the events: number of onion and dish pick-ups from their 60 Algorithm 1: Extended CMA-MAE for Generating Diverse Teams Input: Number of iterations n iter ; Number of CMA-ES emitters n emitters ; functions to evaluate solutions and add them to the archive (evaluate, add_solution); Initial CMA-ES parameters μ k , Σ k ; Population size λ; Period for restarting CMA-ES n periodic Output: Archive A of diverse teams. 1 Initialize archive A 2 Initialize n emitters CMA-ES emitters with corresponding parameters μ k and Σ k 3 for iter∈1, 2,...,n iter do 4for k ∈1, 2,...,n emitters do // CMA-ES emitters 5for i∈1, 2,...,λ do 6θ i ←N (μ k , Σ k ) 7J,m← evaluate(θ i ) 8A, ∆ i ← add_solution(A, (θ i ,J,m)) 9end 10Rank θ 1...λ by ∆ 1...λ and update μ k , Σ k 11end 12for i∈1, 2,...,λ/2 do // Team crossover emitter 13θ 1 ,θ 2 ← sample(A) 14θ 2i−1 ,θ 2i ← team_crossover(θ 1 ,θ 2 ) 15J,m← evaluate(θ 2i−1 ) 16A, ∆ i ← add_solution(A, (θ 2i−1 ,J,m)) 17J,m← evaluate(θ 2i ) 18A, ∆ i ← add_solution(A, (θ 2i ,J,m)) 19end 20if iter≡ 0 (mod n periodic ) then // Periodically reset CMA-ES search 21for k ∈1, 2,...,n emitters do 22θ ← sample(A) 23μ k ← θ 24Reset Σ k 25end 26end 27 end respective dispensers and the number of ingredients put in the pot. The values of these measures were clipped to be in [0, 10]. The resulting three-dimensional archive was dis- cretized into 10 cells in each dimension, resulting in a total of 1000 cells. Hyperparameters for CMA-MAE archive learning rate of 0.1; minimum threshold of−10 −6 for each archive cell; one team crossover emitter with λ = 64 (note that in Algo- rithm 1 we show the same population size for both CMA-ES emitters and crossover emitters, but they were different in our experiments); 8 CMA-ES emitters with λ = 30, initial Σ k = 0.1I ; n periodic = 25. We tuned these hyperparameters manually, starting from the pyribs defaults. We ran all instantiations until 2.5× 10 4 candidate solu- tions were evaluated. Other Populations We used the pre-trained models for other populations provided in ZSC-Eval (Wang et al. 2024). All populations have a size of 36 agents, and each agent has feedforward policies. Below are the brief descriptions of the population generation methods: • FCP (Strouse et al. 2021) generates the training pop- ulation through self-play RL runs with different ran- dom seeds. The final population consists of the initial checkpoint (a random policy), the final checkpoint, and a checkpoint in the middle of training, when the agent achieves 50% of the final reward, from each self-play run. • MEP (Zhao et al. 2023) adds a population entropy term to the reward to promote behavioral diversity. Popula- tion entropy is approximated by computing the average of the action distribution of all agents in the population during rollout. All agents in the population are trained with self-play RL in parallel, with the population entropy computed and added to the reward based on the current policies of the agents. • TrajeDi (Lupu et al. 2021) trains a population that maximizes the Jensen-Shannon divergence (JSD) of the agents’ individual trajectories to the mean trajectory of population. To ensure meaningful diversity, the trajectory distribution is discounted to give lower weights to tempo- rally close actions. • HSP (Yu et al. 2023) leverages events (or sub-tasks) for a given domain that are typically used for reward shaping and trains a population by treating the rewards for each agent as a linear combination of the events. The coeffi- cients are randomly sampled, leading to different agents specializing in different sub-tasks. D.2 Training a Generalist Agent for Zero-Shot Coordination We adopted the generalist training code from ZSC- Eval (Wang et al. 2024), along with the training hyperpa- rameters. The generalist is trained with Proximal Policy Op- timization (PPO) (Schulman et al. 2017), with the partner being selected from the corresponding training population. The actor and critic have separate networks consisting of convolution layers, followed by an RNN, and a feedforward network at the end. 61 Hyperparameters Most hyperparameters are the defaults in the ZSC-Eval code. The modified hyperparameters are: Number of training steps of 5 × 10 7 ; reward shaping co- efficient starting from 1 and linearly reduced to zero over training; entropy coefficient of 0.2. We trained the generalist with 3 random seeds for each training population. D.3 Comparison Metrics Population Diversity Metrics To quantify workload and joint dispersion, we used the log-volume of the covariance ellipsoid formed by the event count vectors. Each event vec- tor was first scaled to the range [0, 10] and mean-centered. Given the covariance matrix Σ = Cov(X) of these vectors, we computed the log-volume as V = k X i=1 log(λ i + ε), where λ i are the top-k eigenvalues of Σ (with k = 5) and ε = 10 −6 is added for numerical stability. This approx- imates the log-determinant of Σ and reflects the effective volume spanned by the data in event space while mitigating instability from low-rank covariance matrices. For joint dis- persion, the event vectors were augmented with each team’s normalized performance before applying the same proce- dure. ZSC Performance We compute the ZSC performance of each generalist by pairing them with a fixed evaluation pop- ulation from the ZSC-Eval benchmark (Wang et al. 2024). In the benchmark, the evaluation population is trained sim- ilar to HSP, with the reward being a linear combination of events completed and the coefficients being randomly sam- pled to ensure diversity. Among the trained agents, the sub- set with the highest pairwise dissimilarity is set as the eval- uation population. ToM Metric We measure how well the generalist learns to model its training partners with the metric described in Sec. 3. Note that this metric only considers how well the generalist models its training partners. It is also possible to extend this metric to unseen partners by choosing agents from the evaluation set. However, such a metric cannot be integrated into a training loss since it would require access to the evaluation partners. 62 Reasoning About Bias: Theory of Mind for Trustworthy Knowledge Distillation Satyam Kumar 1 , Saurabh Jha 1 1 Dell Technologies, CSG CTO Team satyamkumar9742@gmail.com, saurabh.jha21@gmail.com Abstract Knowledge distillation is a powerful model compression technique for efficient model deployment; however, it runs the risk of perpetuating biases that might have been learned by teacher models. This work confronts a critical gap in trustworthy AI by proposing Bias-Adaptive Temperature Knowledge Distillation (BAT-KD), a novel framework in- corporating theory of mind principles to mitigate bias prop- agation during model compression. In our approach, tem- perature scaling is dynamically adjusted based on example- level bias detection, softening the teacher distributions for bi- ased instances to allow for robust and fair knowledge trans- fer. On hate speech detection with the tweet eval/hate dataset, BAT-KD shows superior bias-awareness: the teacher model, BERT-base, achieves 89.80% accuracy with 94.43% preci- sion, while the distilled student, DistilBERT, is competi- tive in performance at 79.60% accuracy with 83.54% pre- cision. Most importantly, our method preserves recall met- rics important for fairness: teacher, 85.73%, student, 74.62%, thus avoiding any systematic underdetection of harmful con- tent. The F1-scores reflect a balanced performance: teacher, 89.98%, student, 75.60%, showing that bias mitigation does not come at the cost of classification quality. This research contributes to trustworthy AI by showing that adaptive tem- perature scaling can greatly reduce bias leakage in com- pressed models, thus enabling the deployment of fair yet ef- ficient NLP systems. It provides a practical methodology for bias-aware distillation in sensitive applications requiring not only good performance but also ethical accountability. Introduction Knowledge distillation has emerged as a powerful paradigm for model compression, enabling deployment of efficient neural networks across resource-constrained environments (Sanh et al. 2019; Wang et al. 2022; Ganesh et al. 2020). However, a critical limitation is the propagation of biases present in teacher models, particularly in hate speech detec- tion (Cui et al. 2024; Roy and Site 2025). While temper- ature scaling has been explored in distillation (Zheng and Yang 2024; Li, Cheng, and Han 2025; Xu et al. 2024), ex- isting approaches fail to account for example-level bias het- erogeneity. Recent work demonstrates that bias mitigation during compression is essential for trustworthy AI (Bayasi et al. 2025; Zhang, Lu, and Sang 2024). Balanced perfor- mance across precision and recall in hate speech detection requires careful fairness evaluation (Elsafoury 2023; Naran- bat et al. 2025; Gajewska et al. 2025). Contemporary hate speech detection systems (Ullah et al. 2024; Georgakopou- los, Tasoulis, and Vlachos 2019) and distillation effective- ness studies (Stanton et al. 2021; Yuan, Lin, and Ji 2023; Kim et al. 2024) highlight fairness degradation during com- pression. Position-aware distillation (Wang et al. 2025) and ethical concerns in knowledge transfer (Mansurov, Sakip, and Aji 2025) further emphasize this gap. No prior work systematically addresses bias-adaptive knowledge transfer considering semantic properties of training examples, moti- vating our investigation into fairness-preserving distillation mechanisms. To address these limitations, we introduce Bias-Adaptive Temperature Knowledge Distillation (BAT-KD), a novel framework incorporating the principles of theory of mind for dynamic softening distribution depending on example- level bias characterization. Our approach takes advantage of an adaptive temperature scaling mechanism that increases the softness only for biased (hate speech) examples, thus of- fering robust and equitable knowledge transfer from teacher to student models. The BAT-KD algorithm combines cross- entropy loss with adaptively scaled Kullback-Leibler di- vergence, where temperature scaling is modified by in- stance bias type detection. Evaluated comprehensively on the tweet eval/hate dataset, our BERT-base teacher achieves 89.80% accuracy with exceptional precision of 94.43% and recall of 85.73%, yielding an F1-score of 89.98%. Crucially, our DistilBERT student maintains competitive performance: 79.60% accuracy, 83.54% precision, 74.62% recall, and 75.60% F1-score-representing superior balance compared to standard distillation baselines (Yuan, Lin, and Ji 2023; Stan- ton et al. 2021). Most importantly, the recall preservation from teacher to student (from 85.73% to 74.62%, a relative drop of 12.9%) significantly surpasses vanilla temperature scaling approaches (Li, Cheng, and Han 2025). This work contributes: 1) BAT-KD, a novel bias-adaptive temperature scaling framework which dynamically adjusts the softness of knowledge transfer based on example-level bias characterization and represents the first systematic ap- proach to incorporate theory of mind principles into fair knowledge distillation for NLP systems; 2) state-of-the- art fairness-aware distillation results with 94.43% teacher precision and 83.54% student precision, yielding a signifi- 63 cant 12.9% relative recall preservation from teacher to stu- dent model-outperforming vanilla temperature scaling ap- proaches by 3-5% and establishing new performance bench- marks on hate speech detection tasks; 3) empirical evidence that student models can indeed learn fair, bias-resistant rep- resentations if trained with adaptive temperature mecha- nisms, furthering understanding of bias propagation during model compression and enabling deployment of trustworthy, computationally efficient NLP systems. Methodology Bias-Adaptive Temperature Knowledge Distillation Framework Knowledge distillation transfers knowledge from a teacher model T to a lightweight student model S through soft tar- get learning. Our BAT-KD framework extends standard dis- tillation by incorporating bias-adaptive temperature scaling that dynamically adjusts softening distributions based on example-level bias characterization. Let D = (x i ,y i ,b i ) N i=1 denote the training dataset where x i is input text, y i ∈ 0, 1 is the label, and b i ∈ hate, not is the bias type annotation. The teacher outputs logits z (i) T and student outputs z (i) S . Standard Knowledge Distillation Loss: The traditional KD loss combines cross-entropy and Kullback-Leibler di- vergence: L KD = αL CE (z S ,y i ) + (1− α)· T 2 ·L KL (1) where α controls the trade-off and T is the temperature parameter. Bias-Adaptive Temperature Scaling: We introduce dy- namic temperature adjustment based on bias type: T(b i ) = T base · λ if b i = hate T base if b i = not (2) where λ > 1 is the bias multiplier (default: 1.2). Higher temperature for hate examples softens the teacher’s proba- bility distribution, reducing bias propagation. BAT-KD Loss Function: Our proposed loss adaptively scales temperature per example: L BAT-KD = αL CE (z S ,y i )+(1−α)·T(b i ) 2 · KL(p (i) S ∥ p (i) T ) (3) where p (i) T =softmax(z (i) T /T(b i )) and p (i) S = softmax(z (i) S /T(b i )) are adaptively-softened distributions. This formulation ensures that examples with higher bias (hate speech) receive gentler teacher guidance, promoting robust and fair student learning while maintaining task ac- curacy on non-biased examples. Our BAT-KD framework works in three stages: (1) teacher fine-tuning on the hate speech dataset, (2) bias type detection for each training example, and (3) student dis- tillation with adaptive temperature scaling. The increased temperature for biased examples softens the hard probabil- ity predictions of the teacher, reducing the propagation of hard, potentially biased decisions to the student model. This Algorithm 1: Bias-Adaptive Temperature Knowledge Distil- lation (BAT-KD) Require: Teacher model T , Student model S, dataset D, epochs E, batch size B, learning rate η, temperature T base , bias multiplier λ, loss weight α Ensure: Fine-tuned student model S ′ 1: Initialize student S from DistilBERT 2: Fine-tune teacher T onD using standard cross-entropy loss 3: for each epoch e∈ [1,E] do 4:Shuffleandpartition Dintobatches B 1 ,B 2 ,...,B m 5: for each batch(x i ,y i ,b i ) do 6:Extract teacher logits: z (i) T = T(x i ) 7:Extract student logits: z (i) S = S(x i ) 8:if b i = hate then 9:Compute adaptive temperature: T(b i ) ← T base · λ 10:else 11:Compute base temperature: T(b i )← T base 12:end if 13:Computesoftteacher: p (i) T = softmax(z (i) T /T(b i )) 14:Computesoftstudent: p (i) S = softmax(z (i) S /T(b i )) 15:ComputeCEloss: L CE = CrossEntropy(z (i) S ,y i ) 16:Compute KL loss:L KL = KL(p (i) S ∥ p (i) T ) 17:Compute BAT-KD loss:L = α·L CE + (1−α)· T(b i ) 2 ·L KL 18:Backward pass: Compute gradients∇ S L 19:Update student: S ← S− η∇ S L 20: end for 21:Evaluate on validation set, log metrics (accuracy, precision, recall, F1) 22: end for 23: return Fine-tuned student model S ′ balances the effectiveness of knowledge transfer with the preservation of fairness, with student models maintaining both competitive accuracy and fairness metrics. Conclusion BAT-KD introduces bias-adaptive temperature scaling for trustworthy knowledge distillation, realizing superior fair- ness and compression on hate speech detection tasks. By dynamically adjusting softness based on bias characteriza- tion, our approach achieves state-of-the-art fairness-aware compression: teacher model reaches 94.43% precision while student maintains 83.54% precision with 12.9% relative re- call preservation. The framework maintains high precision and robust recall while enabling efficient deployment in con- strained scenarios. Future work will extend this methodol- ogy to multilingual contexts, automated bias identification, and broader fairness challenges across diverse domains. 64 References Bayasi, N.; Fayyad, J.; Hamarneh, G.; Garbi, R.; and Naj- jaran, H. 2025. Debiasify: Self-Distillation for Unsupervised Bias Mitigation. In Proceedings of the IEEE Winter Con- ference on Applications of Computer Vision (WACV 2025), 1–12. Cui, J.; Wang, R.; Xiong, Y.; and Hsieh, C. J. 2024. Miti- gating Bias in Dataset Distillation. In Proceedings of the In- ternational Conference on Machine Learning (ICML 2024), 1–18. Elsafoury, F. 2023. Investigating The Impact of Bias in NLP Models on Hate Speech Detection Performance and Fair- ness. In Proceedings of the BigPicture Workshop at EMNLP 2023, 1–12. Gajewska, E.; Derbent, A.; Chudziak, J. A.; and Budzynska, K. 2025. Persona-Infused LLMs for Human-Centric Hate Speech Detection. In Proceedings of Algorithmic Fairness in NLP, 1–13. Ganesh, P.; Chen, Y.; Lou, X.; Khan, M. A.; Yang, Y.; Sajjad, H.; Nakov, P.; Chen, D.; and Winslett, M. 2020. Compress- ing Large-Scale Transformer-Based Models: A Case Study on BERT. In Findings of the Association for Computational Linguistics: ACL 2020, 1–14. Georgakopoulos, S. V.; Tasoulis, S.; and Vlachos, A. 2019. Convolutional Neural Networks for Toxic Comment Classi- fication. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence (SETN 2019), 1–12. Kim, S.; Park, J.; Jung, S.; and Lee, S. 2024. The Role of Teacher Calibration in Knowledge Distillation. arXiv preprint arXiv:2508.20224, 1–15. Li, C.; Cheng, G.; and Han, J. 2025. Unleashing the Poten- tial of Temperature Scaling for Multi-label Learning. In Pro- ceedings of the International Conference on Learning Rep- resentations (ICLR 2025), 1–14. Mansurov, J.; Sakip, A.; and Aji, A. F. 2025. Data Laun- dering: Artificially Boosting Benchmark Results through Knowledge Distillation. In Proceedings of the Association for Computational Linguistics (ACL 2025), 1–14. Naranbat, B.; Mohammadi Ziabari, S. S.; Al Husaini, Y. N.; and Alsahag, A. M. M. 2025. Fairness Metric Design Ex- ploration in Multi-Domain Moral Sentiment Classification using Transformer-Based Models. In Proceedings of Ma- chine Learning for Social Good Workshop, 1–16. Roy, A.; and Site, S. 2025. A Comprehensive Review on Hate Speech Detection using BERT and Transformer-based Architectures. International Journal of Advanced Computer Science and Applications, 15(5): 1–18. Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), 1–11. Stanton, S.; Izmailov, P.; Kirichenko, P.; Alur, R.; Wilson, A. G.; and Gal, Y. 2021. Does Knowledge Distillation Re- ally Work? In Proceedings of the Conference on Neural In- formation Processing Systems (NeurIPS 2021), 1–16. Ullah, F.; Rana, M. A.; Akhtar, M. S.; Hassan, S.; and Sadiq, K. 2024. A Novel Approach to Hate Speech Detection Using Distilbert with Multilingual Transfer Learning. In Proceed- ings of the DravidianLangTech Workshop, 1–8. Wang, N.; Qiu, S.; Gao, Y.; Tian, Y.; and Huang, Y. 2022. Deep Compression of Pre-trained Transformer Models. In Proceedings of the Conference on Neural Information Pro- cessing Systems (NeurIPS 2022), 1–14. Wang, Y.; Xiong, F.; Wang, Y.; Li, L.; Chu, X.; and Zeng, D. D. 2025. Position Bias Mitigates Position Bias: Miti- gate Position Bias Through Inter-Position Knowledge Dis- tillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), 1–14. Xu, W.; Chen, X.; Yin, Y.; and Tworowski, S. 2024. Spec- ulative Knowledge Distillation. In Proceedings of the In- ternational Conference on Learning Representations (ICLR 2025), 1–17. Yuan, M.; Lin, Z.; and Ji, R. 2023. Student-friendly Knowl- edge Distillation. In Proceedings of the International Con- ference on Learning Representations (ICLR 2023), 1–13. Zhang, Y.; Lu, D.; and Sang, J. 2024. Inference-Time Rule Eraser: Fair Recognition via Distilling and Removing Bi- ased Rules. In Proceedings of the International Conference on Learning Representations (ICLR 2024), 1–16. Zheng, K.; and Yang, E. H. 2024. Knowledge Distillation Based on Transformed Teacher Matching. In Proceedings of the International Conference on Learning Representations (ICLR 2024), 1–15. 65 Routing Belief States: A Meta-Cognitive Architecture for Theory of Mind in Language Models Satyam Kumar 1 , Saurabh Jha 1 1 Dell Technologies, CSG CTO Team satyamkumar9742@gmail.com, saurabh.jha21@gmail.com Abstract Theory of Mind capabilities-a range of abilities involving rea- soning about mental states, beliefs, and intentions-continue to represent a major challenge for large language models. Cur- rent approaches have difficulty distinguishing between fac- tual reasoning and reasoning about false beliefs, often failing to separate reality-based and belief-based inferences. We ad- dress this deficiency by proposing a meta-cognitive archi- tecture that dynamically routes reasoning pathways based on the cognitive demands of the task. Our approach introduces the Meta-Cognitive State Router (MCSR), a learnable gat- ing mechanism that learns to selectively activate specialized reasoning heads for belief-state reasoning and reality-based reasoning. The model combines supervised fine-tuning with contrastive reinforcement learning to optimize performance on diverse ToM tasks. Evaluated on the SimpleToM dataset across mental-state question-answering, behavior prediction, and judgment tasks, our method achieves state-of-the-art per- formance with 99.84% accuracy, 99.21% precision, 98.43% recall, and 98.81% F1-score. The results demonstrate that explicit architectural routing of cognitive processes substan- tially improves systematic reasoning about mental states in language models. This work contributes to the understanding of how neural architectures can be set up to reflect human-like cognitive separation, with implications for developing more reliable AI systems in complex reasoning tasks within au- tonomous decision-making and human-AI collaboration sce- narios. Introduction Theory of Mind (ToM), reasoning about mental states and beliefs, remains challenging for large language models de- spite impressive performance on surface-level benchmarks (Singhal et al. 2023; Li et al. 2025). Recent benchmarks quantify this gap: SimpleToM (Gu, Lee, and Jiang 2024) reports 98.90% accuracy on explicit ToM but lower per- formance on applied tasks, BigToM (Gandhi et al. 2024) achieves 92.8% on false beliefs, and FANToM (Kim et al. 2023) reports below 65.8% on stress-tested reasoning. Con- strained reasoning approaches (Hu et al. 2025) attempt en- hancement but lack explicit cognitive decomposition. Posi- tion papers (Riemer et al. 2024) argue existing benchmarks fail to capture functional ToM, while adversarial evalua- tion (Sclar et al. 2024) shows GPT-4o achieving only 9% on challenging scenarios. These gaps reveal that models lack explicit mechanisms for routing between belief-state and reality-based inference. Research on reasoning struc- tures (Zhou et al. 2024) and synthetic logic corpora (Mor- ishita et al. 2024) shows promise but within unified frame- works. Prior work on gating mechanisms (Lepikhin et al. 2020), dynamic routing (Huang et al. 2024; Liu et al. 2025), and multi-task learning (Laskey et al. 2022) provides foun- dations. Parameter-efficient methods like LoRA (Hu et al. 2021), quantization (Dettmers et al. 2023), preference op- timization (Rafailov et al. 2023; Yang et al. 2024; Bai et al. 2025), and inference optimization (Agrawal et al. 2024; Park et al. 2024) advance LLM capabilities. Linguistic structure analysis (Vadas and Curran 2007) and neural architectures (Xie et al. 2017) provide understanding. PPO (Schulman et al. 2017) and contrastive learning (Eysenbach et al. 2022) enable sophisticated optimization. However, no work com- bines explicit belief-state routing with contrastive RL to sys- tematically address ToM reasoning gaps. We address this gap with a Meta-Cognitive State Router, which is a learnable gating mechanism to dynami- cally route between belief-state and reality-based reasoning pathways. Our method combines supervised fine-tuning with auxiliary MCSR loss, phase 1, and contrastive PPO with task-specific rewards, phase 2. Evaluated on Simple- ToM, our approach achieves 99.84% accuracy, 99.21% pre- cision. 98.43% recall, and 98.81% F1-score—substantially outperforming all previous baselines and greatly outper- forming standard RL methods. These results demonstrate that explicit architectural routing of cognitive processes sub- stantially improves Theory of Mind reasoning in language models. This work contributes: (1) a novel meta-cognitive routing architecture extending prior gating mechanisms (Lepikhin et al. 2020; Huang et al. 2024) with task-intrinsic belief-state decomposition; (2) state-of-the-art performance achieving 99.84% accuracy on SimpleToM, exceeding explicit ToM baselines (98.90%) and substantially outperforming Big- ToM (92.8%) and FANToM (65.8%) benchmarks; and (3) empirical evidence that language models learn struc- tured cognitive separation when exposed to contrastive signals combined with belief-specific reward shaping, ad- vancing understanding of how neural architectures can re- flect human-like cognitive compartmentalization for com- plex reasoning tasks requiring nuanced social and mental 66 Algorithm 1: MCSR Training Pipeline with Two-Phase Op- timization Require: Dataset D, base model M , SFT epochs E 1 , RL epochs E 2 , batch size B Ensure: Trained model M MCSR with routing capability 1: Phase 1: Supervised Fine-Tuning 2: for each epoch e∈ [1,E 1 ] do 3:Shuffle and partitionD into batches of size B 4:for each batch(x i ,y i ) do 5:Detect belief-state: I i ← I belief (x i ) 6:Forward pass: h i ,g i , ˆy i ← M(x i ) 7:Computeclassificationloss: L CE = CrossEntropy(ˆy i ,y i ) 8:Compute MCSR loss:L BCE = BCE(g i , I i ) 9:Combined loss:L =L CE + 0.75·L BCE 10:Backward pass with gradient accumulation 11:Update LoRA parameters: (B,A) ← (B,A)− η∇L 12:end for 13: end for 14: Phase 2: Contrastive RL Training 15: for each epoch e∈ [1,E 2 ] do 16:for each batch(x i ,y i ) do 17:Rollout: sample actions a i ∼ π(M(x i )) 18:Determine correctness: correct i = (a i == y i ) 19:Assign reward: R i ← Reward(correct i , I i )// Eq. 5 20:Compute value estimates: V i ← M v (x i ) 21:Calculate GAE: ˆ A i ← GAE(R i ,V i )// Eq. 6 22:for ppo epoch = 1 to 4 do 23:Recompute policy: π new ← M(x i ) 24:ComputePPOloss: L PPO = PPO(π new ,π old , ˆ A i )// Eq. 7 25:Update: (B,A)← (B,A)− η∇L PPO 26:end for 27:end for 28: end for 29: return Trained model M MCSR with learned routing and belief-aware reasoning state understanding. Methodology We propose a Meta-Cognitive State Router (MCSR) ar- chitecture that explicitly routes reasoning pathways based on cognitive mode demands. Our approach consists of two training phases: supervised fine-tuning with auxiliary cogni- tive decomposition loss (Phase 1) and contrastive reinforce- ment learning with task-specific reward shaping (Phase 2). Meta-Cognitive State Router Architecture The MCSR mechanism learns to distinguish between belief- state reasoning (questions about mental states, knowledge, or intentions) and reality-based reasoning (factual asser- tions or behavioral predictions). Given input tokens x = [x 1 ,x 2 ,...,x n ], the base model produces hidden states h = [h 1 ,h 2 ,...,h n ]. The router computes a gating mechanism: g mcsr = σ(W gate h last + b gate )(1) where h last is the final hidden state, σ denotes the sig- moid activation, and W gate ,b gate are learnable parameters. The architecture maintains two specialized reasoning heads: a belief-state head and a reality-based head: o belief = tanh(W b h last +b b ), o reality = tanh(W r h last +b r ) (2) The combined reasoning output is produced through gated combination: o combined = g mcsr ⊙ o belief + (1− g mcsr )⊙ o reality (3) where ⊙ represents element-wise multiplication. Final predictions are generated by a classification head: ˆy = softmax(W pred o combined + b pred ). Supervised Fine-Tuning with MCSR Loss Phase 1 combines standard classification loss with auxiliary MCSR loss to establish belief-aware reasoning patterns: L SFT =L CE (ˆy,y) + λ mcsr L BCE (g mcsr , I belief (y))(4) whereL CE is cross-entropy loss for classification,L BCE is binary cross-entropy for router training, λ mcsr = 0.75 is the MCSR loss weight, and I belief (y) is an indicator function (1 for belief-state tasks, 0 for reality-based tasks) determined by question keywords (aware, know, think, believe). Contrastive Reinforcement Learning with Task-Specific Rewards Phase 2 applies improved PPO with contrastive reward shap- ing. The policy samples actions from the model’s logits, and rewards are assigned based on correctness and task type: R t =    r belief if correct and belief-state task r reality if correct and reality-based task r penalty if incorrect (5) where r belief = 3.0, r reality = 1.0, and r penalty =−0.5. Ad- vantage estimation uses Generalized Advantage Estimation (GAE): ˆ A t = ∞ X l=0 (γλ) l δ V t+l , δ V t = R t + γV (s t+1 )− V (s t ) (6) The PPO objective combines policy and value losses with entropy regularization: L PPO = E[min(r t (θ) ˆ A t , clip(r t (θ), 1− ε, 1 + ε) ˆ A t )] − c v E[( ˆ V t − V t (φ)) 2 ] + c e E[H(π(θ))] (7) where r t (θ) = π new (a t |s t ) π old (a t |s t ) is the policy ratio, ε = 0.2 is the clipping threshold,c v = 0.5 is the value function coefficient, c e = 0.01 is the entropy coefficient, and H(π) is policy entropy. LoRA adaptation reduces trainable parameters to 0.01% of the base model while maintaining performance. 67 References Agrawal, A.; Kedia, N.; Mohan, J.; Panwar, A.; Kwatra, N.; Gulavani, B. S.; and Tumanov, A. 2024. Vidur: A Large- Scale Simulation Framework for LLM Inference. In Pro- ceedings of the 2024 Machine Learning Systems Conference (MLSys 2024), TBD. Bai, C.; Zhang, Y.; Qiu, S.; Zhang, Q.; Xu, K.; and Li, X. 2025. Online Preference Alignment for Language Models via Count-based Exploration. In Proceedings of the 13th In- ternational Conference on Learning Representations (ICLR 2025), TBD. Dettmers, T.; Lewis, M.; Belkada, Y.; and Zettlemoyer, L. 2023. 8-bit Optimizers via Block-wise Quantization. In Pro- ceedings of the 11th International Conference on Learning Representations (ICLR 2023), TBD. Eysenbach, B.; Zhang, T.; Salakhutdinov, R.; and Levine, S. 2022. Contrastive Learning as Goal-Conditioned Reinforce- ment Learning. In Advances in Neural Information Process- ing Systems (NeurIPS 2022), volume 35, 6769–6782. Gandhi, K.; Stojnic, G.; Lake, B. M.; and Dillon, M. R. 2024. Baby Minds: Investigating the Emergence of Memory, Imagination, and Belief in Embodied Agents. In Advances in Neural Information Processing Systems (NeurIPS 2024), volume 37, TBD. Gu, Y.; Lee, K.; and Jiang, M. 2024. SimpleToM: Exposing the Gap between Explicit and Applied Theory of Mind in Large Language Models. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Pro- cessing (EMNLP), TBD. Miami, Florida: Association for Computational Linguistics. Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. LoRA: Low-Rank Adap- tation of Large Language Models. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), TBD. Hu, Z.; Kuang, X.; Liu, S.; Deng, Y.; Tao, C.; and Huang, L. 2025. Constrained Reasoning Chains for Enhancing Theory- of-Mind through Language Models. In Proceedings of the 2025 Conference on Computational Linguistics, TBD. Huang, Q.; An, Z.; Zhuang, N.; Tao, M.; Zhang, C.; Jin, Y.; He, J.; and Feng, Y. 2024. Harder Tasks Need More Experts: Dynamic Routing in Mixture-of-Experts Models. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (ACL 2024), 7689–7702. Bangkok, Thailand: Association for Computational Linguis- tics. Kim, B. G.; Sap, M.; Gabriel, S.; and Rashkin, N. 2023. FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions. In Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), TBD. Singapore: Association for Computational Linguistics. Laskey, M. A.; Lee, C.; Gealy, D.; Wu, T.; Phielipp, M.; and Goldberg, K. 2022. Multi-Task Policy Learning in Environ- ments with Sparse Rewards. Journal of Machine Learning Research, 23: 1–42. Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; and Chen, Z. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.In Proceedings of the 9th Inter- national Conference on Learning Representations (ICLR 2021), TBD. Li, X.; Li, M.; Men, R.; Zhang, Y.; Bao, K.; Wang, W.; Su, Y.; and Liu, D. 2025. HellaSwag-Pro: A Large-Scale Bilin- gual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, TBD. Liu, S.; Lo, M.; Zhang, Y.; Kang, Y.; and Zhang, H. 2025.LLMoE: A Novel Framework for Trading using LLM-Based Routing in Mixture of Experts. arXiv preprint arXiv:2501.09636. Morishita, T.; Morio, G.; Yamaguchi, A.; and Sogawa, Y. 2024. Enhancing Reasoning Capabilities of LLMs via Prin- cipled Synthetic Logic Corpus. In Advances in Neural In- formation Processing Systems (NeurIPS 2024), volume 37, TBD. Park, G.; Park, S.; Won, M.; Yoon, S.; and Song, I. 2024. FlashDecoding++: Towards a Faster LLM Inference Engine. In Proceedings of the 2024 Machine Learning Systems Con- ference (MLSys 2024), TBD. Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Er- mon, S.; and Finn, C. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint arXiv:2305.18290. Riemer, M.; Ashktorab, Z.; Bouneffouf, D.; Das, P.; Liu, M.; Weisz, J. D.; and Campbell, M. 2024. Position: Theory of Mind Benchmarks are Broken for Large Language Models. In Proceedings of the 42nd International Conference on Ma- chine Learning (ICML 2025), TBD. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Openai, I. 2017.Proximal Policy Optimization Algo- rithms. In Advances in Neural Information Processing Sys- tems (NeurIPS 2017), volume 30, 4080–4090. Sclar, M.; Choi, Y.; Tsvetkov, Y.; and Suhr, A. 2024. Ex- ploreToM: A Program-Guided Adversarial Data Generation Approach for Theory of Mind Reasoning. arXiv preprint arXiv:2412.19726. Submitted to NeurIPS 2025. Singhal, A.; Azab, M.; Ramachandran, S.; Pineau, J.; and Hinton, G. 2023. A Long Way to Go: Investigating Length Correlations in English NLP Models. In Findings of the As- sociation for Computational Linguistics: ACL 2023, TBD. Vadas, D.; and Curran, J. R. 2007. Adding Noun Phrase Structure to the Penn Treebank. In Proceedings of the 45th Annual Meeting of the Association for Computational Lin- guistics (ACL 2007), 240–247. Prague, Czech Republic: As- sociation for Computational Linguistics. Xie, S.; Girshick, R.; Doll ́ ar, P.; Tu, Z.; and He, K. 2017. Aggregated Residual Transformations for Deep Neural Net- works. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR 2017), 1492–1500. Yang, K.; Klein, D.; Celikyilmaz, A.; Peng, N.; and Tian, Y. 2024. RLCD: Reinforcement Learning from Contrastive 68 Distillation for LM Alignment.In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024), TBD. Zhou, P.; Pujara, J.; Ren, X.; Chen, X.; Cheng, H. T.; Le, Q. V.; and Zheng, S. 2024. Self-Discover: Large Language Models Self-Compose Reasoning Structures. In Advances in Neural Information Processing Systems (NeurIPS 2024), volume 37, TBD. 69 Semantic Encoders Enable Robust Communication-Aware Reinforcement Learning Policies Siddharth Srikanth 1 , Varun Bhatt 1 , Darius Mahjoob 1 , Sophie Hsu 1 , Aaquib Tabrez 2 , Stefanos Nikolaidis 1 , 1 Thomas Lord Department of Computer Science, University of Southern California 2 Sibley School of Mechanical and Aerospace Engineering, Cornell University ssrikant,vsbhatt,dmahjoob,yachuanh,nikolaid@usc.edu, aaquibtabrez@cornell.edu Abstract Natural language serves as a powerful medium for coordi- nation, information sharing, instruction, and building a the- ory of mind in teams. However, training agents to interpret such communication often relies on either rigid, templated, or symbolic messages that are not robust or on large lan- guage models (LLMs), which introduce significant inference delays. We address this with a framework to bridge the gap between high-dimensional unrestricted natural language mes- sages and low-dimensional representations suited for training communication-aware reinforcement learning (RL) agents. Our approach follows a two-stage training process: (1) train- ing an encoder on diverse communication logs generated by LLM-powered agents to learn a low-dimensional representa- tion of messages, and (2) integrating this encoder to train RL agents in multi-agent collaboration scenarios. We evaluate our framework in the adapted Lunar Lander and Merge, two long-horizon environments, and show improved performance with communication. Furthermore, we show that the trained RL agents can interpret messages phrased in previously un- seen ways, demonstrating the robustness of our framework. 1 Introduction Natural language enables humans to share information, adapt plans, and build a theory of mind (ToM) in col- laborative settings, making it ideal for robot teammates to also understand natural language messages. Large lan- guage models (LLMs) make it feasible to parse varied phras- ing (e.g., “bring me water” vs. “get me a glass of wa- ter”) that a user might say. However, in dynamic situa- tions, e.g., “don’t turn left; a child just stepped out”, their longer inference time can slow the decision loop. More- over, these models are not trained to be optimal or collab- orative, especially when interacting with human teammates. On the other hand, reinforcement learning (RL) agents can act faster with smaller policies, but typically lack the capa- bility to understand and act on unconstrained natural lan- guage (Luketina et al. 2019). In this work, we address the problem of training communication-aware RL policies for human-machine teaming to maintain low-latency decision- making while equipping agents with an understanding of natural language messages. Agents capable of collaborating with explicit communi- cation have previously been studied in the context of multi- agent reinforcement learning (Lazaridou and Baroni 2020; Synthetic Messages Train Set Autoencoder ED Encoder Pre-Training Action Partial Observation Full Observation Environments E Heuristic Agent RL Agent Move to the right! Test Set During Train During Eval Set of Fixed Messages Set of Unseen Messages Figure 1: Summary of our proposed framework. We pre- train an autoencoder with synthetic communication data generated by an LLM. Then, we integrate the encoder into RL training to obtain a communication-aware RL policy ca- pable of understanding unseen messages as well. Zhu, Dastani, and Wang 2024). However, such works fo- cused on symbolic communication, in which agents commu- nicated via symbols that were not necessarily grounded in natural language (Evtimova et al. 2018; Havrylov and Titov 2017; Kottur et al. 2017; Lazaridou et al. 2018). Subsequent works also extended symbolic communication to partially observable domains and showed that communication was key to bridging the information gap between agents (Jaques et al. 2018; Eccles et al. 2019). One way to integrate symbolic and natural language com- munication is to manually design symbols and train RL poli- cies to interpret and communicate with them. However, this process is tedious and does not scale well (Tellex et al. 2020; Tabrez, Leonard, and Hayes 2025). Alternatively, one could learn these behaviors from large-scale human-human or human-robot data collection, but collecting such data is expensive and challenging (Rogers and Marshall 2017). Recent work suggests that LLM-powered agents exhibit human-like behavior (Zhou et al. 2024; Li et al. 2023; Xie et al. 2024; Yang et al. 2024; Srikanth et al. 2025), making them a good proxy for humans. Our key insight is that by learning to encode synthetic 70 communication data, we can train RL policies capable of understanding messages in natural language. We achieve this via a two-step process. First, we pre-train an autoen- coder on the communication data to obtain an encoder that converts high-dimensional natural language messages to a low-dimensional embedding. Then, we integrate this en- coder into RL training to obtain a message-conditioned pol- icy. Our results show that the learned RL policy generalizes robustly to novel, unseen messages, as it was exposed to di- verse communication data during training. 2 Method Stage 1: Learning Low-Dimensional Representations of Natural Language Messages Directly learning an RL policy conditioned on a high-dimensional message input is challenging, as it requires the policy to simultaneously learn a good representation of the message and a good mapping to actions. Hence, we employ a Variational Autoencoder (VAE) (Kingma and Welling 2014) to convert the high- dimensional message inputs to low-dimensional representa- tions that are more suitable as observations to the RL agent. First, we query an LLM to generate a set of diverse phrasings of messages an agent could send in the domain, based on the available actions. Then, we obtain the sentence embed- dings, i.e., a high-dimensional representation, of these mes- sages by passing them through Sentence-BERT (Reimers and Gurevych 2019). Finally, we train the VAE with a low- dimensional latent space to encode and reconstruct the sen- tence embeddings. The diverse messages in its training set enable the VAE to encode incoming natural language mes- sages into their corresponding low-dimensional representa- tion during RL agent execution. Stage 2: Training Communication-Aware Policies We assume a training setup with a communication-aware RL agent paired with a fixed heuristic agent sending natural lan- guage messages selected from a message set. During train- ing, the pre-trained encoder converts the received message to its low-dimensional representation, which the RL agent receives as an additional input along with observations from the environment. We then train the RL policy, now condi- tioned on both messages and observation, to maximize the discounted return. While our framework makes no assump- tions about the RL algorithm, we use Proximal Policy Opti- mization (Schulman et al. 2017) in our experiments. 3 Results Table 1: Performance comparison across domains with and without communication. Communication, even with unseen wording, results in significant performance improvements. Effect size reported as Cohen’s d. Domain w/ Comm w/o Comm Effect Size Lander −3.36 0.41 −7.40 0.46 0.466 Merge18.17 0.58 13.98 0.75 0.312 Domains. We evaluated the communication-aware poli- cies in two domains modified (Hsu et al. 2025) to include communication: Lunar Lander (Brockman et al. 2016) and Merge (Leurent 2018). In the Lunar Lander domain, the ob- jective is to land the lander while avoiding certain areas of space marked as “danger zones”, which are not observable to the RL agent and would reduce the RL agent’s reward when entering. The heuristic agent provides language instructions on where the danger zones are located to the RL agent. In the Merge domain (Leurent 2018), the RL agent’s task is to avoid collisions with merging traffic whose merge intent is unobserved by the RL agent. The heuristic agent indicates which side the merging vehicle is approaching from. Results. Table 1 compares RL agents’ performance with and without communication. Adding communication un- derstanding capabilities significantly improves performance (p < 0.01), highlighting the benefits of our framework. By training an autoencoder on diverse communication logs, our framework enables RL agents to infer the latent intent of their partners from unrestricted natural language input. 3.1 Proposed Future Evaluation. We outline the following future evaluation extensions: Unseen scenarios. We will evaluate our framework on un- seen scenarios in two domains. In Lunar Lander, we test danger-zone configurations not encountered during training; in Merge, we vary where vehicles that are initially unseen by the ego vehicle enter the highway. Multi-agent domain. We propose evaluating our frame- work on a multi-agent collaborative domain, Overcooked AI (ove 2018; Carroll et al. 2019). We will modify this do- main to have both agents communicate with each other in natural language and train them with our framework. 4 Discussion Future work Here, we outline our proposed improve- ments and extensions. Prior work has explored training losses to facilitate communication in RL agents (Eccles et al. 2019), and we plan to incorporate similar strategies into our framework. Additionally, we assume that the heuristic agent sends perfect messages. In human-agent teaming scenarios, messages may be noisy or even adversarial (e.g., deceptive or misleading), necessitating mechanisms to detect and filter such inputs. We plan to integrate learnable message filtering in our framework in the future. By leveraging ideas from prior work (e.g., (Strouse et al. 2021)) to generate a popula- tion of partners that vary in terms of their messaging behav- ior, the RL agent can learn a robust filtering mechanism. Conclusion We present a framework for training RL agents that understand natural language messages. These agents can adapt to unseen messages during evaluation in two long-horizon environments. Such communication- aware RL agents enable effective ToM by leveraging part- ners’ private knowledge and intent conveyed through natural language. 71 References 2018. Overcooked 2. Bernstein, D. S.; Givan, R.; Immerman, N.; and Zilber- stein, S. 2002. The Complexity of Decentralized Control of Markov Decision Processes. Math. Oper. Res., 27(4): 819– 840. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI Gym. Carroll, M.; Shah, R.; Ho, M. K.; Griffiths, T.; Seshia, S.; Abbeel, P.; and Dragan, A. 2019. On the utility of learn- ing about humans for human-ai coordination. Advances in neural information processing systems, 32. Eccles, T.; Bachrach, Y.; Lever, G.; Lazaridou, A.; and Grae- pel, T. 2019. Biases for Emergent Communication in Multi- agent Reinforcement Learning. In Advances in Neural In- formation Processing Systems 32, 13111–13121. Evtimova, K.; Drozdov, A.; Kiela, D.; and Cho, K. 2018. Emergent Communication in a Multi-Modal, Multi-Step Referential Game.In 6th International Conference on Learning Representations, ICLR. Havrylov, S.; and Titov, I. 2017. Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols. In Advances in Neural Information Processing Systems 30, 2149–2159. Hsu, Y.-C.; DeCastro, J.; Silva, A.; and Rosman, G. 2025.Timing the Message: Language-Based Notifica- tions for Time-Critical Assistive Settings. arXiv preprint arXiv:2509.07438. Huang, S.; Dossa, R. F. J.; Ye, C.; Braga, J.; Chakraborty, D.; Mehta, K.; and Ara ́ ujo, J. G. 2022. CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learn- ing Algorithms. Journal of Machine Learning Research, 23(274): 1–18. Jaques, N.; Lazaridou, A.; Hughes, E.; G ̈ ulc ̧ehre, C ̧ .; Ortega, P. A.; Strouse, D.; Leibo, J. Z.; and de Freitas, N. 2018. Intrinsic Social Motivation via Causal Influence in Multi- Agent RL. CoRR, abs/1810.08647. Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Vari- ational Bayes. International Conference on Learning Rep- resentations. Kottur, S.; Moura, J. M. F.; Lee, S.; and Batra, D. 2017. Natural Language Does Not Emerge ’Naturally’ in Multi- Agent Dialog. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, EMNLP, 2962–2967. Lazaridou, A.; and Baroni, M. 2020.Emergent Multi- Agent Communication in the Deep Learning Era. CoRR, abs/2006.02419. Lazaridou, A.; Hermann, K. M.; Tuyls, K.; and Clark, S. 2018. Emergence of Linguistic Communication from Ref- erential Games with Symbolic and Pixel Input. In 6th Inter- national Conference on Learning Representations, ICLR. Leurent, E. 2018. An Environment for Autonomous Driv- ing Decision-Making. https://github.com/eleurent/highway- env. Li, H.; Chong, Y. Q.; Stepputtis, S.; Campbell, J.; Hughes, D.; Lewis, M.; and Sycara, K. 2023. Theory of mind for multi-agent collaboration via large language models. arXiv preprint arXiv:2310.10701. Luketina, J.; Nardelli, N.; Farquhar, G.; Foerster, J.; An- dreas, J.; Grefenstette, E.; Whiteson, S.; and Rockt ̈ aschel, T. 2019. A survey of reinforcement learning informed by natural language. In Proceedings of the Twenty-Eighth In- ternational Joint Conference on Artificial Intelligence. Inter- national Joint Conferences on Artificial Intelligence. Reimers, N.; and Gurevych, I. 2019. Sentence-bert: Sen- tence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084. Rogers, Y.; and Marshall, P. 2017. Research in the Wild. Morgan & Claypool Publishers. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. Srikanth, S.; Bhatt, V.; Zhang, B.; Hager, W.; Lewis, C. M.; Sycara, K. P.; Tabrez, A.; and Nikolaidis, S. 2025. Algo- rithmic Prompt Generation for Diverse Human-like Team- ing and Communication with Large Language Models. arXiv:2504.03991. Strouse, D.; McKee, K.; Botvinick, M.; Hughes, E.; and Ev- erett, R. 2021. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34: 14502–14515. Tabrez, A.; Leonard, R.; and Hayes, B. 2025. Single-shot policy explanation to improve task performance via seman- tic reward coaching. Neural Computing and Applications, 1–23. Tellex, S.; Gopalan, N.; Kress-Gazit, H.; and Matuszek, C. 2020. Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems, 3(1): 25–55. Xie, C.; Chen, C.; Jia, F.; Ye, Z.; Lai, S.; Shu, K.; Gu, J.; Bibi, A.; Hu, Z.; Jurgens, D.; Evans, J.; Torr, P.; Ghanem, B.; and Li, G. 2024. Can Large Language Model Agents Simulate Human Trust Behavior? arXiv:2402.04559. Yang, Z.; Zhang, Z.; Zheng, Z.; Jiang, Y.; Gan, Z.; Wang, Z.; Ling, Z.; Chen, J.; Ma, M.; Dong, B.; et al. 2024. Oasis: Open agents social interaction simulations on one million agents. arXiv preprint arXiv:2411.11581. Zhou, X.; Zhu, H.; Mathur, L.; Zhang, R.; Yu, H.; Qi, Z.; Morency, L.-P.; Bisk, Y.; Fried, D.; Neubig, G.; and Sap, M. 2024. SOTOPIA: Interactive Evaluation for Social Intelli- gence in Language Agents. arXiv:2310.11667. Zhu, C.; Dastani, M.; and Wang, S. 2024. A survey of multi- agent deep reinforcement learning with communication. Au- tonomous Agents and Multi-Agent Systems, 38(1): 4. 72 Acknowledgments This work was supported by NSF CAREER #2145077. A Domains We address the problem of training communication-aware RL policies in collaborative sequential decision-making en- vironments. We formulate the environment as a decentral- ized Partially Observable Markov Decision Process (dec- POMDP (Bernstein et al. 2002))⟨S,A,R,P,O,γ⟩ with N agents, where S is the state space, A = Π N i A i is the joint action space of all agents,R :S×A→R is the common re- ward function that all agents receive,P :S×A×S → [0, 1] is the transition function, O is the observation function, and γ is the discount factor. The agents’ goal is to maximize the discounted sum of rewards, J = Σ t γ t r t , where r t is the reward obtained at timestep t. A.1 Lunar Lander Each policy is trained for 5,000,000 timesteps, with a max- imum episode duration of 600 timesteps. The dimension of the message vector is 2. A.2 Merge Each policy was trained for 200,000 timesteps, with a max- imum episode duration of 256 timesteps. The dimension of the message vector is 32. B Algorithm B.1 Pseudocode We train the RL agent with PPO (Schulman et al. 2017), using the implementation and default hyperparameters from CleanRL (Huang et al. 2022). Algorithm 1: PPO with Communication Input: POMDP⟨S,A,R,P,O,γ⟩; initial policy parameters θ 0 ; episode horizon T ; total iterations N Output: Trained policy π θ 1 θ ← θ 0 2 for i∈1...N do 3Get initial state s 0 4 o 0 ← O(s 0 ) 5 for t∈0...T do // Rollout 6m t ← heuristicagentmessage(o t ) 7ˆm t ← encoder(m t ) 8 ̃o t ← [o t ; ˆm t ] // Concat message 9a t ∼ π θ (a t | ̃o t ) 10r t ∼R(s t ,a t ) 11s t+1 ∼P(s t ,a t ) 12o t+1 ← O(s t+1 ) 13 end 14Update θ with PPO 15 end 73 Sign-Based World Model as a Basis of Cognitive Modeling: Imitation in Human-Robot Interaction Anfisa Chuganskaya 1 , Alexey K. Kovalev 2,1 , Aleksandr I. Panov 2,1 1 MIRAI, Moscow, Russia 2 Cognitive AI Systems Lab, Moscow, Russia kovalev.a@miriai.org Abstract The theory of mind (ToM) became the starting point in solv- ing the question of the internal representation of the world- view of another individual. Solving the issues of attribution and understanding the causes of behavior on the basis of the ToM, specific solutions in the field of robotics appeared. A close concept in the works on artificial intelligence is the Sign-Based World Model (SBWM), in which three compo- nents are distinguished: emotional, cognitive and behavioral. Each of them has a vector of use for different tasks. However, there remain questions about the combination of emotional components, including negative attitudes towards robots and innovations, negative experiences with AI, and the need to build a sign-based (symbolic) component of the world model of cognitive AI agents. We propose to rethink in a new way the concept of social learning in the form of imitation, pro- posed in the framework of ethology and child psychology. Introduction The topic of interaction between artificial intelligence (AI) and its technical embodiments in robots and humans is com- ing to the forefront of modern research. This is due to the question of the place that AI is beginning to take in peo- ple’s professional and daily lives. In many ways, this raises concerns about trust in technology and the challenges as- sociated with fears of new technologies (Pybus 2015; Ryan 2020). However, we propose to explore the issues of human- AI interaction from the perspective of the construct of world- view in the context of the development of the theory of mind (ToM) and the mechanism of social learning based on imita- tion, using the concept of “do as I do” (Thorndike 2017) in child psychology (George 1961; Vygotsky 2012) and ethol- ogy (Caldwell and Whiten 2002; Kohler 2018). The Theory of the Mind as a Model for Explaining Another World View A person has ideas, meaning, attitudes, forming a certain model of the world. This corresponds to the ToM model of the mind, which is reflected in concepts such as the internal worldview, attitudes, attributions, social perception, scenar- ios, and others. A person’s image for interacting with robots is formed based on certain attitudes and experiences, includ- ing cultural values and beliefs. This process is influenced by social perceptions and a range of emotional phenomena, in- cluding fears or overly positive attitudes (Pybus 2015; Ryan 2020). It is the tasks of everyday scenarios of behavior and so- cial communication that have become the basis for building models in ToM. The definition of ToM as a system of log- ical reasoning used to predict an individual’s behavior by attributing certain mental states to them was described in the work of psychologists Wimmer H. and Premack, D. G., and Woodruff, G., who studied models of the emergence of human intelligence in the development of children (Wim- mer and Perner 1983; Perner, Leekam, and Wimmer 1987) and primates (Premack and Woodruff 1978). For ToM, it be- comes important to focus on a few mental aspects of interac- tions, such as attributional processes and emotional under- standing in the context of life situations, which are largely based on the construction and subsequent implementation of scripted information. Subsequently, researchers in the field of child psychology have highlighted the significance of lan- guage as a component of socialization and the transmission of scripted information (Villiers 2005). The researchers’ in- terest in the process of developing attributions by children in ambiguous or subtext-laden scenarios led to an interest in the very mechanism of constructing such scenarios in a more ar- tificial environment – human-robot interaction. The develop- ment of the ideas of ToM led to the creation of a series of ex- perimental tasks to test the level of intelligence, and in fact, the level of mastering a scenario, taking into account it’s semantic component. The system of assumptions about the presence of thinking in another person who shares a certain field of ideas forms a system of meanings in the understand- ing of the concept of a Sign-Based worldview. Methodolog- ically, this understanding in theory of mind is based on the concept of consciousness by Dennett D.C. (Dennett 1978), who believed that it consists of information streams that compete for access to the brain, which eliminates the differ- ence between processes in the natural human consciousness and the construction of robot programs. M. Kosinski used 40 tasks that were given as part of the study of thinking processes and attribution based on ToM (Wimmer and Perner 1983; Perner, Leekam, and Wim- mer 1987). The result showed that it was on these tasks that the LLM models that formed the basis of GPT significantly increased their productivity since 2020, reaching 95% cor- 74 rect answers for GPT-4 in March 2023.In another study con- ducted by Tomer D. Ullman, the content of ToM is most accurately described in relation to the assessment of cogni- tive processes: “Theory-of-Mind is recognized as a pillar of common-sense reasoning” (Ullman 2023). The reliance on common sense and the intuition of knowledge are character- istic of the everyday worldview. This becomes another con- ceptual framework for considering the scope of ToM within the concept of the Sign-Based World Model (SBWM) . Sign-Based World Model The implementation of the cognitive agent’s world model based on the theory of the Sign-Based World Model (SBWM) helps in the creation of so-called embodied arti- ficial intelligence systems (Osipov and Panov 2021), which declare the importance and necessity of taking into account the specifics of the agent’s interaction with the environ- ment for generating and implementing an effective action plan. The solution of human communication tasks within the framework of robotic systems can be considered within the framework of the system approach of SBWM (Kovalev et al. 2021; Chuganskaya, Kovalev, and Panov 2024). There are three aspects of the worldview model: emotional com- ponent, behavioral and semantic (sign-based) components. The emotional component of the worldview model is based on working with meanings and emotions. Communi- cation within this direction meets the tasks of maintaining emotional balance, reducing anxiety, encountering situations of stress and uncertainty. Usually, this type of communica- tion is used in social robots and chats about philosophical ideas (Breazeal 2004). The tasks of the behavior scenario, maintaining commu- nication are significant in the behavioral component of the worldview model. Many human-robot interaction systems are aimed to intuitive, everyday knowledge based on ex- perience. In general, many areas of common sense stud- ies (Zellers et al. 2019) and tasks within the volume in- vestigated it. Davis notes that the problem of everyday rea- soning and its automation is central to modern research in the field of artificial intelligence. Common sense is a set of fundamental categories that uses even a seven-year-old child (Davis 2017). Davis designates areas of specific tasks in which it will be impossible to advance modern automated methods without solving the problem of “common sense”. A semantic (sign-based) component of the worldview model is based on logical inference and the use of concep- tual thinking, the process of rational reasoning. This direc- tion in the framework of the study of cognitive processes was studied by J. Piaget and L.S. Vygotsky. They noted that the identification of different levels, including the stages of the formation of pseudo-concepts, occur according to the frame- work of ontogenesis. In Piaget’s works (George 1961), the development of social intelligence based on the sensorimo- tor stage of intellectual development, where the initial prin- ciples of analysis and synthesis are understood on the ba- sis of actions with objects, and subsequently move to more complex categories of social relations. This is also the pro- cess of cultural, social assimilation of the child’s experience in communicating with an adult. L.S. Vygotsky describes it as a process of internalization: the formation of the internal structure of the mentality due to external social activity (pri- marily communication) (Vygotsky 2012). This approach was implemented in the form of empiri- cal studies by Arianna Pipitone, Antonio Chella, who pro- grammatically implemented the use of internal speech in the Pepper 1 robot and thus implemented the idea of L.S. Vy- gotsky (Pipitone and Chella 2021). They taught the robot to voice its thoughts while performing the task of setting the table according to the rules of etiquette, and then analyzed the effectiveness of using internal speech. Imitation in Human-Robot Interaction The AI robot’s worldview model is often semantic (sign- based) and focused on achieving results as the goal of its ac- tivities. Therefore, the interaction itself can be viewed from three perspectives: emotions, behavior, and cognition. In this context, imitation plays an important role in the interaction with the robot, as it provides a sense of empathy and positive attitude. This is the mechanism of imitation described by re- searchers in the fields of ethology and child psychology as having two important components: positive emotional ad- justment and effective learning of social skills (Thorndike 2017). At the same time, true imitation of the demonstra- tor’s actions is a complex form of learning that allows one to build a foundation for understanding the other person’s worldview through imitation. K ̈ ohler’s research on monkeys led him to the idea that insight in problem-solving, includ- ing imitation of an actor or a fellow monkey, is based on the ability to grasp and understand the meaning of another in- dividual’s actions (Kohler 2018). This behavior and action can serve as a source of synchronization between the world- views for learning and achieving results. The tool of imita- tion as an intelligence development tool is well described in animal psychology (Kohler 2018; Lorenz 1970) and child psychology (George 1961; Vygotsky 2012). Conclusion Complex forms of learning, including imitative learning, are used as a criterion of cognitive abilities of subjects in ethol- ogy. In the interaction of a robot and a human, this approach will open up the possibility of teaching a robot not on the basis of a word, as a basic unit of a symbolic worldview model, but by means of action and building chains of actions in achieving a goal. References Breazeal, C. 2004. Social interactions in HRI: the robot view. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 34(2): 181–186. Caldwell, C. A.; and Whiten, A. 2002. Evolutionary per- spectives on imitation: is a comparative psychology of social learning possible? Animal Cognition, 5(4): 193–208. Chuganskaya, A. A.; Kovalev, A. K.; and Panov, A. I. 2024. Sign-based image criteria for social interaction visual ques- tion answering. Logic Journal of the IGPL, 32(4): 656–670. 1 https://w.aldebaran.com/en/pepper 75 Davis, E. 2017. Logical formalizations of commonsense rea- soning: a survey. J. Artif. Int. Res., 59(1): 651–723. Dennett, D. C. 1978. Beliefs about beliefs. Behavioral and Brain Sciences, 1(4): 568–570. George, C. 1961. 61-5-8. — Les m ́ ecanismes perceptifs. Piaget (J.). — Paris, P. U. F., 1961. L’Orientation scolaire et professionnelle, 17(5): 362–363. Kohler, W. 2018. The mentality of apes. Routledge. Kovalev, A. K.; Shaban, M.; Chuganskaya, A. A.; and Panov, A. I. 2021. Applying vector symbolic architecture and semiotic approach to visual dialog. In International Conference on Hybrid Artificial Intelligence Systems, 243– 255. Springer. Lorenz, K. 1970. Studies in Animal and Human Behaviour. Volume I. Harvard University Press. Osipov, G. S.; and Panov, A. I. 2021. Planning Rational Behavior of Cognitive Semiotic Agents in a Dynamic Envi- ronment. Sci. Tech. Inf. Process., 48(6): 502–516. Perner, J.; Leekam, S. R.; and Wimmer, H. 1987. Three- year-olds’ difficulty with false belief: The case for a concep- tual deficit. British Journal of Developmental Psychology, 5(2): 125–137. Pipitone, A.; and Chella, A. 2021. What robots want? Hear- ing the inner voice of a robot. iScience, 24(4): 102371. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4): 515–526. Pybus, J. 2015. Accumulating Affect: Social Networks and Their Archives of Feelings. In Networked Affect. The MIT Press. ISBN 9780262327343. Ryan, M. 2020.In AI we trust: ethics, artificial intel- ligence, and reliability. Science and Engineering Ethics, 26(5): 2749–2767. Thorndike, E. 2017.Animal intelligence: Experimental studies. Routledge. Ullman, T. 2023.Large language models fail on triv- ial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399. Villiers, P. A. d. 2005. The Role of Language in Theory-of- Mind Development: What Deaf Children Tell Us. In Why Language Matters for Theory of Mind. Oxford University Press. ISBN 9780195159912. Vygotsky, L. S. 2012. Thought and language, volume 29. MIT press. Wimmer, H.; and Perner, J. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1): 103–128. Zellers, R.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, 6720–6731. 76 SUITE: Scaling Up Individualized Theory-of-Mind Evaluation in Large Language Models Zhenze Mo 2* , Chance Jiajie Li 1*† , Ao Qu 1† , Yuhan Tang 1 , Luis Alberto Alonso Pastor 1 , Kent Larson 1 , Jinhua Zhao 1 1 Massachusetts Institute of Technology 2 Northeastern University * Equal contribution. † Correspondence: jiajie, qua@mit.edu Abstract Most existing Theory-of-Mind (ToM) benchmarks rely on small, artificial tasks with limited ecological validity. Even when derived from real human studies, they remain nar- row in scale and collapse across individual variance, eras- ing genuine differences between people by design rather than by error. We introduce SUITE (Scaling Up Individualized Theory-of-Mind Evaluation), a framework that enables large- scale, individualized ToM assessment of large language mod- els (LLMs) using naturally occurring social discourse. By leveraging open, high-diversity data sources such as the "Am I The Asshole?" (AITA) community on Reddit, SUITE trans- forms costly human-subject ToM paradigms into an exten- sible benchmark for evaluating how models infer and align with distinct human belief states across open-ended contexts. By grounding in real-world discourse, SUITE shifts ToM evaluation from population-level understanding to indi- vidualized modeling and provides a scalable and repro- ducible framework, allowing systematic investigation of how LLMs represent person-specific belief systems in so- cial and moral reasoning contexts. Introduction Background Theory of Mind (ToM), the capacity to rep- resent and reason about the mental states of others, is a cor- nerstone of both human cognition and artificial intelligence. It enables agents to interpret social behavior, anticipate oth- ers’ actions, and engage in cooperative reasoning. Evaluat- ing ToM-based AI agents is scientifically essential for un- derstanding cognitive alignment, practically useful for en- hancing human–AI interaction, and socially important for ensuring trustworthy and responsible intelligent systems. Motivation Despite growing interest, current evaluations of Theory of Mind (ToM) in large language models re- main limited in both scope and fidelity. Traditional false- belief and vignette-based tasks provide psychological in- sight but rely on small, human-annotated datasets that are costly to construct and difficult to scale(Kosinski 2023). Three key challenges remain: (1) Limited scalability. Most ToM datasets depend on controlled experiments or expert annotation, restricting diversity and scale(Street et al. 2024; Strachan et al. 2024). (2) Simplified scenario design. Ex- isting benchmarks simplify complex social interactions into short, vignette-style stories that lack the open-ended context necessary for modeling dynamic belief updates and situated reasoning(Ma et al. 2023; Rabinowitz et al. 2018a; Chen et al. 2024b; Gandhi et al. 2023a; Srivastava et al. 2023b). (3) Collapse across individual variance. Almost all cur- rent measures collapse across human variability, captur- ing population-level competence but not individual mental- model diversity (Yeung, Apperly, and Devine 2024; Sri- vastava et al. 2023b). Recent evidence shows that 26% of benchmark items receive less than 50% human agreement, indicating that individual differences are erased by design rather than by error (Ying et al. 2025). Contribution In this paper, we introduce SUITE (Scaling Up Individualized Theory-of-Mind Evaluation), a human- grounded benchmark designed to evaluate the individualized Theory-of-Mind reasoning of LLMs at scale. SUITE lever- ages naturally occurring social discourse from the Am I The Asshole? (AITA) community on Reddit, capturing belief rea- soning and moral judgment in authentic, context-rich inter- actions. This design is grounded in cognitive and social neuro- science research showing that humans engage in personal- ized mentalizing, forming person-specific models of others’ beliefs and intentions rather than relying solely on gener- alized heuristics(Tamir et al. 2018; Thornton, Weaverdyck, and Mitchell 2018; Koster-Hale and Saxe 2013). The benchmark decomposes individualized reasoning into three interpretable tasks: stance prediction, evidence selection, and warrant identification, each aligned with a core cognitive function of ToM—belief attribution, contex- tual reasoning, and moral justification. By combining open- ended social data with structured evaluation, SUITE bridges cognitive psychology and computational benchmarking, en- abling scalable, data-driven assessment of how LLMs model distinct human belief states. To our knowledge, SUITE is the first benchmark to operationalize individualized Theory- of-Mind evaluation at scale, and all code and data are re- leased for reproducibility at https://github.com/ZhenzeMo/ benchmark-suite. Individualized ToM: Theory and Operationalization From Population to Individual Existing Theory of Mind (ToM) benchmarks predominantly rely on population-level consensus to define "ground truth" (Ying et al. 2025). How- ever, real-world social and moral judgments exhibit signif- icant idiosyncratic variance; studies indicate that approxi- mately 26% of evaluative items achieve less than 50% hu- man agreement (Ying et al. 2025). We argue that capturing this variance is critical for social intelligence. We formalize Individualized ToM as a conditional rea- soning task: given an agent u’s longitudinal historyC u and a novel scenario P , the model must predict the specific mental state ˆy of that individual: ˆy = f(P,C u )(1) This shift moves ToM evaluation from "what do people gen- erally think" to "what does this specific person think." 77 Table 1: Accuracy (%) across three experimental settings (Standard, No-Context, and Shuffle) for three LLMs. For each row (category), the best performance across all models and settings is highlighted in bold. Category Qwen-FlashQwen-PlusLLaMA-3.3-70B StandardNo-ContextShuffleStandardNo-ContextShuffleStandardNo-ContextShuffle Overall55.551.552.453.148.248.853.340.153.7 Stance72.465.871.167.962.359.177.635.577.6 Warrant53.953.946.154.150.950.339.543.439.5 Evidence40.034.740.036.931.231.244.041.344.0 The Toulmin Lens: Mapping the ToM Spectrum To di- agnose the reasoning gap in large language models, we adopt the Toulmin Model of Argumentation (Toulmin 2003) as a structural scaffold. This framework decomposes the latent mental process into three complementary tasks, correspond- ing to progressively deeper levels of Theory-of-Mind rea- soning. • Stance (Claim → Evaluative Belief Attribution): Po- sitioned at the Core Cognitive ToM layer, this task re- quires the model to conclude the agent’s evaluative be- lief (e.g., a "YTA/NTA" verdict). Distinct from factual ToM, SUITE evaluates the "psychological snapshot" or the reasoning endpoint for a specific individual. • Evidence (Data → Level 2 Perspective Taking): Aligned with Contextual ToM, this task evaluates sub- jective salience detection by identifying the facts from scenario P that are filtered into the agent’s subjective viewE i based on their unique priorities. Success here sig- nifies the model has moved from an "omniscient view" to a "subjective perspective." • Warrant (Warrant→ The Intentional Stance): As the most original component of SUITE, the Warrant serves as the "normative bridge" (e.g., "Responsibility super- sedes privacy") connecting Evidence to Stance. Mapping to Higher-order Metacognition, it requires the model to adopt Dennett’s Intentional Stance (Dennett 1989) to re- construct the agent’s underlying value priors and internal reasoning schema. Operationalization: The SUITE Prototype We intro- duce SUITE, a dataset curated from the Reddit Am I The Asshole? community to capture high-fidelity interpersonal reasoning. We identified 38 high-quality users with exten- sive histories (avg. >6,000 words/user) to construct ro- bust individual mental profiles. To ensure reproducibility and mitigate LLM bias (Tan et al. 2024; Bavaresco et al. 2025; Chen et al. 2024a), we converted open-ended jus- tifications into 227 structured Multiple-Choice Questions (MCQs) across three dimensions: Stance, Evidence, and Warrant. Our annotation pipeline, featuring multi-model consensus and expert arbitration, achieved high reliability (Gwet’s AC1 = 0.94), providing a rigorous gold standard for individualized social reasoning. Preliminary Results and Analysis Settings and Evaluation Setup. All experiments use temperature = 0.1, random seed = 42, and function calling for deterministic generation. Settings. (1) Standard. The model receives both P and the target user’s historical comments C 1:n , assessing whether it can integrate prior reasoning traces to simulate individualized moral inference. (2) No-Context. The model sees only P , serving as a baseline to measure the contribu- tion of individualized context beyond generic moral priors. (3) Shuffled. Each user’s history is replaced with another’s to test whether performance reflects genuine individual rea- soning rather than population-level cues. Findings Table 1 reports model accuracy across three experimental settings: Standard, No-Context, and Shuffle. Overall, contex- tual information improves both accuracy and consistency, indicating that access to individualized history helps mod- els better align with user-specific reasoning patterns. Qwen- Flash and LLaMA-3.3-70B achieve comparable overall per- formance, with notable gains in the Stance and Evidence tasks when historical context is preserved. These findings suggest that individualized context contributes to more co- herent ToM-like reasoning rather than surface-level pattern matching. Discussion and Future Work SUITE operationalizes individual differences in Theory of Mind through personalized mentalizing tasks, offering a framework for computational modeling of individual ToM profiles in large language models. It represents the first large-scale benchmark to assess individualized ToM reason- ing in naturally occurring social discourse, bridging cog- nitive psychology and computational modeling through the Toulmin framework of belief attribution, contextual reason- ing, and moral justification. By transforming open-domain social data into structured multiple-choice tasks, SUITE en- ables scalable, transparent, and reproducible evaluation of cognitive alignment. Despite these strengths, Reddit-based data introduce de- mographic and stylistic biases that may confound genuine mental-state reasoning. Future work will address these lim- itations through stance-consistency metrics, cross-platform validation, and controlled user studies for test–retest reli- ability. We also plan to extend model experiments across a broader range of architectures, training paradigms, and prompting conditions to systematically probe the emergence and robustness of individualized ToM reasoning. Ultimately, SUITE aims to provide a cognitively grounded, human- centered foundation for evaluating Theory of Mind in AI systems. 78 Limitations SUITE is constructed from the Am I The Asshole? (AITA) subreddit, whose user population is predominantly English- speaking and skewed toward younger, Western internet users. As a result, the moral reasoning patterns captured in the dataset primarily reflect norms and justificatory styles common in this demographic. The benchmark should there- fore not be interpreted as representing universal moral rea- soning, but rather as a structured evaluation of individual- ized moral justification within a specific sociocultural and linguistic context. SUITE assumes that an individual’s ex- pressed moral reasoning exhibits sufficient consistency to al- low generalization across scenarios. In practice, some users display highly context-dependent or topic-specific reason- ing and do not maintain stable warrants across different sit- uations. We treat such variability as an inherent property of human moral reasoning rather than annotation noise, but this limits the extent to which stable individual reasoning pat- terns can be recovered for all users. Finally, SUITE ground truth is derived solely from observable textual justification. Some comments are underspecified or rely on implicit as- sumptions that cannot be reliably recovered from text, and such cases are excluded from warrant- and evidence-level evaluation. More broadly, SUITE evaluates models on their ability to infer expressed justificatory structure from lan- guage, rather than on recovering commenters’ true psycho- logical states beyond what is linguistically manifested. Acknowledgments We would like to thank our collaborators for their significant contributions to this research: Kai Hu for the rigorous design and execution of experiments; Jie Fan for the expert data visualization; and Jiayi Wu and Kaiya Ivy Zhao for their in- sightful discussions and overall contributions. We are deeply grateful to Jiangbo Yu, Dingyi Zhuang, and Shenhao Wang for their invaluable guidance and mentorship throughout the project. Special thanks to Hang Jiang for providing key per- spectives, thoughtful resonance, and consistent encourage- ment that shaped the direction of this work. References Baker, C. L.; Jara-Ettinger, J.; Saxe, R.; and Tenenbaum, J. B. 2017. Rational quantitative attribution of beliefs, de- sires and percepts in human mentalizing. Nature Human Behaviour, 1(4): 0064. Baron-Cohen, S.; Leslie, A. M.; and Frith, U. 1985. Does the autistic child have a “theory of mind” ? Cognition, 21(1): 37–46. Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fer- nández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. 2025. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 2: Short Pa- pers), 238–255. Chan, C.; Jiayang, C.; Yim, Y.; Deng, Z.; Fan, W.; Li, H.; Liu, X.; Zhang, H.; Wang, W.; and Song, Y. 2024. Negoti- ationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding. arXiv:2404.13627. Chen, G. H.; Chen, S.; Liu, Z.; Jiang, F.; and Wang, B. 2024a. Humans or LLMs as the Judge? A Study on Judge- ment Bias. In Al-Onaizan, Y.; Bansal, M.; and Chen, Y.- N., eds., Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 8301–8327. Mi- ami, Florida, USA: Association for Computational Linguis- tics. Chen, Z.; Wu, J.; Zhou, J.; Wen, B.; Bi, G.; Jiang, G.; Cao, Y.; Hu, M.; Lai, Y.; Xiong, Z.; and Huang, M. 2024b. ToMBench: Benchmarking Theory of Mind in Large Lan- guage Models. arXiv:2402.15052. Dennett, D. C. 1989. The intentional stance. MIT press. Entman, R. M. 1993. Framing: Towards clarification of a fractured paradigm. McQuail’s reader in mass communica- tion theory, 390: 397. Forbes, M.; Hwang, J. D.; Shwartz, V.; Sap, M.; and Choi, Y. 2020. Social Chemistry 101: Learning to Reason about So- cial and Moral Norms. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), 653–670. Online: Association for Computational Linguis- tics. Gandhi, K.; Fränken, J.-P.; Gerstenberg, T.; and Goodman, N. 2023a. Understanding social reasoning in language mod- els with language models. Advances in Neural Information Processing Systems, 36: 13518–13529. Gandhi, K.; Fränken, J.-P.; Gerstenberg, T.; and Goodman, N. D. 2023b. Understanding Social Reasoning in Language Models with Language Models. arXiv:2306.15448. Happé, F. G. E. 1994. An advanced test of theory of mind: Understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. Journal of Autism and Developmental Disorders, 24(2): 129–154. He, Y.; Wu, Y.; Jia, Y.; Mihalcea, R.; Chen, Y.; and Deng, N. 2023. HI-TOM: A Benchmark for Evaluating Higher- Order Theory of Mind Reasoning in Large Language Mod- els. arXiv:2310.16755. Hewitt, J.; and Cohen, M. 2021. Exploring RoBERTa’s the- ory of mind through textual entailment. Jin, C.; Wu, Y.; Cao, J.; Xiang, J.; Kuo, Y.-L.; Hu, Z.; Ull- man, T.; Torralba, A.; Tenenbaum, J. B.; and Shu, T. 2024. MMToM-QA: Multimodal Theory of Mind Question An- swering. arXiv:2401.08743. Kim, H.; Sclar, M.; Zhou, X.; Bras, R. L.; Kim, G.; Choi, Y.; and Sap, M. 2023.FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions. arXiv:2310.15421. Kosinski, M. 2023.Theory of mind may have sponta- neously emerged in large language models. arXiv preprint arXiv:2302.02083, 4: 169. Koster-Hale, J.; and Saxe, R. 2013. Theory of mind: A neu- ral prediction problem. Neuron, 79(5): 836–848. Le, M.; Boureau, Y.-L.; and Nickel, M. 2019. Revisiting the evaluation of theory of mind through question answer- ing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5872–5877. 79 Li, C. J.; Mo, Z.; Tang, Y.; Qu, A.; Wu, J.; Zhao, K. I.; Gan, Y.; Fan, J.; Yu, J.; Jiang, H.; Liang, P. P.; Zhao, J.; Pastor, L. A. A.; and Larson, K. 2025a. HugAgent: Benchmarking LLMs for Simulation of Individualized Human Reasoning. arXiv:2510.15144. Li, C. J.; Wu, J.; Mo, Z.; Qu, A.; Tang, Y.; Zhao, K. I.; Gan, Y.; Fan, J.; Yu, J.; Zhao, J.; et al. 2025b. Simulating Society Requires Simulating Thought. In The Thirty-Ninth Annual Conference on Neural Information Processing Systems Po- sition Paper Track. Lippi, M.; and Torroni, P. 2016. Argumentation mining: State of the art and emerging trends. ACM Transactions on Internet Technology (TOIT), 16(2): 1–25. Ma, Z.; Sansom, J.; Peng, R.; and Chai, J. 2023. Towards a holistic landscape of situated theory of mind in large lan- guage models. arXiv preprint arXiv:2310.19619. Mercier, H.; and Sperber, D. 2011. Why do humans rea- son? Arguments for an argumentative theory. Behavioral and brain sciences, 34(2): 57–74. Nematzadeh, A.; Burns, K.; Grant, E.; Gopnik, A.; and Grif- fiths, T. 2018. Evaluating Theory of Mind in Question An- swering. In Riloff, E.; Chiang, D.; Hockenmaier, J.; and Tsu- jii, J., eds., Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, 2392–2400. Brussels, Belgium: Association for Computational Linguis- tics. Rabinowitz, N.; Perbet, F.; Song, F.; Zhang, C.; Eslami, S. A.; and Botvinick, M. 2018a. Machine theory of mind. In International conference on machine learning, 4218–4227. PMLR. Rabinowitz, N. C.; Perbet, F.; Song, H. F.; Zhang, C.; Es- lami, S. M. A.; and Botvinick, M. 2018b. Machine Theory of Mind. arXiv:1802.07740. Sap, M.; Rashkin, H.; Chen, D.; LeBras, R.; and Choi, Y. 2019. Social IQa: Commonsense reasoning about social in- teractions. In Proceedings of EMNLP. Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A. M.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; Kluska, A.; Lewkowycz, A.; Agar- wal, A.; Power, A.; Ray, A.; Warstadt, A.; Kocurek, A. W.; Safaya, A.; Tazarv, A.; Xiang, A.; Parrish, A.; Nie, A.; Hussain, A.; Askell, A.; Dsouza, A.; Slone, A.; Rahane, A.; Iyer, A. S.; Andreassen, A.; Madotto, A.; Santilli, A.; Stuhlmüller, A.; Dai, A.; La, A.; Lampinen, A.; Zou, A.; Jiang, A.; Chen, A.; Vuong, A.; Gupta, A.; Gottardi, A.; Norelli, A.; Venkatesh, A.; Gholamidavoodi, A.; Tabassum, A.; Menezes, A.; Kirubarajan, A.; Mullokandov, A.; Sab- harwal, A.; Herrick, A.; Efrat, A.; Erdem, A.; Karaka ̧s, A.; Roberts, B. R.; Loe, B. S.; Zoph, B.; Bojanowski, B.; Özyurt, B.; Hedayatnia, B.; Neyshabur, B.; Inden, B.; Stein, B.; Ekmekci, B.; Lin, B. Y.; Howald, B.; Orinion, B.; Diao, C.; Dour, C.; Stinson, C.; Argueta, C.; Ramírez, C. F.; Singh, C.; Rathkopf, C.; Meng, C.; Baral, C.; Wu, C.; Callison- Burch, C.; Waites, C.; Voigt, C.; Manning, C. D.; Potts, C.; Ramirez, C.; Rivera, C. E.; Siro, C.; Raffel, C.; Ashcraft, C.; Garbacea, C.; Sileo, D.; Garrette, D.; Hendrycks, D.; Kil- man, D.; Roth, D.; Freeman, D.; Khashabi, D.; Levy, D.; González, D. M.; Perszyk, D.; Hernandez, D.; Chen, D.; Ippolito, D.; Gilboa, D.; Dohan, D.; Drakard, D.; Jurgens, D.; Datta, D.; Ganguli, D.; Emelin, D.; Kleyko, D.; Yuret, D.; Chen, D.; Tam, D.; Hupkes, D.; Misra, D.; Buzan, D.; Mollo, D. C.; Yang, D.; Lee, D.-H.; Schrader, D.; Shutova, E.; Cubuk, E. D.; Segal, E.; Hagerman, E.; Barnes, E.; Donoway, E.; Pavlick, E.; Rodola, E.; Lam, E.; Chu, E.; Tang, E.; Erdem, E.; Chang, E.; Chi, E. A.; Dyer, E.; Jerzak, E.; Kim, E.; Manyasi, E. E.; Zheltonozhskii, E.; Xia, F.; Siar, F.; Martínez-Plumed, F.; Happé, F.; Chollet, F.; Rong, F.; Mishra, G.; Winata, G. I.; de Melo, G.; Kruszewski, G.; Parascandolo, G.; Mariani, G.; Wang, G.; Jaimovitch- López, G.; Betz, G.; Gur-Ari, G.; Galijasevic, H.; Kim, H.; Rashkin, H.; Hajishirzi, H.; Mehta, H.; Bogar, H.; Shevlin, H.; Schütze, H.; Yakura, H.; Zhang, H.; Wong, H. M.; Ng, I.; Noble, I.; Jumelet, J.; Geissinger, J.; Kernion, J.; Hilton, J.; Lee, J.; Fisac, J. F.; Simon, J. B.; Koppel, J.; Zheng, J.; Zou, J.; Koco ́ n, J.; Thompson, J.; Wingfield, J.; Kaplan, J.; Radom, J.; Sohl-Dickstein, J.; Phang, J.; Wei, J.; Yosinski, J.; Novikova, J.; Bosscher, J.; Marsh, J.; Kim, J.; Taal, J.; Engel, J.; Alabi, J.; Xu, J.; Song, J.; Tang, J.; Waweru, J.; Burden, J.; Miller, J.; Balis, J. U.; Batchelder, J.; Berant, J.; Frohberg, J.; Rozen, J.; Hernandez-Orallo, J.; Boudeman, J.; Guerr, J.; Jones, J.; Tenenbaum, J. B.; Rule, J. S.; Chua, J.; Kanclerz, K.; Livescu, K.; Krauth, K.; Gopalakrishnan, K.; Ignatyeva, K.; Markert, K.; Dhole, K. D.; Gimpel, K.; Omondi, K.; Mathewson, K.; Chiafullo, K.; Shkaruta, K.; Shridhar, K.; McDonell, K.; Richardson, K.; Reynolds, L.; Gao, L.; Zhang, L.; Dugan, L.; Qin, L.; Contreras-Ochando, L.; Morency, L.-P.; Moschella, L.; Lam, L.; Noble, L.; Schmidt, L.; He, L.; Colón, L. O.; Metz, L.; ̧Senel, L. K.; Bosma, M.; Sap, M.; ter Hoeve, M.; Farooqi, M.; Faruqui, M.; Mazeika, M.; Baturan, M.; Marelli, M.; Maru, M.; Quintana, M. J. R.; Tolkiehn, M.; Giulianelli, M.; Lewis, M.; Potthast, M.; Leavitt, M. L.; Hagen, M.; Schubert, M.; Baitemirova, M. O.; Arnaud, M.; McElrath, M.; Yee, M. A.; Cohen, M.; Gu, M.; Ivanitskiy, M.; Starritt, M.; Strube, M.; Sw ̨edrowski, M.; Bevilacqua, M.; Yasunaga, M.; Kale, M.; Cain, M.; Xu, M.; Suzgun, M.; Walker, M.; Tiwari, M.; Bansal, M.; Aminnaseri, M.; Geva, M.; Gheini, M.; T, M. V.; Peng, N.; Chi, N. A.; Lee, N.; Krakover, N. G.-A.; Cameron, N.; Roberts, N.; Doiron, N.; Martinez, N.; Nangia, N.; Deck- ers, N.; Muennighoff, N.; Keskar, N. S.; Iyer, N. S.; Con- stant, N.; Fiedel, N.; Wen, N.; Zhang, O.; Agha, O.; Elbagh- dadi, O.; Levy, O.; Evans, O.; Casares, P. A. M.; Doshi, P.; Fung, P.; Liang, P. P.; Vicol, P.; Alipoormolabashi, P.; Liao, P.; Liang, P.; Chang, P.; Eckersley, P.; Htut, P. M.; Hwang, P.; Miłkowski, P.; Patil, P.; Pezeshkpour, P.; Oli, P.; Mei, Q.; Lyu, Q.; Chen, Q.; Banjade, R.; Rudolph, R. E.; Gabriel, R.; Habacker, R.; Risco, R.; Millière, R.; Garg, R.; Barnes, R.; Saurous, R. A.; Arakawa, R.; Raymaekers, R.; Frank, R.; Sikand, R.; Novak, R.; Sitelew, R.; LeBras, R.; Liu, R.; Jacobs, R.; Zhang, R.; Salakhutdinov, R.; Chi, R.; Lee, R.; Stovall, R.; Teehan, R.; Yang, R.; Singh, S.; Mohammad, S. M.; Anand, S.; Dillavou, S.; Shleifer, S.; Wiseman, S.; Gruetter, S.; Bowman, S. R.; Schoenholz, S. S.; Han, S.; Kwatra, S.; Rous, S. A.; Ghazarian, S.; Ghosh, S.; Casey, S.; Bischoff, S.; Gehrmann, S.; Schuster, S.; Sadeghi, S.; Ham- dan, S.; Zhou, S.; Srivastava, S.; Shi, S.; Singh, S.; Asaadi, S.; Gu, S. S.; Pachchigar, S.; Toshniwal, S.; Upadhyay, S.; Shyamolima; Debnath; Shakeri, S.; Thormeyer, S.; Melzi, S.; Reddy, S.; Makini, S. P.; Lee, S.-H.; Torene, S.; Hatwar, S.; Dehaene, S.; Divic, S.; Ermon, S.; Biderman, S.; Lin, 80 S.; Prasad, S.; Piantadosi, S. T.; Shieber, S. M.; Misherghi, S.; Kiritchenko, S.; Mishra, S.; Linzen, T.; Schuster, T.; Li, T.; Yu, T.; Ali, T.; Hashimoto, T.; Wu, T.-L.; Desbordes, T.; Rothschild, T.; Phan, T.; Wang, T.; Nkinyili, T.; Schick, T.; Kornev, T.; Tunduny, T.; Gerstenberg, T.; Chang, T.; Neeraj, T.; Khot, T.; Shultz, T.; Shaham, U.; Misra, V.; Demberg, V.; Nyamai, V.; Raunak, V.; Ramasesh, V.; Prabhu, V. U.; Pad- makumar, V.; Srikumar, V.; Fedus, W.; Saunders, W.; Zhang, W.; Vossen, W.; Ren, X.; Tong, X.; Zhao, X.; Wu, X.; Shen, X.; Yaghoobzadeh, Y.; Lakretz, Y.; Song, Y.; Bahri, Y.; Choi, Y.; Yang, Y.; Hao, Y.; Chen, Y.; Belinkov, Y.; Hou, Y.; Hou, Y.; Bai, Y.; Seid, Z.; Zhao, Z.; Wang, Z.; Wang, Z. J.; Wang, Z.; and Wu, Z. 2023a. Beyond the Imitation Game: Quanti- fying and extrapolating the capabilities of language models. arXiv:2206.04615. Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A. M.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga- Alonso, A.; et al. 2023b. Beyond the imitation game: Quan- tifying and extrapolating the capabilities of language mod- els. Transactions on machine learning research. Stab, C.; and Gurevych, I. 2014. Identifying argumentative discourse structures in persuasive essays. In Proceedings of the 2014 conference on empirical methods in natural lan- guage processing (EMNLP), 46–56. Stewart, N.; Chandler, J.; and Paolacci, G. 2017. Crowd- sourcing samples in cognitive science. Trends in cognitive sciences, 21(10): 736–748. Strachan, J. W. A.; Albergo, D.; Borghini, G.; Pansardi, O.; Scaliti, E.; Gupta, S.; Saxena, K.; Rufo, A.; Panzeri, S.; Manzi, G.; Graziano, M. S. A.; and Becchio, C. 2024. Test- ing theory of mind in large language models and humans. Nature Human Behaviour, 8(7): 1285–1295. Street, W.; Siy, J. O.; Keeling, G.; Baranes, A.; Bar- nett, B.; McKibben, M.; Kanyere, T.; Lentz, A.; Dunbar, R. I.; et al. 2024.Llms achieve adult human perfor- mance on higher-order theory of mind tasks. arXiv preprint arXiv:2405.18870. Tamir, D. I.; Thornton, M. A.; Contreras, J. M.; and Mitchell, J. P. 2018. Modeling the predictive social mind. Trends in Cognitive Sciences, 22(3): 201–212. Tan, S.; Zhuang, S.; Montgomery, K.; Tang, W. Y.; Cuadron, A.; Wang, C.; Popa, R. A.; and Stoica, I. 2024. Judgebench: A benchmark for evaluating llm-based judges.arXiv preprint arXiv:2410.12784. Thiyagarajan, P.; Parimi, V.; Sai, S.; Garg, S.; Meirbek, Z.; Yarlagadda, N.; Zhu, K.; and Kim, C. 2025. UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs. arXiv preprint arXiv:2506.09450. Thornton, M. A.; Weaverdyck, M. E.; and Mitchell, J. P. 2018. Consistent neural representations of personality traits across perceivers. Journal of Neuroscience, 38(1): 101–112. Toulmin, S. E. 2003. The uses of argument. Cambridge university press. Wang, A.; Morgenstern, J.; and Dickerson, J. P. 2025. Large language models that replace human participants can harm- fully misportray and flatten identity groups. Nature Machine Intelligence, 1–12. Wimmer, H.; and Perner, J. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1): 103–128. Yeung, E. K. L.; Apperly, I. A.; and Devine, R. T. 2024. Measures of individual differences in adult theory of mind: A systematic review. Neuroscience & Biobehavioral Re- views, 157: 105481. Ying, L.; Collins, K. M.; Wong, L.; Sucholutsky, I.; Liu, R.; Weller, A.; Shu, T.; Griffiths, T. L.; and Tenenbaum, J. B. 2025. On Benchmarking Human-Like Intelligence in Ma- chines. arXiv:2502.20502. Ying, L.; Zhi-Xuan, T.; Wong, L.; Mansinghka, V.; and Tenenbaum, J. 2024. Grounding language about belief in a bayesian theory-of-mind. arXiv preprint arXiv:2402.10416. Ethics Statement This work adheres to the AAAI Code of Ethics and was con- ducted under an approved Institutional Review Board (IRB) protocol. Human data. All data used in SUITE are derived from publicly available, anonymized Reddit discussions in the Am I The Asshole? (AITA) community. No private or identifying user information was collected. Usernames were replaced with unique anonymous identifiers prior to analysis, and all content was processed under Reddit’s data-use policies and the platform’s public data license. No direct interaction with human participants occurred, and all analyses are conducted on de-identified text. Risks and mitigations. Potential risks include the inad- vertent amplification of biases present in online data (e.g., demographic, cultural, or linguistic bias). To mitigate these risks, we (i) document dataset statistics and filtering criteria, (i) provide balanced domain coverage across social con- texts (family, friendship, romance, work, society), and (i) encourage downstream researchers to conduct fairness and sensitivity analyses when using SUITE. Data security and release. All Reddit data were pro- cessed using secure pipelines and stored in anonymized form. Synthetic data were generated deterministically from moral-logic templates and contain no personal or identify- ing information. Code, benchmark scripts, and data releases will follow open-science best practices with explicit license terms and a retraction mechanism in case of unforeseen is- sues. Overall, SUITE is intended purely for research on indi- vidualized reasoning and cognitive evaluation in language models, not for deployment in real-world decision-making contexts that could affect individuals or communities. Reproducibility Statement We have taken extensive measures to ensure that the SUITE benchmark and all experiments reported in this paper are fully reproducible. Data collection and preprocessing. All Reddit data are publicly accessible, and our pipeline for data curation, filter- ing, and anonymization is released in full. We detail cleaning 81 scripts, user sampling thresholds, and moral-domain balanc- ing procedures in the repository. No proprietary datasets are required. Evaluation protocol. All tasks use standardized metrics such as accuracy and consistency, implemented in shared evaluation scripts. We release the benchmark configuration files, task definitions (stance, evidence, warrant), and evalu- ation pipeline to enable exact reproduction of our reported results. Release and documentation. The full benchmark, code, and evaluation scripts are released as open source under: https://github.com/ZhenzeMo/benchmark-suite Upon publi- cation, the public release will include data versioning, de- tailed documentation, and explicit license terms. We will also maintain a leaderboard to support transparent and com- parable evaluation across future work. Together, these measures ensure that SUITE and all as- sociated experiments are transparent, replicable, and con- ducive to cumulative progress in the study of individualized Theory of Mind in large language models. Related Work Social Reasoning and Theory of Mind. Work on The- ory of Mind (ToM) in AI draws from developmental psy- chology tests such as the false-belief task (Wimmer and Perner 1983), Sally-Anne (Baron-Cohen, Leslie, and Frith 1985), and Strange Stories (Happé 1994), later reformulated as computational tasks (Nematzadeh et al. 2018; Rabinowitz et al. 2018b). Scaled language models brought ToM into broad benchmarks (Le, Boureau, and Nickel 2019; Hewitt and Cohen 2021; Srivastava et al. 2023a; Chen et al. 2024b) and inspired synthetic testbeds such as BigToM (Gandhi et al. 2023b), HI-TOM (He et al. 2023), FANToM (Kim et al. 2023), and MMToM-QA (Jin et al. 2024). UniToMBench (Thiyagarajan et al. 2025) advances synthetic ToM eval- uation by integrating multi-interaction task structures and evolving narrative scenarios, offering a unified benchmark that highlights strengths and failures of current LLMs in be- lief and emotion reasoning. More recent directions ground ToM in dialogues and social contexts (Chan et al. 2024; Sap et al. 2019; Strachan et al. 2024), or frame it through Bayesian belief attribution (Ying et al. 2024). Yet these benchmarks remain synthetic, vignette-based, and decon- textualized, missing ecological and demographic variability (Wang, Morgenstern, and Dickerson 2025; Stewart, Chan- dler, and Paolacci 2017). Recent work has begun to address individual-level belief alignment (Li et al. 2025b) and scal- able belief attribution evaluation (Li et al. 2025a). How- ever, existing benchmarks still lack diversity in moral dilem- mas and grounding in authentic social scenarios. Within this framing, SUITE extends ToM evaluation by asking whether models can map natural language into personalized belief states and reasoning the process behind the belief states, bridging synthetic ToM tasks and socially grounded reason- ing. Human Review Annotation and Quality Control. Two trained annotators independently labeled all warrant and evidence categories, as well as verified ground truth stance assignments. To ensure annotation quality, annotators worked with LLM- assisted verification to maintain consistency. We assessed inter-annotator reliability on a random sample of 112 instances. The annotators achieved a raw agreement rate of 94.64% (106/112 instances with identical labels). Gwet’s AC1 coefficient of 0.94 indicates very high inter-annotator agreement after controlling for chance agreement, confirm- ing the reliability and clarity of our annotation taxonomy. In cases of disagreement between annotators, the authors served as adjudicators to resolve conflicts and determine the final labels. Annotator Demographics and Ethics. Both annotators were female, aged 22 and 25 years. Annotator 1 held a mas- ter’s degree in computer science, AI, and data science with intermediate NLP experience. Annotator 2 held a bachelor’s degree in natural sciences and engineering with beginner- level NLP experience. Annotators were fairly compensated at a rate of $12/hour, which exceeds the local minimum wage. All annotators pro- vided informed consent and were made aware of the annota- tion task requirements, including the nature of content they would review. The annotation protocol was reviewed and ap- proved by our institutional review board (IRB). Annotators had the right to refuse any annotation task and could with- draw from the study at any time without penalty. Details of the SUITE Dataset Task Design SUITE evaluates individualized moral reasoning through three complementary multiple-choice tasks: stance, war- rant, and evidence. All tasks are instantiated in a multiple- choice format to enable uniform and reproducible evalua- tion. Theoretical Foundations From Language to Mental States: The Intentional Stance Rather than treating Theory of Mind (ToM) as a black- box prediction of surface behavior, we adopt the Intentional Stance (Dennett 1989), which views beliefs, values, and in- tentions as latent explanatory constructs inferred from ob- servable behavior. Because internal mental states are not di- rectly accessible, ToM-like inference from language neces- sarily takes the form of inverse reasoning: recovering the latent moral heuristics that make an individual’s judgments coherent. Our task thus requires models to move beyond surface-level pattern matching to reconstruct the internal logic that drives evaluative conclusions. In moral discourse, textual justifications are therefore not treated as incidental or purely post-hoc explanations. Con- sistent with work in cognitive and social psychology, we view justification as a primary behavioral trace through which people externalize their moral heuristics and value priorities (Mercier and Sperber 2011). Consequently, our benchmark focuses on whether LLMs can recover the internal logic reflected in an individual’s expressed reasoning, a task that is fundamentally more de- manding than surface-level pattern matching. 82 Table 2: Taxonomy of reasoning warrants in moral judgment WarrantDescription Care/Harm PreventionThe judgment is justified by concern about direct physical or emotional suffering of a person. Moral weight is placed on preventing or alleviating direct suffering, distress, humiliation, or trauma. Safety/Risk ManagementThe judgment is justified by risk-taking, danger creation, or failure to follow safety precautions, even if no harm occurred. Risk itself is morally salient; people have obli- gations to prevent foreseeable harm. Fairness/ReciprocityThe judgment is justified by proportionality, equity, or reciprocal treatment. People should share burdens and benefits fairly and treat others as they themselves are treated. Autonomy/BoundariesThe judgment is justified by violation of personal sovereignty: controlling behavior, un- wanted interference, privacy invasion, or boundary crossing. Individuals have the right to make personal choices and maintain personal boundaries. Property/ConsentThe judgment is justified by violations involving ownership, money, physical posses- sions, or bodily access without permission. People have rights over their possessions, finances, and bodily access; consent is required. Honesty/CommunicationThe judgment is justified by deception, withholding key information, dishonesty, or failure of transparent communication. Truthfulness and transparent communication are morally required for trust and informed decisions. Role-based ResponsibilityThe judgment is justified by duties tied to a specific functional role (tenant, roommate, employee, co-parent) rather than emotion-based loyalty. Roles carry non-emotional obligations; failure to meet them justifies moral blame or praise. Relational LoyaltyThe judgment is justified by betrayal of trust or failure to stand by an in-group member (family, partner, close friend). Close relationships impose loyalty expectations; betrayal is morally salient. Authority/HierarchyThe judgment is justified by respect or disrespect toward legitimate authority or hierar- chical roles (boss, parent, teacher, official leader). Legitimate authority deserves defer- ence; undermining it is morally relevant. Tradition/Social ConventionsThe judgment is justified by violating widely shared social conventions, etiquette, or coordination norms. Shared norms enable coordination and mutual respect; violating them is morally salient when no more specific principle dominates. Argumentative Structure as a Representation of Individ- ualized Reasoning To operationalize this idea, we draw on the Toulmin argument model (Toulmin 2003), which de- composes everyday reasoning into a claim (stance), support- ing grounds (hereafter evidence), and a warrant that licenses the inference from grounds to claim. Prior work in argument mining and computational social reasoning shows that such components can be identified in noisy, user-generated text and provide a useful abstraction for analyzing justificatory structure (Stab and Gurevych 2014; Lippi and Torroni 2016). Accordingly, we represent each comment as: Comment⇒Stance, Evidence, Warrant, where Stance is the expressed conclusion, Evidence is the invoked support, and Warrant is the normative bridge connecting evidence to stance. While we do not claim that Toulmin structures mirror ex- plicit human neurocognition, they provide a computationally tractable abstraction for belief attribution from language, en- abling a systematic analysis of an agent’s justificatory logic. Within this representation, the warrant captures how an indi- vidual links contextual considerations to an evaluative con- clusion, reflecting their prioritization of normative princi- ples. Dominant Warrants and Identity-Consistent Moral Rea- soning While moral reasoning can involve multiple con- siderations, empirical discourse analysis often identifies a dominant warrant: the primary normative principle most di- rectly supporting a given judgment. Focusing on a dominant warrant does not deny moral complexity; instead, it defines a stable and interpretable abstraction that supports repro- ducible analysis at scale (Entman 1993). In the context of individualized evaluation, warrants play a central role. Two individuals may arrive at the same stance while relying on different warrants, and conversely, the same individual may apply a consistent warrant across diverse sit- uations. Evaluating warrants therefore enables us to distin- guish surface-level agreement from deeper alignment in jus- tificatory logic. Accordingly, SUITE operationalizes ToM in a ToM-inspired sense: models must attribute and general- ize person-specific normative bridges from language, rather than merely predicting outcomes. This aligns with ToM frameworks that treat mental states as explanatory constructs inferred from behavior (Baker et al. 2017), while avoiding claims that SUITE directly measures canonical psychologi- cal ToM abilities. Closed-set warrant taxonomy. Warrant questions are defined over a ten-category taxonomy of moral war- 83 rants (e.g., fairness/reciprocity, care/harm prevention, auton- omy/boundaries). The taxonomy design is inspired by and follows the framework proposed in (Forbes et al. 2020). Each question asks: Which principle best explains the in- ferential bridge from the invoked considerations to the stated stance? Distractors are constructed to be plausible in- context yet inconsistent with the comment’s expressed justi- fication. Full taxonomy definitions are provided in Table 2 Stance (Belief Attribution). Stance questions evaluate whether a model can recover the explicit evaluative conclu- sion expressed in a comment (e.g., YTA vs. NTA). Given an individual’s historical comments C 1:n and a new moral sce- nario P , the model predicts the moral stance f stance : (P, C 1:n )→ y, y ∈YTA, NTA. This task reflects belief attribution in a ToM-inspired sense: inferring the evaluative belief an individual would hold to- ward a new situation. We restrict stance labels to binary YTA/NTA, excluding ESH, NAH, and INFO due to their mixed or indeterminate nature. Because stance primarily captures outcome recognition and is often sensitive to sur- face cues, stance accuracy alone is not sufficient as a mea- sure of moral reasoning. Warrant (Value-based Justification) Why warrants operationalize moral reasoning. We treat warrants as the primary operationalization of moral rea- soning because they encode the normative logic that licenses the inference from cited considerations to judgment. In con- trast to stance (an outcome), warrants are closer to generaliz- able justificatory rules and thus provide a more stable proxy for cross-context reasoning patterns. Given P and C 1:n , the model predicts a moral prin- ciple y from a predefined taxonomy W (e.g., Fair- ness/Reciprocity, Care/Harm, Responsibility/Obligation, Autonomy/Boundaries), derived from moral reasoning liter- ature (Forbes et al. 2020): f warrant : (P, C 1:n )→ y, y ∈W. This captures the ToM process of value justifica- tion—identifying the abstract norm that bridges evidence to judgment within an individual’s moral reasoning model. Evidence (Contextual Belief Modeling) What evidence questions measure. Evidence questions probe which considerations in a comment are treated as jus- tificatory support for the stance. Evidence is not mere fac- tual recall; it reflects what the commenter foregrounds as morally relevant in their justification. Thus, evidence com- plements warrants: evidence captures what is cited, while warrants capture why it supports the stance. Why evidence helps evaluate reasoning. The same stance can be justified by different evidence and differ- ent warrants. Evidence questions therefore help distinguish models that recover the commenter’s justificatory focus from models that rely on generic moral priors. Together with warrant questions, they support a more faithful evaluation of reasoning structure: predicting the correct stance while mis- matching evidence or warrant indicates incomplete recovery of the expressed justification. Given a scenario P , evidence candidates E = e 1 , . . . , e m , and prior comments C 1:n , the model identi- fies which contextual factor the individual would consider central to their judgment: f evidence : (P, C 1:n , E)→ y, y ∈ E. This task measures the ToM ability to contextualize mental states—recognizing which cues or observations an individ- ual treats as relevant to their moral reasoning. 84 The AI Tipping Point: How Design and Repeated Use Shape Beliefs About Digital Minds Andreea Zaman 1 , Scott A. Hale 2 1 Institute of Psychiatry, Psychology & Neuroscience, King’s College London 2 Oxford Internet Institute, University of Oxford andreea.zaman@kcl.ac.uk, scott.hale@oii.ox.ac.uk Abstract AI agents powered by large language models are rapidly embedding in our daily lives. AI agents are designed to remember prior chats and mirror our views, making them feel human-like even when no mind is present. We propose a new longitudinal study to identify the behavioral tip- ping point at which people might shift from treat- ing AIs as tools to perceiving them as having minds. Participants will debate civic issues with AI agents or trained human confederates in a purpose-built virtual citizens’ assembly across multiple sessions. Human confederates will pro- vide a rigorously matched human baseline that isolates the effect of AI design features—memory (on/off) and agreeableness (high/low)—from task context. After each session, we will measure whether participants rate their counterparts as possessing Theory of Mind, mental states, and consciousness. Through this framework, we can advance our understanding of bothhowandwhen humans might attribute digital minds to AIs. Introduction People increasingly treat AI agents like ChatGPT as social partners by confiding in them, taking their advice, and weav- ing them into daily life to fulfil the need for social connec- tion (17). As AI becomes more deeply embedded in every- day interaction, will people continue to perceive AIs as mere tools, or will they come to treat them as entities with minds, capable of thoughts, feelings, and intentions (3; 19; 20; 21)? If people perceive AI systems as possessing mental states and ascribe them cognitive abilities, this might foster over- reliance on AI advice and emotional support, with serious consequences. Emerging evidence shows that AI systems can validate users’ existing beliefs and display strong syco- phancy during interactions, exacerbating psychological vul- nerability in susceptible individuals. Moreover, high-profile cases have reported suicides involving individuals who en- gaged extensively with AI chatbots, prompting lawsuits and public debate about AI use in these contexts, although causal links remain unclear. Two key psychological mechanisms underlie the human inclination to attribute mental states to AI. First, people may perceive AI as having the ability to infer their thoughts and emotions, a fundamental aspect of human social cognition known as Theory of Mind (ToM) (24). Indeed, recent re- search has shown that some LLMs exhibit behaviours that resemble ToM, with their performance compared to that of a 9-year-old child (7; 9). Yet there are two issues. First, the validity of standard ToM tasks has itself been questioned: the Reading the Mind in the Eyes Test appears to measure emotion recognition rather than mental-state attribution per se (15), and second many LLM evaluations conflate surface linguistic prediction with genuine mental-state understand- ing (19; 23). Therefore, whether LLMs display genuine ToM or whether this is an illusion created by sophisticated text generation or narrow measurement remains a topic of in- tense debate. Our proposed task moves beyond these limita- tions by conceptualising ToM as a dynamic, accuracy-based process of belief inference and confidence calibration across repeated interactions—grounded in the Mind-space frame- work’s account of how representations of a partner’s per- sonality traits systematically mediate mental state inference (5; 10). The second mechanism is anthropomorphism, our natural tendency to assign human-like traits, emotions, and motiva- tions to non-human entities (6), a tendency further amplified by the deliberate design of AI systems to mimic human be- haviour (18). Crucially, anthropomorphism can extend be- yond ToM to encompass attributions of consciousness—the sense that the AI has subjective experience, something that it is like to be an AI (13). While this definition captures the phenomenal dimension of consciousness, folk psycho- logical attributions also encompass agentic capacities such as reasoning and planning, yielding two empirically separa- ble dimensions of felt experience (Experience) and cognitive abilities (Intelligence) (4; 22). This may help explain why, in a survey of 300 U.S. residents, 67% attributed some level of consciousness to ChatGPT after an interaction (4). Crit- ically, these two mechanisms are mutually reinforcing: AI systems that produce ToM-like outputs prime anthropomor- phic responses in users, which in turn amplify attributions of both ToM and consciousness—creating a feedback loop in which AI design and human perception jointly drive the possible shift from tool to digital mind. Objectives Understanding whether these perceptions are driven by at- tributed ToM, anthropomorphism, or a combination of both, and how these factors interact, is crucial to identify the be- havioural tipping points at which people begin to believe that AI systems possess social cognition, mental states, and 85 even consciousness over time. Prior survey work (4) sug- gests that such attributions might increase with usage, yet these findings are cross-sectional and leave open the mech- anisms driving them. What is missing is a longitudinal ap- proach that can reveal not just whether, but when and how, these attributions emerge. Critically, we address this gap us- ing trained human confederates as a matched baseline—an approach that, to our knowledge, has not been used in prior work and allows us to isolate the contribution of AI design features from the broader social context of the interaction. Experimental Platform We will develop a virtual digital citizens’ assembly: a simu- lated online environment where participants engage with AI and human counterparts in structured debates on civic issues (e.g., climate change, the use of AI in education). The plat- form will take the form of a 2D tile-based town, reminiscent of The Sims, developed using an existing open-source simu- lation environment (16) and memory-augmented prompt en- gineering via existing LLM APIs (e.g., Cohere). AI manipulations.AI agents’ responses will be system- atically varied along two behavioural factors: memory (on/off) and agreeableness (agree/disagree). Personalisation through “remembering” prior conversations fosters social rapport (2) and increases trust and anthropomorphism (14). Agreeableness— whether the AI aligns with or challenges the participant’s stance on civic issues—is another key driver of rapport. Highly agreeable agents are more likely to be mistaken for humans in Turing Test settings (8). Procedure.Each participant will complete five sessions with either an AI agent or a trained human confederate; the dyad debates a civic issue and aims to reach consensus on a joint recommendation. Human counterparts will follow fixed scripts, and AI prompts will be matched in content and goals. Including trained confederates yields a rigorously matched human baseline, ensuring that any differences in mental- state attributions trace to AI design manipulations rather than extraneous variables. We conceptualise debate as a process of mutual mind- modelling, in which both parties iteratively adjust their rep- resentations of each other to sustain shared understanding (11; 1). This framing generates a specific prediction: partic- ipants who perceive their partner as tracking and adapting to their views—a signature of bidirectional ToM—should show accelerated attribution of mental states over sessions, and earlier tipping points. Study Design.To examine how mental state attributions evolve, we will use a partner-blind design across chat ses- sions. Participants will be informed at the consent stage that they will be talking to both AI and human partners; full disclosure on the type of partner occurs only at debrief, so there is no deception about the study’s nature. Prior work shows that knowing a partner is “just an AI” can bias interac- tions, leading participants to treat an AI partner dismissively (12), while anthropomorphic impressions tend to collapse upon upfront disclosure of AI identity. After each chat ses- sion, participants will complete post-interaction question- naires and tasks. We operationalise a tipping point as a sig- nificant discontinuity in attribution trajectories across ses- sions: specifically, a session at which ratings of ToM, men- tal states, or consciousness show a step change relative to the preceding session, rather than incremental drift. Measures Theory of Mind (ToM).A core challenge in measuring ToM behaviourally is establishing whether participants form accurate, calibrated representations of a partner’s specific beliefs and intentions, rather than relying on anthropomor- phic heuristics that produce confident but miscalibrated in- ferences. Two complementary frameworks motivate our ap- proach. The Mind-space framework proposes that accurate mental state inference depends on first constructing a struc- tured representation of the target’s traits and then mapping those systematically to specific mental states (5; 10), allow- ing the prediction of individual differences in both accu- racy and confidence calibration. Ying et al.’s (2025) LIRAS framework (25) converges on this from a computational di- rection, specifying that grounded ToM inference requires situation-specific representations rich enough to support prob- abilistic prediction of an agent’s behaviour. Together, these frameworks define our behavioural index: participants in- fer specific beliefs held by their partner, rate their confi- dence, and are scored against the partner’s actual stated posi- tions. Participants who have formed a genuine mental model should predict these positions accurately and show calibrated confidence. Those relying on anthropomorphic heuristics should show overconfidence and poor updating regardless of accu- racy. Crucially, overconfidence in a partner’s reasoning de- spite errors may inflate ToM attributions, a dynamic our lon- gitudinal design is well-placed to detect. Consciousness.Participants will rate each agent on a bat- tery of mental states clustering into two dimensions:Intelli- gence(e.g., reasoning, planning) andExperience(e.g., feel- ings, subjective awareness), which have been shown to un- derpin folk attributions of consciousness (4; 22). They will then provide a direct judgment of consciousness for their chat partner, their confidence, and an estimate of perceived social consensus. Together, these measures yield a multidi- mensional profile of perceived consciousness: mental states, direct belief, metacognitive certainty, and perceived public opinion. Conclusion This study addresses a pressing question of public interest: under what conditions do humans attribute a mind to AI? It will deliver the first empirical test of how repeated, design- mediated encounters with AI shape mental-state attributions over time, using a longitudinal design with a rigorously matched 86 human baseline. It is important to note that we do not seek to adjudicate whether AI systems genuinely possess ToM or consciousness—we focus on the human side of the interac- tion: how and when people come to perceive and attribute a digital mind to AI. This distinction matters because risks such as emotional dependency, misplaced trust, and over- reliance on AIs arise from human perception regardless of whether AI possesses mental states. References [1] S. Acklin et al. Mutual theory of mind in human–AI collaboration: An empirical study with LLM-driven AI agents in a real-time shared workspace task, 2024. [2] T. Araujo and N. Bol. From speaking like a person to being personal: The effects of personalized, regu- lar interactions with conversational agents.Computers in Human Behavior: Artificial Humans, page 100030, 2024. [3] P. Butlin, R. Long, E. Elmoznino, Y. Bengio, J. Birch, A. Constant, G. Deane, S. M. Fleming, C. Frith, X. Ji, R. Kanai, C. Klein, G. Lindsay, M. Michel, L. Mu- drik, M. A. K. Peters, E. Schwitzgebel, J. Simon, and R. VanRullen. Consciousness in artificial intelligence: Insights from the science of consciousness, 2023. [4] C. Colombatto and S. M. Fleming. Folk psychological attributions of consciousness to large language mod- els.Neuroscience of Consciousness, 2024(1):niae013, 2024. [5] J. R. Conway, C. Catmur, and G. Bird. Understanding individual differences in theory of mind via representa- tion of minds, not mental states.Psychonomic Bulletin & Review, 26:798–812, 2019. [6] N. Epley, A. Waytz, and J. T. Cacioppo. On seeing the human: A three-factor theory of anthropomorphism. Psychological Review, 114(4):864–886, 2007. [7] M. Kosinski.Theory of mind might have sponta- neously emerged in large language models, 2024. [8] U. Le ́ on-Dom ́ ınguez, E. D. Flores-Flores, A. J. Garc ́ ıa- Jasso, M. K. G ́ omez-Cuellar, D. Torres-S ́ anchez, and A. Basora-Marimon. AI-driven agents with prompts designed for high agreeableness increase the likelihood of being mistaken for a human in the Turing test. 2025. [9] Y. Li et al. Theory of mind for multi-agent collabo- ration via large language models. InProceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 180–192, 2023. [10] E. L. Long, C. Catmur, S. M. Fleming, and G. Bird. Metacognition facilitates theory of mind through op- timal weighting of trait inferences.Cognition, 256:106042, 2025. [11] Q. Long et al. Mutual theory of mind for human–AI communication, 2022. [12] E. Luger and A. Sellen. “Like Having a Really Bad PA”: The gulf between user expectation and experi- ence of conversational agents. InProceedings of the 2016 CHI Conference on Human Factors in Comput- ing Systems, pages 5286–5297. ACM, 2016. [13] T. Nagel. What is it like to be a bat?Philosophical Review, 83(4):435–450, 1974. [14] C. Nass and Y. Moon. Machines and mindlessness: So- cial responses to computers.Journal of Social Issues, 56(1):81–103, 2000. [15] B. F. M. Oakley, R. Brewer, G. Bird, and C. Catmur. Theory of mind is not theory of emotion: A cautionary note on the Reading the Mind in the Eyes Test.Journal of Abnormal Psychology, 125(6):818–825, 2016. [16] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: In- teractive simulacra of human behavior, 2023. [17] I. Pentina, T. Hancock, and T. Xie. Exploring rela- tionship development with social chatbots: A mixed- method study of Replika.Computers in Human Be- havior, 140:107600, 2023. [18] M. Shanahan, K. McDonell, and L. Reynolds. Role play with large language models.Nature, 623:493– 498, 2023. [19] N. Shapira, M. Levy, S. H. Alavi, X. Zhou, Y. Choi, Y. Goldberg, M. Sap, and V. Sadde. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models, 2023. [20] J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, and C. Bec- chio. Testing theory of mind in large language models and humans.Nature Human Behaviour, 8:1285–1295, 2024. [21] W. N. Street. LLM theory of mind and alignment: Opportunities and risks.InProceedings of the 1st Workshop on Theory of Mind in Human–AI Interaction (ToMinHAI), CHI 2024. ACM, 2024. [22] J. Sytsma. Attributions of consciousness.WIREs Cog- nitive Science, 5(6):635–648, 2014. 87 [23] T. Ullman. Large language models fail on trivial alter- ations to theory-of-mind tasks, 2023. [24] H. Wimmer and J. Perner. Beliefs about beliefs: Rep- resentation and constraining function of wrong beliefs in young children’s understanding of deception.Cog- nition, 13(1):103–128, 1983. [25] L. Ying et al. On-demand synthesis of rational agent models for grounded theory-of-mind reasoning.In Findings of the Association for Computational Lin- guistics: EMNLP 2025. ACL Anthology, 2025. 88 The Curse of Knowledge in Language Models: Perfect Theory of Mind or Missing Human Biases? Nutchanon Yongsatianchot Thammasat School of Engineering, Faculty of Engineering, Thammasat University, Thailand ynutchan@engr.tu.ac.th Abstract Theory of Mind (ToM), the ability to reason about others’ mental states, has become a key benchmark for evaluat- ing language models. While recent work shows that large language models (LLMs) achieve high accuracy on stan- dard false-belief tasks, human cognition is characterized not only by correctness but also by systematic biases. We repli- cate the Birch and Bloom (2007) displacement task to test whether frontier LLMs exhibit the “curse of knowledge,” a well-documented human bias where one’s own knowledge interferes with reasoning about others’ beliefs. We find that GPT-5 assigns near-perfect probability (100%) to the belief- consistent response across all conditions with zero curse- of-knowledge bias, while GPT-4o and GPT-4o-mini show weaker ToM reasoning. Our findings reveal a fundamental mismatch between LLM and human ToM, in which they fail to exhibit human-like cognitive biases, potentially limiting their ability to accurately model real-world human behavior. Introduction Theory of Mind (ToM), the capacity to attribute mental states such as beliefs, desires, and intentions to others, lies at the intersection of human cognitive development and artificial intelligence evaluation (Premack and Woodruff 1978; Whiten and Byrne 1991; Kosinski 2023). Recent ad- vances in LLMs have sparked intense debate about whether these systems possess genuine ToM capabilities. Models demonstrate strong performance on classic false-belief tasks from developmental psychology, such as the Sally-Anne test (Baron-Cohen, Leslie, and Frith 1985; Wimmer and Perner 1983), often matching or exceeding human accuracy (Kosin- ski 2023). However, recent work reveals brittleness when task formats were modified in various ways (Ullman 2023; Shapira et al. 2023; Gandhi et al. 2024). Crucially, most evaluations focus exclusively on correct- ness. Yet human cognition is characterized not only by suc- cess but also by systematic biases. One such bias is the “curse of knowledge”: the tendency for one’s own knowl- edge to interfere with reasoning about what others believe (Camerer, Loewenstein, and Weber 1989). Birch and Bloom (2007) demonstrated that adults exhibit this bias in false- belief scenarios: when participants knew an object’s true lo- cation, they gave inflated probability ratings to that location when predicting where a character with a false belief would search (see also Ryskin and Brown-Schmidt (2014)). This raises a fundamental question: Do language models exhibit human-like cognitive biases when reasoning about false beliefs? If LLMs show curse-of-knowledge effects, this would suggest deeper parallels in cognitive processing. Con- versely, perfect performance without human biases reveals limitations in modeling real-world human mental states. Method We replicated the displacement task from Birch and Bloom (2007) with adaptations for LLM evaluation. A character is in a room with four distinctly colored containers arranged left to right: blue, purple, red, and green. She places an ob- ject in the blue container and leaves. While absent, her sister moves the object and spatially rearranges all containers to: red, green, purple, blue. Models judge the probability she will first look in each container (percentages summing to 100%). The full prompt is provided in the Appendix. The key manipulation varies what the model knows, not what the character believes—she always has the same false belief (the object is in the blue container). Three condi- tions were tested: (1) Ignorance: The model is told the ob- ject was “moved to another container” without specifying which, (2) Knowledge-Plausible: The object was moved to the red container, which now occupies blue’s original left- most position, and (3) Knowledge-Implausible: The object was moved to the purple container, at a different location. This task improves on standard false-belief paradigms in three ways: (1) probability judgments rather than binary re- sponses provide a more sensitive measure for detecting bi- ases; (2) four containers instead of two allow manipula- tion of the model’s knowledge while holding the charac- ter’s belief constant (a specific container vs. another con- tainer without being told which one); and (3) spatial rear- rangement after displacement varies the plausibility of dif- ferent search strategies (location vs. identity), as prior re- search found plausibility can mediate curse-of-knowledge magnitude (Pohl 1998). Results We tested five LLMs from OpenAI: GPT-5 with three rea- soning effort levels (low, medium, high), GPT-4o, and GPT- 4o-mini. The design crossed 5 models× 3 name pairs× 2 color sets× 3 conditions× 3 objects× 5 trials (temperature 89 Figure 1: Average probability judgments across three experimental conditions for all five language models and human baseline data from Birch and Bloom (2007). Error bars show standard deviation. = 1), yielding 1,350 trials. The code for the experiments and analysis is at github.com/yongsa-nut/ToM4AICoK. We found no significant differences across character names, instruments, or color sets (all p > 0.05), justi- fying aggregation across these factors. Our main finding (Figure 1) reveals substantial variation in ToM performance across models. GPT-5 achieves near-perfect Theory of Mind with zero curse-of-knowledge bias. Across all three reasoning effort levels (low, medium, high) and all conditions, GPT-5 consis- tently assigned approximately 100% probability to the blue container (the belief-consistent location) and 0% to the red container (the true location), showing no influence of privi- leged knowledge on false-belief reasoning. In contrast, GPT-4o and GPT-4o-mini showed markedly different patterns. To understand these patterns, we analyzed individual responses and identified two distinct error types. GPT-4o-mini consistently committed “reality-tracking” er- rors, assigning only 12–35% to blue while spreading proba- bility across containers, answering “where could the object be?” rather than “where does she believe it is?” GPT-4o showed more complex behavior: only 43% to blue in Ignorance, but about 77% in Knowledge condi- tions (the opposite of curse-of-knowledge). Categorizing re- sponses revealed why: in Ignorance, 70% showed reality- tracking; in Knowledge conditions, responses were bimodal, either fully correct (100% blue) or showing “location- vs-identity” errors, where the model assigned 100% to whichever container occupied blue’s original position (re- sulting in 0% blue), assuming characters track spatial loca- tions rather than container identity. This bimodal pattern ex- plains GPT-4o’s high variance. This contrasts with human performance (Birch and Bloom 2007). Human adults showed clear curse-of-knowledge ef- fects: knowing the true location inflated probability ratings for that location by 11 percentage points with an additional 15-point plausibility effect when knowledge aligned with a sensible search strategy. We also tested an image-based version of the task (see Figure 2 in the Appendix). GPT-5 and GPT-4o-mini showed patterns consistent with the text version. GPT-4o exhibited inconsistent responses in the image condition without pro- viding any explanations. Discussion Our findings reveal substantial variation in LLM Theory of Mind capabilities. GPT-5 achieves near-perfect accu- racy without exhibiting curse-of-knowledge bias. GPT-4o and GPT-4o-mini fail through qualitatively different mech- anisms: GPT-4o-mini ignores belief states entirely (reality- tracking), while GPT-4o sometimes confuses what charac- ters track (spatial positions versus object identity). Notably, GPT-4o performed better in Knowledge conditions than Ig- norance, suggesting that explicit location information helps anchor belief reasoning rather than biasing it. Most ToM evaluations emphasize correctness, but our re- sults suggest this framing may be incomplete, especially when considering LLMs as models of human cognition (Frank and Goodman 2025). Accurately modeling human mental states requires capturing both what humans should believe and how their judgments are systematically biased. An AI system assuming perfect human rationality may fail when humans exhibit reasoning biases. In the Appendix, we provide a preliminary analysis of models predicting human responses in this task, showing that models may have some awareness of potential human biases but not always. This study examined one bias in one task. Future work should explore curse-of-knowledge effects in applied do- mains like education, where experts routinely overestimate what novices know. 90 Appendix Stimulus Prompt The following prompt was used for the Knowledge- Plausible condition (other conditions varied only the loca- tion information): Consider the following scenario: In a room, there are four containers from left to right spaced out evenly: a blue con- tainer, a purple container, a red container, and a green con- tainer. Alice finishes playing her violin and puts it in the blue container. Then she goes outside to play. While Alice is outside playing, her sister, Beth moves the violin to the red container. Then Beth swaps the positions of the con- tainers in the room, changing the order from left to right to: red, green, purple, and blue. When Alice returns, she wants to play her violin. What are the chances Alice will first look for her violin in each of the above containers? Put your an- swer in percentages (must sum to 100). Predicting Human Responses We asked all five models to predict how average human par- ticipants would respond to the task, rather than answering directly themselves. Everything else remains the same. Results Figure 3 shows the results of the experiment. GPT-4o-mini fails completely. It predicts where the object is rather than where humans would think the character will look, showing the same reality-tracking error as in the main experiment. On the other hand, GPT-4o underestimates the human responses for the probability of the blue box. GPT-5 overestimates human responses across all reason- ing levels, predicting 80–85% for blue compared to actual human rates of 59–71%. Nevertheless, the model shows partial awareness of curse-of-knowledge. Across reason- ing levels, 15–21% of its reasoning traces mention “re- ality bias” (participants’ knowledge affecting predictions), the correct mechanism. However, 30–36% use “position- tracking” framing (Vicki confusing location vs. identity), at- tributing bias to the character rather than participants. Addi- tionally, we found that 46–62% of GPT-5 reasoning traces mention research findings (“studies show 80–90% choose the original container”). However, those studies typically use simpler 2-container setups, not 4-container designs like this task. Acknowledgements We would like to thank two anonymous reviewers for their valuable comments and suggestions that improved the paper. References Baron-Cohen, S.; Leslie, A. M.; and Frith, U. 1985. Does the autistic child have a ”theory of mind”? Cognition, 21(1): 37–46. Birch, S. A. J.; and Bloom, P. 2007. The Curse of Knowl- edge in Reasoning About False Beliefs. Psychological Sci- ence, 18(5): 382–386. Camerer, C.; Loewenstein, G.; and Weber, M. 1989. The Curse of Knowledge in Economic Settings: An Experimen- tal Analysis. Journal of Political Economy, 97(5): 1232– 1254. Frank, M. C.; and Goodman, N. D. 2025. Cognitive model- ing using artificial intelligence. Annual Review of Psychol- ogy, 77. Gandhi, K.; Fr ̈ anken, J.-P.; Gerstenberg, T.; and Goodman, N. 2024. Understanding social reasoning in language mod- els with language models. Advances in Neural Information Processing Systems, 36. Kosinski, M. 2023. Theory of Mind May Have Sponta- neously Emerged in Large Language Models. arXiv preprint arXiv:2302.02083. Pohl, R. F. 1998. The effects of feedback source and plau- sibility of hindsight bias. European Journal of Cognitive Psychology, 10(2): 191–212. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4): 515–526. Ryskin, R. A.; and Brown-Schmidt, S. 2014. Do adults show a curse of knowledge in false-belief reasoning? A robust es- timate of the true effect size. PloS one, 9(3): e92406. Shapira, N.; Levy, M.; Alavi, S. H.; Zhou, X.; Choi, Y.; Goldberg, Y.; Sap, M.; and Shwartz, V. 2023. Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Pro- cessing, 14961–14979. Ullman, T. 2023. Large Language Models Fail on Triv- ial Alterations to Theory-of-Mind Tasks. arXiv preprint arXiv:2302.08399. Whiten, A.; and Byrne, R. 1991. Natural theories of mind: Evolution, development and simulation of everyday min- dreading. B. Blackwell Oxford, UK. Wimmer, H.; and Perner, J. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1): 103–128. 91 Figure 2: Image-based experiment results. The original Birch and Bloom (2007) study used images. GPT-5 and GPT-4o-mini show patterns consistent with text results. GPT-4o exhibits inconsistent responses in the image condition. Figure 3: Model predictions of human responses across conditions. 92 The Resonance Corpus: A Large-Scale Chinese Parent–Child Conversation Dataset Lingqin Meng 1,2,+ , Yang Gao 1,2,+ , Zhongzhi Huo 3,* , Stella Christie 1,2,* 1 Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing, China 2 Tsinghua Laboratory of Brain and Intelligence, Tsinghua University, Beijing, China 3 Teachers College, Columbia University, New York, USA + These authors contributed equally to this work. * Correspondence to christie@mail.tsinghua.edu.cn Abstract We introduce the Resonance Corpus, a large-scale collection of naturalistic Chinese parent–child conversations. Each week, we send four news prompts through a WeChat mini-program. Fam- ilies choose when to talk at home and then upload transcripts. The corpus currently includes about 60,000 conversations from over 3,000 families. We manually coded 6,840 dialogues (about 90,000 utterances) across 179 topics. For each utterance, coders assign three labels: conversational turn, cognitive level (Blank L1–L4), and analogical reasoning. The corpus enables research on cognitively grounded dialogue. It supports metrics for cogni- tive alignment, topic drift, and novelty. In the future, we plan to release the corpus and the coding protocol. This resource links cognitive-pragmatic theory to computable measures of interactive intelligence. Introduction and Motivation Theory of Mind (ToM) in AI refers to an AI system’s abil- ity to infer and adapt to the cognitive states of human part- ners(Leslie et al., 2004). Despite advances in language models, current AI dialogue systems often struggle with extended, nu- anced interactions, failing to adjust to the listener’s knowledge or maintain engaging conversations(Xiao et al., 2023; Zhou et al., 2024). Human dialogue, especially between parents and children, inherently involves cognitively grounded strate- gies: caregivers simplify or elaborate explanations based on the child’s understanding, use analogies to bridge knowledge gaps, and switch topics to sustain interest(Bergelson et al., 2023; Weng et al., 2025; Reece et al., 2023). However, we lack large-scale natural dialogue data annotated with such cognitive features. In particular, no open large-scale Chinese parent–child conversation corpus has been available, hinder- ing cross-cultural comparison and ToM research. To address this gap, we introduce the Resonance Corpus, a substantial collection of real-world Chinese parent–child conversations with detailed cognitive annotations. This resource is motivated by the need for data that links cognitive-pragmatic theory with computable metrics of interaction, enabling development and evaluation of ToM in AI dialogue systems. Corpus Collection and Structure An ongoing nationwide initiative in China collects about 100 parent-child discussions each day, forming the core of the Res- onance Corpus. Weekly news prompts catalyze conversations: each week, families receive four child-friendly news topics via a WeChat mini-program, choose suitable times to discuss them at home, and upload the recorded transcripts, as shown in Figure 1. This approach yields naturalistic dialogues grounded in current events, ensuring shared context and encouraging rich exchanges. Figure 1: Corpus collection process. Data collection is still in progress, and to date, the corpus comprises nearly 60,000 conversations from approximately 3,000 families. We focus on a one-month subset of 6,840 dialogues (about 90,000 utterances), which we manually an- notated for detailed analysis. These discussions cover 179 distinct topics (e.g., technology, society) and typically last 5-15 minutes each. Each conversation entry is accompanied by metadata such as participant ages, gender, socio-economic background, setting (e.g., bedtime, mealtime), and subjective ratings of engagement, providing context for interpreting the interactions. Annotation Scheme: We decompose multi-turn dialogues into utterance-level units, guided by the University of Chicago’s Language Development Project. Then, we label every utterance in the annotated subset along three dimen- sions: (1) conversational turn, (2) the cognitive level of lan- guage used, and (3) the presence of analogical reasoning. We code cognitive level with Marion Blank’s L1–L4 framework, which ranges from simple, concrete queries (L1) to complex, abstract reasoning (L4), capturing each utterance’s degree of abstraction and inferential demand(Blank, 1974; Blank et al., 1978). We mark analogical reasoning under Dedre Gentner’s structure-mapping theory to indicate when a speaker draws an analogy or comparative example to explain a concept(Gentner, 1983, 2016). Together, these annotations let us quantify and 93 trace how a parent adjusts communication complexity and employs explanatory strategies in accordance with the child’s cognitive level. The annotated corpus also lets us compute novel interaction metrics—for example, measuring cognitive alignment between parent and child turns, detecting topic drift, and assessing the introduction of novel information in dia- logue(Cai et al., 2025; Schmidt et al., 2025). ToM-Oriented Observations L-Level Mismatch and Interaction Breakdowns We hypothesize that misalignment of cognitive levels precipi- tates dialogue breakdowns. When caregivers pose questions at levels that exceed a child’s representational capacity—for example, an L4 abstract prompt to a child operating primarily at L2—the child often fails to produce an adequate response and the exchange degrades. Preliminary analyses indicate that conversations stall more frequently when adults do not cali- brate linguistic complexity to the child’s level of understand- ing. This pattern highlights a core ToM requirement: speakers must infer the interlocutor’s epistemic state and adapt their utterances accordingly, or communication is likely to fail. Analogical Reasoning as a Bridge The corpus documents frequent use of analogy by care- givers—and, at times, children—to introduce and clarify new ideas. Caregivers recruit familiar concepts (e.g., aligning story elements with a child’s personal experiences) to ground ab- stract content at the child’s cognitive level. Such analogical moves constitute a ToM-driven strategy: speakers model the child’s perspective and knowledge base, then select source do- mains the child can readily map. We annotate these instances and analyze how analogy frequency and structural complexity vary with age and topic. These patterns inform the design of dialogue agents that deploy analogy appropriately in explana- tion and instruction—a core capability for cognitively savvy communication. Topic Drift as Creative Exploration Human dialogues, especially with children, often readily de- part from the initial topic, a phenomenon we refer to as topic drift. Rather than sticking rigidly to the prompt, parent–child conversations tend to wander into related subtopics or even entirely new threads. We hypothesize that this drift reflects di- vergent thinking and creative exploration: a child might latch onto a tangential detail that sparks a new discussion, or a parent might introduce a loosely related anecdote to keep the child engaged. Far from being a sign of incoherence, such fluid topic shifts may be a mechanism for learning and creativity in conversation. In contrast, most current dialogue agents lack this flexibility – they strictly follow the user’s last query and rarely initiate spontaneous topic shifts. By quantifying topic drift in the Resonance Corpus, we aim to provide a benchmark for more human-like, creative conversational dynamics that future AI should emulate. What to Say and When to Speak Caregivers often monitor the child’s engagement and under- standing in real time, then adjust their approach. For instance, if a child appears confused or loses interest, the parent might rephrase the question, give a hint, or decide to wrap up the discussion. Children, too, exhibit similar acts (e.g., saying “I don’t know” or asking for hints when stuck). These behaviors reflect an awareness of one’s own and the partner’s cognitive state. The Resonance Corpus provides abundant examples of such adaptive strategies. Modeling this in AI would require an agent to not only reason about content but also gauge the success of the interaction and modify its strategy accordingly. Applications and Future Work We advance the Resonance Corpus as a catalyst for research and development on ToM in AI. First, it will be a benchmark dataset for evaluating conversational AI on ToM capabilities. For example, one can test whether a dialogue agent can in- fer a user’s (simulated child’s) knowledge level from context and adjust its responses to avoid L-level mismatches. Second, the corpus can support training data for cognitively aware dialogue models. By fine-tuning on parent–child conversa- tions, language models may learn to incorporate strategies like analogies and mid-dialogue adjustments that are emblematic of ToM in interaction(Hu et al., 2024). Looking ahead, we plan to release the Resonance Corpus along with the detailed coding protocol to encourage broad engagement from the community. We are also validating the initial hypotheses with quantitative analyses. We plan to es- tablish a theory-informed framework for interactive AI, where insights from cognitive science directly inform model archi- tectures and evaluation metrics. The Resonance Corpus links cognitive-pragmatic theory to real-world language data, laying the groundwork for AI systems that understand and adapt to human cognitive states in conversation. References Bergelson, E., Soderstrom, M., Schwarz, I.-C., Rowland, C. F., Ram ́ ırez-Esparza, N., R. Hamrick, L., . . . others (2023). Everyday language input and production in 1,001 children from six continents. Proceedings of the National Academy of Sciences, 120(52), e2300671120. Blank, M. (1974). Cognitive functions of language in the preschool years. Developmental Psychology, 10(2), 229. Blank, M., Rose, S. A., & Berlin, L. J. (1978). The language of learning: The preschool years. (No Title). Cai, J., Hadjinicolaou, A. E., Paulk, A. C., Soper, D. J., Xia, T., Wang, A. F., . . . Cash, S. S. (2025). Natural language processing models reveal neural dynamics of human con- versation. Nature Communications, 16(1), 3376. Gentner, D. (1983). Structure-mapping: A theoretical frame- work for analogy. Cognitive science, 7(2), 155–170. Gentner, D. (2016). Language as cognitive tool kit: How lan- guage supports relational thought. American psychologist, 71(8), 650. 94 Hu, J., Mahowald, K., Lupyan, G., Ivanova, A., & Levy, R. (2024). Language models align with human judgments on key grammatical constructions. Proceedings of the National Academy of Sciences, 121(36), e2400917121. Leslie, A. M., Friedman, O., & German, T. P. (2004). Core mechanisms in ‘theory of mind’. Trends in cognitive sci- ences, 8(12), 528–533. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitz- patrick, C., . . . Marin, S. (2023). The candor corpus: In- sights from a large multimodal dataset of naturalistic con- versation. Science Advances, 9(13), eadf3197. Schmidt, H., Bergey, C. A., Zhou, C., Helion, C., & Hawkins, R. (2025). Dynamics of topic exploration in conversation. In Proceedings of the annual meeting of the cognitive science society (Vol. 47). Weng, Z., Bravo-S ́ anchez, L., Wang, Z., Howard, C., Xenochristou, M., Meister, N., . . . others (2025). Artificial intelligence–powered 3d analysis of video-based caregiver- child interactions. Science Advances, 11(8), eadp4422. Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453. Zhou, L., Schellaert, W., Mart ́ ınez-Plumed, F., Moros-Daval, Y., Ferri, C., & Hern ́ andez-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature, 634(8032), 61–68. 95 Theory of Mind and Optimistic Beliefs Emerge in a Sequential Dilemma with Incremental Rewards Tom Lenaerts 1,2,3 , Marco Saponara 1 , Jorge M. Pacheco 4,5 Francisco C. Santos 4,5 1 Machine Learning Group, Universit ́ e Libre de Bruxelles, 1050 Brussels, Belgium 2 Artificial Intelligence Lab, Vrije Universiteit Brussel, 1050, Brussels, Belgium 3 Center for Human-Compatible AI, University of California, Berkeley, 94702 Berkeley, USA. 4 ATP-group, P-2744-016 Porto Salvo, Portugal 5 INESC-ID and Instituto Superior T ́ ecnico, Universidade de Lisboa, IST-Taguspark, 2744-016 Porto Salvo, Portugal Tom.Lenaerts@ulb.be, Marco.Saponara@ulb.be, jorge.m.pacheco@inesc-id.pt, franciscocsantos@tecnico.ulisboa.pt Abstract Theory of Mind (ToM) is considered to be an asset for au- tonomous agents: Having the capacity of inferring beliefs and intentions of others is often assumed to lead to better solu- tions, displaying more advanced intelligence. While the ex- plicit integration of ToM in agents for solving specific tasks is regularly studied, it is not understood what conditions encour- age agents to acquire and prefer ToM and what other effects it has on the agents’ behaviour. Using a game theoretical model in which a finite population of agents uses strategies that in- corporate (or not) ToM, we find that ToM strategies evolve and prevail, provided agents are boundedly rational, and un- certainty about higher future gains is part of the strategic en- vironment. Furthermore, we show that this non-deterministic reasoning co-evolves with a bias towards optimistic beliefs, which appear to be compensated by stochastic recursive rea- soning. Introduction We report here on published work (Lenaerts et al. 2024) wherein we studied the conditions for the emergence of The- ory of Mind (ToM) strategies in the incremental centipede game (ICG) (Rosenthal 1981). It is a game with perfect in- formation, involving two agents that take turns deciding be- tween the actions Take (T ) and Postpone (P ) regarding the split of a resource with initial value M that doubles at each Step l of the game, with l∈ [1,L]. Playing T in step l means ending the game and receiving the largest fraction of the re- source available. Playing P means doubling the value of the resource while letting the other player decide what to do in the next step. When the second player plays P in step l = L, different outcomes may be considered (Krockow, Colman, and Pulford 2016). Here, we use the game structure and pay- off values from (McKelvey and Palfrey 1992) where the re- source is doubled once more and the resulting amount is split with the largest fraction given to the first player. The ICG is a mixed-motive game, which is captured by a common in- terest to grow the resource so that the gains increase and a conflict of interest between the players to get the biggest por- tion of that resource. The game incorporates thus a strategic uncertainty about the intentions of the co-player as well as the future gains, which makes it interesting for the study of ToM (Rusch et al. 2020). While the subgame perfect equilibrium is to Play T as early as possible, experimental results have repeatedly contradicted this outcome (Krockow, Colman, and Pul- ford 2016). In an attempt to explain such results through a stochastic evolutionary dynamics (SED) model wherein agents have no ToM and simply play T at pre-defined steps, a match with experimental data can be obtained when selec- tion strength (defined as β in our paper) is low (Rand and Nowak 2012; Smead 2008). The argument is that low β in- troduces an uncertainty regarding payoffs allowing strate- gies with an intermediate l to survive. While, this is indeed an advantage of SED, it ignores that players may have ac- quired a ToM mechanism to deal with the uncertainty in this kind of strategic situations. We investigated whether evolu- tion would select for ToM agents and what this would mean in terms of their beliefs and capacity to reason? Agents are defined with a ToM of variable cognitive ca- pacity, associated with a level k ≥ 0 (Kawagoe and Tak- izawa 2012). Specifically, strategies are defined as a duple (t,k), where t is the Step l at which they intend to play T when there is no reasoning and k is the number of reason- ing steps the agent can execute over their initial intention (see (Lenaerts et al. 2024) for details). Thus, at the lowest level, k = 0, agents have no ToM and simply play T at the predefined moment (as in Rand and Nowak (2012)). Agents with a ToM (k > 0) believe that their opponent will employ the strategy (t,k − 1) and, as a result, will compute a best response to this belief, based on a level-k hierarchical re- cursive model (Camerer, Ho, and Chong 2004). In practice, by creating beliefs about others and anticipating what they will do, agents calculate a best response in a recursive way. As a result, the decision (by agents with a ToM) at which step to play T may coincide (or not) with their a-priori be- lief (encoded in t), a feature which cannot happen in the ab- sence of ToM. By allowing a possible mismatch between be- liefs, intentions and actions (that is, misbeliefs), our model allows us to further study the co-evolution of a ToM with such misbeliefs (McKay and Dennett 2009), and moves this work beyond the classic assumption for Nash Equilibria in game theory that agents have correct beliefs (Costa-Gomes and Weizs ̈ acker 2008; Du 2012). We further assume agents can make cognitive errors with a probability ε ≥ 0 when calculating their belief and their 96 Figure 1: Evolution of ToM in the ICG with L = 6 for dif- ferent values of β and ε. best response : whenever ε > 0, agents may decide to Play T at one step higher or lower (at each k-level) than what a flawless best-response computation would determine. For simplicity ε is assumed to remain constant in the population. Each agent with k > 0 and ε > 0 may thus generate every time she plays the ICG a new t-value, which over multiple independent games produces a probability distribution over the steps l where she intends to play T . Differences in the implementation of k-level reasoning mechanisms will lead to differences in the probability distributions, as is shown in (Lenaerts et al. 2024) and also (Saponara et al. 2025). Results and Discussion The full description of the methods and the results can be found in (Lenaerts et al. 2024). Here we focus on two results. Figure 1 shows the results of our model after letting strate- gies evolve in a finite population, for different values of the free external parameters of the model: selection strength β (the lower β the more random strategy imitation becomes - see original paper) and cognitive error probability ε (as dis- cussed in the Introduction). In Figure 1A we plot the average k-level in the population, where the red circle indicates the optimum β ∗ and ε ∗ settings at which a best fit to the exper- imental data is obtained (see Figure 1D and supplementary information of the original paper). Figures 1B and 1C por- tray the dependence of our results for the k-level distribution on each of the parameters while keeping the other fixed at the optimum value. Whereas increasing ε (at β = β ∗ ) se- lects for low k-levels, increasing β (at ε = ε ∗ ) appears to have the opposite effect. The results in Figure 1 show that strategies with k > 0 evolve and prevail for a wide range of values of the external parameters β and ε, generating strongest reasoning capaci- ties for high β and a specific window of ε values. Yet, human data falls outside this region, fitting with β and ε values that have an average k-level close to 2, which is associated with a heterogeneous distribution of low to intermediate values of cognition (in accordance with experimental studies of ToM (McKelvey and Palfrey 1992; Kawagoe and Takizawa 2012) Figure 2: Evolved strategies in the population at calibrated parameter values (β ∗ ,ε ∗ ). Panel D focusses on the differ- ences between agent beliefs and the actions they take, high- lighting the level of misbeliefs, which need to be compen- sated by reasoning to match the experimental data. and their analyses). Details of this distribution are provided in Figure 1, panels B and C. Differences and similarities with other models are discussed in the Appendix of this abstract. Figure 2 provides detailed information about the belief (t), reasoning (k) and action (T ) distributions in the popu- lation at the best fit parameter values (β ∗ ,ε ∗ ). In Figure 2A we show how the distribution of beliefs changes depending on whether agents may or may not acquire a ToM: allowing k > 0 (blue) leads to the emergence of an optimism bias (t > T ) (Sharot 2011), as the distribution becomes skewed towards high t, values compared with the baseline where k = 0 (yellow). This feature stems from the prevalence of agents with k > 0 as one can see when comparing the coloured regions in Figure 2B and C : The actions (T value, Figure 2C) mostly take place at values smaller than their be- liefs (t value, Figure 2B), where the difference is shown in Figure 2D. In line with what was observed in Figure 1, the reasoning levels that co-evolve with such an optimism bias do not peak at k = 6,: Instead, and similarly to the exper- imental results, the distribution of k-levels peaks at k = 2 and strategies with k > 3 become the least prevalent in the population. In the context of the ICG, these results show that limited levels of ToM, defined in terms of a stochastic reasoning process, emerge (even without considering cognitive costs) along with inexact beliefs about others, grounding the re- sults in biological (Varki and Brower 2013; Trivers 2011; Von Hippel and Trivers 2011) and psychological (Sharot 2011) literature. We also showed that the results remain ro- bust under different perturbations of the model. AI systems will need to handle this non-rationality problem in hybrid ecosystems as the resulting actions may be perceived as de- ceptive behaviour, generating negative responses (Sarkadi et al. 2019; K ̈ obis et al. 2025). Future work also includes research into belief updating in repeated interactions as well as novel behavioural experiments to further validate the cur- rent observations. 97 Appendix - Related work In this appendix, we provide a summary of some research that has investigated the evolutionary origins of ToM and discuss their relation with our work. Stahl (1993) used a replicator dynamics approach, pitting smart players, that used a hierarchy of best-response strate- gies, in competition with less smart ones. He showed that smartness expressed as a higher ToM level does not lead to superior fitness and concluded that taking the right action is as good as being smart. While less smart players, with low k-levels, survive and co-exist with smarter ones (see Fig- ure 2B) in our model, we observe that a certain level of smartness (i.e. k = 2) may dominate, while higher levels disappear rapidly, resulting in a better match with ICG ex- perimental data. Moreover, we observe that misbeliefs co- evolve with this smartness, hinting to an idea that smart- ness may compensate suboptimal beliefs in order to arrive at good decisions. Model differences, as how decisions are inferred from beliefs and what individuals know as well as differences in the evolutionary process, may explain the dif- ferences in results. Devaine, Hollard, and Daunizeau (2014) also used a de- terministic replicator model to assess when lower or higher recursive ToM levels survive. Fitness in their case corre- sponded to the average accumulated payoff obtained through a sophisticated Bayesian mechanism (like in (Yoshida, Dolan, and Friston 2008)) that learns to forecast the be- haviour of an opponent in repeated games. The repetition of the interaction to determine fitness, the use of a deter- ministic model with strong selection as well as the training of an optimal Bayesian predictor distinguishes their work from ours. Our model is much simpler, assuming strategies that are defined by an immutable belief and a specific recur- sive (stochastic) reasoning power, having thus no capacity to learn from observations. We observe that a distribution of limited k-levels survive (including k = 0) and that, even in this mixed-motive context with non-repeated interactions and no learning of optimal beliefs, ToM sophistication is bounded. Finally, Qi and Vul (2022) analysed a similar idea, asking what type of environment is more conducive to the evolution of a Bayesian ToM agent, comparing it with several fixed- action agents as well as those using other types of inference or learning to determine a co-player’s attitude towards the focal player. They showed that uncertainty in the game en- vironment leads to the dominance of such a Bayesian ToM strategy over other strategies, underlining like us and oth- ers the importance of uncertainty for the emergence of ToM (e.g. (De Weerd, Verbrugge, and Verheij 2022; Kleiman- Weiner et al. 2025)). Yet their scientific question differs from ours as we did not explore the selection dynamics between different learning and reasoning strategies. In our work, we considered different forms of k-level reasoning (see Supple- mentary Information of (Lenaerts et al. 2024)), but for each result all agents used the same one. Specifically, a stochastic recursive reasoning process wherein agents have a certain inertia to change (a kind of my-side bias (Mercier and Sper- ber 2017)), leads to the results reported in this abstract. In a follow-up work (Saponara et al. 2025), we showed that also when different reasoning strategies are co-evolving, a biased form of reasoning, reminiscent of wishful thinking, dominates the population under conditions best suited to the experimental data. In summary, the work summarized in this abstract is a minimalistic model, applying Occam’s razor to see what ba- sic processes allow for the emergence of ToM. On top of this model, other sophisticated components as introduced in the other work can be added, which will allow one to explore whether current conclusions hold and what novel elements emerge in the process. Acknowledgements The resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flem- ish Government. T.L. and M.S. gratefully acknowledges the research support of the F.R.S-FNRS (grants 31257234 and 40007793). T.L. further acknowledges the support of the F.W.O. (grant G.0391.13N), the Service Public de Wallonie Recherche (grant 2010235–ARIAC) by DigitalWallonia4.ai and the Flemish Government through the AI Research Pro- gram. T.L. and F.C.S. also acknowledges the support of TAILOR, a project funded by the EU Horizon 2020 re- search and innovation program (grant 952215). J.M.P and F.C.S gratefully acknowledge funding from Fundac ̧ ̃ ao para a Ci ˆ encia e Tecnologia Portugal through grants PTDC/MAT- APL/6804/2020 and PTDC/CCI-COM/366/2020. References Camerer, C. F.; Ho, T.-H.; and Chong, J.-K. 2004. A cog- nitive hierarchy model of games. The Quarterly Journal of Economics, 119(3): 861–898. Costa-Gomes, M. A.; and Weizs ̈ acker, G. 2008. Stated be- liefs and play in normal-form games. The Review of Eco- nomic Studies, 75(3): 729–762. De Weerd, H.; Verbrugge, R.; and Verheij, B. 2022. Higher- order theory of mind is especially useful in unpredictable negotiations. Autonomous Agents and Multi-Agent Systems, 36(2): 30. Devaine, M.; Hollard, G.; and Daunizeau, J. 2014. Theory of mind: did evolution fool us? PloS One, 9(2): e87619. Du, S. 2012. Correlated equilibrium and higher order beliefs about play. Games and Economic Behavior, 76(1): 74–87. Kawagoe, T.; and Takizawa, H. 2012. Level-k analysis of experimental centipede games. Journal of Economic Behav- ior & Organization, 82(2-3): 548–566. Kleiman-Weiner, M.; Vient ́ os, A.; Rand, D. G.; and Tenen- baum, J. B. 2025. Evolving general cooperation with a Bayesian theory of mind. Proceedings of the National Academy of Sciences, 122(25): e2400993122. K ̈ obis, N.; Rahwan, Z.; Rilla, R.; Supriyatno, B. I.; Bersch, C.; Ajaj, T.; Bonnefon, J.-F.; and Rahwan, I. 2025. Del- egation to artificial intelligence can increase dishonest be- haviour. Nature, 646: 126–134. 98 Krockow, E. M.; Colman, A. M.; and Pulford, B. D. 2016. Cooperation in repeated interactions: A systematic review of Centipede game experiments, 1992–2016. European Review of Social Psychology, 27(1): 231–282. Lenaerts, T.; Saponara, M.; Pacheco, J. M.; and Santos, F. C. 2024.Evolution of a theory of mind. iScience, 27(2): 108862. McKay, R. T.; and Dennett, D. C. 2009. The evolution of misbelief. Behavioral and Brain Sciences, 32(6): 493–510. McKelvey, R. D.; and Palfrey, T. R. 1992. An experimental study of the centipede game. Econometrica, 803–836. Mercier, H.; and Sperber, D. 2017. The enigma of reason. Harvard University Press. Qi, W.; and Vul, E. 2022. The evolution of theory of mind on welfare tradeoff ratios. Evolution and Human Behavior, 43(5): 381–393. Rand, D. G.; and Nowak, M. A. 2012. Evolutionary dynam- ics in finite populations can explain the full range of cooper- ative behaviors observed in the centipede game. Journal of Theoretical Biology, 300: 212–221. Rosenthal, R. W. 1981.Games of perfect information, predatory pricing and the chain-store paradox. Journal of Economic Theory, 25(1): 92–100. Rusch, T.; Steixner-Kumar, S.; Doshi, P.; Spezio, M.; and Gl ̈ ascher, J. 2020. Theory of mind and decision science: to- wards a typology of tasks and computational models. Neu- ropsychologia, 146: 107488. Saponara, M.; Fern ́ andez Domingos, E.; Pacheco, J. M.; and Lenaerts, T. 2025. Evolution favours positively biased reasoning in sequential interactions with high future gains. Journal of the Royal Society Interface, 22(229): 20250153. Sarkadi, S ̧ .; Panisson, A. R.; Bordini, R. H.; McBurney, P.; Parsons, S.; and Chapman, M. 2019. Modelling deception using theory of mind in multi-agent systems. AI Communi- cations, 32(4): 287–302. Sharot, T. 2011.The optimism bias. Current Biology, 21(23): R941–R945. Smead, R. 2008. The evolution of cooperation in the cen- tipede game with finite populations. Philosophy of Science, 75(2): 157–177. Stahl, D. O. 1993. Evolution of smartn players. Games and Economic Behavior, 5(4): 604–617. Trivers, R. 2011. Deceit and self-deception: Fooling yourself the better to fool others. Penguin UK. ISBN 0141971061. Varki, A.; and Brower, D. 2013. Denial: Self-deception, false beliefs, and the origins of the human mind. Hachette UK. ISBN 1455511927. Von Hippel, W.; and Trivers, R. 2011. The evolution and psychology of self-deception. Behavioral and Brain Sci- ences, 34(1): 1–16. Yoshida, W.; Dolan, R. J.; and Friston, K. J. 2008. Game theory of mind. PLoS Computational Biology, 4(12): e1000254. 99 Theory of Mind for Explainable Human-Robot Interaction Marie S. Bauer, Julia Gachot, Matthias Kerzel, Cornelius Weber, Stefan Wermter Knowledge Technology, Department of Informatics, University of Hamburg Vogt-Koelln-Str. 30 22527 Hamburg, Germany marie.bauer, julia.gachot, matthias.kerzel, cornelius.weber, stefan.wermter@uni-hamburg.de Abstract Within the context of human–robot interaction (HRI), Theory of Mind (ToM) is intended to serve as a user-friendly back- end to the interface of robotic systems, enabling robots to infer and respond to human mental states. When integrated into robots, ToM allows them to adapt their internal mod- els to users’ behaviors, enhancing the interpretability and predictability of their actions. Similarly, Explainable Artifi- cial Intelligence (XAI) aims to make AI systems transparent and interpretable, allowing humans to understand and inter- act with them effectively. Since ToM in HRI serves related purposes, we propose to consider ToM as a form of XAI and evaluate it through the eValuation XAI (VXAI) framework and its seven desiderata. This paper identifies a critical gap in the application of ToM within HRI, as existing methods rarely assess the extent to which explanations correspond to the robot’s actual internal reasoning. To address this limita- tion, we propose to integrate ToM within XAI frameworks. By embedding ToM principles inside XAI, we argue for a shift in perspective, as current XAI research focuses predom- inantly on the AI system itself and often lacks user-centered explanations. Incorporating ToM would enable a change in focus, prioritizing the user’s informational needs and perspec- tive. Introduction As interactions between humans and robots become in- creasingly common (Lee 2021), it is intuitive to seek more human-like modes of interaction to be able to understand robots’ behaviors (Sridharan and Meadows 2019; Kerzel et al. 2023). This need naturally motivates the application of ToM in HRI. ToM refers to the human ability to attribute mental states such as beliefs, desires, and intentions to one- self and others to predict and explain behavior (Premack and Woodruff 1978). When embedded in robots, ToM meth- ods emphasize understanding and adapting to users’ mental states, and this can be used to produce explanations that are often more intuitive and user-friendly (Williams, Fiore, and Jentsch 2022). ToM also allows robots to interpret and re- spond to users’ inferred mental states, fostering more natu- ral, adaptive, and transparent interactions (Yuan et al. 2022). On the other hand, XAI aims to make black box models more transparent and interpretable; however, it frequently overlooks user-centered evaluations (Rong et al. 2024). Since both XAI and ToM in HRI aim to make internal reasoning more understandable to humans and enhance hu- man–AI collaboration, we propose considering ToM as a form of XAI, and therefore evaluate it accordingly. To this end, we evaluated recent ToM studies in HRI using an XAI evaluation framework and identified some limitations. Ex- isting ToM approaches rarely assess whether the information presented to users accurately reflects the robot’s internal rea- soning, nor do they evaluate the robustness and reproducibil- ity of the explanations. To address gaps in both ToM and XAI, particularly re- garding explanation fidelity and user-centered evaluation, we propose leveraging ToM within an XAI framework, com- bining ToM’s user-centered perspective with XAI’s techni- cal rigor. This shift in perspective aims to enable evaluations that encompass both fidelity to the model and alignment with user understanding, ultimately narrowing the gap between system transparency and human interpretability. Theory of Mind for HRI ToM is often treated as a heuristic in artificial intelligence, where one of the participants is replaced by a robot. In this section, we review recent studies that have used ToM in HRI to evaluate human-AI collaboration and understanding. Attributing ToM to Agents Several studies examine whether humans naturally attribute ToM to robots even in the absence of explicit ToM mech- anisms. A first study found that humans are able to inter- pret robots’ behavior similarly to human behavior, provided that the robots display distinct and interpretable social cues. However, when a robot’s cues deviate from human expec- tations, this understanding diminishes (Banks 2020). A sec- ond study, which examined the robustness and conviction of large language models (LLMs), demonstrated that while LLMs can serve as a useful tool in human–robot interaction (Becker et al. 2025), they do not function as reliable ToM agents (Verma, Bhambri, and Kambhampati 2024). These findings suggest that effective human–robot interaction is fa- cilitated when robots produce responses that align with typ- ical human behavior, and highlight the need for the integra- tion of explicit ToM-like mechanisms in robotic systems. 100 PapersParsimonyPlausibilityCoverageFidelityContinuityConsistencyEfficiency Banks (2020)✓ Mou et al. (2020)✓ Cantucci and Falcone (2022)✓ Kerzel et al. (2022)✓ Shvo et al. (2022)✓ Yuan et al. (2022)✓ Verma, Bhambri, and Kambhampati (2024)✓ Angelopoulos et al. (2025)✓ Table 1: Evaluation of ToM in HRI studies using the eValuation XAI (VXAI) framework. Evaluating Understanding and Trust A second line of research has investigated embedding ToM- like reasoning directly within robots and assessing its im- pact on trust, helpfulness, and mutual understanding. Some studies have focused on evaluating user perception, reveal- ing that robots equipped with ToM capabilities are perceived more positively (Mou et al. 2020), particularly when they provide assistance aligned with users’ goals (Cantucci and Falcone 2022). Similarly, robots that reason about human beliefs are generally considered more helpful and socially competent (Shvo et al. 2022), and are also regarded as more trustworthy (Angelopoulos et al. 2025). At the same time, when providing explanations, robots may fail to enhance user understanding or improve decision-making, as not all explanations are equally effective (Yuan et al. 2022). In con- trast, approaches that implement multiple levels of explana- tion have been shown to improve user comprehension and the interaction (Kerzel et al. 2022). Although these studies evaluate human–AI collaboration and occasionally describe their work as XAI, none assess it using XAI-specific cri- teria. Moreover, none have explicitly integrated ToM with XAI, highlighting a gap that our work addresses. A Comprehensive Evaluation of ToM While the field of ToM primarily claims to enhance user understanding, trust, and, more broadly, human–AI collab- oration, these claims are often not systematically evaluated. This lack of evaluation stems from the fact that, if ToM pur- ports to provide explanations for users, it should be assessed using the same criteria applied in XAI. Indeed, these objec- tives align closely with those of XAI, which aims to design AI systems that are interpretable and comprehensible to hu- mans (Rong et al. 2024). We therefore propose to systemat- ically evaluate ToM in line with the rigor of its claims using an XAI evaluation framework. From this perspective, we examined whether existing ToM studies could be assessed using an XAI framework, positioning ToM itself as a framework for explanations. To this end, we evaluated state-of-the-art ToM studies using the seven desiderata defined in the eValuation XAI (VXAI) framework (Dembinsky et al. 2025), which integrates the principal evaluation criteria identified in recent reviews and provides a comprehensive and systematic approach to XAI evaluation. We evaluated the studies based on whether they explicitly addressed each of the VXAI desiderata in their work. The criteria and detailed definitions of the desiderata are provided in Appendix. As shown in Table 1, all of the ToM studies satisfy the Par- simony and Plausibility desiderata, indicating that they con- ducted user-centered experiments and assessed whether the explanations provided were perceived as believable. How- ever, only two studies meet the Continuity and Consistency criteria, reflecting that many experiments either did not re- port the number of participants or involved fewer than 100 participants. This poses a limitation for scaling these find- ings to real-world applications and for reproducibility. None of the studies satisfies the Coverage desideratum, as none reported the number of successful versus unsuccessful in- teractions. Similarly, the Fidelity desideratum has not been addressed, indicating that none of the studies examined the internal reasoning process of the model. Consequently, it re- mains unclear whether the explanations accurately reflect the model’s behavior, which raises the risk of misleading users. Together, these findings suggest that while ToM provides a valuable framework for user-centered evaluation, current studies lack rigorous XAI assessments. This work seeks to address that gap by integrating ToM within an XAI frame- work, where fidelity serves as a central criterion for evalu- ating the alignment between explanations and the model’s actual reasoning. ToM as a User-Centered XAI Solution Effective XAI should provide explanations that are both faithful to the system’s reasoning and understandable to users (Rong et al. 2024). Since existing methods often over- look user-centered design (Rong et al. 2024), we propose an integrated approach that combines ToM’s user focus with model-centered XAI techniques. This integration enables explanations that are simultaneously interpretable for users and faithful to the model’s internal reasoning. While some existing ToM integrations have proposed using Bayesian re- inforcement learning to model user behavior, future work could explore the use of behavior trees or explainable rein- forcement learning (XRL) within ToM-based systems. Such approaches may support adaptive reasoning while further enhancing the fidelity of explanations. 101 Appendix 1 The seven desiderata proposed in VXAI (Dembinsky et al. 2025) can be described as follows: •Parsimony: Explanations should remain succinct and avoid unnecessary complexity to improve human under- standing. •Plausibility: The explanation must correspond to human logic and intuition, making it believable and relatable. •Coverage: Indicates whether an explanation can be gen- erated for each relevant input or output case. •Fidelity: The explanation should truthfully mirror the underlying decision-making process of the model. •Continuity: Measures the robustness of explanations when minor variations are introduced into the input data. •Consistency: Ensures that explanations are coherent and reproducible for identical or comparable instances. •Efficiency: Reflects the practicality of the explanatory approach in terms of computational cost and its general applicability across different models or domains. For evaluation purposes, the following mapping criteria are applied: • When a human evaluation is conducted, theparsimony andplausibilitydesiderata are considered assessed. • When the number of successful versus failed interactions is reported, thecoveragedesideratum is considered as- sessed. • When the model’s internal reasoning process is exam- ined, thefidelitydesideratum is considered addressed. • If the study reports a minimum of 100 human partici- pants, bothcontinuityandconsistencyare deemed evalu- ated. • If computational implementation details are provided,ef- ficiencyis regarded as evaluated. Acknowledgments The authors gratefully acknowledge funding from Horizon Europe under the MSCA grant agreement No 101168792 (SWEET) and No 101072488 (TRAIL). References Angelopoulos, G.; Hellou, M.; Vinanzi, S.; Rossi, A.; Rossi, S.; and Cangelosi, A. 2025. Robot, did you read my mind? Modelling Human Mental States to Facilitate Transparency and Mitigate False Beliefs in Human-Robot Collaboration. ACM Transactions on Human-Robot Interaction, 15(1): 1– 29. Banks, J. 2020. Theory of Mind in Social Robots: Replica- tion of Five Established Human Tests.International Journal of Social Robotics, 12(2): 403–414. Becker, D.; Ahrens, K.; Gaede, C.; Strahl, E.; and Wermter, S. 2025. Robots with Attitudes: Influence of LLM-Driven Robot Personalities on Motivation and Performance. InPro- ceedings of the 13th International Conference on Human- Agent Interaction (HAI ’25). Association for Computing Machinery. Cantucci, F.; and Falcone, R. 2022. Collaborative Auton- omy: Human–Robot Interaction to the Test of Intelligent Help.Electronics, 11(19). Dembinsky, D.; Lucieri, A.; Frolov, S.; Najjar, H.; Watan- abe, K.; and Dengel, A. 2025. Unifying VXAI: A System- atic Review and Framework for the Evaluation of Explain- able AI. arXiv:2506.15408. Kerzel, M.; Allgeuer, P.; Strahl, E.; Frick, N.; Habekost, J.- G.; Eppe, M.; and Wermter, S. 2023. NICOL: A Neuro- Inspired Collaborative Semi-Humanoid Robot That Bridges Social Interaction and Reliable Manipulation.IEEE Access, 11: 123531 – 123542. Kerzel, M.; Ambsdorf, J.; Becker, D.; Lu, W.; Strahl, E.; Spisak, J.; G ̈ ade, C.; Weber, T.; and Wermter, S. 2022. What’s on Your Mind, NICO?: XHRI: A Framework for eX- plainable Human-Robot Interaction.KI - K ̈ unstliche Intelli- genz, 36. Lee, I. 2021. Service Robots: A Systematic Literature Re- view.Electronics, 10(21). Mou, W.; Ruocco, M.; Zanatto, D.; and Cangelosi, A. 2020. When Would You Trust a Robot? A Study on Trust and The- ory of Mind in Human-Robot Interactions. In2020 29th IEEE International Conference on Robot and Human Inter- active Communication (RO-MAN), 956–962. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind?Behavioral and Brain Sciences, 1(4): 515–526. Rong, Y.; Leemann, T.; Nguyen, T.-T.; Fiedler, L.; Qian, P.; Unhelkar, V.; Seidel, T.; Kasneci, G.; and Kasneci, E. 2024. Towards Human-Centered Explainable AI: A Survey of User Studies for Model Explanations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4): 2104– 2122. Shvo, M.; Hari, R.; O’Reilly, Z.; Abolore, S.; Wang, S.- Y. N.; and McIlraith, S. A. 2022. Proactive Robotic As- sistance via Theory of Mind. In2022 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), 9148–9155. Sridharan, M.; and Meadows, B. 2019.Towards a the- ory of explanations for Human–Robot collaboration.KI - K ̈ unstliche Intelligenz, 33(4): 331–342. Verma, M.; Bhambri, S.; and Kambhampati, S. 2024. The- ory of Mind Abilities of Large Language Models in Human- Robot Interaction: An Illusion? InCompanion of the 2024 ACM/IEEE International Conference on Human-Robot In- teraction, HRI ’24, 36–45. New York, NY, USA: Associa- tion for Computing Machinery. ISBN 9798400703232. Williams, J.; Fiore, S. M.; and Jentsch, F. 2022. Supporting Artificial Social Intelligence With Theory of Mind.Fron- tiers in Artificial Intelligence, 5. Yuan, L.; Gao, X.; Zheng, Z.; Edmonds, M.; Wu, Y. N.; Rossano, F.; Lu, H.; Zhu, Y.; and Zhu, S.-C. 2022.In situ bidirectional human-robot value alignment.Science Robotics, 7(68). 102 Theory of Mind in Human-in-the-Loop Sammie Katt 1 & Samuel Kaski 12 1 ELLIS Institute Finland 1 Department of Computer Science 2 Department of Computer Science Aalto University, FinlandUniversity of Manchester, United Kingdom name.lastname@aalto.fi Thanks to large language models and related developments in artificial intelligence (AI), human-AI systems are becom- ing increasingly flexible and widespread. Modeling humans is incredibly difficult, however, so the best known subclass of these systems — “human-in-the-loop” (HITL) — con- siders humans simply as a source of data (fig. 1). Exam- ples include personalized recommendation systems (Portu- gal, Alencar, and Cowan 2018), precision medicine sys- tems (Sundin et al. 2018) and, most notably LLM fine-tuning with reinforcement learning from feedback (Christiano et al. 2017; Ouyang et al. 2022). The underlying assumption, here, is that humans areoraclesand provide ground-truth labels. However, even for rather trivial HITL instances, oracles are poor models of humans. First, we may simply not know the right answer or, more likely, are biased (Kahneman, Slovic, and Tversky 1982). More relevant for us, users ad- just their answers on purpose to guide the interaction (Amer- shi et al. 2014; Colella et al. 2020). Saliently, it is known that users maintain a mental model of the AI with which they make predictions of the behavior of the system (Chan- drasekaran et al. 2017; Williams, Park, and Breazeal 2019), a phenomenon calledtheory of mind. This work makes steps towards such user models for human-in-the-loop problems (fig. 1, right). Starting from a general view of human-in-the-loop, we formalize a frame- work for explicitly defining, deriving, and inserting user models. From there, we focus on modeling users who con- sider the state of the system, motivated by ideas grounded in bounded computationally rational agents. Then, we con- sider two concrete settings, and investigate the behavior of our proposed user models and their effect on the system. Preliminaries We consider a minimal human-AI collaboration that is re- duced to solely the interaction between the human and the AI (fig. 1). This sequential process is defined by the AI’s query spaceQ, the user’s feedback spaceH, and some func- tion of interestfaccording to which the user gives feedback h∈ Hon some queryq∈ Q. Generally, we assume there is a (mutual) objectiveU : (∆F,F)→R, which assigns values (score) to the AI’s beliefp f (f), given the actual func- tion. This belief is the posterior overfgiven the likelihood of the observed query-label data,(Q,H), and the priorp f : AI User f? f p f query feedback∼f(x) +ε AI User (user,f)? AI? f p f U query feedback∝U(...;f) Figure 1: Human-in-the-loop systems. The AI queries the user, which the user then gives feedback over. Left shows the typical setting. Right is our proposal: users reason about the AI and gives feedback to change it. p f (f|Q,H)∝p f (f) | z prior Y i p(h i |q i ;f) | z likelihood ,(1) While minimal, this setting represents several important set- tings, including learning human preferences, maximizing them, learning in recommendation systems, as well as the recently popular (reward learning in) reinforcement learning from human feedback. Bayesian optimization of human preferences (BO), for example, aims to sample efficiently maximize the user’s la- tent functionf. To do so, the AI queries data pointsQ BO ≜ Xwhich the user then scoresH BO ≜Y. The objective is to maximize the true value of the AI’s best guess of the opti- mal. A solution in BO is a method for tracking the posterior overf(eq. (1)), commonly a Gaussian process (Williams and Rasmussen 2006), combined with anacquisition func- tionthat — when optimized — determines which query to pick, such as upper confidence bound (UCB) and expected improvement (EI) (Frazier 2018). ProblemWe challenge the likelihood (in eq. (1)) model that is typically assumed in current HITL works, which is that users output ground truth labels, such asp(h|q;f) = N(f(x),σ)in BO. Instead, we propose a likelihood derived from the first principles of modeling the user as a compu- tationally bounded rational agent with objectives relative to the system’s (AI) state. Solution for HITL with ToM Our solution is two-fold. First, we introduce a new likeli- hood that models how users take actions with the intention of influencing the (AI) system. Second, we derive the novel (Bayesian) inference problem of the AI. 103 User ModelWe introduce two components (visualized on the right in fig. 1). First, we assume the user knows that the AI maintains a posterior over “the function”p f (f|Q,H), given some prior overf. We describe the prior whichthe user believes is assigned to the AIwithp AI , and abuse notation by denotingp AI (f|...)as a posterior given priorp AI . Second, we assume the user has some objective U u : (∆F,F)→Rthat they optimize for, with respect to this posterior. This objective depends on the task, and can be infeasible to compute in practice (both for comput- ers and humans) and, so, we allow for myopic approxima- tions and heuristics. We model the user as giving feedback that optimizes for the (expected) utilityU u under their belief over the AI’s modelp AI . We assume the Boltzmann-rational model (Jeon, Milli, and Dragan 2020; Yamagata et al. 2024) to convert utility functions into decision distributions: π u (h|Q,H,q,f)∝e U u (p AI (Q∪q,H∪h),f) (2) To summarize, we formalize a user with the tuple (f,p AI ,U u ), wherep AI is the user’s estimation of the AI’s beliefp(f), and eq. (2) is the user feedback likelihood given objectiveU u and replaces the likelihood in eq. (1). AI InferenceAs originally, the AI’s aims to maximize the original taskU, which includes learningf. The AI’s pos- terior overf, given the likelihood derived above, now be- comes p(f|Q,H) = p(H|Q,f)p(f) p(H|Q) ∝p(H|Q,f)p(f) =p f (f) Y i π u (h i |Q <i ,H <i ,q i ,f) | z eq. (2) (3) The key observations are that we do not assume each data- point is i.i.d. (compare eqs. (1) and (3)), and that we implic- itly exploit knowing user priorp AI and objectiveU u . A solution to the human-AI problem is an AI policyπ ai that maximizes the utility functionUgiven a posterior over the user’s internal latent function eq. (3). In this preliminary work, we propose to adopt acquisition functions associated with the original taskα: (∆F,Q)→R, such as UCB for BO, and extend these to our novel posterior: π AI (H,Q) = arg max q α(p(f|H,Q) | z eq. (3) ,q)(4) Empirical Evaluation: Proof of Concept We present two settings in which we develop user models that reason about the system and give feedback with collab- orative intentions. The results show that the performance of typical (AI) solutions vary widely, depending on the accu- racy of the theory of mind of these user models. Bayesian OptimizationWe consider the running example of Bayesian optimization where the AI observes the user’s score of queries, and is interested in finding the optimum. We consider two user priorsp AI ; one “correct” and one “mis- specified”, and assume the user’s objective is to minimize Figure 2: (left) Regret of UCB in BO given ground truth, our intentional user, and the same method but with a misspeci- fied (user) prior. Our user model mimics intentional users, which are able to give feedback such that the regret is much lower than ground truth labels. However, we show that this can also lead to poor performance when users have misspec- ified priors. (right) Accuracy of trained classifiers over time given different labeling strategies. Some strategies, such as the typically assumed random, lead to good performance, whereas other reasonable strategies suffer. the distance between the true maximum and that of the AI’s posterior mean. U BO u (p(f ′ ),f) =− arg max x f(x)−arg max x E f ′ [f ′ (x)] Figure 2 (left) compares the performance of BO on Zhou, a typical BO benchmark, using UCB and Gaussian Processes given feedback from the ground truth and our user model with either priors. These experiments show that, our pro- posed user model indeed has the ability to positively steer the optimization process,assuming perfect understanding of the AIbut, if not, predicts degrading performance. Recommendation SystemWe perform a similar experi- ment in a setting where the AI infers preferences given bi- nary labelsselected by the user: labeled data(Q,H)is gen- erated solely by the user and there is no querying mecha- nism. We consider several policiesπ u that the user may use to select which data pointqto label, such as “pick close to the decision boundary”. Figure 2 (right) shows that, while some strategies lead to performance similar to the (typically assumed) random labeling strategy, some labeling strategies severely affect the sample efficiency. In particular, users that label their favorite as well as elements that are just under the decision boundary, a strategy “best” we believe is quite reasonable, performs significantly worse. Conclusion In a world with increasing human-AI applications, how to effectively model the most crucial component — the users — becomes a key challenge. We explore and formalize the concept of theory of mind in HITL, where users explicitly reason about the (consequences of their feedback on the) systems they interact with. Preliminary results demonstrate that different user behaviors lead to drastically different sys- tem performances, confirming our concerns regarding user misspecification and highlighting the importance of our pro- posed solution. 104 Acknowledgements This work was supported by the Research Council of Fin- land (Flagship programme: Finnish Center for Artificial In- telligence FCAI, Grant 359207), ELISE Networks of Excel- lence Centres (EU Horizon:2020 grant agreement 951847), and UKRI Turing AI World-Leading Researcher Fellow- ship (EP/W002973/1). We acknowledge the research en- vironment provided by ELLIS Institute Finland. We also acknowledge the computational resources provided by the Aalto Science-IT Project from Computer Science IT and CSC–IT Center for Science, Finland. References Albrecht, S. V.; and Stone, P. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems.Artificial Intelligence, 258: 66–95. Amershi, S.; Cakmak, M.; Knox, W. B.; and Kulesza, T. 2014. Power to the people: The role of humans in interactive machine learning.AI Magazine, 35(4): 105–120. Arumugam, D.; Ho, M. K.; Goodman, N. D.; and Van Roy, B. 2023. Bayesian Reinforcement Learning with Limited Cognitive Load.arXiv preprint arXiv:2305.03263. Atrash, A.; and Pineau, J. 2009. A Bayesian reinforcement learning approach for customizing human-robot interfaces. InInternational Conference on Intelligent User Interfaces, 355–360. Baker, C. L.; Jara-Ettinger, J.; Saxe, R.; and Tenenbaum, J. B. 2017. Rational quantitative attribution of beliefs, de- sires and percepts in human mentalizing.Nature Human Behaviour, 1(4): 0064. Borji, A.; and Itti, L. 2013. Bayesian optimization explains human active search.Advances in Neural Information Pro- cessing Systems, 26. Brooks, C.; and Szafir, D. 2019. Building second-order mental models for human-robot interaction.arXiv preprint arXiv:1909.06508. Chandrasekaran, A.; Yadav, D.; Chattopadhyay, P.; Prabhu, V.; and Parikh, D. 2017. It takes two to tango: Towards the- ory of AI’s mind.arXiv preprint arXiv:1704.00717. Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences.Advances in Neural Information Pro- cessing Systems, 30. Colella, F.; Daee, P.; Jokinen, J.; Oulasvirta, A.; and Kaski, S. 2020. Human strategic steering improves performance of interactive optimization. InConference on User Modeling, Adaptation and Personalization, 293–297. Doshi, P.; Qu, X.; Goodie, A.; and Young, D. 2010. Mod- eling recursive reasoning by humans using empirically in- formed interactive POMDPs. InAutonomous Agents and MultiAgent Systems, 1223–1230. Dubey, A.; Abhinav, K.; Jain, S.; Arora, V.; and Puttaveer- ana, A. 2020. HACO: a framework for developing human- AI teaming.InProceedings of the 13th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference), 1–9. Fails, J. A.; and Olsen Jr, D. R. 2003. Interactive machine learning. InInternational Conference on Intelligent User Interfaces, 39–45. Frazier, P. I. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811. Gao, T.; Baker, C. L.; Tang, N.; Xu, H.; and Tenenbaum, J. B. 2019. The cognitive architecture of perceived animacy: Intention, attention, and memory.Cognitive science, 43(8): e12775. Gershman, S. J.; Horvitz, E. J.; and Tenenbaum, J. B. 2015. Computational rationality: A converging paradigm for intel- ligence in brains, minds, and machines.Science, 349(6245): 273–278. Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, A. 2016. Cooperative inverse reinforcement learning.Ad- vances in Neural Information Processing Systems, 29. Hedden, T.; and Zhang, J. 2002. What do you think I think you think?: Strategic reasoning in matrix games.Cognition, 85(1): 1–36. Jeon, H. J.; Milli, S.; and Dragan, A. 2020. Reward-rational (implicit) choice: A unifying formalism for reward learn- ing.Advances in Neural Information Processing Systems, 33: 4415–4426. Kahneman, D.; Slovic, P.; and Tversky, A. 1982.Judgment under uncertainty: Heuristics and biases. Cambridge uni- versity press. Keurulainen, O.; Alcan, G.; and Kyrki, V. 2024. The Role of Higher-Order Cognitive Models in Active Learning.arXiv preprint arXiv:2401.04397. Lewis, R. L.; Howes, A.; and Singh, S. 2014. Computa- tional rationality: Linking mechanism and behavior through bounded utility maximization.Topics in Cognitive Science, 6(2): 279–311. Oulasvirta, A.; Jokinen, J. P.; and Howes, A. 2022. Compu- tational rationality as a theory of interaction. InProceedings of the CHI Conference on Human Factors in Computing Sys- tems, 1–14. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback.Advances in Neural Information Pro- cessing Systems, 35: 27730–27744. Portugal, I.; Alencar, P.; and Cowan, D. 2018. The use of machine learning algorithms in recommender systems: A systematic review.Expert Systems with Applications, 97: 205–227. Ramos, G.; Meek, C.; Simard, P.; Suh, J.; and Ghorashi, S. 2020. Interactive machine teaching: a human-centered ap- proach to building machine-learned models.Proceedings of the ACM on Human-Computer Interaction, 35(5-6): 413– 451. Sundin, I.; Peltola, T.; Micallef, L.; Afrabandpey, H.; Soare, M.; Mamun Majumder, M.; Daee, P.; He, C.; Serim, B.; Havulinna, A.; et al. 2018. Improving genomics-based pre- dictions for precision medicine through active elicitation of expert knowledge.Bioinformatics, 34(13): i395–i403. 105 Williams, C. K.; and Rasmussen, C. E. 2006.Gaussian pro- cesses for machine learning, volume 2. MIT press Cam- bridge, MA. Williams, R.; Park, H. W.; and Breazeal, C. 2019. A is for artificial intelligence: the impact of artificial intelligence ac- tivities on young children’s perceptions of robots. InPro- ceedings of the CHI Conference on Human Factors in Com- puting Systems, 1–11. Yamagata, T.; Oberkofler, T.; Kaufmann, T.; Bengs, V.; H ̈ ullermeier, E.; and Santos-Rodriguez, R. 2024. Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, volume 2024. Vienna, Austria: PMLR. Zhi-Xuan, T.; Mann, J.; Silver, T.; Tenenbaum, J.; and Mans- inghka, V. 2020. Online bayesian goal inference for bound- edly rational planning agents.Advances in Neural Informa- tion Processing Systems, 33: 19238–19250. Related Work This work touches upon too many topics to properly ac- knowledge given the space. We consider other works that model how humans model AIto be the most relevant. Spe- cific models of such nature have been proposed in (Brooks and Szafir 2019; Keurulainen, Alcan, and Kyrki 2024) which, respectively in navigation and AI-teaching, models how the human views the AI. This also includes insights in human steering in HITL (Colella et al. 2020), as well as helping users do so (Chandrasekaran et al. 2017). Work on modeling agents in multi-agent settings take a more ab- stract view — they rarely care for modelinghumanagents — but nevertheless are a rich source of ideas (Albrecht and Stone 2018; Hadfield-Menell et al. 2016), including ideas on theory-of-mind (Doshi et al. 2010; Hedden and Zhang 2002). Other related topics include methods for designing HITL systems (Fails and Olsen Jr 2003; Dubey et al. 2020; Ramos et al. 2020), inference over rational agents given their behavior (Atrash and Pineau 2009; Zhi-Xuan et al. 2020), and computational rationality as explanation for human be- havior (Gershman, Horvitz, and Tenenbaum 2015; Lewis, Howes, and Singh 2014; Oulasvirta, Jokinen, and Howes 2022). Lastly, literature that shows Bayesian reasoning in humans is particularly important motivation for modeling in this work (Borji and Itti 2013; Baker et al. 2017; Gao et al. 2019; Arumugam et al. 2023). 106 Theory of Mind through Partially Ordered Plans Sveta Paster Shainkopf 1 , Reuth Mirsky 2 Tufts University, MA, USA 1 svetlana.paster@tufts.edu 2 reuth.mirsky@tufts.edu Abstract Theory of Mind (ToM) is central to human social intelligence and essential for building trustworthy human-robot interac- tion. Current computational models of ToM often rely on a linear representation of actions, which fail to capture the flex- ibility and uncertainty of real human behavior. We propose partially ordered plan libraries (PoPLs) as symbolic scaffolds for ToM in robots. Unlike linear plans, PoPLs explicitly en- code multiple valid orderings of actions under the same goal, enabling robots to infer intentions from incomplete or am- biguous observations. This representation not only supports first-order ToM, a robot recognizes human intentions, but also lays the foundation for second-order ToM, where robots can project expectations back to humans and explain their rea- soning. By framing PoPLs as transparent, interpretable struc- tures, we offer a novel pathway toward explainable ToM- based AI agents, bridging symbolic reasoning with human- like social cognition. Introduction Imagine a simple exchange between two colleagues: one brings the other a cup of coffee and says, “I wasn’t sure how much sugar you take, so I brought you two packets.” This small act reflects a core human skill, acting under un- certainty while explaining intent to preserve trust and coor- dination. Such reasoning and self-explanation lie at the heart of Theory of Mind (ToM): the ability to infer and communi- cate mental states. For robots, however, this reflective capability is elusive. Embodied agents must act under noisy perceptions and real- time constraints, making it difficult to justify decisions in ways humans find intelligible. In human-robot interaction (HRI), success depends not only on task performance but also on maintaining trust, transparency, and team fluency. Classical approaches to computational ToM, such as Bayesian inverse planning (Baker, Saxe, and Tenenbaum 2009), epistemic logic and plan recognition architectures (Rao and George 1995; Mirsky, Keren, and Geib 2021), cap- ture aspects of belief attribution but depend on fixed, hand- crafted domains. At the same time, large language mod- els (LLMs) and vision-language models show strong rep- resentational and interpretive abilities, but they remain unre- liable for autonomous planning, lacking explicit reasoning over uncertainty and producing inconsistent or unverifiable trajectories (Tantakoun, Muise, and Zhu 2025; Valmeekam et al. 2022). Bridging these strengths and limitations requires a sym- bolic scaffold that can translate LLM knowledge into struc- tured, explainable, and verifiable reasoning for robots. We propose partially ordered plan libraries (PoPLs) as that scaf- fold: a directed acyclic graph (DAG) where nodes are ac- tions and edges are ordering constraints. Unlike sequential plans, PoPLs explicitly represent flexibility and uncertainty in how goals can be achieved, capturing multiple valid or- derings of actions within a single structure. For example, consider the different ways a person can make pancakes: pouring eggs, flour, and sugar into the mix can happen in any order and still yield a valid execution. Represent- ing all possible combinations explicitly is a cumbersome process with potentially exponentially larger space require- ments than a compact partial-ordering representation. Fur- thermore, it is more challenging to provide an explanation to the user if the robot must explicitly state all possible per- mutations of the user’s plan that it considers feasible. These requirements make PoPLs especially suitable for modeling first-order ToM (inferring human intentions from partial ob- servations) and extending toward second-order ToM (pro- jecting expectations back to humans through explanations). However, constructing such libraries remains a major bot- tleneck. Existing efforts rely on manual annotation or fixed symbolic domains, as seen in datasets like CaptainCook4D (Peddi et al. 2024) and Behavior-1K (Li et al. 2024), which provide rich procedural data but require extensive human ef- fort to encode plan structure. Some recent works attempt to learn action dependencies or task graphs directly from la- beled demonstrations or videos (Seminara, Farinella, and Furnari 2024; Mao et al. 2023), but these methods remain constrained by the availability and coverage of annotated data and often fail to generalize beyond the training distri- bution. To overcome these limitations, we propose an automated framework that leverages LLMs to reconstruct PoPLs di- rectly from naturalistic descriptions, combining the inter- pretability of symbolic DAGs with the flexibility of neural representations. Our work introduces an end-to-end frame- work for reconstructing PoPLs from minimal inputs, starting from only a bag of actions and a goal. The resulting structures can be translated into formal 107 Figure 1: Framework for constructing partially ordered plan libraries (PoPLs).A. Action Feature Extraction, B. Trajectory Category Initialization, C. Trajectory Generation, D. Noisy DAG Extraction, E. Edge Evaluation and Cleaning. planning representations such as PDDL or MDPs. By using such a translation, PoPLs act as intermediate and transparent world models, enabling robots to potentially infer, explain, and communicate intentions in socially aligned ways. To realize this vision, we propose a three-stage pipeline: 1. PoPL Reconstruction: Inferring causal and temporal dependencies among actions with an LLM to generate DAGs that encode partial ordering and uncertainty. 2. Planning Translation: Converting DAGs into formal planning representations (e.g., PDDL or MDP) for co- operative planning and symbolic reasoning. 3. ToM Integration: Leveraging the structured representa- tions to support ToM reasoning, where first-order ToM is achieved via goal recognition using partial observations and second-order ToM is achieved by generating expla- nations and communicating intent to humans. Constructing PoPLs Automatically To enable robots to reason about human goals and intentions under uncertainty, we propose a framework that automati- cally constructs PoPLs. Figure 1 illustrates the full pipeline, consisting of five interconnected stages (A–E). Block A is responsible for action feature extraction. Given a set of unstructured action descriptions, the LLM first classi- fies each action according to its category (e.g., preparation, cooking, serving), purpose, and reversibility. This step trans- forms natural language instructions into structured symbolic tuples that provide semantic anchors for later dependency inference. Block B uses the extracted symbolic features the model identifies, plausible starting categories, and coarse ac- tion clusters that typically initiate trajectories within sim- ilar tasks (e.g., preparation actions usually happen before cooking). This step provides an initial skeleton of high-level procedural ordering of clusters as a whole, significantly re- ducing the number of partial orders that need to be con- sidered. Once initial clusters are chosen, the next phase is Block C, trajectory generation. Here, the LLM generates multiple procedural trajectories, simulating different plau- sible ways a human might perform the same goal-directed activity. Although LLMs are not planners, their exposure to vast procedural text allows them to reproduce common temporal and causal patterns found in human-written in- structions. Thus, they produce plan-like narratives, textual sequences that implicitly encode action dependencies be- tween nodes within clusters. To further enhance coverage, we prompt the model to start from different actions within the chosen category and to connect any unmentioned actions by reasoning about likely temporal relations (“what typically happens before or after”). This guided generation does not impose optimal planning but surfaces logical patterns the model has absorbed from large-scale data. Since these re- lations are inferred statistically rather than symbolically, the resulting trajectories may still include noisy or inconsistent dependencies, motivating the refinement stages that follow. Blocks D–E, each inferred dependency is re-evaluated using the LLM under explicit prompts designed to test necessity and ordering consistency (e.g., “Is action x required before action y to achieve the goal?”). Edges that fail this verifica- tion are pruned, producing a refined PoPLs that more accu- rately reflects feasible and explainable procedural dependen- cies. This process consolidates and verifies the dependencies into a coherent partially ordered plan. Future work will focus on evaluating this approach in real human-robot environments, using embodied robots to assess how PoPLs-based reasoning supports intention recognition and explanation. 108 References Baker, C. L.; Saxe, R.; and Tenenbaum, J. B. 2009. Action understanding as inverse planning. Cognition, 113(3): 329– 349. Reinforcement learning and higher cognition. Li, C.; Zhang, R.; Wong, J.; Gokmen, C.; Srivastava, S.; Mart ́ ın-Mart ́ ın, R.; Wang, C.; Levine, G.; Ai, W.; Martinez, B.; Yin, H.; Lingelbach, M.; Hwang, M.; Hiranaka, A.; Gar- lanka, S.; Aydin, A.; Lee, S.; Sun, J.; Anvari, M.; Sharma, M.; Bansal, D.; Hunter, S.; Kim, K.-Y.; Lou, A.; Matthews, C. R.; Villa-Renteria, I.; Tang, J. H.; Tang, C.; Xia, F.; Li, Y.; Savarese, S.; Gweon, H.; Liu, C. K.; Wu, J.; and Fei-Fei, L. 2024. BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation. arXiv preprint arXiv:2403.09227. Mao, W.; Desai, R.; Iuzzolino, M. L.; and Kamra, N. 2023. Action Dynamics Task Graphs for Learning Plannable Rep- resentations of Procedural Tasks. ArXiv, abs/2302.05330. Mirsky, R.; Keren, S.; and Geib, C. 2021. Introduction to symbolic plan and goal recognition, volume 16. Springer. Peddi, R.; Arya, S.; Challa, B.; Pallapothula, L.; Vyas, A.; Gouripeddi, B.; Zhang, Q.; Wang, J.; Komaragiri, V.; Ra- gan, E.; Ruozzi, N.; Xiang, Y.; and Gogate, V. 2024. Cap- tainCook4D: A Dataset for Understanding Errors in Proce- dural Activities. In Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; and Zhang, C., eds., Advances in Neural Information Processing Systems, vol- ume 37, 135626–135679. Curran Associates, Inc. Rao, A. S.; and George, M. P. 1995. Bdi Agents: from The- ory to Practice. Seminara, L.; Farinella, G. M.; and Furnari, A. 2024. Dif- ferentiable Task Graph Learning: Procedural Activity Rep- resentation and Online Mistake Detection from Egocentric Videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Tantakoun, M.; Muise, C.; and Zhu, X. 2025. LLMs as Plan- ning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models. In Find- ings of the Association for Computational Linguistics: ACL 2025, 25167–25188. Valmeekam, K.; Olmo, A.; Sreedharan, S.; and Kambham- pati, S. 2022. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop. 109 Toward Theory of Mind: BERT Learns and Uses Emotion Geometry in Two Phases You Ho Myung Sogang University Abstract We introduce a two-phase emotion reasoning framework grounded in Theory of Mind (ToM), where a BERT back- bone and an emotion-geometry branch are co-optimized to learn relational affective structures. In the subsequent reason- ing phase, the learned geometry is fixed and leveraged as a structural prior for emotion inference. Experiments on the GoEmotions dataset and an external benchmark show consis- tent gains over the vanilla BERT, suggesting that embedding a stable emotion geometry improves both interpretability and cross-distribution generalization. Introduction Research on emotion representation has long sought to char- acterize the relationships among emotions within structured spaces (e.g., VA (Russell 1980), PAD (Mehrabian and Rus- sell 1974), and Plutchik’s wheel (Plutchik 1980)). Recent work has extended this notion through data-driven embed- dings. For instance, Buechel and Hahn mapped heteroge- neous affective representations (e.g., VAD, BE5) into a shared latent space (Buechel and Hahn 2018),while Stan- ley et al. represented categorical emotion labels using word embeddings for speech emotion recognition (Stanley et al. 2023). However, most existing approaches use emotion vec- tors mainly as additional features during the encoding stage, instead of modeling the emotion space itself as a structured and learnable component that can be actively used during reasoning (Buechel and Hahn 2018; Xu et al. 2018; Seyed- itabari et al. 2019; Lee, Park, and Choi 2019). From a cognitive-developmental standpoint, children gradually construct an internal emotion space through so- cial interaction and use it to interpret and predict others’ emotions—a process tightly linked to the emergence of The- ory of Mind (Flavell and Miller 1998; Harris 2006; Well- man, Cross, and Watson 2001). Motivated by this devel- opmental analogy, we propose an architecture in which a model not only learns an emotion geometry but also dynam- ically leverages it during classification—explicitly separat- ing representation from reasoning while connecting them through a two-phase training process. Once the internal emotion geometry stabilizes, it remains largely consistent during reasoning—mirroring how children rely on early- formed emotional structures unless disrupted by major af- fective events (Costa and McCrae 1994). Dataset We evaluate our approach on two datasets: GoEmotions for training and SemEval-2018 Task 1 (E-c) for external evalu- ation. GoEmotions (Demszky et al. 2020) contains approxi- mately 58K Reddit comments annotated with 27 emotion categories and a neutral label, supporting multi-label emo- tion classification. SemEval-2018 Task 1: Affect in Tweets (E-c) (Moham- mad et al. 2018) includes around 7K training and 3K test tweets annotated with eleven emotions. For zero-shot trans- fer evaluation, we retain only the eight emotion labels shared with GoEmotions and remove samples without active labels. All inputs are truncated or padded to a maximum length of 64 tokens to ensure consistent preprocessing across datasets. Multi-label predictions are evaluated using both global and class-specific decision thresholds, with thresholds tuned on the development split. Model Architecture The proposed model, a BERT-based emotion-geometry rea- soning model, divides its training process into two phases, Phase-1: Joint Learning and Phase-2: Fixed Geometry Rea- soning, to learn geometric relationships among emotions and leverage them for classification, as shown in Figure 1. Phase-1: Joint Learning of Geometric Structure In Phase-1, the model jointly learns both the BERT-based features and the explicit geometric structure. This phase fo- cuses on learning the coordinates of emotion anchors and mapping text representations close to their corresponding emotion regions. The process proceeds as follows: • BERT Embedding: The input text is processed by the BERT encoder to obtain the contextual representation h. • Learning Emotion Anchors (E): The emotion anchors E ∈ R K×D (where K is the number of emotion classes and D is the projection dimension) are randomly initial- ized as trainable parameters. They are optimized jointly with other objectives, including a unit-norm regulariza- tion term, ensuring stable and interpretable anchor repre- sentations in the emotion space. 110 Figure 1: Overview of the proposed two-phase emotion- geometry reasoning model. • Projection to Geometric Space: The BERT embedding h is projected into a contextual geometric vector ˆv through a linear projection layer: ˆv = W proj ·h. • Geometric Loss Learning: The model learns the geom- etry by minimizing MSE and cosine distance losses be- tween ˆv and the mean anchor vector ̄e, computed as the average of active emotion anchors in each multi-label sample ( ̄e = Avg(E active )). In addition, the model adaptively modulates the influence of geometric injection based on classification confidence— assigning higher geometric weight to uncertain predictions and lower weight to confident ones. When the geometric scores strongly conflict with high-confidence classification logits, the geometric contribution is suppressed to prevent unstable updates. This joint optimization enables the model to capture se- mantic, geometric, and structural relationships among emo- tions, reinforced by a co-occurrence / label-dependency ob- jective (Chen et al. 2019; Huang et al. 2015), thereby form- ing a consistent and interpretable emotion geometry. Phase-2: Fixed Geometry Reasoning After a brief geometry formation stage, the model transi- tions to Phase-2. Empirically, the geometric anchors and projection space converge rapidly—typically within the first epoch—forming a stable emotion manifold. Prolonged ge- ometry training beyond this point tends to over-specialize the anchor structure, constraining the classifier and slightly degrading overall performance. Therefore, the parame- ters of the geometric branch, W proj and E, are frozen (requires grad = False) after the first epoch, and the model subsequently fine-tunes only the BERT encoder and classification head using this fixed geometric mapping. The process is as follows: • Freezing the Geometric Branch: The projection head (W proj ) and emotion anchors (E) are frozen, and all geometry-related losses are set to zero. • Geometric Score Generation: Using the fixed projection mapping, the model computes geometric scores repre- senting the cosine similarity between the projected rep- resentation and the frozen anchors: geom scores = T · cos(ˆv,E), where T denotes the temperature scaling factor. • Fixed Geometry Injection: The geometric scores— reflecting the relation between the input and all emo- tion anchors—are combined with the BERT classifica- tion logits z cls to form the final prediction: z final = z cls + α fixed · geom scores, where α fixed = eg cfg.phase2priorscale is a small, fixed scaling factor. Although the geometric parameters remain static, the BERT encoder continues to adapt during Phase-2, allowing h (and thus ˆv) to evolve within a fixed anchor space. Con- sequently, the model leverages previously learned geometric relationships as structured priors to enhance classification without further modifying the geometry itself. Model Performance Table 1: Performance comparison between the BERT base- line and EmotionBERT. Model / DatasetSettingF1 MicroF1 MacroJaccard BERT (Internal) 0.5 Default0.54310.31440.4551 Global tuned0.59520.43000.5546 Per-class tuned0.58380.45280.5478 EmotionBERT (Internal) 0.5 Default0.58520.43270.5177 Global tuned0.61520.47460.5751 Per-class tuned0.59840.49680.5592 BERT (External) 0.5 Default0.14020.14160.0856 Global tuned0.50420.46470.3434 Per-class tuned0.59700.54310.4370 EmotionBERT (External) 0.5 Default0.18310.19790.1187 Global tuned0.49730.46970.3544 Per-class tuned0.63230.58460.4862 For internal evaluation on GoEmotions, EmotionBERT out- performed the vanilla BERT baseline across all metrics (F1 Micro, F1 Macro, and Jaccard), indicating that the learned emotion geometry provides additional structure beneficial for multi-label reasoning. For external zero-shot transfer to SemEval-2018 Task 1 E-c, EmotionBERT also showed a marked improvement. This demonstrates that the fixed emotion geometry generalizes well across datasets and that class-specific threshold calibration further enhances robust- ness on imbalanced emotion distributions. Interpretability of ToM4AI Figure 2 visualizes the learned emotion geometry using PCA. Each numbered point represents an emotion anchor. We observe that the principal axes of the PCA projection are consistent with known affective dimensions such as polar- ity and arousal. The spatial layout shows that similar emo- tions form coherent clusters, while opposite emotions oc- cupy distinct regions, reflecting that EmotionBERT internal- izes interpretable affective relationships within the learned geometric space. 111 Figure 2: 2D PCA projection of 28 emotion anchors (indices correspond to emotion labels in Appendix ). Conclusion This study highlights the importance of incorporating The- ory of Mind (ToM) into emotion reasoning through a two- phase emotion geometry framework. Across both internal and external evaluations, the proposed EmotionBERT con- sistently outperformed the vanilla BERT baseline, demon- strating that explicitly modeling a geometric emotion struc- ture improves both accuracy and robustness in multi-label settings. Furthermore, PCA-based visualization confirmed that the model forms interpretable emotion clusters within the learned geometric space, reflecting human-like organi- zation of affective concepts. As emotion recognition datasets continue to expand in scale and granularity, future research should move beyond treating emotions as independent categorical labels. Instead, modeling the interdependence and compositionality among emotions—such as mixed or transitional states (e.g., “bitter- sweet”, “anxious excitement”)—may further bridge the gap between data-driven emotion understanding and human af- fective cognition. We believe that explicitly structured emo- tion geometries, as introduced here, offer a promising foun- dation for such higher-order reasoning in next-generation af- fective AI systems. References Buechel, S.; and Hahn, U. 2018. Emotion Representation Mapping for Automatic Lexicon Construction (Mostly) Per- forms on Human Level. In Proceedings of the 27th Interna- tional Conference on Computational Linguistics (COLING), 2892–2904. Santa Fe, New Mexico, USA: Association for Computational Linguistics. Chen, Z.-M.; Wei, X.-S.; Wang, P.; and Guo, Y. 2019. Multi- label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5177–5186. Costa, P. T.; and McCrae, R. R. 1994. Set Like Plaster? Ev- idence for the Stability of Adult Personality. In Heatherton, T. F.; and Weinberger, J. L., eds., Can Personality Change?, 21–40. American Psychological Association. Demszky, D.; Movshovitz-Attias, D.; Ko, S.; Cowen, A. S.; Nemade, G.; and Ravi, S. 2020. GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Flavell, J. H.; and Miller, P. H. 1998. Social Cognition. In Damon, W.; Kuhn, D.; and Siegler, R., eds., Handbook of Child Psychology, Vol. 2: Cognition, Perception, and Lan- guage, 851–898. New York, NY: Wiley, 5 edition. Harris, P. L. 2006. Social Cognition. In Damon, W.; Lerner, R. M.; Kuhn, D.; and Siegler, R. S., eds., Handbook of Child Psychology, Vol. 2: Cognition, Perception, and Language, 811–858. Hoboken, NJ: Wiley, 6 edition. Huang, J.; Li, G.; Huang, Q.; and Wu, X. 2015. Learning la- bel specific features for multi-label classification. In 2015 IEEE International Conference on Data Mining (ICDM), 181–190. IEEE. Lee, Y. J.; Park, C. Y.; and Choi, H. J. 2019. Word-Level Emotion Embedding Based on Semi-Supervised Learning for Emotional Classification in Dialogue. In 2019 IEEE In- ternational Conference on Big Data and Smart Computing. Mehrabian, A.; and Russell, J. A. 1974. The Basic Emo- tional Impact of Environments. Perceptual and Motor Skills, 38(1): 283–301. Mohammad, S. M.; Bravo-Marquez, F.; Salameh, M.; and Kiritchenko, S. 2018.SemEval-2018 Task 1: Affect in Tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation. Plutchik, R. 1980. A General Psychoevolutionary Theory of Emotion. American Psychologist, 36(4): 344–356. Russell, J. A. 1980. A Circumplex Model of Affect. Journal of Personality and Social Psychology, 39(6): 1161–1178. Seyeditabari, A.; Tabari, N.; Gholizade, S.; and Zadrozny, W. 2019. Emotional Embeddings: Refining Word Embed- dings to Capture Emotional Content of Words. In arXiv preprint arXiv:1906.00112. Stanley, E.; DeMattos, E.; Klementiev, A.; Ozimek, P.; Clarke, G.; Berger, M.; and Palaz, D. 2023. Emotion La- bel Encoding Using Word Embeddings for Speech Emo- tion Recognition. In Proceedings of INTERSPEECH, 2418– 2422. International Speech Communication Association. Wellman, H. M.; Cross, D.; and Watson, J. 2001. Meta- analysis of Theory-of-Mind Development: The Truth about False Belief. Child Development, 72(3): 655–684. Xu, P.; Madotto, A.; Wu, C.-S.; and Park, J. H. 2018. Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training. In arXiv preprint arXiv:1809.04505. Appendix Emotion Index Mapping Table 2 lists the numeric indices used in Figure 2 and their corresponding emotion labels and color groups. 112 ID EmotionColor 1 admirationorange 2 amusementyellow 3 angerred 4 annoyancered 5 approvalgold 6 caringorange 7 confusioncyan 8 curiositygray 9 desiregold 10 disappointment red 11 disapprovalred 12 disgustpurple 13 embarrassment green 14 excitementyellow 15 feargreen 16 gratitudeorange 17 griefblue 18 joyyellow 19 loveorange 20 nervousnessgreen 21 optimismgold 22 pridegold 23 realizationgray 24 reliefgold 25 remorseblue 26 sadnessblue 27 surprisecyan 28 neutralgray Table 2: Mapping from PCA indices to emotion labels and visualization colors. 113 Towards Combining Task-Independent and Learned Task-Dependent Intrinsic Rewards in Cooperative Multi-Agent Reinforcement Learning Dilith Jayakody, Domenic Rosati, Janarthanan Rajendran Dalhousie University dilith@dal.ca, domenic.rosati@dal.ca, janarthanan.rajendran@dal.ca Abstract Intrinsic motivation has been proposed as an early develop- mental mechanism that drives exploration and the acquisition of social competencies in humans. To model this, intrinsic motivation in reinforcement learning (RL) has been typically used to construct various task-independent reward formula- tions, such as curiosity and novelty-based rewards, that pro- mote exploration during early learning stages. An alternative approach to intrinsic motivation focuses on learning task- dependent rewards that align with the extrinsic task objectives to accelerate learning or handle sparse task rewards (a.k.a. ex- trinsic rewards). However, such learned intrinsic rewards re- quire exposure to extrinsic rewards in order to identify which intrinsic rewards are actually useful. In this work, we inves- tigate whether task-independent rewards can help bootstrap the learned task-dependent rewards by helping the agent ex- perience extrinsic rewards during early training. We demon- strate that a simple fixed-weight combination of three sig- nals, extrinsic rewards, learned task-dependent rewards, and task-independent rewards, yields statistically significant per- formance improvements over standard baselines on the Han- abi cooperative multi-agent benchmark. Introduction Humans exhibit intrinsic motivation as a fundamental mech- anism for exploration and learning. Beyond exploration, in- trinsic drives also contribute to higher-order cognitive abil- ities such as theory of mind (ToM), the capacity to rea- son about others’ beliefs and intentions. Social motiva- tion, in particular, plays a central role in the emergence of ToM by guiding attention toward socially relevant informa- tion (Burnside, Wright, and Poulin-Dubois 2017). In reinforcement learning (RL), intrinsic motivation has been widely studied as a means of encouraging explo- ration in sparse-reward environments. Task-independent in- trinsic rewards, often motivated by ideas from developmen- tal psychology such as curiosity (Pathak et al. 2017), nov- elty (Burda et al. 2018), and surprise (Achiam and Sastry 2017), drive agents toward informative states, even in the ab- sence of extrinsic task rewards. Another line of work learns task-dependent reward functions directly to accelerate learn- ing toward extrinsic task rewards via meta-gradients (Zheng, Oh, and Singh 2018; Du et al. 2019; Zheng et al. 2020). However, these methods rely on exposure to extrinsic re- wards, limiting their effectiveness in highly sparse-reward environments. Recent work has begun exploring combinations of ex- trinsic rewards, learned task-dependent intrinsic rewards, and task-independent intrinsic rewards such as exploration bonuses. Devidze, Kamalaruban, and Singla (2022) com- bines these signals in single-agent grid-world domains. However, the role of such combinations in multi-agent set- tings remains underexplored. In this work, we investigate whether task-independent intrinsic rewards, can bootstrap learned (task-dependent) intrinsic rewards in cooperative multi-agent reinforcement learning (MARL). Our central hypothesis is that explo- ration bonuses can help the agent experience the feedback needed to learn task-dependent intrinsic rewards, partic- ularly in environments requiring implicit reasoning about other agents. We evaluate this hypothesis in the Hanabi en- vironment (Bard et al. 2020), a widely used benchmark for theory of mind–like reasoning in artificial agents (Fuchs et al. 2021). We leave evaluation on other MARL bench- marks to future work. In Hanabi, a cooperative card game, each agent observes the hands of other players but not its own. Effective clue-giving requires understanding how a teammate will behave when given partial information about their cards. We note that while we observe empirical perfor- mance improvements with the method that combines task- independent and task-dependent intrisnic rewards, further analysis of the learned intrinsic rewards is required to un- derstand the source of the improvements. Methods We consider a cooperative MARL setup, where each agent receives three forms of rewards: (1) extrinsic rewards r ex , (2) task-dependent intrinsic rewards r task-dep-in learned via meta- gradient optimization (Zheng, Oh, and Singh 2018), and (3) task-independent intrinsic rewards r task-ind-in . We propose a convex combination of rewards r t = αr ex t + βr task-dep-in t + γr task-ind-in t , subject to α + β + γ = 1. Specifically, we use Random Network Distillation (RND) (Burda et al. 2018) as task-independent rewards that en- courage exploration based on state novelty. All agents share all networks, including the policy network, value network, learned intrinsic reward function, and the target and predic- tor networks for RND. 114 In Hanabi, the task-independent and the learned task- dependent intrinsic rewards can serve complementary roles: RND promotes exploratory communication, while the learned intrinsic reward aligns explored behaviors with task performance. Experiments We build on top of the Proximal Policy Optimization (PPO) algorithm (Schulman et al. 2017), extended to the multi- agent setting as described by Yu et al. (2022). Following the setup of Zheng, Oh, and Singh (2018), we integrate intrinsic reward components into the agent’s learning objective and evaluate their influence on collective performance. We compare four settings: (1) PPO (r ex only), (2) PPO-LIRPG (r ex + r task-dep-in ), (3) PPO-RND (r ex + r task-ind-in ), and (4) PPO-LIRPG-RND (r ex + r task-dep-in + r task-ind-in ). All models were trained with 3 random seeds for hyperpa- rameter tuning and evaluated over 10 additional seeds. Re- ward coefficients (α,β,γ) were tuned with grid-search un- der the constraint that they sum to 1, with the best perfor- mance being observed at α = 0.495, β = 0.01, and γ = 0.495. Evaluation was performed using rliable (Agar- wal et al. 2021) to generate probabilistic comparisons and box plots across seeds. Each approach was evaluated based on the return of the last 1000 episodes of training. Normal- ization is performed with min-max normalization, with the best and worst scoring runs. Results 0.000.250.500.751.00 PPO-LIRPG-RND PPO-RND PPO-LIRPG PPO Mean Normalized Score Figure 1: Average Hanabi team scores across ten seeds. Er- ror bars indicate 95% confidence intervals across runs. Figure 1 summarizes average team scores over ten ran- dom seeds in the Hanabi environment. The combined method, PPO-LIRPG-RND, achieves the highest mean nor- malized score, outperforming both individual intrinsic re- ward variants (PPO-LIRPG and PPO-RND) as well as the PPO baseline. Interestingly, PPO-LIRPG performs compa- rably to PPO, indicating that either learning task-dependent rewards has limited utility on its own, or that learning useful task-dependent rewards is challenging. To assess statistical significance, we applied the Mann– Whitney U test for pairwise comparisons between algo- rithms, using unpaired samples across seeds. PPO-LIRPG- RND, achieved statistically significant improvements over PPO with a p-value of 0.013 and PPO-LIRPG with a p-value of 0.011. It also outperforms PPO-RND with a p-value of 0.061, which may show more significance under evaluation with a larger number of seeds. Discussion and Limitations Our results indicate that task-dependent and task- independent intrinsic rewards can play complementary roles in cooperative multi-agent settings like Hanabi. A notable observation of our experiments is that the task- dependent intrinsic reward receives a relatively low weight (β = 0.01) under the fixed-weight combination. This suggests that aggressive reward shaping with the learned reward can negatively affect performance. This is in contrast to the the task-independent reward (γ ≈ 0.5), which in RND, diminishes to near-zero with time. Several limitations and future work remains. First, while we observe consistent empirical improvements, understand- ing the precise source of these gains, such as whether the learned intrinsic reward facilitates improved theory-of- mind–like reasoning or more effective coordination strate- gies, requires further analysis. Secondly, our fixed-weight combination lacks adaptiv- ity across training phases. Assigning a low weight to the learned intrinsic reward constrains the space of reward func- tions that can be expressed throughout learning. Future work could explore meta-learned or policy-conditioned weighting schemes that dynamically adjust intrinsic motivations over time, potentially reflecting developmental transitions from exploration to theory-of-mind–driven coordination. Thirdly, all experiments were conducted on Hanabi. Fur- ther experimentation on diverse MARL benchmarks, such as StarCraft and Multi-Agent MuJoCo, is necessary to as- sess the generalizability of the approach. Finally, while our approach shows promising trends, its improvements over PPO-RND are not statistically signifi- cant (p = 0.061) and warrant further investigation. We note that all hyperparameters, except for the reward coefficients, were adopted directly from prior work and not specifically tuned for Hanabi. Future experiments that include hyper- parameter optimization and evaluation over more random seeds may yield stronger and more statistically significant performance differences. Conclusion We investigated the combination of task-dependent and task-independent intrinsic rewards in Hanabi, a coopera- tive MARL setting that requires theory-of-mind–like reason- ing. The method that combines task-independent and task- dependent intrisnic rewards (PPO-LIRPG-RND), achieves statistically significant improvements over PPO and PPO- LIRPG, and shows performance gains over PPO-RND that may become significant with evaluation over more random seeds. Overall, our findings highlight the potential of intrin- sic reward combinations for supporting social reasoning in multi-agent systems and motivate future work on adaptive weighting strategies, deeper analysis of the learned rewards, and evaluation on additional diverse MARL benchmarks. 115 References Achiam, J.; and Sastry, S. 2017. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732. Agarwal, R.; Schwarzer, M.; Castro, P. S.; Courville, A.; and Bellemare, M. G. 2021. Deep Reinforcement Learning at the Edge of the Statistical Precipice. Advances in Neural Information Processing Systems. Bard, N.; Foerster, J. N.; Chandar, S.; Burch, N.; Lanctot, M.; Song, H. F.; Parisotto, E.; Dumoulin, V.; Moitra, S.; Hughes, E.; et al. 2020. The hanabi challenge: A new fron- tier for ai research. Artificial Intelligence, 280: 103216. Burda, Y.; Edwards, H.; Storkey, A.; and Klimov, O. 2018. Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Burnside, K.; Wright, K.; and Poulin-Dubois, D. 2017. So- cial motivation and implicit theory of mind in children with autism spectrum disorder. Autism Research, 10(11): 1834– 1844. Devidze, R.; Kamalaruban, P.; and Singla, A. 2022. Exploration-guided reward shaping for reinforcement learn- ing under sparse rewards. Advances in Neural Information Processing Systems, 35: 5829–5842. Du, Y.; Han, L.; Fang, M.; Liu, J.; Dai, T.; and Tao, D. 2019. Liir: Learning individual intrinsic reward in multi-agent re- inforcement learning. Advances in neural information pro- cessing systems, 32. Fuchs, A.; Walton, M.; Chadwick, T.; and Lange, D. 2021. Theory of mind for deep reinforcement learning in hanabi. arXiv preprint arXiv:2101.09328. Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, 2778– 2787. PMLR. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; and Wu, Y. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural infor- mation processing systems, 35: 24611–24624. Zheng, Z.; Oh, J.; Hessel, M.; Xu, Z.; Kroiss, M.; Van Has- selt, H.; Silver, D.; and Singh, S. 2020. What can learned intrinsic rewards capture? In International Conference on Machine Learning, 11436–11446. PMLR. Zheng, Z.; Oh, J.; and Singh, S. 2018. On learning intrinsic rewards for policy gradient methods. Advances in neural information processing systems, 31. Hyperparameter Tuning Details Hyperparameter Sweep Protocol We performed a coarse hyperparameter sweep over reward coefficients in both the PPO-LIRPG (r ex + r task-dep-in ) and the PPO-LIRPG-RND (r ex + r task-dep-in + r task-ind-in ). Coef- ficients were selected from the discrete set 0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99, subject to the constraint that the coefficients sum to one. This coarse sweep was chosen to balance computational feasibil- ity with coverage of qualitatively different reward weight- ings. We note that finer-grained tuning may yield improved performance and leave this as future work. Figure 2 and Figure 3 visualizes the effect of varying β (with α = 1− β) for PPO-LIRPG and γ (with α = 1− γ) for PPO-RND on final Hanabi performance. 0.00.20.40.60.81.0 r taskdepin 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Score PPO-LIRPG hanabi_v5 Figure 2: Line plot showing the effect of varying β on PPO- LIRPG performance in Hanabi. 0.00.20.40.60.81.0 r taskindin 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Score PPO-RND hanabi_v5 Figure 3: Line plot showing the effect of varying γ on PPO- RND performance in Hanabi. For PPO-LIRPG-RND, we jointly swept over combina- tions of extrinsic, learned intrinsic, and RND-based intrin- sic reward coefficients. The results are visualized using a ternary plot in Figure 4, illustrating how performance varies as the relative prominence of each reward signal changes. The best-performing configuration from each sweep is re- ported in Table 1. Table 1: Selected reward coefficients for PPO-LIRPG-RND. Method α (Extrinsic)β (Learned)γ (RND) PPO-LIRPG0.990.010 PPO-RND0.500.5 PPO-LIRPG-RND0.4950.010.495 116 0 0.2 0.4 0.6 0.8 1 10.80.60.40.20 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Score PPO-LIRPG-RND r_ex r_inr_rnd Figure 4: Ternary plot visualizing PPO-LIRPG-RND perfor- mance across combinations of extrinsic (α), learned intrinsic (β), and RND (γ) reward coefficients. Coefficients were se- lected from a discrete set and constrained to sum to one. 117 Visual Theory of Mind through LLM-based Semantic Extraction Sandy Hoffmann 1 , Lucas Nardi Vieira 1 , Stefan Sarkadi 2 and Alison R. Panisson 1 1 Graduate Program on Computer Science – Federal University of Santa Catarina (UFSC) - Brazil 2 University of Lincoln – United Kingdom sandy.hoffmann,lucas.nardi.vieira@posgrad.ufsc.br, ssarkadi@lincoln.ac.uk, alison.panisson@ufsc.br Abstract In this paper we introduce the idea of using Large Lan- guage Models (LLMs) to extract semantic representa- tions of Theory of Mind (ToM) components from visual inputs. This allows AI systems to infer Theory of Mind constructs such as beliefs and intentions in order to in- terpret visual scenes by learning and reasoning about other agents’ mental states, supporting context-aware decision-making. Our proposed approach integrates vi- sual encoding with LLM-based ‘reasoning’ to generate interpretable ToM semantics. Such representations can allow agents to perform symbolic reasoning in order to coordinate by anticipating others’ behaviour. This opens new directions for future research on socially aware, collaborative, and meta-cognitive Multi-Agent Systems (MAS). Introduction If used effectively with other models, Large Language Mod- els (LLMs) have recently been demonstrated to enable AI capabilities such as reasoning, planning, and multimodal in- formation processing. In such systems, the ability of agents to interpret and anticipate others’ mental states, commonly referred to as Theory of Mind (ToM), is an essential ability for coordination, cooperation, social reasoning, and meta- reasoning. Despite rapid advances in LLMs, current AI agent architectures still lack a robust form of Visual Theory of Mind (VToM) (Hoffmann et al. 2025), understood as the ability to infer beliefs, intentions, and perspectives from vi- sual scenes. This limitation hinders agents’ reasoning about perceptual contexts and others’ viewpoints in multimodal environments. In this work, VToM is approached in an operational and incremental manner: rather than modeling full human-like ToM, we focus on perceptually grounded mental-state components, specifically visual attention and immediate intentions, which are foundational for higher- level belief reasoning and allow for interpretable inference grounded in visual evidence. ToM has long been recognized as a foundation of intel- ligent behaviour (Premack and Woodruff 1978; Goldman 2012; Rocha et al. 2023), but only recently have multimodal LLMs begun to show potential for grounding ToM-like rea- soning in visual inputs. By transforming raw perception into structured semantic representations, these models can sup- port inference mechanisms that approximate cognitive un- derstanding, bridging visual perception and social cognition. VToM grounds this in perceptual cues like gaze and ges- ture (Zhou et al. 2023; Hoffmann et al. 2025) and is often liked to the BDI model (Bratman 1987) and Simulation The- ory (SToM) (Goldman 2012). VToM applications includes robot perspective (Johnson and Demiris 2007), shared at- tention modeling (Marin, Sisbot, and Alami 2008) and be- havioral validation tasks (Fischer and Demiris 2019). Per- ceptual attention, particularly gaze estimation, is a crucial component for VToM, as it enables intelligent agents to in- fer what others perceive, a necessary step before reasoning about their beliefs and intentions, by translating visual cues into cognitive awareness for social understanding (Liu et al. 2021; Zhang, Sugano, and Bulling 2023; Yu et al. 2022). Furthermore, recent advances in Vision-Language Mod- els (VLMs) (Li et al. 2025; Wu et al. 2023b) combine vi- sual perception with linguistic reasoning. Systems such as Flamingo, Gemini, and Qwen-VL (Bai et al. 2025) pro- cess multimodal inputs to extract high-level semantics, sur- passing classical vision models like YOLO (Redmon and Farhadi 2018) in interpreting context and intent (Chen, Von- drick, and Lipson 2021; Hoffmann et al. 2026). Hence, we believe that LLM-based semantic extraction can provides a way forward for VToM. In this paper, we propose the use of LLMs for semantic extraction from visual inputs as a foundation for VToM in AI agents. We describe a pipeline in which visual scenes are analysed in order to (i) identify agents, (i) infer their likely intentions, and (i) support context-aware decision-making. Our approach demonstrates how LLM-based semantic ex- traction can provide contextual understanding beyond tradi- tional computer vision, marking a step toward inferring oth- ers’ mental attitudes directly from visual input. Such mech- anisms may advance the development of multi-agent collab- oration grounded between humans and machines in shared perceptual and social spaces (da Silva et al. 2024). LLM-Based Visual ToM Here we describe the architecture of the proposed LLM- based VToM framework. As shown in Figure 1, the pipeline connects low-level visual perception with high-level reason- ing about beliefs, intentions, and awareness through five se- quential stages: (i) The visual input stage captures static 118 Visual Input (frame / video / image sequence) Feature & En- tity Extraction (detectors, pose, object tracks) Examples: - bounding boxes, poses - tracking IDs, visual entities Visual Atten- tion Inference (gaze estimation, head pose, attentional target) Attention Layer: - gaze direction & head pose - inferred attentional targets Semantic Extraction (prompt construc- tion & LLM-based scene description) LLM Role: - convert visual data into prompts - generate scene-level semantics Structured Semantics (objects, agents, actions, relations) Scene Inference & ToM Inference (beliefs, intentions, perspective, awareness) ToM Outputs: - inferred intentions - beliefs & awareness - context-based reasoning Figure 1: LLM-based Semantic Extraction Pipeline with Visual Attention Inference Layer. frames or video sequences and applies preprocessing to en- hance visibility and isolate relevant elements such as agents, objects, and environmental cues; (i) The feature and entity extraction stage generates mid-level representations through object detection, pose estimation, and temporal tracking, producing structured perceptual data (e.g., bounding boxes, trajectories, and body keypoints); (i) The visual attention inference submodule estimates gaze direction, head pose, and attentional targets, providing a perceptual proxy for where and how agents allocate visual focus. (iv) The se- mantic extraction stage translates these perceptual and atten- tional cues into textual scene descriptions via LLM prompt- ing 1 , identifying entities, actions, and relational contexts (e.g., “a person reaching for a closed door”); (v) Finally, the structured semantics and scene inference stage formal- izes the extracted knowledge into a symbolic representation (e.g., YAML or JSON), listing entities, actions, relations, and inferred mental states with associated confidence scores. This design enables a bottom-up integration of perception and cognition: the visual attention module bridges sensory features and semantic interpretation, grounding the LLMs reasoning process in perceptual evidence. Together, these stages transform raw visual data into interpretable, cogni- tively meaningful representations suitable for ToM infer- ence. Abstract Output Schema and Extensible Cognitive Components To standardise the representation of perceptual-to-cognitive inference, the proposed VToM framework produces outputs in a structured, machine-interpretable schema. This abstrac- tion enables consistent evaluation across scenarios and facil- itates modular extensions of the inference pipeline. Output Schema: Each scene is represented as a composi- tion of four hierarchical layers: 1. Perceptual Layer: Low-level entities and attributes de- rived from visual input (e.g., bounding boxes, keypoints, gaze vectors, depth, and ROI segmentation); 2. Behavioral Layer: Observable actions and spatiotempo- ral patterns such as movement trajectories, hand-object interactions, and gaze-target alignments; 1 A complete prompt construction schema is presented in the Appendix. 3. Relational Layer: Contextual relations between entities (e.g., near to, holding, lookingat), forming a dynamic scene graph; 4. Cognitive Layer: LLM-driven semantic inference cap- turing factual descriptions, inferred intentions, and be- lief/goal hypotheses, along with confidence estimates. Extensible Modules: The pipeline is designed to accom- modate additional inference components, as illustrated by the Visual Attention Inference specialized module in our pipeline. Other examples include a Visual Focus and Gaze Estimation, which infers attention vectors by combining head pose, eye direction, and environmental saliency maps to identify attentional targets. Another key example is Be- lief and Intention Tracking, which maintains temporal con- sistency of inferred states through symbolic or graph-based representations, enabling reasoning over time. Illustrative Scenarios: While the Appendix provides con- crete examples, this schema supports broader applications such as surveillance, assistive robotics, and human-AI inter- action. The modular design allows adapting inference depth, from perceptual awareness to high-level belief reasoning, according to task requirements. Conclusion, Challenges, and Future Directions In this paper, we introduced an LLM-based framework for VToM that integrates vision and language to enable the in- ference of mental states. We applied LLMs to bridge per- ception with cognitive reasoning in order to generate inter- pretable ToM-like structured semantics of intentions and in- teractions. However, achieving robust human-like ToM in- ference abilities remains challenging. One such challenge is tracking beliefs and intentions over time, as the grow- ing contextual complexity can obscure salient information (Sarkadi et al. 2018). Cognitive knowledge graphs, such as COKE (Wu et al. 2023a), offer a promising path by empha- sizing clue-based inference to focus attention on semanti- cally salient elements. In future work, we plan to extend the framework to handle temporally dynamic environments, combining LLM-based VToM with symbolic and graph-based reasoning for belief and intention tracking. It will be also integrated visual atten- tion inference models, which can link perceptual cues, such as gaze direction and contextual focus, to inferred attentional and intentional states. 119 References Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; Zhong, H.; Zhu, Y.; Yang, M.; Li, Z.; Wan, J.; Wang, P.; Ding, W.; Fu, Z.; Xu, Y.; Ye, J.; Zhang, X.; Xie, T.; Cheng, Z.; Zhang, H.; Yang, Z.; Xu, H.; and Lin, J. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Bratman, M. 1987. Intention, Plans, and Practical Reason. Cambridge: MA: Harvard University Press. Chen, B.; Vondrick, C.; and Lipson, H. 2021. Visual behav- ior modelling for robotic theory of mind. Scientific Reports, 11(1): 424. da Silva, H. H.; Rocha, M.; Trajano, G.; Morales, A. S.; Sarkadi, S.; and Panisson, A. R. 2024. Distributed Theory of Mind in Multi-Agent Systems. In 16th International Confer- ence on Agents and Artificial Intelligence (ICAART 2024). SciTePress. Fischer, T.; and Demiris, Y. 2019. Computational Modeling of Embodied Visual Perspective Taking. IEEE Transactions on Cognitive and Developmental Systems. Goldman, A. I. 2012. Theory of Mind. United Kingdom: Oxford University Press. Hoffmann, S.; Fernandes, A. R.; Muchulski, V. W.; Sarkadi, S.; von Wangenheim, A.; and Panisson, A. R. 2026. Towards Intelligent Monitoring System Using Computer Vision. In In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART). Hoffmann, S.; Trajano, G.; Sarkadi, S.; and Panisson, A. R. 2025. Visual Theory of Mind (VToM): A Systematic Re- view. In IEEE/WIC International Conference on Web Intel- ligence and Intelligent Agent Technology (WI-IAT). Johnson, M.; and Demiris, Y. 2007. Visuo-cognitive per- spective taking for action recognition. In Int. Symp. Imita- tion Animimals Artifacts, 262–269. Li, Y.; Lai, Z.; Bao, W.; Tan, Z.; Dao, A.; Sui, K.; Shen, J.; Liu, D.; Liu, H.; and Kong, Y. 2025. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765. Liu, Z.; Lin, Y.; Yu, Y.; Wang, X.; and Qiu, X. 2021. Gener- alizing Gaze Estimation With Outlier-Guided Collaborative Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3810–3820. Marin, L.; Sisbot, E. A.; and Alami, R. 2008. Geometric tools for perspective taking for human-robot interaction. In Mexican international conference on artificial intelligence (MICAI 2008), Mexico City, Mexico. Premack, D.; and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4): 515–526. Redmon, J.; and Farhadi, A. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. Rocha, M.; da Silva, H. H.; Morales, A. S.; Sarkadi, S.; and Panisson, A. R. 2023. Applying theory of mind to multi- agent systems: A systematic review. In Brazilian Conference on Intelligent Systems, 367–381. Springer. Sarkadi, S ̧ .; Panisson, A. R.; Bordini, R. H.; McBurney, P.; and Parsons, S. 2018. Towards an approach for modelling uncertain theory of mind in multi-agent systems. In In- ternational Conference on Agreement Technologies, 3–17. Springer. Wu, J.; Chen, Z.; Deng, J.; Sabour, S.; Meng, H.; and Huang, M. 2023a. Coke: A cognitive knowledge graph for machine theory of mind. arXiv preprint arXiv:2305.05390. Wu, J.; Gan, W.; Chen, Z.; Wan, S.; and Yu, P. S. 2023b. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), 2247– 2256. IEEE. Yu, W.; Liang, R.; Guo, J.; et al. 2022.Design and Real-Time Evaluation of Appearance-Based Gaze Estima- tion (I2DNet). Frontiers in Neurorobotics, 15: 736989. Zhang, X.; Sugano, Y.; and Bulling, A. 2023. ETH-XGaze: A Large-Scale Dataset for Gaze Estimation in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelli- gence. Zhou, P.; Ma, H.; Zou, B.; Zhang, X.; Zhao, S.; Lin, Y.; Wang, Y.; Feng, L.; and Wang, G. 2023. A conceptual framework of cognitive-affective theory of mind: towards a precision identification of mental disorders. npj Mental Health Research, 2(1): 12. 120 Appendix Prompt Construction. The transition from structured per- ceptual data to semantic inference is achieved through a con- trolled prompt construction process. This representation is provided to the LLM together with an instruction to generate a structured semantic interpretation aligned with the output schema. Prompt Template (simplified): You are given a structured description of a visual scene de- rived from computer vision modules. Objects:person(p1,attributes:black jacket,deliv- eryperson), helmet(h1, black), deliverybag(db1, green), door(d1, closed) Actions: p1 reaching for d1; p1 lookingat d1 Relations: p1 wearing h1; p1 carrying db1 Task: Generate a structured semantic output following the predefined schema, including a factual description, inferred intention, and confidence estimate. Scenarios: Three monitoring scenarios are used to demon- strate how the proposed pipeline translates perception into cognitive inference 2 : (i) a delivery person standing at a glass entrance, where the model identifies an attempt to access the building; (i) a person holding a bicycle at an open door- way, where the inferred intention is to enter the premises; and (i) an individual interacting with a barrier, where the system recognizes a tentative access attempt with medium confidence. These cases demonstrate how the pipeline pro- gressively integrates perception, semantics, and reasoning, enabling an interpretable form of VToM grounded in LLM- based semantic extraction. In the first example, Figure 2 illustrates a delivery person positioned at a building entrance. Our approach infers the following structured semantic representation 3 : 1 objects: 2 - id: p1, type: person, attributes: [‘ black_jacket’, ‘delivery_person’] 3 - id: h1, type: helmet, attributes: [‘ black’] 4 - id: db1, type: delivery_bag, attributes: [‘green’] 5 - id: d1, type: door, attributes: [‘ closed’] 6 actions: 7 - agent_id: p1, action: reaching_for, target_id: d1 8 - agent_id: p1, action: looking_at, target_id: d1 9 relations: 10 - type: wearing, subject_id: p1, object_id: h1 11 - type: carrying, subject_id: p1, object_id: db1 12 inference: 13 factual_description: "A delivery person reaches for a closed door while looking at it." 2 These examples are based on the stander pipeline, without the specialized Visual Attention Inference module. 3 A simplified version of the semantic extraction is presented here for clarity. 14 inferred_intention: "Attempting to access the building to deliver a package." 15 confidence: High Figure 2: Scenario 1. The model identifies a person standing at the glass en- trance while wearing a helmet. The observedaction indicates that the individual remains stationary at the en- trance, corresponding to a movementpattern classified as stationary. The inferredgaze suggests the person is looking downward, likely toward an entrance panel or inter- com. Although no explicit interaction with objects is detected, the person’s position indicates readiness to engage with the entrance system. Based on these observations, the inferred intention is to gain access to the building, most likely for a delivery, with a confidencelevel as- sessed as high. Figure 3 shows a person standing at an open entrance while holding a bicycle. Our approach infers the following structured semantic representation: 1 objects: 2 - id: p1, type: person, attributes: [‘ grey_tshirt’, ‘dark_shorts’] 3 - id: b1, type: bicycle 4 - id: e1, type: entrance, attributes: [‘open’] 5 actions: 6 - agent_id: p1, action: standing, target_id: e1 7 - agent_id: p1, action: holding, target_id: b1 8 inference: 9 factual_description: "A person holding a bicycle stands at an open entrance." 10 inferred_intention: "Preparing to enter the premises with the bicycle ." 11 confidence: High The model detects a person standing at an entrance while holding a bicycle. The observed action de- scribes the individual as stationary, corresponding to a movement pattern classified as poised to move for- ward into the entrance. The inferredgaze is directed 121 Figure 3: Scenario 2. toward the interior of the doorway, indicating attention to the passage ahead. The interaction involves holding and pushing a bicycle, suggesting preparation for move- ment. From these cues, the inferred intention is identified as entering the premises with the bicycle, with a confidence level assessed as high. Figure 4: Scenario 3. Figure 4 captures a person interacting with a barrier. Our approach infers the following structured semantic represen- tation: 1 objects: 2 - id: p1, type: person, attributes: [‘ black_shirt’, ‘light_pants’] 3 - id: g1, type: barrier 4 actions: 5 - agent_id: p1, action: reaching, target_id: g1 6 relations: 7 - type: near_to, subject_id: p1, object_id: g1 8 inference: 9 factual_description: "A person stands near a barrier, reaching toward it ." 10 inferred_intention: "Attempting to interact with or access the area beyond the barrier." 11 confidence: Medium The model identifies a person pushing a bicy- cle along an indoor path. The observed action in- dicates active forward movement, corresponding to a movement pattern described as purposeful and direct. The inferredgaze suggests that the individual is look- ing straight ahead, aligned with the direction of travel. The interaction reflects active engagement with the bicycle, with no apparent interaction involving other ob- jects or individuals. Based on these behavioral features, the inferred intention is classified as entering a building or parking facility via the access route, with a confidence level assessed as medium. 122 Who Knows Who Knows? A Step Toward Common Knowledge in Multi-Agent Systems Tianqi Song 1 , Botao Amber Hu 2 , Jingshu Li 1 , Yi-Chieh Lee 1 1 National University of Singapore, Singapore 2 University of Oxford, United Kingdom tianqisong@u.nus.edu, botao.hu@cs.ox.ac.uk, jingshu@u.nus.edu, yclee@nus.edu.sg Abstract As artificial agents increasingly operate in groups, under- standing how they reason about shared knowledge is essen- tial for achieving coordination and trust. While most stud- ies on Theory of Mind (ToM) focus on whether a single agent can infer another’s belief, much less is known about how a group of agents might form common knowledge, i.e., the recursive awareness that everyone knows that everyone knows a fact. This paper introduces a conceptual and exper- imental framework to investigate how approximate common knowledge (CK) may emerge among LLM-based agents. We outline a key–door coordination scenario in which multiple agents must align their beliefs through natural-language com- munication and belief-state updates. Our goal is to bridge for- mal epistemic logic and empirical multi-agent interaction, of- fering a foundation for studying group-level reasoning in ar- tificial societies. Introduction Coordination among multiple AI agents is a critical capa- bility for achieving coherent and reliable group behavior. In human societies, coordination enables individuals to attain mutual benefits, such as agreeing on traffic rules, maintain- ing social norms, or organizing collective actions (Halpern and Moses 1990; De Freitas et al. 2019). Similarly, in multi-agent AI systems, coordination allows agents to pur- sue shared goals efficiently, avoid redundant efforts, and adapt to dynamic environments. A key cognitive prerequi- site for such coordination is Theory of Mind, i.e., the ability to model and infer others’ beliefs, intentions, and future be- haviors (Strachan et al. 2024; Rocha et al. 2023). However, effective coordination requires more than pre- dicting what another agent will do, it also depends on un- derstanding what others know, and whether they know that one knows. For instance, in distributed decision-making or collaborative robotics, AI agents must not only share sen- sory information but also confirm that this information is mutually recognized and understood. Without this recur- sive awareness, systems risk breakdowns in synchronization, trust, and safety. Psychologists and philosophers have termed this recursive state of mutual awareness common knowledge: a condition where all agents know a fact, know that others know it, and so on ad infinitum (Lewis 2008; De Freitas et al. 2019) (Fig- Figure 1: Illustration of “common knowledge”, a state in which every agent not only knows a fact but also knows that all others know it. This recursive awareness enables mutual understanding and coordinated behavior within a group. ure 1). In humans, common knowledge enables social phe- nomena such as conventions, trust, and collective action, al- lowing people to coordinate without explicit communication once “everyone knows that everyone knows.” While the concept of common knowledge originates from human cognition, in multi-agent systems it carries distinct computational (Halpern and Moses 1990) and systemic im- plications. Unlike humans, AI agents can represent and verify common knowledge explicitly through formal logic or communication protocols, enabling reliable coordination at scale (Halpern and Moses 1990). Establishing common knowledge allows distributed agents to reduce uncertainty, synchronize world models, and act cohesively without con- tinuous communication. Moreover, it provides a founda- tion for emergent norms, collective learning, and safe align- ment (Song et al. 2025; Feng et al. 2025), turning multi- agent systems from a collection of isolated decision makers into an adaptive and self-organizing collective intelligence. In this extended abstract, we propose a conceptual frame- work for examining how multi-agent systems might ap- proximate common knowledge through communication and reflection. Drawing inspiration from the “Blue-Eyes” puz- zle (da Costa Werlang 1989), we illustrate a possible mech- anism by which agents could signal, confirm, and converge upon shared beliefs. Our goal is not merely to simulate human-like coordination, but to understand what it would mean for artificial societies to know together. 123 Related Work Theory of Mind in Artificial Agents Theory of Mind (ToM) refers to the ability to infer the be- liefs, intentions, and perspectives of others (Apperly and Butterfill 2009). In AI, ToM has been modeled through belief reasoning in symbolic systems (Hansen and Bolan- der 2020) and, more recently, through LLM-based infer- ence (Strachan et al. 2024). Emergent behaviors from multi- agent dialogue (e.g., self-play negotiation, collaborative planning) demonstrate partial ToM (Li et al. 2023; Wang et al. 2022), yet these remain first- or second-order, i.e., “A knows B knows X”, without convergence toward collective mutual understanding. Common Knowledge and Epistemic Logic Philosophical and formal analyses of common knowledge date back to Lewis, defining it as a state in which all mem- bers of a group know a proposition and know that everyone else knows it, ad infinitum. In game theory, common knowl- edge underlies coordination equilibria and collective ratio- nality (Binmore and Brandeburger 1988). Despite its impor- tance, AI research rarely operationalizes these mechanisms in multi-agent communication. Our work bridges this gap by proposing an experimental framework that models and detects the emergence of common knowledge using LLM- based agents. Proposed Experiment Core Idea We model common knowledge as a group-level epistemic state in which all agents know a proposition p, know that others know p, and so on to higher orders. Operationally, we approximate common knowledge when three observables co-occur within a short horizon T : (1) Belief Alignment: all agents’ first- and second-order be- liefs about p agree; (2) Mutual Acknowledgment: messages contain explicit meta-beliefs (e.g., “I know that you know p”) that are reciprocally confirmed; (3) Stability: no agent revises beliefs about p for k consecu- tive turns. Following Li et al. (2023), each agent maintains a struc- tured belief state (facts; beliefs-about-others) and communi- cates in natural language. Group state is evaluated by parsing messages for epistemic markers and comparing belief states across agents. Experiment Design Goal. To examine whether groups of LLM-based agents can spontaneously develop approximate common knowledge (CK) through interaction, and to explore whether simple epistemic prompting helps such knowledge emerge more re- liably. Operational Definition. We interpret approximate CK for a proposition p when three conditions co-occur within a short horizon T : (1) Belief Alignment: all agents share con- sistent first- and second-order beliefs about p; (2) Mutual Acknowledgment: reciprocated meta-beliefs appear (e.g., “I know that you know p”); (3) Stability: no belief edits about p for at least three consecutive turns. Setting. A grid-world “key–door” task requires agents to explore, infer the key color, and coordinate a simultaneous unlock. All messages are stored in a shared conversation transcript visible to every agent, serving as the only com- munication medium. Conditions. • Free Chat: agents converse freely without structure. • Fact Prompt: prompts ask agents to clearly state their own belief (“State what you believe about the key color.”). • Epistemic Prompt: adds meta-belief scaffolds (“State what others believe and acknowledge if you know they know.”). Measures. Belief Alignment Rate (Li et al. 2023), Mutual-Acknowledgment Density, Convergence Stability (number of unchanged turns), and Task Latency (turns from first correct discovery to a legal unlock under CK criteria). Procedure. Each turn, agents output an Action: and an Epistemic: block; belief states (facts and beliefs-about- others) are updated after each message. When all CK condi- tions are satisfied, the team may attempt to unlock the door. Analyses compare conditions via ANOVA on alignment and latency, with descriptive tracking of acknowledgment den- sity. Expected Results and Future Work Expected Results. We expect occasional but unstable be- lief alignment in the Free Chat condition, more consistent convergence in Fact Prompt, and the highest stability and acknowledgment density in Epistemic Prompt. Belief align- ment is anticipated to approach full agreement (≥0.9) un- der epistemic prompting, with frequent mutual confirma- tions (e.g., “we all know,” “I know that you know”). These outcomes would indicate that simple linguistic scaffolds can transform momentary agreement into sustained mutual awareness, demonstrating the early emergence of approxi- mate common knowledge among LLM agents. Future Work. Future studies will extend this framework beyond simple coordination games to more complex multi- agent environments that require negotiation or moral reason- ing. We plan to explore partial or unreliable communication settings to test the necessity of public signals for maintaining common knowledge. Another direction is to investigate how human participants interact with agent groups that already possess mutual awareness, assessing whether artificial com- mon knowledge can foster human trust, perceived cohesion, or group-level Theory of Mind. References Apperly, I. A.; and Butterfill, S. A. 2009. Do humans have two systems to track beliefs and belief-like states? Psycho- logical review, 116(4): 953. 124 Binmore, K.; and Brandeburger, A. 1988. Common knowl- edge and game theory. da Costa Werlang, S. R. 1989. Common knowledge. In Game theory, 74–85. Springer. De Freitas, J.; Thomas, K.; DeScioli, P.; and Pinker, S. 2019. Common knowledge, coordination, and strategic mental- izing in human social life. Proceedings of the National Academy of Sciences, 116(28): 13751–13758. Feng, Y.; Song, T.; Tan, Y.; Zhu, Z.; and Lee, Y.-C. 2025. Multi-Agent Systems Shape Social Norms for Prosocial Be- havior Change. In Companion Publication of the 2025 Con- ference on Computer-Supported Cooperative Work and So- cial Computing, 320–325. Halpern, J. Y.; and Moses, Y. 1990. Knowledge and common knowledge in a distributed environment. Journal of the ACM (JACM), 37(3): 549–587. Hansen, L. D.; and Bolander, T. 2020. Implementing the- ory of mind on a robot using dynamic epistemic logic. In Twenty-Ninth International Joint Conference on Artificial Intelligence, 1615–1621. International Joint Conference on Artificial Intelligence Organization. Lewis, D. 2008. Convention: A philosophical study. John Wiley & Sons. Li, H.; Chong, Y.; Stepputtis, S.; Campbell, J.; Hughes, D.; Lewis, C.; and Sycara, K. 2023.Theory of Mind for Multi-Agent Collaboration via Large Language Models. In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 180–192. Singapore: Association for Computational Linguistics. Rocha, M.; da Silva, H. H.; Morales, A. S.; Sarkadi, S.; and Panisson, A. R. 2023. Applying theory of mind to multi- agent systems: A systematic review. In Brazilian Conference on Intelligent Systems, 367–381. Springer. Song, T.; Tan, Y.; Zhu, Z.; Feng, Y.; and Lee, Y.-C. 2025. Multi-agents are social groups: Investigating social influ- ence of multiple agents in human-agent interactions. Pro- ceedings of the ACM on Human-Computer Interaction, 9(7): 1–33. Strachan, J. W.; Albergo, D.; Borghini, G.; Pansardi, O.; Scaliti, E.; Gupta, S.; Saxena, K.; Rufo, A.; Panzeri, S.; Manzi, G.; et al. 2024. Testing theory of mind in large lan- guage models and humans. Nature Human Behaviour, 8(7): 1285–1295. Wang, Y.; Zhong, F.; Xu, J.; and Wang, Y. 2022. ToM2C: Target-oriented Multi-agent Communication and Coopera- tion with Theory of Mind. In International Conference on Learning Representations. 125