Paper deep dive

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: PositionEmbeddings: 48

Abstract

Abstract:The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans' mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call `human-entangled-in-the-loop' nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability's promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:16:20 PM

Summary

The paper introduces 'agentic interpretability,' a paradigm where Large Language Models (LLMs) proactively assist human understanding through multi-turn, interactive dialogues. By building and leveraging mental models of the user, LLMs can explain complex or superhuman concepts, transforming interpretability from a static 'inspective' process into a collaborative, teacher-student dynamic. The authors discuss core components, potential applications like model debugging and mechanistic interpretability, and challenges such as 'human-entangled-in-the-loop' evaluation.

Entities (5)

Agentic Interpretability · methodology · 100%Inspective Interpretability · methodology · 100%Human-Entangled-in-the-Loop · evaluation-challenge · 95%Mechanistic Interpretability · methodology · 95%Mental Model · concept · 95%

Relation Signals (3)

Agentic Interpretability → contrastswith → Inspective Interpretability

confidence 95% · Such conversation is a new capability that traditional ‘inspective’ interpretability methods (opening the black-box) do not use.

Agentic Interpretability → utilizes → Mental Model

confidence 95% · a method pursues agentic interpretability if it proactively assists human understanding... by developing and leveraging a mental model of the user

Human-Entangled-in-the-Loop → complicates → Agentic Interpretability

confidence 90% · Agentic interpretability introduces challenges, particularly in evaluation, due to what we call ‘human-entangled-in-the-loop’ nature

Cypher Suggestions (2)

Find all interpretability methodologies mentioned in the paper. · confidence 90% · unvalidated

MATCH (e:Entity {entity_type: 'Methodology'}) RETURN e.name

Map the relationship between methodologies and their associated challenges. · confidence 85% · unvalidated

MATCH (m:Methodology)-[:COMPLICATES|CHALLENGED_BY]->(c:EvaluationChallenge) RETURN m.name, c.name

Full Text

47,639 characters extracted from source content.

Expand or collapse full text

arXiv:2506.12152v1 [cs.AI] 13 Jun 2025 2025-06 Because we have LLMs, we Can and Should Pursue Agentic Interpretability Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord 1 1 Google DeepMind The era of Large Language Models (LLMs) presents a new opportunity for interpretability–agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging amental model of the user, which in turn enables humans to develop bettermental models of the LLM. Such conversation is a new capability that traditional ‘inspective’ interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain—beyond just knowing how to talk—is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student’s comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans’ mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call‘human-entangled-in-the-loop’ nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability’s promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them. 1. Introduction We argue that large language models (LLMs) unlock opportunities for agentic interpretability: leveraging the model itself as a cooperative agent to help humans understand the model. LLM’s capacity for coherent, contextual conversation enables this. Many constituent ideas of agentic interpretability are common, e.g., multi-turn dialogues with LLMs and chains of thought (Wei et al., 2022) that help humans to build mental models of machines via their reasoning. We believe that the synthesis of these ideas— machines and humans building mental model of each other—remains to be explored. The termagenticis overloaded and lacks universal consensus. Here, agentic signifies something more than, yet closely related to,proactive. A method pursues agentic interpretability if it proactively assists human understanding in a multi-turn interactive process by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. This mental model (potentially implicit) is critical; to effectively aid comprehension, the model must infer the user’s knowledge and confusion, akin to an effective teacher-student dynamic. Multi- turn conversation is also essential, especially as many opaque model behaviors (e.g., why appending one seemingly meaningless token enables a jailbreak, but not when another arbitrary token is removed) may involve complex, counter-intuitive, or even superhuman knowledge. The mental model is crucial for humans beyond doing well on the end-task; humans’ mental model of the machine serves a distinct ©2025 Google DeepMind. All rights reserved Because we have LLMs, we Can and Should Pursue Agentic Interpretability Figure 1|Agentic and inspective interpretability. Adapted from Schut et al. (2025) with permission. purpose of ensuring we do not fall behind in understanding increasingly powerful models. Agentic interpretability is not the right tool for all uses of interpretability. By focusing on inter- activeness, it may sacrifice completeness, e.g. by missing some important behaviors of potentially deceptive or misaligned models.Inspectiveinterpretability approaches may be a better fit for high- stakes, safety critical applications (Olah, 2020; Shah et al., 2025; Sharkey et al., 2025). In contrast, agentic interpretability is particularly useful for integration of systems into society and teaching humans superhuman knowledge (Schut et al., 2025). Agentic interpretability introduces evaluation challenges. Humans are not merely in-the-loop; but interwoven in the process with machines. We call thishuman-entangled-in-the-loop; humans’ responses in the multi-turn dialogue are integral to the interpretability algorithm itself, limiting automated evaluation. Further, variation in user backgrounds, needs and model outputs will lead to a wide range of conversational trajectories. Finally, much knowledge might be beyond the capacity of an individual interacting with the system, or even superhuman in nature, making direct validation difficult. While LLMs’ capability to carry out coherent conversation enables usto some extentto use LLMs as proxies of the human turns, end-task metrics (e.g., did it enable faster model debugging?) remain critical (Doshi-Velez and Kim, 2017). We lay out possibilities and approximates for evaluation that we believe are promising. Despite these considerable challenges, agentic interpretability is an opportunity too significant to overlook; without major breakthroughs in interpretability, human understanding risks being outpaced by rapid LLM advancements. This is particularly acute for lay people; history shows that while accessible technologies can offer widespread benefits, opacity often deepens societal divides or creates exclusion (Van Dijk, 2005). As LLMs are already complex and increasingly automated, we must explore how to make themwork for usin fostering understanding, especially while they remain largely cooperative and their knowledge not yet predominantly superhuman. 2. What is agentic interpretability? The word agentic carries various connotations across disciplines. In psychology and sociology, it often describes behaviors motivated by individualistic desires for mastery and control. In the context of modern AI, as noted by Merriam-Webster when discussing AI in the 2020s, agentic is described as 2 Because we have LLMs, we Can and Should Pursue Agentic Interpretability “making decisions, taking actions, solving problems, reasoning, etc., on its own.”(Merriam-Webster, Accessed on 2025-05-22). We build on this latter notion, leveraging LLMs’ agency to impel them to build mental models of human users in order toproactivelytake actions to help humans understand functions, state or knowledge. In other words,a method pursues agentic interpretability if it proactively assists human understanding in a multi-turn interactive process by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. 2.1. Core components of agentic interpretability The following are core properties of agentic interpretability: 1. Proactive Assistance:The model takes initiative in the explanatory process, not merely responding to direct queries but potentially offering unsolicited clarifications, suggesting areas of exploration, or adapting its strategy based on inferred user needs. 2.Multi-Turn Interaction:Understanding is built through an extended dialogue, allowing for iterative refinement, clarification, and exploration of topics. 3. Mutual Mental Model:The model develops and maintains (implicitly or explicitly) a represen- tation of the user’s current knowledge, understanding, and potential misconceptions regarding the topic at hand. This mental model informs the model’s explanatory actions. The process also helps humans to build a mental model of the model. What is a mental model in this context?Mental models are extensively studied in cognitive science (Johnson-Laird, 1983), HCI (Norman, 1983), and human factors (Cannon-Bowers et al., 1993) among others. Johnson-Laird (1983) defines a mental model as an internal representation of an external reality that individuals construct to understand and reason about the world. People reason by constructing models of premises and then searching for models consistent with conclusions; these models are not necessarily rule-based or complete. In agentic interpretability, the LLM’s ability to use conversational context (e.g., remembering the user previously stated unfamiliarity with gravitational waves) is an implicit form of mental modeling. An explicit mental model might involve the LLM maintaining an explicit and organized knowledge graph of the user’s stated understanding and confusion, as explored in some proactive agent systems (e.g., Hahn et al., 2024). 2.1.1. Is a mental model necessary? The importance of mutual mental models for effective collaboration is a well-established subject in human collaborations (Cannon-Bowers et al., 1993). We argue that a similar principle ought to hold true in human-machine collaborations. Machine’s mental model of humans:Consider an alternative scenario: a model attempts to be helpful at every turn, yet lacks any information about the user (e.g., what they want, know and don’t know). Imagine such a model is heavily incentivized to explain; it might succeed for simple concepts by spending significant computation to devise optimal single-turn explanations, largely guessing what user might need. This approach relies on luck or verbosity and is inefficient for complex understanding. It would be similar to interacting with a stateless LLM that requires constant reminders from users (e.g., “again, I am not familiar with string theory and my goal is to understand just enough to have casual conversations!"). Human’s mental model of machines:The human’s development of a mental model of the machine is not merely an efficiency enhancement, but also serves a distinct purpose. While traditional 3 Because we have LLMs, we Can and Should Pursue Agentic Interpretability interpretability primarily views explanations as a means to achieve an external goal (e.g., debugging, trust),agentic interpretability inherently values the process of enhancing human understanding itself. By actively building and leveraging a mental model of the machine, humans have a better chance of ‘keeping up’ with increasingly complex models, fostering a deeper, more nuanced comprehension of these powerful systems. 2.2. Examples and opportunities of agentic interpretability We provide hypothetical examples of agentic interpretability as a way of further conveying our vision and proposing exciting future research directions. 2.2.1. Model trainer model: The future of model developmental cycles Current model development cycles can be viewed as a rudimentary form of agentic interpretability. Developers iteratively learn to build, train, and prompt models more effectively, essentially constructing their own mental models of how these systems behave with limited agency on the part of the model. This agency is seen in steps like automated evaluation, simulated human feedback, and automated red-teaming (Bai et al., 2022; Perez et al., 2022; Song et al., 2025), automated prompt tuning (Shin et al., 2020; Zhou et al., 2023), and automated interpretability (Rott Shaham et al., 2024; Sado et al., 2023). However, this process is limited as it typically follows human-prescribed routines. Consequently, there is little effort from machines to build mental models of the developers—for instance, to infer their current understanding or anticipate their informational needs—and thus tailor assistance accordingly. A significant opportunity for agentic interpretability lies in transforming these cycles into collab- orative dialogues. Imagine, for example, a meta-model trained on the entire history of a project’s development: experimental results, code changes, successes, failures, and even developer discussions. Researchers could then converse with this meta-model. It, in turn, could leverage its understand- ing of past efforts and its model of the developers’ current queries/prompts and knowledge gaps to proactively suggest hypotheses, identify overlooked patterns, or guide debugging strategies for future iterations based on its understanding of the user’s specific goals and current comprehension level. Such an agentic partner significantly accelerates and deepens both the understanding gained throughout the development lifecycle and humans’ mental models of the machines. This, in the long run, helps us to ‘keep up’ with model evolution (e.g., to be better prepared for any unexpected step changes). 2.2.2. Teaching super human knowledge Agentic interpretability is crucial when teaching superhuman knowledge to humans. Vygotsky’s theory of the Zone of Proximal Development (ZPD) (Vygotsky, 1978) posits that learning is maximized when individuals tackle tasks slightly beyond their current independent capabilities, but achievablewith guidance. An agentic LLM, by developing a precise mental model of the user’s existing knowledge, may be able to identify this ZPD and provide such guidance. An interesting step towards leveraging ZPD is recent work in neologism learning (Hewitt et al., 2025), where the method adds a new word to the vocabulary, typically carrying a slightly different meaning than the human word (e.g., ‘machine good’ means machine’s notion of ‘good’ answers). Such words introduce a new concept, and can be used to enable conversation with LLMusing the new concept or discussing the concept, effectively scaffolding the learning process with tailored explanations and incremental steps (much like Figure 1). For instance, extending the work on AlphaZero teaching superhuman chess concepts to grandmasters (Schut et al., 2025), an LLM-based AlphaZero could 4 Because we have LLMs, we Can and Should Pursue Agentic Interpretability engage a player in a Socratic dialogue. By discerning the player’s ZPD (e.g., understanding sacrifices but struggling with complex positional play), it could introduce a new superhuman chess concept super_chess_37to design targeted puzzles and explain superhuman insights incrementally, while players can also ask questions, guiding the player through their learning frontier, much like a skilled human tutor. 2.2.3. Agentic mechanistic interpretability even with potentially deceptive models Mechanistic interpretability strives for a rigorous, bottom-up understanding of model components, such as identifying functional circuits/mechanisms (Nanda et al., 2023; Olah, 2020). Agentic in- terpretability could transform this into an interactive diagnostic process. Imagine an “open-model surgery" where researchers actively converse with the modelwhileintervening in its internal mecha- nisms—ablating connections, amplifying activations, or injecting specific inputs into identified circuits. The model, guided by its understanding of the researchers’ goal (to understand a specific component’s function), is then encouraged to explain the resulting behavioral or internal state changes. This interactive probing, akin to neurosurgeons conversing with patients during awake brain surgery to map critical functions, offers a dynamic way to test hypotheses and build understanding, but without the ethical constraints of biological systems. This interactive, interventionist approach gains particular relevance when considering potentially deceptive models. A model whose internal states are being directly manipulated and inspected, yet which is simultaneously engaged in an explanatory dialogue, faces a stringent test of coherence. If a model intends to deceive, it must reconcile its external explanations with the (potentially contradictory) evidence revealed by internal interventions. This is analogous to an interrogation where a suspect’s claims can be immediately cross-referenced with physical evidence or observed physiological responses. Even without direct internal manipulation, compelling a potentially deceptive model to main- tain a façade of cooperation through extended, probing dialogue is a demanding task. Avoiding self-contradiction, particularly when its internal “intentions” diverge from its stated explanations, requires significant cognitive load (from the model’s perspective). Such interactions create numerous opportunities for inconsistencies or tells to emerge, providing insights that might otherwise remain hidden. 3. Alternative views Here, we consider alternative perspectives and potential objections to our proposed framework, alongside our rebuttals and clarifications. 3.1. What if models are not cooperative, or deliberately deceptive? Agentic interpretability relies on a model’s cooperativeness, or the ability to guide models towards helpful interaction. Most advanced LLMs currently do not consistently exhibit severe, overt deceptive behaviors. This period presents a crucial window for developing agentic methods and enabling humans to ‘catch up’ in understanding these rapidly evolving systems before such challenges potentially escalate. If and when a model is non-cooperative or deliberately and successfully deceptive, agentic interpretability may not the right tool. Specifically, by focusing on interactiveness, it may sacrifice completeness, e.g. by missing some important behaviors of potentially deceptive or misaligned models that may attempt to mislead the user (Greenblatt et al., 2024). Even without misalignment, the 5 Because we have LLMs, we Can and Should Pursue Agentic Interpretability mental model might have to be built on a concept with blurry boundaries (e.g., what a ‘good’ model response means from a human’s perspective (Sorensen et al., 2024)), and the model an LLM builds of the user may affect its behavior in subtle and undesired ways (Chen et al., 2024). This means that inspectiveinterpretability approaches like mechanistic interpretability (Olah, 2020; Sharkey et al., 2025) may be a better fit for high-stakes, safety critical applications (Shah et al., 2025) (e.g., auditing for deception or hidden goals (Marks et al., 2025), or monitoring for harmful behavior). However, as previously discussed in the context of agentic mechanistic interpretability (section 2.2.3), doing ‘model open surgery’, where we compel the model to ‘teach us’ while manipulating its internal states, could still reveal discrepancies or expose deceptive intent. 3.2. Are we devaluing traditional or non-agentic interpretability? Agentic interpretability does not devalue traditional or non-agentic methods; rather, it aims to build upon and enhance their utility. Inspective interpretability could be used as a component of agentic interpretability. Moreover, many enduring lessons from traditional interpretability—such as the necessity of overcoming human biases (Tversky and Kahneman, 1974) and the rigorous quantitative evaluation (Doshi-Velez and Kim, 2017; Hoffman et al., 2018) of explanations—retain their critical importance. Fundamentally, while the capabilities of models have dramatically evolved, core human cognitive frameworks and the need for understandable AI have not. In addition, techniques focusing on complete and rigorous solutions (e.g., mechanistic interpretability) remain important for circumventing potentially deceptive behaviors that might be indistinguishable through input-output analysis alone. 3.3.If explanation is a conversation, we don’t get an artifact that humans can observe to reach a consensus. Inspective interpretability often produces a clear artifact, such as highlighted pixels deemed ‘important’ for image classification (Lundberg and Lee, 2017; Ribeiro et al., 2016), a list of influential training data points (Koh and Liang, 2017), examples (Kim et al., 2016), rules (Guidotti et al., 2018), results of probing internal representations (Alain and Bengio, 2018; Ettinger et al., 2016; Shi et al., 2016) or mechanisms (Nanda et al., 2023; Olah, 2020). These artifacts serve as important documentation that humans can share, critique, or use to reach consensus, as the experience of the artifact is generally uniform across individuals. Agentic interpretability might initially seem to lack such grounding explanations. However, requesting the model to generate a summary report at the end of a conversation is not only possible but also desirable. Such a report allows humans to double-check their understanding and align expectations, similar to a ‘meeting note.’ This report can even be iteratively improved with humans in the loop until optimal documentation is achieved. While this is not a trivial task, generating effective summaries is a challenge our research community and many LLM developers have been actively addressing, with significant progress made as a result. 4. Challenges and trade-offs in agentic interpretability 4.1. Challenges While agentic interpretability offers exciting avenues for exploration and could fundamentally alter how humans collaborate with machines, its pursuit also introduces several challenges. 6 Because we have LLMs, we Can and Should Pursue Agentic Interpretability Human-Entangled-in-the-Loop Evaluation:A primary challenge lies in evaluation. In agentic interpretability, humans are not merely "in-the-loop" but rather interweave with machines. We call thishuman-entangled-in-the-loop. Their responses and evolving mental states are not just feedback, but integral components. While human-in-the-loop evaluation has always been crucial for assessing explanation utility (Doshi-Velez and Kim, 2017), this deeper entanglement complicates things, for example in achieving reproducibility, conducting controlled comparisons, and isolating the impact of specific variables exceptionally difficult. One useful tool might be LLMs themselves as a proxy humans, whenever appropriate. High variance in users, backgrounds and needs and in LLM responses:Suppose we designed an agentic interpretability method for a set of expert users, who all attended the same school and are working on the same project. Even then, each individual will likely bring a different preferred way of understanding the expert subject matter (e.g., visual or textual). This variance only grows as we increase the pool of target users, even with the same seed prompt. This human-induced variance is compounded by the LLM-variance; they can exhibit substantial semantic differences in outputs even for subtly varied inputs—a phenomenon central to prompt engineering. This translates to a potentially vast space of conversational trajectories. The difficulty of developing a method that covers the potentially vast variance in how conversations may unfold increases further as the complexity of the knowledge grows – as the machine tries to teach us superhuman knowledge/concepts. 4.2. Potential trade-offs The rich conversational aspect with humans in agentic interpretability may come with some cost, such as completeness and possibilities to hill-climb without humans in the loop. Inefficient for completeness.Agentic interpretability might be an inefficient means to pursue a ‘complete explanation,’ where completeness implies (borrowing from mathematics) that every valid statement (or instance) can be proven or tested. For example, consider circuit finding—a major effort in mechanistic interpretability—where the goal is to use model internals to discover the mechanisms of their functions. These mechanisms aim to be exhaustive and precise, such that behaviors like deception can be reliably detected and manipulated; achieving this is arguably be harder using only input and output. While agentic mechanistic interpretability remains an interesting venue to pursue (see Section 2.2.3), doing so via conversation might be less efficient than directly modifying internal representations, even if it can output a functional form (e.g., “here is the governing equation of the concept.”) Difficult to hill-climb for computational efficiency.Agentic interpretability does not readily enable functionally-grounded evaluation (Doshi-Velez and Kim, 2017), which is well-suited for tasks such as improving the computational aspects of a verified interpretability method (e.g., making influence functions more efficient (Grosse et al., 2023; Koh and Liang, 2017)), since an agentic method is difficult to establish without human interaction. One might attempt to use an LLM by reducing its stochasticity (e.g., using the same seed every time with low temperature), although the challenge of high semantic variance (see Section 4.1) will still be present. 7 Because we have LLMs, we Can and Should Pursue Agentic Interpretability 5. Ideas and challenges for evaluating agentic interpretability methods Users may have a few different end goals with agentic interpretability: 1)case improveto make the model do what we want (help machines understand human concepts better); 2)case learnto learn something new from the model (learn a machine concept).Case learncan be the sole goal or be used to achievecase improve. How do mental models relate? In case improve, helping machines understand human concepts better naturally requires our understanding of the machine also to improve (e.g., how to explain our concepts better to machines). In case learn, not only are humans learning, but machines are also learning how to explain their concepts to us better (becoming a better teacher). Unfortunately, however, directly evaluating these mental models is impossible. Thus, we resort to a proxy such as measuring end-task metrics (e.g., did the model improve in the way we wanted? Can humans predict machine behavior). In this section, we lay out some example proxies. To aid the discussion, we’l introduce a bit of notation. Let푥be an input to an LLM, and푦be an output (both may be textual or multimodal). Let푓:(푥, 푦) ↦→푐where푐∈ Cbe theconcept function, a function that encapsulates some property of inputs, outputs (or both) that is the focus of our goal of understanding. The concept function may human-defined (e.g., whether푦correct as an output for푥, does푦meet some length constraint? Is푥the kind of question the model should refuse to provide answers?) or dependent on the model (would the model refuse on this푥? Would the model judge푦 as a good response for푥? what are the model’s superhuman chess moves given푥?) 5.1. Case improve: the end-goal is to make machines do what we want In this setting, there are some human-defined concepts,푓, (e.g.,is this funny? does the output of the model solve the math question?) that we want the model to respond differently (e.g., output an answer close to human’s concept of correct). A model in agentic interpretability may provide informationmy understanding of this code base is poor; you should consider including documentation in my system promptthat leads to a new model푀 ′ such that the outcome of푓(푥, 푀 ′ (푥))is more favorable on average than푓(푥, 푀(푥)), where푀is the original model, likely also with some constraint on how different푀 ′ is from푀in other dimensions (e.g., one can’t just simply replace푀with a better, completely independent model푀 ′ ). We’re intentionally leaving much of this abstract, as each element—what metric푓to optimize for, how to measure the gap between푀and푀 ′ —must depend on the specific application. Knowing how to improve a system is a particularly practical goal, and being able to do so represents a concrete evidence of understanding. Thus, evaluation involves how well we are able to change model’s behaviors to our favor. A reasonable argument against this evaluation is if thisis just measuring model improvement. There are non-interpretability methods to achieve the same or better success metric which could be a better choice for the situation. However, a machine’s proactive assistance is an add-on to any method. 5.2. Case learn: the end-goal is to learn about machine concepts (potentially superhuman) A machine concept is a property wherein the value of that property for any input (or input-output pair) is defined by some function of the machine. This machine concept could range from something familiar (‘the machine refuses to answer’or‘if asked to evaluate the quality of this input-output pair, the machine would say it was high-quality’) to something foreign (‘neuron 1383 fires’or‘superhuman chess strategy concept 37’). 8 Because we have LLMs, we Can and Should Pursue Agentic Interpretability One way to evaluating the human’s understanding of a machine concept is by testing the human’s ability to predict the concept of new examples. (akin to ‘simulatability’ in inspective interpretability). That is, the human observes푥(and maybe푦) and produces someˆ푐, which is compared to the true 푐=푓(푥, 푦). For example, a human’s understanding of a model’s notion ofgood responsesmight be improved through discussion with the model (i.e., the model teaches the human that it likes responses with flowery language) and then we evaluate the human’s classification accuracy of new푥compared to the machine’s ground truth. If a human is trying to understandsuperhuman chess concept 23, discussion with the model may involve e.g., the model giving examples of concept 23 and makes up quizzes for humans to identify understanding gaps, then the human’s understanding is evaluated by how accurately human can predict concept 23 given a new board position푥. When it is not possible to find the ground truth concept (i.e.,푓is not queryable), then we would resort to measuring the end-task metrics, assessing how well the human leverages their understanding of푓to achieve a specific goal, similar to application-grounded metric in Doshi-Velez and Kim (2017) (e.g., does the chess player’s Elo improve?). 5.3. Evaluation challenges and paths forward Models do not know why they behave as they do.Just as humans lack introspection across many cognitive abilities, Models empirically have little meta-understanding of why they sometimes perform well and other times don’t. A famous example comes from natural language generation: native speakers can generate fluent sentences, but struggle to introspect to teach another person those same rules (a large topic in the field of linguistics.) Similarly, models don’t know that they don’t handle their long contexts that well, or at least don’t necessarily know to suggest that. An interactive dialogue with the model can aid both humans and machines in figuring this out: a co-discoveryprocess. Consider an analogy: a person (the model) exhibits certain behaviors and can identify if they are good or bad (concept), but how to systematically achieve better outcomes. By conversing with a friend (the human) who can pose clarifying questions or hypothetical scenarios (e.g., “Given푥 ′ , what is your푦 ′ and the resulting푐 ′ ?”), both parties might collaboratively build a better mental model of the machine. How do I know if something is a human concept or a machine concept?We’ve presented two categories of exploration under agentic interpretability above: case improve (typically involves a human concept) and learn (a machine concept). Sometimes, however, it can seem unclear whether a concept is from a human or a machine. For example, if we prompt the model to generate scores for the quality of a response, certainly we’re trying to understand the model’s notion ofgood, but the concept is also partially human because how we train and prompt the model is partially determinative of the output. So, it’s not a perfect categorical separation, and we recommend focusing instead on what kind of evaluation—case improve or learn—would be more scientifically interesting or useful for your purposes. Human evaluation is expensive and hard to replicate.The eventual end goal of agentic inter- pretability is fundamentally to teach real humans useful properties of models (by mutually building mental models). Evaluating this is nuanced and expensive, especially when one wants to measure how well agents adapt to the (differing) existing knowledge of each human. However, in developing agentic interpretability methods, we believe that using LLMs as proxies to simulate the human in the agentic interpretability loop will provide a useful and relatively fast, inexpensive signal. 9 Because we have LLMs, we Can and Should Pursue Agentic Interpretability 6. Other related work As related interpretability work has been incorporated into previous sections, we now cover the remaining related work from other, non-interpretability fields. 6.1. Cognitive science The concept of mutual modeling is deeply rooted in cognitive science theories of communication, collaboration, and Theory of Mind (the ability to attribute mental states to others). Mental Models:Humans naturally build mental models of systems and people they interact with (Gweon, 2021; Norman, 1983), as do (some) animals (Krupenye and Call, 2019; Premack and Woodruff, 1978). Effective interaction relies on aligning these mental models. Agentic interpretability explicitly incorporates this, aiming for AI that builds a model of the human user and helps the user build an accurate model of the AI (Lombrozo, 2024). Grounding in Communication:Research on how humans achieve common ground in dialogue (Clark and Schaefer, 1991) informs how interactive explanations should be structured, involving feedback, clarification requests, and confirmations. Rational Speech Acts (RSA):The RSA framework (Frank and Goodman, 2012) models com- municative games between speakers with differing recursive levels of reasoning about the other speaker. Core to this framework is the increased communicative efficiency one achieves choosing what to say only after simulating how the other speaker would interpret each action (and how they’d simulate your reasoning, recursively). Such simulation is likely necessary to achieve successful agentic interpretability. Teaching and Pedagogy:Effective teaching involves diagnosing the learner’s current under- standing and tailoring instructions accordingly (Koedinger et al., 2013; Wood et al., 1976). Agentic interpretability can be viewed as a form of teaching, where the AI explains itself by modeling the “learner” (the user). Work on “Learning from Human Input by Proactively Considering Human Factors” (Gweon, 2021) emphasizes building machines that consider human values, intentions, and beliefs, aligning closely with the goal of mutual understanding. Levels of Analysis:Marr’s levels (computational, algorithmic, implementational) (Marr, 1982) provide a framework for thinking about explanations. Inspective methods often focus on the al- gorithmic or implementational levels. Agentic interpretability methods might operate more at the computational level (what is the goal?) or translate between levels based on user interaction. 6.2. HCI Many HCI research pre-LLM time on studying human’s workflow in interfacing with computers are relevant to agentic interpretability. The fundamental challenges of human cognition overload, human biases and capabilities hasn’t changed, only magnified. Here are some subset of work that are relevant. XAI Interfaces:HCI studies how to design effective interfaces for presenting explanations (Abdul et al., 2018; Hoffman et al., 2018). Agentic interpretability contributes by emphasizing the dialogue aspect over static presentation. Interactive Machine Learning:Systems where humans are involved in the model training loop (Amershi et al., 2014) share the interactive nature but often focus on model improvement rather than post-hoc explanation. 10 Because we have LLMs, we Can and Should Pursue Agentic Interpretability Human-AI Collaboration:Research explores how humans and AI can work together effectively (Bansal et al., 2021, 2024). Agentic interpretability is crucial for enabling the mutual understanding needed for fluid collaboration. Work by Cai et al. (2019) on human-centered tools for intelligible AI highlights the need for systems that support iterative exploration and understanding (Shen et al., 2024). Agentic interpretability aims to integrate insights from these fields to specifically address the challenge of achieving mutual understanding about AI reasoning through interaction. 7. Conclusion The era of LLMs and generative models is full of opportunity, but understanding the enormous complexity of computation behind LLMs’ impressive behaviors is an as-yet unmet scientific and engineering grand challenge. However, LLMs are not just objects of study; increasingly, they can behave as intelligent agentic tools for accelerating our growth. By enabling a teaching mode of these models, and helping us keep up with their potentially superhuman concepts, we believe this grand interpretability challenge is more achievable. Ultimately, LLMs may provide the opportunity to greatly expand our knowledge as humans have repeatedly done in the history of humanity in the face of technological change. Acknowledgement Thank you Robert Geirhos for thoughtful comments, edits and suggesting related work. References A. Abdul, J. Vermeulen, D. Wang, B. Y. Lim, and M. Kankanhalli. Trends and trajectories for ex- plainable, accountable and intelligible systems: An HCI research agenda.Proceedingsofthe 2018CHIConferenceonHumanFactorsinComputingSystems, 2018. URLhttps://api. semanticscholar.org/CorpusID:5063596. G. Alain and Y. Bengio. Understanding deep learning with linear models.arXivpreprint arXiv:1606.05382, 2018. S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza. Power to the people: The role of humans in inter- active machine learning.AIMag., 35:105–120, 2014. URLhttps://api.semanticscholar. org/CorpusID:127197. Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran- Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosi ̄ ut ̇ e, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. Dassarma, R. Lasenby, R. Larson, S. Ringer, S. Johnston, S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, S. Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish, T. B. Brown, and J. Kaplan. Constitutional AI: Harmlessness from AI feedback.ArXiv, abs/2212.08073, 2022. URLhttps://api.semanticscholar.org/CorpusID:254823489. G. Bansal, T. Wu, J. Zhou, R. Fok, B. Nushi, E. Kamar, M. T. Ribeiro, and D. Weld. Does the whole exceed its parts? The effect of ai explanations on complementary team performance. InProceedings ofthe2021CHIconferenceonhumanfactorsincomputingsystems, pages 1–16, 2021. 11 Because we have LLMs, we Can and Should Pursue Agentic Interpretability G. Bansal, J. W. Vaughan, S. Amershi, E. Horvitz, A. Fourney, H. Mozannar, V. Dibia, and D. S. Weld. Challenges in human-agent communication, 2024. URLhttps://arxiv.org/abs/2412.10380. C. J. Cai, E. Reif, N. Hegde, J. D. Hipp, B. Kim, D. Smilkov, M. Wattenberg, F. B. Viégas, G. S. Corrado, M. C. Stumpe, and M. Terry. Human-centered tools for coping with imperfect algorithms during medical decision-making.CoRR, abs/1902.02960, 2019. URLhttp://arxiv.org/abs/1902. 02960. J. A. Cannon-Bowers, E. Salas, and S. A. Converse. Shared mental models in expert team performance. In N. J. Castellan Jr., editor,Individualandgroupdecisionmaking, pages 221–245. Lawrence Erlbaum Associates, 1993. Y. Chen, A. Wu, T. DePodesta, C. Yeh, K. Li, N. C. Marin, O. Patel, J. Riecke, S. Raval, O. Seow, et al. Designing a dashboard for transparency and control of conversational ai.arXivpreprint arXiv:2406.07882, 2024. H. H. Clark and E. F. Schaefer.Groundingincommunication. Psychological Review, 1991. F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning, 2017. A. Ettinger, A. Elgohary, and P. Resnik. Probing for semantic evidence of composition by means of simple classification tasks. InProceedingsofthe1stWorkshoponEvaluatingVector-Space RepresentationsforNLP, pages 134–139, Berlin, Germany, Aug. 2016. Association for Com- putational Linguistics. doi: 10.18653/v1/W16-2524. URLhttps://aclanthology.org/ W16-2524/. M. C. Frank and N. D. Goodman. Predicting pragmatic reasoning in language games.Science, 336 (6084):998–998, 2012. doi: 10.1126/science.1218633. URLhttps://w.science.org/doi/ abs/10.1126/science.1218633. R. Greenblatt, C. Denison, B. Wright, F. Roger, M. MacDiarmid, S. Marks, J. Treutlein, T. Be- lonax, J. Chen, D. Duvenaud, et al. Alignment faking in large language models.arXivpreprint arXiv:2412.14093, 2024. R. Grosse, J. Bae, C. Anil, N. Elhage, A. Tamkin, A. Tajdini, B. Steiner, D. Li, E. Durmus, E. Perez, et al. Studying large language model generalization with influence functions.arXivpreprint arXiv:2308.03296, 2023. R. Guidotti, A. Monreale, F. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models.ACMComputingSurveys(CSUR), 51(5):1–42, 2018. H. Gweon. Inferential social learning: Cognitive foundations of human social learning and teaching. Trendsincognitivesciences, 25, 08 2021. doi: 10.1016/j.tics.2021.07.008. M. Hahn, W. Zeng, N. Kannen, R. Galt, K. Badola, B. Kim, and Z. Wang. Proactive agents for multi-turn text-to-image generation under uncertainty, 2024. URLhttps://arxiv.org/abs/2412.06771. J. Hewitt, R. Geirhos, and B. Kim. We can’t understand AI using our existing vocabulary, 2025. URL https://arxiv.org/abs/2502.07586. R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman. Metrics for explainable AI: Challenges and prospects.ArXiv, abs/1812.04608, 2018. URLhttps://api.semanticscholar.org/ CorpusID:54577009. 12 Because we have LLMs, we Can and Should Pursue Agentic Interpretability P. N. Johnson-Laird.MentalModels:TowardsaCognitiveScienceofLanguage,Inference,and Consciousness. Harvard University Press, 1983. B. Kim, R. Khanna, A. Torralba, and H. Pfister. Examples are not enough, learn to criticize! evaluation of saliency methods by interaction. InAdvancesinNeuralInformationProcessingSystems, volume 29, 2016. K. R. Koedinger, R. Hausmann, P. Jordan, and A. Skogsholm. Toward a science of productive failure in learning: A report from the first productive failure forum.Educationalpsychologist, 48(4): 229–237, 2013. P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In InternationalConferenceonMachineLearning, 2017. URLhttps://api.semanticscholar. org/CorpusID:13193974. C. Krupenye and J. Call. Theory of mind in animals: Current and future directions.Wiley InterdisciplinaryReviews:CognitiveScience, 10(6):e1503, 2019. T. Lombrozo. Learning by thinking in natural and artificial minds.TrendsinCognitiveSciences, 28, 09 2024. doi: 10.1016/j.tics.2024.07.007. S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions.Advancesin neuralinformationprocessingsystems, 30, 2017. S. Marks, J. Treutlein, T. Bricken, J. Lindsey, J. Marcus, S. Mishra-Sharma, D. Ziegler, E. Ameisen, J. Batson, T. Belonax, et al. Auditing language models for hidden objectives.arXivpreprint arXiv:2503.10965, 2025. D. Marr.Vision:Acomputationalinvestigationintothehumanrepresentationandprocessingof visualinformation. MIT press, 1982. Merriam-Webster.Merriam-webster, Accessed on 2025-05-22.URLhttps://w. merriam-webster.com/slang/agentic. In Merriam-Webster.com slang dictionary. N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URLhttps://arxiv.org/abs/2301.05217. D. A. Norman. Some observations on mental models.Mentalmodels, 7:7–14, 1983. C. Olah. Zoom in: An introduction to mechanistic interpretability.Distill, 2020. E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. InConferenceonEmpiricalMethodsin NaturalLanguageProcessing, 2022. URLhttps://api.semanticscholar.org/CorpusID: 246634238. D. Premack and G. Woodruff. Does the chimpanzee have a theory of mind?Behavioralandbrain sciences, 1(4):515–526, 1978. M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier.Proceedingsofthe22ndACMSIGKDDinternationalconferenceonknowledgediscovery anddatamining, pages 1135–1144, 2016. T. Rott Shaham, S. Schwettmann, F. Wang, A. Rajaram, E. Hernandez, J. Andreas, and A. Torralba. A multimodal automated interpretability agent. InForty-firstInternationalConferenceonMachine Learning, 2024. 13 Because we have LLMs, we Can and Should Pursue Agentic Interpretability F. Sado, C. K. Loo, W. S. Liew, M. Kerzel, and S. Wermter. Explainable goal-driven agents and robots - a comprehensive review. 55(10), Feb. 2023. ISSN 0360-0300. doi: 10.1145/3564240. URL https://doi.org/10.1145/3564240. L. Schut, N. Tomašev, T. McGrath, D. Hassabis, U. Paquet, and B. Kim. Bridging the human–AI knowledge gap through concept discovery and transfer in AlphaZero.ProceedingsoftheNational AcademyofSciences, 122, 03 2025. doi: 10.1073/pnas.2406675122. R. Shah, A. Irpan, A. M. Turner, A. Wang, A. Conmy, D. Lindner, J. Brown-Cohen, L. Ho, N. Nanda, R. A. Popa, et al. An approach to technical agi safety and security.arXivpreprintarXiv:2504.01849, 2025. L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimer- sheim, A. Ortega, J. Bloom, et al. Open problems in mechanistic interpretability.arXivpreprint arXiv:2501.16496, 2025. H. Shen, T. Knearem, R. Ghosh, K. Alkiek, K. Krishna, Y. Liu, Z. Ma, S. Petridis, Y.-H. Peng, L. Qiwei, S. Rakshit, C. Si, Y. Xie, J. P. Bigham, F. Bentley, J. Chai, Z. Lipton, Q. Mei, R. Mihalcea, M. Terry, D. Yang, M. R. Morris, P. Resnick, and D. Jurgens. Towards bidirectional human-AI alignment: A systematic review for clarifications, framework, and future directions, 2024. URLhttps: //arxiv.org/abs/2406.09264. X. Shi, I. Padhi, and K. Knight. Does string-based neural MT learn source syntax? In J. Su, K. Duh, and X. Carreras, editors,Proceedingsofthe2016ConferenceonEmpiricalMethodsinNatural LanguageProcessing, pages 1526–1534, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1159. URLhttps://aclanthology.org/D16-1159/. T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, and S. Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts, 2020. URLhttps://arxiv.org/abs/ 2010.15980. Y. Song, H. Zhang, C. Eisenach, S. Kakade, D. Foster, and U. Ghai. Mind the gap: Examining the self-improvement capabilities of large language models, 2025. URLhttps://arxiv.org/abs/ 2412.02674. T. Sorensen, J. Moore, J. Fisher, M. Gordon, N. Mireshghallah, C. M. Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi. A roadmap to pluralistic alignment, 2024. URLhttps://arxiv. org/abs/2402.05070. A. Tversky and D. Kahneman. Judgment under uncertainty: Heuristics and biases.Science, 185 (4157):1124–1131, Sept. 1974. doi: 10.1126/science.185.4157.1124. URLhttps://w.ncbi. nlm.nih.gov/pubmed/17835457. J. A. G. M. Van Dijk.TheDeepeningDivide:InequalityintheInformationSociety. Sage Publications, 2005. L. S. Vygotsky.Mindinsociety:Thedevelopmentofhigherpsychologicalprocesses. Harvard Univer- sity Press, 1978. J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.CoRR, abs/2201.11903, 2022. URLhttps: //arxiv.org/abs/2201.11903. D. Wood, J. S. Bruner, and G. Ross. The role of tutoring in problem solving.Journalofchildpsychology andpsychiatry, 17(2):89–100, 1976. 14 Because we have LLMs, we Can and Should Pursue Agentic Interpretability Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers, 2023. URLhttps://arxiv.org/abs/2211.01910. 15