← Back to papers

Paper deep dive

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Kola Ayonrinde, Louis Jaburi

Year: 2025Venue: AIES 2025 (AAAI/ACM Conference on AI, Ethics, and Society)Area: Mechanistic Interp.Type: TheoreticalEmbeddings: 117

Abstract

Abstract:Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

Tags

ai-safety (imported, 100%)interpretability (suggested, 80%)mechanistic-interp (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:25:26 PM

Summary

The paper introduces a mathematical philosophy for Mechanistic Interpretability (MI), proposing the 'Explanatory View Hypothesis' which posits that neural networks contain implicit, extractable explanations. It defines MI as the practice of producing model-level, ontic, causal-mechanistic, and falsifiable explanations, and introduces the 'Principle of Explanatory Optimism' as a necessary precondition for MI success. The authors distinguish between behavioural and explanatory faithfulness, arguing that understanding neural networks requires mapping internal computations and representations rather than just input-output correlations.

Entities (6)

Kola Ayonrinde · author · 100%Louis Jaburi · author · 100%Mechanistic Interpretability · research-field · 100%Explanatory Faithfulness · metricconcept · 95%Explanatory View Hypothesis · hypothesis · 95%Principle of Explanatory Optimism · principle · 95%

Relation Signals (3)

Kola Ayonrinde authored A Mathematical Philosophy of Explanations in Mechanistic Interpretability

confidence 100% · A Mathematical Philosophy of Explanations in Mechanistic Interpretability... Kola Ayonrinde

Principle of Explanatory Optimism ispreconditionfor Mechanistic Interpretability

confidence 95% · a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability

Mechanistic Interpretability adopts Explanatory View Hypothesis

confidence 90% · We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach

Cypher Suggestions (2)

Map the relationship between principles and research fields · confidence 90% · unvalidated

MATCH (p:Principle)-[r]->(f:ResearchField) RETURN p.name, type(r), f.name

Find all concepts related to Mechanistic Interpretability · confidence 80% · unvalidated

MATCH (n:Concept)-[:RELATED_TO*1..2]->(m:ResearchField {name: 'Mechanistic Interpretability'}) RETURN n, m

Full Text

116,525 characters extracted from source content.

Expand or collapse full text

arXiv:2505.00808v1 [cs.LG] 1 May 2025 A Mathematical Philosophy of Explanations in Mechanistic Interpretability The Strange Science: Part I.i Kola Ayonrinde † UK AI Security Institute Louis Jaburi † Abstract Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for theExplanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to under- standing models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well- defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producingModel-level,Ontic,Causal-Mechanistic, andFalsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability. 1 Introduction ML artifacts arestrangeobjects. ML researchers have produced models with a wide range of cognitive capabilities that no human knows how to program a machine to do, from playing Go and poker at a superhuman level (Schrittwieser et al., 2020; Brown &Sandholm, 2019) to folding proteins (Jumper et al., 2021), and solving advanced mathematical problems (Glazer et al., 2024) 1 . However, we did not design these systems. No human wrote the blueprint for how AI systems ought to perform a given task. Instead, neural networks organically learn to solve problems via gradient descent, given large quantities of data. Neural networks aren’t built; they’re grown. Because we don’t design neural networks, ML researchers typically do not know how their models perform a given task. Additionally, neural networks often solve problems in unintuitive ways, relying on concepts that are not obvious to humans (Widdicombe et al., 2018; Hosseini et al., 2018; Goodfellow et al., 2014; Ilyas et al., 2019). This situation of relative ignorance about the processes that give rise to a neural network’s capabilities leaves us with a scientific problem analogous to the natural sciences. A physicist might observe some natural dynamical system, like the weather, and seek an explanation allowing them to understand, predict, and possibly even steer the system. Similarly,neural network interpretability(henceforth justinterpretability) is the process of understanding artificial neural networks using the scientific method. In this way, we characterise Interpretability asThe Strange Science: Interpretability is the science of understanding artificial neural phenomena, just as the natural sciences seek to understand natural phenomena. Interpretability researchers study formal systems using empirical methods — making observations, generating conjectures, and refuting those conjectures — to understand complex neural systems. Since interpretability is analogous to the natural sciences, a Philosophy of Interpretability should be inspired by our best understanding of the philosophy of science. † Correspondence to:koayon@gmail.com,louis.yodj@gmail.com 1 to deceiving humans in games (Golechha &Garriga-Alonso, 2025; Bakhtin et al., 2022) 1 Recent works like Bereska &Gavves (2024), Sharkey et al. (2025), and Geiger et al. (2023) have explored the methods and assumptions of Mechanistic Interpretability (MI). Other works have explored the philosophically relevant components of MI (Millière &Buckner, 2025; Harding, 2023; Kästner &Crook, 2024). In this work, we foreground the philosophi- cal role ofexplanationinMechanistic Interpretabilityspecifically, and how this differs from previous interpretability paradigms. 2 In particular, inspired by the Information-Theoretic perspective, we can understand explanations in terms of their compressive power and ability to communicateunderstanding which generalises. Understanding neural networks can provide affordances for interventions which are im- portant for AI Safety, AI Ethics, and AI Cognitive Science (Bengio et al., 2025; Anwar et al., 2024; Chalmers, 2025; Olah et al., 2020; Amodei, 2025). Such understanding also allows us to improve the performance of neural networks and debug their failings (Lindsay &Bau, 2023; Sharkey et al., 2025; Amodei, 2025). Contributions.Our contributions are as follows: • Firstly, we show that producing compressive explanations frames a potential solu- tion to the Interpretability Problem. We hence defineexplanatorily faithfulnessas the goal of Mechanistic Interpretability. •Secondly, we provide a technical definition of MI and leverage this definition to highlight both the possibilities and limitations of MI. •Thirdly, we formulate thePrinciple of Explanatory Optimism, a conjecture at the heart of MI which states that the algorithmic structure of generalising neural networks is human-understandable. We show that without the Principle of Explanatory Optimism the project of MI is intractable. Series StructureThis paper is the first in a series titledThe Strange Science of Mechanistic Interpretability, concerning the Philosophy of Mechanistic Interpretability. See later papers in this series for an evaluation of methods in MI through the lens of Explanatory Virtues (Ayonrinde &Jaburi, 2025) (Part I.i). See also Ayonrinde (2025) (Part I.i) which proposes methods to empower humans by teaching humans Machine Concepts. 2 The Strange Science Is Interpretability a natural science or a formal science?Natural scienceslike physics, biology, and earth sciences seek to explain physical phenomena, creating hypotheses and running experiments.Formal scienceslike algebra, decision theory, linguistics, and music theory are concerned with abstract systems and structures, using deductive methods to construct proofs from axioms. The defining strange property of neural network interpretability is that it is a natural science whose objects of study are bothartificial(constructed by humans rather than naturally occurring) andformal(inherently abstract rather than physical systems). Neural networks are mathematical objects; they are deterministic functions from the input embedding space to the output embedding space. Yet, the strange science of interpretability is not a formal science. Interpretability researchers are not primarily interested in proving theorems about neural networks, but in understanding the empirical properties of neural networks. They seek to understand “why” questions about generalisation:why does the network generalise in this specific way?orwhat are the reusable representations that a neural network is using?and so on. We a priori know the model’s weights, architecture and formal specification and we can compute the output behaviour corresponding to any given input. Where other sciences might be limited by the precision of their measuring tools or the fidelity of observations, in Interpretability our observations are exactly precise and our experiments perfectly re- producible. Furthermore, we can intervene on the network at any point and any time 2 Lipton (2018); Gilpin et al. (2018); Fleisher (2022); Doshi-Velez &Kim (2017); Leavitt &Morcos (2020); Erasmus et al. (2021) provide an overview of classical (pre-Mechanistic) Interpretability. 2 to arbitrarily high precision. However, despite this formal knowledge and potential for intervention, understanding neural networks remains elusive. Here we find a peculiar reversal of the sciences: in natural science, we pursue mathematical formalism to describe empirically observed phenomena; yet in interpretability, we pursue empirical methods to understand formalism. To make progress in understanding ML models, we approach them as naturalists studying a complex system. We can see neural networks as an exemplar of the pitfalls of reduction- ism: we know all the material and formal causes of the network, and we understand each individual part; however, we do not understand the system. We have complete access to all formal properties of the system, yet ourscientific knowledgeof the system is incomplete. Thestrangeparadox of interpretability is thus: we have a complete understanding of neural networks at the base, formal, implementation level up to arbitrary precision. 3 However, it’s not immediately obvious how this low-level knowledge translates into high-level under- standing and the ability to predict and control the complex system’s behaviour. We might say that the properties of a neural networksuperveneon low-level mathematical facts about the system. That is, the neural network is entirely defined by its low-level facts, but yet there areemergentmental phenomena that are not apparent purely by analysing the low-level facts. To understand a neural network, we would like to understand the relevant variables in model’s computation. These variables, which we might callfeatures, are generally not neurons or network parameters; they are instead unseen entities that we must posit and discover like subatomic particles in physics. 4 Though we have perfect formal knowledge of the system, we are nonetheless explaining what we see (the network’s behaviour) in terms of what we cannot immediately see (the features). In interpretability, as in other natural sciences, understanding of the seen is revealed through the unseen (Deutsch, 2011; Marion, 2008; Girard et al., 1987; Aquinas, 1273). Paper Structure.The rest of this paper is organised as follows: •In Section 3, we provide an exposition of theExplanatory Viewof neural networks: a model’s internal structures admit explanations of model behaviour. We argue that the Explanatory View provides additional justification for why MI researchers can productively seek Causal-Mechanistic explanations of ML models. •Section 4 provides a technical definition of Mechanistic Interpretability as seeking model explanations that areModel-level,Ontic,Causal-Mechanistic, andFalsifiable. • From this definition, we analyse the inherent limits of Mechanistic Interpretability in Section 5 and discuss implications of the Explanatory View in Section 6. • We conclude in Section 7 by articulating the implicit conjecture at the heart of inter- pretability, which we callThe Principle of Explanatory Optimism(EO). EO states that the generalising algorithms learned by neural networks are human-understandable. 3 Explanations and Interpretability Much of the human experience, both social and personal, is made up of explanations. We explain why vegetables are good to our children, why product A is likely to sell more units than product B to our boss, why to pursue a given research topic to ourselves, and so on. But what do we mean by the term ‘explanation’ in science? 3 This would be comparable to a physicist having perfect knowledge of the fundamental particles and forces of the universe, being able to measure and manipulate each atom at will and see subatomic particles with the naked eye. Or a biologist knowing the precise reactions occurring in every cell in the body and knowing the structure and shape of every relevant molecule (though see also Jonas &Paul Kording (2016)). 4 The Sparse Autoencoder paradigm has a specific linearly accessible interpretation of these features; in general we make no such commitment to any particular instantiation of how features are represented in the network. 3 3.1 Scientific Explanation The epistemic aim of science is to understand phenomena by way of explaining these phenomena (Regt, 2017). A scientific explanation, then, is an answer to a “why” question (Lipton, 2001). The fundamental question that explanations answer is “why did the phe- nomenon occur?” (Hempel &Oppenheim, 1948). We can view explanations as a solution to a problem: there’s a gap between our current best theory and the phenomena that we would like to explain. Good explanations close this gap. With a good explanation, we can say that the phenomenon was indeed expected — and crucially — here’s why. Explanations are vehicles for understanding; someone understands a phenomenon when they grasp an accurate explanation of the phenomenon (Strevens, 2013; Khalifa, 2013). Understanding and Compression.Wilkenfeld (2019) describes the close relationship between understanding and compression. 5 Given a series of observations which characterise a phenomenon, we understand the phenomenon if we have an explanation that compresses the data into a more concise form such that we could reproduce the data from the explanation or use the explanation to predict future data. Good explanations exploit regularities in the data for compression. A series of observations is incomprehensible if it cannot be compressed, that is, if it contains no regularities to exploit and is purely random (Li et al., 2008). Explanations are notmerelycompressions however. It is not obvious that a compressed zip file engenders more understanding than the original data file in general. Explanations are compressions of a particular kind: those compressions that facilitate the understanding of phenomena (Ayonrinde et al., 2024). 3.2 From Induction To Explanation Andrews (2023) details a classical view of machine learning as a process of induction: “ML models use evidence, or training data, to form predictions or classifications, which generalise what they have learned from their training set to unseen instances (i.e., novel data). The field of ML strives to automate inductive inference.” When we view ML models in this behaviourist fashion as black boxes with only inputs and outputs, we may think of them as providing predictions without explanations. Suchexplanationless predictionsare of the same variety as a prophecy from an oracle. Suppose that we are to think of an ML model qua oracle as providing reasons to believe some propositionp. Since oracles do not provide explanations in terms of the prediction’s content(subject matter), we may only believe such an explanationless oracle-style claim if we have extrinsic reasons to believe the model’s prediction (for example, that the model has been generally correct before or that we have some knowledge of the model’s training or similar). For MI researchers, however, believing a propositionpfrom an ML model should be based on thecontentof model explanations rather than merely on extrinsic reasons like the model’s track record (see Appendix G). Mechanistic Interpretability researchers do not view neural networks as incomprehensible inductive black boxes. In this paper, we offer a philosophical substantiation of Mechanis- tic Interpretability as practised by scientists. We will argue that MI researchers take an alternative Deutschian (Deutsch, 2011)Explanatory Viewof neural networks, which we may contrast with the classical inductive perspective expressed by Andrews (2023, inter alia). Where the classical view understands neural networks as black boxes that inductively infer from data, the Explanatory View would have us consider knowledge of the internal mechanisms of the model as necessary for understanding the model’s behaviour. Here the internal mechanisms themselves can be seen as implicit explanations of the model’s behaviour - the reason that a model makes a prediction is contained within its internal mechanisms. This is a white-box (cognitivist) view of neural networks. In this way, we can view models asproto-explainersrather than merely predictors. 5 See also Chaitin (2002); Li et al. (2008); MacKay (2003); Solomonoff (1964); Hutter et al. (2024). 4 3.2.1 Generalisation: What We’re Explaining When We’re Explaining Neural Networks When we are interested in explaining neural networks, what we would like to explain is how, and in which ways, theygeneralise. By generalise, we mean that the model can leverage regularities & structure in the training data to solve some task with respect to unseen data. As models learn to generalise,internal structureforms within the model. 6 A system contains structure to the system’s generating process can be expressed more concisely (i.e., in fewer bits) than the observations of the system. That is, structure is compressibility. Systems that follow general principles contain structure and regularities such that they are (at least in principle) more predictable than structureless systems like random noise generators. For example, consider an idealised pendulum’s position over time. The pendulum’s position follows a predictable pattern which can be expressed as a mathematical function and hence we only need to store the equation and the initial conditions to reproduce the observations of the pendulum’s position over time. In this sense, the natural world contains structural patterns — there are natural laws which allow us to compress and understand the world. And the training data for ML models, which is sampled from the world, inherits such structure from the natural world. ML models generalise to the extent that they can learn or approximate the structure in the world through the training data. 7 A good explanation should expose the model’s internal learned structures. We defineur-explanations 8 as the idealised explanations of model behaviour on an input distribution, given in terms of its learned internal structures. 9 Theur-explanationsof a neural network can be seen asinternal computationsoverlearned repre- sentationsthat compute the output from the input. The network learns these representations during training in a process of automated Conceptual Engineering (see Appendix H). These network computations, outputs, and intermediate activations together constitute not only a prediction of some answer, but also an explanation of the process by which the model came to such a result. 3.2.2 Explanatory Faithfulness The Explanatory View takes seriously the idea that there is structure in the model to be interpreted (see also Appendix D). Under this view, there is a target to the interpretabil- ity program: we are not merely looking for explanations that appear to correlate with model behaviour, we are looking to extract the internal explanations from neural networks. Explanations can be more than confabulatory just-so stories that provide the illusion of understanding the model’s behaviour. Under the Explanatory View, we can now defineexplanatory faithfulness. An explanation is explanatorily faithful to the model to the extent that it matches the model’s ur-explanation. 10 Interpretability researchers would like to say that their explanations are faithful to the model and (approximately) describe the same algorithmic mechanisms that the model uses. Note the difference here between our notion ofexplanatory faithfulnessand(behavioural) faithfulness(e.g., from Wang et al. (2023)):behavioural faithfulnesssays that the explanation and model produce the same outputs;explanatory faithfulnesssays that the step-by-step 6 Here we are interested in explanations of neural networks as objects of scientific and philosophical interest. Much previous work has been interested in explanations of the results of neural networks as it pertains to some task (e.g., medical diagnosis). Here the application domain is of secondary interest and we are examining explanationsof neural networks themselves. 7 Models trained on randomised data may memorise but never learn to generalise (Zhang et al., 2017; Lehalleur et al., 2025; Lin et al., 2017; Deletang et al., 2024). 8 Here we use theur-prefix to indicate primacy or origin as in the German language. 9 Considering the model internals as ur-explanations does not necessarily mean that all such ur- explanations will be interesting explanations of generalisation per se. Models may have memorised certain answers or resort to bags of faulty heuristics in some cases. However, in at least some cases, we have evidence of models learning genuine algorithms which represent explanatory knowledge within the network (Wang et al., 2024; Nanda et al., 2023; Wu et al., 2024). 10 We argue for the uniqueness of the ur-explanation in Appendix B. 5 explanation matches the model’s internal mechanisms, not just the input-output behaviour. Note that defining explanatory faithfulness is not possible under the classical view of Machine Learning without ur-explanations. Under the classical view, statements about circuit equivalence can only be understood as statements about behavioural statistics (Shi et al., 2024). Explanatory Faithfulness An explanationEisexplanatorily faithfulto a modelMover some data distribution D, to the extent that intermediate activationss i at each layerithat are given by the algorithmic explanationEclosely match the intermediate activationsx i of the model Mfor input data inD. 3.3 Neural Networks Perform Computations over Representations We described the ur-explanation of a model (the idealised explanation of model behaviour) in terms ofComputationsoverRepresentations. We now provide more details on what is meant by each of these terms. Computation.Marr (1982) describes 3 levels of analysis for understanding a machine carrying out an information-processing task (McClamrock, 1990; Angelou, 2025): 1. Computational Level:What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? 2.Algorithmic/Representational Level:What is the algorithm being used to perform the computation? How can this computational theory be implemented? In particu- lar, what is the representation for the input and output, and what is the algorithm for the transformation? 3.Implementation Level:What is the physical implementation of the algorithm? How can the representation and algorithm be realized physically on some computational substrate? For neural networks, theImplementationLevel of Analysis corresponds to matrix multi- plications with the model weights and how these are implemented on the substrate of hardware computational accelerators (Angelou, 2025). We know essentially all there is to know formally about the Implementation level. When we speak of a neural network carrying out computations, we are referring to the AlgorithmicandComputationalLevels of Analysis. We would like to have useful compressive causal explanations at the Algorithmic and Computational levels which are detailed enough to be “runnable” (Cao &Yamins, 2024). 11 Explanatory Faithfulness is an Algorithmic level property: the stages of the explanation should match the model layers and be “locatable”. It is not sufficient for Explanatory Faithfulness for the outputs to agree if algorithms producing the outputs differs. Representation.A pattern of neural activations is arepresentationwhen it represents something, that is it has some appropriate correspondence with features of the input data (and hence the external world). Representations are representationsofa feature. 12 We paraphrase Harding (2023)’s three criteria for activations to qualify as representations below. 11 Appendix F details an Implementation-level explanation of a neural network. By “runnable” here we mean that the explanation that we provide should be pseudocode that we could imagine formalising such that it would compile and run on a computer. 12 In other words, representations haveintentionality. Note that we use the word intentionality here in the philosophical sense of “aboutness”. This usage of the term is not to be confused with the psychological sense of “intention” (as in I intend to get the next train home) or any claims about some conscious relationship to representations. 6 Consider a pattern of activationsh(x)forx∈X, whereXis the domain of a model (e.g. natural language). Thenh(x)representsa propertyZif the following three criteria hold: •Information:The activationsh(x)correlate with the propertyZ. More formally, the random variableh(x)has sufficiently high Shannon Mutual Information with the propertyZ,I(h(x);Z), such that we could train a successful probeg z :h(x)→P(Z). Intuitively, Information saysrepresentations areCausal Resultsof features contained within the inputx. •Use:The model uses the information in activationsh(x)aboutZto perform its task. That is to say that if we were to remove the relevant information from the activations through a causal intervention, the model’s performance on the relevant downstream tasks would decrease. Intuitively, Use saysrepresentations areCausesof the model’s behaviour. •Misrepresentation:It should be possible for the activation vectorh(x)to mis- representZ. Suppose that we have activationsh(x)which do not contain useful information aboutZandh(s)which does contain information about the property Z. Then we say thatZis misrepresentable if we can perform an intervention which patches the informationh(s)intoh(x), and predictably increase the likelihood of our model mistaking our inputxfor having propertyZ. Intuitively, Misrepresentation saysrepresentations can becausally intervenedon. 13 To be able to represent, you must be able to misrepresent. A pattern of neural activations that satisfies the Information, Use and Misrepresentation criteria can be called arepresentation. 3.4 The Goal of Interpretability ML methods are often applied to problems in other fields like predicting weather patterns, classifying legal cases and allocating scarce resources. Interpretability, then, can be viewed as applying ML methods and analysis to an epistemic problem: “how does a neural network perform computations over representations to produce useful answers to queries?” Interpretability researchers don’t want to only know what a neural network predicts. We would also like to understand the structures, features, regularities, and knowledge which cause the neural network to make such and such a prediction. We would like to extract explanatory knowledgefrom neural networks,uncoveringthe ur-explanations that are always- already present within a trained, generalising model. Most ML researchers are in the prediction business. Interpretability researchers, however, are in the explanation business. The Explanatory View treats neural networks as containing explanations rather than as being purely behaviourist oracles, moving from black-box induction to white-box computations. The Explanatory View is the first step in understanding ML models not in terms of prediction but in terms ofexplanation. 4 Demarcating Mechanistic Interpretability There has been much discussion about what makes some interpretability research ‘Mech- anistic’ rather than another form of interpretability (Saphra &Wiegreffe, 2024; Chalmers, 2025). Gieryn (1999) describes the problem of demarcating where one science starts and another begins —‘boundary-work’— analogously to the Demarcation Problem between Science and Pseudo-Science (Laudan, 1983; Popper, 1935). The definition of a given science can be seen as a grab-bag of associations like Wittgensteinian language games (Wittgenstein, 1953). Another way to define a science is as (social) culture (Latour, 1987). Under this view “Science is what scientists do” (Bridgman, 1980). 13 in the sense of Causal Abstractions Theory (Geiger et al., 2023; Pearl, 2009; Beckers &Halpern, 2019) 7 To formalise Mechanistic Interpretability, we instead provide a technical definition that focuses on the goal of Mechanistic Interpretability compared to other adjacent disciplines (Olah et al., 2020; Saphra &Wiegreffe, 2024). We define Mechanistic Interpretability as the study ofModel-level, Ontic, Causal-Mechanistic, and Falsifiable Explanationsof Neural Networks. This definition clearly delineates Mechanistic Interpretability from other paradigms like Concept-Based Interpretability. 14 We can understand these properties of explanations by way of contrast with other forms of explanation. 15 Model-level Explanations.An autoregressive languagemodelis a neural network that returns a probability distribution over possible next tokens when conditioned on some input tokens (i.e., textual prompt). A language modelsystem(LM system) is a software object that contains a language model as part of the control flow. The LM system leverages the language model to produce some useful output, like a text completion or an image, rather than a single next-token probability distribution. An LM system may be as simple as augmenting a language model with a sampling method or greedy decoding. More complex LM systems may use meta-decoding strategies, tool use, automated prompting, multiple language models, and more (Arditi, 2024; Zaharia et al., 2024; Khattab et al., 2023; Guo et al., 2024; Dafoe et al., 2020; Welleck et al., 2024). Capability evaluations are typically system-level evaluations. Model performance depends substantially on prompting strategies like Chain of Thought reasoning (Wei et al., 2022) and tool use (Schick et al., 2023). Systems-level explanations might seek to explain whole system performance by, for example, reading a model’s Chain of Thought (Perez et al., 2023). Conversely, Model-level explanations seek to understand the neural network part of the system in isolation to explain why the output distribution is as it is. Ontic Explanations.Ontic explanations consist of real, physical entities (Salmon, 1984). We may contrast Ontic explanations with epistemic explanations which focus on making phenomena understandable or predictable to the interpreter, potentially using idealizations, models, or abstractions that may not directly correspond to reality. Non-ontic, “epistemic” explanations may give useful rules of thumb for prediction or intuition but may not be well supported by reality. 16 Varma et al. (2023)’s work on circuit efficiency can be seen as giving non-ontic explanations, through the hypothesised efficiency metric. Causal-Mechanistic Explanations.Causal-Mechanistic Explanations (Woodward, 2003; Salmon, 1989; Lewis, 1986) identify the causal processes that produce phenomena rather than just describing statistical correlations or general laws. Here, we are interested in the relevant components of a system, how they are organised, and how they interact to produce phenomena. Causal-Mechanistic theorists refer to these explanations as “explaining why by explaining how” (Bechtel &Abrahamsen, 2005): explaining why a phenomenon occurred involves identifying the underlying mechanisms that give rise to observed phenomena. Causal-Mechanistic Explanations go step by step to explain the end-to-end process: they provide a continuous causal chain from cause to effect, without any unexplained gaps. Causal-Mechanistic Explanations explain the end-to-end process (Salmon, 1984; Lipton, 2003). 17 We can contrast Causal-Mechanistic Explanations with: • Statistically Relevant Explanations(Salmon et al., 1971; Salmon, 1989). X explains Y if and only ifP(Y|X)̸=P(Y)— that is, if and only if the conditional probability 14 This delineation is useful for researchers but we do not intend to imply that non-Mechanistic Interpretability is not useful, see Appendix C. 15 We provide intuitive examples of explanations with these properties in Appendix A. We further compare Mechanistic Interpretability with previous interpretability paradigms in Appendix D. 16 Note that scientific non-realists may only produce epistemic explanations, as they may not believe that the entities referred to in scientific theories actually exist in reality. See also Appendix B. 17 Several works within Machine Learning that also provide a good introduction to causal modelling include Jin &Garrido (2024); Schölkopf et al. (2021); Liu et al. (2024); Pearl (2009). 8 of Y given X differs from the probability of Y. For example, we might explain ice cream sales by high correlation with temperature. 18 •Telic Explanations(Sosa, 2021). We explain a phenomenon by reference to its purpose, aims, or function rather than in terms of a causal chain of events. For example, we might explain the heart as being for the purpose of transmitting and pumping blood. •Nomological Explanations(Myers, 2012; Scheibe, 2002). We explain a phenomenon by reference to general laws or principles rather than in terms of a causal chain of events. For example, linguistic theory might appeal to universal grammar “laws” to explain the structure of human languages. Technical Definition of Mechanistic Interpretability Interpretability explanations arevalidas Mechanistic Interpretability explanations if they areModel-level,Ontic,Causal-MechanisticandFalsifiable. Causal-Mechanistic and Ontic explanations are necessary to the empirical practice of Mech- anistic Interpretability amongst active researchers (Bereska &Gavves, 2024; Sharkey et al., 2025). Though it is feasible to imagine a causal-mechanistic approach to understand system- level behaviours, it is a historically contingent fact that the field has coalesced around methods for model-level explanations (Saphra &Wiegreffe, 2024). 19 5 The Limits of Mechanistic Interpretability We have analysed the type of explanations that Mechanistic Interpretability researchers seek, namely those which are Model-level, Ontic, Causal-Mechanistic and Falsifiable Explanations of a model’s internal mechanisms. We now turn our attention to the extent of the limits and challenges of such explanations. 5.1 Value-Ladenness & Theory-Ladenness of Explanations We would like explanations that are accurate and human-understandable compressed repre- sentations of observations. With this goal in mind, the best explanation of a phenomenon is interpreter-relative in the following two senses.Firstly, the ideal explanation is relative to the interpreter’s initial set of concepts, their priors and what types of explanation are easy for them to understand. In this sense, explanations areTheory-Laden.Secondly, the ideal explanation depends on what the interpreter would like todowith such an explanation. The interpreter may be satisfied with a different level of granularity of explanation depending on whether they are seeking an explanation of a model’s behaviour to be able to make crude interventions, or to make guarantees about model performance, or for scientific curiosity. In other words, the ideal explanation is also relative to the interpreter’svalues; explanations areValue-Laden. 5.1.1 Value-Ladenness of Explanations Weber (1949) argued for the Value-Free Ideal in the sciences, the principle that scientists should be value neutral. We can articulate the Value-Free Ideal as “Scientists should strive to minimize the influence of contextual values on scientific reasoning, e.g., in gathering evidence and assessing/accepting scientific theories” (Reiss &Sprenger, 2020). 18 or unhelpfully, our explanation could show correlation with the number of shark attacks 19 Since system-level explanations are of practical and academic interest, there is currently an opportunity for a new field to emerge focusing on system-level explanations, possibly building on the work of the mechanistic interpretability community. Perhaps the nascent field of LLM-ology might be a candidate to fill this gap (Trott, 2023). Note that many Benchmarks and Evaluations researchers could be seen as working in this space already. Chain of Thought interpretability (Perez et al., 2023) is another example of system-level explanation. 9 For the normative statement of the Value-Free Ideal to be considered a reasonable ideal, it must be attainable (at least to some degree). That is, ought implies can. So we may first analyse whether value-freeness is possible which can be expressed in the Value-Neutrality Thesis as follows: “Scientists can—at least in principle—gather evidence and assess/accept theories without making contextual value judgments” (Reiss &Sprenger, 2020). In science generally, and interpretability research particularly, it is difficult to hold Value- Neutrality. Hence the Value-Free Ideal seems to be unattainable and likely undesirable as a goal (Douglas, 2009). The choice of methods and which results are particularly interesting for researchers has a close dependence on what researchers might hope to achieve. Many researchers in Mechanistic Interpretability are interested in the applications to AI Safety, AI Ethics, AI Cognitive Science and AI Governance (Bengio et al., 2025; Anwar et al., 2024; Olah et al., 2020), all of which affect researchers’ contextual value judgements. Evidential standards for accepting theories are highly influenced by such application-guided values. Given the increasing importance of AI systems in society and their potential benefits and harms, it is perhaps more instructive to understand (the lack of) value-freeness in Mech- anistic Interpretability as we would in Climate Science or Public Health, rather than in Theoretical Physics. 20 Researchers must share reproducible, quantitative results for the community to assess but it is unavoidable that the choice of study and what counts as sufficiently convincing to such and such a conclusion is highly value-laden (Sharkey et al., 2025; Casper et al., 2025). Expressing an interpretability-flavoured notion of Value-Ladenness, Dmitry’s Koan (Vain- trob, 2025) states: There is no such thing as interpreting a neural network. There is only interpreting a neural network at a given scale of precision. We can view Dmitry’s Koan as a direct consequence of the Value-Ladenness of explanations in Mechanistic Interpretability: what counts as a good explanation is highly dependent on the level of precision that the interpreter desires. Explanations at a maximal precision would capture lots of noise as well as useful explanatory signal. For some use cases (e.g., determining the safety of critical AI systems), higher fidelity explanations may be required for providing guarantees about model behaviour and so we would like highly precise explanations. In other cases, the interpreter may seek sufficient understanding with which to monitor or steer the model, which might be possible with a lower fidelity explanation. 21 The ideal precision depends on the (human) interpreter ’s goals and hence the ideal explanation is inherently value-laden. Noting that we do not always have an appropriate definition of what ‘precision’ itself means in our interpreter’s context, we can extend Dmitry’s Koan to Dmitry’s Koan ++22 stating: There is no such thing as interpreting a neural network. There is only interpreting a neural network at a given scale of precisionand a given metric for defining what precision means. Dmitry’s Koan ++ highlights that the choice of precision metric itself is value-laden. 5.1.2 Theory-Ladenness of Explanations We might hope to understand ML systems on their own terms, in their ontology, removing all human “biases” from the explanation. After all, if we have enough data, perhaps the data 20 Some philosophers have further argued that the Value-Free Ideal is not even tenable or desirable in Theoretical Physics either. 21 Sharkey (2024) provide a careful analysis of the trade-offs between explanation precision and complexity. We also note the close analogy to rate-distortion theory in Information Theory. 22 We may also refer to Dmitry’s Koan ++ as Nora’s Koan, after Interpretability Researcher Nora Belrose who brought this to our attention. 10 will speak for itself, we might think. This (unfortunate) desire is known as the Theory-Free Ideal (Andrews, 2023). Our explanations always contain underlying (human) theory. Indeed, so-called “unsu- pervised” learning cannot occur without either pre-defined inductive biases or external supervision (Andrews, 2023; Wolpert &Macready, 1997; Goldblum et al., 2023; Locatello et al., 2019). All observations (and interpretations) are theory-laden (Kuhn, 1962; Duhem, 1954; Popper, 1935). Interpreter theory seeps into the explanation in all stages of an inter- pretability workflow: from problem formulation and model design to model selection and semantic interpretation. Example: Theory-Ladenness of Sparse Autoencoder Explanations Sparse Autoencoders (SAEs) are a method for unsupervised interpretability, aiming to extract concepts (or “features”) from the activation space of neural networks. Concept representations are generally entangled and difficult to access in the neural activation space. We may hope for SAEs to disentangle these representations into a linear combination of monosemantic concepts by mapping the neural activations to a feature basis. Empirically, it has been shown that disentangled con- cept representations are more amenable to human interpretation (Bricken et al., 2023). We might believe that we are doing completely unsupervised learning with no human theory when producing SAE derived explanations of neural activations. However, note that in using the SAE, we are committing to the theory that features are sparsely activated and linearly represented (Bricken et al., 2023). Similarly, in choosing a particular SAE architecture like TopK (Gao et al., 2024) or Jump-ReLU SAEs (Rajamanoharan et al., 2024), we are committing to the Monotonic Importance Heuristic (Ayonrinde, 2024), the conjecture that feature activations are not typically both small and important simultaneously. Hindupur et al. (2025) further describe the theoretical assumptions that come with different choices of SAE architecture. Following Locatello et al. (2019), we note that unsupervised disentanglement learning in the general case is not possible; we must first hold some theoretical commitment to the structure of the data. We choose theoretical commitments because we have reason to believe that they are good inductive priors for the data distribution, or because we believe that the structures will be more easily human-understandable. Interpretability is not and cannot be a purely engineering affair, devoid of theory. We require theory for two reasons: Firstly, as we have seen with SAEs and disentanglement learning, holding the wrong theoretical commitments 23 leads to the intractability of unsupervised learning. 24 And secondly, the human (interpreter’s) priors are a key part of the theory, as indeed it’s humans that we would like to make interpretations for! In this sense, Inter- pretability is a fundamentally socio-technical problem which may be best addressed by a combination of understanding humans, machines and the interactions between the two. Hence we see a key role for Human-Computer Interaction (HCI) and the Social Sciences in MI. We suggest that increased focus on the criteria which make explanations accessi- ble to humans (Schut et al., 2023; Ayonrinde et al., 2024), especially to diverse humans (Himmelsbach et al., 2019), is likely to prove fruitful for future interpretability work. 25 23 which is very easy if researchers believe that they are holding no theoretical commitments at all! 24 Since interpretability aims to understand neural systems which we do not yet understand, it is inherently an unsupervised learning problem: we don’t know what we don’t know about the system. 25 Theory-ladenness of explanations is also relevant in the context of theConstruct Validity problem (Cronbach &Meehl, 1955) - the problem of whether the explanation is measuring what it purports to measure. (As Heisenberg (1958, p.58) put it: “what we observe is not nature in itself but nature exposed to our method of questioning.”) It is very possible for researchers to agree on the data reported and the statistical validity of hypothesis tests and yet disagree on the interpretation because their underlying theories are different (see also Appendix E for the use of model and domain theory in interpretability). 11 5.2 Limits of Model-level vs System-level Explanations As detailed in Section 4, explanations in Mechanistic Interpretability are inherently Model- level explanations. However, when interacting with AI, we are typically interacting with AIsystems, notmodels. Though models may think, only systems behave: it is systems that perform actions in the world. MI explanations, then, have limited explanatory power when the system-model relationship is complex or not well-understood. In systems with meta-decoding processes such as those with inference-time compute loops, ensembling methods, or similar, then we might expect model-level explanations to be insufficient for understanding system-level behaviour. Systems which can well be described as having an Extended Mind, in the sense of Clark &Chalmers (1998) or as embodied/embedded agents (Demski &Garrabrant, 2018; Shapiro &Spaulding, 2021) may also be difficult to understand with model-level explanations. It may be difficult to pick out some feature if the feature as it appears in the model is simply a pointer to some cognitive process distributed elsewhere in the system. A particularly notable case of such systems is multi-agent AI systems, which may have emergent properties not well explained by the analysis of individual agents. For example, consider a flock of birds or a well-functioning marketplace which may not be easily understood by the analysis of individual agents within the system (Hyland et al., 2024). 5.3 Limits of Low Abstraction Explanations One other possible, though surmountable, limit to Mechanistic Interpretability explanations is that low-level explanations may be difficult, or even impossible, to turn into explanations at higher levels of abstraction. 26 For example, in the natural sciences, it is not generally known whether the laws of thermodynamics can be derived from lower-level particle physics laws. If MI provides low-level explanations akin to quantum mechanics, research questions that more closely resemble chemistry, biology or, social science questions may not be obviously derivable in a reductionist way from MI explanations. 6 Discussion Given a data distributionD, suppose that we would like to explain the behaviour of a neural networkMover a subset of the data distributionD⊂D. Then the goal of Mechanistic In- terpretability is to provide a Model-level, Ontic, Causal-Mechanistic, Falsifiable explanation Eof the model’s behaviour onDwhich is explanatorily faithful toM. Where behavioural faithfulness requires that the end predictions of the models agree,explanatory faithfulnessis a stronger condition that requires the causal structure of the explanationEto be faithful to the causal structure ofMat each stage of the causal chain. In this work, we have argued that approaching the Problem of Interpretability through the Explanatory View of Mechanistic Interpretability is likely to be fruitful because neural networks naturally admit explanations of their behaviour through their internal structures. Hence, our explanatory methods can and should look to uncover causal structure in the modelMrather than merely producing confabulatory descriptions of model behaviour. The Explanatory View of Neural Networks provides a justification for using explanatory faithfulness as the goal of explanations in Mechanistic Interpretability. However, there are significant limitations to this approach. In particular, it is infeasible to have a general algorithm for finding explanationsEfor all modelsMand all data distributionsDwhich are optimal for all purposes. Solutions to the interpretability problem are Theory-Laden: they require some theoretical priors about neural networks and/or the data distribution to find good explanations. Similarly, the problem of finding good 26 Note that by levels of abstraction here we do not mean the semantic abstraction level of a given feature, where some features might represent more higher-level concepts than others (for example those at later layers of a model). We instead mean that features themselves are low-level units compared to higher-level abstractions like components or circuits which they may compose. 12 explanations is Value-Laden: what makes for a good explanation depends on our goals as interpreters. A core problem to address in future work is how to appropriately characterise what makes agood explanationin the context of Mechanistic Interpretability. Here, we have argued for necessary criteria for an explanation to be validly considered ‘Mechanistic’ (namely that it is Model-level, Ontic, Causal-Mechanistic and Falsifiable). We would further like to understand how some explanations are better than others in terms of their usefulness and likelihood to point towards truth. 7 Coda: Explanatory Optimism “Have you persuaded yourself that there are knowledges and truths beyond your grasp, things that you simply cannot learn? ... If you have allowed this to happen, you have arbitrarily imposed limits on your intellectual freedom, and you have smothered the fires from which all other freedoms arise.” — Scott Buchanan, 1958 We conclude by introducing a conjecture at the heart of (Mechanistic) Interpretability, that we callThe Principle of Explanatory Optimism. We would like to explicate this conjecture, argue for its importance for MI research, and raise a Call to Action for further research into clarifying both the statement and its veracity. Here we provide no arguments for the truth of Explanatory Optimism (EO) as a conjecture; we leave further arguments, proofs, or refutations to future work. 7.1 Alien Concepts Suppose that an AI M (Machine, in blue) and the interpreter H (Human, in red) each have some set of concepts which are understandable to them,C M andC H respectively (Schut et al., 2023; Hewitt et al., 2025). IfC M ⊂C H , then intuitively all concepts that the machine uses for its computations can be immediately understood by the human interpreter. However, ifC M H is large, then there are Machine-concepts that are not natively Human- understandable. Some Machine-concepts that aren’t intuitively understandable by humans may be human- understandable with some human effort, effective translation, or good explanations. How- ever, a core problem remains if there are concepts that are understandable to the model but which are fundamentally alien and incomprehensible to humans. We call such concepts Alien Conceptsand label theseC A . Alien concepts are Machine-concepts inC M H that are effectively untranslatable into Human-concept terms (see Figure 1). To the extent that Alien Concepts are present in the model, and are important for the model’s computation, the project of Interpretability may be fundamentally flawed. The conjecture at the heart of (Mechanistic) Interpretability then is that artificial neural network-based general intelligences have few, or no, alien concepts that are both vital to model behaviour and not human-understandable. A version of this conjecture appears to be a prerequisite for interpretability research: if most of the model’s concepts aren’t understandable to MI researchers, then it is not clear how MI research can proceed. 27 7.2 The Principle of Explanatory Optimism As phrased above, the “no alien concepts” conjecture is about neural networks. It may be more instructive, however, to rephrase this conjecture to put human intelligence at the center and understand it as a claim about a class of general intelligence (such as neural- network based AI systems). Rephrasing then:Everything that is important for the behaviour of an intelligence with implicit explanatory knowledge within some explanatory complexity class 27 Conversely, if model conceptsarepossible to be made understandable by MI researchers, then MI research is a tenable research direction (though it may still be difficult). 13 Figure 1: A Venn diagram showing the relationship between the concept spaces of the machine M and human interpreter H. The machine and human have some shared concepts which they can use to communicate (C M ∩C H ) but there are many concepts that the machine uses that the human does not understand (C M H ). The set ofAlien ConceptsC A ⊂ (C M H ), is a subset of the Machine-concepts. Alien Concepts are causally relevant for the model’s computation but are fundamentally incomprehensible to humans. If this set is large or important, then Interpretability may be highly limited. is human-understandable.This statement is a strong form of the conjecture that we callThe Principle of Explanatory Optimism. 28 Weaker forms of Explanatory Optimism might claim that “most” rather than all neural network behaviour is human-understandable (in terms of variance explained for example). Humans have certain cognitive limitations compared to future generally intelligent systems: we have limited memory and processing power, we are somewhat limited in bandwidth and speed, and we may lack attention. Strong Explanatory Optimism suggests that, given sufficient time, we could understand these artificial intelligences,ifaugmented with good, concise explanations, memory devices, cognitive tools and such like. The Explanatory Optimism conjecture implies that explanations for understanding Machine-Concepts exist. The Principle of Explanatory Optimism The Strong Principle of Explanatory Optimism (SEO): Everything important for the behaviour of an intelligence with implicit explanatory knowledge within some explanatory complexity class is human- understandable. The Weak Principle of Explanatory Optimism (WEO): Most important behaviour of an intelligence with implicit explana- tory knowledge within some explanatory complexity class is human- understandable. Explanatory Optimism (EO) can also be understood, as in Deutsch (2011), as a view of explanatory universality, defined analogously to computational universality as given in the Church-Turing thesis (Turing, 1936; Church, 1936). In the same sense that any Turing machine can simulate any other Turing machine, we would like a theory that maintains that some intelligences are explanatorily universal in the sense that they can understand and explain any other intelligence of an equivalent explanatory complexity class. We leave the question of what such an explanatory complexity class might look like and how to prove explanatory class equivalence and universality to future work. 28 This idea is closely analogous to Deutsch (2011)’s theory of Optimism. 14 The truth value of Weak Explanatory Optimism is a load-bearing question for the field of interpretability: if models aren’t human-understandable, then the field will face unassailable roadblocks in its mission to explain neural networks to humans. Hence, an appropriate disproof of, or otherwise sufficiently convincing arguments against, (W)EO should motivate researchers working on Mechanistic Interpretability to consider reorienting their research focus. 29 Call to Action for Explanatory Optimism The Call To Action for future work is twofold: •Firstly, to formalise the above conjectures (SEO and WEO). We would like to develop core definitions for the explanatory complexity classes and the ap- propriate notion of explanatory universality. This formalisation would likely involve understanding the explanatory complexity classes of different intel- ligences as well as operationalising the quantitative notion of understanding “most” of a model’s behaviour. •Secondly, to prove the formalised conjectures. We would like to assess the truth value of the Principle of Explanatory Optimism. We believe that such results would be of great interest, both to interpretability researchers and scientists who expect to use non-transparent computational models. Complexity theorists, theoretical computer scientists, analytic philosophers, and computa- tional mechanics theorists may be able to use the tools from their fields to make progress on this conjecture. 30 7.3 The Importance of Explanatory Optimism As ML models begin to do more cognitive work in the world, the frontiers of knowledge in mathematics, the (natural, social, and computational) sciences, and the humanities may be known to machines before humans. 31 The default state of the world when living along- side such cognitively advanced machines is that humans areepistemically disempoweredand subjected to living in a world built by knowledge that no human understands. However, Explanatory Optimism offers an alternative future for humanity. The upshot of Explana- tory Optimism is that as machines learn more about the world, through interpretability, humans can learn more about the world too. Any Machine Knowledge can become Human Knowledge. Hence, if Explanatory Optimism is true, Interpretability may be one of the most important projects in the history of modern science. Explanatory Optimism implies thatall explanatory knowledge is accessible to peoplethrough interpretability and human-computer interaction: we are sitting at but the beginning of an explosion in human understanding. 29 MI researchers with a downstream goal of AI Safety, AI Ethics, or AI Cognitive Science may be interested in whether EO holds. Absent EO, MI may not be an effective way to reach their goals. 30 Explanatory Optimism is another area where we might expect fruitful collaboration between Philosophy and Computational Complexity (and Machine Learning) as in Aaronson (2013). 31 Arguably AI automation of science may be the final stage of the long-continuing Crisis of the European Sciences depicted by Husserl &Carr (1970). For Husserl, the fact that the sciences have become so disconnected from the phenomenological world of everyday experience results in scientific disciplines becoming increasingly specialised such that no human has a unified understanding of science as a whole. With sufficiently intelligent AI, we can imagine a world where no human has an understanding of the furthest advances ineven a single scientific discipline. 15 Acknowledgments Thanks to Nora Belrose, Matthew Farr, Sean Trott, Elsie Jang, Evžen Wybitul, Andy Artiti, Owen Parsons, Kristaps Kallaste and Egg Syntax for comments on early drafts. We appreci- ate Daniel Filan and Joseph Miller ’s helpful feedback. Thanks to Mel Andrews, Alexander Gietelink Oldenziel, Jacob Pfau, Michael Pearce, Catherine Fist, Lee Sharkey, Jason Gross, Joseph Bloom, Nick Shea, Barnaby Crook, Eleni Angelou, Dashiell Stander, Geoffrey Irving and attendees of the ICML2024 MechInterp Social for useful conversations. We’re grateful to Kwamina Orleans-Pobee, Will Kirby and Aliya Ahmad for additional support. This project was supported in part by a Foresight Institute AI Safety Grant. 16 References Scott Aaronson. Why philosophers should care about computational complexity.Com- putability: Turing, Gödel, Church, and Beyond, p. 261–328, 2013. Dario Amodei. The urgency of interpretability, 2025. URLhttps://w.darioamodei.com/ post/the-urgency-of-interpretability. Mel Andrews. The devil in the data: Machine learning & the theory-free ideal. 2023. Eleni Angelou.Three levels for large language model cognition,Febru- ary2025.URLhttps://w.lesswrong.com/posts/nH28bPxcxHoZBECz5/ three-levels-for-large-language-model-cognition. Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932, 2024. Thomasn D. Aquinas.Summa Theologica. Hayes Barton Press, 1273. Andy Arditi. Ai as systems, not just models, 2024. URLhttps://w.lesswrong.com/posts/ 2po6bp2gCHzxaccNz/ai-as-systems-not-just-models. Kola Ayonrinde. Adaptive sparse allocation with mutual choice and feature choice sparse autoencoders, 2024. URLhttps://arxiv.org/abs/2411.02124. Kola Ayonrinde. Position: Interpretability is a bidirectional communication problem. In ICLR 2025 Workshop on Bidirectional Human-AI Alignment, 2025. URLhttps://openreview. net/forum?id=O4LaRH4zSI. Kola Ayonrinde and Louis Jaburi. Evaluating explanations: An explanatory virtues frame- work for mechanistic interpretability. 2025. Forthcoming. Kola Ayonrinde, Michael T. Pearce, and Lee Sharkey. Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes, 2024. URLhttps: //arxiv.org/abs/2410.11179. Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624): 1067–1074, 2022. William Bechtel and Adele Abrahamsen. Explanation: A mechanist alternative.Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 36(2):421–441, 2005. doi:10.1016/j.shpsc.2005.03.010. Sander Beckers and Joseph Y. Halpern. Abstracting causal models. InProceedings of the 33Rd Aaai Conference on Artificial Intelligence, p. 2678–2685. 2019. Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. Neural networks learn statistics of increasing complexity.arXiv preprint arXiv:2402.04362, 2024. Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, et al. Interna- tional ai safety report.arXiv preprint arXiv:2501.17805, 2025. Leonard Bereska and Efstratios Gavves. Mechanistic Interpretability for AI Safety – A Review, April 2024. URLhttp://arxiv.org/abs/2404.14082. arXiv:2404.14082 [cs]. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.Transformer Circuits Thread, 2023. 17 Percy Williams Bridgman.Reflections of a Physicist. Arno Press, New York, 1980. Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker.Science, 365 (6456):885–890, 2019. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. Rosa Cao and Daniel Yamins. Explanatory models in neuroscience, part 1: Taking mechanis- tic abstraction seriously.Cognitive Systems Research, p. 101244, 2024. Stephen Casper, David Krueger, and Dylan Hadfield-Menell. Pitfalls of evidence-based ai policy.arXiv preprint arXiv:2502.09618, 2025. Gregory Chaitin.On the intelligibility of the universe and the notions of simplicity, complexity, and irreducibility. na, 2002. Anjan Chakravartty. Scientific Realism. In Edward N. Zalta (ed.),The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2017 edition, 2017. David J Chalmers. What is conceptual engineering and what should it be?Inquiry, p. 1–18, 2020. David J Chalmers. Propositional interpretability in artificial intelligence.arXiv preprint arXiv:2501.15740, 2025. Lawrence Chan, Leon Lang, and Erik Jenner. Natural abstractions: Key claims, theorems, and critiques, March 2023. URLhttps://w.lesswrong.com/posts/gvzW46Z3BsaZsLc25/ natural-abstractions-key-claims-theorems-and-critiques-1. Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner Fabien Roger Vlad Mikulik, Sam Bow- man, Jan Leike Jared Kaplan, et al. Reasoning models don’t always say what they think. 2025. URLhttps://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_ models_paper.pdf. James Chua and Owain Evans. Inference-time-compute: More faithful? a research note. arXiv preprint arXiv:2501.08156, 2025. Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. InInternational Conference on Machine Learning, p. 6243–6267. PMLR, 2023. Alonzo Church. An unsolvable problem of elementary number theory.American Journal of Mathematics, 58(2):345–363, 1936. doi:10.2307/2371045. Andy Clark and David J. Chalmers. The extended mind.Analysis, 58(1):7–19, 1998. doi:10.1093/analys/58.1.7. Richard Creath (ed.).Dear Carnap, Dear Van: The Quine-Carnap Correspondence and Related Work: Edited and with an Introduction by Richard Creath. University of California Press, 1990. Lee J Cronbach and Paul E Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4):281–302, 1955. doi:10.1037/h0040957. Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R McKee, Joel Z Leibo, Kate Larson, and Thore Graepel. Open problems in cooperative ai.arXiv preprint arXiv:2012.08630, 2020. Richard Dawid and Stephan Hartmann. The no miracles argument without the base rate fallacy.Synthese, 195(9):4063–4079, 2016. doi:10.1007/s11229-017-1408-x. 18 Gregoire Deletang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. Language modeling is compression. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=jznbgiynus. Abram Demski and Scott Garrabrant. Embedded agents, 2018. URLhttps://w.lesswrong. com/posts/p7x32SEt43ZMC9r7r/embedded-agents. David Deutsch.The beginning of infinity: Explanations that transform the world. penguin uK, 2011. Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. Heather Douglas.Science, Policy, and the Value-Free Ideal. University of Pittsburgh Press, 2009. Pierre Maurice Marie Duhem.The Aim and Structure of Physical Theory. Princeton University Press, Princeton„ 1954. Adrian Erasmus, Tyler DP Brunet, and Eyal Fisher. What is interpretability?Philosophy & Technology, 34(4):833–862, 2021. Will Fleisher. Understanding, idealization, and explainable ai.Episteme, 19(4):534–560, 2022. doi:10.1017/epi.2022.39. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders, June 2024. URLhttp://arxiv.org/abs/2406.04093. arXiv:2406.04093 [cs] version: 1. Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation.arXiv preprint arXiv:2301.04709, 2023. Thomas F. Gieryn.Cultural Boundaries of Science: Credibility on the Line. University of Chicago Press, 1999. Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), p. 80–89. IEEE, 2018. René Girard, Jean-Michel Oughourlian, and Guy Lefort.Things Hidden Since the Foundation of the World. Stanford University Press, 1987. Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai. arXiv preprint arXiv:2411.04872, 2024. Micah Goldblum, Marc Finzi, Keefer Rowan, and Andrew Gordon Wilson. The no free lunch theorem, kolmogorov complexity, and the role of inductive biases in machine learning. arXiv preprint arXiv:2304.05366, 2023. Satvik Golechha and Adrià Garriga-Alonso. Among us: A sandbox for agentic deception, 2025. URLhttps://arxiv.org/abs/2504.04072. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014. Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024. 19 Jacqueline Harding. Operationalising representation in natural language processing.arXiv preprint arXiv:2306.08193, 2023. WernerHeisenberg.Physicsandphilosophy-therevolution inmodernscience,1958.URLhttps://archive.org/details/ physics-and-philosophy-the-revolution-in-modern-scirnce-werner-heisenberg-f. -s.-c.-northrop. Carl G. Hempel and Paul Oppenheim. Studies in the logic of explanation.Philosophy of Science, 15(2):135–175, 1948. ISSN 00318248, 1539767X. URLhttp://w.jstor.org/ stable/185169. John Hewitt, Robert Geirhos, and Been Kim. We can’t understand ai using our existing vocabulary.arXiv preprint arXiv:2502.07586, 2025. Julia Himmelsbach, Stephanie Schwarz, Cornelia Gerdenitsch, Beatrix Wais-Zechmann, Jan Bobeth, and Manfred Tscheligi. Do we care about diversity in human computer interaction: A comprehensive content analysis on diversity dimensions in research. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, p. 1–16, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450359702. doi:10.1145/3290605.3300720. URLhttps://doi.org/10.1145/3290605.3300720. Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry, 2025. URL https://arxiv.org/abs/2503.01822. Hossein Hosseini, Baicen Xiao, Mayoore Jaiswal, and Radha Poovendran. Assessing shape bias property of convolutional neural networks. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), p. 2004–20048. IEEE, 2018. Colin Howson.Hume’s problem: Induction and the justification of belief. Clarendon Press, 2000. Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, p. 20617–20642. PMLR, 21–27 Jul 2024. URLhttps://proceedings.mlr.press/ v235/huh24a.html. E. Husserl and D. Carr.The Crisis of European Sciences and Transcendental Phenomenology: An In- troduction to Phenomenological Philosophy. Northwestern University studies in phenomenol- ogy & existential philosophy. Northwestern University Press, 1970. ISBN 9780810104587. URLhttps://books.google.co.uk/books?id=Ca7GDwz5lF4C. M. Hutter, E. Catt, and D. Quarel.An Introduction to Universal Artificial Intelligence. Chapman & Hall/CRC Artificial Intelligence and robotics series. Chapman & Hall/CRC Press, 2024. ISBN 9781003460299. URLhttps://books.google.co.uk/books?id=cfg60AEACAAJ. David Hyland, Tomáš Gavenˇciak, Lancelot Da Costa, Conor Heins, Vojtech Kovarik, Julian Gutierrez, Michael J. Wooldridge, and Jan Kulveit. Free-energy equilibria: Toward a theory of interactions between boundedly-rational agents. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024. URLhttps://openreview.net/forum? id=4Ft7DcrjdO. Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019. Yacine Izza, Alexey Ignatiev, and Joao Marques-Silva. On explaining decision trees.arXiv preprint arXiv:2010.11034, 2020. Zhijing Jin and Sergio Garrido. Tutorial proposal: Causality for large language models. 2024. 20 Eric Jonas and Konrad Paul Kording. Could a neuroscientist understand a microprocessor? bioRxiv, 2016. doi:10.1101/055624. URLhttps://w.biorxiv.org/content/early/2016/ 11/14/055624. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021. Lena Kästner and Barnaby Crook. Explaining AI through mechanistic interpretability. European Journal for Philosophy of Science, 14(4):52, 2024. Muhammad Ali Khalidi.Natural kinds. Cambridge University Press, 2023. Kareem Khalifa. The role of explanation in understanding.British Journal for the Philosophy of Science, 64(1):161–187, 2013. doi:10.1093/bjps/axr057. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. Thomas Samuel Kuhn.The Structure of Scientific Revolutions. University of Chicago Press, Chicago, 1962. Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamile Lukosiute, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield- Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, and Ethan Perez. Measuring faithfulness in chain-of-thought reasoning.CoRR, abs/2307.13702, 2023. URLhttps: //doi.org/10.48550/arXiv.2307.13702. Bruno Latour.Science in action. Harvard University Press, Cambridge, Massachusetts, 1987. ISBN 978-0-674-79291-3 and 0-674-79290-4 and 0-674-79291-2 and 978-0-674-79290- 6. Literaturverzeichnis: Seite 266-270 ; Hier auch später erschienene, unveränderte Nachdrucke. Larry Laudan. The demise of the demarcation problem. In Robert S. Cohen and Larry Laudan (eds.),Physics, Philosophy and Psychoanalysis: Essays in Honor of Adolf Grünbaum, p. 111–127. D. Reidel, 1983. Matthew L. Leavitt and Ari Morcos. Towards falsifiable interpretability research, 2020. URL https://arxiv.org/abs/2010.12016. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015. Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexan- der Gietelink Oldenziel, George Wang, Liam Carroll, and Daniel Murfet. You are what you eat–ai alignment requires understanding how data shapes structure and generalisation. arXiv preprint arXiv:2502.05475, 2025. David Lewis. Causal explanation. In David Lewis (ed.),Philosophical Papers, Volume I, p. 214–240. Oxford University Press, 1986. Ming Li, Paul Vitányi, et al.An introduction to Kolmogorov complexity and its applications, volume 3. Springer, 2008. 21 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022. Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well?Journal of Statistical Physics, 168:1223–1247, 2017. Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, and Tong Zhang. Mitigating the alignment tax of RLHF. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, p. 580–606, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.emnlp- main.35. URLhttps://aclanthology.org/2024.emnlp-main.35/. Grace W. Lindsay and David Bau. Testing methods of neural systems understanding.Cogn. Syst. Res., 82:101156, December 2023. URLhttps://doi.org/10.1016/j.cogsys.2023. 101156. Peter Lipton. Truth, existence, and the best explanation. In A. A. Derksen (ed.),The scientific realism of Rom Harre?Tilburg University Press, 1994. Peter Lipton. What good is an explanation? InExplanation: Theoretical approaches and applications, p. 43–59. Springer, 2001. Peter Lipton. The causal model. InInference to the Best Explanation, p. 25. Routledge, 2 edition, 2003. ISBN 9780203470855. eBook. Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018. Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, et al. Large language models and causal inference in collaboration: A comprehensive survey.arXiv preprint arXiv:2403.09606, 2024. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. Ininternational conference on machine learning, p. 4114–4124. PMLR, 2019. David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003. Jean-Luc Marion.The Visible and the Revealed. Fordham University Press, 2008. ISBN 9780823228836. URLhttp://w.jstor.org/stable/j.ctt1c5ck6r. David Marr.Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., USA, 1982. ISBN 0716715678. Ron McClamrock. Marr?s three levels: A re-evaluation.Minds and Machines, 1(2):185–196, 1990. doi:10.1007/bf00361036. Raphaël Millière and Cameron Buckner. Interventionist methods for interpreting deep neu- ral networks. In Gualtiero Piccinini (ed.),Neurocognitive Foundations of Mind. Routledge, 2025. Forthcoming. James Myers. Cognitive styles in two cognitive sciences. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 34, 2012. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023. 22 Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020. Judea Pearl.Causality. Cambridge university press, 2009. Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, p. 13387–13434, 2023. Karl R. Popper.The Logic of Scientific Discovery. Routledge, London, England, 1935. Stathis Psillos.Scientific Realism: How Science Tracks Truth. Routledge, New York, 1999. Hilary Putnam (ed.).Philosophical Papers: Volume 1, Mathematics, Matter and Method. Cam- bridge University Press, New York, 1979. Hilary Putnam. Three kinds of scientific realism.The Philosophical Quarterly (1950-), 32(128): 195–200, 1982. ISSN 00318094, 14679213. URLhttp://w.jstor.org/stable/2219323. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024. URLhttps://arxiv.org/abs/2407.14435. Tim Räz and Claus Beisbart. The importance of understanding deep learning.Erkenntnis, 89(5), 2024. doi:10.1007/s10670-022-00605-y. Henk W. De Regt.Understanding Scientific Understanding. Oup Usa, New York, 2017. Julian Reiss and Jan Sprenger. Scientific Objectivity. In Edward N. Zalta (ed.),The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2020 edition, 2020. Riccardo Rende, Federica Gerace, Alessandro Laio, and Sebastian Goldt. A distributional simplicity bias in the learning dynamics of transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/ forum?id=GgV6UczIWM. Jonathan Richens and Tom Everitt. Robust agents learn causal world models. InThe Twelfth International Conference on Learning Representations, 2024. Darrell P Rowbottom, William Peden, and André Curtis-Trudel. Does the no miracles argument apply to ai?Synthese, 203(5):173, 2024. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015. Wesley Salmon.Four decades of scientific explanation.1989.URLhttps://api. semanticscholar.org/CorpusID:46466034. Wesley C. Salmon.Scientific Explanation and the Causal Structure of the World. Princeton University Press, 1984. ISBN 9780691101705. Wesley C. Salmon, Richard C. Jeffrey, and James G. Greeno.Statistical Explanation, p. 29–88. University of Pittsburgh Press, 1971. ISBN 9780822952251. URLhttp://w.jstor.org/ stable/j.ctt6wrd9p.6. Naomi Saphra and Sarah Wiegreffe. Mechanistic?, 2024. URLhttps://arxiv.org/abs/2410. 09087. Erhard Scheibe.Between Rationalism and Empiricism: Selected Papers in the Philosophy of Physics. Springer Verlag, 2002. 23 Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023. Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839): 604–609, 2020. Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, and Been Kim. Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero. arXiv preprint arXiv:2310.16410, 2023. Lawrence Shapiro and Shannon Spaulding, 2021. URLhttps://plato.stanford.edu/ entries/embodied-cognition/. Lee Sharkey.Sparsify:A mechanistic interpretability research agenda.April 2024.URLhttps://w.alignmentforum.org/posts/64MizJXzyvrYpeKqm/ sparsify-a-mechanistic-interpretability-research-agenda. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath. Open problems in mechanistic interpretability, 2025. URL https://arxiv.org/abs/2501.16496. Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, and David Blei. Hypothesis testing the circuit hypothesis in LLMs. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps:// openreview.net/forum?id=ibSNv9cldu. R.J. Solomonoff. A formal theory of inductive inference. part i.Information and Control, 7(2): 224–254, 1964. ISSN 0019-9958. doi:https://doi.org/10.1016/S0019-9958(64)90131-7. URL https://w.sciencedirect.com/science/article/pii/S0019995864901317. Ernest Sosa.Epistemic Explanations: A Theory of Telic Normativity, and What It Explains. Oxford University Press, Oxford, 2021. Michael Strevens. No understanding without explanation.Studies in history and philosophy of science Part A, 44(3):510–515, 2013. Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, et al. Getting aligned on representational alignment.arXiv preprint arXiv:2310.13018, 2023. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convo- lutions. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 1–9, 2015. Sean Trott. In cautious defense of llm-ology, 2023. URLhttps://seantrott.substack.com/ p/in-cautious-defense-of-llm-ology. Alan M. Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 2(42):230–265, 1936. doi:10.1112/plms/s2- 42.1.230. 24 Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952–74965, 2023. Dmitry Vaintrob.Dmitry’s koan, 2025.URLhttps://w.lesswrong.com/posts/ 3eo4SSZLfpHHCqoEQ/dmitry-s-koan. Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023. Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization.arXiv preprint arXiv:2405.15071, 2024. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. InThe Eleventh International Conference on Learning Representations, 2023. Max Weber. " objectivity" in social science and social policy.The methodology of the social sciences, p. 49–112, 1949. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference- time algorithms for large language models.Transactions on Machine Learning Research, 2024. John Wentworth.Testing the natural abstraction hypothesis:Project in- tro,April 2021.URLhttps://w.lesswrong.com/posts/cy3BhHrGinZCp3LXE/ testing-the-natural-abstraction-hypothesis-project-intro. John Wentworth and David Lorell. Natural latents: the concepts, 2024. URLhttps://w. alignmentforum.org/posts/mMEbfooQzMwJERAJJ/natural-latents-the-concepts. Amy Widdicombe, Simon Julier, and Been Kim. Saliency maps contain network" finger- prints". InICLR 2022 Workshop on PAIR 2Struct: Privacy, Accountability, Interpretability, Robustness, Reasoning on Structured Data, 2018. Daniel A Wilkenfeld. Understanding as compression.Philosophical Studies, 176(10):2807– 2831, 2019. Ludwig Wittgenstein.Philosophical Investigations. Wiley-Blackwell, New York, NY, USA, 1953. David H Wolpert and William G Macready. No free lunch theorems for optimization.IEEE transactions on evolutionary computation, 1(1):67–82, 1997. James F. Woodward.Making Things Happen: A Theory of Causal Explanation. Oxford University Press, New York, 2003. Wilson Wu, Louis Jaburi, Jacob Drori, and Jason Gross. Unifying and verifying mechanistic interpretations: A case study with group operations.arXiv preprint arXiv:2410.07476, 2024. Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems.https://bair.berkeley.edu/blog/2024/02/ 18/compound-ai-systems/, 2024. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, p. 818–833. Springer, 2014. 25 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under- standing deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=Sy8gdB9x. A. Zheng and A. Casari.Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly, 2018. ISBN 9781491953242. URLhttps://books.google.co. uk/books?id=Ho0UvgAACAAJ. 26 A Examples of Explanation Types In this section, we provide some intuitive examples and non-examples of Explanations which satisfy the criteria that we outline in Section 4. A.1 Ontic Explanations Question: Why did the pen fall off the desk? Causal-Mechanistic But Not Ontic Explanation. The pen fell off the desk because the aether pushed the bottle and then the bottle pushed the pen off the desk. This explanation is Causal-Mechanistic in the sense that one thing happens after another and causes the next. However, if we do not believe that the aether is a real entity then this explanation cannot be considered an Ontic Explanation. — Question: Why is the cube heavy? Ontic But Not Causal-Mechanistic Explanation. The cube is heavy because it is made up of tungsten atoms. This explanation is Ontic as the entities involved in the explanation are real entities. How- ever, it is not Causal-Mechanistic as there is no step-by-step explanation without gaps. A.2 Statistically-Relevant Explanations Consider the explanation: Ice cream sales are higher on days when there are more shark attacks. If there’s a shark attack reported, we can predict with 85% confidence that ice cream sales will be above average that day. This explanation is purely in terms of statistical correlation rather than causation. There is no explication of any underlying causal mechanism, which might involve both phenomena being causally downstream of hot weather and/or more beach visitors. We could perform interventions to test this hypothesis. A.3 Telic Explanations Consider the following explanation: The heart exists to pump blood throughout the body and maintain circula- tion. This explanation describes the purpose and function of the heart, rather than describing the physical processes by which the observed phenomena of blood pumping comes about (which might include a mechanistic description of chambers, valves and muscles). In this way, this explanation does not provide a continuous causal chain from cause to effect, without any unexplained gaps, as a Causal-Mechanistic explanation should. A.4 Nomological Explanations Question: Why does a metal rod expand when heated? 27 Nomological but not Causal-Mechanistic Explanation. The rod expands because it follows the natural law that all metals expand when heated, as described by the coefficient of thermal expansion. This explanation references a general law of nature without getting into the underlying mechanism. Causal-Mechanistic Explanation. The rod expands because its metal atoms vibrate more vigorously when heated, which increases their average spacing. This increased spacing leads to an overall increase in the rod’s length. This details the physical mechanism causing the expansion. B The No Miracles Argument for the Explanatory View In Section 4, one of our conditions for claiming that scientific explanation of a neural network was a valid Mechanistic Interpretability explanation was that the entities in the explanation are (to the best of our knowledge) real (Ontic) entities. We might wonder how it is possible to make such a claim, given that it is not necessarily clear that there’s an ontic test for features within a neural network. Similarly, in discussing representations, following Harding (2023), we noted that Infor- mation, Use and the possibility for Misrepresentation are required to claim that some activations arerepresentations. However, it is not entirely clear that we are able to draw a correspondence between the world and the activations (i.e., to say that the neural network is reallyrepresentinga feature of the world or that the feature isaboutthe world). Our notion of ur-explanations, defined in Section 3.2.1 as idealised explanations of model behaviour on an input distribution, given in terms of its learned internal structures, seems to also rely on representations being appropriately well-defined as above. We here provide an argument that mitigates the above concerns. This argument will go via the traditional No Miracles Argument (NMA) for Scientific Realism (Putnam, 1979; Lipton, 1994; Psillos, 1999; Rowbottom et al., 2024). We will first provide a brief overview of the NMA in Science, then adapt the NMA for Machine Learning models in general, and finally we will adapt the NMA for the explanations of Mechanistic Interpretability. B.1 The No Miracles Argument “(Novel) empirical successes in science enabled by scientific theories are non- miraculous because such theories are typically or probably approximately true.” — Putnam (1979) The No Miracles Argument (NMA) is widely considered to be the strongest argument for Scientific Realism and contends that “[Scientific Realism] is the only philosophy that doesn’t make the success of science a miracle” (Putnam, 1979). The argument proceeds as follows (Chakravartty, 2017): 1.Scientific theories are (extraordinarily) successful in the sense that they make accu- rate novel empirical predictions about phenomena of interest. 2. If our scientific theories are very far from the truth, then it would be miraculous that they are so successful. 3. Given the choice between a straightforward reason for the success of scientific theories and a seemingly miraculous sense in which all our theories are just co- incidentally producing accurate novel predictions, one should clearly prefer the former. 28 4. Therefore, two conclusions follow: (a) Firstly, that our best scientific theories are approximately true (or approximately correctly describe mind-independent laws) (b)Secondly, that the entities posited by such scientific theories are real (or approx- imately characterize mind-independent entities). Note that the NMA has two distinct conclusions. The first conclusion is theepistemic thesis which states that our best scientific theories are approximately true. for example, that the fact that the effective predictions of Relativity enable GPS-based navigation systems provides good reason to believe that Relativity is approximately true. The second conclusion is the semantic thesiswhich states that scientific entitiesrefer. For example, that when physicists talk about electrons that they have not seen with their naked eye that they are referring to real entities. Dawid &Hartmann (2016) provide a formalisation of the No Miracles Argument in terms of Bayesian probability which we refer readers to for a complete mathematical formalism of the NMA. 32 B.2 The No Miracles Argument for Neural Representations We would like to provide an argument that neural activations in well-trained, generalising neural networks, are representations in the sense that they correspond with entities in the world. From the Explanatory View of Neural Networks (see Section 3.2), we see that the NMA can be applied to the case of neural networks. In the context of science, neural networks often play the role of theory: they are concise representations produced by processing data, which, at inference time, provide predictions for new data. In this sense neural networks play a theory-like epistemic role. We may hence insert the neural networks into the role of theory in the NMA as follows. First, we note that neural networks in general are extraordinarily successful at making accurate predictions. This success occurs both for individual narrow tasks like image classification (LeCun et al., 1998) and playing Go (Schrittwieser et al., 2020) as well as for general tasks like language modelling (Brown et al., 2020) and writing code (Li et al., 2022). We may now follow the NMA’s argument: if the representations and computations of neural networks don’t at least approximately correspond to entities and laws in the training data (and thus in the natural world) then their quite incredible success would be miraculous. The fact that neural networks generalise so well, that is they make accurate novel predictions, suggests that we have good reason to believe in the non-miraculous conclusion - the structure in neural networks produces representations that correspond to the entities in the world (the semantic thesis of the NMA). And we ought to think of the explanatory theories implicit within the neural network as providing a useful guide to approximate theories about the scientific structure of the task (the epistemic thesis of the NMA). 33 B.3 Explanatory Faithfulness Through the No Miracles Argument Our application of the NMA to neural networks suggests that ur-explanations provided by neural networks are scientifically interesting explanatory theories which are increasingly likely to capture the true structure of the world (asymptotically closely) as models become better predictors. 34 32 Note that Dawid &Hartmann (2016)’s formalisation avoids an earlier mistaken formalisation which contained a base rate fallacy error. See also Howson (2000) for further discussion. 33 Also note the implications here for interpretability as a tool for learning about science through scientific AI models. 34 This argument could provide an explanation for the Platonic Representation Hypothesis of Huh et al. (2024), which suggests that the representations of neural networks tend to become more similar 29 However we also note that in practising Mechanistic Interpretability, researchers come up with their own explanations of model behaviour. These researchers would like to uncover the ur-explanations but a priori they have no way of knowing that their explanations coincide with the idealised ur-explanations that they are targeting. However, a repeated application of the NMA can provide such evidence. We may call the following argument NMA 2 . Firstly, the No Miracles Argument as given above gives us reason to believe that there are representations and ur-explanations to be found always-already within the trained neural network. This is simply a straightforward application of the NMA. Secondly, suppose that we have a mechanistic explanation,Eof a neural network. It is theoretically reasonable to ask whether the explanation,E, is explanatorily faithful to the ur-explanation,U. However, it is not immediately clear how to practically evaluate the claim of whetherEis explanatorily faithful toU. We note however that we may again apply the NMA to the explanationE. That is to say, if (1) the explanationEhighly successfully predicts the relevant information in the neural network activations at each layer of the network, and (2) the activations are representations in the sense of Harding (2023) under causal interventions, then we can conclude (C) we have reason to believe thatEin fact does approximately correspond to the ur-explanationU. 35 It would be highly coincidental (miraculous even) for the negation to be routinely the case. (Here again we use the epistemic thesis of the NMA.) As a practical upshot, this argument provides some validity to the methods of Shi et al. (2024), who propose hypothesis testing as a method for evaluating the faithfulness of mechanistic explanations. If the circuit explanationEpasses the hypothesis test, at an appropriate significance level, then we have reason to believe thatEsimilarly corresponds to the ur- explanationU; that isEis not merely behaviourally faithful but further is explanatorily faithful. 36 In summary, through theNMA 2 argument, we believe that the following two conclusions follow: 1.We have reason to believe that ML models that are extraordinarily successful at making accurate novel predictions contain explanatory knowledge. That is, their ur-explanations are well-defined and we ought to be realists about features, as variables of computation. Further, ur-explanations are to be unique. 2.We have reason to believe that MI explanations that provide successful algorithmic predictions of how the neural activations are structured at each sequential layer of the model, are likely to be approximately explanatorily faithful to the model’s ur-explanation and contribute to our understanding of the model’s behaviour. C The Value of Demarcating Many Valuable Forms of Interpretability In Section 4, we provided a technical definition of Mechanistic Interpretability (MI) as producing explanations of neural networks that areModel-level,Ontic,Causal-Mechanistic andFalsifiable. This definition provides a clear demarcation of interpretability research which ought to be called Mechanistic Interpretability and research which falls outside of Mechanistic Interpretability. In making this distinction, we attach no value judgements to the term “Mechanistic”. We are not wanting to say that Mechanistic Interpretability is inherently good or uniquely valuable as the networks become larger and more effective predictors. The NMA argument suggests that they are converging to the true common structure of the data generating process. 35 An example of an explanation that we might consider explanatorily faithful in the above sense is the explanation of Modular Arithmetic in Nanda et al. (2023). Some other MI explanations are less successful layer by layer and incur non-trivial change to the performance of the model if patched as a replacement for some of the model’s computation. 36 In practise we might consider the significance level of the tests that are run in practise to provide much too weak evidence to conclude that the explanation is explanatorily faithful. 30 and that non-Mechanistic is necessarily bad or worthless, or vice versa. Demarcating Mech- anistic Interpretability is simply a useful way to understand what is meant by the term in a way that is useful for us to make claims about the efficacy of certain methods to solve the interpretability problem. It seems highly plausible that Chain-of-Thought Interpretability (non-Model-level), Concept-Based Interpretability (non-Causal-Mechanistic), and Proposi- tional Interpretability (non-Causal-Mechanistic) could be practically useful for downstream tasks or as inputs to Mechanistic Interpretability. Though many flowers should bloom, it is prudent for us to refrain from calling daffodils tulips. D The Three Varieties of Interpretability Under the classical (behavioural) view of Machine Learning, there are two ways to have an explanation of a model’s behaviour:Interpretability By DesignandPost-hoc Interpretability (Lipton, 2018). Linear models and decision trees are examples of models that are considered interpretable 37 by design: it is typically considered relatively easy to understand how these models make their predictions by inspection. 38 Post-hoc interpretability methods, on the other hand, are methods which take a trained model and attempt to confer useful information by creating explanations which statistically seem to correlate well with the model’s behaviour. 39 Mechanistic Interpretabilityoffers a third variety of explanation distinct from both Inter- pretability By Design and Post-hoc Interpretability. With the Explanatory View of neural networks, we can make sense of a new type of interpretability functioning as a pursuit to uncoverthe ur-explanations that are always-already present within the trained network. The Explanatory View of Neural Networks takes seriously the idea that there is structure in the model to be interpreted. We can contrast this with post-hoc interpretability where we may come up with just-so stories that happen to match the network’s behaviour on a sub- distribution, even if they are confabulatory and not explanatorily faithful. The Explanatory View is a cognitive realist view of neural networks that suggests that there exists a target to the interpretability program: we are not merely looking for explanations which appear to correlate with model behaviour, models admit explanations that we would like to extract. E The Theory Required For Interpretability Section 5.1.2 provides a discussion of theory-ladenness in interpretability methods. There are two plausible types of theory that we might seek to understand a domain-specific neural network,Domain TheoryandModel Theory. For example, if we seek to interpret a protein model then the Domain Theory concerns the biology of proteins. Model Theory, however, always concerns neural networks. 37 The interpretability of such models is somewhat subjective and not solely dependent on the high-level architecture. Very large high dimensional models, even if they are simple, may be difficult to interpret. See Izza et al. (2020) for discussion of decision trees and Lipton (2018) for a discus- sion of linear models. In particular, when the features of a linear model are highly correlated the interpretability of the model may be reduced. 38 Interpretable by Design models may suffer a performance penalty for their interpretability. We might think of this performance degradation as anInterpretability Tax, a cost paid for the ability to understand the model’s behaviour. See also the Alignment Tax in AI Safety (Lin et al., 2024). 39 One interesting form of post-hoc interpretability is models that attempt to self-explain. For example, a language model might output a Chain of Thought, either after the fact or as part of the generation process. However, in general, models do not reliably and faithfully describe the process underlying their outputs (Turpin et al., 2023; Lanham et al., 2023; Chua &Evans, 2025; Chen et al., 2025). We may hence think of these Chain of Thought “explanations” as epiphenomenal rather than faithful explanations. 31 All interpretability requires Model Theory: it would be useful to understand the structure of neural networks, how they generalise, how they represent the structure of the world, etc. The activation space is a function of the preceding weights of the model and the input data distribution, however understanding activation spaces isn’t inherently a function of the Domain Theory. It is not the case that to interpret a protein model, we should need lots of Domain Theory; we need only minimal knowledge about proteins ahead of explanation. On the contrary, we should hope that using interpretability we can learn more about the Domain Theory, from the network. That is, a scientific benefit of interpretability research on scientific models (e.g. protein models) is that they can help us to learn more about Domain theory (e.g. protein theory) than we knew before through understanding the model’s knowledge of the subject matter. F The Straightforward Implementation-Level Explanation Following Marr (1982), in Section 3.3, we discussed the distinction between the Computa- tional, Algorithmic and Implementation levels of analysis. All neural networks admit at least one trivial explanation at the Implementation leval which we denote thestraightforward explanation. We define the straightforward explanation as follows. Given a neural networkf:X→Yandx∈Xsuch thatf(x) =y∈Y. The straightforward explanation is given by a trace off(x), the computation of the network on the inputx. In fact, this explanation is a formal proof of the equalityf(x) =y. While straightforward explanations in this sense are always available, these are typically not goodexplanations as they are not very concise or illuminating. We would like explanations of neural networks that are in terms of the features (or concepts) that the network learned during training and explanations which are compact and useful. G The Trouble with Oracles and Zero-Knowledge Proofs In Section 3.2, we described the classical view of Machine Learning as analogous to seeing the ML model as a black-box providingexplanationless predictionsas an oracle would provide prophecies. We would like to trust models due to the content of their explanations (that is their reasons for acting and the interventions which would have changed their prediction), which is not possible if we have only explanationless predictions. We may especially desire content-relevant reasons to trust models in high-stakes deploy- ments of ML-based systems relevant to AI Ethics and AI Safety (Amodei, 2025). Similarly, if we would like to do science with ML models, i.e., generating explanatory knowledge of natural phenomena, then this appears untenable without understanding the relevant contentful reasons for the model’s predictions (Räz &Beisbart, 2024). Suppose, for example, that an AI tells a user that if they invest in a certain stock they will make a lot of money. To act on this advice, the user may like to know the reasons that the stock is likely to be a good purchase in terms of the business fundamentals (an explanation of the prediction) rather than extrinsic reasons (e.g. the AI has made successful predictions before). This is especially important when the AI might have different motivations for the suggestion or may be scheming to mislead the user towards its own aims. 40 A core problem with behaviourist explanations in general is that the input space is too large for the explanation to pick out every single case and hence either (1) the explanation will be incomplete, leaving out large sections of the input data domain, or (2) will coarse-grain over the input data domain. In both cases, these explanations are likely to be insufficient for out of distribution behaviour and would not be recommended for high-stakes explanations that we would like to be confident in. 40 That is, we don’t just face the generic problem of induction, but further, the system might be genuinely adversarial and anti-inductive! 32 Analogously, zero-knowledge proofs can be thought of as another form of explanationless prediction - they can convince the listener of the truth of some statement whilst providing the listener with no understanding or useful explanatory knowledge. The reasons to believe the statement of which we have a zero-knowledge proof do not come from the relevant content of the explanation. Good explanations are compressions, not cryptography. H Conceptual EngineeringforandbyNeural Networks Early Machine Learning models typically required researchers to first do feature engineering. Before inputting the data into a model, researchers would transform the input data to make it more amenable to the model (Zheng &Casari, 2018). Modern deep neural networks do not require feature engineering. Instead, neural networks learn to create representations of the raw input data which are useful for the task at hand, without being explicitly told to create such and such a representation (Zeiler &Fergus, 2014; LeCun et al., 1998; 2015). For example, Olah et al. (2020) find that neural networks trained to classify images of animals and cars learn to represent the concepts of “fur” and “car windows” in their intermediate layers (see Figure 2). Figure 2: When studying InceptionV1 (Szegedy et al., 2015), Olah et al. (2020) note that though the network was trained only to classify ImageNet images (Russakovsky et al., 2015), the network learned intermediate representations that were useful for the image classification task. For example, in order to detect cars, the network internally learned concepts for windows, wheels and car bodies. This is an example ofConceptual Engineering in neural networks. [Image from Olah et al. (2020)] Analytic philosophers use the termConceptual Engineeringto describe the process of de- signing, improving and assessing the concepts that we use in order to better achieve our aims (Creath, 1990; Chalmers, 2020). In this sense, generalising neural networks engage in conceptual engineering throughout training: they learn representations which more closely represent the useful concepts required for their environment and/or task. This process of learning representations is Conceptual Engineering bothforandbyneural networks. In this sense, when we speak aboutur-explanationsof a neural network asinternal compu- tationsoverlearned representations(see Section 3.3), the representations we are considering have been engineered and learned by the network at train time in a process of automated Conceptual Engineering. 33 I Explanatory Universality In Section 7, we defined Explanatory Optimism as the conjecture that there exists ex- planations that explain the behaviour of an intelligent system (within some explanatory complexity class) such that a human could understand the system’s behaviour. For a human to understand such an explanation, we should expect that the explanation can be written in natural language or some other human-readable format and that the explanation is relatively short - say its description length is a sublinear function of the number of parameters (and crucially this function should not depend on the number of input examples which is often very large). 41 We here rephrase Explanatory Optimism in terms of two types of universality that we might look to understand within neural networks:Universality of Learned Conceptsand Universality of Learnable Concepts. I.1 Universality of Learned Concepts A core problem in interpretability is that there is evidence that humans and machines do not necessarily share the same concepts and representations (Hewitt et al., 2025). This discrepancy leads to a problem for interpretability in communicating without a shared conceptual framework (Hewitt et al., 2025). Universality of Learned Conceptsstates that, at sufficient scale, all neural systems develop universal concepts and mechanisms (perhaps under the constraint of having a shared environment/training distribution or having a shared task). We could either understand this claim by way of networks of sufficiently low prediction error learning the same set of core concepts or by way of concepts being learned in the same order, 42 say from simple to complex (Belrose et al., 2024; Rende et al., 2024; Hutter et al., 2024). Universality of Learned Concepts is the sense of universality which is analysed in inter- pretability research such as Olah et al. (2020); Chughtai et al. (2023) and is often measured in the Representational Alignment (Sucholutsky et al., 2023). The Platonic Representation Hypothesis (Huh et al., 2024), the Causal World Model Theorem (Richens &Everitt, 2024), metaphysical accounts Scientific Realism (Putnam, 1982), the Natural Latents Theory (the Natural Abstractions Hypothesis) (Wentworth, 2021; Chan et al., 2023; Wentworth &Lorell, 2024) and the theory of Natural Kinds (Khalidi, 2023) all argue for Universality of Learned Concepts in some form. If Universality of Learned Concepts is true, then we should expect that the problem of non-shared concepts becomes less significant for interpretability with scale. 43 I.2 Universality of Learnable Concepts Currently, the concepts that neural networks use and learn are different from those that humans use. If this difference remains for the medium-term future, 44 then we might be interested in whether humans can understand AI concepts that they do not a priori use and understand. 41 While it may be computationally expensive to find such an explanation (that is interpretability may be difficult), the claim of Explanatory Optimism is that these explanations are available. We might expect that arguments for Explanatory Optimism thus look more like existence proofs rather than a construction of a concise explanation. Once someone has found a good explanation of a neural network though, by whatever means, we should expect that this explanation can be understood by a human, given enough time and memory. 42 after some initial period of architecture dependent learning due to the initialisation and inductive biases of the network 43 Though the level of scale that would be useful for interpretability is not yet clear. 44 or for example, if the scale required for Universality of Learned Concepts is far from our current scale 34 Under our Universality framing here, we can think of the Principle of Explanatory Optimism saying that “even if humans and machines do not share all concepts, AI concepts are understandable by humansgiven the right explanation”. That is, though not all concepts are human-understood, they arehuman-understandable. The Universality of Learnable Concepts is a weaker notion than the Universality of Learned Concepts and is directly entailed by the Universality of Learned Concepts. 35