← Back to papers

Paper deep dive

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Kola Ayonrinde, Louis Jaburi

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: TheoreticalEmbeddings: 92

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 6:19:11 PM

Summary

The paper introduces the 'Explanatory Virtues Framework' for Mechanistic Interpretability (MI), drawing on Bayesian, Kuhnian, Deutschian, and Nomological perspectives from the Philosophy of Science. It aims to provide a systematic, pluralist approach to evaluating and comparing neural network explanations, addressing the lack of universal criteria in the field. The authors identify key virtues—such as simplicity, unification, and hard-to-varyness—and suggest that Compact Proofs are a promising approach for achieving these virtues.

Entities (5)

Explanatory Virtues Framework · framework · 100%Kola Ayonrinde · researcher · 100%Louis Jaburi · researcher · 100%Mechanistic Interpretability · field-of-study · 100%Compact Proofs · method · 90%

Relation Signals (3)

Explanatory Virtues Framework incorporatesperspectivesfrom Philosophy of Science

confidence 100% · We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science

Explanatory Virtues Framework evaluates Mechanistic Interpretability

confidence 95% · The Explanatory Virtues Framework provides a systematic approach for evaluating MI methods

Compact Proofs isapromisingapproachin Mechanistic Interpretability

confidence 90% · We find that Compact Proofs consider many explanatory virtues and are hence a promising approach.

Cypher Suggestions (2)

List all philosophical perspectives integrated into the framework · confidence 95% · unvalidated

MATCH (f:Framework {name: 'Explanatory Virtues Framework'})-[:INCORPORATES]->(p:Perspective) RETURN p.name

Find all methods evaluated by the Explanatory Virtues Framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Explanatory Virtues Framework'})-[:EVALUATES]->(m:Method) RETURN m.name

Abstract

Abstract:Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.

Tags

ai-safety (imported, 100%)interpretability (suggested, 80%)mechanistic-interp (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

91,655 characters extracted from source content.

Expand or collapse full text

arXiv:2505.01372v1 [cs.LG] 2 May 2025 Preprint. Under review. Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability The Strange Science: Part I.i Kola Ayonrinde † UK AI Security Institute Louis Jaburi † Abstract Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question “What makes a good explanation?” We introduce a pluralistExplanatory Virtues Frameworkdrawing on four perspectives from the Philosophy of Science—the Bayesian, Kuhnian, Deutschian, and Nomological—to system- atically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising ap- proach. Fruitful research directions implied by our framework include (1) clearly defining explanatorysimplicity, (2) focusing onunifyingexplana- tions and (3) derivinguniversal principlesfor neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems. 1 Introduction Mechanistic Interpretability is the study of producing causal, scientific explanations of artificial neural networks. Good explanations allow us to monitor and understand AI systems as well as providing affordances for steering and debugging. But what is agood explanation? Wu et al. (2024) observe the following problem: When analysing the same algorithmic task, Chughtai et al. (2023) and Stander et al. (2024) produced what appeared to be two valid Mechanistic Interpretability (MI) explanations of the same model. Yet the mechanisms that they propose are mutually inconsistent. Without systematic criteria for choosing between explanations, it is difficult to give good epistemic reasons for declaring one explanation to be the better one. Without good reasons to choose, researchers may either suspend judgement or resort to disparate and subjective preferences. Similarly, Bolukbasi et al. (2021), Friedman et al. (2024), and Makelov et al. (2024) find that explanations can be misleading. While generated explanations mightseemplausible initially, they may turn out to be incomplete —- or worse, complete confabulations which do not correspond to the model internals. Such explanations give only the illusion of model understanding. We would like a clear guide as to which explanations are likely to be explanatorily faithful to the model internals (Ayonrinde &Jaburi, 2025), and conversely which explanations may be unfaithful, even if seemingly plausible. 1 Recent work has developed evaluation metrics for interpretability with respect to either specific methods (Karvonen et al., 2024), or specific synthetic tasks (Gupta et al., 2024; Thurnherr &Scheurer, 2024). However, there is not a unifying framework that allows us to compare different explanatory methods across a wide variety of tasks. † Correspondence to:koayon@gmail.com,louis.yodj@gmail.com 1 In this sense, explanatory evaluations are useful to avoid interpretability researchers fooling themselves with Interpretability Illusions. As Feynman (1974) put it: “The first principle is that you must not fool yourself — and you are the easiest person to fool.” 1 Preprint. Under review. To address this problem, we introduce theExplanatory Virtues Framework, which answers the question:Given two competing explanatory theories, which should we prefer?Our framework draws from the Philosophy of Science, specifically theBayesian,Kuhnian,Deutschian,Nomo- logicalaccounts of explanation and we apply their criteria for theory choice to MI methods. We examine the qualities that we should, and do, seek in good explanations, via theoret- ical analysis and case studies respectively. Using our Explanatory Virtues Framework, we analyse four Mechanistic Interpretability methods: Clustering, Sparse Autoencoders (SAEs), Causal Circuit Analysis, and Compact Proofs. We find that the following Ex- planatory Virtues are often neglected among current MI methods:Simplicity,Unification, Co-Explanation, andNomological Principles. We hence suggest pursuing these virtues as promising research directions. The Explanatory Virtues Framework provides a systematic approach for evaluating MI methods and increasing our understanding of AI systems. Such understanding is useful for AI Safety, AI Ethics, and AI Cognitive Science (Bengio et al., 2025; Anwar et al., 2024; Chalmers, 2025), as well as debugging and improving neural networks (Lindsay &Bau, 2023; Sharkey et al., 2025; Amodei, 2025). Contributions.Our contributions are as follows: •Firstly, we provide a unified account of the Explanatory Virtues in MI. This can be understood as an answer to the question “What makes a good explanation?”. • Secondly, we analyse and compare MI methods with respect to these virtues. •Finally, we suggest new directions for developing MI explanations, beyond the current state of the art. Paper Structure.This paper is organised as follows: In Section 2, we outline a definition of valid explanations in Mechanistic Interpretability, distinguishing MI from other inter- pretability paradigms. In Section 3, we analyse reasons for choosing one explanation over another and introduce theExplanatory Virtues Framework. In Section 4, we provide a critical analysis of MI methods with respect to these Explanatory Virtues. We conclude in Section 5 with a discussion of methodological frontiers in interpretability, and highlight virtues that we believe to be helpful in developing more reliable explanations in MI. Series Structure.This paper is the second in a series titledThe Strange Science of Mechanistic Interpretability, concerning the Philosophy of Mechanistic Interpretability. See Ayonrinde &Jaburi (2025) (Part I.i) for a discussion of the Philosophical Foundations of Mechanistic Interpretability as a practise and its limitations. See also Ayonrinde (2025) (Part I.i) which proposes methods to empower humans by teaching humans Machine Concepts. 2 Valid Explanations in Mechanistic Interpretability Neural network interpretability(henceforth justinterpretability) is the process of understanding artificial neural networks using the scientific method. In this paper we focus onMechanistic Interpretability (MI). Following Ayonrinde &Jaburi (2025), we distinguish Mechanistic In- terpretability from other forms of interpretability noting that Mechanistic Interpretability produces Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations. 2.1 Explanations in Mechanistic Interpretability Good scientific explanations provide answers towhyquestions. Typically a scientific expla- nation will provide an answer to the question “Why did the phenomenon occur?” and a good explanation will enable the listener to better comprehend the phenomenon. Explana- tions aim at knowledge. As compression and comprehension are closely linked (Wilkenfeld, 2019), good explanationscompress observations by exploiting regularities in data. Neural networks are classically viewed as black-box prediction machines (Lipton, 2018). However, Ayonrinde &Jaburi (2025) describe an alternativeExplanatory Viewof Neural 2 Preprint. Under review. Networks, emphasising that deep neural networks containrepresentationsandmechanisms that can be understood as providing implicit explanations for their behaviour. As models learn to generalize, they develop internal structures that compress information about the world. Good explanations should uncover these internal structures. 2.2 Defining Mechanistic Interpretability Following Olah et al. (2020); Olsson et al. (2022), Ayonrinde &Jaburi (2025) define Mechanis- tic Interpretability as follows 2 : Technical Definition of Mechanistic Interpretability (Ayonrinde &Jaburi, 2025) Interpretability explanations arevalidMechanistic Interpretability explanations if they areModel-level,Ontic,Causal-Mechanistic, andFalsifiable. • Model-level: Explanations should focus on understanding the neural net- work and not the sampling method or other system-level properties (Arditi, 2024; Zaharia et al., 2024). •Ontic: Explanations should refer to real entities within the model (Salmon, 1984). •Falsifiable: Explanations should yield testable predictions (Popper, 1935). • Causal-Mechanistic: Explanations should identify a step by step continuous causal chain from cause to phenomena, rather than statistical correlations or general laws (Woodward, 2003; Salmon, 1989; Bechtel &Abrahamsen, 2005). 3 The Virtues of Good Explanations “Given two competing explanatory theories, which should we prefer?” This is the question of Theory Choice(Kuhn, 1981; Schindler, 2018; Kuhn, 1962). To answer this question we may look at the properties of explanations. In the sciences, including in the Strange Science of Interpretability, there can be no complete list of sufficient justificatory criteria for an explanation. Explanatory theories cannot be proven true, only falsified (Popper, 1935). However, therearetruth-conducive properties of explanatory theories. We refer to such truth-conducive properties of explanations as Explanatory Virtues. Explanatory Virtues are properties that are reliable indicators of truth. Hence Explanatory Virtues are good reasons to prefer one explanatory theory to another. Whether a property is an Explanatory Virtue is anormativelyloaded; we should epistemically prefer explanations which embody Explanatory Virtues as such explanations are more likely to be true and the aim of scientific explanation is to aim at truth. 3 Conversely, wedescriptively refer to properties of explanations that scientists value in practise asExplanatory Values. We can improve our explanatory theories by increasing the virtue of our explanations. Any individual theory may not embody all the explanatory virtues but, all else equal, we ought to prefer theories that embody more explanatory virtues. Similarly, we can improve the epistemic virtue of the Mechanistic Interpretability scientific community by looking to align our Explanatory Values with the Explanatory Virtues (Sosa, 1991); that is, by coming to appropriately value what is good (i.e., truth-conducive) about explanations. In this section, we discuss Explanatory Virtues – the properties that ML researchersshould value. We assess four accounts of explanation: the Kuhnian, Bayesian, Deutschian, and Nomological accounts. If these accounts correctly identify properties that we ought to value, then the combined set of such properties are Explanatory Virtues. These properties will 2 See Ayonrinde &Jaburi (2025) for a more complete exposition. Also see Appendix D.1 for intuitive examples of Explanation Types. 3 Schindler (2018) provides a discussion of the truth-conduciveness of the virtues we discuss. 3 Preprint. Under review. form our pluralistExplanatory Virtues Framework. We provide a mathematical definition for each Explanatory Virtue which serves to ensure that there is a consistent and canonical way to compute each virtue thus allowing for a more objective comparison of explanations. 4 Then in Section 4, we will discuss what MI researchersdovalue in practise, that is the Explanatory Values in Mechanistic Interpretability. We provide a summary of our Pluralist Explanatory Virtues Framework and how the virtues relate to each other in Figure 1. Notation.We denote the explanation under consideration asE∈ E, whereEis the set of all possible explanations andB, the background theory.x T denotes observational data that the explanation is fitted to (training data). We assumex T is sampled from the set of possible observational dataX.x I denotes future observational data that was not accessible at explanation-making time (inference-time data).x T,i is thei-th data point inx T . We denote ka complexity measure (for example, Kolmogorov complexity) and|E| B the description length of an explanationEunder background theoryBmeasured in bits. 3.1 Bayesian Theoretical Virtues Wojtowicz &DeDeo (2020) describe a Bayesian approach to Inference to the Best Explanation (Henderson, 2014). Here, the Explanatory Virtues are the credence-raising properties of the theory. These virtues can be split into two categories:theoretical virtues(in blue), which are properties of the explanation that do not depend on any observed or yet to be observed data, andempirical virtues(in orange), which are properties of the explanation that are defined in relation to the observed data. Accuracy, Precision, and Priors.The Bayesian virtues are the empirical Explanatory Virtue ofAccuracy, the theoretical Explanatory Virtue ofPrecisionand thePriorprobability of some explanation given the background theory. Accuracy represents the probability of the true data given the explanation. Log-likelihood is the logarithm of Accuracy. Similarly, Precision is the expected log-likelihood of data conditional on the explanation being true. Precision represents the degree to which an explanation’s predictions concentrate in a particular region of the space of possible observed data. Higher precision means that the explanation is more constraining in its predictions, making risky and useful predictions that rule out other possibilities, if the explanation is correct. 5 We decompose Accuracy and Precision into further Explanatory Virtues as follows. Descriptiveness and Co-Explanation.Given many data pointsx=x 1 ,x 2 ,. . .,x n , we would like to understand how well an explanation explains each data point in isolation and how well it explains multiple data points together. We hence defineDescriptivenessas the component of Log-Likelihood where data observation is considered in isolation andCo- Explanationas the component of Log-Likelihood which focuses on how an explanation can explain multiple data points, above its ability to predict any single observation in isolation. Power and Unification.Analogously, we can break down Precision into our theoretical virtues ofPowerandUnification, defined analogously where Power measures the ability to explain individual data points and Unification measures the ability to connect multiple disparate observations together. 4 We believe that canonical definitions are important because Mechanistic Interpretability has previ- ously had several definitions which have been used inconsistently by different researchers, making it difficult to directly compare across methods. Readers may check their intuitive understanding of the textual definitions against the mathematical definitions to ensure they are consistent. We also include a rubric for assessing explanatory methods in Table 2. We further include intuitive examples to illustrate the Explanatory Virtues in Appendix D.2. 5 Note that the definition of Precision here is a slightly different notion to the Precision metric in Machine Learning as in ‘Precision-Recall’ analysis (Hastie et al., 2009). There, Precision is the fraction of true positives among the predicted positives. Here, by Precision we mean to say that more precise explanations are more constraining in their predictions. 4 Preprint. Under review. Glossary of Bayesian Virtues Acc(E) =P(x T |E)(Accuracy) Prec(E) =E x T ∼X [log(P(x T |E))](Precision) Prior(E) =P(E|B)(Prior) Desc(E) = ∑ i log(P(x T,i |E))(Descriptiveness) CoEx(E) =log(Acc(E))−Desc(E) =log( P(x T |E) ∏ i P(x T,i |E) )(Co-explanation) Power(E) =E x T ∼X [ ∑ i log(P(x T,i |E))](Power) Uni f(E) =Prec(E)−Power(E) =E x T ∼X log( P(x T |E) ∏ i P(x T,i |E) )(Unification) 3.2 Kuhnian Theoretical Virtues Kuhn (1981) lists five theoretical virtues as a basis for theory choice:Accuracy,(Internal) Consistency,Scope (Unification),SimplicityandFruitfulness. We previously explored Unification (Scope) and Accuracy in Section 3.1. Accuracy and Fruitfulness.Accuracy is the extent to which the explanation fits the avail- able data at the time of the creation of such an explanation. We can think of this as the “mundane empirical success” of an explanation, which we can contrast with the “novel empirical success” of an explanation or itsFruitfulness(Lakatos, 1978). Machine Learning researchers may draw a close analogy here with Accuracy being a performance measure on the training/validation set and Fruitfulness being a performance measure on a (naturally held-out) test set. 6 Fruitful explanations have reach: they usefully generalise beyond the context of the original problem that the explanation was designed to solve. One particularly important type of Fruitfulness that interpretability researchers often test MI methods for isPragmatic Utility, the ability for a local explanation to be useful on some downstream task of interest (Marks, 2025). For example, researchers have tested Sparse Auutoencoder features on downstream tasks such as unlearning (Karvonen et al., 2024), probing Wu et al. (2025), building robust classifiers Gao et al. (2024), and building sparse feature circuits Marks et al. (2024). Analysing downstream pragmatic utility ensures that MI methods are directly useful in aiding researchers in tasks that they care about (Lindsay &Bau, 2023; Amodei, 2025). Consistency.A necessary criterion for a theory to be a good explanation is that it is inter- nally consistent. That is to say, the explanation must not contain any logical contradictions. Simplicity.Simplicityis considered a key virtue for scientific explanations (White, 2005; Qu, 2023; MacKay, 2003). However, there are many forms of simplicity that may be chosen, which may rank explanations differently (Lakatos, 1970). We consider the main three forms of measures of simplicity:Parsimony,ConcisenessandComplexity.Parsimonycounts 6 Here we allow for the test set to be drawn from the same distribution as the training set or to represent a distribution shift. In science, the analogy of the training and test set being drawn from the same distribution, would be if the explanation also fits new data that we didn’t observe before making the explanation but we plausibly could have observed. The analogy for the distribution shift would be if having the new explanation makes us adversarially seek out new observations to attempt to falsify our explanatory theory. In the latter case, we wouldn’t plausibly have observed these new data points before making the explanation. 5 Preprint. Under review. the number of entities that are posited by the explanation (Wojtowicz &DeDeo, 2020). 7 Concisenessis a Shannon-complexity measure of the information in an explanation given by the description length (Shannon, 1948; MacKay, 2003),(K-)Complexityis a Kolmogorov- complexity measure of an explanation in terms of the shortest program that can generate it (Kolmogorov, 1965; Hutter et al., 2024). For all simplicity measures, lower values are preferred. Glossary of Kuhnian Virtues Fruit(E) =P(x I |E)(Fruitfulness) E is inconsistent⇐⇒E⊨⊥(Consistency) Pars(E) =#_of_entities(E)(Parsimony) DL(E) =|E| B (Conciseness) k-Compl(E) =k(E)(Complexity) 3.3 Deutschian Theoretical Virtues Falsifiability and Hard-to-Varyness.Popper (1935) writes that the key criteria of science is that its theories should beFalsifiable- that is, our explanations should come with a clear set of testable predictions attached. Deutsch (2011) further argues that alongside falsifiability, we should also seek explanations which themselves areHard-To-Vary. Intuitively we might think of an explanation E as hard-to-vary if it cannot be easily modified to account for incoming data that contradicts the explanation. More precisely consider a modification∆to an explanation E, where∆is some edit operation formed of a list of insertions, deletions, substitutions and transpositions of symbols in E.|∆|is the number of such operations in∆. The hard-to-varyness criteria then captures the intuition that if you add some modification or “epicycle”∆to an explanation E, then the new explanation E’ should have lower novel empirical success than E (complexity-weighted). Conversely, if we can add some modifica- tion to an explanation and the new explanation has higher mundane and novel empirical success without being more complex, then we should prefer the new explanation. 8 For some complexity measure k, we can then say that an explanation E is hard-to-vary if it is at a local maximum of the functionhv(E) =log(Acc(E))−k(E). 9 Hard-to-Varyness An explanationEis hard-to-vary if it is at a local maximum of the function hv(E) =log(Acc(E))−k(E)(Hard-to-Varyness) 3.4 Nomological Theoretical Virtues In Hempel &Oppenheim (1948)’s Deductive-Nomological (DN) model of explanation, a scientific explanation is asound deductiveargument where at least one of the premises is a “general law”. For our purposes, we can think of general laws as “for all” statements which are true and not accidentally true. General laws describe necessary rather than contingent facts of the world. For example, “all gases expand when heated under constant pressure” 7 Parsimony is slippery to define well in practise as it is not always clear what counts as an entity. Worse still, parsimony might treat intuitively highly complex objects and very simple objects both equivalently as “entities” and simply count them up without nuance. Baker (2022) provides a discussion of the downsides of Parsimony as a measure of simplicity. 8 We provide a complementary adhocness metric in Appendix E. 9 We informally consider two explanations close if they are a small number of edit operations apart. 6 Preprint. Under review. is a general law whereas “all members of the Greensbury School Board for 1964 are bald” might be true but only by coincidence, as it were. Nomologicity.Though we do not require our explanations to precisely follow the DN model of explanation, theNomologicity(orLawfulness) of an explanation, i.e. whether the explanation appeals to general laws or derives universal principles, is an explanatory virtue. 10 Nomologicity An explanationEis nomological if it appeals to general laws or universal principles about neural networks. 3.5 Explanatory Virtues for Mechanistic Interpretability Figure 1: A Directed Acyclic Graph representation of theExplanatory Virtues Framework showing the relationships between virtues. Empirical virtues are coloured orange and theoretical virtues are coloured blue. We show the virtues which directly depend on each other with bold arrows (→) and those which are highly related with dashed arrows (99K). The Explanatory Virtues which are essential for any scientific explanation (Falsifiability and Causal-Mechanisticity) to be valid are denoted with an exclamation mark; the most impor- tant virtues to decide between explanations (Simplicity, Hard-to-Varyness, and Fruitfulness) are marked with a star. Appendix A details a rubric for assessing explanatory methods. Appendix B provides an example illustrating the importance of Simplicity as an explanatory virtue. We provide a summary of our pluralist Explanatory Virtues Framework and how the virtues relate to each other in Figure 1. These explanatory virtues are not necessarily exhaustive nor completely independent of one another. 11 Some virtues may be in tension with each other. For example, Accuracy may be traded off against Simplicity in some cases. Here we may aim to be at the optimal point of this trade-off on a Pareto frontier. We hope the reader may agree that our Explanatory Virtues both are (1) important consider- ations for the evaluation of explanations and (2) truth-conducive. 12 Thus, these virtues can 10 Myers (2012) contrasts nomological explanations withmechanisticexplanations, considering the former to be explanations at a higher level and the latter to be explanations at a lower level. However, Ayonrinde &Jaburi (2025) note that this is a false dichotomy: The entities in mechanistic explanations can be emergent entities and citing general laws can aid in the explanation and unification of low-level phenomena. 11 We detail an additional possible virtue in Appendix F. 12 That is to say that all else equal explanations which embody these virtues are more likely to be true. We refer readers to Schindler (2018) for a detailed discussion of the truth-conduciveness of many of the virtues we discuss. 7 Preprint. Under review. be a useful guide for theory choice and, more generally, can aid in the developments of new explanatory methods. Mechanistic Interpretability researchers, we argue, ought to value the Explanatory Virtues. For an explanation to be agoodexplanation in Mechanistic Interpretability, it must first be a validMI explanation. In Section 2.2 we identified valid MI explanations as those which are Model-Level, Ontic, Causal-Mechanistic, Falsifiable. Validity requires all of the four validity conditions above to be met. The Explanatory Virtues, then, allow us to assess the quality of valid MI explanations and provide epistemic reasons to prefer one explanation over another. 4Explanations in the Wild: Case Studies in Mechanistic Interpretability In Section 3, we explored the Explanatory Virtues. These values included the Theoretical Explanatory Virtues ofPrecision,Power,Unification,Consistency,Simplicity,Nomologicity, FalsifiabilityandHard-To-Varynessas well as the Empirical Explanatory Virtues of (Mundane) Accuracy,Descriptiveness,Co-ExplanationandFruitfulness. We now consider how these virtues are instantiated in the methods that Mechanistic Interpretability researchers use in practice. That is, we consider howvaluedeach Explanatory Virtue is within MI methods. Visual summaries of the methods we discuss in this section can be found in Appendix C. 4.1 Examples 4.1.1 Clustering (Activations or Inputs) One primitive form of neural network explanation is a clustering of model inputs or acti- vations. For a complex model, such an explanation will not typically be highly accurate. However, this explanationisa simplification of the overall model performance. Here we might imagine finding some partition of the input/activation space, mapping a given input xto its associate cluster, of whichxis ideally a typical member. Then we may take the cluster (and possibly the output of the model on some cluster representative) as a proxy for the model’s behaviour. 13 Though this explanation is clearly not sufficient in many cases, we note that it does perform some compression of the input space and we can control the simplicity of the explanation by varying the number of clusters. Similarly, the explanation generated here is Falsifiable; we can test how well our cluster model predicts the behaviour of the original model. However, this explanation clearly falls down by not being Causal-Mechanistic in nature, and the Fruitfulness of the explanation may be low if the procedure is vulnerable to outliers. 4.1.2 Sparse Autoencoder Explanations of Representations/Activations Sparse Autoencoders (SAEs) can be used to decompose the representations of neural acti- vations into a linear combination of sparsely activating, disentangled and monosemantic latents (Bricken et al., 2023; Huben et al., 2024). Though many evaluation schemes have been proposed for SAEs (Karvonen et al., 2024; Wu et al., 2025), the primary axes on which SAE explanations are evaluated is onEmpirical accuracyandSimplicity. Here Accuracy represents either a local unsupervised accuracy measure like reconstruction error, or the downstream performance of the interpreted model when the SAE reconstructions are patched into the model in place of the original activations. MDL-SAEs.Ayonrinde et al. (2024) provide a useful case study of how different types of Simplicity measures may be more or less principled in different contexts. Within the MDL- SAE framework, SAE explanations are evaluated on Accuracy, Novel Empirical Success and Conciseness, whereConcisenessis an information theoretic measure of Simplicity (see Section 3.2). This stands in contrast to the classical SAE framework where the simplicity measure is instead the SAE latent sparsity, aparsimonymeasure. In this case, changing the 13 We may think of the clustering explanation as performing some “quotienting” operation of the input space by the equivalence relation of being in the same cluster. 8 Preprint. Under review. simplicity measure from sparsity (Parsimony) to description length (Conciseness) solved three key problems for SAEs: avoiding undesired feature splitting, enabling principled choice of SAE width, and ensuring uniqueness of feature-based explanation (Ayonrinde, 2024). Explanatory Virtues for SAEs.SAE explanations, like most ML methods, value Falsifi- ability and Novel Empirical Success (predictions beyond the training set). There is also some Co-Explanatory power in that the same feature dictionary should be used to explain any activations (at least from the same layer of the model). However, SAE explanations might be Ad-hoc and not Hard-to-Vary. As noted by Braun et al. (2024), contributions from features activated on SAEs trained for reconstruction may have little effect on the downstream performance of the model. Hence the corresponding feature activations are effectively free parameters. Similarly, the tendency to enlarge the feature dictionary (i.e. increase the SAE width) or add additional active features to explanations (i.e. increase the allowableℓ 0 norm of the feature activations vector) without clear justification, suggests an implicit ad-hocness in the explanations. MDL-SAEs provide some guidance against the ever increasing size of the feature dictionary, however it still remains an open question as to how to ensure that SAE explanations are truly hard-to-vary and pick out features which are causally relevant to the downstream behaviour of the model (Leask et al., 2025). 4.1.3 Causal Abstraction Explanations of Circuits As in neuroscience, a natural way to explain the behaviour of a neural network for inter- pretability researchers is to decompose the network into circuits (Olah et al., 2020; Kandel et al., 2000). Circuits can be formally specified by a correspondence between the network and some understood high-level causal model using the theory of Causal Abstractions (Geiger et al., 2023; Woodward, 2003; Beckers &Halpern, 2019; Pearl, 2009). In particular, the notion of abstraction that is typically appealed to is constructive abstraction (Beckers &Halpern, 2019). Paraphrasing from Geiger et al. (2021), a high-level model (an understandable causal model) is aconstructive abstractionof a low-level model if we can partition the variables in the low-level model (e.g. the neural network neurons) such that: 1. Each low-level partition cell can be assigned to a high-level variable. 2.There is a systematic correspondence between interventions on the low-level parti- tion cells and interventions on the high-level variables. The Causal Abstraction framework for circuit analysis clearly focuses on the Falsifiability of explanations and theFaithfulnessof the explanation to the underlying causal model (Empirical Accuracy and Novel Success under interventions). To encourage simplicity in explanations, 14 , we may also seekCompletenessandMinimalityin circuit explanations (Wang et al., 2023). (Behavioural) Faithfulness, Completeness, and Minimality are denoted the FCMcriteria for circuit explanations. FCM Criteria for Circuits.ForCa proposed circuit andMthe model, theCompleteness criterion states that for every subsetK⊂C, the incompleteness score|F(C )−F(M )| should be small. Intuitively, a circuit is complete if the function of the circuit and the model remain similar under ablations. Conversely, theMinimalitycriterion states that for every nodev∈Cthere exists a subsetK⊆C\vthat has high minimality score|F(C\(K∪ v))−F(C )|. Intuitively, a circuit is minimal if it doesn’t contain components which are unnecessary for the function of the circuit. Algorithms such as ACDC (Conmy et al., 2023) find circuits that (approximately) satisfy the FCM criteria. However, it is well known (Wang et al., 2023) that the FCM criteria are in tension and that it is not always possible to satisfy all three criteria simultaneously. In 14 After all, it’s not clear what the point of a highly complex abstraction would be when the network itself can be viewed as a causal model if we disregarded the simplicity criterion. 9 Preprint. Under review. practise, finding circuits is a computationally challenging problem and circuit discovery algorithms typically only find approximately optimal circuits (Adolfi et al., 2024). 15 Explanatory Virtues for Circuit Explanations.Despite the virtues of these approaches, they however do suffer from poor unification, co-explanation and nomologicity. In both manual and automated circuit discovery methods, most attention is paid to individual circuits rather than the relation and composition of subcircuits. Circuit explanations for two related tasks which share internal components are not typically privileged. Similarly, there are often no general laws or principles that detail which circuits are likely to be found in a network, and how these circuits relate to one another across contexts. 4.2 Compact Proofs The above examples of Clustering, SAEs and Circuits are methods for both thecreationof explanations and also provideevaluation methodsfor the explanations created. The Compact Proofs methodology (Gross et al., 2024; Wu et al., 2024; Jaburi et al., 2025) is a method for evaluatinganyCausal-Mechanistic explanations obtained through other methods. In the Compact Proofs framework, an explanation is converted into a formal guarantee that allows researchers to assess the Accuracy and Simplicity of the explanation. Given a data distributionD, and a modelM θ with weightsθ∈W, we would like to obtain a lower bound for the model’s accuracy overD. 16 Formally, we construct averifierprogram V(θ,E),whereEis the explanation. The aim forVis to return a worst-case bound on the model’s performance that is as tight as possible with the proof that bound holds being as computationally efficient as possible. We may think of the computational efficiency as a measure of the simplicity of the proof (Xu et al., 2020). Note that these two goals, thetightness (Accuracy) of the bound and thecompactness(Simplicity) of the proof (explanation), are in tension with one another. A good explanation should push out the (tightness, compactness)- Pareto frontier. 17 Gross et al. (2024) show that faithful mechanistic explanations lead to tighter performance bounds and more efficient (i.e. simpler) proofs. Informally, we may say that Compact Proofs allow us to leverage good MI explanations into tighter and more compact proof bounds. We note that this method allows for finding and evaluating explanations which satisfy many of the Explanatory Virtues: Precise explanations allow for tighter bounds, Accuracy and Simplicity are directly optimised for, and Causal-Mechanistic explanations are generally required for non-vacuous bounds. 18 4.3 Discussion of Explanatory Values Table 1 shows that some Explanatory Virtues are consistently valued highly across differ- ent methods. However, all current interpretability methods could be improved on some dimension to be more likely to produce human-understandable and useful explanations. In particular, we suggest that methods which produce or appeal to nomological principles and which unify accounts of neural network behaviour are likely to be increasingly successful. 15 Shi et al. (2024) provide hypothesis tests for circuits which test the related criteria of equivalence, independence and minimality. Their approach is a practical method for evaluating circuit explanations. 16 In general, we might be interested in bounding metrics which are to be minimised (e.g. loss) rather than maximised (e.g. accuracy and reward). In that case we may seek upper bounds rather than lower bounds but the argument is otherwise analogous. 17 Appendix B provides an example of one basic proof strategy which is computationally expensive but provides a tight bound. This strategy is known as thebrute force proof(Gross et al., 2024) and corresponds to thestraightforward, Implementation-level explanation(Ayonrinde &Jaburi, 2025). 18 At present it is not known how to scale the Compact Proofs methodology to much larger models with additional superposition noise and still get informative, non-vacuous bounds. This remains an open problem and a gold standard for evaluating explanations. 10 Preprint. Under review. Explanatory VirtueImportanceClustering(MDL) SAEsCircuitsCompact Proofs Validity Causal- Mechanistic!✗●✓ Bayesian Precision●✓ Priors●✗ Descriptiveness●✓ Co-explanation✗● Power●✓ Bayesian&Kuhnian Accuracy✓ Unification✗ Kuhnian Consistency●✓ Simplicity★●✓●✓ Fruitfulness★●✗● Deutschian Falsifiable!✓ Hard-to-vary★●✓ Nomological Nomological✗● Table 1: An evaluation of MI explanation methods with respect to our Explanatory Virtues Framework as given in Section 3. The virtues which are indispensable for valid Mechanistic Interpretability explanations are highlighted with a!. The virtues that we consider to be the most important for good explanations are highlighted with a★. Metrics are grouped by their philosophical foundations: Deutschian, Kuhnian, Bayesian, or Nomological. Blue metrics indicate empirical criteria, while orange metrics represent theoretical criteria. Green checks, orange circles and red crosses indicate that the method well-considers, moderately considers, or poorly considers the virtue, respectively. The explanatory case studies that we have considered generally optimise for accuracy, however they vary in their commitment to the virtues of Simplicity, Unification and Nomologicity. In our descriptions of these methods across Section 4, we provide a more detailed analysis of how we assess the virtues of each method and we provide our full evaluation rubric in Table 2. 5 The Road Ahead “Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house.” — Poincaré (1905) The field of Mechanistic Interpretability was founded by Olah et al. (2020) to distinguish it- self from previous approaches of neural network interpretability. These previous approaches were not sufficiently grounded in causal abstraction, nor treated the model internals appro- priately as representing explanations as intrinsic structure that we would like to uncover (Ayonrinde &Jaburi, 2025; Saphra &Wiegreffe, 2024). The ‘Mechanistic turn’ in interpretabil- ity was a step towards unifying a community around faithful and falsifiable explanations of models. The Explanatory Virtues Framework is a further step in this direction, providing unifying criteria to evaluate explanatory methods. In particular, focusing on the following three virtues would constitute methodological progress for the field: 1. Simplicity and Compression.Swinburne (1997) argues that simplicity is a key virtue of good explanations and can provide evidence to the truth of a theory. However, appro- 11 Preprint. Under review. priately characterising an explanatory Simplicity measure is currently an open question for interpretability. 19 Early explorations into understanding compression as a key func- tion of explanation can be found in the Compact Proofs literature (Gross et al., 2024) and Attribution-Based Parameter Decomposition (Braun et al., 2025). Coalescing around a con- cept of Simplicity for interpretability would allow different explanations to be rigorously compared on the (accuracy, simplicity) Pareto curve, which is directly useful in many appli- cations. Such a definition might also naturally encourage further research into the impact of modularity in both neural networks and their explanations (Clune et al., 2013; Filan et al., 2021; Baldwin &Clark, 1999). 2. Unification and Co-Explanation.Hempel (1966) argues that unification is a core driver of scientific progress. 20 Indeed we may see unification as a drive towards compression of explanations where the set of phenomena to be explained is large (Bassan et al., 2024; Bhattacharjee &von Luxburg, 2024). Currently, most methods in interpretability don’t seek to co-explain many phenomena using the same building blocks. The Mechanistic Interpretability (MI) community has sought to understand the universality (or otherwise) of representations and algorithms across many models with mixed results (Olah et al., 2020; Olsson et al., 2022; Chughtai et al., 2023). However, we may also be interested in modular compositional explanations where the explanatory units are shared not only across models but also across different tasks and domains within a single model. For example, there is evidence that induction heads are reused for many tasks within models and so induction heads perform a co-explanatory function (Olsson et al., 2022). 3. Nomological Principles.Bacon (1620) writes that any science first starts by observations. After that point, most fields have a choice to make between two (non-exclusive) paths that Windelband (1894) refers to as thenomotheticandidiographicapproaches. The nomothetic approach seeks to rapidly synthesise these early observations into general explanatory theories with nomological principles that are useful for making predictions. Conversely, the idiographic approach focuses on categorising and describing ever more exhaustive sets of observations, without necessarily seeking general laws to explain them. 21 Physics is a prototypical nomothetic science; biology is often considered an idiographic science. 22 Idiographic approaches can tend towardsdescriptionrather thanexplanation. For example, we might wonder if interpretability researchers counting up and categorising all the features in a given model’s latent space is much different to a biologist naming and describing all the species of beetle in an ecosystem without learning anything about the evolution of these species or how they interact within the environment. The use of nomological principles can simplify explanations and help to provide a unify- ing paradigm for Mechanistic Interpretability. Efforts in Developmental Interpretability (Hoogland et al., 2024), the Physics of Intelligence (Allen-Zhu &Li, 2024), Computational Mechanics (Shai et al., 2024), and the Science of Deep Learning (Lubana et al., 2023; Allen- Zhu &Li, 2023) may also produce useful nomological principles for the MI community to adopt in their explanations. Mechanistic Interpretability has begun to congeal into a genuine field, with survey papers (Bereska &Gavves, 2024), and a (major conference) workshop (Barez et al., 2024). There are principles, problems and methods that many members of the Mechanistic Interpretability community adopt, even if there are still many open questions and methodological disputes to address. Though it is not strictly necessary to adopt a nomothetic approach for a field to have a paradigm for Normal Science in the Kuhnian sense (Kuhn, 1962), nomothetically ori- ented fields with laws and principles to test and critique have historically tended to see more rapid scientific progress. We might regard the move towards a nomothetic and unifying 19 Ayonrinde et al. (2024); Ayonrinde &Jaburi (2025) argue that good explanations in interpretability can primarily be understood as compressions of information. 20 See also Kitcher (1981). 21 Physicist Ernest Rutherford famously (and somewhat disparagingly) referred to this difference as partitioning sciences into two cultures: physics and stamp collecting (Bernal, 1940). 22 We might think of History as a highly idiographic field outside of the sciences. 12 Preprint. Under review. approach to explanations in Mechanistic Interpretability as a move towards a more mature science. Efforts in the Science of Deep Learning seeking to develop principles for neural networks may provide a basis for nomological principles for Mechanistic Interpretability. Shimi (2024) writes:“At the beginning of every science there’s a guy who’s just cataloguing rocks.” We might add:And then it turns out that we can use these observations to build a theory. We should not be forever just cataloguing rocks; we do actually have to build a theory at some point! Mechanistic Interpretability has found Causal Abstractions theory to be a useful foundation. We suggest that a further paradigm for Mechanistic Interpretability should take seriously the virtues of good explanations. The Explanatory Virtues allow us to iteratively build better interpretability methods and generate increasingly good explanations of neural networks. Progress in Mechanistic Interpretability may provide insights into AI systems which are useful for increasing the transparency and safety of systems which are deployed widely and/or in critical applications (Bengio et al., 2025; Rivera et al., 2024; Sharkey et al., 2025). We believe that our Explanatory Virtues Framework can help researchers in designing methods which lead to more reliable and useful explanations of neural systems. Acknowledgments Thanks to Nora Belrose, Matthew Farr, Sean Trott, Evžen Wybitul, Andy Artiti, Owen Parsons, Kristaps Kallaste and Egg Syntax for comments on early drafts. Thanks to Elsie Jang, Alexander Gietelink Oldenziel, Jacob Pfau, Catherine Fist, Lee Sharkey, Michael Pearce, Mel Andrews, Daniel Filan, Jason Gross, Samuel Schindler, Dashiell Stander, Geoffrey Irving and attendees of the ICML MechInterp Social for useful conversations. We’re grateful to Kwamina Orleans-Pobee, Will Kirby and Aliya Ahmad for additional support. This project was supported in part by a Foresight Institute AI Safety Grant. Reproducibility Statement The comparative evaluation of explanation methods presented in Table 1 can be reproduced by applying the Explanatory Virtues Rubric detailed in Table 2. This rubric provides clear criteria for assessing the extent to which different Mechanistic Interpretability methods embody each explanatory virtue. By following the three-level assessment framework (Highly Virtuous, Weakly Virtuous, Not Virtuous) with their corresponding indicators (✓, ●,✗), researchers can systematically evaluate explanation methods against the Explanatory Virtues Framework. The rubric’s structured approach ensures that assessments are based on consistent criteria rather than subjective preferences, allowing for reproducible comparisons between different explanation methods in Mechanistic Interpretability. Ethics Statement This work focuses on developing a philosophical framework for evaluating explanations in the context of Mechanistic Interpretability of neural networks. As a theoretical contribution, our framework itself does not directly raise ethical concerns typically associated with empirical AI research, such as data privacy, bias, or direct societal impacts. However, we recognize that advances in Mechanistic Interpretability have significant ethical implications. Better explanations of AI systems, which our framework aims to encourage, can promote transparency, accountability, and trust in AI systems. We note that improved understand- ing of neural networks through Mechanistic Interpretability may contribute to AI Safety, AI Ethics, and the responsible deployment of AI systems in critical applications. By pro- viding systematic criteria for evaluating explanations, our work supports the responsible development of AI that is interpretable and human-understandable. We hope this work contributes to the broader goal of developing AI systems that can be meaningfully understood, monitored, and steered by humans. 13 Preprint. Under review. References Federico Adolfi, Martina G Vilas, and Todd Wareham. The computational complexity of circuit discovery for inner interpretability.arXiv preprint arXiv:2410.08025, 2024. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction.arXiv preprint arXiv:2309.14316, 2023. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InForty-first International Conference on Machine Learning, 2024. Dario Amodei. The urgency of interpretability, 2025. URLhttps://w.darioamodei.com/ post/the-urgency-of-interpretability. Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932, 2024. Andy Arditi. Ai as systems, not just models, 2024. URLhttps://w.lesswrong.com/posts/ 2po6bp2gCHzxaccNz/ai-as-systems-not-just-models. Kola Ayonrinde. Standard saes might be incoherent: A choosing problem & a “concise” solution. Blog post, 2024. URLhttps://w.lesswrong.com/posts/vNCAQLcJSzTgjPaWS/ standard-saes-might-be-incoherent-a-choosing-problem-and-a. Kola Ayonrinde. Position: Interpretability is a bidirectional communication problem. In ICLR 2025 Workshop on Bidirectional Human-AI Alignment, 2025. URLhttps://openreview. net/forum?id=O4LaRH4zSI. Kola Ayonrinde and Louis Jaburi. A mathematical philosophy of explanations in mechanistic interpretability: The strange science part i.i, 2025. forthcoming. Kola Ayonrinde, Michael T. Pearce, and Lee Sharkey. Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes, 2024. URLhttps: //arxiv.org/abs/2410.11179. Francis Bacon.Novum Organum. Clarendon Press, London, 1620. URLhttps://en. wikipedia.org/wiki/Novum_Organum. Part of theInstauratio Magna. Alan Baker. Simplicity. In Edward N. Zalta (ed.),The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2022 edition, 2022. Carliss Y Baldwin and Kim B Clark.Design Rules: The Power of Modularity Volume 1. MIT press, 1999. Fazl Barez, Mor Geva, Lawrence Chan, Atticus Geiger, Kayo Yin, Neel Nanda, and Max Tegmark. Icml 2024 mechanistic interpretability workshop, 2024. URLhttps: //icml2024mi.pages.dev/. Shahaf Bassan, Guy Amir, and Guy Katz. Local vs. global interpretability: A computational complexity perspective.arXiv preprint arXiv:2406.02981, 2024. William Bechtel and Adele Abrahamsen. Explanation: A mechanist alternative.Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 36(2):421–441, 2005. doi:10.1016/j.shpsc.2005.03.010. Sander Beckers and Joseph Y. Halpern. Abstracting causal models. InProceedings of the 33Rd Aaai Conference on Artificial Intelligence, p. 2678–2685. 2019. Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, et al. Interna- tional ai safety report.arXiv preprint arXiv:2501.17805, 2025. 14 Preprint. Under review. Leonard Bereska and Efstratios Gavves. Mechanistic Interpretability for AI Safety – A Review, April 2024. URLhttp://arxiv.org/abs/2404.14082. arXiv:2404.14082 [cs]. J. Bernal.The social function of science.Philosophical Review, 49(n/a):377, 1940. doi:10.2307/2180883. Robi Bhattacharjee and Ulrike von Luxburg. Auditing local explanations is hard. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=ybMrn4tdn0. Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can ex- plain neurons in language models.https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An Interpretability Illusion for BERT, April 2021. URLhttp: //arxiv.org/abs/2104.07143. arXiv:2104.07143 [cs]. Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying Func- tionally Important Features with End-to-End Sparse Dictionary Learning, May 2024. URL http://arxiv.org/abs/2405.12241. arXiv:2405.12241 [cs]. Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, and Lee Sharkey. In- terpretability in parameter space: Minimizing mechanistic description length with attribution-based parameter decomposition.arXiv preprint arXiv:2501.14926, 2025. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.Transformer Circuits Thread, 2023. David J Chalmers. Propositional interpretability in artificial intelligence.arXiv preprint arXiv:2501.15740, 2025. Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. InInternational Conference on Machine Learning, p. 6243–6267. PMLR, 2023. Jeff Clune, Jean-Baptiste Mouret, and Hod Lipson. The evolutionary origins of modularity. Proceedings of the Royal Society b: Biological sciences, 280(1755):20122863, 2013. Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. 2023. URLhttps://arxiv.org/abs/2304.14997. Francis Crick. Central dogma of molecular biology.Nature, 227(5258):561–563, 1970. David Deutsch.The beginning of infinity: Explanations that transform the world. penguin uK, 2011. Frank Watson Dyson, Arthur Stanley Eddington, and Charles Davidson. Ix. a determination of the deflection of light by the sun’s gravitational field, from observations made at the total eclipse of may 29, 1919.Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 220(571-581):291–333, 1920. A. Einstein. The foundation of the general theory of relativity. 1916. Richard P. Feynman. Cargo cult science.Engineering and Science, 37(7):10–13, 1974. ISSN 0013-7812. URLhttp://resolver.caltech.edu/CaltechES:37.7.CargoCult. 15 Preprint. Under review. Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, and Stuart Russell. Clusterability in neural networks.arXiv preprint arXiv:2103.03386, 2021. Dan Friedman, Andrew Lampinen, Lucas Dixon, Danqi Chen, and Asma Ghandeharioun. Interpretability illusions in the generalization of simplified models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders, June 2024. URLhttp://arxiv.org/abs/2406.04093. arXiv:2406.04093 [cs] version: 1. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wort- man Vaughan (eds.),Advances in Neural Information Processing Systems, volume 34, p. 9574–9586. Curran Associates, Inc., 2021. URLhttps://proceedings.neurips.c/paper_ files/paper/2021/file/4f5c422f4d49a5a807eda27434231040-Paper.pdf. Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation.arXiv preprint arXiv:2301.04709, 2023. Google Developers.Clustering algorithms.https://developers.google.com/ machine-learning/clustering/clustering-algorithms, 2025. Accessed: 2025-02-23. Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufi- ane Noubir, and Lawrence Chan. Compact proofs of model performance via mechanistic interpretability. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Rohan Gupta, Iván Arcuschin, Thomas Kwa, and Adrià Garriga-Alonso. Interpbench: Semi- synthetic transformers for evaluating mechanistic interpretability techniques. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.),Advances in Neural Information Processing Systems, volume 37, p. 92922–92951. Curran Associates, Inc., 2024.URLhttps://proceedings.neurips.c/paper_files/paper/2024/file/ a8f7d43ae092d9a5295775eb17f3f4f7-Paper-Datasets_and_Benchmarks_Track.pdf. T. Hastie, R. Tibshirani, and J.H. Friedman.The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009. ISBN 9780387848846. URLhttps://books.google.co.uk/books?id=eBSgoAEACAAJ. Carl G. Hempel and Paul Oppenheim. Studies in the logic of explanation.Philosophy of Science, 15(2):135–175, 1948. ISSN 00318248, 1539767X. URLhttp://w.jstor.org/ stable/185169. Carl Gustav Hempel.Philosophy of Natural Science. Prentice-Hall, Englewood Cliffs, N.J.„ 1966. Leah Henderson. Bayesianism and inference to the best explanation.The British Journal for the Philosophy of Science, 2014. Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. The developmental landscape of in-context learning.arXiv preprint arXiv:2402.02364, 2024. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=F76bwRSLeK. M. Hutter, E. Catt, and D. Quarel.An Introduction to Universal Artificial Intelligence. Chapman & Hall/CRC Artificial Intelligence and robotics series. Chapman & Hall/CRC Press, 2024. ISBN 9781003460299. URLhttps://books.google.co.uk/books?id=cfg60AEACAAJ. Louis Jaburi, Ronak Mehta, Soufiane Noubir, and Jason Gross. Fine-tuning neural networks to match their interpretation: Towards scaling compact proofs, 2025. forthcoming. 16 Preprint. Under review. E.R. Kandel, J.H. Schwartz, and T. Jessell.Principles of Neural Science, Fourth Edition. McGraw- Hill Companies,Incorporated, 2000. ISBN 9780838577011. URLhttps://books.google. co.uk/books?id=yzEFK7Xc87YC. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu- Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, and Neel Nanda. Saebench: a comprehensive benchmark for sparse autoencoders, 2024. URLhttps://w.neuronpedia.org/sae-bench/info. D. Kennefick.No Shadow of a Doubt: The 1919 Eclipse That Confirmed Einstein’s Theory of Relativity. Princeton University Press, 2021. ISBN 9780691217154. URLhttps://books. google.co.uk/books?id=_Eb8DwAAQBAJ. Philip Kitcher. Explanatory unification.Philosophy of science, 48(4):507–531, 1981. Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation. Problems of information transmission, 1(1):1–7, 1965. Thomas S. Kuhn. Objectivity, value judgment, and theory choice. In David Zaret (ed.), Review of Thomas S. Kuhn The Essential Tension: Selected Studies in Scientific Tradition and Change, p. 320–39. Duke University Press, 1981. Thomas Samuel Kuhn.The Structure of Scientific Revolutions. University of Chicago Press, Chicago, 1962. Imre Lakatos. Falsification and the methodology of scientific research programmes. In Imre Lakatos and Alan Musgrave (eds.),Criticism and the growth of knowledge, p. 91–196. Cambridge University Press, 1970. Imre Lakatos.The Methodology of Scientific Research Programmes. Cambridge University Press, New York, 1978. Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, and Neel Nanda. Sparse autoencoders do not find canonical units of analysis.arXiv preprint arXiv:2502.04878, 2025. Grace W. Lindsay and David Bau. Testing methods of neural systems understanding.Cogn. Syst. Res., 82:101156, December 2023. URLhttps://doi.org/10.1016/j.cogsys.2023. 101156. Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018. Ekdeep Singh Lubana, Eric J Bigelow, Robert P Dick, David Krueger, and Hidenori Tanaka. Mechanistic mode connectivity. InProceedings of the 40th International Conference on Machine Learning, p. 22965–23004, 2023. David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003. Aleksandar Makelov, Georg Lange, Atticus Geiger, and Neel Nanda. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=Ebt7JgMHv1. Sam Marks.Downstream applications as validation of interpretability progress, March2025.URLhttps://w.lesswrong.com/posts/wGRnzCFcowRCrpX4Y/ downstream-applications-as-validation-of-interpretability. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Lan- guage Models, March 2024. URLhttp://arxiv.org/abs/2403.19647. arXiv:2403.19647 [cs]. 17 Preprint. Under review. James Myers. Cognitive styles in two cognitive sciences. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 34, 2012. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022. Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024. Judea Pearl.Causality. Cambridge university press, 2009. H. Poincaré.Science and Hypothesis. Library of philosophy, psychology and scientific meth- ods. Science Press, 1905. URLhttps://books.google.co.uk/books?id=5nQSAAAAYAAJ. Karl R. Popper.The Logic of Scientific Discovery. Routledge, London, England, 1935. Hsueh Qu.Hume on theoretical simplicity.Philosophers’ Imprint, 23(1), 2023. doi:10.3998/phimp.1521. Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, and Jacquelyn Schneider. Escalation risks from language models in military and diplomatic decision-making. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, p. 836–898, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400704505. doi:10.1145/3630106.3658942. URLhttps: //doi.org/10.1145/3630106.3658942. Wesley Salmon.Four decades of scientific explanation.1989.URLhttps://api. semanticscholar.org/CorpusID:46466034. Wesley C. Salmon.Scientific Explanation and the Causal Structure of the World. Princeton University Press, 1984. ISBN 9780691101705. Naomi Saphra and Sarah Wiegreffe. Mechanistic?, 2024. URLhttps://arxiv.org/abs/2410. 09087. Samuel Schindler.Theoretical Virtues in Science: Uncovering Reality Through Theory. Cambridge University Press, Cambridge, 2018. Adam Shai, Paul M. Riechers, Lucas Teixeira, Alexander Gietelink Oldenziel, and Sarah Marzen. Transformers represent belief state geometry in their residual stream. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps: //openreview.net/forum?id=YIB7REL8UC. Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath. Open problems in mechanistic interpretability, 2025. URL https://arxiv.org/abs/2501.16496. Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, and David Blei. Hypothesis testing the circuit hypothesis in LLMs. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps:// openreview.net/forum?id=ibSNv9cldu. 18 Preprint. Under review. Adam Shimi. The golden mean of scientific virtues, 2024. URLhttps://formethods. substack.com/p/the-golden-mean-of-scientific-virtues. Ernest Sosa.Knowledge in Perspective: Selected Essays in Epistemology. Cambridge University Press, New York, 1991. Dashiell Stander, Qinan Yu, Honglu Fan, and Stella Biderman. Grokking group multi- plication with cosets. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, p. 46441–46467. PMLR, 21–27 Jul 2024. URLhttps://proceedings.mlr.press/ v235/stander24a.html. Richard Swinburne.Simplicity as Evidence of Truth. Marquette University Press, Milwaukee, 1997. Hannes Thurnherr and Jérémy Scheurer. Tracrbench: Generating interpretability testbeds with large language models, 2024. URLhttps://arxiv.org/abs/2409.13714. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. InThe Eleventh International Conference on Learning Representations, 2023. Roger White.Why favour simplicity?Analysis,65(3):205–210,2005. doi:10.1093/analys/65.3.205. Daniel A Wilkenfeld. Understanding as compression.Philosophical Studies, 176(10):2807– 2831, 2019. Wilhelm Windelband.Geschichte und Naturwissenschaft. Rede zum Antritt des Rectorats der Kaiser-Wilhelms-Universität Strassburg, geh. am 1. Mai 1894. Heitz, 1894. Zachary Wojtowicz and Simon DeDeo. From probability to consilience: How explanatory values implement bayesian reasoning.Trends in Cognitive Sciences, 24(12):981–993, 2020. James F. Woodward.Making Things Happen: A Theory of Causal Explanation. Oxford University Press, New York, 2003. Wilson Wu, Louis , Jacob Drori, and Jason Gross. Unifying and verifying mechanistic interpretations: A case study with group operations.arXiv preprint arXiv:2410.07476, 2024. Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2501. 17148. Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=r1eBeyHFDH. Sergey Yekhanin et al. Locally decodable codes.Foundations and Trends®in Theoretical Computer Science, 6(3):139–255, 2012. Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems.https://bair.berkeley.edu/blog/2024/02/ 18/compound-ai-systems/, 2024. 19 Preprint. Under review. A The Explanatory Virtues Rubric Table 2: The rubric for evaluating the Explanatory Virtues of a given explanation (see Figure 1 and section 3). We use this rubric to provide a structured evaluation of explanations as in Table 1. Explanatory Virtue Highly VirtuousWeakly VirtuousNot Virtuous Icon✓●✗ Causal- Mechanistic Generatesend-to-end causal explanations Explains a part of the net- work and can be used as part of a Causal- Mechanistic Explanation Generates explanations which are not used for producingend-to-end causal explanations PrecisionRewardsexplanations that provide precise and risky predictions in a quantifiable way Partially accounts for precision in explanations, possibly qualitatively Fails to penalise (or even endorses) overly broad or vague predictions PriorsExplicitly incorporates comparisons with back- ground theoretical priors in the method Implicitly incorporates background theoretical priors in evaluating ex- planations Fails to appropriately incorporate background theoretical priors DescriptivenessPrefersexplanations which clearly analyse de- tailed, component-wise predictionqualityin high fidelity, capturing theessentialcharac- teristics of each data point Only partially tangen- tially analyses individ- ual data point fit, mostly focusing on overall ag- gregated fit No analysis of how the data points fit the expla- nation in isolation at all Co- Explanation Assesses the ability of ex- planations to account for multiple observations to- gether, rewarding mea- sures that emphasise in- tegrated, joint predictive performance. Has the potential to in- corporate some aspects of joint explanation but does not fully reward co- herent integration across diverse data points in its currently practised form Evaluateseachdata point in isolation, ignor- ing the value of linking multiple observations. PowerStronglyvaluesap- proaches that produce highlyconstraining predictions(especially aboutobservations considered in isolation), penalisingmethods that allow too many plausible alternatives Provides moderate em- phasis on constraining predictions, allowing for some uncertainty. Assigns no weight to the predictive force of the ex- planation AccuracyQuantitatively rewards explanations that fit the data with minimal error, especially does so with reference to both the pre- cision and recall where relevant Qualitatively rewards ex- planations that seem to fit the data well subjec- tively Does not distinguish be- tween explanations that fit the data well or poorly leading to evaluations that tolerate significant errors 20 Preprint. Under review. Table 2: The rubric for evaluating the Explanatory Virtues (contin- ued) Explanatory Virtue Highly VirtuousWeakly VirtuousNot Virtuous UnificationMeasures how well a single evaluation frame- work can account for diverseobservations, emphasizing integrated, unified explanations Has the potential to recognise some unifica- tion even if in a limited or fragmented way or if this is not a typical appli- cation of the method Places no weight on a unified account rather than a disjunction of ac- counts ConsistencyRequiresinternalco- herencewithinthe explanation and multi- ple instances of running the same explanation method Mostly internally consis- tent but probabilistically can provide inconsistent explanations Places no weight on the internal consistency of generated explanations SimplicityEvaluates explanations based on a conciseness or K-complexity simplic- ity metric rewarding sim- pler explanations Partiallyconsidersa weak form of simplicity such as parsimony Neglects simplicity as a factor, encouraging highly complex and com- plicated explanations FruitfulnessRewardsexplanations thatpredictednew, testablephenomena even with adversarially chosen test data from a close data distribution Rewardsexplanations that predict novel phe- nomena even from the same data distribution Assesses only current data fit with no train-val- test split at all FalsifiableRequires that explana- tions yield clear, testable predictionsandpe- nalises those that could not be refuted under counterfactual data. -Fails to consider whether explanations can be em- pirically refuted, reward- ing unfalsifiable evalua- tions. Hard-to-varyRigorously assesses the robustness of explana- tions, rewarding those evaluations where small modificationswould lead to significant per- formance degradation. Checksforinterde- pendenciesamong components to ensure that each part is essential and load-bearing Makes limited effort to avoid ad-hoc explana- tions but doesn’t fully ad- dress how hard-to-vary the explanations are Does not account for the ease of altering ex- planations and consis- tently produces expla- nations that are easily tweaked without loss of predictive power 21 Preprint. Under review. Table 2: The rubric for evaluating the Explanatory Virtues (contin- ued) Explanatory Virtue Highly VirtuousWeakly VirtuousNot Virtuous NomologicalExplicitlyintegrates establishedgeneral lawsandprinciples, favouringevaluations that connect to a broader nomological framework or reusing laws in mul- tiple places across the explanatory theory Implicitlyappealsto some non-generic laws but such a connection may be indirect and not well utilised Ignores links to universal principles and attempts to focus on explaining the data without any ref- erence to more general theoretical principles B Straightforward explanations Following (Ayonrinde &Jaburi, 2025), we define thestraightforward explanationof a neural network as follows. Given a neural networkf:X→Yandx∈Xsuch thatf(x) =y, the straightforward explanation is given by the computational trace of the network on the input x. 23 We note that for any neural networkfand sub-distributionD⊆ D, there exists a straightforward explanation offonD. However, this straightforward explanation is typically not good a explanation in the sense of Section 3 as such explanations are not very concise or illuminating. We would instead like explanations of neural networks that are in terms of the features (or concepts) that the network learned during training and explanations which are compact and useful. Given Section 3 and Appendix A we may evaluate the straightforward explanation of a neural network using the Explanatory Virtues Framework. •Causal-Mechanistic:The straightforward explanation is Causal-Mechanistic. It decomposes the model into a computational graph, given by the neural network architecture. •Precision, Descriptiveness, Accuracy, Power & Falsifiable:The straightforward explanation fulfills all these criteria, since it is a complete representation of the model. •Co-explanation & Unification:The straightforward explanation does not fulfill these criteria, since it treats all inputs independently. •Priors:The straightforward explanation does not refer to priors in its interpretation. •Consistency:The straightforward explanation is consistent. • Simplicity:The straightforward explanation is highly complex. There is no com- pression from the original weights in the explanation given. •Fruitfulness:The straightforward explanation is not fruitful, in that it doesn’t provide novel predictions. •Hard-to-vary:The straightforward explanation is not hard-to-vary; modifying single parts of the model (e.g. individual weights) by some small amount will typically not vary the model performance. • Nomological:The straightforward explanation is not nomological as it doesn’t provide general laws or principles. We note that the straightforward explanation is a valid explanation of a neural network: It is Model-level, Ontic, Causal-Mechanistic, and Falsifiable. Further, the straightforward 23 In fact, this explanation is a formal proof of the equalityf(x) =y. 22 Preprint. Under review. explanation embodies many of the explanatory values. However, we hope the reader will agree that the straightforward explanation is not agood explanation. Since, as noted in Section 5, not all of the explanatory values are equally as important, an explanation may embody some of the virtues and yet not be a good explanation. Researchers who are interpreting a neural network may have different use cases for which they would like an explanation of the model behaviour. To account for these different goals, researchers can make trade-offs between which Explanatory Virtues they value most highly. 24 Overall, however, for an explanation to be a good explanation, we suggest that SimplicityandFruitfulnessandHard-to-Varynessare the most important values, without which it is difficult to have a good explanation. In this case, the straightforward explanation fails on the virtue of Simplicity. C Explanations in The Wild, Visually This section is a visual companion to Section 4. We present a series of figures to elucidate what we mean by each form of explanation and how we choose between two explanations given this method (i.e. Theory Choice (Schindler, 2018)). C.1 Clustering (Activations or Inputs) Figure 2: Given some (possibly intermediate) embeddings (x), a clustering explanation can be produced by assigningxto a clusterC i , where the n clusters partition the input space into disjoint regions. HereC 1 ∪C 2 ∪. . .∪C n =R N andC i ∩C j =∅∀i̸=j. The explanation is then given by taking the behaviour of the model on some cluster representative, or centroid, μ i ∈C i . We can intuitively see this as performing a quotient operation on the input space, where the model behaviour is approximated by a piecewise constant function. [Image from Google Developers (2025)]. 24 Choosing the right explanation is a value-laden task (Ayonrinde &Jaburi, 2025). 23 Preprint. Under review. C.2 Sparse Autoencoder Explanations of Representations/Activations Figure 3: (a) The SAE architecture. An encoder provides some set of latents (or feature activations) in the feature basis. We have some decoder map, Dec, which is a linear com- bination of the columns of the feature dictionary weighted by the sparse latents. We say informally that these latentscorrespondto the input activations if, under the decoder map, Dec. (b) Ifxandzcorrespond in the above sense then the natural language explanation of the input activationsxis given ase(x) =e ′ (z); that is the explanation of the latents using the automated interpretability processe ′ (z)(Paulo et al., 2024; Karvonen et al., 2024; Bills et al., 2023; Ayonrinde, 2024). We can measure the mathematical description length (Conciseness) of the explanatione(x)as the number of bits required to describe the latentsz(Ayonrinde et al., 2024). [Images from Ayonrinde et al. (2024); Ayonrinde (2024)] C.3 Causal Abstraction Explanations of Circuits Figure 4: A circuit explanation is a Causal-Mechanistic explanation such that the circuit C is a constructive abstraction of a neural network’s computational graph M if there exists a partition the variables in M such that each high-level variables in C correspond to a low-level partition cell in M and interventions on M correspond to interventions on C. For example in Figure 4Left(Conmy et al., 2023), the IOI circuit (Wang et al., 2023) (highlighted in red) is recovered from the computational graph of GPT-2 Small. [Image from (Conmy et al., 2023)]. 24 Preprint. Under review. C.4 Compact Proofs Figure 5: (a) Compact Proofs evaluate explanations on two metrics, their compactness (FLOPs to Verify Proof) and their accuracy (Model Performance Lower Bound). These two metrics can be assessed on a Pareto frontier. (b) A good explanation should push the frontier towards the upper left corner (i.e. more accurate and compact proofs). [Image from Gross et al. (2024).] D Examples of Explanations In this section, we provide some intuitive examples and non-examples of Explanations which satisfy the criteria that we outline above. The case studies in Section 4 are examples within Mechanistic Interpretability and Machine Learning; our examples here are non- technical illustrations. D.1 Examples of Explanation Types D.1.1 Ontic Explanations Question: Why did the pen fall off the desk? Causal-Mechanistic But Not Ontic Explanation. The pen fell off the desk because the aether pushed the bottle and then the bottle pushed the pen off the desk. This explanation is Causal-Mechanistic in the sense that one thing happens after another and causes the next. However, if we do not believe that the aether is a real entity then this explanation cannot be considered an Ontic Explanation. — Question: Why is the cube heavy? Ontic But Not Causal-Mechanistic Explanation. The cube is heavy because it is made up of tungsten atoms. This explanation is Ontic as the entities involved in the explanation are real entities. How- ever, it is not Causal-Mechanistic as there is no step-by-step explanation without gaps. D.1.2 Statistically-Relevant Explanations Consider the explanation: 25 Preprint. Under review. Ice cream sales are higher on days when there are more shark attacks. If there’s a shark attack reported, we can predict with 85% confidence that ice cream sales will be above average that day. This explanation is purely in terms of statistical correlation rather than causation. There is no explication of any underlying causal mechanism, which might involve both phenomena being causally downstream of hot weather and/or more beach visitors. We could perform interventions to test this hypothesis. D.1.3 Nomological Explanations Question: Why does a metal rod expand when heated? Nomological but not Causal-Mechanistic explanation. The rod expands because it follows the natural law that all metals expand when heated, as described by the coefficient of thermal expansion. This explanation references a general law of nature without getting into the underlying mechanism. Causal-Mechanistic Explanation. The rod expands because its metal atoms vibrate more vigorously when heated, which increases their average spacing. This increased spacing leads to an overall increase in the rod’s length. This details the physical mechanism causing the expansion. D.2 Examples of Explanatory Values D.2.1 Precision, Power and Unification Consider one explanation of what happens to objects when they are dropped: When an object is dropped, it falls to the ground due to the force of gravity. compared to the morepreciseexplanation: Objects fall toward Earth at a rate of 9.8 meters per second squared, with slight variations depending on altitude and latitude. The latter explanation rules out more possibilities than the former. When we see that an object is dropped, armed with the second explanation, we are able to rule out the possibility that the object will fall at a different rate as well as the possibility that it will rise into the air. Precise explanations makenarrowandriskypredictions. Unification.An explanation isunifyingif it purports to explain multiple disparate obser- vations. The Central Dogma in molecular biology states that genetic information flows only in one direction, from DNA, to RNA, to protein, or RNA directly to protein (Crick, 1970). This theory operates as a unifying explanation which narrows the space of possibilities for a wide range of biological phenomena. D.2.2 Consistency Consistent explanations contain no internal contradictions. Question: Why did Alice miss the important meeting this morning? 26 Preprint. Under review. Inconsistent Explanation.Alice, being a forgetful person, forgot that the meeting was hap- pening and simultaneously Alice deliberately skipped the meeting to avoid a confrontation. Consistent Explanation.Alice was out of the office for a vacation and missed the meeting. As we increase the unification/scope of explanations, we sometimes introduce inconsisten- cies. For example, as we look to unify Quantum Mechanics and General Relativity, two explanations which are internally consistent on their own, we find that they are inconsistent with each other. D.2.3 Simplicity Occam’s Razor states that when faced with competing explanations, one should select the explanation that is the simplest. This heuristic was first formulated in terms of parsimony, but we might also extend the sense of simplicity here to conciseness (Shannon complexity) or K-complexity (Kolmogorov complexity) as more appropriate measures of simplicity. The Ptolemaic explanation: The Earth is at the center of the universe, with the planets, the sun, and stars orbiting around Earth. There are many epicycles which explain the retrograde motion of the planets (planets moving backwards in the sky). is more complex than the Copernican explanation: The sun is at the center of the solar system and the planets orbit the sun in ellipses. Even though both explanations could fit the data, we ought to prefer the Copernican model according to Occam’s Razor and our Explanatory Virtue of Simplicity. Wojtowicz &DeDeo (2020) give a sobering example of the dangers of not sufficiently valuing simplicity in explanation in their analysis of conspiracy theories. Such theories are often “abnormally co-explanatory and descriptive . . . , account for anomalous facts which are unlikely under the ‘official’ explanation . . . , show how seemingly arbitrary facts of ordinary life are correlated by hidden events . . . , and describe a unified universe where everything is correlated by a network of hidden common causes.” A primary reason that such conspiracy theories are not typically good explanations is that they are notsimple: there’s often a large amount of complexity and ad-hoc reasoning to explain contradictory evidence and the reason for why the cover-up has yet to come to light. D.3 Falsifiability and Hard-To-Varyness Popper (1935) argues against the pseudoscientific theories of Marx, Freud, and Adler on the grounds that they are not falsifiable. That is to say, there exists no observation that could be made that would contradict the theory and cause its proponents to abandon it. For a theory to be falsifiable it must make some concrete predictions about the world that could in principle be tested. Consider the following three explanations for why there are seasons (adapted from Deutsch (2011)): Not Falsifiable. The seasons change when Zeus feels like it. This explanation is not falsifiable because it does not make any predictions. If there were no seasons one year, then it would not be a mark against the theory. Falsifiable but Not Hard-To-Vary. 27 Preprint. Under review. Demeter (the Greek Goddess) negotiates a deal with Hades such that her daughter Persephone visits Hades once a year. When Persephone is with Hades and not with her mother, Demeter is sad and the world becomes cold. This explanation does make a concrete prediction: the seasons will change exactly once a year. Another prediction that follows is that winter (the period of cold where Persephone is with Hades) should happen everywhere on Earth at the same time. This explanation is falsified by the fact that the seasons are at different times in Australia to in Athens. The explanation is not very Hard-to-Vary however. We could easily change any of the characters or mechanisms involved in the theory and keep the same predictions. Falsifiable and Hard-To-Vary. The Earth’s axis of rotation is tilted relative to the plane of its orbit around the sun. Hence for half of each year the northern hemisphere is tilted towards the sun while the southern hemisphere is tilted away, and for the other half it is the other way around. Whenever the sun’s rays are falling vertically in one hemisphere (thus providing more heat per unit area of the surface) they are falling obliquely in the other (thus providing less heat). This explanation is both falsifiable and hard-to-vary. All of the details of the theory play a functional role and cannot be easily changed. The axis-tilt theory also (correctly) predicts the fact that the seasons are reversed in the northern and southern hemispheres. D.4 (Mundane) Accuracy and Fruitfulness (Novel Success) Explanations have Mundane Accuracy insofar as they correctly account for the phenomena they aim to explain. Conversely explanations are Fruitful if they predict new phenomena that were not available to the explainer at the time of coming up with the explanation. Being able to predict and explain new, previously unobserved phenomena that are later confirmed (as in Fruitfulness) is typically considered more valuable than merely explaining known phenomena (as in Mundane Accuracy). Einstein’s General Relativity predicted that light would bend around massive objects like the sun (Einstein, 1916). In 1919, during a solar eclipse, Arthur Eddington observed that starlight passing near the sun was indeed deflected by precisely the amount Einstein had predicted (Dyson et al., 1920; Kennefick, 2021). Given that the phenomenon of light bending around massive objects was previously unknown, this was a novel empirical success for Einstein’s theory. This can increase our credence in Einstein’s theory because the prediction was made before the observation, was precise and quantitative in an unknown domain and the observations matched the prediction with high accuracy. D.5 Co-Explanation and Descriptiveness Explanations can be purelydescriptive, in which case they account well for the phenomena they aim to explain but do not connect with other explanations. Alternatively, explanations can beco-explanatory, unifying phenomena that were previously thought to be distinct. Descriptive but Not Co-Explanatory. Electricity involves the movement of charges and produces effects such as static attraction, lightning, and electrical current. Magnetism, on the other hand, involves the attraction or repulsion between certain materials like lodestone and iron, and manifests in the behavior of compasses pointing north. Co-Explanatory. 28 Preprint. Under review. Electricity and magnetism are manifestations of a single underlying electro- magnetic force. A changing electric field produces a magnetic field, and a changing magnetic field produces an electric field. Moving electric charges create magnetic fields, while moving magnets induce electric currents. E A Coherence Formulation of Adhocness (Schindler, 2018) also gives an adhocness test for explanations which can identify those which are the result of a post-hoc epicycle added to an easy-to-vary explanation. For Schindler, an explanation is adhoc if the modification∆which it corresponds to is some additional hypothesis H (which we can think of as being added in order to accommodate some awkward-to-explain datax I ) and two conditions are met: 1. H explainsx I . That isP(x I |E,H)>P(x I |E). 2. Neither the original explanation E nor background theories B give evidence for H. That isP(H|E,B)<P(H). We define an adhocness metric as Adhoc=P(H)−P(H|E,B)where larger ad-hocness values are more adhoc and dispreferred. F Local Decodability as an Explanatory Virtue Another virtue that we may consider for highly unifying explanations is local-decodability. Locally decodable explanations allow for retrieval and use of some small segment of the explanation without querying the whole explanation, analogously to locally-decodable error-correcting codes (Yekhanin et al., 2012). This is important as we would like not only for our explanations to have information compression (concise representation) but also informa- tion accessibility (the ability to retrieve specific subparts quickly). In practice, an explanation of network which is compressed but not locally-decodable requires significant computa- tional resources to query and is not useful for human understanding. 25 The Independent Additivity condition from Ayonrinde et al. (2024) is an example of a local-decodability condition in Mechanistic Interpretability. V-Information (Xu et al., 2020) provides a useful analogy for local-decodability in Machine Learning. 25 Local decodability is measured in query complexity: the number of queries required to recover 1 bit of the message (explanation). Conciseness and query complexity are known to be inversely proportional but the exact fundamental limit on their relationship is currently unknown. 29