Paper deep dive

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Maxime Méloux, François Portet, Maxime Peyrard

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 65

Models: GPT-2 Small, Llama (various sizes)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:03:41 PM

Summary

This paper investigates the scientific validity of Mechanistic Interpretability (MI) by framing circuit discovery as a statistical estimation problem. The authors demonstrate that causal mediation analysis (CMA) and its common approximations (like EAP and EAP-IG) exhibit high intrinsic variance and instability. They show that circuit discovery pipelines amplify this noise, leading to fragile structural estimates that vary significantly with input data and hyperparameters. The study advocates for more rigorous MI practices, including systematic bootstrap resampling and the reporting of stability metrics.

Entities (5)

Causal Mediation Analysis · methodology · 100%EAP-IG · algorithm · 100%Edge Attribution Patching · algorithm · 100%Mechanistic Interpretability · research-field · 100%GPT2-small · model · 95%

Relation Signals (3)

Edge Attribution Patching → approximates → Causal Mediation Analysis

confidence 95% · Modern methods employ efficient but approximate estimators of the CMA score itself... Edge Attribution Patching (EAP; Syed et al., 2023).

Circuit discovery → exhibits → High Variance

confidence 95% · We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance.

Causal Mediation Analysis → isfoundationfor → Circuit discovery

confidence 90% · We argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA).

Cypher Suggestions (2)

Map the relationship between methodologies and their limitations. · confidence 85% · unvalidated

MATCH (m:Methodology)-[:HAS_LIMITATION]->(l:Limitation) RETURN m.name, l.description

Find all interpretability methods mentioned in the paper. · confidence 80% · unvalidated

MATCH (e:Entity {entity_type: 'Algorithm'})-[:USED_IN]->(t:Task {name: 'Circuit Discovery'}) RETURN e.name

Abstract

Abstract:Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

64,640 characters extracted from source content.

Expand or collapse full text

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Maxime M ́ eloux 1 Franc ̧ois Portet 1 Maxime Peyrard 1 Abstract Mechanistic Interpretability (MI) aims to reverse- engineer model behaviors by identifying func- tional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation prob- lem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random vari- able rather than a fixed property. We then demon- strate that circuit discovery pipelines inherit this variance and further amplify it. Fast approxima- tion methods, such as Edge Attribution Patching and its successors, introduce additional estima- tion noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI prac- tices, prioritizing statistical robustness and routine reporting of stability metrics. 1. Introduction As AI systems are increasingly deployed in real-world ap- plications, the need for robust interpretability methods has become more urgent. Understanding the internal mecha- nisms of these models is critical not only for diagnosing failures and improving robustness (Barredo Arrieta et al., 2020), but also for complying with emerging legal frame- works that mandate explainability (Walke et al., 2025). Mechanistic Interpretability (MI) is a promising research direction aiming to reverse-engineer the algorithms learned by deep neural networks (Olah et al., 2018). A central 1 Uiversit ́ e Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France. Correspondence to: Maxime M ́ eloux <maxime.meloux@univ-grenoble-alpes.fr>, Maxime Peyrard <maxime.peyrard@univ-grenoble-alpes.fr>. Preprint. February 4, 2026. approach in MI involves identifying “circuits”, functional sub-networks that are responsible for particular capabilities (Olah et al., 2020; Elhage et al., 2021). These are typically identified by relying on the framework of causal mediation analyses (CMA) (Pearl, 2001; VanderWeele, 2016). CMA consists of intervening on the computational graph, setting the network in counterfactual states and measuring the effect of components on outputs (Vig et al., 2020a; Monea et al., 2024; Hanna et al., 2024; Syed et al., 2024). In practice, MI relies on fast approximation of CMA to scale the esti- mation of causal importance scores to larger models, e.g., attribution patching (EAP; Syed et al., 2023) with integrated gradients (EAP-IG; Hanna et al., 2024). The causal impor- tance scores are then aggregated over a dataset of inputs representative of the target behavior and discrete heuris- tics are applied to extract a causally important circuit. The long-term vision of MI is to evolve into a rigorous science, employing discovery tools similar to those of the natural sciences (Cammarata et al., 2020; Lindsey et al., 2025). However, MI currently faces foundational challenges that limit its scientific rigor. Methods are prone to ”dead salmon” artifacts and false positives (M ́ eloux et al., 2025), and ex- planations discovered in one setting may fail to transfer to others (Hoelscher-Obermaier et al., 2023). In addition, mul- tiple incompatible explanations may equally satisfy current MI criteria (M ́ eloux et al., 2025). M ́ eloux et al. (2025) ar- gues that these issues stem from non-identifiability: the impossibility of inferring a unique explanation from ob- served data. In statistics, this non-identifiability manifests as high variance (Preston et al., 2025; Arendt et al., 2012). To overcome these hurdles, MI should be reframed as a problem of statistical inference (Fisher, 1955; Mayo, 1998). In the natural sciences, validity requires quantifying obser- vational variability and representing uncertainty (Lele, 2020; Committee et al., 2018). Systematically studying the stabil- ity of MI findings through metrics like variance (Zidek & van Eeden, 2003) is a necessary step toward scientific rigor. Yet, current MI practices often neglect these requirements; explanations are frequently reported without quantifying their statistical stability, robustness to perturbations, and uncertainty estimates (Rauker et al., 2023). Without such analyses, we cannot assess the generalizability, reliability, and ultimately, the validity of MI explanations (Rauker et al., 2023; Liu et al., 2025; Ioannidis, 2005). 1 arXiv:2510.00845v3 [cs.LG] 3 Feb 2026 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Union Circuit Median Circuit 0.60.40.20.00.20.40.6 0.6 0.4 0.2 0.0 0.2 0.4 0.6 MDS Embedding of Circuits EAP Method EAP EAP-IG-activations EAP-IG-inputs clean-corrupted Circuit #141 Circuit #87 Circuit #127 Circuit #113 Circuit #75 Circuit #145 Figure 1. In gpt2-small, varying multiple circuit-finding parameters at once (type of resampling, aggregation, intervention, estimation, and pruning) yields many different circuits for the IOI task, displayed along with the union and median circuit (left). The MDS projection of the pairwise Jaccard index matrix (center) shows that no method consistently yields circuits with lower variance (tighter clustering). At the heart of importance estimation in MI lies causal mediation analysis (CMA) (Pearl, 2001; VanderWeele, 2016). CMA provides a theoretical framework to estimate the causal effect of specific model edges or nodes on a be- havior by mediating information through them. While CMA is identifiable at the level of a single input and behavior, the broader goal of circuit discovery is to aggregate these individual importance scores into a sparse, generalizable subgraph. In this work, we argue that circuit discovery should be viewed as a downstream pipeline fueled by CMA. In this work, we analyze variance in causal mediation anal- ysis (CMA), its approximations (EAP, EAP-IG), and the downstream circuits they extract. We consider multiple sources of variability, including data-related factors, such as dataset (via bootstrap resampling), shifts in the distri- bution, prompt paraphrasing, and the choice of contrastive perturbation, as well as methodological factors such as hy- perparameters and heuristics. We find substantial variance at every stage: CMA already exhibits high variability in estimating causal importance across inputs drawn from the same distribution; its approximations further amplify this variance; and all circuit extraction methods produce highly unstable circuits across nearly all sources of variation. This instability is summarized in Fig. 1, which shows the struc- tural inconsistency among circuits discovered when multi- ple parameters are varied simultaneously. In response, we propose a set of best practices for the MI community, in- cluding systematic bootstrap resampling and the reporting of stability metrics, to promote more rigorous and reliable interpretability research. 2. Related Work Causal Mediation Analysis as the Engine of MI. Causal mediation analysis (CMA; Pearl, 2001; VanderWeele, 2016) investigates how an outcome (e.g., a model’s prediction) is affected by specific mediators (neuron activations or edges) via controlled interventions. In deep neural networks, this involves techniques such as activation patching (Vig et al., 2020b; Geiger et al., 2021) and causal tracing (Meng et al., 2022; 2023; Fang et al., 2025), which manipulate mediators to quantify their influence on restoring a partially corrupted input. Interestingly, the causal effect of a component is identifiable for a fixed input and a fixed input corruption and can be computed exactly by simulating the execution of the networks under different interventions (Vig et al., 2020b; Meng et al., 2022). However, exact CMA is computation- ally expensive, it involves several forward passes to estimate the causal effect of a single component. Consequently, the field has developed fast approximations. Edge Attribution Patching (EAP; Syed et al., 2023) combine causal patching with local Taylor expansion to quantify the importance of individual edges. EAP with integrated gradients (EAP-IG; Hanna et al., 2024) builds on this by using path integrals to better handle non-linearities and measures the impact of components excluded from a subgraph. One prominent application of these importance estimates is circuit discov- ery: a structural estimation problem where one seeks to identify a sparse, interconnected subgraph (a “circuit”) con- sisting of causally important components. This process has evolved from early techniques such as feature visualization (Zeiler & Fergus, 2014; Sundararajan et al., 2017) to auto- 2 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis mated methods such as ACDC (Conmy et al., 2023). Going from an estimated causal importance score for each compo- nent of a network to a discrete sub-graph selection involves several heuristics and design choices, leading to different algorithms. The limits of Point-Estimate Evaluation. Despite their grounding in causal theory, these methods produce point es- timates: single structural summaries derived from finite data and fixed hyperparameters. Yet the notion of a unique, cor- rect circuit is often ill-defined or non-identifiable (Mueller et al., 2025; M ́ eloux et al., 2025), undermining claims about recovering a “ground-truth” circuit. More broadly, M ́ eloux et al. (2025) argues for reframing interpretability as a prob- lem of statistical explanation. Under this view, circuits should be reported with uncertainty estimates, since multi- ple distinct circuits may plausibly explain the same behavior. This shifts attention to variance: how different are the cir- cuits that are consistent with the evidence? Currently, MI relies on proxy metrics to evaluate those estimates based on desirable properties: faithfulness (how accurately a circuit reflects model behavior, often tested by perturbing or ablat- ing the identified components within the full model; Conmy et al., 2023; Hedstr ̈ om et al., 2023; Hanna et al., 2024; Shi et al., 2024b), sufficiency (whether the isolated circuit can reproduce the target behavior; Bau et al., 2017; Yu et al., 2024; Shi et al., 2024a), interpretability (a qualitative as- sessment of understandability and alignment with intuition; Olah et al., 2020), and sparsity/minimality (a preference for simpler, concise circuits; Elhage et al., 2021; Hedstr ̈ om et al., 2023; Dunefsky et al., 2024; Shi et al., 2024a). While these assess the internal validity of a discovered circuit, they do not account for its stability. Recent work has begun to question the robustness of these metrics. For instance, Shi et al. (2024a) introduce hypothesis tests for faithfulness, but only for a fixed circuit. Our work focuses on the variance and stability of both circuits and causal mediation analyses. While bootstrapping has been used to improve the selection of faithful edges (Nikankin et al., 2025), our study provides the first systematic decomposition of these instabilities. We trace the sources of variance across the pipeline: the base- line variance of single-input CMA, the approximation noise introduced by attribution heuristics, and the sensitivity to methodological choices. This mirrors the shift in classic ML from simple error rates to the study of model stability and generalization variance (Bousquet & Elisseeff, 2002). Identifying the Sources of Variance. A growing body of evidence suggests that MI methods suffer from soundness issues. Interventions based on discovered circuits often fail to generalize to novel contexts, casting doubts on the ro- bustness of the underlying identified mechanism (Hoelscher- Obermaier et al., 2023). Furthermore, results can be sen- sitive to the choice of perturbation strategies (Miller et al., 2024; Bhaskar et al., 2024; Zhang & Nanda, 2024). These issues can be symptoms of non-identifiability, where mul- tiple distinct and incompatible circuits can equally satisfy common evaluation metrics (M ́ eloux et al., 2025). Statis- tically, this manifests as high estimator variance (Preston et al., 2025). Also, estimates become unstable due to the high-dimensionality of the model and the limitations of fi- nite sampling. These issues demand a proper quantification of uncertainty and stability. 3. Formal Setup We present a brief formal description of CMA and under- lying circuit discovery. For details, we point the reader to (Mueller et al., 2025). We highlight the statistical perspec- tive on CMA and circuit discovery (M ́ eloux et al., 2025). 3.1. Causal Mediation Analysis The theoretical framework for identifying functional compo- nents in neural networks is causal mediation analysis (Pearl, 2001; VanderWeele, 2016). CMA investigates how an an- tecedentX(input) affects an outcomeY(model output) through a mediatorM(an internal component such as a node or edge), partitioning the Total Effect (TE) of the input into direct and indirect pathways. In the context of MI, we focus on the natural indirect effect (NIE): the portion of the effect that is transmitted specifically through the mediator (Mueller et al., 2025). For- mally, letY (x,m)denote the value of the model’s output metricL(e.g., logit difference or loss) under two distinct interventions (setting the input toX = xand fixing the me- diator toM = m). Standard activation patching techniques (Geiger et al., 2021; Vig et al., 2020b) estimate this effect by contrasting two conditions: a clean run with inputx, and a counterfactual run where the mediator is set to the value it would take under a corrupted inputx corr . The importance scoreSfor a componenteis defined as the NIE of transi- tioning the mediator from its clean to its corrupted state 1 in the context of the clean input: S(e,x,x corr ) =E[Y (x,M (x corr ))] |z Patched run −E[Y (x,M (x))] |z Clean run (1) Here,M (x)andM (x corr )represent the natural value of the mediator under the clean and corrupted inputs, respec- tively. The importance score depends on two distinct inter- ventions: one setting the global context by transformingx intox corr and one manipulating the mediator fromM (x)to M (x corr )). 1 Prior studies such as ACDC, EAP, and EAP-IG commonly consider the opposite effect of activation restoration instead (from the corrupted to clean state). Here, we keep the original formula- tion of CMA (Pearl, 2001; Vig et al., 2020b; Mueller et al., 2025). 3 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Consequently, the estimated importance is not a fixed prop- erty of the component, but a random variable that depends on the joint distribution of the clean inputxand the coun- terfactual source x corr . Fluctuations in how x is sampled or howx corr is generated directly introduce variance into the definition of the score itself. 3.2. Circuit Discovery as Statistical Estimation While CMA provides precise local explanations for a spe- cific input-counterfactual pair, MI typically seeks global circuits: subgraphs that explain model behavior across a distribution representing a behavior of interest. Circuit dis- covery can be seen as a statistical estimation problem that generalizes these local CMA scores to a population-level circuit. The Target Parameter: Circuit discovery methods im- plicitly assume the existence of a global importance score for each componente. We define this targetμ e as the expected value of the local NIE scores over the joint distributionDof inputsXand experimental conditions: μ e =E (x,x corr )∼D [S(e,x,x corr )]. Since the full distribu- tionDis inaccessible, methods rely on a finite dataset D = (x i ,x corr,i ) i sampled fromDto estimateμ e us- ing the empirical mean. However, aggregation methods other than the mean could be used. Circuit Selection (A): The final circuitCis a subset of components selected based on these estimates:C = A( ˆ S(e) e∈M θ , Λ), whereΛdenotes hyperparameters such as sparsity thresholds or connectivity constraints. This formulation highlights that a circuit is not solely a prod- uct of the model, but a compound effect of the estimation pipeline. The importance scoreSexhibits intrinsic variance due to the sampling of inputs and perturbations. Also, the pipeline depends on the choice of hyperparameters and the selection functionAcan amplify small fluctuations in ˆ S(e) into large structural differences in C. 3.3. Approximating CMA via the EAP family Calculating the exact NIE (Eq. 1) for every edge is com- putationally prohibitive (2× N edges × N samples forward passes). Therefore, modern methods employ efficient but approximate estimators of the CMA score itself. In this work, we consider Edge Attribution Patching (EAP; Syed et al., 2023) and its variants (Hanna et al., 2024) due to their ubiquity in the literature (Zhang et al., 2025; Mondorf et al., 2025; Nikankin et al., 2025) and their state-of-the-art perfor- mance in identifying sparse edge-level circuits (Syed et al., 2023; Hanna et al., 2024). These methods approximate the interventionM (x)← M (x corr )using gradient informa- tion. However, these estimators are approximate and rely on local information to approximate the global effect of an intervention, they may introduce approximation noise that compounds the intrinsic variance of the CMA scores. We investigate four specific estimators of S(e,x,x corr ): •EAP: A first-order Taylor approximation ofSthat multi- plies the gradient of the metric∇L(x)by the activation difference M (x)− M (x corr ) after intervention. • EAP-IG (inputs): Uses integrated gradients, averaging ∇L(x) over m interpolation steps between x and x corr . • EAP-IG (activations): Similar to the above, but inte- grates gradients w.r.t. intermediate activations, interpolat- ing directly between clean and corrupted activation states. •Clean-corrupted: Averages the gradient at two points only (x and x corr ), without interpolation. 3.4. Assessing Stability: Protocols and Metrics We decompose the instability of discovered circuits into two distinct sources: (i) Variance (sampling sensitivity) arises from the reliance on a finite datasetDto approximate the population expectation. It measures the fluctuation of ˆ S whenDis resampled. High variance implies that the un- derlying distribution of local CMA scores is broad, making the aggregate estimate unreliable. (i) Robustness (method- ological sensitivity) captures the sensitivity of the result to this specification of the counterfactualx corr (intervention strategy) and the hyperparametersΛ. To quantify these properties, we produce sets ofNcircuitsC 1 ,...,C N under controlled variations and measure their structural and functional stability. Perturbation Strategies. We isolate sources of instability through specific regimes: •Data resampling: We estimate sampling variance via bootstrap. We generaten = 100datasets by resampling with replacement fromDand re-running the full discovery pipeline. •Distribution shifts: We assess generalization using new datasets drawn from the same meta-distribution (meta-dataset) or by paraphrasing input prompts (Re- prompting). •Intervention definition: We investigate how the defini- tion of the counterfactualx corr impacts discovery. Instead of a fixed corruption, we generatex corr by sampling differ- ent Gaussian noise to the token embedding. By varying the noise amplitude, we effectively alter the “strength” of the intervention, measuring how importance scores vary with the magnitude of the perturbation. • Methodological perturbations (robustness): We test sensitivity toΛ. This includes varying the aggregation 4 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis function (e.g., mean vs. median), the type of counter- factual (corrupted vs. mean patching), and comparing different base estimators (e.g., EAP vs. EAP-IG) on fixed data. Evaluation Metrics. We report the following metrics across the generated circuit sets. (i) Structural stability (Jaccard index): We quantify the structural spread of circuit esti- mates via the overlap between discovered edge setsE i ,E j corresponding to discovered circuitsC i ,C j . We report the mean and variance of the pairwise Jaccard index: J (E i ,E j ) = |E i ∩ E j | |E i ∪ E j | . (i) Faithfulness: We assess how well the different circuits recover model behavior using the circuit error: CE(C i ,M θ ) = 1 |D| X x∈D 1[M C i (x)̸= M θ (x)] and the KL divergenceD KL (P M θ ||P M C i )averaged overD. We report meanμ, varianceσ 2 , coefficient of variationCV of bothCEandD KL . In all experiments, KL divergence and circuit error are highly correlated; we report the latter in the main part and the former in the appendices. 4. Experimental Setup Tasks and Datasets. We follow the setup in Hanna et al. (2024) and use three standard interpretability tasks con- sisting of clean/corrupted input pairs: (i) Indirect Object Identification (IOI) (Wang et al., 2023), involving identi- fying indirect objects in narratives. We use the generator from Wang et al. (2023). (i) Subject-Verb Agreement (SVA) (Newman et al., 2021), involving predicting the verb form that agrees with a singular or plural noun. We adapt the generator from Warstadt et al. (2020) to create pairs of singular/plural nouns only. Prompt paraphrasing was not implemented for this task due to the simplicity of the prompt. (i) Greater-Than (Hanna et al., 2023), involving predicting a year numerically greater than the one provided in the prompt. We use the dataset and the generator from Hanna et al. (2023) for distribution shifts. We use the stan- dard evaluation metrics of logit difference for IOI and SVA, and probability difference for Greater-Than. Models. We conduct experiments across three language models: gpt2-small (Radford et al., 2019), selected as a foundational MI benchmark used in the original EAP, EAP- IG and ACDC studies; Llama-3.2-1B (AI@Meta, 2024), to test generality on a larger, modern architecture; and its instruction-tuned variant, Llama-3.2-1B-Instruct, as fine- tuning may impact the stability of causal mechanisms (Jain et al., 2024; Prakash et al., 2024). 5. Results Here, we investigate empirically the stability of causal im- portance estimation and circuit discovery in across sources of variations. Unless otherwise stated, we use the imple- mentation from the EAP-IG library using its default hyper- parameters. Figure 2. Distribution of edge scores for the IOI task in gpt2-small. Top: The coefficient of variation (CV= σ/|μ|) of edge scores across the dataset. A high CV indicates that the causal score of an edge displays marked instability between inputs. Bottom: Com- parison of score distributions (mean vs. std) for exact edge ablation (blue) and EAP (red). While EAP generally has a lower mean and std than the underlying causal effect it attempts to approximate, the obtained edges have a consistently higher CV. Additionally, EAP introduces higher relative fluctuations in the mean and std across edges. 5.1. Variance in Edge Scores To distinguish between the natural variability of the model’s mechanism and the error introduced by approximation meth- ods, we first compute CMA exactly (computing Eq. 1) for each input sample and each edge in the computational graph. For this experiment, due to high computational costs, we restrict ourselves to IOI dataset and gpt2-small. Figure 2 compares the mean and standard deviation (std) of these exact scores (blue) against the approximate EAP estimates (red). We observe two critical phenomena: Intrinsic Variance of CMA. The causal effect of an edge is not stable across inputs. The blue distribution shows that edge scores exhibit a standard deviation often close to half their mean (CV ≈ 0.5), confirming that edge importance display high variability and depends highly on the specific 5 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis input-counterfactual pair. Approximation Instability. The EAP approximation ex- acerbates this issue. EAP shifts the distribution and signifi- cantly increases the CV, with the standard deviation often exceeding the mean (CV > 1). As such, an edge’s score is not consistent across samples. This indicates that gradient- based estimators introduce substantial approximation noise on top of the natural variance of the CMA estimand. Con- sequently, the signal-to-noise ratio for any given edge is low, making the identification of stable circuits from a finite sample statistically precarious. 5.2. Circuit Instability under Data Resampling Table 1. Aggregate statistics for circuit error and Jaccard index across resampling strategies (averaged over all models and tasks). Resampling StrategyCircuit errorJaccard Index μ CVμ CV Bootstrap0.4400.1230.5610.335 Meta-Dataset0.3000.0940.7900.132 Prompt Paraphrasing0.1500.1340.7990.131 Given the high variance of individual edge scores, we next investigate how this instability propagates to the final cir- cuit structure when the input datasetDis varied. Figure 3 displays the functional performance (circuit error) and struc- tural stability (Jaccard index) of circuits discovered under different resampling strategies. Variance and Model Size. We observe a notable degrada- tion in stability for larger models. While gpt2-small yields relatively clustered results, Llama-3.2 (1B and Instruct) ex- hibits higher variability. This suggests that MI methods do not trivially scale; identifying reliable ”circuits” in more capable models is significantly harder. Interestingly, instruc- tion tuning (Llama-Instruct) does not significantly alter this stability profile compared to the base model. Multimodality. For gpt2-small, the Jaccard index distribu- tion is sometimes multimodal (visible in the split violins for bootstrap). This implies that the discovery process does not converge to a single solution, but vacillates between distinct, incompatible circuits. This signals non-identifiability: mul- tiple disparate circuits satisfy the scoring criteria equally well. Sensitivity to Sampling. Table 1 quantifies the impact of the perturbation method. Bootstrap resampling, which mimics the effect of limited sample size, yields the low- est structural consistency (Jaccardμ = 0.561) and high- est variability (CV = 0.335). This confirms that the high variance of edge scores (Fig. 2) makes the aggregated mean ˆ Shighly sensitive to the specific composition of the dataset. Conversely, shifting the meta-distribution (meta- dataset/paraphrasing) yields more stable results. This sug- gests that while the specific edges fluctuate with sampling noise (bootstrap), the general mechanism is somewhat more robust to semantic shifts in the prompt distribution. The circuits discovered under bootstrap resampling also ex- hibit the highest average circuit error (0.440), indicating that the resulting circuits are not only structurally differ- ent but also less faithful to the original model’s behavior, i.e., discovered circuits do not generalize well to small data variations. In contrast, using a meta-dataset or prompt para- phrasing results in more stable circuits, with higher Jaccard indices (resp. 0.790 and 0.799) and lower CVs. 5.3. Methodological Sensitivity: Hyperparameters We next evaluate the robustness of circuit discovery to the value of hyperparameters. Figure 1 (in the introduction) provides a visual summary of how varying multiple param- eters at once leads to a high diversity in circuits found in gpt2-small for the IOI task. Since the data signal is noisy, we hypothesize that the result- ing circuit is heavily influenced by the choices of estimator Eand aggregationA. Table 2 confirms this sensitivity for Llama-Instruct. In the Greater-Than task, changing the aggregation method of EAP-IG-inputs from ”sum” to ”me- dian” and the patching method from ”mean” to ”patching” drops the Jaccard similarity to the median circuit to 0.086, effectively returning almost a disjoint subgraph. In IOI, the overlap between EAP-IG-inputs and Clean-corrupted is also negligible (0.071). This implies that different EAP variants are not converging on the same circuit, but are in- stead isolating different artifacts of the high-variance edge distribution. 5.4. Sensitivity to Counterfactual Choices Finally, we explore how the definition of the intervention alters the results. As discussed in Section 3, CMA is de- fined relative to a specific counterfactualx corr . In noisy intervention setups,x corr is generated by adding Gaussian noise to the token embedding. Varying the noise amplitude implies changing the experimental question: which com- ponents mediate the effect of small vs. large deviations in the input? Intuitively, one might expect mediation results to not be affected by the choice of perturbation. Figure 4 shows the trajectories of circuit error and Jaccard index for gpt2-small as the noise amplitude increases. We iden- tify a critical regime (amplitude≈ 0.2) where the CV for Jaccard index peaks. This demonstrates that the ”circuit” is not invariant to the magnitude of the perturbation. As the intervention changes, the set of components identified as important shifts, further emphasizing that MI findings are relative to the precise definition of the counterfactual distribution. 6 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Greater-ThanIOISVA gpt2-small 0.00 0.25 0.50 0.75 1.00 llama-3.2-1bllama-instructgpt2-smallllama-3.2-1bllama-instructgpt2-smallllama-3.2-1bllama-instruct 0.00 0.25 0.50 0.75 1.00 No faithful circuits found 0.00 0.25 0.50 0.75 1.00 Not Applicable. SVA data is procedurally generated using a custom grammar. Circuit Error Pairwise Jaccard Bootstrap MetaDataset Reprompting Figure 3. Stability of EAP-IG circuits across models and tasks. Each point represents one circuit discovered from a resampled dataset. Blue: Circuit error (lower is better). Orange: Pairwise Jaccard index (higher is better). Table 2. Hyperparameter sensitivity in Llama-3.2-1B-Instruct. We report the circuit error (CErr), size, and Jaccard similarity to the median circuit (computed across all 7 rows) for varying EAP configurations. Results for other models are reported in the appendix. ParametersGreater-ThanIOISVA CErrSizeJacc. to MedianCErrSizeJacc. to MedianCErrSizeJacc. to Median EAP, sum, patching0.20230.4170.6930.2860.76180.536 EAP-IG-activations, sum, patching0.20170.0980.69120.1250.76240.531 EAP-IG-inputs, median, patching0.20100.0860.6961.0000.75210.840 EAP-IG-inputs, sum, mean0.19281.0000.7270.1820.73240.960 EAP-IG-inputs, sum, mean-positional0.41330.2980.8261.0000.73220.808 EAP-IG-inputs, sum, patching 0.20160.5710.6970.1820.75251.000 Clean-corrupted, sum, patching0.20160.4190.6990.0710.76160.577 Greater-Than 0.6 0.7 0.8 0.9 Mean ± 1 0.5 0.6 0.7 0.8 0.9 1.0 Mean ± 1 IOI 0.1000 0.1025 0.1050 0.1075 0.1100 0.1125 Mean ± 1 0.4 0.6 0.8 1.0 Mean ± 1 Figure 4. Effect of intervention definition (added noise amplitude) on circuit error (left) and pairwise Jaccard index (right) in gpt2- small. The amplitude parameter effectively redefines the counter- factual inputx corr , leading to changes in the identified mechanism. Noise amplitude varies in [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5]. 6. Discussion 6.1. Summary of Findings Our investigation traces the source of the observed instabil- ity through the causal analysis pipeline: • Intrinsic & Estimator Variance. We distinguish two sources of instability. First, the fundamental estimand (the causal effect of an edge) is not a constant but a random variable with high variance across inputs drawn from a single distribution. Second, gradient-based estimators (EAP) amplify this variance, often yielding a signal-to- noise ratio below 1. •Aggregation Sensitivity. Because the underlying signal is noisy, the global circuit depends heavily on the specific sample used for aggregation. Bootstrap analysis reveals that circuits discovered from the same model on resam- pled data can exhibit low structural overlap, confirming that single-dataset results are statistically unreliable and not generalizable. 7 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis •Dependence on Experimental Definition. The discov- ered circuits are highly sensitive to the experimental de- sign choices. We find that design choices in the estimation process and the definition of the counterfactual funda- mentally create high structural variability in final circuits. This confirms that these methods do not identify a unique, global mechanism, but rather a structure conditioned on methodological choices. 6.2. Recommendations for a Statistical MI For future research to mitigate these risks, we propose the following recommendations based on our experimental re- sults: Report Stability. We strongly advocate for the routine reporting of stability metrics alongside circuit discovery re- sults. Specifically, we recommend that researchers report the variance of circuit structure and performance (e.g., the average pairwise Jaccard index and the CV of the circuit error) under bootstrap resampling of the input data. This practice, common in mature scientific fields (Efron & Tib- shirani, 1986; Berengut, 2006), provide necessary measures of uncertainty for the structural estimate. Quantify Estimator Uncertainty. Given the sensitivity of circuit discovery to hyperparameter settings, it is crucial that researchers transparently report and justify their choices. Researchers should ideally conduct a sensitivity analysis to assess the impact of different hyperparameter settings on the discovered circuits. If a mechanism is only visible under a specific set of hyperparameters, this fragility must be disclosed. Characterize Intervention Sensitivity. Instead of rely- ing on a single fixed intervention (e.g., mean ablation), we recommend analyzing how the circuit changes as the coun- terfactual is varied. Sweeping intervention parameters (e.g., the noise amplitude) reveals whether a mechanism is invari- ant to the strength of the perturbation or specific to a certain regime. For example, reporting how circuit stability shifts around a noise level of 0.2 in gpt2-small can help distinguish between core mechanisms and localized artifacts. 6.3. Limitations While our analysis identifies fundamental instabilities in cir- cuit discovery, several limitations remain. First, our circuit discovery analysis focuses on the EAP family and its vari- ants. While newer methods, such as HAP (Gu et al., 2025) or RelP (Jafari et al., 2025), use different heuristics, they remain downstream of CMA and likely inherit its volatil- ity. However, their specific rules may act as stabilizing regularizers. Second, while we established ”intrinsic vari- ance” via exact CMA, computational costs restricted this to gpt2-small on the IOI task; generalizing this fundamen- tal layer of instability to other models and tasks relies on approximation-based evidence. Third, our study is limited to three classic MI tasks with relatively discrete linguistic rules; variance may manifest differently in fuzzier reasoning tasks or open-ended generations. Finally, our stability met- rics treat all edges as equally important, whereas weighted stability metrics might reveal a stable ”functional core” of the circuit despite a fluctuating periphery. 6.4. Future Directions Our work opens up several avenues for future research. The high variance of discovered circuits suggests that instead of seeking a single ”true” circuit, it might be more fruitful to characterize a distribution over possible circuits. Probabilistic Circuit Discovery. Since the underlying CMA scores are distributions, the output of an MI method could be a posterior distribution over graphs, rather than a single discrete subgraph. The set of bootstrapped circuits generated in this study serves as a first approximation of such a distribution. Future work could formalize this using Bayesian structure learning approaches. Decomposing Variance. To improve methods’ reliability, future work should aim to decompose the total observed variance into estimator variance (noise from the gradient estimation) and intrinsic variance (true fluctuations in the mechanism across inputs). Reducing estimator variance is an engineering challenge for better approximations, while high intrinsic variance suggests fundamental limits to the universality of specific mechanisms. Stability-Aware Optimization. Our findings motivate the development of objectives that explicitly optimize for sta- bility. Rather than selecting edges solely based on faithful- ness (magnitude of effect), future algorithms could penalize the variance of the edge score across the dataset, prioritiz- ing components that serve as reliable mediators across the dataset, bootstrap resamples or noise perturbations. While the statistical framework we have proposed is broadly applicable to circuit discovery methods, we encourage the community to adopt similar stability analyses for other in- terpretability techniques to build a more complete picture of the reliability of MI findings. Despite recurrent analogies to other sciences like neuroscience (Barrett et al., 2019), biology (Lindsey et al., 2025), or physics (Allen-Zhu & Li, 2023; Allen-Zhu, 2024) of neural networks, the field of MI remains in its early stages. We believe that embracing a statistical estimation framing and its standards of rigor regarding uncertainty quantification is an important step toward becoming a more robust and rigorous field. 8 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Impact Statement This work aims to improve the scientific rigor and reliabil- ity of Mechanistic Interpretability (MI). As MI techniques are increasingly proposed for safety auditing, model align- ment, and regulatory compliance, it is critical that these methods produce stable and statistically valid explanations. Our research highlights the risks of relying on unstable point-estimates, which can lead to unjustified confidence in a model’s safety properties or internal mechanisms. By advocating for statistical robustness and best practices in circuit discovery, this work contributes to the development of more trustworthy AI systems and helps ensure that future interpretability tools provide a solid foundation for policy and safety decisions. 9 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis References AI@Meta.Llama 3.2 model card.2024.URL https://github.com/meta-llama/ llama-models/blob/main/models/llama3_ 2/MODEL_CARD.md. Allen-Zhu, Z. ICML 2024 Tutorial: Physics of Language Models, July 2024. Project page:https://physics. allen-zhu.com/. Allen-Zhu, Z. and Li, Y. Physics of Language Models: Part 1, Learning Hierarchical Language Structures. SSRN Electronic Journal, May 2023. Full version available at https://ssrn.com/abstract=5250639. Arendt, P. D., Apley, D. W., Chen, W., Lamb, D., and Gorsich, D. Improving identifiability in model calibra- tion using multiple responses. Journal of Mechanical Design, 134(10):100909, 09 2012. ISSN 1050-0472. doi: 10.1115/1.4007573. URLhttps://doi.org/ 10.1115/1.4007573. Barredo Arrieta, A., D ́ ıaz-Rodr ́ ıguez, N., Del Ser, J., Ben- netot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., and Herrera, F. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward respon- sible ai. Information Fusion, 58:82–115, 2020. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2019.12. 012. URLhttps://w.sciencedirect.com/ science/article/pii/S1566253519308103. Barrett, D. G., Morcos, A. S., and Macke, J. H. Analyzing biological and artificial neural networks: challenges with opportunities for synergy? Current opinion in neurobiol- ogy, 55:55–64, 2019. Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017. Berengut, D.Statistics for experimenters:De- sign, innovation, and discovery.The American Statistician, 60(4):341–342, 2006.doi:10.1198/ 000313006X152991. URLhttps://doi.org/10. 1198/000313006X152991. Bhaskar, A., Wettig, A., Friedman, D., and Chen, D. Finding transformer circuits with edge pruning.In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Ad- vances in Neural Information Processing Systems, volume 37, p. 18506–18534. Curran Associates, Inc., 2024.URLhttps://proceedings.neurips. c/paper_files/paper/2024/file/ 20fdaf67581e6d7157376d1ed584040a-Paper-Conference. pdf. Bousquet, O. and Elisseeff, A.Stability and gen- eralization.J. Mach. Learn. Res., 2:499–526, March 2002.ISSN 1532-4435.doi:10.1162/ 153244302760200704. URLhttps://doi.org/10. 1162/153244302760200704. Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. Thread: Circuits. Distill, 2020. doi: 10.23915/distill.00024. https://distill.pub/2020/circuits. Committee, E. S., Benford, D., Halldorsson, T., Jeger, M. J., Knutsen, H. K., More, S., Naegeli, H., Noteborn, H., Ock- leford, C., Ricci, A., Rychen, G., Schlatter, J. R., Silano, V., Solecki, R., Turck, D., Younes, M., Craig, P., Hart, A., Von Goetz, N., Koutsoumanis, K., Mortensen, A., Ossendorp, B., Martino, L., Merten, C., Mosbach-Schulz, O., and Hardy, A. Guidance on uncertainty analysis in scientific assessments. EFSA Journal, 16(1):e05123, 2018.doi: https://doi.org/10.2903/j.efsa.2018.5123. URLhttps://efsa.onlinelibrary.wiley. com/doi/abs/10.2903/j.efsa.2018.5123. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit dis- covery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum? id=89ia77nZ8u. Dunefsky, J., Chlenski, P., and Nanda, N. Transcoders find interpretable LLM feature circuits. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/ forum?id=J6zHcScAo0. Efron, B. and Tibshirani, R. Bootstrap methods for stan- dard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1(1):54–75, 1986. ISSN 08834237, 21688745. URL http://w. jstor.org/stable/2245500. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield- Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A math- ematical framework for transformer circuits.Trans- former Circuits Thread, 2021.https://transformer- circuits.pub/2021/framework/index.html. Fang, J., Jiang, H., Wang, K., Ma, Y., Shi, J., Wang, X., He, X., and Chua, T.-S. Alphaedit: Null-space constrained 10 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis model editing for language models. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=HvSytvg3Jh. Fisher, R. Statistical methods and scientific induction. Jour- nal of the Royal Statistical Society. Series B (Method- ological), 17(1):69–78, 1955. ISSN 00359246. URL http://w.jstor.org/stable/2983785. Geiger, A., Lu, H., Icard, T. F., and Potts, C. Causal abstrac- tions of neural networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neu- ral Information Processing Systems, 2021. URLhttps: //openreview.net/forum?id=RmuXDtjDhG. Gu, H., Nair, V., Kumar, A. A., Lagasse, R., Zhu, K., O’Brien, S., and Panda, A. Discovering transformer cir- cuits via a hybrid attribution and pruning framework. In Mechanistic Interpretability Workshop at NeurIPS 2025, 2025. URLhttps://openreview.net/forum? id=hhD5MjHtLi. Hanna, M., Liu, O., and Variengien, A. How does GPT-2 compute greater-than?: Interpreting mathematical abili- ties in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum? id=p4PckNQR8k. Hanna, M., Pezzelle, S., and Belinkov, Y. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps: //openreview.net/forum?id=grXgesr5dT. Hedstr ̈ om, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H ̃ A¶hne, M. M.-C. Quantus: An explainable ai toolkit for respon- sible evaluation of neural network explanations and be- yond. Journal of Machine Learning Research, 24(34):1– 11, 2023. URLhttp://jmlr.org/papers/v24/ 22-0142.html. Hoelscher-Obermaier, J., Persson, J., Kran, E., Kon- stas, I., and Barez, F.Detecting edit failures in large language models: An improved specificity bench- mark. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Compu- tational Linguistics: ACL 2023, p. 11548–11559, Toronto, Canada, July 2023. Association for Computa- tional Linguistics. doi: 10.18653/v1/2023.findings-acl. 733. URLhttps://aclanthology.org/2023. findings-acl.733/. Ioannidis, J. P. A. Why most published research findings are false. PLOS Medicine, 2(8):null, 08 2005. doi: 10.1371/ journal.pmed.0020124. URLhttps://doi.org/10. 1371/journal.pmed.0020124. Jafari, F. R., Eberle, O., Khakzar, A., and Nanda, N. Relp: Faithful and efficient circuit discovery via relevance patching. In Mechanistic Interpretability Workshop at NeurIPS 2025, 2025. URLhttps://openreview. net/forum?id=5PKPy82sWN. Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rockt ̈ aschel, T., and Krueger, D. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024. URLhttps://openreview.net/forum? id=bWimc91mtK. Lele, S. R.How Should We Quantify Uncertainty in Statistical Inference?Frontiers in Ecology and Evolution, 8, March 2020.ISSN 2296- 701X.doi:10.3389/fevo.2020.00035.URL https://w.frontiersin.org/journals/ ecology-and-evolution/articles/10. 3389/fevo.2020.00035/full. Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Thomp- son, T. B., Zimmerman, S., Rivoire, K., Conerly, T., Olah, C., and Batson, J. On the biology of a large language model. Transformer Circuits Thread, 2025. URLhttps://transformer-circuits.pub/ 2025/attribution-graphs/biology.html. Liu, Y., Zhang, Y., and Yeung-Levy, S.Mechanistic interpretability meets vision language models: Insights and limitations.In ICLR Blogposts 2025, 2025. URLhttps://d2jud02ci9yv69.cloudfront. net/2025-04-28-vlm-understanding-29/ blog/vlm-understanding/ . https://d2jud02ci9yv69.cloudfront.net/2025-04-28- vlm-understanding-29/blog/vlm-understanding/. Mayo, D. Error and the growth of experimental knowl- edge. Bibliovault OAI Repository, the University of Chicago Press, 92, 04 1998.doi: 10.1002/(SICI) 1520-6696(199823)34:43.0.CO;2-E. M ́ eloux, M., Maniu, S., Portet, F., and Peyrard, M. Everything, everywhere, all at once: Is mechanistic interpretability identifiable?In The Thirteenth In- ternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=5IWJBStfU7. 11 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088. Meng, K., Sharma, A. S., Andonian, A. J., Belinkov, Y., and Bau, D. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Represen- tations, 2023. URLhttps://openreview.net/ forum?id=MkbcAHIYgyS. Miller, J., Chughtai, B., and Saunders, W. Transformer circuit evaluation metrics are not robust. In First Con- ference on Language Modeling, 2024. URLhttps: //openreview.net/forum?id=zSf8PJyQb2. Mondorf, P., Wang, M., Gerstner, S., Hakimi, A. D., Liu, Y., Veloso, L., Zhou, S., Schuetze, H., and Plank, B. BlackboxNLP-2025 MIB shared task: Exploring ensem- ble strategies for circuit localization methods. In Be- linkov, Y., Mueller, A., Kim, N., Mohebbi, H., Chen, H., Arad, D., and Sarti, G. (eds.), Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, p. 537–542, Suzhou, China, November 2025. Association for Computational Linguis- tics. ISBN 979-8-89176-346-3. doi: 10.18653/v1/2025. blackboxnlp-1.31. URLhttps://aclanthology. org/2025.blackboxnlp-1.31/. Monea, G., Peyrard, M., Josifoski, M., Chaudhary, V., Eis- ner, J., Kiciman, E., Palangi, H., Patra, B., and West, R. A glitch in the matrix? locating and detecting language model grounding with fakepedia. In Ku, L.-W., Mar- tins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 6828–6844, Bangkok, Thailand, August 2024. Association for Com- putational Linguistics. doi: 10.18653/v1/2024.acl-long. 369. URLhttps://aclanthology.org/2024. acl-long.369/. Mueller, A., Brinkmann, J., Li, M., Marks, S., Pal, K., Prakash, N., Rager, C., Sankaranarayanan, A., Sharma, A. S., Sun, J., Todd, E., Bau, D., and Belinkov, Y. The quest for the right mediator: Surveying mechanistic in- terpretability through the lens of causal mediation analy- sis, 2025. URLhttps://arxiv.org/abs/2408. 01416. M ́ eloux, M., Dirupo, G., Portet, F., and Peyrard, M. The dead salmons of ai interpretability, 2025. URLhttps: //arxiv.org/abs/2512.18792. Newman, B., Ang, K.-S., Gong, J., and Hewitt, J. Re- fining targeted syntactic evaluation of language mod- els. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cot- terell, R., Chakraborty, T., and Zhou, Y. (eds.), Pro- ceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, p. 3710– 3723, Online, June 2021. Association for Computa- tional Linguistics. doi: 10.18653/v1/2021.naacl-main. 290. URLhttps://aclanthology.org/2021. naacl-main.290/. Nikankin, Y., Arad, D., Itzhak, I., Reusch, A., Simhi, A., Kesten, G., and Belinkov, Y. BlackboxNLP-2025 MIB shared task: Improving circuit faithfulness via bet- ter edge selection. In Belinkov, Y., Mueller, A., Kim, N., Mohebbi, H., Chen, H., Arad, D., and Sarti, G. (eds.), Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, p. 521–527, Suzhou, China, November 2025. Asso- ciation for Computational Linguistics. ISBN 979-8- 89176-346-3. doi: 10.18653/v1/2025.blackboxnlp-1. 29.URLhttps://aclanthology.org/2025. blackboxnlp-1.29/. Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., and Mordvintsev, A. The building blocks of interpretability. Distill, 2018. doi: 10.23915/distill.00010. https://distill.pub/2018/building-blocks. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to cir- cuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Pearl, J. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, p. 411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001. Prakash, N., Shaham, T. R., Haklay, T., Belinkov, Y., and Bau, D. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811. Preston, S. P., Wilkinson, R. D., Clayton, R. H., Chap- pell, M. J., and Mirams, G. R. Think before you fit: Parameter identifiability, sensitivity and uncertainty in systems biology models.Current Opinion in Systems Biology, 42:100563, 2025. ISSN 2452-3100. doi:https://doi.org/10.1016/j.coisb.2025.100563. URLhttps://w.sciencedirect.com/ science/article/pii/S245231002500023X. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019. 12 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Rauker, T., Ho, A., Casper, S., and Hadfield-Menell, D. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks . In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), p. 464–483, Los Alamitos, CA, USA, February 2023. IEEE Computer Soci- ety.doi: 10.1109/SaTML54575.2023.00039.URL https://doi.ieeecomputersociety.org/ 10.1109/SaTML54575.2023.00039. Shi, C., Beltran-Velez, N., Nazaret, A., Zheng, C., Garriga- Alonso, A., Jesson, A., Makar, M., and Blei, D. Hy- pothesis testing the circuit hypothesis in LLMs.In ICML 2024 Workshop on Mechanistic Interpretability, 2024a. URLhttps://openreview.net/forum? id=ibSNv9cldu. Shi, C., Beltran-Velez, N., Nazaret, A., Zheng, C., Garriga- Alonso, A., Jesson, A., Makar, M., and Blei, D. M. Hypothesis testing the circuit hypothesis in llms. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA, 2024b. Curran Associates Inc. ISBN 9798331314385. Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attri- bution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 3319–3328. JMLR.org, 2017. Syed, A., Rager, C., and Conmy, A. Attribution patching outperforms automated circuit discovery. In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023. URLhttps://openreview.net/forum? id=tiLbFR4bJW. Syed, A., Rager, C., and Conmy, A. Attribution patch- ing outperforms automated circuit discovery. In Be- linkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th Black- boxNLP Workshop: Analyzing and Interpreting Neu- ral Networks for NLP, p. 407–416, Miami, Florida, US, November 2024. Association for Computational Linguistics.doi: 10.18653/v1/2024.blackboxnlp-1. 25.URLhttps://aclanthology.org/2024. blackboxnlp-1.25/. VanderWeele, T. J. Explanation in causal inference: de- velopments in mediation and interaction. International Journal of Epidemiology, 45(6):1904–1908, 11 2016. ISSN 0300-5771.doi: 10.1093/ije/dyw277.URL https://doi.org/10.1093/ije/dyw277. Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis.In Larochelle, H., Ranzato, M., Had- sell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems,vol- ume 33, p. 12388–12401. Curran Associates, Inc., 2020a. URLhttps://proceedings.neurips. c/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper. pdf. Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. M.Causal me- diation analysis for interpreting neural nlp:The case of gender bias.ArXiv,abs/2004.12265, 2020b. URLhttps://api.semanticscholar. org/CorpusID:216553696. Walke, F., Bennek, L., and Winkler, T. J. Artificial intelli- gence explainability requirements of the ai act and metrics for measuring compliance. In Beverungen, D., Lehrer, C., and Trier, M. (eds.), Solutions and Technologies for Responsible Digitalization, p. 113–129, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-80122-8. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indi- rect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=NpsVSN6o4ul. Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377– 392, 2020. doi: 10.1162/tacla00321. URLhttps: //aclanthology.org/2020.tacl-1.25/. Yu, L., Niu, J., Zhu, Z., and Penn, G.Func- tional faithfulness in the wild: Circuit discovery with differentiable computation graph pruning.CoRR, abs/2407.03779, 2024. URLhttps://doi.org/10. 48550/arXiv.2407.03779. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision – ECCV 2014, p. 818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10590-1. Zhang, F. and Nanda, N. Towards best practices of activa- tion patching in language models: Metrics and methods. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=Hf17y6u9BC. Zhang, L., Dong, W., Zhang, Z., Yang, S., Hu, L., Liu, N., Zhou, P., and Wang, D. Eap-gp: Mitigating saturation 13 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis effect in gradient-based automated circuit identification. CoRR, abs/2502.06852, February 2025. URLhttps: //doi.org/10.48550/arXiv.2502.06852. Zidek, J. V. and van Eeden, C.Uncertainty, en- tropy, variance and the effect of partial information. Lecture Notes-Monograph Series, 42:155–167, 2003. ISSN 07492170. URLhttp://w.jstor.org/ stable/4356236. 14 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Additional plots We report in Figure 5 the pairwise Jaccard index for all 125 circuits from Figure 1. 0369 121518212427303336394245485154576063666972757881848790939699 102105108111114117120123 Circuit Index 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 105 108 111 114 117 120 123 Circuit Index Pairwise Jaccard Similarity Between Circuits 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity Index Figure 5. Full heatmap of the pairwise Jaccard index between circuits displayed in Figure 1 (circuits found in gpt2-small on the Greater-Than task while varying all parameters) Tables 3, 4, and 5 contain numerical values for the metrics reported in the violin plots of Figure 3. Table 6 is a more detailed version of Table 2, which also reports KL divergence. Tables 7 and 8 contain the equivalent data for Llama-3.2-1B (non-instruct) and gpt2-small, respectively. Figure 6 reports the CV of the faithfulness metrics for the noise experiments described in Section 5.3 and Figure 4. Table 9 is a more detailed equivalent of Table 4, reporting KL divergence in addition to other metrics. 15 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Table 3. Aggregated results from Figure 3 for bootstrap resampling. Circuit ErrorKL DivergencePairwise Jaccard Index Model Nameμ σ 2 CVμσ 2 CVμ σ 2 CV Greater-Than Llama-3.2-1B0.214.67· 10 −4 0.106.91· 10 −7 1.29· 10 −14 0.160.425.93· 10 −3 0.18 Llama-3.2-1B-Instruct0.215.94· 10 −4 0.126.43· 10 −7 6.50· 10 −16 0.040.331.36· 10 −2 0.36 IOI Llama-3.2-1B0.662.51· 10 −3 0.085.48· 10 −6 1.29· 10 −13 0.070.391.07· 10 −1 0.85 Llama-3.2-1B-Instruct 0.692.62· 10 −3 0.079.26· 10 −6 4.44· 10 −13 0.070.346.72· 10 −2 0.76 gpt2-small0.117.32· 10 −4 0.241.23· 10 −6 8.80· 10 −14 0.240.671.57· 10 −2 0.19 SVA Llama-3.2-1B 0.801.02· 10 −3 0.041.61· 10 −5 4.02· 10 −13 0.040.661.55· 10 −2 0.19 Llama-3.2-1B-Instruct0.751.04· 10 −3 0.041.87· 10 −5 3.97· 10 −13 0.030.691.20· 10 −2 0.16 gpt2-small0.085.00· 10 −4 0.29001.0000.00 Table 4. Aggregated results from Figure 3 for meta-dataset resampling. Circuit ErrorKL DivergencePairwise Jaccard Index Model Nameμ σ 2 CVμσ 2 CVμ σ 2 CV Greater-Than Llama-3.2-1B0.243.06· 10 −5 0.025.58· 10 −7 3.56· 10 −16 0.030.748.17· 10 −3 0.12 Llama-3.2-1B-Instruct0.181.05· 10 −4 0.066.46· 10 −7 1.31· 10 −16 0.020.511.83· 10 −2 0.27 IOI Llama-3.2-1B0.151.67· 10 −4 0.095.75· 10 −7 6.68· 10 −16 0.040.861.25· 10 −2 0.13 Llama-3.2-1B-Instruct0.223.30· 10 −4 0.086.19· 10 −7 1.53· 10 −15 0.060.762.13· 10 −2 0.19 gpt2-small0.035.23· 10 −5 0.224.72· 10 −5 1.91· 10 −12 0.030.885.75· 10 −3 0.09 SVA Llama-3.2-1B0.773.60· 10 −4 0.021.54· 10 −5 8.18· 10 −14 0.020.801.06· 10 −2 0.13 Llama-3.2-1B-Instruct0.742.52· 10 −4 0.021.84· 10 −5 2.05· 10 −13 0.020.771.07· 10 −2 0.13 gpt2-small0.062.18· 10 −4 0.23001.0000.00 Table 5. Aggregated results from Figure 3 for prompt paraphrasing. Circuit ErrorKL DivergencePairwise Jaccard Index Model Nameμ σ 2 CVμσ 2 CVμ σ 2 CV Greater-Than Llama-3.2-1B0.227.77· 10 −5 0.047.09· 10 −7 2.05· 10 −15 0.060.641.42· 10 −2 0.19 Llama-3.2-1B-Instruct 0.177.46· 10 −5 0.055.43· 10 −7 1.04· 10 −16 0.020.854.20· 10 −3 0.08 IOI Llama-3.2-1B0.161.66· 10 −4 0.085.42· 10 −7 9.45· 10 −16 0.060.881.01· 10 −2 0.11 Llama-3.2-1B-Instruct 0.183.44· 10 −4 0.106.06· 10 −7 1.43· 10 −15 0.060.741.80· 10 −2 0.18 gpt2-small0.012.27· 10 −5 0.404.31· 10 −5 1.42· 10 −12 0.030.897.66· 10 −3 0.10 Table 6. Detailed results for Table 2, including KL divergence. ParametersGreater-ThanIOISVA CErrKL-DivSizeJacc. to MedianCErrKL-DivSizeJacc. to MedianCErrKL-DivSizeJacc. to Median EAP, sum, patching0.206.4· 10 −7 230.4170.699.1· 10 −6 30.2860.761.9· 10 −5 180.536 EAP-IG-activations, sum, patching0.206.4· 10 −7 170.0980.699.1· 10 −6 120.1250.761.9· 10 −5 240.531 EAP-IG-inputs, median, patching0.206.4· 10 −7 100.0860.699.1· 10 −6 61.0000.751.9· 10 −5 210.840 EAP-IG-inputs, sum, mean0.197.1· 10 −7 281.0000.729.3· 10 −6 70.1820.731.6· 10 −5 240.960 EAP-IG-inputs, sum, mean-positional0.415.7· 10 −6 330.2980.821.7· 10 −5 61.0000.731.7· 10 −5 220.808 EAP-IG-inputs, sum, patching0.206.4· 10 −7 160.5710.699.1· 10 −6 70.1820.751.8· 10 −5 251.000 clean-corrupted, sum, patching0.206.4· 10 −7 160.4190.699.1· 10 −6 90.0710.761.9· 10 −5 160.577 16 Mechanistic Interpretability as Statistical Estimation: A Variance Analysis Table 7. Comparison of the circuits found in Llama-3.2-1B, using a similar setup to that of Table 2. ParametersGreater-ThanIOISVA CErrKL-DivSizeJacc. to MedianCErrKL-DivSizeJacc. to MedianCErrKL-DivSizeJacc. to Median EAP, sum, patching----0.645.4· 10 −6 70.4000.801.6· 10 −5 160.355 EAP-IG-activations, sum, patching----0.645.4· 10 −6 1170.0420.801.6· 10 −5 280.421 EAP-IG-inputs, median, patching----0.655.4· 10 −6 110.3850.801.6· 10 −5 240.923 EAP-IG-inputs, sum, mean ----0.675.4· 10 −6 50.7140.751.4· 10 −5 261.000 EAP-IG-inputs, sum, mean-positional----0.778.8· 10 −6 80.5000.691.5· 10 −5 250.962 EAP-IG-inputs, sum, patching0.236.0· 10 −7 21-0.655.4· 10 −6 71.0000.801.6· 10 −5 261.000 clean-corrupted, sum, patching----0.595.2· 10 −6 4480.0160.801.6· 10 −5 160.355 Table 8. Comparison of the circuits found in gpt2-small, using a similar setup to that of Table 2. ParametersIOISVA CErrKL-DivSizeJacc. to MedianCErrKL-DivSizeJacc. to Median EAP, sum, patching0.101.2· 10 −6 120.3910.06011.000 EAP-IG-activations, sum, patching0.101.3· 10 −6 50.0420.050210.000 EAP-IG-inputs, median, patching0.111.2· 10 −6 201.0000.06011.000 EAP-IG-inputs, sum, mean0.121.3· 10 −6 201.0000.073.2· 10 −6 11.000 EAP-IG-inputs, sum, mean-positional 0.142.1· 10 −5 210.7830.081.6· 10 −5 11.000 EAP-IG-inputs, sum, patching0.111.2· 10 −6 201.0000.06011.000 EAP-IG-inputs, sum, zero----0.00011.000 clean-corrupted, sum, patching 0.111.2· 10 −6 190.6960.06011.000 0.010.020.050.10.20.5125 Noise Amplitude 0.0 0.1 0.2 Coeff. of Variation (CV) Circuit Error CV Jaccard Index CV Figure 6. CV of circuit metrics for different noise amplitudes in gpt2-small, averaged across tasks. Table 9. Detailed results for Table 4, including KL divergence. Values are plotted for noise amplitudes in [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5]. Circuit ErrorKL DivergencePairwise Jaccard Index Greater-Than 0.6 0.7 0.8 0.9 Mean ± 1 2.78e-6 2.80e-6 2.82e-6 2.84e-6 2.86e-6 2.88e-6 Mean ± 1 0.5 0.6 0.7 0.8 0.9 1.0 Mean ± 1 IOI 0.1000 0.1025 0.1050 0.1075 0.1100 0.1125 Mean ± 1 1.26e-6 1.26e-6 1.26e-6 1.26e-6 1.27e-6 Mean ± 1 0.4 0.6 0.8 1.0 Mean ± 1 17