Paper deep dive

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek

Year: 2025Venue: BlackboxNLP 2025 (EMNLP Workshop)Area: Mechanistic Interp.Type: BenchmarkEmbeddings: 43

Models: GPT-2 Small, Gemma-2-2B, Llama-3.1-8B, Qwen-2.5-0.5B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/11/2026, 12:33:48 AM

Summary

The BlackboxNLP 2025 Shared Task evaluates mechanistic interpretability (MI) techniques using the Mechanistic Interpretability Benchmark (MIB). The task focuses on two tracks: circuit localization (identifying causally influential components) and causal variable localization (mapping activations to interpretable features). Participants demonstrated significant performance gains using ensemble strategies, regularization, and non-linear projections, with the MIB leaderboard remaining open for ongoing research.

Entities (6)

BlackboxNLP 2025 Shared Task · event · 100%Mechanistic Interpretability Benchmark · benchmark · 100%Causal Variable Localization · task · 95%Circuit Localization · task · 95%Gemma-2 · language-model · 90%Llama-3.1 · language-model · 90%

Relation Signals (3)

BlackboxNLP 2025 Shared Task → featurestrack → Circuit Localization

confidence 100% · The shared task features two tracks: circuit localization

BlackboxNLP 2025 Shared Task → featurestrack → Causal Variable Localization

confidence 100% · and causal variable localization

BlackboxNLP 2025 Shared Task → utilizes → Mechanistic Interpretability Benchmark

confidence 100% · the BlackboxNLP 2025 Shared Task employs this benchmark as part of a community-wide effort

Cypher Suggestions (2)

Find all tasks associated with the BlackboxNLP 2025 Shared Task. · confidence 90% · unvalidated

MATCH (e:Event {name: 'BlackboxNLP 2025 Shared Task'})-[:FEATURES_TRACK]->(t:Task) RETURN t.name

List all language models evaluated in the benchmark. · confidence 85% · unvalidated

MATCH (m:LanguageModel) RETURN m.name

Abstract

Abstract:Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

42,604 characters extracted from source content.

Expand or collapse full text

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models Dana Arad 1 , Yonatan Belinkov 17 , Hanjie Chen 2 , Najoung Kim 3 , Hosein Mohebbi 4 , Aaron Mueller 3 , Gabriele Sarti 5 , Martin Tutek 6 1 Technion – IIT 2 Rice University 3 Boston University 4 Tilburg University 5 University of Groningen 6 University of Zagreb 7 Harvard University Abstract Mechanistic interpretability (MI) seeks to un- cover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foun- dation, the BlackboxNLP 2025 Shared Task ex- tends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driv- ing model behavior, and causal variable local- ization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projec- tions to featurize activation vectors. The MIB leaderboard remains open; we encourage con- tinued work in this standard evaluation frame- work to measure progress in MI research going forward. 1 1 Introduction The field of mechanistic interpretability (MI) is ad- vancing rapidly, yet systematically comparing the efficacy of emerging methods remains challenging. The recently-released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) addresses this gap by providing a standardized framework for evaluating techniques that identify circuits and localize latent causal variables in language models 1 https://hf.co/spaces/mib-bench/leaderboard Circuit localization (§3) Causal variable localization (§4) The user provides ... 1 Compute results 3 0.75 Track-specific computations 2 f % Edges x =“12 + 9 =” L Carry the 1 Tens Ones Featurize using F Compute faithfulness scores Area under curve (CPR) Faithfulness Align with variable in high-level causal model 0.38 Circuits of different sizes Location , Featurizer F L L F Figure 1: Overview of the evaluation method for each track in MIB. The circuit localization track requires up- loading multiple circuits or importance scores for each component; we evaluate by taking the area under the faithfulness curve across circuit sizes. The causal vari- able localization track requires uploading a featurizer and location; we evaluate by intervening on the con- cept in the featurized space and measuring whether the model’s behavior changes in the expected way. Figure reproduced from Mueller et al. (2025) with permission. (LMs). Building on this foundation, the Black- boxNLP 2025 Shared Task employs this bench- mark as part of a community-wide effort aimed at accelerating progress in MI research. The shared task comprises two tracks. The cir- cuit localization track (§3) evaluates methods able to identify a minimal set of model components nec- essary to produce a a given behavior, such as attri- bution patching (Nanda, 2023) or information flow routes (Ferrando and Voita, 2024). The causal vari- able localization track (§4) compares methods that featurize activation vectors into more human- interpretable concepts—e.g., sparse autoencoders arXiv:2511.18409v1 [cs.CL] 23 Nov 2025 (SAEs; Huben et al., 2024) or distributed alignment search (DAS; Geiger et al., 2024). Submissions across these tracks are evaluated by their ability to precisely and concisely recover relevant causal pathways or causal variables in neural language models. Submissions across both tracks are evalu- ated by their ability to precisely and concisely re- cover relevant causal pathways or causal variables in LMs. We received submissions from four teams across the two tracks, spanning ten methods. Despite the relatively small number of submissions, the partici- pating teams achieved notable performance gains across both tracks. In the circuit localization track, ensembling strategies and regularization techniques that filter components with unstable contributions to model behavior proved particularly effective, suggesting promising directions for future circuit discovery research. In the causal variable local- ization Track, methods leveraging non-linear ac- tivation functions and/or multi-layer perceptrons during training demonstrated substantial improve- ments. The MIB leaderboard will remain open for ongo- ing submissions to both tracks, encouraging contin- ued participation and reproducibility. 2 Data and Models Here, we summarize the details of MIB’s evaluation methods and metrics. Both tracks evaluate across four tasks representing various reasoning types, dif- ficulty levels, and answer formats. These tasks in- clude Indirect Object Identification (IOI), Multiple- choice Question Answering (MCQA), Arithmetic (addition and subtraction), and the AI2 Reasoning Challenge (ARC). The causal variable localization track additionally includes RAVEL (Huang et al., 2024a). Below, we summarize the format of each task and the size of their datasets (§2.1). 2.1 Tasks The number of instances in each dataset and split is summarized in Table 1. Each task comes with a training split on which users can discover circuits or causal variables, and a validation split on which users can tune their methods or hyperparameters. We also create two test sets per task: public and private. The public test set enables faster iteration on methods. We release the train, validation, and public test sets on Huggingface. The private test set is not visible to users; they must upload either DatasetTrainValidationTest (Public/Private) IOI10000100001000/1000 MCQA1105050/50 Arithmetic (+)3440049201000/1000 Arithmetic (−)1740024841000/1000 ARC (Easy)22515701188/1188 ARC (Challenge)1119299586/586 RAVEL100000160001000 Table 1: Dataset sizes and splits. The train, validation, and public test sets are available on HuggingFace. One may only evaluate on the private test set by uploading their circuit(s) or featurizer to the MIB leaderboard. their circuits or their featurizers to the Hugging- Face leaderboard, where they are then queued for evaluation on the private test set. Indirect Object Identification (IOI). The indi- rect object identification (IOI) task, first proposed by Wang et al. (2023), is one of the most stud- ied tasks in MI. IOI has sentences like “When Mary and John went to the store, John gave an apple to _”, containing a subject (“John”) and an indirect object (“Mary”), which should be com- pleted with the indirect object. Even small LMs can achieve high accuracy; thus, it has been well studied (Huben et al., 2024; Conmy et al., 2023; Merullo et al., 2024). All names tokenize to a sin- gle token for all models in MIB, with the private test set containing names and direct objects that are not contained in the public train or test set. Arithmetic. Math-related tasks are common in MI (Stolfo et al., 2023; Nanda et al., 2023; Zhang et al., 2024; Nikankin et al., 2025b) and inter- pretability research more broadly (Liu et al., 2023; Huang et al., 2024b). Following Stolfo et al., MIB defines the task as performing operations with two operands of up to two digits each. Given a pair of numbers and an operator, the model must predict the outcome, e.g., “What is the sum of 13 and 25?". Multiple-choice question answering (MCQA). MCQA is a common task format on LM evalua- tion benchmarks, though only a few MI works have studied it (Lieberum et al., 2023; Wiegreffe et al., 2025; Li and Gao, 2024). The dataset is designed to isolate a model’s MCQA ability from any task- specific knowledge (Wiegreffe et al., 2025); the information needed to answer the questions is con- tained in the prompt. Questions are about objects’ colors and have four choices, such as: Question: A box is brown. What color is a box? A. gray B. black C. white D. brown Answer: D AI2 Reasoning Challenge (ARC). The ARC dataset (Clark et al., 2018) comprises grade-school- level multiple-choice science questions. This is a representative task for evaluating basic scien- tific knowledge in LMs (Brown et al., 2020; Jiang et al., 2023; Dubey et al., 2024). MIB follows the dataset’s original partition to Easy and Challenge subsets and analyze them separately; this is due to a large accuracy difference on the two subsets. MIB maintains the original 4-choice multiple-choice prompt formatting, making this dataset related in format to, but more challenging than, MCQA. Resolving Attribute-Value Entanglements in Language Models (RAVEL). RAVEL (Huang et al., 2024a) evaluates methods for isolating at- tributes of an entity. We include the split of RAVEL for disentangling the country, continent, and lan- guage attributes of cities. The prompts are queries about a certain attribute, e.g., Paris is on the con- tinent of, and the model must generate the correct completion—here, Europe. 2.2 Counterfactual Inputs For both MIB tracks, counterfactual interventions on model components form the basis for all evalua- tions. Here, components are set to the value they would take under a counterfactual input. In the circuit localization track, activation patch- ing is used to push models towards answering in an opposite manner to how they would naturally answer given the input. Success is achieved in this setting when counterfactual interventions to com- ponents outside the circuit minimally change the model’s predictions. In the causal variable localiza- tion track, activation patching is used to precisely manipulate specific concepts. Success is achieved in this setting when a variable in a causal model is a faithful summary of the role a model component plays in input-output behavior—i.e., interventions on the variable have the same effect as interven- tions on the model component. MIB provides counterfactual inputs for each train, validation, and test samples, where the map- pings from the original inputs to the counterfactual inputs are fixed to ensure consistency in evaluation. 2.3 Models MIB comprises of four models that cover a range of model sizes, families, capability levels, and promi- nence in MI: Llama-3.1 8B (Dubey et al., 2024), Gemma-2 2B (Riviere et al., 2024), Qwen-2.5 0.5B (Yang et al., 2024), and GPT-2 Small (117M, Rad- ford et al., 2019). Mueller et al. (2025) benchmark each model on each task and report performance. They focus specifically on model/task combinations where the model achieves at least 75% accuracy on the task; we do the same. 3 Circuit Localization Track The circuit localization track centers on evaluating how well a method can discover causal subgraphsC of a computation graph; these are more commonly known as circuits (Olah et al., 2020). The purpose of circuits is to localize the mechanisms underlying how a full neural networkNperforms a given task. A circuitCis a graph consisting of nodes and edges between components inN. Nodes are typically submodules or attention heads (e.g., the layer 5 MLP, or attention head 10 at layer 12); edges reflect information flow between a pair of nodes. A typical circuit discovery pipeline consists of two stages: (1) scoring the full set of graph compo- nents (nodes, edges, etc.), and (2) selecting a subset of the components that constitute the circuit. 3.1 Metrics MIB defines two circuit localization metrics: the integrated circuit performance ratio (CPR), and the integrated circuit-model distance (CMD). CPR measures whether a series of circuits include components with a positive effect on model perfor- mance on the task; higher is better. CMD measures whether a series of circuits yield the same strength of preference for the correct answer as the full model; 0 is best, and corresponds to no difference between the circuit and full model behavior with respect to predicting the correct answer. Intuitively, CPR may be more useful for finding circuits that cause the model to perform well on the task, while CMD may be more useful when the aim is to ex- plain the full algorithm the model implements to perform some behavior (including cases where the behavior is not desirable). Given a circuitCand the full modelN, faithful- ness f is defined as: f (C,N ; m) = m(C)− m(∅) m(N )− m(∅) ,(1) wheremis the logit differencey ′ − ybetween the correct answerygiven the original inputxand correct answery ′ given the counterfactual inputx ′ . Thus, CPR is computed as the area under the faithfulness curve with respect to circuit size. Fol- lowing Mueller et al. (2025), we approximate this area using a Riemann sum overfcomputed across circuit sizes. CMD as the area between the faithful- ness curve and 1; we also approximate this using a Riemann sum. Measuring circuit size. MIB treats including a node as equivalent to including all of its outgoing edges, and including one neuron 2 ofd model in sub- moduleuas including all outgoing edges fromuto 1 d model of the degree they would have been compared to including all neurons in u. Under these assumptions, MIB defines the weighted edge count: |C| = X (u,v)∈C |N u ∩ N C | |N u | ,(2) whereuandvare nodes (submodules),N u is the set of neurons inu(the size of which is typically d model ), andN C is the set of neurons in the circuit. This count is then normalized by the number of possible edges to obtain a percentage. 3.2 Submission Procedure All results below are computed on the private test splits for each task. To evaluate on the private test split, participants were first required to upload their circuits to a HuggingFace repository. 3 The faithful- ness evaluation required 9 circuits of different sizes; we expected one circuitC k for eachk ∈ K, where kis the maximum proportion of components inN that are allowed to remain in the circuit. Here,K = 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5 . For each model/task combination, a folder of circuits was required. Each circuit is a dictionary, where each node and edge is a key whose value is either a boolean indicating whether the component or edge belongs to the circuit, or 2 We use “neuron” to refer to a single dimension of any hidden vector, regardless of whether it is preceded by a non- linearity. 3 See this repository for an example of how circuit reposi- tories were required to be structured. a floating-point importance score. If the user uploaded floating-point importance scores, then only one file per model/task was required; we took the top-kcomponents by importance for each circuit sizek ∈ K. If the user uploaded binary inclusion indices, they were required to upload one circuit file for each threshold k ∈ K. Users provided a link to this repository on the “Submit” tab of the MIB leaderboard, 4 along with a method name. 3.3 Task Submissions We received submissions from three teams for the circuit localization track covering eight proposed methods. We taxonomize and summarize the ap- proaches here. Ensemble scoring strategies. Mondorf et al. (2025) proposed ensembling two or more circuit localization methods to improve attribution scores. They examined three ensembling variants: parallel, sequential, and their hybrid combination. Parallel ensembling (P-Ens) merges the scores from different methods into a single edge, using scores from the three variants of edge patching im- plemented by Mueller et al.: (1) Edge Attribution Patching (EAP; Nanda, 2023; Syed et al., 2024, (2) EAP-IG-inputs (Hanna et al., 2024), and (3) EAP-IG-activations (Marks et al., 2025). The lat- ter two methods complement EAP with integrated gradients (Sundararajan et al., 2017) to improve estimates of edge importance, perturbing input em- beddings and activations, respectively. The au- thors experimented with score merging using mean, weighted average, maximum, and minimum, and found that mean yielded the best results. Sequential ensembling (S-Ens) utilizes attribu- tion scores produced by a fast circuit identifica- tion method to warm-start a slower, more precise method, thereby achieving faster convergence and further refining the initial scores. Specifically, they use EAP-IG-inputs (Hanna et al., 2024) edge attri- bution values to initialize the learnable log alpha parameters of edge pruning (Bhaskar et al., 2024). Finally, hybrid ensembling (Hybrid-Ens) com- bines the parallel and sequential strategies by tak- ing the unweighted average over all four method– the three EAP variants and warm-start edge pruning for all model-task combination. 4 https://hf.co/spaces/mib-bench/leaderboard IOIArithmeticMCQAARC (E)ARC (C) MethodInterpBench (↑) GPT-2 Qwen-2.5 Gemma-2 Llama-3.1Llama-3.1Qwen-2.5 Gemma-2 Llama-3.1Gemma-2 Llama-3.1Llama-3.1 Random0.440.750.720.690.740.750.730.680.740.680.740.74 EAP (mean)0.780.290.180.250.040.070.210.200.160.220.280.20 EAP (CF)0.730.030.150.060.010.010.070.080.090.040.110.18 EAP (OA)0.770.300.16---0.11----- EAP-IG-inp. (CF)0.710.030.020.040.010.000.080.060.140.040.110.22 EAP-IG-act. (CF)0.810.030.010.030.010.000.050.070.130.040.300.37 P-Ens (Mondorf et al., 2025)-0.020.02---0.07----- S-Ens (Mondorf et al., 2025)-0.030.02---0.07----- Hybrid-Ens (Mondorf et al., 2025)-0.030.02---0.04----- ILP + PNR + Bootstrapping (2025a)-0.020.010.040.010.010.080.070.450.03-- IPE (CF) (Brunello et al., 2025)-0.020.57--0.54------ Table 2: CMD scores across circuit localization methods (lower is better) on the private test set. All evaluations were performed using counterfactual ablations. Arithmetic scores are averaged across addition and subtraction. We bold and underlinethe best and second-best methods per column, respectively. IOIArithmeticMCQAARC (E)ARC (C) MethodGPT-2Qwen-2.5Gemma-2Llama-3.1Llama-3.1Qwen-2.5Gemma-2Llama-3.1Gemma-2Llama-3.1Llama-3.1 EActP (CF)2.301.21---0.85----- EAP (mean)0.290.710.680.980.350.290.330.130.260.340.80 EAP (CF)1.200.261.290.850.550.851.491.001.080.800.82 EAP (OA)0.950.70---0.29----- EAP-IG-inputs (CF)1.851.633.202.080.991.161.641.051.531.040.98 EAP-IG-activations (CF)1.821.632.071.600.980.771.570.791.700.710.63 NAP (CF)0.280.300.300.260.270.381.471.691.010.260.26 NAP-IG (CF)0.760.291.520.420.390.771.711.871.530.260.26 P-Ens (Mondorf et al., 2025)2.111.88---0.79----- S-Ens (Mondorf et al., 2025)2.371.71---1.16----- Hybrid-Ens (Mondorf et al., 2025)2.431.88---1.19----- ILP + PNR + Bootstrapping (2025a)1.891.71 3.012.391.041.041.71.221.63-- IPE (CF) (Brunello et al., 2025)2.240.35---0.45----- Table 3: CPR scores across circuit localization methods on the private test set. All evaluations were performed using counterfactual ablations. Higher scores are better. Arithmetic scores are averaged across addition and subtraction. We bold and underlinethe best and second-best methods per column, respectively. Improved edge selection. Focusing on the second stage of the circuit discovery pipeline, Nikankin et al. (2025a) experimented with three methods to improve edge selection process. First, they observe that EAP-IG scores can vary across data samples from the same task, with some edges receiving both negative and positive values in dif- ferent samples. The score sign is significant, as it signifies whether the edge contributes positively or negatively to the performance on the task. By bootstrapping the scores across resamples of the training data, they identify edges with consistent score signs and filter out unstable ones. Second, they introduce a ratio based strategy for edge selection based on their signs (PNR): se- lect a fixed proportion of top positive edges, and the rest by absolute value. This approach allows finer control over the balance of edge types and im- proves circuit faithfulness. Lastly, they formulate circuit construction as an Integer Linear Program- ming (ILP) optimization problem, instead of using the naive greedy solution. Path scoring. Brunello et al. (2025) proposed Isolating Path Effects (IPE) to identify entire com- putational paths from input embeddings to output logits responsible for certain model behaviors, as opposed to individual edges. Their method mod- ifies the messages passed between nodes along a given path in such a way as to either precisely re- move the effects of the entire path (i.e., ablate it) or to replace the path’s effects with those that a counterfactual input would have produced. IPE dif- fers from current path-patching or edge-activation- patching techniques, as they do not ablate individ- ual paths but rather a set of paths sharing certain edges, thereby allowing a more precise tracing of information flow. 3.4 Results Table 2 and Table 3 show the CMD and CPR scores, respectively, of the top method from each submission as well as selected methods from MIB, on the private test set. All submissions perform especially well, achieving better or comparable scores to even the strongest baselines. The submission of Nikankin et al. (2025a) achieves especially strong CMD scores, whereas the Hybrid-Ens method of Mondorf et al. (2025) achieves the strongest CPR scores. The IPE method by Brunello et al. (2025) also performs well on IOI for GPT-2. Among the methods of Mondorf et al. (2025), Hybrid-Ens performs the strongest across tasks. These results suggest that ensembling strate- gies may be an accessible and fruitful line of work for future circuit discovery research. For Nikankin et al. (2025b), the removal of components with inconsistent effects on model outputs and a mix- ture of positive and high-magnitude components may have a regularizing effect on the discovered circuit, causing it to behave more closely to the whole model and potentially suppressing compo- nents that would have strong but inconsistent im- pacts on model behavior. It would be interesting to see detailed comparisons of each method on more fine-grained distributions to characterize when and why each is likely to succeed. That said, there is no clear winner; the best method appears to depend on the chosen metric. A factor we have not directly evaluated for is the time complexity of each method. It is possible that different methods could perform comparably despite having very different expected runtimes; a direct comparison of compute requirements would be valuable in helping future researchers decide which methods are most worthwhile to run. We note that many cells are missing for each submis- sion, but this does not necessarily reflect compute requirements—this could be due to local memory constraints, runtime limitations, or other compute constraints (e.g., limited access to GPUs on a clus- ter before a deadline). 4 Causal Variable Localization Track The causal variable localization track focuses on evaluating how well a method can discover specific causal variables in a language model’s activation space. The basic intuition is that any hidden vector h ∈ R d constructed by a modelNduring infer- ence can be mapped into a new feature spaceF k (e.g., a rotated vector space) using an invertible functionF : R d → F k (e.g., multiplication with an orthogonal matrix). FeaturesΠare a set of in- dices between 1 andk, i.e., a set of dimensions in F k . This framework supports features like neurons, orthogonal directions, (sets of) SAE features, and non-linear features. The vectorhmight come from the residual stream between transformer layers or the output of an attention head. 4.1 Evaluation Metric We use faithfulness to evaluate causal variable localization submissions. This metric captures the degree to which the provided features cap- ture the causal variable under counterfactual in- tervention. To evaluate faithfulness, we use inter- change interventions. Given base and counterfac- tual inputs(b, c), high-level causal graphH, and causal variableX ∈ H, the interchange interven- tionH X←Get(H(c),X) (b)runsHon base inputb while fixing the variableXto the value it takes whenHis run on a counterfactual inputc(Vig et al., 2020; Geiger et al., 2020). The distributed in- terchange interventionN Π X ←Get(N (c),Π X ) (b)runs Nonbwhile fixing the featuresΠ X of the hidden vectorhpassed throughFto the value they take for counterfactual inputc(Wu et al., 2023b; Amini et al., 2023; Geiger et al., 2024). Faithfulness is measured as the proportion of examples for which the intervention yields the expected change in the model’s output behavior. See Wu et al. (2023a) and Mueller et al. (2025) for examples. 4.2 Submission Procedure As for the circuit localization task, users were re- quired to upload files to a HuggingFace repository, although the required files differed for causal vari- able localization. 5 Here, a user was required to upload at least three artifacts for a given causal variable: a trained featurizerF, a trained inverse featurizerF −1 , and position indices correspond- ing to the dimensions of the featurized space that encode the causal variable of interest. If the featur- izer was not one of the supported baseline types, users were also required to upload Python code that could save and load their featurizer. We also sup- ported interventions at dynamic token positions; if used, users were required to upload a Python script specifying which token positions to intervene on for a given example. 6 4.3 Task Submissions We received submissions from one team totalling two methods (Hirlimann et al., 2025).Both methods extend the official Distributed Alignment Search (DAS; Geiger et al., 2024) baseline. 5 See this repository for an example of how causal variable localization repositories were required to be structured. 6 See the track’s GitHub repository for further details. RAVEL Gemma-2Llama-3.1 MethodA Cont A Country A Lang A Cont A Country A Lang DAS75 (85)57 (67)62 (70)75 (83)58 (64)63 (70) DBM66 (71)53 (65)54 (58)68 (80)53 (59)58 (64) +PCA63 (70)47 (53)50 (56)62 (74)48 (54)53 (57) +SAE64 (72)49 (56)53 (59)64 (72)50 (57)55 (57) Full Vector48 (62)49 (57)45 (56)53 (62)47 (53)47 (57) Orthogonal---84 (89)70 (79)72 (79) Nonlinear---83 (89)70 (78)72 (79) (a) The RAVEL task with variables for the country A Country , continent A Cont , and language A Lang of a city. Arithmetic (+) Gemma-2Llama-3.1 MethodX Carry X Carry DAS31 (35)54 (65) DBM33 (43)47 (58) +PCA32 (44)37 (56) +SAE32 (44)38 (55) Full Vector29 (35)35 (45) Orthogonal-53 (65) Nonlinear-- (b) The two-digit arithmetic task with a variable computing the carry-the-one operation (X Carry ). MCQA Gemma-2Llama-3.1Qwen-2.5 MethodO Answer X Order O Answer X Order O Answer X Order DAS95 (97)77 (93)94 (100)77 (91)86 (95)78 (100) DBM84 (99)63 (84)86 (100)66 (73)46 (94)60 (99) +PCA57 (96)52 (81)65 (99)53 (74)22 (76)54 (100) +SAE73 (90)51 (65)80 (99)58 (65)– Full Vector61 (100)44 (77)77 (100)46 (68)35 (99)49 (99) Orthogonal----90 (98)78 (100) Nonlinear--95 (100)81 (94)89 (98)81 (100) (c) The MCQA task with variables for the ordering of the answerX Order and then the answer token O Answer . This is a low-data regime (≈100 examples). Table 4: Results for the causal variable localization track. Table headers show the task, the model, and the selected causal variable, respectively. We do not report results for ARC or IOI, as no submissions were made for these tasks. We report interchange intervention accuracy (i.e., our faithfulness metric), i.e., the proportion of aligned interventions on the causal model and deep learning model that result in the same output token(s); higher is better. For each method of aligning a causal variable to LM features, we report the mean across counterfactual datasets and layers in the low-level model. In parenthesis and bold, we report the best alignment across all layers. Non-linear featurizer. This method extends DAS with a multi-layer perceptron (MLP) and non- linearities. During training, this method augments the feature mixing stage with an MLP: h = GeLU(W u x)(3) ˆx = tanh(W d h),(4) whereW u andW d are learned up-projection and down-projection weights, respectively. This is only applied during training. This allows the featurizer to “blend” potentially independent representations and go beyond convex combinations of features, which could allow it to learn dependencies where the signal is not strictly separable by individual di- rections in the original actuvation spacex. Empir- ically, this was the most well-performing method, even outperforming DAS—the best-performing baseline method. That said, recent work has demon- strated that non-linear featurizers are highly ex- pressive, and as such can locate potentially any feature, including those that are not in the model itself (Sutter et al., 2025), echoing the memoriza- tion problem that characterized probing classifiers (Belinkov, 2022). Additional validation is needed to confirm that the learned features capture genuine variables the model employs during processing. Orthogonal non-linear projection. This featur- izer is a simplified variant of the non-linear featur- izer. Here, the features pass only through a tanh non-linearity without a feed-forward layer. This still enables rich feature interactions to be learned, but does not have as much expressive power as the non-linear featurizer. 4.4 Results We show faithfulness scores for baselines and sub- missions in Table 4. Both the orthogonal and non- linear methods achieve significant gains over DAS across tasks and models. Despite the greater ex- pressive power of the non-linear featurizers, this method performs comparably to the simpler or- thogonal featurizer across tasks, with non-linear featurization proving slightly stronger for MCQA with Qwen-2.5. 5 Conclusions Despite the relatively small number of submissions, participants achieved significant performance gains. Ensembling methods are quite effective for circuit discovery, as is regularization via filtering compo- nents with unstable contributions to model behav- ior; we encourage future work to continue explor- ing these directions. Furthermore, one can achieve significant gains in variable localization using non- linear mediator types; these projections into new spaces can be highly effective with the proper train- ing procedure, even when the non-linearity is built on top of a simple architecture. This suggests that expressive featurizer training formulations that leverage existing mediator types might yield sig- nificant gains in causal variable localization—but more controls are needed to ensure that concepts truly in the model itself are being isolated (as op- posed to the featurizer learning the causal variable itself). The MIB leaderboard will continue to accept public submissions in both tracks. The results of this shared task will inform the experimental design and baseline choices for future studies employing circuits and causal-variable localization methods in language models. We hope that participants will continue to publicize their findings to benefit the community and enable scientific progress through direct comparisons in a shared-task setting. Acknowledgments D.A. is supported by the Ariane de Rothschild Women Doctoral Program. G.S. acknowledges the support of the Dutch Research Council (NWO) for the project InDeep (NWA.1292.19.399). This re- search was partly supported by an Azrieli Founda- tion Early Career Faculty Fellowship, Open Philan- thropy, a Google Award, and the European Union (ERC, Control-LM, 101165402). Views and opin- ions expressed are however those of the author(s) only and do not necessarily reflect those of the Eu- ropean Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. References Afra Amini, Tiago Pimentel, Clara Meister, and Ryan Cotterell. 2023.Naturalistic causal probing for morpho-syntax. Transactions of the Association for Computational Linguistics, 11:384–403. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguis- tics, 48(1):207–219. Adithya Bhaskar, Alexander Wettig, Dan Friedman, and Danqi Chen. 2024. Finding transformer circuits with edge pruning. Advances in Neural Information Pro- cessing Systems, 37:18506–18534. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Nicolò Brunello, Andrea Cerutti, Andrea Sassella, and Mark Carman. 2025. Blackboxnlp-2025 mib shared task: Ipe: Isolating path effects for improving la- tent circuit identification. In Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 528–536. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question an- swering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. Towards automated circuit discovery for mech- anistic interpretability. Advances in Neural Informa- tion Processing Systems, 36:16318–16352. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The Llama 3 herd of models. CoRR, abs/2407.21783. Javier Ferrando and Elena Voita. 2024. Information flow routes: Automatically interpreting language models at scale. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17432–17445, Miami, Florida, USA. Associa- tion for Computational Linguistics. Atticus Geiger, Kyle Richardson, and Christopher Potts. 2020. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Net- works for NLP, pages 163–173, Online. Association for Computational Linguistics. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. 2024. Find- ing alignments between interpretable causal variables and distributed neural representations. In Causal Learning and Reasoning, 1-3 April 2024, Los Ange- les, California, USA, volume 236 of Proceedings of Machine Learning Research, pages 160–187. PMLR. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. 2024. Have faith in faithfulness: Going beyond cir- cuit overlap when finding model mechanisms. In ICML 2024 Workshop on Mechanistic Interpretabil- ity. Lea Hirlimann,Yihong Liu,Leonor Veloso, Philipp Mondorf Shijia Zhou, Mingyang Wang, Ahmad Dawar Hakimi, Barbara Plank, and Hinrich Schütze. 2025. BlackboxNLP-2025 MIB shared task: Exploring the impact of non-linear modules on distributed alignment search. In BlackboxNLP-2025 MIB Shared Task. Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. 2024a. RAVEL: Evalu- ating interpretability methods on disentangling lan- guage model representations. In Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 8669–8687, Bangkok, Thailand. Association for Computational Linguistics. Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. 2024b. Unified view of grokking, double descent and emergent abilities: A comprehen- sive study on algorithm task. In First Conference on Language Modeling. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2024. Sparse autoen- coders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. CoRR, abs/2310.06825. Ruizhe Li and Yanjun Gao. 2024. Anchored answers: Unravelling positional bias in GPT-2’s multiple- choice questions. ArXiv preprint arXiv:2405.03205. Tom Lieberum, Matthew Rahtz, János Kramár, Ge- offrey Irving, Rohin Shah, and Vladimir Mikulik. 2023. Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in chin- chilla. ArXiv preprint arXiv:2307.09458. Ziming Liu, Eric J Michaud, and Max Tegmark. 2023. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2025. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representa- tions. Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024. Circuit component reuse across tasks in transformer language models. In The Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu, Leonor Veloso, Shijia Zhou, Hinrich Schütze, and Barbara Plank. 2025. Blackboxnlp-2025 mib shared task: Exploring ensemble strategies for circuit localization methods. In Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 537–542. Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, and 1 others. 2025. MIB: A mechanistic interpretability benchmark. In Forty-second Interna- tional Conference on Machine Learning. Neel Nanda. 2023. Attribution patching: Activation patching at industrial scale. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress mea- sures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations. Yaniv Nikankin, Dana Arad, Itay Itzhak, Anja Reusch, Adi Simhi, Gal Kesten, and Yonatan Belinkov. 2025a. Blackboxnlp-2025 mib shared task: Improving cir- cuit faithfulness via better edge selection. In Proceed- ings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 521–527. Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2025b. Arithmetic without al- gorithms: Language models solve math with a bag of heuristics. In The Thirteenth International Confer- ence on Learning Representations. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits.Distill. Https://distill.pub/2020/circuits/zoom-in. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learn- ers. Blog post. Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, and 1 others. 2024. Gemma 2: Improving open language models at a practical size. ArXiv:2408.00118. Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. A mechanistic interpretation of arith- metic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceed- ings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3319–3328. JMLR.org. Denis Sutter, Julian Minder, Thomas Hofmann, and Tiago Pimentel. 2025. The non-linear representation dilemma: Is causal abstraction enough for mechanis- tic interpretability? Preprint, arXiv:2507.08802. Aaquib Syed, Can Rager, and Arthur Conmy. 2024. Attribution patching outperforms automated circuit discovery. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP, pages 407–416, Miami, Florida, US. Association for Computational Linguistics. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stu- art Shieber. 2020. Investigating gender bias in lan- guage models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh In- ternational Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open- Review.net. Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, and Ashish Sabharwal. 2025. Answer, assemble, ace: Understanding how LMs an- swer multiple choice questions. In The Thirteenth International Conference on Learning Representa- tions. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christo- pher Potts, and Noah Goodman. 2023a. Interpretabil- ity at scale: Identifying causal mechanisms in alpaca. In Thirty-seventh Conference on Neural Information Processing Systems. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christo- pher Potts, and Noah D. Goodman. 2023b. Inter- pretability at scale: Identifying causal mechanisms in alpaca. In Advances in Neural Information Pro- cessing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024. Qwen2.5 technical report. ArXiv:2412.15115. Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. 2024. Interpreting and improving large language models in arithmetic calculation. In Forty-first International Conference on Machine Learning.