← Back to papers

Paper deep dive

SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

Carter Teplica, Yixin Liu, Arman Cohan, Tim G. J. Rudner

Year: 2025Venue: NAACL 2025Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 58

Models: unspecified 8 language models

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/11/2026, 1:17:03 AM

Summary

The paper investigates the mechanistic sources of uncertainty in large language models (LLMs) by testing the 'shared circuits' hypothesis, which posits that the same internal model components responsible for factual response generation also produce uncertainty estimates. Using P(IK) probes, causal tracing, and zero-ablation techniques across eight models and five datasets, the authors provide evidence that uncertainty introspection and factual recall rely on shared circuits within the model architecture.

Entities (5)

P(IK) probe · methodology · 98%Causal Tracing · technique · 95%SCIURus · research-framework · 95%Shared Circuits Hypothesis · hypothesis · 95%Zero-ablation · technique · 95%

Relation Signals (3)

Causal Tracing appliedto LLM

confidence 95% · We perform an extensive empirical analysis on eight different models... and find evidence that for knowledge recall, uncertainty and the factuality of answers generated by an LLM are handled by the same parts of the model.

Shared Circuits Hypothesis supportedby Empirical Evidence

confidence 90% · Our experiments... provide strong evidence that a model’s uncertainty is produced in the same parts of the network that are responsible for the factuality of generated responses.

P(IK) probe usedtoidentify Shared Circuits

confidence 90% · We then used these probes’ predicted confidences as target metrics for causal tracing and zero-ablation... to evaluate whether the same circuits were responsible for the answer and the predicted confidence.

Cypher Suggestions (2)

Find all techniques used to analyze model circuits. · confidence 90% · unvalidated

MATCH (t:Technique)-[:USED_IN]->(s:Study) RETURN t.name, s.title

Identify the relationship between uncertainty and factuality circuits. · confidence 85% · unvalidated

MATCH (a:Circuit)-[r:RESPONSIBLE_FOR]->(b:Task) WHERE b.name IN ['Factuality', 'Uncertainty'] RETURN a, r, b

Abstract

Carter Teplica, Yixin Liu, Arman Cohan, Tim G. J. Rudner. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

Full Text

57,856 characters extracted from source content.

Expand or collapse full text

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12451–12469 April 29 - May 4, 2025 ©2025 Association for Computational Linguistics SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models Carter Teplica New York University carterteplica@nyu.edu Yixin Liu Yale University yixin.liu@yale.edu Arman Cohan Yale University arman.cohan@yale.edu Tim G. J. Rudner New York University tim.rudner@nyu.edu Abstract We investigate the mechanistic sources of un- certainty in large language models (LLMs), an area with important implications for language model reliability and trustworthiness. To do so, we conduct a series of experiments designed to identify whether the factuality of generated responses and a model’s uncertainty originate in separate or shared circuits in the model archi- tecture. We approach this question by adapting the well-established mechanistic interpretabil- ity techniques of causal tracing and two styles of zero-ablation to study the effect of different circuits on LLM generations. Our experiments on eight different models and five datasets, rep- resenting tasks predominantly requiring factual recall, provide strong evidence that a model’s uncertainty is produced in the same parts of the network that are responsible for the factuality of generated responses. 1 Introduction Uncertainty quantification (UQ) in large language models (LLMs) for knowledge-intensive tasks (Petroni et al., 2020) remains a critical yet under- studied area. Despite achieving human-level perfor- mance on various benchmarks, LLMs often strug- gle with reliable uncertainty estimation, leading to issues such as overconfidence and hallucination (Zhang et al., 2024). This limitation has strong implications for their trustworthiness and safety in high-stakes applications. While recent research has explored verbalized uncertainty in LLMs (Band et al., 2024; Kadavath et al., 2022; Kuhn et al., 2022), significant gaps remain in our understanding of and ability to improve uncertainty quantification. In particular, existing UQ techniques typically pro- vide little insight into the factors responsible for an uncertainty estimate, limiting their usefulness both as practical tools for improving trustworthi- ness and as methods for understanding uncertainty reasoning. We propose leveraging mechanistic in- terpretability, an approach focused on characteriz- ing models’ internal reasoning mechanisms, to ad- vance our capabilities for and understanding of un- certainty quantification in large language models. To better understand how LLMs generate uncer- tainty estimates, we trainedP(IK)(probability that I know) probes that represent the model’s uncer- tainty based on multiple generated answers (Ka- davath et al., 2022). We then used these probes’ predicted confidences as target metrics for causal tracing and zero-ablation, two interpretability tech- niques which identify the components of a model that are relevant for a task by testing the effect of an intervention made on activations in the model during evaluation. We compared the mechanistic signatures of changes in the model’s accuracy and the probe’s output to evaluate whether the same circuits were responsible for the answer and the predicted confidence. In our empirical evaluation, we performed causal tracing and leave-one-out and COAR-style (Shah et al., 2024) zero-ablation for a large range of model–dataset combinations. We found that model accuracy and probe behavior largely responded to the same interventions, indicating that circuits re- sponsible for the factuality of responses and for the model’s uncertainty are located in the same parts of the model. For a group of knowledge-intensive question an- swering tasks (Petroni et al., 2020), model accuracy and probe confidence are (highly) positively related to one another. We conclude that, at least on re- call tasks, a language model’s representation of confidence may derive mainly from “uncertainty introspection” on its question-answering process, rather than from separate reasoning specific to its uncertainty. To summarize, the key contributions of this pa- per are as follows: 1. We use mechanistic interpretability and uncer- tainty quantification tools to investigate the 12451 language model P(IK) probe Q: What is the common name of Sciurus carolinensis? A: prompt [1.9, 1.7, 2.1, 0.9, 1.8, 1.8, ...] activations correct (m) P(IK) = 0.7 (p) Eastern gray squirrel model answer train probe (p) to predict whether answer is correct (m) _gray _squirrel attention MLP embed.out attn.out mlp.residmlp.out layer.out Figure 1:Left:P(IK)probing. The LLM takes a question as input and returns an answer and last-layer activations. Answers are checked for correctness. The probe learns to predict whether the model’s answer is correct, based on the last-layer activations. Our analysis uses the probe as a proxy for an LLM’sP(IK). We conduct path patching and zero ablation studies on the probe and the corresponding LLM.Right:Locations used in interventions. Path- patching restorations are atmlp.resid,mlp.out,layer.out, andembed.out. Zero-ablations are atattn.outand mlp.out. mechanistic sources of uncertainty in large language models. To do so, we use a logis- ticP(IK)probe with causal tracing and zero- ablation to examine whether LLM uncertainty and the factuality of answers generated by an LLM reside in shared or separate circuits within the model. 2.We perform an extensive empirical analysis on eight different models and five recall-intensive datasets, and find evidence that for knowledge recall, uncertainty and the factuality of an- swers generated by an LLM are handled by the same parts of the model. 2 Related Work 2.1 Uncertainty Quantification in Large Language Models Uncertainty quantification in large language mod- els is crucial for enhancing reliability, particularly in high-stakes applications. While LLMs’ token probabilities are often well-calibrated for next- token prediction, practical applications of UQ often require quantifications of the uncertainty in the se- mantic content of the output (Gawlikowski et al., 2023). Language models’ ability to quantify se- mantic uncertainty remains limited, especially for open-ended tasks. Various techniques have been proposed to address this. A well-studied set of techniques involvesmulti- ple sampling and clusteringbased on consistency. This can be effective when clustering of responses is straightforward, but this is often not the case ex- cept on simple tasks (Kuhn et al., 2022; Fomicheva et al., 2020; Lin et al., 2024; Ao et al., 2024). An- other approach, sometimes calledverbalized un- certainty, is to ask the model to state a verbal or quantitative confidence estimate (Kadavath et al., 2022); the performance of such methods is often inconsistent. On multiple-choice questions, the token probabilitiesmay yield well-calibrated un- certainty estimates (Kadavath et al., 2022). An- other option is to train aP(IK)probe, a binary classifier predicting whether the model knows the answer. This approach is among the most effective in-distribution (Orgad et al., 2024) but struggles with generalization to out-of-distribution data (Ka- davath et al., 2022; Orgad et al., 2024). In this work, we focus onP(IK)probing, as it provides a potentially interpretable view into a model’s self-assessed uncertainty by identifying a specific feature direction within the model. Beyond the introduction ofP(IK)probing itself (Kadavath et al., 2022), little research has been conducted on interpreting the mechanisms behind uncertainty rea- soning in LLMs. While most UQ techniques rely on eliciting information about uncertainty through explicit or indirect methods, we still lack an un- derstanding of how this information is represented internally. Analyzing these mechanisms could im- prove UQ techniques and provide insights into broader epistemic weaknesses in LLMs. 12452 2.2 Mechanistic Interpretability Mechanistic interpretability (MI) aims to under- stand how neural networks function internally, with a focus on understanding the internal mechanisms and computational processes involved in perform- ing a task. MI work often revolves around identify- ing “circuits” responsible for specific tasks (Olah et al., 2020). To achieve this, several intervention techniques have been developed, with tradeoffs in resolution, breadth of applicability, and computa- tional cost. Ablation(orknockout) involves removing parts of the model, such as layers or neurons, to observe changes in behavior. We use zero-ablation for one analysis because it is very general, well-supported in the literature (Wang et al., 2023; Elhage et al., 2021) and computationally inexpensive, albeit less precise than other methods. In earlier ablation work, a common approach has been to use “leave-one-out” ablation, i.e., to ablate a single layer on each trial. However, other ap- proaches may perform better in cases where mod- els are very robust to ablations (as is commonly the case for LLMs, especially with larger mod- els). COAR (componentattribution viaregression) (Shah et al., 2024) is a recently proposed technique which ablates random subsets of components in a model and produces attributions using linear regres- sion. We perform both leave-one-out and COAR ablations and compare the results of the two. Causal tracing, also calledactivation patching, treats the model’s hidden states as a causal graph (Pearl, 2009), which can be analyzed with an ap- proach based on causal mediation analysis (Vig et al., 2020). (We discuss this further in Section 3.) Causal tracing is more precise than ablation, at the cost of higher computational demands and a need for more careful setup. Probingtechniques (Alain and Bengio, 2018) involve training a simple probe (commonly a one- layer binary classifier) on model activations, in order to find places in the model’s representation space that represent specific functions of the input. TheP(IK)probes used in this paper are an example of this. 2.3 Applications of Interpretable Uncertainty Quantification in Large Language Models Reliable UQ could help to improve LLM trust- worthiness by allowing auditing of LLMs in high- stakes domains, such as medical and legal applica- tions (Gawlikowski et al., 2023) and applications of LLM-based agents (Yang et al., 2023). Inter- pretability could also help to ensure that UQ tech- niques remain reliable under distribution shifts, and could contribute to detecting deception (Hendrycks et al., 2021a). Finally, if limitations in UQ are related to broader epistemic weaknesses in LLMs, interpretable UQ could shed light on problems such as hallucination (Zhang et al., 2023; Manakul et al., 2023) and could deepen our understanding of rea- soning and knowledge in LLMs in general, possibly helping to address problems such as eliciting latent knowledge (Christiano et al., 2024). 3 Methods Probe DesignWe use aP(IK)probing approach in part because of the difficulty of reasoning about uncertainty using token probabilities. Token proba- bilities for open-ended questions are a highly im- perfect proxy for a model’s confidence, because they conflate semantic uncertainty, or uncertainty about content, with syntactic uncertainty, or uncer- tainty about form (Kuhn et al., 2022). Furthermore, we are most interested in improving uncertainty quantification for fine-tuned chat models, for which token probabilities do not correspond to an under- lying distribution over possible text strings. We construct a dataset on which to train the P(IK)probe according to the following steps. 1. Perform32forward passes for each question on the question-answering task. We used few- shot prompting with5examples to ensure that the model answered in the right format. 2. Check whether a model’s answers are correct. Specifically, we check whether a model’s an- swer contains any correct answer as a sub- string, ignoring case. (See Appendix B.1 for validation of this approach.) 3. For each question in the dataset, save the num- ber of correct and incorrect answers (implying a “true probability” of the model answering correctly). 4.Also, for each question, save the output of the model’s last layer (before the unembedding). This is a vector inR d model . TheP(IK)probe is a logistic classifierp: R d model →(0,1) which takes these last layer activa- tions as input and returns the proportion of correct answers. For example, if the model answers a ques- tion correctly 47% of the time, the probe should 12453 ...of*Abdullah*Ahmad*Bad*awi?Answer: model’sm restored (correct token prob.) 0 14 28 42 layer ...of*Abdullah*Ahmad*Bad*awi?Answer: probe’sp restored (probe output) 0.0 0.2 0.4 0.6 0.8 1.0 “What is the religion of Abdullah Ahmad Badawi?” ...of*Google*Street*View?Answer: model’sm restored (correct token prob.) 0 14 28 42 layer ...of*Google*Street*View?Answer: probe’sp restored (probe output) 0.0 0.2 0.4 0.6 0.8 1.0 “What is the owner of Google Street View?” ...does/did*Terence*Blanchardplay?Answer: model’sm restored (correct token prob.) 0 14 28 42 layer ...does/did*Terence*Blanchardplay?Answer: probe’sp restored (probe output) 0.0 0.2 0.4 0.6 0.8 1.0 “What musical instrument does/did Terence Blanchard play?” ...of*Abd*ul*lah*A*hm*ad*B*ada*wi?Answer: model’sm restored (correct token prob.) 0 16 32 layer ...of*Abd*ul*lah*A*hm*ad*B*ada*wi?Answer: probe’sp restored (probe output) 0.0 0.2 0.4 0.6 0.8 1.0 “What is the religion of Abdullah Ahmad Badawi?” ...of*Google*Street*View?Answer: model’sm restored (correct token prob.) 0 16 32 layer ...of*Google*Street*View?Answer: probe’sp restored (probe output) 0.0 0.2 0.4 0.6 0.8 1.0 “What is the owner of Google Street View?” ...does/did*Ter*ence*Bl*anch*ardplay?Answer: model’sm restored (correct token prob.) 0 16 32 layer ...does/did*Ter*ence*Bl*anch*ardplay?Answer: probe’sp restored (probe output) 0.0 0.2 0.4 0.6 0.8 1.0 “What musical instrument does/did Terence Blanchard play?” Figure 2: Representative results of causal tracing, shown for Gemma 2 9B Instruct (top) and Llama 2 7B (bottom) on three questions in CounterFact. The vertical axis shows the layer. Onlylayer.outlocations are shown (plus embed.outin the first row). The input embeddings for the starred tokens (e.g.,*Abdullah) are replaced with zeros in the corrupted and restored runs. We chose these questions to be representative of typical behavior. In theleft column, the sets of components for which restorations have substantial effects onmandpare almost identical. In themiddlecolumn, the sets are very similarexceptat theGoogletoken. (Restoring this token decreases the probability of the correct output token—alsoGoogle—even below the corrupted baseline, but increases the model’s certainty in its answer.) This pattern was common across examples in which the correct answer appeared in the question. In therightcolumn, the model is confidently wrong. We exclude cases like this (with very lowm clean ) from this analysis, since we cannot meaningfully select a set of components which contribute to correctness. output 0.47 on the last-layer activations at the last token of that question. We trained with binary cross-entropy loss, using dropout and a triangular learning rate schedule, and used a low learning rate (η= 3×10 −6 ) as in Kadavath et al. (2022). We used between 2048 and 8192 examples per task (using fewer than 8192 when we were limited by the size of the dataset), and held out20%of the data as a test split. Experiments were done with examples from the test split. Models and DatasetsWe studied eight models, including Llama 2 and 3 and Gemma 2 models with two to thirteen billion parameters, and five datasets; these are described in detail in Appendix B. All of the datasets studied, with the partial ex- ception of MMLU (Hendrycks et al., 2021b), are “recall-intensive” in that they largely depend on re- calling factual information learned during training; we studied both multiple-choice datasets (MMLU, ARC (Clark et al., 2018) 1 ) and open-ended ones (TriviaQA (Joshi et al., 2017), WebQuestions (Be- rant et al., 2013), CounterFact (Meng et al., 2022)). Based on some preliminary zero-ablation experi- ments, we believe that models may exhibit similar behavior on some non-recall tasks such as simple math questions (see Appendix D for details). We used the CounterFact dataset (Meng et al., 2022) exclusively for causal tracing. We reformu- lated CounterFact prompts as questions to match the format of our other datasets. Because we used the TriviaQA probe for the causal tracing ex- periment with CounterFact, we also did few-shot prompting with the prompt from TriviaQA. 1 ARC includes both the ARC-Easy and ARC-Challenge splits. ARC questions are drawn from standardized tests; the datasets listed as ARC (Hg) and ARC (Other) correspond to the “Mercury” test and to a combination of the other 20 tests. 12454 Causal Tracing.Causal tracing is a causal in- tervention method that aims to trace and identify important components in neural models for a given task (Meng et al., 2022; Wang et al., 2023), which is a generalization of causal mediation analysis (Vig et al., 2020). In this work, we use causal trac- ing (Meng et al., 2022) to examine the importance and role of individual circuits and components in LLMs. Specifically, given a specific inputq, causal tracing involves three runs: (1) a clean run, in which the original inputqis given to the model, which is used to obtain the hidden states of each layer; (2) a corrupted run, in which the input em- beddings of certain tokens are corrupted by adding noise or (in this paper) replaced with zeros; and (3) a corrupted-with-restoration run, in which the computation is similar to the corrupted run except that the hidden states at specific locationsℓin the model are restored using the hidden states obtained from the clean run. By comparing the differences between the output (predicted probabilities) of the clean, corrupted, and restored runs, causal tracing allows the identification of important components in LLMs. That is, if the restored run achieves a similar effect as the clean run, it is likely that the corresponding restored component plays an impor- tant role in the model’s processing. Zero-Ablation.Zero-ablation is a mechanistic intervention technique that takes advantage of a transformer’s residual structure by treating atten- tion or MLP layers as separable modules which read from and write to the residual stream (Elhage et al., 2021; Nostalgebraist, 2020). A component ℓ(in this paper, an attention or MLP layer) is “ab- lated” by replacing its output with zero. The drop in model performance on a given task after an inter- vention removing a componentℓprovides a mea- sure of the importance ofℓfor the task. Leave-one-out and COAR interventions.Inter- pretability work using ablation commonly employs leave-one-out style interventions, in which an in- tervention is applied to a single component at a time. Since larger Transformer LMs are often in- sensitive to smaller interventions, leave-one-out interventions may struggle to meaningfully affect the target metrics. COAR (Shah et al., 2024) is a recent approach which addresses this by applying ablation interventions to random subsets of model components. In a COAR experiment, ablations are performed for many dataset examples and subsets of components, and linear regression is used to pre- dict the target metrics from a vector of ablated com- ponents; the coefficients of the linear predictor then reflect the predicted effect of ablating each compo- nent on the target metric. (We refer the reader to Shah et al. (2024) for details.) 4 Uncertainty Introspection and the Shared Circuits Hypothesis The aim of this paper is to make progress toward characterizing the mechanistic structures used for UQ in language models. To this end, we propose a theoretical hypothesis (“shared circuits”) about the locations of these structures, along with opera- tionalizations which we test experimentally. Shared Circuits Hypothesis.Uncertainty quantification in question-answering (QA) systems may be carried out in a variety of ways. We hypothesize that language models are capable of expressing uncertainty using shared circuitsthat both solve the underly- ing question-answering task and output un- certainty information. This contrasts with the possibility that uncertainty quantifica- tion emerges inseparate circuits, either to post-process messy uncertainty signals from question-answering circuits or to do uncer- tainty calculations of their own. Language models are known to be capable of introspective behavior in some contexts (Binder et al., 2024). The shared circuits hypothesis, to the extent that it is true, suggests that uncertainty quantification is one such context. We refer to this phenomenon as “uncertainty introspection”. We use aP(IK)probing approach as in (Kada- vath et al., 2022) in part because of the difficulty of reasoning about uncertainty using token probabili- ties. Token probabilities for open-ended questions are a highly imperfect proxy for a model’s confi- dence, because they conflate semantic uncertainty, or uncertainty about content, with syntactic un- certainty, or uncertainty about form (Kuhn et al., 2022). For details on models, datasets, and probes, see Appendices B through C. 4.1 Experiment Design: Causal Tracing On a given questionq i in a datasetQ, for each causal tracing run (clean, corrupted, and restored) we compute the model’s sample probabilitym(q i ) 12455 for the correct first token of the answer, and the probe’s confidencep(q i ). 2 We consider each ques- tion individually because this allows a particularly fine-grained test for shared circuits—we ask here whether the same circuits are used for QA and UQ on an in individual question, and in the next section whether this is true in aggregate for a task. Lo- cationsℓwherem restored(ℓ) ≈m clean correspond to parts of the model which are important for solving the QA task; likewise, locationsℓwhere p restored(ℓ) ≈p clean correspond to parts of the model which are important for the UQ task. 34 For causal tracing, we operationalize the shared circuits hypothesis in the claim thatm restored can be predicted fromp restored by interpolating be- tween the clean and corrupted values: e.g., if the model’s correct-token probability on a restored run is halfway between the clean and corrupted prob- abilities, then the probe’s confidence should be halfway between the clean and corrupted confi- dences. Specifically, for each questionq i ∈Q, we con- sider the linear predictorˆm restored defined by ˆm restored(ℓ) −m corrupted m clean −m corrupted = p restored(ℓ) −p corrupted p clean −p corrupted . That is: we predict that a restoration at a loca- tionℓwill have the same proportional effect on the model’s performance and the probe’s response, relative to the clean condition where there is no intervention and the corrupted condition where no data on the subject is available. We claim that this predictor explains most of the variance inm restored (i.e., has a highR 2 ). As a (somewhat weak) formal- ization of this, we attempt to reject the null hypoth- esis 5 H 0 :R 2 is no greater than expected under random permutations of the set of locationsℓ. 2 Correct-first-token probability is in this case a closely aligned proxy for correct-answer probability. To test validity, we checked 100 examples by hand and found that 98% were graded correctly (see Appendix B.1). 3 Although note that the converse is not strictly true; see Appendix 5 for details. 4 Here,m restored(ℓ) andp restored(ℓ) represent the correct token probability andpprobe output for a run with the hidden state restored at locationℓin the model; notation is likewise for clean and corrupted runs. 5 We report ourp-values as continuous variables in Appen- dices F and G, and caution against assigning undue value to thep= 0.05threshold. 4.2 Experiment Design: Zero-Ablation We also test the shared circuits hypothesis via zero- ablation on layers. Unlike for causal tracing, we sample and evaluate multi-token answers. We de- finem(q i )as the probability of the model sampling a correct answer when prompted on the question q i ∈Q, andp(q i )as the probe output on that ques- tion. Averaging overQ, we can compare changes in the model accuracymand the average probe outputp. 4.2.1 Leave-One-Out Ablation Under the shared circuits hypothesis, the change in the probe output from ablation|p ablated(ℓ) − p clean |is large when the change in model accuracy |m ablated(ℓ) −m clean |is large. Concretely, we claim that the predictor ˆ mdefined by m clean − ˆ m ablated(ℓ) =|p ablated(ℓ) −p clean | explains most of the variance inm ablated (has a highR 2 ), and attempt to reject the null hypothesis H 0 :R 2 is no greater than expected under random permutations of the layersℓ. We considerabsolute changes in the probe output only, because inter- ventions which severely damage the model may increase the value of the probe output, but gener- ally do not improve the model’s correctness. 4.2.2 COAR COAR constructs least-squares predictors for model accuracy and probe output based on vec- tors of ablated components, in which the coeffi- cient corresponding to a componentℓrepresents the expected effect of ablatingℓ. Under the shared circuits hypothesis, the predictorsw m andw p for the model accuracy and probe output should be similar. Concretely, we attempt to reject the null hypothesisH 0 :The correlation betweenw m and w p is no greater than expected under random per- mutations of the layersℓ. We see COAR as a useful complement to leave-one-out ablation because it addresses cases where models are highly resilient to ablations, a common challenge for ablation on larger models. 4.3 Permutation testing We tested our hypotheses using permutation tests with Monte Carlo sampling. Specifically, for each test, we compared the goodness-of-fit of the ob- served data with that of a synthetic dataset made by shuffling the locations and (for causal tracing) 12456 0.00.20.40.60.81.0 probe outputp 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 3 8B Inst. TriviaQA 0.00.20.40.60.81.0 probe outputp Llama 3 8B Inst. WebQs 0.00.20.40.60.81.0 probe outputp Llama 3 8B Inst. MMLU 0.00.20.40.60.81.0 probe outputp Llama 3 8B Inst. ARC (Hg) 0.00.20.40.60.81.0 probe outputp Llama 3 8B Inst. ARC (Other) −0.2−0.10.00.10.2 coeff. for probe outputp −0.2 −0.1 0.0 0.1 0.2 coeff. model accuracy m Gemma 2 9B Inst. / TriviaQA −0.2−0.10.00.10.2 coeff. for probe outputp Gemma 2 9B Inst. / WebQs −0.2−0.10.00.10.2 coeff. for probe outputp Gemma 2 9B Inst. / ARC (Hg) −0.2−0.10.00.10.2 coeff. for probe outputp Gemma 2 9B Inst. / ARC (Other) Figure 3:Top.Results of leave-one-out style zero ablation for Llama 3 8B Instruct on five different datasets. Circle, triangle, and small X markers represent MLP ablations, attention ablations, and clean runs respectively. Warmer colors represent earlier layers.Bottom.Coefficients for zero ablation with COAR, for Gemma 2 9B Instruct on four different datasets. Circle and triangle markers represent MLP and attention respectively. Warmer colors represent earlier layers. token positions. To exclude the simple explanation that some types of locations performed better than others, we shuffled locations of different types (e.g., attention and MLP outputs) independently. 4.4 Testing the Hypothesis Causal TracingWe performed causal tracing with all eight models on a random sample of 100 questions from CounterFact (Meng et al., 2022). We considered only questions withm clean >0.05 (since otherwise predictingm restored is trivial). We used the probe and few-shot prompt for TriviaQA. Across this sample, the predictorsˆm restored esti- matedm restored well, withR 2 >0.6in most cases. On each questionq i , we tested the null hypothe- sis by sampling 1000 permutations. 6 In almost all cases (see Fig. 5), we rejectH 0 withp <0.05. Based on manual inspection, 7 we conclude that R 2 <1both due to small discrepancies between UQ and QA circuitry and due to nonlinearity in 6 Specifically, we shuffled the values ofm restored(ℓ) independentlyforthemlp.out,mlp.resid,and layer.out/embed.outlocations, to exclude the expla- nation that the predictor works well because themlp.out andmlp.residstates each carry less information than layer.out. 7 See graphs in the supplementary materials online (see Appendix A). the UQ/QA relationship. In the cases studied, the model is more resilient than the probe: that is, in- terventions generally have a greater effect on the probe than the model (creating the convex shape in Figure 4, left); this depends to some extent on the model architecture. In some cases, when the probe is confidently wrong (see Figure 2, right), the probe may be following the path for the model’s (incorrect) highest-probability token. As in Meng et al. (2022), highly important loca- tions generally fall into two clusters: one in earlier layers at the token positions in the subject, and one at later layers at the last token position. We note that uncertainty information and answer informa- tion are often transferred to the last position by attention heads in different layers (Fig. 2. 8 These small differences suggest that ourP(IK)probes are using the model’s question-answering circuitry directly, rather than by performing separate or post- hoc uncertainty calculations. 8 Other discrepancies occasionally occur: in particular, when an answer token (often a proper noun) is present in the question, restorations at the corresponding token position show suppressed model accuracy but normal probe perfor- mance. One possible explanation is that the model may be using circuitry similar to the “negative name movers” in Wang et al. (2022) to avoid spuriously copying input tokens to the output. 12457 0.00.20.40.60.81.0 probe’sp 0.0 0.2 0.4 0.6 0.8 1.0 model’s m “What is the official language of Prince Edward Island?” 0.00.20.40.60.81.0 probe’sp 0.0 0.2 0.4 0.6 0.8 1.0 model’s m “What is the official language of Prince Edward Island?” Llama 2 7B Base Llama 2 7B Chat Llama 3 8B Base Llama 3 8B Inst. Llama 2 13B Base Llama 2 13B Chat Gemma 2 2B Inst. <−1−0.500.51 R 2 Gemma 2 9B Inst. Goodness of fit for causal tracing predictor Figure 4:Left and center.Predicting the correct-token probabilitymgiven the probe outputp, for Gemma 2 9B Instruct (left) and Llama 2 7B (center). The black and red X (small, top-right and bottom-left) show the clean and corrupted runs; all others show restored runs. Yellow points are later in the sequence. The grey line shows the predictorˆm.Right.Values ofR 2 for the causal tracing predictor. “<” signifies cases whereR 2 <−1(which is possible because the predictor is not a linear-regression line). Zero-Ablation(Leave-One-Out).Weper- formed 500 ablation trials each with eight models across five question-answering datasets. Across this sample, the predictors ˆ m ablated generally estimatedm ablated better than chance, with a median ofR 2 = 0.33.For each model–dataset combination, we tested the null hypothesis by sampling 10,000 permutations. As with the causal tracing analysis, we shuffled attention and MLP layer interventions independently, to exclude the explanation that one type of layer was more important than the other in a way not specific to the QA and UQ tasks. We reject the null hypothesis withp <0.05in 36 out of 38 cases, andp <0.0001in 31 out of 38 cases. In many cases, the model’s uncertainty represen- tation plays particularly nicely with zero-ablation, remaining calibrated on average even after an in- tervention: using the same statistical framework as above, the very simple predictor ˆ m ablated =p ablated does better than expected under random permuta- tions in 27 out of 38 cases (atp <0.05). 9 While other explanations may be possible, one interpretation of these results is that a given compo- nent makes a nonzero contribution to the model’s uncertainty representation if and only if it can also contribute information about the answer. Zero-Ablation (COAR).We performed 2000 COAR trials each with all models and four 9 IfR 2 is the fraction of the variance inm ablated explained by ˆ m ablated =p ablated , we reject the null hypothesisR 2 is no greater than expected under random permutations of the set of layersatp <0.05in 27/38 cases. datasets. 10 For each trial, the probability of ab- lating any given component was set atα= 0.2. We reject the null hypothesis withp <0.05in all but one case. Particularly strong correlations were present for the Gemma models; this may be related to our choice ofαand these models’ robustness to interventions in the leave-one-out experiments. 5 Discussion and Conclusion The results of the causal tracing and zero-ablation analyses presented in the previous section broadly support the shared circuits hypothesis, implying that—across the setups we considered—the sets of model components used for question-answering and uncertainty quantification were largely, albeit not entirely, the same. This suggests thatP(IK) probing may be a viable way of eliciting introspec- tive, interpretable uncertainty estimates. Based on these findings, further research could analyze the mechanisms responsible forP(IK)estimates in greater detail, or applyP(IK)probing as an inter- pretability tool to study phenomena such as halluci- nation in LLMs. Similar analyses of other methods of uncertainty quantification (e.g., verbalized un- certainty) may provide insight further insight into the role of uncertainty introspection in uncertainty quantification. More broadly, we see interpretable uncertainty quantification as a potentially useful approach for understanding and improving LLM reasoning, in order to improve trustworthiness and reliability and inform technical AI governance. 10 We excluded MMLU because of computational resource constraints. 12458 Limitations Causal tracing and zero-ablation, like many inter- pretability techniques, yield results which can im- perfectly reflect the contributions of model inter- nals to a task. In particular: Zero-ablation.We chose to ablate activations in the model with zeros. While the zero vector is far from an arbitrary choice, especially given its relevance to dropout and the additive residual structure of a transformer, this approach may lack specificity. For example, zero-ablating an early or late MLP layer sometimes severely damages a model’s ability to produce coherent language in general, so accuracies from ablation do not neces- sarily correspond to the flow of question-specific information through the model. Approaches such as causal scrubbing (Chan et al., 2022) avoid this limitation but are generally more computationally expensive. Causal tracing.The “path” through the model identified comprises, to a first approximation, the set of points in the model at whichallinforma- tion relevant to the task is present. As such, when information relevant to a question passes along mul- tiple paths in parallel, it may be that no individual path shows a substantial difference between the re- stored and baseline conditions. For example, in the question in Fig. 2 (center), restoring the input em- bedding for any one token of “Google Street View” without the others has little effect on the model. Acknowledgments TGJR acknowledges support through a CSET Foun- dational Research Grant. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. References Guillaume Alain and Yoshua Bengio. 2018. Under- standing intermediate layers using linear classifier probes. Shuang Ao, Stefan Rueger, and Advaith Siddharthan. 2024. Css: Contrastive semantic similarities for un- certainty quantification of llms. InThe 40th Confer- ence on Uncertainty in Artificial Intelligence. Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. 2024. Linguistic calibration of long- form generations. InForty-first International Confer- ence on Machine Learning. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1533–1544, Seattle, Wash- ington, USA. Association for Computational Linguis- tics. Felix J. Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. 2024. Looking Inward: Language Models Can Learn About Them- selves by Introspection. ArXiv:2410.13787. Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny, Ansh Rad- hakrishnan, Buck Shlegeris, and Nate Thomas. 2022. Causal scrubbing: A method for rigorously testing interpretability hypotheses. Paul Christiano, Mark Xu, and Ajeya Cotra. 2024. Elic- iting latent knowledge. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018.Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Ka- plan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer- circuits.pub/2021/framework/index.html. Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Spe- cia. 2020. Unsupervised quality estimation for neural machine translation.Transactions of the Association for Computational Linguistics, 8:539–555. Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxi- ang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, Muhammad Shahzad, Wen Yang, Richard Bamler, and Xiao Xiang Zhu. 2023. A sur- vey of uncertainty in deep neural networks.Artificial Intelligence Review, 56(1):1513–1589. Gemma Team, Google AI. 2024. Gemma 2: Improving open language models at a practical size. Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. Aligning ai with shared human values.Pro- ceedings of the International Conference on Learning Representations (ICLR). 12459 Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. 2021b. Measuring massive multitask language understanding.Proceedings of the International Con- ference on Learning Representations (ICLR). Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehen- sion. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), page 1601–1611, Vancouver, Canada. Association for Computational Linguistics. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Ka- mal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. 2022. Language models (mostly) know what they know. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. InProceedings of the Eleventh International Confer- ence on Learning Representations. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with confidence: Uncertainty quan- tification for black-box large language models. ArXiv:2305.19187 [cs, stat]. Llama 2 Team, Meta AI. 2023. Llama 2: Open founda- tion and fine-tuned chat models. Llama 3 Team, Meta AI. 2024. The llama 3 herd of models. Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero-resource black-box hallucina- tion detection for generative large language models. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computa- tional Linguistics. Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associ- ations in GPT. InAdvances in Neural Information Processing Systems. Nostalgebraist. 2020. Interpreting GPT: The logit lens. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits.Distill. Https://distill.pub/2020/circuits/zoom-in. Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Re- ichart, Idan Szpektor, Hadas Kotek, and Yonatan Be- linkov. 2024. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucina- tions. ArXiv:2410.02707. Judea Pearl. 2009.Causality: Models, Reasoning and Inference, 2nd edition. Cambridge University Press, USA. Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2020. KILT: a benchmark for knowledge intensive language tasks. CoRR, abs/2009.02252. Harshay Shah, Andrew Ilyas, and Aleksander Madry. 2024. Decomposing and Editing Predictions by Mod- eling Model Computation. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stu- art Shieber. 2020. Investigating gender bias in lan- guage models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. Inter- pretability in the wild: A circuit for indirect object identification in GPT-2 Small. InProceedings of the Eleventh International Conference on Learning Representations. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh Inter- national Conference on Learning Representations. Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto- gpt for online decision making: Benchmarks and additional opinions. Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. 2024. How language model hallu- cinations can snowball. InForty-first International Conference on Machine Learning. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023. Siren’s song in the ai ocean: A survey on hallucination in large language models. 12460 A Reproducibility Code to reproduce our results can be found athttps://github.com/crtep/sciurus. B Models and Datasets Table 1: Models studied. ModelParametersLayers Llama 2 7B7B32 Llama 2 7B Chat7B32 Llama 2 13B13B40 Llama 2 13B Chat13B40 Llama 3 8B8B32 Llama 3 8B Instruct8B32 Gemma 2 2B Instruct2B26 Gemma 2 9B Instruct9B42 B.1 Validation of Answer-Checking Procedure For substring matching, we manually checked 100 answers each from TriviaQA and WebQuestions, generated by Gemma 2 2B Chat (the smallest of our models). For TriviaQA, substring matching graded 98 out of 100 model answers correctly, as evaluated by a human, with 2 false “incorrect”s where the model formatted the answer unacceptably. For WebQuestions, substring matching graded 86 out of 100 model answers correctly, as evaluated by a human, with 11 false “incorrect”s where the model formatted the answer unacceptably and 3 false “correct”s where the model gave additional details that made the answer incorrect. For CounterFact, first-token matching graded 98 out of 100 model answers correctly, as evaluated by a human, with 2 false “incorrect”s where the model formatted the answer unacceptably. B.2 Licenses for Models and Datasets Models: •Llama 2 (Llama 2 Team, Meta AI, 2023) is licensed under the Llama 2 Community License Agree- ment, available at https://ai.meta.com/llama/license/. •Llama 3 (Llama 3 Team, Meta AI, 2024) is licensed under the Meta Llama 3 License, available at https://llama.meta.com/llama3/license/. •Gemma 2 (Gemma Team, Google AI, 2024) is licensed under the Gemma Terms of Use, available at https://ai.google.dev/gemma/terms. Datasets: •TriviaQA (Joshi et al., 2017) is licensed under the Apache License 2.0, available at https://w.apache.org/licenses/LICENSE-2.0. •WebQuestions (Berant et al., 2013) is licensed under the Creative Commons Attribution 4.0 Interna- tional License, available at https://creativecommons.org/licenses/by/4.0/. • MMLU (Massive Multitask Language Understanding) (Hendrycks et al., 2021b) is licensed under the MIT License, available at https://opensource.org/licenses/MIT. 12461 •ARC (AI2 Reasoning Challenge) (Clark et al., 2018) is licensed under the Cre- ativeCommonsAttribution-ShareAlike4.0InternationalLicense,availableat https://creativecommons.org/licenses/by-sa/4.0/. •CounterFact (Meng et al., 2022) is licensed under the MIT License, available at https://opensource.org/licenses/MIT. 12462 C Model and Probe Performance ModelDatasetModel accuracyProbe accuracy (bal.)ECE Llama 2 7B TriviaQA0.60060.77870.0342 WebQuestions0.40160.66740.0320 MMLU0.39840.65710.0265 ARC (Mercury)0.58450.67310.0363 ARC (Other)0.62600.70110.0327 Math (2 Digit)0.74950.83740.0287 Math (3 Digit)0.67970.87010.0143 Math (4 Digit)0.64940.84070.0207 Llama 2 7B Chat TriviaQA0.58500.78190.0315 WebQuestions0.43430.70510.0213 MMLU0.46880.67010.0272 ARC (Mercury)0.69730.63750.0294 ARC (Other)0.76320.67190.0395 Math (2 Digit)0.70900.86170.0317 Math (3 Digit)0.62520.82190.0292 Math (4 Digit)0.58640.79240.0346 Llama 3 8B TriviaQA0.65820.70260.0366 WebQuestions0.41580.70340.0405 MMLU0.60550.74550.0171 ARC (Mercury)0.84960.60350.0459 ARC (Other)0.84230.59910.0313 Llama 3 8B Instruct TriviaQA0.65090.70370.0397 WebQuestions0.44600.72130.0530 MMLU0.64450.72010.0300 ARC (Mercury)0.87790.60140.0495 ARC (Other)0.85690.66580.0362 Math (2 Digit)0.93650.79350.0502 Math (3 Digit)0.78610.97970.0421 Math (4 Digit)0.74370.96610.0408 Llama 2 13B TriviaQA0.66800.69380.0324 WebQuestions0.43460.69480.0403 MMLU0.49580.72520.0284 ARC (Mercury)0.72900.50100.1029 ARC (Other)0.77640.66910.0239 Llama 2 13B Chat TriviaQA0.63770.70200.0414 WebQuestions0.44680.72020.0306 MMLU0.49020.69130.0208 ARC (Mercury)0.71340.63950.0475 ARC (Other)0.76370.59730.0192 Gemma 2 2B Instruct TriviaQA0.41800.72210.0136 WebQuestions0.29100.65010.0517 MMLU0.46170.68030.0344 ARC (Mercury)0.78760.63100.0228 ARC (Other)0.77050.60510.0612 Math (2 Digit)0.82760.71940.0357 Math (3 Digit)0.66110.84370.0219 Math (4 Digit)0.63670.82890.0308 Gemma 2 9B Instruct TriviaQA0.63920.73740.0306 WebQuestions0.35790.72640.0529 ARC (Other)0.91260.57860.0354 Table 2: Model performance metrics 12463 D Experiments on Non-Recall Tasks To explore whether our results were specific to recall-based tasks, we repeated our leave-one-out ablation analysis on a set of simple math datasets. We constructed datasets of 2-, 3-, and 4-digit math problems, consisting of equal mixes of addition, subtraction, multiplication, and division. We validated answers for addition, subtraction and multiplication by extracting the first valid integer from the answer and testing whether it matched the answer exactly. We validated answers for division by extracting the first decimal real number and testing whether it was within one percent of the correct answer. We ran our leave-one-out ablation analysis for these datasets with four models. Using the same hypothesis test as in the main analysis, we rejected the null hypothesis atp <0.05in 11 of 12 cases. 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / Math 2 R 2 = 0.24,p <0.0001 Llama 3 8B / Math 2 R 2 =−0.21,p= 0.0492 Llama 3 8B Inst. / Math 2 R 2 = 0.77,p <0.0001 Gemma 2 2B Inst. / Math 2 R 2 =−0.18,p= 0.0406 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / Math 3 R 2 = 0.84,p <0.0001 Llama 3 8B / Math 3 R 2 = 0.00,p <0.0001 Llama 3 8B Inst. / Math 3 R 2 = 0.72,p <0.0001 Gemma 2 2B Inst. / Math 3 R 2 = 0.18,p= 0.217 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / Math 4 R 2 = 0.75,p <0.0001 Llama 3 8B / Math 4 R 2 =−0.70,p <0.0001 Llama 3 8B Inst. / Math 4 R 2 = 0.79,p= 0.0001 Gemma 2 2B Inst. / Math 4 R 2 = 0.63,p= 0.0006 Figure 5: Results of zero-ablation for selected models on math datasets. Circle, triangle, and X markers represent MLP ablations, attention ablations, and clean runs respectively. Warmer colors represent earlier layers. Error bars for individual points are omitted for legibility, but std. err.<0.032in all cases (by the bounds onpandm). E Statistics for Causal Tracing Modelp <0.05p≥0.05 Llama 2 7B550 Llama 2 7B Chat510 Llama 3 8B531 Llama 3 8B Instruct512 Llama 2 13B513 Llama 2 13B Chat454 Gemma 2 2B Instruct471 Gemma 2 9B Instruct500 Table 3: Number of occurrences ofp-values for causal tracing. 12464 F Full Results for Zero-Ablation (Leave-One-Out) 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / TriviaQA R 2 = 0.56,p <0.0001 Llama 2 7B Chat / TriviaQA R 2 = 0.66,p= 0.0002 Llama 3 8B / TriviaQA R 2 = 0.93,p <0.0001 Llama 3 8B Inst. / TriviaQA R 2 = 0.71,p <0.0001 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / WebQs R 2 = 0.10,p <0.0001 Llama 2 7B Chat / WebQs R 2 = 0.70,p= 0.0001 Llama 3 8B / WebQs R 2 = 0.39,p <0.0001 Llama 3 8B Inst. / WebQs R 2 =−0.09,p= 0.001 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / MMLU R 2 = 0.18,p <0.0001 Llama 2 7B Chat / MMLU R 2 =−0.29,p= 0.0006 Llama 3 8B / MMLU R 2 = 0.67,p <0.0001 Llama 3 8B Inst. / MMLU R 2 =−0.11,p= 0.0008 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / ARC (Hg) R 2 = 0.58,p <0.0001 Llama 2 7B Chat / ARC (Hg) R 2 = 0.68,p <0.0001 Llama 3 8B / ARC (Hg) R 2 = 0.23,p= 0.008 Llama 3 8B Inst. / ARC (Hg) R 2 = 0.56,p <0.0001 0.00.20.40.60.81.0 probe outputp 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 7B / ARC (Other) R 2 = 0.66,p <0.0001 0.00.20.40.60.81.0 probe outputp Llama 2 7B Chat / ARC (Other) R 2 = 0.47,p= 0.0016 0.00.20.40.60.81.0 probe outputp Llama 3 8B / ARC (Other) R 2 = 0.43,p= 0.0111 0.00.20.40.60.81.0 probe outputp Llama 3 8B Inst. / ARC (Other) R 2 = 0.66,p= 0.001 Figure 6: Results of zero-ablation for eight models and five datasets. Circle, triangle, and X markers represent MLP ablations, attention ablations, and clean runs respectively. Warmer colors represent earlier layers. Error bars for individual points are omitted for legibility, but std. err.<0.032in all cases (by the bounds onpandm). 12465 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 13B / TriviaQA R 2 = 0.45,p= 0.0008 Llama 2 13B Chat / TriviaQA R 2 = 0.35,p= 0.0017 Gemma 2 2B Inst. / TriviaQA R 2 = 0.63,p <0.0001 Gemma 2 9B Inst. / TriviaQA R 2 = 0.71,p <0.0001 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 13B / WebQs R 2 =−0.78,p <0.0001 Llama 2 13B Chat / WebQs R 2 =−0.81,p <0.0001 Gemma 2 2B Inst. / WebQs R 2 = 0.46,p= 0.0009 Gemma 2 9B Inst. / WebQs R 2 = 0.66,p <0.0001 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 13B / MMLU R 2 =−0.50,p <0.0001 Llama 2 13B Chat / MMLU R 2 = 0.24,p= 0.0227 Gemma 2 2B Inst. / MMLU R 2 = 0.73,p <0.0001 Trials for this model and dataset were skipped due to computational constraints. 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 13B / ARC (Hg) R 2 = 0.84,p <0.0001 Llama 2 13B Chat / ARC (Hg) R 2 = 0.20,p= 0.0641 Gemma 2 2B Inst. / ARC (Hg) R 2 = 0.81,p <0.0001 Trials for this model and dataset were skipped due to computational constraints. 0.00.20.40.60.81.0 probe outputp 0.0 0.2 0.4 0.6 0.8 1.0 model accuracy m Llama 2 13B / ARC (Other) R 2 = 0.11,p= 0.137 0.00.20.40.60.81.0 probe outputp Llama 2 13B Chat / ARC (Other) R 2 = 0.61,p <0.0001 0.00.20.40.60.81.0 probe outputp Gemma 2 2B Inst. / ARC (Other) R 2 = 0.69,p <0.0001 0.00.20.40.60.81.0 probe outputp Gemma 2 9B Inst. / ARC (Other) R 2 = 0.26,p= 0.006 Figure 7:(continued)Results of zero-ablation for eight models and five datasets. Circle, triangle, and X markers represent MLP ablations, attention ablations, and clean runs respectively. Warmer colors represent earlier layers. Error bars for individual points are omitted for legibility, butstd. err.<0.032in all cases (by the bounds onpand m). 12466 G Full Results for Zero-Ablation (COAR) −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 7B / TriviaQA R 2 COAR = 0.14,0.59 R 2 = 0.13,p= 0.0208 Llama 2 7B Chat / TriviaQA R 2 COAR = 0.13,0.47 R 2 = 0.14,p= 0.0222 Llama 3 8B / TriviaQA R 2 COAR = 0.13,0.46 R 2 = 0.20,p= 0.0041 Llama 3 8B Inst. / TriviaQA R 2 COAR = 0.14,0.51 R 2 = 0.27,p= 0.002 −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 7B / WebQs R 2 COAR = 0.11,0.61 R 2 = 0.07,p= 0.0514 Llama 2 7B Chat / WebQs R 2 COAR = 0.12,0.59 R 2 = 0.06,p= 0.0484 Llama 3 8B / WebQs R 2 COAR = 0.09,0.47 R 2 = 0.15,p= 0.005 Llama 3 8B Inst. / WebQs R 2 COAR = 0.12,0.50 R 2 = 0.30,p= 0.0008 −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 7B / ARC (Hg) R 2 COAR = 0.10,0.48 R 2 = 0.12,p= 0.0307 Llama 2 7B Chat / ARC (Hg) R 2 COAR = 0.13,0.47 R 2 = 0.16,p= 0.0302 Llama 3 8B / ARC (Hg) R 2 COAR = 0.14,0.30 R 2 = 0.17,p= 0.0106 Llama 3 8B Inst. / ARC (Hg) R 2 COAR = 0.13,0.39 R 2 = 0.20,p= 0.0021 −0.2−0.10.00.10.2 coeff. for probe outputp −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 7B / ARC (Other) R 2 COAR = 0.11,0.60 R 2 = 0.11,p= 0.0334 −0.2−0.10.00.10.2 coeff. for probe outputp Llama 2 7B Chat / ARC (Other) R 2 COAR = 0.13,0.65 R 2 = 0.10,p= 0.0294 −0.2−0.10.00.10.2 coeff. for probe outputp Llama 3 8B / ARC (Other) R 2 COAR = 0.12,0.56 R 2 = 0.20,p= 0.0036 −0.2−0.10.00.10.2 coeff. for probe outputp Llama 3 8B Inst. / ARC (Other) R 2 COAR = 0.10,0.62 R 2 = 0.17,p= 0.0094 Figure 8: COAR coefficients for zero-ablation for eight models and four datasets. Circle and triangle markers represent MLP and attention ablations respectively. Warmer colors represent earlier layers. Error bars for individual points are omitted for legibility. The two values forR 2 COAR are the fraction of variance inmandpexplained by the COAR prediction from the set of ablations. 12467 −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 13B / TriviaQA R 2 COAR = 0.22,0.40 R 2 = 0.20,p= 0.0015 Llama 2 13B Chat / TriviaQA R 2 COAR = 0.17,0.43 R 2 = 0.44,p <0.0001 Gemma 2 2B Inst. / TriviaQA R 2 COAR = 0.10,0.18 R 2 = 0.47,p <0.0001 Gemma 2 9B Inst. / TriviaQA R 2 COAR = 0.11,0.25 R 2 = 0.30,p <0.0001 −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 13B / WebQs R 2 COAR = 0.14,0.50 R 2 = 0.53,p <0.0001 Llama 2 13B Chat / WebQs R 2 COAR = 0.16,0.42 R 2 = 0.47,p <0.0001 Gemma 2 2B Inst. / WebQs R 2 COAR = 0.08,0.23 R 2 = 0.18,p= 0.0009 Gemma 2 9B Inst. / WebQs R 2 COAR = 0.08,0.24 R 2 = 0.20,p= 0.0001 −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 13B / ARC (Hg) R 2 COAR = 0.14,0.43 R 2 = 0.41,p= 0.0001 Llama 2 13B Chat / ARC (Hg) R 2 COAR = 0.17,0.46 R 2 = 0.48,p <0.0001 Gemma 2 2B Inst. / ARC (Hg) R 2 COAR = 0.12,0.36 R 2 = 0.35,p <0.0001 Gemma 2 9B Inst. / ARC (Hg) R 2 COAR = 0.19,0.48 R 2 = 0.42,p <0.0001 −0.2−0.10.00.10.2 coeff. for probe outputp −0.2 −0.1 0.0 0.1 0.2 coeff. for model accuracy m Llama 2 13B / ARC (Other) R 2 COAR = 0.16,0.75 R 2 = 0.29,p= 0.0002 −0.2−0.10.00.10.2 coeff. for probe outputp Llama 2 13B Chat / ARC (Other) R 2 COAR = 0.19,0.75 R 2 = 0.31,p= 0.0004 −0.2−0.10.00.10.2 coeff. for probe outputp Gemma 2 2B Inst. / ARC (Other) R 2 COAR = 0.07,0.44 R 2 = 0.33,p <0.0001 −0.2−0.10.00.10.2 coeff. for probe outputp Gemma 2 9B Inst. / ARC (Other) R 2 COAR = 0.51,0.78 R 2 = 0.47,p <0.0001 Figure 9:(continued)COAR coefficients for zero-ablation for eight models and four datasets. Circle and triangle markers represent MLP and attention ablations respectively. Warmer colors represent earlier layers. Error bars for individual points are omitted for legibility. The two values forR 2 COAR are the fraction of variance inmandp explained by the COAR prediction from the set of ablations. 12468 H Computational Resources This project has used approximately 1200 GPU-hours of computation time on an academic cluster, mainly on RTX8000 GPUs with 48 GB of memory, including approximately 600 GPU-hours for results used directly in this paper. Results for individual model/dataset combinations can be reproduced independently; for example, the code to produce the TriviaQA / Llama 3 8B Instruct results ran in approximately 20 GPU-hours. I Ethics Statement This paper intends to advance the areas of interpretability and uncertainty quantification for language models, with the primary aim of making language models more reliable and more trustworthy. We expect these research directions in general to reduce societal risks from machine learning (for example, by allowing for warning signals in situations where a model might be lying or making a dangerous mistake). Nevertheless, since reliability work also makes systems more useful, some caution is warranted: for example, users might be tempted to deploy the resultant more-reliable systems in higher-stakes contexts in which tail risks from failures are greater. The humanoid and sciuroid robots in Fig. 1 were created using DALL-E 3. 12469