Paper deep dive

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 58

Models: Gemma-2-27B, Gemma-2-2B, Gemma-2-9B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:24:51 PM

Summary

The paper introduces 'Causal Faithfulness' (CaF), a metric for evaluating the faithfulness of Natural Language Explanations (NLEs) in Large Language Models. By utilizing activation patching instead of SHAP, the authors avoid out-of-distribution (OOD) issues and measure the consistency of causal attributions between model answers and their corresponding explanations. Experiments across various model sizes demonstrate that alignment-tuned models produce more faithful and plausible explanations.

Entities (5)

Activation Patching · method · 99%Causal Faithfulness · metric · 98%Large Language Models · technology · 95%SHAP · method · 95%Symmetric Token Replacement · method · 92%

Relation Signals (3)

Causal Faithfulness → utilizes → Activation Patching

confidence 98% · We first introduce Causal Faithfulness, which utilizes activation patching in place of SHAP

Activation Patching → mitigates → Out-of-Distribution issues

confidence 95% · We find that CaF is by design, a closer test of establishing the true faithfulness of NLEs... avoiding out-of-distribution concerns

Alignment-tuning → improves → Faithfulness

confidence 90% · models that underwent alignment-tuning tend to produce more faithful and plausible explanations.

Cypher Suggestions (2)

Identify relationships between metrics and their underlying techniques · confidence 95% · unvalidated

MATCH (m:Metric)-[:UTILIZES]->(t:Method) RETURN m.name, t.name

Find methods used to measure faithfulness in LLMs · confidence 90% · unvalidated

MATCH (m:Method)-[:USED_FOR]->(f:Property {name: 'Faithfulness'}) RETURN m.name

Abstract

Abstract:Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model's internal computations and avoiding out of distribution concerns that could otherwise undermine the validity of faithfulness assessments. We release the code in \url{this https URL}

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

57,354 characters extracted from source content.

Expand or collapse full text

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models Wei Jie Yeo1, Ranjan Satapathy2, Erik Cambria1 1Nanyang Technological University 2Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A∗ ∗ STAR), Singapore yeow0082@e.ntu.edu.sg Abstract Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment-tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model’s internal computations and avoiding out-of-distribution concerns that could otherwise undermine the validity of faithfulness assessments. We release the code in https://github.com/wj210/Causal-Faithfulness Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models Wei Jie Yeo1, Ranjan Satapathy2, Erik Cambria1 1Nanyang Technological University 2Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A∗ ∗ STAR), Singapore yeow0082@e.ntu.edu.sg 1 Introduction The advent of Large Language Models (LLMs) has revolutionized the field of Natural Language Processing (NLP), enabling machines to perform considerably well across a wide array of tasks ranging from commonsense reasoning (Achiam et al., 2023) to producing high-quality training data (Yeo et al., 2024). Moreover, LLMs can generate highly plausible Natural Language Explanations (NLE) to support their answer, such as adopting the Chain-of-Thought (CoT) prompting method (Wei et al., 2022). Despite their apparent sophistication, the trustworthiness of these explanations is not guaranteed and their underlying faithfulness (Jie et al., 2024) should be carefully scrutinized. However, it is unclear amongst the existing approaches (Atanasova et al., 2023; Lanham et al., 2023; Wiegreffe et al., 2020; Parcalabescu and Frank, 2023), which is best suited for measuring faithfulness. A commonly referenced definition of faithfulness follows as such: "The faithfulness of an interpretation is a measure of how well it accurately reflects the underlying reasoning process of the model" (Jacovi and Goldberg, 2020). Following this definition, we can be certain that a truly faithful explanation leads to goals such as instilling trust (Cambria et al., 2023; Yeo et al., 2023) in the receiving audience or enabling actionable insights given the information revealed by the explanation. A recent work, Parcalabescu and Frank (2023) argues that existing faithfulness tests are not designed to measure the faithfulness under the referenced definition, but rather the ability of models to return consistent answers under various perturbed settings. Albeit these tests are still considered valid, the findings of these tests may be unreliable and inconsistent across different benchmarks. In this work, we study the internal computations of the model to assess the level of faithfulness of an explanation in supporting the explained answer. We borrow insights from works that perform Activation Patching (AP), also known as causal tracing on the LLM’s internal states (Meng et al., 2022; Vig et al., 2020; Zhang and Nanda, 2023), and show how the derived causal effects can be used to measure the faithfulness of a model’s NLE. Our metric, Causal Faithfulness (CaF) compares the distribution over the causal effects behind the answer and NLE in a similar fashion as Parcalabescu and Frank (2023). However, we explore at a deeper level, by also examining consistency at the layer positions and perform AP instead of SHAP (Lundberg and Lee, 2017) which we later show is prone to Out-Of-Distribution (OOD) issues. We find that CaF is by design, a closer test of establishing the true faithfulness of NLEs. Overall, the main contributions of this work can be summarized as follows: 1. We first introduce Causal Faithfulness, which utilizes activation patching in place of SHAP, to investigate faithfulness at three levels: token and layer, token-only and layer-only. 2. We investigate the relationship between the plausibility and faithfulness of NLEs to determine if a plausible NLE can be equally regarded as faithful. 3. We analyze our approach extensively against existing faithfulness tests on 6666 open-sourced LLMs of varying sizes and across 3333 benchmarks. 2 Related Works 2.1 Faithfulness of NLP models NLE is an abstraction form of explanation that consists of a sequence of free-text tokens, aimed at explaining the internal reasoning process of the model. On the other hand, extractive approaches (DeYoung et al., 2019; Lei et al., 2016; Wei Jie et al., 2024) highlight rationale tokens in the input to serve as the explanation. Though extractive approaches may appear to be faithful, Jacovi and Goldberg (2021) cautions against such beliefs as such an approach requires the explanation to be produced before the answer, thereby questioning its soundness. Existing works on faithfulness introduce different forms of perturbations such as paraphrasing or corrupting the CoT explanation (Jie et al., 2024; Lanham et al., 2023), or checking for consistency in counterfactuals (Atanasova et al., 2023). Wiegreffe et al. (2020) corrupts the input and regards the similarity between the deterioration of the answer and explanation as a proxy for faithfulness. Turpin et al. (2024) showed that adding biasing features can significantly influence the CoT leading to incorrect answers. Parcalabescu and Frank (2023) expanded on SHAP (Lundberg and Lee, 2017) by introducing C-SHAP, which measures the convergence between the attribution vectors of the answer and explanation as a measure of faithfulness. The authors argue that C-SHAP more closely aligns with the definition of faithfulness as ’how well the explanation reflects the reasoning process.’ Our approach, CaF, is similar to C-SHAP but operates at a deeper level by also considering convergence between the layers while avoiding OOD samples which may invalidate any findings. 2.2 Causality in NLP Causal mediation methods aim to quantify the sensitivity of intermediate variables within a causal graph, following an intervention. The sensitivity is termed as causal effects and can be further decomposed into direct and indirect effects (Pearl, 2022). Works leveraging causality insights include Gat et al. (2023), which assesses the faithfulness of NLP models based on their ability to identify features with significant causal effects. The authors argue that inputs with high causal influence are more likely to be successfully transformed into counterfactuals. Vig et al. (2020) detects gender bias by measuring the causal effects of a stereotypical input given an intervention. Paul et al. (2024) investigates the direct and indirect effects in CoT explanations generated by LLMs of different sizes and concludes that causal effects of CoT are task-dependent, complimenting our findings. Meng et al. (2022) perform causal mediation on the hidden states of a transformer (Vaswani et al., 2017) and discover that certain states are responsible for recollecting factual knowledge. The authors then showed that intervening in the states with the highest causal effect is helpful in knowledge editing. This approach inspired our work, where we perform similar interventions to quantify the causal effects of each LLM’s internal states during test-time as a means of measuring faithfulness. 3 Preliminary Figure 1: Activation Patching: Given two runs, a clean run under normal conditions [left] and a corrupted run where tokens in the input are replaced such that it leads to a counterfactual scenario. [right]. AP identifies the causal effects of the hidden state at the specified token and layer position through the changes in output after inserting the activations from the clean run. The indirect effect is thus measured via the mediated effects of the intervention. (Meng et al., 2022). We focus on decoder-only transformers LLMs, parameterized as fθsubscriptf_θfitalic_θ that is L-layers deep, each consisting of both an Attention (attn) and Multi-Layer Perceptron (MLP) module. fθsubscriptf_θfitalic_θ models the discrete probability distribution of the next token over a predefined vocabulary set V in an autoregressive manner, fθ(p(xt+1)|xt,…,x1f_θ(p(x_t+1)|x_t,…,x_1fitalic_θ ( p ( xitalic_t + 1 ) | xitalic_t , … , x1), where p⁢(xt+1)∈ℝ|V|subscript1superscriptℝp(x_t+1) ^|V|p ( xitalic_t + 1 ) ∈ blackboard_R| V |. The output token, y^=xt+1^subscript1 y=x_t+1over start_ARG y end_ARG = xitalic_t + 1 is then sampled from the probability distribution. For each input question, fθsubscriptf_θfitalic_θ generates both an answer, y^asubscript y_aover start_ARG y end_ARGa and a post-hoc explanation, y^esubscript y_eover start_ARG y end_ARGe. We are interested in assessing if y^esubscript y_eover start_ARG y end_ARGe is a faithful representation of the internal reasoning responsible for y^asubscript y_aover start_ARG y end_ARGa. We denote the intermediate hidden representation of the token i at the l layer as hil∈ℝ|K|superscriptsubscriptℎsuperscriptℝh_i^l ^|K|hitalic_iitalic_l ∈ blackboard_R| K |, K referring to the hidden size. The text tokens are encoded into embeddings with positional information represented as hi0=fθ,e⁢m⁢b⁢e⁢d⁢(xi)+p⁢o⁢s⁢(xi)superscriptsubscriptℎ0subscriptsubscriptsubscripth_i^0=f_θ,embed(x_i)+pos(x_i)hitalic_i0 = fitalic_θ , e m b e d ( xitalic_i ) + p o s ( xitalic_i ). The internal computations of hilsuperscriptsubscriptℎh_i^lhitalic_iitalic_l consist of a self-attention, aisubscripta_iaitalic_i followed by an MLP operation, misubscriptm_imitalic_i is described as: hilsubscriptsuperscriptℎ h^l_ihitalic_litalic_i =hil−1+ail+milabsentsubscriptsuperscriptℎ1subscriptsuperscriptsubscriptsuperscript =h^l-1_i+a^l_i+m^l_i= hitalic_l - 1i + aitalic_litalic_i + mitalic_litalic_i ailsubscriptsuperscript a^l_iaitalic_litalic_i =attnl⁢(h1l−1,h2l−1,…,hil−1)absentsuperscriptattnsubscriptsuperscriptℎ11subscriptsuperscriptℎ12…subscriptsuperscriptℎ1 =attn 1.0ptl (h^l-1_1,h^l-1_2,…,% h^l-1_i )= attnl ( hitalic_l - 11 , hitalic_l - 12 , … , hitalic_l - 1i ) milsubscriptsuperscript m^l_imitalic_litalic_i =MLPl⁢(ail+hil−1)absentsuperscriptMLPsubscriptsuperscriptsubscriptsuperscriptℎ1 =MLP 1.0ptl(a^l_i+h^l-1_i)= MLPl ( aitalic_litalic_i + hitalic_l - 1i ) (1) 3.1 Activation Patching Meng et al. (2022) applied AP to identify specific hidden states essential for storing factual knowledge, which later proved useful for fact-editing. AP primarily requires three forward passes: a clean, p⁢(y^)^p( y)p ( over start_ARG y end_ARG ), corrupted, p∗⁢(y^)subscript^p_*( y)p∗ ( over start_ARG y end_ARG ) and patched, p∗,c⁢l⁢e⁢a⁢n⁢hil⁢(y^)subscriptsuperscriptsubscriptℎ^p_*,clean 2.0pth_i^l( y)p∗ , c l e a n h start_POSTSUBSCRIPT iitalic_l end_POSTSUBSCRIPT ( over start_ARG y end_ARG ) run at the specified i and l hidden state positions. Meng et al. (2022) corrupts the embedding vectors of the subject tokens in the input by adding Gaussian noise (GN), hi0,∗=hi0+ϵsuperscriptsubscriptℎ0superscriptsubscriptℎ0italic-ϵh_i^0,*=h_i^0+ _i0 , ∗ = hitalic_i0 + ϵ, sampled from ⁢(0,3⁢σ)03N(0,3σ)N ( 0 , 3 σ ), σ given as the standard deviation between the set of input embeddings over the subject tokens, with indices belong to a span, S of varying length. The patched run then intervenes at the i token and l layer position by patching in the hidden states from the clean run as seen in Figure 1, before continuing until the final output. This is repeated across all layers and input tokens such that we end up with a matrix, C∈ℝT×LsuperscriptℝC ^T× LC ∈ blackboard_RT × L, and each element represents the causal effects of the corresponding hidden state towards each output generation, y^asubscript y_aover start_ARG y end_ARGa or y^esubscript y_eover start_ARG y end_ARGe. Note that patching states at token positions i such that i<ji<ji < j for all j∈Sj∈ Sj ∈ S is pointless given the nature of causal attention used in decoder models. Thus T represents the length of the original input after excluding the token positions preceding S. (a) SHAP vs AP (STR) (b) STR vs GN Figure 2: [Left] Counts of instances where the modified features are assigned higher importance. [Right] Probability scores of original and counterfactual answers in the clean and corrupted (STR/GN) runs. Based on Pearl (2022), we can retrieve various causal effects from these three runs. The total effect due to the corruption is given as p⁢(y^)−p∗⁢(y^)^subscript^p( y)-p_*( y)p ( over start_ARG y end_ARG ) - p∗ ( over start_ARG y end_ARG ). We are interested in the indirect effect of the target hidden state, given as p∗,c⁢l⁢e⁢a⁢n⁢hil⁢(y^)−p∗⁢(y^)subscriptsuperscriptsubscriptℎ^subscript^p_*,clean 2.0pth_i^l( y)-p_*( y)p∗ , c l e a n h start_POSTSUBSCRIPT iitalic_l end_POSTSUBSCRIPT ( over start_ARG y end_ARG ) - p∗ ( over start_ARG y end_ARG ), where the mediator (hidden state) is intervened while the independent variable (subject token) is held at the no-treatment state. This mediated effect can be observed in Equation 3, layer-wise via the MLP and position-wise in the self-attention module. However, adding GN is prone to inducing OOD inputs (Zhang and Nanda, 2023) that may invalidate the attribution readings. This is particularly problematic as C should ideally represent the importance of the target state in generating the dependent variable, rather than relieving the model from the distribution shift. We perform Symmetric Token Replacement (STR) instead, by replacing a subset of tokens in the input such that it becomes a counterfactual instance. We find that this relieves the OOD issue since this process is controlled as y can be converted into counterfactual alternatives, yC≠ysubscripty_C≠ yyitalic_C ≠ y, while adding GN is uncontrollable. We discuss more in 4.1. 4 Methodology In this section, we discuss why activation patching is a suitable method for measuring the faithfulness of NLE. While the causal values from AP itself suffices as an explanation, it may appear less interpretable for human users, particularly when multiple hidden states exhibit significant causal effects. This is especially true when the output space, T, is large, as the resulting explanation can highlight numerous factors contributing to the model’s decision. Such explanations, though accurate, may seem less intuitive compared to natural language explanations, which are generally easier for users to understand and process. On the other hand, NLEs should not be trusted easily as their faithfulness is questionable. With the abundance of existing faithfulness tests, this poses another question of which is best suited to assess the faithfulness of NLE? Figure 3: The model’s probability scores from both clean corrupted runs are recorded and deducted from the patched scores over each token and layer. All activations from the clean run are hooked and subsequently patched in at the target location before continuing the run. AP is implemented for both outputs: answer and explanation, resulting in the final causal matrix C, before measuring CaF. Parcalabescu and Frank (2023) argues that most of these tests are designed to evaluate self-consistency rather than the referenced definition: "aligns with the underlying reasoning process behind the model’s answer". Albeit asserting complete faithfulness is impossible as it would require reverse-engineering the internal mechanism of several billion parameters. It is also crucial to avoid erroneous measurements that may mislead the audience into attributing incorrect causes, which could have severe consequences in critical applications (Agarwal et al., 2024). This phenomenon, referred to as social misalignment (Jacovi and Goldberg, 2021; Hase et al., 2021), occurs when the understanding of an explanation is distorted. Such misalignment can arise from various factors, including the evaluation of models on OOD samples. Parcalabescu and Frank (2023) introduces C-SHAP by extending upon SHAP (Lundberg and Lee, 2017) to measure the consistency between the Shapley values of ya^^subscript y_aover start_ARG yitalic_a end_ARG and ye^^subscript y_eover start_ARG yitalic_e end_ARG. The authors regard this form of measurement as a closer step towards assessing faithfulness under the commonly referenced definition. 4.1 Implications of Out-of-Distribution samples In this section, we argue that using SHAP as an attribution technique to assess faithfulness has notable limitations. Similar to GN, we show that SHAP is susceptible to OOD samples due to its nature of integrating across all possible permutation pairs, this can lead to feature combinations likely unseen by the model. We investigate SHAP’s behavior in attributing scores to counterfactual features that influence changes in model predictions. We compare the consistency between SHAP and AP on the ComVE (Wang et al., 2020) task with Gemma2-2B-chat (Team et al., 2024), where the model is required to identify the illogical statement from two given statements. We split the task into two scenarios: the original, where the illogical statement is presented, and the counterfactual (CF), where the logical statement is given. Both statements, each of length T, differ by a subset of tokens with positions in S⊆TS TS ⊆ T. We count how often each attribution method assigns a higher average score to S over the complement, SN=T∖SsuperscriptS^N=T SSitalic_N = T ∖ S. The analysis covers both the original and CF scenarios, where the model is instead given a logical statement. We only consider instances where the model is successful in both scenarios. Figure 2(a) shows the averaged count of S>SNsuperscriptS>S^NS > Sitalic_N. SHAP assigns higher importance to S in only 12%percent1212\%12 % of the CF scenarios and 50%percent5050\%50 % of both when the model assigns higher importance in S>SNsuperscriptS>S^NS > Sitalic_N for the original as well. In contrast, AP assigns higher importance to S in nearly 95%percent9595\%95 % of cases across both scenarios. We note that modifications to the feature space do not always result in the highest attributions in the event of a prediction shift. However, when the original input regards S as important, failing to observe this when a prediction is altered, indicates inconsistency and raises concerns about the fidelity of the approach. This inconsistency presents a challenge when SHAP is used to assess faithfulness. For GN, we notice that this form of corruption can lead to inputs that deteriorate model’s understanding of the task rather than removing important information. In Figure 2(b), adding GN causes the model to assign similar probability scores between the two answers, while STR correctly swaps the scores from illogical (Original) to logical (CF). Since ComVE only has two possible outcomes, assigning an equal score indicates a lack of confidence in either outcome. We note that STR allows for greater control over the removal or negation of relevant information, thus enabling in-distribution corruption, as opposed to GN which may break the model’s ability to understand the task correctly. This is analogous to the findings in Zhang and Nanda (2023), where GN is observed to disrupt the model’s internal mechanism, which is not the intended objective of AP. 4.2 Causal Faithfulness We introduce Causal Faithfulness, by measuring the divergence between the causal matrix of the answer and explanation, CasubscriptC_aCitalic_a and CesubscriptC_eCitalic_e respectively. Each matrix is generated via activation patching over layers of length L and feature set of size T towards ya^^subscript y_aover start_ARG yitalic_a end_ARG and ye^^subscript y_eover start_ARG yitalic_e end_ARG, see Figure 3. For outputs with more than one token, we average over the output sequence, M as given: Cil=1M⁢∑j=1MCi,jlsuperscriptsubscript1superscriptsubscript1superscriptsubscriptC_i 1.0ptl= 1M _j=1^MC_i,j 1.0ptlCitalic_iitalic_l = divide start_ARG 1 end_ARG start_ARG M end_ARG ∑j = 1M Citalic_i , jitalic_l (2) We similarly employ cosine distance (C⁢DCDC D) as our divergence metric, prioritizing the distribution of attribution over magnitude, since CasubscriptC_aCitalic_a and CesubscriptC_eCitalic_e may have significantly different ranges. CaF=1−C⁢D⁢(Ce,Ca)CaF1subscriptsubscriptCaF=1-CD(C_e,C_a)CaF = 1 - C D ( Citalic_e , Citalic_a ) (3) Equation 3 produces a continuous score, similar to C-SHAP Parcalabescu and Frank (2023), but also considers the layer-wise effects. We argue that assessing faithfulness according to the referenced definition, requires making judgments beyond just at the feature level. For instance, on reasoning tasks, one would expect the model to reason internally before generating ya^^subscript y_aover start_ARG yitalic_a end_ARG, where the internal reasoning process is revealed via ye^^subscript y_eover start_ARG yitalic_e end_ARG. Input-level attribution addresses the question of "which features are important" but not "how does the model process each feature". While the notion of sufficient granularity for faithfulness is yet to be rigorously defined, layer-wise alignment can be particularly advantageous, such as performing corrective actions on models with undesirable bias in certain layers, potentially observed via the NLEs. On the contrary, CaF does not assume that faithfulness is guaranteed purely at the feature level but rather states that true faithfulness should also consider the model’s internal computations. 5 Experiments 5.1 Dataset and Model We implement the faithfulness test on 3 different benchmarks: CoS-E (Rajani et al., 2019)(Commonsense reasoning), e-snli (Camburu et al., 2018) (Natural Language Inference) and ComVE (Wang et al., 2020). CosE and e-snli provides annotated rationales tokens, which we used to create counterfactuals. In ComVE, we swap the logical statement in to act as the counterfactual. We discuss more in A.1. We evaluate the Gemma-2 suite of LLMs (Team et al., 2024), of three different sizes: 2B, 9B and 27B, including both pre-trained and instruct-tuned versions, totalling to 6 models. We use 3-shot prompting for pre-trained models and evaluate chat models under zero-shot settings. All explanations are generated post-hoc, conditioned on the model’s answer to the question. We use the same template for each task for standardization. We assess a total of 100 samples for each task and average the scores across three seeds for each faithfulness metric. Model/Test Acc CFF C-SHAP CaF CaF(M) CaF(T) CaF(L) CoS-E 2B 42 14 62 13 13 20 10 2B-chat 58 57 74 25 26 33 20 9B 58 20 35 18 17 23 9 9B-chat 76 57 89 39 37 49 35 27B 68 20 40 34 33 43 19 27B-chat 76 59 15 37 36 45 33 e-snli 2B 59 25 65 8 7 14 5 2B-chat 64 56 64 11 12 12 8 9B 82 14 37 26 25 30 21 9B-chat 90 64 91 41 41 50 37 27B 91 13 62 34 36 42 16 27B-chat 90 63 54 39 42 40 27 ComVE 2B 52 17 82 -6 -9 -6 5 2B-chat 69 28 98 44 44 54 43 9B 68 23 47 24 24 33 17 9B-chat 91 78 91 40 37 53 35 27B 86 24 82 39 36 51 23 27B-chat 95 81 76 51 47 61 46 Table 1: Faithfulness test scores across different metrics. M refers to patching across multiple layers, T is the aggregated effects across layers within the same token position and L across tokens within the layer. Instruct-tuned models on average tend to fare better as compared to the corresponding pre-trained counterparts. CaF and C-SHAP is bounded between -1 and 1. Acc refers to the accuracy scores according to the task. All values are multiplied by 100. 5.2 Faithfulness Tests We also include other baseline faithfulness metrics designed for post-hoc explanations: C-SHAP (Parcalabescu and Frank, 2023) and Counterfactual Faithfulness (CFF) (Atanasova et al., 2023). C-SHAP is similar to CaF but replaces AP with SHAP and measures the cosine similarity between the attribution vectors at the feature level. CFF introduces an adjective or adverb token at random positions in the input to influence the model into altering its original prediction, within a specified budget. If the model maintains its original prediction, it is considered faithful. CFF is primarily employed to identify instances of unfaithfulness, where the model’s prediction changes, but the counterfactual explanation omits the inserted token. A noted limitation of CFF is that it focuses solely on syntactic checks, overlooking the importance of maintaining semantic consistency. Additionally, we explore the impact of patching across multiple layers, rather than a single layer, which allows for a greater recovery. (Meng et al., 2022; Zhang and Nanda, 2023). We use a window, w of 10101010, patching the layers between [l−w2[l- w2[ l - divide start_ARG w end_ARG start_ARG 2 end_ARG, l+w2]l+ w2]l + divide start_ARG w end_ARG start_ARG 2 end_ARG ] to represent the effects at layer l. We also study changes to CaF when aggregating effects across the layer and token position. We denote CaF(T) as the divergence between token-level causal vectors where the layers within each token position are aggregated and vice versa at the layer-level, CaF(L). The focus of our experiment is to address the following research questions: • How do the NLEs from different LLMs fare under various faithfulness metrics • Is there a correlation between a model’s capability and the faithfulness of the explanations produced? • Does a plausible explanation entail a higher degree of faithfulness? 5.3 Findings Chat models are more faithful: Table 1 indicates that instruct-tuned models tend to produce more faithful explanations, as indicated by CaF. This highlights the additional benefits of alignment tuning on faithfulness besides enhancing the plausibility of explanations, which we discuss further in 5.4. Since pre-trained models are not fine-tuned for specific task structures, they may lack an internal understanding of the task, such as when required to explain their own decisions. This can lead to a discrepancy in the distribution of causal values between the generated answer, y^asubscript y_aover start_ARG y end_ARGa, and the explanation, y^esubscript y_eover start_ARG y end_ARGe, resulting in lower consistency between the two. Patching across multiple layers do not have significant differences, though we find that models exhibit a higher degree of consistency when the layer effects are aggregated as in CaF(T) and less so at the layer level, CaF(L). We also observe a positive relationship between task performance and faithfulness, when comparing within the training category (ie pre-trained or instruct-tuned), which applies similarly to model scaling as well. CaF vs other tests: As mentioned by Parcalabescu and Frank (2023), a primary concern is the high level of disagreement between existing faithfulness tests. Besides that, in 4.1, we discuss the risks involved with encountering OOD samples depending on the form of attribution method used, that affects metrics such as C-SHAP. For CFF, a drawback as highlighted by Siegel et al. (2024); Parcalabescu and Frank (2023), is that the scores can be manipulated by models trained to produce excessively verbose responses, where the input is always replicated in the explanation. Given the design of the test, this would almost certainly result in achieving perfect faithfulness. Another problematic concern is that the counterfactual edits may not always result in a modified prediction, which would default to an indication of faithfulness. However, we argue that this has nothing to do with the faithfulness of NLEs, but rather points towards the robustness of the model against adversarial inserts. In our experiments, we note that this occurrence is rather significant in instruct-tuned models, ranging from 10%percent1010\%10 % to 70%percent7070\%70 % of the dataset. Figure 4: Pearson’s correlation between plausibility and faithfulness. 5.4 Plausibility vs Faithfulness Ensuring faithful explanations is crucial to prevent over-reliance on plausible but unfaithful NLEs. Thus, it is important to investigate the correlation between plausibility and faithfulness. We evaluate plausibility using the framework from Chen et al. (2023), with OpenAI’s GPT4-o111https://openai.com/index/hello-gpt-4o/ as the evaluator. Each explanation is judged based on six aspects (Hoffman et al., 2018) such as clarity, and sufficiency, with more details in A.2. We assess both e-SNLI and ComVE, utilizing their gold-standard explanations as references for GPT4-o. Our analysis reveals a positive correlation between plausibility scores and CaF, with ComVE exhibiting a stronger relationship in Figure 4. This aligns with observations from 5.3, where larger models consistently generate higher-quality NLEs, with instruct-tuned models occupying the upper right quadrant of the scatter plot. These results suggest that scaling models and improving general NLP task performance can enhance the explainability of LLM-generated explanations. However, we urge caution in interpreting these findings due to the limited sample size. 5.5 Distribution of Activation Patching In addition to measuring the overall divergence between the distributions of causal values, we also focus on analyzing the internal patterns within these distributions—specifically, the significance of key input elements such as the prediction itself and corrupted tokens. Our primary focus is on the aggregated effects of corrupted tokens, with positions denoted by S, the model’s prediction y^asubscript y_aover start_ARG y end_ARGa, and all the available answer choices. Given that e-SNLI and ComVE consist of only two possible answers, we limit our study to CoS-E, which offers a broader set of answer possibilities. The results, presented in Figure 5, reveal a consistent trend across all models: they often attribute higher causal importance to corrupted tokens compared to the prediction. This effect is even more pronounced when the model is explaining. Moreover, when aggregating the effects across all answer choices, the resulting overall causal score is significantly higher than that of the prediction alone. This suggests that the model may be reasoning over alternative options when formulating its answer or generating an explanation. Interestingly, patching multiple layers distributes the causal values from the corrupt tokens to the answer tokens. Overall, we observe no significant differences in causality distribution trends across different model sizes. 6 Conclusion In this work, we introduced a novel faithfulness metric, Causal Faithfulness, building on insights from activation patching. We demonstrated that the faithfulness of a LLM’s NLE can be assessed by evaluating the consistency between the causal distributions underlying the explanation and the model’s answer. Additionally, we expanded on existing critiques of current tests, particularly highlighting the risks of grounding faithfulness measurements on out-of-distribution inputs. We argue that, to achieve precise faithfulness assessments, the evaluation process must be rigorously examined to avoid drawing misinformed conclusions. We believe our approach is a step toward achieving true faithfulness—a monumental challenge that demands further research into techniques capable of providing deeper insights into the model’s internal behavior. Future work could explore integrating mechanistic interpretability approaches (Nanda et al., 2023) to develop a more robust and reliable framework for assessing faithfulness. 7 Limitations Given the nature of analyzing model internals for assessing faithfulness, one limitation of conducting such a test is the increased computational requirements. CaF requires a total of T⋅L⋅T· LT ⋅ L forward passes, which is higher than existing surface-level tests. However, this requirement is notably smaller than that of C-SHAP for smaller models, such as the 2B and 9B models, since C-SHAP scales linearly with T, which often exceeds L in such cases. Additionally, CaF avoids the need for approximating attribution values and does not rely on out-of-distribution samples and by design, is a more theoretically sound framework for measuring faithfulness as compared to existing surface-level methods. Another limitation concerns the process of token replacement, which requires the set of edited tokens to be of equal length to the tokens being replaced, S. This constraint is further complicated by the varying tokenization schemes used by different tokenizers. Future research could explore methods to relax this requirement, potentially enabling variable-length replacements. Lastly, activation patching primarily provides insights starting from the corrupted spans onward, which may limit the scope of the information revealed. We mitigate this limitation by focusing on samples where the rationales are positioned at the beginning. However, the impact of this constraint varies depending on the task structure. For instance, in Meng et al. (2022), the subject tokens are typically located at the beginning, and in e-SNLI, we target the premise. References Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Agarwal et al. (2024) Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. 2024. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models. arXiv preprint arXiv:2402.04614. Atanasova et al. (2023) Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. 2023. Faithfulness tests for natural language explanations. arXiv preprint arXiv:2305.18029. Cambria et al. (2023) Erik Cambria, Rui Mao, Melvin Chen, Zhaoxia Wang, and Seng-Beng Ho. 2023. Seven pillars for the future of artificial intelligence. IEEE Intelligent Systems, 38(6):62–69. Camburu et al. (2018) Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31. Chen et al. (2023) Zichen Chen, Jianda Chen, Mitali Gaidhani, Ambuj Singh, and Misha Sra. 2023. Xplainllm: A qa explanation dataset for understanding llm decision-making. arXiv preprint arXiv:2311.08614. DeYoung et al. (2019) Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. 2019. Eraser: A benchmark to evaluate rationalized nlp models. arXiv preprint arXiv:1911.03429. Gat et al. (2023) Yair Gat, Nitay Calderon, Amir Feder, Alexander Chapanin, Amit Sharma, and Roi Reichart. 2023. Faithful explanations of black-box nlp models using llm-generated counterfactuals. arXiv preprint arXiv:2310.00603. Hase et al. (2021) Peter Hase, Harry Xie, and Mohit Bansal. 2021. The out-of-distribution problem in explainability and search methods for feature importance explanations. Advances in neural information processing systems, 34:3650–3666. Hoffman et al. (2018) Robert R Hoffman, Shane T Mueller, Gary Klein, and Jordan Litman. 2018. Metrics for explainable ai: Challenges and prospects. arXiv preprint arXiv:1812.04608. Jacovi and Goldberg (2020) Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685. Jacovi and Goldberg (2021) Alon Jacovi and Yoav Goldberg. 2021. Aligning faithful interpretations with their social attribution. Transactions of the Association for Computational Linguistics, 9:294–310. Jie et al. (2024) Yeo Wei Jie, Ranjan Satapathy, Rick Goh, and Erik Cambria. 2024. How interpretable are reasoning explanations from prompting large language models? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2148–2164. Lanham et al. (2023) Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. 2023. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702. Lei et al. (2016) Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155. Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30. Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372. Nanda et al. (2023) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217. Parcalabescu and Frank (2023) Letitia Parcalabescu and Anette Frank. 2023. On measuring faithfulness of natural language explanations. arXiv preprint arXiv:2311.07466. Paul et al. (2024) Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. 2024. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. arXiv preprint arXiv:2402.13950. Pearl (2022) Judea Pearl. 2022. Direct and indirect effects. In Probabilistic and causal inference: the works of Judea Pearl, pages 373–392. Rajani et al. (2019) Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361. Siegel et al. (2024) Noah Y Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. 2024. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models. arXiv preprint arXiv:2404.03189. Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. Turpin et al. (2024) Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. 2024. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36. Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30. Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265. Wang et al. (2020) Cunxiang Wang, Shuailong Liang, Yili Jin, Yilong Wang, Xiaodan Zhu, and Yue Zhang. 2020. Semeval-2020 task 4: Commonsense validation and explanation. arXiv preprint arXiv:2007.00236. Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837. Wei Jie et al. (2024) Yeo Wei Jie, Ranjan Satapathy, and Erik Cambria. 2024. Plausible extractive rationalization through semi-supervised entailment signal. In Findings of the Association for Computational Linguistics ACL 2024, pages 5182–5192. Association for Computational Linguistics. Wiegreffe et al. (2020) Sarah Wiegreffe, Ana Marasović, and Noah A Smith. 2020. Measuring association between labels and free-text rationales. arXiv preprint arXiv:2010.12762. Yeo et al. (2024) Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria. 2024. Self-training large language models through knowledge detection. Preprint, arXiv:2406.11275. Yeo et al. (2023) Wei Jie Yeo, Wihan van der Heever, Rui Mao, Erik Cambria, Ranjan Satapathy, and Gianmarco Mengaldo. 2023. A comprehensive review on financial explainable ai. arXiv preprint arXiv:2309.11960. Zhang and Nanda (2023) Fred Zhang and Neel Nanda. 2023. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042. Appendix A Appendix A.1 CaF implementation In this section, we detail the implementation of token replacement for CaF. In CoS-E and e-SNLI, we are provided with annotated indices indicating the important positions within the input, S=1,2,…,M12…S=1,2,…,MS = 1 , 2 , … , M, where M varies across samples. We extract the corresponding tokens, referred to as the rationale, and manually edit them to transform the original answer, y, into a counterfactual answer, yC≠ysuperscripty^C≠ yyitalic_C ≠ y. For CoS-E, a 5-way multiple-choice QA dataset, we select the most plausible alternative answer and modify the rationale accordingly. In e-SNLI, we exclude samples labeled as neutral, as they are not annotated with S, and modify the rationale to switch entailment inputs to contradiction and vice versa. We target task A of ComVE, where the model is given two sentences and must identify the statement that contradicts common sense, we adapt the task by presenting the model with only the illogical statement and asking it to confirm its illogicality. We then replace the statement with the logical one to serve as the counterfactual. We sample both the answer and explanation using a fixed temperature of 1.01.01.01.0 and nucleus sampling with a threshold of 0.950.950.950.95. One advantage of performing STR over GN is the reduced batch size, as Meng et al. (2022) averages causal effects across 10 different noise perturbations at each layer and feature level, whereas STR requires only a single forward pass per patching. Furthermore, unlike C-SHAP, which relies on Monte Carlo sampling to handle its exponential computational demands, CaF avoids the need for such approximation techniques. (a) Gemma2-2B (b) Gemma2-2B-chat (c) Gemma2-9B (d) Gemma2-9B-chat (e) Gemma2-27B (f) Gemma2-27B-chat Figure 5: Causal scores at the token level, CaF(T) across the six models on CoS-E. The cross-lines refer to patching multiple layers with window of size 10101010. Each bar represents the aggregated values of the target features, red: the corrupted token spanned by S, blue: the answer choice corresponding to the resultant prediction and green: all answer choices. A.2 Plausibility In this section, we discuss the evaluation framework to assess the plausibility criteria of each NLE. We follow the framework used introduced in Chen et al. (2023), which poses seven questions to the evaluator. We left out the question regarding trust as we find it unrelated to plausibility. Each question is allocated a score using a three-point Likert scale: −11-1- 1 (disagree), 00 (neutral), and 1111 (agree). For each sample, we prompted the evaluator thrice and averaged the scores to represent a single evaluation. Each plausibility score is thus bounded between [−6,6]66[-6,6][ - 6 , 6 ]. While concerns may arise over using LLMs to judge plausibility, the strong correlation between human and model preferences reported in Chen et al. (2023) provides sufficient justification for relying on LLM-based evaluations. The prompt template can be shown in Table 2 Please rate the plausibility and quality of the candidate explanation generated by a langauge model to support its own answer to the corresponding question. Assign a score of either -1 for disagree, 0 for neutral, or 1 for agree by answering the following criteria questions below. You are also given a list of gold explanation samples for reference. Question: <question> Choices: <choices> Answer: <answer> Candidate Explanation: <explanation> Gold Explanation Samples: <gold explanations> Criteria Questions: Q1: This is a good explanation 1. Disagree: The explanation is illogical or inconsistent with the question and/or does not adequately cover the answer choices 2. Neutral: The explanation is somewhat logical and consistent with the question but might miss some aspects of the answer choices. 3. Agree: The explanation is logical, consistent with the question, and adequately covers the answer choices. Q2: I understand this explanation of how the AI model works. 1. Disagree: The explanation is unclear or contains overly complex terms or convoluted sentences. 2. Neutral: The explanation is somewhat understandable but might contain complex terms or convoluted sentences. 3. Agree: The explanation is clear, concise, and easy to understand. Q3: This explanation of how the AI model works is satisfying. 1. Disagree: The explanation does not meet my expectations and leaves many questions unanswered. 2. Neutral: The explanation somewhat meets my expectations but leaves some questions unanswered. 3. Agree: The explanation meets my expectations and satisfies my query. Q4: This explanation of how the AI model works has sufficient detail. 1. Disagree: The explanation lacks detail and does not adequately cover the AI model’s decisionmaking. 2. Neutral: The explanation provides some detail but lacks thoroughness in covering the AI model’s decision-making. 3. Agree: The explanation is thorough and covers all aspects of the AI model’s decision-making. Q5: This explanation of how the AI model works seems complete. 1. Disagree: The explanation does not adequately cover the answer choices and leaves many aspects unexplained. 2. Neutral: The explanation covers most answer choices but leaves some aspects unexplained. 3. Agree: The explanation covers all answer choices and leaves no aspect unexplained. Q6: This explanation of how the AI model works is accurate. 1. Disagree: The explanation does not accurately reflect the AI model’s decision-making. 2. Neutral: The explanation somewhat reflects the AI model’s decision-making but contains some inaccuracies. 3. Agree: The explanation accurately reflects the AI model’s decision-making. Table 2: Prompt template for plausibility scoring. Appendix B Examples of CaF In this section, we illustrate the causal matrix, CasubscriptC_aCitalic_a and CesubscriptC_eCitalic_e on an example for each dataset: CoS-E, e-snli and ComVE for all six models. Each matrix is truncated from the corrupted span onwards, and we include the counterfactual edits used by token replacement. We omit the values since CaF only evaluates the consistency between the distributions. (a) Gemma2-2B (b) Gemma2-2B-chat Figure 6: Illustration of causal attributions on CoS-E, between Gemma2-2B pretrained [top] and instruct-tuned [bottom] LLM. Left refers to the values pertaining to the answer generation, CasubscriptC_aCitalic_a, and right to the explanation, CesubscriptC_eCitalic_e. Both examples show high values on the corresponding prediction choice. The chat model predicted an incorrect answer, focusing on the option D, while the pre-trained model correctly identified "loving couple" as the answer. (a) Gemma2-9B (b) Gemma2-9B-chat Figure 7: CoS-E on Gemma2-9B models. The explanation causal values appear to be noisy for the pre-trained model as compared to the chat model, yielding a lower faithfulness score. The chat model reasons between two possible answers, and picks the correct answer, in contrast with 6, where the chat model only focuses on the wrong answer. (a) Gemma2-27B (b) Gemma2-27B-chat Figure 8: CoS-E on Gemma2-27B models. Both models attributes higher causality in the earlier layers of the corrupted span. (a) Gemma2-2B (b) Gemma2-2B-chat Figure 9: E-snli: Gemma2-2B LLMs. Both models focus on the corrupted token "blue", but differs in the proceeding tokens. Both models CasubscriptC_aCitalic_a highlight the noun phrase (NP), "red curtains" in the hypothesis, attributing it to the contradiction label. However, the CesubscriptC_eCitalic_e in the pre-trained model is noisier as compared to the chat model which appears to be concentrated in the NP. (a) Gemma2-9B (b) Gemma2-9B-chat Figure 10: E-snli: Gemma2-9B LLMs. The 9B models display similar patterns to those in Figure 12, by focusing primarily on the corrupted token and NP. However, the pre-trained model is less faithful, as the CesubscriptC_eCitalic_e matrix fails to highlight the NP, in contrast to CasubscriptC_aCitalic_a. Additionally, the models appear to rely more heavily on the middle layers when reasoning about the NP. (a) Gemma2-27B (b) Gemma2-27B-chat Figure 11: E-snli: Gemma2-9B LLMs. In contrast to the 2B and 9B models. The causal distributions of CasubscriptC_aCitalic_a and CesubscriptC_eCitalic_e are more consistent, specifically for the pre-trained variant. (a) Gemma2-2B (b) Gemma2-2B-chat Figure 12: ComVE: Gemma2-2B LLMs. The pre-trained model is considered unfaithful due to the noisy causal distribution in the explanation, CesubscriptC_eCitalic_e, a pattern observed across other datasets. In contrast, the chat model demonstrates a much more consistent distribution between the explanation and the answer. The pre-trained model fails to focus on the key noun, "Basketball", which accounts for the incorrect prediction. This highlights how AP can be used to explain why the model made an erroneous decision, as the pre-trained model concentrates on the wrong elements while neglecting the key token, unlike the chat model. (a) Gemma2-9B (b) Gemma2-9B-chat Figure 13: ComVE: Gemma2-9B LLMs. Both models are relatively faithful, with high values assigned to both the subject and object noun. (a) Gemma2-27B (b) Gemma2-27B-chat Figure 14: ComVE: Gemma2-27B LLMs.