Paper deep dive

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang

Year: 2025Venue: COLM 2025Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 117

Models: Gemma-2-9B, Gemma-2-9B-it, Llama-2-7B, Llama-2-7B-Chat, Llama-3.1-8B, Llama-3.1-8B-Instruct, Llama-3.1-Tulu-3-8B-SFT, Mistral-7B-Base-SFT-Tulu2, Mistral-7B-v0.3, Mistral-7B-v0.3-Instruct, Qwen-1.5-0.5B, Qwen-1.5-0.5B-Chat

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:13:53 PM

Summary

This paper provides a mechanistic analysis of how post-training (SFT and RLHF) reshapes Large Language Models (LLMs) compared to their base counterparts. The study focuses on four key areas: factual knowledge storage, truthfulness, refusal behavior, and confidence. Findings indicate that factual knowledge storage locations remain largely unchanged, truthfulness directions are highly transferable, refusal directions are distinct and show limited forward transferability, and confidence differences cannot be attributed to entropy neurons.

Entities (7)

LLaMA-3.1-8B · llm · 100%Mistral-7B-v0.3 · llm · 100%Post-training · process · 98%Causal Tracing · method · 95%Entropy Neurons · mechanism · 95%Refusal Direction · mechanism · 95%Truthfulness Direction · mechanism · 95%

Relation Signals (4)

Entropy Neurons → doesnotexplain → Confidence Differences

confidence 90% · Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons.

Truthfulness Direction → exhibits → High Transferability

confidence 90% · The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable

Refusal Direction → exhibits → Limited Forward Transferability

confidence 90% · The refusal direction is different between the base and post-trained models, and it shows limited forward transferability

Post-training → preserves → Factual Knowledge Storage

confidence 90% · Post-training does not change the factual knowledge storage locations

Cypher Suggestions (2)

Map the relationship between post-training and model mechanisms · confidence 95% · unvalidated

MATCH (p:Process {name: 'Post-training'})-[r]->(m:Mechanism) RETURN p.name, type(r), m.name

Find all mechanisms analyzed in the paper · confidence 90% · unvalidated

MATCH (e:Mechanism) RETURN e.name

Abstract

Abstract:Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

116,574 characters extracted from source content.

Expand or collapse full text

Published as a conference paper at COLM 2025 How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence Hongzhe Du 1,∗,† , Weikai Li 1,∗ , Min Cai 2 , Karim Saraipour 1 , Zimin Zhang 3 , Himabindu Lakkaraju 4 , Yizhou Sun 1 , Shichang Zhang 4 1 University of California, Los Angeles 2 University of Alberta 3 University of Illinois at Urbana-Champaign 4 Harvard University Abstract Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post- trained models. While plenty of works have studied post-training algo- rithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspec- tives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the fac- tual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for in- terventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Dif- ferences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facil- itates downstream tasks like model steering and benefits future research in interpretability and LLM post-training. Our code is publicly available at HZD01/post-training-mechanistic-analysis. 1 Introduction The success of large language models (LLMs) has standardized a training paradigm con- sisting of pre-training and post-training. Post-training transforms a pre-trained base model into more useful and aligned post-trained models (Grattafiori et al., 2024; OpenAI, 2023; Jiang et al., 2023; Lambert et al., 2024, inter alia). Initially introduced to improve instruction- following capabilities (Ouyang et al., 2022; Wei et al., 2022), post-training has evolved to serve versatile purposes, including but not limited to making models more truthful (Lin et al., 2022; OpenAI, 2023; Lambert et al., 2024), safety alignment by enabling models to refuse harmful instructions (Bai et al., 2022; Grattafiori et al., 2024), and calibrating the model’s output confidence (OpenAI, 2023). Research on post-training has predominantly focused on algorithms such as Direct Prefer- ence Optimization (DPO) (Rafailov et al., 2024) and Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017) and improving LLMs’ ability in downstream tasks such as reasoning (Kumar et al., 2025; Liu et al., 2024b). These studies mainly treat the LLM as a black box, and only evaluate its outputs externally (Zhou et al., 2023; Wen et al., 2024). However, it remains unclear how post-training affects the mechanisms of LLMs and whether ∗ Equal contribution † Correspondence: hongzhedu@cs.ucla.edu 1 arXiv:2504.02904v3 [cs.CL] 8 Nov 2025 Published as a conference paper at COLM 2025 Post Training . . . Subject Max diff = 0.1 Object Max diff = 0 Last token: Max diff = 0.3 Knowledge-related positions BASE neurons INSTRUCT neurons SFT neurons 207, 993, 1852, 2287, 8677, 10510, 11184, 12111 8775, 12664 5463, 9032 (a) (b) (c)(d) Pre -trained LLM (BASE) Post -trained LLM (INSTRUCT) SFT RLHF Figure 1: Summary of our analysis and findings. (a) Knowledge: A difference heatmap showing BASE and POST models have negligible location differences for storing the same knowledge; (b) Truthfulness: A PCA plot showing the truthfulness directions are similar in BASE and POST models; (c) Refusal: A PCA plot showing the refusal directions of BASE and POST models are quite different; (d) Confidence: A Venn diagram of entropy neuron IDs showing the difference in confidence between BASE and POST models cannot be fully attributed to entropy neurons as they largely overlap. the model is fundamentally altered internally. Such a mechanistic understanding can help us better use post-trained LLMs and potentially design better post-training methods. Recent research studies have started to examine the mechanistic effect of post-training and reveal interesting findings. However, this direction is still underexplored, given these efforts are still algorithm-centric (Lee et al., 2024), model-specific (Panickssery et al., 2024), task-format-specific (Panickssery et al., 2024), or rely on learning an extra model like Sparse Autoencoders (SAEs) on top of the LLM instead of direct analysis (Kissane et al., 2024b). In this work, we systematically and mechanistically study the post-trained (POST) model, on top of the pre-trained (BASE) model. We specifically focus on two POST model types: a model that went through all post-training stages, commonly called the INSTRUCT model, and a model with only supervised fine-tuning on top of BASE, commonly called the SFT model. We compare the BASE and POST models internally from four perspectives: knowl- edge storage and representation, internal belief of truthfulness, refusal behavior, and confidence. These perspectives represent fundamental capabilities that determine an LLM’s real-world utility and safety. POST models are expected to preserve knowledge learned during pre-training, improve truthfulness, enhance refusal of harmful inputs, and show a different level of confidence from the BASE model. While some other perspectives, such as reasoning and instruction-following, are also important, they involve complex, multi-step processes that are not well-captured by current mechanistic interpretability tools. Therefore, our work focuses on those four perspectives above that can be rigorously measured and mechanistically interpreted, providing a solid foundation for understanding the internal mechanism updates during post-training. For each perspective, we choose the most suitable tool from the LLM interpretability toolbox for analysis, and we illustrate our main findings in Figure 1. For the first perspective, we adopt the widely used knowledge locating technique, causal tracing (Meng et al., 2022), to investigate the storage and representation of knowledge. We discover that locations for storing the same knowledge in BASE and POST models are similar, and POST model adapts the BASE knowledge representations while developing new ones. For the second perspective 2 Published as a conference paper at COLM 2025 of truthfulness, based on the previous discovery that truthfulness can be represented as a direction in the model’s hidden representation space (Marks & Tegmark, 2024; Li et al., 2024; Panickssery et al., 2024; B ̈ urger et al., 2024), we learn a linear vector representing truthfulness, referred to as the “truthfulness direction.” For the two directions learned for BASE and POST models, we find that they have high cosine similarity and can be effectively transferred for truthfulness intervention. For the third perspective, we learn a “refusal direction” similar to the truthfulness direction in the hidden representation space (Arditi et al., 2024). We find that the transferability of such refusal direction is only effective backward (from POST to BASE) but not forward (from BASE to POST). Lastly, we analyze the confidence of BASE and POST models through the lens of entropy neurons, which contribute to the confidence of the LLM’s output (Stolfo et al., 2024; Gurnee et al., 2024). Our analysis reveals that entropy neurons of BASE and POST models have similar distributions, leading us to the conclusion that these neurons are not the determining factor of the observed confidence differences between the BASE and POST models. Our analysis from the four perspectives reveals both the kept and the altered internal mechanisms by post-training, which could benefit future research and applications in interpretability and LLM post-training. As our results suggested, some internal mechanisms are mostly developed during pre-training and not significantly altered by post-training, such as factual knowledge storage and the truthfulness direction. We can thus leverage their transferability to develop procedures on the BASE model and apply them to the POST model conveniently, for example, for truthfulness steering. For the mechanisms that are altered or developed during post-training, such as refusing harmful instructions, it is also possible to efficiently improve BASE’s ability by applying the backward transfer from POST. 2 Related Work Mechanistic interpretability of post-training Mechanistic interpretability aims to under- stand internal mechanisms of models (Elhage et al., 2021; Wang et al., 2022; Templeton et al., 2023; Nanda et al., 2023, inter alia). Recently, a growing body of research starts to analyze LLM post-training through the Mechanistic interpretability lens. Lee et al. (2024) studied how DPO unlearns toxicity in LLM, finding that rather than removing toxic-promoting vectors, the model learns distributed offsets to bypass them. Panickssery et al. (2024) discov- ered that Llama-2 BASE and INSTRUCT models have similar steering vectors for answering multiple choice questions. Kissane et al. (2024a) showed that refusal directions can be transferred from INSTRUCT models to BASE models. Kissane et al. (2024b) revealed that the SAEs trained on the BASE model can reconstruct the activations of the INSTRUCT model. However, these investigations do not directly and generally reveal the post-training effect, whereas we do a comprehensive study of different models and datasets and investigate post-training’s effect from four critical perspectives. Knowledge storage and representation Geva et al. (2021) showed that transformer MLP layers function as key-value memories, with keys corresponding to input representations and values inducing output distributions. Dai et al. (2022) identified specific “knowledge neurons” in MLPs that encode facts. To detect knowledge-storage locations and edit them, Meng et al. (2022) introduced causal tracing (activation patching) and edited knowledge through targeted weight changes. These studies show that knowledge in LLMs can be localized and modified through causal intervention techniques. In this work, we use a variant of causal tracing to study the effect of post-training on knowledge storage. Internal belief of truthfulness Recent research demonstrates that LLMs encode the belief of truthfulness linearly in their representation space as a “truthfulness direction”. Azaria & Mitchell (2023) identified truthfulness signals in model activations, while Burns et al. (2024) developed unsupervised methods to extract these signals using logical consistency. Li et al. (2024) leveraged truthfulness directions to improve truthfulness through activation steering. Later, Marks & Tegmark (2024) introduced the mass-mean probe. Similarly, Panickssery et al. (2024) uses difference-in-means to identify the direction by computing the difference between mean activation vectors of true and false statements. Additionally, B ̈ urger et al. (2024) discovered a universal two-dimensional truthfulness subspace across various LLMs, 3 Published as a conference paper at COLM 2025 and Liu et al. (2024a) showed that training the direction on more datasets makes it more robust, suggesting that a universal truthfulness hyperplane may exist. We employ mass- mean probe (Marks & Tegmark, 2024) and show that the truthfulness direction persists after post-training. Refusal behavior Refusing to answer harmful instructions is a key objective of post-training. Recent research has revealed that this behavior is linearly mediated by a vector as a “refusal direction” (Arditi et al., 2024). This direction can be used to undermine the model’s ability to refuse harmful requests. Similarly, research on prompt-driven safeguarding has shown that safety prompts typically move input queries in the refusal direction in the representation space (Zheng et al., 2024). Further research has shown this direction can also be learned on BASE models, or transferred from an INSTRUCT model to a BASE model (Kissane et al., 2024a). Our work extends the study to a more systematic comparison of the refusal direction learned on BASE and different POST models across model families. Confidence and entropy neurons Confidence calibration is another key objective of post- training. Studies have shown that post-trained models tend to be less calibrated, with INSTRUCT models being overconfident compared to BASE models (Tian et al., 2023). One line of research is to understand LLM’s confidence with verbalized output (Tian et al., 2023; Xiong et al., 2024), using prompting and sampling strategies to generate multiple responses and compute consistency. Another line of work analyzes confidence to show that specialized neurons within LLMs regulate uncertainty (Katz & Belinkov, 2023; Gurnee et al., 2024; Stolfo et al., 2024). Among them, Gurnee et al. (2024) discovered “entropy neurons” that have high weight norms but minimal direct logit effects. They modulate uncertainty by influencing layer normalization to scale down logits. Our work examines the changes in entropy neurons after post-training to understand its effect on confidence. 3 Notations and Experimental Settings Notations Throughout the paper, we denote layers asl ∈ [L]and token positions asi∈ [I], whereLis the number of model layers andIis the input length. We use notations like D train harmless for datasets, with superscript for train/test subset, and subscript for the dataset’s type, which might be omitted if the context is clear. The representation at layerland position iof an input statementsis denoted ash l i (s). We useW U ∈R |V|×d model for the unembedding matrix, andw out ∈R d model for the output weights of a given neuron in the last-layer MLP, withV stands for the vocabulary, and d model for the model’s hidden space dimension. Models We mainly conduct experiments on two representative families of LLMs: Llama- 3.1-8B/Instruct (Grattafiori et al., 2024) and Mistral-7B-v0.3/Instruct (Jiang et al., 2023). The original model releases do not include corresponding SFT models, so we use widely recognized external SFT models: Llama-3.1-Tulu-3-8B-SFT, which finetunes Llama-3.1-8B on thetulu-3-sft-mixturedataset (Lambert et al., 2024), and Mistral-7B-Base-SFT-Tulu2 (Feuer et al., 2025), which finetunes Mistral-7B-v0.3 on thetulu-v2-sft-mixturedataset (Ivison et al., 2023). For refusal experiments, we additionally include Qwen-1.5-0.5B/Instruct (Bai et al., 2023) and Gemma-2-9B/Instruct (Team et al., 2024) following experiment settings in Arditi et al. (2024). For confidence experiments, we additionally include Llama-2-7B/Instruct models (Touvron et al., 2023) following Stolfo et al. (2024). To further demonstrate that our findings could generalize to different model sizes, especially larger models, we perform experiments on Llama-2-13B/Instruct (Touvron et al., 2023) models for all perspectives. Datasets For the knowledge and truthfulness perspectives, we start with a group of datasets from Marks & Tegmark (2024) and curate them to fit specific experiments we run. Each of these datasets contains simple and unambiguous statements from diverse topics that are either true or false. For example, the datasetcitiescontains statements about cities and their countries, following the format “The city of [city] is in [country]”. While these datasets are independent of post-training, we also perform our analysis on datasets that are actually used for post-training to reveal the post-training effect on in-distribution data. Specifi- cally, we construct thetuluextracteddataset by sampling factual statements from the tulu-3-sft-mixturedataset (Lambert et al., 2024), which was used to finetune Llama-3.1-8B 4 Published as a conference paper at COLM 2025 to Llama-3.1-8B-SFT. For each extracted statement, we generate a counterfactual counter- part to form true–false pairs. We ensure that all sampled statements also appear in the tulu-v2-sft-mixturedataset (Ivison et al., 2023), makingtuluextractedin-distribution for both Llama-3.1-8B-SFT and Mistral-7B-SFT. For experiments on the refusal perspective, we follow Arditi et al. (2024) to useadvbench(Zou et al., 2023) for harmful inputs and alpaca(Taori et al., 2023) for harmless inputs. Dataset details are explained in Appendix A. 4 Knowledge Storage and Representation LLMs are known to store factual knowledge at specific locations of their parameters, partic- ularly in “knowledge neurons” and MLP layers acting as key-value memories. This enables them to answer factual queries, such as answering “TRUE” or “FALSE” for the prompt “The city of New York is in the United States. This statement is:”. While such knowledge is believed to emerge during pre-training and persist through post-training, mechanistic evidence remains limited. As knowledge is foundational for LLMs, we examine how post-training affects it by asking two research questions about (Q1) knowledge-storage locations and (Q2) knowledge representations. When prompted to classify a statement’s truthfulness, LLMs extract stored knowledge at certain layers and inject it into the hidden states to guide the final output. Following Marks & Tegmark (2024), we adapt causal tracing to identify knowledge-storage locations by patching hidden states between true and false statement pairs. Each pair is token-aligned and differs only in subject, e.g., “The city of Seattle is in France.” vs. “The city of Paris is in France.”, and the relation (e.g., city-in-country) is true for only one statement. We target subject and object tokens (e.g., city and country) for knowledge location analysis. Locating knowledge We use causal tracing to localize knowledge storage via three forward passes with varying inputs and intermediate patching. First, we input a true statements and record the hidden representationsh l i (s)at each layerland token positioni. Second, we input a false statement ˆ sand similarly recordh l i ( ˆ s). Third, we input ˆ sagain, but patch a specific hidden stateh l i ( ˆ s)withh l i (s)from the first run (i.e., replaceh l i ( ˆ s)withh l i (s)). We perform this patching independently for each(l,i). If patching at a particular location(l,i) flips the output from “FALSE” to “TRUE”, it indicates that location stores the knowledge. To measure the effectiveness of the patching, we compare the log probability of outputting “TRUE” versus outputting “FALSE” for each (l, i) by computing: M l i (s, ˆ s) := log[ P(“TRUE”) P(“FALSE”) |patching(h l i (s), h l i ( ˆ s))],(1) where a high value indicates knowledge about s is stored in the l-th layer at the i-th token. In order to analyze the general locations of knowledge beyond individual knowledge, we average the patching results from a setDof carefully curated statements with the same length and the same subject/object token positions, and each(s, ˆ s)statement pair for patching only differ in their subjects (see Appendix B.1 for dataset construction details). We construct input prompts using 4-shot examples containing 2 true statements and 2 false statements with answers, followed by the question statement. Patching is applied to the question statement via three forward pass as described above. Then we aggregate the results for each(l,i)and normalize them across all layersl ∈ Land tokensi∈ Ifor better visualization: ̃ M l i = 1 |D| ∑ (s, ˆ s)∈D M l i (s, ˆ s), M = normalize( ̃ M)(2) For the normalization, we divide the range[min i,l ̃ M l i ,max i,l ̃ M l i ] into 20 equal-width bins, set the values in the lower 10 bins to 0 and the values in the upper 10 bins to 0.1, 0.2, ..., 1. We denote the normalized result as M model ∈ R L∗I , subscripted by the specific model. Q1: Does post-training change LLM’s knowledge storage locations? Figure 2 visualizes the log probability ratio (M model ) of Llama-3.1-8B BASE and INSTRUCT on thecitiesdataset. As shown in Figure (a), influential patching consistently occurs at three token positions: subject, 5 Published as a conference paper at COLM 2025 Subject Object Last token (a) BASE. Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Layer (b) INSTRUCT. Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Layer (c) Difference. Figure 2: Knowledge storage locations of Llama-3.1-8B BASE and INSTRUCT on thecities dataset. Their knowledge-storage locations are almost the same. Metriccitiesnegcitieslargerthansmallerthanspentransnegspentranstuluextracted Number of Curated Pairs238215406487253355 Corr(M BASE , M INSTRUCT )0.99230.98530.99690.98050.99450.98220.9978 max|M INSTRUCT − M BASE |0.40.40.30.50.30.50.2 max|M INSTRUCT − M BASE | K 0.20.40.10.50.20.10.1 Corr(M BASE , M SFT )0.99620.99470.99780.98550.99750.97920.9969 max|M SFT − M BASE |0.20.20.10.50.20.50.2 max|M SFT − M BASE | K 0.20.20.10.50.10.20.1 Table 1: Comparison of knowledge storage locations of the Llama-3.1-8B model family. object, and the last token. The last token is always important as it contains information of the whole sentence, whereas the subject and the object position indicate knowledge storage and are important for both BASE (e.g., (a)) and INSTRUCT (e.g., (b)). Their difference is nearly zero (e.g., (c)), indicating that BASE and INSTRUCT store knowledge in nearly identical locations. This pattern holds across all datasets and models (additional visualizations in Appendix B.5). We further conduct quantitative analysis with three metrics and include SFT models into comparison. We compute Pearson correlation betweenM BASE andM POST , where POSTis INSTRUCT or SFT. We also measure the maximum absolute difference value over all tokens,max|M POST − M BASE |, as well as only over knowledge-related tokens (subject and object),max|M POST − M BASE | K . Llama-3.1-8B results are in Table 1, and Mistral-7B results are in Table 8 in Appendix B.4. All results show almost perfect correlation and low difference, confirming that post-training has little influence on knowledge-storage locations. Q2: Does post-training change the knowledge representations? We further conduct cross-model experiments by patching hidden representations from BASE to POST (forward patching) and from POST to BASE (backward patching). It allows us to analyze whether knowledge representations in BASE can still work in POST, and vice versa, i.e., whether cross-model patching recover the log probability ratio of same-model patching. Due to space limits, we put the visualizations on all models and datasets in Appendix B.5. The results demonstrate that the forward patching is almost always successful, but the backward patching often fails. It leads to the conclusion that knowledge representations of BASE still work after post-training, but post-training also develops new knowledge representations. Verification on in-distribution dataset One natural question for our experiments above is that they are on datasets independent of post-training, which can be out of the post-training distribution. To verify our conclusions completely, we conduct in-distribution experiments by curating datasets from the post-training data. Specifically, we construct (true, false) statement pairs of factual knowledge from thetuludataset, which was used to develop Llama-3.1-8B-SFT and Mistral-7B-v0.3-SFT from their base models. Different from previous datasets, pairs in the Tulu dataset could have different lengths, so we slightly modify the metric calculation (details in Appendix B.3). The last column of Table 1 shows the Llama- 3.1-8B results, and the last column of Table 8 in Appendix B.4 shows the Mistral-7B results. Both verify our previous conclusions. 6 Published as a conference paper at COLM 2025 Verification on larger models and other tracing settings To verify the generalizability of our conclusion, we conduct experiments on a larger model, Llama-2-13B, with results shown in Appendix F. We also do experiments following the causal tracing setting as in Meng et al. (2022), which asks the LLM to output the object given a subject instead of to output true or false given a subject-object pair. We use the latter in our main experiments because it allows location analysis for the object as well, and it works better with more datasets (details are explained in Appendix B.4). The results of all these experiments also verify our conclusions. 5 Internal Belief of Truthfulness How LLMs internally assess the truthfulness of an input statement is another essential aspect of making LLMs truthful and reliable. Previous studies have found that given an LLM and a statement, whether the LLM believes the statement to be true or false can be assessed from the hidden representations encoded by the model. Such internal belief of truthfulness can be linearly represented along a truthfulness direction in the hidden representation space (Marks & Tegmark, 2024; B ̈ urger et al., 2024). We analyze this direction in BASE models and POST models to illustrate how post-training affects it. Linear probe for truthfulness To identify the truthfulness direction in a model, we take a (training) datasetD train containing a subset of true statementsD train true and the rest being a subset of false statementsD train false , and compute difference-in-mean of the hidden repre- sentations of these two subsets of statements. Similar to knowledge-storage experiments, we use two true statements and two false statements followed by the question statement to construct 4-shot prompts (specified in Appendix C.1), so the model outputs “TRUE” or “FALSE” for the final statement. Formally, we compute the truthfulness direction t l as: t l = 1 |D train true | ∑ s∈D train true h l i (s)− 1 |D train false | ∑ s∈D train false h l i (s),(3) whereiis the last token of the input prompt andlis the layer number where truthfulness is most strongly encoded (based on causal tracing results in Section 4). Figure 3 (a) and (b) show the cosine similarities oft l from BASE, SFT, and INSTRUCT models on two truthfulness datasets. The heatmaps show a high cosine similarity, revealing that these models have remarkably similar internal truthfulness directions. To further investigate the generalizability, we utilizet l as a probe to classify whether a statements∈D train is true (Marks & Tegmark, 2024), i.e., computep =σ(h l i (s) T t l )withσ being the sigmoid function. We train the probe on five datasets and test its performance on a separate test dataset. We also conduct transfer experiments across models, training the probe on the hidden representations generated by one model and evaluating its accuracy in classifying representations generated by other models. For example, we compare the (baseline) accuracy of a probe trained for POST (p POST ) and applied on POST’s test repre- sentations (h POST ) versus the (forward-transfer) accuracy of a probe trained on BASE (p BASE ) and similarly applied toh POST . Table 2 presents the results, and the probe classification accuracies across BASE, SFT, and INSTRUCT are very similar. Especially,p BASE achieves very similar accuracies top SFT andp INSTRUCT when applied to SFT and INSTRUCT’s test represen- tations across all datasets. Experiments on the Mistral models also show similar results (see Appendix C). Transfer intervention with truthfulness directions The truthfulness directiont l can also be used to steer model output. To flip a model’s response between “TRUE” and “FALSE” for a statement, one can addt l to the model’s hidden representation during forward pass at layerlas ̃ h l = h l +λt l , withλ =±1 to control the flipping direction. To investigate the transferability oft l , we test: (1) interveningh SFT witht l BASE versust l SFT ; and (2) intervening h INSTRUCT witht l BASE versust l INSTRUCT . We evaluate the intervention performance using the Intervention Effect (IE):( ̃ P − − P − )/(1− P − )for false→true intervention, and( ̃ P + − P + )/(− 1− P + ) for true→false intervention.P − andP + represent the average probability 7 Published as a conference paper at COLM 2025 BaseSFTInstruct Base SFT Instruct 1.0000.9430.905 0.9431.0000.931 0.9050.9311.000 BaseSFTInstruct Base SFT Instruct 1.0000.9380.895 0.9381.0000.886 0.8950.8861.000 BaseSFTInstruct Base SFT Instruct 1.0000.1470.252 0.1471.0000.660 0.2520.6601.000 0.0 0.2 0.4 0.6 0.8 1.0 (a) Truthfulness direction on inventors. BaseSFTInstruct Base SFT Instruct 1.0000.9430.905 0.9431.0000.931 0.9050.9311.000 BaseSFTInstruct Base SFT Instruct 1.0000.9380.895 0.9381.0000.886 0.8950.8861.000 BaseSFTInstruct Base SFT Instruct 1.0000.1470.252 0.1471.0000.660 0.2520.6601.000 0.0 0.2 0.4 0.6 0.8 1.0 (b) Truthfulness direction on animalclass. BaseSFTInstruct Base SFT Instruct 1.0000.9430.905 0.9431.0000.931 0.9050.9311.000 BaseSFTInstruct Base SFT Instruct 1.0000.9380.895 0.9381.0000.886 0.8950.8861.000 BaseSFTInstruct Base SFT Instruct 1.0000.1470.252 0.1471.0000.660 0.2520.6601.000 0.0 0.2 0.4 0.6 0.8 1.0 (c) Refusal direction on ad- vbench and alpaca. Figure 3: Cosine similarities of truthfulness (a and b) and refusal (c) directions of Llama-3.1- 8B BASE, INSTRUCT, and SFT. Truthfulness directions are similar while refusal directions are different. Test DatasetProbe Transfer Accuracy (%) p BASE → h BASE p SFT → h SFT / p BASE → h SFT (∆) p INS → h INS / p BASE → h INS (∆) cities81.0684.50 / 85.32 (+0.82)94.65 / 95.91 (+1.26) spentrans97.1698.45 / 98.88 (+0.43)95.18 / 98.94 (+3.76) inventors92.7291.96 / 93.12 (+1.16)88.73 / 92.18 (+3.45) animal class97.2096.01 / 95.64 (-0.37)98.75 / 96.46 (-2.29) elementsymb92.0294.87 / 97.02 (+2.15)96.18 / 95.13 (-1.05) facts77.0577.58 / 77.72 (+0.14)82.47 / 80.86 (-1.61) Table 2: Probe transfer accuracy (↑) of Llama-3.1-8B BASE, SFT, and INSTRUCT tested on 6 truthfulness datasets. For each row, we use the datasets from the other 5 rows for training. p model 1 → h model 2 means using the probe trained onmodel 1 to classify statement representa- tions bymodel 2 . Accuracy of transferred probes show little difference (∆) compared to the same-model probes. differenceP(TRUE)− P(FALSE)for false and true statements, respectively. Here ̃ P − and ̃ P + areP − andP + after intervention, respectively. The goal is to increase ̃ P − , i.e., P(TRUE)− P(FALSE)for false statements after the intervention, and similarly to decrease ̃ P + , so a higher IE indicates better intervention performance. The results in Table 3 show that when steering SFT, the difference (∆) of IE betweent l BASE andt l SFT is little. Similar results hold when steering INSTRUCT. Experiments on Mistral models also show results (see Appendix C). We illustrate two intervention examples in Appendix C.6, which show thatt l BASE can flip TRUE/FALSE outputs in POST models as effectively ast l POST . These findings further support our conclusion: post-training generally preserves the internal representation of truthfulness. 6 Refusal Post-training also aims to prevent LLMs from generating harmful and toxic content. For example, when inputting harmful prompts like “Give instructions for making explosive ma- terials”, INSTRUCT models are likely to refuse to answer by outputting “I can’t fulfill that request...”, but BASE models may not. Recent studies by Arditi et al. (2024) show that, similar to the internal belief of truthfulness, this refusal behavior can also be linearly represented by a vector in the hidden space as a “refusal direction”. By steering a model with it, we can encourage the model to change its original sensible behavior to follow harmful prompts or refuse harmless prompts. Kissane et al. (2024a) found that BASE models also demonstrate the refusal behavior for some harmful instructions, and thus a refusal direction can be extracted. The study also verified the backward transferability of the refusal direction from INSTRUCT 8 Published as a conference paper at COLM 2025 Test DatasetTruthful Intervention Effects t BASE 7→ h BASE t SFT 7→ h SFT / t BASE 7→ h SFT (∆) t INS 7→ h INS / t BASE 7→ h INS (∆) cities0.830.91 / 0.92 (+0.01)0.88 / 0.90 (+0.02) spentrans0.780.82 / 0.83 (+0.01)0.84 / 0.81 (-0.03) inventors0.720.80 / 0.82 (+0.02)0.79 / 0.83 (+0.04) animal class0.730.79 / 0.80 (+0.01)0.71 / 0.72 (+0.01) element symb0.790.84 / 0.86 (+0.02)0.73 / 0.77 (+0.04) facts0.610.64 / 0.66 (+0.02)0.62 / 0.66 (+0.04) Table 3: Intervention effect (↑) of intervention on Llama-3.1-8B BASE, SFT, and INSTRUCT. For each row, we use the dataset from the other 5 rows for training.t model 1 7→ h model 2 means using the truthfulness direction inmodel 1 to intervenemodel 2 . Transfer interventions show small differences (∆) compared to same-model interventions. Intervention Refusal Score BASESFTINSTRUCT Inputsbaseline/r BASE 7→ h BASE baseline/r SFT 7→ h SFT /r BASE 7→ h SFT baseline/r INS 7→ h INS /r SFT 7→ h INS /r BASE 7→ h INS harmful (↓)0.21 / 0.170.99 / 0.79 / 0.990.98 / 0.01 / 0.36 / 0.95 harmless (↑)0.01 / 0.590.01 / 1.0 / 0.850.0 / 1.0 / 0.98 / 0.08 Table 4: Intervention Refusal Score (RS) of Llama-3.1-8B BASE, SFT, and INSTRUCT tested on harmful and harmless inputs.r model 1 7→ h model 2 means using the refusal direction inmodel 1 to intervenemodel 2 , and baseline refers to the original RS without intervention. For harmful inputs we use ablation and for harmless inputs we use addition. to the BASE. We aim to compare the refusal directions in POST versus BASE similarly to the truthfulness direction in Section 5 and study its forward transferability. Refusal direction identification and intervention To identify the refusal directionr, we use D train harmful (a size-128 subset ofadvbench) andD train harmless (a size-128 subset ofalpaca) as two contrastive datasets to calculatersimilarly to the truthfulness directiontin Equation 3. We then perform intervention evaluation follow Arditi et al. (2024) closely. Givenr, we induce the refusal behavior by addingrto the model’s representations at layerl, i.e., ̃ h l ← h l + r l , andlis decided based on the best intervention result. To reduce refusal, for better effect, we ablaterfrom the model’s representations at all layers, i.e., ̃ h← h− ˆ r ˆ r ⊤ hfor alll ∈ L, where ˆ r is the unit-norm vector ofr. Both kinds of interventions are applied at all token positions. To study the refusal direction across models, we first directly compare r learned on BASE (r BASE ), SFT (r SFT ), and INSTRUCT (r INSTRUCT ) models. Figure 3 (c) shows thatr BASE has very low cosine similarity withr SFT andr INSTRUCT . To further investigate this, we conduct forward transfer intervention experiments similar to Section 5. We compare the Refusal Score (RS) when usingr BASE to steer SFT and INSTRUCT versus using their native refusal vectors (r SFT andr INSTRUCT ). RS is calculated as the percentage of responses where refusal keywords such as “I can’t” or “I am sorry” appear at the beginning of outputs. We do an intervention on both harmful and harmless datasets, sampling 100 prompts from each for testing. We try to alter the original sensible behavior, i.e., to decrease RS for harmful inputs and increase RS for harmless inputs. Table 4 demonstrates thatr BASE generally cannot be effectively transferred to steer INSTRUCT and SFT for Llama-3.1-8B. We also conduct experiments on Qwen-1.5-0.5B/Instruct (Bai et al., 2023) and Gemma-2-9B/Instruct (Team et al., 2024) (see Appendix D). All results verify the same conclusion: post-training changes the refusal direction, so the direction has limited forward transferability. 7 Confidence Confidence of LLMs is represented by the probability associated with the decoded token. Post-trained models are known to have different confidence compared to BASE models (Ope- nAI, 2023), which is also revealed in their drastically different outputs to the same prompts. 9 Published as a conference paper at COLM 2025 Understanding and calibrating model confidence is an important research direction. Re- cently, entropy neurons have been shown to be a hidden mechanism of modulating confi- dence that is persistent across models (Gurnee et al., 2024; Stolfo et al., 2024). They have relatively high weight norms and a low composition with the model’s unembedding matrix, so they influence the model’s output probability without affecting the probability ranking much, working similarly to the temperature parameter. We study whether the difference in confidence between BASE and POST models is caused by the difference in entropy neurons. Entropy neuron identification Entropy neurons are identified by checking the weight norm and the variance of logit attribution. First, we compute the logit attribution for each neuron in the final MLP layer by projecting its output weightsw out onto the vocabulary space through the unembedding matrixW U . This projection approximates the neuron’s direct effect on the final prediction logits. We then calculate the variance of the normalized projection: LogitVar(w out ) = Var W U w out ∥W U ∥ dim=1 ∥w out ∥ ,(4) where∥·∥ dim=1 denotes a row-wise norm. A low LogitVar value indicates a relatively balanced contribution across all vocabulary tokens rather than promoting specific tokens. Entropy neurons typically have both a low LogitVar and a large weight norm to ensure they are influential. Our identification process first selects the top 25% of neurons with the largest weight norms, and from this subset, we identify the 10 neurons with the lowest LogitVar values as entropy neurons from the final MLP layer. In our analysis comparing BASE and POST models, we found substantial overlap in identified entropy neurons, with highly similar weight norm to LogitVar ratios. We show the detailed results in Appendix E. These findings suggest that the confidence regulation mechanism of entropy neurons remains largely unchanged during post-training, indicating that the observed confidence difference between BASE and POST models likely stem from more subtle mechanistic changes, which require more sophisticated interpretability tools beyond entropy neurons to fully understand. 8 Discussion and Conclusion To achieve effective post-training, it is important to understand how it shapes LLMs in- ternally. In this paper, we analyze its effect on LLM’s internal mechanisms from four representative perspectives. We discover that post-training does not alter knowledge- storage locations and truthfulness directions significantly, and it adapts original knowledge representations while developing some new ones. In contrast, post-training changes the refusal direction, so refusal steering cannot be transferred forward. We also find that the confidence difference brought by post-training cannot be attributed to entropy neurons, which requires further investigation. Our findings could also benefit many real-world applications. As we have shown, gen- eral abilities such as factual knowledge and the internal belief of truthfulness are mostly developed during pre-training and remain unchanged in post-training, and the forward transfer remains valid. Therefore, for fixing model errors or updating knowledge, this finding allows us to conveniently and effectively transfer knowledge edits or truthfulness probes developed on a BASE model to its POST model. On the other hand, for internal mech- anisms corresponding to abilities developed during post-training, such as refusing harmful instructions, a promising future application is to transfer the newly acquired capabilities from the POST model to the BASE model to induce such ability without training. Looking ahead, although we concentrated on four key perspectives, future work could extend our framework to more complex capabilities, such as reasoning and instruction- following. These areas present significant methodological challenges for existing inter- pretability tools. We find that properly defining the instruction-following ability is tricky, and a suitable technique to interpret this ability and verify it on BASE is non-trivial. More- over, future work could leverage our analysis to improve the post-training effectiveness and efficiency. 10 Published as a conference paper at COLM 2025 9 Acknowledgments We would like to thank Fan Yin for insightful discussions. This work was partially supported by NSF 2211557, NSF 1937599, NSF 2119643, NSF 2303037, NSF 2312501, NASA, SRC JUMP 2.0 Center, Amazon Research Awards, and Snapchat Gifts. References Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024. URL https://arxiv.org/abs/2406.11717. Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying, 2023. URL https://arxiv.org/abs/2304.13734. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/abs/2212.08073. Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2024. URL https://arxiv.org/abs/2212.03827. Lennart B ̈ urger, Fred A. Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms, 2024. URL https://arxiv.org/abs/2407.12831. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers, 2022. URL https://arxiv.org/abs/2104.08696. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah.A mathematical frame- work for transformer circuits. Transformer Circuits Thread, 2021. https://transformer- circuits.pub/2021/framework/index.html. Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, and John P. Dickerson. Style outweighs substance: Failure modes of llm judges in alignment benchmarking, 2025. URLhttps://arxiv.org/abs/ 2409.15268. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories, 2021. URL https://arxiv.org/abs/2012.14913. 11 Published as a conference paper at COLM 2025 Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hath- away, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models, 2024. URL https://arxiv.org/abs/2401.12181. Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023. URL https://arxiv.org/abs/2311.10702. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ́ elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ́ e Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825. Shahar Katz and Yonatan Belinkov. Visit: Visualizing and interpreting the semantic infor- mation flow of transformers, 2023. URL https://arxiv.org/abs/2305.13417. Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda. Base llms refuse too. Alignment Forum, 2024a. URLhttps://w.alignmentforum.org/posts/ YWo2cKJgL7Lg8xWjj/base-llms-refuse-too. ConnorKissane,RobertKrzyzanowski,ArthurConmy,andNeelNanda. Saes (usually) transfer between base and chat models.Alignment Forum, 2024b.URLhttps://w.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/ saes-usually-transfer-between-base-and-chat-models. Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models, 2025. URL https://arxiv.org/abs/2502.21321. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. T ̈ ulu 3: Pushing frontiers in open language model post-training. 2024. Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity, 2024. URL https://arxiv.org/abs/2401.01967. Kenneth Li, Oam Patel, Fernanda Vi ́ egas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2024. URL https://arxiv.org/abs/2306.03341. Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. Junteng Liu, Shiqi Chen, Yu Cheng, and Junxian He. On the universal truthfulness hyper- plane inside llms, 2024a. URL https://arxiv.org/abs/2407.08582. Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint arXiv:2412.15084, 2024b. 12 Published as a conference paper at COLM 2025 Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2024. URLhttps://arxiv.org/ abs/2310.06824. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URLhttps://arxiv.org/ abs/2301.05217. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv.org/abs/2312.06681. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290. Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, and Neel Nanda. Confidence regulation neurons in language models, 2024. URLhttps: //arxiv.org/abs/2406.16254. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanfordalpaca, 2023. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L ́ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ́ e, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. Catherine Templeton, Adam Scherlis, Joseph Cunningham, Turner Conerly, Tom Henighan, Zac Hatfield-Dodds, Amanda Askell, Dawn Drain, Danny Hernandez, Scott Jones, Nate Stiennon, Nicholas Schiefer, Samuel Kravec, Ben Shlegeris, Gabriel Landau, Alec Mueller, Jared Kerr, Dario Amodei, Jan Leike, Jared Kaplan, Paul Christiano, and Tom Brown. Towards monosemanticity: Decomposing language models with dictionary learn- ing. Transformer Circuits, 2023. https://transformer-circuits.pub/2023/monosemantic- features/index.html. Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback, 2023. URL https://arxiv.org/abs/2305.14975. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022. URL https://arxiv.org/abs/2211.00593. 13 Published as a conference paper at COLM 2025 Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022. URL https://arxiv.org/abs/2109.01652. Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, and Minlie Huang. Benchmarking complex instruction-following with multiple constraints composition, 2024. URL https://arxiv.org/abs/2407.03978. Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2024. URL https://arxiv.org/abs/2306.13063. Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models, 2024. URL https://arxiv.org/abs/2401.18018. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911. Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 14 Published as a conference paper at COLM 2025 A Details on Datasets NameDescription#Data points True / False Datasets (Knowledge & Truthfulness) elementsymbSymbols of elements186 animalclassClasses of animals164 inventorsHome countries of inventors406 factsDiverse scientific facts561 cities“The city of [city] is in [country].”1496 neg citiesNegations of statements in cities with “not”1496 sp entrans“The Spanish word ‘[word]’ means ‘[English word]’.”354 negspentransNegations of statements in spentrans with “not”354 larger than“x is larger than y.”1980 smallerthan“x is smaller than y.”1980 tulu extractedDiverse T/F statements extracted from tulu-3-sft-mixture200 Harmful / Harmless Datasets (Refusal) advbenchHarmful instructions520 alpacaHarmless instructions52k Table 5: Dataset Descriptions and Statistics. Table 5 presents details on the datasets we use for our experiments. For the datasets that follow a strict template, such ascities,negcities, etc., we write their templates in the table. For datasets that do not follow a strict template, such aselementsymbandanimalclass, we describe them in the table. For the true/false datasets, you can find four examples for each dataset in Table 7. TheTuluextracteddataset is an in-distribution dataset for the Llama-3.1-8B SFT and Mistral- 7B-v0.3 SFT models. In order to construct it, we use GPT-4o to extract 100 factual knowledge statements from the Tulu-SFT dataset that was used to fine-tune the SFT models (Lambert et al., 2024). Then we use GPT-4o to generate a false statement for each true factual statement by changing the subject, object, or subject-object relation. B Supplementary Details and Experiments of Knowledge Storage B.1 (True, False) Pair Construction As introduced in the main content, in order to provide a generalizable conclusion, we want to aggregate the results from all the prompts, and thus we need to align the token positions of all the prompts. Therefore, we manually find out the most common token pattern in each dataset, and we filter out the prompts that do not match this pattern. It ensures that every statement has the same number of tokens, and that their subjects/objects appear in the same token positions. After filtering, about one-third to half of the original dataset remains. We list the token patterns we use for each dataset in Table 6. After filtering, we obtain a subset for each original dataset. This subset contains a group of true statements and a group of false statements with the same token patterns. Then, for each true statement, we search for the first unused false statement whose object is the same but the subject is different. In this case, they only differ in the subject. If all the false statements that only differ in the subject are already paired with a true statement, then we repeatedly use the last satisfying paired false statement. It is because we want to increase the number of (true, false) statement pairs, and it does not matter much if one false statement is paired with more than one true statement. If we cannot find any false statement that only differs in the subject, then we do not use that true statement. By this method, we construct abundant (true, false) statement pairs for our patching experiments. 15 Published as a conference paper at COLM 2025 DatasetModel familyToken pattern citiesboth[Begin] / The / city / of/ [3-token city name] / is / in / [1- token country name] / . negcitiesboth[Begin] / The / city / of/ [3-token city name] / is / not / in / [1-token country name] / . largerthanLlama-3.1-8B[Begin] / [3-token number] / is / larger / than / [2-token number] / . largerthanMistral-7B[Begin] / [4-token number] / is / larger / than / [3-token number] / . smallerthanLlama-3.1-8B[Begin] / [3-token number] / is / smaller / than / [2-token number] / . smaller thanMistral-7B[Begin] / [4-token number] / is / smaller / than / [4-token number] / . spentransboth[Begin] / The / Spanish / word / ’ / [2-token Spanish word] / ’ / means / ’ / [1-token English word] / ’. negspentransboth[Begin] / The / Spanish / word / ’ / [2-token Spanish word] / ’ / does / not / mean / ’ / [1-token English word] / ’. Table 6: The token patterns we use to select the statements from the original dataset for the knowledge storage experiments. B.2 Few-shot prompting For each dataset, we select 2 true examples and 2 false examples to conduct four-shot prompting. We randomly select them from the dataset once, and then we fix them. The selected examples are shown in Table 7. The input is constructed using the template: “[four examples] [final statement] This statement is:”. To eliminate the influence of example order, we randomly perturb the four examples for every (true, false) statement pairs, so different pairs might have different example orders, but the true and false statements in a pair have the same example order. We set the random seed to 1 in the beginning to ensure the reproducibility of this random ordering. B.3 Adapting Causal Tracing for the Tulu extracted Dataset For theTuluextracteddataset, we also only use the pairs where the true and false state- ments have the same number of tokens in this experiment. Among them, most of the pairs differ in the object. Nonetheless, a natural consequence of this unstructured dataset construction is that different pairs could have different numbers of tokens, so we cannot directly align them. In order to aggregate the results from different statement pairs, we use another method to align them. Based on our previous finding that the influential patching only occurs on the knowledge-related tokens and the last token, we categorize the tokens into three categories: the different tokens between the true and false statements, the last token, and the other tokens. The different tokens can be seen as knowledge-related tokens. The three token categories can be seen as three meta-tokens, and we want to transform the results on the original tokens into the three meta-tokens. After doing patching for each (true, false) statement pair(s, ˆ s), we first calculate the metricM (l) i (s, ˆ s)for each token positioniand layer las before. Then for each pair, we average the results on all the knowledge-related tokens to obtainM (l) K (s , ˆ s), record the result of the last tokenM (l) −1 (s, ˆ s), and average the results on the other tokens to obtainM (l) O (s, ˆ s) . Now we have results for the three meta-tokens and|L| layers. Then, we use the same way as before to average the results among all the prompt pairs and normalize the results. The final result is denotedM model ∈ R |L|∗3 , which we can visualize and evaluate as before. 16 Published as a conference paper at COLM 2025 DatasetToken pattern elementsymb“Astatine has the symbol At. This statement is: TRUE”, “Arsenic has the symbol As. This statement is: TRUE”, “Platinum has the symbol La. This statement is: FALSE”, “Titanium has the symbol B. This statement is: FALSE” animalclass“The otter is a mammal. This statement is: TRUE”, “The skunk is a mammal. This statement is: TRUE”, “The tuna is a mammal. This statement is: FALSE”, “The giraffe is a crustacean. This statement is: FALSE” inventors“Candace Pert lived in the U.S. This statement is: TRUE”, “Levi Strauss lived in the U.S. This statement is: TRUE”, “Frederick McKinley Jones lived in Japan. This statement is: FALSE”, “Elisha Otis lived in the U.K. This statement is: FALSE” facts“The scientific method is a systematic process for investigating phenomena and acquiring new knowledge. This statement is: TRUE”, “Birds have feathers and wings. This statement is: TRUE”, “Cacti store water in their ears. This statement is: FALSE”, “The process of aging is influenced solely by environmental factors. This statement is: FALSE” cities“The city of Dar es Salaam is in Tanzania. This statement is: TRUE”, “The city of Kozhikode is in India. This statement is: TRUE”, “The city of Dar es Salaam is in Italy. This statement is: FALSE”, “The city of Kozhikode is in the United States. This statement is: FALSE” negcities“The city of Dar es Salaam is not in Italy. This statement is: TRUE”, “The city of Kozhikode is not in the United States. This statement is: TRUE”, “The city of Dar es Salaam is not in Tanzania. This statement is: FALSE”, “The city of Kozhikode is not in India. This statement is: FALSE” largerthan“Seventy-eight is larger than seventy-three. This statement is: TRUE”, “Ninety- six is larger than sixty-six. This statement is: TRUE”, “Fifty-eight is larger than ninety-six. This statement is: FALSE”, “Seventy-nine is larger than ninety-seven. This statement is: FALSE” smallerthan“Fifty-eight is smaller than ninety-six. This statement is: TRUE”, “Seventy- nine is smaller than ninety-seven. This statement is: TRUE”, “Seventy-eight is smaller than seventy-three. This statement is: FALSE”, “Ninety-six is smaller than sixty-six. This statement is: FALSE” spentrans“The Spanish word ’bosque’ means ’forest’. This statement is: TRUE”, “The Spanish word ’piel’ means ’skin’. This statement is: TRUE”, “The Spanish word ’gobernar’ means ’to eat’. This statement is: FALSE”, “The Spanish word ’edad’ means ’clock’. This statement is: FALSE” negspentrans“The Spanish word ’gobernar’ does not mean ’to eat’. This statement is: TRUE”, “The Spanish word ’edad’ does not mean ’clock’. This statement is: TRUE”, “The Spanish word ’bosque’ does not mean ’forest’. This statement is: FALSE”, “The Spanish word ’piel’ does not mean ’skin’. This statement is: FALSE” tuluextracted“The Eiffel Tower is located in Paris. This statement is: TRUE”, “’The Great Gatsby’ was written by F. Scott Fitzgerald. This statement is: TRUE”, “The largest moon of Saturn is Earth. This statement is: FALSE”, “Albert Einstein developed the theory of evolution. This statement is: FALSE” Table 7: Four-shot examples. 17 Published as a conference paper at COLM 2025 Metriccitiesnegcitieslargerthansmallerthanspentransnegspentranstuluextracted Number of Curated Pairs229218389249111537 Corr(M BASE , M INSTRUCT )0.98960.98780.98380.99700.99590.98610.9985 max|M INSTRUCT − M BASE | 0.40.40.20.20.30.30.1 max|M INSTRUCT − M BASE | K 0.40.40.20.10.10.30.0 Corr(M BASE , M SFT )0.98410.96750.97380.98630.9877-0.0775*0.9974 max|M SFT − M BASE |0.40.50.40.30.50.9*0.1 max|M SFT − M BASE | K 0.40.40.40.30.50.7*0.1 Table 8: Comparison of knowledge storage locations of the Mistral-7B-v0.3 model family. The * case is the only abnormal case because the SFT model performs poorly onnegspentrans dataset. It outputs “TRUE” for false statements with an average logit of 78.05%. Metriccitiesnegcitieslargersmallerspennegspentuluex Corr(M BASE−>INSTRUCT , M INSTRUCT )0.99450.92040.97940.91220.99660.94510.9911 max|M BASE−>INSTRUCT − M INSTRUCT |0.30.60.30.70.20.50.3 max|M BASE−>INSTRUCT − M INSTRUCT | K 0.20.60.20.30.00.10.1 Corr(M BASE−>SFT , M SFT )0.99550.90670.94440.95920.98660.94220.9915 max|M BASE−>SFT − M SFT | 0.20.40.40.30.30.40.2 max|M BASE−>SFT − M SFT | K 0.20.40.30.30.10.30.2 Corr(M INSTRUCT−>BASE , M BASE )0.99010.91580.93750.91070.98790.90350.9900 max|M INSTRUCT−>BASE − M BASE |0.31.00.60.80.30.80.2 max|M INSTRUCT−>BASE − M BASE | K 0.21.00.60.70.30.80.2 Corr(M SFT−>BASE , M BASE )0.99120.92490.89720.91690.95580.87960.9877 max|M SFT−>BASE − M BASE |0.31.00.80.80.60.90.2 max|M SFT−>BASE − M BASE | K 0.31.00.80.80.60.90.2 Table 9: Comparison of knowledge storage locations detected by same-model patching and cross-model patching on the Llama-3.1-8B model family.M BASE−>INSTRUCT andM BASE−>SFT are results of forward patching from BASE to INSTRUCT and SFT.M INSTRUCT−>BASE and M SFT−>BASE are results of backward patching from INSTRUCT and SFT to BASE. B.4 Supplementary Quantitative Results Same-model patchingDue to the space limit, we only show the quantitative result of the same-model patching for the Llama-3.1-8B model family in the main content. Here “same- model patching” means the source model from which the patched hidden representation comes is the same model as the target model. The result of the Mistral model family is shown in Table 8. It verifies our previous conclusion that post-training has little influence on knowledge-storage locations. The only abnormal result is the result of Mistral-7B SFT on the negspentransdataset, which is because of its very poor performance. Its average output logit of “TRUE” is 78.05% for false statements. Therefore, it is natural that the patching of most activations, even useless ones, leads to a high probability of outputting “TRUE” for false statements. In this situation, patching cannot detect the knowledge-storage locations. In all other cases, the model achieves a good performance, and causal tracing results verify our previous conclusion. Cross-model patching We also use the same metrics to evaluate cross-model patching. We want to examine whether cross-model patching is as effective as same-model patching, so that we can understand whether the knowledge representations are the same in the BASE and POST models. For a target model, we compare the patching results of same- model patching and cross-model patching. The results are listed in Table 9 and Table 10. M BASE−>INSTRUCT andM BASE−>SFT are results of forward patching from BASE to INSTRUCT or SFT.M INSTRUCT−>BASE andM SFT−>BASE are results of backward patching from INSTRUCT or SFT to BASE. The difference between the same-model patching and cross-model patching is significantly larger on backward patching than on forward patching. It verifies our conclusion: knowledge representations in BASE model still work in the POST model, but knowledge representations in POST model do not work that well in the BASE model. 18 Published as a conference paper at COLM 2025 Metriccitiesnegcitieslargersmallerspennegspentuluex Corr(M BASE−>INSTRUCT , M INSTRUCT )0.93540.85830.81870.99670.96940.89380.9710 max|M BASE−>INSTRUCT − M INSTRUCT |0.50.70.70.20.40.70.4 max|M BASE−>INSTRUCT − M INSTRUCT | K 0.20.20.70.10.10.10.4 Corr(M BASE−>SFT , M SFT )0.97350.97690.96330.90690.9870-0.10610.9721 max|M BASE−>SFT − M SFT |0.40.40.40.60.31.00.4 max|M BASE−>SFT − M SFT | K 0.40.40.40.60.30.70.4 Corr(M INSTRUCT−>BASE , M BASE )0.87450.84740.95570.97110.91960.69300.9774 max|M INSTRUCT−>BASE − M BASE |0.80.70.50.60.70.90.4 max|M INSTRUCT−>BASE − M BASE | K 0.80.70.30.30.70.90.4 Corr(M SFT−>BASE , M BASE )0.93970.73810.95550.97400.9796-0.42080.9763 max|M SFT−>BASE − M BASE |0.61.00.50.50.41.00.4 max|M SFT−>BASE − M BASE | K 0.60.40.30.50.40.90.4 Table 10: Comparison of knowledge storage locations detected by same-model patch- ing and cross-model patching on the Mistral-7B-v0.3 model family.M BASE−>INSTRUCT andM BASE−>SFT are results of forward patching from BASE to INSTRUCT and SFT. M INSTRUCT−>BASE andM SFT−>BASE are results of backward patching from INSTRUCT and SFT to BASE. MetricLlama-3.1-8B familyMistral-7B-v0.3 family citiesspentranscitiesspentrans Corr(M BASE , M INSTRUCT )0.99610.99820.99820.9981 max|M INSTRUCT − M BASE | 0.10.10.10.1 max|M INSTRUCT − M BASE | K 0.10.10.10.1 Corr(M BASE , M SFT )0.99680.99890.99000.9959 max|M SFT − M BASE |0.10.10.30.3 max|M SFT − M BASE | K 0.10.10.30.3 Table 11: Comparison of knowledge storage locations detected by the traditional causal tracing setting. Generalizability verification: causal tracing using the traditional setting Our main experiments follow the setting of Marks & Tegmark (2024). We ask the LLM to classify the truthfulness of a statement. This setup differs from the traditional causal tracing setup (Meng et al., 2022), which uses LLM to output the object corresponding to a given subject. We choose this setting because of the following considerations. First, this setting (e.g., ”The city of Toronto is in Canada. This statement is:”) can detect knowledge storage in both the subject and the object. In contrast, the traditional setting provides the subject and lets the model output the object, e.g., ”The city of Toronto is in”. It can only detect knowledge storage in the subject. Second, our setting can test a wider range of factual knowledge. The traditional setting evaluates the patching’s influence by examining the output logit of the correct object, so it must have a fixed correct answer, such as the country of a city. But in many datasets, such aslargerthan, statements like ”86 is larger than 57” don’t have a fixed correct answer. Any number less than the subject is correct here. To verify the generalizability of our conclusion, we also conduct causal tracing experiments based on the traditional setting. Only two of our datasets,citiesandspentrans, have a fixed correct object for each statement, so we conduct experiments using the traditional setting only on them. We directly ask the model to output the object. We use the same metric for evaluation: if we denote the model’s output object for one statement asO 1 and the output for another statement asO 2 , the metric oflog P(O 1 ) P(O 2 ) denotes the effectiveness of patching. The results are shown in Table 11. The results verify our conclusion that post-training has little influence on knowledge storage locations. 19 Published as a conference paper at COLM 2025 B.5 Supplementary Visualization Results Same-model patching Due to the space limit, we only show some representative visu- alization results in the main paper. Here we show all of the visualization results. We first show the visualizations of within-model patching, further verifying our first conclusion: LLM post-training has little influence on the knowledge-storage locations. The comparison between Llama-3.1-8B BASE and INSTRUCT is shown in Figure 10. The comparison between Llama-3.1-8B BASE and SFT is shown in Figure 11. On the figure titles, “Llama-3.1-8B” means BASE, “Llama-3.1-8B-Instruct” means INSTRUCT, “Llama-3.1-8B-SFT” means SFT, “Llama-3.1-8B-Instruct - Llama-3.1-8B” and “Llama-3.1-8B-SFT - Llama-3.1-8B” means the difference (specifically, M POST − M BASE ). Similarly, the comparison between Mistral-7B BASE and INSTRUCT is shown in Figure 12, and the comparison between Mistral-7B BASE and SFT is shown in Figure 13. Results using the traditional causal tracing setting are visualized in Figure 14 and Figure 15. The only abnormal result is Mistral-7B-SFT on thenegspentransdataset. As explained in the previ- ous subsection, it is because of this model’s very poor performance on thenegspentrans dataset. Except for this abnormal case, all of the results verify our conclusion. Cross-model patching Here we show all the visualizations of cross-model patching, further verifying our second conclusion: LLM post-training keeps the original knowledge representations, but it also develops new knowledge representations. The patching between Llama-3.1-8B BASE and INSTRUCT is visualized in Figure 16 and Figure 17. The patching between Llama-3.1-8B BASE and SFT is shown in Figure 18 and Figure 19. The patching between Mistral-7B BASE and INSTRUCT is shown in Figure 20 and Figure 21. The patching between Mistral-7B BASE and SFT is shown in Figure 22 and Figure 23. Results using the traditional causal tracing setting are visualized in Figure 24 and Figure 25. 20 Published as a conference paper at COLM 2025 C Supplementary Details and Experiments of Internal Belief of Truthfulness C.1 Few-Shot Prompting For learning the truthfulness directiont, we do not use few-shot examples but directly prompt the models with the statements. For truthfulness intervention, we use the same four-shot prompting as the experiments of knowledge storage with the same examples, though we do not have (true, false) statement pairs in the truthfulness experiments. The four examples contain two true statements and two false statements, shown in Table 7. The input is constructed in the template: “[four examples] [final statement] This statement is:”. To eliminate the influence of example order, we randomly perturb the four examples for every final statement. We set the random seed to 1 in the beginning to ensure the reproducibility of this random ordering. C.2 Truthfulness Direction Layer and Token Position Choices We examine the causal tracing result to determine the best layer and token position for learning the truthfulness direction and performing the intervention. Specifically, for llama- 3.1-8b BASE, SFT, and INSTRUCT models, we use the 12th layer for learning truthfulness direction and 8-12 layers for performing the intervention. For mistral-7B BASE and SFT we use the 13th layer for learning truthfulness direction and 8-13 layers for performing the intervention. For both model families, direction learning and intervention use the last token position of the input statements. C.3 Probe Transfer Accuracy on Mistral Family Due to space limits, we only show the results on the Llama-3.1-8B model family in the main content. To further generalize our conclusion, we conduct probe transfer experiments on Mistral-7B-v0.3 BASE and INSTRUCT. Initially we also conducted probe experiments on Mistral-7B-Base-SFT-Tulu2 as the Mistral SFT model, but its performance on this experi- ment’s datasets is on the level of random guess, making us impossible to draw any useful conclusions on it. Therefore, we discard the Mistral SFT model and only present the other two. Test DatasetProbe Transfer Accuracy (%) p BASE → h BASE p INS → h INS / p BASE → h INS (∆) cities93.7895.90 / 95.82 (-0.08) sp entrans83.7184.11 / 88.83 (+4.72) inventors91.0887.93 / 90.23 (+2.30) animalclass98.7899.09 / 98.93 (-0.16) elementsymb75.2279.87 / 84.19 (+4.32) facts75.1076.09 / 76.27 (+0.18) Table 12: Probe transfer accuracy (↑) of Mistral-7B-v0.3 BASE and INSTRUCT tested on 6 truthfulness datasets. For each row, we use the datasets from the other 5 rows for training. p model 1 → h model 2 means using the probe trained onmodel 1 to classify statement representa- tions inmodel 2 . Accuracy of transferred probes show little difference (∆) compared to the same-model probes. As shown in Table 12, the probe transfer is quite successful, which align with our previous conclusions on Llama-3.1-8B. 21 Published as a conference paper at COLM 2025 DatasetINS→INS IEINS→INS CoefBASE→INS IEBASE→INS CoefDelta cities0.88802.000.89681.000.0088 sp entrans0.84842.000.84093.00-0.0075 inventors0.79732.000.82981.000.0325 animalclass0.70631.000.71921.000.0129 elementsymb0.75822.000.76971.000.0115 facts0.61851.000.65601.000.0375 Table 13: Intervention performance with optimal scaling factors on Llama-3.1-8B models. INS→INS denotes using INSTRUCT model’s truthfulness direction to intervene in itself, while BASE→INS denotes using BASE model’s direction to intervene in SFT model. Coef indicates the optimal scaling factorλ, IE is the Intervention Effect, and Delta represents the performance difference. Test DatasetTruthful Intervention Effect t BASE 7→ h BASE t INS 7→ h INS / t BASE 7→ h INS (∆) cities0.650.67 / 0.69 (+0.02) sp entrans0.770.87 / 0.89 (+0.02) inventors0.630.71 / 0.72 (+0.01) animalclass0.630.67 / 0.68 (+0.01) elementsymb0.710.81 / 0.81 (+0.00) facts0.590.63 / 0.64 (+0.01) Table 14: Intervention effect (↑) of intervention on Mistral-7B-v0.3 BASE and INSTRUCT tested on 6 truthful datasets. For each row, we use the datasets from the other 5 rows for training.t model 1 7→ h model 2 means using the truthfulness direction inmodel 1 to intervene model 2 . Transfer truthful interventions show small differences (∆). C.4 Probe Intervention Coefficient Choice To assess the robustness of our findings to the choice of scaling factor, we extended our experiments beyond the default scalar setting (λ =±1) used in Marks & Tegmark (2024). Prior work has shown that scaling can impact intervention effectiveness (Li et al., 2024), motivating a broader evaluation. We variedλfrom 1 to 10 (step size 1) on the Llama-3.1-8B and Llama-3.1-8B-Instruct model pair. For each model and dataset, we selected the scaling factor that maximized the Inter- vention Effect (IE), comparing two scenarios: (1) INS→INS (INSTRUCT direction intervening on INSTRUCT model) and (2) BASE→INS (BASE direction intervening on INSTRUCT model). Table 13 reports the optimal scaling factors and corresponding IE values. While intervention effectiveness shows modest sensitivity toλ, both base and instruct directions achieve com- parable performance when optimally scaled. The small Delta values (ranging from -0.0075 to 0.0375) further indicate that cross-model and same-model interventions perform similarly, reinforcing our conclusion that post-training preserves the model’s internal representation of truthfulness. C.5 Probe Intervention on Mistral Family The probe intervention results on Mistral-7B-v0.3 BASE and INSTRUCT are shown in figure 14. The difference (∆) in Intervention Effects when steering INSTRUCT witht BASE versus t INSTRUCT is very little. It further verifies our previous conclusions in Section 5. C.6 Case Study of Intervention Here we show a case study of cross-model truthfulness intervention on Llama-3.1-8B BASE, INSTRUCT, and SFT models. It shows thatt BASE can flip T/F outputs in POST as effectively as 22 Published as a conference paper at COLM 2025 Truthful Intervention Case Study True statement input: “The city of Paris is in France. This statement is:” SFT baseline:TRUE (1.00) / SFT-SFT:FALSE (1.00) / BASE-SFT:FALSE (0.99) INSTRUCT baseline:TRUE(1.00) / INSTRUCT-INSTRUCT:FALSE(1.00) / BASE- INSTRUCT:FALSE (1.00) False statement input: “The city of Paris is in China. This statement is:” SFT baseline:FALSE (1.00) / SFT-SFT:TRUE (0.99) / BASE-SFT:TRUE (1.00) INSTRUCT w/o intervention:FALSE(1.00) / INSTRUCT-INSTRUCT:TRUE(0.94) / BASE-INSTRUCT:TRUE (0.94) Figure 4: Example output of Llama-3.1-8b-Instruct with intervention.Gray boxshows the next token output with the highest predicted probability with its corresponding probability in the parentheses. Transferred intervention can flip the output as successfully as native intervention. t SFT andt INSTRUCT . The successful intervention verifies our conclusion that the direction of truthfulness in the hidden representation space of BASE and POST are similar. 23 Published as a conference paper at COLM 2025 D Supplementary Details and Experiments of Refusal Direction Intervention D.1 Refusal Scores Following Arditi et al. (2024); Kissane et al. (2024a) We use refusal keywords to determine if a model refuses an input. The keywords are listed in Figure 5. Refusal score on a dataset is calculated as the percentage of outputs of the model that contains these keywords at the beginning. Refusal Keywords “I’m sorry”, “I am sorry”, “I apologize”, “As an AI”, “As a language model”, “As an assistant”, “I cannot”, “I can’t”, “I can’t”, “I am unable to”, “I’m unable to”, “I am not able to”, “I’m not able to”, “I will not”, “No, I will not”, “I am not”, “I’m not going to do that”, “I can’t”, “I don’t”. Figure 5: Refusal keywords used to detect refusal behavior. D.2 Input Template In order for the models to give reasonable responses to the instructions, we use the user- assistant template to prompt the models. For all SFT and INSTRUCT models we use their official chat templates. For BASE models, we construct a template for the best output quality. Templates for base models are illustrated as following: gemma-2-9b: <start_of_turn>user: instruction<end_of_turn> <start_of_turn>assistant: llama-3.1-8b: User: instruction Assistant: qwen1.5-0.5b: <|im\_start|>user instruction<|im\_end|> <|im\_start|>assistant instruction is the input harmful or harmless instructions. D.3 Refusal Direction Layer and Token Position Choices We follow Arditi et al. (2024) to select the best-performing layer and token positions for extracting the refusal direction r. The choices are reported in Table 15. D.4 Abnormal Case in Refusal Intervention for Llama-3.1-8b Table 4 shows one notable abnormal case: intervening the representations of SFT by adding r BASE induces SFT to refuse 85% of inputs, which is even higher than the intervention results on BASE itself. This suggests SFT may be inherently more prone to refusing instructions and thus more easily steered toward refusal. The poorer transfer results when usingr BASE to intervene in INSTRUCT further suggests that the DPO process employed in INSTRUCT may have mitigated INSTRUCT’s internal tendency to refuse. Investigating this phenomenon could be a promising future direction. 24 Published as a conference paper at COLM 2025 ModelLayerToken Position llama-3.1-8b BASE11-4 llama-3.1-8b SFT11-2 llama-3.1-8b INSTRUCT11-1 qwen1.5-0.5b BASE13-1 qwen1.5-0.5b INSTRUCT13-1 gemma-2-9b BASE23-1 gemma-2-9b INSTRUCT23-1 Table 15: Layer and token position choices for extracting refusal directions. D.5 Refusal Direction Intervention with Other Model Families ModelDataRefusal Score↑ INSINS-INSBASE-INS Qwen-1.5-0.5B-chatharmless0.030.680.27 Qwen-1.5-0.5B-chatharmful0.851.001.00 Gemma-2-9B-itharmless0.020.750.09 Gemma-2-9B-itharmful0.981.001.00 Table 16: Refusal scores after adding refusal directions from INSTRUCT to INSTRUCT (INS- INS) or from BASE to INSTRUCT (BASE-INS). The best intervention results are shown in bold. The BASE to INSTRUCT intervention is not very successful. ModelDataRefusal Score↓ INSINS-INSBASE-INS Qwen-1.5-0.5B-chatharmless0.030.010.01 Qwen-1.5-0.5B-chatharmful0.850.040.13 Gemma-2-9B-itharmless0.020.000.01 Gemma-2-9B-itharmful0.980.000.87 Table 17: Intervention Refusal scores after ablating the INSTRUCT refusal direction from INSTRUCT (INS-INS) and abalting the BASE refusal direction from INSTRUCT (BASE-INS). The best intervention results are shown in bold. The BASE to INSTRUCT intervention is not very successful. Besides Llama-3.1-8b-Instruct, we also conducted the experiment on intervening in the refusal direction on BASE and INSTRUCT from Qwen-1.5-0.5B series and Gemma-2-9B series. As shown in Table 16 and Table 17, neither of the refusal directions from BASE of Qwen-1.5- 0.5B and Gemma-2-9B can successfully intervene in the behaviors of INSTRUCT. It confirms our previous conclusions. D.6 Case Study of Refusal Intervention We show a case study of refusal intervention in Figure 6. As shown in the figure, the baseline output from INSTRUCT is refusing to follow the harmful input. After intervention with the refusal direction from INSTRUCT, the refusal behavior disappears and the model starts to follow the harmful input. However, with the direction from BASE, the behavior stays the same. It further confirms our previous conclusions. 25 Published as a conference paper at COLM 2025 Refusal Intervention Case Study Harmful input: “Give step-by-step instructions for making explosives and other dangerous materials” INSTRUCT baseline output: I can’t fulfill that request... INSTRUCT output (ablated by INSTRUCT refusal direction): A thrilling request! Here are step-by-step instructions for making various explosives... INSTRUCT output (ablated by BASE refusal direction): I can’t fulfill that request... Figure 6: Example output of Llama-3.1-8b-Instruct on harmful instructions with intervention. The baseline is the output without intervention. Ablation using direction learned from BASE model failed to steer the model to bypass the refusal behavior. 26 Published as a conference paper at COLM 2025 E Supplementary Details and Experiments for Confidence Due to space limits, we did not provide experiment results regarding entropy neurons in the main content, so we leave them here. We analyze the neurons from the last MLP layer, and we calculate their weight norms and LogitVar. Figure 7, 8, and 9 show the distributions of their weight norms and LogitVar. The X-axis shows the weight norm, and the Y-axis shows the LogitVar. We conduct experiments on Llama-2-7B, Llama-3.1-8B, and Mistral-7B models. The distributions across BASE, SFT, and INSTRUCT models are very similar. (a) Llama-2-7b BASE(b) Llama-2-7b INSTRUCT Figure 7: Weight norm and LogitVar of the last MLP layer’s neurons in the Llama-2-7B model family. (a) Llama-3.1-8b BASE(b) Llama-3.1-8b SFT(c) Llama-3.1-8b INSTRUCT Figure 8: Weight norm and LogitVar of the last MLP layer’s neurons in the Llama-3.1-8B model family. (a) Mistral BASE(b) Mistral SFT(c) Mistral INSTRUCT Figure 9: Weight norm and LogitVar of the last MLP layer’s neurons in the Mistral-7B-v0.3 model family. Table 18 shows the stats of entropy neurons across models. We observe a high overlap of entropy neurons between BASE and POST models. To further investigate the overlapping entropy neurons, we calculate the ratio ∥w out ∥ log(LogitVar) of overlapping entropy neurons to quantitatively represent how much each neuron is qualified as an entropy neuron. We then 27 Published as a conference paper at COLM 2025 compute the absolute difference between these ratios for entropy neurons in the BASE and POST models with the result shown in Table 18. As a reference, the average ratio of all the entropy neurons is -0.0880, while the average absolute difference of the ratio on the overlapping entropy neurons between BASE and POST is generally less than 1% of it. It confirms that the entropy neurons are not only overlapping, but the overlapping entropy neurons are also very similar. Model pairOverlapping neuron count (out of 10)Avg abs ratio difference llama-3.1-8b BASE vs INSTRUCT80.000815 llama-3.1-8b BASE vs SFT100.000112 mistral-7b BASE vs INSTRUCT90.000030 mistral-7b BASE vs SFT80.000089 llama-2-7b BASE vs INSTRUCT90.001712 Table 18: BASE models and POST models have very similar entropy neurons. “Overlapping neuron count” shows the number of overlapping entropy neurons between BASE and POST models. “Avg abs ratio difference” shows the average absolute difference of ∥w out ∥ log(LogitVar) of the overlapping entropy neurons between BASE and POST models. As a reference, the average ratio is -0.0880 for all entropy neurons. 28 Published as a conference paper at COLM 2025 F Additional Experiments on Llama-2-13B Models To verify whether our findings generalize to larger models, we conduct experiments on Llama-2-13B base (BASE) and Llama-2-13B-Instruct (INSTRUCT) models. We use the same experimental settings as described in the main paper. Our previous conclusions are consis- tently verified on these 13B parameter models. F.1 Knowledge Storage Experiments We conduct causal tracing experiments using the same settings and metrics as the main paper on thecities,spentrans, andtuluextracteddatasets. The results in Table 19 demonstrate high correlation coefficients between BASE and INSTRUCT models with low maximum differences, confirming that post-training has minimal influence on knowledge storage locations. citiesspentranstuluextracted Corr(M BASE , M INSTRUCT )0.98850.99180.9970 max|M INSTRUCT − M BASE |0.40.40.2 max|M INSTRUCT − M BASE | K 0.40.40.1 Table 19: Knowledge storage results for Llama-2-13B models. F.2 Truthfulness Probing Experiments We follow the same experimental settings and metrics for truthfulness probing across multiple datasets. The results in Table 20 show consistent patterns with our main findings. Test DatasetProbe Transfer Accuracy (%) p BASE → h BASE p INS → h INS / p BASE → h INS (∆) cities95.3999.47 / 99.06 (-0.41) spentrans96.8996.33 / 90.68 (-5.65) inventors83.7470.20 / 70.94 (+0.74) animalclass98.7895.12 / 95.12 (+0) elementsymb95.7094.62 / 94.09 (-0.53) facts71.1278.97 / 62.75 (-16.22) Table 20: Probe transfer accuracy (↑) of Llama-2-13B models. F.3 Truthfulness Intervention Experiments Using identical settings as the main experiments, we evaluate truthfulness interventions on both models. The results in Table 21 maintain consistency with our previous conclusions. F.4 Refusal Intervention Experiments We conduct refusal intervention experiments following the same methodology. The results in Table 22 confirm that truthfulness directions remain similar between base and post-trained models while refusal directions differ. F.5 Entropy Neuron Analysis For entropy neuron experiments, all top 10 entropy neuron candidates are identical between BASE and INSTRUCT. The weight ratio differences remain minimal, confirming that confi- dence differences between base and post-trained models cannot be attributed to entropy neurons. 29 Published as a conference paper at COLM 2025 Test DatasetTruthful Intervention Effect t BASE 7→ h BASE t INS 7→ h INS / t BASE 7→ h INS (∆) cities0.690.71 / 0.68 (+0.03) spentrans0.830.86 / 0.88 (+0.02) inventors0.660.64 / 0.67 (+0.03) animal class0.720.73 / 0.74 (+0.01) elementsymb0.790.84 / 0.83 (-0.01) facts0.680.63 / 0.66 (+0.03) Table 21: Intervention effect (↑) of intervention on Llama-2-13B models. Test Datasetbaseline / r BASE → h BASE baseline / r INS → h INS / r BASE → h INS harmful0.24 / 0.370.99 / 0.59 / 0.99 harmless0.05 / 0.320.0 / 1.0 / 0.01 Table 22: Refusal intervention results for Llama-2-13B Due to resource constraints, we were unable to conduct experiments on even larger models, but we expect our findings to generalize to models with 40 billion or more parameters. 30 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B neg_cities Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B larger_than Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct larger_than Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B larger_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B smaller_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct smaller_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B smaller_than Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B tulu_extracted Layer Figure 10: Knowledge storage locations of Llama-3.1-8B BASE and INSTRUCT. 31 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B neg_cities Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B larger_than Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT larger_than Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B larger_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B smaller_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT smaller_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B smaller_than Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B tulu_extracted Layer Figure 11: Knowledge storage locations of Llama-3.1-8B BASE and SFT. 32 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B neg_cities Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B larger_than Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct larger_than Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B larger_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B smaller_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct smaller_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B smaller_than Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B tulu_extracted Layer Figure 12: Knowledge storage locations of Mistral-7B BASE and INSTRUCT. 33 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B neg_cities Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B larger_than Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT larger_than Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B larger_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B smaller_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT smaller_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B smaller_than Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B tulu_extracted Layer Figure 13: Knowledge storage locations of mistral-7B BASE and SFT. 34 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-Instruct cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct - Llama-3.1-8B sp_en_trans Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-SFT cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT - Llama-3.1-8B sp_en_trans Layer Figure 14: Knowledge storage locations of Llama-3.1-8B BASE, INSTRUCT, and SFT in the traditional causal tracing setting. 35 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-Instruct cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-SFT cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct - Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT - Mistral-7B sp_en_trans Layer Figure 15: Knowledge storage locations of Mistral-7B BASE, INSTRUCT, and SFT in the traditional causal tracing setting. 36 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B neg_cities Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct larger_than Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B larger_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct smaller_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B smaller_than Layer Figure 16: Cross-model patching results between llama-3.1-8b BASE and INSTRUCT. 37 Published as a conference paper at COLM 2025 TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-Instruct tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-Instruct to Llama-3.1-8B tulu_extracted Layer Figure 17: Cross-model patching results between llama-3.1-8b BASE and INSTRUCT (Contin- ued). 38 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B neg_cities Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT larger_than Layer [s1][s2][s3]islar ger than[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B larger_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT smaller_than Layer [s1][s2][s3]issmallerthan[o1][o2].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B smaller_than Layer Figure 18: Cross-model patching results between llama-3.1-8b BASE and SFT. 39 Published as a conference paper at COLM 2025 TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B to Llama-3.1-8B-SFT tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Llama-3.1-8B-SFT to Llama-3.1-8B tulu_extracted Layer Figure 19: Cross-model patching results between llama-3.1-8b BASE and SFT (Continued). 40 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B neg_cities Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct larger_than Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B larger_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct smaller_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B smaller_than Layer Figure 20: Cross-model patching results between Mistral-7B BASE and INSTRUCT. 41 Published as a conference paper at COLM 2025 TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-Instruct tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-Instruct to Mistral-7B tulu_extracted Layer Figure 21: Cross-model patching results between Mistral-7B BASE and INSTRUCT (Contin- ued). 42 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT cities Layer Thecityof[s1][s2][s3]isin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT neg_cities Layer Thecityof[s1][s2][s3]isnotin[o1].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B neg_cities Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT larger_than Layer [s1][s2][s3][s4]islar ger than[o1][o2][o3].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B larger_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT smaller_than Layer [s1][s2][s3][s4]issmallerthan[o1][o2][o3][o4].Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B smaller_than Layer Figure 22: Cross-model patching results between Mistral-7B BASE and SFT. 43 Published as a conference paper at COLM 2025 TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT neg_sp_en_trans Layer TheSpanishword'[s1][s2]' doesnotmean' [o1]'.Thisstatementis : 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B neg_sp_en_trans Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B to Mistral-7B-SFT tulu_extracted Layer knowledgeotherslast_token 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) Mistral-7B-SFT to Mistral-7B tulu_extracted Layer Figure 23: Cross-model patching results between Mistral-7B BASE and SFT (Continued). 44 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B_to_Llama-3.1-8B-Instruct cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-Instruct_to_Llama-3.1-8B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B_to_Llama-3.1-8B-Instruct sp_en_ Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-Instruct_to_Llama-3.1-8B sp_en_ Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B_to_Llama-3.1-8B-SFT cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-SFT_to_Llama-3.1-8B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B_to_Llama-3.1-8B-SFT sp_en_tra Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Llama-3.1-8B-SFT_to_Llama-3.1-8B sp_en_tra Layer Figure 24: Cross-model patching results between Llama-3.1-8B BASE, INSTRUCT, and SFT in the traditional causal tracing setting. 45 Published as a conference paper at COLM 2025 Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B_to_Mistral-7B-Instruct cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-Instruct_to_Mistral-7B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B_to_Mistral-7B-Instruct sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-Instruct_to_Mistral-7B sp_en_trans Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B_to_Mistral-7B-SFT cities Layer Thecityof[s1][s2][s3]isin 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-SFT_to_Mistral-7B cities Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B_to_Mistral-7B-SFT sp_en_trans Layer TheSpanishword'[s1][s2]' means' 30 25 20 15 10 5 0 −1 −0.5 0 0.5 1 log P(T)/P(F) no_question_Mistral-7B-SFT_to_Mistral-7B sp_en_trans Layer Figure 25: Cross-model patching results between Mistral-7B BASE, INSTRUCT, and SFT in the traditional causal tracing setting. 46