Paper deep dive
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song
Models: DeepSeek-R1-Distill-Qwen-7B, Llama-3.1-8B, Nemotron-1.5B
Abstract
Abstract:Large language models can generate long chain-of-thought (CoT) reasoning, but it remains unclear whether the verbalized steps reflect the models' internal thinking. In this work, we propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT to the model's final prediction. Our experiments show that LLMs often interleave between true-thinking steps (which are genuinely used to compute the final output) and decorative-thinking steps (which give the appearance of reasoning but have minimal causal influence). We reveal that only a small subset of the total reasoning steps causally drive the model's prediction: e.g., on AIME, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) for Qwen-2.5. Furthermore, we find that LLMs can be steered to internally follow or disregard specific steps in their verbalized CoT using the identified TrueThinking direction. We highlight that self-verification steps in CoT (i.e., aha moments) can be decorative, while steering along the TrueThinking direction can force internal reasoning over these steps. Overall, our work reveals that LLMs often verbalize reasoning steps without performing them internally, challenging the efficiency of LLM reasoning and the trustworthiness of CoT.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/12/2026, 5:31:45 PM
Summary
The paper introduces the True Thinking Score (TTS) to quantify the causal contribution of individual reasoning steps in Large Language Model (LLM) Chain-of-Thought (CoT) sequences. It distinguishes between 'true-thinking' steps, which causally influence the final prediction, and 'decorative-thinking' steps, which provide the appearance of reasoning without internal causal impact. The authors demonstrate that LLMs can be steered along a 'TrueThinking' direction in latent space to force or suppress internal reasoning over specific steps, challenging the assumption that CoT is always a faithful reflection of internal model processes.
Entities (5)
Relation Signals (3)
Qwen-2.5 → performs → Chain-of-Thought
confidence 95% · only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 for Qwen-2.5.
True Thinking Score → quantifies → Chain-of-Thought
confidence 95% · we propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT
TrueThinking direction → mediates → Chain-of-Thought
confidence 90% · whether an LLM internally performs a step verbalized in CoT can be mediated by a TrueThinking direction
Cypher Suggestions (2)
Find all models evaluated using the True Thinking Score · confidence 90% · unvalidated
MATCH (m:Model)-[:EVALUATED_WITH]->(t:Metric {name: 'True Thinking Score'}) RETURN m.nameIdentify relationships between reasoning steps and causal influence · confidence 85% · unvalidated
MATCH (s:ReasoningStep)-[r:HAS_CAUSAL_IMPACT]->(m:Model) RETURN s.name, r.type, m.name
Full Text
77,669 characters extracted from source content.
Expand or collapse full text
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Jiachen Zhao * 1 Yiyou Sun * 2 Weiyan Shi † 1 Dawn Song † 2 Abstract Large language models can generate long chain-of-thought (CoT) reasoning, but it remains unclear whether the verbalized steps reflect the models’ internal thinking.In this work, we propose aTrue Thinking Score (TTS)to quantify the causal contribution of each step in CoT to the model’s final prediction. Our experiments show that LLMs often interleave between true-thinking steps (which are genuinely used to compute the final output) and decorative-thinking steps (which give the appearance of reasoning but have minimal causal influence). We reveal that only a small subset of the total reasoning steps causally drive the model’s prediction: e.g., on AIME, only an average of 2.3% of reasoning steps in CoT have aTTS≥0.7 (range: 0–1) for Qwen-2.5. Fur- thermore, we find that LLMs can be steered to in- ternally follow or disregard specific steps in their verbalized CoT using the identified TrueThinking direction. We highlight that self-verification steps in CoT (i.e., aha moments) can be decorative, while steering along the TrueThinking direction can force internal reasoning over these steps. Overall, our work reveals that LLMs often verbalize reasoning steps without performing them internally, challenging the efficiency of LLM reasoning and the trustworthiness of CoT. 1. Introduction Recent frontier LLMs can increasingly solve complex rea- soning problems through test-time scaling, often by gener- ating a very long chain of thought (CoT) (Guo et al., 2025; Muennighoff et al., 2025; Snell et al., 2024; Jaech et al., 2024). In their long CoT, these models frequently exhibit an “aha moment”, where the model begins to self-verify * Equalcontribution † Equaladvising 1 Northeastern University, Boston, MA, USA 2 University of California, Berkeley, CA, USA. Correspondence to:Jiachen Zhao <zhao.jiach@northeastern.edu>. Preprint. February 2, 2026. its solution (e.g., “Wait, let’s re-evaluate ...”) (Guo et al., 2025). LLMs’ generated CoT is commonly assumed as a scratch pad where the model thinks out loud (Korbak et al., 2025). This has also motivated using CoT as a means to monitor LLMs and detect unsafe behaviors revealed in their CoT (Baker et al., 2025). A central yet questionable assumption about CoT is its faith- fulness: each verbalized step genuinely reflects the model’s internal reasoning and contributes to its final output. How- ever, recent work implies this assumption is not always the case. Models may solve problems relying on hints (Chen et al., 2025b; Chua & Evans, 2025; Turpin et al., 2023) or their biases (Arcuschin et al., 2025) without verbaliz- ing them in their CoT, and they may already know their final answers early before finishing generating the complete CoT (Ma et al., 2025; Wang et al., 2025; Yang et al., 2025). Such findings motivate the view that CoT may act as post- hoc rationalizations (Arcuschin et al., 2025; Emmons et al., 2025), where LLMs first pre-determine their answers in- ternally in their latent space and then generate reasoning steps to rationalize them. Such unfaithfulness of CoT raises concerns about relying on CoT for monitoring LLMs, as the verbalized reasoning may not reflect what a model truly “thinks”. Although prior work has questioned the faithful- ness of CoT, a fine-grained, step-by-step analysis remains lacking. Therefore, in this study, we ask: To what extent do LLMs truly think through each verbalized step in their CoT? To close this gap, we propose to measure the step-wise causality to probe whether an LLM is faithfully thinking as verbalized in its reasoning traces in CoT. We reveal that in a CoT, there are faithful true-thinking steps that causally affect the model’s prediction, and unfaithful decorative- thinking steps that the model does not actually perform internally and that make minimal causal contribution to its prediction (as illustrated in Figure 1). Crucially, a true- thinking step can causally contribute in two distinct ways as illustrated in Figure 2. 1. Conjunctive (“and”): a stepsand other steps before it (denoted asC) jointly determine the answer, as in many enumeration problems where all steps are impor- tant. Then, removing or corruptingswill flip the model’s initial predictiony ∗ . This is the regime primarily tested 1 arXiv:2510.24941v2 [cs.LG] 30 Jan 2026 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought What is the smallest possible average of four distinct positive even integers? ...What’s the maximum number of slips Jillian can draw without obtaining a product that is a perfect square? ...5, 3, and 7, which is the minimal set that gives the smallest sum, so the average is 6. Wait, no, but 2, 4, 6, 8: 2 is the first, 4 is the second... Therefore, the average is 5. The final result is 6. ... pick one from each, that's 2, and the rest have one each, so totaling 12. Wait, no, because s=1, s=2, s=3 each have two numbers, but ... so 11 numbers. The final result is 11. Decorative thinking True thinking Thought process (Correct answer is 5) (self-verification) (Correct answer is 11) (self-verification) ✔ Thought process Decorative self-verification: The model predicts a wrong answer (6) despite the correct self-verification (5). True self-verification: The model predicts the right answer (11) by self-verifying wrong answer (12). Figure 1. We find that reasoning steps in CoT may not always be true thinking but function as decorative thinking where the model internally is not using those steps to compute its answer. Taking self-verification steps as an example (known as “Aha moments” where LLMs rethink their solution with phrases like “wait”), we randomly perturb the numerical values in the reasoning steps preceding the “Aha moment”, and then re-prompt the model for the answer using the modified CoT. In the left example, although the model’s self-verification reasoning is correct, it ignores it and outputs the wrong answer after perturbation. In the right example, the model follows its self-verification and produces the correct result. by prior work (Lanham et al., 2023; Tanneru et al., 2024; Wang et al., 2025; Yu et al., 2025; Tutek et al., 2025), which infers faithfulness from the necessity-in-context effect of perturbing s alone. 2.Disjunctive (“or”): eithersorCalready suffices to pro- duce the correct answer. For example,sis a verification step or alternative solution for the results established inC. Here, perturbingsmay leave model’s prediction unchanged becauseCstill carries the solution. Prior works (Lanham et al., 2023; Tanneru et al., 2024; Wang et al., 2025; Yu et al., 2025) that only consider necessity may mislabelsin this case as “unfaithful” despite its genuine contribution. To measure both roles, we extend Average Treatment Effect (ATE) (Rubin, 1974; Pearl, 2009) with two complementary interventions by conditioning on contextC(steps before the steps): a necessity testATE nec (1) = P(y ∗ |C,s) − P(y ∗ |C,s ′ )that measures model’s confidence change before and after perturbingsunder intactC, and a sufficiency test ATE suf (0) = P(y ∗ |C ′ ,s)− P(y ∗ |C ′ ,s ′ )that perturbssun- der corruptedC ′ . Averaging them yields our True-Thinking Score (TTS), which considers steps that matter either jointly with context (the “and” case) or as an alternative route that still validates or secures the answer (the “or” case). Di- rect adaptations of prior methods estimate onlyATE nec (1), which is logically insufficient to detect disjunctive contribu- tions and thus systematically miscounts true-thinking steps. Our evaluation reveals steps with high and lowTTSare interleaved in a CoT: while a sparse set of true-thinking steps directly influence the model’s predictions, others tend to act as decorative reasoning with negligible causal impact and are not truly followed by models when computing their answer (Section 5). Additionally, we find LLMs’ self-verification steps can be decorative without truly checking their solution (Section 5.1). Beyond empirical evaluation, we identify a mechanistic basis for this phe- nomenon: whether an LLM internally performs a step verbalized in CoT can be mediated by a TrueThinking direction in latent space (Section 6). Moving the hidden states of a step along this direction increases LLMs’ internal reliance on that step, whereas reversing it suppresses think- ing over it. Finally, we showcase that by steering along the TrueThinking direction, we causally induce the model to reason over decorative self-verification steps (Section 6.3). Overall, we find that LLMs often verbalize steps that have minimal causal impact on their predictions. Moreover, steer- ing enables control over whether steps in CoT are internally followed or disregarded by LLMs. Our results fundamen- tally question the efficiency of LLMs’ reasoning and un- dermine the practice of using verbalized rationales as a safety-monitoring signal (Baker et al., 2025). Our work shifts the focus from what models verbalize to what they think underneath, motivating future study that explains the mechanisms of CoT reasoning and develops training objec- tives that reward reliance on true-thinking steps. 2. Related Work Internal thinking in LLMs’ latent space.Apart from ex- plicit CoT, LLMs “think” internally across their layers. They can directly answer reasoning problems, sometimes even 2 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Question: In triangle (ABC) with side lengths (AB=13, AC=14, BC=15), let (O) be the circumcenter and (I) the incenter. Find the distance (OI). Step A: Compute the circumradius R and inradius r Step B: Use Euler’s relation: <latexit sha1_base64="QTo0r8zxGLgIq7lgMurJd3ZKi7k=">AAAB9HicbVBNT8JAEJ3iF+IX6tFLIzGBg6QlRL2YEL3oSSTykUAl22ULG7bburslIQ2/w4sHjfHqj/Hmv3GBHhR8ySQv781kZp4bMiqVZX0bqZXVtfWN9GZma3tndy+7f9CQQSQwqeOABaLlIkkY5aSuqGKkFQqCfJeRpju8nvrNERGSBvxBjUPi+KjPqUcxUlpy7m4fS5e1fO20JAqFbjZnFa0ZzGViJyQHCard7FenF+DIJ1xhhqRs21aonBgJRTEjk0wnkiREeIj6pK0pRz6RTjw7emKeaKVneoHQxZU5U39PxMiXcuy7utNHaiAXvan4n9eOlHfhxJSHkSIczxd5ETNVYE4TMHtUEKzYWBOEBdW3mniABMJK55TRIdiLLy+TRqlonxXL9+Vc5SqJIw1HcAx5sOEcKnADVagDhid4hld4M0bGi/FufMxbU0Yycwh/YHz+A+AkFk=</latexit> OI 2 =R(R2r)) Step A: Use Euler’s relation: Step B: Check by computing the coordination of 푂/퐼 <latexit sha1_base64="QTo0r8zxGLgIq7lgMurJd3ZKi7k=">AAAB9HicbVBNT8JAEJ3iF+IX6tFLIzGBg6QlRL2YEL3oSSTykUAl22ULG7bburslIQ2/w4sHjfHqj/Hmv3GBHhR8ySQv781kZp4bMiqVZX0bqZXVtfWN9GZma3tndy+7f9CQQSQwqeOABaLlIkkY5aSuqGKkFQqCfJeRpju8nvrNERGSBvxBjUPi+KjPqUcxUlpy7m4fS5e1fO20JAqFbjZnFa0ZzGViJyQHCard7FenF+DIJ1xhhqRs21aonBgJRTEjk0wnkiREeIj6pK0pRz6RTjw7emKeaKVneoHQxZU5U39PxMiXcuy7utNHaiAXvan4n9eOlHfhxJSHkSIczxd5ETNVYE4TMHtUEKzYWBOEBdW3mniABMJK55TRIdiLLy+TRqlonxXL9+Vc5SqJIw1HcAx5sOEcKnADVagDhid4hld4M0bGi/FufMxbU0Yycwh/YHz+A+AkFk=</latexit> OI 2 =R(R2r)) Step 1Step 2 Reasoning Traces: Step N ... The answer is <latexit sha1_base64="4Y7L+a+HN8XVivqDDCeLHK38xqk=">AAAB8nicbVBNTwIxEO3iF+IX6tFLIzHxhLsGkSPRi0dM5CNZNqRbCjR027WdNSEbfoYXDxrj1V/jzX9jgT0o+JJJXt6bycy8MBbcgOt+O7m19Y3Nrfx2YWd3b/+geHjUMirRlDWpEkp3QmKY4JI1gYNgnVgzEoWCtcPx7cxvPzFtuJIPMIlZEJGh5ANOCVjJ75pHDWn1anpR6xVLbtmdA68SLyMllKHRK351+4omEZNABTHG99wYgpRo4FSwaaGbGBYTOiZD5lsqScRMkM5PnuIzq/TxQGlbEvBc/T2RksiYSRTazojAyCx7M/E/z09gUAtSLuMEmKSLRYNEYFB49j/uc80oiIklhGpub8V0RDShYFMq2BC85ZdXSeuy7FXLlftKqX6TxZFHJ+gUnSMPXaM6ukMN1EQUKfSMXtGbA86L8+58LFpzTjZzjP7A+fwBsrqQ4Q==</latexit> p 65/8 True thinking Mode AND: (jointly lead to answer) True thinking Mode OR: (any step leads to answer) A’BAB’AB A’BAB’AB Decorative thinking: True thinking: Step AStep A’ Step AStep A’ Naive View: (check if A is true/decorative thinking by perturbation) Advanced View: (considering both AND/OR case) Decorative thinking: Neither mode AND/OR Sufficiency Test (high score for OR) Necessity Te s t : (high score for AND) C (other steps) s (target step) answer is LLM <latexit sha1_base64="C837dYA8n5LPjJhOQhj2Pkx1tJo=">AAACF3icbVDLSsNAFJ3UV62vqks3g0WoIiWRoi6L3bisYB/QxDKZTtqhM0mYuRFC6F+48VfcuFDEre78G6ePhVYPXDiccy/33uPHgmuw7S8rt7S8srqWXy9sbG5t7xR391o6ShRlTRqJSHV8opngIWsCB8E6sWJE+oK1/VF94rfvmdI8Cm8hjZknySDkAacEjNQrVlxJYKhk1hi7ggVQTu9OsCt5H08NSkRWH59i7So+GMJxr1iyK/YU+C9x5qSE5mj0ip9uP6KJZCFQQbTuOnYMXkYUcCrYuOAmmsWEjsiAdQ0NiWTay6Z/jfGRUfo4iJSpEPBU/TmREal1Kn3TOTlWL3oT8T+vm0Bw6WU8jBNgIZ0tChKBIcKTkHCfK0ZBpIYQqri5FdMhUYSCibJgQnAWX/5LWmcV57xSvamWalfzOPLoAB2iMnLQBaqha9RATUTRA3pCL+jVerSerTfrfdaas+Yz++gXrI9vieWfhg==</latexit> P(y ⇤ |C,s) Computation of ..., R=65/8, r=4 C (other steps) ...OI² = R(R-2r) s (target step) ..., R=65/8, r=4 C (other steps) ...OI² = R(R-3r) s' (perturbed) <latexit sha1_base64="lexbOUvfIh2i/tRQjXs5oK5oVsQ=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBahuiiJFHVZdOOygn1AE8tkOmmHTh7M3AghFNz4K25cKOLWn3Dn3zhpu9DWAxcO59zLvfd4seAKLOvbKCwtr6yuFddLG5tb2zvm7l5LRYmkrEkjEcmORxQTPGRN4CBYJ5aMBJ5gbW90nfvtByYVj8I7SGPmBmQQcp9TAlrqmQdOQGAog6wxdgTzoZLenzqSD4Zw0jPLVtWaAC8Se0bKaIZGz/xy+hFNAhYCFUSprm3F4GZEAqeCjUtOolhM6IgMWFfTkARMudnkhzE+1kof+5HUFQKeqL8nMhIolQae7swvVvNeLv7ndRPwL92Mh3ECLKTTRX4iMEQ4DwT3uWQURKoJoZLrWzEdEkko6NhKOgR7/uVF0jqr2ufV2m2tXL+axVFEh+gIVZCNLlAd3aAGaiKKHtEzekVvxpPxYrwbH9PWgjGb2Ud/YHz+APPRl7g=</latexit> P(y ⇤ ) P’( ) <latexit sha1_base64="4Y7L+a+HN8XVivqDDCeLHK38xqk=">AAAB8nicbVBNTwIxEO3iF+IX6tFLIzHxhLsGkSPRi0dM5CNZNqRbCjR027WdNSEbfoYXDxrj1V/jzX9jgT0o+JJJXt6bycy8MBbcgOt+O7m19Y3Nrfx2YWd3b/+geHjUMirRlDWpEkp3QmKY4JI1gYNgnVgzEoWCtcPx7cxvPzFtuJIPMIlZEJGh5ANOCVjJ75pHDWn1anpR6xVLbtmdA68SLyMllKHRK351+4omEZNABTHG99wYgpRo4FSwaaGbGBYTOiZD5lsqScRMkM5PnuIzq/TxQGlbEvBc/T2RksiYSRTazojAyCx7M/E/z09gUAtSLuMEmKSLRYNEYFB49j/uc80oiIklhGpub8V0RDShYFMq2BC85ZdXSeuy7FXLlftKqX6TxZFHJ+gUnSMPXaM6ukMN1EQUKfSMXtGbA86L8+58LFpzTjZzjP7A+fwBsrqQ4Q==</latexit> p 65/8 ATE(1) =P( ) - <latexit sha1_base64="4Y7L+a+HN8XVivqDDCeLHK38xqk=">AAAB8nicbVBNTwIxEO3iF+IX6tFLIzHxhLsGkSPRi0dM5CNZNqRbCjR027WdNSEbfoYXDxrj1V/jzX9jgT0o+JJJXt6bycy8MBbcgOt+O7m19Y3Nrfx2YWd3b/+geHjUMirRlDWpEkp3QmKY4JI1gYNgnVgzEoWCtcPx7cxvPzFtuJIPMIlZEJGh5ANOCVjJ75pHDWn1anpR6xVLbtmdA68SLyMllKHRK351+4omEZNABTHG99wYgpRo4FSwaaGbGBYTOiZD5lsqScRMkM5PnuIzq/TxQGlbEvBc/T2RksiYSRTazojAyCx7M/E/z09gUAtSLuMEmKSLRYNEYFB49j/uc80oiIklhGpub8V0RDShYFMq2BC85ZdXSeuy7FXLlftKqX6TxZFHJ+gUnSMPXaM6ukMN1EQUKfSMXtGbA86L8+58LFpzTjZzjP7A+fwBsrqQ4Q==</latexit> p 65/8 ..., R=35/4, r=2 C' (perturbed) ...OI² = R(R-2r) s (target step) ..., R=35/4, r=2 C' (perturbed) ...OI² = R(R-3r) s' (perturbed) P’( ) <latexit sha1_base64="4Y7L+a+HN8XVivqDDCeLHK38xqk=">AAAB8nicbVBNTwIxEO3iF+IX6tFLIzHxhLsGkSPRi0dM5CNZNqRbCjR027WdNSEbfoYXDxrj1V/jzX9jgT0o+JJJXt6bycy8MBbcgOt+O7m19Y3Nrfx2YWd3b/+geHjUMirRlDWpEkp3QmKY4JI1gYNgnVgzEoWCtcPx7cxvPzFtuJIPMIlZEJGh5ANOCVjJ75pHDWn1anpR6xVLbtmdA68SLyMllKHRK351+4omEZNABTHG99wYgpRo4FSwaaGbGBYTOiZD5lsqScRMkM5PnuIzq/TxQGlbEvBc/T2RksiYSRTazojAyCx7M/E/z09gUAtSLuMEmKSLRYNEYFB49j/uc80oiIklhGpub8V0RDShYFMq2BC85ZdXSeuy7FXLlftKqX6TxZFHJ+gUnSMPXaM6ukMN1EQUKfSMXtGbA86L8+58LFpzTjZzjP7A+fwBsrqQ4Q==</latexit> p 65/8 ATE(0)=P( ) - <latexit sha1_base64="4Y7L+a+HN8XVivqDDCeLHK38xqk=">AAAB8nicbVBNTwIxEO3iF+IX6tFLIzHxhLsGkSPRi0dM5CNZNqRbCjR027WdNSEbfoYXDxrj1V/jzX9jgT0o+JJJXt6bycy8MBbcgOt+O7m19Y3Nrfx2YWd3b/+geHjUMirRlDWpEkp3QmKY4JI1gYNgnVgzEoWCtcPx7cxvPzFtuJIPMIlZEJGh5ANOCVjJ75pHDWn1anpR6xVLbtmdA68SLyMllKHRK351+4omEZNABTHG99wYgpRo4FSwaaGbGBYTOiZD5lsqScRMkM5PnuIzq/TxQGlbEvBc/T2RksiYSRTazojAyCx7M/E/z09gUAtSLuMEmKSLRYNEYFB49j/uc80oiIklhGpub8V0RDShYFMq2BC85ZdXSeuy7FXLlftKqX6TxZFHJ+gUnSMPXaM6ukMN1EQUKfSMXtGbA86L8+58LFpzTjZzjP7A+fwBsrqQ4Q==</latexit> p 65/8 Ture Thinking Score: 0.5(|ATE(1)|+|ATE(0)|) (a) Example: Thinking Steps Having Different Modes in CoT (b) Illustration: Measurement of Ture Thinking Score Step A Perturbation Step A’ Figure 2. (a) Illustration of different modes in thinking steps within chain-of-thought (CoT) reasoning. Contrary to the naive view that a step’s faithfulness depends solely on whether perturbing it directly changes the final result, we show that the relationship is more nuanced. A true thinking step may operate in either an AND or OR mode when interacting with other steps. In both cases, such steps contribute meaningfully to the final answer. (b) Based on this understanding, we define the True Thinking Score, which jointly considers two complementary evaluations: the necessity test (high for AND-like steps) and the sufficiency test (high for OR-like steps). matching the performance of CoT-based prompting (Ma et al., 2025). Prior work suggests that LLMs may solve cer- tain tasks through internal circuits (Yang et al., 2024; Marks et al., 2024; Prakash et al., 2025). The implicit reasoning capabilities of LLMs have been shown to bypass using ex- plicit CoTs (Deng et al., 2023; Hao et al., 2024; Pfau et al., 2024; Goyal et al., 2023). Such internal reasoning of LLMs questions how much the model truly uses each step in their CoT, which we study using a causal framework. Evaluating the faithfulness of reasoning traces. Many works evaluate the faithfulness of reasoning traces, but most treat the CoT as a whole, offering suggestive evidence with- out step-level analysis. Existing methods fall into three categories: (1) Hint-based evaluation (Chen et al., 2025b; Arcuschin et al., 2025; Chua & Evans, 2025; Turpin et al., 2023), where hints are injected into unsolved questions and a faithful CoT should acknowledge them. These setups are not generalizable and cannot assess individual steps. (2) Perturbation-based evaluation (Gao, 2023; Lanham et al., 2023), where errors/ modifications are injected into a step, and the continuation is resampled. If the final answer is unchanged, the CoT is deemed unfaithful. Yet this is unreli- able, as the model may detect and correct the error later in resampled steps (Yee et al., 2024). This evaluation frame- work is further refined to examine and classify the resampled continuations (Xiong et al., 2025). (3) Early-exit evalua- tion (Lanham et al., 2023; Tanneru et al., 2024; Wang et al., 2025), where cues are inserted to prompt the model to an- swer after an intermediate step. A correct early answer sug- gests subsequent steps may be decorative, but this overlooks cases in which later steps are genuinely used, for exam- ple for self-verification or exploring alternative solutions. Overall, existing work has primarily provided suggestive, case-based evidence that CoT is not always faithful to the model’s internal thinking without looking into step-wise analysis. There is a lack of a framework formally evaluating how much LLMs truly think through a step verbalized in CoT in their latent space. Steering LLMs. Steering directions applied to interme- diate hidden states have been found to mediate LLMs’ be- haviors or perception in many aspects (Von Rütte et al., 2024; Turner et al., 2023; Tigges et al., 2023; Li et al., 2023; Marks & Tegmark, 2023; Zhao et al., 2025). In terms of reasoning, past works have found steering vectors that can be used to control the strength of reasoning, e.g., longer or shorter CoT (Tang et al., 2025; Sun et al., 2025a; Chen et al., 2025a; Sheng et al., 2025) or different reasoning styles in CoT (Venhoff et al., 2025). We are the first to reveal that LLMs’ internal thinking process can also be mediated by steering vectors. 3. Measuring Step-wise Causality for Faithfulness in Reasoning The faithfulness of CoT is defined with respect to a target, typically the model’s predicted answer. A lack of faithfulness arises when the model claims to rely on steps A, B, and C in its CoT, but internally disregards them (instead, e.g., relying on other shortcuts or biases (Turpin et al., 2023; 3 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Arcuschin et al., 2025) to compute answers). In this case, those steps make no causal contribution to the prediction. Formally, we quantify the causal contribution of each rea- soning stepsin the CoT to the final answery ∗ , which serves as the basis for determining its faithfulness. A step with gen- uine causal impact is a faithful true-thinking step, where the model indeed internally thinks throughsin order to producey ∗ . By contrast, a step with minimal causal impact is an unfaithful decorative-thinking step, where the model verbalizes a line of reasoning without following it internally. Notation and Setup. We adopt notation following Ru- bin (1974); Pearl (2009). Let the input question beq, and let the model’s full chain of thought (CoT) forqbe C ⋆ = (s 1 ,s 2 ,...,s n ),where eachs i denotes a reasoning step. At the current stepsunder evaluation (we omit the in- dexiand directly usesfor simplicity), we define the context as its preceding steps, i.e.,C = (s 1 ,s 2 ,...,s i−1 ). To probe the model’s current prediction after any partial reasoning trace, we use early-exit answering by appending a standard- ized cue:The final result is. This approach, following Lanham et al. (2023); Fu et al. (2025); Yang et al. (2025); Tanneru et al. (2024); Bogdan et al. (2025), reliably elicits the model’s intermediate answer given the question qand reasoning prefix(C,s). Letf (q,C,s)denote the model’s early-exit prediction after processingqwith context Cand steps. The reference prediction under the full reason- ing trace is then defined asy ∗ := f (q,C ⋆ ),representing the model’s final answer when all steps in the full CoT are intact. Indicator Variables. We introduce the following binary random variables to formalize interventions on reasoning steps: (1) Context indicatorC ∈ 0, 1:C=1indicates an intact context (the original prefixC);C=0indicates a perturbed context in which all preceding steps are replaced by perturbed versions. We writec ∈ 0, 1for a specific realization ofC. (2) Step toggleX ∈ 0, 1:X=1is the original steps;X=0replaces it with a perturbed versions ′ . (3) Outcome indicatorY ∈ 0, 1: Givenf (q,C,s), we defineY := 1f (q,C,s) = y ∗ ,which measures whether the model’s early-exit prediction under the given interven- tion matches the full-CoT reference outcome. Perturbation Procedure. To isolate the causal effect of each reasoning step, we create perturbed versions of steps and contexts by introducing small random numerical offsets to quantities appearing in the reasoning text (Lanham et al., 2023; Gao, 2023). These perturbations are minimal and preserve grammatical and semantic structure, ensuring that the modified step remains coherent but subtly altered. Addi- tional implementation details are provided in Appendix A. 3.1. Context-based Average Treatment Effect The Average Treatment Effect (ATE) (Rubin, 1974; Pearl, 2009) quantifies the causal effect of a binary intervention X∈0, 1 on an outcome Y via Pearl’s do(·) operator: ATE = S 1 − S 0 ,S x∈0,1 := Pr Y=1 do(X=x) . (1) To evaluate the causal contribution of a reasoning steps, we condition on a context settingC ∈ 0, 1and define a context-based ATE: ATE(c) = S 1 (c)− S 0 (c),(2) S x∈0,1 (c) := Pr Y=1 C=c, do(X=x) ,(3) wherec∈0, 1specifies the context regime andXtoggles using the intact stepsversus its perturbed counterparts ′ . We consider two regimes: a) Intact context (C=1): the original prefixCprecedingsis kept as generated; b) Perturbed context (C=0): all steps inCare minimally perturbed (e.g., by small numeric offsets), weakening associations between s and other steps so that the effect of s can be isolated. Scoring the outcome. Lety ∗ := f (q,C ⋆ )be the model’s reference answer obtained via early-exit on the full CoT, and letf (q,C,s)denote the early-exit prediction after a given (C,s). Instead of the binaryY = 1f (·) = y ∗ , we use the model’s confidence for the eventy ∗ :Pr(Y=1 | ·) ≡ Pr y ∗ | q,C,s , givingS x (c)a probabilistic (confidence- based) interpretation. InterpretingATE nec (1)vs.ATE suf (0).Conditioning on Callows us to distinguish two complementary notions of causal relevance: • Necessity under intact context (ATE nec (1)). ATE nec (1) = Pr(y ∗ | C=1, do(X=1)) − Pr(y ∗ | C=1, do(X=0)). This tests whethersis needed given the full, supportive contextC. A lowATE nec (1)indicates that removing sdoes not harm performance when other steps remain intact—what prior measures often label as “unnecessary.” However, this does not implysis unfaithful; it may be redundant because other steps already suffice (an “OR” relation). • Sufficiency under perturbed context (ATE suf (0)). ATE suf (0) = Pr(y ∗ | C=0, do(X=1)) − Pr(y ∗ | C=0, do(X=0)). This asks whetherscan on its own—i.e., with weakened support fromC—drive the model towardy ∗ . A high ATE suf (0)suggestssis sufficiently informative to elicit the correct answer, capturing causal relevance even when s is not strictly necessary under the intact context. 4 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Together,ATE nec (1)(necessity) andATE suf (0)(suffi- ciency) provide a balanced view of faithfulness: a step can be causally meaningful by being necessary, sufficient, or both. The context perturbation operationalizes the “OR” case by dampening alternative pathways inC, yielding a more reliable test of s’s standalone impact. True-Thinking Score (TTS). We define the faithfulness score of a step s with respect to the final result y ∗ as TTS(s) = 1 2 |S 1 (1)− S 0 (1)| +|S 1 (0)− S 0 (0)| .(4) A smallerTTS(s)indicates that the step has little causal influence on the model’s prediction: perturbing or keep- ing it leads to almost the same result. Thus, that step is more likely to be decorative rather than true thinking. For each context settingc, we measure the unsignedATE(c), |ATE(c)| =|S 1 (c)− S 0 (c)|. The sign ofATE(c)reflects whether the step is helpful or harmful (e.g., the step is ac- tually wrong) overall, but we are interested in how much the model truly thinks through the step in its internal com- putation, regardless of direction. Taking the absolute value thus captures the magnitude of a step’s causal effect and provides a broader measure of its importance.TTSis a continuous metric, where lower values indicate that a step is more decorative. 4. Experimental Setup Models. We conduct experiments on three different fam- ilies of open-source reasoning models that have strong reasoning abilities and can generate long CoTs.For Qwen-2.5-7B and Llama-3.1-8B, we use the version fine- tuned on samples generated by Deepseek-R1 (Guo et al., 2025), i.e.,Deepseek-R1-Distill-Qwen-7Band Deepseek-R1-Distill-Llama-8B. We also exper- iment withNemotron-1.5B. These models can generate very long CoTs at test time. We use the maximum sequence length per model to avoid cut-off of reasoning traces during generation. We use greedy decoding for reproducibility and use the default prompting template for reasoning. Data. We evaluate on different reasoning benchmarks: AMC (American Mathematics Competitions), AIME (American Invitational Mathematics Examination) from 2020–2024, the MATH dataset (Hendrycks et al., 2021), and CommonsenseQA (Talmor et al., 2019). For experi- ments in Section 6, we randomly sample 30% of cases as a heldout test set, 10 % of cases as validation set, and 60 % of cases as a training set. 5. Evaluating Step-wise Causality in CoT In this section, we present evaluation results for theTTS score, which measures the extent to which the model truly reasons through each step internally. Recent reasoning models often produce long CoTs with many intermediate steps, incurring significant computational cost. We show that not each of these steps is truly used by the model in its internal reasoning process. The distribution ofTTSis long-tailed.As shown in Fig- ure 3a, we find most steps have low scores, while only a few have very high scores. For example, as shown in Figure 3a on the AIME dataset of Qwen-2.5, the mean is around 0.03. Only 6.4% of CoT steps achieve aTTSgreater than 0.3, and merely 2.3% exceed 0.7. This suggests that only a handful of verbalized steps in CoT are critical and faithfully fol- lowed by the model, whereas many others may not reliably reflect the model’s true inner thinking. Additionally, our experimental results highlight the importance of evaluating both when the context is intact and when it is perturbed. In Figure 3b, we observe disparities betweenATE nec (1) andATE suf (0)for the same step, for example, cases where ATE nec (1)is low whileATE suf (0)is high. This indicates that relying on the score computed solely under an intact or perturbed context can miss potential true-thinking steps. We confirm the same distribution pattern across different datasets (see Appendix C for mathematical reasoning and results on Commonsense reasoning in Appendix D). Over- all, our results show that the majority of reasoning steps in CoT may be decorative and performative that have minimal causal contribution to LLMs’ prediction. Reasoning steps with high and low TTS are interleaved in a CoT. Figure 3c illustrates that steps with highTTS scores can appear at different positions, though later steps are on average more likely to be true-thinking with higher TTS. These results indicate that labeling an entire CoT as either unfaithful post-rationalization or faithful compu- tation (Emmons et al., 2025) is overly coarse. They also raise concerns about the reliability of monitoring LLMs by inspecting CoT (Baker et al., 2025), since individual steps may not always reflect the model’s true internal rea- soning or be performed internally at all. Finally, our results suggest that task difficulty does not necessarily lead to more faithful reasoning: even on the AIME dataset that challenges recent models (Sun et al., 2025b), LLMs still produce many decorative-thinking steps in CoT. The distri- bution of lowTTSsteps on AIME mirrors that of simpler math datasets (Appendix C), challenging the common hy- pothesis that LLMs produce more faithful reasoning on harder problems (Emmons et al., 2025; Wang et al., 2025). 5.1. Self-verification steps can be decorative We leverage our definedTTSto evaluate whether LLMs are truly thinking at self-verification steps (often known as “aha moments”). Self-verification steps are often seen in recent LLMs’ CoT, e.g., “Wait, let me re-evaluate...”, which can help them achieve stronger reasoning performance (Guo 5 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought 0.00.20.40.60.81.0 Score 0 1000 2000 3000 4000 5000 6000 7000 Count 0.000.010.02mean 0 2000 4000 (a) low_c1_low_c0 low_c1_high_c0high_c1_low_c0 high_c1_high_c0 0 20 40 60 80 100 Percentage of Total Examples (%) 7821 217 456 434 (b) 0102030405060708090100 Normalized step (0100%) 0.0 0.1 0.2 0.3 0.4 0.5 True-thinking score (TTS) Example Mean (c) Figure 3. (a) The dataset-level distribution of theTTSscore; (b) The distribution forATE nec (1)andATE suf (0)where low means ATE(·) is below mean and high means ATE(·) is above mean; (c) An example CoT case forTTSand the averageTTSat different step percentile (normalized). et al., 2025; Muennighoff et al., 2025; Snell et al., 2024; Jaech et al., 2024). However, our findings suggest that in some cases the model may appear to self-verify in CoT, while internally it does not actually perform those steps, and they make little contribution to its computation. To identify decorative-thinking self-verificationS DT-self-verify , we scan the self-verification steps and computeTTS. We defineS DT-self-verify = (s 1 ,s 2 ,...,s n )whereTTS(s i )is smaller than a thresholdβ. Notably, we observe cases where self-verification steps have near-zeroTTS(see an instance in Figure 11 in the Appendix). For example, around 12% of the self-verification steps for Qwen-2.5 have TTSlower than 0.005, while 21% for Nemotron. We also find that perturbing the context steps beforeS DT-self-verify can always flip the model’s initial correct answers to wrong ones, thoughS DT-self-verify may contain ample information to lead the model to correct answers.Overall, those self-verification steps contribute minimally to the model’s computation of its answer. We further demonstrate that we can force the model to truly self-check its solution using steering vectors in Section 6.3, and provide more examples with different reasoning behaviors in Appendix G. 6. Mediating LLMs’ Internal Thought via Steering In this section, we empirically show that, in LLMs, latent- space steering can mediate whether a verbalized reasoning step is internally followed or disregarded, via the identified TrueThinking direction. Extracting the TrueThinking direction.We first detail how we extract the steering direction using TTS. Formally, for each layerl ∈ [1,L]in a Transformer-based (Vaswani et al., 2017) model, the hidden state for a tokenx t in an input sequencexis updated with self-attention modules that associatex t with tokensx 1:t and a multi-layer perception: h l t (x) = h l−1 t (x) + Attn l (x t ) + MLP l (x t ). We focus on the residual stream activationh l (s t )of the last token position tfor a stepsat a layerl. At a layerl, we collect the hidden states of the most representative stepss T (whereTTS(s T ) ≥thresholdα) and stepss DT (whereTTS(s DT )≤threshold β). Following the difference-in-means approach (Marks & Tegmark, 2023; Arditi et al., 2024; Zhao et al., 2025), we compute the direction as the mean shift fromμ l T = mean(h l (s T )) toμ l DT = mean(h l (s DT )) in the latent space. v l TrueThinking = μ l T − μ l DT .(5) This yields a steering vector that captures the model’s tendency to either sustain or truncate its reasoning process at that step. For steering at test time, we modify the residual stream for the hidden state of a test step in the example by using activation addition at a single layerl, i.e., ̄ h l = h l + v l TrueThinking to all tokens in the step. To compute the TrueThinking direction, for all tested models we set the thresholdα=0.9 fors T , whileβ=0 fors DT , which means perturbing those stepss DT does not change model’s confidence at all. We emphasize that thresholding is used only to select the most representative steps for extracting the steering direction, while TTS itself is a continuous measure. The ablation study for thresholding is shown in Appendix F.1. 6.1. Steering setup and evaluation Comparison baselines for steering. As baselines, we consider three approaches for layer-wise intervention apart from the TrueThinking direction. (1) steering directions extracted using DropStep: we adapt the evaluation met- ric of prior work (Tanneru et al., 2024; Lanham et al., 2023; Bogdan et al., 2025; Wang et al., 2025) for the faithfulness of reasoning steps, which compares predic- tion confidence before and after appending steps, i.e., P(y GT |q,C,o)− P(y GT |q,C), to identify faithful and un- faithful steps. Those steps are then used to extract a steering direction following the same method as the TrueThinking direction. (2) Random steering vector: we generate a ran- dom vector with the same dimensionality and norm as the TrueThinking direction to test whether our identified direc- tion encodes meaningful information. (3) Attention scaling: we scale the attention weights of the tokens of stepsat a 6 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Dataset / Method Attention scaling Random vector DropStep direction TrueThinking direction ETDTETDTETDTETDT AMC Qwen-2.56.225.04.0 26.91.528.655.055.7 Llama-3.124.820.53.5 20.6 10.532.417.635.3 Nemotron5.127.24.5 45.59.045.435.754.5 MATH Qwen-2.510.023.92.0 30.22.517.749.869.2 Llama-3.17.535.45.0 47.9 11.052.114.054.2 Nemotron21.742.721.5 44.66.545.159.556.3 AIME Qwen-2.59.325.01.5 21.41.514.355.553.6 Llama-3.16.335.22.5 29.45.041.138.047.1 Nemotron12.070.66.5 76.54.579.539.091.2 Table 1. Top-1 flip rate among all layers (%)↑in theEngagement Test(ET) and theDisengagement Test(DT). We use flip rate as the metric, measuring how often steering changes the model’s initial prediction. AMC dataset is in-domain evaluation where TrueThinking directions are extracted, while the other two datasets are for out-of-domain evaluation. 0510152025 Layer 50 60 70 80 90 100 Accuracy 7B 1.5B (a) Engagement Test 0510152025 Layer 0 10 20 30 40 50 60 Accuracy 7B 1.5B (b) Disengagement Test 0510152025 Layer 50 60 70 80 90 100 Accuracy AMC MATH AIME (c) Engagement Test 0510152025 Layer 0 10 20 30 40 50 60 70 Accuracy AMC MATH AIME (d) Disengagement Test Figure 4. Layer-wise results of steering with the TrueThinking vector. In theEngagement Test, stronger intervention is reflected by lower accuracy (more right→wrong flips); In theDisengagement Test, by higher accuracy (more wrong→right flips). Figures (a–b): layer-wise results on AMC for DeepSeek-R1-Distill-Qwen-7B and its 1.5B variant under theEngagement Testand theDisengagement Test. Figures (c–d): cross-domain results, where the TrueThinking direction is extracted on AMC and applied to MATH and AIME. layer during inference time, with scale=100 encouraging the model to think through the step and scale=0 suppressing it. Causal test cases for steering.We construct two steering test cases to investigate the mechanism of LLMs’ thinking in CoT. (1) Engagement Test: Can steering make the model think through a step in CoT it ignores? We con- sider cases wheref (q,C) = y GT andf (q,C,s ′ ) = y GT . Namely, the model still obtains the ground-truth answer y GT despite the perturbed steps ′ . If we apply a steering directionv l to the hidden state ofs ′ , and the model’s correct answer flips to an incorrect one (f +v l (q,C,s ′ )̸= y GT ), this indicates that the steering has forced the model to reason overs ′ in its internal thinking, thus following the errors injected intos ′ . (2) Disengagement Test: Can steering along the reverse direction make the model disregard a step internally? On the other hand, we consider the reverse cases where the model first predicts the correct answer, i.e., f (q,C) = y GT , but including a perturbed steps ′ causes it to fail:f (q,C,s ′ ) ̸= y GT . If applying a steering direc- tion−v l tos ′ flips the wrong answer to the correct answer (f −v l (q,C,s ′ ) = y GT ), then the intervention has made the model disregard the steps ′ in its internal thinking. All test examples are drawn from a held-out set according to the two cases described above to ensure fair comparison among different steering methods. 6.2.LLMs encode a steerable latent signal of “thinking” We find that a simple linear TrueThinking direction can mediate whether LLMs truly reason over a verbalized step. As shown in Table 1, steering with the (reverse) TrueThinking direction reliably flips predictions in both tests. In theDisengagement Test, it effectively prevents the model from using the perturbed steps ′ , with effects far stronger than those of random vectors. This shows that suppression of step use with the reverse TrueThinking di- rection in theDisengagement Testarises from a meaningful signal rather than noise, confirming that the TrueThinking direction captures a genuine internal representation of thinking. Additionally, the TrueThinking direction derived from ourTTScan produce stronger and more consistent intervention effects than the steering directions extracted using DropStep of past works (Tanneru et al., 2024; Lanham et al., 2023; Bogdan et al., 2025; Wang et al., 2025), especially inEngagement Test. The inferior steering 7 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought The distance between ( 4 , 1 ) and (- 3 , 3 ) is sqrt [( 5 ) 3 + ( 5 ) 1 ] = 6 Input tokens 0 5 10 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (a) Engagement Test: After steering The distance between ( 4 , 1 ) and (- 3 , 3 ) is sqrt [( 5 ) 3 + ( 5 ) 1 ] = 6 Input tokens 0 5 10 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (b) Engagement Test: Before steering Now , the problem says that this is an irre ducible fraction , so I need to make sure that 110 and 34 have no common factors other than - 2 Input tokens 0 5 10 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (c) Disengagement Test: After steering Now , the problem says that this is an irre ducible fraction , so I need to make sure that 110 and 34 have no common factors other than - 2 Input tokens 0 5 10 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (d) Disengagement Test: Before steering Figure 5. Normalized attention scores of the step in theEngagement Testand theDisengagement Testbefore and after steering. (a–b) Applying the TrueThinking direction to a step increases the model’s attention to it. (c–d) Applying the reverse TrueThinking direction decreases the model’s attention. performance implies that existing evaluation metrics may mischaracterize the faithfulness of reasoning steps in CoT. On the other hand, our experiments across datasets show that the latent signal controlling whether a step engages in internal thinking is universal. As seen in Table 1, the TrueThinking direction extracted on AMC generalizes well to other datasets across all models, indicating a model- internal mechanism of thinking rather than a dataset-specific artifact. For instance, in the Qwen model, layer 15-22 con- sistently yield the strongest intervention performance across all three datasets (Figure 4c–4d), suggesting these interme- diate layers concentrate latent reasoning. We also compare models with different sizes in the same model family. Simi- lar patterns hold for the much smaller Qwen-2.5-1.5B model (Figure 4a and Figure 4b) and larger 14B model (Figure 9 in Appendix E). Steering along the TrueThinking direc- tion (to induce the step use in LLMs’ internal reasoning) is weaker for smaller models, whereas the results in the Disengagement Test are comparable. Steering with the TrueThinking direction mediates LLMs’ attention. We find that steering with the True- Thinking direction can reallocate attention among tokens. In Disengagement Test, steering in the reverse TrueThinking direction (for example, when applied at Layer 17) reduces attention to the perturbed tokens (see examples in Figures 5c and 5d of the Appendix), causing the model to largely dis- regard them. Similarly, steering along the TrueThinking direction increases attention to the step (see examples in Fig- ure 5a and Figure 5b of Appendix). Overall, when steering along the TrueThinking direction, the attention of the steered tokens will increase on average by 193% and steering along the reverse direction will decrease the attention by 32.3% across the layers. In comparison, directly scaling attention on step tokens in a layer does not always yield noticeable effects. As shown in Table 1, in theDisengagement Test, masking attention (i.e., setting coefficients to 0) at a layer can partially flip answers, but in theEngagement Testits impact is weak, suggesting that attention alone cannot medi- ate the LLMs’ thinking process. We hypothesize that LLMs employ a directional reasoning circuit (Marks et al., 2024; Prakash et al., 2025), where the model first decides whether to engage in reasoning for a step and only then modulates attention, which may be irreversible through direct atten- 8 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought tion scaling. We leave understanding the relation between attention and the reasoning mechanism for future work. 0510152025 Layer 0 10 20 30 40 50 Accuracy (%) Figure 6. Layer-wise results after steering the model to truly self- verify their reasoning steps. Initially, the test accuracy is zero without steering, as the model ignores its self-verification steps in CoT and outputs wrong answers. 6.3. Steering LLMs for true self-verification As shown in Section 5.1, self-verification in CoT can contain decorative-thinking steps. In this section, we focus on a real- istic setting in which the model fails to follow correct steps in self-verificationS DT-self-verify and is instead misled by incorrect previous stepsC ′ , resulting in a wrong prediction, i.e.,f (q,C ′ ,S DT-self-verify ) ̸= y GT . We investigate whether steering along the TrueThinking direction can force the model to truly think through steps inS DT-self-verify and thus restore the correct answer. We apply the TrueThinking di- rection to the tokens inS DT-self-verify , encouraging the model to genuinely think throughS DT-self-verify and measuring how much this steering restores the correct answer. We find that steering along the TrueThinking direction can at best recover around51%of the cases where the models do not follow their correct self-verification steps in CoT and give wrong answers (layer-wise results shown in Figure 6). Remarkably, the layer with the strongest steering effect aligns with the layer identified in Section 6.2, suggesting that certain layers may play a special role in determining whether reasoning steps are engaged in the model’s internal computation. Beyond interpretability, this finding points toward applications in efficient reasoning: the discovered TrueThinking direction could be leveraged to maximize token utilization, ensuring the model truly reasons over each generated token rather than producing superficially coherent but unused reasoning steps. 7. Conclusions We propose a step-wise causality framework to evalu- ate CoT faithfulness, revealing that true-thinking and decorative-thinking steps are interleaved: only a small subset are true-thinking that causally influence predictions, whereas most are decorative-thinking that merely create the appearance of reasoning and have minimal causal impact on predictions.Mechanistically, we demonstrate that whether a reasoning step in CoT contributes to a model’s computation can be controlled by a TrueThinking direction, enabling causal steering for the model to either follow or disregard that step in its internal thinking process. Overall, our findings show that many steps in CoT do not faithfully reflect an LLM’s internal thinking: models may verbalize reasoning they do not perform. This raises concerns about both the efficiency of LLMs’ reasoning and the reliability of relying on CoT to monitor LLMs for safety. 8. Limitations Our causal evaluation framework is inherently approximate. It is greedy in nature and may not capture all possible causal pathways, nor does it aim to reconstruct a complete causal graph of reasoning steps. Thus, it should be viewed as a probe that highlights representative true-thinking and decorative-thinking steps rather than a definitive oracle of internal reasoning. In addition, the TrueThinking direction we extract may not be optimal. We regard our findings as an existence proof that internal thinking can be medi- ated by steering directions, and we leave the development of more effective directions and a deeper understanding of their geometry to future work. We cannot experiment on larger frontier models due to limited computational re- sources, and our findings may therefore not fully generalize to those untested settings. Nonetheless, by demonstrating ef- fectiveness across several accessible models, we establish a general evaluation framework for analyzing and interpreting the thinking process in CoT. Our TTS computation can be costly as it requires different runs. However, in this work, we do not aim to propose an efficient real-time detector, but in the first place, we need a theoretically sound way to reveal whether steps in CoT are faithful. Future work can leverage the TrueThinking direction to construct a latent monitor by comparing it with the hidden states. 9. Impact Statement Our work reveals that LLMs can verbalize CoT steps that do not reflect the reasoning processes actually used inter- nally. We further show that such discrepancies can be influ- enced by latent steering vectors, suggesting that externally expressed reasoning can be decoupled from internal compu- tation. More broadly, this raises the possibility that LLMs could conceal their true internal decision processes while presenting plausible reasoning steps to users. Such behavior introduces a potential threat model in which models may appear transparent or aligned while internally operating dif- ferently. Our findings highlight the need for further research into the reliability of CoT-based interpretability and the de- velopment of safeguards against misleading or strategically generated explanations. 9 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought References Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., and Conmy, A. Chain-of-thought rea- soning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025. Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. Advances in Neural Infor- mation Processing Systems, 37:136037–136083, 2024. Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., Zaremba, W., Pachocki, J., and Farhi, D. Mon- itoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025. Bogdan, P. C., Macar, U., Nanda, N., and Conmy, A. Thought anchors: Which llm reasoning steps matter? arXiv preprint arXiv:2506.19143, 2025. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. Seal: Steerable reasoning calibration of large language models for free. arXiv preprint arXiv:2504.07986, 2025a. doi: 10.48550/arXiv.2504.07986. URLhttps://arxiv. org/abs/2504.07986. Chen, Y., Benton, J., Radhakrishnan, A., Uesato, J., Deni- son, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., et al. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025b. Chua, J. and Evans, O. Are deepseek r1 and other reasoning models more faithful? arXiv preprint arXiv:2501.08156, 2025. Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaud- hary, V., and Shieber, S. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023. doi: 10.48550/arXiv.2311. 01460. Emmons, S., Jenner, E., Elson, D. K., Saurous, R. A., Raja- manoharan, S., Chen, H., Shafkat, I., and Shah, R. When chain of thought is necessary, language models struggle to evade monitors. arXiv preprint arXiv:2507.05246, 2025. Fu, Y., Wang, X., Tian, Y., and Zhao, J. Deep think with confidence. arXiv preprint arXiv:2508.15260, 2025. Gao,L.Shapley value attribution in chain ofthought.URLhttps://w.lesswrong. com/posts/FX5JmftqL2j6K8dn4/shapley-value- attribution-in-chain-of-thought, 2023. Golovneva, O., Chen, M., Poff, S., Corredor, M., Zettle- moyer, L., Fazel-Zarandi, M., and Celikyilmaz, A. Roscoe: A suite of metrics for scoring step-by-step rea- soning. arXiv preprint arXiv:2212.07919, 2022. Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V. Think before you speak: Train- ing language models with pause tokens. arXiv preprint arXiv:2310.02226, 2023. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J. E., and Tian, Y. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. Kang, Y., Sun, X., Chen, L., and Zou, W. C3ot: Generating shorter chain-of-thought without compromising effective- ness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, p. 24312–24320, 2025. Korbak, T., Balesni, M., Barnes, E., Bengio, Y., Benton, J., Bloom, J., Chen, M., Cooney, A., Dafoe, A., Dra- gan, A., et al. Chain of thought monitorability: A new and fragile opportunity for ai safety. arXiv preprint arXiv:2507.11473, 2025. Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Deni- son, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain- of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023. Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36:41451–41530, 2023. Ma, W., He, J., Snell, C., Griggs, T., Min, S., and Zaharia, M. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025. Marks, S. and Tegmark, M.The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023. Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and 10 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025. Pearl, J. Causality. Cambridge university press, 2009. Pfau, J., Merrill, W., and Bowman, S. R. Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024. Prakash, N., Shapira, N., Sharma, A. S., Riedl, C., Belinkov, Y., Shaham, T. R., Bau, D., and Geiger, A. Language models use lookbacks to track beliefs. arXiv preprint arXiv:2505.14685, 2025. Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of edu- cational Psychology, 66(5):688, 1974. Sheng, L., Zhang, A., Wu, Z., Zhao, W., Shen, C., Zhang, Y., Wang, X., and Chua, T. On reasoning strength planning in large reasoning models. arXiv preprint arXiv:2506.08390, 2025. doi: 10.48550/arXiv.2506.08390. URLhttps: //arxiv.org/abs/2506.08390. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters. arXiv preprint arXiv:2408.03314, 2024. Song, J., Jo, D., Kim, Y., and Kim, J.-J. Reasoning path compression: Compressing generation trajectories for efficient llm reasoning. arXiv preprint arXiv:2505.13866, 2025. Sun, C., Yan, G., and Weng, T. Thinkedit: Interpretable weight editing to mitigate overly short thinking in reason- ing models. arXiv preprint arXiv:2503.22048, 2025a. doi: 10.48550/arXiv.2503.22048. URLhttps://arxiv. org/abs/2503.22048. Sun, H., Min, Y., Chen, Z., Zhao, W. X., Fang, L., Liu, Z., Wang, Z., and Wen, J.-R. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. arXiv preprint arXiv:2503.21380, 2025b. Talmor, A., Herzig, J., Lourie, N., and Berant, J. Com- monsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 4149–4158, 2019. Tang, X., Wang, X., Lv, Z., Min, Y., Zhao, W. X., Hu, B., Liu, Z., and Zhang, Z. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. arXiv preprint arXiv:2503.11314, 2025. doi: 10.48550/arXiv.2503. 11314.URLhttps://arxiv.org/abs/2503. 11314. ACL 2025. Tanneru, S. H., Ley, D., Agarwal, C., and Lakkaraju, H. On the hardness of faithful chain-of-thought reasoning in large language models. arXiv preprint arXiv:2406.10625, 2024. Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154, 2023. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M.Steering lan- guage models with activation engineering. arXiv preprint arXiv:2308.10248, 2023. Turpin, M., Michael, J., Perez, E., and Bowman, S. Lan- guage models don’t always say what they think: Un- faithful explanations in chain-of-thought prompting. Ad- vances in Neural Information Processing Systems, 36: 74952–74965, 2023. Tutek, M., Chaleshtori, F. H., Marasovi ́ c, A., and Belinkov, Y. Measuring chain of thought faithfulness by unlearning reasoning steps. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 9946–9971, 2025. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- tention is all you need. Advances in neural information processing systems, 30, 2017. Venhoff, C., Arcuschin, I., Torr, P., Conmy, A., and Nanda, N. Understanding reasoning in thinking language models via steering vectors. arXiv preprint arXiv:2506.18167, 2025. Von Rütte, D., Anagnostidis, S., Bachmann, G., and Hof- mann, T. A language model’s guide through latent space. arXiv preprint arXiv:2402.14433, 2024. Wang, Z., Zeng, X., Liu, W., Wang, Y., Li, L., Wang, Y., Shang, L., Jiang, X., Liu, Q., and Wong, K.-F. Chain-of- probe: Examining the necessity and accuracy of CoT step- by-step. In Findings of the Association for Computational Linguistics: NAACL 2025, 2025. Wollschläger, T., Elstner, J., Geisler, S., Cohen-Addad, V., Günnemann, S., and Gasteiger, J. The geometry of refusal in large language models: Concept cones and represen- tational independence. In Forty-second International Conference on Machine Learning, 2025. 11 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Xia, H., Leong, C. T., Wang, W., Li, Y., and Li, W. Token- skip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025. Xiong, Z., Chen, S., Qi, Z., and Lakkaraju, H. Measuring the faithfulness of thinking drafts in large reasoning models. arXiv preprint arXiv:2505.13774, 2025. Yang, C., Si, Q., Duan, Y., Zhu, Z., Zhu, C., Li, Q., Lin, Z., Cao, L., and Wang, W. Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895, 2025. Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024. Yee, E., Li, A., Tang, C., Jung, Y. H., Paturi, R., and Bergen, L. Dissociation of faithful and unfaithful reasoning in llms. arXiv preprint arXiv:2405.15092, 2024. Yu, X., Wang, Z., Yang, L., Li, H., Liu, A., Xue, X., Wang, J., and Yang, M. Causal sufficiency and neces- sity improves chain-of-thought reasoning. arXiv preprint arXiv:2506.09853, 2025. Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W. Llms encode harmfulness and refusal separately. arXiv preprint arXiv:2507.11878, 2025. 12 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought A. Implementations Perturbing reasoning steps. We treat sentences as dis- tinct reasoning steps, as prior work has shown that each sentence can serve a different function within a reasoning trace (Bogdan et al., 2025). We follow prior work (Lan- ham et al., 2023; Gao, 2023) and add small random offsets (chosen from[−3,−2,−1, 1, 2, 3]) to the numbers in a rea- soning step. This keeps the perturbation minimal so that the step remains largely unchanged in token length, wording, and underlying logic. We can therefore reasonably attribute any confidence changes caused by the perturbation to the model’s treatment of the original step. For steps that do not contain numerical values, we also follow prior work (Bog- dan et al., 2025; Tanneru et al., 2024; Wang et al., 2025) by dropping them as a perturbation to measure the influence of those sentences. For perturbing context steps, we only change numerical values. B. More related work Detailed analysis of existing evaluation frameworks of faithfulness for CoT. Many recent works have sought to evaluate the faithfulness of reasoning traces. Most, however, focus on the CoT as a whole, providing suggestive evidence that the CoT is not faithful without analyzing each step. The existing evaluation methods can be summarized as, •Hint-based evaluation: Most prior studies (Chen et al., 2025b; Arcuschin et al., 2025; Chua & Evans, 2025; Turpin et al., 2023) consider simple multiple- choice questions rather than complex reasoning tasks. Hints are injected into questions that the model ini- tially failed to solve. A faithful CoT should explicitly acknowledge the use of hints in deriving the correct an- swer (Chua & Evans, 2025; Chen et al., 2025b). Relat- edly, Arcuschin et al. (2025); Turpin et al. (2023) insert biasing features into questions and observe whether the model’s answer changes. If so, the CoT is deemed unfaithful, as the prediction is driven by bias in the prompt. Yet, those framework setups are not generaliz- able to practical reasoning problems, and cannot reveal the faithfulness of individual steps. •Perturbation-based evaluation: Errors are injected into a correct reasoning step, and its following reason- ing traces are resampled (Gao, 2023; Lanham et al., 2023). If the model’s predicted answer remains un- changed at last, the CoT is considered unfaithful since the error inserted was ignored. However, this criterion may be unreliable: the model may instead detect and correct the error in later resampled steps. Yee et al. (2024); Xiong et al. (2025) try to address this by track- ing resampled steps. •Early-exit answering: Early-exit cues are inserted after a reasoning step to test whether the model can already produce a correct answer (Lanham et al., 2023; Tanneru et al., 2024). A correct early-exit answer sug- gests the CoT may be unfaithful (Lanham et al., 2023) since further steps are unnecessary to model’s answer. Tanneru et al. (2024) further computes the change in the model’s answer confidence before and after each reasoning step when using early-exit answering. Yet unnecessity may not be equivalent to unfaithfulness. The fact that a model arrives at the correct answer early and maintains it does not necessarily imply that it ignores subsequent reasoning steps. This view over- looks important cases where the model continues to engage in those steps, for example, faithfully perform- ing self-verification to consolidate or reinforce earlier predictions. On the other hand, conceptually, CoTs have also been hypothesized as either CoT-as-computation or CoT-as- rationalization (Emmons et al., 2025). However, our analy- sis in Section 5 reveals a more nuanced picture: reasoning traces in CoT often interleave steps that genuinely drive computation with others that merely pretend reasoning. CoT pruning. Our work is also broadly related to CoT compression (Xia et al., 2025; Song et al., 2025; Kang et al., 2025; Yu et al., 2025) and important steps analy- sis (Golovneva et al., 2022). However, those works are different from “faithfulness” as they focus on the semantics of verbalized steps for analysis and do not look into the underlying causality. For example, Xia et al. (2025) prune long CoTs into shorter sentences with equivalent meanings for LLMs without degrading performance. But that does not mean pruned steps do not causally contribute to the model’s answer. Our work may suggest similar results that there exist shorter alternative CoTs leading the model to get the same answers, but they do not solve whether a step truly engages with the model’s internal computation, and the in- ternal mechanism of LLMs’ thinking process is still unclear. Yu et al. (2025) reframe CoT pruning into the Probabil- ity of Necessity and Sufficiency (PNS) framework (Pearl, 2009), but it focuses on estimating the necessity of each step using counterfactual intervention in a similar way to prior works (Lanham et al., 2023; Yee et al., 2024; Bogdan et al., 2025): They perturb an intermediate step while keep- ing all preceding steps intact, then resample the subsequent steps using a rollout model. The resulting reasoning trace is evaluated by a validator model to assess whether the initial step is necessary. Although they are framed into PNS, their measurement of sufficiency and necessity is restricted to contexts where the prior steps remain unchanged, which cor- responds to ATE(1) in our framework. From a faithfulness perspective, this neglects disjunctive cases where a step is 13 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought QuestionWhere would you find magazines along- side many other printed works? OptionsA doctorB✓bookstoreC mar- ketD train stationE mortuary Original rationale...bookstoresare designed to sell both books and periodicals... Perturbed rationale...train stationsare designed to sell both books and periodicals... Table 2. Example multiple-choice question with original vs. per- turbed rationale. not strictly necessary for producing the correct answer but still reflects genuine reasoning. They also neglect that the model can realize and correct the previously-perturbed step in the rollout steps (Yee et al., 2024). C. More Experimental Results of TTS As shown in Figure 7, we observe a similar long-tail distri- bution ofTTSacross different datasets. Steps with larger TTSare sparse, while the most of steps have relatively low TTS. Additionally, as shown in Figure 7b and Figure 7d, on different datasets, we still observe that there exist ample examples whereATE nec (1)andATE suf (0)of the same step are very different. D. Results on Commonsense Reasoning We applyTTSand the TrueThinking direction to common- sense reasoning to test the generalizability of our frame- work to more generic reasoning cases. We use Common- senseQA (Talmor et al., 2019) as our test data. Since com- monsense reasoning is predominantly non-numeric, stan- dard intervention schemes that rely on perturbing numerical values are not suitable in this setting. To apply interventions on such content, we prompt a separate large language model (GPT-5) to produce subtle, semantically coherent distor- tions of the initial reasoning step. This makes perturbation applicable to different reasoning scenarios. An example perturbed rationale is shown in Table 2. As shown in Figure 8a and Figure 8b, the distribution of TTSon Commonsense reasoning is long-tailed, which is similar to the patterns observed on mathematical reasoning. Furthermore, we conduct the steering tests (explained in Section 6) on the examples of Commonsense QA. Specif- ically, we use the same TrueThinking direction extracted from AMC dataset for steering (the same setting as Table 1). The steering results are shown in Figure 8c. Noticeably, the TrueThinking direction can still mediate the reasoning behaviors on CommonsenseQA when applied to the mid- dle layers. This strongly indicates the generalizability of TrueThinking direction and suggests that LLMs encode a specific variable for true-thinking behaviors. E. Steering Results on Larger Models Beyond the 1.5B and 7B models in Figure 4a and Figure 4b, we further demonstrate that the TrueThinking direction generalizes to a larger 14B model. As shown in Figure 9, we can similarly modulate the 14B model’s reasoning behavior, controlling whether it engages or disengages with reasoning steps by steering the hidden states in the middle layers along or against TrueThinking direction. F. Ablation study F.1. Threshold ofTTSfor extracting steering directions In this section, we ablate the threshold for selecting true- thinking steps when extracting the TrueThinking vector. Our goal is to better understand the scale ofTTS, that is, how low a score may already indicate decorative thinking and how high a score reflects true thinking. We use steering performance as an indirect probe of howTTScorrelates with the internal engagement of steps in reasoning. When extracting steering directions with difference-in- means, the steps with zeroTTSare treated as decorative- thinking steps (s DT ), while we use steps from different ranges ofTTSas true-thinking steps (s T ). As shown in Fig- ure 10, when TrueThinking directions are computed from steps drawn from increasingly higherTTSranges, the steer- ing effect grows roughly linearly. In contrast, using steps withTTSbelow 0.03 yields negligible steering, suggesting these steps are internally treated as decorative, similarly to those with zeroTTS. Because the TrueThinking directions are computed as the difference in mean hidden states be- tween true and decorative steps (Eq. 5), negligible steering effects mean the direction fails to capture the meaningful distinction between decorative and true thinking. Overall, this analysis reveals an implicit decision boundary in the model’s internal space: while the distribution ofTTS is very long-tailed and high-TTSsteps are rare, largerTTS indeed corresponds to genuinely influential reasoning. We leave further in-depth study of the geometry (Wollschläger et al., 2025) of hidden states and TrueThinking directions in the latent space as future work. F.2. Averaging over ATE(c) for TTS We compare using onlyATE(1)inTTSwith the complete TTSto identify true-thinking steps. This slightly differs from the DropStep method in Table 3, which is adopted by past works (Tanneru et al., 2024; Lanham et al., 2023; Wang et al., 2025; Bogdan et al., 2025). DropStep de- finesATE(1)with step removal as the perturbation, i.e., ATE(1) drop = P(y ∗ | C,s)− P(y ∗ | C). In contrast, here 14 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought 0.00.20.40.60.81.0 Score 0 1000 2000 3000 4000 5000 6000 Count 0.000.010.020.030.040.05mean 0 1000 2000 3000 4000 (a) MATH dataset low_c1_low_c0 low_c1_high_c0high_c1_low_c0 high_c1_high_c0 0 20 40 60 80 Percentage of Total Examples (%) 8816 920 1396 909 (b) MATH dataset 0.00.20.40.60.81.0 Score 0 250 500 750 1000 1250 1500 1750 Count 0.000.010.020.030.04mean 0 200 400 600 800 (c) AIME dataset low_c1_low_c0 low_c1_high_c0high_c1_low_c0 high_c1_high_c0 0 20 40 60 80 Percentage of Total Examples (%) 3863 396 700 392 (d) AIME dataset Figure 7. Distribution of TTS on different datasets. 0.00.20.40.60.81.0 Score 0 20 40 60 80 Count 0.000.050.10mean 0 5 10 15 20 (a) low_c1_low_c0 low_c1_high_c0high_c1_low_c0 high_c1_high_c0 0 10 20 30 40 50 60 70 Percentage of Total Examples (%) 656 165 98 160 (b) 0510152025 Layer 0 20 40 60 80 100 Accuracy DT ET (c) Figure 8. Results for Commonsense reasoning. (a) The dataset-level distribution of theTTSscore; (b) The distribution forATE nec (1)and ATE suf (0)where low means ATE(·) is below mean and high means ATE(·) is above mean; (c) Steering results on the Engagement Test and the Disengagement Test. 01020304050 Layer 0 20 40 60 80 100 Accuracy DT ET Figure 9. Results ofEngagement TestandDisengagement Teston Qwen-14B. Table 3. Results on MATH for steering with directions identified from true-thinking steps based on ATE(1) and complete TTS. |ATE nec (1)| only mean |ATE nec (1)|, |ATE suf (0)| ABAB Flip rate 10.0 41.6 38.062.5 ATE(1) = P(y ∗ | C,s)− P(y ∗ | C,s ′ )using numerical perturbation by changing the numbers in steps. However, as shown in Table 3, incorporatingATE(0)is still necessary, as the resulting TrueThinking direction achieves stronger intervention performance. 15 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought 0510152025 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy [0.9,1] [0.5,0.9) [0.3,0.5) [0.1,0.3) [0.03,0.1) [0.01,0.03) [0.001,0.01) Figure 10. Test results of Nemotron on theEngagement Testwhere TrueThinking directions are extracted between examples with zero TTS(as decorative-thinking stepss DT ) and examples of different ranges ofTTS(as true-thinking stepss T ), and the lower accuracy means stronger steering effects. Problem: A square has a side length of10inches. Con- gruent isosceles right triangles are cut off each corner so that the resulting octagon has equal side lengths. How many inches are in the length of one side of the octagon? Express your answer as a decimal to the nearest hundredth. CoT: ...Then, the side length of the octagon isx √ 2 = 10 √ 2+2 · √ 2 = 10 √ 2 √ 2+2 , which simplifies to10 √ 2− 10 ≈ 4.142 inches. So, 4.14 inches. Wait, but let me compute10 √ 2 − 10more accurately. √ 2 ≈ 1.41421356, so10 √ 2 ≈ 14.1421356and 14.1421356− 10 = 4.1421356inches, which rounds to 4.14 inches. So, the length of one side of the octagon is approximately 4.14. Figure 11. An example of unfaithful self-verification steps (high- lighted in blue) where theTTSscore of each step is found smaller than 0.005. LowTTSindicates that those steps are not truly en- gaged in computation; rather, these reasoning steps are likely to be decorative and function as an appearance of self-verification, contributing minimally to the model’s final prediction. 16 Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought G. Examples with low TTS We provide more examples where reasoning steps have TTS lower than 0.005. We classify each reasoning example based on behavioral patterns labeled in Venhoff et al. (2025) Uncertainty-estimation example Context steps (last two sentences): Another way is to divide the quadrilateral into two triangles and find the area of each triangle, then add them up. Looking at the points, maybe I can split the quadrilateral along the x-axis Steps with low TTS — Uncertainty-estimation: The points (-1, 0) and (2, 0) are on the x-axis, so the quadrilateral is split into two triangles: one above the x-axis and one below. Wait, actually, the point (0, 1) is above the x-axis, and (0, -3) is below. So, if I connect (-1, 0) to (0, 1) to (2, 0) to (0, -3) and back to (-1, 0), it’s a quadrilateral that crosses over the x-axis. Alternatively, maybe I can split it into two triangles: one with vertices (-1, 0), (0, 1), (2, 0) and another with vertices (2, 0), (0, -3), (-1, 0). Wait, but that might not cover the entire quadrilateral. Alternatively, perhaps split it into triangles with a common diago- nal. Let me see. If I connect (0, 1) to (0, -3), that’s a vertical line along the y-axis. Then, the quadrilateral is split into two parts: one on the left side of the y-axis and one on the right. But actually, the points are (-1, 0), (0, 1), (2, 0), (0, -3) Uncertainty-estimation example Context steps (last two sentences): Looking at the points, maybe I can split the quadrilateral along the x-axis. The points (-1, 0) and (2, 0) are on the x-axis, so the quadrilateral is split into two triangles: one above the x-axis and one below Steps with low TTS — Uncertainty-estimation: Wait, actually, the point (0, 1) is above the x-axis, and (0, -3) is below. So, if I connect (-1, 0) to (0, 1) to (2, 0) to (0, -3) and back to (-1, 0), it’s a quadrilateral that crosses over the x-axis. Alternatively, maybe I can split it into two triangles: one with vertices (-1, 0), (0, 1), (2, 0) and another with vertices (2, 0), (0, -3), (-1, 0). Wait, but that might not cover the entire quadrilateral. Alternatively, perhaps split it into triangles with a common diago- nal. Let me see. If I connect (0, 1) to (0, -3), that’s a vertical line along the y-axis. Then, the quadrilateral is split into two parts: one on the left side of the y-axis and one on the right. But actually, the points are (-1, 0), (0, 1), (2, 0), (0, -3) Backtracking example Context steps (last two sentences): So, putting it all together, 20% of 50% of 80 is 8. That seems correct, but let me verify it another way to make sure I didn’t make a mistake Steps with low TTS — Backtracking: Another approach is to multiply all the percentages together first and then apply them to 80. So, 20% is 0. 2, and 50% is 0. 5. Multiplying those together: 0. 2 * 0. 5 = 0. 1 Adding-knowledge example Context steps (last two sentences): Wait, but that might not be accurate. Alternatively, maybe split the quadrilateral into two triangles by drawing a diagonal from (-1, 0) to (2, 0) Steps with low TTS — Adding-knowledge: Then, the quadrilat- eral is split into two triangles: one with vertices (-1, 0), (0, 1), (2, 0) and another with vertices (-1, 0), (2, 0), (0, -3). Let me calculate the area of each triangle and add them. First triangle: (-1, 0), (0, 1), (2, 0) Using the formula for the area of a triangle with coordinates: 1/2 |x1(y2 - y3) + x2(y3 - y1) + x3(y1 - y2)| Plugging in the points: x1 = -1, y1 = 0 x2 = 0, y2 = 1 x3 = 2, y3 = 0 So, Area = 1/2 | (-1)(1 - 0) + 0(0 - 0) + 2(0 - 1) | = 1/2 | (-1)(1) + 0 + 2(-1) | = 1/2 | -1 + 0 - 2 | = 1/2 | -3 | = 1/2 * 3 = 1 Adding-knowledge example Context steps (last two sentences): Let me calculate the area of each triangle and add them. First triangle: (-1, 0), (0, 1), (2, 0) Using the formula for the area of a triangle with coordinates: 1/2 |x1(y2 - y3) + x2(y3 - y1) + x3(y1 - y2)| Plugging in the points: x1 = -1, y1 = 0 x2 = 0, y2 = 1 x3 = 2, y3 = 0 So, Area = 1/2 | (-1)(1 - 0) + 0(0 - 0) + 2(0 - 1) | = 1/2 | (-1)(1) + 0 + 2(-1) | = 1/2 | -1 + 0 - 2 | = 1/2 | -3 | = 1/2 * 3 = 1.5 Steps with low TTS — Adding-knowledge: Second triangle: (-1, 0), (2, 0), (0, -3) Again, using the same formula: x1 = -1, y1 = 0 x2 = 2, y2 = 0 x3 = 0, y3 = -3 Area = 1/2 | (-1)(0 - (-3)) + 2((-3) - 0) + 0(0 - 0) | = 1/2 | (-1)(3) + 2(-3) + 0 | = 1/2 | -3 -6 + 0 | = 1/2 | -9 | = 1/2 * 9 = 4. 5 Adding both areas: 1. 5 + 4. 5 = 6 Okay, so that’s the same result as before 17