Paper deep dive

ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

Xinyan Wang, Xiaogeng Liu, Chaowei Xiao

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 69

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 2:36:30 AM

Summary

ROM is a novel, lightweight framework for real-time overthinking mitigation in Large Reasoning Models (LRMs). It uses a streaming detection head attached to a frozen LLM backbone to identify and truncate redundant reasoning steps after the first correct solution is reached. The method employs a Counterfactual Self-Correction (CSC) data augmentation strategy to improve detection accuracy and achieves significant improvements in response efficiency and length reduction across seven benchmarks.

Entities (4)

ROM · method · 100%Counterfactual Self-Correction · technique · 95%Large Reasoning Models · model-class · 95%Qwen3-8b · model · 90%

Relation Signals (3)

ROM → mitigates → Overthinking

confidence 100% · ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem.

Counterfactual Self-Correction → augments → ROM

confidence 95% · We propose Counterfactual Self-Correction (CSC), a data augmentation strategy that... improves accuracy by +0.98%

ROM → usesbackbone → Qwen3-8b

confidence 95% · Our backbone is Qwen3-8B, and we extract hidden states from layer L=32

Cypher Suggestions (2)

Identify the backbone models used by specific mitigation methods · confidence 95% · unvalidated

MATCH (m:Method)-[:USES_BACKBONE]->(b:Model) RETURN m.name, b.name

Find all methods that mitigate overthinking in LLMs · confidence 90% · unvalidated

MATCH (m:Method)-[:MITIGATES]->(p:Problem {name: 'Overthinking'}) RETURN m.name

Abstract

Abstract:Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.

PDF

Open source PDF →Open local PDF →

Full Text

68,905 characters extracted from source content.

Expand or collapse full text

ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention Xinyan Wang Xiaogeng Liu Chaowei Xiao Abstract Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation. Code is available at https://github.com/SaFo-Lab/ROM. Machine Learning, ICML 1University of Wisconsin-Madison 2Johns Hopkins University @noticeprintedtrue†footnotetext: 1 Introduction Large Language Models (LLMs) have shown strong abilities on various tasks (Brown et al., 2020). Large Reasoning Models (LRMs) (OpenAI, 2024; Guo et al., 2025) further deliver impressive gains on challenging benchmarks by generating long Chain-of-Thoughts (CoTs) (Wei et al., 2023). Yet this strength comes with a persistent inefficiency known as overthinking (Chen et al., 2025; Sui et al., 2025). Even on easy-to-medium queries, LRMs frequently reach the correct answer early but continue to generate redundant steps and repeated verification. As reasoning traces scale, the waste becomes immediate and costly. It inflates inference compute and latency, and can even undermine reliability via answer drift, where extended reasoning flips an originally correct conclusion. These costs make mitigating overthinking urgent for building efficient and reliable LRMs. A natural direction is model-based control, such as methods based on Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). These approaches train models to obey length budgets or stop earlier. Recent RL-based methods explicitly encourage shorter and less redundant reasoning traces while preserving accuracy under a length or budget constraint (Aggarwal & Welleck, 2025; Luo et al., 2025). While effective, these approaches are training-heavy. They require updating the full backbone model. Moreover, it is difficult to align performance consistently across different models, especially across today’s quickly evolving training algorithms. Thus, we argue that it is more practically promising to explore lightweight, plug-and-play mechanisms that operate at inference time without modifying the backbone. Such methods offer better scalability and adaptability. In contrast to model-based control, heuristic early-exit methods aim to stop generation without training. These methods rely on test-time stopping signals to determine whether additional reasoning is still useful (Fu et al., 2025; Yang et al., 2025b; Wang et al., 2025). Such signals include probing answer stabilization, evaluating the confidence or entropy of trial answers at candidate exit points, and tracking entropy after </think>. More recently, researchers have explored real-time variants that operate during generation, such as tracking the rank dynamics of the </think> token (Wei et al., 2026) or monitoring attention patterns at transition points (Li et al., 2026). However, these methods rely on hand-crafted signals and thresholding, which can be brittle across models. More importantly, they do not learn overthinking patterns directly. Instead, they use proxies to decide whether further computation is worthwhile, such as checking the stability of intermediate answers or monitoring confidence-like signals during decoding. The effectiveness of such heuristics is neither guaranteed nor learned. Therefore, we argue that effectively addressing overthinking requires methods that genuinely learn to recognize overthinking patterns, rather than relying on heuristics that approximate saturation with manually tuned signals and thresholds. Another related line of work shows that answer correctness signals are encoded in a model’s hidden states. Zhang et al. (2025) demonstrate that a lightweight probe can extract from hidden representations whether an intermediate conclusion is already correct, enabling early exit. This finding suggests that internal representations are a promising substrate for learned stopping decisions. However, such methods are not designed for token-level streaming intervention in multi-solution trajectories. They operate at coarser checkpoints, such as intermediate answers. Therefore, we argue that effective overthinking mitigation must leverage token-level streaming signals to enable real-time intervention during decoding. To overcome these limitations, we propose ROM, a learnable and deployment-friendly framework for Realtime Overthinking Mitigation. ROM formulates overthinking mitigation as a streaming prediction-and-control problem. It attaches a lightweight detection head to the late-layer hidden states of a frozen LLM backbone and monitors decoding in real time. At each step, the head takes an attention-pooled prefix summary together with a small memory state summarizing recent steps (Li et al., 2025a), and outputs an overthinking score for the current step. This design enables plug-and-play intervention without modifying the base model. Our approach is inspired by recent work on streaming, token-level detection for jailbreak and hallucination defense (Sharma et al., 2025; Li et al., 2025b; Xuan et al., 2025; Krishna et al., 2025; Obeso et al., 2025; Li et al., 2025a), which shows that lightweight heads can monitor generation with negligible overhead. We find this paradigm particularly suitable for overthinking mitigation, because overthinking emerges during decoding when the model continues reasoning after already reaching a sufficient solution. Rather than enforcing a fixed budget or relying on entropy heuristics, ROM detects overthinking as it emerges and intervenes at the appropriate moment. To support streaming detection, we require fine-grained token-level supervision. Following the view that a model output can be interpreted as a sequence of candidate solutions (Chen et al., 2025), we first segment multi-solution generations into solution blocks and locate the first correct solution. Its endpoint is treated as a correctness boundary that separates efficient reasoning from overthinking. However, supervision from distilled reasoning data is biased. In many distilled outputs, the first solution is already correct, while genuine self-correction trajectories (wrong→ ) are rare. A detector trained on this skewed distribution may therefore overfit superficial cues associated with later segments, such as position or discourse markers, and incorrectly penalize beneficial corrective reasoning. To address this issue, we propose Counterfactual Self-Correction (CSC), a data augmentation strategy that synthesizes balanced wrong→ trajectories and breaks the shortcut that the first solution is always correct while preserving beneficial self-correction. We further introduce token-level labels anchored at the first correct solution: all tokens up to that point are labeled as efficient reasoning, and all subsequent tokens as overthinking. Together, CSC and the anchored labels teach the detector that self-correction is still efficient reasoning, whereas generating additional solutions after the first correct one constitutes overthinking. Finally, we include a lightweight solution-level backtracing mechanism to select an effective cut point, achieving stronger truncation performance while maintaining clean solution-level exits. We conduct extensive experiments on 7 different datasets, comparing our method with 5 baselines including RL-based and heuristic early-exit methods. The experiment results show that our method achieves the highest accuracy (93.51%), shortest response length (1159 tokens), and best response efficiency (SE=12.37) among all methods, with 47.2% token reduction and 121% efficiency improvement over the vanilla baseline. Notably, our method surpasses the runner-up in terms of response efficiency SE (defined as Acc/SL×100Acc/SL× 100) by a 52.7% margin, where the runner-up is L1, an RL-based method that requires extensive training on larger datasets compared with ours. In addition, combined with backtracing, our proposed CSC augmentation improves accuracy by +0.98%, reduces response length by 18.4%, and boosts efficiency by 35.3%, demonstrating the effectiveness of our counterfactual data augmentation strategy. Overall, ROM, serving as the first streaming-based overthinking detection method, shows strong performance across diverse datasets, highlighting the promising potential and practical value of this paradigm and paving the way for future exploration of streaming detection in addressing the overthinking problem. 2 Related Works 2.1 LRMs and Overthinking Large Reasoning Models (LRMs) such as OpenAI o1 (OpenAI, 2024) and DeepSeek-R1 (Guo et al., 2025) achieve strong performance on challenging benchmarks by generating long Chain-of-Thought (CoT) reasoning traces, yet they suffer from a persistent inefficiency known as overthinking. Chen et al. (2025) first systematically studied the overthinking problem in o1-like LLMs, showing that even on simple problems, these models allocate excessive computational resources with minimal accuracy benefit. 2.2 Mitigating Overthinking Model-based Efficient Reasoning. Model-based approaches directly modify the reasoning model through training to produce shorter, more efficient reasoning traces. L1 (Aggarwal & Welleck, 2025) introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning method that trains reasoning models to satisfy user-specified length constraints while maintaining accuracy, enabling precise test-time compute allocation and even producing short reasoning models that match GPT-4o performance at comparable reasoning lengths. O1-Pruner (Luo et al., 2025) applies a two-phase approach: it first estimates baseline performance via pre-sampling, then uses RL-based fine-tuning to incentivize shorter reasoning processes under accuracy constraints, effectively harmonizing token allocation with problem difficulty. Heuristic Early-Exit Methods. Heuristic early-exit methods terminate reasoning at inference time without model training, relying on hand-crafted stopping signals. Certaindex (Fu et al., 2025) measures the evolving stability of intermediate answers and exits once the solution stabilizes, achieving up to 50% compute savings. DEER (Yang et al., 2025b) monitors model confidence at reasoning transition points, triggering early termination when the model exhibits high certainty in a trial answer. Other methods leverage entropy trajectories (Wang et al., 2025), attention patterns at the </think> token (Li et al., 2026), or rank dynamics of termination tokens (Wei et al., 2026) as stopping criteria. Model-based methods require updating the full backbone, while heuristic methods rely on hand-designed proxy signals rather than learning overthinking patterns directly. ROM combines the strengths of both: it is learnable, operates on frozen LLM backbones, and performs streaming token-level detection during decoding (see Table 1). 2.3 Streaming Token-Level Detection Streaming, token-level detection has recently emerged as a lightweight guardrail paradigm for safety and hallucination monitoring. In the safety domain, recent work trains real-time classifiers or attaches lightweight heads to hidden states to detect harmful content during generation (Sharma et al., 2025; Li et al., 2025b; Xuan et al., 2025; Krishna et al., 2025). In the hallucination domain, linear probes on hidden states detect entity-level hallucinations in real time (Obeso et al., 2025), and Kelp (Li et al., 2025a) models the temporal evolution of risk via a Streaming Latent Dynamics Head with under 0.5ms latency per token. These methods demonstrate the feasibility of real-time monitoring via lightweight heads on hidden states. In this paper, we make the first attempt to apply streaming detection to mitigating overthinking in reasoning models. Table 1: Comparison of overthinking mitigation approaches across three key properties. ✓ satisfies, ✗ violates. Method Learnable Frozen-LLMs Streaming L1 (Aggarwal & Welleck, 2025) ✓ ✗ ✗ O1-Pruner (Luo et al., 2025) ✓ ✗ ✗ Certaindex (Fu et al., 2025) ✗ ✓ ✗ DEER (Yang et al., 2025b) ✗ ✓ ✗ EAT (Wang et al., 2025) ✗ ✓ ✗ SyncThink (Li et al., 2026) ✗ ✓ ✓ RCPD (Wei et al., 2026) ✗ ✓ ✓ Reasoning Probing (Zhang et al., 2025) ✓ ✓ ✗ ROM (Ours) ✓ ✓ ✓ 3 Method In this section, we present ROM, a lightweight framework for detecting and mitigating overthinking in large reasoning models in real time. ROM is designed to be deployable. It attaches a small detection head to a frozen LLM backbone, runs in lockstep with decoding, and reuses intermediate hidden states with negligible overhead. Our method consists of three key components (Figure 1): (1) a correctness-boundary labeling and augmentation pipeline that produces token-level overthinking supervision (Sec. 3.2), (2) a streaming detector that predicts overthinking per token from frozen hidden representations (Sec. 3.3), and (3) a boundary-aware intervention that early-stops redundant reasoning while preserving clean solution structure (Sec. 3.4). Figure 1: Overview of the ROM framework. ROM attaches a lightweight detection head to a frozen LLM backbone and runs in lockstep with decoding. It consists of three components: (1) correctness-boundary labeling and counterfactual self-correction augmentation, (2) streaming overthinking detection via a compact neural probe, and (3) boundary-aware intervention with early stopping and backtracing. 3.1 Problem Setup Given a query/problem q and a base reasoning model ℳM, the model generates an assistant response =r1,…,rTassistr=\r_1,…,r_T_assist\ of TassistT_assist tokens (excluding the user prompt). We define overthinking as tokens produced after the model has already reached its First Correct Solution (FCS), i.e., additional generation that does not improve final answer correctness and typically corresponds to redundant verification, repeated attempts, or unnecessary exploration beyond the earliest correct solution. Let t∈ℝdh_t ^d denote the hidden state of token rtr_t extracted from a fixed layer L of the frozen backbone, and let 1:t=[1;…;t]H_1:t=[h_1;…;h_t] denote the prefix of hidden states up to step t. Our goal is to learn a binary streaming detector fθf_θ that outputs an overthinking probability at each step: pt=fθ(1:t)≈ℙ(yt=1∣1:t),p_t\;=\;f_θ(H_1:t)\;≈\;P(y_t=1 _1:t), (1) where yt=1y_t=1 indicates overthinking and yt=0y_t=0 indicates efficient reasoning. 3.2 Correctness-Boundary Labeling and Augmentation 3.2.1 Token-Level Labeling via First Correct Solution Following (Chen et al., 2025), we segment each model response into a sequence of solution attempts =s1,s2,…,sMs=\s_1,s_2,…,s_M\ and assign each attempt sis_i a correctness label ci∈0,1c_i∈\0,1\ via answer extraction and verification, yielding a correctness sequence =c1,…,cMc=\c_1,…,c_M\. To obtain token-level supervision, we identify the boundary between efficient reasoning and overthinking by locating the first correct solution. For each response, we define the FCS index as k∗=min⁡i∣ci=1,i∈[1,M].k^*\;=\; \i c_i=1,\;i∈[1,M]\. (2) This boundary induces a simple labeling rule: tokens belonging to attempts in s1,…,sk∗\s_1,…,s_k^*\ are labeled 0 (efficient), while tokens in sk∗+1,…,sM\s_k^*+1,…,s_M\ are labeled 11 (overthinking). We create two types of training samples from each segmented response: Efficient samples. We retain only solutions up to k∗k^*, i.e., eff=s1,…,sk∗s^eff=\s_1,…,s_k^*\, and label all tokens as 0. Overthinking samples. We keep the full sequence over=s1,…,sMs^over=\s_1,…,s_M\ and assign boundary-based labels: tokens up to k∗k^* are labeled 0 and tokens after k∗k^* are labeled 11. 3.2.2 Counterfactual Self-Correction: Mitigating First-Solution Bias A key challenge in training overthinking detectors is limited trajectory diversity and a problematic bias in distilled reasoning traces: in many outputs, the first attempt is already correct, while genuine self-correction trajectories (wrong→ ) are rare. A detector trained on this skewed distribution can overfit superficial cues correlated with later segments (e.g., absolute position, discourse markers such as “Let me try again,” or formatting artifacts) and mistakenly penalize beneficial corrective reasoning. To address this, we propose Counterfactual Self-Correction (CSC), a data augmentation strategy that synthesizes balanced wrong→ trajectories, breaking the “first-solution-is-correct” shortcut while preserving the intended notion of overthinking. CSC constructs two kinds of augmented trajectories from the FCS boundary k∗k^*: (i) efficient samples effD_eff that stop at the FCS, and (i) overthinking samples overD_over that include sk∗s_k^* plus at least one additional segment after it. To keep concatenated trajectories natural and avoid trivial cues, we insert brief transition phrases between segments and remove explicit final-answer markers (e.g., …) from non-terminal segments. Algorithm 1 formalizes the CSC procedure. Algorithm 1 Counterfactual Self-Correction (CSC) 0: Dataset of responses, each with solutions =s1,…,sMs=\s_1,…,s_M\ and correctness labels =c1,…,cMc=\c_1,…,c_M\ 0: Efficient set effD_eff; overthinking set overD_over 1: eff←∅D_eff← ; over←∅D_over← 2: for each response (,)(s,c) in the dataset do 3: k∗←min⁡i∣ci=1k^*← \i c_i=1\ ⊳ FCS index 4: if no correct solution exists then 5: continue ⊳ skip this response 6: end if 7: ⊳ Efficient samples: stop at first correct solution 8: Truncate to s1,…,sk∗\s_1,…,s_k^*\; add to effD_eff with all tokens labeled 0 9: ⊳ Overthinking samples: continue past first correct solution 10: Concatenate s1,…,sk∗,sj,…\s_1,…,s_k^*,s_j,…\ where j>k∗j>k^*; add to overD_over 11: Label tokens up to sk∗s_k^* as 0, tokens after sk∗s_k^* as 11 12: Remove final-answer markers from non-terminal segments; insert transition phrases 13: end for 14: return eff,overD_eff,D_over This CSC augmentation increases training set size by 3–6× while maintaining natural reasoning flow. 3.3 Streaming Detection Unlike heuristic early-exit rules (e.g., entropy or logit thresholds), our detector learns overthinking directly from correctness-boundary supervision (Sec. 3.2). Feature extraction. Instead of using only the last-token embedding th_t, we summarize the prefix 1:tH_1:t (defined in Sec. 3.1) with a lightweight attention projector: ^t=AttnProj(1:t)∈ℝdp, h_t=AttnProj\! (H_1:t ) ^d_p, (3) where AttnProj(⋅)AttnProj(·) is attention pooling with projection dimension dpd_p (we use dp=1024d_p=1024). This compression retains prefix-level cues while keeping the downstream model compact. Temporal modeling. Overthinking is temporal and often appears as a phase change after a correct solution is formed, making per-token feed-forward probes noisy. We therefore feed ^t h_t into a Closed-form Continuous-time (CfC) recurrent cell (Li et al., 2025a), which maintains a memory state t∈ℝdpm_t ^d_p: t=CfC(t−1,^t).m_t=CfC(m_t-1, h_t). (4) CfC models the latent state as a continuous-time ODE discretized at each token step, yielding smoother signals than purely feed-forward classifiers. We initialize 0m_0 from the user prompt by applying AttnProjAttnProj to prompt hidden states, so the detector is query-aware from the start. Classification head and training objective. A linear head predicts the overthinking probability: pt=ℙ(yt=1∣1:t)=σ(⊤t+b),p_t\;=\;P(y_t=1 _1:t)\;=\;σ(w m_t+b), (5) producing a stream ptt=1Tassist\p_t\_t=1^T_assist that can be thresholded online for intervention (Sec. 3.4). We train the detector with token-level binary cross-entropy over assistant tokens: ℒ=−1Tassist∑t=1Tassist[ytlog⁡pt+(1−yt)log⁡(1−pt)].L\;=\;- 1T_assist _t=1^T_assist [y_t p_t+(1-y_t) (1-p_t) ]. (6) 3.4 Boundary-Aware Intervention At test time, we run the detector in lockstep with decoding. For each generated token, we compute pt=fθ(1:t)p_t=f_θ(H_1:t). We trigger intervention at the first step where the predicted overthinking probability exceeds 0.50.5, the natural decision boundary of the binary classifier: t∗=min⁡t∣pt>0.5,t∈[1,Tassist].t^*= \t p_t>0.5,\;t∈[1,T_assist]\. (7) Boundary-aware truncation. Naively cutting at t∗t^* can break formatting or truncate mid-sentence. We therefore backtrace to the nearest clean boundary (e.g., the last newline or sentence boundary) before t∗t^*, yielding t~∗ t^*. We truncate the response at t~∗ t^* and append a short “final answer” cue (e.g., closing the thinking block and prompting a concise conclusion). The model then generates a brief final answer segment, preserving solution-level structure while removing redundant post-solution exploration. 3.5 Implementation Details We implement ROM in PyTorch with Transformers. We use GPT-4o for automatic segmentation and correctness verification (Sec. 3.2). Our backbone is Qwen3-8B, and we extract hidden states from layer L=32L=32 (roughly 80% depth). Note that the token-level labels (efficient vs. overthinking) are derived from QwQ (Team, 2025) outputs, while the detection head is trained on hidden states extracted from Qwen3-8B processing the same prompts. This decoupling is intentional: QwQ produces richer multi-solution trajectories suited for labeling, whereas the detector must learn from the backbone it will monitor at inference time. We train the detection head for 20 epochs on 740 efficient and 793 overthinking samples. To reduce compute, we pre-compute and cache t\h_t\ for all training samples, so detector training does not require repeated backbone forward passes. 4 Experiments 4.1 Experimental Setup 4.1.1 Datasets We evaluate our method across two categories of tasks: Mathematical Reasoning Tasks: We use six benchmarks, including MATH500 (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), ASDiv (Miao et al., 2020), MAWPS (Koncel-Kedziorski et al., 2016), MultiArith (Roy & Roth, 2015), and SVAMP (Patel et al., 2021), which test arithmetic reasoning and multi-step problem solving with varying levels of difficulty. Specifically, MATH500 represents tasks of high difficulty requiring multi-step mathematical reasoning, while GSM8K, ASDiv, MAWPS, MultiArith, and SVAMP represent easier arithmetic reasoning tasks, where overthinking is especially problematic as these problems should be straightforward for today’s LRMs to solve. Multiple-Choice Question Tasks: We evaluate on MMLU-Pro (Wang et al., 2024), a challenging multi-domain expert-level benchmark covering science, humanities, and professional domains. For training, we generate responses on the MATH500 dataset using QwQ (Team, 2025) with greedy decoding, then apply solution splitting, correctness verification, and our Counterfactual Self-Correction (CSC) augmentation, augmenting the data from 222 efficient samples and 222 overthinking samples to 740 efficient thinking samples and 793 overthinking samples, respectively. 4.1.2 Metrics We evaluate methods using three complementary metrics: Accuracy (Acc): Percentage of problems with correct final answers, measuring reasoning quality. Response Length (SL): Average number of tokens in generated responses, measuring computational cost. Response Efficiency (SE): Defined as (Acc/SL)×100, measuring accuracy per token to capture the efficiency-quality trade-off. Higher SE indicates better efficiency. 4.1.3 Baselines We compare against several categories of methods: Vanilla: Qwen3-8B (Yang et al., 2025a) with reasoning. Model-based Methods: L1 (Aggarwal & Welleck, 2025) uses reinforcement learning to train reasoning models to satisfy length constraints while maintaining accuracy. We use L1-8B-Max, trained on Qwen3-8B, as a representative state-of-the-art model-based baseline. Heuristic Early-Exit Methods: EAT (Wang et al., 2025) monitors entropy after appending the </think> token and triggers early exit when the entropy trajectory stabilizes. We select EAT as the most recent training-free heuristic method at the time of writing. Hard Truncation: Cut512, Cut1024, Cut2048 truncate responses at fixed token limits. Our Methods: ROM is our base model without CSC or backtracing; ROMCSC_CSC is our full method with both CSC augmentation and backtracing. 4.2 Main Results Table 2: Main evaluation results focusing on accuracy, response length, and response efficiency. Acc denotes accuracy (%), SL denotes response length (tokens), and SE denotes response efficiency (Acc/SL×100). MATH represents mathematical reasoning tasks, while MCQ represents multiple-choice question tasks. Bold values indicate the best performance per dataset, while underlined values indicate the second-best performance. Higher values are better for Acc and SE, while lower values are better for SL. Overall SE is computed as the average of SE over all datasets. MATH MCQ Method MATH500 GSM8K ASDiv MAWPS MultiArith SVAMP MMLU-Pro Overall Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Vanilla 88.33% 4554 1.94 100.00% 2043 4.89 96.67% 1138 8.49 96.67% 1267 7.63 100.00% 1181 8.47 95.00% 1874 5.07 76.67% 2842 2.70 91.72% 2197 5.60 L1 89.20% 2603 3.43 100.00% 1259 7.94 99.20% 889 11.16 97.50% 969 10.06 97.50% 930 10.48 96.70% 1138 8.50 71.40% 1401 5.10 93.07% 1313 8.10 EAT 89.20% 4297 2.08 100.00% 1594 6.27 96.70% 1136 8.51 96.70% 1197 8.08 100.00% 1142 8.76 95.00% 1480 6.42 76.70% 2052 3.74 93.47% 1843 6.26 Cut2048 85.00% 2563 3.32 98.33% 2041 4.82 98.33% 1135 8.66 96.67% 1265 7.64 97.50% 1179 8.27 95.00% 1872 5.07 78.57% 2839 2.77 92.77% 1842 5.79 Cut1024 81.67% 1873 4.36 96.67% 2041 4.74 97.50% 1135 8.59 96.67% 1265 7.64 97.50% 1179 8.27 96.67% 1872 5.16 72.38% 2839 2.55 91.29% 1743 5.90 Cut512 82.50% 1426 5.79 83.33% 2041 4.08 97.50% 1135 8.59 95.83% 1265 7.58 98.33% 1179 8.34 86.67% 1872 4.63 50.48% 2839 1.78 84.95% 1680 5.83 ROM 87.50% 3063 2.86 99.20% 1254 7.91 97.50% 602 16.20 97.50% 723 13.49 98.30% 768 12.80 92.50% 1241 7.45 75.20% 2299 3.27 92.53% 1421 9.14 ROMCSC_CSC 87.50% 2807 3.12 100.00% 1042 9.60 97.50% 399 24.44 97.50% 527 18.50 99.20% 616 16.10 95.80% 906 10.57 77.10% 1818 4.24 93.51% 1159 12.37 4.2.1 Main Results Analysis Overall Performance. Across all datasets, ROMCSC_CSC achieves the highest accuracy (93.51%), the shortest response length (1159 tokens), and the highest efficiency (SE=12.37) among all methods. Compared to vanilla (91.72%, 2197 tokens, SE=5.60), ROMCSC_CSC improves accuracy by +1.79% while reducing response length by 47.2%, yielding a 121% efficiency improvement. Challenging Benchmarks. On MATH500, ROMCSC_CSC achieves 87.50% accuracy with a 38.4% token reduction (vs. vanilla), approaching L1’s accuracy (89.20%) with comparable efficiency (SE=3.12 vs. 3.43). On MMLU-Pro, it achieves the highest accuracy among non-truncation methods (77.10%, +5.70% vs. L1) with a 36.0% token reduction, indicating that the learned detector generalizes to diverse domains beyond mathematical reasoning (SE=4.24 vs. EAT’s 3.74). Simple Arithmetic Tasks. ROMCSC_CSC excels particularly on simple arithmetic benchmarks, where overthinking is most pronounced. On these tasks, the vanilla model already achieves near-perfect accuracy but generates excessively long responses—for example, 1138 tokens on ASDiv and 1267 tokens on MAWPS for problems that require only basic operations. ROMCSC_CSC reduces response length by 64.9% on ASDiv (1138→ 399 tokens) and 58.4% on MAWPS (1267→ 527 tokens) while maintaining accuracy, yielding SE improvements of 188% and 143% over vanilla, respectively. Compared to L1, ROMCSC_CSC achieves 2.2× efficiency on ASDiv (SE=24.44 vs. 11.16) and 1.8× on MAWPS (SE=18.50 vs. 10.06). This pattern is consistent on MultiArith and GSM8K, where token reductions reach 47.8% and 49.0%, respectively. These results suggest that the learned detector is especially effective at identifying redundant verification and repeated solution attempts that LRMs tend to produce on problems they can already solve correctly in one pass. Figure 2 illustrates this with a concrete GSM8K example. Figure 2: Case study on a GSM8K problem. The vanilla model generates 637 tokens of redundant reasoning after reaching the correct answer. ROM detects overthinking and intervenes, but cuts mid-sentence, causing the model to produce a lengthy explanation after the answer box (117 tokens). ROMCSC_CSC with backtracing rewinds to a clean boundary, yielding a concise 202-token response (75.8% reduction). 4.2.2 Comparison with L1: Instance-Adaptive Control vs. Global Constraint Method contrast. L1 (Aggarwal & Welleck, 2025) uses RL on 40K samples to fine-tune Qwen3-8B to satisfy a single global length constraint. ROMCSC_CSC keeps the backbone frozen and trains only a lightweight detection head on 1,533 samples, providing token-level control at inference time without modifying the base model. Results. Despite L1’s heavier training budget, ROMCSC_CSC delivers +0.44% higher accuracy (93.51% vs. 93.07%), 11.7% shorter responses (1159 vs. 1313 tokens), and +52.7% higher efficiency (SE=12.37 vs. 8.10). Key difference. L1 learns a fixed length budget shared across all inputs, which limits its ability to adapt to per-instance difficulty. In contrast, ROMCSC_CSC makes a per-token stopping decision conditioned on the current reasoning state, allowing it to cut aggressively on simple problems (e.g., ASDiv: 399 vs. 889 tokens) while preserving longer reasoning on harder ones. This instance-adaptive behavior also explains ROMCSC_CSC’s stronger generalization to out-of-domain tasks such as MMLU-Pro (+5.70% over L1), where a global constraint trained on mathematical data may not transfer well. 4.2.3 Comparison with EAT: Learned Semantics vs. Entropy Proxy Method contrast. EAT (Wang et al., 2025) monitors the entropy trajectory after appending </think> and exits when the trajectory stabilizes. This signal is a hand-crafted proxy for answer confidence that does not directly capture whether the model has already solved the problem. ROMCSC_CSC instead trains a detector on hidden-state representations with explicit correctness-boundary supervision, learning to recognize overthinking patterns from data. Results. ROMCSC_CSC achieves comparable accuracy (93.51% vs. 93.47%) but with 37.1% shorter responses (1159 vs. 1843 tokens) and 97.6% higher efficiency (SE=12.37 vs. 6.26). Key difference. The efficiency gap is most evident on arithmetic tasks, where EAT’s entropy signal offers limited reduction (e.g., ASDiv: 1136 vs. vanilla’s 1138), while ROMCSC_CSC cuts to 399 tokens. This suggests that entropy stabilization is a weak indicator of overthinking on simple problems—the model’s output distribution may remain confident even as it generates redundant verification steps. By contrast, the learned detector captures richer semantic signals from hidden states that directly reflect whether a correct solution has already been formed. Table 3: Performance Comparison of ROMCSC_CSC with and without Backtracing Strategy. Acc denotes accuracy (%), SL denotes solution/response length (tokens), and SE denotes solution efficiency (Acc/SL×100). Bold values indicate the best performance per dataset. MATH MCQ Method MATH500 GSM8K ASDiv MAWPS MultiArith SVAMP MMLU-Pro Overall Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE Acc SL SE ROMCSC_CSC (No-BT) 87.50% 2973 2.94 99.17% 1077 9.21 97.50% 431 22.62 97.50% 572 17.05 99.17% 655 15.14 92.50% 944 9.80 75.24% 1814 4.15 92.65% 1209 11.56 ROMCSC_CSC (BT) 87.5% 2807 3.12 100.0% 1042 9.60 97.5% 399 24.44 97.5% 527 18.50 99.2% 616 16.10 95.8% 906 10.57 77.1% 1818 4.24 93.51% 1159 12.37 Table 4: Parameter comparison between the original model and our detection head. Metric Original Model Our Additional Layers Ratio Parameters 8.00B 8.13M 0.10% 4.3 Ablation Studies 4.3.1 Impact of Counterfactual Self-Correction (CSC) To isolate CSC’s effect, we compare ROM (no CSC, no backtracing) with ROMCSC_CSC (No-BT) (CSC, no backtracing). CSC improves accuracy from 92.53% to 92.65% (+0.12%), reduces response length from 1421 to 1209 tokens (-14.9%), and boosts efficiency from SE=9.14 to SE=11.56 (+26.5%). On individual datasets, CSC enables consistent length reductions with maintained accuracy. On arithmetic tasks, CSC substantially shortens responses (ASDiv: 602→ 431 tokens, -28.4%; MAWPS: 723→ 572, -20.9%) while preserving accuracy. These results validate that CSC mitigates first-solution bias by teaching the detector to distinguish genuine self-correction from overthinking through balanced wrong→ trajectories. 4.3.2 Impact of Backtracing Table 3 shows ROMCSC_CSC (BT) achieves superior performance: 93.51% accuracy with 1159 tokens vs. 92.65% and 1209 tokens for No-BT, with better performance across 6 of 7 datasets (e.g., GSM8K: 100.0% vs. 99.17%; SVAMP: 95.8% vs. 92.5%). Backtracing’s success stems from two factors: (1) it rewinds to the nearest clean boundary (e.g., newline), preserving complete reasoning steps; and (2) it avoids cutting mid-solution, which would cause the LLM to perceive its reasoning as unfinished—when forced to produce a final answer from an incomplete thought, the model compensates by generating lengthy additional analysis after the answer box, negating the token savings from early stopping. By ensuring a clean exit point, backtracing enables concise final answers. 4.3.3 Efficiency Table 4 shows the parameter comparison. Our detection head adds only 8.13M parameters (0.10% of the base model), consisting of an attention layer (6.29M), a CfC cell (1.57M), and output projection layers (0.27M). 5 Conclusions and Limitations We present ROM, a lightweight streaming framework that detects and mitigates overthinking at the token level using a small head on frozen LLM backbones, enabling real-time early stopping with minimal intervention. With Counterfactual Self-Correction (CSC), ROM consistently reduces generation length while preserving or improving accuracy across seven benchmarks, delivering substantial token savings and efficiency gains with only 0.10% extra parameters. Absence of Clear Data Scaling Behavior. Our method does not exhibit a pronounced scaling law with respect to training data size: performance on 50% of the training set is comparable to that on the full dataset. On the other hand, it also implies that our method does not consistently benefit from larger training sets (at least in our dataset scale). Dependence on Labeling Model Quality. Our data generation pipeline relies on a (relatively) high-capability model for automatic solution labeling. In our experiments, we initially explored open-source alternatives such as Llama-3.3-70B-Instruct (Grattafiori et al., 2024) for labeling, but found its labeling accuracy to be noticeably lower than that of GPT-4o (OpenAI et al., 2024), which we ultimately adopted. We note that this dependence on a strong labeling model is a common limitation shared by virtually all methods that employ automatic labeling for training data construction. Impact Statement This paper proposes a method to reduce unnecessary reasoning in Large Reasoning Models, aiming to make their deployment more efficient and responsive. We believe this work can benefit a broad range of users and practitioners by lowering the computational cost of accessing advanced reasoning capabilities, thereby making such systems more affordable and accessible, particularly in resource-constrained settings such as edge devices or low-budget deployments. Faster response times may also improve user experience in latency-sensitive interactive applications such as education, programming assistance, and real-time decision support. More broadly, as Large Reasoning Models become increasingly central to AI-powered services, the cumulative cost of redundant computation grows substantially. Our work contributes to the sustainable scaling of these models by demonstrating that significant efficiency gains can be achieved without sacrificing reasoning quality, and without requiring expensive model retraining. The plug-and-play nature of our approach means it can be readily adopted alongside existing models as they continue to evolve, lowering the barrier for both researchers and developers to deploy efficient reasoning systems. As a side benefit, reducing redundant token generation may also contribute to lowering the energy consumption and carbon footprint associated with large-scale model inference. Since our method focuses on improving the efficiency of LRMs, the negative impact is implicit and limited. There is a small potential risk that, if our method is applied without sufficient testing and validation by model service providers, it may lead to service performance degradation in cases that this paper did not cover. However, this risk is inherent to any inference-time intervention and can be effectively mitigated through standard evaluation and quality assurance practices before deployment. References Aggarwal & Welleck (2025) Aggarwal, P. and Welleck, S. L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025. URL https://arxiv.org/abs/2503.04697. Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165. Chen et al. (2025) Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., Wang, R., Tu, Z., Mi, H., and Yu, D. Do not think that much for 2+3=? on the overthinking of o1-like llms, 2025. URL https://arxiv.org/abs/2412.21187. Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Fu et al. (2025) Fu, Y., Chen, J., Zhu, S., Fu, Z., Dai, Z., Zhuang, Y., Ma, Y., Qiao, A., Rosing, T., Stoica, I., and Zhang, H. Efficiently scaling llm reasoning with certaindex, 2025. URL https://arxiv.org/abs/2412.20993. Grattafiori et al. (2024) Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Wyatt, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Guzmán, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Thattai, G., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Zhang, J., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Prasad, K., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Lakhotia, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Tsimpoukelli, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Zhang, N., Duchenne, O., Çelebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Maheswari, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speckbacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Albiero, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Wang, X., Tan, X. E., Xia, X., Xie, X., Jia, X., Wang, X., Goldschlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Srivastava, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Teo, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poulton, A., Ryan, A., Ramchandani, A., Dong, A., Franco, A., Goyal, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Liu, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Gao, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Montgomery, E., Presani, E., Hahn, E., Wood, E., Le, E.-T., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Kokkinos, F., Ozgenel, F., Caggioni, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Inan, H., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Zhan, H., Damlaj, I., Molybog, I., Tufanov, I., Leontiadis, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Lam, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Jagadeesh, K., Huang, K., Chawla, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Liu, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Mehta, N., Laptev, N. P., Dong, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Parthasarathy, R., Li, R., Hogan, R., Battey, R., Wang, R., Howes, R., Rinott, R., Mehta, S., Siby, S., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Mahajan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Patil, S., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satterfield, S., Govindaprasad, S., Gupta, S., Deng, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Koehler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Ionescu, V., Poenaru, V., Mihailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wu, X., Wang, X., Wu, X., Gao, X., Kleinman, Y., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Zhao, Y., Hao, Y., Qian, Y., Li, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., Zhao, Z., and Ma, Z. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Guo et al. (2025) Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Ding, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Chen, J., Yuan, J., Tu, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., You, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Zhou, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y. X., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., and Zhang, Z. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z. Hendrycks et al. (2021) Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. Koncel-Kedziorski et al. (2016) Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. MAWPS: A math word problem repository. In Knight, K., Nenkova, A., and Rambow, O. (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p. 1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136/. Krishna et al. (2025) Krishna, K., Cheng, J. Y., Maalouf, C., and Gatys, L. A. Disentangled safety adapters enable efficient guardrails and flexible inference-time alignment, 2025. URL https://arxiv.org/abs/2506.00166. Li et al. (2026) Li, G., Cai, W., Gao, Y., and Wu, Y. Syncthink: A training-free strategy to align inference termination with reasoning saturation, 2026. URL https://arxiv.org/abs/2601.03649. Li et al. (2025a) Li, X., Wu, M., Zhu, Y., Lv, Y., Chen, Y., Chen, C., Guo, J., and Xue, H. Kelp: A streaming safeguard for large models via latent dynamics-guided risk detection, 2025a. URL https://arxiv.org/abs/2510.09694. Li et al. (2025b) Li, Y., Sheng, Q., Yang, Y., Zhang, X., and Cao, J. From judgment to interference: Early stopping llm harmful outputs via streaming content monitoring, 2025b. URL https://arxiv.org/abs/2506.09996. Luo et al. (2025) Luo, H., Shen, L., He, H., Wang, Y., Liu, S., Li, W., Tan, N., Cao, X., and Tao, D. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025. URL https://arxiv.org/abs/2501.12570. Miao et al. (2020) Miao, S.-y., Liang, C.-C., and Su, K.-Y. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 975–984, 2020. Obeso et al. (2025) Obeso, O., Arditi, A., Ferrando, J., Freeman, J., Holmes, C., and Nanda, N. Real-time detection of hallucinated entities in long-form generation, 2025. URL https://arxiv.org/abs/2509.03531. OpenAI (2024) OpenAI. Openai o1, 2024. URL https://openai.com. Technical report and system card. OpenAI et al. (2024) OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G., Brooks, T., Brundage, M., Button, K., Cai, T., Campbell, R., Cann, A., Carey, B., Carlson, C., Carmichael, R., Chan, B., Chang, C., Chantzis, F., Chen, D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B., Cho, C., Chu, C., Chung, H. W., Cummings, D., Currier, J., Dai, Y., Decareaux, C., Degry, T., Deutsch, N., Deville, D., Dhar, A., Dohan, D., Dowling, S., Dunning, S., Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., Felix, N., Fishman, S. P., Forte, J., Fulford, I., Gao, L., Georges, E., Gibson, C., Goel, V., Gogineni, T., Goh, G., Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S., Greene, R., Gross, J., Gu, S. S., Guo, Y., Hallacy, C., Han, J., Harris, J., He, Y., Heaton, M., Heidecke, J., Hesse, C., Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu, K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., Jiang, A., Jiang, R., Jin, H., Jin, D., Jomoto, S., Jonn, B., Jun, H., Kaftan, T., Łukasz Kaiser, Kamali, A., Kanitscheider, I., Keskar, N. S., Khan, T., Kilpatrick, L., Kim, J. W., Kim, C., Kim, Y., Kirchner, J. H., Kiros, J., Knight, M., Kokotajlo, D., Łukasz Kondraciuk, Kondrich, A., Konstantinidis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan, I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C. M., Lim, R., Lin, M., Lin, S., Litwin, M., Lopez, T., Lowe, R., Lue, P., Makanju, A., Malfacini, K., Manning, S., Markov, T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., McGrew, B., McKinney, S. M., McLeavey, C., McMillan, P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz, L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa, E., Mossing, D., Mu, T., Murati, M., Murk, O., Mély, D., Nair, A., Nakano, R., Nayak, R., Neelakantan, A., Ngo, R., Noh, H., Ouyang, L., O’Keefe, C., Pachocki, J., Paino, A., Palermo, J., Pantuliano, A., Parascandolo, G., Parish, J., Parparita, E., Passos, A., Pavlov, M., Peng, A., Perelman, A., de Avila Belbute Peres, F., Petrov, M., de Oliveira Pinto, H. P., Michael, Pokorny, Pokrass, M., Pong, V. H., Powell, T., Power, A., Power, B., Proehl, E., Puri, R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, F., Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ryder, N., Saltarelli, M., Sanders, T., Santurkar, S., Sastry, G., Schmidt, H., Schnurr, D., Schulman, J., Selsam, D., Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., Shyam, P., Sidor, S., Sigler, E., Simens, M., Sitkin, J., Slama, K., Sohl, I., Sokolowsky, B., Song, Y., Staudacher, N., Such, F. P., Summers, N., Sutskever, I., Tang, J., Tezak, N., Thompson, M. B., Tillet, P., Tootoonchian, A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J. F. C., Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, C., Wang, J. J., Wang, A., Wang, B., Ward, J., Wei, J., Weinmann, C., Welihinda, A., Welinder, P., Weng, J., Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman, L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba, W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W., and Zoph, B. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774. Patel et al. (2021) Patel, A., Bhattamishra, S., and Goyal, N. Are NLP models really able to solve simple math word problems? In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p. 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168/. Roy & Roth (2015) Roy, S. and Roth, D. Solving general arithmetic word problems. In Màrquez, L., Callison-Burch, C., and Su, J. (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, p. 1743–1752, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202/. Sharma et al. (2025) Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Goodfriend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bowman, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O’Hara, C., Olsson, C., Petrini, L., Rajani, S., Saxena, N., Silverstein, A., Singh, T., Sumers, T., Tang, L., Troy, K. K., Weisser, C., Zhong, R., Zhou, G., Leike, J., Kaplan, J., and Perez, E. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming, 2025. URL https://arxiv.org/abs/2501.18837. Sui et al. (2025) Sui, Y., Chuang, Y.-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., Chen, H., and Hu, X. Stop overthinking: A survey on efficient reasoning for large language models, 2025. URL https://arxiv.org/abs/2503.16419. Team (2025) Team, Q. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/. Wang et al. (2025) Wang, X., McInerney, J., Wang, L., and Kallus, N. Entropy after ⟨/Think⟩ /Think for reasoning model early exiting, 2025. URL https://arxiv.org/abs/2509.26522. Wang et al. (2024) Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024. Wei et al. (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. Wei et al. (2026) Wei, Z., Pang, L., Liu, J., Shi, W., Deng, J., Xu, S., Duan, Z., Sun, F., Shen, H., and Cheng, X. The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis, 2026. URL https://arxiv.org/abs/2508.17627. Xuan et al. (2025) Xuan, Z., Mao, X., Chen, D., Zhang, X., Dong, Y., and Zhou, J. ShieldHead: Decoding-time safeguard for large language models. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, p. 18129–18143, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.932. URL https://aclanthology.org/2025.findings-acl.932/. Yang et al. (2025a) Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y., Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. Qwen3 technical report, 2025a. URL https://arxiv.org/abs/2505.09388. Yang et al. (2025b) Yang, C., Si, Q., Duan, Y., Zhu, Z., Zhu, C., Li, Q., Chen, M., Lin, Z., and Wang, W. Dynamic early exit in reasoning models, 2025b. URL https://arxiv.org/abs/2504.15895. Zhang et al. (2025) Zhang, A., Chen, Y., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models know when they’re right: Probing hidden states for self-verification, 2025. URL https://arxiv.org/abs/2504.05419. Appendix A Full version of main table with reasoning length and reasoning efficiency Table 5: Comprehensive evaluation results across all methods and datasets. Acc denotes accuracy (%), RL denotes reasoning length (tokens), SL denotes response length (tokens), RE denotes reasoning efficiency (Acc/RL×100), and SE denotes response efficiency (Acc/SL×100). MATH represents mathematical reasoning tasks, while MCQ represents multiple-choice question tasks. Bold values indicate the best performance per dataset, while underlined values indicate the second-best performance. Higher values are better for Acc, RE, and SE, while lower values are better for RL and SL. MATH MCQ Method MATH500 GSM8K ASDiv MAWPS MultiArith SVAMP MMLU-Pro Overall Acc RL SL RE SE Acc RL SL RE SE Acc RL SL RE SE Acc RL SL RE SE Acc RL SL RE SE Acc RL SL RE SE Acc RL SL RE SE Acc RL SL RE SE Vanilla 88.33% 2954 4554 2.99 1.94 100.00% 1722 2043 5.81 4.89 96.67% 819 1138 11.80 8.49 96.67% 1074 1267 9.00 7.63 100.00% 997 1181 10.03 8.47 95.00% 1646 1874 5.77 5.07 76.67% 1981 2842 3.87 2.70 91.72% 1636 2197 7.04 5.60 L1 89.20% 2133 2603 4.18 3.43 100.00% 983 1259 10.17 7.94 99.20% 729 889 13.61 11.16 97.50% 789 969 12.36 10.06 97.50% 758 930 12.86 10.48 96.70% 933 1138 10.36 8.50 71.40% 1016 1401 7.03 5.10 93.07% 1049 1313 10.08 8.10 EAT 89.20% 3761 4297 2.37 2.08 100.00% 1273 1594 7.86 6.27 96.70% 955 1136 10.13 8.51 96.70% 1004 1197 9.63 8.08 100.00% 958 1142 10.44 8.76 95.00% 1251 1480 7.59 6.42 76.70% 1543 2052 4.97 3.74 93.47% 1535 1843 7.57 6.26 Cut2048 85.00% 1867 2563 4.55 3.32 98.33% 1409 2041 6.98 4.82 98.33% 853 1135 11.53 8.66 96.67% 1027 1265 9.41 7.64 97.50% 934 1179 10.44 8.27 95.00% 1379 1872 6.89 5.07 78.57% 1502 2839 5.23 2.77 92.77% 1282 1842 7.86 5.79 Cut1024 81.67% 1042 1873 7.84 4.36 96.67% 968 2041 9.99 4.74 97.50% 784 1135 12.44 8.59 96.67% 867 1265 11.15 7.64 97.50% 824 1179 11.83 8.27 96.67% 940 1872 10.28 5.16 72.38% 961 2839 7.53 2.55 91.29% 912 1743 10.15 5.90 Cut512 82.50% 539 1426 15.31 5.79 83.33% 513 2041 16.24 4.08 97.50% 493 1135 19.78 8.59 95.83% 508 1265 18.86 7.58 98.33% 512 1179 19.21 8.34 86.67% 514 1872 16.86 4.63 50.48% 514 2839 9.82 1.78 84.95% 513 1680 16.58 5.83 ROM 87.50% 2652 3063 3.30 2.86 99.20% 1006 1254 9.86 7.91 97.50% 500 602 19.50 16.20 97.50% 581 723 16.78 13.49 98.30% 613 768 16.04 12.80 92.50% 1062 1241 8.71 7.45 75.20% 1899 2299 3.96 3.27 92.53% 1188 1421 11.16 9.14 ROMCSC_CSC 87.50% 2422 2807 3.61 3.12 100.00% 785 1042 12.74 9.60 97.50% 329 399 29.64 24.44 97.50% 408 527 23.90 18.50 99.20% 469 616 21.15 16.10 95.80% 754 906 12.71 10.57 77.10% 1462 1818 5.27 4.24 93.51% 947 1159 15.57 12.37 Appendix B A Case from GSM8K comparing ROM vs. ROMCSC_CSC Methods B.1 Problem and Metrics Problem: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Table 6: Metrics comparison among Original, ROM and ROMCSC_CSC methods Metric Original ROM ROMCSC_CSC Diff. Improve. Cut position - 228 187 -41 - Reasoning length 637 229 188 -449 70.5% Response length 836 360 202 -634 75.8% Correctness ✓ ✓ ✓ - - Compared to the Original model, ROMCSC_CSC achieves 70.5% reasoning length reduction and 75.8% response length reduction while maintaining correctness. B.2 Full Responses B.2.1 ROM Method Response <think> Okay, let me try to figure out this problem. So, the question is: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Hmm, let me break it down. First, the robe requires 2 bolts of blue fiber. Then, it also needs half that much white fiber. Wait, half of what? Half of the blue fiber amount, right? So if blue is 2 bolts, then half of that would be 1 bolt of white fiber? Let me check that again. The problem says "half that much white fiber." The "that much" probably refers to the blue fiber. So if blue is 2 bolts, then half of that is 1 bolt. So the white fiber is 1 bolt. Therefore, total bolts would be blue plus white, which is 2 + 1 = 3 bolts. Is that right? Wait, but let me make sure I didn’t misinterpret the question. Sometimes "half that much" can be confusing. Let me parse the sentence again: "A robe takes </think> --- ### Final Answer $$ 3 $$ --- ### Explanation: The problem states: - A robe requires **2 bolts of blue fiber**. - It also requires **half that much white fiber**. The phrase "half that much" refers to **half of the 2 bolts of blue fiber**. So: - Blue fiber: $ 2 $ bolts - White fiber: $ 12 × 2 = 1 $ bolt Adding them together: $$ 2 + 1 = 3 $$ ### Final Answer: $$ 3 $$ B.2.2 ROMCSC_CSC Method Response <think> Okay, let me try to figure out this problem. So, the question is: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Hmm, let me break it down. First, the robe requires 2 bolts of blue fiber. Then, it also needs half that much white fiber. Wait, half of what? Half of the blue fiber amount, right? So if blue is 2 bolts, then half of that would be 1 bolt of white fiber? Let me check that again. The problem says "half that much white fiber." The "that much" probably refers to the blue fiber. So if blue is 2 bolts, then half of that is 1 bolt. So the white fiber is 1 bolt. Therefore, total bolts would be blue plus white, which is 2 + 1 = 3 bolts. </think> --- ### Final Answer $$ 3 $$ B.2.3 Original Response (Vanilla Model) <think> Okay, let me try to figure out this problem. So, the question is: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Hmm, let me break it down. First, the robe requires 2 bolts of blue fiber. Then, it also needs half that much white fiber. Wait, half of what? Half of the blue fiber amount, right? So if blue is 2 bolts, then half of that would be 1 bolt of white fiber? Let me check that again. The problem says "half that much white fiber." The "that much" probably refers to the blue fiber. So if blue is 2 bolts, then half of that is 1 bolt. So the white fiber is 1 bolt. Therefore, total bolts would be blue plus white, which is 2 + 1 = 3 bolts. Is that right? Wait, but let me make sure I didn’t misinterpret the question. Sometimes "half that much" can be confusing. Let me parse the sentence again: "A robe takes 2 bolts of blue fiber and half that much white fiber." So "that much" refers to the 2 bolts of blue fiber. So half of 2 bolts is 1 bolt. So total is 2 + 1 = 3. That seems straightforward. But maybe someone could misinterpret it as half of the total? Wait, but the wording is "half that much white fiber," where "that much" is referring to the blue fiber. So it’s definitely half of the blue amount. So 2 bolts blue, half that (which is 1) white. So total is 3 bolts. Alternatively, maybe the question is trying to trick me? Let me think. If someone reads it as "half that much" meaning half of the total amount needed? But that would be different. Let me try that. Suppose the total is T, then blue is 2, white is half of T? But that would be a different equation. But the problem says "half that much white fiber," so "that much" is the blue fiber. So I think my original interpretation is correct. Alternatively, maybe "half that much" is referring to half of the amount of white fiber? Wait, no. The structure is "takes 2 bolts of blue fiber and half that much white fiber." So the "half that much" is modifying the white fiber. So "half that much" is half of the blue fiber’s amount. So 2 bolts blue, half of that (which is 1 bolt) white. So total is 3. I think that’s the correct answer. Let me just check with numbers. If blue is 2, then white is half of 2, which is 1. Total is 2 + 1 = 3. Yeah, that seems right. I don’t see any other way to interpret it. So the answer should be 3 bolts in total. **Final Answer** The total number of bolts needed is 3. </think> To determine how many bolts of fiber are required to make a robe, we start by breaking down the information provided: - The robe requires **2 bolts of blue fiber**. - It also requires **half that much white fiber**. --- ### Step 1: Interpret "half that much" The phrase "half that much white fiber" refers to **half the amount of blue fiber**. Since the blue fiber is 2 bolts, then: $$ White fiber = 12 × 2 = 1 bolt $$ --- ### Step 2: Add the quantities Now that we know: - Blue fiber = 2 bolts - White fiber = 1 bolt We can find the **total number of bolts** required: $$ Total bolts = 2 + 1 = 3 $$ --- ### Final Answer $$ 3 $$