Paper deep dive

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 79

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/22/2026, 6:03:36 AM

Summary

TARo (Token-level Adaptive Routing) is a test-time alignment method for LLMs that uses a learnable token-level router to dynamically blend logits from a frozen base model and a reasoning-aware reward model. This approach eliminates the need for manual hyperparameter tuning and enables robust, cross-domain reasoning improvements without retraining the base model.

Entities (6)

TARo · method · 100%AlpacaEval · benchmark · 98%MATH500 · benchmark · 98%MedXpertQA · benchmark · 98%GenARM · method · 95%Llama-3.1 · model-family · 95%

Relation Signals (3)

Learnable token-level router → partof → TARo

confidence 100% · we introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model.

TARo → improvesperformanceon → MATH500

confidence 95% · TARO significantly improves reasoning performance by up to +22.4% over base model... on MATH500

TARo → outperforms → GenARM

confidence 95% · TARO consistently exceeds GenARM across domains

Cypher Suggestions (2)

Find all benchmarks where TARo was evaluated. · confidence 90% · unvalidated

MATCH (m:Method {name: 'TARo'})-[:EVALUATED_ON]->(b:Benchmark) RETURN b.name

Identify methods that TARo outperforms. · confidence 90% · unvalidated

MATCH (t:Method {name: 'TARo'})-[:OUTPERFORMS]->(m:Method) RETURN m.name

Abstract

Abstract:Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

PDF

Open source PDF →Open local PDF →

Full Text

78,297 characters extracted from source content.

Expand or collapse full text

TARo: Token-level Adaptive Routing for LLM Test-time Alignment Arushi Rai ∗ 1 2 Qiang Zhang 1 Hanqing Zeng 1 Yunkai Zhang 3 Dipesh Tamboli 1 Xiangjun Fan 1 Zhuokai Zhao † 1 Lizhu Zhang † 1 *Work done during internship at Meta. † Joint last author 1 Meta 2 University of Pittsburgh 3 University of California, Berkeley arr159@pitt.edu qiangzhang, zhuokai, lizhu@meta.com Abstract Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high perfor- mance. Recent test-time alignment methods offer a lightweight alternative, but have been ex- plored mainly for preference alignment rather than reasoning. To bridge this gap, we pro- pose Token-level Adaptive Routing (TARO) , which steers frozen LLMs toward structured reasoning entirely at inference time. Specifi- cally, we first train reward models on step-wise mathematical traces to capture fine-grained log- ical consistency signals, then introduce a learn- able token-level router that automatically con- trols the guidance of the reward model to the base model. Extensive experiments show that TARO significantly improves reasoning per- formance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of- distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Fur- thermore, TARO also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimiza- tion to robust, cross-domain reasoning. 1 Introduction Large Language Models (LLMs) have achieved impressive performance across many natural lan- guage tasks (OpenAI, 2024; Guo et al., 2025; Team et al., 2025; Yang et al., 2025a). On complex do- mains, such as mathematics, science, and clini- cal reasoning, it remains challenging to reliably solve logically demanding problems (Mirzadeh et al., 2024; Wang et al., 2025a; Cui et al., 2025; Wang et al., 2025b; Xiong et al., 2026). Recent advances in LLMs post-training, especially rein- forcement learning with verifiable reward (RLVR) approaches, such as group relative policy opti- mization (GRPO) (Shao et al., 2024), dynamic sampling policy optimization (DAPO) (Yu et al., Figure 1: Performance on MATH500 (accuracy) and AlpacaEval (length-controlled win rate) for the state-of- the-art test-time alignment approach (GenARM) under different mixing coefficientsα ∈ [0, 1]. Anα = 0 corresponds to decoding solely from the base model, while α = 1 uses only the reward model. 2025a) and others (Zheng et al., 2025a; Yang et al., 2025b), have substantially improved reasoning. However, post-training approaches require costly model updates (Casper et al., 2023; Hou et al., 2024), tend to be domain-specific (Wu et al., 2025; Qi et al., 2024), and often degrade non-reasoning capabilities or disrupt previously learned user pref- erences (Chen et al., 2024; Xiao et al., 2025). More- over, retraining becomes increasingly impractical for larger LLMs and especially prohibitive when ro- bust reasoning is needed across multiple, frequently changing domains. Test-time alignment offers a lighter, and versa- tile alternative by steering a frozen LLM (the base model) during decoding with a reward model (usu- ally a smaller LLM) that provides domain expertise or user preference signals complementary to the base model (Pan et al., 2025; Zhang et al., 2025). While this paradigm avoids costly retraining, exist- ing approaches typically rely on fixed interpolation weights between the base and reward models (Xu et al., 2024), requiring manual tuning and offering no mechanism to adapt guidance as decoding un- folds or as domains change. In general-purpose 1 arXiv:2603.18411v1 [cs.CL] 19 Mar 2026 deployments, where a model must handle diverse requests across tasks and domains, this rigidity be- comes a significant limitation. Furthermore, as base models are scaled (Xu et al., 2024), the opti- mal balance between base and reward model guid- ance shifts, yet fixed interpolation weights offer no mechanism to accommodate this. As shown in Fig. 1, the performance of exist- ing token-level test-time alignment is sensitive to hyperparameter choices, and the optimal hyperpa- rameter varies substantially across domains and model families. For instance, fixing the interpola- tion weight atα = 0.5as in GenARM (Xu et al., 2024) can even cause the guided model to under- perform the base model on certain tasks. Although test-time alignment is attractive for its flexibility, this lack of adaptive control over reward guidance limits robustness when transferring across tasks, domains, and model scales. In this work, we propose Token-level Adaptive Routing (TARO), that enables robust reasoning improvement without retraining the base model. We first train the reward model on step-wise mathe- matical reasoning traces to capture fine-grained log- ical consistency signals. To make this reward guid- ance effective across domains and model scales, we introduce a learnable token-level router that dy- namically combines the base and reward model outputs at each decoding step, eliminating the need for manual hyperparameter tuning and improving stability when transferring across tasks, domains, and model families. We evaluate TARO on both reasoning andnon-reasoningbenchmarks,including MATH500 (Lightman et al., 2023a) for mathemati- cal reasoning, MedXpertQA (Zuo et al., 2025) for clinical reasoning, and AlpacaEval (Li et al., 2023) for instruction following. Our method consistently outperforms state-of-the-art test-time alignment methods, achieving up to +22.4% accuracy over the base model and +8.4% over GenARM (Xu et al., 2024) on MATH500 (Lightman et al., 2023a), while also delivering robust gains on out-of-distribution tasks such as clinical reasoning and instruction following. Notably, the proposed router exhibits weak-to-strong generalization: when trained on smaller models, it transfers effectively to much larger backbones (base models) without retraining, indicating that the learned token-level modulation is both scale- and architecture-agnostic. To summarize, our contributions are threefold: ①Token-level reasoning rewards: we show that step-wise mathematical traces can train effective reward models for test-time reasoning guidance. ②Adaptive token-level router:we propose a lightweight, learnable router that removes the need for manual interpolation tuning by dynami- cally blending base and reward logits. ③Robust, transferable reasoning: TARO consis- tently improves reasoning across domains and model scales without additional training, extend- ing test-time alignment from preference optimiza- tion to general, cross-domain reasoning. 2 Related Work Test-time alignment.Expensive policy optimiza- tion methods have motivated a shift toward dy- namic alignment approaches that operate during inference. Some TTA methods, such as Best-of- N sampling (Gao et al., 2022), rely on trajectory- level rewards and require multiple complete for- ward passes. Others apply trajectory-level reward models at each decoding step over full rollouts (Chakraborty et al., 2024; Huang et al., 2024) or partial rollouts (Khanov et al., 2024; Li et al., 2024a), making them prohibitively costly. In con- trast, GenARM (Xu et al., 2024) learns to predict token-level rewards from preference data, elimi- nating the need for rollouts altogether. Concurrent work to ours, UniR (Kim et al., 2025) also explores test-time alignment for reasoning, training a reward model with GRPO (Shao et al., 2024) alongside a frozen base model. Our approach is more similar to GenARM: we learn a mathematical reasoning re- ward model from preference data, independently of the base model. Beyond both GenARM and UniR, we further study how to achieve robust and adap- tive test-time reasoning without relying on fixed reward and base model interpolation. Post-training methods for reasoning. Super- vised finetuning (SFT) (Guha et al., 2025; Li et al., 2024b) has been used to enhance reasoning ability during post-training from datasets distilled from more advanced models (DeepSeek-AI et al., 2025; OpenAI, 2024) or carefully curated datasets like (Yue et al., 2023; Ye et al., 2025). Recently, rein- forcement learning from verifiable rewards (Shao et al., 2024; Yu et al., 2025b; Liu et al., 2025; Zheng et al., 2025b) have been used to significantly im- prove the reasoning ability of large language mod- els. Our method seeks to improve the reasoning 2 ability of LLMs as well, but does not require train- ing the base or policy model. Mixture of Experts. Mixture-of-Experts mod- els have recently emerged as the state-of-the-art architecture for improving LLM capacity (Fedus et al., 2021; Dai et al., 2024; Yang et al., 2025a). In MoE, each expert specializes in a task domain (Li et al., 2022; Sukhbaatar et al., 2024) where a router selects the most suitable experts for differ- ent input data. Recently, MoE has also been ap- plied as adapters (Li et al., 2024c; Tian et al., 2024; Zeng et al., 2025) on top of a frozen base model in parameter-efficient fine-tuning use cases. However, most existing works need to jointly train experts and router, where the router performs expert se- lection based on the model’s hidden embeddings. Such designs make it less flexible if we were to replace experts in test time. In this work, we adapt the idea of MoE for test-time alignment, where we treat the base and reward models as experts and instantiate a router which is trained separately. We tailor the routing mechanism so that no re-training is needed when scaling up base models, leading to flexible and lightweight test-time alignment. 3 Method 3.1 Preliminaries Our work builds on GenARM (Xu et al., 2024), which reformulates trajectory-level rewardr(x,y), a scalar score assigned to a full input-output pair, into token-level rewards produced by a reward model. Formally, letx ∈ Xdenote a prompt (in- put sequence), and lety = (y 1 ,...,y |y| ) ∈ Y denote the completion (LLM response) as a se- quence of tokens. GenARM models the reward as the log-likelihood of the trajectory under a reward- parameterized language model π reward , i.e. r(x,y) = |y| X t=1 logπ reward (y t | x,y <t ). wherey <t = (y 1 ,...,y t−1 )denotes the prefix up to stept−1. In GenARM, the reward modelπ reward is trained on human preference data (Ouyang et al., 2022) using a preference loss. At inference, Gen- ARM combines the base model and the reward model at the token level. Specifically, the next- token distribution is given by a weighted sum of the base and reward model distributions: π guided (y t | x,y <t ) = π base (y t | x,y <t ) + απ reward (y t | x,y <t ), whereαis a scalar controlling the influence of the reward model. Next, we introduce our pro- posed method, TARO. At a high-level, we first train a reasoning reward model, and then learn a lightweight token-level router that adaptively com- bines the logits from the base and reward models during decoding. We detail each component below. 3.2 Reasoning Reward LLM Unlike GenARM (Xu et al., 2024), which learns rewards from preference pairs without explicitly modeling reasoning, we train a reasoning-aware reward model that directly targets stepwise logical correctness. In practice, we use the Math-StepDPO- 10K (Lai et al., 2024) dataset, which captures fine- grained reasoning dynamics by constructing prefer- ence pairs between two step completions that share an identical correct reasoning prefix but diverging at the next step, yielding one logically valid contin- uationy w and one incorrect continuationy l . Our reasoning reward model is thus optimized to prefer logically coherent, step-by-step reasoning over er- roneous continuations by minimizing the standard preference loss (Ouyang et al., 2022; Bradley and Terry, 1952): l pref =− logσ β r r(y w , [x, prefix])− β r r(y l , [x, prefix]) , where[x, prefix]denotes the concatenation of the questionxand the shared correct reasoning prefix, β r is a temperature-like scaling factor, andσ(·)is the logistic sigmoid. This objective encourages the reward modelπ reward to assign higher scores to steps that continue the reasoning correctly and lower scores to invalid ones. To reconcile step-level supervision with the token-level reward parameterization, we treat each stepy = (y 1 ,...,y |y| )as a short trajectory and decomposes its scalar reward into per-token log- likelihood under the reward model. r(y, [x, prefix]) = |y| X t=1 logπ reward (y t | [x, prefix],y <t ), wherey <t = (y 1 ,...,y t−1 ). This decomposi- tion enables step-level preferences to supervise token-wise reward signals, preserving the fine- grained token-level formulation while aligning it 3 Figure 2: Learnable token-level router design. At each LLM decoding stept, the base and reward models produce logitsz base t andz reward t . The logits are passed as input to Feature Concat, which either (i) concatenates logits, or (i) concatenates logits plus learnable token-index embeddings (as discussed in §3.3). The router consumes the concatenated feature and outputs a routing weightα t ∈ (0, 1). The guided distribution(1− α t )z base t + α t z reward t is then used for sampling next token. This design makes the router portable across base model scales and families. with reasoning-driven correctness rather than flat, response-level preferences. 3.3 Learnable Token-level Router A fixed interpolation between base and reward model logits, as shown in Fig. 1, is fragile: one global mixing coefficient (i.e.,α) must simulta- neously work across domains, model scales, and every decoding step. To make the alignment adap- tive, we introduce a lightweight token-level router that dynamically chooses how much to trust and utilize each model at every generation step. Concretely, at decoding stept, the frozen base and reward models produce logitsz base t andz reward t . A lightweight feed-forward routerg θ takes as input features derived from these two logits and outputs a scalar adaptive valueα t . We define this routing co- efficient asα t = σ(α t )∈ (0, 1), which determines how much the guided decoding should follow the reward model versus the base model: z guided (·| x,y <t ) = (1− α t )z base t + α t z reward t , Essentially, the proposedα t adaptively adjusts the influence of the reward model token by token, in contrast to a fixedαthat is expected to work uni- versally across domains and model scales. To prepare the router’s input for predictingα t , we investigate two feature constructions that fun- damentally differ in whether token position (index) information from the base and reward models is explicitly encoded. Full-logits concatenation. In this design, we concatenate both logits from the base and reward model, before passing them through a small multi- layer perceptron (MLP), i.e. h full t = z base t ; z reward t ∈ R 2V ,(1) whereVis the size of vocabulary. This design is straightforward and utilizes the existing logit distributions of both models. Top-klogits with index embedding. Instead of operating purely in the raw logits space, we also consider explicitly encoding token index informa- tion by pairing each selected logitz t,i with a learn- able index embeddinge t,i . For each chosen token z t,i from the base model we form a feature vec- tor that combines its logit value with its embed- ding; the same is done for tokensjfrom the reward model. Formally, we have: u base t,i = z base t,i ; e i ∈ R d+1 , u reward t,j = z reward t,j ; e j ∈ R d+1 . Heree i = E(i)comes from ad-dimensional learn- able encoderE, allowing the router to represent 4 each token’s identity rather than treating all tokens the same in the logit vector. Since the position information is explicitly encoded, in practice we can restrict the inputs to only the top-ktokens from each model. This keeps the feature representation compact while preserving the most informative can- didates for routing. We then concatenate all index-augmented fea- tures from both models into a single vector: h top-k t = u base t,1 ,..., u base t,k ; u reward t,1 ,..., u reward t,k This vectorh top-k t ∈ R 2K(d+1) is passed through the same MLP as in the full-logits design to pro- duceα t . Note thatkis meant to be very small, i.e., |h top-k t |≪|h full t |. Router design. In both cases, the resulting rep- resentationh t is passed through the same shallow MLP to predict the routing weight: ˆα t = σ (W 2 φ(W 1 h t + b 1 ) + b 2 ), whereφis the Tanh activation andσis the sigmoid function, constraining ˆα t ∈ (0, 1). To promote confident routing behavior, we op- tionally add an entropy regularizer onˆα t . This encourages the router to avoid indecisive values (e.g.,ˆα t ≈ 0.5) when the base and reward models diverge, thereby helping it to commit to the source it considers more reliable. The overall training objective combines standard negative log-likelihood with the entropy penalty: L router =− X t logπ guided (y ⋆ t | x,y <t ) + λ entropy X t H(ˆα t ), where y ⋆ t is the gold target token and H(ˆα t ) =−ˆα t log ˆα t − (1− ˆα t ) log(1− ˆα t ) is the Bernoulli entropy of the router’s decision. The hyperparameterλ entropy ≥ 0controls the strength of this confidence regularization. Note that no ground truth values ofˆα t are re- quired. Instead, the router implicitly optimizesˆα t throughL router : (1) the NLL term penalizes routing decisions that reduce the likelihood of gold tokens and (2) the entropy term encourages hard routing decisions over mixing uniformly between the two models. Final guided decoding.With the learned router, the decoding distribution becomes: π guided (y t | x,y <t ) = (1− ˆα t )π base (y t | x,y <t ) + ˆα t π reward (y t | x,y <t ). This allows dynamic token-level modulation of re- ward guidance, improving reasoning ability while mitigating performance drop across domains. 4 Experiment 4.1 Experimental Setup Benchmarks.We evaluate TARO on two reason- ing domains:MATH500(Lightman et al., 2023b), which is in-distribution with respect to the router training recipe, andMedXpertQA(Zuo et al., 2025), which is out-of-distribution.We also include AlpacaEval(Li et al., 2023), a general instruction- following benchmark for the multi-domain experi- ment. AlpacaEval mainly consists of knowledge- intensive question answering, but also includes sim- pler reasoning tasks in mathematics and coding. Models. We experiment with two model fam- ilies:Llama-3.1(Llama Team, 2024) and Qwen-2.5(Qwen et al., 2025), using their instruct variants unless otherwise stated. For the reward models, we useDeepSeek-R1-Distill-Llama 8B (DeepSeek-AI et al., 2025) andQwen2.5-0.5B. Reward models are trained on the step-wise prefer- ence reasoning dataset as discussed in §3.2. Implementation details. Following Xu et al. (2024) and Kim et al. (2025), we train a separate reward model for each base model family. In terms of router, we train the learnable router on exam- ples fromMath-StepDPO-10K(Lai et al., 2024) andUltraFeedback(Cui et al., 2023). More train- ing details and hyperparameters are illustrated in Appendix A. During decoding, following Xu et al. (2024), we use standard sampling with temperaturet = 0.5 across all models. We generate up to 512 tokens for AlpacaEval and MedXpertQA, and up to 2,048 tokens for MATH500. Prompts used in our experi- ments are reported in Appendix G. Baselines. We compare TARO against (i) the base models, (i) the reward models, and (i) two state-of-the-art test-time alignment methods: GenARM(Xu et al., 2024) andUniR(Kim et al., 2025). For GenARM, we use the same reward model trained on MATH-StepDPO-10K (Lai et al., 5 MethodMATH500MedXpertQAAlpacaEvalAverage Llama-3.1-8B Instruct Base+DeepSeek Distill Llama-8B Reward UniR (Kim et al., 2025)51.2– Llama-3.1-8B Instruct (Base)32.013.017.320.8 DeepSeek Llama-8B (Reward)47.211.62.820.5 GenARM (Xu et al., 2024)*49.211.210.823.7 TARO (Ours)54.413.220.829.5 Qwen-2.5-3B Instruct Base+Qwen-2.5-0.5B Instuct Reward UniR (Kim et al., 2025)64.8– Qwen-2.5 3B (Base)60.811.833.635.3 Qwen-2.5 0.5B (Reward)24.28.04.512.2 GenARM (Xu et al., 2024)*56.49.818.928.4 TARO (Ours)64.89.821.632.0 Table 1: Performance across reasoning (MATH500, MedXpertQA) and instruction-following (AlpacaEval) bench- marks. Reward models are trained on Math-StepDPO-10K described in §3.2. Results for UniR are taken from its original paper, * indicates evaluated with our Math-StepDPO-10K trained reasoning reward model. 2024) as in our method (α = 0.5, equivalent to equal base and reward weighting). 4.2 Results Across Diverse Domains Table 1 reports results across MATH500, MedX- pertQA, and AlpacaEval. For the Llama-3.1 family, our method improves over both the base and re- ward models individually, and also outperforms GenARM. The largest gains appear on MATH500, where the reasoning reward model is extremely helpful with mathematical reasoning, and we also observe improvements on the out-of-distribution MedXpertQA domain. For the Qwen-2.5 family, the base model is al- ready very strong, especially on AlpacaEval. While our method does not outperform the Qwen-2.5 base model on AlpacaEval and MedXpertQA, it consis- tently exceeds GenARM across domains, showing that token-level routing provides more effective reward guidance than static interpolation. Impor- tantly, our reward models are trained using a pref- erence loss on step-wise mathematical preference data, which is considerably simpler than the RL ob- jective used in GRPO for UniR (Kim et al., 2025). Despite the relatively low standalone performance of the reward model, this does not imply a lack of utility: the reward model may be overfitting to sig- nals of step-wise mathematical reasoning, which could remain highly beneficial when used to steer the base model. However, we are only able to demonstrate utility when this weak reward model is combined with token-level routing. Overall, we are able to show that effective token-level reward models for math can be constructed directly from math step-wise reasoning traces preferences. We also compare TARO against majority vot- ing (N=8) and show that TARO achieves higher accuracy on MATH500 with approximately 4x less compute. The results are detailed in Appendix D. Interestingly, GenARM notably improved the performance of Llama-3.1-8B Instruct (Base) but failed to yield gains on Qwen-2.5 3B (Base) when evaluated on MATH500. We hypothesize that Gen- ARM and other similar test-time alignment meth- ods may be ineffective when the reward model per- forms substantially worse than the base model. In contrast, our proposed method adaptively controls the reward weight, which can significantly mitigate this limitation. 4.3 Weak-to-Strong Generalization Figure 3: Weak-to-strong generalization of learned router on reasoning. Learned router and reasoning re- ward model are not retrained for this scale. Using the router trained in §4.2, we evaluate its transferability by scaling to larger base mod- els without any re-training. Specifically, we pair the learned router with Llama-3.1-70B and Qwen- 2.5-14B backbones. As shown in Fig. 3, our ap- proach consistently improves over both the base and GenARM. On MATH500, the transferred 6 High α Tokens Low α Tokens rightarrow period Exp -coordinate Step profit cases factors critical shaded Part squares Find students understand time begin digits Setting bag equiv houses Evaluate digit analyzing What opens mean back pets determine sums compute bles end region follow already Table 2: Tokens in top 0.1% and bottom 0.1% ofαfrom generated responses to MATH500 responses; filtered out tokens with less than 50 occurrences and shorter than 2 characters (difficult to interpret). Tokens on the left show strong reward model influence, reflect- ing mathematical operators, formatting, and reasoning scaffolds (e.g. “cases”, “Step”). Tokens on the right are dominated by the base model, largely consisting of vocabulary from the problem context. Base Model = Qwen2.5-3B; Reward Model = Qwen2.5-0.5B. router achieves substantial gains, while on MedX- pertQA, improvements are smaller but still positive. This setting highlights the weak-to-strong gen- eralization property of both our router and reward model: routers and reward models trained on rela- tively small backbones can effectively steer much larger frozen LLMs. Importantly, despite being trained on limited step-wise mathematical prefer- ence data, the router provides transferable bene- fits even in out-of-domain reasoning tasks such as MedXpertQA. 5 Analysis and Ablation Studies 5.1 Understanding Router Behavior We further investigate which tokens are more influ- enced by the reward model by analyzing the learned ˆα t values. Table 2 shows tokens from the top 0.1% and bottom 0.1% of averageˆα t on MATH500. High-αtokens include mathematical operators, for- matting, and scaffolding symbols (e.g.,“cases”, “Step”), where the reward model contributes most strongly. In contrast, low-αtokens consist mainly of problem context tokens from the base model, such as common words (“period”,“students”, “already”). This analysis shows that the router al- Router Input FeatureMATH500 Accuracy (%) Reward hidden state only51.2 Base + reward hidden states49.6 Reward logits only52.6 Base + reward logits54.4 Table 3: Ablation on router input features for Llama- 3.1-8B Instruct with DeepSeek Distill Llama-8B reward. Results are reported on MATH500. See Appendix C for results on all three benchmarks. Model VariantMATH MedX Qwen K = 10060.27.7 Qwen K = 1,00060.27.3 Qwen K = 10,00061.87.4 Qwen K = 100 + λ entropy = 0.559.87.3 Qwen K = 1,000 + λ entropy = 0.5 64.89.8 Llama K = 1,000 + λ entropy = 0.550.614.5 LlamaFull logits54.413.2 Table 4: Ablation on Top-K, entropy regularization, and full-logit router designs. Accuracy (%) is reported on MATH500 (MATH) and MedXpertQA (MedX). locates reward-model guidance to tokens that shape the underlying reasoning process, while relying on the base model to preserve fluent language and accurately convey problem-specific details. 5.2 Qualitative Analysis To better understand model behavior, Table 5 shows responses from the base model, reward model, and our method to a problem from MATH500. The base model over-simplifies and arrives at an incorrect numeric answer, while the reward model identi- fies the correct reasoning path but becomes stuck mid-derivation. Our method succeeds by explicitly decomposing the reasoning into interpretable steps, ultimately producing the correct final answer. This illustrates the benefit of dynamically leveraging both base and reward signals at the token level. We present more examples in Appendix H. 5.3 Ablations on Token-level Router We first validate the necessity of token-level routing granularity. As shown in Appendix E, a prompt- level router that predicts a singleαfor the entire sequence significantly under-performs token-level routing on MATH500 (33.2% vs. 49.6%), confirm- ing that fine-grained, per-token control is essen- tial for structured reasoning. Having established token-level routing, we next ablate the choice of input features for the router and effect of restricting router inputs to the Top-k logits. 7 Question: A worker receives an annual wage of$20,000, which he always deposits into a savings account at the end of the year. By the end of the third year (when he makes the third deposit), he wants to have at least$66, 200in the account to finance the purchase of a house. What is the minimal compound interest rate that the savings account must provide? Express your answer as a percentage, but do not include the percent sign. Ground Truth: 10 Base ModelReward ModelOurs Output (abridged) Let r denote the interest rate.The worker will have a total of20, 000(1 + r) 3 in the account by the end of the third year. This is at least66, 200, so we have the inequality: 20,000(1 + r) 3 ≥ 66,200 (1 + r) 3 ≥ 3.31 1 + r ≥ 3 √ 3.31 r ≥ 3 √ 3.31− 1 r ≥ 0.322 Thus, the smallest possible value ofris 0.322, or 32.2. Final Answer:32.2. I hope it is correct. Let the annual interest rate ber. The worker makes three deposits of $20,000 at the end of each year. After the first year:20, 000(1 + r).After the second year:20, 000(1 + r) 2 + 20, 000(1 + r). After the third year:20, 000(1+r) 3 + 20, 000(1+r)+20, 000(1+ r).Wait, this seems incorrect...Wait, no... The correct amount...[... stuck in confusion] Step 1: The worker deposits $20,000 each year, wants at least $66,200 after three years. Step 2:Total after 3 yearsis20, 000(1 + r) 3 + 20, 000(1 + r) 2 + 20, 000(1 + r). Step 3: Divide by 20,000: (1 + r) 3 + (1 + r) 2 + (1 + r)≥ 3.31. Step 4: Letx = 1 + r, then x 3 + x 2 + x≥ 3.31. Step 5: Solving givesx ≥ 1.1, so r ≥ 0.1. Final Answer: 10 Table 5: Qualitative example of base (Llama-3.1 8B Instruct), reward and our model outputs (abridged to fit) for a MATH500 question. The base model oversimplifies, while the reward model understands the right plan but gets confused. Ours solves correctly by engaging in reasoning steps. Router input feature choice.On Llama-3.1-8B Instruct and DeepSeek Distill Llama-8B reward, we experimented with several input choices for the router: (1) reward hidden state only, (2) base and reward hidden states, and (3) reward logits. Since hidden states encode token-level context, it may be natural to consider them as router inputs. As shown in Table 3, we found that using the re- ward hidden state alone outperforms adding the base hidden state on MATH500 (51.2% vs 49.6%), while on out-of-domain tasks adding the base hid- den state improves performance (18.7% vs 15.7% on AlpacaEval and 13.7% vs 12.6% on MedX- pertQA). However, base and reward hidden-state- based routers cannot reliably generalize to stronger base models, as the base hidden-state distribution shifts with model scale. This motivated our use of logits, which are scale- and domain-agnostic since they reflect model confidence over the pre- dicted token distribution. We also found using base and reward logits yields the strongest results on MATH500 (54.4%). Top-kablation. We next study the effect of restricting the router inputs to the Top-klogits. The motivation for this design is to reduce noise from the full vocabulary distribution and focus the router on the most confident token candidates. As shown in Table 4, on Qwen-2.5, we find that using K = 1000performs well, and adding entropy regu- larization further improves MATH500 accuracy to 64.8%. The entropy penalty encourages the router to make more decisive choices between the base and reward models, which is particularly effective when the reward model provides complementary signal. On Llama-3.1, however, we observe that the en- tropy regularization is suboptimal on MATH500. Note in Table 1, the reasoning reward model performs relatively closer to the base model on MATH500 than on AlpacaEval. This could sug- gest that on most of the soft router training dataset, the base model has a balanced performance, and this could drive the router to predict consistently smaller alpha values, since the base model is more consistent in it’s performance. Possibly, the reason- ing reward model is then underused in this config- uration. This effect could suggest that the benefit of entropy regularization is sensitive to the relative strengths of the base and reward models. 5.4 Inference Cost We compare the inference-time efficiency of TARO against GenARM in Table 6, measured on a sin- gle node with 8×H100 GPUs. TARO introduces a lightweight router module to dynamically adjustα, and we find that throughput is comparable to Gen- ARM when the Top-K logits (Design i in §3.3) is used. Expanding to full logits (Design i in §3.3) re- duces throughput, particularly for large-vocabulary models such asLlama3.1-8B; however, vocabu- 8 ModelMethodTPS GPU Mem. (%) QPS Qwen2.5-3B GenARM85.64.290.0651 TARO (Top-K) 85.95.220.0448 Llama3.1-8B GenARM80.76.930.0141 TARO (Full)32.510.450.0161 Table 6: Throughput analysis on MATH500. TPS = tokens per second; QPS = queries per second. Note that these results reflect unoptimized implementations; tech- niques such as speculative decoding and fully sharded data-parallel inference have not been applied. lary sizes appear to have stabilized in recent model releases, suggesting this overhead will not grow with newer models. We report both tokens-per- second (TPS) and queries-per-second (QPS). The latter does not penalize concise generations unlike TPS, making it a more complete measure of end- to-end throughput. In terms of QPS, the overhead is minimal in both settings and even lower than GenARM in the Top-K setting due to more concise outputs. More details are discussed in Appendix B. 6 Conclusion In this paper, we introduced TARO, a test-time alignment framework that improves LLM reason- ing by adaptively routing between a frozen base model and a reward model at the token level. Across mathematical reasoning, clinical reasoning, and instruction-following benchmarks, TARO con- sistently outperforms fixed-weight decoding base- lines while preserving the flexibility and low train- ing cost of inference-time alignment. Our results show that fine-grained reward signals, even when trained from step-wise mathematical preference data, can generalize beyond their source domain when applied through adaptive routing rather than static interpolation. We further find that the learned routing policy transfers to larger backbones without retraining, suggesting that token-level logit routing provides a scalable and portable interface for test- time reasoning control. Overall, TARO highlights that lightweight adaptive routing can be a practi- cal path toward stronger, more robust reasoning in frozen LLMs without expensive post-training. 9 References Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324. Stephen Casper,Xander Davies,Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, and 1 others. 2023. Open problems and fundamental limitations of reinforce- ment learning from human feedback. arXiv preprint arXiv:2307.15217. Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, A. S. Bedi, and Furong Huang. 2024. Transfer q star: Principled decoding for llm alignment. In The Thirty-eighth An- nual Conference on Neural Information Processing Systems. Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. 2024. Preference learning algo- rithms do not learn preference rankings. Advances in Neural Information Processing Systems, 37:101928– 101968. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting lan- guage models with high-quality feedback. ArXiv, abs/2310.01377. Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Rac- cuglia, and 1 others. 2025. Curie: Evaluating llms on multitask scientific long context understanding and reasoning. arXiv preprint arXiv:2503.13517. Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. Preprint, arXiv:2401.06066. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 179 oth- ers. 2025. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning. ArXiv, abs/2501.12948. William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961. Leo Gao, John Schulman, and Jacob Hilton. 2022. Scal- ing laws for reward model overoptimization. In In- ternational Conference on Machine Learning. Etash Kumar Guha, Ryan Marten, Sedrick Scott Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Mar- ianna Nezhurina, Jean-Pierre Mercat, Trung Vu, Za- yne Sprague, Ashima Suvarna, Ben Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Car- oline Choi, Niklas Muennighoff, Shiye Su, and 31 others. 2025. Openthoughts: Data recipes for reason- ing models. ArXiv, abs/2506.04178. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, and Yuxiao Dong. 2024. Does rlhf scale? exploring the impacts from data, model, and method. arXiv preprint arXiv:2412.06000. James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi an Lai, Arshit Gupta, Nikolaos Pappas, Saab Man- sour, Katrin Kirchoff, and Dan Roth. 2024. Deal: Decoding-time alignment for large language models. In Annual Meeting of the Association for Computa- tional Linguistics. Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. 2024. Args: Alignment as reward-guided search. ArXiv, abs/2402.01694. Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, and Jong Chul Ye. 2025. Univer- sal reasoner: A single, composable plug-and-play reasoner for frozen llms. ArXiv, abs/2505.19075. Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xi- angru Peng, and Jiaya Jia. 2024. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. ArXiv, abs/2406.18629. Bolian Li, Yifan Wang, Ananth Y. Grama, and Ruqi Zhang. 2024a. Cascade reward sampling for efficient decoding-time alignment. ArXiv, abs/2406.16306. Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nan- ning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. 2024b.Common 7b language models al- ready possess strong math capabilities.ArXiv, abs/2403.04706. Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, and Mingjie Tang. 2024c. Mixlora: Enhancing large language models fine- tuning with lora-based mixture of experts. Preprint, arXiv:2404.15159. Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettle- moyer. 2022. Branch-train-merge: Embarrassingly parallel training of expert language models. Preprint, arXiv:2208.03306. 10 Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An au- tomatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023a. Let’s verify step by step. arXiv preprint arXiv:2305.20050. Hunter Lightman, Vineet Kosaraju, Yura Burda, Har- rison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023b. Let’s verify step by step. ArXiv, abs/2305.20050. Zi-Yan Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. 2025. Understanding r1-zero-like training: A critical perspective. ArXiv, abs/2503.20783. Llama Team. 2024. Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/ blog/meta-llama-3-1/. Meta AI Blog. Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229. OpenAI. 2024. Gpt-4 technical report.https://cdn. openai.com/papers/gpt-4.pdf. Accessed: 2025- 10-05. OpenAI.2024.Learningtoreasonwith llms.https://openai.com/index/ learning-to-reason-with-llms/.Accessed: 2025-05-01. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Pe- ter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155. Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, and Tieyun Qian. 2025. A survey on training-free alignment of large language models. arXiv preprint arXiv:2508.09016. Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, and James Glass. 2024. Quantifying generalization complexity for large language models. arXiv preprint arXiv:2410.01769. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 oth- ers. 2025.Qwen2.5 technical report.Preprint, arXiv:2412.15115. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, and 1 others. 2024. Deepseek- math: Pushing the limits of mathematical reason- ing in open language models.arXiv preprint arXiv:2402.03300. Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen tau Yih, Jason Weston, and Xian Li. 2024.Branch-train-mix: Mixing ex- pert llms into a mixture-of-experts llm. Preprint, arXiv:2403.07816. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, and 1 others. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. 2024. Hydralora: An asymmetric lora architecture for efficient fine-tuning. Preprint, arXiv:2404.19245. Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Yi- Di Wang, Shu Yan, Cheng-Xing Jia, Xu-Hui Liu, Xin-Wei Chen, Jia-Cheng Xu, Ziniu Li, and 1 others. 2025a. A survey on large language models for mathe- matical reasoning. arXiv preprint arXiv:2506.08446. Wenxuan Wang, Zizhan Ma, Meidan Ding, Shiyi Zheng, Shengyuan Liu, Jie Liu, Jiaming Ji, Wenting Chen, Xiang Li, Linlin Shen, and 1 others. 2025b. Medical reasoning in the era of llms: A systematic review of enhancement techniques and applications. arXiv preprint arXiv:2508.00669. Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xi- aoke Huang, James Zou, Cihang Xie, and Yuyin Zhou. 2025. Knowledge or reasoning? a close look at how llms think across domains. arXiv preprint arXiv:2506.02126. Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, and Weijie J Su. 2025. On the algorithmic bias of aligning large language models with rlhf: Preference collapse and matching regular- ization. Journal of the American Statistical Associa- tion, (just-accepted):1–21. Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, and Zhuokai Zhao. 2026. Token-level llm collaboration via fusionroute. arXiv preprint arXiv:2601.05106. 11 Yuancheng Xu, Udari Madhushani Sehwag, Alec Kop- pel, Sicheng Zhu, Bang An, Furong Huang, and Sum- itra Ganesh. 2024. Genarm: Reward guided genera- tion with autoregressive reward model for test-time alignment. arXiv preprint arXiv:2410.08193. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025a. Qwen3 technical report. arXiv preprint arXiv:2505.09388. Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, and Zhuokai Zhao. 2025b. Let it calm: Exploratory annealed decoding for verifiable reinforcement learning. arXiv preprint arXiv:2510.05251. Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. 2025. Limo: Less is more for reasoning. ArXiv, abs/2502.03387. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, and 1 others. 2025a. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and 16 others. 2025b. Dapo: An open-source llm reinforcement learning system at scale. ArXiv, abs/2503.14476. Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. ArXiv, abs/2309.05653. Hanqing Zeng, Yinglong Xia, Zhuokai Zhao, Gilbert Jiang, Qiang Zhang, Jiayi Liu, Lizhu Zhang, Xi- angjun Fan, and Benyu Zhang. 2025. S’more: Struc- tural mixture of residual experts for llm fine-tuning. Preprint, arXiv:2504.06426. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, and 1 others. 2025. A survey on test-time scaling in large language mod- els: What, how, where, and how well? arXiv preprint arXiv:2503.24235. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025a. Group sequence policy optimization. arXiv preprint arXiv:2507.18071. Chujie Zheng, Shixuan Liu, Mingze Li, Xionghui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025b. Group sequence policy optimization. ArXiv, abs/2507.18071. Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. ArXiv, abs/2501.18362. 12 Appendix The appendix provides supporting details across eight sections. Appendix A describes implementation and hyperparameter details for training the reward model and token-level router. Appendix B analyzes the parameter overhead introduced by the router for each design variant. Appendix C extends the router input ablation to all three benchmarks. Appendix D compares TARo against majority voting on Qwen2.5-3B. Appendix E validates the choice of token-level over prompt-level routing. We provide generation and evaluation (LLM-as-a-judge) prompts in Appendix F and Appendix G, respectively. Finally, Appendix H presents additional qualitative examples illustrating cases where TARo succeeds and where it fails. A Experiments Implementation and Hyperparameter Details Reasoning reward model. Reward model training uses AdamW with a learning rate of2× 10 −5 , cosine learning rate scheduler, batch size 32, and 3 epochs. We setβ r = 0.1in the preference loss. LoRA adapters are applied with rank 8 and scaling factor α = 16 1 . Token-level router.Router training uses supervised fine-tuning (SFT) to train the MLP layers (hidden dimension sizeH = 128) on 1,000 samples from each dataset for three epochs. We use a learning rate of 5× 10 −6 , batch size 32, and10warmup steps. Training is conducted on 1-4 NVIDIA H100 GPUs with bfloat16 precision. B Router Complexity Compared with GenARM, the router introduces(2V + 1)× (H + 1)learnable parameters for Design (i) and(2Kd + 1)× (H + 1) +d×Vfor Design (i). Given the hidden dimension of router prediction head H = 128, for Llama (V = 128256) and Qwen 2.5 (V = 151936), the router would have33Mand12M learnable parameters respectively. C Router Input Ablation: Full Results Router InputsScalable to MATH500 AlpacaEval MedXpertQA Larger Base (Acc %) (LC WR %)(Acc %)Avg. Base + reward hidden states✗49.618.713.727.3 Reward hidden state only✓51.215.712.626.5 Reward logits only✓52.619.714.428.9 Base + reward logits✓54.420.813.229.5 Table 7: Ablation of router input configurations on Llama3.1-8B across all three benchmarks. Scalable to Larger Base indicates whether the router can be applied to a larger base model without retraining. The greyed row is excluded from the scalable comparison. Among scalable designs, base and reward logits on average outperforms using hidden states or logits from the reward model alone. Bold indicates best result per column across all rows. Table 7 extends the router input ablation from the main paper (§5.3) to all three benchmarks. We first compare two router input configurations: using reward hidden states alone vs. using both base and reward hidden states. The latter consistently improves performance on tasks outside the reward model’s training domain (MedXpertQA:12.6% → 13.7%, AlpacaEval:15.7% → 18.7%), suggesting that the reward model’s hidden states carry a domain bias from mathematical training that the base model’s representation can counteract. However, this configuration is not base-model-scalable, as a larger base model’s hidden states may differ in both representation and dimensionality from those seen during router training. Logits provide a scale-agnostic alternative, where the vocabulary space remains the same across scales. We observe that reward logits alone already generalize better on OOD tasks than their hidden state counterpart (AlpacaEval: 19.7% vs. 15.7%, MedXpertQA: 14.4% vs. 12.6%). We posit this is because logits encode model confidence over a shared, domain-agnostic vocabulary space, making them a more universal signal for routing regardless of the target domain. Adding base logits further improves in-domain 1 This α differs from the interpolation coefficient in GenARM (Xu et al., 2024) and our router 13 performance (MATH500: 52.6%→54.4%) at a slight OOD cost, and achieves the best average across benchmarks, motivating our choice of base and reward logits as the default configuration. D Comparison with Majority Voting MethodMATH500 AlpacaEval MedXpertQA Est. TFLOPs (Acc %) (LC WR %)(Acc %) Majority Voting (N =8)56.4—10.6113.6 GenARM56.418.99.828.4 TARo64.821.69.828.4 Table 8: Comparison with majority voting on Qwen2.5-3B. TFLOPs are estimated for a single forward pass with a 2048-token context. Majority voting is not applicable to AlpacaEval (—) due to its open-ended generation format. Bold indicates best result per column. Table 8 compares TARo against majority voting and GenARM on Qwen2.5-3B. Note that majority voting cannot be applied to AlpacaEval, as it depends on a known, extractable answer format. This can make majority voting poorly suited for agentic systems, as a potential application setting, that need flexible logical reasoning and instruction-following, not fixed-format outputs. TARo outperforms majority voting on MATH500 while requiring only4×less compute, as majority voting requiresN=8full response samples rather than token-wise generation. On MedXpertQA, TARo underperforms majority voting. E Prompt-level vs. Token-level Routing Routing Granularity MATH500 AlpacaEval MedXpertQA Avg. (Acc %) (LC WR %)(Acc %) Prompt-level33.222.113.3 Token-level49.618.713.7 Table 9: Comparison of prompt-level versus token-level routing with hidden state inputs on Llama3.1-8B. Token- level routing significantly outperforms prompt-level routing on mathematical reasoning, validating the design choice of adaptive token-level α prediction. Table 9 compares router performance when trained to predict a single prompt-levelα(uniform across all tokens) versus a token-levelαat each decoding step. Prompt-level routing performs significantly worse on mathematical reasoning (MATH500: 33.2% vs. 49.6%), demonstrating that fine-grained token- level control is essential for structured reasoning tasks where the reward model’s guidance should be concentrated on specific tokens such as operators, variables, and reasoning scaffolds (see Tab. 2 from main text). Performance differences on OOD tasks (AlpacaEval, MedXpertQA) are smaller, suggesting that token-level granularity matters most in the in-domain reasoning setting. F Generation Prompts We share the MATH500 and MedXpertQA generation prompts in this section. MATH500 Generation Prompt S o l v e t h e f o l l o w i n g m a t h p r o b l e m s t e p by s t e p . P l e a s e r e a s o n s t e p by s t e p , a n d p u t y o u r f i n a l a n s w e r w i t h i n \ t e x t t t \ t e x t b a c k s l a s h b o x e d \ \ . H e r e a r e some e x a m p l e s : P r o b l e m : F i n d t h e d o m a i n o f t h e e x p r e s s i o n √ x−2 √ 5−x . S o l u t i o n : The e x p r e s s i o n s i n s i d e e a c h s q u a r e r o o t m u s t b e non − n e g a t i v e . T h e r e f o r e , x− 2≥ 0 , s o x≥ 2 , a n d 5− x≥ 0 , s o x≤ 5 . 14 A l s o , t h e d e n o m i n a t o r c a n n o t b e e q u a l t o z e r o , s o 5− x > 0 , w h i c h g i v e s x < 5 . T h e r e f o r e , t h e d o m a i n o f t h e e x p r e s s i o n i s [2, 5). F i n a l Answer : The f i n a l a n s w e r i s [2, 5) . I h o p e i t i s c o r r e c t . P r o b l e m : I f detA = 2 a n d detB = 12, t h e n f i n d det(AB). S o l u t i o n : We h a v e t h a t det(AB) = (detA)(detB) = (2)(12) = 24 . F i n a l Answer : The f i n a l a n s w e r i s 24 . I h o p e i t i s c o r r e c t . P r o b l e m : T e r r e l l u s u a l l yl i f t s t w o 20 − p o u n d w e i g h t s 12 t i m e s . I f h e u s e s t w o 15 − p o u n d w e i g h t s i n s t e a d , how many t i m e s m u s t T e r r e l l i f t t h e m i n o r d e r t o l i f t t h e same t o t a l w e i g h t ? S o l u t i o n : I f T e r r e l l i f t s t w o 20 − p o u n d w e i g h t s 12 t i m e s , h e l i f t s a t o t a l o f 2· 12· 20 = 480 p o u n d s o f w e i g h t . I f h e l i f t s t w o 15 − p o u n d w e i g h t s i n s t e a d f o r n t i m e s , h e w i l l i f t a t o t a l o f 2· 15· n = 30n p o u n d s o f w e i g h t . E q u a t i n g t h i s t o 4 8 0 p o u n d s , we c a n s o l v e f o r n : \ b e g i n a l i g n * 30 n & = 4 8 0 \ \ \ R i g h t a r r o w \ q q u a d n & = 4 8 0 / 3 0 = \ b o x e d 1 6 \ e n d a l i g n * F i n a l Answer : The f i n a l a n s w e r i s 16 . I h o p e i t i s c o r r e c t . P r o b l e m : I f t h e s y s t e m o f e q u a t i o n s \ b e g i n a l i g n * 6 x −4 y&=a , \ \ 6 y −9 x &=b . \ e n d a l i g n * h a s a s o l u t i o n (x,y) w h e r e x a n d y a r e b o t h n o n z e r o , f i n d a b , a s s u m i n g b i s n o n z e r o . S o l u t i o n : I f we m u l t i p l y t h e f i r s t e q u a t i o n by− 3 2 , we o b t a i n 6 y −9 x = − \ f r a c 3 2 a . S i n c e we a l s o know t h a t 6y− 9x = b , we h a v e − \ f r a c 3 2 a =b \ R i g h t a r r o w \ f r a c a b = \ b o x e d − \ f r a c 2 3 . F i n a l Answer : The f i n a l a n s w e r i s− 2 3 . I h o p e i t i s c o r r e c t . 15 MedXpertQA Generation Prompt # I n s t r u c t i o n s * S e l e c t t h e m u l t i p l e c h o i c e a n s w e r t h a t b e s t a n s w e r s t h e q u e s t i o n . * P l e a s e t h i n k s t e p by s t e p . * W r i t e y o u r f i n a l a n s w e r a s " S e l e c t e d Answer C h o i c e : a n s w e r " # E x a m p l e s U s e r : A 29 − y e a r − o l d woman p r e s e n t s t o t h e e m e r g e n c y d e p a r t m e n t w i t h a 1− d a y h i s t o r y o f f e v e r . S h e s a y s t h a t i ts t a r t e d s p o n t a n e o u s l y a n d t h a t i t h a s n ' t g o t t e n a n y b e t t e r w i t h e i t h e r r e s t o r a c e t a m i n o p h e n . H e r p a s t m e d i c a l h i s t o r y i s s i g n i f i c a n t f o r h o m e l e s s n e s s , u n p r o t e c t e d s e x u a l i n t e r c o u r s e w i t h m u l t i p l e p a r t n e r s , a n d s u b s t a n c e u s e d i s o r d e r w i t h m u l t i p l e s u b s t a n c e s . H e r t e m p e r a t u r e i s 1 0 3 . 5 F ( 3 9 . 5 C ) , b l o o d p r e s s u r e i s 1 0 0 / 7 0 mmHg, p u l s e i s 1 1 2 / min , r e s p i r a t i o n s a r e 1 8 / min , a n d o x y g e n s a t u r a t i o n i s 93% on room a i r . On p h y s i c a l exam , a murmur i s h e a r d a l o n g t h e l e f t mid − s t e r n a l b o r d e r . The p u l m o n a r y exam r e v e a l s m i n o r b i b a s i l a r c r a c k l e s . The p a t i e n t ' s u p p e r l i m b s d e m o n s t r a t e many b r u i s e s a n d s c a r s i n t h e a n t e c u b i t a l f o s s a . The p a t i e n t i s s t a r t e d on v a n c o m y c i n a n d g e n t a m i c i n a n d s e n t f o r e c h o c a r d i o g r a p h y . B a s e d on t h e r e s u l t s o f t h e e c h o c a r d i o g r a m , t h e p a t i e n t u n d e r g o e s s u r g e r y t o r e m o v e m u l t i p l e v e g e t a t i o n s f r o m t h e t r i c u s p i d v a l v e . V a n c o m y c i n a n d g e n t a m i c i n a r e c o n t i n u e d o v e r t h e n e x t 5 d a y s . On p o s t o p e r a t i v e d a y 5 , t h e p a t i e n t p r e s e n t s w i t h b l e e d i n g f r o m h e r gums , o o z i n g f r o m h e r s u r g i c a l s i t e s , a n d r e c u r r e n t e p i s t a x i s . Lab v a l u e s a r e o b t a i n e d a s s e e n b e l o w : Serum : Na + : 1 3 5 mEq / L Cl − : 90 mEq / L K+ : 4 . 4 mEq / L HCO3− : 23 mEq / L BUN : 20 mg / dL G l u c o s e : 1 1 0 mg / dL C r e a t i n i n e : 1 . 0 mg / dL Ca2 + : 1 0 . 1 mg / dL AST : 9 U / L ALT : 9 U / L L e u k o c y t e c o u n t a n d d i f f e r e n t i a l : L e u k o c y t e c o u n t : 6 , 0 0 0 c e l l s /m 3 L y m p h o c y t e s : 20% M o n o c y t e s : 1% N e u t r o p h i l s : 78% E o s i n o p h i l s : 1% B a s o p h i l s : 0% PT : 27 s e c o n d s aPTT : 84 s e c o n d s D− d i m e r : < 50 μg / L H e m o g l o b i n : 14 g / dL H e m a t o c r i t : 40% P l a t e l e t c o u n t : 1 5 0 , 0 0 0 /m 3 Mean c o r p u s c u l a r v o l u m e : 1 1 0 f L Mean c o r p u s c u l a r h e m o g l o b i n c o n c e n t r a t i o n : 34 g / dL RDW: 14% Which o f t h e f o l l o w i n g i s t h e m o s t l i k e l y c a u s e o f t h i s p a t i e n t ' s c u r r e n t s y m p t o m s ? Answer C h o i c e s : (A) B a c t e r i a l i n f e c t i o n o f t h e b l o o d s t r e a m ( B ) D e f e c t i n v o n W i l l e b r a n d f a c t o r ( C ) A c q u i r e d h e m o p h i l i a A (D) F a c t o r V I I I d e f i c i e n c y ( E ) C o a g u l a t i o n c a s c a d e a c t i v a t i o n ( F ) D i s s e m i n a t e d i n t r a v a s c u l a r c o a g u l a t i o n (G) V i t a m i n K d e f i c i e n c y d u e t o a n t i b i o t i c u s e (H) Immune − m e d i a t e d t h r o m b o c y t o p e n i a ( I ) A n t i b i o t i c t h e r a p y ( J ) H y p e r s e n s i t i v i t y r e a c t i o n t o a n t i b i o t i c s < | e o t _ i d | > A s s i s t a n t : To d e t e r m i n e t h e m o s t l i k e l y c a u s e o f t h i s p a t i e n t ' s c u r r e n t symptoms , l e t ' s a n a l y z e t h e s i t u a t i o n s t e p by s t e p . A 29 − y e a r − o l d woman p r e s e n t s t o t h e e m e r g e n c y d e p a r t m e n t w i t h a 1− d a y h i s t o r y o f f e v e r . S h e h a s a h i s t o r y o f h o m e l e s s n e s s , u n p r o t e c t e d s e x u a l i n t e r c o u r s e w i t h m u l t i p l e p a r t n e r s , a n d s u b s t a n c e u s e d i s o r d e r . H e r v i t a l s i g n s show a h i g h f e v e r , t a c h y c a r d i a , a n d s l i g h t l y l o w o x y g e n s a t u r a t i o n . A murmur i s h e a r d a l o n g t h e l e f t mid − s t e r n a l b o r d e r , a n d t h e r e a r e m i n o r b i b a s i l a r c r a c k l e s on p u l m o n a r y exam . The p r e s e n c e o f b r u i s e s a n d s c a r s i n t h e a n t e c u b i t a l f o s s a s u g g e s t s i n t r a v e n o u s d r u g u s e . S h e i s s t a r t e d on v a n c o m y c i n a n d g e n t a m i c i n a n d u n d e r g o e s s u r g e r y t o r e m o v e v e g e t a t i o n s f r o m t h e t r i c u s p i d v a l v e , i n d i c a t i n g i n f e c t i v e e n d o c a r d i t i s . On p o s t o p e r a t i v e d a y 5 , s h e p r e s e n t s w i t h b l e e d i n g f r o m h e r gums , o o z i n g f r o m s u r g i c a l s i t e s , a n d r e c u r r e n t e p i s t a x i s . Lab v a l u e s show p r o l o n g e d PT a n d aPTT , n o r m a l p l a t e l e t c o u n t , a n d n o r m a l D− d i m e r , w h i c h s u g g e s t s a c o a g u l o p a t h y r a t h e r t h a n t h r o m b o c y t o p e n i a o r DIC . The p r o l o n g e d PT a n d aPTT w i t h n o r m a l p l a t e l e t c o u n t a n d D− d i m e r a r e c o n s i s t e n t w i t h a c o a g u l a t i o n f a c t o r d e f i c i e n c y o r i n h i b i t i o n . G i v e n t h e r e c e n t a n t i b i o t i c u s e , v i t a m i n K d e f i c i e n c y d u e t o a n t i b i o t i c t h e r a p y i s a l i k e l y c a u s e , a s a n t i b i o t i c s c a n d i s r u p t g u t f l o r a t h a t s y n t h e s i z e v i t a m i n K , l e a d i n g t o d e f i c i e n c i e s i n v i t a m i n K− d e p e n d e n t c l o t t i n g f a c t o r s . T h e r e f o r e , t h e m o s t l i k e l y c a u s e o f t h i s p a t i e n t ' s c u r r e n t s y m p t o m s i s : A n t i b i o t i c t h e r a p y . S e l e c t e d Answer C h o i c e : I < | e o t _ i d | > U s e r : A 55 − y e a r − o l d man i s b r o u g h t t o t h e e m e r g e n c y d e p a r t m e n t a f t e r a w i t n e s s e d t o n i c − c l o n i c s e i z u r e . T h e r e was no t r a u m a d u r i n g t h e s e i z u r e . The p a t i e n t ' s m e d i c a l h i s t o r y i n c l u d e s h y p e r t e n s i o n , c o r o n a r y a r t e r y d i s e a s e , a n d t y p e 2 d i a b e t e s . H i s c u r r e n t m e d i c a t i o n s a r e h y d r o c h l o r o t h i a z i d e , a t e n o l o l , m e t f o r m i n , a n d a s p i r i n . A c c o r d i n g t o h i s f a m i l y , h e s m o k e s a p a c k o f c i g a r e t t e s a n d c o n s u m e s 6 t o 8 b e e r s d a i l y b u t d o e s n o t u s e r e c r e a t i o n a l d r u g s . H i s l a s t a l c o h o l c o n s u m p t i o n o c c u r r e d a p p r o x i m a t e l y 36 h o u r s b e f o r e p r e s e n t a t i o n . H i s b l o o d p r e s s u r e i s 1 6 0 / 1 1 0 m Hg , h i s h e a r t r a t e i s 98 b e a t s p e r m i n u t e a n d r e g u l a r , a n d h i s r e s p i r a t o r y r a t e i s 22 b r e a t h s p e r m i n u t e . He i s a f e b r i l e b u t c o n f u s e d a n d d i a p h o r e t i c . 16 A c o m p l e t e b l o o d c o u n t s h o w s m i l d m a c r o c y t i c a n e m i a a n d t h r o m b o c y t o p e n i a . A b i o c h e m i c a l p r o f i l e s h o w s a n o r m a l g l u c o s e l e v e l , m i l d h y p o k a l e m i a , h y p o p h o s p h a t e m i a , n o r m a l k i d n e y f u n c t i o n , a n d a n e l e v a t e d a s p a r t a t e a m i n o t r a n s f e r a s e l e v e l . C h e s t r a d i o g r a p h y i s u n r e m a r k a b l e . N o n c o n t r a s t CT o f t h e b r a i n s h o w s m i l d v o l u m e l o s s . B l o o d a n d u r i n e d r u g t e s t r e s u l t s a r e n e g a t i v e . An e l e c t r o c a r d i o g r a m i s n o t a b l e f o r s i n u s t a c h y c a r d i a a n d m o t i o n a r t i f a c t . T h i a m i n e , m u l t i v i t a m i n s , f o l a t e , a n d e l e c t r o l y t e r e p l e t i o n a r e a d m i n i s t e r e d . T h r e e h o u r s l a t e r , t h e p a t i e n t b e c o m e s i n c r e a s i n g l y a g i t a t e d a n d c o n f u s e d , w i t h i n c o h e r e n t s p e e c h , d i a p h o r e s i s , a n d c o m b a t i v e b e h a v i o r t h a t p e r s i s t e v e n a f t e r t h e c u m u l a t i v e a d m i n i s t r a t i o n o f 40 mg o f i n t r a v e n o u s l o r a z e p a m . P h y s i c a l r e s t r a i n t s t o p r e v e n t him f r o m d i s r u p t i n g h i s i n t r a v e n o u s a c c e s s o n l y c o n t r i b u t e t o h i s a g i t a t i o n . What i s t h e m o s t a p p r o p r i a t ei n i t i a l m a n a g e m e n t o p t i o n i n t h i s p a t i e n t ? Answer C h o i c e s : (A) O b t a i n r e p e a t CT o f t h e b r a i n ( B ) T r a n s f e r t o t h e i n t e n s i v e c a r e u n i t , a n d i n i t i a t e s e d a t i o n ( C ) S t a r t a c o n t i n u o u s i n f u s i o n o f m i d a z o l a m (D) O b t a i n e l e c t r o e n c e p h a l o g r a p h y ( E ) S t a r t a c o n t i n u o u s i n f u s i o n o f p h e n y t o i n ( F ) A d m i n i s t e r p h e n o b a r b i t a l f o r r e f r a c t o r y a g i t a t i o n (G) P e r f o r m l u m b a r p u n c t u r e t o r u l e o u t m e n i n g i t i s (H) A d m i n i s t e r h a l o p e r i d o l f o r a g i t a t i o n ( I ) I n i t i a t e p r o p o f o l i n f u s i o n f o r s e d a t i o n ( J ) A d m i n i s t e r d e x m e d e t o m i d i n e f o r s e d a t i o n < | e o t _ i d | > A s s i s t a n t : To d e t e r m i n e t h e m o s t a p p r o p r i a t ei n i t i a l m a n a g e m e n t o p t i o n f o r t h i s p a t i e n t , l e t ' s a n a l y z e t h e s i t u a t i o n s t e p by s t e p . The p a t i e n t i s a 29 − y e a r − o l d woman w i t h a h i s t o r y o f h o m e l e s s n e s s , u n p r o t e c t e d s e x u a l i n t e r c o u r s e w i t h m u l t i p l e p a r t n e r s , a n d s u b s t a n c e u s e d i s o r d e r . S h e p r e s e n t s w i t h a h i g h f e v e r , t a c h y c a r d i a , a n d s l i g h t l y l o w o x y g e n s a t u r a t i o n . P h y s i c a l e x a m i n a t i o n r e v e a l s a murmur a l o n g t h e l e f t mid − s t e r n a l b o r d e r a n d m i n o r b i b a s i l a r c r a c k l e s . The p r e s e n c e o f b r u i s e s a n d s c a r s i n t h e a n t e c u b i t a l f o s s a s u g g e s t s i n t r a v e n o u s d r u g u s e . S h e i s s t a r t e d on v a n c o m y c i n a n d g e n t a m i c i n d u e t o s u s p e c t e d i n f e c t i v e e n d o c a r d i t i s a n d u n d e r g o e s s u r g e r y t o r e m o v e v e g e t a t i o n s f r o m t h e t r i c u s p i d v a l v e . On p o s t o p e r a t i v e d a y 5 , s h e p r e s e n t s w i t h b l e e d i n g f r o m h e r gums , o o z i n g f r o m s u r g i c a l s i t e s , a n d r e c u r r e n t e p i s t a x i s . Lab v a l u e s show p r o l o n g e d PT a n d aPTT , n o r m a l p l a t e l e t c o u n t , a n d n o r m a l D− d i m e r , i n d i c a t i n g a c o a g u l o p a t h y r a t h e r t h a n t h r o m b o c y t o p e n i a o r d i s s e m i n a t e d i n t r a v a s c u l a r c o a g u l a t i o n ( DIC ) . The p r o l o n g e d PT a n d aPTT w i t h n o r m a l p l a t e l e t c o u n t a n d D− d i m e r s u g g e s t a c o a g u l a t i o n f a c t o r d e f i c i e n c y o r i n h i b i t i o n . G i v e n t h e r e c e n t a n t i b i o t i c u s e , v i t a m i n K d e f i c i e n c y d u e t o a n t i b i o t i c t h e r a p y i s a l i k e l y c a u s e , a s a n t i b i o t i c s c a n d i s r u p t g u t f l o r a t h a t s y n t h e s i z e v i t a m i n K , l e a d i n g t o d e f i c i e n c i e s i n v i t a m i n K− d e p e n d e n t c l o t t i n g f a c t o r s . The m o s t l i k e l y c a u s e o f t h i s p a t i e n t ' s c u r r e n t s y m p t o m s i s a n t i b i o t i c t h e r a p y . S e l e c t e d Answer C h o i c e : B < | e o t _ i d | > 17 G Evaluation prompts. In this section, we share the evaluation prompt used byLlama3.1-70Bto evaluate the extracted answer for MedXpertQAandMATH500. First, we extract the model’s answer from the generated output by (1) taking the content inside “ ” for Qwen, (2) taking the span after “Final answer:” for Llama, or (3) using the full generated output as a fallback for either if extraction based on formatting fails. The evaluation prompt considers equivalence between the extracted answer and the ground truth answer for MedXpertQA. For MATH500, the evaluation prompt provides examples to evaluate based on mathematical equivalence. MATH500 Evaluation Prompt Look a t t h e f o l l o w i n g t w o e x p r e s s i o n s ( a n s w e r s t o a m a t h p r o b l e m ) a n d j u d g e w h e t h e r t h e y a r e e q u i v a l e n t . O n l y p e r f o r m t r i v i a ls i m p l i f i c a t i o n s E x a m p l e s : E x p r e s s i o n 1 : $ 2 x +3 $ E x p r e s s i o n 2 : $3 +2 x$ Yes E x p r e s s i o n 1 : 3 / 2 E x p r e s s i o n 2 : 1 . 5 Yes E x p r e s s i o n 1 : $x ^2+2 x +1 $ E x p r e s s i o n 2 : $y ^2+2 y +1 $ No E x p r e s s i o n 1 : $x ^2+2 x +1 $ E x p r e s s i o n 2 : $ ( x + 1 ) 2 $ Yes E x p r e s s i o n 1 : 3 2 4 5 / 5 E x p r e s s i o n 2 : 6 4 9 No ( t h e s e a r e a c t u a l l y e q u a l , don ' t mark t h e m e q u i v a l e n t i f y o u n e e d t o do n o n t r i v i a l s i m p l i f i c a t i o n s ) E x p r e s s i o n 1 : 2 / ( − 3 ) E x p r e s s i o n 2 : − 2 / 3 Yes ( t r i v i a ls i m p l i f i c a t i o n s a r e a l l o w e d ) E x p r e s s i o n 1 : 72 d e g r e e s E x p r e s s i o n 2 : 72 Yes ( g i v e b e n e f i t o f t h e d o u b t t o u n i t s ) E x p r e s s i o n 1 : 64 E x p r e s s i o n 2 : 64 s q u a r e f e e t Yes ( g i v e b e n e f i t o f t h e d o u b t t o u n i t s ) − YOUR TASK R e s p o n d w i t h o n l y " Yes " o r " No " ( w i t h o u t q u o t e s ) . Do n o t i n c l u d e a r a t i o n a l e . E x p r e s s i o n 1 : %( e x p r e s s i o n 1 ) s E x p r e s s i o n 2 : %( e x p r e s s i o n 2 ) s 18 MedXpertQA Evaluation Prompt Look a t t h e f o l l o w i n g t w o a n s w e r s a n d j u d g e w h e t h e r t h e y a r e e q u i v a l e n t . − YOUR TASK R e s p o n d w i t h o n l y " Yes " o r " No " ( w i t h o u t q u o t e s ) . Do n o t i n c l u d e a r a t i o n a l e . Answer 1 : %( e x p r e s s i o n 1 ) s Answer 2 : %( e x p r e s s i o n 2 ) s 19 H Additional Qualitative Examples We present additional qualitative examples comparing the base, reasoning reward, and TARO model outputs in this section. Table 10 illustrates a case where the base model bypasses the reasoning process, producing an incorrect result (200), while both the reward and guided models reason systematically and arrive at the correct answer (2220). In contrast, Table 11 presents a case where the base model quickly identifies the correct cube (27) through trial and error, but the reward and TARO follow the right strategy, but make an error during reasoning, leading to an incorrect answer. This suggests that the reward and guided model lacks mechanisms to revise their problem-solving strategy and check their work. Future work can enhance the reward model’s ability to reward backtracking and verification. Question: What is the least positive integer multiple of 30 that can be written with only the digits 0 and 2? Ground Truth: 2220 Base ModelReward ModelTARO (Ours) OutputThe least positive integer multiple of 30 that can be written with only the digits 0 and 2 is200. Final Answer: 200. I hope it is correct. We need to find the small- est positive integer multiple of 30 that can be written with only the digits 0 and 2. Since 30 is a multiple of 2, 3, and 5, the num- ber must end with a 0 (to be divisible by 10) and con- tain at least one 3 or 9 (to be divisible by 3). How- ever, we are only allowed to use the digits 0 and 2, so the number must be written with only the digits 0 and 2. The smallest such number is 200, but 200 is not divis- ible by 3. The next possi- ble number is 2220, which is divisible by 6 (and there- fore by 3) and ends with a 0 (so it is divisible by 10). Therefore, 2220 is a multi- ple of 30. The smallest such number is 60, but 60 is not written with only the digits 0 and 2. The next possible number is 22020, but 22020 is not written with only the digits 0 and 2. The smallest such number is 2220, which is written with only the dig- its 0 and 2 and is a multi- ple of 30. Therefore, the an- swer is 2220. Final Answer: 2220. I hope it is correct. ## Step 1: Understand the problem We need to find the smallest multiple of 30 that can be expressed using only the digits 0 and 2. ## Step 2: Recall divisibility rules Multiples of 30 are di- visible by both 2 and 3, and must end with 0. ## Step 3: Divisibility by 3 The sum of digits must be a multiple of 3. With digits 0 and 2, the sum = 2×(number of 2s), so the number of 2s must be a multiple of 3. ## Step 4: Try smallest com- bination Three 2s and one 0 gives 2220. ## Step 5: Verification 2220 is divisible by 2, 3, and 5. ⇒ It is divisible by 30. Final Answer: 2220 Table 10: Qualitative example comparing Llama-3.1 8B Instruct (base), the reasoning reward model, and TARO on a divisibility problem. The base model bypasses reasoning, while both the reward and guided models reason systematically to reach the correct answer. 20 Question: What is the smallest positive perfect cube that can be written as the sum of three consecutive integers? Ground Truth: 27 Base ModelReward ModelTARO (Ours) OutputWe can check small sets of three consecutive integers: 1, 2, 3,2, 3, 4,3, 4, 5,.... The sums are 6, 9, 12, 15, 18, 21, 24, 27. The first perfect cube among these is 27, which corresponds to8, 9, 10. Final Answer: 27. We need to find the smallest posi- tive perfect cube that can be writ- ten as the sum of three consecutive in- tegers.Let’s de- note the three con- secutive integers as n − 1,n, andn + 1. Then, the sum of these three integers is(n−1)+n+(n+ 1) = 3n. We need to find the small- est positive perfect cube that can be written as3n. Let’s consider the cubes of the positive inte- gers and see if any of them can be writ- ten as the sum of three consecutive in- tegers. For exam- ple, the cube of 1 is 1, the cube of 2 is 8, the cube of 3 is 27, the cube of 4 is 64, the cube of 5 is 125, and the cube of 6 is 216. The cube of 6 is 216. Let’s check if 216 can be written as the sum of three consecutive integers.We can write 216 as 3n, so n = 72. Then, the three consecutive in- tegers are 71, 72, and 73. The sum of these three integers is71 + 72 + 73 = 216 .Therefore, the smallest positive perfect cube that can be written as the sum of three consec- utive integers is 216. Final Answer: The final answer is 216. I hope it is correct. Let’s take a look at the problem. We’re looking for the smallest positive perfect cube that can be written as the sum of three consecutive integers. Let’s start by finding the smallest perfect cube. The smallest positive perfect cube is1 3 = 1. Now, we need to find three consecutive integers that add up to 1. The only set of three consecutive integers that add up to 1 is 0, 1, and 2, but their sum is 3, not 1. Let’s try the next perfect cube. The next smallest positive perfect cube is2 3 = 8. We need to find three consecutive integers that add up to 8. The set of three consecutive integers that add up to 8 is -3, -2, and -1. Their sum is -6, not 8. The next perfect cube is3 3 = 27. We need to find three consecutive integers that add up to 27. The set of three consecutive integers that add up to 27 is -10, -9, and -8. Their sum is -27, not 27. The next perfect cube is4 3 = 64. We need to find three consecutive integers that add up to 64. The set of three consecutive integers that add up to 64 is -21, -20, and -19. Their sum is -60, not 64. The next perfect cube is5 3 = 125. We need to find three consecutive integers that add up to 125. The set of three consecutive integers that add up to 125 is -44, -43, and -42. Their sum is -129, not 125. The next perfect cube is6 3 = 216. We need to find three consecutive integers that add up to 216. The set of three consecutive integers that add up to 216 is -71, -70, and -69. Their sum is -210, not 216. The next perfect cube is7 3 = 343. We need to find three consecutive integers that add up to 343. The set of three consecutive integers that add up to 343 is -112, -111, and -110. Their sum is -333, not 343. The next perfect cube is8 3 = 512. We need to find three consecutive integers that add up to 512. The set of three consecutive integers that add up to 512 is -169, -168, and -167. Their sum is -504, not 512. The next perfect cube is9 3 = 729. We need to find three consecutive integers that add up to 729. The set of three consecutive integers that add up to 729 is -242, -241, and -240. Their sum is -723, not 729. The next perfect cube is10 3 = 1000. We need to find three consecutive integers that add up to 1000. The set of three consecutive integers that add up to 1000 is -333, -332, and -331. Their sum is -996, not 1000. The next perfect cube is11 3 = 1331. We need to find three consecutive integers that add up to 1331. The set of three consecutive integers that add up to 1331 is -444, -443, and -442. Their sum is -1329, not 1331. The next perfect cube is12 3 = 1728. We need to find three consecutive integers that add up to 1728. The set of three consecutive integers that add up to 1728 is -579, -578, and -577. Their sum is -1734, not 1728. The next perfect cube is13 3 = 2197. We need to find three consecutive integers that add up to 2197. The set of three consecutive integers that add up to 2197 is -728, -727, and -726. Their sum is -2181, not 2197. The next perfect cube is14 3 = 2744. We need to find three consecutive integers that add up to 2744. The set of three consecutive integers that add up to 2744 is -913, -912, and -911. Their sum is -2736, not 2744. The next perfect cube is15 3 = 3375. We need to find three consecutive integers that add up to 3375. The set of three consecutive integers that add up to 3375 is -1126, -1125, and -1124. Their sum is -3375, which is 3375. Therefore, the smallest positive perfect cube that can be written as the sum of three consecutive integers is3375. The final answer is: 3375 Table 11: Qualitative example comparing Llama-3.1 8B Instruct (base), the reasoning reward model, and TARO on a simple cube-sum problem. The base model correctly identifies the answer through simple enumeration strategy reasoning, while the reward and guided models added errors during reasoning, leading to incorrect answers. 21