Paper deep dive

Agentic Critical Training

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 61

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/13/2026, 12:55:11 AM

Summary

Agentic Critical Training (ACT) is a reinforcement learning paradigm that trains LLM agents to autonomously evaluate action quality by contrasting expert actions with suboptimal alternatives. Unlike imitation learning or knowledge distillation methods that rely on mimicking pre-generated reflection text, ACT uses verifiable rewards to drive the emergence of genuine critical reasoning, leading to improved performance and out-of-distribution generalization across agentic and reasoning benchmarks.

Entities (6)

ALFWorld · benchmark · 100%Agentic Critical Training · method · 100%Group Relative Policy Optimization · algorithm · 100%Large Language Models · technology · 100%ScienceWorld · benchmark · 100%WebShop · benchmark · 100%

Relation Signals (3)

Agentic Critical Training → improvesperformanceon → ALFWorld

confidence 95% · Across three challenging agent benchmarks (ALFWorld, WebShop, ScienceWorld), ACT consistently improves agent performance

Agentic Critical Training → outperforms → Imitation Learning

confidence 95% · It achieves an average improvement of 5.07 points over imitation learning

Agentic Critical Training → usesalgorithm → Group Relative Policy Optimization

confidence 95% · The model is first trained on D critic to identify the better action... optimized using Group Relative Policy Optimization (GRPO)

Cypher Suggestions (2)

Find all benchmarks where ACT improves performance · confidence 90% · unvalidated

MATCH (m:Method {name: 'Agentic Critical Training'})-[:IMPROVES_PERFORMANCE_ON]->(b:Benchmark) RETURN b.name

Identify algorithms used by the ACT method · confidence 90% · unvalidated

MATCH (m:Method {name: 'Agentic Critical Training'})-[:USES_ALGORITHM]->(a:Algorithm) RETURN a.name

Abstract

Abstract:Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

60,232 characters extracted from source content.

Expand or collapse full text

Agentic Critical Training Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang ‡ , Furong Huang ‡ University of Maryland, College Park ‡ Equal advising Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model’s judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of- distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents. Project Page: https://attention-is-all-i-need.github.io/ACT/ 1. Introduction Large language models (LLMs) have demonstrated remarkable capabilities across a wide spectrum of tasks, from natural language understanding to complex reasoning (Brown et al., 2020, Achiam et al., 2023). Recent advances have enabled these models to function as autonomous agents, capable of interacting with external environments, using tools, and completing multi-step tasks (Yao et al., Shinn et al., 2023). Training effective LLM-based agents has become a critical research direction, with applications spanning web navigation (Yao et al., 2022), household robotics (Shridhar et al.), scientific experimentation (Wang et al., 2022), and function calling (Patil et al., 2024). Training LLM agents often begins with imitation learning (IL), where models learn to replicate expert demonstrations through supervised fine-tuning. While effective, IL suffers from a well-known limitation: it only teaches agents what to do, not what to avoid (Ross et al., 2011, Hussein et al., 2017, Pomerleau, 1991). Because agents only observe successful trajectories, they lack understanding of why certain actions are preferable and have no awareness of suboptimal states. A recent approach, Early Experience (Zhang et al., 2025a), attempts to address this by executing both expert and alternative actions in the environment, observing the resulting next states, and prompting the model to generate reflections explaining why the expert action leads to a better outcome. The self-reflection data is then mixed with the expert dataset and the model is trained using a standard next-token prediction loss. However, this approach fundamentally remains imitation learning: the model is trained to imitate a pre-generated target string rather than to Corresponding author(s): Furong Huang; Email furongh@umd.edu arXiv:2603.08706v1 [cs.AI] 9 Mar 2026 Agentic Critical Training (a)Early Experience: Imitated Self-Reflection Training Examples (퐷 critic ) (b)ACT: Genuine Self-Reflection LLMGeneratesReflection Action 풂 ∗ is betteris better because it moves the agent closer to the target location, while Action풂 ′ fails since the agent is not at the cabinet yet... Expert Action(풂 ∗ ) Alternative Action(풂 ′ ) Environment NextState (풔 ∗ ) NextState (풔 ′ ) Model PromptBox Which action is better for the task? Action 1: put cloth in cabinet Action 2: go to cabinet Model Wait... I just cleaned the cloth at the sink. I need to go to the cabinet first before I can put it there! SFT Reflection is imitated from a fixed target string RL Reflection emerges autonomously through RL Figure 1: Comparison of imitated vs. genuine self-reflection. (a) Early Experience executes both actions in the environment, generates a reflection from the resulting states, and trains the model to imitate this fixed text via supervised fine-tuning (SFT). (b) ACT presents two candidate actions and trains the model via RL to select the better one. Since only the selection outcome is rewarded, the model must autonomously develop reasoning about action quality to maximize reward. autonomously discover reasoning that leads to correct action selection. The “self-reflection” is imitated from text, not spontaneously developed. To address this issue, we propose Agentic Critical Training (ACT), illustrated in Figure 1. Given expert demonstrations, ACT pairs each expert action with a model-generated alternative action to form a preference pair at each time step of the sequential decision-making process. These preference pairs are formed based on the hypothesis that expert actions are generally superior to model-generated ones. We then train the agent to identify which action is better via reinforcement learning (RL). The only supervision is whether the model correctly identifies the superior expert action; since no reasoning supervision is provided, the model must autonomously develop chain-of-thought (CoT; Wei et al., 2022) reasoning that leads to correct choices. This produces genuine self-reflection rather than imitated reflection: the model learns to reason about action quality through RL, rather than imitating pre-constructed reflections through knowledge distillation. Across three challenging agent benchmarks (ALFWorld, WebShop, ScienceWorld), ACT consistently improves agent performance when combined with different post-training methods. Compared with imitation learning, ACT achieves an average performance gain of 5.07 points, while outperforming reinforcement learning by 4.62 points. Furthermore, compared to the Early Experience baseline that injects reflection capability through knowledge distillation, ACT still demonstrates clear advantages, yielding an additional 2.42 points improvement on average. These results indicate that directly training models to evaluate action quality is more effective than supervising them to imitate reflection behaviors. 2 Agentic Critical Training Beyond in-distribution improvements, ACT also exhibits strong out-of-distribution generalization on agentic benchmarks. Interestingly, despite being trained purely through action-level supervision, ACT also improves performance on general reasoning benchmarks (MATH-500 and GPQA-Diamond) without requiring any reasoning-specific training data. This suggests that learning to evaluate and compare actions can serve as a general mechanism for enhancing reasoning and decision-making abilities in LLM agents. In summary, our contributions are: 1.We propose ACT, which trains agents to judge which action is better under the current state via RL. Unlike Early Experience, which imitates pre-generated reflections, ACT drives the model to autonomously develop critical reasoning through RL and internalize this capability into its parameters. 2.Across three agentic benchmarks, ACT consistently improves both IL and RL, and outperforms Early Experience, achieving the highest performance across all benchmarks. 3.We demonstrate that ACT not only enables strong out-of-distribution generalization on agentic bench- marks, but also achieves notable improvements on general reasoning benchmarks (GPQA-Diamond, MATH-500) without any reasoning-specific training data, suggesting agentic RL environments may serve as a pathway for improving general reasoning. 2. Agentic Critical Training We present Agentic Critical Training (ACT), our approach to training LLM agents beyond pure imitation. We first describe the problem setting (section 2.1), then detail ACT data construction (section 2.2), and finally present the training pipeline (section 2.3). Figure 2 provides an overview. 2.1. Problem Formulation We consider an agent operating in a sequential decision-making environment, formalized as a partially observable Markov decision process (POMDP)M = (S,A,T ,Ω,O,R, γ), whereSis the state space,Ais the action space,T :S×A→∆(S)is the transition function,Ωis the observation space,O :S →∆(Ω) is the observation function,R : S×A → Ris the reward function, andγ ∈ [0, 1)is the discount factor. In practice, the agent conditions on a textual context constructed from the task description, a truncated window of the most recentkobservation-action pairs, and the current observation (see prompt templates in section A.5). We denote this observation-derived context ass t and usesas shorthand throughout; the policy is thus π θ (a t |s t ). Given an expert demonstration datasetD expert = ︁ τ (n) = (s (n) 1 , a (n) 1 , . . . , s (n) T n , a (n) T n ) ︁ N n=1 , the standard ap- proach is imitation learning (IL), which maximizes the likelihood of expert actions: L IL (θ) =−E (s,a)∼D expert [ log π θ (a|s) ] ,(1) whereπ θ is the policy parameterized by the LLM. While effective at teaching agents to replicate expert behavior, IL provides no signal about the relative quality of different actions: the agent learns thata i is correct at contexts i , but not why it is preferable to alternatives. Our goal is to train agents that develop this understanding. 3 Agentic Critical Training Expert Demonstration Trajectories (퐷 expert ) Initial Policy 휋 ! ! Extract State- Action Pairs Sample퐾 Alternative Actions FilterModel- Generated Alternatives Stage1:DataConstruction Training Examples (퐷 critic ) AgenticCriticalModel 휋 ! ! CalculateRewards GRPOPolicyUpdate Stage2:AgenticCriticalTraining (ACT) LLMAgent CalculateRewards GRPOPolicyUpdate Stage3:RLActionTraining Expert Demonstration Trajectories (퐷 expert ) NextAction Generation Figure 2: Overview of the ACT + RL training pipeline. Stage 1 (Data Construction): Given expert demonstration trajectories, we extract state-action pairs and sample alternative actions from the initial policyπ θ 0 at each state. Expert actions are paired with model-generated alternatives to construct contrastive training examples. Stage 2 (Agentic Critical Training): The model is trained via GRPO to identify the better action among candidates presented in randomized order, internalizing an understanding of action quality through verifiable rewards. Stage 3 (RL Action Training): The ACT-enhanced model is further trained with RL for direct action generation, leveraging its improved critical reasoning foundation to achieve higher task success rates. 2.2. Data Construction The core idea of ACT is to transform the learning objective from “imitate the expert action” to “identify the better action,” requiring the model to develop discriminative understanding of action quality. For each expert state-action pair (s i , a i )∈D expert , we construct an ACT example as follows: 1. Sample alternatives. Draw K candidate actionsa 1 i , a 2 i , . . . , a K i from an initial policy π θ 0 (·|s i ). 2. Filter duplicates. Remove candidates identical to the expert action: A neg i =a j i : a j i ̸= a i , j∈ [K]. 3.Construct pairs. Pair the expert actiona + i = a i with each alternativea − i ∈ A neg i to form|A neg i | contrastive examples. This yields the ACT datasetD critic =(s i , a + i , a − i ) M i=1 , where each example contains one expert action and one sampled alternative action. The key assumption is that the initial policyπ θ 0 generates actions that are, on average, inferior to expert actions. 4 Agentic Critical Training 2.3. Training Pipeline After constructingD critic , our training pipeline proceeds through two sequential RL stages: Agentic Critical Training followed by RL Action Training, both optimized using Group Relative Policy Optimization (GRPO; Shao et al., 2024). For each prompt, GRPO samples a group ofGresponses from the current policy, computes a verifiable reward for each, and updates the policy using group-relative advantages. Details are provided in section A.1. Agentic Critical Training. The model is first trained onD critic to identify the better action among two candidates. The ACT prompt is: ACT Prompt Task: task_description Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: o t Your admissible actions of the current situation are: A admissible Candidate Actions for the NEXT step: Action 1:a σ(1) Action 2:a σ(2) Which action is better as the NEXT step? Think about why, then output the chosen action inside <action>...</action> tags. Here σ is a random permutation, so the expert action appears in either position with equal probability. Given this prompt, GRPO samplesGresponses, each containing the model’s reasoning and action selection. Crucially, because ACT is trained through RL rather than imitation learning, the model must autonomously discover chain-of-thought reasoning that causally leads to correct action selection. Unlike Early Experi- ence (Zhang et al., 2025a) that generates reflection text and then trains on it via IL, ACT uses verifiable rewards to drive the emergence of critical thinking: the model is rewarded only for selecting correctly, and must therefore learn to reason about action quality on its own. This internalized understanding of why certain actions are preferable, rather than memorized patterns of what to output, directly enhances action generation at test time. RL Action Training. The ACT-enhanced model is then further trained with GRPO for direct action generation on the expert trajectories. Given each contexts i , GRPO samples a group ofGresponses and rewards those matching the expert actiona + i . By building on the critical reasoning foundation from ACT, the model leverages its improved understanding of action quality to achieve more effective policy optimization. 5 Agentic Critical Training Reward Design. Both stages share a composite reward function: R(s, y) = R acc (a, a + ) + R adm (a,A admissible ) + R fmt (y),(2) whereydenotes the full model response anda = extract(y)denotes the action extracted from the <action>...</action>tags. We setR acc = 1ifaexactly matches the expert actiona + . We setR adm = 0.1 ifais admissible but does not matcha + , which provides partial credit for valid actions. We setR fmt =−0.5 if the full response lacks proper <action> tags. Detailed reward definitions are provided in section A.1. 3. Related Work LLM-based Agents. LLM-based autonomous agents have advanced rapidly across web navigation (Yao et al., 2022), tool use (Schick et al., 2023, Qin et al., 2023), and multi-step reasoning (Yao et al., Wei et al., 2022). ReAct (Yao et al.) interleaves reasoning and acting, while Reflexion (Shinn et al., 2023) uses verbal self-reflection at inference time to improve performance. Our work instead trains self-reflection as a learned competence via RL, rather than relying on inference-time prompting. Training LLM Agents. Most approaches to training LLM agents rely on imitation learning from expert demonstrations (Chen et al., 2023, Zeng et al., 2024). Recently, Zhang et al. (2025a) proposed “early experience,” which enriches the training signal by prompting the model to generate reflections that explain why the expert action is preferable, and training the model to reproduce these reflections via supervised fine- tuning. However, the training objective fundamentally remains imitating a pre-generated target string. Our ACT instead trains the model to discriminate which action is better through RL, where the only supervision is whether the selection is correct. This requires the model to autonomously develop reasoning that leads to correct choices, rather than reproducing a fixed target string. In our experiments (section 4), we include Early Experience as a baseline and show that ACT consistently outperforms it across benchmarks, including on out-of-distribution tasks and general reasoning benchmarks. Critique RL Training. Recent work uses RL to train critique capabilities, either for building stronger reward models, such as R1-Reward (Zhang et al., 2025b) and RM-R1 (Chen et al., 2025), or for directly improving the policy through critique training, such as LLaVA-Critic-R1 (Wang et al., 2025) and Critique-Coder (Ruan et al., 2025). However, these approaches focus on single-turn settings (e.g., chat or code generation). Our ACT differs in two key aspects: (1) ACT operates in multi-turn agentic environments rather than single-turn chat or code settings, and (2) ACT trains the model to discriminate between expert and suboptimal actions within a sequential decision-making process, rather than critiquing standalone solutions. Agentic RL. Reinforcement learning (RL) has emerged as a powerful paradigm for training LLM-based agents (Zhang et al.). Unlike conventional LLM-RL for chat settings, such as RLHF (Ouyang et al., 2022) and DPO (Rafailov et al., 2023) for alignment, agentic RL tackles multi-turn, long-horizon decision-making in complex environments. DeepSeek-R1 (Guo et al., 2025) demonstrated that RL with verifiable rewards (RLVR) can incentivize reasoning without supervised chain-of-thought data. On the algorithmic side, GRPO (Shao et al., 2024) provides an efficient group-relative policy optimization method, GiGPO (Feng et al.) extends it with step-level credit assignment for long-horizon agent tasks, and Search-R1 (Jin et al.) trains LLMs to interleave reasoning with search engine queries via outcome-based RL. Our work contributes to this paradigm by showing that training agents via RL to discriminate between expert and suboptimal actions provides a complementary critical reasoning stage that further improves both IL- and RL-trained agents. 6 Agentic Critical Training 4. Experiments We evaluate our approach on three diverse benchmarks spanning different agent capabilities. 4.1. Experimental Setup Benchmarks. We use three benchmarks that span embodied, web, and scientific domains: •ALFWorld (Shridhar et al.): Embodied household tasks requiring navigation and object manipulation in text-based environments. ALFWorld provides “seen” and “unseen” splits: the seen split tests performance in room layouts present in the training set (used as our ID evaluation), while the unseen split requires the model to operate in rooms with novel spatial layouts and unknown object combinations (used as our OOD evaluation). • WebShop (Yao et al., 2022): Web-based shopping tasks requiring product search and selection. • ScienceWorld (Wang et al., 2022): Scientific reasoning tasks requiring multi-step experimental procedures. Methods. We compare the following methods, all trained on exactly the same expert trajectoriesD expert to ensure that performance differences are attributable solely to the training paradigm: • Prompt w/o CoT thinking: standard prompting without chain-of-thought reasoning. • Prompt w/ CoT thinking: CoT prompting with thinking enabled (i.e., “Let’s think step by step ...”). • ACT: trained with ACT only (no action generation training). • IL: fine-tuned on expert state-action pairs via supervised next-token prediction. • Early Experience (Self-Reflection): following Zhang et al. (2025a), executes both expert and al- ternative actions in the environment, prompts the model to generate reflections by comparing the resulting next states, and mixes the self-reflection data with the expert dataset to train the model using a standard next-token prediction loss. •RL: trained with GRPO, where the reward is whether the generated action matches the expert action. • IL w/ ACT and RL w/ ACT: first trained with ACT, then further trained with IL or RL, respectively. Implementation details, hyperparameters, and prompt templates are provided in section A. 4.2. Main Results Table 1 presents our main results on Qwen3-8B (Yang et al., 2025) across three benchmarks. RL outperforms IL. When trained on the same expert data, RL consistently achieves higher success rates than IL across all benchmarks, confirming that RL is a more effective paradigm for training LLM agents from expert trajectories. ACT provides positive transfer. Training with ACT alone improves over prompting baselines but does not match IL or RL in absolute performance. This is expected, as ACT trains the model to judge which action is better, not to generate actions directly. However, when used as a first stage before IL or RL, ACT consistently 7 Agentic Critical Training Table 1: Main results on Qwen3-8B (%). ALFWorld and WebShop report success rates; ScienceWorld reports next-action prediction accuracy. ALFWorld results include both in-distribution (ID) and out-of-distribution (OOD) tasks. Method ALFWorld WebShop ScienceWorld IDOOD Prompt w/o CoT thinking35.71 27.612.8028.01 Prompt w/ CoT thinking56.43 50.003.0025.21 ACT72.86 72.397.4026.71 Imitation Learning85.71 82.8428.0042.80 Early Experience (Self-Reflection) 87.86 85.8231.0045.60 IL w/ ACT91.43 87.3131.6048.69 RL90.71 84.3329.4043.04 RL w/ ACT92.86 88.06 33.8050.34 improves performance across all benchmarks, and RL w/ ACT achieves the highest overall performance. Specifically, adding ACT yields an average improvement of 5.07 percentage points over IL (via IL w/ ACT) and 4.62 percentage points over RL (via RL w/ ACT) across all benchmarks. This shows that the critical reasoning learned during ACT benefits subsequent action generation training. ACT outperforms Early Experience. Early Experience (Zhang et al., 2025a) enriches IL data with self- reflection text generated by prompting the model to compare environment states after executing both expert and alternative actions. As shown in table 1, this yields improvements over standard IL, but both IL w/ ACT and RL w/ ACT consistently outperform Early Experience across all benchmarks. Across all benchmarks, IL w/ ACT outperforms Early Experience by an average of 2.42 percentage points. This suggests that training the model to autonomously reason about action quality through RL is more effective than having it imitate pre-generated reflection text within an IL framework. ACT improves OOD generalization. On the out-of-distribution split of ALFWorld, adding ACT improves both IL and RL. RL w/ ACT also achieves the best OOD performance overall. Moreover, ACT’s gain on top of RL is larger on OOD tasks (3.73p) than on in-distribution tasks (2.15p), indicating that the reasoning acquired through ACT generalizes to unseen task configurations rather than overfitting to the training distribution. Case study: ACT enables failure recovery. To illustrate how ACT improves agentic decision-making, we examine real evaluation traces. In ALFWorld, the environment returns “Nothing happens.” when an action fails. IL models, having never observed failure states during training, repeat the same failed action indefinitely. Figure 3 contrasts this with ACT’s behavior: the IL model enters an infinite loop repeating a failed action for over 30 steps until termination, while the ACT-trained model (RL w/ ACT) diagnoses the root cause through its internal reasoning and issues the correct navigation command. Notably, the self-critique behavior originates from the ACT phase, as RL alone does not produce such reflective reasoning patterns. An additional case study showing IL’s rigid execution on WebShop is provided in section B. 8 Agentic Critical Training IL Task: Clean cloth→ cabinet Step 7: take cloth 1 from countertop 1 Step 8: clean cloth 1 with sinkbasin 1 Step 9: put cloth 1 in/on cabinet 1 ← fails Step 10: put cloth 1 in/on cabinet 1 Step 11: put cloth 1 in/on cabinet 1 Step 12: put cloth 1 in/on cabinet 1 . . . (Repeats for 30+ steps until termination) IL Result: Failed× The IL model repeats the identical failed action for over 30 steps until termination. Having never seen failure states during imitation learning, it has no mechanism to diagnose the error or propose an al- ternative. ACT Task: Clean spoon→ dining table Step 4: clean spoon 2 with sinkbasin 1 Step 5: use spoon 2 on diningtable 1 ← “Nothing happens.” Step 6: put spoon 2 in/on diningtable 1 ← “Nothing happens.” Step 7: put spoon 2 in/on diningtable 1 ← “Nothing happens.” ACT Self-Critique & Recovery (Correct✓) I cleaned spoon 2 with sinkbasin 1. Then I tried to put spoon 2 in/on diningtable 1, but nothing happened. I did this twice. Wait... the task is to put it IN the dining table. I am currently at sinkbasin 1. I need to GO TO the dining table first before I can put it there! Action: go to diningtable 1✓ Figure 3: Failure recovery on ALFWorld. Left: The IL model enters an infinite loop, repeating a failed action for over 30 steps until termination. Right: The ACT model encounters the same type of failure but uses its internal reasoning to diagnose the root cause (wrong location), break the loop, and issue the correct navigation command. 4.3. Cross-Size Data Transferability ACT requires collecting alternative actions from a policy to construct contrastive pairs, which can be expensive. A natural question is whether these data can be reused across model sizes to amortize the collection cost. To investigate this, we train Qwen3-4B on ALFWorld using ACT data collected entirely from Qwen3-8B, without any re-collection or adaptation. As shown in table 2, the transferred ACT data remains effective: all ACT-augmented methods improve over their non-ACT counterparts on both ID and OOD tasks for Qwen3-4B. Similar to Qwen3-8B, ACT’s gain on the smaller model is also larger on OOD tasks than on ID tasks. These results validate that ACT’s benefits generalize across model sizes and that the data collection cost can be amortized by reusing data across models of different sizes. 4.4. Generalization to General Reasoning Benchmarks Beyond agentic tasks, we investigate whether the critical reasoning capabilities acquired through ACT transfer to general reasoning benchmarks. We take the Qwen3-8B models trained on ALFWorld agentic data (IL, RL, and ACT) and directly evaluate them on MATH-500 (Hendrycks et al., 2021) and GPQA-Diamond (Rein et al.), two widely used benchmarks for mathematical and scientific reasoning, respectively. In our training process, none of these models are exposed to any mathematical or scientific reasoning data. 9 Agentic Critical Training Table 2: Cross-size results on ALFWorld with in-distribution (ID) and out-of-distribution (OOD) success rates (%). Method Qwen3-4BQwen3-8B IDOODIDOOD Prompt w/o CoT thinking13.57 8.96 35.71 27.61 Prompt w/ CoT thinking50.71 29.85 56.43 50.00 ACT71.43 62.69 72.86 72.39 Imitation Learning85.00 83.58 85.71 82.84 Early Experience (Self-Reflection) 88.57 88.06 87.86 85.82 IL w/ ACT88.57 91.04 91.43 87.31 RL91.43 88.81 90.71 84.33 RL w/ ACT92.14 91.79 92.86 88.06 Table 3: Performance on general reasoning benchmarks. Values are accuracy (%) with standard deviation across 3 runs. All trained models are learned solely from ALFWorld agentic data (no general reasoning training data). MethodMATH-500 GPQA-Diamond Prompt w/o CoT thinking78.6±0.3342.93±1.09 Prompt w/ CoT thinking86.93±0.7451.52±1.89 Imitation Learning87±0.3344.61±0.95 Early Experience (Self-Reflection) 86.86±0.2551.85±0.63 RL87.07±0.7752.36±1.32 ACT87.73±0.19 53.37±0.63 As shown in table 3, the training paradigms exhibit different effects on general reasoning. IL and Early Experience, both based on next-token prediction, fail to improve general reasoning through agentic training. On MATH-500, both maintain performance comparable to the original model. On GPQA-Diamond, IL degrades performance by 6.91p compared to the CoT prompting baseline (44.61% vs. 51.52%), while Early Experience only recovers to the baseline level (51.85%). This indicates that next-token prediction approaches, even when enriched with self-reflection data, do not transfer agentic training to general reasoning. RL roughly preserves the original model’s performance on both benchmarks. ACT achieves the highest scores on both MATH-500 and GPQA-Diamond despite being trained exclusively on agentic data. On GPQA- Diamond, ACT improves over the CoT prompting baseline by 1.85p (53.37% vs. 51.52%), while IL degrades it by 6.91p. ACT not only avoids the catastrophic forgetting observed in IL but improves upon the original model. This result indicates that RL in agentic environments, when combined with the ACT objective, can serve as a viable pathway for enhancing general reasoning capabilities. A detailed case study of IL’s reasoning collapse is provided in section C. Case study: self-verification behavior. To illustrate how ACT may improve general reasoning, we examine reasoning traces on GPQA-Diamond. We observe that on certain difficult problems, ACT exhibits self- verification behavior: after performing an initial derivation, the model checks its answer by substituting back into the original equations. Figure 4 shows one such example on a particle physics problem: after deriving the kinetic energies, the ACT model substitutes each answer option back into the energy conservation equation 10 Agentic Critical Training GPQA-Diamond #12: Pion Decay Kinetic Energy Find the kinetic energy of product particles inπ + → μ + + ν, whereπ + is stationary. Rest mass ofπ + andμ + is 139.6 MeV and 105.7 MeV respectively. Options: (A)KE μ = 4.12MeV,KE ν = 29.8MeV (B)KE μ = 7.2MeV,KE ν = 32.8MeV (C)KE μ = 3.52MeV, KE ν = 20.8 MeV (D) KE μ = 2.84 MeV, KE ν = 26.8 MeV ACT Reasoning Trace: Self-Verification (Correct✓) [. . . derives kinetic energies via conservation of energy and momentum. . . ] Verification of each option against energy conservation (E total = 139.6 MeV): Option C:KE μ = 3.52MeV,KE ν = 20.8MeV. Then, total energy would be105.7 + 3.52 = 109.22MeV for the muon. Neutrino energy is 20.8 MeV. Total energy:109.22 + 20.8 = 130.02MeV. Which is less than 139.6. So that’s not possible. Option D:KE μ = 2.84MeV,KE ν = 26.8MeV. Total energy:105.7 + 2.84 = 108.54MeV for muon. Neutrino energy is 26.8 MeV. Total energy is 108.54 + 26.8 = 135.34 MeV. Still less than 139.6. So, only option A gives the total energy matching. Therefore, the correct answer is A. Answer: A✓ Figure 4: Self-verification behavior observed in ACT on GPQA-Diamond. After deriving the kinetic energies, the ACT model substitutes each answer option back into the energy conservation equation, eliminating inconsistent options. to verify consistency, systematically eliminating incorrect options. The base model performs the initial derivation but does not systematically verify against all options. This “check your work” pattern is consistent with ACT’s training objective, which requires the model to evaluate and compare candidate actions. 5. Conclusion We introduced ACT, which trains LLM agents to reason about action quality by contrasting expert and self-generated actions via RL. Unlike approaches that imitate pre-generated self-reflection text via next- token prediction, ACT produces autonomous critical reasoning through RL. Across three benchmarks, ACT consistently improves both IL and RL, and outperforms prior approaches, achieving the highest performance across all benchmarks. ACT also enables strong out-of-distribution generalization on agentic benchmarks. Furthermore, on general reasoning benchmarks (GPQA-Diamond and MATH-500), where other training methods degrade or fail to improve reasoning, ACT achieves notable improvements without any reasoning- specific training data, indicating the potential of agentic RL environments for improving general reasoning. Acknowledgements Liu, Liu, Ho, Chakraborty, Wang, and Huang are supported by DARPA Transfer from Imprecise and Abstract Models to Autonomous Technologies (TIAMAT) 80321, DARPA HR001124S0029-AIQ-FP-019, DOD-AFOSR- Air Force Office of Scientific Research under award number FA9550-23-1-0048, National Science Foundation TRAILS Institute (2229885). Private support was provided by Peraton and Open Philanthropy. The Authors acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot for contributing to this research result. 11 Agentic Critical Training References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023. Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025. Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024. Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. In Second Conference on Language Modeling. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37:126544–126565, 2024. Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023. 12 Agentic Critical Training Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling. Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. Chi Ruan, Dongfu Jiang, Yubo Wang, and Wenhu Chen. Critique-coder: Enhancing coder models by critique reinforcement learning. arXiv preprint arXiv:2509.22824, 2025. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539–68551, 2023. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634– 8652, 2023. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations. Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022. Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, and Furong Huang. Llava-critic-r1: Your critic model is secretly a strong policy model. arXiv preprint arXiv:2509.00676, 2025. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. 13 Agentic Critical Training Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations. Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35: 20744–20757, 2022. Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3053–3077, 2024. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. Transactions on Machine Learning Research. Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558, 2025a. Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning. arXiv preprint arXiv:2505.02835, 2025b. 14 Agentic Critical Training A. Experimental Details A.1. Training Formulation and Algorithm Reward Function Design. Our composite reward function consists of three components. Given a full generated responseyfor states, we first extract the action span inside the<action>...</action>tags and denote the extracted action bya = extract(y). The semantic rewards are applied to the extracted action a, while the format reward is applied to the full response y: R(s, y) = R acc (a, a + ) + R adm (a,A admissible ) + R fmt (y).(3) If the response does not contain a valid tagged action span, we seta =∅, so the response receives zero semantic reward and only the format penalty applies. The accuracy reward measures exact match between the extracted action and the expert action: R acc (a, a + ) = ︃ 1.0 if normalize(a) = normalize(a + ) 0.0 otherwise (4) The admissible action reward provides partial credit for extracted actions that are valid but suboptimal in environments with defined action spaces: R adm (a,A admissible ) = ︃ 0.1 if a̸= a + ∧ a∈A admissible 0.0 otherwise (5) For WebShop RL Action Training, the action space includes open-ended search queries (e.g.,search[...]) that cannot be enumerated, so R adm is disabled and only R acc and R fmt are used. The format reward penalizes full responses that lack proper action tags: R fmt (y) = ︃ 0.0 if action tags present −0.5 otherwise (6) GRPO Algorithm. Both stages of our training pipeline use Group Relative Policy Optimization (GRPO) (Shao et al., 2024), a variant of proximal policy optimization (Schulman et al., 2017) that eliminates the need for a learned value function by estimating advantages from group-level reward statistics. Given a prompt (state) s, GRPO samples a group ofGresponsesy (1) , y (2) , . . . , y (G) from the current policyπ θ . Each response receives a rewardr (g) = R(s, y (g) ), and the advantage for each response is computed relative to the group statistics: ˆ A (g) = r (g) − ̄ r σ r + ε ,(7) where ̄ r = 1 G ∑︀ G g=1 r (g) is the group mean reward,σ r = √︁ 1 G ∑︀ G g=1 (r (g) − ̄ r) 2 is the group standard deviation, andεis a small constant for numerical stability. The GRPO objective combines the policy gradient with KL regularization: L GRPO (θ) =−E s∼D E y (g) ∼π θ (·|s) [︂ min (︁ ρ (g) ˆ A (g) , clip(ρ (g) , 1− ε c , 1 + ε c ) ˆ A (g) )︁ ]︂ + β· D KL (π θ ∥π ref ),(8) 15 Agentic Critical Training Algorithm 1 ACT with GRPO Input: Expert datasetD expert , initial policy π θ 0 , number of candidate samples K, group size G Output: Trained policy π θ ∗ // Phase 1: Data Collection D critic ←∅ for each (s i , a + i )∈D expert do Samplea 1 i , . . . , a K i ∼ π θ 0 (·|s i ) A neg i ←a j i : a j i ̸= a + i if |A neg i |> 0 then for each a − i ∈A neg i do D critic ←D critic ∪(s i , a + i , a − i ) end for end if end for // Phase 2: GRPO Training Initialize θ ← θ 0 , π ref ← π θ 0 for each training iteration do Sample batchB ⊂D critic for each (s, a + , a − )∈B do Construct ACT prompt p with randomized positions Sampley (1) , . . . , y (G) ∼ π θ (·|p) Compute rewards r (g) = R(s, y (g) ) via eq. (3) Compute advantages ˆ A (g) via eq. (7) end for Update θ using∇ θ L GRPO (θ) end for return π θ whereρ (g) = π θ (y (g) |s) π θ old (y (g) |s) is the importance sampling ratio,ε c is the clipping threshold,βis the KL penalty coefficient, and π ref is the reference policy. Training Algorithm. Algorithm 1 summarizes the complete ACT procedure. The algorithm consists of two phases: data collection, where we construct the ACT dataset by sampling alternatives from the initial policy, and GRPO training, where we optimize the policy using verifiable rewards based on action correctness. A.2. Implementation Details We use OpenRLHF (Hu et al., 2024) for GRPO training with DeepSpeed ZeRO-3 (Rajbhandari et al., 2020) for memory efficiency. Training uses 4 NVIDIA GH200 GPUs. Table 4 lists the hyperparameters used in our experiments. 16 Agentic Critical Training Table 4: Training hyperparameters used across all experiments. HyperparameterValue Base modelQwen3-4B / Qwen3-8B Learning rate2× 10 −6 LR schedulerCosine Warmup ratio0.1 Batch size64 Group size (G)8 (Qwen3-8B) or 16 (Qwen3-4B) Candidate samples (K) 1 Max epochs3 Prompt max length4,096 tokens Generation max length 4,096 tokens Temperature1.0 Top-p0.95 KL coefficient (β)0.0 OptimizerAdamW with offload PrecisionBF16 A.3. Data Statistics Table 5 summarizes the dataset statistics across all benchmarks. Each training sample corresponds to a single expert state-action pair extracted from successful trajectories. All methods (IL, RL, ACT, and their combinations) are trained on the same set of pairs to ensure a fair comparison. For ScienceWorld, due to its large action space and resource constraints, we randomly sample 10,240 state-action pairs for training (from the full set of expert trajectories) and evaluate offline next-action prediction accuracy on 10,000 test states (uniformly sampled across task types). Table 5: Dataset statistics for all training. Train samples are state-action pairs. ID: In-Distribution, OOD: Out-of- Distribution. BenchmarkDomain Train Pairs Task TypesTest Samples ALFWorldEmbodied10,2406140 (ID) / 134 (OOD) episodes WebShopWeb3,000N/A500 episodes ScienceWorld Science10,2403010,000 states A.4. Expert Trajectory Collection For ALFWorld, expert trajectories are collected by running the model released by Feng et al. on the ALFWorld training set. For WebShop, expert trajectories come from the official human demonstration data released with the benchmark. For ScienceWorld, expert trajectories come from the official gold trajectories released with the benchmark. 17 Agentic Critical Training A.5. Prompt Templates ALFWorld Prompts. The ACT and RL prompts for ALFWorld are shown in Figures 5 and 6, respectively. ACT Prompt (ALFWorld) You are an expert agent operating in the ALFWorld Environment. Task: task_description Prior to this step, you have already taken n step(s). Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: observation Your admissible actions of the current situation are: admissible_actions You need to decide which action is better for completing the task. Output format requirements: 1. First, think about which action is better and why. 2. Then, output the chosen action content inside <action>...</action> tags. 3. You must output the actual action text, NOT ‘Action 1’ or ‘Action 2’. Candidate Actions for the NEXT step: Action 1: action_1 Action 2: action_2 Which action is better as the NEXT step? Think about why, then output the chosen action inside <action>...</action> tags. Figure 5: The ACT prompt for ALFWorld. The model is presented with the full context followed by two candidate actions and is asked to select the better one with reasoning. 18 Agentic Critical Training RL Prompt (ALFWorld) You are an expert agent operating in the ALFWorld Environment. Task: task_description Prior to this step, you have already taken n step(s). Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: observation Your admissible actions of the current situation are: admissible_actions Action format examples: - go to object - take object from container - open/close container, examine object - inventory, look Now it’s your turn to take an action. Once you’ve finished your reasoning, you should choose an admissible action for current step and MUST be enclosed within <action> </action> tags. Figure 6: The RL prompt for ALFWorld. The model is presented with the task, history, current observation, and admissible actions, and is asked to generate the next action. 19 Agentic Critical Training WebShop Prompts. The ACT and RL prompts for WebShop are shown in Figures 7 and 8, respectively. ACT Prompt (WebShop) You are an expert shopping assistant navigating an e-commerce website. Task: task_description Prior to this step, you have already taken n step(s). Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: observation Your admissible actions of the current situation are: admissible_actions You need to decide which action is better for completing the task. Output format requirements: 1. First, think about which action is better and why. 2. Then, output the chosen action content inside <action>...</action> tags. 3. You must output the actual action text, NOT ‘Action 1’ or ‘Action 2’. Candidate Actions for the NEXT step: Action 1: action_1 Action 2: action_2 Which action is better as the NEXT step? Think about why, then output the chosen action inside <action>...</action> tags. Figure 7: The ACT prompt for WebShop. The model is presented with the full context followed by two candidate actions and is asked to select the better one with reasoning. 20 Agentic Critical Training RL Prompt (WebShop) You are an expert shopping assistant navigating an e-commerce website. Task: task_description Prior to this step, you have already taken n step(s). Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: observation Your admissible actions of the current situation are: admissible_actions Action format examples: - search[query] - click[element] Now it’s your turn to take an action. Once you’ve finished your reasoning, you should choose an admissible action for current step and MUST be enclosed within <action> </action> tags. Figure 8: The RL prompt for WebShop. The model is presented with the task, history, current observation, and admissible actions, and is asked to generate the next action. 21 Agentic Critical Training ScienceWorld Prompts. The ACT and RL prompts for ScienceWorld are shown in Figures 9 and 10, respectively. ACT Prompt (ScienceWorld) You are an expert science tutor guiding a student through hands-on experiments in the ScienceWorld environment. Task: task_description Prior to this step, you have already taken n step(s). Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: observation Your admissible actions of the current situation are: admissible_actions You need to decide which action is better for completing the task. Output format requirements: 1. First, think about which action is better and why. 2. Then, output the chosen action content inside <action>...</action> tags. 3. You must output the actual action text, NOT ‘Action 1’ or ‘Action 2’. Candidate Actions for the NEXT step: Action 1: action_1 Action 2: action_2 Which action is better as the NEXT step? Think about why, then output the chosen action inside <action>...</action> tags. Figure 9: The ACT prompt for ScienceWorld. The model is presented with the full context followed by two candidate actions and is asked to select the better one with reasoning. 22 Agentic Critical Training RL Prompt (ScienceWorld) You are an expert science tutor guiding a student through hands-on experiments in the ScienceWorld environment. Task: task_description Prior to this step, you have already taken n step(s). Below are the most recent k observations and actions: history You are now at step current_step and your current observation is: observation Your admissible actions of the current situation are: admissible_actions Action format examples: - pick up object - put object in container - pour object into container - open/close object, focus on object - teleport to location, look around, inventory Now it’s your turn to take an action. Once you’ve finished your reasoning, you should choose an admissible action for current step and MUST be enclosed within <action> </action> tags. Figure 10: The RL prompt for ScienceWorld. The model is presented with the task, history, current observation, and admissible actions, and is asked to generate the next action. 23 Agentic Critical Training B. Additional Case Study: Agentic Task Performance In the main text (Figure 3), we showed how ACT enables failure recovery on ALFWorld through self-critique. Here we provide an additional case study on WebShop, illustrating another failure mode of IL: rigid execution without state awareness. B.1. Rigid Execution Without State Awareness IL trains agents to replicate expert trajectories as fixed action sequences. When the environment state deviates from what was seen during training, IL models lack the internal mechanism to detect the mismatch and adjust. Figure 11 illustrates this on a WebShop task: the user requests men’s shirts with specific attributes priced below $50. The IL model follows a rigid script (search→click item→select attributes→buy), executing each step without checking whether the item actually satisfies the constraints. It clicks “Buy Now” on a $55 item, violating the price requirement and receiving a score of 0. ACT addresses this by training the model to evaluate candidate actions against the current state. Through critical training, the model internalizes an awareness of whether its current state (e.g., the product page) matches the goal constraints, enabling it to back out and search again rather than blindly proceeding. WebShop Task Find me wash cold, machine wash men’s shirts with polyester heathers, heathers cotton. . . and price lower than 50.00 dollars. IL Trajectory (Score: 0.0×) Step 1: search[men’s shirts polyester heathers] Step 2: click[B07XJD. . . ] Observation: Price: $55.00 (exceeds $50 budget) Step 3: click[black] Step 4: click[small] Step 5: click[Buy Now] ← buys item despite constraint violation Figure 11: Rigid execution on WebShop. The IL model follows a scripted sequence (search→click→buy) without checking whether the item satisfies the price constraint ($55> $50 budget), resulting in a failed purchase. 24 Agentic Critical Training C. Additional Case Study: Why ACT Improves General Reasoning In the main text (Figure 4), we showed self-verification behavior observed in ACT on GPQA-Diamond. Here we provide additional case studies analyzing IL’s reasoning collapse on general reasoning benchmarks, based on reasoning traces produced by ACT and IL (Qwen3-8B) on GPQA-Diamond and MATH-500. C.1. IL Causes Reasoning Collapse IL on agentic data fine-tunes the model on short, action-heavy expert trajectories that contain minimal extended reasoning. By imitating these behavioral patterns, the model suffers catastrophic forgetting of its deep reasoning capabilities: the chain-of-thought reasoning capacity that the original model possesses is overwritten by the short-sequence, action-centric distribution of agentic data. We term this reasoning collapse, which explains the sharp drop in GPQA-Diamond performance observed in table 3. Reasoning collapse manifests in two characteristic ways. Unfocused meandering. Even when the IL model does produce reasoning traces, the quality of reasoning degrades significantly. Figure 12 shows a high-energy physics problem (GPQA #7) about gamma-ray annihilation with CMB photons. ACT produces a focused 10,669-character derivation that methodically sets up the threshold energy condition and arrives at the correct answer. The IL model, by contrast, generates a 37,924-character trace, 3.5×longer, yet wanders through vague recollections and contradictory estimates, ultimately conceding “given the time I’ve spent and the lack of progress” before guessing incorrectly. The reasoning capacity has not disappeared entirely but has become diffuse and ineffective. GPQA-Diamond #7: Gamma-Ray CMB Annihilation Threshold The universe is filled with the Cosmic Microwave Background. Consider the annihilation of high-energyγ-rays with a CMB photon into electron-positron pairs, i.e.,γ→ e + e − . Find the energy ofγ-rays whose lifetimes are limited by this process. The average energy of a CMB photon is 10 −3 eV. ACT Reasoning Trace (Correct✓) Forγ → e + e − , the total energy must be at least 2m e c 2 . For a head-on collision, the threshold con- dition isE γ · E CMB ≥ (m e c 2 ) 2 . [...] SoE γ ≥ (0.511 MeV) 2 /(10 −3 eV) = 2.61× 10 14 eV ≈ 2.6× 10 5 GeV. The gamma rays with energy just above the threshold would have their lifetimes limited by this process. Therefore, the answer is the threshold energy. Answer: C (2.6× 10 5 GeV)✓ (10,669 chars) IL Reasoning Trace (Wrong×) [...after 37,924 characters of unfocused deriva- tion. . . ] However, I recall that the energy at which gamma rays can interact with CMB photons to producee + e − pairs is approximately10 4 GeV. [. . . ] Given that, the answer is likely option B or C. But I’m not certain. [. . . ] Given the time I’ve spent and the lack of progress, I’l go with option B: 1.8× 10 5 GeV. Answer: B (1.8× 10 5 GeV)× (37,924 chars) Figure 12: Reasoning collapse: unfocused meandering. On a high-energy physics threshold problem, ACT produces a focused derivation (10K chars), while IL generates 3.5×more text (38K chars) yet wanders through vague recollections and contradictory estimates, ultimately guessing incorrectly. 25 Agentic Critical Training Algebraic death loops. On MATH-500 problems requiring sustained mathematical derivation, the IL model enters repetitive algebraic loops, producing traces exceeding 80,000 characters without reaching a correct conclusion. Figure 13 shows a probability problem (MATH-500 #445): ACT recognizes the geometric structure and cleanly derives the answer, while the IL model enters a prolonged algebraic spiral, repeatedly second-guesses itself, and ultimately produces an incorrect answer. MATH-500 #445: Rectangle Probability RectangleABCDhas centerOandAB/AD = k. A point is randomly chosen from the interior ofABCD. What is the probability that it is closer to O than to any of the four vertices? ACT Reasoning Trace (Correct✓) The set of points closer toOthan to any vertex is the intersection of the four half-planes defined by the perpendicular bisectors betweenOand each vertex. These bisectors form a rectangle rotated by45 ◦ , whose area is half of the original rectangle. Therefore, the probability is 1 2 regardless of k. [. . . ] I’ve verified fork = 1(square case) and generalk: all methods confirm that the probability is 1 2 . Answer: 1 2 ✓ (42,641 chars of reasoning) IL Reasoning Trace (Wrong×) [. . . after 80,000+ characters of algebraic manipula- tion. . . ] Given that I’m stuck, I’l refer back to the initial correct calculation fork = 1, which gave the area as 2, and the probability as 1 2 . For generalk, if the area is(k 2 + 1) 2 /(2k)when it is inside the rectangle, but fork< 1, it is clipped. However, I cannot proceed further. Given the time I’ve spent, I will conclude that the prob- ability is 1−k 2 2 . Answer: 1−k 2 2 × (80,918 chars of reasoning) Figure 13: Reasoning collapse: algebraic death loop. On a probability problem, ACT cleanly identifies the geometric structure, while IL generates over 80,000 characters of circular algebraic manipulation before giving up with an incorrect answer. The IL model correctly solves thek = 1special case early on but cannot generalize, endlessly rederiving and contradicting its own intermediate results. ACT avoids reasoning collapse because it optimizes for outcome correctness via RL rather than imitating behavioral patterns. Since the RL training signal rewards correct critical judgments regardless of response format or length, ACT not only acquires agentic skills but also fully preserves and even strengthens the original model’s deep reasoning capacity, compared to IL, which tends to overwrite general reasoning capabilities with domain-specific action patterns through supervised fine-tuning on short, action-centric sequences. 26