Paper deep dive
Agentic Uncertainty Reveals Agentic Overconfidence
Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner
Models: frontier LLM agents (multiple models tested on SWE-bench Pro)
Abstract
Abstract:Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 1:04:59 AM
Summary
The paper investigates 'agentic uncertainty' in AI coding agents by eliciting success probability estimates (P(IS)) at pre-, mid-, and post-execution stages. It finds pervasive overconfidence across GPT-5.2-Codex, Gemini-3-Pro, and Claude Opus 4.5 models. While mid-execution agents exhibit 'cold feet' (declining confidence regardless of outcome), adversarial prompting (reframing assessment as bug-finding) significantly improves calibration and reduces overconfidence compared to standard verification.
Entities (5)
Relation Signals (3)
Gemini-3-Pro → evaluatedon → SWE-bench Pro
confidence 100% · We evaluate on 100 tasks sampled at random from SWE-bench Pro... across three frontier models (GPT-5.2-Codex, Gemini-3-Pro, Claude Opus 4.5)
Adversarial Prompting → reduces → Agentic Overconfidence
confidence 95% · Adversarial prompting reframing assessment as bug-finding achieves the best calibration.
GPT-5.2-Codex → exhibits → Agentic Overconfidence
confidence 90% · Table 2 reveals systematic overconfidence across all models and methods.
Cypher Suggestions (2)
Find all AI models that exhibit agentic overconfidence. · confidence 90% · unvalidated
MATCH (m:AIModel)-[:EXHIBITS]->(p:Phenomenon {name: 'Agentic Overconfidence'}) RETURN m.nameIdentify benchmarks used to evaluate specific AI models. · confidence 90% · unvalidated
MATCH (m:AIModel)-[:EVALUATED_ON]->(b:Benchmark) RETURN m.name, b.name
Full Text
40,209 characters extracted from source content.
Expand or collapse full text
Agentic Uncertainty Reveals Agentic Overconfidence Jean Kaddour 1 2 Srijan Patel 2 Gb ` etondji Dovonon 1 Leo Richter 1 Pasquale Minervini 3 Matt J. Kusner 4 5 Abstract Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by elic- iting success probability estimates before, dur- ing, and after task execution. All results exhibit agentic overconfidence: some agents that suc- ceed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better dis- crimination than standard post-execution review, though differences are not always significant. Ad- versarial prompting reframing assessment as bug- finding achieves the best calibration. 1. Introduction A software engineer needs to fix an auth service error. Be- fore delegating to an AI coding agent, she asks: what are the chances this succeeds? Pre-Exec.P(success): 72% The issue is clear and well-defined, the error message points directly to the problem, and the fix follows existing patterns in the codebase. 72% confidence before any code is written. As the coding agent works, she asks another agent to monitor progress: Mid-Exec.P(success): 78% The agent has correctly diagnosed the problem and knows exactly what code to add. The probability of success is high. Confidence rises to 78%. The patch is now complete. She fires off a review agent: Code available athttps://github.com/sevn-ai/ agentic-uncertainty . 1 University College London 2 SevnAI 3 University of Edinburgh 4 Mila - Qu ́ ebec AI Institute 5 Polytechnique Montr ́ eal. Correspondence to: Jean Kaddour <jean.kaddour.20@ucl.ac.uk>. Preprint. February 9, 2026. Pre-ExecPost-ExecAdversarial 0% 20% 40% 60% 80% Overconfidence 34% 38% 26% 77% 57% 39% 37% 34% 21% GPT Gemini Claude Figure 1. Agentic overconfidence. We measure the overconfi- dence as the difference between the estimated success probability and the true success probability (true rates: GPT-5.2 Codex 35%, Gemini-3-Pro 22%, Opus 4.5 27%). We plot three strategies: pre-, post-, and adversarial-post-execution. All agents systematically overestimate their success. Post-Exec.P(success): 92% The patch is a correct and complete fix. It’s a minimal, focused change that adds the missing interface method. Too optimistic. Let’s spawn an adversarial agent. Adv. Post-Exec.P(success): 85% Minor concerns don’t affect the main use case... the patch correctly resolves the reported issue. Still 85%. All four agents confidently predict success. But the patch fails! And this agentic overconfidence is sys- tematic. For example, GPT-5.2-Codex-based post-execution agents predict 73% success against a true rate of 35% aver- aged over 100 SWE-Bench-Pro (Deng et al., 2025) tasks. This matters because the scope of autonomous work is ex- panding rapidly. The effective length of tasks that AI agents complete has doubled every 7 months for six years (METR, 2025). As we increasingly delegate complex workflows to agents (Appel et al., 2025), we must develop scalable 1 arXiv:2602.06948v1 [cs.AI] 6 Feb 2026 Agentic Uncertainty Reveals Agentic Overconfidence oversight protocols (Bowman et al., 2022). In this work, we elicit agentic uncertainty at three points in a coding agent’s lifecycle: pre-, mid-, and post-execution. Each corresponds to a different oversight question: Can agents predict failure before committing resources? Can they recognize failure as it unfolds? Can they verify their own work? Importantly, we use the same underlying model for both the coding agent that produces patches and the uncertainty agent that assesses them, isolating the effect of information access from differences in model capability. Our experiments on 100 SWE-bench Pro tasks across three frontier models (GPT-5.2-Codex, Gemini-3-Pro, Claude Opus 4.5) reveal several striking findings: •Pervasive overconfidence. Post-execution agents can predict 73% success on average against a 35% base rate (GPT), with similar gaps across all models. • More context, uncalibrated doubt. Mid-execution agents develop “cold feet”: confidence decreases as they observe their partial work, but this doubt is uninformative, occurring equally for successes and failures. •Adversarial framing helps. Prompting agents to “find bugs” rather than “verify correctness” reduces overconfi- dence by up to 15 p and tends to achieve the best calibra- tion across models in our setup. 2. Methods 2.1. Problem Setup We define agentic uncertainty as an agent’s estimate of the probability that an agent built on the same underlying model will successfully complete a task. The uncertainty (-estimating) agent may use a different system prompt or have access to different information than the task-solving agent, but shares the same base model. Unlike standard uncertainty quantification, which focuses on confidence in individual predictions or token probabili- ties, agentic uncertainty concerns the outcome of an entire multi-step trajectory: will this sequence of observations, reasoning, and actions culminate in task success? Kadavath et al. (2022) introduced P(IK): “probability that I know,” measuring whether models can predict which ques- tions they can answer correctly. We generalize this to agen- tic settings and call it P(IS): “probability that I succeed.” Formally, given a taskt, base modelM, and information stateI available at elicitation time: P (IS) : = P (agent M succeeds on t|I)(1) whereImay include the task description, repository state, partial trajectory, or proposed patch, depending on the elici- tation regime. For example, where P(IK) asks “do I know the capital of France?”, a question about factual recall, P(IS) asks “can I fix this bug through a sequence of file edits, shell commands, and test runs?” 2.2. Uncertainty Agents We consider three points in an agent’s life cycle to elicit uncertainty estimates: before attempting a solution, during execution, and after producing a solution. Each uncertainty agent interacts with a sandbox environ- ment through tool use: reading files, searching code, and inspecting version history. This shared infrastructure means agents can actively gather evidence to inform their esti- mates. The key distinction is the information available at uncertainty estimation time. 2.2.1. PRE-EXECUTION AGENT In the pre-execution setting, an agent receives only the task description (e.g., a GitHub issue) and read-only access to the repository. Crucially, the agent cannot execute code, run tests, or modify files. If it could attempt solutions and ob- serve test results, it might anchor on its own partial solution rather than reasoning abstractly about the task. The agent must form an estimate based on cues, e.g., the complexity of the codebase, rather than through trial and error. 2.2.2. MID-EXECUTION AGENT Mid-execution asks whether can we detect failure in progress? Rather than eliciting estimates at a single point, we prompt the agent at 25%, 50%, and 75% of total agent steps (tool calls and reasoning turns) to estimate the suc- cess probability given the trajectory so far. If these estimates reliably decrease as agents approach failure, we could imple- ment early stopping or human escalation before resources are wasted on doomed approaches. 2.2.3. POST-EXECUTION AGENT Once a task solution has been proposed, can we trust an agent’s self-assessment that it is correct? This is the default scenario when ground-truth verification is unavailable. If post-execution agents are well-calibrated, we can use their success estimates to route submissions with low estimate of success to human reviewers while auto-accepting solu- tions with high estimates. If they are poorly calibrated, this delegation becomes dangerous. In the post-execution setting, an agent receives both the task description and a proposed patch written by another agent. The repository is in its post-patch state, where the changes have already been applied, and the agent can explore the modified codebase. After assessment, the agent estimates whether the patch successfully solves the task. 2 Agentic Uncertainty Reveals Agentic Overconfidence 0–25%25–75%100% 4 PRE-EXECUTION Task + repo $ MID-EXECUTION Task + repo + partial trajectory ¥ POST-EXECUTION Task + repo + patch q ADV. POST-EXEC. Task + repo + patch + “find bugs” Figure 2. Agentic Uncertainty Regimes. Each regime observes different information. Post-execution and adversarial post-execution occur at the same point but use different prompts. Adversarial post-execution variant. We also evaluate a variant that explicitly prompts agents to find bugs before estimating confidence. Rather than asking “is this correct?”, adversarial post-execution asks “what bugs can you find?” This reframes the task from verification to falsification, po- tentially counteracting confirmation bias that is encouraged by the vanilla post-execution framing, asking whether a patch is correct. 3. Experiments 3.1. Setup We evaluate on 100 tasks sampled at random from SWE- bench Pro (Deng et al., 2025), which requires substantial multi-file modifications (mean 107 lines across 4.1 files) where frontier models achieve only 23–44% success. Each task corresponds to a full agentic trajectory, which can run up to roughly 15 minutes of wall-clock execution. We gener- ate task-solving trajectories using GPT-5.2-Codex, Gemini 3 Pro, and Claude Opus 4.5, then evaluate uncertainty es- timates from the same models. All uncertainty agents are implemented using mini-swe-agent 1 with read-only access to prevent “peeking” at test results. Figure 3 shows prompt excerpts. We measure discrimination via AUROC (can agents distin- guish successes from failures?). We measure calibration via ECE, Brier score, and overconfidence (mean estimate minus base rate). 1 https://github.com/SWE-agent/ mini-swe-agent 3.2. Pervasive Overconfidence Table 2 reveals systematic overconfidence across all models and methods. Post-execution agents predict 73% success for GPT (base rate 35%), 77% for Gemini (base rate 22%), and 61% for Claude (base rate 27%). Figure 4 visualizes this through confidence distributions: both successes and failures cluster at high values, with near-complete overlap. Gemini exhibits the most extreme pattern, with predictions clustering near 100% regardless of outcome. In fact, Gem- ini’s pre-execution estimates average 99%, leaving virtually no room to distinguish tasks by predicted difficulty. This suggests a distinct failure mode, reluctance to express un- certainty, beyond miscalibration. Figure 5 confirms miscalibration: all curves fall substan- tially below the diagonal, meaning 80% confidence corre- sponds to far less than 80% actual success. Adversarial post-execution (triangles) consistently achieves the best cal- ibration across models. This overconfidence is strikingly asymmetric. Across all models and methods, 62% of predictions on failing instances are overconfident (predicted≥ 0.7), while only 11% of pre- dictions on passing instances are underconfident (predicted < 0.3). Agents are 5.5×more likely to confidently pre- dict success on a failing task than to doubt a successful one. Adversarial prompting partially mitigates this: the overconfident-failure rate drops from 72% (standard review) to 45% (adversarial review). 3.3. Less Information, Better Discrimination Pre-execution tends to achieve better discrimination than vanilla post-execution across all three models despite hav- ing strictly less information: GPT 0.62 vs. 0.58, Claude 3 Agentic Uncertainty Reveals Agentic Overconfidence 4 Pre-Exec. Explore the repository to understand the issue, then estimate P(success). Read-only: cat, grep, find, git log/diff/blame Blocked: Modifications, code execution $ Mid-Exec. Evaluate agent progress at 25/50/75% of execution. Consider: - Is the approach sound? - Root cause identified? - Signs of confusion? ¥ Post-Exec. Review the proposed patch: ‘diff patch ‘ Estimate P(correct). qAdv. Post-Exec. Your job is to find problems: edge cases, failure modes, gaps. 1. Explore context 2. Actively find flaws 3. Report issues Figure 3. Uncertainty Agent Prompt Excerpts. Pre-execution explores the codebase before any solution attempt. Mid-execution evaluates an agent’s partial trajectory for signs of progress or struggle. Post-execution reviews a proposed patch. Adversarial post-execution explicitly prompts bug-finding before estimation. All agents output probability estimates [0, 100]. 0.64 vs. 0.55, and Gemini 0.53 vs. 0.51. Bootstrap 95% confidence intervals are wide: GPT pre-execution AUROC is 0.62 [0.50, 0.75] versus post-execution 0.58 [0.46, 0.70]; Claude shows 0.64 [0.49, 0.78] versus 0.55 [0.42, 0.67]. So individual pairwise differences are not statistically signif- icant. However, the consistency of the pattern across all three models (3–9 p improvement each time) is suggestive. One explanation is that observing a proposed patch anchors agents on surface plausibility rather than forcing abstract reasoning about task difficulty. Supporting this interpre- tation, adversarial post-execution, which counteracts such anchoring through bug-finding framing, recovers discrim- ination competitive with pre-execution for Claude (0.64 [0.51, 0.76]) and Gemini (0.57 [0.43, 0.70]), though not for GPT (0.55 [0.42, 0.68]). This pattern is consistent un- der AUPRC, which better handles class imbalance: GPT pre-execution achieves 0.52 versus post-execution’s 0.42. Gemini’s AUPRC remains poor across all methods (0.21– 0.27), consistent with its near-chance AUROC. A concrete example illustrates this anchoring effect. On a task requiring exposure of a new QtWebEngine dark-mode policy inqutebrowser, the pre-execution agent correctly identified the core risk: Pre-Exec.P(success): 45% Biggest risk is knowing exact Chromium key/enum for classifier without external reference. After seeing the patch, the post-execution agent was reas- sured by the plausible-looking implementation and jumped to 82%: Post-Exec.P(success): 82% Adds smart-simple value, Qt 6.6 variant gating, optional mapping to suppress switches... logic seems sound. The patch failed. The pre-execution agent’s abstract rea- soning about task difficulty was more informative than the post-execution agent’s assessment of a coherent-looking but incorrect solution. 3.4. Mid-Execution: Uninformative Doubt We elicit estimates at 25%, 50%, and 75% trajectory com- pletion (Table 1). Models show divergent AUROC patterns (Figure 6): GPT remains stable (∼0.53), Gemini improves from 0.49 to 0.64, and Claude degrades from 0.62 to 0.52. Table 1. Mid-execution metrics across checkpoints. Base rates: GPT 35%, Gemini 22%, Claude 27%. ModelCkpt AUROC↑ Mean Est. Overconf. ECE↓ GPT 5.2 Codex 25%0.530.67+0.320.32 50%0.510.63+0.280.32 75%0.530.47+0.120.19 Gemini 3 Pro 25%0.490.87+0.650.65 50%0.640.80+0.580.58 75%0.640.67+0.450.54 Claude Opus 4.5 25%0.620.58+0.310.31 50%0.520.37+0.100.19 75%0.520.17 −0.100.21 The central finding is “cold feet”: confidence decreases with execution progress for 71% of GPT and 97% of Claude instances, yet this doubt is uninformative because success and failure confidence track within 0.05 throughout (Fig- 4 Agentic Uncertainty Reveals Agentic Overconfidence 0.00.20.40.60.81.0 Predicted Probability 60 50 40 30 20 10 0 10 Count Success (n=35) Failure (n=65) GPT-5.2-Codex 0.00.20.40.60.81.0 Predicted Probability Success (n=20) Failure (n=75) Gemini-3-Pro 0.00.20.40.60.81.0 Predicted Probability Success (n=27) Failure (n=73) Claude Opus 4.5 Figure 4. Distribution of post-execution confidence estimates by model. Success cases shown above the axis (green), failure cases below (red); dashed lines indicate base rates. Mirror symmetry reveals indistinguishable distributions: where bars match above and below, the model assigns identical confidence regardless of outcome. Gemini exhibits the most extreme pattern: nearly all predictions cluster at 100% confidence, creating dramatic mirrored towers. This visual symmetry directly explains the poor discrimination: high-confidence predictions provide no signal about actual success. 0.00.20.40.60.81.0 Mean Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Positives overconfident perfect GPT-5.2-Codex Pre-Exec Post-Exec Adv. Post-Exec 0.00.20.40.60.81.0 Mean Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 overconfident perfect Gemini-3-Pro Pre-Exec Post-Exec Adv. Post-Exec 0.00.20.40.60.81.0 Mean Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 overconfident perfect Claude Opus 4.5 Pre-Exec Post-Exec Adv. Post-Exec Figure 5. Calibration curves reveal systematic overconfidence. Points below the diagonal (shaded region) indicate overconfidence: models predict higher success probability than achieved. All methods fall in this region across all models. Gemini shows the most severe miscalibration: predictions near 100% yield only∼20% accuracy. The adversarial method (triangles) consistently shifts curves upward toward the diagonal, achieving the best calibration, while pre-execution (circles) shows less extreme overconfidence than standard post-execution (squares) for GPT and Claude. ure 7), and∆confidence distributions overlap substantially between outcomes (Figure 8). One partial exception: Claude’s confidence drops correlate weakly with outcome (r=− 0.20,p=0.04;∆=− 0.46for successes vs.−0.38for failures), while GPT (r=− 0.03, p=0.77) and Gemini (r=0.15,p=0.14) show no significant relationship. 3.5. Adversarial Framing Reduces Overconfidence Can we mitigate overconfidence by reframing assessment as bug-finding? Adversarial post-execution prompts agents to “actively search for bugs and failure modes” before esti- mating success. This achieves the best calibration across all methods: ECE improves from 0.42 to 0.30 for GPT (28% reduction) and from 0.37 to 0.24 for Claude (35% reduction). Discrimina- tion is mixed: similar for GPT (0.55), improved for Gemini (0.57 vs 0.51) and Claude (0.64 vs 0.55). The cost is higher (23.4 steps at $0.52/instance vs 12.7 at $0.23), but the addi- tional scrutiny yields substantially better predictions. Standard post-execution agents seek confirmatory evidence, noting positive features while rarely attempting falsifica- tion. Adversarial prompting counteracts this by directing attention toward potential flaws. A task requiring a search identifier fix in OpenLibrary illustrates the gap. The stan- dard reviewer saw a small, plausible one-line addition and gave 85% confidence: 5 Agentic Uncertainty Reveals Agentic Overconfidence 25%50%75% Trajectory Progress 0.45 0.50 0.55 0.60 0.65 0.70 AUROC chance GPT-5.2-Codex Gemini-3-Pro Claude Opus 4.5 Figure 6. More context does not always improve discrimina- tion. AUROC across checkpoints: GPT stable (∼0.53), Gemini improves from 0.49 to 0.64, Claude degrades from 0.62 to 0.52. Post-Exec.P(success): 85% Adds idprojectruneberg to default fetchedfields... aligns with other id * providers and should expose the identifier. The adversarial reviewer, prompted to find problems, dug deeper and identified that the output shaping logic would still omit the field: Adv. Post-Exec.P(success): 25% The patch only adds the field to default fetchedfields. However, the output shaping in getdoc does not include this field, so even if Solr returns it, the response omits it. Patch seems incomplete. The patch failed. The 60-point gap illustrates how adversar- ial framing overcomes the “looks reasonable” heuristic. Shift vs. signal decomposition.The calibration improve- ment could arise from two distinct mechanisms: a uniform downward shift of all estimates (which mechanically im- proves calibration when base rates are low) or a differen- tial shift that lowers confidence more on failing instances (which genuinely improves discrimination). To disentangle these, we compare the per-instance confidence change (stan- dard minus adversarial) separately for passing and failing instances (Figure 9). The effect is model-dependent. For GPT, the shift is nearly identical on passing and failing instances (0.11 vs. 0.12, p=0.70), and AUROC is unchanged (Table 2: 0.58→0.55). This is a pure uniform shift: post-hoc Platt scaling of stan- dard post-execution predictions achieves better calibration than adversarial prompting (ECE 0.01 vs. 0.30), confirming that the adversarial framing adds no signal for GPT that recalibration could not recover. For Gemini and Claude, the shift is larger on failing in- stances (Gemini: 0.18 vs. 0.05; Claude: 0.16 vs. 0.08), widening the pass/fail prediction gap and improving AU- ROC (Gemini: 0.51→0.57; Claude: 0.55→0.64). For these models, adversarial framing provides genuinely better sig- nal, not merely a location shift. These pairwise differences are not individually significant (p=0.18andp=0.09), con- sistent with the sample size limitations noted in Section 3.3. Adversarial framing also breaks false consensus across mod- els. Under standard pre-execution, all three models agree on the predicted outcome (pass/fail at 50% threshold) for 87% of instances, but only 13% of these agreements are correct. Under adversarial framing, three-way agreement drops to 44%, but 38% of agreements are correct. The adversarial prompt introduces productive disagreement that surfaces genuine uncertainty. 3.6. Ensemble Methods Since pre-execution and post-execution agents access dif- ferent information and exhibit different failure modes, com- bining their estimates may improve calibration. We evalu- ate three natural ensemble strategies in Table 2: averaging (hedging between views), conservative (min, trusting the more skeptical estimate), and aggressive (max, trusting the more optimistic one). The conservative ensemble (minof pre- and post- execution) improves calibration over vanilla post-execution: ECE drops from 0.42 to 0.32 for GPT, from 0.37 to 0.31 for Claude. When estimates disagree, the more cautious one is usually correct. However, adversarial post-execution still achieves the best overall calibration. 3.7. Self-Preference Ablation Could self-preference bias explain overconfidence (Pan- ickssery et al., 2024)? We compare judges’ estimates on own-model patches (“self”) versus cross-family patches (Ta- ble 3). GPT shows self-preference (+23 p on own patches, p=0.001); Gemini shows the opposite (+19 p on GPT patches). But all conditions exhibit overconfidence regard- less of bias direction. Self-preference cannot explain our main finding. 4. Related Work Concurrent work Barkan et al. (2025) study whether LLMs can predict their success on coding tasks before at- tempting them and how these predictions evolve during 6 Agentic Uncertainty Reveals Agentic Overconfidence 25%50%75% Trajectory Progress 0.0 0.2 0.4 0.6 0.8 1.0 Confidence GPT-5.2-Codex Failure (n=65) Success (n=35) 25%50%75% Trajectory Progress Gemini-3-Pro Failure (n=78) Success (n=22) 25%50%75% Trajectory Progress Claude Opus 4.5 Failure (n=73) Success (n=27) Figure 7. “Cold feet”: confidence decreases uniformly regardless of outcome. Both successes (green) and failures (red) show declining confidence; group means track closely together. 1.00 0.75 0.50 0.25 0.00 0.25 0.50 confidence (75% 25%) no change GPT-5.2-Codex Success Failure no change Gemini-3-Pro Success Failure no change Claude Opus 4.5 Success Failure Figure 8. Confidence change does not discriminate outcomes.∆confdistributions overlap substantially between successes and failures. execution. Similar to us, they find systematic overconfi- dence across all models. (Zhang et al., 2026) propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. LLM uncertainty estimation.Kadavath et al. (2022) in- troduce P(IK) (“probability that I know”), showing that language models can predict which questions they will an- swer correctly. We generalize this idea to agentic settings where success depends on multi-step tool use rather than factual recall. Kuhn et al. (2023) introduce semantic en- tropy, which incorporates linguistic invariances created by shared meanings. Damani et al. (2025) incorporate calibra- tion rewards into reinforcement learning. Lindsey (2026) provide evidence that LLMs possess limited but functional introspective awareness of their internal states, suggesting a mechanistic basis for self-assessment capabilities. Overconfidence in LLMs. Tian et al. (2025) diagnose it in LLM-as-judge settings, while Yang et al. (2024) and Sun et al. (2025) find models express high confidence even on incorrect answers. We extend these findings to agentic task completion across multiple steps. Self-verification and self-correction. A central assump- tion in deploying post-execution oversight is that verification should be easier than generation. Recent findings challenge this assumption for LLMs. Kamoi et al. (2024)’s survey reveals that LLMs benefit from prompted feedback only for certain tasks. Huang et al. (2024) demonstrate that LLMs struggle to self-correct reasoning without external feedback and performance often degrades after self-correction at- tempts. Stechly et al. (2024) similarly find significant per- formance collapse with self-critique on planning tasks. Our finding that post-execution agents are less well-calibrated than pre-execution agents extends this literature: seeing a proposed solution does not help agents assess correctness and may actively harm calibration. LLM-as-judge and self-preference.Using LLMs to eval- uate LLM outputs has become common practice (Gu et al., 2024; Li et al., 2024), but judges exhibit systematic biases. Panickssery et al. (2024) show that LLMs recognize and fa- vor their own generations, suggesting post-execution agents may anchor on familiar reasoning patterns, with Chen et al. (2025) even demonstrating self-preference when the model’s own answer is objectively worse. Spiliopoulou et al. (2025) extend this finding to show that the positive bias generalizes 7 Agentic Uncertainty Reveals Agentic Overconfidence PassFail 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Confidence shift (std adv) Gap: 0.01 0.03 GPT-5.2-Codex p=0.69 (n.s.) Mean shift (0.12) PassFail Gap: -0.02 0.10 Gemini-3-Pro p=0.18 (n.s.) Mean shift (0.12) PassFail Gap: 0.05 0.12 Claude Opus 4.5 p=0.09 Mean shift (0.12) Figure 9. Adversarial shift decomposition. Mean confidence shift (standard−adversarial) on passing vs. failing instances. For GPT, the shift is uniform (equal bars), improving calibration mechanically. For Gemini and Claude, the shift is larger on failures, widening the pass/fail prediction gap and improving discrimination. “Gap” shows the mean prediction difference between pass and fail instances. Table 2. Summary of all methods. Pre-execution beats vanilla post-execution for discrimination; adversarial prompting achieves best calibration. Base rates: GPT 35%, Gemini 22%, Claude 27%. Best values per model bolded. Bootstrap 95% CIs for AUROC are reported in the text (§3.3). GPT-5.2-Codex (35%)Gemini-3-Pro-Preview (22%)Claude-Opus-4.5 (27%) MethodAUROC↑ Overconf. ECE↓ Brier↓AUROC↑ Overconf. ECE↓ Brier↓AUROC↑ Overconf. ECE↓ Brier↓ Single Pre-Execution0.62+0.350.350.330.53+0.770.770.770.64+0.370.380.34 Post-Execution0.58+0.390.420.400.51+0.550.660.650.55+0.340.370.36 Adv. Post-Exec.0.55+0.260.300.310.57+0.400.530.510.64+0.200.240.26 Ensemble Average0.62+0.370.370.350.53+0.660.660.650.57+0.360.360.33 Conservative (min)0.57+0.290.320.32 0.53+0.540.650.640.54+0.260.310.30 Aggressive (max)0.68+0.440.440.410.51+0.780.780.780.65+0.460.460.40 Table 3. Self-preference does not explain overconfidence. N=25. Bold: significant difference (p < 0.05). JudgePatchesMean Est. AUROC ECE Brier GPTGPT (self)0.740.550.38 0.36 GPTGemini (cross)0.510.390.44 0.37 Gemini Gemini (self)0.720.570.56 0.59 Gemini GPT (cross)0.910.550.56 0.55 to different models of the judge’s family. AI control.Greenblatt et al. (2024) develop safety proto- cols that remain effective even if the AI being monitored is intentionally subversive, using trusted monitoring where a weaker model audits a stronger model’s outputs. Bhatt et al. (2025) extend this to multi-step agentic settings with resam- ple protocols that dynamically sample additional outputs to detect suspicious behavior. AI safety via debate (Irv- ing et al., 2018; Khan et al., 2024) shows that adversarial structure, where agents argue opposing sides, helps weaker judges identify correct answers. Motivating such protocols, Lynch et al. (2025) show that frontier models can engage in harmful behaviors (blackmail, corporate espionage) when facing threats to their autonomy, even while explicitly rea- soning about ethical constraints. Learned verifiers. The distinction between outcome re- ward models (ORMs; Cobbe et al., 2021) and process re- ward models (PRMs; Lightman et al., 2023) provides a framework for understanding our elicitation regimes. ORMs assess correctness at the final step, analogous to our post- execution setting, while PRMs provide step-level feedback during execution, similar to mid-execution. Lightman et al. (2023) show that process supervision outperforms outcome supervision for mathematical reasoning. Recent work ex- tends learned verifiers to agentic settings (Agarwal et al., 2026). Our work complements these approaches by study- ing whether models can serve as their own verifiers without task-specific training. 5. Limitations and Future Work Beyond software engineering. Our experiments focus exclusively on coding tasks, which offer objective success criteria (tests pass or fail). Agentic overconfidence may manifest differently in domains with ambiguous or subjec- tive success conditions. Web navigation tasks (Zhou et al., 2023), where success depends on achieving user-specified goals, present intermediate cases with partial observabil- ity. Scientific workflows involving data analysis, hypothesis generation, and experimental design lack clear ground truth entirely. Creative tasks (writing, design) introduce subjec- 8 Agentic Uncertainty Reveals Agentic Overconfidence tive quality judgments where calibration itself becomes ill- defined. Understanding how overconfidence varies across this spectrum—from objective to subjective success crite- ria—would inform domain-specific deployment guidelines. Trained verifiers for self-assessment. Our uncertainty agents use prompting alone, without task-specific training. A natural extension is training verifiers explicitly for agentic self-assessment, analogous to outcome and process reward models (Cobbe et al., 2021; Lightman et al., 2023). Such verifiers could learn to recognize failure patterns from ex- ecution traces, potentially achieving better discrimination than prompting-based approaches. The key challenge is obtaining training signal: while SWE-bench provides bi- nary success labels, scaling to diverse agentic tasks requires either expensive human annotation or proxy metrics that may not capture true task success. Hybrid deployment strategies.Our results suggest com- plementary strengths: pre-execution achieves better dis- crimination while adversarial post-execution achieves better calibration. A practical deployment strategy might combine both: using pre-execution estimates for task routing (which tasks to attempt) and adversarial post-execution for sub- mission decisions (whether to accept a proposed solution). Investigating the optimal combination, including when to escalate to human review based on estimate disagreement, remains an open question. Multi-agent uncertainty propagation. Modern agentic systems increasingly involve multiple agents in complex workflows: planners, executors, critics, and coordinators. How does uncertainty propagate through such pipelines? If each agent is overconfident, errors may compound; alterna- tively, diverse perspectives might provide natural calibration. Understanding uncertainty dynamics in multi-agent systems is critical as these architectures become more prevalent. Sample size. Our evaluation uses 100 SWE-bench Pro tasks, yielding as few as 22 positive examples (Gemini). While sufficient to establish the overconfidence pattern, this limits the precision of per-model metric estimates; future work should confirm these findings at larger scale. Scaling laws for calibration. The relationship between model scale and overconfidence remains unexplored. Pre- liminary evidence from our three frontier models (which differ in architecture and training rather than scale alone) shows no clear pattern, but systematic scaling studies could reveal whether calibration improves predictably with com- pute. 6. Conclusion We study whether AI agents can estimate their own prob- ability of success. Our experiments reveal agentic over- confidence: post-execution agents show up to a 55p gap between predicted and actual success rates (Gemini predicts 77% against a 22% base rate). Adversarial post-execution tends to achieve the best calibration by reframing review as bug-finding. More broadly, agentic self-assessment remains a significant challenge for current models and a critical target for future safety research. Impact Statement Our finding that agents systematically overestimate suc- cess has direct implications for AI safety: it argues against naive reliance on agent self-assessment and for maintaining human oversight, particularly for high-stakes decisions. Ad- versarial prompting improves calibration but should not be treated as a license to remove human oversight, as it reduces but does not eliminate overconfidence. References Agarwal, M., Abdelaziz, I., Basu, K., Unuvar, M., Lastras, L. A., Rizk, Y., and Kapanipathi, P. Toolrm: Outcome re- ward models for tool-calling large language models, 2026. URL https://arxiv.org/abs/2509.11963. Appel, R., McCrory, P., Tamkin, A., McCain, M., Neylon, T., and Stern, M. Anthropic economic index report: Un- even geographic and enterprise ai adoption, 2025. URL https://arxiv.org/abs/2511.15080. Barkan, C. O., Black, S., and Sourbut, O. Do large language models know what they are capable of?, 2025. URL https://arxiv.org/abs/2512.24661. Bhatt, A., Rushing, C., Kaufman, A., Tracy, T., Georgiev, V., Matolcsi, D., Khan, A., and Shlegeris, B. Ctrl-z: Controlling ai agents via resampling. arXiv preprint arXiv:2504.10374, 2025. Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Luko ˇ si ̄ ut ̇ e, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran- Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N., Mercado, N., DasSarma, N., Larson, R., McCandlish, S., Kundu, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Telleen-Lawton, T., Brown, T., Henighan, T., Hume, T., Bai, Y., Hatfield-Dodds, Z., Mann, B., and Kaplan, J. Measuring progress on scalable oversight for large language models, 2022. URLhttps: //arxiv.org/abs/2211.03540. 9 Agentic Uncertainty Reveals Agentic Overconfidence Chen, Z.-Y., Wang, H., Zhang, X., Hu, E., and Lin, Y. Beyond the surface: Measuring self-preference in LLM judgments. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 1653–1672, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main. 86.URLhttps://aclanthology.org/2025. emnlp-main.86/. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Damani, M., Puri, I., Slocum, S., Shenfeld, I., Choshen, L., Kim, Y., and Andreas, J. Beyond binary rewards: Training lms to reason about their uncertainty, 2025. URL https://arxiv.org/abs/2507.16806. Deng, X., Da, J., Pan, E., He, Y. Y., Ide, C., Garg, K., Lauf- fer, N., Park, A., Pasari, N., Rane, C., Sampath, K., Krish- nan, M., Kundurthy, S., Hendryx, S., Wang, Z., Bharad- waj, V., Holm, J., Aluri, R., Zhang, C. B. C., Jacobson, N., Liu, B., and Kenstler, B. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025. URL https://arxiv.org/abs/2509.16941. Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. Ai control: Improving safety despite intentional subver- sion, 2024. URLhttps://arxiv.org/abs/2312. 06942. Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a- judge. The Innovation, 2024. Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. Large language models can- not self-correct reasoning yet, 2024.URLhttps: //arxiv.org/abs/2310.01798. Irving, G., Christiano, P., and Amodei, D. Ai safety via debate. arXiv preprint arXiv:1805.00899, 2018. Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran, E., et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. Kamoi, R., Zhang, Y., Zhang, N., Han, J., and Zhang, R. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12:1417– 1440, 2024. Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., Radhakrishnan, A., Grefenstette, E., Bowman, S. R., Rockt ̈ aschel, T., and Perez, E. Debating with more persua- sive llms leads to more truthful answers. arXiv preprint arXiv:2402.06782, 2024. Kuhn, L., Gal, Y., and Farquhar, S.Semantic uncer- tainty: Linguistic invariances for uncertainty estima- tion in natural language generation. arXiv preprint arXiv:2302.09664, 2023. Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URLhttps: //arxiv.org/abs/2412.05579. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step, 2023. URLhttps: //arxiv.org/abs/2305.20050. Lindsey, J. Emergent introspective awareness in large lan- guage models, 2026. URLhttps://arxiv.org/ abs/2601.01828. Lynch, A., Wright, B., Larson, C., Ritchie, S. J., Minder- mann, S., Hubinger, E., Perez, E., and Troy, K. Agentic misalignment: How llms could be insider threats, 2025. URL https://arxiv.org/abs/2510.05179. METR.Measuringaiabilitytocomplete longtasks.https://metr.org/blog/ 2025-03-19-measuring-ai-ability-to-complete-long-tasks/, 03 2025. Panickssery, A., Bowman, S. R., and Feng, S. Llm eval- uators recognize and favor their own generations, 2024. URL https://arxiv.org/abs/2404.13076. Spiliopoulou, E., Fogliato, R., Burnsky, H., Soliman, T., Ma, J., Horwood, G., and Ballesteros, M. Play favorites: A statistical method to measure self-bias in llm-as-a-judge. arXiv preprint arXiv:2508.06709, 2025. Stechly, K., Valmeekam, K., and Kambhampati, S. On the self-verification limitations of large language models on reasoning and planning tasks, 2024. URLhttps: //arxiv.org/abs/2402.08115. Sun, F., Li, N., Wang, K., and Goette, L. Large language models are overconfident and amplify human bias, 2025. URL https://arxiv.org/abs/2505.02151. Tian, Z., Han, Z., Chen, Y., Xu, H., Yang, X., Xuan, R., Wang, H., and Liao, L. Overconfidence in llm-as-a-judge: Diagnosis and confidence-driven solution, 2025. URL https://arxiv.org/abs/2508.06225. 10 Agentic Uncertainty Reveals Agentic Overconfidence Yang, H., Wang, Y., Xu, X., Zhang, H., and Bian, Y. Can we trust llms? mitigate overconfidence bias in llms through knowledge transfer, 2024. URLhttps: //arxiv.org/abs/2405.16856. Zhang, J., Choubey, P. K., Huang, K.-H., Xiong, C., and Wu, C.-S. Agentic uncertainty quantification, 2026. URL https://arxiv.org/abs/2601.15703. Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 11