Paper deep dive

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao, Long Ma, Yanghua Xiao

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 118

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 6:07:44 AM

Summary

The paper introduces GeoAux-Bench, a benchmark for visual-text interleaved geometric reasoning, and proposes Action Applicability Policy Optimization (A2PO), a reinforcement learning framework that uses adaptive reward shaping to improve how MLLMs utilize auxiliary visual constructions.

Entities (4)

A2PO · reinforcement-learning-paradigm · 100%GeoAux-Bench · benchmark · 100%GRPO · optimization-algorithm · 95%MLLMs · model-architecture · 95%

Relation Signals (3)

A2PO → uses → GRPO

confidence 95% · Building upon GRPO, we propose Action Applicability Policy Optimization (A2PO).

A2PO → optimizes → MLLMs

confidence 90% · Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions

GeoAux-Bench → evaluates → MLLMs

confidence 85% · To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought... GeoAux-Bench

Cypher Suggestions (2)

Identify models optimized by A2PO · confidence 95% · unvalidated

MATCH (m:Model)-[:OPTIMIZED_BY]->(a:Algorithm {name: 'A2PO'}) RETURN m

Find all benchmarks related to geometric reasoning · confidence 90% · unvalidated

MATCH (b:Benchmark)-[:FOCUSES_ON]->(r:ReasoningTask {name: 'Geometric Reasoning'}) RETURN b

Abstract

Abstract:Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

PDF

Open source PDF →Open local PDF →

Full Text

117,773 characters extracted from source content.

Expand or collapse full text

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning Haokun Zhao1,∗⁣†* Wanshi Xu2,∗* Haidong Yuan3 Songjun Cao4 Long Ma4,† Yanghua Xiao1,† 1College of Computer Science and Artificial Intelligence, Fudan University 2School of ECE, Peking University 3School of Software and Microelectronics, Peking University 4Tencent Youtu Lab hkzhao23@m.fudan.edu.cn, shawyh@fudan.edu.cn xwanshi, oseast@stu.pku.edu.cn, songjuncao, malonema@tencent.com ∗These authors contributed equally to this work. †Corresponding author. Abstract Geometric reasoning inherently requires "thinking with constructions"—the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub 111https://anonymous.4open.science/r/GeoAux-5863. Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning Haokun Zhao1,∗⁣†* †thanks: ∗These authors contributed equally to this work. †Corresponding author. Wanshi Xu2,∗* Haidong Yuan3 Songjun Cao4 Long Ma4,† Yanghua Xiao1,† 1College of Computer Science and Artificial Intelligence, Fudan University 2School of ECE, Peking University 3School of Software and Microelectronics, Peking University 4Tencent Youtu Lab hkzhao23@m.fudan.edu.cn, shawyh@fudan.edu.cn xwanshi, oseast@stu.pku.edu.cn, songjuncao, malonema@tencent.com 1 Introduction Recent advancements in Large Language Models (LLMs) have demonstrated remarkable proficiency in mathematical reasoning Shao et al. (2024); Yang et al. (2024); Chen et al. (2025a), largely driven by the Chain-of-Thought (CoT) technique Wei et al. (2022). However, geometry problem solving remains a significant hurdle. Unlike algebraic tasks that rely on symbolic manipulation, geometric reasoning is intrinsically multimodal Lu et al. (2021); Kazemi et al. (2023): human experts do not merely read static diagrams; they solve problems by constructing and manipulating visual aids (e.g., drawing auxiliary lines) to bridge the gap between conditions and solutions. This process of auxiliary construction is the quintessential embodiment of “thinking with images,” where the visual context dynamically evolves to reveal hidden geometric properties Chern et al. (2025); Li et al. (2025b). Figure 1: A representative sample from GeoAux-Bench. The solution trajectory is structured in a visual-text interleaved format: the textual auxiliary construction step (e.g., <aux>... </aux>) is explicitly paired with a corresponding auxiliary diagram (IauxI_aux). To emulate this process, the community has actively explored Visual Chain-of-Thought (VCoT). However, existing paradigms generally fall into three categories, each facing distinct limitations: (1) Agent-based approaches manipulate geometric code to render diagrams but often rely on ground-truth code inputs, diverging from the natural visual perception of raw images Hu et al. (2024); Wang et al. (2025b); (2) Formal Abstraction methods convert diagrams into formal languages, acting as a lossy compression that strips away visual intuition and risks hallucination Sharma et al. (2024); Trinh et al. (2024); Yang et al. (2025); and (3) Unified Multimodal Models attempt to natively generate visual thoughts Shi et al. (2025); Li et al. (2025a). While promising, even SOTA models like Nano-Banana-Pro Team et al. (2025) suffer from pixel-level structural hallucinations that mislead the reasoning process. Consequently, most MLLMs are confined to a passive static inference mode, unable to update their visual context to match their reasoning steps. This disconnect highlights a critical gap: models lack the strategic procedural knowledge to employ visual aids effectively—specifically, the decision-making of when to draw, what to draw, and how to leverage the visualization for subsequent deduction. Crucially, such constructions significantly reduce problem-solving difficulty, particularly when diagrams are inherently complex or benefit from intrinsic properties Chervonyi et al. (2025). To address this, we conducted an ablation study by injecting oracle auxiliary aids in different modalities. Our findings confirm that single-modality auxiliary aids—whether consisting solely of textual instructions or visual diagrams—cannot serve as a lossless substitute for the interleaved provision of construction statements and corresponding images, as they fail to fully encapsulate the synergistic information inherent in the multimodal context. Furthermore, we observed that this visual feedback correlates strongly with a reduction in reasoning perplexity (PPL), mirroring a “cognitive epiphany” where correctly constructed auxiliary lines drastically reduce the uncertainty of the subsequent reasoning trajectory. Building on these insights, we propose a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, a benchmark comprising 4,334 geometry problems and 8,470 diagrams. As illustrated in Figure 1, it establishes a precise interleaved mapping where each textual construction (TauxT_aux) is explicitly paired with its corresponding ground-truth visual update (IauxI_aux). To effectively leverage this data, we present Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm designed to master strategic visual construction. A2PO employs a Tri-Partition Sampling strategy to construct counterfactual reasoning paths (mandatory vs. prohibited). Based on these baselines, we employ Adaptive Reward Shaping to orchestrate the reasoning process: (1) A Timing Reward to discern the necessity of auxiliary lines; complemented by (2) a Quality Reward, grounded in reasoning perplexity, to ensure constructions genuinely simplify the solution path. During inference, we implement Visual Re-prompting to dynamically inject auxiliary diagrams, enabling the model to reason in a truly interleaved manner. In summary, our contributions are as follows: 1. We present GeoAux-Bench, the first geometry benchmark to explicitly associate textual auxiliary construction steps with corresponding auxiliary diagrams. 2. We provide empirical evidence that interleaved visual-textual auxiliary representations significantly outperform single-modality counterparts by up to 1.97%. We further demonstrate that high-quality auxiliary constructions act as an entropy reducer, narrowing the solution search space and lowering reasoning uncertainty. 3. We propose A2PO, a reinforcement learning paradigm utilizing Adaptive Reward Shaping. By strictly regulating the timing and evaluating the quality of visual aids, A2PO achieves a maximum performance gain of 3.51% over GRPO and unconditional reinforcement strategies. 2 Related Work 2.1 Benchmarks for Multimodal Mathematical Reasoning Visual-mathematical reasoning benchmarks, from foundational datasets Lu et al. (2021, 2022) to recent suites like MMMU Yue et al. (2023) and MathVista Lu et al. (2023), have advanced MLLMs Wang et al. (2024); Zhang et al. (2024); Kazemi et al. (2023). However, they rely on static pairs, lacking step-by-step visual demonstrations for dynamic reasoning. While Shi et al. (2025) explored interleaved formats, such data remains scarce. GeoAux-Bench bridges this gap via explicit text-visual alignment to provide dense supervision. 2.2 Visual Chain-of-Thought (VCoT) Textual Chain-of-Thought Wei et al. (2022); Fang et al. (2025b, a) dominates symbolic reasoning but fails to capture spatial dynamics for geometry. Visual CoT addresses this by integrating visual synthesis into deduction, with two primary streams: (1) Tool-augmented approaches Hu et al. (2024); Wang et al. (2025b); Zheng et al. (2025) suffer from procedural rigidity, treating diagramming as a static step rather than a fluid cognitive strategy; (2) Intrinsic generation models Shi et al. (2025); Li et al. (2025a) are prone to structural hallucinations and lack geometric precision. Our framework synthesizes these paradigms via Visual Re-prompting, combining symbolic precision with dynamic reasoning feedback. 2.3 Reinforcement Learning for Reasoning Reinforcement learning is central to reasoning breakthroughs DeepSeek-AI et al. (2025). Algorithms like GRPO Shao et al. (2024), DAPO Yu et al. (2025), and VAPO Yue et al. (2025) excel in text, vision Meng et al. (2025), and tool-use Li et al. (2025c); Hong et al. (2025b), but applying RL to geometry is non-trivial—intermediate visual constructions lack verifiable outcomes. Existing methods like GeometryZero Wang et al. (2025c) optimize timing via tool priors Li et al. (2025c) but overlook reasoning efficacy. Our A2PO addresses this via Adaptive Reward Shaping, using contrastive sampling for timing and reasoning perplexity for utility assessment, ensuring constructions actively facilitate solutions. 3 The GeoAux Benchmark and Reasoning Dynamics In this section, we introduce GeoAux, a benchmark tailored for Interleaved-Modal Chain-of-Thought (VCoT). To capture the dynamic nature of geometric reasoning, GeoAux explicitly aligns textual auxiliary instructions with their corresponding visual updates (Taux↔IauxT_aux I_aux). We first detail the benchmark construction pipeline, followed by a pilot study that quantitatively validates the cognitive synergy between these modalities. 3.1 GeoAux-Bench Construction To foster research into active visual-textual reasoning, we construct GeoAux-Bench, consisting of two subsets: the expert-annotated GeoAux-Core and the adapted GeoAux-Canvas. GeoAux-Core. We curated geometric problems predominantly requiring auxiliary constructions from secondary school curricula and Olympiad competitions (post-January 2024) to minimize data contamination. Crucially, we restructure the solution trajectory into an interleaved format. We introduce a dedicated token pair <aux>…</aux> to encapsulate the Auxiliary Construction Instruction (TauxT_aux). The closing tag </aux> explicitly marks the insertion point for the corresponding Auxiliary Diagram (IauxI_aux). This structure acts as a transition operator mapping the initial visual state (IorigI_orig) to an updated state (IauxI_aux): ℳ:(Iorig,Taux)→IauxM:(I_orig,T_aux)→ I_aux (1) Here, TauxT_aux contains explicit directives (e.g., “Connect ABAB”), while IauxI_aux renders these elements visually. A representative sample is shown in Figure 1. The subset is stratified into four difficulty levels: Curriculum-Junior/Senior and Olympiad-Junior/Senior. GeoAux-Canvas. To assess generalization at scale, we adapted samples from MathCanvas-Bench Shi et al. (2025). We filtered for construction-heavy problems and applied our annotation pipeline to generate corresponding (Taux,Iaux)(T_aux,I_aux) pairs. This subset retains fine-grained subject tags (e.g., Analytic Geometry, Trigonometry), enabling detailed capability analysis. Quality Control and Standardization. We enforced a rigorous three-stage pipeline: (1) Solvability Verification: Using Gemini-2.5-Pro Comanici et al. (2025) to ensure problem conditions are sufficient for a unique solution; (2) Symbolic Normalization: Parsing all mathematical expressions into a unified LaTeX format; (3) Visual Enhancement: We utilized Seedream 4.0 Chen et al. (2025b) for super-resolution to enhance image quality, followed by normalization to 512×512512× 512. Furthermore, to enable deterministic evaluation of open-ended proofs, we reformulate proof problems (e.g., “Prove A=BA=B”) into verifiable computational problems (e.g., “Find the ratio A/BA/B”). Statistics. (a) Subject & Difficulty Distribution (b) Image Count (c) Text Length Distribution Figure 2: Overview of GeoAux-Bench. GeoAux-Bench comprises 4,334 problems (6,523 queries) and 8,470 geometric diagrams, with an extremely high image-to-problem ratio as illustrated in Figure 2. Detailed statistics are provided in Appendix B. 3.2 Pilot Study: Complementarity of Auxiliary Modalities in Interleaved Geometric Reasoning We conceptualize geometric construction as a critical Interleaved Reasoning Step with dual representations: the Auxiliary Construction Instruction (TauxT_aux) and the Auxiliary Diagram (IauxI_aux). While recent unified MLLMs (e.g., Chameleon-7B Team et al. (2024), Bagle-7B-MoT Deng et al. (2025)) attempt end-to-end interleaved generation, their capability to precisely render geometric constraints remains nascent. Consequently, prior works often resort to compromises: they convert diagrams that ought to be presented via the visual modality into single-modal formal languages or drawing code, thereby circumventing the challenge of visual generation. This raises a fundamental research question: Does TauxT_aux—the textual modality describing the visual diagram—fully encapsulate the latent spatial information contained in the corresponding visual diagram IauxI_aux, or do these two modalities offer complementary cognitive grounding? To disentangle the contributions of the semantic instruction and the corresponding visual diagram, we design a controlled ablation study. Let Q denote the problem text and IorigI_orig the initial diagram. We evaluate performance under four experimental settings: • Standard Settings: (,Iorig)(Q,I_orig). Baseline reasoning without hints. • Textual-Only Settings: (,Iorig,Taux)(Q,I_orig,T_aux). Providing only the auxiliary construction instruction. • Visual-Only Settings: (,Iorig,Iaux)(Q,I_orig,I_aux). Providing only the auxiliary diagram. • Interleaved Settings: (,Iorig,Taux,Iaux)(Q,I_orig,T_aux,I_aux). Simulating an ideal interleaved reasoning step with both modalities. Perceptual Saliency Control (Visual Enhancement). A potential confounder in the Visual-Only Setting is visual saliency: standard (typically dashed) auxiliary lines may be too subtle for detection, leading to false negatives from perceptual misses.To boost the saliency of these elements and prevent them from being overlooked in complex geometric configurations, we introduce a Visual Enhancement Protocol, as illustrated in Figure 3. Specifically, we identified 200 hard samples where models succeeded with the Textual-Only Settings but failed with the Visual-Only Settings; to these, we applied the enhanced annotation (denoted IauxredI_aux [rgb]0.546875,0,0red) to ensure visibility, yielding two control settings: Enhanced Visual-Only (,Iorig,Iauxred)(Q,I_orig,I_aux [rgb]0.546875,0,0red) and Enhanced Dual-Modal (,Iorig,Taux,Iauxred)(Q,I_orig,T_aux,I_aux [rgb]0.546875,0,0red). Figure 3: Visual Enhancement Protocol. Bold red lines are used to highlight auxiliary elements (IauxredI_aux [rgb]0.546875,0,0red). Analysis of Reasoning Dynamics. Table 1 presents the performance metrics. By analyzing the intra-group trends, we derive three critical insights into the cognitive mechanism of auxiliary construction: Setting Input Modality Acc% (Δ ) Tokens PPL G1: Text-Only Settings Standard ,IorigQ,I_orig 23.68 (+0.00) 1336.2 1.1250 Text-Only +Taux+T_aux 24.24 (+0.56) 1241.3 1.1185 G2: Visual-Only Settings Visual-Only +Iaux+I_aux 23.34 (-0.34) 1258.0 1.1340 Enhanced Visual +Iauxred+I_aux [rgb]0.546875,0,0red 24.44 (+0.76) 1303.6 1.1310 G3: Interleaved Settings Dual-Modal +Taux+Iaux+T_aux+I_aux 24.90 (+1.22) 1230.7 1.1350 Enhanced Dual +Taux+Iauxred+T_aux+I_aux [rgb]0.546875,0,0red 25.31 (+1.63) 1270.5 1.1323 Table 1: Reasoning Performance under Different Modal Input Combinations. Experiments are grouped by context structure for fair intra-group comparisons. (1) Irreplaceability and Complementarity of Modalities. While the Text-Only Setting outperforms the baseline, it still falls 1.07% short of the Interleaved Setting. This directly answers our research question: Textual instructions (TauxT_aux) cannot fully replace visual diagrams (IauxI_aux). The two modalities provide complementary cognitive grounding: text defines operational instructions to guide focus on target elements and resolve semantic ambiguity, while diagrams deliver spatial relationships to clarify geometric connections and reduce uncertainty. Optimal reasoning performance arises only when the model “sees” the spatial realization of its intent. (2) Lower PPL Correlates with Higher Accuracy Across Modal Settings. Across all groups, we observe a strong positive correlation between reduced Perplexity and improved reasoning accuracy under the same modal configuration—notably, with no significant correlation between accuracy and generated token length. This aligns with human geometric reasoning intuition: just as a well-constructed auxiliary line triggers an "epiphany" that simplifies a complex problem, valid auxiliary information lowers the model’s predictive entropy. A lower PPL may signals a clearer, less ambiguous solution path, which empirically boosts the likelihood of correct reasoning—an insight we leverage in our reward shaping design. (3) Visual Saliency Matters. Comparing Visual-Only with Enhanced Visual (and Interleaved analogously), enhancing visual saliency consistently reduces PPL and improves Accuracy (up to +1.10%). Notably, this performance gain stems from modifying only 200 specific samples. This sensitivity indicates the model’s performance is highly reliant on the perceptual clarity of auxiliary elements. We hypothesize visual feature enhancement enables the model’s visual system to more reliably detect and anchor geometric configuration changes. This confirms visual perceptibility is a strict prerequisite for geometric reasoning. For qualitative analysis of attentional patterns, see Appendix A. 4 Method Figure 4: The framework of Action Applicability Policy Optimization (A2PO). The upper panel shows standard GRPO, while the lower panel illustrates our tri-partition sampling and adaptive reward shaping mechanism. 4.1 Preliminaries: GRPO We adopt Group Relative Policy Optimization (GRPO) as our optimization backbone. GRPO eliminates the separate value function by estimating the baseline from the group average. Formally, given a question q, the policy generates a group of outputs oii=1G\o_i\_i=1^G. The advantage AiA_i is estimated by normalizing group rewards: Ai=Ri−mean(Rjj=1G)std(Rjj=1G)+ϵ,A_i= R_i-mean(\R_j\_j=1^G)std(\R_j\_j=1^G)+ε, (2) where ϵε is a small constant. GRPO optimizes the surrogate objective, averaged over tokens: GRPO(θ)=q,[1G∑i=1G1|oi|∑t=1|oi|(min(ri,tAi,clip(ri,t,1−ε,1+ε)Ai)−βKL(πθ||πref))], splitJ_GRPO(θ)=E_q,o [ 1G _i=1^G 1|o_i| _t=1^|o_i| ( (r_i,tA_i,\\ clip(r_i,t,1\!-\! ,1\!+\! )A_i )- _KL( _θ|| _ref) ) ], split (3) where ri,t=πθ(oi,t|q,oi,<t)πθold(oi,t|q,oi,<t)r_i,t= _θ(o_i,t|q,o_i,<t) _ _old(o_i,t|q,o_i,<t) is the probability ratio, and =oii=1Go=\o_i\_i=1^G denotes the sampled group. 4.2 Overview of A2PO Building upon GRPO, we propose Action Applicability Policy Optimization (A2PO). While standard GRPO optimizes solely for outcome correctness, geometric reasoning requires mastering the strategic applicability of auxiliary constructions—specifically, discerning whether a visual modification is beneficial for the specific problem configuration. To explicitly model this applicability, A2PO introduces a Tri-Partition Sampling mechanism. As illustrated in Figure 4, instead of sampling from a single policy distribution, we partition the rollout trajectories into three distinct subsets. This structure enables the construction of counterfactual baselines—comparing scenarios where auxiliary lines are enforced versus prohibited—to derive an Adaptive Reward that guides the model on when to construct (Timing) and how to construct efficiently (Quality). 4.3 Tri-Partition Sampling with Visual Re-prompting To quantify the marginal utility of auxiliary constructions, we sample trajectories via three distinct conditioning protocols, aggregating them into a rollout set =O+,O−,OG=\O^+,O^-,O\. For a given query q and initial diagram IorigI_orig, we sample N trajectories for each subset. Prompts for each protocol are provided in Appendix E: • Mandatory Subset (O+O^+): We employ Prefix Forcing to enforce geometric intervention. Given that the model is fine-tuned during the supervised warm-up to encapsulate auxiliary commands in <aux> tags, we pre-populate the generation prefix with <aux>. This strictly forces a valid auxiliary construction step while retaining the standard prompt context. • Prohibited Subset (O−O^-): We impose a Hard Constraint to disable visual updates. In addition to appending a negative constraint to the system prompt, we explicitly mask the logits of the <aux> and </aux> tokens during decoding, guaranteeing a reasoning trajectory devoid of auxiliary constructions and forcing the model to rely solely on the initial visual context. • Natural Subset (O): The model samples autonomously using the same standard prompt as G+G^+, without any intervention. This subset represents the target policy to be optimized. Visual Re-prompting. To simulate interleaved visual reasoning despite current rendering limitations, we employ a retrieval-based injection strategy within the Natural Subset. Upon detecting a completed auxiliary command, an Aux Verifier evaluates its semantic equivalence to the ground truth. If and only if the construction is equivalent to the ground truth, we trigger re-prompting: the generation is paused, and the model is queried again with an augmented context that appends a structured “Hint” containing the ground-truth instruction (TauxT_aux) and the corresponding auxiliary diagram (IauxI_aux). This provides high-fidelity visual feedback contingent on correct reasoning. Prompt transformations are detailed in Appendix E. 4.4 Adaptive Reward Shaping We design a composite reward function R(o)R(o) specifically for the natural subset O: R(o)=w1racc+w2rfmt+w3rtime+w4rqualR(o)=w_1r_acc+w_2r_fmt+w_3r_time+w_4r_qual (4) where w(⋅)w_(·) are weighting coefficients. While raccr_acc and rfmtr_fmt align with standard GRPO protocols, we introduce rtimer_time and rqualr_qual to optimize auxiliary construction efficacy: 4.4.1 Timing Reward (rtimer_time) This component trains the model to discern the strategic necessity of auxiliary constructions. Let aux(o)I_aux(o) be the auxiliary indicator and Δ=O+[racc]−O−[racc] =E_O^+[r_acc]-E_O^-[r_acc] be the utility gap. We formulate rtime(o)r_time(o) with a significance margin τ: rtime(o)=aux(o)⋅1if Δ>τ−1if Δ<−τ0otherwiser_time(o)=I_aux(o)· cases1&if >τ\\ -1&if <-τ\\ 0&otherwise cases (5) This restricts auxiliary construction to scenarios yielding a net performance gain. 4.4.2 Quality Reward (rqualr_qual) Building on our pilot findings (Sec. 3.2), we posit that effective auxiliary constructions function as entropy reducers, mirroring the “cognitive epiphany” that manifests as reduced reasoning PPL. We establish a perplexity baseline ¯=O+[PPL] P=E_O^+[PPL] derived from the mandatory subset. The quality reward grants a bonus solely to valid, low-entropy auxiliary usage: rqual(o)=aux(o)⋅racc(o)⋅(PPL(o)<¯+δ)r_qual(o)=I_aux(o)· r_acc(o)·I (PPL(o)< P+δ ) (6) where δ compensates for the PPL overhead induced by visual re-prompting. This reward explicitly favors confident reasoning paths that simplify the solution space. Detailed hyperparameter configurations are provided in Appendix F. 5 Experiments We split our experiments into two distinct parts to rigorously validate our contributions: (1) conducting a systematic evaluation of popular MLLMs on GeoAux-Bench to establish a difficulty baseline; (2) focusing on our proposed A2PO framework, with comparisons against strong RL baselines across multiple benchmarks to verify the efficacy of our adaptive reward shaping. 5.1 Benchmark Results We conduct a comprehensive evaluation on GeoAux-Bench involving various popular MLLMs, including proprietary SOTA models Comanici et al. (2025); OpenAI (2025); Hurst et al. (2024); Anthropic (2025a, b); ByteDance Seed Team (2025), open-weights baselines Bai et al. (2025a, b); Zhu et al. (2025); Wang et al. (2025a); Hong et al. (2025a), and native unified models capable of interleaved image-text generation Shi et al. (2025); Deng et al. (2025); Li et al. (2025a). All models are evaluated under a unified setting (see Appendix F for hyperparameters). The performance on GeoAux-Bench-Core is reported in Table 2. Results on the GeoAux-Bench-Canvas subset are provided in Appendix B. Model Think Curriculum Olympiad Total Senior Junior Senior Junior Closed-source MLMMs Gemini-2.5-Pro ✓ 83.91 84.75 82.28 88.24 83.16 Gemini-2.5-Flash ✓ 80.10 81.68 64.56 82.13 79.53 GPT-5 ✓ 71.48 84.02 79.32 88.41 80.62 GPT-4o ✗ 19.57 24.29 57.38 42.51 28.13 Claude-Opus-4.1 ✓ 60.36 52.70 58.23 58.45 55.82 Claude-Sonnet-4.5 ✓ 59.20 52.00 55.70 60.39 55.03 Doubao-seed-1.6 ✓ 62.02 61.42 72.57 86.47 65.00 Open-source MLLMs Qwen3-VL-235B-Ins. ✗ 61.36 77.80 78.90 91.79 74.85 Qwen3-VL-235B-Thk. ✓ 64.68 50.36 89.45 95.17 62.25 Qwen3-VL-30B-Ins. ✗ 61.36 70.38 75.53 89.37 70.25 Qwen3-VL-30B-Thk. ✓ 55.72 49.64 83.97 93.24 58.75 Qwen2.5-VL-Ins. ✗ 23.05 28.81 65.82 52.17 33.25 InternVL3.5-8B ✗ 38.31 26.63 54.85 66.67 36.26 InternVL3-8B ✗ 25.87 30.02 67.93 55.56 35.17 GLM-4.1V-Thk. ✓ 31.67 23.57 56.12 53.14 31.76 GLM-4.5V ✗ 45.27 24.62 72.15 82.61 40.24 MiMo-VL-7B-SFT ✗ 50.41 42.13 71.73 80.68 50.87 MiMo-VL-7B-RL ✓ 50.75 41.65 71.73 80.68 50.70 United MLLMs BAGEL-7B-MoT ✗ 7.30 7.34 34.60 38.16 12.95 BAGEL-Zebra-CoT ✗ 6.80 6.94 31.65 35.27 12.03 MathCanvas-7B ✗ 18.41 18.89 46.84 51.69 24.63 Table 2: Comparison on GeoAux-Bench-Core. Best closed and open scores are highlighted. “Ins.” and “Thk.” denote Instruct and Thinking models. Analysis. The results highlight three critical observations: (1) Performance Gap & Difficulty. A substantial performance chasm exists between top-tier proprietary models (e.g., Gemini-2.5-Pro at 83.16%) and typical open-weights baselines (e.g., InternVL3.5-8B at 36.26%). Furthermore, the benchmark proves exceptionally difficult for Native Unified MLLMs (e.g., BAGEL < 13%), which perform the worst among all paradigms. Despite being intuitively aligned with human “thinking-while-drawing,” qualitative analysis (see Appendix D.1) reveals that these models fail to execute precise geometric edits, leading to severe visual hallucinations that derail subsequent reasoning. (2) The Analytic Shortcut. Models frequently outperform on Senior geometry compared to Junior geometry. Case studies in Appendix D.2 reveal that MLLMs exhibit a strong preference for Analytic Geometry (e.g., establishing coordinate systems) to bypass visual intuition. While this algebraic conversion works for high school problems, it is often inefficient or inapplicable to the pure geometric logic required in Junior tasks, explaining the performance inversion. (3) Signs of Memorization. Two anomalies point to potential data contamination rather than robust reasoning. First, models like Qwen3-VL-Thk achieve remarkable scores on the challenging Senior Olympiad split (89.45%) yet struggle significantly on the simpler Junior Curriculum set (50.36%). Second, reasoning-enhanced “Thinking” models often underperform their instruction-tuned counterparts. We attribute this to the finite and public nature of Olympiad problems. Unlike the vast space of curriculum exercises, high-profile competition questions are limited in number and widely circulated, making them highly susceptible to inclusion in pre-training corpora. This suggests that such performance spikes likely stem from memorizing specific problem instances rather than generalized geometric reasoning. 5.2 Performance of A2PO We evaluate our proposed A2PO against SFT and strong RL baselines (GRPO Shao et al. (2024), ToRL Li et al. (2025c), GeometryZero Wang et al. (2025c)). Experiments use Qwen2.5-VL across three datasets: GeoAux-Bench, external benchmarks Geomverse Kazemi et al. (2023) and Geometry3k Lu et al. (2021). Results in Table 3; training/inference configurations in Appendix F. Method GeoAux Geomverse Geometry3k Avg. GeoAux Acc% Acc% Acc% Acc% PPL ↓ Qwen2.5-VL-3B-Instruct SFT 23.09 56.20 39.40 39.56 1.1389 GRPO 31.22 59.10 50.72 47.01 1.1550 ToRL 28.68 58.40 47.06 44.71 1.1558 GeometryZero 29.33 57.00 52.72 46.35 1.1535 A2PO (Ours) 33.20 58.40 53.05 48.22 1.1534 Qwen2.5-VL-7B-Instruct SFT 37.30 63.60 46.17 49.02 1.0857 GRPO 39.28 67.40 55.49 54.06 1.0887 ToRL 39.77 65.50 53.50 52.92 1.0941 GeometryZero 40.18 68.30 53.72 54.07 1.0945 A2PO (Ours) 42.97 70.70 53.61 55.76 1.0869 Table 3: Main results on GeoAux and external benchmarks. Best and Second Best results are highlighted. Analysis. The comparative results demonstrate the superiority of our policy optimization strategy: (1) Consistent Improvements. A2PO consistently outperforms all baselines across model scales. Notably, on the 7B scale, A2PO achieves an average accuracy of 55.76%, surpassing the strong GeometryZero baseline (54.07%) and standard GRPO (54.06%). This gain is most pronounced on GeoAux (+2.79% over GeometryZero), confirming that our reward shaping is particularly effective for geometric problems that heavily rely on auxiliary construction for their solution. (2) Reasoning Certainty & Efficiency. A critical observation lies in Perplexity: strong baselines like GeometryZero and ToRL improve accuracy but see elevated PPL compared to SFT (e.g., up to +0.0169),suggesting performance gains may come at the cost of uncertain or convoluted reasoning. In contrast, A2PO achieves the highest accuracy while maintaining a PPL nearly matching SFT. This validates Section 3.2: auxiliary constructions act as “cognitive scaffolds” reducing uncertainty. The Quality Reward guides A2PO toward elegant simplifications—akin to a geometric “epiphany”—over convoluted computation. 5.3 Ablation Study Table 4 presents a component-wise ablation on Qwen2.5-VL-7B-Instruct to validate the efficacy of each component. Base Model Method LR TR QR Vis Acc% Qwen2.5-7B-Instruct GRPO ✓ ✗ ✗ ✗ 39.28 GRPO(w/o LR) ✗ ✗ ✗ ✗ 39.52 A2PO (w/o TR, w/o Vis) ✗ ✓ ✗ ✗ 40.18 A2PO (w/o Vis) ✗ ✓ ✓ ✗ 41.17 A2PO ✗ ✓ ✓ ✓ 42.97 Table 4: Ablation study of A2PO components, including Length Reward (LR), Timing Reward (TR), Quality Reward (QR), and Visual Re-prompting (Vis). Accuracy results are based on GeoAux-Bench. Analysis. We analyze the incremental impact of our design choices: (1) Removing Length Bias (LR). Removing the standard Length Reward yields a performance gain (39.28% → 39.52%). Unlike open-ended generation where length often correlates with detail, mathematical proofs prioritize conciseness. We observed that generic length incentives inadvertently encouraged verbose behaviors (e.g., repetitive reasoning). Removing this constraint allows the model to focus on the logical precision of the solution rather than artificially extending the trajectory. (2) Efficacy of Reward Shaping (TR & QR). Incorporating the Timing Reward (TR) improves accuracy to 40.18%, explicitly surpassing the ToRL baseline (39.77%). While ToRL indiscriminately rewards any valid auxiliary construction, TR teaches the model the strategic distinction of necessity—rewarding the action only when it is strictly beneficial compared to a non-auxiliary path. Further adding the Quality Reward (QR) pushes performance to 41.17%. This confirms that guiding the model towards lower-perplexity (i.e., more confident) constructions effectively filters out valid-but-redundant auxiliary lines—constructions that are syntactically correct but offer no strategic value to the solution. (3) Visual Synergy (Vis). The most significant jump occurs with Visual Re-prompting (+1.80%), achieving the peak accuracy of 42.97%. This empirical evidence strongly supports our core contribution: textual descriptions of auxiliary lines alone are insufficient. The model achieves optimal reasoning only when the textual construction is explicitly rendered and injected back as an updated visual context. This mechanism successfully simulates the cognitive advantage of “thinking with images” within a re-prompting framework. 6 Conclusion In this work, we advance geometric problem solving by shifting from static perception to active Visual-Text Interleaved reasoning. First, we introduce GeoAux-Bench, the first benchmark to rigorously align textual construction instructions with corresponding visual updates, providing a precise testbed for multimodal interaction. Second, our empirical analysis reveals that visual aids function as essential entropy reducers: interleaved visual feedback significantly lowers reasoning uncertainty compared to single-modality inputs, validating the cognitive necessity of “thinking with images.” Capitalizing on this, we propose Action Applicability Policy Optimization (A2PO). By employing Adaptive Reward Shaping to strictly regulate the timing and quality of visual interventions, A2PO enables models to master the strategic deployment of auxiliary lines, achieving state-of-the-art performance. Our findings demonstrate that equipping models with the agency to actively modify their visual context is a pivotal step toward autonomous, physically grounded geometric reasoning. Limitations The primary limitation of this work lies in the implementation of the visual update mechanism. While our A2PO framework establishes the cognitive benefits of interleaved reasoning, the current execution relies on a retrieval-based injection strategy—utilizing ground-truth auxiliary diagrams—rather than native model-generated visuals. This design choice is necessitated by the capabilities of current Unified MLLMs. Despite recent advancements, existing models still lack the fine-grained visual actuation capability required for precise geometric editing. As observed in our pilot experiments, even atomic operations (e.g., “Connect points A and B”) frequently result in structural hallucinations or pixel-level inaccuracies that can propagate errors into the reasoning chain. Furthermore, the high inference latency associated with autoregressive image generation currently hinders efficient exploration during reinforcement learning. Consequently, our work simulates the strategic decision-making of visual construction rather than the physical execution. We posit that realizing a fully autonomous loop—where a model generates, perceives, and refines its own diagrams—requires future advancements in multimodal pre-training. Specifically, it demands a tighter alignment between high-level geometric conceptual understanding and low-level pixel editing skills. Only upon this foundation can true dynamic sampling and reinforcement learning for generative visual-text reasoning be achieved. References Anthropic (2025a) Claude Opus 4.1 system card. System Card Anthropic. External Links: Link Cited by: §5.1. Anthropic (2025b) Claude Sonnet 4.5 system card. System Card Anthropic. External Links: Link Cited by: §5.1. S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, R. Fang, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Y. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q. Wang, Y. Wang, T. Xie, Y. Xu, H. Xu, J. Xu, Z. Yang, M. Yang, J. Yang, A. Yang, B. Yu, F. Zhang, H. Zhang, X. Zhang, B. Zheng, H. Zhong, J. Zhou, F. Zhou, J. Zhou, Y. Zhu, and K. Zhu (2025a) Qwen3-vl technical report. External Links: Link Cited by: §5.1. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin (2025b) Qwen2.5-vl technical report. ArXiv abs/2502.13923. External Links: Link Cited by: §5.1. ByteDance Seed Team (2025) Doubao / SEED-1.6 model series. Note: https://seed.bytedance.com/en/seed1_6Official description of SEED1.6 multimodal AI models Cited by: §5.1. J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang, C. Wei, S. Wei, H. Xin, F. Yang, W. Gao, Z. Yuan, T. Zhan, Z. Zheng, T. Zhou, and T. (. Zhu (2025a) Seed-prover 1.5: mastering undergraduate-level theorem proving via learning from experience. External Links: Link Cited by: §1. Y. Chen, Y. Gao, L. Gong, M. Guo, Q. Guo, Z. Guo, X. Hou, W. Huang, Y. Huang, X. Jian, H. Kuang, Z. Lai, F. Li, L. Li, X. Lian, C. Liao, L. Liu, W. Liu, Y. Lu, Z. Luo, T. Ou, G. Shi, Y. Shi, S. Sun, Y. Tian, Z. Tian, P. Wang, R. Wang, X. Wang, Y. Wang, G. Wu, J. Wu, W. Wu, Y. Wu, X. Xia, X. Xiao, S. Xu, X. Yan, C. Yang, J. Yang, Z. Zhai, C. Zhang, H. Zhang, Q. Zhang, X. Zhang, Y. Zhang, S. Zhao, W. Zhao, and W. B. Zhu (2025b) Seedream 4.0: toward next-generation multimodal image generation. ArXiv abs/2509.20427. External Links: Link Cited by: §3.1. E. Chern, Z. Hu, S. Chern, S. Kou, J. Su, Y. Ma, Z. Deng, and P. Liu (2025) Thinking with generated images. ArXiv abs/2505.22525. External Links: Link Cited by: §1. Y. Chervonyi, T. H. Trinh, M. Olsák, X. Yang, H. Nguyen, M. Menegali, J. Jung, V. Verma, Q. V. Le, and T. Luong (2025) Gold-medalist performance in solving olympiad geometry with alphageometry2. ArXiv abs/2502.03544. External Links: Link Cited by: §1. G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, L. Marris, S. Petulla, C. Gaffney, A. Aharoni, N. Lintz, T. C. Pais, H. Jacobsson, I. Szpektor, N. Jiang, K. Haridasan, A. Omran, N. Saunshi, D. Bahri, G. Mishra, E. Chu, T. Boyd, B. Hekman, A. T. Parisi, C. Zhang, K. Kawintiranon, T. Bedrax-Weiss, O. Wang, Y. Xu, O. Purkiss, U. Mendlovic, I. Deutel, N. Nguyen, A. Langley, F. Korn, L. Rossazza, A. Ram’e, S. M. Waghmare, H. Miller, V. Keshava, Y. Jian, X. Zhang, R. A. Popa, K. Dhamdhere, B. Bratanivc, K. Kim, T. Koo, F. Alet, Y. Chen, A. Nagrani, H. Muckenhirn, Z. Zhang, C. Quick, F. Paveti’c, D. D. Nguyen, J. Carreira, M. Elabd, H. Qureshi, F. Mentzer, Y. Yang, D. Eisenbud, A. Gulati, E. Talius, E. Ni, S. Ghalebikesabi, E. Yvinec, A. Saade, T. Ulrich, L. Blanco, D. A. Calian, M. Huang, A. van den Oord, N. Goyal, T. Chen, P. Rawlani, C. Schallhart, S. Lokhande, X. Luo, J. Shan, C. Montgomery, V. Krakovna, F. Piccinini, O. Barak, J. Cui, Y. Jia, M. Dektiarev, A. Kolganov, S. Huang, Z. Chen, X. Wang, J. Austin, P. de Boursac, E. Sluzhaev, F. Ding, H. Li, S. Bhupatiraju, M. Agarwal, S. Kwasiborski, P. Sandhu, P. Siegler, A. Iscen, E. Ben-David, S. Butt, M. Allamanis, S. Benjamin, R. I. Busa-Fekete, F. Hernández-Campos, S. Goldshtein, M. Dibb, W. Zhang, A. Marsden, C. Radebaugh, S. Roller, A. Nayyar, J. Austin, T. Terzi, B. K. Shamanna, P. Shaw, A. Singh, F. Luisier, A. Mendoncca, V. Aggarwal, L. Markeeva, C. Fantacci, S. Brin, H. Choe, G. Wang, H. Adam, A. Dabush, T. Kiyono, E. Marcus, J. R. Cole, T. Weber, H. Lee, R. Huang, A. Muzio, L. Kieliger, M. Le, C. Biles, L. Le, A. Sharma, C. Yang, A. Lamp, D. Dopson, N. Hurley, K. Xu, Z. Shan, S. Song, J. Tan, A. Senges, G. F. Zhang, C. You, Y. Jun, D. Raposo, S. Ricco, X. Yang, W. Chen, P. Gupta, A. Szlam, K. Villela, C. Ferng, D. Kasenberg, C. Liang, R. Zhu, A. Narayanaswamy, F. Perot, P. Pucciarelli, A. Shekhawat, A. Stern, R. Ingale, S. Karp, S. Bahargam, A. Goedeckemeyer, J. Han, S. Li, A. Tacchetti, D. Yu, A. Chakladar, Z. Zhang, M. E. Mahdy, X. Gao, D. Johnson, S. Phatale, A. J. Piergiovanni, H. Lim, C. Farabet, C. S. Lebsack, T. Guidroz, J. Blitzer, N. Duduta, D. Madras, S. Li, D. von Dincklage, X. Li, M. Mahdieh, G. Tucker, G. Jawahar, Y. O. Xiao, D. Tarlow, R. Geirhos, N. Velan, D. Vlasic, K. Bullard, S. Park, N. Gupta, K. Webster, A. Hitron, J. Mao, J. M. Eisenschlos, L. Prince, N. D’Souza, K. Zheng, S. Nasso, G. Botea, C. Doersch, C. Unlu, C. Alberti, A. Svyatkovskiy, A. Goel, K. Choromanski, P. Jiang, R. Nguyen, F. Flynn, D. Ćurko, P. Chen, N. Roth, K. Milan, C. Habtegebriel, S. Narayan, M. Moffitt, J. Marcus, T. Anthony, B. McMahan, G. Cheon, R. Liu, M. Barnes, L. Lew, R. Santamaria-Fernandez, M. Upadhyay, A. Akula, A. M. Hrafnkelsson, A. Caceres, A. Bunner, M. Sokolik, S. Puttagunta, L. Moore, B. Isik, J. Hartford, L. Chan, P. Shenoy, D. N. Holtmann-Rice, J. Park, F. Viola, A. Salcianu, S. Rajayogam, I. Stewart-Binks, Z. Wu, R. Everett, X. Xiong, P. Manzagol, G. Leung, C. Saroufim, B. Pang, D. Wegner, G. Papamakarios, J. Palomaki, H. Pankov, G. Lai, G. H. Tubone, S. Zhao, T. Strinopoulos, S. Neel, M. Wang, J. Kelley, L. Li, P. Xu, A. Vijayakumar, A. D’olimpio, O. Levy, M. Nicosia, G. Rozhdestvenskiy, N. Lao, S. Xie, Y. Katariya, J. Simon, S. Kumar, F. Hartmann, M. Kilgore, J. Lee, A. Mahendru, R. Ring, T. Hennigan, F. Lang, C. Cherry, D. Steiner, D. Hwang, R. Smith, P. Wang, J. Chen, M. Yang, S. K. Kwei, P. Schlattner, D. Kim, G. P. Girirajan, N. Momchev, A. Agarwal, X. Zhou, I. Safarli, Z. Garrett, A. Pierigiovanni, S. Jauhari, A. R. Rochman, S. Vashishth, Q. Yuan, C. Angermueller, J. Blanton, X. Song, N. B. Gundavarapu, T. Avrahami, M. Deines, S. Roy, M. Gupta, C. Semturs, S. Vasudevan, A. S. Veerubhotla, S. Sharma, J. Jacob, Z. Yang, A. Terzis, D. Karliner, A. Wright, T. Rojas-Esponda, A. Brown, A. G. Roy, P. Dogra, A. Kapishnikov, P. Young, W. Kan, V. K. Rajendran, M. Ivanova, S. Deshmukh, C. Ho, M. Kwong, S. Ginzburg, A. Louis, K. Sawhney, S. Petrov, J. Xie, Y. Bai, G. Stoyanov, A. Fabrikant, R. Jayaram, Y. Li, J. Heyward, J. Gilmer, Y. Wang, R. Soricut, L. Liu, Q. Duan, J. Hayes, M. O’Brien, G. S. Tomar, S. Eiger, B. Fatemi, J. Hui, C. Barros, A. Chukwuka, A. Butryna, S. Thakur, A. Huang, Z. Pan, H. Tang, S. Cabi, T. Doshi, M. A. Bakker, S. Bagri, R. Ley-Wild, Á. D. Lelkes, J. Lees, P. Kane, D. Greene, S. Wu, J. Bornschein, G. de Castro Surita, S. Hodkinson, F. Li, C. Hidey, S. Pereira, S. Ammirati, P. Lippe, A. Kraft, P. Han, S. Gerlach, Z. Wang, L. Panait, F. Han, B. S. Farris, Y. Bi, H. DeBalsi, M. Wang, G. Tyen, J. Cohan, S. Zhang, J. Barber, D. Chung, J. Kim, M. Kunesch, S. Pecht, N. Akazawa, A. Friesen, J. Lyon, A. Eslami, J. Wu, J. Tan, Y. Song, R. Kumar, C. A. Welty, I. Akolzin, G. Gibson, S. Augenstein, A. Pillai, N. Yuen, D. Phan, X. Wang, I. Barr, H. Zen, N. Hua, C. Liu, J. Wang, T. P. Bhatia, H. Xu, O. Elyada, P. Kohli, M. Olvs’ak, K. Chen, A. Mirhoseini, N. Shazeer, S. Jakobovits, M. Tran, N. Ramsden, T. Bharti, F. Alcober, Y. Li, S. Shetty, J. Chen, D. Kalashnikov, M. Nawhal, S. Ö. Arik, H. Chen, M. Blokzijl, S. Gupta, J. Rubin, R. Swavely, S. Bridgers, I. M. Gemp, C. Su, A. S. Suggala, J. Pluto, M. Cassin, A. C. Vaucher, K. Ji, J. Cai, A. Audibert, A. Sinha, D. Tian, E. Farkash, A. Hua, J. Chen, D. Tran, E. Loper, N. Brichtova, L. McConnaughey, B. Sandhu, R. Leland, D. DeCarlo, A. Over, J. Huang, X. Wu, C. Fan, E. Li, Y. Lei, D. Sharma, C. Paduraru, L. Yu, M. Bovsnjak, P. Dao, M. Choi, S. Kudugunta, J. Adamek, C. Gu’ia, A. Khodaei, J. Feng, W. Zeng, D. Welling, S. Tata, C. Butterfield, A. Vlasov, S. El-Sayed, S. Mishra, T. N. Sainath, S. Yang, R. J. Skerry-Ryan, J. Shar, R. Berry, A. Rajendran, A. R. Kandoor, A. Burns, D. Jain, T. Stone, W. Park, S. Wang, A. Cassirer, G. Wang, H. Kobayashi, S. Rogulenko, V. Govindaraj, M. Rybi’nski, N. Olmert, C. Evans, P. Huang, K. Xu, P. Shah, T. Thurk, C. Sikora, M. Cai, J. Xie, E. Dabir, S. Shah, N. Kalb, C. Zhang, S. Prabhakara, A. Sabne, A. Myaskovsky, V. Raunak, B. Huergo, B. Neyshabur, J. Clark, Y. Zhang, S. Krishnan, E. Cohen, D. Tewari, J. Lottes, Y. Yamamori, H. Li, M. Elhawaty, A. M. Oflazer, A. Recasens, S. Luo, D. Nguyen, T. Bos, K. Andra, A. Salazar, E. H. Chi, J. Ko, M. L. Ginsberg, A. Andreassen, A. Ruoss, T. Davchev, E. Davoodi, C. Liu, M. Kim, S. Ontanon, C. M. To, D. Jia, R. Ke, J. Wang, A. Korsun, M. Ambar, I. Kornakov, I. Giannoumis, T. Creswell, D. Zhou, Y. Su, I. Watts, A. Zaks, E. Eltyshev, Z. Feng, S. Mudgal, A. Kaskasoli, J. C. Love, K. Dasgupta, S. Shleifer, R. Green, S. Seo, C. Lee, D. R. Webster, P. Shroff, G. Raboshchuk, I. Leal, J. Manyika, S. Erell, D. Murphy, Z. Xiao, A. Bulyenov, J. Walker, M. Collier, M. Kastelic, N. George, S. Prakash, S. Sidhwani, A. Frolov, S. Hansen, P. Georgiev, T. Sosea, C. Apps, A. B. Kamath, D. Reid, E. Cooney, C. Magister, O. Riva, A. Go, P. Chen, S. Krause, N. Levine, M. Fornoni, I. Figotin, N. Roy, P. Mahmoudieh, V. Magay, M. Madhavan, J. Miao, J. Ni, Y. Fujii, I. Chou, G. Scrivener, Z. Tsai, S. Mcloughlin, J. Selier, S. Lefdal, J. Zhao, A. Karmarkar, K. Chauhan, S. Goel, Z. Zhang, V. Jain, P. Haghani, M. Dehghani, J. Scott, E. Farnese, A. Ili’c, S. Baker, J. Pawar, L. Zhong, J. Camp, Y. Zeldes, S. Shetty, A. Iyer, V. List’ik, J. Guo, L. Tang, M. Geller, S. Bucher, Y. Ding, H. Shi, C. Muir, D. Grewe, R. Eskander, O. Ponce, B. Gong, D. Gasaway, S. Khan, U. Gupta, A. Filos, W. Kuo, K. Kloboves, J. Beattie, C. Wright, L. Li, A. Jin, S. Mariserla, M. Patel, J. Heitkaemper, D. Krishnan, V. Sharma, D. Bieber, C. Frank, J. Lambert, P. Caron, M. Polacek, M. Gim’enez, H. Choudhury, X. Yu, S. Tavakkol, A. Ahuja, F. J. Och, R. Jenatton, W. Skut, B. Richter, D. Gaddy, A. Ly, M. Bilenko, M. Umekar, E. Liang, M. Sevenich, M. Joshi, H. Mansoor, R. Lin, S. K. Sanghai, A. Singh, X. Li, S. Vijayanarasimhan, Z. Abbas, Y. Bitton, H. Srinivasan, M. R. Vuyyuru, A. Frommgen, Y. Sun, R. Leith, A. Castaño, D. Strouse, L. Yan, A. Kyker, S. Kambala, M. Jasarevic, T. Sellam, C. Jia, A. Pritzel, R. Raghavender, H. Chen, N. Clay, S. Gandhe, S. Kirmani, S. Ebrahimi, H. Kirkwood, J. Mallinson, C. Wang, A. Ozturel, K. Lin, S. Upadhyay, V. Cohen-Addad, S. Purser-Haskell, Y. Xu, E. M. Songhori, B. Seal, A. Magni, A. Gueta, T. Zou, G. Guruganesh, T. Kagohara, H. Nguyen, K. Salama, A. C. Ruiz, J. Frye, Z. Zhu, M. Lochbrunner, S. Osindero, W. Yuan, L. Lee, A. Prasad, L. N. Thiet, D. Calandriello, V. Stone, Q. Feng, H. Ke, M. Voitovich, G. Sampemane, L. Chiang, L. Wu, A. Bykovsky, M. Young, L. Vilnis, I. Dasgupta, A. Chawla, Q. Cao, B. Liang, D. Toyama, S. Payrits, A. Stefanoiu, D. Vytiniotis, A. Anand, T. Shen, B. Mitrevski, M. Tschannen, S. Gollapudi, S. AishwaryaP, J. Leal, Z. Shen, H. Fu, W. Wang, A. Kannan, D. Kukliansky, S. Yaroshenko, S. Grant, U. Telang, D. Wood, A. Chronopoulou, A. cTifrea, T. Zhou, T. Nguyen, M. Ersoy, A. Singh, M. Xie, E. Taropa, W. Han, E. Agustsson, A. Sozanschi, H. Peng, A. Chen, Y. Drori, E. Robles, Y. Gao, X. Dotiwalla, Y. Chen, A. Boral, A. Bendebury, J. Nham, C. Tar, L. Castro, J. Jiang, C. Liu, F. Halim, J. Baek, A. Wan, J. Liu, Y. Cao, S. Dai, T. Acharya, R. Sun, F. Xue, S. Joshi, M. Lustman, Y. Xian, R. Joshi, D. Karkhanis, N. Kassner, J. Hall, X. Ding, G. Song, G. Li, C. Zhu, Y. Kulizhskaya, B. Ni, A. Vlaskin, S. Demmessie, L. Dery, S. Zaiem, Y. Huang, C. Fan, F. Gimeno, A. Balashankar, K. Kojima, H. Taitelbaum, M. Meng, D. Gharibian, S. Singla, W. Chen, A. Slone, G. Chen, S. Rajayogam, M. Schumacher, S. Kotecha, R. Blevins, Q. Wang, M. H. Taege, A. Morris, X. Liu, F. Jamil, R. Zhang, P. Joshi, B. Ingram, T. Liechty, A. Eleryan, S. Baird, A. Grills, G. Bansal, S. Han, K. Yalasangi, S. Xu, M. A. Merey, I. Gao, F. Weissenberger, I. Karpov, R. Riachi, A. Anand, G. Prasad, K. Lamerigts, R. Hayes, J. Rogers, M. Guo, A. Shenoy, Q. Hu, K. He, Y. Liu, P. Zablotskaia, S. Gubbi, Y. Chang, J. Pavagadhi, K. Kjems, A. Vadali, D. Machado, Y. Li, R. Wang, D. Ghosh, A. Mehta, D. Alon, G. Polovets, A. Tonioni, N. Kushman, J. D’sa, L. Zhuo, A. Wu, R. Shah, J. Youssef, J. Ye, J. Snyder, K. Lenc, S. Buthpitiya, M. Tung, J. Chang, T. Chen, D. Saxton, J. Lee, L. L. Zhang, J. Qin, P. Radhakrishnan, M. Chen, P. Ambroszczyk, M. Toksoz-Exley, Y. Zhong, N. Katz, B. O’Donoghue, T. von Glehn, A. G. Rosenthal, A. Swietlik, X. Zhao, N. Fernando, J. Wei, J. Mei, S. Vassilvitskii, D. Cedillo, P. Awasthi, H. Zheng, K. Kavukcuoglu, I. Laish, J. Pagadora, M. Brockschmidt, C. A. Choquette-Choo, A. Byravan, Y. Lu, X. Chen, M. Chen, K. Lee, R. Pasumarthi, S. Bhatnagar, A. Shah, Q. Wu, Z. Chen, Z. Nado, B. Perz, Z. Jiang, D. Kao, G. Mallya, N. Vieillard, L. Mei, S. Girgin, M. Jordan, Y. Ko, A. Agarwal, Y. Liu, Y. Altun, R. de Liedekerke, A. Kementsietsidis, D. Peng, D. Liu, U. Evci, P. Humphreys, A. Tarango, X. Deng, Y. Lewenberg, K. Aydin, C. Wu, B. Mittal, T. Munkhdalai, K. Chatziprimou, R. Benenson, U. First, X. Ma, J. Li, A. Joulin, H. Tomlinson, T. Zhang, M. Nasr, Z. Hong, M. E. Sander, L. A. Hendricks, A. Sharma, A. Bolt, E. V’ertes, Jiřa, T. Levinboim, O. Sercinoglu, D. Shukla, A. Wu, C. Swanson, D. Vainstein, F. Bu, B. Wang, R. C. Julian, C. Yoon, S. Lebedev, A. M. Girgis, B. Bandemer, D. Du, T. Wang, X. Chen, Y. Xiao, P. Lu, N. Ha, V. Ionescu, S. Rowe, J. Matak, F. Lebron, A. Steiner, L. Jain, M. Faruqui, N. Lacasse, G. Evans, N. Subramaniam, D. T. Reich, G. Vezzani, A. Pandey, J. Stanton, T. Zhou, L. McCafferty, H. Griffiths, V. Rieser, S. H. Yeganeh, E. Briakou, L. Huang, Z. Wei, L. Luo, E. Jue, G. Wang, V. Cotruta, M. Khan, J. Park, Q. Guo, P. Li, R. Rong, D. Antognini, A. Petrushkina, C. Tekur, E. Collins, P. Bhatia, C. Kwak, W. Chen, A. Neelakantan, I. Odisho, S. Peng, V. Nallatamby, V. Tulsyan, F. Pedregosa, P. Xu, R. Lin, Y. Wang, E. Wang, S. Douglas, R. Tsarfaty, E. Gribovskaya, R. Aravamudhan, M. Agarwal, M. Finkelstein, Q. Zhang, E. Cole, P. Crone, S. Velury, A. Das, C. Sauer, L. Xu, D. Qin, C. Gu, D. Marcus, C. Zheng, W. V. Gansbeke, S. Miryoosefi, H. Sun, Y. Li, C. Chen, J. Yoo, P. Dubov, A. Tomala, A. Yu, P. Wesolowski, A. Gunjan, E. Cao, J. Luo, N. Sethi, A. Socala, L. Graesser, T. Kociský, B. Arturo, M. Chen, E. Lee, S. Wang, W. Kong, Q. Xu, N. Tripuraneni, Y. Li, X. Yu, A. Porter, P. Voigtlaender, B. Zhang, A. Vezer, S. York, Q. Wei, G. Cideron, M. Kurzeja, S. Kim, B. Li, A. Pouget, H. Lee, K. Daugaard, Y. Li, D. Uthus, A. Siddhant, P. Cavallaro, S. Ganapathy, M. Shah, R. Jagerman, J. Stanway, P. Mendolicchio, L. Xiao, K. Lee, T. Thompson, S. M. Phal, J. Chase, S. J. Lee, A. N. Reyes, D. Shrivastava, Z. Qin, R. Sukkerd, S. Odoom, L. Madmoni, J. Aslanides, J. Herzig, E. Pochernina, S. Zhang, P. Barnes, D. Ikeda, Q. Li, S. Chang, S. Mohamed, J. Sproch, R. Powell, B. Samanta, D. Cevid, A. Kovsharov, S. B. Mallick, S. Tadepalli, A. Zheng, K. W. Ayoub, A. Noever, C. Reisswig, Z. Xu, J. Oh, M. Matysiak, T. Blyth, S. Ashraf, J. Amelot, B. Severson, M. Bevilacqua, M. Sano, E. Dyer, O. Roval, A. Sinha, Y. Zhong, S. Perel, T. Saboli’c, J. Mauerer, W. Gierke, M. Verzetti, R. Cabrera, A. Abdagic, S. Hemingray, A. Stone, J. Lee, F. Ahmad, K. Raman, L. Shani, J. Lai, O. Firat, N. Waters, E. Ge, M. Shomrat, H. Gupta, R. K. Aggarwal, T. Hudson, B. Jia, S. Baumgartner, P. Jain, J. Kovac, J. Jung, A. vZuvzul, W. Truong, M. Zadimoghaddam, S. Peng, M. Liang, R. Sterneck, B. Lakshminarayanan, M. Reid, O. Woodman, T. Zhou, J. Wang, V. Coriou, A. Narayanan, J. Hoover, Y. Ma, A. Jindal, C. Sanford, D. Reid, S. I. Ramaswamy, A. Kurakin, R. S. Zimmermann, Y. Lunts, D. Dena, Z. Borsos, V. Cohen, S. Zhang, W. Grathwohl, R. Dadashi, M. Redshaw, J. Kessinger, J. Odell, S. Bonacina, Z. Dai, G. Chen, A. Dubey, P. Sprechmann, M. Pajarskas, W. Zhou, N. Ahuja, T. Thomas, M. Nikoltchev, M. Kecman, B. Mankalale, A. Ryabtsev, J. She, C. Walder, J. Shen, L. Li, C. Parada, S. Panthaplackel, O. Kwon, M. Lawlor, U. Prabhu, Y. Schroecker, M. Ranzato, P. Blois, I. Kemaev, T. Yu, D. Lepikhin, H. Xiong, S. Sharifzadeh, O. Johnson, J. Willcock, R. Yao, G. Farquhar, S. Basu, H. Shimokawa, N. Anderson, H. Li, K. Pham, Y. Liang, S. Borgeaud, A. Moufarek, H. Kazawa, B. Kutzman, M. Sieniek, S. Smoot, R. Wang, N. Axelsson, N. Fallen, P. K. Sundaram, Y. Zhai, V. Godbole, P. Maniatis, A. W. Wang, I. Shumailov, S. Thangaraj, R. Crocker, N. Gupta, G. Wu, P. Chen, G. Weisz, C. Smith, M. Seyedhosseini, B. Fang, X. Luo, R. Yogev, Z. Cankara, A. S. Hard, H. Ran, R. Sukthankar, G. Necula, G. Liu, H. Cai, P. Banzal, D. Keysers, S. Ghemawat, C. Tao, E. Dunleavy, A. Chaudhary, W. Li, M. Mikuła, C. Lee, T. Refice, K. Somandepalli, A. Fr’echette, D. Bahir, J. E. Karro, K. Rush, S. Perrin, B. Rosgen, X. Yang, C. H. Hu, M. Alnahlawi, J. Mao-Jones, R. Garg, H. Nguyen, B. Batsaikhan, I. Iturrate, A. Levskaya, A. Singh, A. Kachra, T. Lu, D. Petek, Z. Xu, M. Graham, L. Zilka, Y. Karov, M. Kostelac, F. Liu, Y. Guo, W. Wang, B. Bohnet, E. Pitler, T. Bruguier, K. Kinoshita, C. Anastasiou, N. Jha, T. Liu, J. T. Connor, P. Wallis, P. Pham, E. Bailey, S. Li, H. Cheng, S. Ma, H. Li, A. Maurya, K. Olszewska, M. K. Warmuth, C. Koh, D. Paulus, S. R. Jonnalagadda, E. Piqueras, A. Elqursh, G. Brown, H. Shemtov, L. Maggiore, F. Xia, R. Foley, B. Westberg, G. van den Driessche, L. B. Soares, A. Kar, M. Quinn, S. Zuo, J. Wu, K. Kastner, A. Bortsova, A. Bai, A. Mikhalap, L. Zhou, J. Brennan, V. V. Ramasesh, H. Zhuang, J. Maggs, J. Schalkwyk, Y. Xu, H. Huang, A. Howard, S. Brown, L. ing Xue, G. Shen, B. Albert, N.K. Jha, D. Zheng, V. Krayvanova, S. A. Hombaiah, O. Lacombe, G. Vasudevan, D. Graur, T. Xie, M. Gandhi, B. Wang, D. Zelle, H. Singh, D. Kim, S. Cevey, V. Ungureanu, N. Noy, F. Liu, A. Xie, F. Feng, K. Tsihlas, D. Formoso, N. Vats, Q. Wellens, Y. Wang, N. K. Bhumihar, S. Ghosh, M. Hoffman, T. Lieber, O. Lang, K. Bhatia, T. L. Paine, A. Pyne, R. Votel, M. Elish, B. Schillings, A. Panagopoulos, H. Yang, A. Raveret, Z. Yahav, S. Liu, D. E. Badawy, N. Agrawal, M. Badawi, M. Mirzazadeh, C. Bromberg, F. Ye, C. Liu, T. Sholokhova, G. cristian Muraru, G. Balasubramaniam, J. Malmaud, A. Carin, D. Martins, I. Jurenka, P. Botadra, D. Lacey, R. Singh, M. Schain, D. Zheng, I. Guyon, V. Lavrenko, S. Lee, X. Zhou, D. Hassabis, J. Challagundla, D. Cheng, N. Mehta, M. Mauger, M. Paganini, P. Mishra, K. Lee, Z. Li, L. Baugher, O. Skopek, M. Chang, A. Zait, G. Menghani, L. Bellot, G. Han, J. M. A. Sarr, S. Chikkerur, H. Sahni, R. Anil, A. Narayanan, C. Thekkath, D. Pighin, H. Strejvcek, M. Velic, F. Bertsch, M. Tragut, K. Rong, A. Parrish, K. Bailey, J. Park, I. Albuquerque, A. Bapna, R. Venkataraman, A. Kosik, J. Griesser, Z. Deng, A. Andreev, Q. Dou, K. Hui, F. Wei, X. Yu, L. Shu, A. Aharon, D. Barker, B. Ghazi, S. Flennerhag, C. Breaux, Y. Liu, M. Bilotti, J. Woodward, U. Alon, S. Winkler, T. Huang, K. Andriopoulos, J. G. Oliveira, P. Koanantakool, B. Akin, M. Wunder, C. N. dos Santos, M. H. Bateni, L. Yang, D. Horgan, B. Changpinyo, K. Amiri, M. Ma, D. Lee, L. Liang, A. Baddepudi, T. Latkar, R. Hadsell, J. Xu, H. Mu, M. Han, A. Pope, S. Grover, F. Kim, A. V. Bhagatwala, G. Sun, Y. Bansal, A. Globerson, A. Nazari, S. Daruki, H. Soltau, J. Labanowski, L. E. Shafey, M. Harvey, Y. Ahmad, E. Rosenfeld, W. Kong, E. Pot, Y. Tan, A. Wei, V. Langston, M. Prasetya, P. Velivckovi’c, R. Killam, R. Strudel, D. Ni, Z. Zhu, A. Archer, K. Kopparapu, L. Nguyen, E. Parisotto, H. Masoom, S. Addepalli, J. Grimstad, H. Hu, J. Moore, A. Hassidim, L. Hou, M. Raghavachari, J. Lichtarge, A. R. Brown, H. Dib, N. Ponomareva, J. Fu, Y. Zhang, A. Rahman, J. Iljazi, E. Leurent, G. Dulac-Arnold, C. Du, C. Asawaroengchai, L. Jin, E. Gruzewska, Z. Ji, B. Uria, D. D. Freitas, P. Barham, L. Beltrone, V. Campos, J. Yan, N. Kovelamudi, A. Nguyen, E. Davies, Z. Wu, Z. Egyed, K. Toutanova, N. Attaluri, H. Fei, P. Stys, S. Brahma, M. Izzard, S. Velusamy, S. M. Lundberg, V. Zhuang, K. Sequeira, A. Santoro, E. Amid, O. Aharoni, S. Ye, M. Sundararajan, L. Yu, Y. Ling, S. Spencer, H. Song, J. Djolonga, C. Kirov, S. Gupta, A. Bissacco, C. Meyer, M. Bhutani, A. M. Dai, W. Wang, S. Liu, A. Sreevatsa, Q. Tan, M. Wang, L. Kim, Y. Wang, A. Irpan, Y. Xiao, S. Fort, Y. He, A. Gurney, B. Gale, Y. Ma, M. Roy, V. Patraucean, T. Bilal, G. Ghiasi, A. Hosseini, M. Johnson, Z. Li, Y. Tay, B. Beyret, K. Millican, J. Broder, M. Lunayach, D. Swisher, E. Vuvsak, D. Parkinson, M. Tessler, A. M. Gilady, R. Song, A. Dafoe, Y. Raimond, M. Yamaguchi, I. Karo, E. Nielsen, K. Kilgour, M. Dusenberry, R. Mathews, J. Choi, S. Qiao, H. Mehta, S. Potluri, C. Knutsen, J. Liu, T. Tan, K. Sengupta, K. Gopalakrishnan, A. Toki, M. Chiang, M. Burrows, G. Vesom, Z. Ahmed, I. Labzovsky, S. Vashishtha, P. Singh, A. Sharma, A. Ma, J. Xie, P. Talluri, H. Forbes-Pollard, A. Selvan, J. Wee, L. Matthey, T. Funkhouser, P. Gopavarapu, L. Proleev, C. Li, M. Thomas, K. B. R. Kolipaka, Z. Jia, A. Kakarla, S. Sunkara, J. Puigcerver, S. S. Sheth, E. Graves, C. Wang, S. M. Khan, K. Kang, S. Buch, F. Zhang, O. Savant, D. Soergel, K. Lee, L. Friso, X. Dong, R. Arya, S. Chandrakaladharan, C. Schenck, G. Billock, T. Iyer, A. Bakalov, L. Baker, A. Ruiz, A. Chandorkar, T. H. Trinh, M. Miecnikowski, Y. Zhou, Y. Huang, J. Nie, A. Shah, A. V. Thapliyal, S. Haves, L. Wang, U. Shaham, P. Morris-Suzuki, S. Radpour, L. Berrada, T. Strohmann, C. Yan, J. Shen, S. Goenka, T. Warkentin, P. Devi’c, D. Belov, A. Webson, M. Yenugula, P. Datta, J. Chang, N. Ghelani, A. Kumar, V. Perot, J. Lo, Y. Song, H. Schmit, J. Chen, V. Bashlovkina, X. Pan, D. Mincu, P. Roit, I. Edkins, A. Davis, Y. Li, B. Horn, X. Li, S. Pradeepkumar, E. Doi, W. Zhu, S. G. S. Padmanabhan, S. Verma, J. Liu, H. Chen, M. Velimirovi’c, M. Reynolds, P. Agrawal, N. Sukhanov, A. Modi, S. Goyal, J. Palowitch, N. Khajehnouri, W. Lowe, D. Klinghoffer, S. Silver, V. Q. Tran, C. Schumann, F. Piccinno, X. Liu, M. Luvci’c, X. Yang, S. Kumar, A. Kannan, R. Kotikalapudi, M. Bansal, F. B. Fuchs, M. J. Hosseini, A. Abdelhamed, D. Bloxwich, T. Yu, R. Sang, G. Thornton, K. Gill, Y. Liu, V. Shejwalkar, J. Lin, Z. Yan, K. Han, T. Buschmann, M. Pliskin, Z. Xing, S. Tatineni, J. Zhang, S. Hsiao, G. Buttimore, M. Wu, Z. Li, G. Kovacs, L. Yeung, T. Huang, A. Cohen, B. Brownfield, A. Nowak, M. Rodriguez, T. Shi, H. V. Hasselt, K. Cen, D. Ghosal, K. Majmundar, W. Yu, W. Chen, D. Sinopalnikov, H. Zhang, V. Gali’c, D. Lu, Z. Zheng, M. Song, G. Wang, G. Citovsky, S. Gawde, I. R. Galatzer-Levy, D. Silver, I. Balazevic, D. Das, K. Majumder, Y. Cong, P. Dutta, D. Tran, H. Wan, J. Yuan, D. Eppens, A. Walton, B. Kim, H. Ragan, J. Cobon-Kerr, L. Liu, W. Wang, B. Petrini, J. W. Rae, R. Shivanna, Y. Xiong, C. Lee, P. Coquinot, Y. Gu, L. Patel, B. A. Hechtman, A. Boag, O. Jankowski, A. Wertheim, A. X. Lee, P. Covington, H. Noga, S. Sobell, S. Vasanth, W. Bono, C. Nagpal, W. Fan, X. García, K. Soparkar, A. Turker, N. Howard, S. Menon, Y. Chen, V. Verma, V. Pchelin, H. Rajamani, V. Dalibard, A. Ramalho, Y. Guo, K. Badola, S. Bang, N. Rauschmayr, J. Proskurnia, S. Dasari, X. Chen, M. Sushkov, A. Hauth, P. Sho, A. Singh, B. Chandra, A. Culp, M. Dylla, O. Bachem, J. Besley, H. Zhao, T. P. Lillicrap, W. Wei, W. A. Jishi, N. Niu, A. Rrustemi, R. L. Kaufman, R. Poplin, J. Zhao, M. Truong, S. Bharadwaj, E. Hlavnova, E. Stickgold, C. Schmid, G. Stephanov, Z. Leng, F. Liu, L. Hussenot, S. Dodhia, J. Franco, L. Katzen, A. Sharma, S. Cogan, Z. Yang, A. Ray, S. Caelles, S. Yan, R. Kumar, D. Gillick, R. Wong, J. Ainslie, J. Hoech, S. M. R. Arnold, D. A. Abolafia, A. Dragan, B. Hora, G. Hu, A. Guseynov, Y. Lu, C. Leichner, J. Rao, A. Goyal, N. Baddi, D. H. Diaz, T. McConnell, M. Bain, J. Abernethy, Q. Yan, R. Schaeffer, P. Vicol, W. Thompson, M. G. Arenas, M. M.J. Bellaiche, P. Barrio, S. Zinke, R. Patana, P. Mehta, J. Kearns, A. Ruderman, S. Pollom, D. B. D’Ambrosio, C. Hope, Y. Yu, A. Gesmundo, K. Lee, A. Rosenberg, Y. Zhou, Y. Li, D. Garmon, Y. Wu, S. Huda, G. Fidel, M. Baeuml, J. Li, P. Kirk, R. May, T. Tu, S. M. M. Carthy, T. Fukuzawa, M. Aperghis, C. Yeh, T. Yoshino, B. Li, A. Myers, K. Yao, B. Limonchik, C. Ryu, R. Saxena, A. Goldin, R. Zhao, R. Rhodes, T. Zhu, D. Tyam, H. Howard, N. Byrd, H. Ma, Y. Wu, R. Mullins, Q. Wang, A. Amini, S. Baur, Y. Mao, S. Venugopalan, W. Song, W. Ding, P. Collins, S. J. Reddi, M. Shum, A. A. Rusu, L. M. Zintgraf, K. Chan, S. Goenka, M. Blondel, M. Collins, R. Pan, M. Giustina, N. Chinaev, C. Schuler, C. Zheng, J. Valfridsson, A. Loo, A. Yakubovich, J. Smith, T. Jiang, R. Munoz, G. Barcik, R. Bansal, M. Yang, Y. Du, P. Duque, M. Phuong, A. M. Belias, K. Lad, Z. Liu, T. Schuster, K. Duddu, J. Hu, P. Kunkle, M. Watson, J. Tolins, J. Smith, D. Teplyashin, G. Bingham, M. Ritter, M. Andreetto, D. Pitta, M. Patel, S. Viswanadha, T. Strohman, C. Ionescu, J. Luo, Y. Kalley, J. Wiesner, D. Deutsch, D. Lockhart, P. Choy, R. Dangovski, C. Sitawarin, C. Graves, T. Lando, J. R. van Amersfoort, N. Elue, Z. Huo, P. Moradi, J. Tarbouriech, H. Michalewski, W. Ye, E. Kim, A. Druinsky, F. Altch’e, X. Chen, A. Dwornik, D. Juan, R. Moroshko, H. Toma, J. Kahn, H. Qian, M. Sieb, I. Cai, R. Goldenberg, P. Netrapalli, S. Raghuram, Y. Gong, L. Fan, E. Palmer, Y. Matias, V. Gabeur, S. Pathak, T. Ouyang, D. Metzler, G. Bacon, S. Venkatachary, S. Thiagarajan, A. A. Cullum, Eran. O. Ofek, V. Sakėnas, M. Hammad, C. I. Magalhaes, M. Daswani, O. Chang, A. Popat, R. Li, K. Jalan, Y. Hou, J. Lipschultz, A. He, W. Jia, P. G. Sessa, P. Kolhar, W. Wong, S. Singh, L. Haas, J. Whang, H. Klimczak-Pluci’nska, G. Rotival, G. Chung, Y. Hua, A. Siddiqui, N. Serrano, D. Chen, B. Porter, L. Bai, K. Shivam, S. Arora, P. Talukdar, T. Cobley, S. Bhardwaj, E. Gladchenko, S. Green, K. Guu, F. Fischer, X. Y. Wu, E. Wang, A. Singhal, T. Matejovicova, J. Martens, H. Li, R. Patel, E. Kemp, J. Pan, L. Wang, B. J. Chen, J. Alayrac, N. Potti, E. Gemzer, E. Ie, K. McKinney, T. Saeki, E. Chou, P. Lamblin, S. Mah, Z. Fisher, M. Chadwick, J. Stritar, O. Sarvana, A. Hogue, A. Shtefan, H. Hashemi, Y. Xu, J. Gu, S. Vikram, C. Chang, S. Ramos, L. Kilpatrick, W. Xi, J. Brennan, Y. Sun, A. Jindal, I. Gog, D. Chen, F. Wu, J. Lee, S. Kopalle, S. Bhojanapalli, O. Vinyals, N. Potikha, B. K. Ayan, Y. Yuan, M. Riley, P. Stańczyk, S. Kishchenko, B. Wang, D. Garrette, A. Yang, V. Feinberg, C. Carey, J. Azizi, V. Shah, E. Moreira, C. Shi, J. Feldman, E. Salesky, T. Lampe, A. Pappu, D. Kim, J. Adler, A. Caciularu, B. Walker, Y. Xu, Y. Blau, D. Scandinaro, T. Huang, S. El-Husseini, A. Sinha, L. Ren, T. Tobin, P. Sundberg, T. Sohn, V. Yadav, M. Ly, E. Xue, J. Xiong, A. S. Soudagar, S. Mondal, N. Khadke, Q. Ren, B. Vargas, S. M. Bileschi, S. Chakera, C. Wang, B. Wang, Y. Halpern, J. W. Jiang, V. Sindhwani, P. Petrov, P. Ponnuramu, S. V. Mehta, Y. Watanabe, B. Chan, M. Wisniewski, T. Pham, J. Zhang, C. Li, D. de Cesare, A. Khurshudov, A. Vasiloff, M. Tan, Z. Ashwood, B. Shahriari, M. Majzoubi, G. Tanzer, O. Kozlova, R. Alazard, J. Lee-Thorp, N. M. Phu, I. Tian, J. Ahn, A. Crawford, L. Lax, S. Yuan, I. Naim, D. Ross, O. Ferludin, T. Guo, A. Banino, H. P. Soyer, X. Ju, D. Rogozi’nska, I. Malhi, M. Valentine, D. Balle, A. Kulshreshtha, M. Kula, Y. Song, S. Austin, J. Schultz, R. Hirsch, A. Douillard, A. Reddy, M. Fink, S. Yue, K. Gupta, A. Zhang, N. A. Rink, D. McDuff, L. Meng, A. Gyorgy, Y. Razeghi, R. Liang, K. Osawa, A. Atias, M. Eyal, T. Hill, N. Grigorev, Z. Wang, N. Kulkarni, R. T. Soh, I. Lobov, Z. Charles, S. Lall, K. Hashimoto, I. Kessler, V. Gomes, Z. E. Mariet, D. Driess, A. Agostini, C. Akbulut, J. Hu, M. Ikonomidis, E. Caveness, K. Audhkhasi, S. Agrawal, I. Bica, E. Senter, J. Mudigonda, K. Chen, J. Ye, X. Wang, J. Svensson, P. Franken, J. Newlan, L. Lao, E. Schnider, S. Alabed, J. Kready, J. Emond, A. Halumi, T. Zaman, C. Ye, N. Raisinghani, V. Meshram, B. Chang, A. S. Rawat, A. Stjerngren, S. Levi, R. Wang, X. Long, M. Rasquinha, S. Hand, A. Mavalankar, L. Agubuzu, S. Roy, J. Chen, J. Wilkiewicz, H. Zhou, M. K. Jastrzębski, A. D. Lago, R. S. Boppana, W. Ko, J. Prendki, Y. Su, Z. Li, E. Rutherford, G. R. Rao, R. Comanescu, A. Puigdomènech, Q. Chen, D. Petrova, C. Chan, V. Milutinovic, F. T. Ferreira, C. Cheng, M. Zhang, T. Dey, S. Yang, R. Sampath, Q. V. Le, H. Zhou, C. Lin, H. Lam, C. Kaeser-Chen, K. Hui, D. Hirsch, T. Eccles, B. Mustafa, S. Rijhwani, M. Rivière, Y. Xu, J. Wang, X. Geng, X. Si, A. Khare, C. Kim, V. S. Mirrokni, K. Lee, K. Baatarsukh, N. Braun, L. Wang, L. Pallavi, R. Tanburn, Y. Zhu, F. Li, S. Ariafar, D. Goldberg, K. Burke, D. Mirylenka, M. Guo, O. Ronneberger, H. N. Vogel, L. Cheng, N. Shetty, J. Jia, T. Jimma, C. Fry, T. Xiao, M. Sundermeyer, R. Burnell, Y. Assael, M. Pinto, J. Chen, R. Sathyanarayana, D. Cho, J. Lu, R. Agarwal, S. Basu, L. Gonzalez, D. Shah, M. Wei, D. Mahaarachchi, R. Agrawal, T. Rissa, Y. Donchev, R. Leal-Cavazos, A. Hutter, M. Mircea, A. Jacovi, F. Ahmed, J. Zhang, S. Hu, B. Chen, J. Kanerva, G. Desjardins, A. Lee, N. Parotsidis, A. Mujika, T. Weyand, J. Snoek, J. Chick, K. Chen, P. Chang, E. Mahintorabi, Z. Wang, T. Powell, O. Keller, A. Gupta, C. Sha, K. Garg, N. las Heess, ’. Weisz, C. Hardin, B. P. Wydrowski, B. Coleman, K. Zainullina, P. Joshi, A. Epasto, T. Spitz, B. Xiong, K. Zhao, A. Klimovskiy, I. Zheng, J. Ferret, I. Yona, W. Khawaja, J. Lespiau, M. Krikun, S. Shakeri, T. Cour, B. Li, I. Krivokon, D. Suh, A. Hofer, J. A. Abdallah, N. Putikhin, O. Akerlund, S. Lattanzi, A. Kumar, S. Settle, H. Srivastava, F. Campbell-Ajala, E. Rosseel, M. Istin, N. Dikkala, A. Rao, N. Young, K. Lin, D. Bhaswar, Y. Wang, J. S. Elias, K. Muralidharan, J. Keeling, D. Du, S. Gopal, G. Dibb, C. Blundell, M. Delakis, J. Liang, M. T. Ribeiro, G. Karadzhov, G. Garrido, A. Bapna, J. Cao, A. Sadovsky, P. D. Tafti, A. Guez, C. Devin, Y. Di, J. Xing, C. Xu, H. Lin, C. Chu, S. S. Ponda, W. Helmholz, F. Yang, Y. Gao, S. D. Javanmardi, W. Farhan, A. Ramirez, R. Figueira, K. C. Sim, Y. Bahat, A. Vaswani, L. Yuan, G. Zhang, L. Rechis, H. Dai, T. Oguntebi, A. Cordell, E. Rives, K. Tekelioglu, N. Kumar, B. Zhang, A. Zhou, N. Savinov, A. Leach, A. Tudor, S. Ganapathy, Y. Zheng, M. Rossini, V. Axelrod, A. Autef, Y. Zhu, Z. Zheng, M. Zhang, B. Sun, J. Ren, N. Tomasev, N. Kannen, A. Sinha, C. Chen, L. O’Bryan, A. Pak, A. Kusupati, W. Yang, D. Ramachandran, P. Griffin, S. Kim, P. Neubeck, C. Schiff, T. Spalink, M. Ling, A. Nair, G. Joung, L. Deng, A. Bhoopchand, L. Aroyo, T. Duerig, J. Griffith, G. Barth-Maron, J. Ades, A. Haig, A. Taly, Y. Song, P. Michel, D. Orr, D. Weesner, C. Tallec, C. G. Bostock, P. Niemczyk, A. Twigg, M. Verma, R. Vallu, H. Wang, M. Gelmi, K. Sodhia, A. Chuklin, O. Goldman, J. George, L. Bai, K. Zhang, P. Sirkovic, E. Nehoran, G. Pundak, J. Mu, A. Chen, A. Greve, P. Zacchello, D. Amos, H. Ge, E. Noland, C. Bishop, J. Dudek, Y. Namiki, E. Buchatskaya, J. Li, D. Sadigh, M. Samsikova, D. Malkin, D. Vincent, R. David, R. Willoughby, P. Meadowlark, S. Gao, Y. Li, R. Apte, A. Jhindal, S. X. Lin, A. Polozov, Z. Wang, T. Mery, G. Anirudh, V. Yerram, S. Stevens, T. Liu, N. Fiedel, C. Sutton, M. Johnson, X. Song, K. Baumli, N. Shabat, M. Mohammad, H. Liu, M. Selvi, Y. Zhou, M. H. Manshadi, C. Ko, A. Chen, M. Bendersky, J. G. Mendez, N. A. Kothari, A. Zandieh, Y. Huang, D. Andor, E. Pavlick, I. Brusilovsky, J. K. Harlalka, S. Goldman, A. K. Lampinen, G. Li, A. Ushio, S. Gupta, L. Zhang, C. K. Fu, M. Sewak, T. Denk, J. Borovik, B. Jou, A. Zipori, P. Jain, J. Bai, T. Luong, J. Tompson, A. Li, L. Liu, G. Powell, J. Shen, A. Feng, G. Chole, D. Yu, Y. Chow, T. Yin, E. Malmi, K. Xiao, Y. Pande, S. Paul, N. D. Santo, A. Dostmohamed, S. Guadarrama, A. Phillips, T. S. Pillai, G. Yona, A. Ghafouri, P. Lahoti, B. Lee, D. Madeka, E. Sezener, S. Tokumine, A. Collister, N. D. Cao, R. Shin, U. Kalra, P. Beak, E. Nottage, R. Nakashima, I. Jurin, V. Sehwag, M. Gaba, J. Zeng, K. R. McKee, F. Pereira, T. Yakar, A. Panda, A. Dhar, P. Zhong, D. Sohn, M. Brand, L. L. Sjoesund, V. Carpenter, S. Lin, S. Thakoor, M. Wainwright, A. Chaugule, P. Srinivasan, M. Zhu, B. Orlando, J. Weber, A. Wahid, G. Baechler, A. Suman, J. Mitrovi’c, G. Taubman, H. Yu, H. King, J. Dillon, C. Yip, D. Varma, T. Izo, L. Bolelli, B. de Balle Pigem, J. D. Trapani, F. Iliopoulos, A. Paszke, N. Ranka, J. Zou, F. Pongetti, J. N. McGiffin, A. Siegman, R. Galt, R. Hemsley, G. vZuvzi’c, V. Carbune, T. Li, M. Ott, F. de Chaumont Quitry, D. V. Torres, Y. Chervonyi, T. Tsai, P. Eruvbetine, S. Yang, M. Denton, J. Walker, S. Andavci’c, I. H. Shtacher, V. Premachandran, H. T. Lehri, C. Baetu, D. Yates, L. Lamprou, M. Iinuma, I. Mihailescu, B. Albrecht, S. Dave, S. Sargsyan, B. Perozzi, L. Manning, C. Zhang, D. Vnukov, I. Mordatch, R. H. W. Macherey, R. D. Kappedal, J. Stephan, A. Tripathi, K. Macherey, J. Qian, A. Bhowmick, S. Azizi, R. Leblond, S. M. R. Garlapati, T. Knight, M. Wiethoff, W. Hung, A. Angelova, G. Evangelopoulos, P. Janus, D. Paparas, M. Rahtz, K. Caluwaerts, V. Sampathkumar, D. Jarrett, S. Noghabi, A. Miech, C. Yeung, G. Clark, H. Prior, F. Zheng, J. Pouget-Abadie, I. Bhattacharya, K. Krishna, W. Bishop, Z. Yuan, Y. Deng, A. Sathe, K. Krasowiak, C. Chelba, C. Hsieh, K. Vodrahalli, B. Liu, T. Koppe, A. Khalifa, L. Litchev, P. Charoenpanit, R. Roberts, S. Yadav, Y. Onoe, D. Ivanov, M. Mohabey, V. Birodkar, N. Raki’cevi’c, P. Sermanet, V. Mehta, K. Subudhi, T. Choma, W. Ng, L. He, K. Wang, T. Kementsietsidis, S. Gu, M. Gupta, A. Nystrom, M. Kazemi, T. Chung, N. Cano, N. Dhawan, Y. Wang, J. Xia, T. Yacovone, E. Jia, M. Chen, S. Ivanov, A. Sheshan, S. Dalmia, P. Stradomski, P. Yin, S. Haykal, C. Wang, D. Duan, N. Bulut, G. Kochanski, L. MacDermed, N. Godbole, S. Weng, J. Chen, R. Fellinger, R. Mehran, D. Suo, H. Husain, T. He, K. Patel, J. Howland, R. Parker, K. Nguyen, S. Maddineni, C. Rawles, M. Khan, S. Cohen-Ganor, A. Mandhane, X. Wu, C. Kuang, I. Comcsa, R. Ganeshan, H. Sedghi, A. Bloniarz, N. W. Pierse, A. Briukhov, P. Mitrichev, A. Gergely, S. Zhan, A. Zhou, N. Saxena, E. Lu, J. Dean, A. Gupta, N. Perez-Nieves, R. Wu, C. Y. McLean, W. Liang, D. Jindal, A. Tsitsulin, W. Yu, K. Alarakyia, T. Schaul, P. Patil, P. Sung, E. Peake, H. Yu, F. M. P. Behbahani, J. Co-Reyes, A. Ansell, S. Sun, C. Barbu, J. Lee, S. Noury, J. U. Allingham, B. Piot, M. Sharma, C. Yew, I. Korotkov, B. Xu, D. Brady, G. Petrovic, S. Mourad, C. Cui, A. Gupta, P. Schuh, S. Khanna, A. Goldie, A. Arora, V. Zubov, A. Stuart, M. Epstein, Y. Zhu, J. Liu, Y. Stuken, Z. Wang, K. Misiunas, D. Guo, A. Gill, A. J. Hartman, Z. Nabulsi, A. Roy, A. Faust, J. Riesa, B. Withbroe, M. Wang, M. Tagliasacchi, A. Marzoca, J. Noraky, S. Toropov, M. Mehrotra, B. Raad, S. Deur, S. Xu, M. Monteiro, Z. Wu, Y. Luan, S. Ritter, N. Li, H. Garnes, Y. He, M. Zlocha, J. Zhu, M. Hessel, W. Wu, S. R. Babbula, C. Kawamoto, Y. Li, M. Hassen, Y. Wang, B. Wieder, J. Freedman, Y. Zhang, X. Bai, T. Yu, D. Reitter, X. Sheng, M. Wirth, A. Kini, D. Damen, M. Gao, R. Hornung, M. Voznesensky, B. Roark, A. Kuncoro, Y. Zhou, R. Shah, A. Brohan, K. Chen, J. G. Wendt, D. Rim, P. K. Rubenstein, J. J. Halcrow, M. Liu, T. Geri, Y. Sung, J. Shapiro, S. Bijwadia, C. Duvarney, C. Sorokin, P. Natsev, R. R. Ingle, P. Gupta, Y. Maeng, N. Ndebele, K. Zhu, V. Anklin, K. Lee, Y. Liu, Y. Akulov, S. Gupta, G. Su, F. Prost, T. Liu, V. Kovalev, P. Moreno, M. Scholz, S. Redmond, Z. Zhou, A. Castro-Ros, A. S. Pinto, D. Kharrat, M. Yarom, R. Saputro, J. Bulian, B. Caine, J. Liu, A. Abdolmaleki, S. Iqbal, T. Misiunas, M. Sirotenko, S. Garg, G. Bensky, H. Gui, X. Wang, R. Koster, M. Bernico, D. Huang, R. Thoppilan, T. Cohn, B. Golan, W. Zhou, A. Rosenberg, M. Freitag, T. Gangwani, V. Tsang, A. Shukla, X. Ren, M. Giang, C. Zou, A. Elisseeff, C. L. Lan, D. Dua, S. Lall, P. Shyam, F. Garcia, S. Nguyen, M. Guzman, A. Maschinot, M. Maggioni, M. Chang, K. Gregor, L. Weerts, K. Venkatesan, B. Damoc, L. Liu, J. Wassenberg, L. Ho, B. Roelofs, M. Hadian, F. Aubet, Y. Liang, S. Lachgar, D. Karmon, Y. Cheng, A. V’azquez-Reina, A. Chen, Z. Dai, A. Brock, S. Agrawal, C. Pang, P. Garst, M. Sanchez-Vargas, I. Rendulic, A. Ayyar, A. Ravznatovi’c, O. Ma, R. Vij, N. Sharma, A. Balakrishna, B. Liu, I. Mackinnon, S. Baltateanu, P. Poklukar, G. Ibagon, C. Ji, H. Jiao, I. Noble, W. Stokowiec, Z. Li, J. Dean, D. Lindner, M. Omernick, K. Chiafullo, M. Dimarco, V. Rodrigues, V. Selo, G. Honke, X. Wu, W. He, A. Hillier, A. Mohananey, V. Piratla, C. Ye, C. Malik, S. Riedel, S. Albanie, Z. Yang, K. K. Vassigh, M. Bauzá, S. Li, Y. Tao, N. Wichers, A. Maksai, A. Ittycheriah, R. Mcilroy, B. Seybold, N. Goodman, R. Datta, S. M. Hernandez, T. Shi, Y. Kochinski, A. Bulanova, K. Franko, M. Sazanovich, N. Fitzgerald, P. Kacham, S. Raghvendra, V. J. Hellendoorn, A. Grushetsky, J. Salazar, A. Lazaridou, J. Chang, J. Peter, S. Kafle, Y. Dauphin, A. Rao, F. Graziano, I. Shafran, Y. Liao, T. Ding, G. Yan, G. Chu, Z. Fu, V. Roulet, G. Rasskin, D. Williams, S. Drath, A. Mossin, R. Hoffmann, J. Orbay, F. Bertolini, H. Sheftel, J. Chiu, S. Xue, Y. Kuang, F. Naeem, S. Nath, N. B. Nti, P. Culliton, K. Krishnakumar, M. Isard, P. Sun, A. Chakrabarti, N. Clement, R. Cohen, A. Wongpanich, G. Oh, A. Murthy, H. Zheng, J. B. Hamrick, O. Bunyan, S. Ganesh, N. Gupta, R. Frostig, J. M. Wieting, Y. Malkov, P. Marcenac, Z. Lai, X. Tang, M. Saleh, F. Zubach, C. Kulkarni, H. Zhou, V. Zayats, N. Ding, A. Tripathi, A. Pramanik, P. Zochbauer, H. Ganapathy, V. Misra, Z. Behrman, H. Vallet, M. Zhang, M. Sridhar, Y. Jin, S. Gehman, M. Babaeizadeh, S. Põder, M. Goel, D. Jain, T. Nasir, S. Mittal, T. Dozat, D. Ardila, A. Severyn, F. Pardo, S. Jerome, S. Qin, L. Rouillard, A. Yazdanbakhsh, Z. Zhang, S. Agrawal, K. Shivakumar, C. Lu, P. Kallakuri, R. Chhaparia, K. Rao, C. Kwong, A. Fadeeva, S. Nigam, Y. Virin, Y. Zhang, B. Venkatraman, B. Gunel, M. Wilson, H. Wang, A. Gupta, A. Gupta, A. A. Taiga, K. Mohamed, D. Fritz, D. Rodriguez, Z. Ghahramani, H. Askham, L. Belenki, J. Zhao, R. Gupta, K. Jastrzkebski, T. Kosakai, K. Katircioglu, J. Schneider, R. Panigrahy, K. Bousmalis, P. Grabowski, P. Ramachandran, C. Hegde, M. Roșca, A. S. Scarpati, K. Axiotis, Y. Xu, Z. Gleicher, A. H. Michaely, M. Sharma, S. Jain, C. Hirnschall, T. Marian, X. Jia, K. Mather, K. Gupta, L. Qiu, N. Nayakanti, L. Ionita, S. Zheng, L. Loher, K. Shuster, I. Petrovski, R. Sharma, R. Chaabouni, A. Yeh, J. An, A. Gupta, S. Schwarcz, S. Ellis, S. Conway-Rahman, J. Snaider, A. Zhai, J. Atwood, D. Golovin, L. Peng, I. Te, V. Xia, S. Scellato, M. Malihi, A. Bravzinskas, V. Ion, Y. Jun, J. Swirhun, S. Mariooryad, J. Sun, S. Chien, R. Coaguila, A. Brand, Y. Gao, T. Kwiatkowski, R. Aharoni, C. Lee, M. vZani’c, Y. Zhang, D. Ethier, V. Nikolaev, P. A. Nair, Y. B. Shalom, H. Fitoussi, J. P. Gupta, H. Liu, D. Cattle, T. Bolukbasi, B. Murdoch, F. Huot, Y. Li, C. Hahn, U. Khandelwal, F. Benzing, A. Conmy, A. E. Simanovsky, F. Beaufays, E. Weinstein, T. Chen, L. Leonhard, and B. Ramabhadran (2025) Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. ArXiv abs/2507.06261. External Links: Link Cited by: §3.1, §5.1. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Ding, H. Xin, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Wang, J. Chen, J. Yuan, J. Qiu, J. Li, J. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Wang, L. Zhang, L. Xu, L. Xia, M. Zhang, M. Zhang, M. Tang, M. Li, M. Wang, M. Li, N. Tian, P. Huang, P. Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. J. Chen, R. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, S. S. Li, S. Zhou, S. Wu, T. Yun, T. Pei, T. Sun, T. Wang, W. Zeng, W. Zhao, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, W. Xiao, W. An, X. Liu, X. Wang, X. Chen, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yang, X. Li, X. Su, X. Lin, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Sun, X. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang (2025) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. ArXiv abs/2501.12948. External Links: Link Cited by: §2.3. C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, S. Guang, and H. Fan (2025) Emerging properties in unified multimodal pretraining. ArXiv abs/2505.14683. External Links: Link Cited by: §3.2, §5.1. R. Fang, C. Duan, K. Wang, L. Huang, H. Li, S. Yan, H. Tian, X. Zeng, R. Zhao, J. Dai, X. Liu, and H. Li (2025a) GoT: unleashing reasoning capability of multimodal large language model for visual generation and editing. ArXiv abs/2503.10639. External Links: Link Cited by: §2.2. R. Fang, A. Yu, C. Duan, L. Huang, S. Bai, Y. Cai, K. Wang, S. Liu, X. Liu, and H. Li (2025b) FLUX-reason-6m & prism-bench: a million-scale text-to-image reasoning dataset and comprehensive benchmark. ArXiv abs/2509.09680. External Links: Link Cited by: §2.2. G. T. W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y. Wang, Y. Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, J. Xu, J. Chen, J. Chen, J. Chen, J. Lin, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Zhang, Q. Zheng, S. Yang, S. Zhong, S. Huang, S. Zhao, S. Xue, S. Tu, S. Meng, T. Zhang, T. Luo, T. Hao, W. Li, W. Jia, X. Lyu, X. Huang, Y. Wang, Y. Xue, Y. Wang, Y. An, Y. Du, Y. Shi, Y. Huang, Y. Niu, Y. Wang, Y. Yue, Y. Li, Y. Zhang, Y. Zhang, Z. Du, Z. Hou, Z. Xue, Z. Du, Z. Wang, P. Zhang, D. Liu, B. Xu, J. Li, M. Huang, Y. Dong, and J. Tang (2025a) GLM-4.5v and glm-4.1v-thinking: towards versatile multimodal reasoning with scalable reinforcement learning. External Links: Link Cited by: §5.1. J. Hong, C. Zhao, C. Zhu, W. Lu, G. Xu, and X. Yu (2025b) DeepEyesV2: toward agentic multimodal model. External Links: 2511.05271, Link Cited by: §2.3. Y. Hu, W. Shi, X. Fu, D. Roth, M. Ostendorf, L. S. Zettlemoyer, N. A. Smith, and R. Krishna (2024) Visual sketchpad: sketching as a visual chain of thought for multimodal language models. ArXiv abs/2406.09403. External Links: Link Cited by: §1, §2.2. O. A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, A. Mkadry, A. Baker-Whitcomb, A. Beutel, A. Borzunov, A. Carney, A. Chow, A. Kirillov, A. Nichol, A. Paino, A. Renzin, A. Passos, A. Kirillov, A. Christakis, A. Conneau, A. Kamali, A. Jabri, A. Moyer, A. Tam, A. Crookes, A. Tootoochian, A. Tootoonchian, A. Kumar, A. Vallone, A. Karpathy, A. Braunstein, A. Cann, A. Codispoti, A. Galu, A. Kondrich, A. Tulloch, A. Mishchenko, A. Baek, A. Jiang, A. Pelisse, A. Woodford, A. Gosalia, A. Dhar, A. Pantuliano, A. Nayak, A. Oliver, B. Zoph, B. Ghorbani, B. Leimberger, B. Rossen, B. Sokolowsky, B. Wang, B. Zweig, B. Hoover, B. Samic, B. McGrew, B. Spero, B. Giertler, B. Cheng, B. Lightcap, B. Walkin, B. Quinn, B. Guarraci, B. Hsu, B. Kellogg, B. Eastman, C. Lugaresi, C. L. Wainwright, C. Bassin, C. Hudson, C. Chu, C. Nelson, C. Li, C. J. Shern, C. Conger, C. Barette, C. Voss, C. Ding, C. Lu, C. Zhang, C. Beaumont, C. Hallacy, C. Koch, C. Gibson, C. Kim, C. Choi, C. McLeavey, C. Hesse, C. Fischer, C. Winter, C. Czarnecki, C. Jarvis, C. Wei, C. Koumouzelis, D. Sherburn, D. Kappler, D. Levin, D. Levy, D. Carr, D. Farhi, D. Mély, D. Robinson, D. Sasaki, D. Jin, D. Valladares, D. Tsipras, D. Li, P. D. Nguyen, D. Findlay, E. Oiwoh, E. Wong, E. Asdar, E. Proehl, E. Yang, E. Antonow, E. Kramer, E. Peterson, E. Sigler, E. Wallace, E. Brevdo, E. Mays, F. Khorasani, F. P. Such, F. Raso, F. Zhang, F. von Lohmann, F. Sulit, G. Goh, G. Oden, G. Salmon, G. Starace, G. Brockman, H. Salman, H. Bao, H. Hu, H. Wong, H. Wang, H. Schmidt, H. Whitney, H. Jun, H. Kirchner, H. P. de Oliveira Pinto, H. Ren, H. Chang, H. W. Chung, I. Kivlichan, I. O’Connell, I. Osband, I. Silber, I. Sohl, I. Okuyucu, I. Lan, I. Kostrikov, I. Sutskever, I. Kanitscheider, I. Gulrajani, J. Coxon, J. Menick, J. W. Pachocki, J. Aung, J. Betker, J. Crooks, J. Lennon, J. R. Kiros, J. Leike, J. Park, J. Kwon, J. Phang, J. Teplitz, J. Wei, J. Wolfe, J. Chen, J. Harris, J. Varavva, J. G. Lee, J. Shieh, J. Lin, J. Yu, J. Weng, J. Tang, J. Yu, J. Jang, J. Q. Candela, J. Beutler, J. Landers, J. Parish, J. Heidecke, J. Schulman, J. Lachman, J. McKay, J. Uesato, J. Ward, J. W. Kim, J. Huizinga, J. Sitkin, J. Kraaijeveld, J. Gross, J. Kaplan, J. Snyder, J. Achiam, J. Jiao, J. Lee, J. Zhuang, J. Harriman, K. Fricke, K. Hayashi, K. Singhal, K. Shi, K. Karthik, K. Wood, K. Rimbach, K. Hsu, K. Nguyen, K. Gu-Lemberg, K. Button, K. Liu, K. Howe, K. Muthukumar, K. Luther, L. Ahmad, L. Kai, L. Itow, L. Workman, L. Pathak, L. Chen, L. Jing, L. Guy, L. Fedus, L. Zhou, L. Mamitsuka, L. Weng, L. McCallum, L. Held, O. Long, L. Feuvrier, L. Zhang, L. Kondraciuk, L. Kaiser, L. Hewitt, L. Metz, L. Doshi, M. Aflak, M. Simens, M. Boyd, M. Thompson, M. Dukhan, M. Chen, M. Gray, M. Hudnall, M. Zhang, M. Aljubeh, M. Litwin, M. Zeng, M. Johnson, M. Shetty, M. Gupta, M. Shah, M. A. Yatbaz, M. Yang, M. Zhong, M. Glaese, M. Chen, M. Janner, M. Lampe, M. Petrov, M. Wu, M. Wang, M. Fradin, M. Pokrass, M. Castro, M. Castro, M. Pavlov, M. Brundage, M. Wang, M. Khan, M. Murati, M. Bavarian, M. Lin, M. Yesildal, N. Soto, N. Gimelshein, N. Cone, N. Staudacher, N. Summers, N. LaFontaine, N. Chowdhury, N. Ryder, N. Stathas, N. Turley, N. A. Tezak, N. Felix, N. Kudige, N. S. Keskar, N. Deutsch, N. Bundick, N. Puckett, O. Nachum, O. Okelola, O. Boiko, O. Murk, O. Jaffe, O. Watkins, O. Godement, O. Campbell-Moore, P. Chao, P. McMillan, P. Belov, P. Su, P. Bak, P. Bakkum, P. Deng, P. Dolan, P. Hoeschele, P. Welinder, P. Tillet, P. Pronin, P. Tillet, P. Dhariwal, Q. Yuan, R. Dias, R. Lim, R. Arora, R. Troll, R. Lin, R. G. Lopes, R. Puri, R. Miyara, R. H. Leike, R. Gaubert, R. Zamani, R. Wang, R. Donnelly, R. Honsby, R. Smith, R. Sahai, R. Ramchandani, R. Huet, R. Carmichael, R. Zellers, R. Chen, R. Chen, R. R. Nigmatullin, R. Cheu, S. Jain, S. Altman, S. Schoenholz, S. Toizer, S. Miserendino, S. Agarwal, S. Culver, S. Ethersmith, S. Gray, S. Grove, S. Metzger, S. Hermani, S. Jain, S. Zhao, S. Wu, S. Jomoto, S. Wu, S. Xia, S. Phene, S. Papay, S. Narayanan, S. Coffey, S. Lee, S. Hall, S. Balaji, T. Broda, T. Stramer, T. Xu, T. Gogineni, T. Christianson, T. Sanders, T. Patwardhan, T. Cunninghman, T. Degry, T. Dimson, T. Raoux, T. Shadwell, T. Zheng, T. Underwood, T. Markov, T. Sherbakov, T. Rubin, T. Stasi, T. Kaftan, T. Heywood, T. Peterson, T. Walters, T. Eloundou, V. Qi, V. Moeller, V. Monaco, V. Kuo, V. Fomenko, W. Chang, W. Zheng, W. Zhou, W. Manassra, W. Sheu, W. Zaremba, Y. Patil, Y. Qian, Y. Kim, Y. Cheng, Y. Zhang, Y. He, Y. Zhang, Y. Jin, Y. Dai, and Y. Malkov (2024) GPT-4o system card. ArXiv abs/2410.21276. External Links: Link Cited by: §5.1. M. Kazemi, H. Alvari, A. Anand, J. Wu, X. Chen, and R. Soricut (2023) GeomVerse: a systematic evaluation of large models for geometric reasoning. ArXiv abs/2312.12241. External Links: Link Cited by: §1, §2.1, §5.2. A. Li, C. L. Wang, K. Yue, Z. Cai, O. Liu, D. Fu, P. Guo, W. B. Zhu, V. Sharan, R. Jia, W. Neiswanger, F. Huang, T. Goldstein, and M. Goldblum (2025a) Zebra-cot: a dataset for interleaved vision language reasoning. ArXiv abs/2507.16746. External Links: Link Cited by: §1, §2.2, §5.1. C. Li, W. Wu, H. Zhang, Y. Xia, S. Mao, L. Dong, I. Vuli’c, and F. Wei (2025b) Imagine while reasoning in space: multimodal visualization-of-thought. ArXiv abs/2501.07542. External Links: Link Cited by: §1. X. Li, H. Zou, and P. Liu (2025c) ToRL: scaling tool-integrated rl. ArXiv abs/2503.23383. External Links: Link Cited by: §2.3, §5.2. P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K. Chang, M. Galley, and J. Gao (2023) MathVista: evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations, External Links: Link Cited by: §2.1. P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S. Zhu (2021) Inter-gps: interpretable geometry problem solving with formal language and symbolic reasoning. In Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §1, §2.1, §5.2. P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu, O. Tafjord, P. Clark, and A. Kalyan (2022) Learn to explain: multimodal reasoning via thought chains for science question answering. ArXiv abs/2209.09513. External Links: Link Cited by: §2.1. F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Lu, D. Fu, T. Han, B. Shi, W. Wang, J. He, K. Zhang, P. Luo, Y. Qiao, Q. Zhang, and W. Shao (2025) M-eureka: exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. External Links: Link Cited by: §2.3. OpenAI (2025) GPT-5 system card. System Card OpenAI. External Links: Link Cited by: §5.1. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. ArXiv abs/2402.03300. External Links: Link Cited by: §1, §2.3, §5.2. A. Sharma, A. Dalmia, M. Kazemi, A. Zouaq, and C. Pal (2024) GeoCoder: solving geometry problems by generating modular code through vision-language models. ArXiv abs/2410.13510. External Links: Link Cited by: §1. W. Shi, A. Yu, R. Fang, H. Ren, K. Wang, A. Zhou, C. Tian, X. Fu, Y. Hu, Z. Lu, L. Huang, S. Liu, R. Liu, and H. Li (2025) MathCanvas: intrinsic visual chain-of-thought for multimodal mathematical reasoning. ArXiv abs/2510.14958. External Links: Link Cited by: Table 6, Appendix C, §1, §2.1, §2.2, §3.1, §5.1. C. Team, M. Chen, J. Kahn, and S. Li (2024) Chameleon: mixed-modal early-fusion foundation models. ArXiv abs/2405.09818. External Links: Link Cited by: §3.2. G. Team, R. Anil, S. Borgeaud, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Doherty, E. Collins, C. Meyer, E. Rutherford, E. Moreira, K. Ayoub, M. Goel, J. Krawczyk, C. Du, E. Chi, H. Cheng, E. Ni, P. Shah, P. Kane, B. Chan, M. Faruqui, A. Severyn, H. Lin, Y. Li, Y. Cheng, A. Ittycheriah, M. Mahdieh, M. Chen, P. Sun, D. Tran, S. Bagri, B. Lakshminarayanan, J. Liu, A. Orban, F. Güra, H. Zhou, X. Song, A. Boffy, H. Ganapathy, S. Zheng, H. Choe, Á. Weisz, T. Zhu, Y. Lu, S. Gopal, J. Kahn, M. Kula, J. Pitman, R. Shah, E. Taropa, M. A. Merey, M. Baeuml, Z. Chen, L. E. Shafey, Y. Zhang, O. Sercinoglu, G. Tucker, E. Piqueras, M. Krikun, I. Barr, N. Savinov, I. Danihelka, B. Roelofs, A. White, A. Andreassen, T. von Glehn, L. Yagati, M. Kazemi, L. Gonzalez, M. Khalman, J. Sygnowski, A. Frechette, C. Smith, L. Culp, L. Proleev, Y. Luan, X. Chen, J. Lottes, N. Schucher, F. Lebron, A. Rrustemi, N. Clay, P. Crone, T. Kocisky, J. Zhao, B. Perz, D. Yu, H. Howard, A. Bloniarz, J. W. Rae, H. Lu, L. Sifre, M. Maggioni, F. Alcober, D. Garrette, M. Barnes, S. Thakoor, J. Austin, G. Barth-Maron, W. Wong, R. Joshi, R. Chaabouni, D. Fatiha, A. Ahuja, G. S. Tomar, E. Senter, M. Chadwick, I. Kornakov, N. Attaluri, I. Iturrate, R. Liu, Y. Li, S. Cogan, J. Chen, C. Jia, C. Gu, Q. Zhang, J. Grimstad, A. J. Hartman, X. Garcia, T. S. Pillai, J. Devlin, M. Laskin, D. de Las Casas, D. Valter, C. Tao, L. Blanco, A. P. Badia, D. Reitter, M. Chen, J. Brennan, C. Rivera, S. Brin, S. Iqbal, G. Surita, J. Labanowski, A. Rao, S. Winkler, E. Parisotto, Y. Gu, K. Olszewska, R. Addanki, A. Miech, A. Louis, D. Teplyashin, G. Brown, E. Catt, J. Balaguer, J. Xiang, P. Wang, Z. Ashwood, A. Briukhov, A. Webson, S. Ganapathy, S. Sanghavi, A. Kannan, M. Chang, A. Stjerngren, J. Djolonga, Y. Sun, A. Bapna, M. Aitchison, P. Pejman, H. Michalewski, T. Yu, C. Wang, J. Love, J. Ahn, D. Bloxwich, K. Han, P. Humphreys, T. Sellam, J. Bradbury, V. Godbole, S. Samangooei, B. Damoc, A. Kaskasoli, S. M. R. Arnold, V. Vasudevan, S. Agrawal, J. Riesa, D. Lepikhin, R. Tanburn, S. Srinivasan, H. Lim, S. Hodkinson, P. Shyam, J. Ferret, S. Hand, A. Garg, T. L. Paine, J. Li, Y. Li, M. Giang, A. Neitz, Z. Abbas, S. York, M. Reid, E. Cole, A. Chowdhery, D. Das, D. Rogozińska, V. Nikolaev, P. Sprechmann, Z. Nado, L. Zilka, F. Prost, L. He, M. Monteiro, G. Mishra, C. Welty, J. Newlan, D. Jia, M. Allamanis, C. H. Hu, R. de Liedekerke, J. Gilmer, C. Saroufim, S. Rijhwani, S. Hou, D. Shrivastava, A. Baddepudi, A. Goldin, A. Ozturel, A. Cassirer, Y. Xu, D. Sohn, D. Sachan, R. K. Amplayo, C. Swanson, D. Petrova, S. Narayan, A. Guez, S. Brahma, J. Landon, M. Patel, R. Zhao, K. Villela, L. Wang, W. Jia, M. Rahtz, M. Giménez, L. Yeung, J. Keeling, P. Georgiev, D. Mincu, B. Wu, S. Haykal, R. Saputro, K. Vodrahalli, J. Qin, Z. Cankara, A. Sharma, N. Fernando, W. Hawkins, B. Neyshabur, S. Kim, A. Hutter, P. Agrawal, A. Castro-Ros, G. van den Driessche, T. Wang, F. Yang, S. Chang, P. Komarek, R. McIlroy, M. Lučić, G. Zhang, W. Farhan, M. Sharman, P. Natsev, P. Michel, Y. Bansal, S. Qiao, K. Cao, S. Shakeri, C. Butterfield, J. Chung, P. K. Rubenstein, S. Agrawal, A. Mensch, K. Soparkar, K. Lenc, T. Chung, A. Pope, L. Maggiore, J. Kay, P. Jhakra, S. Wang, J. Maynez, M. Phuong, T. Tobin, A. Tacchetti, M. Trebacz, K. Robinson, Y. Katariya, S. Riedel, P. Bailey, K. Xiao, N. Ghelani, L. Aroyo, A. Slone, N. Houlsby, X. Xiong, Z. Yang, E. Gribovskaya, J. Adler, M. Wirth, L. Lee, M. Li, T. Kagohara, J. Pavagadhi, S. Bridgers, A. Bortsova, S. Ghemawat, Z. Ahmed, T. Liu, R. Powell, V. Bolina, M. Iinuma, P. Zablotskaia, J. Besley, D. Chung, T. Dozat, R. Comanescu, X. Si, J. Greer, G. Su, M. Polacek, R. L. Kaufman, S. Tokumine, H. Hu, E. Buchatskaya, Y. Miao, M. Elhawaty, A. Siddhant, N. Tomasev, J. Xing, C. Greer, H. Miller, S. Ashraf, A. Roy, Z. Zhang, A. Ma, A. Filos, M. Besta, R. Blevins, T. Klimenko, C. Yeh, S. Changpinyo, J. Mu, O. Chang, M. Pajarskas, C. Muir, V. Cohen, C. L. Lan, K. Haridasan, A. Marathe, S. Hansen, S. Douglas, R. Samuel, M. Wang, S. Austin, C. Lan, J. Jiang, J. Chiu, J. A. Lorenzo, L. L. Sjösund, S. Cevey, Z. Gleicher, T. Avrahami, A. Boral, H. Srinivasan, V. Selo, R. May, K. Aisopos, L. Hussenot, L. B. Soares, K. Baumli, M. B. Chang, A. Recasens, B. Caine, A. Pritzel, F. Pavetic, F. Pardo, A. Gergely, J. Frye, V. Ramasesh, D. Horgan, K. Badola, N. Kassner, S. Roy, E. Dyer, V. C. Campos, A. Tomala, Y. Tang, D. E. Badawy, E. White, B. Mustafa, O. Lang, A. Jindal, S. Vikram, Z. Gong, S. Caelles, R. Hemsley, G. Thornton, F. Feng, W. Stokowiec, C. Zheng, P. Thacker, Ç. Ünlü, Z. Zhang, M. Saleh, J. Svensson, M. Bileschi, P. Patil, A. Anand, R. Ring, K. Tsihlas, A. Vezer, M. Selvi, T. Shevlane, M. Rodriguez, T. Kwiatkowski, S. Daruki, K. Rong, A. Dafoe, N. FitzGerald, K. Gu-Lemberg, M. Khan, L. A. Hendricks, M. Pellat, V. Feinberg, J. Cobon-Kerr, T. Sainath, M. Rauh, S. H. Hashemi, R. Ives, Y. Hasson, E. Noland, Y. Cao, N. Byrd, L. Hou, Q. Wang, T. Sottiaux, M. Paganini, J. Lespiau, A. Moufarek, S. Hassan, K. Shivakumar, J. van Amersfoort, A. Mandhane, P. Joshi, A. Goyal, M. Tung, A. Brock, H. Sheahan, V. Misra, C. Li, N. Rakićević, M. Dehghani, F. Liu, S. Mittal, J. Oh, S. Noury, E. Sezener, F. Huot, M. Lamm, N. D. Cao, C. Chen, S. Mudgal, R. Stella, K. Brooks, G. Vasudevan, C. Liu, M. Chain, N. Melinkeri, A. Cohen, V. Wang, K. Seymore, S. Zubkov, R. Goel, S. Yue, S. Krishnakumaran, B. Albert, N. Hurley, M. Sano, A. Mohananey, J. Joughin, E. Filonov, T. Kępa, Y. Eldawy, J. Lim, R. Rishi, S. Badiezadegan, T. Bos, J. Chang, S. Jain, S. G. S. Padmanabhan, S. Puttagunta, K. Krishna, L. Baker, N. Kalb, V. Bedapudi, A. Kurzrok, S. Lei, A. Yu, O. Litvin, X. Zhou, Z. Wu, S. Sobell, A. Siciliano, A. Papir, R. Neale, J. Bragagnolo, T. Toor, T. Chen, V. Anklin, F. Wang, R. Feng, M. Gholami, K. Ling, L. Liu, J. Walter, H. Moghaddam, A. Kishore, J. Adamek, T. Mercado, J. Mallinson, S. Wandekar, S. Cagle, E. Ofek, G. Garrido, C. Lombriser, M. Mukha, B. Sun, H. R. Mohammad, J. Matak, Y. Qian, V. Peswani, P. Janus, Q. Yuan, L. Schelin, O. David, A. Garg, Y. He, O. Duzhyi, A. Älgmyr, T. Lottaz, Q. Li, V. Yadav, L. Xu, A. Chinien, R. Shivanna, A. Chuklin, J. Li, C. Spadine, T. Wolfe, K. Mohamed, S. Das, Z. Dai, K. He, D. von Dincklage, S. Upadhyay, A. Maurya, L. Chi, S. Krause, K. Salama, P. G. Rabinovitch, P. K. R. M, A. Selvan, M. Dektiarev, G. Ghiasi, E. Guven, H. Gupta, B. Liu, D. Sharma, I. H. Shtacher, S. Paul, O. Akerlund, F. Aubet, T. Huang, C. Zhu, E. Zhu, E. Teixeira, M. Fritze, F. Bertolini, L. Marinescu, M. Bölle, D. Paulus, K. Gupta, T. Latkar, M. Chang, J. Sanders, R. Wilson, X. Wu, Y. Tan, L. N. Thiet, T. Doshi, S. Lall, S. Mishra, W. Chen, T. Luong, S. Benjamin, J. Lee, E. Andrejczuk, D. Rabiej, V. Ranjan, K. Styrc, P. Yin, J. Simon, M. R. Harriott, M. Bansal, A. Robsky, G. Bacon, D. Greene, D. Mirylenka, C. Zhou, O. Sarvana, A. Goyal, S. Andermatt, P. Siegler, B. Horn, A. Israel, F. Pongetti, C. ". Chen, M. Selvatici, P. Silva, K. Wang, J. Tolins, K. Guu, R. Yogev, X. Cai, A. Agostini, M. Shah, H. Nguyen, N. Ó. Donnaile, S. Pereira, L. Friso, A. Stambler, A. Kurzrok, C. Kuang, Y. Romanikhin, M. Geller, Z. Yan, K. Jang, C. Lee, W. Fica, E. Malmi, Q. Tan, D. Banica, D. Balle, R. Pham, Y. Huang, D. Avram, H. Shi, J. Singh, C. Hidey, N. Ahuja, P. Saxena, D. Dooley, S. P. Potharaju, E. O’Neill, A. Gokulchandran, R. Foley, K. Zhao, M. Dusenberry, Y. Liu, P. Mehta, R. Kotikalapudi, C. Safranek-Shrader, A. Goodman, J. Kessinger, E. Globen, P. Kolhar, C. Gorgolewski, A. Ibrahim, Y. Song, A. Eichenbaum, T. Brovelli, S. Potluri, P. Lahoti, C. Baetu, A. Ghorbani, C. Chen, A. Crawford, S. Pal, M. Sridhar, P. Gurita, A. Mujika, I. Petrovski, P. Cedoz, C. Li, S. Chen, N. D. Santo, S. Goyal, J. Punjabi, K. Kappaganthu, C. Kwak, P. LV, S. Velury, H. Choudhury, J. Hall, P. Shah, R. Figueira, M. Thomas, M. Lu, T. Zhou, C. Kumar, T. Jurdi, S. Chikkerur, Y. Ma, A. Yu, S. Kwak, V. Ähdel, S. Rajayogam, T. Choma, F. Liu, A. Barua, C. Ji, J. H. Park, V. Hellendoorn, A. Bailey, T. Bilal, H. Zhou, M. Khatir, C. Sutton, W. Rzadkowski, F. Macintosh, R. Vij, K. Shagin, P. Medina, C. Liang, J. Zhou, P. Shah, Y. Bi, A. Dankovics, S. Banga, S. Lehmann, M. Bredesen, Z. Lin, J. E. Hoffmann, J. Lai, R. Chung, K. Yang, N. Balani, A. Bražinskas, A. Sozanschi, M. Hayes, H. F. Alcalde, P. Makarov, W. Chen, A. Stella, L. Snijders, M. Mandl, A. Kärrman, P. Nowak, X. Wu, A. Dyck, K. Vaidyanathan, R. R, J. Mallet, M. Rudominer, E. Johnston, S. Mittal, A. Udathu, J. Christensen, V. Verma, Z. Irving, A. Santucci, G. Elsayed, E. Davoodi, M. Georgiev, I. Tenney, N. Hua, G. Cideron, E. Leurent, M. Alnahlawi, I. Georgescu, N. Wei, I. Zheng, D. Scandinaro, H. Jiang, J. Snoek, M. Sundararajan, X. Wang, Z. Ontiveros, I. Karo, J. Cole, V. Rajashekhar, L. Tumeh, E. Ben-David, R. Jain, J. Uesato, R. Datta, O. Bunyan, S. Wu, J. Zhang, P. Stanczyk, Y. Zhang, D. Steiner, S. Naskar, M. Azzam, M. Johnson, A. Paszke, C. Chiu, J. S. Elias, A. Mohiuddin, F. Muhammad, J. Miao, A. Lee, N. Vieillard, J. Park, J. Zhang, J. Stanway, D. Garmon, A. Karmarkar, Z. Dong, J. Lee, A. Kumar, L. Zhou, J. Evens, W. Isaac, G. Irving, E. Loper, M. Fink, I. Arkatkar, N. Chen, I. Shafran, I. Petrychenko, Z. Chen, J. Jia, A. Levskaya, Z. Zhu, P. Grabowski, Y. Mao, A. Magni, K. Yao, J. Snaider, N. Casagrande, E. Palmer, P. Suganthan, A. Castaño, I. Giannoumis, W. Kim, M. Rybiński, A. Sreevatsa, J. Prendki, D. Soergel, A. Goedeckemeyer, W. Gierke, M. Jafari, M. Gaba, J. Wiesner, D. G. Wright, Y. Wei, H. Vashisht, Y. Kulizhskaya, J. Hoover, M. Le, L. Li, C. Iwuanyanwu, L. Liu, K. Ramirez, A. Khorlin, A. Cui, T. LIN, M. Wu, R. Aguilar, K. Pallo, A. Chakladar, G. Perng, E. A. Abellan, M. Zhang, I. Dasgupta, N. Kushman, I. Penchev, A. Repina, X. Wu, T. van der Weide, P. Ponnapalli, C. Kaplan, J. Simsa, S. Li, O. Dousse, F. Yang, J. Piper, N. Ie, R. Pasumarthi, N. Lintz, A. Vijayakumar, D. Andor, P. Valenzuela, M. Lui, C. Paduraru, D. Peng, K. Lee, S. Zhang, S. Greene, D. D. Nguyen, P. Kurylowicz, C. Hardin, L. Dixon, L. Janzer, K. Choo, Z. Feng, B. Zhang, A. Singhal, D. Du, D. McKinnon, N. Antropova, T. Bolukbasi, O. Keller, D. Reid, D. Finchelstein, M. A. Raad, R. Crocker, P. Hawkins, R. Dadashi, C. Gaffney, K. Franko, A. Bulanova, R. Leblond, S. Chung, H. Askham, L. C. Cobo, K. Xu, F. Fischer, J. Xu, C. Sorokin, C. Alberti, C. Lin, C. Evans, A. Dimitriev, H. Forbes, D. Banarse, Z. Tung, M. Omernick, C. Bishop, R. Sterneck, R. Jain, J. Xia, E. Amid, F. Piccinno, X. Wang, P. Banzal, D. J. Mankowitz, A. Polozov, V. Krakovna, S. Brown, M. Bateni, D. Duan, V. Firoiu, M. Thotakuri, T. Natan, M. Geist, S. tan Girgin, H. Li, J. Ye, O. Roval, R. Tojo, M. Kwong, J. Lee-Thorp, C. Yew, D. Sinopalnikov, S. Ramos, J. Mellor, A. Sharma, K. Wu, D. Miller, N. Sonnerat, D. Vnukov, R. Greig, J. Beattie, E. Caveness, L. Bai, J. Eisenschlos, A. Korchemniy, T. Tsai, M. Jasarevic, W. Kong, P. Dao, Z. Zheng, F. Liu, F. Yang, R. Zhu, T. H. Teh, J. Sanmiya, E. Gladchenko, N. Trdin, D. Toyama, E. Rosen, S. Tavakkol, L. Xue, C. Elkind, O. Woodman, J. Carpenter, G. Papamakarios, R. Kemp, S. Kafle, T. Grunina, R. Sinha, A. Talbert, D. Wu, D. Owusu-Afriyie, C. Du, C. Thornton, J. Pont-Tuset, P. Narayana, J. Li, S. Fatehi, J. Wieting, O. Ajmeri, B. Uria, Y. Ko, L. Knight, A. Héliou, N. Niu, S. Gu, C. Pang, Y. Li, N. Levine, A. Stolovich, R. Santamaria-Fernandez, S. Goenka, W. Yustalim, R. Strudel, A. Elqursh, C. Deck, H. Lee, Z. Li, K. Levin, R. Hoffmann, D. Holtmann-Rice, O. Bachem, S. Arora, C. Koh, S. H. Yeganeh, S. Põder, M. Tariq, Y. Sun, L. Ionita, M. Seyedhosseini, P. Tafti, Z. Liu, A. Gulati, J. Liu, X. Ye, B. Chrzaszcz, L. Wang, N. Sethi, T. Li, B. Brown, S. Singh, W. Fan, A. Parisi, J. Stanton, V. Koverkathu, C. A. Choquette-Choo, Y. Li, T. Lu, A. Ittycheriah, P. Shroff, M. Varadarajan, S. Bahargam, R. Willoughby, D. Gaddy, G. Desjardins, M. Cornero, B. Robenek, B. Mittal, B. Albrecht, A. Shenoy, F. Moiseev, H. Jacobsson, A. Ghaffarkhah, M. Rivière, A. Walton, C. Crepy, A. Parrish, Z. Zhou, C. Farabet, C. Radebaugh, P. Srinivasan, C. van der Salm, A. Fidjeland, S. Scellato, E. Latorre-Chimoto, H. Klimczak-Plucińska, D. Bridson, D. de Cesare, T. Hudson, P. Mendolicchio, L. Walker, A. Morris, M. Mauger, A. Guseynov, A. Reid, S. Odoom, L. Loher, V. Cotruta, M. Yenugula, D. Grewe, A. Petrushkina, T. Duerig, A. Sanchez, S. Yadlowsky, A. Shen, A. Globerson, L. Webb, S. Dua, D. Li, S. Bhupatiraju, D. Hurt, H. Qureshi, A. Agarwal, T. Shani, M. Eyal, A. Khare, S. R. Belle, L. Wang, C. Tekur, M. S. Kale, J. Wei, R. Sang, B. Saeta, T. Liechty, Y. Sun, Y. Zhao, S. Lee, P. Nayak, D. Fritz, M. R. Vuyyuru, J. Aslanides, N. Vyas, M. Wicke, X. Ma, E. Eltyshev, N. Martin, H. Cate, J. Manyika, K. Amiri, Y. Kim, X. Xiong, K. Kang, F. Luisier, N. Tripuraneni, D. Madras, M. Guo, A. Waters, O. Wang, J. Ainslie, J. Baldridge, H. Zhang, G. Pruthi, J. Bauer, F. Yang, R. Mansour, J. Gelman, Y. Xu, G. Polovets, J. Liu, H. Cai, W. Chen, X. Sheng, E. Xue, S. Ozair, C. Angermueller, X. Li, A. Sinha, W. Wang, J. Wiesinger, E. Koukoumidis, Y. Tian, A. Iyer, M. Gurumurthy, M. Goldenson, P. Shah, M. Blake, H. Yu, A. Urbanowicz, J. Palomaki, C. Fernando, K. Durden, H. Mehta, N. Momchev, E. Rahimtoroghi, M. Georgaki, A. Raul, S. Ruder, M. Redshaw, J. Lee, D. Zhou, K. Jalan, D. Li, B. Hechtman, P. Schuh, M. Nasr, K. Milan, V. Mikulik, J. Franco, T. Green, N. Nguyen, J. Kelley, A. Mahendru, A. Hu, J. Howland, B. Vargas, J. Hui, K. Bansal, V. Rao, R. Ghiya, E. Wang, K. Ye, J. M. Sarr, M. M. Preston, M. Elish, S. Li, A. Kaku, J. Gupta, I. Pasupat, D. Juan, M. Someswar, T. M., X. Chen, A. Amini, A. Fabrikant, E. Chu, X. Dong, A. Muthal, S. Buthpitiya, S. Jauhari, N. Hua, U. Khandelwal, A. Hitron, J. Ren, L. Rinaldi, S. Drath, A. Dabush, N. Jiang, H. Godhia, U. Sachs, A. Chen, Y. Fan, H. Taitelbaum, H. Noga, Z. Dai, J. Wang, C. Liang, J. Hamer, C. Ferng, C. Elkind, A. Atias, P. Lee, V. Listík, M. Carlen, J. van de Kerkhof, M. Pikus, K. Zaher, P. Müller, S. Zykova, R. Stefanec, V. Gatsko, C. Hirnschall, A. Sethi, X. F. Xu, C. Ahuja, B. Tsai, A. Stefanoiu, B. Feng, K. Dhandhania, M. Katyal, A. Gupta, A. Parulekar, D. Pitta, J. Zhao, V. Bhatia, Y. Bhavnani, O. Alhadlaq, X. Li, P. Danenberg, D. Tu, A. Pine, V. Filippova, A. Ghosh, B. Limonchik, B. Urala, C. K. Lanka, D. Clive, Y. Sun, E. Li, H. Wu, K. Hongtongsak, I. Li, K. Thakkar, K. Omarov, K. Majmundar, M. Alverson, M. Kucharski, M. Patel, M. Jain, M. Zabelin, P. Pelagatti, R. Kohli, S. Kumar, J. Kim, S. Sankar, V. Shah, L. Ramachandruni, X. Zeng, B. Bariach, L. Weidinger, T. Vu, A. Andreev, A. He, K. Hui, S. Kashem, A. Subramanya, S. Hsiao, D. Hassabis, K. Kavukcuoglu, A. Sadovsky, Q. Le, T. Strohman, Y. Wu, S. Petrov, J. Dean, and O. Vinyals (2025) Gemini: a family of highly capable multimodal models. External Links: 2312.11805, Link Cited by: §1. T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong (2024) Solving olympiad geometry without human demonstrations. Nature 625 (7995), p. 476–482. Cited by: §1. K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li (2024) Measuring multimodal mathematical reasoning with math-vision dataset. ArXiv abs/2402.14804. External Links: Link Cited by: §2.1. W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, Z. Wu, J. Xie, Z. Li, B. Yang, Y. Duan, X. Wang, H. Hao, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y. He, Y. Wang, C. He, B. Shi, J. He, Y. Xiong, H. Lv, L. Wu, W. Shao, K. Zhang, H. Deng, B. Qi, J. Ge, Q. Guo, W. Zhang, Y. Gu, W. Ouyang, L. Wang, M. Dou, X. Zhu, T. Lu, D. Lin, J. Dai, B. Zhou, W. Su, K. Chen, Y. Qiao, W. Wang, and G. Luo (2025a) InternVL3.5: advancing open-source multimodal models in versatility, reasoning, and efficiency. ArXiv abs/2508.18265. External Links: Link Cited by: §5.1. Y. Wang, S. Wang, Q. Cheng, Z. Fei, L. Ding, Q. Guo, D. Tao, and X. Qiu (2025b) VisuoThink: empowering lvlm reasoning with multimodal tree search. ArXiv abs/2504.09130. External Links: Link Cited by: §1, §2.2. Y. Wang, Y. Wang, D. Wang, Z. Peng, Q. Guo, D. Tao, and J. Wang (2025c) GeometryZero: improving geometry solving for llm with group contrastive policy optimization. ArXiv abs/2506.07160. External Links: Link Cited by: §2.3, §5.2. J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou (2022) Chain of thought prompting elicits reasoning in large language models. ArXiv abs/2201.11903. External Links: Link Cited by: §1, §2.2. A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, K. Lu, M. Xue, R. Lin, T. Liu, X. Ren, and Z. Zhang (2024) Qwen2.5-math technical report: toward mathematical expert model via self-improvement. ArXiv abs/2409.12122. External Links: Link Cited by: §1. Y. Yang, X. He, H. Pan, X. Jiang, Y. Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhu, B. Zhang, and W. Chen (2025) R1-onevision: advancing generalized multimodal reasoning through cross-modal formalization. ArXiv abs/2503.10615. External Links: Link Cited by: §1. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang (2025) DAPO: an open-source llm reinforcement learning system at scale. ArXiv abs/2503.14476. External Links: Link Cited by: §2.3. X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen (2023) MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 9556–9567. External Links: Link Cited by: §2.1. Y. Yue, Y. Yuan, Q. Yu, X. Zuo, R. Zhu, W. Xu, J. Chen, C. Wang, T. Fan, Z. Du, X. Wei, X. Yu, G. Liu, J. Liu, L. Liu, H. Lin, Z. Lin, B. Ma, C. Zhang, M. Zhang, W. Zhang, H. Zhu, R. Zhang, X. Liu, M. Wang, Y. Wu, and L. Yan (2025) VAPO: efficient and reliable reinforcement learning for advanced reasoning tasks. ArXiv abs/2504.05118. External Links: Link Cited by: §2.3. R. Zhang, D. Jiang, Y. Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K. Chang, P. Gao, and H. Li (2024) MathVerse: does your multi-modal llm truly see the diagrams in visual math problems?. In European Conference on Computer Vision, External Links: Link Cited by: §2.1. Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu (2025) DeepEyes: incentivizing "thinking with images" via reinforcement learning. ArXiv abs/2505.14362. External Links: Link Cited by: §2.2. J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, Y. Duan, H. Tian, W. Su, J. Shao, Z. Gao, E. Cui, Y. Cao, Y. Liu, H. Wang, W. Xu, H. Li, J. Wang, H. Lv, D. Chen, S. Li, Y. He, T. Jiang, J. Luo, Y. Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y. Xiong, W. Qu, P. Sun, P. Jiao, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang (2025) InternVL3: exploring advanced training and test-time recipes for open-source multimodal models. ArXiv abs/2504.10479. External Links: Link Cited by: §5.1. Appendix A Qualitative Analysis of Attentional Patterns To complement the quantitative results in Section 3.2, we present the attentional patterns of Qwen2.5-VL-7B-Instruct for two representative examples before and after visual saliency enhancement (bold red auxiliary lines). No significant attentional shift toward the red-highlighted auxiliary elements is observed across cases. However, in both examples, higher attention scores are seen for the alphabetic labels of points involved in auxiliary line construction (e.g., G, M). (a) Example 1: Before enhancement (b) Example 1: After enhancement (c) Example 2: Before enhancement (d) Example 2: After enhancement Figure 5: Attentional heatmaps of Qwen2.5-VL-7B-Instruct. Warmer colors indicate higher attention levels. These observations indicate that visual enhancement may modulate attention via textual point labels rather than the auxiliary lines themselves. However, the underlying mechanism through which a minimal set of visually enhanced samples yields such substantial accuracy gains remains to be further explored. Appendix B Detailed Statistics of GeoAux-Bench Statistic GeoAux-Core GeoAux-Canvas Total Curriculum Olympiad Subtotal Senior Junior Senior Junior Problems 516 722 234 207 1,679 2,655 4,334 Questions 647 1,288 237 207 2,379 4,144 6,523 Problem Diagrams 422 1,097 207 206 1,932 1,693 3,625 Solution Diagrams 505 892 205 199 1,801 3,044 4,845 Table 5: Detailed composition of GeoAux-Bench. The table breaks down the dataset by subset source, difficulty level, and data modality counts. Table 5 presents the comprehensive statistical breakdown of the GeoAux-Bench dataset. The benchmark is composed of two primary subsets: 1. GeoAux-Core: This expert-curated subset (1,679 problems) is stratified by difficulty source, distinguishing between standard Curriculum problems and Olympiad-level competitions across both Senior and Junior grades. 2. GeoAux-Canvas: This subset includes 2,655 scale-augmented samples adapted from MathCanvas-Bench to enhance domain diversity. We report metrics across four dimensions: the number of distinct Problems, total Questions (including sub-questions), original Problem Diagrams, and the generated Solution Diagrams (visual auxiliary lines). Appendix C Results on GeoAux-Bench-Canvas In this section, we present the supplementary evaluation results on the GeoAux-Bench-Canvas subset. The performance metrics for the evaluated models are directly sourced from Shi et al. (2025). Detailed breakdowns across different geometric sub-domains are provided in Table 6. Model Think Alg. Ana. Calc. Plane Solid Stats. Trans. Trig. Total (Algebra) (Geom.) (Vector) (Geom.) (Geom.) (Geom.) Closed-source MLLMs Gemini-2.5-Pro ✓ 68.0 59.2 60.2 54.8 48.7 64.5 58.5 69.9 47.9 Gemini-2.5-Flash ✓ 63.2 56.5 54.6 40.7 40.7 61.1 46.8 64.6 39.3 Gemini-2.0-Flash ✗ 39.1 32.6 38.9 31.1 25.6 51.4 28.1 38.0 21.2 GPT-4.1 ✗ 40.4 30.7 37.1 24.1 25.1 54.0 21.5 42.5 19.0 GPT-4.1-mini ✗ 35.7 30.5 36.5 22.0 22.4 24.8 19.7 30.3 14.6 GPT-4o ✗ 21.6 17.7 21.8 19.5 18.6 17.4 13.2 23.0 9.9 GPT-5 ✓ 68.7 55.5 64.2 45.6 36.1 64.5 42.7 66.5 43.5 Claude-Sonnet-4 ✓ 44.8 38.9 49.3 33.8 33.0 46.9 30.3 47.6 25.0 Seed-1.6-Thinking ✓ 67.7 57.5 55.9 52.2 45.0 65.1 56.8 60.7 44.1 Qwen3-VL-Plus ✓ 67.0 54.6 56.9 45.9 42.0 66.7 49.3 58.9 40.9 Nano-Banana ✗ 55.4 50.2 51.8 34.5 36.6 56.7 39.4 60.4 33.2 Open-source MLLMs Qwen-2.5-VL-7B ✗ 19.5 19.0 19.2 20.6 18.7 10.7 13.9 15.0 8.9 Qwen-2.5-VL-32B ✗ 29.8 27.4 27.8 27.4 27.2 27.9 20.1 30.5 15.4 Qwen-2.5-VL-72B ✗ 30.6 19.5 36.4 34.5 33.5 23.9 33.6 48.9 21.1 Gemma-3-27b-it ✗ 31.3 28.4 34.4 25.8 21.0 40.0 21.0 26.9 15.8 InternVL3.5-8B ✗ 32.3 33.8 33.8 24.2 26.9 43.7 16.2 14.9 16.7 InternVL3.5-30B ✗ 22.2 19.9 15.1 24.9 24.3 22.1 17.4 18.4 11.7 Keye-VL-1.5-8B ✓ 33.1 28.0 26.2 27.0 23.6 29.5 20.9 26.3 17.1 United MLLMs BAGEL-7B-MoT ✗ 18.1 13.1 17.1 20.8 23.0 10.9 19.4 13.3 8.3 BAGEL-Zebra-CoT ✗ 18.0 15.1 15.6 18.0 16.8 20.8 11.1 14.1 8.0 MathCanvas-7B ✗ 29.9 27.2 17.9 40.0 35.3 23.2 29.3 40.4 21.9 Table 6: Detailed performance breakdown on the GeoAux-Bench-Canvas subset. Results are cited from Shi et al. (2025). Best scores in closed and open categories are highlighted. Appendix D Case Study D.1 Case of Visual Hallucinations We present representative failure cases of Native Unified MLLMs (e.g., MathCanvas-7B) in Figure 6, as these models lack the pixel-level control required to produce accurate diagrams. This leads to a Visual-Logic Mismatch: the generated diagram contains severe distortions (e.g., curved lines, missed intersections), causing the model to hallucinate non-existent geometric properties based on the flawed visual feedback, which ultimately derails the reasoning process. Figure 6: Failure Cases of MathCanvas-7B with Visual Hallucinations. D.2 Case of the Analytic Shortcut We present failure cases demonstrating the model’s tendency to bypass geometric intuition in favor of algebraic brute force, termed as The Analytic Shortcut. Instead of employing geometric theorems or auxiliary lines, the model habitually establishes coordinate systems to solve problems via complex equations. As discussed in the main text, while this approach is viable for Senior high school problems, it often proves inefficient or fails to capture the pure geometric logic required for Junior geometry tasks. Figure 7: An example where the model forces a coordinate system on a geometry problem, leading to unnecessary complexity. Appendix E Prompt Templates We present the prompt templates used in Tri-Partition Sampling (Section 4.3) and evaluation. Figures 8 and 9 show the templates for the Natural/Mandatory (O/O+O/O^+) and Prohibited (O−O^-) subsets, respectively. Figure 10 shows the template for visual re-prompting. Figures 11 and 12 display the prompts for the answer judge and auxiliary verifier. Figure 8: Standard prompt template (used for O, O+O^+ subsets and evaluation). Figure 9: Prompt template for the Prohibited Subset (O−O^-). Figure 10: Prompt template for visual-injected re-prompting. Figure 11: Prompt template for correctness judgment (used in evaluation and reward calculation). Figure 12: Prompt template for auxiliary construction verification. Appendix F Implementation Details F.1 Training Hyperparameters Table 7 details the hyperparameters used during the Supervised Fine-Tuning (SFT) warm-up phase and the subsequent A2PO training phase. We utilize Qwen2.5-VL-7B-Instruct as the backbone. For the SFT stage, we freeze the vision tower and the multi-modal projector, updating only the language model. For the A2PO stage, we continue to freeze the vision tower to maintain visual feature stability. Table 7: Training Hyperparameters. Comparison between the SFT warm-up stage and the A2PO RL stage. Hyperparameter SFT (Warm-up) A2PO (RL) Base Model Qwen2.5-VL-7B-Instruct Precision bfloat16 Optimizer AdamW Learning Rate 5e-5 1e-6 LR Scheduler Cosine Constant Warm-up Ratio 0.1 0.0 Global Batch Size 32 24 Gradient Accumulation 4 - Max Sequence Length 8,192 8,192 Image Resolution (min/max) 262,144262,144 262,144262,144 Epochs/Steps 5 epochs 650 steps GRPO Specific Settings KL Coefficient (β) - 0.01 Rollout Batch Size - 72 Generations per Prompt (N) - 8 F.2 A2PO Algorithm Coefficients Table 8 lists the specific weighting coefficients and thresholds defined in the reward shaping mechanism (Section 4.4). Table 8: A2PO Reward Coefficients. Component Symbol Value Reward Weights Accuracy Weight waccw_acc 0.70 Format Weight wfmtw_fmt 0.00 Timing Weight wtimew_time 0.15 Quality Weight wqualw_qual 0.15 Thresholds & Margins Timing Significance τ 0.15 PPL Tolerance δ 0.01 F.3 Inference and Sampling Configuration Table 9 details the generation parameters. We utilize a strong external model as the Aux Verifier and Judge to ensure the quality of the retrieved auxiliary diagrams and the final answer correctness. Table 9: Generation & Verifier Configurations. Settings for training rollouts, the auxiliary verifier, and final evaluation. Parameter Training Rollout Evaluation Temperature 1.0 0.0 (Greedy) Top-p 1.0 0.0 Repetition Penalty 1.05 1.08 Max New Tokens 8,192 8,192 Accuracy Reward Model Verifier Model Qwen3-30B-A3B-Thinking-2507 Verifier Temperature 0.0 (Greedy) Max Tokens 24576 Aux Verifier Verifier Model Qwen3-30B-A3B-Thinking-2507 Verifier Temperature 0.0 (Greedy) Max Tokens 8192 F.4 Dataset Composition and Splits To ensure rigorous evaluation, we strictly enforce decontamination between training and evaluation sets. Table 10 summarizes the data sources and statistics across the SFT warm-up, A2PO training, and evaluation phases. Training Data Construction. The SFT dataset is constructed using a multi-constraint instruction tuning strategy to initialize controllability. We augment training samples by creating diverse prompt-response pairs, specifically: (1) Standard Prompts paired with solutions containing auxiliary constructions, and (2) Prohibited Prompts paired with solutions strictly devoid of auxiliary lines. For the A2PO stage, we employ a marginal solvability filtering strategy. We perform 10 inference rollouts per problem using the base model. We retain only samples that exhibit mixed outcomes (i.e., containing both correct and incorrect responses), explicitly discarding trivial (100% correct) or impossible (0% correct) instances to maximize gradient efficiency. Evaluation Data. All evaluation results reported in the main paper are based on held-out test sets. For GeoAux-Bench, we reserve a fixed test split that is strictly excluded from all training stages. Table 10: Data Statistics and Splits. Detailed breakdown of sample counts. Note that for SFT and RL phases, we utilize subsets from the training splits of external datasets to avoid contamination. Stage Data Source / Subset Count Phase 1: SFT Warm-up (Mixed-Prompt) GeoAux-Bench (Train Split) 1,000 GeomVerse (Train Subset) 300 Geometry3k (Train Subset) 300 Phase 2: A2PO Training (Solvability Filtered) GeoAux-Bench (Train Split) 1,500 GeomVerse (Train Subset) 300 Geometry3k (Train Subset) 300 Phase 3: Evaluation (Held-out Test Sets) GeoAux-Bench (Test) 1,217 GeomVerse (Test) 1,000 Geometry3k (Test) 901