Paper deep dive
Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm
Tianyu Yang, Sihong Wu, Yilun Zhao, Zhenwen Liang, Lisen Dai, Chen Zhao, Minhao Cheng, Arman Cohan, Xiangliang Zhang
Abstract
Abstract:Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems that involve both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks. They often misinterpret diagrams, fail to align mathematical symbols with visual evidence, and produce inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying the correctness or executability of each intermediate step. To address these limitations, a growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically study them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. Finally, we discuss open challenges and offer perspectives on promising directions for future research.
Tags
Links
- Source: https://arxiv.org/abs/2603.08291v1
- Canonical: https://arxiv.org/abs/2603.08291v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/13/2026, 12:42:50 AM
Summary
The paper presents a systematic framework for Multimodal Mathematical Reasoning (MMR) called the Perception-Alignment-Reasoning (PAR) paradigm. It categorizes existing research into three stages: Perception (extracting structured evidence), Alignment (mapping evidence to symbolic/executable representations), and Reasoning (performing verifiable inference). Additionally, it introduces the Answer-Process-Executable (APE) hierarchy for evaluating model performance, providing a roadmap for future research in multimodal LLMs.
Entities (5)
Relation Signals (3)
PAR Framework → decomposesinto → Perception, Alignment, and Reasoning
confidence 98% · we organize MMR methods under a Perception–Alignment–Reasoning (PAR) framework, which decomposes MMR approaches into three interdependent stages
APE Hierarchy → assesses → MMR
confidence 95% · APE assesses correctness at three levels, answer (task accuracy), process (faithfulness of intermediate reasoning steps), and executable (verification via executable checks).
E-GPS → supports → Executable Intermediates
confidence 92% · E-GPS [Wu et al., 2024] integrates a symbolic solver with a diagram parser for verifiable step-by- step solutions.
Cypher Suggestions (2)
Find all methods associated with a specific PAR stage · confidence 90% · unvalidated
MATCH (m:Method)-[:IMPLEMENTS_STAGE]->(s:Stage {name: 'Perception'}) RETURN m.nameMap benchmarks to their evaluation level · confidence 90% · unvalidated
MATCH (b:Benchmark)-[:EVALUATES_AT]->(l:EvaluationLevel) RETURN b.name, l.name
Full Text
91,598 characters extracted from source content.
Expand or collapse full text
Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm Tianyu Yang 1 * , Sihong Wu 2 * , Yilun Zhao 2 * , Zhenwen Liang 1 , Lisen Dai 3 , Chen Zhao 4 , Minhao Cheng 5 , Arman Cohan 2 , Xiangliang Zhang 1† 1 University of Notre Dame 2 Yale University 3 Columbia University 4 New York University 5 Pennsylvania State University tyang4, xzhang33@nd.edu Abstract Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems that involve both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks. They often misinterpret diagrams, fail to align mathematical symbols with visual evidence, and produce inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying the correctness or executability of each intermediate step. To address these limitations, a growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically study them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. Finally, we discuss open challenges and offer perspectives on promising directions for future research. Figure 1: The roadmap of our framework. 1 Introduction Large Language Models (LLMs) have recently advanced mathematical reasoning, achieving state-of-the-art results on various symbolic and arithmetic tasks, from elementary school level to college level [DeepMind, 2024, Guo et al., 2025]. However, in practice, mathematics often involves multimodal information. Many real-world problems in education [Ku et al., 2025], scientific discovery [Du et al., 2025], and interactive professional systems [Hu et al., 2024, Zhao et al., 2025b] require reasoning over visual structures and spatial relations. Solving these problems often requires interpreting diagrams, coordinate plots, charts, tables, and mixed-modality documents [Lu et al., 2021b, Saikh et al., 2022, Lee et al., 2023, Zhao et al., 2023]. In these contexts, visual elements encode critical * Equal contributions. † Correspondence 1 arXiv:2603.08291v1 [cs.AI] 9 Mar 2026 Perception–Alignment–Reasoning (PAR) framwork What to Extract? Geometry GEOS [Seo et al., 2015]; E-GPS [Wu et al., 2024]; Pi-GPS [Zhao et al., 2025a]; GeomVerse [Kazemi et al., 2023]; G-LLaVA [Gao et al., 2023a]; GeoGPT4V [Cai et al., 2024a]; GeoQA+ [Cao and Xiao, 2022a]; Geometry3K [Lu et al., 2021a]; GeoQA [Chen et al., 2022a]; PGDP5K [Hao et al., 2022]; PGPS9K [Zhang et al., 2023]; DFE-GPS [Xin et al., 2025]; GEOX [Xia et al., 2024a] Chart and Table PlotQA [Methani et al., 2020a]; ChartQA [Masry et al., 2022]; FinQA [Chen et al., 2022b]; TATQA [Zhu et al., 2021]; MultiHiertt [Zhao et al., 2022]; DocMath-Eval [Zhao et al., 2024]; DVQA [Kafle et al., 2018]; PlotQA [Methani et al., 2020b]; ChartQA [Masry et al., 2022]; Pix2Struct [Lee et al., 2023]; Chartx [Xia et al., 2024b]; Chartllama [Han et al., 2023]; DePlot [Liu et al., 2023a]; LogicNLG [Chen et al., 2020] Visual Math Word Problems IconQA [Lu et al., 2021b]; CLEVR Math [Lindstr ̈ om and Abraham, 2022]; MV MATH [Wang et al., 2025a]; Patch-TRM [Lu et al., 2021b]; Geogpt4v [Cai et al., 2024b]; Inter-gps [Lu et al., 2021c]; TABMWP [Lu et al., 2022] How to Align? Executable Intermediates GeoQA+ [Cao and Xiao, 2022b]; FormalGeo [Zhang et al., 2024a]; Inter GPS [Lu et al., 2021c]; E-GPS [Wu et al., 2024]; Pi GPS [Zhao et al., 2025a] Symbolic–Neural Hybrids GeoGen [Pan et al., 2025]; AlphaGeometry [Trinh et al., 2024]; MathCoder VL [Wang et al., 2025b] Cross-modal Alignment BLIP 2 [Li et al., 2023]; LLaVA [Liu et al., 2023b]; Math PUMA [Zhuang et al., 2024]; VCAR [Jia et al., 2024]; TVC [Sun et al., 2025a]; VIC [Zheng et al., 2024] Pre-training& Fine-tuning GeoGen [Pan et al., 2025]; MathCoder VL [Wang et al., 2025b]; SynthGeo228K [Zhang et al., 2025a]; GeoGPT-4V [Cai et al., 2024b]; Math-LLaVA [Shi et al., 2024]; MAVIS [Zhang et al., 2024b]; MultiMath-300K [Peng et al., 2024]; AlphaGeometry [Trinh et al., 2024]; AtomThink [Xiang et al., 2024]; Masked Thought [Chen et al., 2024]; LogicSolver [Yang et al., 2022]; MathGenie [Lu et al., 2024a]; MMathCoT-1M [Shi et al., 2024]; DualMath-1.1M [Zhang et al., 2024b]; MathV360K [Shi et al., 2024]; Inter-GPS [Lu et al., 2021c]; GeoQA[Chen et al., 2021]; GeoQA+ [Cao and Xiao, 2022b]; E-GPS [Wu et al., 2024]; VCAR [Jia et al., 2024]; Math-PUMA [Zhuang et al., 2024]; MAmmoTH-VL [Guo et al., 2024]; TrustGeoGen [Fu et al., 2025] How to Perform Reasoning? Deliberate Chains LLaVA CoT [Xu et al., 2024]; VisuoThink [Wang et al., 2025c]; VReST [Zhang et al., 2025b]; ToT [Yao et al., 2023]; GoT [Besta et al., 2024] RL-based Reward Mechanism Design R1 VL [Zhang et al., 2025c]; VisualPRM [Wang et al., 2025d]; M PRM [Du et al., 2025]; M Eureka [Meng et al., 2025] Search & Decision Algorithms DeepSeek-R1 [Guo et al., 2025]; Vision R1 [Huang et al., 2025]; Mulberry [Yao et al.]; Skywork R1V2 [Chris et al.]; VL Rethinker [Wang et al., 2025e]; FAST [Xiao et al., 2025]; AlphaProof [DeepMind, 2024]; Think or Not [Wang et al., 2025f]; VLAA Thinking [Chen et al., 2025a]; VLM R3 [Jiang et al., 2025]; MAYE [Ma et al., 2025]; SoTA with Less [Wang et al., 2025g] Process Feedback & Verification VisualPRM [Wang et al., 2025d]; M-PRM [Du et al., 2025]; TVC [Sun et al., 2025a]; VIC [Zheng et al., 2024] Tool Augmented MathCoder-VL [Wang et al., 2025b]; Pi-GPS [Zhao et al., 2025a]; Visual Sketchpad [Hu et al., 2024]; Toolformer [Schick et al., 2023]; ToRA [Gou et al., 2023]; M REACT [Yang et al., 2023] Supervision & Data Error Detection and Correction M MATH [Sun et al., 2024]; ErrorRadar [Yan et al., 2024a]; MPBench [Xu et al., 2025a]; Sherlock [Ding and Zhang, 2025]; We Math [Qiao et al., 2024]; Mathador LM [Kurtic et al., 2024]; VATE [Xu et al., 2025b] Mathematical Problem Generation GeoGen [Pan et al., 2025]; GeoGPT 4V [Cai et al., 2024b]; Math LLaVA [Shi et al., 2024]; MAVIS [Zhang et al., 2024b]; MultiMath 300K [Peng et al., 2024]; AtomThink [Xiang et al., 2024] Figure 2: Taxonomy of Perception, Alignment and Reasoning framework. constraints—such as incidence, parallelism, numeric scales, and layout semantics—that text-only models simply cannot perceive [Chen et al., 2025b]. To handle this complexity, a line of work focuses on integrating perception, symbolic understanding, and executable reasoning across modalities, defining the field of Multimodal Mathematical Reasoning (MMR) [Chen et al., 2021, Lu et al., 2021b, Saikh et al., 2022]. Compared with purely text-based approaches [Lewkowycz et al., 2022, Liang et al., 2023], MMR approaches significantly improves evidence completeness by grounding visual cues. Nonetheless, these multimodal learning approaches substantially increase reasoning complexity: a model must jointly interpret visual cues, align them with symbolic expressions, and execute consistent multi-step reasoning across modalities [Chen et al., 2021, Sheng et al., 2025]. This strong multimodal coupling introduces new, non-trivial challenges related to structured perception, cross-modal alignment, and verifiable reasoning. . Given the importance of MMR and its rapid progress, we are motivated to present this framework that foregrounds fundamental mechanisms of addressing MMR using Multimodal LLMs (MLLMs). Prior efforts primarily catalog benchmarks and methodologies for MMR [Yan et al., 2024b] or discuss MLLM ecosystem roles (Reasoner, Enhancer, Planner) [Yan et al., 2024b]. In contrast, we take a vertical, process-centric view: we articulate what is needed to solve MMR end-to-end and position MLLM-based approaches along this roadmap. Concretely, we organize the field around four questions: 1) what to extract from multimodal inputs, 2) how to represent and align textual and visual information, 3) how to perform the reasoning (e.g., CoT, program-aided, tool use), and 4) how to evaluate the correctness of the reasoning process. More discussion about our work vs related frameworks is provided in Table 2 and Appendix A. Centered on these four questions, we organize MMR methods under a Perception–Alignment–Reasoning (PAR) framework, which decomposes MMR approaches into three interdependent stages: (1) Perception, extracting structured mathematical evidence from visual and textual modalities; (2) Alignment, mapping perceived facts to symbolic or executable representations; and (3) Reasoning, conducting interpretable and verifiable inference over the aligned representations (e.g., CoT, program execution, tool use). To complement this process-centric perspective, we further introduce a companion evaluation hierarchy, the Answer–Process–Executable (APE) framework. APE assesses correctness at three levels, answer (task accuracy), process (faithfulness of intermediate reasoning steps), and executable (verification via executable checks). Together, PAR and APE provide a systematic 2 BenchmarkYear (Venue)Eval LevelPAR StageKey Contributions ChartQA [Masry et al., 2022]2022 (ACL Findings)AnswerPerception + ReasoningReal charts; logical & numeric QA. FigureQA [Kahou et al., 2017]2018 (ICLR Workshop)AnswerPerceptionSynthetic charts; controlled reasoning. PlotQA [Methani et al., 2020a]2020 (WACV)AnswerPerception + ReasoningReal plots; open-vocab numeric answers. IconQA [Lu et al., 2021b]2021 (NeurIPS)AnswerPerception + ReasoningLarge icon-based multimodal math. CLEVR-Math [Lindstr ̈ om and Abraham, 2022]2022 (NeSy Workshop)AnswerPerception + ReasoningSynthetic compositional arithmetic. FinQA [Chen et al., 2022b]2021 (EMNLP)AnswerAlignment + ReasoningFinancial table-text; gold programs. TAT-QA [Zhu et al., 2021]2021 (ACL)AnswerAlignment + ReasoningTable-text numeracy in reports. MultiHiertt [Zhao et al., 2022]2022 (ACL)AnswerAlignment + ReasoningFinancial table-text; gold programs. DocMath-Eval [Zhao et al., 2024]2024 (ACL)AnswerAlignment + ReasoningFinancial table-text; gold evidence. ChartQAPro [Masry et al., 2025]2025 (ACL Findings)AnswerPerception + AlignmentHarder charts incl. dashboards. CharXiv [Wang et al., 2024a]2024 (NeurIPS D&B)AnswerPerceptionHuman-curated arXiv charts. M-MATH [Sun et al., 2024]2024 (arXiv)ProcessReasoningStep types & error labels. MPBench [Xu et al., 2025a]2025 (ACL Findings)ProcessReasoningPRM / step-judge benchmarking. ErrorRadar [Yan et al., 2024a]2024 (arXiv)ProcessReasoningFine-grained error taxonomy. Sherlock [Ding and Zhang, 2025]2025 (arXiv)ProcessReasoningMultimodal error detect & repair. We-Math [Qiao et al., 2024]2025 (ACL)ProcessReasoningPrinciple-centered process probing. MathVerse [Zhang et al., 2024c]2024 (ECCV)ProcessAll Diagram perturbations; CoT step scoring. CHAMP [Mao et al., 2024]2024 (arXiv)ProcessReasoningCompetition items; wrong-step tags. PolyMATH [Gupta et al., 2024]2024 (arXiv)ProcessReasoningImage–text puzzles; cognitive coverage. GeoQA+ [Cao and Xiao, 2022b]2022 (COLING)ExecutableAlignment + ReasoningGeometry QA with executable programs. Geometry3K [Lu et al., 2021a]2021 (ACL)ExecutablePerception + AlignmentDense formal language for geometry. E-GPS [Lu et al., 2021c, Wu et al., 2024]2024 (CVPR)ExecutableAllSolver+parser; verifiable steps. FormalGeo [Zhang et al., 2024a]2024 (MATH-AI)ExecutableAlignment + ReasoningOlympiad-level formal proofs. Pi-GPS [Zhao et al., 2025a]2025 (arXiv)ExecutableAlignment + ReasoningRectifier and solver for proofs. WikiSQL [Zhong et al., 2017]2017 (NeurIPS)ExecutableAlignment + ReasoningNL→SQL with execution accuracy. MathVista [Lu et al., 2024b]2024 (ICLR)ComprehensiveAllAggregated multimodal suite. MATH-V [Wang et al., 2024b]2024 (NeurIPS)ComprehensiveAllDifficulty-calibrated visual math. OlympiadBench [Cherian et al., 2024]2024 (ACL)ComprehensiveAllBilingual competition-grade; stepwise. MathScape [Liang et al., 2024a]2024 (arXiv)ComprehensiveAllPhoto scenarios; multi-dim evaluation. Cmm-Math [Liu et al., 2024]2024 (arXiv)ComprehensiveAllChinese multimodal math. Children’s Olympiads [He et al., 2024]2024 (arXiv)ComprehensiveAllOlympiad-style problems. M-PRM [Du et al., 2025]2025 (arXiv)ComprehensiveAllReal-world K-12 multimodal QA. Table 1: Evaluation benchmarks organized by the APE hierarchy, aligned with corresponding PAR stages. lens for dissecting multimodal mathematical reasoning enabling both a comprehensive synthesis of prior work and a diagnostic understanding of where current MLLMs succeed or fail to reason faithfully. The roadmap of our framework is shown in Figure 1. We begin by outlining the core challenges and preliminaries of MMR, including main task families and the structure of perception outputs. We then formalize the PAR pipeline and synthesize methods at each stage. For Perception, we track the path from symbolic parsers to pipelines built on large multimodal models (Section 2). For Alignment, we cover executable intermediates, symbolic and neural hybrids, cross-modal alignment frameworks, and pretraining and finetuning strategies (Section 3). For Reasoning, we review deliberate chains, reinforcement learning, tool-augmented and executable reasoning, and process feedback and verification (Section 4). Next, we map major benchmarks and datasets to APE levels and to PAR stages (Section 5), and we provide consolidated tables for direct comparison and diagnostic analysis (Figure 2 and Tables 1-A1). We finally conclude the framework by outlining open challenges and future directions (Section 6). 2 Perception: What to Extract? In the PAR framework (overview shown in Figure 2), perception addresses the first and central question, what to extract from multimodal inputs before alignment and reasoning can occur. Unlike generic vision tasks, mathematical perception must yield structured, computation relevant evidence rather than only objects or text. Given multimodal inputs, i.e.,X ⊆ T, D, C, Ia mixture of textT, diagramD, chart or tableC, and imageI, the perception functionp : X 7→Fextracts a set of mathematical factsFspanning three levels: (i) low level primitives such as points, lines, axes, or objects, (i) structural relations such as incidence, parallelism, axis series binding, or row and column layouts, and (i) quantitative attributes such as lengths, angles, values, and units. Note that perception is essential; errors at this stage propagate downstream and can lead to misalignment or faulty reasoning. To ground PAR in concrete settings, we introduce three representative task families: geometry problems, chart/table problems, and visual math word problems. These task families illustrate the kinds of evidence that must be extracted. We then summarize the task-oriented datasets through the lens of PAR (detailed in Table A1), which provides the complete list of datasets for each task. Finally, we review the methodological evolution of perception, from symbolic parsers to neural encoders to LMM-based pipelines, and conclude with an outlook on open challenges and promising directions. Geometry Problems. Geometry problem solving requires models to jointly parse textual descriptionsTand diagramsDto produce numerical values, symbolic relations, or complete proofs:f : (T, D)7→ y. Perception in this task focuses on recognizing geometric primitives such as points, lines, and angles, understanding their spatial relations, and grounding textual references to diagrammatic structures before performing deductive reasoning. 3 Method development has progressed from symbolic theorem provers such as GEOS [Seo et al., 2015], to neural vision–language models, and more recently to hybrid pipelines with executable programs such as E-GPS [Wu et al., 2024] and Pi-GPS [Zhao et al., 2025a], which enhance verifiability and explainability. LMMs further introduce a new perception paradigm, enabling both improved geometric understanding, as seen in GeomVerse [Kazemi et al., 2023], and large-scale synthetic data generation, as demonstrated by G-LLaVA [Gao et al., 2023a] and GeoGPT4V [Cai et al., 2024a]. Recent work further explores diagram formalization and formal-language pretraining to improve structural understanding and robustness under domain shift, such as DFE-GPS [Xin et al., 2025] and GEOX [Xia et al., 2024a]. Representative datasets include Geometry3K [Lu et al., 2021a], GeoQA and GeoQA+ [Chen et al., 2022a, Cao and Xiao, 2022a], PGDP5K [Hao et al., 2022], and PGPS9K [Zhang et al., 2023]. Chart and Table Problems. Chart and table problems assess the ability to interpret structured visual dataC in response to a natural language queryQ, formalized asf : (C, Q)7→ a, whereadenotes the predicted answer. Models must accurately perceive visual layouts such as axes, legends, rows, and columns, ground linguistic references to these visual elements, and perform numerical or logical reasoning based on the extracted structure. Perception in this domain has evolved from explicit symbolic parsing [Kafle et al., 2018, Methani et al., 2020b, Masry et al., 2022] to neural vision–language models that jointly encode layout and text [Lee et al., 2023], and more recently to LMM-based instruction-tuned frameworks [Han et al., 2023, Xia et al., 2024b] that integrate structural perception with executable reasoning. DePlot [Liu et al., 2023a] and LogicNLG [Chen et al., 2020] bridge perception and alignment through chart to table translation. Key benchmarks include PlotQA [Methani et al., 2020a], TATQA [Zhu et al., 2021], FinQA [Chen et al., 2022b], MultiHiertt [Zhao et al., 2022], ChartQA [Masry et al., 2022], and DocMath-Eval [Zhao et al., 2024]. Visual Math Word Problems. Visual Math Word Problems require solving natural-language math queries grounded in visual scenes:f : (I, Q) 7→ a, whereQdenotes the natural-language question andadenotes the predicted answer. Typical skills include object counting, attribute reasoning, quantity comparison, and cross-image co-reference. Methods have gradually shifted from symbolic perception and explicit object relation parsing like Patch-TRM [Lu et al., 2021b] to neural multimodal encoders that learn visual–textual correspondences [Lu et al., 2021c], and more recently to LMMs capable of holistic scene understanding and chain-of-thought reasoning [Cai et al., 2024b]. Representative datasets include IconQA [Lu et al., 2021b], CLEVR-Math [Lindstr ̈ om and Abraham, 2022], TABMWP [Lu et al., 2022], RoMMath [Zhao et al., 2025c], and MV-MATH [Wang et al., 2025a]. Method Evolution and Outlook. Methods for mathematical perception have progressed from symbolic parsers and handcrafted rules to neural encoders that couple visual grounding with textual understanding, and now to LMMs unified through pretraining and instruction tuning. Despite their generality, LMMs often struggle with fine-grained perception, such as misreading geometric elements or chart layouts. Future work should focus on precise structure perception, executable supervision, and combining neural and symbolic reasoning for reliable results. 3 Alignment: How to Represent & Align? Alignment bridges perception and reasoning. It defines how perceived visual facts are structured and mapped to symbolic or linguistic forms so that downstream reasoning becomes interpretable and verifiable. In mathematical contexts, alignment connects visual entities such as geometric primitives, chart axes, and table layouts with textual predicates or executable intermediates like geometry description languages, constraint sets, proof sketches, chart or table operators, SQL queries, and program-of-thought traces. The key challenge is to represent and align multimodal information while preserving symbolic fidelity and remaining robust to visual noise and domain variation. This section reviews alignment techniques from four complementary perspectives: (1) executable intermediates that formalize visual content into checkable programs, (2) symbolic–neural hybrids that couple neural perception with symbolic reasoning engines, (3) cross-modal frameworks that stabilize vision–language coupling, and (4) pre-training and fine-tuning strategies that provide large-scale priors and task-specific supervision. 3.1 Executable Intermediates A key direction is converting visual content into formal, checkable intermediates that support symbolic reasoning. Inter-GPS [Lu et al., 2021c] annotate geometry problems with domain-specific languages to enable interpretable execution. E-GPS [Wu et al., 2024] integrates a symbolic solver with a diagram parser for verifiable step-by- step solutions. Pi-GPS [Zhao et al., 2025a] introduces a multimodal rectifier to disambiguate diagrams before theorem-driven solving. R1-OneVision [Yang et al., 2025] scales this idea by transforming diagrams into textual formalizations for large-scale consistency training. Beyond geometry, chart and table reasoning convert visual marks into code- or SQL-like operators to ensure numeric correctness by design. Executable intermediates thus anchor alignment and make reasoning verifiable. 4 3.2 Symbolic–Neural Hybrids Hybrid pipelines combine symbolic rigor with neural flexibility. For example, GeoGen [Pan et al., 2025] aligns dia- grams with executable programs under symbolic supervision. MathCoder-VL [Wang et al., 2025b] uses code-based cross-modal supervision to reinforce visual and text alignment and program-level faithfulness. AlphaGeome- try [Trinh et al., 2024] integrates theorem libraries with neural search to handle complex geometric deductions. By injecting formal structure while retaining perceptual capacity, these hybrids enhance interpretability, transferability, and reasoning stability. 3.3 Cross-modal Alignment Frameworks General frameworks provide reusable backbones for stable vision–language coupling. BLIP-2 [Li et al., 2023] links vision encoders to LLMs and serves as a base for math-specific extensions. LLaVA [Liu et al., 2023b] introduces instruction-following alignment for visual inputs. Math-PUMA [Zhuang et al., 2024] applies progressive staged alignment for long-chain stability, while VCAR [Jia et al., 2024] follows a “describe-then-reason” curriculum. For long-horizon reasoning, TVC [Sun et al., 2025a] maintains persistent visual conditioning, and VIC [Zheng et al., 2024] composes textual plans with late fusion to avoid drift. Curriculum- and conditioning-based designs help reduce cumulative errors and stabilize multi-step reasoning. 3.4 Pre-training and Fine-tuning as Enablers Large-scale pre-training provides broad coverage and alignment priors. Geo170K [Gao et al., 2023b], Synth- Geo228K [Zhang et al., 2025a], TrustGeoGen [Fu et al., 2025] and GeoGPT-4V [Cai et al., 2024b] expand diagram–text coupling at scale. Math-LLaVA [Shi et al., 2024] and MAVIS [Zhang et al., 2024b] extend instruction- tuned data with visual reasoning. MultiMath-300K [Peng et al., 2024] contributes multimodal K–12 problems with stepwise annotations.Beyond these, MAmmoTH-VL [Guo et al., 2024] scales to 12M instruction pairs for multimodal pre-training, while [Fu et al., 2025] generates verified geometric data for reliable training. Symbolic resources like AlphaGeometry [Trinh et al., 2024] and auto-diagram construction [Krueger et al., 2021] further enhance formal priors. Objective design mixes grounding with process supervision—Masked Thought [Chen et al., 2024] learns from partial steps, LogicSolver [Yang et al., 2022] integrates logical constraints, and MathGenie [Lu et al., 2024a] generates synthetic CoT data. Fine-tuning specializes alignment toward executable reasoning. MMathCoT-1M and DualMath-1.1M [Shi et al., 2024, Zhang et al., 2024b] link QA with dual-view trajectories, while MathV360K [Shi et al., 2024] and MAVIS [Zhang et al., 2024b] provide diagram-based instruction data. Datasets such as Geometry3K [Lu et al., 2021c], GeoQA [Chen et al., 2021], and E-GPS [Wu et al., 2024] enable symbolic supervision and program- level verifiability. Curricular designs like VCAR [Jia et al., 2024], Math-PUMA [Zhuang et al., 2024], and AtomThink [Xiang et al., 2024] progressively refine perception and reasoning, making alignment robust and transferable. Outlook and Comparison. Executable intermediates ensure verifiability but are brittle under domain shifts. Symbolic–neural hybrids improve robustness yet add complexity. Cross-modal frameworks scale well but risk inconsistencies without explicit execution. Pre-training and fine-tuning bring generality but depend on data fidelity. In practice, combining executable precision, hybrid robustness, curriculum stability, and large-scale priors can perhaps achieve the best balance between reliability and generalization. 4 How to perform Reasoning? After perception and alignment produce structured representations, the final stage concerns how models perform reliable inference. Reasoning in multimodal mathematical tasks involves executing stable and verifiable computation from structured inputs. Four paradigms dominate: (1) Deliberate chain (e.g., CoT) methods, which externalize intermediate steps to expose and guide reasoning; (2) Reinforcement learning methods, which optimize long-horizon decision sequences via reward-guided search; (3) Tool-augmented reasoning, which employs external solvers or code execution to enforce formal correctness; and (4) Process feedback and verification, which introduces critics or verifiers to assess intermediate steps (e.g., executable checks, self-consistency), improving validity and interpretability. These approaches collectively enhance robustness and faithfulness across long reasoning chains. Beyond these main paradigms, Error Detection and Correction (to flag and repair faulty traces) and Mathematical Problem Generation (to synthesize diverse, curriculum-aligned instances) play supportive roles that strengthen process supervision and dataset curation. Due to space limits, we defer discussion of these topics to Appendix C. 4.1 Deliberate Chains (e.g., Chain-of-Thought) In-Context Learning (ICL) with multimodal chain-of-thought (CoT) prompts models to externalize intermediate steps. LLaVA-CoT [Xu et al., 2024] shows that structured prompts can elicit more reliable reasoning paths. TVC 5 [Sun et al., 2025a] injects persistent visual conditioning at every step to mitigate forgetting. VIC [Zheng et al., 2024] composes plans in text first and fuses vision later to reduce cross-modal drift. I2L [Wang et al., 2024c] embeds exemplars directly on the visual canvas to strengthen grounding. AtomThink [Xiang et al., 2024] decomposes reasoning into atomic steps, improving compositionality and enabling fine-grained supervision. Although these methods are lightweight and effective, they can still drift away from the underlying evidence without stronger grounding or verification mechanisms. Beyond linear chains, Tree of Thoughts (ToT) [Yao et al., 2023] generalizes CoT by exploring and self-evaluating multiple branches of intermediate thoughts, and Graph of Thoughts (GoT) [Besta et al., 2024] further models non-linear dependencies among partial solutions. For multimodal settings, AGoT [Yang et al., 2024] adapts GoT to multi-modal representation learning via an aggregation graph that soft-prompts and routes reasoning across aspects. For multimodal mathematical reasoning specifically, VisuoThink [Wang et al., 2025c] performs multimodal tree search with interleaved vision–text steps, and VReST [Zhang et al., 2025b] combines Monte Carlo Tree Search with a self-reward signal to deepen exploration and reports state-of-the-art results on several multimodal math benchmarks. Together, these ToT/GoT-style methods complement CoT by enabling branching, backtracking, and structured selection over intermediate solutions, which is valuable for long-horizon visual–symbolic math problems. 4.2 RL-based Reasoning Reinforcement learning (RL) approaches treat reasoning as a sequential decision process and optimize for long- horizon stability. Reward Mechanism Design.R1-VL [Zhang et al., 2025c] introduces step-wise accuracy and validity rewards to encourage high-quality transitions. VisualPRM [Wang et al., 2025d] learns Process Reward Models (PRMs) from large-scale multimodal supervision to provide dense step-level feedback. M-PRM [Du et al., 2025] combines PRM supervision with Monte Carlo Tree Search (MCTS) for comprehensive evaluation. M-Eureka [Meng et al., 2025] explores rule-based RL to capture “visual aha” moments with minimal human annotation. Search and Decision Algorithms.DeepSeek-R1 [Guo et al., 2025] applies Group Relative Policy Optimization (GRPO) to jointly optimize reasoning and search, and Vision-R1 [Huang et al., 2025] extends this to multimodal settings. Mulberry [Yao et al.] integrates MCTS with reflective reasoning for iterative correction, while Skywork R1V2 [Chris et al.] combines Maximum a Posteriori Policy Optimization (MPO) and GRPO to balance detail and generalization. VL-Rethinker [Wang et al., 2025e] uses selective sample replay to mitigate vanishing advantages. FAST [Xiao et al., 2025] adapts inference depth to question complexity, and Think-or-Not? [Wang et al., 2025f] learns when to engage in deep reasoning. VLAA-Thinking [Chen et al., 2025a] studies reflection-aware optimization and contrasts RL with Supervised Fine-Tuning (SFT). VLM-R 3 [Jiang et al., 2025] proposes a three-stage pipeline of region recognition, reasoning, and refinement, while MAYE [Ma et al., 2025] and SoTA-with-Less [Wang et al., 2025g] focus on sample efficiency via MCTS-guided data selection. Beyond multimodal reasoning, Al- phaProof [DeepMind, 2024] extends reinforcement learning to formal theorem proving via self-play and symbolic verification in Lean, achieving silver-medal performance on IMO problems. It exemplifies how RL can support verifiable and executable mathematical reasoning. 4.3 Tool-Augmented Reasoning Tool-augmented methods delegate parts of reasoning to external symbolic systems or APIs to enhance modularity and correctness. Toolformer [Schick et al., 2023] demonstrates how LLMs can invoke external tools for symbolic computation and retrieval, while ToRA [Gou et al., 2023] organizes iterative loops of reasoning, tool calls, and result integration. COPRA [Thakur et al., 2023] composes multiple external capabilities adaptively, and M- REACT [Yang et al., 2023] coordinates visual and textual tools for multimodal reasoning. For geometry, Visual Sketchpad [Hu et al., 2024] provides an interactive canvas that enables models to construct and reason visually, and Pi-GPS [Zhao et al., 2025a] integrates parsers, verifiers, and symbolic solvers to produce provable results. Chameleon [Lu et al., 2023a] illustrates dynamic multi-tool composition, while MathCoder-VL [Wang et al., 2025b] uses code supervision to align diagrams with programs, making reasoning directly executable. Together, these systems show how tool integration supports structured, verifiable, and interpretable reasoning. 4.4 Process Feedback and Verification VisualPRM [Wang et al., 2025d] provides process-level rewards that encourage valid steps and penalize errors. M- PRM [Du et al., 2025] integrates PRM scoring with search, creating a generate–judge–revise loop for stable chains. Proof and program verifiers check intermediate Domain-Specific Language, code, or proof sketches, ensuring results are executable. At the representation level, TVC [Sun et al., 2025a] maintains visual conditioning during reasoning, while VIC [Zheng et al., 2024] reduces bias by text-first planning and late fusion. These approaches connect training with evaluation, ensuring that models are judged not only by answers but also by the correctness of their processes. 6 Outlook and Comparison.Different reasoning paradigms show complementary strengths. Deliberate chains are lightweight but risk drifting from visual evidence. Reinforcement learning stabilizes long reasoning yet demands costly rewards. Tool-augmented methods add modularity and verifiability but rely on stable interfaces. Process feedback improves auditability but needs dense supervision. Overall, hybrid systems that combine explicit reasoning chains, selective reinforcement learning, executable intermediate representations, and verification mechanisms appear especially promising for robust and interpretable multimodal reasoning. 5 How to Evaluate? To distinguish genuine mathematical reasoning from shortcut use, evaluation must span the full PAR pipeline and follow our Answer–Process–Executable (APE) hierarchy. Answer: Final-task metrics (e.g., accuracy) that are easy to report but can conflate perception errors (e.g., misread diagrams) and alignment errors (e.g., incorrect bindings) with reasoning mistakes. Process: Step-level checks that test whether intermediate reasoning is valid and visually grounded (i.e., consistent with extracted primitives and relations). Executable: Faithfulness via execution or proof checking (e.g., running code, verifying constraints/derivations) to directly assess alignment and reasoning correctness. We summarize how existing benchmarks map to the APE dimensions in Table 1. The table also covers Comprehensive benchmarks (see Appendix E) that combine diverse modalities, tasks, and difficulty levels to assess overall reasoning ability. Other benchmarks, including robustness (e.g., probing sensitivity to visual perturbations) and domain-specific sets (e.g., remote sensing), are discussed in Appendix D. 5.1 Answer-level Evaluation Answer-level benchmarks judge the final answer with exact match or numeric tolerance. ChartQA [Masry et al., 2022] evaluates reasoning over diverse real-world charts; PlotQA [Methani et al., 2020a] stresses open-vocabulary and real-valued answers on scientific plots; FigureQA [Kahou et al., 2017] provides large-scale synthetic charts for controlled visual reasoning. IconQA [Lu et al., 2021b] assesses icon-like visual math with multiple formats and cognitive skills. CLEVR-Math [Lindstr ̈ om and Abraham, 2022] probes compositional arithmetic in synthetic scenes. Hybrid table–text datasets such as FinQA [Chen et al., 2022b] and TAT-QA [Zhu et al., 2021] evaluate numerical reasoning over structured evidence. Answer-level evaluation is scalable and task-agnostic but cannot separate lucky guesses from correct reasoning, nor does it reveal where the Perception, Alignment and Reasoning pipeline failed. 5.2 Process-level Evaluation Process-level benchmarks attach or elicit intermediate steps and score their validity, shifting the focus from answers to how solutions are produced. M-MATH [Sun et al., 2024] provides step types and error annotations on middle-school problems with visual contexts. MPBench [Xu et al., 2025a] evaluates step-level judges and finds that many general multimodal models struggle with systematic error identification. ErrorRadar [Yan et al., 2024a] contributes fine-grained error taxonomies and labels for diagnostic analysis, and Sherlock [Ding and Zhang, 2025] extends multimodal process diagnosis with detailed failure categories. We-Math [Qiao et al., 2024] emphasizes principle-centered process evaluation beyond end-to-end scores, MathVerse [Zhang et al., 2024c] perturbs diagrams to test visual understanding beyond text priors, CHAMP [Mao et al., 2024] annotates concepts and hints and reports cases where models reach correct answers with wrong steps, and PolyMATH [Gupta et al., 2024] covers diverse cognitive categories including spatial and pattern reasoning. These resources enable audits of faithfulness and robustness while exposing where Perception or Alignment drifts translate into faulty Reasoning steps. 5.3 Executable-level Evaluation Executable-level benchmarks require programs, proofs, or constraints that can be run or verified, directly testing symbolic Alignment and the faithfulness of Reasoning. GeoQA+ [Cao and Xiao, 2022b] annotates step-by-step programs for geometry and validates them by execution. FormalGeo [Zhang et al., 2024a] offers Olympiad-level geometry with formal statements, theorem sequences, and verifiable proofs. Inter-GPS and E-GPS [Lu et al., 2021c, Wu et al., 2024] provide formal languages and solver-backed pipelines, and Pi-GPS [Zhao et al., 2025a] adds an LMM rectifier with a theorem-driven solver to produce provable chains. Executable metrics give clear pass or fail results that help identify alignment or reasoning errors, but they depend on reliable parsers and checkers. 6 Challenges and Future Directions MMR has advanced rapidly, yet key challenges remain. Following the PAR framework, we summarize major limitations and future directions. Perception.Current MLLMs show only a shallow understanding of visual information and often fail under layout or style changes [Liu et al., 2025a,b]. Structured diagram parsing that captures primitives, topology, and layout improves robustness [Wu et al., 2024]. A promising direction is to pair structured perception with formal interfaces 7 such as code, proof sketches, or SQL, enabling visual evidence to be verified through execution [Zhao et al., 2025a, Lu et al., 2021c]. Alignment.Fragmented domain-specific languages (DSLs) and inconsistent unit conventions cause misalignment and limit transfer. Future work should design unified, type-aware DSLs with explicit unit handling, constraint checking, and program verification [Pan et al., 2025] to standardize visual–symbolic mappings. Reasoning.Long reasoning chains tend to drift from visual evidence. RL improves stability but is expensive and sensitive to reward design. Lightweight reward models, adaptive inference depth, and hybrid pipelines that delegate symbolic steps to external verifiers can reduce cost while maintaining robustness [Guo et al., 2025, Huang et al., 2025, Wang et al., 2025e,d]. This reflects a broader trade-off between stability and cost reinforcement learning enhances consistency but introduces heavy computational demands, motivating lightweight process rewards and symbolic verification for practical scalability. However, benchmark-based evaluation remains limited: models may overfit to specific datasets or annotation styles rather than acquiring transferable reasoning skills. True reasoning should extend beyond curated benchmarks to unseen problems and open-ended contexts [Liang et al., 2024b, Cherian et al., 2024]. Future Opportunities. Applications such as intelligent tutoring, automated grading, and theorem explanation can enhance education through process-aware feedback [Zhou et al., 2024, Ku et al., 2025, Du et al., 2025]. Accessibility tools like MathCAT and MathVision translate visual math into speech or braille with executable checks for accuracy [Soiffer, 2024, Awais et al., 2024]. Professional systems for AR, VR, and engineering can integrate sketchpads, solvers, and code interfaces for verifiable design [Hu et al., 2024]. Advancing these directions while addressing PAR-level challenges will lead to more reliable and interpretable multimodal reasoning systems. Detailed discussions on challenges and future opportunities are provided in Appendix F. 7 Conclusion This paper presents a process-centered framework of MMR built on the Perception–Alignment–Reasoning (PAR) pipeline and the Answer–Process–Executable (APE) hierarchy. By organizing progress across geometry, chart and table reasoning, and visual math word problems, we show how structured perception, symbolic alignment, and verifiable reasoning jointly enable reliable multimodal intelligence. The PAR and APE frameworks offer a unified lens for understanding methods, benchmarks, and open issues, emphasizing structure-aware perception, executable intermediates, and process-level evaluation. References Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Llu ́ ıs M ` arquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1466–1476, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1171. URL https://aclanthology.org/D15-1171/. Wenjun Wu, Lingling Zhang, Jun Liu, Xi Tang, Yaxian Wang, Shaowei Wang, and Qianying Wang. E-gps: Explainable geometry problem solving via top-down solver and bottom-up generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13828–13837, 2024. Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-gps: Enhancing geometry problem solving by unleashing the power of diagrammatic information. arXiv preprint arXiv:2503.05543, 2025a. Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning, 2023. URLhttps://arxiv.org/abs/2312.12241. Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model, 2023a. URL https://arxiv.org/abs/2312.11370. Shihao Cai, Keqin Bao, Hangyu Guo, Jizhi Zhang, Jun Song, and Bo Zheng. Geogpt4v: Towards geometric multi-modal large language models with geometric image generation, 2024a. URLhttps://arxiv.org/ abs/2406.11503. Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon 8 Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520, Gyeongju, Republic of Korea, October 2022a. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.130/. Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774–6786, Online, August 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.528. URL https://aclanthology.org/2021.acl-long.528/. Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022a. URLhttps: //arxiv.org/abs/2105.14517. Yihan Hao, Mingliang Zhang, Fei Yin, and Linlin Huang. Pgdp5k: A diagram parsing dataset for plane geometry problems, 2022. URL https://arxiv.org/abs/2205.09947. Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. A multi-modal neural geometric solver with textual clauses parsed from diagram, 2023. URL https://arxiv.org/abs/2302.11097. Yue Xin, Wenyuan Wang, Rui Pan, Ruida Wang, Howard Meng, Renjie Pi, Shizhe Diao, and Tong Zhang. Generalizable geometric image caption synthesis. arXiv preprint arXiv:2509.15217, 2025. Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, et al. Geox: Geometric problem solving through unified formalized vision-language pre-training. arXiv preprint arXiv:2412.11863, 2024a. Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots, 2020a. URL https://arxiv.org/abs/1909.00997. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. Finqa: A dataset of numerical reasoning over financial data, 2022b. URL https://arxiv.org/abs/2109.00122. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance, 2021. URL https://arxiv.org/abs/2105.07624. Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. MultiHiertt: Numerical reasoning over multi hierarchical tabular and textual data. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6588–6600, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022. acl-long.454. URL https://aclanthology.org/2022.acl-long.454/. Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16103–16120, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.852. URL https://aclanthology.org/2024.acl-long.852/. Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018. Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In Proceedings of the ieee/cvf winter conference on applications of computer vision, pages 1527–1536, 2020b. Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023. 9 Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Peng Ye, Min Dou, Botian Shi, et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185, 2024b. Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023. Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to- table translation, 2023a. URL https://arxiv.org/abs/2212.10505. Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and William Yang Wang. Logical natural language generation from open-domain tables. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7929–7942, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.708. URLhttps:// aclanthology.org/2020.acl-main.708/. Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021b. Adam Dahlgren Lindstr ̈ om and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025a. Shihao Cai, Keqin Bao, Hangyu Guo, Jizhi Zhang, Jun Song, and Bo Zheng. Geogpt4v: Towards geometric multi-modal large language models with geometric image generation. arXiv preprint arXiv:2406.11503, 2024b. Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter- gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021c. Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022. Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th international conference on computational linguistics, pages 1511–1520, 2022b. Xiaokai Zhang, Na Zhu, Yiming He, Jia Zou, Qike Huang, Xiaoxiao Jin, Yanjun Guo, Chenyang Mao, Yang Li, Zhe Zhu, Dengfeng Yue, Fangzhen Zhu, Yifan Wang, Yiwen Huang, Runan Wang, Cheng Qin, Zhenbing Zeng, Shaorong Xie, Xiangfeng Luo, and Tuo Leng. Formalgeo: An extensible formalized framework for olympiad geometric problem solving, 2024a. URL https://arxiv.org/abs/2310.18021. Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multimodal llms via symbolic-neural integration. arXiv preprint arXiv:2504.12773, 2025. Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024. Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, et al. Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning. arXiv preprint arXiv:2505.10557, 2025b. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023b. Wenwen Zhuang, Xin Huang, Xiantao Zhang, and Jin Zeng. Math-puma: Progressive upward multimodal alignment to enhance mathematical reasoning, 2024. URL https://arxiv.org/abs/2408.08640. 10 Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training. arXiv preprint arXiv:2404.14604, 2024. Hai-Long Sun, Zhun Sun, Houwen Peng, and Han-Jia Ye. Mitigating visual forgetting via take-along visual conditioning for multi-modal long cot reasoning. arXiv preprint arXiv:2503.13360, 2025a. Haojie Zheng, Tianyang Xu, Hanchi Sun, Shu Pu, Ruoxi Chen, and Lichao Sun. Thinking before looking: Improving multimodal llm reasoning via mitigating visual hallucination. arXiv preprint arXiv:2411.12591, 2024. Zeren Zhang, Jo-Ku Cheng, Jingyang Deng, Lu Tian, Jinwen Ma, Ziran Qin, Xiaokai Zhang, Na Zhu, and Tuo Leng. Diagram formalization enhanced multi-modal geometry problem solver. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025a. Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024. Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Shicheng Li, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, et al. Mavis: Mathematical visual instruction tuning with an automatic data engine. arXiv preprint arXiv:2407.08739, 2024b. Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathematical reasoning for large language models. arXiv preprint arXiv:2409.00147, 2024. Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Jianhua Han, et al. Atomthink: A slow thinking framework for multimodal mathematical reasoning. arXiv preprint arXiv:2411.11930, 2024. Changyu Chen, Xiting Wang, Ting-En Lin, Ang Lv, Yuchuan Wu, Xin Gao, Ji-Rong Wen, Rui Yan, and Yongbin Li. Masked thought: Simply masking partial reasoning steps can improve mathematical reasoning learning of language models. arXiv preprint arXiv:2403.02178, 2024. Zhicheng Yang, Jinghui Qin, Jiaqi Chen, Liang Lin, and Xiaodan Liang. Logicsolver: Towards interpretable math word problem solving with logical prompt-enhanced learning. arXiv preprint arXiv:2205.08232, 2022. Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms. arXiv preprint arXiv:2402.16352, 2024a. Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geomet- ric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517, 2021. Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024. Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, et al. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving. arXiv preprint arXiv:2504.15780, 2025. Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024. Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, and Xipeng Qiu. Visuothink: Empowering lvlm reasoning with multimodal tree search. arXiv preprint arXiv:2504.09130, 2025c. Congzhi Zhang, Jiawei Peng, Zhenglin Wang, Yilong Lai, Haowen Sun, Heng Chang, Fei Ma, and Weijiang Yu. Vrest: Enhancing reasoning in large vision-language models through tree search and self-reward mechanism. arXiv preprint arXiv:2506.08691, 2025b. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023. Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024. 11 Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025c. Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning. arXiv preprint arXiv:2503.10291, 2025d. Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, and Wenqi Shao. Mm- prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision. arXiv preprint arXiv:2505.13427, 2025. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025. Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, 2024. URL https://arxiv. org/abs/2412.18319. Y Wei Chris, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, et al. Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning, 2025. URL https://arxiv. org/abs/2504.16656. Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025e. Wenyi Xiao, Leilei Gan, Weilong Dai, Wanggui He, Ziwei Huang, Haoyuan Li, Fangxun Shu, Zhelun Yu, Peng Zhang, Hao Jiang, et al. Fast-slow thinking for large vision-language model reasoning. arXiv preprint arXiv:2504.18458, 2025. DeepMind.Aiachievessilver-medalstandardsolvinginternationalmathemat- icalolympiadproblems.https://deepmind.google/discover/blog/ ai-solves-imo-problems-at-silver-medal-level/, 2024. Accessed: 2025-10-06. Jiaqi Wang, Kevin Qinghong Lin, James Cheng, and Mike Zheng Shou. Think or not? selective reasoning via reinforcement learning for vision-language models. arXiv preprint arXiv:2505.16854, 2025f. Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468, 2025a. Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3 : Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought, 2025. URL https://arxiv.org/abs/2505.16192. Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, and Pengfei Liu. Rethinking rl scaling for vision language models: A transparent, from-scratch framework and comprehensive evaluation scheme. arXiv preprint arXiv:2504.02587, 2025. Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. arXiv preprint arXiv:2504.07934, 2025g. Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. Timo Schick, Jane Dwivedi-Yu, Roberto Dess ` ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023. 12 Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023. Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification. arXiv preprint arXiv:2404.05091, 2024. Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, et al. Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection. arXiv preprint arXiv:2410.04509, 2024a. Zhaopan Xu, Pengfei Zhou, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, and Kaipeng Zhang. arXiv preprint arXiv:2503.12505, 2025a. Yi Ding and Ruqi Zhang. Sherlock: Self-correcting reasoning in vision-language models. arXiv preprint arXiv:2505.22651, 2025. Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. Eldar Kurtic, Amir Moeini, and Dan Alistarh. Mathador-lm: A dynamic benchmark for mathematical reasoning on large language models. arXiv preprint arXiv:2406.12572, 2024. Tianlong Xu, YiFan Zhang, Zhendong Chu, Shen Wang, and Qingsong Wen. Ai-driven virtual teacher for enhanced educational efficiency: Leveraging large pretrain models for autonomous error analysis and correction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28801–28809, 2025b. Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, and Wenhu Chen. Theoremexplainagent: Towards video-based multimodal explanations for llm theorem understanding. arXiv preprint arXiv:2502.19400, 2025. Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding, 2025b. URL https://arxiv.org/abs/2501.12380. Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022. Yilun Zhao, Zhenting Qi, Linyong Nan, Boyu Mi, Yixin Liu, Weijin Zou, Simeng Han, Ruizhe Chen, Xiangru Tang, Yumo Xu, Dragomir Radev, and Arman Cohan. QTSumm: Query-focused summarization over tabular data. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1157–1172, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.74. URLhttps://aclanthology.org/ 2023.emnlp-main.74/. Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. arXiv preprint arXiv:2503.01773, 2025b. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in neural information processing systems, 35:3843–3857, 2022. Zhenwen Liang, Tianyu Yang, Jipeng Zhang, and Xiangliang Zhang. Unimath: A foundational and multimodal mathematical reasoner. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7126–7133, 2023. Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, and Pan Lu. Solving inequality proofs with large language models. arXiv preprint arXiv:2506.07927, 2025. Yibo Yan, Jiamin Su, Jianxiang He, Fangteng Fu, Xu Zheng, Yuanhuiyi Lyu, Kun Wang, Shen Wang, Qingsong Wen, and Xuming Hu. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. arXiv preprint arXiv:2412.11936, 2024b. 13 Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, ́ Akos K ́ ad ́ ar, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tah- mid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. ChartQAPro: A more diverse and challenging benchmark for chart question answering. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 19123–19151, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/ 2025.findings-acl.978. URL https://aclanthology.org/2025.findings-acl.978/. Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. Advances in Neural Information Processing Systems, 37:113569–113697, 2024a. Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024c. Yujun Mao, Yoon Kim, and Yilun Zhou. Champ: A competition-level dataset for fine-grained analyses of llms’ mathematical reasoning capabilities. arXiv preprint arXiv:2401.06961, 2024. Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, and Chitta Baral. Polymath: A challenging multi-modal mathematical reasoning benchmark. arXiv preprint arXiv:2410.14702, 2024. Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning, 2017. URL https://arxiv.org/abs/1709.00103. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024b. URL https://arxiv.org/abs/2310.02255. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024b. Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Joanna Matthiesen, Kevin Smith, and Josh Tenenbaum. Evaluating large vision-and-language models on children’s mathematical olympiads. Advances in Neural Information Processing Systems, 37:15779–15800, 2024. Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang, Mingan Lin, Tianpeng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mathscape: Benchmarking multimodal large language models in real-world mathematical contexts. arXiv e-prints, pages arXiv–2408, 2024a. Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, and Liang He. Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. arXiv preprint arXiv:2409.02834, 2024. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024. Yilun Zhao, Guo Gan, Chengye Wang, Chen Zhao, and Arman Cohan. Are multimodal LLMs robust against adversarial perturbations? RoMMath: A systematic evaluation on multimodal math reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11653–11665, Albuquerque, New Mexico, April 2025c. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.582. URLhttps://aclanthology.org/2025. naacl-long.582/. Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025. Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023b. 14 Ryan Krueger, Jesse Michael Han, and Daniel Selsam. Automatically building diagrams for olympiad geometry problems. In CADE, pages 577–588, 2021. Lei Wang, Wanyu Xu, Zhiqiang Hu, Yihuai Lan, Shan Dong, Hao Wang, Roy Ka-Wei Lee, and Ee-Peng Lim. All in an aggregated image for in-image learning. arXiv preprint arXiv:2402.17971, 2024c. Juncheng Yang, Zuchao Li, Shuai Xie, Wei Yu, Shijun Li, and Bo Du. Soft-prompting with graph-of-thought for multi-modal representation learning. arXiv preprint arXiv:2404.04538, 2024. Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving. arXiv preprint arXiv:2310.04353, 2023. Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36:43447–43478, 2023a. Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, and Xunliang Cai. The role of visual modality in multimodal mathematical reasoning: Challenges and insights, 2025a. URL https://arxiv.org/abs/2503.04167. Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, and Xunliang Cai. The role of visual modality in multimodal mathematical reasoning: Challenges and insights. arXiv preprint arXiv:2503.04167, 2025b. Zhenwen Liang, Kehan Guo, Gang Liu, Taicheng Guo, Yujun Zhou, Tianyu Yang, Jiajun Jiao, Renjie Pi, Jipeng Zhang, and Xiangliang Zhang. Scemqa: A scientific college entrance level multimodal question answering benchmark. arXiv preprint arXiv:2402.05138, 2024b. Minxuan Zhou, Hao Liang, Tianpeng Li, Zhiyu Wu, Mingan Lin, Linzhuang Sun, Yaqi Zhou, Yan Zhang, Xiaoqin Huang, Yicong Chen, et al. Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark. arXiv preprint arXiv:2408.07543, 2024. Neil Soiffer. Mathcat: Math capable assistive technology, 2024. Muhammad Awais, Tauqir Ahmed, Muhammad Aslam, Amjad Rehman, Faten S Alamri, Saeed Ali Bahaj, and Tanzila Saba. Mathvision: An accessible intelligent agent for visually impaired people to understand mathematical equations. IEEE Access, 2024. Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14605–14631, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.817. URL https://aclanthology.org/2023.acl-long.817/. Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Yuan Wu, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng, Jifeng Dai, Ping Luo, Jingdong Wang, Ji-Rong Wen, Xipeng Qiu, Yike Guo, Hui Xiong, Qun Liu, and Zhenguo Li. A survey of reasoning with foundation models: Concepts, methodologies, and outlook. ACM Comput. Surv., 57(11), June 2025b. ISSN 0360-0300. doi: 10.1145/3729218. URL https://doi.org/10.1145/3729218. Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024. Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921, 2025. Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Vcbench: A controllable benchmark for symbolic and abstract challenges in video cognition. arXiv preprint arXiv:2411.09105, 2024. Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic vi- sual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836, 2024. Yue Zhou, Litong Feng, Mengcheng Lan, Yiping Ke, Xue Jiang, and Wayne Zhang. Geomath: A benchmark for multimodal mathematical reasoning in remote sensing. 2025. 15 Appendix DatasetYear (Venue)PAR StageSize / AnnotationKey Contributions Geometry Problem GEOS2015 (EMNLP)Perception + Alignment55 questions; text + diagramearly GPS baseline; text–diagram mapping Textbook Geometry2017 (EMNLP)Alignment1,406 questions; partial logical formsSAT-style benchmark with logical grounding Geometry3K2021 (ACL)Perception + Alignment3,002 questions; dense formal languageformal grounding linking text and diagrams GeoQA2021 (ACL Findings)Alignment + Reasoning5,010 questions; executable programsprogram-supervised QA GeoQA+2022 (COLING)Alignment + Reasoningextended set with harder stepschallenging multi-step reasoning test PGDP5K2022 (IJCAI)Perception5,000 diagrams; primitive labelsdataset for geometric primitive parsing PGPS9K2023 (IJCAI)Perception + Alignment9,022 items; fine-grained diagram + programinterpretable diagram–program pairs UniGeo2022 (EMNLP)Alignment + Reasoning4,998 calc + 9,543 proofsunified format covering calculation and proof GeomVerse2024 (ICML Workshop)Reasoningprocedurally generated problemssynthetic benchmark to test reasoning capacity FormalGeo7K2024 (NeurIPS Workshop)Alignment + Reasoning∼7,000 problems; diagram + formal solutionverifiable formal geometry tasks Geo170K2025 (ICML)Perception + Alignment∼170,000 image–caption + QA pairslarge-scale geometry pretraining set GeoGPT4V2024 (EMNLP)Perception + Alignment4,900 synthesized + 19,000 mixed pairsLLM-generated geometry text–figure dataset MATHGLANCE2025 (arXiv)Perception∼1,200 diagrams/1,600 questions; perception tagsisolates perception-level evaluation Chart and Table Problems FigureQA2018 (ICLR Workshop)Perception∼100,000 charts;∼1M QAsynthetic chart reasoning dataset DVQA2018 (CVPR)Perception∼300,000 images;>3M QAopen vocabulary chart questions with metadata PlotQA2020 (WACV)Perception224,377 plots;∼28.9M QAreal-valued numeric reasoning on scientific plots ChartQA2022 (ACL Findings)Perception + Alignment9,600 human + 23,100 generated QAvisual + logical chart QA CharXiv2025 (NeurIPS)Perception2,323 curated chartsscientific chart understanding in real domain ChartQAPro2025 (ACL)Perception + Alignment1,341 charts with dashboardsmore complex visualization types ChartQA-X2025 (arxiv)Alignment30,299 charts with QA + rationalesupervision for explanation in charts FinQA2021 (EMNLP)Alignment + Reasoning8,281 cases with gold programshybrid table + text numerical reasoning TAT-QA2021 (ACL)Alignment + Reasoning16,552 QA in financial reportstable–text numerical reasoning benchmark MultiHiertt [Zhao et al., 2022]2022 (ACL)Alignment + Reasoning10,440 QAs in financial reportshybrid table + text numerical reasoning DocMath-Eval [Zhao et al., 2024]2024 (ACL)Alignment + Reasoning4,000 QAs in financial reports; gold programshybrid table + text numerical reasoning MultiHiertt2022 (ACL)Alignment + Reasoning10,440 examples with supporting factsmulti-table reasoning over documents TabFact2020 (ICLR)Alignment118,000 statements; 16,000 tablestable entailment verification dataset WikiTableQuestions2015 (ACL)Alignment + Reasoning22,033 QA; 2,108 tablescompositional QA over web tables WikiSQL2017 (NeurIPS)Alignment80,654 NL–SQL; 24,241 tablesexecutable SQL supervision benchmark DUDE2023 (ICCV Challenge)Perception + Alignmentmulti-page document datasetsdocument-level reasoning with table/figure content Visual Math Word Problems IconQA2021 (NeurIPS)Perception + Reasoning107,439 questions; multiple formatslarge-scale multimodal math QA benchmark Icon6452021 (NeurIPS)Perception645,687 icons; 377 classesicon pretraining resource TABMWP2023 (ICLR)Alignment + Reasoning38,431 problems; gold solutions / programstable-based visual math word problems CLEVR-Math2022 (NeSy)Perception + Reasoningsynthetic image + text arithmeticcompositional arithmetic reasoning MV-MATH2025 (CVPR)Perception + Alignment2,009 multi-image problemscross-image dependency reasoning for K–12 MathVista2024 (ICLR)All6,000+ visual math problems; 28 merged setscombining diagrams, charts, and images MATH-V2024 (NeurIPS)All3,040 curated visual problemshigher-difficulty multimodal reasoning benchmark Math2Visual2024 (ACL Findings)Perception + Alignment12,000 generated visuals from math word textbenchmark for text-to-diagram generation in math TABMWP2023 (ICLR)Alignment + Reasoning38,431 problems; gold solutions / programs table-grounded problems with executable annotations Table A1: Datasets grouped by task and annotated with the primary PAR stage they support, plus year, venue, size, and key contributions. 16 FrameworkVenue & YearScope / FocusModelsFocus Lu et al. [2023b]ACL’23DL4MathDeep LearningPre-LLM; model architectures and datasets; Sun et al. [2025b]ACM Computing’25FM4ReasonMLLMBroad reasoning (limited math/symbolic depth) Ahn et al. [2024]EACL Workshop’24LLM4MathLLMText-centered (non-MMR) Yan et al. [2024b]ACL Findings’25MLLM4MathMLLMBenchmark- and Model-centric taxonomy Li et al. [2025]arXiv’25LMRMLLM/MLLMRoadmap- and Stage-centric analysis Ours-MLLM4MathMLLM First unified process-level framework revealing internal mechanisms of multimodal mathematical reasoning Table 2: Comparisons between representative frameworks and ours. “Models” column indicates model scope discussed in each framework (e.g., deep learning models, LLM, MLLM). A Related Frameworks As shown in Table 2, we summary recent related frameworks. Recent frameworks have examined mathematical reasoning and multimodal intelligence from complementary perspectives but differ in focus and depth. Lu et al. [2023b] reviewed deep learning for mathematical reasoning, summarizing architectures and datasets in the pre- LLM era but without multimodal or process-level analysis. Sun et al. [2025b] broadly discussed reasoning with foundation models across commonsense, logical, and mathematical domains, yet its treatment of symbolic and multimodal reasoning remains superficial. Ahn et al. [2024] analyzed LLM-based mathematical reasoning through four dimensions: tasks, methods, factors, and challenges, offering a structured text-centered view but overlooking visual grounding and reasoning processes. Yan et al. [2024b] extended this to the multimodal large language model (MLLM) era, organizing research by benchmarks, methodologies, and challenges, and introducing model roles as Reasoner, Enhancer, and Planner. However, its emphasis lies on ecosystem taxonomy rather than the internal mechanism connecting perception and symbolic alignment. Li et al. [2025] studied large multimodal reasoning models (LMRMs) and proposed a developmental roadmap from modular perception to agentic reasoning, integrating reinforcement learning and multimodal chain-of-thought. Although comprehensive in scope, it treats mathematics as one application and lacks formal analysis of symbolic-numeric grounding or verifiability. In contrast, our framework focuses specifically on multimodal mathematical reasoning (MMR), abstracting the workflow into the Perception–Alignment–Reasoning (PAR) framework and the Answer–Process–Executable (APE) evaluation hierarchy. Together, PAR and APE provide a unified lens for understanding how multimodal evidence is perceived, aligned, and executed in verifiable reasoning. This framework bridges the symbolic–neural perspective of early deep learning, the text-based view of LLM reasoning, and the model-centric paradigm of MLLMs, offering the first process-level synthesis of multimodal mathematical reasoning. Overall, previous frameworks remain largely descriptive and domain-specific, while ours advances toward a process-level, verifiable, and multimodal understanding of mathematical reasoning that integrates perception, alignment, and reasoning within a coherent analytical framework. B Reasoning Pipeline: Perception, Alignment and Reasoning We abstract multimodal math reasoning into three stages. This view clarifies where systems fail and how to design robust solutions. Perception. The goal is to recover computationally relevant visual facts. In geometry this means primitives and topology such as points, lines, angles, incidence, and equality. In charts and tables this means axes, legends, marks, tick reading, cell structure, and semantic units. Robust OCR and layout also matter in document settings. Errors at this stage, such as missed intersections or misread scales, often cascade. Alignment.The next step is to bind visual facts to textual predicates or to an intermediate representation that can be executed. Examples include a geometry description language, a set of constraints, a proof language, a sequence of operators for charts and tables, a SQL query, or a program of thought trace. Alignment benefits from explicit anchors and structural losses, from code or program supervision, and from formal interfaces. To reduce cross modal drift during long chains of thought, recent strategies first compose reasoning in text and then consult visual evidence, or maintain visual conditioning throughout the chain. Reasoning.The final step executes arithmetic, logic, theorem sequences, or programs, often with tool use such as calculators, symbolic solvers, or retrieval. Process level critics and rewards and search methods such as best of N or tree search help maintain validity over long chains. Retaining visual evidence and controlling bias are important for stability. In geometry, staged planning with verifier backed steps is especially effective. 17 This decomposition also guides evaluation. Some benchmarks focus on perception and alignment such as chart reading or primitive extraction. Others emphasize executable and checkable inference such as geometric proofs or program execution. C Supervision and Data for Reasoning C.1 Error Detection and Correction In multimodal mathematical reasoning, inference often involves long chains of cross-modal steps, which requires not only evaluating the final answer but also supervising and revising intermediate reasoning states. VisualPRM [Wang et al., 2025d] provides process-level rewards with dense supervision, encouraging valid reasoning transitions and penalizing deviations. M-PRM [Du et al., 2025] integrates PRM scoring with Monte Carlo Tree Search to form a generate–judge–revise loop that stabilizes long reasoning chains. Mathador-LM [Kurtic et al., 2024] instantiates critique-driven revision for math solutions, promoting self-correction during inference. VATE [Xu et al., 2025b] targets classroom drafts with interactive feedback loops aligned with human pedagogy. Sherlock [Ding and Zhang, 2025] contributes fine-grained error taxonomies for process diagnosis, and ErrorRadar [Yan et al., 2024a] provides labeled categories to localize typical failure modes. M-MATH [Sun et al., 2024] supplies large-scale step and error annotations, while MPBench [Xu et al., 2025a] shows that general-purpose multimodal models still struggle with systematic error identification. Together, these systems and resources operationalize step-level judging and correction, so models are evaluated and improved by how they reason, not just by final answers. C.2 Mathematical Problem Generation In multimodal mathematical reasoning, generating high-quality problems is essential for driving model training and evaluation, especially by supplying process- and execution-level testbeds for perception, alignment, and reasoning. GeoGen [Pan et al., 2025] follows a generate–solve–verify loop coupling symbolic solvers with natural-language verbalization to guarantee checkable solutions. GeoGPT-4V [Cai et al., 2024b] co-generates aligned text–figure pairs with a strong multimodal model to broaden geometric coverage. Math-LLaVA with MathV360K [Shi et al., 2024] extends instruction-style data toward visual math, and MAVIS [Zhang et al., 2024b] provides an automatic data engine with chain-of-thought supervision for large-scale synthesis. MultiMath-300K [Peng et al., 2024] curates K–12 multimodal problems with captions and stepwise solutions for process-aware training. AtomThink [Xiang et al., 2024] offers long atomic chains of thought to supervise compositional reasoning, while MathCoder-VL [Wang et al., 2025b] uses code as supervision to align diagrams with executable programs for verifiable generation. These generation pipelines and corpora supply controllable, diverse, and executable data that strengthen perception and alignment while furnishing robust evaluation environments. D Robustness and Domain-specific Benchmarks Robustness benchmarks probe sensitivity to visual perturbations, multi-image dependencies, and domain shifts beyond standard evaluation. VCBench [Li et al., 2024] focuses on explicit multi-image reasoning dependencies. DynaMath [Zou et al., 2024] applies dynamic perturbations to test shortcut reliance. HC-M3D [Liu et al., 2025b] constructs near-duplicate images that flip correct answers to measure vision dependence. SMART-840 [Cherian et al., 2024] collects K–12 visuo-linguistic problems to assess fundamental multimodal skills under varied conditions. Domain specific sets such as GeoMath [Zhou et al., 2025] target remote-sensing imagery and subject-specific math tasks, while MV-MATH [Wang et al., 2025a] extends multi-image reasoning to K–12 contexts. Together these datasets assess model stability, generalization, and cross-domain transfer for multimodal mathematical reasoning. E Comprehensive Benchmarks Comprehensive suites mix modalities, tasks, and difficulties to profile broad capabilities. MathVista [Lu et al., 2024b] aggregates problems from many sources spanning natural images, diagrams, and charts. MATH-V [Wang et al., 2024b] emphasizes difficulty calibration and curated coverage across subjects. SceMQA [Liang et al., 2024b] introduces a scientific multimodal QA benchmark at the college entrance level including Mathematics and other core subjects to evaluate reasoning across disciplines. M-K12 [Du et al., 2025] targets K–12 education scenarios with verifiable multimodal problems, bridging visual understanding and curriculum-level reasoning. OlympiadBench [Cherian et al., 2024] reports expert-level annotations enabling stepwise evaluation on competition- grade math and physics, while the Children’s Olympiads benchmark [He et al., 2024] evaluates reasoning on competition problems designed for younger students. MathScape [Liang et al., 2024a] focuses on photo-based scenarios with hierarchical categories and multi-dimensional evaluation. CMM-Math [Liu et al., 2024] extends these benchmarks to the Chinese language setting, highlighting multilingual reasoning capabilities. These suites provide breadth and coverage but often entangle perception, alignment, and reasoning in a single score. 18 F Challenges and Future Directions F.1 Challenges Evaluation Challenges. While the proposed Answer–Process–Executable (APE) evaluation level provides a structured lens for assessing reasoning fidelity, the executable-level evaluation remains challenging to scale. Current executable benchmarks such as GeoQA+ [Cao and Xiao, 2022b], FormalGeo [Zhang et al., 2024a], and Pi-GPS [Zhao et al., 2025a] depend on domain-specific languages, symbolic solvers, or theorem checkers that are largely confined to geometry or table reasoning tasks. Generalizing these pipelines to broader multimodal reasoning such as chart interpretation, visual word problems, or scientific document understanding requires unified annotation protocols and lightweight verification schemes. Moreover, executable evaluation often introduces heavy computational costs and relies on manually curated programs or proofs, limiting its practicality for large-scale MLLM assessment. Future work may explore scalable formal interfaces and semi-automated checkers that balance verifiability, coverage, and efficiency within the APE framework. Cross-cutting Challenges.Data contamination, limited reproducibility, safety, and interpretability remain persis- tent issues. Leakage audits, standardized reporting, and verifier-backed pipelines can improve reliability. Executable intermediates, process judges, and proof or code verification support interpretability and trustworthy reasoning [Hu et al., 2024]. F.2 Future Opportunities Multimodal mathematical reasoning enables diverse downstream applications that benefit from the model’s ability to process and integrate visual and symbolic modalities. We categorize representative applications into three core areas: 1. Education and Learning.Education applications benefit greatly from multimodal reasoning. For example, in STEM learning, tools like TheoremExplainAgent [Ku et al., 2025] visually and symbolically guide students through theorems and problem-solving processes. Intelligent tutoring systems [Du et al., 2025] dynamically adapt based on student input, providing feedback by analyzing both diagrams and text. Automated grading systems [Zhou et al., 2024] can assess multi-step, visual-rich student solutions, improving evaluation accuracy and scalability. 2. Accessibility and Inclusivity.For learners with disabilities, multimodal reasoning systems enable accessible content delivery. MathCAT [Soiffer, 2024] and Mathvision [Awais et al., 2024] translate visual math into speech and braille, facilitating interaction with geometry or charts. These systems also support alternative input/output modalities (e.g., voice, haptics), ensuring inclusive engagement with mathematical content. 3. Professional and Interactive Systems.In real-world problem-solving tasks—such as data analysis, architecture, or engineering—professionals must reason over both visual schematics and textual instructions. Multimodal reasoning aids this integration. In parallel, interactive interfaces in AR/VR environments [Hu et al., 2024] allow users to engage with math through gestures, voice commands, or immersive visual aids. These interfaces, when empowered by multimodal reasoning, enhance spatial understanding and application-specific interaction. 19