Paper deep dive
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen
Abstract
Abstract:Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
Tags
Links
- Source: https://arxiv.org/abs/2603.16124v1
- Canonical: https://arxiv.org/abs/2603.16124v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
109,144 characters extracted from source content.
Expand or collapse full text
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding Songcheng Cai 1 * Zhiheng Lyu 1 * Yuansheng Ni 1 Xiangchao Chen 1 Baichuan Zhou 1 Shenzhe Zhu 2 Yi Lu 2 Haozhe Wang 3 Chi Ruan 1 Benjamin Schneider 1 Weixu Zhang 4 Xiang Li 1 Andy Zheng 1 Yuyu Zhang 5 Ping Nie 1 Wenhu Chen 1 † 1 University of Waterloo 2 University of Toronto 3 The Hong Kong University of Science and Technology 4 McGill University & MILA 5 Verdent AI, Inc. https://github.com/TIGER-AI-Lab/SWE-QA-Pro Abstract Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often over- look the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clus- tering to cover under-represented task types and apply a rigorous difficulty calibration pro- cess: questions solvable by direct-answer base- lines are filtered out. This results in a dataset where agentic workflows significantly outper- form direct answering (e.g., a 13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable syn- thetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow. 1 Introduction Repository-level code understanding is central to LLM-assisted software engineering. Real tasks re- quire navigating many files, tracking control and data flow across modules, and verifying that im- plementations match intended designs. Snippet- * Equal contribution † Corresponding author. centric QA benchmarks do not capture these be- haviors, and knowledge-only prompting can hide weaknesses in navigation and grounding (Husain et al., 2019; Liu and Wan, 2021; Huang et al., 2021; Lee et al., 2022; Gong et al., 2024; Sahu et al., 2024; Li et al., 2024). Recent repository QA bench- marks move toward large-context, tool-using eval- uation, but still focus on few projects and include many questions solvable without interacting with the codebase. (Abedu et al., 2025; Chen et al., 2025; Peng et al., 2025; Rando et al., 2025). We focus on two concrete gaps. First, limited diversity: existing benchmarks concentrate on a few popular repositories. This leaves large parts of the natural task distribution uncovered and under- represents certain semantic categories of tasks (e.g., configuration, data plumbing, infrastructure glue shown in Appendix A). Second, uncertain need for tools: many benchmark questions can be an- swered from prior knowledge or public documen- tation that is already covered during pretraining, so current setups do not clearly separate cases that actually requires tool-using from the cases where a single-pass, knowledge-only model with enough context and reasoning would already succeed. As a result, it is difficult to tell whether a model truly understands and operates within a particular repos- itory, or simply recall generic knowledge. To address these issues, we introduce SWE-QA- Pro, a benchmark and training recipe for repository- level QA. On the benchmark side, we: (i) se- lect less-studied, long-tail repositories and ensure that each one has an executable environment so the project can be built and explored end-to-end (Badertdinov et al., 2025); (i) use issue texts as question seeds, embed them, and run k-means clus- tering to form topic groups, followed by a brief human pass to merge near-duplicates and clarify topic boundaries; and (i) for each topic, use a tool-using code model to propose QA items and 1 arXiv:2603.16124v1 [cs.SE] 17 Mar 2026 Stage1: TaxonomyConstruct Stage3: Difficulty Calibration Stage2: Data Synthesis Stage4: Data Validation Github Issues Embedding Cluster Taxonomy Code Repos Claude Code Generated Query Direct Answer QA Pairs Hard&BalancedQuestions Claude Code Answer Annotated Answer Human Cross Validation Questions with Ground-Truths Easy Questions ❌ ✅ (a) Benchmark Construction Pipeline Base Model SFT Model RL Model SFT Data RL Data Tool Server Tools Reward Model RolloutReward (b) Training Recipe Figure 1: SWE-QA-Pro Benchmark and Training Pipeline. draft answers, which are then edited by humans for correctness and repository grounding, with final benchmark QA items sampled across clusters to preserve diversity in Section 2.4. To reduce knowledge-only questions and make tool usage meaningful, we add a simple filtering step. For each drafted item, we compare a direct- answer baseline (no tools, single turn) with a tool- using run. If the direct-answer baseline already achieves a high score, we discard the item. This preserves questions that require locating and citing concrete code rather than recalling documentation. On the training side, we introduce a two-stage agentic recipe for improving small open models on repository-level QA. We first apply SFT to match repository-grounded answer formats, then use RLAIF to favor answers citing concrete files and symbols (Lee et al., 2023). In experiments, a tuned Qwen3-8B (Yang et al., 2025) trained with this SFT→RLAIF recipe outperforms GPT-4o (Hurst et al., 2024) and substantially narrows the gap to state-of-the-art proprietary models shown in Section 3.2 and Section 4.2. In summary, we make two contributions, as il- lustrated in Figure 1: • Benchmark.We release SWE-QA-Pro, a repository-level QA benchmark built from long- tail repositories with executable environments. Questions are seeded from issues, then synthe- sized and grounded with a tool-using code model along with human editing, followed by filter- ing to remove cases solvable by strong direct- answer baselines. Compared to SWE-QA, SWE- QA-Pro covers more diverse repositories and includes more questions that truly require code- base interaction (Peng et al., 2025). •Agent Workflow and Training Recipe. We in- troduce a simple agentic workflow for repository- level QA that enables iterative codebase explo- ration via structured actions. Building on this workflow, we present an SFT→RLAIF training recipe that significantly improves small open- source models on SWE-QA-Pro. Using this framework, Qwen3-8B surpasses GPT-4o on SWE-QA-Pro by2.31points and substantially narrows the gap to several state-of-the-art propri- etary models, including GPT-4.1, Claude Sonnet 4.5, and DeepSeek-V3.2 (OpenAI, 2025; An- thropic, 2025; Liu et al., 2025). 2 SWE-QA-Pro Bench SWE-QA-Pro Bench is constructed through a four-stage pipeline, as illustrated in Figure 1a: Data Sourcing and Taxonomy, Data Synthesis and Sampling, Data Filtering and Difficulty Calibra- tion, and Data Validation. This pipeline yields three key advantages over existing benchmarks (Table 1): (1) pull-request–driven clustering to- gether with long-tail repository sampling ensures balanced coverage across diverse software engi- neering question types; (2) systematic filtering against multiple strong proprietary models removes instances solvable via memorization or pretrain- ing artifacts, thereby isolating questions that re- quire genuine codebase interaction; and (3) an- swers cross-verified by Claude Code and human annotators provide high-quality gold ground truth, 2 enabling reliable multi-dimensional evaluation. 2.1 Data Sourcing and Taxonomy We conducted a large-scale analysis of the GitHub Repositories in SWE-Rebench (Badertdinov et al., 2025). We processed1, 687, 638issues spanning 3, 468repositories by concatenating their titles and bodies, specifically filtering for contexts between 10 Byte and 16KB. We computed representations for these texts using Qwen3-8B-Embedding model. To organize this data, we applied a hierarchical K-Means clustering algorithm, initializing with 10 clusters in the first layer and expanding to 50 in the second. We then utilized GPT-4.1 to extract semantic labels for each resulting cluster. These labels were refined through a human-verified taxon- omy to eliminate semantic redundancy and enforce clear semantic boundaries between closely related categories, thereby reducing ambiguity and yield- ing 48 distinct task subclasses in Appendix E. This unsupervised taxonomy serves as the foundational structure for our benchmark, ensuring it covers a wide spectrum of software engineering challenges rather than a manually cherry-picked subset. 2.2 Data Synthesis and Sampling Leveraging the derived semantic taxonomy, we em- ployed Claude Code to synthesize the final bench- mark data. To guarantee the executability and valid- ity of the problems, we repurposed the established sandbox environments from SWE-Rebench. For each synthesis task, we stochastically sam- pled 20 existing issues from the corresponding clus- ter and repository to serve as reference context. The agent was then tasked with automatically explor- ing the codebase to generate a new, self-contained problem-solution pair aligned with the specific clus- ter’s semantics. We adopted different sampling strategies for the training and test sets to balance diversity with hu- man evaluation constraints. For the Test Set, we selected 26 repositories that efficiently cover all 48 task categories, accommodating the cognitive con- straints of human annotators while ensuring com- prehensive evaluation. Conversely, for the Training Set, we applied uniform sampling across the entire dataset, achieving coverage of 1, 484 repositories. Figure 2 illustrates the effectiveness of this pipeline: our synthesized data maintains a semantic distribution that is sufficiently diverse compared to the original repository distribution between SWE- Rebench and SWE-Verified datasets. 2.3 Data Filtering and Difficulty Calibration To ensure SWE-QA-Pro focuses on non-trivial, agent-essential reasoning, we apply a multi-stage filtering and calibration pipeline. We first remove multi-query prompts and perform semantic dedu- plication using Qwen3-8B embeddings to ensure task independence. A key challenge in evaluat- ing repository-level understanding is that state-of- the-art proprietary LLMs possess extensive pre- training knowledge, enabling them to answer many software engineering questions without interacting with the codebase, reading source files, or explor- ing repository structure. Such questions are often associated with widely known repositories (e.g., the canonical projects in SWE-Bench (Jimenez et al., 2023)), or can be resolved by inspecting only one or a small number of files, without requiring multi- hop reasoning over the repository. Empirically, this issue is reflected in existing repo-level QA bench- marks, where the performance gap between mod- els answering with full repository exploration and those responding without any code context is often marginal. As a result, these benchmarks may fail to accurately measure an LLM’s ability to explore codebases and perform grounded, repository-level reasoning (Peng et al., 2025). To mitigate the influence of memorized knowl- edge and filter out trivially answerable questions, we introduce a difficulty calibration procedure based on cross-model agreement. We evaluate direct (no-repository) answers produced by three strong proprietary models, GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro (Comanici et al., 2025), and compare them against repository-grounded ref- erence answers generated by Claude Code. Each direct answer is assessed using an LLM-as-a-Judge framework along five dimensions: correctness, completeness, relevance, clarity, and reasoning quality, as detailed in Section 4.1. For each model m, we aggregate scores across multiple indepen- dent runs and compute the average total score ̄s m (q) for question q. To account for inter-model scale differences, we standardize the aggregated scores using z-score normalization: z m (q) = ̄s m (q)− μ m σ m ,(1) whereμ m andσ m denote the mean and standard deviation of modelm’s scores over all questions. We then define the difficulty of a question as the 3 BenchmarkRepo-levelRepo Nav.Multi-hopSemantic CoverageDiff. CalibrationTest Size CodeQueries (Sahu et al., 2024)✗✔✗29033 InfiBench (Li et al., 2024)✗✔234 CodeReQA (Hu et al., 2024)✔✗1563 LongCodeQA (Rando et al., 2025)✔✗✔✗443 SWE-QA (Peng et al., 2025)✔✗576 SWE-QA-Pro✔260 Table 1: Comparison of representative code benchmarks. (a) Semantic distribution of raw Issues: SWE-Rebench (All Repos) vs. SWE-Verified. (b) Semantic distribution of QA Datasets: Comparison among our Training/Test splits and SWE-QA. Figure 2: t-SNE visualization of semantic distributions. (a) Comparison of the original issue spaces, showing the broad coverage of SWE-Rebench compared to the manually curated SWE-Verified. (b) The distribution of our synthesized datasets (Training and Test) demonstrates high diversity and alignment with the semantic clusters of existing benchmarks. Figure 3: Difficulty Comparison between SWE-QA and SWE-QA-Pro. Higher difficulty indicates harder questions. negative consensus score across models: Difficulty(q) =− 1 |M| X m∈M z m (q),(2) where M denotes the set of evaluated models. Under this definition, questions that consistently receive high-quality direct answers across mod- els are assigned low difficulty and are filtered out, while questions that remain challenging without repository interaction are retained. Using the cali- brated difficulty signal, we construct the QA pairs candidates pools with cluster-level coverage and approximate balance across QA types. This cali- bration step ensures that SWE-QA-Pro emphasizes questions that genuinely require repository explo- ration and multi-step reasoning, providing a more faithful evaluation of LLM agent capabilities as shown in Figure 3. 2.4 Data Validation and Statistics To ensure the quality and reliability of QA pairs and metadata, we adopt a multi-stage annotation and validation process to mitigate hallucinations and semantic ambiguity. First, CLAUDE CODE explores each repository in a sandbox environment to produce repository-grounded reference answers, while assigned semantic clusters and QA types are cross-checked against the taxonomy and reused for difficulty calibration. Second, human annotators independently ex- 4 plore the codebase to produce answers, revise am- biguous or underspecified questions, and verify QA types and semantic clusters. Their answers are compared against the CLAUDE CODE refer- ences to identify missing details or errors, and only answers satisfying correctness, completeness, rel- evance, clarity, and reasoning quality are retained. A final expert review pass adjudicates remaining inconsistencies and further refines the answers. The resulting SWE-QA-Pro benchmark contains 260 questions from 26 long-tail repositories, with 4–9 questions per semantic cluster and an approx- imately balanced distribution of QA types. Full statistics are reported in Appendix F, with case studies in Appendix H.1. 3 SWE-QA-Pro Agent We introduce SWE-QA-Pro Agent, a lightweight workflow designed for repository-level code un- derstanding in small open-source models. Unlike RAG-based approaches that require pre-built in- dices, our agent uses a ReAct-style loop to explore codebases directly. By combining directory traver- sal, keyword search, and scoped file inspection, the agent gathers evidence incrementally to reason across files under limited context budgets. 3.1 Agent Workflow We propose SWE-QA-Pro Agent, a ReAct-based workflow for repository-level code understanding. Prior agents such as SWE-QA-Agent primarily rely on RAG-style retrieval with limited command-line support, requiring the construction of a retrieval index while still offering insufficient capacity for genuine repository exploration. This limitation is particularly evident for open-source models, where such agents often underperform strong traditional RAG baselines with offline indexing and manually designed retrieval pipelines. In contrast, SWE-QA-Pro Agent abandons RAG- based retrieval entirely and does not require a pre- built index. Instead, it performs direct repository exploration using explicit, length-controlled Search based on keyword matching to locate relevant files, View for scoped inspection of file contents or direc- tory structure, and constrained read-only Comman- dLine actions for lightweight structural and pattern- based analysis (e.g., directory traversal, symbol matching, and line-level extraction), enabling more flexible and effective context acquisition for rea- soning under limited context budgets. The agent operates in a ReAct-style loop, where it iteratively reasons over the current context, freely selects an action, and incorporates the resulting observation until sufficient evidence is collected, at which point it terminates with Finish. Detailed algorithms are provided in the Appendix B. 3.2 Agentic Training Recipe To our knowledge, existing efforts to enhance open- source LLMs for SWE-QA focus on SFT of agentic behaviors, without leveraging reinforcement learn- ing to optimize repository-level exploration and reasoning (Rastogi et al., 2025). Inspired by recent advances in RL for LLMs, we propose a scalable training framework that explicitly trains agentic interaction with code repositories, leading to im- proved exploration and understanding. Training Data Construction. Starting from the benchmark question construction pipeline, we deduplicate and obtain 1,464 raw questions, which are randomly split into 1,000 questions for SFT and 464 questions for RL. For the SFT stage, we useClaude Sonnet 4.5to generate 1,000 high- quality multi-turn conversation trajectories condi- tioned on each question and our predefined agent action space (Search, View, and read-only Com- mandLine tools), resulting in tool-augmented su- pervision data. For the RL stage, we assign each question a high-quality reference answer generated byClaude Code, which serves as the ground truth for reward computation. Two-Stage Training. Training proceeds in two stages. In the first stage, we perform supervised fine-tuning on Qwen3-8B using 1K tool-invocation question–answer trajectories. This stage teaches the model the tool-call syntax and instills a basic understanding of tool semantics and usage patterns. In the second stage, we apply reinforcement learn- ing to the SFT-initialized model. For each rollout, a reward model evaluates the final answer against the ground truth along five dimensions: correct- ness, completeness, relevance, clarity, and reason- ing quality, following the same criteria used in eval- uation. Since SWE-QA answers are often complex and cannot be reliably assessed by exact-match or rule-based rewards, we adopt an LLM-as-Judge reward formulation. To mitigate reward hacking, we employ a judge model distinct from the evalua- tion judge and assign higher weight to correctness while down-weighting clarity, discouraging fluent but incorrect answers. The final scalar reward is 5 computed as: s = RM(ˆa,a ∗ )∈ [1, 10] 5 ,(3) r = w ⊤ s 10 ,w = (0.3, 0.2, 0.2, 0.1, 0.2).(4) whereˆais the generated answer,a ∗ the ground- truth reference, andsdenotes scores for five eval- uation dimensions. We optimize the policy using the GRPO algorithm (Shao et al., 2024), where rewards are normalized within each rollout group before computing policy gradients. This stage en- courages the model to converge toward rollouts that produce high-quality, fact-grounded final answers. 4 Experiments 4.1 Experimental Setup Model selection We evaluate 11 LLMs, includ- ing proprietary models, GPT-4o, GPT-4.1, Claude Sonnet 4.5, Gemini 2.5 Pro, DeepSeek-V3.2, open- source models, Qwen3-8B/32B, Devstral-Small-2- 24B-Instruct (Rastogi et al., 2025), LLaMA-3.3- 70B-Instruct (Dubey et al., 2024), and two variants of SWE-QA-Pro 8B trained with SFT and SFT+RL. All models are evaluated under both direct answer- ing and agent-based reasoning using the SWE-QA- Pro Agent workflow. Inference and Training Setup All inference uses temperature 0, a maximum of 25 turns, and a 32k context window on NVIDIA A100 80GB GPUs. SFT and RL are implemented using SWIFT (Zhao et al., 2025) and Verl-Tool (Jiang et al., 2025), re- spectively. Hyperparameters are provided in Ap- pendix C. Evaluation Metrics We follow the LLM-as- Judge protocol of SWE-QA, including strict judge–candidate separation, anonymization, and randomized answer order. Compared to SWE-QA, we require explicit file-path and line-number ref- erences and use a stricter judge prompt to enable finer-grained score differentiation. Each answer is scored independently three times, and scores are averaged to reduce variance. Full prompts are pro- vided in Appendix D. 4.2 Main Results Table 2 summarizes the evaluation results across all LLMs. As shown in Table 2, there is a substan- tial performance gap between the direct-answer set- ting and the agent-based workflow, particularly on correctness, completeness, and reasoning quality. This gap highlights the effectiveness of our diffi- culty calibration and underscores the critical role of the SWE-QA-Pro agent in enabling high-quality, repository-grounded reasoning. Overall Performance. Among all evaluated mod- els, Claude Sonnet 4.5 achieves the highest over- all score, reflecting strong repository-level code understanding and effective tool use. Despite its smaller scale and limited training data, SWE-QA- Pro 8B outperforms many open-source baselines as well as GPT-4o with case study in Appendix H.2, and performs competitively with larger agen- tic models such as Devstral Small 2 24B Instruct. These results highlight that explicitly training agen- tic capabilities for repository-level QA can be more impactful than scaling model size alone. Breakdown Results Analysis. Appendix G de- tails model performance across repositories, seman- tic clusters, and question types. First, analyzing question types (Table 11, Table 12) reveals that localization-oriented questions yield consistently high scores with low variance, as they focus on identifying specific files, identifiers, or execution points. Conversely, causal and explanatory ques- tions are significantly more challenging, particu- larly those involving design rationale, trade-offs, or implicit dependencies, which require multi-file evidence integration and global semantic reason- ing. Procedural questions fall in between: concrete implementation tasks are tractable, whereas system- level inquiries remain difficult due to their reliance on holistic understanding. Regarding repositories and semantic clusters in Appendix G.2 and G.3, configuration and workflow management areas involving dependency injec- tion, CLI argument handling, and packaging prove consistently difficult. These clusters characterize configuration-driven repositories like jsonargparse, checkov, and yt-dlp, where answering requires rea- soning over implicit control flow, cross-file propa- gation, and runtime behavior not localized to single files. In contrast, clusters with explicitly encoded, localized logic, such as Unicode/data parsing, pro- tocol/API compatibility, filesystem config, and vi- sualization, are easier. Consequently, repositories concentrating on these areas (docker-py, pint, mk- docs, seaborn) achieve higher, stable performance, benefiting from structurally explicit and locally grounded code. 6 Model Evaluation Metrics Overall CorrectnessCompletenessRelevanceClarityReasoning Proprietary LLMs Gemini 2.5 Pro2.512.138.668.024.1625.48 Gemini 2.5 Pro + Agent 7.126.258.919.347.8439.46 GPT-4.13.422.389.029.234.6828.74 GPT-4.1 + Agent6.865.908.899.137.6838.47 GPT-4o3.082.118.968.793.6426.58 GPT-4o + Agent5.594.498.558.166.2933.08 DeepSeek V3.23.192.328.838.834.3927.55 DeepSeek V3.2 + Agent6.94 6.498.788.727.7638.69 Claude Sonnet 4.53.342.748.658.124.8427.69 Claude Sonnet 4.5 + Agent7.347.368.889.038.0640.67 Open-Source LLMs Qwen3-8B2.842.168.598.664.3626.61 Qwen3-8B + Agent4.523.778.297.835.6230.03 Qwen3-32B3.042.41 8.718.745.0227.91 Qwen3-32B + Agent4.994.218.508.166.2232.08 Llama-3.3-70B-Instruct2.341.758.688.473.0824.32 Llama-3.3-70B-Instruct + Agent2.842.118.097.183.5123.73 Devstral-Small-2-24B-Instruct2.652.148.578.313.7725.44 Devstral-Small-2-24B-Instruct + Agent6.615.668.819.097.1337.30 Finetuned LLMs SWE-QA-Pro-8B (SFT)2.562.018.377.933.5524.42 SWE-QA-Pro-8B (SFT) + Agent 5.665.458.408.216.6134.34 SWE-QA-Pro-8B (SFT+RL)2.542.048.287.923.5524.34 SWE-QA-Pro-8B (SFT+RL) + Agent5.965.668.518.446.8335.39 Table 2: SWE-QA-Pro Bench evaluation results. “+agent” denotes models using the SWE-QA-Pro agent framework. Best results per scale are shown in bold, with second-bestunderlined. 4.3 Tool Usage Analysis Figure 4 correlates tool proficiency with repository- level QA performance. Models with lower scores, such as LLaMA-3.3-70B-Instruct and GPT-4o, suffer from weak tool usage that limits context retrieval and global understanding. Conversely, Claude Sonnet 4.5 excels by leveraging the high- est volume of tool calls, translating robust explo- ration into superior answer quality. Gemini 2.5 Pro, however, remains competitive with fewer calls, in- dicating that internal reasoning enables efficient, selective tool use; this underscores that reasoning is vital alongside tool capacity. Additionally, post-RL improvements in SWE-QA-Pro 8B demonstrate that RL fosters effective, judicious execution rather than merely inflating tool-call frequency. 4.4 Training Strategy Analysis Figure 5 compares different training strategies un- der the same model backbone. SFT-1000 and SFT- 1464 denote supervised fine-tuning on 1,000 and 1,464 tool-call trajectories, respectively, while SFT- 1000 + RL-464 represents a two-stage setting that initializes the model with 1,000 SFT trajectories and then applies reinforcement learning on an addi- tional 464 QA pairs. Increasing the amount of su- pervised fine-tuning data from 1,000 to 1,464 trajec- tories yields consistent but modest improvements across most evaluation dimensions. In contrast, in- troducing reinforcement learning after SFT leads to a more pronounced performance gain. In particular, SFT-1000 + RL-464 achieves substantially higher scores in both Correctness and Completeness than SFT-only variants, including SFT-1464. This in- dicates that reinforcement learning does not sim- ply replicate the effect of scaling supervised data, but instead introduces a qualitatively different op- timization signal that further unlocks the potential of the SFT-initialized model, encouraging more ac- curate and more comprehensive answers. Overall, these results suggest that RL provides complemen- tary supervision beyond SFT, especially effective at refining factual precision and answer coverage. 5 Related Work Code-centric and Repository-level QA Bench- marks. Existing code and repository QA bench- marks focus on localized or context-limited set- tings, where questions can be answered from snip- pets, APIs, or documentation.Representative 7 Figure 4: Tool usage behavior across models on SWE-QA-Pro. Figure 5: Effect of Training Strategy. We compare SFT at different scales and a two-stage SFT→RLAIF setting. datasets such as CodeQueries, CS1QA, CoSQA, CoSQA+, and CodeSearchNet emphasize element- level reasoning or retrieval over individual func- tions, deliberately avoiding cross-file dependencies and repository structure (Sahu et al., 2024; Lee et al., 2022; Huang et al., 2021; Gong et al., 2024; Husain et al., 2019). InfiBench extends evalua- tion to free-form coding-related questions across languages, but remains knowledge- and snippet- centric rather than repository-grounded (Li et al., 2024). More recent efforts move toward repository- scale evaluation. LongCodeBench relies on long context windows to ingest large codebases (Rando et al., 2025), while RepoChat uses offline index- ing for structured retrieval (Abedu et al., 2025). SWE-QA formulates repository understanding as a QA task (Peng et al., 2025), but does not ex- plicitly separate cases solvable by standard re- trieval. In contrast, SWE-QA-Pro targets long- tail, executable repositories and filters out retrieval- solvable queries, isolating scenarios that require interactive code exploration. Repository-level Agents. Agents in software en- gineering largely target generative tasks, including issue resolution, program repair, and code genera- tion (Jimenez et al., 2023; Yang et al., 2024; Zhang et al., 2024; Da et al., 2025; Li et al., 2025; Bi et al., 2024; Bairi et al., 2024). In these domains, exploration is implicitly shaped by generation ob- jectives rather than comprehension. Conversely, repository-level QA demands strict code navigation and understanding. Prior approaches, such as SWE- QA-Agent, rely on inference-time heuristics for tool use, often underperform retrieval-augmented generation (RAG) baselines due to unoptimized navigation (Peng et al., 2025). We address this lim- itation by explicitly training repository exploration policies, bridging the gap between passive retrieval and active agentic navigation. 6 Conclusion In this work, we address the challenge of evaluating and training Large Language Models for repository- level code understanding, where reliance on mem- orized knowledge often masks deficits in genuine exploration capabilities. By introducing SWE-QA- Pro, we establish a rigorous testbed that enforces semantic diversity through long-tail repositories and systematically filters out questions solvable by direct answering. The substantial performance gap observed between direct and agentic baselines on this benchmark confirms that our design success- fully isolates tasks requiring authentic codebase navigation and evidence grounding. Beyond evaluation, we show that agentic capabil- ities can be learned using a scalable framework. We 8 propose a two-stage SFT→RLAIF training recipe enabled by a synthetic data pipeline. A Qwen3-8B model trained with this recipe surpasses GPT-4o on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models. We hope SWE- QA-Pro catalyzes future research toward active, grounded repository reasoning. Limitations The objective of SWE-QA-Pro is to provide a chal- lenging benchmark and practical training recipe, yet we identify three limitations. First, despite employing semantic embedding clustering to max- imize topical coverage, the benchmark is con- strained to 260 questions from 26 repositories due to the high cost of expert human verification; this scale may not fully capture the extreme long-tail diversity of the software ecosystem, and the embed- ding models used for clustering could inadvertently introduce latent biases into the taxonomy. Second, the benchmark is currently restricted to the Python ecosystem due to the strict requirement for exe- cutable sandboxes to verify agent actions, though our data synthesis and agentic training pipeline is inherently language-agnostic and can be readily ex- tended to other languages with compatible runtime environments. Third, our RLAIF training objec- tives share a similar distribution with evaluation metrics as both rely on LLM-as-a-Judge frame- works, creating a potential risk of reward hacking where models optimize for judge preferences rather than objective correctness. While preliminary case studies did not reveal significant gaming behav- iors, this proximity suggests that future research should prioritize investigating more robust train- ing methodologies—such as process supervision or diverse reward modeling—to mitigate such align- ment risks. References Samuel Abedu, Laurine Menneron, SayedHassan Kha- toonabadi, and Emad Shihab. 2025. Repochat: An llm-powered chatbot for github repository question- answering. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 255–259. IEEE. Anthropic. 2025.Claude sonnet 4.5 system card.https://w.anthropic.com/news/ claude-sonnet-4-5. Accessed: 2025-10-27. Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, An- drei Andriushchenko, Maria Trofimova, Daria Litv- intseva, and Boris Yangel. 2025. Swe-rebench: An automated pipeline for task collection and decon- taminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411. Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D C, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. Codeplan: Repository-level coding using llms and planning. Proceedings of the ACM on Software Engineering, 1(FSE):675–698. Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Hai Jin, and Xuanhua Shi. 2024. Iterative refinement of project-level code context for precise code gen- eration with compiler feedback.arXiv preprint arXiv:2403.16792. Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng. 2025. Coreqa: uncovering poten- tials of language models in code repository question answering. arXiv preprint arXiv:2501.03447. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. 2025. Agent- rlvr: Training software engineering agents via guid- ance and environment rewards.arXiv preprint arXiv:2506.11425. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Jing Gong, Yanghui Wu, Linxi Liang, Yanlin Wang, Jiachi Chen, Mingwei Liu, and Zibin Zheng. 2024. Cosqa+: Pioneering the multi-choice code search benchmark with test-driven agents. arXiv preprint arXiv:2406.11589. Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. Coderepoqa: A large-scale benchmark for software engineering question answer- ing. arXiv preprint arXiv:2412.14764. Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021.Cosqa: 20,000+ web queries for code search and question answering.arXiv preprint arXiv:2105.13239. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, 9 Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Code- searchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436. Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, and 1 others. 2025. Verltool: Towards holistic agentic reinforcement learning with tool use. arXiv preprint arXiv:2509.01055. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language mod- els resolve real-world github issues? arXiv preprint arXiv:2310.06770. Changyoon Lee, Yeon Seonwoo, and Alice Oh. 2022. Cs1qa: A dataset for assisting code-based question answering in an introductory programming course. arXiv preprint arXiv:2210.14494. Harrison Lee, Samrat Phatale, Hassan Mansoor, Kel- lie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learn- ing from human feedback with ai feedback. arXiv e-prints. Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianx- iang Wang. 2025. Swe-debate: Competitive multi- agent debate for software issue resolution. arXiv preprint arXiv:2507.23348. Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. Infibench: Evaluating the question-answering capabilities of code large lan- guage models. Advances in Neural Information Pro- cessing Systems, 37:128668–128698. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Chenxiao Liu and Xiaojun Wan. 2021. Codeqa: A question answering dataset for source code compre- hension. arXiv preprint arXiv:2109.08365. OpenAI. 2025. Introducing gpt-4.1 in the api.https: //openai.com/index/gpt-4-1/. Accessed: 2026- 01-06. Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. 2025. Swe-qa: Can language models answer repository-level code ques- tions? arXiv preprint arXiv:2509.14635. Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. Longcodebench: Eval- uating coding llms at 1m context windows. arXiv preprint arXiv:2505.07897. Abhinav Rastogi, Adam Yang, Albert Q Jiang, Alexan- der H Liu, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Anmol Agarwal, Andy Ehrenberg, Andy Lo, and 1 others. 2025. Devstral: Fine-tuning language models for coding agent applications. arXiv preprint arXiv:2509.25193. Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, and Shirish Shevade. 2024. Codequeries: A dataset of semantic queries over code. In Proceedings of the 17th Innovations in Software Engineering Confer- ence, pages 1–11. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025.Qwen3 technical report.arXiv preprint arXiv:2505.09388. John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528– 50652. Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Soft- ware Testing and Analysis, pages 1592–1604. Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, and 1 others. 2025. Swift: a scalable lightweight infrastructure for fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735. 10 Table of Contents in Appendix A Cluster Coverage of SWE-QA Bench12 B SWE-QA-Pro Agent Algorithm14 C Training Hyperparameters15 D Prompts16 E Cluster and QA Type Taxonomy21 F Statistics of SWE-QA-Pro25 G Breakdown Results in SWE-QA-Pro27 G.1 Breakdown Results By QA Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 G.2 Breakdown Results By Repository Name . . . . . . . . . . . . . . . . . . . . . . . . .28 G.3 Breakdown Results By Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 H Case Study34 H.1 Cases of Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34 H.2 Case Studies of Human and Model Reference Answers . . . . . . . . . . . . . . . . . .36 IEthics and Reproducibility Statements39 I.1Potential Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 I.2Discuss the License for Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 I.3Artifact Use Consistent With Intended Use . . . . . . . . . . . . . . . . . . . . . . . . .39 I.4Data Contains Personally Identifying Info or Offensive Content . . . . . . . . . . . . . .39 I.5Documentation of Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 I.6Parameters for Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 I.7Data Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 I.8AI Assistants in Research or Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 11 A Cluster Coverage of SWE-QA Bench Cluster IDCluster NameOriginal Ratio (%)Test Ratio (%)Coverage (×)Origin CountTest Count 0.0 Unicode / formatting / parsing / data validation 2.040.000.0017,1850 0.1 SQL / structured grammar / templating Engine 2.671.890.7122,48910 0.2 CLI args / syntax / regex / com- mand completion 2.083.221.5517,51817 0.3 file import/export / metadata / sorting / recurrence 1.890.190.1015,9401 1.0 data formats / FITS / Astropy / units / WCS 0.938.148.787,82543 1.1 pandas / parquet / datetime / Dask / table schema 1.770.570.3214,9173 1.2 numpy / dask arrays / dtype / array serialization / parallel 1.968.334.2416,57144 1.3 coordinate transform / image processing / IO / numeric pre- cision 3.990.000.0033,6600 2.0 protocol serialization / encod- ing / headers / API compatibil- ity 1.250.000.0010,5730 2.1 dependency injection / config / attrs / API design 2.277.203.1719,16838 2.2 dataclass / schema validation / enums / OpenAPI 1.680.760.4514,1904 2.3 type and attribute errors / ini- tialization / CLI workflow 2.000.190.0916,8371 2.4 type hints / mypy / typing sys- tem / code generation 1.711.520.8914,4368 3.0 Python version / imports / dep- recation / conflicts 1.930.000.0016,3000 3.1 install / virtual env / OS / hard- ware requirements / cloud de- ploy 2.140.000.0018,0350 3.2 artifacts / distribution format / repository management / post- install state 2.090.570.2717,6773 3.3 extensibility / configuration framework / plugin architec- ture 2.960.000.0024,9980 4.0 version control / Docker / build cache 1.810.000.0015,2740 4.1 releasemanagement/ changelog / license / commu- nity 1.980.000.0016,6920 4.2 documentation / MkDocs / user tutorials 3.115.301.7026,27428 4.3 async refactor / migration / log- ging infra / i18n 2.370.380.1619,9722 4.4 CI pipelines / coverage / lint / GitHub Actions / security checks 1.300.190.1511,0111 Table 3: Cluster Coverage (1.0 - 4.4): SWE-Rebench (Original) vs. SWE-QA (Test). [Back to Appendix Contents] 12 Cluster ID Cluster NameOriginal Ratio (%) Test Ratio (%) Coverage (×) Origin Count Test Count 5.0 asyncio / async context / re- source cleanup 2.110.190.0917,8111 5.1 multiprocessing / advanced runtime / concurrency / hetero- geneous compute 1.070.000.009,0340 5.2 runtime error handling / DB transactions / retry / logging system 2.480.760.3120,8914 5.3 threading / execution limits / scheduling / memory / timeout 2.170.760.3518,2784 5.4 connection lifecycle / protocol handling / low-level failures 2.080.190.0917,5691 5.5 parallel execution / distributed frameworks / task graphs 1.720.000.0014,5110 6.0 file paths / filesystem permis- sions / symlinks / env config / cache system 2.130.760.3617,9824 6.1 unit testing / mocking / test au- tomation 2.058.904.3417,32347 6.2 build pipeline / doc building / Sphinx / cloud provisioning 3.300.000.0027,8580 6.3 compiler toolchain / cross- compile / env vars / code qual- ity analysis 1.420.000.0011,9840 7.0 API integration / sync / perfor- mance / DB / SDK 2.620.950.3622,0675 7.1 media download / playlist / metadata / client-side proxy config 1.130.000.009,5320 7.2 auth systems / deployment / extension plugins / cloud ser- vices 2.212.461.1118,66813 7.3 AWS / Azure / K8s / container security / IAM policy 2.200.000.0018,5310 7.4 reverse proxy / URL routing / websocket / CDN / streaming 1.4215.3410.7912,00081 7.5 OAuth / JWT / SSL / access control / user sessions/ token lifecycle 4.001.330.3333,7627 8.0 tensors / training / GPU / ML experiment logging / tuning 2.202.080.9518,52511 8.1 ML analytical visualization / Fourier / ML animation / cali- bration 2.761.330.4823,2787 8.2 time series / feature engineer- ing / explainability methods / behavioral analysis / computa- tional semantics 0.902.462.737,60613 8.3 data parallel / compression / ML plugin / indexing 2.392.651.1120,18214 8.4 bayesian models / MCMC / statistics / reproducibility 1.671.140.6814,0666 8.5 ML APIs / decorators / metrics / optimization strategies 2.1510.614.9418,11656 9.0 UI layout / CSS / markdown / table extraction / frontend se- curity 2.300.380.1619,4502 9.1 plotting systems / widgets / maps / UI animation / usability 1.784.922.7714,99526 9.2 runtime UI config / UI permis- sion management / upload han- dling / customization / user- facing runtime extensibility 2.090.190.0917,6431 9.3 3D rendering / legends / color mapping / visualization for- matting 1.734.172.4114,61522 Table 4: Cluster Coverage (5.0 - 9.3): SWE-Rebench (Original) vs. SWE-QA (Test). [Back to Appendix Contents] 13 B SWE-QA-Pro Agent Algorithm Algorithm 1 SWE-QA-Pro Agent Require: User query Q, repository R Ensure: Final answer A 1: /* Phase 1: Initialization */ 2: context← [ ] 3: thought← Analyze(Q) 4: context← context∪SemanticSearch(Q,R) 5: /* Phase 2: Iterative ReAct Loop */ 6: max_iterations← N 7: for i← 1 to max_iterations do 8: thought← Reason(context,Q) 9: action← SelectAction(thought) 10:if action = SemanticSearch then 11:output← Execute(SemanticSearch) 12:else if action = V iewCodebase then 13:output← Execute(V iewCodebase) 14:else if action = ExecuteCommand then 15:output← Execute(ExecuteCommand) 16:end if 17: context← context∪output 18:if SufficientEvidence(context,Q) or i = max_iterations then 19:break 20:end if 21: end for 22: /* Phase 3: Finalization */ 23: A← Synthesize(context,Q) 24: return A 14 C Training Hyperparameters HyperparameterValue Precisionbfloat16 Max sequence length32,768 OptimizerAdamW Learning rate5× 10 −6 Weight decay0.05 LR schedulerCosine Warmup ratio0.05 Batch size (per device)1 Gradient accumulation2 Epochs4 Agent templateHermes Table 5: Hyperparameters for SFT of SWE-QA-Pro 8B. [Back to Appendix Contents] HyperparameterValue Max turns25 Max prompt length2,048 Max response length8,192 Max observation length28,000 Temperature1.0 Top-p1.0 Number of rollouts (n)8 KL loss coefficient0.02 KL loss typeLow-variance KL Entropy coefficient0 Actor learning rate1× 10 −6 Batch size8 PPO mini-batch size8 StrategyFSDP Max model length32,768 Table 6: Hyperparameters for RL of SWE-QA-Pro 8B. [Back to Appendix Contents] 15 D Prompts Prompt Template for LLM-as-Judge Model: GPT-5 You are a professional evaluator. Please rate the candidate answer against the reference answer based on five criteria. Evaluation Criteria and Scoring Guidelines (each scored 1 to 10): 1. Correctness: 10 — Completely correct; core points and details are accurate with no ambiguity. 8-9 — Mostly correct; only minor details are slightly inaccurate or loosely expressed. 6-7 — Partially correct; some errors or omissions, but main points are generally accurate. 4-5 — Several errors or ambiguities that affect understanding of the core information. 2-3 — Many errors; misleading or fails to convey key information. 1 — Serious errors; completely wrong or misleading. 2. Completeness: 10 — Covers all key points from the reference answer without omission. 8-9 — Covers most key points; only minor non-critical information missing. 6-7 — Missing several key points; content is somewhat incomplete. 4-5 — Important information largely missing; content is one-sided. 2-3 — Covers very little relevant information; seriously incomplete. 1 — Covers almost no relevant information; completely incomplete. 3. Relevance: 10 — Content fully focused on the question topic; no irrelevant information. 8-9 — Mostly focused; only minor irrelevant or peripheral information. 6-7 — Generally on topic; some off-topic content but still relevant overall. 4-5 — Topic not sufficiently focused; contains considerable off-topic content. 2-3 — Content deviates from topic; includes excessive irrelevant information. 1 — Majority of content irrelevant to the question. 4. Clarity: 10 — Fluent language; clear and precise expression; very easy to understand. 8-9 — Mostly fluent; clear expression with minor unclear points. 6-7 — Generally clear; some expressions slightly unclear or not concise. 4-5 — Expression somewhat awkward; some ambiguity or lack of fluency. 2-3 — Language obscure; sentences are not smooth; hinders understanding. 1 — Expression confusing; very difficult to understand. 5. Reasoning: 10 — Reasoning is clear, logical, and well-structured; argumentation is excellent. 8-9 — Reasoning is clear and logical; well-structured with solid argumentation. 6-7 — Reasoning generally reasonable; mostly clear logic; minor jumps. 4-5 — Reasoning is average; some logical jumps or organization issues. 2-3 — Reasoning unclear; lacks logical order; difficult to follow. 1 — No clear reasoning; logic is chaotic. INPUT: Question:question Reference Answer:reference Candidate Answer:candidate OUTPUT: Please output ONLY a JSON object with 5 integer fields in the range [1,10], corresponding to the evaluation scores: "correctness": <1-10>, "completeness": <1-10>, "relevance": <1-10>, "clarity": <1-10>, "reasoning": <1-10> REQUIREMENT: You should assume that a score of 5 represents an average but imperfect answer. Scores above 7 should be reserved for answers that are clearly strong. Do not infer or assume missing information. Score strictly based on what is explicitly stated. No explanation, no extra text, no formatting other than valid JSON 16 Prompt Template for Generating Answer Model: All Evaluated Model System Prompt: You are a codebase analysis agent operating in a strictly read-only environment. Your task is to answer SWE-related questions by analyzing source code, configuration, documentation, and tests. You must prioritize correctness, completeness, clarity, relevance and evidence-based reasoning when answering given questions within 25 max turns. PROCESS PROTOCOL (MANDATORY) For every question, you MUST follow this process: 1. Planning Before calling any tools, you MUST output a short planning explanation at each turn. * Explain step by step what you have found so far from the current context, and what you will inspect next and why. * This reasoning MUST be explicit and visible. 2. Investigation * Call one or more read-only tools to gather evidence. * Multiple tool calls in one turn are allowed. 3. Synthesis * Combine evidence acoss multiple files or components. * Do NOT rely on a single file unless clearly justified. 4. Finalization * Produce a final answer following the OUTPUT PROTOCOL. TOOL USAGE RULES Available tools: * semantic_search: find relevant files, symbols, or modules. * viewcodebase: inspect structure or specific file sections. * Prefer ‘concise=True‘ first; use ‘viewrange‘ when needed. Prefer using viewcodebase; avoid using ls -l or ls -R whenever possible. Don’t use tree without -L. * executereadonlycommand: small, focused inspection tasks that require raw command output (Avoid using command-line operations that produce excessive and uncontrollable output, DON’t use ls -R path and ls -lR path as a command). You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> tools </tools> OUTPUT PROTOCOL (STRICT) You MUST follow this output structure at each assitant turn: 1. Reasoning * Before any tool call, output only your step by step planning explanation 2. Final Answer * Output **exactly one** block in this format without any tool calls: <finish> Final answer’s content </finish> Rules for ‘<finish>‘ block: * Must appear exactly once. * Must contain only the final answer’s content. * NO code blocks or copied code such as “‘python ...“. * Cite evidence only using file paths relative to repo_path, in the format <relative_path>: line <start>-<end> (do not use absolute paths) e.g. responses/init.py: line 1-10. Any violation of this protocol makes the answer invalid. The working directory (where the code is executed) is /data/songcheng/SWE-QA-Pro-dev/eval. Now the code repo at repopath. Please use absolute paths in all tools. User Prompt: Repository Path: repopath Question:question Instructions: - Please analyze the codebase to answer this question. - Provide a step-by-step explanation before calling any tools. - Follow this workflow: 1) Inspect the repository structure 2) Search for relevant files and symbols 3) Examine specific implementations 4) Cross-validate your findings 5) Provide a complete answer with evidence inside a <finish> block 17 Prompt Template for Generating Query Candidates Model: Claude Code You are a repository-aware planning agent. Your job is NOT to modify code, but to: (1) lightly explore the local repository, (2) generate ONE high-quality developer Query tailored to this repo, (3) classify it with: cluster, task_type, clarity (0–5), context (0–5), difficulty (0–6), (4) provide evidence (paths, line ranges for each paths, signals) and a concise rationale summary, (5) produce a detailed NEXT_STEPS plan that explains how to solve the generated Query using your general software knowledge and the repo evidence. Each step must reference specific file paths and line ranges from the evidence. Keep internal reasoning private. Output must follow the strict JSON schema at the end. [SUGGESTED — OPTIONAL, USER-CUSTOMIZABLE] - Goal bias: prefer qa_verifiable tasks that ask for structural, architectural, or algorithmic explanations with standard answers - Risk preference: avoid trivial or opinion-based questions; push toward mid/high-level system questions (e.g., how modules interact, why design chosen) - Domain preference: explanations tied to repo internals, standardized APIs, algorithmic design, or widely accepted conventions - Complexity target: cover difficulty 2–5 (multi-step reasoning about design/structure, not trivial lookups) - Output style: queries must include a ground_truth_answer field that captures a definitive, evidence-based explanation [/SUGGESTED] —– BUCKET / CLUSTER INDEX (L1→ L2) —– Choose the tightest cluster (prefer L2; if uncertain, use L1). The “cluster” field will include [id, name]. cluster_taxonomy —- QA_VERIFIABLE TAXONOMY —- qa_type_taxonomy —- LABELING DEFINITIONS —- Clarity (0-5): 0 extremely vague; 1 very vague; 2 vague; 3 workable with small additions; 4 clear (acceptance feasible); 5 very clear (explicit acceptance/tests). Context (0-5): 0 no repo/env needed; 1 light reference; 2 local file/API awareness; 3 multi-file/module; 4 system-level; 5 deep env/data/service coupling. Difficulty (0-6): 0 trivial QA; 1 simple single-point; 2 routine; 3 moderately complex (multi-step or multi-file); 4 advanced (design/concurrency/test-heavy); 5 high complexity (tradeoffs, cross-domain, higher risk); 6 extreme system-level with high uncertainty. — WORKFLOW STEPS — 1) Light repo scan: identify system-level modules, architecture diagrams, abstract base classes, registries, or pipelines. 2) Extract the repository info from the input and output it in the format: ‘"repo": ["owner/repo_name"]‘. 3) Map to cluster aligned with conceptual/system knowledge (algorithms, coordinate systems, unit registries, API design). 4) Generate the developer Query: - must be a factual “what/how/why” about repo structures or algorithms, not trivial docstring repeats; - answer must be checkable from code, docs, or API standards. 5) Score clarity, context, difficulty based on how well-defined and system-level the question is. 6) Evidence: point to modules, classes, or specs that define the authoritative structure. 7) NEXT_STEPS: describe how to cross-check the ground truth with codebase or documentation. 18 Prompt Template for Generating Query Candidates —- QUALITY RULES —- - All queries must require an evidence-backed, canonical answer (e.g., architecture, pipeline design, algorithm complexity). - The queries should encourage advanced analysis, integration of multiple concept, or insight beyond surface-level information. - All queries must contain exactly one of the following words: "What", "Why", "Where", or "How". The query must not contain more than one of these words or multiple sub-questions. For example, ’What is the architecture of chartpress’s configuration system and how does it coordinate between chartpress.yaml parsing, image building workflows, and values.yaml modification?’ is invalid: - The evidence must explicitly include the relevant line numbers or line ranges for each repo_path. - Must include a "ground_truth_answer" string in the JSON output, summarizing the verified explanation. - Reject trivial “what is the type of X” unless it connects to a bigger design concept. - Prioritize non-trivial, yet verifiable knowledge that reflects the repo’s system design or standards compliance. —- USER PREFERENCE —- We HIGHLY suggest you prioritize selecting problems from Cluster swe_issue_qa_1_0, since issues in this repo often fall into this domain. Since the clusters were generated through unsupervised clustering and the labels were assigned based on random sampling and manual annotation within each cluster, there may be inherent bias. If the assigned cluster label conflicts with the reference issues, always treat the reference issues as the source of truth. To help you understand the typical patterns, here are some example issues from this repo. These are provided only as *reference context* to inform your reasoning. - If an issue has already been fixed, do not reuse it. - If an issue is still relevant, you may paraphrase it into a fresh query. - Ideally, you should write your own problem statement, using the examples only as background knowledge. It is acceptable to generate a problem outside the recommended cluster if necessary — the examples are guidance, not a restriction. Repo name: repo_name Reference issues: reference_issues 19 Prompt Template for Generating Reference Answer Model: Claude Code You are a repository-aware QA answer agent. Your job is NOT to modify code, but to: (1) Lightly explore the local repository at the given commit using available tools. (2) Understand and answer the given generated_query based on the actual codebase. (3) Produce a high-quality, evidence-backed gold-standard answer (refined_ground_truth) that satisfies the five dimensions: correctness, completeness, relevance, clarity, and reasoning quality. (4) Optionally use reference_answer only as a weak hint, never as ground truth. ## Inputs You Will Receive * ‘repo_name‘ * ‘commit_id‘ * ‘generated_query‘ (the question you must answer) * ‘reference_answer‘ (may be partially correct, incomplete, or wrong) ## Your Objectives You must produce a final answer that is: * Fully verified against the repository’s source code. * Structurally complete, covering all parts required by the query. * Clear and technically correct, written for developers unfamiliar with the repo. * Evidence-based, with file paths and line ranges supporting your claims. * Free from speculation. ## Suggestions The following sections guide your behavior during exploration and answer construction. ### 1. How to Use ‘reference_answer‘ * Treat ‘reference_answer‘ as an **optional and unreliable hint**. * It may point to relevant files or concepts, but you must verify everything independently using the actual code. * You must not summarize, lightly edit, or trust the reference answer. * If there is any contradiction between code and reference_answer, follow the code. * The correct mental model is ’reference_answer is a hypothesis; the repository is the truth.’ ### 2. Exploration & Evidence Collection You must actively explore the repo: * Navigate to relevant modules, subpackages, core classes, registries, and any architecture files. * Inspect the implementations, comments, and interfaces relevant to the query. * Track everything you depend on in ‘evidence.repo_paths‘, using exact format: ‘"path/to/file.py: line X-Y"‘ * Collect signals, which are short text markers such as: class names, method names, configuration patterns, key comments and helper functions * Every important claim in your ‘refined_ground_truth‘ must be traceable to your collected evidence. ### 3. Answer Style and Constraints Your ‘refined_ground_truth‘ must obey: * No direct code quotations. * You may name classes/functions/variables but do not copy their bodies. * Format the answer as coherent paragraphs, not bullet points. * The answer must be: concise but complete, technically precise, Clear for developers and Grounded in the repository. * Every major claim must be supported by file paths: line ranges you listed in ‘evidence‘. ## Required Output JSON Format You must output **only** this JSON object: JSON_EXAMPLE ## Working Procedure (Mental Checklist) 1. Read the ‘generated_query‘ and identify scope (architecture? registration? flow? algorithm?). 2. Lightly scan the repo structure to locate relevant modules. 3. Open related files and gather evidence. 4. Build an internal understanding of the underlying architecture or behavior. 5. Compare your understanding with ‘reference_answer‘: keep what matches the repo, correct what is wrong and add missing key pieces. 6. Write a clean, well-organized gold answer in the ‘refined_ground_truth‘. 7. Fill in ‘evidence‘, ‘rationale_summary‘, and ‘next_steps‘. 8. Output the JSON object. ## Input Repo Name: repo_name Commit ID: commit_id Generated Query: generated_query Reference Answer: reference_answer 20 E Cluster and QA Type Taxonomy ClusterSubclusterDescription Input / Parsing / Data Conversion Unicode / formatting / parsing / data validation Character encoding, string normalization, and file for- mat sanitization; focuses on correctness of raw text and low-level input structure. SQL / structured grammar / templat- ing engine SQL syntax, AST gram- mars, and templating sys- tems; covers grammar rules and templated string genera- tion. CLI args / syntax / regex / command completion Argumentparsing, Bash/Zshcompletion, and regex issues; applies whenuserinputmust beparsedormatched interactively. File import/export / metadata / sort- ing / recurrence Fileloading,metadata extraction, sorting logic, and recurrence handling for structured data transfer. Data / Array / Image / Coordinate Data formats / FITS / Astropy / units / WCS Scientific data formats and astronomy-specificcoor- dinate systems, including units and world coordinate systems. Pandas / parquet / datetime / Dask / table schema Tabular data manipulation, schema handling, and time- indexed datasets. NumPy / Dask arrays / dtype / seri- alization / parallel Numerical array operations, data types, chunking strate- gies, and array serialization. Coordinate transform / image pro- cessing / IO / precision Geometric transformations, image IO, and precision- sensitive numerical process- ing. Schema / Types / Validation / Static Analysis Protocol serialization / encoding / headers / API compatibility Structured message formats and wire-level compatibility for APIs and protocols. Dependency injection / config / attrs / API design Software design patterns controllingconfiguration, object construction, and API structure. Dataclass / schema validation / enums / OpenAPI Structured field validation, enum constraints, and Ope- nAPI specifications. Type and attribute errors / initializa- tion / CLI workflow Runtime failures due to incorrect initialization, at- tribute access, or object life- cycle misuse. Type hints / mypy / typing system / code generation Static typing, type check- ing, and auto-generated type stubs. Packaging / Dependency / Build Python version / imports / depreca- tion / conflicts Import errors, deprecated APIs, and Python version compatibility issues. Continued on next page 21 ClusterSubclusterDescription Install / virtual env / OS / hardware / cloud deploy Environment setup, package installation, OS and hard- ware requirements, and de- ployment. Artifacts / distribution format / repository management Wheels, source distribu- tions, repository layout, and post-install package state. Extensibility / configuration frame- work / plugin architecture Plugin discovery, extension mechanisms, and dynamic component loading. Docs / CI / Release / Workflow Version control / Docker / build cache Git workflows, containeriza- tion strategies, and build cache management. Release management / changelog / license / community Release cycles, licensing policies, and community governance. Documentation / MkDocs / user tu- torials Systems for generating and maintaining user-facing doc- umentation. Async refactor / migration / logging / i18n Large-scale refactoring, log- ging infrastructure, and in- ternationalization. CI pipelines / coverage / lint / GitHub Actions Automated testing, linting, security checks, and CI exe- cution. Runtime / Async / Errors / Resources Asyncio / async context / resource cleanup Coroutine scheduling, event loops, async contexts, and cooperative concurrency. Multiprocessing / advanced runtime / heterogeneous compute Process pools, CPU/GPU scheduling,andmulti- backend execution. Runtime error handling / transac- tions / retry / logging Exception handling, roll- back mechanisms, and retry strategies. Threading / execution limits / scheduling / memory OS-level threading, memory constraints, and timeout be- havior. Connection lifecycle / protocol han- dling / failures Socket errors, TLS issues, and low-level network fail- ures. Parallel execution / distributed frameworks / task graphs Distributed execution mod- els such as Ray or Dask. Build Env / Testing / Toolchain File paths / filesystem / permissions / env config OS-level filesystem config- uration and environment- dependent behavior. Unit testing / mocking / test automa- tion Test frameworks, mocks, and automated verification pipelines. Build pipeline / doc building / Sphinx / provisioning Automated build systems, documentation compilation, and provisioning. Compiler toolchain / cross-compile / static analysis Compiler behavior, environ- ment variables, and code quality analysis. API / Cloud / Auth / Network API integration / SDK / perfor- mance / DB External API usage, SDK in- tegration, and performance considerations. Media download / playlist / meta- data / proxy Media fetching, metadata extraction, and proxy config- uration. Continued on next page 22 ClusterSubclusterDescription Auth systems / deployment / cloud plugins Authentication services and cloud runtime behavior. AWS / Azure / Kubernetes / IAMInfrastructure orchestration and cloud security policies. Reverse proxy / routing / websocket / CDN URL routing,real-time transport, and CDN integra- tion. OAuth / JWT / SSL / access controlToken-based authentication, certificates, and session life- cycle. ML / Algorithms / Performance Tensors / training / GPU / experi- ment logging Model training, GPU exe- cution, and ML experiment management. ML visualization / Fourier / calibra- tion Analytical visualization and mathematical interpretation of models. Time series / feature engineering / explainability Feature extraction, behav- ioral analysis, and computa- tional semantics. Data parallel / compression / ML plugins Distributed training and compressed data or model representations. Bayesian models / MCMC / statis- tics Probabilisticmodeling anduncertainty-aware inference. ML APIs / metrics / optimization strategies Model interfaces, evaluation metrics, and optimization behavior. Visualization / UI / Rendering UI layout / CSS / markdown / fron- tend security Layout, formatting, and se- curity of user-facing con- tent. Plotting systems / widgets / maps / animation Charts, interactive widgets, and UI animations. Runtime UI config / permissions / uploads UI customization, permis- sion control, and file upload handling. 3D rendering / legends / color map- ping Rendering pipelines, color schemes, and legend format- ting. Table 7: Question Cluster taxonomy used in SWE-QA-Pro [Back to Appendix Contents] 23 TypeIntentionDefinition What Architecture exploration Identify components or structural organization of the system. Concept / DefinitionUnderstand the meaning or semantics of code elements. Dependency tracingIdentify relationships or dependencies among code elements. Why Design rationaleExplain why certain design decisions are made. Purpose exploration Understand the intended purpose of a function or component. PerformanceUnderstand performance considerations or trade-offs. Where Data / Control-flowLocalize variables, data flow, or control statements. Feature locationIdentify where a specific feature is implemented. Identifier locationFind where an identifier is defined or referenced. How System designExplain overall system behavior or execution workflow. Algorithm implementationUnderstand algorithmic steps or logic implemented in code. API / Framework supportShow how APIs or frameworks are used within the system. Table 8: Taxonomy of Repository-Level Question Intentions [Back to Appendix Contents] 24 F Statistics of SWE-QA-Pro Cluster IDCluster NameCount 0.0Unicode / formatting / parsing / data validation5 0.1SQL / structured grammar / templating Engine6 0.2CLI args / syntax / regex / command completion8 0.3file import/export / metadata / sorting / recurrence6 1.0data formats / FITS / Astropy / units / WCS4 1.1pandas / parquet / datetime / Dask / table schema5 1.2numpy / dask arrays / dtype / array serialization / parallel6 1.3coordinate transform / image processing / IO / numeric precision4 2.0protocol serialization / encoding / headers / API compatibility4 2.1dependency injection / config / attrs / API design9 2.2dataclass / schema validation / enums / OpenAPI7 2.3type and attribute errors / initialization / CLI workflow7 2.4type hints / mypy / typing system / code generation7 3.0Python version / imports / deprecation / conflicts5 3.1install / virtual env / OS / hardware requirements / cloud deploy5 3.2artifacts / distribution format / repository management / post-install state4 3.3extensibility / configuration framework / plugin architecture9 4.0version control / Docker / build cache4 4.1release management / changelog / license / community4 4.2documentation / MkDocs / user tutorials4 4.3async refactor / migration / logging infra / i18n5 4.4CI pipelines / coverage / lint / GitHub Actions / security checks8 5.0asyncio / async context / resource cleanup7 5.1multiprocessing / advanced runtime / concurrency / heterogeneous compute7 5.2runtime error handling / DB transactions / retry / logging system8 5.3threading / execution limits / scheduling / memory / timeout5 5.4connection lifecycle / protocol handling / low-level failures4 5.5parallel execution / distributed frameworks / task graphs4 6.0file paths / filesystem permissions / symlinks / env config / cache system8 6.1unit testing / mocking / test automation5 6.2build pipeline / doc building / Sphinx / cloud provisioning5 6.3compiler toolchain / cross-compile / env vars / code quality analysis6 7.0API integration / sync / performance / DB / SDK7 7.1media download / playlist / metadata / client-side proxy config5 7.2auth systems / deployment / extension plugins / cloud services5 7.3AWS / Azure / K8s / container security / IAM policy4 7.4reverse proxy / URL routing / websocket / CDN / streaming5 7.5OAuth / JWT / SSL / access control / user sessions/ token lifecycle4 8.0tensors / training / GPU / ML experiment logging / tuning4 8.1ML analytical visualization / Fourier / ML animation / calibration4 8.2 time series / feature engineering / explainability methods / behavioral analysis / computa- tional semantics 6 8.3data parallel / compression / ML plugin / indexing4 8.4bayesian models / MCMC / statistics / reproducibility5 8.5ML APIs / decorators / metrics / optimization strategies5 9.0UI layout / CSS / markdown / table extraction / frontend security4 9.1plotting systems / widgets / maps / UI animation / usability5 9.2 runtime UI config / UI permission management / upload handling / customization / user- facing runtime extensibility 4 9.33D rendering / legends / color mapping / visualization formatting4 Table 9: Cluster Statistics (Counts per Question Cluster) [Back to Appendix Contents] 25 Class NameSub-class NameCount Total Num Why (Causal Queries) Performance & Scalability33 65 Design Rationale24 Purpose & Role8 What (Factual Queries) Architecture & Components20 51 Dependency & Inheritance17 Concepts & Definitions14 How (Procedural Queries) System Design & Patterns30 67 Algorithm Implementation23 API & Framework Support14 Where (Localization Queries) Identifier Location32 77 Feature Location30 Data & Control Flow15 Table 10: QA Type Statistics [Back to Appendix Contents] 26 G Breakdown Results in SWE-QA-Pro G.1 Breakdown Results By QA Type QA Type Name Claude Sonnet 4.5 Gemini 2.5 Pro GPT 4.1 GPT-4o DeepSeek V3.2 Avg. Why (Causal Queries)38.7238.1837.6532.5036.8736.78 Performance & Scalability37.2737.3237.6929.3435.3035.39 Design Rationale41.3838.4937.9935.7638.6838.46 Purpose & Role36.4840.8336.4635.7237.9237.48 What (Factual Queries)39.2738.5937.6630.9337.9536.88 Architecture & Components38.3838.5237.8230.4137.4836.52 Dependency & Inheritance40.5736.7337.6531.7837.0436.75 Concepts & Definitions38.9240.9837.4530.6439.7137.54 How (Procedural Queries)40.8639.1738.4432.8139.1038.08 System Design & Patterns39.0738.1737.1932.9437.6036.99 Algorithm Implementation42.8540.4240.4534.0040.6239.67 API & Framework Support41.6439.2937.8330.6239.8137.84 Where (Localization Queries)42.8441.3639.7335.5640.3639.97 Identifier Location45.3242.4641.4437.5941.8541.73 Feature Location41.9041.7939.2036.7140.4040.00 Data & Control Flow39.5838.1837.1628.9537.1136.19 Table 11: Results across Different Question Types by SWE-QA-Pro [Back to Appendix Contents] QA Type Name Qwen3 8B Qwen3 32B Devstral Small-2-24B-Ins Llama 3.3-70B-Ins SWE-QA-Pro 8B (SFT) SWE-QA-Pro 8B (SFT+RL) Avg. Why (Causal Queries)31.5232.8335.5226.6632.3533.5432.07 Performance & Scalability29.5530.9533.4924.6430.4030.6229.94 Design Rationale33.7434.6838.5628.3934.6637.4234.57 Purpose & Role33.0435.0034.7529.7133.4633.9633.32 What (Factual Queries)29.2830.1136.5021.6932.9534.1730.78 Architecture & Components28.3329.5236.5320.7832.9734.8730.50 Dependency & Inheritance31.6530.4937.0823.0531.5132.1030.98 Concepts & Definitions27.7630.5035.7421.3534.6935.6930.96 How (Procedural Queries)28.4331.5737.6022.4335.3035.8431.86 System Design & Patterns28.3632.0736.1922.7534.9934.9231.55 Algorithm Implementation29.1231.9139.4921.8335.9137.4932.63 API & Framework Support27.4829.9337.5222.7434.9835.1031.29 Where (Localization Queries)30.6733.2139.0923.6736.1037.3633.35 Identifier Location32.2734.6739.4324.1137.3637.7134.26 Feature Location29.2933.1639.8324.5036.738.6133.68 Data & Control Flow30.0230.2036.8921.0832.1834.1330.75 Table 12: Results across Different Question Types (Open Models and SWE-QA-Pro Variants) [Back to Appendix Contents] 27 G.2 Breakdown Results By Repository Name Repo Name Claude Sonnet 4.5 Gemini 2.5 Pro GPT 4.1 GPT-4o DeepSeek V3.2 Avg. PSyclone38.3737.9734.5325.6835.2034.35 Pillow39.9037.3339.3333.7337.9337.64 cekit42.4739.7739.3732.8038.4738.58 checkov37.1036.9034.3028.4036.2734.59 docker-py44.5044.2341.4339.1743.1342.49 dwave-cloud-client39.8736.4736.0330.2536.8735.90 fitbenchmarking41.4739.6340.3337.8338.6339.58 frictionless-py37.8340.5037.7032.2738.0337.27 geopandas39.8040.7038.8335.4238.6338.68 hy41.5740.2341.8034.1039.2339.39 jsonargparse37.3333.1033.1331.7034.7334.00 mkdocs43.2040.9738.8335.1241.9740.02 numba38.9738.2339.0032.0538.3037.31 pennylane38.3741.4737.6030.4838.3037.24 pint42.7340.7339.7736.3338.1339.54 pybryt40.3341.7738.9735.8040.4339.46 qibo43.2738.9039.5333.4840.5039.14 responses41.1037.7738.6334.5039.3738.27 sanic39.0041.1337.4334.3337.9037.96 seaborn42.9741.7339.0732.9539.8739.32 sphinx41.0740.2041.934.5838.6039.27 sqlfluff41.6039.1738.0734.9039.1338.57 tox41.0740.6740.0035.4039.0339.23 web3.py40.7038.9039.9033.3341.0338.77 xarray41.0038.7037.0731.2538.5337.31 yt-dlp41.9738.8337.7026.2437.7336.49 Table 13: Results across Different Repositories by SWE-QA-Pro [Back to Appendix Contents] 28 Repo Name Qwen3 8B Qwen3 32B Devstral Small-2-24B-Ins Llama 3.3-70B-Ins SWE-QA-Pro 8B (SFT) SWE-QA-Pro 8B (SFT+RL) Avg. PSyclone27.2026.1037.4723.8631.7333.8730.04 Pillow29.6031.8036.2724.5035.3732.7031.71 cekit28.0733.2739.5723.0735.8733.8332.28 checkov27.4729.2034.4322.6833.6232.7330.02 docker-py36.0340.0041.4727.5940.2039.7037.50 dwave-cloud-client32.3730.7333.8024.5235.2736.1332.14 fitbenchmarking32.8334.5038.4322.0038.0038.9034.11 frictionless-py28.0330.9336.6022.4034.0035.2031.19 geopandas29.4333.2737.7325.9034.5735.4332.72 hy32.3731.4039.0026.7036.2036.1733.64 jsonargparse31.1731.1029.2021.3528.9332.8729.10 mkdocs32.4033.6341.4323.8036.2338.3334.30 numba29.9032.1037.2324.7531.6934.4331.68 pennylane28.4732.2333.8018.0333.1332.2729.66 pint32.1032.6739.5022.1037.7340.0034.02 pybryt31.3734.4339.0023.6535.7037.6333.63 qibo26.3030.1037.1022.1035.7037.1031.40 responses28.8734.2036.5324.0030.3334.8031.46 sanic30.0332.6037.6925.6333.7035.8332.58 seaborn30.6733.0037.9024.9034.1335.7332.72 sphinx30.8730.7038.2324.8533.7036.1732.42 sqlfluff28.9030.8736.3022.5233.6736.0031.38 tox29.3728.8338.1324.8735.3034.5031.83 web3.py28.8733.1738.4323.9533.9333.0331.90 xarray29.8331.9338.3322.2534.2034.1331.78 yt-dlp28.4031.3736.2725.3029.9032.6030.64 Table 14: Results across Different Repositories (Open Models and SWE-QA-Pro Variants) [Back to Appendix Contents] 29 G.3 Breakdown Results By Cluster Cluster ID Name Claude Sonnet 4.5 Gemini 2.5 Pro GPT 4.1 GPT-4o DeepSeek V3.2 Avg. 0.0 Unicode / formatting / parsing / data valida- tion 42.7341.0738.0734.2038.0038.81 0.1 SQL / structured grammar / templating En- gine 43.0642.2838.1138.0039.2240.13 0.2 CLI args / syntax / regex / command comple- tion 40.6737.2135.1733.8839.0837.20 0.3 file import/export / metadata / sorting / recur- rence 36.2037.1734.0629.7539.1735.27 1.0data formats / FITS / Astropy / units / WCS44.5840.2536.0037.0040.5839.68 1.1 pandas / parquet / datetime / Dask / table schema 39.0842.2737.6730.2540.9338.04 1.2 numpy / dask arrays / dtype / array serializa- tion / parallel 39.6739.1138.3332.9240.5638.12 1.3 coordinate transform / image processing / IO / numeric precision 40.6740.5837.0832.1237.1737.52 2.0 protocol serialization / encoding / headers / API compatibility 39.6742.7540.9236.7541.4240.30 2.1 dependency injection / config / attrs / API design 42.4240.1938.3733.8640.1939.00 2.2 dataclass / schema validation / enums / Ope- nAPI 40.9534.1033.6727.9637.2934.79 2.3 type and attribute errors / initialization / CLI workflow 44.1441.9042.8135.5441.3341.15 2.4 type hints / mypy / typing system / code gen- eration 42.1940.4340.8635.9339.6239.80 3.0 Python version / imports / deprecation / con- flicts 42.8739.2737.8736.9538.9339.18 3.1 install / virtual env / OS / hardware require- ments / cloud deploy 39.8035.6738.9329.7536.2736.08 3.2 artifacts / distribution format / repository management / post-install state 32.1732.9236.4227.0630.0831.73 3.3 extensibility / configuration framework / plu- gin architecture 38.0039.3738.6734.1138.6337.76 4.0version control / Docker / build cache41.6742.5040.5836.2539.3340.07 4.1 release management / changelog / license / community 37.1741.8339.1734.8135.6737.73 4.2documentation / MkDocs / user tutorials40.4239.1737.5029.0039.2537.07 4.3 async refactor / migration / logging infra / i18n 42.4038.2739.8736.1540.1339.36 4.4 CI pipelines / coverage / lint / GitHub Actions / security checks 41.8840.9241.1232.3840.8839.43 5.0asyncio / async context / resource cleanup40.1036.3839.9531.0736.4836.80 5.1 multiprocessing / advanced runtime / concur- rency / heterogeneous compute 33.2437.7131.4328.6833.2932.87 5.2 runtime error handling / DB transactions / retry / logging system 38.9439.3840.7131.1939.2937.90 5.3 threading / execution limits / scheduling / memory / timeout 39.3332.3334.2725.2032.4032.71 5.4 connection lifecycle / protocol handling / low- level failures 44.5043.5841.6741.3141.5042.51 5.5 parallel execution / distributed frameworks / task graphs 40.5038.5039.6728.5037.8337.00 6.0 file paths / filesystem permissions / symlinks / env config / cache system 42.5040.8341.5038.0342.8341.14 6.1unit testing / mocking / test automation43.6038.1339.8734.7540.6039.39 6.2 build pipeline / doc building / Sphinx / cloud provisioning 44.1343.6741.5334.2535.6739.85 Continued on next page 30 Cluster ID Name Claude Sonnet 4.5 Gemini 2.5 Pro GPT 4.1 GPT-4o DeepSeek V3.2 Avg. 6.3 compiler toolchain / cross-compile / env vars / code quality analysis 42.1139.2840.5636.9641.1740.01 7.0 API integration / sync / performance / DB / SDK 42.6240.6739.5734.4641.4339.75 7.1 media download / playlist / metadata / client- side proxy config 38.8338.5333.9318.8335.4733.12 7.2 auth systems / deployment / extension plug- ins / cloud services 44.4744.0743.0039.1544.2742.99 7.3 AWS / Azure / K8s / container security / IAM policy 32.7537.4232.6729.5633.1733.11 7.4 reverse proxy / URL routing / websocket / CDN / streaming 44.0743.6040.6739.0543.2042.12 7.5 OAuth / JWT / SSL / access control / user sessions / token lifecycle 33.2533.0036.4227.5036.2533.28 8.0 tensors / training / GPU / ML experiment logging / tuning 32.5640.0032.4228.4434.0033.48 8.1 ML analytical visualization / Fourier / ML animation / calibration 38.2539.9238.0830.3840.8337.49 8.2 time series / feature engineering / explainabil- ity methods / behavioral analysis / computa- tional semantics 42.0641.3938.1733.6237.8938.62 8.3 data parallel / compression / ML plugin / in- dexing 37.1734.5838.8331.5032.8334.98 8.4 bayesian models / MCMC / statistics / repro- ducibility 42.9338.1338.5330.2038.1337.59 8.5 ML APIs / decorators / metrics / optimization strategies 40.0038.4738.3336.5038.4038.34 9.0 UI layout / CSS / markdown / table extraction / frontend security 41.3338.9239.6734.5639.8338.86 9.1 plotting systems / widgets / maps / UI anima- tion / usability 43.6044.2742.8040.4042.0042.61 9.2 runtime UI config / UI permission manage- ment / upload handling / customization / user- facing runtime extensibility 31.4439.0033.7526.4033.0832.74 9.3 3D rendering / legends / color mapping / vi- sualization formatting 45.9243.8344.5041.4441.5843.45 Table 15: Results across Question Clusters (Closed Models) [Back to Appendix Contents] 31 Cluster ID Name Qwen3 8B Qwen3 32B Devstral Small- 2-24B- Ins Llama 3.3-70B- Ins SWE-QA- Pro 8B (SFT) SWE-QA- Pro 8B (SFT+RL) Avg. 0.0 Unicode / formatting / parsing / data validation 31.4733.4038.8724.1739.5338.4034.31 0.1 SQL / structured grammar / tem- plating Engine 30.0032.7839.4421.7635.3338.8933.03 0.2 CLI args / syntax / regex / com- mand completion 29.7930.2931.1222.7233.5435.5430.50 0.3 file import/export / metadata / sorting / recurrence 31.3335.6136.2823.6734.0033.0032.31 1.0 data formats / FITS / Astropy / units / WCS 30.5028.4238.4220.3834.1742.8332.45 1.1 pandas / parquet / datetime / Dask / table schema 25.6727.2040.0724.4535.8035.8731.51 1.2 numpy / dask arrays / dtype / ar- ray serialization / parallel 30.0034.3338.9424.3837.2234.3333.20 1.3 coordinate transform / image pro- cessing / IO / numeric precision 28.2531.3336.8320.8834.5833.1730.84 2.0 protocol serialization / encoding / headers / API compatibility 30.5836.1740.6720.1036.4235.2533.20 2.1 dependency injection / config / at- trs / API design 31.0033.2238.7023.2335.5636.3033.00 2.2 dataclass / schema validation / enums / OpenAPI 25.7629.4835.8123.8130.5731.9029.56 2.3 type and attribute errors / initial- ization / CLI workflow 33.5726.3839.1026.4533.8636.6232.66 2.4 type hints / mypy / typing system / code generation 36.1932.9540.1927.5335.0535.6234.59 3.0 Python version / imports / depre- cation / conflicts 37.0735.2741.6721.0034.6033.2033.80 3.1 install / virtual env / OS / hard- ware requirements / cloud deploy 26.4028.8039.6024.3630.8030.3330.05 3.2 artifacts / distribution format / repository management / post- install state 25.9227.9228.5020.5025.0027.5025.89 3.3 extensibility/configuration framework / plugin architecture 29.4831.8936.0025.6132.8135.9631.96 4.0 version control / Docker / build cache 31.0027.0841.3325.6738.0840.3333.92 4.1 release management / changelog / license / community 30.7530.8338.9222.3332.1729.0830.68 4.2 documentation / MkDocs / user tutorials 28.3330.8337.6724.2029.2531.0030.21 4.3 async refactor / migration / log- ging infra / i18n 31.8734.0740.6023.2036.6739.6734.34 4.4 CI pipelines / coverage / lint / GitHub Actions / security checks 27.8831.7537.6222.9439.9638.0433.03 5.0 asyncio / async context / resource cleanup 27.0527.4837.5224.0031.7131.4829.87 5.1 multiprocessing / advanced run- time / concurrency / heteroge- neous compute 29.1030.7130.8123.4728.2927.5728.32 5.2 runtime error handling / DB trans- actions / retry / logging system 31.2132.0037.4222.6133.2937.5432.34 5.3 threading / execution limits / scheduling / memory / timeout 30.0732.3329.6724.3828.5328.4728.91 5.4 connection lifecycle / protocol handling / low-level failures 34.8341.9242.9125.0036.7539.0836.75 5.5 parallel execution / distributed frameworks / task graphs 27.7532.5838.1722.8232.5536.2531.69 Continued on next page 32 Cluster ID Name Qwen3 8B Qwen3 32B Devstral Small-2- 24B-Ins Llama 3.3-70B- Ins SWE-QA- Pro 8B (SFT) SWE-QA- Pro 8B (SFT+RL) Avg. 6.0 file paths / filesystem permissions / symlinks / env config / cache system 33.4632.8840.7925.2137.9640.0035.05 6.1 unit testing / mocking / test au- tomation 31.6734.4037.8720.1741.4740.2734.31 6.2 build pipeline / doc building / Sphinx / cloud provisioning 31.8031.3339.1325.5035.2040.8733.97 6.3 compilertoolchain/cross- compile / env vars / code quality analysis 31.6735.7834.4424.3835.4135.8932.93 7.0 API integration / sync / perfor- mance / DB / SDK 28.0035.0038.0524.3340.0039.2434.10 7.1 media download / playlist / meta- data / client-side proxy config 23.6028.6032.6724.6424.5328.0727.02 7.2 auth systems / deployment / ex- tension plugins / cloud services 35.2041.2742.8028.2944.1337.7338.24 7.3 AWS / Azure / K8s / container security / IAM policy 27.0826.6733.8320.2233.0832.0028.81 7.4 reverse proxy / URL routing / websocket / CDN / streaming 37.2738.6040.5326.0038.0742.2737.12 7.5 OAuth / JWT / SSL / access con- trol / user sessions / token lifecy- cle 25.4230.0828.8324.8025.5829.5027.37 8.0 tensors / training / GPU / ML ex- periment logging / tuning 28.6727.5828.3317.6428.7533.5827.43 8.1 ML analytical visualization / Fourier / ML animation / calibra- tion 25.5023.1734.0822.2529.5832.5027.85 8.2 time series / feature engineering / explainability methods / behav- ioral analysis / computational se- mantics 29.5032.4440.6123.5037.2239.3933.78 8.3 data parallel / compression / ML plugin / indexing 28.6730.5834.5022.1031.8326.6729.06 8.4 bayesian models / MCMC / statis- tics / reproducibility 29.4732.0031.8722.8231.6031.2029.83 8.5 ML APIs / decorators / metrics / optimization strategies 29.3334.5339.9321.3337.1338.0733.39 9.0 UI layout / CSS / markdown / ta- ble extraction / frontend security 26.5030.5838.3325.5635.1737.5832.29 9.1 plotting systems / widgets / maps / UI animation / usability 35.8040.7342.6029.5437.5339.8737.68 9.2 runtime UI config / UI permission management / upload handling / customization / user-facing run- time extensibility 26.8333.1733.3322.3831.7533.6730.19 9.3 3D rendering / legends / color mapping / visualization format- ting 25.9230.5043.9224.0936.8338.5833.31 Table 16: Results across Question Clusters (Open Models and SWE-QA-Pro Variants) [Back to Appendix Contents] 33 H Case Study H.1 Cases of Model Comparison ComparisonbetweenGPT-4oandSWE-QA-8B (SFT+RL) RepoName:dwavesystems/dwave-cloud-client GeneratedQuery:Where does coverage data flow from test execution to the external reporting service in the current CI pipeline? GroundTruth: During test execution, coverage data is generated via the 'coverage run -m unittestdiscover' command (.travis.yml: line 46), which instruments the test runs to collect code coverage metrics. The coverage collection is configured by .coveragerc, which specifies that coverage should track the 'dwave.cloud' source package (line 2) while omitting virtual environments, test files, performance tests, and the package_info.pyfile (lines 3-6). The necessary dependencies for this process, including both 'coverage' and 'coveralls' packages, are installed during the CI setup phase as specified in tests/requirements.txtand installed via .travis.yml. After successful test completion, the collected coverage data is transmitted to the external Coveralls reporting service through the 'coveralls' command executed in the after_successphase (.travis.yml: line 49). AnswerfromGPT-4o(abridged): In the current CI pipeline ... coverage data flows ... as follows: ... The appveyor.ymlfile installs ... including coverage and coveralls. However, it does not explicitly invoke coveralls, suggesting that the upload process might be handled implicitly ... ... No explicit script or command for invoking coveralls was found in the repository ... Score:total_score: 22, correctness:2, completeness: 2, relevance: 6, clarity: 8, reasoning: 4 AnswerfromSWE-QA-8B (SFT+RL)(abridged): Coverage data flows ... through the following path ... ... The CI pipeline runs tests using ... coverage run -m unittestdiscover ... in the .travis.ymlfile (line 46). ... After successful test execution, the coveralls command (line 49 in .travis.yml) sends the coverage data to Coveralls.io... The flow is: Tests → ... → Coveralls CLI → Coveralls.ioService Score:total_score: 50, correctness:10, completeness:10, relevance: 10, clarity: 10, reasoning: 10 Scoring Rationale : SWE-QA-8B (SFT+RL) explicitly matches the ground-truth flow (coverage run -m unittest discover in .travis.yml, then coveralls in after_success). GPT-4o is not just less aligned—it is wrong: it claims“No explicit ... command for invoking coveralls was found” and suggests an “implicit” upload, even though .travis.ymlexplicitly runs coveralls. It also mis-centers appveyor.ymlas the relevant CI path for uploading coverage, which contradicts the ground truth. Figure 6: Example of a comparison of GPT-4o and SWE-QA-8B (SFT+RL) answers against the ground-truth. [Back to Appendix Contents] 34 ComparisonbetweenSWE-QA-8B (SFT)andSWE-QA-8B (SFT+RL) RepoName:pydata/xarray GeneratedQuery:Where are the CRS format conversion methods (PROJ.4 vs WKT string) defined and implemented in xarray'srasteriobackend? GroundTruth: The CRS format conversion logic is implemented in the open_rasteriofunction in xarray/backends/rasterio_.py. When constructing the output attributes, the code checks whether the rasteriodataset has a CRS and serializes it to a string by first attempting to call riods.crs.to_proj4() to obtain a PROJ.4 representation, and if this method is unavailable (raising AttributeError), falling back to riods.crs.to_string(), which returns a WKT representation; the resulting CRS string is stored in the returned DataArray’sattributes under the crskey (xarray/backends/rasterio_.py: line 284–291, 341–342). A second CRS serialization occurs when the input is a WarpedVRT, in which case vrt.crs.to_string() is used when building the VRT parameters so that the virtual raster’s CRS is preserved during reopening (xarray/backends/rasterio_.py: line 220–222). AnswerfromSWE-QA-8B (SFT)(abridged): The CRS format conversion methods ... are defined and implemented in ... xarray/backends/rasterio_.py. ... The conversion methods are: ... to_proj4() ... to_string() ... These methods are used in two key locations: ... ... WarpedVRT... the CRS is converted to a string (PROJ.4 format) using vrt.crs.to_string() ... ... the CRS is converted to both PROJ.4 and WKT formats using riods.crs.to_proj4() and riods.crs.to_string(), and stored as attributes ... The implementation delegates to rasterio’sbuilt-in CRS methods ... Score:total_score: 29, correctness:3, completeness: 4, relevance: 9, clarity: 8, reasoning: 5 AnswerfromSWE-QA-8B (SFT+RL)(abridged): The CRS format conversion methods ... are ... in xarray/backends/rasterio_.py. ... to_proj4() ... used ... to convert ... to a PROJ.4 string ... to_string() ... used ... to convert ... to a WKT string ... ... when creating a DataArray... uses riods.crs.to_proj4() ... ... if to_proj4() raises an AttributeError, it falls back to riods.crs.to_string() for WKT format ... ... when creating a WarpedVRT... uses vrt.crs.to_string() ... ... conversions are not implemented by xarrayitself but delegate to ... rasterio... Score:total_score: 46, correctness:9, completeness:9, relevance: 10, clarity: 9, reasoning: 9 Scoring Rationale : SWE-QA-8B (SFT+RL) matches the ground-truth control flow: it states the try to_proj4() then fallback to to_string() on AttributeError, and it places vrt.crs.to_string() in the WarpedVRTpath. SWE-QA-8B (SFT) is weaker against the ground truth and alsointernally inconsistent: it labels vrt.crs.to_string() as “PROJ.4 format,” even though to_string() is the WKT-style serialization path in the ground truth. It also claims the CRS is converted to “both PROJ.4 and WKT ... and stored as attributes,” which conflicts with the ground-truth logic that stores one CRS string (PROJ.4 if available, otherwise WKT) under a single crsattribute. Figure 7: Example of Comparison between Reference Answer and Human Annotated Answer (web3.py). [Back to Appendix Contents] 35 H.2 Case Studies of Human and Model Reference Answers Session ID: ethereum__web3.py-3027-0773ab80- e07c-4730-a349-6f9eef2f0c76 GroundTruth: The AF_UNIX socket path validation that generates 'AF_UNIX path too long' errors is not explicitly implemented in web3.py code, butrather occurs at the operating system level during socket connection. The error originates in the get_ipc_socket() function (web3/providers/ipc.py: line 35–47) where socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) creates a Unix domain socket and sock.connect(ipc_path) attempts the connection at line 45. When the path exceeds the system limit (typically 108 characters on Unix systems), the OS raises an OSErrorwith the message 'AF_UNIX path too long'. The IPC path itself is constructed by get_default_ipc_path() (web3/providers/ipc.py: line 89–114) using platform-specific logic for macOS, Linux, FreeBSD, and Windows. This function is invoked during IPCProviderinitialization (web3/providers/ipc.py: line 149–170) when no explicit path is provided. The AutoProviderclass (web3/providers/auto.py: line 63–69) includes IPCProviderin its default_providerslist and attempts initialization through _get_active_provider() (web3/providers/auto.py: line 113–123). The RequestManager (web3/manager.py: line 108–111) uses AutoProvideras the default when no provider is specified. When the path length error occurs, it manifests as an OSErrorthat gets caught by the JSONBaseProvider.is_connected() method (web3/providers/base.py: line 111–133), which handles OSErrorsduring connection attempts by either returning False or raising a ProviderConnectionErrorwith traceback details Error Type: Factual Error Error Reason: The original answer factually misidentified the error handling mechanism by citing lines 120–130 in JSONBaseProvider.is_connected. This code block is responsible for parsing application-level JSON-RPC errors (checking for an 'error' key in the response payload), not low- level connection failures. It failed to locate the actual try/except block (lines 112–115) that explicitly captures the OS-level OSErrortriggered by the socket path limit, effectively conflating protocol payload validation with transport exception handling. GeneratedQuery:Where is the AF_UNIX socket path validation logic implemented that generates 'AF_UNIX path too long' errors in the IPC provider system? Generated Answer: The AF_UNIX socket path validation that generates 'AF_UNIX path too long' errors is not explicitly implemented in web3.py code, butrather occurs at the operating system level during socket connection. The error originates in the get_ipc_socket() function (web3/providers/ipc.py: line 35-47) where socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) creates a Unix domain socket and sock.connect(ipc_path) attempts the connection at line 45. When the path exceeds the system limit (typically 108 characters on Unix systems), the OS raises an OSErrorwith the message 'AF_UNIX path too long'. The IPC path itself is constructed by get_default_ipc_path() (web3/providers/ipc.py: line 89-114) using platform-specific logic for macOS, Linux, FreeBSD, and Windows. This function is invoked during IPCProviderinitialization (web3/providers/ipc.py: line 149-170) when no explicit path is provided. The AutoProviderclass (web3/providers/auto.py: line 63-69) includes IPCProviderin its default_providerslist and attempts initialization through _get_active_provider() (web3/providers/auto.py: line 117-122). The RequestManager (web3/manager.py: line 108-111) uses AutoProvideras the default when no provider is specified. When the path length error occurs, it manifests as an OSErrorthat gets caught by the JSONBaseProvider.is_connected() method (web3/providers/base.py: line 120-130), which handles OSErrorsduring connection attempts by either returning False or raising a ProviderConnectionErrorwith traceback details. Figure 8: Example of a comparison of SWE-QA-8B (SFT) and SWE-QA-8B (SFT+RL) answers. [Back to Appendix Contents] 36 Session ID: frictionlessdata__frictionless-py-1156- b1134404-d537-4caa-999d-88f3b0969356 GroundTruth: The schema detection algorithm implements field type inference through a sophisticated multi-candidate scoring system orchestrated by the "detect_schema" method in the Detector class. The process begins by creating field type candidates from "DEFAULT_CANDIDATES", a prioritized list containing 14 types ordered from most specific to most general (e.g., yearmonth, geopoint... string) (frictionless/settings.py: line 71-86). For each data column, the algorithm initializes "runners" which track the state of every candidate field type (frictionless/detector/detector.py: line 485-500). During inference, the algorithm processes each row of the data fragment. It evaluates every candidate against each cell using the "Field.read_cell()" method, which parses the value and returns any type discrepancies (frictionless/field.py: line 350-361). The scoring mechanism awards +1 point for successful cell parsing and -1 point for parsing failures (frictionless/detector/detector.py: line 519).To optimize performance, the algorithm implements early elimination by calculating a "threshold" value based on the fragment length and confidence level; candidates dropping below this threshold are skipped in subsequent iterations (frictionless/detector/detector.py: line 505, 515). Finally, a candidate wins and is assigned as the inferred field type when its score meets the confidence requirement (calculated against a "max_score" that accounts for missing values) (frictionless/detector/detector.py: line 512-527). Error Type: Localization Hallucination Error Reason: The original answer accurately described the multi-candidate scoring logic but suffered from severe localization hallucination, citing the runnersinitialization at lines 346-362 based on an outdated file version. It failed to verify the code against the current commit, overlooking significant docstring additions that shifted the actual implementation to lines 485-500, resulting in a ~140-line discrepancy that rendered the supporting evidence invalid. GeneratedQuery:How does the schema detection algorithm implement field type inference and scoring for tabular data in the Detector class? Generated Answer: The schema detection algorithm implements field type inference through a sophisticated multi-candidate scoring system orchestrated by the Detector class. The process begins by creating field type candidates from DEFAULT_CANDIDATES, a prioritized list containing 14 types ordered from most specific to most general: yearmonth, geopoint, duration, geojson, object, array, datetime, time, date, integer, number, boolean, year, and string (/testbed/frictionless/settings.py: line 71-86). For each data column, the algorithm initializes runners containing all candidate field types (/testbed/frictionless/detector/detector.py: line 487-500). During inference, the algorithm processes each row of the sample data and evaluates every candidate against each cell using the Field.read_cell() method (/testbed/frictionless/field.py: line 323-362). The scoring mechanism awards +1 point for successful cell parsing (when read_cellreturns no error notes) and -1 point for parsing failures (/testbed/frictionless/detector/detector.py: line 519). The algorithm implements early elimination optimization by calculating a threshold value as len(fragment) * (field_confidence-1) and skipping candidates whose scores fall below this threshold (/testbed/frictionless/detector/detector.py: line 505, 515). A candidate wins and becomes the inferred field type when its score reaches or exceeds max_possible_score* field_confidence, where max_possible_scoreaccounts for missing values in the data (/testbed/frictionless/detector/detector.py: line 520-527). The default field_confidenceis 0.9, meaning a candidate must successfully parse 90% of valid (non-missing) cells to be selected. This approach ensures robust type inference while maintaining performance through early elimination of poorly-performing candidates. Figure 9: Example of Comparison between Reference Answer and Human Annotated Answer (fritionless-py). [Back to Appendix Contents] 37 Session ID: hylang__hy-2514-955a46b9-91d6-46ec-98cb-2a41fbff46bd GroundTruth: Hy's conditional compilation system generates redundant anonymous variable assignments primarily because the "cond" macro recursively expands into nested "if" expressions (hy/core/macros.hy: line 36-41) that do not utilize the compiler's internal "if*" optimization symbol. While the "compile_if" function (hy/core/result_macros.py: line 643-707) includes logic to reuse temporary variables for nested conditionals (specifically checking for "if*" at line 652), the standard "cond" macro emits standard "if" forms, causing this optimization to be skipped. Consequently, "compile_if" treats each nesting level as a separate statement-to- expression conversion context. To bridge Hy's expression-oriented semantics with Python's statement-oriented architecture, the compiler is forced to create a new temporary variable via "get_anon_var" (line 681) and generate explicit assignment statements (lines 685-689) for each level. This behavior is structurally enforced by the "Result" class (hy/compiler.py: line 144-309), which separates statements from expression contexts, necessitating these assignments to maintain correct evaluation semantics across nested statement blocks. Error Type: Reasoning Depth Error Reason: The original answer superficially blamed design philosophy, overlooking the compiler's internal if* optimization mechanism. It failed to identify the root cause: the condmacro expands into standard if forms, inadvertently bypassing the built-in optimization path designed to prevent redundancy. GeneratedQuery:Why does Hy's conditional compilation system generate redundant anonymous variable assignments instead of optimizing nested conditionals into efficient elif chains? Generated Answer: Hy's conditional compilation system generates redundant anonymous variable assignments due to a fundamental architectural decision that prioritizes semantic correctness over performance optimization. The redundancy stems from two key design patterns. First, the condmacro recursively expands into deeply nested if expressions rather than a flat elif-like structure. The _condfunction in hy/core/macros.hy: line 36-41 demonstrates this recursive expansion where each condition-result pair generates a nested if form that calls _condagain for remaining conditions. This creates a tree of nested conditionals rather than a linear chain. Second, the compile_iffunction in hy/core/result_macros.py: line 643-707 handles the fundamental mismatch between Hy's everything-is-an-expression philosophy and Python's separation of statements and expressions. When conditional bodies contain statements rather than pure expressions, the compiler creates anonymous variables using get_anon_var() (line 681) and generates assignment statements (lines 685-689) to store intermediate results. Each nested conditional receives its own temporary variable and assignment rather than being optimized into a single elifchain because the compiler must ensure that each branch can be used in both expression and statement contexts. The Result class architecture in hy/compiler.py: line 144- 309 enforces this separation by maintaining both a statements list and an expression context, requiring assignments to bridge between them. This design choice ensures that complex nested conditionals with mixed statements and expressions maintain correct evaluation semantics, but at the cost of generating additional temporary variables and assignments that could theoretically be optimized away in simpler cases. Figure 10: Example of Comparison between Reference Answer and Human Annotated Answer (hy). [Back to Appendix Contents] 38 I Ethics and Reproducibility Statements I.1 Potential Risks This work focuses on training and evaluating large language models for repository-level code understanding and question answering, using real-world open-source project code. While all questions are grounded in executable repositories and paired with validated reference answers, models may still produce incomplete or misleading responses when codebase exploration or tool usage fails, particularly in cases of implicit reasoning errors or silent failures. This work involves only publicly available open-source repositories and does not include any personal, sensitive, or user-generated content. I.2 Discuss the License for Artifacts All released artifacts are provided under permissive licenses suitable for academic research. License terms permit use, modifica- tion, and redistribution in accordance with each license’s conditions. I.3 Artifact Use Consistent With Intended Use All external datasets and software components were used in accordance with their original license agreements and intended purposes. Derived artifacts are intended solely for research and educational use, and are not authorized for commercial deployment or redistribution. I.4 Data Contains Personally Identifying Info or Offensive Content All data was either synthetically generated or obtained from public sources. Automated filters and manual review were applied to ensure that no samples contain personally identifying information or offensive content. All instructions and tables are free of references to real individuals, groups, or sensitive contexts. I.5 Documentation of Artifacts All released artifacts are accompanied by documentation describing their structure, content format, intended use, and evaluation methodology. Sufficient metadata and usage instructions are provided to support inspection, reproduction, and downstream research use. I.6 Parameters for Packages All external packages used during training and evaluation were applied in accordance with standard practices. Default parameters were used unless otherwise specified. Any deviations from default settings are documented in the accompanying implementation materials. I.7 Data Consent This work exclusively uses data derived from publicly available open-source software repositories and their associated issue trackers. All data was collected in accordance with the repositories’ publicly stated licenses and terms of use, which permit research, analysis, and redistribution. As no private, restricted, or user-submitted personal data is included, explicit individual consent was not required. The data collection and curation process does not involve interaction with repository contributors, nor does it introduce new uses beyond the original public and open-source context. I.8 AI Assistants in Research or Writing Used ChatGPT to capture grammar errors in the manuscript. 39