Paper deep dive

SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

Xin-Cheng Wen, Binbin Chen, Haoxuan Lan, Hang Yu, Peng Di, Cuiyun Gao

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 72

Abstract

Abstract:Large language models (LLMs) have transformed the software engineering landscape. Recently, numerous LLM-based agents have been developed to address real-world software issue fixing tasks. Despite their state-of-the-art performance, Despite achieving state-of-the-art performance, these agents face a significant challenge: \textbf{Insufficient high-quality issue descriptions.} Real-world datasets often exhibit misalignments between issue descriptions and their corresponding solutions, introducing noise and ambiguity that mislead automated agents and limit their problem-solving effectiveness. We propose \textbf{\textit{SWE-Fuse}}, an issue-description-aware training framework that fuses issue-description-guided and issue-free samples for training SWE agents. It consists of two key modules: (1) An issue-free-driven trajectory learning module for mitigating potentially misleading issue descriptions while enabling the model to learn step-by-step debugging processes; and (2) An entropy-aware RLVR training module, which adaptively adjusts training dynamics through entropy-driven clipping. It applies relaxed clipping under high entropy to encourage exploration, and stricter clipping under low entropy to ensure training stability. We evaluate SWE-Fuse on the widely studied SWE-bench Verified benchmark shows to demonstrate its effectiveness in solving real-world software problems. Specifically, SWE-Fuse outperforms the best 8B and 32B baselines by 43.0\% and 60.2\% in solve rate, respectively. Furthermore, integrating SWE-Fuse with test-time scaling (TTS) enables further performance improvements, achieving solve rates of 49.8\% and 65.2\% under TTS@8 for the 8B and 32B models, respectively.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 12:38:22 AM

Summary

SWE-Fuse is a training framework for software engineering agents that addresses the challenge of noisy or missing issue descriptions by fusing issue-description-guided and issue-free training samples. It utilizes an issue-free-driven trajectory learning module and an entropy-aware RLVR training module to improve debugging performance and training stability, achieving state-of-the-art results on the SWE-bench Verified benchmark.

Entities (5)

SWE-Fuse · framework · 100%SWE-bench Verified · benchmark · 100%Entropy-aware RLVR training · module · 95%Issue-free-driven trajectory learning · module · 95%Qwen3 · model-architecture · 90%

Relation Signals (4)

SWE-Fuse → evaluatedon → SWE-bench Verified

confidence 100% · We evaluate SWE-Fuse on the widely studied SWE-bench Verified benchmark

SWE-Fuse → includesmodule → Issue-free-driven trajectory learning

confidence 100% · It consists of two key modules: (1) An issue-free-driven trajectory learning module

SWE-Fuse → includesmodule → Entropy-aware RLVR training

confidence 100% · and (2) An entropy-aware RLVR training module

SWE-Fuse → utilizesmodel → Qwen3

confidence 90% · SWE-Fuse-Qwen3-8B and 60.2% with SWE-Fuse-Qwen3-32B.

Cypher Suggestions (2)

Find all modules associated with the SWE-Fuse framework. · confidence 95% · unvalidated

MATCH (f:Framework {name: 'SWE-Fuse'})-[:INCLUDES_MODULE]->(m:Module) RETURN f.name, m.name

Identify benchmarks used to evaluate the SWE-Fuse framework. · confidence 95% · unvalidated

MATCH (f:Framework {name: 'SWE-Fuse'})-[:EVALUATED_ON]->(b:Benchmark) RETURN f.name, b.name

Full Text

71,636 characters extracted from source content.

Expand or collapse full text

SWE-Fuse SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training Xin-Cheng Wen 1∗ Binbin Chen 1 Haoxuan Lan 1 Hang Yu 1 † Peng Di 1,† Cuiyun Gao 2,† 1 Ant Group 2 The Chinese University of Hong Kong § https://github.com/codefuse-ai/x https://huggingface.co/datasets/codefuse-ai/x 7832701003005001000 Model Size (B) 0 10 20 30 40 50 60 70 SWE-Bench Verified (%) SWE-Fuse-Qwen3-32B + TTS (65.2%) CWM-32B (53.9%) SWE-Fuse-Qwen3-32B (60.2%) SWE-Fuse-Qwen3-8B + TTS (49.8%) SWE-Fuse-Qwen3-8B (43.0%) DeepSWE-32B-Preview (42.2%) Llama3-SWE-RL-70B (41.0%) Klear-Agent-8B-SFT (39.0%) Skywork-SWE-32B (38.0%) R2E-Gym-32B (34.4%) SWE-Fixer-72B (32.8%) Lingma-SWE-GPT-72B (30.2%) DeepSeek-V3-0324-641B (45.4%) GLM-4.5-355B (64.2%) Kimi-K2-1T (65.8%) Qwen3-Coder-480B (67.0%) SWE-Gym-32B (20.6%) Lingma-SWE-GPT-7B (18.2%) SWE-agent-LM-7B (15.2%) SWE-Gym-7B (10.6%) Qwen2.5-Coder-Instruct-32B (6.2%) Figure 1: Results on SWE-bench Veriefied. SWE-Fuse ranks first among 8B and 32B models. Abstract Large language models (LLMs) have transformed the software engineering landscape. Recently, numerous LLM-based agents have been developed to address real-world software issue fixing tasks. Despite their state-of-the-art performance, Despite achieving state-of-the-art performance, these agents face a significant challenge: Insufficient high- quality issue descriptions. Real-world datasets often exhibit misalignments between issue descriptions and their corresponding solutions, introducing noise and ambiguity that mislead automated agents and limit their problem-solving effectiveness. We propose SWE-Fuse, an issue-description-aware training framework that fuses issue- description-guided and issue-free samples for training SWE agents. It consists of two key modules: (1) An issue-free-driven trajectory learning module for mitigating potentially misleading issue descriptions while enabling the model to learn step-by-step debugging processes; and (2) An entropy-aware RLVR training module, which adaptively adjusts training dynamics through entropy-driven clipping. It applies relaxed clipping under high entropy to encourage exploration, and stricter clipping under low entropy to ensure training stability. We evaluate SWE-Fuse on the widely studied SWE-bench Verified benchmark shows to demonstrate its effectiveness in solving real-world software problems. Specifically, SWE-Fuse outperforms the best 8B and 32B baselines by 43.0% and 60.2% in solve rate, respectively. Furthermore, integrating SWE-Fuse with test-time scaling (TTS) enables further performance improvements, achieving solve rates of 49.8% and 65.2% under TTS@8 for the 8B and 32B models, respectively. ∗ This work was done when Xin-Cheng Wen was a research intern at Ant Group. † Correspondence to: Hang Yu <hyu.hugo@antgroup.com>, Peng Di <dipeng.dp@antgroup.com>and Cuiyun Gao <cuiyungao@outlook.com>. 1 arXiv:2603.07927v1 [cs.SE] 9 Mar 2026 SWE-Fuse 1 Introduction Large language models (LLMs) have demonstrated substantial capabilities across diverse natural language processing ChatGPT (2022); DeepSeek-AI et al. (2025); Shao et al. (2024); Zhou et al. (2025) and software engineering tasks Yang et al. (2023); Ding et al. (2023); Du et al. (2023), including automated bug repair Bouzenia et al. (2025), code synthesis Ye et al. (2025), and vulnerability detection Wen et al. (2025a). Specialized code-oriented LLMs, such as CodeLlama Rozière et al. (2023), StarCoder2 Lozhkov et al. (2024), and Qwen2.5-Coder Hui et al. (2024), have achieved performance comparable to human developers on numerous function-level programming tasks. Recently, the evolution of LLM-based software engineering tools has progressed rapidly from basic code autocompletion systems to sophisticated interactive agents capable of autonomous repository- level context processing and end-to-end patch generation Zhang et al. (2023); Liu et al. (2024); Zhang et al. (2024a). To evaluate these capabilities systematically, SWE-bench Jimenez et al. (2024) was introduced as a benchmark dataset comprising real-world GitHub issues. Initial approaches employed conversational repair systems that leveraged execution feedback for iterative refinement of candidate solutions. Subsequent agentic frameworks, including SWE-agent Yang et al. (2024) and OpenHands Wang et al. (2025d), enhanced LLMs with integrated development tools, such as terminals, editors, and search functionality for enabling multi-step reasoning across complex codebases. For example, AutoCodeRover Zhang et al. (2024b) incorporates dynamic execution traces into the LLM-based repair process, and RepoGraph Ouyang et al. (2025) employs static code graph representations to support program analysis, though currently limited to Python repositories. Despite their state-of-the-art performance, these agents still face challenge: [ISSUE] Title: TypeErrorOccurs When Handling Warnings with NoneTypein File Operations Description: When attempting to handle warnings during file closure operations, passing `None` as the expected warning type results in a `TypeError`. This prevents the proper execution of warning management in the code, leading to unexpected crashes. Example Buggy Code: ```python def test_warning_handling(): with warnings_handler(None) as record: perform_file_operations() ``` ... Actual Behavior: Passing `None` to the warning handler raises a `TypeError`: ``` TypeError: exceptions must be derived from Warning, not <class 'NoneType’> ``` This error disrupts the normal flow of file operations and indicates that the warning handler does not accept `None` as a valid input for expected warning types. [/ISSUE] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [Gold Patch] diff --git a/src/PIL/TiffImagePlugin.pyb/src/PIL/TiffImagePlugin.py index 2858694f8..15678948c 100644 ---a/src/PIL/TiffImagePlugin.py +++ b/src/PIL/TiffImagePlugin.py @@ -1572,13 +1572,16 @@ def _save(im, fp, filename): if PHOTOMETRIC_INTERPRETATION not in ifd: ifd[PHOTOMETRIC_INTERPRETATION] = photo -elifim.mode== "1" and ifd[PHOTOMETRIC_INTERPRETATION] == 0: : -inverted_im= im.copy() : -px= inverted_im.load() : ...: -im= inverted_im: + elifim.modein ("1", "L") and ifd[PHOTOMETRIC_INTERPRETATION] == 0: + if im.mode== "1": . + inverted_im= im.copy() ": . + px= inverted_im.load() ": . .... + im= inverted_im": . + else: ": . + im= ImageOps.invert(im) ": . if im.modein ["P", "PA"]: lut= im.im.getpalette("RGB", "RGB;L") ... Figure 2: The case in R2E-Gym Jain et al. (2025) demonstrating an issue and gold-patch mismatch. Lack of sufficiently high-quality issue descriptions: In SWE-bench, each issue description is authored by domain experts and paired with precisely matched test cases to ensure evaluation accuracy Jimenez et al. (2024). However, real-world datasets constructed from authentic scenarios often contain inconsistencies between issue descriptions and their corresponding solutions. These inconsistencies introduce noise and ambiguity that can mislead automated agents, restricting their ability to derive effective solutions Guo et al. (2025). For example, as shown in Fig. 2, the issue is about warnings handling in testing, while the patch fixes TIFF image encoding logic. These are two unrelated problems in different parts of the codebase. Specifically, The issue describes a problem with the warnings handler, when None is passed towarnings_handler(None), it causes a “TypeError: exceptions must be derived from Warning” (shaded blue). This is a warnings mechanism problem that should be fixed by modifying howwarnings_handlerhandles None input. However, the gold patch fixes something entirely different—theTIFFimage saving logic. It modifies PIL/TiffImagePlugin.pyto change how images are inverted (shaded in red). Furthermore, high- 2 SWE-Fuse quality issue-PR pairs are difficult to acquire at scale, and the available datasets remain relatively limited in size. For instance, in the SWE-smith Yang et al. (2025b) dataset, 18,033 samples (30.49% of the total 59,136 issues) contain empty problem statements, representing a substantial data quality challenge. Our work. To address these challenges, we propose SWE-Fuse, an issue-description-aware training framework that fuses issue-description-guided and issue-free samples for training SWE agents. Specifically, our framework enables easier training from base models, with cold-start prior fine-tuning process, achieves stable training with reinforcement learning with verifiable reward (RLVR) Wen et al. (2025b), and requires only basic bash commands as tool calls for sandbox environ- ment. Specifically, SWE-Fuse consists of two key components: (1) An issue-free-driven trajectory learning module, which comprises: a multi-step trajectory construction component for generating high- quality multi-turn reasoning-action trajectories, a trajectory data filtering process for ensuring data quality, and issue-free-driven supervised Fine-tuning to mitigate interference from potentially misleading issue descriptions while enabling the model to learn step-by-step debugging processes; and (2) An entropy-aware RLVR training module, which adaptively adjusts training dynamics through entropy-driven clipping. It applies relaxed clipping under high entropy to encourage exploration, and stricter clipping under low entropy to ensure training stability. We evaluate SWE-Fuse on the widely-used SWE-bench Verified Jimenez et al. (2024) benchmark. Experimental results demonstrate that SWE-Fuse achieves new state-of-the-art performance among open-source models with 32B parameters, resolving 43.0% of issues with SWE-Fuse-Qwen3-8B and 60.2% with SWE-Fuse-Qwen3-32B. Furthermore, integrating SWE-Fuse with test-time scaling (TTS) enables further performance improvements, achieving solve rates of 49.8% and 65.2% under TTS@8 for the 8B and 32B models, respectively. These results demonstrate that SWE-Fuse rivals or exceeds the performance of more complex and computationally expensive training methods. Contributions. The major contributions of this paper are summarized as follows: 1.We propose SWE-Fuse, an issue-description-aware training framework that fuses issue- description-guided and issue-free samples for training SWE agents. SWE-Fuse effectively enables reasoning about complex SWE patterns through the issue-free-driven trajectory learning and entropy-aware RLVR training module. 2.We introduce the SWE-Fuse trajectory dataset, comprising 14k validated and correct tra- jectories. The SWE-Fuse dataset is constructed from two types of samples: with issue descriptions and issue-free. Issue-free samples help the model avoid influence from noisy descriptions while enabling the LLM to identify problems through systematic debugging. 3.SWE-Fuse resolves 60.2% of issues in SWE-bench-Verified, outperforming prior state-of-the- art performance among open-source models with 32B parameters. 2 Proposed Framework 2.1 Overview Fig. 3 presents an overview of the proposed SWE-Fuse framework. The primary objective of SWE- Fuse is to employ a cold-start process that enables the model to first acquire fundamental reasoning capabilities through multi-turn interactions. Concurrently, the model learns to identify and resolve problems by iteratively debugging test case failures across multiple rounds. Guided by RLVR, the model then iteratively selects optimal reasoning steps to accurately complete software engineering tasks, achieving faster convergence under entropy-based guidance. First, the process begins with the issue-free-driven trajectory learning module, which comprises three components: (1) multi-step trajectory construction, which generates SWE-related reasoning data; (2) trajectory data filtering, which prevents Git exploitation and mitigates the impact of low- 3 SWE-Fuse IssueDescription Repository Environment GitHub (1) Multi-step Trajectory Construction Localization File view File edit FinalPatch (2) Trajectory Data Filter Multi-Step Trajectory GoldPatch Filter Git Hacking Filter Step 1 Step 2 Step n ... LLMs Training Objective: Environment Bash GoldPatch Environment without Issue Description MultiStep (3) Issue-Free-drivenSFT Issue-Free Trajectory Mask Multi-Step Fine-Tune Trajectory 1 Trajectory 2 Trajectory G LLMs Group Sampling and Relative Advantage in RLOO Sample G Reward 1 Reward 2 Reward G Advantage 퐴 푖 Calculate Entropy Normalization Entropy-Adaptive Clipping 퐻 푛표푟푚 ቊ 퐴 푖 >0 퐴 푖 <0 Batch Normalization ℒ 푆퐹푇 =−퐸 푥,푖,푎 푡 ,푂 푡 ~풟 Ratio Constraint −퐸 푞 퐸 표 푖 ~휋 표푙푑 (·|푞) RLVR Training Objective: Policy Update Entropy-aware RLVR Training Module Issue-Free-driven Trajectory Learning Module Original Trajectory GitHub Mixed Environment ... ... Figure 3: The overview of SWE-Fuse. quality data; and (3) Issue-Free-driven SFT, which enables the model to complete tasks in issue-free scenarios. Subsequently, SWE-Fuse employs an entropy-aware RLVR training module. It ensures faster model convergence and more efficient interaction with the execution environment. 2.2 Issue-Free-driven Trajectory Learning Module We propose the issue-free-driven trajectory learning module for facilitating the initial learning of trajectory reasoning knowledge. As shown in Fig. 3, it mainly contains three phases, including (a) multi-step trajectory construction, (b) trajectory data filter, and (c) issue-free-driven SFT training, with details as below: 2.2.1 Multi-step Trajectory Construction Environment Construction The environment construction serves two primary purposes: (i) en- abling multi-step trajectory generation, and (i) supporting high-concurrency execution during RLVR phase while enforcing appropriate resource constraints. The construction pipeline comprises three steps: (1) Repository Collection. SWE-Fuse builds upon SWE-smith, a dataset of 50,137 instances drawn from 128 executable GitHub repositories. From this corpus, we select more than 33,274 issues with permissive licenses and evidence of active maintenance. (2) Sandbox Reproduction. We leverage the Docker images provided by SWE-smith Yang et al. (2025b). We retain only repositories that build successfully and pass all sanity checks, resulting in verified base environments for subsequent task construction. (3) Sandbox Manager Construction. We develop a sandbox manager for scalable and reliable execution, which consists of a management layer for scheduling different environments, and an execution environment layer for single-sandbox execution. (A). a management layer serves as the central scheduling hub responsible for data routing and state management. It communicates upstream via REST APIs to handle user requests, coordinates downstream execution through Tool Call mechanisms to interact with AI sandboxes, and performs file-level data persistence. (B). An 4 SWE-Fuse execution environment layer built upon the mini-SWE-agent-plus scaffolding framework, this layer integrates multiple execution engines including bash and Code Interpreter, with extensibility to accommodate additional tool types. SWE-Agentic Task Formalization and Trajectory RolloutIn SWE-bench agentic tasks, we formal- ize the problem as a sequential decision-making process where an LLM agentπ θ interacts with a software repository environmentEto resolve GitHub issues. Each episode begins with an initial states 0 containing the issue description and repository sandbox, and proceeds through a sequence of interactions until task completion. ReAct Paradigm. At each timestept, the agent executes a two-phase interaction cycle following the ReAct paradigm: • Reasoning Phase: The agent generates an internal reasoning tracer t ∈V ∗ to analyze the problem and plan next steps. •Action Phase: The agent selects and executes a tool operationc t to modify or inspect the repository. Environment Feedback. Upon receiving operationc t , the environment transitions to a new state and returns an observation: s t+1 , o t ∼E(·| s t , c t )(1) The observationo t contains execution results (e.g., file contents, test outputs, error messages). The updated history becomes h t+1 = h t ⊕ (r t , c t , o t ). Trajectory and Termination. A complete trajectoryτ = (s 0 ,r 0 ,c 0 ,o 0 ,. . .,r T ,c T ,o T )ends when: (1) the agent emits a terminal actionc T = submit, (2) the step limitT max is reached, or (3) the token limit is reached. The trajectory receives a binary reward: R(τ) = I[patch(τ) passes all tests inE](2) where patch(τ) denotes the code changes accumulated throughout the trajectory. With the ReAct Yao et al. (2023) paradigm formalized, we now collect expert trajectories for SFT. We adopt the Mini-SWE-Agent-Plus Wang et al. (2025c) scaffold with Gemini 3 Gem (2025) as the teacher agent, configuring the maximum number of interaction turns toT max =100. Gemini 3 offers competitive coding and reasoning capabilities on challenging SWE tasks, making it suitable for generating high-quality demonstration trajectories for distillation. Since Gemini 3 is a closed-source model, we cannot access its internal thought process. Therefore, we explicitly inject a special marker token<THOUGHT>into the trajectory format during collection. Specifically, we modify the interaction template to explicitly separate the reasoning phase:h t → THOUGHT: r t ```bash c t ```This structured format serves three purposes: (1) it makes the reasoning traces explicitly visible in the training data, (2) it teaches the student model to generate explicit thought processes during inference, and (3) it maintains a simple format that facilitates learning for smaller LLMs (less than 32B). While Gemini 3’s actual internal reasoning may differ from the verbalized thoughts, this approach provides a reasonable proxy for teaching the student model to reason before acting. 2.2.2 Trajectory Data Filter Preventing Git Exploitation FilteringIt has been recently recognized by the SWE community git (2025) that LLM agents can unexpectedly exploit git metadata to locate the ground-truth patch by directly inspecting commit logs. To address this, we perform the following filtering procedures: First, in SWE-bench Verified, we remove all commits and log messages dated after the issue creation date to ensure that future fixes remain invisible to the agent. Second, during trajectory collection, we filter out any trajectories containing commands such as “git show” or “git log” that could expose sanitized repository history to the agent, thereby preventing the model from exploiting git metadata to bypass the intended learning process. In RQ 4.2.3, we provide a detailed empirical analysis demonstrating that our trajectory data and corresponding trained models are not susceptible to 5 SWE-Fuse such git metadata exploitation, thereby ensuring the integrity and validity of trajectories dataset and SWE-Fuse. Rule-based FilteringSubsequently, rule-based filtering is applied to assess the format and content of the trajectories. The main filtering criteria are as follows: (1) We filter out samples with fewer than 5 interaction rounds. For samples with fewer than 10 rounds, we verify trajectory correctness by evaluating all test cases to ensure validity. (2) Samples lacking intermediate reasoning steps are identified and filtered using regular expressions, as they fail to adequately capture the reasoning process. (3) We enforce a strict response format for bash commands, where tags such as “```bash```” denote the final bash action; samples not conforming to this format are discarded. (4) We ensure that model-generated trajectories contain only English text. Although non-English content may originate from the sandbox environment, we discard all trajectories containing such content to maintain consistency. 2.2.3 Issue-Free-driven Supervised Fine-tuning Due to the limited availability of high-quality issue descriptions, we include a subset of samples without issue descriptions (referred to as "issue-free" samples). For these samples, we provide partial test cases to enable the model to learn through step-by-step debugging. The rationale behind this approach is to mitigate potential noise introduced by imprecise or misleading issue descriptions. As illustrated in Fig. 2, models are often misguided by inaccurate issue descriptions, leading to incorrect exploration trajectories. Specifically, we retain all problem information except the issue description, and generate trajectories following the same procedure as for samples with issue descriptions. However, for issue-free samples, we only include successful trajectories where the model correctly resolves the problem. Then, construct a mixed datasetD mixed = D issue ∪D issue free . Each instance inD issue comprises an initial guidancexand issue descriptioni. Each instance inD issue-free comprises only the initial guidancexand issue descriptioni ∈∅, where these samples serve as implicit bug specifications that guide the debugging process without explicit textual descriptions. We utilize the multi-turn interaction trajectories generated in the previous multi-step trajectory construction step. Each trajectoryτ = (s 0 ,r 0 ,c 0 ,o 0 ,. . .,r T ,c T ,o T )consists of alternating agent actionsα t =⟨r t ,c t ⟩(reasoning r t and action operations c t ) and environment observations o t . The model is trained to autoregressively predict the entire decision sequence by maximizing the log-likelihood over complete trajectories. The objective can be expressed as: L SFT =−E (x,i,α t ,o t T t=1 )∼D mixed " T ∑ t=1 log π θ (α t | x, i,α j , o j j<t ) # (3) This formulation enables the model to learn both the sequential decision-making policy across multiple interaction rounds. 2.3 Entropy-aware RLVR Training Module RLVR-based methods updates a policy using relative credit assignment computed within a group ofGsampled completions for the same prompt, thereby avoiding an explicit critic function. A key stabilizer in RLVR implementations is a PPO-style probability-ratio constraint Liu et al. (2025b), which prevents the new policyπ θ from drifting too far from the behavior policyπ θ old in a single update. However, when the policy is uncertain (high entropy), overly tight clipping can impede learning; when the policy is confident (low entropy), overly loose clipping can induce sudden distribution shift and degrade performance Cui et al. (2025). We therefore propose an entropy-aware RLVR training module, including the following four components. 6 SWE-Fuse Group Sampling and Relative Advantage in RLOOGiven a query (prompt)q, we sample a group ofGSWE-task trajectory outputso i G i=1 from the behavior policyπ θ old (· | q). Each outputo i is assigned a scalar rewardR i (e.g., task correctness only). Reward leave-one-out (RLOO) Ahmadian et al. (2024) constructs a variance-reduced, sample-specific baseline by taking the mean reward of the other group members (i.e., leaving out the current sample). Concretely, for eachi∈1,. . .,G, we define the leave-one-out and the corresponding advantage as follows: A i = R i − 1 G− 1 G ∑ j=1,j̸=i R j .(4) Compared with the full group-mean baseline used in GRPO DeepSeek-AI et al. (2025), SWE-Fuse yields an unbiased estimate of the expected reward baseline Bereket & Leskovec (2025) because it removes the self-coupling betweenR i . In small-model training paradigm, this reduced coupling often translates into a lower-variance and therefore more stable advantage estimate for policy optimization where reward variance and sampling noise are typically higher. Entropy NormalizationFor each sampled outputo i = (a i,1 ,. . .,a i,T i )under queryq, we compute a sequence-level entropy Cheng et al. (2025) H i (e.g., mean token entropy along the trajectory): H i = 1 T i T i ∑ t=1 − ∑ a π θ (a| s i,t ) log π θ (a| s i,t ) ! ,(5) wheres i,t denotes the decoding state (prefix) at time stept. Then, for each sampled sequence-level entropy H i , we normalize it within the current minibatch: H norm,i = H i − min batch (H) max batch (H)− min batch (H) ∈ [0, 1].(6) This batch normalization removes scale effects caused by different prompts and action spaces, so that H norm,i measures the relative uncertainty of sample i within the current optimization context. Entropy-Adaptive Clipping We propose a dynamic clipping mechanism that adapts the trust- region width at the sample level by jointly considering (i) the policy’s uncertainty over the sampled trajectory and (i) the update direction indicated by the estimated advantage function. Furthermore, advantage estimates are susceptible to noise arising from finite batch sizes and stochastic reward signals. Under such conditions, excessively large updates in the negative direction are particularly detrimental, as they may prematurely suppress exploratory behaviors that could prove beneficial, especially when the sign of the advantage estimate is corrupted by noise. Concretely, we map H norm,i to a clipping radius ε i ∈ [ε min , ε max ] using an adaptive rule: ε i = ( ε min + (ε max − ε min ) H norm,i ,A i > 0, ε max − (ε max − ε min ) H norm,i ,A i ≤ 0. (7) WhenA i >0, higher entropy implies greater uncertainty and therefore a larger admissible update region, enabling faster reinforcement of relatively better-than-baseline samples. WhenA i ≤0, higher entropy triggers a smaller clipping radius, reflecting a conservative stance toward decreasing the probability of a sample under uncertainty and thereby mitigating the risk of over-penalization caused by noisy or spuriously negative advantages. 3 Experimental setup This section describes the experimental setup, including the task formulation and benchmark, the models evaluated, and the metrics used to assess performance. 7 SWE-Fuse 3.1 Research Questions RQ1:How effective is SWE-Fuse compared with the state-of-the-art software agents approaches? RQ2: What is the impact of incorporating different trajectories data during the SFT phase? RQ3: How do RLVR algorithm contribute to the performance of SWE-Fuse? 3.2 Task Formulation and Benchmark Table 1: Statistics of Project Distribution in SWE-bench Verified and Our Subset. ProjectTotalSubSet astropy/astropy229 django/django23192 matplotlib/matplotlib3414 mwaskom/seaborn21 pallets/flask11 psf/requests83 pydata/xarray229 pylint-dev/pylint104 pytest-dev/pytest197 scikit-learn/scikit-learn3213 sphinx-doc/sphinx4417 sympy/sympy7530 Total500200 Our research focuses on real-world software engineering tasks ex- tracted from GitHub issues, encompassing both bug fixes and fea- ture implementations. We employ the widely-adopted SWE-bench Verified benchmark for evaluation. In RQ1, we utilize the complete SWE-bench Verified benchmark to ensure fair comparison with existing baselines. For the remain- ing RQs, we carefully curate a representative subset of 200 tasks to balance experimental rigor with computational feasibility. This subset is constructed through stratified random sampling from the full SWE-bench Verified benchmark, preserving the original pro- portional distribution of tasks across repositories while excluding cases that may cause evaluation inaccuracies due to network-related issues or other external factors. Table 1 presents a statistical compar- ison of the task distribution between our subset and the full SWE-bench Verified benchmark. 3.3 Baselines and Evaluation Metrics Table 2: Summary Statistics of the Trajectories Dataset StatisticTotalMeanMinMax Valid Trajectories14,350--- Instances14,329--- Projects111--- Interaction Rounds401,95828.051098 Token Consumption281,938,58419,676.084,13665,115 Baselines We compare SWE-Fuse with several widely-used frameworks and mod- els that have been evaluated on SWE- bench Verified. Our selection criterion is based on the availability of source code and generated patches. To ensure fair and accurate comparison, we select both open- source and closed-source models across two parameter scale categories: 7B+ and 30B+. The evaluated models include the Qwen series Yang et al. (2025a); Hui et al. (2024), GPT series OpenAI (2025b; 2024), Claude se- ries Anthropic (2025;?), and other state-of-the-art works. These models are deployed across 6 distinct scaffolding frameworks, including MOpenHands Zan et al. (2025), OpenHands Wang et al. (2025d), R2E-Gym Jain et al. (2025), SWE-agent Yang et al. (2024), Agentless Xia et al. (2024), and Mini-SWE-agent-plus Wang et al. (2025c). Evaluation Metrics We evaluate performance on SWE-bench Verified using a single metric: the issue resolve rate, defined as the percentage of issues successfully resolved by the generated patches, as verified by test cases. We compare the performance of SWE-Fuse against baseline results reported in their respective publications and on the official SWE-bench leaderboard. Additionally, we conduct a fine-grained analysis at the repository level to identify all valid patches and quantify the number of issues uniquely resolved by our approach compared to existing baselines. 4 Experimental Results 4.1 RQ1: Comparison with SOTA To assess the effectiveness of SWE-Fuse, we compare SWE-Fuse models against both open-source and closed sourced models on SWE-bench Verified. 8 SWE-Fuse 4.1.1 SWE-Fuse vs Open-Source Models The results summarized in Table 3 demonstrate that SWE-Fuse consistently outperforms all open- source models in both the 8B and 32B categories. Specifically, SWE-Fuse achieves an relative improvement of 9.1% for 8B models and 11.7% for 32B models, respectively. Furthermore, SWE-Fuse represents the only 8B baseline that successfully applies both SFT and RL training. These results indicate that with proper SFT initialization, SWE-Fuse-8B model effectively learn from multi-turn agentic RL tasks and achieve further performance gains through RLVR, without requiring extensive computational resources for RL training. Furthermore, integration with TTS built upon the SWE- Fuse allows the performance to be improved even further, achieving a solve rate of 49.8% and 65.2% under TTS@8 for the 8B and 32B models, respectively. 4.1.2 SWE-Fuse vs Closed-Source Models As shown in Table 3, SWE-Fuse demonstrates competitive performance compared to proprietary models. Specifically, SWE-Fuse outperforms OpenAI-o3 by 1.8% in resolved rate, while remaining below Claude-4-Sonnet and Claude-4.5-Sonnet. The competitive performance of SWE-Fuse can be attributed to effective trajectory learning, which enables the model to acquire SWE-specific knowledge more efficiently, allowing a 32B model to achieve performance comparable to models with approximately 1T parameters. However, the remaining performance gap can be attributed to two factors: first, the model size inherently limits knowledge capacity; second, training on a single task leaves substantial room for improvement through multi-task learning. Answer to RQ1: SWE-Fuse achieves the best overall performance across all open-source models in both the 8B and 32B categories, with improvements of 9.1% and 11.7% in resolve rate on the SWE-bench Verified dataset. SWE-Fuse also achieves solve rates of 49.8% and 65.2% under TTS@8 for the 8B and 32B models, respectively. 4.2 RQ2: Analysis of Trajectories Data in SFT To systematically explore the contribution of trajectory reasoning data within SWE-Fuse during the SFT phase, we analyze the impact of three key factors on performance: (1) the quantity of generated trajectories to examine the impact of data scaling, (2) varying ratios of samples with issue descriptions versus issue-free samples to investigate whether samples without descriptions benefit the model, and (3) potential Git hacking risks in our generated trajectory data. To balance experimental rigor with computational costs, we use the Qwen3-8B as the base model and evaluate on the representative subset of 200 tasks as shown in Table 2 in all RQ2 experiments. 4.2.1 Impact of Trajectories Sample Size Table 4: Impact of trajectories size on SWE-Fuse performance. SizeResolve Rate(%)↑Resolved Issues↑ 013.527/200 1k24.5(↑+11.0%)49/200 2k31.5(↑+18.0%)63/200 4k33.5 (↑+20.0%)67/200 8k35.0 (↑+21.5%)70/200 All39.0 (↑+25.5%)78/200 In this experiment, we systematically vary the training data size across multiple orders of magnitude, ranging from 1k to 14k samples, while preserving the respective sampling constraints at each scale. For each data size configuration, we train models using identical hyperpa- rameters and evaluate their performance on the held-out SWE-Bench Verified test set to ensure fair comparison. To maintain consistency across scales, each larger dataset is constructed as a superset of all smaller datasets. As shown in Table 4, our experimental results demonstrate a clear positive correlation between training set size and model performance. Specifically, the resolve rate increases monotonically from 13.5% with zero sample training to 39.0% when trained on the complete dataset, representing a 2.9×improvement in performance. The model exhibits substantial gains during the initial scaling phase, achieving a 24.5% resolve rate with only 1k training examples. The 9 SWE-Fuse Table 3: Performance comparison of SWE-Fuse and baselines on the SWE-bench Verified. ModelScaffoldTrainingResolve Rate (%)↑ Open-Source Models Parameters≈ 7B Qwen3-8B (Yang et al., 2025a)OpenHands-7.6 SWE-Gym-7B (Pan et al., 2025)OpenHandsSFT10.6 SWE-agent-LM-7B (Yang et al., 2025b)SWE-agentSFT15.2 Lingma-SWE-GPT-7B (Ma et al., 2025)SWESynInferSFT18.2 SWE-Mirror-LM-7B (Wang et al., 2025b)MOpenHandsSFT22.8 SWE-Dev-7B (Wang et al., 2025a)OpenHandsRL23.4 Klear-Agent-8B-SFT (Wang et al., 2025c)Mini-SWE-agent-plusSFT39.4 SWE-Fuse-8BMini-SWE-agent-plusSFT + RL43.0 + TTS@8Mini-SWE-agent-plusSFT + RL49.8 Parameters≈ 30B Qwen3-32B (Yang et al., 2025a)OpenHands-23.2 SWE-Gym-32B (Pan et al., 2025)OpenHandsSFT20.6 R2E-Gym-32B (Jain et al., 2025)R2E-GymSFT34.4 + TTS@16R2E-GymSFT49.4 SWE-Dev-32B (Wang et al., 2025a)OpenHandsRL36.6 Skywork-SWE-32B (Zeng et al., 2025b)OpenHandsSFT38.0 + TTS@8OpenHandsSFT47.0 SWE-agent-LM-32B (Yang et al., 2025b)SWE-agentSFT40.2 DeepSWE-32B-Preview (Luo et al., 2025)OpenHandsRL42.2 + TTS@16OpenHandsRL59.0 SWE-Mirror-LM-32B (Wang et al., 2025b)MOpenHandsSFT52.2 CWM-32B (Copet et al., 2025)AgentlessCPT + SFT + RL53.9 SWE-Fuse-32BMini-SWE-agent-plusSFT + RL60.2 + TTS@8Mini-SWE-agent-plusSFT + RL65.2 Parameters≥ 100B DeepSeek-V3-0324 (DeepSeek-AI, 2024)Internal pipeline-45.4 GLM-4.5 (Zeng et al., 2025a)OpenHands-64.2 Kimi-K2 (Bai et al., 2025)Agentless-65.8 Qwen3-Coder-480B (Yang et al., 2025a)OpenHands-67.0 Closed-Sourced Models OpenAI-o3 (OpenAI, 2025b)Mini-SWE-agent-58.4 Claude-4-Sonnet (Anthropic, 2025)SWE-agent-66.6 Claude-4.5-Sonnet (Anthropic, 2025)Mini-SWE-agent-70.6 GPT-5 (OpenAI, 2025a)OpenHands-71.8 Gemini 3 Pro Preview (Google, 2025)Mini-SWE-agent-74.2 marginal improvements gradually diminish as training data increases. This trend suggests that while additional training data consistently improves performance, the model begins to approach diminishing returns beyond 4k to 8k training examples. Nevertheless, even the fully trained model resolves only 78 out of 200 issues (39.0%), indicating substantial room for improvement and highlighting the challenging nature of automated issue resolution in real-world software engineering contexts. These findings underscore the importance of high-quality training data in few-shot learning scenarios and suggest that data efficiency remains a critical consideration for SWE agents training. 10 SWE-Fuse 4.2.2 Impact of Issue Descriptions and Issue-free Samples Ratio In this experiment, we investigate the impact of varying the ratio of issue-free samples in the training data. Issue-free samples refer to trajectories that do not contain explicit issue context. We systematically vary the issue-free ratio from 0% (all samples contain issue descriptions) to 100% (no issue descriptions included), with intermediate configurations at 25%, 50%, and 75%. For each ratio configuration, we maintain a fixed total training size (4k) and evaluate model performance on the SWE-Bench Verified test set under identical experimental conditions. Table 5: Effect of issue descriptions and issue-free samples ratio on performance. Ratio(%)Resolve Rate(%)↑Resolved Issues↑ 033.567/200 2534.0 (↑+0.5%)68/200 5034.0 (↑+0.5%)68/200 7530.5 (↓-3.0%)61/200 10030.5 (↓-3.0%)61/200 As shown in Table 5, our results reveal that the resolve rate remains relatively stable when incorporating up to 50% issue-free samples, maintaining approximately 34.0% resolve rate (68/200 issues) compared to 33.5% (67/200 issues) with purely issue-based training. However, per- formance degrades notably when the issue-free ratio ex- ceeds 75%, with the resolve rate dropping to 30.5%, which represents a 10.3% relative performance reduction com- pared to the optimal configuration. These findings suggest that while general code modification patterns captured by issue-free samples can provide complementary debug knowledge, issue-specific context remains crucial for effective automated issue resolution. The optimal performance at moderate issue-free ratios (25-50%) indicates that a balanced mixture of issue and issue-free trajectories may help the model learn both task-specific problem-solving strategies and generalizable test case debug patterns. 4.2.3 Potential Git Hacking Risks Docker images of SWE-Bench Verified released prior to September 3, 2025 may contain ground truth commits, potentially enabling “git hacking” scenarios. To verify that SWE-Fuse does not exploit these environmental shortcuts, we conducted a systematic analysis of agent trajectories. We demonstrate that the agent relies primarily on genuine problem-solving capabilities rather than accessing ground truth solutions through git history. We manually check the sample of agent trajectories across successful resolutions. The analysis reveals minimal instances of git history exploration commands (e.g., git log, git history) in successful task completions. For example, when agents consider git commands, they quickly recognize environmental constraints (e.g., shallow clones, absent .git directories) and pivot to alternative problem-solving strategies, as illustrated by the following agent reasoning: Trajectories Example: ... Let’s check git log or git blame? I cannot run git commands that require history if it’s a shallow clone or if .git is not fully available. Let me verify: ls -la ... These findings provide strong evidence that our agent’s performance stems from learned problem- solving capabilities rather than exploitation of benchmark artifacts. Answer to RQ2: SWE-Fuse exhibits clear data scaling benefits, achieves optimal performance with 25-50% issue-free trajectories that balance task-specific and generalizable debugging patterns, and relies on problem-solving capabilities rather than benchmark exploitation. 4.3 RQ3: Analysis of RLVR To demonstrate the effectiveness of our proposed entropy-aware RLVR training module, we present three training curves in Fig. 7: (1) Qwen3-32B trained without cold-start phase (i.e., w/o IFTL Qwen3-32B), (2) Qwen3-8B with IFTL initialization (i.e., w/ IFTL Qwen3-8B), and (3) Qwen3-32B with IFTL initialization (i.e., w/ IFTL Qwen3-8B). The reward metric increases progressively throughout training steps across all settings. Notably, Qwen3-32B trained without cold-start phase requires a longer training period compared to SFT- 11 SWE-Fuse 50100150200 Step 0.13 0.21 0.28 Reward Original Smoothed (0.6) Figure 4: w/o IFTL Qwen3-8B 102030405060 Step 0.21 0.24 0.27 Reward Original Smoothed (0.6) Figure 5: w/ IFTL Qwen3-8B 102030405060 Step 0.47 0.55 0.62 Reward Original Smoothed (0.6) Figure 6: w/ IFTL Qwen3-32B Figure 7: The curve of training reward score of ERLVR in SWE-Fuse. initialized models (more than 200 steps), as it must learn task-specific behaviors from scratch without the benefit of SFT on domain-relevant trajectories. Conversely, SFT-initialized models exhibit rapid performance gains during RLVR and achieve higher ultimate performance. This improvement stems from two complementary mechanisms. First, SFT-based cold-starting mitigates inefficient exploration by biasing the initial policy toward productive action spaces, thereby accelerating training. Second, exposure to curated trajectory data during the SFT phase fundamentally expands the model’s capabilities by incorporating task-specific knowledge, which raises the achievable performance ceiling and yields superior final outcomes. Answer to RQ3: Furthermore, SWE-Fuse exhibits stable convergence across multiple rounds of agentic RL training on both 8B and 32B models, demonstrating its robustness and scalability across different model sizes. 5 Discussion Empowering LLMs with step-by-step debugging capabilities by creating reproduction scripts After incorporating issue-free trajectories into the training process, SWE-Fuse effectively learns to create reproduction scripts and generate new test cases to better understand and validate the target pull request descriptions. As illustrated in Fig. 8, SWE-Fuse successfully resolved the instance astropy-13236 from SWE-bench-Verified, while Claude-4-Sonnet failed to fix this issue (highlighted in red). Specifically, SWE-Fuse begins in Round 1 by searching for the relevant code snippet within the codebase. In Round 2, it successfully locates the buggy fileastropy/table/table.py. In Round 3, SWE-Fuse formulates an initial proposal plan, which includes creating a reproduction script, modi- fying the identified table file, and applying string replacement operations to update the reproduction script. By Round 32, SWE-Fuse successfully modifies the problematic code block and ultimately submits the correct patch. In this case, SWE-Fuse utilizes its self-generated reproduction script without altering the original test cases, enabling a progressive debugging process that culminates in successful issue resolution (the submitted patch passes all test cases). In contrast, Claude-4 directly modifies existing test cases when encountering test failures (shown in red-shaded text under "Test Changes Applied"), resulting in an incorrect patch submission. This case study demonstrates SWE-Fuse’s superior effectiveness in real-world issue detection and resolution. Enhanced reasoning and action through ERLVR for effective multi-turn instruction following Beyond the SFT paradigm, we introduce an entropy-aware RLVR training module that enables the model to maintain robust reasoning capabilities within long-context, multi-turn interactions. Specifically, as illustrated in Fig. 8, the pull request description outlines a two-phase proposal: (1) in version 5.1, automatically convert structured arrays to NdarrayMixin, and (2) in version 5.2, remove thedata.view(NdarrayMixin)logic to allow structured arrays to become Column objects directly. SWE-Fuse, at Round 34 during reproduction script execution, correctly addresses both requirements by verifying that: (a) a FutureWarning is issued, and (b) the column type is astropy.table.column.Column(neitherRownorNdarrayMixin). In contrast, Claude-4-Sonnet’s 12 SWE-Fuse [PR description]: Consider the following PR description Description: Currently if you add a structured `np.array` to a Table, it gets turned into an `NdarrayMixin` (via the code below). While this mostly works, I am not sure this is necessary or desirable any more after #12644. Basically the original rational for `NdarrayMixin` was that structured dtype`Column` didn't quite work, in particular for serialization. So we pushed that out to a mixinclass which would signal to unified I/O that it might not be supported. ... Proposal: -Add a FutureWarninghere telling the user to wrap `data` in `Column` and that in the future (5.2) the structure d array will be added as a `Column`. Change the behavior in 5.2 by removing this clause. This is not critical for 5.1 but if we have the opportunity due to other (critical) bugfixes it might be nice to save 6 months in the change process. [Round1]:THOUGHT:I need to locate the code that needs modification. The PR description mentions `if (not isinstance(data, Column) and not data_is_mixinand isinstance(data, np.ndarray) and len(data.dtype) > 1):`. I'l start by searching for this code snippet or parts of it in the codebase. ```bash -r \"isinstance(data, np.ndarray\" . ``` [Round2]:THOUGHT: The grep results point to `astropy/table/table.py` as the likely place containing the code mentioned in the PR description, specifically the line with `len(data.dtype) > 1`. 'llread `astropy/table/table.py` around the line that matches the PR description to confirm the location and context. ```bash -nC5 \"len(data.dtype) > 1\" astropy/table/table.py `` Replacement: ```python if (not isinstance(data, Column) and not data_is_mixin and isinstance(data, np.ndarray) and len(data.dtype) > 1): warnings.warn(... ```bash <<'EOF' > modify_table. ...EOF ``` [Round3]:THOUGHT:The code snippet matches exactly what I found in `astropy/table/table.py` at lines 1244-1247. The plan is: 1.Create a reproduction script to demonstrate the current behavior. 2.Modify `astropy/table/table.py` to add a `FutureWarning` before changing the behavior. 3.Verify the warning with the reproduction script. 4.Modify `astropy/table/table.py` again to remove the problematic code.... 5.Verify the change with the reproduction script.Wait, ....Add a FutureWarninghere telling the user to wrap `data` in `Column` and that in the future (5.2) the structured array will be added as a `Column`.\” ```bash <<'EOF' > reproduce_issue.py...nEOF ```” [Round32]:THOUGHT: I'l use `edit_via_str_replace` to replace the code block. Original: ```python if (not isinstance(data, Column) and not data_is_mixin... data_is_mixin= True ``` [Round34:]THOUGHT:Now I run the reproduction script `reproduce_issue.py`. I expect: 1.A FutureWarning. 2.The column type to be `astropy.table.column.Column` (not `Row` and not `NdarrayMixin`). ```bash _issue.py ```" [Round34]:EnvironmentFeedback：<returncode>0</returncode> <output> Adding structured array to tabl e... sys:1: FutureWarning: Adding a structured array to a Table will stop automatically wrapping it in an NdarrayMixinin astropy5.2. It will be added as a standard Column object instead. [Round53]:THOUGHT:Looks correct. I'l check ifthere are any other occurrences of this pattern or related code that I missed?...I am ready to submit. ```bash _SWE_AGENT_FINAL_OUTPUT && git add -A && git diff -- cached ```" [Submit Patch]: diff --git a/astropy/table/table.pyb/astropy/table/table.py index 5afe2127a3..8cd2f3f5 100644 ---a/astropy/table/table.py +++ b/astropy/table/table.py @@ -1243,8 +1243,13 @@ class Table: # mixinclass if (not isinstance(data, Column) and not data_is_mixin and isinstance(data, np.ndarray) and len(data.dtype) > 1): -data = data.view(NdarrayMixin) -data_is_mixin= True + warnings.warn('Adding a structured array to a Table will stop automatically wrapping it in an NdarrayMixinin astropy5.2. ‘ + 'It will be added as a standard Column object instead. ‘ + FutureWarning, stacklevel=4) ... [Refact_Agent_claude-4-sonnet’ Trajectory] FINAL VERIFICATION COMPLETE **Summary of Changes Applied Successfull y:** 1.Code Changes Applied: -File: `/testbed/astropy/table/table.py` (lines 1246-1250) -Change: Added `FutureWarning` when structured arrays are converted to `NdarrayMixin` -Warning Message: ... informative message mentioning Column() wrapper and version 5.2 2. Test Changes Applied: -File: `/testbed/astropy/table/tests/test_mixin.py` -Added: `test_structured_array_future_warning()` function with comprehensive test coverage -Updated: `test_ndarray_mixin()` to expect warnings -Added: `warnings` import 3. All Functionality Verified: -FutureWarningcorrectly issued for structured arrays (multi- field) -No warning when using `Column()` wrapper -No warning for single-field arrays, regular arrays, or existing NdarrayMixinobjects -All existing tests pass -no regressions introduced -Edge cases handled correctly 4. Requirements Met”... Figure 8: An example case (astropy-13236) from SWE-bench-Verified resolved by SWE-Fuse. final response only implements the first step—adding the FutureWarning when structured arrays are converted to NdarrayMixin—while treating the version 5.2 instruction as merely a warning, ulti- mately leading to an incomplete issue fix. The key factor underlying SWE-Fuse’s success is that our modifications strictly align with the pull request’s objectives and are validated through reproducible script execution. Throughout the extended multi-turn interaction, SWE-Fuse comprehensively addresses both theadd_columnmethod and portions of theTable(...)initialization pathway, ensuring precise modifications with verifiable behavioral changes in long-context scenarios. 6 Related Work 6.1 Coding LLMs LLMs Kasneci et al. (2023); Chang et al. (2024); Zhao et al. (2023); Naveed et al. (2023) are pretrained deep learning models predominantly built upon the Transformer Vaswani et al. (2017) architecture. In code generation, LLMs are trained on extensive corpora comprising high-quality open-source code repositories and programming documentation Wei et al. (2025); Zhang et al. (2025); Quan et al. (2025). Through this training, LLMs acquire syntactic knowledge across multiple programming languages, learn common programming paradigms, and develop an understanding of the semantic mappings between natural language specifications and code logic Li et al. (2022); Herrington (2003;?); Lu et al. (2021). This enables LLMs to generate executable code from natural language descriptions and perform context-aware code completion and refactoring. Representative code-generative LLMs include Codex OpenAI (2025), CodeLlama Rozière et al. (2023), DeepSeek-Coder Guo et al. (2024), and Qwen3-Coder Yang et al. (2025a). These models have been widely deployed in software engineering tasks such as code completion Shojaee et al. (2023), test generation Liu et al. (2025a); Gao 13 SWE-Fuse et al. (2025), and automated bug repair Bouzenia et al. (2025), demonstrating strong code generation and comprehension capabilities. 6.2 SWE Agents Contemporary agentic agents represent diverse approaches to automating and enhancing software engineering tasks Xia et al. (2025); Shrivastava et al. (2023); Bairi et al. (2023). Several agents pri- marily focus on streamlining agent-environment interactions. For example, Agentless Xia et al. (2024) introduces a lightweight approach that eliminates the need for complex agent scaffolding while maintaining effective code generation capabilities. SWE-agent Yang et al. (2024) establishes a comprehensive framework that enables agents to interact systematically with software development environments through structured action spaces and observation mechanisms. Building upon this foundation, Mini-SWE-agent agent Team (2024) offers a more compact implementation optimized for resource-constrained scenarios, while Mini-SWE-agent-plus Wang et al. (2025c) extends these capabilities with enhanced tool integration. OpenHands Wang et al. (2025d) provides a collabo- rative multi-agent framework that facilitates concurrent development activities across distributed environments. MOpenHands Zan et al. (2025) further advances this paradigm by introducing multilingual support and cross-platform compatibility, enabling broader applicability across diverse software ecosystems. Xia et al. Xia et al. (2025) introduce mechanisms for continuous agent im- provement, where agents learn from their successes and failures to refine problem-solving strategies on-the-fly, effectively creating a feedback loop that enhances performance over successive interac- tions. Collectively, these frameworks serve as foundational scaffolds that structure and orchestrate agent-environment interactions, thereby enabling more systematic and reproducible approaches to automated software engineering. 7 Conclusion We propose an issue-description-aware training framework that fuses both issue and issue-free learning tailored for lightweight models, termed SWE-Fuse Our framework comprises two key components: (1) an issue-focused trajectory learning module that learns step-by-step debugging processes, and (2) an entropy-aware RLVR training module that adaptively adjusts clipping con- straints for stable training. We further release a high-quality trajectory dataset containing 14K trajectories specifically curated for SWE agent training. Experimental results demonstrate that SWE-Fuse achieves state-of-the-art performance for 8B and 32B models on SWE tasks, with solve rates of 43.0% and 60.2%, respectively. Integrating TTS@8 further improves performance to 49.8% and 65.2%. References Gemini 3 pro best for complex tasks and bringing creative concepts to life. 2025. Repo state loopholes during agentic evaluation. 2025. SWE agent Team. Mini-swe-agent. https://github.com/SWE-agent/Mini-SWE-Agent, 2024. Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, p. 12248–12267. Association for Computational Linguistics, 2024. Anthropic. Claude Sonnet 4.https://w.anthropic.com/claude/sonnet, 2025. [Accessed 31-08-2025]. 14 SWE-Fuse Anthropic. Introducing claude sonnet 4.5.https://w.anthropic.com/news/claude-sonnet-4 -5, 2025. 2025-09-30. Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, and et al. Kimi K2: open agentic intelligence. CoRR, abs/2507.20534, 2025. Ramakrishna Bairi, Anirudh Sonwane, Aditya Kanade, V Arun, et al. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499, 2023. Michael Bereket and Jure Leskovec. Uncalibrated reasoning: GRPO induces overconfidence for stochastic outcomes. CoRR, abs/2508.11800, 2025. Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. Repairagent: An autonomous, llm- based agent for program repair. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, p. 2188–2200. IEEE, 2025. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. ChatGPT. Chatgpt. 2022. Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. CoRR, abs/2506.14758, 2025. Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models. CoRR, abs/2505.22617, 2025. DeepSeek-AI. Deepseek-v3 technical report. CoRR, abs/2412.19437, 2024. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, and Z. F. Wu et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Crosscodee- val: A diverse and multilingual benchmark for cross-file code completion. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. CoRR, abs/2308.01861, 2023. Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael R. Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation. CoRR, abs/2501.01329, 2025. Google. A new era of intelligence with gemini 3.https://blog.google/products/gemini/gemini -3/, 2025. Published: 2025-11-18; Accessed: 2025-12-22. 15 SWE-Fuse Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence, 2024. Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks. CoRR, abs/2506.10954, 2025. Jack Herrington. Code generation in action. Manning Publications Co., 2003. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report. CoRR, abs/2409.12186, 2024. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents. CoRR, abs/2504.07164, 2025. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. ChatGPT for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274, 2023. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with Alphacode. Science, 378(6624):1092–1097, 2022. Kaibo Liu, Zhenpeng Chen, Yiyang Liu, Jie M. Zhang, Mark Harman, Yudong Han, Yun Ma, Yihong Dong, Ge Li, and Gang Huang. Llm-powered test case generation for detecting bugs in plausible programs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, p. 430–440. Association for Computational Linguistics, 2025a. Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking KL regularization in RLHF: from value estimation to gradient optimization. CoRR, abs/2510.01555, 2025b. Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, , and et al. Starcoder 2 and the stack v2: The next generation. CoRR, abs/2402.19173, 2024. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. 16 SWE-Fuse Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun Venkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and Ion Stoica. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl.https://w.together.ai/blog/deepswe, 7 2025. URL https://w.together.ai/blog/deepswe. Blog post. Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. Swe-gpt: A process-centric language model for automated software improvement. Proceedings of the ACM on Software Engineering, 2(ISSTA):2362–2383, 2025. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 2023. OpenAI. Hello gpt-4o.https://openai.com/zh-Hans-CN/index/hello-gpt-4o/, May 2024. Published: 2024-05-13; Accessed: 2025-12-22. OpenAI. Building more with gpt-5.1-codex-max.https://openai.com/index/gpt-5-1-codex-m ax/, 2025. 2025-11-19. OpenAI. Introducing gpt-5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5/, August 2025a. Published: 2025-08-07; Accessed: 2025-12-22. OpenAI. Introducing openai o3 and o4-mini.https://openai.com/zh-Hans-CN/index/introduc ing-o3-and-o4-mini/, April 2025b. Published: 2025-04-16; Accessed: 2025-12-22. Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hong- ming Zhang, and Dong Yu. Repograph: Enhancing AI software engineering with repository-level code graph. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. In Forty-second International Con- ference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=Cq1BNvHx74. Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings. CoRR, abs/2501.01257, 2025. Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, and et al. Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024. Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K. Reddy. Execution-based code generation using deep reinforcement learning. Trans. Mach. Learn. Res., 2023, 2023. Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repofusion: Training code models to understand your repository. arXiv preprint arXiv:2306.10998, 2023. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), 2017. 17 SWE-Fuse Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. Swe-dev: Building software engineering agents with training and inference scaling. arXiv preprint arXiv:2506.07636, 2025a. Junhao Wang, Daoguang Zan, Shulin Xin, Siyao Liu, Yurong Wu, and Kai Shen. Swe-mirror: Scaling issue-resolving datasets by mirroring issues across repositories. arXiv preprint arXiv:2509.08724, 2025b. Qi Wang, Hongzhi Zhang, Jia Fu, Kai Fu, Yahui Liu, Tinghai Zhang, Chenxi Sun, Gangwei Jiang, Jingyi Tang, Xingguang Ji, Yang Yue, Jingyuan Zhang, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-agentforge: Forging agentic intelligence through posttraining scaling. CoRR, abs/2511.05951, 2025c. Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, and et al. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025d. Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. SWE-RL: advancing LLM reasoning via reinforcement learning on open software evolution. CoRR, abs/2502.18449, 2025. Xin-Cheng Wen, Yijun Yang, Cuiyun Gao, Yang Xiao, and Deheng Ye. Boosting vulnerability detection of llms via curriculum preference optimization with synthetic reasoning data. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, p. 8935–8949. Association for Computational Linguistics, 2025a. Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. CoRR, abs/2506.14245, 2025b. Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents. CoRR, abs/2407.01489, 2024. Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly? CoRR, abs/2511.13646, 2025. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, and et al. Qwen3 technical report. CoRR, abs/2505.09388, 2025a. John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents. CoRR, abs/2504.21798, 2025b. 18 SWE-Fuse Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. Tong Ye, Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, and Wenhai Wang. Uncovering llm-generated code: A zero-shot synthetic code detector via code rewriting. In Toby Walsh, Julie Shah, and Zico Kolter (eds.), AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, p. 968–976. AAAI Press, 2025. Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving. CoRR, abs/2504.02605, 2025. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, and et al. GLM-4.5: agentic, reasoning, and coding (ARC) foundation models. CoRR, abs/2508.06471, 2025a. Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, et al. Skywork-swe: Unveiling data scaling laws for software engineering in llms. arXiv preprint arXiv:2506.19290, 2025b. Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, p. 2471–2484. Association for Computational Linguistics, 2023. Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, p. 13643–13658. Association for Computational Linguistics, 2024a. Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shiqi Kuang, Shouyu Yin, Chaohang Wen, Haotian Zhang, Bin Chen, and Bing Yu. SRPO: A cross-domain implementation of large-scale reinforcement learning on LLM. CoRR, abs/2504.14286, 2025. Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Maria Christakis and Michael Pradel (eds.), Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, p. 1592–1604. ACM, 2024b. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models, 2023. Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, and Xian Li. SWEET-RL: training multi-turn LLM agents on collaborative reasoning tasks. CoRR, abs/2503.15478, 2025. 19