← Back to papers

Paper deep dive

Resolving Java Code Repository Issues with iSWE Agent

Jatin Ganhotra, Sami Serhan, Antonio Abu Nassar, Avraham Shinnar, Ziv Nevo, Martin Hirzel

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 68

Abstract

Abstract:Resolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive performance. Unfortunately, most of these models and agents focus primarily on Python, and their performance on other programming languages is lower. In particular, a lot of enterprise software is written in Java, yet automated issue resolution for Java is under-explored. This paper introduces iSWE Agent, an automated issue resolver with an emphasis on Java. It consists of two sub-agents, one for localization and the other for editing. Both have access to novel tools based on rule-based Java static analysis and transformation. Using this approach, iSWE achieves state-of-the-art issue resolution rates across the Java splits of both Multi-SWE-bench and SWE-PolyBench. More generally, we hope that by combining the best of rule-based and model-based techniques, this paper contributes towards improving enterprise software development.

Tags

ai-safety (imported, 100%)csse (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:13:50 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

67,895 characters extracted from source content.

Expand or collapse full text

Resolving Java Code Repository Issues with iSWE Agent Jatin Ganhotra 1 Sami Serhan 1 Antonio Abu Nassar 1 Avraham Shinnar 1 Ziv Nevo 1 Martin Hirzel 1 1 IBM AbstractResolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive performance. Unfortunately, most of these models and agents focus primarily on Python, and their performance on other programming languages is lower. In particular, a lot of enterprise software is written in Java, yet automated issue resolution for Java is underexplored. This paper introduces iSWE Agent, an automated issue resolver with an emphasis on Java. It consists of two sub-agents, one for localization and the other for editing. Both have access to novel tools based on rule-based Java static analysis and transformation. Using this approach, iSWE achieves state-of-the-art issue resolution rates across the Java splits of both Multi-SWE-bench and SWE-PolyBench. More generally, we hope that by combining the best of rule-based and model-based techniques, this paper contributes towards improving enterprise software development. 1 Introduction One way to make software developers more productive is by giving them automated support for resolving issues. An issue on a code repository is a natural-language bug report or feature request, and resolving an issue involves making changes to the code that fix the bug or implement the feature. The SWE-bench benchmark [21] measures the ability of automated systems to resolve Python issues and has inspired many, mostly agent-based, solutions using large language models (LLMs). However, while Python-based issue resolution leaderboards are highly active and show signs of saturating, issue resolution leaderboards for other programming languages are less active and show lower performance. At the same time, other programming languages are of great practical importance: for instance, much enterprise software is written in Java. As a language, Java differs substantially from Python in terms of its syntax, type system, and emphasis on object-orientation. This raises the question of whether Java issue resolution would benefit from Java-specific knowledge — a question that is, to our knowledge, unanswered. This paper sets out to tackle that gap. Shortly after the release of SWE-bench, SWE-Agent demonstrated early gains thanks to carefully crafted tools for performing code-related actions [40]. In particular, its edit tool had a built-in Python linter for catching coding mistakes, suggesting that language-specific knowledge can be helpful. At the other extreme, the CodeAct paper [35] argued against sophisticated tools, instead expressing all actions purely as code. And recent frontier models have improved SWE-bench success rates for Python even when the action space only allows code in the bash shell language. Unfortunately, such arbitrary code actions risk unsafe side-effects, and avoiding those side-effects via sandboxing incurs overheads in time, memory, disk, and solution complexity. Arbitrary code actions also tend to require more iterations of the agentic loop, increasing LLM inferencing cost compared to tools tailored towards issue resolution for a given language. Therefore, this paper explores the premise that we can improve an agent for Java issue resolution by giving it advanced Java-specific tools. This paper introduces iSWE, a new LLM-based agent for resolving issues in Java code reposito- ries. (While iSWE can also handle other languages besides Java, it does so in a more generic manner.) Preprint. Under review.© 2025 the authors, released under C BY 4.0 arXiv:2603.11356v1 [cs.SE] 11 Mar 2026 Compared to other issue-resolution agents, iSWE puts an emphasis on stronger tools, meaning tools that are: (i) more effective by leveraging Java-specific knowledge; (i) less LLM-dependent and able to use fewer LLM turns by providing a simple high-level interface; (i) safer by avoiding spurious side-effects; and (iv) lighter-weight by only launching a sandbox for rare actions where side-effects are unavoidable. Internally, iSWE comprises two ReAct [42] sub-agents, one for localization and the second for editing. Decomposing issue resolution this way makes each sub-task easier, and has the added benefit that even solving just localization on its own delivers incremental value to developers. The sub-agents, in turn, build upon two open-source technologies: the declarative prompting language PDL [33] and the program analysis framework CLDK [22]. We evaluate iSWE on the Java subsets of two issue-resolution benchmarks, Multi-SWE- bench [43] and SWE-PolyBench [28]. We selected these two because both have public leader- boards, and because taken together, they have enough Java instances to draw meaningful conclu- sions (128+165=293). The evaluation demonstrates iSWE’s performance and sheds light on some idiosyncracies of the benchmarks that are of interest to the issue-resolution research community. In terms of success rate, iSWE ranks at or near the top for Java on both leaderboards. And in terms of cost, iSWE spends between 2×and 3×fewer dollars on model-inferencing APIs than other leading agents using the same LLM. The results section further explores other metrics (e.g., localization precision and recall) and breakdowns (e.g., by issue complexity). Overall, this paper makes the following contributions: 1. It introduces iSWE, the first issue-resolution agent specialized for Java. 2. It explores the premise that language-aware tools can yield stronger issue-resolution agents. 3. It reports on extensive experiments with two thus-far less explored Java SWE benchmarks. We hope this paper contributes towards taking some of the research progress driven by Python- centered benchmarks and extending its benefits into enterprise languages such as Java. 2 Background The problem statement for issue resolution is as follows: Given an issue description푑 issue and the current code in a repository푐 old , generate a new modified version 푐 new of the code that resolves the issue. Listing 1 shows an example issue description푑 issue . This particular issue description indicates a code location to fix, though it will turn out that properly resolving the issue also requires changing code in a second location not mentioned in the issue. Most issue descriptions do not clearly indicate code locations. In real-world practice, after a developer writes new modified code푐 new , they usually do some testing to convince themselves that it addresses the issue, then submit a pull request (PR) to update the repository. Issue description often include an informal acceptance test fragment, as in the example in Listing 1. Sometimes, developers even include a more polished version of such a test in the same PR along with the issue-resolving code푐 new . Benchmarks like SWE-bench [21] emulate this practice by evaluating each candidate푐 new on a hidden set of test cases푡 gold . This yields an execution-based evaluation, accounting for the fact that what matters is whether generated code resolves the issue, not whether it resembles ground-truth code. What makes SWE-bench and similar benchmarks realistic is that they are based on (often popular) open-source code repositories: they mine issue-resolving PRs to get an issue (푑 issue ) along with the code before (푐 old ) and after (푐 new ) the PR. For example, Listing 1 is mined from the apache/rocketmq repository. Then, they filter for presence of tests (푡 gold ) that are fail-to-pass (F2P), i.e., they fail on푐 old , reproducing the issue, and pass on푐 new , confirming its resolution. For completeness, in addition to F2P tests, the benchmark also runs a select few pass-to-pass (P2P) tests 2 Listing 1:Example issue description푑 issue , based on SWE-PolyBench instance apache__rocketmq-516. 1 org.apache.rocketmq.common.stats.StatsItemSet#getAndCreateStatsItem has multi-thread problem. 2 Because we can not ensure the atomicity of 3 ‘java 4 StatsItem statsItem = this.statsItemTable.get(statsKey); 5 StatsItem prev = this.statsItemTable.put(statsKey, statsItem); 6 ‘ 7 Here is the test case. The result is not always the correct one 20000. 8 ‘java 9 for(int i =0; i < 10000; i++) 10 executor.submit(new Runnable() 11 @Override 12 public void run() 13 brokerStatsManager.incTopicPutNums("topicTest", 2, 1); 14 15 ); 16 17 Thread.sleep(5000); 18 System.out.println(brokerStatsManager.getStatsItem(TOPIC_PUT_NUMS,"topicTest").getValue()); 19 ‘ to check whether the patch broke any previous behavior. SWE-bench Verified [12] is a subset of 500 SWE-bench instances filtered further using human annotations to ensure issue descriptions are not underspecified and tests are not overly specific. While SWE-bench and SWE-bench Verified focus on Python, there are at least three benchmarks that address the same problem statement but for other programming languages including Java. Multi-SWE-bench [43] comprises 1,632 instances from 7 languages, including 128 instances from Java. All its instances are pre-filtered using the same criteria as for SWE-bench Verified, and further hand-labeled for difficulty levels. SWE-PolyBench [28] comprises 2,110 instances from 4 languages, including 165 instances from Java. It comes with complexity labels based on how many functions and classes the ground-truth PR changed. SWE-PolyBench Verified is a hand-filtered subset, including 69 instances from Java. SWE-bench Multilingual [41] comprises 300 instances from 9 languages, including 43 instances from Java. This paper adopts both Multi-SWE-Bench and SWE-PolyBench for evaluation to increase the number and diversity of Java instances. Doing so also enables comparing iSWE against other submissions to the corresponding public leaderboards, since several systems have submissions on one or the other but not both. Furthermore, it reduces the risk of inadvertently tailoring the agent scaffold too much to one particular benchmark, thus facilitating practical use for other repositories beyond the benchmarks. This paper does not adopt SWE-bench Multilingual, since it only has a small number of additional Java instances. Issue resolution tasks for Java differ from those from Python SWE-bench in several ways. Python agents tend to struggle with multi-file edits [17], and golden code edits푐 new for Java issues tend to affect more files and involve more diff-hunks than those for Python issues. This may be caused by Java’s object-oriented nature, which encourages code where functionality is spread across more files. Also, Java is a compiled language with a strong static type systems. Since many code mistakes are caught by the Java compiler, the popular Java linters do not need to check for those same mistakes. As a consequence, unlike for Python, an automatic issue resolution system for Java cannot rely on a linter alone to catch mistakes. Furthermore, the Java compiler requires the correct versions of libraries it depends on in the build environment, and most projects invoke 3 iSWE localization agent LLM Loc. tools rich info, use semantics, read-only thought and action obser- vation issue description d issue code c old from Java repository set of code locations iSWE editing agent LLM Edit tool check, repair, mostly read- only thought and action obser- vation code c new to resolve issue Figure 1: Overview of iSWE Agent the Java compiler from a build tool such as Gradle or Maven. Taken together, this makes it harder for an automatic issue resolution system to get feedback from static correctness checks. Another way in which issue resolution for Java differs from Python is that the Python leader- boards are more popular and are starting to show signs of saturation [18]. Some voices [7,14,24,27] even raise contamination concerns with Python SWE-bench, more likely due to extended pre- training than to fine-tuning. That said, leading frontier-model vendors are indeed fine-tuning their LLMs for Python issue resolution [8,26]. While this fine-tuning hopefully uses data disjoint from the benchmark test set, it is nevertheless likely to improve the model more for Python tasks than for other languages such as Java. On the other hand, Java issue resolution has received relatively less attention so far, making it less saturated and less overfitted. 3 Approach This section describes our approach for tackling the problem statement from Section 2, namely, given an issue issue description푑 issue , mapping from old code푐 old to new code푐 new . Figure 1 shows our two-agent pipeline for solving this problem. The pipeline first feeds the input⟨푑 issue ,푐 old ⟩into the localization agent. The localization agent is a ReAct [42] agent with read-only tools, and it returns an edit recommendation with a set of code locations in푐 old that should be edited to resolve the issue. Next, the pipeline feeds this recommendation and set of code locations into the editing agent. The editing agent is also a ReAct agent with a mostly read-only tool, and it returns the modified code 푐 new that, if correct, resolves the issue. There are several reasons why we opted for this flow of two ReAct agents. One advantage is that it gives us the flexibility to run either the entire flow end-to-end or each sub-agent separately. For instance, sometimes users want to pause after the first step, to inspect the intermediate result. Conversely, sometimes the first step can be skipped, if the localization is available from a different source such as a developer or a static code scanning tool. The intermediate result itself is typed, structured, and sanitized. This simplifies our engineering effort, since each sub-agent is modular with well-defined input and output expectations. It also speeds up experiments, since we can run pieces in isolation. Finally, each of the sub-tasks is relatively simpler than the end-to-end task, bringing it more within reach of moderate-sized LLMs. Both agents are LLM-agnostic: they avoid hardwired LLMs or LLM-specific prompts, and instead, the LLM to be called is a configuration variable. As of now, iSWE uses inline reasoning, in the sense that the LLM may generate a chain-of-thought [37] as part of its main content, as opposed to generating a separate API response field such as reasoning_content. Similarly, iSWE uses inline tool calling, in the sense that the LLM may generate a tool call [29] as part of its main content, which iSWE then parses on the client-side, as opposed to using a native tool-calling API. We found these inline approaches to be more robust across different LLMs. We implemented both agents using Python and PDL [33,30]. PDL is a declarative YAML-based language for writing prompt templates and the core LLM-flow control logic directly surrounding 4 Listing 2: Example sanitized localization for the issue description from Listing 1. 1 thought: "Perfect! Now I have all the information needed to identify the bug and propose fixes. The issue is a race condition in the getAndCreateStatsItem method in both StatsItemSet and MomentStatsItemSet classes.", 2 localization: 3"main/java/org/apache/rocketmq/common/stats/MomentStatsItemSet.java": [ 4 changes: [kind: "modify", range: start: 76, end: 90], 5 proposed_edit: "Replace the non−atomic get−check−put pattern with putIfAbsent() to ensure atomicity.", 6 scope: [ 7 name: "MomentStatsItemSet", 8 kind: language: "java", name: "class" , 9 span: start: 28, end: 92, 10 name: "getAndCreateStatsItem", 11 kind: language: "java", name: "method" , 12 span: start: 76, end: 91] 13 14 ], 15"main/java/org/apache/rocketmq/common/stats/StatsItemSet.java": [ 16 changes: [kind: "modify", range: start: 160, end: 173], 17 proposed_edit: "Replace the non−atomic get−check−put pattern with putIfAbsent() to ensure atomicity.", 18 scope: [ 19 name: "StatsItemSet", 20 kind: language: "java", name: "class" , 21 span: start: 28, end: 203, 22 name: "getAndCreateStatsItem", 23 kind: language: "java", name: "method" , 24 span: start: 160, end: 174] 25 26 ] 27 the prompts. It takes care of context accumulation and data flow between components including type-checking. By using PDL, iSWE keeps its prompts easy-to-read, compared to the common practice of scattering their sub-strings across Python code. At the same time, PDL is flexible enough to work for a variety of different flows, and thus, we did not have to work around design decisions of an existing agentic framework such as AutoGen [38] or OpenHands [36]. In particular, one core decision we wanted control over was containerization. Most parts of iSWE run in user-space without any container, including the top-level flow, the ReAct loop of each of the sub-agents, and even the functions being called for tools. This makes things simpler and faster. The only part that runs in a container is the part of the editing agent that builds the Java project for the purpose of error checking (described below). The following two subsections describe each of iSWE’s two sub-agents, localization and editing, in detail. 3.1 iSWE Localization Agent As shown in Figure 1, the input to the localization agent consists of the issue description푑 issue and the old code푐 old . The initial prompt context before the first agent iteration includes푑 issue but not푐 old , which can be very large; instead, the agent can inspect푐 old via tools. To this end, the initial prompt context also contains instructions and tool descriptions. Each loop iteration starts with an LLM call to produce a thought (inline reasoning) and an action (tool call). The agent parses the tool call, invokes the tool, and obtains the output from the tool as an observation. PDL [33] implicitly appends the⟨thought,action,observation⟩triple from the current loop iteration to the prompt context, which becomes the LLM input in the next iteration. The loop terminates when the LLM generates not a tool call but a valid JSON with a set of code locations. 5 Listing 2 shows an example of a sanitized JSON output from the localization agent, containing a set of code locations. In this case, there are two locations across two different files. Only one of them was mentioned in the corresponding issue description푑 issue (Listing 1), demonstrating that the localization agent found additional useful information. Besides the locations, the JSON in Listing 2 also includes the final thought from the localization agent, which might be useful for downstream consumption in the editing agent. In our implementation, we only expect the LLM to generate a simplified subset of the information in Listing 2, and then fill in the remaining details using rule-based program analysis. The LLM output JSON is less nested than the final sanitized JSON. For example, line number ranges are given as a string such as"76−90", and scopes do not provide spans, or can even be elided altogether. Then, the rule-based sanitizer adds any incomplete information and reconciles contradictions, if any. The initial LLM context contains few-shot examples for the LLM output JSON format to teach the localization agent what it needs to produce. By default, the localization agent is configured to use the following seven tools: get_file_info, get_class_info, get_method_info, get_symbol_info, get_inheritance_hierarchy, get_function_callers, and get_call_chain, If we conceptualize the code푐 old as a graph, then the first four tools get information on nodes representing files, classes, methods, or other symbols, and the remaining three tools get information on edges representing class inheritance and function calling. We implemented these tools using two rule-based static analysis libraries, CLDK [22] and Tree-Sitter [13]. All tools generate textual outputs carefully crafted to provide rich yet concise information to the LLM. Also, all of the above-listed tools are read-only, in the sense that they only passively obtain information without making any changes to푐 old or having other side-effects on the system. We used ACE [1] to hone the tool descriptions to make it easier for LLMs to call them. Above, we describe the localization agent in its default configuration. Our implementation also permits various aspects to be configured differently. For instance, we can provide different tools, such as those from SWE-Agent [40] or even a bash tool. We can change the number of few-shot samples for the JSON format. And we can configure the output of tools, for instance, whether to include line numbers, or whether to show relative paths. 3.2 iSWE Editing Agent The input to the editing agent consists of the set of locations returned by the localization agent. Based on this set of locations, the agent extracts and formats relevant excerpts from푐 old , and places them into the initial prompt context. For example, given the localization in Listing 2, it would show excerpts from two Java files, using the line ranges of thescopefields to show only the methods where edits should take place. Since code without line numbers is more in-distribution for most LLMs than code with line numbers, the agent does not explicitly mention line numbers in the prompt. Instead, it injects edit marker comments to pinpoint exact locations to edit. Besides excerpts from푐 old , the initial prompt before the first agent iteration also includes the issue description푑 issue , thethought and proposed_edit fields from the localization, plus instructions and a description of the edit tool. Similarly to the localization agent, each loop iteration of the editing agent also starts with an initial LLM call to produce a thought and an action. The action is an edit tool call, which comprises one or more search-replace blocks, across one or more files. The agent parses the tool call from the LLM output and tries to apply it to an in-memory copy of the relevant portion of푐 old to obtain a candidate푐 new . Since the original version of푐 old on disk remains unchanged, this step is trivially free of side effects and requires no container sand-box. If the editing tool detects any mistakes in the tool call, it renders them as an observation. The previous context plus the current ⟨thought,action,observation⟩triple become the input to the LLM call in the next loop iteration. The loop terminates when the editing tool cannot find any mistakes in the tool call, at which point the agent returns the well-formed 푐 new , converted to unidiff format. Listing 3 shows an example of an edit tool call, based on the same running example used in earlier sections. It takes the form of a search-replace in merge-conflict format. This format has 6 Listing 3: Example edit for the issue from Listing 1 and the localization from Listing 2. 1 The solution is to use ‘putIfAbsent()‘ which is an atomic operation. If ‘putIfAbsent()‘ returns non-null, it means another thread already inserted a value, so return that existing value 2 ‘java 3 ### main/java/org/apache/rocketmq/common/stats/MomentStatsItemSet.java 4 <<<<<<< SEARCH 5 MomentStatsItem statsItem = statsItemTable.get(statsKey); 6 if (null == statsItem) 7 statsItem = new MomentStatsItem(statsName, statsKey, schedExSvs, log); 8 MomentStatsItem prev = statsItemTable.put(statsKey, statsItem); 9 ======= 10 MomentStatsItem statsItem = statsItemTable.get(statsKey); 11 if (null == statsItem) 12 statsItem = new MomentStatsItem(statsName, statsKey, schedExSvs, log); 13 MomentStatsItem prev = statsItemTable.putIfAbsent(statsKey, statsItem); 14 if (null != prev) statsItem = prev; 15 >>>>>>> REPLACE 16 ### main/java/org/apache/rocketmq/common/stats/StatsItemSet.java 17 <<<<<<< SEARCH 18 StatsItem statsItem = statsItemTable.get(statsKey); 19 if (null == statsItem) 20 statsItem = new StatsItem(statsName, statsKey, schedExSvs, log); 21 StatsItem prev = statsItemTable.put(statsKey, statsItem); 22 ======= 23 StatsItem statsItem = statsItemTable.get(statsKey); 24 if (null == statsItem) 25 statsItem = new StatsItem(statsName, statsKey, schedExSvs, log); 26 StatsItem prev = statsItemTable.putIfAbsent(statsKey, statsItem); 27 if (null != prev) statsItem = prev; 28 >>>>>>> REPLACE 29 ‘ been used for many years by version management software [25], making it likely to occur in LLM pre-training data and thus giving LLMs some basic familiarity with it. In this example, the edit has two search-replace blocks, one for each of the two files given in the localization (Listing 2). In general, the tool also supports multiple search-replace blocks in the same file. Each search-replace block has two parts separated by ‘=======’: a block of lines to search in푐 old , and a block of lines to replace them with to obtain 푐 new . The initial LLM context has few-shot samples for this format. The edit agent tool performs a series of escalating checks on the LLM-generated tool call, such that the first check that fails causes an early exit with a corresponding observation to explain what went wrong. It checks adherance to the merge-conflict format and attempts to match the search blocks, with heuristic rule-based repairs to tolerate minor differences such as including the agent-injected edit markers. Once these initial checks pass, it derives the candidate푐 new to be subjected to the remaining checks. It runs a simple Java linter on푐 new that does not have side-effects but misses many deeper Java code problems. Finally, only when all previous checks succeeded, it fires up a containerized environment with the dependencies of the current repository installed and kicks off the Java compiler on푐 new . The containerization isolates the versions of Java tools and libraries from the host machine and prevents any side-effects of the build scripts from affecting the master copy of the code. If all checks succeed without finding mistakes, the agent loop exits. The above description applies to the editing agent in its default configuration. The agent can also be configured to elide certain information from the prompt, such as the thought from the localization agent or the proposed edit. We can change the number of few-shot samples for the 7 Table 1: Main results on Multi-SWE-Bench (Java) and SWE-PolyBench (Java). 12 Bench mark Base LLM % Resolved $ Cost Total (Avg.) Avg. # Tokens Edit Correct Location FileNode RecallPrecisionRecallPrecision iSWE-Agent (Frontier Models) Multi-SWE-Bench (Java) Claude-4.5-Sonnet43 (33.6%) 237.89 (1.86)598k62%82%44%68% Claude-3.7-Sonnet32 (25.0%) 154.51 (1.21)386k50%72%32%63% Leaderboard submissions (top other agents for comparison) InfCode (GPT-5.2)50 (39.1%) 569.29 (4.78)2453k54%72%47%54% MSWE-agent (C-3.7-Sonnet) 1 30 (23.4%) 477.26 (3.72)1233k---- MopenHands (C-3.7-Sonnet) 1 28 (21.8%) 332.61 (2.59)835k---- iSWE-Agent (Open Source Models) deepSeek-V3-228 (21.8%)32.46 (0.25)899k44%57%32%48% llama-4-maverick-17b-128e21 (16.4%)14.53 (0.11)82k37%59%22%51% gpt-oss-120b (tool calling)21 (16.4%)4.86 (0.04)223k36%57%19%47% deepSeek-R119 (14.8%)7.70 (0.06)83k38%54%20%42% devstral-small-250519 (14.8%)10.16 (0.08)775k40%60%24%47% qwen2-5-72b-instruct17 (13.2%)1.49 (0.01)89k34%52%20%41% llama-3.3-70b15 (11.7%)10.60 (0.08)116k35%53%20%43% devstral-small-250713 (10.1%)15.98 (0.12)1235k30%47%18%37% gpt-oss-20b (tool calling)11 (8.5%)13.56 (0.11)1461k24%37%17%33% iSWE-Agent (Frontier Models) SWE-PolyBench (Java) Claude-4.5-Opus55 (33.3%)420.7 (2.55)492k61%85%44%74% Claude-4.5-Sonnet48 (29.1%)305.13 (1.85)588k53%72%34%63% Leaderboard submissions (top other agents for comparison) Amazon Q Developer Agent 2 44 (26.6%)--56%75%43%62% Aider-PB (C-3.5-Sonnet) 2 26 (15.7%)--52%58%32%52% iSWE-Agent (Open Source Models) deepSeek-V3-230 (18.1%)46.5 (0.28)1000k35%45%23%40% gpt-oss-120b (tool calling)23 (13.9%)8.01 (0.05)290k27%44%14%38% deepSeek-R120 (12.1%)10.64 (0.06)62k34%49%19%42% llama-4-maverick-17b-128e17 (10.3%)20.71 (0.13)91k33%50%18%42% qwen2-5-72b-instruct14 (8.4%)2.15 (0.01)100k26%39%13%33% devstral-small-250514 (8.4%)8.76 (0.05)522k28%39%14%31% devstral-small-250713 (7.8%)17.52 (0.11)1051k28%39%14%31% gpt-oss-20b (tool calling)10 (6.0%)17.54 (0.11)1464k13%24%7%20% llama-3.3-70b9 (5.4%)16.51 (0.10)140k23%36%12%29% merge-conflict format. Furthermore, we can configure whether or not to run the Java compiler using the project build scripts. 4 Evaluation This section presents our empirical evaluation of iSWE based on the Java splits of Multi-SWE- bench [43] and SWE-PolyBench [28]. Since there has been little published work on using those benchmarks so far, this sheds light not just on iSWE’s performance but also on the benchmarks themselves. This section starts by discussing the main results in Table 1, and then dives deeper into various aspects with additional tables and figures. Table 1 is based on running iSWE on the two benchmarks with a variety of different LLMs. We use inline tool calling for most models except for gpt-oss, where localization used native tool calling, because that performed better. For comparison, we also included some results from other top agents on the corresponding leaderboards. 1 Correct Location metrics are unavailable for a subset of Multi-SWE-Bench (Java) leaderboard submissions due to difficulties encountered when applying the publicly available output patches. 2 Cost and average token metrics are unavailable for SWE-PolyBench (Java) leaderboard submissions due to the lack of publicly available trajectories. 8 Figure 2:Average cost vs. %resolved across models. Most datapoints are from iSWE with different models; two datapoints for Multi-SWE-bench ( Java) are from MopenHands and MSWE-agent. Column % Resolved indicates for how many benchmark instances the agent-generated patch푐 new resolves the issue푑 issue as indicated by passing the hidden tests푡 gold . With frontier models, iSWE has among the highest resolution rates on Java issues. While the open-source models resolve fewer instances, they also show descent results, demonstrating iSWE’s ability to use different models. The absolute numbers are lower than commonly seen on Python leaderboards; this is likely an artifact of Python leaderboards becoming saturated [27]. Column $ Cost shows the expense for LLM calls from model vendor APIs in US dollars. To calculate it, we track the number of input and output tokens, and then use public pricing per input or output token, mostly based on scaling factors included in the LiteLLM library. Generally, stronger models cost more. For Multi-SWE-bench, the top non-iSWE results use Claude-3.7-Sonnet; the iSWE runs with the same model incur 2× to 3×lower cost while resolving more issues. Column Avg. # Tokens shows the average number of tokens used per instance. It varies across models with no clear correlation to resolution rate. Columns Edit Correct Location show to what extent the agent-generated new code푐 new modifies the same code locations as the hidden golden patch. The columns report precision and recall both at the file level and at the level of nodes, which are the closest enclosing class or function that contains a given diff-hunk [28]. Not surprisingly, file-level retrieval numbers are higher than node-level, and retrieval metrics correlate well with % resolved. The following subsections explore iSWE and the benchmarks more deeply. Section 4.1 explores dollar costs and number of turns by model and instance to better highlight the trade-offs enabled by iSWE. Section 4.2 digs deeper into the sub-agent and even tool layer of iSWE to explore how each component contributes to overall success. And Section 4.3 provides a better understanding of the benchmarks by drilling down to samples of different complexities and other characteristics. 4.1 Results for Cost and Turns While the primary metric for SWE-bench and similar benchmarks is issue resolution rate, another major concern is cost. Furthermore, one can trade resolution rate and cost off against each other, e.g., by choosing a different model or by resorting to inference scaling. This subsection explores that trade-off in more detail, using dollar costs computed based on published pricing for LLM input and output tokens (mostly leveraging metadata in LiteLLM). Besides dollar cost, this also sheds light 9 Figure 3: Average cost vs. resolved across individual instances in the benchmark (cumulative). on other correlated concerns, such as the number of turns in the agentic loop or the time-to-fix. All results in this section are based on the same runs as the main results. Figure 2 visualizes the trade-off between% resolvedand average per-instance$ costas a scatter plot. The x-axis shows cost on a logarithmic scale; cheaper (further left) is better. The y-axis shows resolution rate on a linear scale; more resolved (further up) is better. The points on the Pareto frontier are those on the upper left and are highlighted with a green border: they are partially optimal in the sense that other points may be higher in one metric but never both. For both benchmarks, the Pareto frontier is occupied by iSWE with five models: qwen2-5-72b-instruct, gpt- oss-120b, deepseek-v3-2, and finally two versions of the Claude frontier models. Thanks to iSWE being model agnostic, it offers the user a range of choices to navigate the cost/accuracy trade-off. When we first did these experiments, the top non-iSWE agents on Multi-SWE-bench (Java) were MSWE-agent and MopenHands, both using Claude-3.7-Sonnet. They are dominated by iSWE with the same model, which achieves a higher resolution rate at a fraction of the cost. In the meantime, InfCode using GPT-5.2 has also appeared on the Multi-SWE-bench (Java) leaderboard, with higher resolved rate but at higher cost than iSWE. While the previous figure showed average costs across all instances in an entire benchmark, not all instances incur the same cost. Some instances can be resolved in fewer turns or tokens, and thus at lower cost, than others. Figure 3 shows the cumulative distribution of how many instances get finished (light blue) or resolved (dark blue) up to a cost (or turns) threshold shown on the x-axis. The left two subfigures focus on dollar cost: most instances that get resolved take at most $4. The number of finished instances keeps increasing beyond that point, but they experience diminishing returns in terms of resolution. The right two subfigures have the same y-axis but put the number of turns on the x-axis. Early turns are cheaper, because as context accumulates, later turns consume 10 Table 2: Cost per sub-agent. Benchmark Base LLM Avg. # TurnsAvg. Cost Loc.EditingTotalLoc.EditingTotal Multi-SWE-Bench (Java) Claude-4.5-Sonnet (Loc & Edit)31.51.733.3$1.77$0.09$1.86 DeepSeek-V3-2 (Loc & Edit)41.62.544.2$0.24$0.01$0.25 SWE-PolyBench (Java) Claude-4.5-Opus (Loc & Edit)34.12.236.3$2.41$0.14$2.55 DeepSeek-V3-2 (Loc & Edit)54.83.358.0$0.27$0.01$0.28 Table 3: Localization metrics per sub-agent. BenchmarkBase LLM Localization Correct LocationEdit Correct Location FileNodeFileNode RecallPrecisionRecallPrecisionRecallPrecisionRecallPrecision iSWE-Agent (Frontier Models) Multi-SWE-Bench (Java) Claude-4.5-Sonnet63%83%36%59%62%82%44%68% Claude-3.7-Sonnet56%77%28%53%50%72%32%63% iSWE-Agent (Open Source Models) deepSeek-V3-258%76%34%55%44%57%32%48% llama-4-maverick-17b-128e45%70%21%48%37%59%22%51% gpt-oss-120b (tool calling)49%75%20%48%36%57%19%47% deepSeek-R149%72%21%46%38%54%20%42% devstral-small-250544%67%22%44%40%60%24%47% qwen2-5-72b-instruct44%63%18%40%34%52%20%41% llama-3.3-70b40%60%17%39%35%53%20%43% devstral-small-250736%53%17%35%30%47%18%37% gpt-oss-20b (tool calling)36%55%22%40%24%37%17%33% iSWE-Agent (Frontier Models) SWE-PolyBench (Java) Claude-4.5-Opus62%83%31%60%56%79%38%72% Claude-4.5-Sonnet61%80%29%56%53%72%34%63% iSWE-Agent (Open Source Models) deepSeek-V3-256%73%28%52%35%45%23%40% gpt-oss-120b (tool calling)45%69%19%50%27%44%14%38% deepSeek-R148%69%18%42%34%49%19%42% llama-4-maverick-17b-128e46%70%17%45%33%50%18%42% qwen2-5-72b-instruct38%58%14%39%26%39%13%33% devstral-small-250539%56%15%36%28%39%14%31% devstral-small-250736%52%15%34%28%39%14%31% gpt-oss-20b (tool calling)36%52%14%35%13%24%7%20% llama-3.3-70b39%57%15%33%23%36%12%29% more tokens. Again, the curve for resolved issues asymptotically flattens at fewer turns than the curve for finished instances, indicating that LLMs struggle with long-horizon tasks. These results indicate that cost thresholds are another lever for navigating the cost/accuracy trade-off. Table 2 breaks down the cost spent in iSWE’s two sub-agents for localization and editing, respec- tively. For both benchmarks, for two representative models from the Pareto frontier, most of the turns and cost are incurred during localization. Even though DeepSeek takes more turns than Claude, in the end it incurs lower cost, because its per-token pricing is cheaper. Note that iSWE allows different models for localization and editing. For example, we did a run on Multi-SWE-bench (Java) where we used Claude-4.5-Sonnet for localization and Gemini-2.5-Pro for editing (elided from Table 1). That run resolved 29.6% of instances at an average cost of $2.02 per instance. 4.2 Results by iSWE Sub-Agents and Tools This subsection breaks down how each component of iSWE contributes to its overall success. Table 3 shows the localization metrics at two stages in the pipeline: based on the intermediate JSON right after the localization agent (see e.g. Listing 2), and based on the final code patch푐 new 11 Table 4: Tool usage distribution across the entire benchmark (not per-instance). Bench mark Base LLM Tool Usage (Number of Calls) Total get file info get class info get method info get symbol info get inheritance hierarchy get function callers get call chain Frontier Models Multi-SWE-Bench (Java) Claude-4.5-Sonnet8712440318121933849 Claude-3.7-Sonnet16313626575175652 Open Source Models DeepSeek-V3-2497549180311132293494126 llama-4-maverick-17b-128e73245357902730804 gpt-oss-120b (tool calling)6042207082240111758 Devstral-small-2505220747814314082920154588 Qwen2-5-72b-instruct1343896968452591369 llama-3.3-70b65307508557381011081 Devstral-small-2507193338435123601120126232 gpt-oss-20b (tool calling)4076178123928105135792 Frontier Models SWE-PolyBench (Java) Claude-4.5-Opus41673419858483982964200 Claude-4.5-Sonnet127184466226233411079 Open Source Models DeepSeek-V3-26918782397165321114735827 gpt-oss-120b (tool calling)9294389994220302791 DeepSeek-R166342549106817221110 llama-4-maverick-17b-128e8339345918068281157 Qwen2-5-72b-instruct291656779164212241928 Devstral-small-25052019942212663192955761 Devstral-small-250721711203421395473378588 gpt-oss-20b (tool calling)7545236107560400379497 llama-3.3-70b583697028618821241439 Table 5:Localization performance for different tool sets, using Claude-4.5-Sonnet on Multi-SWE- bench (Java). Tool Set$ CostAvg.Turns File RetrievalNode Retrieval Total (Avg.)#TokensRecallPrecisionRecallPrecision iSWE-tools226 (1.77)575k403763%83%36%59% iSWE-tools & view-file229 (1.79)584k377763%83%37%57% Bash263 (2.06)667k581562%82%37%57% All-Tools (iSWE-tools & Bash & swea)336 (2.63)855k582062%81%37%58% after the editing agent (see e.g. Listing 3). Calculating the same metrics at these different stages sheds light on the quality of the localization-agent intermediate output as well as on the faithfulness of the editing agent in following it. As before, the node metrics are based on the nearest enclosing function or class, similar to SWE-PolyBench [28]. Table 3 shows that file-level metrics are better after localization than after edit, because iSWE only shows the files found by the localization to the edit agent. In contrast, node-level metrics are sometimes better after edit than after localization, because the edit agent can further adjust its location. As we saw before, localization metrics closely track resolution rate, indicating that good localization is important to overall success. Table 4 shows how often the various models call the various localization tools. The first four tools (get_file/class/method/symbol_info) answer questions about code-graph nodes, whereas the last three tools (get_inheritance_hierarchy/function_callers/call_chains) answer questions about 12 Table 6:Performance by hand-annotated complexity level on Multi-SWE-Bench (Java) and by hand- annotated Verified or not label on SWE-PolyBench (Java). Bench mark Complexity # Instances Base LLM% ResolvedAvg. Cost File RetrievalNode Retrieval RecallPrecisionRecallPrecision Multi-SWE- Bench (Java) Easy27 Claude-4.5-Sonnet16 (59.2%)$1.6565%80%42%49% DeepSeek-V3-29 (33.3%)$0.1858%74%39%47% Medium65 Claude-4.5-Sonnet26 (40.0%)$1.7166%86%39%58% DeepSeek-V3-218 (27.6%)$0.2161%70%37%47% Hard36 Claude-4.5-Sonnet1 (2.7%)$2.2655%80%22%65% DeepSeek-V3-21 (2.7%)$0.3752%87%23%75% SWE-Poly- Bench (Java) Verified69 Claude-4.5-Opus31 (44.9%)$2.2567%84%33%61% DeepSeek-V3-218 (26.0%)$0.2860%71%30%50% Non-Verified96 Claude-4.5-Opus25 (26.0%)$2.7658%82%28%58% DeepSeek-V3-212 (12.5%)$0.2753%74%26%52% edges. There are more calls to node tools than to edge tools; one reason for this may be that edge tools yield more useful information in a single call. The overall most-called tool is get_method_info and the least-call tool is get_inheritance_hierarchy. Aside from a few exceptions, all models call pretty much all tools, indicating that iSWE’s tools are model-agnostic. Table 5 explores how iSWE’s tools compare to alternatives. It is based on multiple different benchmark runs, each with a different tool-set configuration for the localization agent. Besides iSWE’s Java-aware tools introduced in Section 3, alternative tools include view_file; bash (running low-level shell commands); and swea (the set of tools from SWE-agent [40], which are read-only but lower-level than the iSWE tools). The bash tool, despite being low-level, is feasible with recent frontier models, because those have been fine-tuned for SWE tasks. The iSWE tools, view_file, and the swea tools are read-only and free of side-effects; in contrast, bash can modify state and have other potentially harmful side-effects, thus requiring sand-boxing. Overall, the retrieval metrics are similar for all tool-sets. On the other hand, cost and tokens are lower when using iSWE-tools, through a combination of fewer turns and fewer spurious tokens per turn. This is a symptom of the iSWE tools being higher-level: models find it easier to call them (fewer turns) and to interpret the output from calls (fewer tokens). Cost and retrieval metrics aside, using iSWE tools has the additional advantage of avoiding spurious side-effects, making the agent safer and simpler. 4.3 Results by Benchmark Sample Characteristics Some benchmarks come with instance labels that partition their set of instances into different subsets. Comparing metrics on these subsets yields further insight into agent performance, while also helping us understand the benchmarks better. Multi-SWE-bench [43] comes with hand-labeled complexity levels (Easy, Medium, or Hard). Table 6 shows that the metrics follow the expected trend: as issues get harder,% Resolveddecreases, cost increases, and retrieval metrics decrease. Of course, when metrics are computed on a smaller subset of instances, they get more noisy. SWE-PolyBench [28] comes with hand-labeled Verified tags for some instances whose issue description푑 issue is clear and matches its tests푡 golden well. SWE-PolyBench [28] indicates that on average, Verified instances are easier than non-Verified ones. This implies that the numbers are not directly comparable across the SWE-PolyBench leaderboards for Verified vs. Full. Besides hand-labeled complexity levels, SWE-PolyBench also comes with localization-derived labels that are indicative of complexity. We wrote our own code for deriving these labels, then ran it on both benchmarks. Table 7 shows the results. Here, None means all changes are in new files or non-code files; Function Only means the change is entirely contained in functions (which may or may not be part of the class); Class Only means the change is entirely in a class (e.g., changing 13 Table 7:Performance by localization-derived complexity attributes. These attributes are computed based on the gold patch and emulate the attributes defined by SWE-PolyBench. Bench mark Modification Type # Instances Base LLM% Resolved Avg. Cost File RetrievalNode Retrieval RecallPrecisionRecallPrecision Multi-SWE-Bench (Java) None3 Claude-4.5-Sonnet0 (0.0%)$2.2736%100%36%100% DeepSeek-V3-20 (0.0%)$0.1425%100%25%100% Function Only Single All 33 Claude-4.5-Sonnet20 (60.6%)$1.7572%87%58%70% DeepSeek-V3-210 (30.3%)$0.2356%69%50%61% 44 Claude-4.5-Sonnet27 (61.3%)$1.6373%90%59%74% DeepSeek-V3-218 (40.9%)$0.2160%69%54%62% Class Only Single All 4 Claude-4.5-Sonnet4 (100.0%)$1.9787%100%83%75% DeepSeek-V3-21 (25.0%)$0.2437%50.0%33%37% 9 Claude-4.5-Sonnet6 (66.6%)$2.1470%75%73%63% DeepSeek-V3-22 (22.2%)$0.1846%57%48%45% Mixed 72 Claude-4.5-Sonnet10 (13.8%)$1.9555%78%31%65% DeepSeek-V3-28 (11.1%)$0.2946%65%26%53% Total 128 Claude-4.5-Sonnet43 (33.5%)$1.8662%82%44%69% DeepSeek-V3-228 (21.8%)$0.2550%67%37%57% SWE-PolyBench (Java) None0 Claude-4.5-Sonnet– DeepSeek-V3-2– Function Only Single All 31 Claude-4.5-Opus18 (58.0%)$1.5384%85%74%78% DeepSeek-V3-210 (32.2%)$0.2349%50%39%40% 46 Claude-4.5-Opus25 (54.3%)$1.6177%83%69%78% DeepSeek-V3-216 (34.7%)$0.2354%62%46%53% Class Only Single All 5 Claude-4.5-Opus2 (40.0%)$3.3590%81%40%28% DeepSeek-V3-21 (20.0%)$0.3846%50%16%30% 10 Claude-4.5-Opus4 (40.0%)$3.8874%90%44%58% DeepSeek-V3-21 (10.0%)$0.4839%55%22%43% Mixed 109 Claude-4.5-Opus26 (23.8%)$2.8253%83%28%75% DeepSeek-V3-213 (11.9%)$0.2934%50%19%44% Total 165 Claude-4.5-Opus55 (33.3%)$2.5561%84%40%75% DeepSeek-V3-230 (18.1%)$0.2840%54%26%46% attributes, but not contained within a function of a class); and Mixed counts any remaining changes across functions and classes or top-level code in a file. Rows marked Single only count instances where a single node was modified, whereas rows marked All count instances where one or more nodes of the given kind were modified. Some of the subsets are small, leading to noisy results, but from the more significant subsets we can observe trends. In general, going from functions to classes to mixed, resolution rate decreases, cost increases, and retrieval metrics decrease. While Multi-SWE-bench [43] and SWE-PolyBench [28] are great contributions to the research community, we also encountered some idiosyncrasies worth sharing. Multi-SWE-bench instances come with identifier hints, which are newly-defined entity names, that could be used in addition to the issue description푑 issue to make the issue more specific. Since we are unsure about whether or not other non-iSWE leaderboard submissions make use of such identifier hints, we decided to hide them from iSWE for our experiments, even if that reduces iSWE’s apparent performance. Some SWE-bench inspired benchmarks also have so-called hints-text that, despite the similar-sounding name, is unrelated to identifier hints. Such hints-text comes from comments and discussions on the issue after the original issue statement; using this is widely frowned upon, so we hide it from iSWE for our experiments. Multi-SWE-bench has duplicate issues with identical issue descriptions푑 issue and golden code patches푐 new , mostly due to back-porting fixes to different branches. We treat these like any other instances for consistency with how everyone else uses the benchmark. Furthermore, Multi-SWE-bench also has some instances where the golden code patch fails the tests; again, we treat these like any other instances, although not surprisingly, the code patch generated by iSWE also fails the tests, so these remain unresolved. Finally, since iSWE does not have a bash tool, it 14 Table 8:Performance by number of identifier hints about new entities provided in Multi-SWE-Bench (Java), using Claude-4.5-Sonnet. # Hints Provided # Instances % Resolved File RetrievalNode Retrieval RecallPrecisionRecallPrecision 07944%68.33%84.99%52.03%73.89% 12330%59.66%86.96%35.98%59.50% 2425%50.00%100.00%43.15%95.00% 3 20%50.00%62.50%50.00%66.67% 480%62.35%75.00%18.31%68.84% 520%100.00%50.00%66.67%20.56% ≥ 6 100%15.18%57.00%14.12%45.85% cannot cheat by using commands such as‘git log –all’or‘git show <commit_id>’that have been reported from non-iSWE solutions on some SWE-bench derived leaderboards. To explore one of the above-listed idiosyncrasies more, Table 8 shows iSWE’s performance on instances with different numbers of identifier hints. Since we are hiding identifier hints from iSWE, we expected iSWE to perform worse on instances where the benchmark creators provided such hints. The results show that this expectation holds: iSWE’s resolution rates are highest for instances with zero hints, and higher for instances with few hints than for instances with many. The information-retrieval metrics are noisy, perhaps due to the small subsets of instances. 5 Related Work The authors of Multi-SWE-bench [43] and SWE-PolyBench [28] created leaderboards and populated them using multi-lingual issue resolution systems they adapted from prior work. Since those adapted systems can handle Java code, that makes them related work for our paper. Specifically, the Multi- SWE-bench leaderboard has entries for Magentless (based on Agentless [39] but skipping test-based validation); MSWE-agent (based on SWE-Agent [40]); and MopenHands (based on OpenHands [36]). Similarly, the SWE-PolyBench leaderboard has entries for Agentless-PB (based on Agentless [39] but using only some parts of test-based validation); SWE-agent-PB (based on SWE-Agent [40]); and Aider-PB (based on Aider [4] but disabling interactive parts and test-based validation). All of these systems have only limited language-specific components, such as using Tree-Sitter [13] to create a repository map in Aider and Agentless. In contrast, iSWE puts a special emphasis on Java, with more advanced Java-specific tools. At the time of this writing, InfCode using GPT-5.2 is at the top of the Multi-SWE-bench (Java) leaderboard [23]. It works by alternating between two ReAct agent, a test generator and a patch generator, causing higher cost; in contrast, iSWE does not yet use test feedback. The leaderboard for SWE-PolyBench Verified also has highly ranked third-party submissions, including Java results beyond the entries from the benchmark authors. Prometheus [11] is an issue resolution system with five sub-agents, all revolving around a common knowledge graph constructed using Tree-Sitter. In contrast, iSWE uses a more direct tool-based approach with no knowledge graph and more emphasis on Java. Furthermore, our paper reports more detailed results, such as localizer metrics not present in the Prometheus paper. The other third-party leaderboard entries on SWE-PolyBench with Java results are Atlassian Rovo Dev [6] and Amazon Q Developer Agent [5]. Unfortunately, their inner workings are unknown; while there is a paper about a predecessor system of Rovo Dev [31], it does not talk about its tools, and even if it did, Rovo Dev may work differently. We can thus not compare their approach, only their results (see Section 4). While the above discussion focuses on Java issue resolution, so far, more of the action has been on Python issue resolution [4,10,11,19,32,34,36,39,40,44]. One dimension for the design of 15 issue resolution systems is whether to use zero, one, or multiple agents. Agentless [39] uses zero ReAct agents, and instead, has a fixed workflow, both at the top-level and within its sub-modules for localization and code editing. Single-agent systems, such as SWE-Agent [40], Aider [4], or OpenHands [36], use a unified ReAct loop to handle both localization and code editing. Multi- agent issue resolution systems rely upon anywhere from two (AutoCodeRover [44], HULA [31]) to four (CodeR [10], Magis [32]) or even five (MASAI [34], Prometheus [11]) sub-agents. As discussed in Section 3, iSWE has two sub-agents with a hand-off based on a sanitized and well-typed set of locations. More important than the number of sub-agents may be the tools being called by these agents. Some systems leverage basic file system tools, specifically bash shell commands for code search and a single custom tool to view and edit files. Since the release of Claude 3.5 Sonnet, several systems have converged to a str_replace_editor tool to view files and make changes (create and edit) to existing files [19,34,36,39,40]. This is true even for some systems that previously used their own different tools, since Claude is a strong code model specifically trained with these tools in mind. Another approach is to introduce more sophisticated, domain-specific tools. For example, AutoCodeRover [44] employs AST-based tools that enable searching for specific code entities (classes, methods) within other entities, and Prometheus [11] builds a code knowledge graph for the code repository, allowing the agent to search for classes and functions. With iSWE, we double down on more sophisticated tools specialized for Java. At the same time, iSWE differs from most prior issue resolution systems by being careful to use mostly read-only tools. One topic this paper did not cover is testing and inference scaling. Several issue-resolution systems run existing regression tests or newly-generated reproduction tests, either to refine푐 new or to select among multiple candidates for푐 new [16,19,20,39,44]. In fact, this practice can be so useful that our team created a dedicated benchmark for reproduction test generation, TDD-Bench Verified [2]. But while we have ongoing work on both testing (e.g. [3]) and inference scaling, that work is beyond the scope of this paper, which focuses on the core agent. Recently, there has been significant effort to identify areas where issue resolution systems excel and where they perform poorly. Analysis of agent performance on multi-file issues [17] and single- file saturation studies [18] reveal that state-of-the-art systems struggle significantly with multi-file issues, despite achieving high performance on single-file problems. Further research, such as TRAIL (Trace Reasoning and Agentic Issue Localization) [15] and MAST (Multi-Agent System Failure Taxonomy) [9], explores the failure modes of agents by analyzing their trajectories, identifying various reasons for failure including incorrect fault localization, inability to understand complex code dependencies, and challenges in maintaining context across multiple files. These studies highlight the gap between current agent capabilities and the requirements of real-world software engineering tasks, where issues frequently span multiple files and require deep understanding of system architecture. 6 Conclusion This paper introduces iSWE, a two-agent system for resolving issues on code repositories with a particular emphasis on Java. The first sub-agent uses read-only tools based on static analysis to find a set of code locations to be edited. The second sub-agent uses an edit tool, which also performs most of its work in a read-only manner, with any side-effects isolated in a container. The tools for both agents leverage Java knowledge for better performance. Our results show that iSWE reaches state-of-the-art issue resolution rates on the Java splits of two benchmarks, SWE-PolyBench and Multi-SWE-Bench. It achieves these results without any model fine-tuning or dynamic test execution feedback. We expect that incorporating those in future work will further increase resolution rates. 16 References [1]Prerna Agarwal, Himanshu Gupta, Soujanya Soni, Rohith Vallam, Renuka Sindhgatta, and Sameep Mehta. Automated creation and enrichment framework for improved invocation of enterprise APIs as tools, September 2025. URL https://arxiv.org/abs/2509.11626. [2] Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. TDD- Bench Verified: Can LLMs generate tests for issues before they get resolved?, December 2024. URL https://arxiv.org/abs/2412.02883. [3] Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. Heterogeneous prompting and execution feedback for SWE issue test generation and selection. In International Conference on Software Engineering (ICSE), April 2026. [4]Aider Team. How aider scored SOTA 26.3% on SWE bench Lite, May 2024. URLhttps: //aider.chat/2024/05/22/swe-bench-lite.html. Blog post. [5]Amazon Team. Amazon Q developer agent, May 2024. URLhttps://aws.amazon.com/q/ developer/. [6]Atlassian Team. Atlassian Rovo Dev, September 2025. URLhttps://w.atlassian.com/ software/rovo-dev. [7] Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. SWE- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, May 2025. URL https://arxiv.org/abs/2505.20411. [8]Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, May 2025. URLhttps://arxiv.org/ abs/2503.11926. [9]Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail? arXiv preprint arXiv:2503.13657, 2025. [10] Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and Qianxiang Wang. CodeR: Issue resolving with multi-agent and task graphs, June 2024. URL https://arxiv.org/abs/2406.01304. [11]Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, and He Ye. Prometheus: Unified knowledge graphs for issue resolution in multilingual codebases, July 2025. URL https://arxiv.org/abs/2507.19942. [12]Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. Introducing SWE-bench Verified, August 2024. URLhttps: //openai.com/index/introducing-swe-bench-verified/. [13]Timothy Clem and Patrick Thomson. Static analysis at GitHub. Communications of the ACM (CACM), 65(2):44–51, January 2022. URL https://doi.org/10.1145/3486594. 17 [14] Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kun- durthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?, September 2025. URL https://arxiv.org/abs/2509.16941. [15]Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, and Rebecca Qian.Trail: Trace reasoning and agentic issue localization. arXiv preprint arXiv:2505.08638, 2025. [16] Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, and Azalia Mirhoseini. CodeMonkeys: Scaling test-time compute for software engineering, January 2025. URL https://arxiv.org/abs/2501.14723. [17]Jatin Ganhotra. Do SWE-agents solve multi-file issues like humans? A deep dive into SWE- bench Verified, January 2025. URLhttps://jatinganhotra.dev/blog/swe-agents/2025/ 03/30/swe-bench-verified-single-file-saturation/. Blog post. [18] Jatin Ganhotra. The multi-file frontier: Why SWE-bench Verified doesn’t reflect real-world programming challenges, March 2025. URLhttps://jatinganhotra.dev/blog/swe-agents/ 2025/01/05/swe-bench-mutliple-files/. Blog post. [19]Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. Trae agent: An LLM-based agent for software engineering with test-time scaling, July 2025. URL https://arxiv.org/abs/2507.23370. [20] Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2E- Gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents, April 2025. URL https://arxiv.org/abs/2504.07164. [21]Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), May 2024. URLhttps:// openreview.net/forum?id=VTF8yNQM66. [22] Rahul Krishna, Rangeet Pan, Saurabh Sinha, Srikanth Tamilselvam, Raju Pavuluri, and Maja Vukovic. Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights. In Industry paper at Symposium on the Foundations of Software Engineering (FSE- Industry), pages 308–318, 2025. URL https://doi.org/10.1145/3696630.3728555. [23]KeFan Li, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, and Weifeng Lv. InfCode: Adversarial iterative refinement of tests and patches for reliable software issue resolution, November 2025. URLhttps://arxiv.org/abs/ 2511.16004. [24]Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. The SWE-bench illusion: When state-of-the-art LLMs remember instead of reason, August 2025. URLhttps: //arxiv.org/abs/2506.12286. [25]Jon Loeliger. Version Control with Git: Powerful Techniques for Centralized and Distributed Project management. O’Reilly, 2009. 18 [26]Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Jo- hannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natu- ral emergent misalignment from reward hacking in production RL, November 2025. URL https://arxiv.org/abs/2511.18397. [27]OpenAI.WhySWE-benchVerifiednolongermeasuresfrontiercod- ingcapabilities,February2026.URLhttps://openai.com/index/ why-we-no-longer-evaluate-swe-bench-verified/. Blog post. [28]Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buccholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents, April 2025. URLhttps://arxiv. org/abs/2504.08703. [29]Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Conference on Neural Information Processing Systems (NeurIPS), pages 68539–68551, December 2023. URLhttps://proceedings.neurips.c/paper_files/ paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html. [30] Claudio Spiess, Mandana Vaziri, Louis Mandel, and Martin Hirzel. AutoPDL: Automatic prompt optimization for LLM agents. In Conference on Automated Machine Learning (AutoML), September 2025. URL https://proceedings.mlr.press/v293/spiess25a.html. [31]Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamtha- vorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. Human- in-the-loop software development agents. In International Conference on Software Engineer- ing: Software Engineering in Practice track (ICSE-SEIP), pages 342–352, April 2025. URL https://doi.org/10.1109/ICSE-SEIP66354.2025.00036. [32] Wei Tao, Yucheng Zhou, Wenqiang Zhang, and Yu Cheng.MAGIS: LLM- based multi-agent framework for GitHub issue resolution.In Conference on Neural Information Processing Systems (NeurIPS), pages 51963–51993,December 2024.URLhttps://proceedings.neurips.c/paper_files/paper/2024/hash/ 5d1f02132ef51602adf07000ca5b6138-Abstract-Conference.html. [33] Mandana Vaziri, Louis Mandel, Claudio Spiess, and Martin Hirzel. PDL: A declarative prompt programming language, October 2024. URL http://arxiv.org/abs/2410.19135. [34]Nalin Wadhwa, Atharv Sonwane, Daman Arora, Abhav Mehrotra, Saiteja Utpala, Ramakr- ishna B Bairi, Aditya Kanade, and Nagarajan Natarajan. MASAI: Modular architecture for software-engineering AI agents. In Workshop on Open-World Agents (OWA@NeurIPS), Decem- ber 2024. URL https://openreview.net/forum?id=NSINt8lLYB. [35] Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. In International Conference on Machine Learning (ICML), July 2024. URL https://proceedings.mlr.press/v235/wang24h.html. [36]Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang 19 Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI software developers as generalist agents. In International Conference on Learning Representations (ICLR), April 2025. URL https://openreview.net/forum?id=OJd3ayDDoF. [37]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Conference on Neural Information Processing Systems (NeurIPS), pages 24824–24837, December 2022. URLhttps://proceedings.neurips.c/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html. [38]Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation, October 2023. URL https://arxiv.org/abs/2308.08155. [39]Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Demystifying LLM- based software engineering agents. In Symposium on the Foundations of Software Engineering (FSE), pages 801–824, June 2025. URL https://doi.org/10.1145/3715754. [40]John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-Agent: Agent-computer interfaces enable automated software engineering. In Conference on Neural Information Processing Systems (NeurIPS), pages 50528–50652, December 2024. URLhttps://proceedings.neurips.c/paper_files/paper/ 2024/hash/5a7c947568c1b1328c5230172e1e7c-Abstract-Conference.html. [41]John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents, April 2025. URL https://arxiv.org/abs/2504.21798. [42]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), May 2023. URLhttps://openreview.net/forum?id=WE_ vluYUL-X. [43] Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-SWE-bench: A multilingual benchmark for issue resolving, April 2025. URL https://arxiv.org/abs/2504.02605. [44]Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Autonomous program improvement. In International Symposium on Software Testing and Analysis (ISSTA), pages 1592–1604, September 2024. URLhttps://doi.org/10.1145/3650212. 3680384. 20