Paper deep dive

Evaluating Agentic Optimization on Large Codebases

Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 192

Abstract

Abstract:Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: this https URL

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

191,512 characters extracted from source content.

Expand or collapse full text

FORMULACODE: Evaluating Agentic Optimization on Large Codebases Atharva Sehgal 1 * James Hou 2 * Akanksha Sarkar 3 Ishaan Mantripragada 2 Swarat Chaudhuri 1 Jennifer J. Sun 3 Yisong Yue 2 Abstract Large language model (LLM) coding agents in- creasingly operate at the repository level, mo- tivating benchmarks that evaluate their ability to optimize entire codebases under realistic con- straints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We in- troduce FORMULACODE, a benchmark for eval- uating agentic optimization on large, real-world codebases with fine-grained, multi-objective per- formance metrics. FORMULACODE comprises 957performance bottlenecks mined from scien- tific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6community-maintained performance work- loads per task, enabling the holistic ability of LLM agents to optimize codebases under realis- tic correctness and performance constraints. Our evaluations reveal that repository-scale, multi- objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io. 1. Introduction Large Language Models (LLMs) for code are rapidly evolv- ing from isolated function-level synthesis to file-level edit- ing, and now, to repository-level optimization (Merrill et al., 2026; Jimenez et al., 2024; Zhang et al., 2025; Zhao et al., 2024; Shetty et al., 2025; Ma et al., 2025). These mod- els are now transitioning from assistants into autonomous coding agents, increasingly tasked with navigating com- plex, interconnected software ecosystems to diagnose bot- tlenecks and improve performance. However, we currently lack frameworks to study these emerging capabilities for * Equal contribution 1 The University of Texas at Austin 2 California Institute of Technology 3 Cornell University. Corre- spondence to: Atharva Sehgal <atharvas@utexas.edu>. Proceedings of the42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). the full optimization lifecycle; for example, how agents bal- ance multiple workloads, maintain function integrity, and structure improvements at different levels of the codebase hierarchy. While there exist coding benchmarks based on real GitHub repositories (Jimenez et al., 2024; Zhang et al., 2025; Zhao et al., 2024), they generally do not fully capture the multi- workload real-world tasks that engineers and researchers face in practice. These benchmarks often rely on binary pass/fail feedback, which is insufficient for measuring opti- mization, or synthetic (e.g., LLM generated) tasks, which lack the complexity and characteristics of real-world coding. For example, real-world optimization is rarely isolated, diag- nosing and improving performance often requires reasoning about architectural decisions, component interactions, and design trade-offs on the system-level rather than tuning an isolated function (Balsamo et al., 2004; Woodside et al., 2007; Jin et al., 2012). Consequently, this requires a new evaluation standard capable of measuring the emerging abil- ity of agents across this entire optimization workflow under realistic software engineering constraints. We identify several directions for improving agentic coding benchmarking: (1) Fine-grained metrics: evaluation must move beyond binary correctness to capture continuous per- formance changes and trade-offs; (2) Real-world measure- ments: metrics should be derived from established execution environments (e.g., standard profiling suites) rather than syn- thetic proxies; (3) Reliable baselines: agent performance must be assessed against human optimization to provide a meaningful standard; and (4) Repository scale: agents must operate within large, evolving codebases. We introduce FORMULACODE 1 , a novel benchmark de- signed for advancing agentic optimization on large, evolv- ing software ecosystems. FORMULACODE is constructed from957real-world performance bottlenecks mined from 70 scientific, open-source Python repositories, like Pandas, 1 FORMULACODE draws inspiration from Formula 1, where constructors must optimize entire systems—not just individual components—to achieve peak performance on the track. Simi- larly, FORMULACODE challenges code agents to perform holistic, codebase-level optimizations, reflecting the complexity and inter- dependence found in real-world software. 1 arXiv:2603.16011v1 [cs.SE] 16 Mar 2026 Performance Issue #13479: Performance of Angle, Latitude and Longitude is a major bottleneck in coordinate transforms Codebase Generated PR [+4 -5] astropy/coordinates/angles core.py angles.py astropy/ README.md tests/ benchmarks/ Crowdsourced Metrics def time_init_scalar() -> time[ns]: def time_init_array() -> time[ns]: Snapshot tests Unit tests time_init_scalar time_init_array advantage over human #tests passed FormulaCode Task Multi-workload optimization +57 more metrics  Coding Agent 5% Large Edit Performnace Small Edit Performance Time → Figure 1: FORMULACODE is a continuously updating benchmark for evaluating the holistic ability of agents to optimize large codebases. Each task in FORMULACODE comprises a problem description of a performance regression from GitHub, an environment containing a baseline repository snapshot, and multiple expert-written crowdsourced performance workloads, along with the tools to execute them. An agent’s performance improving edits are assessed based on their ability to outperform expert-written edits in optimizing multiple workloads while meeting multiple forms of correctness guarantees. Scikit-learn, and SciPy. Unlike previous datasets, each task in FORMULACODE is paired with an average of264.6 community-maintained performance workloads alongside expert-authored patches. This unique construction enables the use of the airspeed-velocity (asv) framework to assess the full lifecycle of optimization (triage, diagnosis, and res- olution) in a way that isolated coding tasks cannot. We conduct a large-scale evaluation of frontier and open weights models (GPT-5, Claude 4.0 Sonnet, Gemini 2.5 Pro, Qwen 3 Coder) within multiple agentic frameworks (Terminus 2, OpenHands). Our main findings are: •Agents generally can improve run-time performance, but perform worse than human experts (§3.1). •Agents are better at local or function-level optimization, rather than repository-level optimization (§3.2). •Agents excel at using specific optimization strategies (e.g., parallelizing or batching) and struggle with others (e.g., vectorized operations) (§3.3). •Agent performance relative to experts can vary dramat- ically by popularity of the repository, performing worst on the 4th quintile and best on the 2nd quintile (§3.4). •Despite being more expensive per call, agents using fron- tier LLMs are overall more cost effective than those using open weights models (e.g., due to open weights models having much longer reasoning chains) (§3.5.1). •Compared to human experts, agents make less favorable performance–cost trade-off decisions (§3.5.2). • We observe minimal effects from data leakage (i.e., using LLMs potentially trained on expert solutions) (§3.5.3). We open-source FORMULACODE as a community resource 2 , to not only measure what code agents can generate, but to understand how they can reliably optimize and maintain complex real-world systems. 2. FORMULACODE Benchmark Design Each FORMULACODE task evaluates the ability of an agentto optimize a real-world codebase under strict correct- ness constraints. A task begins with a baseline repository, denotedCode 0 , which represents the unmodified implemen- tation. Theagentoperates onCode 0 and produces a modi- fied version of the repository, denotedCode agent , by making arbitrary repository-level edits. Each task is paired with two forms of evaluation signals: •Correctness. Correctness is measured via a suite of tests on the functional behavior. A proposed code modification is considered valid only ifCode agent passes all tests that Code 0 passes. •Performance Workloads. Each task includes a large collection of expert-written performance workloads that exercise known performance-critical execution paths in the codebase. Each workload measures a single perfor- mance dimension, such as runtime or memory usage, and may exhibit natural variability due to execution noise. 2 Project website at https://formula-code.github.io/. 2 Figure 1 depicts our benchmark setup. The top half shows a task from the Astropy repository, highlighting a perfor- mance issue with three functions: Angle, Latitude, and Longitude. There are 59 workloads defined by community- sourced expert-written metrics. The goal of the coding agentis to modify the repository to optimize these work- loads while still maintaining correctness. Performance evaluation proceeds by executing the full set of workloads on bothCode 0 andCode agent and compar- ing their measured outcomes. Improving performance on oneworkloadmay degrade performance on others (Bal- samo et al., 2004; Woodside et al., 2007; Jin et al., 2012). As a result, optimization in FORMULACODE is inherently multi-objective: agents must reason about trade-offs across subsystems and deliver improvements that are broad and consistent rather than localized to a single execution path. 2.1. Metrics Speedup. For eachworkload i , we compare the perfor- mance ratio of Code agent versus Code 0 : speedup i = workload i (Code 0 ) workload i (Code agent ) . Havingspeedup > 1indicates an improvement. These ratios are dimensionless and allow performance changes to be compared across heterogeneous workloads.If Code agent does not pass correctness tests forworkload i , then speedup i = 1 (i.e., the modifications were reverted). Fornworkloads, the overall speedup is the geometric mean: speedup agent = Y workload i speedup i ! 1 n .(1) Advantage. For each task, we also have expert-written code modifications,Code expert . For example, the performance issue in Figure 1 was eventually resolved by a human expert. We use the performance ofCode expert as a reference point to characterize the difficulty of each task. We can then define the advantage of an agent as: Adv agent = speedup agent − speedup expert . If an agent had simply memorized the expert solution (e.g., due to training data contamination), then the advantage is zero. Indeed, the goal of super-human optimization is to achieve a large positive advantage. Appendix Figure 23 provides a geometric intuition for this metric. Stratified Advantage. We now turn to measuring advantage aggregated at different levels of granularity. We useℓ ∈ 0, 1,... to denote the code hierarchy level. •At the coarsest level (ℓ = 0), we group workloads by entire modules such as algorithms.*. •At finer levels, we group workloads under individ- ual classes or functions (e.g.,algorithms.Sorting.*, algorithms.Sorting.time_sort_int.*). Each levelℓthus partitions the workloads into groups: G (ℓ) = g (ℓ) 1 ,...,g (ℓ) K ℓ , where eachworkloadbelongs in some g (ℓ) k . We can then define per-group advantage as: Adv agent,g = speedup agent (g)− speedup expert (g), wherespeedup ∗ (g)is defined using Equation 1 computed only over workloads ing. The stratified advantage at level ℓ is then the average across all groups at that level: Adv (ℓ) agent = 1 |G (ℓ) | X g∈G (ℓ) Adv agent,g . The familyAdv (ℓ) agent |ℓ ∈ Z ≥0 thus forms a multi-scale profile of an agent’s performance. Because aggregation is performed over multiplicative speedup ratios within each group,Adv (ℓ) agent remains in the same metric family as the global advantage, but is sensitive to how performance gains are organized across the codebase hierarchy (Figure 22). Normalized Advantage. Finally, we introduce a normal- ized version of advantage that explicitly accounts for noise and heterogeneity across workloads. Given the variance of the per-workload speedup ratios for anagent,σ 2 (agent), we define the normalized advantage of an agent as: g Adv agent = Adv agent p σ 2 (agent) + σ 2 (expert) . Conceptually, g Adv agent captures a signal-to-noise ratio of the agent advantage, and rewards consistency across workloads. Cost Weighted Metrics. In practice, we also care about the inference budget of the optimization agent. We estimate the total inference cost ascost agent = c in N in agent + c out N out agent whereN in agent andN out agent denote the total number of input and output tokens, andc in andc out are the per-token prices. This allows us to define the cost-weighted advantage: cost(Adv agent ) = Adv agent cost agent , which captures the human-relative improvement obtained per unit of inference budget. We will use these metrics in §3 to evaluate code optimization agents’ performance on real-world codebases. 2.2. Dataset Construction Here we briefly summarize our dataset construction. Full details can be found in Appendix A. Figure 2 shows an overview of our procedure. 3 Rule Based Filter(s) 766 Github repositories Scraping RepositoriesStatistical Validation At least 100 Stars Mann-Whitney U test to confirm oracle speedup 1232 candidates 75 repositories References at least 1 issue Replicable Dependencies Environment Synthesis 26,717 PRs 101 repositories Intent to improve performance? LLM Based Filter(s) Feedback-driven LLM Agent Successfully Build Performance Suite & Test Suite 957 tasks 70 repositories Figure 2: Overview of FORMULACODE construction pipeline. FORMULACODE follows a four stage pipeline to identify real-world performance optimization tasks. (1) Scrape compliant repositories (§A.1.1). (2) Apply rule-based and LLM- based filters to identify candidate performance improvement pull requests (§A.1.2). (3) Construct reproducible Docker environments for each candidate (§A.1.3). (4) Validate each candidate for correctness and statistically significant performance improvement (§A.1.4). The pipeline is fully automated and updates FORMULACODE with new tasks every month. Repository Scraping. We search for repositories with ma- ture performance benchmarking infrastructure. Using a CommonSQL script on GitHub’s public dataset, we find 766 repositories containing Airspeed Velocity (ASV; (Droet- tboom et al., 2025)) performance workloads, ensuring they have active maintenance, Python 3.8+ support, and at least 100 stars (A.1.1). Attribute Filtering. We scrape 26,717 pull requests from 127 repositories and apply both rule-based filters (merged status, benchmark infrastructure presence, appropriate file changes) and LLM-based intent classification to identify 3,181 candidate performance improvements from 101 repos- itories. An LLM agent analyzes PR descriptions, patches, and linked issues to verify the primary intent is performance optimization. For each candidate, the submitted patch corre- sponds to Code expert (Appendix A.1.2). Environment Synthesis. For each candidate, we auto- matically generate reproducible Docker build scripts using a reflexive LLM agent that iteratively refines installation commands based on build failures. Through chronologi- cal caching of successful scripts and targeted tool use, we synthesize verified environments for 1,232 tasks across 75 repositories (Appendix A.1.3). Statistical Validation. We execute expert patches and base- line code in isolated environments, measuring performance across all ASV workloads. Using Mann-Whitney U tests (p < 0.002; (Mann & Whitney, 1947)) and strict correctness checks (unit tests + snapshot tests), we retain only tasks with statistically significant, reproducible improvements, yielding 957 final tasks across 70 repositories (A.1.4). This pipeline projects to add an average of 27.00 new tasks per month. 3. Experiments We organize our experimental findings into three categories. •First, we present overall performance metrics to inves- tigate whether agents can achieve meaningful runtime speedups and whether they can outperform experts. •Second, we provide detailed breakdown of agent capabil- ities, examining performance across optimization strate- gies, optimization scope, and repository popularity. • Third,we present additional findings on cost- effectiveness,multi-workloadoptimization,data leakage, and ensemble approaches. We compare four frontier LLMs – GPT-5 (Singh et al., 2025), Claude 4.0 Sonnet (Anthropic, 2025), Gemini 2.5 Pro (Comanici et al., 2025), and Qwen 3 Coder (Yang et al., 2025) – under two LLM Frameworks – Terminus 2 (Merrill et al., 2026) and OpenHands (Wang et al., 2025). Terminus 2 is evaluated with all four models, while OpenHands is eval- uated with GPT-5, Claude 4.0 Sonnet, and Qwen 3 Coder. Additional discussion of model and framework choices ap- pears in Appendix §B.2.2. Evaluations are conducted on FORMULACODE-V due to compute constraints, using the metrics defined in §2. Full experimental details and addi- tional analyses are provided in Appendix §B.2. 3.1. Global Leaderboard For each agent–model configuration, we compute the human-relative advantageAdvand normalized advantage g Adv defined in §2. We then aggregate configurations into a global leaderboard using the Ranked Pairs (RP) method (Tideman, 1987), yielding a transitive ordering. Table 1 summarizes the resulting rankings. 4 AgentModelRP Rank (Adv)↓ Adv↑ g Adv ↑ speedup↑ Terminus 2GPT-57-0.0504-0.13871.0585 Claude 4.0 Sonnet4-0.0410-0.10651.0987 Gemini 2.5 Pro6-0.0433-0.11381.0963 Qwen 3 Coder5-0.0454-0.12571.0677 OpenHandsGPT-53-0.0209-0.07021.0825 Claude 4.0 Sonnet1-0.0112-0.04831.0539 Qwen 3 Coder2-0.0301-0.15291.0346 Human Expert--0.00000.00001.1040 Table 1:Global leaderboard of agent-model configurations on FORMULACODE-V. We report the Ranked Pairs (RP) position in- duced by human-relative advan- tage (Adv), the normalized ad- vantage ( g Adv), and the speedup (speedup) as defined in §2. Module-levelClass-levelFunction-level −0.1 0.0 0.1 0.2 0.3 Stratified advantage Claude 4.0 SonnetGPT-5Gemini 2.5 ProQwen 3 Coder OpenHandsTerminus 2 Figure 3: Showing stratified advantage across hierarchy lev- els for each agent–model configuration. Each line traces the stratified advantage (Adv (ℓ) agent ) overℓ ∈ 1, 2, 3, re- vealing whether a configuration prefers coarse module-level changes or fine-grained function-level edits. Observation: Agents achieve non-trivial speedups over the baseline. All evaluated configurations attainspeedup > 1 on FORMULACODE-V relative to the baseline codebase (associated with the issue), indicating that agents can suc- cessfully identify and implement runtime-relevant changes. Observation: Agents underperform human experts on per- formance optimization tasks. For all agents, the overall advantage,Adv, is negative, indicating a fundamental per- formance gap. We also notice a disagreement between the Advandspeedupmetrics for many configurations, where large performance gains on certain ‘easier’ tasks have a dis- proportionate influence on the globalspeedupscore. The influence of such tasks is diminished in theAdvscore, which compares each agent improvement to the corresponding ex- pert improvement; since tasks that are “easier” typically also admit larger expert speedups, this relative metric yields a more consistent difficulty reference. 3.2. Large-Scale vs. Small-Scale Refactors To disentangle performance by optimization scale, we use the hierarchical structure of FORMULACODE-V workloads (Figure 22) and stratified advantageAdv (ℓ) agent from §2. We construct per-configuration profiles across three strata: Mod- ule level aggregation (ℓ = 1), Class level aggregation (ℓ = 2), and Function level aggregation (ℓ = 3). For each configuration and levelℓ, we compute group-level speedups and advantages, shown in Figure 3. Observation: Agents demonstrate characteristic perfor- mance profiles. In Figure 3 models exhibit diverse perfor- mance profiles. OpenHands + Claude 4.0 Sonnet performs best at the module-level optimization but underperforms at the function-level, indicating that this configuration can overlook small-scale optimizations in favor of large-scale ones. Conversely, OpenHands + GPT-5 performs best at the function-level but loses effectiveness at the module-level. Observation: Agents are comparatively stronger on local optimizations. With few exceptions (notably Claude 4.0 Sonnet + OpenHands), configurations achieve higher strati- fied advantage at function-level aggregation. 3.3. Type of Optimization Problem We investigate whether models can outperform human ex- perts on particular classes of optimizations. For each prob- lem in FORMULACODE-V, we label the optimization at- tempted by the human-written patch using an LLM (see §B.1.7 for details). Next, we aggregate the advantage of each agent–model pair within each optimization class. Ta- ble 2 summarizes the results. Observation: Some optimization classes remain systemati- cally difficult for agents. We observe certain optimization categories where agents outperform experts. Specifically, all agents were able to find faster solutions in tasks where the expert attempted a parallelization or batching based solution. Conversely, all agents struggle when the human solutions require delegating to lower-level system implementations (C extensions, vectorized operations). 3.4. Long-Tail Generalization Across Repository Popularity We next study how performance varies by repository popu- larity (measured using GitHub stars). We compute advan- tage statistics for each popularity quintile. Observation: Agents perform weakest on tail reposito- ries. Agent performance is substantially lower in the first popularity quintile (Q1; bottom 20%), which comprises 5 Table 2: Per-tag advantage for each agent–model configuration. Columns correspond to optimization tags (see 7), and cells report the human-relative advantage restricted to workloads whose patches are annotated with the respective tag. OpenHands + GPT-5 shows strong advantage on algorithmic rewrites and data-structure changes, while other models perform comparatively better on micro-optimizations or caching. AgentModelAlgoDataLowerApproxParallelReduceCacheBatchScaleDBMicroI/OHigherUncat Terminus 2GPT-5-0.064-0.112-0.233–0.010-0.006-0.0540.028–0.001–-0.002– Claude 4.0 Sonnet-0.0190.011-0.720–0.013-0.028-0.0480.041–-0.038–-0.009– Gemini 2.5 Pro-0.0290.011-0.676–0.013-0.028-0.0480.041–-0.038–-0.007– Qwen 3 Coder-0.0230.007-0.455–0.007-0.079-0.0270.042–-0.066–0.005– OpenHandsGPT-50.015-0.052-0.211–0.015-0.051-0.0180.040–-0.018–-0.008– Claude 4.0 Sonnet-0.0280.023-0.180–0.007-0.049-0.0170.047–0.086–-0.005– Qwen 3 Coder-0.020-0.004-0.203–0.012-0.016-0.0190.051–-0.063–0.013– Table 3: Performance across repository popularity quintiles (by GitHub stars). We reportAdv agent for workloads drawn from repositories in each quintile, from least popular (Q1) to most popular (Q5). Red signifies worse performance. AgentModelQ1Q2Q3Q4Q5 Terminus 2GPT-5-0.01940.0423 -0.0045 -0.2754 -0.0123 Claude 4.0 Sonnet -0.0450 -0.00620.0025 -0.3529 -0.0220 Gemini 2.5 Pro0.0077 -0.00620.0024 -0.3311 -0.0445 Qwen 3 Coder-0.06910.0052 -0.0179 -0.1669 -0.0332 OpenHands GPT-5-0.03870.03150.0072 -0.0769 -0.0068 Claude 4.0 Sonnet -0.10410.0291 -0.0200 -0.03780.0263 Qwen 3 Coder-0.01590.01370.0227 -0.0878 -0.0270 repositories with 133–202 GitHub stars. Expert patches, however, yield comparatively large gains in this regime: speedup expert (Q1) = 1.1104, the second-largest speedup across quintiles. One hypothesis is that smaller repos- itories contain more heterogeneous, high-impact micro- optimizations that may have already been discovered in larger, more mature repositories, leading to more variable (but sometimes high-impact) optimization opportunities. A second plausible hypothesis is distribution shift: smaller repositories may be less represented in training corpora, reducing agent effectiveness. Observation:Agents are most competitive on mid- popularity repositories. In the 20th to 60th percentile range, mean advantages are closest to expert performance, and some configurations perform comparably with experts. We hypothesize that this is due to two reasons. First, moderately popular repositories more closely match the agent’s training distribution than tail repositories. Second, these reposito- ries have more unexploited optimization avenues relative to highly popular projects. Observation: Performance dips in high-popularity reposi- tories. Agent performance is lowest in the fourth quintile (Q4; 6,371-10,343 stars). In this regime, expert patches also yield the smallest gains:speedup expert (Q4) = 1.0822, the lowest expert speedup across all quintiles. This pattern indicates reduced remaining optimization headroom in these repositories, where many simpler improvements may have already been realized. Additionally, slight distribution shift 1.01.52.02.53.03.5 Mean Cost (USD) −0.05 −0.04 −0.03 −0.02 −0.01 Mean Advantage Better Claude 4.0 SonnetQwen 3 CoderGemini 2.5 ProGPT-5Claude 4.0 SonnetQwen 3 CoderGemini 2.5 ProGPT-5 OpenHandsTerminus 2 Figure 4: Cost-Performance tradeoff of agent-model con- figurations. As most agents struggle on code optimizations tasks, the pareto set is primarily dominated by the most expensive model (Claude 4.0 Sonnet). may persist and limit agent effectiveness. 3.5. Practical Considerations 3.5.1. COST EFFICIENCY. Frontier models differ substantially in end-to-end inference cost due to provider pricing and the number of tokens con- sumed by a given agent configuration. In this experiment, we consider the cost–performance tradeoff within our agent configurations using the cost-weighted objectives defined in §2. Table 10 reports a leaderboard based on cost-weighted normalized advantage, and Figure 4 summarizes the result- ing trade-off. Observation: Higher-priced models rank best under the cost-weighted objective. When weighted by cost, top-ranked configurations tend to use the higher-priced (and more capa- ble) models. A contributing factor is that lower-capability models often consume more tokens within the agent loop, which can offset lower per-token prices. This might also indicate that smaller models lack the capabilities to reason effectively about performance optimizations. 6 1.001.021.041.061.081.10 Global Speedup 0.90 0.92 0.94 0.96 0.98 1.00 Worst Workload Speedup Better Claude 4.0 SonnetQwen 3 CoderGemini 2.5 ProGPT-5Claude 4.0 SonnetQwen 3 CoderGemini 2.5 ProGPT-5 OpenhandsTerminus 2Expert Figure 5: Multi-workload tradeoff performance of agent- model configurations. We quantify a model’s speedup per- formance as a function of its worst regression. The expert patch achieves the highest speedup while negotiating con- siderably high workload regressions. 3.5.2. MULTI-WORKLOAD TRADEOFF PERFORMANCE. Performance optimization necessitates a holistic understand- ing of competing workloads. In this experiment, we com- pare the global speedup achieved by a model with the largest regression it causes. For each agent–model configuration, we compute (i) global speedup aggregated across tasks and workloads, and (i) the average worst-workload speedup, defined as follows: for each task, we take the minimum speedup across the task workloads, and then average this minimum across tasks. Figure 5 plots these two quantities. Observation: Multi-workload optimization remains chal- lenging for agents. Despite causing large regressions, hu- man code edits achieve the best global speedup, indicating a superior ability to negotiate multi-workload performance tradeoffs than our configurations. 3.5.3. TEMPORAL GENERALIZATION. Motivation. FORMULACODE is a live benchmark: tasks are continuously added and include creation timestamps. This enables us to probe the temporal out-of-distribution behavior of agents on performance optimization tasks. Related work on code correctness finds large gains when tasks are present in training corpora (Jain et al., 2024a). We bucket tasks by their month of creation and compute mean global speedup in windows defined by the temporal distance to each model’s knowledge cutoff (§B.2.2). We use 3-month bins and consider bins up to 6 months before/after the cutoff. Table 4 summarizes results. Observation: Limited evidence of a cutoff-aligned leakage effect. Performance shows no consistent shift when moving from pre-cutoff to post-cutoff task creation dates, suggesting the gap is capability-based rather than data-based. Table 4: Temporal analysis of model performance across knowledge cutoff boundaries. Each column represents a temporal bin defined by distance (in months) from the model’s training data cutoff; values indicate mean global speedup (speedup agent ) within each bin. We find no consis- tent drop in performance. Before CutoffAfter Cutoff Model6+ mo 3-6 mo 0-3 mo 0-3 mo 3-6 mo6+ mo Claude 4.0 Sonnet 1.0892 1.0564 0.9966 1.0915 1.0951 1.0519 GPT-51.1708 1.0454 0.9871 1.0378 1.0679 1.0500 Gemini 2.5 Pro1.1071 0.9989 1.0219 1.0523 1.1063 1.0251 4. Related Work Algorithms for Code optimization. There is a long his- tory of research on iterative code optimization using execu- tion feedback. Classical approaches to this problem were based on stochastic search and constraint solving (Schkufza et al., 2013; Sasnauskas et al., 2018). Among deep-learning based approaches, AlphaTensor and AlphaDev produce super-optimized matrix multiplication and sorting routines, respectively (Fawzi et al., 2022; Mankowitz et al., 2023). These systems combine large, publicly sourced pretraining datasets with carefully chosen inductive biases to make op- timization faster. The more general agentic Optimization workflows operate by iteratively running LLM-generated code, evaluating the output, and feeding the output back to the model. Terminus 2 and OpenHands represent two such configurations out of many that benefit from iterative feedback (Yao, 2024; Yang et al., 2024; Merrill et al., 2026; Wang et al., 2025; Merrill & Shaw, 2025). FORMULA- CODE is the first benchmark purpose built to assess the multi-workload optimization ability of such agentic AI algo- rithms in real-world codebases and provides the fine-grained evaluation functions needed for iterative optimization. Evolutionary Optimization algorithms equipped with LLMs (Romera-Paredes et al., 2024; Grayeli et al., 2024) iteratively improve a candidate pool of programs using exe- cution feedback. Systems like AlphaEvolve (Novikov et al., 2025) and OpenEvolve (Sharma, 2025) demonstrate that such agents can efficiently discover and refine novel, high- performance code-based heuristics across diverse scientific domains. These methods are scalable but require high qual- ity evaluation functions to penalize degenerate solutions. While FORMULACODE provides the necessary evaluation functions, we could not benchmark evolutionary methods due to their substantial compute needs. Code Generation Benchmarks. Coding benchmarks can be differentiated by their synthesis scope. For a list of differences, consult Table 5. Function and file level. HumanEval (Chen et al., 2021) 7 Table 5: Comparing FORMULACODE with related codebase benchmarks. FORMULACODE is the only benchmark that satisfies the desired properties for evaluating LLM agents on real-world code optimization tasks. ++ denotes continually updating benchmarks. Data is sampled from real distributions like GitHub (), Leetcode (), AtCoder (), and Codeforces (); and LLM-generated or synthetic distributions (). An extended analysis is presented in §4. Benchmark Evaluation framework # Tasks # Workloads / Task Live updates Data source Search space Synthesis scope Leakage resistant? GSO-BenchPerformance102Single✗;LargeRepo✗ SWE-BenchUnit Tests2292-✗SmallRepo✗ LiveCodeBench Unit Tests300 ++ -✓;;SmallFile✓ SWEfficiencyPerformance & Unit Tests400Single✗LargeRepo✗ CruxEvalUnit Tests800 ++ -✗SmallFile✓ FormulaCodePerformance & Unit Tests957 ++ 264.58✓LargeRepo✓ and MBPP (Austin et al., 2021) present hand-written pro- gramming problems in Python with corresponding unit tests. Many contributions extend these benchmarks to have more testing (Liu et al., 2023), broader scope (Yin et al., 2022; Yang et al., 2023), and more task diversity (Muennighoff et al., 2023; Lai et al., 2022; Zan et al., 2022). CruxEval (Gu et al., 2024) benchmarks the code execution and reasoning ability of LLMs more deeply. LiveCodeBench (Jain et al., 2024a) attempts to mitigate data-leakage by annotating prob- lems with release dates. All these benchmarking efforts uti- lize unit testing suites to gauge program correctness. FOR- MULACODE supplements the evaluation signal provided by the above datasets by using community-maintained evalua- tion functions that continually update with each commit. Repository level. Function and file level benchmarks evalu- ate coding ability on self-contained coding tasks. However, real software issues typically span multiple modules and files. Repository level benchmarks (Jimenez et al., 2024; Tang et al., 2024; Jain et al., 2024b; Shetty et al., 2025) aim to preserve the inherent challenges in real-world software engineering beyond text completion, such as finding relevant files, capturing relationships between modules, tracing infor- mation flow, etc. SWE-Bench (Jimenez et al., 2024) collects GitHub issues from popular repositories and evaluates cod- ing agents’ ability to resolve the issues. Follow-up efforts benchmark agents on repository-conditioned code synthe- sis (Tang et al., 2024), scale-up benchmarking by admitting smaller codebases with LLM-generated unit tests (Jain et al., 2024b), and introduce continually updating pipelines for the task (Zhang et al., 2025). Such extensions provide valuable insights into LLM agent behavior yet ground their evalua- tions in correctness tests, that present a discrete optimization surface for the agents. FORMULACODE complements these benchmarks by assessing agents on community-maintained evaluation functions that present a smoother optimization landscape and higher fidelity than unit tests. Optimization Benchmarks. There are prior benchmarks for efficient code synthesis on function and file-level tasks. COFFE (Peng et al., 2025) samples tasks from HumanEval, MBPP, CodeContests, and APPS (Chen et al., 2021; Austin et al., 2021; Hendrycks et al., 2021) and auto-generates stress tests while ECCO (Waghjale et al., 2024) curates a function and file-level efficient synthesis dataset from IBM CodeNet (Puri et al., 2021) with data-mined test cases. Recent repository-level benchmarks like GSO-Bench (Shetty et al., 2025) and SWEfficiency (Ma et al., 2025) also study LLM agents’ ability to optimize code. How- ever, these benchmarks only optimize for a single target function at a time. (He et al., 2025) don’t test correct- ness. In contrast, FORMULACODE focuses on: (1) using community-maintained benchmarks specifically designed to profile performance inefficiencies instead of using hand- curated stress tests, (2) benchmarking on repository-level codebases, which better capture the natural challenges with real-world code optimization, and (3) presenting multiple workloads that can compete with one another to assess the holistic optimization ability of agents. 5. Conclusion We present FORMULACODE, a comprehensive coding benchmark for repository-level agentic optimization. In this benchmark, coding agents must not only write code that passes standard correctness tests, but also improve runtime, and our benchmark design enables us to study the impact of repository popularity, temporal cutoffs, and multi-scale optimization to guide the design of future agents capable of surpassing human experts. As code-writing agents become more capable at the repository-level, FORMULACODE pro- vides a rigorous foundation for development. To ensure longevity and prevent saturation, we operate as a live bench- mark, continually ingesting new tasks to test agents against an evolving human baseline. Our evaluations show that FORMULACODE is a challenging benchmark for frontier LLMs and agentic frameworks, leaving open significant room for future agent development. 8 6. Acknowledgements This work was supported in part by a Laude Institute Sling- shot Award, NSF awards I-#2505097, PPoSS-#2316161, NSF #2505096, NSF #2505098, and gifts from Point72 and OpenAI. We also thank Alex Shaw, Braden Hancock, Miles Cranmer, Neehar Kondapaneni, Rogério Guimarães, Anant Asthana, and Markus Marks for helpful discussions. 7. Impact Statement We have presented FORMULACODE: a benchmark for mea- suring the capabilities of LLM-guided agents to optimize performance on large codebases. FORMULACODE is de- signed to serve two audiences: researchers (those devel- oping new LLMs / Agents) and practitioners (those using Agents for daily workflows). For researchers, we hope that FORMULACODE accelerates the development of coding agents by providing contamination-free training and evalu- ation signals. For practitioners, we hope FORMULACODE offers comparative metrics that gauge the utility of LLMs and agents in specialized repositories under diverse cost- performance constraints. In this section, we discuss the broader societal impacts and ethical considerations of our work. Potential for Misuse. Benchmark results are only as reliable as the interpretations drawn from them. To ground evalua- tions in realistic developer workflows, we use community- maintained workloads that already exist in each repository and attempt to preserve the same information and perfor- mance instrumentation available to a human contributor. This design also supports practical impact: strong model- generated changes can, in principle, be merged upstream to reduce maintenance burden, particularly for smaller repos- itories after thorough manual analysis. At the same time, reliance on repository workloads introduces an attack sur- face: an adversary could submit pull requests that alter or add workloads to make tasks artificially easier. While such additional workloads can increase regression coverage (thereby providing some downstream utility), practitioners should treat workload provenance and review practices as part of the evaluation’s trust boundary. Privacy Concerns. FORMULACODE is an ‘open-book’ benchmark and necessarily includes interactions from open- source software developers. We include such context to pro- vide models access to the same information a human would use when solving these tasks. Although we anonymize user- names and remove personally identifiable information to the best of our ability, some contributors may remain indirectly identifiable via secondary cues (e.g., writing style, repeated project-specific references). Bias and Fairness. Benchmarks can incentivize and influ- ence which capabilities are prioritized by the community. We strive to make FORMULACODE ’s metrics explicit and stable, and we apply statistical analyses to reduce unin- tended measurement artifacts. Yet, FORMULACODE inher- its limitations from the underlying repository benchmarks. In particular, FORMULACODE is susceptible to a form of the Quantitative Fallacy: aspects of agent competence that are difficult to measure may be underweighted or omitted, inflat- ing the true utility of such algorithms. This is a limitation of all execution-based benchmarks. We therefore recommend using FORMULACODE as a complementary signal rather than as a substitute for careful manual assessment of Agent / LLM behavior. References AI@Meta.Llama 3 model card.2024.URL https://github.com/meta-llama/llama3/blob/ main/MODEL_CARD.md. Amazon Web Services. Infrastructure security in amazon ec2: Isolation on physical hosts. Amazon EC2 User Guide. URLhttps://docs.aws.amazon.com/AWSEC2/ latest/UserGuide/infrastructure-security. html#physical-isolation. Accessed: 2026-01-28. Anthropic.The Claude 3 Model Family: Opus, Son- net, Haiku.https://w-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf , March 2024. Model card v1.0. Accessed 31 May 2025. Anthropic. System card: Claude opus 4 & claude sonnet 4. PDF, May 2025. URLhttps://w-cdn.anthropic. com/6d8a8055020700718b0c49369f60816ba2a7c285. pdf. Includes changelog updates dated July 16, 2025 and September 2, 2025. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021. Balsamo, S., Di Marco, A., Inverardi, P., and Simeoni, M. Model-based performance prediction in software devel- opment: A survey. IEEE Transactions on Software Engi- neering, 30(5):295–310, 2004. Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., and et. al, J. K. Evaluating large language models trained on code, 2021. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aha- roni, A., Lintz, N., Pais, T. C., Jacobsson, H., Szpektor, 9 I., Jiang, N.-J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G., Chu, E., Boyd, T., Hekman, B., Parisi, A., Zhang, C., Kawintiranon, K., Bedrax-Weiss, T., Wang, O., Xu, Y., Purkiss, O., Mendlovic, U., Deu- tel, I., Nguyen, N., Langley, A., Korn, F., Rossazza, L., Ramé, A., Waghmare, S., Miller, H., Byrd, N., Sheshan, A., Hadsell, R., Bhardwaj, S., Janus, P., Rissa, T., Horgan, D., Abdagic, A., Belenki, L., Allingham, J., Singh, A., Guidroz, T., Srinivasan, S., Schmit, H., Chiafullo, K., Elisseeff, A., Jha, N., Kolhar, P., Berrada, L., Ding, F., Si, X., Mallick, S. B., Och, F., Erell, S., Ni, E., Latkar, T., Yang, S., Sirkovic, P., Feng, Z., Leland, R., Hornung, R., Wu, G., Blundell, C., Alvari, H., Huang, P.-S., Yip, C., Deur, S., Liu, L., Surita, G., Duque, P., Damen, D., Jia, J., Guez, A., Mircea, M., Sinha, A., Magni, A., Stradomski, P., Marian, T., Gali ́ c, V., Chen, W., Husain, H., Singhal, A., Grewe, D., Aubet, F.-X., Song, S., Blanco, L., Rechis, L., Ho, L., Munoz, R., Zheng, K., Hamrick, J., Mather, K., Taitelbaum, H., Rutherford, E., Lei, Y., Chen, K., Shukla, A., Moreira, E., Doi, E., Isik, B., Shabat, N., Rogozi ́ nska, D., Kolipaka, K., Chang, J., Vušak, E., Venkatachary, S., Noghabi, S., Bharti, T., Jun, Y., Zaks, A., Green, S., Challagundla, J., Wong, W., Mohammad, M., Hirsch, D., Cheng, Y., Naim, I., Proleev, L., Vincent, D., Singh, A., Krikun, M., Krishnan, D., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2601.03267. Cruz, V. P. G., Rocha, H., and Valente, M. T. Snapshot testing in practice: Benefits and drawbacks. Journal of Systems and Software, 204:111797, 2023. Droettboom, M., Virtanen, P., and asv Developers. air- speed velocity (asv): A simple python benchmarking tool with web-based reporting.https://github.com/ airspeed-velocity/asv , 2025. GitHub repository, ver- sion v0.6.5, accessed 2026-02-24. Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera- Paredes, B., Barekatain, M., Novikov, A., R. Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforce- ment learning. Nature, 610(7930):47–53, 2022. Forum Discussion. Four kinds of optimisation (hacker news discussion). Hacker News, November 2023. URLhttps: //news.ycombinator.com/item?id=38262251. Forum Discussion. The fifth kind of optimisation (hacker news discussion). Hacker News, April 2025. URLhttps: //news.ycombinator.com/item?id=43555311. GitHub and Google Cloud Platform.bigquery-public- data.github_repos – github public repository dataset. https://console.cloud.google.com/marketplace/ details/github/github-repos, 2025. Queried via Google BigQuery on 30 May 2025. Grayeli, A., Sehgal, A., Costilla Reyes, O., Cranmer, M., and Chaudhuri, S. Symbolic regression with a learned concept library. Advances in Neural Information Process- ing Systems, 37:44678–44709, 2024. Gu, A., Rozière, B., Leather, H., Solar-Lezama, A., Syn- naeve, G., and Wang, S. I. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024. He, X., Liu, Q., Du, M., Yan, L., Fan, Z., Huang, Y., Yuan, Z., and Ma, Z. Swe-perf: Can language models optimize code performance on real-world repositories?, 2025. URL https://arxiv.org/abs/2507.12415. Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring coding challenge competence with apps, 2021. Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code. arXiv preprint arXiv:2403.07974, 2024a. Jain, N., Shetty, M., Zhang, T., Han, K., Sen, K., and Stoica, I. R2E: Turning any github repository into a programming agent environment. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, p. 21196–21224. PMLR, 21–27 Jul 2024b. URLhttps: //proceedings.mlr.press/v235/jain24c.html. Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770. Jin, G., Song, L., Shi, X., Scherpelz, J., and Lu, S. Un- derstanding and detecting real-world performance bugs. ACM SIGPLAN Notices, 47(6):77–88, 2012. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and Potts, C. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps: //arxiv.org/abs/2310.03714. Koch, B., Denton, E., Hanna, A., and Foster, J. G. Re- duced, reused and recycled: The life of a dataset in machine learning research. In Thirty-fifth Conference 10 on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps: //openreview.net/forum?id=zNQBIBKJRkd. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R., Zettlemoyer, L., tau Yih, S. W., Fried, D., Wang, S., and Yu, T. Ds- 1000: A natural and reliable benchmark for data science code generation, 2022. LangDB.Qwen3-coder-480b-a35b-instructby fireworksai.Web page,July 2025.URL https://langdb.ai/app/models/fireworksai/ qwen3-coder-480b-a35b-instruct.Model details, pricing, and performance metrics. Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023. Ma, J. J., Hashemi, M., Yazdanbakhsh, A., Swersky, K., Press, O., Li, E., Reddi, V. J., and Ranganathan, P. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025. URLhttps: //arxiv.org/abs/2511.06090. Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964): 257–263, 2023. Mann, H. B. and Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, p. 50–60, 1947. Merrill, M. and Shaw, A. Terminus.https://w.tbench. ai/terminus , May 2025. Published May 19, 2025. Ac- cessed 2026-01-28. Merrill, M. A., Shaw, A. G., Carlini, N., Li, B., Raj, H., Bercovich, I., Shi, L., Shin, J. Y., Walshe, T., Buchanan, E. K., Shen, J., Ye, G., Lin, H., Poulos, J., Wang, M., Nezhurina, M., Jitsev, J., Lu, D., Mastromichalakis, O. M., Xu, Z., Chen, Z., Liu, Y., Zhang, R., Chen, L. L., Kashyap, A., Uslu, J.-L., Li, J., Wu, J., Yan, M., Bian, S., Sharma, V., Sun, K., Dillmann, S., Anand, A., Lan- pouthakoun, A., Koopah, B., Hu, C., Guha, E., Dreiman, G. H. S., Zhu, J., Krauth, K., Zhong, L., Muennighoff, N., Amanfu, R., Tan, S., Pimpalgaonkar, S., Aggarwal, T., Lin, X., Lan, X., Zhao, X., Liang, Y., Wang, Y., Wang, Z., Zhou, C., Heineman, D., Liu, H., Trivedi, H., Yang, J., Lin, J., Shetty, M., Yang, M., Omi, N., Raoof, N., Li, S., Zhuo, T. Y., Lin, W., Dai, Y., Wang, Y., Chai, W., Zhou, S., Wahdany, D., She, Z., Hu, J., Dong, Z., Zhu, Y., Cui, S., Saiyed, A., Kolbeinsson, A., Hu, J., Rytting, C. M., Marten, R., Wang, Y., Dimakis, A., Konwinski, A., and Schmidt, L. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL https://arxiv.org/abs/2601.11868. Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and Longpre, S. Octopack: Instruction tuning code large language models, 2023. Novikov, A., V ̃ u, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., and Balog, M. Alphaevolve: A coding agent for scientific and algorithmic discovery. Google DeepMind White Paper, May 2025. OpenAI, :, Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., and et al. gpt-oss-120b & gpt- oss-20b model card, 2025. URLhttps://arxiv.org/ abs/2508.10925. Peng, Y., Wan, J., Li, Y., and Ren, X. Coffe: A code efficiency benchmark for code generation, 2025. URL https://arxiv.org/abs/2502.02827. Puri, R., Kung, D. S., Janssen, G., Zhang, W., Domeni- coni, G., Zolotov, V., Dolby, J., Chen, J., Choudhury, M., Decker, L., Thost, V., Buratti, L., Pujar, S., Ramji, S., Fin- kler, U., Malaika, S., and Reiss, F. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks, 2021. URL https://arxiv.org/abs/2105.12655. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J., Ellenberg, J. S., Wang, P., Fawzi, O., et al. Mathematical discoveries from program search with large language models. Nature, 625 (7995):468–475, 2024. Sasnauskas, R., Chen, Y., Collingbourne, P., Ketema, J., Lup, G., Taneja, J., and Regehr, J. Souper: A synthesizing superoptimizer, 2018. URLhttps://arxiv.org/abs/ 1711.04422. Schkufza, E., Sharma, R., and Aiken, A. Stochastic super- optimization. In ACM SIGARCH Computer Architecture News, volume 41, p. 305–316. ACM, 2013. Sharma, A.Openevolve: Open-source implementa- tion of alphaevolve.https://github.com/codelion/ openevolve, 2025. Software, version 1.0.0. 11 Shetty, M., Jain, N., Liu, J., Kethanaboyina, V., Sen, K., and Stoica, I. Gso: Challenging software optimization tasks for evaluating swe-agents, 2025. URLhttps://arxiv. org/abs/2505.23671. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366. Singh, A., Fry, A., Perelman, A., Tart, A., and et al., A. G. Openai gpt-5 system card, 2025. URLhttps://arxiv. org/abs/2601.03267. Tang, X., Liu, Y., Cai, Z., Shao, Y., Lu, J., Zhang, Y., Deng, Z., Hu, H., An, K., Huang, R., Si, S., Chen, S., Zhao, H., Chen, L., Wang, Y., Liu, T., Jiang, Z., Chang, B., Fang, Y., Qin, Y., Zhou, W., Zhao, Y., Cohan, A., and Gerstein, M. Ml-bench: Evaluating large language models and agents for machine learning tasks on repository-level code, 2024. URL https://arxiv.org/abs/2311.09835. Tideman, T. N. Independence of clones as a criterion for voting rules. Social Choice and Welfare, 4(3):185–206, 1987. ISSN 01761714, 1432217X. URLhttp://w. jstor.org/stable/41105866. Tratt, L. Four kinds of optimisation, November 2023. URLhttps://tratt.net/laurie/blog/2023/four_ kinds_of_optimisation.html. Tratt, L.The fifth kind of optimisation, April 2025. URLhttps://tratt.net/laurie/blog/2025/ the_fifth_kind_of_optimisation.html. Waghjale, S., Veerendranath, V., Wang, Z., and Fried, D. ECCO: Can we improve model-generated code efficiency without sacrificing functional correctness? In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, p. 15362–15376, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.859. URLhttps://aclanthology.org/2024.emnlp-main. 859/. Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H., and Neubig, G. Openhands: An open platform for ai software developers as generalist agents, 2025. URL https://arxiv.org/abs/2407.16741. Woodside, M., Franks, G., and Petriu, D. C. The future of software performance engineering. In Future of Software Engineering (FOSE’07), p. 171–187. IEEE, 2007. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y., Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. Qwen3 technical report, 2025. URLhttps: //arxiv.org/abs/2505.09388. Yang, J., Prabhakar, A., Narasimhan, K., and Yao, S. Inter- code: Standardizing and benchmarking interactive coding with execution feedback, 2023. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software engi- neering, 2024. URLhttps://arxiv.org/abs/2405. 15793. Yao, S. Language Agents: From Next-Token Prediction to Digital Automation. PhD thesis, Princeton University, 2024. Yin, P., Li, W.-D., Xiao, K., Rao, A., Wen, Y., Shi, K., Howland, J., Bailey, P., Catasta, M., Michalewski, H., Polozov, A., and Sutton, C. Natural language to code generation in interactive data science notebooks, 2022. Zan, D., Chen, B., Yang, D., Lin, Z., Kim, M., Guan, B., Wang, Y., Chen, W., and Lou, J.-G. Cert: Continual pre- training on sketches for library-oriented code generation, 2022. Zhang, L., He, S., Zhang, C., Kang, Y., Li, B., Xie, C., Wang, J., Wang, M., Huang, Y., Fu, S., Nallipogu, E., Lin, Q., Dang, Y., Rajmohan, S., and Zhang, D. Swe-bench goes live!, 2025. URLhttps://arxiv.org/abs/2505. 23419. Zhao, W., Jiang, N., Lee, C., Chiu, J. T., Cardie, C., Gallé, M., and Rush, A. M. Commit0: Library generation from scratch, 2024. URLhttps://arxiv.org/abs/ 2412.01769. 12 A. FORMULACODE: Dataset Construction FORMULACODE consists of957multi-workload real-world code optimization problems from70repositories as of November 30th, 2025. We develop an automated four-stage pipeline that extracts these problems from105074pull requests across766 repositories on GitHub, as described in §A.1 and illustrated in Figure 2. §A.2 summarizes the key properties of the dataset. At the time of collection, all frontier models tested on FORMULACODE struggle to outperform human experts (§3), though we expect more advanced models to close this gap in the near future. A.1. Dataset Creation Overview. The dataset creation pipeline comprises four broad stages: (1) crawl GitHub repositories with high-quality expert-defined performance workloads (§A.1.1), (2) filter out all candidate pull requests using rule-based and LLM-based attribute filters where the primary intent of merging the PR was not performance related (§A.1.2), (3) synthesize an environment building script so that the terminal interface tools function (§A.1.3), (4) Filter all candidate PRs that do not show statistically significant improvement in performance workloads (§A.1.4). A.1.1. STAGE 1: SCRAPING REPOSITORIES. Our benchmarking apparatus relies heavily on mature tools developed within the Python performance benchmarking community (Appendix §B.2.1). To use these tools, the core developers of a package write customized performance profiling workloads in a pre-specified format for their repository. This allows us to identify crowdsourced workloads as well as repositories with an established rigorous benchmarking procedure by searching for the presence of these tools. Appendix §B.1.1 provides additional details on the scraping process. Overall, this step yields 766 repositories. A.1.2. STAGE 2: ATTRIBUTE FILTERING. For each repository, we scrape pull requests that were merged into the default branch and that reference at least one issue. Next, we filter out all pull requests with missing patches and with unsatisfiable requirements (e.g. expired PyPI packages). This yields 26717 Pull Requests from127repositories. Finally, we construct a knowledge graph of relevant issues and comments referenced by the pull request, filtering out any nodes created after the PR creation date. The knowledge graph is rendered along with the merge commit patch and is analyzed by an LLM Agent to gauge whether the primary intent of the pull request is performance oriented. This is required to reduce the cost of re-running all repositories. Specific details are presented in Appendix §B.1.2. This yields 3181 potential performance improving tasks from101repositories, presented in Table 15. A.1.3. STAGE 3: SYNTHESIZING REPRODUCIBLE ENVIRONMENTS. Before we validate that the performance improvement claimed by the previous stage surfaces as a statistically significant improvement in the workloads, we must build and install a developmental copy of the package. However, automatically building such development copies proves to be a non-trivial task for three reasons. (1) Many scientific packages require complex tool interactions, which necessitate a bespoke build process. (2) The build process evolves significantly as a project matures. (3) The documentation for building packages tends to be extremely fragmented, requiring the reading of many plaintext and code files (README.md, setup.py, CONTRIBUTING.md, etc.) to reproduce. We automate the process of building such packages by developing a reflexive LLM agent (Shinn et al., 2023) that iteratively refines a shell script to build an editable environment for our benchmarking and testing apparatus. In the worst case, such an agent must be run on every potential candidate PR. However, we find that aggressively caching and reusing previous scripts significantly lowers the amortized complexity of LLM queries (Figure 14). More details are presented in Appendix §B.1.5. This process yields 1232 potential tasks with reproducible docker containers from 75 repositories. A.1.4. STAGE 4: STATISTICAL AND CORRECTNESS VALIDATION Given a reproducible build environment, we can apply the expert-produced patch and ensure it produces a statistically significant speedup (Appendix §B.1.6). We offer two kinds of correctness tests in each FORMULACODE testing suite: Unit Tests. Like contemporary work in building repository-centric code generation datasets (Jimenez et al., 2024), we find that the unit test suite needs to be manually validated to ensure proper operability. As such, in FORMULACODE-V, we 13 MeanMax Issue TextLength (Tokens)2718.0315781 Gold Patch # Lines edited38.088526 # Files edited3.9334 # Func. edited6.0654 Workloads # Eval. Fns264.581364 % Coverage41.24%97.86% Table 6: (Micro-averaged) statistics characterizing different attributes of a FORMULACODE task instance. The average FORMULACODE gold patch requires5.2more lines of code spread over1.29×more files and1.01×more functions than the average SWE-Bench (Jimenez et al., 2024) patch. present 108 problems where we manually synthesize and verify that the build process and test suite function properly. Snapshot Tests. After benchmarking performance workloads, we capture a snapshot of the immediate return values of the workloads (by execution trace inspection). We then compare it against a reference snapshot captured after the human-written code was benchmarked. Comparison is skipped for any Python objects where equality is not defined. Such snapshot tests are commonly used in UX development to ensure that an underlying hard-to-inspect system (Android’s View Hierarchy, HTML DOM, or in our case, an arbitrary python package) does not change unexpectedly following codebase changes (Cruz et al., 2023). This snapshot testing framework allows us to construct correctness checks for all performance workloads, greatly increasing the correctness verification surface of each task. This process yields957statistically significant performance improvement tasks that form FORMULACODE. Table 16 shows a repository-level breakdown of the final dataset. The next section presents a deeper analysis of the dataset. A.2. FORMULACODE Analysis Multi-Workload Optimization Tasks. Code optimizations rarely have isolated effects; an optimization in one part of the code could significantly slow down another part of the code or cause unwanted spikes in other resources (e.g., in some scenarios, memoization-based optimizations are undesirable as they decrease runtime at the cost of increased memory usage). FORMULACODE handles this problem by framing performance optimization as a multi-workload optimization problem. Each FORMULACODE problem has on average 264.58 performance workloads that are presented to the optimization agent along with the problem description. The agent is evaluated on the aggregate performance improvement achieved by the optimization agent on all workloads. To perform well on FORMULACODE, the agent must reason about the effect its changes have on multiple workloads spanning multiple target functionalities or multiple target resources. Task Diversity.The general consensus in repository-centered dataset design is to restrict scraping problems to a curated set of repositories. While manual curation significantly eases the dataset construction process, it inadvertently creates a cumulative advantage for certain types of repositories and their respective tasks. As explored in (Koch et al., 2021), this Matthew effect ultimately leads to the benchmarks becoming disconnected from the broader task and ultimately hurts “in the wild” performance. Instead, FORMULACODE samples performance benchmarks from a large set of repositories based on whether the performance benchmarks adhere to the four axiomatic stages. Figure 6 showcases the set of repositories represented in FORMULACODE and Table 16 presents a more detailed overview. Contamination Resistance.Data contamination has been shown to skew the performance of frontier models on many code generation tasks mined from GitHub (Zhang et al., 2025). To be resistant to such data contamination issues, FORMULACODE functions as a live dataset. We update FORMULACODE’s problem set on the 31st of each month with new problems. Figure 15 showcases the distribution of FORMULACODE problems based on the merge date of the task. The earliest task was merged on 2017-10-21 and the most recent task is from 2025-11-21. 55.88% of the tasks were merged in the last two years, and we added, on average, 27.00 problems to the dataset every month in 2025. Hierarchical Workloads. Based on the file structure of the benchmarks directory, we organize all workloads based on three levels of increasing granularity: module, class, and function. As depicted in Figure 22, this allows us to aggregate workloads in our analysis based on the semantic grouping assigned by core developers. 14 Dataset Composition.Table 15 shows the composition of FORMULACODE across different filtering stages. In §B.1.7, we further characterize FORMULACODE using an automated taxonomy-based classifier that infers (i) the type of optimization problem and (i) problem difficulty; the resulting distributions are reported in Tables 7 and 8. We find that, in FORMULACODE, roughly three categories account for∼ 60%of problems (Micro Optimizations, Remove/Reduce work, and Construct Better Algorithms) as shown in Table 7. Also, most expert solutions are inferred to be of Easy or Medium difficulty (Table 8). These distributions change only marginally in FORMULACODE-V. Sample Questions. Appendix §B.2.4 showcases example questions from FORMULACODE. B. Additional Details This Appendix presents more details on the following topics: §B.1: Dataset Construction. This includes subsections on (§B.1.1) Scraping repositories and compliant repositories discovered, (§B.1.2) Details on attribute filtering and repository-level composition after attribute filtering, (§B.1.5) Docker container synthesis, and (§B.1.6) statistical testing. §B.2: Experiments. This section provides additional details on (§B.2.1) the benchmarking framework used in FOR- MULACODE, (§B.2.2) agent-model configurations presented, (§B.2.3) the taxonomy used for classifying the type of optimizations, (§B.2.4) qualitative examples showing characteristic behavior of various agent–model pairs, (§B.2.5) the evaluation framework used in FORMULACODE, and (§B.3) additional analysis in FORMULACODE. B.1. Dataset Construction Details In this section, we provide details on the dataset construction process. Our core aim is to provide an automated pipeline for constructing a dataset of pull requests that are relevant for performance benchmarking. The dataset was constructed on a single machine with Ubuntu 22.04 LTS running on a machine with 503 GiB RAM, a dual-socket Intel Xeon Platinum 8352Y CPU @ 2.20 GHz (128 hardware threads), equipped with 4xNVIDIA A40 GPUs (46 GiB VRAM each). Making the dataset from scratch takes∼ 32hours, consuming∼ 100GB of disk space for the metadata and∼ 2TB of disk space for the docker image cache. We use two LLMs during the dataset construction process. For less complex tasks such as textual classification and extraction, we useopenai/gpt-oss-120bmodel served locally (Kwon et al., 2023; OpenAI et al., 2025). For complex tasks such as environment build script synthesis, we first attempt to use the local LLM and fallback to the anthropic/claude-3-5-sonnet-20241022(Anthropic, 2024) model (with a one-time total cost of$446for the entire dataset). The additional cost may change if a different locally available LLM is utilized. B.1.1. REPOSITORY SCRAPING We identify compliant repositories by searching for the presence of mature tools developed within the Python performance benchmarking community. To search for these repositories at scale, we develop a CommonSQL script to search for the presence of performance-oriented tools and workloads in the GitHub Public Dataset on Google BigQuery (GitHub & Google Cloud Platform, 2025), which snapshots about2.8× 10 6 open-source repositories and2× 10 9 code files. We add additional filters to ensure only mature software packages are considered. Specifically, we ensure that each valid repository has (1) Markers identifying the presence of at least one performance workload (e.g.,asv.conf.json); (2) does not fork an existing repository. (3) Presence of PR merges and active maintenance in the last three years. (4) Support for Python 3.8+. This leaves us with 766 repositories. The CommonSQL script executes in about48seconds and cost$9.4. As an alternative, we can also use the GitHub Search API to query for the repositories. This yields the same number of repositories, but can be much slower due to API rate limits. B.1.2. RULE-BASED FILTERING Once we have a list of compliant repositories, it is technically possible to execute and measure the performance of all pull-requests in the repository. However, as most pull-requests do not primarily intend to improve performance, this leads to unnecessary waste of compute resources. The rule-based filtering stage ensures that we collect performance metrics for only those pull requests where we can ensure that the pull request is suitable for benchmarking. Most filters in this stage aim to 15 identify unambiguous signals that disqualify a pull request from being used for benchmarking. The prominent filters are listed below: •Repository Compliance: We select repositories that have at least 100 GitHub stars. Below 100 stars, we found that repositories often lacked the necessary community engagement to produce good quality pull requests. •Pull Request Status: We strictly filter for pull requests that have been successfully merged (state=’closed’with a validmerged_attimestamp) within the target date range. We also ensure that we can retrieve and successfully apply the patch to the repository. •Benchmarking Infrastructure: The specific commit tree must contain an Airspeed Velocity (ASV) configuration file (asv.conf.json), ensuring the repository supported benchmarking at that point in history. • Core Content: We explicitly exclude commits that only touch non-functional paths, such astests/,docs/,examples/, .github/, dist-info/, build artifacts, or packaging metadata (e.g., pyproject.toml, requirements.txt). •Heuristic Message Filtering: We apply a regex-based pre-filter to the commit message. Commits matching “negative” patterns (e.g., “revert”, “release”, “bump version”, “fix typo”, “formatting”) are discarded unless they also contain “positive” performance keywords (e.g., “speed”, “optimize”, “latency”, “throughput”, “memory”, “vectorize”). Ambiguous messages are retained for LLM classification. •Complexity Constraints: To ensure feasibility for both the LLM context and the build system, we exclude commits that change more than 500 files or 80,000 lines of code, or where the patch size exceeds an acceptable context window for a capable local LLM (64,000 tokens). These constraints can be adjusted based on the future capabilities of LLMs. •Build Environment: We clone each repository at the specific commit tree and attempt to build it usinguv.uvis a fast python package manager that can be used to install dependencies from a project’s dependency files (e.g.,pyproject.toml, requirements.txt, orsetup.py). If the build fails, we discard the pull request. If the build succeeds, we pin the dependencies to ensure that the build environment can be reproduced. This is a compute-intensive process and, after parallelizing the build process, requires∼ 13 hours for all pull requests on our machine. After applying these filters, we are able to select 26717 pull requests from127repositories that are suitable for benchmarking. B.1.3. PERFORMANCE INTENT FILTERING The previous stage ensures that we only select pull requests that are suitable for benchmarking. However, it is still possible that the pull request does not primarily intend to improve performance. To ensure that we only select pull requests that are suitable for benchmarking, we utilize a pre-trained local LLM to classify the pull request as performance improving. The primary objective of this classifier is to filter out pull requests that pass the regex-based heuristic but are not bona fide performance optimizations. Common examples of such false positives include commits that contribute new features instead of improving performance, refactor code structure without runtime impact, or maintainability improvements. The classifier analyzes the pull request description, file change summary, and the code patch to make this determination. The classifier is written in DSPy (Khattab et al., 2023) and the prompt is shown in Figure 13. We explicitly prioritize recall over precision. The prompt is configured to lean towards a “YES” classification in ambiguous cases. This design choice is deliberate, as false positives will be symbolically verified in the subsequent benchmark execution stage, and discarded if they yield no measurable speedup. B.1.4. PROBLEM STATEMENT CONSTRUCTION To transform a raw pull request into a benchmark task, we must construct a clear, self-contained problem statement that defines the performance goal. We employ a multi-stage pipeline to aggregate context and extract a structured narrative. Context Aggregation.For each candidate pull request, we scrape all available metadata (title, body, labels, and comments, date of creation, and date of merge) that can be used to construct the problem statement. We also fetch the file change summary and the raw patch content to ground the problem statement in the actual code changes. We parse the pull request body and comments to identify linked issues (e.g.,#123,owner/repo#123). These references are resolved to their full issue 16 pandas-dev/pandas (222) scikit-learn/scikit-learn (143) qiskit/qiskit (142) xdslproject/xdsl (134) optuna/optuna (94) pydata/xarray (69) sk-image (39) networkx (35) satpy (30) pymc (18) flox (17) dimod (15) uxarray (13) bottleneck (13) geopandas (13) sgkit (12) sourmash (11) datalad (10) mars (10) kartothek (10) momepy (9) rich (9) pygeos (7) napari (7) qcodes (7) msp (7) shapely (6) psygnal (6) activitysim (6)dascore (5) pvlib-python (5) pybamm (5) modin (5) deepchecks (5) sunpy (4) beem (4) lmfit-py (4)cartopy (4) pybop (4) trackintel (4) outlines-core (4) dipy (3) kedro (3) devito (3) tiledb-py (3)adaptive (3)h5py (2) param (2) pantab (2) nilearn (2) dedupe (2) geocat-comp (2) signac (2)np-fin (2) xbatcher (2) onents (2) tqdm oggm tograd pyphiorch arviz ultrack pystac indexpatial y-python aics oto Figure 6: Distribution of tasks across repositories in FORMULACODE till November, 2025. FORMULACODE comprises of 957 tasks sampled from 70 diverse open source GitHub repositories. Most repositories are software tools used extensively within scientific communities. FORMULACODE shows a strong long-tail pattern of bespoke repositories that are rarely covered in contemporary code-generation datasets. Table 16 presents a detailed overview. descriptions and discussions, which are also parsed and aggregated into the problem statement. We only include information that was available before or at the time the pull request was created to ensure that the problem statement is self-contained. Context Filtering.Before attempting extraction, we enforce a strict validity check: a pull request must have at least one linked issue or a descriptive body. The rationale for this constraint is twofold. First, the linked issue typically provides the problem context (the bug report, performance regression analysis, or feature request) that motivated the change. Second, a descriptive pull request provides details of the problem solved, the methodology used, and the solution, which can be helpful for computing metadata for the benchmark task as well as clarifying the overall task goal. Context Extraction.We consolidate all linked issues into a single document using a static template (shown in Figure 8). In principle, the issue text alone should sufficiently describe the initial observed performance regression or bottleneck. However, in practice, we find that while an issue provides the initial observed performance regression or bottleneck, they frequently bundle multiple optimization directions that are implemented across several pull requests. As a result, a problem statement derived only from the issue can under-specify the problem statement’s starting state, leading to an ambiguous task (an agent may optimize a different aspect than the original change). To ensure that each problem statement provides a clear and self-contained description of the problem, we use another specialized LLM-based classifier to extract relevant problem context from the pull request description. We instruct the agent to specifically extract near-verbatim sentences corresponding to the performance goal and constraints relevant to this PR. Each extracted sentence is symbolically verified to maintain a high degree of textual fidelity (High Longest Common Subsequence ratio) to preserve technical terms, error messages, and code snippets. Any pull request that fails to yield a valid problem context is discarded as it lacks a defined starting state for the benchmark. This LLM-based extraction agent is implemented using DSPy, and the prompt is shown in Figure 10. Examples.Figure 8 shows problem statements for some FORMULACODE tasks. Each problem statement has an initial set of static instructions, information about the problem extracted from the linked issues, and the initial direction of optimization extracted from the pull request description. The problem statement construction (§B.1.4) and the performance intent filtering (§B.1.3) stages are applied together to yield 3181 problems. B.1.5. SYNTHESIZING REPRODUCIBLE ENVIRONMENTS Motivation. A critical challenge in benchmarking historical commits is that the build environment (dependencies, compilers, and system libraries) is often implicit and evolves over time. Simply installing the package viapipis insufficient for performance benchmarking for two main reasons: First, performance-critical Python packages often rely on compiled 17 extensions (C/C++, Cython, Fortran) that must be built from source to accurately reflect the performance characteristics of the code at that specific commit. Installing pre-built binaries (wheels) would benchmark the packaged version rather than the code in the pull request. Second, developers often introduce bespoke dependencies or modify build configurations in a pull request, rendering previous environments obsolete. To address this, we implement an agentic pipeline to synthesize a reproducible Docker environment for each task. Setup. For each task, we first construct a Docker container with the base dependencies installed (Refer to the ‘Build Environment’ subsection under §B.1.2) containing the source code of the repository at the initial state of the pull request. Our goal is to synthesize a build script that contains shell commands to install an editable version of the package from source. We also want to ensure that certain tools (ASV, PyTest, and our snapshot testing tool) can be successfully run in the container. Agent.We employ an iterative, reflexive agent to synthesize a valid build script. The agent is described in Figure 10 and has four principal components: Validation & Feedback Loop: The synthesized script is executed in an isolated Docker container. We validate the build using two verification subroutines. (1) Aprofilecheck ensures that the package is importable, runnable, and we can run the ASV benchmarks under a generous timeout. (2) Apytestcheck ensures that we can run the pytest test suite without errors. If the build or validation fails, thestderrandstdoutlogs are fed back to the agent as observations, allowing it to iteratively refine the script (e.g., installing missing system libraries, fixing syntax errors). Chronological Retrieval: We leverage the insight that build requirements rarely change drastically between adjacent commits. For a given task, we sample10successful build scripts from the same repository, sourced from a database of successfully built tasks, sorted by commit date. We first attempt to build the container using the script from the nearest chronological neighbor. If the build or verification fails, we move to the next neighbor until we either find a successful build or run out of neighbors. The failure logs are preserved and used as observations for the agent. Agentic Synthesis: If the retrieved scripts fail (or no history exists), we instantiate an LLM-based agent to generate a new build script. The agent acts as an interactive planner with access to the failure logs and a set of tools that allows it to inspect the repository state (e.g., list directories, read files, parsesetup.pyorpyproject.toml). Given10interactive turns, the model can either choose to use one of the tools or prematurely end the turns by synthesizing a build script. The largest model we tried (Claude Sonnet 3.5 and GPT OSS 120b) rarely chooses to use tools as the error messages provide sufficient context while the smallest model (Meta Llama 3.3 8B; (AI@Meta, 2024)) often utilizes many tool interactions before synthesizing the build script. LLM Choice and Prompt Design. We find that a locally hostedopenai/gpt-oss-120bprovides the best balance of performance and cost. We also implement a fallback toanthropic/claude-3-5-sonnet-20241022if the build script synthesis fails after multiple tries. Overall, the chronological caching and local LLM cascade allows us to successfully synthesize a build script for 1232 out of 75 repositories at a cost of $446 to process 3181 PRs. This process yields1232reproducible containers for3181PRs. We elected to stop the synthesis prematurely due to limited resources. However, with more resources, we expect the number of reproducible containers to substantially increase. B.1.6. STATISTICAL TESTING AND ROBUSTNESS Finally, we must ensure that every retained task reflects a statistically significant and reproducible performance change. Because timing measurements are inherently noisy (e.g., due to OS scheduling, background load, and CPU power man- agement), we adopt the statistical significance validation procedure used by ASV to verify that the observed differences between two code states are significant under repeated measurement on commodity hardware. Measurement protocol. All experiments are run on an AWS EC2 instance specified in §B.2.5 to ensures hardware isolation. For each candidate pull request, we execute the expert-selected workloadsWorkloads =w 1 ,...,w n on both the baseline codebaseCode 0 and the human-optimized codebaseCode ∗ expert on the same instance. For each workloadw i , ASV repeatedly evaluates the benchmark under a warm-up and multi-sample timing protocol (with interleaved rounds when enabled), yielding independent sample sets of observed runtimes for the baseline and human-edited codebases: X i =x i1 ,...,x im from w i (Code 0 ), Y i =y i1 ,...,y ik from w i (Code ∗ expert ), 18 whereX i andY i denote the set of measurements for workloadw i from the baseline and human edited code respectively. We preserve ASV’s default sampling parameters (unless a repository overrides it via workload-specific attributes), so that the resulting statistical decision procedure matches common practice in the Python benchmarking ecosystem. Mann–Whitney U test.To test whetherCode 0 andCode ∗ expert exhibit different performance distributions for a workload, we use the Mann–Whitney U test (Mann & Whitney, 1947), a non-parametric two-sample test based on rank ordering. Formally, for samples X i and Y i , the U statistic can be written as U(X i ,Y i ) = m X a=1 k X b=1 I[x ia > y ib ] + 1 2 I[x ia = y ib ], and the associated two-sided p-value quantifies evidence against the null hypothesis. Null hypothesis. For each workload w i , we test H 0 : X i and Y i are drawn from the same underlying distribution (i.e., the patch does not induce a statistically detectable change in the benchmark outcome), against the two-sided alternative that the distributions differ. We only consider workloads that reject H 0 . Implementation. In practice, ASV applies a conservative two-stage decision rule. When sufficient raw samples are available, it applies the Mann–WhitneyUtest and declares a difference only if the resultingp-value is below a stringent threshold (defaultp < 0.002). If the sample sizes are too small for theUtest to ever reach this threshold (given the discrete nature of the test), ASV falls back to a pessimistic check based on uncertainty estimates: it computes a99%confidence interval for each sample distribution and only declares a difference when these intervals do not overlap. This fallback biases towards not claiming a difference unless the separation is unambiguous. Dataset Inclusion Criterion.We discard candidate tasks for which no workload exhibits a statistically significant change betweenCode 0 andCode expert under this rule. This ensures that every retained task in FORMULACODE corresponds to a clear, reproducible, and statistically supported performance difference. Tasks with no positive significant workloads are also discarded. This yields the final957problems used in FORMULACODE. The 108 problems in FORMULACODE-V subset are sampled from the best performing tasks in FORMULACODE. B.1.7. DATASET COMPOSITION STATISTICS To better study the characteristics of FORMULACODE, we develop an automated classifier that attempts to infer the kind of optimization based on a curated taxonomy (§B.2.3). The classifier is similar to the one introduced in §B.1.3. It takes as input a sample pull request along with the expert-written patch and attempts to categorize the human-written solutions using a manually curated taxonomy (Table 9). Such a methodology allows us to efficiently and scalably study the composition of an continuously growing set of problems. The prompts for this classifier are presented in Figure 9 and an example is presented in Figure 11. The distribution of the types of optimizations is presented in Table 7 and the distribution of the inferred difficulty is presented in Table 8. Importantly, the distribution of optimization problems and difficulty changes marginally between FORMULACODE and FORMULACODE-V. B.2. Experiment Details In this section, we provide additional details on the methodology used to evaluate agents on FORMULACODE. All experiments ran on a single Ubuntu 22.04 LTS machine with 503 GiB RAM, Intel Xeon Platinum 8352Y CPU @ 2.20 GHz (128 hardware threads), 4 NVIDIA A40 GPUs (46 GiB VRAM each). Making the dataset from scratch takes∼ 32hours, consuming∼ 100GB of disk space for the metadata and∼ 2TB of disk space for the docker image cache. Our evaluation protocol is grounded in Terminal Bench (Merrill et al., 2026). Unless explicitly indicated otherwise, all experiments use the default hyperparameters defined by Terminal Bench. 1 INPUT SIGNATURE 2 3 problem_description : string 4 Problem statement and technical context from PR/issue. 5 6 git_patch : string 7 Git diff showing actual code changes. 8 9 file_change_summary : string 10 A markdown table summarizing all the files changed in the commit along with lines added/removed. 11 12 CLASSIFIER MODULE 13 14 Decide if this commit’s PRIMARY intent is to improve product/runtime performance. 15 Label YES only when there is CLEAR, EXPLICIT evidence in the description and/or patch that the runtime gets faster (e.g., algorithm change, fewer allocations, caching, vectorization, reduced I/O, async/non-blocking for throughput, latency reduction, memory footprint reduction, fix a speed regression). 16 17 Strong positive signals (weigh these collectively): 18• PR title/body contains performance intent (e.g., “PERF:”, “speed up”, “faster”, “performance”). 19• Linked issues/comments include benchmark links or timings demonstrating impact. 20• Low-level/hot-path tweaks (e.g., reuse global context, avoid per-call init/teardown, vectorize C/NumPy). 21 22 Hard NO (non-performance) examples: 23 tests/ASV/harness-only changes; CI/workflows/build/packaging; coverage; pre-commit/format/lints (clippy/ruff/black); docs; version bumps; terminology/renames; pure refactors without performance claims; changes aimed at making perf tests pass but not improving runtime. 24 25 If ambiguous, weigh the concrete code changes and problem description together. 26 When there are specific performance cues (title keywords, measured timings, fewer allocations, vectorization, caching/reuse) lean YES; otherwise NO. 27 28 OUTPUT SIGNATURE 29 30 reasoning : string 31 Deductive reasoning steps leading to the classification. 32 33 label : string 34 Final label: “YES” for performance-related, “NO” otherwise. 35 Figure 7: Prompt template used by the LLM-based performance intent classifier described in B.1.3. The prompt defines the input signature (problem description, git patch, and file change summary), the classifier module specifying decision criteria for identifying performance-motivated commits, and the output signature producing a reasoning trace and binary label (“YES”/“NO”). 20 1 Example PR 2 3 CLASSIFIER INPUT 4 5 problem_description : string 6 Labels: performance; Description: Fixes #14471. 7 Body: The new ParameterExpression.bind_all is a fast path for producing a numeric result. This has advantages over ParameterExpression.bind: 8• Far fewer Python objects are allocated, since no new ParameterExpression objects need to be constructed and the output is guaranteed to be numeric. 9• There is no historical API requirement to scan the incoming mapping for invalid keys or values, yielding a large performance improvement when the same mapping is used to bind many expressions. 10• This provides a major complexity improvement when a large values dictionary is reused many times. 11 There is still room for further gains because the Rust-space ParameterExpression and SymbolExpr interfaces require more heap allocations than strictly necessary, but this already yields substantial speedups. 12 Issues: Fixes #14471. 13 The linked issue reports that ParameterExpression.bind scales with the size of the binding dictionary even when only a single parameter is needed, leading to severe performance penalties for large parameter tables. 14 Comments: 15 Currently in draft because there’s no tests - I’m just putting it up so Sam and Ian from #14471 can test it out for their use case. For the explicit example in that issue, a complete comparison on my machine: 16 <details><summary>Out of date timings</summary> 17 In [1]: from qiskit.circuit import Parameter, ParameterExpression 18 N: int = 100_000 19 parameter_values = Parameter(f"th_i"): 1 for i in range(N) 20 parameter_values[param := Parameter("my_param")] = 1 21 . . . <TRUNCATED> 22 I think it’s fine without having the same behavior. For clarity it might be helpful to add a blurb to the bind_all docstring to say that “unlike bind, NaN and inf are in the range of expected outputs for this method”. 23 LGTM, thanks! 24 25 git_patch : string 26 diff –git a/crates/circuit/src/parameter/parameter_expression.rs b/crates/circuit/src/parameter/parameter_expression.rs 27 index 1f0406f62c7e..98da2e3e9e6 100644 28 — a/crates/circuit/src/parameter/parameter_expression.rs 29 +++ b/crates/circuit/src/parameter/parameter_expression.rs 30 @@ -1048,6 +1048,40 @@ impl PyParameterExpression 31 + ##[pyo3(name = "bind_all")] 32 + ##[pyo3(signature = (values, *))] 33 + pub fn py_bind_all(&self, values: Bound<PyAny>) -> PyResult<Value> 34 + let mut partial_map = HashMap::with_capacity(self.inner.name_map.len()); 35 + for symbol in self.inner.name_map.values() 36 + let py_parameter = symbol.clone().into_pyobject(values.py())?; 37 + partial_map.insert(symbol, values.get_item(py_parameter)?.extract()?); 38 + 39 + let bound = self.inner.expr.bind(&partial_map); 40 ... <Truncated> 41 42 file_change_summary : string 43 FileLines AddedLines RemovedTotal Changes crates/circuit/src/parameter/parameter_expression.rs43043 crates/circuit/src/parameter/symbol_expr.rs112 releasenotes/notes/parameter-bind-all-22337586429509f9.yaml808 test/python/circuit/test_parameters.py18018 44 45 CLASSIFIER MODULE 46 47 Decide if this commit’s PRIMARY intent is to improve product/runtime performance. 48 Label YES only when. . . 49 . . . there are specific performance cues (title keywords, measured timings, fewer allocations, vectorization, caching/reuse) lean YES; otherwise NO. 50 51 CLASSIFIER OUTPUT 52 53 reasoning : string 54 The pull request introduces a new fast-path method ParameterExpression.bind_all (and Parameter.bind_all) to bind parameters to numeric values with far fewer allocations and reduced overhead than the existing bind method. The PR explicitly emphasizes performance, using terms such as “fast path” and “huge performance gain”, and reports timing improvements from tens of milliseconds to microsecond-scale execution. 55 The implementation avoids constructing new ParameterExpression objects and skips expensive validation logic, directly producing numeric results. It is labeled with the performance tag and documented as a fast-path method for performance-sensitive use cases, making the primary intent of the change to reduce computational overhead and improve runtime performance of parameter binding. 56 Therefore, the commit’s primary purpose is to make parameter binding substantially faster. 57 58 label : string 59 ’YES’ 60 Figure 8: Example of performance intent classification for a real pull request(https://github.com/Qiskit/qiskit/ pull/14782) , illustrating how the prompt in Fig. 7 is instantiated and how the classifier produces a reasoning trace and binary label. 21 1 Performance type & difficulty classifier 2 3 INPUT SIGNATURE PROMPT 4 5 problem_description : string 6 Problem statement and technical context from PR/issue. 7 8 git_patch : string 9 Git diff showing actual code changes. 10 11 CLASSIFIER MODULE PROMPT 12 13 Decide the PRIMARY performance optimization technique and the difficulty level of the optimization. 14 15 Category mapping (when performance-related): 16• Algorithm improvements: complexity reduction; switching to faster algorithms → use_better_algorithm 17• Data structures / layout: sets, maps, indices; memory layout tuning → use_better_data_structure_and_layout 18• System-level: C/Rust/NumPy/Vectorized/Native extensions → use_lower_level_system 19• Approximation / heuristics: trade accuracy for speed → accept_less_precise_solution 20• Parallelization: threads, processes, parallel algorithms (not just async I/O) → use_parallelization 21• Cache & reuse: memoization, LRU, materialized results → cache_and_reuse 22• Scheduling: batching, lazy execution, throttling → do_it_earlier_batch_throttle 23• Database / storage: indices, query tuning, partitioning → database_and_storage_tuning 24• Micro-optimizations: hot-path tweaks, guards, inlining → micro_optimizations 25• I/O / latency hiding: async or non-blocking I/O, overlap I/O and compute → io_and_latency_hiding 26• Higher-level systems: using optimized libraries or frameworks → use_higher_level_system 27• Uncategorized: performance-related but does not fit the above categories → uncategorized 28 29 Difficulty (when performance-related): 30• easy: localized change (< 50 lines), minimal risk 31• medium: module-level refactor, data structure changes 32• hard: algorithm rewrite or architectural change 33 34 OUTPUT SIGNATURE PROMPT 35 36 category : OptimizationType 37 The classified optimization category. 38 39 difficulty : DifficultyLevel 40 The difficulty level of the optimization. 41 42 reasoning : string 43 Brief explanation of the classification. 44 Figure 9: Prompt template used by the LLM-based classifier for assigning each performance task an optimization category and difficulty level ( B.2.3). The prompt defines the input signature, a taxonomy-driven classification module that maps code changes to optimization types, and an output schema that produces the predicted category, difficulty, and a brief reasoning trace. 22 1 INPUT SIGNATURE 2 3 owner_repo : string 4 The repository this commit belongs to (e.g., scikit-learn/scikit-learn). 5 6 sha : string 7 The commit SHA that is currently checked out. 8 9 commit_date : string 10 The commit date in ISO format (e.g., 2023-10-05T12:34:56Z). 11 12 stderr_logs : string 13 Most recent stderr logs from the last build attempt (up to ∼8k tail-end characters). 14 15 stdout_logs : string 16 Most recent stdout logs from the last build attempt (up to ∼8k tail-end characters). 17 18 failure_more : string 19 Describes where the failure occurred (e.g., N/A, build failed, asv run failed). 20 21 last_docker_build_script : string 22 The previously generated docker_build.sh script. 23 24 repo_facts_json : string 25 JSON object containing inferred repository facts (paths, package names, versions, etc.). 26 27 toolbelt : string 28 Human-readable summary of available tools and their usage. 29 30 messages_log : string 31 Transcript of prior tool calls, actions, and observations. 32 33 BUILD AGENT MODULE 34 35 An interactive planner for producing a docker_build.sh bash script that builds and installs a Python repository inside micromamba environments. The agent may either: (A) Request a tool call with structured JSON arguments, or (B) Output the final executable build script. 36 If a tool is required, set next_action to one of: probe_repo | list_tree | read_file | try_import | none. 37 38 Tool call formats: 39• read_file: "path": "...", "max_bytes": 65536 40• list_tree: "depth": 2 41• try_import: "candidates": ["foo","bar"] 42 Return docker_build_script only when fully satisfied with correctness and completeness. 43 Critical constraints on the generated script: 44• Must be idempotent and safe to run inside Docker. 45• Fully non-interactive; no user prompts. 46• Must be valid executable Bash with no syntax errors. 47• Must use real newline characters (not escaped ). 48• Must not output literal . 49 Post-install readiness requirements: 50• After editable install, the environment must be immediately usable. 51• A lightweight profiling sanity check and a lightweight pytest sanity check must start without immediate errors, even for projects that require execution from subdirectories. 52• Test/benchmark extras and optional dependencies must be installed as needed for import and test discovery to succeed. 53 54 OUTPUT SIGNATURE 55 56 thought : string 57 Brief rationale describing the current decision or plan. 58 59 next_action : string 60 One of probe_repo, list_tree, read_file, try_import, none, or finish. 61 62 action_input : string 63 JSON arguments for the selected tool, or empty if no tool is called. 64 65 error_summary : string 66 Brief summary of the most recent build failure and its possible causes. 67 68 resolution_steps : string 69 Concrete steps required to resolve the failure. 70 71 docker_build_script : string 72 Final executable docker_build.sh script that successfully builds and installs the project from source. Figure 10: Prompt structure for the docker build agent ( B.1.5), defining its input state, tool-calling interface, constraints, and executable script output. 23 1 Example PR 2 3 CLASSIFIER INPUT 4 5 problem_description : string 6 Labels: performance; Description: Fixes #14471. 7 Body: The new ParameterExpression.bind_all is a fast path for producing a numeric result. This has advantages over ParameterExpression.bind: 8• Far fewer Python objects are allocated, since no new ParameterExpression objects need to be constructed and the output is guaranteed to be numeric. 9• There is no historical API requirement to scan the incoming mapping for invalid keys or values, yielding a large performance improvement when the same mapping is used to bind many expressions. 10• This provides a major complexity improvement when a large values dictionary is reused many times. 11 There is still room for further gains because the Rust-space ParameterExpression and SymbolExpr interfaces require more heap allocations than strictly necessary, but this already yields substantial speedups. 12 Issues: Fixes #14471. 13 The linked issue reports that ParameterExpression.bind scales with the size of the binding dictionary even when only a single parameter is needed, leading to severe performance penalties for large parameter tables. 14 Comments: 15 Currently in draft because there’s no tests - I’m just putting it up so Sam and Ian from #14471 can test it out for their use case. For the explicit example in that issue, a complete comparison on my machine: 16 <details><summary>Out of date timings</summary> 17 In [1]: from qiskit.circuit import Parameter, ParameterExpression 18 N: int = 100_000 19 parameter_values = Parameter(f"th_i"): 1 for i in range(N) 20 parameter_values[param := Parameter("my_param")] = 1 21 print("Using the specialised ‘Parameter‘ methods:") 22 %timeit param.bind(parameter_values, allow_unknown_parameters=True) 23 </details> . . . <TRUNCATED> 24 I think it’s fine without having the same behavior. For clarity it might be helpful to add a blurb to the bind_all docstring to say that “unlike bind, NaN and inf are in the range of expected outputs for this method”. 25 LGTM, thanks! 26 27 git_patch : string 28 diff –git a/crates/circuit/src/parameter/parameter_expression.rs b/crates/circuit/src/parameter/parameter_expression.rs 29 index 1f0406f62c7e..98da2e3e9e6 100644 30 — a/crates/circuit/src/parameter/parameter_expression.rs 31 +++ b/crates/circuit/src/parameter/parameter_expression.rs 32 @@ -1048,6 +1048,40 @@ impl PyParameterExpression 33 + ##[pyo3(name = "bind_all")] 34 + ##[pyo3(signature = (values, *))] 35 + pub fn py_bind_all(&self, values: Bound<PyAny>) -> PyResult<Value> 36 + let mut partial_map = HashMap::with_capacity(self.inner.name_map.len()); 37 + for symbol in self.inner.name_map.values() 38 + let py_parameter = symbol.clone().into_pyobject(values.py())?; 39 + partial_map.insert(symbol, values.get_item(py_parameter)?.extract()?); 40 + 41 + let bound = self.inner.expr.bind(&partial_map); 42 + bound.eval(true).ok_or_else(|| 43 + PyTypeError::new_err(format!("binding did not produce a numeric quantity: bound:?")) 44 + ) 45 + 46 ... <Truncated> 47 48 CLASSIFIER MODULE 49 50 Decide the PRIMARY performance optimization technique and the difficulty level of the optimization. 51 Category mapping (when performance-related): . . . 52 . . . algorithm rewrite or architectural change. 53 54 CLASSIFIER OUTPUT 55 56 category : OptimizationType 57 REMOVE_OR_REDUCE_WORK 58 59 difficulty : DifficultyLevel 60 ’medium’ 61 62 reasoning : string 63 The change adds a specialized ’bind_all’ fast-path that skips the heavy allocation and validation logic of the generic ’bind’ method, directly mapping parameters to numeric values and evaluating the expression. This reduces the amount of work performed per binding call, yielding large speedups. The implementation touches Rust code and adds tests but is a localized, moderate-size change, fitting a medium difficulty level. 64 Figure 11: Example application of the optimization type and difficulty classifier (Figure 9), illustrating the predicted category, difficulty level, and reasoning for a real pull request (https://github.com/Qiskit/qiskit/pull/14782) 24 1 Judge performance related PR prompt 2 3 INPUT SIGNATURE PROMPT 4 5 problem_description : string 6 Problem statement and technical context from PR/issue. 7 8 git_patch : string 9 Git diff showing actual code changes. 10 11 file_change_summary : string 12 A markdown table summarizing all the files changed in the commit along with lines added/removed. 13 14 JUDGE SIGNATURE PROMPT 15 16 Decide if this commit’s PRIMARY intent is to improve product/runtime performance. 17 18 Label YES only when there is CLEAR, EXPLICIT evidence in the description and/or patch that the runtime gets faster (e.g., algorithm change, fewer allocations, caching, vectorization, reduced I/O, async/non-blocking for throughput, latency reduction, memory footprint reduction, fix a speed regression). 19 20 Strong positive signals (weigh these collectively): 21 - PR title/body contains performance intent (e.g., "PERF:", "speed up", "faster", "performance"). 22 - Linked issues/comments include benchmark links or timings demonstrating impact. 23 - Low-level/hot-path tweaks (e.g., reuse global context, avoid per-call init/teardown, vectorize C/NumPy). 24 25 Hard NO (non-performance) examples: tests/ASV/harness-only changes; CI/workflows/build/packaging; coverage; pre-commit/format/lints (clippy/ruff/black); docs; version bumps; terminology/renames; pure refactors without performance claims; changes aimed at making perf tests pass but not improving runtime. 26 27 If ambiguous, weigh the concrete code changes and problem description together. When there are specific performance cues (title keywords, measured timings, fewer allocations, vectorization, caching/reuse) lean YES; otherwise NO. 28 29 OUTPUT SIGNATURE PROMPT 30 31 reasoning : string 32 Deductive reasoning steps leading to the classification. 33 34 label : string 35 Final label: "YES" for performance-related, "NO" otherwise.’ 36 Figure 12: Structured DSPy prompt used to judge whether a pull request is primarily intended to improve runtime or product performance. The prompt specifies the required inputs (problem description, code diff, and file-level change summary), explicit decision criteria and exclusions for performance-related changes, and an output format consisting of a justification and a binary YES/NO label. The design emphasizes conservative, evidence-based classification, prioritizing explicit runtime improvements over incidental or refactoring-only changes. 25 1 Problem Extractor Prompt description 2 3 INPUT SIGNATURE PROMPT 4 5 pr_title : string 6 The GitHub PR title 7 8 pr_body : string 9 The GitHub PR description 10 11 pr_comments : string 12 Comments on the PR thread. 13 14 PROBLEM EXTRACTOR SIGNATURE 15 16 What problem is this Github PR trying to solve? Extract near-verbatim relevant text following the given JSON output. If no relevant context exists for a field, return an empty string for it. 17 18 OUTPUT SIGNATURE PROMPT 19 20 initial_observations: string | list[Any] | None 21 Objective symptoms of the problematic behavior, described in the present tense. Focus strictly on what is happening (metrics, user impact, frequency). Do not include causes, hypotheses, or explanations. 22 23 triage_attempts: string | list[Any] | None 24 The investigative steps and reasoning used to narrow down contributing factors—what you checked, what you ruled out, and what evidence you gathered to understand where the issue originates. 25 26 solution_overview: string | list[Any] | None 27 A concise description of the change(s) made and how they address the identified bottleneck or constraint. 28 29 solution_observations: string | list[Any] | None 30 What you observe after applying the change—new measurements, behavior differences, and any regressions or trade-offs that appeared. 31 Figure 13: Structured DSPy prompt used to extract the underlying problem and resolution context from a GitHub pull request. The prompt consumes the PR title, description, and discussion, and produces a structured summary capturing observed symptoms, triage steps, the implemented solution, and post-change observations. The design emphasizes near-verbatim extraction and separation of observations, investigation, and outcomes. 26 Table 7: Patch classification distribution in FORMULACODE and FORMULACODE-V. The problems in FORMULACODE-V are sampled from the best performing tasks in FORMULACODE which is why some categories are overrepresented. Inferred Type of Optimization Problem% FORMULACODE% FORMULACODE-V Accept Less Precise Solution0.6584- Cache And Reuse8.31284.6296 Database And Storage Tuning0.5761- Do It Earlier Batch Throttle2.46910.9259 Io And Latency Hiding0.0823- Micro Optimizations20.246923.1481 Remove Or Reduce Work20.082318.5185 Uncategorized1.5638- Use Better Algorithm20.082326.8519 Use Better Data Structure And Layout9.711912.9630 Use Higher Level System2.96302.7778 Use Lower Level System11.02889.2593 Use Parallelization2.22220.9259 Table 8: The inferred difficulty of human solutions in FORMULACODE and FORMULACODE-V. Inferred Difficulty% FORMULACODE% FORMULACODE-V Easy54.897160.1852 Medium44.444437.0370 Hard0.65842.7778 B.2.1. AIRSPEED VELOCITY METHODOLOGY To benchmark a new function with Airspeed Velocity, a developer supplies asetup(. . . )routine and one or more time profiling functions (e.g.time_foo(. . . ),time_bar(. . . )) and memory profiling functions (e.g.mem_foo(. . . ), mem_bar(. . . )).asvthen clones the repository, creates an isolated virtual environment, and records the performance characteristics for all commits. The tool ships with best-practice safeguards (CPU affinity, warm-ups, repeated trials, etc.) to control system variance. Section 2 includes additional safeguards to further minimize system variance. Airspeed velocity offers many advantages towards our goal of making a benchmark for code optimization: •Low barrier to entry. The minimalist interface means developers routinely add new benchmarks, expanding coverage over time. Asv ships with a robust regression-detection functionality which further motivates developers to ensure that the asv benchmarks maximally cover all performance critical parts of their software. •Maturity and reliability. First released on 1 May 2015,asvencapsulates nearly a decade of community experience in timing and memory profiling code on commodity hardware. Most common pitfalls have documented solutions, and well established platform-specific best practices, ensuring results are both accurate and precise. •CI integration.asvco-exists naturally with other continuous-integration tools, so each commit carries both performance and correctness metadata. B.2.2. MODEL AND AGENT CHOICES Models.Our experimental design centers on four models – GPT-5, Claude 4.0 Sonnet, Gemini 2.5 Pro, and Qwen 3 Coder – that represent the strongest generally available systems for coding and tool-use workloads at the time of paper writing. We selected these models because they are natively integrated with our inference provider and support long context windows, function calling, and multi-turn interactions at a cost profile compatible with large-scale benchmarking. We treat these models as representative of the frontier capability regime against which different agent architectures can be fairly compared. 1.GPT-5. GPT-5 (Singh et al., 2025) is OpenAI’s flagship general-purpose model in this study, and we use the standard API configuration with built-in “thinking” enabled. It is a multimodal, tool-using model with strong performance on code, math, and long-context reasoning benchmarks, and is widely deployed in agentic coding systems. We use the gpt-5-2025-08-07 version specifically with a documented knowledge cutoff of late September 2024. 2. Claude 4.0 Sonnet. Claude 4.0 Sonnet (Anthropic, 2025) is Anthropic’s top-end general-purpose model at the time of 27 Task Metadata Docker script library 2. Generate Script with LLM Agent Reasoning Module Docker script Error log Previous attempt log 1. Sample chronologically adjacent scripts. Verifier Successful build unsuccessful build Docker script Docker script Figure 14: Overview of the pipeline for Docker environment synthesis. The system reuses chronologically adjacent build scripts when possible, otherwise invoking an LLM agent that generates and refines Docker scripts using build logs and repository context until a verifier confirms a successful, reproducible build. our experiments, designed for complex reasoning, long-form generation, and tool-heavy workloads such as software development. Public reports place Claude 4.0 Sonnet at or near the frontier on a wide range of coding and reasoning benchmarks. We use theclaude-sonnet-4-20250514version specifically with a documented knowledge cutoff date of January 2025, with training data extending to March 2025. 3.Gemini 2.5 Pro. Gemini 2.5 Pro (Comanici et al., 2025) is Google DeepMind’s latest high-end model at the time of writing, introduced as the first member of the Gemini 2 series and optimized for complex multimodal reasoning. It offers a very large context window (up to 1M tokens in the preview configuration) and supports advanced tool-calling and code execution. It has a documented knowledge cutoff date of January 2025. We include Gemini 2.5 Pro to ensure that our agentic analysis covers three distinct provider ecosystems under comparable frontier-model conditions. 4. Qwen 3 Coder. Qwen 3 Coder is a large open Mixture-of-Experts model explicitly optimized for agentic coding tasks rather than general conversation. Qwen 3 Coder (in particular, theqwen3-coder-480b-a35b-instructmodel) combines 480 B total parameters with sparse expert activation (35 B active parameters per forward pass) and a context window of roughly 262k tokens, enabling it to reason over entire repositories and multi-file refactors in a single pass. Third-party model cards list a knowledge cutoff of 23 January 2025 (LangDB, 2025). Empirically, Qwen 3 Coder claims strong results on SWE-Bench and related agentic coding and browser-use benchmarks (Yang et al., 2025). Agents.We evaluate two agent frameworks within FORMULACODE: Terminus 2, the default harness for Terminal-Bench, and an agent implemented with OpenHands, a popular open-source framework for AI-driven software development. We intentionally omit more complex agent families such as tree-structured search agents and evolutionary or population-based methods. Tree agents that branch over alternative command sequences must maintain multiple snapshots of the terminal state, which quickly leads to exponential blowup in cloud compute usage. Evolutionary agents that track a Pareto frontier across many workloads are similarly expensive: given that the median FORMULACODE task exposes roughly 81 workloads, the number of candidate solutions required to reasonably explore the frontier is beyond our evaluation budget. 1.Terminus 2. Terminus 2 is a reference agent for Terminal-Bench (Merrill et al., 2026). It is intentionally minimal: the agent spawns a single tmux session and exposes the raw shell to the model, which issues commands as plain text and receives the terminal output verbatim, without additional structured tools or high-level abstractions. This architecture can be viewed as a reflexive, single-trajectory agent that repeatedly observes the current terminal state, updates its internal plan implicitly in the model’s hidden state, and emits the next command. Despite its simplicity, Terminus 2 is competitive with more elaborate systems, making it a natural baseline for FORMULACODE. 2. OpenHands. OpenHands is a widely used open-source framework for AI-driven software development (Wang et al., 28 Jan-MarApr-JunJul-SepOct-Dec 2025 2024 2023 2022 1721 2031152129254251351520 263726333832423332432719 451717353417161512272623 77151316121047181828 106134888479178 Figure 15: Timeline of FORMULACODE tasks organized by the date the expert-patch was merged till November, 2025. Each box represents the number of expert-patch tasks merged during a particular month/year. FORMULACODE is updated on the 31st of each month, and our most recent task is from 2025-11-21. The dataset grows by20.25tasks per month on average, facilitating contamination analyses for performance-optimization agents. Table 16 presents a detailed overview. Table 9: Optimization categories used to categorize human solutions in FORMULACODE. The taxonomy is derived from various online sources, listed in the primary references for each category. Category AbbreviationCategory DescriptionSource AlgoUse a better algorithm(Tratt, 2023) DataUse a better data structure (and layout)(Tratt, 2023) LowerUse a lower-level system(Tratt, 2023) ApproxAccept a less-precise solution (approximation/heuristics)(Tratt, 2023) ParallelUse parallelization(Tratt, 2025) ReduceRemove or reduce work (requirements & UX)(Forum Discussion, 2025; 2023) CacheCache & reuse(Forum Discussion, 2025) BatchDo it earlier / batch it / throttle it(Forum Discussion, 2025) ScaleScale the platform(Forum Discussion, 2025) DBDatabase & storage tuning(Forum Discussion, 2025) MicroMicro-optimizations (hot path tweaks)(Forum Discussion, 2025) I/OI/O and latency hiding (async, overlap I/O/compute)(Forum Discussion, 2025; 2023) HigherUse a higher-level system that optimizes for you(Forum Discussion, 2025) UncatUncategorized– 2025). OpenHands exposes a flexible SDK that allows defining agents as compositions of tools and routines that can clone repositories, edit files, run tests, and manage long-running coding sessions, with support for swapping out the underlying LLM. In our experiments, we utilize a single-trajectory terminal-plus-editor agent implemented in the OpenHands SDK, following a default configuration used in terminal bench (Merrill et al., 2026). B.2.3. KINDS OF OPTIMIZATION PROBLEMS We categorize human-written solutions in FORMULACODE into thirteen optimization classes gathered from various online sources. We reviewed these sources, normalized overlapping suggestions into standard terminology, and used them to define the categories, which are then applied consistently in our analysis. This taxonomy is intentionally non-exhaustive: it serves as a practical baseline for analysis, capturing the principal codebase optimizations that developers typically consider when improving performance, rather than offering an authoritative catalog of all systems optimizations. B.2.4. QUALITATIVE EXAMPLES Qualitative examples are presented in Figure 25, Figure 26, and Figure 27. 29 Table 10: Cost-aware leaderboard of agent–model configurations. We report cost per task, mean advantageAdv agent , cost-weighted advantage Adv cost agent , and cost-weighted normalized advantage g Adv cost agent . AgentModelCost/Task↓ Adv agent ↑ Adv cost agent ↑ g Adv cost agent ↑ Terminus 2GPT-51.8508-0.0504-0.0272-0.0750 Claude 4.0 Sonnet3.7722-0.0410-0.0109-0.0282 Gemini 2.5 Pro1.5455-0.0433-0.0280-0.0737 Qwen 3 Coder1.2060-0.0454-0.0376-0.1043 OpenHandsGPT-50.7814-0.0209-0.0267-0.0899 Claude 4.0 Sonnet3.2300-0.0112-0.0035-0.0150 Qwen 3 Coder1.0974-0.0301-0.0274-0.1393 B.2.5. TERMINAL BENCH MODIFICATIONS Terminal-Bench (Merrill et al., 2026) is a widely used harness for benchmarking terminal-based software development tasks. It is actively maintained, well understood by the agent development and benchmarking community, and already designed around end-to-end agent execution in a containerized shell environment. However, Terminal-Bench primarily targets correctness-oriented evaluations. In FORMULACODE, the evaluation target shifts: tasks are optimization-centric and require measuring performance improvements reliably, comparing multiple agent/model configurations under matched conditions, and auditing performance-oriented behavior and cost. We therefore extend Terminal-Bench along four capability axes. Standardized execution for low-variance measurement. To complement the variance-control safeguards in Section 2, we add support for executing runs in standardized isolated environments (e.g., fixed cloud machines). This reduces machine- to-machine drift and makes speedup measurements more comparable across runs, which is essential when the benchmark signal is a relative performance change rather than a binary pass/fail outcome. Operationally, we extend Terminal Bench to support running tasks on compute optimized Amazon Web Services (AWS) EC2 instances. Such instances are guaranteed to have a finite amount of isolated hardware resources situated in professionally-managed data centers, ensuring third-party reproducibility of FORMULACODE’s experiments (Amazon Web Services). We use thec5ad.largeinstance with 2 vCPUs, 4GiB RAM, and a dedicated 75 GiB SSD for storage. This instance is chosen specifically because it is extremely cost efficient (on-demand price of $0.086 per hour at the time of writing). Importantly, remote execution is a reproducibility convenience rather than a methodological prerequisite. The ASV-based protocol (warm-ups, repeated trials, and the variance controls in Section 2) is designed to yield reliable estimates on well-managed local commodity machines. We use EC2 primarily to eliminate avoidable confounds – resource contention, background load, and hardware heterogeneity – to provide a clean gold-standard reference for subsequent experiments. Sequential agent evaluation. We add controls to evaluate multiple agent/model configurations sequentially within the same standardized environment. For each FORMULACODE task, we provision a single instance and evaluate agent/model configurations in separate fresh containers: we measure the baseline implementation (Code 0 ), then the human-written optimized solution (Code expert ), and then each agent-produced candidate in turn, resetting the container state between configurations. This design ensures that comparisons are statistically matched by construction (same hardware and near-identical runtime conditions) while preventing cross-run interference from accumulated state. Optimization-centric metrics. Terminal-Bench natively aggregates discrete outcomes (e.g., test pass/fail). We extend the measurement and analysis layers to parse and summarize continuous optimization signals (e.g., speedup, advantage, and variance) and to support custom aggregation procedures (e.g., stratification by difficulty, as described in Figure 23). Additional Accounting metrics. Finally, we add explicit support for token-usage and API-cost accounting, as well as other observability metrics (improved logging, robust timeout handling, and comprehensive interactive traces). These additions enable the cost-aware and failure-mode analysis reported in Section 3. Overall, these modifications enable the use of Terminal-Bench as a stable evaluation harness for FORMULACODE. 30 Table 11: Correctness constraint violations by agent–model configuration. For each configuration, we report the total number of rejected solutions (out of 108), along with how many are attributable to PyTest failures versus snapshot test failures. AgentModelTotal↓ PyTest↓ Snapshot↓ Terminus 2GPT-5545132 Claude 4.0 Sonnet555236 Gemini 2.5 Pro555330 Qwen 3 Coder565429 OpenHands GPT-5474230 Claude 4.0 Sonnet504334 Qwen 3 Coder504432 B.3. Additional Analysis This section lists additional analysis on FORMULACODE-V that was not included in the main paper for space reasons. We analyze (1) the rate of correctness constraint violations across agent/model configurations, (2) the relationship between trajectory length and performance, (3) patterns of tool usage across configurations, and (4) qualitative examples of agent patches. Correctness Constraint Violations.Each FORMULACODE-V task is associated with two types of correctness constraints: (1) Snapshot tests, that verify that the optimized codebase preserves each workload’s local variables, and (2) the original PyTest suite from the upstream repository which captures broader functional correctness. At initialization, the agent–model configuration receives explicit instructions to maximize performance while preserving correctness. If the patch fails either constraint, we ‘roll back’ any performance improvements and revert to the original codebase, ensuring that all reported speedups are strictly correctness-preserving. We therefore ask: how often are candidate performance improving edits rejected solely due to correctness violations? For each agent–model pair, we count the number of tasks in which the final patch fails at least one test, and then further break this down into PyTest failures and snapshot test failures. Table 11 summarizes these statistics over 108 attempted solutions per configuration. Observation: Correctness violations are common and represent a major source of rollbacks. We find that models spend most of their budget exploring patches that ultimately fail correctness checks. On average,52.43% of trajectories are rejected due to correctness violations, with the majority of these failures stemming from PyTest suite violations rather than snapshot test failures. We believe this to be a consequence of the multi-objective nature of the optimization problem. A single-objective setting allows verifying new functionalities with a single tool call. However, in a multi-objective setting, the agent–model configuration must strategically allocate interactions towards running either the benchmarking tool, the snapshot verification tool, or the pytest suite depending on the new functionality it introduces. The tool call distribution in Table 14 supports this hypothesis, as most agents demonstrate an inclination towards running performance validation commands rather than correctness validation commands. Trajectory Length and Performance. Discovering effective performance optimizations requires a deep understanding of the codebase. Agents must interact with the codebase through a terminal interface to obtain such an understanding. In this experiment, we study the relation between the number of interactions and the global performance achieved by the cumulative trajectory of interactions. For each task, we record the number of complete command-line agent interactions (interactions where the agent runs a command and receives a response from the environment) and calculate the mean and median trajectory lengths averaged over all tasks. We then calculate the length-weighted advantage as len(Adv agent ) = Adv agent len agent . Table 12 showcases these results. Observation: Trajectory lengths can be highly skewed. Some configurations demonstrate highly skewed trajectories. Specifically, Terminus 2 + GPT-5 and Terminus 2 + Gemini 2.5 Pro have mean lengths substantially larger than the median length, suggesting that these configurations occasionally require very long interactive runs. By contrast, OpenHands + Claude 4.0 Sonnet has more stable trajectory lengths across tasks as the deviation between the mean and median is much smaller. Observation: Agent choice has a substantial effect on overall behavior. The same model behaves very differently depending 31 Table 12: Trajectory length and length-weighted advantage. For each agent–model configuration, we report the mean and median trajectory length (in interaction steps), as well as a length-weighted advantage (len(Adv agent )). AgentModelMean Length↓Median Length↓ len(Adv agent )↑ Terminus 2GPT-5295.53198.50-0.000226 Claude 4.0 Sonnet73.1363.50-0.000349 Gemini 2.5 Pro106.9963.50-0.000755 Qwen 3 Coder99.9190.50-0.000557 OpenHandsGPT-568.6061.00-0.000299 Claude 4.0 Sonnet222.80219.50-0.000106 Qwen 3 Coder633.10595.00-0.000044 Table 13: Tool categories used in trajectory classification. The classifier’s implementation mirrors that of the optimization category classifier (§B.2.3). CategoryDescription editingText editing or transformation commands (e.g., sed, awk, ed). searchSearch/discovery commands for finding files or text (e.g., grep, rg, find, fd). viewRead-only inspection commands for showing file/output snippets (e.g., cat, less, head, tail). fs_opsFilesystem mutation/metadata operations (e.g., cp, mv, rm, mkdir, chmod). shell_sessionShell navigation/session management commands (e.g., cd, ls, pwd, clear, exit). gitVersion-control commands and git-derived shell variable setup (e.g., git, diff, reset). python_execPython execution plus Python environment/package commands (e.g., python, pip, micromamba). testTest-running commands, including snapshot checks (e.g., pytest, snapshot-tool). benchBenchmark/profiling commands, primarily ASV workflows (e.g., asv run, asv profile). patchingPatch/diff application commands or diff-marker lines (e.g., patch, applypatch, —/+++). otherCommands/fragments that do not match the above classes, including control-flow snippets or terminal noise. on the chosen agent. For example, GPT-5 produces much longer trajectories in Terminus 2 than in OpenHands, while Claude 4.0 Sonnet and Qwen 3 Coder show the opposite pattern. This suggests that surrounding agent design heavily shapes search behavior. Tool-Usage Patterns. In all FORMULACODE tasks, agents are given unrestricted access to thebashcommand line with additional performance profiling and correctness testing tools. In this experiment, we analyze how different configurations employ tools during optimization. For each task, we store the command-line interactions of the agent–model configurations and use an LLM to categorize the input commands based on the primary purpose of the command. The implementation is identical to the performance categorization classifier §B.1.7. We then aggregate the tool type classifications by total tool uses and tool uses per category. Table 14 summarizes these statistics across all configurations. Observation: Agents invoke benchmarking tools more than testing tools. All agent–model configurations show a strong preference for running benchmarking and profiling commands over correctness validation commands, with an average Table 14: Tool-usage statistics by agent–model configuration. Columns report the total number of tool calls and the percentage distribution of calls across tool categories (Judged byopenai/gpt-oss-120busing categories inTable 13). The most effective configurations spend the majority of their tool calls on file-operations (editing,search, andview) and running performance benchmarks (bench), with the remaining calls distributed across a variety of tool categories. AgentModelTotal editing search view fs_ops shell git python test bench patching other Terminus 2GPT-51337019.4015.7010.292.765.0012.1411.202.8917.512.071.02 Claude 4.0 Sonnet42146.128.0011.257.788.190.0016.612.186.360.0033.51 Gemini 2.5 Pro564111.356.295.967.4311.452.805.480.6216.040.2732.32 Qwen 3 Coder356517.5916.898.6112.1712.450.458.721.499.990.3611.28 OpenHandsGPT-5468314.3519.6526.440.621.624.367.943.9318.260.042.80 Claude 4.0 Sonnet632312.9220.9426.510.761.573.569.903.6716.860.033.29 Qwen 3 Coder863810.6620.3925.240.791.624.339.923.9619.230.023.84 32 of14.90% of tool calls dedicated to benchmarking/profiling and only2.68% of calls dedicated to testing. This proclivity towards performance validation over correctness validation might have a substantial impact on our previous observation that correctness violations are prevalent for all agent–model configurations. Observation: Reading dominant tool category. The most frequently used tool category across all configurations is file- system operations (editing, searching, and viewing files), which accounts for an average of31.74% of all tool calls. This is consistent with the intuition that developing a holistic understanding of the codebase is a prerequisite for synthesizing effective optimizations. B.4. Qualitative Examples: Human Expert vs. AI Agent Patches This section presents side-by-side comparisons of human expert and AI agent patches for FORMULACODE tasks. Specifically, the following examples are showcased: •Figure 16 (modin_modin_2). Failure mode: Incorrect triage; expert gained edge by identifying performance hotpath. Modin has an expensive auto-switch backend logic that was being called even when all inputs shared the same backend. The agent was not able to identify the core issue, instead focusing on a caching issue that was not on the performance critical path. The human correctly identifies the issue and implemented a fix to the caching logic. •Figure 17 (optuna_optuna_6). Failure mode: Correct triage; expert gained edge bynumpyvectorization delegation. Optuna’s hypervolume computation used a naive recursive algorithm, when a fasterO(N 2 )approach was possible. Both the human and the agent were able to identify and implement the algorithm. However, the human’s solution used fully vectorized numpy operations, while the agent’s solution used a Python-level sweep-line approach withbisect. This resulted in the human outperforming the agent despite both having the same asymptotic complexity. •Figure 18 (optuna_optuna_1). Failure mode: Correct triage; expert implemented holistic full-module optimization. Optuna’s implementation for sorting non-dominated Pareto fronts used a naive algorithm that didn’t scale well as number of trials increased. Both the human and the agent identified this issue; the agent’s implementation utilized a Fenwick tree based algorithm which fixed a single hotpath (when inputs are 2D). However, the expert implementation implemented a holistic rewrite: it optimized the entire call chain to use vectorized numpy operations and merged separate pathways for 2D/N-D optimization, resulting in complementary improvements across the entire multi-objective optimization flow. •Figure 19 (networkx_networkx_4). The core issue was that NetworkX’s BFS-based component discovery algorithm did not implement an early-termination optimization. Both the human and the agent fix this by implementing an early termination optimization. However, the agent outperforms the human by further optimizing the BFS implementation, achieving an additional +0.0132 advantage on top of the human’s improvement. • Figure 20 (pybamm_team_pybamm_1). A sensitivity computation in PyBaMM created a quadratic memory allocation bottleneck due to incremental concatenation without realizing that the full size was known in advance. Both the human and the agent identify the issue and collect all blocks first and concatenate once. The agent further optimizes the concatenation logic by consolidating multiple function calls into one and adding guards for empty inputs, resulting in a+0.0167 advantage. • Figure 21 (shapely_shapely_1). Thedeprecate_positionaldecorator in Shapely calledinspect.signatureon every invocation, causing 300–1000% slowdowns. The human and agent solution both converged on the same strategy: implementing a caching layer on the decorator. However, the agent implemented additional optimizations to skip the hot-path when no deprecated parameters existed, resulting in a +0.0132 advantage. 33 Human Expert Patch diff --git a/modin/core/storage_formats/pandas/query_compiler_caster.py --- a/modin/core/storage_formats/pandas/query_compiler_caster.py +++ b/modin/core/storage_formats/pandas/query_compiler_caster.py @@ -1006,6 +1006,8 @@ pin_target_backend = None + input_backends: set[str] = set() def register_query_compilers(arg): + input_backends.add(arg_backend) @@ -1047,10 +1050,12 @@ - inputs_pinned = ( - len(input_query_compilers) < 2 and pin_target_backend is not None - ) - if not AutoSwitchBackend.get() or inputs_pinned: # Skip casting when all inputs share a single backend and either # auto-switching is disabled or inputs are pinned + if len(input_backends) == 1 and ( + not AutoSwitchBackend.get() or pin_target_backend is not None + ): f_to_apply = _get_extension_for_method(...) result = f_to_apply(*args, **kwargs) - if isinstance(result, QueryCompilerCaster) and inputs_pinned: + if (isinstance(result, QueryCompilerCaster) + and pin_target_backend is not None): result._set_backend_pinned(True, inplace=True) return result AI Agent Patch diff --git a/modin/core/storage_formats/pandas/query_compiler_caster.py --- a/modin/core/storage_formats/pandas/query_compiler_caster.py +++ b/modin/core/storage_formats/pandas/query_compiler_caster.py @@ -137,23 +137,52 @@ -def _get_empty_qc_for_default_backend() -> BaseQueryCompiler: - """Get an empty query compiler for the default backend.""" +def _get_empty_qc_for_backend(backend: str) -> BaseQueryCompiler: + """Get an empty query compiler for the specified backend.""" from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher - return FactoryDispatcher.get_factory().io_cls.from_pandas(pandas. DataFrame()) + original_backend = Backend.get() + try: + Backend.put(backend) + return FactoryDispatcher.get_factory().io_cls.from_pandas(pandas. DataFrame()) + finally: + Backend.put(original_backend) -_BACKEND_TO_EMPTY_QC: defaultdict[str, BaseQueryCompiler] = defaultdict( - _get_empty_qc_for_default_backend -) +_BACKEND_TO_EMPTY_QC: dict[str, BaseQueryCompiler] = +def _get_cached_empty_qc(backend: str) -> BaseQueryCompiler: + if backend not in _BACKEND_TO_EMPTY_QC: + _BACKEND_TO_EMPTY_QC[backend] = _get_empty_qc_for_backend(backend) + return _BACKEND_TO_EMPTY_QC[backend] @@ -1042,7 +1071,7 @@ - input_qc_for_pre_op_switch = _BACKEND_TO_EMPTY_QC[ input_backend] + input_qc_for_pre_op_switch = _get_cached_empty_qc( input_backend) Summary DisablesAutoSwitchBackendby default and rewrites the casting-skip logic inquery_compiler_caster.pyto track the set of distinct input backends, skipping expensive query- compiler conversions when all inputs share a single backend. Updates four test files and adjusts metric assertions (not shown). Summary Fixes a bug where thedefaultdictfactory ignores the re- quested backend when creating empty query compilers, replac- ing it with an explicit_get_cached_empty_qcfunction that temporarily switchesBackend.put()to the correct backend. A correctness fix, but not on the performance-critical path. Figure 16:modin_project-modin_2: Modin’sAutoSwitchBackendfeature, enabled by default, triggered an expensive type conversion even when all inputs shared the same backend. The agent solution (openhands:claude-sonnet-4) identified and fixed a real bug in the caching logic, but this was not on the performance-critical path, resulting in a−0.1265 advantage compared to the human expert’s systemic fix that disabledAutoSwitchBackendby default and optimized the casting logic to track input backend diversity, skipping conversions when unnecessary. 34 Human Expert Patch diff --git a/optuna/_hypervolume/wfg.py b/optuna/_hypervolume/wfg.py --- a/optuna/_hypervolume/wfg.py +++ b/optuna/_hypervolume/wfg.py # New O(N^2) vectorized 3D hypervolume via coordinate compression +def _compress_coordinate(coords: np.ndarray) -> tuple[np.ndarray, np. ndarray]: + sorted_indices = np.argsort(coords) + values = coords[sorted_indices] + r = np.zeros_like(sorted_indices) + r[sorted_indices] = np.arange(coords.shape[0], dtype=r.dtype) + return r, values +def _compute_3d(sorted_pareto_sols: np.ndarray, reference_point: np.ndarray ) -> float: + """O(N^2) 3D hypervolume via cumulative minimum on compressed coordinates.""" + n = sorted_pareto_sols.shape[0] + x_vals = np.concatenate([sorted_pareto_sols[:, 0], reference_point [:1]]) + y_ind, y_vals = _compress_coordinate(sorted_pareto_sols[:, 1]) + y_vals = np.concatenate([y_vals, reference_point[1:2]]) + z_max = reference_point[2] + zs = np.full((n, n), z_max) + for i in range(n): + zs[i, y_ind[i]] = sorted_pareto_sols[i, 2] + zs = np.minimum.accumulate(zs, axis=0) + zs = np.minimum.accumulate(zs, axis=1) + x_delta = x_vals[1:] - x_vals[:-1] + y_delta = y_vals[1:] - y_vals[:-1] + return np.sum((z_max - zs) * x_delta[:, np.newaxis] * y_delta[np. newaxis, :]) @@ -125,6 +178,13 @@ def compute_hypervolume( if reference_point.shape[0] == 2: hv = _compute_2d(sorted_pareto_sols, reference_point) + elif reference_point.shape[0] == 3: + hv = _compute_3d(sorted_pareto_sols, reference_point) else: hv = _compute_hv(sorted_pareto_sols, reference_point) AI Agent Patch diff --git a/optuna/_hypervolume/wfg.py b/optuna/_hypervolume/wfg.py --- a/optuna/_hypervolume/wfg.py +++ b/optuna/_hypervolume/wfg.py # x-sweep with incremental 2D Pareto frontier via bisect +def _compute_3d(sorted_pareto_sols, reference_point) -> float: + """Exact 3D hypervolume using x-sweep with incremental 2D frontier.""" + import bisect + xs = pts[:, 0] + dx = np.maximum(xs_ext[1:] - xs_ext[:-1], 0.0) + y_list: list[float] = [] + z_list: list[float] = [] + + def insert_frontier(y: float, z: float) -> None: + i = bisect.bisect_left(y_list, float(y)) + if i > 0 and z >= z_list[i - 1]: + return # dominated by left neighbor # ... (dominance-aware insertion: handle equal y, # remove dominated points to the right) + y_list.insert(i, float(y)) + z_list.insert(i, float(z)) + + for i in range(n): + insert_frontier(float(pts[i, 1]), float(pts[i, 2])) + if y_list: + yz = np.column_stack((np.asarray(y_list), np.asarray(z_list))) + areas[i] = _compute_2d(yz, ref_yz) + return float(np.dot(dx, areas)) @@ -126,7 +190,7 @@ def compute_hypervolume( - hv = _compute_hv(sorted_pareto_sols, reference_point) + hv = _compute_3d(...) if sorted_pareto_sols.shape[1] == 3 else _compute_hv(...) Summary Adds a specializedO(N 2 ) _compute_3dfunction using a _compress_coordinatehelper that mapsy-coordinates to in- teger ranks vianp.argsort, builds anN× Ngrid, and applies np.minimum.accumulatealong both axes to compute domi- nated volume in fully vectorized numpy. Also adds a dedicated elifbranch incompute_hypervolumeand parameterized tests (not shown). Summary Adds a_compute_3dfunction using anx-sweep with incre- mental 2D Pareto frontier maintenance viabisectand Python lists. At eachx-slice, the frontier is updated with dominance- aware insertion, then the 2D area is computed by delegating to_compute_2d. The dispatch incompute_hypervolumeis modified with an inline ternary for 3D inputs. Figure 17:optuna_optuna_6: Optuna’s_hypervolume.WFGclass used a naive recursive algorithm for hypervolume computation that had aO(N 3 )runtime for the common 3D case, when aO(N 2 )approach was possible. Both the human and the agent identified and implemented the faster algorithm. However, the human’s solution used fully vectorized numpy operations, while the best agent (terminus-2:gpt-5) used a Python-level sweep-line approach withbisect. This resulted in the human outperforming the agent with a−0.03964agent advantage despite both having the same asymptotic complexity. 35 Human Expert Patch diff --git a/optuna/study/_multi_objective.py b/optuna/study/ _multi_objective.py --- a/optuna/study/_multi_objective.py +++ b/optuna/study/_multi_objective.py @@ (selected excerpts) -def _get_pareto_front_trials_2d(...): - ... # Separate 2D implementation -def _get_pareto_front_trials_nd(...): - ... # Separate N-D implementation -def _get_pareto_front_trials_by_trials(...): - if len(directions) == 2: - return _get_pareto_front_trials_2d(...) - return _get_pareto_front_trials_nd(...) +def _get_pareto_front_trials_by_trials(...): + loss_values = np.asarray(...) + on_front = _is_pareto_front(loss_values, + assume_unique_lexsorted=False) + return [t for t, p in zip(trials, on_front) if p] -def _fast_non_dominated_sort( - objective_values, *, penalty=None, n_below=None +def _fast_non_domination_rank( + loss_values, *, penalty=None, n_below=None ) -> np.ndarray: - ... # O(n^2) broadcast + defaultdict + ... # Vectorized _calculate_nondomination_rank + ... # + _is_pareto_front with lexsort AI Agent Patch diff --git a/optuna/study/_multi_objective.py b/optuna/study/ _multi_objective.py --- a/optuna/study/_multi_objective.py +++ b/optuna/study/_multi_objective.py @@ -189,42 +189,106 @@ def _calculate_nondomination_rank(...): ... # Fast path for 2D objectives. + if objective_values.shape[1] == 2: + x = objective_values[:, 0] + y = objective_values[:, 1] + order = np.lexsort((y, x)) + ys_unique = np.unique(y) + y_idx_all = np.searchsorted(ys_unique, y, + side=’right’) + m = len(ys_unique) + bit = np.zeros(m + 1, dtype=int) + def bit_query(i): # Fenwick tree prefix max + ... + def bit_update(i, v): + ... # Process equal-x groups, BIT for rank + ... + return ranks, last_rank + # Fallback: original O(n^2) broadcast for >=3D. domination_mat = np.all(...) & np.any(...) Summary Complete rewrite of_multi_objective.py. Renames_fast_ non_dominated_sortto_fast_non_domination_rank, re- places theO(n 2 )broadcast-based algorithm with a vector- ized_is_pareto_frontand_calculate_nondomination_ rankimplementation, merges the separate 2D/N-D Pareto front functions, and updates all callers across the TPE sampler and NSGA-I selection strategy. Summary Adds a specializedO(n log n)BIT (Fenwick tree) algorithm for 2D objectives in_calculate_nondomination_rank, falling back to the originalO(n 2 )broadcast for≥3 objectives. While algorithmically superior for the 2D case, the agent only opti- mizes the inner ranking function without restructuring callers or the Pareto front computation. Figure 18:optuna_optuna_1: The original implementation of Optuna’s non-dominated sorting in multi-objective optimiza- tion cases emerged as a performance bottleneck when scaling to large number of trials (∼ 10000trials). Both the best agent (terminus-2:gpt-5) and the human expert correctly identified the issue. The agent’s solution focused on optimizing the inner ranking function with a specializedO(n logn)Fenwick-tree algorithm for 2D objectives, while retaining the original O(n 2 )broadcast-based approach for higher dimensions. In contrast, the human expert implemented a holistic rewrite of the entire call chain to use vectorized numpy operations and merged separate pathways for 2D/N-D optimization. This broader vectorized rewrite captured multi-objective speedups (5–6×) that the agent’s localized change missed, resulting in a −0.03629 agent advantage. 36 Human Expert Patch diff --git a/networkx/algorithms/components/connected.py @@ -65,7 +65,7 @@ def connected_components(G): for v in G: if v not in seen: - c = _plain_bfs(G, n, v) + c = _plain_bfs(G, n - len(seen), v) seen.update(c) diff --git a/networkx/algorithms/components/weakly_connected.py @@ -62,7 +62,7 @@ def weakly_connected_components(G): for v in G: if v not in seen: - c = set(_plain_bfs(G, n, v)) + c = set(_plain_bfs(G, n - len(seen), v)) seen.update(c) AI Agent Patch diff --git a/networkx/algorithms/components/connected.py b/networkx/ algorithms/components/connected.py index ebe0d8c1..922d4904 100644 --- a/networkx/algorithms/components/connected.py +++ b/networkx/algorithms/components/connected.py @@ -62,12 +62,15 @@ def connected_components(G): - [-n-]+remaining+ = len(G) ... - c = _plain_bfs(G, [-n-]+remaining+, v) seen.update(c) + remaining -= len(c) yield c + if remaining == 0: + break @@ -203,14 +206,16 @@ def _plain_bfs(G, n, source): + seen_add = seen.add while nextlevel: thislevel = nextlevel nextlevel = [] + append = nextlevel.append ... if w not in seen: - [-seen.add(w)-]+seen_add(w)+ - [-nextlevel.append(w)-]+append(w)+ diff --git a/networkx/algorithms/components/weakly_connected.py b/networkx/ algorithms/components/weakly_connected.py index ecfac50a..a89b7af8 100644 --- a/networkx/algorithms/components/weakly_connected.py +++ b/networkx/algorithms/components/weakly_connected.py @@ -59,12 +59,15 @@ def weakly_connected_components(G): # (same early-exit optimization as connected_components above) @@ -166,32 +169,30 @@ def _plain_bfs(G, n, source): # (same local-variable caching as connected._plain_bfs above) # additionally, converted from generator (yield) to returning seen set: - yield source + ... if len(seen) == n: - return + return seen + return seen Summary Minimal single-line fix in bothconnected_componentsand weakly_connected_components: passesn - len(seen)in- stead ofnto_plain_bfs, tightening the BFS early-termination bound so it stops as soon as all remaining unseen nodes are found. No structural changes to the BFS itself. Summary Multi-pronged optimization: tracks aremainingnode count to break out of the component loop early, caches method lookups (seen.add,nextlevel.append) into local variables, and con- verts the weakly-connected_plain_bfsfrom a generator to a batch set return, eliminating per-node yield overhead. Figure 19:networkx_networkx_4: NetworkX’sconnected_componentsandweakly_connected_componentspassed the total graph node countnto_plain_bfswithout accounting for already-discovered nodes, missing an early-termination optimization. For disconnected graphs with large components explored last, this caused dramatic slowdowns—up to 367× for adversarial cases withn=1000. Both the best agent (openhands:gpt-5) and the expert identified the core issue, and implemented the same early-termination optimization. However, the agent also implemented additional micro-optimizations that further reduced overhead, resulting in a +0.0132 advantage over the human’s solution. 37 Human Expert Patch diff --git a/src/pybamm/solvers/processed_variable.py b/... --- a/src/pybamm/solvers/processed_variable.py +++ b/src/pybamm/solvers/processed_variable.py @@ -443,16 +443,18 @@ class ProcessedVariable: dvar_dp_func = casadi.Function( "dvar_dp", [t_casadi, y_casadi, p_casadi_stacked], [ dvar_dp] ) - for idx, t in enumerate(ts): - u = ys[:, idx] - next_dvar_dy_eval = dvar_dy_func(t, u, inputs_stacked) - next_dvar_dp_eval = dvar_dp_func(t, u, inputs_stacked) - if idx == 0: - dvar_dy_eval = next_dvar_dy_eval - dvar_dp_eval = next_dvar_dp_eval - else: - dvar_dy_eval = casadi.diagcat(dvar_dy_eval, next_dvar_dy_eval) - dvar_dp_eval = casadi.vertcat(dvar_dp_eval, next_dvar_dp_eval) + dvar_dy_eval = casadi.diagcat( + *[ + dvar_dy_func(t, ys[:, idx], inputs_stacked) + for idx, t in enumerate(ts) + ] + ) + dvar_dp_eval = casadi.vertcat( + *[ + dvar_dp_func(t, ys[:, idx], inputs_stacked) + for idx, t in enumerate(ts) + ] + ) # Compute sensitivity S_var = dvar_dy_eval @ dy_dp + dvar_dp_eval AI Agent Patch diff --git a/src/pybamm/solvers/processed_variable.py b/... --- a/src/pybamm/solvers/processed_variable.py +++ b/src/pybamm/solvers/processed_variable.py @@ -436,29 +439,30 @@ class ProcessedVariable: dvar_dy = casadi.jacobian(var_casadi, y_casadi) dvar_dp = casadi.jacobian(var_casadi, p_casadi_stacked) - dvar_dy_func = casadi.Function( - "dvar_dy", [t_casadi, y_casadi, p_casadi_stacked], [dvar_dy ] - ) - dvar_dp_func = casadi.Function( - "dvar_dp", [t_casadi, y_casadi, p_casadi_stacked], [ dvar_dp] # Single function returning both jacobians + grads_func = casadi.Function( + "pv_grads", [t_casadi, y_casadi, p_casadi_stacked], + [dvar_dy, dvar_dp] ) - for idx, t in enumerate(ts): + + dvar_dy_blocks = [] + dvar_dp_blocks = [] + for idx in range(ts.size): + t = ts[idx] u = ys[:, idx] - next_dvar_dy_eval = dvar_dy_func(t, u, inputs_stacked) - next_dvar_dp_eval = dvar_dp_func(t, u, inputs_stacked) - if idx == 0: - dvar_dy_eval = next_dvar_dy_eval - dvar_dp_eval = next_dvar_dp_eval - else: - dvar_dy_eval = casadi.diagcat(dvar_dy_eval, next_dvar_dy_eval) - dvar_dp_eval = casadi.vertcat(dvar_dp_eval, next_dvar_dp_eval) + g_dy, g_dp = grads_func(t, u, inputs_stacked) + dvar_dy_blocks.append(g_dy) + dvar_dp_blocks.append(g_dp) + # Concatenation in one shot + dvar_dy_eval = casadi.diagcat(*dvar_dy_blocks) + dvar_dp_eval = casadi.vertcat(*dvar_dp_blocks) # Compute sensitivity S_var = dvar_dy_eval @ dy_dp + dvar_dp_eval Summary Replaced the incremental per-timestepcasadi.diagcat/ casadi.vertcatloop with list comprehensions that build all Jacobian blocks first, then concatenate once via unpacking (*blocks). Also added a CHANGELOG.md entry (not shown). Summary Consolidated the two separatecasadi.Functionobjects (dvar_dy_func,dvar_dp_func) into a singlegrads_funcre- turning both Jacobians, reducing per-timestep function call overhead. Collects results in lists and concatenates once. Also adds guards for empty time series and empty result lists. Figure 20:pybamm_team-pybamm_1: PyBaMM’sProcessedVariablesensitivity computation inIDAKLUSolverused an incremental per-timestep concatenation operation, creating a quadratic memory allocation overhead. Both the best agent (openhands:gpt-5) and the expert identified that, instead of each loop iteration building a progressively larger matrix by concatenating to the existing result, it would be more efficient to first collect all blocks and then concatenate once at the end. The agent added further micro-optimization: consolidating two accumulation function calls into one and added empty-input guards. This resulted in a +0.0167 agent advantage. 38 Human Expert Patch diff --git a/shapely/decorators.py b/shapely/decorators.py --- a/shapely/decorators.py +++ b/shapely/decorators.py -def deprecate_positional(should_be_kwargs, category=DeprecationWarning): +def deprecate_positional( + should_be_kwargs: Iterable[str], + category: type[Warning] = DeprecationWarning, +): + def decorator(func: Callable): + code = unwrap(func).__code__ + pos_names = code.co_varnames[: code.co_argcount] + name_to_idx = name: idx for idx, name in enumerate(pos_names) + deprecate_positions = [ + (name_to_idx[name], name) for name in should_be_kwargs + if name in name_to_idx] + if not deprecate_positions: + return func + warn_from = min(deprecate_positions)[0] + + @lru_cache(10) + def make_msg(n_args: int): + used = [name for idx, name in deprecate_positions if idx < n_args] # ... (build warning message string) + return f"positional argumentplr args_txt for ‘func. __name__‘ ..." - def decorator(func): @wraps(func) def wrapper(*args, **kwargs): - ret_value = func(*args, **kwargs) - sig = inspect.signature(func) - args_bind = sig.bind_partial(*args) - warn_args = [f"‘arg‘" for arg in args_bind.arguments.keys() - if arg in should_be_kwargs] - if warn_args: # ... (build warning message) - warnings.warn(msg, category=category, stacklevel=2) - return ret_value + result = func(*args, **kwargs) + n = len(args) + if n > warn_from: + warnings.warn(make_msg(n), category=category, stacklevel=2) + return result return wrapper AI Agent Patch diff --git a/shapely/decorators.py b/shapely/decorators.py --- a/shapely/decorators.py +++ b/shapely/decorators.py def decorator(func): # Cache signature inspection at decoration time + sig = inspect.signature(func) + param_names = list(sig.parameters.keys()) + deprecated_positions = set() + for param_name in should_be_kwargs: + if param_name in param_names: + deprecated_positions.add(param_names.index(param_name)) + if not deprecated_positions: + return func + max_deprecated_pos = max(deprecated_positions) @wraps(func) def wrapper(*args, **kwargs): - ret_value = func(*args, **kwargs) - sig = inspect.signature(func) - args_bind = sig.bind_partial(*args) # ... (per-call signature inspection) # Fast path: skip if not enough args + if len(args) <= max_deprecated_pos: + return func(*args, **kwargs) # Only check deprecated positions + warn_positions = [pos for pos in deprecated_positions if pos < len(args)] + if warn_positions: + args_bind = sig.bind_partial(*args) # ... (build and emit warning) + return func(*args, **kwargs) return wrapper Summary Completely rewrote thedeprecate_positionaldecorator: replacedinspect.signaturewithinspect.unwrapand di- rect__code__introspection at decoration time, added an lru_cache-backedmake_msghelper to avoid rebuilding warn- ing strings, and included type annotations and a comprehensive 138-line test suite. Summary Cachedinspect.signatureat decoration time and pre- computed deprecated parameter positions as a set. Added an early-return fast path when no deprecated parameters exist and a second fast path skipping checking when argument count is below the threshold. Figure 21:shapely_shapely_1: Thedeprecate_positionaldecorator in Shapely calledinspect.signatureand sig.bind_partialon every decorated function invocation, causing a 300–1000% performance regression. Users reported significant Polygon creation slowdowns. The best agent (terminus-2:claude-sonnet-4) and the human expert converged on nearly identical core strategies. Both implemented a caching layer to move signature inspection from call time to decoration time. The agent added additional micro-optimizations to skip checks when no deprecated parameters exist or when the argument count is below the threshold. This resulted in a +0.0131 advantage over the human’s solution. 39 Table 15: Repositories and Tasks after applying rule-based filters (Filter Stage 1) and LLM-based filters (Filter Stage 2) as described in §A.1.2. We also showcase the number of tasks, the date of creation of the latest task, and additional information about the functionality and popularity of the repository. Most repositories are software tools used extensively within scientific communities. Repository Name#Stars#ForksFilter Stage 1 Filter Stage 2 Latest Task Date Description 1. scikit-learn/scikit-learn637922635924342432025-10-31scikit-learn: machine learning in Python 2. pandas-dev/pandas469221918432985602025-11-11Flexible and powerful data analysis / manipulation library for Python, provid- ing labeled data structures similar to R data.frame objects, statistical functions, and much more 3. scipy/scipy14120551614542092025-10-29SciPy library main repository 4. apache/arrow16089388419882672025-07-22Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics 5. networkx/networkx162773415288442025-09-16NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. 6. Qiskit/qiskit659826597172122025-11-19Qiskit is an open-source SDK for work- ing with quantum computers at the level of pulses, circuits, and application mod- ules. 7. scikit-image/scikit-image63712320458542025-11-18Image processing in Python 8. pymc-devs/pymc93222146685452025-09-23PyMC (formerly PyMC3) is a Python package for Bayesian statistical model- ing focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. 9. Textualize/rich541721920165112025-07-25Rich is a Python library for rich text and beautiful formatting in the terminal. 10. tqdm/tqdm3058014021212022-03-24Fast, extensible progress bar for Python and CLI 11. pydata/xarray400411926091012025-11-21N-D labeled arrays and datasets in Python 12. optuna/optuna1292211777191122025-11-05A hyperparameter optimization frame- work 13. quantumlib/Cirq477211511032025-11-18Python framework for creating, editing, and invoking Noisy Intermediate-Scale Quantum (NISQ) circuits. 14. pvlib/pvlib-python1424112611082025-10-03A set of documented functions for sim- ulating the performance of photovoltaic energy systems. 15. ipython/ipyparallel262610066562024-10-28IPython Parallel: Interactive Parallel Computing in Python 16. geopandas/geopandas4940981314222025-05-22Python tools for geographic data Continued on next page 40 Repository Name#Stars#ForksFilter Stage 1 Filter Stage 2 Latest Task Date Description 17. kedro-org/kedro105939714142025-07-17Kedro is a toolbox for production-ready data science. It uses software engineer- ing best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular. 18. HIPS/autograd73799281312017-10-21Efficiently computes derivatives of NumPy code. 19. MDAnalysis/mdanalysis1477733196232025-10-13MDAnalysis is a Python library to ana- lyze molecular dynamics simulations. 20. pybamm-team/PyBaMM1387692218172025-04-29PyBaMM (Python Battery Mathemati- cal Modelling) is an open-source battery simulation package written in Python. 21. modin-project/modin103326695082025-09-30Speed up your Pandas workflows by changing a single line of code 22. nilearn/nilearn132263113822025-10-09Machine learning for NeuroImaging in Python 23. sunpy/sunpy971626663222025-05-16sunpy is a Python software package that provides fundamental tools for accessing, loading and interacting with solar physics data in Python. 24. shapely/shapely4284600150212025-05-03Manipulation and analysis of geometric objects 25. dedupeio/dedupe43875682542023-12-19A python library for accurate and scal- able data deduplication and entity- resolution. 26. h5py/h5py2174547263352025-08-10h5py is a thin, pythonic wrapper around HDF5 27. PyWavelets/pywt22945171212024-07-16PyWavelets - Wavelet Transforms in Python 28. pydicom/pydicom20705088672025-05-12Read, modify and write DICOM files with python code 29. arviz-devs/arviz173745810752025-10-21Exploratory analysis of Bayesian mod- els 30. napari/napari2512454849692025-09-30napari: a fast, interactive, multi- dimensional image viewer for python 31. tardis-sn/tardis225446268132025-09-16TARDIS - Temperature And Radiative Diffusion In Supernovae 32. dipy/dipy787446194162025-11-18DIPY is the paragon 3D/4D+ medical imaging library in Python. Contains generic methods for spatial normal- ization, signal processing, machine learning, statistical analysis and visual- ization of medical images. Additionally, it contains specialized methods for com- putational anatomy including diffusion, perfusion and structural imaging. Continued on next page 41 Repository Name#Stars#ForksFilter Stage 1 Filter Stage 2 Latest Task Date Description 33. python-control/python- control 190844411762025-06-21The Python Control Systems Library is a Python module that implements basic operations for analysis and design of feedback control systems. 34. SciTools/cartopy15453897462025-04-26Cartopy is a Python package designed for geospatial data processing in order to produce maps and other geospatial data analyses. 35. holoviz/datashader346737790192025-10-09Quickly and accurately render even the largest data. 36. microsoft/Qcodes396335187102025-09-05Modular data acquisition framework 37. mars-project/mars2748326164512023-02-16Mars is a tensor-based unified frame- work for large-scale data computation which scales numpy, pandas, scikit- learn and Python functions. 38. pytroll/satpy1146320520452025-08-02Python package for reading, manipulat- ing and writing satellite data 39. SciTools/iris692297109232025-10-31A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data 40. lmfit/lmfit-py116429020582022-09-05Non-Linear Least Squares Minimiza- tion, with flexible Parameter settings, based on scipy.optimize, and with many additional classes and methods for curve fitting. 41. deepchecks/deepchecks39242869992023-12-06Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML valida- tion needs, enabling to thoroughly test your data and models from research to production. 42. devitocodes/devito6322429972025-07-24DSL and compiler framework for au- tomated finite-differences and stencil computation 43. danielgtaylor/python- betterproto 17332334212023-12-07Better Protobuf / gRPC code generator and library for Python 44. scikit-learn-contrib/metric- learn 1425229612017-11-27Metric Learning in Python 45. pydicom/pynetdicom5511882412025-05-24A Python implementation of the DI- COM networking protocol 46. scverse/anndata667175142172025-07-23Annotated data matrix for single-cell genomics 47. apache/arrow-adbc498160571632025-11-07Database connectivity API standard and libraries for Apache Arrow 48. man-group/ArcticDB21021531122025-11-19ArcticDB is a high performance data store for time series and tick data 49. stac-utils/pystac4121274812023-03-31Python library for working with Spa- tioTemporal Asset Catalog (STAC) Continued on next page 42 Repository Name#Stars#ForksFilter Stage 1 Filter Stage 2 Latest Task Date Description 50. xdslproject/xdsl43312521362362025-11-04A Python compiler design toolkit. 51. ActivitySim/activitysim21711751102025-11-12An open platform for activity-based travel behavior modeling 52. OGGM/oggm245115484362025-04-01Open Global Glacier Model (OGGM): a modular framework for glacier model- ing 53. datalad/datalad613115426312024-09-10Keep code, data, containers under con- trol with git and git-annex 54. pydata/bottleneck114411261202025-04-29Fast NumPy array functions written in C 55. wmayner/pyphi4061002512024-09-24A toolbox for integrated information theory. 56. django-components/ django-components 14631005332025-09-30Reusable, composable components for Django templates 57. sourmash-bio/sourmash52488297272025-01-09Quickly search, compare, and analyze genomic and metagenomic data sets. 58. tskit-dev/msprime2018820992025-07-24Simulate genealogical trees and ge- nomic sequence data using population genetic models 59. numpy/numpy-financial384871342024-04-04Financial functions for NumPy 60. makepath/xarray-spatial894853892023-02-16Spatial analysis algorithms for xarray implemented in numba 61. dwavesystems/dimod13584152202024-06-13dimod is a shared API for samplers. 62. python-hyper/h11530831822025-01-12A pure-Python, bring-your-own-I/O implementation of HTTP/1.1 63. bjodah/chempy611816912018-03-24A package useful for chemistry written in Python 64. holoviz/param4977985102025-02-27Declarative parameters for robust Python classes and a rich API for re- active programming 65. inducer/loopy61578172152023-07-27A code generator for array computations on CPUs and GPUs 66. holgern/beem138757552020-12-22A Python library for Hive and Steem 67. scverse/spatialdata329752022025-09-29An open and interoperable data frame- work for spatial omics data 68. pysb/pysb1887110772021-01-20PySB is a framework for building math- ematical models of biochemical systems as Python programs 69. xorbitsai/xorbits119970186222024-11-16Xorbits is an open-source computing framework that makes it easy to scale data science and machine learning work- loads — from data preprocessing to tuning, training, and model serving. 70. pysal/momepy5636780122024-07-16Urban Morphology Measuring Toolkit 71. python-adaptive/adaptive1203622852025-08-21:chart_with_upwards_trend: Adaptive: parallel active learning of mathematical functions Continued on next page 43 Repository Name#Stars#ForksFilter Stage 1 Filter Stage 2 Latest Task Date Description 72. probabilistic-numerics/ probnum 459615272023-05-04Probabilistic numerics in Python 73. neurostuff/NiMARE197601412025-06-13Coordinate- and image-based meta- analysis in Python 74. NCAR/geocat-comp140561822025-08-18GeoCAT-comp provides implementa- tions of computational functions for operating on geosciences data. Many of these functions originated in NCL and were translated into Python. 75. mie-lab/trackintel243535552024-01-07trackintel is a library for the analysis of spatio-temporal tracking data with a focus on human mobility. 76. JDASoftwareGroup/ kartothek 16053152312021-03-17A dataset library for partitioned datasets stored in Parquet 77. AllenCellModeling/ aicsimageio 220515032023-04-05Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Python 78. dottxt-ai/outlines-core254504452025-03-31Core library for Outlines, providing structured text generation utilities 79. apache/arrow-nanoarrow2074710982025-10-27nanoarrow: a (C) library for the Apache Arrow C Data interface 80. pangeo-data/climpred25247922021-11-20:earth_americas: Verification of weather and climate forecasts :earth_africa: 81. pybop-team/PyBOP152457882025-07-15A parameterisation and optimisation package for battery models. 82. UXARRAY/uxarray2024499222025-09-11Python library for working with unstruc- tured grid model data in xarray 83. pygeos/pygeos38843101172021-11-30Wraps GEOS geometry functions in numpy ufuncs 84. innobi/pantab120417972024-10-31Read/Write pandas DataFrames with Tableau Hyper Extracts 85. xarray-contrib/xskillscore237412312021-11-20Metrics for verifying forecasts 86. glotzerlab/signac135371722025-04-04Manage large and heterogeneous data spaces on the file system. 87. sgkit-dev/sgkit26537113212025-09-30Scalable genetics toolkit 88. TileDB-Inc/TileDB-Py198365152025-08-01Python API for TileDB 89. IntelPython/dpctl117313722025-10-02Data Parallel Control (dpctl) - Python device control and USM memory for SYCL 90. tensorwerk/hangar-py205291912019-12-04Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era. 91. xarray-contrib/xbatcher184282032023-07-31Batch generation from xarray objects. 92. DASDAE/dascore12126122112025-09-20DASCore: A Python package for the analysis of distributed acoustic sensing data. 93. IntelPython/dpnp11623680262025-10-14Data Parallel Extension for NumPy Continued on next page 44 Repository Name#Stars#ForksFilter Stage 1 Filter Stage 2 Latest Task Date Description 94. not522/ac-library-python23023522021-11-19Python implementation of AtCoder Library 95. xarray-contrib/flox13321150392025-07-17Fast groupby reductions for dask and xarray 96. scipp/scipp13621268262025-03-17Python library for multi-dimensional data analysis 97. pyapp-kit/psygnal1152170102025-09-24Python observer pattern (callback/event system). Modeled after Qt Signals & Slots (but independent of Qt) 98. royerlab/ultrack149216852025-09-23Cell tracking and segmentation software 99. xitorch/xitorch15521922024-05-24Differentiable scientific computing for PyTorch 100. Quansight-Labs/ndindex107161232025-05-14A Python library for manipulating N- dimensional array indices 101. jkjkil4/JAnim18914312025-03-28Programmatic animation engine for creating precise and smooth animations with real-time feedback 45 Table 16: Repositories and Tasks represented in FORMULACODE (as of November 30, 2025). We showcase a repository level breakdown of the number of tasks, the latest task (by PR merge date), the average difficulty (0-5, with 0 being easiest), the average number of tokens in the human patch and in the prompt instructions, and the most common optimization type of the human patch. Repository#TasksLatest TaskAvg. Diffi- culty Avg. Patch Size (To- kens) Avg. PR Size (Tokens) Most Common Optimization 1. pandas-dev/pandas2222025-10-210.771842.85489.35Micro Optimizations (26.6%) 2. scikit-learn/scikit-learn1432025-10-311.02735.29491.49Micro Optimizations (23.1%) 3. Qiskit/qiskit1422025-10-031.734438.38505.02Use Lower Level System (28.2%) 4. xdslproject/xdsl1342025-10-091.363567.76463.46Remove Or Reduce Work (37.3%) 5. optuna/optuna942025-11-050.96546.29471.81Use Better Algorithm (24.5%) 6. pydata/xarray692025-11-210.981929.9474.04Micro Optimizations (30.4%) 7. scikit-image/scikit- image 392024-11-200.832271.46481.36Remove Or Reduce Work (28.2%) 8. networkx/networkx352025-09-161.01809.74480.46Use Better Algorithm (42.9%) 9. pytroll/satpy302024-11-201.42777.4483.7Use Better Data Structure And Layout (30.0%) 10. pymc-devs/pymc182025-06-161.812589.89479.89Use Better Algorithm (33.3%) 11. xarray-contrib/flox172025-07-171.472149.24485.18Use Better Algorithm (29.4%) 12. dwavesystems/dimod152024-06-131.332322.93476.4Use Better Algorithm (26.7%) 13. geopandas/geopandas132025-05-220.772231.62497.15Use Better Algorithm (46.2%) 14. UXARRAY/uxarray132025-09-111.734722.15489.38Remove Or Reduce Work (23.1%) 15. pydata/bottleneck132020-11-251.541293.23492.0Use Lower Level System (38.5%) 16. sgkit-dev/sgkit122025-09-301.252231.67469.0Do It Earlier Batch Throttle (25.0%) 17. sourmash-bio/sourmash112022-07-201.362561.45491.91Use Better Algorithm (27.3%) 18. JDASoftwareGroup/ kartothek 102020-10-010.51026.8466.5Micro Optimizations (40.0%) 19. datalad/datalad102021-03-190.25597.5492.8Remove Or Reduce Work (40.0%) 20. mars-project/mars102023-02-161.753936.5495.1Micro Optimizations (30.0%) 21. pysal/momepy92024-07-161.393021.56469.33Use Better Algorithm (77.8%) 22. Textualize/rich92025-07-250.56391.11471.67Micro Optimizations (55.6%) 23. tskit-dev/msprime72025-07-241.433013.43468.86Micro Optimizations (28.6%) 24. pygeos/pygeos72021-11-302.145001.57483.43Use Lower Level System (42.9%) 25. microsoft/Qcodes72025-08-270.71800.43467.71Do It Earlier Batch Throttle (28.6%) 26. napari/napari72025-07-291.792595.86485.71Cache And Reuse (28.6%) 27. shapely/shapely62025-05-030.832131.5480.17Use Better Algorithm (33.3%) Continued on next page 46 Repository#TasksLatest TaskAvg. Diffi- culty Avg. Patch Size (To- kens) Avg. PR Size (Tokens) Most Common Optimization 28. pyapp-kit/psygnal62025-09-240.831647.33482.83Remove Or Reduce Work (50.0%) 29. ActivitySim/activitysim62024-08-091.25833.83465.17Remove Or Reduce Work (33.3%) 30. pvlib/pvlib-python52025-10-031.57490.2482.6Use Better Algorithm (40.0%) 31. pybamm-team/ PyBaMM 52025-04-291.51637.6496.8Cache And Reuse (20.0%) 32. DASDAE/dascore52025-09-201.55505.6469.2Cache And Reuse (40.0%) 33. deepchecks/deepchecks52023-12-061.53384.6505.0Use Better Algorithm (60.0%) 34. modin-project/modin52025-09-302.05533.0481.0Micro Optimizations (60.0%) 35. mie-lab/trackintel42024-01-070.621404.75471.75Use Better Algorithm (50.0%) 36. lmfit/lmfit-py42022-09-050.0411.75497.0Do It Earlier Batch Throttle (25.0%) 37. dottxt-ai/outlines-core42025-03-310.625003.75480.75Remove Or Reduce Work (25.0%) 38. pybop-team/PyBOP42025-07-151.883863.0464.5Uncategorized (75.0%) 39. sunpy/sunpy42025-05-121.251852.25486.25Cache And Reuse (50.0%) 40. SciTools/cartopy42025-04-261.881000.0475.75Cache And Reuse (50.0%) 41. holgern/beem42018-11-300.621302.5462.0Use Better Algorithm (50.0%) 42. dipy/dipy32025-03-120.83803.67523.67Micro Optimizations (33.3%) 43. kedro-org/kedro32025-07-170.831764.67526.33Cache And Reuse (66.7%) 44. python-adaptive/ adaptive 32025-08-210.01400.0462.33Cache And Reuse (33.3%) 45. devitocodes/devito32025-07-222.52156.67484.33Cache And Reuse (66.7%) 46. TileDB-Inc/TileDB-Py32025-07-290.831823.33482.0Remove Or Reduce Work (33.3%) 47. numpy/numpy-financial22024-04-041.25423.0457.5Use Lower Level System (100.0%) 48. xarray-contrib/xbatcher22023-01-032.52981.0502.5Do It Earlier Batch Throttle (50.0%) 49. django-components/ django-components 22025-09-300.06528.0463.0Cache And Reuse (50.0%) 50. glotzerlab/signac22025-04-041.253955.0532.5Cache And Reuse (50.0%) 51. dedupeio/dedupe22023-02-172.5709.0503.0Micro Optimizations (50.0%) 52. NCAR/geocat-comp22025-08-182.52615.0498.5Remove Or Reduce Work (50.0%) 53. innobi/pantab22024-01-220.0650.5446.5Use Better Data Structure And Layout (50.0%) 54. h5py/h5py22025-05-232.5550.5548.5Remove Or Reduce Work (50.0%) 55. nilearn/nilearn22025-10-090.04810.0486.5Micro Optimizations (50.0%) 56. holoviz/param22025-02-270.01287.0473.5Do It Earlier Batch Throttle (50.0%) Continued on next page 47 Repository#TasksLatest TaskAvg. Diffi- culty Avg. Patch Size (To- kens) Avg. PR Size (Tokens) Most Common Optimization 57. AllenCellModeling/ aicsimageio 12022-04-132.56813.0505.0Use Higher Level System (100.0%) 58. HIPS/autograd12017-10-210.0525.0463.0Micro Optimizations (100.0%) 59. OGGM/oggm12022-09-070.0511.0442.0Micro Optimizations (100.0%) 60. arviz-devs/arviz12024-05-100.0299.0458.0Micro Optimizations (100.0%) 61. danielgtaylor/python- betterproto 12023-12-070.02995.0507.0Use Lower Level System (100.0%) 62. makepath/xarray- spatial 12022-05-122.53774.0436.0Use Lower Level System (100.0%) 63. Quansight-Labs/ ndindex 12024-09-202.5375.0476.0Use Lower Level System (100.0%) 64. not522/ac-library- python 12021-11-190.0388.0441.0Micro Optimizations (100.0%) 65. royerlab/ultrack12025-04-222.51816.0437.0Do It Earlier Batch Throttle (100.0%) 66. stac-utils/pystac12023-03-310.01593.0461.0Micro Optimizations (100.0%) 67. tqdm/tqdm12022-03-240.0372.0448.0Micro Optimizations (100.0%) 68. wmayner/pyphi12024-09-242.51057.0480.0Remove Or Reduce Work (100.0%) 69. xitorch/xitorch12024-05-240.04352.0479.0Micro Optimizations (100.0%) 48 · · · Class Module Complete Workload Function Pandas (pd) pd.algorithms.* Stratified Speedup 1.01 Quantile.* Stratified Speedup 1.28 Hashing.* Stratified Speedup 0.95 time_quantile.* Stratified Speedup 1.64 time_dates.* Stratified Speedup 0.95 time_timedeltas.* Stratified Speedup 0.95 time_quantile(‘float’) Stratified Speedup 3.04 time_quantile(‘int’) Stratified Speedup 0.89 pd.algorithms.Quantile.* Figure 22: Illustration of Hierarchical Grouping of Pandas Workloads. By construction, each workload in FORMU- LACODE is organized hierarchically based on three levels:ℓ = 1(Module),ℓ = 2(Class), andℓ = 3(Function). Metrics (likespeedup agent andAdv agent ) are computed for each complete workload (leaf nodes). We can semantically aggregate workloads by stratification of workloads based on this hierarcy. For instance, in this example, the stratified speedup ofpd.algorithms.Quantile.*can be calculated by computing the geometric mean of all leaf nodes that share the same the prefix string (depicted in the gray dotted box;pd.algorithms.Quantile.time_quantile(‘float’), pd.algorithms.Quantile.time_quantile(‘int’), and other complete workloads not shown.). The example also illus- trates how highly localized optimizations are diluted by stratification, and underscores that, at higher levels of stratification, consistent speedups across a large number of workloads is required to achieve a significant stratified speedup. 49 Expert Speedup Agent Speedup 1.0 1.0 Equal Advantage Super Optimization Under Optimization Performance Degradation Regression Figure 23: Visual intuition for Agent Advantage (Adv agent ; §2). Each cross (✗) represents an individual workload using the expert-derived speedup (speedup expert ) and the agent-derived speedup (speedup agent ). The identity function line represents equal advantage (i.e.,speedup expert = speedup agent ). Then, the agent advantage is the mean weighted deviation from the equal advantage line. The plot also showcases four optimization regions clockwise from top: (1) Super Optimization: workloads where an agent’s code performs better than the expert’s code and the baseline. (2) Under Optimization: workloads where the agent’s code and the expert’s code both deliver a positive speedup, but the expert outperforms the agent. (3) Performance Degredation: workloads where the expert discovers a speedup while the agent slows down the code. (4) Regression: workloads where neither the expert nor the agent slow down the code; usually an intentional tradeoff to optimize other workloads. Figure 23 showcases an example of workload distribution for various agents on FORMULACODE. ≤00.50.7511.251.51.752>2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 2 22 5 22 12121 1415 53255111 12110247137411 1151123344325131112118 114105669111 531473971 23326174 151091 12139198 564 111311 211 1 3 1 31 11031 31 2 21 111 352 11 1217 Claude 4.0 Sonnet (Advantage: -0.0410) ≤00.50.7511.251.51.752>2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 121 154 153926221 122513986212111112 111516300629472052111118 11083231111 42034812 1241 2121 12221 2 1 1 1 1 5 Qwen 3 Coder (Advantage: -0.0454) ≤00.50.7511.251.51.752>2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 2 22 5 22 1214 1515 52960111 1217226121411 1151425244513131111117 1161147771111 531484271 23326174 151092 12129198 564 111311 212 1 3 31 11103 31 2 21 111 35 11 12110 Gemini 2.5 Pro (Advantage: -0.0433) ≤00.50.7511.251.51.752>2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 293129698 65 162 1173 25 197 263 33613 231816 121621 42126 11745413 1111172642172021 1262852573581211121215 134193842341 22929294613 129121161512 2415113124 218221271 1433731 8211 11331 51111 42121 11 11311 11183 421 14 1 141 5 1 23819 GPT-5 (Advantage: -0.0504) Figure 24: Visualization of advantage for Terminus 2 Agents. Refer to Figure 23 for an explanation of each region. Each square represents the number of workloads in that region (within0.5units). A speedup of1.0indicates no deviation from baseline performance. The red dotted line represents equal advantage. This visualization is helpful to guage the holistic behavior of models across the entire workload distribution. For instance, Claude 4.0 Sonnet (Top Left) achieves a better overall advantage than GPT-5 (Bottom Right) by making measured and surgical optimizations that align with the equal-advantage line, whereas optimizations proposed by GPT-5 are more volatile, with more workloads experiencing performance degredations and effectively bringing the overall advantage down. 51 1 OBJECTIVE 2 You are a performance optimization expert. Speed up the repository while maintaining correctness. 3 4 TOOLING 5 The micromamba environment includes Pytest for correctness testing and Airspeed Velocity (ASV) for benchmarking measurements and profiling. 6 7 PROCESS 8 1. Scan & Baseline 9 Read the code and any hints. Map likely bottlenecks. Establish a baseline by running the relevant ASV benchmarks. 10 2. Benchmark (ASV) 11 Read through relevant benchmarks. Prefer targeted runs using ’–bench=<regex>’; full-suite runs are discouraged. 12 Command: 13 ”’ asv run –python=same –bench="<regex>" ”’ 14 Find benchmarks via asv_benchmarks.txt or within the ASV benchmarks directory. You may run multiple benchmarks at once using regexes. 15 3. Profile Hotspots 16 Profile relevant benchmarks to locate hot paths. Use ASV’s built-in profiling support. 17 Command: 18 ”’ asv profile –python=same –config=<path-to-asv.*.json> <benchmark_name> ”’ 19 4. Optimize 20 Make targeted changes that address the hot paths while maintaining correctness. Follow the Operating Principles below. 21 22 OPERATING PRINCIPLES 23• One change/command at a time (code edit, ASV run, profiling). 24• Baseline first, then iterate. 25• Target the hot paths shown by profiling. 26• Evidence-driven: justify changes with benchmark/profile data. 27• Correctness first: never trade correctness for speed. 28 29 REPOSITORY DESCRIPTION 30 This repository is called Qiskit/qiskit. Qiskit/qiskit is written primarily in Python and is described as a "Qiskit is an open-source SDK for working with quantum computers at the level of extended quantum circuits, operators, and primitives.". 31 32 TASK DESCRIPTION 33 Your main goal is to optimize the code to run as fast as possible. Use the following information if needed to understand the problem: 34 35 INITIAL OBSERVATIONS 36 Binding parameters with ‘ParameterExpression.bind‘ is slow, allocating many Python objects and taking tens of milliseconds per call when binding large dictionaries (e.g., 100k parameters). 37 38 RELEVANT ISSUES 39 40 Issue #14471: Addressing performance bottlenecks in ParameterExpression.bind 41 Environment: Qiskit version: 2.0.0 42 Summary: Let us consider a parameter expression ’expr’ and a dictionary ’parameter_values: dict[Parameter, float]’ with ’M’ key, value pairs. Consider the following code to bind the expression: 43 ”’ expression.bind(parameter_values)”’ 44 As it turns out, this line takes time that grows with len(M). As far as I can tell, this is because qiskit applies some checks to all of the parameters in parameter_values. Even if it turns out that expression only needs one of them, all the parameters are checked and then only one of them is used. 45 Why this needs fixing: Sometimes, it is useful to maintain a log of parameters outside of a circuit (e.g., in a parameter table) and bind these parameters when needed agains a ’parameter_values’ dict. In this case, the ’QuantumCircuit.assign_parameters’ method (which does some tricks to speed things up) is not available, and users take a hit in performance when they bind. 46 Some suggestions on how to fix this: Provide an option for users so that they can choose to check only the ’relevant’ parameter values (i.e., those present in expression), so that the runtime of bind becomes independent of len(M). Review the checks and remove those that are not needed. 47 How can we reproduce the issue? 48 ”’ from qiskit.circuit import Parameter 49 N: int = ... 50 parameter_values = Parameter(f"th_i"): 1 for i in range(N) 51 parameter_values[param := Parameter("my_param")] = 1 52 %timeit param.bind(parameter_values, allow_unknown_parameters=True)”’ 53 On my laptop, with N=1 bind takes ~2.5 μs, but with N=10**5 it takes 17.8 ms. 54 Comments 55 I’d generally be supportive of removing huge tracts of the error-checking code from all the ParameterExpression methods. 56 Fwiw, there are a couple of tricks we ought to figure out: the ParameterExpression.bind method either has to be linear in the number of unbound parameters in the expression, or in the number of elements in the binding dictionary. . . . 57 . . . <TRUNCATED> be cheaper even than adding fast-paths through ‘ParameterExpression.bind‘: we don’t need to maintain the QPY replay log and we don’t need to allocate a new ‘ParameterExpression‘ (which is quite heavy) Figure 25: Example task in FORMULACODE forQiskit/qiskit(PR:https://github.com/Qiskit/qiskit/pull/ 14782). The prompt presents a complete optimization task, including the performance goal, the benchmarking and profiling tools (Pytest and ASV), a structured optimization workflow, and concrete repository context with motivating performance observations. The “Relevant Issues” section contains GitHub issues that are directly related to the performance problem addressed by the PR (describing the underlying bottlenecks the PR aims to fix). These issues provide important background context that mimics a real, human-authored PR setting. Issue discussions are truncated only in this figure for brevity, while the full issue content is provided to the agent during execution. 52 1 OBJECTIVE 2 You are a performance optimization expert. Speed up the repository while maintaining correctness. 3 4 TOOLING 5 The micromamba environment includes Pytest for correctness testing and Airspeed Velocity (ASV) for benchmarking measurements and profiling. 6 7 PROCESS 8 1. Scan & Baseline 9 Read the code and any hints. Map likely bottlenecks. Establish a baseline by running the relevant ASV benchmarks. 10 2. Benchmark (ASV) 11 Read through relevant benchmarks. Prefer targeted runs using ’–bench=<regex>’; full-suite runs are too time-consuming and are discouraged. 12 Command: 13 ”’ # Always pin to current interpreter asv run –python=same –bench="<regex>" ”’ 14 Find benchmarks via asv_benchmarks.txt or in the directory containing the ASV benchmarks. You may run multiple benchmarks at once using regexes. 15 3. Profile Hotspots 16 Profile relevant benchmarks to locate hot paths. Use ASV’s built-in profiling support. 17 Command: 18 ”’ asv profile –python=same –config=<path-to-asv.*.json> <benchmark_name> ”’ 19 4. Optimize 20 Make targeted changes that address the hot paths while maintaining correctness. Always follow the Operating Principles below. 21 22 OPERATING PRINCIPLES 23• One change/command at a time (code edit, ASV run, profiling). 24• Baseline first, then iterate. 25• Target the hot paths shown by profiling. 26• Evidence-driven: justify changes with benchmark/profile data. 27• Correctness first: never trade correctness for speed. 28 29 REPOSITORY DESCRIPTION 30 This repository is called shapely/shapely. shapely/shapely is written primarily in Python and is described as a "Manipulation and analysis of geometric objects". 31 32 TASK DESCRIPTION 33 Your main goal is to optimize the code to run as fast as possible. Use the following information if needed to understand the problem: 34 35 INITIAL OBSERVATIONS 36 The deprecate_positional decorator incurred a noticeable runtime penalty because it invoked the full inspect.signature machinery on every call, leading to slow polygon construction (e.g., ~107 ms per 1000 iterations in the main branch). Users also experienced repeated deprecation-warning processing overhead. 37 38 RELEVANT ISSUES 39 40 Issue #2280: 2.1 Polygon creation is much slower than 2.0.7 41 Summary: It seems to be that creating Polygons in 2.1 is much slower (roughly 5–10x) slower than 2.0.7. The following script takes roughly 0.1 seconds with Shapely 2.1 and 0.015 with Shapely 2.0.7 on Python 3.12. 42 ”’ import time 43 import shapely 44 if __name__ == "__main__": 45 start_time = time.time() 46 for _ in range(1000): 47 coords = ((0., 0.), (0., 1.), (1., 1.), (1., 0.), (0., 0.)) 48 polygon = shapely.Polygon(coords) 49 print(time.time() - start_time) ”’ 50 Comments: Thanks for the report. This slowdown seems to be due to the overhead of the decorator we added to deprecate positional arguments. That decorator does inspect the signature, which in . . . 51 . . . <TRUNCATED> I noticed an even greater performance degradation when running under a debugger. 52 53 Issue #2282: deprecate_positional is a performance bottleneck (300%–1000% slowdown) in Shapely 2.1 54 Summary: Performance analysis indicates that only 17 seconds from 66 seconds total is the implementation of transform. The remaining time is taken by the deprecate_positional decorator. 55 I have the following code: 56 ”’ @overload 57 def compressible_geometry(geometry: _GeomT, /) -> _GeomT: ... 58 @overload 59 def compressible_geometry(geometry: NDArray[np.float64], /) -> NDArray[np.float64]: ... 60 . . . <TRUNCATED> 61 Comments: - 62 Figure 26: Example task in FORMULACODE forshapely/shapely(PR:https://github.com/shapely/shapely/pull/ 2283). 1 OBJECTIVE 2 You are a performance optimization expert. Speed up the repository while maintaining correctness. 3 4 TOOLING 5 The micromamba environment includes Pytest for correctness testing and Airspeed Velocity (ASV) for benchmarking measurements and profiling. 6 7 PROCESS 8 1. Scan & Baseline 9 Read the code and any hints. Map likely bottlenecks. Establish a baseline by running the relevant ASV benchmarks. 10 2. Benchmark (ASV) 11 Read through relevant benchmarks. Prefer targeted runs using ’–bench=<regex>’; full-suite runs are too time-consuming and are discouraged. 12 Command: 13 ”’ # Always pin to current interpreter asv run –python=same –bench="<regex>" ”’ 14 Find benchmarks via asv_benchmarks.txt or in the directory containing the ASV benchmarks. You may run multiple benchmarks at once using regexes. 15 3. Profile Hotspots 16 Profile relevant benchmarks to locate hot paths. Use ASV’s built-in profiling support. 17 Command: 18 ”’ asv profile –python=same –config=<path-to-asv.*.json> <benchmark_name> ”’ 19 4. Optimize 20 Make targeted changes that address the hot paths while maintaining correctness. Always follow the Operating Principles below. 21 22 OPERATING PRINCIPLES 23• One change/command at a time (code edit, ASV run, profiling). 24• Baseline first, then iterate. 25• Target the hot paths shown by profiling. 26• Evidence-driven: justify changes with benchmark/profile data. 27• Correctness first: never trade correctness for speed. 28 29 REPOSITORY DESCRIPTION 30 This repository is called pandas-dev/pandas. pandas-dev/pandas is written primarily in Python and is described as a "Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more". 31 32 TASK DESCRIPTION 33 Your main goal is to optimize the code to run as fast as possible. Use the following information if needed to understand the problem: 34 35 INITIAL OBSERVATIONS 36 The DataFrame.to_csv() call with index=False on a Multi-Index DataFrame was extremely slow (≈ 869 seconds for 10M rows × 20 cols), while resetting the index first and then calling to_csv() took only ≈ 42 seconds. The performance gap was observed consistently in the benchmark. 37 38 RELEVANT ISSUES 39 40 Issue #59312: PERF: Significant Performance Difference in DataFrame.to_csv() with and without Index Reset 41 Description: 42 Pandas version checks: I have checked that this issue has not already been reported. I have confirmed this issue exists on the latest version of pandas. I have not confirmed this issue exists on the main branch of pandas. 43 Reproducible Example 44 Below is a toy DataFrame example with 10M rows and 20 columns. The CSV write speed differ significantly between whether the multi-index is dropped first or not, even if the resulting CSV files are essentially the same. The benchmark for PyArrow is also attached for reference. Notice that the CSV generated from PyArrow has column names and column values additionally double-quoted. 45 ”’ import pandas as pd 46 import pyarrow as pa 47 import pyarrow.csv as csv 48 import time 49 NUM_ROWS = 10000000 50 NUM_COLS = 20 51 df = pd.DataFrame(f"col_col_idx": range(col_idx * NUM_ROWS, (col_idx + 1) * NUM_ROWS) for col_idx in range(NUM_COLS)) . . . <TRUNCATED> 52 Comments 53 Thanks for the report! It seems to me the issue is here: 54 ”’ https://github.com/pandas-dev/pandas/blob/642d2446060afb11f9860c79a7339eb6ec96fea7/pandas/io/formats/csvs.py#L323 ”’ 55 A significant amount of time on that line is spent getting the index values, only to be ignored because self.nlevels is 0 when index=False. In addition, it seems to me that there may . . . <TRUNCATED> 56 Figure 27: Example task in FORMULACODE forpandas-dev/pandas(PR:https://github.com/pandas-dev/pandas/ pull/59608).