← Back to papers

Paper deep dive

DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Alejandro Pradas-Gomez, Arindam Brahma, Ola Isaksson

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 107

Abstract

Abstract:Engineering analysis automation in product development relies on rigid interfaces between tools, data formats and documented processes. When these interfaces change, as they routinely do as the product evolves in the engineering ecosystem, the automation support breaks. This paper presents a DUCTILE (Delegated, User-supervised Coordination of Tool- and document-Integrated LLM-Enabled) agentic orchestration, an approach for developing, executing and evaluating LLM-based agentic automation support of engineering analysis tasks. The approach separates adaptive orchestration, performed by the LLM agent, from deterministic execution, performed by verified engineering tools. The agent interprets documented design practices, inspects input data and adapts the processing path, while the engineer supervises and exercises final judgment. DUCTILE is demonstrated on an industrial structural analysis task at an aerospace manufacturer, where the agent handled input deviations in format, units, naming conventions and methodology that would break traditional scripted pipelines. Evaluation against expert-defined acceptance criteria and deployment with practicing engineers confirm that the approach produces correct, methodologically compliant results across repeated independent runs. The paper discusses practical consequences of adopting agentic automation, including unintended effects on the nature of engineering work and the tension between removing mundane tasks and creating an exhausting supervisory role.

Tags

ai-safety (imported, 100%)csse (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/13/2026, 1:08:46 AM

Summary

The paper introduces DUCTILE, an agentic LLM orchestration framework designed to automate engineering analysis tasks in product development. By separating adaptive orchestration (handled by LLM agents) from deterministic execution (handled by verified engineering tools), DUCTILE addresses the brittleness of traditional scripted pipelines in aerospace structural analysis, allowing for robust handling of input variability while maintaining engineering rigor and traceability.

Entities (4)

DUCTILE · framework · 100%GKN Aerospace · organization · 95%LLM Agent · technology · 95%Engineering Analysis · domain · 90%

Relation Signals (3)

DUCTILE orchestrates Engineering Analysis

confidence 95% · DUCTILE... an approach for developing, executing and evaluating LLM-based agentic automation support of engineering analysis tasks.

GKN Aerospace deployed DUCTILE

confidence 90% · DUCTILE is demonstrated on an industrial structural analysis task at an aerospace manufacturer

LLM Agent interfaceswith Engineering Tools

confidence 90% · The agent does not replace the domain-specific software, but instead, interprets the documented design practice... through the same verified and trusted tools

Cypher Suggestions (2)

Find all engineering tasks orchestrated by specific frameworks · confidence 90% · unvalidated

MATCH (f:Framework)-[:ORCHESTRATES]->(t:Task) RETURN f.name, t.name

Identify organizations using specific automation technologies · confidence 90% · unvalidated

MATCH (o:Organization)-[:DEPLOYED]->(t:Technology) RETURN o.name, t.name

Full Text

106,966 characters extracted from source content.

Expand or collapse full text

Alejandro Pradas-Gómez 1 Product Development Division, Department of Mechanical Engineering, Chalmers University of Technology, Gothenburg, Sweden e-mail: alejandro.pradas@chalmers.se Arindam Brahma Product Development Division, Department of Mechanical Engineering, Chalmers University of Technology, Gothenburg, Sweden e-mail: arindam.brahma@chalmers.se Ola Isaksson Product Development Division, Department of Mechanical Engineering, Chalmers University of Technology, Gothenburg, Sweden e-mail: ola.isaksson@chalmers.se DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice Engineering analysis automation in product development relies on rigid interfaces be- tween tools, data formats and documented processes. When these interfaces change, as they routinely do as the product evolves in the engineering ecosystem, the automation sup- port breaks. This paper presents a DUCTILE (Delegated, User-supervised Coordination of Tool- and document-Integrated LLM-Enabled) agentic orchestration, an approach for developing, executing and evaluating LLM-based agentic automation support of engineer- ing analysis tasks. The approach separates adaptive orchestration, performed by the LLM agent, from deterministic execution, performed by verified engineering tools. The agent interprets documented design practices, inspects input data and adapts the processing path, while the engineer supervises and exercises final judgment. DUCTILE is demon- strated on an industrial structural analysis task at an aerospace manufacturer, where the agent handled input deviations in format, units, naming conventions and methodology that would break traditional scripted pipelines. Evaluation against expert-defined acceptance criteria and deployment with practicing engineers confirm that the approach produces correct, methodologically compliant results across repeated independent runs. The paper discusses practical consequences of adopting agentic automation, including unintended effects on the nature of engineering work and the tension between removing mundane tasks and creating an exhausting supervisory role. Keywords: large language models, agent-based design, design automation, design methodology, design process, product development, expert systems, design evaluation 1 Introduction Product development relies on an engineering ecosystem with many interacting elements: engineers, data, methods, processes and specialist tools [1–3]. In engineering analyses, organizations often try to automate how these interactions occur, for example, through scripts and tool-specific workflows. In practice, designers often repeatedly move between context-specific tools, scripts and methods, adapting the workflow as the requirements, models and interfaces change [3,4]. However, such changes, even if minor, can often lead to significant disruptions creating knock-on effects in downstream design activities or changes within the product [5, 6]. For example, because of a tool update, or because of revised requirements or an organizational change. The result is a brittle analysis process [7], where significant engineering time is spent on data wrangling [8,9] and tool orchestration [10,11], rather than on engineering decisions that directly affect product performance. In practice, organizations have addressed the brittleness of engi- neering automation through two main strategies. One is to extend the deterministic automations to add more rules, cover more sce- narios, which increases complexity and creates new failure modes as the variation grows [12–14]. If the automation itself cannot cover the changes, the second approach is to rely on expert engi- neers to adapt on a per-case basis. This second approach is slow and concentrates critical knowledge in individuals, creating an or- ganizational vulnerability when those individuals are not available [14–16]. Neither approach scales well in environments where the workflow evolves alongside the product over a decade-long devel- opment process. Approaches integrating machine learning with expert validation have improved knowledge sourcing efficiency in 1 Corresponding Author. March 12, 2026 well-structured domains [15], but remain dependent on predefined data structures, expert-curated feature sets, and rigid methodolog- ical pipelines, leaving the underlying brittleness of the workflow unaddressed. In this paper, a third approach is proposed: Large Language Model (LLM)-based agents as orchestration layers that connect engineers to the verified and trusted domain-specific engineering tools. The agent does not replace the domain-specific software, but instead, interprets the documented design practice, inspects the available data and adapts to the particular context. In doing so, it generates and executes processing code at the engineer re- quest through the same verified and trusted tools that the engineer would use manually. By adding a now standardized and universal orchestration layer on the automation, this approach minimizes the development (and maintenance) burden of tailored workflows and frameworks. With this approach, engineers remain responsible for reviewing the plan, supervising execution and validating outputs as per the latest AI policies and standards of aerospace and auto- motive certification bodies [17–19]. In this framing, adaptability is provided by the agent, while engineering rigor and traceability are preserved by the deterministic, domain-certified tools. This paper makes two contributions. First, it presents DUC- TILE, an approach for developing, executing and evaluating LLM- based agentic automation of engineering analysis tasks, grounded in the separation of adaptive orchestration from deterministic, ver- ified tool execution [20]. Second, it provides a test case and evaluates the approach on an industrial load processing task at an aerospace manufacturer, where the agent handled four input devi- ations that would break traditional automation, across 10 indepen- dent runs and two engineers with different supervision styles. The novelty lies in demonstrating that LLM agents can orchestrate real aerospace analysis workflows transparently and compatibly with Journal of Mechanical DesignPREPRINT / 1 arXiv:2603.10249v1 [cs.SE] 10 Mar 2026 existing verified tools, while absorbing the input variability that breaks traditional automation. The case, tools and evaluation are provided free of licenses for other researchers to replicate and im- prove. 2 Background 2.1 The Tool Adoption Challenge at Aerospace Engineer- ing Companies. Of specific interest in this paper are the chal- lenges of automation in the mechanical engineering analysis, par- ticularly in the aerospace domain [15,21]. GKN Aerospace (here- after, the case company) is a large Swedish manufacturer of jet en- gine components and is a risk and revenue-sharing partner for var- ious jet engine Original Equipment Manufacturers (OEMs). The case company routinely performs structural analysis activities to substantiate airworthiness and certification activities, an essential part of the safety-focused industry [22]. These activities follow documented design practices and rely on a combination of commer- cial solvers, legacy in-house scripts, and an internally developed, evolving ecosystem of modular Python tools [23]. The variety of tools in the ecosystem creates a practical challenge that directly relates to the user’s experience level. A new engineer, for exam- ple, may understand the engineering principles, but may face a steep learning curve in connecting that knowledge to the specific tools and sequences unique to the company’s internally established processes. On the other hand, an experienced engineer may know the methodology intimately but may lack the time to explore the documentation and capabilities of a frequently updated tool. Both these cases highlight a significant and ever-increasing gap between the growing capability of computational tools and the en- gineering judgment required to use them effectively [24]. The en- gineer’s value lies in understanding what the analysis must achieve and whether the results are correctly interpreted. However, in practice, when a design change needs to be evaluated, a signif- icant amount of time is spent on non-engineering tasks such as debugging data format mismatches between modules. 2.2 Design Automation in Engineering Analysis. The chal- lenge described at this case company is not unique to the organi- zation. Engineering design problems are creative, ill-defined and context-dependent, often described as wicked problems [25]. In such problems, the designer often lacks clarity early in the pro- cess, and the problem(s) become fully clear only as the solution emerges [26]. Simon [27] framed design as a process of satisfic- ing under bounded rationality. This definition applies to both the design synthesis and analysis. Gericke et al. [1] argue that design methods are not isolated procedures but are embedded in ecosys- tems of representations, tools, and purposes. These properties have implications for any attempt to automate design work: the automa- tion must accommodate variability, incomplete information, and evolving understanding. Precisely the conditions that rigid, deter- ministic systems handle poorly. Specific to the aerospace industry, this is a challenge as it creates significant bottlenecks to rapid de- sign iterations, both in the design [28] and manufacturing domains [29]. In literature, a number of approaches have been put forward over the years, mainly through various forms of computational support for engineering design that have evolved through several generations. We explore three of the most common in aerospace [30]. First, Knowledge-based engineering (KBE) systems encode ex- pert rules and geometric primitives to automate routine design tasks, but they require ontologies [31] and design primitives to be defined in advance of the design and analysis activity [12]. In in- dustrial practice, KBE developers and the engineers who perform the analysis are typically not the same people [32]. The specialized knowledge and effort required to build and maintain KBE appli- cations limit their use to high performing teams in early design concepts, where the design space is sufficiently constrained [33]. Further, its use in early phases allows Multi-Disciplinary Optimiza- tion (MDO) or design space explorations (DSE) within pre-defined validity ranges, that can be built into models [34]. However, for the detailed design and certification phases, engineers do not work on pre-determined environments, limits, or design paradigm con- straints. The data received from previous processes’ variability is highest due to other teams generating the data manually, and analy- sis processes and methods quickly evolve. In these scenarios, KBE approaches are rarely viable [35]. Second, are process integration and design optimization (PIDO) platforms or workflow managers [36]. Independent platforms such as modeFRONTIER, ModelCenter, optiSLang, iSight, or CAE em- bedded equivalent such as ANSYS Workbench, or equivalent from Siemens, Dassault Systems, etc connect heterogeneous solvers through graphical workflow editors, enabling parametric and MDO studies [37]. Third, scripted pipelines serve a similar purpose of connecting sequences of modular tools. Each of these approaches shares a common limitation: the inter- faces between components are deterministic and must be specified in advance. A KBE ontology cannot accommodate a design param- eter that is not foreseen when the system is built. A scripted/visual pipeline produces errors when the OEM delivers loads in a differ- ent format. In practice, even rule-based systems with fully explicit logic are described as black boxes [35] by the engineers who use them: the complexity of the rule chains either graphical or codi- fied exceeds what a practitioner can trace in complex products and design scenarios, regardless of the system’s formal transparency [13,35]. 2.3 AI in Engineering Design. The 2022 JMD Special Is- sue on Artificial Intelligence and Engineering Design [38] mapped the intersection of AI and design research along five AI method themes: natural language processing, graph neural networks, gen- erative models, transfer learning and representation learning, and across four design research themes spanning conceptual synthesis, process acceleration, direct estimation and broader process support. Of the five themes, generative models are the most relevant to the present work, yet they address a fundamentally different problem. Deep Generative Models (DGMs) [39] such as GANs, VAEs and diffusion models, produce new design candidates or predict perfor- mance fields. For example, [40] addresses the DGM as a surrogate model. Computation occurs entirely within the neural network and the output is a design proposal or a surrogate prediction, a nar- row problem. In contrast, the central challenge highlighted here focuses on the wide problem that includes the detail phases of product development of safety-critical products. In such cases, specified documented processes must be followed, specific tools must be called in a defined sequence, and inputs and outputs must be traceable through the quality assurance chain [41,42]. This orchestration problem falls outside the scope of DGMs or other Machine Learning techniques (e.g. pure numerical regression or classification). These limitations point toward a class of approaches not cap- tured in earlier AI research applied to engineering design: systems that can mediate between heterogeneous tools, follow documented processes, and adaptively orchestrate analysis workflows. Large language model (LLM)–based agents represent a promising step in this direction. In the previously mentioned JMD special issue [38], only one article used a LLM architecture [43] for processing tex- tual product descriptions. Basic LLM design synthesis tasks were explored as early as 2024 [44], [45]. However, a literature review that year on the engineering design journals revealed that most re- search was around the use of LLMs as tools in the process, not as orchestrators [20]. Since then the capabilities of the models have evolved. Software engineering took a leading role in developing applications that first autocompleted the sentences, then used tools to read and edit files, and finally developed agentic applications to delegate complete tasks inside the Integrated Development Envi- ronment (IDE) or in the Command Line Interface (CLI). 2 / PREPRINTTransactions of the ASME Recently, Massoudi and Fuge [46] demonstrated that agentic LLMs can support conceptual systems engineering tasks, includ- ing requirements analysis, functional decomposition and architec- ture trade-offs. Their work shows that LLM agents can reason about design intent when provided with appropriate context. How- ever, their scope is the early design phase, where outputs are con- ceptual and the integration of in-house tools and methods is not emphasized. Mustapha [47] also mentions the change in scope of LLMs, from tool to orchestrator, paradigm shift from deterministic to stochasticity, and adaptability or malleability - a similar concept to ductility. 2.4 Summary: systemic brittleness of existing approaches. A recurring theme across engineering design literature is that methods and tools do not work in isolation, but are embedded within evolving socio-technical systems. As Gericke and Eckert [1] note, methods, tools and practices co-evolve; changes in one element—such as updated load formats, revised design rules, or modified APIs—propagate into others. Empirical studies on engi- neering change propagation show that even minor interface shifts can trigger disproportionate rework [5,48]. Deterministic automa- tion approaches, including KBE systems, scripted pipelines, and rule-based workflows, are particularly brittle under such conditions because they rely on predefined assumptions about data structures and process sequences. When these assumptions no longer hold, failures emerge that are difficult for engineers to diagnose, leading many to perceive formally transparent systems as practical “black boxes” [13]. The limitations of deterministic systems also echo earlier observations that design tools should support adaptive, in- telligent behavior rather than encode rigid procedures [49], yet this has still not been realized in current industrial practice. The background literature, therefore, indicates a common need for an intermediary that is capable of accommodating variability in tools, data, and practice. The literature further shows that the brittleness of previous approaches arises not only from isolated technical faults but also from the inability of deterministic systems to adapt to the normal variability and evolution of engineering practice. This is a central limitation that the LLM-based orches- tration layer proposed in this paper, seeks to address. The central insight from literature is therefore summarized as a core principle which is used to guide the development of the method: Central Orchestration Principle: Effective engineering orches- tration requires an intermediary capable of accommodating vari- ability in data, tools and practices, rather than relying on rigid, predetermined interfaces. From the survey of literature, a set of requirements for engi- neering automation can be extracted; Table 1 summarizes twelve requirements, R1-R12 with brief descriptions and supporting ref- erences. 3 LLM Agents: Technical Foundations Building on the central orchestration principle articulated in Sec. 2.4, in this section, we now examine how large language models (LLMs) in agents provide such an adaptive intermediary. So far, we have avoided defining the term Agent. Historically in AI, agents have been defined in many ways including tools whose interactions are via indirect management [60–62]. In the context of LLMs, as the technology matures, the term continues to evolve reflecting different characteristics and challenges. Earlier work defined LLM agents as a [...] task oriented combination of LLM models and the infrastructure to access tools, retrieve context information, and store memory of its interactions. [63]. While every LLM and framework provider has a variation of this definition, this paper uses the term that is gaining traction lately, which involves a de- scription of what it does rather than what it is: An LLM agent runs tools in a loop to achieve a goal [64]. In this paper, we distinguish between the agentic framework, the software toolkit that provides the inference loop and tool-calling infrastructure (e.g. Claude Code, Context Window LLM inference sampling strategy output token stop token? yes no Response Inference Engine next token probability Fig. 1 Schematic of the LLM inference loop. The model takes the full context window as input, produces a probability distri- bution over the next token, samples one token according to a sampling strategy, appends it to the context window, and re- peats until a stop token (<EOS>) is generated. The resulting sequence of tokens forms the response. Pydantic-AI), and the agentic application, the configured deploy- ment that combines a specific model, system prompt and tool con- nections for a given engineering context. Sec. 4 describes how an agentic application is developed, executed, and evaluated. We focus solely on external, off-the-shelf large language mod- els, recognizing that most engineering organizations, especially medium and small firms lack the resources to develop or fine- tune such models in house. The context window is central to this capability: since externally developed models have no built-in knowledge of an organization’s internal tools and processes, the information available in the context window determines what the model can draw upon when generating each token. LLMs offer a flexible inference mechanism that can interpret context, reconcile heterogeneous representations, and mediate be- tween evolving tools and engineering intent [47]. Recent advances have further strengthened their suitability for orchestration through two capabilities that have matured rapidly: structured reasoning modes (often referred to as thinking or deliberate reasoning) and tool calling, which enables controlled interactions with determinis- tic external software. This subsection provides a brief introduction on these mechanisms for a mechanical engineering audience. 3.1 Response Generation. A transformer-based LLM [65] sequentially generates a token at a time through an autoregres- sive process, see Fig. 1. At each step, the model takes the entire context window: the system prompt, the user message and all to- kens generated so far; and produces a probability distribution over the vocabulary of possible next tokens. A sampling strategy se- lects one token from this distribution, and the selected token is appended to the context window for the next iteration. This loop continues until a stop token is sampled, at which point the ac- cumulated sequence of tokens is returned as the response. Other recent architectural language models, namely diffusion [66], substi- tute the auto-regressive sequential token generation with a series of refinement steps. Both the architecture and the process are funda- mentally non-deterministic: the same input can produce different responses depending on the sampling strategy and its stochastic parameters. 3.2 Thinking Mode. The transformer autoregressive nature of token generation introduces a limitation: once the model com- mits to an initial direction, subsequent tokens are conditioned on that choice, making it difficult to reverse course [67]. This can lead to fixations or hallucinations [68]. Thinking mode is a mitigation strategy developed during model post-training using reinforcement learning [69,70], and first experienced by the general public with OpenAI’s o1 model in September 2024 [71] and subsequent com- petitor models. Practically, the so-called thinking mode provides the model with a scratchpad in which self-critique is rewarded. Specific token sequences allow the model to reconsider its initial direction, explore alternative interpretations and select the most Journal of Mechanical DesignPREPRINT / 3 Table 1 Requirements for DUCTILE and supporting literature Req.Requirement nameDescriptionSupporting literature R1InspectabilityAgentic application (prompts, plans, tool calls, intermediate artifacts) must be directly inspectable by engineers/auditors; no hidden state. [17,18] R2ReproducibilityModel, prompt, tool versions and inputs must be pinned and logged; runs are reproducible or cause-attributable despite stochastic LLM steps. [17,18] R3Deterministic execution boundarySafety/engineering computations run in verified, deterministic tools; the LLM performs orchestration and code generation only. [18,47] R4Traceability & auditabilityEvery artifact is trace-linked to the generating step (arguments, stdout/stderr, checksums, timestamps) to support do/check/approve reviews. [17,18] R5Data governanceDeployment mode (on-premises/API) and data flows comply with confiden- tiality/export/privacy rules and are justified. [17,18] R6Human oversight & accountabilityEngineers review plans and sign off on outputs; autonomy is limited to or- chestration, not engineering judgment. [17] R7Robustness to routine variabilityOrchestration handles foreseeable variation (formats, units, naming, minor method updates) without modifying certified tools. [5,13,47,48, 50–52] R8ObservabilityThe system emits traces for model calls and tool invocations (arguments, latency, tokens/cost) sufficient to diagnose failures. [53,54] R9Minimal couplingTool integration is lightweight/externalized (CLIs/typed APIs); generated code is readable and runnable by engineers. [12,13,51,52] R10Evaluation and pass-푘Reliability is demonstrated on curated cases with deterministic checks and/or LLM-as-a-judge, using repeated independent runs (pass-푘) proportionate to risk. [55–57] R11Change controlAny change to model, prompts, tools, or design practice triggers re-evaluation on the same cases before deployment. [17,18] R12Documentation qualityTool/method documentation is sufficiently structured for agents to follow procedures; ambiguities are revised or clarified. [58,59] User If a T-shirt takes 5 hours to dry, how long does it take to dry 3 T-shirts? No thinking Each T-shirt takes 5 hours, so 3 T-shirts take 3× 5= 15 hours to dry. 15 hours× Thinking mode 3× 5= 15 hours to dry. . . Hold on! If the user is hanging them at the same time. . . Wait! the user did not specify... usually laundry is done at the same time so I should assume that the ... 5 hours. Assuming you have enough space, all three dry si- multaneously. 5 hours✓ Fig. 2 Comparison of LLM responses with and without think- ing mode enabled. The thinking scratchpad (dashed box) al- lows the model to self-critique and reconsider the autoregres- sive initial response before committing to a final answer. appropriate response before committing to a final answer [72], see Fig. 2. For engineering orchestration, thinking mode is important because it allows the model to reason at inference time about am- biguous or unexpected situations before acting on them, increasing the probability of a successful final answer [70]. 3.3 Tool Calling. Tool calling enables the model to ground its responses on external data and computations, beyond what is encoded in its training weights or present in the context window [73]. Models are trained to recognize when a question requires ex- ternal information and to generate a structured output (a function name and arguments) instead of a direct answer. The model does not execute the tool itself. In practice, it performs a classification task, selecting which of the available tools is best suited for the cur- rent step and inferring the appropriate arguments from the context window. Fig. 3 illustrates this with an engineering example. Wei et al. [74] identify three styles of tool integration in agen- tic systems: (a) in-context, where the model reads documentation and generates calls zero-shot; (b) post-training, where the tool-use Engineer What is the yield strength of Ti-6Al-4V at 400°C? No tool calling Ti-6Al-4V has a yield strength of approximately 880 MPa. 880 MPa× Tool calling materials_db("Ti-6Al-4V", T=400) → 휎 푦 = 620 MPa At 400°C, Ti-6Al-4V has a yield strength of approximately 620 MPa. 620 MPa✓ Fig. 3 Comparison of LLM responses with and without tool calling. Without tools, the model relies on parametric knowl- edge from training weights, returning a room-temperature value. With tool calling, the model queries an external mate- rials database for temperature-dependent data, grounding its response on verified external information. strategies are fine-tuned into model weights, and (c) orchestration- based where a planning layer coordinates multi-tool workflows. The present work uses the first and third; the agent reads tool doc- umentation to learn about unfamiliar APIs and then plans multi- step workflows across multiple tools. Post-training integration is deliberately excluded because engineering tools evolve and must re- main independently verified and versioned. Fine-tuning the model to internalize the tool signature or APIs is, as mentioned earlier, possible but unrealistic in practice for most engineering companies. 3.4 Orchestration. These three mechanisms, inference, think- ing mode, and tool calling, establish a two-layer architecture. The LLM selects actions and adapts to context; the external tools exe- cute deterministically. Yao et al. [75] formalized this interleaving of reasoning traces and tool actions in the ReAct framework, show- ing that the model decides at each step whether to reason further, act on the environment, or return a final answer. This pattern has been demonstrated beyond engineering: Shen et al. [76] showed that an LLM can act as a controller that plans tasks, selects spe- 4 / PREPRINTTransactions of the ASME cialized AI models and orchestrates their execution across vision, speech, and language domains. Thinking and tool calling capabilities are pre-requisites for or- chestration. Agentic reasoning literature distinguishes two forms of scaling at inference time [74]: scaling test-time computation, where the model reasons longer within its context window (map- ping to thinking) and scaling test-time interaction, where the model interacts with external tools and receives feedback from the envi- ronment. 3.5 LLM Agents in Engineering Practice. Developments in the application of LLMs also come from commercial software ven- dors, who have begun embedding LLM-based assistants within their products. Simulation platforms such as ANSYS [77] or COM- SOL [78] now offer interfaces that support engineer interactivity within the GUI. These assistants are, however, tightly coupled to their host ap- plication. They have access to the model being edited but cannot access external design practice repositories, in-house tools, or data from other stages of the analysis process. This closed-ecosystem constraint limits their applicability to the orchestration problem described in Sec. 2.1, where the challenge is to bridge across multiple tools and data sources. Emerging service vendors such as COSMON [79] are developing agentic interfaces that connect LLMs to commercial CAD and simulation software. These solu- tions are interesting for a no-code solution. However, due to the requirements imposed of inspectability (R1), reproducibility (R2) and traceability and auditability (R4) these solutions are discarded as the companies do not have direct access to adapting and there- fore adding another potential layer of brittleness. To address this gap, this paper argues that a minimal and generic agentic application shall be developed at each company. 4 The DUCTILE approach to agentic orchestration This section addresses the gap identified in the previous sections by introducing the DUCTILE (Delegated, User-supervised Coordi- nation of Tool- and document-Integrated LLM-Enabled) approach to agentic orchestration, whose core concept is to separate interpre- tation from computation. As argued in Sec. 2.4, brittleness arises when adaptation is encoded in fixed rules. DUCTILE addresses this by assigning interpretation and orchestration to an LLM and delegating all calculations to verified tools, improving robustness to interface and input changes without relaxing engineering rigor. This separation also supports traceability and certification compli- ance, which are necessary for the aerospace context. It is aligned with the requirements R1–R12 summarized in Table 1. The following subsections describe how the approach is imple- mented. The description is divided into three parts: developing the agentic implementation (Sec. 4.1), executing the agentic analysis (Sec. 4.2) and evaluating the agentic performance (Sec. 4.3). Fig. 4 shows the architecture of the agentic application used to execute the analysis. 4.1 Developing the agentic implementation. In the develop- ment stage, the agentic application is established so that orches- tration can be observed, reproduced, and integrated with existing engineering tools. In line with the requirements (Table 1), the agentic application implementation is shaped by two considera- tions: (i) the character of the agentic framework used to host the agent (addresses R1, R2, R9) and (i) the configuration of its core connections to models, prompts, and tools (addresses R2–R5, R8). 4.1.1 Framework character. The agentic framework exhibits inspectable and reproducible behavior, transparent configuration with minimal hidden state and low coupling to the analysis stack. In practice, this corresponds to open-source, appropriately licensed solutions that surface prompts, traces, and file operations to the en- gineer, and that connect to external documents, tools, and analysis codes with minimal glue (addresses R1, R2, R9). 4.1.2 Configuration scope. Within such a framework, an agen- tic application is organized around three connective elements as shown in Fig. 4: (a) the LLM model connection, (b) the system prompt, and (c) the tool interfaces. The model connection covers the chosen model family and deployment mode (on-premises or API), together with version pinning and environment capture so that runs are attributable across iterations (addresses R2, R5). The system prompt provides stable interaction conventions and output contracts, while domain procedures and method rules are provided at run time through documents and skills rather than embedded as static instructions; this keeps behavior auditable and portable across projects (addresses R1, R9). Tool interfaces present de- terministic, versioned capabilities to the agent via explicit, typed I/O contracts (paths, formats, units) and controlled write locations; generated scripts and invocation logs are persisted next to input- s/outputs to preserve traceability (addresses R3, R4, R9). 4.1.3 Selecting and connecting to a LLM. Model choice for engineering analysis is typically shaped by a three-way trade-off be- tween capability profile, computational cost, and data-governance constraints. Capability is commonly profiled along three dimen- sions: tool use (selection and invocation of external tools), as re- flected in 휏-bench [80]; agentic reasoning (multi-step planning and recovery on error), as reflected in AgentBench [55]; and code gen- eration quality, e.g., SWE-bench [81]. The choice of on-premises vs externally hosted model deployment further conditions feasibil- ity and governance. On-premises execution offers full control over data flows, while externally hosted models reduce infrastructure burden and broaden immediate model availability. If an externally hosted solution is chosen, it shall be bounded by confidentiality requirements, export-control rules, and applicable data-protection frameworks (e.g., GDPR). For exploratory or non-sensitive work- loads however, external providers often function as a practical start- ing point. Regardless of the choice, model and version pinning enables run-to-run attribution (addresses R2, R5). 4.1.4 System Prompt. Model behavior is influenced by the sys- tem prompt [82], with design that can evolve iteratively or via sys- tematic methods [82,83]. Guidance from major providers (Ope- nAI [84], Anthropic [85], Google [86], Meta [87], Mistral [88]) offer different advice, but have common sections that are always recommended to include: • Goal: purpose and scope of responsibility. • Style: tone and interaction conventions for communicating with the user. • Available tools: typically injected automatically by the framework. • Reasoning strategy: approach to task decomposition, depth of diagnosis and information exhaustiveness. • Output format: a verbal schema or a structured contract (e.g., a final tool call). Within the present approach, the prompt is kept task-agnostic (ad- dresses R1, R9). Company-specific procedures, methods, and de- sign practices are not encoded statically; instead, they are retrieved at run time from knowledge resources or disclosed progressively via skills [89]. This keeps behavior auditable and portable, and allows the user to state the task (e.g., perform task X using method Y, revision Z) while the agent assembles the relevant organizational information (addresses R1, R9). 4.1.5 Configuring tool access. Tool access can be provided internally or via external resources (Fig. 4). The approach favors connecting to external resources wherever feasible to minimize framework complexity and maximize reusability across agent in- stances (addresses R9). Interfaces expose deterministic, versioned capabilities with strict I/O contracts (paths, formats, units) and controlled write locations; generated scripts and invocation logs (arguments and stdout/stderr) are persisted alongside inputs and Journal of Mechanical DesignPREPRINT / 5 User System Prompt LLM Response Context Window User Message type? Inference Engine tool call tool response request/ response final answer Tool Execution internal external interface textcodebinaries File system area webrelational DBsemantic DB External Knowledge Tool Repository (e.g. MCP) terminalAPI’s language interpreters External processes ... ... ... ... tool 1tool 2tool 3tool 4 External Resources Agentic Automation Application Fig. 4 Architecture of an agentic application. The inference engine processes user messages through the LLM within a context window, generating either final answers or tool calls. Tool execution is handled internally or through external interfaces that connect to external resources including file systems, knowledge bases, tool repositories and system interfaces. outputs to preserve traceability and support inspection (addresses R3, R4). 4.2 Executing the agentic analysis. Execution proceeds as an alternation between stochastic planning and deterministic tool use. The engineer provides the design context, inputs, and task objec- tive; the application incorporates these into the context window. The inference process described in Sec. 3 proposes the next step toward the goal. Here, the stochasticity inherent to the model is an advantage: the model adapts inputs and company knowledge to the particular design context and goals (orange path in Fig. 4). Once the model proposes a tool call, the agentic framework executes it deterministically and appends the result to the context window. The agent repeats this cycle, calling successive tools until the task objective is met (addresses R3, R4). This loop connects the agent to the same tools an engineer would execute manually or hardcoded in a scripted pipeline, but the or- chestration logic is no longer hardcoded. The LLM translates a natural language defined process description from a document into a sequence of tool invocations, selecting each step based on the task objective, intermediate results, and available tools (addresses R1, R9). This adaptability is reported to the engineer in the final re- sponse and remains fully visible in the model traces and generated artifacts. At any time in the parallel or sequential call of tools iterations, the agent may return the control to the user [75]. The decision to finish the iteration is made by the LLM based on its training behaviors and the content of the context window. A few examples of when the LLM can choose (via the likelihood of the next token and subsequent generation of a response) to finish its turn are: - When the request of the user can be answered directly, without the need to call tools. - When the model misses critical information to carry out the task. - When the tool calls have provided enough information to com- plete the task. - When a roadblock in the execution plan stops the model from getting closer to the goal (e.g., permission access, tool fail- ure) and reasonable efforts have been made to circumvent those limitations within the independence scope defined in the context window. 4.2.1 Final response and user intervention. When stopping, the agent may summarize previous activities in the agent’s turn. The agentic application parses the results and then finalizes the process by returning to the user the last message. Current agentic applications stream to the engineer directly the intermediate tool calls and thinking steps to follow the task progress. This allows for engineers to stop the agentic execution if it is deviating from the goal or stalling with an activity (addresses R6). 4.2.2 Traceability and observability. Messages, tool calls and tool results at minimum shall be sent for recording to an observ- ability platform via standard payloads (e.g. OpenTelemetry). If a company external service is used, the same data sharing con- siderations shall be evaluated. Generated code is stored next to inputs/outputs to preserve traceability (addresses R3, R4). Where available, telemetry captures call latency and token/cost for model steps, supporting diagnosis, evaluation, and change control over time (addresses R8, R10, R11). The resulting traces provide the evidence required for do/check/approve review and for the evalua- tion described in the next sub-section. 4.3 Evaluating the agentic application performance. Method adoption in engineering practice depends on evidence that the method behaves as expected, with a high degree of confidence [90]. Further, the system needs to be validated like any in-house engineering software with rigor proportional to the risk of the activity performed [91,92]. However, agentic systems introduce challenges beyond those of traditional software. First, the context window evolves across tool calls and interme- diate reasoning steps, making the internal state difficult to inspect and debug. Observability platforms that trace each model invo- cation, tool call, and token exchange, such as Logfire [53], are essential for diagnosing failures (addresses R8). Second, outputs can vary structurally across identical inputs and yet can be equally valid. Because the DUCTILE approach intentionally uses flexibil- ity to adapt to diverse engineering contexts, suppressing variability is not the goal. Instead, evaluation is performed against curated datasets of representative problems, including edge cases, similar to the held-out test sets used to assess generalization in machine learning [93]. These datasets verify that the agentic application produces correct and compliant outputs across the expected range of inputs, independent of the particular response path taken (ad- dresses R10). Unlike machine learning datasets containing only inputx and expected outputy, each evaluation case must specify: 6 / PREPRINTTransactions of the ASME (1) The design environment: the system prompt, the engineer query, the executable tools and reference documentation. (2) The acceptance criteria: expected outcomes expressed as quantitative tolerances, qualitative descriptions, or both. (3) The evaluation method: a deterministic check against refer- ence values, a human expert review, or an LLM-as-a-judge assessment. The evaluation must cover both standard interactions and edge cases. Expected answers shall be developed and maintained by domain experts; in practice, these use cases grow to the order of hundreds of scenarios. Because model responses are probabilis- tic, each use case must be repeated to establish confidence in the agent’s ability. DUCTILE adopts the pass 푘 metric [56], defined as whether the agent correctly performs the same task in 푘 inde- pendent repetitions. The choice of 푘 depends on the evaluation scenario and the criticality of the task. A useful analogy comes from material qualification in aerospace structures, where allowable properties are defined probabilistically as 푥/푦: the 푥-th population percentile at 푦% confidence. Three standard levels exist: S, B, and A-Basis [57]. Treating each evalu- ation case as an independent Bernoulli pass/fail event, the Clopper– Pearson exact bound [94] for 푘 successes in 푘 trials yields the min- imum number of consecutive passes required at each confidence level, as shown in Table 2. Table 2 Minimum consecutive passes k for the pass k metric at 95% confidence, derived from the Clopper–Pearson bound p lower = α 1/k with α= 0.05. S-Basis B-Basis A-Basis Pass probability 푝 ≥ 0.50 ≥ 0.90 ≥ 0.99 Required 푘529299 DUCTILE recommended values Development푘= 3 Deployment푘 ≥ 10 The basis-level 푘 values serve as a reference for comparison with established engineering practice. In practice, DUCTILE del- egates execution but not responsibility. The engineer retains ac- countability for reviewing outputs, and existing engineering quality processes remain in place. Since each repetition consumes compu- tational resources and cost, the DUCTILE approach recommends 푘= 3 during development and 푘 ≥ 10 for deployment, increasing 푘 as the criticality or autonomy of the agent grows. Evaluations must be re-run whenever the system prompt, tools, documentation, or underlying model changes. At the scale of hun- dreds of cases repeated 푘 times, human evaluation becomes un- feasible. Automated assessment via deterministic checks or LLM- as-a-judge is therefore essential, with trust metrics analogous to inter-rater reliability established for the latter. 5 Industrial application of agentic load processing on a realistic case The DUCTILE approach described in Sec. 4 is applied to a structural strength evaluation on a realistic case at the case com- pany. The component in this case is a Turbine Rear Structure (TRS), shown in Fig. 5. While the case company routinely per- forms static strength analyses on the TRS as part of the certifica- tion and continuous product development activities, the jet engine OEM generally owns the global stiffness and loads models at the whole-engine integration level, and delegates structural compliance checks to component partners. A design practice document at the case company defines the steps, methods, and expected outputs for this type of analysis. The use case description. The engineering use case concerns a new structural evaluation triggered by a design change in a neigh- boring engine component. The geometry, finite element mesh and analysis methods remain unchanged; only the loading input from the OEM is new. A new, company-maintained Python-based loads processing tool is used together with the established design prac- tice. The new OEM delivery introduces four deviations compared to the previous analysis. Three of these deviations are input for- mat changes, a common example of variability when data crosses organizational boundaries; the fourth is a correction method com- municated by the OEM requiring a 1.04 factor on all Fx forces, separate from the raw data. In a traditional scripted pipeline, each deviation would require manual intervention or code modification. Table 3 summarizes the four deviations. Table 3 Deviations in the new analysis scenario. The first three are input format changes; the fourth is an OEM- communicated correction specified in the task description. DeviationWhat changedImpact on automation File formatYAML instead of JSONRequires new parser Unit systemImperial instead of SISilent error Node naming right/left instead of port/starboard FEM run fails downstream Method change 1.04 factor on Fx forcesSilent error The remainder of this section describes the configuration as shown in Sec. 5.1, the execution with two engineers (Sec. 5.2) and the systematic evaluation of the agent’s performance, in Sec. 5.3. 5.1 Case Setup. Following Sec. 4.1, the agentic implemen- tation was built using Claude Code [95], a terminal-based agentic framework. Extensive description of the implementation is avail- able in the supplemental material. Claude Code operates within the engineer’s existing IDE and file system. Since the engineers at the case company already work with scripts, input files, and terminal commands, the agentic application fits within the exist- ing workflow rather than imposing a new one. The engineer can inspect every file the agent reads or writes (addresses R1, R2). To maximize transparency and user experience, engineering prod- uct data information was placed in the working file system during the experiments. The design practice was served via an MCP document server, giving the agent version controlled access to the methodology. The loads processing tool was distributed as a Python package via PyPI, with API documentation published in both human-readable and LLM-consumable formats (llms.txt) on the corresponding GitHub Pages. The task description was prepared prior to agent testing, target- ing an experienced engineer or an inexperienced graduate joining the company; no refinements to the prompt or integration were re- quired after initial testing. The design practice document captures the logic of the activities to be performed. Fig. 6 shows the work- flow derived by the authors from this document. This flowchart was not available to the agent. By reading the document and in- specting the inputs, the agent was expected to infer and implement a comparable sequence of activities (addresses R1, R12). 5.2 Execution. Two engineers at the case company were as- signed the same task with the same inputs and deviations described in Table 3. Both had access to the agent and were encouraged to interact with it and verify its outputs as they saw fit. The full conversations and output files are available in the supplemental material. 5.2.1 Engineer 1: Delegation. Engineer 1, familiar with agen- tic workflows, delegated the task to the agent and checked the out- puts after the full activity was completed. Fig. 7 summarizes the interaction. Journal of Mechanical DesignPREPRINT / 7 FE Mesh and Point Load Interfaces lug_port: lug_failsafe: lug_starboard: hidden in this view lpt_case & bearing: nozzle & plug hidden in this view Jet Engine and TRS 3D viewFE Analysis Results Build FEM Prepare Loads Solve FEM Process results Write Report OEM interface loads CAD loads tool (py module) APDL run template Material properties Ansys in house documentation Report template Writing guidelines tool 3tool 2 Justification Report Result files Execution files Design Practice: Strength Evaluation Activity DP section: Loads processing Inputs Documents Activity: Strength Evaluation Outputs Tools Fig. 5 Top: Views of the component in the physical domain. Bottom: Activity View of the TRS Strength evaluation. Only a subset of the inputs, documents and tools are shown for visual clarity. Highlighted in orange is the load processing step where the use case focuses. The engineer pointed the agent to the task description and re- quested a plan. The agent read the task description PDF, fetched the design practice via the MCP document server, explored the working directory, and examined the previous analysis run and in- put files. During this reading phase, the agent identified all four deviations in its reasoning trace: the YAML format incompatible with the tool’s JSON parser, the imperial unit system, the changed pilot node names, and the 1.04 correction factor specified in the task description. The agent surfaced these findings in a seven-step plan presented to the engineer, proposing a processing approach that addressed each deviation. The engineer approved the plan. The agent then fetched the tool’s API documentation, wrote a processing script, executed it and verified the outputs: six .inp files (one per envelope-selected load case), an envelope summary table, and an exceedance comparison against the previous analy- sis. The engineer inspected the generated script and output files independently and declared the task completed. 5.2.2 Engineer 2: Incremental discovery. A second engineer, unfamiliar with agentic workflows, took the opposite approach. Rather than delegating the full task, this engineer requested one step at a time: first reading the inputs, then processing a single load case, then checking the output before proceeding. Each inter- mediate result was inspected before the next step was requested. Through this incremental process, the engineer discovered the task, learned the tool’s behavior, and built confidence in the agent grad- ually. The agent adapted to this step-by-step supervision without difficulty. The final output was identical to Engineer 1’s result: the same load case files, envelope table and exceedance compari- son. The same four deviations were handled, through a different supervision path. 5.3 Evaluation. Following the evaluation approach described in Sec. 4.3, the agent’s performance was assessed against the scenario defined by the four deviations in Table 3. After an in- dependent check by a company domain expert, Engineer 1 output data was used as ground truth for automated evaluation. The evaluation used two complementary checks. First, an LLM- as-a-judge assessment (Opus 4.6) verified that the agent’s process- ing script correctly addressed all four deviations: YAML-to-JSON conversion, unit conversion, pilot node mapping, and application of the 1.04 Fx correction factor. Second, a deterministic compar- ison checked the final load summary values against the expert’s reference output. A run passed only if both checks succeeded. The LLM-as-a-judge assessments were manually reviewed by the domain expert, yielding 100% inter-rater agreement. The agent was run 10 independent times on the same scenario, with no modifications to the prompt, tools, or context between runs. All 10 runs passed both quantitative and qualitative checks, yielding a 100 per cent inter rater reliability by LLM-as-a-judge (Opus 4.6) and the 10 manual verifications. The evaluation scripts and outputs are available in the supplemental material. The im- plementation is an example of the DUCTILE approach towards agentic evaluation. 8 / PREPRINTTransactions of the ASME Start Read OEM load file & verify format Downselect applicable load cases OEM delivery in SI units? Convert units to FEM system Verify equilibrium for each load case Ultimate analysis with limit loads? Apply 1.5 ultimate load factor Write one .inp file per load case Produce summary envelope table Previous loads available? Produce exceedance comparison End No Yes Yes Yes No No Fig. 6 Engineer-interpreted activity process flowchart of the natural language and unstructured description contained in the loads design practice document. NOTE: The agent must in- fer a similar workflow by reading the document; this flowchart is not available in the agent’s context. 6 Discussion 6.1 Closing the Brittleness Loop. Sec. 1 identified brittle- ness as a structural property of engineering analysis automation: rigid pipelines break when inputs deviate from expected formats, units, naming conventions, or methodologies. In the case study of Sec. 5, the agentic application consistently identified the errors that would have broken the downstream processes, and adapted to format changes consistently with pass 푘 =10. Therefore it was demonstrated using the DUCTILE agentic orchestration approach how to develop, use and evaluate agentic application that meets the engineering analysis performance expectations and fulfills the requirements identified in Table 1. 6.2 Consequences for Practice. Preparing context for the agent exposes gaps in design practice documentation. Experienced engineers bridge ambiguities in written procedures through tacit knowledge. Hsieh et al. [58] demonstrate that well-written tool documentation enables LLMs to use unfamiliar tools zero-shot, without fine-tuning. Yuan et al. [59] show that concise, structured descriptions outperform verbose ones. Investing in documentation quality serves both human engineers and agents. The development of the use case showed how those tacit activities were missing from the company design practices, and had to be added as the agentic application development and evaluations were being co-developed. Moreover, by requiring engineering knowledge to be captured in explicit, accessible documents rather than residing in individual expertise alone, the use case showed how the approach reduces the Engineer 1 Opens the input files and design practice in the IDE. Reviews the deviations independently. Engineer 1 plan this: @ref_documents/task_description.pdf Agent read(task_description.pdf) bash(ls) working directory get_document_content(MCP) design practice read(previous_run/. . . ) read(OEM_loads_v2.yaml) The new file is YAML but the tool expects JSON. . . Units are in imperial, the design practice requires SI. . . Pilot nodes are named left/right instead of port/starboard. . . The task description specifies a 1.04 Fx correction factor not present in the previous run. . . Agent Here’s the plan: 1. Read YAML, rename lug_left→lug_port, lug_right→lug_starboard. 2. Apply OEM Fx correction (×1.04). 3. Verify coordinate system (no transform). 4. Convert units klbs→ N. 5. Envelope downselection. 6. Write .inp files (exclude bearing). 7. Envelope table + exceedance comparison. Engineer 1 yes Agent web_fetch(ductile-loads/llms-full.txt) write(loads_processing_v2.py) bash(uv run loads_processing_v2.py) Verifying outputs. . . bash(ls limit_loads/) read(limit_load_2.inp) read(comparison_report/v1_vs_v2.json) Processing complete. 6 load case files generated. New loads exceed previous envelope — Mz +30%, Fx +15%, Fy +11– 17%. Engineer 1 Checks the generated script and output files. Declares the task completed. Fig. 7 Summary of Engineer 1’s interaction. Green bubbles: engineer messages. Grey bubbles: agent responses. Dashed boxes: agent reasoning. Blue boxes: tool calls. Dotted boxes: engineer actions outside the conversation. The full transcript is available in the supplemental material. Journal of Mechanical DesignPREPRINT / 9 organizational vulnerability that arises when critical knowledge is concentrated in individuals. The DUCTILE agentic approach uses in-context reasoning ex- clusively (no fine-tuning, no persistent memory, no reinforcement learning). This is a deliberate choice because, together with evalu- ations, it allows the agentic application to avoid model lock-ins, and to update the model as smaller-footprint, cheaper, or more capable models are released without effort and robustly. This contrasts with post-training approaches where tool-use strategies are embedded in model weights [74], and with rule-based systems that, despite for- mally explicit logic, become practical black boxes as rule chain complexity exceeds what practitioners can trace (Sec. 2.2). Both engineers in the case study confirmed that the agent’s actions were easy to follow and inspect, showing promise for wider adoption in practice. 6.3 Risks and Unintended Consequences. The aerospace community warns that “overreliance on software tools can lead to a superficial understanding of the underlying physics” [24]. This concern applies to agents as well. A key distinction is needed how- ever between dexterity and judgment. Dexterity with specific APIs, data formats, and tool interfaces is what the agent will absorb. En- gineering judgment, knowing whether results make physical sense, the method is appropriate, or the output meets certification re- quirements, must remain with the engineer. In our experience, the real risk is less about agents eroding engineering skill, but in the organizations stopping to recognize and allocate resources to develop the engineer’s judgment that the evaluation activity still requires. Engineer 2’s step-by-step interaction pattern provides ev- idence that supervision and learning can coexist: this engineer built understanding of the task progressively while the agent handled the implementation details. Even with transparent reasoning, agents can create a false im- pression of competence [96,97]. The LLM is a probabilistic com- ponent. It will sometimes be wrong. Von Neumann argued in 1956 that reliable systems can be built from unreliable compo- nents, provided the architecture includes sufficient redundancy and verification [98]. The evaluation datasets described in Sec. 4.3 serve exactly this function: they measure how often the probabilis- tic component produces correct results under realistic conditions. Safeguards must therefore be holistic, not reliant on a single check. The pass k metric over independent runs provides one layer. Built- in reviewer steps, where the agent checks its own outputs against the design practice before presenting them to the engineer, provide a second. Two-method verification, comparing a hand calculation against an automated script, or running two independent processing approaches, provides a third. Gericke and Eckert [1] raise two cautions from the product de- velopment ecosystem perspective. First, a rebound effect: histori- cally, tools that shortened development time led to more complex products requiring similar development time. Second, skill loss: designers may lose tacit skills previously developed through rou- tine manual execution. The supervisory role must be deliberately designed to maintain engineering judgment [99]. It is unclear yet if this new approach would make mechanical engineers happy or exhausted. On one hand, many design and AI researchers, including Cross [49], argue that these systems that interactively support humans should be designed in a way that is cognitively comfortable for them. If done correctly, in a human- centered way [100], we could remove the undesirable activities and do only what we enjoy most [101]. However, we are starting to observe in software engineering how these agents are not delivering this promise. Instead, the joy of solving the little problems and the social nature of the work are reduced, being substituted by a hollow supervision role that exhausts software engineers via increasing the intensity of their work [102]. Shneiderman [103] argues that this outcome is not inevitable: systems designed as “supertools” that amplify human capability, rather than autonomous agents requiring supervision, can preserve both productivity and satisfaction. 6.4 Limitations. Several limitations constrain the generaliz- ability of these findings. • Single task domain. The case study covers load processing for static strength analysis of one component type. The brit- tleness argument should generalize to other format-sensitive engineering workflows, but this generalization is not demon- strated. • Two engineers. The experiments is illustrative, two partici- pants cannot establish population-level patterns. • Model dependence. The results reflect Claude Opus 4.6 at the time of the study and may change with future model ver- sions. • Requires documented processes. Where engineering knowl- edge is predominantly tacit and not captured in accessible documents and tools, the agent has nothing to read and the approach does not apply. • Text and data workflows only. This work does not cover spatial tasks such as CAD geometry generation, finite element meshing, or graphical post-processing. • Transferability not demonstrated. Applicability to other engineering domains, organizations and task types is argued on structural grounds but not empirically validated beyond this case. 6.5 Future Work. An open dataset for agentic evaluation is necessary. Organizations deploying agents in certified engineer- ing need references for developing internal datasets that capture their specific tools, design practices and quality requirements. Any change to the underlying model, prompts, tool version or design practice can alter agent behavior and must be assessed before deployment. The research community, in turn, needs open en- gineering benchmarks. Efforts such as calls for curated design datasets [104] and EngiBench [105] have begun to address it for generative tasks, and similar calls have emerged for LLM evalu- ation in engineering [47]. Agentic engineering evaluations have additional, nuanced requirements: the framework must be exe- cutable and available to simultaneously apply domain knowledge, call tools correctly, interpret numerical outputs and follow docu- mented methods. Industrial impact shall be measured. The two engineers in Sec. 5 interacted with the agent in fundamentally different ways. One delegating the full task, the other proceeding incrementally. A controlled within-subject study comparing engineer performance with and without agent support would formalize this observation and measure impact. Metrics like task accuracy, completion time and confidence in the result across a larger sample of practicing en- gineers would reveal how engineering practices interact in realistic scenarios. 7 Conclusion The presented DUCTILE approach demonstrated the automa- tion support separation via an industrial load processing task at an aerospace manufacturer. The adaptive orchestration was per- formed by the agent, using the non-deterministic nature of the LLMs. The deterministic execution was performed by verified en- gineering tools. Engineers retain responsibility for supervision and judgment while the process was guided by design practices already available in companies. In the use case, the agent handled four input deviations that would break traditional scripted automation: a file format change, a unit system mismatch, a naming convention change and a method- ology update. All four were resolved correctly consistently across 10 independent evaluation runs. The approach was also tested on two engineers with different supervision styles, one delegating the full task and one proceeding incrementally, and both reached the same correct result using the same agent configuration. Traditional engineering automation is brittle because it encodes adaptation in deterministic rules bound to interfaces that, in prac- tice, keep changing. When the ecosystem changes, when people 10 / PREPRINTTransactions of the ASME move, the automation support becomes a barrier and stops being an enabler. The DUCTILE agentic approach presented here absorbs that variability at the orchestration layer, so that engineering time and cognitive effort are spent on the decisions that affect prod- uct performance rather than on the data wrangling and workflow adaptation that currently consume it. Acknowledgment The authors thank the engineers at GKN Aerospace who partic- ipated in the deployment study and provided valuable feedback on the agentic application. Conflict of Interest Data Availability Statement The agentic application implementation, evaluation results, and supplemental material are openly available at https://github.com/ alex-pradas/DUCTILE and permanently archived [106]. References [1] Gericke, K. and Eckert, C., 2026, “Co-Evolution of Tools and Methods in Product Development Ecosystems,” Co-Evolution of Design Research and De- sign Practice, K. Gericke, C. Eckert, V. Singh, and S. Venkataraman, eds., Springer Nature Switzerland, p. 6–19. [2] Eckert, C. and Clarkson, J., 2005, “The Reality of Design,” Design Process Improvement, Springer London, p. 1–29. [3] Wynn, D. C. and Clarkson, P. J., 2021, “Improving the Engineering Design Process by Simulating Iteration Impact with ASM2.0,” Research in Engineering Design, 32(2), p. 127–156. [4] Eppinger, S. D., Whitney, D. E., Smith, R. P., and Gebala, D. A., 1994, “A Model-Based Method for Organizing Tasks in Product Development,” Research in Engineering Design, 6(1), p. 1–13. [5] Wynn, D. C., Caldwell, N. H. M., and Clarkson, P. J., 2014, “Predicting Change Propagation in Complex Design Workflows,” Journal of Mechanical Design, 136(8), p. 081009. [6] Brahma, A. and Wynn, D. C., 2022, “Concepts of Change Propagation Analysis in Engineering Design,” Research in Engineering Design, 34(1), p. 117–151. [7] Liu, Y., Abulawi, Z., Garimidi, A., and Lim, D., 2025, “Automating Data- Driven Modeling and Analysis for Engineering Applications Using Large Lan- guage Model Agents,” doi: 10.2139/ssrn.5400869, SSRN Preprint. [8] Fernandes, A. A. A., Koehler, M., Konstantinou, N., Pankin, P., Paton, N. W., and Sakellariou, R., 2023, “Data Preparation: A Technological Perspective and Review,” SN Computer Science, 4(4), p. 425. [9] Eckert, C., Isaksson, O., Hane-Hagström, M., and Eckert, C., 2022, “My Facts Are Not Your Facts: Data Wrangling as a Socially Negotiated Process, a Case Study in a Multisite Manufacturing Company,” Journal of Computing and Information Science in Engineering, 22(6), p. 060906. [10] Qin, Y., Lu, W., Qi, Q., Liu, X., Zhong, Y., Scott, P. J., and Jiang, X., 2017, “Status, Comparison, and Issues of Computer-Aided Design Model Data Exchange Methods Based on Standardized Neutral Files and Web Ontology Language File,” Journal of Computing and Information Science in Engineering, 17(1), p. 010801. [11] Cherukuri, R. and Yarram, V. K., 2024, “From Intelligent Automation to Agen- tic AI: Engineering the next Generation of Enterprise Systems,” International Journal of Emerging Research in Engineering and Technology, 5(4), p. 142– 145. [12] La Rocca, G., 2012, “Knowledge Based Engineering: Between AI and CAD. Review of a Language Based Technology to Support Engineering Design,” Advanced Engineering Informatics, 26(2), p. 159–179. [13] Verhagen, W., Bermell-Garcia, P., Van Dijk, R., and Curran, R., 2012, “A Crit- ical Review of Knowledge-Based Engineering: An Identification of Research Challenges,” Advanced Engineering Informatics, 26(1), p. 5–15. [14] Vidner, O., Wehlin, C., and Wiberg, A., 2022, “Design Automation Systems for the Product Development Process: Reflections from Five Industrial Case Studies,” Proceedings of the Design Society, 2, p. 2533–2542. [15] Quintana-Amate, S., Bermell-Garcia, P., Tiwari, A., and Turner, C., 2017, “A New Knowledge Sourcing Framework for Knowledge-Based Engineering: An Aerospace Industry Case Study,” Computers &amp; Industrial Engineering, 104, p. 35–50. [16] Ahmed, S. and Wallace, K. M., 2004, “Understanding the Knowledge Needs of Novice Designers in the Aerospace Industry,” Design Studies, 25(2), p. 155–173. [17] 2024, “EASA Artificial Intelligence Concept Paper Issue 2 - Guid- ance for Level 1 & 2 Machine-Learning Applications,” accessed 2026- 02-29, https://w.easa.europa.eu/en/document-library/general-publications/ easa-artificial-intelligence-concept-paper-issue-2 [18] ISO/IEC, 2024, “Artificial Intelligence — Functional Safety and AI Systems,” International Organization for Standardization and International Electrotechni- cal Commission, ISO/IEC TR 5469:2024. [19] ISO, 2024, “Road Vehicles — Safety and Artificial Intelligence,” International Organization for Standardization, ISO/PAS 8800:2024. [20] Gomez, A. P., Kretzschmar, M., Paetzold-Byhain, K., and Isaksson, O., 2025, “A Team of Three: The Role of Generative AI in the Development of De- sign Automation Systems for Complex Products,” Proceedings of the Design Society, 5, p. 309–318. [21] Kerley, W., Armstrong, G., Pepe, C., Moss, M., and Clarkson, P. J., 2011, “Using Simulation to Support Process Integration and Automation of the Early Stages of Aerospace Design,” ICED 11 — 18th International Conference on Engineering Design, Vol. 1, Copenhagen, Denmark, August 15–18, 2011, p. 134–146. [22] SAE International, 2017, “AS9100: Quality Systems – Aerospace – Model for Quality Assurance in Design, Development, Production, Installation and Servicing,” SAE International, 400 Commonwealth Drive, Warrendale, PA, United States, doi: 10.4271/AS9100A. [23] Abollado, J. R., Shehab, E., and Bamforth, P., 2017, “Challenges and Benefits of Digital Workflow Implementation in Aerospace Manufacturing Engineer- ing,” Procedia CIRP, 60, p. 80–85. [24] Krupa, G., 2025, “The Art of the Science. Creative Stimulation in the Class- room,” Aerospace, Royal Aeronautical Society. [25] Rittel, H. W. J. and Webber, M. M., 1973, “Dilemmas in a General Theory of Planning,” Policy Sciences, 4(2), p. 155–169. [26] Cross, N., 1992, Research in Design Thinking, Delft University Press. [27] Simon, H. A., 1973, “The Structure of Ill Structured Problems,” Artificial Intelligence, 4(3-4), p. 181–201. [28] Baskaran, S. P., Camacho Casero, J., Bagdatli, B., and Mavris, D., 2026, “MBSE-Driven MDAO in Co-Architecture and Concurrent Engineering Work- flows for Aircraft Development,” AIAA SCITECH 2026 Forum, American In- stitute of Aeronautics and Astronautics, Orlando, FL, January 6–10, 2026, doi: 10.2514/6.2026-0193. [29] Formentini, G., Bouissiere, F., Cuiller, C., Dereux, P.-E., and Favi, C., 2022, “Conceptual Design for Assembly Methodology Formalization: Systems In- stallation Analysis and Manufacturing Information Integration in the Design and Development of Aircraft Architectures,” Journal of Industrial Information Integration, 26, p. 100327. [30] Pradas Gómez, A., Panarotto, M., and Isaksson, O., 2023, “Design Automa- tion Strategies for Aerospace Components During Conceptual Design Phases,” Aerospace Europe Conference 2023 – 10th EUCASS – 9th CEAS, Lausanne, Switzerland, July 9–13, 2023, doi: 10.13009/EUCASS2023-578. [31] Gruber, T. R., 1993, “A Translation Approach to Portable Ontology Specifica- tions,” Knowledge Acquisition, 5(2), p. 199–220. [32] Brimble, R. and Sellini, F., 2000, “The MOKA Modelling Language,” Knowl- edge Engineering and Knowledge Management: Methods, Models, and Tools (EKAW 2000), Vol. 1937 of Lecture Notes in Computer Science, Springer, p. 49–56. [33] Tobie van den Berg, Ton van der Laan, Bas van Manen, Darpan Bansal, Ioana Ciobotia, and Jente Sonneveld, 2023, “A Multidisciplinary Modelling System for Aircraft Structural Components,” Aerospace Europe Conference 2023 — Joint 10th EUCASS / 9th CEAS Conference, Lausanne, Switzerland, July 9–13, 2023. [34] Al Handawi, K., Brahma, A., Wynn, D. C., Kokkolaras, M., and Isaksson, O., 2024, “Design Space Exploration and Evaluation Using Margin-Based Trade- Offs,” Journal of Mechanical Design, 146(6), p. 061701. [35] Kügler, P., Dworschak, F., Schleich, B., and Wartzack, S., 2023, “The Evolu- tion of Knowledge-Based Engineering from a Design Research Perspective: Literature Review 2012–2021,” Advanced Engineering Informatics, 55, p. 101892. [36] Padula, S. L. and Gillian, R. E., 2006, “Multidisciplinary Environments: A History of Engineering Framework Development,” 11th AIAA/ISSMO Multi- disciplinary Analysis and Optimization Conference, Vol. 4, Portsmouth, VA, September 6–8, 2006, p. 2058–2068, doi: 10.2514/6.2006-7083. [37] Hiriyannaiah, S. and Mocko, G. M., 2008, “Information Management Capa- bilities of MDO Frameworks,” Proceedings of the ASME Design Engineering Technical Conference, Vol. 3, Brooklyn, NY, August 3–6, 2008, p. 635–645, doi: 10.1115/DETC2008-49934. [38] Allison, J. T., Cardin, M.-A., McComb, C., Ren, M. Y., Selva, D., Tucker, C., Witherell, P., and Zhao, Y. F., 2022, “Special Issue: Artificial Intelligence and Engineering Design,” Journal of Mechanical Design, 144(2), p. 020301. [39] Regenwetter, L., Nobari, A. H., and Ahmed, F., 2022, “Deep Generative Mod- els in Engineering Design: A Review,” Journal of Mechanical Design, 144(7), p. 071704. [40] Whalen, E. and Mueller, C., 2022, “Toward Reusable Surrogate Models: Graph-based Transfer Learning on Trusses,” Journal of Mechanical Design, 144(2), p. 021704. [41] EASA, 2012, “Commission Regulation (EU) No 748/2012, Part 21, Subpart J—Design Organisation Approval, Section 21.A.239,” Consolidated version revised 2025, https://eur-lex.europa.eu/eli/reg/2012/748 [42] SAE International, 2023, “ARP4754B: Guidelines for Development of Civil Aircraft and Systems,” doi: 10.4271/ARP4754B. [43] Yuan, C., Marion, T., and Moghaddam, M., 2022, “Leveraging End-User Data for Enhanced Design Concept Evaluation: A Multimodal Deep Regression Model,” Journal of Mechanical Design, 144(2), p. 021403. [44] Pradas Gómez, A., Krus, P., Panarotto, M., and Isaksson, O., 2024, “Large Language Models in Complex System Design,” Proceedings of the Design Society, Vol. 4, Dubrovnik, Croatia, May 27–30, 2024, p. 2197–2206, doi: 10.1017/pds.2024.222. Journal of Mechanical DesignPREPRINT / 11 [45] Picard, C., Edwards, K. M., Doris, A. C., Man, B., Giannone, G., Alam, M. F., and Ahmed, F., 2025, “From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design,” Artificial Intelligence Re- view, 58(9), p. 288. [46] Massoudi, S. and Fuge, M., 2026, “Agentic Large Language Models for Con- ceptual Systems Engineering and Design,” Journal of Mechanical Design, 148(5), p. 051405. [47] Mustapha, K. B., 2025, “A Survey of Emerging Applications of Large Lan- guage Models for Problems in Mechanics, Product Design, and Manufactur- ing,” Advanced Engineering Informatics, 64, p. 103066. [48] Eckert, C., Clarkson, P. J., and Zanker, W., 2004, “Change and Customisation in Complex Engineering Domains,” Research in Engineering Design, 15(1), p. 1–21. [49] Cross, N., 2025, “Natural and Artificial Intelligence in Design,” Designerly Ways of Knowing and Thinking, Springer, p. 31–46. [50] Caputo, C. and Cardin, M.-A., 2022, “Analyzing Real Options and Flexibility in Engineering Systems Design Using Decision Rules and Deep Reinforcement Learning,” Journal of Mechanical Design, 144(2), p. 021705. [51] Elgh, F., 2008, “Supporting Management and Maintenance of Manufacturing Knowledge in Design Automation Systems,” Advanced Engineering Informat- ics, 22(4), p. 445–456. [52] Hjertberg, T., Stolt, R., and Elgh, F., 2016, “Managing Dependencies in Hetero- geneous Design Automation Systems,” Transdisciplinary Engineering: Cross- ing Boundaries, Advances in Transdisciplinary Engineering, Vol. 4, IOS Press, Curitiba, Brazil, October 3–7, 2016, p. 279–288, doi: 10.3233/978-1-61499- 703-0-279. [53] Colvin, S. and Pydantic Services Inc., 2024, “Logfire,” accessed 2026-02-29, https://github.com/pydantic/logfire [54] Dong, L., Lu, Q., and Zhu, L., 2024, “A Taxonomy of AgentOps for Enabling Observability of Foundation Model Based Agents,” https://arxiv.org/abs/2411. 05285 [55] Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., and Tang, J., 2024, “AgentBench: Evaluating LLMs as Agents,” 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, May 7–11, 2024, https://arxiv.org/abs/2308.03688 [56] Yao, S., Shinn, N., Razavi, P., and Narasimhan, K., 2025, “휏-bench: A Bench- mark for Tool-Agent-User Interaction in Real-World Domains,” 13th Interna- tional Conference on Learning Representations (ICLR 2025), Singapore, April 24–28, 2025, p. 74824–74876, https://arxiv.org/abs/2406.12045 [57] Battelle Memorial Institute, 2024, “Metallic Materials Properties Development and Standardization (MMPDS),” Battelle Memorial Institute, Columbus, OH, Supersedes MIL-HDBK-5. [58] Hsieh, C.-Y., Chen, S.-A., Li, C.-L., Fujii, Y., Ratner, A., Lee, C.-Y., Krishna, R., and Pfister, T., 2023, “Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models,” https://arxiv.org/abs/2308.00675 [59] Yuan, S., Song, K., Chen, J., Tan, X., Shen, Y., Kan, R., Li, D., and Yang, D., 2025, “EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction,” Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL- HLT 2025), Vol. 1, Albuquerque, NM, April 29–May 4, 2025, p. 951–972, doi: 10.18653/v1/2025.naacl-long.44. [60] Kay, A., 1990, “User Interface: A Personal View,” The art of human-computer interface design, p. 191–207. [61] Wooldridge, M. and Jennings, N. R., 1995, “Intelligent Agents: Theory and Practice,” Knowledge Engineering Review, 10(2), p. 115–152. [62] Maes, P., 1994, “Agents That Reduce Work and Information Overload,” Com- mun. ACM, 37(7), p. 30–40. [63] Pradas Gómez, A., Panarotto, M., and Isaksson, O., 2024, “Evaluation of Different Large Language Model Agent Frameworks for Design Engineering Tasks,” DS 130: Proceedings of NordDesign 2024, Reykjavik, Iceland, 12th- 14th August 2024, Design Society, Reykjavík, Iceland, August 12–14, 2024, p. 693–702, doi: 10.35199/NORDDESIGN2024.74. [64] Willison, S., “I Think “Agent” May Finally Have a Widely Enough Agreed upon Definition to Be Useful Jargon Now,” accessed 2026-02-29, https: //simonwillison.net/2025/Sep/18/agents/ [65] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I., 2017, “Attention Is All You Need,” Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., Long Beach, CA, December 4–9, 2017, p. 5999–6009. [66] Labs, I., Khanna, S., Kharbanda, S., Li, S., Varma, H., Wang, E., Birnbaum, S., Luo, Z., Miraoui, Y., Palrecha, A., Ermon, S., Grover, A., and Kuleshov, V., 2025, “Mercury: Ultra-Fast Language Models Based on Diffusion,” https: //arxiv.org/abs/2506.17298 [67] Zhang, M., Press, O., Merrill, W., Liu, A., and Smith, N. A., 2024, “How Language Model Hallucinations Can Snowball,” Proceedings of the 41st Inter- national Conference on Machine Learning, PMLR, Vol. 235, Vienna, Austria, July 21–27, 2024, p. 59670–59684. [68] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P., 2023, “Survey of Hallucination in Natural Language Generation,” ACM Computing Surveys, 55(12), p. 1–38. [69] Zelikman, E., Wu, Y., Mu, J., and Goodman, N., 2022, “Star: Bootstrap- ping Reasoning with Reasoning,” Advances in Neural Information Processing Systems, 35, p. 15476–15488. [70] Snell, C., Lee, J., Xu, K., and Kumar, A., 2025, “Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters,” 13th International Conference on Learning Representations (ICLR 2025), Sin- gapore, April 24–28, 2025, p. 7595–7629, https://arxiv.org/abs/2408.03314 [71] OpenAI, 2024, “OpenAI o1 System Card,” arXiv2412.16720 [72] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D., 2022, “Chain-of-Thought Prompting Elicits Reason- ing in Large Language Models,” Advances in Neural Information Processing Systems, Vol. 35, New Orleans, LA, November 28 – December 9, 2022, p. 24824–24837. [73] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., and Scialom, T., 2023, “Toolformer: Language Models Can Teach Themselves to Use Tools,” Advances in Neural Information Processing Systems, Vol. 36, New Orleans, LA, December 10–16, 2023, p. 68539–68551. [74] Wei, T., Li, T.-W., Liu, Z., Ning, X., Yang, Z., Zou, J., Zeng, Z., Qiu, R., Lin, X., Fu, D., Li, Z., Ai, M., Zhou, D., Bao, W., Li, Y., Li, G., Qian, C., Wang, Y., Tang, X., Xiao, Y., Fang, L., Liu, H., Tang, X., Zhang, Y., Wang, C., You, J., Ji, H., Tong, H., and He, J., 2026, “Agentic Reasoning for Large Language Models,” https://arxiv.org/abs/2601.12538 [75] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y., 2023, “ReAct: Synergizing Reasoning and Acting in Language Models,” 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1–5, 2023. [76] Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y., 2023, “Hug- gingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face,” Advances in Neural Information Processing Systems, Vol. 36, New Orleans, LA, December 10–16, 2023, p. 38154–38180. [77] “Work Faster and Smarter With Ansys Engineering Copi- lot,”accessed2026-02-29,https://w.ansys.com/blog/ work-faster-smarter-with-ansys-engineering-copilot [78] “Using the Chatbot Window in COMSOL Multiphysics ® ,” . [79] “AI Agents for Mechanical Engineering,” accessed 2026-02-29, https://w. cosmon.com/ [80] Barres, V., Dong, H., Ray, S., Si, X., and Narasimhan, K., 2025, “휏 2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment,” https:// arxiv.org/abs/2506.07982 [81] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K., 2024, “SWE-bench: Can Language Models Resolve Real- World GitHub Issues?” https://arxiv.org/abs/2310.06770 [82] Zhang, L., Ergen, T., Logeswaran, L., Lee, M., and Jurgens, D., 2024, “SPRIG: Improving Large Language Model Performance by System Prompt Optimiza- tion,” https://arxiv.org/abs/2410.14826 [83] Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and Potts, C., 2024, “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines,” 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, May 7–11, 2024, https://arxiv.org/abs/2310.03714 [84] OpenAI, 2025, “OpenAI GPT-4.1 Prompting Guide,” accessed 2026-02-29, https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide/ [85] Anthropic, 2025, “Anthropic Context Engineering Guide for AI Agents,” accessed 2026-02-29, https://w.anthropic.com/engineering/ effective-context-engineering-for-ai-agents [86] Google, 2025, “Google Gemini Prompt Design Strategies,” accessed 2026-02- 29, https://ai.google.dev/gemini-api/docs/prompting-strategies [87] Meta, 2025, “Meta AI Prompting Guide,” accessed 2026-02-29, https://ai.meta. com/learn/how-to-write-effective-ai-prompts-tips-and-examples/ [88] Mistral AI, 2025, “Mistral AI Prompting Guide,” accessed 2026-02-29, https: //docs.mistral.ai/capabilities/completion/prompting_capabilities [89] 2026, “Agent Skills Open Standard,” accessed 2026-02-27, https://github.com/ agentskills/agentskills [90] Seepersad, C. C., Pedersen, K., Emblemsvåg, J., Bailey, R., Allen, J. K., and Mistree, F., 2006, “The Validation Square: How Does One Verify and Validate a Design Method?” Decision Making in Engineering Design, K. E. Lewis, W. Chen, and L. C. Schmidt, eds., ASME Press, New York, p. 303–314. [91] Morrison, T. M., Hariharan, P., Funkhouser, C. M., Afshari, P., Goodin, M., and Horner, M., 2019, “Assessing Computational Model Credibility Using a Risk- Based Framework: Application to Hemolysis in Centrifugal Blood Pumps,” ASAIO Journal, 65(4), p. 349–360. [92] Razi, H., Schaefer, J. D., Wanthal, S., Handler, J. J., Renieri, G. D., and Justus- son, B. P., 2016, “Rapid Integration of New Analysis Methods in Production,” Proceedings of the American Society for Composites — 31st Technical Confer- ence, American Society for Composites, Williamsburg, VA, September 19–22, 2016. [93] Hastie, T., Tibshirani, R., and Friedman, J., 2009, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York. [94] Clopper, C. J. and Pearson, E. S., 1934, “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial,” Biometrika, 26(4), p. 404–413. [95] Anthropic, 2025, “Claude Code,” accessed 2026-02-29, https://github.com/ anthropics/claude-code [96] Parasuraman, R. and Manzey, D. H., 2010, “Complacency and Bias in Human Use of Automation: An Attentional Integration,” Human Factors, 52(3), p. 381–410. [97] Woods, D. D. and Hollnagel, E., 2006, Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, CRC Press, Boca Raton, FL. 12 / PREPRINTTransactions of the ASME [98] Neumann, J. V., 1956, “Probabilistic Logics and the Synthesis of Reliable Organisms From Unreliable Components,” Automata Studies. (AM-34), C. E. Shannon and J. McCarthy, eds., Princeton University Press, p. 43–98. [99] Bainbridge, L., 1983, “Ironies of Automation,” Automatica, 19(6), p. 775– 779. [100] Li, F.-F., 2023, The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI, Flatiron Books, New York. [101] Li, C., Zhang, R., Wong, J., Gokmen, C., Srivastava, S., Martín-Martín, R., Wang, C., Levine, G., Lingelbach, M., Sun, J., Anvari, M., Hwang, M., Sharma, M., Aydin, A., Bansal, D., Hunter, S., Kim, K.-Y., Lou, A., Matthews, C. R., Villa-Renteria, I., Tang, J. H., Tang, C., Xia, F., Savarese, S., Gweon, H., Liu, C. K., Wu, J., and Fei-Fei, L., 2023, “BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation,” Pro- ceedings of Machine Learning Research, Vol. 205, Auckland, New Zealand, December 14–18, 2022, p. 80–93. [102] Summers, L., 2026, “The Human-in-the-Loop is Tired,” Pydantic, accessed 2026-03-02, https://pydantic.dev/articles/the-human-in-the-loop-is-tired [103] Shneiderman, B., 2022, Human-Centered AI, Oxford University Press, Oxford. [104] Ahmed, F., Picard, C., Chen, W., McComb, C., Wang, P., Lee, I., Stankovic, T., Allaire, D., and Menzel, S., 2025, “Design by Data: Cultivating Datasets for Engineering Design,” Journal of Mechanical Design, 147(4), p. 040301. [105] Felten, F., Apaza, G., Bäunlich, G., Diniz, C., Dong, X., Drake, A., Habibi, M., Hoffman, N. J., Keeler, M., Massoudi, S., VanGessel, F. G., and Fuge, M., 2025, “EngiBench: A Framework for Data-Driven Engineering De- sign Research,” Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2–7, 2025, https://openreview.net/forum?id=YowD33Q89V [106] Pradas Gómez, A., Brahma, A., and Isaksson, O., 2026, “DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice,” doi: 10.5281/zenodo.18836517, https://github.com/alex-pradas/DUCTILE Journal of Mechanical DesignPREPRINT / 13 SUPPLEMENTAL MATERIAL S-1 Repository and Archived Data All source code, configuration files, evaluation scripts, and session transcripts are available in the DUCTILE repository: • GitHub: https://github.com/alex-pradas/DUCTILE • Zenodo archive: 10.5281/zenodo.18836517 GitHub always contains the latest version, which is propagated to Zenodo for long-term academic archival. S-2 IDE Setup for engineers interacting with agentic application Figure S-1 shows the integrated development environment as configured for the engineer sessions. The left panel displays the project file explorer with the working directory structure, the center panel shows the task description document, and the right panel shows the Claude Code agent interface ready to receive instructions. Fig. S-1 Screenshot of Engineer 2 IDE setup for the experiment session Supplemental MaterialS-1 S-3 Agent Configuration User Claude Code System Prompt Opus 4.6 Context Window User Message type Inference Engine tool call response final answer list docs External Resources Claude Code CLAUDE.md MCP: Knowledge Based Repository File system read docs design practice id:1001 ver:1 DP task description pdf previous anaysis v1 task description pdf OEM loads yaml v2 ductile-loads pypi: index internal tools llms.txt GitHub pages other processes (via bash terminal) uvpython web_search bash read_file write_file create_plan Fig. S-2 Implementation of an agentic structure for the use case. Note how the implementation leverages internal claude code tools with in-house tools (blue) architecture, that minimizes architectural customization. In addition, the user experience is enhanced as only product data (in green) is required to customize the agent. Listing S-1: Full CLAUDE.md system prompt as deployed in the engineer sessions. 1 # CLAUDE.md 2 3 ## Role 4 5 You are an engineering assistant supporting a stress 6 engineer. Your sole purpose is to help the engineer 7 complete their task correctly and efficiently. 8 9 ## Task Scope 10 11 The engineer will present the task to perform. It will 12 be an engineering task. Searching the web for engineering 13 references, ANSYS documentation, Python package 14 documentation, or anything directly relevant to solving 15 the engineer's task is acceptable. 16 17 ## Methodology Compliance 18 19 Search for design practice documents in the MCP tools 20 wherever mentioned. A design practice defines the 21 complete processing chain. You must ensure every step is 22 followed in the correct order. If the engineer skips a 23 step or proceeds out of sequence, flag it. 24 25 ## Tool use 26 27 Whatever tool you select to do the job, use the latest 28 version always, unless specified by the user. 29 30 ## First Interaction 31 32 When the conversation starts, ask the engineer whether 33 they have used AI coding agents before. Adapt your 34 behaviour based on their answer: 35 36 - If they have experience: Briefly confirm you're ready 37 to assist, then wait for them to drive the 38 conversation. 39 - If they have not: Welcome them and offer a few example 40 prompts to get started. Examples can range from 41 questions on how to execute things, about the tools 42 documentation, about methodology of the task or the 43 execution of part or the totality of the task. 44 45 In either case, do not start processing anything until 46 the engineer asks you to. 47 48 Use your built-in capability to answer questions about 49 Claude Code. 50 51 ## Working Directory 52 53 Only read files within the current working directory. Do 54 not navigate to parent directories or read files outside 55 this folder. 56 57 ## Behavioural Guidelines 58 59 - Be proactive: If you notice something that doesn't 60 match the design practice, raise it immediately. Do 61 not wait for the engineer to ask. 62 - Be methodical: Work through the design practice steps 63 in order. Suggest the next step when the current one 64 is complete. 65 - Be transparent: When performing calculations or 66 transformations, show your reasoning so the engineer 67 can verify. 68 - Respect the engineer's expertise: You are an 69 assistant, not the lead. Present findings and 70 suggestions, but let the engineer make the final 71 decisions. 72 - Track progress: Keep the engineer aware of which steps 73 from the design practice have been completed and which 74 remain. 75 ## Task-Specific Configuration 76 77 The certified tool for this task is `ductile-loads`. If 78 needed, fetch the API reference from: 79 80 https://alex-pradas.github.io/ductile-loads/llms-full.txt Supplemental MaterialS-2 The agentic application is configured through a system prompt (CLAUDE.md), completely reproduced in Listing S-1 that defines the agent’s role, scope, and behavioural guidelines. Note that the only task specific description is included at the end of the file (in green). In real practice, it would not be specified but pointed to the location where all documentation is stored in-house directly in the tool use section. For complete transparency on the content of the prompt, be aware that Claude Code has its own system prompt that we did not deactivate for the experiment. The agent is aware of the additional MCP server configurations via the file (.mcp.json) that is read automatically and loads into context the tools to provide access to version-controlled design practice documents. The executable tool itself (ductile-loads) is distributed as a Python package via PyPI, modelling the case company distribution through an internal package index. Readers are welcome to explore it by themselves. The demo repository automatically generates the human (html) and machine (llms.txt and .md files) directly from the codebase. S-4 Task Description The following three-page document was provided to the engineer as a printed handout before the experiment sessions started. It defines the task scope, available resources, and briefing instructions. Supplemental MaterialS-3 Task Description Loads Processing for Static Strength Analysis — Turbine Rear Structure This document is intended to be read by the engineer and sets the task scope. Read this document carefully. Take your time and ask questions to the researcher as needed to clarify the scope or details of the task. 1. Background You are a stress engineer working on the static strength analysis of a Turbine Rear Structure (TRS) in an aero engine. The TRS is a structural component after the Low Pressure Turbine that transfers loads among its neighbouring components: LPT Case, Bearing House, Nozzle, Plug and three lugs (port, starboard and failsafe — active only when the port or starboard lugs fail). Figure 1: (top) TRS and neighbouring components positioning on the jet engine and (bottom) neighbouring components in CAD exploded view The programme is in the certification phase. The OEM customer has delivered a new loads update. Your task is to process these loads into input files that can be applied to the Finite Element Model (FEM) for structural analysis. The FE mesh and solver scripts have already been set up by another team member — running the FE model is not part of this task. The mesh geometry and interface point locations are shown in Figure 2. It also shows the equilibrium example of a load case when forces are applied at the interface points. Task Description— Page 1 of 3 Supplemental MaterialS-4 Figure 2: TRS interface locations showing one loadcase in equilibrium 2. Detailed Task Description Your task is to process the new OEM load delivery (OEM loadsv2.yaml) into APDL-ANSYS- compatible input files (.inp), following the methodology document loadsprocessingdesignpractice (document 1001, version 1). The design practice contains all expected processing activities and must be followed in order. You will use Claude Code (an AI coding agent) to assist you with this task. The agent already has access to the design practice document (document 1001, version 1) — you do not need to provide it manually. Make use of the certified tool ductile-loads, available as a Python package on PyPI. Notice that the customer has delivered the files in a different format (YAML) than expected by the tool (JSON). Part of your task is to bypass this limitation. If there are any loads exceedances, let the researcher know. OEM Correction Notice The OEM has informed that due to a human error in the Whole Engine Model export, all Fx force components in this delivery are understated. Apply a 1.04 multiplication factor to all Fx forces at every interface point before processing. The OEM has chosen not to issue a revised delivery; the correction shall be applied by the component owner during processing. 3. Available Resources The following resources are available: • This document — a printed copy for you. The agent can also read it from the working directory. • Loads processing design practice — printed for you. The agent already has access to it. You do not need to give it to the agent. • ductile-loads — the certified processing tool, available as a Python package on PyPI. The agent can access its documentation online. The documentation is also open in your browser for your reference. Task Description— Page 2 of 3 Supplemental MaterialS-5 • Previous analysis — script, inputs and outputs in the previousrun/ folder for refer- ence. Additionally, you or the agent can browse the internet and install Python packages via uv pip install package . At any time during the task execution you can ask the researcher for help or clarification, even half way during the execution of the task. 4. Before You Start Take your time This briefing time does not count towards the task duration. Use it to familiarise yourself with the task, the design practice, the tool documentation, and the available files. Check that Claude Code is running in the VS Code terminal. Feel free to explore the working directory, browse the ductile-loads documentation in your browser, and review the previous analysis. Ask the researcher any questions about the task, the methodology, or the tools — there are no wrong questions. We will start the timed activity only when you confirm that you are ready. Task Description— Page 3 of 3 Supplemental MaterialS-6 S-5 Engineer 1: Full Session Transcript Green Engineer (user) messages Orange Agent responses and reasoning Blue Tool calls and file operations Grey Tool/command output Teal Engineer terminal execution The engineer spent 10 minutes reviewing the printed task description, the input data files, and the design practice document before starting the session. plan this: @ref_documents/task_description.pdf Let me read the task description document. Read 1 file (PDF) Let me gather all the context needed to plan this task. Bash(ls) working directory + previous_run ToolSearch: load MCP tools design-documents – get_document_content (MCP) (document_id: 1001, version: 1) Read 4 files (v1 script, envelope.md, .inp, YAML) Good, I now have all the context. Here’s the plan: Task Plan: Loads Processing for TRS Static Strength (v2) Key Issues to Address (1) Format mismatch — OEM delivered YAML; ductile- loads expects JSON. (2) OEM Fx correction — Multiply all fx values by 1.04 be- fore processing. (3) Naming mismatch — v2 uses lug_left/lug_right in- stead of lug_port/lug_starboard. (4) Units — klbs/klbs·in→ N/N·m. (5) Bearing exclusion — bearing must not appear in output .inp files. Processing Chain (per Design Practice) Step 1 Read & verify OEM delivery — load YAML, val- idate. Rename lug_left → lug_port, lug_right → lug_starboard. Step 2 Apply OEM Fx correction — multiply all fx × 1.04. Step 3 Verify coordinate system — engine CS = FEM CS, no transform. Step 4 Convert units — convert_to("N") — klbs→ N. Step 5 Envelope downselection — .envelope(). Step 6 Write .inp files — to_ansys(. . . , ex- clude=["bearing"]). Step 7a Summary envelope table — envelope_to_markdown(). Step 7b Exceedance comparison — compare_to() against v1. Implementation.A single Python script (loads_processing_v2.py) using uv run --script with dependencies ductile-loads[all] and pyyaml. Shall I proceed with execution? Crunched for 1 min 39 s yes Let me start by fetching the ductile-loads documentation and examining the data more closely, then write the script. WebFetch(https://alex-pradas.github.io/ductile- loads/llms-full.txt) Read(previous_run/envelope_extremes.json) Now I have everything. Let me write the processing script. Write(loads_processing_v2.py) — 85 lines Now let me run it. Bash(uv run loads_processing_v2.py) Delivery: Engine Mount Balanced Loads v2, version 2 Load cases: 100 Renamed lug_left/lug_right (200 renames) Applied Fx correction factor 1.04 Units converted: klbs → N Envelope downselection: 100 → 6 load cases ANSYS .inp files written to limit_loads/ Exceedance check: new exceeds old = True [output truncated] All processing completed successfully. Let me verify the outputs. Bash(ls limit_loads/) Read(limit_load_2.inp) Bash(ls comparison_report/) Read(v1_vs_v2.json) Processing complete. Here’s the summary: Design Practice steps completed 1. Read & verify 100 load cases, 7 interfaces. Renamed lug_left→ lug_port, lug_right→ lug_starboard. 2. Fx correction All Fx × 1.04. 3. Coord system No transformation needed. 4. Unit conversion klbs/klbs·in→ N/N·m. 5. Downselect 100→ 6 load cases (2, 20, 34, 61, 92, 99). 6. ANSYS export 6 .inp files in limit_loads/ (bearing ex- cluded). 7a. Envelope table envelope.md + envelope_extremes.json. 7b. Exceedance Comparison report + charts in compari- son_report/. Exceedance Alert New loads exceed the previous v1 envelope. The largest percent- age changes: • Mz min across all points: ∼+30% • Fx min at bearing/lpt/nozzle/plug: ∼+15% • Fy max/min at lug_port: +11–17% • Fy max at lug_starboard: ∼+11% • Fx max at lug_starboard: ∼+14% These load increases may impact existing margins of safety and should be flagged to the stress lead. Cogitated for 2 min 53 s The engineer spent 6 minutes reviewing the generated Python script and the output files. After confirming the results were correct, the engineer was satisfied with the task execution. Supplemental MaterialS-7 S-6 Engineer 2: Full Session Transcript The transcript below is a faithful representation of the interactions between the engineer and the agent. Message content has been summarized for readability; tool calls and outputs are abbreviated. Full machine-readable transcripts (JSONL) are available in the repository. The engineer spent 8 minutes reviewing the printed task description and the working directory before starting the session. Can you create a python file to translate the YAML to a JSON format? Let me read the YAML file to understand its structure. Read(OEM_loads_v2.yaml) Write(yaml_to_json.py) — 20 lines Created yaml_to_json.py.Run it with uv run yaml_to_json.py. $ uv run yaml_to_json.py Converted OEM_loads_v2.yaml → OEM_loads_v2.json ok, now create a file that uses ductile-loads to read the json Let me fetch the ductile-loads API reference first. WebFetch(https://alex-pradas.github.io/ductile- loads/llms-full.txt) Write(read_loads.py) — 18 lines $ uv run read_loads.py Name: Engine Mount Balanced Loads v2 Units: klbs, klbs.in | Load cases: 100 [output truncated] modify the file to convert to newtons, and also rename lug_right to pilot_lug_starboard and lug_left to pilot_lug_port Edit(read_loads.py) — rename + convert_to("N") I used rename_points() based on the API docs. Let me verify that method exists. yes check WebFetch(ductile-loads/llms-full.txt) No rename_points method exists. I’l rename by iterating through load cases directly. Edit(read_loads.py) — iterate + convert_to("N") $ uv run read_loads.py Units: N, Nm | pilot_lug_port, pilot_lug_starboard [output truncated] ah, I see there are two pilot_ pilot. . . remove the extra one we added Which rename do you want to remove — pilot_lug_port or pilot_lug_starboard? just remove the prepend pilot_ Edit(read_loads.py) — lug_left → lug_port, lug_right → lug_starboard $ uv run read_loads.py lug_port, lug_starboard — confirmed [output truncated] how does the envelope in this python package work? The envelope() method reduces your 100 load cases to only the critical ones — for each point and component it selects the max value (always) and the min value (only if negative). Would you like me to add it? yes Edit(read_loads.py) — add .envelope() + JSON export ok. add now the export to ansys Edit(read_loads.py) — add to_ansys() $ uv run read_loads.py Envelope: 6 load cases | ANSYS files written to limit_loads/ [output truncated] The researcher verbally reminds the engineer that the agent can access and check the design practice on his own ok. I think I am done. Can you check against the design practice 1001 v1? design-documents – browse_catalog (MCP) design-documents – get_document_content (MCP) (document_id: 1001, version: 1) Comparing against DP-TRS-LOADS-001 Rev 1: Divergences found: (1) Bearing — must be excluded from output. (2) lug_fairlead — should be lug_failsafe. (3) Exceedance comparison — step 7.2 requires comparison against v1. Would you like me to fix these? yes fix please Glob(previous_run/*) Read(previous_run/OEM_all_loads_v1.json) Edit(read_loads.py) — exclude bearing, rename lug_failsafe, add compare_to(v1) Supplemental MaterialS-8 can you run it? Bash(uv run read_loads.py) Envelope: 6 cases, 5 points (bearing excluded) WARNING: v2 exceeds v1 envelope ModuleNotFoundError: matplotlib Needs matplotlib. Let me add the [all] extra to the dependency. Edit(read_loads.py) — ductile-loads[all] Bash(uv run read_loads.py) Envelope: 6 cases, 5 points WARNING: v2 exceeds v1 envelope Comparison report generated in comparison_report/ ok, I am happy. One final check against the design practice? All processing steps from DP-TRS-LOADS-001 are covered: for- mat conversion, naming, units, envelope, ANSYS export, and exceedance comparison. ah, I forgot to multiply fx loads only by 1.04. Can you do that? Edit(read_loads.py) — add Fx × 1.04 before unit con- version $ uv run read_loads.py Fx correction applied | Envelope: 6 cases WARNING: v2 exceeds v1 envelope [output truncated] Engineers checked the output for 3 minutes, and considered the task completed. Total session duration: 22 minutes. The engineer executed scripts 8 times independently, driving the process step by step. Supplemental MaterialS-9