Paper deep dive
A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance
Ciprian Paduraru, Petru-Liviu Bouruc, Alin Stefanescu
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/22/2026, 5:55:11 AM
Summary
The paper introduces a trace-based assurance framework for Agentic AI systems, utilizing Message-Action Traces (MAT) to enable contract-based runtime monitoring, stress testing via counterexample search, and governance through policy mediation. It defines a failure taxonomy covering coordination collapse, error propagation, role drift, interface-driven injection, and misuse, providing a structured approach to evaluate multi-agent LLM systems.
Entities (5)
Relation Signals (3)
Message-Action Traces (MAT) → enables → Deterministic Replay
confidence 95% · Contracts provide machine-checkable verdicts, localize the first violating step, and support deterministic replay.
Governance Mechanisms → enforces → Capability Limits
confidence 95% · Governance is treated as a runtime component, enforcing per-agent capability limits and action mediation.
Stress Testing → identifies → Contract Violations
confidence 90% · The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations.
Cypher Suggestions (2)
Find all failure modes defined in the framework · confidence 90% · unvalidated
MATCH (f:FailureMode) RETURN f.name, f.description
Map components of the assurance framework · confidence 85% · unvalidated
MATCH (c:Component)-[:PART_OF]->(f:Framework {name: 'Assurance Framework'}) RETURN c.name, c.typeAbstract
Abstract:In Agentic AI, Large Language Models (LLMs) are increasingly used in the orchestration layer to coordinate multiple agents and to interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final outputs. They also arise from long-horizon interaction, stochastic decisions, and external side effects (such as API calls, database writes, and message sends). Common failures include non-termination, role drift, propagation of unsupported claims, and attacks via untrusted context or external channels. This paper presents an assurance framework for such Agentic AI systems. Executions are instrumented as Message-Action Traces (MAT) with explicit step and trace contracts. Contracts provide machine-checkable verdicts, localize the first violating step, and support deterministic replay. The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations. It also supports structured fault injection at service, retrieval, and memory boundaries to assess containment under realistic operational faults and degraded conditions. Finally, governance is treated as a runtime component, enforcing per-agent capability limits and action mediation (allow, rewrite, block) at the language-to-action boundary. To support comparative evaluations across stochastic seeds, models, and orchestration configurations, the paper defines trace-based metrics for task success, termination reliability, contract compliance, factuality indicators, containment rate, and governance outcome distributions. More broadly, the framework is intended as a common abstraction to support testing and evaluation of multi-agent LLM systems, and to facilitate reproducible comparison across orchestration designs and configurations.
Tags
Links
- Source: https://arxiv.org/abs/2603.18096v1
- Canonical: https://arxiv.org/abs/2603.18096v1
Full Text
76,140 characters extracted from source content.
Expand or collapse full text
A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance Ciprian Paduraru Petru-Liviu Bouruc Alin Stefanescu Abstract In Agentic AI, Large Language Models (LLMs) are increasingly used in the orchestration layer to coordinate multiple agents and to interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final outputs. They also arise from long-horizon interaction, stochastic decisions, and external side effects (such as API calls, database writes, and message sends). Common failures include non-termination, role drift, propagation of unsupported claims, and attacks via untrusted context or external channels. This paper presents an assurance framework for such Agentic AI systems. Executions are instrumented as Message-Action Traces (MAT) with explicit step and trace contracts. Contracts provide machine-checkable verdicts, localize the first violating step, and support deterministic replay. The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations. It also supports structured fault injection at service, retrieval, and memory boundaries to assess containment under realistic operational faults and degraded conditions. Finally, governance is treated as a runtime component, enforcing per-agent capability limits and action mediation (allow, rewrite, block) at the language-to-action boundary. To support comparative evaluations across stochastic seeds, models, and orchestration configurations, the paper defines trace-based metrics for task success, termination reliability, contract compliance, factuality indicators, containment rate, and governance outcome distributions. More broadly, the framework is intended as a common abstraction to support testing and evaluation of multi-agent LLM systems, and to facilitate reproducible comparison across orchestration designs and configurations. I Introduction Modern enterprise workflows using an Agentic AI approach are increasingly mediated via LLM-orchestrated multi-agent systems. In these systems, an orchestration layer decomposes user intent into a plan, delegates subtasks to specialized agents, and executes actions through external services such as APIs, databases, and messaging systems. Production-focused reports highlight that evaluation, reliability, and debugging remain persistent bottlenecks once such systems interact with real services and untrusted inputs [1, 2]. In this paper, we focus on multi-agent LLM system: multiple LLM-driven agents that interact via messages and shared memory and may invoke external services (e.g., APIs, databases, and messaging). The coordinating component, termed as the orchestrator (orchestration layer), performs routing and role selection, manages shared memory/context, and mediates the translation from language-level decisions into concrete tool actions. Motivating scenario. Consider a customer support triage assistant deployed in an enterprise helpdesk. It ingests an inbound email (untrusted text), fetches customer history from an internal database, queries shipping status from a logistics API, drafts a reply, and updates internal tickets. In realistic operation, assurance must account for four recurring stressors: • Untrusted inputs: emails or retrieved snippets may include indirect prompt injection attempts. • Integration faults: external services may time out, return partial responses, or serve stale cached data. • Policy constraints: privacy rules for personally identifiable information (PII) and communication consent rules (e.g., opt-out lists). • Long-horizon coordination: handoffs between planner, verifier, and action roles may drift, loop, or deadlock across multi-step workflows. In this context, a test must evaluate not only the final answer but also the sequence of tool actions and side effects under stochastic decisions and imperfect services. This shifts the testing target from output-only oracles to trace-level properties (termination, policy compliance, containment), monitored during execution. The framework proposed in this work integrates and adapts three directions that are often treated separately: (i) contract-based runtime monitoring over executions, grounded in runtime verification [3, 4]; (i) robustness testing under realistic perturbations and boundary faults, consistent with chaos engineering and production-oriented agent evaluations [5, 1]; and (i) governance controls at the language-to-action boundary, motivated by guidance and analyses on prompt injection, excessive agency, and tool-interface risk [6, 7, 8]. Figure 1 summarizes how these elements are organized into a single assurance workflow for multi-agent orchestration. Compared to observability-only tracing or standalone safety checks, the framework treats monitoring, stress testing, and governance as jointly testable through a shared trace-and-contract interface. The intent is not to prescribe a specific agent architecture, but to provide a testing-oriented abstraction that can be adopted across implementations to reason about failures, robustness, and governance consistently. The contributions are summarized below: • System model and failure taxonomy for multi-agent systems. A lightweight model and taxonomy covering coordination failures (loops and deadlocks), error propagation across agents, role drift in long-horizon workflows, and failures introduced by external services and shared memory. • Message-Action Traces (MAT) as a contract-carrying execution record. A trace representation that records each run as a sequence of typed steps (user/agent messages), tool calls, memory reads and writes, delegation, termination), augmented with provenance and contract verdicts. The resulting record supports replay and localization of the first violating step, and serves as the common trace representation used across the pipeline in Fig. 1. • Stress testing as counterexample search under a budget. A testing formulation that searches for small, realistic perturbation schedules and boundary-fault operators that trigger contract violations under an explicit cost budget. In this context, a contract [9] is a machine-checkable predicate over a step-level MAT record or over a trace (or prefix) that encodes an expected property (e.g., authorization, verification-before-action, progress, containment). The outcome is a replayable counterexample schedule together with the first violated contract and a localized trace region. This supports debugging, regression testing, and coverage-guided suite construction over contract IDs and interface/action signatures. • Governance mechanisms that constrain delegated authority and make outcomes measurable. A set of runtime controls including capability restriction (least privilege via per-agent capability sets) and action-time policy mediation (allow, rewrite, block), with risk-aware routing and escalation to a verifier or human approval step when required by policy. Trace-derived metrics are defined to characterize governance behavior, including success and contract-failure rates, containment rate, and distributions over mediator outcomes. The paper focuses on the methodology, definitions, and metrics required to instantiate an executable evaluation protocol; a full empirical study is ongoing but is not reported here. Figure 1: Pipeline overview of the assurance framework. Colors indicate the four layers and their roles. The system under test (SUT, green) is the deployed multi-agent LLM system: an orchestrator coordinating an agent pool, together with the runtime governance boundary L4 (blue). The diagram uses a centralized orchestrator for clarity; the same instrumentation and controls apply to decentralized variants (e.g., peer-to-peer agents) by treating the current decision maker as the acting role at step t. The operational environment (green cloud) is depicted only through tool, retrieval, and memory interfaces, since the framework evaluates end-to-end integration behavior and integration failures. L2 stress testing (orange) draws task instances x∼x and applies bounded perturbations δ to inputs and context. During execution, the acting role proposes an action ata_t; L4 mediates the proposed external action using step contracts ℐstepI^step and a policy shield Π , yielding allow, rewrite (to a governed action a~t a_t), or block (red dashed feedback) to prevent unsafe side effects. L3 (red) injects controlled faults at the same external interfaces to exercise realistic integration disturbances. The resulting observation ot+1o_t+1 (tool output, retrieved evidence, or error) closes the interaction loop. In parallel, L1 (yellow) records Message-Action Trace entries and evaluates trace contracts ℐtraceI^trace over prefixes, localizing the first violation and emitting a replay record for debugging and regression testing. I Related Work Research on agentic AI, in particular, multi-agent LLM systems, has progressed quickly in building orchestration frameworks and evaluation tools. Less progress has been made on methods that can specify, monitor, and enforce properties of long-running multi-agent executions, especially when external services, retrieval, and shared memory are involved. I-A Frameworks and observability for agentic systems AutoGen [10] supports multi-agent collaboration through structured conversation patterns. LangGraph [11] offers stateful workflows with cycles, retries, and human checkpoints. For debugging and evaluation, LangSmith [12] records intermediate steps and provides scoring and analysis tools. For retrieval-augmented generation, RAGAS [13] provides metrics such as faithfulness and context relevance. These systems make development and diagnosis easier, but they typically do not enforce properties such as termination, invariant preservation, or role stability under stochastic execution. I-B Benchmarks for capability, security, and misuse AgentBench [14] and GAIA [15] evaluate task competence in interactive and assistant-style settings. They are useful for measuring what agents can do, but they are not designed to systematically test robustness under small perturbations or to expose multi-agent failure patterns such as delegation loops and error cascades. Security-focused benchmarks study indirect prompt injection in agent settings. BIPIA [16] targets indirect prompt injection for LLM applications. AgentDojo [17] provides realistic tasks with attack and defense variants. WASP [18] studies web agents under hijacking-style objectives. InjecAgent [19] evaluates prompt injection threats and defenses for agents integrated with external services. Misuse-oriented benchmarks extend evaluation beyond benign reliability. AgentHarm [20] measures harmful behavior in multi-step tasks. HarmBench [21] evaluates refusal and robustness to adversarial prompting. The study in [22] provides an empirical characterization of breakdowns in multi-agent LLM workflows, emphasizing recurrent patterns such as coordination and non-termination behaviors, verification and oversight gaps, role-level drift over long horizons, and error propagation across agents. Our failure taxonomy (cf. F1-F5 from Sec. I) is aligned with these observations but is structured to support monitored, trace-level evaluation: F1 captures coordination collapse and non-termination; F2 formalizes unsupported-claim propagation as a trace-conditioned phenomenon; and F3 captures role drift as boundary violations against role-specific constraints. In addition, our taxonomy makes two dimensions explicit that are not central in [22]: (a) interface-driven compromise and poisoning via external services, retrieval, or shared memory (F4), motivated by prompt-injection and excessive-agency analyses; and (b) misuse-oriented harmful task execution (F5), motivated by safety and misuse evaluations. This mapping positions [22] as an empirical basis for what fails, while our taxonomy specifies failure conditions in a form suitable for contracts, stress testing, and governance measurement. I-C Runtime constraints, monitoring, and interface risk Contract-based testing has been studied in the context of web services [9], where contracts specify admissible input-output behavior and interaction protocols. We generalize this notion to agentic AI systems by specifying contracts a machine-checkable boolean predicate over an agentic execution, defined on individual steps or finite trace prefixes (e.g., “tool calls with side effects must be preceded by a verification action within h steps”). Contracts allow temporal, cross-agent, and interface-level properties to be specified independently of model internals. NeMo Guardrails [23] provides programmable mechanisms to apply policy constraints, often as dialogue and safety controls. Runtime verification offers a formal basis for monitoring executions against specifications [3, 4], motivating contract-based monitoring for complex workflows. External service interfaces introduce additional risk. The Model Context Protocol (MCP) specification [24] and its security best practices [7] discuss trust and safety concerns when connecting models to tools and data. Unit 42 [8] reports prompt injection-style attack vectors in MCP deployments. These results motivate capability limits and policy mediation at the point where an internal plan can trigger an external side effect. Failure modes and their operationalization. Recent analyses and benchmarks collectively highlight recurrent failure patterns in agentic systems, including non-termination and coordination breakdowns, error amplification across steps, role confusion, injection and interface poisoning, and harmful task completion [22, 6, 16, 17, 18, 19, 20, 21, 7, 8]. Our taxonomy in Sec. I aligns these observations with trace-level conditions that are directly monitorable: coordination and termination failures (F1), unsupported-claim propagation (F2), role drift and boundary violations (F3), interface-driven injection (F4), and misuse outcomes (F5). The goal is not to re-categorize prior work, but to provide a compact set of monitored failure classes that can be exercised by stress testing and fault injection and measured under governance controls. I-D Gap Most existing work focuses on either building agent systems or observing them after the fact. What is missing is a single assurance approach that combines: (i) traces enriched with contracts that can be checked at runtime, (i) stress testing that searches for small, realistic perturbations that cause contract violations, (i) fault injection across external services, retrieval, and shared memory, and (iv) governance that limits authority and routes high-impact actions through policy and, when needed, human approval. This paper proposes such a framework to narrow the existing gaps. I System Model and Failure Taxonomy for Agentic Systems We summarize prior reports of agentic failure patterns into five operational classes, each defined as a trace-level condition that can be monitored and stress-tested. I-A System model An agentic system can be viewed as a stochastic transition system with an explicit external environment and a mutable shared state, coordinated via an orchestrator. The orchestrator routes between agents, manages shared memory and context, and mediates external tool actions. Let =1,…,mA=\1,…,m\ be the set of agents.. At step t, the global state st∈s_t , Eq. (1), summarizes: • shared memory MtM_t, • per-agent local contexts CtiC_t^i (role prompt, conversation, working notes), • orchestration control state GtG_t (for example the current node in an execution graph, retry counters, budgets), • external environment state EtE_t (for example service sessions, database state, network conditions). st=(Mt,Ct1,…,Ctm,Gt,Et).s_t\;=\;(M_t,\;C_t^1,…,C_t^m,\;G_t,\;E_t). (1) In practice, assurance instrumentation stores only a filtered projection of this state (for example IDs, hashes, and redacted parameters) that is sufficient for monitoring, which reduces retention of sensitive prompt and service content. An action at∈a_t denotes an externally meaningful step such as emitting a message, calling an external service with parameters, reading or writing memory, delegating to another agent, or terminating a run. Executing ata_t yields an observation ot+1o_t+1 (service response, retrieved passages, error code) and a next state sampled from the environment dynamics: st+1∼P(⋅∣st,at),at∼πθit(⋅∣Ctit,Mt),s_t+1 P(· s_t,a_t), a_t _θ^i_t(· C_t^i_t,M_t), where iti_t is the selected agent at step t and πθi _θ^i denotes the stochastic policy induced by an LLM under role instructions and prompt context. This view emphasizes that failures arise from non-deterministic, long-horizon interaction (sampling, external service nondeterminism, partial observability), rather than fixed control flow [22, 1]. One execution is represented as a finite trace: τ=(s0,a0,o1,s1,…,sT),τ\;=\;(s_0,a_0,o_1,s_1,…,s_T), where T is the configured horizon or termination time. Let Term(τ)=1Term(τ)=1 if a terminal action occurs within T steps (successful completion, safe abort, or explicit failure return). Failures are violations of properties over steps or over the whole trace. Two cases are distinguished: • local step failures, where a single transition violates a constraint, • workflow failures, where termination, safety, or correctness properties are violated over the trace. This aligns with runtime verification, which instruments executions and monitors traces against specifications [3, 4]. I-B Failure taxonomy Recent empirical work reports fine-grained multi-agent failure modes that include specification and system design issues, inter-agent misalignment, and breakdowns in verification and termination [22]. These are compressed into operational classes that can be tested under the trace model above, and extended with security-focused classes emphasized by agent security benchmarks and guidance [17, 18, 25]. F1: Coordination collapse and non-termination. Operational definition: the system fails to make progress and does not terminate within the configured horizon, due to deadlock, oscillation, or circular delegation. Termination and verification failures are a major category in multi-agent systems [22]. Common subtypes include circular delegation, deadlock on approvals, repeated replanning without execution, and collisions in shared memory. Orchestrators that support cycles and retries increase expressiveness, but also increase the need for termination and invariant checks [11]. In the framework, collapse is detected using a non-negative potential function Φ:→ℝ≥0 :S _≥ 0 that encodes remaining work using observable orchestration signals possibly weighted (for example Φ(st)=#unresolved subtasks+#pending approvals+#active retries (s_t)=\#unresolved subtasks+\#pending approvals+\#active retries). Collapse is detected as a sufficient condition over a sliding window of length w when the potential does not decrease, and the system does not terminate: ∀t∈[k,k+w]:Φ(st+1)≥Φ(st)∧¬Term(τ).∀\,t∈[k,k+w]:\; (s_t+1)≥ (s_t)\; \; (τ). (2) F2: Error amplification and unsupported-claim propagation. Operational definition: an upstream factual error is accepted as an assumption and subsequently drives downstream actions or final outputs. A local factual error can become a system-level failure when an upstream output is treated as authoritative by downstream agents or by the orchestrator [22]. Let the final response be decomposed into atomic claims cj\c_j\ (minimal verifiable propositions), extracted by a lightweight claim splitter (rule-based or model-assisted). Each claim is linked to provenance evidence ej\e_j\ recorded in the trace (retrieved passages, external service outputs, database row identifiers). Define support(cj,ej)=1support(c_j,e_j)=1 when the evidence provides sufficient justification for the claim under a chosen verifier (e.g., entailment checking, exact result matching, or a constrained LLM judge). Otherwise, support(cj,ej)=0support(c_j,e_j)=0. The unsupported claim rate is: Hrate=|j:support(cj,ej)=0||j:cj|.H_rate= |\j:support(c_j,e_j)=0\||\j:c_j\|. To capture propagation, mark whether an unsupported claim becomes an input assumption for later actions (e.g., for example service calls, approvals, or downstream instructions). Let use(cj,τ)=1use(c_j,τ)=1 if claim cjc_j is referenced in subsequent steps. Then: Hprop=|j:support(cj,ej)=0∧use(cj,τ)=1||j:cj|.H_prop= |\j:support(c_j,e_j)=0 (c_j,τ)=1\||\j:c_j\|. Propagation is most harmful when high-impact claims are not re-checked against external services or retrieval before driving actions [1]. F2: Error amplification and unsupported-claim propagation. Operational definition: an upstream factual error is accepted as an assumption and subsequently drives downstream actions or final outputs. A local factual error can become a system-level failure when an upstream output is treated as authoritative by downstream agents or by the orchestrator [22]. Let the final response be decomposed into atomic claims cj\c_j\ (minimal verifiable propositions), extracted by a lightweight claim splitter (rule-based or model-assisted). Each claim is linked to provenance evidence ej\e_j\ recorded in the MAT trace (retrieved passages, external service outputs, database row identifiers). Define support(cj,ej)=1support(c_j,e_j)=1 when the evidence provides sufficient justification for the claim under a chosen verifier (e.g., entailment checking, exact result matching, or a constrained LLM judge), and support(cj,ej)=0support(c_j,e_j)=0 otherwise. Let =1,…,JC=\1,…,J\ index the extracted claims. The unsupported claim rate is: Hrate=1J|j∈:support(cj,ej)=0|.H_rate= 1J\,|\j :support(c_j,e_j)=0\|. To capture propagation, mark whether an unsupported claim becomes an input assumption for later actions (e.g., service calls, approvals, or downstream instructions). Let use(cj,τ)=1use(c_j,τ)=1 if claim cjc_j is referenced in subsequent steps. Then: Hprop=1J|j∈:support(cj,ej)=0∧use(cj,τ)=1|.H_prop= 1J\,|\j :support(c_j,e_j)=0 (c_j,τ)=1\|. Propagation is most harmful when high-impact claims are not re-checked against external services or retrieval before driving actions [1]. F3: Role drift and boundary violations. Operational definition: an agent deviates from its assigned role by taking unauthorized actions or failing role-specific obligations over the execution horizon. Role confusion arises when an agent deviates from its declared role or when role boundaries collapse [22]. Let each agent i have a role contract ℛi=(i,i)R_i=(U_i,O_i) specifying allowed actions i⊆U_i and obligations iO_i (e.g., cite sources, only propose external calls, or avoid external side effects). A role violation occurs when at∉ita_t _i_t. Obligation violations are captured by explicit contracts (Sec. IV, L1). A compact per-agent trace-level drift score is: Di(τ)=1|i|∑t∈iat∉i,D_i(τ)= 1|T_i| _t _iI\a_t _i\, where ⋅I\·\ denotes the indicator function and i=t:it=iT_i=\t:\,i_t=i\. Drift becomes more likely as context accumulates, instructions interfere, or injected content changes effective constraints. F4: Tool/memory injection and interface poisoning. Operational definition: untrusted text entering via memory or external interfaces alters the agent policy so that it selects an unsafe action. Shared memory and external interfaces create direct attack surfaces. OWASP [6] identifies Prompt Injection and highlights Excessive Agency. Benchmarks and analyses show that prompt injection against agents that call external services is practical and defenses remain incomplete [16, 17, 18, 19]. MCP security guidance recommends treating service integration as a system design problem, with explicit limits and checks, rather than relying on prompt-based defenses alone [7]. Injection is modeled as a bounded adversarial perturbation δ applied to untrusted content entering the system via memory or external interfaces (e.g., retrieved text, tool outputs, or stored notes). Let Mt′=inject(Mt,δ)M _t=inject(M_t,δ) denote a poisoned memory state (similarly for poisoned interface descriptions or metadata). The attack succeeds if the induced policy selects an unsafe action at∈unsafea_t _unsafe, where unsafeU_unsafe denotes actions that violate security or safety policy (e.g., exfiltrating sensitive data, unauthorized service use, or irreversible external actions). Excessive Agency is the condition where the system is granted enough permissions that plausible perturbations or ordinary service failures can lead to unsafe external actions. In the trace model, a simple operational signal is: Pr(at∈unsafe)>ϵ, (a_t _unsafe)>ε, (3) for some small but meaningful ϵε within the allowed perturbation budget. Standardized integration protocols can increase exposure by making it easier to connect many services and endpoints [24, 7, 8]. F5: Misuse and harmful task execution. Operational definition: the system completes a harmful multi-step objective or performs a harmful external action without triggering refusal, containment, or escalation. Misuse-oriented benchmarks motivate explicit evaluation along this dimension [20, 21]. Operationally, a misuse failure is recorded when the run reaches a harmful end state (or performs a harmful external action) while governance and contracts remain unsatisfied or bypassed. IV Framework: Trace Contracts, Adversarial Testing, and Governance This section presents a trace-based assurance framework for multi-agent LLM systems that interact with external services, retrieval, and shared memory. Layers L1–L3 form the surrounding assurance harness used to observe and stress-test the SUT. The deployed System Under Test (SUT) comprises the orchestrator, agent pool, and the runtime governance mediator (L4), since these jointly determine which external actions are executed. Layers L1–L3 constitute the assurance harness that instruments execution, applies perturbations/faults at interfaces, and checks contracts; the external environment is exercised through tool/retrieval/memory interfaces. Figure 1 provides a pipeline view of how the assurance harness exercises and audits executions. A task instance x is executed under configuration κ (roles, topology, contracts, and governance settings) and stochastic seed z, while the harness applies a perturbation or fault schedule δ. The run emits Message-Action Trace records that are checked against step and trace contracts, producing localized violations and a replay record for debugging and regression testing. Layer definitions correspond to Sec. IV-B– IV-E. Layered responsibilities. The responsibilities of the four layers are: (i) L1 instruments each run as Message-Action Trace records with provenance and step/trace contract verdicts, enabling monitoring, localization of the first violating step, and replay. (i) L2 performs stress testing by searching for low-cost perturbation schedules δ (and counterexamples δ⋆δ ) that trigger contract violations under a budget, as in Eq. 5. (i) L3 injects structured boundary faults at external interfaces (services, retrieval, memory) using a fault schedule and checks the containment requirement (Eq. 10). (iv) L4 governs external actions at the language-to-action boundary by enforcing per-agent capability sets iK_i and mediating tool calls via the policy shield Π (allow / rewrite / block), as in Eq. 11. IV-A Assurance as monitored traces under perturbations Assurance for agentic systems can be formulated as trace-based verification under environment perturbations and tool faults. Let x denote a task instance and let τ(x,δ,Π)τ(x,δ, ) be the resulting execution trace when the system is exposed to a perturbation schedule δ and mediated by a governance policy Π . Let Ks=|ℐstep|K_s=|I^step| and Kτ=|ℐtrace|K_τ=|I^trace| denote the number of step and trace contracts, respectively. The setting assumes: • a finite set of step contracts ℐstep=I1step,…,IKsstepI^step=\I^step_1,…,I^step_K_s\ evaluated on Message-Action Traces records (local invariants); • a finite set of trace contracts ℐtrace=I1trace,…,IKτtraceI^trace=\I^trace_1,…,I^trace_K_τ\ evaluated on full traces or prefixes (workflow/temporal properties); • a perturbation/operator space Δ capturing prompt ambiguity, tool latency/failures, retrieval noise, and memory/tool-channel injection; • a governance policy Π that mediates tool execution (allow/rewrite/block) and may impose capability constraints. Let T denote the set of finite MAT traces emitted by the instrumented system under a fixed horizon T. A run fails if any trace contract is violated (i.e., violation of a single hard safety or correctness constraint is sufficient): Fail(τ,ℐtrace)=∃k∈1,…,Kτ:Iktrace(τ)=0.Fail(τ,I^trace)\;=\;I \∃ k∈\1,…,K_τ\:\ I^trace_k(τ)=0 \. (4) Assurance can be posed as counterexample search with an explicit perturbation cost: find the lowest-cost δ∈Δδ∈ that causes at least one trace contract to fail, δ⋆∈argminδ∈Δcost(δ)s.t.Fail(τ(x,δ,Π),ℐtrace)=1.δ ∈ _δ∈ \ cost(δ) .t. \! (τ(x,δ, ),I^trace )=1. (5) When δ is a schedule δ0:T _0:T, the total cost is aggregated as cost(δ)=∑t=0Tcost(δt).cost(δ)= _t=0^Tcost( _t). IV-B L1: MAT as contract-enriched instrumentation Standard logs capture events but rarely provide semantic accountability: which role made which claim, based on which evidence, under which constraints, and with what verification outcome. Executions are instrumented as Message-Action Traces, where each step is a typed record enriched with (i) provenance and (i) explicit, checkable contracts. This follows runtime verification (monitor executions against specifications), adapted to agentic workflows where language, tools, and memory jointly determine behavior [3, 4]. MAT record and step semantics. At step t, a controller (centralized or distributed) selects an acting agent iti_t and produces an action ata_t (message, tool call, memory update, delegation, or termination). The environment returns an observation ot+1o_t+1 (tool result, retrieved context, user reply), and the system updates its internal state. The system emits: rt=⟨t,it,role(it),s^t,at,ot+1,provt,ℐtstep,verdictt⟩.r_t= t,\ i_t,\ role(i_t),\ s_t,\ a_t,\ o_t+1,\ prov_t,\ I^step_t,\ verdict_t . (6) Here, ℐtstepI^step_t is the set of step contracts checked at time t (selected based on the action type). The field verdicttverdict_t stores the result of these checks, e.g., whether the step passed and which contract IDs were violated. Provenance for factuality and audit. Let provtprov_t be the set of provenance links recorded at step t. Each link has the form (src,rel,dst)( src, rel, dst), where src and dst are trace artifacts (IDs for claim, retrieved passage, tool call, tool result, or memory entry), and rel is the link type (e.g., supports, returns, derived_from). Contract interface and selection. Let ℛR denote the space of MAT records of the form in Eq. (6). Step contracts are boolean predicates on the current record, Ikstep:ℛ→0,1I^step_k:\ R→\0,1\, while trace contracts are evaluated on the trace (or a prefix), Iktrace:→0,1I^trace_k:\ T→\0,1\. At each step, a relevant subset ℐtstep⊆ℐstepI^step_t ^step is evaluated, determined by the action type (e.g., tool calls trigger policy checks; memory writes trigger sanitization; final responses trigger factuality/PII checks). Trace contracts in ℐtraceI^trace are checked on prefixes τ0:t _0:t and on termination. At each step, a relevant subset ℐtstep⊆ℐstepI^step_t ^step is evaluated, determined by the action type (e.g., tool calls trigger policy checks; memory writes trigger sanitization; final responses trigger factuality and PII checks, such as detection of personal identifiers against policy-specific allowlists (permitted identifiers) or redaction rules (masking or removal when disclosure is not allowed). Trace contracts in ℐtraceI^trace are checked on prefixes τ0:t _0:t and on termination. Verdicts and localization. Monitoring outcomes are recorded as: verdictt=⟨passt,violationst,severityt,mitigationt⟩,verdict_t= _t,\ violations_t,\ severity_t,\ mitigation_t , (7) where violationstviolations_t is the set of violated contract IDs, severitytseverity_t is a discrete level (e.g., soft vs. hard), and mitigationtmitigation_t records the response (e.g., retry, replan, sandbox, escalate, block). This localizes failures to specific steps and agents and supports replay. Example base contract templates. Contracts can be drawn from a small library of templates instantiated with system-specific parameters (e.g., tool sets, allowlists, and window sizes): • Verify before acting: any side-effecting external call must be preceded by a verifier step within the last h steps. • Principle of least privilege: an external service T may be invoked only if it is permitted by itK_i_t and the call parameters satisfy predefined allowlists. • Progress: Φ must decrease at least once in any window of length w, unless the run terminates (Eq. 2). These templates aim to cover correctness (progress and termination), safety (verification gates), and security (capability limits). IV-C L2: Adversarial stress testing as constrained environment search Capability benchmarks primarily measure average case task performance and contract compliance, rather than behavior under perturbations and faults. Assurance also needs counterexample discovery: finding small, plausible perturbations that cause contract violations under an explicit budget. Stress testing is therefore posed as a constrained search over the main input and interaction channels: prompt and context, external services, retrieval, and memory. In this paper, adversarial, refers to worst-case perturbation selection within a bounded, plausible operator set Δ under a cost budget B, not necessarily a malicious human attacker. Perturbation schedules. A perturbation δ can be applied once (e.g., rewriting the user request) or applied over time as a schedule. Let δ0:T=(δ0,…,δT) _0:T=( _0,…, _T) with δt∈Δ _t∈ . The resulting execution trace is τ=Exec(x,κ,z,δ0:T),τ\;=\;Exec(x,κ,z, _0:T), where κ is the system configuration (roles, topology, services, governance) and z is the stochastic seed. We write δ0:T=(δ0,…,δT) _0:T=( _0,…, _T) for a perturbation schedule; when the per-step form is not needed, we denote the schedule simply by δ. Violation signal and plausibility cost. At step t, only the step contracts relevant to the current action are evaluated. Let ℐtstep⊆ℐstepI^step_t ^step denote this selected subset (e.g., contracts for external service calls, memory writes, or the final response). Given the MAT record rtr_t, define a weighted violation score: Vio(rt,ℐtstep)=∑Ikstep∈ℐtstepαk⋅Ikstep(rt)=0,Vio(r_t,I^step_t)\;=\; _I^step_k ^step_t _k·I\! \I^step_k(r_t)=0 \, where αk≥0 _k≥ 0 encodes contract severity. Each perturbation operator δt _t is assigned an explicit cost that reflects how intrusive it is: cost(δt)=ctok|Δtokens|+chooknhooks+cmagη.cost( _t)=c_tok\,| |+c_hook\,n_hooks+c_mag\,η. (8) Here |Δtokens|| | counts token changes in prompts or other text inputs; nhooksn_hooks counts activated boundary fault hooks (e.g., forcing a timeout or delay on a specific external call); and η is a magnitude parameter for response perturbation (e.g., removing or changing fields in a structured payload). A fixed budget B bounds the total cost so perturbations remain small, realistic, and comparable across runs. Figure 2: Adversarial counterexample search as an inner–outer assurance loop. (1) Setup: fix a system configuration κ (roles, tools, contracts, governance) and sample tasks x∼x with stochastic seed z. (2) Inner loop (search): an adversary selects bounded perturbations δ (subject to cost(δ)≤Bcost(δ)≤ B) and injects them into the execution; the system produces a trace τ=Exec(x,κ,z,δ)τ=Exec(x,κ,z,δ), which is monitored for contract violations (step/trace contracts ℐstep,ℐtraceI^step,I^trace) and auxiliary signals such as progress Φ and unsupported-claim rate HrateH_rate. The resulting score guides the next perturbation choice. (3) Outer loop (engineering feedback): when a violation is found (red arrow), the framework localizes the first failing step t and stores a replay record (e.g., (z,δ⋆)(z,δ ) and required stubs), enabling configuration revision and re-testing (green dashed path), including updates to agent parameters πθ _θ and/or governance policy Π . Search strategies and adaptive adversaries. The same perturbation operator interface can be used with simple search methods (random fuzzing, beam search over operator sequences, evolutionary search), reinforcement learning, or with an adaptive adversary. Adaptive search is useful for staged failures that unfold over multiple steps (e.g., first inducing role drift, then exploiting the resulting authority). At step t, the adversary observes a filtered summary ht=Obs(rt)h_t=Obs(r_t) (exposing only safe MAT fields such as action type, called service, and contract pass/fail outcomes) and selects the next operator δt _t. Figure 2 summarizes this inner-loop interaction. When the adversary is implemented as a learned policy (e.g., via reinforcement learning), the inner-loop perturbation selection can be formalized as an optimization problem. The adversary observes a filtered state summary ht=Obs(rt)h_t=Obs(r_t) and selects perturbation operators δt _t to maximize contract violations while remaining within a bounded cost budget. Equation (9) defines the objective optimized by such an adaptive adversary. Here, Instab(x,κ,δ)Instab(x,κ,δ) penalizes perturbation schedules that lead to unstable or non-reproducible executions (e.g., high variance across replays), encouraging replayable counterexamples. maxπψadv _ _ψ^adv [∑t=0Tγt(Vio(rt,ℐtstep)−βcost(δt))] \! [ _t=0^Tγ^t (Vio(r_t,I^step_t)-β\,cost( _t) ) ] (9) −λ[Instab(x,κ,δ)] \;-λ\,E\! [Instab(x,κ,δ) ] s.t. ∑t=0Tcost(δt)≤B. _t=0^Tcost( _t)≤ B. Operator families. The operator space Δ is a small library of bounded transformations: • Prompt and context: paraphrasing, adding ambiguity, reordering constraints, inserting distractors, or splicing untrusted text, bounded by token changes. • External services: delays and timeouts, dropped responses, field corruption, metadata changes, or stale results, bounded by frequency and magnitude. • Retrieval: shuffling top k, adding near duplicate distractors, perturbing ranking scores, or truncating evidence, bounded by the number of modified items. • Memory: inserting untrusted notes, reordering items, forcing truncation, or simulating write collisions, bounded by the number and size of writes. From failures to reproducible test-cases. For each stress testing experiment (one executed search on a task and system configuration), the procedure returns a replayable perturbation schedule δ0:T⋆δ _0:T and the violated contracts with the responsible steps (t,it)(t,i_t) identified from MAT. In addition, a replay record is stored, including the seed, tool stubs or cached tool outputs, retrieved artifacts, and the injected fault schedule. This allows deterministic reproduction in continuous integration and supports regression tests after changes to prompts, policies, or services. The replay artifact supports regression testing after changes to κ, including updates to the agent policy πθ _θ and adjustments to governance policy Π (Figure 2, green path). IV-D L3: Structured fault injection across external services, retrieval, and memory Adversarial perturbations are only one source of failures. Production assurance also requires robustness to common integration faults such as timeouts, partial outages, stale caches, malformed payloads, and inconsistent shared memory. A chaos engineering [5] approach is used to test these conditions by injecting controlled faults at system boundaries. During the run, MAT contracts check a containment property. Containment requires that the fault is detected, an appropriate mitigation is executed, and the final user-facing output still satisfies the required contracts. Fault operators and injection points. Faults are injected at three boundaries: service adapters, retrieval, and shared memory. The fault library includes timeouts and delays, dropped or partial responses, payload corruption, schema mismatches, stale cache returns, retrieval shuffles, and memory injection, reordering, or collisions. Parameters such as ℓ , η, and Δt t bound fault severity. Containment contract. Let FaulttFault_t indicate that a fault is injected or observed at step t, DetecttDetect_t that the fault is recognized and recorded in MAT, and MitigatetMitigate_t that a corrective action is taken (retry and backoff, alternate service, replan, clarification, or escalation). A minimal containment requirement is: ∀t:Faultt⇒ ∀\,t:\ Fault_t ∃t′∈[t,t+w]Detectt′ ∃\,t ∈[t,t+w]\ Detect_t (10) ∧∃t′∈[t′,t′+w′]Mitigatet′. \ ∃\,t ∈[t ,t +w ]\ Mitigate_t . In addition, the final response must satisfy all relevant contracts, including all Iktrace∈ℐtraceI^trace_k ^trace and any end of run IkstepI^step_k checks that apply to the final response. If Eq. 10 fails, MAT localizes the boundary where the fault entered and the first step where it led to a trace-level violation, enabling replay and regression testing (under recorded seeds and stubs) using the recorded seed and fault schedule. IV-E L4: Governing external actions and service calls Even when a system performs well on routine tasks, delegated tool access can turn small reasoning errors or injected instructions into real-world harm. Governance is therefore treated as a runtime layer that mediates external actions via least privilege, policy enforcement, and selective escalation for high-impact effects. Capability boundaries (least privilege). Each agent AiA_i (for i∈i ) is assigned a capability set iK_i (allowed services, endpoints, parameter ranges, and rate limits). Let cap(at)cap(a_t) denote the capabilities required by action ata_t. Action ata_t is authorized for the acting agent iti_t iff: Allow(at,it)=1⇔cap(at)⊆it.Allow(a_t,i_t)=1 (a_t) _i_t. This limits the impact of prompt injection and behavioral drift by construction. Policy shield (allow / rewrite / block). All proposed actions pass through a policy mediator Π , which produces a governed outcome given the proposed action ata_t and the current state estimate s^t s_t: Π(at,s^t)∈allow(at),rewrite(a~t),block. (a_t, s_t)∈\ allow(a_t),\ rewrite( a_t),\ block\. (11) Rewriting applies bounded corrections (e.g., removing sensitive content, clamping parameters to safe ranges, or converting a high-impact action into an approval request). Approval can be provided by a human reviewer (human-in-the-loop, HITL) or by a designated verifier component, depending on deployment policy. Risk-aware routing and sandboxing (trust as a heuristic). A per-agent trust score Ti(t)∈[0,1]T_i(t)∈[0,1] summarizes recent contract outcomes and policy events. Trust is not a security guarantee; it is a practical signal used for routing and sandboxing, with decay so older “good” behavior does not dominate. When Ti(t)T_i(t) falls below a threshold, governance can: (i) shrink the capability set iK_i, (i) route the subtask to a verifier agent, or (i) apply stricter mediation for high-impact actions. Selective human escalation for high-impact actions. For pre-labeled high-impact actions (e.g., send_email, payment, delete, grant_access), policy may require HITL approval. Operationally, this requirement is enforced by the policy mediator Π via rewrite (convert the action into an explicit approval request) or block (prevent execution until approval), as defined in Eq. (11). V Evaluation and Metrics This section proposes an evaluation methodology and a set of metrics for assessing multi-agent LLM systems and governance variants. The methodology treats evaluation as contract-monitored execution traces, so that utility, robustness to perturbations, fault containment, and governance behavior are comparable under a common trace representation. A full empirical study is left to ongoing and future work; the protocol and estimators below are specified in a form intended to be instantiated in a later implementation. V-A Evaluation object: tasks, configurations, and traces Let x∼x denote a task instance sampled from a workload distribution. Let κ denote a system configuration (agent topology, role prompts, tool adapters, contract set, and governance policy). Let z denote a stochastic seed (model sampling, tie breaks, and any randomized tool stubs). Let δ denote a perturbation and fault schedule drawn from an operator family Δ . A single run produces an execution trace: τ=Exec(x,κ,z,δ).τ\;=\;Exec(x,κ,z,δ). A run is a contract failure if at least one trace contract is violated, using Eq. (4). A run is a task failure if the task objective is not met, even if no contract is violated. Both notions are reported because they reflect different deployment concerns. Contract failure is safety and governance oriented: it indicates that at least one monitored constraint (e.g., verification gates, least privilege, containment requirements, refusal rules) has been violated, even if the task output appears useful. Task failure is end-to-end utility oriented: it indicates that the intended outcome was not achieved (wrong, incomplete, or unusable), even if monitored constraints were satisfied. To estimate expectations, a finite workload set =x1,…,xND=\x_1,…,x_N\ is fixed and each task is run with S seeds z1,…,zSz_1,…,z_S, yielding N×SN\!×\!S runs per (κ,δ)(κ,δ) condition. Rate metrics that reduce to a binomial proportion (e.g., success, contract failure, containment, non-termination, refusal or block rates) are accompanied by Wilson-score confidence intervals or Clopper–Pearson exact intervals. For derived metrics that are not simple proportions (e.g., ratios such as tokens per successful run, or robustness summaries such as area under R(B)R(B)), percentile bootstrap intervals are computed by resampling runs and recomputing the metric. V-B Protocol and design choices Industry guidance has emphasized the practical importance of well-scoped task suites and systematic failure analysis when evaluating agentic systems111https://w.anthropic.com/engineering/demystifying-evals-for-ai-agents. The protocol specified here complements these recommendations with contract-monitored traces that localize failures to concrete execution steps and interfaces. Workloads. An enterprise-oriented workload suite D is assumed to cover common orchestration patterns and risk cases. A practical instantiation partitions tasks into categories and uses a fixed number of tasks per category, so that aggregate results are not dominated by a single class: • Tool use tasks: query an internal database, retrieve documents, summarize, and propose an action. • Multi-step planning tasks: delegation across roles, approval gates, retries, and coordination across agents. • Policy constrained tasks: PII handling, consent lists, and compliance checks enforced at action time. • Misuse oriented tasks: malicious or policy violating requests that should trigger refusal or containment. Each task includes at least one untrusted input channel (e.g., email text, a retrieved snippet, or a document excerpt) and at least one tool boundary where integration failures are realistic (e.g., timeouts, partial results, stale cache returns, or schema mismatch). This supports direct measurement of injection robustness, fault containment, and governance behavior. Existing agent evaluation suites are used to seed task specifications and prompt templates (e.g., AgentBench [14], GAIA [15], AgentDojo [17], HarmBench [21], AgentHarm [20]). These suites provide realistic task narratives and failure oriented scenarios, but they typically assume their own tool abstractions, action schemas, and safety rules. For trace contract evaluation, each seeded task is re-instantiated for the target deployment by adapting (a) the tool layer (swap benchmark tools for the deployed adapters and endpoints), (b) the action interface (map benchmark actions onto the system action types: messages, tool calls, memory reads and writes, delegation, termination), and (c) the policy layer (align PII, consent, and capability constraints with the governance policy). In addition, the task is paired with a perturbation and fault matrix applied at the same tool and memory boundaries that are monitored by contracts. This instantiation procedure provides a practical bridge from existing task suites to contract-monitored, tool grounded evaluations of agent interactions. Perturbation and fault matrix. For each task instance x, evaluation covers three operating conditions: • Nominal runs (δ=∅δ= ): baseline behavior under the unmodified input and normal tool operation. These runs measure utility and contract compliance without injected disturbances. • Structured fault injection (δ∈Δfaultδ∈ _fault): controlled boundary faults applied to tools, retrieval, or memory. These runs measure robustness to integration issues such as timeouts, partial responses, stale cache returns, and schema mismatches. • Adversarial schedules (δ chosen by search under a cost budget): perturbations selected to trigger contract violations while remaining small and realistic. These runs measure worst-case behavior under bounded, plausible disturbances. This design reflects the observation that production reliability depends on behavior under imperfect conditions, not only under nominal inputs [5, 1]. Governance variants and ablations. Results are compared across governance and instrumentation variants. A minimal set includes: • No governance. Tool actions are executed directly, without policy mediation. • Policy shield. Each proposed tool action is mediated at execution time (allow / rewrite / block), as in Eq. 11. • Shield + capability constraints. Least privilege is enforced via per-agent capability sets iK_i, which restrict the services, endpoints, and parameter ranges that each agent may invoke. • Shield + routing heuristics. Risk-aware routing and sandboxing are enabled; when recent contract outcomes indicate elevated risk, actions are routed through stricter mediation or forwarded to a verifier or human approval step. These variants target risks emphasized by OWASP and related guidance, including prompt injection and excessive agency [6, 7, 8]. Repeated runs and seed control. Each task is run with multiple seeds. This captures stochastic variability from model sampling and tool or retrieval non-determinism. Single-run metrics are reported alongside multi-run robustness metrics (e.g., Success@k). For failure analysis, an execution record for deterministic reproduction is stored, including seeds, tool stubs, and injected fault schedules; this enables later regression testing against the same recorded conditions. Uncertainty quantification. For binomial proportions (success rates, violation rates, containment rates), Wilson-score confidence intervals or Clopper–Pearson exact intervals are reported [26]. For non-linear aggregates (e.g., robustness curve area, tokens per successful run), bootstrap confidence intervals are reported [27]. Per-category results (means and intervals) are also reported to avoid mixing heterogeneous task types. V-C Primary metrics Metrics are grouped by evaluation goal. All metrics are computed from MAT traces and monitored contracts. For estimation, fix a finite workload set =x1,…,xND=\x_1,…,x_N\ and S seeds per task, z1,…,zSz_1,…,z_S (so there are N×SN\!×\!S total runs). For a fixed configuration κ and operating condition δ, run (xi,zs)(x_i,z_s) produces a trace τi,s=Exec(xi,κ,zs,δ) _i,s=Exec(x_i,κ,z_s,δ) and an outcome indicator Y(xi,zs,δ)∈0,1Y(x_i,z_s,δ)∈\0,1\. V-C1 End-to-end utility and reliability Task success Task success measures whether the task objective is satisfied for a given run. The estimated success rate under configuration κ and condition δ is: Success^(κ,δ)=1NS∑i=1N∑s=1SY(xi,zs,δ). Success(κ,δ)\;=\; 1NS _i=1^N _s=1^SY(x_i,z_s,δ). Success@k Because the system is stochastic, a robustness metric analogous to pass@k is defined and is reported when multi-run evaluation is included [28]. A task is counted as solved if any of k independent runs succeeds: Success@k^(κ,δ)=1N∑i=1Nmaxj∈1,…,kY(xi,zj,δ)=1. Success@k(κ,δ)\;=\; 1N _i=1^NI \ _j∈\1,…,k\Y(x_i,z_j,δ)=1 \. This metric is reported for nominal runs and, when robustness is evaluated, for selected perturbation budgets. Termination reliability Non-termination and coordination collapse are measured using Term(τ)Term(τ) and the progress contract (Eq. 2). The estimated non-termination rate is: NTR^=1NS∑i=1N∑s=1S¬Term(τi,s). NTR\;=\; 1NS _i=1^N _s=1^SI\ ( _i,s)\. To distinguish early failures from long-horizon collapse, distributions of steps-to-termination and steps-to-first-failure can be included. V-C2 Contract compliance and failure localization Trace contract violation rate Trace-level assurance is summarized by the fraction of runs that violate at least one trace contract (Eq. 4): Fail^=1NS∑i=1N∑s=1SFail(τi,s,ℐtrace). Fail\;=\; 1NS _i=1^N _s=1^SFail( _i,s,I^trace). Per-contract violation profile Let VioID(τ)VioID(τ) denote the set of violated contract IDs recorded in MAT verdicts. For each contract k, the estimated violation rate is: p^k=1NS∑i=1N∑s=1Sk∈VioID(τi,s). p_k\;=\; 1NS _i=1^N _s=1^SI\k ( _i,s)\. A ranked list of contracts by p^k p_k is reported. Hard and soft violations are separated using the severity field in Eq. (7). First violation step and responsible role For each failing run, let t⋆t be the first step where any monitored contract is violated. The distributions of t⋆t and the associated agent identity it⋆i_t are reported. This indicates whether failures tend to arise early (planning and delegation) or late (tool actions, policy mediation, or escalation). V-C3 Quality and factuality Unsupported claim rate and propagation The definitions from failure class F2 (Section I) are reused. Given claims cj\c_j\ extracted from the final response and provenance evidence ej\e_j\ recorded in MAT, HrateH_rate and HpropH_prop are computed per run and then averaged across runs: H^rate=1NS∑i,sHrate(τi,s),H^prop=1NS∑i,sHprop(τi,s). H_rate\;=\; 1NS _i,sH_rate( _i,s),\; H_prop\;=\; 1NS _i,sH_prop( _i,s). When gold reference answers exist, support can be checked by exact match or entailment tests. When gold is unavailable, a constrained judge may be used; in that case, judge calibration and judge variance are reported [29]. Retrieval-grounded metrics For configurations that include a retrieval component (i.e., the system consults an external corpus and conditions generation on retrieved passages), reference-free retrieval quality metrics such as faithfulness and context relevance are reported using established tooling [13]. These metrics are treated as secondary signals, mainly to interpret changes in unsupported-claim rates under retrieval noise. V-C4 Fault robustness and containment Containment rate For structured fault injection runs, the containment rate (Sec. IV, Eq. 10) reports how often injected faults are detected and mitigated within the required windows and do not lead to trace-level failures in the final output: CR^=#faults that satisfy Eq. (10) and final contracts#faults injected. CR\;=\; \#\faults that satisfy Eq.~( eq:containment) and final contracts\\#\faults injected\. A breakdown by fault type can also be included (timeout, partial response, stale cache, corrupt payload, memory collision). Residual harm under faults Containment can fail by delayed detection, ineffective mitigation, or downstream policy violations. An additional safety signal is therefore reported: whether any policy-relevant contract is violated in the final response after a fault. This complements CR CR by separating non-critical quality loss from safety-critical failure. V-C5 Safety, governance, and misuse resistance Policy mediator outcomes The governance layer produces allow, rewrite, or block outcomes (Eq. 11). Their occurrence rates over tool actions are reported: p^allow,p^rewrite,p^block. p_allow, p_rewrite, p_block. Rates are also reported conditioned on action risk level (low-impact vs. high-impact) and input category (non-malicious vs. misuse-oriented). Blocked high-impact action rate For pre-labeled high-impact actions, the fraction that are blocked or routed to approval is reported. This provides a measurable signal of excessive agency at the action boundary. Misuse success and refusal For misuse-oriented tasks, three outcome rates are reported: • refusal or safe abort, • harmful completion, • partial completion with containment (side effects blocked). These outcomes are defined using task labels and policy contracts [20, 21]. V-C6 Efficiency and operational cost Token and tool cost per successful task Tokens per successful run are reported as: T^eff=∑i,stokens(τi,s)#(i,s):Y(xi,zs,δ)=1. T_eff\;=\; _i,stokens( _i,s)\#\(i,s):Y(x_i,z_s,δ)=1\. Tool-call counts and latency summaries are reported separately. Efficiency is not treated as a primary objective, but is included to contextualize overhead introduced by governance and verification [1]. V-D Robustness, MTBF, and maintainability Beyond nominal performance, evaluation may also summarize robustness under bounded perturbations, stability under continuous operation, and regressions after system changes. Robustness under perturbation budgets Robustness is defined as a function of a perturbation budget. Let B be a maximum budget on perturbation cost (Sec. IV). Define the robustness curve as the expected task success when the perturbation schedule is chosen within budget B: R(B)=1NS∑i=1N∑s=1SY(xi,zs,δ⋆(B)),R(B)\;=\; 1NS _i=1^N _s=1^SY\! (x_i,z_s,δ (B) ), where δ⋆(B)δ (B) denotes a perturbation schedule selected under the budget constraint cost(δ)≤Bcost(δ)≤ B (e.g., by sampling schedules from Δ subject to the cost budget, or by an adaptive search procedure). The curve R(B)R(B) may be reported for a small set of budget values. A compact summary may also be reported, such as the area under R(B)R(B) over a fixed budget range, together with the success rate at the largest tested budget. Results may be reported per workload category to avoid masking brittle behavior on risk-heavy tasks. Mean time between failures (MTBF) MTBF is most meaningful when the system processes a stream of tasks or operates continuously. An episode is defined as a sequence of tasks processed under the same configuration, with state carried across tasks as appropriate (e.g., persistent memory, caches, and trust scores). Let FrF_r be the number of MAT steps until the first trace-contract failure in episode r, or until the episode ends. The estimator is: MTBF^=1R∑r=1RFr. MTBF\;=\; 1R _r=1^RF_r. MTBF may be reported in steps and in wall-clock time when latency measurements are available. Regression and maintainability Prompt, tool, and policy changes are frequent in deployed systems. This motivates regression testing using stored execution artifacts (e.g., seeds and injected fault schedules) so that failures can be reproduced after changes. Let Pass(x,κ)=1Pass(x,κ)=1 denote that task x completes successfully and satisfies all required contracts under configuration κ under nominal conditions and, optionally, under a fixed perturbation set. Given a regression suite regD_reg and configurations κ (old) and κ′κ (new), define the regression rate as: R^reg=#x∈reg:Pass(x,κ)=1∧Pass(x,κ′)=0#x∈reg:Pass(x,κ)=1. R_reg\;=\; \#\x _reg:\ Pass(x,κ)=1 (x,κ )=0\\#\x _reg:\ Pass(x,κ)=1\. Regressions are reported separately for task success, contract compliance, and governance outcomes (allow vs. rewrite vs. block changes), since these capture different operational risks. Coverage units and regression-suite selection The replay artifacts produced by contract violations can be treated as regression test cases. Unlike code coverage, the framework supports contract- and interface-oriented coverage. Each stored replay is indexed by a failure signature derived from MAT, including (i) the first violated contract ID and severity, (i) the triggering action type (e.g., tool call, memory write, delegation, final response), (i) the affected interface class (service, retrieval, memory), (iv) the action risk label (high-impact vs. low-impact), and (v) the perturbation or fault operator family. A compact regression suite can then be obtained by selecting a minimal subset of replays that covers all observed high-severity contracts (and representative signatures per interface and action type), together with a small set of nominal passing tasks per workload category to detect quality regressions. Formally, suite selection can be posed as a set-cover problem over these coverage units; however, the framework is independent of the particular solver used. V-E Recommended reporting format A consistent reporting structure improves interpretability and makes comparisons across governance variants reproducible. The following artifacts are typically sufficient. Nominal performance table. Recommended reporting under nominal conditions (δ=∅δ= ) includes, for each governance variant: end-to-end task success (and Success@k when stochasticity is material), non-termination or coordination-collapse rate, and basic efficiency signals (steps, tool calls, and tokens per successful run). For mixed task suites, per-category results are recommended to avoid masking failures on harder categories. Fault-injection outcomes table. Report results under structured fault injection (δ∈Δfaultδ∈ _fault) in a separate table. Summarize containment in simple rate form: how often an injected fault is: (a) detected within the next w steps, and then (b) followed by a mitigation within the next w′w steps. It may also be useful to report whether the final response satisfies the relevant trace contracts after mitigation. For systems with action mediation, include the distribution of policy outcomes (allow, rewrite, block) to distinguish recovery from conservative blocking. Robustness curves. Plot robustness curves R(B)R(B) as a function of perturbation budget B for at least two workload categories. Where applicable, include both random perturbations and adversarial schedules under the same budget. Optionally add a scalar summary such as the area under the robustness curve over a fixed budget range. Contract violation profile and examples. A compact contract violation profile may be included as a ranked list of the most frequently violated contracts. Alongside frequency, report a localization statistic such as the most common first-violation step index or the dominant action type at first violation. Optionally add a small number of short, reproducible counterexamples. Each example should identify the triggering perturbation or injected fault, the first violated contract, and the localized MAT step window around the violation. VI Conclusion This paper addressed assurance for agentic systems in which one or more orchestrators coordinate multiple agents that interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final responses; they also include non-termination, coordination breakdowns, error propagation across agents, role drift, and unsafe interface-driven actions. We presented a trace-based assurance framework that organizes evaluation around contract-monitored executions. The framework (i) instruments runs as Message-Action Traces with step- and trace-level contracts, enabling localization of the first violating step and replay; (i) formulates stress testing as budgeted counterexample search over bounded perturbations; (i) supports structured fault injection at service, retrieval, and memory interfaces to test containment under realistic integration disturbances; and (iv) treats governance as a runtime component that mediates external actions through capability restrictions and allow/rewrite/block decisions at the language-to-action boundary. We also defined trace-derived metrics that support comparison across stochastic seeds, models, and orchestration configurations. Limitations and next steps. The current contribution focuses on methodology, definitions, and evaluation estimators rather than an empirical demonstration. The next step is to implement the instrumentation, contract library, perturbation and fault operators, and governance mechanisms in a prototype, and to execute a systematic empirical study across representative workloads and orchestration variants. This study will quantify utility, robustness, containment, and governance behavior, and will assess reproducibility via replay artifacts and regression testing. Future work will also study coverage criteria and selection strategies for maintaining small but effective regression suites, including contract coverage, boundary/interface coverage, and risk-conditioned coverage. Another direction is integrating the protocol into the software development lifecycle as evaluation-driven development: changes to configuration κ (prompts, tool adapters, allowlists), governance policy Π , or agent parameters πθ _θ would be accepted only if a fixed smoke suite and a replay-based regression suite preserve contract compliance, containment, and utility within predefined statistical tolerances across seeds. Acknowledgments This work was supported by the project “Romanian Hub for Artificial Intelligence (HRIA)”, funded under the Smart Growth, Digitization and Financial Instruments Programme 2021–2027, MySMIS no. 351416. We also thank Adobe Romania for funding a PhD scholarship on the topic and for fruitful discussions and feedback that helped shape this work. References [1] M. Z. Pan et al., “Measuring agents in production,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2512.04123 [2] World Economic Forum and Capgemini, “Ai agents in action: Foundations for evaluation and governance,” White Paper, Nov. 2025, accessed: 2025-12-17. [Online]. Available: https://w.weforum.org/publications/ai-agents-in-action-foundations-for-evaluation-and-governance/ [3] C. Sanchez et al., “A survey of challenges for runtime verification from advanced application domains,” Formal Methods in System Design, 2018, preprint: arXiv:1811.06740. [Online]. Available: https://arxiv.org/abs/1811.06740 [4] A. Bauer, M. Leucker, and C. Schallhart, “Runtime verification for LTL and TLTL,” ACM Transactions on Software Engineering and Methodology, vol. 20, no. 4, p. 14:1–14:64, 2011. [5] A. Basiri et al., “Chaos engineering,” IEEE Software, vol. 33, no. 3, p. 35–41, 2016, also available as arXiv:1702.05843. [Online]. Available: https://arxiv.org/abs/1702.05843 [6] OWASP Foundation, “Owasp top 10 for large language model applications,” OWASP Project Page, 2025, accessed: 2025-12-17. [Online]. Available: https://owasp.org/w-project-top-10-for-large-language-model-applications/ [7] Model Context Protocol (MCP), “Security best practices (draft),” Online documentation, 2025, accessed: 2025-12-17. [Online]. Available: https://modelcontextprotocol.io/specification/draft/basic/security_best_practices [8] Palo Alto Networks Unit 42, “New prompt injection attack vectors through MCP sampling,” Blog post, Dec. 2025, accessed: 2025-12-17. [Online]. Available: https://unit42.paloaltonetworks.com/model-context-protocol-attack-vectors/ [9] R. Heckel and M. Lohmann, “Towards contract-based testing of web services,” Electronic Notes in Theoretical Computer Science, vol. 116, p. 145–156, 2005, tACoS. [10] Q. Wu et al., “AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework,” OpenReview, 2023, accessed: 2025-12-17. [Online]. Available: https://openreview.net/forum?id=BAakY1hNKS [11] LangChain, “LangGraph documentation (overview),” Online documentation, 2025, accessed: 2025-12-17. [Online]. Available: https://docs.langchain.com/oss/python/langgraph/overview [12] —, “LangSmith evaluation documentation,” Online documentation, 2025, accessed: 2025-12-17. [Online]. Available: https://docs.langchain.com/langsmith/evaluation [13] S. Es et al., “RAGAS: Automated evaluation of retrieval augmented generation,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2309.15217 [14] X. Liu et al., “Agentbench: Evaluating LLMs as agents,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2308.03688 [15] G. Mialon et al., “GAIA: A benchmark for general AI assistants,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2311.12983 [16] J. Yi et al., “Benchmarking and defending against indirect prompt injection attacks on large language models,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2312.14197 [17] E. Debenedetti et al., “Agentdojo: A dynamic environment to evaluate attacks and defenses for LLM agents,” arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2406.13352 [18] I. Evtimov et al., “WASP: Benchmarking web agent security against prompt injection attacks,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2504.18575 [19] Q. Zhan et al., “Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents,” arXiv (Findings of ACL 2024), 2024. [Online]. Available: https://arxiv.org/abs/2403.02691 [20] M. Andriushchenko et al., “Agentharm: A benchmark for measuring harmfulness of LLM agents,” arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2410.09024 [21] M. Mazeika et al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,” arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2402.04249 [22] M. Cemri et al., “Why do multi-agent LLM systems fail?” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2503.13657 [23] T. Rebedea et al., “Nemo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2310.10501 [24] Model Context Protocol (MCP), “Specification,” Online specification, Jun. 2025, accessed: 2025-12-17. [Online]. Available: https://modelcontextprotocol.io/specification/2025-06-18 [25] UK National Cyber Security Centre (NCSC), “Prompt injection is not SQL injection (it may be worse),” Blog post, Dec. 2025, accessed: 2025-12-17. [Online]. Available: https://w.ncsc.gov.uk/blog-post/prompt-injection-is-not-sql-injection [26] A. Antwi, “Adaptive bayesian interval estimation for rare binomial events,” Mathematics, vol. 13, no. 24, p. 3988, 2025. [Online]. Available: https://w.mdpi.com/2227-7390/13/24/3988 [27] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994. [28] M. Chen et al., “Evaluating large language models trained on code,” arXiv, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374 [29] L. Zheng et al., “Judging LLM-as-a-judge with MT-bench and chatbot arena,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2306.05685