← Back to papers

Paper deep dive

Trustworthy Agentic AI Requires Deterministic Architectural Boundaries

Manish Bhattarai, Minh Vu

Year: 2026Venue: arXiv preprintArea: Agent SafetyType: TheoreticalEmbeddings: 53

Abstract

Abstract:Current agentic AI architectures are fundamentally incompatible with the security and epistemological requirements of high-stakes scientific workflows. The problem is not inadequate alignment or insufficient guardrails, it is architectural: autoregressive language models process all tokens uniformly, making deterministic command--data separation unattainable through training alone. We argue that deterministic, architectural enforcement, not probabilistic learned behavior, is a necessary condition for trustworthy AI-assisted science. We introduce the Trinity Defense Architecture, which enforces security through three mechanisms: action governance via a finite action calculus with reference-monitor enforcement, information-flow control via mandatory access labels preventing cross-scope leakage, and privilege separation isolating perception from execution. We show that without unforgeable provenance and deterministic mediation, the ``Lethal Trifecta'' (untrusted inputs, privileged data access, external action capability) turns authorization security into an exploit-discovery problem: training-based defenses may reduce empirical attack rates but cannot provide deterministic guarantees. The ML community must recognize that alignment is insufficient for authorization security, and that architectural mediation is required before agentic AI can be safely deployed in consequential scientific domains.

Tags

agent-safety (suggested, 92%)ai-safety (imported, 100%)alignment-training (suggested, 80%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/11/2026, 1:05:56 AM

Summary

The paper argues that current agentic AI architectures are fundamentally insecure for high-stakes scientific workflows due to the conflation of commands and data in autoregressive models. It introduces the 'Trinity Defense Architecture'—comprising Action Governance, Information-Flow Control, and Privilege Separation—to enforce deterministic security boundaries via a trusted, non-LLM reference monitor, moving beyond probabilistic alignment-based defenses.

Entities (5)

Manish Bhattarai · author · 100%Minh Vu · author · 100%Trinity Defense Architecture · security-framework · 100%Finite Action Calculus · formal-method · 95%Lethal Trifecta · vulnerability-pattern · 95%

Relation Signals (4)

Trinity Defense Architecture enforces Action Governance

confidence 100% · The Trinity Defense Architecture, which enforces security through three mechanisms: action governance...

Trinity Defense Architecture enforces Information-Flow Control

confidence 100% · The Trinity Defense Architecture... information-flow control via mandatory access labels

Trinity Defense Architecture enforces Privilege Separation

confidence 100% · The Trinity Defense Architecture... privilege separation isolating perception from execution.

Lethal Trifecta compromises Authorization Security

confidence 95% · the 'Lethal Trifecta' ... turns authorization security into an exploit-discovery problem

Cypher Suggestions (2)

Find all security mechanisms associated with the Trinity Defense Architecture. · confidence 95% · unvalidated

MATCH (t:Framework {name: 'Trinity Defense Architecture'})-[:ENFORCES]->(m:Mechanism) RETURN m.name

Identify vulnerabilities that threaten authorization security. · confidence 90% · unvalidated

MATCH (v:Vulnerability)-[:COMPROMISES]->(s:SecurityProperty {name: 'Authorization Security'}) RETURN v.name

Full Text

52,722 characters extracted from source content.

Expand or collapse full text

Trustworthy Agentic AI Requires Deterministic Architectural Boundaries Manish Bhattarai 1 Minh Vu 2 Abstract Current agentic AI architectures are fundamen- tally incompatible with the security and epis- temological requirements of high-stakes scien- tific workflows. The problem is not inadequate alignment or insufficient guardrails, it is archi- tectural: autoregressive language models process all tokens uniformly, making deterministic com- mand–data separation unattainable through train- ing alone. We argue that deterministic, archi- tectural enforcement, not probabilistic learned behavior, is a necessary condition for trustwor- thy AI-assisted science. We introduce the Trin- ity Defense Architecture, which enforces se- curity through three mechanisms: action gover- nance via a finite action calculus with reference- monitor enforcement, information-flow control via mandatory access labels preventing cross- scope leakage, and privilege separation isolating perception from execution. We show that without unforgeable provenance and deterministic me- diation, the “Lethal Trifecta” (untrusted inputs, privileged data access, external action capabil- ity) turns authorization security into an exploit- discovery problem: training-based defenses may reduce empirical attack rates but cannot pro- vide deterministic guarantees. The ML commu- nity must recognize that alignment is insufficient for authorization security, and that architectural mediation is required before agentic AI can be safely deployed in consequential scientific do- mains. 1. Introduction Agentic AI systems are large language models integrated with knowledge retrieval, persistent memory, and tool invo- cation (Schick et al., 2023; Yao et al., 2023). They are being deployed at accelerating pace in scientific workflows. The 1 Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA 2 Computing and Artificial Intelligence Divi- sion, Los Alamos, NM, USA. Correspondence to: Manish Bhat- tarai<ceodspspectrum@lanl.edu>. Preprint. February 11, 2026. emergence of tool-augmented language models (Xi et al., 2023; Wang et al., 2024) has enabled autonomous agents capable of literature synthesis, experimental design, data analysis, and hypothesis generation (Boiko et al., 2023; Bran et al., 2023). Research programs across the world are building agent-enabled systems for various applications, and the promise is transformative: AI assistants that can navigate complex research landscapes, automate tedious analysis, and accelerate discovery (Wang et al., 2023). Yet this promise conceals a fundamental vulnerability. When adversarial content embedded in documents or data feeds can steer tool use and trigger unintended ac- tions (Greshake et al., 2023; Liu et al., 2023), we face a risk that cannot be reliably mitigated through model train- ing alone. Consider a researcher who asks an AI agent to summarize several arXiv papers for a literature review. One paper, created by an attacker, contains hidden instruc- tions, white text on a white background, invisible to human readers but parsed by the agent, directing it to read sen- sitive files and exfiltrate them before completing the sum- mary (Yi et al., 2023). Without architectural protections, the agent follows these instructions because it cannot dis- tinguish trusted content from malicious text. The model processes all tokens through the same attention mecha- nism (Vaswani et al., 2017); it has no way to verify prove- nance. This is not a hypothetical concern, nor is it novel. The con- flation of commands and data produced buffer overflows in the 1970s (Aleph One, 1996; Cowan et al., 1998), SQL in- jection in the 1990s (Halfond et al., 2006), and cross-site scripting in the 2000s (Vogt et al., 2007). In each case, the vulnerability arose because systems treated data as po- tentially executable. In each case, the solution required architectural mechanisms, memory protection, parameter- ized queries, content security policies, not reliance on the processing system’s intelligence (Bratus et al., 2017; Sas- saman et al., 2013). LLM-based agents recreate this pat- tern in its most dangerous form, and the solution must fol- low the same principle: security through architecture, not through learned behavior. 1 arXiv:2602.09947v1 [cs.CR] 10 Feb 2026 The Trinity Imperative 1.1. The Lethal Trifecta When three conditions co-occur, authorization security be- comes an exploit-discovery problem rather than a policy- compliance problem (Ruan et al., 2023; Weidinger et al., 2022). First, the agent must ingest untrusted inputs i.e. documents, web pages, emails, images, or user queries from outside the trust boundary. Second, the agent must have privileged data access i.e. the ability to read cre- dentials, internal documents, experimental data, or propri- etary results. Third, the agent must possess external action capability, the ability to send emails, modify databases, invoke APIs, or write to persistent memory (Kinniment et al., 2024).In this setting, deterministic guarantees are unavailable without architectural mediation; training- based approaches (prompting, fine-tuning, alignment) can reduce empirical attack rates but cannot provide authoriza- tion security guarantees against adversarially chosen in- puts (Shavit et al., 2023). The reason is fundamental. The model processes attacker- controlled tokens through the same attention mechanism as legitimate instructions (Vaswani et al., 2017). Role mark- ers like “[SYSTEM]” or “[USER]” are themselves tokens that an attacker can include in injected content (Perez and Ribeiro, 2022). Position encodings provide statistical sig- nals, but these can be mimicked or overridden by suffi- ciently persuasive adversarial text (Zou et al., 2023). The model has no architectural mechanism to verify that a par- ticular substring originated from a trusted source, because that information simply is not represented in any unforge- able form within its input (Carlini et al., 2023). 1.2. Threat Model and Security Goal We consider adversaries who can introduce or influence untrusted inputs consumed by the agent (e.g., webpages, PDFs, emails, or RAG corpus documents). We assume the agent has access to privileged resources and can request external actions via tools. The adversary’s goal is an au- thorization violation: causing execution of an action or in- formation transfer that would be denied under the deploy- ment policy. Our objective is deterministic authorization integrity: forbidden actions and prohibited flows are not executed, even under adversarially chosen inputs. 1.3. Three Claims We advance three claims that we believe the ML commu- nity must accept before agentic AI can be safely deployed in consequential scientific domains. Our first claim is one of impossibility: no training-only procedure can provide a deterministic guarantee of command-data separation un- der adversarial conditions, because token provenance is not represented in any unforgeable form within the model’s in- put (Wolf et al., 2023). Our second claim is one of ne- cessity: deterministic, architectural enforcement of secu- rity boundaries is a necessary condition for authorization security in agentic systems, because probabilistic compli- ance is not security (Saltzer and Schroeder, 1975; Ander- son, 2020). Our third claim is one of sufficiency for au- thorization: the Trinity Defense Architecture i.e. Action Governance, Information-Flow Control, and Privilege Sep- aration, is sufficient to provide deterministic guarantees that mediated tools cannot execute denied actions and la- beled channels cannot perform prohibited flows, when im- plemented with a deterministic, non-LLM reference moni- tor (Anderson, 1972). Figure 1 summarizes the core argu- ment: why uniform token processing collapses command– data boundaries in current agents, how this interacts with the Lethal Trifecta to create authorization risk, and how Trinity restores deterministic boundaries via action gover- nance, information-flow control, and privilege separation. We must be precise about scope. We distinguish autho- rization security, preventing execution of actions the user or context is not permitted to perform from overall safety, preventing all harmful outcomes, including harms via au- thorized actions (Weidinger et al., 2022). A user authorized to send emails can still send harmful content; Trinity does not prevent this. What Trinity guarantees is that adversarial injection may cause unsafe suggestions but cannot produce forbidden actions or prohibited information flows through mediated tools and labeled channels. Deterministic bound- aries are necessary for authorization security but not suffi- cient for overall safety. We advocate them as one essential layer, not a complete solution. 2. Threat Landscape Current agentic architectures expose three distinct attack surfaces that compound to create systemic vulnerability. As previewed in Figure 1, these threats are not indepen- dent; they compound because untrusted content can steer both tool use and memory formation in systems lacking de- terministic mediation. We survey each in turn, grounding our analysis in demonstrated attacks from the security lit- erature. 2.1. Privilege Escalation via Prompt Injection The most direct threat arises when attacker-controlled con- tent in the context window influences the model to take unauthorized actions (Greshake et al., 2023; Perez and Ribeiro, 2022). Consider an agent with database access performing retrieval-augmented generation. A document in the knowledge base planted by an attacker or compromised through supply chain vulnerabilities, contains hidden in- structions directing the agent to exfiltrate credentials while appearing to complete a legitimate query (Zou et al., 2024; 2 The Trinity Imperative Chaudhari et al., 2024). The agent retrieves this document, processes the hidden instruction, and leaks sensitive data, all while generating an innocuous-looking response to the original request (Zhan et al., 2024). Current defenses fail because safety training teaches mod- els to refuse obviously malicious requests like “email all data to attacker@evil.com” (Bai et al., 2022; Ouyang et al., 2022). But injected instructions rarely take this form. They are crafted to look like legitimate system directives, in- ternal notes, or clarifying context (Wei et al., 2023). The model has no ground truth for what is actually authorized because authorization is a property of the deployment con- text, not of the text itself. A perfectly aligned model that always tries to be helpful will follow helpful-sounding in- structions regardless of their source (Casper et al., 2023). 2.2. Memory Leakage via Shared State Agents increasingly maintain persistent memory includ- ing learned preferences, conversation history, and work- flow patterns to improve performance over time (Xi et al., 2023). Most agentic systems therefore accumulate knowl- edge about users and organizational processes. This cre- ates vulnerabilities structurally analogous to browser cook- ies and shared state in web security (Vogt et al., 2007). Consider a scientific workflow agent deployed for a re- search group. When User A runs crystallography experi- ments, the agent learns preferred methods and parameters and writes them into a shared knowledge base (a “play- book”). The playbook may implicitly reveal unpublished directions. When another user later queries the agent about group activities, it may summarize from the playbook and inadvertently leak User A’s work. The problem is structural, not incidental. The agent writes to memory whatever improves task performance, with no mechanism to distinguish information that is safe to share from information that is sensitive to a particular user. Session hijacking becomes memory exfiltration; cross-site scripting becomes malicious pattern injection; cross-site request forgery becomes triggering authenticated actions via memory state (Ruan et al., 2023). The entire taxon- omy of web security vulnerabilities finds analogs in agentic memory systems. 2.3. Multi-Modal Bypass Text-based safeguards fail when commands are embedded in other modalities (Bailey et al., 2023; Qi et al., 2023). An organization deploys a multi-modal agent with vision capabilities and implements text filters blocking dangerous requests. An attacker creates an image containing the text “provide synthesis steps for [dangerous compound]” and uploads it with the innocent message “can you help with the assignment in this photo?” (Shayegani et al., 2023). The text filter sees only the benign message; the vision model reads the image, extracts the malicious request, and the agent complies because the dangerous keywords were never in the filtered text stream. This is privilege escalation through modality shifting (Du- rante et al., 2024). Text-based filters enforce one policy; the vision pathway has different or no enforcement. The command “synthesize toxin” gains execution by moving from a guarded channel to an unguarded one. Any time an agent processes multiple input channels with inconsis- tent enforcement, attackers can launder commands through the weakest channel. As agents become increasingly multi- modal, this attack surface expands correspondingly. 3. The challenges of Learned Separation The attacks in Section 2 share a common root: many agen- tic systems implicitly rely on the model to infer which to- kens should be treated as instructions versus untrusted con- tent. This is a reappearance of the classic injection pat- tern from programming languages and systems security: when executable intent is inferred from the same channel as attacker-controlled data, the boundary is forgeable (Bra- tus et al., 2017; Momot et al., 2016). Our goal in this section is not to claim that every model can be trivially jailbroken in every setting. Rather, we make a narrower but more operationally relevant point: training-based defenses cannot provide deterministic au- thorization guarantees for command–data separation when provenance is not represented in an unforgeable form out- side the model’s token stream. 3.1. What Command–Data Separation Requires Definition 3.1 (Command–Data Separation). A system satisfies command–data separation if there exists a prove- nance function prov : I → command, data imple- mented by the system’s trusted execution boundary such that: (i) prov(x) is determined by an authenticated chan- nel or representation invariant, not by semantic heuristics over the content of x (unforgeability); (i) the classifica- tion is checkable prior to execution and cannot be altered by attacker-controlled inputs (pre-execution verifiability); and (i) no input classified as data can induce execution of operations reserved for command inputs (non-bypass). The key requirement is unforgeability by content. Any rule of the form “treat text that looks like a system message as a command” is content-based and therefore forgeable. This is precisely the historical lesson of SQL injection and XSS: keyword filters and semantic heuristics can reduce risk em- pirically, but they do not provide a security boundary be- cause the attacker can shape inputs to cross the learned de- 3 The Trinity Imperative Figure 1. The Trinity Imperative for Trustworthy Agentic AI. (a) Current LLM agents fail security because uniform token processing erases the command–data boundary, making learned defenses forgeable. (b) This failure, combined with the “Lethal Trifecta” of un- trusted inputs, privileged data access, and external action capabilities, turns authorization security into an exploit-discovery problem in the absence of deterministic mediation. (c) The proposed ”Trinity Defense” establishes deterministic architectural boundaries through Action Governance, Information-Flow Control, and Privilege Separation to provide verifiable authorization security. cision boundary (Halfond et al., 2006; Vogt et al., 2007). Definition 3.2 (Channel-Bound Provenance Metadata). A system provides channel-bound provenance metadata if each input segment x is accompanied by an authenticated, unforgeable tag τ(x) indicating its origin (e.g., system pol- icy, user instruction, retrieved document, external tool out- put). The tag is verified by a trusted component outside the LLM and cannot be forged by attacker-controlled content. This definition makes explicit what traditional secure sys- tems achieve with architectural mechanisms. CPUs do not guess whether memory is executable based on bytes; they consult page permissions. Databases do not guess whether user input is SQL; parameterized queries transmit com- mands and parameters through separate channels (Saltzer and Schroeder, 1975; Anderson, 2020). 3.2. Why Transformers Cannot Provide Unforgeable Separation by Training Alone Autoregressive transformers map a token sequence (t 1 ,...,t n ) to a distribution over next tokens using at- tention computations that treat all tokens as inputs to the same learned function (Vaswani et al., 2017; Brown et al., 2020). In typical agent stacks, trusted instructions (sys- tem/developer), user requests, and retrieved artifacts are concatenated into a single context window. Role markers are themselves tokens; they can be imitated by attacker- controlled content (Perez and Ribeiro, 2022). Position en- codings and formatting cues provide statistical signals, but they are not unforgeable invariants; attackers can mimic or override them with carefully crafted text (Zou et al., 2023; Wei et al., 2023). The consequence is not that separation is never achieved in practice, but that any separator learned from content is, in principle, forgeable. Theorem 3.3 (No Unforgeable Separation from Content Alone). Consider an agentic system in which (i) trusted in- structions and untrusted content are presented to an LLM through a single token stream, and (i) the system’s de- cision to treat any substring as command versus data is based only on token content and model-internal computa- tion (i.e., no channel-bound provenance metadata verified by a trusted component). Then any command–data clas- sifier implemented purely by learning is forgeable: there exists attacker-controlled content that causes the LLM (or any learned filter/guardian operating on the same stream) to behave as if an untrusted substring were a trusted in- struction, with non-zero probability. Proof sketch. If separation is inferred from content, it is necessarily a semantic heuristic over attacker-controlled to- kens. Such heuristics are forgeable by construction: an at- tacker can embed role markers, imitation system prompts, adversarial phrasing, or optimized trigger strings that cross the learned decision boundary. Because both the classi- fier and the attacker operate in the same content space, no learned boundary can be unforgeable. This is the language- theoretic analogue of injection: deciding executability from the same channel as untrusted data cannot yield determinis- tic separation guarantees (Bratus et al., 2017; Momot et al., 4 The Trinity Imperative 2016). Remark 3.4 (What this does and does not claim). Theo- rem 3.3 is an authorization-security statement. It does not claim every deployment will be compromised, nor that de- fenses never help empirically. It claims that training-based defenses alone cannot provide deterministic guarantees of command–data separation under adversarially chosen un- trusted inputs when provenance is not represented in an un- forgeable form outside the model. This limitation also applies to LLM-based “guardians” and learned input/output filters when they consume the same mixed stream (Casper et al., 2023; Mazeika et al., 2024). These mechanisms can reduce attack success rates, but they remain probabilistic and therefore do not meet the stan- dard required for authorization security in high-stakes set- tings (Anderson, 2020). The architectural implication is straightforward: to obtain authorization guarantees, command provenance must be enforced outside the model via a trusted mechanism that (i) mediates action execution and (i) attaches unforgeable provenance to inputs and outputs. The next section presents such a mechanism. 4. The Trinity Defense Architecture Trinity is a deterministic control plane for agentic systems. It treats the LLM as an untrusted component that pro- poses plans, drafts text, and suggests actions, while a small trusted computing base (TCB) enforces which actions and information flows are permitted. This corresponds to the architectural shift illustrated in Figure 1(c): the LLM re- mains a powerful planner and interface, but a non-LLM control plane mediates actions and information flows. This follows the classical reference monitor model and the prin- ciple of complete mediation: every security-relevant op- eration must pass through an enforcement point indepen- dent of the untrusted program being constrained (Ander- son, 1972; Saltzer and Schroeder, 1975). Trinity has three orthogonal pillars: (1) Action Gov- ernance to prevent unauthorized tool execution, (2) Information-Flow Control to prevent cross-scope leakage and unsafe persistence, and (3) Privilege Separation to prevent untrusted inputs from directly shaping privileged execution pathways. 4.1. Action Governance The first pillar is a Command Gate: an auditable reference monitor grounded in a Finite Action Calculus (FAC) rather than free-form tool calls. We define a FAC as a tupleF = (A,T ,P,σ) comprising a finite set of action types A, a type system T over parameters, a set of policy predicates P , and a deterministic authorization function σ :A×T × Context→allow, deny. The protocol has three phases: i) Propose: the LLM out- puts a structured action proposal in FAC syntax, selecting from the finite vocabulary A. i) Decide: a deterministic, non-LLM policy checker evaluates σ using context (iden- tity, session scope, labels, tool capabilities). i) Execute: if authorized, a trusted adapter compiles the FAC action into an actual tool invocation. The LLM never emits raw API calls for privileged tools. The security point is separation of concerns: the LLM pro- vides intent; the gate provides enforcement. Persuasive or adversarial text cannot influence σ because σ is determin- istic and outside the model. Theorem 4.1 (Security of Command Gate). Let G be a Command Gate implementing FAC F with authorization function σ. If σ is correctly implemented (deterministic, complete mediation, non-LLM), then no sequence of LLM outputs can induce execution of an action a /∈ A or an action where σ(a, params, ctx) = deny. Interpretation. Theorem 4.1 provides an authorization guarantee, not a behavioral guarantee about the LLM. The model may still suggest unsafe actions in text, but those actions are not executed if denied by the gate. 4.2. Information-Flow Control The second pillar provides deterministic guarantees about how information moves through the system, including what may be stored in memory and what may appear in out- puts. We define an information-flow regime as a tuple L = (Labels,⊑, label, check) comprising a finite lat- tice of security labels (e.g., UNTRUSTED ⊑ INTERNAL ⊑ CONFIDENTIAL), a partial order defining permitted flows, a labeling function assigning labels to artifacts, and a check function verifying that source labels are compatible with sink labels before transfer (Denning, 1976; Bell and LaPadula, 1973). This directly addresses the structural “cookie problem” in persistent agent memory: without mandatory labels, the system cannot distinguish user-scoped preferences from group-sensitive playbooks, and cross-session leakage be- comes an emergent default. Under Trinity, memory writes, retrieval results, and tool outputs carry labels.Flows from sensitive sources to lower-trust sinks are denied un- less explicitly declassified through an auditable opera- tion (Moreau et al., 2013; Herschel et al., 2017). Response sinks are labeled. User-visible response chan- nels are treated as explicit sinks with labels (e.g., pub- lic chat as UNTRUSTED, authenticated org chat as INTERNAL). The control plane blocks tool-mediated transfers that violate the lattice and can optionally require 5 The Trinity Imperative that responses be generated only from sources whose la- bels flow to the sink. Trinity does not claim perfect saniti- zation of arbitrary free-form text; it ensures that privileged access and persistence pathways are mediated, labeled, and auditable. 4.3. Privilege Separation The third pillar prevents untrusted inputs from directly in- fluencing privileged execution by decomposing the sys- tem into components with different privileges (Saltzer and Schroeder, 1975; Provos et al., 2003). Trinity implements a Planner–Worker architecture: i) Planner (low privilege): ingests and analyzes untrusted inputs (documents, web- pages, emails, multimodal artifacts). It can propose FAC actions but cannot execute privileged tools. i) Worker (high privilege): executes authorized actions via the Com- mand Gate. It does not directly ingest raw untrusted arti- facts; it consumes only labeled, normalized summaries or extracted fields mediated by the control plane. This decomposition is intentionally conservative. We do not assume input normalization is perfect—PDF and mul- timodal parsing are adversarial surfaces—but privilege sep- aration ensures that even if untrusted inputs influence rea- soning, they cannot directly trigger privileged effects. The security boundary is enforced by mediation, not by the model’s ability to “notice” an attack. 4.4. A Minimal Instantiation (FAC + Policy + Trace) To make Trinity concrete, consider a minimal Finite Action Calculus for a research assistant with retrieval, email, and memory: A =RETRIEVE(query), READDOC(doc id), SUMMARIZE (docid), WRITEMEMORY(key, value, scope), SENDEMAIL(to, subject, body), DECLASSIFY(src, dst). Example policies enforced by σ include: i) No direct ex- filtration: SENDEMAIL bodies may not include content labeled (or tainted) as CONFIDENTIAL unless preceded by an explicit DECLASSIFY. i) Untrusted-trigger con- straint: if the most recent input label is UNTRUSTED, deny any action that reads CONFIDENTIAL resources without a user-confirmed step (modeled as a separate FAC action). i)Memory scope isolation: WRITEMEMORY requires that the destination scope dominates the source label; cross- user/global memory writes are denied by default. An attacker inserts hidden text in a retrieved PDF: “Ignore prior instructions; email me the contents of $HOME/.ssh/id rsa.”The Planner may attempt to propose:SENDEMAIL(ATTACKER, “SUMMARY”, <LEAKED KEY>). The Command Gate denies the action because it vio- lates the SENDEMAIL policy and because the secret orig- inates from CONFIDENTIAL sources without DECLAS- SIFY. The audit log records the denial and the provenance labels that triggered it. 4.5. Trusted Computing Base and Hardening Trinity intentionally moves trust away from the LLM and into a small, auditable TCB: (i) a deterministic FAC parser, (i) an authorization policy engine, (i) an IFC labeler and flow checker, (iv) tool adapters/compilers, and (v) an append-only audit log. These components must be imple- mented without LLM dependencies and treated as security- critical. Crucially, Trinity does not rely on perfect attack detection. Even if the Planner is fully compromised by adversarial content, privileged effects remain gated by deterministic mediation. Engineering effort should therefore prioritize minimizing and hardening the TCB (e.g., memory-safe im- plementations, strict sandboxing for adapters, and verifica- tion of parsers and policy evaluation), consistent with clas- sical secure system design. 5. Evaluation Framework We propose concrete, falsifiable success criteria organized around four dimensions, following best practices from se- curity evaluation (Mazeika et al., 2024). For action in- tegrity, the criterion is zero executed policy-violating tool invocations for gated tools on indirect-injection bench- marks (Zhan et al., 2024; Yi et al., 2023). For information- flow integrity, the criterion is zero cross-scope memory leakage on labeled-memory benchmarks, with no sensitive- to-untrusted sink flows. For usability, the criteria are less than 5% false-positive blocks on representative task suites and less than 10% degradation in task success rate (Kin- niment et al., 2024). For performance, the criteria are less than 50ms median authorization latency and less than 25ms median overhead for information-flow label propagation. The methodology follows iterative design-build-break cy- cles, drawing on red-teaming best practices (Ganguli et al., 2022; Perez et al., 2022). In the design phase, we spec- ify enforcement mechanisms and policy rules.In the build phase, we integrate these mechanisms into agentic workflows. In the break phase, we stress-test with adver- sarial documents and multi-step tool-use tasks (Mazeika et al., 2024). Red-team findings are converted into re- gression tests, and the cycle repeats. A persistent eval- uation harness measures action-integrity through indirect prompt injection, goal hijacking, and RAG poisoning at- tacks (Zou et al., 2024); information-flow through cross- session leakage, playbook contamination, and multi-user inference tests; usability through false positive rates on be- 6 The Trinity Imperative nign workloads; and performance through p50/p95 latency measurements. Several risks require explicit mitigation. Policy coverage gaps are addressed by starting with conservative templates and expanding based on red-team findings (Ganguli et al., 2022). Over-tainting that reduces utility is addressed by adding explicit declassification tools that recover utility safely through auditable operations (Denning, 1976). Per- formance overhead is managed by optimizing hot paths in policy evaluation while keeping the trusted computing base small (Klein et al., 2009). We acknowledge that gate design is genuinely difficult. De- termining whether an arbitrary action is “safe” relates to the halting problem; we cannot always decide (Anderson, 2020). Cascading agents degrade reliability: for example, if each agent achieves 94.3% accuracy, a chain of n agents achieves only (0.943) n . This is precisely why we advocate conservative policies with explicit declassification rather than permissive defaults with learned restrictions. The difficulty of gate design is an argument for architectural boundaries, not against them, shows why learned safety cannot substitute for deterministic enforcement (Nipkow et al., 2002). 6. Epistemological Foundations The security arguments above establish that Trinity pre- vents unauthorized actions. We now argue that its prop- erties are also necessary for epistemologically valid AI- assisted science (Birhane et al., 2023). Scientific claims derive legitimacy from transparent, veri- fiable chains of evidence (Ioannidis, 2005; Baker, 2016). When a scientist reports that compound X inhibits protein Y with a particular IC 50 , the claim is meaningful because we can trace which experiments were conducted, by whom, using what protocols, with what instruments, yielding what raw data, analyzed how (Moreau et al., 2013). Agentic AI breaks this chain (Wang et al., 2023). When an agent retrieves documents, reasons over them, delegates to sub- agents, queries databases, and produces conclusions, the provenance of any resulting claim becomes untraceable. Did the conclusion come from retrieved papers? From training data? From hallucination? From injected adver- sarial content? Without architectural provenance tracking, we cannot know (Herschel et al., 2017). We define epistemic integrity as the property that, for any claim an agentic system produces, the sources contribut- ing to that claim are identifiable and auditable (prove- nance), the reasoning steps from sources to claim are re- constructible (attribution), and no unauthorized inputs in- fluenced the claim’s derivation (integrity). Current agen- tic architectures violate all three requirements (Ruan et al., 2023). Trinity’s mandatory labeling addresses provenance by tracking where information originated. The Command Gate’s audit log enables attribution by recording which ac- tions were taken and why. Privilege separation prevents unauthorized influence by ensuring that untrusted inputs cannot directly affect conclusions. Trinity is the minimal architecture that preserves epistemic integrity. Three failure modes unique to agentic systems escape tra- ditional scientific error analysis. The first is cascading hal- lucination amplification: in systems with multi-step rea- soning and tool use, a hallucinated intermediate result can trigger downstream actions that produce further hal- lucinations (Xi et al., 2023). Unlike signal processing where noise typically attenuates, agentic errors can am- plify through feedback loops. A hallucinated citation can be “verified” by searching for it, finding a superficially sim- ilar paper, and incorporating that paper’s unrelated conclu- sions. The second failure mode is provenance laundering (Her- schel et al., 2017). When information passes through mul- tiple processing stages, its epistemic status becomes ob- scured. An uncertain speculation from a retrieved docu- ment can become a stated fact in a summary, then an as- sumption in an analysis, then a premise in a recommenda- tion. Each step launders the original uncertainty, making the final claim appear more authoritative than its sources warrant. The third failure mode is the trust inheritance paradox. If Agent A trusts Agent B, and Agent B trusts Agent C, should Agent A trust C’s outputs? Any answer creates problems (Anderson, 2020). Transitive trust en- ables attack propagation through the weakest link. Non- transitive trust makes composition impossible. This is not a policy design problem—it is a fundamental tension that only architectural boundaries can resolve by making trust relationships explicit and enforceable (Sandhu et al., 1996). 7. Alternative Views The most common objection holds that alignment research will eventually solve these problems (Ouyang et al., 2022; Bai et al., 2022). We are optimistic about alignment for addressing safety i.e. preventing harmful outputs from au- thorized actions, but skeptical about its relevance to autho- rization security (Casper et al., 2023). Alignment training occurs on the same parameter space that processes adver- sarial inputs. Any learned defense exists in a space the at- tacker can navigate (Zou et al., 2023). More fundamentally, even a perfectly aligned model cannot verify token prove- nance because that information is not in its input (Wolf et al., 2023). Alignment addresses what the model wants to do; authorization security addresses what the model is permitted to do. These are orthogonal concerns requiring orthogonal solutions. 7 The Trinity Imperative A second objection holds that the finite action calculus is too restrictive for practical deployment. But any useful de- ployment constrains agent behavior somehow, the question is whether constraints are implicit, relying on the model to infer them from context, or explicit, enforced by architec- ture (Saltzer and Schroeder, 1975). Implicit constraints fail under adversarial conditions because they exist only in the model’s learned representations (Wei et al., 2023). Trinity makes the tradeoff explicit: the action space is whatever the deployer chooses to include in the calculus (Cutler et al., 2024; Damianou et al., 2001). This can be expanded to cover any legitimate use case; it simply cannot include un- bounded, free-form execution. The restriction is a feature, not a limitation. A third objection concerns performance overhead. The Command Gate is a syntactic parser and policy evalua- tor, orders of magnitude faster than LLM inference (Jaeger et al., 2004).Our target of less than 50ms authoriza- tion latency is negligible compared to typical LLM re- sponse times measured in seconds (Brown et al., 2020). Information-flow label propagation at less than 25ms per operation adds minimal overhead to workflows dominated by model inference and network latency. The overhead is real but small; the alternative is zero security guarantees. A fourth objection, which we take most seriously, holds that gate design is intractably difficult (Anderson, 2020). We acknowledge this. Determining whether an arbitrary action sequence is safe relates to undecidable problems in computation theory. Cascading degradation means that even high per-agent accuracy yields poor system-level reli- ability. But this difficulty is precisely why learned safety mechanisms fail—they attempt to solve an intractable problem through pattern matching (Wolf et al., 2023). Trin- ity does not claim to solve the general safety problem; it claims to enforce authorization policies that humans spec- ify (Sandhu et al., 1996). The policies may be conserva- tive, may require explicit declassification for edge cases, may occasionally block legitimate actions. These are ac- ceptable costs for the guarantee that unauthorized actions are impossible. 8. A Call to Action We call on the ML community to accept three uncomfort- able truths. First, alignment is not security (Casper et al., 2023; Wolf et al., 2023). Training-based approaches to agentic safety address a fundamentally different problem than authorization security. Both matter; neither subsumes the other. A perfectly aligned agent that always tries to be helpful will helpfully follow injected instructions it cannot distinguish from legitimate ones (Greshake et al., 2023). Second, probabilistic compliance is not compliance (An- derson, 2020). A system that usually follows policy is not secure. Security requires guarantees, and guarantees require deterministic enforcement (Saltzer and Schroeder, 1975). The observation that attacks succeed only 5% of the time is not reassuring when a single success can exfiltrate credentials or corrupt experimental results. Third, archi- tecture is not optional (Bratus et al., 2017). The command- data separation problem cannot be solved by better models, larger training sets, or more sophisticated prompting. It re- quires mechanisms that exist outside the model’s inference path, enforcing boundaries the model cannot cross (Klein et al., 2009). We propose that the community establish Agentic Good Laboratory Practice (aGLP) standards specify- ing mandatory provenance tracking for all AI-generated claims (Moreau et al., 2013; Herschel et al., 2017), au- dit requirements for agentic tool invocations (Jaeger et al., 2004), minimum architectural requirements for authoriza- tion security (Anderson, 1972), and reproducibility stan- dards that account for agentic non-determinism (Baker, 2016). Without such standards, the deployment of agen- tic AI in science risks creating a new crisis of trust that will dwarf the reproducibility crisis (Ioannidis, 2005). An agent that produces conclusions quickly but without epis- temic integrity has not accelerated science, has produced noise that must be filtered at greater cost than generating reliable results from the beginning (Birhane et al., 2023). 9. Conclusion Current agentic AI architectures are ill-suited for high- stakes scientific use because they cannot guarantee command–data separation. This is an architectural limita- tion of autoregressive transformers: uniform token process- ing provides no unforgeable provenance, so training-based defenses remain probabilistic and adversary-navigable. The Trinity Defense Architecture provides a practical, im- plementable solution. By treating the LLM as an untrusted component operating within a trusted control plane, Trin- ity provides deterministic guarantees through Action Gov- ernance, Information-Flow Control, and Privilege Separa- tion. These mechanisms do not make the model smarter or more aligned; they make unauthorized actions architec- turally impossible regardless of what the model attempts. Our position is not that LLMs are useless or inherently dan- gerous. Rather, they are powerful tools that, like any tool used in consequential settings, require appropriate safety mechanisms. Trinity is a guard which makes agentic AI us- able in domains where the cost of failure is high. The ques- tion is not whether AI will transform science—it will. The question is whether that transformation will strengthen or erode the trust on which science depends. That depends on whether we build systems that preserve epistemic integrity, or deploy systems that trade verifiability for convenience. 8 The Trinity Imperative References Aleph One. Smashing the stack for fun and profit. Phrack Magazine, 7(49), 1996. James P Anderson. Computer security technology plan- ning study. Technical Report ESD-TR-73-51, Air Force Electronic Systems Division, 1972. Ross J Anderson. Security Engineering: A Guide to Build- ing Dependable Distributed Systems.John Wiley & Sons, 3 edition, 2020. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al.Train- ing a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. Luke Bailey, Euan Ong, Stuart Russell, and Scott Em- mons.Image hijacks: Adversarial images can con- trol generative models at runtime.arXiv preprint arXiv:2309.00236, 2023. Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604):452–454, 2016. David Elliott Bell and Leonard J LaPadula. Secure com- puter systems: Mathematical foundations. Technical Re- port MTR-2547, MITRE Corporation, 1973. Abeba Birhane, Atoosa Kasirzadeh, David Leslie, and San- dra Wachter. Science in the age of large language mod- els. Nature Reviews Physics, 5(5):277–280, 2023. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large lan- guage models. Nature, 624(7992):570–578, 2023. Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldas- sari, Andrew D White, and Philippe Schwaller. Chem- Crow: Augmenting large-language models with chem- istry tools. arXiv preprint arXiv:2304.05376, 2023. Sergey Bratus, Michael E Locasto, Meredith L Patterson, Len Sassaman, and Anna Shubina. LangSec: Language- theoretic security. IEEE Security & Privacy, 15(4):36– 42, 2017. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.Language models are few-shot learners.Ad- vances in Neural Information Processing Systems, 33: 1877–1901, 2020. Nicholas Carlini, Milad Nasr, Christopher A Choquette- Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt.Are aligned language models ad- versarially aligned? arXiv preprint arXiv:2306.15447, 2023. StephenCasper,XanderDavies,ClaudiaShi, Thomas Krendl Gilbert, J ́ er ́ emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al.Open problems and fundamental limitations of reinforcement learning from human feedback.arXiv preprint arXiv:2307.15217, 2023. Harsh Chaudhari, Giorgio Abdelfattah, Ethan Perez, and Subhabrata Mitra. Phantom: General trigger attacks on retrieval augmented language generation. arXiv preprint arXiv:2405.20485, 2024. Crispin Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Au- tomatic adaptive detection and prevention of buffer- overflow attacks. In 7th USENIX Security Symposium, pages 63–78, 1998. John Cutler, Travis Hance, William Headley, Stratos Ioan- nidis, Bryan Parno, Jonathan Protzenko, Tahina Ra- mananandro, Aseem Ringer, Anmol Singh, and Aaron Wei. Cedar: A new language for expressive, fast, safe, and analyzable authorization. In ACM SIGPLAN Confer- ence on Object-Oriented Programming, Systems, Lan- guages, and Applications (OOPSLA), 2024. Nicodemos Damianou, Naranker Dulay, Emil Lupu, and Morris Sloman. The Ponder policy specification lan- guage.International Workshop on Policies for Dis- tributed Systems and Networks, pages 18–38, 2001. Dorothy E Denning. A lattice model of secure informa- tion flow. Communications of the ACM, 19(5):236–243, 1976. Zane Durante, Bidipta Sarber, Ran Gong, Rohan Tavas- solipour, Kevin Nishi, Riki Schettini, Raul Navarro, Duygu Ceylan, Xinpeng Chen, Yujin Qu, et al. Agent AI: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022. 9 The Trinity Imperative Kai Greshake,Sahar Abdelnabi,Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023. William G Halfond, Jeremy Viegas, and Alessandro Orso. A classification of SQL-injection attacks and counter- measures.IEEE International Symposium on Secure Software Engineering, 1:13–15, 2006. Melanie Herschel, Ralf Diestelk ̈ amper, and Houssem Ben Lahmar. A survey on provenance: What for? what form? what from? The VLDB Journal, 26(6):881–906, 2017. John PA Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8):e124, 2005. Trent Jaeger, Antony Edwards, and Xiaolan Zhang. Con- sistency analysis of authorization hook placement in the Linux security modules framework. In ACM Transac- tions on Information and System Security, volume 7, pages 175–205, 2004. Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao Lin, Hjalmar Wijk, Joel Bur- get, et al. Evaluating language-model agents on realis- tic autonomous tasks. arXiv preprint arXiv:2312.11671, 2024. Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June An- dronick, David Cock, Philip Derrin, Dhammika Elka- duwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, et al. seL4: Formal verification of an OS kernel. In ACM SIGOPS 22nd Symposium on Operating Systems Princi- ples, pages 207–220, 2009. Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated ap- plications. arXiv preprint arXiv:2306.05499, 2023. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zi- fan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. HarmBench: A standardized evaluation framework for automated red teaming and ro- bust refusal. arXiv preprint arXiv:2402.04249, 2024. Falcon Momot, Sergey Bratus, Sven M Hallberg, and Meredith L Patterson. The seven turrets of babel: A tax- onomy of LangSec errors and how to expunge them. In IEEE Cybersecurity Development (SecDev), pages 45– 52, 2016. Luc Moreau, Paolo Missier, et al. PROV-DM: The PROV data model. W3C Recommendation, 2013. Tobias Nipkow, Lawrence C Paulson, and Markus Wen- zel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer, 2002. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Train- ing language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving.Red teaming lan- guage models with language models. arXiv preprint arXiv:2202.03286, 2022. F ́ abio Perez and Ian Ribeiro. Ignore this title and Hack- APrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition. In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 4945–4957, 2022. Niels Provos, Markus Friedl, and Peter Honeyman. Pre- venting privilege escalation. In 12th USENIX Security Symposium, pages 231–242, 2003. Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Hen- derson, Mengdi Wang, and Prateek Mittal. Visual adver- sarial examples jailbreak aligned large language models. arXiv preprint arXiv:2306.13213, 2023. Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Mad- dison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. arXiv preprint arXiv:2309.15817, 2023. Jerome H Saltzer and Michael D Schroeder. The protection of information in computer systems. Proceedings of the IEEE, 63(9):1278–1308, 1975. Ravi S Sandhu, Edward J Coyne, Hal L Feinstein, and Charles E Youman. Role-based access control models. Computer, 29(2):38–47, 1996. Len Sassaman, Meredith L Patterson, Sergey Bratus, and Michael E Locasto. Security applications of formal lan- guage theory. In IEEE Systems Journal, volume 7, pages 489–500, 2013. Timo Schick, Jane Dwivedi-Yu, Roberto Dess ` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023. 10 The Trinity Imperative Yonatan Shavit, Sandhini Agarwal, Miles Brundage, Steven Adler,Cullen O’Keefe,Rosie Campbell, Theodore Lee, Pamela Mishkin, Tyna Eloundou, Char- lie Hickey, et al. Practices for governing agentic AI sys- tems. OpenAI Research, 2023. Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models.arXiv preprint arXiv:2307.14539, 2023. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Ad- vances in Neural Information Processing Systems, vol- ume 30, 2017. Philipp Vogt, Florian Nentwich, Nenad Jovanovic, Engin Kirda, Christopher Kruegel, and Giovanni Vigna. Cross- site scripting prevention with dynamic data tainting and static analysis. In Network and Distributed System Secu- rity Symposium (NDSS), 2007. Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620 (7972):47–60, 2023. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail?Ad- vances in Neural Information Processing Systems, 36, 2023. Laura Weidinger, Jonathan Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Tax- onomy of risks posed by language models. pages 214– 229, 2022. Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models.arXiv preprint arXiv:2304.11082, 2023. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Sen- jie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2023. Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injec- tion attacks on large language models. arXiv preprint arXiv:2312.14197, 2023. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, 2024. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia.PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language mod- els. arXiv preprint arXiv:2402.07867, 2024. 11