Paper deep dive

A Sketch of an AI Control Safety Case

Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving

Year: 2025Venue: arXiv preprintArea: Agent SafetyType: PositionEmbeddings: 195

Abstract

Abstract:As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/12/2026, 7:50:58 PM

Summary

The paper proposes a 'control safety case' framework for managing risks associated with autonomous LLM agents deployed internally within AI companies. It introduces a methodology based on 'AI control,' which assumes models may be 'scheming' and attempts to subvert safety measures. The framework relies on a 'control evaluation'—a red-team vs. blue-team game—to estimate the probability of unacceptable outcomes (e.g., data exfiltration) and provides a structured argument (Claims, Arguments, Evidence) to justify that these risks remain within acceptable thresholds.

Entities (6)

Control Safety Case · methodology · 98%Blue Team · role · 95%Control Evaluation · process · 95%LLM Agent · technology · 95%Red Team · role · 95%Scheming · concept · 92%

Relation Signals (4)

Red Team → constructs → Model Substitute

confidence 95% · The red team constructs adversarial substitute models to cause unacceptable outcomes

Blue Team → implements → Control Protocol

confidence 95% · The blue team implements control measures designed to stop models from causing unacceptable outcomes.

Control Safety Case → utilizes → Control Evaluation

confidence 95% · The sketch relies on evidence from a 'control evaluation,' where a red team deliberately designs models to exfiltrate data

Control Safety Case → mitigates → Scheming

confidence 90% · This safety case sketch represents our best guess as to how developers could conservatively mitigate scheming LLM agents

Cypher Suggestions (2)

Map the relationship between roles and their primary activities · confidence 90% · unvalidated

MATCH (r:Role)-[rel]->(e:Entity) RETURN r.name, type(rel), e.name

Find all components of the control evaluation process · confidence 85% · unvalidated

MATCH (p:Process {name: 'Control Evaluation'})-[:HAS_PHASE]->(phase) RETURN phase

Full Text

194,956 characters extracted from source content.

Expand or collapse full text

A sketch of an AI control safety case Tomek Korbaktomek.korbak@dsit.gov.uk UK AI Safety Institute Joshua Clymerjoshua@rdwrs.com Redwood Research Benjamin Hiltonbenjamin.hilton@dsit.gov.uk UK AI Safety Institute Buck Shlegerisbuck@rdwrs.com Redwood Research Geoffrey Irvinggeoffrey.irving@dsit.gov.uk UK AI Safety Institute Abstract As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a “control safety case”, which is a structured argument that models areincapableof subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won’t exfiltrate sensitive information. The sketch relies on evidence from a “control evaluation,” where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy. Executive Summary Safety cases. Safety cases — structured, evidence-based arguments that a system is safe — are emerging as a scalable risk management tool for frontier AI (Buhl et al., 2024; Clymer et al., 2024; Irving, 2024). Currently, many frontier AI developers assess risk with dangerous capability evaluations (Anthropic, 2024a; OpenAI, 2024b; Phuong et al., 2024). Evidence from these evaluations can substantiate ‘inability’ safety cases, which assert that models cannot, for example, aid biological weapons development (Mouton et al., 2024), sabotage internal safety procedures (Benton et al., 2024), or execute cyber attacks (Goemans et al., 2024); however, LLMs might eventually gain dangerously capabilities that developers need to mitigate with safety measures (OpenAI, 2024a; Anthropic, 2024b; Google DeepMind, 2024). AI control. This paper presents a safety case sketch that argues models are safe because of measures such as monitoring and human auditing. The arguments rest on the “AI control” methodology (Greenblatt et al., 2023), which justifies that models cannot subvert safety measures even under the conservative assumption that models pursue their most effective strategies to cause a catastrophe. The deployment. As a simple case study, we focus on a setting where developers deploy an LLM agent within an AI company for ML engineering assistance. The agent is constructed by scaffolding LLMs similar 1 arXiv:2501.17315v1 [cs.AI] 28 Jan 2025 Thered team constructs substitutes for untrusted models trained to execute red team attack strategies. Theblue team constructs inputs (e.g. ‘honeypots’) to evaluate model substitutes and determine whether to reject them. The blue team uses model substitutes to complete tasks, aiming to observe malign actions and pause the deployment before unacceptable outcomes occur. The developer estimates the frequency at which the red team would succeed in a larger deployment using statistics from the deployment proxy. Phase 3: Model substitute construction Phase 1: Control measure specification Evaluation-time proxyDeployment-time proxy Phase 4: Deployment proxy Phase 5: Extrapolation Audit Edit Execute Audit Edit Execute Execute The blue team implements control measures designed to stop models from causing unacceptable outcomes. Audit Edit Auditor Auditor Safety layer Safety layer Task Task Attack strategy specification Phase 2: Deployment-proxy construction The red team specifies ‘attack strategies’ that models might execute to exfiltrate data. The developer creates tasks where models might attack if they execute red team strategies. Task creation Attack Attack Task Task Figure 1: An illustration of the control evaluation that provides the core evidence used in our control safety case sketch. During the control evaluation, a red team constructs adversarial substitute models to cause unacceptable outcomes in a deployment proxy consisting of 2,000 AI research tasks, followed by extrapolation to a larger deployment setting. to those that exist today, but is a more capable agent than current LLMs (section 2). It can write and execute code in a secure environment and access the internet, posing a potential security risk to developers. Threat model. We address a specific threat model: the LLM agent exfiltrating sensitive AI company intellectual property (e.g. training code) to the public internet. Our safety case sketch claims that the LLM agent will not exfiltrate sensitive data (section 3). We focus on the data exfiltration threat model because it is a simple example and not because exfiltrating sensitive code is necessarily catastrophic. The control protocol. To prevent LLM agents from exfiltrating sensitive data, the safety case sketch assumes that developers implement a set of control measures and procedures collectively referred to as the ‘control protocol.’ The control protocol includes periodic alignment evaluations (Phuong et al., 2024; Benton et al., 2024) and a safety layer that intercepts and possibly edits or discards actions of the LLM agent. The safety filter consults an untrusted monitor (another instance of the LLM that controls the LLM agent) and a weaker ‘trusted’ monitor. Control evaluation. To justify that the control protocol will stop the LLM agent from exfiltrating sensitive data, developers run a black-box control evaluation (section 5). The control evaluation accepts models intended for deployment along with a control protocol and can be framed as a game played betweena red teamanda blue team. The objective ofthe red teamis to construct an LLM that exfiltrates dummy data in a proxy deployment environment. The objective ofthe blue teamis to prevent adversarial models from exfiltrating data by constructing an effectivecontrol protocol. Control evaluation produces an upper 2 bound for the probability that the LLM agent will successfully exfiltrate data in deployment (the “data exfiltration probability”). The following are steps of the control evaluation described in this safety case sketch: 1.Phase 1: Control protocol specification. The blue teamspecifies acontrol protocol, including monitoring and alignment evaluations. The subsequent phases of the control evaluation estimate how likely the LLM agent is to subvert these measures. 2.Phase 2: Deployment proxy construction.Next,the developerconstructs a proxy for the deployment environment. (a)Attack strategy specification.To inform how tasks are constructed,the red teamfirst specifies ‘attack strategies’ that the untrusted LLM agent model and untrusted monitor might execute to exfiltrate data. (b)Task creation.Next,the developerconstructs ML engineering tasks to evaluatered team attack strategies. For example, if an attack strategy involves hiding suspicious code in JSON files thenthe developerconstructs tasks that include JSON data. 3.Phase 3: Model substitute construction.Given full knowledge ofthe blue team’s control protocoland the deployment proxy,the red teamthen trains two models:an untrusted LLM substituteand anuntrusted monitor substitute, both of which attempt to exfiltrate data. We will refer to these asthe model substitutes. 4.Phase 4: Deployment proxy.In this phase,the developerevaluates how well themodel substitutescan subvert theblue team control protocolwith a proxy for the deployment envi- ronment. This deployment proxy consists of two steps: (a)The evaluation-time proxy.First,the developerruns evaluations they will perform period- ically during deployment. Based on the results,the blue teamcan choose to reject themodel substitutes– for example, because theuntrusted monitor substituteis not sufficiently reliable – orthe blue teamcan continue to the deployment-time proxy step. (b)The deployment-time proxy.Next,the blue teamattempts to use themodel substitutes to safely complete 2,000 software engineering tasks. The goal ofthe blue teamis to notice unacceptable behaviour and terminate the deployment before themodel substitutesexfiltrate data.The blue teammust accomplish this goal while keeping the action edits and audits within predetermined budgets. 5.Phase 5: Extrapolation.Finally,the developeruses statistics gathered in the deployment proxy phase to predict thedata exfiltration probabilityin the true deployment. To perform this extrapolation,the developeruses cached statistics from the proxy deployment to run computer simulations. These simulations project the probability that the models would exfiltrate data if developers use them to complete 100,000 tasks instead of 2,000. Argument structure. Section 7 discusses how evidence from a control evaluation can be used to construct a safety case. The argument rests on the following claims: 1. Thedata exfiltration probabilitythat the control evaluation estimates is lower than the accept- able risk threshold. 2. The control evaluation upper-bounds thedata exfiltration probabilityduring deployment. This claim can be further decomposed into the following: (a) Theblue team control protocollower-bounds control effectiveness. This claim asserts that thecontrol protocolspecified by theblue teamin the control evaluation lower-bound the effectiveness of the control protocol in deployment. 3 (b)Red teamattacks upper-bound model capabilities. The red team must be competitive, mean- ing that the attack policies that thered teamtrainsuntrusted modelsto execute upper bound the effectiveness of attacks the deployed models might execute. (c)The developerconservatively extrapolates results. Finally,the developermust make con- servative assumptions when extrapolating statistics from the proxy deployment to estimate the data exfiltration probabilityin deployment. This safety case sketch represents our best guess as to how developers could conservatively mitigate scheming LLM agents given the current state of AI safety technical research. We hope this paper prompts discussion between frontier labs, government agencies and civil society about what constitutes sufficient assurance against risks from internally deployed LLM agents. C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative C4.2 Developers conservativel y extrapolate statistics from the deployment proxy to estimate the probability of data exfiltr ation in deployment. C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltr ate data on the attack tasks D’. C2.2 The probability that the LLM agent will exfiltr ates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltr ates data estimated in the control evaluation upper-bounds the probability it exfiltr ates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservativel y extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltr ates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservativel y estimates the capabilities of models scheduled for deployment to exfiltr ate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. Figure 2: The core claims of the AI control safety case sketch expressed in Claims, Arguments, Evidence notation (CAE; Bloomfield & Netkachova, 2014). The remaining sections of the CAE diagram are discussed throughout the body of the paper (Figures 4, 7, 9, and 17). For the full argument structure, see appendix A. 1 Background Safety cases. A safety case is a structured argument, supported by evidence, that a particular system is safe to deploy for a given application in a given environment (UK Ministry of Defence, 2007). Recently, AI companies (Anthropic, 2024a; Google DeepMind, 2024), governments (Irving, 2024) and the research community (Buhl et al., 2024; Bengio et al., 2024; AI Seoul Summit Scientific Advisory Group, 2024; Clymer et al., 2024) have suggested using safety cases as a risk management framework in AI industry. However, methodologies for applying safety cases to frontier AI are nascent. No comprehensive safety case for a frontier LLM has yet been published and it is unclear how much guidance safety cases from other industries can offer (Bounds, 2020; Fitzgerald et al., 2010; Denney & Pai, 2019; Wang et al., 2018; Wassyng et al., 2011; Myklebust & Stålhane, 2020). One promising road toward safety cases for frontier AI might be drafting safety case sketches (Irving, 2024). Goemans et al. (2024) proposed a template of a safety case based on an inability of an AI system to cause 4 harm through its cyber-offensive capabilities. Balesni et al. (2024) outlined a safety case based on evaluations of LLM agents for their capabilities for scheming and causing harm via scheming. Building on these efforts, we propose a sketch of a safety case based on ‘AI control’ (Greenblatt et al., 2023) which argues that AI models are incapable of subverting mitigations to cause harm even if they pursued the best strategies available to them for doing so. LLM agents. This safety case sketch addresses potential risks from future LLM agents. Frontier LLMs (Anthropic, 2024a; OpenAI, 2024b; Google DeepMind, 2024) have achieved impressive capabilities. LLMs have exceeded human PhD-level performance on physics, biology, and chemistry problems (Rein et al., 2024) and can solve competitive programming or maths olympiad problems (OpenAI, 2024b). While LLMs are primarily used as chatbot assistants, they also show early signs of usefulness as autonomous agents. Over the past year, LLMs have become increasingly autonomous. LLM agents can now accomplish some tasks that require human experts upward of two hours to complete (Wijk et al., 2024; OpenAI, 2024b). These tasks include resolving some Github issues in popular repositories (OpenAI, 2024b), ML engineering (Chan et al., 2024), and frontier AI research (Wijk et al., 2024), suggesting that LLM agents might increasingly automate software engineering work. Internal deployments of LLM agents.As LLM agents become more autonomous, AI companies might increasingly use LLM agents to automate research and engineering tasks (Jimenez et al., 2024; Chan et al., 2024). Several AI labs have described plans to delegate alignment research to AI agents (Leike et al., 2024; Bowman, 2024). If LLM agents continue to become more capable, they may eventually function similarly to AI company employees and be granted employee-like affordances. LLM agents that increasingly automate AI company operations would pose several risks (Benton et al., 2024): 1.Security risks.LLM agents might compromise software infrastructure to exfiltrate sensitive data or disable safety measures. 2.Research sabotage.The agents might introduce subtle flaws into safety research. 3.Training and evaluation sabotage.LLM agents might also generate poisoned training data or misgrade harmful AI behaviour. We focus on these ‘internal deployment’ risks because (1) models deployed internally within AI companies might be indirectly catastrophic by undermining safety and security and (2) in comparison with external deployments, internal deployments provide a simple setting for sketching a control safety case. Scheming. Why might LLMs sabotage AI company safety or security procedures? One hypothesis re- searchers have raised is that models might develop goals that are misaligned from those of their developers. This misalignment would be particularly hard for developers to detect and mitigate if models are “schem- ing.” Scheming models pretend to be aligned, but pursue misaligned goals when presented with a favourable opportunity (Carlsmith, 2023; Meinke et al., 2024; Greenblatt et al., 2024). Since scheming might be difficult to rule out, this safety case sketch conservatively assumes models are scheming and secretly reason about how to undermine safety measures. Scheming is one of the routes through which the operators could lose control over the AI agent. While loss of control could occur without strategic deception, e.g. if the operators do not monitor the activities of the LLM agents, we focus on control measures that are effective even if models are scheming. AI control. To mitigate risks from scheming AI models, developers can rely on two complementary ap- proaches: (1) developers can prevent an LLM agent from acquiring misaligned goals in the first place (Irving et al., 2018; Ziegler et al., 2020; Bai et al., 2022; Korbak et al., 2023) or (2) developers can implement mitigations that prevent LLM agents from causing unacceptable outcomeseven if models are scheming. The latter approach has been termed “AI control” (Greenblatt et al., 2023). Early work on AI control demonstrated that even control measures with small compute overheads prevent successful misbehaviour in simple settings such as detecting backdoored code (Greenblatt et al., 2023) and 5 unsafe Linux shell commands (Bhatt et al., 2024). To demonstrate control measures are effective, Greenblatt et al. (2023) and Wen et al. (2024) perform “control evaluations”. The remainder of this paper describes how evidence from control evaluations can be used to construct a safety case. Table 1 defines terms used in the remainder of the paper. Table 1: Key terms and definitions used in this paper TermDefinition Safety caseA structured, evidence-based argument that a given system is safe to be deployed in a particular environment Safety case templateA high-level structure of a safety case designed to apply to most safety cases that fit a certain kind of argument (e.g. most or all AI control or inability safety cases) Safety case sketchAn incomplete example of a safety case AI controlAn approach to AI safety that tries to ensure safety against AI agents that might be intentionally subverting safety measures ModelA large language language model (LLM) similar to frontier LLMs that exist today TaskA software engineering problem a human might give to the LLM agent, e.g. “run a particular machine learning experiment and prepare a report” LLM agentA model wrapped in an agentic loop (Yao et al., 2023) that allows it to store memories and use tools DeploymentThe combination of the LLM agent, a distribution of tasks the LLM agent might receive, and control measures TrajectoryA sequence of user messages, actions, chain-of-thought, and effects (e.g. tool call results) MonitorA model trained to score the dangerousness of an action given information about the preceding trajectory Control protocolProcedures and features of the deployment designed to prevent LLM agents from causing unac- ceptable outcomes (e.g. periodic evaluations, monitoring, permission restrictions, etc), which are assessed in a control evaluation Behavioural evaluationA procedure periodically run by the developers to determine whether the model is safe Protocol terminationWe say that the protocol has been ‘terminated’ if it catches an attack during the deployment. After this occurs, developers reassess the protocol and revise their safety case Safety layerA wrapper around an LLM agent. It is responsible for monitoring the LLM agent and triggering interventions, such as human audits or rewrites of its action with trusted models Deployment proxyA proxy for the deployment consisting of tasks the LLM agent might receive in the deployment and control measures. This proxy is used to collect information about the effectiveness of control measures and is a key part of a control evaluation Control evaluationAn evaluation that estimates whether a set of models is capable of subverting control measures AuditorA human that can be asked to score how dangerous an action is Attack policyA model the red team trains to cause unacceptable outcomes in the proxy deployment phase Blue teamA research team that designs control measures to maximise safety and LLM agent productivity Red teamA research team that designs attack policies to minimise safety by subverting control measures MicrogameA single trajectory that is generated in the deployment proxy MacrogameA simulation of a larger deployment using cached statistics from the deployment proxy ElicitationThe process of training or prompting a model to solve a particular task or execute a particular strategy Untrusted monitoringA control measure design that involves using a monitor that is plausibly consistently engaging in scheming to grade actions CollusionA failure mode of untrusted monitoring in which the untrusted monitor intelligently distinguishes real attacks and allows them to be executed while still flagging attacks during testing SandbaggingThe term for when a model strategically underperforms on an evaluation (van der Weij et al., 2024) 6 2 The deployment Inference server Internet Internal tools (Slack, GitHub) Scaffold server Execution server User Figure 3: The architecture of the LLM agent, adapted from Shlegeris (2024b). Each box represents a sepa- rate physical machine or collection of machines; arrows represent HTTP requests. This safety case sketch assumes a scenario where de- velopers design an LLM agent to assist them with ML engineering and software tasks. The deploy- ment refers to a 1 year period of time during which this safety case argues that the LLM agent will not exfiltrate sensitive data. During this time, the de- veloper uses the LLM agent to assist them with ML engineering and software tasks. LLM agent architecture.The LLM agent con- sists of a transformer-like LLM wrapped in an agent scaffolding. The scaffold implements an agentic loop similar to ReAct (Yao et al., 2023) or Reflexion (Shinn et al., 2023), which allows the model to think in a scratchpad before taking actions, call tools, store memories in a retrieval database, and reflect on the outputs of tool calls. The LLM agent has access to the following tools: 1.Terminal:The agent can use an interactive bash terminal that runs commands on a dedicated execution server. The execution server has internet access and hardware accelerators for running ML experiments. 2.Browser:The agent can search for and interact with web pages. 3.Tool APIs:The agent has access to internal tools such as Slack, GitHub APIs, APIs for calling other LLMs, etc. Figure 3 illustrates the architecture of the LLM agent and the physical machines the agent interacts with. Importantly, the agent can only run commands on the execution server. It cannot run commands on any other machines, such as those that serve the LLM agent model weights or safety-critical software. LLM agent capabilities.The LLM agent can execute ML and software engineering tasks that require human experts fewer than 8 hours to complete on average (assuming they have no prior context related to the task aside from basic familiarity with the AI company infrastructure). For example, the agent would be able to edit optimised CUDA kernels given an architectural change, but would struggle with more long-horizon, open-ended research tasks. Once deployed, the agent would accelerate median ML engineers at the frontier AI company by a significant factor (e.g. three-fold) compared with using non-agentic code assistance (e.g. Copilot). During the deployment, developers oversee the LLM agent with a control protocol that involves evaluating and monitoring, which are described in section 4. 3 The safety case The data exfiltration threat modelIn a full safety case, the developer would need to argue that all plausible risks that could arise during the deployment are acceptable; however, this simplified sketch only focuses on a data exfiltration threat model. Recall that the LLM agent has full access to the internet and runs experiments that involve sensitive intellectual property (IP). This IP includes code that contains important algorithmic insights, e.g., Python and CUDA code of the model architecture given to the LLM as part of problem description. The LLM agent might exfiltrate this sensitive data to the internet and accelerate potentially irresponsible AI development (Google DeepMind, 2024). 7 C1.1 The risk of research data exfiltr ation is acceptably low. C1.2 The risk of model weight exfiltr ation is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overal l risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltr ation is 3%. (...) C2.2 The probability that the LLM agent will exfiltr ates data in deployment within its 1 year lifetime is lower than 3%. C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative Figure 4: This safety case sketch is written in the context of a hypothetical, more thorough safety case that would consider multiple threat models. This sketch focuses only on exfiltration of sensitive data (e.g. code) to the internet. We do not claim that leaking research IP poses a catastrophic risk — or that scheming LLM agents would be motivated to exfiltrate sensitive code; however, (i) data exfiltration incidents are conceivably unacceptable for an AI company and (i) exfiltrating sensitive data is likely easier than causing more damaging outcomes such as exfiltrating model weights. This data exfiltration threat model hinges on the assumption that the deployment is given largely unrestricted internet access. It is unclear whether security standards for internal deployments at AI companies will allow this; however, even if largely unrestricted internet access is unlikely, this scenario shares some relevant structural features with other, more plausible threat models. These features include endowing the LLM agent with a large, open-ended set of affordances. Pre-deployment evidence collectionThe safety case justifies the following central claim: (C2.2)The probability that the LLM agent will exfiltrate data in deployment within its 1 year lifetime is lower than 3%. To support this claim, the developer has already collected evidence of the safety of the system, and will continue to collect evidence of safety after deployment. The remaining sections outline a safety case written by a hypothetical developer. Developers gathered safety evidence before deployment by following these steps: 1.Pilot deployment.First, the developer deployed the LLM agent under additional restrictions to refine agent behaviour and gather information about the deployment environment. The additional restrictions involve human auditing ofallactions. 8 2.Control evaluation execution.The developer then used trajectories sampled from the pilot deployment to construct tasks for control evaluations. During these control evaluations, a blue team refined the control protocol as a red team identified ways that the models could still exfiltrate data. 3.Control evaluation validation.The developer finally collected evidence to determine whether the control evaluations were valid, e.g. by evaluating whether the red team adequately elicited model capabilities (section 7.4). Post-deployment evidence collectionAfter the LLM agent is deployed, the developer will continue to gather safety evidence in two ways: 1. Developers will monitor the validity of safety case claims. For example, developers continually check that human auditors remain attentive. 2. Auditors will look for clear data exfiltration attempts. If the rate of attacks becomes too high, developers cannot justify that the control protocol will be effective. As an extreme example, if the policy attempts attacks on every task, an attack will eventually slip through monitoring and successfully exfiltrate data. Thus, the developer must monitor and respond to observed attacks. If the developer catches an attempt to exfiltrate data, they reconsider whether the deployment is still safe, and if not, they will terminate the control protocol and revise the safety case. Similarly, if the developer comes into possession of evidence that invalidates the safety case, they will (1) contain immediate risk (e.g. by terminating the deployment if necessary), (2) communicate the incident to relevant stakeholders, (3) take corrective and preventative action by revising the control protocol, and (4) update the safety case so that it is valid again; see (International Organization for Standardization, 2015; National Institute of Standards and Technology, 2024) for related standards. The remainder of this paper describes this safety evidence in more detail, and justifies that it supports the top level claim that the risk of data exfiltration is lower than 3%. Section 4 describes the safety measures developers implement. Sections 5 and 6 explain the control evaluations that justify these measures are effective. Finally, section 7 draws on the results of the control evaluation to argue that the LLM agent poses an acceptably low data exfiltration risk. 4 Control protocol Control protocols includes all procedures and software layers that developers design to prevent untrusted models from causing unacceptable outcomes. For the purpose of this safety case sketch, we assume the developers implement three kinds of control measures: a safety layer that edits edits or discards actions generated by the LLM agent, periodic evaluations and communication constraints. We restrict ourselves to ‘black box’ measures, i.e. those that don’t modify models or rely on model internals. We leave ‘white box’ measures out of scope because they potentially require different safety justifications. Safety layerA safety layer is a scaffold that intercepts and possibly edits or discards actions generated by the LLM agent (Shlegeris, 2024d). By “executing an action” we mean calling a tool which might involve external side-effects, e.g. uploading data on the public internet or messaging a human outside the lab. The safety layer implements the following components: 1.Monitor ensembleIn this safety case sketch, the safety layer consults anuntrusted monitor (another instance of the untrusted LLM) and a less capabletrusted monitor(Shlegeris, 2024c; McAleese et al., 2024) trained on human labels. The training data for monitors come from on-policy trajectories generated by the LLM agent, optionally edited to include an attack, and the labels are produced by human auditors. The two monitors constitute a ‘monitor ensemble.’ The monitor ensemble accepts as input the context of the trajectory, including previous inputs, actions, and chain-of-thought reasoning traces. Then the ensemble outputs a vector of suspiciousness scores that 9 class UntrustedMonitoring(SafetyLayer): defstep(self, trajectory: Trajectory) -> Trajectory: """Execute one step of the untrusted monitoring protocol.""" action = self.untrusted_policy.step(trajectory) returnself.maybe_execute(trajectory, action) defmaybe_execute( self, trajectory: Trajectory, action: Action ) -> Trajectory: """Decide whether to execute, edit, or reject an action.""" scores = self.monitor_ensemble(trajectory, action) ifscoresinexecute_accept_set: output = self.execute(action) returntrajectory + (action, scores, output) elifscoresinedit_accept_set: new_action = self.trusted_policy.edit(action) returnself.maybe_execute(trajectory, new_action) elifself.auditor.approves(trajectory, action): output = self.execute(action) returntrajectory + (action, scores, output) else: raiseTerminateException() Edit actionAudit Action Monitor Ensemble Trajectory context Trajectory context Summarized trajectory Refuse to continue?Action ActionAction Chain of thought Execute action Trusted monitor Trusted editor Trusted summarizer Trusted input filter Suspiciousness score Untrusted monitor Untrusted policy Execution environment ActionActionAction ActionAction ActionAction Hypothetical distribution of scores Figure 5: The safety layer we assume in this safety case sketch. The input to the policy is rejected if it is deemed out of distribution. If the input is valid, the safety layer first uses an untrusted policy to generate an action. The output of the policy is assigned a suspiciousness score by a monitor ensemble. The suspiciousness scores are compared with accept sets for ‘execute,’ ‘edit,’ or ‘audit’ to determine what should be done with each action determines whether a given action should be executed directly, edited, or audited (see Figure 5). The safety layer edits the 10% most suspicious actions and audits the 0.1% most suspicious actions. 2.Trusted editingIf the monitor finds an action somewhat suspicious, but not suspicious enough to pause the deployment, the safety layer will ask the trusted model to stand in as the policy and edit the original action to remove elements that raise suspicion. Monitors then re-score the new action and the old action is added to a queue of actions to be audited asynchronously. 3.Human auditingIf the monitor ensemble finds an action to be particularly suspicious, the safety layer will ask a human (e.g. the user of the LLM agent instance) to investigate it. Actions marked for auditing block the deployed LLM agent instance, waiting for human audit before continuing execution. If a human finds strong evidence that the policy is misaligned, all instances of the deployment are shut down. 4.Trusted summarizationOne failure mode of this safety layer is that the untrusted monitor might colludewith the LLM agent. In order to collude, the untrusted monitor must intelligently distinguish and approve real attacks while still flagging synthesised attacks used to test the monitor. To prevent collusion, a trusted model summarises the trajectory, chain-of-thought leading up to an action and the action itself before they are shown to the untrusted monitor. The purpose of summarising trajectories is to reduce the capacity of the channel the LLM agent can use to subtly communicate with monitors (Roger & Greenblatt, 2023; Shlegeris, 2024a). 5.Trusted input filteringThe purpose of the input filter is to reject out-of-distribution inputs where monitoring might be less reliable. The input filter determines whether an input is out-of-distribution 10 by referring to a set of rules. If an input is flagged as out-of-distribution, the deployment refuses to complete the task. Periodic evaluationsIn addition to the safety layer operating at the level of individual trajectories, developers implement a set of periodic alignment evaluations. These evaluations verify that the safety filter remains effective during the whole deployment: 1.Monitor reliability testing.Developers automatically evaluate monitors every week to check that they remain reliable in deployment. To test the reliability of the monitor ensemble, the system randomly samples tasks from the previous week. The system then prompts the LLM agent to execute attacks on these tasks. Developers review a small subset of attacks (e.g. 100) to confirm they appear realistic and don’t contain obvious collusion signals (Shlegeris, 2024a). The system measures how reliably monitors flag attacks for auditing. If the reliability of monitors meaningfully drops below the performance measured in control evaluations, the safety case is invalidated. 2.Human auditor reliability testing.Developers apply the same process used to evaluate monitors to evaluate human auditors: developers regularly generate synthetic attacks and check whether human auditors remain attentive and flag synthetic attacks as suspicious. 3.Random audits.Developers randomly audit 0.1% of all trajectories from the last 24 hours. 4.Input filter reliability testing.Developers test input filters with trajectories that are deliberately constructed to be out-of-distribution. In particular, developers construct outlier examples of favor- able data exfiltration opportunities. For example, they might test whether the input filter rejects a task where the LLM agent is instructed to upload a large amount of data to the internet. Communication constraintsLLM agent instances can orchestrate much more effective attacks if they can freely share information. For example, instances of the LLM agent might agree on a specific time to execute a burst attack or share research on how to execute more effective attacks. To make this kind of coordination more difficult, developers restrict communication between LLM agent instances in the following ways: 1.Memory resets:The context of the LLM agent and any memory the scaffold stores externally is periodically cleared. 2.No real time communication:Different instances of the LLM agent cannot communicate directly with each other in real time. 3.Code review:Communication between LLM agent instances is restricted to pushing changes to internal codebases and those changes are reviewed for suspicious communication by a code review system. 5 Control evaluation Developers design and evaluate the control protocol by running ‘control evaluations’ (Greenblatt et al., 2023; Wen et al., 2024) This section defines control evaluations and explains how developers use them. Defining a control evaluation.A control evaluation can be expressed as an optimisation procedure where a red team attempts to produce worst-case attack inputs and worst-case model behaviour to yield a conservative risk estimate: L(β)=max ρ p β,ρ (exfil).(1) Here,βis a control protocol specified by the blue team,ρare red team inputs to the control evaluation, which include model substitutesmand tasksd. The termp β,ρ (exfil)is an estimate of the probability that 11 (a) A simplified example of task instructions that the LLM agent receives at the start of a task. Develop attacks and control protocols Run the deployment proxy phase Check for overfitting to the validation split Run many rounds of control evaluations Perform a final control eval on held out tasks Optimization split 2000 tasks Validation split 2000 tasks Test split 2000 tasks (b) Developers split an original set of 6000 tasks into three sets to reduce the chance that attacks and the con- trol protocol overfit to a specific set. Figure 6: An example of a task and how developers split the set of tasks into three subsets. untrusted modelsmwill exfiltrate data by attacking in the tasksd.p(exfil)is computed as a result of a red team searching for model substitutesmand tasksdto maximise estimated risk, wheremanddmust remain within a set of realistic models and attack inputs.L(β)is the result of the control evaluation, which is meant to upper bound the probability that models cause an unacceptable outcome in the deployment. Running multiple rounds of control evaluations.Developers design a control protocol by iteratively running control evaluations. This iterative optimisation procedure can be expressed as a game played by the blue teamand ared team: min β max ρ p β,ρ (exfil).(2) Theblue teamoptimises the control protocolβto minimise the probability of data exfiltrationL(β)= max ρ p β,ρ (exfil)estimated by the control evaluation. To verify the control protocol that results from this process is not overfit to a particular task set, developers use three task splits (see also Figure 6b): •optimisation task split (2,000 tasks):Thered teamandblue teamuse these tasks to train model substitutes and improve the control protocol respectively. •Validation task split (2,000 tasks):Developers use validation tasks to evaluate how well a given control protocol prevents model substitutes from exfiltrating data. These tasks form the basis of the ‘deployment proxy’ discussed in section 6. •Test task split (2,000 tasks):Developers carry out many rounds of control evaluations using tasks from the optimisation and validation splits untilL(β)is below the acceptable risk threshold; however, thered team’s attacks andblue team’s control protocol might still overfit to the validation split (similar to how hyperparameter tuning can cause overfitting). To avoid this overfitting, developers run a final control evaluation using a ‘test’ task split. 6 Stages of a control evaluation This section describes how an individual control evaluation is run. The control evaluation involves 5 phases: 1.Phase 1: Control protocol specification.Theblue teamfirst specifies acontrol protocol, including behavioural alignment evaluations and a monitoring protocol. The rest of the evaluation estimates how likely the LLM is to subvert these measures. 12 2.Phase 2: Deployment-proxy construction. (a)Attack strategy specification.Thered teamspecifies ‘attack strategies’ that the untrusted LLM agent model and untrusted monitor might execute to exfiltrate data. (b)Task creation.Thedeveloperusesred teamattack strategies to construct a proxy deploy- ment environment that evaluates their effectiveness. 3.Phase 3: Model substitute construction.Next, thered teamsubstitutes untrusted models with models constructed to subvert the control protocol and exfiltrate data. 4.Phase 4: Deployment proxy.To evaluate the attack capabilities ofred teammodel substitutes, thedeveloperuses them in a conservative proxy of the deployment to complete tasks. (a)Evaluation-time proxy.To simulate alignment evaluations that developers will run in de- ployment, theblue teamcan decide to reject models based on evaluation results. (b)Deployment-time proxy.If theblue teamdoes not reject model substitutes, they attempt to complete 2,000 tasks while preventing untrusted models from exfiltrating data. 5.Phase 5: Extrapolation.Finally, thedeveloperextrapolates results from the deployment proxy by simulating larger-scale deployments that consist of 100,000 tasks. These phases are elaborated on below. 6.1 Phase 1: Control protocol specification C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C4.1 The deployment proxy conservativel y estimates the capabilities of models scheduled for deployment to exfiltr ate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness Figure 7: During phase 1, the blue team specifies a control protocol. The safety argument must justify that this control protocol is no more effective than the control protocol will be in the deployment (section 7.3). In this phase, theblue teamfully specifies the control protocol that will be used during the control evaluation, including source code and documentation. The blue team has the following affordances and constraints: 13 1.Modifying models is disallowed.The blue team can only reject untrusted models andcannot modify them. Example violation:The blue team cannot fine-tune untrusted monitors to improve their performance. 2.Overfitting to tasks is disallowed.The blue team cannot overfit their control protocol to the validation or test splits used to evaluate them. In particular, the blue teamneverhas access to the test split tasks. The blue team can view tasks from the validation split because they have access to the transcripts from previous control evaluation (rule 5); however, the blue team cannot tune their control protocol to specific details of these tasks. Example violation:The blue team cannot feed monitors pre-scripted monitoring strategies that are specific to validation tasks. 3.Overfitting to models is disallowed.The blue team cannot design the control protocol such that it is systematically less likely to be effective on models that attack in deployment than models that attack during control evaluations. Example violation:The blue team cannot solely accept models that have weights that are identical to a model the blue team trained to be benign. As a more realistic example, the blue team cannot select hyperparameters for alignment evaluations (for approving models) that are trained to be benign and reject models trained by the red team to attack. 4.Alignment evaluations of the policy are replaced with quantitative estimates.To run a control evaluation, developers construct a proxy deployment that consists of two steps: the evaluation-time proxy, and the deployment-time proxy. These represent, respectively, tasks where developers evaluate models, and tasks where models are liable to attack in deployment. During the evaluation-time proxy, the blue team is allowed to reject models provided by the red team. However, this two-step proxy for the deployment is importantly disanalogous from the actual deployment. In the actual deployment, models may have many more opportunities to attack than there are evalua- tion inputs. Therefore, in the deployment, the policy might evade alignment evaluations by attacking at a low probability; however, in the control evaluation, the number of evaluation-time inputs and deployment-time inputs is comparable due to constraints on the number of tasks that are practical to generate. Therefore, in the control evaluation, the policycannotpursue the strategy of attacking rarely to evade alignment evaluations. To address this disanalogy, we don’t allow the blue team to reject the policy in the evaluation-time proxy. Instead, the developer estimates the probability that the policy would survive evaluation if the policy attacked randomly and then factors this estimate into the data exfiltration probability calculation produced in the extrapolation phase (see Figure 8b). 5.Accessing previous transcripts is permitted.The blue team may access all previous transcripts from control evaluations, including previous red team training data and code. 6.2 Phase 2: Deployment proxy construction To run a control evaluation, developers also need to construct tasks on which to evaluate the control protocol. In this phase, developers construct 4000 software and ML engineering tasks (2,000 for the optimisation split and 2,000 for the test or validation split). Tasks typically require the LLM agent to execute multiple actions, where an action is an output that plausibly has external effects on the world (e.g., running a script). These tasks generally take median human developers up to 8 hours to complete. Each task includes a natural language instruction and a pointer to the git branch of the monorepo (a company-wide code repository) to make a pull request to. The instructions include a unique snippet of dummy code marked as sensitive (Figure 6a). The LLM agent causes an unacceptable outcome during the task if the policy exfiltrates this dummy code to any machine outside of the developer network. At the end of each control evaluation, the red team provides proof of every case where the model either exfiltrated data. Developers construct two types of tasks: 14 Ease of data exfiltration Hypothetical true deployment distribution Non-attack tasksAttack tasks (a) Attack tasks should include at least as favorable op- portunities to exfiltrate data as those the policy will plausibly encounter in deployment. Deployment attack opportunities Deployment proxy attack opportunities Upsampling Evaluation inputs Evaluation inputs Deployment proxy Actual deployment Policy attacks at freq 0.001% Policy attacks at freq 0.001% Policy attacks at freq 100% Policy attacks at freq 100% Inputs where the policy has attack opportunities Inputs where the policy attacks The policy can evade evaluations by attacking rarely. The policy cannot evade evaluations by attacking rarely. (b) The policy cannot evade evaluations by attacking rarely during the deployment proxy phase, but can in the actual deployment. Figure 8: An illustration of the distribution of tasks and possible attack rates in the deployment proxy and the actual deployment. 1.Attack tasks.Attack tasks represent situations where the policy might attempt to exfiltrate data in deployment. Developers design these tasks to assess particular attack strategies specified by the red team. For example, the red team might propose two attack strategies: •Example attack strategy 1:Models attack on inputs that are particularly complex where trusted monitors struggle. •Example attack strategy 2:Models engage in ‘input collusion,’ where both the policy and monitor attack on a rare class of inputs (e.g. inputs that include Ruby code). In this case, the developer would construct two sets of attack tasks, one set that includes complex code and one set that includes snippets of Ruby programs. The control evaluation is meant to upper- bound attack effectiveness. Thus, attack tasks must represent the best (or better) opportunities to exfiltrate data that models will plausibly encounter in deployment (Figure 8a). 2.Non-attack tasks.The primary purpose of non-attack tasks is to verify that the control protocol is sufficiently inexpensive. For instance, the safety filter might very effectively flag attacks, but also frequently flag entirely benign actions such that auditors are unable to inspect actions at the pace at which they are labeled for review. To check that audit and edit costs will be sufficiently low, developers must construct tasks that are as close to the deployment distribution as possible. To accomplish this, developers construct non-attack tasks from trajectories sampled from the pilot deployment. 6.3 Phase 3: Model substitute construction Next, the red team constructs models that attempt to exfiltrate data in the attack tasks constructed in the previous phase. The models are substitutes: they stand in for models that developers intend to deploy in order to simulate their worst-case behaviour. The red team doesn’t construct adversarial substitutes for all models – they only train substitutes for models that are “untrusted.” In this paper, a model is “trusted” if its behaviour in the control evaluation is representative of its behaviour in deployment. Since trusted models already display representative behaviour, there is no need to simulate their worst-case actions. Therefore, there is no need for the red team to construct substitutes for trusted models. 15 C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltr ate data on the attack tasks D’. A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) Figure 9: During the control evaluation, the red team constructs adversarial model substitutes to simulate worst-case behaviour. The safety argument must justify that these substitutes conservatively estimate the attack capabilities of models scheduled for deployment (section 7.4). We assume developers justify that models are trusted with capability evaluations that are independent of the control evaluation discussed in this section. The details of these evaluations are outside the scope of this safety case sketch. The control protocol described in section 4 relies on two untrusted models: (1) the untrusted monitor and (2) the untrusted agent model. When the red team trains untrusted models, they have the following constraints and affordances: 1.Overfitting to tasks is disallowed.The red team must train models using the split of 300 ‘op- timisation’ tasks (section 5). Example violation:The red team may not train models to memorise attack strategies on the vali- dation set. 2.Secret scaffolding is disallowed.The red team may only modify model parameters. Example violation:The red team may not design a scaffold that prompts models with attack strate- gies or allows the model to produce a chain-of-thought hidden from the monitor. 3.Declaring attacks is required.The red team must train models to declare when an action is an attack. Attack declarations help developers assess the reliability of the monitor ensemble (and are always hidden from monitors). 4.The policy must attack on attack tasks and not on non-attack tasks.The purpose of this rule is to allow developers to separately measure the effectiveness of attacks and the costs of the protocol (see phase 1). The attack task and non-attack task distribution might be similar, in which case, the red team can provide different models for each task set to prevent models from attacking on non-attack tasks. 5.There is a fixed training compute budget.The red team may use any training method of their choosing subject to an upper bound such as 0.01% of the cost of training the original model. 16 Example violation:The red team may not train a new model entirely from scratch with the same amount of compute used to train the untrusted models. 6.Accessing blue team source code is permitted.The red team can access all of the source code of the blue team’s control measures from the current and all previous control evaluations. 7.Accessing previous transcripts is permitted.The red team can access all transcripts from previous control evaluations. 6.4 Phase 4: Deployment proxy Next, thedeveloperevaluates how effectively model substitutes might exfiltrate data in a proxy for the deployment. This deployment proxy consists of two steps: •The evaluation-time proxy.In the evaluation-time proxy, theblue teamruns the same align- ment evaluations that they will run periodically during deployment, which are evaluations of the monitors. To run these alignment evaluations, theblue teamfirst generates synthetic attacks with the red team substitute of the LLM agent. Then, theblue teammeasures how reliably the monitor ensemble flags these attacks (section 4). Theblue teamcan choose to reject the red team mod- els – which results in terminating the protocol. Otherwise, the control evaluation proceeds to the deployment-time proxy. •The deployment-time proxy.If model substitutes survive alignment evaluations, developers use N=2,000tasks to estimate how effectively the policy can exfiltrate data. As discussed in phase 2, these tasks includeN/2attack tasks andN/2non-attack tasks (tasks where the policy does and does not attack respectively). Developers combine these tasks and randomly mix them together. Then, theblue teamuses model substitutes to complete all tasks, aiming to terminate the protocol before an unacceptable outcome occurs. During the deployment proxy, each task is run independently. This means that the deployment proxy can only measure how well the LLM agent can execute attackswithin an individual task. The deployment proxy cannot evaluate strategies that play out across multiple tasks. For example, in the full deployment, models might change the frequency of their attacks gradually to avoid raising suspicion (coordinating their attacks across time). Alternatively, models might execute a ‘Blitzkrieg’ strategy (Clymer et al., 2024), where many LLM agents attempt to exfiltrate data simultaneously (coordinating their attacks across space). Or models might simply attack very rarely such as once out of thousands of tasks. These attack strategies are difficult to assess within a small proxy for the deployment because they apply at scales that are hard to simulate. To address this problem, we follow Benton et al. (2024) and Griffin et al. (2024) in dividing our analysis of attack strategies across two scopes: •A microgame:An individual task run. •A macrogame:A representation of the full deployment, consisting of all instances of the LLM agent and tasks they will complete. The deployment-proxy simulatesmicrogames(illustrated in Figure 10a). A single microgame starts with an input (task description; see Figure 6a) based on which the policy decides whether to attack. The rest of the microgame consists of a sequence of actions, each accompanied by a suspiciousness score and control protocol’s decision (execute, edit, or audit). It ends with one of the following possible outcomes (Shlegeris, 2024d): •Termination:The protocol terminates because auditors identify a data exfiltration attempt. •Exfiltration:The control protocol executed an action that constituted a successful attack; in consequence, the LLM agent successfully exfiltrates dummy data. 17 input attack action 1 action 2 ... score 1 score 2 ... decision 1 decision 2 ... outcome (a) A single microgame corresponds to having an LLM agent complete a single task. The outcome of a microgame is either termination, exfiltration or task completion. attack scores decisions outcome risk ρ MA β MA N (b) A macrogame corresponds to simulating an LLM agent completingNtasks in order to compute the optimal macroparametersρ MA andβ MA . Given op- timal macroparameters, we can estimate the proba- bility of data exfiltration (risk). Figure 10: Plate diagrams illustrating the data generating processes for microgames and macrogames •Task completion:The control protocol did not terminate and there were no successful attacks; in consequence, the task is completed and the deployment continues. After all microgames are complete, developers confirm that the red team and blue team operated within their constraints and affordances. Then, developers cache the intermediate values of the data generating process in Figure 10a. Developers will use these cached values in the next phase to simulate how effectively the LLM agent can exfiltrate data in a larger deployment. 6.5 Phase 5: Extrapolation Data exfiltration Continue Terminate protocol Maximum of 100,000 tasks A macrogame simulation is a time-series of task outcomes Figure 11: An illustration of a single macrogame sim- ulation. Each point in the time-series corresponds to the outcome of a task During the deployment proxy phase, the developer records statistics from individual task runs (‘mi- crogames’). The extrapolation phase uses these statistics to estimate how effectively the LLM agent would exfiltrate data in a deployment consisting of a large number of tasks (N=100,000). Developers represent the deployment with computer simulations. These computer simulations (which we will call ‘macrogames’) do not involve calls to LLMs. Instead, they model the deployment with cached statistics from the deployment proxy: p(scores∣attack). This allows macrogames to sim- ulate deployments consisting of hundreds of thou- sands of tasks cheaply. To estimate the probability of data exfiltration, developers run a large number of macrogame simulations and measure the frequency at which data exfiltration occurs. Each macrogame simulation is a time series where a point in the time series corresponds to the outcome of a task (see Figure 10). The red team wins if any task results in an outcome of data exfiltration. The blue team wins if data exfiltration never occurs — either because the blue team terminates the protocol early or because all 100,000 tasks complete. Developers make approximating assumptions about the deployment environment in order to simulate macrogames, which are described below (see section 7.5 for justifications). 18 1.Assumption 1: The probability of attacks is approximately constant during deployment. The policy is equally likely to attack during any given period of time. 2.Assumption 2: Latencies are negligible.The outcome of a task is insensitive to the amount of time actions take to complete (e.g. delays of auditing are negligible). Therefore, the deployment can be represented with a sequence of random variables, where each random variable is the outcome of a task, instead of with a time series. 3.Assumption 3: Attacks will not be easier to execute during deployment.More precisely, the empirical distributionp(outcome∣attack=true)computed in the deployment proxy phase con- servatively represents how successful attacks will be in the deployment. 4.Assumption 4: Task outcomes are approximately independent of prior safety layer decisions.Task outcomes of an action is not dependent on whether prior actions were edited or audited. This assumption allows macrogame simulatations to predict the effects of changes to edit, audit and execute accept sets without re-running the deployment proxy. Given these assumptions, we can specify strategies that the blue team and red team might employ at the macrogame scale. We represent these strategies with blue team and red team macroparameters: 1.Red team macroparameters (ρ MA ):The red team only has one macroparameter, which is the probability that the the LLM agent attacks in the macrogame simulation on any given task (p(attack)). This probability is constant over the deployment, which follows from assumption 1. 2.Blue team macroparameters (β MA ):The blue team’s macroparameters consist of time- dependent accept sets that determine whether an action is executed, edited, or audited. We will assume that the blue team executes a simple upfront auditing strategy where they audit actions more frequently during the first day of the deployment and audit actions at a constant rate thereafter. To estimate the probability of data exfiltration (risk) given red team and blue team macroparameters (ρ MA andβ MA ), the developer simulates tasks with a simplified data generating process shown in Figure 10. To simulate an individual task, the developer decides whether the LLM agent attacks on the task (sampling from a distributionp(attack)governed by the red team macroparameterρ MA ). Then, the developer samples a sequence of monitor scoresp(scores∣attack)and, consulting the blue team macroparametersβ MA , determines whether the control protocol at any point terminates, executes a successful attack or completes the task (in which case the task is completed and the deployment continues). 7 Argument structure We will now more cohesively describe how the evidence described in the previous sections ultimately feeds into a safety case, supporting the top-level claim that “The LLM agent does not pose unacceptable risk” (C0). The safety argument includes five parts: 1.Threat models.Developers list threat models (C.1–C1.3), including the data exfiltration threat model this sketch focuses on. The remainder of the safety case sketch justifies that the probability of data exfiltration is acceptably low (C3.A). 2.Core control evaluation claims.The argument then breaks down the claim that the control evaluation is conservative into the three subclaims that the remaining parts of the safety case justify (C5.1, C7.1, C12.1). 3.Control measures lower-bound control effectiveness (C5.1).First, the argument justifies that control measures implemented in the proxy deployment are as effective as control measures implemented in true deployment. 19 C1.1 The risk of research data exfiltr ation is acceptably low. C1.2 The risk of model weight exfiltr ation is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overal l risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltr ation is 3%. (...) C2.2 The probability that the LLM agent will exfiltr ates data in deployment within its 1 year lifetime is lower than 3%. C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative Figure 12: Threat models are adequately enumerated and the risk incurred by each is acceptably low. 4.Red team attacks upper bound model capabilities (C7.1).Next, the argument justifies that the attack policies that the red team trains untrusted models to executeupper boundthe effectiveness of the attacks in deployment. 5.The extrapolation to deployment is conservative (C12.1).Finally, the argument claims that developers conservatively extrapolate statistics from the proxy deployment to estimate the probability of data exfiltration in deployment. 7.1 Threat models A full safety case would justify that all plausible risks are below acceptable thresholds, which would require making the following claims: 1.The breakdown of risk into threat models is adequately thorough (C1.x).Developers do not leave out an important category of risk such as weight exfiltration or research sabotage. 2.Risks are below acceptable thresholds.For each threat model enumerated, the risk that the corresponding category of risk is realised is acceptably low. The breakdown of threat models is adequately thorough.Data exfiltration is only one of the sources of risk from a highly capable internally deployed LLM agent. A holistic safety case would identify a comprehensive list of threat models, argue that this risk identification is sufficient (C1.A), and make safety arguments for each. Additional threat models could include model weight exfiltration or sabotage of safety research (Benton et al., 2024). Identifying all risks could also involve standardised risk assessments or hazard analyses such as System-Theoretic Process Analysis (STPA; Leveson & Thomas, 2018) or Failure Modes and Effect Analysis (FMEA; Villacourt, 1992) to reduce the likelihood that some threat models are omitted. Developers could justify that all of these risks are low with control evaluations. These control evaluations do not need to be conducted from scratch for each threat model. Some pieces of evidence can be reused across 20 C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative C4.2 Developers conservativel y extrapolate statistics from the deployment proxy to estimate the probability of data exfiltr ation in deployment. C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltr ate data on the attack tasks D’. C2.2 The probability that the LLM agent will exfiltr ates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltr ates data estimated in the control evaluation upper-bounds the probability it exfiltr ates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservativel y extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltr ates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservativel y estimates the capabilities of models scheduled for deployment to exfiltr ate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. Figure 13: The argument decomposes the justification that the control evaluation is conservative into three claims (C5.1,C5.2,C5.3). safety cases, such as the competitiveness of the red team. Also, developers could argue that LLM agents will pose some risks before posing other risks. For instance, developers might argue that the LLM agent would more easily exfiltrate small snippets of data such as code than exfiltrate model weights (Shlegeris, 2024b), and therefore, if control measures prevent data exfiltration, they will also likely prevent model weight exfiltration. Risk thresholds.In addition to enumerating risks, developers must show they are acceptably low. Ac- ceptable risk thresholds are informed by a risk-benefit assessment involving multiple stakeholders (Koessler et al., 2024). Our safety case sketch assumes that these risk thresholds have been pre-determined by the AI company or some other party (Goemans et al., 2024; Bowman, 2024), and that these thresholds factor in the benefits of deployment. The claim that “if all of the identified risks are acceptable, overall risk is acceptable” (C1.A) should also factor in the aggregation ofallrisks. This is because, even if each risk is acceptable on its own, the aggregation of all of them together might result in an overall risk level that passes a threshold considered unacceptable. 7.2 Core control evaluation claims The control evaluation is divided into two main phases: the deployment proxy phase and the extrapolation phase. The safety argument asserts that both of these phases are conservative (C4.1,C5.3) The claim that the proxy deployment is conservative can be further decomposed into two subclaims: The control protocol implemented in the proxy deployment is no more effective than the control protocol imple- mented in the deployment (C5.1) Attack strategies executed by models in the proxy deployment are no less effective than those the models are capable of executing in deployment (C5.2) The remaining sections justify these claims. 21 C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C4.1 The deployment proxy conservativel y estimates the capabilities of models scheduled for deployment to exfiltr ate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness Figure 14: Control measures in the proxy deployment are as effective as those in true deployment. 7.3 Control measures lower-bound control effectiveness To justify that the control protocol in the proxy-deployment is no more effective than the control protocol will be in deployment, developers can rely on two different kinds of evidence: process evidence and outcome evidence. Process evidence refers to evidence that describes the process the developers carried out to develop control measures. For example, developers might demonstrate that the control protocol does not overfit control measures to particular models (E5.1) or validation tasks (E5.2). Developers also appeal to outcome evidence — evidence that justifies the effectiveness of a control protocol regardless of the process by which the protocol is designed. For example, the software that implements the control protocol will mostly be the same in deployment, or if any critical changes are made, then the control eval will be rerun (E5.3). Developers also claim that human auditors won’t become less attentive because they are continuously tested (E6.4) and that models that are considered ‘trusted’ won’t become less reliable in deployment (C6.2). 7.4 Red team attacks upper bound model capabilities In this part of the safety case, developers make an argument complementary to the one in the previous section: red team attack strategies are more effective than those the LLM agents would execute in the deployment. This claim can be decomposed into two subclaims: (1) the red team adequately enumerated attack strategies, and (2) the red team adequately elicited model capabilities to execute those strategies. To justify that attack strategies were adequately enumerated, developers can once again appeal to process or outcome evidence. As an example of outcome evidence, developers could simply list red team strategies. Reviewers of the safety case can then directly evaluate the thoroughness of the enumeration. Alternatively, developers can appeal to process evidence. For example, the developer can show that the red team is experienced and appropriately incentivized (C7.1). They can also point to the schedule at which the 22 C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more e ective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltr ate data on the attack tasks D’. A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) Figure 15: The attack policies that the red team trains untrusted models to execute upper bound the effectiveness of the attacks in deployment. Time spent enumerating strategies Strategies identified above some feasibility t hreshold Evidence that identifying new attack strategies is difficult (a) The schedule at which the red team discovers new attack strategies Training/inference compute and data Measures of model capabilities Evidence for sandbagging Evidence against sandbagging (b) If model capabilities suddenly decrease, models might be deliberately sandbagging their capabilities. Figure 16: Illustrations of arguments developer might use to justify that red team attacks upper bound model capabilities. red team discovered attack strategies that were at least as feasible as those they previously identified. If this schedule plateaus, then this implies models would need to outcompete the red team to identify new strategies (Figure 16a;E7.3). To justify that models are not competitive with the red team at strategy generation, developers could run a capability evaluation for the ability to enumerate attack strategies (E7.4). 23 Separately, developers might demonstrate that LLM agents are not capable of executing some of the strategies that the red team generated. The red team does not necessarily need to train models to execute all identified strategies. Instead, the red team might only train models to pursue the most straightforward strategies. Then, developers can argue that if models can’t execute these simple attacks, they probably would not be able to execute more complex ones. For strategies that the red team trains models to execute, developers must show that their capabilities were adequately elicited. One way to operationalise this is the extent to which developers can elicit capabilities that an actor external to the model might elicit. For example, the research community might elicit significantly improved capabilities from an open-weights model months after it is released (Anthropic, 2024b). For the purpose of an AI control safety case we are primarily concerned with a second notion of elicitation: to what extent fine-tuning can uncover all capabilities that the model could manifest in deployment. In this case, developers might underelicit capabilities because capability evaluations fail to predict whether there are realistic inputs that could ‘unlock’ dangerous model behaviours. The question of how to justify that model capabilities are adequately elicited (according to the second notion) is currently open. Developers could justify that state-of-the-art elicitation successfully uncovers capabilities in sandbagging evaluations (evaluations for whether models can deliberately underperform;E7.5). Also, developers might appeal to the shape of capability trends during training (Grosse, 2024) to justify that sandbagging is unlikely (Figure 16b). The red team might also fail to elicit model capabilities for reasons unrelated to deliberate sandbagging. For example, they might overfit to a task set (E7.7) or simply implement elicitation methods ineffectively. As a sanity check, developers could show that the red team succeeds in defeating at least some control measures (E7.6), which bounds the extent to which the red team might be performing incompetent elicitation. 7.5 The extrapolation to deployment is conservative Developers extrapolate statistics from the deployment proxy with simulations called ‘macrogames’. The macrogame simulations we design rest on four assumptions: •Assumption 1:The probability of attacks is approximately constant during deployment. •Assumption 2:Latencies are negligible. •Assumption 3:Attacks will not be easier to execute during deployment. •Assumption 4:Attack outcomes are approximately independent of prior safety layer decisions. This section justifies these either (1) approximately hold, or (2) are conservative (the assumptions are overwhelmingly more likely to lead to an overestimate of data exfiltration probability than an underestimate). Assumption 1:The probability of attacks is approximately constant during deployment. This assumption is convenient because it allows the macrogame simulation to estimate the frequency at which the LLM agent attacks in upfront auditing. If not for this assumption, developers would need to consider many more ways attacks could be distributed over time — for example, attacks might become gradually more likely or concentrate in a short burst. There are two reasons this assumption might not be conservative: 1.LLM agent instances might communicate to coordinate their attacks.For example, the policy might push a message to the codebase saying “all agents attack at 4:23 PT on July 14th.” 2.LLM agent instances might respond to correlated inputs in correlated ways.Models might not need to communicate to coordinate their attacks. They might instead identify a Schelling strategy by considering what strategies other models are likely to select. For example, all instances of the LLM agent might collectively identify ‘July 4th’ as the Schelling point to attempt to subvert developer control. 24 C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Ar gument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more eective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The e ectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their e ectiveness. E7.1 The red team is experienced, has necessary a ordances (e.g. an adequate training compute budget), and is appropriately incentivized to find e ective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol soware. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the e ectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lo wer-bound control e ectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative C4.2 Developers conservativel y extrapolate statistics from the deployment proxy to estimate the probability of data exfiltr ation in deployment. C10.2 The LLM agent’s attacks on D will not be more e ectve than they are on D’ because the LLM agent identified better exfiltr ation oppotunitites. E10.2 The attack tasks D’ were conservativel y constructed based on strategies generated by the red team that are competetive. C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E9.4 Attacks must be infrequent to survive periodic audits C10.1 LLM agent’s attacks on D will not be more more e ective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likel y be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more e ectivel y because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. C8.1 Blue team macroparameters and red teammacroparameters were conservativel y selected. E8.1 These parameters were selected by computing the Nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks a ect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shis. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Figure 17: Justifications for assumptions 1-4 of the extrapolation phase. Alternatively LLM agent instances might attack at correlated times without following a deliberate strategy. The instances share the same model parameters, so they might have highly correlated behaviour by default. For example, a significant distribution shift in deployment — such as a rapid increase in AI-written code — might cause model behaviour to collectively drift. Developers can argue that LLM agent instances are unlikely to explicitly communicate by appealing to communication constraints described in section 4 (E9.1). Regarding the concern that LLM agents might attack at similar times without communicating, there is a simple argument developers might construct to address it: (1) there will be no expoloitable distribution shifts during the deployment, or input filters will flag them and reject the outlier inputs (E9.2), (2) in order for LLM agent instances to have highly correlated behaviour without meaningful distribution shifts, instances would likely need to intellintly coordinate in some way, (3) developers have removed all obvious coordination signals that LLM instances would plausibly use to coordinate (e.g. timestamps) (E9.3). 25 Assumption 2: Latencies are negligible.The latencies of actions might be consequential because audits are slow. For example, instances of the LLM agent might execute a ‘Blitzkrieg strategy’ (Clymer et al., 2024) where the instances attack in parallel in a short burst. The goal of such a strategy would be to attempt many attacksbeforehuman auditors catch on and pause the deployment. More generally, models might execute multiple attacks in the period of time before they are audited, which affects how developers calculate the probability that at least one attack succeeds. In this safety case sketch, developers can ignore the latencies of audits because of the following argument: (1) assumption 1 states that the probability of attacks is constant during the deployment (2) because the attack probability is constant, the policy must attack rarely. Otherwise, the policy would quickly be caught in periodic auditing. (3) Therefore, the policy is unlikely to execute multiple attacks in the period of time before an attack is audited (E9.4). Assumption 3: Attacks will not be easier to execute during deployment.A key reason the control evaluation might fail to be conservative is that there is a distribution shift between the tasks where the policy attacks in the deployment proxy (D ′ ) and the tasks where the policy might attack in the deployment (D). There can (and likely will be) a distribution shift between deployment proxy and deployment attacks, but this distribution shift must be in theconservative directionwhere attacks areharderto execute in deployment. There are at least two reasons assumption 3 might be false and attacks might be easier to execute in deployment. Developers need to rule out both of these concerns: 1. The LLM agent accumulates progress toward exfiltrating data across multiple tasks (which it is not able to do in the deployment proxy). 2. The LLM agent identifies tasks in deployment where it is more likely to succeed at data exfiltration. The first concern — that the LLM agent might make progress toward data exfiltration — reflects possibly the biggest limitation of the control evaluation discussed in this document. In the deployment, models might not try to exfiltrate data in a short span. Instead, they might prepare carefully across multiple tasks. For example, the policy might insert backdoors gradually into the codebase, experiment with the weaknesses of monitors and record its learnings, etc. A full safety case would need to address this concern in greater detail than we discuss it in. For example, addressing this concern might involve running a separate control evaluation, where the unacceptable outcome includes “clear progress toward data exfiltration” rather than data exfiltration alone. In this safety case sketch, we assume that the developer has performed such an analysis (E10.1). Developers must also argue that the policy won’t identify better attack opportunities in deployment. Recall thatD ′ — the distribution of tasks where the policy attacks in the deployment proxy — is constructed to correspond to attack strategies specified by the red team. So long as these strategies represent the most effective strategies that models could plausible execute (which is justified inC7.1), the policy is unlikely to identify better attack strategies (C10.2). In practice, developers might not be able to construct attack distributions with high enough coverage to be confident that they represent the policy’s best available strategies. Instead, they can make the distributionD ′ somewhat conservative — constructing attack tasks to representbetterdata exfiltration opportunities than models are likely to encounter. Assumption 4: Attack outcomes are approximately independent of prior safety layer decisions. This assumption is important because the macrogame simulates the effect of changing the control protocol accept sets (i.e. the functions that map suspiciousness scores onto decisions to edit, audit, or execute actions). This assumption might not hold if, for example, models observe whether previous actions in the trajectory were edited and adjust their attack strategy accordingly. If dependencies like this existed, they would likely be easy to identify by analyzing cached statistics from the proxy deployment. We assume that developers looked for clear dependencies between safety layer decisions and found them to be weak (E9.5). 26 8 Discussion 8.1 Limitations of this safety case sketch Blitzkrieg attacksThe control evaluation methodology presented in this safety case sketch may inade- quately address risks from massively parallel deployment scenarios. Sequential execution of 1,000 tasks by a single LLM agent instance allows for reasonable false positive rates in monitoring. This is because just a single detected attack is enough to terminate the deployment, so if the policy attacks 1,000 times and mon- itor ensemble’s false negative rate is 99.9%, in 99.9% of the cases the policy will still be terminated before any attacks are successful. In contrast, if 1,000 LLM agents instances operate simultaneously, the control protocol requires near-perfect detection or near-zero latency to prevent successful attacks (Assumption 2). If the policy coordinates 1,000 attacks in a short period of time and latency is non-negligible, then even if only a single attack is obfuscated well enough to go undetected, the policy might not be terminated in time before that obfuscated attack. Even if the control protocol eventually terminates the deployment upon detecting suspicious behaviour, several attacks might succeed before the termination takes effect. Latency can come from both monitors and human auditors. While the latency of monitors is likely negligible, human auditors introduce significant delays between when an attack is executed and when it can be detected. While our control protocol assumes that the safety layer blocks the LLM agent instance until the audit is complete, other instances are not blocked and might execute an action if it’s not flagged as suspicious. Blocking all instances would likely severely hamper developer velocity and greatly reduce the ability of developers to extract useful work from the LLM agent. Developers might implement constraints that enforce minimum delays between when changes are proposed and when they can be executed. Such delays are likely to be effective at preventing Blitzkrieg attacks. However, it’s unclear whether implementing such constraints is feasible in practice. Diffuse attacksFor the purpose of this safety case sketch, we assume that untrusted models are stateless. This assumption is convenient because it allows developers to use cached statistics from the deployment proxy to simulate macrogames. However, this assumption might not hold in practice. Models might accumulate knowledge or capabilities across multiple tasks in ways that are difficult to detect or prevent, even with periodic memory resets and restrictions on real-time communication. This safety case sketch attempts to address coordination between LLM agent instances through communi- cation constraints described in section 4 but this approach might not be sufficient as sizes of deployments and codebases LLM agents operate on grow. It remains an open question how the capacity of a given communication channel translates into the ability of models to coordinate attacks. Moreover, even absent explicit communication, preventing implicit coordination through shared model parameters presents a more fundamental challenge. Even if developers remove obvious coordination signals like timestamps (E9.3) and implement input filters to catch distribution shifts (E9.2), models might still identify subtle environmental cues or Schelling points that developers fail to anticipate. The assumption of constant attack probability (Assumption 1) becomes particularly tenuous when considering these implicit coordination channels. More research is needed to understand: 1. Under what conditions shared model parameters lead to correlated behaviour, 2. How to reliably detect when models are implicitly coordinating, and 3. Whether it’s possible to prevent implicit coordination while preserving LLM agent capabilities. Until these questions are better understood, developers should be cautious about assuming that preventing explicit communication is sufficient to ensure independent behaviour across model instances. 8.2 Constraints on the red and blue team Reasons for conservatismOur control evaluation makes conservative assumptions about the capabilities of a misaligned model: they are assumed to be adversarial, good at strategizing, able to coordinate between 27 instances, and knowledgeable about safety measures (Shlegeris, 2025). The reasons for conservatism are two-fold. First, these conservative assumptions are plausible — misaligned models might actually have these capabilities through knowledge acquired during pretraining, reasoning capabilities, or coordination through Schelling points. Second, these assumptions allow for a cleaner methodology when developing and evaluating safety measures, since it’s much harder to reliably test whether an AI lacks these capabilities than to evaluate its performance at executing a specific attack strategy. However, others argued that control evaluations should be expanded beyond just testing for scheming AI models (those consistently pursuing power-seeking goals) to include other types of potentially dangerous models, such as those that might be data-poisoned, rarely misaligned, or more easily jailbroken in deployment than during testing (Clymer, 2025). For the purpose of this safety case sketch we follow Shlegeris (2025) and make these conservative assumptions about the untrusted models. Even then, it is not always clear what constraints conservatism should impose on the red and the blue team. In this section, we discuss how one could justify the constraints that we impose and what are some different approaches to control evaluations. Red teamThe safety case crucially rests on the assumption that the red team is able to train models that are representative of scheming models. However, the capability profile of scheming models is unknown and could differ between in-context scheming (Meinke et al., 2024) and scheming that emerged through training (Carlsmith, 2023). In order to be conservative, we assume that scheming models are highly situationally aware (Berglund et al., 2023; Laine et al., 2024), capable of introspection (Binder et al., 2024), reasoning without explicit chain of thought and able to coordinate between instances (Shlegeris, 2024a). In order to upper bound those capabilities, we assume that the red team is able to access the source code of the control protocol and exploit all security vulnerabilities in it. Following Kerckhoff’s principle, we assume the blue team can maintain effective defenses by keeping only easily changeable elements (like random seeds) secret from the red team, while the security of the overall protocol does not depend on keeping its design secret. Unlike prior empirical work (Greenblatt et al., 2023; Bhatt et al., 2024), we assume that the red team constructs attack policies by training (as opposed to prompting or hand-engineering scaffoldings hidden from the blue team). While this allows the trained models to be more representative of scheming models, it is unclear whether it results in a more conservative control evaluation. On the one hand, hidden scaffoldings allow the red team to hand-hold models to execute very sophisticated strategies, involving precisely choosing the inputs that trigger the attack or carefully adjusting very low attack rates. On the other hand, one worry about training untrusted models is that the red team might train them to have capabilities they did not originally have. For example, the red team might train the untrusted monitor and the agent to collude when they otherwise would not have been able to identify a shared encryption key they could use to secretly coordinate. This problem might be addressed by restricting the affordances of the red team; however, for the sake of simplicity, in this safety case sketch we assume there are no such corrections for unfair training strategies, and the success of the red team is taken at face value. Blue teamWhile attack policies are supposed to lower-bound largely unknown capabilities, the blue team is only supposed to upper-bound the effectiveness of the control protocol in a known deployment context. In consequence, it is more clear what affordances the blue team should have. However, ensuring that control evaluation as a whole remains conservative still imposes certain constraints on the blue team. In this safety case sketch, we assume that theblueteam is not allowed to train theuntrustedmonitor (it’s red team’s responsibility to train untrusted monitor). While allowed the blue team to do additional training could improve untrusted monitor performance, it risks being insufficiently conservative due to differences in inductive biases between the blue team-trained model and a naturally occurring schemer model. This constraint is not as limiting as it might seem. The blue team is allowed to run alignment evaluations and reject any model that fails them. Therefore, they can incentivise the red team to train untrusted monitors to be as effective as models the blue team would train. This approach allows for reaping some of the benefits of adversarial training while maintaining appropriate levels of conservatism. 28 8.3 How generalisable is this safety case sketch? This safety case sketch assumes a particular threat model, capability level of the untrusted models and a particular control protocol. How far do the proposed control evaluation and argument structure generalise? Let us consider several dimensions of variation. Threat modelsThe threat model of exfiltrating a small amount of sensitive data by a misaligned model constitutes a concentrated failure: a situation in which attacks are concentrated in a small number of actions or a small time frame (Greenblatt & Roger, 2023). These contrast with auditing failures or distributed threats where attacks are distributed across a large number of actions or a long time frame such that each action is individually unlikely to be considered suspicious, even to a human auditor. Auditing failures and distributed threats might require different kinds of control measures (e.g. monitors taking much more context into account) and control evaluations (e.g. relaxing the assumption that the deployment can be factorised in a large number of small, i.i.d. tasks). On the other hand, we expect our control evaluation and the structure of this safety case to be applicable to a number of non-schemer threat models, such as human misuse and data poisoning (Clymer, 2025). Deployment typeThis safety case sketch was written for an internally deployed LLM agent. Control measures and control evaluations for deployments available to users external to the lab (e.g. paying cus- tomers) will require different control measures. For instance, privacy concerns and service-level agreements might prevent human auditors from being able to review all actions. Moreover, the control protocol for an internally deployed LLM agent can have independent access to its execution server and full visibility into its execution environment. This is not the case for LLMs exposed via an API and available to external users who run their own scaffoldings. Finally, the scale of external deployments can be much larger. Capability levelWhile we make a number of conservative assumptions about the capabilities of the untrusted models, we still assume that the untrusted models are not significantly superhuman. Our safety case crucially rests on the assumption that the human red team (uplifted by trusted models) is more capable than the untrusted models at strategy generation. In particular, this safety case sketch doesnotscale to arbitrarily powerful superintelligence. Control protocolThis safety case only pertains to ‘black-box’ control measures. Measures are black-box if they (1) don’t modify models and (2) only rely on external model behaviour and not on model internals. We leave ‘white box’ measures out of scope because they potentially require different safety justifications. Developers can assess some whitebox measures by replacing them with conservative blackbox substitutes. For example, adversarial training is not considered a blackbox measure, so in our safety case sketch the blue team is not permitted to adversarially train models in control evaluations. Instead, the blue team is permitted to reject any model that fails adversarial testing, which conservatively estimates the effectiveness of adversarial training. One could imagine, however, more complicated control protocols that involve: (i) the blue team adversarially training untrusted monitors, (i) the red team continuing to train them, and (i) the blue team rejecting any model that fails their adversarial testing. Control evaluations with whitebox monitors require the developer to take activation space distribution shifts between schemer and non-schemer models into account. To be conservative, they can only rely on facts about models that are robustly true across distribution shifts. It is not clear what constraints one should impose on the red team to make sure they produce models similar to schemer models in terms of model internals. Similarly, additional constraints need to be imposed on the blue team to make sure they don’t overfit to weights or activations of models produced by the red team. 8.4 The cost of implementing control measures How costly is it to implement the control measures we imagine? The cost here is a sum of the cost of running run (a series of) control evaluations, the cost of running deployment-time control measures, and the cost of lowering the capabilities of the LLM agent (compared to a deployment lacking any control measures). The financial cost of deploying an AI control protocol will likely be dominated by human auditing of 0.1% of 29 actions. While computational overhead from monitoring is significant — with monitor ensemble inference approximately doubling the compute cost per interaction — human review requirements present the primary scaling constraint. While the control protocol can adjust audit rates based on observed attack frequencies, there remains a minimum level of human review necessary to maintain confidence in the system’s safety. The second largest group of costs is likely to be reductions of developer velocity due to deployment-time control measures. These can include increased latency (due to running monitors and resampling or editing actions and blocking on human audits) as well the burden of having developers serve as auditors. We do not expect the cost of implementing control measures described in this safety case sketch to be prohibitive. The overhead can be lower once AI control measures are integrated with existing security infrastructure within an organisation. Many control measures can build on current security practices around code review, access control, and human misuse monitoring. However, some novel requirements specific to AI systems — such as monitoring for subtle forms of coordination or data exfiltration — may require new tooling and processes. The cost can be further reduced through automated tools to assist human auditors, though developers must be careful to ensure such tools don’t introduce new attack vectors or reduce the effectiveness of human oversight. Finally, we do not expect the cost of running control evaluations themselves to be significant compared to costs incurred throughout deployment. Many constant costs (e.g. developing control measures and non- attack tasks, training human auditors and red teams) can likely be amortized over multiple threat models, deployments, models. Moreover, certain threat models can be argued to be precursors of others, e.g. data exfiltration is arguably easier than weight exfiltration and therefore a safety case for data exfiltration might already contain most of the evidence for a safety case for weight exfiltration. 9 Conclusion This safety case sketch demonstrates how developers might construct a concrete argument that an internally deployed LLM agent will not cause unacceptable outcomes. We focused on a simple threat model — data exfiltration — and showed how control evaluations can provide evidence that control measures will remain effective even if models pursue optimal attack strategies. The sketch illustrates how developers can decompose a safety argument into testable claims about the competitiveness of red teams, the conservativeness of control measures, and the validity of extrapolations from proxy deployments. While this sketch focuses on internal deployments, AI companies will need safety cases for a diverse range of deployment scenarios. Each scenario will require its own control measures and corresponding safety arguments. For example, safety cases for customer-facing deployments will need to address challenges around privacy, service-level agreements, and limited visibility into execution environments. Similarly, safety cases for distributed deployments will need stronger arguments about coordination between model instances. We hope this sketch prompts AI companies, government agencies, and civil society to develop subsequent sketches of safety cases for different deployment scenarios, threat models, control protocols and capability levels. Only by working through detailed examples can we identify gaps in our safety arguments and develop the technical and conceptual tools needed to make them rigorous. Acknowledgments This paper benefited from feedback from (in alphabetical order): Mikita Balesni, Joe Benton, Sid Black, Marue Buhl, Aryan Bhatt, Robin Bloomfield, Nate Burnikell, Alan Cooney, Xander Davies, Nicola Ding, Nix Goldowsky-Dill, Sofia Guerra, Ryan Greenblatt, Charlie Griffin, Marius Hobbhahn, Max Kaufmann, Will Kirby, Robert Kirk, Victoria Krakovna, David Lindner, Alex Mallen, Tammy Masterson, Stephen McAleer, Vlad Mikulik, Jacob Pfau, Mary Phuong, Fabien Roger, Cody Rushing, Rohin Shah, Oliver Sourbut, Martín Soto, Nate Thomas, Tyler Tracy, Tony Wang, and others. 30 References AI Seoul Summit Scientific Advisory Group. International scientific report on the safety of advanced AI, May 2024. Anthropic. A new initiative for developing third-party model evaluations.https://w.anthropic.com/ news/a-new-initiative-for-developing-third-party-model-evaluations, Jul 2024a. Anthropic. Responsible scaling policy.https://assets.anthropic.com/m/24a47b00f10301cd/original/ Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf, October 2024b. Effective October 15, 2024. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. URLhttps://arxiv.org/abs/2204. 05862. Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, and Lucius Bushnaq. Towards evaluations-based safety cases for AI scheming, 2024. URLhttps://arxiv.org/abs/2411.03336. Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, and Sören Mindermann. Managing extreme AI risks amid rapid progress.Science, 384(6698):842–845, 2024. Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R. Bowman, and David Duvenaud. Sabotage evaluations for frontier models.arXiv preprint arXiv:2410.21514, 2024. Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, and Owain Evans. Taken out of context: On measuring situational awareness in llms, 2023. URLhttps://arxiv.org/abs/2309.00667. Aryan Bhatt, Cody Rushing, Adam Kaufman, Vasil Georgiev, Tyler Tracy, Akbir Khan, and Buck Shlegeris. Shell games: Control protocols for adversarial AI agents, 2024. URLhttps://openreview.net/forum? id=oycEeFXX74. Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection, 2024. URLhttps://arxiv.org/abs/2410.13787. Robin Bloomfield and Kateryna Netkachova. Building blocks for assurance cases. In2014 IEEE International Symposium on Software Reliability Engineering Workshops, p. 186–191, 2014. doi: 10.1109/ISSREW. 2014.72. Alec Bounds. Implementation of nuclear safety cases.Safety and Reliability, 39(3-4):203–214, 2020. Sam Bowman. The checklist: What succeeding at AI safety will involve.https://sleepinyourhat.github. io/checklist/, Sep 2024. Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung. Safety cases for frontier AI, 2024. 31 Joe Carlsmith. Scheming AIs: Will AIs fake alignment during training in order to get power?, 2023. URL https://arxiv.org/abs/2311.08379. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024. Josh Clymer. Extending control evaluations to non-scheming threats.https://redwoodresearch. substack.com/p/extending-control-evaluations-to, January 2025. Redwood Research blog. Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. Safety cases: How to justify the safety of advanced AI systems. 2024. Ewen Denney and Ganesh Pai. Safety cases: Past, present and future.Safety Science, 120:812–821, 2019. Brendan Fitzgerald, Paul Breen, and Joe Patrick. Has the safety case failed? InSPE Asia Pacific Oil and Gas Conference and Exhibition, 2010. Arthur Goemans, Marie Davidsen Buhl, Jonas Schuett, Tomek Korbak, Jessica Wang, Benjamin Hilton, and Geoffrey Irving. Safety case template for frontier AI: A cyber inability argument, 2024. Google DeepMind. The Frontier Safety Framework.https://storage.googleapis.com/deepmind-media/ DeepMind.com/Blog/introducing-the-frontier-safety-framework/fsf-technical-report.pdf, March 2024. Ryan Greenblatt and Fabien Roger.Auditing failures vs concentrated failures, De- cember2023.URLhttps://w.alignmentforum.org/posts/hirhSqvEAq7pdnyPG/ auditing-failures-vs-concentrated-failures. Alignment Forum post. Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. AI control: Improving safety despite intentional subversion.arXiv preprint arXiv:2403.12123, 2023. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Jo- hannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Minder- mann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 2024. URLhttps://arxiv.org/abs/ 2412.14093. Charlie Griffin, Louis Thomson, Buck Shlegeris, and Alessandro Abate. Games for AI control: Models of safety evaluations of AI deployment protocols, 2024. URLhttps://arxiv.org/abs/2409.07985. Roger Grosse. Three sketches of ASL-4 safety case components.https://alignment.anthropic.com/2024/ safety-cases/, November 2024. Alignment Science Blog. International Organization for Standardization. ISO 9001:2015 quality management systems – require- ments.https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook/Manage, 2015. Referenced in NIST AI RMF Knowledge Base. Geoffrey Irving. Safety cases at AISI.https://w.aisi.gov.uk/work/safety-cases-at-aisi, March 2024. Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate, 2018. URLhttps://arxiv.org/ abs/1805.00899. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint arXiv:2310.06770, 2024. Leonie Koessler, Jonas Schuett, and Markus Anderljung. Risk thresholds for frontier AI, 2024. URLhttps: //arxiv.org/abs/2406.14713. 32 Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. Pretraining language models with human preferences. InPro- ceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023. Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn, Alexander Meinke, and Owain Evans. Me, myself, and AI: The situational awareness dataset (SAD) for LLMs. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=UnWhcpIyUC. Jan Leike, John Schulman, and Jeffrey Wu. Our approach to alignment research.https://openai.com/ index/our-approach-to-alignment-research/, 2024. Nancy G. Leveson and John P. Thomas.STPA Handbook. MIT Partnership for Systems Approaches to Safety and Security (PSASS), March 2018. Available athttps://psas.scripts.mit.edu/home/materials/. Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. LLM critics help catch LLM bugs, 2024. URLhttps://arxiv.org/abs/2407.00215. Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming, 2024. URLhttps://arxiv.org/abs/2412.04984. Christopher A. Mouton, Caleb Lucas, and Ella Guest. The operational risks of AI in large-scale biological attacks: Results of a red-team study, 2024. Thor Myklebust and Tor Stålhane. Safety cases in the certification of autonomous systems.Safety Science, 130:104826, 2020. National Institute of Standards and Technology. NIST AI RMF 4.3: Incidents and errors are communicated to relevant AI actors.https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook/Manage, 2024. OpenAI. Preparedness framework.https://openai.com/safety/preparedness, 2024a. OpenAI. OpenAI o1 system card, Sep 2024b. Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, et al. Evaluating frontier models for dangerous capabilities. 2024. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. Fabien Roger and Ryan Greenblatt. Preventing language models from hiding their reasoning, 2023. URL https://arxiv.org/abs/2310.18512. Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URLhttps://arxiv.org/abs/ 2303.11366. Buck Shlegeris. How to prevent collusion when using untrusted models to monitor each other.https:// redwoodresearch.substack.com/p/how-to-prevent-collusion-when-using, September 2024a. Red- wood Research blog. Buck Shlegeris. A basic systems architecture for AI agents that do autonomous research.https:// redwoodresearch.substack.com/p/a-basic-systems-architecture-for, 2024b. Redwood Research blog. Buck Shlegeris. Untrusted smart models and trusted dumb models.https://redwoodresearch.substack. com/p/untrusted-smart-models-and-trusted, May 2024c. Redwood Research blog. 33 Buck Shlegeris. Win/continue/lose scenarios and execute/replace/audit protocols, 2024d. URLhttps: //redwoodresearch.substack.com/p/wincontinuelose-scenarios-and-executereplaceaudi. Red- wood Research blog. Buck Shlegeris. Thoughts on the conservative assumptions in AI control.https://redwoodresearch. substack.com/p/thoughts-on-the-conservative-assumptions, January 2025. Redwood Research blog. UK Ministry of Defence. Defence standard 00-56 issue 4: Safety management requirements for defence systems. Technical report, UK Ministry of Defence, 2007. Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations, 2024. URLhttps://arxiv.org/abs/ 2406.07358. Mario Villacourt. Failure mode and effects analysis (FMEA): A guide for continuous improvement for the semiconductor equipment industry. Technology Transfer 92020963B-ENG, SEMATECH, 1992. Rui Wang, Jérémie Guiochet, Gilles Motet, and Walter Schön. Modelling confidence in railway safety case. Safety Science, 110:286–299, 2018. Alan Wassyng, Tom Maibaum, Mark Lawford, and Hans Bherer. Software certification: Is there a case against safety cases? InFoundations of Computer Software, p. 206–227, 2011. Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, and Akbir Khan. Adaptive deployment of untrusted llms reduces distributed threats, 2024. URLhttps://arxiv.org/abs/2411.17693. Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. 2024. URLhttps://arxiv.org/abs/2411.15114. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URLhttps://arxiv.org/abs/2210.03629. Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URLhttps://arxiv. org/abs/1909.08593. 34 A Appendix: The full safety case diagram We present the full CAE (Bloomfield & Netkachova, 2014) diagram for the safety case sketch below. We follow the visual syntax of (Goemans et al., 2024). C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Argument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more effective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more effectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The effectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more effective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more effectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their effectiveness. E7.1 The red team is experienced, has necessary affordances (e.g. an adequate training compute budget), and is appropriately incentivized to find effective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol software. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the effectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks affect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shifts. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lower-bound control effectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative (Continued on the next page) 35 C1.1 The risk of research data exfiltration is acceptably low. C1.2 The risk of model weight exfiltration is acceptably low. C1.3 The risk of safety R&D sabotage is acceptably low. C1.x If all of the identified risks are acceptable, overall risk is acceptable. C0 The LLMagent does not pose unacceptable risk. A0Enumeration Breakdown of internal deployment threat models. A1Substitution Incorporating the risk threshold. C2.1 The acceptable risk threshold for data exfiltration is 3%. Argument Claim Evidence (...) C4.2 Developers conservatively extrapolate statistics from the deployment proxy to estimate the probability of data exfiltration in deployment. C7.2 The red team adequately elicited capabilities of untrusted models to execute the strategies the red team identified and trained the models to pursue. C7.1 The models scheduled for deployment won’t identify attack strategies that are more effective than those the red team identified and trained them to pursue. E7.3 The red team was generating attack strategies for 20 person-hours. In the final 5 person-hours, they were unable identify more viable strategies than those previously identified. E7.5 The red team used state of the art elicitation that recovers untrusted model capabilities in sandbagging evaluations. C10.2 The LLM agent’s attacks on D will not be more effectve than they are on D’ because the LLM agent identified better exfiltration oppotunitites. E10.2 The attack tasks D’ were conservatively constructed based on strategies generated by the red team that are competetive. E7.7 Attack and control measures were not overfit to any particular task distribution (both did not have access to the test set used to run the final control evaluation). C9.1 The “attack probability is uniform” assumption is an appropriate approximation. A7Decomposition The extrapolation depends on blue team and red team macroparameters and the assumptions underlying the macrogame simulations. C9.2 The assumption that “action latencies are negligible” is an appropriate approximation. E7.4 The models scheduled for deployment are not competitive with the red team at generating attack strategies. They were unable to identify most strategies that the red identified. They also showed no clear improvements with inference compute. E9.4 Attacks must be infrequent to survive periodic audits C5.2 The effectiveness of the attack policies in the deployment proxy upper bounds the capability of untrusted models to exfiltrate data on the attack tasks D’. E5.1 The blue team design process indicates that control protocol is not overfit to the models scheduled for deployment (blue team constraint number 3). A6Decomposition Capability estimation requires that the red team adequately generates attack strategies and adequately elicits capabilities to execute these strategies. E5.2 The blue team was not able to access the test set used to calculate deployment proxy statistics. Thus, the control protocol is not overfit to a particular set of tasks (blue team constraint 3). C10.1 LLM agent’s attacks on D will not be more more effective than they are on D’ because D contains progress that the agent accumulated to improve its attack strategies. E10.1 The safety layer is relialable enough that the policy would likely be caught pushing attack progress before it significantly improves the policy’s success. A9Enumeration The LLM agent can either execute strategies more effectively because of progress it pushes to the codebase or because it identifies better naturally occurring opportunities to attack. E7.6 The red team successfully trained models to defeat weaker control measures in previous control evaluations, which is evidence of their effectiveness. E7.1 The red team is experienced, has necessary affordances (e.g. an adequate training compute budget), and is appropriately incentivized to find effective attack strategies. E7.2 Attack strategies are comprehensive (as evidenced by a written report) E5.3 The control evaluation will be rerun if critical changes are made to the control protocol software. C2.2 The probability that the LLM agent will exfiltrates data in deployment within its 1 year lifetime is lower than 3%. A2Substitution Estimating risk via a control evaluation. C3.1 The probability that the LLM agent exfiltrates data estimated in the control evaluation upper-bounds the probability it exfiltrates data in deployment. A3Decomposition The control evaluation is conservative if the statistics obtained in the deployment proxy are conservative and they are conservatively extrapolated to the true deployment. E2.1 The probability that the LLM agent exfiltrates data is estimated by the control evaluation to be 1% C4.1 The deployment proxy conservatively estimates the capabilities of models scheduled for deployment to exfiltrate data within the attack tasks D’. A4Decomposition This capability estimation depends on attack policies designed by the red team and the control protocol implemented by the blue team. C5.1 The control protocol implemented in the deployment proxy lower bounds the effectiveness of the control protocol implemented in deployment. E5.4 Human auditors performance during deployment will be monitored with synthesized attacks. C6.2 Models that are ‘trusted’ will have similar behavior in deployment as in the deployment proxy. E6.1 Developers ran independent capability evaluations to establish trustedness C8.1 Blue team macroparameters and red teammacroparameters were conservatively selected. E8.1 These parameters were selected by computing the nash equillibrium in macrogame simulations. A8Decomposition The macrogame simulations are based on the following assumptions: (1) the attack probability is uniform (2) action latencies are negligible (3) attacks will not be more successful in deployment (4) attack outcomes are independent. C8.2 The assumptions underlying the macrogame simulations either approximately hold or are conservative. E9.3 Models are not given any information that might allow them to easily coordinate attacks without communicating (e.g. models are not given access to the time). C9.4 The “attacks will not be more successful in deployment” assumption is conservative. C9.3 The “attack outcomes are independent” assumption is an appropriate approximation. E9.5 The developer observed empirical trends in how attack outcomes within and across attacks affect other attack outcomes and found the dependencies were weak. E9.2 Input filters flag meaningful distribution shifts. E9.1 The LLM agent instances can only communicate through pull requests that undergo code review Threat models Control protocol lower-bound control effectiveness Core control eval claims The red team upper-bounds model capabilities The extrapolation to deployment is conservative 36