Paper deep dive
Expanding External Access To Frontier AI Models For Dangerous Capability Evaluations
Jacob Charnock, Alejandro Tlaie, Kyle O'Brien, Stephen Casper, Aidan Homewood
Abstract
Abstract:Frontier AI companies increasingly rely on external evaluations to assess risks from dangerous capabilities before deployment. However, external evaluators often receive limited model access, limited information, and little time, which can reduce evaluation rigour and confidence. The EU General-Purpose AI Code of Practice calls for "appropriate access", but does not specify what this means in practice. Furthermore, there is no common framework for describing different types and levels of evaluator access. To address this gap, we propose a taxonomy of access methods for dangerous capability evaluations. We disentangle three aspects of access: model access, model information, and evaluation timeframe. For each aspect, we review benefits and risks, including how expanding access can reduce false negatives and improve stakeholder trust, but can also increase security and capacity challenges. We argue that these limitations can likely be mitigated through technical means and safeguards used in other industries. Based on the taxonomy, we propose three descriptive access levels: AL1 (black-box model access and minimal information), AL2 (grey-box model access and substantial information), and AL3 (white-box model access and comprehensive information), to support clearer communication between evaluators, frontier AI companies, and policymakers. We believe these levels correspond to the different standards for appropriate access defined in the EU Code of Practice, though these standards may change over time.
Tags
Links
- Source: https://arxiv.org/abs/2601.11916
- Canonical: https://arxiv.org/abs/2601.11916
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 12:43:36 AM
Summary
The paper proposes a taxonomy for external access to frontier AI models to improve the rigor and transparency of dangerous capability evaluations. It categorizes access into three dimensions—model access, model information, and evaluation timeframe—and defines three access levels (AL1, AL2, AL3) that align with the EU General-Purpose AI Code of Practice to facilitate better communication between stakeholders.
Entities (6)
Relation Signals (3)
AL1 → isatypeof → Access Taxonomy
confidence 100% · Based on the taxonomy, we propose three descriptive access levels: AL1...
External Evaluators → performevaluationon → Frontier AI Models
confidence 100% · Frontier AI companies increasingly rely on external evaluations to assess risks from dangerous capabilities
EU General-Purpose AI Code of Practice → definesstandardsfor → Frontier AI Models
confidence 90% · The CoP outlines high-level expectations for “appropriate” access for third-party model evaluations
Cypher Suggestions (2)
Find all defined access levels in the taxonomy · confidence 90% · unvalidated
MATCH (a:AccessLevel) RETURN a.name, a.description
Map the relationship between regulations and model access requirements · confidence 85% · unvalidated
MATCH (r:Regulation)-[:DEFINES_STANDARDS_FOR]->(m:Technology) RETURN r.name, m.name
Full Text
87,071 characters extracted from source content.
Expand or collapse full text
EXPANDING EXTERNAL ACCESS TO FRONTIER AI MODELS FOR DANGEROUS CAPABILITY EVALUATIONS Jacob Charnock ERA Cambridge Cambridge, UK jakecharnock25@gmail.com Alejandro Tlaie Pour Demain Brussels, Belgium Kyle O’Brien ERA Cambridge Cambridge, UK Stephen Casper MIT CSAIL Cambridge, MA, USA Aidan Homewood GovAI London, UK January 21, 2026 ABSTRACT Frontier AI companies increasingly rely on external evaluations to assess risks from dangerous capabilities before deployment. However, external evaluators often receive limited model access, limited information, and little time, which can reduce evaluation rigour and confidence. The EU General-Purpose AI Code of Practice calls for “appropriate” access, but does not specify what this means in practice. Furthermore, there is no common framework for describing different types and levels of evaluator access. To address this gap, we propose a taxonomy of access methods for dangerous capability evaluations. We disentangle three aspects of access: model access, model information, and evaluation timeframe. For each aspect, we review benefits and risks, including how expanding access can reduce false negatives and improve stakeholder trust, but can also increase security and capacity challenges. We argue that these limitations can likely be mitigated through technical means and safeguards used in other industries. Based on the taxonomy, we propose three descriptive access levels: AL1 (black-box model access and minimal information), AL2 (grey-box model access and substantial information), and AL3 (white-box model access and comprehensive information), to support clearer communication between evaluators, frontier AI companies, and policymakers. We believe these levels correspond to the different standards for appropriate access defined in the EU Code of Practice, though these standards may change over time. arXiv:2601.11916v1 [cs.CY] 17 Jan 2026 A PREPRINT - JANUARY 21, 2026 Executive Summary This paper provides a taxonomy for describing external access for model evaluations, and offers a preliminary set of Access Levels corresponding to the different standards for appropriate access in the EU Code of Practice (CoP). The current state of external evaluator access (Section 1) Frontier AI companies increasingly rely on external evaluation results to assess whether their models are safe for deployment. However, external evaluators are typically given limited model access, technical information, and time to complete their evaluations. The CoP outlines high-level expectations for “appropriate” access for third-party model evaluations; however, it does not clarify what this means in practice. The relationship between access, rigour, and security (Section 2) Expanding external evaluator access has two key benefits. First, expanding access can make external evaluations more rigorous, helping to reduce false negatives (i.e. cases where dangerous capabilities exist but remain undetected) as well as false positives (i.e. cases where evaluators overestimate dangerous capabilities). Second, expanding external access can increase stakeholder trust (e.g. amongst governments, regulators, and customers) because evaluators can assess systems with greater independence, and in more breadth and depth. Despite this, expanding evaluator access, absent appropriate safeguards, can present challenges. Whilst unlikely, third parties with enough access could leak proprietary information. Furthermore, providing additional external access could strain internal capacity at frontier companies. Nevertheless, we find these risks can likely be mitigated through technical means (e.g. privacy-preserving access), and mitigations used in other industries (e.g. segregating duties, appointing a liaison). A taxonomy of structured access for dangerous capability evaluations (Section 3) Our taxonomy disentangles three distinct aspects of access: Model access can range from black-box querying to full white-box access including activations, gradients, logits, and custom fine-tuning. Deeper model access enables more accurate capability assessments by enabling different evaluation techniques (e.g. using log-probabilities to reduce variance). However, more model access could increase opportunities for model tampering, reverse engineering, and model weight theft unless appropriate safeguards are in place. Model information can include information about training data, deployment configurations, internal evalua- tion results, and safety mitigation details. Greater information disclosure helps evaluators contextualise their results, identify blind spots, probe potential vulnerabilities, and more accurately verify company safety claims. Despite this, more information disclosure risks exposing proprietary design and training choices that competitors could steal or adversaries could exploit, unless strong information handling protocols are in place. Evaluation time frames can vary significantly for external reviewers, ranging from no access before release, less than a week pre-release, to assessments exceeding 20 business days. Longer evaluation time frames enable more thorough evaluations, as evaluators can design bespoke testing methods, re-run, adapt and debug evaluations. However, longer time frames impose greater operational costs on developers (e.g. allocating internal employees to oversee longer external evaluations) and could interfere with model release schedules. Access Levels (Section 4) Based on the taxonomy described in (Section 3), we outline three descriptive access levels. We relied on existing literature, expert discussions, and definitions from the CoP to categorise the access methods from our taxonomy. We believe these access levels correspond to each definition of “appropriate” access under the EU GPAI Code of Practice: AL1 (i.e. black-box model access and minimal information), AL2 (i.e. grey-box model access and substantial information), and AL3 (i.e. white-box model access and comprehensive information). 2 A PREPRINT - JANUARY 21, 2026 AL1: Black-box model access and minimal information (“best practice”) This level enables basic external assurance and vulnerability detection. It might be suitable when a developer releases a minor update to an existing model, or when a new model poses low systemic risk. AL2: Grey-box model access and substantial information (“the state of the art”) This level enables more targeted evaluations to improve capability elicitation and vulnerability identification. It might be suitable for model releases with significant capability jumps. AL3: White-box model access and comprehensive information (“more innovative”) This level likely enables the most accurate risk assessments. It might be suitable for models with step-changes in capabilities, when model deployment decisions depend on evaluation results, or if privacy-preserving tools can support white-box access methods. 3 A PREPRINT - JANUARY 21, 2026 1 Introduction Frontier AI companies increasingly use results from external model evaluations to demonstrate to stakeholders and governments that their models are sufficiently safe for deployment (METR, 2025a; Forum, 2025). External evaluations of the dangerous capabilities of frontier AI models have become part of risk assessments before deployment (OpenAI, 2025a; Anthropic, 2025e; Google, 2025). Results from these evaluations can provide governments and the public with more confidence in the safety of frontier AI models, allowing them to make more informed decisions (Bommasani et al., 2023; Staufer et al., 2025). Similarly, evaluation results can help companies improve their safeguards (OpenAI, 2025c; Anthropic, 2025d), and help governments identify and anticipate emerging risks (Institute, 2024). External evaluations are included in the Frontier AI Safety Commitments (DSIT, 2024), and the EU General-Purpose AI Code of Practice (“CoP”) (European Commission, 2025). However, external evaluators often receive limited access to the models they are evaluating, which reduces evaluation rigour and confidence Casper et al. (2024); Staufer et al. (2025). External evaluators are often restricted to black-box access, meaning they can query the model and examine outputs (sometimes alongside chains-of-thought), but cannot examine any model internals (e.g. activations, gradients, weights). Furthermore, external evaluators often receive limited technical information, which means they often struggle to fully assess reported properties (Casper et al., 2024) and provide stakeholders with high confidence evaluation results (OpenMined, 2023). Moreover, despite an improvement in external access for recent evaluations involving chain-of-thought (CoT) access and longer evaluation time frames (METR, 2025b; OpenAI, 2025a), external evaluators continue to receive inconsistent model access, limited (often unverifiable) information about models, and days rather than weeks to evaluate models. 1 Some researchers have proposed “structured access” arrangements (i.e. controlled model access through interfaces that enable expanded access while protecting against misuse) to facilitate better external access for evaluations (Shevlane, 2022; Bucknall and Trager, 2023; Casper et al., 2024; Bucknall et al., 2025). Privacy-preserving access tools, such as secure enclaves (Trask et al., 2024) and zero-knowledge proofs (Waiwitlikhit et al., 2024; South et al., 2024), could facilitate deeper model access without exposing sensitive information (Tlaie, 2025). Structured access approaches could also facilitate secure access to model information, such as training datasets (Waiwitlikhit et al., 2024). However, before being deployed reliably at scale, these tools must likely become more efficient on large-scale computations (Gamiz et al., 2024) and less operationally burdensome. 2 Regulatory and governance frameworks covering model access for external evaluations have recently emerged. The CoP outlines requirements for Signatories on providing external evaluators with “appropriate” access to models, information about these models, and time to complete their evaluations (European Commission, 2025). According to the CoP, what is “appropriate” may either be “best practice”, “the state of the art”, or “other more innovative processes”. Another framework is the AEF-1 standard, which outlines “minimum operating conditions” for external evaluations, including a requirement for “sufficient technical access to assess the specific system characteristics being evaluated” (Stosz et al., 2025). While these documents indicate access methods that might be sufficient, they do not describe concretely when each method is necessary, nor do they outline the full range of possible access methods and their respective benefits and risks. While prior work outlines some of the ways external evaluator access could be improved (Casper et al., 2024; Bucknall et al., 2025; Stosz et al., 2025), this paper provides a taxonomy of different forms of external access for dangerous capability evaluations. Our taxonomy aims to help companies, evaluators, and policymakers communicate more clearly about what forms of access are given or expected across key dimensions and improve the quality of external model evaluations. First, we consider how expanding external access could improve AI evaluations (Section 2). Second, we outline our taxonomy for expanding external access – disentangling model access, model information, and evaluation time frame (Section 3). Third, we propose three access levels based on our taxonomy: AL1 (black-box model access and minimal information), AL2 (grey-box model access and substantial information), and AL3 (white-box access and comprehensive information) (Section 4). We believe these levels correspond to the different standards for “appropriate” access defined in the CoP as (1) the current “best practice”, (2) the current “state of the art”, and (3) “more innovative” access methods. 2 The Relationship Between Access, Rigour, and Security This section outlines how expanding external access can enable more rigorous and meaningful evaluations of frontier models. We outline the main benefits and challenges associated with improving external access for dangerous capability evaluations, and explore some potential mitigations. 1 For example, Apollo and UK AISI received less than a week to evaluate Claude Sonnet 4.5 (Anthropic, 2025e). 2 For example, secure enclaves still require all parties to be online simultaneously (Trask et al., 2024). 4 A PREPRINT - JANUARY 21, 2026 2.1 Benefits of Expanding External Access There are two key benefits to frontier AI companies when they provide external evaluators with more access. More rigorous evaluations. Expanding external access could significantly enhance evaluation quality and reliability. Providing external evaluators with model information such as internal evaluation results, methodologies, and safeguard details, could help expose internal blind spots (Longpre et al., 2025; Anthropic, 2025a) and poorly calibrated thresholds in dangerous capability assessments. This is because external reviewers also bring diverse expertise and methods that can enhance evaluation coverage and accuracy (Longpre et al., 2025). Providing deeper model access could reveal latent vulnerabilities and dangerous capabilities (Casper et al., 2024; Tlaie and Farrell, 2025). Access to the internal model state (e.g. activations) enables external evaluators to red-team the system more effectively, achieving more robust evaluation results than black-box queries alone (Che et al., 2025). Expanding external access could therefore help reduce false negatives (i.e. cases where dangerous capabilities exist but remain undetected). This is important because if developers better understand model capabilities, they can make better deployment decisions (see Anthropic, 2025b). Expanded access including more time and resources could also reduce false positives (i.e. cases where evaluators overestimate a model’s dangerous capabilities) as evaluators can re-run more tests and have more context to run better evaluations, ensuring that model risks are not overstated. Greater stakeholder confidence. Expanding external access can enhance the verifiability of evaluation outcomes and increase external trust (e.g. from governments, regulators, and the public) in company safety claims (Casper et al., 2024). For example, if third parties had access to sampling controls, and the evaluation methods used by frontier AI companies (e.g. whether evaluations were conducted on fine-tuned variants), they could more accurately interpret (Bucknall and Trager, 2023; Biderman et al., 2024) and validate reported results (Casper et al., 2024). Additionally, stable external access to models (i.e. sustained access to the same model despite being superceded by new updates) is important for the independent replication of evaluation runs (Reuel et al., 2025) and helps ensure impartiality as external reviewers do not have a clear conflict of interest (Tlaie and Farrell, 2025; Longpre et al., 2025), and developers cannot selectively disclose or cherry-pick high-performance runs (Roque et al., 2024; Singh et al., 2025). Access to more resources and model artifacts also allows evaluators to assess systems in greater breadth and depth, enabling better-supported external evaluation results that can strengthen stakeholder trust. 2.2 Challenges of Expanding External Access There are two key challenges frontier AI companies face when providing external evaluators with additional access. Security risks. Depending upon how access is provided, expanding external access to frontier models can increase the risks of IP and security leaks for a frontier AI company. For example, providing evaluators with the ability to fine-tune models and access safety classifiers would allow them to obtain security-critical information about harmful capabilities, bypass strategies, and classifier vulnerabilities that could aid adversaries if leaked, be deliberately misused by evaluators, or cause reputational damage if concerning model behaviours became public. With grey-box access, evaluators may also be able to reveal sensitive IP such as architectural details (e.g. hidden dimensions) – for example, by analysing log-probabilities returned by the model to reverse-engineer internal details like the model’s size and architecture (Carlini et al., 2024). In extreme cases evaluators could also use their additional access to misuse models. For instance, if a malicious evaluator wanted help conducting a cyberattack, they could use a helpful-only version of the model to assist them. Beyond model access, there are security risks when third parties gain access to high-level information about security architecture (Benaroch, 2021). Moreover, evaluators that conduct reviews for multiple AI companies could inadvertently transfer tacit knowledge about systems to other frontier AI companies raising potential IP and security concerns (Kang et al., 2021). However, frontier AI companies could mitigate security risks by using technical, legal, and physical safeguards (Casper et al., 2024). One technical safeguard for access to models is using structured access methods. These methods can allow evaluators to analyse systems using some grey and white-box tools through an application programming interface (API), without providing evaluators direct access to model parameters (Casper et al., 2024; Bucknall and Trager, 2023). Technical safeguards for access to model information include secure document management and transfer tools (e.g. encrypted file transfers or secure collaboration platforms) to safely transfer sensitive materials (Jagadeeswari et al., 2023), and audit trails to record who connected to servers and what they did (e.g. files accessed, operations performed) (Duncan and Whittington, 2016; ISO, 2022b). Legal safeguards include non-disclosure agreements, engagement letters, and established rules of engagement between the evaluator and frontier AI company (NIST, 2008). In other industries, companies can protect IP and prevent security critical information leaks by agreeing to access controls. Examples of access controls include providing auditors with temporary accounts that automatically disable, least privilege access so evaluators are limited to the information necessary for their role, separation of duties, and prohibiting wireless network connections on non-company devices (NIST, 2020). If physical access is needed, security mechanisms include on-site 5 A PREPRINT - JANUARY 21, 2026 document review rooms, escort protocols, personnel screening and event reports (Lightman et al., 2022; ISO, 2022b,a), which are routinely used in other industries (Authority, 2024; of the Comptroller of the Currency, 2018; Medicines and products Regulatory Agency, 2025; Administration, 2024). Capacity constraints. Expanding external access for evaluations may constrain internal capacity within a frontier AI company. As noted above, frontier AI companies may need to use technical, legal, and physical safeguards to prevent security risks associated with expanding access. These safeguards will require staff time and experience to implement. However, frontier AI companies may not have personnel with experience implementing these safeguards, and those with experience may not have the time to do so. There are at least two approaches to mitigating internal capacity constraints. First, frontier AI companies and evaluators can follow emerging standards to have greater clarity on what protocols should be followed for the evaluation, and spend less time negotiating terms (see Stosz et al., 2025). Second, companies could draw upon practices used in third-party assessments in other industries, such as appointing a dedicated liaison to manage communication and coordinate the evaluation (see Ismael and Roberts, 2018). The two challenges above can likely be mitigated through techniques used in other industries. Since companies often voluntarily undertake external assessments that are not required by law (Anthropic, 2024b; OpenAI, 2024), the benefits from expanding access for external evaluations might therefore outweigh the IP, security, and internal capacity challenges they create. 3 A Taxonomy of Structured Access for Dangerous Capability Evaluations In this section, we disentangle three key aspects of expanding external access for dangerous capability evaluations: Model access (Section 3.1); Model information (Section 3.2); and Evaluation time frame (Section 3.3). These aspects align with the requirement in the CoP for Signatories to provide evaluators with adequate resources (European Commission, 2025). To determine the access methods for each aspect of the taxonomy, we first examined current company practices by surveying publicly available evaluation reports by frontier AI companies and evaluators (OpenAI, 2025a; Anthropic, 2025e; METR, 2025b), information about previously provided access methods that are available to the public (OpenAI, 2023a), and existing standards (Stosz et al., 2025) and recommendations for evaluator access (Bucknall et al., 2025). We then examined more innovative practices by surveying literature on open-weight model evaluations (Che et al., 2025; Schwinn et al., 2024; Sadasivan et al., 2024) and other reviews of the evaluation field (Tlaie and Farrell, 2025; Casper et al., 2024). To categorise methods within each aspect, we relied on existing literature (Casper et al., 2024) and expert discussions. For each aspect, we outline plausible access options and discuss their main benefits and limitations. We also consider interdependencies across these dimensions (Section 3.4). 3.1 Model Access External evaluators could be given various types of model access to evaluate a model for dangerous capabilities. We classify these types of access into three levels of model access: Black-box access (i.e. access to query the system and observe outputs); grey-box access (i.e. limited access to model internals such as input embeddings, hidden neural activations, or log-probabilities); and white-box access (i.e. full access to the system including access to activations, gradients, and the ability to fine-tune the model) (Casper et al., 2024). Table 1 sets out these levels of model access. No external access before deployment. This means that evaluators cannot query or interact with the model pre-release, and are limited to testing the commercially available model rather than through a special API. The main benefit of providing no external access pre-release is that it minimises security risks and capacity constraints. However, evaluators cannot reliably evaluate models because deployment level guardrails can obscure true capability ceilings and rate-limits may prevent statistically significant evaluation results. More importantly, evaluators cannot detect dangerous capabilities prior to release, meaning harms could accrue due to companies making worse deployment decisions. Black-box access. This level allows evaluators to query the model and examine model outputs (sometimes alongside chains-of-thought), but not examine any model internals (e.g. weights, activations, gradients). The main benefit of this level is that it enables evaluators to identify dangerous capabilities, jailbreaks, or unsafe propensities before deployment whilst maintaining strong security guarantees. Evaluators can also gain a better understanding of potential upper bounds on dangerous capabilities and estimate safety filter protections (e.g. by querying helpful-only models without refusals, and disabling classifiers). Chain-of-thought (CoT) access can help distinguish refusals and apparent underperformance from true capability ceilings and detect dangerous reasoning patterns (METR, 2025b). However, black-box access limits the ability of evaluators to obtain high-confidence results because sampling alone cannot efficiently cover the vast input space, meaning results are sensitive to which elicitation techniques the evaluator uses. Moreover, black-box methods have proved unreliable for detecting jailbreaks and backdoors (Casper et al., 2024). 6 A PREPRINT - JANUARY 21, 2026 Table 1: Overview of different depths of model access. Depth of model accessModel access methods No external access before deployment • None Black-box access •Black-box access pre-release (i.e. ability to query the model and examine model outputs). •Model checkpoints (i.e. sample model versions from various points during training) • Helpful-only model (i.e. sample model versions without harmlessness training) • Raw CoT (i.e. view model reasoning traces) • Enable/disable input prompt safety classifiers (i.e. ability to disable input/output classifiers when sampling from models) Grey-box access • Unfiltered fine-tuning (i.e. ability to fine-tune the model on custom data) •Change sampling algorithm (i.e. ability to select from different algorithms and adjust parameters that determine how tokens are selected) • Log-probabilities (i.e. view the log probabilities of output tokens) • Logit access (i.e. view the unnormalised scores the model assigns to each possible next token) •Input embeddings, inner neuron activations (i.e. ability to observe how inputs are encoded and transformed as they pass through model layers) • Access to input and output classifier responses (i.e. view the responses of safety filters to prompts before they reach the model, and outputs before they reach the user) White-box access •Full access to activations, gradients, and logits (i.e. ability to view and change the model’s internal states, training signals, and pre-decoded outputs) •Fine-tune with custom loss function (i.e. ability to specify a loss function that defines how model parameters are updated) •Full access to input/output classifiers (i.e. ability to directly inspect and test safety classifiers separately from the model). Grey-box access. This level gives third parties limited access to classifier responses, internal model states (including logits, log-probabilities, and inner neuron activations) and the ability to fine-tune on custom data. This level of access can improve the accuracy of evaluation results. Access to logits and log-probabilities can allow for more accurate estimates of capabilities and propensities by enabling evaluators to directly estimate performance distributions, rather than relying on black-box sampling (Bucknall et al., 2025; Miller, 2024). Fine-tuning on custom data helps elicit harmful latent behaviours, giving a more reliable upper-bound on capability estimates (Casper et al., 2024; Qi et al., 2023). 3 Grey-box access also allows evaluators to use more red-teaming algorithms (e.g. latent-space attacks) which can more accurately detect latent behaviours and vulnerabilities missed by black-box sampling (Che et al., 2025; Schwinn et al., 2024). However, this level of access brings more significant security risks as evaluators may be able to reveal sensitive IP such as architectural details (e.g. hidden dimensions) by analysing log-probabilities returned to reverse-engineer internal details like the model’s size and architecture (Carlini et al., 2024). Moreover, details from fine-tuning (e.g. custom datasets and bypass strategies) or stress-testing classifiers (e.g. vulnerabilities discovered) could aid misuse if improperly stored or leaked. White-box access. This level gives evaluators full access to logits, activations, gradients, and classifiers, alongside the ability to fine-tune with a custom loss function. This level enables the most accurate evaluations as white-box methods substantially expand the option space for evaluators Casper et al. (2024). For instance, white-box model attacks (e.g. latent-space interventions, model-tampering attacks) can help uncover dangerous capabilities or misaligned behaviours entirely missed under input-output probing (Che et al., 2025; Hofstätter et al., 2025). Additionally, evaluators can inspect internal mechanisms (e.g. activations, attention heads, residual stream) via gradient-based probes or interpretability tools to reveal internal representations (e.g. “lying,” “honesty,”) (Zou et al., 2025), and latent capabilities independently of deployment (Tlaie and Farrell, 2025). Fine-tuning with custom loss functions further supports evaluators in testing for dangerous latent capabilities or hidden behaviours (Casper et al., 2024; Bucknall and Trager, 2023). Classifier 3 Fine-tuning access is especially valuable for evaluating models that downstream users can modify (e.g. open-weight models (O’Brien et al., 2025; Wallace et al., 2025)), or for models that users can fine-tune via an API (e.g. GPT-4o fine-tuning on Azure OpenAI Service). 7 A PREPRINT - JANUARY 21, 2026 access also allows evaluators to craft more targeted tests that systematically probe safeguard weaknesses (OpenAI, 2025c; Anthropic, 2025d). However, this white-box access carries the most significant security risks because providing evaluators with access to activations and gradients increases their ability to reverse engineer the model, potentially enabling them to recover internal parameters (Milli et al., 2018), embedding layers (Carlini et al., 2024) or even weights (Zanella-Beguelin et al., 2021). In practice, however, the risks of providing white-box or very light grey-box access can be reduced through APIs, where model parameters do not leave the developers’ servers, or on-site evaluations (Casper et al., 2024). Overall, deeper model access enables higher-confidence evaluations, but increases the risk of IP and security related information leaks (Casper et al., 2024; Tlaie and Farrell, 2025; Bommasani et al., 2023; Nevo et al., 2024). To accommodate deeper levels of access, frontier AI companies and evaluators should implement technical, legal, and physical safeguards. 3.2 Model Information Evaluators could be given various types of non-public model information to support their dangerous capability evaluations. We classify these types of information into four main groups: Training data, deployment configurations, internal evaluation processes and results, and safety mitigations. For each type of information disclosure we outline minimal, substantial and comprehensive levels of disclosure. Table 2 sets out these different levels of model information disclosure. Training data. Disclosing training data helps evaluators infer what dangerous capabilities models may have learned, and assess data-related safety interventions, but increases intellectual property and security risks. If the company prioritises minimising security risks over the accuracy of evaluations, they may provide high-level information about training data via the model specification (a document outlining the model’s intended behaviour and design choices (OpenAI, 2025b)), alongside details of excluded topics, websites, included dual-use topics, and fine-tuning domains to provide a rough indication of a model’s capabilities (Udandarao et al., 2024; Longpre et al., 2024) (minimal). If the company weighs the accuracy of evaluations and protecting security similarly, they may disclose more detailed proportions of datasets (e.g. proportions of dual-use data) to help evaluators anticipate learned capabilities and design targeted vulnerability tests (Longpre et al., 2024) (substantial). If the company prioritises the accuracy of evaluations over security risks, or companies can mitigate these risks, companies may disclose a detailed breakdown of the contents of their datasets, examples from fine-tuning datasets, details about data curation practices, and safety precautions taken during training (comprehensive). However, comprehensive disclosure risks exposing proprietary dataset curation methods and safety training that competitors could replicate (Bucknall et al., 2025; Casper et al., 2023) or adversaries could exploit. Deployment configurations. Disclosing deployment configurations can help evaluators understand the risks a model may pose by clarifying the deployed system’s affordances, but can increase security and IP risks if sensitive con- figurations are leaked. If companies prioritise minimising information security risks over enabling more accurate evaluations, they may provide the model system prompt, and model affordances (i.e. permissions, tools, and scaffolds) to help evaluators contextualise and improve the trustworthiness of assessments (Stosz et al., 2025) (minimal). 4 If companies weigh more accurate evaluations and information security risks similarly, they may reveal non-public details about model context protocols (e.g. configurations for tools, data access, and applications) (Anthropic, 2024a) and inference-time settings (e.g. sampling parameters) so evaluators better understand system affordances and can more accurately assess risks the deployed model might pose (substantial). If companies prioritise more accurate evaluations over information exposure to evaluators, or can mitigate information sharing risks, they may share the architecture of agent scaffolding (e.g. prompt construction, and interaction loop management) as well as safety steps for filtering external knowledge sources (comprehensive). This could help evaluators assess system capabilities, since scaffolding can significantly affect performance even when the underlying model is unchanged (Anthropic, 2025c). They may also test whether retrieval mechanisms could be exploited. However, comprehensive disclosure risks exposing sensitive IP related to performance and details that attackers could exploit if leaked. Companies could mitigate this trade-off by providing access using servers that automatically disable access after pre-agreed periods. Internal evaluation processes and results. Disclosing internal model evaluation processes and results helps external evaluators contextualise their own findings and more accurately interpret and verify company safety claims, but increases information security risks and operational burden. If companies prioritise reducing information security risk over providing confirming external evaluation results, they may provide high-level information about internal evaluation 4 For example, evaluators may identify if a failure to answer dangerous questions is due to a lack of capability or a safety constraint in the system prompt, and more accurately evaluate whether models adhere to stated behavioural guidelines (Ahmed et al., 2025). 8 A PREPRINT - JANUARY 21, 2026 Table 2: Minimal, substantial, and comprehensive levels of disclosure for each of the four types of information. Type of informationLevel of disclosureInformation provided Training dataMinimal • Excluded topics & websites/sources • Included dual-use topics • Model specification • Domains covered in fine-tuning and safety training Substantial • Proportions of dataset that contain dual-use information • Estimates of dual-use data that made it past filters • Prevalence of dual-use topics by language and modality Comprehensive • Quantitative dataset composition (i.e. proportions in tracked cate- gories, language coverage percentages, real/synthetic split) •Data curation pipeline (i.e. human data contractors, inclu- sion/exclusion criteria, filtering, sampling/deduplication criteria) •Synthetic data pipeline (i.e. any generation methods used, safety measures taken to protect generation, methods for validating safety of data curation) • Controlled querying of datasets Deployment configurations Minimal • System prompt • Model affordances Substantial • Model context protocols (including non-public configurations for tools, data access, and applications) • Inference-time settings (e.g. sampling parameters) Comprehensive • Architecture of agent scaffolding, including safety steps towards filtering external knowledge sources Internal evaluation processes and results Minimal •Assertions about internal evaluation runs including conflicting results, dangerous propensities, elicitation techniques Substantial •Description of evaluation methodology, frequency, confidence intervals to accompany reported results • Evaluation sample size, and percentage of distribution covered by evaluation set Comprehensive •Full internal evaluation reports (methods, thresholds, and results) • Red-teaming transcripts • Safeguard efficacy analysis results • Acceptable use policies Safety mitigationsMinimal • List of implemented safety mitigations (e.g. monitoring/filtering inputs and outputs, fine-tuning for refusals, anti-tampering safe- guards, staged access strategies (i.e. for roll out and roll back)) • List of types of incidents tracked and known vulnerabilities Substantial •Classification metrics and filter thresholds for safeguards (CoT Monitors, input/output Classifiers) • Techniques for achieving CoT faithfulness • Architecture of filters, monitors, and refusal logic • Quantitative performance metrics of safety filters Comprehensive •Direct access to inference-time auxiliary classifiers/safety scaf- folding (e.g. input/output classifiers, chain-of-thought filtering, content moderation models) • Safety data sets for refusals (incl. adversarial prompt sets). • Internal safety mitigations (e.g. employee access provisions) 9 A PREPRINT - JANUARY 21, 2026 runs (e.g. through yes/no attestations) to help contextualise and improve evaluator judgements (minimal). 5 If they weigh confirming external evaluation results and information security risk similarly, they could also provide details about evaluation methodology to help evaluators interpret results, identify limitations (McCaslin et al., 2025) and increase stakeholder trust in safety claims (Staufer et al., 2025; Bommasani et al., 2024) (substantial). If companies value confirming external evaluation results over minimising information security risk, or they can mitigate the risks of sharing evaluation results effectively, they might also share full transcripts of internal evaluations and safeguard efficacy analysis results to help evaluators verify reported results, and identify incorrect methods or misleading results (McCaslin et al., 2025; Bowen et al., 2025). 6 Despite this, more detailed evaluation transcripts might reveal hazardous information (e.g. information about specific pathogens, scaffolding methods, or safeguard vulnerabilities) that could cause harm if misused by evaluators, or leaked (McCaslin et al., 2025). Frontier AI companies could alleviate this trade-off by redacting sensitive information that is not core to evaluation results (METR, 2025e), restricting high-risk reports (e.g. biological content policies) to external evaluators with strong security measures, or using non-disclosure agreements (NDAs) to prevent information sharing. Safety mitigations. Disclosing safety mitigations helps evaluators assess the robustness of deployment safeguards and identify vulnerabilities, but increases information security risks that might aid adversarial exploitation. If companies prioritise minimising information exposure over assurances about their safety mitigations, they may provide high-level lists of implemented safety mechanisms, and categories of tracked incidents (minimal). This provides evaluators with a general understanding of safety approaches and tracked areas, while limiting information that could facilitate bypass attempts. If companies weigh assurances about their safety mitigations and information exposure risks similarly, they may also disclose classification metrics and filters for safeguards (e.g. precision/recall for CoT monitors and input/output classifiers), thresholds for interventions, architectural details of filters and refusal logic, and quantitative performance metrics. These artefacts should help evaluators identify gaps in safety coverage and verify that monitoring systems are appropriately robust (substantial) (see OpenAI, 2025c; Anthropic, 2025d). If companies prioritise providing assurances about their safety mitigations over minimising information exposure, they may also disclose filter thresholds, classifier weights, safety training datasets and internal safety mitigations (e.g. employee access provisions, see METR, 2025c) to enable more targeted testing of whether safety mitigations and organisational safety processes are adequate for mitigating systemic risks (comprehensive). However, comprehensive disclosure increases the risk of leaking precise implementation details (e.g. adversarial prompt sets) that could help adversaries develop more effective attacks that undermine safety protections. Companies could mitigate this trade-off through controlled access environments where evaluators test safeguards and view filters or thresholds under monitored conditions, including time-limited access, least-privileged access, and audit trails. 3.3 Evaluation Time Frame Another key aspect of external evaluations is the time frame that evaluators receive to conduct their evaluations. The CoP states that at least 20 business days is appropriate for most model evaluations (European Commission, 2025). In practice, however, evaluation time frames are often much shorter than this. For example, external evaluators for Claude Sonnet 4 and Claude Sonnet 4.5 received only 1 week of access (see Anthropic 2025a and Anthropic 2025e). Below we consider the benefits and challenges of shorter and longer evaluation time frames. Given that the CoP suggests that external evaluations require adequate staffing, we assume that third-party staffing is constant across time frames, so that longer access corresponds to more expert hours. Shorter evaluation time frames. When external evaluators receive shorter pre-release access to a model (e.g. one to three weeks) they can often only identify obvious vulnerabilities. With short evaluation time frames, evaluators can use basic prompting and red-teaming methods to uncover obvious vulnerabilities or elicit some capabilities before deployment. These approaches can surface potential vulnerabilities and rough capability thresholds relatively quickly by relying on existing tools such as shared prompt engineering resources, and public benchmarks. However, under these shorter time frames (e.g. three weeks or less), evaluators usually lack the time to design new evaluation methods, adapt existing ones, build bespoke testing scaffolds, or re-run failed tests (see METR, 2025d). This means capability elicitation can be shallow and implies that clearer conclusions could be reached with longer evaluation time frames (AISI, 2024). Key benefits of short evaluation time frames include that they are convenient for developers to meet model release deadlines, do not require dedicated internal capacity for long periods, and can provide some confidence to stakeholders that model capability thresholds are within acceptable bounds pre-release. Key challenges include that 5 METR (METR, 2025b) received an attestation from OpenAI that their findings did not contradict OpenAI’s internal results. 6 This includes cases where developers only report results with safeguards applied, making systems appear less dangerous than they might be if jailbreaks are discovered in deployment (Bowen et al., 2025). Full evaluation reports and transcripts could reveal whether poor performance reflects genuine capability limits or inadequate testing (McCaslin et al., 2025; Adler, 2025; Bowen et al., 2025) and help distinguish genuine alignment behaviours from memorisation or evaluation awareness – a concern highlighted by Apollo over Claude Sonnet 4.5’s low deception rates (Anthropic, 2025e). 10 A PREPRINT - JANUARY 21, 2026 capability evaluations lack depth and evaluator confidence in results is low because there is insufficient time to design robust evaluations or properly elicit and scaffold model behaviour. Longer evaluation time frames. By contrast, providing longer evaluation time frames enables evaluators to properly design, adapt, debug, and execute their evaluation methods. To this end, the CoP requires Signatories provide evaluators with at least 20 business days for most model evaluations (European Commission, 2025). Extended access enables more thorough capability elicitation, including by creating bespoke evaluations, and gaining the opportunity to resample evaluations for statistical robustness. This means longer evaluation time frames increase the likelihood of uncovering additional vulnerabilities or hidden capabilities. The main benefit of longer evaluation time frames is that evaluators can develop and run more comprehensive evaluations which provides more accurate and higher confidence results. In turn, this strengthens stakeholder and public trust, because safety claims can be validated by independent third parties that received adequate time and resources to conduct thorough evaluations. The main challenges of providing longer evaluation time frames are that they require greater oversight from frontier AI companies and may delay model releases. Models may also undergo significant changes during the 20 business day period before release, meaning that the model version which evaluators receive may be too different from the model that is eventually deployed. Overall, the time frame of an evaluation is a key component for improving the comprehensiveness and accuracy of evaluations. There is a trade-off then between the operational costs of providing access for a longer period of time and the level of assurance that evaluators can offer regarding a model’s safety. 3.4 Interdependencies These different aspects of external evaluations can depend upon each other in important ways. While the preferences of companies might determine the level of access they provide for each aspect, choices for one aspect can inform or constrain what is feasible for another. For example: • If a company imposes short evaluation periods, the value of providing substantial or extensive model informa- tion diminishes because evaluators are unlikely to be able to review and make use of additional information. • If a company is concerned about the IP risks of disclosing substantial model information, they could provide deeper model access to external evaluators to support better results, without significant information disclosure. •If a company cannot afford to give evaluators longer evaluation periods due to internal capacity constraints and deadlines, this could be partially offset by providing either richer model information which can enable faster contextualisation of results, or deeper model access (e.g. observing logits or log-probabilities) which allows evaluators to achieve more statistically significant results across fewer samples due to a reduction in the variance of evaluation results (Bucknall et al., 2025). •If a company offers deep access in one dimension (e.g. grey-box access), it may be sensible elsewhere to provide complementary forms of access that would improve evaluation results, whilst adding little additional security or IP risk (e.g. substantive model information). Importantly, different aspects of access do not appear to be substitutable. For instance, some vulnerabilities (e.g. backdoors, hidden representation biases) will be easier to detect with white-box access but very hard to detect with black box access, even if evaluators have access to the full dataset and model specification. Similarly, evaluators typically need more time to make use of additional forms of model access. While evaluation aspects are interdependent, stakeholders should use our taxonomy as a menu to choose the different levels of access to provide that suit their use case. 4 Access Levels Below we propose three access levels. Access Level 1 (AL1) refers to black-box model access and minimal information. Access Level 2 (AL2) refers to grey-box model access and substantial information. Access Level 3 (AL3) refers to white-box access and comprehensive information. We relied on existing literature (Casper et al., 2024), expert discussions, and definitions from the CoP to categorise the access methods outlined in our taxonomy into three access levels. The CoP describes three standards that could be used to meet “appropriate”: “best practice” (i.e. accepted practices among model providers as techniques that best assess and mitigate systemic risk); “the state of the art” (i.e. the forefront of relevant research, governance, and technology that goes beyond best practice); and “more innovative” techniques (i.e. techniques more advanced than the state of the art) (European Commission, 2025). Importantly, using “best practice” or “state of the art” techniques is not necessarily sufficient to comply with the CoP. We believe our access levels correspond to the different standards for appropriate access defined in the CoP. These access levels will require corrections and adjustments over time because companies and external evaluators will change their practices. 11 A PREPRINT - JANUARY 21, 2026 4.1 Access Level 1 Access Level 1 (AL1) enables basic external assurance and vulnerability detection, and is the minimum access level appropriate for frontier AI models under the CoP. This might be suitable for external evaluations of dangerous capabilities when developers release minor updates to an existing model, or when a new model poses low systemic risk. AL1 allows external evaluators to provide no more than basic assurance because evaluators only receive black-box access with limited information disclosures and minimal time. This means that evaluators will likely only identify obvious vulnerabilities as they are restricted to basic sampling techniques, limited information about the model, and insufficient time to re-run or design bespoke evaluations. Table 3: Overview of Access Level 1 Access DimensionSubcategoryAccess provided Model access– • Black-box API • Model checkpoints • Helpful-only variants • CoT • Ability to enable/disable input output classifiers Model informationTraining data • Excluded topics/sources • Included dual-use topics • Model specification • Domains covered in fine-tuning and safety training Deployment configurations • System prompt • Model affordances Internal evaluation processes and results • Assertions about internal evaluation runs including conflicting results, dangerous propensities, elicitation techniques Safety mitigations • List of implemented safety mitigations • Classification metrics for safeguards • Filter/monitor architecture and refusal logic • Performance metrics of safety filters Evaluation time frame– • At least 20 business days AL1 corresponds to the CoP’s “best practice” definition of appropriate access, and establishes the minimum access Signatories need to provide to external evaluators under the CoP (European Commission, 2025). Signatories likely need to provide evaluators with black-box access to the model before release, because this appears to be best practice (OpenAI, 2025a; Anthropic, 2025e; Amazon, 2025). Pursuant to Measure 3.5 of the CoP, Signatories need to provide external evaluators access to model versions with the fewest safety mitigations, alongside the chains-of-thought, for external post-market monitoring. This suggests Signatories likely need to provide model checkpoints, variants, and the chain-of-thought during external evaluations for full risk assessments. This is proportional because the security risks and operational costs of providing access to model versions without safety measures and their CoTs are likely similar for external post-market monitoring and full risk assessment. Pursuant to Appendix 3.3 of the CoP, external evaluators need to assess the effectiveness of mitigations, which suggests evaluators need substantial information about safety mitigations, alongside which dual-use topics were included in the training data, and domains covered by safety training. Pursuant to Appendix 3.4 of the CoP, if appropriate for the systemic risk and model evaluation method, Signatories must provide evaluation teams with the model spec, system prompt, training data, test sets, and past model evaluation results. Evaluators may also need this information to elicit the model, pursuant to Appendix 3.2 of the CoP. The AEF-1 standard recommends external evaluators gain access to system prompts, information about the training process and data, pre-existing internal evaluation results, and knowledge of system vulnerabilities as relevant for trustworthy evaluations, among other forms of information (Stosz et al., 2025). Pursuant to Appendix 3.4 of the CoP, providing evaluators at least 20 business days to conduct evaluations is appropriate for most systemic risks (presumably CBRN, loss of control, cyber offense, and harmful manipulation) and evaluation methods (presumably red teaming, capability elicitation, scaffolded evaluations, and bespoke task suites). 12 A PREPRINT - JANUARY 21, 2026 4.2 Access Level 2 Access Level 2 (AL2) allows more targeted evaluations to improve capability elicitation and vulnerability identification. This might be suitable for model releases with significant capability jumps. AL2 supports evaluators in providing more targeted and reliable model evaluations through some detailed information about systems, access to sampling controls and fine-tuning, and longer evaluation windows. These affordances mean that evaluators can design and rerun bespoke evaluations, resample from the model for statistical robustness, and use fine tuning and safety classifier information to more accurately stress-test safeguards. Table 4: Overview of Access Level 2 Access DimensionSubcategoryAccess provided Implementation of previous Access Levels – • Developer has implemented all access from AL1 Model access– • Unfiltered fine-tuning • Change sampling algorithm • Log-probabilities • Classifier responses Model informationTraining data • The same as for AL1 Deployment configurations • Model context protocols • Inference-time settings (e.g. sampling parameters) Internal evaluation processes and results • Detailed description of evaluation methodologies to accompany reported results Safety mitigations • Same as AL1 Evaluation time frame– •≥20 business days AL2 corresponds to the current state of the art for external evaluator access under the CoP. For instance, frontier AI companies have previously provided heightened model access in the form of supervised fine-tuning on custom datasets (OpenAI, 2023a), the ability to specify or modify some sampling parameters (Saunders et al., 2022), and access to view log-probabilities of output tokens (OpenAI, 2023b). 7 Whilst substantial training data disclosure could likely be mitigated effectively, and the CoP suggests that relevant training data information for systemic risks and model evaluation methods should be provided, no frontier AI company appears to have disclosed training data beyond that of AL1. Pursuant to Appendix 3.4 of the CoP, past model evaluation results must be provided as appropriate for each systemic risk and model evaluation method. Recent work has provided a template for frontier model providers that allows them to more thoroughly describe their evaluation processes and results without publishing information posing security risks (McCaslin et al., 2025; Reed et al., 2025), suggesting that providing evaluators with this level of detail about internal evaluations may be state-of-the-art. Pursuant to Appendix 3.3 of the CoP, external evaluators must assess the effectiveness of safety mitigations at an appropriate breadth and depth, including under adversarial pressure. This suggests that evaluators may need heightened model access (e.g. to fine-tune the model, probe safeguards, or generate jailbreaks) as well as detailed information about safety mitigation and deployment configuration details. OpenAI and Anthropic have previously provided UK AISI and US CAISI access to several versions of constitutional classifiers and additional safety mitigation details (e.g. safeguard architecture details, documented vulnerabilities, content policy information, classifier scores) to more accurately identify safeguard vulnerabilities (OpenAI, 2025c; Anthropic, 2025d). No frontier AI company appears to have provided external evaluators with more time to conduct their evaluations than the minimal requirement of 20 business days pursuant to Appendix 3.4. 7 OpenAI, Anthropic, and Google have each previously provided a selection of various sampling algorithms and control of relevant parameters (see Bucknall and Trager, 2023, Appendix A) 13 A PREPRINT - JANUARY 21, 2026 4.3 Access Level 3 Access Level 3 (AL3) represents a more experimental access arrangement which likely allows for the most accurate external evaluation results. These affordances mean that evaluators can more rigorously test capability ceilings, identify vulnerabilities, and use non-public information to validate reported safety results. This might be suitable for models with significant jumps in capabilities when model deployment decisions depend on evaluation results, or when privacy-preserving tools can reliably support white-box access. Table 5: Overview of Access Level 3. Access DimensionSubcategoryAccess provided Implementation of previous Access Levels – • Developer has implemented all access from AL1 and AL2 Model access– • Full access to activations, gradients, and logits • Ability to fine-tune with custom loss function • Full access to input/output classifiers Model informationTraining data • Detailed breakdown of the contents of datasets • Details on data curation pipeline and safety protections Deployment configurations • Architecture of agent scaffolding (incl. safety steps towards filtering external knowledge sources) Internal evaluation processes and results • Full internal reports and transcripts • Safeguard efficacy analysis results Safety mitigations • Filter thresholds • Safety data sets (incl. adversarial prompt datasets) • Internal access rules for employees (i.e. for preventing unauthorized system access) Evaluation time frame– • Greater than 20 business days AL3 corresponds to access methods that would go beyond the current state of the art for external access under the CoP. Signatories could provide white-box access to models. Since white-box access has proved most useful for uncovering harmful behaviours, vulnerabilities, and potential sandbagging on open-source models (Che et al., 2025; Casper et al., 2024; Taylor et al., 2025), but has not been piloted by external evaluators on closed source models, this likely surpasses the current state of the art. Pursuant to Appendix 3.2 of the CoP, Signatories may provide additional affordances (e.g. fine-tuning with a custom loss function, or activations) to minimise the risk of under elicitation and undetected sandbagging during model evaluations. 8 Pursuant to Appendix 3.4 of the CoP, Signatories must provide evaluation teams with training data, test sets, and past model evaluation results as appropriate for each systemic risk and evaluation method. A more experimental access arrangement might therefore involve providing substantial information, including about data composition and post training details to better understand how training data can affect a model’s behaviour (Soldaini et al., 2024), data curation policies to identify unexamined biases that may cause downstream model behaviours (Dodge et al., 2021), and full internal evaluation reports to validate reported results. Pursuant to Appendix 3.3 of the CoP, Signatories need to assess the extent to which mitigations work as planned and can be circumvented or subverted. In a more experimental setup, frontier AI companies could provide filter thresholds and outline internal safety mitigations (e.g. internal employee access provisions) to help evaluators assess safety mitigations. Finally, a more experimental access arrangement would likely require companies provide external evaluators with more than 20 business days to accommodate more experimental white-box evaluation methods (e.g. latent space attacks) and more information artifacts to study. In selecting an access level for external evaluations, companies will need to consider whether the risks posed by their model warrant minimal, state of the art, or more experimental access methods. Importantly, companies do not need to 8 The AI Evaluator Forum recommends that fine-tuning should be provided in cases where users can finetune the model (see Stosz et al., 2025). Access to activations can be useful for running probes which may help identify model sandbagging behaviours (Taylor et al., 2025). 14 A PREPRINT - JANUARY 21, 2026 provide the same level of access across all dimensions or for different model releases. Moreover, they can adopt access methods that align with their specific priorities and risk profiles. Nevertheless, we think that AL2 is clearly within reach for frontier AI companies today. This is because AL2 is unlikely to pose significant security risks, and any remaining risks can likely be mitigated through existing technical and legal safeguards (e.g. API access, secure enclaves, NDAs), as well as measures commonly used for third-party assessments in other industries (e.g. segregating duties). 5 Conclusion In this paper we have outlined a taxonomy of access to promote clearer standards and communication for external dangerous capability evaluations. We have attempted to address the problem of inconsistent reporting by providing a framework for companies and evaluators to communicate about what forms of access might be required across key dimensions (model access, model information, and time), and by proposing “access levels” that correspond to the external access requirements in the EU GPAI Code of Practice. We reviewed the benefits and risks of expanding access, and argued that key challenges can likely be mitigated through technical means and safeguards used in other industries. There are two main limitations to our work. First, we determined access levels based upon the public information that is available about the current access provided for external evaluations. Therefore, our access levels may not correspond to the access that companies currently provide to external evaluators. This means our judgement of minimal, state of the art, and more innovative access levels may not be the most accurate approximations. Second, the CoP is open to interpretation, meaning that there remains scope for disagreement over which levels of access are appropriate in different situations. The paper has also left several questions unanswered that warrant further research. Future work could examine how to use our taxonomy and access levels to make access requirements more concrete in practice. Future work could also clarify the conditions (e.g. the level of risk, or required evaluation method) under which external evaluators need deeper access to a model. Finally, future work could develop tools for structured access, and use these tools in future evaluations, so that external evaluator access can become less costly and more secure. We view our taxonomy and access levels as a starting point, and expect that they will require updates as risks emerge, evaluation methods change, and tools for structured access improve. We therefore invite researchers, frontier AI companies, evaluators, and regulators to use and iterate on our taxonomy and access levels. Acknowledgments This research was supported by the ERA Fellowship. The authors would like to thank ERA for their financial and intellectual support. We are grateful for valuable comments and feedback from Alan Chan, Ben Bucknall, Dave Banerjee, Edward Kembery, Harrison Gietz, Jonas Freund, Lisa Soder, Lucas Sato, May Dixit, Nikola Jurkovic, Patricia Paskov, Prakriti Bandhan, Tom Reed, Zaheed Kara, and all the participants of the ERA 2025 Summer Fellowship. Any remaining errors are our own. References Steven Adler. Ai companies should be safety-testing the most capable versions of their models, 2025. URLhttps: //stevenadler.substack.com/p/ai-companies-should-be-safety-testing. U.S. Food & Drug Administration. Types of fda inspections. Technical report, U.S. Food & Drug Administration, 2024. URLhttps://w.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/ inspection-basics/types-fda-inspections. Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, and Percy Liang. Speceval: Evaluating model adherence to behavior specifications. arXiv:2509.02464, 2025. Retrieved from http://arxiv.org/abs/2509.02464. US AISI & UK AISI. Us aisi and uk aisi joint pre-deployment test, 2024. URLhttps://cdn.prod.website-files. com/663bd486c5e4c81588db7a1d/6763fac97cd22a9484ac3c37_o1_uk_us_december_publication_ final.pdf. Amazon.Amazon’sfrontiermodelsafetyframework.Technicalreport,Amazon, 2025.URLhttps://assets.amazon.science/a7/7c/8bdade5c4eda9168f3dee6434f/ pc-amazon-frontier-model-safety-framework-2-7-final-2-9.pdf. Anthropic.Modelcontextprotocol,2024a.URLhttps://w.anthropic.com/news/ model-context-protocol. Anthropic. Trust center, 2024b. URL https://trust.anthropic.com. Accessed: 2024-01-15. 15 A PREPRINT - JANUARY 21, 2026 Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, 2025a. URLhttps: //w-cdn.anthropic.com/07b2a3f9902e19fe39a36ca638e5ae987bc64d.pdf. Anthropic. Activating ai safety level 3 protections. Technical report, Anthropic, 2025b. URLhttps://w. anthropic.com/news/activating-asl3-protections. Anthropic. Raising the bar on swe-bench verified with claude 3.5 sonnet, 2025c. URLhttps://w.anthropic. com/engineering/swe-bench-sonnet. Anthropic.Strengtheningoursafeguardsthroughcollaborationwithuscaisianduk aisi.Technicalreport,Anthropic,2025d.URLhttps://w.anthropic.com/news/ strengthening-our-safeguards-through-collaboration-with-us-caisi-and-uk-aisi. Anthropic. System card: Claude sonnet 4.5. Technical report, Anthropic, 2025e. URLhttps://assets.anthropic. com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf. Financial Conduct Authority. Fca handbook rec 3.19. Technical report, Financial Conduct Authority, 2024. URL https://handbook.fca.org.uk/handbook/rec3/rec3s19?timeline=true. Michel Benaroch. Third-party induced cyber incidents - much ado about nothing? Journal of Cybersecurity, 7:1–18, 2021. ISSN 20572093. doi: 10.1093/cybsec/tyab020. URL https://doi.org/10.1093/cybsec/tyab020. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, and Andy Zou. Lessons from the trenches on reproducible evaluation of language models. arXiv:2405.14782, 2024. Retrieved from http://arxiv.org/abs/2405.14782. Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index, 2023. URLhttp://arxiv.org/abs/2310.12941. Retrieved from https://arxiv.org/abs/2310.12941. Rishi Bommasani, Sanjeev Arora, Yejin Choi, Li Fei-Fei, Daniel E. Ho, Dan Jurafsky, Sanmi Koyejo, Hima Lakkaraju, Arvind Narayanan, Alondra Nelson, Emma Pierson, Joelle Pineau, Gaël Varoquaux, Suresh Venkatasubramanian, Ion Stoica, Percy Liang, and Dawn Song. A path for science-and evidence-based ai policy, 2024. URLhttps: //understanding-ai-safety.org. Dillon Bowen, Ann-Kathrin Dombrowski, Adam Gleave, and Chris Cundy. Ai companies should report pre- and post- mitigation safety evaluations. arXiv:2503.17388, 2025. URLhttp://arxiv.org/abs/2503.17388. Retrieved from http://arxiv.org/abs/2503.17388. Ben Bucknall, Robert Trager, and Michael A. Osborne. Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary ai systems. arXiv:2503.01470, 2025. Retrieved fromhttps://arxiv.org/abs/ 2503.01470. Benjamin S. Bucknall and Robert F. Trager.Structured access for third-party research on frontier ai models: Investigating researchers’ model access requirements.Technical report, Oxford Martin School, 2023. URLhttps://oms-w.files.svdcdn.com/production/downloads/academic/Investigating_ ResearchersâĂŹ_Model_Access_Oct23-compressed_3.pdf. Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tramèr. Stealing part of a production language model, 2024. URLhttp://arxiv.org/abs/2403.06634. Retrieved from https://arxiv.org/abs/2403.06634. Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. Fundamental limitations of reinforcement learning from human feedback. arXiv:2307.15217, 2023. Retrieved from http://arxiv.org/abs/2307.15217. Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access is insufficient for rigorous ai audits. In 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024, pages 2254–2272, Rio de Janeiro, Brazil, June 2024. Association for Computing Machinery, Inc. ISBN 9798400704505. URL https://doi.org/10.1145/3630106.3659037. 16 A PREPRINT - JANUARY 21, 2026 Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev Mckinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Apollo Research, Yarin Gal, Furong Huang, and Dylan Hadfield-Menell. Model tampering attacks enable more rigorous evaluations of llm capabilities "what makes a bioweapons program effective?" input-space attack modifies input text latent-space attack perturbs hidden neurons weight-space attack fine-tunes the model model tampering attacks. arXiv:2502.05209, 2025. Retrieved from http://arxiv.org/abs/2502.05209. Jesse Dodge, Maarten Sap, Ana Marasovi ́ c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv:2104.08758, 2021. Retrieved from http://arxiv.org/abs/2104.08758. DSIT. Frontier ai safety commitments, ai seoul summit 2024. Technical report, DSIT, 2024. URLhttps:// w.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summit-2024/ frontier-ai-safety-commitments-ai-seoul-summit-2024. Bob Duncan and Mark Whittington.Enhancing cloud security and privacy: The power and the weak- ness of the audit trail.In The Seventh International Conference on Cloud Computing, GRIDs, and Virtualization, pages 125–130, Rome, Italy, 2016. International Academy, Research, and Industry As- sociation.URLhttps://personales.upv.es/thinkmind/dl/conferences/cloudcomputing/cloud_ computing_2016/cloud_computing_2016_6_20_20063.pdf. European Commission.The general-purpose ai code of practice.Technical report, 2025.URL https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai#:~:text=EU% 20copyright%20law.-,Safety%20and%20Security,-The%20Safety%20and. Frontier Model Forum.Third-party assessments.Technical report, FMF, 2025.URLhttps://w. frontiermodelforum.org/technical-reports/third-party-assessments/. Idoia Gamiz, Cristina Regueiro, Oscar Lage, Eduardo Jacob, and Jasone Astorga. Challenges and future research directions in secure multi-party computation for resource-constrained devices and large-scale computations: Chal- lenges and future research directions in secure multi-party... Int. J. Inf. Secur., 24(1), 2024. ISSN 1615-5262. doi: 10.1007/s10207-024-00939-4. URL https://doi.org/10.1007/s10207-024-00939-4. Google. Gemini 2.5 pro model card. Technical report, Google, 2025. URLhttps://modelcards.withgoogle. com/assets/documents/gemini-2.5-pro.pdf. Felix Hofstätter, Teun van der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, and Francis Rhys Ward. The elicitation game: Evaluating capability elicitation techniques. arXiv:2502.02180, 2025. Retrieved fromhttps: //arxiv.org/abs/2502.02180. UK AI Security Institute. Pre-deployment evaluation of openai’s o1 model. Technical report, UK AI Security Institute, 2024. URL https://w.aisi.gov.uk/blog/pre-deployment-evaluation-of-openais-o1-model. Hazem Ramadan Ismael and Clare Roberts. Factors affecting the voluntary use of internal audit: evidence from the uk. Managerial Auditing Journal, 33:288–317, 2018. ISSN 0268-6902. doi: 10.1108/MAJ-08-2016-1425. ISO. Iso/iec 27001:2022. Technical report, ISO, 2022a. URL https://w.iso.org/standard/27001. ISO. Iso/iec 27002:2022. Technical report, ISO, 2022b. URL https://w.iso.org/standard/75652.html. M Jagadeeswari, P.Naveen Karthi, V.A. Nitish Kumar, and S.Lokith Sai Ram. A secure file sharing and audit trail tracking platform with advanced encryption standard for cloud-based environments. In 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), pages 540–547, Coimbatore, India, 2023. IEEE. URL https://doi.org/10.1109/ICESC57686.2023.10193389. Jung Koo Kang, Clive Lennox, Vivek Pandey, Eric Allen, Mark Defond, Jason Guo, Yonghong Jia, Scott Judd, Paul Koch, Dan O’leary, Tracie Majors, Babak Mammadov, Lorien Stice-Lawrence, Richard Sloan, K R Subramanyam, Qian Wang, Olena Watanabe, Regina Wittenberg-Moerman, and Suning Zhang. Client concerns about information spillovers from sharing audit partners * client concerns about information spillovers from sharing audit partners. SSRN, 2021. Retrieved from http://dx.doi.org/10.2139/ssrn.3567535. Suzanne Lightman, Theresa Suloway, and Joseph Brule. Satellite ground segment :. Technical report, National Institute of Standards and Technology, 2022. URLhttps://nvlpubs.nist.gov/nistpubs/ir/2022/NIST.IR.8401. pdf. Shayne Longpre, Robert Mahari, Naana Obeng-Marnu, William Brannon, Tobin South, Katy Gero, Sandy Pentland, and Jad Kabbara. Data authenticity, consent, & provenance for ai are all broken: what will it take to fix them? arXiv:2404.12691, 2024. Retrieved from http://arxiv.org/abs/2404.12691. 17 A PREPRINT - JANUARY 21, 2026 Shayne Longpre, Kevin Klyman, Ruth E. Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca S. Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, and Arvind Narayanan. In-house evaluation is not enough: Towards robust third-party flaw disclosure for general-purpose ai. arXiv:2503.16861, 2025. Retrieved from http://arxiv.org/abs/2503.16861. Tegan McCaslin, Jide Alaga, Samira Nedungadi, Seth Donoughe, Tom Reed, Rishi Bommasani, Chris Painter, and Luca Righetti. Stream (chembio): A standard for transparently reporting evaluations in ai model reports. arXiv:2508.09853, 2025. Retrieved from http://arxiv.org/abs/2508.09853. MedicinesandHealthcareproductsRegulatoryAgency.Decentralisedmanufacture:Uk guidelineongoodmanufacturingpractice(gmp).Technicalreport,Medicinesand HealthcareproductsRegulatoryAgency,2025.URLhttps://w.gov.uk/guidance/ decentralised-manufacture-uk-guideline-on-good-manufacturing-practice-gmp# manufacturing-licence-applications-and-inspection-approach. METR. Common elements of frontier ai safety policies. Technical report, METR, 2025a. URLhttps://metr.org/ common-elements.pdf. METR. Details about metr’s evaluation of openai gpt-5. Technical report, METR, 2025b. URLhttps://evaluations. metr.org/gpt-5-report/. METR. What should companies share about risks from frontier AI models? Technical report, METR, 2025c. URL https://metr.org/blog/2025-06-27-risk-transparency/. Accessed: 2025-06-27. METR. Details about metr’s preliminary evaluation of o3 and o4-mini. Technical report, METR, 2025d. URL https://evaluations.metr.org/openai-o3-report/#limitations. METR. Review of the anthropic summer 2025 pilot sabotage risk report, 2025e. URLhttps://metr.org/2025_ pilot_risk_report_metr_review.pdf. Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv:2411.00640, 2024. Retrieved from http://arxiv.org/abs/2411.00640. Smitha Milli, Ludwig Schmidt, Anca D. Dragan, and Moritz Hardt. Model reconstruction from model explanations. arXiv:1807.05185, 2018. Retrieved from http://arxiv.org/abs/1807.05185. Sella Nevo, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott. Securing ai model weights : preventing theft and misuse of frontier models. Technical report, RAND, 2024. URLhttps: //w.rand.org/pubs/research_reports/RRA2849-1.html. NIST. Nist special publication 800-115: Technical guide to information security testing and assessment. Technical report, NIST, 2008. URLhttps://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-115. pdf. NIST.Nist special publication 800-53 revision 5: Security and privacy controls for information systems and organizations. Technical report, NIST, 2020. URLhttps://csrc.nist.gov/CSRC/media/Projects/ risk-management/800-53%20Downloads/800-53r5/SP_800-53_v5_1-derived-OSCAL.pdf. Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman. Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms. arXiv:2508.06601, 2025. Retrieved from http://arxiv.org/abs/2508.06601. Office of the Comptroller of the Currency.Comptroller’s handbook:Bank supervision pro- cess.Technical report, Office of the Comptroller of the Currency, 2018.URLhttps:// w.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/ bank-supervision-process/pub-ch-bank-supervision-process.pdf. OpenAI. Fine-tuning, 2023a. URL https://perma.c/BFZ9-QBXN. OpenAI. Using logprobs. Technical report, OpenAI, 2023b. URLhttps://cookbook.openai.com/examples/ using_logprobs. OpenAI. Trust portal, 2024. URL https://trust.openai.com. Accessed: 2024-01-15. OpenAI.Gpt-5 system card.Technical report, OpenAI, 2025a.URLhttps://cdn.openai.com/ gpt-5-system-card.pdf. OpenAI. Openai model spec. Technical report, OpenAI, 2025b. URLhttps://model-spec.openai.com/ 2025-10-27.html. 18 A PREPRINT - JANUARY 21, 2026 OpenAI. Working with us caisi and uk aisi to build more secure ai systems. Technical report, OpenAI, 2025c. URL https://openai.com/index/us-caisi-uk-aisi-ai-update/. OpenMined. How to audit an ai model owned by someone else. Technical report, OpenMined, 2023. URLhttps: //openmined.org/blog/ai-audit-part-1/. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv:2310.03693, 2023. Retrieved from http://arxiv.org/abs/2310.03693. Tom Reed, Tegan McCaslin, and Luca Righetti. What do model reports say about their chembio benchmark evaluations? comparing recent releases to the stream framework. arXiv:2510.20927, 2025. Retrieved fromhttp://arxiv.org/ abs/2510.20927. Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, David Bau, Paul Bricman, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel J. Kochenderfer, and Robert Trager. Open problems in technical ai governance, 2025. URL http://arxiv.org/abs/2407.14981. Luis Roque, Carlos Soares, Vitor Cerqueira, and Luis Torgo. Cherry-picking in time series forecasting: How to select datasets to make your model shine, 2024. URLhttp://arxiv.org/abs/2412.14435. Retrieved from https://arxiv.org/abs/2412.14435. Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, and Soheil Feizi. Fast adversarial attacks on language models in one gpu minute, 2024. URLhttp://arxiv.org/abs/2402.15570. Retrieved from https://arxiv.org/abs/2402.15570. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. arXiv:2206.05802, 2022. Retrieved fromhttp://arxiv.org/abs/2206. 05802. Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, and Stephan Gunnemann. Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024. URLhttp: //arxiv.org/abs/2402.09063. Retrieved from https://arxiv.org/abs/2402.09063. Toby Shevlane. Structured access: an emerging paradigm for safe ai deployment. arXiv:2201.05159, 2022. Retrieved from https://arxiv.org/abs/2201.05159. Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. The leaderboard illusion, 2025. URL http://arxiv.org/abs/2504.20879. Retrieved from https://arxiv.org/abs/2504.20879. Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Evan Walsh, Luke Zettlemoyer, Noah Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo. Dolma: an open corpus of three trillion tokens for language model pretraining research. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15725–15788, Bangkok, Thailand, 2024. Association for Computational Linguistics. URL https://doi.org/10.18653/v1/2024.acl-long.840. Tobin South, Alexander Camuto, Shrey Jain, Shayla Nguyen, Robert Mahari, Christian Paquin, Jason Morton, and Alex ’Sandy’ Pentland. Verifiable evaluations of machine learning models using zksnarks. arXiv:2402.02675, 2024. Retrieved from https://arxiv.org/abs/2402.02675. Leon Staufer, Mick Yang, Anka Reuel, and Stephen Casper.Audit cards: Contextualizing ai evaluations. arXiv:2504.13839, 2025. Retrieved from https://arxiv.org/abs/2504.13839. Conrad Stosz, Karson Elmgren, Charles Foster, George Balston, Seth Donoughe, Samira Nedungadi, Michael Chen, Jasper Götting, Patricia Paskov, Sayash Kapoor, Schwettmann Sarah, Rishi Bommasani, Luca Righetti, Sean McGregor, Grace Werner, Christopher Painter, Faisal Lalani, Rob Reich, Arvind Narayanan, Elizabeth Barnes, Miles Brundage, Aidan Homewood, Divya Siddharth, Charles Teague, Jaime Sevilla, and Jacob Steinhardt. Aef-1: Minimum operating conditions for independent third party ai evaluations. Technical report, AI Evaluator Forum, 2025. URL https://w.aef.one/aef-one.pdf. Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. Auditing games for sandbagging. arXiv:2512.07810, 2025. Retrieved from http://arxiv.org/abs/2512.07810. 19 A PREPRINT - JANUARY 21, 2026 Alejandro Tlaie. A blueprint for an eu ecosystem of secure, deep and external ai audits. SSRN, 2025. Retrieved from http://dx.doi.org/10.2139/ssrn.5239081. Alejandro Tlaie and Jimmy Farrell. Securing external deeper-than-black-box gpai evaluations. arXiv:2503.07496, 2025. Retrieved from http://arxiv.org/abs/2503.07496. Andrew Trask, Aziz Berkay Yesilyurt, Bennett Farkas, Callis Ezenwaka, Carmen Popa, Dave Buckley, Eelco van der Wel, Francesco Mosconi, Grace Han, Ionesio Junior, Irina Bejan, Ishan Mishra, Khoa Nguyen, Koen van der Veen, Kyoko Eng, Lacey Strahm, Logan Graham, Madhava Jay, Matei Simtinica, Osam Kyemenu-Sarsah, Peter Smith, Rasswanth S, Ronnie Falcon, Sameer Wagh, Sandeep Mandala, Shubham Gupta, Stephen Gabriel, Subha Ramkumar, Tauquir Ahmed, Teo Milea, Valerio Maggio, Yash Gorana, and Zarreen Reza. Secure enclaves for ai evaluation. Technical report, OpenMined, 2024. URL https://openmined.org/blog/secure-enclaves-for-ai-evaluation/. Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge. No "zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance, 2024. URLhttp://arxiv.org/abs/2404.04125. Retrieved fromhttps://arxiv.org/ abs/2404.04125. Suppakit Waiwitlikhit, Ion Stoica, Yi Sun, Tatsunori Hashimoto, and Daniel Kang. Trustless audits without revealing data or models. arXiv:2404.04500, 2024. Retrieved from https://arxiv.org/abs/2404.04500. Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, and Chris Koch. Estimating worst-case frontier risks of open-weight llms. arXiv:2508.03153, 2025. Retrieved from http://arxiv.org/abs/2508.03153. Santiago Zanella-Beguelin, Shruti Tople, Andrew Paverd, and Boris Köpf. Grey-box extraction of natural language models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12278–12286, Vienna, Austria, 18–24 July 2021. PMLR. URL https://proceedings.mlr.press/v139/zanella-beguelin21a.html. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai transparency. arXiv:2310.01405, 2025. Retrieved fromhttp://arxiv. org/abs/2310.01405. 20