Paper deep dive

Constructing Safety Cases for AI Systems: A Reusable Template Framework

Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen

Year: 2026Venue: arXiv preprintArea: Safety EvaluationType: SurveyEmbeddings: 168

Abstract

Abstract:Safety cases, structured arguments that a system is acceptably safe, are becoming central to the governance of AI systems. Yet, traditional safety-case practices from aviation or nuclear engineering rely on well-specified system boundaries, stable architectures, and known failure modes. Modern AI systems, such as generative and agentic AI, are the opposite. Their capabilities emerge unpredictably from low-level training objectives, their behaviour varies with prompts, and their risk profiles shift through fine-tuning, scaffolding, or deployment context. This study examines how safety cases are currently constructed for AI systems and why classical approaches fail to capture these dynamics. This study introduces comprehensive taxonomies for AI-specific claim types (assertion-based, constraint-based, capability-based), argument types (demonstrative, comparative, causal/explanatory, risk-based, and normative), and evidence families (empirical, mechanistic, comparative, expert-driven, formal methods, operational/field data, and model-based). It then proposes a reusable safety-case template, each of which follows a predefined structure of claims, arguments, and evidence tailored for AI systems. Each template is illustrated by end-to-end patterns that address distinctive challenges, such as evaluation without ground truth, dynamic model updates, and threshold-based risk decisions. The result is a systematic, composable, and reusable approach to constructing and maintaining safety cases that are credible, auditable, and adaptive to the evolving behaviour of generative and frontier AI systems.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/11/2026, 12:56:48 AM

Summary

This paper introduces a structured, reusable framework for constructing AI safety cases, addressing the limitations of traditional engineering approaches when applied to modern generative and agentic AI. The authors propose a comprehensive taxonomy for Claims, Arguments, and Evidence (CAE), along with reusable templates and patterns designed to handle AI-specific challenges like emergent capabilities, lack of ground truth, and continuous model evolution. The framework is supported by an ecosystem pipeline involving builders, validators, and registries to ensure auditable and adaptive safety governance.

Entities (5)

AI Safety Case · concept · 100%CAE Framework · methodology · 100%AI Safety Case Builder · tool · 95%AI Safety Case Registry · tool · 95%AI Safety Case Validator · tool · 95%

Relation Signals (4)

CAE Framework → comprises → Claim

confidence 100% · The CAE structure emphasizes: Claim, Argument, Evidence.

CAE Framework → comprises → Argument

confidence 100% · The CAE structure emphasizes: Claim, Argument, Evidence.

CAE Framework → comprises → Evidence

confidence 100% · The CAE structure emphasizes: Claim, Argument, Evidence.

AI Safety Case Builder → supports → AI Safety Case

confidence 90% · AI Safety Case Builder helps AI developers and system owners construct structured, versioned safety cases.

Cypher Suggestions (2)

List all tools involved in the AI safety ecosystem · confidence 95% · unvalidated

MATCH (t:Tool) RETURN t.name

Find all components of the CAE framework · confidence 90% · unvalidated

MATCH (f:Methodology {name: 'CAE Framework'})-[:COMPRISES]->(c) RETURN c.name, labels(c)

Full Text

167,735 characters extracted from source content.

Expand or collapse full text

A STRUCTURED APPROACH TO SAFETY CASE CONSTRUCTION FOR AI SYSTEMS Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen, Lionel Briand* Data61, CSIRO, Australia firstname.lastname@data61.csiro.au *University of Ottawa, Canada *lbriand@uottawa.ca ABSTRACT Safety cases, structured arguments that a system is acceptably safe, are becoming central to the governance of AI systems. Yet, traditional safety-case practices from aviation or nuclear engineering rely on well-specified system boundaries, stable architectures, and known failure modes. Modern AI systems, such as generative and agentic AI, are the opposite. Their capabilities emerge unpredictably from low-level training objectives, their behaviour varies with prompts, and their risk profiles shift through fine-tuning, scaffolding, or deployment context. This study examines how safety cases are currently constructed for AI systems and why classical approaches fail to capture these dynamics. This study introduces comprehensive taxonomies for AI-specific claim types (assertion-based, constraint- based, capability-based), argument types (demonstrative, comparative, causal/explanatory, risk-based, and normative), and evidence families (empirical, mechanistic, comparative, expert-driven, formal methods, operational/field data, and model-based). It then proposes a reusable safety-case template, each of which follows a predefined structure of claims, arguments, and evidence tailored for AI systems. Each template is illustrated by end-to-end patterns that address distinctive challenges, such as evaluation without ground truth, dynamic model updates, and threshold-based risk decisions. The result is a systematic, composable, and reusable approach to constructing and maintaining safety cases that are credible, auditable, and adaptive to the evolving behaviour of generative and frontier AI systems. Keywords AI· AI Safety Case· AI Safety Case Taxonomy· AI Safety Case Template· AI Safety Case Pattern 1 Introduction Safety cases have become a central assurance artefact across safety-critical engineering, structured arguments supported by evidence, intended to demonstrate that a system is acceptably safe for a specific purpose and operating context [1,2]. However, artificial intelligence (AI) systems, including frontier, generative, and agentic AI systems, fundamentally differ from the deterministic, specification-driven systems for which safety cases were originally designed. Their capabilities are not engineered line by line but discovered after training. Their risks evolve through interactions, fine-tuning, and context. In addition, their evaluation often proceeds without a fixed ground truth. Consequently, the logic of assurance for AI should be discovery-driven, continually updated, and capable of integrating empirical, statistical, and abductive reasoning alongside classical deductive forms. The value of a safety case lies not only in regulatory compliance or as a governance tool, but also in being an epistemic discipline that stabilises ethical reasoning under uncertainty. Discussions of AI safety and ethics often operate largely in a pre-harm, anticipatory mode [3], where severity is inferred rather than observed. While this is necessary, such analysis is often conducted with limited empirical grounding, making it difficult to specify risk severity, compare it with existing systems, or define acceptable risk thresholds. The safety-case approach is well-suited to addressing these challenges. By grounding ethical and safety concerns in explicit claims, evidence, and well-defined comparators or baselines, AI safety cases support consistent risk assessment, review, and justification across different systems and contexts. arXiv:2601.22773v3 [cs.SE] 6 Mar 2026 Safety Cases for AI Systems While early work has begun to adapt safety-case thinking to frontier AI [4,5,6], the field still lacks a coherent, reusable structure that practitioners can apply across model types, lifecycle stages, and regulatory regimes. Existing examples tend to focus on isolated instances such as dynamic updating, misuse safeguards, or autonomous-system templates, without an overarching synthesis that ties together claim formulation, argument structure, and evidence design. Moreover, current practices rarely capture the iterative process by which AI developers discover new model capabilities and failure modes through evaluation, stress testing, and adversarial probing [7]. These discoveries continually reshape the scope of assurance itself. This study develops a comprehensive taxonomy of AI safety cases and reusable safety-case templates for AI systems, addressing the limitations of existing assurance approaches. The key contributions are fourfold. •Characterisation of modern AI safety cases. The study first analyses how safety cases are currently constructed for AI systems, contrasting them with classical engineering approaches and identifying distinctive features, including capability discovery, the absence of ground truth, continuous evolution, and threshold-based decision-making. •Taxonomies for reusable safety-case design. It introduces AI-specific taxonomies for Claim types, including assertion-based, constrained-based, capability-based claims, Argument types including demonstrative, com- parative, causal/explanatory, risk-based, and normative arguments, and Evidence families such as empirical, mechanistic, comparative, expert-driven, formal methods, model-based, and operational/field data. Each is linked through explicit reasoning logics (deductive, inductive, abductive, statistical, and analogical). The taxonomies introduced in this study are descriptive and compositional, not mutually exclusive classification schemes. Claim types, argument types, and evidence families are intended as orthogonal lenses on safety reasoning, rather than disjoint buckets into which safety cases must be partitioned. In real AI safety cases, overlap across categories is expected and legitimate. •Template and pattern library. It proposes a library of reusable templates for constructing AI safety cases, illustrated through end-to-end patterns that address unique AI challenges such as safety justification through discovery-driven evaluation, marginal-risk reasoning without ground truth, continuous evolution for dynamic update and redeployment management, and threshold-based risk acceptance. •Integration with dynamic assurance. Building on emerging ideas such as Dynamic Safety-Case Management Systems (DSCMS) and Checkable Safety Arguments [5], the study embeds these templates within a continuous evaluation pipeline, linking safety claims to live metrics and governance artefacts. Existing AI safety case approaches typically provide domain-specific instantiations (e.g., cyber inability templates), governance-oriented guidance for frontier AI, or concepts for dynamic safety management. However, they focus on specific assurance elements and do not jointly provide an integrated CAE taxonomy spanning claims, arguments, and evidence, a reusable template meta-model, and a composable pattern library that can be instantiated across recurring AI assurance scenarios. The contributions of this study establish a composable, auditable, and reusable foundation for constructing safety cases that remain credible under the uncertainty, discovery, and rapid change inherent to modern and emerging AI systems. A detailed comparison with representative approaches is provided in Table 1 (Section 2). To illustrate the practical implications of this study, we examine a real-world case study involving an AI-based tender evaluation system in government. This case study demonstrates the practical application of the proposed approach, highlighting its potential to enhance governance and assurance processes. It also provides concrete insights that inform decision-making and guide the development of effective, responsible AI solutions. The remainder of the paper proceeds as follows. Section 2 provides background on AI safety cases, including an in-depth analysis of how they diverge from traditional engineering examples, highlighting the distinctive challenges of discovery-driven evaluation and evolving system boundaries. We then present the methodology of this study in Section 3. Section 4 introduces the reusable safety-case templates and details the proposed taxonomies of claim, argument, and evidence (CAE) types. Section 5 presents end-to-end patterns that illustrate how these templates address characteristic AI assurance problems, such as marginal risk in the absence of ground truth and continuous model evolution. Section 6 provides a case study to show how the study can be applied in the real-world context. In Section 7, we discuss current limitations of this study and future work. We then conclude this study in Section 8. 2 Safety Cases for AI Systems AI Safety Case A1: This is supported by evidence showing all major risks are mitigated. E1.1.1: Data validation reports. E1.1.2: Bias audit results. C1.2: Model robustness is verified. E1.2.1: Adversarial test results. E1.2.2: MLOps monitoring dashboard logs. C1.1: Data quality risks are controlled. C1: The AI system is acceptably safe for deployment. Sunny: modified Legend Claims Argument Evidence Figure 1: Claims, Arguments, Evidence (CAE) example 2 Background In this section, we introduce core AI safety case concepts, including the Claims–Arguments–Evidence (CAE) ap- proach, taxonomy, template, and pattern, present an overview of the AI safety ecosystem pipeline involving different stakeholders, and examine the transition from traditional engineering safety cases to AI-specific approaches. 2.1 AI safety case concepts Safety Case. A safety case is a documented, structured argument asserting that a system is acceptably safe for a given context of operation, supported by evidence [8,9]. For AI systems, especially high-capability or frontier AI, the safety case aims to show that the system’s deployment or operation does not pose unacceptable risk. Claims–Arguments–Evidence (CAE). Recent works have developed safety cases using the CAE structure [10,11]. For example, Goemans et al. [11] propose a safety-case template for cyber inability that explicitly uses CAE to make safety arguments coherent and explicit. The CAE structure (see Figure 1) [12] emphasizes: • Claim: A claim is a true/false statement about a property of a particular object. A claim is just what you might consider to be from common usage of the term; an idea that someone is trying to convince someone else is true. An example claim could be what you are asserting (e.g., "the model cannot perform X harmful capability"). •Argument: An argument is a rule that provides the bridge between what we know or are assuming (sub-claims, evidence) and the claim we are investigating. In other words, it refers to the justification linking a specific claim to a specific piece of evidence [13,14]. The reasoning or structure showing why the claim holds (e.g., decomposition into sub-claims, use of proxies, evaluations). 3 Safety Cases for AI Systems Claim Taxonomy •Assertion-based •Construction-based •Capability-based Argument Taxonomy •Demonstrative •Comparative •Risk-based •Casual/explanatory •Capability-oriented •Normative/conformance Evidence Taxonomy •Empirical •Comparative •Model-based •Expert-derived •Formal methods •Operational and field data •Mechanistic Taxonomy Classification Taxonomy= W HAT type s exist •To p L e ve l /Su b-claims •Scope/Context/Use Cases •Claim types Claim Structure The"What", the proposition to be proven •Argument types •Reasoning logic Argument Structure The"Why", the justification for accepting the claim based on the evidence •Evidence types •Quality Criteria Evidence Structure The"How", thedata/facts that substantiate the arguments Template Structure Blueprint One CAE Template Te m p l ate = H O W to s t r u ct u re Other CAE Template Link/Stitch Select Discovery-driven •Problem: incomplete knowledge -> continuous empirical discovery and evaluation Each Pattern Contains •Problem •Applicable claim types •Recommended argument functions and reasoning logics •Evidence families •Example Marginal-risk •Problem: no ground truth -> relative comparison needed... Continuous-evolution •Problem: model/data/scaffolding updates ->dynamic safety case and living artefact Threshold-comparator •Problem:multiple quantitative thresholds -> quantitative decision-making Pattern Reuseable CAE Template for AI risks Pattern = PROBLEM-specific CAE composition Classify Figure 2: AI safety case template, CAE taxonomy, and pattern •Evidence: Evidence is an artefact that establishes facts that can be trusted and lead directly to a claim. E.g., data, test results, evaluations, or simulation results that support the argument. In this study, the CAE taxonomy provides a unified classification of claim, argument, and evidence types, yet classifica- tion alone does not show how these elements should be assembled into a working safety case. Templates address this gap by offering structured blueprints that operationalise the taxonomy, and patterns further extend these templates by composing them into reusable solutions for recurring AI-specific risks. Figure 2 illustrates the relationships among taxonomy, template, and pattern in AI safety case construction. Safety Case Taxonomy. A taxonomy denotes a categorical classification scheme for safety case elements, organizing the different types of claims, argument structures, and evidence categories that may appear in an AI safety case. Research on safety-case taxonomies spans foundational argument structures, frequent evidence types, and emerging frontier-AI–specific argument types. Kelly [1] classifies reusable argument structures into top-down, bottom-up, and general construction patterns, introducing formal notions of claim decomposition and argument strategies. Nair et al. [15] provide a taxonomy of evidence types for safety certification, including Hazards Cause Specification and Testing Results. Buhl et al. [4] identify four core safety-case components and introduce frontier-AI specific argument types, such as directly risk-based arguments and inability arguments, reflecting the shift from traditional failure-based reasoning to arguments about model behaviour and emergent capabilities. However, existing taxonomies [16,17,18] face several limitations that hinder their application to contemporary AI systems. First, the absence of standardized taxonomies leads to inconsistent terminology and incomparable safety claims across organizations; what one team calls a "robustness claim" may differ substantially from another’s interpretation. Second, existing taxonomies focus on isolated aspects (patterns, argument, evidence types, or high-level components) rather than providing an integrated view of the complete CAE structure. Third, prior work has not adequately addressed AI-specific challenges, including the emergence of capabilities, continuous evolution, and context-dependent behaviours that distinguish AI systems from traditional safety-critical systems. We introduce our taxonomy to address critical gaps in current AI safety practices. Building on a review of existing work, our CAE taxonomy provides a systematic basis for analyzing the primary AI safety case papers (Appendix B) we studied, enabling us to identify which claim types, argument structures, and evidence categories are most prevalent in current practice. Safety Case Templates. A template denotes a reusable structured blueprint for constructing a concrete safety case in a particular context or domain. While the taxonomy classifies element types, a template specifies how selected instances of these elements are arranged and how multiple CAE chains are linked or stitched to form a coherent assurance argument. 4 Safety Cases for AI Systems Previous work on safety case templates originates from safety-critical domains such as autonomous vehicles. One influential work is the comprehensive set of templates proposed by Bloomfield et al. [10], which outlines CAE- based templates for autonomous systems, including requirements templates, hazard-analysis templates, safety-monitor templates, and ML-sensor templates. Recently, template-based approaches have extended into frontier AI. Goemans et al. [11] have developed a cyber-inability safety case template that integrates model capability evaluation as a first-class component, a methodological shift required by the opacity of modern AI systems, where behavioural evidence must be empirically derived rather than analytically proven. However, existing template work [10,11,19] has limited coverage of diverse claim, argument, and evidence categories. We present templates for each CAE element: Claim is structured from top-level claims to sub-claims, each of which specifies the key objectives that must be addressed to demonstrate an AI system’s safety in a specific operational context. Argument provides diverse breakdown types that capture the justification and reasoning for accepting the claim given the evidence. Evidence presenting the data and facts that substantiate the arguments, with diverse evidence types and quality criteria to ensure robustness, traceability, and sufficiency of the safety case. Our templates provide a structured lens for analyzing existing AI safety cases and guide practitioners in systematically constructing safety cases. Safety Case Patterns. A pattern denotes a reusable argument fragment or reasoning structure that captures a proven approach to addressing specific safety problems. Unlike templates, which provide complete structural blueprints, patterns are modular components that can be instantiated and combined across multiple contexts, embodying recurring reasoning strategies such as red-team/blue-team evaluation or defense-in-depth arguments. Research on safety case patterns has evolved from traditional systems to AI applications. Kelly et al. [20] and Alexander et al. [21] introduce early pattern-based methods for structuring safety arguments and supporting cross-project reuse. With the rise of ML, Wozniak et al. [22] propose patterns covering data quality, model training, and runtime monitoring, while Kaur et al. [23] extend these to deep learning with robustness testing and neural-network evaluation. Porter et al. [24] propose principles-based patterns that frame ethical principles (fairness, transparency) as top-level claims, demonstrating broader assurance objectives beyond technical safety. We have found that existing patterns [20,21,22,23] remain focused on specific traditional software or ML system challenges without coverage of frontier AI risks such as emerging capabilities, the absence of ground truth, continuous evolution behaviours, and a lack of explicit connections to the broader CAE framework. We introduce four structured AI safety case patterns (Section 5) to address emerging AI risk challenges. 2.2 Overview of AI safety cases ecosystem pipeline To situate AI safety cases within real operational contexts, Figure 3 presents an overview of the pipeline that supports how AI safety cases are created, assured, and governed by different stakeholders in practice. In real deployments, safety cases are not static documents produced by developers alone; they form a shared artefact that connects multiple organisations with distinct responsibilities, including AI system developers and owners, government regulators, and independent oversight bodies. Each of these actors rely on the safety case for different purposes: developers to demonstrate risk-informed design and testing, and have regulators verify compliance and defensibility before approval, and audit organisations to maintain long-term accountability across versions. Figure 3 therefore maps coordinated components’ needs into a coherent pipeline centred on the evolving AI safety case, which is continuously updated with inputs from risk assessments, testing, evaluation, and operational monitoring [25,26]. These three coordinated components, the AI Safety Case Builder, AI Safety Case Validator, and AI Safety Case Registry, support this lifecycle by generating initial safety cases, verifying their completeness and compliance, and preserving validated versions for ongoing oversight. •AI Safety Case Builder/plugins: help AI developers and system owners construct structured, versioned safety cases across the AI lifecycle. At design time, the builder imports outputs from risk assessments [4] to establish the structure of claims and required evidence, and later integrates results from testing, evaluation, and monitoring. Each design or testing iteration automatically updates the safety case, maintaining traceability between identified risks, implemented controls, and empirical results that demonstrate AI system safety and reliability. • AI Safety Case Validator: used by government assurance bodies or accredited third parties to examine the integrity, completeness, and consistency of submitted safety cases. The Validator could check alignment with standards such as the Australian AI Safety Standard [27] or ISO 42001 [28], verify the authenticity of evidence, and assess whether controls remain effective over time. It ensures that the safety case is auditable, transparent, and defensible before approval or deployment in critical contexts. 5 Safety Cases for AI Systems •construct structured AISafetycases (e.g., CAE (claims-argument-evidence)) •import risk, testing, monitoring data •maintain traceability across lifecycle AI Safety Case Builder AI Safety Case Validator AI Safety Case Registry AI Safety Case •check completeness, consistency... •standard compliance (ISO 42001, etc.) •ensure defensibility before approval RegulatorsAI System Developers / Owners •trusted repository for verified safety cases •enable comparison, version tracking •support continuous audit readiness Auditors/Oversight Orgs Generate Verify Govern Responsible for Responsible forResponsible for Initial Safety Cases Validated Safety Cases Registered Safety Case AI Risk Assessment (e.g., Risks, Safety Controls, Guardrails...) Evaluation/Monitoring (e.g., Performance, AgentOps...) Import Feedback/update Figure 3: Overview of AI safety cases ecosystem pipeline •AI Safety Case Registry: acts as a trusted repository for storing and sharing verified safety cases across agencies and sectors. It would enable oversight organisations [29] to compare systems, trace changes, and support continuous audit readiness. The Builder, Validator, and Registry would form a pipeline for generating, verifying, and governing AI safety cases, making AI assurance a structured, evidence-based, and collaborative process among different stakeholders, e.g., developers, regulators, and auditors. 2.3 From traditional engineering to AI safety cases Having introduced the core CAE concepts and the broader safety-case ecosystem, we now examine why traditional engineering approaches to safety assurance cannot be directly applied to AI systems. This contextualises the need for the AI-specific taxonomies, templates, and patterns developed in this study. The foundations of classical safety cases. In traditional engineering domains such as aerospace, rail, and nuclear power, safety cases serve as structured justifications that a system is acceptably safe for its intended use [1,2]. The underlying logic is relatively stable: systems are designed to precise specifications, hazards can be systematically identified through techniques such as Fault Tree Analysis or Failure Mode and Effects Analysis, and risk is managed by design-time redundancy, physical containment, and procedural barriers. In these cases, the safety argument typically follows a deductive and hierarchical form: top-level claims (e.g., "The aircraft control system is acceptably safe") are decomposed into subclaims about components, processes, and operating conditions, supported by deterministic evidence such as design verification, test coverage results, and formal proofs. These cases depend on predictability: the same inputs produce the same outputs, and the behaviour of the components is well understood. Assurance rests on design sufficiency (the system behaves as specified) and evidence completeness (all hazards have been analysed). Once the system is validated against fixed requirements, safety justification largely stabilises until the next design change or certification cycle. Why AI systems break this assurance model. AI systems, particularly generative and frontier AI models, disrupt nearly all these premises. They are not engineered line-by-line to a full behavioural specification, but are trained on massive, uncurated data using a low-level objective, such as next-token prediction. Their performance and risks are emergent properties of both data and training scale. Consequently, their capabilities are discovered after training rather than defined before it [7]. While traditional safety cases already address probabilistic risk models, socio-technical systems, human-in-the-loop control, partial observability, and operational drift, AI systems introduce distinctive characteristics that render traditional safety-case logic insufficient on their own: 6 Safety Cases for AI Systems •Post-deployment capability discovery. The internal mechanisms of large neural networks are not interpretable in causal or mechanistic terms. Model behaviour can change qualitatively with scale, prompting configuration or fine-tuning. Safety arguments that rely on deterministic traceability or component-level understanding cannot be fully applied. Unexpected behaviours, such as prompt injection or role-play exploitation, emerge only after deployment-like testing [4]. • Evaluation without stable ground truth. Many AI systems are evaluated against existing human or automated systems that were never formally certified. Hence, risk arguments become marginal: we infer safety by comparison ("no worse than baseline") or by proxy metrics such as expert consensus, simulated games, or risk scores [30] This weakens the deductive foundation of safety arguments and shifts the reasoning mode toward induction and statistical inference. •Model updates outside system owner control. AI applications often depend on proprietary large language models maintained by third parties; model updates may occur outside the system owner’s direct control, thereby undermining the assumption that the assured configuration can be frozen post-certification. Because these models change or update without notice, the nature and distribution of risks also evolve in ways users cannot anticipate or verify. Unlike aircraft software, which is frozen after certification, AI systems evolve continuously through fine-tuning, retraining, data refresh, model combination, or expanded tool access. These changes invalidate static evidence. Assurance must therefore be dynamic, maintaining traceability across versions and regenerating claims and evidence as the system evolves [5]. •Threshold-based acceptability. Regulators and developers increasingly rely on quantitative thresholds, of compute use, model capability, or safety score, as proxies for acceptable safety [31] In AI systems, however, these thresholds are often incomplete, overlapping, or context-dependent. Safety cases must accommodate multiple, sometimes inconsistent, decision criteria rather than a single binary pass/fail condition. • Capability emergence decoupled from specification. AI capabilities can emerge with scale, fine-tuning, or system-level scaffolding in ways that are not explicitly specified in requirements. As a result, safety-relevant behaviours may not be fully traceable from the intended functionality to the design artefacts, thereby weakening the assumption that assurance can be achieved through specification-driven decomposition alone. •Tight coupling between model behaviour and context engineering. Model behaviour is strongly shaped by context engineering choices [32], including prompting strategies, retrieval augmentation, tool interfaces, and user workflows. This means that safety properties cannot be assessed independently of the surrounding system and its operational context, and assurance must account for how changes in context can materially alter risks. AI systems are not unique because they are uncertain, but because the locus and timing of uncertainty is shifted, and because assurance must operate under continual epistemic incompleteness, not just probabilistic risk. That framing avoids straw-manning traditional systems while sharpening the real distinction. Together, these properties mean that AI safety cases cannot rely on design-time determinism; instead, they must embrace evaluation-time discovery and run-time adaptation. Assurance becomes a process of continuous learning rather than static certification. Emerging practices in AI safety cases. Several efforts illustrate how the safety-case paradigm is beginning to evolve for AI. Buhl et al. [4] propose a structure that integrates traditional CAE patterns with AI-specific concepts, including model evaluation, interpretability analysis, and governance oversight. Their "frontier AI safety case" uses layered argumentation across technical, organisational, and societal domains. Cârlan et al. [5] extends this thinking into dynamic safety cases, linking claims to live metrics through "Checkable Safety Arguments" and "Safety Performance Indicators" (SPIs). The key insight is that evidence must be continually refreshed as model behaviour and deployment contexts evolve. Similarly, Clymer et al. [6] demonstrate how a safety case can be constructed for misuse safeguards in large language models, combining threat models, red-teaming, and organisational response plans as evidence. These recent works suggest three broad shifts already underway: •From design justification to behavioural justification: Safety is demonstrated through what the system does, not only through what it was designed to do. •From completeness to adaptability: The argument’s credibility depends on its capacity to evolve and revalidate itself, not on being once exhaustive. • From component-level reasoning to ecosystem reasoning: AI assurance must consider interactions between model, tools, and governance. Implications for reusable AI safety-case templates. The divergence between classical and AI safety cases motivates the need for reusable, AI-specific templates that encode appropriate logic, structure, and expectations for evidence. Such templates should enable practitioners to: 7 Safety Cases for AI Systems •formulate conditional claims that explicitly capture context, control boundaries, and capability assumptions •structure arguments that can combine deductive reasoning (for architecture and controls) with inductive and abductive reasoning (for evaluation and mitigation) • incorporate empirical, model-based, and governance evidence in coherent patterns that can be updated dynamically. 2.4 Positioning this study relative to existing AI safety case approaches Prior research on AI safety cases spans multiple directions, including governance-oriented proposals, domain-specific templates, dynamic management mechanisms, and worked case sketches. However, these approaches differ in scope, level of abstraction, and reusability. To clarify the distinct contribution of this study, we compare it against representative approaches across these categories. Table 1 positions our contribution relative to these approaches. Our contribution unifies these strands into an integrated, structured foundation for designing AI safety cases, comprising a CAE taxonomy, reusable template meta-models, and a composable pattern library that can be instantiated and systematically maintained throughout the AI lifecycle. Table 1: Positioning of this study relative to representative AI safety-case approaches (illustrative, not exhaustive). ApproachPrimary outputScope and reusabilityWhat this study adds Safety cases for frontier AI [4, 33] Motivates safety cases for fron- tier AI governance; outlines prac- ticalities of producing a frontier AI safety case Primarily high-level guidance for producing and using safety cases; does not provide an explicit and practical-level solutions Provides an integrated CAE taxon- omy, reusable template, and pattern library that operationalises how to con- struct CAE chains across recurring AI assurance problems Cyber inability safety-case template [11, 11] A concrete CAE template argu- ing a model lacks unacceptable offensive cyber capability; de- composes claims into sub-claims supported by evaluation evidence Single risk-focused template (cy- ber inability); provides strong “worked CAE template” but nar- row coverage across assurance problem types Generalises beyond a single tem- plate: comprehensive taxonomy of AI claim/argument/evidence types and a pattern library spanning multiple assur- ance scenarios, including cyber, control, evolution, thresholds Dynamic safety cases forfron- tierAI [34, 35, 36] Dynamic Safety Case Manage- ment System (DSCMS) concept for systematic, semi-automated safety-case revision over time; links claims to system state Lifecycle/update mechanism em- phasised; demonstrated on a cyber-related template; less em- phasis on a broad CAE taxonomy and reusable pattern design space Extends dynamic safety-case manage- ment by providing a structured CAE design space (taxonomy + reusable templates + patterns), clarifying which claim/argument/evidence types evolve and what evidence triggers updates Sketch of an AIcontrol safetycase [14, 16] Worked “sketch” of a control safety case for LLM agents; high- lights key claims and evidence (e.g., control evaluations) Case-study style sketch (control measures); illustrates claim/evi- dence dependencies but is not a general template library Elevates such sketches into reusable patterns and templates grounded in a compositional CAE taxonomy, with explicit reasoning logics and evidence families This studyAn integrated and structured CAE-based foundation for de- signing AI safety cases, compris- ing a taxonomy, reusable tem- plate meta-models, and a compos- able pattern library; guidance for maintaining auditability under AI evolution Designed for composability across AI system types and assurance contexts; supports reuse, traceability, and update triggers Unifies and systematises prior strands by providing: (i) an integrated CAE taxonomy, (i) reusable template meta- models, (i) composable AI-specific patterns, and (iv) lifecycle-aware update mechanisms grounded in explicit rea- soning logics 3 Methodology This study follows a Systematic Literature Review (SLR) approach based on established guidelines [37]. Figure 4 presents the high-level workflow of this research. We first developed an SLR protocol that defined the research scope, search strategy, study selection criteria, and data analysis methods. In the following subsections, we describe each step in detail. 8 Safety Cases for AI Systems Research Scope and Research Tasks Develop the Review Protocol Collect seed papers, Define scientific databases for search, Formulate keyword & the search string, Define qualitative & quantitive checklists, Specify inclusion & exclusion criterion, Define reference management process Keyword Based Auto Search - AI Safety, CAE IEEE, ACM, Springer, SciencDirect (1235 papers) Snowballing Select 9 Seed Papers, Search References and Citations of Seed Papers (807 papers) Backword Snowballing Search Citations of Auto Search Papers Using Google Scholar (46 papers from identified 1070 papers) 1st Research Question (RQ1) AI safety cases taxonomy - claim, argument, & evidence 2nd Research Question (RQ2) Reusable safety case patterns - Case Study Data Extraction, Synthesis & Analysis Execute a detailed review on the selected studies Final Report CAE taxonomy Case Study Recommendations Future works Study Filtering (Vetting Process) Screening, Grouping of Papers, Conflict Removal (77 papers) Cross Check & Merge (112 papers) Figure 4: Architectural block diagram for this research paper 3.1 Research Questions Population, Interventions, Comparison, Outcomes, and Context (PICOC) can properly direct the formation of research questions for a review study [38,39]. The PICOC for this study is shown in Table 2, following established guidelines [40, 37]. In this research, the Population is the body of literature on AI safety cases, including CAE structures, safety case notations, and closely related assurance-case approaches. The Interventions of interest are the methods, processes, templates, tools, and taxonomies used to construct, analyse, or operationalise AI safety cases. We consider an Outcome successful if it contributes to reusable safety-case solutions, such as a CAE taxonomy and an associated pattern library. The relevant Context is limited to works that explicitly connect AI systems with safety case development; this excludes generic AI ethics or governance works that lack a clear safety-case linkage. Our objective in this study is to analyse how safety cases are currently constructed for AI systems and to synthesise reusable CAE structures that support requirement-aligned, auditable safety cases. Within this scope, we formulated the following key research questions: RQ1 What taxonomy of claim types, argument strategies, and evidence categories is suitable for AI safety cases? This research question aims to define classification schemes and templates for the kinds of claims (e.g., absolute vs. marginal safety claims), types of arguments (e.g., layered, barrier, or comparative arguments), and types of evidence (e.g., testing results, audits, formal proofs) that commonly occur in AI system assurance. RQ2 Can we identify reusable safety case templates and patterns for AI and demonstrate their use in a case study? Our second RQ plans to derive generic CAE templates and patterns for common AI safety concerns and test the study by applying the templates and a selected pattern in an illustrative case study of an AI system. 3.2 Search strategy We developed a strategy to search for papers that target AI safety and claim evidence and arguments to answer our RQs defined in Section 3.1. The goal is to find as many primary study papers as possible. Our strategy consisted of four parts: search string identification, automatic search in an electronic database, snowballing using a seed paper, and backward snowballing using Google Scholar. With the assistance of the PICOC approach (Table 2), our search terms were divided into three primary concepts, as shown in Table 3. These concepts helped us formulate a well-defined search string. 9 Safety Cases for AI Systems Table 2: PICOC for this SLR PICOC ElementDescription PopulationThe literature on AI safety case, assurance case, claim argument evidence, goal structuring notation. InterventionMethods, processes, templates, tools, and taxonomies relevant to AI safety cases including claim argument evidence centric approaches and frameworks. ComparisonExploratory synthesis across interventions. OutcomesCAE taxonomy, reusable pattern library, case study, application, evaluation criteria. ContextInclude: Works that explicitly link AI safety case and CAE. Exclude: General AI ethics and governance without safety-case linkage; security, privacy or access-control topics not framed as AI or safety assurance; generic risk-management checklists; non-AI safety cases unless patterns are clearly transferable to AI; non-English sources; opinion pieces without evidence/method. Table 3: Concepts and search terms explanation Main TermsSupportive Search Terms Concept 1(Co1): AI SafetySafety case, assurance case, reliability, trustworthiness, acceptable risk, risk management, robustness, transparency, verification, validation, compliance etc. Concept 2 (Co2): CAE and Struc- tured Argumentation Claim, argument, evidence, argument pattern, goal structuring Notation (GSN), pattern/mod- ule, assurance pattern, template etc. Concept 3 (Co3): AI and related domains Autonomous system, intelligent system, Software, agents/multi-agent system. We used alternative spellings, abbreviations, and synonyms of search terms to increase the number of relevant research papers. We used truncation and wildcard operators in finding these alternative keywords. Moreover, additional key terms or phrases identified during search iterations were added to the supporting search terms list. We assume they will collect all relevant articles. When constructing the final search query, the identified keywords, their alternatives, and related terms were linked with Boolean AND (&&) and OR (∥) operators. The OR operator was used to concatenate synonyms, and the AND operator was used to concatenate major concepts. A generic version of the search string we used is as follows: Generic Search String (‘safety’ OR ‘safety case’ OR ‘reliability’ OR ‘trust*’ OR ‘AI Safety’) AND (‘claim-argument-evidence’ OR ‘CAE’ OR ‘structured argument*’ OR ‘safety justification’ OR ‘evidence-based safety’ OR ‘safety argument*’ OR ‘goal structuring notation’ OR ‘GSN’) AND (‘artificial intelligence’ OR AI OR ‘machine learning’ OR ‘autonomous system*’ OR ‘intelligent system*’ OR ‘software’) 3.3 Study collection and filtering Our study filtration process is summarized in Figure 5. We first ran the formatted search query on four major digital libraries, which returned 1,235 research papers. In parallel, we selected nine seed papers on AI safety and conducted snowballing, yielding 807 papers. The key rationale for selecting these papers as seed papers is discussed in Appendix A. We then removed 264 papers (205 from the database search results and 59 from the snowballing results of the nine seed papers) because they were duplicate articles, editorials, keynotes, or papers with no listed authors. After reading the title, abstract, conclusion, and, when necessary, skimming the introduction, methodology, and results sections, we applied Exclusion Criteria defined in Table 4. This step removed 1,588 papers in total. In the third step of filtration, we applied the Inclusion Criteria (IC) shown in Table 4 and removed a further 113 papers that did not meet ICs. We then carried out backward snowballing in Google Scholar using the 41 papers selected from the database search results. This yielded a further 1,070 papers. We followed the same process for this new set of papers and selected an additional 46 papers for inclusion that satisfied all ICs and met none of the ECs. At each step, at least two authors independently cross-checked the selection decisions and resolved any disagreements through discussion. Finally, after quality assessment (discussed in Section 3.4), we cross-checked and merged the initially selected sets, resulting in 112 primary studies for data extraction, synthesis, and analysis. These 112 original articles are summarized in Appendix B 1 . Among the 112 selected studies, several (e.g., [S27], [S43], [S58], [S70]) do not explicitly target AI systems but were retained because they articulate foundational principles of structured, goal-based safety argumentation. Although 1 We collected the paper list in August 2025; therefore, studies published after this date are not included in this review. 10 Safety Cases for AI Systems not recent, these works continue to underpin contemporary assurance practices through their treatment of argument decomposition and evidence traceability. Moreover, their methodological contributions are domain-independent and directly inform the CAE-based taxonomy and reusable templates developed in this paper. Database Auto Search Returns 1235 Papers IEEE Xpolre (119) ACM Digital Library (904) ScienceDirect (83) SpringerLink (129) Snowballing Using 9 Seed Papers Returns 807 Papers Database Remaining 1030 Papers IEEE Xpolre (117) ACM Digital Library (705) ScienceDirect (81) SpringerLink (127) Snowballing Remaining 748 Papers Backward Snowballing 1070 Papers IEEE Xpolre (162) ACM Digital Library (135) ScienceDirect (381) SpringerLink (258) From FS (134) Database Remaining 142 Papers IEEE Xpolre (26) ACM Digital Library (68) ScienceDirect (25) SpringerLink (23) Snowballing Remaining 48 Papers Database Selected 41 Papers IEEE Xpolre (7) ACM Digital Library (8) ScienceDirect (11) SpringerLink (15) Snowballing Remaining 36 Papers Finally Selected - 112 Papers For data extraction, synthesis, and analysis Remove Key Notes Editorials No Authors & Duplicate Backward Snowballing Remaining 948 Papers IEEE Xpolre (149) ACM Digital Library (89) ScienceDirect (370) SpringerLink (234) Backward Snowballing Remaining 96 Papers IEEE Xpolre (22) ACM Digital Library (8) ScienceDirect (12) SpringerLink (20) Backward Snowballing Selected 46 Papers IEEE Xpolre (14) ACM digital library (8) ScienceDirect (10) Springer link (14) Apply Exclusion Criteria Apply Inclusion Criteria Remove Duplicate Apply Exclusion Criteria Apply Inclusion Criteria Cross Check & Merge & Average QCs ≥ 2 Seed Papers 9 Papers Search String Auto Search Selected Papers 41 Papers Figure 5: Primary study selection process steps 3.4 Quality Assessment We used a 1-to-5 numeric score (Very Poor, Inadequate, Moderate, Good, and Excellent) for Quality Checking (QC), applied to each study using the following eleven questions (QC 1 to QC 11 ). The questions focus on the relevance of the work to AI safety cases and to structured arguments, such as the claim-argument-evidence structure. QC 1 : Is the study highly relevant to the research objectives and to the concepts of AI safety cases and CAE. QC 2 : Does the study clearly explain the methodology that accomplishes its goals? QC 3 : Does the study provide sufficient information on data collection, prototyping, and/or algorithms used? QC 4 : Does the study present safety cases and a closely related structure for AI systems? QC 5 : How well does the study detail how it has validated or evaluated its results? QC 6 : Are clear outcomes and a corresponding results analysis reported? Table 4: Inclusion and exclusion criteria ECsExclusion CriterionICsInclusion Criterion EC 1 Papers about AI but not use safety or claim argument evidence as key point of concern. IC 1 Studies that propose, specify, evaluate, or apply safety cases / structured arguments for AI/ML systems or that provide foundational structured-argument constructs that are extracted to build the CAE taxonomy and template framework. EC 2 Papers discussing AI or machine learning but not regarding safety case or CAE. EC 3 Safety papers without relevant AI/ML and without safety- case/structured-argument constructs beyond the scope. EC 4 Papers with inadequate information to extract (irrelevant pa- pers). IC 2 Content includes sufficient methodological or technical detail to extract and explicit linkage between AI risks or regulatory requirements and the safety case structure. EC 5 Gray literature, workshop articles, posters, books, work-in- progress proposals, keynotes, editorials, secondary or review studies, vision papers with no concrete implementation. EC 6 Discussion papers and opinion papers, as well as surveys that do not include any solution. IC 3 Published or publicly available up to the review’s search date (August 2025). EC 7 Short papers less than three pages, irrelevant and low-quality studies that do not contain a considerable amount of ex- tractable information. IC 4 Full-text conference papers, journal articles, regula- tory documents, and credible industry publications that comply with the three concepts defined in Ta- ble 3. EC 8 Conference or workshop papers if an extended journal version of the same paper exists. IC 5 Entire papers are written in English and use refer- ences. EC 9 Non-primary studies (secondary or tertiary studies).IC 6 Papers available in electronic format (e.g., doc, docx, pdf, HTML, ps). 11 Safety Cases for AI Systems QC 7 : Are the study limitations and possible future work adequately described? QC 8 : What is the quality of the venue in which the study was published? QC 9 : Are the implications and significance of the research findings discussed in the study? QC 10 : Are the study’s claim, presentation, and findings clear and understandable? QC 11 : To what extent is the work presented in the study practically usable? Each paper received an average score across these criteria. We set a threshold: if a study’s average Quality Checking (QC) score was below 2.0, we excluded it from our primary set as low-quality (11 studies from the selected list of 123 papers after cross-checking). Studies with borderline scores were discussed among the team, using the detailed extracted data to decide if they should be retained. As a result, the final analysis is based on a curated collection of high-quality publications that collectively address our research questions. This strengthens the validity of the conclusions and the proposed templates derived from the literature. QC score for all the selected studies is presented in Appendix C. In the selected primary studies, QC scores ranged from 2.73 to 4.55, with a mean of 3.37 on the 1-5 scale. Of the 112 studies in the final set, 96 (≈ 86%) achieved a QC score of at least 3.0, including 22 (≈ 20%) that scored above 4.0, while 16 (≈ 14%) fell between 2.0 and 3.0 but were retained following team discussion of their relevance. 3.5 Data synthesis and analysis We first extracted all relevant information, including the RQ-specific items related to CAE structures and any proposed safety-case templates or patterns. Based on these data, we conducted a mixed-methods synthesis aligned with our two research questions (RQ1 and RQ2). On the quantitative side, we used simple descriptive statistics to identify patterns and trends across the selected studies on AI safety cases and structured argumentation. In addition to descriptive distributions, we conducted an explicit synthesis to characterize the state of the art in AI safety-case construction. This synthesis identifies (i) recurring strengths and weaknesses in how claims are formulated, e.g., conditional vs. unconditional safety, (i) how arguments establish acceptability, e.g., reliance on comparators without explicit thresholds, (i) how evidence is operationalized, e.g., evaluation artefacts without stated quality criteria or traceability. We then map these findings to reusable template requirements, including fields, constraints, and update triggers, so that the study directly addresses observed gaps. We also conducted a thematic synthesis of the extracted CAE-related data. We grouped similar concepts and approaches across studies to derive higher-level themes concerning (i) how prior work classifies or organizes claims, arguments, and evidence for AI systems, leading to a structured CAE taxonomy and template for AI safety cases, (i) how CAE elements are combined in practice e.g., common argument structures or evidence bundles that can be expressed as reusable safety-case templates or patterns for AI systems, and (i) observations such as prevalent reliance on particular argument styles, dominance of certain evidence types, gaps in post-deployment or uncertainty-focused evidence, and examples of innovative or effective CAE practices. The derived themes were iteratively refined against the primary studies to ensure that the resulting CAE taxonomy, along with its associated templates and patterns, accurately reflected the underlying evidence. 3.6 Summary information To provide an overview of the publication trends within our SLR dataset, Figure 6 summarises the temporal and typological distribution of the 112 selected studies. Figure 6(a) shows the number of selected papers by publication year; prior to 2020, the number of relevant works remains comparatively low and stable, with only modest growth between 2021 and 2023, but there is a sharp increase from 2024 onwards. Figure 6(b) presents the distribution of article types and indicates that most studies appear as journal articles (40 out of 112, 35.71%), arXiv preprints (35 out of 112, 31.25%), and conference papers (31 out of 112, 27.67%). 3.6.1 Overview of the selected studies Table 5 organises the CAE taxonomy into its Claim, Argument, and Evidence categories, each further divided into subcategories. For this mapping, in addition to the 112 papers selected through the SLR process, we also incorporated seven supplementary papers identified through a targeted search. These papers were not retrieved through the automated search or snowballing phases for two reasons. First, several of them use terminology that predates contemporary CAE- or AI-specific safety-case vocabulary, and second, some fall outside the scope of indexing by the chosen databases. Nevertheless, they represent influential or foundational works that provide important conceptual grounding for safety-case thinking, particularly regarding argument structures, inability-based reasoning, and safety-case reuse. 12 Safety Cases for AI Systems Table 5: Summary: Mapping of primary studies to CAE taxonomy categories. CategoryAI safety element: definitionCore referencesInferred mapping ClaimAssertion-based: Claims that assert safety directly, either absolutely or relative to a comparator, without conditioning on specific constraints. S2, S7, S8, S10, S14, S15-S18, S20, S21, S22, S26, S33, S35, S41, S43, S45, S47, S49, S53, S59, S63, S70, S71, S74, S76, S81, S83, S88, S89, S91, S99 S1, S3, S5, S30, S31, S107 Constrained-based: Claims that safety holds only within defined operating boundaries such as context, modality, data regime, or validated envelopes. S8, S10, S14, S15, S16, S17, S20, S23, S26, S28, S33, S35, S41, S43, S45, S47, S49, S53, S81, S83, S87, S88, S89, S93, S96, S112 S1, S5, S18, S24 Capability-based: Claims that safety is ensured be- cause the system’s abilities are restricted by design or intrinsic model behaviours (e.g., tool-access limits, refusal policies). S10, S15, S26, S45, S47, S49, S53, S64, S81, S85, S86, S87, S89, S91, S93, S96, [41, 42, 43] S17, S19, S28, S29, S78 ArgumentDemonstrative: Arguments showing that layered con- trols, safeguards, or architectural mechanisms collec- tively satisfy safety criteria through deductive reason- ing. S2, S3, S5, S8, S13, S16, S17, S20, S21, S23, S26, S33, S35, S36, S44-S47, S49, S58, S63, S68, S69, S70, S71, S74, S81, S87, S88, S93, S96, S99, S100, S104, S106, S107, S108, S112 S10, S4, S30 Comparative: Arguments inferring safety by demon- strating that the system is no worse than a baseline or comparator across relevant metrics. S8, S10, S16, S22, S33, S59, S81, S88, S89S18, S27, S65-67 Risk-based: Arguments linking safety to formal risk estimation, uncertainty quantification, or compliance with risk-management frameworks. S7, S8, S16, S21, S25, S30, S47, S63, S69, S70, S81, S83, S92, S93, S96, S99, S106 S4, S18, S19, S72, S107, S112 Casual and explanatory: Arguments justifying safety by identifying, explaining, and mitigating underlying causes of failures or hazardous behaviours. S26, S28, S47, S70, S71, S74, S76, S81, S83, S87, S93, S96, S100, [44, 45, 46, 47] S24,S25,S73, S102 Capability-oriented: Arguments showing that safety is ensured through system containment, enforced limitations, or inherent model inability/refusal be- haviours. S8, S10, S47, S49, S64, S81, S84-S87, S89, S91S111 Normative and conformance: Arguments deriving safety from adherence to recognised standards, guide- lines, or organisational dependability principles. S6-S9, S25, S26, S28, S30, S69, S70, S71, S74, S76, S81, S83, S87, S92, S93, S96, S99, S100, S106 S11–S14, S16, S17, S20–S22, S32–S42, S45–S55, S57–S64, S105–S110 EvidenceEmpirical: Behavioural evidence obtained through testing, red-teaming, user studies, or evaluation under controlled and stress conditions. S1, S2, S4, S8, S10, S13-S18, S20-S22, S26, S28, S33, S35, S41, S43, S44, S45, S47, S48, S49, S53, S55, S59, S70, S71, S74, S76, S77, S80, S81, S83, S85, S89, S93, S96, S98, S100 S27,S56,S94, S103 Comparative: Evidence from benchmarking or con- trolled comparisons showing performance against human or system baselines. S16, S28, S33, S46, S59, S64, S81, S89S18, S27 Model-based: Quantitative risk-analysis outputs de- rived from probabilistic models, simulations, or un- certainty propagation. S1, S9, S17, S20, S33, S35, S36, S44, S47, S49, S53, S55, S75, S81, S82, S83, S87, S93, S96, S98 S100, S112 Expert-derived: Structured expert judgment, scenario analysis, or horizon-scanning exercises used where empirical evidence is incomplete. S8, S13, S16, S22, S23, S35, S43, S44, S47, S48, S53, S63, S79, S81, S85, S93, S96, S100 S1, S19, S97, S111 Formal methods: Mathematically rigorous analyses such as model checking, theorem proving, or static analysis demonstrating safety properties. S7, S17, S20, S44, S49, S52, S53, S71, S80, S83, S93, S96, S98, S112 S15, S19, S43, S63 Operational and field data: Evidence from real-world operation, including logs, drift reports, monitoring dashboards, and governance artefacts. S8, S19, S25, S46, S47, S49, S53, S70, S74, S77, S81, S83, S85, S93, S96, S99, S100 S22, S27, S28, S28, S64, S95, S101 Mechanistic: Evidence explaining model behaviour through mechanistic analysis, attribution, feature trac- ing, or subsystem verification. S22, S63, S71, S81, S87, S89S16,S19,S90, S111 In Table 5, the "Core reference" column records studies that explicitly instantiate a given subcategory in our analysis. For instance, studies S2, S7, and S8 state assertion-based claims as direct safety guarantees. Similarly, risk-based argumentation is represented where authors substantively engage hazards and mitigations, such as S18’s barrier-based reasoning to reduce exfiltration risk. The evidence layer captures explicit support. The empirical entries reflect reported evaluation artefacts such as red-team outcomes (e.g., S18 and S27). The formal-methods evidence is reserved for studies that provide proof-style assurance (e.g., S20) as part of a safety case. This treatment aligns with the presentation of defining categories with concrete, method-relevant examples. 13 Safety Cases for AI Systems 29 4 11 10 29 29 112 ~ 202020212022202320242025Total Number of Selected Papers by Year (a) Journal article, 40 arXiv preprint, 35 Conference paper, 31 PhD thesis, 4 Technical report, 2 Number of Selected Papers by Article Type (b) Figure 6: The distribution of the selected articles in this study by published year and article types. (a) Number of selected papers by year; publications show a sharp increase in 2024 and 2025.(b) Number of selected papers by type; the majority are journal, conference, and recently posted arXiv papers. The "Inferred mapping" column extends coverage by recording studies whose content functionally aligns with a subcat- egory even when the authors do not label it as such. This improves completeness but also introduces a methodological trade-off: implicit assignments can blur the boundary between the observed CAE framing and the inferred alignment. This increases susceptibility to interpretive bias unless the inference rules are explicit and consistently applied, as in S73 within causal and explanatory arguments. The following example illustrates both the benefit and the risk, considering example studies: Example The mapping study S112, under a risk-based argument, highlights a substantive critique: safety cases should prioritize risk reduction over compliance. Similarly, adding S64 under operational/field evidence reflects an evidential stance that assumes the presence of runtime monitoring artifacts. However, these inferences should be read as analytical triangulation, not a re-labelling of author intent. We believe the table is strongest when implicit links are justified via clear criteria, e.g., required artifact types, explicit causal reasoning, or risk-management constructs, and reported as a validity mitigation rather than a substitute for explicit CAE usage in the studies. 3.6.2 Key insights into AI safety case construction Across the 112 analyzed studies, AI safety cases most commonly take the form of conditional claims supported by empirical evidence, often because ‘ground truth’ safety is unavailable in real deployments. However, we observe three recurring shortcomings. • Acceptability criteria are underspecified. Comparative or marginal claims frequently rely on a baseline (no worse than thresholds) without defining the comparator system’s assurance status or formalizing failure budgets and quantitative thresholds. •Evidence quality criteria and traceability are inconsistently stated. While red-teaming, stress testing, and audit artefacts appear widely, many safety-case variants do not specify evidence independence, coverage, recency, reproducibility, or linkage, which limits auditability under continuous updates. •Post-deployment assurance remains fragmented. A smaller subset of studies (≤ 10%) link claims to operational monitoring and governance actions, which becomes increasingly necessary as models are updated, tool access changes, or deployment contexts drift. Overall, these three major findings motivate our template requirements, i.e., each template instance must explicitly declare comparators/thresholds (when used), evidence quality criteria, and update triggers that require re-validation. The reviewed literature indicates that AI safety case construction remains at a formative stage. While the CAE structure is widely adopted in form, its operationalisation varies significantly in claim definition, evidential adequacy, and lifecycle integration. The absence of reusable abstraction layers and formalised update logic limits composability and 14 Safety Cases for AI Systems long-term auditability. This structural diagnosis motivates the need for systematic taxonomies and reusable template patterns to stabilise AI safety case engineering across domains. 4 Reusable AI Safety-Case Templates 4.1 Rationale for reusable templates The emerging diversity of AI systems, from narrow-domain models to large-scale generative models, demands a systematic yet adaptable approach to structuring safety justification. In traditional engineering, safety cases are often bespoke, created for each project through expert judgement and iterative review. For AI, this approach does not scale. Model behaviours evolve too quickly, evidence sources are heterogeneous, and assurance must often be performed by teams without formal safety-case experience. A reusable, pattern-based approach can reduce subjectivity and promote consistency, while remaining flexible enough to handle uncertainty, discovery, and change. Following [7] and adapting their template-based assurance logic for autonomous systems, the proposed solution defines reusable safety-case templates as standardised, semi-structured argument skeletons. Each template combines a claim pattern, a corresponding argument structure, and a set of evidence families appropriate to AI systems. These templates can be specialised or composed to construct end-to-end safety cases. 4.2 Overview of the architecture The structure follows the classical CAE triad but adapts each element to AI-specific realities (Table 6). Table 6: Claim, Argument, and Evidence for AI systems TypePurpose in traditional systemsAdaptation for AI systems ClaimFixed proposition of system safety (e.g., “System X is acceptably safe”) Conditional and contextual proposition incorporating assumptions about data, model capability, and deployment context. ArgumentLogical structure showing how subclaims and evidence support the top-level claim Multi-modal reasoning combining deductive (design), inductive (empirical), abductive (mitigation), and statistical (uncertainty) in- ference EvidenceTest results, analyses, certificationsEmpirical evaluations, red-teaming, interpretability analyses, gover- nance artefacts, and risk-model outputs The templates are designed to be composable. A single safety case may consist of multiple claim–argument–evidence chains stitched together; one addressing architectural robustness, another addressing misuse safeguards, and another addressing residual risk thresholds. Figure 7 shows the full CAE taxonomy for AI systems, which serves as the structural foundation for the proposed reusable safety-case template. The taxonomy integrates AI-specific extensions into the classical CAE model, dis- tinguishing how safety propositions (claims), reasoning structures (arguments), and supporting artefacts (evidence) interrelate across assurance layers. 15 Safety Cases for AI Systems CAE Taxonomy Claim Argument Evidence Assertion-based Constrained-based Capability-limited Absolute safety Marginal safety Constrained Envelope-constrained Control-based Intrinsic-based Actuation-limited Access-limited Tool-use limited Modality-limited Knowledge-limited Goal-limited Demonstrative Comparative Risk-based Causal/explanatory Capability-oriented Normative/conformance Layered Barrier Comparative Comparison-base Threshold-based Guideline-based Dependability Root-cause Control Intrinsic Guideline-base Dependability Empirical Comparative Model-based Expert-derived Formal methods Operational and field data Mechanistic Testing User studies Benchmarking Risk analysis Expert assessment Scenario-based evaluation Forecasting and horizon- scanning evidence Formal verification Historical data Governance and policy artefacts Interpretability evidence Figure 7: The CAE Taxonomy for AI systems. 16 Safety Cases for AI Systems The purpose of the taxonomy is to support structured composition and explicit reasoning, not to enforce classification purity or decision-tree logic. This distinction is critical for practitioners, as AI safety cases are inherently multi-modal and evolve through discovery, comparison, and governance over time. Accordingly, the taxonomy is intentionally non-exclusive: certain claim, argument, or evidence types (e.g., guideline-based and dependability arguments) may appear under multiple categories. This reflects their dual role as both technical assurance mechanisms and normative proxies for acceptability, depending on the assurance intent and context of use. 4.3 Claim Claims in AI safety cases differ from those in conventional engineering because they must describe conditional and discovery-dependent safety propositions rather than absolute guarantees. Each claim type captures how safety is defined, bounded, or inferred within an evolving, data-driven system. As shown in Figure 7, the taxonomy of claim categories is grouped by their assurance intent: context-based, constrained, capability-limited, and threshold-based. Assertion-based Assertion-based claims describe the mode of assurance, how safety is asserted relative to defined expectations or baselines. They include both absolute propositions (claiming acceptable safety outright) and comparative propositions (claiming safety relative to a reference system). • Absolute safety claim: A high-level claim that the system is acceptably safe for a specified purpose. This acts as a placeholder for more specific subclaims. Example: “AI System X is acceptably safe for document summarisation within approved enterprise environ- ments.” • Marginal safety claim: Used when absolute safety cannot be quantified but relative safety can be inferred. Example: “AI System X is at least as safe as the previous deployed version Y (or a comparator Z) on relevant misuse benchmarks and consistency metrics” Constrained-based Constrained claims state that safety is conditional on defined operational limits, such as data regimes, modalities, or environmental boundaries. They ensure that safety justification remains valid only within those bounds. •Constrained claim: Expresses that safety is bounded by particular operational conditions, data regimes, or user contexts. Context and constraints are treated jointly, as they are inseparable in AI (e.g., domain, modality, data source, tool access limits). Example: “AI System X is safe when interacting with text-only inputs drawn from internal data sources and used by authorised employees.” •Envelope-constrained claim: AI safety is maintained only while the system operates within a validated performance or failure envelope, with mechanisms to detect and respond to out-of-envelope behaviour. Such claims emphasise runtime assurance, where the system transitions to a safe or degraded state when its operational limits are exceeded. Example: “System X is safe so long as it operates within a validated failure envelope, and transitions to a safe state when out-of-envelope behaviour is detected (e.g., by halting, degrading, or handing off to a human).” Capability-limited This type of claim states that safety depends on capability boundaries, either control-based (design choices, such as tool or network disablement) or intrinsic (assumed AI system inability or refusal behaviour). •Control-based claim depends on external or architectural design choices that enforce operational containment. This claim includes: –Actuation-limited aligns with prior safety research advocating containment and human-in-command architectures, which restricts AI systems from issuing direct actuation commands to prevent uncontrolled or unsafe physical actions. Example: “System X cannot issue direct actuation commands to industrial equipment.” – Access-limited asserts that the system cannot access unauthorised networks, data repositories, or sensitive resources. It limits data exfiltration and privacy risks. Example: “System X cannot connect to external databases or internet endpoints.” 17 Safety Cases for AI Systems –Tool-use limited specifies that the AI may invoke only pre-approved or sandboxed tools, ensuring controlled interactions with external functions or APIs. Example: “System X can execute only whitelisted tools within its isolated environment.” – Modality-limited declares that the system is restricted to particular input or output modalities, preventing cross-modal risks such as audio or image manipulation. Example: “System X processes text only and cannot generate or interpret images, audio, or video.” •Intrinsic-based claim depends on internal properties or alignment behaviours embedded in the model that make unsafe actions inherently infeasible. This means that refusal emerges from reinforced policies or internal gating rather than from external rule enforcement. If such refusal is implemented through wrappers, filters, or middleware, it constitutes a control-based claim. –Knowledge-limited states that the model cannot access or infer information beyond its authorised or trained data domain. It limits knowledge leakage or speculative inference about restricted topics. Example: “System X cannot retrieve or deduce information outside the approved knowledge base.” –Goal-limited asserts that the system lacks persistent goals, self-modification ability, or independent task initiation beyond a defined session scope. This prevents uncontrolled decision loops or goal drift. Example: “System X cannot initiate tasks or alter parameters without explicit human instruction.” Each claim type implies specific expectations for argument structure and evidence. For instance, contextual claims require evidence of boundary enforcement, whereas capability-limited claims require inspection of mechanisms or adversarial validation. 4.4 Argument Arguments provide the connective reasoning that links claims to supporting evidence. AI safety cases require argument structures that can integrate heterogeneous reasoning modes and adapt as new evidence emerges. The proposed taxonomy defines two orthogonal dimensions, including function and logical form, which can be combined modularly. 4.4.1 Argument taxonomy Demonstrative Demonstrative arguments show that all architectural and procedural layers collectively satisfy acceptance criteria. They are used when safety depends on the completeness and integrity of design-time controls. •Layered safety argument decomposes a top-level claim into sub-claims covering independent assurance layers (e.g., data, model, governance). Example: “Each layer (e.g., data validation, model robustness, and human oversight) independently satisfies its acceptance criteria, and together they ensure overall system safety.” •Barrier argument demonstrates that multiple independent safeguards exist to prevent or mitigate harm, ensuring redundancy. Example: “If automated content filters fail, human review and escalation protocols provide a secondary safety barrier." Demonstrative arguments primarily employ deductive reasoning and rely on architectural evidence, safety design documentation, and compliance verification. Comparative Comparative arguments justify safety by relative performance against a baseline or reference system. They are central to marginal-risk and model-update cases. •Comparative argument establishes that the target system’s behaviour is no worse than a comparator across defined safety dimensions. Example: “System X’s misclassification rate is not higher than that of the human baseline at 95 % confidence.” •Comparison-based argument extends comparison to include context alignment and fairness of refer- ence—showing that differences in scope or data do not bias results. Example: “System X is similar to System Y, which is safe→ therefore X is safe." Comparative arguments use inductive, statistical, and abductive reasoning. They are supported by benchmarking, controlled evaluations, and evidence from expert reviews. 18 Safety Cases for AI Systems Risk-based Risk-based arguments justify safety through explicit reasoning about risk levels or conformance with risk-management frameworks. They are used when assurance depends on either quantitative risk estimation or adherence to recognised safety guidelines. •Threshold-based argument demonstrates that the system’s estimated residual risk is acceptable, based on formal analysis or statistical modelling. This form directly connects quantitative risk evidence to the top-level safety claim. Example: “According to our risk assessment, System X does not pose unacceptable risk; our risk analysis demonstrates that risk levels are within tolerable bounds." •Guideline-based argument demonstrates that the system complies with an established safety or risk- management framework and infers acceptability from that compliance. It indirectly argues that adherence to recognised safety processes implies an acceptable level of risk. Example: “System X adheres to our safety framework; no system that follows this framework poses unaccept- able risks.” While the typical rationale for a threshold-based argument is statistical reasoning, a guideline-based argument relies on deductive reasoning. Causal/explanatory Causal or explanatory arguments link observed safety outcomes to identifiable mechanisms or mitigations. They show that hazards are understood and effectively controlled. •Dependability argument demonstrates that system reliability and fault-tolerance mechanisms causally ensure safety. For example, it can incorporate reliability, availability, maintainability, and safety (RAMS). Example: “System X is dependably controllable: the architecture enforces fail-safe degradation and human handover on fault.” •Root-cause argument draws on causal-explanatory reasoning proposed in safety-case literature and recent AI assurance studies where safety is justified by identifying, explaining, and mitigating the underlying causes of observed failures. Example: “We justify the safety of System X by explaining and eliminating the previously observed risk. We observed [O1–O3] and evaluated competing hypotheses [H0. . . Hn] using [D1. . . Dk] (ablations, audits, counterfactuals). Mitigation M removes the hazard, and M was deployed." Causal arguments rely on abductive and deductive reasoning, drawing evidence from incident analyses, interpretability studies, and post-mitigation verification. Capability-oriented Capability-oriented arguments support capability-limited claims by showing that technical or intrinsic boundaries effectively constrain behaviour. • Control argument demonstrates that design or configuration controls prevent the system from exceeding authorised capabilities. Example: “System X is capable of causing serious harm, but it is restricted from using those capabilities; The AI system will not be modified such that it can use its capabilities to cause serious harm; The AI system’s model weights will not be stolen." •Intrinsic argument shows that internal model behaviours (e.g., refusal, bounded reasoning) inherently limit risk. Example: “System X is not capable of causing serious harm, regardless of how it is modified or used; Safety-alignment training enforces refusal to generate disallowed instructions across tested scenarios.” These arguments combine deductive reasoning (for design rules) and empirical reasoning (for behavioural validation) and are evidenced by mechanism inspection, alignment evaluation, and red-team testing. Normative/conformance Normative/conformance arguments demonstrate adherence to established frameworks, governance standards, or dependability engineering principles widely recognised as proxies for safety assurance. This argument type demonstrates that system safety is justified through compliance with normative expectations rather than new empirical discovery. This 19 Safety Cases for AI Systems argument includes two types: Guideline-based argument and Dependability argument. Although these argument types also appear under other functional categories (e.g., risk-based or causal/explanatory), the taxonomy is not exclusive: certain reasoning forms can legitimately contribute to multiple assurance dimensions depending on the intent and context of their use. In this category, the emphasis is on institutional acceptability and auditability rather than on demonstrating reductions in new technical risk. Compliance with recognised frameworks or dependability doctrines functions as a proxy for acceptable safety within established governance regimes. •Guideline-based argument demonstrates that the system complies with an established safety or risk- management framework and infers acceptability from that compliance. It indirectly argues that adherence to recognised safety processes implies an acceptable level of risk. Example: “System X adheres to our safety framework; no system that follows this framework poses unaccept- able risks.” •Dependability argument demonstrates that system reliability and fault-tolerance mechanisms causally ensure safety. For example, it can incorporate reliability, availability, maintainability, and safety (RAMS). Example: “System X is dependably controllable: the architecture enforces fail-safe degradation and human handover on fault.” 4.4.2 Reasoning logic A safety case is not simply a repository of evidence; it is an argument about why the evidence justifies trust in a system’s safety [48]. For AI systems, the reasoning logic underpinning this argument must reflect both the probabilistic nature of learning models and the socio-technical uncertainties of deployment. Traditional engineering relies predominantly on deductive logic (“if all components are safe, the system is safe”). AI assurance, by contrast, requires a pluralistic combination of deductive, inductive, abductive, statistical, and analogical reasoning, each appropriate to different parts of the assurance problem. The choice of reasoning logic determines how claims are substantiated, how confidence is accumulated, and how residual uncertainty is expressed. The following subsections describe when and how each reasoning form should be applied. Deductive- structural and rule-based assurance Deductive reasoning is used when system-level safety can be inferred from formally specified properties, architectures, or controls. Recent work has demonstrated how such deductive assurance can be formalised using safety contracts and logical composition, including approaches based on Subjective Logic to represent confidence while preserving deductive argument structure [49]. Its typical use includes: •Proving containment and control effectiveness (e.g., “If the AI system has no external network access, it cannot exfiltrate sensitive data”). • Establishing logical completeness of safety arguments under declared assumptions. • Demonstrating conformance with fixed regulatory or procedural rules. It provides rigour and traceability and supports formal verification or policy-based reasoning [50]. Yet, it requires stable premises and is rarely sufficient for emergent or probabilistic behaviours. The example scenario is: "A newly developed “AI-based document evaluation system” has four safety layers (policy, testing, controls, and monitoring). Each layer meets its defined acceptance criteria, and all interfaces between layers are verified. For example, • Policy layer: The system has documented risk classification and approval before deployment. •Testing layer: The system is classified as high-risk; according to the high-risk criteria, it passed fairness and robustness tests with≥ 95% success. • Control layer: According to the control requirements, only approved users can modify model parameters (access control enforced). • Monitoring layer: Confirmed that incidents or model drifts are detected and logged within 24 hours. As the AI system passed the four layers, it’s acceptable and safe." 20 Safety Cases for AI Systems Inductive- empirical generalisation from evaluation Inductive reasoning generalises from observed evidence, such as evaluation results, to infer overall safety [51]. Its typical use includes: • Aggregating results from test datasets, red-teaming, or user trials to estimate expected performance. • Establishing non-inferiority to a comparator based on statistical sampling. • Supporting discovery-driven safety claims where behaviour must be inferred from empirical regularities. While this logic captures real-world performance and adapts to continual evaluation, confidence depends on representa- tiveness and coverage of the test conditions, and induction cannot guarantee future behaviour [52]. There are two example scenarios. "Scenario 1: System X is an updated LLM model of System Y and Z (previous version of the model), both of which have been used safely in production for over a year. Since the latest version of the model meets or exceeds their safety benchmarks, it is inferred that it is also safe." "Scenario 2: System X is a new LLM. After comparison with widely used LLMs (e.g., ChatGPT and Gemini), its performance on safety benchmarks is equal to or better than that of these LLMs. Therefore, System X is considered safe and acceptable." Abductive- explanation and mitigation of anomalies Abduction supports reasoning about why a system failed or behaved unexpectedly and whether the proposed mitigation plausibly prevents recurrence. This logic is typically used for: • Analyze incidents or findings of the red-team to infer causal mechanisms (e.g., “the refusal failure is due to hidden prompt leakage”). • Linking discovered hazards to mitigation strategies when causal mechanisms are only partially understood. • Building evidence chains connecting interpretability results with system-level safety claims. It enables learning-oriented assurance and transforms unexplained anomalies into explainable risks with documented hypotheses [53]. Its limitations include reliance on expert judgment and the plausibility of explanations, as well as susceptibility to confirmation bias without cross-validation. The example scenario is: "AI-based risk management system kept crashing after long use. Engineers hypothesised it was due to memory leaks (H1), corrupted user data (H2), and overheating (H3). After testing/observation, confirmed memory leaks were the root cause (i.e., updated the memory management module, and the crashes stopped)." Statistical- quantifying uncertainty and marginal risk Statistical reasoning is central to quantifying residual uncertainty, estimating harm probabilities, and supporting threshold-based or marginal-risk claims [54]. This logic is primarily used for: • Non-inferiority testing in marginal safety comparisons. • Confidence-bound estimation of harm rates, false refusals, or false acceptances. • Risk aggregation and uncertainty propagation across heterogeneous metrics. It provides quantitative credibility and clear confidence intervals, but is sensitive to metric design, data bias, and assumptions of independence or stationarity [55]. The example scenario is: "Process analysis AI system has been newly developed. The organisation’s risk threshold for the AI system is 0.25. Based on testing and field data, the system’s estimated risk score is 0.18, below the organisation’s approved threshold of 0.25. Since risk score≤ threshold, the system is considered acceptable for deployment." Analogical- leveraging similarity and prior assurance Analogical reasoning justifies safety by reference to similar, previously validated systems, datasets, or operating environments [24]. 21 Safety Cases for AI Systems It is used for: •Transferring assurance arguments from a comparable AI system (e.g., “safety controls proven for System Z apply to System X with minor adaptation”). • Reusing validated templates and patterns for related applications or system versions [1]. • Supporting initial assurance of novel systems by analogy before sufficient empirical evidence exists. It enables rapid safety case construction when data or evaluation results are limited. Yet, it requires careful justification of similarity conditions; analogies degrade as systems diverge in architecture or context. The example scenario is as follows. "System X, a new AI-assisted diagnostic tool, uses the same model architecture and data validation pipeline as System Y, which has been safely deployed in hospitals for two years. Since only the interface has changed and all core safety features are identical, System X is inferred to be safe by analogy." Integrative reasoning and meta-assurance No single reasoning mode suffices for AI assurance. Deductive reasoning establishes formal bounds, inductive and statistical reasoning quantify empirical confidence, abductive reasoning explains and mitigates emergent issues, and analogical reasoning bootstraps new cases from prior experience. Mature safety cases must explicitly declare which reasoning logics apply to which claim types and how transitions between them are validated, for example, from abductive hypotheses to inductive evidence once mitigations are tested. Meta-assurance, reasoning about the quality of reasoning, becomes essential. It requires showing that each logic is used under its valid conditions, that assumptions are traceable, and that uncertainty is bounded both within and across reasoning modes. For example, each logical form includes an associated pass condition that defines its validity criteria. These pass conditions collectively enable meta-assurance by allowing the reasoning process itself to be verified for correctness, coherence, and justified application. 4.4.3 Alignment between argument types and reasoning logics Although integrative reasoning combines multiple logical forms to strengthen overall assurance, each argument type primarily relies on a specific reasoning logic that determines how evidence supports its claim. Understanding this alignment clarifies when a given argument form is appropriate and how different logics (anological, deductive, inductive, abductive, or statistical) collectively strengthen assurance credibility. Table 7 summarises the relationship between argument types, their dominant reasoning logic, and a concise description of their assurance role. Table 7: Summary of the argument types and reasoning logics mapping ArgumentReasoning logicDescription Layered safetyDeductiveDemonstrates that all assurance layers (e.g., data, model, governance) collec- tively satisfy safety criteria. BarrierDeductiveShows that multiple independent safeguards exist, ensuring redundancy even if one control fails. ComparativeInductive or StatisticalGeneralises from empirical comparison to infer that the AI system is no worse than a reference baseline. Comparison-basedAnalogicalJustifies safety by reference to similar, previously validated systems, datasets, or operating environments. Threshold-basedStatisticalQuantifies residual risk and demonstrates that it remains below the defined numerical threshold. Guideline-basedDeductiveInfers safety from adherence to recognised frameworks, standards, or organisa- tional policies. DependabilityDeductive or StatisticalEstablishes that reliability and fault-tolerance mechanisms causally ensure acceptable safety levels. Root-causeAbductuve Provides causal explanations for observed failures and links corrective actions to prevention of recurrence. ControlDeductiveDemonstrates that external design or configuration controls effectively restrict system capabilities. IntrinsicDeductiveShows that built-in model properties (e.g., alignment or refusal behaviour) inherently prevent unsafe actions. 22 Safety Cases for AI Systems 4.5 Evidence Evidence substantiates the premises of arguments. In AI safety cases, credible evidence must capture both technical and governance realities, showing not only that the system behaves safely under test conditions but also that safety controls, policies, and monitoring processes remain effective over time. Unlike conventional engineering evidence, which is often deterministic and static, AI evidence must be empirical, dynamic, and socio-technical. It combines computational validation with organisational verification. The taxonomy in Figure 7 organises AI safety evidence into seven primary categories, including empirical, compar- ative, risk-based, expert-derived, formal methods, operational, and mechanistic. Each represents distinct modes of substantiation that map to different reasoning logics and argument functions introduced in Section 4.4. Empirical Empirical evidence establishes behavioural facts about the AI system through direct observation and evaluation. This primarily includes testing and user studies, which can be used with red-teaming, fuzzing, or scenario simulations. • Testing includes results from functional, stress, adversarial, or red-team tests showing how the system behaves under controlled or extreme conditions. •User studies refer to structured experiments involving human participants that assess usability, fairness perception, and effectiveness of oversight mechanisms. The quality focus of empirical evidence is on the representativeness of datasets, the reproducibility of test procedures, and the coverage of operational contexts. Comparative benchmarking Comparative benchmarking evidence demonstrates relative safety through reference evaluations. It strongly supports comparative arguments. The evidence type includes benchmarking results, comparative performance on safety-related benchmarks, or competitions are used as controlled studies comparing the AI system to human or legacy systems using equivalent metrics. The quality focus includes the fairness of comparators, the equivalence of the evaluation scope, and the statistical significance of differences. Model-based risk analysis Model-based evidence provides quantitative reasoning about residual risk and uncertainty. Potential risk models include Bayesian networks, causal models, or Monte Carlo estimations of harm likelihood. While probabilistic or Bayesian risk analysis involves statistical estimation of harm probabilities and confidence intervals, simulation or Monte Carlo evaluation propagates uncertainty across system variables to determine threshold compliance. The quality focus is on transparency of assumptions, calibration with empirical data, and sensitivity analysis of parameters. Expert-derived Expert-derived evidence captures structured professional judgement where empirical or model-based data are incomplete. It supports abductive and analogical reasoning. •Expert assessment is formal elicitation or review by qualified experts estimating the likelihood, severity, or adequacy of controls. It includes Delphi studies. • Scenario-based evaluation includes tabletop or synthetic exercises exploring plausible failure modes and mitigation strategies. • Forecasting and horizon-scanning evidence is anticipatory analysis that identifies emergent capabilities, threat vectors, or long-term socio-technical impacts. The key quality aspects include the diversity of expertise, the transparency of reasoning, and the explicit representation of uncertainty and consensus. Formal verification Formal verification provides mathematically rigorous assurance that specified safety or containment properties hold. Typical methods include model checking, theorem proving, static analysis, and symbolic execution to establish invariants, enforce rules, and verify bounded-autonomy constraints. It primarily supports deductive reasoning within demonstrative and capability-oriented arguments. 23 Safety Cases for AI Systems The quality focus is on the soundness of proofs, tool qualification, and the correspondence between the verified model and the implemented system, with clear traceability of assumptions and coverage of critical properties. Operational and field data Operational evidence validates safety during real-world use. It supports both demonstrative and statistical arguments by confirming that the implemented controls remain effective. •Historical data includes logs, drift reports, and post-deployment metrics confirming stability and safe performance. •Governance and policy artefacts refer to monitoring dashboards, audit trails, and approval records linking operational data to oversight actions. The quality focus is on data recency, record authenticity, and traceability to system versions. Mechanistic or interpretability evidence Mechanistic evidence links internal model mechanisms to observable safety behaviour. It strongly supports causal/ex- planatory and capability-oriented arguments. The evidence type includes interpretability analysis, such as attribution mapping, feature tracing, or causal-path exploration, that explains why behaviour is safe or unsafe. It also includes formal verification, such as proofs or static analyses, demonstrating that safety or containment properties hold within subsystems. Evidence must also satisfy quality criteria such as independence, coverage, recency, reproducibility, and representative- ness, enabling its credibility to be assessed systematically. 4.6 Composability and reusability of the template The true power of these templates lies in their composability. A full safety case can be built as a network of reusable modules. Each module is reusable across systems and contexts, with only the specific claim instantiation and evidence artefacts substituted. The result is a structured but flexible assurance design; one capable of supporting AI systems whose safety can only be demonstrated through continual evaluation and rediscovery. 5 Safety Case Patterns for AI Systems While Section 4 established the taxonomic foundation for constructing AI safety cases, practitioners ultimately need end-to-end patterns, reusable narrative structures that show how claims, arguments, and evidence interact to address recurring assurance challenges. A pattern is not a fixed template but a worked composition of the claim, argument, and evidence taxonomies tailored to a particular class of AI system risks. These patterns help align safety reasoning with how AI systems are actually developed and operated: iteratively, empirically, and under uncertainty. Each pattern below is organised as follows: 1. Problem or assurance challenge. 2. Applicable claim types. 3. Recommended argument functions and reasoning logics. 4. Evidence families and quality conditions. 5. Practical example structure. In this section, we introduce four representative patterns that correspond to the unique characteristics of modern AI assurance: "discovery-driven evaluation", "marginal risk without ground truth", "continuous system evolution", and "threshold-based acceptability decisions". We also discuss integrating these patterns into composite safety cases used in practice. 5.1 Discovery-driven evaluation pattern Problem Traditional engineering safety cases rely on a full hazard enumeration prior to operation. For AI systems, this is infeasible: new capabilities and risks are discovered only after deployment-like testing. The assurance challenge is to justify safety in the face of incomplete knowledge and ongoing empirical discovery. 24 Safety Cases for AI Systems Applicable claim types • Contextual claims constraining safe operating boundaries (e.g., data domain, user group, or task category). • Capability-limited claims, especially intrinsic limitations (e.g., refusal to execute unsafe content). • Marginal safety claims compared to a reference system or control scenario. Argument functions and reasoning logics • Demonstrative–Deductive for control enforcement (containment, isolation). • Causal–Abductive to explain and mitigate newly discovered hazards. • Risk-based–Statistical to express residual uncertainty and show progressive improvement. Evidence families Empirical testing, adversarial and stress evaluation, interpretability-based investigation of failures, and governance artefacts showing incident response or post-mitigation procedures. Evidence must emphasise recency, reproducibility, and completeness of evaluation coverage. Example structure Top claim: "AI system X is safe for interaction tasks in context C under control K." • Demonstrative argument: enforcement of access, isolation, and rate limits. •Causal argument: identified risk R 1 (e.g., prompt leakage) explained via mechanism analysis and mitigated through instruction filtering M 1 . •Risk-based argument: post-mitigation evaluation shows risk rate r≤threshold t within statistical confidence interval. • Continuous discovery process documented as part of the evidence base, with results linked to monitoring triggers. This pattern formalises assurance through evaluation and discovery, embedding iterative empirical testing into the safety case rather than treating it as post-certification validation. 5.2 Marginal-risk pattern without ground truth Problem Many AI systems are deployed in domains where there is no fixed ground truth or where incumbent systems, whether human workflows, legacy automation, or peer AI systems, were never rigorously validated. In these cases, the objective is not to demonstrate absolute safety, but to show that the AI system is no less safe than an accepted comparator. Because risk must be inferred indirectly, assurance relies on a structured combination of predictability, capability, and interactive (game-based) metrics [30] that collectively approximate the system’s safety margin. Applicable claim types • Marginal safety claim: “AI system X is at least as safe as comparator Z on relevant predictability, capability, and interaction metrics.” Argument functions and reasoning logics • Comparative–Inductive to establish that AI system X matches or exceeds comparator Z on defined safety- related dimensions. •Risk-based–Statistical to quantify non-inferiority margins and propagate uncertainty across heterogeneous metrics. •Normative–Deductive to justify comparator selection and alignment with accepted operational or regulatory standards when empirical evidence is partial. Evidence families Evidence is drawn from multiple proxy domains to triangulate safety in the absence of a definitive benchmark: 25 Safety Cases for AI Systems •Consistent predictability metrics – measures of consensus and stability that indicate behavioural coherence under small input perturbations or prompt reformulations. Examples include inter-annotator agreement with human experts, stability across rephrasings, and invariance under benign format changes. • Capability-based metrics – controlled evaluations of the AI system’s competence in safety-relevant tasks, bounded by known capability envelopes. This may include stress tests on reasoning accuracy, constraint adherence, or refusal behaviour. •Game-based metrics – structured interactive evaluations where human or AI agents probe the system in adversarial or cooperative scenarios to surface hidden failure modes. These “games” can simulate contest conditions (e.g., red-team versus defence) or cooperative judgment alignment tasks that reveal marginal safety performance under realistic pressure. • Expert elicitation and structured judgement – panels assessing comparative harm likelihood and interpretability of results, supplemented by independent review of experimental fairness. Example structure Top claim: “AI system X is at least as safe as comparator Z for task Y under the evaluated proxy metrics.” • Comparative argument: define comparator Z and demonstrate alignment of task scope, data, and operational context; analyse differences and potential confounders. • Predictability sub-claim: show that AI system X achieves equal or higher consistency, consensus, and stability scores than Z under controlled prompt or context perturbations. •Capability sub-claim: present controlled task performance within safety-critical dimensions, demonstrating that observed capability boundaries remain within accepted operational limits. • Game-based sub-claim: report results from adversarial or cooperative gameplay showing equivalent or reduced harm frequency and better recovery behaviour relative to Z. • Risk-based argument: integrate the three metric families into a composite marginal-risk index; demonstrate non-inferiority (∆≤ δ) with statistical confidence (≥ α). •Normative argument: justify why Z constitutes an appropriate reference system, referencing governance precedents or deployment track record. •Evidence package: includes evaluation protocols, datasets, gameplay logs, aggregation scripts, expert-panel reports, and independent audits of experimental design and fairness. This pattern operationalises marginal safety reasoning when ground truth is absent. It leverages multi-dimensional proxies such as predictability, capability, and interactive robustness to provide a defensible, quantitative, and qualitative basis for comparing an AI system’s safety against human or legacy baselines. It enables deployment decisions grounded in relative rather than absolute assurance, while preserving transparency in the reasoning chain and quality of evidence. 5.3 Continuous-evolution pattern Problem AI systems undergo continual modification, including model updates, retraining, data refreshes, scaffolding updates, and integration with new tools. Each change potentially alters risk characteristics, invalidating static evidence. The assurance challenge is to maintain a living safety case that stays valid through system evolution. Applicable claim types • Contextual claims updated with versioned scope and control boundaries. • Marginal safety claims comparing new and prior system configurations. Argument functions and reasoning logics • Demonstrative–Deductive for update and rollback control mechanisms. • Comparative–Inductive for version-to-version safety parity. • Risk-based–Statistical for continuous performance metrics with alert thresholds. Evidence families 26 Safety Cases for AI Systems Dynamic evaluation reports, regression test results, Safety Performance Indicators (SPIs) linked to claims, incident trend analysis, and governance logs for release approval. Evidence must include timestamps and automated traceability to prior versions. Example structure Top claim: “AI system X remains acceptably safe after update U.” • Demonstrative argument: update process verified by formal change-control procedure P. •Comparative argument: no statistically significant degradation compared to prior version or alternative system Z on safety benchmarks. • Risk-based argument: rolling SPI indicators within pre-declared acceptance bounds. • Governance evidence: audit trail confirming revalidation and approval of updated safety case. This pattern embodies the notion of a dynamic safety case, a living artefact refreshed alongside system development cycles. 5.4 Threshold-comparator pattern Problem Developers and regulators often define multiple quantitative thresholds (e.g., harm probability, compute expenditure, or autonomous capability) as proxies for safety acceptability. These thresholds may be incomplete or conflicting. The assurance challenge is to demonstrate that the AI system satisfies the relevant thresholds and justify how they are prioritised or combined. Applicable claim types • Constraint-based risk claims with explicit numerical limits. • Contextual claims clarifying conditions of applicability. • Normative claims aligning thresholds with policy or external frameworks. Argument functions and reasoning logics • Threshold-based, Deductive-Structural for mapping thresholds to system properties and decision rules. • Statistical for evidence aggregation and uncertainty handling. • Normative for explaining prioritisation when thresholds conflict or overlap. Evidence families Model-based risk analyses, empirical metrics, acceptance-criteria tables, sensitivity and worst-case analyses, and governance documentation describing threshold selection. Example structure Top claim: “AI system X meets defined safety thresholds T 1 -T n for deployment in context C.” • Deductive argument: decomposition of system-level safety into measurable metrics tied to T 1 -T n . • Statistical argument: aggregation function F(T 1 -T n ) pre-declared, with uncertainty propagation and correlation analysis. • Normative argument: justification for threshold choice hierarchy, consistent with policy or regulation R. •Evidence: metric definitions, evaluation datasets, acceptance-criteria reports, and independent review sum- maries. This pattern operationalises quantitative decision-making in safety cases, ensuring transparency about how thresholds translate into deployment approval or rejection. 27 Safety Cases for AI Systems 5.5 Integrating patterns into composite safety cases In practice, an AI system’s safety case will combine several of these patterns. For instance, a conversational agent might use the discovery-driven pattern to capture ongoing capability discovery and red-team evaluation, the marginal-risk pattern to justify deployment relative to human helpdesk baselines, and the continuous-evolution pattern to manage version updates. Each pattern contributes a module that can be updated independently while maintaining traceability to the overall safety claim. Reusable patterns, therefore, enable composable assurance, the ability to build, maintain, and audit AI safety cases as evolving mosaics rather than static documents. This compositional approach allows alignment with both rapid development cycles and dynamic regulatory expectations. 6 Case Study: AI-based Tender Evaluation System 6.1 Case system overview and context Government tender evaluation processes traditionally involve two independent human reviewers who assess and score supplier submissions against defined criteria such as compliance, value for money, and risk. Discrepancies between reviewers are reconciled through discussion or third-party moderation. The pilot AI-based tender evaluation system replaces one of the two human reviewers with an AI model. The AI system was deliberately not trained on historical data. This design choice was critical to avoid bias and artificial correlation in the evaluation process. Training on historical samples would have inflated the “agreement” metrics, as outcomes would have been partially determined by prior examples rather than genuine reasoning. Instead, the system relied on a prompt-based approach using GPT-4o, in which both the document under evaluation and the criteria document (e.g., policy documents and scoring rubrics) were provided at inference time. The system assists the remaining human reviewer by producing structured assessments and justifications. Its purpose is to improve efficiency and consistency while maintaining or enhancing fairness, transparency, and procedural integrity. A key assurance challenge is the absence of an absolute ground truth: there are no definitive “correct” tender scores, only prior human judgements that vary across reviewers and contexts. Hence, safety and reliability must be justified marginally, so that the new AI + human configuration performs no worse (and ideally better) than the existing human + human process across comparable criteria. This context makes the Marginal-Risk Pattern without Ground Truth the most appropriate safety-case pattern, as it supports reasoning about relative safety and integrity using proxy metrics and comparative evaluation. 6.2 Applied pattern: Marginal-Risk Pattern without Ground Truth Claim For this case, we selected "Marginal safety claim" and state: "The AI-based tender evaluation process is at least as safe, fair, and reliable as the prior human-only process on relevant performance and integrity metrics." Argument functions and reasoning To support the claim, practitioners can select argument types from three main argument functions (comparative, risk- based, and normative) depending on the availability of evidence and the decision context. These functions can be used individually or in combination to balance empirical and procedural assurance. •Comparative (Inductive reasoning) is the primary argument function for marginal-risk cases where there is no ground truth but a reference system (e.g., human baseline) exists. It generalises from observed evidence to infer that the new system’s behaviour is no worse than the comparator across relevant dimensions. This is recommended when direct empirical comparison is possible through shared metrics, datasets, or tasks. • Risk-based (Threshold-based argument, Statistical reasoning) is quantitative analysis estimates uncertainty and verifies non-inferiority(∆≤ δ)between configurations with confidence(≥ α), based on bootstrapped agreement and harm-probability measures. This is recommended when the evaluation produces measurable indicators that can be statistically combined. •Normative (Deductive reasoning) supports assurance by referencing accepted practice, policy, or precedent, establishing that the comparator system or metric selection is valid. In this case, the comparator (human-human 28 Safety Cases for AI Systems baseline) is justified as an appropriate reference, given regulatory and procedural acceptance in public-sector procurement. These reasoning modes collectively support marginal assurance in the absence of ground truth and ensure that both empirical performance and normative alignment are represented. For this case, two argument functions are applied: Comparative and Risk-based (Threshold-based). The Comparative argument forms the primary reasoning mode because the AI–human process is evaluated relative to an existing, operational human–human baseline. We defined marginal risk for this case:MR = ∆R = R AI+Human −R Human+Human , whereRis a multi-dimensional risk vector encompassing performance, reliability, safety, security, fairness, privacy, compliance, cost, and resilience. The Risk-based (Threshold-based) argument complements it by quantifying residual uncertainty and demonstrating that performance differences fall within acceptable confidence bounds. The Normative argument is not explicitly developed here, as the baseline comparator and procurement policies already define an accepted standard of practice. Table 8 summarises the selected argument types and rationale of the selection for this case study. ArgumentReasoning logicSelectedRationaleTypical evidence ComparativeInductiveYesSelected as the primary argument type for this case. Suitable when a validated hu- man baseline exists, enabling empirical “no-worse-than” reasoning across fairness, consistency, and interpretability metrics. Comparative evaluation re- sults, expert review reports, red-team findings, and quali- tative disagreement analysis. Risk-based (Threshold- based) StatisticalYesSelected as a complementary argument function to express quantitative confidence. Aggregates multiple proxy metrics and demonstrates non-inferiority (∆≤δ) with statistical confidence level (≥α). Aggregated performance met- rics, confidence intervals, boot- strapped significance tests, and uncertainty propagation analy- sis. NormativeDeductiveNo Not selected as a primary argument in this case. The comparator (human-human base- line) and the associated procurement poli- cies have already been formally accepted within governance processes; therefore, no additional normative justification is re- quired. (If applied) Regulatory stan- dards, policy guidelines, and procurement framework docu- mentation. Table 8: Selected argument functions, reasoning logic, and rationale for the AI-based tender evaluation case. Argument statements for the case With the selected argument functions, we formalised the following argument statements for the AI-based tender evaluation system. Comparative argument: "The AI–human tender evaluation process is at least as safe, fair, and reliable as the prior human–human process because its observed behaviour across multiple assurance dimensions (e.g., fairness, consistency, interpretability) is empirically comparable or superior under equivalent evaluation conditions. In particular, while the human–human review inconsistency averaged 3.0%, the AI–human review inconsistency was observed at 2.8%, yielding a marginal risk difference of MR = ∆R = R AI+Human − R Human+Human =−0.2%" This negative marginal difference indicates that the AI–human configuration performs at least as consistently as the human–human baseline, satisfying the comparative “no-worse-than” criterion for decision reliability. Risk-based argument: To support the comparative argument (e.g., decision consistency), the threshold-based argument is defined as follows. "The residual risk of decision inconsistency introduced by the AI reviewer remains below the predefined acceptance threshold (∆≤ δ) = 5%." The threshold-based argument provides quantitative support for the comparative argument by statistically verifying that observed differences between AI–human and human–human evaluations remain within an acceptable tolerance 29 Safety Cases for AI Systems range. It translates the qualitative claim of “no-worse-than” performance into measurable confidence bounds, thereby strengthening the empirical credibility of the primary argument. Evidence collection As mentioned earlier, there is no definitive ground truth in this case. Accordingly, the evaluation metrics in this case follow the principles outlined in Section 5 and rely on predictability-based measures such as consistency and consensus, rather than on absolute correctness. These metrics operationalise the “no-ground-truth” condition by quantifying relative coherence between human and AI reviewers. To support the comparative and threshold-based arguments, evidence from Comparative benchmarking was collected through controlled evaluation trials involving 200 synthetic tender cases. The following statement clearly explains the assurance activities performed, summarises the tangible artefacts produced or collected, and explicitly links those artefacts to the argument they support. "To substantiate the comparative and threshold-based arguments, a series of benchmarking and statistical evaluation activities were conducted. A synthetic dataset of 200 tender cases was prepared using historical scoring rubrics. Both the AI–human and human–human reviewer pairs independently assessed these cases following the same protocol. Inter-rater agreement, fairness deviation, and justification quality were measured. The resulting data were analysed using bootstrapped significance tests and confidence-interval estimation to verify that observed differences (∆ = 2.8%,δ = 5%,α = 0.95) remained within acceptable tolerance levels." The tangible evidence (corresponding artefacts) for this case includes: i) the comparative evaluation dataset, i) the benchmarking and statistical analysis report, i) expert review summaries, and iv) the evaluation protocol and computation scripts. Summary and conclusion This case applies the marginal-risk pattern without ground truth to an AI-assisted tender evaluation workflow. We advance a marginal safety claim supported by i) a Comparative argument (primary- inductive) and i) a Threshold-based argument (supporting- statistical). Potential evidence for this case comes from the comparative benchmarking family, instantiated via predictability (decision stability/consistency) and capability (agreement, fairness, efficiency) metrics against the human–human baseline. These proxies triangulate safety in the absence of a definitive benchmark. The results of this case study across 200 synthetic cases are interpreted as follows: 1. Observed marginal risk: The differences (marginal risk: e.g., -0.2% gap of inconsistency rate) clearly show that the new system (AI–human) is no worse (slightly better) than the existing system (human-human). The differences in decision inconsistency between the AI–human and human–human configurations (∆R = −0.2%) show that the new system performs no worse and slightly better than the existing process. This indicates that the AI–human evaluation configuration achieves equivalent or improved performance (e.g., consistency and fairness). 2. Threshold validation: In addition, the observed 2.8% inconsistency rate of the AI–human system remained below the predefined acceptance threshold (δ = 5%) with 95% confidence. This statistically corroborates the comparative finding of “no-worse-than” performance and confirms that adding an AI reviewer does not increase residual decision risk beyond historical human variability. These results demonstrate that the AI–human evaluation process maintains decision reliability within the established human-human variability envelope. By combining empirical comparison and statistical quantification, the AI safety case provides a defensible basis for concluding that the system operates within acceptable safety and fairness bounds. The findings indicate that the AI-based review process can be safely adopted alongside or as a partial replacement for the existing human-human review workflow. Adoption is expected to reduce reviewer workload and resource demand while maintaining consistency, fairness, and transparency in decision-making. Continuous monitoring of agreement and fairness indicators is recommended to ensure that performance remains within the validated acceptance threshold. Figure 8 presents the AI safety case structure for this case study. 30 Safety Cases for AI Systems Claim AI+Humanis safe Argument_1 AI+Humanis “no worse than” Human+Human; The marginal risk (e.g., consistency risk of AI+human) is reduced by 0.2% Argument_2 AI+Humanperformance is acceptable within the predefined tolerance threshold; The inconsistency rate (2.8%) of AI+Human is below the acceptance threshold (5 %). Evidence_1 •Comparative evaluation dataset (e.g., 200 synthetic tender cases, observed inconsistency rates: AI+humansystem = 2.8%, human+human= 3.0%) •Difference (gap) analysis report •Gap analysis summary Evidence_2 •Acceptance criteria table (e.g., threshold δ = 5 % derived from 95th percentile of historical human–human variability). •Governance artefacts: approval memo documenting threshold justification and alignment with procurement policy. Figure 8: The AI safety case for AI-based tender evaluation system. 7 Limitations and Future Work This study aims to establish reusable, composable templates for AI safety cases that reflect the empirical and evolving nature of AI systems. Several limitations remain. 7.1 Scalability of dynamic safety cases Dynamic safety cases require continuous ingestion, verification, and version tracking of evidence. For continuously evolving systems, achieving through-life safety assurance is challenging due to the fundamental mismatch between rapid, non-deterministic AI evolution and the need for rigorous assurance documentation [34,35,36]. Key limitations include continuous evidence overload from real-time operational data, the computational complexity of automatically updating argument structures when AI behaviour changes, and the lack of unified standards and robust automation infrastructure for managing dynamic evidence and argument revision. Implementing them at the scale of frontier AI systems will require automation infrastructure, tooling standards, and agreement on trusted data pipelines for live safety indicators. 7.2 Interoperability across regulatory regimes Different jurisdictions emphasise distinct aspects of AI safety, such as capability thresholds, model access controls, or human oversight. Ensuring that reusable templates can be parameterised for diverse regulatory contexts without fragmentation is challenging due to divergent regulatory (e.g., EU AI Act [56]) philosophies, varied evidential standards and levels of assurance across sectors [57], and the difficulty of translating high-level legal requirements into structured, auditable technical claims within CAE templates [58]. Achieving effective interoperability requires developing modular template components that can be selectively activated and customized for targeted jurisdictions and required assurance levels. 31 Safety Cases for AI Systems 7.3 Evaluation reproducibility Assurance for complex AI systems heavily depends on empirical evidence, such as red-teaming reports, adversarial testing results, and benchmark scores. However, the lack of standardization in these evaluation methods makes the resulting safety evidence difficult to verify or reproduce. Additionally, fragmented evidence repositories prevent objective comparison of model safety performance against industry baselines, undermining the consistency of template- based safety claims. Improving reproducibility requires community consensus on standardized testing protocols, shared repositories of safety benchmarks and incident data, and open verification protocols to build trust in empirical assurance. 7.4 Relationship to verification processes This paper treats verification outcomes (e.g., test results, certificates, proofs, or monitoring summaries) as evidence inputs to an AI safety case, but does not model the verification processes that produce them. In particular, we do not characterise how behavioural properties are generated, checked, or enforced at design time or runtime. These concerns are increasingly addressed by verifiability-first AI engineering approaches [59], which focus on structuring systems to enable the verification of behavioural claims at scale. The contribution of this paper is complementary. It focuses on how verified artefacts, once available, are assembled, contextualised, and justified within a structured safety argument. 7.5 Future directions Further research should operationalise the templates into a tool-supported Safety Case Pattern Library, enabling instanti- ation and continuous management across AI lifecycles. Key directions include formalising uncertainty propagation across reasoning modes, developing composite safety metrics for dynamic systems, and standardising comparator-based marginal-risk evaluations, all of which represent promising steps for both scientific and regulatory progress. In addition, three priorities emerge for our next step work: First, using LLMs to automatically generate AI safety cases or CAE templates could significantly support dynamic safety case construction. Recent studies (e.g., S14,S51,S53,S74) show that LLMs can generate the structural components of safety case elements (e.g., CAEs) and accelerate early safety documentation, though they struggle to achieve high accuracy and provide traceable evidence without human review. This points to a hybrid human-AI assurance paradigm in which LLMs serve as drafting accelerators, while humans validate the results. Second, integrating AI risk management and assessment frameworks [25] and operational data, (e.g., guardrail logs [39], evaluation results [18], and AgentOps observability data [26]) across AI safety case pipeline, could provide a more comprehensive evidence repository to support the full lifecycle of AI safety case construction, verification, and governance (as illustrated in Figure 3 ecosystem pipeline). Third, developing systematic methods for mapping broad legal requirements into specific, measurable technical claims using CAE templates (e.g., [58]) would bridge the gap between regulatory mandates and technical assurance. These directions represent critical steps toward making rigorous, evidence-based safety assurance practical and scalable for frontier AI systems. 8 Conclusion In this study, we presented a reusable safety-case template structure for AI systems. It uses a clear Claims-Arguments- Evidence (CAE) approach. It also provides AI-specific taxonomies for claim types, argument forms, and evidence families. These elements help practitioners build safety arguments in a consistent and reviewable way. Our proposed templates address recurring assurance issues in many AI deployments. We include end-to-end patterns for evaluation without ground truth, managing dynamic model updates, and making threshold-based risk decisions. This extends safety-case practice to areas where traditional safety cases often do not fit well. It also supports safety cases that can be checked and updated as the system changes. In addition, we embedded the templates into a continuous assurance pipeline. This pipeline links safety claims to live metrics and governance artifacts for ongoing monitoring. We applied the approach in a real-world case study on a government AI-based tender evaluation system. The case study shows how the templates can be used in practice and what evidence can be collected to support key claims. Overall, this template-based approach provides a reusable foundation for AI safety cases that must evolve over time. However, it is not a complete solution. Key remaining challenges include scaling dynamic safety cases, supporting different regulatory settings, and improving the reproducibility of evaluations. 32 Safety Cases for AI Systems Appendix A Seed Papers Sections Table 9: Snowballing seed papers for AI safety case Paper TitleKey Focus & Rationale Selection A Sketch of an AI Control Safety CaseProposes a structured argument for controlling advanced AI systems, illustrating how control requirements map to claims, arguments, and evidence. GPT-4 and Safety Case Generation: An Exploratory Analysis Explores whether GPT-4 can help generate safety case components, identifying strengths, limitations, and automation opportunities. Safety Case Template for Frontier AI: A Cyber Inability Argument Introduces a template based on “cyber inability” reasoning, arguing that AI systems must be demonstrably unable to perform unsafe actions. Safety Case Templates for Autonomous Systems Provides reusable safety case templates for autonomy, emphasising hazard analysis, assurance patterns, and argument structures. An Alignment Safety Case Sketch Based on Debate Uses debate-based alignment methods as the core argument structure for an AI alignment safety case. The BIG Argument for AI Safety CasesPresents the BIG (Balanced, Integrated and Grounded) argument pattern as a scalable foundation for justifying AI safety. Safety Cases: A Scalable Approach to Frontier AI Safety Argues that safety cases can serve as a scalable governance tool for frontier AI models, outlining principles for systematic justification. Safety Cases for Frontier AIDiscusses how safety cases can be adapted for frontier AI governance, empha- sising transparency, auditability, and risk scenario coverage. Safety Cases: How to Justify the Safety of Advanced AI Systems Introduces guidance for constructing safety cases for advanced AI, focusing on claims, evidence credibility, and structured reasoning. 33 Safety Cases for AI Systems Appendix B List of Selected Studies [S1] S. Burton, “A causal model of safety assurance for machine learning,” arXiv preprint arXiv:2201.05451, 2022. [S2]D. Tola and P. G. Larsen, “A co-simulation based approach for developing safety-critical systems,” in Proceedings of the 18th International Overture Workshop, vol. 65, 2021. [S3]A. Rudolph, S. Voget, and J. Mottok, “A consistent safety case argumentation for artificial intelligence in safety related automotive systems,” in ERTS 2018, 2018. [S4]B. Herd, J.-V. Zacchi, and S. Burton, “A deductive approach to safety assurance: Formalising safety contracts with subjective logic,” in International Conference on Computer Safety, Reliability, and Security.Springer Nature Switzerland Cham, 2024, p. 213–226. [S5]Y. Jia, C. Verrill, K. White, M. Dolton, M. Horton, M. Jafferji, and I. Habli, “A deployment safety case for ai-assisted prostate cancer diagnosis,” Computers in biology and medicine, vol. 192, p. 110237, 2025. [S6]E. Denney and G. Pai, “A formal basis for safety case patterns,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2013, p. 21–32. [S7] J. Krook, Y. Selvaraj, W. Ahrendt, and M. Fabian, “A formal-methods approach to provide evidence in automated- driving safety cases,” arXiv preprint arXiv:2210.07798, 2022. [S8]N. Hayama, Y. Yamagata, H. Nishihara, and Y. Matsuno, “A gsn-based requirement analysis of the eu AI regulation,” in International Conference on Computer Safety, Reliability, and Security.Springer Nature Switzerland Cham, 2025, p. 183–196. [S9]P. Bishop and R. Bloomfield, “A methodology for safety case development,” in Safety and Reliability, vol. 20, no. 1. Taylor & Francis, 2000, p. 34–42. [S10]Z. Porter, I. Habli, J. McDermid, and M. Kaas, “A principles-based ethics assurance argument pattern for AI and autonomous systems,” AI and Ethics, vol. 4, no. 2, p. 593–616, 2024. [S11]O. Odu, A. B. Belle, S. Wang, and K. K. Shahandashti, “A prisma-driven bibliometric analysis of the scientific literature on assurance case patterns,” arXiv preprint arXiv:2407.04961, 2024. [S12] K. K. Shahandashti, A. B. Belle, T. C. Lethbridge, O. Odu, and M. Sivakumar, “A prisma-driven systematic mapping study on system assurance weakeners,” Information and Software Technology, vol. 175, p. 107526, 2024. [S13] M. Gyllenhammar, G. R. d. Campos, and M. Törngren, “A safety argument fragment towards safe deployment of performant automated driving systems,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2025, p. 197–210. [S14] R. Potham, “A safety case for a deployed llm: Corrigibility as a singular target via debate,” 2025. [S15] E. Wozniak, C. Cârlan, E. Acar-Celik, and H. J. Putzer, “A safety case pattern for systems with machine learning components,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2020, p. 370–382. [S16]T. Korbak, J. Clymer, B. Hilton, B. Shlegeris, and G. Irving, “A sketch of an AI control safety case,” arXiv preprint arXiv:2501.17315, 2025. [S17]R. Wei, S. Foster, H. Mei, F. Yan, R. Yang, I. Habli, C. O’Halloran, N. Tudor, T. Kelly, and Y. Nemouchi, “Access: Assurance case centric engineering of safety–critical systems,” Journal of Systems and Software, vol. 213, p. 112034, 2024. [S18]S. Burton and B. Herd, “Addressing uncertainty in the safety assurance of machine-learning,” Frontiers in computer science, vol. 5, p. 1132580, 2023. [S19] V. J. Hodge and M. Osborne, “Agile development for safety assurance of machine learning in autonomous systems (agileamlas),” Array, p. 100482, 2025. [S20]B. Kaiser, M. Soden, R. Diefenbach, and E. Holz, “An agile approach to safety cases for autonomous systems through model-based engineering and simulation,” in Proc. 33rd Saf. Crit. Syst. Symp, 2025. [S21] M. D. Buhl, J. Pfau, B. Hilton, and G. Irving, “An alignment safety case sketch based on debate,” arXiv preprint arXiv:2505.03989, 2025. 34 Safety Cases for AI Systems [S22]F. R. Ward and I. Habli, “An assurance case pattern for the interpretability of machine learning in safety-critical systems,” in International Conference on Computer Safety, Reliability, and Security.Springer, 2020, p. 395–407. [S23]J. Clymer, J. Weinbaum, R. Kirk, K. Mai, S. Zhang, and X. Davies, “An example safety case for safeguards against misuse,” arXiv preprint arXiv:2505.18003, 2025. [S24]S. Nair, J. L. De La Vara, M. Sabetzadeh, and L. Briand, “An extended systematic literature review on provision of evidence for safety certification,” Information and Software Technology, vol. 56, no. 7, p. 689–717, 2014. [S25]U. D. Ferrell and A. H. A. Anderegg, “Applicability of ul 4600 to unmanned aircraft systems (uas) and urban air mobility (uam),” in 2020 AIAA/IEEE 39th Digital Avionics Systems Conference (DASC).IEEE, 2020, p. 1–7. [S26]J. P. C. de Araujo, B. V. Balu, E. Reichmann, J. Kelly, S. Kugele, N. Mata, and L. Grunske, “Applying concept- based models for enhanced safety argumentation,” in 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, p. 272–283. [S27]T. P. Kelly, “Arguing safety-a systematic approach to safety case management,” Department of Computer Science, The University of York, 1998. [S28]S. Barrett, P. Fox, J. Krook, T. Mondal, S. Mylius, and A. Tlaie, “Assessing confidence in frontier AI safety cases,” arXiv preprint arXiv:2502.05791, 2025. [S29] R. Kaur, R. Ivanov, M. Cleaveland, O. Sokolsky, and I. Lee, “Assurance case patterns for cyber-physical systems with deep neural networks,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2020, p. 82–97. [S30]T. Chowdhury, “Assurance case templates: Principles for their development and criteria for their evaluation,” Ph.D. dissertation, McMaster University, 2021. [S31] V. Mussot, E. Jenn, F. Chenevier, R. C. Laguna, Y. I. Messaoud, J.-L. Farges, A. F. Pires, F. Latombe, and S. Creff, “Assurance cases to face the complexity of ml-based systems verification,” in Embedded Real Time System Congress, ERTS’24, 2024. [S32]R. Bloomfield and J. Rushby, “Assurance of AI systems from a dependability perspective,” arXiv preprint arXiv:2407.13948, 2024. [S33]E. Denney and G. Pai, “Assurance-driven design of machine learning-based functionality in an aviation systems context,” in 2023 IEEE/AIAA 42nd Digital Avionics Systems Conference (DASC). IEEE, 2023, p. 1–10. [S34]A. Wardzi ́ nski and A. Jarz ̨ebowicz, “Automated generation of modular assurance cases with the system assurance reference model,” Formal Aspects of Computing, vol. 36, no. 4, p. 1–29, 2024. [S35]O. Odu, A. B. Belle, S. Wang, S. Kpodjedo, T. C. Lethbridge, and H. Hemmati, “Automatic instantiation of assurance cases from patterns using large language models,” Journal of Systems and Software, vol. 222, p. 112353, 2025. [S36]C. Cârlan, L. Gauerhof, B. Gallina, and S. Burton, “Automating safety argument change impact analysis for machine learning components,” in 2022 IEEE 27th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 2022, p. 43–53. [S37]S. Diemert, E. Cyffka, N. Anwari, O. Foster, T. Viger, L. Millet, and J. Joyce, “Balancing the risks and benefits of using large language models to support assurance case development,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2025, p. 209–225. [S38]C. Cârlan, “Checkable safety arguments-a modeling framework supporting the maintenance of safety arguments consistent with system development artifacts,” Ph.D. dissertation, Technische Universität München, 2025. [S39]Y. Idmessaoud, D. Dubois, and J. Guiochet, “Confidence assessment in safety argument structure-quantitative vs. qualitative approaches,” International Journal of Approximate Reasoning, vol. 165, p. 109100, 2024. [S40]M. Sivakumar, “Design and automatic generation of safety cases of ml-enabled autonomous driving systems,” Master’s thesis, York University, 2024. [S41]M. Sivakumar, A. B. Belle, J. Shan, O. Odu, and M. Yuan, “Design of the safety case of the reinforcement learning-enabled component of a quanser autonomous vehicle,” in 2024 IEEE 32nd International Requirements Engineering Conference Workshops (REW). IEEE, 2024, p. 57–67. [S42]E. Mirzaei, C. CARLAN, C. Thomas, and B. Gallina, “Design-time specification of dynamic modular safety cases in support of run-time safety assessment,” in Proceedings of the 30th Safety-Critical Systems Symposium (SCSC), 2022. [S43] R. Hawkins, “Developing compelling safety cases,” arXiv preprint arXiv:2502.00911, 2025. 35 Safety Cases for AI Systems [S44]E. Asaadi, E. Denney, J. Menzies, G. J. Pai, and D. Petroff, “Dynamic assurance cases: a pathway to trusted autonomy,” Computer, vol. 53, no. 12, p. 35–46, 2020. [S45]S. Ramakrishna, “Dynamic safety assurance of autonomous cyber-physical systems,” Ph.D. dissertation, Vander- bilt University, 2022. [S46] C. Cârlan, F. Gomez, Y. Mathew, K. Krishna, R. King, P. Gebauer, and B. R. Smith, “Dynamic safety cases for frontier ai,” arXiv preprint arXiv:2412.17618, 2024. [S47]M. Borg, J. Henriksson, K. Socha, O. Lennartsson, E. Sonnsjö Lönegren, T. Bui, P. Tomaszewski, S. R. Sathyamoorthy, S. Brink, and M. Helali Moghadam, “Ergo, smirk is safe: a safety case for a machine learning component in a pedestrian automatic emergency brake system,” Software quality journal, vol. 31, no. 2, p. 335–403, 2023. [S48] C. Burr and D. Leslie, “Ethical assurance: a practical approach to the responsible design, development, and deployment of data-driven technologies,” AI and Ethics, vol. 3, no. 1, p. 73–98, 2023. [S49] J. Roßbach, O. De Candido, A. Hammam, and M. Leuschel, “Evaluating ai-based components in autonomous railway systems: A methodology,” in German Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, 2024, p. 190–203. [S50]F. Ikhwantri and D. Marijan, “Explainable compliance detection with multi-hop natural language inference on assurance case structure,” arXiv preprint arXiv:2506.08713, 2025. [S51]M. Sivakumar, A. B. Belle, J. Shan, and K. K. Shahandashti, “Exploring the capabilities of large language models for the generation of safety cases: the case of gpt-4,” in 2024 IEEE 32nd International Requirements Engineering Conference Workshops (REW). IEEE, 2024, p. 35–45. [S52]I. Sljivo, B. Gallina, J. Carlson, and H. Hansson, “Generation of safety case argument-fragments from safety contracts,” in International Conference on Computer Safety, Reliability, and Security.Springer, 2014, p. 170–185. [S53]M. Sivakumar, A. B. Belle, J. Shan, and K. K. Shahandashti, “Gpt-4 and safety case generation: An exploratory analysis,” arXiv preprint arXiv:2312.05696, 2023. [S54]M. Chelouati, A. Boussif, J. Beugin, and E.-M. El Koursi, “Graphical safety assurance case using goal structuring notation (gsn)—challenges, opportunities and a framework for autonomous trains,” Reliability Engineering & System Safety, vol. 230, p. 108933, 2023. [S55]I. Sljivo, E. Denney, and J. Menzies, “Guided integration of formal verification in assurance cases,” in Interna- tional Conference on Formal Engineering Methods. Springer, 2023, p. 172–190. [S56]R. Hawkins and P. Ryan Conmy, “Identifying run-time monitoring requirements for autonomous systems through the analysis of safety arguments,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2023, p. 11–24. [S57]G. Despotou, S. White, T. Kelly, and M. Ryan, “Introducing safety cases for health it,” in 2012 4th International Workshop on Software Engineering in Health Care (SEHC). IEEE, 2012, p. 44–50. [S58]M. J. Squair, “Issues in the application of software safety standards,” in ACM International Conference Proceed- ing Series, vol. 162, 2006, p. 13–26. [S59]A. Sabuncuoglu, C. Burr, and C. Maple, “Justified evidence collection for argument-based AI fairness assurance,” in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2025, p. 18–28. [S60]A. Agrawal, S. Khoshmanesh, M. Vierhauser, M. Rahimi, J. Cleland-Huang, and R. Lutz, “Leveraging artifact trees to evolve and reuse safety cases,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, p. 1222–1233. [S61]Y. Fujiwara, T. Tuchida, R. Miyata, H. Washizaki, and N. Ubayashi, “Llm-based automated mitigation and assurance case generation against threats to AI systems,” in 2025 IEEE Conference on Artificial Intelligence (CAI). IEEE, 2025, p. 906–911. [S62]O. Odu, A. B. Belle, and S. Wang, “Llm-based safety case generation for baidu apollo: Are we there yet?” in 2025 IEEE/ACM 4th International Conference on AI Engineering–Software Engineering for AI (CAIN).IEEE, 2025, p. 222–233. [S63]S. Burton, L. Gauerhof, and C. Heinzemann, “Making the case for safety of machine learning in highly automated driving,” in International Conference on Computer Safety, Reliability, and Security.Springer, 2017, p. 5–16. [S64]R. Dassanayake, M. Demetroudi, J. Walpole, L. Lentati, J. R. Brown, and E. J. Young, “Manipulation attacks by misaligned ai: Risk analysis and safety case framework,” arXiv preprint arXiv:2507.12872, 2025. 36 Safety Cases for AI Systems [S65]C.-L. Lin, W. Shen, S. Drager, and B. Cheng, “Measure confidence of assurance cases in safety-critical domains,” in Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results, 2018, p. 13–16. [S66] J. Murdoch, G. Clark, A. Powell, and P. Caseley, “Measuring safety: applying psm to the system safety domain,” in Proceedings of the 8th Australian workshop on Safety critical systems and software-Volume 33, 2003, p. 47–55. [S67]M. A. Langford, K. H. Chan, J. E. Fleck, P. K. McKinley, and B. H. Cheng, “MoDALAS: addressing assurance for learning-enabled autonomous systems in the face of uncertainty,” Software and systems modeling, vol. 22, no. 5, p. 1543–1563, 2023. [S68]L. Nascimento, A. L. de Oliveira, R. Villela, E. F. Silva, R. Wei, R. Hawkins, and T. Kelly, “Model-based security assurance cases for open and adaptive cyber-physical systems,” in International Conference on Advanced Information Networking and Applications. Springer, 2025, p. 326–340. [S69] A. Retouniotis, Y. Papadopoulos, I. Sorokos, D. Parker, N. Matragkas, and S. Sharvia, “Model-connected safety cases,” in International Symposium on Model-Based Safety and Assessment. Springer, 2017, p. 50–63. [S70]D. H. Becht, “Moving towards goal-based safety management,” in Proceedings of the Australian System Safety Conference-Volume 133, 2011, p. 19–26. [S71]C.-H. Cheng, C.-H. Huang, and G. Nührenberg, “n-dependability-kit: Engineering neural networks for safety- critical autonomous driving systems,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019, p. 1–6. [S72]C. Preschern, N. Kajtazovic, A. Höller, and C. Kreiner, “Pattern-based safety development methods: overview and comparison,” in Proceedings of the 19th European Conference on Pattern Languages of Programs, 2014, p. 1–20. [S73] N. Annable, M. Lawford, R. F. Paige, and A. Wassyng, “Principled safety assurance arguments,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2025, p. 18–32. [S74]M. Sivakumar, A. B. Belle, J. Shan, and K. K. Shahandashti, “Prompting gpt–4 to support automatic safety case generation,” Expert Systems with Applications, vol. 255, p. 124653, 2024. [S75] A. Di Sandro, S. Kokaly, R. Salay, and M. Chechik, “Querying automotive system models and safety artifacts with mmint and viatra,” in 2019 ACM/IEEE 22nd international conference on model driven engineering languages and systems companion (MODELS-c). IEEE Computer Society, 2019, p. 2–11. [S76] Y. Dong, W. Huang, V. Bharti, V. Cox, A. Banks, S. Wang, X. Zhao, S. Schewe, and X. Huang, “Reliability assessment and safety arguments for machine learning components in system assurance,” ACM transactions on embedded computing systems, vol. 22, no. 3, p. 1–48, 2023. [S77]L. Buysse, I. Habli, D. Vanoost, and D. Pissoort, “Safe autonomous systems in a changing world: Operationalising dynamic safety cases,” Safety Science, vol. 191, p. 106965, 2025. [S78]S. Ballingall, M. Sarvi, and P. Sweatman, “Safety assurance concepts for automated driving systems,” SAE International Journal of Advances and Current Practices in Mobility, vol. 2, no. 2020-01-0727, p. 1528–1537, 2020. [S79]H. Fujino, N. Kobayashi, and S. Shirasaka, “Safety assurance case description method for systems incorporating off-operational machine learning and safety device,” in INCOSE International Symposium, vol. 29.Wiley Online Library, 2019, p. 152–164. [S80]C. Paterson, R. Hawkins, C. Picardi, Y. Jia, R. Calinescu, and I. Habli, “Safety assurance of machine learning for autonomous systems,” Reliability Engineering & System Safety, vol. 264, p. 111311, 2025. [S81]S. Burton, I. Kurzidem, A. Schwaiger, P. Schleiss, M. Unterreiner, T. Graeber, and P. Becker, “Safety assurance of machine learning for chassis control functions,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2021, p. 149–162. [S82]S. Burton, C. Hellert, F. Hüger, M. Mock, and A. Rohatschek, “Safety assurance of machine learning for perception functions,” in Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety. Springer International Publishing Cham, 2022, p. 335–358. [S83]T. P. Kelly and J. A. McDermid, “Safety case construction and reuse using patterns,” in Safe Comp 97: The 16th International Conference on Computer Safety, Reliability and Security. Springer, 1997, p. 55–69. [S84]C. Cârlan, B. Gallina, and L. Soima, “Safety case maintenance: a systematic literature review,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2021, p. 115–129. 37 Safety Cases for AI Systems [S85]A. UK, “Safety case template for inability arguments,” AISI UK, Tech. Rep., 2024. [Online]. Available: https://w.aisi.gov.uk/blog/safety-case-template-for-inability-arguments [S86] A. Goemans, M. D. Buhl, J. Schuett, T. Korbak, J. Wang, B. Hilton, and G. Irving, “Safety case template for frontier ai: A cyber inability argument,” arXiv preprint arXiv:2411.08088, 2024. [S87]R. Bloomfield, G. Fletcher, H. Khlaaf, L. Hinde, and P. Ryan, “Safety case templates for autonomous systems,” arXiv preprint arXiv:2102.02625, 2021. [S88]R. Alexander, M. Hall-May, T. Kelly, and J. A. McDermid, “Safety cases for advanced control software: Final report,” University of York, York, United Kingdom, Final Report, Jun. 2007, approved for public release; distribution unlimited. [S89]M. D. Buhl, G. Sett, L. Koessler, J. Schuett, and M. Anderljung, “Safety cases for frontier ai,” arXiv preprint arXiv:2410.21572, 2024. [S90]B. Hilton, M. D. Buhl, T. Korbak, and G. Irving, “Safety cases: A scalable approach to frontier AI safety,” arXiv preprint arXiv:2503.04744, 2025. [S91]J. Clymer, N. Gabrieli, D. Krueger, and T. Larsen, “Safety cases: How to justify the safety of advanced AI systems,” preprint arXiv:2403.10462, 2024. [S92]S. Diemert, L. Millet, J. Groves, and J. Joyce, “Safety integrity levels for artificial intelligence,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2023, p. 397–409. [S93]Y. Jia, T. Lawton, J. Burden, J. McDermid, and I. Habli, “Safety-driven design of machine learning for sepsis treatment,” Journal of Biomedical Informatics, vol. 117, p. 103762, 2021. [S94]S. Jahan, A. Marshall, and R. Gamble, “Self-adaptation strategies to maintain security assurance cases,” in 2018 IEEE 12th International Conference on Self-Adaptive and Self-Organizing Systems (SASO). IEEE, 2018, p. 180–185. [S95]Y. Matsuno, F. Ishikawa, and S. Tokumoto, “Tackling uncertainty in safety assurance for machine learning: con- tinuous argument engineering with attributed tests,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2019, p. 398–404. [S96]I. Habli, R. Hawkins, C. Paterson, P. Ryan, Y. Jia, M. Sujan, and J. McDermid, “The big argument for AI safety cases,” arXiv preprint arXiv:2503.11705, 2025. [S97] R. Salay, K. Czarnecki, H. Kuwajima, H. Yasuoka, V. Abdelzad, C. Huang, M. Kahn, V. D. Nguyen, and T. Nakae, “The missing link: Developing a safety case for perception components in automated driving,” SAE International Journal of Advances and Current Practices in Mobility, vol. 5, no. 2022-01-0818, p. 567–579, 2022. [S98]F. Wolny, S. Vock, T. Holoyad, and R. Adler, “The need for systematic approaches in risk assessment of safety-critical ai-applications in machinery,” in European Safety and Reliability & Society for Risk Analysis Europe Conference ESREL SRA-E, 2025. [S99]M. Wagner and C. Carlan, “The open autonomy safety case framework,” arXiv preprint arXiv:2404.05444, 2024. [S100]N. Leveson, “The use of safety cases in certification and regulation,” Journal of System Safety, 2011, aeronautics and Astronautics/Engineering Systems, MIT. [S101] R. Grosse, “Three sketches of asl-4 safety case components,” Anthropic Alignment Science Blog, 2024. [S102]M. Loba, N. F. Salem, M. Nolte, A. Dotzler, D. Ludwig, and M. Maurer, “Toward a harmonized approach– requirement-based structuring of a safety assurance argumentation for automated vehicles,” arXiv preprint arXiv:2505.03709, 2025. [S103]M. Bagheri, J. Lamp, X. Zhou, L. Feng, and H. Alemzadeh, “Towards developing safety assurance cases for learning-enabled medical cyber-physical systems,” arXiv preprint arXiv:2211.15413, 2022. [S104] M. Balesni, M. Hobbhahn, D. Lindner, A. Meinke, T. Korbak, J. Clymer, B. Shlegeris, J. Scheurer, C. Stix, R. Shah et al., “Towards evaluations-based safety cases for AI scheming,” arXiv preprint arXiv:2411.03336, 2024. [S105] E. Stensrud, T. Skramstad, J. Li, and J. Xie, “Towards goal-based software safety certification based on prescriptive standards,” in 2011 First International Workshop on Software Certification. IEEE, 2011, p. 13–18. [S106]Z. Chen, Y. Deng, and W. Du, “Trusta: Reasoning about assurance cases with formal methods and large language models,” Science of Computer Programming, vol. 244, p. 103288, 2025. [S107]P. Koopman, “Ul 4600: What to include in an autonomous vehicle safety case,” Computer, vol. 56, no. 05, p. 101–104, 2023. 38 Safety Cases for AI Systems [S108]Y. Idmessaoud, J.-L. Farges, E. Jenn, V. Mussot, A. F. Pires, F. Chenevier, and R. C. Laguna, “Uncertainty in assurance case pattern for machine learning,” in Embedded Real Time Systems (ERTS), 2024. [S109]J. McDermid, Y. Jia, and I. Habli, “Upstream and downstream AI safety: Both on the same river?” arXiv preprint arXiv:2501.05455, 2024. [S110]K. K. Shahandashti, A. B. Belle, M. M. Mohajer, O. Odu, T. C. Lethbridge, H. Hemmati, and S. Wang, “Using gpt-4 turbo to automatically identify defeaters in assurance cases,” in 2024 IEEE 32nd International Requirements Engineering Conference Workshops (REW). IEEE, 2024, p. 46–56. [S111]T. Myklebust, T. Stålhane, G. D. Jenssen, and I. Wærø, “Autonomous cars, trust and safety case for the public,” in 2020 Annual Reliability and Maintainability Symposium (RAMS). IEEE, 2020, p. 1–6. [S112]J. Bragg and I. Habli, “What is acceptably safe for reinforcement learning?” in International Conference on Computer Safety, Reliability, and Security. Springer, 2018, p. 418–430. 39 Safety Cases for AI Systems Appendix C Individual Quality Assessment Scores for Selected Studies Study NoQC1QC2QC3QC4QC5QC6QC7QC8QC9QC10QC11 S143243323343 S244534355343 S343435454543 S455444445545 S554454545544 S654354545554 S745445545445 S855444445545 S955543545544 S1054455454554 S1154544545554 S1245455445544 S1355454554544 S1455433544533 S1544534355343 S1643435454543 S1735354533535 S1833354435353 S1954353445333 S2055433544533 S2144534355343 S2243435454543 S2355444445545 S2454454545544 S2554354545554 S2645445545445 S2755444445545 S2855543545544 S2944344433423 S3022242444433 S3124232424442 S3243433423432 S3334332223332 S3444344433423 S3522242444433 S3624232424442 S3743433423432 S3834332223332 S3944344433423 S4022242444433 S4124232424442 S4243433423432 S4334332223332 S4444344433423 40 Safety Cases for AI Systems Study NoQC1QC2QC3QC4QC5QC6QC7QC8QC9QC10QC11 S4522242444433 S4624232424442 S4743433423432 S4834332223332 S4944344433423 S5022242444433 S5124232424442 S5243433423432 S5334332223332 S5444344433423 S5522242444433 S5624232424442 S5743433423432 S5834332223332 S5944344433423 S6022242444433 S6124232424442 S6243433423432 S6334332223332 S6444344433423 S6522242444433 S6624232424442 S6743433423432 S6834332223332 S6944344433423 S7022242444433 S7124232424442 S7243433423432 S7334332223332 S7444344433423 S7522242444433 S7624232424442 S7743433423432 S7834332223332 S7944344433423 S8022242444433 S8124232424442 S8243433423432 S8334332223332 S8444344433423 S8522242444433 S8624232424442 S8743433423432 S8834332223332 S8944344433423 S9022242444433 41 Safety Cases for AI Systems Study NoQC1QC2QC3QC4QC5QC6QC7QC8QC9QC10QC11 S9124232424442 S9243433423432 S9334332223332 S9444344433423 S9522242444433 S9624232424442 S9743433423432 S9834332223332 S9944344433423 S10022242444433 S10124232424442 S10243433423432 S10334332223332 S10444344433423 S10522242444433 S10624232424442 S10743433423432 S10844344433423 S10922242444433 S11024232424442 S11143433423432 S11234332223332 42 Safety Cases for AI Systems References [1]T. P. Kelly et al., “Arguing safety: a systematic approach to managing safety cases,” Ph.D. dissertation, University of York York, UK, 1999. [Online]. Available: https://w.jstor.org/stable/44699541 [2]R. Bloomfield and P. Bishop, “Safety and assurance cases: Past, present and possible future–an adelard perspective,” in Making Systems Safer: Proceedings of the Eighteenth Safety-Critical Systems Symposium, Bristol, UK, 9-11th February 2010. Springer, 2009, p. 51–67. [Online]. Available: https://doi.org/10.1007/978-1-84996-086-1_4 [3]P. Königs, “The negativity crisis of AI ethics,” Synthese, vol. 206, p. 277, 2025. [Online]. Available: https://doi.org/10.1007/s11229-025-05378-9 [4] M. D. Buhl, G. Sett, L. Koessler, J. Schuett, and M. Anderljung, “Safety cases for frontier ai,” 2024. [Online]. Available: https://arxiv.org/abs/2410.21572 [5]C. Cârlan, F. Gomez, Y. Mathew, K. Krishna, R. King, P. Gebauer, and B. R. Smith, “Dynamic safety cases for frontier ai,” 2024. [Online]. Available: https://arxiv.org/abs/2412.17618 [6] J. Clymer, J. Weinbaum, R. Kirk, K. Mai, S. Zhang, and X. Davies, “An example safety case for safeguards against misuse,” 2025. [Online]. Available: https://arxiv.org/abs/2505.18003 [7]R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” 2022. [Online]. Available: https://arxiv.org/abs/2108.07258 [8]U. MoD, “Safety management requirements for defence systems part 1 requirements,” Ministry of Defence, Defence Standard 00-56 Issue, vol. 4, 2007. [Online]. Available: https://segoldmine.ppi-int.com/taxonomy/term/1166 [9]I. Habli, R. Hawkins, C. Paterson, P. Ryan, Y. Jia, M. Sujan, and J. McDermid, “The big argument for AI safety cases,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11705 [10]R. Bloomfield, G. Fletcher, H. Khlaaf, L. Hinde, and P. Ryan, “Safety case templates for autonomous systems,” 2021. [Online]. Available: https://arxiv.org/abs/2102.02625 [11] A. Goemans, M. D. Buhl, J. Schuett, T. Korbak, J. Wang, B. Hilton, and G. Irving, “Safety case template for frontier ai: A cyber inability argument,” 2024. [Online]. Available: https://arxiv.org/abs/2411.08088 [12]Adelard, “Claims, arguments & evidence (cae) framework,” 2024. [Online]. Available:https:// claimsargumentsevidence.org/notations/claims-arguments-evidence-cae/ [13] Adelard, part of NCC group, “Ascad: Adelard safety case development manual,” 2024. [Online]. Available: https://w.adelard.com/resources/ascad-manual/ [14] T. Korbak, J. Clymer, B. Hilton, B. Shlegeris, and G. Irving, “A sketch of an AI control safety case,” 2025. [Online]. Available: https://arxiv.org/abs/2501.17315 [15] S. Nair, J. L. De La Vara, M. Sabetzadeh, and L. Briand, “An extended systematic literature review on provision of evidence for safety certification,” Information and Software Technology, vol. 56, no. 7, p. 689–717, 2014. [Online]. Available: https://doi.org/10.1016/j.infsof.2014.03.001 [16]J. Clymer, N. Gabrieli, D. Krueger, and T. Larsen, “Safety cases: How to justify the safety of advanced AI systems,” 2024. [Online]. Available: https://arxiv.org/abs/2403.10462 [17]A. R. Wasil, J. Clymer, D. Krueger, E. Dardaman, S. Campos, and E. R. Murphy, “Affirmative safety: An approach to risk management for high-risk ai,” 2024. [Online]. Available: https://arxiv.org/abs/2406.15371 [18]M. Balesni, M. Hobbhahn, D. Lindner, A. Meinke, T. Korbak, J. Clymer, B. Shlegeris, J. Scheurer, C. Stix, R. Shah, N. Goldowsky-Dill, D. Braun, B. Chughtai, O. Evans, D. Kokotajlo, and L. Bushnaq, “Towards evaluations-based safety cases for AI scheming,” 2024. [Online]. Available: https://arxiv.org/abs/2411.03336 [19]T. Chowdhury, “Assurance case templates: Principles for their development and criteria for their evaluation,” Ph.D. dissertation, McMaster University, 2021. [Online]. Available: http://hdl.handle.net/11375/26941 [20]T. P. Kelly and J. A. McDermid, “Safety case construction and reuse using patterns,” in International Conference on Computer Safety, Reliability and Security, Safe Comp 1997, York, UK, September 7-10, 1997.Springer, 1997, p. 55–69. [Online]. Available: https://doi.org/10.1007/978-1-4471-0997-6_5 [21]R. Alexander, T. Kelly, Z. Kurd, and J. McDermid, “Safety cases for advanced control software: Safety case patterns,” Department of Computer Science, University of York, York, UK, Technical Report ADA491299, 2007. [Online]. Available: https://archive.org/details/DTIC_ADA491299 43 Safety Cases for AI Systems [22]E. Wozniak, C. Cârlan, E. Acar-Celik, and H. J. Putzer, “A safety case pattern for systems with machine learning components,” in International Conference on Computer Safety, Reliability, and Security, ser. Lecture Notes in Computer Science, vol. 12235.Springer, 2020, p. 370–382. [Online]. Available: https://doi.org/10.1007/978-3-030-55583-2_28 [23] R. Kaur, R. Ivanov, M. Cleaveland, O. Sokolsky, and I. Lee, “Assurance case patterns for cyber-physical systems with deep neural networks,” in International Conference on Computer Safety, Reliability, and Security, ser. Lecture Notes in Computer Science, vol. 12235.Springer, 2020, p. 82–97. [Online]. Available: https://doi.org/10.1007/978-3-030-55583-2_6 [24]Z. Porter, I. Habli, J. A. McDermid, and M. H. L. Kaas, “A principles-based ethics assurance argument pattern for AI and autonomous systems,” AI Ethics, vol. 4, no. 2, p. 593–616, 2024. [Online]. Available: https://doi.org/10.1007/S43681-023-00297-2 [25]NOPSEMA, “The safety case in context: An overview of the safety case regime,” National Offshore Petroleum Safety and Environmental Management Authority, Guidance Note N-04300-GN0060 A86480, Nov 2025. [26]L. Dong, Q. Lu, and L. Zhu, “Agentops: Enabling observability of llm agents,” 2024. [Online]. Available: https://arxiv.org/abs/2411.05285v2 [27] DSIR, NAIC and CSIRO, “Voluntary AI safety standard: Guiding safe and responsible use of artificial intelligence in australia,” Australian Government, Tech. Rep., 2024, date published: 5 September 2024. Date updated: 2 December 2025. [Online]. Available: https://w.industry.gov.au/publications/voluntary-ai-safety-standard [28]International Organization for Standardization, “Iso/iec 42001:2023 information technology — artificial intelligence — management system,” Geneva, Switzerland, 2023, international Standard. [Online]. Available: https://w.iso.org/standard/42001 [29] Fairly AI, “AI trust & safety assurance registry,” 2025. [Online]. Available: https://w.fairly.ai/trust [30] J. Chen, S. Ma, Q. Lu, S. U. Lee, and L. Zhu, “Maria: A framework for marginal risk assessment without ground truth in AI systems,” 2025. [Online]. Available: https://arxiv.org/abs/2510.27163 [31] OECD, “Thresholds for frontier ai,” Web page, 2025. [Online]. Available: https://oecd.ai/en/action-summit- thresholds [32]Q. Hua, L. Ye, D. Fu, Y. Xiao, X. Cai, Y. Wu, J. Lin, J. Wang, and P. Liu, “Context engineering 2.0: The context of context engineering,” 2025. [Online]. Available: https://arxiv.org/abs/2510.26493 [33] B. Hilton, M. D. Buhl, T. Korbak, and G. Irving, “Safety cases: A scalable approach to frontier AI safety,” 2025. [Online]. Available: https://arxiv.org/abs/2503.04744 [34] E. Asaadi, E. Denney, J. Menzies, G. J. Pai, and D. Petroff, “Dynamic assurance cases:a pathway to trusted autonomy,” Computer, vol. 53, no. 12, p. 35–46, 2020. [Online]. Available: https://doi.org/10.1109/MC.2020.3022030 [35] S. Ramakrishna, “Dynamic safety assurance of autonomous cyber-physical systems,” Ph.D. dissertation, Vanderbilt University, 2022. [36]L. Buysse, I. Habli, D. Vanoost, and D. Pissoort, “Safe autonomous systems in a changing world: Operationalising dynamic safety cases,” Safety Science, vol. 191, p. 106965, 2025. [Online]. Available: https://doi.org/10.1016/j.ssci.2025.106965 [37] B. A. Kitchenham, S. Charters, and Other Keele Staffs, “Guidelines for performing systematic literature reviews in software engineering (version 2.3),” Keele University and Durham University Joint Report, Tech. Rep., 2007. [38]M. Shamsujjoha, J. Grundy, L. Li, H. Khalajzadeh, and Q. Lu, “Developing mobile applications via model driven development: A systematic literature review,” Information and Software Technology, vol. 140, p. 106693, 2021. [Online]. Available: https://doi.org/10.1016/j.infsof.2021.106693 [39] M. Shamsujjoha, Q. Lu, D. Zhao, and L. Zhu, “Swiss cheese model for AI safety: A taxonomy and reference architecture for multi-layered guardrails of foundation model based agents,” in 2025 IEEE 22nd International Conference on Software Architecture (ICSA).IEEE, 2025, p. 37–48. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSA65012.2025.00014 [40] M. Petticrew and H. Roberts, Systematic reviews in the social sciences: A practical guide. John Wiley & Sons, 2006. [Online]. Available: https://doi.org/10.1002/9780470754887 [41]I. D. Raji and R. Dobbe, “Concrete problems in AI safety, revisited,” 2023. [Online]. Available: https://arxiv.org/abs/2401.10899 44 Safety Cases for AI Systems [42]M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong et al., “Toward trustworthy AI development: Mechanisms for supporting verifiable claims,” 2020. [Online]. Available: https://arxiv.org/abs/2004.07213 [43]J. Babcock, J. Kramár, and R. Yampolskiy, “The agi containment problem,” in International Conference on Artificial General Intelligence.Springer, 2016, p. 53–63. [Online]. Available: https://doi.org/10.1007/978-3-319-41649-6_6 [44] R. Hawkins, T. Kelly, J. Knight, and P. Graydon, “A new approach to creating clear safety arguments,” in Advances in Systems Safety: Proceedings of the Nineteenth Safety-Critical Systems Symposium, Southampton, UK, 8-10th February 2011. Springer, 2010, p. 3–23. [Online]. Available: https://doi.org/10.1007/978-0-85729-133-2_1 [45]N. G. Leveson, Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016. [Online]. Available: https://doi.org/10.7551/mitpress/8179.001.0001 [46]A. B. Arrieta, N. Díaz-Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, R. Chatila, and F. Herrera, “Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai,” Information Fusion, vol. 58, p. 82–115, 2020. [Online]. Available: https://doi.org/10.1016/j.inffus.2019.12.012 [47]R. Salay, M. Angus, and K. Czarnecki, “A safety analysis method for perceptual components in automated driving,” in 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).IEEE, 2019, p. 24–34. [Online]. Available: https://doi.org/10.1109/ISSRE.2019.00013 [48] J. Rushby, “On the interpretation of assurance case arguments,” in JSAI International Symposium on Artificial Intelligence.Springer, 2015, p. 331–347. [Online]. Available: https://doi.org/10.1007/978-3-319-50953-2_23 [49]B. Herd, J.-V. Zacchi, and S. Burton, “A deductive approach to safety assurance: Formalising safety contracts with subjective logic,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2024, p. 213–226. [Online]. Available: https://doi.org/10.1007/978-3-031-68738-9_16 [50]N. Basir, E. Denney, and B. Fischer, “Deriving safety cases from automatically constructed proofs,” in 4th IET International Conference on System Safety 2009. Incorporating the SaRS Annual Conference. IET, 2009, p. 1–6. [Online]. Available: https://doi.org/10.1049/cp.2009.1535 [51]A. E. Oztekin and J. T. Luxhøj, “An inductive reasoning approach for building system safety risk models of aviation accidents,” Journal of Risk Research, vol. 13, no. 4, p. 479–499, 2010. [Online]. Available: https://doi.org/10.1080/13669870903484344 [52]D. B. Leake, Case-Based Reasoning: Experiences, lessons and future directions. MIT press, 1996. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/524680 [53]S. Burton, “A causal model of safety assurance for machine learning,” 2022. [Online]. Available: https://arxiv.org/abs/2201.05451 [54]S. Burton and B. Herd, “Addressing uncertainty in the safety assurance of machine-learning,” Frontiers in Computer Science, vol. 5, p. 1132580, 2023. [Online]. Available: https://doi.org/10.3389/fcomp.2023.1132580 [55]E. Denney, G. Pai, and I. Habli, “Towards measurement of confidence in safety cases,” in 2011 International Symposium on Empirical Software Engineering and Measurement.IEEE, 2011, p. 380–383. [Online]. Available: https://doi.org/10.1109/ESEM.2011.53 [56]E. A. Act, “Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act),” Official Journal of the European Union (OJ L 1689, 12.7.2024), 2024. [Online]. Available: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng [57]N. Leveson, “The use of safety cases in certification and regulation,” Journal of System Safety, 2011, aeronautics and Astronautics/Engineering Systems, MIT. [58]N. Hayama, Y. Yamagata, H. Nishihara, and Y. Matsuno, “A GSN-based requirement analysis of the EU AI regulation,” in International Conference on Computer Safety, Reliability, and Security.Springer, 2025, p. 183–196. [Online]. Available: https://doi.org/10.1007/978-3-032-02018-5_14 [59] L. Zhu and Q. Lu, “Verifiability-first AI engineering in the era of AIware: Aconceptual framework, design principles, and architectural patterns for scalable verification,” SSRN, 2025. [Online]. Available: http://dx.doi.org/10.2139/ssrn.6031534 45

Constructing Safety Cases for AI Systems: A Reusable Template Framework

Abstract

Tags

Links

PDF

Intelligence

Summary

Entities (5)

Relation Signals (4)

Cypher Suggestions (2)

Full Text