Paper deep dive
From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making
Min Hun Lee
Abstract
Abstract:Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.
Tags
Links
- Source: https://arxiv.org/abs/2603.18895v1
- Canonical: https://arxiv.org/abs/2603.18895v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
54,930 characters extracted from source content.
Expand or collapse full text
by-nc-nd From Accuracy to Readiness: Metrics and Benchmarks for Human–AI Decision-Making Min Hun Lee mhlee@smu.edu.sg 0000-0003-3506-8972 Singapore Management UniversitySingapore (2026) Abstract. Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four-part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand–Control–Improve (U–C–I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human–AI readiness, advancing safer and more accountable human–AI collaboration. Human-Centered AI, Human-AI Collaboration, Human–AI Decision Making, Appropriate Reliance, AI Evaluation Metrics, AI Governance †copyright: acmlicensed†journalyear: 2026†copyright: c†conference: Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems; April 13–17, 2026; Barcelona, Spain†booktitle: Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26), April 13–17, 2026, Barcelona, Spain†doi: 10.1145/3772363.3798377†isbn: 979-8-4007-2281-3/2026/04†ccs: Human-centered computing HCI theory, concepts and models†ccs: Human-centered computing HCI design and evaluation methods†ccs: Computing methodologies Artificial intelligence 1. Introduction Artificial intelligence (AI) systems are increasingly deployed as collaborators rather than autonomous decision-makers, supporting human judgment in high-stakes domains such as healthcare (Cai et al., 2019a; Lee et al., 2020, 2021; Wang et al., 2021) and public services (Kuo et al., 2023; Završnik, 2020). In these settings, AI systems increasingly shape how people interpret evidence, calibrate confidence, allocate responsibility, and ultimately make decisions (Holstein et al., 2023; Lee et al., 2021; Lai et al., 2021; Cai et al., 2021). Over the past several years, empirical Human–AI Interaction (HAI) research has demonstrated that model performance alone is insufficient for safe and effective human–AI collaboration: even highly accurate systems can yield worse human–AI outcomes when users follow incorrect advice, ignore correct advice, or apply inconsistent intervention strategies under uncertainty (He et al., 2023; Bansal et al., 2021a; Buçinca et al., 2021; Chen et al., 2023; Lai et al., 2021). Complementing these findings, research on accountable and trustworthy AI emphasizes governance mechanisms, such as oversight, contestability, auditing, and responsibility across deployment (Raji et al., 2020; Novelli et al., 2024; Mökander and Floridi, 2022; Kaur et al., 2022; Vashney, 2022; Toreini et al., 2020; Lee and Teo, 2026). Meanwhile, explainable AI (XAI) and interactive ML research has proposed many mechanisms—feature attributions, examples, counterfactuals, rules, and uncertainty estimates—to make model behavior intelligible (Ribeiro et al., 2016; Arya et al., 2019; Doshi-Velez and Kim, 2017; Wang et al., 2019; Rudin, 2019; Crisan et al., 2022; Guo et al., 2022; Kulesza et al., 2011). However, empirical evidence across HAI and XAI suggests these techniques do not reliably improve decision quality by default. Instead, their effects depend on task context, user expertise, timing, and interactions with human intuition and confidence (Chen et al., 2023; He et al., 2023; Lai et al., 2021; Buçinca et al., 2021; Lee and Chew, 2023). Despite progress on mechanisms, evaluation practices remain misaligned with how human–AI systems fail in practice during real-world deployment. Many studies emphasize model accuracy, explanation fidelity, or self-reported trust (Lai et al., 2021; Guo et al., 2024; Rudin, 2019; Ghassemi et al., 2021), implicitly assuming these proxies reflect whether users are ready to collaborate with AI safely and effectively. Yet, trust often poorly predicts reliance behavior, and explanations can increase overreliance by providing a false sense of certainty or legitimacy (Lai et al., 2021; Lee and Chew, 2023; Chen et al., 2023; Bansal et al., 2021a; Ghassemi et al., 2021). Consequently, real-world failures persist not only due to model error, but due to miscalibrated human reliance—overreliance when AI is wrong, underuse when AI is helpful, and brittle “local” adaptations that do not generalize across cases (He et al., 2023; Buçinca et al., 2021; Lee and Chew, 2023; Chen et al., 2023; Ghassemi et al., 2021). Critically, these failure modes are often invisible when evaluation reports only accuracy, perceived trust, or explanation satisfaction. Figure 1. We propose a four-part taxonomy of metrics for human–AI onboarding (top) and show how each metric family becomes observable and actionable across the human-AI onboarding lifecycle (Understand–Control–Improve) (bottom). A two-level framework linking four metric families for human–AI decision-making (Outcome, Reliance & Interaction, Safety & Harm, and Learning & Readiness) to three stages of the human–AI onboarding lifecycle: Understand, Control, and Improve. In this paper, we argue that resolving this gap requires shifting evaluation from “how good is the model?” to “how ready is the human–AI team?”: whether users can recognize failures, calibrate reliance, and remain accountable under realistic constraints (Lai et al., 2021; Bansal et al., 2021a; Buçinca et al., 2021; Chen et al., 2023). We focus on onboarding, calibration, and governance as the early-deployment phase where reliance patterns are formed and where many downstream failures originate (Cai et al., 2019b; Mitchell et al., 2019). Building on this direction, our work reframes onboarding as a measurable learning intervention organized around Understand–Control–Improve (U–C–I), extending recent work on AI onboarding and explanation-supported learning for clinical decision-making (Lee et al., 2024b, a). We treat onboarding broadly as the process through which users learn to work effectively with AI systems in real decision-making settings. In Understand, users develop mental models of model behavior, boundary conditions, and failure modes through structured practice on curated failure sets and counterfactual examples that reveal how small input changes can flip predictions (Bansal et al., 2019; Verma et al., 2020; Lee and Chew, 2023). In Control, users learn how to calibrate reliance and apply safe interventions using lightweight supports, such as calibration cards, artifacts that summarize when AI predictions are reliable or unreliable (e.g. “when to trust”, “when to double-check”) (Lee and Teo, 2026), common failure modes (Cai et al., 2019b; Lee et al., 2024a), and recommended operating points (e.g. thresholds or escalation rules) (Lee and Tok, 2025), alongside regions-of-no-use, contexts where AI recommendations should not be trusted (Lee et al., 2024b), and safe levers, user-facing controls (e.g. rule/threshold edits) that allow bounded intervention in AI behavior with preview, rollback, and audit trails to support contestability (Kulesza et al., 2015; Amershi et al., 2019; Lyons et al., 2021; Lee et al., 2021; Alfrink et al., 2023; Lee and Teo, 2026) and accountability (Mitchell et al., 2019; Raji et al., 2020; Crisan et al., 2022). In Improve, teams iteratively refine collaboration strategies and governance policies using feedback from newly observed failures to update training content, thresholds, and governance practices (Bansal et al., 2021a; Lai et al., 2021; Lee et al., 2021; Kulesza et al., 2015; Lee and Teo, 2026). We organize these measures around the Understand–Control –Improve (U–C–I) lifecycle. U–C–I describes when key capabilities in human–AI collaboration develop: users first learn model behavior and limitations (Understand), then calibrate how and when AI should be used in practice (Control), and finally refine collaboration strategies and governance policies over time (Improve). The four metric families describe what should be measured across this lifecycle: Outcome metrics evaluate decision quality, Reliance and Interaction metrics capture how AI advice is adopted or rejected, Safety and Harm metrics identify high-risk collaboration failures, and Learning and Readiness metrics measure how these behaviors evolve over repeated use. Together, the taxonomy makes the U–C–I lifecycle observable, enabling evaluation of how human–AI collaboration evolves over time. Prior work has proposed a variety of measures for evaluating human–AI interaction, including trust, reliance, agreement with model predictions, and decision accuracy (Green and Chen, 2019; Lai et al., 2021; Schemmer et al., 2023; Bansal et al., 2021b). However, these measures are often studied in isolation and therefore do not capture the full lifecycle of human–AI collaboration. We synthesize these existing constructs into four complementary metric families: outcome quality, reliance behavior, safety and harm signals, and learning over time. These categories reflect four practical questions that arise when deploying AI decision-support systems: What happened? How was the AI used? What went wrong? And how does collaboration evolve over time? Building on this framing, we further specify how these metrics can be computed directly from observable interaction traces rather than inferred attitudes or model properties. Our taxonomy spans four complementary classes. (1) Outcome metrics: capture decision quality beyond raw model accuracy, such as team gain and avoidable error (e.g. regret relative to the best achievable human–AI decision), reflecting whether AI involvement ultimately improves or degrades outcomes (Bansal et al., 2021a; Guo et al., 2024). (2) Reliance & Interaction metrics: characterize how AI advice shapes human judgments, including accept-on-wrong, changed-to-wrong, override frequency and timing, and reliance slope, which operationalize behavioral calibration and sensitivity to AI correctness (Buçinca et al., 2021; Chen et al., 2023; Lai et al., 2021; Lee and Chew, 2023). (3) Safety & Harm metrics: attribute risk to AI influence and governance breakdowns rather than human error alone, including AI-induced harm, near-misses, contradictions between rules and behavior, and rollback or escalation events (Raji et al., 2020; Ghassemi et al., 2021). (4) Learning & Readiness metrics: assess whether onboarding produces durable skill, such as failure identification, explanation comprehension, and retention or transfer across cases, tasks, or model versions (Cai et al., 2019b; Holstein et al., 2023). These four metric families can be instantiated across a wide range of decision-support settings. For example, in a clinical triage system (Lee and Tok, 2025), outcome metrics measure the accuracy of the final human–AI decision, reliance metrics capture how often clinicians accept or override AI recommendations, safety metrics detect harmful deferrals to incorrect AI predictions, and learning metrics track how reliance evolves across repeated cases. These metrics are not standalone statistics. They are computed from decision traces (e.g. accept, override, change), error attribution (AI-influenced versus independent errors), and learning signals (e.g. pre/post onboarding probes, time-to-calibration, cross-case transfer). As a result, each metric class maps naturally to stages of the Understand–Control–Improve (U–C–I) onboarding lifecycle, where it becomes both observable (during interaction) and actionable (through training, control levers, or governance interventions). This structure moves evaluation beyond accuracy and trust toward cumulative, deployment-relevant evidence of human–AI readiness. This framework surfaces a measurement and benchmarking agenda for the human–AI interaction community: • When does a user become “AI-ready”? • Which reliance and harm metrics generalize across domains? • How should governance be evaluated in use—beyond documentation—through behaviors such as contestation, rollback, escalation, and auditability? Answering these questions would enable cumulative science and more deployment-relevant evidence for safe human–AI collaboration. We argue that progress in human–AI collaboration requires shifting from evaluating models in isolation to evaluating human–AI teams, and from reporting isolated metrics to developing benchmarkable measures of readiness, calibration, and governance. Positioning: Our work complements prior frameworks for measuring reliance in human–AI systems (Guo et al., 2024) and surveys of human–AI decision-making metrics (Lai et al., 2021). While these works catalog existing measures or analyze reliance behavior, we focus on evaluation during onboarding and early deployment, where reliance patterns are formed and many downstream failures originate. We therefore propose a structured taxonomy of evaluation metrics and map these metrics to actionable stages in the Understand–Control–Improve (U–C–I) lifecycle. Contribution: We contribute a unified, traced-based evaluation framework for human–AI readiness: • A metric taxonomy spanning outcomes, reliance, harm, learning • Trace-based metric definitions grounded in interaction logs • A mapping from metrics to actionable U–C–I design interventions (Tables 1–2; Appendix A). 2. Why Accuracy Alone Is Insufficient 2.1. Why Current Evaluation Fails Despite rapid advances in model performance, many failures of human–AI systems arise after deployment, during everyday use in real workflows. A growing body of HAI research suggests that this gap is not primarily due to insufficient model accuracy, but to a mismatch between how systems are evaluated and how they are actually used (Lai et al., 2021; Guo et al., 2024). In practice, AI systems are embedded in time pressure, institutional norms, accountability structures, and evolving user strategies, which are rarely reflected in standard evaluation protocols. The following three evaluation assumptions illustrate this mismatch. 2.1.1. Accuracy ≠ Safety Accuracy measures whether a model’s prediction matches ground truth, but it does not capture the quality of human–AI decisions. In high-stakes settings, such as healthcare, multiple studies show that users may change initially correct judgments to incorrect ones after seeing AI advice, a phenomenon often referred to as AI-induced error or automation bias (Bansal et al., 2021a; Buçinca et al., 2021; Lee and Chew, 2023). These errors are invisible in standard accuracy (e.g. AUROC, or F1 metrics), which treat AI outputs as independent of human behavior. Moreover, accuracy does not distinguish between errors that users recognize and recover from versus errors that propagate silently into downstream decisions, documentation, or treatment plans (Ghassemi et al., 2021; Buçinca et al., 2021). As a result, systems that appear high-performing in offline benchmarks may still increase harm when integrated into real workflows where AI advice shapes human judgment. 2.1.2. Trust ≠ Reliance Trust is frequently measured through post-task surveys or Likert-scale questionnaires, yet behavioral evidence consistently shows weak alignment between reported trust and actual reliance (Lai et al., 2021; Buçinca et al., 2021; Chen et al., 2023; Lee and Chew, 2023). Users may report low trust while still following AI recommendations under time pressure, cognitive load, or organizational expectations. Conversely, users may report high trust while selectively ignoring AI advice in critical or ambiguous cases (Bansal et al., 2021a; Lee and Chew, 2023). This disconnect arises because trust captures attitudes, whereas reliance reflects situated behavior under constraints—including workload, accountability, and perceived risk. Evaluations that rely primarily on trust scores therefore miss when, how, and why users defer to or override AI advice in practice, obscuring important safety and governance concerns. 2.1.3. Performance ≠ Readiness High task performance during evaluation does not imply that users are prepared for real-world deployment. Short-term performance gains can mask brittle strategies, such as copying AI outputs without understanding underlying uncertainty or failure modes (He et al., 2023; Buçinca et al., 2021; Lee and Chew, 2023). In contrast, readiness depends on whether users can recognize when AI is likely wrong, interpret confidence and uncertainty appropriately, and recover from errors when they occur (Cai et al., 2019b; Holstein et al., 2023; Lee and Chew, 2023; Salimzadeh et al., 2024; Prabhudesai et al., 2023). These capacities (e.g. failure detection, uncertainty interpretation, and error recovery) are rarely measured explicitly, yet they determine whether human–AI systems remain safe over time, under distribution shift, and as models or workflows evolve (Lai et al., 2021; Ghassemi et al., 2021). Together, these gaps point to a fundamental mismatch: we often evaluate AI systems as artifacts optimized for predictive performance, but deploy them as teammates embedded in human workflows. Addressing this mismatch requires evaluation frameworks that capture not only what the AI predicts, but how humans learn to work with it (Cai et al., 2019b), rely on it (Lai et al., 2021; Guo et al., 2024), and govern it over time. 2.2. Reframing Onboarding as a Measurable Process To address this mismatch, we reframe onboarding not as documentation, demos, or one-off training, but as a measurable learning intervention that prepares users to collaborate with AI safely in real workflows (Cai et al., 2019b; Holstein et al., 2023). Drawing on prior work in human–AI collaboration, explainable AI, learning-by-doing, and AI onboarding for clinical decision-making (Lai et al., 2021; Buçinca et al., 2021; Chen et al., 2023; Cai et al., 2019b; Lee et al., 2024b, a), we conceptualize onboarding as the process through which users acquire durable skills for forming accurate mental models of AI reliability, calibrating reliance, and enacting accountability under realistic constraints. Effective onboarding supports at least four competencies. First, users learn to detect reliability boundaries: when AI is likely correct or incorrect rather than assuming uniform performance across cases, contexts, or subpopulations (Cai et al., 2019b). Second, users learn to calibrate reliance, adjusting when to accept, question, or override AI advice based on evidence and uncertainty cues (Bansal et al., 2021a; Buçinca et al., 2021; Chen et al., 2023). Third, users learn to exercise safe control and contestability, including how to intervene (Kulesza et al., 2015; Guo et al., 2022), escalate ambiguous cases, and use rollback or audit mechanisms when AI advice conflicts with domain judgment or policy requirements (Raji et al., 2020; Mitchell et al., 2019; Mökander and Floridi, 2022). Fourth, users learn to understand delegation and autonomy, recognizing how responsibility shifts between human and AI under different operating modes (e.g., decision support vs. selective deferral) and how these choices affect outcomes and accountability (Bansal et al., 2021a; Guo et al., 2024; Wilder et al., 2020; Holstein et al., 2023; Lee and Tok, 2025). These abilities cannot be inferred from model properties or self-reported attitudes alone; they must be measured behaviorally through interaction traces over time (e.g. acceptance/override patterns, sensitivity to AI correctness, failure detection rates, and recovery actions across cases and changing conditions) (Lai et al., 2021; Chen et al., 2023; Holstein et al., 2023). 2.3. A Taxonomy of Metrics for Human–AI Onboarding & Decision-Making Building on empirical findings across healthcare AI onboarding, decision-support evaluation, uncertainty-aware delegation, and accountable AI systems, we propose a taxonomy of metrics that capture complementary aspects of onboarding and collaboration (Cai et al., 2019b; Lai et al., 2021; Guo et al., 2024; Raji et al., 2020; Lee and Chew, 2023; Prabhudesai et al., 2023). Our taxonomy separates four evaluation questions: what happened, how AI was used, what went wrong, and what changed over time—dimensions often conflated or omitted in prior evaluations (Lai et al., 2021; Buçinca et al., 2021; Ghassemi et al., 2021). Full metric definitions and equations are provided in Appendix A. 2.3.1. Outcome Metrics (What happened?) Outcome metrics capture the quality of final human–AI decisions beyond raw model correctness, reflecting whether AI involvement ultimately improves or degrades task outcomes (Bansal et al., 2021a; Guo et al., 2024). We report: (i) team gain relative to human-only and AI-only baselines, and (i) regret_best, which quantifies avoidable error relative to an oracle that selects the better of the initial human decision and AI prediction per case (Guo et al., 2024). We further distinguish error recovery vs. error amplification, separating cases where AI helps users correct initial mistakes from cases where AI induces harm that would not otherwise occur (He et al., 2023; Ghassemi et al., 2021). Oracle best accuracy is treated as a reference upper bound rather than an operational target, enabling diagnosis of collaboration failures distinct from model limitations (Guo et al., 2024) (Appendix A). 2.3.2. Reliance & Interaction Metrics (How was AI used?) Reliance metrics characterize how AI advice shapes human decisions, operationalizing behavioral calibration rather than subjective attitudes (Buçinca et al., 2021; Chen et al., 2023; Lai et al., 2021). We track: (i) accept-on-wrong (agreeing with incorrect AI), (i) changed-to-wrong (switching from a correct human judgment to an incorrect final decision after seeing AI), (i) override frequency and timing, and (iv) local vs. global update asymmetry (i.e. whether users treat a failure as case-specific or revise their broader mental model of AI reliability) (Wang and Yin, 2023; Liao and Vaughan, 2023). These measures expose overreliance, underuse, and brittle strategies that are invisible in aggregate accuracy (Bansal et al., 2021a) (Appendix A). 2.3.3. Safety & Harm Metrics (What went wrong?) Safety metrics attribute harm to AI influence and governance breakdowns rather than human error alone (Raji et al., 2020; Mökander and Floridi, 2022; Ghassemi et al., 2021). We include: (i) AI-harm (cases where AI causes a correct initial human decision to become wrong), (i) near-misses (high-risk disagreements narrowly avoided), and (i) governance-in-use signals such as contradictions between rules and behavior, rollback events, and escalation actions. These metrics operationalize accountability as enacted behavior rather than documentation alone (Raji et al., 2020; Mitchell et al., 2019) (Appendix A). 2.3.4. Learning & Onboarding Metrics (What changed over time?) Learning metrics assess whether onboarding produces durable skill (Cai et al., 2019b) rather than transient performance gains. We measure: (i) calibration gap (confidence vs. correctness), (i) reliance slope (acceptance sensitivity to AI correctness), (i) stability under distribution shift, and (iv) transfer across tasks, cases, or model versions. These targets operationalize “AI readiness” as a behavioral capability that persists beyond a single evaluation outcome (Lai et al., 2021; Buçinca et al., 2021; Guo et al., 2024) (Appendix A). In operational settings, many of these metrics can be computed directly from interaction logs that record initial human decisions, AI recommendations, and final outcomes. In large-scale deployments, collecting these signals may require event-logging infrastructure similar to observability pipelines used in production ML systems. When ground-truth labels are delayed or expensive, practitioners may estimate some metrics through sampling strategies or proxy signals such as disagreement events or escalation rates. In privacy-sensitive settings, behavioral traces should be collected with appropriate aggregation and anonymization mechanisms. 2.3.5. Calibration & Governance as First-Class Targets Across domains, outcomes depend less on raw predictive accuracy and more on whether users calibrate reliance, accepting AI when it is likely correct and overriding it when it is likely wrong (Bansal et al., 2021a; Lai et al., 2021; Buçinca et al., 2021). Even highly accurate systems can degrade team performance when users over-rely on incorrect advice or fail to intervene at critical moments (Ghassemi et al., 2021; Lai et al., 2021; Buçinca et al., 2021). Thus, calibration should be treated as a primary evaluation target (e.g. accept-on-wrong, changed-to-wrong, reliance slope, calibration gap), not a byproduct of explainability. Governance mechanisms (e.g. model cards, audit trails, policies) are necessary but insufficient on their own: accountability is enacted through everyday interaction, including how users contest AI, justify overrides, escalate cases, or rollback edits (Mitchell et al., 2019; Raji et al., 2020; Mökander and Floridi, 2022; Lee and Teo, 2026). Behavioral signals such as rollback frequency, escalation behavior, contradiction detection, and intervention latency provide empirical evidence of “governance in use,” enabling assessment beyond documentation (Raji et al., 2020; Ghassemi et al., 2021). 2.3.6. Open Benchmarking Questions Taken together, our framework raises foundational benchmarking questions for the HCI and HAI community: • When is a user “AI-ready”? What behavioral criteria indicate readiness for deployment, beyond short-term task performance? • Which onboarding metrics generalize across domains? Which measures of reliance, learning, and harm are robust to task context, expertise, and institutional setting? • How should governance mechanisms be evaluated empirically? What behavioral signals best capture contestability, accountability, and safe intervention in use? • What should standardized human–AI benchmarks include beyond accuracy? How can benchmarks reflect calibration, error recovery, and governance rather than prediction alone? Addressing these questions is essential for cumulative, comparable, and deployment-relevant progress in human–AI collaboration research. 3. Discussion and Conclusion This paper positions measurement rather than algorithmic novelty as a central bottleneck for safe and accountable AI deployment. By shifting evaluation toward calibration, learning, and governance, the proposed framework aims to support: (i) comparable evaluation across studies and domains, (i) principled design of onboarding interventions grounded in learning outcomes, and (i) policy-relevant assessment of AI governance as enacted in practice. In addition, we provide an agenda for future CHI workshops, surveys, benchmarks, and research programs focused on human–AI teaming rather than model-centric performance. As a limitation, this taxonomy should be understood as a starting point rather than a finalized standard: it synthesizes recurring measures and highlights gaps, and it will require community iteration, domain-specific validation, and refinement as new evidence and deployment contexts emerge. If we do not measure onboarding, calibration, and harm, we cannot claim that human–AI systems are ready for real-world collaboration. This work proposes a shared measurement agenda for evaluating human–AI teams—not as tools, but as socio-technical systems whose safety and effectiveness emerge through interaction over time. This framework provides a foundation for future evaluation protocols, benchmark design, and shared measurement standards for human–AI collaboration across domains. Acknowledgements.This research was supported by the Resilient Workforces Institute, Singapore Management University, under the SMU Seed Fund (Grant ID: 2026-6026IR-25T040-SMUIRNYXXX), and by the Ministry of Education, Singapore under its Academic Research Fund Tier 2 (MOE-T2EP20223-0007). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the Ministry of Education, Singapore. References K. Alfrink, I. Keller, G. Kortuem, and N. Doorn (2023) Contestable ai by design: towards a framework. Minds and Machines 33 (4), p. 613–639. Cited by: §1. S. Amershi, D. Weld, M. Vorvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, et al. (2019) Guidelines for human-ai interaction. In Proceedings of the 2019 chi conference on human factors in computing systems, p. 1–13. Cited by: §1. V. Arya, R. K. Bellamy, P. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilović, et al. (2019) One explanation does not fit all: a toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012. Cited by: §1. G. Bansal, B. Nushi, E. Kamar, E. Horvitz, and D. S. Weld (2021a) Is the most accurate ai the best teammate? optimizing ai for teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, p. 11405–11414. Cited by: item 1, §1, §1, §1, §2.1.1, §2.1.2, §2.2, §2.3.1, §2.3.2, §2.3.5. G. Bansal, B. Nushi, E. Kamar, W. S. Lasecki, D. S. Weld, and E. Horvitz (2019) Beyond accuracy: the role of mental models in human-ai team performance. In Proceedings of the AAAI conference on human computation and crowdsourcing, Vol. 7, p. 2–11. Cited by: §1. G. Bansal, T. Wu, J. Zhou, R. Fok, B. Nushi, E. Kamar, M. T. Ribeiro, and D. Weld (2021b) Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI conference on human factors in computing systems, p. 1–16. Cited by: §1. Z. Buçinca, M. B. Malaya, and K. Z. Gajos (2021) To trust or to think: cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making. Proceedings of the ACM on Human-computer Interaction 5 (CSCW1), p. 1–21. Cited by: 1st item, Appendix A, item 2, §1, §1, §1, §2.1.1, §2.1.2, §2.1.3, §2.2, §2.2, §2.3.2, §2.3.4, §2.3.5, §2.3. C. J. Cai, E. Reif, N. Hegde, J. Hipp, B. Kim, D. Smilkov, M. Wattenberg, F. Viegas, G. S. Corrado, M. C. Stumpe, et al. (2019a) Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 chi conference on human factors in computing systems, p. 1–14. Cited by: §1. C. J. Cai, S. Winter, D. Steiner, L. Wilcox, and M. Terry (2019b) ” Hello ai”: uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making. Proceedings of the ACM on Human-computer Interaction 3 (CSCW), p. 1–24. Cited by: item 4, §1, §2.1.3, §2.1.3, §2.2, §2.2, §2.3.4, §2.3. C. J. Cai, S. Winter, D. Steiner, L. Wilcox, and M. Terry (2021) Onboarding materials as cross-functional boundary objects for developing ai assistants. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, p. 1–7. Cited by: §1. V. Chen, Q. V. Liao, J. Wortman Vaughan, and G. Bansal (2023) Understanding the role of human intuition on reliance in human-ai decision-making with explanations. Proceedings of the ACM on Human-computer Interaction 7 (CSCW2), p. 1–32. Cited by: item 2, §1, §1, §1, §2.1.2, §2.2, §2.2, §2.2, §2.3.2. A. Crisan, M. Drouhard, J. Vig, and N. Rajani (2022) Interactive model cards: a human-centered approach to model documentation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, p. 427–439. Cited by: §1, §1. F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §1. M. Ghassemi, L. Oakden-Rayner, and A. L. Beam (2021) The false hope of current approaches to explainable artificial intelligence in health care. The lancet digital health 3 (11), p. e745–e750. Cited by: item 3, §1, §2.1.1, §2.1.3, §2.3.1, §2.3.3, §2.3.5, §2.3.5, §2.3. B. Green and Y. Chen (2019) Disparate interactions: an algorithm-in-the-loop analysis of fairness in risk assessments. In Proceedings of the conference on fairness, accountability, and transparency, p. 90–99. Cited by: §1. L. Guo, E. M. Daly, O. Alkan, M. Mattetti, O. Cornec, and B. Knijnenburg (2022) Building trust in interactive machine learning via user contributed interpretable rules. In Proceedings of the 27th international conference on intelligent user interfaces, p. 537–548. Cited by: §1, §2.2. Z. Guo, Y. Wu, J. D. Hartline, and J. Hullman (2024) A decision theoretic framework for measuring ai reliance. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, p. 221–236. Cited by: 2nd item, Appendix A, item 1, §1, §1, §2.1.3, §2.1, §2.2, §2.3.1, §2.3.4, §2.3. G. He, S. Buijsman, and U. Gadiraju (2023) How stated accuracy of an ai system and analogies to explain accuracy affect human reliance on the system. Proceedings of the ACM on Human-Computer Interaction 7 (CSCW2), p. 1–29. Cited by: §1, §1, §2.1.3, §2.3.1. K. Holstein, M. De-Arteaga, L. Tumati, and Y. Cheng (2023) Toward supporting perceptual complementarity in human-ai collaboration via reflection on unobservables. Proceedings of the ACM on Human-Computer Interaction 7 (CSCW1), p. 1–20. Cited by: item 4, §1, §2.1.3, §2.2, §2.2, §2.2. D. Kaur, S. Uslu, K. J. Rittichier, and A. Durresi (2022) Trustworthy artificial intelligence: a review. ACM computing surveys (CSUR) 55 (2), p. 1–38. Cited by: §1. T. Kulesza, M. Burnett, W. Wong, and S. Stumpf (2015) Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces, p. 126–137. Cited by: §1, §2.2. T. Kulesza, S. Stumpf, W. Wong, M. M. Burnett, S. Perona, A. J. Ko, and I. Oberst (2011) Why-oriented end-user debugging of naive bayes text classification. ACM Transactions on Interactive Intelligent Systems (TiiS) 1 (1), p. 1–31. Cited by: §1. T. Kuo, H. Shen, J. Geum, N. Jones, J. I. Hong, H. Zhu, and K. Holstein (2023) Understanding frontline workers’ and unhoused individuals’ perspectives on ai used in homeless services. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, p. 1–17. Cited by: §1. V. Lai, C. Chen, Q. V. Liao, A. Smith-Renner, and C. Tan (2021) Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471. Cited by: Appendix A, item 2, §1, §1, §1, §1, §1, §1, §2.1.2, §2.1.3, §2.1.3, §2.1, §2.2, §2.2, §2.3.2, §2.3.4, §2.3.5, §2.3. M. H. Lee and C. J. Chew (2023) Understanding the effect of counterfactual explanations on trust and reliance on ai for human-ai collaborative clinical decision making. Proceedings of the ACM on Human-Computer Interaction 7 (CSCW2), p. 1–22. Cited by: Appendix A, item 2, §1, §1, §1, §2.1.1, §2.1.2, §2.1.3, §2.3. M. H. Lee, S. X. Y. Choo, S. D. Thilarajah, et al. (2024a) Improving health professionals’ onboarding with ai and xai for trustworthy human-ai collaborative decision making. arXiv preprint arXiv:2405.16424. Cited by: §1, §2.2. M. H. Lee, R. B. X. Ng, S. X. Choo, and S. Thilarajah (2024b) Interactive example-based explanations to improve health professionals’ onboarding with ai for human-ai collaborative decision making. arXiv preprint arXiv:2409.15814. Cited by: §1, §2.2. M. H. Lee, D. P. Siewiorek, A. Smailagic, A. Bernardino, and S. Bermúdez i Badia (2020) Co-design and evaluation of an intelligent decision support system for stroke rehabilitation assessment. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW2), p. 1–27. Cited by: §1. M. H. Lee, D. P. Siewiorek, A. Smailagic, A. Bernardino, and S. Bermúdez i Badia (2021) A human-ai collaborative approach for clinical decision making on rehabilitation assessment. In Proceedings of the 2021 CHI conference on human factors in computing systems, p. 1–14. Cited by: §1, §1. M. H. Lee and J. Y. F. Teo (2026) RuleEdit: failure-guided human–ai model editing with prospective impact preview. Cited by: §1, §1, §2.3.5. M. H. Lee and M. Z. Y. Tok (2025) Towards uncertainty aware task delegation and human-ai collaborative decision-making. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, p. 2274–2289. Cited by: §1, §1, §2.2. Q. V. Liao and J. W. Vaughan (2023) Ai transparency in the age of llms: a human-centered research roadmap. arXiv preprint arXiv:2306.01941 10. Cited by: §A.2, §2.3.2. H. Lyons, E. Velloso, and T. Miller (2021) Conceptualising contestability: perspectives on contesting algorithmic decisions. Proceedings of the ACM on Human-Computer Interaction 5 (CSCW1), p. 1–25. Cited by: §1. M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru (2019) Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, p. 220–229. Cited by: §1, §2.2, §2.3.3, §2.3.5. J. Mökander and L. Floridi (2022) From algorithmic accountability to digital governance. Nature Machine Intelligence 4 (6), p. 508–509. Cited by: §1, §2.2, §2.3.3, §2.3.5. C. Novelli, M. Taddeo, and L. Floridi (2024) Accountability in artificial intelligence: what it is and how it works. Ai & Society 39 (4), p. 1871–1882. Cited by: §1. S. Prabhudesai, L. Yang, S. Asthana, X. Huan, Q. V. Liao, and N. Banovic (2023) Understanding uncertainty: how lay decision-makers perceive and interpret uncertainty in human-ai decision making. In Proceedings of the 28th international conference on intelligent user interfaces, p. 379–396. Cited by: §2.1.3, §2.3. I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes (2020) Closing the ai accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency, p. 33–44. Cited by: item 3, §1, §1, §2.2, §2.3.3, §2.3.5, §2.3. M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, p. 1135–1144. Cited by: §1. C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence 1 (5), p. 206–215. Cited by: §1, §1. S. Salimzadeh, G. He, and U. Gadiraju (2024) Dealing with uncertainty: understanding the impact of prognostic versus diagnostic tasks on trust and reliance in human-ai decision making. In Proceedings of the CHI Conference on Human Factors in Computing Systems, p. 1–17. Cited by: §2.1.3. M. Schemmer, N. Kuehl, C. Benz, A. Bartos, and G. Satzger (2023) Appropriate reliance on ai advice: conceptualization and the effect of explanations. In Proceedings of the 28th International Conference on Intelligent User Interfaces, p. 410–422. Cited by: §1. E. Toreini, M. Aitken, K. Coopamootoo, K. Elliott, C. G. Zelaya, and A. Van Moorsel (2020) The relationship between trust in ai and trustworthy machine learning technologies. In Proceedings of the 2020 conference on fairness, accountability, and transparency, p. 272–283. Cited by: §1. K. R. Vashney (2022) Trustworthy machine learning. Independently published. Cited by: §1. S. Verma, V. Boonsanong, M. Hoang, K. E. Hines, J. P. Dickerson, and C. Shah (2020) Counterfactual explanations and algorithmic recourses for machine learning: a review. arXiv preprint arXiv:2010.10596. Cited by: §1. D. Wang, L. Wang, Z. Zhang, D. Wang, H. Zhu, Y. Gao, X. Fan, and F. Tian (2021) “Brilliant ai doctor” in rural clinics: challenges in ai-powered clinical decision support system deployment. In Proceedings of the 2021 CHI conference on human factors in computing systems, p. 1–18. Cited by: §1. D. Wang, Q. Yang, A. Abdul, and B. Y. Lim (2019) Designing theory-driven user-centric explainable ai. In Proceedings of the 2019 CHI conference on human factors in computing systems, p. 1–15. Cited by: §1. X. Wang and M. Yin (2023) Watch out for updates: understanding the effects of model explanation updates in ai-assisted decision making. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, p. 1–19. Cited by: §A.2, §2.3.2. B. Wilder, E. Horvitz, and E. Kamar (2020) Learning to complement humans. arXiv preprint arXiv:2005.00582. Cited by: §2.2. A. Završnik (2020) Criminal justice, artificial intelligence systems, and human rights. In ERA forum, Vol. 20, p. 567–583. Cited by: §1. Appendix A Appendix: Detailed Metrics (Organized by the Four-Part Taxonomy) Let (yj,h0j,aj,h1j,cj,tj)j=1N\(y_j,h_0j,a_j,h_1j,c_j,t_j)\_j=1^N denote N decision instances, where yjy_j is the ground truth, h0jh_0j the participant’s initial decision, aja_j the AI prediction, h1jh_1j the participant’s final decision after viewing AI output, cjc_j the participant’s reported confidence (if available), and tjt_j timestamps or interaction events recorded in system logs. We organize metrics by what happened, how AI was used, what went wrong, and what changed over time, following prior analyses of human–AI reliance and collaboration behavior (Lai et al., 2021; Buçinca et al., 2021; Lee and Chew, 2023; Guo et al., 2024). A.1. Outcome Metrics (What happened?) Accuracies and team gains. These metrics describe decision quality at the human, AI, and human–team level. • Human accuracy (Acch0Acc_h0): proportion of cases correctly solved by the human before seeing AI. Acch0=1N∑j=1N[h0j=yj]Acc_h0= 1N _j=1^NI[h_0j=y_j] • AI accuracy (AccaiAcc_ai): proportion of cases correctly predicted by the AI. Accai=1N∑j=1N[aj=yj]Acc_ai= 1N _j=1^NI[a_j=y_j] • Team accuracy (AccteamAcc_team): proportion of cases where the final human–AI decision is correct. Accteam=1N∑j=1N[h1j=yj]Acc_team= 1N _j=1^NI[h_1j=y_j] • TeamGain vs. Human: improvement (or degradation) from human-only decisions to AI-assisted decisions. Accteam−Acch0Acc_team-Acc_h0 • TeamGain vs. AI: improvement (or degradation) from AI-only predictions to the final team decision. Accteam−AccaiAcc_team-Acc_ai Oracle upper bound and regret. These metrics separate model limitations from collaboration failures. Define the oracle-correct indicator as: Oraclej=[(h0j=yj)∨(aj=yj)].Oracle_j=I [(h_0j=y_j) (a_j=y_j) ]. • Oracle best accuracy (AccoracleAcc_oracle): upper bound on achievable team performance if one always selected the correct agent. Accoracle=1N∑j=1NOraclejAcc_oracle= 1N _j=1^NOracle_j • Regret_best: proportion of avoidable errors where the team fails despite at least one agent being correct (Guo et al., 2024). 1N∑j=1N(Oraclej−[h1j=yj]) 1N _j=1^N (Oracle_j-I[h_1j=y_j] ) Error recovery vs. error amplification (derived outcome effects). Beyond aggregate accuracy and regret, we distinguish whether AI involvement helps users recover from errors or amplifies them. Error recovery refers to cases where an initially incorrect human decision becomes correct after viewing AI output, while error amplification refers to cases where a correct initial human decision becomes incorrect due to AI influence. These outcome-level effects are not captured by accuracy or regret alone, but are critical for assessing whether AI improves or degrades real-world decision quality. We operationalize error recovery and amplification through the help–harm decomposition (AI-help vs. AI-harm) and complementary decision-change metrics (ChangedToRight vs. ChangedToWrong), defined in subsequent sections of the appendix. A.2. Reliance & Interaction Metrics (How was AI used?) Reliance conditioned on AI correctness. These metrics capture behavioral reliance patterns, including appropriate reliance as well as over- and under-reliance. Let =j:aj=yjC=\j:a_j=y_j\ and =j:aj≠yjW=\j:a_j≠ y_j\. • Accept-on-correct: tendency to follow AI when it is correct. Pr(h1=a∣a=y)=1||∑j∈[h1j=aj] (h_1=a a=y)= 1|C| _j I[h_1j=a_j] • Reject-on-wrong: ability to reject incorrect AI advice. Pr(h1≠a∣a≠y)=1||∑j∈[h1j≠aj] (h_1≠ a a≠ y)= 1|W| _j I[h_1j≠ a_j] • Reject-on-correct: unnecessary rejection of correct AI advice. Pr(h1≠a∣a=y) (h_1≠ a a=y) • Accept-on-wrong: overreliance on incorrect AI predictions. Pr(h1=a∣a≠y) (h_1=a a≠ y) Decision-change behaviors. These metrics distinguish beneficial from harmful decision updates. • Changed: proportion of cases where the participant changes an initial decision after seeing AI. 1N∑j=1N[h1j≠h0j] 1N _j=1^NI[h_1j≠ h_0j] • ChangedToRight: beneficial changes from incorrect to correct. 1N∑j=1N[h1j≠h0j][h0j≠yj][h1j=yj] 1N _j=1^NI[h_1j≠ h_0j]I[h_0j≠ y_j]I[h_1j=y_j] • ChangedToWrong: harmful changes induced after seeing AI. 1N∑j=1N[h1j≠h0j][h0j=yj][h1j≠yj] 1N _j=1^NI[h_1j≠ h_0j]I[h_0j=y_j]I[h_1j≠ y_j] Calibration and timing. • Reliance slope: sensitivity of reliance to AI correctness; higher values indicate better calibration (Buçinca et al., 2021). Pr(h1=a∣a=y)−Pr(h1=a∣a≠y) (h_1=a a=y)- (h_1=a a≠ y) • Intervention latency: average time taken to confirm or override AI, reflecting hesitation and contesting AI recommendations. 1N∑j=1N(tjconfirm/override−tjAI) 1N _j=1^N(t^confirm/override_j-t^AI_j) Local vs. global update asymmetry. To distinguish case-specific reactions from durable belief updates about AI reliability, we measure whether responses to an AI failure generalize beyond the current instance. Local updates are reflected by isolated overrides or rejections confined to the current case, whereas global updates manifest as systematic changes in reliance behavior on subsequent cases. We operationalize global updating by comparing reliance metrics (e.g., accept-on-wrong, reliance slope) before versus after observed AI failures, and quantify update asymmetry as the degree to which behavior changes persist across subsequent cases rather than reverting immediately. This framing aligns with evidence that user behavior can shift under model/explanation updates and with calls to study transparency via user mental models over time (Wang and Yin, 2023; Liao and Vaughan, 2023). A.3. Safety & Harm Metrics (What went wrong?) Help–harm decomposition. These metrics attribute outcome changes to AI influence rather than overall accuracy alone. • AI-help: cases where AI corrects an initially wrong human decision. 1N∑j=1N[h0j≠yj∧h1j=yj] 1N _j=1^NI[h_0j≠ y_j h_1j=y_j] • AI-harm: cases where AI causes a correct human decision to become wrong. 1N∑j=1N[h0j=yj∧h1j≠yj] 1N _j=1^NI[h_0j=y_j h_1j≠ y_j] • Missed-help: failures to adopt correct AI advice when the human is wrong. 1N∑j=1N[h0j≠yj∧aj=yj∧h1j≠aj] 1N _j=1^NI[h_0j≠ y_j a_j=y_j h_1j≠ a_j] • Correct-ignore: appropriate rejection of incorrect AI advice. 1N∑j=1N[h0j=yj∧aj≠yj∧h1j≠aj] 1N _j=1^NI[h_0j=y_j a_j≠ y_j h_1j≠ a_j] • Near-miss rate: proportion of high-risk cases where an incorrect AI recommendation was narrowly avoided through human intervention or override. 1N∑j=1N[aj≠yj∧h1j=yj∧Riskj=high] 1N _j=1^NI[a_j≠ y_j h_1j=y_j _j=high] Governance-in-use signals. These metrics operationalize governance as observable behavior in practice rather than documentation. Let RjR_j indicate a rollback, EjE_j an escalation, and πj _j a policy rule for case j. • Rollback rate: frequency of reversing AI-influenced decisions after review. 1N∑j=1N[Rj=1] 1N _j=1^NI[R_j=1] • Escalation rate: proportion of cases referred for human or institutional oversight. 1N∑j=1N[Ej=1] 1N _j=1^NI[E_j=1] • Rule–behavior contradiction: violations where required actions (e.g., escalation) are not taken. 1N∑j=1N[πj=escalate∧Ej=0] 1N _j=1^NI[ _j=escalate E_j=0] A.4. Learning & Readiness Metrics (What changed over time?) These metrics assess whether onboarding produces durable, transferable user capability. • Calibration gap: misalignment between reported confidence and actual correctness. 1N∑j=1N|cj−[h1j=yj]| 1N _j=1^N|c_j-I[h_1j=y_j]| • Reliance slope over time (behavioral calibration): change in sensitivity of agreement to AI correctness across sessions/blocks. For block/session k: Slopek=Pr(h1=a∣a=y)k−Pr(h1=a∣a≠y)kSlope_k= (h_1=a a=y)_k- (h_1=a a≠ y)_k and learning can be summarized as ΔSlope=Slopepost−Slopepre =Slope_post-Slope_pre • Retention: stability of calibration-related metrics across multiple sessions or time intervals. |Calibsession k−Calibsession k+1||Calib_session k-Calib_session k+1| • Transfer: consistency of performance or reliance across tasks, datasets, or model versions. |Metrictask A−Metrictask B||Metric_task A-Metric_task B| Table 1. Mapping metrics to observable data sources, U–C–I onboarding stages, and the corresponding design actions they enable. (Part 1) Metric (example) Data source (trace) U–C–I stage Design action (what you do when it’s bad) Outcome metrics (What happened?) Team accuracy / TeamGain Final decision h1h_1, ground truth y; condition logs Improve Adjust delegation policy (when to defer / override); revise workflow to route high-risk cases to human review. Regret_best / Oracle gap h0,a,h1,yh_0,a,h_1,y per case Improve Diagnose collaboration failures (not model limits); target training on cases where either human or AI was correct but the team failed. Error recovery vs. amplification h0→h1h_0→ h_1 transitions + y Improve Identify whether UI causes harmful flips; refine prompts/explanations to reduce changed-to-wrong; add guardrails for high-stakes edits. Reliance & interaction metrics (How was AI used?) Accept-on-wrong Reject-on-wrong Agreement with AI (h1=ah_1=a) conditioned on AI correctness Understand + Control Curate “failure sets”; add reliability cues; introduce regions-of-no-use; require justification or second-check for high-risk acceptance. ChangedToWrong ChangedToRight Decision-change events (h1≠h0h_1≠ h_0) + y Understand + Control Refine onboarding tasks to expose boundary conditions; add counterfactual practice; redesign explanation timing to prevent harmful flips. Override rate + timing Override events + timestamps tjt_j Control Add safe levers (sandbox preview, rollback); reduce friction for appropriate overrides; introduce escalation shortcuts for uncertain cases. Reliance slope Pr(h1=a∣a=y)−Pr(h1=a∣a≠y) (h_1=a a=y)- (h_1=a a≠ y) Understand Diagnose calibration; if low, strengthen training on discriminating correct vs. incorrect AI; improve uncertainty communication. Intervention latency (tconfirm/override−tAI)(t^confirm/override-t^AI) Control Tune interaction costs; add “pause-and-check” for risky cases; streamline override/escalation to reduce delayed intervention. Table 2. Mapping metrics to observable data sources, U–C–I onboarding stages, and the corresponding design actions they enable. (Part 2) Metric (example) Data source (trace) U–C–I stage Design action (recommended response) Safety & harm metrics (What went wrong?) AI-harm / AI-help Help–harm decomposition from h0,h1,a,yh_0,h_1,a,y Control + Improve Add guardrails where AI induces harm; adjust autonomy (e.g., limit deferral) in high-harm regions; prioritize model fixes for harm-heavy slices. Missed-help / Under-reliance h0≠yh_0≠ y, a=ya=y, but h1≠ah_1≠ a Understand Improve “when to trust” instruction; show exemplars of correct AI behavior; add calibrated confidence cues for beneficial reliance. Near-misses (high-risk disagreements) High-stakes disagreement logs; risk labels; margin/uncertainty if available Control Trigger required second review; add risk-based escalation rules; refine “regions-of-no-use” policy. Rule–behavior contradiction rate Policy label πj _j + behavior event (EjE_j) Control + Improve Fix workflow compliance gaps; redesign the UI to make required actions salient; update governance policy or training based on observed violations. Rollback rate / Escalation rate Rollback logs RjR_j; escalation logs EjE_j Control + Improve Audit contested decisions; improve contestability pathways; adjust accountability and when rollback is encouraged. Learning & readiness metrics (What changed over time?) Calibration gap Confidence cjc_j + correctness [h1=y]I[h_1=y] Understand Improve calibration cards; add feedback on confidence miscalibration; emphasize boundary conditions and failure modes. Retention Same metrics across sessions (pre/post; follow-up) Improve Iterate the onboarding curriculum; schedule refreshers; adapt materials to failure modes that “do not stick.” Transfer (across tasks/versions) Metrics across task A vs. B, or model version v vs. v′v Improve Update onboarding for new model versions; add regression tests for reliance and harm; retrain users on newly emerging failures. Time-to-calibration Rolling-window estimates of reliance slope / accept-on-wrong over time Understand + Improve Personalize onboarding length; stop training when stable calibration is achieved; allocate extra practice for slow-to-calibrate users.