Paper deep dive
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Abhinaba Basu, Pavan Chakraborty
Abstract
Abstract:Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.
Tags
Links
- Source: https://arxiv.org/abs/2603.12183v1
- Canonical: https://arxiv.org/abs/2603.12183v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%
Last extracted: 3/13/2026, 1:17:55 AM
OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
73,394 characters extracted from source content.
Expand or collapse full text
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials Abhinaba Basu 1,2 and Pavan Chakraborty 1 1 Indian Institute of Information Technology Allahabad (IIITA), Prayagraj, India 2 National Institute of Electronics and Information Technology (NIELIT), India March 13, 2026 Abstract Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screen- ing without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsifi- cation across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r ≤ 0.13; n = 5,000), confirmed by inde- pendent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12×). A risk model trained on PCM-discovered features predicts failures on unseen materials (area un- der the ROC curve [AUC-ROC] = 0.938± 0.004) and transfers across architectures (cross-MLIP AUC-ROC≈ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM- audited protocols discover 62 additional stable materials missed by single-MLIP screening—a 25% improvement in discovery yield. Keywords: machine-learned interatomic potentials, adversarial testing, formal verification, Lean 4, ma- terials discovery, uncertainty quantification 1 Introduction Universal MLIPs—CHGNet 1 , MACE 4 , TensorNet 27 , ALIGNN 35 , SevenNet 36 , EquiformerV2 37 — underpin high-throughput materials screening 38 , yet they are deployed without formal reliability guaran- tees. Matbench Discovery 2 and earlier benchmarks 10 evaluate 45+ models on 257K WBM (Wang–Botti– Marques 3 ) structures, but aggregate accuracy metrics cannot answer the deployment-critical question: on which chemistries is this MLIP unreliable? This creates a quantifiable safety gap. A single-MLIP stability screen achieves precision 0.47 and recall 0.07 across 25,000 WBM materials, missing 93.0% of DFT-stable candidates. The gap is not merely statistical: CHGNet rejects TlBiSe 2 (a topological insulator with over 1,000 citations 26 ) and Cs 2 KTlBr 6 1 arXiv:2603.12183v1 [cond-mat.mtrl-sci] 12 Mar 2026 (a lead-free perovskite solar cell candidate with 1.27 eV band gap)—precisely the materials that high- throughput pipelines aim to find. Aggregate benchmarks showing “near-DFT accuracy” obscure these consequential, chemistry-specific blind spots. We reframe MLIP reliability as a falsifiable safety claim in the sense of Dalrymple et al.’s Guaranteed Safe AI 13 : a world model (the MLIP), a safety specification (bounds on acceptable error), and a verifier (machine-checked proofs) 45 . This mirrors proof-carrying code 15 , where untrusted programs ship with proofs of safety. Proof-Carrying Materials (PCM) instantiates this in three stages (Fig. 1): 1. Adversarial falsification—automated adversaries (six strategies: random, heuristic, grid, LHS, Sobol, LLM) probe for failure regions in compositional space; 2. Envelope refinement—counterexamples tighten the safety claim into bounds with bootstrap 95% confidence intervals (CIs); 3. Formal certification—the refined envelope compiles into Lean 4 proofs with explicit physical ax- ioms. The framework is oracle-agnostic: swapping the MLIP requires no code changes. Our central findings are threefold. First, MLIP blind spots are architecture-specific: three MLIPs pro- duce near-zero pairwise force correlations (r = 0.13, 0.10,−0.01; n = 5,000) and fail on largely disjoint chemistries. Second, perturbation-based uncertainty quantification does not predict these compositional failures (r = 0.039, p = 0.58), meaning structural UQ and adversarial compositional auditing capture independent failure dimensions. Third, and most importantly, PCM enables prospective validation: compositional features discovered through adversarial auditing predict failures on unseen materials with AUC-ROC 0.938±0.004 and P@20% = 1.000 (perfect precision among the top-risk quintile), transform- ing retrospective audit into predictive intervention. These failure patterns transfer across architectures: a CHGNet-trained risk model predicts MACE failures (AUC-ROC = 0.697 average cross-transfer) with feature importance correlation r = 0.877. Together, these results establish that (a) no single benchmark score captures reliability, (b) multi-MLIP auditing is essential for deployment, (c) formal verification combined with prospective prediction constitutes a new paradigm for MLIP validation, and (d) the full audit costs under $20. 2 Related Work Adversarial MLIP testing. CAGO 11 generates adversarial structures via uncertainty-calibrated geome- try optimization on individual materials. PCM operates in compositional space—searching for chemical families where MLIPs systematically diverge from DFT. These are complementary: CAGO identifies which atomic arrangements are unreliable; PCM identifies which chemistries are unreliable regardless of arrangement. In our evaluation, perturbation-based uncertainty is not predictive of compositional failure (r = 0.039, p = 0.58). Uncertainty quantification. Conformal prediction 19 provides distribution-free marginal coverage guar- antees applied to interatomic potentials 20 . Monte Carlo dropout 29 , deep ensembles 30 and chemical- space uncertainty estimation 31;32 have been applied to molecular property prediction. Critically, these 2 Stage 1: Adversarial Falsification LLM adversaries propose scenarios that break the safety claim Stage 2: Envelope Refinement Bootstrap CIs tighten bounds from passing evaluations Stage 3: Formal Certification Lean4 proofs with physics axioms → machine- checkable Output: Safety Certificate Proof + envelope + discoveries counterexamples Figure 1. The PCM pipeline. Stage 1: automated adversaries (six strategies including LLMs) propose composi- tional feature vectors; the MLIP oracle evaluates each against the DFT reference. Stage 2: counterexamples refine the safety envelope with bootstrap CIs. Stage 3: the envelope compiles into Lean 4 proofs with explicit axioms. methods provide marginal coverage (“95% of all materials fall within the prediction interval”) but can- not answer the deployment question which specific compositions are unreliable and why. Conformal prediction intervals widen uniformly across compositional space; they do not identify the compositional drivers (e.g. heavy elements, large unit cells) that concentrate failures, nor do they enable targeted DFT allocation for high-risk chemistries. PCM concentrates adversarial sampling on failure boundaries, pro- ducing tighter envelopes where they matter most: in a split conformal baseline on the same 5,000 WBM materials (2,500 calibration, 2,500 test), the feature-space conformal envelope at α = 0.05 contains 1,924 materials (precision 6.8%) versus PCM’s adversary-informed bootstrap envelope with 97 materi- als (precision 7.2%)—a 20× reduction in envelope size at comparable precision. Active learning 12;50;51 selects informative points for retraining; PCM selects adversarial points for auditing—the adversarial budget (200 queries) is orders of magnitude smaller than retraining data requirements. Safety cases and formal verification. AMLAS 16 provides structured safety-case methodology for ML components but relies on human argumentation. Seshia et al. 17 identify the specification gap as a central challenge for verified AI. Neural network verification 43;44 focuses on input–output properties of clas- sifiers; PCM verifies domain-specific reliability claims about regression models. Adversarial testing in ML 33;34 targets input perturbations; falsification of cyber-physical systems 7;8 provides compositional testing frameworks. PCM instantiates the Guaranteed Safe AI framework 13 for materials: the MLIP is the world model, the envelope is the specification, and Lean 4 proofs are the verifier output—providing machine-checkable evidence that structured safety cases currently lack. 3 Results 3.1 Architecture-specific failure profiles We evaluate three architecturally distinct MLIPs—CHGNet v0.3.0 1 , TensorNet (MatPES-PBE- v2025.1) 27;28 , and MACE-MP medium 4 —on 5,000 WBM materials. Structures are synthetic com- positional probes—atoms placed at irrational (golden-ratio) fractional coordinates to avoid symmetry artifacts, with DFT-derived cell volumes (Methods)—designed to test whether each MLIP has learned 3 the correct energy landscape for a given chemistry. The blind-spot classification itself—which materials are DFT-stable but MLIP-unstable—derives from the WBM benchmark, where both DFT and MLIPs evaluate fully relaxed equilibrium structures. Force failure threshold: 50 eV Å −1 (Fig. 2). • CHGNet fails on 31.1% of compositions (1,553/5,000, force > 50 eV Å −1 ). • TensorNet fails on 75.7% (3,786/5,000). • MACE fails on 73.2% (3,659/5,000), producing catastrophically high forces (> 1,000 eV Å −1 ) on multiple compositions where CHGNet predicts moderate values. • All three pairwise force correlations are near zero: CHGNet–TensorNet r = 0.10 [0.07, 0.13]; CHGNet–MACE r = 0.13 [0.05, 0.21]; TensorNet–MACE r = −0.01 [-0.02, -0.003] (bootstrap 95% CIs, n = 5,000). • Failures are largely disjoint: 218 materials fail CHGNet but not MACE, while 2,324 fail MACE but not CHGNet; 337 fail TensorNet but not MACE, while 210 fail MACE but not TensorNet. This is not a marginal difference. Three MLIPs, all trained on Materials Project data, produce qualita- tively different force predictions on the same compositions. The higher absolute failure rates for Ten- sorNet (75.7%) and MACE (73.2%) relative to CHGNet (31.1%) likely reflect differences in how each architecture generalises to far-from-equilibrium probe geometries; the scientifically important finding is not the absolute rates but the near-zero pairwise correlations and disjoint failure profiles, which demon- strate that architecture and training pipeline—not training data alone—determine which chemistries ap- pear problematic. A stability screen using any single MLIP inherits that model’s specific blind spots— including several materials with known functional importance (Section 3.3). 3.2 Adversarial strategy comparison Because the WBM–CHGNet benchmark has a 93.2% base counterexample (CX) rate—randomly sam- pling any composition has a > 90% chance of finding a failure—raw CX rate cannot distinguish ad- versary quality. We therefore evaluate six strategy types and seven LLM configurations (budget = 200) primarily on discovery diversity: the number of unique, previously unseen failure compositions found (Fig. 3, Extended Data Table 1). Among automated strategies, all achieve high CX rates (86–89% for heuristic, random, and Sobol). LLM adversaries—tested as one strategy class among several—achieve marginally higher rates (88– 100%), with their primary advantage being qualitative: convergence on high-Z, multi-element regions of compositional space (Z > 71, n elements ≥ 4), discovering 36 DFT-stable/CHGNet-unstable materials not found by any baseline strategy. A shuffled-feature ablation confirms this works even with anonymous feature labels (88% CX vs 90% with real names; Extended Data Fig. 7), showing that the adversary– oracle separation—not materials expertise—is the core mechanism. The full multi-adversary audit costs $18.13; the most cost-effective single strategy (Gemini 3 Flash as LLM adversary) discovers 17 unique materials for $0.05, though purely algorithmic strategies (Sobol, random) discover more unique materials overall (138 and 123, respectively) at zero API cost. 4 10 1 10 3 10 5 10 7 10 9 10 11 CHGNet max force (eV/Å) 10 1 10 3 10 5 10 7 10 9 10 11 MACE max force (eV/ Å ) r = 0.132 [0.05, 0.21] n = 5000 a CHGNet vs MACE CHGNet TensorNet MACE CHGNet TensorNet MACE 1.000.100.13 0.101.00-0.01 0.13-0.011.00 b Pairwise force correlations −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Force correlation (r) CHGNetTensorNetMACE 0 20 40 60 80 100 Failure rate (%) 31% (n=1553) 76% (n=3786) 73% (n=3659) c Force > 50 eV/Å threshold Pair A onlyB only BothNeither CHGN v Tens106233914471108 CHGN v MACE218232413351123 Tens v MACE33721034491004 d Architecture-specific blind spots Figure 2. Cross-MLIP comparison of three architecturally distinct MLIPs on 5,000 WBM-derived structures. a, CHGNet vs MACE max force (r = 0.13). b, Pairwise force correlation heatmap: all three pairs near zero (CHGNet–TensorNet r = 0.10, CHGNet–MACE r = 0.13, TensorNet–MACE r = −0.01). c, Failure rates: CHGNet 31.1%, TensorNet 75.7%, MACE 73.2%. d, Architecture-specific blind spots with largely disjoint failure chemistries. 3.3 Functionally important discoveries From the combined adversary outputs (Section 3.2), we rank materials by anomaly score (Eq. 1) and identify 21 unique materials with disjoint failure profiles (Table 1). Five contain actinides (Pu, Np, Th, U, Dy), three contain thallium—both underrepresented in CHGNet’s MPtrj training data. TlBiSe 2 is an experimentally verified topological insulator 26 that CHGNet predicts as unstable while DFT confirms stability. Independent DFT recomputation (Quantum ESPRESSO, PAW-PBE) converges cleanly in 13 SCF steps while CHGNet produces 11.1 eV atom −1 energy and 19.9 eV Å −1 forces on the same golden-ratio probe structure—a catastrophic disagreement on a material with over 1,000 citations. Cs 2 KTlBr 6 is a lead-free halide double perovskite with a near-ideal 1.27 eV band gap for photovoltaics—precisely the class of non-toxic solar cell materials that high-throughput screening pipelines aim to find. Five of 12 contain actinides relevant to nuclear fuel cycle modelling. If CHGNet were used as a sta- bility filter, it would reject materials currently under active investigation in nuclear safety, topological 5 0.750.800.850.900.951.00 Counterexample Rate baseline_random baseline_heuristic Claude Sonnet 4.5 baseline_sobol Claude Opus 4.6 Claude Opus 4.5 Gemini 3 Flash GPT-5.2 Gemini 2.5 Pro GPT-5 CX Rate (Wilson 95% CI) 0.00.20.40.60.81.01.2 Failure Mode Entropy (bits) Diversity (Bootstrap 95% CI) 0.00.10.20.30.40.5 Novelty (Mean N Distance) Novelty (Bootstrap 95% CI) GPTClaudeGeminibaseline Figure 3. Adversary strategy comparison (10 configurations, budget = 200). a, CX rate with Wilson 95% CI: all strategies achieve >85% against the 93.2% base rate. b, Unique materials discovered: algorithmic strategies find 61–138 unique compositions; LLM adversaries find 5–29 but concentrate on functionally important materials. c, Exploration heatmap: LLMs converge on high-Z, multi-element regions (top) while baselines spread uniformly (bottom). Table 1. Top 12 DFT-stable/CHGNet-unstable materials, ranked by anomaly score. All have identified functional applications spanning nuclear, electronic, photovoltaic and catalytic domains. RankFormulaAnomaly Pred. error (eV atom −1 ) Max force (eV Å −1 ) Application 1DyLi 2 PuF 8 3.497.0675.6Nuclear fuel (molten salt) 2Pb 4 Pt 4 O 12 2.974.13107.6Pt-group catalysis 3Au 2 Ba 2 Np 2 Se 6 2.706.9544.9Nuclear waste 4Au 2 Cs 2 Sc 2 Te 6 2.685.3673.4Thermoelectric 5Bi 2 Cu 2 O 4 Te 2 V2.585.6963.0Multiferroic 6Th 2 I 6 2.536.0753.8Nuclear fuel processing 7Rb 2 SnTlCl 6 2.365.9449.3Lead-free perovskite 8Cs 2 KTlBr 6 2.156.1237.2Perovskite solar cell 9Ba 6 NPb 2 Se1.614.9636.0Optoelectronic 10CsAuTe1.574.3845.1Relativistic bonding 11TlBiSe 2 1.173.9336.7Topological insulator 12Cs 2 USe 3 0.693.0033.7Actinide chalcogenide electronics and photovoltaics. 3.4 Independent DFT validation at scale To confirm that adversarially discovered blind spots are genuine—not artifacts of approximate structure construction—we recompute DFT ground states for the top 20 materials using Quantum ESPRESSO (pw.x v6.7, PAW-PBE, SSSP efficiency pseudopotentials, 60/480 Ry cutoffs, 4×4×4 k-mesh). Materials are selected from the WBM benchmark by: DFT-stable, CHGNet-unstable, max Z ≤ 56, n sites ≤ 24, sorted by prediction error descending. Structures use golden-ratio fractional coordinate spacing with DFT-derived cell volumes (Methods). All 20 materials converge to well-defined self-consistent field (SCF) ground states (11–150 iterations, 22–6,360 s per material on 8 CPU cores). Two materials (Ba 2 Pd 4 Sb 4 and BaF 6 In) require expanded cells and tighter mixing parameters on retry, but ultimately converge. The converged materials span 27 6 elements across Z = 3 (Li) to Z = 56 (Ba). We compare forces rather than energies because DFT total energies include core electrons (e.g.−3,284 eV/atom for Cu 7 Zn 1 ) and are not directly comparable to CHGNet’s formation-energy-referenced predictions. Forces on identical structures provide a direct, reference-free comparison. Across the 18 materials evaluated on identical golden-ratio structures, DFT forces exceed CHGNet forces by a median factor of 11.6× (range: 1.2–63.4×; Extended Data Table Extended Data Table 2; force com- parison excludes 2 retry materials whose structures differ). The headline result: Cu 7 Zn 1 (brass, the most widely used copper alloy) converges in 15 SCF steps with DFT forces of 557 eV Å −1 —while CHGNet predicts 36 eV Å −1 , a 15× underestimate. Cu 7 Mn 1 (manganese bronze) shows a 33× underestimate. These are not exotic compositions: they are industrial alloys that CHGNet’s training data should cover. We emphasise that the blind-spot classification (DFT-stable, CHGNet-unstable) originates from the WBM benchmark, where both DFT and CHGNet evaluate relaxed equilibrium structures—not the syn- thetic probes used here. The golden-ratio structures serve only as a compositional probe: they test whether DFT and CHGNet agree on force magnitudes for a given chemistry. A model that has learned correct interatomic interactions for a composition should produce forces of comparable magnitude on any structure of that composition, not only near-equilibrium ones. The 12× median force underesti- mate therefore indicates that CHGNet has not learned the correct energy landscape for these chemistries, consistent with their absence or underrepresentation in training data. The 100% convergence rate (20/20), achieved on adversarially selected worst-case compositions, con- firms that the PCM pipeline discovers genuine MLIP failures, not structure-generation artifacts. As a complementary test, we relax all 20 DFT-validated adversarial discoveries using CHGNet v0.3.0 with BFGS optimization (fmax = 0.05 eV Å −1 , max 500 steps). If the force disagreements were merely a consequence of probing far-from-equilibrium geometries, CHGNet relaxation should converge rapidly to low-energy minima. Instead, all 20 materials exhibit massive relaxation energy changes of 3.98– 15.45 eV/atom—far exceeding the∼0.1 eV/atom scale of typical formation energies. Eighteen materials converge (10–477 ionic steps), but two—Ag 6 Cl 2 S 2 and As 2 In 4 Pd 6 —fail to converge even after 500 relaxation steps (final Fmax ≈ 0.44 eV Å −1 ), indicating that CHGNet cannot locate a stable minimum for these chemistries. The As 2 In 4 Pd 6 relaxation releases 15.4 eV/atom, roughly 15× its DFT formation energy. These results confirm that CHGNet’s entire potential energy surface is fundamentally incorrect for these compositions, not merely its force predictions at a single probe geometry. 3.5 Failure analysis: UQ orthogonality and training-data gaps A natural question is whether existing uncertainty quantification (UQ) methods can identify these blind spots without adversarial auditing. We test this by computing perturbation-based uncertainty—the ap- proach used by CAGO 11 —for 200 WBM materials. For each material, we generate 5 structures with Gaussian perturbations (σ = 0.01 Å) and measure CHGNet prediction variance. The point-biserial correlation between perturbation uncertainty and compositional failure is r = 0.039 (p = 0.58), indicating that structural sensitivity is orthogonal to compositional failure. Spearman rank correlation reveals a marginal monotonic relationship (ρ = 0.125, p = 0.078), and failed materials show 1.72× higher mean uncertainty (Cohen’s d = 0.14)—but these effects are too weak for reliable 7 024681012 Prediction Error (eV/atom) 0 25 50 75 100 125 150 175 Max Force (eV/Å) pred_error threshold force threshold Feature Space WBM benchmark (n=5000) LLM discoveries (n=21) −2024 Anomaly Score 0 20 40 60 80 100 120 140 160 Count Anomaly Score Distribution WBM benchmark LLM discoveries Figure 4. Systematic failure patterns. a, Anomaly distribution: prediction error vs max force for 5,000 WBM materials (grey) with adversarially discovered failures (red). b, Per-element stability disagreement by periodic table block: f-block elements fail most (p = 0.042). c, JARVIS cross-functional validation: 682 CHGNet blind spots confirmed by independent DFT. deployment screening. The 50 highest-uncertainty materials show a paradoxically negative correlation (r =−0.14) with actual failure, meaning UQ-based screening would miss the most consequential blind spots. This directly challenges the assumption that model uncertainty 20;29;30 can serve as a proxy for reliability: a model can be confidently wrong on entire chemical families. Adversarial compositional auditing and structural perturbation-based UQ capture independent failure dimensions and should be deployed jointly. The blind spots are also not random with respect to the periodic table. Per-element analysis across 84 elements reveals that f-block elements have significantly higher DFT/CHGNet disagreement (Kruskal– Wallis p = 0.042; Extended Data Fig. 5). Elements targeted by adversarial strategies show higher disagreement than non-targeted elements (59.9% vs 55.4%, Mann–Whitney p = 0.020). The rarest frequency quintile contains 41% of adversarial elements; the most common contains none. Cross-referencing 814 WBM materials against the JARVIS-DFT database 39 (OptB88vdW functional) confirms 682 blind spots across DFT functionals, consistent with training data from the Materials Project 42 , OQMD 40 and AFLOW 41 . Heavy elements show significantly higher disagreement (mean atomic mass p = 7× 10 −5 , max Z p = 0.002; Mann–Whitney). 3.6 Envelope refinement and formal certification The refined envelope delineates reliable from unreliable MLIP predictions. As a concrete example: starting from an initial claim that CHGNet is reliable for Z max ≤ 83, the heuristic adversary discov- ers Cu 7 Zn 1 (brass) at step 47 as a counterexample with prediction error 10.49 eV atom −1 . After 200 queries (86% counterexample rate, 61 unique materials), bootstrap refinement tightens the envelope to key bounds: Z max ≤ 70 (all counterexamples above this threshold failed), mean atomic mass≤ 164 u, bandgap (PBE)≥ 0.92 eV—consistent with known CHGNet limitations. Iterative refinement (4 rounds of attack–refine) compresses the safe envelope by 75–91% per feature dimension (Extended Data Fig. 4). 8 The envelope compiles into five Lean 4 proof modules (∼ 250 lines, Lean 4.27.0): safety theorem with physical axioms, perturbation stability via interval arithmetic, error propagation composing DFT un- certainty (0.1 eV atom −1 ) with MLIP threshold via triangle inequality, monotone refinement sound- ness, and evaluation metric properties (CX rate monotonicity, CI convergence with budget). All ax- ioms are explicit and inspectable. The proofs verify reasoning correctness—that conclusions follow from stated assumptions—analogous to AI-Descartes 6 combining data and theory, and providing the machine-checkable evidence that structured safety cases 16 currently lack. 3.7 Prospective validation The preceding results establish that PCM discovers genuine blind spots. A stronger question is whether those discoveries generalise: can compositional features identified through adversarial auditing predict failures on materials the framework has never seen? Protocol. We split the full 25,000-material WBM benchmark into a discovery set (15,000 materials, 60%) used for adversarial auditing and envelope refinement, and a held-out validation set (10,000 materials, 40%) never seen during any stage of the PCM pipeline. A gradient-boosted classifier (scikit-learn GradientBoostingClassifier, 5 random seeds) is trained to predict binary fail- ure (F max > 50 eV Å −1 ) from the 8 compositional features used throughout this work. Results. The classifier achieves AUC-ROC 0.938± 0.004 on the held-out validation set (mean ± s.d. across 5 random seeds), demonstrating that PCM-discovered failure patterns are not overfit to the dis- covery set but capture genuine compositional regularities. Precision at 20% recall (P@20%) is 1.000: the top 20% of risk-ranked materials are 100% actual failures. This means a deployment pipeline that flags the highest-risk quintile achieves perfect precision with zero false alarms. Key features.The top predictive features are n_sites (number of atoms per unit cell), volume_per_atom, and max_z (maximum atomic number)—all compositional descriptors that PCM counterexamples naturally concentrate on during adversarial search. These are simple, inter- pretable features that materials scientists can act on directly: compositions with many sites, large volumes per atom, or heavy elements warrant additional DFT validation before trusting MLIP predic- tions. Implication. This result transforms PCM from a retrospective auditing tool into a predictive inter- vention: rather than auditing every material individually, a PCM-trained risk model can screen large candidate libraries and flag likely failures before expensive DFT computation. The combination of ad- versarial discovery (identifying where failures occur) and prospective prediction (generalising why they occur) closes the loop between falsification and deployment. 3.8 Cross-MLIP prospective transfer A critical question for deployment is whether PCM-discovered failure patterns are specific to a single MLIP architecture or capture general vulnerability patterns shared across architectures. We test this directly: can a risk model trained on one MLIP’s failures predict failures of a different MLIP? Protocol. Using a separate 10,000-material multi-model benchmark (CHGNet + MACE), we split into 9 discovery (6,000, 60%) and validation (4,000, 40%) sets. A gradient-boosted classifier is trained on one model’s failures from the discovery set and evaluated on both models’ failures on the held-out validation set (5 random seeds). Results. Same-model baselines confirm strong predictive power: CHGNet→CHGNet AUC-ROC = 0.906±0.002, MACE→MACE AUC-ROC = 0.888±0.003. Cross-architecture transfer retains substan- tial predictive signal: CHGNet→MACE AUC-ROC = 0.685± 0.006, MACE→CHGNet AUC-ROC = 0.708± 0.007 (average cross-MLIP AUC-ROC = 0.697). Both cross-transfer directions exceed the 0.5 random baseline by a wide margin (p < 10 −4 across all seeds), indicating that a significant fraction of MLIP vulnerability is compositionally determined rather than architecture-specific. Feature importance overlap. The Pearson correlation between CHGNet and MACE feature impor- tances is r = 0.877± 0.025, with 2.2 out of 3 top features shared across models (consistently n_sites and volume). This high overlap confirms that both architectures struggle with the same compositional drivers—large unit cells and heavy elements—even though their absolute failure sets are largely disjoint (r = 0.182 for raw errors). Consensus prediction. Training on either model and predicting “any model fails” yields AUC-ROC 0.813±0.004 (CHGNet-trained) and 0.834±0.005 (MACE-trained), suggesting that single-model PCM audits capture a substantial portion of the shared vulnerability landscape. Implication. This result addresses the key generalizability concern: PCM features are not CHGNet- specific artifacts. A risk model trained on one MLIP’s failures provides actionable early warning for other architectures, with a shared compositional vulnerability signal (r = 0.877) and architecture- specific residuals. This supports deployment of PCM as a partially architecture-transferable auditing framework, where cross-MLIP risk models provide useful triage and per-MLIP retraining yields optimal performance. 3.9 Pipeline impact To quantify the practical consequences of MLIP blind spots, we evaluate single-model screening perfor- mance on the WBM benchmark. Single-model screening. Using CHGNet as the sole stability filter on 25,000 WBM materials (the standard high-throughput screening protocol), precision (fraction of CHGNet-trusted materials that are truly DFT-stable) is 0.47 and recall (fraction of DFT-stable materials that CHGNet identifies) is 0.07. This means that (a) roughly half of CHGNet-predicted “stable” materials are actually DFT-unstable (false positives), and (b) CHGNet misses 93.0% of DFT-stable materials (false negatives; distinct from the 93.2% counterexample rate on 5K, which counts all failure types). The false negative rate is the more consequential metric: a screening pipeline that rejects 93% of genuinely stable materials is systematically excluding the materials it aims to find. PCM-audited screening. Applying the PCM-refined envelope as a pre-filter—flagging materials in compositional regions where CHGNet is known to be unreliable—reduces the false positive rate from 10.7% to 8.1% (a 24% relative reduction). Materials flagged by the PCM envelope are routed to DFT validation rather than trusted at MLIP-level, ensuring that blind-spot regions receive appropriate scrutiny. 10 The prospective risk model (Section 3.7) further enables prioritised DFT allocation: the highest-risk quintile can be validated first, maximising the yield of genuine discoveries per DFT calculation. These numbers establish the pipeline impact of unaudited MLIP deployment: without adversarial audit- ing, a CHGNet-based screening pipeline operates at 7% recall for DFT-stable materials, rejecting 93.0% of the materials it should identify as promising candidates. 3.10 Multi-MLIP deployment simulation The preceding sections establish that (a) blind spots are architecture-specific and (b) PCM risk models transfer across architectures. A practitioner’s natural question is: how should I change my screening pipeline tomorrow? We simulate five deployment strategies on a 10,000-material subset of WBM with both CHGNet and MACE predictions (distinct from the 5K single-model and 25K prospective bench- marks). Screening strategies. Using force reliability (F max < 50 eV Å −1 ) as the trust criterion: • CHGNet-only: precision 0.564, recall 0.676, false negative rate (FNR) 32.4% (6,829 trusted) • MACE-only: precision 0.529, recall 0.174, FNR 82.5% (1,882 trusted) • Union (trust if either reliable): precision 0.559, recall 0.698, FNR 30.2% (7,116 trusted) • Consensus (trust only if both reliable): precision 0.545, recall 0.152, FNR 84.8% (1,595 trusted) Consensus screening is too conservative, missing 84.8% of DFT-stable materials. Union achieves the best no-DFT recall (69.8%), improving over CHGNet-only by recovering materials where MACE is reliable but CHGNet is not (and vice versa). Note that the Union strategy is effectively an ensemble disagreement baseline—trusting any material where at least one model is confident. Even this natural multi-model heuristic achieves only 69.8% recall; adding the PCM risk model raises recall to 78.0%, demonstrating that adversarial compositional auditing captures failure patterns beyond what ensemble agreement alone can detect. PCM-audited screening. Adding the PCM risk model to CHGNet screening (routing the top 20% risk- ranked materials to DFT) raises screening precision (fraction of trusted materials that are DFT-stable) from 0.564 → 0.623 ± 0.004 and recall (fraction of DFT-stable materials correctly identified) from 0.676 → 0.780± 0.005—recovering 10.4% more DFT-stable materials that unaudited CHGNet would reject. Materials in PCM-flagged regions are DFT-verified rather than blindly trusted or rejected. DFT budget efficiency. The PCM risk model prioritises which materials most need DFT validation. At a 20% DFT budget (800 materials out of 4,000 in the validation set), PCM-ranked allocation achieves a DFT yield of 0.756± 0.012 (75.6% of DFT-evaluated materials are genuinely stable), compared to 0.563± 0.008 for random DFT allocation—a 34% improvement in DFT efficiency. This means each DFT calculation is 1.34× more likely to discover a stable material when guided by PCM risk ranking. Blind spot coverage. 72.1% of all force failures (2,818 / 3,909) are model-specific: 1,409 fail CHGNet only, 1,409 fail MACE only. Among the 5,703 DFT-stable materials, 42.7% are flagged as unreliable by at least one MLIP—yet only 13.0% are flagged by both. This confirms that multi-MLIP auditing catches blind spots invisible to any single model. Practitioner recommendation. Based on these results, the optimal deployment strategy is: (1) screen 11 with union of available MLIPs to maximise recall, (2) apply PCM risk model to flag compositional regions of concern, and (3) allocate DFT budget preferentially to PCM-flagged materials. This three-step protocol improves screening recall by >10% absolute and DFT efficiency by 34% relative, at negligible computational cost beyond the initial PCM audit. 3.11 Deployment case study: thermoelectric screening To demonstrate the practical impact of PCM-audited screening in a realistic discovery workflow, we simulate a thermoelectric materials campaign. A researcher screens 647 WBM candidates with band gaps in the thermoelectric-relevant range (0.1–1.5 eV), of which 488 are DFT-stable (ground truth unknown to the screener). Head-to-head comparison. Five protocols are evaluated (Table 2): Table 2. Thermoelectric screening case study. Discovery yield across five screening protocols for 647 WBM can- didates (488 DFT-stable). PCM-audited screening discovers 62 additional stable thermoelectrics over the CHGNet baseline while reducing false leads, at the cost of 130 DFT evaluations (20% of candidates). ProtocolDiscov.MissedFalse leadsPrecisionDFT evals CHGNet-only244244 (50.0%)630.7950 MACE-only63425 (87.1%)140.8180 Union (no DFT)253235 (48.2%)650.7960 Union + random DFT296192 (39.3%)520.850130 PCM-audited (ours)306182 (37.3%)530.852130 Key result. The PCM-audited protocol discovers 62 additional stable thermoelectrics that CHGNet- only screening would miss entirely—a 25.4% improvement in discovery yield. The miss rate drops from 50.0% to 37.3%, while precision improves from 0.795 to 0.852 (fewer false leads). This requires 130 DFT evaluations (20.1% of candidates), targeted by the PCM risk model at the highest-risk compositions. Compared to random DFT allocation at the same budget, PCM-ranked allocation recovers 53 discoveries versus 43 for random—a 23% improvement in DFT targeting efficiency. Recovered materials. Among the discoveries recovered by the PCM-audited protocol are WSe 2 (gap = 1.47 eV, E hull =−0.558 eV/atom), IrSbZr (gap = 1.48 eV, half-Heusler thermoelectric), and BaBiScO 6 (gap = 1.46 eV, double perovskite)—all DFT-stable materials that CHGNet flags as unreliable (max force > 50 eV Å −1 ) due to architecture-specific blind spots. These are precisely the materials a high- throughput pipeline aims to discover. Practical implication. For a research group screening∼650 thermoelectric candidates, the PCM-audited protocol adds∼60 genuine discoveries at the cost of 130 DFT calculations—roughly 0.5 DFT per addi- tional discovery. Given that a single DFT relaxation costs minutes to hours, this represents a favourable trade-off: each additional DFT calculation targeted by PCM is 1.23× more likely to yield a discovery than random allocation. 12 3.12 Cross-domain generalizability To test whether the PCM framework generalises beyond solid-state materials, we apply the identical pipeline to three independent domains—molecular property prediction (QM9), drug solubility (ESOL), and tabular ML regression (California Housing)—running extended experiments across all three: multi- strategy comparison (5 strategies × 3 seeds per domain), budget sensitivity (5 budgets × 3 strategies × 3 seeds per domain), full pipeline verification, and failure characterisation. These experiments test framework generalization, not merely breadth: each domain has different data modalities, model archi- tectures, and failure patterns, yet the same three-stage pipeline (falsify, refine, certify) applies without modification. Molecular property prediction (QM9). Two real graph neural networks—SchNet 52 (455K parameters, distance-based message passing) and DimeNet++ 53 (1.9M parameters, directional message passing)— were trained for 200 epochs on 8,000 QM9 molecules (2,000 held-out test set, 10,000 total) 49 on an A4000 GPU. SchNet achieves test MAE = 0.125 eV (median error 0.101 eV); DimeNet++ achieves test MAE = 0.087 eV (median error 0.064 eV). The models produce moderate error correlations (Pearson r = 0.461) with architecture-specific failure modes: at a 1.0 eV threshold, 6 molecules fail DimeNet++ only and 0 fail SchNet only—confirming disjoint blind spots even for well-trained GNNs. Top failure predictors are n_atoms and molecular_weight. At budget = 200, grid sampling achieves the highest CX rate (38.0 ± 0.0%), followed by sobol (37.8 ± 1.4%), LHS (34.0 ± 4.8%), and random (33.3 ± 3.8%), while the heuristic adversary discovers 23/200 counterexamples (11.5 ± 2.0%). The heuristic’s lower rate reflects an informative mismatch: the stress model targets large, polar molecules, but real SchNet/DimeNet++ errors are not concentrated at stress extremes—uniform samplers cover the input space more broadly and find more failures spread across the space. Drug solubility prediction (ESOL). A Morgan fingerprint Random Forest and a descriptor-based neural network predict aqueous solubility on 1,128 molecules (Delaney 2004). Failures are largely disjoint: 334 molecules fail the fingerprint model only, 58 fail the neural network only, and just 86 fail both (error correlation r = 0.321). The heuristic adversary achieves 80.3± 2.3% CX rate versus grid 70.0%, LHS 60.7%, random 59.0% and sobol 58.3%. LogP is the strongest predictor of fingerprint model failure, while heavy_atoms drives neural network failure—again, distinct failure predictors for each architecture. Tabular ML regression (California Housing). A GradientBoosting tree and an MLP regressor pre- dict median house values on 20,640 California census tracts, demonstrating PCM in a genuinely non- chemistry domain. The GBT achieves MAE 0.31 ($31K) and the MLP MAE 0.34 ($34K). Error cor- relation is high (r = 0.734), with top failure predictors ave_occup, med_inc, and population. At budget = 200, grid achieves the highest CX rate (58.5± 0.0%), followed by sobol (40.2± 1.3%), random (40.0±2.5%), LHS (38.3±3.4%), and heuristic (16.2±4.5%). The heuristic adversary discov- ers 24/200 counterexamples (12.0%), with meaningful envelope refinement and bootstrap CIs computed. Lean 4 proofs verify for the tabular domain, confirming formal certification generalises beyond chemistry entirely. Budget sensitivity. CX rate saturates early in all three domains. For QM9, the heuristic adversary is stable at ∼10–11% across all budgets while random saturates at budget = 100 (36.3%) and sobol at 13 budget = 200 (37.8%). For ESOL, heuristic saturates at budget = 200 (80.3%), random at budget = 500 (59.4%), and sobol remains stable at∼58–60%. For Tabular, heuristic saturates at budget = 100 (15.3%), random at budget = 200 (40.0%), and sobol at∼40%. This confirms the WBM finding (Section 5) that a modest adversarial budget suffices for failure discovery across diverse ML domains. Full pipeline. All three domains complete the entire PCM pipeline—attack, envelope refinement, Lean 4 proof generation—with machine-checked proofs verified by lake build in all cases (QM9: 23/200 CX, 11.5%; ESOL: 161/200 CX, 80.5%; Tabular: 24/200 CX, 12.0%). This provides strong evidence that formal certification of ML reliability claims extends beyond solid-state materials to molecular property prediction with real GNNs and non-chemistry tabular ML alike. All five experiments demonstrate that architecture-specific blind spots, adversarial auditing, multi- strategy comparison, and formal verification generalise across chemistry domains with auditable ML models; the tabular ML experiment confirms PCM generalises to non-chemistry domains. 4 Discussion Prospective validation as the strongest evidence. The most significant result is not the discovery of blind spots—which prior work has noted qualitatively—but the demonstration that PCM-discovered features predict failures on unseen materials (AUC-ROC 0.938± 0.004; P@20% = 1.000, i.e. perfect precision among the top-risk quintile with zero false alarms). This transforms adversarial auditing from a retrospective exercise into a prospective tool. The key insight is that compositional features concentrat- ing counterexamples (n_sites, volume_per_atom, max_z) generalise to unseen chemistries—a property that conformal prediction (marginal coverage only) and perturbation-based UQ (r = 0.039) do not provide. Deployment impact. Single-model screening misses 93% of DFT-stable materials; multi-MLIP screen- ing with PCM risk-ranked DFT allocation improves efficiency by 34%. The thermoelectric case study (Section 3.11) demonstrates that these improvements translate to concrete discovery gains: 62 additional stable materials recovered at a cost of 130 targeted DFT calculations. The recommended three-step protocol—union screening, PCM risk flagging, prioritised DFT—is immediately adoptable. Critically, 72.1% of failures are model-specific, confirming that no single-model audit can substitute for multi- MLIP coverage. Formal verification and the specification gap. The Lean 4 proofs are explicitly conditional: every ap- proximation (DFT at 0.1 eV atom −1 , MLIP threshold at 2.0 eV atom −1 ) appears as an inspectable axiom with constructive non-vacuity witnesses (Extended Data Fig. Extended Data Fig. 9). This addresses the specification gap 17 by making assumptions explicit rather than hiding them in implementation details. Combined with prospective failure prediction, this constitutes a new approach to MLIP validation: tradi- tional benchmarks provide aggregate scores; conformal prediction provides coverage guarantees without compositional specificity; PCM provides falsifiable safety certificates with both formal guarantees and predictive power. Cross-MLIP transfer (AUC-ROC∼0.70 cross-model, feature importance r = 0.877) confirms that vulnerability patterns are partially shared across architectures. Cross-domain generalizability. Extended experiments across three chemistry domains (WBM, QM9 14 Table 3. Box 1: How to audit your MLIP. Three-step protocol for pre-deployment reliability certification. StepAction 1. DefineWrite a YAML envelope specifying compositional bounds and failure thresholds. Example: Z max ≤ 83, prediction error≤ 2 eV atom −1 , force≤ 50 eV Å −1 . 2. AttackRun adversarial search: python -m pcm attack -model <yours> -budget 200. PCM deploys 6 strategies (random, heuristic, grid, LHS, Sobol, LLM) to find counterexamples. 3. CertifyInspect the refined envelope and machine-checked Lean 4 proofs. The proofs certify: if the assumptions hold, then the safety claim follows. with real SchNet/DimeNet++ GNNs, ESOL) and one non-chemistry domain (California Housing, tabular ML) confirm that the three-stage pipeline generalises across data modalities and model architectures (Section 3.12). Each domain exhibits distinct failure predictors per architecture, and all produce verified Lean 4 proofs. Accessibility and practical considerations. The full multi-strategy, multi-MLIP audit costs $18.13; a single-model audit with purely algorithmic adversaries costs nothing. CX rate saturates by bud- get = 50, though discovery diversity grows linearly with budget (24 unique materials at budget = 25 vs 231 at budget = 500). PCM selects adversarial points for auditing (200 queries), not retraining (>100K structures)—at budget = 200, the probability of missing a failure mode of prevalence≥5% is < 0.003, providing a PAC-style coverage guarantee without model modification. A PCM audit report provides four actionable outputs: (1) a pass/fail decision with quantified confidence, (2) an explicit compositional map of unreliable regions, (3) machine-checkable proofs, and (4) a prospective risk model for screening unseen materials. Limitations. (1) Ground truth is computational DFT, not experiment—though independent QE recompu- tation (20/20 converging on adversarially selected compositions), cross-functional (JARVIS) validation, and literature-verified cases (TlBiSe 2 ) provide three independent lines of evidence. (2) The WBM oracle is database retrieval, not structure generation; live CHGNet evaluation on 50 approximate (golden-ratio) structures shows moderate energy correlation (r = 0.42, p = 0.002) and force correlation (r = 0.46, p < 0.001) with database values. This gap arises because the database uses DFT-relaxed equilibrium structures while our approximate structures use simple cubic cells reconstructed from formula and vol- ume. The blind-spot classification itself, however, derives from the WBM benchmark where both DFT and CHGNet evaluate relaxed structures. (3) The benchmark’s high base CX rate means adversary advan- tage is in diversity, not rate. (4) Lean proofs verify reasoning correctness, not physics. (5) Cross-MLIP transfer (Section 3.8) shows moderate but imperfect generalisation (average AUC-ROC = 0.697): while a CHGNet-trained risk model provides useful early warning for MACE failures, architecture-specific retraining improves AUC-ROC by∼0.2, suggesting that optimal deployment uses per-MLIP auditing. 15 5 Methods 5.1 Framework and oracle models PCM operates on three abstractions: Environment (compositional features: n elements , Z max , electroneg- ativity, volume per atom, atomic mass), Design (structural parameters: n sites , bandgap), and Outcomes (prediction error, maximum force). A safety claim asserts outcomes satisfy thresholds for all (env, de- sign) within specified bounds. Six adversary strategies (random, heuristic, LHS, Sobol, grid, LLM) are compared at equal budget. The WBM oracle maps adversary-proposed feature vectors to nearest real materials via KDTree lookup (O(logn)) and returns pre-computed CHGNet/DFT comparisons. TensorNet and MACE oracles use the same WBM feature space with live MLIP inference via matgl and mace-torch, respectively. Stability classification uses E above hull < 0.025 eV atom −1 . KDTree-indexed feature lookup enables sub-millisecond queries at full WBM scale (0.43 ms per query at 257K vs 0.14 ms at 5K; coverage at budget = 1,000: 504 unique materials at 257K vs 347 at 5K), so the framework scales to production datasets without re-running MLIPs at attack time. 5.2 Adversary strategies Six strategies are evaluated: • Random: Uniform sampling within envelope bounds. • Heuristic: Beta-distribution biased toward stress directions specified in envelope YAML. • Grid: Uniform grid over envelope bounds. • LHS: Latin Hypercube Sampling (scipy). • Sobol: Sobol quasi-random sequences (scipy). • LLM: Language model adversary receiving evaluation history and proposing numerical feature vec- tors targeting likely failure regions. On API failure, degrades to random sampling with exponential backoff. Temperature = 0.2 for all models. 5.3 Envelope refinement Environment bounds are tightened to the 75th percentile of passing evaluations using the conservative 95% bootstrap CI (1,000 resamples). Minimum sample gate: 30 passing evaluations. Design require- ments are stress-direction-aware. 5.4 Lean 4 certification Refined envelopes compile into five proof modules: core safety theorem, perturbation stability, error propagation (composing DFT + MLIP uncertainty via triangle inequality), monotone refinement sound- ness, and evaluation metric properties (CX count monotonicity, CI width convergence, PAC coverage guarantee). Lean 4.27.0 5 , no mathlib dependency. This builds on recent applications of theorem provers to scientific reasoning 9;18 . 16 5.5 Prospective validation protocol The WBM benchmark (25,000 materials) is split 60/40 into discovery (15,000) and validation (10,000) sets. The discovery set is used for all adversarial auditing and envelope refinement. A scikit-learn GradientBoostingClassifier (200 estimators, max depth 4, learning rate 0.05) is trained on the 8 compositional features (n_elements, max_z, mean_electronegativity, std_electronegativity, volume_per_atom, mean_atomic_mass, n_sites, bandgap_pbe) to predict binary failure (F max > 50 eV Å −1 ). Results are aggregated across 5 independent random seeds (each with different discovery/- validation splits). Performance is evaluated on the held-out validation set using AUC-ROC, precision- recall, and precision at k% recall (P@k%). 5.6 Benchmark construction 25,000 materials sampled from 257K WBM structures 2 , characterised by 8 compositional features with pre-computed CHGNet v0.3.0 predictions and DFT reference values. Full 257K available for scaling validation. Anomaly score: a i = z pred_err + 0.5· z force + 2.0· ⊮[DFT stable̸= CHGNet stable](1) robust to alternative weightings (Kendall τ > 0.7 across 5 configurations). 5.7 Independent DFT validation Quantum ESPRESSO 6.7, projector augmented wave (PAW) Perdew–Burke–Ernzerhof (PBE) with Stan- dard Solid-State Pseudopotentials (SSSP) v1.3.0 efficiency library, 60/480 Ry cutoffs, 4× 4× 4 k-mesh, cold smearing (σ = 0.01 Ry), mixing β = 0.3, convergence threshold 10 −8 Ry. TlBiSe 2 converges in 13 SCF steps (207.6 s). For the 20-material validation (Section 3.4), materials are selected from 811 can- didates meeting: DFT-stable, CHGNet-unstable, max Z ≤ 56, n sites ≤ 24. Structures use golden-ratio fractional coordinate spacing (φ = (1 + √ 5)/2; deterministic, non-overlapping) with DFT-derived cell volumes from WBM. Each SCF calculation runs on 8 CPU cores with a 3,600 s timeout. All 20 con- verge: 18 on first pass (47 min total) and 2 after retry with expanded cells and tighter mixing (additional 2 hours). Total first-pass wall time for all 20 attempts is 54 min including the 2 that initially failed. 5.8 Cross-MLIP evaluation Three architecturally distinct MLIPs—CHGNet v0.3.0 1 , TensorNet (MatPES-PBE-v2025.1-PES via matgl 2.0.6, evaluated in float32 precision), and MACE-MP medium 4 (via mace-torch, float64)— are evaluated on 5,000 WBM materials (random sample, seed = 42). Structures are built from WBM composition, volume, and Wyckoff symmetry data using golden-ratio fractional coordinate spacing with correct DFT-derived cell volumes. Six pairwise correlations (energy and force) are computed with boot- strap 95% CIs (1,000 resamples). Force failure threshold: 50 eV Å −1 . 17 Data Availability The WBM benchmark (25,000 materials), all experimental runs and generated Lean 4 proofs are included in the repository. The full WBM dataset (257K) is available from Matbench Discovery 2 . Code Availability PCM is open-source at https://github.com/abhinaba/alloy_pcm. Reproduction: pip install -e . && python -m pcm demo. Acknowledgements We thank the developers of CHGNet, MACE and TensorNet for making their models publicly available, and the Matbench Discovery team for the WBM benchmark infrastructure. Author Contributions A.B. conceived the project, developed the PCM framework, performed all experiments and wrote the manuscript. P.C. supervised the project and reviewed the manuscript. Competing Interests The authors declare no competing interests. References [1] Deng, B. et al. CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat. Mach. Intell. 5, 1031–1041 (2023). [2] Riebesell, J. et al. Matbench Discovery—A framework to evaluate machine learning crystal stability predictions. Nat. Mach. Intell. 7, 836–847 (2025). [3] Wang, H.-C., Botti, S. & Marques, M. A. L. Predicting stable crystalline compounds using chemical similarity. npj Comput. Mater. 7, 12 (2021). [4] Batatia, I. et al. MACE: Higher order equivariant message passing neural networks for fast and accurate force fields. NeurIPS (2022). [5] de Moura, L. & Ullrich, S. The Lean 4 theorem prover and programming language. CADE-28 (2021). 18 [6] Cornelio, C. et al. Combining data and theory for derivable scientific discovery with AI-Descartes. Nat. Commun. 14, 1777 (2023). [7] Dreossi, T., Donze, A. & Seshia, S. A. Compositional falsification of cyber-physical systems with machine learning components. J. Autom. Reason. 63, 1031–1053 (2019). [8] Ernst, D. et al. ARCH-COMP 2023 Category Report: Falsification. EPiC Ser. Comput. (2023). [9] Walters, J. et al. PhysLean: Digitizing physics with Lean. arXiv:2405.10000 (2024). [10] Dunn, A. et al. Benchmarking materials property prediction methods: the Matbench test suite. npj Comput. Mater. 6, 138 (2020). [11] Yoo, S. et al. Learning atomic forces from uncertainty-calibrated adversarial attacks on graph neural network potentials. npj Comput. Mater. 11, 200 (2025). [12] Kulichenko, M. et al. Uncertainty-driven dynamics for active learning of interatomic potentials. Nat. Comput. Sci. 3, 230–239 (2023). [13] Dalrymple, D. et al. Towards Guaranteed Safe AI: A framework for ensuring robust and reliable AI systems. arXiv:2405.06624 (2024). [14] Deng, B. et al. Systematic softening in universal machine learning interatomic potentials. npj Com- put. Mater. 11, 9 (2025). [15] Necula, G. C. Proof-carrying code. POPL 106–119 (1997). [16] Hawkins, R. et al. Guidance on the assurance of machine learning in autonomous systems (AM- LAS). arXiv:2102.01564 (2021). [17] Seshia, S. A., Sadigh, D. & Sastry, S. S. Toward verified artificial intelligence. Commun. ACM 65, 46–55 (2022). [18] AlphaProof team. Formal mathematical reasoning with reinforcement learning. Nature (2025). [19] Angelopoulos, A. N. et al. Conformal risk control. ICLR (2024). [20] Xu, H. et al. Evidential deep learning for interatomic potentials. Nat. Commun. (2025). [21] Zhang, H. et al. AlloyGPT: An autoregressive large language model for alloy prediction and design. npj Comput. Mater. 11, 68 (2025). [22] Gupta, V. et al. AlloyBERT: Alloy property prediction with large language models. Comput. Mater. Sci. 246, 113433 (2024). [23] Xie, T. et al. DARWIN: A series of foundation models for materials science. arXiv:2412.11970 (2024). [24] Rubungo, A. N. et al. LLM-Prop: Predicting physical and electronic properties of crystalline solids from their text descriptions. npj Comput. Mater. 11, 36 (2025). 19 [25] Antunes, L. M. et al. Crystal structure generation with autoregressive large language modelling. Nat. Commun. 15, 10570 (2024). [26] Sato, T. et al. Direct evidence for the Dirac-cone topological surface states in the ternary chalco- genide TlBiSe 2 . Phys. Rev. Lett. 105, 136802 (2010). [27] Simeon, G. & de Fabritiis, G. TensorNet: Cartesian tensor representations for efficient and accurate interatomic potentials. NeurIPS (2023). [28] Chen, C. et al. MatPES: A universal pretrained potential for materials science. arXiv:2501.11768 (2025). [29] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. ICML (2016). [30] Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty esti- mation using deep ensembles. NeurIPS (2017). [31] Musil, F. et al. Fast and accurate uncertainty estimation in chemical machine learning. J. Chem. Theory Comput. 15, 906–915 (2019). [32] Hirschfeld, L. et al. Uncertainty quantification using neural networks for molecular property pre- diction. J. Chem. Inf. Model. 60, 3770–3780 (2020). [33] Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. ICLR (2015). [34] Madry, A. et al. Towards deep learning models resistant to adversarial attacks. ICLR (2018). [35] Choudhary, K. & DeCost, B. Atomistic line graph neural network for improved materials property predictions. npj Comput. Mater. 7, 185 (2021). [36] Park, Y. et al. Scalable parallel algorithm for graph neural network interatomic potentials in molec- ular dynamics simulations. J. Chem. Theory Comput. 20, 4857–4868 (2024). [37] Liao, Y.-L. & Smidt, T. EquiformerV2: Improved equivariant transformer for scaling to higher- degree representations. ICLR (2024). [38] Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023). [39] Choudhary, K. et al. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. npj Comput. Mater. 6, 173 (2020). [40] Kirklin, S. et al. The Open Quantum Materials Database (OQMD): Assessing the accuracy of DFT formation energies. npj Comput. Mater. 1, 15010 (2015). [41] Curtarolo, S. et al. AFLOW: An automatic framework for high-throughput materials discovery. Comput. Mater. Sci. 58, 218–226 (2012). [42] Jain, A. et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013). 20 [43] Katz, G. et al. Reluplex: An efficient SMT solver for verifying deep neural networks. CAV (2017). [44] Huang, X. et al. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Comput. Sci. Rev. 37, 100270 (2020). [45] Amodei, D. et al. Concrete problems in AI safety. arXiv:1606.06565 (2016). [46] Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon. Digital Discovery 3, 781–810 (2024). [47] Miret, S. & Krishnan, N. M. A. Are LLMs ready for real-world materials discovery? arXiv:2402.05200 (2024). [48] White, A. D. et al. Assessment of chemistry knowledge in large language models that generate code. Digital Discovery 2, 368–376 (2023). [49] Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 140022 (2014). [50] Lookman, T. et al. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 5, 21 (2019). [51] Smith, J. S. et al. Less is more: Sampling chemical space with active learning. J. Chem. Phys. 148, 241733 (2018). [52] Schüt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet – A deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018). [53] Gasteiger, J., Giri, S., Margraf, J. T. & Günnemann, S. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. In Machine Learning for Molecules Workshop, NeurIPS (2020). 21 Extended Data Table Extended Data Table 1. Full adversary strategy comparison (10 configurations, budget = 200). CX rate with Wilson 95% CI, failure mode entropy, and unique materials discovered. LLM adversaries achieve higher CX rates but lower unique discovery counts; algorithmic strategies provide broader coverage. Grid and LHS strategies are omitted as they perform comparably to Random and Sobol at this budget. RankStrategyCX Rate95% CIEntropy (bits)Unique 1Gemini 2.5 Pro (LLM)100.0%[98.1, 100]0.0005 2GPT-5 (LLM)100.0%[98.1, 100]0.8756 3Gemini 3 Flash (LLM)98.5%[95.7, 99.5]0.98017 4GPT-5.2 (LLM)98.5%[95.7, 99.5]0.99829 5Claude Opus 4.6 (LLM)88.5%[83.3, 92.2]0.98620 6Claude Opus 4.5 (LLM)88.5%[83.3, 92.2]1.21122 7Sobol87.0%[81.6, 91.0]1.152138 8Random86.0%[80.5, 90.1]1.027123 9Heuristic86.0%[80.5, 90.1]0.94861 10Claude Sonnet 4.5 (LLM)86.0%[80.5, 90.1]0.72930 Table Extended Data Table 2. Independent DFT validation of 20 adversarially discovered materials (QE pw.x v6.7, PAW-PBE, SSSP v1.3.0, 60/480 Ry, 8 cores). All 20 converge (2 require expanded cells on retry). DFT E/atom is total energy (including core electrons); CHGNet F max is evaluated on the same golden-ratio structure for direct force comparison. Cu 7 Zn 1 (brass) and Cu 7 Mn 1 (manganese bronze) are industrial alloys where CHGNet underestimates forces by 15–33×. #FormulaAtoms Pred. err. (eV atom −1 ) DFT F max (eV Å −1 ) CHGNet F max (eV Å −1 ) Ratio DFT/CHGNet SCFTime (s) 1Cu 7 Zn 1 810.495573615.4×15180 2Cd 6 Zr 2 810.094602816.5×14153 3Ag 6 Cl 2 S 2 109.9814453936.8×13296 4RuZn 3 49.9589541.6×1122 5LiO 4 Sc 3 89.74197228.8×1871 6Cu 6 Mg 2 89.683212413.2×1460 7Cu 6 Ga 3 99.54215419411.1×20178 8F 6 SnSr89.41298614.9×1243 9Y 2 Zn 6 89.393421522.5×13172 10Ba 2 Pd 4 Sb 4 109.38Retry † 376360 11BaF 6 In89.34Retry † 1501345 12In 6 Sb 2 89.27287931.3×12214 13Cu 2 O 2 Te 2 Y 2 89.25322408.0×1488 14Cu 7 Mn 1 89.245301633.1×1795 15Cd 4 Ni 2 Sr 2 89.133512912.3×15194 16Ca 2 In 2 Zn 4 89.10334369.4×13256 17Zn 3 Zr49.1061431.4×1124 18As 2 In 4 Pd 6 129.0522223563.4×17537 19Ag 2 Ba 2 O 4 89.04230663.5×36191 20F 6 K 2 Mg98.9997801.2×1452 † Converged on retry with expanded cells (5× and 4× volume) and tighter mixing (β = 0.1, 80/640 Ry). Force comparison omitted as structures differ from original golden-ratio placement. 22 3153 GPT-5 Claude Opus 4.5 Union: 21 unique materials | Overlap: 14.3% GPT-5: focused depth (6 total) | Opus 4.5: broad coverage (18 total) Figure 5: Dual-Adversary Discovery Coverage Figure Extended Data Fig. 1. Dual-adversary Venn diagram. Two adversary strategies discover complementary materials with 14.3% overlap (3 consensus materials). Complementary search strategies exploit different regions of compositional space. 23 CHGNet failsCHGNet passes TensorNet fails TensorNet passes 180 (90.0%) 4 (2.0%) 6 (3.0%) 10 (5.0%) Jaccard = 0.947 | Cohen's = 0.640 (a) Transfer matrix (direct comparison, n=200) 0 20 40 60 80 100 120 140 160 180 Count 0.02.55.07.510.012.515.017.5 CHGNet prediction error (eV/atom) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 TensorNet prediction error (eV/atom) r = 0.693 p = 6.1e-30 (b) CHGNet vs TensorNet errors with transfer annotations Both fail (180) CHGNet only (4) TensorNet only (6) Neither (10) Neighbors (no noise) =0.1=0.2=0.5=1.0 0 20 40 60 80 100 Failure discovery rate (%) 96.5% 99.0% 96.0% 95.0% 94.0% 93.7%93.7%93.7%93.7%93.7% Overall seeded: 96.2% vs random: 93.7% (+2.5 p) (c) Seeded vs random discovery rate Seeded from CHGNet failures Random baseline 024681012 Prediction error (eV/atom) Gd2Ge1Ir3 Ta1Zr5 Al1Ce2Si2 In4Pu4 Al2I2 Co1Ge3Pu1 Cs1F3Tl1 Br2Sn1Tb2 Eu1La1N2 N2Sc1Sm1 CHGNet-specific: 4 | TensorNet-specific: 6 (d) Model-specific blind spots CHGNet-only failure TensorNet-only failure Passing MLIP error Threshold (2.0 eV) Figure Extended Data Fig. 2. Cross-MLIP adversarial transfer. a, CHGNet↔TensorNet transfer rate heatmap: 97.8% of CHGNet failures also fail on TensorNet. b, Error correlation scatter (r = 0.69). c, Seeded vs random discovery rates. d, Model-specific blind spots. 24 4321012 JARVIS-DFT Formation Energy (eV/atom) 0 2 4 6 8 10 12 CHGNet Prediction Error (eV/atom) (a) CHGNet Error vs JARVIS-DFT Both OK (48) Both fail (76) CHGNet-only fail (682) JARVIS-only unstable (8) CHGNet fails JARVIS stable Both problematic JARVIS unstable CHGNet passes Both OK 0 100 200 300 400 500 600 700 Number of Materials 682 76 8 48 (b) Cross-Oracle Categories (n=814) 682768 CHGNet failsJARVIS unstable (c) Failure Overlap 020406080 Mean Feature Value n elements max z electronegativity volume per atom atomic mass * * * (d) Compositional Disagreement Regions CHGNet-only fail Both OK JARVIS-DFT (OptB88vdW) Cross-Oracle Comparison n=814 matched materials, CHGNet-JARVIS agreement=15.2% Figure Extended Data Fig. 3. JARVIS-DFT cross-oracle validation. a, CHGNet prediction error vs JARVIS for- mation energy for 814 matched materials. b, Cross-oracle category counts: 682 CHGNet-only failures. c, Failure overlap Venn diagram. d, Compositional features of disagreement regions: max Z (p = 0.002) and mean atomic mass (p = 7× 10 −5 ) significantly higher in CHGNet blind spots. 25 1234 Round 0.0 0.2 0.4 0.6 0.8 1.0 Normalised Envelope Width (relative to Round 1) a) Envelope Bounds Convergence n_elem max_Z EN_mean EN_range vol/atom mass_mean 1234 Round 0.0 0.2 0.4 0.6 0.8 1.0 Rate 0.68 0.51 0.63 0.69 0.33 0.49 0.37 0.30 b) Counterexample Rate & Pass Rate CX Rate Pass Rate 1234 Round 0 20 40 60 80 100 Materials Count 59 108 50 28 59 98 24 5 c) Material Discovery (Diminishing Returns) Unique (this round)NEW (unseen)Cumulative R1 (n=135) R2 (n=102) R3 (n=126) R4 (n=139) Round 2 3 4 5 6 7 8 9 10 Prediction Error (eV/atom) 6.40 6.52 6.34 5.95 strict threshold relaxed threshold d) CX Prediction Error Distribution 60 80 100 120 140 160 180 Cumulative Materials 59 157 181 186 Figure Extended Data Fig. 4. Iterative envelope refinement (4 rounds). a, Envelope bounds convergence. b, CX/- pass rate evolution. c, Cumulative material discovery. d, Error distribution per round. The safe envelope com- presses by 75–91% per feature dimension. 3.03.54.04.55.05.56.06.57.0 Prediction Error (eV/atom) 30 40 50 60 70 80 90 100 110 Max Force (eV/Å) Pu compound ThI₃ TlBiSe₂ (topological insulator) Top-12 MLIP Blind Spots GPT-5 only Both Opus 4.5 only 0.00.51.01.52.02.53.03.5 Anomaly Score Cs4Se6U2 Bi4Se8Tl4 Au4Cs4Te4 Ba6N1Pb2Se1 Br6Cs2K1Tl1 Cl6Rb2Sn1Tl1 I6Th2 Bi2Cu2O4Te2V1 Au2Cs2Sc2Te6 Au2Ba2Np2Se6 O12Pb4Pt4 Dy1F8Li2Pu1 U Tl Tl Tl Th Np Pu Ranked by Anomaly Score Figure Extended Data Fig. 5. Element frequency analysis. a, Per-element mean prediction error vs WBM frequency (84 elements, Spearman ρ = 0.443). b, Stability disagreement by periodic table block: f-block highest (Kruskal–Wallis p = 0.042). c, Adversarial vs non-adversarial elements (p = 0.020). Rarest quintile contains 41% adversarial elements. 26 0255075100125 Unique Materials Found GPT-5 Gemini_2.5_Pro Gemini_3_Flash GPT-5.2 Claude_Opus_4.6 Claude_Opus_4.5 Claude_Sonnet_4.5 baseline_heuristic baseline_random baseline_sobol Material Coverage 012345 Anomaly Score F8Li2Nb1Yb1 Ca1F8Li2Th1 Au8Sb4 Cu1F6Li1Tl2 Sn2Tl6 Er1F6Pd1 Ba1Bi1Cu1Er1O5 Dy1F8Li2Pu1 Ag3Cl6Eu1 F8Na2Pt2 Br6In1Rb2 Br6Y4 Cu1Eu2F2O2 Al4Se4 N1O1Pb2Tm6 LLM-Unique Discoveries 0.8750.9000.9250.9500.9751.000 Counterexample Rate 0.0 0.1 0.2 0.3 0.4 0.5 Novelty (Mean N Distance) G 2.5 Pro GPT-5 G 3 Flash GPT-5.2 C Opus 4.6 C Opus 4.5 sobol random heuristic C Sonnet 4.5 Rate vs Novelty Trade-off GPTClaudeGeminibaseline 246810 Prediction Error (eV/atom) 0 5 10 15 20 25 30 35 Count CHGNet Error Distribution (Out-of-Sample) All evaluated Blind spots (n=200) 0246810 Prediction Error (eV/atom) F6Na2Th1 F6Na2Pa1 F6K2Pa1 F6K2Pa1 F6Na2U1 F6Na2Np1 F6Na2Th1 F6K2Pa1 F6Na2Pu1 F8Na2Th1Yb1 F8Na2Th1Yb1 F6Pa1Rb1 F6Nb1Pa1 F6K1U1 F6K2Np1 Top Out-of-Sample Blind Spots Figure Extended Data Fig. 6. Adversarial strategy analysis. Top: Ablation comparing 13 adversary configura- tions on CX rate vs unique materials discovered (base CX rate = 93.2%; max-everything achieves 100% but dis- covers only 1 unique material). Bottom: Emergent search behaviour—a, Feature-space autocorrelation. b, Phase transition detection. c, Post-violation exploration (p = 0.0001). 27 Control (real names) Shuffled (permuted names) Anonymous (feature_1..8) 0.0 0.2 0.4 0.6 0.8 1.0 Counterexample Rate 90.0% 82.0% 88.0% (a) CX Rate by Condition (Wilson 95% CI) Control (real names) Shuffled (permuted names) Anonymous (feature_1..8) 0 5 10 15 20 25 Unique Materials Hit 26 2828 (b) Material Diversity by Condition 806040200204060 PC1 (81.4% var.) 20 10 0 10 20 30 40 PC2 (12.8% var.) (c) Feature-Space Exploration (PCA) Control (pass) Control (CX) Shuffled (pass) Shuffled (CX) Anonymous (pass) Anonymous (CX) ControlShuffledAnonymous n_elements max_z mean_electronegativity en_range volume_per_atom mean_atomic_mass n_sites bandgap_pbe 0.510.480.53 0.470.520.52 0.480.540.50 0.500.460.54 0.510.540.43 0.520.460.54 0.570.480.46 0.490.510.42 (d) Per-Feature Targeting (0=low bound, 1=high bound) 3.125 3.150 3.175 3.200 3.225 3.250 3.275 3.300 3.325 Failure Diversity (bits) Failure diversity (bits) 0.0 0.2 0.4 0.6 0.8 1.0 Mean normalized value Shuffled-Feature Ablation: Does the LLM Use Domain Knowledge? Figure Extended Data Fig. 7. Shuffled-feature ablation. CX rate by condition: control 90%, shuffled 82%, anony- mous 88% (Wilson 95% CIs). Incorrect domain knowledge (shuffled) is worse than no knowledge (anonymous), confirming the adversary–oracle separation as the core mechanism. 28 Cu₇Zn Cd₆Zr₂ Ag₆Cl₂S₂ RuZn₃ LiO₄Sc₃ Cu₆Mg₂ Cu₆Ga₃ F₆SnSr Y₂Zn₆ In₆Sb₂ Cu₂O₂Te₂Y₂ Cu₇Mn Cd₄Ni₂Sr₂ Ca₂In₂Zn₄ Zn₃Zr As₂In₄Pd₆ Ag₂Ba₂O₄ F₆K₂Mg 10 1 10 2 10 3 Max force (eV Å⁻¹) 50 eV/Å threshold a DFT vs CHGNet forces on same structures DFT (QE) CHGNet Cu₇Zn Cd₆Zr₂ Ag₆Cl₂S₂ RuZn₃ LiO₄Sc₃ Cu₆Mg₂ Cu₆Ga₃ F₆SnSr Y₂Zn₆ In₆Sb₂ Cu₂O₂Te₂Y₂ Cu₇Mn Cd₄Ni₂Sr₂ Ca₂In₂Zn₄ Zn₃Zr As₂In₄Pd₆ Ag₂Ba₂O₄ F₆K₂Mg 0 10 20 30 40 50 60 Force ratio (DFT / CHGNet) median = 11.6× 15× 33× b CHGNet force underestimation ratio Cu₇Zn Cd₆Zr₂ Ag₆Cl₂S₂ RuZn₃ LiO₄Sc₃ Cu₆Mg₂ Cu₆Ga₃ F₆SnSr Y₂Zn₆ In₆Sb₂ Cu₂O₂Te₂Y₂ Cu₇Mn Cd₄Ni₂Sr₂ Ca₂In₂Zn₄ Zn₃Zr As₂In₄Pd₆ Ag₂Ba₂O₄ F₆K₂Mg 0 5 10 15 20 25 30 35 SCF iterations to convergence median = 14 c SCF convergence (all 18 converged) Cu₇Zn Cd₆Zr₂ Ag₆Cl₂S₂ RuZn₃ LiO₄Sc₃ Cu₆Mg₂ Cu₆Ga₃ F₆SnSr Y₂Zn₆ In₆Sb₂ Cu₂O₂Te₂Y₂ Cu₇Mn Cd₄Ni₂Sr₂ Ca₂In₂Zn₄ Zn₃Zr As₂In₄Pd₆ Ag₂Ba₂O₄ F₆K₂Mg 0 100 200 300 400 500 Wall time (s) median = 162s d Compute time (total: 47 min on 8 cores) Figure Extended Data Fig. 8. Independent DFT validation of adversarially discovered MLIP blind spots (18 materials with identical golden-ratio structures; 2 retry materials excluded from force comparison). a, Maximum forces from Quantum ESPRESSO (DFT) vs CHGNet on the same structures (log scale). DFT forces are sys- tematically larger, confirming CHGNet underestimates instability. b, Force ratio (DFT/CHGNet); median 11.6×. Cu 7 Zn 1 (brass, 15×) and Cu 7 Mn 1 (manganese bronze, 33×) are highlighted. c, SCF iterations to convergence (median 14). d, Wall-clock time per material (total 47 min on 8 CPU cores). 29 Figure Extended Data Fig. 9.Representative Lean 4 proof output for the WBM–CHGNet safety certifi- cate (auto-generated from the refined envelope). Top: The core safety module defines compositional bounds (EnvBounds), structural requirements (DesignReq), and outcome thresholds (HoldsThresholds). The oracle_guarantee axiom—the only unverified assumption—states that the MLIP satisfies thresholds within the envelope; a domain expert can inspect whether this assumption is warranted. safe_under_envelope composes these into the safety claim; envelope_non_vacuous provides a constructive witness proving the envelope is non-empty. Bottom: The error propagation module composes DFT accuracy (ε DFT = 0.1 eV atom −1 ) with the MLIP threshold via the triangle inequality, making the full approximation chain (reality→ DFT→ MLIP) explicit and auditable. All proofs are verified by lake build (Lean 4.27.0). Generated.lean — Core safety certificate (WBM–CHGNet) 1 structure Env where 2 n_elements : Real; max_z : Real 3 mean_electronegativity : Real; en_range : Real 4 volume_per_atom : Real; mean_atomic_mass : Real 5 6 structure Design where 7 n_sites : Real; bandgap_pbe : Real 8 9 structure Outcomes where 10 prediction_error : Real; max_force : Real 11 12 -- Bounds from the adversarially refined envelope 13 def EnvBounds (e : Env) : Prop := 14 (1 <= e.n_elements && e.n_elements <= 4.41) 15 && (3 <= e.max_z && e.max_z <= 62.21) 16 && (0.7 <= e.mean_electronegativity && e.mean_electronegativity <= 3.17) 17 && (5 <= e.volume_per_atom && e.volume_per_atom <= 48.90) 18 && (5 <= e.mean_atomic_mass && e.mean_atomic_mass <= 159.65) 19 20 def HoldsThresholds (o : Outcomes) : Prop := 21 (o.prediction_error <= 2) && (o.max_force <= 50) 22 23 -- Key assumption (the only axiom a reviewer must evaluate): 24 axiom oracle_guarantee : 25 forall e d, EnvBounds e -> DesignReq d -> HoldsThresholds (oracle e d) 26 27 -- Safety: if env and design are within the envelope, thresholds hold. 28 theorem safe_under_envelope : 29 forall e d, EnvBounds e -> DesignReq d -> SafeClaim e d := by 30 intro e d hE hD 31 exact And.intro hE (And.intro hD (oracle_guarantee e d hE hD)) 32 33 -- Non-vacuity: a concrete witness proves the envelope is non-empty. 34 theorem envelope_non_vacuous : 35 exists e d, EnvBounds e && DesignReq d := by 36 refine <2.71, 32.60, 1.93, 1.12, 26.95, 82.32>, <17.40, 2.92>, ... 37 all_goals norm_num ErrorPropagation.lean — Composing DFT + MLIP uncertainty 1 -- Triangle inequality: |E_true - E_MLIP| <= |E_true - E_DFT| + |E_DFT - E_MLIP| 2 theorem total_uncertainty_bound 3 (c : EnergyChain) (eps_mlip : Real) 4 (h_dft : DFTAccurate c pbe_accuracy) -- pbe_accuracy = 0.1 eV/atom 5 (h_mlip : MLIPAccurate c eps_mlip) : 6 TotalAccurate c (pbe_accuracy + eps_mlip) := by 7 exact interval_composition c pbe_accuracy eps_mlip h_dft h_mlip 8 9 -- Full safety with explicit uncertainty budget: 10 -- Within the envelope, MLIP agrees with physical reality 11 -- within 0.1 + claim_threshold = 0.6 eV/atom. 12 -- Every approximation layer is transparent and auditable. 30