← Back to papers

Paper deep dive

Probabilistic Verification of Voice Anti-Spoofing Models

Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh, Mikhail Pautov, Oleg Kiriukhin, Oleg Y. Rogov

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 53

Abstract

Abstract:Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

Tags

ai-safety (imported, 100%)cssd (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:11:49 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 56816. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

52,538 characters extracted from source content.

Expand or collapse full text

Probabilistic Verification of Voice Anti-Spoofing Models Evgeny Kushnir 1,2,3,∗ , Alexandr Kozodaev 4,∗ , Dmitrii Korzh 1,5,∗,∗ , Mikhail Pautov 1,6,∗ , Oleg Kiriukhin 7 , Oleg Y. Rogov 1,3,5 1 AXXX, 2 HSE, 3 Applied AI Institute, 4 Central University, 5 MTUCI, 6 Trusted AI Research Center, RAS, 7 City University of Hong Kong d.s.korzh@mtuci.ru Abstract Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV- VASM estimates the probability of misclassification under text- to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis tech- niques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool. Index Terms: voice anti-spoofing, audio deepfake detection, robustness, verification, text-to-speech, voice cloning 1. Introduction Over the past decade, rapid advancements of text-to-speech (TTS) and voice cloning (VC) models [1, 2] resulted in both im- proved generated speech quality and notably easier access to it for a broad audience. A variety of online and offline solutions, while generally beneficial, pose serious security risks [3]: in particular, realistic synthetic speech can be exploited to imper- sonate target speakers and gain unauthorized access to sensitive resources. To counter this threat, the research on voice anti- spoofing (VAS) and speech deepfake detection substantially ac- celerated in recent years. Primary spoof detection methods are based on feature engineering, architectural design, training optimization tech- niques, and demonstrate quantitative efficiency in several data- dependent scenarios [4]. Despite notable empirical progress, modern voice anti-spoofing models remain non-robust in prac- tice: they are usually deployed in conditions that notably differ from those they are expected to be robust to. Results of rele- vant competitions, such as ASVspoof [5] and Int-the-Wild [6], confirm this statement, demonstrating that state-of-the-art mod- els suffer from a significant performance degradation when ex- posed to previously unseen spoof generation methods or new audio conditions and domains. Consequently, strong empirical accuracy alone provides a limited sense of reliability in real- world applications. A fundamental challenge underlying this limitation is the absence of formal robustness guarantees [7]. Existing VAS * These authors contributed equally. ** indicates the corresponding author. measures are predominantly evaluated empirically and offer no principled bounds on their behavior under perturbations or gen- erative processes. While the broader machine learning literature has developed a rich body of robustness certification methods, these techniques are typically tailored to a narrow class of per- turbations and are not directly applicable to the complex, gen- erative transformations induced by modern speech generators. As a result, the certification of VAS models against TTS, VC, or other neural speech synthesis systems remains largely unex- plored. In this paper, we bridge this gap by proposing PV-VASM, a probabilistic framework for robustness verification of voice anti-spoofing models in a black-box model-agnostic manner. PV-VASM yields an upper bound on the probability of incorrect classification of the conventionally transformed or artificially synthesized input audio. The proposed framework enables veri- fication against unseen transformation and speech generators, making it relevant for real-world pre-deployment robustness evaluation. From a technical point of view, our approach builds on probabilistic concentration inequalities and provides a princi- pled mechanism for estimating tight upper bounds on misclas- sification probabilities with high confidence. We develop prac- tical procedures for estimating the statistics of random vari- ables required to estimate the bound and for selecting certifi- cation parameters to balance the tightness of the results versus computation cost. Extensive experiments across a wide range of transformations, TTS, and VC models demonstrate that PV- VASM yields meaningful robustness certificates and comple- ments standard empirical evaluation. The contributions of this work might be summarized as follows: • We introduce and motivate the probabilistic framework to formally verify the robustness of voice anti-spoofing models. We propose PV-VASM, a model-agnostic method capable of verifying robustness not only against classic audio transfor- mations but also against arbitrary neural speech generators, including unseen TTS and VC systems. • We derive a theoretical upper bound on the error probability of the method and present a practical pipeline for estimation of statistics of underlying random variables and certification parameters. • We empirically validate the proposed framework on diverse experimental settings, demonstrating its practical applicabil- ity and relevance to real-world deployments of voice anti- spoofing models. Overall, PV-VASM provides a systematic approach to ver- ify the robustness of voice anti-spoofing models, particularly in the context of rapidly advancing speech synthesis technologies. arXiv:2603.10713v2 [cs.SD] 12 Mar 2026 2. Related work Neural networks are well known to exhibit performance insta- bility under domain shift [8], often resulting in limited gen- eralization beyond the training distribution. Similar behav- ior has been observed for voice anti-spoofing (VAS) models in major evaluation campaigns, including the ASVspoof chal- lenges [9, 10, 11, 5] and the ADD series [12, 13]. In these benchmarks, systems typically demonstrate a notable degra- dation in test performance in comparison to validation perfor- mance, as the evaluation data introduces the new perturbation types and speech generation methods that are not presented in the training set. Common VAS architectures include graph-based neural ap- proaches, such as AASIST [14], which construct both ho- mogeneous and heterogeneous graphs and apply graph at- tention mechanisms to structures derived from temporal and spectral representations (e.g., Sinc convolutions [15]). Sub- stantial performance gains have been achieved by leverag- ing self-supervised learning (SSL) audio encoders, including Wav2Vec 2.0 [16] and WavLM [17], as front-end feature extrac- tors. These features can subsequently be processed by a variety of back-end architectures, such as AASIST (e.g., Wav2Vec2– AASIST [18]) or simpler pooling-based models followed by linear classifiers [19]. Additional performance improvements can be achieved through dataset expansion and diversification. Further gains have been reported via model ensembling, one- class learning paradigms, the use of specialized architectural components and loss functions, as well as the integration of audio large language models [20, 21, 22, 23]. Despite these advances, VAS systems frequently exhibit limited robustness to previously unseen spoofing generators, resulting in substan- tial performance degradation in open-set evaluation scenarios. Moreover, such models remain vulnerable to adversarial ma- nipulations and synthetic input perturbations, including additive noise and signal-level transformations [24, 25]. In recent years, the vulnerabilities of deep learning models have been extensively studied, and the vulnerability of mod- els to different types of input perturbations has been demon- strated [26, 27, 28]. In particular, adversarial attacks [29, 30] have received substantial attention due to their ability to in- duce severe performance degradation with imperceptible input modifications. In response, a variety of defense mechanisms have been proposed to enhance model robustness, which can be broadly classified into empirical and certified approaches. Em- pirical defenses [31, 32, 33], such as adversarial training, are comparatively easy to deploy and have demonstrated strong em- pirical effectiveness. Nevertheless, these methods do not pro- vide formal guarantees against previously unseen or adaptive perturbations, often giving rise to a recurring arms race between attack and defense strategies. To address this limitation, the certification paradigm has been introduced [34, 7], offering provable deterministic or prob- abilistic guarantees on a model’s behavior under all perturba- tions within a predefined threat model. These perturbations are typically characterized by the type of transformation and their parameters (for example, norm-bounded additive perturbations [30]). A comprehensive survey of certified robustness methods is provided in [7]. Despite this progress, limited attention is devoted to the certification of voice anti-spoofing (VAS) sys- tems in the literature. In particular, existing certification tech- niques are generally incapable of verifying robustness to syn- thetic speech generation, a crucial property for the safe deploy- ment of VAS models in real-world scenarios. Among existing certification techniques, probabilistic methods [35, 36, 37, 38] estimate the probability that a given in- put is adversarial and appear particularly suitable for application in the voice anti-spoofing setting. For instance, C-Cert [38] provides an upper bound on the misclassification probability of a transformed input, where the transformation parameters are sampled from a prescribed distribution. This bound is derived using an empirical form of the Chernoff-Cramer concentration inequality [39]. However, such approaches do not offer robust- ness guarantees for non-analytical transformations, including data generated by artificial or neural speech generators. Fur- thermore, in these methods, theoretical formulation implies the knowledge of the exact values of statistics of the distribution of the input perturbations, such as the coefficient of variation. An assumption about the invariance of these values across different possible transformations is generally violated and may lead to overly conservative robustness estimates. 3. Methodology In this section, we provide the formal description of the pro- posed method. We recall that PV-VASM is designed to certify the probability of the voice anti-spoofing model to misclassify the transformed input audio. The following section is devoted to the verification of robustness when the input audio is sub- jected to a parametric label-preserving transformation; in the subsequent sections, we discuss the verification of the model’s robustness against the speech generation methods. Here and below, we treat the VAS problem as a binary classification task. 3.1. Problem setup LetX ⊂ R d be the space of input audios, x ∈ X be the fixed input audio, f : R d → R 2 be the source voice anti-spoofing model that assigns the vector of probabilities to x in the form f (x) = (p 1 ,p 2 ) ⊤ ,(1) where p 1 + p 2 = 1, p 1 ,p 2 ≥ 0. Here, p 1 represents the prob- ability of x being classified as a spoof audio, and p 2 represents the probability of x being classified as a bona fide audio. To ease the notation, we introduce the classification rule in the form h(x) = arg max i f i (x).(2) In this work, we focus on the problem of certifying the robust- ness of f under parametric perturbations of its input. Formally, the parametric transformation φ :X × Θ→X is the mapping of input audio space to itself, where Θ is a fixed space of param- eters of the transformation φ. Under transformation φ, the input object x becomes the random variable, since it is a function of θ: x ′ = φ(x,θ), θ μ ∼ Θ,(3) where μ is the measure on Θ. The prediction of the source model on the perturbed audio becomes the random variable in the form f (x ′ ) = (p ′ 1 ,p ′ 2 ) ⊤ .(4) The robustness of f under label-preserving transformation φ at point x is reflected by the probability to assign the same class index to x and x ′ in the form P θ∼Θ h(x) = h(x ′ ) , where x ′ = φ(x,θ).(5) Here we omit μ, assuming it is fixed and known. In the case of a binary classification task, h(x) = h(x ′ )↔ (p 1 − 1/2) p ′ 1 − 1/2 > 0 ↔ (p 2 − 1/2) p ′ 2 − 1/2 > 0.(6) Without loss of generality, we assume that the initial audio x is correctly classified by f as a bona fide one, so p 2 > 1/2. Thus, h(x) = h(x ′ )↔ p ′ 2 > 1/2(7) and P θ∼Θ h(x)̸= h(x ′ ) = P θ∼Θ [p ′ 2 < 1/2].(8) In case of a nontrivial μ and f , the probability from Eq. (8) is intractable. The goal of the proposed method is to provide a tight upper bound for this probability and, consequently, a tight upper bound for the probability of f to misclassify x ′ . 3.2. Description of PV-VASM To ease the notation, we introduce the random variable Z ≡ p ′ 2 . Then, according to Chernoff inequality, P θ∼Θ [Z < 1/2]≤ inf t<0 E(e tZ )e −t/2 .(9) Since the expectation from Eq. (9) is intractable, in our method, it is upper-bounded via sampling the random variable Z and computing the statistics of the resulting sample [38]. Specifi- cally, for the given t < 0, let Y j = 1 n n X i=1 exp(tZ j i ) exp(−t/2), j ∈ [1,...,k](10) be the set of k independent and identically distributed sample means, each computed over n realizations of Z j (here, super- script j denotes the batch of realizations of the random variable used to compute Y j ). Then, for all δ ∈ (0, 1), the statistic A(x) = maxY 1 ,...,Y k δ −1 (11) is an upper bound for the expectation E(e tZ )e −t/2 with high probability. Specifically, the error probability of the method is upper-bounded by P θ∼Θ h A(x) < E(e tZ )e −t/2 i ≤ 1 1 + n(1− δ) 2 c −2 k ≡ ≡ p(n,k,c),(12) where c = p V(e tZ ) E(e tZ ) (13) is the coefficient of variation of the random variable e tZ (see [38]). Informally, given m = n× k realizations of a ran- dom variable Z, the statisticA(x) provides an upper-bound for the probability of misclassifying x ′ , that holds with high prob- ability. In the next section, we describe the procedure to estimate the coefficient of variation from Eq. (13) to compute the error probability of the method. 3.2.1. Estimation of error probability PV-VASM yields an upper bound for the probability from Eq. (8) in the form of Eq. (11). Here and below, we say that PV-VASM makes an error if its output underestimates an un- known error probability, namely, if P θ∼Θ [Z < 1/2] >A(x).(14) To estimate the probability of error from Eq. (14), one has first to estimate the right-hand side of Eq. (12). Since both the mean and variance of the random variable e tZ are intractable, one can estimate the coefficient of variation using m = n× k re- alizations of the random variable Z. In our approach, we use one-sided confidence interval estimation of the coefficient of variation in the form of the modified McKay’s approximation [40]: P θ∼Θ   c > χ 2 α/4 (1 + ˆc 2 ) mˆc 2 ! −1/2   < α/2.(15) Here α ∈ (0, 1) is the desired confidence level, χ 2 α/4 is the lower α/4−percentile of the Chi-square distribution with m−1 degrees of freedom, and ˆc is the sample coefficient of variation. Note that by combining Eq. (12) and Eq. (15), one can es- timate the probability of an error of PV-VASM. Specifically, by introducing an auxiliary random variable A = 1P θ∼Θ [Z < 1/2] >A(x),(16) and setting ̃c = χ 2 α/4 (1 + ˆc 2 )m −1 ˆc −2 −1/2 from Eq. (15), one can upper bound the probability from Eq. (8) by P[A = 1] < 1× P[c > ̃c]+ + p(n,k,c = ̃c)P[c≤ ̃c]< < α/2 + p(n,k,c = ̃c).(17) When the number of samples is large enough, so that p(n,k,c = ̃c) < α/2, the misclassification probability from Eq. (8) is upper bounded by α, since P[A = 1] < α. It is worth noting that for sufficiently large values of c, McKay’s approximation in the form from Eq. (15) should be replaced by, for example, bootstrap-based interval estimation. Recall that the definition of P[A = 1] from Eq. (16) is given for the verification of an initially bona fide audio. To verify robust- ness to the transformations of initially spoof audio, we flip the sign of parameter t from Eq. (9): t7→−t. The verification procedure against parametric transforma- tions is described in Algorithms 1-2. 3.3. Adaptation to generative models In the preceding sections, we considered robustness verification against parametric transformations that preserve the semantic class of the original audio signal. In this subsection, we extend the proposed framework to enable robustness verification of voice anti-spoofing models against generative models, specif- ically text-to-speech (TTS) and voice-cloning (VC) systems. 3.3.1. TTS In the TTS setting, the objective is no longer to verify robustness with respect to transformations of a fixed input audio sample x, but rather to reason about an entire family (distribution) of au- dio signals generated by a TTS model. The generated audio de- pends primarily on the input text and, potentially, on additional Algorithm 1 PV-VASM, the case of input transformations Require: Classifier f , verification datasetD, transform φ and parameter space Θ, hyperparameters n,k,δ,α,ε Ensure: Probabilistically Certified Accuracy PCA(ε,α,D) 1: S ← 0 2: for (x,y)∈D do 3: Z ∈ R k×n ← AUGMENTPREDICT(f,φ, Θ,x,n,k) 4: q ← 1h(x) = y 5: A(x)← min t max j∈[k] Y j (Z[k]) ▷ according to Eq. (10) 6: t ∗ ← arg[min t max j∈[k] Y j (Z[k])] ▷ save the best value of t 7: A(x)←A(x)/δ 8: ̃c← EstimateC(Z,t ∗ ,α/2)▷ one-sided interval estimation according to Eq. (15) 9: p← 1 + n(1− δ) 2 / ̃c 2 −k 10: S ← S + 1q∧A(x) < ε∧ p < α/2 11: end for 12: PCA(ε,α,D)← S/|D| 13: return PCA(ε,α,D) Algorithm 2 AugmentPredict Require: Classifier f , transform φ and parameter space Θ, input object x, number of samples n, number of sample means k Ensure: The set Z ∈ R k×n of realizations of random variable Z from Eq. (9) 1: for j ∈ [k] do 2:for i∈ [n] do 3:θ ← θ ∼ Θ, x ′ ← φ(x,θ) 4:Z[j][i]← f (x ′ ) 2 ▷ compute the value of p ′ 2 according to Eq. (4) 5:end for 6: end for 7: return Z generation parameters such as speech rate or selected artificial voice: φ(x,θ)7→ g(t,θ),(18) where g denotes the considered TTS model, t ∼ T , with T ⊂ R l×s denoting a text representation of vocabulary size s and length l, and θ ∼ Θ represents the collection of additional independent generation variables. Such variables may be multi- dimensional and either be fixed or drawn from the correspond- ing distributions; they can include, for example, speech rate, voice identity, or language configuration of the TTS system, in the case of multi-speaker or multilingual TTS. In practice, the text inputs are often drawn from natural language corpora or task-specific datasets. Unlike sample-specific verification against transformations of the given input x, the key idea in this setting is to certify robustness with respect to the distribution of audio signals in- duced by the generative model g. That is, we aim to quan- tify how frequently the classifier f assigns a generated sample x ′ = g(θ) to the incorrect class of bona fide speech: P[p ′ 2 > 1/2].(19) 3.3.2. Voice cloning PV-VASM can be used to provide verification results for voice cloning (VC) systems. Voice cloning aims to synthesize speech for an arbitrary text t while preserving the voice characteristics of a reference speaker. In a nutshell, a voice cloning model g vc can be expressed as g vc :T ×X × Θ→X, x ′ = g vc (t,x ref ,θ),(20) where t ∈ T denotes the input text, x ref ∈ X is a reference audio sample of the target speaker s tgt , and θ ∈ Θ represents additional generation parameters. This formulation naturally supports both sample-specific and distribution-level robustness verification. In the sample-specific setting, for a fixed input x (which need not necessarily be bona fide speech), one can sample or segment the reference audio x ref , as well as vary the input texts and additional generation parameters. In contrast, distribution-level verification further considers variability over the entire set of admissible input audio signals x, thereby as- sessing robustness with respect to the full data-generating dis- tribution. Note that VC is a label-switching transformation for the bona fide inputs. 4. Experimental setup 4.1. Source model, datasets, and hyperparameters We selected Wav2Vec2-AASIST as the architecture for the source model f . This model was trained for two epochs using cross-entropy loss and the AdamW [41] optimizer. For the train data, a combination of open-source datasets was used, including ASVspoof 19, 21 (LA and DF) splits, ASVspoof 5, ADD 22 − 23, DFADD [42], SONAR [43], CFAD [44], MLAAD [45], Speech-to-Latex [46], and Mozilla Common Voice [47]. During training, to improve the model’s empirical robust- ness, each audio sample was subjected to a composition of ran- domly selected augmentations applied in a random order. Aug- mentations were chosen from the following set: voice activ- ity detection, random signal cropping, background noise addi- tion, room impulse response (RIR) simulation, Gaussian noise injection, band-, low-, and high-pass filter (BPF, LPF, HPF), loudness normalization, codec augmentation, time stretching, pitch shifting, bit crushing, and gain adjustment. For the test data, a balanced class subset of 300 audio samples from the ASVspoof 5 evaluation (test) subset was used. Only initially correctly classified audio samples were used for the verification. The following default values of hyperparameters were used in the experiments unless said otherwise: the value of δ from Eq. (11) was set to 0.9, the range of parameter t from Eq. (9) was set to be [−50,−10 −4 ]. The number of transformations of input samples n to compute the single statistic from Eq. (10), the total number of statistics k, and an upper bound α for the error probability of the method from Eq. (17) are varied for dif- ferent experiments and are discussed in the subsequent sections. 4.2. Parametric transformations and speech generation models We evaluate PV-VASM in two different settings, namely, against conventional input perturbations and against speech generation models. In the first setting, we considered parametric perturbations that do not change the predicted class of the input object x. For this purpose, several input transformations that are com- monly applied as audio augmentation were used. In a nut- shell, if the input audio is a deepfake, the method should ver- ify that the classifier’s prediction remains the same under an Table 1: Robustness verification results against parametric input transformations. The desired confidence level is set to be α = 10 −6 . Transform Parameters’ Space Θ nk PCA A(x)p(x) ε = 10 −5 ε = 10 −3 ε = 10 −2 ε = 0.05 LPFω max ∈ [2.5, 3] kHz500400.97670.98330.99000.99000.00498.3· 10 −13 LPFω max ∈ [2.5, 3] kHz1000200.97670.98330.99000.99000.00482.0· 10 −11 HPFω min ∈ [0.5, 1] kHz500400.84330.86670.87330.87670.10707.5· 10 −7 HPFω min ∈ [0.5, 1] kHz1000200.84330.86670.87330.87670.10653.6· 10 −6 Background Noise SNR∈ [15, 30] dB500120.34000.34670.37330.43000.10380.1687 Background Noise SNR∈ [15, 30] dB100060.34670.35330.38670.46000.09430.1814 Pitch ShiftST∈ [−6, 6]500400.58670.61330.61330.62330.35061.4· 10 −11 Pitch ShiftST∈ [−6, 6]1000200.58670.61330.61330.62330.34854.6· 10 −11 BPFω ∈ [200, 4000],b∈ [0.5, 1.99]500400.00330.00330.00670.01000.35100.1664 BPFω ∈ [200, 4000],b∈ [0.5, 1.99]1000200.00330.00330.00670.02000.33370.2114 BPFω ∈ [200, 1500],b∈ [1.2, 1.5]500400.94330.95330.95330.95330.03454.3· 10 −13 BPFω ∈ [200, 1500],b∈ [1.2, 1.5]1000200.94330.95330.95330.95330.03378.3· 10 −13 Time Stretchκ ∈ [0.75, 1.35]500400.80670.81330.83000.88000.04040.0064 Time Stretchκ ∈ [0.75, 1.35]1000200.81000.82330.84000.89000.03720.0077 Gainγ ∈ [−10, 20] dB500400.88000.89330.89330.89330.03840.0072 Gainγ ∈ [−10, 20] dB1000200.88000.89330.89330.89330.03710.0060 Gainγ ∈ [−10, 10] dB500400.98670.98670.98670.98670.00761.2· 10 −15 Gainγ ∈ [−10, 10] dB1000200.98670.98670.98670.98670.00751.5· 10 −13 RIRr ∈ OpenSLR dataset1000200.08000.08330.08660.11000.25230.2358 Composite: [Gain, LPF Gaussian Noise] γ ∈ [−10, 10] dB ω max ∈ [2.5, 3] kHz σ ∈ [0.01, 0.03] 1000200.57330.59000.61670.68000.11400.0628 input transformation; analogously, if the input is a bona fide audio, the method should verify that the predicted class does not change under an input transformation. Audio transfor- mations were implemented using audiomentations [48] and torch-audiomentations [49] libraries. If not stated otherwise, the default transformation parameters from the audiomentations library were used. Below we list the set of audio transformations used during the training of the source model and indicate, where applicable, the range of parameters of a corresponding transformation used for augmentation of the source model during training: • Filters. The cutoff frequency ω max for LPF is sampled from [3400 Hz, 7500 Hz]; for the HPF, the cutoff frequency ω min is sampled from [100 Hz, 400 Hz]; for the BPF, the cutoff fre- quency ω is sampled from [200 Hz, 4000 Hz], and the band- width to central frequency ratio b is sampled from [0.5, 1.99]. • Pitch shift. Pitch was randomly adjusted from−4 to 4 semi- tones (ST). • Time stretch. A default speedup parameter κ was sampled from [0.8, 1.25]. • Gain. The gain parameter γ was between−10 dB and +20 dB. • RIR. We use the OpenSLR dataset of RIR recordings [50] and sample uniformly from it. • Background noise. Similar to the RIR, we sampled from the noise subset of the Musan dataset [51]. In our experiments, the parameters of input transformations were sampled uniformly from the selected parameter’s range or among available noise and RIR samples; however, all the pa- rameters may be sampled from any other analytical distribution. We indicate that the ranges of input transformations during the verification procedure differ from the ones used during training, and are presented in the subsequent section. To evaluate the efficiency of the method against TTS and VC methods, we generated data using the following open- source generative models: Vosk 1 , Silero 2 , Coqui XTTS-v2 3 [1], f5-TTS [52], CosyVoice 4 , and proprietary models of Eleven- Labs 5 and Finevoice 6 . Texts used for the artificial annota- tion were primarily selected from the Mozilla Common Voice dataset. 4.3. Metrics The primary quantitative measure of efficiency of a probabilistic verification methodA in a classification problem is probabilis- tically certified accuracy [38], PCA, defined as PCA(ε,α,D) = 1 |D| X (x,y)∈D 1h(x) = y∧A(x) < ε ∧ p < α/2,(21) where p is from Algorithm 1. Informally, PCA is the frac- tion of objects fromD which (i) are correctly classified by the 1 https://pypi.org/project/vosk-tts 2 https://pypi.org/project/silero-tts 3 https://huggingface.co/coqui/XTTS-v2 4 https://github.com/FunAudioLLM/CosyVoice 5 https://elevenlabs.io 6 https://finevoice.ai source model f ; (i) are assigned the misclassification probabil- ity A(x) less than the threshold value ε; (i) have the p(x) is upper bounded by α/2, given that ̃c is estimated with 1− α/2 confidence. When verification is performed with respect to the TTS- induced distribution, the resulting PCA metric is binary valued and is defined as: PCA(ε,α) = 1A≺ ε∧ p≺ α/2,(22) whereA≺ ε means thatA is uniformly less than ε. 5. Results 5.1. Parametric input transformations In Table 1, we present the results of robustness verification of the considered Wav2Vec2-AASIST model against parametric input transformations. For each transformation and correspond- ing space of parameters, we report the number of input transfor- mations n to compute the single statistic from Eq. (10) and the total number of statistics k; for the given value of threshold ε, we present the values of PCA metric; additionally, we report the average value of A(x) from Algorithm 1 and the average value p =p(n,k, ̃c(x)) from Eq. (12). 10 6 10 5 10 4 10 3 10 2 10 1 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 PCA n= 500, k= 4 n=1000, k= 2 n=2000, k= 1 n= 500, k= 6 n=1000, k= 3 n=3000, k= 1 n= 500, k=12 n=1000, k= 6 n=2000, k= 3 n=3000, k= 2 n=6000, k= 1 Figure 1: Dependence of PCA on (m,n,k) for background noise perturbations with SNR∈ [15, 30]. The confidence level is set to α = 10 −6 . Curves sharing the same color correspond to an identical computational budget m, while line styles and marker types indicate variations in n and k, respectively. 10 6 10 5 10 4 10 3 10 2 10 1 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 PCA = 10 12 = 10 10 = 10 6 = 10 4 = 10 2 = 5.010 2 Figure 2: Dependence of PCA on α for background noise per- turbations with SNR ∈ [15, 30]. m = 6000, n = 1000, k = 6 are fixed. 10 6 10 5 10 4 10 3 10 2 10 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 PCA n= 500, k= 4 n= 1000, k= 2 n= 2000, k= 1 n= 500, k=10 n= 1000, k= 5 n= 5000, k= 1 n= 500, k=20 n= 1000, k=10 n= 2000, k= 5 n= 5000, k= 2 n=10000, k= 1 n= 500, k=40 n= 1000, k=20 n= 2000, k=10 n= 5000, k= 4 n=10000, k= 2 n=20000, k= 1 Figure 3: Dependence of PCA on (m,n,k) for the gain adjust- ment transform with γ ∈ [−10, 20] dB. The confidence level is set to α = 10 −6 . Curves sharing the same color correspond to the same augmentation budget m, while line styles and marker types indicate variations in n and k, respectively. 10 6 10 5 10 4 10 3 10 2 10 1 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 PCA = 10 12 = 10 10 = 10 6 = 10 4 = 10 2 = 5.010 2 Figure 4: Dependence of PCA on α for the gain adjust- ment transform with γ ∈ [−10, 20] dB. The values m = 20000, n = 1000, k = 20 are fixed. From the results, we make several observations. Firstly, the source model show strong robustness against LPF, HPF, and time stretch transforms, demonstrating low average error prob- abilityp; secondly, the different distributions of the total aug- mentation budget in the form m = n×k into k splits of n sam- ples each (see, for example, Eq. (10)) do not always affect the verification results, but can be used to control the average error probability; finally, we highlight that the broader the parame- ters’ space Θ for the given transform, the worse the expected robustness of the model to that transform is (see, for instance, the verification results for BPF for different Θ). In Figures 1, 3, 5, 6, we report the dependency of PCA on the values of augmentation budget m and its distribution; in Ta- ble 2, we report averaged probability bounds from Algorithm 1 along with the averaged error probabilityp(x) for specific transforms. In Figures 2 and 4, we compare PCA for the differ- ent values of α given fixed (m,n,k). 5.2. TTS In Table 3, the results of robustness verification against differ- ent TTS generators are shown. Here, in the column “p”, an estimation of the method’s error probability from Eq. (12) is presented. We highlight that in this setting, the verification of the VAS model’s robustness against the distribution of pertur- bations induced by a TTS model is performed, so the output 10 6 10 5 10 4 10 3 10 2 10 1 0.70 0.75 0.80 0.85 0.90 0.95 1.00 PCA n= 500, k= 4 n= 1000, k= 2 n= 2000, k= 1 n= 500, k=10 n= 1000, k= 5 n= 5000, k= 1 n= 500, k=20 n= 1000, k=10 n= 2000, k= 5 n= 5000, k= 2 n=10000, k= 1 n= 500, k=40 n= 1000, k=20 n= 2000, k=10 n= 5000, k= 4 n=10000, k= 2 n=20000, k= 1 Figure 5: Dependence of PCA on (m,n,k) for the low pass fil- ter with the cutoff frequency ω max is randomly sampled from [2500, 3000] Hz range. The confidence level is set to α = 10 −6 . Curves sharing the same color correspond to the same augmentation budget m, while line styles and marker types in- dicate variations in n and k, respectively. 10 6 10 5 10 4 10 3 10 2 10 1 0.40 0.45 0.50 0.55 0.60 PCA n= 500, k= 4 n= 1000, k= 2 n= 2000, k= 1 n= 500, k=10 n= 1000, k= 5 n= 5000, k= 1 n= 500, k=20 n= 1000, k=10 n= 2000, k= 5 n= 5000, k= 2 n=10000, k= 1 n= 500, k=40 n= 1000, k=20 n= 2000, k=10 n= 5000, k= 4 n=10000, k= 2 n=20000, k= 1 Figure 6: Dependence of PCA on (m,n,k) for the pitch shift transform, ST∈ [−6, 6] semitones. The confidence level is set to be α = 10 −6 . Curves sharing the same color correspond to the same augmentation budget m, while line styles and marker types indicate variations in n and k, respectively. of the algorithm and its error probability depend on the distri- bution. We present verification results for the pre- and post- finetuned models. Here, the pre-finetuned model is the base model trained on the original dataset; the post-finetuned model represents the base model that was additionally trained on a sub- set of the data generated by a corresponding TTS model. For each TTS model, all the generated samples were split into a train subset (10%) and the verification subset (90%). In Fig- ure 7, we present the results of the verification of robustness against the Vosk TTS model. To do so, we fixed the augmenta- tion budget m = 130000 and the probability threshold ε = 0.1 and show for which values of (n,α) the verification condition from Eq.(22) is met. It is noteworthy that the verification condi- tion is met for a few pairs of (n,α) due to the trade-off between A and p for different values of n : on the one hand, the larger the n, the smaller the value ofA; on the other hand, the larger the n, the larger the value of p. From the results, the follow- ing observations are made. First, the verification of robustness against TTS is a more complicated task than the verification of robustness against random input transformations (one can ob- serve, for example, the range ofA in the TTS experiment, mak- ing the verification impossible for small values of ε). Secondly, Table 2: Dependency of average probability of error p for dif- ferent values (n,k,m), low pass filter, Θ = [2500, 3000] Hz. The confidence level is set to α = 10 −6 . nk mA(x)p(x) 500420004.7· 10 −3 1.3· 10 −3 1000220004.8· 10 −3 2.3· 10 −3 2000120004.6· 10 −3 5.0· 10 −3 5001050004.8· 10 −3 2.9· 10 −5 1000550004.7· 10 −3 7.2· 10 −5 5000150004.7· 10 −3 2.1· 10 −3 50020100004.9· 10 −3 7.6· 10 −8 100010100004.8· 10 −3 3.8· 10 −7 20005100004.7· 10 −3 4.3· 10 −6 50002100004.7· 10 −3 1.2· 10 −4 100001100004.7· 10 −3 1.0· 10 −3 50040200004.9· 10 −3 8.3· 10 −13 100020200004.8· 10 −3 2.0· 10 −11 200010200004.8· 10 −3 1.8· 10 −9 50004200004.7· 10 −3 7.2· 10 −7 100002200004.7· 10 −3 3.1· 10 −5 200001200004.7· 10 −3 5.2· 10 −4 10 12 10 9 10 6 10 5 10 4 10 3 10 2 5.010 2 10 1 500 1000 2000 5000 10000 n 000000000 000000111 000000111 000000000 000000000 Figure 7: Verification condition result from Eq. (22) for the pre- finetuned f against Vosk TTS vs. various α and n given fixed m = 130000, δ = 0.9, and ε = 0.1. fine-tuning on data obtained from the same TTS expectedly im- proves the verification results (up to 1.5− 3 times in terms of values ofA). 5.3. Voice Cloning Similarly to the TTS experiment, we measured the model’s ro- bustness against VC models. We present the results of verifica- tion of robustness against XTTSv2 and F5 models in Table 4. One can notice that the improvement of PCA for the XTTSv2 model after finetuning is slightly less than the one for the f5 model; this can be explained by the fact that some inclusion of audio samples generated by the XTTSv2 model is present in the training dataset of the base model f . 5.4. Optimal values of hyperparameters Recall that the PV-VASM yields an upper bound for the proba- bility of misclassification of the audio subject to the transforma- tion from Eq. (8). This upper bound may be over-conservative, i.e., significantly higher than the actual unknown value of P θ∼Θ [h(x) ̸= h(x ′ )]. We also highlight that the distribution Table 3: Robustness verification results against TTS generators. Results are shown for the model before and after finetuning on these generators. The confidence level is set to be α = 10 −6 . TTS model nk δ Ap Pre-finetuning results Vosk5000260.750.13523.37· 10 −17 Silero650200.50.50306.1· 10 −21 CosyVoice500400.90.52362.4· 10 −8 f510000200.90.40192.6· 10 −25 ElevenLabs50060.50.33083.0· 10 −5 Finevoice200250.50.14463.8· 10 −7 Post-finetuning results Vosk5000260.750.06868.1· 10 −7 Silero650200.50.05791.64· 10 −5 CosyVoice500400.90.26563.2· 10 −4 f510000200.90.23191.7· 10 −16 ElevenLabs50060.50.20023.9· 10 −3 Finevoice200250.50.07211.7· 10 −5 Table 4: Robustness verification results against VC generators. Results are shown for the model before and after finetuning on these generators. Confidence level is set to α = 10 −4 , δ = 0.5. VC model n k PCA A(x)p(x) ε = 0.05 ε = 0.10 ε = 0.30 Pre-finetuning results XTTSv2100500.920.941.000.01483.8· 10 −6 f550010000.240.41679.6· 10 −5 Post-finetuning results XTTSv2100500.960.961.000.00784.1· 10 −6 f5500100.000.000.300.35316.8· 10 −4 of the augmentation budget (namely, m = n×k) into k chunks of n samples each affects the upper bound for error probabil- ity (see, for example, Eq. (12)). To study these dependencies, we conduct additional experiments to illustrate the effect of the computational budget re-distribution. Namely, given a fixed budget m, we compute the value of PCA and error probability for different pairs (n i ,k i ) : n i ×k i = m. We present the PCA results for LPF in Fig. 5 and Table 2, for the background noise in Fig. 1, gain in Fig. 3, and pitch shift in Fig. 6. It can be observed that not only does a budget im- pact PCA. For instance, k = 1 uniformly yields the worst PCA results, and generally increasing k improves verification results for a fixed m, as it improves values of p(x). However, it is not always satisfied, for example, see Fig. 1. To obtain a tighter bound, one would be recommended to balance the augmenta- tion budget towards increasing the value of k. In Figures 2 and 4, the dependency of verification results on the confidence level α for background noise and gain trans- forms is illustrated. Generally, a higher value of α leads to less strict verification conditions and larger PCA. Additionally, for the TTS, we studied the dependency of PCA on values of n and α, with the other parameters fixed. The overall verification re- sults are shown in Fig. 7. It is worth mentioning that the lower the value of n, the lower the bound on the error probability p, but the looser the algorithm outputA. To obtain a tighter verifi- cation results, a balance between the values of p andA should be found. 6. Discussion and limitations For relatively simple input transformations, such as LPF with high cutoff frequency ω max or moderate gain, good verification results are obtained. In contrast, robustness degrades for harder transformations that significantly reduce speech intelligibility, such as strong background noise, narrow high-band BPF, or combinations of several perturbations. For robustness verification against TTS and VC generators, we again observe limited generalization of VAS models, re- flected by poor upper bounds on the misclassification proba- bility. The results also clearly show improved robustness after finetuning on these domains. Intuitively, when the distribution of the random variable e tZ from Eq. (9), induced for example by a speech generation model, has a high variance within the support, the base model does not show a satisfactory level of robustness. In our experiments, this manifests as overly con- servative PCA values for complex conventional and generative transformations. Although increasing the augmentation budget reduces the gap between the true misclassification probability P θ∼Θ [h(x) ̸= h(x ′ )] and its estimateA(x), it can be difficult to distinguish between overly conservative estimates and gen- uinely poor robustness. The tightness of the bound in Eq. (11) and the error proba- bility in Eq. (14) both depend on the parameter t and hyperpa- rameter δ, and this trade-off should be considered when select- ing (sub)optimal values. In our implementation, t is restricted to a bounded range from Eq. (9), while the optimal value may lie outside this interval; in general, the wider the range, the better verification results are expected. Finally, the upper bound for error probability from Eq. (17), namely, α, is equally distributed over (i) interval estimation of the coefficient of variation from Eq. (15) and (i) estimation of error the verification algorithm from Eq. (12); an increase of the confidence level in Eq. (15) may positively affect the verifica- tion results. Additionally, while the classification threshold is set to 1/2 according to Eq. (6), one could possibly adapt the verification methodology to a variable classification threshold to balance between false negative and false positive error rates. 7. Conclusion and future work In this paper, we proposed PV-VASM, a framework for robust- ness verification of voice anti-spoofing models. We theoreti- cally derived an upper bound for the error probability of the method and experimentally demonstrated the verification re- sults in different settings, including the presence of paramet- ric input perturbations, text-to-speech generation models, and voice cloning methods. We showed that the robustness of voice anti-spoofing models crucially depends on the type of pertur- bation and the width of its parameter space. We therefore confirmed that robustness to a simple parametric perturbation is noticeably higher than that to perturbations produced by a speech generation model. The proposed method can be applied to verify the robustness of models before real-world applica- tions. Future work might be focused on tightening error bounds and adapting the proposed approach to spoofing-aware speaker verification methods. 8. Generative AI use disclosure AI models (ChatGPT) and tools (Grammarly) were used only for text polishing and shortening. 9. References [1] E. Casanova, K. Davis, E. G ̈ olge, G. G ̈ oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi et al., “Xtts: a massively multilingual zero-shot text-to-speech model,” arXiv preprint arXiv:2406.04904, 2024. [2] H. Azzuni and A. E. Saddik, “Voice cloning: Comprehensive sur- vey,” arXiv preprint arXiv:2505.00579, 2025. [3] K. Wang, M. Chen, L. Lu, J. Feng, Q. Chen, Z. Ba, K. Ren, and C. Chen, “From one stolen utterance: Assessing the risks of voice cloning in the aigc era,” in 2025 IEEE Symposium on Security and Privacy (SP). IEEE, 2025, p. 4663–4681. [4] M. Li, Y. Ahmadiadli, and X.-P. Zhang, “A survey on speech deep- fake detection,” ACM Computing Surveys, vol. 57, no. 7, p. 1–38, 2025. [5] X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen et al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” arXiv preprint arXiv:2408.08739, 2024. [6] N. M. M ̈ uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B ̈ ottinger, “Does audio deepfake detection generalize?” In- terspeech, 2022. [7] L. Li, T. Xie, and B. Li, “Sok: Certified robustness for deep neural networks,” in 2023 IEEE symposium on security and privacy (SP). IEEE, 2023, p. 1289–1310. [8] P. Singhal, R. Walambe, S. Ramanna, and K. Kotecha, “Do- main adaptation: Challenges, methods, datasets, and applica- tions,” IEEE Access, vol. 11, p. 6973–7020, 2023. [9] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio de- tection,” arXiv preprint arXiv:1904.05441, 2019. [10] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al., “Asvspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,” Computer Speech & Lan- guage, vol. 64, p. 101114, 2020. [11] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537, 2021. [12] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y. Bai, C. Fan et al., “Add 2022: the first audio deep synthe- sis detection challenge,” in ICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, p. 9216–9220. [13] J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y. Zhang, X. Zhang, Y. Zhao, Y. Ren et al., “Add 2023: the second audio deepfake detection challenge,” arXiv preprint arXiv:2305.13774, 2023. [14] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, p. 6367–6371. [15] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave- form with sincnet,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, p. 1021–1028. [16] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,” Advances in neural information processing systems, vol. 33, p. 12 449–12 460, 2020. [17] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, p. 1505–1518, 2022. [18] H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022. [19] A. Aliyev and A. Kondratev, “Intema system description for the asvspoof5 challenge: power weighted score fusion,” in Proc. ASVspoof 2024, 2024, p. 152–157. [20] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness- aware minimization for efficiently improving generalization,” arXiv preprint arXiv:2010.01412, 2020. [21] S. Ding, Y. Zhang, and Z. Duan, “Samo: Speaker attractor multi- center one-class learning for voice anti-spoofing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, p. 1–5. [22] K. Borodin,V. Kudryavtsev,D. Korzh,A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y. Rogov, “Aasist3: Kan- enhanced aasist speech deepfake detection using ssl features and additional regularization for the asvspoof 2024 challenge.” [23] H. Gu, J. Yi, C. Wang, J. Tao, Z. Lian, J. He, Y. Ren, Y. Chen, and Z. Wen, “Allm4add: Unlocking the capabilities of audio large language models for audio deepfake detection,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, p. 11 736–11 745. [24] M. Rabhi, S. Bakiras, and R. Di Pietro, “Audio-deepfake detec- tion: Adversarial attacks and countermeasures,” Expert Systems with Applications, vol. 250, p. 123941, 2024. [25] X. Li, P.-Y. Chen, and W. Wei, “Measuring the robustness of audio deepfake detectors,” arXiv preprint arXiv:2503.17577, 2025. [26] V. Sehwag, A. N. Bhagoji, L. Song, C. Sitawarin, D. Cullina, M. Chiang, and P. Mittal, “Analyzing the robustness of open- world machine learning,” in Proceedings of the 12th ACM Work- shop on Artificial Intelligence and Security, 2019, p. 105–116. [27] F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein, “Robust- bench: a standardized adversarial robustness benchmark,” arXiv preprint arXiv:2010.09670, 2020. [28] D. Hendrycks and T. Dietterich, “Benchmarking neural net- work robustness to common corruptions and perturbations,” arXiv preprint arXiv:1903.12261, 2019. [29] C. Szegedy, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013. [30] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har- nessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. [31] E. Wong, L. Rice, and J. Z. Kolter, “Fast is better than free: Re- visiting adversarial training,” arXiv preprint arXiv:2001.03994, 2020. [32] A. Madry, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017. [33] S. Fort and B. Lakshminarayanan, “Ensemble everything every- where: Multi-scale aggregation for adversarial robustness,” arXiv preprint arXiv:2408.05446, 2024. [34] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana, “Certified robustness to adversarial examples with differential pri- vacy,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, p. 656–672. [35] J. Mohapatra, C.-Y. Ko, T.-W. Weng, P.-Y. Chen, S. Liu, and L. Daniel, “Higher-order certification for randomized smoothing,” Advances in Neural Information Processing Systems, vol. 33, p. 4501–4511, 2020. [36] K. Tit, T. Furon, and M. Rousset, “Efficient statistical assessment of neural network corruption robustness,” Advances in Neural In- formation Processing Systems, vol. 34, p. 9253–9263, 2021. [37] T. Baluta, Z. L. Chua, K. S. Meel, and P. Saxena, “Scalable quanti- tative verification for deep neural networks,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, p. 312–323. [38] M. Pautov, N. Tursynbek, M. Munkhoeva, N. Muravev, A. Petiushko, and I. Oseledets, “Cc-cert: A probabilistic approach to certify general robustness of neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, p. 7975–7983. [39] S. Boucheron, G. Lugosi, and O. Bousquet, “Concentration in- equalities,” in Summer school on machine learning.Springer, 2003, p. 208–240. [40] M. E. Payton, “Confidence intervals for the coefficient of varia- tion,” Kansas State University, 1996. [41] I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” arXiv preprint arXiv:1711.05101, 2017. [42] J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y. Tsao, H.-y. Lee, and J.-S. R. Jang, “Dfadd: The diffusion and flow- matching based audio deepfake dataset,” in 2024 IEEE Spoken Language Technology Workshop (SLT).IEEE, 2024, p. 921– 928. [43] X. Li, P.-Y. Chen, and W. Wei, “Sonar:A synthetic ai- audio detection framework and benchmark,” arXiv preprint arXiv:2410.04324, 2024. [44] H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, and R. Fu, “Cfad: A chinese dataset for fake audio detection,” Speech Communication, vol. 164, p. 103122, 2024. [45] N. M. M ̈ uller, P. Kawa, W. H. Choong, E. Casanova, E. G ̈ olge, T. M ̈ uller, P. Syga, P. Sperl, and K. B ̈ ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” in 2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, p. 1–7. [46] D. Korzh, D. Tarasov, A. Iudin, E. Karimov, M. Skripkin, N. Kuzmin, A. Kuznetsov, O. Y. Rogov, and I. Oseledets, “Speech-to-latex: New models and datasets for converting spo- ken equations and sentences,” arXiv preprint arXiv:2508.03542, 2025. [47] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019. [48] I. Jordal, A. Tamazian, T. Dhyani, E. Theofanis Chourdakis, N. Karpov, C. Landschoot, C. Angonin, O. Sarioglu, W. Drevo, F. Catania et al., “iver56/audiomentations: v0. 42.0,” Zenodo, 2025. [49] I. Jordal, S. ES, H. BREDIN, K. Nishi, F. Lata, H. Coul- tas Blum, P. Manuel, A. Raj, K. Choi, P. ̇ Zelasko et al., “Asteroid- team/torch-audiomentations: v0. 11.0,” Zenodo, 2022. [50] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, p. 5220–5224. [51] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015. [52] Y. Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, p. 6255–6271.