Paper deep dive
Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning
Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov, Artem Iudin, Oleg Kiriukhin, Mikhail Pautov, Dmitrii Korzh, Oleg Y. Rogov
Abstract
Abstract:The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.
Tags
Links
- Source: https://arxiv.org/abs/2603.10725v2
- Canonical: https://arxiv.org/abs/2603.10725v2
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%
Last extracted: 3/13/2026, 1:11:55 AM
OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 56816. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
33,819 characters extracted from source content.
Expand or collapse full text
Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning Artem Dvirniak 1 , Evgeny Kushnir 2,3,4 , Dmitrii Tarasov 3,4 , Artem Iudin 6 , Oleg Kiriukhin 6 , Mikhail Pautov 2,8 , Dmitrii Korzh 2,6,∗ , Oleg Y. Rogov 2,4,6 1 MIRAI, 2 AXXX, 3 HSE, 4 Applied AI Institute, 5 Fusion Brain Lab, AXXX, 6 MTUCI, 7 City University of Hong Kong, 8 Trusted AI Research Center, RAS d.s.korzh@mtuci.ru Abstract The modern generative audio models can be used by an ad- versary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and gen- erators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribu- tion of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effec- tiveness of the proposed method and its ability to provide rea- sonable justifications for predictions. Index Terms: deepfake detection, voice anti-spoofing, audio LLM, reasoning, benchmark 1. Introduction Contests, such as ASVspoof [1, 2], ADD [3, 4], and Singfake [5, 6] drive the progress in audio and speech deepfake detection (SDD) research, providing high-quality deepfake data and fair evaluation protocols. SDD and spoofing-aware speaker verifi- cation (SASV) research primarily focuses on the architecture design, including task-specific front-ends [7], graph-attention networks [8], self-supervised (SSL) audio encoders [9, 10], and architecture modifications [11, 12] to improve the empirical per- formance of the models. Additionally, augmentation strategies [13], optimizer choices [14], and representation learning-based approaches [15, 16, 17] are also investigated in SDD research. Unfortunately, SDD remains challenging due to distribution shifts across spoofing methods, speech domains, and transfor- mations, as evidenced by results from the aforementioned con- tests. A common strategy to improve empirical performance is to increase training diversity and model capacity. How- ever, SDD systems still fail to generalize to unseen domains, highlighting the need for more robust and explainable detec- tion approaches. Such tools are particularly important for risk- sensitive applications, such as biometrics and banking, yet this direction remains underexplored in SDD research. Recently, Large Language Models (LLMs) [18, 19] and Large Audio Language Models (LALMs),such as SALMONN [20],Qwen2-Audio [21],and Audio- Flamingo 3 [22],have demonstrated strong reasoning ** indicates the corresponding author. capabilities. Chain-of-thought (CoT) [23] and related reason- ing methods [19] often improve performance by extracting intermediate rationales.Although CoT mainly acts as an empirical mechanism and does not necessarily reflect human reasoning, careful training and grounding can produce explana- tions that are consistent with the input and useful for analysis. Similar ideas have been explored for audio question answering and captioning (including music understanding) [24, 25], disease detection from speech [26], emotion recognition [27], and interpretable audio quality assessment and hard-label SDD [28]. However, reasoning-based approaches remain relatively un- common in SDD. One practical limitation is the lack of open- source datasets with high-quality human explanations for train- ing and evaluation. To address this gap, we introduce a human- annotated dataset for CoT training and evaluation of SDD mod- els and propose HIR-SDD, a human-inspired reasoning frame- work that combines hard-label and CoT-supervised fine-tuning, grounding, and reinforcement learning (RL) methods. Our contributions are summarized as follows: • We present 1 a new dataset of human-annotated reasoning traces for 41k bona fide and spoof speech samples partially curated from existing open-source datasets. • We propose a hard-label and CoT pipelines that achieve both strong countermeasure and reasoning explainability perfor- mance. • We provide further improvement and evaluation strategies of reasoning-capable SDD models. 2. Related work ALLM4ADD [29] is among the first studies to explore large- scale LALMs for SDD. The authors report weak zero-shot per- formance for Qwen and Qwen2-Audio [21], and then fine- tune the models with a simple prompt requesting a binary “Real”/“Fake” response for the input speech. Across several benchmarks, the resulting systems match or outperform smaller state-of-the-art detectors. The study also analyzes training data scaling, sensitivity to LoRA [30] rank, and prompt design. However, the evaluation in [29] does not include the more re- cent ASVspoof 5 benchmark [2]. Moreover, the analysis fo- cuses mainly on hard-label classification, leaving the reasoning and interpretability capabilities of LALMs largely unexplored. In [31], the authors study the robustness of SDD models that produce reasoning traces alongside binary decisions. Ex- planations are evaluated along three aspects: perception qual- ity (e.g., captioning or recognition fidelity), coherence be- 1 We will add links to the dataset and source code soon. arXiv:2603.10725v2 [cs.SD] 12 Mar 2026 tween reasoning and prediction, and robustness under domain shift caused by audio transformations and adversarial pertur- bations [32]. The results indicate that reasoning traces do not consistently improve detection accuracy: well-grounded mod- els may benefit, while other LALMs degrade due to halluci- nated justifications. The study mainly audits explanation sensi- tivity under perturbations rather than improving standard SDD evaluation protocols. Experiments are also limited to older gen- erative conditions from ASVspoof 19. Step-Audio-R1 [25] ad- dresses audio reasoning issues by grounding explanations to real acoustic evidence rather than textual hallucinations. The approach combines CoT-style supervised fine-tuning (SFT) and RLVR [33], followed by Modality-Grounded Reasoning Distil- lation (MGRD). MGRD iteratively self-distills partially correct CoT traces while excluding batches where traces are fully cor- rect or almost entirely incorrect. This process improves per- formance across speech, environmental sound, and music tasks. However, the method has not been applied to SDD. Explainable AI methods for audio are surveyed in [34]. In [35], an interpretability method for SDD by modifying an SSL-based detector and introducing a class-activation-style rep- resentation to highlight influential time regions is proposed. However, evaluation is limited to ASVspoof 2019 and does not provide textual explanations. In [36], a benchmark for time- localized, model-agnostic post-hoc explanations in audio clas- sification is introduced, using temporally annotated events as proxy ground truth. HoliAntiSpoof [37] applies LALMs to jointly reason about spoofing mechanisms, temporal localiza- tion, and semantic artifacts. While it improves hard-label per- formance and interpretability compared to binary classifiers, it does not explicitly evaluate the quality of SDD reasoning. It is worth mentioning that humans also struggle with SDD [38].The study [39] evaluates responses from 1200 annotators and reports 73% binary accuracy on detecting deepfakes from ASVspoof 21 [40], WaveFake [41], and FakeAVCeleb [42]. Annotators also provide free-form expla- nations, later grouped into eight cue categories (e.g., prosody, liveness, quality). The authors compare these cues with auto- mated detectors to analyze when humans outperform machines and which cues are reliable. Although the dataset is diverse, the public release includes only keywords rather than full ex- planations and does not clearly separate binary decisions from reasoning comments. 3. Dataset collection 3.1. Audio sources Audio was primarily collected from several open-source SDD datasets in English and Russian, including ASVspoof 5 [2], PyAra [43], LibriSecVoc [44], MLAAD [45], DFADD [46], and M-AILABS 2 . We also included bona fide speech from Go- los [47], SOVA 3 , and Russian LibriSpeech (RuLS) 4 . In addi- tion, we incorporated a subset of annotated samples from the SpeechEval dataset introduced in SpeechLLM-as-Judges [28]. We further generated audio using the open-source XTTS- V2 [48] model and several ESpeech 5 models. To diversify and strengthen the test set, we also synthesized several thousand 2 https://github.com/i-celeste-aurora/ m-ailabs-dataset 3 https://github.com/sovaai/sova-dataset 4 https://w.openslr.org/96/ 5 https://huggingface.co/ESpeech samples using proprietary ElevenLabs 6 models, approximating real-world conditions. 05000100001500020000250003000035000 Count 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Reason ID Figure 1: Distribution of obtained reasons from Listing 1. 3.2. Ethics The study and annotation procedure received Institutional Re- view Board approval. Annotators were informed about the project goals and agreed to participate under these conditions. No personally identifiable information (e.g., age, race, gender) was collected. Instead, anonymized identifiers were used. The annotation platform preserved participant anonymity, and anno- tators were compensated per sample, corresponding to approx- imately $8–10 per hour. Annotators were required to be native speakers of either English or Russian and preferably proficient in the other language. No individual was obligated to annotate. 3.3. Instructions Unlike [39], where the questionnaire was designed to avoid training annotators, we provided detailed instructions, training examples, and feedback on prediction accuracy during labeling, including additional training or removal of underperforming an- notators. However, annotators were not guided on how to write explanations and were not shown the correct labels. Audio sam- ples were presented in random order. For each audio sample, annotators answered the question: Assess whether the audio sample contains original (genuine) human speech or synthesized (artificial) speech. If “genuine” was selected, annotators provided a brief explanation. If “artificial” was selected, they provided a free-form explanation and selected relevant reasons from a predefined list of 14 options: Listing 1: Annotation reasons list. 1 (1) Lack of fluency or coherence 2 (2) Unnatural pauses 3 (3) Uniform pauses between words throughout the audio 4 (4) Unusual intonation patterns 5 (5) Insufficient variation in speaking style 6 (6) Incorrect stress in common words 7 (7) Mispronunciation of common words 8 (8) Unusual or inconsistent accent 9 (9) Atypical voice characteristics 10 (10) Excessively fast speech 11 (11) Incorrect reading of abbreviations 12 (12) Verbalization of typographical errors 13 (13) Word-by-word repetition in cases of tautology 14 (14) Other (please specify) 6 https://elevenlabs.io 15 Optional comment: 16 ______________________________ Annotators were additionally provided with the following rec- ommendations: • If you selected “synthesized” and indicated specific factors from the list, please describe in detail where and how these cues manifested in the audio. • If you are uncertain, explain which aspects appeared suspi- cious and which seemed natural. • If you selected “genuine”, describe the characteristics that led you to perceive the speech as natural (e.g., realistic pauses, natural intonation, breathing patterns, expressive variability). • Responses should be detailed and well justified. Short state- ments such as “it sounds normal” are not acceptable. Table 1: Evaluation results on Test-1-HL. Model Train Set Accuracy Balanced Accuracy F1 Wav2Vec2- AASIST Train-1-HL92.381.376.7 Train-2-HL92.984.076.7 Train-1-HL Train-2-HL 92.383.976.6 SALMONN-7b Train-1-HL93.489.384.5 Train-2-HL94.588.685.7 Train-1-HL Train-2-HL 93.183.880.4 Train-2-R92.986.781.4 Train-1-HL Train-2-R 92.786.581.2 Train-2-R Val-1-GRPO 93.689.685.0 3.4. Annotations filtering and postprocessing For data filtering, the following strategy was applied. First, we excluded annotators with classification accuracy below 75%. Second, samples with incorrect class predictions were removed. Finally, we manually inspected 30–50 annotations per annotator. Considering their overall accuracy, we catego- rized responses as ”high” quality (≥ 85% accuracy or more) or ”medium” quality (≥ 75% accuracy or more). After filter- ing, the dataset contained 124, 410 annotations of 41, 414 audio samples (32, 045 spoof and 9, 369 bona fide) provided by 37 an- notators. The distribution of selected reasons is shown in Fig. 1. In total, 120, 258 non-empty comments were collected with an average length of 12 words. Human annotations and question- naire selections were translated to English and post-processed using Qwen-32b to produce reasoning-like traces. 3.5. Splits The hard-label (HL; binary labels without reasoning) dataset was collected from the sources described in subsection 3.1, and is denoted as Train-1-HL. Val-1-HL is a whole ASVspoof 5 development subset. For Test-1-HL, we ran- domly selected 20, 000 samples from the ASVspoof 5 evalua- tion subset, collecting 15, 943 spoof and 4, 057 bona fide au- dio. The reasoning dataset was split into train, validation, and test subsets of 114k, 8k, and 1k samples, respectively. These splits can be used for binary classification or CoT training and evaluation, e.g., Train-2-HL and Train-2-R. 4. Methodology 4.1. Binary classification and CoT SFT training Despite strong overall performance, LLMs and LALMs often show weak zero-shot and few-shot results on new tasks, in- cluding SDD [29]. Following this observation, we begin with SFT using LoRA applied to all linear projections of the LLM backbone. We use SALMONN [20], which combines Whis- per [49] and BEATS [50] audio encoders with a Q-Former [51] adapter connected to the Vicuna-7B/13B [52] LLM. The model receives concatenated audio features and textual instructions and is trained with cross-entropy loss on completion tokens only (prompt tokens are masked). For comparison, we also evaluate a strong conventional Wav2Vec2-AASIST [9] SDD model.For hard-label SFT, the model is trained on the Train-1-HL, Train-2-HL, or their combination to out- put a binary answer (“Final Answer: Real” or “Final Answer: Fake”). For CoT SFT, the model is trained on the Train-2-R to produce structured output in the following format: <think>...</think> <reasons>[...]</reasons> <answer>Real/Fake</answer>, where <think> holds the free-form reasoning trace, <reasons> lists detected cues from the annotation taxonomy (Listing 1), and <answer> gives the final binary prediction. Two initialization strategies are explored: (i) fine-tuning from the base (vanilla) LALM, and (i) fine-tuning from a hard- label SFT checkpoint. In both cases, the model is trained to gen- erate the full structured output, including the reasoning trace, the list of detected artifact categories, and the binary prediction. SALMONN-7B was fine-tuned using LoRA with rank 8− 128, α = 16− 256, and dropout 0.2. The learning rate started at 10 −6 with a 5000-step warm-up, reached 10 −4 , and decayed to 10 −5 . The audio was cut or padded to a fixed length. 4.2. GRPO and grounding The LALM can generate fluent and structurally valid CoT traces that nevertheless do not correspond to the actual audio content and can be based on hallucinations or training CoT memoriza- tion. To address this issue, we use an audio grounding stage to explicitly encourage the model to anchor its <think> traces and its <reasons> tags to perceptible acoustic evidence, such as Gaussian noise, time masking, and gain adjustments embed- ded deterministically in the audio signal. To further improve the quality and diversity of CoT, a GRPO [53] was applied. Re- ward functions encouraged the model’s correctness, following the format (<think>, <reasons>, <answer> tags) and ad- justing to human preferences. For the latter, Qwen2.5-32B was requested to evaluate each generation on its coverage, relevance, logic (coherence with binary prediction) and helpfulness from 0 to 10. For each sample we generated 6 distinct reasoning traces to calculate rewards. 4.3. Metrics For the primary metrics of SDD classification, accuracy, bal- anced accuracy, and F1 (positive class is bona fide) were con- sidered. Reasoning quality was evaluated using the GPT-5.1 7 model over the same criteria set that were used in GRPO: recall, relevance, logic, helpfulness, with scores from 0 to 10. Sets of 7 https://openai.com/index/gpt-5-1/ Table 2: Reasoning samples for SALMONN-7B tuned on Train-2-R and evaluated on Test-2-R. Human CommentTagsLabel Model ReasoningPred. Tags Pred. Label I believe the audio is genuine. I heard a non- robotic voice. The speech pace varies, speeding up and slowing down. There is a natural rhythm following the meaning, clear pronunciation, slight tone changes, and audible breaths without arti- facts. –RealI believe the audio is genuine. The voice is non- robotic and the pace varies naturally. The rhythm sounds lively and human, with logical stress and clear pronunciation. –Real The synthesized speech has high quality and clear intelligibility without artifacts. The voice is soft and neutral, but the speech rate is slightly fast and emotional engagement is limited. (5),(10)FakeThe speech is too fast and uninterrupted.(1),(2),(10)Fake The voice is monotone and the narrator reads the sentence without pausing. (1),(9)FakeThe speech is unnaturally fast; it is difficult for a person to speak this quickly without pauses or mistakes. Odd pronunciation of “Europe”. (1),(2),(9)Fake Monotone speech. It’s not entirely clear from the context what intonation was originally intended. (4),(9)FakeThe voice sounds robotic, and the speech is unnat- ural and staccato. (4),(9)Fake A genuine recording for a voice assistant. En- vironmental noise is audible (TV in the back- ground). –RealThe speech sounds authentic:natural intona- tion, individual pronunciation of sounds, consis- tent speaking pace, and background noise consis- tent with a television. –Real reasons (tags) were evaluated using the Jaccard similarity index. For predicted tags ˆ Y and ground-truth tags Y , it is defined as J(Y, ˆ Y ) = |Y ∩ ˆ Y| |Y ∪ ˆ Y| .(1) 5. Results and discussion SALMONN-7BoutperformedconventionalWav2Vec2- AASIST in terms of binary classification metrics, as shown in Table 1. Moreover, the obtained models demonstrate similar or better results compared to the open-source models, presented on Speech DF-Arena [54] leaderboard 8 . Nonetheless, LALMs can demonstrate weak training sta- bility and a drastic overfitting tendency; to mitigate this issue, the number of training iterations and learning rate might be de- creased, while dropout and augmentation range are increased. Training on the combined dataset of Train-1-HL and Train-2-HL does not improve the results, compared to the separately trained checkpoints. Regarding hard-label evaluation of reasoning checkpoints, while reasoning-only SFT underper- formed to the hard-label checkpoints, GRPO tuning improves the results noticeably. The hypothesis that hard-label SFT can help the reasoning SFT has not been confirmed. We highlight the decent reasoning performance on a set of samples from Test-2-R, which is reflected in Table 2. One can notice that SALMONN provides reasonable audio- grounded cues, such as referring to particular words’ pronun- ciation and to background noises. Among the other findings, it is noticeable that grounding does not significantly improve the classification metrics, but diversifies the chain-of-thought traces, while GRPO further improves the reasoning. After ap- plying these techniques to the model, the diversity and informa- tiveness of the answers it generates, as well as its classification metrics, increase (see Table 1). Despite these results, the resulting reasoning models still struggle with modern high-fidelity synthesis systems that were 8 https://huggingface.co/spaces/ Speech-Arena-2025/Speech-DF-Arena not present in the training data. LALM’s reasoning often de- scribes deepfake audio as genuine speech. Regarding CoT quality evaluation, GRPO does not yield a significant improvement in the Jaccard index. Given the noise in reasoning-tag markup, the scores are effectively unchanged: 0.6468 for the GRPO checkpoint vs. 0.6264 for the SFT check- point. In an LLM-as-a-judge evaluation (five runs on 1,000 rea- soning samples), GRPO shows a small but non-significant im- provement in reasoning-trace quality: mean score 5.74± 1.49 vs. 5.12± 1.47 for SFT. We note that the judge rubric is in- tentionally strict, which keeps absolute scores relatively low; in this setting, the relative difference between checkpoints is more informative than the raw score level. 6. Conclusion This work addressed the problem of limited generalization and poor interpretability in modern speech deepfake detection systems. We introduced a novel human-annotated reasoning dataset for the SDD and proposed a HIR-SDD, a framework that enables models not only to perform binary spoof detection but also to provide human-interpretable explanations grounded in perceptual cues. Experimental results demonstrate that the pro- posed framework achieves competitive detection performance while producing meaningful reasoning traces aligned with hu- man annotations. These findings suggest that the incorporation of proposed human-inspired reasoning traces can improve both transparency and reliability of SDD systems. Future work in- cludes improving robustness to unseen generative models and domain shifts, as well as refining the stability, quality, and eval- uation of reasoning traces produced by LALMs. 7. Generative AI use disclosure AI models were used only for grammar correction and for text refining. 8. References [1] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio de- tection,” arXiv preprint arXiv:1904.05441, 2019. [2] X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen et al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” arXiv preprint arXiv:2408.08739, 2024. [3] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y. Bai, C. Fan et al., “Add 2022: the first audio deep synthe- sis detection challenge,” in ICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, p. 9216–9220. [4] J. Yi et al., “Add 2023: the second audio deepfake detection chal- lenge,” arXiv preprint arXiv:2305.13774, 2023. [5] Y. Zang, Y. Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” in ICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, p. 12 156–12 160. [6] Y. Zhang, Y. Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan, “Svdd 2024: The inaugural singing voice deepfake detection chal- lenge,” in 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, p. 782–787. [7] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, p. 6369–6373. [8] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, p. 6367–6371. [9] H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022. [10] A. Aliyev and A. Kondratev, “Intema system description for the asvspoof5 challenge: power weighted score fusion,” in Proc. ASVspoof 2024, 2024, p. 152–157. [11] Y. Zhang, J. Lu, Z. Shang, W. Wang, and P. Zhang, “Improving short utterance anti-spoofing with aasist2,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, p. 11 636–11 640. [12] K. Borodin,V. Kudryavtsev,D. Korzh,A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y. Rogov, “Aasist3: Kan- enhanced aasist speech deepfake detection using ssl features and additional regularization for the asvspoof 2024 challenge,” arXiv preprint arXiv:2408.17352, 2024. [13] H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, p. 6382–6386. [14] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness- aware minimization for efficiently improving generalization,” arXiv preprint arXiv:2010.01412, 2020. [15] S. Ding, Y. Zhang, and Z. Duan, “Samo: Speaker attractor multi- center one-class learning for voice anti-spoofing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, p. 1–5. [16] S.-b. Kim, H.-s. Shin, J. Heo, C.-y. Lim, K.-W. Koo, J. Son, S. Hong, S. Jung, and H.-J. Yu, “Enhancing audio deepfake detec- tion by improving representation similarity of bonafide speech,” in Proc. Interspeech 2025, 2025, p. 2250–2254. [17] M. Yang, Y. Gu, Q. He, Y. Li, P. Zhang, Y. Chen, Z. Wang, H. Zhu, J. Liu, and W. Wang, “Generalizable audio deepfake detection via hierarchical structure learning and feature whitening in poincar\’e sphere,” arXiv preprint arXiv:2508.01897, 2025. [18] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Ale- man, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. [19] D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi et al., “Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025. [20] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” arXiv preprint arXiv:2310.13289, 2023. [21] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv, J. He, J. Lin et al., “Qwen2-audio technical report,” arXiv preprint arXiv:2407.10759, 2024. [22] A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valle et al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” arXiv preprint arXiv:2507.08128, 2025. [23] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information pro- cessing systems, vol. 35, p. 24 824–24 837, 2022. [24] C.-K. Yang, N. Ho, Y.-T. Piao, and H.-y. Lee, “Sakura: On the multi-hop reasoning of large audio-language models based on speech and audio information,” arXiv preprint arXiv:2505.13237, 2025. [25] F. Tian, X. T. Zhang, Y. Zhang, H. Zhang, Y. Li, D. Liu, Y. Deng, D. Wu, J. Chen, L. Zhao et al., “Step-audio-r1 technical report,” arXiv preprint arXiv:2511.15848, 2025. [26] C. Park, A. S. G. Choi, S. Cho, and C. Kim, “Reasoning- based approach with chain-of-thought for alzheimer’s detec- tion using speech and large language models,” arXiv preprint arXiv:2506.01683, 2025. [27] J. Mai, X. Xing, Y. Li, and X. Xu, “Chain-of-thought distillation with fine-grained acoustic cues for speech emotion recognition,” in Proc. Interspeech 2025, 2025, p. 5438–5442. [28] H. Wang, J. Zhao, Y. Yang, S. Liu, J. Chen, Y. Zhang, S. Zhao, J. Li, J. Zhou, H. Sun et al., “Speechllm-as-judges: Towards gen- eral and interpretable speech quality evaluation,” arXiv preprint arXiv:2510.14664, 2025. [29] H. Gu, J. Yi, C. Wang, J. Tao, Z. Lian, J. He, Y. Ren, Y. Chen, and Z. Wen, “Allm4add: Unlocking the capabilities of audio large language models for audio deepfake detection,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, p. 11 736–11 745. [30] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022. [31] B. Nguyen and T. Le, “Analyzing reasoning shifts in audio deep- fake detection under adversarial attacks: The reasoning tax versus shield bifurcation,” arXiv preprint arXiv:2601.03615, 2026. [32] C. Szegedy, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013. [33] X. Wen, Z. Liu, S. Zheng, S. Ye, Z. Wu, Y. Wang, Z. Xu, X. Liang, J. Li, Z. Miao et al., “Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,” arXiv preprint arXiv:2506.14245, 2025. [34] A. Akman and B. W. Schuller, “Audio explainable artificial intel- ligence: A review,” Intelligent Computing, vol. 2, p. 0074, 2024. [35] M. Li and X.-P. Zhang, “Interpretable temporal class activation representation for audio spoofing detection,” in Proc. Interspeech 2024, 2024, p. 1120–1124. [36] C. Bola ̃ nos, L. Pepino, M. Meza, and L. Ferrer, “Benchmark- ing time-localized explanations for audio classification models,” arXiv preprint arXiv:2506.04391, 2025. [37] X. Xu, Y. Ren, L. Liu, W. Wu, B. Li, C. Lu, S. Wang, and C. Zhang, “Holiantispoof: Audio llm for holistic speech anti- spoofing,” arXiv preprint arXiv:2602.04535, 2026. [38] S. Barrington, E. A. Cooper, and H. Farid, “People are poorly equipped to detect ai-powered voice clones,” Scientific Reports, vol. 15, no. 1, p. 11004, 2025. [39] K. Warren, T. Tucker, A. Crowder, D. Olszewski, A. Lu, C. Fedele, M. Pasternak, S. Layton, K. Butler, C. Gates et al., “” better be computer or i’m dumb”: A large-scale evaluation of humans as audio deepfake detectors,” in Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, p. 2696–2710. [40] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537, 2021. [41] J. Frank and L. Sch ̈ onherr, “Wavefake: A data set to facilitate au- dio deepfake detection,” arXiv preprint arXiv:2111.02813, 2021. [42] H. Khalid, S. Tariq, M. Kim, and S. S. Woo, “Fakeavceleb: A novel audio-video multimodal deepfake dataset,” arXiv preprint arXiv:2108.05080, 2021. [43] D. Efanov, P. Aleksandrov, and I. Mironov, “Comparison of the ef- fectiveness of cepstral coefficients for russian speech synthesis de- tection,” Journal of Computer Virology and Hacking Techniques, vol. 20, no. 3, p. 375–382, 2024. [44] C. Sun, S. Jia, S. Hou, and S. Lyu, “Ai-synthesized voice detection using neural vocoder artifacts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 904–912. [45] N. M. M ̈ uller, P. Kawa, W. H. Choong, E. Casanova, E. G ̈ olge, T. M ̈ uller, P. Syga, P. Sperl, and K. B ̈ ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” in 2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, p. 1–7. [46] J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y. Tsao, H.-Y. Lee, and J.-S. R. Jang, “Dfadd: The diffusion and flow- matching based audio deepfake dataset,” in 2024 IEEE Spoken Language Technology Workshop (SLT).IEEE, 2024, p. 921– 928. [47] N. Karpov, A. Denisenko, and F. Minkin, “Golos: Russian dataset for speech research,” arXiv preprint arXiv:2106.10161, 2021. [48] E. Casanova, K. Davis, E. G ̈ olge, G. G ̈ oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi et al., “Xtts: a massively multilingual zero-shot text-to-speech model,” arXiv preprint arXiv:2406.04904, 2024. [49] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, p. 28 492–28 518. [50] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” arXiv preprint arXiv:2212.09058, 2022. [51] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, p. 19 730–19 742. [52] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/ [53] Z. Shao et al., “Deepseekmath:Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300 [54] S. Dowerah, A. Kulkarni, A. Kulkarni, H. M. Tran, J. Kalda, A. Fedorchenko, B. Fauve, D. Lolive, T. Alum ̈ ae, and M. M. Doss, “Speech df arena: A leaderboard for speech deepfake detection models,” IEEE Open Journal of Signal Processing, 2026.