Paper deep dive
Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding
Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong Shim
Abstract
Abstract:Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
Tags
Links
- Source: https://arxiv.org/abs/2603.06193v1
- Canonical: https://arxiv.org/abs/2603.06193v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/13/2026, 12:17:24 AM
Summary
Whisper-CD is a training-free contrastive decoding framework designed to mitigate hallucinations, repetition loops, and content omissions in long-form automatic speech recognition (ASR) models like Whisper. By contrasting clean-audio logits against negative logits derived from three acoustically motivated perturbations (Gaussian noise, silence, and temporal shift), the method suppresses erroneous token generation at inference time without requiring model retraining.
Entities (7)
Relation Signals (3)
Whisper-CD → uses → Gaussian noise injection
confidence 100% · Whisper-CD leverages three audio-specific negative signals: (i) Gaussian noise injection
Whisper-CD → improves → Whisper
confidence 95% · it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
Whisper-CD → mitigates → Hallucination
confidence 95% · Whisper-CD suppresses hallucinated generation by contrasting clean audio logits against multiple negative logits.
Cypher Suggestions (2)
Find all perturbation methods used by Whisper-CD · confidence 90% · unvalidated
MATCH (f:Framework {name: 'Whisper-CD'})-[:USES]->(p:Perturbation) RETURN p.nameIdentify datasets evaluated with Whisper-CD · confidence 90% · unvalidated
MATCH (f:Framework {name: 'Whisper-CD'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.nameFull Text
28,936 characters extracted from source content.
Expand or collapse full text
Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding Hoseong Ahn 1 , Jeongyun Chae 1 , Yoonji Park 2 , Kyuhong Shim 1,2 1 Department of Intelligent Software, Sungkyunkwan University 2 Department of Computer Science and Engineering, Sungkyunkwan University hoseong8115, jyunchae, yoonji4024, khshim@skku.edu Abstract Long-form speech recognition with large encoder–decoder models such as Whisper often exhibit hallucinations, repeti- tion loops, and content omissions. These errors can accumu- late and be further amplified when the previous segment’s tran- scription is used as decoding context. We propose Whisper- CD, a training-free contrastive decoding framework that con- trasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise in- jection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3 p on CORAAL and shows 48% faster token gen- eration throughput than beam search. Because Whisper-CD op- erates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without re- training. Index Terms: speech recognition, contrastive decoding, long- form asr, inference-time 1. Introduction Large-scale encoder–decoder models, such as Whisper [1], have significantly advanced automatic speech recognition (ASR), yet long-form transcription remains error-prone. When processing extended recordings that contain prolonged silences, acoustic corruption, or distribution shifts across segments, these models often produce fluent but unsupported text. This phenomenon is broadly termed hallucination [2, 3, 4, 5]. Such errors are par- ticularly difficult to detect because the model typically predicts them with high confidence. Moreover, long-form audio is com- monly processed in a divide-and-conquer manner, which can compound errors as decoding proceeds across segments. In practice, hallucination in long-form ASR appears in three recurring patterns: (i) silence-region hallucination, where the model generates fictitious words during non-speech intervals; (i) repetition loops that persist across segment boundaries; and (i) content skips that omit parts of the spoken content. Ar- chitectural approaches for long-form processing [6, 7] improve segmentation and context utilization, but do not directly tar- get these error patterns. Hallucination-specific mitigation tech- niques, such as VAD-based chunking, constrained decoding, and fine-tuning of hallucinatory attention heads [8], typically focus on a single error type and introduce additional compo- nents or model retraining. This motivates a single decoding- time method that can address diverse error patterns without aux- iliary modules or parameter updates. Contrastive decoding [9] (CD) has been studied as an effec- tive technique to reduce hallucinations in vision-language [10, 11] and natural language processing tasks [12, 13, 14]. By con- trasting logits from a target generation process against logits from a “negative” process that emphasizes undesirable behav- ior, CD steers token selection away from incorrect outputs with- out modifying model parameters. For example, VCD [10] cor- rupts the input image with heavy noise to obtain negative logits that carry weaker visual evidence, so tokens preferred under this condition are down-weighted during decoding. Analogously, in ASR, the negative process can be instantiated through audio perturbations that reduce speech evidence; to our knowledge, we are the first to apply contrastive decoding to ASR. In this paper, we propose Whisper-CD, a training-free Contrastive Decoding framework for long-form ASR. Whisper- CD suppresses hallucinated generation by contrasting clean audio logits against multiple negative logits. Whisper-CD leverages three audio-specific negative signals: (i) Gaussian noise injection, (i) silence-only input (all-zero waveform), and (i) temporal shift of the input waveform. These perturbations are designed to capture common and distinct long-form ASR failure modes. The three negative logits are combined using the log-sum-exp operator with a single uniform contrastive co- efficient α, producing a unified multi-negative CD objective that addresses all three failure patterns. Experiments on five long-form ASR benchmarks show that Whisper-CD consis- tently reduces word error rate (WER) and hallucinations com- pared to the baseline, while adding substantially less computa- tional overhead than beam search. 2. Related Work 2.1. Challenges in Long-form ASR Long-form ASR remains challenging even for large-scale mod- els trained on massive and diverse corpora [1, 15]. Existing methods include joint segmentation-and-decoding [16], factor- ized neural transducers [7], masked chunking conformers [6], and streaming LLM-based recognizers [17]. While effective, these approaches generally require architectural changes or model retraining, which limits their applicability to already- deployed ASR systems. 2.2. Hallucinations in Whisper ASR Whisper is known to hallucinate frequently on non-speech seg- ments, noisy audio, or out-of-distribution inputs [2, 3], mak- ing hallucination a serious concern in long-form transcription. A representative failure case is the so-called bag of halluci- nations, a set of stock phrases (e.g., “Thank you for watch- ing”) that the model generates with high confidence when no genuine speech is present [2, 3]. Several mitigation strate- gies have been proposed, including fine-tuning hallucinatory arXiv:2603.06193v1 [cs.SD] 6 Mar 2026 Long-FormAudio 30s (Whisper Segment) Encoder Contrastive Decoding Previous Context Multi-Negative Input Perturbation Contrastive Decoding Original Gaussian Noise AudioShift Silence 푁(0,휎 ! ) Whisper (Frozen) Original Autoregressive Decoder Refined Multi-Negative Figure 1: Overview of the proposed Whisper-CD. Each audio segment is processed through four parallel paths, comprising the orig- inal signal and three acoustically perturbed variants (Gaussian noise injection, silence signal, and audio temporal shift). Each path produces decoder logits conditioned on the corresponding encoder output, and contrastive decoding steers token selection away from hallucinated outputs such as repetition loops and content omissions. decoder heads for non-speech segments [8] and timestamp- aware training for more verbatim transcription [15]. In contrast, training-free decoding-time approaches that can address diverse hallucination patterns remain underexplored; in particular, con- trastive decoding has not yet been applied to reduce ASR hallu- cination. 2.3. Test-time Improvement in ASR A complementary line of research seeks to improve ASR qual- ity at inference time without retraining the model. Classical ex- amples include beam search, which explores multiple decoding hypotheses, and language model rescoring, which re-ranks n- best lists using an external language model. More recently, gen- erative error correction (GER) leverages LLMs to post-correct ASR transcripts [18, 19], and test-time adaptation methods in- stead update model parameters per utterance via entropy mini- mization [20] or sequential-level generalized entropy minimiza- tion [21]. However, these approaches have fundamental limita- tions. Beam search broadens the decoding space, but does not alter the model’s token distribution; when hallucinated tokens receive high probability, increasing the beam size can still con- centrate the probability mass on the same incorrect transcript. GER operates after the initial decoding pass, and thus cannot directly influence token selection during decoding. In contrast, our method directly adjusts token distribution at test time. 3. Whisper-CD 3.1. Motivation Whisper processes long-form audio by splitting the input into 30-second segments and processing each segment. The previ- ous segment’s transcription can be optionally given as context for consistent transcription (see Figure 1). Surprisingly, we ob- serve that Whisper ASR performance degrades as the previous context is given. This unintuitive behavior is widely recognized, as the official documentation 1 recommends turning off the con- text passing option when performance is unsatisfactory. We find that conditioning on previous text increases WER by over 190 p on CORAAL [22] and over 500 p on Earn- ings22 [23] on Whisper Large-v3. This degradation is mainly driven by error accumulation. Once an erroneous transcription 1 https://github.com/openai/whisper/blob/ 7858a/whisper/transcribe.py#L278 is passed as context, it biases the current segment’s decoding, and hallucinated text or repetition loops propagate to subse- quent segments, compounding errors over the entire recording. Such hallucinations are difficult to recover, even with beam search, and are more prominent in larger variants (i.e., mod- els with a large decoder module). Rather than disabling context passing, which sacrifices the model’s inherent ability to utilize useful long-range cues, we seek to preserve contextual decoding while suppressing errors. This motivates a logit-level decoding strategy based on contrastive decoding, which down-weights to- kens that remain likely under deliberately degraded “negative” audio conditions. 3.2. Contrastive Decoding for ASR Contrastive decoding steers token selection by contrasting the model logits under a target (positive) generation process with those under a negative process. We adapt this principle to ASR by deriving the negative signal from acoustically perturbed ver- sions of the input waveform, rather than from a weaker model. Let x denote an input utterance and g(·) be a perturbation func- tion that produces ̃ x = g(x). At decoding step t, given the previously generated tokens y <t , we compute logits using two forward passes of the same ASR model: ℓ pos t = f θ x,y <t , ℓ neg t = f θ ̃ x,y <t (1) where f θ (·) outputs a logit vector over the vocabulary. The con- trastive logits are then defined as: ℓ CD t = (1 + α)ℓ pos t − αℓ neg t (2) where α > 0 controls the contrastive strength. Token selec- tion is performed from ℓ CD t . This procedure is training-free: it requires no parameter updates and can be applied to any token- by-token generation ASR models at inference time. 3.3. Perturbation Strategies We design three perturbation functions to instantiate a negative generation process for contrastive decoding. The key idea is to deliberately reduce or distort acoustic evidence so that the re- sulting logits reflect the model’s bias or tendencies rather than the true speech content. Table 1: Speech recognition results of Whisper-CD across different Whisper model variants. ModelMethod WER (%)↓Throughput CORAALVoxPopuliEarnings22TED-LIUMREV-16Speed (tokens/s)↑RTF↓ Large-v3 Baseline208.7644.95520.9466.42173.6930.60.2886 + CD45.7719.8657.0825.6221.3827.30.1655 Large-v3-Turbo Baseline38.7530.6333.2512.9319.82168.90.0239 + CD14.4325.7116.1610.1114.81144.20.0346 3.3.1. Gaussian Noise Injection. The input waveform x is corrupted with additive Gaussian noise calibrated to a target signal-to-noise ratio (SNR). Specifically, the corrupted waveform is ̃ x = x +ε, whereε ∼ N (0,σ 2 I) and σ is chosen to achieve the desired SNR. This perturbation weakens fine-grained phonetic evidence while preserving the coarse structure, providing negative logits that highlight tokens the model prefers even under acoustic uncertainty. 3.3.2. Silence Signal. The input spectrogram is set to all zeros, which removes spec- tral structure entirely. In this condition, the decoder behaves as if it is generating the continued text with minimal acoustic ev- idence, revealing its unconditional textual prior. The resulting logits tend to emphasize silence-region hallucination patterns, including the “bag of hallucinations” phrases (e.g., “Thank you for watching”) reported in prior analyses [2, 3]. 3.3.3. Audio Temporal Shift. The input waveform is shifted leftward by ∆ s seconds to cre- ate a controlled misalignment between acoustic content and the segment’s temporal position. Concretely, the first ∆ s sam- ples are discarded, and the rightmost end is zero-padded, so the decoder receives future audio content earlier than expected. This temporal mismatch disrupts the alignment between the de- coder’s prefix context and the local acoustics, producing nega- tive logits that represent segment-boundary failure tendencies. 3.4. Multi-Negative Contrastive Decoding To simultaneously address multiple long-form failure patterns, we combine K negative signals (K=3) into a single decoding step. We aggregate the negative logits using a log-mean-exp op- erator with the temperature τ : ℓ CD t = (1 + ατ )ℓ pos t − ατ log 1 K K X k=1 exp ℓ neg k,t / τ . (3) where ℓ neg k,t denote the negative logits obtained from k-th per- turbed input ̃ x k . Setting K=1 and τ =1 recovers Eq. (2). A small τ concentrates influence on the dominant negative (ap- proximating a max), while a large τ approaches a smoother av- erage. We set τ =1.0, corresponding to arithmetic-mean aggre- gation in log space, and vary α∈ [0.5, 2.0]. 3.5. Inference For efficiency, we compute encoder outputs for the clean input and all K perturbed inputs in a single batched forward pass. During autoregressive decoding, we compute all paths jointly by packing them along the batch dimension, which allows a single batched decoder forward pass per step while reusing the same prefix tokens y <t . Because our method operates only on logits, model capabilities such as language identification and timestamp generation remain available. Furthermore, it is com- patible with existing inference pipelines such as beam search or token-level constraints. 4. Experimental Results 4.1. Setup Datasets. We evaluate on five English long-form bench- marks covering diverse speaking styles and acoustic conditions. CORAAL [22] uses 14 recordings (11.5 h) from the VLD re- gion. Earnings22 [23] is a 13-file subset (14.3 h) from the full 125-file corpus (∼120 h), selected at regular duration intervals (every 3rd file when sorted by duration) to ensure uniform cov- erage. VoxPopuli [24, 25] is a 50-file subset (1.4 h) from the long-form reconstitution of the test split, filtered for record- ings between 60 and 300 seconds and sampled at every 9th file. TED-LIUM [25, 26] uses the full long-form reconstitution test set (11 files, 3.1 h). REV-16 [1] uses 16 recordings (16.1 h). Decoding. We employ Whisper’s long-form pipeline with 30- second segments and previous-context conditioning enabled, matching the default configuration. Unless specified, we use greedy decoding and disable the temperature fallback mecha- nism to isolate the effect of contrastive decoding. Models. We conduct experiments on two Whisper model sizes: Large-v3 and Large-v3-Turbo. As discussed in Section 3.5, the negative paths reuse the same model weights with acoustically perturbed inputs. All analysis experiments are conducted on Whisper Large-v3-Turbo. Perturbation hyperparameters. We set SNR dB = 10 for Gaussian noise injection. Audio temporal shift applies a left- ward shift of ∆ s = 7 s. Metrics. We report word error rate (WER, %↓) as the primary metric. To assess computational cost, we additionally report decoding throughput in tokens-per-second and real-time factor (RTF), measured on a single NVIDIA A100 80GB PCIe with a batch size of 1. 4.2. ASR Performance and Inference Efficiency Table 1 shows that Whisper-CD reduces WER across all five long-form ASR benchmarks. Note that the Large-v3 baseline WER exceeds 100% on several datasets because repetition loops inflate the output length well beyond the reference. CD suppresses these repetitions, though the WER remains above Turbo’s baseline; we discuss this gap in Section 4.3. Figure 2 shows qualitative examples of how CD eliminates repetition loops and non-speech hallucinations. Regarding throughput, CD introduces three additional de- coding paths, yet the reduction in total generated tokens can off- ... So tell me about growing up, what you remember about Atlanta, or when you moved to Atlanta, why y'all moved to Atlanta. So tell me a little bit (x 16) ...SotellmeaboutgrowingupwhatyourememberaboutAtlantaorwhenyou movedtoAtlantawhyy'allmovedtoAtlanta.So tell me a little bit about growing up in Atlanta. Well growing up in Atlanta you know me as a kid we did a lot of things that kids normally do ...We must pay attention that this report should stay balanced and should not reflect only onepolicy area like employment. Anyhow, effectiveness of the. EU law should be made (x 32) ...Wemustpayattentionthatthisreportshouldstaybalancedandshouldnot reflectonlyonepolicylikeemployment.AnyhoweffectivenessoftheEU law should be made through correct implementation reducing burdens for the SMEs and tackling unnecessary gold planting Baseline Proposed Figure 2: Qualitative examples on the same audio inputs. The baseline falls into repetition loops (red), while Whisper-CD breaks the loop and recovers the correct transcription (green). Table 2: Ablation on contrastive decoding strength (α). Setting α = 0 corresponds to the baseline without CD. α WER(%)↓ CORAALEarnings22TED-LIUM 0.038.7533.2512.93 0.526.5816.1611.19 1.014.4317.7011.65 1.519.6519.4410.11 2.020.8221.7711.00 Table 3: Ablation on three audio perturbation strategies. Strategy WER(%)↓ CORAALEarnings22TED-LIUM Gaussian38.1119.5012.49 Silence18.9917.4121.62 Audio Shift18.7715.5413.81 set this cost. For Large-v3, eliminating repetition loops lowers overall wall-clock time, improving RTF over the baseline. For Large-v3-Turbo, the additional paths incur a modest slowdown that remains substantially faster than beam search (Table 4). 4.3. Analysis Impact of α. Table 2 shows the effect of the contrastive strength α. Datasets with higher baseline WER generally ben- efit from stronger contrastive signals, while the cleaner TED- LIUM is sensitive to over-subtraction and degrades beyond α=1.5. Notably, all nonzero α values improve over the base- line on CORAAL and Earnings22, indicating that CD is broadly robust for hallucination-prone speech recordings. Individual strategies. Table 3 compares perturbation strategies by using each as the sole negative. No single strategy con- sistently improves all datasets; for example, silence degrades TED-LIUM to 21.62% despite helping CORAAL. In contrast, the multi-negative combination (Table 1) achieves lower WER than any individual strategy on all three datasets, suggesting that the proposed aggregation can effectively leverage complemen- tary negative signals. Effect of model scale. Despite its larger capacity, Large-v3 yields substantially higher WER under CD than Large-v3- Turbo (45.77% vs. 14.43% on CORAAL). We hypothesize that this gap stems from differences in failure severity and decoder- driven error propagation. In particular, Large-v3 often enters deep repetition loops that can push the baseline WER above 200%. Once such loops are established, logit-level contrast alone may be insufficient to consistently steer decoding back Table 4: Comparison with beam search (beam size = 5). Method WER(%)↓Throughput CORAALTED-LIUMSpeed↑RTF↓ Baseline38.7512.93174.30.0246 + Beam Search22.6517.5099.00.0436 + CD14.4310.11147.00.0302 to a correct trajectory, because the model assigns overwhelm- ing probability mass to self-reinforcing continuations. More- over, when a previous-segment context is provided as a prefix, a stronger decoder may rely more heavily on that textual context, which can amplify the impact of earlier mistakes. Comparison to beam search. Table 4 compares CD with beam search (beam size 5). Beam search improves performance on CORAAL but degrades TED-LIUM, whereas CD achieves lower WER on both datasets with higher throughput, offering a more favorable accuracy-throughput trade-off. 4.4. Discussion and Future Work Dynamic α. Since the optimal α differs across datasets and model sizes (Tables 2 and 1), predicting and adjusting α dy- namically per segment or per token would improve robustness on diverse acoustic conditions. Additional perturbations. Our multi-negative framework is not restricted to a fixed set of negatives; other audio transforma- tions, such as frequency masking, chunk shuffling, or temporal warping, could potentially serve as additional negative signals. Decoder-only ASR models. Recent decoder-only ASR mod- els [27, 28, 29] process audio and text in a single stream, making it less straightforward to inject perturbed audio while preserving the text prefix. Adapting the negative path construction to such architectures is a natural next step, building on the precedent of visual contrastive decoding in vision-language models [10]. 5. Conclusion We proposed Whisper-CD, a training-free contrastive decod- ing framework for long-form speech recognition. Whisper-CD contrasts clean-audio logits against three acoustically motivated negative signals: Gaussian noise injection, silence signal, and audio temporal shift. This multi-negative formulation mitigates long-form failure patterns, including non-speech hallucinations, repetition loops, and content omissions without any parame- ter update. Experiments on five English long-form ASR bench- marks show consistent WER reductions across model configu- rations, with up to a 24.3 p improvement on CORAAL, while adding only modest computational overhead relative to greedy decoding and remaining substantially faster than beam search. 6. References [1] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” in Proceedings of the 40th International Conference on Machine Learning (ICML). PMLR, 2023. [2] A. Koenecke, A. S. G. Choi, K. X. Mei, H. Schellmann, and M. Sloane, “Careless whisper: Speech-to-text hallucination harms,” in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024. [3] M. Bara ́ nski, J. Jasi ́ nski, J. Bartolewska, S. Kacprzak, M. Witkowski, and K. Kowalczyk, “Investigation of whis- per asr hallucinations induced by non-speech audio,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025. [4] H. Atwany, A. Waheed, R. Singh, M. Choudhury, and B. Raj, “Lost in transcription, found in distribution shift: Demystifying hallucination in speech foundation models,” in Findings of the As- sociation for Computational Linguistics: ACL 2025, Jul. 2025. [5] H. Park, H. Ahn, J. Moon, Y. Lee, and K. Shim, “Evaluating hal- lucinations in multimodal llms with spoken queries under diverse acoustic conditions,” arXiv preprint arXiv:2510.08581, 2025. [6] K. Le, T. V. Ho, D. Tran, and D. T. Chau, “Chunkformer: Masked chunking conformer for long-form speech transcription,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025. [7] X. Gong, Y. Wu, J. Li, S. Liu, R. Zhao, X. Chen, and Y. Qian, “Longfnt: Long-form speech recognition with factorized neural transducer,” in ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2023. [8] Y. Wang, A. Alhmoud, S. Alsahly, M. Alqurishi, and M. Ra- vanelli, “Calm-Whisper: Reduce Whisper Hallucination On Non- Speech By Calming Crazy Heads Down,” p. 3414–3418, 2025. [9] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis, “Contrastive decoding: Open- ended text generation as optimization,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023. [10] S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing, “Mitigating object hallucinations in large vision-language models through visual contrastive decoding,” in 2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2024, p. 13 872–13 882. [11] J. Waldendorf, B. Haddow, and A. Birch, “Contrastive decoding reduces hallucinations in large multilingual machine translation models,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Mar. 2024. [12] Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and P. He, “Dola: Decoding by contrasting layers improves factuality in large language models,” in The Twelfth International Conference on Learning Representations, 2024. [13] R. Sennrich, J. Vamvas, and A. Mohammadshahi, “Mitigating hallucinations and off-target machine translation with source- contrastive and language-contrastive decoding,” in Proceedings of the 18th Conference of the European Chapter of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), Mar. 2024. [14] C. Zhu, Y. LIU, H. Zhang, A. Wang, Yangxue, G. Chen, L. Wang, W. Luo, and K. Zhang, “Alleviating hallucinations in large lan- guage models through multi-model contrastive decoding and dy- namic hallucination detection,” in The Thirty-ninth Annual Con- ference on Neural Information Processing Systems, 2025. [15] M. Zusag, L. Wagner, and B. Thallinger, “CrisperWhisper: Ac- curate Timestamps on Verbatim Speech Transcriptions,” in Inter- speech 2024, 2024, p. 1265–1269. [16] W. R. Huang, S.-Y. Chang, D. Rybach, T. Sainath, R. Prab- havalkar, C. Peyser, Z. Lu, and C. Allauzen, “E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR,” in Inter- speech 2022, 2022, p. 4995–4999. [17] J. Jia, G. Keren, W. Zhou, E. Lakomkin, X. Zhang, C. Wu, F. Seide, J. Mahadeokar, and O. Kalinli, “Efficient streaming llm for speech recognition,” in ICASSP 2025 - 2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. [18] C.-H. H. Yang, Y. Gu, Y.-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative speech recognition error correction with large language models and task-activating prompting,” in 2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2023, p. 1–8. [19] Y. Hu, C. Chen, C. Qin, Q. Zhu, E. S. Chng, and R. Li, “Listen again and choose the right answer: A new paradigm for automatic speech recognition with large language models,” in Findings of the Association for Computational Linguistics: ACL 2024, Aug. 2024. [20] G.-T. Lin, S.-W. Li, and H. yi Lee, “Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition,” in Interspeech 2022, 2022, p. 2198–2202. [21] C. Kim, J. Park, H. Shim, and E. Yang, “SGEM: Test-Time Adap- tation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization,” in Interspeech 2023, 2023, p. 3367–3371. [22] M. Quartey, C. Farrington, T. Kendall, L. Jenson, C. Tacata, and J. McLean, “The corpus of regional african american language: VLD (Valdosta, GA 2017),” Eugene, OR, 2020, version 2021.07. [Online]. Available: https://oraal.uoregon.edu/coraal [23] M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” 2022. [Online]. Available: https://arxiv.org/abs/2203.15591 [24] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- iza, M. Williamson, J. Pino, and E. Dupoux, “Voxpopuli: A large-scale multilingual speech corpus for representation learn- ing, semi-supervised learning and interpretation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), 2021, p. 993–1003. [25] J. D. Fox, D. Raj, N. Delworth, Q. McNamara, C. Miller, and M. Jett ́ e, “Updated corpora and benchmarks for long-form speech recognition,” in ICASSP 2024-2024 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, p. 13 246–13 250. [26] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Es- teve, “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in International Confer- ence on Speech and Computer. Springer, 2018, p. 198–208. [27] A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V. Chaudhary, C. Chen et al., “Phi-4- mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” arXiv preprint arXiv:2503.01743, 2025. [28] A. H. Liu, A. Ehrenberg, A. Lo, C. Denoix, C. Barreau, G. Lam- ple, J.-M. Delignon, K. R. Chandu, P. von Platen, P. R. Mud- direddy et al., “Voxtral,” arXiv preprint arXiv:2507.13264, 2025. [29] X. Shi, X. Wang, Z. Guo, Y. Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y. Xi, B. Yang et al., “Qwen3-asr technical report,” arXiv preprint arXiv:2601.21337, 2026.