Paper deep dive

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 37

Abstract

Abstract:While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 99%

Last extracted: 3/13/2026, 1:04:50 AM

Summary

MUGEN is a comprehensive benchmark designed to evaluate multi-audio understanding in Large Audio-Language Models (LALMs). It covers 35 tasks across seven dimensions, including speech, general audio, and music. The study identifies that LALMs struggle with non-semantic attributes and exhibit performance degradation as the number of concurrent audio inputs increases. The authors propose Audio-Permutational Self-Consistency (APSC) as an effective training-free strategy to improve robustness and accuracy in multi-audio reasoning.

Entities (5)

Audio-Permutational Self-Consistency · methodology · 100%Chain-of-Thought · methodology · 100%Gemini-3-pro · model · 100%LALMs · model-class · 100%MUGEN · benchmark · 100%

Relation Signals (3)

MUGEN → evaluates → LALMs

confidence 100% · MUGEN, a comprehensive benchmark evaluating this capability [multi-audio understanding] across speech, general audio, and music.

Audio-Permutational Self-Consistency → improves → LALMs

confidence 100% · Audio-Permutational Self-Consistency... helps models form more robust aggregated predictions

Gemini-3-pro → performson → MUGEN

confidence 100% · We evaluate a diverse set of advanced LALMs... including... Gemini-3-pro

Cypher Suggestions (2)

Identify methodologies used to improve LALM performance. · confidence 95% · unvalidated

MATCH (m:Methodology)-[:IMPROVES]->(l:ModelClass) RETURN m.name, l.name

Find all models evaluated on the MUGEN benchmark. · confidence 90% · unvalidated

MATCH (m:Model)-[:PERFORMS_ON]->(b:Benchmark {name: 'MUGEN'}) RETURN m.name

Full Text

36,432 characters extracted from source content.

Expand or collapse full text

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models Chih-Kai Yang ∗ , Yun-Shao Tsai ∗ , Yu-Kai Guo † , Ping-Le Tsai † , Yen-Ting Piao ‡ , Hung-Wei Chen ‡ , Ting-Lin Hsiao ‡ , Yun-Man Hsu ‡ , Ke-Han Lu, Hung-yi Lee ∗ National Taiwan University, Taiwan chihkaiyang1124@gmail.com, hungyilee@ntu.edu.tw Abstract While multi-audio understanding is critical for large audio- language models (LALMs), it remains underexplored. We in- troduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experi- ments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent au- dio inputs increases, identifying input scaling as a fundamen- tal bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accu- racy gains. Combining this permutation strategy with Chain-of- Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension. Index Terms: large audio-language model, multi-audio under- standing, benchmark 1. Introduction Large language models (LLMs) [1–3] have achieved remark- able progress in language understanding and have been ex- tended to multimodal domains such as vision [4] and speech [5, 6].Building on this trend, large audio-language models (LALMs) [7–14] integrate auditory perception [15, 16] with strong reasoning capability, enabling flexible interfaces for audio-centric applications such as voice agents [17]. However, current research predominantly evaluates these models in iso- lated and single-audio environments. In real-world deployments, LALMs require reasoning over multiple audio segments simultaneously. This requirement ap- pears in settings such as audio-based in-context learning [18], where multiple audio demonstrations are required to guide adaptation, as well as in applications including speech retrieval- augmented generation (RAG) [19], multi-speaker analytics, and cross-utterance event matching. These scenarios all involve jointly understanding multiple audio segments, requiring mod- els to compare, aggregate, and reconcile information across clips. Consequently, multi-audio understanding is not merely an advanced feature but a strict prerequisite for practical LALMs. Despite its importance, current evaluation remains largely focused on single-audio settings. Existing benchmarks [20] assess general understanding [21, 22], reasoning [23–25], di- alogue [26], bias [27], and safety [28, 29], but they largely overlook multi-audio scenarios. Recent attempts at multi-audio evaluation still exhibit two limitations: (1) narrow coverage of *†‡ Equal Contribution. ** indicates the corresponding author. auditory attributes [30–32], often emphasizing semantic content or sound events while underrepresenting non-semantic aspects such as emotion; and (2) limited input scale [24, 32–35], typi- cally involving only two to three audio clips per sample. As a result, systematic evaluation across diverse auditory dimensions and larger input scales remains underexplored. To fill this gap, we introduce MUGEN (Mu lti-audio Grounding and Understanding Benchmark), comprising 35 audio-grounding tasks across seven dimensions spanning speech, general audio, and music (Figure 1). Each task requires selecting the audio that best satisfies a constraint from five can- didates, enforcing cross-audio comparison through an audio-as- option design where all choices are audio signals rather than text (Figure 2). This design requires direct comparison of audi- tory features across candidates, introducing different challenges from text-based multiple-choice settings. Compared to prior work, MUGEN scales to more audio inputs and covers diverse auditory dimensions, enabling fine-grained multi-audio evalu- ation. Crucially, by emphasizing non-semantic and paralin- guistic dimensions, MUGEN prevents models from bypassing acoustic reasoning through semantic shortcuts. By benchmarking seven advanced LALMs, we uncover consistent weaknesses in multi-audio settings, especially for non-semantic attributes. Through systematic evaluation across varying numbers of audio inputs, we demonstrate that perfor- mance degrades markedly as the number increases, revealing input scaling as a systematic challenge. We further show that combining self-consistency [36] with audio permutation yields up to 6.28% accuracy gains, which increase to 6.74% when combined with Chain-of-Thought [37, 38] reasoning. These findings shed light on the limitations of LALMs. Our contributions are threefold: (1) We introduce MU- GEN, a comprehensive benchmark for multi-audio understand- ing in LALMs. (2) We identify key challenges, particularly in non-semantic attributes and input scaling. (3) We study effective training-free strategies to improve multi-audio per- formance. The MUGEN benchmark is available at https: //huggingface.co/Multi-Audio-Grounding. 2. MUGEN Benchmark 2.1. Overview MUGEN evaluates multi-audio understanding in LALMs through 35 tasks totaling 1750 test instances across 7 dimen- sions spanning speech, audio, and music. Each task is formu- lated as a multiple-choice audio grounding problem: given a textual constraint, the model selects the audio candidate that best satisfies it. For example, under the constraint “select the audio with the angriest emotion,” the model must recognize and compare the emotional intensity of each candidate, requir- arXiv:2603.09714v1 [cs.SD] 10 Mar 2026 Figure 1: Overview of MUGEN and the detailed task distribution across the seven evaluation dimensions. Figure 2: Illustration of the audio-as-option design. Table 1: Dataset statistics of MUGEN. Duration and instruction length are reported as mean± standard deviation. StatisticValue # Audio clips9,250 Audio duration (s)8.60± 8.79 Instruction length (words)13.91± 5.46 ing joint reasoning over multiple audio inputs. Each task pro- vides five audio candidates, and ten tasks additionally include a reference audio (e.g., “select the option featuring the same speaker as the reference”), resulting in six audio inputs that re- quire reference-conditioned comparison. By enforcing cross- audio comparison and reference-conditioned reasoning, MU- GEN probes integrated multi-audio reasoning rather than iso- lated perception. Table 1 summarizes the dataset statistics. 2.2. Evaluation Dimensions MUGEN covers seven complementary dimensions of multi- audio understanding. Semantics & Pragmatics evaluates com- prehension of speech content and contextual meaning, where tasks can in principle be solved by transcribing the audio and reasoning over text. Speaker & Demographics assesses recog- nition of speaker-related attributes such as identity cues and ac- cent. Affective & Paralinguistic State examines non-semantic vocal signals, including emotion and prosody. Temporal Aware- ness probes sensitivity to temporal properties such as duration and pacing. Acoustic Scene & Event Analysis measures identi- fication and reasoning over environmental sounds. Music Anal- ysis evaluates recognition of musical attributes such as genre and instrumentation. Compositional Acoustic Reasoning tests integrated reasoning over multiple attributes across dimensions. 2.3. Data Sources We source the majority of auditory data from publicly available corpora, using synthetic generation only when precise attribute control is required. Standard classification tasks covering emo- tion, speaker identity, language, and music are derived from widely adopted speech and audio benchmarks [39–42]. For at- tributes involving specific vocal conditions, we curate samples from specialized academic corpora [43–53] as well as high- quality scripted speech collections 1 . Finally, prosodic and se- mantic tasks that require strict ground-truth alignment are con- structed using open-source speech synthesis models [54, 55] or commercial text-to-speech APIs 2 . All synthesized samples are manually verified to ensure data quality and attribute fidelity. Regarding task construction, all textual instructions are manually authored to provide clear and unambiguous selection constraints. To formulate the multiple-choice options, we first establish the ground-truth audio that satisfies the target objec- tive. The remaining distractor candidates are then systemati- cally selected or synthesized to exhibit contrasting variations of the specific auditory attribute under evaluation. This rigor- ous candidate formulation ensures that the model must perform precise cross-audio comparison to distinguish the correct op- tion, preventing reliance on superficial shortcuts. 3. Experimental Setups 3.1. Baselines We evaluate a diverse set of advanced LALMs to assess current progress in multi-audio understanding, including open-source models like DeSTA2.5-Audio [7], Qwen2.5-Omni-7B [11], Au- dio Flamingo 3 [12], Voxtral-Mini-3B, Voxtral-Small-24B [13], and Phi-4-Multimodal-Instruct [14], as well as the proprietary Gemini-3-pro [56] under different thinking levels. We run the open-source models with the vLLM framework [57] and in- clude a cascade baseline (ASR+LLM) that transcribes audio with Whisper-large-v3 [58] and answers with Gemini-3-pro, to estimate how much of our tasks can be solved with semantic content alone. We use greedy decoding for all models except Voxtral and Gemini-3-pro. For Voxtral, we adopt the decod- ing configuration strongly recommended by the original authors (temperature 0.2 and top-p 0.95). For Gemini-3-pro, we use 1 https://datacollective.mozillafoundation. org/datasets/cmkfm9fbl00nto0070sdcrak2 2 https://developers.openai.com/api/docs/ models/gpt-4o-mini-tts Table 2: Accuracy (%) and 95% confidence interval of baselines on MUGEN across seven dimensions: Semantics and Pragmatics (S&P), Speaker and Demographics (S&D), Affective and Paralinguistic (A&P), Temporal Awareness (TA), Acoustic Scene and Event Analysis (AS&E), Music Analysis (MA), and Compositional Acoustic Reasoning (CA). Overall accuracy shows the micro-average over all tasks. “Low”/“High” denote the thinking levels of Gemini. The best performance within each group is highlighted in bold. ModelS&P (↑)S&D (↑)A&P (↑)TA (↑)AS&E (↑)MA (↑)CA (↑)Overall (↑) Open-source LALMs DeSTA2.5-Audio46.67± 7.9822.00± 5.7421.75± 4.0417.20± 4.6830.40± 5.7018.00± 4.7628.40± 5.5924.91± 2.03 Qwen2.5-Omni70.00± 7.3312.00± 4.5032.50± 4.5915.20± 4.4510.80± 3.8553.20± 6.1918.00± 4.7628.69± 2.12 Audio Flamingo 325.33± 6.9622.50± 5.7915.00± 3.5021.20± 5.0723.20± 5.231.20± 1.3519.20± 4.8817.43± 1.78 Voxtral-Mini-3B 59.33± 7.8625.50± 6.0430.00± 4.4920.00± 4.9622.40± 5.1724.80± 5.3523.60± 5.2627.83± 2.10 Voxtral-Small-24B64.67± 7.6527.50± 6.1932.50± 4.5922.80± 5.2022.40± 5.1725.60± 5.4116.80± 4.6328.63± 2.12 Phi-4-multimodal-instruct52.00± 8.0021.00± 5.6521.00± 3.9928.00± 5.5724.80± 5.3524.80± 5.3524.80± 5.3526.29± 2.06 Proprietary LALM Gemini-3-pro (Low)89.33± 4.9468.00± 6.4777.00± 4.1248.00± 6.1955.60± 6.1669.60± 5.7069.20± 5.7267.66± 2.19 Gemini-3-pro (High)90.67± 4.6571.50± 6.2677.50± 4.0953.60± 6.1856.80± 6.1468.40± 5.7672.80± 5.5269.60± 2.16 Cascaded Systems ASR + LLM (Low)75.33± 6.9020.00± 5.5427.00± 4.3528.80± 5.6126.40± 5.4626.00± 5.4418.00± 4.7629.09± 2.13 ASR + LLM (High)82.67± 6.0616.50± 5.1427.75± 4.3928.80± 5.6126.00± 5.4426.00± 5.4422.40± 5.1730.06± 2.15 (a) Overall accuracy for n = 2–5 candidate audios.(b) Overall accuracy relative to n = 2, Acc(n)/Acc(2). Figure 3: Performance scaling under varying numbers of audio candidates for tasks without (left) and with reference audio (right). temperature 1, as highly recommended by the official team. 3.2. Evaluation Metric All tasks are formulated as multiple-choice question answering, with accuracy as the metric. To extract answers from free-form outputs, we adopt LLM-as-a-judge [59], using Claude Haiku 4.5 3 with temperature 0 for reproducibility. The judge evaluates each prediction against the ground truth under predefined rubrics: each sample has exactly one cor- rect option, and selecting none or multiple options is marked incorrect. It also provides a brief explanation and outputs the decision in a fixed format for post-processing, with human- annotated examples for calibration. On 400 randomly sampled instances, we observe 99% agreement with human annotators, supporting the reliability of automatic evaluation. 4. Evaluation Results 4.1. Main Results Table 2 presents the performance of baseline models on the MUGEN benchmark. We discuss the main observations below. Current LALMs remain limited in multi-audio under- 3 claude-haiku-4-5-20251001 standing. We compare cascaded ASR+LLM systems with end- to-end LALMs. While ASR+LLM performs competitively on semantic tasks, its performance degrades noticeably on non- semantic dimensions, suggesting that semantic-based reasoning alone is insufficient for many tasks in our benchmark. How- ever, open-source end-to-end LALMs, despite direct access to acoustic signals, achieve overall performance only comparable to cascaded systems. This indicates that open-source models remain limited in multi-audio understanding and fail to extend their strong abilities in single-audio scenarios [21–23] to multi- audio settings. In contrast, the proprietary model consistently and substantially outperforms all open-source models across di- mensions. Nevertheless, even the strongest model remains far from perfect, indicating that multi-audio understanding remains an open challenge even for proprietary systems. LALMs exhibit a clear imbalance between semantic and non-semantic dimensions. All LALMs perform consistently better on the semantic dimension than on others, indicating stronger semantic reasoning while non-semantic perceptual rea- soning remains underdeveloped. These results reveal an uneven capability distribution and highlight the need for more balanced modeling of diverse auditory attributes in multi-audio settings. Systematic blind spots persist across non-semantic di- mensions. Among the non-semantic dimensions, some are con- sistently more challenging for most models. In particular, rea- Table 3: Performance of improvement strategies on MUGEN. Main values report accuracy (%), while small numbers indicate absolute gains over the original performance (Orig.). MethodQwen2.5-Omni Gemini-3-pro (Low) Gemini-3-pro (High) Orig.28.6967.6669.60 CoT28.57 -0.1267.89 +0.2370.46 +0.86 SC28.00 -0.6970.06 +2.4072.74 +3.14 APSC30.69 +2.0073.94 +6.2874.97 +5.37 SC+CoT28.34 -0.3569.94 +2.2873.49 +3.89 APSC+CoT31.31 +2.6274.40 +6.7475.26 +5.66 soning over and comparing temporal information remains diffi- cult, with substantially lower performance even for proprietary models. We also observe model-specific weaknesses, such as music analysis for Audio Flamingo 3 and compositional acous- tic reasoning for Voxtral-Small-24B. These results reveal con- crete gaps in current LALMs and suggest that targeted train- ing strategies or architectural adjustments may be required to strengthen multi-audio understanding. 4.2. Performance Scaling with the Number of Audio Inputs While Sec. 4.1 highlights the overall limitations of LALMs in multi-audio understanding, it remains unclear how performance scales with the number of audio inputs. Thus, we construct reduced variants of MUGEN by progressively decreasing the number of audio candidates from five to two. This procedure controls input size while preserving the task formulation. For each instance, we remove one non-ground-truth option while keeping the instruction unchanged. Ranking-based tasks are excluded, as reducing the option set may invalidate instruc- tions (e.g., “select the third longest audio” becomes ill-defined when fewer than three audios are provided). We therefore con- duct this analysis on the remaining 32 tasks. We evaluate Qwen2.5-Omni and Gemini-3-pro (both think- ing levels), the strongest models in Table 2. As shown in Fig- ures 3a and 3b, accuracy declines as the number of candidate audios increases from two to five. To compare degradation across models, we report accuracy at each n relative to the two-candidate setting, Acc(n)/Acc(2) (Figure 3b), where n denotes the number of candidate audios (excluding the refer- ence when present). Qwen2.5-Omni shows a sharper perfor- mance decline than Gemini-3-pro. With five candidates, it pre- serves only about 66% and 48% of its two-candidate accuracy in tasks without and with reference audio, respectively, whereas Gemini-3-pro retains around 80% in both cases. The two think- ing levels of Gemini-3-pro follow similar trends, indicating that increased thinking depth does not mitigate scaling degradation. Notably, the decline already emerges with only three additional audio inputs. These findings identify input scaling as a sys- tematic challenge for current LALMs. This analysis characterizes LALM behavior under con- trolled input expansion. Such evaluation is difficult with ex- isting benchmarks, where the number of audio inputs is limited. By enabling systematic variation of input size, MUGEN facili- tates fine-grained analysis under increasing input complexity. 5. Improvement Strategies 5.1. Methodology We investigate several training-free strategies to improve the multi-audio understanding of LALMs. To encourage structured deliberation before answer prediction, we examine Chain-of- Thought (CoT) [37, 38], which elicits intermediate reasoning steps (e.g., “Let’s think step by step”). Alongside CoT, we eval- uate Self-Consistency (SC) [36] to improve robustness. Instead of relying on a single decoding output, SC generates multiple sampled responses and aggregates them via majority voting. Building upon SC, we investigate Audio-Permutational Self-Consistency (APSC) for multi-audio scenarios.Before each inference, we randomly permute the order of audio can- didates to reduce positional sensitivity and over-reliance on a specific audio position [60]. The model generates responses un- der these permutations, which are mapped back to the original indexing system and aggregated via majority voting. For implementation, SC and APSC use a temperature of 0.2 for Qwen2.5-Omni, while Gemini-3-pro follows the recom- mended configuration in Sec. 3.1. Both methods generate 10 responses for majority voting. In SC, all responses are sampled under the original audio order, whereas APSC generates one re- sponse for each of the 10 permutations, maintaining the same number of generations and comparable computational cost. 5.2. Results Chain-of-Thought reasoning fails to resolve underlying au- ditory perceptual bottlenecks. Table 3 details the performance of various improvement strategies. All models consistently ben- efit most from APSC, followed by SC, while CoT yields min- imal or even negative impacts. Specifically, CoT leads to only marginal improvements for proprietary models and slightly de- grades Qwen2.5-Omni’s accuracy. This suggests that the pri- mary challenge lies in acoustic perception and cross-audio com- parison, rather than a deficiency in logical reasoning. Simply prompting models to deliberate over text cannot compensate for its inability to accurately differentiate auditory features. Audio permutation further enhances self-consistency by diversifying the presentation order of audio candidates. Building upon the improvements of standard SC, APSC consis- tently achieves superior performance. By processing the audio options under varied presentation orders, APSC helps models form a more robust aggregated prediction, indicating that sen- sitivity to input arrangement can be effectively mitigated. Fur- thermore, combining this permutation strategy with CoT yields peak performance across all models, including an absolute ac- curacy gain of up to 6.74% for Gemini-3-pro (Low). However, while this method achieves significant improvements, produc- ing multiple generations per instance incurs considerable com- putational overhead. Nevertheless, these findings suggest that audio permutation serves as a beneficial strategy for improving multi-audio comprehension in LALMs. 6. Conclusion We introduce MUGEN, a comprehensive benchmark for eval- uating multi-audio understanding in LALMs. Our systematic evaluation reveals critical blind spots in state-of-the-art mod- els: while proficient in semantic tasks, their performance de- grades severely on non-semantic attributes and scales poorly with concurrent audio inputs. Furthermore, we demonstrate that common CoT is insufficient for these multi-audio chal- lenges. Instead, mitigating inherent positional biases through APSC yields substantial accuracy improvements. By exposing these foundational limitations and validating effective inference strategies, MUGEN establishes a vital stepping stone for ad- vancing complex auditory comprehension in future LALMs. 7. Acknowledgement We acknowledge the computational and storage support pro- vided by the National Center for High-performance Comput- ing (NCHC) of the National Applied Research Laboratories (NARLabs) in Taiwan. 8. Generative AI Use Disclosure In this paper, generative AI tools were used for writing re- finement and language polishing. Additionally, large language models were employed as judges in our automatic evaluation. 9. References [1] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024. [2] A. Grattafiori et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024. [3] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025. [4] D. Sur ́ ıs, S. Menon, and C. Vondrick, “Vipergpt: Visual inference via python execution for reasoning,” Proceedings of IEEE Inter- national Conference on Computer Vision (ICCV), 2023. [5] R. Huang et al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” in Proceedings of the AAAI Con- ference on Artificial Intelligence, 2024, p. 23 802–23 804. [6] C.-Y. Kuan et al., “Speech-copilot: Leveraging large language models for speech processing via task decomposition, modular- ization, and program generation,” in 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, p. 1060–1067. [7] K.-H. Lu et al., “Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal align- ment,” arXiv preprint arXiv:2507.02768, 2025. [8] —,“Developing instruction-following speech language model without speech instruction-tuning data,” arXiv preprint arXiv:2409.20007, 2024. [9] C.-K. Yang, Y.-K. Fu, C.-A. Li, Y.-C. Lin, Y.-X. Lin, W.-C. Chen, H. L. Chung, C.-Y. Kuan, W.-P. Huang, K.-H. Lu et al., “Building a taiwanese mandarin spoken language model: A first attempt,” arXiv preprint arXiv:2411.07111, 2024. [10] Y.-X. Lin, C.-K. Yang, W.-C. Chen, C.-A. Li, C.-y. Huang, X. Chen, and H.-y. Lee, “A preliminary exploration with gpt-4o voice mode,” arXiv preprint arXiv:2502.09940, 2025. [11] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Dang et al., “Qwen2. 5-omni technical report,” arXiv preprint arXiv:2503.20215, 2025. [12] S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S. gil Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available:https://openreview.net/forum?id= FjByDpDVIO [13] A. H. Liu, A. Ehrenberg, A. Lo, C. Denoix, C. Barreau, G. Lam- ple, J.-M. Delignon, K. R. Chandu, P. von Platen, P. R. Mud- direddy et al., “Voxtral,” arXiv preprint arXiv:2507.13264, 2025. [14] A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V. Chaudhary, C. Chen et al., “Phi-4-mini technical report:Compact yet powerful multi- modal language models via mixture-of-loras,” arXiv preprint arXiv:2503.01743, 2025. [15] C.-K. Yang, N. Ho, Y.-J. Lee, and H.-y. Lee, “Audiolens: A closer look at auditory attribute perception of large audio-language mod- els,” arXiv preprint arXiv:2506.05140, 2025. [16] C.-K. Yang, Y.-T. Piao, T.-W. Hsu, S.-W. Fu, Z. Chen, K.-H. Lu, S.-F. Huang, C.-H. H. Yang, Y.-C. F. Wang, Y.-N. Chen et al., “Sake: Towards editing auditory attribute knowledge of large audio-language models,” arXiv preprint arXiv:2510.16917, 2025. [17] Z. Zhong, C. Wang, Y. Liu, S. Yang, L. Tang, Y. Zhang, J. Li, T. Qu, Y. Li, Y. Chen, S. Yu, S. Wu, E. Lo, S. Liu, and J. Jia, “Lyra: An efficient and speech-centric framework for omni-cognition,” in Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), October 2025, p. 3694–3704. [18] K.-W. Chang, M.-H. Hsu, S.-W. Li, and H. yi Lee, “Exploring In- Context Learning of Textless Speech Language Model for Speech Classification Tasks,” in Interspeech 2024, 2024, p. 4139–4143. [19] D. J. Min, K. Mundnich, A. Lapastora, E. Soltanmohammadi, S. Ronanki, and K. Han, “Speech retrieval-augmented generation without automatic speech recognition,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, p. 1–5. [20] C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, Eds.Suzhou, China: Association for Computational Linguistics, Nov. 2025, p. 10 144–10 170. [Online]. Available:https://aclanthology.org/ 2025.emnlp-main.514/ [21] Q. Yang et al., “AIR-bench: Benchmarking large audio-language models via generative comprehension,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), Aug. 2024, p. 1979–1998. [22] B. Wang et al., “Audiobench: A universal benchmark for audio large language models,” NAACL, 2025. [23] C.-K. Yang, N. Ho, Y.-T. Piao, and H. yi Lee, “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,” in Interspeech 2025, 2025, p. 1788–1792. [24] S. Kumar, ˇ S. Sedl ́ a ˇ cek, V. Lokegaonkar, F. L ́ opez, W. Yu, N. Anand, H. Ryu, L. Chen, M. Pli ˇ cka, M. Hlav ́ a ˇ cek et al., “Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,” arXiv preprint arXiv:2508.13992, 2025. [25] K.-H. Lu et al., “Speech-IFEval:Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,” in Interspeech 2025, 2025, p. 2078– 2082. [26] J. AO et al., “Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words,” in Advances in Neural Information Processing Systems, vol. 37, 2024, p. 56 898–56 918. [27] Y.-C. Lin et al., “Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, p. 439–446. [28] H. Yang et al., “Audio is the achilles’ heel: Red teaming audio large multimodal models,” in Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, p. 9292–9306. [29] B.-H. Feng, C.-F. Liu, Y.-H. L. Liang, C.-K. Yang, S.-W. Fu, Z. Chen, K.-H. Lu, S.-F. Huang, C.-H. H. Yang, Y.-C. F. Wang et al., “Investigating safety vulnerabilities of large audio- language models under speaker emotional variations,” arXiv preprint arXiv:2510.16893, 2025. [30] Y. Chen, X. Yue, X. Gao, C. Zhang, L. F. D’Haro, R. T. Tan, and H. Li, “Beyond single-audio:Advancing multi-audio processing in audio large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2024, Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds.Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, p. 10 917–10 930. [Online]. Available: https://aclanthology.org/2024.findings-emnlp.640/ [31] S. Kumar, S. Ghosh, Y. Lin, Y. Chen, R. Duraiswami, and D. Manocha, “Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,” 2025. [Online]. Available: https://openreview.net/forum?id=Tq0oPUyVTz [32] S. Deshmukh, S. Han, R. Singh, and B. Raj, “ADIFF: Explaining audio difference using natural language,” in The Thirteenth In- ternational Conference on Learning Representations, 2025. [On- line]. Available: https://openreview.net/forum?id=l4fMj4Vnly [33] C.-y. Huang et al., “Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,” in The Thirteenth Interna- tional Conference on Learning Representations, 2025. [34] S. Deshmukh et al., “Mellow: a small audio language model for reasoning,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=um4aiicz3L [35] K.-P. Huang, C.-K. Yang, Y.-K. Fu, E. Dunbar, and H.-Y. Lee, “Zero resource code-switched speech benchmark using speech ut- terance pairs for multiple spoken languages,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p. 10 006–10 010. [36] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available:https://openreview.net/forum?id= 1PL1NIMMrw [37] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, p. 22 199–22 213. [Online]. Available: https://proceedings.neurips.c/paper files/paper/2022/ file/8b0d291acd4acf06ef112099c16f326-Paper-Conference.pdf [38] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, p. 24 824–24 837. [Online]. Available: https://proceedings.neurips.c/paperfiles/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf [39] H. Cao et al., “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, p. 377–390, 2014. [40] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, p. 5206–5210. [41] R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B ́ echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds.Marseille, France: European Language Resources Association, May 2020, p. 4218–4222. [Online]. Available: https://aclanthology.org/2020.lrec-1.520/ [42] D. Bogdanov et al., “The mtg-jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States, 2019. [Online]. Available: http://hdl.handle.net/10230/42015 [43] F. Rudzicz et al., “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Language resources and evaluation, vol. 46, no. 4, p. 523–541, 2012. [44] J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “Vocalset: A singing voice dataset,” in International Society for Music Information Retrieval Conference, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53875542 [45] E. Bastianelli, A. Vanzo, P. Swietojanski, and V. Rieser, “SLURP: A spoken language understanding resource package,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds.Online: Association for Computational Linguistics, Nov. 2020, p. 7252–7262. [Online]. Available: https://aclanthology.org/2020.emnlp-main.588/ [46] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),” The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive: (http://web.ku.edu/ ̃idea/readings/rainbow.htm)., Nov. 2019, accepted: 2019-11-13T17:09:33Z. [Online]. Available: https://datashare.ed.ac.uk/handle/10283/3443 [47] F. Cummins, M. Grimaldi, T. Leonard, and J. Simko, “The chains speech corpus: Characterizing individual speakers,” 2006. [48] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev- Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-arctic: A non-native english speech corpus,” in Proc. Interspeech, 2018, p. 2783–2787. [Online]. Available: http://dx.doi.org/10.21437/ Interspeech.2018-1110 [49] J. Carletta et al., “The ami meeting corpus: a pre-announcement,” in Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction, ser. MLMI’05. Berlin, Heidelberg: Springer-Verlag, 2005, p. 28–39. [Online]. Available: https://doi.org/10.1007/11677482 3 [50] K. Wang et al., “Disfluencyspeech – single-speaker conversational speech dataset with paralanguage,” in TENCON 2024 - 2024 IEEE Region 10 Conference (TENCON), 2024, p. 469–472. [51] J. Thiemann, N. Ito, and E. Vincent, “DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments,” in Proceedings of the International Congress on Acoustics (ICA), Montreal, Canada, 2013, p. 1–6. [Online]. Available: https://zenodo.org/record/1227121 [52] K. J. Piczak, “ESC: Dataset for Environmental Sound Classifi- cation,” in Proceedings of the 23rd Annual ACM Conference on Multimedia.ACM Press, p. 1015–1018. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2733373.2806390 [53] T. A. Nguyen et al., “Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis,” in Interspeech 2023, 2023, p. 4823–4827. [54] Y. Wang et al., “MaskGCT: Zero-shot text-to-speech with masked generative codec transformer,” in The Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=ExuBFYtCQU [55] H. Hu et al., “Qwen3-tts technical report,” arXiv preprint arXiv:2601.15621, 2026. [56] Google, “Gemini 3 pro — best for complex tasks and bring- ing creative concepts to life,” https://deepmind.google/models/ gemini/pro/, 2025, accessed: 2026-02-08. [57] W. Kwon et al., “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23.New York, NY, USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165 [58] A. Radford et al., “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, p. 28 492–28 518. [59] C.-H. Chiang and H.-y. Lee, “Can large language models be an alternative to human evaluations?”in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd- Graber, and N. Okazaki, Eds.Toronto, Canada: Association for Computational Linguistics, Jul. 2023, p. 15 607–15 631. [Online]. Available: https://aclanthology.org/2023.acl-long.870/ [60] Y.-X. Lin et al., “Hearing the order: Investigating selection bias in large audio-language models,” arXiv preprint arXiv:2510.00628, 2025.