Paper deep dive

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 62

Abstract

Abstract:Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:56:44 AM

Summary

VoxEmo is a comprehensive benchmarking framework for Speech Large Language Models (LLMs) in Speech Emotion Recognition (SER), covering 35 corpora across 15 languages. It addresses zero-shot stochasticity and dataset ambiguity by providing a standardized toolkit, a distribution-aware soft-label protocol, and a prompt-ensemble strategy. The study evaluates models like Qwen2-Audio and Audio Flamingo 3, finding that while zero-shot performance is sensitive to prompt engineering, supervised fine-tuning significantly improves alignment with human subjective emotion distributions.

Entities (5)

Speech Emotion Recognition · task · 100%VoxEmo · benchmark · 100%Audio Flamingo 3 · speech-llm · 98%Qwen2-Audio-7B-Instruct · speech-llm · 98%LoRA · fine-tuning-method · 95%

Relation Signals (3)

VoxEmo → evaluates → Speech Emotion Recognition

confidence 100% · VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora

Qwen2-Audio-7B-Instruct → performstask → Speech Emotion Recognition

confidence 95% · Two representative ∼7B audio-language models are selected... for SER

LoRA → optimizes → Speech LLM

confidence 90% · LoRA [63] adapters... is applied to the attention projection matrices of the language decoders.

Cypher Suggestions (3)

Identify Speech LLMs evaluated in the paper · confidence 95% · unvalidated

MATCH (m:Model)-[:EVALUATED_IN]->(b:Benchmark {name: 'VoxEmo'}) RETURN m.name

Find all datasets included in the VoxEmo benchmark · confidence 90% · unvalidated

MATCH (b:Benchmark {name: 'VoxEmo'})-[:INCLUDES]->(d:Dataset) RETURN d.name, d.language

List fine-tuning methods used for Speech LLMs · confidence 85% · unvalidated

MATCH (m:Model)-[:USED_METHOD]->(f:Method {type: 'Fine-tuning'}) RETURN m.name, f.name

Full Text

61,637 characters extracted from source content.

Expand or collapse full text

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs Hezhao Zhang ID 1 , Huang-Cheng Chou ID 2 , Shrikanth Narayanan ID 2 , Thomas Hain ID 1 1 Department of Computer Science, University of Sheffield, United Kingdom 2 Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA hzhang181, t.hain@sheffield.ac.uk Abstract Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evalua- tion highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehen- sive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standard- ized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft- label protocol and a prompt-ensemble strategy that emulates an- notator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions. Index Terms: speech recognition, human-computer interac- tion, computational paralinguistics 1. Introduction Speech emotion recognition (SER) concerns the automatic recognition of expressed emotion and affect from speech [1, 2]. As a key capability for modeling affect in human-computer in- teraction (HCI), SER supports affect-aware spoken interfaces and speech analytics, such as spoken dialog systems [3] and call-center monitoring [4]. Despite significant advancements in speech fields across multiple speech tasks in recent years (e.g., WavLM [5]), achieving robust SER performance in nat- uralistic speech remains significantly more challenging than in controlled, acted settings [6]. Recently, a different strategy to solve the many challenges has emerged: SER that leverages the many abilities of speech Large Language Models (LLMs) [7] (e.g., Qwen-Audio [8]). Speech LLMs have gained interest by achieving competitive results utilising generative, instruction- following interfaces [9], jointly leveraging lexical semantics and paralinguistic cues from speech [10]. However, the transition from traditional models to Speech LLMs does not relate well with existing evaluation paradigms. Traditionally, SER is formulated as a supervised, closed-set classification task [1, 11]. Typically, a speech encoder is used to extract embeddings, a lightweight prediction head outputs class probabilities, and inference-time behavior is strictly fixed by the model architecture [12, 6]. Consequently, existing bench- mark frameworks primarily focus on standardizing data splits and metrics [13, 6]. In contrast, speech LLMs based SER oper- ates as a generative, text-output task. Given an audio clip and an explicit instruction/prompt, the model generates a text response that must be parsed and mapped to a closed label set [14]. Re- ported scores are therefore highly sensitive to inference-time protocols-, specifically prompt formulation, decoding settings, and parsing heuristics [15]. This zero-shot stochasticity dras- tically reduces comparability across studies, necessitating eval- uation procedures that make inference-time settings explicitly reproducible. Beyond changes in how speech is modeled, SER bench- marking is confounded by the semantics of underlying datasets and domain shifts. Emotion is inherently ambiguous and sub- jective [16, 17], leading to low inter-annotator agreement, es- pecially in naturalistic corpora [18, 19]. This implies a crucial differentiation by Label Source: whether annotations reflect the emotion intended/expressed by the speaker or the emotion per- ceived by the listeners [20]. Prior work suggests that retaining label distributions (soft-labels) to reflect the annotation and la- bel uncertainty is preferred to collapsing minority views into a single hard label [21, 22]. Furthermore, as zero-shot speech LLMs enable seamless cross-corpus evaluation, unguided com- parisons often conflate multiple variables, such as language, in- teraction style, or recording conditions [23, 24]. Therefore, an appropriate benchmark should aim to categorise cross-corpus transfer performance using explicit, interpretable shift types. To sum up, the effects of generative model stochasticity and dataset semantic ambiguity require a unified, large-scale benchmarking framework. We introduce VoxEmo, an evalua- tion toolkit and benchmark designed to provide a standard for investigations into speech LLM-based SER. Our contributions are 4-fold: • (i) A standardized speech LLM-based SER evaluation toolkit that unifies prompt templates, generation settings, output parsing, and invalid-output handling. • (i) A comprehensive benchmark report and scoreboard span- ning 35 corpora across 15 languages, accompanied by a dis- closure checklist for reproducible comparisons. • (i) A novel dataset metadata schema that explicitly docu- ments the Label source (perceived vs. expressed) to enable distribution-aware evaluation. • (iv) Structured, corpus-driven cross-domain settings that ex- plicitly isolate label-set and acoustic shifts. 2. VoxEmo Benchmark Design To fulfill the requirements outlined above, the VoxEmo curates 35 corpora with explicit metadata and standardised partitioning. 2.1. Data Overview Table 1 gives a summary of the datasets used in VoxEmo and presents key dataset attributes. As shown 35 datasets released between 2006 and 2025 are included, covering 15 languages. arXiv:2603.08936v1 [cs.SD] 9 Mar 2026 Table 1: Dataset overview, grouped by acted vs in-the-wild recordings and sorted by year. Audio source (Aud. S.) reports recording origin for in-the-wild corpora, and scripted, spontaneous, mixed (Script/Spont) style for acted corpora. Label source indicates available label sources (Lab. S.): Expr (expressed/actor-intended), Perc (perceived/listener-annotator). Emo, #Utts, and #Hrs reflect the benchmark subset after dataset-specific filtering and label-set restriction (e.g., excluding no-agreement samples in MSP-Podcast and using a commonly adopted label subset for IEMOCAP). DatasetYearAud. S.Lab. S.#Emos#SpksSelected #Utts#HrsLangage In-the-wild EmotionTalk [25]2025TalkshowPerc73019,25023.6Mandarin MSP-Podcast [19, 18]2025PodcastPerc83,711212,801324.1English BIIC-Podcast [26]2023PodcastPerc8–69,874147.2Taiwanese Mandarin M3ED [27]2022TVPerc763424,4379.8Mandarin ASVP-ESD [28]2020Online mediaPerc1413213,96418.0English/Mandarin MELD [29]2019TVPerc7–13,70312.0English URDU [30]2018TalkshowPerc4384000.3Urdu Acted EmoDB [31, 32]2025ScriptedExpr7108160.6German Thai-ser [33]2025Script/SpontExpr520027,85441.6Thai nEMO [34]2024ScriptedExpr694,4813.1Polish ASED [35]2023ScriptedPerc5652,4742.1Amharic EMNS [36]2023ScriptedExpr811,1811.9English Emozionalmente [37]2023ScriptedExpr74316,9026.3Italian RESD [38]2022UnknownExpr7–1,3372.1Russian ESD [39]2021ScriptedExpr52035,00029.1English/Mandarin MESD [40]2021ScriptedExpr6168620.2Spanish MEAD [41]2020ScriptedExpr86031,72937.3English Oreau [42]2020ScriptedExpr7324340.3French Polish [43]2020ScriptedExpr354500.1Polish TurEV-DB [44]2020ScriptedExpr461,7350.5Turkish ShEMO [45]2019SpontaneousPerc6872,8383.3Persian SUBESCO [46]2019ScriptedExpr7207,0007.8Bangla AESDD [47]2018ScriptedExpr566040.7Greek CaFE [48]2018ScriptedExpr7129361.1French EmoV-DB [49]2018ScriptedExpr546,8879.5English JL-Corpus [50]2018ScriptedExpr1042,4001.4English RAVDESS [51]2018ScriptedExpr8241,4401.5English CREMA-D [52]2014ScriptedExpr+Perc6917,4425.3English EMOVO [53]2014ScriptedExpr765880.5Italian SAVEE [54]2014ScriptedExpr744800.5English PAVOQUE [55]2013ScriptedExpr517,33412.2German TESS [56]2010ScriptedExpr722,8001.6English CASIA [57]2008ScriptedExpr661,2000.6Mandarin IEMOCAP [20]2008Script/SpontExpr+Perc4105,5317.0English eNTERFACE [58]2006SpontaneousExpr6441,2631.1English Datasets are grouped into in-the-wild (7) and acted (28) record- ings and sorted by year of release. The table includes year of release, language, number of emotion categories, number of speakers and utterances, hours of speech, and Audio source and Label source. Audio source denotes whether the record- ing is “in-the-wild”, or the scripted/spontaneous style for acted datasets. Label source categorises whether annotations target perceived emotion, expressed emotion, or both; as was found in the wild datasets, primarily provide perceived labels, whereas acted datasets typically provide expressed labels, with only a few providing both expressed and perceived labels. 2.2. Soft-Label Ground Truth Among the datasets annotated by multiple listeners, inter- rater disagreement often reflects meaningful perceptual vari- ance rather than noise [59]. To retain this information, the ground truth is modeled as an unsmoothed count-based soft- label distributiony, where denotes the probability simplex over C emotion categories. Formally, given a set of N anno- tations for a specific sample, the target distribution is defined as: y = h n 1 N , n 2 N ,..., n C N i ,(1) where n c represents the count of votes for category c. For in- stance, in a 4-class scenarioNeutral, Angry, Sad, Happy with N = 5 annotators distributed as 1, 2, 2, 0, the resulting la- Table 2: Comparison of the two evaluated speech LLMs. Qwen2-AudioAudio Flamingo 3 Audio encoderWhisper-large-v3Whisper-large-v3 LLM backboneQwen-7BQwen2.5-7B Training data370k hrs, undisclosed>9M QA pairs, released IEMOCAP WA (%)59.263.8 bel vector isy = [0.2, 0.4, 0.4, 0.0]. Since annotator-level ratings are not available across all public emotion databases, soft-label analysis is restricted to the five datasets that include such metadata: CREMA-D, IEMOCAP, MSP-Podcast, BIIC- Podcast, and EmotionTalk. 2.3. In-Domain Partitioning Following the EmoBox protocol [13], which provides repro- ducible, speaker-independent splits for a large number of public SER corpora, datasets with provider-defined splits (e.g., IEMO- CAP, MELD) use them directly. For the remainder, the parti- tioning strategy is determined by the number of speakers and the balance of emotion labels across speakers. Datasets with fewer than 4 speakers or with highly unbalanced emotion-label distributions across speakers are split using stratified random sampling (75% train, 25% test; one fold), stratified by emotion category. For datasets with 4–6 speakers, leave-one-speaker-out cross-validation is used, with as many folds as there are distinct Table 3: Zero-shot prompt template used in VoxEmo. Prompt families are formed by adding instruction blocks to the shared template. For hard-label evaluation, FINAL LABEL is parsed as the prediction. Zero-shot Prompt Template Task descriptionYou are a speech emotion recognition system. You are given an audio clip. Emotion categoriesHard-label prompts: Choose exactly one emotion label from this closed set:labels. Distribution prompt: Consider the following emotion labels:labels. Instruction blocks + Direct decisionDecide the emotion directly from the audio. + DistributionEstimate how likely each emotion is based on the audio. Assign a probability (between 0.0 and 1.0) to every label. The probabilities must sum to 1.0. Also state which single label is most likely. + ASR transcriptTranscribe the spoken content from the audio as accurately as possible. + Acoustic captionDescribe acoustic and paralinguistic cues you hear (e.g., pitch, loudness, speaking rate, pauses, voice quality, non-speech events). + ReasoningExplain how the transcript content and the acoustic cues together support your final choice. Response Format[ASRTRANSCRIPT: <transcript>] [ACOUSTIC CAPTION:] <description>] [REASONING: <reasoning>] [EMOTION DISTRIBUTION: "Label1": prob1, "Label2": prob2, ...] FINALLABEL: <one label from labels> speakers. For datasets with more than six speakers, speakers are partitioned into four-folds. All speaker-based schemes en- sure that no speaker overlaps between the training and test sets. For datasets lacking a pre-defined/official validation set, 20% of the training partition is held out via stratified sampling. 2.4. Cross-Domain Configurations To assess cross-corpus generalisation across English-language data, a cross-scenario setting is defined in which label inven- tories, recording conditions, and Label source (expressed vs. perceived) vary simultaneously. Unlike traditional classifiers, generative SER facilitates evaluating across these variations by allowing the prompt-specified label set at inference to differ from the one used during training or fine-tuning. Four cor- pora that share a common language but differ systematically in label-set size (4–8 classes), audio source, and label source are evaluated in all source–target combinations: CREMA-D (6- class, acted, expressed labels), IEMOCAP (4-class, acted dia- logue, perceived), MELD (7-class, broadcast, perceived), and MSP-Podcast v2.0 (8-class, podcast, perceived). The released toolkit also includes cross-lingual configurations; these are not reported here due to space constraints. 3. Experimental Setup 3.1. Selection of Speech LLMs To maximise reproducibility while minimising implementation- driven variance, the benchmark focuses on open-weight speech LLMs that accept audio inputs and generate text re- sponses under instruction prompts. Two representative ∼7B audio-language models are selected: Qwen2-Audio-7B-Instruct (Q2A) [8] and Audio Flamingo 3 (AF3) [60]. Table 2 sum- marises their key attributes. Q2A’s training data composition is not publicly disclosed, making potential data contamination difficult to assess. AF3’s released training data explicitly includes SER-relevant corpora such as IEMOCAP and MELD, confirming overlap with bench- mark datasets. The IEMOCAP WA scores in Table 2 are re- ported by [60]. Both models are supported by Hugging Face Transformers 1 and PEFT 2 , enabling inference and fine-tuning under a single, consistent environment. 3.2. Zero-shot Protocol Zero-shot evaluation follows a fixed inference-time protocol: generation uses greedy decoding, and output parsing rules are shared across all runs; only the prompt template varies. To as- sess how auxiliary reasoning modes affect SER performance, five prompt variants are formed by progressively composing instruction blocks onto a shared template (Table 3): ASR transcript (T), acoustic caption (A), and reasoning (R). For soft-label evaluation on the five multi-annotator corpora (Sec- tion 2.2), each variant additionally includes the Distribution block, which requests a probability distribution over emotion categories. For hard-label evaluation, the predicted label is extracted from the FINAL LABEL field and mapped to the dataset- specific closed label set via case-insensitive matching. Samples whose output cannot be mapped are retained as incorrect predic- tions. In soft-label evaluation, the EMOTION DISTRIBUTION field is parsed as a JSON object, and mapped to the target label set via alias normalisation, and then re-normalised for probabil- ities to sum to one. When the parsing process fails, a uniform distribution across all labels is assumed. Since single-prompt evaluation is sensitive to prompt word- ing and parsing failures, a vote-based ensemble is adopted that aggregates the 5 hard-label predictions into a single distribution. This follows recent findings that various prompt-elicited pre- dictions better reflect human annotation distributions [61]. For sample k, let n k,c be the number of prompts predicting class c and f k the number of parse failures; each failure contributes a uniform vote of 1/C. The predicted distribution is: ˆ P (k,c) = n k,c + f k /C N prompts , N prompts = 5.(2) 3.3. Supervised Fine-Tuning To provide a supervised contrast reference to zero-shot eval- uation, both models presented in 3.1 are adapted (fine-tuned) to the training sets. Following recent practice in SpeechLLM- based SER [62], LoRA [63] adapters (r = 8, α = 16, dropout 0.05) is applied to the attention projection matrices (q/k/v/o) of the language decoders. The audio encoder and the remain- ing parameters remain frozen, meaning fewer than 0.15% of the total parameters are updated. Training uses AdamW with a learning rate of 1× 10 −5 , 10% warmup, bf16 precision, an ef- fective batch size of 16, for 10 epochs. Compared to the multi- block zero-shot template, supervised fine-tuning (SFT) uses a single closed-set instruction prompt that injects the dataset- specific label set labels; the training target is formatted as FINAL LABEL: <label>. Inference uses greedy decod- ing, consistent with the zero-shot setting. Predictions are scored with the same evaluator and metrics as zero-shot runs. 3.4. Evaluation Metrics For the hard-label case (only one emotion per utterance), the reported standard metrics are Weighted Accuracy (WA), Un- 1 https://github.com/huggingface/transformers 2 https://github.com/huggingface/peft weighted Accuracy (UA), Micro-F1, and Macro-F1, computed per dataset over its closed-label set. However, these metrics do not reflect annotation disagreement or the natural ambiguity of emotions (more than one emotion per utterance) present in multi-annotator corpora. Following [21], we additionally report soft-label metrics using a dual-metric approach: • Hard-decision Assessment: The predicted probability dis- tributions are thresholded to compute categorical classifica- tion performance, reporting Macro-F1, Micro-F1, and Top-1 Accuracy. • Distribution-aware Assessment: Probabilistic divergence is measured using Kullback-Leibler Divergence (KLD), Jensen-Shannon Divergence (JSD), and Total Variation Dis- tance (TVD), Cosine Similarity (Sim), and Mean Squared Er- ror (MSE) are reported for directional alignment and recon- struction error, respectively. 4. Results and Analyses 4.1. Zero-Shot Evaluation Results in the tables 4–5 show Macro-F1, UA, and WA results for both Q2A and AF3, under five zero-shot prompt variants on 35 datasets. As can be observed prompt choice has a substan- tial effect on performance in almost all cases. For Q2A, the Macro-F1 difference between the best and worst variants ex- ceeds 20 on 11 datasets, reaching a maximum spread of 49 on TESS. The optimal variant differs across models and datasets: AF3 performs better with the Direct prompt on 25/35 datasets, whereas Q2A most often benefits from the acoustic-caption variant (+A, 18/35). In absolute terms, zero-shot performance remains limited, with the best-prompt Macro-F1 falling below 20% on eight datasets for both models and staying below the EmoBox supervised baselines on all 30 comparable datasets. Under best-prompt comparison, Q2A outperforms AF3 on 26 of the 35 datasets, though this advantage should be interpreted alongside the training-data overlap discussed in Section 3.1. The following subsections provide further analysis on how in- dividual instruction blocks contribute to this variation. 4.1.1. Effect of Acoustic Caption For Q2A, the acoustic-caption prompt (+A) most frequently yields the highest Macro-F1, for 18 out of 35 datasets. This advantage, however, is concentrated on acted corpora (16/28 acted vs 2/7 in-the-wild). For scripted recordings, transcript content is fixed across emotion categories, leaving paralinguis- tic cues as the primary discriminatory signal. Explicitly elic- iting an acoustic description appears to help the model attend to these cues. Within the acted subset, the benefit is mostly visible for non-English data (13/18 non-EN vs 3/10 EN acted), suggesting that the language-agnostic nature of paralinguistic descriptions additionally compensates for weaker cross-lingual text understanding. AF3 is considerably less responsive: The Direct prompt remains its best variant on 25/35 datasets, and +A lowers Macro-F1 relative to Direct for 28/35 datasets. The difference between the two models may stem from differences in the composition of the training data or in how acoustic de- scriptions are integrated with the language decoder. 4.1.2. Effect of ASR Transcripts Adding the transcript prompt (+T) lowers Macro-F1 on the majority of the acted datasets for both models (Q2A 22/28, AF3 25/28), with Q2A Macro-F1 decreasing by up to 46.7 on CREMA-D and 37.6 on TESS. As discussed in Section 4.1.1, scripted recordings fix transcript content across emotion cat- egories, so the added lexical signal is non-discriminative and may introduce noise. In in-the-wild corpora, the pattern re- verses for Q2A, improving across 6 of 7 datasets, with URDU as the sole exception. These gains focus on English and Man- darin, the two languages most represented in Q2A’s training data; the URDU result is consistent with the limited coverage of underrepresented languages in current speech LLMs. AF3, by contrast, does not benefit from +T on any in-the-wild dataset, with all seven deltas negative, including English corpora such as MELD. While AF3’s training data is predominantly English, this failure on English corpora points to a model-level limi- tation in leveraging transcript content for emotion discrimina- tion, rather than a language coverage issue alone. The acted– naturalistic split in +T effectiveness aligns directly with the Au- dio source metadata in Table 1, confirming that prompt sensi- tivity is systematically shaped by corpus construction. 4.1.3. Combined Prompts Combining transcript and acoustic captions in a single prompt (+T+A) does not seem to add value for the Macro-F1 of either model across any of the 35 datasets. Across the 70 model– dataset pairs, the optimal variant is a single-block or Direct prompt in 66 cases (Q2A 32/35, AF3 34/35). Appending a reasoning step (+T+A+R) partially recovers performance for Q2A, improving over +T+A on 28/35 datasets, but it exceeds the best single-block prompt’s performance on only 3 datasets. Single-block prompts thus consistently outperform multi-block alternatives within this benchmark, a pattern that may reflect capacity limitations of the two 7 B-parameter models in inte- grating multiple information sources, though this may not hold for larger or differently trained speech LLMs. 4.2. Supervised Fine-tuning The SFT rows in Tables 4 – 5 report the Macro-F1 test for Q2A and AF3 after LoRA fine-tuning (Sec. 3.3). SFT improves Q2A over its best zero-shot prompt on 34 of 36 evaluation splits, with a mean absolute Macro-F1 gain of 23.7. Gains exceed 40 points on 9 datasets, including cases where zero-shot Macro-F1 was below 25. The improvement is more pronounced on the acted corpora than on the in-the-wild data, yielding a mean SFT Macro-F1 of 66.0 and 40.1, respectively. Only EmoDB (−2.9) and URDU (−0.5) show slight degradation. On the 30 datasets with a comparable EmoBox supervised reference [13], Q2A surpasses the reference on 10 and matches it within 3 Macro-F1 on a further 5 (e.g., TESS, ESD, eNTER- FACE). This performance correlates with the size of the dataset: Q2A exceeds the reference on 7 of 11 datasets with at least 5,000 utterances, but on none of the 9 datasets with fewer than 1,000. Among in-the-wild corpora, Q2A exceeds the reference on MELD (+6.3) and M3ED (+4.4), two datasets where the EmoBox supervised baseline itself remains below 34 Macro- F1, suggesting that speech LLM fine-tuning can be competitive on naturalistic data where conventional approaches also strug- gle. Nevertheless, 10 evaluation splits remain below 40 Macro- F1 after SFT. The five acted cases are all corpora with fewer than 1,500 utterances (EMNS, RESD, EMOVO, MESD, Pol- ish), where the limited training data likely limits the improve- ment for LoRA adaptation. The five in-the-wild cases span a wider range of corpus sizes, from URDU (400 utterances) to MSP-Podcast (213k), suggesting that factors beyond data vol- ume contribute to difficulty on naturalistic speech. Under the same LoRA configuration, AF3 improves on 26 of 36 splits but with a mean gain of 10.3, less than half of Q2A’s 23.7. The remaining 10 splits degrade, with the largest drops Table 4: VoxEmo zero-shot + SFT results. Q2A = Qwen2-Audio, AF3 = Audio Flamingo 3. Prompt abbreviations: T = ASR transcript, A = acoustic caption, R = reasoning. Bold red: best Q2A prompt per dataset; bold red: best AF3 prompt per dataset. Bold red: best score among Q2A SFT / AF3 SFT / EmoBox (by Macro-F1) per dataset. Part 1/2. SettingUA(%)WA(%)F1(%)UA(%)WA(%)F1(%)UA(%)WA(%)F1(%)UA(%)WA(%)F1(%) AESDD (el, 5)ASED (am, 5)ASVP-ESD (en/zh, 14)BIIC-Podcast (zh-tw, 8) Q2A Direct28.5728.1920.0635.9335.3932.7411.2816.619.9127.7329.8518.34 Q2A +T28.1828.1927.0824.592518.347.8411.2710.3828.6633.719.96 Q2A +A39.5239.638.6341.4142.2139.0220.8624.5518.421.5939.3516.55 Q2A +T+A 27.3827.5224.6729.1129.8725.169.4512.4710.1720.5136.9815.29 Q2A +T+A+R26.9226.8525.7730.531.1726.8114.9421.0815.4724.340.0318.08 AF3 Direct29.4529.5324.9537.1838.1533.2828.7934.4128.3618.2641.2215.43 AF3 +T 29.4729.5326.6524.8725.6517.7215.5519.4416.8416.0537.8713.41 AF3 +A26.0526.1720.9831.6932.6325.3928.1836.4228.0320.236.1515.19 AF3 +T+A18.0518.1211.9725.7126.6218.2518.6524.7519.2715.0739.6912.58 AF3 +T+A+R24.0924.1621.9823.4823.8618.5917.5322.9719.3813.5435.8110.67 Q2A SFT44.344.340.0983.9984.0983.9659.0370.3859.5139.557.940.96 AF3 SFT35.4335.5729.7267.7767.6967.3445.7658.0247.722.7948.1324.25 EmoBox[13]84.484.4984.1996.7596.7396.7461.1471.5262.08– SettingCaFE (fr, 7)CASIA (zh, 6)CREMA-D (en, 6)EMNS (en, 8) Q2A Direct37.138.5734.1231.2531.2525.3165.9465.366.2214.0213.47.21 Q2A +T28.6727.2425.2220.0820.0814.4925.7123.8719.512.4612.727.42 Q2A +A40.5838.1437.6240.540.535.5163.562.8262.1412.7513.066.9 Q2A +T+A 40.083635.4126.7526.7520.7132.9531.5232.6113.1913.45.01 Q2A +T+A+R35.3232.1632.2729.0829.0822.4341.7940.3639.4312.1512.373.98 AF3 Direct37.534.6231.2529.7529.7525.3158.4957.4959.7822.5821.9918.54 AF3 +T 29.2724.1424.82212113.4538.9837.539.6213.9414.096.31 AF3 +A32.3427.7827.631.8331.8328.4950.3649.3550.8116.5716.510.4 AF3 +T+A29.0723.7224.1124.4224.4218.3337.1535.6737.1411.6211.685.27 AF3 +T+A+R27.2822.4420.9123.4223.4217.4228.9827.5928.789.29.284.08 Q2A SFT65.5865.0664.66606056.5791.1791.0691.1114.4114.4311.08 AF3 SFT38.8938.1437.1746.3346.3344.4474.7574.8475.221.4621.3122.64 EmoBox[13]69.4368.8468.0659.5859.5856.2776.7576.4876.683.9784.1283.97 SettingEmoDB (de, 7)EmotionTalk (zh, 7)EmoV-DB (en, 5)EMOVO (it, 7) Q2A Direct69.0167.9764.5727.8717.3219.5568.0464.0562.123.6423.6419.66 Q2A +T39.1433.7736.6835.8738.2127.2240.6739.0337.1316.3316.3312.77 Q2A +A 75.9468.470.8729.0837.7922.4457.5358.1755.7127.0427.0421.18 Q2A +T+A43.8940.6942.5328.1745.7726.2733.7933.8630.7622.2822.2816.67 Q2A +T+A+R47.3342.8645.7130.4147.0728.0242.4241.2438.2323.8123.8118.55 AF3 Direct54.7554.1153.7128.1345.3127.6474.575.9777.4925.1725.1716.35 AF3 +T 47.7345.4548.6122.1846.524.5360.8362.1365.717.6917.6910.13 AF3 +A4948.9251.1124.4945.3122.9865.1466.969.124.3224.3216.9 AF3 +T+A44.442.4246.9421.3849.3523.0451.0252.8256.1918.218.210.74 AF3 +T+A+R 37.6537.6641.1718.2442.7219.3645.847.1851.1518.7118.7111.78 Q2A SFT72.5973.596859.1971.2860.2299.8899.8899.8833.3333.3330.46 AF3 SFT33.3832.0330.738.8658.6841.389.890.2990.4516.8416.8411.23 EmoBox[13] 92.5892.6792.57–99.4499.4799.4557.8257.8256.06 SettingEmozionalmente (it, 7)eNTERFACE (en, 6)ESD (en/zh, 5)IEMOCAP (en, 4) Q2A Direct232317.8443.2743.2740.0938.1838.1835.0554.3856.9751.63 Q2A +T 18.9918.9914.4663.1463.1460.2625.6625.6621.5244.5846.946 Q2A +A 27.1227.1222.3243.9143.9138.0942.0742.0740.357.3359.0958.19 Q2A +T+A21.621.616.9853.5353.5352.3628.4728.4724.339.9243.8540.03 Q2A +T+A+R23.2323.2318.5556.0956.0955.1933.9633.9630.0641.2544.941.81 AF3 Direct24.5624.5620.7253.5353.5356.1237.6537.6532.3462.663.8863.97 AF3 +T 16.2616.268.5148.7248.7249.0528.4328.4322.6646.5748.8849.9 AF3 +A22.0122.0115.3549.0449.0448.6836.6936.6934.1560.161.461.09 AF3 +T+A16.916.910.2244.5544.5546.1128.3128.3123.2839.0842.0741.11 AF3 +T+A+R 15.4515.459.1438.1438.1439.424.5824.5821.2643.1145.9943.38 Q2A SFT74.6274.6274.6195.8395.8395.8583.783.783.5882.6582.1382.43 AF3 SFT54.6554.6554.2860.960.961.1971.7271.7271.7563.4665.0165.15 EmoBox[13]76.9176.9176.997.6997.6897.6884.6284.6284.3373.5472.8673.11 SettingJL-Corpus (en, 10)M3ED (zh, 7)MEAD (en, 8)MELD (en, 7) Q2A Direct20.7920.7914.7717.1820.1311.3816.1416.7613.3831.6842.9125.97 Q2A +T 12.8812.889.5221.4936.9518.9316.0514.2413.2232.2342.9828.06 Q2A +A16.4616.469.6818.6327.5112.8115.111.048.8829.0747.3924.81 Q2A +T+A11.0811.084.1915.3937.9912.9714.719.888.6527.3746.2425.49 Q2A +T+A+R12.9612.968.2817.9340.7115.214.069.418.2827.7548.0426.48 AF3 Direct46.546.549.0618.2533.8514.8815.8111.178.9229.7744.7529.83 AF3 +T22.1722.1721.8214.5440.2812.3813.798.065.4825.5349.1627.52 AF3 +A32.5832.5833.2519.2339.4715.0414.879.377.5627.8247.8527.91 AF3 +T+A21.6721.6721.4816.0542.8313.0412.666.853.8320.6149.4621.63 AF3 +T+A+R20.8820.8820.2815.2340.2612.9112.586.964.322.7450.1523.39 Q2A SFT44.3344.3341.8337.5754.9538.2469.3670.2669.6337.8556.3339.26 AF3 SFT37.0837.0837.0925.9150.2426.7848.3949.3648.9424.1248.5425.74 EmoBox[13]66.7166.7165.1932.8449.4233.7681.2782.0381.4331.5451.8932.95 SettingMESD (es, 6)MSP-Podcast(test1) (en, 8)MSP-Podcast(test2) (en, 8)nEMO (pl, 6) Q2A Direct23.2723.3617.7522.2317.212.2817.9916.539.1821.8822.4314.3 Q2A +T22.6222.4319.7424.8431.4220.3519.632.1815.3515.3816.1811.35 Q2A +A 21.4521.520.5421.7938.116.8916.8655.4814.6921.3422.0716.89 Q2A +T+A20.7420.5617.6520.9135.5917.7215.9943.1513.3217.2218.1412.36 Q2A +T+A+R18.8518.6915.5620.6237.1918.7816.0545.5414.1317.6618.5913.17 AF3 Direct26.526.1722.3229.1147.6628.6720.5851.4218.7817.8819.129.04 AF3 +T16.5916.3612.6921.3538.3418.7715.751.2613.315.4516.715.5 AF3 +A21.8121.515.123.5742.3421.3818.456.3616.5717.5718.867.75 AF3 +T+A17.1216.826.8416.8735.9914.8114.452.7812.216.0317.345.71 AF3 +T+A+R17.5917.297.6516.2934.0614.3313.0547.0710.7215.5416.85.67 Q2A SFT34.2934.1134.332.0845.329.1823.2448.2618.2178.1678.2878.07 AF3 SFT36.235.9831.6222.0138.5419.0818.7548.8914.8661.8662.261.56 EmoBox[13]69.7869.6769.64– Table 5: VoxEmo zero-shot + SFT results. Q2A = Qwen2-Audio, AF3 = Audio Flamingo 3. Prompt abbreviations: T = ASR transcript, A = acoustic caption, R = reasoning. Bold red: best Q2A prompt per dataset; bold red: best AF3 prompt per dataset. Bold red: best score among Q2A SFT / AF3 SFT / EmoBox[13] (by MacroF1) per dataset. Part 2/2. SettingUA(%)WA(%)F1(%)UA(%)WA(%)F1(%)UA(%)WA(%)F1(%)UA(%)WA(%)F1(%) Oreau (fr, 7)PAVOQUE (de, 5)Polish (pl, 3)RAVDESS (en, 8) Q2A Direct46.2247.1239.8547.5662.7746.0934.8934.8921.9772.472.4367.55 Q2A +T 39.8441.3537.5327.725329.07343427.9149.1545.7644.54 Q2A +A33.1235.5825.7245.2367.5245.2734.6734.6730.1468.2966.3962.73 Q2A +T+A41.8944.2341.8231.1459.6133.8535.1135.1125.849.847.0148.57 Q2A +T+A+R45.5248.0846.1734.5460.2636.334.4434.4427.3445.1244.9344 AF3 Direct33.6235.5835.4938.367.341.1134.6734.6719.2547.8545.4244.66 AF3 +T36.8439.4237.1225.5855.5127.9433.1133.1119.5934.930.5632.92 AF3 +A30.7532.6930.2431.3264.4632.98343418.3342.1238.8240.17 AF3 +T+A 28.1730.7724.8123.4657.9722.9732.4432.4419.4535.0330.934.96 AF3 +T+A+R32.4934.6232.552046.5622.4336.2236.2229.0133.4630.1434.57 Q2A SFT50.3451.9250.8589.8494.2189.9544.6744.6737.0287.7690.8387.81 AF3 SFT43.8346.1546.1372.0684.7776.31363628.634.4435.734.77 EmoBox[13] 84.4884.7985.0187.7393.488.4383.2783.2782.7775.3275.8775.19 SettingRESD (ru, 7)SAVEE (en, 7)ShEMO (fa, 6)SUBESCO (bn, 7) Q2A Direct22.3822.4716.7454.4156.0451.7347.349.7943.5630.3930.3924.84 Q2A +T 18.5418.3517.0428.3331.0427.126.2330.9824.8817.6917.6915.13 Q2A +A23.7222.119.2258.3361.8858.5169.8571.5759.4337.2337.2332.39 Q2A +T+A20.0220.2318.1542.3847.543.4733.9143.1432.0125.9625.9622.12 Q2A +T+A+R23.2323.2222.1745.4850.6245.6541.946.9636.4427.5627.5624.07 AF3 Direct22.823.2217.1453.155.4254.5752.4660.9648.5324.7324.7319.18 AF3 +T14.7114.6111.0134.6442.534.7130.747.133.7720.5720.5714.24 AF3 +A21.9822.4715.7646.6752.9248.4449.0664.6450.3124.5124.5117.77 AF3 +T+A13.8913.867.8735.1242.2937.0624.440.0327.1219.9319.9314.12 AF3 +T+A+R 12.0211.986.6630.8337.2932.1524.2633.9526.3613.613.613.04 Q2A SFT28.8830.7127.1864.646562.387.1695.1988.2876.2976.2976.16 AF3 SFT 20.620.9719.9233.3437.7134.5757.1575.9560.6231.3631.3631.46 EmoBox[13] 55.8756.4755.8275.6578.2578.3880.2389.5582.9473.0573.0572.94 SettingTESS (en, 7)Thai-ser (th, 5)TurEV-DB (tr, 4)URDU (ur, 4) Q2A Direct71.8671.8673.6629.2129.1725.4928.1128.9323.83383835.52 Q2A +T 36.8636.8636.0217.0916.4513.6221.6221.7620.44222219.05 Q2A +A777773.8925.4224.5921.5626.7927.7821.39363635.97 Q2A +T+A26.8626.8624.9220.1219.0914.0322.7721.7615.14272721.4 Q2A +T+A+R 58.7158.7158.4421.1220.0915.1822.4521.5314.14242419.78 AF3 Direct59.1459.1461.3630.829.9323.5428.6731.2519.43383832.2 AF3 +T51.8651.865723.7922.1115.8123.4823.1518.88292920.93 AF3 +A55.5755.5759.5828.3527.7820.3928.3628.4722424236.85 AF3 +T+A 53.4353.4358.5522.7321.4714.3122.1521.0711.78343424.15 AF3 +T+A+R 47.1447.1452.8819.9218.7512.4622.0421.0712.01363625 Q2A SFT99.8699.8699.8672.4272.7972.351.8853.2451.36414135.46 AF3 SFT 82.7182.7183.1951.9552.4551.742.4943.7542.52414133.75 EmoBox[13] 99.9699.9699.96–81.3281.5881.3188.4188.4188.4 on EmoDB (−23.0), SAVEE (−20.0), and JL-Corpus (−12.0). AF3 does not surpass the EmoBox reference on any of the 30 comparable datasets, with CREMA-D (−1.4) as the closest. These results reflect a single LoRA configuration (r=8); AF3 may benefit from different hyperparameters, a point we return to in Section 5. Overall, SFT substantially narrows the gap between speech LLMs and traditional supervised baselines, with Q2A reaching parity or better on 15 of 30 comparable datasets. The remaining deficit concentrates in small-act corpora and depends heavily on the choice of foundation model, as the widening of Q2A’s advantage from 26/35 datasets in zero-shot to 35/36 after SFT illustrates. 4.3. Soft-Label Evaluation Table 6 presents a comprehensive assessment of zero-shot soft-label performance across five corpora that provide multi- annotator distributions. Based on the prompt sensitivity ob- served in Figure 1, we select three representative configura- tions for this detailed comparison: the baseline Direct prompt, the acoustic caption prompt (+A), and our proposed Ensemble strategy. 4.3.1. Impacts of Prompts and Zero-Shot Instability A critical challenge in evaluating generative speech LLMs is inference-time stochasticity and strict adherence to structured output formats. As illustrated in Figure 1, instructing models to perform intermediate reasoning tasks often disrupts their ability to generate valid JSON predictions. Interestingly, this degrada- tion is highly model-dependent. For Q2A, increasing prompt complexity leads to a catas- trophic surge in the Average Parse Failure Rate (red line)— peaking at 93.2% for the +A prompt—which directly plunges the average Macro-F1 score to near zero. Conversely, AF3 handles the +A prompt relatively well, achieving its high- est single-prompt F1 (31.4%) with a low parse failure rate (10.9%). Nonetheless, both models suffer severe formatting failures when subjected to combined reasoning prompts (e.g., +T+A+R). This reveals that relying on a single prompt is inher- ently brittle and inconsistent across different LLMs. 4.3.2. Mitigating Instability via Prompt Ensemble To mitigate this zero-shot instability, the proposed prompt en- semble strategy (Ensemble) is evaluated following [61]. This ensemble method aggregates discrete predictions across all five prompt variations using a uniform fallback mechanism for parse Table 6: Zero-shot soft-label evaluation results. We compare the single direct prompt (Direct), +A prompt, against the proposed prompt ensemble method (Ensemble). Performance is categorized into hard-decision metrics (Macro-F1, Micro-F1, and Top-1 Accuracy in %) and distribution-aware assessments, including Kullback–Leibler Divergence (KLD), Jensen–Shannon Divergence (JSD), Total Variation Distance (TVD), Cosine Similarity (Sim×100), and Mean Squared Error (MSE). Bold red denotes the superior setting per corpus. ↑ (↓) indicates that higher (lower) values are better. DatasetModelPromptHard-decision AssessmentDistribution-aware Assessment Ma-F1↑Mi-F1↑Acc↑KLD↓JSD↓TVD↓SIM↑MSE↓ BIIC-Podcast (zh-tw, 8) Q2A Direct 28.1442.1629.575.4730.3630.65749.610.074 +A 4.799.389.672.4400.4020.74945.780.073 Ensemble31.1251.1442.957.2570.3090.55956.490.070 AF3 Direct15.5725.7022.954.7670.3740.68046.970.083 +A16.8842.7038.1110.3910.3580.61648.960.101 Ensemble 20.9845.8141.1710.1720.3470.59550.110.097 CREMA-D (en, 6) Q2A Direct 61.8865.3449.034.1120.1960.44369.430.054 +A1.991.7616.431.0920.2530.55460.120.058 Ensemble 61.0068.2559.165.4820.1750.38975.810.045 AF3 Direct40.5741.1939.226.3720.2750.55759.710.088 +A48.5954.1453.346.0240.2390.50165.670.073 Ensemble54.0663.9857.606.6520.2090.44671.340.064 EmotionTalk (zh, 7) Q2A Direct33.1040.1736.245.8420.3770.66747.910.105 +A8.9513.3621.513.1910.4280.77041.550.111 Ensemble36.2347.1147.336.6070.3230.58355.130.091 AF3 Direct 23.8031.4131.993.5990.3690.66149.190.100 +A 31.1554.3951.847.3520.3130.53556.320.105 Ensemble36.0953.4254.075.8070.2820.49960.370.087 IEMOCAP (en, 4) Q2A Direct 62.8362.1060.032.9620.2340.46469.010.121 +A 1.301.3220.531.2860.3380.68054.530.158 Ensemble56.0059.2454.235.3690.2540.47263.660.152 AF3 Direct50.9555.3051.894.7160.2610.46562.140.160 +A49.1653.2551.154.7230.2650.48061.700.158 Ensemble 60.0761.8156.855.7000.2480.44964.800.155 MSP-Podcast 2 (test1, 8) Q2A Direct24.4334.7218.8510.3710.4210.71537.140.087 +A1.752.0619.281.7410.3610.69249.100.060 Ensemble 33.7646.9936.659.9590.3460.60250.520.072 AF3 Direct 23.0233.8034.3510.0070.3880.68144.510.098 +A24.1136.3633.337.5410.3590.65649.500.082 Ensemble30.1848.6141.6210.4150.3280.58753.440.078 MSP-Podcast 2 (test2, 8) Q2A Direct 24.5040.1620.997.7440.3610.65146.700.068 +A 1.332.175.841.5000.3340.64952.350.053 Ensemble29.1348.0443.689.2920.3130.55857.110.059 AF3 Direct17.9834.7639.808.2830.3210.59355.390.074 +A 18.8038.3441.056.5890.3010.57859.590.063 Ensemble 23.3848.6555.7010.0590.2840.52163.110.061 failures (as detailed in Section 3.2). As shown in Table 6, the ensemble method effectively neutralizes the formatting failures of individual prompts and consistently outperforms the base- line Direct prompt. Not only does it recover and improve hard- decision metrics (e.g., boosting Q2A’s Macro-F1 from 61.88% to 68.25% on CREMA-D), but it also yields significantly better subjectivity alignment. Across most datasets, the Ensemble ap- proach achieves higher SIM and lower JSD/TVD. This demon- strates that, rather than being a flaw, the stochasticity of diverse prompt responses can be aggregated to effectively approximate the uncertainty and subjectivity inherent in human emotion per- ception. 4.3.3. Qwen2-Audio vs. Audio Flamingo 3 Across the majority of the benchmarked corpora, Q2A consis- tently demonstrates superior alignment with human soft labels compared to AF3. This is particularly evident in the CREMA- D and IEMOCAP datasets, where Q2A achieves significantly higher Cosine Similarity (↑) and lower MSE (↓). These results suggest that Q2A’s underlying representation is more effective at mimicking human annotators’ consensus in structured envi- ronments. However, one can observe a performance trade-off in natu- ralistic datasets such as EmotionTalk and BIIC-Podcast, where AF3 occasionally yields lower KLD and JSD despite having lower accuracy. This discrepancy indicates that while Q2A may Table 7: Comparison of best zero-shot speech LLMs against re- cent fully supervised state-of-the-art benchmarks on Hard-label Macro-F1 (%). DatasetModel / FrameworkMa-F1 BIIC-Podcast (zh-tw, 8) [64]35.5 [6]28.3 This paper (Q2A-Ensemble)31.1 This paper (AF3-Ensemble)21.0 CREMA-D (en, 6) [21]71.0 [64]70.6 [6]67.6 This paper (Q2A-Direct)61.9 This paper (AF3-Ensemble)54.1 be more accurate in top-1 prediction, its probability mass is of- ten over-concentrated (over-confident), whereas AF3 maintains a smoother distribution that better reflects the inherent uncer- tainty in cross-lingual, spontaneous speech. 4.3.4. Capturing Affective Ambiguity and Subjectivity To compare the performance of generative speech LLMs within the broader SER landscape, Table 7 compares the best zero- shot configurations against recent fully supervised state-of-the- art (SOTA) benchmarks, including conventional Transformers [64], Self-Supervised Learning (SSL) representations [6], and distribution-aware supervised models [21]. As shown in Ta- ble 7, zero-shot speech LLMs’ performance generally trails be- 0 10 20 30 40 50 60 Avg Macro F1 (%) 38.1 15.6 3.4 7.8 26.0 Qwen2-Audio Direct+T+A+T+A+T+A+R 0 10 20 30 40 50 60 Avg Macro F1 (%) 26.8 20.4 31.4 18.5 14.9 AudioFlamingo-3 0 20 40 60 80 100 Avg Parse Fail Rate (%) 10.8 71.3 93.2 84.9 25.8 0 20 40 60 80 100 Avg Parse Fail Rate (%) 16.9 59.3 10.9 44.3 62.8 Figure 1: Impact of zero-shot prompt complexity on perfor- mance and format adherence. The blue bars (left axis) denote the Average Macro-F1 across five evaluation corpora, while the red line (right axis) illustrates the Average Parse Failure Rate. As prompt constraints increase (from direct classifica- tion Direct to requesting intermediate acoustic captions +T), both Q2A and AF3 struggle to follow the structured text-output format, leading to a severe degradation in overall predictive performance. This zero-shot stochasticity necessitates our pro- posed distribution-aware prompt ensemble methodology. hind task-specific supervised models in hard-label Macro-F1. However, this performance gap underscores a fundamental dif- ference in modeling paradigms rather than a mere capability deficit. Supervised baseline models are explicitly optimized via cross-entropy to collapse subjective annotations into hard la- bels, thereby maximizing categorical metrics such as F1. In contrast, zero-shot generative speech LLMs maintain a broader, untuned probability mass. This brings us to a key contribution of the paper: the quan- tification of affective ambiguity. In naturalistic datasets such as the MSP-Podcast series, which are characterized by high inter- rater variability, the relatively low TVD and JSD scores for both Q2A and AF3 (as seen in Table 6) suggests that speech LLMs possess a strong latent capacity to model emotional subjectivity. Unlike traditional closed-set classifiers, which discard human perceptual variance, generative speech LLMs naturally repre- sent competing emotional states in their output logits. The abil- ity of these models to maintain high SIM to human label dis- tributions without any dataset-specific fine-tuning demonstrates their exceptional potential for developing affect-aware systems that respect the diverse perceptions of multiple listeners. 4.4. Cross-Corpus Evaluation Table 8 reports cross-domain transfer results for both models using each dataset’s native label set. For Q2A, fine-tuning on a mismatched English source exceeds the best zero-shot prompt on 11 of the 12 source–target pairs. MELD is the most ef- fective source, with a mean Macro-F1 gain of 17.7 over zero- shot across three targets. On IEMOCAP, it reaches 81.6, and Table 8: Cross-scenario transfer (Macro-F1). Each cell shows the score when fine-tuning on the source (row) and evaluating on the target (column), with ∆ over best zero-shot in parenthe- ses. Diagonal (bold): in-domain SFT. MSP-Podcast uses test1. Source CREMA-DIEMOCAPMELDMSP-Podcast 2 Qwen2-Audio CREMA-D91.1 (+21.7) 79.9 (+20.5) 30.7 (+2.6)25.8 (+5.5) IEMOCAP76.8 (+7.4)82.4 (+23.0) 24.7 (-3.4)31.5 (+11.1) MELD86.6 (+17.1) 81.6 (+22.2) 39.3 (+11.2) 34.1 (+13.7) MSP-Podcast 2 78.6 (+9.1)68.1 (+8.7)30.8 (+2.8)29.2 (+8.8) AudioFlamingo-3 CREMA-D75.2 (+15.4) 65.1 (+1.1)21.6 (-8.2)22.9 (-5.7) IEMOCAP63.6 (+3.8)65.2 (+1.2)20.3 (-9.5)19.2 (-9.5) MELD63.3 (+3.5)62.0 (-2.0)25.7 (-4.1)23.5 (-5.2) MSP-Podcast 2 36.6 (-23.2)45.4 (-18.6)18.9 (-10.9)19.1 (-9.6) on CREMA-D, 86.6. MSP-Podcast, the other in-the-wild cor- pus, also improves all three targets but with a notably smaller mean gain. Conversely, MELD is also the hardest target: the sole degradation in the table occurs when transferring IEMO- CAP→MELD (−3.4), where IEMOCAP’s 4 categories cover only 4 of MELD’s 7, missing Disgust, Fear, and Surprise. Under the same LoRA configuration, AF3 exceeds zero- shot on only 3 of 12 pairs, all targeting acted corpora with modest gains. All six transfers to in-the-wild targets (MELD, MSP-Podcast) degrade, consistent with AF3’s in-domain SFT already falling below zero-shot on both corpora. MSP-Podcast is a particularly harmful source, accounting for the three largest drops in the table. As in Section 4.2, these results may partly re- flect LoRA hyperparameter mismatch rather than a fundamental model limitation. 5. Conclusion, Limitation and Future Work This paper introduces VoxEmo, an evaluation toolkit and bench- mark that standardises inference-time protocols for speech- LLM-based SER across 35 corpora in 15 languages. Zero- shot performance is highly sensitive to prompt design, and the acted–naturalistic split in corpus construction systematically shapes prompt effectiveness. Supervised fine-tuning narrows the gap with traditional baselines but does not close it in most cases, with effectiveness depending on dataset scale and the choice of foundation model. Nevertheless, even without task- specific training, zero-shot outputs capture affective ambiguity that aligns with human annotation distributions, and the genera- tive interface enables cross-domain transfer across mismatched label sets—though both benefits vary sharply between models. Several limitations should be noted. The benchmark eval- uates only two models that share the same audio encoder (Whisper-large-v3) and parameter scale; whether the observed patterns generalise to models at larger scales or with differ- ent audio front-ends, including recent tokenizer-based archi- tectures, remains open. Such tokenizer-based models cannot currently be served in the same inference environment, limit- ing the direct comparison. A single LoRA configuration is used throughout, and the limited SFT gains of AF3 may partly reflect hyperparameter mismatch. Soft-label evaluation is restricted to the five corpora that provide per-annotator metadata. Addition- ally, this work focuses on corpus-level aggregate metrics and does not examine within-dataset factors such as class imbal- ance, per-class performance, or speaker effects. Future work will extend the benchmark to architecturally diverse models, in- vestigate SFT hyperparameter sensitivity, and incorporate finer- grained per-dataset analyses. 6. Generative AI Use Disclosure Generative AI tools were employed solely to improve grammar, clarity, and readability. The authors are fully responsible for all technical content and results presented in this paper. These tools were not used to generate scientific ideas, analyses, or findings. 7. References [1] B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, p. 90–99, Apr. 2018. [2] C.-C. Lee, T. Chaspari, E. M. Provost, and S. S. Narayanan, “An engineering view on emotions and speech: From analysis and pre- dictive models to responsible human-centered applications,” Pro- ceedings of the IEEE, vol. 111, no. 10, p. 1142–1158, 2023. [3] C. M. Lee and S. Narayanan, “Toward detecting emotions in spo- ken dialogs,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, p. 293–303, 2005. [4] P. Gupta and N. Rajput, “Two-stream emotion recognition for call center monitoring,” in Interspeech 2007, 2007, p. 2241–2244. [5] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, p. 1505–1518, 2022. [6] H. Wu, H.-C. Chou, K.-W. Chang, L. Goncalves, J. Du, J.-S. R. Jang, C.-C. Lee, and H.-Y. Lee, “Open-Emotion: A Reproducible EMO-Superb For Speech Emotion Recognition Systems,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, p. 510–517. [7] J. Bellver, I. Mart ́ ın-Fern ́ andez, J. M. Bravo-Pacheco, S. Es- teban, F. Fern ́ andez-Mart ́ ınez, and L. F. D’Haro, “Multimodal Audio-Language Model for Speech Emotion Recognition,” in The Speaker and Language Recognition Workshop (Odyssey 2024), 2024, p. 288–295. [8] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-Audio Technical Report,” Jul. 2024. [9] J. Mai, X. Xing, W. Chen, Y. Fang, and X. Xu, “A-SLLM: An Acoustically Augmented Speech Large Language Model for Speech Emotion Recognition,” in Interspeech 2025. ISCA, Aug. 2025, p. 4328–4332. [10] W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y. Guo, and I. King, “Recent Advances in Speech Language Models: A Survey,” in Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, p. 13 943–13 970. [11] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Interspeech 2021. ISCA, Aug. 2021, p. 1194–1198. [12] L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in Interspeech 2021. ISCA, Aug. 2021, p. 3400–3404. [13] Z. Ma, M. Chen, H. Zhang, Z. Zheng, W. Chen, X. Li, J. Ye, X. Chen, and T. Hain, “EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark,” Jun. 2024. [14] J. Murzaku and O. Rambow. OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs. [Online]. Available:http: //arxiv.org/abs/2503.21480 [15] M. M. Amin and B. W. Schuller, “On prompt sensitivity of chatgpt in affective computing,” in 2024 12th International Conference on Affective Computing and Intelligent Interaction (ACII).IEEE, 2024, p. 203–209. [16] P. Larrouy-Maestri, D. Poeppel, and M. D. Pell, “The Sound of Emotional Prosody: Nearly 3 Decades of Research and Future Directions,” Perspectives on Psychological Science, vol. 20, no. 4, p. 623–638, Jul. 2025. [17] H.-C. Chou and C.-C. Lee, “Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition:Considering Subjectivity of Annotators and Ambiguity of Emotions,” 2025. [Online]. Available: https://arxiv.org/abs/2510.05934 [18] R. Lotfian and C. Busso, “Building Naturalistic Emotionally Bal- anced Speech Corpus by Retrieving Emotional Speech from Ex- isting Podcast Recordings,” IEEE Transactions on Affective Com- puting, vol. 10, no. 4, p. 471–483, Oct. 2019. [19] C. Busso, R. Lotfian, K. Sridhar, A. N. Salman, W.-C. Lin, L. Goncalves, S. Parthasarathy, A. R. Naini, S.-G. Leem, L. Martinez-Lucas, H.-C. Chou, and P. Mote, “The MSP-Podcast Corpus,” Sep. 2025. [20] Carlos Busso and S. S. Narayanan, “The expression and percep- tion of emotions: Comparing assessments of self versus others,” in Interspeech 2008. ISCA: ISCA, Sep. 2008, p. 257–260. [21] H.-C. Chou, L. Goncalves, S.-G. Leem, A. N. Salman, C.-C. Lee, and C. Busso, “Minority Views Matter: Evaluating Speech Emotion Classifiers With Human Subjective Annotations by an All-Inclusive Aggregation Rule,” IEEE Transactions on Affective Computing, vol. 16, no. 1, p. 41–55, Jan. 2025. [22] H.-C. Chou, H. Wu, L. Goncalves, S.-G. Leem, A. Salman, C. Busso, H.-Y. Lee, and C.-C. Lee, “Embracing Ambiguity And Subjectivity Using The All-Inclusive Aggregation Rule For Eval- uating Multi-Label Speech Emotion Recognition Systems,” in 2024 IEEE Spoken Language Technology Workshop (SLT), Dec. 2024, p. 502–509. [23] R. Milner, M. A. Jalal, R. W. Ng, and T. Hain, “A cross- corpus study on speech emotion recognition,” in 2019 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, p. 304–311. [24] O. C. Phukan, M. M. Akhtar, S. R. Behera, P. Bhagath, P. B. Reddy, A. B. Buduru et al., “Rethinking cross-corpus speech emo- tion recognition benchmarking: Are paralinguistic pre-trained representations sufficient?” in 2025 Asia Pacific Signal and In- formation Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2025, p. 1022–1027. [25] H. Sun, X. Wang, J. Zhao, S. Zhao, J. Zhou, H. Wang, J. He, A. Kong, X. Yang, Y. Wang, Y. Lin, and Y. Qin, “EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations,” Jun. 2025. [26] S. G. Upadhyay, W.-S. Chien, B.-H. Su, L. Goncalves, Y.-T. Wu, A. N. Salman, C. Busso, and C.-C. Lee, “An Intelligent Infrastruc- ture Toward Large Scale Naturalistic Affective Speech Corpora Collection,” in 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), 2023, p. 1–8. [27] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, and H. Li, “M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds.Dublin, Ireland:Association for Computational Linguistics, May 2022, p. 5699–5710. [Online]. Available: https://aclanthology.org/2022.acl-long.391/ [28] T. Landry Dejoli, Q. He, H. Yan, and Y. Li, “ASVP-ESD: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,” in Proc. GSJ, 2020. [29] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations,” Jun. 2019. [30] S. Latif, A. Qayyum, M. Usman, and J. Qadir, “Cross lingual speech emotion recognition: Urdu vs. western languages,” in Proc. FIT, 2018. [31] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of German emotional speech,” in Inter- speech 2005. ISCA, Sep. 2005, p. 1517–1520. [32] F. Burkhardt, O. Schr ̈ ufer, U. Reichel, H. Wierstorf, A. Derington, F. Eyben, and B. W. Schuller, “EmoDB 2.0: A Database of Emo- tional Speech in a World that is not Black or White but Grey,” in Interspeech 2025. ISCA, Aug. 2025, p. 4488–4492. [33] J. Wongpithayadisai,C. Chaksangchaichot,S. Sangnark, P. Prakrankamanant, K. Gangwanpongpun, S. Boonpunmongkol, P. Milindasuta, D. Na-Pombejra, S. Nutanong, and E. Chuang- suwanich, “THAI Speech Emotion Recognition (THAI-SER) cor- pus,” Jul. 2025. [34] I. Christop, “nEMO: Dataset of emotional speech in Polish,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds.Torino, Italia: ELRA and ICCL, May 2024, p. 12 111–12 116. [Online]. Available: https://aclanthology.org/2024.lrec-main.1059/ [35] E. A. Retta, E. Almekhlafi, R. Sutcliffe, M. Mhamed, H. Ali, and J. Feng, “A new Amharic speech emotion dataset and classifica- tion benchmark,” in Proc. TALLIP, 2023. [36] K. A. Noriy, X. Yang, and J. J. Zhang, “EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels,” May 2023. [37] F. Catania, “Speech emotion recognition in Italian using Wav2Vec 2,” in Authorea Preprints, 2023. [38] I. Lubenets, N. Davidchuk, and A. Amentes, “Aniemore.” [Online]. Available: https://github.com/aniemore/Aniemore [39] K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emo- tional style transfer for voice conversion with a new emotional speech dataset,” in Proc. ICASSP, 2021. [40] M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate, “The Mexican Emotional Speech Database (MESD): Elaboration and assessment based on machine learning,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Nov. 2021, p. 1644–1647. [41] K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, and C. C. Loy, “MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation,” in Computer Vi- sion – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds.Cham: Springer International Publishing, 2020, vol. 12366, p. 700–717. [42] L. KERKENI, C. CLEDER, S.-R. Youssef, and K. RAOOF, “French emotional speech database-or ́ eau,” 2020. [43] M. Miesikowska and D. Swisulski, “Emotions in Polish speech recordings,” https://doi.org/10.34808/h46c-hb44, 2020. [44] S. F. Canpolat, Z. Ormano ̆ glu, and D. Zeyrek, “Turkish Emo- tion Voice Database (TurEV-DB),” in Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under- resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), D. Beermann, L. Be- sacier, S. Sakti, and C. Soria, Eds.Marseille, France: European Language Resources association, May 2020, p. 368–375. [45] O. Mohamad Nezami, P. Jamshid Lou, and M. Karami, “ShEMO: A large-scale validated database for Persian speech emotion de- tection,” Language Resources and Evaluation, vol. 53, no. 1, p. 1–16, Mar. 2019. [46] S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal, “SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla,” PLOS ONE, vol. 16, no. 4, p. e0250173, Apr. 2021. [47] N. Vrysas, R. Kotsakis, A. Liatsou, C. Dimoulas, and G. Kalliris, “Speech Emotion Recognition for Performance Interaction,” Journal of the Audio Engineering Society, vol. 66, no. 6, p. 457– 467, Jun. 2018. [48] P. Gournay, O. Lahaie, and R. Lefebvre, “A canadian french emo- tional speech dataset,” in Proceedings of the 9th ACM Multimedia Systems Conference. Amsterdam Netherlands: ACM, Jun. 2018, p. 399–402. [49] A. Adigwe, N. Tits, K. E. Haddad, S. Ostadabbas, and T. Dutoit, “The Emotional Voices Database: Towards Controlling the Emo- tion Dimension in Voice Generation Systems,” Jun. 2018. [50] J. James, L. Tian, and C. Inez Watson, “An Open Source Emo- tional Speech Corpus for Human Robot Interaction Applications,” in Interspeech 2018. ISCA, Sep. 2018, p. 2768–2772. [51] S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dy- namic, multimodal set of facial and vocal expressions in North American English,” PLOS ONE, vol. 13, no. 5, p. e0196391, May 2018. [52] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-Sourced Emotional Multi- modal Actors Dataset,” IEEE Transactions on Affective Comput- ing, vol. 5, no. 4, p. 377–390, Oct. 2014. [53] G. Costantini, I. Iadarola, A. Paoloni, and M. Todisco, “EMOVO Corpus: An Italian Emotional Speech Database,” 2014. [54] P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (SAVEE) database.” University of Surrey, 2014. [55] I. Steiner, M. Schr ̈ oder, and A. Klepp, “The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech,” in Proc. Phonetik & Phonologie, 2013. [56] K. Dupuis and M. K. Pichora-Fuller, “Toronto emotional speech set (TESS) - younger talker happy.” University of Toronto, 2010. [57] J. Tao, F. Liu, M. Zhang, and H. Jia, “Design of speech corpus for Mandarin text to speech,” in The Blizzard Challenge Workshop, 2008. [58] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’ 05 Audio-Visual Emotion Database,” in 22nd International Con- ference on Data Engineering Workshops (ICDEW’06), Apr. 2006, p. 8–8. [59] A. S. Cowen and D. Keltner, “Semantic Space Theory: A Computational Approach to Emotion,” Trends in cognitive sciences, vol. 25, no. 2, p. 124—136, February 2021. [Online]. Available: https://doi.org/10.1016/j.tics.2020.11.004 [60] A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models,” Jul. 2025. [61] W. Zhang, H. Jin, S. Wang, Z. Wei, and T. Dang, “Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models,” in ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, p. 1–5. [62] B.-H. Su, H.-Y. Shih, J. Tian, J. Shi, C.-C. Lee, C. Busso, and S. Watanabe, “Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition,” arXiv preprint arXiv:2509.24187, 2025. [63] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9 [64] J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the trans- former era in speech emotion recognition: Closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 9, p. 10 745–10 759, 2023.