Paper deep dive

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting Dang

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 28

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:41:50 AM

Summary

This paper introduces a systematic study of ambiguity-aware reasoning in Large Audio-Language Models (LALMs) for speech emotion recognition. The authors reformulate emotion recognition as a distributional reasoning problem, proposing a framework that integrates an ambiguity-aware objective (using KL divergence) and structured chain-of-thought (CoT) supervision. The approach is evaluated across SFT, DPO, and GRPO training strategies on the IEMOCAP and CREMA-D datasets, demonstrating improved performance in preserving affective uncertainty and reasoning over emotional cues.

Entities (7)

CREMA-D · dataset · 100%IEMOCAP · dataset · 100%DPO · training-strategy · 95%GRPO · training-strategy · 95%KL Divergence · objective-function · 95%LALM · model-architecture · 95%SFT · training-strategy · 95%

Relation Signals (3)

KL Divergence → regularizes → LALM

confidence 95% · ambiguity-aware objective that aligns predictions with human perceptual distributions

SFT → improvesperformanceon → IEMOCAP

confidence 90% · Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO

DPO → improvesperformanceon → CREMA-D

confidence 90% · DPO performs best on CREMA-D

Cypher Suggestions (2)

Identify datasets evaluated by the proposed framework · confidence 95% · unvalidated

MATCH (d:Dataset)<-[:EVALUATED_ON]-(p:Paper {id: '1e50474e-13f5-4686-a2cb-dc000dc3d312'}) RETURN d.name

Find all training strategies used in the paper · confidence 90% · unvalidated

MATCH (s:Strategy)-[:USED_IN]->(p:Paper {id: '1e50474e-13f5-4686-a2cb-dc000dc3d312'}) RETURN s.name

Abstract

Abstract:Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

27,920 characters extracted from source content.

Expand or collapse full text

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction Xiaofeng Yu 1,∗ , Jiaheng Dong 2,∗ , Jean Honorio 2 , Abhirup Ghosh 3 , Hong Jia 1 , Ting Dang 2 1 University of Auckland, New Zealand 2 The University of Melbourne, Australia 3 University of Birmingham, United Kingdom xyu724@aucklanduni.ac.nz, jiahengd@student.unimelb.edu.au Abstract Speech emotion recognition plays an important role in vari- ous applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambigu- ous nature of human emotional expression. Recent large au- dio–language models show promise in generating richer out- puts, but their reasoning ability for ambiguous emotional un- derstanding remains limited. In this work, we reformulate am- biguous emotion recognition as a distributional reasoning prob- lem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two comple- mentary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a struc- tured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies. Index Terms: speech recognition, human-computer interac- tion, computational paralinguistics 1. Introduction Speech emotion recognition (SER) plays an important role in human–computer interaction [1, 2, 3], conversational agents [4], and mental health applications [5]. Despite recent advances, most SER systems are trained to predict a single discrete emo- tion category, which oversimplifies the complexity of human emotional expression. In practice, emotions are often ambigu- ous and mixed rather than belonging to a single class. This discrepancy between computational modeling and human emo- tional perception motivates the need for more expressive and uncertainty-aware emotion understanding paradigms. Recent large audio–language models (LALMs) have demonstrated promising capabilities in generating richer tex- tual outputs beyond single-label classification [6, 7]. Prior work shows that LALMs can capture multiple emotional cues from speech and generate distributional emotion predictions [6]. However, their reasoning ability under highly ambiguous emo- tional conditions remains limited. While humans naturally rea- son under emotion ambiguity by weighing multiple cues and forming probabilistic judgements [8, 9], current LALMs strug- gle to emulate. Enhancing reasoning in LALMs has attracted increas- ing attention, particularly through chain-of-thought (CoT) prompting [10] and reinforcement learning (RL)-based post- training [11, 12]. For example, supervised fine-tuning (SFT) with CoT supervision [13], or preference optimization meth- ods such as Direct Preference Optimization (DPO) [14] and RL * These authors contributed equally. methods of Group Relative Policy Optimization (GRPO) [15] show great promises. However, these studies mainly focus on deterministic speech understanding tasks such as audio question answering (AudioQA) that selects a single correct answer. In contrast, ambiguous emotion reasoning is inherently distribu- tional: multiple emotional interpretations may simultaneously be plausible, and can be represented as a soft label or proba- bility distribution over emotion classes (e.g., 40% happy, 60% surprised). Therefore, the challenge is to improve reasoning quality and also avoid premature collapse to a single determin- istic emotional interpretation when uncertainty is present. To address this gap, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. An effective model must (i) preserve affective uncer- tainty at the decision level and (i) perform structured reasoning over emotional ambiguity, coherently integrating subtle and het- erogeneous emotional evidence. Our framework proposes two key components to tackle these: (i) an ambiguity-aware objec- tive that aligns predicted emotion distributions with human per- ceptual distributions via forward KL regularization to prevent affective collapse; and (i) structured ambiguity-aware CoT su- pervision to guide evidence of emotion ambiguity integration before prediction. The framework is plug-and-play and com- patible with different post-training strategies, evaluated across SFT, DPO, and GRPO. Our contributions are summarized: • We propose the first systematic study of ambiguity-aware reasoning in LALMs • We design two complementary objectives to support ambiguity-aware learning: an ambiguity-aware objective and structured ambiguity-aware CoT supervision. • We evaluate the proposed paradigm across multiple post- training strategies on IEMOCAP and CREMA-D dataset, demonstrating effectiveness of both objectives. By disentangling decision-level uncertainty modeling from reasoning enhancement, our work provides new insights into ambiguity-aware emotion understanding in LALMs. 2. Related Work Ambiguous Speech Emotion Recognition.Early studies on ambiguous emotion recognition include modelling emo- tions ambiguity using soft labels instead of categorical hard la- bels [16], capturing ambiguity by simulating multiple annota- tors with multiple classifiers [17], or handling ambiguous emo- tions as out-of-distribution cases [18]. With the emergence of LALMs, several studies investigates ambiguity-aware emotion recognition. For instance, [6] examines whether LALMs im- plicitly encode emotion ambiguity by analyzing token-level pre- arXiv:2603.08230v1 [cs.SD] 9 Mar 2026 diction distributions. Another study [7] proposes augmenting multi-annotator labels with synthetic annotations generated by LALMs. However, these approaches ignores improving the rea- soning processes of LALMs for ambiguous emotion perception. Reasoning Capabilities of LALMs.Recent work on im- proving reasoning in LALMs mainly falls into two categories: chain-of-thought (CoT) and reinforcement learning (RL) based approaches. CoT-based methods aim to teach step-by-step rea- soning. For example, Audio-CoT [10] explores prompt-based CoT elicitation, while Audio-Reasoner [13] further trains mod- els to follow structured reasoning routines before producing answers. However, these methods typically employ CoT pat- terns designed for deterministic tasks (e.g., AudioQA), where the goal is to produce a single correct answer, and do not ex- plicitly model emotional ambiguity. RL-based approaches instead optimize reasoning behav- iors through reward-driven learning. SARI [11] and Sound- Mind [12] improve reasoning capability using GRPO and RE- INFORCE++ [19] respectively.However, these reasoning frameworks are still tailored to certain tasks where the objective is to identify a single correct answer. It remains unclear what post-training objectives and strategies are essential for ambigu- ous emotion reasoning. 3. Methods 3.1. Problem Formulation We formulate emotion understanding as an ambiguity-aware reasoning task. For each speech utterance n, the input is a mul- timodal pair x n = A n ,T n , where A n denotes the acoustic signal and T n its transcript. To capture perceptual subjectiv- ity, each sample is annotated with (i) a ground-truth emotion distribution p GT n ∈ R C over C categories and (i) a reasoning trajectory Z GT n : p GT n,i = m n,i P C j=1 m n,j , Z GT n = (z (1) n ,...,z (L) n )(1) where m n,i denotes the number of annotators selecting class i, and z (ℓ) n represents an intermediate reasoning step. Unlike single-label classification, p GT n encodes collective emotion am- biguity as a soft distribution (e.g., 70% happy and 30% sur- prised). The trajectory Z GT n is a structured CoT that explains how reasoning over subtle acoustic and linguistic cues leads to the ambiguous emotion distribution, serving as an interpretable bridge between multimodal evidence and final predictions. The objective is to learn a LALM f θ that maps the in- put to a joint output space consisting of a predicted reason- ing trajectory ˆ Z n and a predicted emotion distribution ˆ p n : ( ˆ Z n , ˆ p n ) = f θ (A n ,T n ). The model is optimized via two com- plementary objectives. It minimizes (i) the divergence between the predicted and ground-truth distributions while (i) encour- aging the generated reasoning steps to follow the ground truth reasoning trajectory Z n . 3.2. CoT Curation via Ambiguity-aware Prompting To enable the LALM to reason over ambiguous emotional cues, we first curate a structured CoT dataset that provides explicit reasoning supervision signals for ambiguity emotion. For each inputx n =A n ,T n and its corresponding ground-truth distri- bution p GT n , we synthesize the reasoning trajectory Z GT n using a high-capacity, closed-source LALM (e.g., GPT-4o). The generation process follows a structured protocol to en- sure the trajectory remains grounded and logically consistent as Table 1: Synthesis for ambiguity-aware CoT Curation. Input: x n =A n ,T n , p GT n Critical Rules for Reasoning Steps: Keywords Extract salient acoustic and linguistic cues. Step 1: Text Analysis. Analyze the text strictly for semantic meaning and context. Identify potential ambiguity. Step 2: Audio Analysis. Describe the prosody using profes- sional terminology (Volume, Speed, Pitch, Tone). Highlight specific cues that support both the majority label AND the minority label (if any). Step 3: Synthesis. Synthesize the evidence to resolve the ambiguity. The reasoning must be strong enough that a reader would predict the Ground Truth labels just by read- ing your analysis. shown in Table 1. The generated trajectories are then automat- ically validated by checking whether the explanation is consis- tent with the given target distribution using the same closed- source model. 3.3. Ambiguity-aware Learning Objectives With the structured reasoning trajectory Z GT n and the perceptual distribution p GT n , we explore the optimization of the LALM for ambiguity-aware emotion understanding. Specifically, we pro- pose a Ambigutiy-Aware Objective designed to align the model outputs with human emotional uncertainty. A key advantage of our formulation is its “plug-and-play” compatibility; it can be seamlessly integrated into various training frameworks, includ- ing SFT, DPO, and GRPO. 3.3.1. Ambigutiy-Aware Objective The goal is to align the model predicted emotion distribution ˆ p n with the human perceptual distribution p GT n . We employ the Kullback-Leibler (KL) Divergence as the alignment criterion: L dist = D KL ( ˆ p n ∥ p GT n ) = C X i=1 p GT i log p GT i ˆp i ,(2) Rather than relying on standard textual generation to ob- tain ˆ p n , we read out token-level logits for emotion category names at the final step and applying a softmax over these logits to capture a more nuanced distribution of emotional states, fol- lowing [6]. This logit-based formulation lets the model express graded uncertainty over emotion classes, mirroring the multi- annotator soft labels. 3.3.2. Plug-and-Play Learning Paradigms Our distribution-aware objective is framework-agnostic, allow- ing for seamless integration into different training strategies. SFT. For learning using SFT, the model is trained to jointly maximize the likelihood of the reasoning tokens and minimize the distributional divergence. The total lossL SFT is a weighted combination: L SFT =L CE ( ˆ Z n , Z GT n ) + λL dist ( ˆ p n , p GT n ),(3) where L CE is the standard cross-entropy loss for the autore- gressive generation of the CoT trajectory. This objective jointly constrains the reasoning trajectory and the distribution. DPO. For DPO, we extend the standard framework with our ambiguity-aware objective. Instead of relying on static, offline LALM ... + SFT(Supervised Fine-Tuning) Ambigutiy-Aware Objective Text Embeddings Audio Embeddings Decoder layer LM head Softmax Targeted emotions AngrySadNeutra... logit14.819.508.56 ... ... <Keywords>resigned tone, low pitch, slower pace, negative phrasing...</keywords> [Step1: Text Analysis]The phrase \"It wouldn't work this way\" carries a negative... [Step2: Audio Analysis]The speaker's tone is low in pitch and the pace is relatively ... [Step3: Synthesis]Because the tone is low, the pace is slow, and the delivery feels... ... LALM <Keywords> ... [Step 1] ... [Step 2] ... [Step3] ... ... ... ... ... ... ... ... ... ... Update Structured Protocal Ambiguity-aware CoT curation DPO(Direct Preference Optimization) LALM NegativePositive Preference Pair + Update GRPO(Group Relative Policy Optimization) LALM Group of K+1 Rewards ... + + Update Group of K+1 rollouts ... ... Figure 1: Overview of the proposed ambiguity-aware objectives and Plug-and-play Learning Paradigms. preference pairs, we use an on-policy scheme where the cur- rent policy generates dynamic CoT rollouts ˆ Z n and their emo- tion distributions ˆ p n . Rollouts whose predicted distribution de- viates from the ground-truth distribution are treated as nega- tive samples, while the curated ground truth CoT serves as the positive sample. This is measured by Jensen-Shannon (JS) Di- vergence, JS( ˆ p n , p GT n ), which provides a bounded symmetric metric more stable for identifying “poor” reasoning paths. The DPO loss then encourages preference for trajectories that both follow the gold reasoning and better match the human percep- tual distribution, combined with the KL-based distribution loss and CoT cross-entropy as: L DPO-total =L DPO + λ 1 L dist ( ˆ p n , p GT n ) + λ 2 L CE ( ˆ Z n , Z GT n ), (4) L DPO =−E h logσ β log π θ (y pos |x n ) π ref (y pos |x n ) −log π θ (y neg |x n ) π ref (y neg |x n ) i , (5) where π θ denotes the current LALM, and π ref denotes the orig- inal LALM without any update. GRPO. We additionally study reinforcement learning based reasoning optimization using GRPO, and similarly augment the reward with ambiguity-awareness. At each update of the old policy π θ old , we sample K reason- ing trajectories ˆ Z n,k K k=1 ∼ π θ old (·|x n ) with corresponding emotion distributions ˆ p n,k K k=1 . Each trajectory is assigned a reward r n,k : r n,k =−L dist (p GT n , ˆ p n,k ) + λ 3 r format ( ˆ Z n,k ),(6) where the first term rewards accurate matching of the human perceptual distribution and the second term enforces adherence to the prescribed CoT format (defined similar as [12]). We then normalize rewards into advantages to stabilize training: ˆ A n,k = r n,k −μ n σ n , where μ n and σ n denote the mean and stan- dard deviation of advantages across the K sampled trajectories for utterance n. The policy is then optimized using a GRPO-style objective: J GRPO = E " 1 K K X k=1 1 |Z n,k | X t min ρ n,k,t (θ) ˆ A n,k ,(7) clip(ρ n,k,t (θ), 1− ε, 1 + ε) ˆ A n,k − βKL(π θ ||π ref ) # . where ρ n,k,t (θ) denotes the token-level policy probability ra- tio. The objective increases the likelihood of high-quality rea- soning trajectories while suppressing low-quality ones. As GRPO typically relies on sampled trajectories without direct reference to ground-truth reasoning paths, we further en- hance advantage estimation by incorporating the ground-truth trajectory Z GT n as an additional reference sample when advan- tages, referred to as GRPO z . This ensures that the ground- truth reasoning trajectory consistently receives the highest re- ward, thereby biasing optimization toward more faithful rea- soning paths under ambiguity. 4. Experimental Setup Datasets.We evaluate our training paradigm on IEMO- CAP [20] and CREMA-D [21]. IEMOCAP contains four emo- tion categories: Anger, Happiness, Sadness, and Neutral, with “Excited” merged into Happiness, and is evaluated using strict 5-fold leave-one-session-out cross-validation. CREMA-D [21] includes six categories, adding Disgust and Fear. We aggregate annotator votes (i.e., 3 annotators in IEMOCAP and 4-12 an- notators in CREMA-D) and normalize them to construct soft ground-truth labels. Implementation Details. We use GPT-4o to construct struc- tured CoT trajectories due to its strong capability in generating coherent reasoning chains that connect input observations with target outputs [13]. Experiments with post-training are con- ducted using Qwen2-Audio-7B-Instruct 1 as the base LALM. For all the learning paradigms, we apply LoRA (r = 8, α lora = 16) to the attention and feed-forward modules. All trainable pa- rameters are optimized using AdamW with optimized learning rates of 1e-4 (SFT), 5e-6 (DPO), and 2e-5 (GRPO). A 3% linear warmup and cosine decay are applied throughout training. The corresponding loss weights are λ = 1,λ 1 = 0.1, λ 2 = 0.1, and λ 3 = 0.1 for regularization during DPO optimization. We com- pare against the 1) Base Model and 2) Audio-Reasoner [13]. Evaluation Metrics.To assess how well the predicted emo- tion distribution matches the ground-truth distribution, we re- port Jensen–Shannon divergence (JS↓) and the Bhattacharyya 1 https://huggingface.co/Qwen/ Qwen2-Audio-7B-Instruct Table 2: Comparison for ambiguous emotion recognition per- formance. The best results are in bold. GRPO z indicates ground-truth CoT trajectory injection in GRPO. DatasetMethodsJS↓ BC↑ R 2 ↑ Brier↓ IEMOCAP Base model0.40 0.64 0.510.15 Audio-Reasoner 0.36 0.67 0.520.15 SFT0.29 0.73 0.600.12 DPO0.23 0.79 0.650.09 GRPO0.36 0.67 0.550.13 GRPO z 0.20 0.82 0.67 0.07 CREMA-D Base model0.25 0.78 0.540.05 Audio-Reasoner 0.37 0.69 0.510.09 SFT0.21 0.82 0.570.04 DPO0.17 0.86 0.67 0.03 GRPO0.25 0.77 0.460.04 GRPO z 0.20 0.82 0.680.04 Coefficient (BC↑) [22], which measure distributional discrep- ancy and probability mass overlap, assessing if the model pre- serves uncertainty rather than overconfident predictions. 5. Results 5.1. Performance of Post-Training Strategies We first evaluate the effectiveness of the proposed ambiguity- aware training paradigm across different post-training strategies in Table 2. Across both datasets, applying the proposed objec- tives consistently improves performance over the Base model under SFT, DPO, and GRPO Z . Among them, SFT consistently performs worse than both GRPO Z and DPO, suggesting that ambiguity-aware emotion modelling benefits from learning over multiple reasoning tra- jectories rather than relying on a single supervised reasoning path. Notably, GRPO Z achieves the strongest performances on IEMOCAP across all metrics, whereas DPO performs best on CREMA-D. This difference suggests that the effectiveness of post-training strategies depends on the structural complexity of the emotion distribution space. As the number of emotion classes increases in CREMA-D, the predicted distribution be- comes more sensitive to subtle variations in intermediate rea- soning steps. In such higher-dimensional ambiguity settings, the outcome-level reward signal used in GRPO Z relies largely on the divergence between the predicted and target emotion distributions at the final decision stage , which may become less precise for guiding fine-grained reasoning. In contrast, preference-based methods such as DPO provide denser super- vision by directly contrasting positive and negative reasoning chains at the token level, leading to stronger intermediate tra- jectory supervision toward the target distribution. 5.2. Impact of KL divergence We analyze the impact of KL-based distribution supervision by comparing variants trained with cross-entropy (CE) only against those incorporating KL divergence for emotion distribution su- pervision in Figure 2. Across both datasets, incorporating KL supervision consis- tently improves distribution-level metrics compared to CE-only training. On IEMOCAP, adding KL regularization reduces JS divergence and increases BC for both SFT and DPO variants relative to their CE-only counterparts. Similar improvements JSBCR²Brier 0.0 0.5 Score (a) SFT IEMOCAP JSBCR²Brier 0.0 0.5 Score (b) SFT CREMA-D JSBCR²Brier 0.0 0.5 Score (c) DPO IEMOCAP JSBCR²Brier 0.0 0.5 Score (d) DPO CREMA-D w/o KL supervisionw/ KL supervision Figure 2: Comparison between CE-only (w/o KL supervision) and proposed additional KL supervision. (a) SFT on IEMO- CAP, (b) SFT on CREMA-D, (c) DPO on IEMOCAP, and (d) DPO on CREMA-D. Table 3:Ablation of SFT training objectives.Models are trained on CREMA-D and evaluated under in-domain (CREMA-D) and cross-domain (IEMOCAP) settings. SettingCoTKLJS↓BC↑ R 2 ↑Brier↓ In-domain✓0.210.820.570.04 In-domain✗✓ 0.220.810.570.04 Cross-domain✓0.380.650.500.15 Cross-domain✗✓0.520.520.310.22 are observed on CREMA-D. These consistent gains indicate that while CE-based training encourages the model to reproduce the curated reasoning trajectory and final prediction tokens, it performs weak supervision to explicitly constrain the predicted probability mass to match the soft-label emotion distribution. In contrast, KL divergence directly penalizes distribution mis- matches, leading to improved consistency with ambiguity. 5.3. Impact of ambiguity-aware CoT supervision We analyze the impact of ambiguity-aware CoT supervision in Table 3. Comparing SFT trained with and without CoT super- vision reveals different behaviors under in-domain and cross- domain evaluation. When training and testing on CREMA- D, incorporating CoT supervision yields marginal performance gain comparing to KL-only training. However, when training on CREMA-D and testing on IEMOCAP, models trained with CoT supervision significantly outperform those trained without. This observation suggests that optimizing only the final emo- tion distribution through KL regularization without CoT can easily overfit to dataset-specific distribution patterns. In con- trast, ambiguity-aware CoT supervision encourages the model to reason over multimodal emotional cues, leading to stronger generalization capability. 6. Conclusions This work presents the first systematic study of ambiguity- aware reasoning in large audio–language models for ambigu- ous speech emotion recognition.We reformulate ambigu- ous emotion understanding as a distributional reasoning prob- lem and propose a plug-and-play framework that combines an ambiguity-aware objective with structured chain-of-thought su- pervision. Different post-training strategies, such as SFT, DPO, and GRPO, using our training paradigm consistently improve performance on both IEMOCAP and CREMA-D. 7. Generative AI Use Disclosure Generative AI tools were used solely for minor language edit- ing and polishing to improve the clarity and readability of the manuscript. These tools were not used to generate scientific content, analyze experimental results, summarize related work, develop methodologies, or propose research ideas. All concep- tual contributions, experimental design, analysis, and conclu- sions presented in this paper were developed and verified by the authors. The authors take full responsibility for the content of this manuscript. 8. References [1] L. Cai, J. Dong, and M. Wei, “Multi-modal emotion recogni- tion from speech and facial expression based on deep learning,” in 2020 Chinese automation congress (CAC).IEEE, 2020, p. 5726–5729. [2] G. Wadley, V. Kostakos, P. Koval, W. Smith, S. Webber, A. Cox, J. J. Gross, K. H ̈ o ̈ ok, R. Mandryk, and P. Slov ́ ak, “The future of emotion in human-computer interaction,” in CHI Conference on human factors in computing systems extended abstracts, 2022, p. 1–6. [3] C. Singla, S. Singh, P. Sharma, N. Mittal, and F. Gared, “Emo- tion recognition for human–computer interaction using high-level descriptors,” Scientific reports, vol. 14, no. 1, p. 12122, 2024. [4] E. K. Zadeh and M. Alaeifard, “Adaptive virtual assistant interac- tion through real-time speech emotion analysis using hybrid deep learning models and contextual awareness,” International Journal of Advanced Human Computer Interaction, vol. 1, no. 1, p. 1–15, 2023. [5] A. Sharma, K. Sharma, and A. Kumar, “Real-time emotional health detection using fine-tuned transfer networks with mul- timodal fusion,” Neural computing and applications, vol. 35, no. 31, p. 22 935–22 948, 2023. [6] J. V. Halim, S. Wang, H. Jia, and T. Dang, “Token-level logits matter: A closer look at speech foundation models for ambiguous emotion recognition,” INTERSPEECH, 2025. [7] W. Zhang, H. Jin, S. Wang, Z. Wei, and T. Dang, “Scaling ambigu- ity: Augmenting human annotation in speech emotion recognition with audio-language models,” arXiv preprint arXiv:2601.14620, 2026. [8] K. R. Scherer, “Vocal communication of emotion: A review of research paradigms,” Speech communication, vol. 40, no. 1-2, p. 227–256, 2003. [9] A. Schirmer and R. Adolphs, “Emotion perception from face, voice, and touch: comparisons and convergence,” Trends in cog- nitive sciences, vol. 21, no. 3, p. 216–228, 2017. [10] Z. Ma, Z. Chen, Y. Wang, E. S. Chng, and X. Chen, “Audio- cot: Exploring chain-of-thought reasoning in large audio language model,” arXiv preprint arXiv:2501.07246, 2025. [11] C. Wen, T. Guo, S. Zhao, W. Zou, and X. Li, “Sari: Structured audio reasoning via curriculum-guided reinforcement learning,” arXiv preprint arXiv:2504.15900, 2025. [12] X. Diao, C. Zhang, K. Kong, W. Wu, C. Ma, Z. Ouyang, P. Qing, S. Vosoughi, and J. Gui, “Soundmind: Rl-incentivized logic rea- soning for audio-language models,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, 2025, p. 528–540. [13] X. Zhifei, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, p. 23 840– 23 862. [14] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in neural informa- tion processing systems, vol. 36, p. 53 728–53 741, 2023. [15] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024,” URL https://arxiv. org/abs/2402.03300, vol. 2, no. 3, p. 5, 2024. [16] E. Mower, A. Metallinou, C.-C. Lee, A. Kazemzadeh, C. Busso, S. Lee, and S. Narayanan, “Interpreting ambiguous emotional expressions,” in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.IEEE, 2009, p. 1–8. [17] Y. Zhou, X. Liang, Y. Gu, Y. Yin, and L. Yao, “Multi-classifier interactive learning for ambiguous speech emotion recognition,” IEEE/ACM transactions on audio, speech, and language process- ing, vol. 30, p. 695–705, 2022. [18] W. Wu, B. Li, C. Zhang, C.-C. Chiu, Q. Li, J. Bai, T. Sainath, and P. Woodland, “Handling ambiguity in emotion: From out- of-domain detection to distribution estimation,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, p. 2078–2093. [19] J. Hu, J. K. Liu, H. Xu, and W. Shen, “Reinforce++: Stabilizing critic-free policy optimization with global advantage normaliza- tion,” arXiv preprint arXiv:2501.03262, 2025. [20] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, p. 335–359, 2008. [21] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, p. 377–390, 2014. [22] X. Hong, Y. Gong, V. Sethu, and T. Dang, “Aer-llm: Ambiguity- aware emotion recognition leveraging large language models,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, p. 1–5.