Paper deep dive
StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation
Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%
Last extracted: 3/13/2026, 12:16:14 AM
Summary
StreamVoiceAnon+ is a streaming speaker anonymization framework that addresses emotion degradation by using supervised finetuning with neutral-emotion utterance pairs and frame-level emotion distillation on acoustic token hidden states. It achieves a 49.2% UAR and 5.77% WER, significantly outperforming baseline streaming methods while maintaining low latency and strong privacy.
Entities (5)
Relation Signals (3)
StreamVoiceAnon+ → evaluatedon → VoicePrivacy 2024
confidence 100% · On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR
StreamVoiceAnon+ → trainedon → CREMA-D
confidence 100% · We finetune the pretrained open-source StreamVoiceAnon model on CREMA-D neutral-emotion pairs
StreamVoiceAnon+ → uses → Frame-level emotion distillation
confidence 100% · We propose supervised finetuning... combined with frame-level emotion distillation on acoustic token hidden states.
Cypher Suggestions (2)
List all datasets used for training or evaluation · confidence 95% · unvalidated
MATCH (m:Method)-[:TRAINED_ON|EVALUATED_ON]->(d:Dataset) RETURN m.name, d.name
Find all methods that use a specific distillation technique · confidence 90% · unvalidated
MATCH (m:Method)-[:USES]->(t:Technique {name: 'Frame-level emotion distillation'}) RETURN m.nameAbstract
Abstract:We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: this https URL
Tags
Links
- Source: https://arxiv.org/abs/2603.06079v1
- Canonical: https://arxiv.org/abs/2603.06079v1
PDF not stored locally. Use the link above to view on the source site.
Full Text
29,199 characters extracted from source content.
Expand or collapse full text
StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation Nikita Kuzmin 1,2 , Kong Aik Lee 3 , Eng Siong Chng 1 1 Nanyang Technological University, Singapore 2 Institute for Infocomm Research (I2R), A*STAR, Singapore 3 The Hong Kong Polytechnic University, Hong Kong s220028@e.ntu.edu.sg Abstract We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec lan- guage models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose super- vised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are con- fined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% rela- tive UAR improvement over the baseline (39.7%→49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). 1 Index Terms: speaker anonymization, emotion preservation, streaming speech processing, knowledge distillation 1. Introduction Speaker anonymization (SA) aims to transform an input speech (source) to conceal speaker identity while preserving linguistic content and paralinguistic attributes such as emotion [1, 2, 3]. Real-time SA is essential for privacy-preserving applications in- cluding teleconferencing, call centers, voice assistants, and on- line mental health counseling [4], where low latency is critical. While privacy and intelligibility have been the primary focus of SA research, emotion preservation is equally important for nat- ural communication. Losing emotional cues can significantly degrade user experience and communication effectiveness in real-world deployments [5]. Recent advances in neural audio codec (NAC) language models have enabled streaming SA with competitive privacy- intelligibility trade-offs [6, 7], unlike offline approaches that re- quire full utterance context [8, 9]. These models process speech as interleaved content and acoustic token sequences, allowing autoregressive (AR) generation with controllable speaker iden- tity. However, two fundamental limitations emerge for emo- tion preservation: (1) the audio continuation training paradigm encourages the model to degrade emotion preservation for the source utterance, and (2) the VQ bottleneck in neural codecs discards fine-grained acoustic details that carry emotional in- formation [10, 11]. To mitigate this limitation, prior work [7] employed mul- tiple emotion-diverse prompts at inference, which partially im- 1 Demo and code: https://anonymous3842031239. github.io better 454647484950 EER (%) ↑ 32.5 35.0 37.5 40.0 42.5 45.0 47.5 50.0 UAR (%) ↑ Privacy–Emotion Trade-off Ours Prior streaming methods Ours SVA+EMO SVA GenVC-small DarkStream TVTSyn Figure 1: Privacy-emotion trade-off for streaming speaker anonymization methods. Our method (orange star) compared to prior streaming methods (triangles). proves emotion preservation but significantly degrades intelli- gibility and requires emotion-labeled prompts that are harder to obtain compared to diverse neutral prompts commonly used for anonymization. We propose a supervised finetuning (SFT) approach with frame-level emotion distillation (Figure 2) that addresses the root cause of emotion degradation. Our contributions are: • We show that emotion degradation in NAC-based streaming SA is primarily a training paradigm problem, not a model capacity issue: restructuring training pairs yields 3× larger gains than adding emotional data alone (ablation: Exp1 vs Exp2). • We apply frame-level emotion distillation to acoustic token hidden states, isolating emotion learning from content super- vision (next-token prediction) to avoid gradient competition. Ablation confirms this design choice: acoustic-branch distil- lation outperforms semantic-branch in both UAR and WER (Exp6 vs Exp7). • Our approach achieves the highest reported emotion preser- vation among streaming SA methods (49.2% UAR, +24% relative over baseline) with zero inference overhead and im- proved privacy (Figure 1). arXiv:2603.06079v1 [eess.AS] 6 Mar 2026 Acoustic Encoder Content Encoder Speaker Encoder Causal Transformer Slow AR Causal Transformer Acoustic Decoder Fast AR Speaker A Speaker A Emotion Encoder Anonymized Wav S S Training module Quantized content feature Speaker embedding w4s embedding w4e embedding Pooled acoustic feature Frozen module Emotion embs (frame-level) s Linguistic [SEP] token Acoustic [SEP] token s g s ~ Only training ℒ emo ℒ slowAR ℒ fastAR Speaker C Speaker B TrainInference Prompt Source same spkdifferent spk T n q 1:τ,1:n g c t a t,1:n e t p t Figure 2: Training and inference configurations. Training: prompt and source share the same speaker but differ in emotion, forcing the model to generate emotional output from source content rather than copying prompt-specific patterns. Frame-level emotion distillation (L emo , dashed) on Slow AR acoustic hidden states provides additional learning signal. Inference: a neutral utterance from the target anonymous speaker conceals source identity while the finetuned model preserves source emotion; no latency is added over the baseline. The transformer f θ is depicted twice for visual clarity; weights are shared. 2. Related Work Speaker Anonymization.Traditional SA methods modify speech features such as pitch and formants [12, 2] or replace speaker embeddings [13, 14, 15, 16]. Recent neural approaches use voice conversion [17, 8] or NAC-based language mod- els [6, 7]. Most methods prioritize privacy-intelligibility trade- offs, with emotion preservation receiving less attention [11]. Streaming SA and Voice Conversion.Real-time SA and VC systems face a fundamental trade-off between latency and quality.StreamVoice [18] introduced NAC-based lan- guage models for streaming VC with interleaved semantic and acoustic token generation, achieving low latency through au- toregressive processing. A key limitation identified in recent work [19] is the temporal mismatch between static speaker em- beddings and dynamic content representations. While content evolves frame-by-frame, conventional systems inject speaker identity as a global vector, leading to over-smoothed timbre and reduced expressivity.Time-varying timbre (TVT) ap- proaches address this by allowing speaker characteristics to vary synchronously with content through attention-based mem- ory mechanisms. Self-supervised approaches like GenVC [20] eliminate dependency on external speaker encoders through au- toregressive generation. These architectural insights are com- plementary to our supervised finetuning approach, which ad- dresses emotion degradation from a training paradigm perspec- tive. Emotion in SA. Speaker anonymization systems often degrade emotional content [21, 22, 23], with the primary cause identified as information loss in intermediate represen- tations [11]. Several approaches address this through emo- tion embeddings [24] or prosody-aware anonymization [25]. EASY [8] introduced emotion-aware offline SA through factor- ized distillation, achieving strong emotion preservation (63.8% UAR) but requiring full utterance context.For streaming SA, StreamVoiceAnon [7] showed limited emotion preserva- tion (39.7% UAR) with neutral prompts.Prior work [26] achieved low latency but weak privacy (31.4% EER), while DarkStream [27] operates at 200 ms latency but shows lim- ited emotion preservation (34.7% UAR). Recent streaming VC work [19] demonstrates that architectural improvements can im- prove naturalness, but these systems still face emotion chal- lenges where content bottlenecks and speaker replacement cre- ate additional pressure on paralinguistic retention. Knowledge Distillation for Speech. Knowledge distilla- tion has been widely applied in speech processing to transfer representations from large pretrained models to smaller task- specific models [28]. Self-supervised models capture emotion- relevant information in their hidden representations [29]; Emotion2Vec+ [30] provides frame-level emotion representa- tions through pretraining on large-scale speech data. Unlike utterance-level emotion labels, these frame-level features cap- ture fine-grained emotional dynamics at each time step, mak- ing them suitable for streaming applications. We leverage such features as distillation targets, carefully selecting the acoustic branch to avoid interference with content supervision on the se- mantic branch. 3. Proposed Method 3.1. Problem Statement Streaming anonymization models generally struggle with emo- tion preservation due to information bottlenecks and training objectives that prioritize content and speaker conversion over paralinguistic attributes. In NAC-based approaches specifically, we identify two key issues: (1) the audio continuation training paradigm encourages the model to degrade emotion preserva- tion for the source utterance, and (2) the discrete token repre- sentation loses fine-grained emotional prosody. Per-emotion analysis reveals severe imbalance in the base- line: certain emotions are strongly over-represented while oth- ers are nearly absent (see Section 4.3 for details). This pattern persists even with neutral prompts, suggesting that the model develops an inherent bias toward dominant acoustic patterns during pretraining, likely due to imbalanced emotion distribu- tions in the training data. Rather than preserving source emo- tion, the model defaults to a dominant acoustic style. The com- bination of this learned bias and VQ information loss may create systematic emotion degradation that simple finetuning on emo- tional data cannot resolve. 3.2. Supervised Finetuning with Neutral-Emotion Pairs We construct training pairs from an emotional speech corpus where each pair contains a neutral and an emotional utterance from the same speaker. This design ensures: • Anonymization still works because speaker embedding mix- ing is unchanged • The model must generate emotional acoustic tokens from a neutral prompt • Emotion information must come from source content fea- tures, not prompt acoustics We also include neutral-to-neutral pairs to maintain balance and prevent the model from assuming all source utterances are emotional. Separate learned separation tokens for the semantic and acoustic branches mark the boundary between prompt and source sequences in each token stream. This explicit boundary helps the model transition from prompt reproduction to source preservation, preventing it from copying prompt characteristics into the source utterance. These tokens are shown as Linguistic [SEP] and Acoustic [SEP] in Figure 2. 3.3. Frame-Level Emotion Distillation We distill frame-level emotion features from a pretrained emo- tion extractor into the model’s hidden states. An important de- sign choice is where to apply distillation: • Semantic branch: Already supervised via next-token pre- diction (LM loss); adding emotion creates gradient competi- tion [31] • Acoustic branch: No existing supervision; clean gradient flow for emotion learning Our architecture (Figure 2) applies a shared causal trans- former f θ to acoustic hidden states h acou t from the Slow AR branch, producing predicted emotion embeddings p t , super- vised byL emo against Emotion Encoder targets e t : L emo = 1 T T X t=1 ∥p t − e t ∥ 2 , p t = f θ (h acou t )(1) where h acou t are acoustic hidden states, e t are frame-level emo- tion representations from the Emotion Encoder, and p t is the predicted emotion embedding. While the VQ bottleneck re- mains at the output stage, the distillation loss encourages richer emotion encoding in hidden states prior to quantization. The total loss combines language modeling [7] and distillation: L =L LM + w·L emo (2) whereL LM =L slowAR +L fastAR are next-token prediction losses on the Slow AR and Fast AR branches respectively, and w con- trols the distillation strength. Slow AR generates one token per time step along the time axis, while Fast AR autoregressively generates n codebook tokens (q 1 . . . q n ) per time step using shared weights. At inference, f θ and the Emotion Encoder are removed; the model operates with the same architecture and la- tency as the baseline. Table 1: Comparison with prior methods. ↑/↓: higher/lower is better. Lat.: algorithmic latency in ms. EER-L/EER-S: lazy- informed/semi-informed attacker. Semi: semi-streaming. Bold: best streaming result; underline : second best. Only methods with viable privacy (EER-L ≥ 40%) are included, following VoicePrivacy evaluation criteria. Our two variants correspond to Exp4 (pool-distill) and Exp7 (frame-distill) in the ablation study (Table 2). MethodTypeLat. WER↓ UAR↑ EER-L↑ EER-S↑ Original–1.8370.075.16– EASY [8]Offline–2.7063.81–45.89 GenVC-small [20]Semi–8.2034.2348.4815.94 DarkStream [27]Online2008.7534.7347.2621.83 TVTSyn [19]Online805.3537.3247.5514.57 StreamVoiceAnon [7] vctk-1fixOnline1804.5439.7247.1915.92 crema-emo-4rndOnline1806.5944.5946.5318.63 Ours pool-distillOnline180 5.0846.3048.6218.32 frame-distillOnline1805.7749.2248.9818.30 4. Experiments 4.1. Setup We follow the VoicePrivacy 2024 Challenge protocol [3]: • Privacy (EER): Equal error rate from ECAPA-TDNN [32]; lazy-informed (trained on original) and semi-informed (trained on anonymized) attackers. Higher EER = better pri- vacy. • Intelligibility (WER): Word error rate from ASR trained on LibriSpeech. Lower = better. • Emotion (UAR): Unweighted average recall from a speech emotion recognition (SER) model trained on IEMOCAP [33] (4 classes: angry, happy, neutral, sad). Higher = better. Wefinetunethepretrainedopen-source StreamVoiceAnon[7]model(vctk-1fixanonymization strategy, i.e., single fixed target speaker) on CREMA-D [34] neutral-emotion pairs for 5 epochs using 4× NVIDIA RTX 4090 GPUs with a learning rate of 1 × 10 −4 .Only the Slow AR and Fast AR modules are finetuned, along with the distillation transformer f θ (initialized from scratch); all other components remain frozen. The [SEP] tokens are randomly initialized learnable embeddings.We filter CREMA-D to four emotions (angry, happy, neutral, sad), excluding fearful and disgusted categories as these emotions are not present in the IEMOCAP evaluation set. The dataset provides 7,442 clips from 91 actors, from which we construct approximately 25,000 neutral-emotion pairs using many-to-many matching (all neutral utterances per speaker paired with all emotional and neutral utterances from the same speaker) after filtering (quality threshold q = 0.5 based on vote scores). We use Emotion2Vec+ large [30] (last-layer hidden representations) as the emotion feature extractor for distillation, with weight w = 0.01; higher values degraded intelligibility while lower values showed diminishing emotion gains; training loss plateaued between epochs 5 and 10. Note that the VoicePrivacy 2024 SER evaluator is a wav2vec 2.0-based model [29] trained on IEMOCAP via SpeechBrain, while the distillation teacher (Emotion2Vec+) uses a data2vec 2.0 backbone, reducing the risk of circular evaluation. This evaluator is part of the standard VoicePrivacy protocol and enables direct comparison with prior work. The causal transformer f θ consists of 2 layers with hidden dimension matching the acoustic hidden states. Emotion evaluation uses IEMOCAP development and test sets. Table 2: Ablation study.✓ indicates active components. StatPool/Causal: aggregation approach. Distill: distillation target branch (Sem = semantic, Aco = acoustic; – = acoustic by default). Bold: best among Exp1–7; underline : second best. All metrics on IEMOCAP following VoicePrivacy 2024 protocol. ComponentsContent↓Emotion↑Privacy↑ Model FT-CREMA Neu-Emo [SEP] StatPool Causal Distill WER Average Ang Hap Neu Sad EER-L EER-S Baseline–4.5439.735.8 81.9 33.18.047.1915.92 Exp1✓–5.0041.136.3 79.6 35.5 13.245.7014.88 Exp2✓–5.1645.335.3 75.948.2 21.747.3116.73 Exp3✓–5.2547.434.8 72.9 50.7 31.247.4616.53 Exp4✓–5.08 46.340.975.1 44.2 25.048.6218.32 Exp5✓–5.3248.5 40.3 65.2 53.6 34.848.1916.78 Exp6✓Sem6.2348.2 48.7 66.7 49.7 27.747.9317.10 Exp7✓Aco5.7749.238.8 62.8 52.7 42.6 48.9818.30 4.2. Main Results Table 1 presents the results. Our method’s primary achievement is emotion preservation: 49.2% UAR represents the highest among all streaming methods (the SER model achieves 70.07% on original speech, establishing an upper bound), while main- taining competitive intelligibility (5.77% WER) and strong pri- vacy (48.98% EER-lazy). Compared to the StreamVoiceAnon baseline with neutral prompts (39.7% UAR), our method achieves +24% relative UAR improvement with a modest WER increase (4.54% → 5.77%) and improved privacy (EER-lazy: 47.19%→ 48.98%). We hypothesize that emotion distillation encourages disentan- glement of emotional and speaker information in acoustic hid- den states, reducing identity leakage. Against the emotion- prompt variant (SVA+EMO, 44.6% UAR), we achieve +10% relative UAR with better WER (6.59%→ 5.77%) and privacy. Notably, our method uses only a single neutral prompt instead of four emotion-diverse prompts. Compared to other streaming methods with viable privacy (EER-lazy >40%), our method improves UAR by +32–44% relative over GenVC-small (34.2%), DarkStream (34.7%), and TVTSyn (37.3%), with comparable or better WER and pri- vacy, demonstrating that the training paradigm approach com- plements architectural improvements. 4.3. Ablation Study Table 2 systematically evaluates each component. All experi- ments from Exp1 onward use finetuning on CREMA-D; Exp2 onwards add neutral-emotion pairs; Exp3 onwards add the sepa- ration token; Exp4–7 explore different distillation architectures. Finetuning on emotional data (Exp1): Simply finetuning on CREMA-D yields only +1.4 UAR, confirming this is not a data domain issue but a training paradigm problem. Neutral-emotion pairs (Exp2): Forcing the model to generate emotional output from neutral prompts provides the largest single improvement (+4.1 UAR), partially validating our hypothesis about prompt-driven emotion degradation. Sad emo- tion improves from 8.0% to 21.7%. Separation token (Exp3): Explicit boundary marking be- tween prompt and source sequences adds +2.1 UAR, further im- proving sad to 31.2%. Emotion distillation architectures (Exp4-5): We explore two aggregation strategies for the distillation head. Statisti- cal pooling (Exp4) computes mean and standard deviation over hidden states before prediction, achieving 46.3% UAR with 5.08% WER. However, Exp4 actually decreases UAR com- pared to Exp3 (47.4%), suggesting that utterance-level pooling loses temporal emotion dynamics. A causal transformer (Exp5) maintains frame-level predictions, achieving 48.5% UAR with 5.32% WER, confirming that temporal modeling better captures frame-level emotion dynamics. Distillation target selection (Exp6 vs Exp7): Distilling to semantic hidden states (Exp6) achieves 48.2% UAR but de- grades WER to 6.23% due to gradient competition with content supervision. Acoustic distillation (Exp7) achieves higher UAR (49.2%) with better WER (5.77%), confirming that the acoustic branch provides cleaner gradient flow. Per-emotion analysis: The most dramatic improvement occurs for “sad”: from 8.0% (baseline) to 42.6% (ours). Neutral emotion improves from 33.1% to 52.7% (+59%). The decrease in “happy” (81.9% → 62.8%) represents correction of over- prediction bias, as the baseline defaulted to happy-sounding output regardless of source emotion. 5. Conclusion We have presented a SFT approach for emotion-preserving streaming speaker anonymization. Our ablation reveals that emotion preservation is largely a training paradigm problem for this architecture: restructuring training pairs (Exp1→Exp2, +4.2 UAR) yields 3× larger gains than adding emotional data alone (Baseline→Exp1, +1.4 UAR), indicating the model has the capacity to encode emotion but lacks the right training sig- nal. Combined with frame-level emotion distillation on acous- tic hidden states, our method improves the baseline from 39.7% to 49.2% UAR (+24% relative) while also improving privacy (EER-lazy: 47.2%→ 49.0%), with zero inference overhead. A gap remains compared to offline methods (EASY: 63.8% UAR), reflecting the fundamental latency–quality trade-off: of- fline methods access full utterance context, enabling bidirec- tional modeling and utterance-level emotion optimization. Fu- ture work will explore longer-range emotional context within the causal constraint, extension to dimensional emotion mod- els, and adversarial training against adaptive attacks. While we follow the standard VoicePrivacy 2024 evaluation protocol, lim- itations include reliance on a single SER evaluator, the absence of subjective listening tests, and evaluation on acted speech cor- pora only (both finetuning on CREMA-D and evaluation on IEMOCAP). Future work should validate on spontaneous emo- tion corpora (e.g., MSP-Podcast). 6. Generative AI Use Disclosure Generative AI tools were used to assist with manuscript editing and formatting. All technical content, experimental design, and analysis were performed by the authors. 7. References [1] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivas- tava, P.-G. No ́ e, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien, A. Chanclu, J.-F. Bonastre, M. Todisco, and M. Maouche, “The VoicePrivacy 2020 challenge: Results and findings,” Computer Speech & Language, vol. 74, p. 101362, 2022. [2] M. Panariello, N. Tomashenko, X. Wang, X. Miao, P. Champion, H. Nourtel, M. Todisco, N. Evans, E. Vincent, and J. Yamagishi, “The VoicePrivacy 2022 challenge: Progress and perspectives in voice anonymisation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 32, p. 3497–3512, 2024. [3] N. Tomashenko, X. Miao, P. Champion, S. Meyer, M. Panariello, X. Wang, N. Evans, E. Vincent, J. Yamagishi, and M. Todisco, “The third VoicePrivacy challenge: Preserving emotional expres- siveness and linguistic content in voice anonymization,” arXiv preprint arXiv:2601.11846, 2026. [4] S. Meyer and N. T. Vu, “Use cases for voice anonymization,” in Proc. 5th Symposium on Security and Privacy in Speech Commu- nication (SPSC), 2025. [5] Z. Cai, H. Li Xinyuan, A. Garg, L. P. Garc ́ ıa-Perera, K. Duh, S. Khudanpur, N. Andrews, and M. Wiesner, “Privacy versus emotion preservation trade-offs in emotion-preserving speaker anonymization,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2024, p. 409–414. [6] M. Panariello, F. Nespoli, M. Todisco, and N. Evans, “Speaker anonymization using neural audio codec language models,” in Proc. ICASSP 2024, 2024, p. 4725–4729. [7] N. Kuzmin, S. Liu, K. A. Lee, and E. S. Chng, “Stream-voice- anon: Enhancing utility of real-time speaker anonymization via neural audio codec and language models,” in Proc. ICASSP 2026, 2026. [8] J. Yao, H. Liu, E. S. Chng, and L. Xie, “EASY: Emotion-aware speaker anonymization via factorized distillation,” in Proc. Inter- speech 2025, 2025, p. 3219–3223. [9] J. Yao, N. Kuzmin, Q. Wang, P. Guo, Z. Ning, D. Guo, K. A. Lee, E.-S. Chng, and L. Xie, “NPU-NTU System for Voice Privacy 2024 Challenge,” in 4th Symposium on Security and Privacy in Speech Communication, 2024, p. 67–71. [10] X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu, “SpeechTok- enizer: Unified speech tokenizer for speech language models,” in Proc. International Conference on Learning Representations (ICLR), 2024. [11] ̈ U. E. Gaznepoglu and N. Peters, “Why disentanglement-based speaker anonymization systems fail at preserving emotions?” in Proc. ICASSP 2025, 2025, p. 1–5. [12] J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker anonymisation using the McAdams coefficient,” in Proc. Interspeech 2021, 2021, p. 1099–1103. [13] B. M. L. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Ya- magishi, M. Todisco, A. Nautsch, J. Shan, and N. Evans, “Design choices for x-vector based speaker anonymization,” in Proc. In- terspeech 2020, 2020, p. 2512–2516. [14] B. M. L. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yam- agishi, “Privacy and utility of x-vector based speaker anonymiza- tion,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 30, p. 2383–2395, 2022. [15] N. Kuzmin, H.-T. Luong, J. Yao, L. Xie, K. A. Lee, and E.-S. Chng, “NTU-NPU System for Voice Privacy 2024 Challenge,” in Proc. 4th Symposium on Security and Privacy in Speech Commu- nication (SPSC), 2024, p. 72–79. [16] S. Meyer, P. Tilli, P. Denisov, F. Lux, J. Koch, and N. T. Vu, “Anonymizing speech with generative adversarial networks to preserve speaker privacy,” in Proc. IEEE Spoken Language Tech- nology Workshop (SLT), 2023, p. 912–919. [17] S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody is not identity: A speaker anonymization approach us- ing prosody cloning,” in Proc. ICASSP 2023, 2023, p. 1–5. [18] Z. Wang, Y. Chen, X. Wang, L. Xie, and Y. Wang, “StreamVoice: Streamable context-aware language modeling for real-time zero- shot voice conversion,” in Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers).Bangkok, Thailand: Association for Computa- tional Linguistics, 2024, p. 7328–7338. [19] W. Quamer, M.-R. Tseng, G. Nasrallah, and R. Gutierrez-Osuna, “TVTSyn: Content-synchronous time-varying timbre for stream- ing voice conversion and anonymization,” in Proc. International Conference on Learning Representations (ICLR), 2026. [20] Z. Cai, H. X. Li, A. Garg, L. P. Garc ́ ıa-Perera, K. Duh, S. Khudan- pur, M. Wiesner, and N. Andrews, “GenVC: Self-supervised zero- shot voice conversion,” in Proc. IEEE Automatic Speech Recogni- tion and Understanding Workshop (ASRU). IEEE, 2025. [21] S. Ghosh, A. Das, Y. Sinha, I. Siegert, T. Polzehl, and S. Stober, “Emo-StarGAN: A semi-supervised any-to-many non-parallel emotion-preserving voice conversion,” in Proc. Interspeech 2023, 2023, p. 2093–2097. [22] A. Das, S. Ghosh, T. Polzehl, I. Siegert, and S. Stober, “StarGAN- VC++: Towards emotion preserving voice conversion using deep embeddings,” in Proc. 12th ISCA Speech Synthesis Workshop (SSW), 2023, p. 81–87. [23] H. Hua, Z. Shang, X. Li, P. Shi, C. Yang, L. Wang, and P. Zhang, “Emotional speech anonymization: Preserving emotion character- istics in pseudo-speaker speech generation,” in Proc. 4th Sympo- sium on Security and Privacy in Speech Communication (SPSC), 2024, p. 55–60. [24] X. Miao, Y. Zhang, X. Wang, N. Tomashenko, D. C. L. Soh, and I. Mcloughlin, “Adapting general disentanglement-based speaker anonymization for enhanced emotion preservation,” Computer Speech & Language, vol. 94, p. 101810, 2025. [25] J. He, S. Zhao, J. Zhou, H. Sun, H. Wang, and Y. Qin, “Emotion- preserving prosody anonymization network for voice privacy pro- tection,” in Proc. ICASSP 2025, 2025, p. 1–5. [26] W. Quamer and R. Gutierrez-Osuna, “End-to-end streaming model for low-latency speech anonymization,” in Proc. IEEE Spo- ken Language Technology Workshop (SLT). IEEE, 2024. [27] W. Quamer and R. Gutierrez-Osuna, “DarkStream: Real-time speech anonymization with low latency,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025. [28] H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT,” in Proc. ICASSP 2022, 2022, p. 7087–7091. [29] L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” in Proc. Interspeech 2021, 2021, p. 3400–3404. [30] Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” in Findings of the Association for Computational Linguistics: ACL 2024.Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 15 747–15 760. [31] Y. Hu, C. Chen, R. Li, Q. Zhu, and E. S. Chng, “Gradient remedy for multi-task learning in end-to-end noise-robust speech recogni- tion,” in Proc. ICASSP 2023, 2023, p. 1–5. [32] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized channel attention, propagation and aggrega- tion in TDNN based speaker verification,” in Proc. Interspeech 2020, 2020, p. 3830–3834. [33] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, p. 335–359, 2008. [34] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multi- modal actors dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, p. 377–390, 2014.