Paper deep dive

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Nikita Kuzmin, Tao Zhong, Jiajun Deng, Yingke Zhu, Tristan Tsoi, Tianxiang Cao, Simon Lui, Kong Aik Lee, Eng Siong Chng

Year: 2026Venue: arXiv preprintArea: eess.ASType: PreprintEmbeddings: 30

Abstract

Abstract:End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/13/2026, 12:40:32 AM

Summary

This paper investigates speaker identity leakage in end-to-end full-duplex speech dialogue models (SALM-Duplex and Moshi). It demonstrates that hidden representations in these models leak significant speaker identity, with discrete encoders showing higher leakage than continuous ones. The authors propose two streaming anonymization methods: Anon-W2W (waveform-level) and Anon-W2F (feature-domain), which significantly increase the Equal Error Rate (EER) of speaker verification attackers, effectively mitigating privacy risks while maintaining real-time performance.

Entities (6)

Moshi · model · 100%SALM-Duplex · model · 100%Anon-W2F · method · 95%Anon-W2W · method · 95%Stream-Voice-Anon · tool · 95%ECAPA-TDNN · model · 90%

Relation Signals (4)

SALM-Duplex → leaks → Speaker Identity

confidence 95% · SALM-Duplex showing stronger leakage in early layers

Moshi → leaks → Speaker Identity

confidence 95% · Moshi leaks uniformly

Anon-W2F → mitigates → Speaker Identity Leakage

confidence 90% · Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline

Anon-W2W → mitigates → Speaker Identity Leakage

confidence 90% · Anon-W2W raises EER to 36.9% for Moshi

Cypher Suggestions (2)

Find all models and their associated anonymization methods · confidence 90% · unvalidated

MATCH (m:Model)-[:USES]->(a:Method) RETURN m.name, a.name

Identify risks associated with specific models · confidence 90% · unvalidated

MATCH (m:Model)-[:LEAKS]->(r:Risk) RETURN m.name, r.name

Full Text

29,590 characters extracted from source content.

Expand or collapse full text

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models Nikita Kuzmin 1,2 , Tao Zhong 3,4 , Jiajun Deng 3,† , Yingke Zhu 3 , Tristan Tsoi 3 , Tianxiang Cao 3 , Simon Lui 3 , Kong Aik Lee 5 , Eng Siong Chng 1 1 Nanyang Technological University, Singapore 2 A*STAR, Singapore 3 Huawei Leibniz Research Center 4 The Chinese University of Hong Kong 5 The Hong Kong Polytechnic University s220028@e.ntu.edu.sg Abstract End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implica- tions of their hidden representations remain unexamined. Fol- lowing the VoicePrivacy 2024 protocol with a lazy-informed at- tacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5× relative to the discrete encoder baseline (11.2%→41.0%), ap- proaching the 50% random-chance ceiling, while Anon-W2W retains 78–93% of baseline sBERT across setups with sub- second response latency (FRL under 0.8 s). 1 Index Terms: speaker anonymization, full-duplex speech, pri- vacy, speaker verification, speech agents 1. Introduction End-to-end (E2E) full-duplex speech dialogue systems repre- sent a fundamental shift from turn-taking interaction to always- on, simultaneous listening and speaking.Models such as SALM-Duplex [1], Moshi [2], and SyncLLM [3] route raw user audio continuously through a decoder-only LLM back- bone, computing hidden state representations at every trans- former layer throughout the entire conversation. Unlike cas- caded systems where user and agent streams are processed sep- arately, the LLM core of these full-duplex systems maintains a persistent internal state over the user’s speech stream, capturing their voice, speaking style, and identity. This creates a privacy exposure that remains largely unex- plored. Under GDPR and similar regulations, the mere presence of identifiable speaker information in model representations constitutes a compliance risk, regardless of whether external ac- cess is exploited, making it essential to audit and mitigate such leakage proactively. Probing studies on self-supervised speech models (e.g., wav2vec 2.0, HuBERT, WavLM) have established that their hidden representations strongly encode speaker iden- tity [4, 5, 6]. In the text domain, prior work has shown that LLM hidden states can leak demographic and content-level user attributes [7], and membership inference probes have exposed privacy risks in masked language models [8]. Our focus is dis- † indicates the corresponding author. 1 Demo page: https://anonymous-569230593.github. io/ 135710 0 0.2 0.4 0.6 0.8 1 low privacy protection high privacy protection Dialogue Duration (turns) Speaker Privacy ( 1 − Linkability ) SALM-Duplex disc. (no anon)Moshi disc. (no anon) SALM-Duplex disc. + W2FMoshi disc. + W2W Figure 1: Speaker privacy (1 − Linkability) vs. dialogue turn count. Only discrete encoder variants are shown; con- tinuous encoder Linkability is omitted for clarity. Red lines: no anonymization; green lines: post-anonymization. Without anonymization, both systems drop into the low-privacy zone within a few turns; anonymization lifts privacy into the pro- tected zone, though Moshi + W2W shows gradual degradation over dialogue length. tinct: we investigate speaker identity leakage, the persistent encoding of who is speaking, which enables re-identification regardless of conversational content. Yet, to the best of our knowledge, no prior work has examined whether these always- on LLM hidden states retain sufficient speaker identity to enable re-identification. We answer this question empirically for two prominent E2E full-duplex systems, SALM-Duplex and Moshi [2]. Following the VoicePrivacy 2024 Challenge [9] evaluation protocol with a lazy-informed attacker scenario, we train a speaker verifica- tion attacker (probe) on hidden state representations extracted from each layer group. Equal error rate (EER) serves as our primary privacy metric; we additionally report Linkability from the legally validated evaluation framework of [10]. Our mean- pooled EER for the original SALM-Duplex (continuous en- coder) is 28.5%, revealing significant speaker identity leakage, while Moshi’s discrete encoder reaches 6.4% EER (near-perfect identification). We further propose and evaluate two streaming anonymization setups (Figure 2) using Stream-Voice-Anon [11] to mitigate this leakage without sacrificing dialogue utility. Our contributions are: • We characterise speaker identity leakage in the hidden states of two E2E full-duplex dialogue LLMs (SALM-Duplex and arXiv:2603.08179v1 [eess.AS] 9 Mar 2026 Feature Extraction User Encoder barge-in Agent Decoder Privacy Attack Probe Speaker ID leakage Anon-W2W User Audio Stream-Voice-Anon User Encoder unchanged wave-to-wave protected audio protected emb Anon-W2F Stream-Voice-Anon Encoder wave-to-feature protected emb <turn-1><silence> <turn-2> <silence><turn-1><silence> hidden states agent wave stream agent token stream replaces encoder User Stream Agent Stream barge-in Agent Encoder User Audio Agent Audio User Audio full-duplex modeling Dialogue LLM Backbone Layer 1 (early) ... Layer K (middle) ... Layer N (late) User Emb Agent Emb Figure 2: Overview of the original SALM-Duplex pipeline and proposed anonymization setups. The main diagram shows the ASR- based encoder baseline: an ECAPA-TDNN probe attached to the LLM’s hidden states (red dashed path) reveals substantial speaker identity leakage. The Anon-W2W inset (upper left) prepends Stream-Voice-Anon to anonymize the waveform before the unchanged ASR encoder. The Anon-W2F inset (upper right) replaces the ASR encoder with the Stream-Voice-Anon encoder (anonymization active) and fine-tunes the LLM, eliminating the redundant waveform synthesis step. (Anon-W2F is demonstrated for SALM-Duplex; the Anon-W2W setup is additionally evaluated on Moshi.) Moshi), extending probing methodology [12, 4] from static encoders to always-on dialogue backbones. • We provide a layer-wise and turn-length-wise analysis show- ing which parts of the LLM carry the most speaker- identifying information and how leakage accumulates over dialogue length. • We propose two streaming anonymization setups: Anon- W2W applies Stream-Voice-Anon at the waveform level (re- taining the original encoder, validated on both SALM-Duplex and Moshi), while Anon-W2F replaces the continuous en- coder with a discrete encoder and activates anonymization at the feature level. 2. Related Work 2.1. E2E speech dialogue language models The generative spoken language model (GSLM) [13] estab- lished the paradigm of modelling speech directly from discrete units, subsequently extended to two-channel spoken dialogue by dGSLM [14], the first fully E2E full-duplex dialogue model. SpeechGPT [15] and SpiRit-LM [16] demonstrated that a single LLM backbone can natively interleave speech and text tokens for conversational interaction. SyncLLM [3] introduced syn- chronised LLM inference to support real-time full-duplex turn management. The most recent systems, SALM-Duplex [1] and Moshi [2], achieve low-latency always-on operation through continuous dual-stream processing. Crucially, because all of these models process raw user speech as input, speaker identity information may be inadvertently encoded in their LLM hid- den representations, yet none has been analysed from a privacy perspective. 2.2. Speaker anonymization The VoicePrivacy Challenge [17, 9] established standardised evaluation protocols and benchmarks for speaker anonymiza- tion, using the EER of an ASV system as the primary pri- vacy metric; Vauquier et al. [10] recently extended this with a legally validated framework incorporating Linkability metrics aligned with GDPR requirements. Anonymization approaches include x-vector-based methods [18, 17], perturbation [19], phonetic intermediate representations [20], neural audio codec language models [21, 22], and disentangled speech represen- tations [23, 24]. Streaming anonymization systems, includ- ing Stream-Voice-Anon [11], TVTSyn [25], DarkStream [26], and [27], extend these methods to the real-time setting required by E2E full-duplex models; we select Stream-Voice-Anon as our front-end owing to its competitive privacy–utility trade-off and open-source availability 2 . All of these approaches operate at the waveform level. To the best of our knowledge, our work is the first to analyse and mitigate speaker identity leakage inside a full-duplex LLM’s hidden states. 2.3. Privacy in neural speech and language representations Probing studies have established that self-supervised speech models encode substantial speaker identity across their trans- former layers: Pasad et al. [4] showed speaker information peaks in the lower and middle layers of wav2vec 2.0, and SU- PERB [5] demonstrated that lightweight probes achieve strong speaker verification from frozen SSL representations. Chiu et al. [6] recently confirmed this at large scale across multiple SSL architectures and speaker attributes. The layer-wise prob- ing methodology itself was established by Tenney et al. [12] for text models. Our focus is distinct: rather than protecting conversational content, we investigate speaker identity leakage, the persistent encoding of who is speaking, which enables re- identification regardless of what is said and may additionally extend to attributes such as gender, accent, and health status. Nautsch et al. [28] provide a comprehensive survey of privacy threats in speaker and speech characterisation, identifying re- identification from audio representations as a key risk category. We extend this line of work to E2E full-duplex dialogue LLMs, whose hidden states, unlike static SSL encoders, are computed continuously over live user speech. 2 https://github.com/Plachtaa/StreamVoiceAnon Table 1: Privacy, dialogue quality, and efficiency on the VPC2024 evaluation set. Privacy: higher EER and lower Linkability (Lnk) = stronger privacy; EER = 50% = perfect anonymization. Linkability for no-anon rows uses the “orig” attacker; for anonymized rows, the “lazy-informed” attacker [10]. Quality: higher sBLEU / sBERT = better. Efficiency: RTFx = 1/RTF; FRL = First Response Latency (s); TTSR = Turn-Taking Success Rate; Int.L. = Interruption Latency (s); ISR = Interruption Success Rate. Bold = best per column; underline = second best. Shaded rows = anonymized conditions. ModelUserAnonymizationPrivacyQualityEfficiency NameEncoderTypeEER↑ Lnk↓sBLEU↑sBERT↑RTFx↑ FRL↓ TTSR↑ Int.L.↓ ISR↑ S2T S2S S2T S2S discrete–6.40.907.18 6.85 0.500.48170.500.851.240.45 Moshi discreteW2W36.90.355.205.000.390.381.60.720.721.310.38 discrete–11.20.794.18 3.15 0.45 0.292380.440.851.030.96 discreteW2F41.00.233.571.490.390.202.50.500.681.180.93 continuous–28.50.296.916.350.59 0.502630.680.970.600.99 SALM- Duplex continuousW2W34.60.246.465.490.550.451.70.800.940.570.98 3. Methodology 3.1. E2E full-duplex dialogue models We analyse two E2E full-duplex dialogue systems. Moshi [2] uses a decoder-only Transformer backbone with a residual- quantization (RVQ) audio codec encoder to jointly model user and system streams in a single autoregressive pass. SALM- Duplex [1] is a decoder-only LLM that encodes user audio via an ASR-initialized continuous encoder and processes dual user– agent audio streams frame-synchronously, maintaining persis- tent hidden states over the entire conversation; we reimplement a modified version that adopts the speech decoder architecture from [29] but without the chronological thinking mechanism. Our reimplementation yields comparable response quality; the effect of the chronological thinking mechanism on privacy is left for future work. Both systems maintain a persistent LLM hidden state over continuous user speech, making them candi- dates for speaker identity probing. 3.2. User encoder variants We distinguish two encoder families by their output representa- tion, which determines how much speaker identity reaches the LLM backbone. Discrete encoders (Moshi and SALM-Duplex variant): Moshi natively uses an RVQ codec encoder that produces dis- crete token representations. For SALM-Duplex we replace the continuous front-end with a discrete encoder based on the Fire- fly architecture (implementation details in Section 3.5). Be- cause both discrete encoders are trained for high-fidelity speech reconstruction, they preserve rich speaker information in their token representations. The SALM-Duplex discrete encoder can be operated in two modes: with anonymization disabled, serving as our ablation to isolate the effect of the encoder swap alone; or with anonymization active via Stream-Voice- Anon [11], constituting our Anon-W2F setup (Section 3.3). Continuous encoder (original SALM-Duplex): An ASR- initialized adapter encodes raw user audio into continuous em- beddings. Because ASR pretraining optimizes for speech con- tent rather than speaker characteristics, this encoder represents a good design choice for built-in partial privacy protection, though with a limited privacy ceiling as we show in Section 4. 3.3. Anonymization setups We propose two streaming anonymization setups using Stream- Voice-Anon [11]. 3.3.1. Anon-W2W: Wave-to-wave anonymization Anon-W2W pairs with the continuous encoder for SALM- Duplex and with Moshi’s native discrete codec encoder. Stream-Voice-Anon is applied as a pre-processing step that transforms the raw user waveform into an anonymized wave- form before it is fed to the dialogue model. However, it intro- duces a redundant processing step: the anonymized waveform is first synthesized and then re-encoded by the model’s original encoder. 3.3.2. Anon-W2F: Wave-to-feature anonymization Anon-W2F pairs with the discrete encoder (anonymization ac- tive). We replace the continuous encoder front-end of SALM- Duplex with the discrete encoder and enable Stream-Voice- Anon’s anonymization module, which operates natively on the discrete token representations.Because the anonymization step operates in the same feature domain as the encoder, this eliminates the redundant waveform synthesis step of Anon- W2W. The modified model is pretrained on a similar data mix- ture as [29] (approximately 12k hours of multi-turn dialogue and 2.7k hours of QA data), then fine-tuned on InstructS2S- 200K 3 [30]. This setup requires architectural modification but achieves stronger privacy guarantees through feature-domain- native anonymization. While the Anon-W2F principle (replac- ing a model’s encoder with an anonymization-capable discrete encoder) is architecturally transferable, its effectiveness de- pends on the target model’s tokenizer compatibility; we demon- strate it for SALM-Duplex and leave Moshi integration for fu- ture work. 3.4. Hidden state extraction We extract hidden state representations from both Moshi (N=32 transformer layers) and SALM-Duplex (N=20) at three individual layers: early (layer 1), mid (layer N/2), and late (layer N ), as well as a mean-pooled representation aver- aging across all N layers (denoted “All” in Table 2). Temporal 3 https://huggingface.co/datasets/ICTNLP/ InstructS2S-200K Table 2: Layer-wise EER (%) of the speaker verification at- tacker. Higher = better privacy; 50% = chance level. All = mean-pooled over all layers. Bold = best; underline = second best. Shaded rows = anonymized conditions. System EncoderAnon. Early Mid LateAll discrete–7.35.66.46.4 Moshi discreteW2W42.537.635.236.9 discrete–7.514.0 20.111.2 discreteW2F43.840.540.141.0 continuous –24.6 28.6 32.128.5 SALM- Duplex continuousW2W31.533.735.334.6 pooling is handled internally by the probe architecture. 3.5. Setup Dataset. Privacy metrics are evaluated on the VoicePrivacy 2024 Challenge [9] evaluation set, derived from LibriSpeech dev-clean and test-clean splits. While this dataset consists of read speech rather than spontaneous conversation, it serves as a good starting point given the standardised evaluation proto- col; evaluating on conversational corpora is an important direc- tion for future work. The speaker verification attacker is trained on LibriSpeech train-clean-360. Quality and efficiency metrics are evaluated on MtBenchEval [31, 32], a multi-turn dialogue benchmark standardised by URO-Bench. Moshi uses open- source pretrained weights; SALM-Duplex is reimplemented following the speech decoder architecture of [29]. Speaker verification attacker. We train separate ECAPA- TDNN [33] attackers from scratch for SALM-Duplex and Moshi, replicating the lazy-informed attacker scenario from VPC2024 [9]: identical hyperparameters, with an input projec- tion layer adapted to each model’s hidden state dimension. Privacy metrics.Following the VoicePrivacy 2024 Chal- lenge [9] protocol, we report equal error rate (EER%) as the primary privacy metric.A higher EER indicates stronger privacy protection; EER ≈ 50% corresponds to random- chance discrimination, i.e., complete anonymization. To fur- ther strengthen privacy analysis, we additionally report Linka- bility [10]. Utility metrics.Dialogue quality is assessed using eval- uation scripts adapted from URO-Bench [31], Full-Duplex- Bench [32], and SALM-Duplex [1]. Quality metrics (sBLEU, sBERT) are reported separately for speech-to-text (S2T) and speech-to-speech (S2S) response modes: in S2T mode the model outputs text directly; in S2S mode the model outputs speech which is subsequently transcribed by an ASR sys- tem. Efficiency metrics include Real-Time Factor Speedup (RTFx), First Response Latency (FRL), Turn-Taking Success Rate (TTSR), Interruption Latency (Int.L.), and Interruption Success Rate (ISR), reported per system condition (Table 1). All RTFx and latency measurements are estimated on a single GPU. 4. Results Table 1 reports privacy, response quality, and efficiency met- rics. Discrete encoders, trained for high-fidelity speech re- construction, leak substantially more speaker identity than the continuous encoder (Moshi: 6.4% EER; SALM-Duplex dis- crete: 11.2% vs. continuous: 28.5%), confirming that ASR pretraining acts as a privacy-positive design choice by dis- carding speaker characteristics in favour of linguistic content. Both anonymization setups substantially reduce leakage: Anon- W2W raises EER to 36.9% for Moshi (+30.5 points) and 34.6% for SALM-Duplex continuous (+6.1 points), while Anon-W2F achieves 41.0% EER, approaching the 50% chance ceiling. Crucially, comparing Anon-W2F against the discrete-only ab- lation (11.2%) confirms the gain is entirely from anonymiza- tion, not the encoder swap. These patterns hold across both architectures, confirming speaker identity exposure as a con- sistent property of E2E full-duplex LLM hidden states. Our ECAPA-TDNN probe represents a lower bound on leakage; the VoicePrivacy Attacker Challenge [34] has shown that more so- phisticated attackers can further reduce EER, suggesting that actual privacy risk may be higher than reported here. Anonymization introduces moderate quality degradation (sBERT S2T drops 7–22% relative) but privacy gains consis- tently outweigh the cost, with EER improving 21–477% rela- tive across setups. We note that our text-based quality metrics do not capture speech-level attributes such as naturalness and prosody; speech quality evaluation (e.g., MOS, UTMOS) re- mains for future work. Adding the anonymization module re- duces RTFx substantially (from 17–263× to 1.6–2.5×) as the anonymizer dominates inference time, but all conditions remain real-time viable (RTFx > 1). Anon-W2F is faster than Anon- W2W (RTFx 2.5 vs. 1.6–1.7) because the feature-domain en- coder avoids the redundant waveform synthesis step. Reducing the anonymization module’s computational cost is an important direction for future work. 4.1. Ablation studies Layer-wise analysis. Table 2 shows per-layer EER. Moshi ex- hibits uniformly low EER across all layers (5.6–7.3%), while both SALM-Duplex variants show decreasing leakage from early to late layers, consistent with deeper layers progressively abstracting away speaker features. Anonymization raises EER uniformly across all layer groups, with all anonymized condi- tions reaching the 31–44% range. Turn-length analysis. Figure 1 plots 1− Linkability vs. turn count. Without anonymization, privacy degrades rapidly within the first few turns. Both anonymized systems maintain ac- ceptable protection even after 10 turns (SALM-Duplex + W2F: 0.90→0.78; Moshi + W2W: 0.83→0.67). 5. Conclusion We have shown that significant speaker identity leakage is a consistent property of E2E full-duplex architectures, demon- strated across both SALM-Duplex and Moshi, with discrete en- coders leaking substantially more than the ASR-pretrained con- tinuous encoder. Waveform-level anonymization (Anon-W2W) reduces this risk across both architectures, with the largest rela- tive gains observed for the most exposed systems, while feature- domain anonymization (Anon-W2F) achieves the strongest pro- tection, raising EER by over 3.5x and approaching chance level. These findings motivate privacy-by-design for always-on speech AI systems. Future work includes extending Anon-W2F to Moshi and other E2E full-duplex architectures, studying personalization with minimal privacy threats, improving privacy protection sys- tems to have less impact on response quality and latency, and evaluating privacy under stronger and more diverse attacker models. 6. Generative AI Use Disclosure Generative AI tools were used for proofreading and grammar correction with minor changes. All research ideas, experimen- tal design, implementation, and analysis were conducted by the authors. 7. References [1] K. Hu, E. Hosseini-Asl, C. Chen, E. Casanova, S. Ghosh, P. ̇ Zelasko, Z. Chen, J. Li, J. Balam, and B. Ginsburg, “Effi- cient and direct duplex modeling for speech-to-speech language model,” in Proc. Interspeech 2025, 2025, p. 2715–2719. [2] A. D ́ efossez, L. Mazar ́ e, M. Orsini, A. Royer, P. P ́ erez, H. J ́ egou, E. Grave, and N. Zeghidour, “Moshi:a speech-text foun- dation model for real-time dialogue,” 2024, arXiv preprint arXiv:2410.00037. [3] B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Be- yond turn-based interfaces: Synchronous LLMs as full-duplex dialogue agents,” in Proc. 2024 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2024. [4] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, p. 914–921. [5] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.- W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech processing universal performance benchmark,” in Proc. Interspeech 2021, 2021. [6] A. Y. F. Chiu, K. C. Fung, R. T. Y. Li, J. Li, and T. Lee, “A large-scale probing analysis of speaker-specific attributes in self-supervised speech representations,” 2025, arXiv preprint arXiv:2501.05310. [7] R. Staab, M. Vero, M. Balunovi ́ c, and M. Vechev, “Beyond memo- rization: Violating privacy via inference with large language mod- els,” in Proc. 12th International Conference on Learning Repre- sentations (ICLR), 2024. [8] F. Mireshghallah, K. Goyal, A. Uniyal, T. Berg-Kirkpatrick, and R. Shokri, “Quantifying privacy risks of masked language mod- els using membership inference attacks,” in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. [9] N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Panariello, N. W. D. Evans, J. Yamagishi, and M. Todisco, “The VoicePrivacy 2024 challenge evaluation plan,” 2024, arXiv preprint arXiv:2404.02677. [10] N. Vauquier, B. M. L. Srivastava, S. A. Hosseini, and E. Vincent, “Legally validated evaluation framework for voice anonymiza- tion,” in Proc. Interspeech 2025, 2025. [11] N. Kuzmin, S. Liu, K. A. Lee, and E. S. Chng, “Stream-Voice- Anon: Enhancing utility of real-time speaker anonymization via neural audio codec and language models,” in Proc. IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026. [12] I. Tenney, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. [13] K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T. A. Nguyen, J. Copet, A. Baeveski, A. Mohamed, and E. Dupoux, “Generative spoken language modeling from raw au- dio,” Transactions of the Association for Computational Linguis- tics, vol. 9, p. 1336–1354, 2021. [14] T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, and E. Dupoux, “Generative spoken dialogue language model- ing,” Transactions of the Association for Computational Linguis- tics, vol. 11, p. 250–266, 2023. [15] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” in Findings of the Associa- tion for Computational Linguistics: EMNLP 2023, 2023. [16] T. A. Nguyen, B. Muller, B. Yu, M. R. Costa-juss ` a, M. Elbayad, S. Popuri, P.-A. Duquenne, R. Algayres, R. Mavlyutov, I. Gat, G. Synnaeve, J. Pino, B. Sagot, and E. Dupoux, “SpiRit-LM: In- terleaved spoken and written language model,” Transactions of the Association for Computational Linguistics, 2024. [17] M. Panariello, N. Tomashenko, X. Wang, X. Miao, P. Champion, H. Nourtel, M. Todisco, N. W. D. Evans, E. Vincent, and J. Yam- agishi, “The VoicePrivacy 2022 challenge: Progress and perspec- tives in voice anonymisation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, p. 3477–3491, 2024. [18] S. Meyer, F. Lux, P. Denisov, J. Koch, P. Jaubert, and N. T. Vu, “Anonymizing speech with generative adversarial networks to preserve speaker privacy,” in Proc. IEEE Spoken Language Tech- nology Workshop (SLT), 2022. [19] N. Kuzmin, H.-T. Luong, K. A. Lee, and E. S. Chng, “The NTU- NPU system for the VoicePrivacy 2024 challenge,” in Proc. 4th Symposium on Security and Privacy in Speech Communication (SPSC), 2024. [20] S. Meyer et al., “Speaker anonymization with phonetic interme- diate representations,” in Proc. Interspeech 2022, Incheon, Korea, 2022. [21] M. Panariello, F. Nespoli, M. Todisco, and N. Evans, “Speaker anonymization using neural audio codec language models,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024. [22] X. Yao, N. Kuzmin, K. A. Lee, and E. S. Chng, “The NPU-NTU system for the VoicePrivacy 2024 challenge,” in Proc. 4th Sympo- sium on Security and Privacy in Speech Communication (SPSC), 2024. [23] P. Champion, D. Jouvet, and A. Larcher, “Are disentangled repre- sentations all you need to build a voice anonymization system?” in Proc. Interspeech 2022, Incheon, Korea, 2022. [24] R. Aloufi, H. Haddadi, and D. Boyle, “Privacy-preserving voice analysis via disentangled representations,” in Proc. 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop (CCSW), 2020. [25] W. Quamer, M.-R. Tseng, G. Nasrallah, and R. Gutierrez-Osuna, “Tvtsyn: Content-synchronous time-varying timbre for streaming voice conversion and anonymization,” in The Fourteenth Interna- tional Conference on Learning Representations, 2026. [26] W. Quamer and R. Gutierrez-Osuna, “DarkStream: Real-time speech anonymization with low latency,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025. [27] W. Quamer and R. Gutierrez-Osuna, “End-to-end streaming model for low-latency speech anonymization,” in IEEE Spoken Language Technology Workshop (SLT).IEEE, 2024, p. 727– 734. [28] A. Nautsch, A. Jim ́ enez, A. Treiber, J. Kolberg, C. Jasserand, E. Kindt, H. Delanne, M. Todisco, T. Schneider, and N. W. D. Evans, “Preserving privacy in speaker and speech characterisa- tion,” Computer Speech & Language, vol. 58, p. 441–480, 2019. [29] Z. Wu et al., “Chronological thinking in full-duplex spoken dia- logue language models,” 2025, arXiv preprint arXiv:2510.05150. [30] Q. Fang et al., “LLaMA-Omni 2: LLM-based real-time spoken chatbot,” in Proc. 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. [31] Yan et al., “URO-Bench: A comprehensive benchmark for end-to-end spoken dialogue models,” 2025, arXiv preprint arXiv:2502.17810. [32] Lin et al., “Full-duplex-bench v1.5: Benchmarking full-duplex spoken dialogue models,” 2025, arXiv preprint arXiv:2507.23159. [33] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized channel attention, propagation and aggrega- tion in TDNN based speaker verification,” in Proc. Interspeech 2020, Shanghai, China, 2020, p. 3830–3834. [34] N. Tomashenko, X. Miao, E. Vincent, and J. Yamagishi, “The VoicePrivacy 2024 attacker sub-challenge,” in Proc. IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.