Paper deep dive

Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input

Changsong Liu, Tianrui Wang, Ye Ni, Yizhou Peng, Eng Siong Chng

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 26

Abstract

Abstract:Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 12:19:30 AM

Summary

The paper introduces a prosodic-boundary-aware post-training strategy for LLM-based streaming Text-to-Speech (TTS). By utilizing a prosodic-boundary marker and a sliding-window prompt with lookahead, the method enables stable, long-form streaming synthesis using only weakly time-aligned data, effectively mitigating prosodic drift and long-form collapse common in interleaved architectures.

Entities (5)

CosyVoice2 · model-architecture · 99%Seed-TTS-Eval · benchmark · 99%CommonVoice 13.0 · dataset · 95%Prosodic-Boundary-Aware Streaming Generation · methodology · 95%WhisperX · tool · 95%

Relation Signals (3)

Prosodic-Boundary-Aware Streaming Generation → outperforms → Interleaved Baseline

confidence 98% · Evaluations show our method outperforms CosyVoice-Style interleaved baseline

Prosodic-Boundary-Aware Streaming Generation → improves → LLM-based TTS

confidence 95% · adapting a pretrained LLM-based TTS model using weakly time-aligned data

CosyVoice2 → servesasfoundationfor → Prosodic-Boundary-Aware Streaming Generation

confidence 95% · We adopt CosyVoice2 as the foundation model.

Cypher Suggestions (2)

List all benchmarks used for evaluation · confidence 95% · unvalidated

MATCH (e:Benchmark) RETURN e.name

Find all methods that outperform the Interleaved baseline · confidence 90% · unvalidated

MATCH (m:Methodology)-[:OUTPERFORMS]->(b:Methodology {name: 'Interleaved Baseline'}) RETURN m.name

Full Text

26,134 characters extracted from source content.

Expand or collapse full text

Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input Changsong Liu 1,∗ , Tianrui Wang 2,∗ , Ye Ni 3 , Yizhou Peng 1 , Eng Siong Chng 1 1 Nanyang Technological University, Singapore 2 Tianjin University, China 3 Southeast University, China changsong.liu@ntu.edu.sg, wangtianrui@tju.edu.cn, niye@seu.edu.cn, peng.yizhou@ntu.edu.sg, ASESChng@ntu.edu.sg Abstract Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major chal- lenges: unnatural prosody due to missing lookahead and long- form collapse due to unbounded context.We propose a prosodic-boundary-aware post-training strategy, adapting a pre- trained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries for- ward previous text and speech tokens, ensuring bounded con- text and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text. Index Terms: streaming text-to-speech, llm-based text-to- speech, incremental text input 1. Introduction Streaming text-to-speech (TTS) with streaming text input aims to generate speech in real-time as text arrives, which is in high demand for applications such as dialogue systems and speech- to-speech translation [1, 2]. The usability of such systems is largely determined by syn- thesis latency, which mainly arises from two sources: the wait- ing time for text segment accumulation, and the model’s infer- ence time for converting text into audio. While the latter can be effectively mitigated by optimizing the model size or intro- ducing causal structures [3, 4], achieving ultra-low latency re- quires keeping the text accumulation window extremely small. This leads to the first core challenge: natural and high-quality speech synthesis heavily relies on sufficient contextual informa- tion. The model requires not only historical text to maintain coherence but also future text (lookahead) to accurately pre- dict prosodic features such as stress and pauses [5]. Thus, a restricted receptive field results in unnatural prosody. Existing approaches [1, 6, 7] attempt to address this using local context modeling, but typically require complex causal modifications to the attention mechanism and rely on precise text–speech forced alignment [8, 9]. Meanwhile, with the widespread adoption of large lan- guage model (LLM) architectures in the speech domain, mod- ern TTS systems increasingly adopt cross-modality modeling * These authors contributed equally to this work. that interleaves text and speech tokens [10, 11, 12, 13, 14, 15]. Although such architectures achieve state-of-the-art synthesis quality, enabling streaming generation with streaming text in- put introduces a second challenge: long-form performance col- lapse caused by unbounded generation history. For instance, Cosyvoice-series [16, 17] employs an interleaved arrangement of fixed numbers of text and speech tokens. Similarly, the streaming version of Qwen3-TTS [18] adopts a comparable in- terleaved structure, albeit arranging text and speech tokens in parallel. Yet in real-world scenarios with long-term continuous input, since the speech length corresponding to a single text to- ken varies enormously, the physical distance between text and its associated speech tokens gradually widens. This ultimately causes generation failure, making it difficult to support long- term streaming interactions. SpeakStream [19] alleviates this issue through character-level alignment and history truncation, but it relies heavily on precise alignment annotations. These limitations motivate the following research question: Can ro- bust long-form streaming be achieved in LLM-based TTS with streaming text input using only weakly time-aligned data without architectural modifications? To tackle this question, we propose a novel post-training strategy for prosodic-boundary-aware adaptation that adapts ex- isting LLM-based TTS models for robust streaming using only weakly time-aligned data. Our contributions are as follows: • We introduce a prosodic-boundary-aware adaptation com- bined with a windowed lookahead mechanism, allowing models to anticipate future text to improve prosody without requiring complex causal modifications. • We design an acoustic prompting method utilizing the pre- vious chunk’s audio tail, which ensures seamless concate- nation and mitigates generation collapse in long-form cross- modality continuous streaming. • We demonstrate state-of-the-art streaming stability and ro- bustness using only weakly time-aligned open-source data, significantly outperforming existing interleaved baselines in real-time deployment. Audio demos are available at: 1 2. Methodology 2.1. Prosodic-Boundary Marker To enable streaming generation while preserving prosodic nat- uralness, we formulate the input as a bifurcated sequence via a prosodic-boundary-aware marker, marker boundary , to decou- ple the acoustic generation span from the broader prosodic con- 1 https://anonymous-demo-168.github.io/ Prosodic-Boundary-Aware-Streaming-Text-TTS-demo/ arXiv:2603.06444v1 [cs.SD] 6 Mar 2026 Text-Speech Language Model Prosodic-Boundary Fine-tuingSliding Text Window (Inference) Infinite Text Input Text Windows Prosodic-Boundary Marker Lookahead Text Text Tokens Belonging to Different Words Corresponding Speech Tokens Beginning of Token Sequence Ending of Token Sequence Text Window Deleted Speech Tokens Chunk-based Token-to- Waveform Text-Speech Language Model LLMText-Speech Language Model Bounded Context and Sliding-Window Continuation (Inference) 1 st Window Inference 2 nd Window Inference Next Text Window Next Text Window Figure 1: The proposed fine-tuning and inferencing pipeline for prosodic boundary-aware streaming LLM-based TTS generation. text. As illustrated in Figure 1, the model learns to treat this marker as a soft boundary. During inference, the marker is in- serted every k words, allowing the model to leverage limited fu- ture context for prosodic planning while preventing unbounded generation context. 2.2. Training with Weakly Time-Aligned Supervision The model is adapted using word-level timestamps obtained from an off-the-shelf aligner, WhisperX [20], to approxi- mate candidate prosodic boundaries without manual annota- tion. Given an utterance with text tokens x 1:T , speech tokens s 1:L , and word-level boundaries (where word j maps to a text- token span [b j , e j ] and audio-end time a j ), we apply Dynamic Boundary Insertion. During training, we stochastically decide with probability p full to use the full, unmodified utterance to pre- serve global coherence. Otherwise, we randomly sample a word index m and insert the boundary marker into the text sequence: x ′ = (x 1:e m , marker boundary , x e m +1:T ).(1) The target speech sequence is truncated to the aligned audio position of word m. Let r s denote the speech-token frame rate and ℓmin a minimum length. The truncated target length L ′ is: L ′ = max ℓ min , ⌊a m · r s ⌋ .(2) The language model is then fine-tuned to predict the truncated stream s 1:L ′ conditioned on x ′ .This procedure trains the model to interpret the marker as both a segmentation cue and a prosodic anchor, ensuring acoustic output is aligned only with the segment preceding the marker. 2.3. Bounded Context and Sliding-Window Continuation During inference, input text is processed in chunks of k words, with a lookahead of f future words. For chunk index t, let x t cur denote the current segment and x t fut the lookahead text. The input sequence is constructed as: X t = x t cur , marker boundary , x t fut .(3) To maintain cross-chunk continuity, we employ a Sliding- Window Prompt. The first chunk is conditioned on a reference utterance (x ref ,(s ref ) for zero-shot voice cloning. For subsequent chunks, the prompt (p x t , p s t ) is replaced with the text and speech tokens synthesized in the previous step: p x t , p s t = ( x ref , s ref ,t = 1, ˆx t−1 , ˆs t−1 , t≥ 2. (4) where ˆx t−1 denotes the previously generated text segment cor- responding to the current chunk (excluding lookahead), and ˆs t−1 is the corresponding synthesized speech. This design keeps the Key-Value (KV) cache bounded byO(k + f) regard- less of total sequence length, preventing both latency growth and long-form instability. Finally, the generated speech tokens are passed to a streaming vocoder for incremental waveform synthesis, enabling seamless concatenation across chunks. 3. Experiments 3.1. Training Datasets We train our model on the sampled English subset of Com- monVoice 13.0 [21], containing approximately 930k utterances (about ̃1,000 hours). To ensure zero-shot evaluation integrity, we remove any training utterances overlapping with the Seed- TTS-Eval test set [22]. Before training, we pre-extract speech tokens, mel- spectrograms, text tokens, and speaker embeddings. Word-level timestamps are obtained using WhisperX [20] to approximate candidate prosodic boundaries. During fine-tuning, the prob- ability of using the full utterance is set to p full = 0.15. The speech tokenizer operates at a frame rate of r s = 25 Hz, and the minimum truncated length is ℓ min = 5 frames. Table 1: Evaluation on streaming efficiency. Chunk size k = 5 words; lookahead f = 2 words for our proposed method. SystemRTF↓TTFA (ms)↓ Interleaved0.8431414 Sliding-Window0.7182588 Boundary-Aware0.7821296 Table 2: Objective quality evaluation on Seed-TTS-Eval. Chunk size k = 5 words; lookahead f = 2 words for our proposed method. System WER (%)↓SPK-SIM↑EMO-SIM↑ StandardLong-formStandardLong-formStandardLong-form Interleaved7.4870.970.530.560.8990.899 Sliding-Window6.037.830.570.220.9120.857 Boundary-Aware (Ours)4.034.770.640.650.9180.912 Table 3: Subjective MOS evaluation on Seed-TTS-Eval. Higher scores indicate better perceptual quality. System MOS↑SMOS↑EMOS↑ StandardLong-formStandardLong-formStandardLong-form Interleaved3.99± 0.16 3.18± 0.23 4.05± 0.15 3.24± 0.22 4.03± 0.15 3.21± 0.18 Sliding-Window3.43± 0.20 1.60± 0.18 3.18± 0.20 1.68± 0.18 3.25± 0.18 1.67± 0.17 Boundary-Aware (Ours) 4.28± 0.14 4.13± 0.13 4.25± 0.13 4.24± 0.13 4.38± 0.12 4.19± 0.13 3.2. Evaluation Datasets We evaluate the system under two complementary tiers to assess both sentence-level quality and long-form robustness. Standard-form Evaluation: We use the Seed-TTS-Eval benchmark [22], which consists of short read-style sentences with reference prompts. This set evaluates standard sentence- level streaming synthesis. LLM-expanded Long-form Evaluation: To stress-test long-form stability, we construct an expanded benchmark by extending each Seed-TTS-Eval sentence into a coherent para- graph of 280–320 words using DeepSeek-V3 [23], while pre- serving the original sentence as the opening segment. This set- ting evaluates prosodic consistency and synthesis stability dur- ing extended monologues. 3.3. Evaluation Metrics Streaming efficiency is evaluated using Time-to-First-Audio (TTFA) and Real-Time Factor (RTF). TTFA measures the la- tency between the initial synthesis request and the first decod- able audio chunk, reflecting perceived responsiveness. An RTF < 1.0 indicates real-time synthesis capability. To avoid hard- ware warm-up effects, we report averages over 50 trials follow- ing two warm-up runs for each utterance. The latency metrics are measured on a single NVIDIA A40 GPU (48 GB VRAM). For synthesis quality, we follow the Seed-TTS-Eval proto- col [22]. Linguistic accuracy is measured by Word Error Rate (WER) using the Parakeet-TDT-0.6B-v2 model [24], which supports long-form transcription beyond the typical 30-second limit. Acoustic fidelity is evaluated via cosine similarities of utterance embeddings using WavLM-Large for Speaker Simi- larity (SPK-SIM) and emotion2vec+ [25] for Emotional Simi- larity (EMO-SIM). We further conduct subjective listening tests where 20 evaluators rate the generated speech on a 5-point Lik- ert scale for Intelligibility (MOS), Speaker Similarity (SMOS), and Emotion Similarity (EMOS). For similarity ratings, gener- ated samples are compared against the reference prompts. All subjective scores are reported as mean scores with 95% confi- dence intervals computed via standard t-tests. 3.4. Experimental Setup and Baselines 3.4.1. Base Architecture We adopt CosyVoice2 [16] as the foundation model. The ar- chitecture consists of a Qwen-based LLM that generates speech tokens, followed by a flow-matching module and a pretrained HiFi-GAN vocoder [26] for waveform synthesis. During train- ing, only the LLM is fine-tuned using the proposed method (Section 2); the flow-matching and vocoder remain frozen. 3.4.2. Baselines and Proposed Method In this section, we describe the baselines. All systems use a fixed chunk size of k = 5 words for streaming input. Interleaved Baseline: We use the native streaming imple- mentation of CosyVoice2 (inference bistream), where text and speech tokens are interleaved at a 5:15 ratio within a single sequence. The KV cache grows throughout genera- tion, and streaming vocoding is used for incremental waveform generation. We use the original pre-trained weights without boundary-aware adaptation. Sliding-Window Baseline: We implement a simplified sliding-window prompting strategy in which each chunk is con- ditioned on the previous chunk’s generated tokens. Unlike our method, it does not use boundary markers or lookahead context, meaning generation relies solely on past history. Offline batch vocoding is applied to synthesize each chunk. Proposed Method (Boundary-Aware): Our method com- bines sliding-window continuation with the proposed prosodic- boundary marker and a lookahead window of f = 2 words. Streaming vocoding is employed for incremental audio syn- thesis. By incorporating limited future context while keeping the effective context length bounded byO(k + f), the system achieves stable long-form streaming generation. 4. Results and Discussion 4.1. Streaming Latency and Efficiency The streaming latency and efficiency results are summarized in Table 1. Our proposed method achieves the lowest TTFA (1296 ms), outperforming both the Interleaved and Sliding-Window baselines. This improvement stems from the boundary-aware prompting mechanism, which enables earlier audio emission than native streaming implementations. For RTF, the Sliding- Figure 2: Ablation study on quality evaluation with different combinations of chunk size k and lookahead context f. Window baseline achieves the lowest value (0.718). However, as discussed in Section 3.4.2, this result is attributed to its use of offline batch vocoding, which maximizes GPU throughput but does not support incremental emission. Between systems using streaming vocoding, our method achieves a lower RTF (0.782) than the Interleaved baseline (0.843). This indicates that bounding the context length improves computational efficiency by preventing unbounded KV-cache growth. 4.2. Synthesis Quality and Linguistic Fidelity Table 2 and 3 reports the objective and subjective results across both evaluation tiers. Our proposed method consistently achieves the best performance across all metrics. 4.2.1. Objective Evaluation On the standard-form set, our method achieves a WER of 4.03%, outperforming Interleaved (7.48%) and Sliding- Window (6.03%). This shows that the boundary marker and lookahead context stabilize linguistic generation even in short- form synthesis. The model also achieves the highest SPK-SIM (0.64) and EMO-SIM (0.918), suggesting that future context improves speaker consistency and emotional expression. The performance gap becomes more pronounced in the long-form evaluation. The Interleaved baseline exhibits catas- trophic failure, with WER increasing to 70.97%. This behav- ior results from unbounded context growth: as the KV cache accumulates, the model experiences semantic drift and hallu- cinations, leading to ”garbled” speech and premature termina- tion with severe deletion errors. The Sliding-Window baseline maintains a stable WER (7.83%) but suffers a severe degrada- tion in speaker similarity (SPK-SIM drops from 0.57 to 0.22). Without boundary markers or lookahead context, the model cannot properly predict sentence boundaries, causing prosodic drifts across segments and a decline in speaker and emotional consistency (EMO-SIM drops to 0.857). In contrast, our Boundary-Aware method maintains sta- ble performance across long-form synthesis, achieving WER 4.77%, SPK-SIM 0.65, and EMO-SIM 0.912. By bounding the context length and incorporating the lookahead window for prosodic conditioning, the approach effectively mitigates both the hallucination problem of the Interleaved architecture and the prosodic drift of naive Sliding-Window methods. 4.2.2. Subjective Evaluation Subjective listening results confirm the objective findings. While the Interleaved baseline maintains acceptable quality in the standard set (3.99±0.16), its performance plummet to 3.18±0.23 in long-form scenario due to linguistic instabil- ity. The Sliding-Window baseline degrades further, reaching a MOS of 1.60±0.18, as listeners frequently observed prosodic discontinuities between segments. Our Boundary-Aware method achieves the highest percep- tual scores across all metrics. In particular, it maintains strong speaker identity (4.24±0.13 SMOS) and emotional consistency (4.19±0.13 EMOS) even in long-form synthesis. These results indicate that the proposed boundary-aware conditioning suc- cessfully preserves prosodic continuity while maintaining ro- bust streaming generation. 4.3. Ablation Studies As shown in Figure 2, we analyze the trade-off between stream- ing latency and synthesis quality by varying chunk size k ∈ 1, . . . , 10 and lookahead context f ∈ 1, . . . , 6 under the constraint f ≤ k. For evaluation, we sample a fixed amount of 50 utterances each from both Standard and Long-form tiers. Linguistic fidelity (WER) is highly sensitive to the initial context size. At k = 1, f = 1, the model lacks sufficient se- mantic grounding, resulting in peak WER values of 33.27% and 23.77% for the Standard and Long-form sets. As k increases, performance stabilizes rapidly; for k ≥ 3, the Standard-tier WER falls below 5%. A similar trend is observed in the Long- form tier, indicating that a relatively small semantic anchor is sufficient to stabilize generation. However, excessive lookahead relative to the chunk size can degrade performance. For instance, at k = 10, increasing f from 2 to 6 raises the Long-form WER from 3.03% to 12.98%. This suggests overly strong future conditioning may destabilize generation when it dominates the current segment context. Speaker and emotion consistency (SPK-SIM and EMO- SIM), are more robust but still benefit from moderate context. Speaker similarity increases from approximately 0.50 at mini- mal context to above 0.65 for k ≥ 5, while emotional similarity remains consistently high (> 0.90) across configurations. 5. Conclusion This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text in- put. By introducing a prosodic-boundary marker and bounded sliding-window prompting, the method stabilizes prosodic gen- eration while preventing unbounded context growth. Exper- iments on Seed-TTS-Eval show improved streaming latency and synthesis quality over interleaved and sliding-window base- lines, while maintaining stable long-form generation with con- sistent speaker identity and emotion. Future work will explore generalization to other LLM- based TTS architectures and multilingual settings, as well as adaptive boundary prediction for more flexible streaming gen- eration. 6. Generative AI Use Disclosure During the preparation of this manuscript, the authors used gen- erative AI tools to assist with language refinement, clarity im- provement, and L A T E X formatting. These tools were employed solely for editorial support. All scientific content, experimental results, and interpretations were developed and verified by the authors. 7. References [1] T. Dang, D. Aponte, D. Tran, T. Chen, and K. Koishida, “Zero- shot text-to-speech from continuous text streams,” arXiv preprint arXiv:2410.00767, 2024. [2] A. D ́ efossez, L. Mazar ́ e, M. Orsini, A. Royer, P. P ́ erez, H. J ́ egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,” arXiv preprint arXiv:2410.00037, 2024. [3] G. Pamisetty, R. A. Easow, K. Gupta, and K. S. R. Murty, “Stream-tts: A low-latency text-to-speech using kolmogorov- arnold networks for streaming speech applications,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, p. 1–2. [4] H. Sun, S. Hu, S. Liu, L. Meng, H. Wang, B. Han, Y. Yang, Y. Liu, S. Zhao, Y. Lu et al., “Zero-shot streaming text to speech synthe- sis with transducer and auto-regressive modeling,” arXiv preprint arXiv:2505.19669, 2025. [5] Z. Sheng, Z. Du, S. Zhang, Z. Yan, Y. Yang, and Z. Ling, “Syncspeech: Low-latency and efficient dual-stream text-to- speech based on temporal masked transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2502.11094 [6] A. Dekel, S. Shechtman, R. Fernandez, D. Haws, Z. Kons, and R. Hoory, “Speak while you think: Streaming speech synthe- sis during text generation,” in ICASSP 2024-2024 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, p. 11 931–11 935. [7] T. Dang, D. Aponte, D. Tran, and K. Koishida, “Livespeech: Low- latency zero-shot text-to-speech via autoregressive modeling of audio discrete codes,” arXiv preprint arXiv:2406.02897, 2024. [8] H. Bai, R. Zheng, J. Chen, M. Ma, X. Li, and L. Huang, “A 3 t: Alignment-aware acoustic and text pretraining for speech synthe- sis and editing,” in International Conference on Machine Learn- ing. PMLR, 2022, p. 1399–1411. [9] M. Kim, M. Jeong, B. J. Choi, D. Lee, and N. S. Kim, “Transduce and speak: Neural transducer for text-to-speech with semantic to- ken prediction,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, p. 1–7. [10] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio gener- ation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 31, p. 2523–2533, 2023. [11] S. Chen, C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, p. 705–718, 2025. [12] T. Wang, H. Wang, M. Ge, C. Gong, C. Qiang, Z. Ma, Z. Huang, G. Yang, X. Wang, E. Chng, X. Chen, L. Wang, and J. Dang, “Word-level emotional expression control in zero-shot text-to- speech synthesis,” in The Thirty-ninth Annual Conference on Neu- ral Information Processing Systems, 2025. [13] T. A. Nguyen, B. Muller, B. Yu, M. R. Costa-Jussa, M. Elbayad, S. Popuri, C. Ropers, P.-A. Duquenne, R. Algayres, R. Mavlyu- tov et al., “Spirit-lm: Interleaved spoken and written language model,” Transactions of the Association for Computational Lin- guistics, vol. 13, p. 30–52, 2025. [14] H. Wang, Y. Yang, S. Liu, J. Li, L. Meng, Y. Liu, J. Zhou, H. Sun, Y. Lu, and Y. Qin, “Streammel: Real-time zero-shot text-to-speech via interleaved continuous autoregressive model- ing,” IEEE Signal Processing Letters, 2025. [15] Y. Yang, S. Liu, J. Li, H. Wang, L. Meng, H. Sun, Y. Liang, Z. Ma, Y. Hu, R. Zhao et al., “Interleaved speech-text language mod- els for simple streaming text-to-speech synthesis,” arXiv preprint arXiv:2412.16102, 2024. [16] Z. Du et al., “CosyVoice 2: Scalable streaming speech synthesis with large language models,” arXiv preprint arXiv:2412.10117, 2024. [17] Z. Du, C. Gao, Y. Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi et al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,” arXiv preprint arXiv:2505.17589, 2025. [18] H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo et al., “Qwen3-tts technical report,” arXiv preprint arXiv:2601.15621, 2026. [19] R. H. Bai, Z. Gu, T. Likhomanenko, and N. Jaitly, “Speakstream: Streaming text-to-speech with interleaved data,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19206 [20] M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” INTER- SPEECH 2023, 2023. [21] R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” in Proceedings of LREC, 2020. [22] P. Anastassiou et al., “Seed-TTS: A family of high-quality versa- tile speech generation models,” arXiv preprint arXiv:2406.02430, 2024. [23] DeepSeek-AI, “Deepseek-v3 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2412.19437 [24] nvidia,“nvidia/parakeet-tdt-0.6b-v2,”2026,accessed via CROVIA transparency registry. [Online]. Available:https: //huggingface.co/nvidia/parakeet-tdt-0.6b-v2 [25] Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec:Self-supervised pre-training for speech emotion representation,” in Findings of the Association for Computational Linguistics: ACL 2024.Association for Computational Linguistics, 2024, p. 15 747–15 760. [Online]. Available: https://aclanthology.org/2024.findings-acl.931/ [26] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020. [Online]. Available: https://arxiv.org/abs/2010.05646