Paper deep dive
MOSS-TTS Technical Report
Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
Abstract
Abstract:This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Tags
Links
- Source: https://arxiv.org/abs/2603.18090v1
- Canonical: https://arxiv.org/abs/2603.18090v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
93,845 characters extracted from source content.
Expand or collapse full text
OpenMOSS MOSS-TTS Technical Report SII-OpenMOSS Team * Abstract This technical report presentsMOSS-TTS, a speech generation foundation model built on a scal- able recipe: discrete audio tokens + autoregressive modeling + large-scale pretraining. Built onMOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic–acoustic representations, we release two complementary generators:MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, andMOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, train- ing recipe, and empirical characteristics of the released models. Homepage:https://mosi.cn/models/moss-tts Online Demo:https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS AI Studio:https://studio.mosi.cn/voice-synthesis Hugging Face:https://huggingface.co/OpenMOSS-Team/MOSS-TTS GitHub:https://huggingface.co/collections/OpenMOSS-Team/moss-tts 1Introduction Text-to-speech (TTS) has evolved from task-specific pipelines into a broader paradigm of speech generation that is expected to behave like a foundation model: it should generalize across speakers, languages, speaking styles, and acoustic conditions; support controllable and low-latency synthesis; and remain stable over long- form content [1–4]. Recent progress increasingly resembles the scaling trajectory of large language models, where model capacity and data scale unlock emergent capabilities beyond narrow benchmarks [5,6]. At the same time, scaling speech generation is not a simple matter of “bigger models.” Modern approaches must reconcile competing requirements in representation learning and pretraining: (i) the discrete token representation must be compact enough for efficient sequence modeling yet expressive enough to preserve both semantic content and fine-grained acoustics; (i) the generative model must remain stable over long se- quences while staying compatible with streaming constraints; and (i) the training signal must scale across diverse, noisy, real-world data without relying on brittle cascaded supervision. Much of the recent literature addresses these tensions by introducing multiple intermediate targets, external semantic teachers, refine- ment stages, or post-hoc alignment. Such designs can be effective, but they often complicate scaling because each additional module introduces a new supervision contract, new failure modes, and new latency budgets [ 2,7–10]. ∗ Full contributors can be found in the Contributors section. 1 arXiv:2603.18090v1 [cs.SD] 18 Mar 2026 This report argues for a return to the core of speech generation: learn a high-quality audio tokenizer, train an autoregressive (AR) model over its tokens, and pretrain at scale. Concretely, we pursue the recipe discrete tokens + AR modeling + large-scale pretraining, and show that it provides a clean and scalable path to strong quality and controllability in practice. The key intuition is that a sufficiently capable tokenizer turns speech generation into a token prediction problem with a single, universal modeling objective—much like language modeling—thereby making it easier to scale data, compute, and downstream capabilities without continuously expanding the model stack. MOSS-TTS combines three core components.(1) A high-quality audio tokenizer.We build onMOSS- Audio-Tokenizer[11], a causal Transformer-based discrete tokenizer designed for large-scale AR model- ing. It supports variable-bitrate residual vector quantization (RVQ), compressing 24 kHz audio to 12.5 fps and enabling streaming-friendly, frame-level encoding and decoding, while preserving high-fidelity recon- struction and semantically informative tokens. Unlike approaches that depend on external pretrained audio encoders or multi-stage distillation [7,8,12–14], MOSS-Audio-Tokenizer is trained end-to-end to jointly opti- mize acoustic reconstruction and semantic alignment, aiming to maximize scalability and minimize inherited bottlenecks. (2) Large-scale, high-quality pretraining data.We build a large-scale, high-quality data pipeline that con- verts raw open-domain recordings into trainable single-speaker assets with cross-consistency gating (speaker consistency, language consistency, and transcript validity). The resulting corpus spans millions of hours in total, with the majority consisting of carefully filtered multilingual TTS-style supervision and targeted sup- plements for voice cloning and controllability. This data-centric foundation is essential for robustness across domains (podcasts, audiobooks, broadcast & news, film & drama, commentary, and online content) and for multilingual and code-switching behavior. (3) Refined discrete-token modeling for speech generation.On top of the tokenizer, we study and de- ploy discrete AR modeling strategies that remain efficient and stable for long-form synthesis. To serve both research reproducibility and practical deployment constraints, we explore two architectures with explicit tradeoffs. The Delay-Pattern model (MOSS-TTS) uses a single Transformer backbone with multiple predic- tion heads and an RVQ-aware delay schedule, prioritizing structural simplicity, scalability, and a clean long- context operating point. The Global-Latent + Local Transformer model (MOSS-TTS-Local-Transformer) in- troduces an additional frame-local autoregressive module that is more complex but more learning-efficient, yielding stronger speaker preservation at smaller scale and a shorter time to first audio. These components yield a practical speech generation foundation model with a broad capability set, in- cluding zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, multilingual synthesis with smooth code-switching (notably between Chinese and English), and stable long- form generation up to hour-scale outputs. Contributions.This technical report makes the following contributions: •We present MOSS-TTS, a discrete-token autoregressive speech generation foundation model built on a scalable discrete + AR + pretraining recipe. •We integrate and analyzeMOSS-Audio-Tokenizer[11] as a universal, streaming-compatible audio tokenizer with variable bitrate and unified semantic-acoustic representations. •We present a large-scale, high-quality data pipeline that supports training on millions of hours of data and enables robust multilingual pretraining and controllable synthesis behavior. •We release and compare two complementary discrete AR architectures (Delay-Pattern vs. Global-Latent + Local Transformer) that expose a clear tradeoff between structural simplicity/scalability and model- ing efficiency/quality. •We demonstrate broad controllability features (voice cloning, duration control, pronunciation control) and strong empirical performance on speaker similarity and quality metrics. 2 Organization.The remainder of the report is organized as follows. We begin with an overview of related work. We then describe the audio tokenizer and the overall modeling architectures, followed by the pretrain- ing data pipeline and training recipe. We next present the evaluation results before concluding. 2Related Work MOSS-TTS sits at the intersection of discrete audio tokenization, large-scale autoregressive sequence model- ing, and speech generation foundation models. We review the most relevant directions below. Neural audio codecs and discrete audio tokenization.Discrete representations have become a standard founda- tion for scalable audio generation, following the broader success of vector quantization in representation learning [15]. Neural codecs such as SoundStream [16] and subsequent high-fidelity compression models [17,18] demonstrate that a learned encoder–quantizer–decoder stack can support low-bitrate reconstruction while remaining compatible with downstream sequence modeling. Recent toolkits and open implementa- tions further accelerate codec research and adoption [19,20]. For speech generation in particular, an effective tokenizer must not only reconstruct waveforms, but also expose tokens that are semantically aligned with text and robust under long-horizon generation. Several recent lines explore semantic shortcomings and the semantic–acoustic tradeoff in codec tokens [8,14,21], motivating tokenizers that better balance compression, perceptual quality, and text-aligned semantics. Audio language modeling with discrete tokens.With discrete tokens, audio generation can be cast as token sequence modeling, enabling language-model-like scaling and training recipes [22–24]. Codec language models have been shown to produce intelligible speech and even zero-shot TTS behavior when trained au- toregressively over discrete units [ 25]. Concurrently, a growing body of work studies how token choices and modeling decisions affect controllability, semantic fidelity, and efficiency [ 26,27]. MOSS-TTS follows this trend but emphasizes a tokenizer and modeling stack designed to scale end-to-end without external pretrained audio teachers, aligning the discrete token representation with the requirements of AR speech generation. TTS architectures: AR, NAR, diffusion/flow, and foundation-model scaling.Classical neural TTS systems pro- gressed from AR acoustic modeling and neural vocoders [28–30] to faster and more controllable NAR frame- works [31,32], flow-based and diffusion-based synthesis [33,34]. End-to-end approaches such as VITS [35] further unified acoustic modeling and waveform generation, improving simplicity and sample quality. More recently, scaling-driven and token-centric systems increasingly combine discrete representations with AR backbones for robustness and controllability at scale [36], as reflected in recent open technical reports and large-scale systems such as Qwen3-TTS [1], CosyVoice [9], CosyVoice 3 [2], Seed-TTS [3], Fish-Speech [37], and FireRedTTS-2 [4]. Across these efforts, a recurring theme is that scaling data and model capacity alone is insufficient without a well-chosen discrete tokenizer and a model design that remains compatible with streaming, controllability, and long-context stability. MOSS-TTS complements this line of work by focusing on a fully discrete tokenization pipeline and token modeling strategies that remain efficient for long-form synthesis, while explicitly comparing two autoregressive architectures under the same tokenizer and large- scale pretraining recipe. Voice cloning and controllability.Practical TTS systems increasingly demand controllability beyond text con- tent, including speaker identity (voice cloning), speaking rate/duration control, and fine-grained pronunci- ation control. Zero-shot voice cloning and multilingual universal generation have been explored via large- scale generative models and conditioning mechanisms [38–40]. Token-centric systems also enable control signals to be expressed directly in the discrete domain, which can simplify modeling and improve stability compared to waveform-level control. MOSS-TTS emphasizes token-level duration control and phoneme- /pinyin-level pronunciation interfaces, aiming to make control explicit and composable. 3 3Audio Tokenizer 3.1Motivation and Design Principles Audio tokenizers serve as the foundational bridge for native Audio Large Language Models (Audio LLMs), transforming continuous raw audio signals into discrete tokens that can be seamlessly processed within a unified generative framework. A unified audio tokenizer for speech LLMs must satisfy two primary require- ments: enabling high-fidelity reconstruction of diverse audio signals and maintaining compatibility with the sequential nature of autoregressive modeling [8,41,42]. Existing approaches typically address these requirements through pretrained audio encoders (e.g., HuBERT, Whisper) [12–14,43], multi-stage training pipelines [19,44], or architecture-specific inductive biases such as specialized CNN structures [16–18]. These designs often introduce external dependencies and architectural constraints that hinder the seamless scaling of model capacity, data volume, and quantization levels. Draw- ing inspiration from the success of LLMs, where simple, scalable architectures trained on massive datasets have proven superior [5,6], we posit that the performance ceiling of audio tokenizers can be raised by adopt- ing a similar philosophy. We advocate for a simple, end-to-end scalable architecture that minimizes reliance on external priors or complex heuristics, emphasizing joint optimization and large-scale data exposure. To address these limitations and support high-quality speech synthesis in MOSS-TTS, we useMOSS-Audio- Tokenizer, a high-performance audio tokenizer based on theCAT(CausalAudioTokenizer withTransformer) architecture [11]. MOSS-Audio-Tokenizer is characterized by the following core strengths: •High Compression and Variable Bitrate:The model achieves a significant compression ratio, convert- ing 24kHz audio into a discrete representation at only 12.5 frames per second (fps). Utilizing a 32-layer Residual Vector Quantization (RVQ) mechanism, it supports flexible bitrate adjustment from 0.125 to 4 kbps, catering to various high-fidelity reconstruction requirements. •Pure Transformer Architecture:Unlike traditional codecs that rely on complex, hand-crafted CNN or hybrid CNN-Transformer blocks, MOSS-Audio-Tokenizer adopts a minimalist causal Transformer- based design. This architecture is intentionally unencumbered by specialized inductive biases, mak- ing it remarkably simple to implement and highly efficient to scale up. With a substantial 1.6-billion- parameter capacity, the model demonstrates superior representation power, while its inherently causal nature ensures seamless, frame-level streaming inference. •Universal Audio Representation:The model is pretrained on millions of hours of diverse audio data, including speech, music, and environmental sound effects, ensuring robust generalization across all audio domains. •Unified Semantic-Acoustic Modeling:The discrete tokens produced by MOSS-Audio-Tokenizer pre- serve strong reconstruction quality while inherently capturing rich semantic information, making them ideally suited for autoregressive LLM modeling. •End-to-End Joint Optimization:All components, including the encoder, quantizer, decoder, discrim- inators, and the LLM used for semantic alignment, are optimized jointly to maximize the model’s performance ceiling. 3.2Architecture As illustrated in Figure1, MOSS-Audio-Tokenizer adopts an RVQ-GAN framework for training. The model consists of five components: a causal encoder, a residual vector quantizer (RVQ), a causal decoder, a decoder- only LLM for semantic modeling, and adversarial discriminators. Fully Transformer-based Encoder and Decoder.The encoder and decoder of MOSS-Audio-Tokenizer each con- sist of 68 causal Transformer blocks. To facilitate efficient streaming inference, both components use a 10- second sliding-window attention mechanism. To progressively reduce the sequence length, the encoder 4 Causal Trm Causal Trm Causal Trm Causal Trm 12.5 Hz RVQ 32 Causal Trm Causal Trm Causal Trm Causal Trm VQ Loss Reconstruction Loss Discriminator Real/Fake Decoder-only LLM Transcription / Caption Audio Speech Music Sound Text [Task_type][Audio Hidden State] Figure 1Architecture of MOSS-Audio-Tokenizer. Both the encoder and decoder are built upon causal Transformers. All components, including the encoder, quantizer, decoder, decoder-only LLM, and discriminator, are optimized jointly in an end-to-end manner. incorporates patchify operations [45] at the input stage and following layers 12, 24, and 36, with respective patch sizes of 240, 2, 2, and 2. Since these patchify operations alter the feature dimensionality, a linear projec- tion is applied after each stage to map the hidden states to the corresponding Transformer block dimension. This configuration effectively downsamples raw 24 kHz waveforms to a low frame rate of 12.5 fps. The en- coder is structured into four stages with hidden dimensions of 768, 768, 768, and 1280, containing 12, 12, 12, and 32 Transformer blocks, respectively. For each stage, the feed-forward network (FFN) dimension is set to four times the hidden dimension. The multi-head self-attention mechanism uses 12, 12, 12, and 20 attention heads across the four stages. All Transformer blocks employ rotary positional embeddings (RoPE) [46]. The decoder mirrors the encoder architecture in a fully causal manner. Both the encoder and decoder contain approximately 0.8B parameters and are trained from scratch. Residual Vector Quantization.Discrete tokenization is performed using a 32-layer residual vector quantizer (RVQ). Each layer employs a codebook of size 1024 with factorized vector quantization (latent dimension 8) [ 18] and L2-normalized codes. To enable variable-bitrate tokenization, random quantizer dropout [16] with a probability of 1.0 is applied during training. Semantic Supervision.To encourage the learning of semantically structured discrete representations, we at- tach a 0.5B decoder-only causal language model [47] as a semantic head. This head provides audio-to-text supervision by autoregressively predicting text conditioned on the quantizer outputs. The supervision tasks include Automatic Speech Recognition (ASR), multi-speaker ASR, and audio captioning. Perceptual Modeling.To enhance the perceptual quality of the reconstructed audio, we employ a multi- period discriminator [ 17] and a complex STFT discriminator [18] for adversarial training with the audio tokenizer. 3.3Training MOSS-Audio-Tokenizer is trained on a massive dataset comprising millions of hours of both public and in- the-wild audio data. During training, we employ a multi-task learning framework to enable MOSS-Audio- Tokenizer to achieve both robust semantic alignment with text and high-fidelity audio reconstruction. The modeling approach for each component is detailed as follows. Semantic Modeling via Audio-to-Text Tasks.To encourage the token representation to be semantically rich and aligned with text-based language modeling, we incorporate an auxiliary audio-to-text objective. Specifi- cally, we employ a 0.5B-parameter decoder-only LLM [47] and condition it on the representations produced 5 by MOSS-Audio-Tokenizer. Concretely, we feed the hidden states from the quantizer output into the LLM, which then autoregressively predicts textual tokens. We consider a diverse set of audio-to-text tasks, includ- ing automatic speech recognition (ASR), multi-speaker ASR, and audio captioning. For audio samples that are paired with textual annotations, we apply the corresponding semantic modeling objective. Each task is specified by a fixed task tag풯, which is prepended to the LLM input. The semantic objective is optimized using a standard cross-entropy loss: ℒ sem =− |s| ’ 푡=1 log푝 휃 LLM ( s 푡 | 풯,q,s <푡 ) ,(1) wheres=(s 1 , ... ,s |s| )denotes the target text token sequence,qdenotes the sequence of quantized audio representations produced by MOSS-Audio-Tokenizer,풯is a task-specific prompt token, and휃 LLM are the parameters of the causal language model. QuantizerOptimization.For training simplicity and stability, each quantization layer in MOSS-Audio-Tokenizer adopts factorized vector quantization [18], where codebooks are directly optimized via gradient descent, without relying on additional codebook update mechanisms [17]. We incorporate a commitment loss and a codebook loss to jointly optimize the encoder and the codebook entries: ℒ cmt = 푁 푞 ’ 푐=1 z 푐 −sg(푞 푐 (z 푐 )) 2 2 ,(2) ℒ code = 푁 푞 ’ 푐=1 sg(z 푐 )−푞 푐 (z 푐 ) 2 2 ,(3) wherez 푐 denotes the input to the푐-th quantization layer,푞 푐 (z 푐 )is the corresponding quantized output,푁 푞 is the number of quantizers, andsg(·)denotes the stop-gradient operator [15]. Acoustic Modeling via Reconstruction Tasks.To ensure high-fidelity and domain-robust audio reconstruction, we adopt a multi-scale mel-spectrogram loss: ℒ rec = 11 ’ 푖=5 ∥ 푆 2 푖 (x)−푆 2 푖 (ˆx) ∥ 1 ,(4) where푆 2 푖 (·)denotes the mel-spectrogram computed using a normalized short-time Fourier transform (STFT) with window size2 푖 and hop size2 푖−2 . Here,xis the ground-truth waveform andˆxis the reconstructed waveform generated by the decoder. Adversarial Training.To further improve reconstruction fidelity and perceptual quality, we employ adver- sarial training with multiple discriminators. The discriminator loss follows the least squares GAN (LSGAN) formulation [48], given by: ℒ D (x,ˆx)= 1 퐾 퐾 ’ 푘=1 (1−퐷 푘 (x)) 2 +퐷 2 푘 (ˆx),(5) where퐷 푘 represents the푘-th discriminator,퐾is the total number of discriminators,xis the ground-truth audio, andˆxis the predicted audio. For the generator, we include an adversarial loss and a feature matching loss. The adversarial loss encourages 6 the generator to produce high-fidelity audio that is indistinguishable from real samples: ℒ adv (ˆx)= 1 퐾 퐾 ’ 푘=1 ( 1 − 퐷 푘 (ˆ x )) 2 . (6) Additionally, we incorporate a feature matching lossℒ feat [49] to ensure structural similarity across multiple scales. It penalizes theℓ 1 distance between the intermediate feature maps of the discriminators for real and synthetic audio: ℒ feat (x,ˆx)= 1 퐾 퐾 ’ 푘=1 1 퐿 푘 퐿 푘 ’ 푙=1 퐷 푙 푘 (x)−퐷 푙 푘 (ˆx) 1 mean( 퐷 푙 푘 (x) 1 ) (7) where퐷 푙 푘 denotes the feature representation from the푙-th layer of the푘-th discriminator, and퐿 푘 is the num- ber of layers in that discriminator. Overall Training Objective.The overall generator objective is a weighted combination of all loss terms: ℒ G =휆 sem ℒ sem +휆 rec ℒ rec +휆 cmt ℒ cmt +휆 code ℒ code +휆 adv ℒ adv +휆 feat ℒ feat , (8) where휆 sem ,휆 rec ,휆 cmt ,휆 code ,휆 adv ,휆 feat are scalar hyperparameters controlling the relative contribution of each loss term. During training, we set the hyperparameters to휆 sem =20,휆 rec =15,휆 cmt =0.25,휆 code =1.0,휆 adv =1.0,휆 feat =2.0. Due to computational constraints, we adopt a two-stage training schedule to improve training efficiency: non- adversarial pretraining without discriminator-related losses for 520k steps (batch size 1536, approximately 5 hours of audio per batch), followed by adversarial fine-tuning for 500k steps (batch size 768). All modules are optimized end-to-end without pretrained encoders or semantic teachers [ 7,8,12–14]. 4Architecture MOSS-TTS is a speech generation foundation model built upon discrete audio tokens. To facilitate effective scaling and capitalize on the success of large language models (LLMs), we adopt a straightforward end-to- end, purely autoregressive (AR) architecture. As illustrated in Figure2, given a text sequence and an optional speech prompt, MOSS-TTS generates the target token sequence through next-token prediction. The central architectural question is not whether to use AR modeling, but how to handle the multi-stream discrete token block produced by the tokenizer. For a 32-layer RVQ tokenizer, the chosen token modeling pattern directly determines engineering complexity, scaling behavior, decoding latency, and final synthesis quality. Rather than committing to a single token modeling pattern a priori, we train two architectures under the same tokenizer and large-scale pretraining recipe. This serves a concrete research purpose: to isolate the ef- fect of the token modeling pattern itself in large-scale discrete speech modeling. In practice, the two designs expose a clear tradeoff.Delay Patternuses a structurally simple single-backbone, multi-head parameteriza- tion, making it easier to scale to large model sizes, long contexts, and optimized inference backends.Local Transformerintroduces an additional frame-local autoregressive module, which increases architectural com- plexity but improves modeling efficiency; during internal development, it exhibited consistently lower per- layer token losses, and the later voice-cloning evaluations show stronger speaker similarity at much smaller model scale. Correspondingly, the current report usesMOSS-TTS-Local-Transformerto highlight the qual- ity advantage of the local pattern on standard cloning benchmarks (Table 3), whileMOSS-TTSserves as the main architecture for duration control, pronunciation control, and ultra-long generation (Tables5,7, and6). The tokenizer emits푁 푞 =32RVQ layers. In our implementation, both architectures predict푁 ℎ =푁 푞 +1=33 channels at each aligned step: one text-or-pad channel푦 0,푡 and 32 audio channels푦 1,푡 , ... , 푦 푁 푞 ,푡 , where푦 푗,푡 = a 푗,푡 for푗≥1. When step푡corresponds to an audio frame,푦 0,푡 is trained to emit a dedicated pad symbol; on 7 Figure 2 Architecture of MOSS-TTS. The left panel illustrates the delay pattern as described in Section 4.1, while the right panel depicts thelocal transformer patternas detailed in Section4.2. text-only steps, it emits the normal text token. We use the same head-wise weighted cross-entropy in both architectures, with 흀= (1,3,3,3,2,2,2,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), (9) which up-weights the earliest coarse RVQ layers while keeping unit weight on the text-or-pad channel and the remaining finer layers. 4.1Delay Pattern To model the RVQ hierarchy efficiently without increasing the sequence length to푇×푁 푞 , we adopt ade- lay pattern[50]. Among the two architectures we study, this is the simpler and more scalable design: a single Transformer backbone carries the full sequence model, and each prediction channel is obtained by a lightweight head projection from the backbone hidden state. Letsdenote the input text sequence andA∈1, ... ,푉 푁 푞 ×푇 be the audio token matrix, where푉denotes the codebook size of each RVQ layer,푁 푞 is the number of quantizers, and푇is the number of audio frames. Each elementa 푗,푡 ∈1, ... ,푉represents the token index at the푗-th RVQ layer and푡-th time frame. We apply a time-delay shift such that the푗-th layer is shifted forward by푗−1frames. The delayed token matrix ̃ Ais defined as: ̃a 푗,푡 =a 푗,푡−(푗−1) , 푡∈푗, ... ,푇+푗−1.(10) Input Embedding.On the input side of the backbone LLM, we use푁 푞 distinct speech embedding tables. For each time step푡in the delayed sequence, the input audio representation vectorh 푡 ∈ℝ 퐷 is the sum of embeddings across all layers: h 푡 = 푁 푞 ’ 푗=1 Emb 푗 ( ̃a 푗,푡 ),(11) 8 whereEmb 푗 (·)denotes the embedding lookup for the푗-th codebook and퐷is the model hidden dimension. Text tokens are embedded with the standard text embedding table; the delay mechanism applies only to the RVQ audio streams. The resulting vector sequence of length푇+푁 푞 −1is concatenated with text embeddings as the backbone input. Modeling Objective.On the output side, the hidden statex 푡 is passed through푁 ℎ =33heads: one text-or- pad head and 32 audio heads. Let ̃ 푦 0,푡 =푦 0,푡 and ̃ 푦 푗,푡 =a 푗,푡 for푗≥1. The weighted training objective is ℒ delay =− 푇+푁 푞 −1 ’ 푡=1 푁 푞 ’ 푗=0 휆 푗 푚 푗,푡 log푝 휃 delay ( ̃ 푦 푗,푡 |E, ̃ 푦 푥,푦 :푥+푦<푗+푡+max(0,1−푗)),(12) where휃 delay encompasses the parameters of the backbone, embeddings, and prediction heads;Erepresents the text token sequence; and푚 푗,푡 masks the invalid positions introduced by delay shifting and padding. Because every channel is predicted directly from the backbone state, the delay pattern keeps the decoding path simple: oncex 푡 is available, token generation only requires head projections. This simplicity is one of the main reasons it is easier to implement, scale, and deploy. 4.2Local Transformer We further explore a hierarchical token modeling design using aLocal Transformer, inspired by the RQ- Transformer in Moshi [8]. Unlike the delay pattern, this approach models the token block without intro- ducing temporal shifts: the backbone produces one global latent per aligned step, and a lightweight autore- gressive module expands that latent into the within-step token block. This design is architecturally more complex, but it offers a stronger inductive bias for frame-level token modeling. Input Embedding.On the input side, we directly sum the embeddings of all RVQ layers at each time step푡 without any delay. The input hidden stateh 푡 to the backbone LLM is given by: h 푡 = 푁 푞 ’ 푗=1 Emb 푗 (a 푗,푡 ),(13) wherea 푗,푡 is the token at the푗-th RVQ layer and푡-th time frame. Hierarchical Decoding.On the output side, we employ a lightweightLocal Transformerto autoregressively decode the full per-step token block. Specifically, letx 푡 be the output hidden state from the backbone LLM at time푡. The Local Transformer predicts the sequence(푦 0,푡+1 , 푦 1,푡+1 , ... , 푦 푁 푞 ,푡+1 )sequentially. The input to the Local Transformer when predicting channel푗, denoted asz 푗,푡 , is defined as: z 푗,푡 = ( x 푡 ,if푗=0, Emb 푗−1 (푦 푗−1,푡+1 ),if1≤푗≤푁 푞 . (14) The Local Transformer processesz 푗,푡 and passes the resulting hidden states through the corresponding pre- diction head to emit푦 푗,푡+1 . Modeling Objective.The entire architecture, including the backbone and the local transformer, is trained end-to-end and optimized via ℒ local =− 푇 ’ 푡=1 푁 푞 ’ 푗=0 휆 푗 log푝 휃 local (푦 푗,푡 |E, 푦 <푗,푡 , 푦 :,<푡 ),(15) 9 Phase2: FilteringPhase3: Data Synthesis Stage ③ ASR & Transcript QC Stage ④ Joint Audio–Text Filtering Timbre-Cloning Data Supplementary Data LangID Pass? Rule-based Pre-filtering Pass? LLM Refinement Pass?Single-speaker validation Pass? Length Consistency Filtering Audiobook Podcast Broadcast & News Film & Drama Commentary Online Content Phase1: Preprocessing Stage ① Raw Audio Preprocessing Stage ② Diarization & Consolidation Noise Reduction Format Alignment Volume Normalization [S1][S2][S3][S2] [S1][S2][S3][S2] Raw Audio [S1][S2][S3][S2][S1][S2][S3] Segments of Speaker i Target Candidates Most Similar! Segment Pairs of Speaker i Input RobustnessPhonetic InputShort-Form 今天吃啥呀 Hello, how are you? 今天,吃啥呀 Hello?//// how are you? 吃chi2啥sha2呀 hɛloʊ, haʊ ɑr juː? 吃 今天 arezigzag Audio–text Language Consistency Filtering Pass? Quality Scoring Pass? Acoustic Quality Filtering Figure 3Overview of the MOSS-TTS pretraining data pipeline, including preprocessing, filtering, and targeted data synthesis. where푦 <푗,푡 denotes the preceding channels at the current aligned step, and푦 :,<푡 denotes all channels from previous steps.휃 local encompasses the parameters of the backbone LLM, the local transformer, embeddings, and prediction heads. Compared with the delay pattern, this design inserts an additional autoregressive loop of length푁 푞 +1inside each frame. As a result, it is computationally heavier in steady-state decoding, but it can start emitting audio earlier because it does not need to wait for delayed offsets to materialize the first frame. Empirically, its main advantage in this report is not architectural simplicity, but higher modeling efficiency and stronger speaker preservation. Moreover, we incorporateProgressiveSequenceDropoutas proposed in MOSS-Audio-Tokenizer [11] to support bitrate-controllable audio generation. 5Pretraining 5.1Pretraining Data Scaling TTS pretraining to millions of hours of speech data necessitates sourcing audio from naturally occur- ring, open-domain recordings—podcasts, audiobooks, broadcast & news, film & drama, commentary, and online content. Such recordings, however, rarely satisfy the conditions required for direct TTS supervision: they routinely contain multiple concurrent speakers, background music, ambient noise, and unreliable or missing transcription metadata. High-quality pretraining therefore demands that two fundamental proper- ties hold for every training unit: (i) the audio is acoustically clean and contains the speech of a single speaker, free from overlapping voices, music, and significant background noise, and (i) the paired transcript is lin- guistically well-formed and faithfully aligned to the spoken content. To enforce these properties at scale, we design a multi-stage data pipeline that progressively transforms raw web audio into curated, trainable speech–transcript pairs. As summarized in Figure3, the pipeline is organized into three phases. Theprepro- cessingphase (Stages 1–2) establishes a standardized acoustic foundation and extracts speaker-consistent segments via diarization. Thefilteringphase (Stages 3–4) first produces and refines transcripts with multi- lingual ASR, rule-based checks, and LLM-based quality control, and then retains only pairs that pass joint audio–transcript filtering, including acoustic quality checks, audio/text language-consistency checks, and duration-text consistency checks. Thedata synthesisphase supplements the corpus with targeted examples that introduce explicit speaker-conditioning structure for timbre transfer, broaden coverage of underrepre- sented input types, and strengthen robustness to diverse real-world input formats. 10 5.1.1Data Preprocessing As shown in Figure3, the preprocessing phase contains Stages 1–2 and prepares acoustically standardized, speaker-consistent segments before any transcript is produced. Stage 1: Raw audio preprocessing.Raw web-sourced recordings exhibit substantial heterogeneity in acous- tic format: sampling rates vary widely across sources, loudness levels differ by tens of decibels, and many files carry environmental noise, music accompaniment, or reverberation. Left uncorrected, this variability degrades the reliability of both speaker diarization (Stage 2) and ASR (Stage 3), and introduces inconsistency into the acoustic features used during model training. We therefore apply the following preprocessing steps to each recording as the first pipeline stage: •Noise suppression.We apply MossFormer2-SE-48K [51], a neural speech enhancement model, to sup- press stationary and non-stationary background noise. The purpose of denoising at this stage is not to produce the final training signal but to improve the reliability of downstream speaker diarization: a cleaner input yields more accurate voice activity detection and sharper speaker boundary estimates. Audio is resampled to 48 kHz before enhancement, which is the native operating rate of the model. •Format standardization.Parameter alignment such as sample type, channel layout, and header meta- data is enforced after enhancement, and the processed output is written to FLAC format, establishing a uniform format contract for all subsequent stages. •Volume normalization.To reduce inter-source level variation, we apply a two-stage gain procedure. First, we compute the RMS-based signal level in dBFS, 퐿 dBFS (x)=20 log 10 q 1 푇 Õ 푡 푥 2 푡 +휖 , and apply a clamped gain푔=clip(−20−퐿 dBFS (x),−3,3)dB, rescaling the waveform by10 푔/20 . The target of−20dBFS and the±3dB clipping range together prevent both over-compression of already- quiet recordings and excessive amplification of outliers. Second, peak normalization divides by the maximum absolute sample value to map the result to the[−1,1]amplitude range, ensuring numerical consistency across the pipeline. Stage 2: Speaker diarization and segment consolidation.We run speaker diarization on the denoised audio to obtain a time-ordered sequence of speaker-labeled intervals, 풟=(푘 푖 , 푡 st 푖 , 푡 ed 푖 ) 푁 푖=1 , where푘 푖 is a recording-local speaker label (e.g.,SPEAKER-00,SPEAKER-01, ...) and푡 st 푖 <푡 ed 푖 are the start and end timestamps. Speaker labels are meaningful only within a single recording; we do not perform cross- recording identity linking. We use DiariZen [ 52–54], an end-to-end neural diarization system, for this step. Raw diarization output is typically fragmented: a single continuous speaking turn may be split into multiple short intervals separated by brief pauses or breath sounds. Training directly on such fragments would over- represent short, sub-sentence units at the expense of longer, paragraph-level continuity. We therefore apply a two-step consolidation procedure to maximize contiguous single-speaker coverage: •Filtering and consecutive-speaker merging.Segments shorter than휏 min =0.1s are discarded as unre- liable diarization artifacts. The remaining segments are then scanned in chronological order: whenever two adjacent segments carry the same speaker label, they are merged into a single interval spanning both. This produces a consolidated sequence 풜=(푘 푗 , 푠 푗 , 푒 푗 ) 푀 푗=1 , 푀≤푁, where each run of consecutive same-speaker segments in풟has been collapsed into one entry. There 11 is no gap threshold on merging: any two adjacent same-speaker segments are unified regardless of the intervening silence duration, because such gaps reflect natural pauses within a speaking turn rather than speaker changes. •Single-speaker truncation.We apply a hard one-hour limit to avoid unbounded unit lengths. Starting from푠 1 (the onset of풜), we define a cutoff푡 lim =푠 1 +3600s and emit segments from풜in order, clamp- ing the endpoint of the last included segment to푡 lim and discarding any resulting segment shorter than 휏 min . The output is a list of consolidated, speaker-labeled intervals drawn from at most one hour of the recording, starting from the onset of the first diarized segment. 5.1.2Data Filtering As shown in Figure3, the filtering phase corresponds to Stages 3–4: Stage 3 constructs and cleans transcripts, and Stage 4 keeps only pairs that are jointly consistent on the audio and transcript sides. Stage 3: ASR and transcript quality control.Each consolidated segment from Stage 2 passes through a sequential pipeline that transcribes the audio and then applies a series of quality-control steps to produce a clean, speaker-tagged transcript suitable for TTS training. •ASR transcription.We transcribe each segment using MOSS-Transcribe-Diarize [55], our proprietary multilingual ASR model. The model does not require an externally provided language label; it directly produces a multilingual diarization-aware transcript. The raw output follows a structured format in which every utterance span is prefixed by a recording-local speaker tag (e.g.,[S1],[S2]) and may contain inline sound-event markers (e.g.,[music],[laugh]). This structured output reflects the full diarization- aware recognition result and requires subsequent cleaning before use as a training transcript. •Rule-based pre-filtering.Before invoking the LLM, three lightweight rules discard clearly unusable transcripts to avoid wasting inference budget: –Empty content: the transcript is blank or consists only of whitespace after stripping. –Severe repetition loop: any phrase is repeated consecutively more than six times, which is a reliable signal of ASR model collapse. –Non-speech dominance: after removing all bracketed tags ([...]), the remaining linguistic content accounts for less than 20% of the total text length, indicating that the segment is predominantly noise, music, or other non-speech events. Segments failing any rule are discarded without further processing. •LLM-based transcript refinement.Segments that pass pre-filtering are processed by a large language model using a structured two-stage prompt. –Diagnosis (filtering): the LLM first checks for two fatal defects.filter-1targets identical spoken con- tent repeated two or more times (distinct from the trivially repeating speaker tags that are a normal output of MOSS-Transcribe-Diarize).filter-2targets sentence truncation, identified by a trailing hyphen or an abrupt mid-word termination. Segments receiving either code are discarded. –Correction (refinement): segments that pass diagnosis undergo sequential cleaning.refine-1re- moves all non-speech event tags while preserving speaker tags and linguistic content.refine-2 deletes any speaker tag that is left with no following speech content.refine-3applies minimal structural repair to restore the standard[speaker]contentformat without modifying the recognized words. Not all steps are applied to every segment; segments already in correct form receive ano- changecode and are passed through unmodified. If the LLM call fails (e.g., due to a malformed response), the segment is discarded rather than falling back to the uncleaned transcript. 12 •Single-speaker transcript validation.As a final check, we verify that the refined transcript contains only[S1]speaker tags. The presence of any[S2],[S3], or higher-indexed tag indicates that multiple speakers were detected within the segment at the transcription level, which is inconsistent with the single-speaker constraint enforced in Stage 2. Such segments are discarded. Stage 4: Joint audio–transcript filtering.Segments that survive Stage 3 undergo a second round of filter- ing that combines acoustic quality signals with audio–transcript consistency checks. Unlike Stage 3, which focuses on transcript quality, this stage treats the audio and transcript jointly; a segment is retained only if both its acoustic quality and its audio–transcript consistency fall within acceptable bounds. •Acoustic quality filtering.We compute DNSMOS [56] and Meta AudioBox Production Quality (PQ) [57] on the pre-denoising audio rather than on the enhanced signal, because speech enhancement can distort quality estimates and make the scores reflect the enhancer rather than the source recording. A segment is accepted only if its DNSMOS score exceeds2.8and its Meta AudioBox PQ score exceeds 6.5, retaining only segments with sufficiently clean and natural-sounding speech. •Audio–text language consistency filtering.We derive two language labels from different modalities and require them to agree. First, Whisper large-v3 [13] is applied to the audio to obtain an audio-side language label ˆ ℓ aud . Second, a large language model reads the refined transcript and predicts a text- side language label ˆ ℓ text . We retain a segment only if ˆ ℓ aud = ˆ ℓ text . This removes pairs whose transcript language is inconsistent with the spoken content or whose ASR output is unreliable enough to confuse transcript-side language identification. For the remaining segments, we denote the agreed label by ˆ ℓ. •Audio–transcript length consistency filtering.A systematic mismatch between audio duration and transcript length indicates one of two failure modes: (i) the audio is far longer than the transcript, sug- gesting that large portions of the segment are silence or non-speech background; or (i) the transcript is far longer than the audio, a reliable indicator of ASR hallucination. To detect both cases, we compute a language-specific character rate, 푟= |푥 ′ | 푑 , where|푥 ′ |is the character count of the refined transcript and푑is the segment duration in seconds. For each supported languageℓ, we define a valid rate interval[푟 min ℓ , 푟 max ℓ ]derived from empirical statistics over a reference corpus; the agreed language label ˆ ℓfrom the previous step is used to select the appropriate bounds. Segments whose rate푟∉[푟 min ℓ , 푟 max ℓ ]are discarded. 5.1.3Data Synthesis As shown in Figure3, the final phase augments the naturally filtered corpus with targeted synthetic or trans- formed examples that cover capabilities not directly available from organic web audio. Even after the pipeline described above, the filtered web-sourced corpus still leaves three systematic gaps that cannot be closed by filtering alone. The most consequential is the absence of explicit speaker-conditioning structure: the filtered corpus pairs text with speech but provides no prompt audio, and a model trained on it alone has no mechanism to transfer timbre from a reference speaker. The remaining two gaps are on the text side: real user inputs often contain formatting noise that the model must handle gracefully , for example, inputs such as “Hello??!! are you there” or “Iwant to knowwhere it is”, and phonetic script input is absent from organic speech data yet required for fine-grained pronunciation control. We address all three through targeted data synthesis. Timbre-cloning data construction.The goal of this construction is to provide (prompt audio, target audio) pairs in which both sides originate from the same speaker, enabling the model to learn prompt-conditioned timbre transfer from real recorded speech rather than from any generative process. Construction proceeds entirely from the filtered corpus produced by Stages 1–4. 13 Figure 4Statistics of the MOSS-TTS pretraining corpus. Panel (a) shows the share of training hours by domain; panel (b) shows the language distribution as a donut chart (English/Chinese/Other) alongside a breakdown of the top minor languages; panel (c) shows the distribution of utterance duration by both hours (bars) and utterance count (line). For each recording, we group the surviving segments by their diarization-assigned speaker identity. Let the segments attributed to a given speaker be푠 1 , 푠 2 , ... , 푠 푛 . For each target segment푠 푖 , we construct a prompt candidate pool as follows. For every other segment푠 푗 (푗≠푖), we draw five random temporal crops of푠 푗 , each with independently sampled start and end timestamps subject to a maximum duration of 30 s, yielding5(푛−1)prompt candidates in total. We then score every candidate against푠 푖 by computing the cosine similarity between their speaker embeddings, extracted using the fine-tuned WavLM-Large model employed in Seed-TTS-eval [3]. The candidate attaining the highest similarity is selected as the prompt for 푠 푖 , producing the final training pair(prompt=푠 ∗ 푗,partial ,target=푠 푖 ). This construction strategy has two important properties. First, by evaluating similarity on the cropped par- tials rather than on full segments, the selection directly optimizes for how well the prompt conveys the speaker’s identity at inference-time durations, rather than for full-segment representativeness. Second, re- stricting the prompt to at most 30 s and randomizing its boundaries encourages the model to extract stable timbre representations from varied temporal windows, improving robustness to prompt length and position at inference time. Supplementary data.Three smaller supplements address remaining distributional gaps. Forinput robust- ness, we apply four noise transformations to the text of existing validated pairs—punctuation noise (con- secutive marks, mixed full/half-width forms, malformed combinations), whitespace artifacts (extra spaces, misplaced line breaks), punctuation dropout, and sparse dirty-character injection—without modifying the audio. Forphonetic script input, we support both full-sentence and partial (word- or phrase-level) replace- ment of orthographic text with phonetic notation: tone-marked pinyin for Chinese (e.g.,nin2 hao3) and IPA enclosed in slashes for English (e.g.,/hæŋ bæk/); training pairs are derived from filtered corpus segments by rule-based transcript conversion, with audio left unchanged. Fordictionary-style short-form data, we sup- plement the corpus with single-character and single-word utterances, which are severely underrepresented in web-sourced recordings yet constitute a common real-world input pattern; without targeted coverage, a model trained on organic data alone tends to be unreliable on such ultra-short inputs. Figure4summarizes the composition of the resulting corpus across all data subsets, showing the distribution of training hours by domain, language, and utterance duration. For duration control, every training asset is serialized into two parallel variants. In the duration-conditioned variant, the prompt fieldtokensstores an explicit integer equal to the target audio-token count; in the free- duration variant, the same field is set toNone. This paired formatting is applied uniformly across the corpus, so explicit and implicit duration supervision are both present throughout pretraining rather than being in- troduced only in a specific phase. 14 Table 1 Four-phase pretraining schedule of MOSS-TTS. PhaseMax seq.LR Schedule and Data Mixture Updates P132kLR warmup to2×10 −4 , then hold; use풟 basic only. P232kHold LR at2×10 −4 ; enable all data subsets and strongly upsample풟 clone . P332k Linearly decay LR from2×10 −4 to2×10 −6 ; keep all subsets active and restore the data mixture to normal proportions. P464k Hold LR at2×10 −6 ; retain the rebalanced full-data mixture, expand the context window from 32k to 64k, and heavily upsample long-form data. 5.2Pretraining Stage For brevity, we denote the five training data subsets used in the curriculum as follows: the main filtered corpus풟 basic , timbre-cloning pairs풟 clone , dictionary-style short-form data풟 dict , noisy-text augmentation data풟 noise , and phonetic augmentation data풟 phone . The curriculum design follows three principles. First, we begin with the highest-density and least ambiguous supervision to maximize early learning efficiency. Second, harder conditioning tasks such as voice cloning and pronunciation control are introduced while the optimizer is still in a high-learning-rate stable region, so that they become native behaviors rather than nar- row later-stage patches. Third, long-context extension is deferred until the short-context model has largely converged, which substantially reduces optimization instability and preserves short-utterance quality. The full schedule follows a simple warmup–stable–decay (WSD) pattern [58]: the learning rate is warmed up only in Phase 1, held fixed in the other non-decay phases, and linearly decayed from2×10 −4 to2×10 −6 only in Phase 3. Phase1: basicalignmentacquisition.We begin with풟 basic only, covering Chinese, English, and lower-resource languages under relatively clean and direct text–speech supervision. This phase includes the only explicit warmup in the whole training schedule: the learning rate is first increased to2×10 −4 and then held fixed for the remainder of the phase. Excluding more specialized objectives at this stage improves sample efficiency: the model first learns monotonic text-to-audio alignment, multilingual grapheme-to-acoustic mapping, and the basic semantics encoded by the tokenizer before it is asked to solve voice transfer or pronunciation- correction tasks. Empirically, this stage produces a substantially stronger initialization for the subsequent mixed-data phases than training on the full heterogeneous mixture from step zero. Phase 2: capability expansion under stable high LR.After the base mapping is established, we switch to the full data universe and deliberately assign a much higher sampling weight to풟 clone . The reason is strategic: prompt-conditioned timbre transfer is both harder and more fragile than ordinary text-to-speech, and if it is introduced too weakly it tends to remain a tail capability. Keeping the learning rate fixed at2×10 −4 while the model is exposed to the full control-oriented data mixture allows the backbone to absorb cloning, dictionary reading, noisy-text robustness, and phonetic prompting as first-class behaviors rather than as late patches. Phase 3: linear-decay mixture rebalancing and quality consolidation.Once the control capabilities are in place, we keep the full data universe active but restore the mixture to its normal proportions, while linearly de- caying the learning rate from2×10 −4 to2×10 −6 over the entire phase. This step is critical. Oversampling timbre-cloning data for too long biases the model toward prompt copying and can suppress the relative in- fluence of standard multilingual TTS, dictionary coverage, and robustness-oriented augmentations. Phase 3 therefore serves as the main quality-consolidation stage: the model revisits the full task distribution while the optimizer transitions from a still-flexible high-LR regime to a highly conservative low-LR regime. In WSD terms, this is the decay segment in which most of the final gains are consolidated [ 58]. Early in the phase, the remaining relatively large updates are still sufficient to repair mixture imbalance and absorb residual capa- bility gaps; late in the phase, the much smaller updates improve stability, reduce hallucination-like failures, and sharpen the final tradeoff among intelligibility, speaker similarity, and controllability. 15 Phase 4: long-context extension.In the final phase, we keep the learning rate fixed at2×10 −6 , increase the maximum sequence length from 32k to 64k, and heavily upsample long-form data. We intentionally do not introduce this longer context earlier. Training with a very long window from the beginning is significantly less efficient, because most examples do not require it and because early optimization is better spent learning the core text–speech mapping than fitting long-range attention patterns. Instead, we follow a late context- extension strategy analogous to recent LLM and TTS systems: once the base distribution has converged at moderate context length, the model is adapted to longer contexts under a small learning rate, which pre- serves short-form quality while teaching paragraph-scale and hour-scale continuity [1,59,60]. The heavy upsampling of long-form data is important here: without it, the nominal 64k window would be underuti- lized by the natural length distribution of the corpus. This stage is intended to improve speaker consistency across long generations, reduce drift in prosody and content over extended passages, and enable the model to use longer prompt speech without destabilizing decoding. Learning-rate shape and practical rationale.Viewed globally, the four phases form a simple WSD-style training program rather than four disconnected runs: a short warmup embedded in Phase 1 lifts the learning rate to 2×10 −4 , Phases 1–2 then share a stable plateau at2×10 −4 , Phase 3 linearly decays the learning rate from 2×10 −4 to2×10 −6 , and Phase 4 holds the final low learning rate at2×10 −6 during long-context adaptation. This schedule combines the optimization efficiency of a long stable high-LR region with the reliability of a gradual decay into a low-LR refinement regime. The distinction matters in practice: the stable plateau is where the model acquires its main multilingual TTS and controllability behaviors, whereas the linear decay phase is where those behaviors are rebalanced and polished without the abrupt optimization shock that would come from dropping directly from2×10 −4 to2×10 −6 . By the time training enters Phase 4, the optimizer is already in a conservative regime, which allows us to upsample long-form data and extend the context window with minimal damage to established short-form quality. Compared with a one-shot full- data recipe, the staged curriculum yields a better division of labor across phases: P1 learns the multilingual TTS prior, P2 makes control abilities robust, P3 restores distributional balance while progressively refining the model, and P4 transfers the already-competent model to longer contexts. This progression serves as a practical compromise between training efficiency, controllability, and long-form robustness, and it forms the default pretraining recipe used for the full MOSS-TTS release. 6Evaluation We evaluate MOSS-TTS from two complementary perspectives: (i) the audio tokenizer—whether it pro- vides high-fidelity and semantically usable units across bitrates and domains—and (i) the speech generation model—whether discrete autoregressive modeling and large-scale pretraining yield strong zero-shot voice cloning, multilingual robustness, token-level duration control, phoneme-/pinyin-level pronunciation con- trol, and ultra-long speech generation. For the speech generation model, we report results for bothMOSS- TTSandMOSS-TTS-Local-Transformer. Following influential TTS technical reports [1,2,61], we prioritize objective metrics that are easy to reproduce and interpret: content consistency measured by WER/CER using a fixed ASR backend, speaker similarity (SIM) measured by cosine similarity of pretrained speaker embed- dings, and task-specific metrics for controllability and long-form behavior. 6.1Audio Tokenizer We conduct a comprehensive evaluation of MOSS-Audio-Tokenizer, comparing it with current state-of-the- art open-source audio tokenizers across various bitrate regimes. The baseline audio tokenizers include Sta- bleCodec [62], XCodec2.0 [63], MiMo-Audio-Tokenizer [42], Higgs-Audio-Tokenizer [64], SpeechTokenizer [7], XY-Tokenizer [21], BigCodec [65], Mimi [8], DAC [18], Encodec [17], and Qwen3-TTS-Tokenizer [1]. Our eval- uation encompasses speech, general audio, and music to assess the model’s versatility and reconstruction fidelity. For speech reconstruction, we conduct evaluations on LibriSpeech test-clean (English) [ 66] and AISHELL- 16 Table 2Reconstruction quality comparison of open-source audio tokenizers on speech and audio/music data. Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese) and reported as English/Chinese. Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on the MUSDB dataset; values are reported as audio/music. STFT-Dist. denotes the STFT distance. Higher is better for speech metrics, whereas lower is better for audio/music metrics.푵 VQ denotes the number of quantizers.Boldentries indicate the best result within each bitrate regime. Modelbps Frame rate 푵 VQ SpeechAudio / Music SIM↑STOI↑PESQ-NB↑PESQ-WB↑Mel-Loss↓STFT-Dist.↓ StableCodec7002520.62 / 0.45 0.91 / 0.86 2.91 / 2.50 2.24 / 1.93– / – / – XCodec2.08005010.82 / 0.74 0.92 / 0.86 3.04 / 2.46 2.43 / 1.96– / – / – MiMo-Audio-Tokenizer8502540.80 / 0.74 0.91 / 0.87 2.94 / 2.62 2.39 / 2.140.82/ 0.81 2.33 / 2.23 Higgs-Audio-Tokenizer10002540.77 / 0.68 0.83 / 0.82 3.03 / 2.61 2.48 / 2.14 0.83 /0.802.20 / 2.05 SpeechTokenizer10005020.36 / 0.25 0.77 / 0.68 1.59 / 1.38 1.25 / 1.17– / – / – XY-Tokenizer1000 12.580.85 / 0.79 0.92 / 0.87 3.10 / 2.63 2.50 / 2.12– / – / – BigCodec10408010.84 / 0.69 0.93 / 0.88 3.27 / 2.55 2.68 / 2.06– / – / – Mimi1100 12.580.74 / 0.59 0.91 / 0.85 2.80 / 2.24 2.25 / 1.78 1.24 / 1.19 2.62 / 2.49 MOSS-Audio-Tokenizer75012.560.82 / 0.750.93 / 0.893.14 / 2.732.60 / 2.220.86 / 0.852.21 / 2.10 MOSS-Audio-Tokenizer100012.580.88/0.810.94/0.913.38/2.962.87/2.430.82/0.802.16/2.04 DAC15007520.48 / 0.41 0.83 / 0.79 1.87 / 1.67 1.48 / 1.37– / – / – Encodec15007520.60 / 0.45 0.85 / 0.81 1.94 / 1.80 1.56 / 1.48 1.12 / 1.04 2.60 / 2.42 Higgs-Audio-Tokenizer20002580.90 / 0.83 0.85 / 0.85 3.59 / 3.22 3.11 / 2.73 0.74 / 0.70 2.07 / 1.92 SpeechTokenizer20005040.66 / 0.50 0.88 / 0.80 2.38 / 1.79 1.92 / 1.49– / – / – Qwen3-TTS-Tokenizer2200 12.5160.95/ 0.880.96/ 0.93 3.66 / 3.10 3.19 / 2.62– / – / – MiMo-Audio-Tokenizer22502512 0.89 / 0.83 0.95 / 0.92 3.57 / 3.25 3.05 / 2.710.70/0.682.21 / 2.10 Mimi2475 12.518 0.89 / 0.76 0.94 / 0.91 3.49 / 2.90 2.97 / 2.35 1.10 / 1.06 2.45 / 2.32 MOSS-Audio-Tokenizer150012.5120.92 / 0.860.95 / 0.933.64 / 3.273.20 / 2.740.77 / 0.742.08 / 1.96 MOSS-Audio-Tokenizer200012.5160.95/0.890.96/0.943.78/3.463.41/2.960.73 / 0.702.03/1.90 DAC30007540.74 / 0.67 0.90 / 0.88 2.76 / 2.47 2.31 / 2.07 0.86 / 0.83 2.23 / 2.10 MiMo-Audio-Tokenizer36502520 0.91 / 0.85 0.95 / 0.93 3.73 / 3.44 3.25 / 2.89 0.66 / 0.65 2.17 / 2.06 SpeechTokenizer40005080.85 / 0.69 0.92 / 0.85 3.05 / 2.20 2.60 / 1.87– / – / – Mimi4400 12.532 0.94 / 0.83 0.96 / 0.94 3.80 / 3.31 3.43 / 2.78 1.02 / 0.98 2.34 / 2.21 Encodec45007560.86 / 0.75 0.92 / 0.91 2.91 / 2.63 2.46 / 2.15 0.91 / 0.84 2.33 / 2.17 DAC60007580.89 / 0.84 0.95 / 0.94 3.75 / 3.57 3.41 / 3.200.65/0.631.97 / 1.87 MOSS-Audio-Tokenizer300012.5240.96 / 0.920.97/0.963.90 / 3.643.61 / 3.200.69 / 0.661.98 / 1.84 MOSS-Audio-Tokenizer400012.5320.97/0.930.97/0.963.95/3.713.69/3.300.68 / 0.641.96/1.82 2 (Chinese) [67]. We report speaker similarity (SIM), computed as the cosine similarity between speaker embeddings extracted from the original and reconstructed audio using a pretrained speaker verification model 2 . In addition, we report short-time objective intelligibility (STOI) [68] and perceptual evaluation of speech quality (PESQ) [69]. For sound and music reconstruction, following prior work [18], we evaluate on the AudioSet evaluation subset [70] and MUSDB [71]. We report mel-spectrogram distance and short-time Fourier transform (STFT) distance as objective metrics. Table2summarizes the objective reconstruction results across speech, general audio, and music benchmarks. We categorize the performance into low (750–1500 bps), medium (1500–2500 bps), and high (2500–6000 bps) bitrate regimes. Additionally, Figure5illustrates the performance trajectory of MOSS-Audio-Tokenizer against other open-source alternatives within the 0–4 kbps range. Across all evaluated bitrates, MOSS-Audio- Tokenizer consistently outperforms the compared open-source baselines in speech reconstruction. On gen- eral audio and music benchmarks, the model maintains competitive performance. Notably, reconstruction 2 UniSpeech speaker verification repository 17 Figure 5Comparison of objective reconstruction metrics between MOSS-Audio-Tokenizer and other state-of-the-art open-source audio tokenizers on the LibriSpeechtest-cleandataset. Results are evaluated within the 0–4 kbps bitrate range. The horizontal axis represents the bitrate, and the vertical axis denotes the corresponding objective reconstruction scores. quality scales gracefully with the increase in bitrate, demonstrating that the model effectively leverages ad- ditional capacity and bitrate through its joint end-to-end optimization framework. These results indicate strong modeling capacity for MOSS-Audio-Tokenizer across both low-bitrate and high- bitrate regimes. By allowing a flexible selection of RVQ layers, the model can be adapted to diverse applica- tion requirements, spanning low-bitrate scenarios to high-fidelity audio generation. Overall, MOSS-Audio- Tokenizer provides a stable, high-fidelity, and standardized tokenizer for native audio generation models. 6.2Voice Cloning Table3compares MOSS-TTS with representative open and closed systems on Seed-TTS-eval. We report both architectures in both inference modes. Results for non-MOSS baselines are collected from the corresponding technical reports and reported as given in those sources. For prompt-conditioned generation, we distinguish two inference paradigms throughout this section. In Clone, the user input explicitly provides a reference audio clip. InContinuation, we instead prepend the reference audio to the assistant-side speech prefix, prepend its ASR transcript to the requested text, and let the model continue generating the speech for the original text. We report both modes because they probe different uses of the same pretrained model:Clonemeasures explicit reference-audio conditioning, whereas Continuationtests whether native speech continuation already provides usable timbre transfer without re- lying on a dedicated clone-style prompt format. On Seed-TTS-eval, speaker similarity is the more informative metric. Once WER/CER is already below 18 Table 3 Zero-shot voice cloning on Seed-TTS-eval.We report English WER (↓), English speaker similarity (SIM,↑), Chinese CER (↓), and Chinese SIM (↑). The evaluation results are from technical reports of other models, such as Vox- CPM [61] and SparkTTS [72]. ModelModeParams Open EN WER↓EN SIM↑ZH CER↓ZH SIM↑ DiTAR–0.6B81.6973.501.0275.30 FishAudio-S1–4B81.7262.571.2272.10 CosyVoice3–1.5B82.2272.001.1278.10 Seed-TTS–82.2576.201.1279.60 MiniMax-Speech– 81.6569.200.8378.30 CosyVoice–0.3B44.2960.903.6372.30 CosyVoice2–0.5B 43.0965.901.3875.70 CosyVoice3–0.5B42.0271.801.1678.00 F5-TTS–0.3B42.0067.001.5376.00 SparkTTS–0.5B43.1457.301.5466.00 FireRedTTS–0.5B43.8246.001.5163.50 FireRedTTS-2–1.5B41.9566.501.1473.60 Qwen2.5-Omni–7B42.7263.201.7075.20 FishAudio-S1-mini–0.5B41.9455.001.1868.50 IndexTTS2–1.5B42.2370.601.0376.50 VibeVoice–1.5B43.0468.901.1674.40 HiggsAudio-v2–3B42.4467.701.5074.00 GLM-TTS–1.5B42.2367.21.0376.1 GLM-TTS-RL–1.5B41.9168.10.8976.4 VoxCPM–0.5B41.8572.900.9377.20 Qwen3-TTS–0.6B41.6870.391.2376.40 Qwen3-TTS–1.7B41.5071.451.3376.72 MOSS-TTS Clone 8B 41.9269.311.4676.21 Continuation8B 41.8470.861.3776.98 MOSS-TTS-Local-Transformer Clone1.7B41.8771.741.3377.24 Continuation 1.7B41.9373.281.4479.62 about 2, residual differences become hard to interpret: in our manual review, most remaining mismatches in that regime are ASR errors rather than audible pronunciation failures. Under this lens, MOSS-TTS is par- ticularly strong on SIM. For both architectures,Continuationconsistently improves speaker similarity over Clone, indicating that native speech continuation is an effective way to anchor speaker identity.MOSS-TTS- Local-Transformeris consistently stronger thanMOSS-TTSon speaker preservation despite using only 1.7B parameters, andMOSS-TTS-Local-TransformerinContinuationachieves the highest Chinese and English similarity scores among the open-source models in the table. This matches the architectural tradeoff dis- cussed in Section4.1and Section4.2:MOSS-TTS-Local-Transformeris the more modeling-efficient archi- tecture for speaker preservation, whileMOSS-TTSremains the simpler long-context backbone used in the control-oriented evaluations below. 6.3Multilingual Voice Cloning We evaluate the released pretrained checkpoints directly on the CV3-Eval multilingual voice cloning subset, without any task-specific fine-tuning or post-training for this benchmark. As shown in Table4, this subset probes voice cloning across a larger language set than Seed-TTS-eval. We report Clone and Continuation separately for both released MOSS-TTS architectures. External baseline entries in Table4are filled only where corresponding values are provided in the cited reports. As shown in Table 4, even without benchmark-specific multilingual cloning training, MOSS-TTS remains competitive across several non-zh/en languages. Relative to strong open baselines, it shows stable perfor- 19 Table 4 CER(%) and WER(%) on the CV3-Eval Multilingual Voice Cloning subset.“–” means the language is unsup- ported. ModelModezh en ja ko de es fr it ru F5-TTS–5.47 8.90– Spark-TTS–5.15 11.00 – GPT-SoVits–7.34 12.50 – CosyVoice2–4.08 6.32 9.13 19.7 – CosyVoice2+DiffRO–3.00 4.72 6.36 5.14 – CosyVoice3-0.5B–3.89 5.24 10.4 12.8 7.41 4.25 12.9 6.68 6.77 CosyVoice3-0.5B+DiffRO–2.89 3.68 5.15 4.02 4.51 2.99 8.56 2.94 3.79 CosyVoice3-1.5B–3.91 4.99 7.57 5.69 6.43 4.47 11.8 10.5 6.64 CosyVoice3-1.5B+DiffRO–3.01 3.71 5.27 4.01 3.93 3.26 8.09 2.72 4.11 MOSS-TTS Clone4.42 4.92 10.72 6.33 4.70 4.36 11.17 5.46 6.37 Continuation 4.26 5.12 7.78 7.73 10.83 3.43 10.59 4.82 6.64 MOSS-TTS-Local-Transformer Clone3.95 4.35 10.10 5.95 4.28 3.98 10.32 5.02 5.90 Continuation 3.68 4.89 7.30 7.20 10.20 3.10 9.90 4.40 6.20 Table 5 Token-level duration control.Relative duration error (%) across target-duration buckets. AbsErr Mean: mean absolute relative error; AbsErr P50/P90: 50th/90th percentile of absolute relative error; RMSE: root mean squared rela- tive error. Language Bucket AbsErr Mean (%)↓AbsErr P50 (%)↓AbsErr P90 (%)↓RMSE (%)↓ zh3s–10s1.4561.3332.3431.652 zh10s–1m0.3590.2540.6470.502 zh1m–10m0.3560.0771.2730.849 zh10m–30m0.6780.0611.8591.228 zhoverall0.7120.2842.0131.141 en3s–10s1.4821.3572.3851.685 en10s–1m0.3550.2510.6390.515 en1m–10m0.3650.0791.3040.834 en10m–30m0.6600.0591.8091.261 enoverall0.7230.2882.0431.160 mance on de/es/it/ru, and theContinuationsetting remains usable across the broader language set despite being a harder zero-shot transfer setting. Table 4also shows that the largest gaps are concentrated in a few harder language pairs such as ja/ko and in some English continuation cases, which is consistent with the overall difficulty of this subset. 6.4Duration Control From this subsection onward, we report onlyMOSS-TTS. The remaining three evaluations in this section— duration control, ultra-long speech generation, and phoneme-/pinyin-level pronunciation control—stress explicit token conditioning and long-context continuation, where the delay architecture is the more practical release target because of its simpler single-backbone parameterization and better scalability at long sequence lengths. We therefore useMOSS-TTS-Local-Transformerprimarily to characterize the similarity-quality tradeoff in the cloning benchmarks above. We evaluate token-level duration control onMOSS-TTSby prompting the model with a target token count and measuring the relative duration error. Under our tokenizer, 1 second corresponds to 12.5 audio tokens. Given a target token count푛, the target duration is푇 target =푛/12.5seconds; we compute the realized duration 푇 real from the generated waveform and report Err%=|푇 real −푇 target |/푇 target ×100%. We summarize errors by language and target-duration buckets. 20 Table 6 Ultra-long speech generation on an internal evaluation set.Chinese reports CER (%) and English reports WER (%), each averaged over 10 prompts per bucket. SIM is reported in percentage form, where 100 corresponds to perfect cosine similarity. It is computed by averaging 3-second window scores within each utterance and then averaging over utterances in the same bucket. This internal set is used only to characterize expected behavior in ultra-long generation. Language BucketClone CER/WER↓Continuation CER/WER↓Clone SIM (%)↑Continuation SIM (%)↑ zh10-1000.830.6569.369.9 zh100-5001.530.8565.666.0 zh500-25004.120.9464.966.2 zh2500-50003.461.1963.466.3 zh5000-100001.891.8763.164.7 zh10000+3.411.8660.163.0 en50-5004.636.6364.663.5 en500-25003.654.0860.960.2 en2500-125003.754.0560.360.0 en12500-250003.763.7555.556.5 en25000-500004.586.5054.853.3 en50000+17.4929.5244.451.2 08162432 Elapsed Time (min) 48 56 64 72 Speaker Similarity (%) Clone 08162432 Elapsed Time (min) Continuation 10-100 100-500 500-2500 2500-5000 5000-10000 10000+ (a)Chinese 01020304050 Elapsed Time (min) 40 50 60 70 Speaker Similarity (%) Clone 0816243240 Elapsed Time (min) Continuation 50-500 500-2500 2500-12500 12500-25000 25000-50000 50000+ (b)English Figure 6 Speaker similarity drift under ultra-long generation.Each curve averages non-overlapping 3-second window similarities within a length bucket. For readability, the visualization reports 30-second bins and only keeps the time prefix where at least eight utterances remain in the bucket. As shown in Table5, the model achieves consistently low relative duration errors from short to long utter- ances, with overall AbsErr Mean around 0.7% and strong percentile behavior. Notably, these results are obtained under a pretraining-only setup, indicating that effective token-level duration control can emerge without introducing a dedicated duration-control fine-tuning stage. 6.5Ultra-Long Speech Generation We further build an internal ultra-long evaluation set forMOSS-TTSto estimate expected behavior when generation extends from short utterances to approximately one hour. The set covers Chinese and English, each with six language-specific text-length buckets and 10 prompts per bucket. For each prompt, we evaluate bothCloneandContinuation, yielding 240 generated utterances in total. We transcribe each sample with MOSS-Transcribe-Diarize [55] and report CER for Chinese and WER for English, each computed per sample and averaged over the 10 prompts in each bucket. Speaker similarity (SIM) is computed as the mean cosine similarity over non-overlapping 3-second windows. This internal set is used only to characterize expected performance in ultra-long generation rather than to serve as a public benchmark. Table6is best read as a coarse bucket-level summary. Content fidelity remains usable through most buckets and degrades mainly at the longest horizons, while the average SIM values already suggest that speaker preservation weakens earlier than lexical accuracy. The more informative signal, however, is the temporal drift profile in Figure6. 21 Table 7 Phoneme-/pinyin-level pronunciation control on an internal evaluation set.We report span-only CER for Chinese and span-only WER for English. Lower is better. LanguageSettingReplaced-Span CER/WER↓ zhpartial-replace1.00 zhfull-replace1.65 enpartial-replace4.32 enfull-replace5.84 Figure6makes the failure mode explicit. In Chinese, most buckets begin in a narrow high-SIM band. Under Clone, the short and medium buckets stay fairly flat, but the 10000+ bucket shows a clear late-stage collapse near the tail. UnderContinuation, the curves are much tighter and flatter: even the longest bucket stays close to the others for more than 30 minutes, indicating substantially better long-horizon speaker anchoring. English is harder. All buckets drift downward earlier, and the 50000+ bucket underClonefalls the fastest and separates from the shorter buckets after only a few minutes.Continuationdoes not remove this trend, but it clearly raises and smooths the long-bucket trajectories, especially for the 25000–50000 and 50000+ set- tings. The main conclusion from Figure 6is therefore that ultra-long generation remains operational, but the dominant bottleneck is cumulative speaker drift over elapsed time rather than immediate lexical failure. 6.6Phoneme-/Pinyin-Level Pronunciation Control We conduct a small internal functionality evaluation for phoneme-/pinyin-level pronunciation control using MOSS-TTS. For each language (Chinese and English), we construct two settings:partial-replace, where only a short target span is replaced by pinyin or IPA, andfull-replace, where the entire sentence is specified in pinyin or IPA. Each language-setting pair contains 100 samples. Since the goal of this test is to verify controllable pronunciation editing, we evaluate only the controlled span rather than the full sentence. We transcribe each generated utterance with MOSS-Transcribe-Diarize [55], align the transcript to the target text, and compute span-only CER (Chinese) and WER (English). As shown in Table7, MOSS-TTS achieves low span error in all four settings on this internal set, indicating that phoneme-/pinyin-level control is al- ready practically usable, including both local span replacement and full-sentence phoneme control. 7Conclusion In this technical report, we presentedMOSS-TTS, an open speech generation foundation model built on a scalable recipe: a high-quality audio tokenizer, autoregressive next-token modeling, and large-scale multi- lingual pretraining. Built onMOSS-Audio-Tokenizer, MOSS-TTS formulates speech generation as autore- gressive prediction over aligned text and speech tokens. On top of this tokenizer, MOSS-TTS and MOSS- TTS-Local-Transformer instantiate two complementary operating points: the former emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, while the latter emphasizes higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. The empirical results support the central thesis of the report. MOSS-Audio-Tokenizer provides strong dis- crete audio tokens across bitrate regimes, and the two architectures expose a clear and practically useful trade- off: MOSS-TTS-Local-Transformer is generally stronger on speaker similarity in zero-shot cloning, whereas MOSS-TTS is the more natural backbone for duration control and ultra-long generation. At the same time, the evaluation makes the remaining bottlenecks explicit. The hardest multilingual setting still leave room for improvement, and ultra-long generation shows that long-horizon speaker drift—rather than immedi- ate lexical failure—is now the dominant failure mode, especially in English. We therefore view stronger long-context speaker anchoring, broader low-resource language coverage, and further improvement of fine- grained controllability as the most important next directions. Taken together, these results suggest that speech generation can benefit from the same principles that have 22 driven recent progress in open large language models: data quality, scale, and architectural simplicity. Rather than relying on increasingly elaborate cascades, MOSS-TTS shows that a strong tokenizer, a large- scale high-quality data pipeline, and a unified autoregressive objective already provide a practical founda- tion for open speech generation. With the release of MOSS-Audio-Tokenizer, MOSS-TTS, and MOSS-TTS- Local-Transformer, we hope this report can serve both as a reproducible account of the current release and as a clean baseline for future work on open speech foundation models. 23 Contributors Core Contributors: Yitian Gong † , Botian Jiang † , Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang ∗ Contributors: Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li Advisors: Xipeng Qiu § Affiliations: Shanghai Innovation Institute MOSI Intelligence Fudan University † Equal contribution. ∗ Project lead. § Corresponding author:xpqiu@fudan.edu.cn. We especially thank the Infrastructure and Data teams for their essential contributions to the MOSS-TTS release. 24 References [1]Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026. [2]Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025. [3]Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430, 2024. [4]Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, and Yao Hu. Fireredtts-2: Towards long conversational speech generation for podcast and chatbot.arXiv preprint arXiv:2509.02020, 2025. [5]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. [6]Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020. [7]Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech large language models.arXiv preprint arXiv:2308.16692, 2023. [8]Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024. [9]Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407, 2024. [10]Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec trans- former.arXiv preprint arXiv:2409.00750, 2024. [11]Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruix- iao Li, Qingyuan Cheng, et al. Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models. arXiv preprint arXiv:2602.10934, 2026. [12]Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah- man Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021. [13]Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. [14]Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25697–25705, 2025. [15]Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017. [16]Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end- to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021. [17]Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022. 25 [18]Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan.Advances in Neural Information Processing Systems, 36:27980–27993, 2023. [19]Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and Alexander Richard. Audiodec: An open-source streaming high-fidelity neural audio codec. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [20]Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. Funcodec: A fundamental, reproducible and integrable open- source toolkit for neural speech codec. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 591–595. IEEE, 2024. [21]Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, and Xipeng Qiu. Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs.arXiv preprint arXiv:2506.23325, 2025. [22]Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Rob- lek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023. [23]Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H Liu, and Hung-yi Lee. Towards audio language modeling–an overview.arXiv preprint arXiv:2402.13236, 2024. [24]Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuayáhuitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Erik Cambria, et al. Sparks of large audio models: A survey and outlook. arXiv preprint arXiv:2308.12792, 2023. [25]Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111, 2023. [26]Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechgpt-gen: Scaling chain-of- information speech generation.arXiv preprint arXiv:2401.13527, 2024. [27]Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, et al. Almtokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling.arXiv preprint arXiv:2504.10344, 2025. [28]Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalch- brenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499, 2016. [29]Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, and Quoc Le. Tacotron: Towards end-to-end speech synthesis.arXiv preprint arXiv:1703.10135, 2017. [30]Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, et al. Natural TTS synthesis by conditioning WaveNet on mel spec- trogram predictions.arXiv preprint arXiv:1712.05884, 2018. [31]Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech: Fast, robust and controllable text to speech.arXiv preprint arXiv:1905.09263, 2019. [32]Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and high-quality end-to-end text to speech.arXiv preprint arXiv:2006.04558, 2020. [33]Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Juhee Son. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search.arXiv preprint arXiv:2005.11129, 2020. [34]Vadim Popov, Ivan Vovk, Vardan Gogoryan, Tatiana Sadekova, and Mikhail Kudinov. Grad-TTS: A diffusion prob- abilistic model for text-to-speech.arXiv preprint arXiv:2105.06337, 2021. [35]Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end- to-end text-to-speech. InInternational Conference on Machine Learning, pages 5530–5540. PMLR, 2021. 26 [36]James Betker. Better speech synthesis through scaling.arXiv preprint arXiv:2305.07243, 2023. [37]Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. Fish-speech: Lever- aging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024. [38]Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu, et al. Voicebox: Text-guided multilingual universal speech generation at scale.arXiv preprint arXiv:2306.15687, 2023. [39]Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir Antonelli Ponti. YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone.arXiv preprint arXiv:2112.02418, 2022. [40]Yinghao Aaron Li, Cong Han, Vinay S Raghavan, Gavin Mischler, and Nima Mesgarani. StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. arXiv preprint arXiv:2306.07691, 2023. [41]Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction.arXiv preprint arXiv:2502.17239, 2025. [42]Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, et al. Mimo-audio: Audio language models are few-shot learners.arXiv preprint arXiv:2512.23808, 2025. [43]Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, and Zhizheng Wu. Dualcodec: A low-frame-rate, semantically-enhanced neural audio codec for speech generation.arXiv preprint arXiv:2505.13000, 2025. [44]Simon Welker, Matthew Le, Ricky TQ Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, and Yi-Chiao Wu. Flowdec: A flow-based full-band general audio codec with high perceptual quality.arXiv preprint arXiv:2503.01485, 2025. [45]Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. [46]Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. [47]An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. [48]Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares gen- erative adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017. [49]Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre De Bre- bisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.Advances in neural information processing systems, 32, 2019. [50]Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation.Advances in Neural Information Processing Systems, 36:47704–47720, 2023. [51]Shengkui Zhao, Bin Ma, and Shinji Watanabe. MossFormer2: Combining transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11766–11770. IEEE, 2024. [52]Jiangyu Han, Petr Pálka, Marc Delcroix, Federico Landini, Johan Rohdin, Jan Cernockỳ, and Lukáš Burget. Ef- ficient and generalizable speaker diarization via structured pruning of self-supervised models.arXiv preprint arXiv:2506.18623, 2025. [53]Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cernocky, and Lukas Burget. Fine- tune before structured pruning: Towards compact and accurate self-supervised models for speaker diarization. arXiv preprint arXiv:2505.24111, 2025. 27 [54]Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. Leveraging self- supervised learning for speaker diarization. InProc. ICASSP, 2025. [55]Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Zhaoye Fei, Hanfu Chen, Jingqi Chen, Ke Chen, Qinyuan Cheng, Liwei Fan, et al. Moss transcribe diarize: Accurate transcription with speaker diarization.arXiv preprint arXiv:2601.01554, 2026. [56]Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 886–890. IEEE, 2022. [57]Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139, 2025. [58]Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. [59]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. [60]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. [61]Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, et al. Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning. arXiv preprint arXiv:2509.24650, 2025. [62]Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, and Xubo Liu. Scaling transform- ers for low-bitrate high-quality speech coding.arXiv preprint arXiv:2411.19842, 2024. [63]Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, et al. Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis.arXiv preprint arXiv:2502.04128, 2025. [64]BosonAI. Higgs audio v2: Redefining expressiveness in audio generation.https://github.com/boson-ai/higgs-audio, 2025. [65]Detai Xin, Xu Tan, Shinnosuke Takamichi, and Hiroshi Saruwatari. Bigcodec: Pushing the limits of low-bitrate neural speech codec.arXiv preprint arXiv:2409.05377, 2024. [66]Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015. [67]Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. Aishell-2: Transforming mandarin asr research into industrial scale.arXiv preprint arXiv:1808.10583, 2018. [68]Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. A short-time objective intelligibility mea- sure for time-frequency weighted noisy speech. In2010 IEEE international conference on acoustics, speech and signal processing, pages 4214–4217. IEEE, 2010. [69]Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749–752. IEEE, 2001. [70]Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017. 28 [71]Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The musdb18 corpus for music separation. 2017. [72]Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025. 29