Paper deep dive

TiCo: Time-Controllable Training for Spoken Dialogue Models

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 80

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 2:48:36 AM

Summary

TiCo is a two-stage post-training framework designed to enable Spoken Dialogue Models (SDMs) to follow time-constrained instructions. It uses Spoken Time Markers (STMs) to track elapsed speaking time during generation and employs Reinforcement Learning with Verifiable Rewards (RLVR) to optimize duration control, evaluated via the newly introduced TiCo-Bench.

Entities (5)

TiCo · methodology · 100%TiCo-Bench · benchmark · 100%SDMs · model-class · 99%Spoken Time Markers · technique · 98%RLVR · algorithm · 95%

Relation Signals (3)

TiCo → utilizes → Spoken Time Markers

confidence 98% · TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM)

TiCo → evaluatedby → TiCo-Bench

confidence 95% · we first introduce TiCo-Bench, a benchmark designed to evaluate the time-controllable instruction-following capability of SDMs.

TiCo → optimizes → SDMs

confidence 95% · TiCo, a two-stage training framework that enables spoken dialogue models to generate Spoken Time Markers

Cypher Suggestions (2)

Find all methods related to time-controllable generation in SDMs · confidence 90% · unvalidated

MATCH (m:Methodology)-[:UTILIZES]->(t:Technique) WHERE m.name = 'TiCo' RETURN m, t

Identify benchmarks used to evaluate SDMs · confidence 85% · unvalidated

MATCH (b:Benchmark)<-[:EVALUATED_BY]-(m:Methodology) RETURN b.name, m.name

Abstract

Abstract:We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

PDF

Open source PDF →Open local PDF →

Full Text

79,875 characters extracted from source content.

Expand or collapse full text

TiCo: Time-Controllable Training for Spoken Dialogue Models Kai-Wei Chang♠ Wei-Chih Chen♢11footnotemark: 1 En-Pei Hu♢ Hung-yi Lee♢ James Glass♠ ♠ MIT ♢ NTU kwchang@mit.edu Equal contribution Abstract We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., “Please generate a response lasting about 15 seconds”). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality. Figure 1: Overview of TiCo, a two-stage framework for time-controllable speech generation. Stage 1 (top): The model leverages self-generation to produce responses annotated with Spoken Time Markers (STMs), which serve as supervision for learning time awareness, i.e., associating intermediate generation states with temporal progress and estimating elapsed speaking time. Stage 2 (bottom): The model is further optimized via RLVR, where rewards are derived from STMs, enabling the model to regulate response duration in real time. 1 Introduction “Time is money,” as famously stated by Benjamin Franklin, highlights the fundamental value of time in human life. In human–computer interaction, time is a critical resource that directly impacts usability, deployment cost, and safety-critical decision making. This is especially true for Spoken Dialogue Models (SDMs) [4, 16, 29], which are gaining increasing attention in real-world applications, such as personal assistants, wearable devices, and healthcare systems [17, 1]. In these scenarios, a response must not only be accurate and natural, but often also strictly bounded in duration. For example, a voice assistant may be required to provide a traffic update while driving; a wearable device may require concise spoken feedback due to battery or bandwidth constraints. Similarly, in medical or emergency scenarios, a voice assistant may need to deliver brief yet informative instructions under strict time pressure. In all of these cases, the ability to control response duration is a key requirement for practical deployment. Despite its importance, time controllability remains largely underexplored in SDMs. In the domain of text Large Language Models (LLMs), prior studies have shown that models often struggle to follow explicit length-constraint instructions [73]. Moreover, LLM outputs often exhibit verbosity or length bias, a phenomenon associated with preference-based evaluation and alignment [20, 28]. This tendency weakens instruction-following capability and negatively affects user experience. While benchmarks, prompting and training strategies have begun to address length controllability in text LLMs [73, 63, 30, 54], this research direction remains active and continues to attract attention due to its substantial practical importance. However, controlling response duration in SDMs is considerably more challenging than controlling output length in text. In speech generation, word count is only a proxy for actual duration. A single word may contain different numbers of syllables, and speech duration is known to vary with phonetic composition, linguistic context, and prosodic structure [31]. Moreover, speaking rate may vary across speakers, speaking styles, and communicative conditions, which depends on both speaker and listener [36]. As a result, simply constraining the number of generated words does not guarantee accurate control over the final speech duration. This mismatch makes duration control a unique and more demanding problem in spoken dialogue systems. Given the limited study of time controllability in SDMs, we first introduce TiCo-Bench, a benchmark designed to evaluate the time-controllable instruction-following capability of SDMs. Our evaluation reveals that existing SDMs struggle to reliably satisfy explicit time constraints. To address this challenge, we propose TiCo (Figure 1), a simple yet effective post-training framework that enables SDMs to estimate and regulate generated speech duration in real time through Spoken Time Markers. The core idea of TiCo is to equip the model with an internal mechanism for time awareness during generation, allowing it to track temporal progress during generation and adjust its responses accordingly. Specifically, TiCo consists of a two-stage training procedure. In the first stage, the model leverages self-generation to construct supervision data for learning duration awareness, enabling it to associate intermediate generation states with temporal progress and estimate the elapsed speaking time. In the second stage, Reinforcement Learning with Verifiable Rewards (RLVR) [52] is applied, where rewards are automatically verified based on the Spoken Time Markers, to further shape the response distribution and improve compliance with duration-related instructions. This stage encourages the model to better satisfy target time constraints while preserving the response quality, including helpfulness and coherence. Our contributions are summarized as follows: • We propose TiCo, a two-stage training framework that enables spoken dialogue models to generate Spoken Time Markers (STMs) during inference and perform real-time control over response duration. • We introduce TiCo-Bench, the first benchmark designed to evaluate the time controllability of spoken dialogue models (SDMs), measuring whether they can follow explicit duration-related instructions. • We conduct extensive experiments showing that TiCo significantly improves duration controllability while preserving response quality, and further demonstrate that the learned capability generalizes beyond the duration range seen during training. 2 Related Works 2.1 Spoken Dialogue Models Spoken Dialogue Models (SDMs) [4, 16] aim to enable natural human-computer interaction by directly understanding and generating spoken conversations. Unlike traditional voice assistants that rely on cascaded ASR, text generation, and TTS modules, recent SDMs increasingly adopt end-to-end or tightly integrated modeling paradigms [24, 29]. However, compared to text-based LLMs operating in a semantically rich textual space, speech is consider to be significantly more challenging to process due to the high variability and complexity of acoustic signals 111This challenge has been larely explored in prior work such as the “Textless NLP” paradigm [52, 47, 27, 11, 12, 48], where quantized speech representations are treated as “pseudo text” to improve training efficiency and efficacy.. As a result, many recent SDMs introduce intermediate representations, most commonly text, to support semantic planning during generating the speech response. This includes reasoning to improve response quality [15], tool calling [5] to leverage external modules, and more direct guidance over spoken content [66, 67]. Specifically, the SDM first takes the input query (in either text or speech form) to generate an intermediate representation, which is then consumed by a speech generator to produce the final output speech representation (e.g., phonetic tokens and acoustic tokens [4, 26, 62]), and subsequently synthesized into a waveform. We provide a survey of representative SDMs and their intermediate representations in the Appendix G. Recently, several benchmarks have begun to evaluate SDMs beyond response quality, incorporating dimensions such as speaking style, time-awareness, and controllability. For instance, Full-Duplex-Bench [35, 34] evaluates full-duplex conversation by quantifying the ability of SDMs to engage in simultaneous speaking. ParaS2S [69] focuses on paralinguistic awareness, assessing how well SDMs capture aspects such as speaking styles and emotional expression. F-Actor [78] explores controllable conversational behavior in SDMs, showing that SDMs can be guided with instructions over attributes such as speaker voice, topic, and interaction style. In addition, and most related to TiCo, Game-Time [10] instead targets time-awareness, including adherence to temporal instructions, tempo control, and overlapping speech generation. Despite the emergence of such benchmarks on controllability and time-awareness, to the best of our knowledge, TiCo is the first method that explicitly enables time-controllable generation for SDMs through a simple post-training approach. It is worth noting that TiCo differs fundamentally from duration modeling in TTS systems [49, 64]. While duration modeling in TTS primarily focuses on aligning text with synthesized speech, TiCo instead targets time-controllable spoken response generation. This setting requires spoken dialogue models (SDMs) to perform semantic planning and reasoning while dynamically adapting to time-related constraints during generation. Moreover, TiCo is orthogonal to prior work on temporal understanding [55], which aims to equip speech models with the ability to interpret temporal information in input audio (e.g., “What is the time interval of the query ‘a dog barking’ in the audio?”) [58]. In contrast, TiCo focuses on time awareness in the generation process, rather than temporal comprehension of the input. 2.2 Length-Control Large Language Models Precise regulation over the generation length of LLMs is critical for adapting these systems to constrained downstream applications and fulfilling specific user requirements. To enforce exact length constraints without incurring the computational overhead of retraining, researchers have proposed various training-free and decoding-time interventions. These methods encompass advanced sampling algorithms [25], task-specific zero-shot prompting strategies [50], and decoding interventions that dynamically increase the weight of the End-of-Sequence (EOS) token [7]. For more fine-grained regulation, instruction tuning approaches explicitly integrate positional awareness to continuously monitor the remaining generation length. Recent methodologies achieve this by incorporating length-tracking signals directly into the generation process, such as modifying the input encodings to reflect the distance to the target length [9], interleaving latent tracking tokens alongside the output sequence [54], or injecting explicit positional markers to enforce strict length constraints [59]. These generalized alignment strategies enable models to strictly adhere to predefined length constraints and reliably execute complex text manipulation operations including precise copy-and-paste tasks. Beyond explicit instruction tuning, a parallel line of research leverages reinforcement learning (RL) and preference optimization to align large language models with human expectations. However, standard alignment processes frequently introduce verbosity biases, where models tend to associate longer responses with higher quality. To address this fundamental issue, advanced RL-based approaches have been refined to decouple explicit length biases from actual response evaluation [39, 32]. Expanding beyond the mere mitigation of superficial verbosity, recent RL paradigms have shifted focus toward actively managing the internal computational reasoning duration of models. Depending on the objective, these frameworks either establish constrained environments to enforce concise reasoning steps [37], or deliberately prolong these internal cognitive trajectories to expand the problem-solving capabilities of the system prior to generating a final answer [38, 2, 30]. 3 TiCo A speech-to-speech Spoken Dialogue Model (SDM) can be viewed as a conditional generative model that produces a spoken response spy^sp given the user’s input speech query spx^sp and a textual instruction p (e.g., a system prompt). Modern SDMs often introduce intermediate representations z to bridge high-level semantic reasoning and low-level speech synthesis. Concretely, an intermediate sequence generator pθp_θ first generates an intermediate representation conditioned on the user input: ∼pθ(∣sp,).z p_θ(z ^sp,p). (1) The final spoken response is then generated by a speech generator qϕq_φ: sp∼qϕ(sp∣,sp,).y^sp q_φ(y^sp ,x^sp,p). (2) Different architectures impose different conditional independence assumptions on Eq. (3). In cascaded systems, the speech synthesis module has no access to the original user speech or instruction, reducing the generation to qϕ(sp∣)q_φ(y^sp ). In end-to-end models, the generation of spy^sp may additionally depend on spx^sp and p 222For example, in Qwen-Omni’s “Thinker-Talker” design [66, 67]. 3.1 TiCo Stage1: Time-Awareness Training This stage (Figure 1 (top)) trains the model to generate Spoken Time Markers as part of the intermediate representation z, so that z encodes not only semantic content but also its expected temporal alignment with the final spoken response spy^sp under the conditioning context (sp,)(x^sp,p). These markers are inserted into z through a self-generation process and used as prediction targets during training. Spoken Time Marker. A Spoken Time Marker is a special token indicating the estimated cumulative speaking duration up to a given position in the intermediate representation. Conceptually, these markers serve as a discretized alignment signal between the intermediate semantic plan z and the realized spoken response spy^sp under the same conditioning context (sp,)(x^sp,p). Inspired by TimeMarker [13], we represent these markers in textual form, e.g., <6.8 seconds>. Estimating duration at the intermediate level is non-trivial. A single word may correspond to multiple syllables, and its acoustic duration may vary depending on context and speaking rate. Explicit duration estimation is therefore required to bridge the gap between the intermediate representation and the final speech realization. Training Data Construction. Let =(sp,)D=\(x^sp,p)\ denote a pool of input speech query–instruction pairs. In this stage, we construct time-aware training targets through self-generation followed by ASR-based alignment. Specifically, given each input (sp,)∈(x^sp,p) , the model first freely generates an intermediate representation z and its corresponding spoken response spy^sp. We then apply ASR-based alignment to estimate the temporal correspondence between z and spy^sp. Based on the aligned timestamps, we define a sequence of Spoken Time Markers =[t1,…,tM]t=[t_1,…,t_M], where each tjt_j denotes the estimated cumulative speaking duration at an aligned position in z. We interleave these markers with the intermediate tokens to obtain an augmented sequence: ~=[z1,…,zi,tj,…,zN,tM]. z=[\,z_1,…,z_i,t_j,…,z_N,t_M\,]. (3) As a result, the augmented sequence ~ z encodes not only semantic content, but also alignment-induced timing information that links z to the final spoken response under the same input condition (sp,)(x^sp,p). This process yields an aligned training set SFT=(sp,,~).D_SFT=\(x^sp,p, z)\. We model the augmented intermediate sequence autoregressively as pθ(~∣sp,)=∏n=1|~|pθ(z~n∣~<n,sp,).p_θ( z ^sp,p)= _n=1^| z|p_θ( z_n z_<n,x^sp,p). (4) We then optimize the standard supervised fine-tuning (SFT) objective: ℒSFT=−(sp,,~)∼SFT[∑n=1|~|log⁡pθ(z~n∣~<n,sp,)].L_SFT=-E_(x^sp,p, z) _SFT [ _n=1^| z| p_θ ( z_n z_<n,x^sp,p ) ]. (5) It’s worth noting that self-generation offers two advantages: (1) it removes the need for collecting paired question-answer supervision, and (2) the generated responses follow the model’s own output distribution, which improves training stability [41]. 3.2 TiCo Stage 2: Time-Controllable Training This stage (Figure 1 (bottom)) further trains the model to follow time-constrained instructions. We augment the textual instruction p with a duration constraint and denote the resulting instruction by durp^dur, where the target duration is denoted by tinstt_inst. Since Spoken Time Markers reside in the intermediate representation, we apply reinforcement learning to the intermediate-sequence generator pθ(~∣sp,dur)p_θ( z ^sp,p^dur). Specifically, we adopt GRPO [52] to optimize time controllability, and incorporate CHORD [74] as a dynamically weighted auxiliary objective that integrates off-policy expert trajectories into the on-policy RL process. This regularization steers the policy toward the expert trajectories in the Stage-1-constructed dataset SFTD_SFT while preserving on-policy exploration. In practice, we find this mechanism crucial for stabilizing training, as GRPO alone frequently leads to reward hacking. Given an input (sp,dur)(x^sp,p^dur), we sample a group of G candidate augmented intermediate sequences from the old policy: ~(g)∼pθold(⋅∣sp,dur),g=1,…,G. z^(g) p_ _old(· ^sp,p^dur), g=1,…,G. (6) Reward Design. The main reward measures the accuracy of the predicted total duration: ℛmain(g)=F(tinst−tlast(g)),R_main^(g)=F (t_inst-t_last^(g) ), (7) where tinstt_inst is the target duration specified in the instruction and tlast(g)t_last^(g) is the duration indicated by the final generated time marker in ~(g) z^(g). We instantiate F as a Gaussian function, i.e., F(Δt)=exp⁡(−(Δt)2/(2σ2))F( t)= (-( t)^2/(2σ^2) ), where σ controls the tolerance to duration errors. We additionally introduce several auxiliary rewards to stabilize training and mitigate reward hacking, including a “presence reward” that encourages the model to generate at least one time marker, a “monotonicity reward” that encourages time markers to increase monotonically, a “repetition penalty” that discourages repeatedly generating identical time markers, and a “copy penalty” that discourages trivial copying of the instructed duration. The detailed definitions of these auxiliary rewards, as well as the corresponding ablation study, are provided in Appendix B. The overall reward for the g-th sample is R(g)=ℛmain(g)+ℛaux(g).R^(g)=R_main^(g)+R_aux^(g). (8) We then optimize the intermediate-sequence generator with GRPO: ℒGRPO=−[1G∑g=1G1|~(g)|∑n=1|~(g)|ℓg,n(θ)],L_GRPO=-E [ 1G _g=1^G 1| z^(g)| _n=1^| z^(g)| _g,n(θ) ], (9) where ℓg,n(θ)=min⁡(ρg,n(θ)A^(g),clip⁡(ρg,n(θ),1−ε,1+ε)A^(g))−βKL[pθ∥pref]n, _g,n(θ)= ( _g,n(θ) A^(g),clip ( _g,n(θ),1- ,1+ ) A^(g) )-β\,D_KL\! [p_θ\|p_ref ]_n, (10) and ρg,n(θ)=pθ(z~n(g)∣~<n(g),sp,dur)pθold(z~n(g)∣~<n(g),sp,dur). _g,n(θ)= p_θ ( z_n^(g) z_<n^(g),x^sp,p^dur )p_ _old ( z_n^(g) z_<n^(g),x^sp,p^dur ). (11) Here, A^(g) A^(g) denotes the group-relative normalized advantage for sample g, computed from R(g)g=1G\R^(g)\_g=1^G following GRPO. prefp_ref denotes the reference policy (i.e., the Stage 1 checkpoint before RL training), and β is the KL penalty coefficient. Following CHORD [74], we additionally regularize training with expert trajectories from the first stage. The final training loss at optimization step s is ℒ(s)=(1−αs)ℒGRPO+αsℒSFT,L^(s)=(1- _s)~L_GRPO+ _sL_SFT, (12) where αs _s is a step-dependent coefficient as described in CHORD [74]. Specifically, αs _s gradually decays over the course of training, allowing the regularizing effect of the SFT loss to diminish as the model improves. 4 Experiments Figure 2: Overview of TiCo-Bench construction. Base queries are collected from four distinct text and speech datasets (totaling 720 queries). Explicit time-control instructions are then inserted into these queries. By applying both a short-duration setting (10–30 secs) and a long-duration setting (30–60 secs) to each query, the initial dataset is doubled, resulting in a final benchmark of 1440 evaluation samples. 4.1 TiCo-Bench Dataset Construction. As illustrated in Figure 2, TiCo-Bench is constructed by deriving samples from existing spoken and textual question datasets to provide a rigorous evaluation across diverse tasks and modalities. Specifically, the benchmark is sourced from InstructS2S [21], UROBench [68], and LIFEBench [73]. InstructS2S is utilized to evaluate fundamental understanding capabilities. To assess performance in more complex scenarios, we incorporate UROBench, explicitly extracting its reasoning subset to evaluate cognitive processing. Furthermore, we leverage LIFEBench, an existing length-control benchmark in the text domain, whose task categories encompass question answering, creative generation, reasoning, and summarization. Regarding input modalities, InstructS2S and UROBench are native speech datasets, whereas LIFEBench is a text-based dataset. We directly utilize the audio questions from InstructS2S and UROBench as speech queries to evaluate the spoken language models. Conversely, the text questions in LIFEBench serve as textual queries. To enable a more detailed ablation study regarding input modalities, we also utilize the explicit text transcriptions provided in UROBench, establishing a parallel set of text queries for comparative analysis. To construct the final benchmark, we sample 180 questions from each dataset source. Because UROBench is repurposed to provide both speech and text queries, this creates four distinct evaluation subsets, totaling 720 unique base queries. We augment these queries with randomly assigned, time-controlled instructions. These instructions are formulated as textual prompts to explicitly guide the generative models toward the target response length (e.g., ‘Your response duration should be exactly 25 seconds.’). Every query is evaluated under two distinct temporal constraints: a short setting (10 to 30 seconds) and a long setting (30 seconds to 1 minute). This systematic augmentation yields a rigorous benchmark of 1,440 evaluation samples. Metrics. We evaluate duration controllability using two metrics. Let did_i denote the actual duration of the generated speech for the i-th sample, and let tinst,it_inst,i denote the target duration specified in the instruction. The Mean Absolute Error (MAE) measures the average absolute deviation in seconds: MAE=1N∑i=1N|di−tinst,i|,MAE= 1N _i=1^N|d_i-t_inst,i|, (13) while the Mean Absolute Percentage Error (MAPE) normalizes the deviation by the target duration: MAPE=1N∑i=1N|di−tinst,i|tinst,i×100%.MAPE= 1N _i=1^N |d_i-t_inst,i|t_inst,i× 100\%. (14) MAE captures the absolute magnitude of duration error, whereas MAPE provides a scale-invariant measure that enables fair comparison across different target durations. We further evaluate response quality using a GPT-score. We first transcribe the generated speech using ASR, and then prompt an LLM (GPT-5-mini) to assess the response quality. The model assigns a score from 1 to 5 for each sample, and we report the average score. Baselines. We compare against three categories of baselines in TiCo-Bench: (1) open-source SDMs, (2) commercial models, and (3) cascaded systems. For the cascaded strong baselines, we employ an LLM prompted to generate a response that satisfies the target duration constraint as closely as possible, and then use a text-to-speech system to synthesize the corresponding speech. Specifically, we utilize GPT-5.2 [45] as a frontier commercial LLM and Qwen2.5-7B-Instruct [56] as a representative SoTA open-source language model. For the TTS component, IndexTTS-2 [77] is employed to generate high-quality speech from the LLM response. Detailed prompts used for the cascaded system can be found at the Appendix C. To ensure that evaluation reflects generation quality rather than truncation artifacts, all SDMs are allocated a sufficiently large token budget to cover responses of up to 1 minute of speech. 4.2 Experimental Setup We adopted MS-SWIFT (SWIFT) [76] 333https://github.com/modelscope/ms-swift to train the model through out this paper. We adopt Qwen-2.5-Omni 7B [66] as the backbone model. Spoken Time Markers are inserted into the output of the “Thinker”. In both training stages of TiCo, only the “Thinker” is trained, while the “Talker” remains fixed. During inference, Spoken Time Markers are used only for intermediate planning and are removed via simple regex before feeding the cleaned sequence z into the “Talker” for speech generation. Although our experiments are conducted on this architecture, TiCo is not restricted to a specific SDM design. The Spoken Time Marker mechanism can be applied to any spoken dialogue model that generates an intermediate representation prior to speech synthesis. We sample 4,000 speech questions from InstructS2S [21] as in-domain training data (with 400 held out for evaluation). The training data do not overlap with the in-domain test set in TiCo-Bench. Word-level timestamps for constructing Spoken Time Markers are obtained using Whisper medium [40]. For simplicity, we insert a Spoken Time Marker after sentence-level punctuation marks (e.g., commas, periods, and exclamation marks). The distribution of Spoken Time Markers in the training data is shown in Figure 3. (a) Markers per Response (b) Marker Timestamps (c) Inter-Marker Intervals Figure 3: Distribution of Spoken Time Markers in the First stage training data. During training, the maximum number of generated tokens for Qwen-2.5-Omni 7B is set to 2,048, corresponding to approximately 41 seconds of speech. At inference time, this limit is increased to 4,096 to support longer responses. This configuration is primarily adopted for efficiency and to evaluate the model’s ability to generalize to longer outputs, as TiCo-Bench extends up to one minute. In principle, the model can also be trained on longer-response data if desired. Additional training details are provided in the Appendix B. 5 Results 5.1 TiCo-Bench As shown in Table 1, TiCo consistently outperforms all baselines across both datasets (in-domain InstrcutS2S and out-of-domain UROBench) and both duration settings. TiCo achieves the best overall performance with 4.54s MAE and 14.9% MAPE, substantially improving over its base model Qwen2.5-Omni-7B (13.01 s / 42.3%) and the strongest baseline, Cascade (GPT) (10.41 s / 29.0%). These results demonstrate that TiCo post-training significantly improves the time controllability of the backbone SDM. Table 1: TiCo-Bench (Speech) evaluation of time-controlled instruction-following performance on speech-query tasks under short (10s–30s) and long (30s–1min) settings. Model categories are indicated by color: Cascaded, Commercial, Open-sourced, and Proposed. Results are reported as MAE (seconds) / MAPE (%). Lower is better. TiCo-Bench (Speech) Short (10s–30s) Long (30s–1min) Overall Model InstructS2S UROBench InstructS2S UROBench MAE / MAPE GPT-score Cascade (GPT) 4.09 / 19.6% 5.92 / 28.8% 12.17 / 25.7% 19.44 / 41.7% 10.41 / 29.0% 4.15 Cascade (Qwen) 8.12 / 37.7% 10.38 / 51.2% 21.77 / 46.8% 29.72 / 64.2% 17.50 / 50.0% 3.39 GPT-audio 7.79 / 36.3% 18.11 / 92.0% 12.42 / 27.5% 17.41 / 41.1% 13.93 / 49.2% 3.88 Kimi Audio 24.45 / 129.3% 34.46 / 196.7% 25.96 / 58.0% 32.28 / 74.1% 29.29 / 114.5% 1.67 LFM Audio 13.08 / 75.0% 12.52 / 62.6% 23.74 / 48.4% 35.06 / 75.6% 21.10 / 65.4% 2.63 Mimo-Audio 20.71 / 115.0% 12.45 / 74.7% 20.44 / 45.0% 18.37 / 39.9% 17.99 / 68.7% 3.32 Qwen2.5-Omni-7B 7.55 / 43.6% 8.62 / 50.6% 16.60 / 34.6% 19.27 / 40.4% 13.01 / 42.3% 3.57 TiCo (Proposed) 3.16 / 15.6% 3.71 / 19.4% 5.16 / 11.3% 6.13 / 13.4% 4.54 / 14.9% 3.56 Figure 4: Duration MAE and MAPE of Qwen2-Omni-7B and TiCo across instructed-duration bins. TiCo maintains consistently lower error across all duration ranges. It’s worth noting that the cascaded systems are more competitive than most SDM baselines, indicating that explicit duration planning with a strong LLM is helpful. Nevertheless, they remain clearly inferior to TiCo, suggesting that text-level duration estimation alone is insufficient for speech-level control, where the final duration also depends on speaking rate and speech realization. Another notable trend is that most direct speech models perform much worse on the short subsets than on the long subsets in terms of relative error, indicating difficulty in satisfying tight time budgets. This observation is consistent with findings in text LLMs, where models often prefer longer responses. In contrast, TiCo maintains low error across both short and long settings and is the only model to achieve below 20% MAPE on all subsets. 5.2 Generalization to Longer Responses and Text Queries We further examine whether TiCo generalizes beyond the conditions seen during post-training. Generalization to longer responses. Although TiCo is post-trained with a maximum sequence length of 2048 audio tokens, corresponding to roughly 41 seconds of speech, as shown in Figure 3 (b), it generalizes well to substantially longer responses. As shown in Figure 4, TiCo maintains consistently low MAE and MAPE across instructed-duration bins on both the in-domain InstructS2S and out-of-domain UROBench benchmarks. In contrast, the error of the backbone Qwen2.5-Omni-7B increases noticeably as the target duration becomes longer, especially in the 45–60s bins. Notably, TiCo’s relative error on long-duration bins is comparable to, or even lower than, that on short-duration bins, indicating that TiCo can extrapolate its time-control capability to durations up to 1 minute. Generalization from speech to text queries. TiCo is post-trained only with speech queries, yet it transfers well to text-query inputs. As shown in Table 2, TiCo achieves the best overall performance with 5.35 s MAE and 18.0% MAPE, substantially outperforming its backbone Qwen2.5-Omni-7B (13.63 s / 48.9%). Moreover, on LIFEBench, where cascaded systems are available, TiCo also surpasses the strongest cascaded baseline, Cascade (GPT), under both short and long settings. These results suggest that the duration-control capability learned during post-training is not limited to the speech-query format seen during training. Table 2: TiCo-Bench (Text) evaluation of time-controlled instruction-following performance on text-query tasks under short (10s–30s) and long (30s–1min) settings. Model categories are indicated by color: Cascaded, Commercial, Open-sourced, and Proposed. Results are reported as MAE (seconds) / MAPE (%). Lower is better. TiCo-Bench (Text) Short (10s–30s) Long (30s–1min) Overall Model LIFEBench UROBench-text LIFEBench UROBench-text MAE / MAPE GPT-score Cascade (GPT) 4.83 / 23.5% 5.92 / 28.8% 8.35 / 18.4% 19.44 / 41.7% 9.64 / 28.1% 3.58 Cascade (Qwen) 7.98 / 40.4% 10.38 / 51.2% 19.16 / 42.9% 29.72 / 64.2% 16.81 / 49.7% 2.86 GPT-audio 14.33 / 73.6% 18.75 / 97.1% 11.07 / 26.3% 15.94 / 38.0% 15.02 / 58.8% 2.78 LFM Audio 18.84 / 101.1% 12.43 / 60.4% 35.74 / 79.1% 39.21 / 85.4% 26.56 / 81.5% 1.95 Mimo-Audio 12.66 / 62.0% 9.40 / 46.8% 31.75 / 69.7% 29.59 / 63.8% 20.85 / 60.6% 1.39 Qwen2.5-Omni-7B 13.74 / 79.1% 7.73 / 44.9% 13.86 / 30.8% 19.18 / 40.6% 13.63 / 48.9% 2.67 TiCo (Proposed) 4.46 / 22.7% 4.08 / 21.0% 6.62 / 14.8% 6.25 / 13.5% 5.35 / 18.0% 2.76 5.3 Spoken Time Token Prediction Analysis TiCo relies on Spoken Time Token generation during semantic planning. Figure 5 compares the instructed duration with (i) the final response duration and (i) the duration indicated by the last Spoken Time Token. The results show that these two errors are closely aligned, suggesting that the last Spoken Time Token provides a reliable approximation of the final response duration. However, we do observe a room for further improvement. Enhancing the training strategy for more accurate time prediction is left for future work. Figure 5: Duration error of TiCo across instructed-duration bins, comparing two reference signals: the instructed duration tinstt_inst and the final Spoken Time Marker tlastt_last. The close alignment indicates that the final time marker accurately estimates realized speech duration. 5.4 Qualitative Results The qualitative results shown below demonstrate that the model can generate responses with varying lengths and adapt to different target durations while preserving the core semantic content. For a 15-second target, the response is concise and focused, whereas for a 40-second target, the model extends the utterance through additional elaboration and follow-up content while maintaining the same underlying answer. More examples are presented in Appendix F. How deep is the ocean?, target duration = 15 s Prompt: How deep is the ocean? Please respond with exactly 15 seconds. Speech response: Well, <0.9 seconds> you know, <1.6 seconds> the ocean is really deep in some places. <3.8 seconds> The Mariana Trench is the deepest part, <7.2 seconds> it’s about 36 kilometers deep. <9.4 seconds> But overall, <11.0 seconds> the average depth of the world’s oceans is around 3.7 kilometers. <15.0 seconds> Target duration: 15.0 seconds Actual duration: 15.2 seconds How deep is the ocean?, target duration = 40 s Prompt: How deep is the ocean? Please respond with exactly 40 seconds. Speech response: Well, <1.5 seconds> you know, <2.3 seconds> the ocean is really deep in some places. <4.6 seconds> The Mariana Trench is one of the deepest parts. <7.9 seconds> It’s about 36 kilometers deep. <10.8 seconds> But overall, <12.5 seconds> if we’re talking about an average depth for the whole ocean, <15.5 seconds> it’s around 3.7 kilometers. <18.0 seconds> However, <19.4 seconds> different areas can have very different depths. <22.3 seconds> Some shallow bays might be just a few meters deep while others in the middle of the ocean could be much deeper than that. <29.0 seconds> So, <30.0 seconds> it’s not really a fixed number like 3.7 kilometers everywhere. <33.0 seconds> If you want to know more about why the ocean has such varying depths or anything else related, <39.0 seconds> feel free to ask. <40.0 seconds> Target duration: 40.0 seconds Actual duration: 41.6 seconds 6 Conclusion In this work, we introduced TiCo, a simple yet effective post-training framework that enables spoken dialogue models to follow explicit time-controlled instructions through Spoken Time Markers and a two-stage training process for time awareness and time-controllability. We also presented TiCo-Bench, the first benchmark designed to evaluate time-controllable instruction following in spoken dialogue models. Experimental results demonstrate that TiCo substantially improves duration controllability over strong open-source, commercial, and cascaded baselines, while maintaining response quality comparable to the backbone model. Moreover, TiCo generalizes beyond the duration range observed during training and transfers effectively from speech-query to text-query inputs, suggesting that temporal control can be learned as a robust intermediate planning capability. In future work, we aim to further improve the precision of time-marker prediction, extend training to longer and more diverse conversational scenarios, and explore how time control can be integrated with other controllable dialogue behaviors in spoken language systems. References [1] S. J. Adams, J. N. Acosta, and P. Rajpurkar (2025) How generative ai voice agents will transform medicine. npj Digital Medicine 8 (1), p. 353. Cited by: §1. [2] P. Aggarwal and S. Welleck (2025) L1: controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697. Cited by: §2.2. [3] A. Amini, A. Banaszak, H. Benoit, A. Bök, T. Dakhran, S. Duong, A. Eng, F. Fernandes, M. Härkönen, A. Harrington, et al. (2025) Lfm2 technical report. arXiv preprint arXiv:2511.23404. Cited by: Table 5. [4] S. Arora, K. Chang, C. Chien, Y. Peng, H. Wu, Y. Adi, E. Dupoux, H. Lee, K. Livescu, and S. Watanabe On the landscape of spoken language models: a comprehensive survey. Transactions on Machine Learning Research. Cited by: Appendix G, §1, §2.1, §2.1. [5] S. Arora, H. Khan, K. Sun, X. L. Dong, S. Choudhary, S. Moon, X. Zhang, A. Sagar, S. T. Appini, K. Patnaik, et al. (2025) Stream rag: instant and accurate spoken dialogue systems with streaming tool usage. arXiv preprint arXiv:2510.02044. Cited by: Table 5, §2.1. [6] S. Arora, J. Tian, H. Futami, J. Shi, Y. Kashiwagi, E. Tsunoo, and S. Watanabe (2025) Chain-of-thought reasoning in streaming full-duplex end-to-end spoken dialogue systems. arXiv preprint arXiv:2510.02066. Cited by: Table 5. [7] Z. Belligoli, E. Stergiadis, E. Fainman, and I. Gusev (2025) Controlling summarization length through eos token weighting. arXiv preprint arXiv:2506.05017. Cited by: §2.2. [8] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, et al. (2023) Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing 31, p. 2523–2533. Cited by: 1st item. [9] B. Butcher, M. O’Keefe, and J. Titchener (2025) Precise length control for large language models. Natural Language Processing Journal 11, p. 100143. Cited by: §2.2. [10] K. Chang, E. Hu, C. Kuan, W. Ren, W. Chen, G. Lin, Y. Tsao, S. Sun, H. Lee, and J. Glass (2025) Game-time: evaluating temporal dynamics in spoken language models. arXiv preprint arXiv:2509.26388. Cited by: §2.1. [11] K. Chang, W. Tseng, S. Li, and H. Lee (2022) Speechprompt: an exploration of prompt tuning on generative spoken language model for speech processing tasks. arXiv preprint arXiv:2203.16773. Cited by: footnote 1. [12] K. Chang, H. Wu, Y. Wang, Y. Wu, H. Shen, W. Tseng, I. Kang, S. Li, and H. Lee (2024) Speechprompt: prompting speech language models for speech processing tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32, p. 3730–3744. Cited by: footnote 1. [13] S. Chen, X. Lan, Y. Yuan, Z. Jie, and L. Ma (2024) Timemarker: a versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211. Cited by: §3.1. [14] W. Chen, Z. Ma, R. Yan, Y. Liang, X. Li, R. Xu, Z. Niu, Y. Zhu, Y. Yang, Z. Liu, et al. (2025) Slam-omni: timbre-controllable voice interaction system with single-stage training. In Findings of the Association for Computational Linguistics: ACL 2025, p. 2262–2282. Cited by: Table 5. [15] C. Chiang, X. Wang, L. Li, C. Lin, K. Lin, S. LIU, Z. Wang, Z. Yang, H. Lee, and L. Wang (2026) STITCH: simultaneous thinking and talking with chunked reasoning for spoken language models. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: Table 5, §2.1. [16] W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y. Guo, and I. King (2025) Recent advances in speech language models: a survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 13943–13970. Cited by: §1, §2.1. [17] A. de Barcelos Silva, M. M. Gomes, C. A. Da Costa, R. da Rosa Righi, J. L. V. Barbosa, G. Pessin, G. De Doncker, and G. Federizzi (2020) Intelligent personal assistants: a systematic literature review. Expert systems with applications 147, p. 113193. Cited by: §1. [18] A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour (2024) Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037. Cited by: 2nd item, Table 5. [19] D. Ding, Z. Ju, Y. Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang, et al. (2025) Kimi-audio technical report. arXiv preprint arXiv:2504.18425. Cited by: Table 5. [20] Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto (2024) Length-controlled alpacaeval: a simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475. Cited by: §1. [21] Q. Fang, S. Guo, Y. Zhou, Z. Ma, S. Zhang, and Y. Feng (2025) LLaMA-omni: seamless speech interaction with large language models. In The Thirteenth International Conference on Learning Representations, Cited by: Table 5, §4.1, §4.2. [22] Q. Fang, Y. Zhou, S. Guo, S. Zhang, and Y. Feng (2025) LLaMA-omni 2: llm-based real-time spoken chatbot with autoregressive streaming speech synthesis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 18617–18629. Cited by: Table 5. [23] C. Fu, H. Lin, X. Wang, Y. Zhang, Y. Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Li, et al. (2025) Vita-1.5: towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957. Cited by: Table 5. [24] J. Glass (1999) Challenges for spoken dialogue systems. In Proceedings of the 1999 IEEE ASRU Workshop, Vol. 696. Cited by: §2.1. [25] Y. Gu, W. Wang, X. Feng, W. Zhong, K. Zhu, L. Huang, T. Liu, B. Qin, and T. Chua (2025) Length controlled generation for black-box llms. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 16878–16895. Cited by: §2.2. [26] Y. Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu (2025) Recent advances in discrete speech tokens: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: 2nd item, §2.1. [27] M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. Conneau, F. Kreuk, J. Copet, A. Defossez, G. Synnaeve, E. Dupoux, et al. (2023) Textually pretrained speech language models. Advances in Neural Information Processing Systems 36, p. 63483–63501. Cited by: footnote 1. [28] Z. Hu, L. Song, J. Zhang, Z. Xiao, T. Wang, Z. Chen, N. J. Yuan, J. Lian, K. Ding, and H. Xiong (2025-11) Explaining length bias in LLM-based preference evaluations. In Findings of the Association for Computational Linguistics: EMNLP 2025, p. 6763–6794. Cited by: §1. [29] S. Ji, Y. Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Cheng, et al. (2024) Wavchat: a survey of spoken dialogue models. arXiv preprint arXiv:2411.13577. Cited by: §1, §2.1. [30] R. Jie, X. Meng, L. Shang, X. Jiang, and Q. Liu (2023) Prompt-based length controlled generation with reinforcement learning. arXiv preprint arXiv:2308.12030. Cited by: §1, §2.2. [31] D. H. Klatt (1976) Linguistic uses of segmental duration in english: acoustic and perceptual evidence. The journal of the acoustical society of America 59 (5), p. 1208–1221. Cited by: §1. [32] G. Li, T. Xia, Y. Chang, and Y. Wu (2025) Length-controlled margin-based preference optimization without reference model. arXiv preprint arXiv:2502.14643. Cited by: §2.2. [33] T. Li, J. Liu, T. Zhang, Y. Fang, D. Pan, M. Wang, Z. Liang, Z. Li, M. Lin, G. Dong, et al. (2025) Baichuan-audio: a unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239. Cited by: Table 5. [34] G. Lin, S. S. Kuan, J. Shi, K. Chang, S. Arora, S. Watanabe, and H. Lee (2025) Full-duplex-bench-v2: a multi-turn evaluation framework for duplex dialogue systems with an automated examiner. arXiv preprint arXiv:2510.07838. Cited by: §2.1. [35] G. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H. Lee (2025) Full-duplex-bench: a benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities. arXiv preprint arXiv:2503.04721. Cited by: §2.1. [36] B. Lindblom (1990) Explaining phonetic variation: a sketch of the h&h theory. In Speech production and speech modelling, p. 403–439. Cited by: §1. [37] C. Liu, Y. Zhao, L. Liu, Y. Ye, C. Szepesvári, and L. F. Yang (2026) LACONIC: length-aware constrained reinforcement learning for llm. arXiv preprint arXiv:2602.14468. Cited by: §2.2. [38] M. Liu, S. Diao, X. Lu, J. Hu, X. Dong, Y. Choi, J. Kautz, and Y. Dong (2025) Prorl: prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864. Cited by: §2.2. [39] W. Liu, Y. Bai, C. Han, R. Weng, J. Xu, X. Cao, J. Wang, and X. Cai (2024) Length desensitization in direct preference optimization. arXiv preprint arXiv:2409.06411. Cited by: §2.2. [40] J. Louradour (2023) Whisper-timestamped. GitHub. Note: https://github.com/linto-ai/whisper-timestamped Cited by: §4.2. [41] K. Lu, Z. Chen, S. Fu, C. H. Yang, S. Huang, C. Yang, C. Yu, C. Chen, W. Chen, C. Huang, et al. (2025) Desta2. 5-audio: toward general-purpose large audio language model with self-generated cross-modal alignment. arXiv preprint arXiv:2507.02768. Cited by: §3.1. [42] P. Mousavi, G. Maimon, A. Moumen, D. Petermann, J. Shi, H. Wu, H. Yang, A. Kuznetsova, A. Ploujnikov, R. Marxer, et al. (2025) Discrete audio tokens: more than a survey!. arXiv preprint arXiv:2506.10274. Cited by: 2nd item. [43] T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, et al. (2023) Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics 11, p. 250–266. Cited by: Table 5. [44] T. A. Nguyen, B. Muller, B. Yu, M. R. Costa-Jussa, M. Elbayad, S. Popuri, C. Ropers, P. Duquenne, R. Algayres, R. Mavlyutov, et al. (2025) Spirit-lm: interleaved spoken and written language model. Transactions of the Association for Computational Linguistics 13, p. 30–52. Cited by: Table 5. [45] OpenAI (2025-12) Update to gpt-5 system card: gpt-5.2. Technical report OpenAI. External Links: Link Cited by: §4.1. [46] OpenBMB (2026) MiniCPM-o: a gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone. Note: https://github.com/OpenBMB/MiniCPM-oGitHub repository Cited by: Table 5. [47] A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W. Hsu, A. Mohamed, and E. Dupoux (2021) Speech resynthesis from discrete disentangled self-supervised representations. In Proc. Interspeech 2021, p. 3615–3619. Cited by: footnote 1. [48] S. Popuri, P. Chen, C. Wang, J. Pino, Y. Adi, J. Gu, W. Hsu, and A. Lee (2022) Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. Cited by: footnote 1. [49] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2021) FastSpeech 2: fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, Cited by: §2.1. [50] F. Retkowski and A. Waibel (2025) Zero-shot strategies for length-controllable summarization. In Findings of the Association for Computational Linguistics: NAACL 2025, p. 551–572. Cited by: §2.2. [51] R. Roy, J. Raiman, S. Lee, T. Ene, R. Kirby, S. Kim, J. Kim, and B. Catanzaro (2026) PersonaPlex: voice and role control for full duplex conversational speech models. arXiv preprint arXiv:2602.06053. Cited by: Table 5. [52] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §3.2, footnote 1. [53] Y. Shih, D. Raj, C. Wu, W. Zhou, S. Bong, Y. Gaur, J. Mahadeokar, O. Kalinli, and M. Seltzer (2025) Can speech llms think while listening?. arXiv preprint arXiv:2510.07497. Cited by: Table 5. [54] S. Song, J. Lee, and H. Ko (2025) Hansel: output length controlling framework for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, p. 25146–25154. Cited by: §1, §2.2. [55] A. K. Sridhar, Y. Guo, and E. Visser (2025) Enhancing temporal understanding in audio question answering for large audio language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), p. 1026–1035. Cited by: §2.1. [56] Q. Team (2024-09) Qwen2.5: a party of foundation models. External Links: Link Cited by: §4.1. [57] B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota (2024) Beyond turn-based interfaces: synchronous llms as full-duplex dialogue agents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, p. 21390–21402. Cited by: Table 5. [58] H. Wang, Y. Li, S. Ma, H. Liu, and X. Wang (2025) Listening between the frames: bridging temporal gaps in large audio-language models. arXiv preprint arXiv:2511.11039. Cited by: §2.1. [59] N. Wang, F. Duan, Y. Zhang, W. Zhou, K. Xu, W. Huang, and J. Fu (2024) PositionID: llms can control lengths, copy and paste with explicit positional awareness. In Findings of the Association for Computational Linguistics: EMNLP 2024, p. 16877–16915. Cited by: §2.2. [60] X. Wang, Y. Li, C. Fu, Y. Zhang, Y. Shen, L. Xie, K. Li, X. Sun, and L. Ma (2025) Freeze-omni: a smart and low latency speech-to-speech dialogue model with frozen llm. In International Conference on Machine Learning, p. 63345–63354. Cited by: Table 5. [61] B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Li, et al. (2025) Step-audio 2 technical report. arXiv preprint arXiv:2507.16632. Cited by: Table 5. [62] H. Wu, H. Chung, Y. Lin, Y. Wu, X. Chen, Y. Pai, H. Wang, K. Chang, A. Liu, and H. Lee (2024) Codec-superb: an in-depth analysis of sound codec models. In Findings of the Association for Computational Linguistics: ACL 2024, p. 10330–10348. Cited by: 2nd item, §2.1. [63] J. Xie and H. Lee (2025) Prompt-based one-shot exact length-controlled generation with llms. arXiv preprint arXiv:2508.13805. Cited by: §1. [64] T. Xie, Y. Rong, P. Zhang, W. Wang, and L. Liu (2025) Towards controllable speech synthesis in the era of large language models: a systematic survey. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 764–791. Cited by: §2.1. [65] Z. Xie and C. Wu (2024) Mini-omni2: towards open-source gpt-4o with vision, speech and duplex capabilities. arXiv preprint arXiv:2410.11190. Cited by: Table 5. [66] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Dang, B. Zhang, X. Wang, Y. Chu, and J. Lin (2025) Qwen2.5-omni technical report. External Links: 2503.20215, Link Cited by: Table 5, §2.1, §4.2, footnote 2. [67] J. Xu, Z. Guo, H. Hu, Y. Chu, X. Wang, J. He, Y. Wang, X. Shi, T. He, X. Zhu, et al. (2025) Qwen3-omni technical report. arXiv preprint arXiv:2509.17765. Cited by: Table 5, §2.1, footnote 2. [68] R. Yan, X. Li, W. Chen, Z. Niu, C. Yang, Z. Ma, K. Yu, and X. Chen (2025-11) URO-bench: towards comprehensive evaluation for end-to-end spoken dialogue models. In Findings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.), Suzhou, China, p. 17211–17242. External Links: Link, Document, ISBN 979-8-89176-335-7 Cited by: §4.1. [69] S. Yang, M. Tu, A. T. Liu, X. Qu, H. Lee, L. Lu, Y. Wang, and Y. Wu (2025) ParaS2S: benchmarking and aligning spoken language models for paralinguistic-aware speech-to-speech interaction. arXiv preprint arXiv:2511.08723. Cited by: §2.1. [70] A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y. Dong, and J. Tang (2024) Glm-4-voice: towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612. Cited by: Table 5. [71] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu (2023) Speechgpt: empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, p. 15757–15773. Cited by: Table 5. [72] D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuang, et al. (2025) MiMo-audio: audio language models are few-shot learners. arXiv preprint arXiv:2512.23808. Cited by: Table 5. [73] W. Zhang, Z. Zhou, K. Wang, J. Fang, Y. Zhang, R. Wang, G. Zhang, X. Li, L. Sun, L. Lyu, et al. (2025) LIFEBench: evaluating length instruction following in large language models. arXiv preprint arXiv:2505.16234. Cited by: §1, §4.1. [74] W. Zhang, Y. Xie, Y. Sun, Y. Chen, G. Wang, Y. Li, B. Ding, and J. Zhou (2026) On-policy rl meets off-policy experts: harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting. External Links: 2508.11408, Link Cited by: Appendix B, §3.2, §3.2, §3.2. [75] X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu SpeechTokenizer: unified speech tokenizer for speech language models. In The Twelfth International Conference on Learning Representations, Cited by: 2nd item. [76] Y. Zhao, J. Huang, J. Hu, X. Wang, Y. Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wang, W. Zhou, and Y. Chen (2024) SWIFT:a scalable lightweight infrastructure for fine-tuning. External Links: 2408.05517, Link Cited by: §4.2. [77] S. Zhou, Y. Zhou, Y. He, X. Zhou, J. Wang, W. Deng, and J. Shu (2026) Indextts2: a breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40, p. 35139–35148. Cited by: §4.1. [78] M. Züfle, O. Klejch, N. Sanders, J. Niehues, A. Birch, and T. K. Lam (2026) F-actor: controllable conversational behaviour in full-duplex models. arXiv preprint arXiv:2601.11329. Cited by: §2.1. Appendix A Author Contributions All authors contributed significantly to the design of the method, benchmark construction, evaluation, and the writing and refinement of the paper. While all authors were involved in multiple aspects of the project, their primary contributions are summarized below: Kai-Wei Chang. Proposed the initial idea of time-controllable SDMs and the TiCo training framework (SFT + RL), and led the main writing of the paper. Wei-Chih Chen. Proposed the design of TiCo-Bench, constructed the dataset and evaluation protocols, and conducted extensive evaluations of various models on the benchmark. En-Pei Hu. Developed the cascaded methods and evaluation protocols in TiCo-Bench, and explored reinforcement learning approaches in TiCo. Hung-yi Lee and James Glass. Provided overall guidance on the project, contributed deep technical expertise, helped shape the research direction, and offered critical feedback on the experimental methodology. Appendix B Training Details Stage 1: Time-Awareness SFT. We fine-tune Qwen2.5-Omni-7B with LoRA (r=8r=8, α=16α=16) on all linear layers, keeping the vision encoder frozen. The training set consists of 4,000 samples (400 held out for validation). We train for 5 epochs with a batch size of 2 per GPU × 4 GPUs and gradient accumulation of 4 steps (effective batch size 32), using a cosine learning rate schedule with peak 5×10−55× 10^-5 and 10% warmup. Maximum sequence length is 1,024 tokens. Training uses bfloat16 precision with gradient checkpointing. Stage 2: Time-Controllable GRPO with CHORD. Starting from the Stage 1 checkpoint, we apply GRPO with CHORD [74] to optimize duration controllability. The LoRA configuration uses r=8r=8, α=32α=32. We train for 3 epochs with a per-GPU batch size of 1 and gradient accumulation of 8 (effective batch size 32), learning rate 5×10−65× 10^-6 with cosine schedule and 10% warmup. Each prompt generates G=4G=4 candidate completions with maximum completion length of 512 tokens. The clipping parameter is ε=0.2 =0.2 and the KL penalty coefficient is β=0.04β=0.04. Reward Design. The main reward function is ℛmain(g)=F(tinst−tlast(g)),R_main^(g)=F (t_inst-t_last^(g) ), (15) where tinstt_inst is the target duration specified in the instruction and tlast(g)t_last^(g) is the duration indicated by the final generated time marker in ~(g) z^(g). The function F is defined as a Gaussian: F(Δt)=exp⁡(−(Δt)22σ2),F( t)= \! (- ( t)^22σ^2 ), (16) where σ controls the tolerance to duration deviations. In our experiments, we set σ=5σ=5. We further incorporate auxiliary reward functions to prevent reward hacking: • Presence reward ℛpres(g)R_pres^(g): encourages the model to generate at least one Spoken Time Marker, ℛpres(g)=[Mg≥1],R_pres^(g)=I\! [\,M_g≥ 1\, ], (17) where MgM_g denotes the number of time markers in ~(g) z^(g). • Monotonicity reward ℛmono(g)R_mono^(g): encourages generated time markers to be strictly increasing. We compute the fraction of consecutive pairs that are strictly increasing: ℛmono(g)=1Mg−1∑j=1Mg−1[tj+1(g)>tj(g)].R_mono^(g)= 1M_g-1 _j=1^M_g-1I\! [\,t_j+1^(g)>t_j^(g)\, ]. (18) • Repetition penalty ℛrep(g)R_rep^(g): penalizes repeated time marker values: ℛrep(g)=−(1−|t1(g),…,tMg(g)|Mg),R_rep^(g)=- (1- |\t_1^(g),…,t_M_g^(g)\|M_g ), (19) where |⋅||·| denotes set cardinality. The penalty is 0 when all markers are unique and −1-1 when all are identical. • Copy penalty ℛcopy(g)R_copy^(g): penalizes non-final time markers that trivially copy the instructed duration tinstt_inst: ℛcopy(g)=−1Mg∑j=1Mg−1[|tj(g)−tinst|<τ],R_copy^(g)=- 1M_g _j=1^M_g-1I\! [\,|t_j^(g)-t_inst|<τ\, ], (20) where τ=0.5τ=0.5 s is the tolerance threshold. The final marker tMg(g)t_M_g^(g) is excluded since matching the target duration at the end is the desired behavior. The overall reward for the g-th sample is R(g)=ℛmain(g)+ℛpres(g)+ℛmono(g)+ℛrep(g)+ℛcopy(g).R^(g)=R_main^(g)+R_pres^(g)+R_mono^(g)+R_rep^(g)+R_copy^(g). (21) Note that ℛrep(g)R_rep^(g) and ℛcopy(g)R_copy^(g) are non-positive by construction, so no explicit subtraction is needed. CHORD. CHORD interleaves SFT updates with GRPO updates using a mixing coefficient μ that decays from μpeak=0.8 _peak=0.8 to μvalley=0.3 _valley=0.3 over 500 steps, preventing catastrophic forgetting of general conversational ability. Both stages are trained on 4 NVIDIA A6000 GPUs, and the entire two-stage pipeline completes in less than one day. Appendix C Cascaded System Prompt Templates We use a unified system prompt across GPT and Qwen for the cascaded LLM baseline. LLM System Prompt for Cascaded System You are writing a final script for text-to-speech (TTS). Your response will be synthesized directly into speech. Follow the duration instruction as strictly as possible. Output only the final spoken text, with natural punctuation. Do not output markdown, bullets, JSON, XML tags, stage directions, or extra commentary. Do not mention these instructions. Appendix D Generalization Study of Textual Queries During training, in both the first and second stages, the model always receives speech queries as input; that is, all questions are presented in spoken form. Here, we evaluate whether the trained model can generalize to queries provided in textual form. The results show that the model maintains strong performance even when the input queries are given as text. Figure 6: Text benchmarks: duration error of Qwen2-Omni-7B vs. TiCo measured against instructed duration. From left to right: LIFEBench MAE (s), LIFEBench MAPE (%), UROBench MAE (s), UROBench MAPE (%). Shaded regions indicate ± 1 SEM. Figure 7: Text benchmarks (TiCo): duration error measured against instructed duration vs. last time marker. From left to right: LIFEBench MAE (s), LIFEBench MAPE (%), UROBench MAE (s), UROBench MAPE (%). Shaded regions indicate ± 1 SEM. Appendix E Detailed GPT score in TiCo-Bench subsets Table 3: TiCo-Bench (Speech) evaluation of time-controlled instruction-following performance on speech-query tasks. Results are reported as a scale of 1 to 5. Higher is better. TiCo-Bench (Speech) Short (10s–30s) Long (30s–1min) Model InstructS2S UROBench InstructS2S UROBench Overall Cascade (GPT) 4.09 4.18 4.14 4.18 4.15 Cascade (Qwen) 3.41 3.35 3.36 3.43 3.39 GPT-audio 4.07 3.96 4.08 3.42 3.88 Kimi Audio 1.99 1.48 1.79 1.41 1.67 LFM Audio 3.17 2.28 2.92 2.15 2.63 Mimo-Audio 3.22 3.44 3.16 3.45 3.32 Qwen2.5-Omni-7B 3.64 3.63 3.44 3.56 3.57 TiCo (Proposed) 3.39 3.59 3.44 3.83 3.56 Table 4: TiCo-Bench (Text) evaluation of time-controlled instruction-following performance on text-query tasks. Results are reported as a scale of 1 to 5. Higher is better. TiCo-Bench (Speech) Short (10s–30s) Long (30s–1min) Model LIFEBench UROBench-text LIFEBench UROBench-text Overall Cascade (GPT) 2.86 4.18 3.10 4.18 3.58 Cascade (Qwen) 2.32 3.35 2.34 3.43 2.86 GPT-audio 2.55 3.23 2.28 3.05 2.78 LFM Audio 1.93 2.16 1.62 2.08 1.95 Mimo-Audio 1.42 1.53 1.24 1.38 1.39 Qwen2.5-Omni-7B 2.30 3.32 2.10 2.94 2.67 TiCo (Proposed) 2.23 3.33 2.23 3.26 2.76 Appendix F Qualitative Examples F.1 How deep is the ocean? (Speech query) How deep is the ocean?, target duration = 15 s Prompt: How deep is the ocean? Please respond with exactly 15 seconds. Speech response: Well, <0.9 seconds> you know, <1.6 seconds> the ocean is really deep in some places. <3.8 seconds> The Mariana Trench is the deepest part, <7.2 seconds> it’s about 36 kilometers deep. <9.4 seconds> But overall, <11.0 seconds> the average depth of the world’s oceans is around 3.7 kilometers. <15.0 seconds> Target duration: 15.0 seconds Actual duration: 15.2 seconds How deep is the ocean?, target duration = 25 s Prompt: How deep is the ocean? Please respond with exactly 25 seconds. Speech response: Well, <1.0 seconds> you know, <1.9 seconds> the ocean is really deep in some places. <4.3 seconds> The Mariana Trench is the deepest part, <7.6 seconds> it’s about 36,000 feet or so deep. <10.8 seconds> But overall, <12.6 seconds> the average depth of the world’s oceans is around 13,000 feet. <16.4 seconds> It’s amazing how vast and deep they are, <19.2 seconds> right? <20.0 seconds> If you want to know more about the ocean’s depths or anything else related, <25.0 seconds> Target duration: 25.0 seconds Actual duration: 25.0 seconds How deep is the ocean?, target duration = 40 s Prompt: How deep is the ocean? Please respond with exactly 40 seconds. Speech response: Well, <1.5 seconds> you know, <2.3 seconds> the ocean is really deep in some places. <4.6 seconds> The Mariana Trench is one of the deepest parts. <7.9 seconds> It’s about 36 kilometers deep. <10.8 seconds> But overall, <12.5 seconds> if we’re talking about an average depth for the whole ocean, <15.5 seconds> it’s around 3.7 kilometers. <18.0 seconds> However, <19.4 seconds> different areas can have very different depths. <22.3 seconds> Some shallow bays might be just a few meters deep while others in the middle of the ocean could be much deeper than that. <29.0 seconds> So, <30.0 seconds> it’s not really a fixed number like 3.7 kilometers everywhere. <33.0 seconds> If you want to know more about why the ocean has such varying depths or anything else related, <39.0 seconds> feel free to ask. <40.0 seconds> Target duration: 40.0 seconds Actual duration: 41.6 seconds F.2 Discuss an event from history (Speech query) History event, target duration = 10 s Prompt: Discuss an event from history. Please respond with exactly 10 seconds. Speech response: Well, <1.5 seconds> let’s talk about the fall of the Berlin Wall in 1989. <4.2 seconds> It was a huge event that symbolized the end of the Cold War. <7.3 seconds> It brought East and West Germany together again. <10.0 seconds> Target duration: 10.0 seconds Actual duration: 12.0 seconds History event, target duration = 20 s Prompt: Discuss an event from history. Please respond with exactly 20 seconds. Speech response: Well, <1.5 seconds> let’s talk about the signing of the Magna Carta in 1215. <4.8 seconds> It was really important because it limited the power of the king and protected the rights of the nobles. <9.6 seconds> It also had an impact on later democratic ideas. <13.7 seconds> You know, <15.0 seconds> it set a precedent for the rule of law. <18.0 seconds> So what do you think about this event? <20.0 seconds> Target duration: 20.0 seconds Actual duration: 19.2 seconds History event, target duration = 30 s Prompt: Please respond with exactly 30 seconds. Speech response: Well, <1.5 seconds> let’s talk about the signing of the Magna Carta in 1215. <4.8 seconds> It was really important because it limited the power of the king to some extent. <9.6 seconds> You know, <10.7 seconds> before that kings could pretty much do whatever they wanted with their subjects’ rights. <14.9 seconds> But this document said things like no one can be imprisoned without proper trial. <19.6 seconds> It set a precedent for later legal systems all over the world. <24.0 seconds> And it kind of started the idea of due process. <28.0 seconds> So what do you think about this event? <30.0 seconds> Target duration: 30.0 seconds Actual duration: 31.1 seconds F.3 What is quantum mechanics? (Text query) Quantum mechanics, target duration = 25 s Prompt: What is quantum mechanics? Please respond with exactly 25 seconds. Speech response: Well, <1.0 seconds> quantum mechanics is this really strange science that deals with really small things like atoms and subatomic particles. <7.0 seconds> It has some wacky ideas, <9.0 seconds> you know, <10.0 seconds> like particles can be in multiple places at once or they can be entangled so that if one changes state the other does too instantaneously no matter how far apart. <23.0 seconds> It’s super important for understanding how the world works on a very fundamental level.<25.0 seconds> Target duration: 25.0 seconds Actual duration: 29.1 seconds Quantum mechanics, target duration = 50 s Prompt: What is quantum mechanics? Please respond with exactly 50 seconds. Speech response: Well, <1.9 seconds> quantum mechanics is really something special. <4.3 seconds> It’s a branch of physics that deals with the behavior of really small things like atoms and subatomic particles. <12.6 seconds> You know, <14.0 seconds> it has some really strange ideas. <16.0 seconds> For example, <17.8 seconds> particles can be in multiple places at once until you measure them. <21.8 seconds> And there are these things called wave-particle duality where particles act both like waves and particles. <29.2 seconds> Also, <30.6 seconds> the Heisenberg uncertainty principle says that we can’t precisely know certain pairs of properties at the same time. <37.8 seconds> It’s all very counterintuitive compared to what we see in our everyday world. <43.0 seconds> If you want to learn more about specific aspects or have any other questions, <48.0 seconds> Target duration: 50.0 seconds Actual duration: 44.0 seconds F.4 Why is Mars considered a candidate for human colonization? (Text query) Mars colonization, target duration = 10 s Prompt: Why is Mars considered a candidate for human colonization? Please respond with exactly 10 seconds. Speech response: Well, <1.5 seconds> Mars has resources like water ice, <3.8 seconds> which could be used for life support. <5.4 seconds> It also has an atmosphere that might be modified to make it more habitable. <9.2 seconds> So yeah, <10.0 seconds> it’s a good candidate. Target duration: 10.0 seconds Actual duration: 12.0 seconds Mars colonization, target duration = 30 s Prompt: Why is Mars considered a candidate for human colonization? Please respond with exactly 30 seconds. Speech response: Well, <1.5 seconds> Mars has some things going for it. <4.2 seconds> It’s closer to Earth than other potential places like Jupiter’s moons. <8.6 seconds> There might be water ice on its surface, <11.7 seconds> which could be used for life support systems. <14.9 seconds> Also, <16.3 seconds> its gravity is lower compared to Earth’s, <19.0 seconds> making it easier to build structures there. <21.7 seconds> And it has an atmosphere that can potentially be modified over time. <25.5 seconds> So all these factors make Mars a good candidate for human colonization. <30.0 seconds> Target duration: 30.0 seconds Actual duration: 27.9 seconds Mars colonization, target duration = 60 s Prompt: Why is Mars considered a candidate for human colonization? Please respond with exactly 60 seconds. Speech response: Well, <1.5 seconds> Mars is considered a candidate for human colonization for several reasons. <4.9 seconds> First off, <6.2 seconds> it has some resources like water ice at the poles which could be used to make oxygen and hydrogen for fuel. <13.7 seconds> Also, <15.0 seconds> there might be minerals on Mars that we can mine in the future. <18.5 seconds> The gravity on Mars is about one-third of Earth’s, <22.2 seconds> so it would be easier for humans to move around compared to being in zero gravity. <27.0 seconds> And scientists think that over time, <30.0 seconds> we could potentially modify the Martian environment to be more habitable. <34.0 seconds> For example, <35.6 seconds> building habitats that can protect against radiation and extreme temperatures. <40.0 seconds> There is also the possibility of finding signs of past life on Mars, <44.0 seconds> which makes it even more interesting. <46.0 seconds> If you want to know more about this or have other questions related to space exploration, <50.0 seconds> feel free to ask. <51.0 seconds> Target duration: 60.0 seconds Actual duration: 57.4 seconds Appendix G Spoken Dialogue Model Survey Table 5: Spoken dialogue models (SDMs) with speech input and speech output, ordered by their first public release time. Date: First released date. IR: Intermediate representation used in the SDM. Speech Rep.: Speech representation (prediction target) used by the model. Pattern: The pattern of how the intermediate and speech representations are generated. Model Date IR Speech Rrep. Pattern Notes dGSLM [43] 2022-03 - Phonetic token Direct “Dual-tower” architecture for dual channel full-duplex modeling. Direct modeling of two-channel phonetic tokens. SpeechGPT [71] 2023-05 Text (+R.) Phonetic token Sequential Using “Chain-of-Modality”. Expands LLM vocabulary with phonetic tokens. SPIRIT-LM [44] 2024-02 Text (+S.) Phonetic token Interleaved Interleaves text and speech in one stream; expressive version adds pitch and style tokens. Moshi [18] 2024-07 Text Acoustic token Parallel Dual-channel full-duplex model. Parallel decoding text and acoustic tokens with delay pattern. LLaMA-Omni [21] 2024-09 Text Phonetic token Parallel CTC speech decoder maps LLM response states to phonetic tokens for streaming speech synthesis. SyncLLM [57] 2024-09 - Phonetic token Direct Interleaving user speech and model speech for full-duplex dialogue. Mini-Omni2 [65] 2024-10 Text Acoustic token Parallel Parallel decoding with delay pattern. Freeze-Omni [60] 2024-11 Text Acoustic token Sequential TDM-based full-duplex interaction; speech decoder conditioned on text tokens and LLM hidden states. GLM-4-Voice [70] 2024-12 Text Phonetic token Interleaved Single speech codebook paired with flow matching speech decoder. SLAM-Omni [14] 2024-12 Text Phonetic token Parallel “Semantic group modeling” enables generating multiple phonetic tokens per text token. VITA-1.5 [23] 2025-01 Text Acoustic token Sequential NAR + AR speech decoder taking LLM embedding as input. Baichuan-Audio [33] 2025-02 Text Acoustic token Interleaved Text-guided speech generation with an independent audio head. Qwen2.5-Omni [66] 2025-03 Text Acoustic token Sequential Thinker-Talker architecture. Supports visual modaliity. Kimi-Audio [19] 2025-04 Text Phonetic token Parallel Shared LLM with text head and audio head. LLaMA-Omni 2 [22] 2025-05 Text Phonetic token Parallel Gate fusion of LLM hidden states and text tokens for improved speech quality. Step-Audio 2 [61] 2025-07 Text (+R., Tool) Phonetic token Interleaved Using “multi-modal RAG” to support grounded response and timbre/style control. STITCH [15] 2025-07 Text (+R.) Phonetic token Interleaved Backbone: GLM-4-Voice. Various interleaving text, reasoning, speech patterns are discussed in the paper. Qwen3-Omni [67] 2025-09 Text (+R.) Acoustic token Sequential Thinker-Talker architecture. Supports explicit thinking mode and compatible with tool calling. Supports visual modaliity. Moshi-CoT [53] 2025-10 Text (+R.) Acoustic token Parallel CoT-tuned Moshi performs text reasoning in the “text monologue” stream to enable “thinking while listening” paradigm. SCoT [6] 2025-10 Text Acoustic token Interleaved CoT framework for SDMs. Blockwise streaming full-duplex model. Streaming RAG [5] 2025-10 Text (+Tool) Acoustic token Sequential Enables the SDM to trigger tool queries in parallel with the user’s speech. LFM2-Audio [3] 2025-11 Text Acoustic token Interleaved Supports both interleaved and sequential patterns, adapting to different tasks. Mimo-Audio [72] 2025-12 Text (+R.) Acoustic token Interleaved Interleaving text tokens and “audio patches”, which includes a delay pattern. PersonaPlex [51] 2026-01 Text Acoustic token Parallel Followed the Moshi architecture. Dual-channel full-duplex model. MiniCPM-o 4.5 [46] 2026-02 Text (+R.) Acoustic token Interleaved TDM full-duplex model. Supports visual modaliity. Figure 8: Illustration of different generation patterns in spoken dialogue models (SDMs): (a) Sequential, (b) Interleaved, and (c) Parallel. Table 5 surveys representative spoken dialogue models (SDMs), including their Intermediate Representations (IR), target speech representations (Speech Rrep.), and the generation patterns (Pattern) that describe how intermediate representations and speech representations are processed during speech response generation. Readers may refer to the spoken language model (SLM) survey paper [4] for a more detailed discussion on speech representation and generation pattern. Intermediate Representation (IR). With the emergence of text-based large language models (LLMs) demonstrating strong reasoning capabilities, modern spoken dialogue models (SDMs) increasingly adopt LLMs to generate speech responses, using text as an intermediate representation for semantic planning. Text-based IR offers high versatility and can serve multiple purposes, as summarized in Table 5, including style control (++S.), reasoning (++R.), tool calling (++Tool), and direct guidance of the target speech content. Pattern. The intermediate representation and the target speech representations can be generated under several design patterns, each leading to different trade-offs in terms of efficiency, latency, and the degree to which speech generation is conditioned on the intermediate representation. For simplicity, we assume text as the intermediate representation in the following discussion, and provide an illustration in Figure 8. • Sequential: Text is generated first, followed by speech tokens. Chunking strategies can be incorporated to support streaming generation. • Parallel: Text and speech tokens are generated simultaneously. In this setting, the hidden representations of a text LLM are typically used to predict text tokens and speech tokens through separate prediction networks. Frame-level operations can further be introduced to realize delay patterns. • Interleaved: Text and speech tokens are arranged in a single interleaved sequence, typically modeled by a single LLM, allowing speech representations to be conditioned more directly on text representations. Speech Representations. The goal of a spoken dialogue model is to generate an appropriate spoken response, typically represented as a sequence of speech tokens. These tokens can be further synthesized into waveforms using a pre-trained vocoder or an audio codec decoder. • Phonetic tokens. Phonetic tokens are obtained by quantizing speech encoder representations (e.g., via K-means), such as those extracted from self-supervised speech models (e.g., HuBERT) or foundation ASR models (e.g., Whisper encoders). They primarily capture phonetic and linguistic content, while containing relatively limited acoustic information such as speaker identity or environmental characteristics. In prior work, they are also referred to as semantic tokens [8]. When phonetic tokens are used as the target speech representation, an additional vocoder (e.g., HiFi-GAN or flow-matching decoders) is typically required to incorporate speaker identity and speaking style, as these attributes are not explicitly encoded. • Acoustic tokens. Acoustic tokens [42, 26, 62] are derived from neural speech codec models trained with reconstruction objectives. These models typically employ multiple hierarchical codebooks based on residual vector quantization (RVQ). When acoustic tokens are generated, a pre-trained audio codec decoder can be directly used for waveform synthesis. Recently, there has been a growing trend toward distilling phonetic information into the early layers of acoustic tokens, aiming to preserve phonetic structure while maintaining rich acoustic detail [75, 18]. (Please find the survey table on the following page.)