Paper deep dive
AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow
Duojia Li, Shuhan Zhang, Zihan Qian, Wenxuan Wu, Shuai Wang, Qingyang Hong, Lin Li, Haizhou Li
Abstract
Abstract:In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).
Tags
Links
- Source: https://arxiv.org/abs/2603.10701v1
- Canonical: https://arxiv.org/abs/2603.10701v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%
Last extracted: 3/13/2026, 1:11:47 AM
OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 56816. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
51,129 characters extracted from source content.
Expand or collapse full text
AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow Duojia Li 1,4 , Shuhan Zhang 4,5 , Zihan Qian 3 , Wenxuan Wu 6 , Shuai Wang 3,4,∗ , Qingyang Hong 2,∗ , Lin Li 1,∗ , Haizhou Li 4,5,∗ 1 School of Electronic Science and Engineering, Xiamen University, Xiamen, China 2 School of Informatics, Xiamen University, Xiamen, China 3 School of Intelligence Science and Technology, Nanjing University, Suzhou, China 4 Shenzhen Loop Area Institute, Shenzhen, China 5 School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China 6 The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China liduojia@stu.xmu.edu.cn, shuaiwang@nju.edu.cn Abstract In target speaker extraction (TSE), we aim to recover tar- get speech from a multi-talker mixture using a short enroll- ment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present Al- phaFlowTSE, a one-step conditional generative model trained with a Jacobian–vector product (JVP)-free AlphaFlow objec- tive. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mix- ture, eliminating auxiliary mixing-ratio prediction, and stabi- lizes training by combining flow matching with an interval- consistency teacher–student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target- speaker similarity and real-mixture generalization for down- stream automatic speech recognition (ASR). Index Terms: Target speaker extraction; one-step generation; flow matching; AlphaFlow 1. Introduction In multi-talker recordings such as online meetings, hands-free calls, and far-field conversations, the signal of interest is often a single speaker, while other speakers and background sounds act as interference. A practical personalized front-end should therefore extract a user-specified speaker reliably under realistic acoustic conditions, and do so efficiently enough for interactive use [1, 2]. Target speaker extraction (TSE) addresses this need by recovering target speech from a mixture using auxiliary in- formation that identifies the desired speaker. In this paper, we focus on the single-channel audio-enrollment setting, where a short target-only enrollment utterance provides the speaker cue and is typically selected from a non-overlapping region [3, 4, 5]. Most existing TSE systems are discriminative: a neural net- work predicts a mask or waveform estimate from the mixture, conditioned on a target-speaker representation derived from the enrollment [6, 7]. With advances in separation backbones, dis- criminative extraction has benefited from time-domain mod- eling [8] and Transformer-based architectures [9]. These ad- ** indicates the corresponding author. vances have substantially improved extraction quality, but the underlying formulation remains direct conditional regression toward a single output. Under heavy interference or domain mismatch, such regression can still introduce artifacts or over- suppression [1]. This suggests that progress in TSE depends not only on stronger backbones, but also on formulations that better capture how target speech should be generated from the mixture under the target cue. Generative modeling offers such a complementary perspec- tive. Instead of committing to a single deterministic estimate from the outset, a conditional generative model learns how tar- get speech should be produced from the mixture under the en- rollment condition. Many recent diffusion- and flow-based ap- proaches can be understood through a transport view: the model starts from an initial representation, follows a conditioned tra- jectory toward the target, and repeatedly applies a neural update rule along the way. From this viewpoint, latency is tied closely to how many times the network must be evaluated during gen- eration, commonly summarized as the number of function eval- uations (NFE). For low-latency TSE, reducing NFE therefore becomes a central design objective rather than a secondary im- plementation detail. Within generative TSE, diffusion models have demon- strated strong fidelity and naturalness [10, 11]. However, dif- fusion sampling typically requires many reverse steps and often relies on additional acceleration techniques, such as fast solvers or distillation, to become practical [12, 13]. Flow matching pro- vides a deterministic alternative by learning a transport, or ve- locity, field whose integration maps an initial state to the target distribution [14, 15]. FlowTSE brings this conditional trans- port idea to enrollment-conditioned extraction [16]. Yet, in their standard forms, both diffusion sampling and flow integration remain iterative, which has motivated recent work on finite- interval parameterizations that reduce inference to only a few updates or even a single update [17, 18]. Making such one-step generation practical introduces a dif- ferent challenge: the model must remain accurate over long transport intervals while staying coherent across different in- terval lengths. MeanFlow addresses this issue by learning an average finite-interval velocity that directly matches the update used at inference [19]. AlphaFlow further improves training stability with a Jacobian–vector product (JVP)-free objective that combines trajectory matching with teacher–student interval arXiv:2603.10701v1 [cs.SD] 11 Mar 2026 consistency [20]. Together, these developments make one-step generative modeling a practical direction for low-latency TSE rather than only a conceptual possibility. Motivated by this line of development, we present Al- phaFlowTSE, a one-step conditional generative framework for target speaker extraction. AlphaFlowTSE formulates extrac- tion as mixture-to-target transport in the complex STFT do- main and learns an enrollment-conditioned mean-velocity pre- dictor. Training couples a local trajectory-matching signal with an interval-consistent teacher–student target under a JVP-free AlphaFlow objective, aligning optimization with one-step in- ference and enabling single-step extraction. Experiments on Libri2Mix and REAL-T show that AlphaFlowTSE improves one-step extraction quality and generalization to real conversa- tional mixtures in downstream ASR while maintaining compet- itive speaker-related cues. 2. Background Recent generative TSE has progressed from iterative condi- tional generation to low-NFE inference, making one-step ex- traction a realistic goal rather than only a conceptual one. We begin with flow-based generative TSE and flow matching, which provide the conditional transport view underlying our formulation. We then discuss one-step generative TSE, where finite-interval updates make low-latency inference possible and where recent baselines often adopt an MR-indexed trajectory as a reference setting. Finally, we summarize AlphaFlow, which addresses the training difficulty of one-step mean-velocity mod- els by enforcing interval consistency without explicit JVP com- putation. 2.1. Flow-based Generative TSE In single-channel TSE, we observe a mixture waveform y ∈ R L and a short enrollment utterance e ∈ R L e that specifies the target speaker. A common additive model is y = s + b,(1) wheres is the clean target speech andb aggregates all non-target components (other speakers and background noise). Given (y,e), the goal is to estimate ˆs that preserves the target iden- tity while suppressing b. In practice, e is typically taken from non-overlapping regions to avoid contaminating the speaker cue [5, 1]. A conditional generative approach reframes extraction as transport in a representation space. Let z ∈ R d denote a signal representation (e.g., STFT features), and let c denote the condi- tioning information derived from (y,e). A trajectory specifies how a state evolves as t increases from a start point to an end point, and a neural update rule determines how to move from the current state toward the target along this trajectory. Because inference applies this update rule repeatedly, the resulting run- time is governed by the number of network evaluations (NFE), making the trajectory and update rule central to low-latency de- sign. Flow matching learns an instantaneous velocity field v θ (z t ,t,c) that specifies how z t should evolve along a chosen trajectory [14]. A standard construction defines a linear interpo- lation between a source sample z 0 ∼ p 0 (often Gaussian) and a target sample z 1 : z t ≜ (1− t)z 0 + tz 1 , t∈ [0, 1].(2) The learned velocity field parameterizes an ODE dz t dt = v θ (z t ,t,c),(3) and, for the linear path in (2), the target velocity is the constant vector (z 1 − z 0 ), yielding the regression objective L FM (θ) = E t,z 0 ,z 1 h ∥v θ (z t ,t,c)− (z 1 − z 0 )∥ 2 2 i .(4) Sampling then integrates (3) from t = 0 to 1, so the cost scales with the number of integration steps (NFE). FlowTSE instan- tiates this conditional FM paradigm for TSE by conditioning the velocity field on mixture/enrollment cues and integrating the learned transport to generate target speech [16]. This motivates our next step: if we want NFE close to 1, we need an update rule that directly predicts long-interval transport. 2.2. One-Step Generative TSE For TSE, the main appeal of one-step generation is low-latency inference: instead of predicting infinitesimal updates and inte- grating many small steps, the model learns the transport over a finite interval directly, so that the target can be reached with a single network evaluation. A common way to formalize this idea is through a mean- velocity model [19]. Let 0≤ t < r ≤ 1 denote the endpoints of an interval on a trajectory, and let z t be the state at time t under condition c. A mean-velocity network u θ (z t ,t,r,c) predicts the state at time r as ˆz r = z t + (r− t)u θ (z t ,t,r,c).(5) This parameterization is attractive because it aligns the model output with the finite update used at inference: setting (t,r) = (0, 1) yields a single-step generator, i.e., NFE= 1. MeanFlow-TSE brings this finite-interval formulation into conditional generative TSE by predicting the remaining trans- port from the current mixture-related state to the target endpoint in one update, rather than relying on iterative integration [18]. A commonly used trajectory choice in recent low-NFE TSE baselines is an MR-indexed background-to-target path [17, 18]. Here, the background denotes all non-target components in the mixture, i.e., interfering speakers and noise. Let Y , S, and B denote the STFT-domain representations of the mixture, target, and background, respectively. In synthetic mixture recipes, the mixture can be associated with a mixing-ratio-like coordinate τ ⋆ ∈ [0, 1] such that Y ≈ (1− τ ⋆ )B + τ ⋆ S.(6) This defines the linear path x τ ≜ (1− τ )B + τ S, τ ∈ [0, 1],(7) with endpoints x 0 = B and x 1 = S. If τ ⋆ were known, infer- ence could start near the mixture location and traverse only the remaining span to τ = 1. Since τ ⋆ is unavailable at test time, AD-FlowTSE estimates ˆτ from the mixture and enrollment and then performs transport from ˆτ to 1 with a small number of updates [17], while MeanFlow-TSE combines this MR-indexed trajectory with the finite-interval update in (5) for one-step ex- traction [18]. We summarize this MR-indexed formulation be- cause it underlies widely used baselines and serves as a refer- ence setting in our experiments. Once inference is reduced to a single long update, training becomes more delicate because the model must remain accu- rate over long intervals while staying coherent across different choices of (t,r). Directly enforcing such interval coherence can involve time derivatives of the model output and JVPs, which increase overhead and can destabilize optimization when dif- ferent supervision terms interact [19]. This motivates training principles that retain the one-step update in (5) while enforcing interval consistency without explicit JVP computation, which leads directly to AlphaFlow. 2.3. AlphaFlow: JVP-Free Interval Consistency for Mean- Velocity Models AlphaFlow provides a practical way to train finite-interval (mean-velocity) models without explicit JVPs [20]. It couples a trajectory-matching signal that anchors the prediction to the intended transport direction with an interval-consistency sig- nal that encourages agreement across different spans. Instead of differentiating through intermediate predictions, AlphaFlow uses a stop-gradient teacher–student construction to build a sta- ble consistency target. Given an interval (t,r), AlphaFlow introduces an interme- diate time s = αr + (1− α)t,0 < α≤ 1,(8) where z s denotes the trajectory state at the intermediate time s. A teacher prediction at (z s ,s,r) is then evaluated with stop- gradient to guide the student at (z t ,t,r). The parameter α con- trols how strongly the target relies on the direct trajectory an- chor versus the teacher-guided direction. In practice, annealing α from values near 1 to smaller values gradually shifts training from easier trajectory matching to stronger interval consistency, reducing optimization conflict [20]. In our method, we instan- tiate this principle on a deterministic mixture-to-target trajec- tory so that intermediate states can be computed in closed form, making the teacher evaluation both simple and stable. 3. Method AlphaFlowTSE formulates target speaker extraction as a con- ditional finite-interval transport problem in the spectral domain. Given a mixture and an enrollment utterance, we (i) define a de- terministic mixture-to-target trajectory, (i) parameterize inter- val transport using a mean-velocity network conditioned on the interval endpoints, and (i) train the network with a JVP-free α- Flow objective that couples a stable flow-matching anchor with teacher–student interval consistency. At inference, extraction reduces to a single transport update followed by iSTFT recon- struction (NFE= 1). 3.1. AlphaFlowTSE AlphaFlowTSE learns an enrollment-conditioned one-step transport that maps an observed mixture to the target speaker. For controlled comparisons with prior one-step baselines, we also report an MR-indexed variant that follows the trajectory parameterization adopted in AD-FlowTSE [17] and MeanFlow- TSE [18]. We operate in the complex STFT domain and represent each spectrum by concatenating real and imaginary parts along the channel axis. Omitting the batch dimension, we denote the mixture and target spectra as Y,S ∈ R 2F×T , where F and T are the numbers of frequency bins and time frames, and the en- rollment spectrum as E ∈ R 2F×T e with T e enrollment frames. One-step update. We define a deterministic mixture-to-target trajectory by linear interpolation in the STFT domain: z t ≜ (1− t)Y + tS, t∈ [0, 1].(9) For a forward interval 0 ≤ t ≤ r ≤ 1, we parameter- ize the finite-interval transport using a mean-velocity model u θ (z t ,t,r;E). When r > t, the state at r is predicted by ˆz r = z t + (r− t)u θ (z t ,t,r;E).(10) At inference, we apply a single update from the mixture start point (t,r) = (0, 1): ˆ S = Y + u θ (Y, 0, 1;E),(11) followed by iSTFT reconstruction. Note that the trajectory in (9) is used only to define training-time supervision through paired (Y,S); at test time, S is unknown and the update in (11) depends only on (Y,E). Network parameterization. We implement u θ with a U-Net style Diffusion Transformer backbone (UDiT) [18]. The net- work input is formed by concatenating the enrollment spectrum E as a temporal prefix to the current state z t ; the output to- kens corresponding to the mixture segment are interpreted as the predicted mean velocity. To support interval-dependent pre- diction with a single model, we condition each DiT block on the start time t and the interval length ∆ = r − t via adap- tive layer normalization, using a conditioning vector c t,∆ = emb(t) + emb(∆). 3.2. JVP-Free AlphaFlow Training Training objective. Our goal is to learn u θ that is accurate for long intervals while remaining coherent across different inter- val lengths. Following the α-Flow principle [20], we combine a trajectory-matching anchor with a teacher–student interval- consistency loss on the same deterministic mixture-to-target tra- jectory in (9). Because the trajectory is linear, the intermediate teacher state is available in closed form; thus the teacher is eval- uated on an exact on-trajectory point with stop-gradient, avoid- ing both Jacobian–vector products and model-generated inter- mediate states. For the linear interpolation path in (9), the trajectory veloc- ity is constant: v≜ dz t dt = S− Y.(12) We interpret u θ (·) as a velocity (rather than a displacement), so the displacement over (t,r) is (r− t)u θ (·) as in (10). For the linear trajectory in (9), the desired transport direction is the con- stant vector v = S−Y , which serves as a stable anchor signal. Nevertheless, one-step inference queries the model at different states z t and interval lengths (t,r); we therefore complement the anchor with a teacher–student consistency term that encour- ages coherent predictions across intervals without computing JVPs. For a residual tensor D ∈ R 2F×T , we define the per- sample mean squared error as m(D)≜ 1 2FT ∥D∥ 2 F ,(13) where∥·∥ F denotes the Frobenius norm. Local matching. We include a stable anchor loss by regressing the model output to the trajectory velocity on the diagonal slice r = t. Although the displacement is zero when r = t, we still UDiT Backbone Mean velocity 퐀 퐀 퐀,퐀, 퐀||퐀 퐀 Inference (NFE=1) Estimated Target speech DiT Block DiT Block . . . DiT Block UDiT Blocks × 16 (U-Net skip connections) time tε 0,1 interval ∆=r−t (r=1 at inference) iSTFT Optional MR predictor background → target Ablation p φ y,e DiT Block with AdaLN c_ln MLP Residual LayerNorm MHSA LayerNorm + AdaLN AdaLN + Complex STFT Mixture waveform y Enrollment waveform e Concat Mixture feature Enrollment feature State feature traing :퐀 퐀 Inference :퐀 0 =퐀 퐀||퐀 퐀 / 퐀||퐀 0 AdaLN c_ln embed_t embed_∆ Figure 1: Overall architecture of AlphaFlowTSE. Given a mixture waveform y and an enrollment utterance e, we compute complex STFT features and form the mixture feature Y and enrollment feature E (real/imaginary concatenation). During training, the backbone takes the current state feature z t ; during inference we initialize z 0 = Y . The enrollment feature is concatenated as a temporal prefix, yielding [E∥z t ] (or [E∥z 0 ] at inference), which is fed to the UDiT backbone. The backbone is conditioned via AdaLN on the absolute time t and the interval length ∆ = r− t (with r = 1 at inference), and predicts the mean velocity for finite-interval transport, denoted u θ (t,r, [E∥z t ]). One-step inference (NFE= 1) produces an estimated complex STFT ˆ S = ( ˆ S Re , ˆ S Im ), which is converted to the target waveform ˆs by iSTFT. The dashed module is an optional mixing-ratio predictor used only in the background-to-target ablation to predict the start coordinate ˆτ . interpret u θ (z t ,t,r;E) as a velocity predictor; training on ∆ = r − t = 0 provides well-conditioned gradients and stabilizes optimization. We sample t ∈ (0, 1), set r = t, and compute z t from (9). Let D FM ≜ u θ (z t ,t,t;E)− v. We denote sg(·) as the stop- gradient operator, and apply an adaptive weighting ℓ adp (D)≜ sg m(D) + ε adp γ−1 m(D).(14) where ε adp > 0 is a small constant, γ ∈ [0, 1], and sg(·) stops gradients through its argument. The resulting objective is L FM (θ) = E t h ℓ adp (D FM ) i .(15) This term anchors optimization with well-conditioned gradients and corresponds to the flow-matching component on the diago- nal slice. AlphaFlow consistency.To incorporate the interval- consistency signal without computing JVPs, we instantiate the JVP-free AlphaFlow teacher–student construction on the same mixture-to-target trajectory. We sample (t,r) with 0 ≤ t < r ≤ 1, choose a step ratio α ∈ (0, 1], and define an intermedi- ate time and state s≜ αr + (1− α)t, z s ≜ (1− s)Y + sS. (16) Because the trajectory is deterministic and linear, z s is com- puted exactly in closed form, so the teacher is evaluated on a true on-trajectory state rather than a model-generated interme- diate. We compute a student prediction u θ (z t ,t,r;E) and a stop- gradient teacher prediction ̃u≜ sg u θ (z s ,s,r;E) , where sg(·) blocks gradients through the teacher branch. The α-Flow target velocity is then defined as u ⋆ α ≜ αv + (1− α) ̃u,(17) and we denote the residual by D MF ≜ u θ (z t ,t,r;E)− u ⋆ α . To balance intervals with different α, we adopt a bounded α −1 -style reweighting that amplifies informative samples while preventing excessively large weights when α becomes small: ℓ bnd (D;α)≜ sg κ m(D) + ακ + ε m(D).(18) where κ > 0 controls saturation and ε > 0 is a numerical constant. The corresponding interval objective is L MF (θ) = E t,r h ℓ bnd (D MF ;α) i .(19) We implement a decoupled objective where each training example is assigned to the FM anchor or the MF consistency branch with probability ρ. Equivalently, the expected objective can be written as L(θ) = ρλ FM L FM (θ) + (1− ρ)λ MF L MF (θ),(20) Algorithm 1 AlphaFlowTSE: Training (mixture-to-target). Require: Training batch (Y,S,E); branch probability ρ; weights λ FM ,λ MF ; α-schedule params (k s ,k e ,γ α ,α min ); bounded-loss params (κ,ε). 1: for training iteration k = 1, 2,... do 2: α← AlphaSchedule(k;k s ,k e ,γ α ,α min ) ▷ from 1 to α min 3:Sample q ∼ Bernoulli(ρ) ▷ switch between FM and MF 4: v ← S− Y▷ true path velocity 5:if q = 1 then▷ Local trajectory matching (FM) 6:Sample t∈ (0, 1) and set r ← t 7:z t ← (1− t)Y + tS 8:u← u θ (z t ,t,r;E) 9: L← λ FM ℓ adp (u− v) 10:else▷ Interval consistency (JVP-free AlphaFlow) 11:Sample (t,r) with 0≤ t < r ≤ 1 (including long spans) 12:s← αr + (1− α)t 13:z t ← (1− t)Y + tS, z s ← (1− s)Y + sS 14:u← u θ (z t ,t,r;E)▷ student 15: ̃u← sg u θ (z s ,s,r;E) ▷ teacher (stop-gradient) 16:u ⋆ α ← αv + (1− α) ̃u 17: L← λ MF ℓ bnd (u− u ⋆ α ;α) 18:end if 19:Update θ by gradient descent onL 20: end for with constants λ FM ,λ MF > 0. During training, α is annealed from 1 to a floor value α min , gradually shifting the supervision from pure trajectory matching toward teacher-guided interval consistency [20]. To make the proposed JVP-free training explicit, Algo- rithm 1 summarizes AlphaFlowTSE training on the mixture-to- target path following the AlphaFlow principle [20]. Since the trajectory is deterministic and linear, the intermediate state z s is computed in closed form and the teacher is evaluated on this exact on-path point with stop-gradient, avoiding JVPs. MR-based variant. For controlled comparisons with mixing- ratio (MR) indexed baselines, we also implement a background- to-target trajectory with an auxiliary MR predictor. Let B ∈ R 2F×T denote the background/interference spectrum (available during training) and define x τ ≜ (1− τ )B + τ S, τ ∈ [0, 1].(21) The mixture is treated as an intermediate point Y ≈ x τ ⋆ . We train u θ onx τ using the same objectives above by replacing v with v bg ≜ S− B and computing the teacher state in closed form as x s = (1− s)B + sS, where s is defined in (16). At test time, τ ⋆ is estimated by a separate regressor p φ : ˆτ = σ p φ (y,e) , L MR (φ) = E (ˆτ − τ ⋆ ) 2 .(22) The corresponding one-step extraction uses (t,r) = (ˆτ, 1). 3.3. Inference At test time, we compute complex STFTs using the same anal- ysis parameters as in training and represent each spectrum by concatenating real and imaginary parts. For the proposed mixture-to-target parameterization, we start from the observed mixture (t = 0) and perform a single finite-interval update to the target endpoint (r = 1) using (11), followed by iSTFT reconstruction. No iterative sampling, guid- ance, or refinement is applied (NFE= 1). In practice, training is performed on fixed-duration seg- ments whereas test utterances may be longer. For long utter- ances, we process the mixture spectrogram in contiguous time chunks, apply the one-step estimator with the same enrollment condition to each chunk, and concatenate the predicted spectro- grams along the time axis before waveform reconstruction. For the background-to-target comparison system with mixing-ratio prediction, an additional predictor p φ (·) estimates the time coordinate of the mixture on the background-to-target path. Given (y,e), we obtain: ˆτ = σ p φ (y,e) ,ˆτ ∈ (0, 1).(23) We then perform a single jump from t = ˆτ to r = 1 starting from the observed mixture state: ˆ S = Y + (1− ˆτ )u θ Y, ˆτ, 1;E .(24) This variant requires one additional forward pass for p φ ; the remaining steps are identical to the mixture-to-target pipeline. 4. Experiment 4.1. Datasets and Data Preparation We train and benchmark the proposed models on Libri2Mix from the LibriMix corpus [21], and further assess out-of- domain generalization on REAL-T [22]. To enable a fair com- parison with recent one-step generative TSE baselines, we fol- low the community-standard Libri2Mix configuration and adopt the same SpeakerBeam-style informed data protocol used by AD-FlowTSE and MeanFlow-TSE [17, 18, 5]. Libri2Mix dataset. We follow the official LibriMix recipe and use Libri2Mix (min, 16 kHz) under both clean and noisy conditions [21, 23]. For noisy, WHAM! noise is added fol- lowing the LibriMix procedure [24, 21]. We adopt the SpeakerBeam-style informed setup used by AD-FlowTSE and MeanFlow-TSE [5, 17, 18]: each mixture is paired with a designated target source, and a short target-only enrollment segment is provided as the conditioning cue. We follow the same mixture/enrollment file-list and metadata orga- nization as in these baselines, and randomly crop 3 s segments for both mixture–target and enrollment during training to match their data segmentation protocol. REAL-T (real conversational mixtures). To evaluate robust- ness beyond synthetic mixtures, we additionally test on REAL- T [22], a conversation-centric benchmark constructed from real multi-speaker recordings. REAL-T provides mixture segments with natural conversational overlap and enrollment utterances extracted from non-overlapping regions of the same speaker; the benchmark further defines two evaluation subsets (BASE and PRIMARY) to facilitate controlled evaluation under dif- ferent difficulty levels [22]. As REAL-T originates from real recordings and does not provide perfectly aligned clean target references, it is primarily used to assess practical generalization through transcript-based automatic speech recognition (ASR) measures, non-intrusive perceptual quality estimation via DNS- MOS, and speaker similarity. We use the official evaluation lists and data release from the REAL-T project repository. 1 4.2. Experimental Setup To ensure a controlled comparison with recent one-step TSE baselines, we keep the front-end, data formatting, and test-time setting consistent with AD-FlowTSE and MeanFlow-TSE [17, 18]. Front-end and model. All systems operate on complex STFT features computed from 16 kHz audio, using N FFT = 510 and 1 https://github.com/REAL-TSE/REAL-T hop size H = 128. We represent each spectrum by concate- nating real and imaginary parts, resulting in 2F = 512 chan- nels. Our separator uses the UDiT backbone as in MeanFlow- TSE [18] (16 Transformer blocks, 16 attention heads, hidden width 1024). The enrollment spectrogram is concatenated as a temporal prefix and processed jointly with the mixture features. Training setting. We train separate models for Libri2Mix clean and noisy. Following the baselines, mixture/target and enrollment signals are randomly cropped to 3 s during train- ing. Optimization uses AdamW with bfloat16 mixed precision and gradient clipping (max norm 0.5). We continue training for 150 epochs with an initial learning rate 2× 10 −5 , a short linear warmup, and cosine decay. Training is conducted with distributed data-parallel on 8 NVIDIA H100 GPUs; we use a per-GPU batch size of 42 with gradient accumulation of 2 steps. Both AD-FlowTSE and MeanFlow-TSE report long train- ing cycles (up to 2000 epochs) to reach their strongest check- points. To reduce computation while keeping the comparison fair, we initialize our models from the publicly released AD- FlowTSE checkpoints [17] and then train under our mixture-to- target AlphaFlowTSE objective. We load only the network pa- rameters (i.e., without optimizer or scheduler state) and restart optimization with the schedule above, so the final behavior is determined by our trajectory definition and loss design rather than inherited training dynamics. For completeness, we also conducted a sanity check by continuing AD-FlowTSE train- ing under its original objective from the released checkpoints, and observed no consistent gains under our compute budget, indicating that the improvements observed in Sec. 5 are not at- tributable to extended baseline training. Time pairs (t,r) for the mean-flow term are sampled us- ing the logit-normal strategy described in AlphaFlow [20] with (μ,σ) = (−0.4, 1.0). To better match one-step inference, we additionally draw 15% of samples from a large-span subset with t ≤ 0.15 and r ≥ 0.85. We anneal α from 1 to α min = 0.1 with a sigmoid schedule (epochs 5–100, k = 15), and apply the FM and MF branches with equal probability. The total objective weights the two branches by λ FM = 0.6 and λ MF = 0.4. Inference and MR-predictor comparison. At test time, Al- phaFlowTSE performs extraction with a single network evalu- ation (NFE= 1) via a single finite-interval mean-flow update from the mixture start point to the target endpoint. Waveform reconstruction and long-utterance handling follow the baseline evaluation protocol [17, 18]; no iterative refinement is used. For the background-to-target comparison variant, we follow the mixing-ratio predictor design used in AD- FlowTSE/MeanFlow-TSE. A separate regressor p φ (·) estimates the time coordinate ˆτ ∈ (0, 1) of the observed mixture on the background-to-target trajectory. We implement p φ with an ECAPA-TDNN encoder [25] and an MLP regression head, and apply SpecAugment during training [26]. At inference, p φ is evaluated once to obtain ˆτ , after which the separator performs a single transport from t = ˆτ to r = 1 with the same reconstruc- tion pipeline. 4.3. Evaluation Metrics All metrics are computed on reconstructed waveforms at 16 kHz.For Libri2Mix, where clean target references are available, we report a set of standard reference-based and non- intrusive measures that are commonly used in recent generative TSE work. Specifically, we use wideband Perceptual Evalua- tion of Speech Quality (PESQ) [29] to assess perceptual qual- ity, extended Short-Time Objective Intelligibility (ESTOI) [30] to assess intelligibility, and scale-invariant signal-to-distortion ratio (SI-SDR) [31] to measure separation accuracy. To comple- ment reference-based metrics, we additionally report the DNS- MOS P.835 score (DNSMOS) [32] using the official implemen- tation from the Microsoft DNS-Challenge repository [33]. Fi- nally, we measure speaker similarity by computing cosine sim- ilarity between speaker embeddings extracted by a pretrained WeSpeaker encoder [34]; on Libri2Mix, embeddings are com- puted from the extracted speech and the clean target reference. For REAL-T [22], clean and time-aligned target references are not available, hence reference-dependent metrics such as SI-SDR, PESQ, and ESTOI are not applicable.Following the REAL-T evaluation protocol, we report transcript-based er- ror rates: word error rate (WER) for English computed with Whisper-large-v2 [35] and character error rate (CER) for Chi- nese computed with FireRedASR-AED-L [36]. We also report speaker similarity between the extracted speech and the pro- vided enrollment utterance using a pretrained WeSpeaker en- coder [34]. 5. Results 5.1. Libri2Mix: One-Step Benchmark Performance We first report controlled one-step evaluation on Libri2Mix (min, 16 kHz) in Table 1. All AlphaFlowTSE results are ob- tained with a single separator evaluation (NFE= 1). For clarity, Table 1 reports the MR-enabled setting for AlphaFlowTSE for protocol alignment with prior one-step MR-indexed systems, while Table 2 explicitly compares the w/ and w/o MR settings. Reference-based fidelity and intelligibility. Under the MR- predictor setting, AlphaFlowTSE achieves the strongest intru- sive performance among the one-step systems on both clean and noisy. On clean, it achieves the best PESQ and at- tains the highest ESTOI and SI-SDR, indicating improved intel- ligibility and separation accuracy under the strict NFE=1 con- straint. On noisy, AlphaFlowTSE again yields the best in- trusive scores, showing that the proposed training objective re- mains effective in the presence of additive noise. Overall, these results support that AlphaFlow-stabilized mean-velocity learn- ing improves one-step extraction quality without increasing in- ference iterations. Perceptual quality and target-speaker similarity. In terms of DNSMOS OVRL, AlphaFlowTSE remains competitive on both splits, while some multi-step diffusion/flow systems re- port slightly higher OVRL in the literature. For target-speaker similarity (SpkSim), AD-FlowTSE attains the highest scores, while AlphaFlowTSE stays close on clean and is stronger than MeanFlowTSE on noisy. Taken together, Table 1 shows that AlphaFlowTSE strengthens one-step fidelity and intelli- gibility while maintaining competitive perceptual quality and identity preservation. 5.2. Effect of MR Prediction and Inference Overhead Several recent one-step TSE baselines rely on an MR predictor to set a trajectory coordinate at inference. To clarify the role of this component, Table 2 reports results with MR prediction enabled/disabled and quantifies the relative degradation when the predictor is removed. Sensitivity to MR prediction. Removing MR prediction sub- stantially degrades AD-FlowTSE and MeanFlowTSE, with par- ticularly large drops in SI-SDR for MeanFlowTSE. In con- trast, AlphaFlowTSE exhibits markedly smaller degradations: its SI-SDR decreases only marginally when MR prediction is Table 1: Libri2Mix benchmark results (min, 16 kHz). ↑ indicates higher is better. DNSMOS OVRL is the DNSMOS-P.835 overall score and SpkSim denotes speaker similarity. Results of prior systems are taken from the literature [16, 10, 11, 27, 28, 17, 18]. For MR-indexed one-step systems, the default setting uses an MR predictor at inference. MethodLibri2Mix CleanLibri2Mix Noisy PESQ↑ ESTOI↑ SI-SDR↑ OVRL↑ SpkSim↑ PESQ↑ ESTOI↑ SI-SDR↑ OVRL↑ SpkSim↑ Mixture1.150.540.002.650.541.080.40-1.931.630.46 DiffSep+SV [11]1.850.79–3.140.831.320.60–2.780.62 DDTSE [11]1.790.78–3.300.731.600.71–3.280.71 DiffTSE [10]3.080.8011.28– FlowTSE [16]2.580.84–3.270.901.860.75–3.300.83 SR-SSL [27]2.99–16.00– SoloSpeech [28]–1.890.7811.12– AD-FlowTSE [17]2.890.9017.493.150.952.150.8112.703.110.87 MeanFlowTSE [18]3.260.9318.803.210.922.210.8212.853.170.73 AlphaFlowTSE (ours)3.270.9419.173.240.932.280.8513.163.190.76 Table 2: MR-predictor ablation on Libri2Mix (NFE= 1). ↑ indicates higher is better. “w/” and “w/o” denote inference with and without MR prediction, respectively; relative-decline columns are computed as in the table header. Best values are in bold. SplitMethodMR pred. DNSMOS PESQ↑ ESTOI↑ SI-SDR↑ SpkSim↑ Relative decline (w/o vs. w/) SIG↑ BAK↑ OVRL↑ P808↑∆OVRL(%) ∆PESQ(%) ∆SI-SDR (dB) clean AD-FlowTSE [17] w/o–3.023.442.330.8212.540.92 -4.1%-19.4%-4.95 w/3.47 3.903.153.592.890.9017.490.95 MeanFlowTSE [18] w/o3.31 3.372.773.311.530.53-6.000.60 -13.7%-53.1%-24.80 w/3.51 3.953.213.693.260.9318.800.92 AlphaFlowTSE w/o3.52 3.943.163.613.040.9218.500.92 -2.5%-7.0%-0.67 w/3.54 4.023.243.723.270.9419.170.93 noisy AD-FlowTSE [17] w/o–2.873.231.730.729.400.84 -7.7%-19.5%-3.30 w/3.43 3.923.113.482.150.8112.700.87 MeanFlowTSE [18] w/o3.25 3.632.853.171.510.600.030.57 -10.1%-31.7%-12.82 w/3.45 3.973.173.552.210.8212.850.73 AlphaFlowTSE w/o3.48 3.903.113.432.160.8212.760.76 -2.5%-5.3%-0.40 w/3.49 4.013.193.572.280.8513.160.76 Table 3: Inference cost and model size comparison. NFE de- notes the number of separator network function evaluations at test time. “Params” reports the separator (backbone) param- eters, while “Aux Params” reports additional parameters re- quired at inference (e.g., an MR predictor). MethodNFEParams (M)Aux Params (M) DiffSep+SV [11]60666.63 DDTSE [11]10716.63 SR-SSL [27]5431– SoloSpeech [28]50589– AD-FlowTSE [17]1 or 534215.57 (MR predictor) MeanFlowTSE [18]134315.57 (MR predictor) AlphaFlowTSE1343Optional removed, and the same trend holds for PESQ and DNSMOS OVRL in relative terms. This indicates that AlphaFlowTSE is less sensitive to the availability (or quality) of a coordi- nate predictor, consistent with the goal of learning a mean- velocity model that remains accurate and coherent across in- terval lengths. Inference cost. Table 3 summarizes test-time overhead. Itera- tive diffusion/cascaded systems require many separator evalua- tions, whereas the one-step family operates at NFE= 1 for the separator. Within one-step systems, MR-indexed baselines re- quire an additional MR predictor (auxiliary parameters and an extra forward pass). AlphaFlowTSE keeps one-step inference for the separator; we report results with an MR predictor for protocol alignment and analyze its effect in Table 2. 5.3. REAL-T: Zero-Shot Transfer to Real Conversations We next assess out-of-domain generalization on REAL-T, which contains real conversational mixtures without aligned clean targets. All models are trained on Libri2Mix noisy and evaluated zero-shot on REAL-T. Importantly, REAL-T does not provide MR labels; therefore, MR prediction is not supervised on REAL-T. We nevertheless report two inference settings— without an MR predictor and with an MR predictor imported from synthetic training—to study how attaching a synthetic- trained predictor affects cross-domain behavior. Downstream ASR accuracy. Figure 2 reports downstream ASR error rates on REAL-T. Without MR prediction (Fig. 2(a)), AlphaFlowTSE consistently yields the lowest WER/CER across all subsets and achieves the best language-level aver- ages, indicating strong zero-shot transfer to real conversational overlap patterns. When attaching an MR predictor at inference Table 4: DNSMOS OVRL on REAL-T [22] (higher is better). Models are trained on Libri2Mix noisy and evaluated zero-shot on REAL-T. We report inference without an MR predictor (w/o MR predictor) and with an MR predictor imported from synthetic training (w/ MR predictor), since REAL-T provides no MR labels. Dataset Samples(N )DNSMOS OVRL↑ (w/o MR predictor)DNSMOS OVRL↑ (w/ MR predictor) AD-FlowTSE MeanFlowTSE AlphaFlowTSE AD-FlowTSE MeanFlowTSE AlphaFlowTSE English subsets AMI5921.8372.1781.8201.7992.1202.169 CHiME-65451.4601.1741.8431.5971.8961.858 DipCo1331.3461.1931.5601.2521.4751.515 Avg.12701.6241.6441.8031.6551.9561.967 Chinese subsets AISHELL-42402.1131.7322.0022.1372.2582.277 AliMeeting4811.8241.6071.9741.8552.0582.086 Avg.7211.9211.6481.9831.9492.1252.150 AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 W E R / C E R Model English (avg) AMI CHiME-6 DipCo (a) MR-Free (b) MR-Predictor AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 W E R / C E R Model Chinese (avg) AISHELL-4 AliMeeting AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 W E R / C E R Model English (avg) AMI CHiME-6 DipCo AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 W E R / C E R Model Chinese (avg) AISHELL-4 AliMeeting Figure 2: ASR error rates on REAL-T under two inference set- tings: (a) w/o MR predictor and (b) w/ MR predictor. Left pan- els report English WER (average and subsets: AMI, CHiME- 6, DipCo), and right panels report Chinese CER (average and subsets: AISHELL-4, AliMeeting) for AD-FlowTSE, Mean- FlowTSE, and AlphaFlowTSE. Lower is better. (Fig. 2(b)), ASR errors for the MR-indexed baselines gener- ally decrease and the ordering becomes subset-dependent; Al- phaFlowTSE remains competitive, with the most consistent ad- vantage observed in the MR-free setting that matches REAL- T’s supervision-free condition. Target-speaker similarity. Figure 3 shows a similar trend for speaker similarity. Without MR prediction (Fig. 3(a)), Al- phaFlowTSE achieves the highest SIM on both language aver- ages and on all subsets. With MR prediction enabled (Fig. 3(b)), differences shrink and some subsets (e.g., DipCo) favor differ- ent systems, suggesting that importing an MR predictor intro- duces an additional cross-domain operating point rather than a uniformly improved setting. Reference-free perceptual quality. Table 4 reports DNSMOS OVRL on REAL-T. AlphaFlowTSE achieves the best DNS- MOS OVRL on both language-level averages under both infer- ence settings. Without MR prediction, it is best on most subsets but shows expected trade-offs; with the imported MR predic- tor, AlphaFlowTSE further improves and becomes best on most subsets, indicating that the learned transport preserves favorable AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 S I M Model English (avg) AMI CHiME-6 DipCo (a) MR-Free (b) MR-Predictor AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 S I M Model Chinese (avg) AISHELL-4 AliMeeting AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 S I M Model English (avg) AMI CHiME-6 DipCo AD-FlowTSEMeanFlowTSEAlphaFlowTSE 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 S I M Model Chinese (avg) AISHELL-4 AliMeeting Figure 3: Speaker similarity (SIM / SpkSim) on REAL-T under two inference settings: (a) MR-free and (b) w/ MR predictor. Left panels report English SpkSim (average and subsets: AMI, CHiME-6, DipCo), and right panels report Chinese SpkSim (av- erage and subsets: AISHELL-4, AliMeeting) for AD-FlowTSE, MeanFlowTSE, and AlphaFlowTSE. Higher is better. perceptual quality under real-mixture conditions. Summary. Across Libri2Mix and REAL-T, AlphaFlowTSE de- livers strong one-step extraction quality under NFE= 1. On REAL-T, it provides the most consistent gains in the realistic MR-free setting (lower ASR errors and higher speaker similar- ity), while also achieving strong DNSMOS OVRL and remain- ing competitive when coupled with an imported MR predictor. 6. Conclusion We presented AlphaFlowTSE, a one-step generative TSE framework that learns mean-velocity mixture-to-target transport and is trained with a JVP-free AlphaFlow objective combining trajectory matching and interval-consistent teacher–student su- pervision, aligning training with single-step inference. Experi- ments on Libri2Mix and REAL-T demonstrate strong one-step extraction quality, robustness to disabling MR prediction, and improved zero-shot ASR performance with competitive target- speaker similarity, indicating favorable transfer to real conver- sational mixtures under practical low-latency settings. 7. Generative AI Use Disclosure Generative AI tools were used solely for language editing and polishing of the manuscript (e.g., improving grammar, phrasing, and readability). All authors reviewed the final manuscript and take full responsibility for the content. Generative AI tools are not listed as authors. 8. References [1] K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J. Cernocky, and D. Yu, “Neural Target Speech Extraction: An Overview,” IEEE Signal Processing Magazine, vol. 40, no. 3, p. 8–29, May 2023. [2] D.-J. Alcala Padilla, N. L. Westhausen, S. Vivekananthan, and B. T. Meyer, “Location-Aware Target Speaker Extraction for Hearing Aids,” in Interspeech 2025, 2025, p. 2975–2979. [3] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Her- shey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voice- Filter: Targeted Voice Separation by Speaker-Conditioned Spec- trogram Masking,” in Interspeech 2019, 2019, p. 2728–2732. [4] K. ˇ Zmol ́ ıkov ́ a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. ˇ Cernock ́ y, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, p. 800–814, 2019. [5] T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani, “Multimodal SpeakerBeam: Single Channel Target Speech Ex- traction with Audio-Visual Speaker Clues,” in Interspeech 2019, 2019, p. 2718–2722. [6] M. Delcroix, S. Watanabe, T. Ochiai, K. Kinoshita, S. Karita, A. Ogawa, and T. Nakatani, “End-to-End SpeakerBeam for Single Channel Target Speech Recognition,” in Interspeech 2019, 2019, p. 451–455. [7] M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “SpEx+: A Complete Time Domain Speaker Extraction Network,” in Inter- speech 2020, 2020, p. 1406–1410. [8] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, p. 1256–1266, 2019. [9] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention Is All You Need in Speech Separation,” in Proc. ICASSP 2021, 2021, p. 21–25. [10] N. Kamo, M. Delcroix, and T. Nakatani, “Target Speech Ex- traction with Conditional Diffusion Model,” in Interspeech 2023, 2023, p. 176–180. [11] L. Zhang, Y. Qian, L. Yu, H. Wang, H. Yang, L. Zhou, S. Liu, and Y. Qian, “DDTSE: Discriminative diffusion model for target speech extraction,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2024, p. 294–301, arXiv:2309.13874. [12] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [Online]. Available: https://proceedings.neurips.c/paper files/paper/2022/hash/ 4c1c83c8273de1b25b52cada8d8a6b9c-Abstract-Conference. html [13] T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=TIdIXIpzhoI [14] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” in International Conference on Learning Representations (ICLR), 2023. [Online]. Available: https://openreview.net/forum?id=PqvMRDCJT9t [15] X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in International Conference on Learning Representations (ICLR), 2023. [Online]. Available: https://openreview.net/forum?id=XVjTT1nw5z [16] A. Navon, A. Shamsian, Y. Segal-Feldman, N. Glazer, G. Hetz, and J. Keshet, “FlowTSE: Target Speaker Extraction with Flow Matching,” in Interspeech 2025, 2025, p. 2965–2969. [17] T.-A. Hsieh and M. Kim, “Adaptive deterministic flow matching for target speaker extraction,” arXiv preprint arXiv:2510.16995, 2025. [Online]. Available: https://arxiv.org/abs/2510.16995 [18] R. Shimizu, X. Jiang, and N. Mesgarani, “Meanflow-tse: One- step generative target speaker extraction with mean flow,” arXiv preprint arXiv:2512.18572, 2025. [19] Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He, “Mean flows for one-step generative modeling,” arXiv preprint arXiv:2505.13447, 2025. [20] H. Zhang, A. Siarohin, W. Menapace, M. Vasilkovsky, S. Tulyakov, Q. Qu, and I. Skorokhodov, “Alphaflow: Un- derstanding and improving meanflow models,” arXiv preprint arXiv:2510.20771, 2025. [21] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vin- cent, “LibriMix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020. [22] S. Li, S. Wang, J. Han, K. Zhang, W. Wang, and H. Li, “REAL- T: Real conversational mixtures for target speaker extraction,” in Proc. INTERSPEECH 2025, 2025, p. 1923–1927. [23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- riSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP 2015, 2015, p. 5206–5210. [24] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. Le Roux, “WHAM!: Extending Speech Separation to Noisy Environments,” in Interspeech 2019, 2019, p. 1368–1372. [25] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,” in Interspeech 2020, 2020, p. 3830–3834. [26] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmenta- tion Method for Automatic Speech Recognition,” in Interspeech 2019, 2019, p. 2613–2617. [27] P.-J. Ku, A. H. Liu, R. Korostik, S.-F. Huang, S.-W. Fu, and A. Juki ́ c, “Generative speech foundation model pretraining for high-quality speech extraction and restoration,” in Proc. ICASSP 2025, 2025, p. 1–5. [28] H. Wang, J. Hai, D. Yang, C. Chen, K. Li, J. Peng, T. The- baud, L. Moro-Vel ́ azquez, J. Villalba, and N. Dehak, “Solospeech: Enhancing intelligibility and quality in target speech extrac- tion through a cascaded generative pipeline,” arXiv preprint arXiv:2505.19314, 2025. [29] “ITU-T Recommendation P.862.2 (11/07): Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs,” International Telecom- munication Union, Recommendation, Nov. 2007. [Online]. Available: https://w.itu.int/rec/T-REC-P.862.2-200711-W/en [30] J. Jensen and C. H. Taal, “An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 24, no. 11, p. 2009–2022, 2016. [31] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in Proc. ICASSP 2019, 2019, p. 626–630. [32] C. K. A. Reddy, V. Gopal, and R. Cutler, “DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” in Proc. ICASSP 2022, 2022. [33] Microsoft, “DNS-Challenge:Deep noise suppression chal- lenge (dnsmos implementation),” https://github.com/microsoft/ DNS-Challenge, 2023. [34] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “WeSpeaker: A research and production oriented speaker embedding learning toolkit,” in Proc. ICASSP 2023, 2023, p. 1–5. [35] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” arXiv preprint arXiv:2212.04356, 2022. [36] K.-T. Xu, F.-L. Xie, X. Tang, and Y. Hu, “Fireredasr: Open- source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration,” arXiv preprint arXiv:2501.14350, 2025. [Online]. Available: https://arxiv.org/ abs/2501.14350