Paper deep dive
A Unified View of Drifting and Score-Based Models
Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao
Abstract
Abstract:Drifting models train one-step generators by optimizing a mean-shift discrepancy induced by a kernel between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, yielding a transport direction for generated samples. In this paper, we make its relationship to the score-matching principle behind diffusion models precise by showing that drifting admits a score-based formulation on kernel-smoothed distributions. For Gaussian kernels, the population mean-shift field coincides with the score difference between the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to the corresponding conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching-style objective on smoothed distributions. It also clarifies the connection to Distribution Matching Distillation (DMD): both methods use score-mismatch transport directions, but drifting realizes the score signal nonparametrically from kernel neighborhoods, whereas DMD uses a pretrained diffusion teacher. Beyond Gaussians, we derive an exact decomposition for general radial kernels, and for the Laplace kernel we prove rigorous error bounds showing that drifting remains an accurate proxy for score matching in low-temperature and high-dimensional regimes.
Tags
Links
- Source: https://arxiv.org/abs/2603.07514v1
- Canonical: https://arxiv.org/abs/2603.07514v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/13/2026, 12:34:49 AM
Summary
This paper provides a theoretical unification of drifting models and score-based diffusion models. It demonstrates that the mean-shift field used in drifting models is equivalent to a score-mismatch field on kernel-smoothed distributions. For Gaussian kernels, this relationship is exact via Tweedie's formula, while for Laplace kernels, the drifting model serves as an accurate proxy for score matching in low-temperature and high-dimensional regimes. The work clarifies the connection to Distribution Matching Distillation (DMD) and provides rigorous error bounds for non-Gaussian kernels.
Entities (6)
Relation Signals (3)
Tweedie's Formula → links → Gaussian Kernel
confidence 95% · This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to the corresponding conditional mean
Drifting Models → uses → Laplace Kernel
confidence 95% · Drifting models train one-step generators by optimizing a mean-shift discrepancy induced by a kernel... with Laplace kernels used by default in practice.
Drifting Models → isequivalentto → Score-Based Models
confidence 90% · we make its relationship to the score-matching principle behind diffusion models precise by showing that drifting admits a score-based formulation
Cypher Suggestions (2)
Find all generative models and their associated kernels mentioned in the paper. · confidence 90% · unvalidated
MATCH (m:GenerativeModel)-[:USES]->(k:Kernel) RETURN m.name, k.name
Identify the relationship between drifting models and score-based principles. · confidence 85% · unvalidated
MATCH (a:GenerativeModel {name: 'Drifting Models'})-[r:IS_EQUIVALENT_TO]->(b:GenerativeModel {name: 'Score-Based Models'}) RETURN r.descriptionFull Text
175,449 characters extracted from source content.
Expand or collapse full text
A Unified View of Drifting and Score-Based Models Chieh-Hsin Lai1 Bac Nguyen1 Naoki Murata1 Yuhta Takida1 Toshimitsu Uesaka1 Yuki Mitsufuji1,2 Stefano Ermon3 Molei Tao4†footnotemark: Corresponding author: chieh-hsin.lai@sony.comEqual supervision. ( 1Sony AI, 2Sony Group Corporation, 3Stanford University, 4Georgia Tech ) Abstract Drifting models train one-step generators by optimizing a mean-shift discrepancy induced by a kernel between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, yielding a transport direction for generated samples. In this paper, we make its relationship to the score-matching principle behind diffusion models precise by showing that drifting admits a score-based formulation on kernel-smoothed distributions. For Gaussian kernels, the population mean-shift field coincides with the score difference between the Gaussian-smoothed data and model distributions. This identity follows from Tweedie’s formula, which links the score of a Gaussian-smoothed density to the corresponding conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching-style objective on smoothed distributions. It also clarifies the connection to Distribution Matching Distillation (DMD): both methods use score-mismatch transport directions, but drifting realizes the score signal nonparametrically from kernel neighborhoods, whereas DMD uses a pretrained diffusion teacher. Beyond Gaussians, we derive an exact decomposition for general radial kernels, and for the Laplace kernel we prove rigorous error bounds showing that drifting remains an accurate proxy for score matching in low-temperature and high-dimensional regimes. 1 Introduction Diffusion and score-based generative models [22, 8, 24, 25, 13] generate data by transporting a simple noise distribution to the data distribution through many small steps. This transport is usually described by a time-indexed stochastic process or its ODE counterpart. Such a formulation yields strong sample quality, but often makes inference expensive because generation requires many neural network evaluations. Motivated by the need for faster sampling, recent work has explored one-step or few-step generators [23, 11, 6, 1, 9] that directly push forward noise to data. Drifting models [3] offer a fast, one-step perspective on generative modeling. Instead of defining a time-indexed corruption process and learning to reverse it, drifting fixes a kernel (Laplace by default) and constructs a transport rule directly from samples. The central object is a displacement: at each location, drifting aggregates nearby samples (weighted by the kernel) and takes their weighted average offset. This yields a mean-shift-type update [2] that moves points toward higher-density regions. Collecting these local displacements over all locations defines a vector field, and generation is performed by pushing samples along this field at one (or a few) kernel scales; see Figure 1-(a) for intuition. This displacement field is closely connected to the score function that underlies modern diffusion models. The score is a log-density gradient and points toward higher-density regions. In particular, the score mismatch between data and model defines a transport direction that steers model samples toward the data; Figure 1-(b) visualizes this behavior. Score-based diffusion models [24, 25] learn scores via score matching [10, 27, 18], most often in a forward Fisher form that averages the score mismatch under the data distribution (which encourages mode coverage). They then generate samples by applying denoising updates across noise levels. In this article, we make this connection precise by proving that drifting admits a score-based interpretation on kernel-smoothed distributions. For Gaussian kernels, the population mean-shift field is exactly a variance-scaled score-mismatch field between the Gaussian-smoothed data and model distributions. This follows from Tweedie’s formula [4], which links the conditional mean under additive Gaussian noise to the score of the corresponding smoothed marginal, and is the same principle that underlies the denoising interpretation of diffusion models, where the optimal denoiser is determined by the score of a Gaussian-smoothed distribution. We empirically validate and visualize this exact theoretical correspondence in Figure 1-(c,d): the mean-shift and score-mismatch fields coincide up to the variance scaling, so the bridge to score-based modeling is exact at the level of the transport field. The corresponding drifting objective is a score-matching-style objective, but in reverse Fisher form: the pointwise score mismatch is averaged under the model distribution rather than the data distribution. This weighting encourages correcting the score field where the current model places mass (for example, suppressing spurious mass), and is complementary to forward Fisher, which emphasizes matching scores on data regions and thus promotes coverage under the data distribution. This viewpoint also places drifting closer, at the objective level, to Distribution Matching Distillation (DMD) [29], a distillation approach that learns a one-step generator from a pre-trained diffusion teacher. Both drifting and DMD use score-mismatch transport directions under the model law, but differ in how the score signal is obtained: drifting realizes it nonparametrically from local kernel neighborhoods, whereas DMD relies on a pre-trained diffusion teacher. Figure 1: 2D visualization: Gaussian drifting field is exactly parallel to the score-matching direction. With a Gaussian kernel used for smoothing, the mean-shift drifting field in (a) is exactly direction-aligned with the score-mismatch field in (b) (as proved in Theorem 1); panels (c,d) visualize this alignment. Both fields are estimated from finite samples using the same kernel-based Monte Carlo procedure. Here p denotes the data distribution and q denotes a constructed model distribution (no training): p is a 33-mode mixture on the unit circle, and q is obtained by rotating each mode center of p by a fixed angle. Beyond the Gaussian case, we show that general radial kernels (including Laplace) admit an exact decomposition into (i) a preconditioned smoothed-score term and (i) a covariance residual that captures local neighborhood geometry. This yields a preconditioned score-matching interpretation of drifting and makes explicit the additional terms introduced by non-Gaussian kernels. For the Laplace kernel (a special case of the radial family) used in drifting-model implementations, this decomposition further implies that mean-shift drifting follows an approximately score-matching direction at least in two complementary regimes. (1) Low temperature. For small τ, the kernel is highly local and mean shift acts like a local score estimate, so the population drifting minimizer matches the data’s kernel-smoothed score up to a polynomially small error in τ. (2) High dimension. For large (embedding) dimension D, drifting and score matching align at three levels: (i) vector field/objective alignment, where the drifting field approximately matches a scaled score-mismatch field; (i) update alignment, where the stop-gradient update approximately matches a score-mismatch transport update; and (i) minimizer alignment, where the drifting minimizer is close to the score-matching minimizer (and hence the data). In all three cases, the discrepancy decays polynomially in D. Empirically, we carefully validate our theory by confirming the predicted decay rate in D, and by showing that the Gaussian kernel (i.e., the score-based case) and the Laplace kernel yield generally comparable generation quality. For clarity, we summarize the main messages of this article below: Key Takeaways. • Gaussian Kernel: ⋄ Drifting’s mean-shift direction == score-mismatch direction (via Tweedie) ⋄ Drifting objective == score-matching-style objective. • Laplace Kernel: drifting direction ≈ score-mismatch direction in two regimes: ⋄ Small temperature τ: population optima align, with a polynomially vanishing gap in τ. ⋄ Large dimension D: objectives, gradients, and optima align, with a polynomially vanishing gap in D. 2 Preliminaries Score Functions and Fisher Divergences. Let p be a distribution on ℝDR^D with density (still denoted by p). Its score function is p():=∇logp().s_p(x):= _x p(x). Given two distributions p and q, we define the forward and reverse Fisher divergences as fF(p∥q):=∼p[‖p()−q()‖22],rF(p∥q):=∼q[‖p()−q()‖22].D_fF(p\|q):=E_x p [\|s_p(x)-s_q(x)\|_2^2 ], _rF(p\|q):=E_x q [\|s_p(x)-s_q(x)\|_2^2 ]. Throughout this article, we use p (or pdatap_data) to denote the data distribution and q (or q_ θ) to denote the model distribution. The forward and reverse Fisher divergences share the same pointwise score mismatch, ‖p−q‖22\|s_p-s_q\|_2^2, but differ in how it is weighted: forward Fisher averages under p, while reverse Fisher averages under q. This distinction matters in practice: forward Fisher prioritizes correctness where real data concentrate (mode coverage), whereas reverse Fisher prioritizes correcting the score field where the current model places mass (suppressing spurious mass). Standard score matching [10, 25] fits q_ θ by minimizing the forward Fisher divergence fF(p∥q)D_fF(p\|q_ θ), which avoids explicit normalizing constants and directly targets score-field matching. Gaussian Perturbations and Diffusion Models. A diffusion forward process can be written as the Gaussian perturbation t=αt0+σtϵ,0∼pdata,ϵ∼(,),x_t= _tx_0+ _t ε, x_0 p_data,\ \ ε (0,I), (1) where (αt,σt)( _t, _t) is a fixed noise schedule with αt≥0 _t≥ 0 and σt>0 _t>0. Let ptp_t denote the marginal density of tx_t. Then ptp_t is a Gaussian smoothing of pdatap_data: pt()=∫(;αt0,σt2)pdata(0)d0p_t(x)= (x; _tx_0, _t^2I )\,p_data(x_0)\, \!dx_0 with a time-varying score function: p(,t):=∇logpt().s_p(x,t):= _x p_t(x). Diffusion models learn the oracle score p(t,t)s_p(x_t,t) with a neural network (t,t)s_ θ(x_t,t) by minimizing a time-dependent score-matching objective [27, 25] in forward Fisher form: tt∼pt[‖(t,t)−p(t,t)‖22].E_t\,E_x_t p_t [ \|s_ θ(x_t,t)-s_p(x_t,t) \|_2^2 ]. Distribution Matching Distillation (DMD). Since diffusion-model sampling typically requires many function evaluations (the sampling process is essentially equivalent to solving an ODE [25]), it can be slow. DMD is an approach proposed to learn one- (or few-step) pushforward generators by distilling pre-trained diffusion models. Let ()f_ θ(z) be a deterministic one-step generator with Gaussian prior ∼ppriorz p_prior, where we take pprior=(,σT2).p_prior=N (0,\, _T^2I ). This generator induces the model distribution q:=()#pprior.q_ θ:=(f_ θ)_\#p_prior. Applying the same forward process to synthetic samples yields ^t:=αt()+σtϵ,∼pprior,ϵ∼(,), x_t:= _tf_ θ(z)+ _t ε, z p_prior,\ \ ε (0,I), with marginal density q,tq_ θ,t. DMD matches q,t\q_ θ,t\ to pt\p_t\ by minimizing a time-weighted reverse KL objective ℒDMD():=t[ω(t)KL(q,t∥pt)],L_DMD( θ):=E_t [ω(t)\,D_KL (q_ θ,t\,\|\,p_t ) ], where ω(t)ω(t) is a weighting function. A key property is that the θ-gradient reduces to a score-mismatch term: ∇ℒDMD()=t,,ϵ[ω(t)αt(∂())⊤(q,t(^t,t)−pt(^t,t))],^t=αt()+σtϵ. _ θL_DMD( θ)=E_t,z, ε [ω(t)\, _t\, ( _ θf_ θ(z) ) (s_q_ θ,t( x_t,t)-s_p_t( x_t,t) ) ], x_t= _tf_ θ(z)+ _t ε. (2) DMD naturally induces a reverse-Fisher-style weighting. Indeed, ℒDMDL_DMD is based on a reverse KL divergence KL(q,t∥pt)D_KL(q_ θ,t\|p_t), whose gradient is evaluated under the model’s noised marginal q,tq_ θ,t. Equivalently, conditioning on t, we may rewrite Equation 2 as an expectation over ^t∼q,t x_t q_ θ,t, so the score-mismatch signal q,t(^t,t)−pt(^t,t)s_q_ θ,t( x_t,t)-s_p_t( x_t,t) is weighted by the (noised) model distribution. In practice, DMD obtains pts_p_t from a pre-trained diffusion teacher and estimates q,ts_q_ θ,t using an auxiliary “fake-score” model trained on current generator samples. 3 A Fixed-Point Regression Template In this section, we introduce a fixed-point regression framework with a pre-designed drift field, motivated by drifting models. We then present two realizations of this drift field: kernel-induced mean shift and kernel-induced score mismatch. 3.1 Training Objective of Drifting Model Let p:=pdatap:=p_data denote the data distribution111Although we present the construction in the raw data space, the same definitions and arguments extend verbatim to any fixed feature space induced by a pre-trained embedding. Concretely, let Ψ:→ℝD :X ^D be a (frozen) feature map and define the induced distributions pΨ:=Ψ#p := _\#p and qΨ:=Ψ#q := _\#q on ℝDR^D. Applying the kernel, mean-shift field, and discrepancy construction to pΨp and qΨq yields the feature-space counterparts of all quantities above, so every statement holds after replacing x by Ψ() (x) and (p,q)(p,q) by (pΨ,qΨ)(p ,q ).. We consider a pushforward generator =(ϵ)x=f_ θ( ε) with ϵ∼pprior ε p_prior with pprior:=(,)p_prior:=N(0,I), which induces the model distribution q:=()#pprior.q_ θ:=(f_ θ)_\#p_prior. When the dependence on θ is clear from context, we write q for brevity. The goal of learning a one-step generator is to choose f_ θ such that q≈pdataq_ θ≈ p_data under a chosen statistical divergence (e.g., the KL divergence or the Fisher divergence). A drifting-style method [3] starts from a one-step transport operator p,q()=+Δp,q(),U_p,q(x)=x+ _p,q(x), where Δp,q() _p,q(x) is a designed drift field (instantiated later) on ℝDR^D that measures the discrepancy between some quantities defined by p and q, and satisfies the equilibrium condition Δp,p()≡. _p,p(x) 0. Given p,qU_p,q, the drifting-model template distills the update via a fixed-point regression objective with a stop-gradient target: minℒdrift():=ϵ∼pprior[‖(ϵ)−sg(p,q((ϵ)))‖22]. _ θ~L_drift( θ):=E_ ε p_prior [ \|f_ θ( ε)-sg (U_p,q_ θ(f_ θ( ε)) ) \|_2^2 ]. (3) Intuitively, at each iteration the method (i) computes a frozen one-step transported sample ~=p,q() x=U_p,q_ θ(x) using the current model, and (i) fits the next generator to regress onto this transported cloud. Objective-Level Equivalence. For any fixed θ, the stop-gradient makes the target constant for backpropagation, but the value of the loss simplifies: ℒdrift()=ϵ[‖(ϵ)−sg((ϵ)+Δp,q((ϵ)))‖22]=ϵ[‖Δp,q((ϵ))‖22]=∼q[‖Δp,q()‖22]. aligned L_drift( θ)&=E_ ε [ \|f_ θ( ε)-sg (f_ θ( ε)+ _p,q_ θ(f_ θ( ε)) ) \|_2^2 ]\\ &=E_ ε [\| _p,q_ θ(f_ θ( ε))\|_2^2 ]=E_x q_ θ [\| _p,q_ θ(x)\|_2^2 ]. aligned (4) Thus, at the objective level, Equation 3 is exactly the squared-norm functional of the field Δp,q _p,q_ θ under q_ θ. Since the two expressions are equal for every θ, they define the same objective function and thus share the same global minimizers (and optimal value). Gradient-Level Remark. Although Equation 4 identifies the objective as q‖Δp,q‖2E_q_ θ\| _p,q_ θ\|^2, the stop-gradient solver does not backpropagate through the dependence of Δp,q _p,q_ θ on q_ θ. Differentiating Equation 3 treats the frozen target as constant and yields the semi-gradient ∇ℒdrift()=−2ϵ∼pprior[(ϵ)⊤Δp,q()],=(ϵ), _ θL_drift( θ)=-2\,E_ ε p_prior [J_f_ θ( ε) _p,q_ θ(x) ], =f_ θ( ε), where (ϵ)J_f_ θ( ε) is the Jacobian of f_ θ w.r.t. θ. Consequently, the stop-gradient objective Equation 3 implements a transport–then–projection procedure at each iteration i: 1. Transport/Particle Step. Using the current distribution qiq_ θ_i, compute a drift Δp,qi _p,q_ θ_i and move samples by a one-step transport: ↦~=+Δp,qi(),=i(ϵ).x\; \; x~=~x+ _p,q_ θ_i(x), x=f_ θ_i( ε). 2. Projection/Regression Step. Fit the next generator to match this transported sample cloud by regression: i+1≈argminϵ∼pprior[‖(ϵ)−sg(~)‖22]. θ_i+1~≈~ _ θ\;E_ ε p_prior [\|f_ θ( ε)-sg( x)\|_2^2 ]. (5) Stop-gradient ensures that Equation 5 is a clean projection step: the target is treated as fixed while optimizing θ. In regimes where Δp,qi _p,q_ θ_i varies slowly across iterations (e.g., small steps or lagged estimates), this two-step scheme can be viewed as a semi-gradient, fixed-point iteration that targets stationary points of the population functional in Equation 4, while yielding an explicit, implementable update direction. Identifiability. Even if training reaches a global minimum of the drifting loss, this does not automatically imply that the learned distribution matches the data. Indeed, ℒdrift()=0L_drift( θ)=0 only guarantees that the discrepancy field vanishes on model samples (that is, Δp,q()= _p,q_ θ(x)=0 for q_ θ-almost every x). In principle, there can be multiple generators, and thus multiple model distributions, that satisfy this same fixed-point condition. This motivates an identifiability question: is the data distribution the only equilibrium picked out by the discrepancy operator? We say a discrepancy operator is identifiable on a class of distributions if the only way for its discrepancy to vanish is to have q=pq=p. A natural criterion is Δp,q()=for q-almost every ⟹q=p. _p,q(x)=0\ for $q$-almost every x q=p. (6) If identifiability fails, the fixed-point condition can admit spurious equilibria with q≠pq≠ p, and driving the drifting loss to zero does not certify true distribution matching. Whether identifiability holds depends on the choice of kernel; we return to this question for Gaussian kernels in Section 4.1 and for general radial kernels in Section 4.2. 3.2 Mean-Shift and Kernel-Induced Score Function Mean-Shift Operator. Following the drifting-model construction, the update field Δp,q() _p,q(x) is realized via using a kernel induced mean shift. To make it more precisely, Let k(,)≥0k(x,y)≥ 0 be a similarity kernel and let π be any distribution on ℝDR^D. Define the kernel-weighted barycenter π,k():=∼π[k(,)]∼π[k(,)], μ_ ,k(x):= E_y [k(x,y)\,y ]E_y [k(x,y) ], and the corresponding mean-shift direction π,k():=π,k()−=∼π[k(,)(−)]∼π[k(,)].V_ ,k(x):= μ_ ,k(x)-x= E_y [k(x,y)\,(y-x) ]E_y [k(x,y) ]. Intuitively, π,k() μ_ ,k(x) is a kernel-weighted local average of samples, where points y with larger similarity k(,)k(x,y) receive larger weight. Accordingly, π,k()=π,k()−V_ ,k(x)= μ_ ,k(x)-x moves x toward regions that are better supported by π in the sense of the kernel, meaning regions containing many points y with large k(,)k(x,y). In particular, we write p,k()V_p,k(x) for the data distribution p, and q,k()V_q,k(x) for the model distribution q. The mean-shift induced discrepancy field is then Δp,q():=η(p,k()−q,k()), _p,q(x):=η (V_p,k(x)-V_q,k(x) ), where η>0η>0 is a step-size. This formulation is reminiscent of discriminator-free, kernel-based two-sample methods (e.g., MMD-style objectives [14, 15, 26]): the kernel acts as a fixed, nonparametric critic, and the update uses only samples from p and q. We make the relationship to kernel-potential methods precise in Section 4.4. Kernel-Induced Score Function. A complementary “score-like” direction can be defined directly from the same kernel smoothing. Consider the scalar smoothing field induced by the kernel k πk():=∼π[k(,)]=∫k(,)π()d, _k(x):=E_y [k(x,y) ]= k(x,y)\, (y)\,dy, if π admits a density (still denoted by π ). In the common translation-invariant case k(,)=κ(−)k(x,y)=κ(x-y) for some real-valued function κ, kernel smoothing reduces to convolution, namely πk=π∗κ _k= *κ. Hereafter, we do not distinguish between k and κ. We define the kernel-induced score as π,k():=∇logπk().s_ ,k(x):= _x _k(x). Throughout, we allow k to be unnormalized. Accordingly, πk() _k(x) should be understood as a smoothed mass profile, proportional to convolution with the corresponding normalized kernel. Because both the score and the mean-shift field are invariant under multiplicative rescaling of the kernel, all score and mean-shift identities remain unchanged by this normalization constant. The main goal of the remainder of this article is to uncover a principled but hidden connection between the mean-shift discrepancy p,k()−q,k()V_p,k(x)-V_q,k(x) and the score-mismatch discrepancy p,k()−q,k().s_p,k(x)-s_q,k(x). In doing so, we show that drifting models are not an isolated heuristic, but are deeply connected to the core mechanism of score-based modeling. 4 Bridging Drift Models and Score-Based Models This section establishes the core theoretical message of the article: drifting models are fundamentally linked to score-based modeling. In Section 4.1, we show that, for Gaussian kernels, mean shift is exactly a score mismatch up to a constant factor, making Gaussian-kernel drifting a score-matching-style objective and clarifying its close relationship to DMD. In Section 4.2, we extend this perspective to general radial kernels through an exact preconditioned-score decomposition, which reveals the additional terms introduced by non-Gaussian kernel geometry. Building on this result, Section 4.3 shows that the Laplace kernel used in practice still provides a reliable proxy for score mismatch in the low-temperature and high-dimensional regimes. Finally, Section 4.4 contrasts drifting with kernel-GAN updates and highlights that mean-shift normalization yields a essentially more score-like learning signal. 4.1 Gaussian Kernel: Mean-Shift as Smoothed Score We first specialize the kernel k to the Gaussian kernel with bandwidth τ>0τ>0 as kτ(,)=exp(−‖−‖222τ2).k_τ(x,y)= (- \|x-y\|_2^22τ^2 ). The corresponding smoothing field of any distribution π becomes πkτ()=∼π[kτ(,)]=(π∗kτ)()=:πτ(), _k_τ(x)=E_y [k_τ(x,y) ]=( *k_τ)(x)=: _τ(x), and the kernel-induced score is exactly the score of the Gaussian-smoothed density, π,kτ()=∇logπkτ()=∇logπτ()=:π,τ().s_ ,k_τ(x)= _x _k_τ(x)= _x _τ(x)=:s_ ,τ(x). For the Gaussian kernel kτk_τ, the mean-shift direction π,kτV_ ,k_τ admits a particularly simple form. In particular, Theorem 1 shows that the drifting objective in Equation 4 can be rewritten as a discrepancy between smoothed scores. Moreover, up to the constant factor η2τ4η^2τ^4, this discrepancy is exactly a score-matching-style objective [10] between the Gaussian-smoothed densities pτp_τ and (q)τ(q_ θ)_τ, taking the form of a reverse Fisher divergence. Therefore, the two objectives differ only by a positive constant factor, and thus have the same set of global minimizers, with minimum values that agree up to scaling. Theorem 1 (Gaussian-Kernel Drifting Equals Smoothed-Score Matching). Suppose that kτk_τ is a Gaussian kernel. Let p and q be distributions on ℝDR^D such that the expectations below are finite, and fix τ>0τ>0. Then for all x, π,kτ()=τ2π,τ()for π∈p,q.V_ ,k_τ(x)=τ^2s_ ,τ(x) ∈\p,q\. Consequently, the drifting discrepancy field induced by kτk_τ satisfies Δp,q()=ητ2(p,τ()−q,τ()). _p,q(x)=ητ^2 (s_p,τ(x)-s_q,τ(x) ). Now consider a pushforward model q=()#ppriorq_ θ=(f_ θ)_\#p_prior and the fixed-point regression loss in Equation 4. At the objective-value level, we have ℒdrift()=η2τ4∼q[‖p,τ()−q,τ()‖22].L_drift( θ)=η^2τ^4\,E_x q_ θ [ \|s_p,τ(x)-s_q_ θ,τ(x) \|_2^2 ]. (7) Sketch of Proof. The proof follows by differentiating πτ()=∼π[kτ(,)] _τ(x)=E_y [k_τ(x,y)] under the expectation and rearranging, which yields π,kτ()=π,kτ()−=τ2∇logπτ().V_ ,k_τ(x)= μ_ ,k_τ(x)-x=τ^2 _x _τ(x). See Appendix A for the detailed derivation. ∎ Theorem 1 reveals a natural link to Tweedie’s formula [4], a central identity underlying diffusion denoising. For the Gaussian kernel, the kernel barycenter coincides with the Bayes-optimal denoiser, π,kτ()=[|~=],~=+τ,∼(,), μ_ ,k_τ(x)=E[X| X=x], X=X+ ,\ \ Z (0,I), and the mean-shift direction π,kτ()V_ ,k_τ(x) is exactly the denoising residual. Tweedie’s formula then yields π,kτ()=π,kτ()−=τ2∇logπτ().V_ ,k_τ(x)= μ_ ,k_τ(x)-x=τ^2 _x _τ(x). Equivalently, π,kτ()=[|~=]⟹π,kτ()=τ2∇logπτ(). μ_ ,k_τ(x)=E[X| X=x] _ ,k_τ(x)=τ^2 _x _τ(x). Moreover, the mean-shift discrepancy p,kτ−q,kτV_p,k_τ-V_q,k_τ equals (up to the factor τ2τ^2) the difference between the score fields of the smoothed densities pτp_τ and qτq_τ. As a result, drifting with a Gaussian kernel can be interpreted as matching smoothed scores: either evaluated on clean model samples Equation 7, or, more canonically, evaluated on noised samples at noise scale τ, which yields the reverse Fisher divergence between pτp_τ and qτq_τ. However, because the drifting objective is in reverse Fisher form and averages the score mismatch under q rather than p, it only receives learning signal where the current model places mass. If q assigns (near-)zero probability to a mode of p, then that region contributes essentially nothing to the gradient, so drifting can be susceptible to mode dropping, especially early in training when coverage is limited. Connection to Distribution Matching Distillation (DMD). DMD distills a diffusion model into a one-step generator by matching the data and model distributions after adding Gaussian noise at scale τ. A key observation is that the resulting update can be written in terms of a score mismatch p,τ()−q,τ()s_p,τ(x)-s_q,τ(x), where p,τs_p,τ is provided by a pretrained diffusion teacher and q,τs_q,τ is obtained from a second “fake-score” model trained on current generator samples. Moreover, because DMD is derived from a reverse-KL objective KL(q,t∥pt)D_KL(q_ θ,t\|p_t) at each noise level, its learning signal is weighted by the (noised) model marginal q,tq_ θ,t, consistent with the reverse-Fisher weighting discussed in Section 2. Our Gaussian-kernel identity in Theorem 1 shows that Gaussian drifting recovers the same score-mismatch direction: Δp,q()=η(p,kτ()−q,kτ())=ητ2(p,τ()−q,τ()). _p,q(x)=η (V_p,k_τ(x)-V_q,k_τ(x) )=ητ^2 (s_p,τ(x)-s_q,τ(x) ). Beyond the score-mismatch algebra, DMD also fits the transport–then–projection template. At noise level t, define the score-mismatch transport field Δpt,q,t():=ω(t)αt2(pt(,t)−q,t(,t)), s_p_t,q_ θ,t(x):= ω(t)\, _t2 (s_p_t(x,t)-s_q_ θ,t(x,t) ), and consider the projection-style loss ℒST-proj():=t,,ϵ[‖()−sg(()+Δpt,q,t(^t))‖22],^t=αt()+σtϵ. _ST -proj( θ):=E_t,z, ε [ \|f_ θ(z)-sg (f_ θ(z)+ s_p_t,q_ θ,t( x_t) ) \|_2^2 ], x_t= _tf_ θ(z)+ _t ε. (8) Differentiating with stop-gradient yields a semi-gradient that matches the DMD gradient direction in Equation 2. Although ℒDMDL_DMD is KL-based while ℒST-projL_ST -proj is a regression surrogate, they share the same fixed-point condition: if the score mismatch vanishes for almost every t, then pt=q,tp_t=q_ θ,t and hence p=qp=q_ θ, using injectivity of the smoothing operator (see Lemma 7). This viewpoint clarifies the link to drifting: both methods fit the same transport–then–projection template, differing only in how the score mismatch is realized. We summarize the connection as follows: Gaussian-kernel drifting is essentially score-based generative modeling, or more precisely, a DMD-like distribution-matching step in a reverse-Fisher (model-weighted) score-mismatch form, except for the score source. Drifting realizes this score signal nonparametrically and teacher-free via Tweedie/mean shift, whereas DMD relies on a pretrained diffusion teacher. In particular, drifting performs distribution matching at a single or a few noise levels using only samples from p and q and a chosen kernel, without training auxiliary diffusion-score networks. The trade-off is complementary to DMD: kernel drifting avoids maintaining a separate estimate of q,τs_q,τ as q evolves, but its effectiveness depends on bandwidth selection and sample density, while DMD scales to high-dimensional data by learning parametric score models at the cost of additional training and potential instability in the synthetic-score estimation. Identifiability of the Gaussian Kernel. Using a Gaussian kernel to define the discrepancy Δp,q() _p,q(x) (as in DMD-style objectives) has a clean identifiability story. By Theorem 1, matching the mean-shift directions under a Gaussian kernel is equivalent to matching the corresponding Gaussian-smoothed scores: p,τ()=q,τ()(on the support of qτ).s_p,τ(x)=s_q,τ(x) (on the support of $q_τ$). Under mild regularity and the usual normalization (both pτp_τ and qτq_τ integrate to one), equality of scores forces the smoothed densities to coincide: p,τ≡q,τs_p,τ _q,τ implies pτ=qτp_τ=q_τ (the only ambiguity is a multiplicative constant, removed by normalization). Finally, Gaussian smoothing is injective: if pτ=qτp_τ=q_τ for some τ>0τ>0, then necessarily p=qp=q. This yields the following identifiability statement. Proposition 1 (Identifiability of Gaussian Case (Idealized DMD)). Assume pτp_τ and qτq_τ are the Gaussian-smoothed versions of p and q for some τ>0τ>0. If p,τ()=q,τ()s_p,τ(x)=s_q,τ(x) for all x, then p=qp=q. In particular, any population objective that enforces pτ=qτp_τ=q_τ for a single τ>0τ>0 is identifiable. In words, in the Gaussian case the DMD objective inherited from a diffusion teacher is identifiable under an idealized setting: if the teacher is perfectly trained, then at the population level the objective has a unique intended optimum, namely q=pq=p. In practice, this guarantee is only approximate: it depends on the teacher score being sufficiently accurate and on student training being well behaved, in the sense that the model class is expressive enough and optimization reaches the intended optimum rather than an approximate or degenerate solution. 4.2 General Radial Kernels: Mean-Shift as Preconditioned Score The Gaussian case in Section 4.1 gives an exact bridge: mean shift equals a scaled smoothed-score, so drifting becomes a score-matching-style objective. However, drifting model uses non-Gaussian kernels (Laplace in particular), where this equivalence is no longer automatic. The goal of this subsection is to isolate exactly what changes when we move beyond Gaussians, and to express mean shift as a score term plus explicit kernel-dependent corrections. To do this, we consider a broad family of translation-invariant radial kernels, which includes both Gaussian and Laplace and covers common choices in drifting: kτ(,)=exp(−ρ(‖−‖2τ)),τ>0,k_τ(x,y)= (-ρ ( \|x-y\|_2τ ) ), τ>0, (9) where ρ:[0,∞)→ℝρ:[0,∞) is differentiable. This family includes, for example, Gaussian: ρ(u)=12u2⇒kτ(,)=exp(−‖−‖222τ2), ρ(u)= 12u^2\ \ k_τ(x,y)= (- \|x-y\|_2^22τ^2 ), Laplace: ρ(u)=u⇒kτ(,)=exp(−‖−‖2τ). ρ(u)=u\ \ k_τ(x,y)= (- \|x-y\|_2τ ). We keep the same two objects from Section 3.2: the mean-shift direction π,kτV_ ,k_τ and the kernel-induced score π,kτ=∇logπkτs_ ,k_τ=∇ _k_τ, along with the associated score mismatch, where πkτ()=∼π[kτ(,)] _k_τ(x)=E_y [k_τ(x,y)]. The key observation is that, for radial kernels, the score direction is still a local average of displacements, but with an additional radius-dependent reweighting. Indeed, differentiating logkτ(,)=−ρ(‖−‖2/τ) k_τ(x,y)=-ρ(\|x-y\|_2/τ) gives ∇logkτ(,)=1τ2ρ′(r/τ)r/τ⏟=:bτ(r)(−),r:=‖−‖2. _x k_τ(x,y)= 1τ^2\, ρ (r/τ)r/τ_=:~b_τ(r)\,(y-x), r:=\|x-y\|_2. Substituting the above identity into π,kτ()=[kτ(,)∇logkτ(,)][kτ(,)]s_ ,k_τ(x)= E\! [k_τ(x,y)\, _x k_τ(x,y) ]E\! [k_τ(x,y) ] yields the following weighted-score representation: τ2π,kτ()=∼π[kτ(,)bτ(‖−‖2)(−)]∼π[kτ(,)].τ^2\,s_ ,k_τ(x)= E_y [k_τ(x,y)\,b_τ(\|x-y\|_2)\,(y-x) ]E_y [k_τ(x,y) ]. (10) This immediately explains why Gaussians are special: for ρ(u)=12u2ρ(u)= 12u^2, we have ρ′(u)=uρ (u)=u and hence bτ(r)≡1b_τ(r)≡ 1, so the extra reweighting disappears and mean shift becomes exactly proportional to the smoothed score (see Theorem 1). For Laplace and other radial kernels, bτ(r)b_τ(r) is not constant, so mean shift and score differ in a controlled, kernel-dependent way. To make this difference explicit, it is convenient to view both quantities under the same local kernel-reweighted law, πτ(|):=kτ(,)π()∼π[kτ(,)]. _τ(y|x):= k_τ(x,y)\, (y)E_y [k_τ(x,y)]. With this notation, Equation 10 becomes τ2π,kτ()=∼πτ(⋅|)[bτ(‖−‖2)(−)],τ^2\,s_ ,k_τ(x)=E_y _τ(·|x) [b_τ (\|x-y\|_2 )\,(y-x) ], which puts the score in the same “local displacement” form as mean shift and sets up an exact decomposition. Theorem 2 (Preconditioned-Score Decomposition of Mean Shift for General Radial Kernels). Fix τ>0τ>0 and let kτk_τ be a radial kernel defined by Equation 9. Assume the expectations below are finite, and bτ(r)>0b_τ(r)>0 on the support of πτ(⋅|) _τ(·|x). Then for any ∈ℝDx ^D, π,kτ()=τ2απ,τ()π,kτ()+π,τ().V_ ,k_τ(x)=τ^2\, _ ,τ(x)\,s_ ,k_τ(x)+ δ_ ,τ(x). Here, for each x, απ,τ() _ ,τ(x) :=∼πτ(⋅|)[bτ−1(‖−‖2)], :=E_y _τ(·|x) [b_τ^-1(\|x-y\|_2) ], π,τ() δ_ ,τ(x) :=Cov∼πτ(⋅|)(bτ−1(‖−‖2),bτ(‖−‖2)(−))∈ℝD. :=Cov_y _τ(·|x) (b_τ^-1(\|x-y\|_2),~b_τ(\|x-y\|_2)(y-x) ) ^D. Consequently, for Δp,q()=η(p,kτ()−q,kτ()) _p,q(x)=η(V_p,k_τ(x)-V_q,k_τ(x)), Δp,q()=ητ2(αp,τ()p,kτ()−αq,τ()q,kτ())+η(p,τ()−q,τ()). _p,q(x)=ητ^2 ( _p,τ(x)s_p,k_τ(x)- _q,τ(x)s_q,k_τ(x) )+η ( δ_p,τ(x)- δ_q,τ(x) ). Sketch of proof. Fix x and let ∼πτ(⋅|)y _τ(·|x). The claim follows by applying [AB]=[A][B]+Cov(A,B)E[AB]=E[A]E[B]+Cov(A,B) with A():=bτ−1(‖−‖2)A(y):=b_τ^-1(\|x-y\|_2), and B():=bτ(‖−‖2)(−)B(y):=b_τ(\|x-y\|_2)(y-x). See Appendix A for the full derivation. ∎ We remark that Cov(⋅,⋅)Cov(·,·) above is a scalar–vector covariance. The decomposition in Theorem 2 clarifies how mean-shift relates to the kernel-induced score. Both π,kτ()V_ ,k_τ(x) and π,kτ()s_ ,k_τ(x) summarize local displacements (−)(y-x) from neighbors sampled under the kernel-reweighted law πτ(|)∝kτ(,)π() _τ(y|x) k_τ(x,y) (y). For a general radial kernel, the difference has two parts: a scalar preconditioning factor απ,τ() _ ,τ(x) that rescales the score contribution, and a residual π,τ() δ_ ,τ(x) that captures the remaining effect of distance-dependent reweighting through bτb_τ. In the Gaussian case bτ≡1b_τ≡ 1, so απ,τ≡1 _ ,τ≡ 1 and π,τ≡ δ_ ,τ 0, recovering exact proportionality. A useful geometric view comes from rewriting the residual in terms of radius and direction. With r=‖−‖2r=\|x-y\|_2, under ∼πτ(⋅|)y _τ(·|x), we set A:=bτ−1(r)∈ℝ,:=bτ(r)(−)∈ℝD.A:=b_τ^-1(r) , :=b_τ(r)\,(y-x) ^D. Then Equation 10 becomes τ2π,kτ()=[]τ^2\,s_ ,k_τ(x)=E[Z], and the residual can be written as π,τ()=Cov(A,)=Cov(A(r),[|r]). δ_ ,τ(x)=Cov(A,Z)=Cov (A(r),\,E[Z|r] ). This shows that π,τ() δ_ ,τ(x) arises only when the average (preconditioned) displacement [|r]E[Z|r] changes systematically with radius. To isolate the part that actually changes the direction relative to the score, assume :=π,kτ()≠s:=s_ ,k_τ(x) 0 and let ^:=/‖2 s:=s/\|s\|_2. Define the orthogonal projector onto the subspace perpendicular to the score direction, ⟂:=−^^⊤,P_ :=I- s s , so that ⟂P_ v removes the component of v along s. We define the off-score residual by ⟂():=⟂π,τ() δ_ (x):=P_ \, δ_ ,τ(x). Using ⟂[]=P_ E[Z]=0, one obtains the exact identity ⟂()=Cov(A(r),[⟂|r]), δ_ (x)=Cov (A(r),\,E[P_ Z|r] ), so mean-shift deviates from the score direction only if the conditional mean displacement develops a systematic tangential component across radii. Equivalently, the score-parallel component of π,τ δ_ ,τ only adjusts the effective step size along the score. For the Laplace kernel, bτ(r)=τ/rb_τ(r)=τ/r, hence A=r/τA=r/τ. Let :=−‖−‖2,so that−=r.u:= y-x\|y-x\|_2, that y-x=r\,u. Then =bτ(r)(−)=τr(r)=τ,Z=b_τ(r)(y-x)= τr\,(ru)=τ\,u, so the preconditioned displacement keeps only the direction u and discards the radius. Here the randomness is over ∼πτ(⋅|)y _τ(·|x) (with x fixed), so both r=‖−‖2r=\|y-x\|_2 and u are random. Using the scalar–vector covariance definition, the off-score residual becomes ⟂()=Cov∼πτ(⋅|)(r,⟂),⟂:=⟂. δ_ (x)=Cov_y _τ(·|x) (r,\,u_ ), u_ :=P_ u. Moreover, since ⟂[]=P_ E[Z]=0 and =τZ=τu, we have [⟂]=E[u_ ]=0, and hence ⟂()=∼πτ(⋅|)[r⟂]. δ_ (x)=E_y _τ(·|x) [r\,u_ ]. This form makes the geometry transparent (which we also illustrate with simple scenarios in Figure 2): ⟂() δ_ (x) is a radius-weighted average of the tangential direction. Hence Laplace mean-shift develops an off-score component only when radius and tangential direction are correlated under ∼πτ(⋅|)y _τ(·|x), i.e., when ⟂()=Cov(r,⟂)=[r⟂]≠ δ_ (x)=Cov(r,u_ )=E[r\,u_ ] 0; equivalently, when [⟂|r]E[u_ |r] does not cancel across different r. Intuitively, ⟂()≠ δ_ (x) 0 means that different distance bands contribute different average tangential directions, so the radius-weighted average [r⟂]E[r\,u_ ] does not cancel out; for instance, if farther neighbors tend to have a different average tangential direction than nearer ones, their contribution can dominate because it is amplified by the factor r. If [⟂|r]≈E[u_ |r] 0 for all r (no tangential bias at any radius), or if r concentrates so that it is effectively constant, then [r⟂]E[r\,u_ ] is small and mean-shift stays nearly aligned with the score. In high dimension, the kernel-reweighted radii typically concentrate, making A(r)=r/τA(r)=r/τ nearly constant; this suppresses ⟂() δ_ (x) and yields the decay bounds used later (cf. Theorems 4 and 5). Figure 2: Illustration of δ⟂() δ_ (x) in three illustrative examples. (a) ⟂≫ δ_ 0 because most of the mass of πτ(⋅|) _τ(·|x) lies far away (large r) and in directions perpendicular to s; (b) ⟂≈ δ_ ≈ 0 because most of the mass of πτ(⋅|) _τ(·|x) lies closer (small r) and in directions parallel to s; (c) ⟂≈ δ_ ≈ 0 because the contributions from different directions nearly cancel out. Identifiability of the General Radial Kernel. For the mean-shift–induced discrepancy field Δp,q()=η(p,k()−q,k()), _p,q(x)=η (V_p,k(x)-V_q,k(x) ), a natural identifiability question is whether driving the field to zero forces true distribution matching, namely, Δp,q≡⟹p=q. _p,q 0 p=q. From the decomposition in Theorem 2, the Gaussian kernel is special: since b(r)≡1b(r)≡ 1, we have απ,τ≡1 _ ,τ≡ 1 and π,τ≡ δ_ ,τ 0. In this case, mean shift is exactly proportional to the kernel-smoothed score, so Δp,q≡ _p,q 0 reduces to score equality ∇logpτ≡∇logqτ∇ p_τ≡∇ q_τ, which implies pτ=qτp_τ=q_τ (and then p=qp=q if the smoothing operator is injective). For a general radial kernel, however, Theorem 2 exposes extra degrees of freedom that can cancel the score mismatch. Concretely, Δp,q()=⟹τ2(αp,τ()p,kτ()−αq,τ()q,kτ())+(p,τ()−q,τ())=. _p,q(x)=0 τ^2 ( _p,τ(x)\,s_p,k_τ(x)- _q,τ(x)\,s_q,k_τ(x) )+ ( δ_p,τ(x)- δ_q,τ(x) )=0. (11) so Δp,q≡ _p,q 0 only constrains a sum of a scaled score difference and a residual δ capturing distance–direction coupling. Unless one can separately control the scalar preconditioners απ,τ _ ,τ and the residuals π,τ δ_ ,τ, Equation 11 does not force ∇logpτ=∇logqτ∇ p_τ=∇ q_τ, and therefore does not by itself certify pτ=qτp_τ=q_τ. In particular, identifiability is generally p-dependent: the functions αp,τ _p,τ and p,τ δ_p,τ are determined by the kernel-reweighted local neighborhood law induced by p, and different targets can yield different patterns of (preconditioner, residual) cancellation. As an important example, for Laplace-type kernels commonly used in drifting-model implementations, the induced preconditioning is generally not constant. In this case, both απ,τ() _ ,τ(x) and the residual π,τ() δ_ ,τ(x) depend on the local kernel neighborhood and can vary with the underlying distribution π . As a result, even at the population level, the equilibrium condition Δp,q()≡ _p,q(x) 0 does not necessarily imply score equality. Instead, it can be satisfied by a cancellation between a scaled score mismatch and the residual term. Therefore, identifiability is not automatic and may require additional structural assumptions. This concern can be amplified in practice because the drifting field is never evaluated at the population level. Implementations use finite-sample, mini-batch estimates with normalized (softmax) kernel weights, so the drift becomes a ratio-type Monte Carlo estimator, which can have non-negligible variance and bias. Empirically, the method often benefits from using more reference samples, consistent with this estimation sensitivity. Moreover, many pipelines apply batch-dependent normalizations and heuristics, such as the balance between positive and negative sets, the choice and quality of the feature map, and tuning or aggregating across multiple temperatures [5]. These choices effectively reshape the kernel and hence the induced transport field. Finally, when the kernel is computed in a learned feature space, near-flat feature distances can make the softmax weights close to uniform, yielding a small drift magnitude even when p and q remain mismatched. Taken together, these effects imply that an observed near-equilibrium in practice does not uniquely certify q≈pq≈ p. It can also make training behavior sensitive to implementation choices that change the effective kernel and the resulting field. In summary, Gaussian kernels lead to an identifiable score-based discrepancy. For Laplace-type kernels, identifiability is not automatic in general, because Δp,q _p,q can vanish through cancellation with the residual term even when the underlying scores differ. 4.3 Laplace Kernel: Mean-Shift as a Proxy of Score Mismatch The Gaussian case provides a clean reference point. With a Gaussian kernel, the mean-shift direction is exactly proportional to the kernel-induced score, so minimizing the drifting objective agrees with minimizing a score-matching objective at the population optimum (see Theorem 1). In practice, however, drifting models typically use the Laplace kernel. In that setting, the mean-shift direction is generally not a purely scaled score (see Theorem 2), so it is not obvious whether optimizing the mean-shift loss still yields a model distribution whose kernel-smoothed score matches that of the data. In the next two subsections, we analyze two complementary regimes: (i) the low-temperature (small τ) regime and (i) the high-dimensional (large D) regime. In both cases, we show that the drifting objective behaves like score matching up to an explicit error that decays in τ or in D. Before turning to these results, we first formalize the problem setup. Formally, let p denote the data distribution on ℝDR^D (the same ambient space where the kernel is applied). We consider generators ,:ℝm→ℝDf,g:R^m ^D and their induced model distributions q:=#(,),q:=#(,).q_f:=f_\#N(0,I), q_g:=g_\#N(0,I). Throughout the discussion, we use a dimension-aware bandwidth τ:=τ¯Da,τ¯>0,a≥0,τ:= τD^a, τ>0,\ a≥ 0, which follows common practice in high-dimensional kernel methods and in drifting model implementations, and keeps the kernel weights well-conditioned as D grows. Given the Laplace kernel kτ(,)=exp(−‖−‖2/τ)k_τ(x,y)= (-\|x-y\|_2/τ), the mean-shift direction under a distribution π is π,kτ()V_ ,k_τ(x) as introduced in Section 3.2. The drifting model compares mean-shift directions under p and under the model distribution: ():=p,kτ()−q,kτ(),ℒdrift():=∼q‖()‖22.V_f(x):=V_p,k_τ(x)-V_q_f,k_τ(x), _drift(f):=E_x q_f\|V_f(x)\|_2^2. In parallel, we compare kernel-smoothed scores using the kernel-induced score π,τ()=∇logπkτ()s_ ,τ(x)= _x _k_τ(x), where πkτ()=∼π[kτ(,)] _k_τ(x)=E_y [k_τ(x,y)]. 4.3.1 Low-Temperature Regime First, we show that in the low-temperature regime (small τ) with fixed dimension D, the population minimizer of the drifting model induces a distribution close to the data’s, as in score matching. When τ is small, the Laplace kernel kτ(,)=exp(−‖−‖2/τ)k_τ(x,y)= (-\|x-y\|_2/τ) concentrates its mass near x, so the mean-shift direction π,kτ()V_ ,k_τ(x) becomes a purely local statistic of the density π . A Taylor expansion then reveals that this local displacement is proportional to the kernel-smoothed score π,τ()s_ ,τ(x), up to higher-order terms. The only remaining issue is that mean shift is defined through a ratio. Specifically, we write π,kτ()=Bτ()/Aτ()V_ ,k_τ(x)=B_τ(x)/A_τ(x), where Aτ():=∫ℝDe−‖−‖2/τπ()d,Bτ():=∫ℝDe−‖−‖2/τ(−)π()d.A_τ(x):= _R^De^-\|x-y\|_2/τ\, (y)\, \!dy, B_τ(x):= _R^De^-\|x-y\|_2/τ\,(y-x)\, (y)\, \!dy. Thus we need to ensure the denominator Aτ()A_τ(x) does not become too small and that the Taylor remainder stays well behaved after dividing by Aτ()A_τ(x). Assumption 2 provides exactly this by requiring moment control of the local smoothness of logπ and of how much π() (x) can differ from nearby values. Under this mild regularity, the theorem below states that any population drifting minimizer induces a distribution whose kernel-smoothed score matches that of the data up to (τ2)O(τ^2) pointwise, and hence the scale-τ Fisher divergence decays as (τ4)O(τ^4). In short, we have the following: Theorem 3 ((Informal) Small-τ¯ τ Agreement between Mean-Shift and Score Matching). Suppose that kτk_τ is a Laplace kernel. Let τ0>0 _0>0. For each τ∈(0,τ0]τ∈(0, _0], pick any population drifting minimizer ⋆(τ)∈argminℒdrift(),⋆(τ)∈argminℒSM().f (τ)∈ _fL_drift(f), g (τ)∈ _gL_SM(g). Assume that, uniformly over such drifting-model minimizers, the local derivatives and local density ratios of p and q⋆(τ)q_f (τ) admit integrable (moment) control. Then, for small τ, rF(q⋆(τ)∥q⋆(τ))=rF(p∥q⋆(τ))=(τ4).D_rF (q_g (τ)\|q_f (τ) )=D_rF (p\|q_f (τ) )=O(τ^4). The hidden constant is independent of τ¯ τ and of the learnable parameters. We refer to Appendix B for a rigorous statement and the full proof. To obtain similar bounds in other distributional divergences (e.g., KL or total variation) from a reverse-Fisher guarantee, one needs an additional regularity condition on the reference distribution. A standard sufficient assumption is that the reference distribution π satisfies a log-Sobolev inequality; we do not pursue these implications further here. 4.3.2 High-Dimensional Regime Figure 3: 2D visualization: drifting field is nearly parallel to score mismatch. The mean-shift drifting field is nearly direction-aligned with the score-mismatch field; both are estimated from finite samples using the same kernel-based Monte Carlo procedure. Here, p and q are constructed as in Figure 1. As D increases, the alignment improves and the error gap decays as 1/D1/D, consistent with Theorems 4 and 5. We next analyze the high-dimensional regime with fixed τ¯>0 τ>0, which is relevant both in raw data space and, more practically, in a pre-trained embedding space where the feature dimension can easily reach the order of 10310^3. In this regime, we make the connection between the drifting model and score matching concrete through three complementary results: (i) objective-level alignment (drifting as an approximate surrogate for the scale-τ Fisher divergence), (i) semi-gradient alignment (the implemented stop-gradient update direction matches a score-transport regression update), and (i) minimizer alignment (the drifting optimum is close to the score-matching/data distribution in smoothed Fisher divergence). Together, these provide both an optimization-level and an estimator-level justification for viewing drifting as “more or less” score matching in high dimension. We refer to Appendix C for rigorous statements and full proofs of the high-D results; here we present a conceptual (but still rigorous) version instead. Vector Field/Objective Alignment. We work in the large-D regime and write C=C(Da)>0C=C(D^a)>0 for the known scaling factor. Under high-dimensional regularity conditions that commonly hold in raw feature spaces or normalized embedding spaces (Assumption 4–Assumption 7), sample norms concentrate around a shared scale and inner products between independent samples remain small on average. Our first result is an objective-level alignment: the drifting field is well-approximated by a scaled score-mismatch field, ()≈CΔ(),V_f(x)\;≈\;C\, s_f(x), with mean-square error of order (1/D)O(1/D). Theorem 4 ((Informal) Large-D field alignment at 1/D1/D rate). Suppose that kτk_τ is a Laplace kernel. Assume the distributions under consideration concentrate on a common-radius shell, and their inner products and moments are suitably controlled. Let C=C(Da)>0C=C(D^a)>0 denote the known scaling factor. Then for all sufficiently large D and all f, ∼q‖()−CΔ()‖22=(D−1).E_x q_f \|V_f(x)-C\, s_f(x) \|_2^2=O (D^-1 ). This theorem implies that, up to a vanishing (1/D)O(1/D) error, minimizing the drifting objective is equivalent to minimizing a scaled Fisher divergence based on kernel-smoothed scores. (In particular, when a=0a=0, the scaling C is constant.) Algorithmic Gradient Alignment. The previous theorem compares two fields (V_f and Δ s_f) evaluated under the model-induced law q_f. To connect this geometric picture to training dynamics, we now instantiate the model by a parameterized generator f_ θ and write q:=()#ppriorq_ θ:=(f_ θ)_\#p_prior. As introduced above, the drifting model’s fixed-point loss with stop-gradient implements a transport–then–projection step: it transports each sample =(ϵ)x=f_ θ( ε) by a discrepancy field and then regresses back to the generator family via an L2L^2 projection. As a result, the implemented update direction is following: drift():=∇ℒdrift()=−2ϵ[(ϵ)⊤Δp,q((ϵ))],Δp,q():=η(p,kτ()−q,kτ()),g_drift( θ):= _ θL_drift( θ)=-2\,E_ ε [J_f_ θ( ε) \, _p,q_ θ (f_ θ( ε) ) ], _p,q_ θ(x):=η (V_p,k_τ(x)-V_q_ θ,k_τ(x) ), where η>0η>0 is the transport step size in sample space and J_f_ θ is the Jacobian w.r.t. θ. To compare this implemented update to a score-driven direction without differentiating the score-matching objective, we use a transport–then–projection (e.g., DMD-style) comparator obtained by replacing the drift transport field with a scaled score mismatch (see Equations 13 and 8 for the fixed-point loss induced by this score-transport field): ST():=−2ϵ[(ϵ)⊤Δp,q((ϵ))],Δp,q():=ηC(p,τ−q,τ).g_ST( θ):=-2\,E_ ε [J_f_ θ( ε) \, s_p,q_ θ (f_ θ( ε) ) ], s_p,q_ θ(x):=η\,C\, (s_p,τ-s_q_ θ,τ ). This has the familiar “Jacobian×⊤ × (score mismatch)” structure used in DMD-like updates (see Equation 2). In the following informal theorem, we establish an algorithm-level relationship: the stop-gradient update used by drifting becomes asymptotically indistinguishable from a score-transport update driven by the smoothed score mismatch. Theorem 5 ((Informal) Large-D Gradient Alignment). Suppose that kτk_τ is a Laplace kernel. Assume the conditions of Theorem 4 and a uniform second-moment bound on J_f_ θ. Then for large enough D and all θ, ‖drift()−ST()‖2=(D−1/2). \|g_drift( θ)-g_ST( θ) \|_2=O (D^-1/2 ). Moreover, assuming that at least one of the two update directions has non-vanishing norm, we have cos∠(drift(),ST())=1−(D−1). (g_drift( θ),g_ST( θ) )=1-O (D^-1 ). In particular, the hidden constant does not depend on learnable parameters. Figure 3 illustrates this approximate alignment on a 2D synthetic dataset. Even at the low dimension D=2D=2, the Laplace-kernel mean-shift direction is already well aligned with the score-mismatch direction. In Section 5.1, we further confirm empirically that this alignment improves markedly as D increases. Theorem 5 also helps explain why drifting may behave like score matching in practice. In high dimensions, the two implemented update directions are close as vectors: their difference is small in norm. As a result, if one update is near zero (i.e., the algorithm is close to a stationary point), then the other must also be near zero up to the same tolerance. When the updates are not tiny, the cosine-similarity bound strengthens this into a geometric statement: the two updates point in nearly the same direction, so they tend to move the generator along essentially the same optimization path, differing mainly by a rescaling of the step size. Minimizer Alignment. Finally, we return to population optima in nonparametric form and study how close the minimizers are at the level of the induced densities. To quantify proximity between a model distribution and the data through their kernel-smoothed scores, we use the scale-τ reverse Fisher divergence rF(p∥q):=∼q‖p,τ()−q,τ()‖22.D_rF(p\|q):=E_x q \|s_p,τ(x)-s_q,τ(x) \|_2^2. For a generator f, this becomes rF(p∥q)=∼q‖p,τ()−q,τ()‖22,D_rF(p\|q_f)=E_x q_f \|s_p,τ(x)-s_q_f,τ(x) \|_2^2, which directly measures the squared mismatch between the smoothed data score and the smoothed model score under the model law. Under the same high-dimensional regularity conditions as before, any population minimizer of the Laplace drifting objective induces a model distribution whose kernel-smoothed score is close to that of the data, with an error that decays polynomially in D. Although the Laplace mean-shift objective is not explicitly formulated as score matching, its minimizer nevertheless achieves vanishing reverse Fisher discrepancy at rate (D−(1+2a))O (D^-(1+2a) ). We summarize this high-dimensional agreement result below. Theorem 6 ((Informal) High-Dimensional Agreement Between Mean-Shift and Score Matching). Suppose that kτk_τ is a Laplace kernel. Assume that distributions we consider concentrate on a common-radius shell and have controlled inner products and moments. Let ⋆∈argminℒdrift(),⋆∈argminℒSM().f ∈ _fL_drift(f), g ∈ _gL_SM(g). Then as D is large, rF(p∥q⋆)=rF(q⋆∥q⋆)=(D−(1+2a)).D_rF (p\|q_f )=D_rF (q_g \|q_f )=O (D^-(1+2a) ). In particular, the hidden constant in (⋅)O(·) is independent of D and of the learnable parameters used to parametrize f and g. Even though we also derive an analogous minimizer-level alignment in the small-temperature regime, one could in principle establish the same three components there as well: objective-level equivalence, (semi-)gradient alignment, and minimizer agreement. In practice, however, the small-τ analysis requires additional τ-uniform local smoothness and non-degeneracy conditions. These are needed to control Taylor remainders and denominators arising in ratio expansions, but they make the presentation substantially more technical. To keep the main narrative focused and readable, we therefore emphasize the high-dimensional results in this article. 4.4 What About GANs? The drifting-model discrepancy field Δp,q _p,q has an attractive–repulsive structure: it pulls model samples toward regions supported by the data distribution p while pushing them away from regions heavily supported by the model distribution q. This is conceptually close to two-sample, kernel-based views of GANs [7]. The goal of this subsection is to highlight a simple but important distinction. Coulomb GANs [26] (and, more broadly, MMD-style kernel discrepancies [14, 15, 26] discussed in [3]) derive update directions by differentiating a global objective that compares p and q through kernel interactions. Drifting models use the same two-sample ingredients, but introduce a local normalization via mean shift that turns the kernel signal into a score-like (log-gradient) update, thereby connecting more naturally to (kernel-smoothed) score matching. Coulomb GANs: A Potential Field Induced by A Global Discrepancy. Coulomb GANs begin with the signed density difference p−qp-q and define a kernel potential Φp,q():=∼p[k(,)]−∼q[k(,)], _p,q(x):=E_y p [k(x,y) ]-E_y q [k(x,y) ], which appears in empirical (mini-batch) form as Equation (8) in [26]. Generator samples are transported along the induced force field (the direction followed by generator particles under the Coulomb-GAN charge convention), given by the gradient of the potential: p,q():=∇Φp,q()=∼p[∇k(,)]−∼q[∇k(,)].F_p,q(x):= _x _p,q(x)=E_y p [ _xk(x,y) ]-E_y q [ _xk(x,y) ]. This matches the electric-field expression in Equation (34) of [26] up to a sign convention (generator particles are treated as negative charges). For radial kernels, ∇k(,) _xk(x,y) is colinear with (−)(y-x), so p,q()F_p,q(x) takes the familiar “attract data, repel model” form of a weighted displacement difference. The key point is that this force field is not arbitrary: it is induced by an interaction energy. Coulomb GANs define ℰ(p,q):=12∫(p()−q())Φp,q()d,E(p,q):= 12 (p(x)-q(x) )\, _p,q(x)\, \!dx, and update particles by following the potential gradient p,q=∇Φp,qF_p,q= _x _p,q. Heuristically, transporting model mass in the direction of p,qF_p,q decreases ℰE. The energy is minimized when p≡qp≡ q. In this sense, the Coulomb-GAN update is driven by a global discrepancy: the learning signal is determined by how p and q interact through the kernel across the whole space. To relate this to the drifting model, it is helpful to isolate the kernel-smoothed profile πk():=∼π[k(,)],π∈p,q. _k(x):=E_y [k(x,y) ], ∈\p,q\. Intuitively, πk() _k(x) summarizes how much probability mass of π lies near x in the sense of the kernel. Note that πk _k is generally not a normalized density; it is a kernel-smoothed mass profile (or “density proxy”) whose overall scale depends on the normalization of k. With this notation, Φp,q()=pk()−qk(),p,q()=∇pk()−∇qk(), _p,q(x)=p_k(x)-q_k(x), _p,q(x)= _xp_k(x)- _xq_k(x), so Coulomb GAN transport is driven by gradients of these kernel-smoothed mass profiles. Drifting Models: Same Ingredients but Locally Normalized. Drifting uses the same attraction–repulsion ingredients from p and q, but replaces an unnormalized force by a locally normalized mean-shift direction: Δp,q()=η(p,k()−q,k()),π,k()=∼π[k(,)(−)]∼π[k(,)](π∈p,q). _p,q(x)=η (V_p,k(x)-V_q,k(x) ), _ ,k(x)= E_y [k(x,y)(y-x) ]E_y [k(x,y) ] ( ∈\p,q\). The numerator is a kernel-weighted displacement, while the denominator πk():=∼π[k(,)] _k(x):=E_y [k(x,y)] is the local kernel mass. This normalization automatically rescales the step by local neighborhood density, while keeping the same qualitative behavior of “attract p, repel q”. In dense regions πk() _k(x) is large, so the update is tempered; in sparse regions it is amplified, yielding a meaningful pull toward nearby neighbors even when few samples lie in the kernel neighborhood. This is a score-like property: the update follows a direction determined by local geometry rather than a force whose magnitude is proportional to local mass. Score matching is formulated in terms of log-gradients. The kernel-induced score is π,k():=∇logπk()=∇πk()πk().s_ ,k(x):= _x _k(x)= _x _k(x) _k(x). This makes the contrast explicit. Kernel-based potential methods (e.g., Coulomb-style updates) act on the unnormalized gradient ∇πk() _x _k(x), which scales with local mass. Drifting, through the mean-shift normalization, naturally aligns with the normalized log-gradient ∇logπk() _x _k(x), namely the score of the kernel-smoothed profile. As shown in Section 4, this connection is exact for Gaussian kernels, where the drifting discrepancy reduces to a kernel-smoothed score mismatch. We summarize these points in the following takeaway: Mean-shift drifting is more score-like than kernel-discrepancy methods (e.g., Coulomb GANs/MMD), because its normalization by πk() _k(x) turns mass gradients into log-gradients (scores). 5 Empirical Study Our theory makes two predictions about mean shift versus score mismatch. At the field level, Gaussian mean shift matches a variance-scaled score mismatch exactly, while Laplace mean shift aligns with a scaled score mismatch in high dimension. At the mechanism level for Laplace, the preconditioner concentrates when D grows, the covariance residual vanishes, and the scaling is set by the kernel-weighted radius (see Theorem 2). We therefore split experiments into two complementary angles. First, in Section 5.1, we probe the discrepancy-field geometry on synthetic data and verify that Laplace mean shift becomes increasingly aligned with an appropriately scaled score-mismatch field as dimension grows, with errors decaying at the rates predicted by theory. Second, in Section 5.2, we move to generation: we train one-step generators with Gaussian and Laplace mean shift under the same pipeline and compare sample quality. Together, these experiments separate mechanism from outcome: the synthetic study directly tests the predicted score-alignment mechanism at the field level, while the trained-model study evaluates whether the Laplace-specific preconditioning and residual terms translate into any measurable difference in end-to-end generation quality. 5.1 Examinations with Oracles Experimental Setup: Datasets. We empirically test the high-dimensional alignment prediction in our theory: the drifting discrepancy and the score discrepancy should become increasingly aligned as the ambient dimension grows. Concretely, we measure whether Δp,q()≈CΔp,q(),andcos∠(Δp,q(),Δp,q())→1as D→∞. _p,q(x)\;≈\;C\, _p,q(x), \! ( _p,q(x),\, _p,q(x) )→ 1 D→∞. All quantities below are estimated nonparametrically from finite samples of p or q with no training involved. We treat p and q as oracle samplers. For each dimension D, we construct two Gaussian mixtures with a fixed number of modes. We then draw reference samples from both p and q, and sample query points x from q as the evaluation locations. Below, we consider two synthetic datasets to disentangle the role of the shell-concentration assumption from other effects. (A) Ring MoG. Figure 4: Illustration of the 2D synthetic datasets. Top row shows Ring MoG; bottom row shows Raw MoG. Both p and q are six-mode mixtures of Gaussians in ℝDR^D. For each dimension D, we first choose a random two-dimensional plane and place six mode centers equally spaced on a ring of radius R=3R=3 inside that plane. To draw a sample, we pick one mode uniformly and add isotropic Gaussian noise with standard deviation 0.400.40. We define q by rotating the entire ring by a fixed angle of π/6π/6 within the same plane and sampling in the same way. The mismatch between p and q is controlled by a fixed angular offset, so the geometric structure of the discrepancy is preserved as D increases. (B) Raw MoG. Here p and q are mixtures with different numbers of modes and different radius profiles. We place six mode centers for p at radii 1.5, 2.5, 3.0, 4.0, 5.0, 6.0\1.5,\,2.5,\,3.0,\,4.0,\,5.0,\,6.0\ in random directions, and place four mode centers for q at radii 2.0, 3.5, 4.0, 5.5\2.0,\,3.5,\,4.0,\,5.5\ in independent random directions. Samples are generated by choosing a mode uniformly within each mixture and adding isotropic Gaussian noise with σ=0.5σ=0.5. This construction produces broad, varying norms and a mismatched mode layout between p and q, serving as a stress test beyond the shell-concentration setting. In both datasets, we set the Laplace bandwidth adaptively as τ=τ¯⋅∥¯τ= τ· x with τ¯=0.3 τ=0.3, where ∥¯ x is the mean norm of the query batch. Since ∥¯ x grows as R2+σ2D∝D R^2+σ^2D D for large D, this corresponds to a≈12a≈ 12 in the dimension-aware scaling τ=τ¯Daτ= τ\,D^a used in the theory. Experimental Setup: Kernel and Nonparametric Field Estimators. Fix a query point ∈ℝDx ^D. For each distribution, we draw a finite i.i.d. reference set of size N (we use N=3,000N=3,000): j(p)j=1N∼p,j(q)j=1N∼q.\y^(p)_j\_j=1^N p, \y^(q)_j\_j=1^N q. We use the Laplace kernel kτ(,)=exp(−‖−‖2/τ)k_τ(x,y)= (-\|x-y\|_2/τ) as in drifting model, and turn kernel values into normalized attention weights over the reference set. For p, the weights are wj(p)()=kτ(,j(p))∑ℓ=1Nkτ(,ℓ(p)),j=1,…,N,w^(p)_j(x)= k_τ(x,y^(p)_j) _ =1^Nk_τ(x,y^(p)_ ), j=1,…,N, and wj(q)()w^(q)_j(x) is defined analogously using j(q)j=1N\y^(q)_j\_j=1^N. Intuitively, wj(p)()w^(p)_j(x) assigns larger mass to reference points j(p)y^(p)_j that are closer to x. These weights define two nonparametric fields. The mean-shift drifting field is the weighted displacement from x to the reference set: ^p,k()=∑j=1Nwj(p)()(j(p)−),^q,k()=∑j=1Nwj(q)()(j(q)−). V_p,k(x)= _j=1^Nw^(p)_j(x) (y^(p)_j-x ), V_q,k(x)= _j=1^Nw^(q)_j(x) (y^(q)_j-x ). The kernel-score is the gradient of the log kernel density estimate, which admits the Monte Carlo form ^p,k()=∇log(∑j=1Nkτ(,j(p)))=1τ∑j=1Nwj(p)()j(p)−‖j(p)−‖, s_p,k(x)= _x \! ( _j=1^Nk_τ(x,y^(p)_j) )= 1τ _j=1^Nw^(p)_j(x)\, y^(p)_j-x\|y^(p)_j-x\|, and ^q,k() s_q,k(x) is defined analogously. Finally, we form the drifting and score discrepancies at x: Δp,q()=^p,k()−^q,k(),Δp,q()=^p,k()−^q,k(). _p,q(x)= V_p,k(x)- V_q,k(x), _p,q(x)= s_p,k(x)- s_q,k(x). All quantities are Monte Carlo estimators: they depend only on the sampled reference sets and kernel evaluations, with no closed-form access to p or q. Empirical Results of Alignment as D increases. Figure 5: Empirical validity of drifting–score alignment as dimension grows. Field alignment between the drifting discrepancy Δp,q() _p,q(x) and the score discrepancy Δp,q() _p,q(x) across increasing dimension D, evaluated on both Ring MoG and Raw MoG. (a) Absolute alignment error q‖Δp,q()−CtheoryΔp,q()‖2E_q\| _p,q(x)-C_theory _p,q(x)\|^2, where CtheoryC_theory is computed as Equation 12. (b) Scale-free relative error normalized by the field energy q‖Δp,q()‖2E_q\| _p,q(x)\|^2. (c) Cosine similarity cos∠(Δp,q(),Δp,q()) ( _p,q(x), _p,q(x)). (d) Directional misalignment 1−cos∠1- on a log scale. Across both datasets, alignment improves monotonically with dimension: errors decrease at a rate consistent with the predicted 1/D1/D decay, and the cosine similarity approaches 11, supporting the high-dimensional alignment predictions in Theorems 4 and 5. We evaluate three complementary alignment metrics as D increases; results are shown in Figure 5. Figure 5-(a) reports the absolute alignment error q‖Δp,q()−CtheoryΔp,q()‖2E_q\| _p,q(x)-C_theory _p,q(x)\|^2, where CtheoryC_theory is computed as in Equation 12. Figure 5-(b) reports the corresponding scale-free relative error, obtained by normalizing with the field energy q‖Δp,q()‖2E_q\| _p,q(x)\|^2; this controls for changes in the overall magnitude of Δp,q _p,q across dimensions. For both datasets, the absolute and relative errors decrease as D grows, consistent with the 1/D1/D decay predicted by Theorem 4. To make the decay explicit, we annotate each curve in (a) and (b) with its log–log regression slope; the measured slopes are close to −1-1, matching the 1/D1/D rate. Finally, Figure 5-(c,d) reports the averaged cosine similarity cos∠(Δp,q(),Δp,q()) ( _p,q(x),\, _p,q(x)) and its complement 1−cos∠1- , which isolate directional alignment independent of scaling. In line with Theorem 5, the cosine similarity increases toward 11 as D grows, while 1−cos∠1- decreases consistently across dimensions. Empirical Mechanism Diagnostics: Preconditioner Concentration and Vanishing Residual. Following the notation in Theorem 2, we compute the covariance residuals p():=p,k()−(αp()τ)p,k(),q():=q,k()−(αq()τ)q,k(), δ_p(x):=V_p,k(x)- ( _p(x)\,τ )\,s_p,k(x), δ_q(x):=V_q,k(x)- ( _q(x)\,τ )\,s_q,k(x), and their gap gap():=p()−q(). δ_gap(x):= δ_p(x)- δ_q(x). The high-dimensional mechanism of the Laplace kernel (as captured by our proof) can be summarized by two effects: (i) the preconditioning terms αp() _p(x) and αq() _q(x) concentrate to the same constant scale as D grows, and (i) the covariance residual π() δ_ (x) vanishes. As a result, the drifting discrepancy is dominated by the score discrepancy: the remaining direction is essentially the score direction. Empirically, Figure 6 supports both predictions. In Figure 6-(a), the averaged preconditioners become indistinguishable, α¯pα¯q→1,whereα¯p:=∼q[αp()],α¯q:=∼q[αq()]. α_p α_q→ 1, α_p:=E_x q [ _p(x) ],\ \ α_q:=E_x q [ _q(x) ]. In Figure 6-(b), the residual-gap energy ∼q‖gap()‖2E_x q\| δ_gap(x)\|^2 decays with D, consistent with the theory that the covariance residual vanishes. Together with the concentration of αp() _p(x) and αq() _q(x), this indicates that the Laplace preconditioner becomes nearly constant over x, so the mean-shift field behaves like a globally scaled score field. Here, all reported quantities are computed nonparametrically from finite reference samples from p and q (and query points ∼qx q), using the same Monte Carlo strategies described above. Empirical Check: Alignment Between the Predicted Scaling and the Oracle Constant. Figure 6: Empirical diagnostics for the Laplace-kernel mechanism. (a) The kernel-reweighted preconditioners concentrate and become indistinguishable, α¯p/α¯q→1 α_p/ α_q→ 1. (b) The residual-gap energy ∼q‖gap()‖22E_x q\| δ_gap(x)\|_2^2 decays with D, indicating a vanishing covariance residual. (c) The theory-predicted scale Ctheory=ρτC_theory=ρτ matches the oracle least-squares scale C∗C^*, with C∗/Ctheory→1C^*/C_theory→ 1. All results are consistent with the predictions of Theorems 4 and 5 and support the view that mean-shift drifting follows an approximately score-matching direction. A central quantity in Theorem 2 is the kernel-weighted mean distance. In our implementation it is estimated from samples as αp():=∑jwj(p)()‖j(p)−‖,αq():=∑jwj(q)()‖j(q)−‖. _p(x):= _jw^(p)_j(x)\,\|y^(p)_j-x\|, _q(x):= _jw^(q)_j(x)\,\|y^(q)_j-x\|. In high dimension, these quantities become nearly constant over x, so we form a single empirical scale ρ:=12(α¯p+α¯q),Ctheory:=ρτ. ρ:= 12 ( α_p+ α_q ), C_theory:=ρ\,τ. (12) Intuitively, CtheoryC_theory captures the typical (kernel-weighted) distance scale induced by kτk_τ, and is the scalar predicted by the theory when relating the drifting discrepancy to the score discrepancy. To check the predicted scaling quantitatively, we also compute an oracle rescaling constant C∗C^* by a least-squares fit. Specifically, we choose the scalar C that best matches the drifting discrepancy by the score discrepancy in mean squared error: C∗:=argminC∈ℝ∼q[‖Δp,q()−CΔp,q()‖2]=∼q[⟨Δp,q(),Δp,q()⟩]∼q[‖Δp,q()‖2].C^*:= _C E_x q [ \| _p,q(x)-C\, _p,q(x) \|^2 ]= E_x q [ _p,q(x),\, _p,q(x) ]E_x q [\| _p,q(x)\|^2 ]\,. This comparison separates two questions. The cosine similarity measures directional alignment and does not depend on the overall scale. In contrast, comparing C∗C^* with CtheoryC_theory tests whether the magnitude scaling predicted by the theory is correct. As shown in Figure 6-(c), we observe C∗/Ctheory→1C^*/C_theory→ 1 as D increases, which directly supports the constant predicted by Theorem 2. Together, these results show that the preconditioned-score decomposition predicts not only directional alignment, but also a concrete calibration of magnitude, leading to near-parallelism between the drifting mean-shift direction and the score-matching direction. Moreover, although Theorems 4, 5 and 6 rely on concentration-type assumptions in high dimension as sufficient conditions for the theory, the empirical results on Raw MoG still support the same conclusions even when these assumptions are not strictly enforced. This raises an interesting question: are there broader and more realistic settings, or even necessary and sufficient conditions, under which the parallel between the mean-shift direction and the score-mismatch direction continues to hold? 5.2 Examinations with Trained Models In Section 5.1, we validated the central geometric prediction of our theory: Laplace mean-shift drifting is approximately aligned with a scaled score-mismatch field, with an alignment gap that decays with dimension at the predicted rate. At the same time, our radial-kernel decomposition shows that non-Gaussian kernels introduce an additional preconditioning effect and a covariance residual, both absent in the Gaussian case. The practical question is therefore whether these Laplace-specific terms matter for generation quality. To test this, we compare one-step generators trained with the same drifting pipeline under two kernels: Gaussian, for which mean shift is exactly aligned with the smoothed score mismatch, and Laplace, the default choice in drifting implementations. Since the training procedure, architecture, and feature map (for CIFAR-10) are all fixed, this comparison isolates the effect of the non-Gaussian preconditioning and residual terms on sample quality. 2D Synthetic Experiments. Figure 7: Illustration of 2D generation across different synthetic datasets. We compare the generation quality of drifting models using Laplace and Gaussian kernels, and evaluate them using Sliced Wasserstein Distance (SWD) and MMD. The two kernels achieve nearly identical performance on both metrics across the four datasets. This suggests that, even in low dimension, the preconditioning and covariance-residual terms in Theorem 2 have little practical effect on generation quality. We follow the setup in the drifting model’s demo implementation222https://lambertae.github.io/projects/drifting/ and train a one-step generator on four 2D targets (Ring MoG, Swiss Roll, Checkerboard, Two Moons). The generator is a 4-layer MLP mapping ∼(,32)z (0,I_32) to ℝ2R^2, trained with batch size 20482048 for both data and model samples. At each step, we form a transported target +^p,q()x+ V_p,q(x) using finite-sample mean shift with either a Laplace kernel kτ(,)=exp(−‖−‖2/τ)k_τ(x,y)= (-\|x-y\|_2/τ) or a Gaussian kernel kσ(,)=exp(−‖−‖22/(2σ2))k_σ(x,y)= (-\|x-y\|_2^2/(2σ^2)) (the Gaussian case coincides with a score-mismatch transport direction). We use kernel scales (τ,σ)=(0.30,0.30)(τ,σ)=(0.30,0.30) for Ring MoG and (τ,σ)=(0.05,0.05)(τ,σ)=(0.05,0.05) for the other datasets. We evaluate sample quality by sliced Wasserstein distance (SWD) with 200 random projections and RBF-MMD on 5,0005,000 generated and 5,0005,000 real samples. Figure 7 reports the generations together with SWD and MMD. Across all four targets, Laplace and Gaussian kernels achieve nearly identical performance, suggesting that, in these low-dimensional settings, the preconditioning and covariance-residual corrections in Theorem 2 have limited impact on sample quality. CIFAR-10. We train drifting models on CIFAR-10 with a U-Net backbone, following the public reimplementation [5]333At the time of writing, there is no official implementation released by the drifting-model authors. Our implementation builds directly on the driftin codebase https://github.com/Infatoshi/driftin. Training runs for 50,000 iterations on 4 GPUs with a global batch size of 1,536. To compute kernels in feature space, we use DINOv3 [21] with a ViT-B/16 backbone. Input images are resized to 112×112112× 112 before encoding. Rather than using the [CLS] token, we extract multi-resolution features from four encoder stages (layers 2, 5, 8, and 11). For each stage, we form 18 vectors consisting of 16 spatial descriptors (via adaptive pooling) plus a global mean and global standard deviation. Concatenating across stages yields 72 feature vectors per image. Figure 8 compares Laplace and Gaussian kernels under the same baseline configuration. The Gaussian-kernel variant (which corresponds to the score-mismatch discrepancy) reaches an FID of 7.977.97 after convergence, whereas the Laplace-kernel variant (the default mean-shift drifting discrepancy) converges to an FID of 20.9120.91. While we did not perform extensive tuning, the Gaussian kernel is already competitive. These observations are consistent with the concurrent work of [16], who report similar FIDs for Laplace and Gaussian kernels on CelebA-HQ unconditional generation under a matched training budget (Laplace: 14.7114.71; Gaussian: 16.5616.56). We expect that additional hyperparameter optimization, as well as searching for stronger pre-trained feature maps, can further improve sample quality. At the same time, the current evidence suggests that the performance gap between Gaussian and Laplace kernels can remain modest. (a) Laplace (b) Gaussian Figure 8: Comparison of generation on CIFAR-10. Single-step unconditional generation on CIFAR-10 at 32×3232× 32 resolution using (a) a Laplace kernel (FID 20.91) and (b) a Gaussian kernel (FID 7.97). Both models are trained from the same random initialization. In this setup, the Gaussian kernel performs better. However, we do not view this gap as necessarily intrinsic to the kernel choice: prior evidence on CelebA-HQ [16] shows that Laplace and Gaussian kernels can achieve comparable FIDs (Laplace: 14.71; Gaussian: 16.56). Taken together, these results suggest that the additional preconditioning and residual terms introduced by the Laplace kernel need not significantly affect generation quality relative to the Gaussian kernel, although their practical impact may depend on tuning and training configuration. Conclusion. Across all four 2D synthetic datasets and CIFAR-10, Laplace- and Gaussian-kernel drifting achieve comparable generation quality under the same training pipeline. Since Gaussian-kernel drifting is exactly aligned with the smoothed score-mismatch field, this kernel swap provides an end-to-end check of whether the Laplace-specific corrections identified by the radial decomposition (preconditioning and covariance residual) accumulate during training in a way that degrades samples. The observed parity in sample quality suggests that, in these settings, these Laplace-specific terms are either small or largely self-canceling over optimization, so the default Laplace drifting behaves similarly to a score-mismatch transport baseline in practice. 6 Additional Related Works Generative modeling has long sought a paradigm that supports stable and efficient training together with fast, high-fidelity sampling. Early work focused on generative adversarial networks (GANs) [7], including kernel-based and MMD-style variants [14, 15, 26]. These methods can learn a one-step pushforward map and generate samples with a single function evaluation, but training is often unstable and difficult. Later, diffusion and score-based models [22, 24, 8, 25, 13] shifted the field because they offer more stable training, strong scalability, and high sample quality. Their main drawback is slow sampling, since generation amounts to solving a differential equation with many function evaluations. To reduce this cost, subsequent work has explored distillation methods [19, 29, 12], such as DMD, which compress a pre-trained diffusion teacher into a few-step student generator, as well as approaches that directly learn the ODE solution map [17, 23, 11]. Several recent methods revisit the design of powerful generative models from different formulations. Idempotent Generative Networks (IGN) [20] train generators through an idempotence constraint; Equilibrium Matching [28] learns a time-invariant gradient field compatible with an underlying energy; and drifting models [3] use a kernel-induced force field to move the model distribution toward the data distribution. Despite these different viewpoints, the connections between them can be close. For instance, drifting can be related to a different parametrization of IGN by setting the discrepancy field Δp,q:=− _p,q_ θ:=f_ θ-x as the generator residual in Equation 4. Moreover, concurrent work [16] relates drifting to a semigroup-consistent decomposition of diffusion ODE solution maps. Our work places drifting more directly within the score-based generative modeling framework: it admits a score-based formulation, while differing in how the score function is realized, and in the Gaussian case it becomes exactly score matching on kernel-smoothed distributions. 7 Discussion and Conclusion The fixed-point regression template views one-step generator training as learning a distribution-level transport field. For Gaussian kernels, this transport field coincides with the score mismatch of Gaussian-smoothed densities, so the drifting objective reduces exactly to a score-matching-style objective in reverse Fisher form, and closely parallels DMD (single-scale kernel score estimation versus multi-scale teacher scores). For general radial kernels, including Laplace, we show that mean shift admits an exact decomposition into a preconditioned smoothed-score term plus a covariance residual that captures local neighborhood geometry, yielding a unified preconditioned score-based interpretation. For the Laplace kernel used in drifting, we further justify drifting as a reliable proxy for score mismatch by proving two complementary results: in the low-temperature regime, population optima agree up to a polynomially small error in the kernel scale; and in the high-dimensional regime, the drifting field, the implemented stop-gradient updates, and population optima align with discrepancies decaying polynomially in dimension. Empirically, we support these predictions by confirming the predicted decay in D and by showing that Gaussian and Laplace kernels yield generally comparable generation quality. Overall, these results clarify the connection between drifting and score-based generative modeling. These results suggest that drifting can be understood as a kernel-based, nonparametric realization of score-driven one-step generation, and potentially offer a useful perspective for designing fast diffusion-style generators. References [1] N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden (2024) Flow map matching. arXiv preprint arXiv:2406.07507. Cited by: §1. [2] D. Comaniciu and P. Meer (2002) Mean shift: a robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence 24 (5), p. 603–619. Cited by: §1. [3] M. Deng, H. Li, T. Li, Y. Du, and K. He (2026) Generative modeling via drifting. arXiv preprint arXiv:2602.04770. Cited by: §1, §3.1, §4.4, §6. [4] B. Efron (2011) Tweedie’s formula and selection bias. Journal of the American Statistical Association 106 (496), p. 1602–1614. Cited by: §1, §4.1. [5] Elliot (2026) Driftin: single-step image generation via drift fields. External Links: Link Cited by: §4.2, §5.2. [6] Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He (2025) Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447. Cited by: §1. [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §4.4, §6. [8] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, p. 6840–6851. Cited by: §1, §6. [9] Z. Hu, C. Lai, Y. Mitsufuji, and S. Ermon (2025) CMT: mid-training for efficient learning of consistency, mean flow, and flow map models. arXiv preprint arXiv:2509.24526. Cited by: §1. [10] A. Hyvärinen and P. Dayan (2005) Estimation of non-normalized statistical models by score matching.. Journal of Machine Learning Research 6 (4). Cited by: §1, §2, §4.1. [11] D. Kim, C. Lai, W. Liao, N. Murata, Y. Takida, T. Uesaka, Y. He, Y. Mitsufuji, and S. Ermon (2024) Consistency trajectory models: learning probability flow ode trajectory of diffusion. In International Conference on Learning Representations, Cited by: §1, §6. [12] D. Kim, C. Lai, W. Liao, Y. Takida, N. Murata, T. Uesaka, Y. Mitsufuji, and S. Ermon (2024) PaGoDA: progressive growing of a one-step generator from a low-resolution diffusion teacher. arXiv preprint arXiv:2405.14822. Cited by: §6. [13] C. Lai, Y. Song, D. Kim, Y. Mitsufuji, and S. Ermon (2025) The principles of diffusion models. arXiv preprint arXiv:2510.21890. Cited by: §1, §6. [14] C. Li, W. Chang, Y. Cheng, Y. Yang, and B. Póczos (2017) Mmd gan: towards deeper understanding of moment matching network. Advances in neural information processing systems 30. Cited by: §3.2, §4.4, §6. [15] Y. Li, K. Swersky, and R. Zemel (2015) Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning-Volume 37, p. 1718–1727. Cited by: §3.2, §4.4, §6. [16] Z. Li and B. Zhu (2026) A long-short flow-map perspective for drifting models. arXiv preprint arXiv:2602.20463. Cited by: Figure 8, Figure 8, §5.2, §6. [17] E. Luhman and T. Luhman (2021) Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388. Cited by: §6. [18] S. Lyu (2012) Interpretation and generalization of score matching. arXiv preprint arXiv:1205.2629. Cited by: §1. [19] T. Salimans and J. Ho (2021) Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, Cited by: §6. [20] A. Shocher, A. V. Dravid, Y. Gandelsman, I. Mosseri, M. Rubinstein, and A. A. Efros (2024) Idempotent generative network. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §6. [21] O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski (2025) DINOv3. External Links: Link Cited by: §5.2. [22] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, p. 2256–2265. Cited by: §1, §6. [23] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever (2023) Consistency models. arXiv preprint arXiv:2303.01469. Cited by: §1, §6. [24] Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32. Cited by: §1, §1, §6. [25] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020) Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, Cited by: §1, §1, §2, §2, §2, §6. [26] T. Unterthiner, B. Nessler, C. Seward, G. Klambauer, M. Heusel, H. Ramsauer, and S. Hochreiter (2018) Coulomb gans: provably optimal nash equilibria via potential fields. In International Conference on Learning Representations, Cited by: §3.2, §4.4, §4.4, §4.4, §6. [27] P. Vincent (2011) A connection between score matching and denoising autoencoders. Neural computation 23 (7), p. 1661–1674. Cited by: §1, §2. [28] R. Wang and Y. Du (2025) Equilibrium matching: generative modeling with implicit energy-based models. arXiv preprint arXiv:2510.02300. Cited by: §6. [29] T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park (2024) One-step diffusion with distribution matching distillation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 6613–6623. Cited by: §1, §6. Contents 1 Introduction 2 Preliminaries 3 A Fixed-Point Regression Template 3.1 Training Objective of Drifting Model 3.2 Mean-Shift and Kernel-Induced Score Function 4 Bridging Drift Models and Score-Based Models 4.1 Gaussian Kernel: Mean-Shift as Smoothed Score 4.2 General Radial Kernels: Mean-Shift as Preconditioned Score 4.3 Laplace Kernel: Mean-Shift as a Proxy of Score Mismatch 4.3.1 Low-Temperature Regime 4.3.2 High-Dimensional Regime 4.4 What About GANs? 5 Empirical Study 5.1 Examinations with Oracles 5.2 Examinations with Trained Models 6 Additional Related Works 7 Discussion and Conclusion References A Proof of the Preconditioning Score Decomposition A.1 Proof to Theorem 1 A.2 Proof to Theorem 2 B Proof to Theorem 3 B.1 Setup and Technical Assumptions B.2 Auxiliary Tools and the Main Proof C Proof to Theorem 6 C.1 Proof Roadmap and Technical Assumptions C.2 Auxiliary Tools and the Main Proof C.3 Gradient-Level Alignment Appendix Appendix A Proof of the Preconditioning Score Decomposition A.1 Proof to Theorem 1 Proof to Theorem 1. We first prove the identity that relates the mean-drift and score function. Let kτk_τ be a Gaussian kernel. By definition, pτ()=(p∗kτ)()=∼p[kτ(,)].p_τ(x)=(p*k_τ)(x)=E_y p [k_τ(x,y) ]. Differentiating under the expectation and using ∇kτ(,)=−(−)kτ(,)/τ2 _xk_τ(x,y)=-(x-y)k_τ(x,y)/τ^2 yields ∇pτ()=1τ2∼p[kτ(,)(−)]. _xp_τ(x)= 1τ^2E_y p [k_τ(x,y)(y-x) ]. Dividing by pτ()p_τ(x) gives ∇logpτ()=∼p[kτ(,)(−)]τ2∼p[kτ(,)]=p,kτ()τ2, _x p_τ(x)= E_y p [k_τ(x,y)(y-x) ]τ^2\,E_y p [k_τ(x,y) ]= V_p,k_τ(x)τ^2, where the last equality uses the definition of p,kτV_p,k_τ in the mean-shift operator paragraph. The same argument applies to q. Substituting π,kτ()=τ2π,τ()for π∈p,q.V_ ,k_τ(x)=τ^2s_ ,τ(x) ∈\p,q\. into Δp,q()=η(p,kτ()−q,kτ()) _p,q(x)=η(V_p,k_τ(x)-V_q,k_τ(x)) gives the discrepancy formula defined by the mean-shift. Finally, Equation 7 follows from Equation 4. ∎ A.2 Proof to Theorem 2 Proof to Theorem 2. Fix x and write b:=bτ(‖−‖2)b:=b_τ(\|x-y\|_2). Under ∼πτ(⋅|)y _τ(·|x), define A():=b−1,B():=b(−).A(y):=b^-1, B(y):=b(y-x). Then A()B()=−A(y)B(y)=y-x, and by the definition of π,kτV_ ,k_τ, π,kτ()=∼πτ(⋅|)[A()B()].V_ ,k_τ(x)=E_y _τ(·|x)[A(y)B(y)]. Applying the covariance identity [AB]=[A][B]+Cov(A,B)E[AB]=E[A]E[B]+Cov(A,B) under ∼πτ(⋅|)y _τ(·|x) yields π,kτ()=απ,τ()∼πτ(⋅|)[B()]+π,τ().V_ ,k_τ(x)= _ ,τ(x)\,E_y _τ(·|x)[B(y)]+ δ_ ,τ(x). Finally, Equation 10 is exactly ∼πτ(⋅|)[bτ(‖−‖2)(−)]=τ2π,kτ(),E_y _τ(·|x) [b_τ(\|x-y\|_2)(y-x) ]=τ^2\,s_ ,k_τ(x), so [B()]=τ2π,kτ()E[B(y)]=τ^2s_ ,k_τ(x) and the claim follows. ∎ Appendix B Proof to Theorem 3 In this section, we prove the small-temperature result Theorem 3, which establishes a proxy relationship between score matching and the drifting model at their optimal distributions in the small-temperature regime. We state a rigorous version of this relationship below. Theorem 5′ (Small-τ¯ τ agreement between mean-shift and score matching). Assume Assumptions 1, 2 and 3. Fix a≥0a≥ 0 and set τ=τD:=τ¯Daτ= _D:= τD^a. Let τ0 _0 be as in Assumption 2. For each τ∈(0,τ0]τ∈(0, _0], let ⋆(τ)∈argmin∈ℱℒdrift().f (τ)∈ _f L_drift(f). Then, as τ¯→0 τ→ 0 (equivalently τ→0τ→ 0), rF(p∥q⋆(τ))=(τ¯4).D_rF\! (p\|q_f (τ) )=O( τ^4). Moreover, by realizability (Assumption 3) there exists data∈g_data such that qdata=pq_g_data=p, and hence rF(qdata∥q⋆(τ))=rF(p∥q⋆(τ))=(τ¯4).D_rF\! (q_g_data\|q_f (τ) )=D_rF\! (p\|q_f (τ) )=O( τ^4). The hidden constant in (⋅)O(·) is independent of τ¯ τ and of the learnable parameters (but may depend on D,a,ηD,a,η and the uniform envelope moment bound in Assumption 2). B.1 Setup and Technical Assumptions Setup and Formulations. We work in the D-dimensional (feature embedding) space and consider distributions induced by generators. Let ℱ,F,G be classes of measurable maps ,:ℝm→ℝDf,g:R^m ^D, and define the pushforward distributions q:=#(,),q:=#(,),q_f:=f_\#N(0,I), q_g:=g_\#N(0,I), on ℝDR^D. Let p denote the data distribution in the same space. To match the dimension-dependent scaling used in drifting implementations (and to avoid kernel degeneracy in high dimension), we use the bandwidth τ=τD:=τ¯Da,τ¯>0,a≥0,τ= _D:= τD^a, τ>0,\ a≥ 0, so τ grows with D when a>0a>0 and remains constant when a=0a=0. Given a kernel kτk_τ (e.g., the Laplace kernel kτ(,)=exp(−‖−‖2/τ)k_τ(x,y)= (-\|x-y\|_2/τ) as a special case of Equation 9), recall the mean-shift direction π,kτ():=∼π[kτ(,)(−)]∼π[kτ(,)],V_ ,k_τ(x):= E_y \! [k_τ(x,y)\,(y-x) ]E_y \! [k_τ(x,y) ], and the kernel-induced score (at scale τ) π,τ():=∇logπkτ()s_ ,τ(x):= _x _k_τ(x) with πkτ()=∼π[kτ(,)] _k_τ(x)=E_y [k_τ(x,y)]. We measure how well a generator f matches p by the drifting field ():=p,kτ()−q,kτ(),V_f(x):=V_p,k_τ(x)-V_q_f,k_τ(x), and how well it matches the (smoothed) score by the score mismatch Δ():=p,τ()−q,τ(). s_f(x):=s_p,τ(x)-s_q_f,τ(x). The corresponding population objectives are ℒdrift():=∼q‖()‖22,ℒSM():=∼q‖p,τ()−q,τ()‖22.L_drift(f):=E_x q_f\|V_f(x)\|_2^2, _SM(g):=E_x q_g\|s_p,τ(x)-s_q_g,τ(x)\|_2^2. Finally, the scale-τ Fisher divergence introduced above satisfies rF(p∥q)=∼q‖p,τ()−q,τ()‖22=∼q‖Δ()‖22.D_rF(p\|q_f)=E_x q_f\|s_p,τ(x)-s_q_f,τ(x)\|_2^2=E_x q_f\| s_f(x)\|_2^2. Proof Roadmap. We write π,kτ()=Bτ()/Aτ()V_ ,k_τ(x)=B_τ(x)/A_τ(x), where Aτ():=∫ℝDe−‖−‖2/τπ()d,Bτ():=∫ℝDe−‖−‖2/τ(−)π()d.A_τ(x):= _R^De^-\|x-y\|_2/τ\, (y)\, \!dy, B_τ(x):= _R^De^-\|x-y\|_2/τ\,(y-x)\, (y)\, \!dy. After the change of variables =+τy=x+τz, both AτA_τ and BτB_τ become Laplace-weighted local averages of π(+τ) (x+τz). We then Taylor expand π(+τ) (x+τz) around x and control the far-tail region by the exponential kernel decay, yielding expansions of AτA_τ, BτB_τ, and ∇Aτ∇ A_τ up to order τ4τ^4. Finally, a ratio expansion gives π,kτ()=cDτ2∇logAτ()+τ4π,τ()=cDτ2π,τ()+τ4π,τ(),V_ ,k_τ(x)=c_Dτ^2∇ A_τ(x)+τ^4R_ ,τ(x)=c_Dτ^2s_ ,τ(x)+τ^4R_ ,τ(x), and the local envelope condition ensures the remainder stays uniformly controlled after dividing by Aτ()A_τ(x). Technical Assumptions. We follow the instance-noise convention: whenever we evaluate drifting or score-matching objectives, we implicitly add a small Gaussian perturbation. This ensures all relevant laws admit smooth, strictly positive densities, avoiding singular pushforwards without changing the practical learning setup. Assumption 1 (Instance Noise). We observe ~=+ x=x+ ξ where ∼(,η2) ξ (0,η^2I) is independent and η>0η>0 is fixed. Equivalently, for every distribution μ in p∪q:∈ℱ∪q:∈, \p \\ ∪\ \q_f:f \\ ∪\ \q_g:g \, we work with its noise-regularized version μ∗(,η2)μ*N(0,η^2I). In particular, each effective p,q,qp,q_f,q_g admits a C∞C^∞ Lebesgue density that is strictly positive everywhere on ℝDR^D. Throughout this subsection, we keep the same notation p,q,qp,q_f,q_g for these effective (instance-noised) laws (i.e., we drop the convolution notation). Moreover, we notice that there is a uniform ∥⋅∥∞\|·\|_∞ bound from instance noise. If ν=μ∗(,η2)ν=μ*N(0,η^2I) for a probability measure μ, then ‖ν‖∞≤(2πη2)−D/2.\|ν\|_∞≤(2π\,η^2)^-D/2. Indeed, writing φη()=(2πη2)−D/2exp(−‖22/(2η2)) _η(x)=(2π\,η^2)^-D/2 (-\|x\|_2^2/(2η^2)) for the Gaussian density, ν()=∫φη(−)dμ()≤supφη()=φη().ν(x)= _η(x-y)\, \!dμ(y)≤ _z _η(z)= _η(0). As stated above, we write π,kτ()=Bτ()/Aτ()V_ ,k_τ(x)=B_τ(x)/A_τ(x). A small-τ expansion is local: after =+τy=x+τz, the kernel weight becomes e−‖2e^-\|z\|_2, so the main contribution comes from ‖2=(1)\|z\|_2=O(1) (equivalently, ‖−‖2=(τ)\|y-x\|_2=O(τ)). The only subtlety is that our ratio expansions divide by Aτ()A_τ(x), whose leading term is Aτ()≈M0τDπ()A_τ(x)≈ M_0τ^D\, (x) as τ→0τ→ 0. Thus we must control how small π() (x) can be relative to its local neighborhood. We encode this via a local density ratio as stated in the following assumption. Assumption 2 (Uniform Integrable Local Envelope along Drifting Models’ Minimizers). Assume Assumption 1. For any effective density π define: π():=1+∑1≤|α|≤4sup‖−‖2≤1|∂αlogπ()|,ℜπ():=sup‖−‖2≤1π()π()∈[1,∞), M_ (x):=1+ _1≤|α|≤ 4\ _\|u-x\|_2≤ 1\ |∂^α (u) |, R_ (x):= _\|u-x\|_2≤ 1\ (u) (x)∈[1,∞), and the combined local envelope (for an integer K≥1K≥ 1, e.g. K=4K=4) π():=(1+π())K(1+ℜπ()). U_ (x):=(1+ M_ (x))^K\,(1+ R_ (x)). Fix τ0∈(0,1] _0∈(0,1]. Assume that for every τ∈(0,τ0]τ∈(0, _0] the population minimizer set argmin∈ℱℒdrift() _f L_drift(f) is nonempty. Assume there exists an integer K≥1K≥ 1 (one may take K=4K=4) such that supτ∈(0,τ0]sup⋆∈argminℒdrift()∼q⋆[p()2+q⋆()2]<∞. _τ∈(0, _0]\ _f ∈ _fL_drift(f)\ E_X q_f [ U_p(X)^2+ U_q_f (X)^2 ]<∞. This density-ratio factor is mild but necessary. The denominator of π,kτ()V_ ,k_τ(x), Aτ()=∼π[kτ(,)],A_τ(x)=E_y \! [k_τ(x,y) ], satisfies Aτ()≍τDπ()A_τ(x) τ^D\, (x) as τ→0τ→ 0 (under the local regularity conditions used in the small-τ expansion). Hence any pointwise expansion of π,kτ()=Bτ()/Aτ()V_ ,k_τ(x)=B_τ(x)/A_τ(x) must control division by π() (x). The local ratio ℜπ() R_ (x) quantifies how π() (x) compares to nearby values in (,1)B(x,1), and it avoids imposing any global lower bound or compact support. Only moments under the drifting minimizers are required. Assumption 3 (Realizability in each class). There exist data∈ℱf_data and data∈g_data such that qdata=pq_f_data=p and qdata=pq_g_data=p. We measure agreement via the Fisher divergence between kernel-smoothed scores: rF(p∥q):=∼q[‖p,τ()−q,τ()‖22],π,τ():=∇logπkτ(),πkτ():=∼π[kτ(,)].D_rF(p\|q):=E_x q [ \|s_p,τ(x)-s_q,τ(x) \|_2^2 ], s_ ,τ(x):= _x _k_τ(x), _k_τ(x):=E_y [k_τ(x,y)]. B.2 Auxiliary Tools and the Main Proof For the Laplace kernel kτ(,)=exp(−‖−‖2/τ)k_τ(x,y)= (-\|x-y\|_2/τ), define M0:=∫ℝDe−‖2d,M2:=∫ℝDz12e−‖2d,cD:=M2M0∈(0,∞).M_0:= _R^De^-\|z\|_2\, \!dz, M_2:= _R^Dz_1^2\,e^-\|z\|_2\, \!dz, c_D:= M_2M_0∈(0,∞). Lemma 1 (Small-τ Expansion: Mean Shift Matches Kernel Score up to τ4τ^4). Assume Assumptions 1 and 2. Fix any τ∈(0,τ0]τ∈(0, _0] and any minimizer ⋆∈argmin∈ℱℒdrift()f ∈ _f L_drift(f), depending on τ. For π∈p,q⋆ ∈\p,q_f \, there exists C<∞C<∞ (depending only on D,ηD,η and kernel moments, but not on τ) such that for all ∈ℝDx ^D, π,kτ()=cDτ2π,τ()+τ4π,τ(),‖π,τ()‖2≤Cπ().V_ ,k_τ(x)=c_D\,τ^2\,s_ ,τ(x)\;+\;τ^4\,R_ ,τ(x), \|R_ ,τ(x)\|_2≤ C\, U_ (x). Consequently, supτ∈(0,τ0]sup⋆∈argminℒdrift∼q⋆[‖π,τ()‖22]<∞. _τ∈(0, _0]\ _f ∈ _drift\ E_X q_f [\|R_ ,τ(X)\|_2^2 ]<∞. Proof. Fix π and ∈ℝDx ^D. Define Aτ():=∫ℝDe−‖−‖2/τπ()d,Bτ():=∫ℝDe−‖−‖2/τ(−)π()d.A_τ(x):= _R^De^-\|x-y\|_2/τ\, (y)\, \!dy, B_τ(x):= _R^De^-\|x-y\|_2/τ\,(y-x)\, (y)\, \!dy. Then πkτ()=Aτ() _k_τ(x)=A_τ(x) and π,kτ()=Bτ()/Aτ()V_ ,k_τ(x)=B_τ(x)/A_τ(x). First, we examine the differentiability of AτA_τ (Laplace kernel). Although kτ(,)k_τ(x,y) is not classically C1C^1 at =x=y, the map ↦kτ(,)x k_τ(x,y) belongs to W1,1(ℝD)W^1,1(R^D) (in x) and admits an L1L^1 weak gradient, given for ≠x≠y by ∇kτ(,)=−1τe−‖−‖2/τ−‖−‖2, _xk_τ(x,y)=- 1τ\,e^-\|x-y\|_2/τ\, x-y\|x-y\|_2, and we set ∇kτ(,):= _xk_τ(y,y):=0. Hence Aτ=kτ∗π∈C1A_τ=k_τ* ∈ C^1 and ∇Aτ()=∫ℝD∇kτ(,)π()d,π,τ()=∇logAτ()=∇Aτ()Aτ().∇ A_τ(x)= _R^D _xk_τ(x,y)\, (y)\, \!dy, s_ ,τ(x)=∇ A_τ(x)= ∇ A_τ(x)A_τ(x). Now, we apply the change of variables. Set =+τy=x+τz to get Aτ()=τD∫ℝDe−‖2π(+τ)d,Bτ()=τD+1∫ℝDe−‖2π(+τ)d.A_τ(x)=τ^D _R^De^-\|z\|_2\, (x+τz)\, \!dz, B_τ(x)=τ^D+1 _R^De^-\|z\|_2\,z\, (x+τz)\, \!dz. Step 1: Local Taylor Expansion on ‖≤τ−1\|z\|≤τ^-1. If ‖≤τ−1\|z\|≤τ^-1 then +tτ∈B,1)x+tτz∈ Bx,1) for all t∈[0,1]t∈[0,1]. Taylor’s theorem (order 33 with integral remainder) yields π(+τ)=π()+τ⟨∇π(),⟩+τ22⊤∇2π()+τ36∑i,j,k∂ijkπ()zizjzk+τ4R~4(,τ), (x+τz)= (x)+τ ∇ (x),z + τ^22z ∇^2 (x)z+ τ^36 _i,j,k _ijk (x)\,z_iz_jz_k+τ^4\, R_4(x,τz), where R~4(,τ) R_4(x,τz) is a linear combination of ∂απ(+tτ)∂^α (x+tτz) with |α|=4|α|=4 times monomials of degree 44 in z. Using ∂απ=π⋅Pα(∂βlogπ1≤|β|≤4)∂^α = · P_α(\∂^β \_1≤|β|≤ 4) and the definitions of π,ℜπ M_ , R_ , we obtain for ‖≤τ−1\|z\|≤τ^-1: |R~4(,τ)|≤C1π()ℜπ()(1+π())K‖24, | R_4(x,τz) |\;≤\;C_1\, (x)\, R_ (x)\,(1+ M_ (x))^K\,\|z\|_2^4, for a constant C1C_1 depending only on D. Plugging into Aτ,BτA_τ,B_τ and using symmetry of e−‖2e^-\|z\|_2 (odd moments vanish) gives Aτ() A_τ(x) =τD(M0π()+τ22M2Δπ())+τD+4aπ,τ(), =τ^D (M_0\, (x)+ τ^22M_2\, (x) )+τ^D+4\,a_ ,τ(x), Bτ() B_τ(x) =τD+2(M2∇π())+τD+4bπ,τ(), =τ^D+2 (M_2\,∇ (x) )+τ^D+4\,b_ ,τ(x), where Δ=∑i=1D∂ii = _i=1^D _i and |aπ,τ()|+‖bπ,τ()‖≤C2π()ℜπ()(1+π())K,(τ∈(0,τ0]).|a_ ,τ(x)|+\|b_ ,τ(x)\|\;≤\;C_2\, (x)\, R_ (x)\,(1+ M_ (x))^K, (τ∈(0, _0]). Step 2: Tail Region ‖>τ−1\|z\|>τ^-1. Since ∫‖>τ−1e−‖2d≲e−1/ττ−(D−1) _\|z\|>τ^-1e^-\|z\|_2\, \!dz e^-1/τ^-(D-1) and ‖π‖∞<∞\| \|_∞<∞ under instance noise, the tail contributions to Aτ,BτA_τ,B_τ are bounded by e−1/τe^-1/τ times a polynomial in 1/τ1/τ. Hence they are o(τN)o(τ^N) for every N and can be absorbed into the τD+4τ^D+4 remainders by shrinking τ0 _0 if needed. Step 3: Expansion for ∇Aτ∇ A_τ and a Ratio Identity. By the same decomposition (main region + tail) we obtain ∇Aτ()=τD(M0∇π()+τ22M2∇Δπ())+τD+4a~π,τ(),‖a~π,τ()‖≤C3π()ℜπ()(1+π())K.∇ A_τ(x)=τ^D (M_0\,∇ (x)+ τ^22M_2\,∇ (x) )+τ^D+4\, a_ ,τ(x), \| a_ ,τ(x)\|≤ C_3\, (x)\, R_ (x)\,(1+ M_ (x))^K. We use the explicit algebraic identity: if A=a0+τ2a2+τ4a4A=a_0+τ^2a_2+τ^4a_4 with a0>0a_0>0 and B=τ2(b2+τ2b4)B=τ^2(b_2+τ^2b_4), then BA=τ2b2a0+τ4b4a0−b2a2a0(a0+τ2a2+τ4a4). BA=τ^2 b_2a_0+τ^4 b_4a_0-b_2a_2a_0\,(a_0+τ^2a_2+τ^4a_4). Applying this to Bτ/AτB_τ/A_τ and to ∇Aτ/Aτ∇ A_τ/A_τ, using a0()=M0π()a_0(x)=M_0 (x), b2()=M2∇π()b_2(x)=M_2∇ (x), and cD=M2/M0c_D=M_2/M_0, yields Bτ()Aτ()=cDτ2∇Aτ()Aτ()+τ4π,τ()=cDτ2π,τ()+τ4π,τ(). B_τ(x)A_τ(x)=c_D\,τ^2\, ∇ A_τ(x)A_τ(x)+τ^4\,R_ ,τ(x)=c_D\,τ^2\,s_ ,τ(x)+τ^4\,R_ ,τ(x). Using the bounds on aπ,τ,bπ,τ,a~π,τa_ ,τ,b_ ,τ, a_ ,τ above together with ‖∇π()‖≤π()sup‖−‖≤1‖∇logπ()‖≤π()π(),|Δπ()|≤π()C(1+π())2,\|∇ (x)\|≤ (x) _\|u-x\|≤ 1\|∇ (u)\|≤ (x)\, M_ (x), | (x)|≤ (x)\,C(1+ M_ (x))^2, one checks that all factors of π() (x) cancel in the ratio remainder, and thus ‖π,τ()‖≤C(1+π())K(1+ℜπ())=Cπ(),\|R_ ,τ(x)\|≤ C\,(1+ M_ (x))^K\,(1+ R_ (x))=C\, U_ (x), for a constant C depending only on D,ηD,η and kernel moments. This proves the pointwise expansion and bound. Finally, the uniform L2L^2 statement follows immediately from Assumption 2. ∎ Main Proof to Theorem 5′. Fix a≥0a≥ 0 and τ=τD=τ¯Daτ= _D= τD^a. For each τ∈(0,τ0]τ∈(0, _0], let ⋆(τ)∈argmin∈ℱℒdrift(),⋆(τ)∈argmin∈ℒSM().f (τ)∈ _f L_drift(f), g (τ)∈ _g L_SM(g). We first show that drifting optimality implies equality of mean-shift fields. By Assumption 3, there exists data∈ℱf_data with qdata=pq_f_data=p. Then ℒdrift(data)=0L_drift(f_data)=0 for every τ, hence minℒdrift()=0 _fL_drift(f)=0 and therefore ℒdrift(⋆(τ))=0L_drift(f (τ))=0. Equivalently, p,kτ()=q⋆(τ),kτ()for q⋆(τ)-a.e. .V_p,k_τ(x)=V_q_f (τ),k_τ(x) q_f (τ)-a.e.\ x. We then convert mean-shift equality into a bound on score mismatch. Apply Lemma 1 with π=p =p and π=q⋆(τ) =q_f (τ). For q⋆(τ)q_f (τ)-a.e. x, cDτ2p,τ()+τ4p,τ()=cDτ2q⋆(τ),τ()+τ4q⋆(τ),τ(),c_D\,τ^2\,s_p,τ(x)+τ^4R_p,τ(x)=c_D\,τ^2\,s_q_f (τ),τ(x)+τ^4R_q_f (τ),τ(x), hence Δ⋆(τ)()=p,τ()−q⋆(τ),τ()=τ2cD(q⋆(τ),τ()−p,τ()). s_f (τ)(x)=s_p,τ(x)-s_q_f (τ),τ(x)= τ^2c_D (R_q_f (τ),τ(x)-R_p,τ(x) ). Now we square and integrate under q⋆(τ)q_f (τ). Let ∼q⋆(τ)X q_f (τ). Using (a+b)2≤2a2+2b2(a+b)^2≤ 2a^2+2b^2, rF(p∥q⋆(τ))=[‖Δ⋆(τ)()‖22]≤2τ4cD2[‖q⋆(τ),τ()‖22+‖p,τ()‖22].D_rF(p\|q_f (τ))=E [\| s_f (τ)(X)\|_2^2 ]≤ 2τ^4c_D^2\,E [\|R_q_f (τ),τ(X)\|_2^2+\|R_p,τ(X)\|_2^2 ]. By Assumption 2 and Lemma 1, the expectation on the right-hand side is bounded uniformly for τ∈(0,τ0]τ∈(0, _0], hence rF(p∥q⋆(τ))=(τ4),τ→0.D_rF(p\|q_f (τ))=O(τ^4), τ→ 0. Since D is fixed and τ=τ¯Daτ= τD^a, we have τ4=D4aτ¯4τ^4=D^4a τ^4, so equivalently rF(p∥q⋆(τ))=(τ¯4),τ¯→0,D_rF(p\|q_f (τ))=O( τ^4), τ→ 0, with a constant independent of τ¯ τ. At last, we compare to the score-matching minimizer. By realizability and population identification of score matching (as used elsewhere in your paper), q⋆(τ)=pq_g (τ)=p. Therefore rF(q⋆(τ)∥q⋆(τ))=rF(p∥q⋆(τ))=(τ¯4).D_rF(q_g (τ)\|q_f (τ))=D_rF(p\|q_f (τ))=O( τ^4). Appendix C Proof to Theorem 6 In this section, we prove Theorem 6, which establishes a proxy relationship between score matching and the drifting model at their optimal distributions in the high-dimensional regime. We state a rigorous version of this relationship below. Theorem 6′ (High-Dimensional Agreement between Mean-Shift and Score Matching). Assume Assumption 1, Assumption 3, and Assumption 4–Assumption 7. Fix τ¯>0 τ>0 and a≥0a≥ 0, and set τ=τ¯Daτ= τD^a. For each D, let D⋆∈argmin∈ℱℒdrift().f _D∈ _f L_drift(f). Then rF(p∥qD⋆)=(D−(1+2a)).D_rF\! (p\|q_f _D )=O\! (D^-(1+2a) ). Moreover, by realizability (Assumption 3) there exists data,D∈g_data,D such that qdata,D=pq_g_data,D=p, and hence rF(qdata,D∥qD⋆)=rF(p∥qD⋆)=(D−(1+2a)).D_rF\! (q_g_data,D\|q_f _D )=D_rF\! (p\|q_f _D )=O\! (D^-(1+2a) ). The hidden constant in (⋅)O(·) is independent of D and of the learnable parameters. C.1 Proof Roadmap and Technical Assumptions Proof Roadmap. The proof has two conceptual pieces. First, under the shell and inner-product moment conditions, independent samples from any two distributions in our model/data family have pairwise distance tightly concentrated around the common scale ρ=2R0ρ= 2R_0 at rate 1/D1/ D (Step 1). Second, drifting does not sample neighbors uniformly: it reweights neighbors by the Laplace kernel. With bounded feature norms, this kernel reweighting cannot distort distances too much, so the kernel-reweighted neighbor radius r=‖−‖2r=\|y-x\|_2 still concentrates around ρ with (r−ρ)2=(1/D)E(r-ρ)^2=O(1/D), uniformly in D and without any restriction on τ¯ τ (Step 2). These two concentration facts are then plugged into the preconditioned-score decomposition (Step 3), which writes mean shift as a scaled kernel score plus a covariance residual; concentration makes both the preconditioner fluctuation and the residual (1/D)O(1/ D) in L2L^2, yielding the field alignment ()≈(τρ)Δ()V_f(x)≈(τρ)\, s_f(x) with mean-square error (1/D)O(1/D). Finally, at the distribution-level optimum drifting enforces ⋆≡0V_f ≡ 0, so alignment implies q⋆‖Δ⋆‖22=((τρ)−2D−1)=(D−(1+2a))E_q_f \| s_f \|_2^2=O((τρ)^-2D^-1)=O(D^-(1+2a)); score matching identifies q⋆=pq_g =p, giving the stated Fisher-divergence decay (Step 4). Technical Assumptions. The assumptions below assert that sample norms concentrate around a common scale, and that pairwise inner products between independent samples are small on average. Assumption 4 (Shell around a Common Scale R0R_0). There exists σ≥0σ≥ 0 independent of D such that for every μ in the above family, if ∼μx μ then (‖22−R02)2≤σ2R04D.E (\|x\|_2^2-R_0^2 )^2≤ σ^2R_0^4D. Assumption 5 (Mixed Inner-Product Second Moment). There exists κ≥0κ≥ 0 independent of D such that for any μ,νμ,ν in the above family, if ∼μx μ and ∼νy ν are independent then ⟨,⟩2≤κR04D.E ,y ^2≤ κ R_0^4D. Assumption 6 (Fourth-Moment Controls). There exist constants Cnorm,4,Cip,4<∞C_norm,4,C_ip,4<∞ independent of D such that: 1. For every μ in the above family, if ∼μx μ then (‖22−R02)4≤Cnorm,4R08D2.E (\|x\|_2^2-R_0^2 )^4≤ C_norm,4R_0^8D^2. 2. For any μ,νμ,ν in the above family, if ∼μx μ and ∼νy ν are independent then ⟨,⟩4≤Cip,4R08D2.E ,y ^4≤ C_ip,4R_0^8D^2. Assumption 7 (Bounded (Feature) Norm). There exists B<∞B<∞ independent of D such that for every μ in the above family, if ∼μx μ then ‖2≤Balmost surely.\|x\|_2≤ B surely. Drifting-model pipelines that rely on pretrained feature maps typically enforce explicit norm control, for instance via ℓ2 _2-normalization, LayerNorm followed by clipping, or direct norm clipping. Under such preprocessing, Assumption 7 holds naturally with a known constant B (often B=1B=1 for ℓ2 _2-normalized embeddings). Thus, this assumption is realistic in practice. C.2 Auxiliary Tools and the Main Proof Step 1: Mixed Distances Concentrate. Lemma 2 (Mixed Distance Concentrates in L2L^2 around ρ=2R0ρ= 2R_0). Assume Assumptions 4 and 5. Draw ∼μx μ and ∼νy ν independently (any μ,νμ,ν from the family), and set S:=‖−‖2S:=\|x-y\|_2. Then (S−ρ)2≤CdistD,Cdist:=3(2σ2+4κ)R02.E(S-ρ)^2≤ C_distD, C_dist:=3 (2σ^2+4κ )R_0^2. Proof. Write S2=‖22+‖22−2⟨,⟩,ρ2=2R02.S^2=\|x\|_2^2+\|y\|_2^2-2 x,y , ρ^2=2R_0^2. For b>0b>0, (a−b)2≤(a−b)2/b( a- b)^2≤(a-b)^2/b, hence (S−ρ)2≤(S2−ρ2)2ρ2.(S-ρ)^2≤ (S^2-ρ^2)^2ρ^2. Now S2−ρ2=(‖22−R02)+(‖22−R02)−2⟨,⟩.S^2-ρ^2=(\|x\|_2^2-R_0^2)+(\|y\|_2^2-R_0^2)-2 x,y . Using (a+b+c)2≤3(a2+b2+c2)(a+b+c)^2≤ 3(a^2+b^2+c^2), (S2−ρ2)2≤3((‖22−R02)2+(‖22−R02)2+4⟨,⟩2).(S^2-ρ^2)^2≤ 3 ((\|x\|_2^2-R_0^2)^2+(\|y\|_2^2-R_0^2)^2+4 x,y ^2 ). Taking expectations and applying Assumptions 4 and 5 gives (S2−ρ2)2≤3(σ2R04D+σ2R04D+4κR04D)=3(2σ2+4κ)R04D.E(S^2-ρ^2)^2≤ 3 ( σ^2R_0^4D+ σ^2R_0^4D+4 κ R_0^4D )= 3(2σ^2+4κ)R_0^4D. Divide by ρ2=2R02ρ^2=2R_0^2 and absorb the factor 1/21/2 into the constant. ∎ Lemma 3 (Fourth Moment Bound for Mixed Distances). Assume Assumption 6. Draw ∼μx μ and ∼νy ν independently, set S:=‖−‖2S:=\|x-y\|_2 and ρ=2R0ρ= 2R_0. Then there exists Cdist,4C_dist,4 independent of D such that (S−ρ)4≤Cdist,4D2.E(S-ρ)^4≤ C_dist,4D^2. Proof. Let A:=‖22−R02A:=\|x\|_2^2-R_0^2, B:=‖22−R02B:=\|y\|_2^2-R_0^2, W:=⟨,⟩W:= x,y . Then S2−ρ2=A+B−2WS^2-ρ^2=A+B-2W. Using (a+b+c)4≤27(a4+b4+c4)(a+b+c)^4≤ 27(a^4+b^4+c^4) with c=−2Wc=-2W, (S2−ρ2)4≤27(A4+B4+16W4).(S^2-ρ^2)^4≤ 27 (A^4+B^4+16W^4 ). Taking expectations and applying Assumption 6 yields (S2−ρ2)4≤27(Cnorm,4R08D2+Cnorm,4R08D2+16Cip,4R08D2)=27(2Cnorm,4+16Cip,4)R08D2.E(S^2-ρ^2)^4≤ 27 ( C_norm,4R_0^8D^2+ C_norm,4R_0^8D^2+16 C_ip,4R_0^8D^2 )= 27(2C_norm,4+16C_ip,4)R_0^8D^2. Finally, for b>0b>0, (a−b)4≤(a−b)4/b2( a- b)^4≤(a-b)^4/b^2 with a=S2a=S^2 and b=ρ2=2R02b=ρ^2=2R_0^2 gives (S−ρ)4≤(S2−ρ2)4ρ4=(S2−ρ2)44R04.(S-ρ)^4≤ (S^2-ρ^2)^4ρ^4= (S^2-ρ^2)^44R_0^4. Combine the bounds. ∎ Step 2: Kernel-Reweighting Preserves the (1/D)O(1/D) Scale. In the remainder we specialize to the Laplace kernel kτ(,)=exp(−‖−‖2τ),so thatbτ(r)=τr.k_τ(x,y)= \! (- \|x-y\|_2τ ), that b_τ(r)= τr. Lemma 4 (Kernel-reweighted radii still concentrate (no constraint on τ¯ τ)). Assume Assumption 7 and Lemma 2. Fix any μ,νμ,ν from the family and draw ∼νx ν. Conditionally on x, draw y from the kernel-reweighted conditional density μτ(|):=kτ(,)μ()′∼μ[kτ(,′)]. _τ(y|x):= k_τ(x,y)\,μ(y)E_y μ[k_τ(x,y )]. Let r:=‖−‖2r:=\|y-x\|_2 and ρ=2R0ρ= 2R_0. Then for every τ>0τ>0, (r−ρ)2≤exp(2Bτ)⋅CdistD,E(r-ρ)^2≤ \! ( 2Bτ )· C_distD, where B is from Assumption 7 and CdistC_dist is from Lemma 2. In particular, if τ=τ¯Daτ= τD^a with a≥0a≥ 0, then τ≥τ¯τ≥ τ for all D≥1D≥ 1 and hence (r−ρ)2≤exp(2Bτ¯)⋅CdistD,E(r-ρ)^2≤ \! ( 2B τ )· C_distD, so the hidden constant is independent of D with no lower bound requirement on τ¯>0 τ>0. Proof. Condition on x. By definition of μτ(⋅|) _τ(·|x), [(r−ρ)2|]=′∼μ[(‖′−‖2−ρ)2e−‖′−‖2/τ|]′∼μ[e−‖′−‖2/τ|].E[(r-ρ)^2|x]= E_y μ [(\|y -x\|_2-ρ)^2e^-\|y -x\|_2/τ\, |\,x ]E_y μ [e^-\|y -x\|_2/τ\, |\,x ]. Since e−u/τ≤1e^-u/τ≤ 1, the numerator is at most [(‖′−‖2−ρ)2|]E[(\|y -x\|_2-ρ)^2|x]. For the denominator, Jensen gives [e−‖′−‖2/τ|]≥exp(−1τ[‖′−‖2|]).E [e^-\|y -x\|_2/τ\, |\,x ]≥ \! (- 1τE[\|y -x\|_2|x] ). Therefore [(r−ρ)2|]≤[(‖′−‖2−ρ)2|]exp(1τ[‖′−‖2|]).E[(r-ρ)^2|x] [(\|y -x\|_2-ρ)^2|x ]\, \! ( 1τE[\|y -x\|_2|x] ). Under Assumption 7, ‖2≤B\|x\|_2≤ B and ‖′‖2≤B\|y \|_2≤ B almost surely, so ‖′−‖2≤2B\|y -x\|_2≤ 2B almost surely and hence [‖′−‖2|]≤2BE[\|y -x\|_2|x]≤ 2B. Thus [(r−ρ)2|]≤exp(2Bτ)[(‖′−‖2−ρ)2|].E[(r-ρ)^2|x]≤ \! ( 2Bτ )\,E [(\|y -x\|_2-ρ)^2|x ]. Taking expectation over ∼νx ν and using independence of ∼νx ν and ′∼μy μ yields (r−ρ)2≤exp(2Bτ)[(‖′−‖2−ρ)2].E(r-ρ)^2≤ \! ( 2Bτ )\,E [(\|y -x\|_2-ρ)^2 ]. Finally, Lemma 2 bounds the right-hand side by exp(2B/τ)⋅Cdist/D (2B/τ)· C_dist/D. ∎ Step 3: Field Alignment. For Laplace, ∇logkτ(,)=1τ−‖−‖2 _x k_τ(x,y)= 1τ y-x\|y-x\|_2 (with the diagonal convention), so the kernel-score identity gives π,τ()=1τ∼πτ(⋅|)[−‖−‖2]⟹‖π,τ()‖2≤1τ.s_ ,τ(x)= 1τ\,E_y _τ(·|x) [ y-x\|y-x\|_2 ] \|s_ ,τ(x)\|_2≤ 1τ. Lemma 5 (Residual Bound for π,τ δ_ ,τ of Laplace Kernel). Assume the second moment of r=‖−‖2r=\|y-x\|_2 under πτ(⋅|) _τ(·|x) is finite. Then ‖π,τ()‖2≤2Var(r|).\| δ_ ,τ(x)\|_2≤ 2 Var(r|x). Proof. For Laplace, bτ(r)=τ/rb_τ(r)=τ/r, so bτ(r)−1=r/τb_τ(r)^-1=r/τ and bτ(r)(−)=τ⋅u,u:=−‖−‖2,‖u‖2≤1.b_τ(r)(y-x)=τ· u, u:= y-x\|y-x\|_2, \|u\|_2≤ 1. By definition, π,τ()=Cov∼πτ(⋅|)(r/τ,τu). δ_ ,τ(x)=Cov_y _τ(·|x) (r/τ,~τ u ). Cauchy–Schwarz for scalar–vector covariance yields ‖π,τ()‖2≤Var(r/τ|)⋅‖τu−[τu]‖22.\| δ_ ,τ(x)\|_2≤ Var(r/τ|x)· E\|τ u-E[τ u]\|_2^2. Since ‖τu−[τu]‖2≤2τ\|τ u-E[τ u]\|_2≤ 2τ, the second factor is at most 2τ2τ, hence ‖π,τ()‖2≤2τVar(r/τ|)=2Var(r|).\| δ_ ,τ(x)\|_2≤ 2τ Var(r/τ|x)=2 Var(r|x). ∎ Now we prove the objective-alignment result in Theorem 4′. A rigorous statement is as follows: Theorem 4′ (Large-D Field Alignment at 1/D1/D Rate). Assume Assumptions 4, 5 and 7 and let ρ=2R0ρ= 2R_0 and C:=τρ=τ¯ρDaC:=τρ= τρ D^a. Then there exists a constant K>0K>0 independent of D and independent of learnable parameters such that for all sufficiently large D and all ∈ℱf , ∼q‖()−CΔ()‖22≤KD.E_x q_f \|V_f(x)-C\, s_f(x) \|_2^2≤ KD. Proof. Apply Theorem 2 to π=p =p and π=q =q_f and subtract: =τ2(αp,τp,τ−αq,τq,τ)+(p,τ−q,τ).V_f=τ^2 ( _p,τs_p,τ- _q_f,τs_q_f,τ )+ ( δ_p,τ- δ_q_f,τ ). Add and subtract (ρ/τ)p,τ(ρ/τ)s_p,τ and (ρ/τ)q,τ(ρ/τ)s_q_f,τ: =CΔ+τ2((αp,τ−ρ/τ)p,τ−(αq,τ−ρ/τ)q,τ)+(p,τ−q,τ).V_f=C\, s_f+τ^2 (( _p,τ-ρ/τ)s_p,τ-( _q_f,τ-ρ/τ)s_q_f,τ )+ ( δ_p,τ- δ_q_f,τ ). Using (a+b)2≤2a2+2b2(a+b)^2≤ 2a^2+2b^2 repeatedly, it suffices to control for π∈p,q ∈\p,q_f\: ∼q‖τ2(απ,τ()−ρ/τ)π,τ()‖22and∼q‖π,τ()‖22,E_x q_f \|τ^2( _ ,τ(x)-ρ/τ)s_ ,τ(x) \|_2^2 _x q_f\| δ_ ,τ(x)\|_2^2, by (1/D)O(1/D) with constants independent of f. (i) Preconditioner term. For Laplace, bτ(r)=τ/rb_τ(r)=τ/r, so απ,τ()=[r|]/τ _ ,τ(x)=E[r|x]/τ, hence ταπ,τ()=[r|]τ _ ,τ(x)=E[r|x]. Using ‖π,τ()‖2≤1/τ\|s_ ,τ(x)\|_2≤ 1/τ, ‖τ2(απ,τ−ρ/τ)π,τ‖2≤τ2|απ,τ−ρτ|⋅1τ=|ταπ,τ−ρ|. \|τ^2( _ ,τ-ρ/τ)s_ ,τ \|_2≤τ^2 | _ ,τ- ρτ |· 1τ= |τ _ ,τ-ρ |. By Jensen, (ταπ,τ()−ρ)2=([r|]−ρ)2≤[(r−ρ)2|]. (τ _ ,τ(x)-ρ )^2= (E[r|x]-ρ )^2 [(r-ρ)^2|x]. Averaging over ∼qx q_f and applying Lemma 4 yields ∼q(ταπ,τ()−ρ)2=(1/D),π∈p,q.E_x q_f (τ _ ,τ(x)-ρ )^2=O(1/D), ∈\p,q_f\. (i) Residual term. By Lemma 5, ‖π,τ()‖22≤4Var(r|)≤4[(r−ρ)2|],\| δ_ ,τ(x)\|_2^2≤ 4\,Var(r|x)≤ 4\,E[(r-ρ)^2|x], since [(r−ρ)2|]=Var(r|)+([r|]−ρ)2≥Var(r|)E[(r-ρ)^2|x]=Var(r|x)+(E[r|x]-ρ)^2 (r|x). Averaging over ∼qx q_f and applying Lemma 4 again gives (1/D)O(1/D). Combining the bounds yields ‖−CΔ‖22≤K/DE\|V_f-C s_f\|_2^2≤ K/D for some K depending only on (σ,κ,R0,B,τ¯,a)(σ,κ,R_0,B, τ,a) (and not on learnable parameters). ∎ Corollary 4.1 (Drift Controls Score Mismatch). Assume Theorem 4′. Then for every ∈ℱf , ∼q‖Δ()‖22≤2C2∼q‖()‖22+2KC2D,C=τρ.E_x q_f\| s_f(x)\|_2^2≤ 2C^2\,E_x q_f\|V_f(x)\|_2^2+ 2KC^2D, C=τρ. Proof. From Theorem 4′, write =CΔ+V_f=C s_f+ _f with ‖22≤K/DE\| _f\|_2^2≤ K/D. Then C2‖Δ‖22=‖−‖22≤2‖22+2‖22.C^2\| s_f\|_2^2=\|V_f- _f\|_2^2≤ 2\|V_f\|_2^2+2\| _f\|_2^2. Take expectation over ∼qx q_f. ∎ Step 4: Score-Matching Minimizers Imply q⋆=pq_g =p. Lemma 6 (Kernel Score Identity for Laplace Smoothing). For any probability measure μ on ℝDR^D, the function μkτ _k_τ is Lipschitz and hence differentiable Lebesgue-a.e. Moreover, wherever ∇μkτ()∇ _k_τ(x) exists, ∇μkτ()=∼μ[∇kτ(,)],μ,τ()=∇logμkτ().∇ _k_τ(x)=E_y μ [ _xk_τ(x,y) ], s_μ,τ(x)=∇ _k_τ(x). Proof. For fixed y, define ∇kτ(,):=1τkτ(,)(,) _xk_τ(x,y):= 1τk_τ(x,y)\,u(x,y) for ≠y≠x and 0 on the diagonal, where ‖(,)‖2≤1\|u(x,y)\|_2≤ 1. Then ‖∇kτ(,)‖2≤1/τ\| _xk_τ(x,y)\|_2≤ 1/τ, so ↦kτ(,)x k_τ(x,y) is 1/τ1/τ-Lipschitz, and therefore μkτ _k_τ is 1/τ1/τ-Lipschitz as an expectation. By Rademacher, μkτ _k_τ is differentiable Lebesgue-a.e. At differentiability points, dominated convergence justifies differentiating under the expectation to obtain ∇μkτ()=∼μ[∇kτ(,)]∇ _k_τ(x)=E_y μ[ _xk_τ(x,y)]. Dividing by μkτ()>0 _k_τ(x)>0 yields ∇logμkτ=μ,τ∇ _k_τ=s_μ,τ at those points. ∎ Lemma 7 (Injectivity of Laplace Smoothing). Let kτ()=exp(−‖2/τ)k_τ(z)= (-\|z\|_2/τ) and let μ,νμ,ν be finite Borel measures on ℝDR^D. If μ∗kτ=ν∗kτμ*k_τ=ν*k_τ as functions, then μ=νμ=ν as measures. Proof. Take Fourier transforms: μ^()kτ^()=ν^()kτ^(). μ( ω)\, k_τ( ω)= ν( ω)\, k_τ( ω). For the radial Laplace kernel, kτ^()>0 k_τ( ω)>0 for all ω (e.g. kτ^()=cD,τ(1+τ2‖22)−(D+1)/2 k_τ( ω)=c_D,τ(1+τ^2\| ω\|_2^2)^-(D+1)/2 with cD,τ>0c_D,τ>0), so μ^=ν μ= ν and hence μ=νμ=ν. ∎ Proposition 2 (SM Identification: ℒSM()=0⇒q=pL_SM(g)=0 q_g=p). Assume Assumption 1. If ℒSM()=0L_SM(g)=0, i.e. ∼q‖p,τ()−q,τ()‖22=0,E_x q_g\|s_p,τ(x)-s_q_g,τ(x)\|_2^2=0, then q=pq_g=p as distributions on ℝDR^D. Proof. ℒSM()=0L_SM(g)=0 implies p,τ()=q,τ()s_p,τ(x)=s_q_g,τ(x) for q_g-a.e. x. By Assumption 1, q_g has a density strictly positive Lebesgue-a.e., hence the equality holds Lebesgue-a.e. By Lemma 6, wherever both gradients exist, ∇logpkτ()=∇log(q)kτ()for Lebesgue-a.e. .∇ p_k_τ(x)=∇ (q_g)_k_τ(x) Lebesgue-a.e.\ x. Thus ∇h()=0∇ h(x)=0 Lebesgue-a.e. for h():=logpkτ()−log(q)kτ()h(x):= p_k_τ(x)- (q_g)_k_τ(x). Since both smoothed functions are positive and Lipschitz, h is locally Lipschitz. A locally Lipschitz function with zero gradient a.e. is constant, so h()≡logch(x)≡ c and pkτ()=c(q)kτ()p_k_τ(x)=c\,(q_g)_k_τ(x) for Lebesgue-a.e. x. By continuity, the equality holds for all x. Integrate both sides over ℝDR^D and use Fubini/translation invariance: ∫ℝDμkτ()d=(∫ℝDkτ()d)∫ℝDμ()d, _R^D _k_τ(x)\, \!dx= ( _R^Dk_τ(z)\, \!dz ) _R^Dμ(y)\, \!dy, so ∫pkτ=∫(q)kτ p_k_τ= (q_g)_k_τ and therefore c=1c=1. Hence pkτ=(q)kτp_k_τ=(q_g)_k_τ, and Lemma 7 gives p=qp=q_g. ∎ Main Proof to Theorem 6′. Let ⋆∈argmin∈ℱℒdrift(),⋆∈argmin∈ℒSM().f ∈ _f L_drift(f), g ∈ _g L_SM(g). Fix a≥0a≥ 0 and use the dimension-aware bandwidth τ=τ¯Daτ= τD^a with τ¯>0 τ>0. By realizability, inf∈ℒSM()=0 _g L_SM(g)=0, hence ℒSM(⋆)=0L_SM(g )=0 and Proposition 2 gives q⋆=pq_g =p. By realizability, inf∈ℱℒdrift()=0 _f L_drift(f)=0, hence any minimizer satisfies ℒdrift(⋆)=0L_drift(f )=0, i.e. ∼q⋆‖⋆()‖22=0.E_x q_f \|V_f (x)\|_2^2=0. Apply Corollary 4.1 to ⋆f : ∼q⋆‖Δ⋆()‖22≤2KC2D,C=τρ=τ¯ρDa.E_x q_f \| s_f (x)\|_2^2≤ 2KC^2D, C=τρ= τρ D^a. Since Δ⋆=p,τ−q⋆,τ s_f =s_p,τ-s_q_f ,τ, by definition rF(p∥q⋆)=∼q⋆‖Δ⋆()‖22≤2K(ρτ¯)2⋅1D1+2a.D_rF(p\|q_f )=E_x q_f \| s_f (x)\|_2^2≤ 2K(ρ τ)^2· 1D^1+2a. The equivalent statement with q⋆q_g follows from q⋆=pq_g =p. C.3 Gradient-Level Alignment The stop-gradient solver optimizes the drifting loss by treating the transported target sg(+Δp,q())sg(x+ _p,q_ θ(x)) as constant for backpropagation. Consequently, the update direction is the semi-gradient drift():=∇ℒdrift()=−2ϵ[(ϵ)⊤Δp,q((ϵ))],q:=()#pϵ.g_drift( θ):= _ θL_drift( θ)=-2\,E_ ε [J_f_ θ( ε) \, _p,q_ θ (f_ θ( ε) ) ], q_ θ:=(f_ θ)_\#p_ ε. This differs from the full gradient of the population functional ↦∼q‖Δp,q()‖22 θ _x q_ θ\| _p,q_ θ(x)\|_2^2, which would additionally differentiate through the dependence of Δp,q _p,q_ θ on q_ θ. Therefore, the appropriate notion of “gradient-level equivalence” for the implemented drifting algorithm is alignment between implemented semi-gradient directions. In the Laplace-kernel setting, our field-alignment results imply that the drifting discrepancy field Δp,q()=η(p,kτ()−q,kτ()) _p,q_ θ(x)=η (V_p,k_τ(x)-V_q_ θ,k_τ(x) ) is close to a scaled score-mismatch field. Recall the definition of score π,τ():=∇logπkτ()s_ ,τ(x):= _x _k_τ(x), we define the scale-τ score mismatch Δp,q():=ηC(p,τ()−q,τ()),C:=τρ. s_p,q_ θ(x):=η\,C\, (s_p,τ(x)-s_q_ θ,τ(x) ), C:=τρ. To compare the implemented drifting update to an update driven by score mismatch, we introduce the score-transport projection loss ℒST-proj():=ϵ[‖(ϵ)−sg((ϵ)+Δp,q((ϵ)))‖22], _ST -proj( θ):=E_ ε [ \|f_ θ( ε)-sg (f_ θ( ε)+ s_p,q_ θ(f_ θ( ε)) ) \|_2^2 ], (13) whose semi-gradient is ST():=∇ℒST-proj()=−2ϵ[(ϵ)⊤Δp,q((ϵ))].g_ST( θ):= _ θL_ST -proj( θ)=-2\,E_ ε [J_f_ θ( ε) \, s_p,q_ θ (f_ θ( ε) ) ]. We remark that the population score-matching objective is ℒSM():=∼q‖p,τ()−q,τ()‖22,L_SM( θ):=E_x q_g_ θ \|s_p,τ(x)-s_q_g_ θ,τ(x) \|_2^2, and it involves no stop-gradient. The full gradient ∇ℒSM() _ θL_SM( θ) differentiates through the θ-dependence of the smoothed score q,τs_q_g_ θ,τ and is not, in general, equal to the semi-gradient of ℒST-projL_ST -proj. The role of ℒST-projL_ST -proj is purely to provide an algorithm-level comparator that matches the transport–projection structure of the drifting stop-gradient update. First, we state the Jacobian control needed to turn field alignment into gradient alignment: Assumption 8 (Uniform Jacobian Second Moment). There exists J2<∞J_2<∞ (uniform in θ and in D for the considered model class) such that ϵ[‖(ϵ)‖op2]≤J2for all .E_ ε [\|J_f_ θ( ε)\|_op^2 ]≤ J_2 all θ. Next, we first state a rigorous version of Theorem 5 and then prove it. Theorem 5′ (Large-D Alignment). Assume the hypotheses of Theorem 4, and additionally assume Assumption 8. Let Δp,q()=η(p,kτ()−q,kτ()) _p,q_ θ(x)=η(V_p,k_τ(x)-V_q_ θ,k_τ(x)) and Δp,q()=ηC(p,τ()−q,τ()) s_p,q_ θ(x)=η\,C\, (s_p,τ(x)-s_q_ θ,τ(x) ) with C=τρC=τρ. Then there exists D0∈ℕD_0 such that for all D≥D0D≥ D_0 and all θ, ‖drift()−ST()‖2≤ 2ηJ2KD, \|g_drift( θ)-g_ST( θ) \|_2\;≤\;2\,η\, J_2\, KD, where K is the constant from Theorem 4. Proof. By definition of the two semi-gradients, drift()−ST()=−2ϵ[(ϵ)⊤(Δp,q()−Δp,q())],=(ϵ).g_drift( θ)-g_ST( θ)=-2\,E_ ε [J_f_ θ( ε) ( _p,q_ θ(x)- s_p,q_ θ(x) ) ], x=f_ θ( ε). Using ‖A⊤v‖2≤‖A‖op‖v‖2\|A v\|_2≤\|A\|_op\|v\|_2 and Cauchy–Schwarz, ‖drift()−ST()‖2 \|g_drift( θ)-g_ST( θ) \|_2 ≤2ϵ‖(ϵ)‖op2ϵ‖Δp,q()−Δp,q()‖22. ≤ 2 E_ ε\|J_f_ θ( ε)\|_op^2\; E_ ε \| _p,q_ θ(x)- s_p,q_ θ(x) \|_2^2. By Assumption 8, the first factor is at most J2 J_2. For the second factor, Δp,q()−Δp,q()=η((p,kτ()−q,kτ())−C(p,τ()−q,τ())). _p,q_ θ(x)- s_p,q_ θ(x)=η ((V_p,k_τ(x)-V_q_ θ,k_τ(x))-C\, (s_p,τ(x)-s_q_ θ,τ(x) ) ). Applying Theorem 4 (with =f=f_ θ) yields that for all D≥D0D≥ D_0, ∼q‖(p,kτ()−q,kτ())−C(p,τ()−q,τ())‖22≤KD.E_x q_ θ \|(V_p,k_τ(x)-V_q_ θ,k_τ(x))-C\, (s_p,τ(x)-s_q_ θ,τ(x) ) \|_2^2≤ KD. Therefore, ϵ‖Δp,q()−Δp,q()‖22=∼q‖Δp,q()−Δp,q()‖22≤η2KD,E_ ε \| _p,q_ θ(x)- s_p,q_ θ(x) \|_2^2=E_x q_ θ \| _p,q_ θ(x)- s_p,q_ θ(x) \|_2^2≤η^2 KD, and the claim follows. ∎ Remark 1 (Scope). Theorem 5 concerns the implemented stop-gradient update directions for drifting, and compares them to a score-transport projection update that uses the score mismatch as the transport field. It does not claim alignment with the full gradient ∇ℒSM() _ θL_SM( θ) of the score-matching objective, which differentiates through the θ-dependence of q,τs_q_g_ θ,τ. Corollary 5.1 (Cosine Similarity of Update Directions). Assume the hypotheses of Theorem 5. Fix any γ>0γ>0 and consider parameters θ such that min‖drift()‖2,‖ST()‖2≥γ \\|g_drift( θ)\|_2,\|g_ST( θ)\|_2\≥γ. Then for all D≥D0D≥ D_0, cos∠(drift(),ST())≥ 1−16η2J2Kγ2D. (g_drift( θ),g_ST( θ) )\;≥\;1- 16\,η^2\,J_2\,Kγ^2\,D. In particular, if ‖ST()‖2\|g_ST( θ)\|_2 is bounded below by a constant γ>0γ>0 independent of D, then the two update directions become asymptotically parallel with cos∠(drift,ST)≥1−(D−1) (g_drift,g_ST)≥ 1-O(D^-1). Proof. We may assume ‖ST()‖2≥γ\|g_ST( θ)\|_2≥γ. Let :=ST()a:=g_ST( θ) and :=drift()b:=g_drift( θ). From Theorem 5, ‖−‖2≤δD\|b-a\|_2≤ _D with δD:=2ηJ2K/D _D:=2η J_2 K/D. Write =+b=a+e with ‖≤δD\|e\|≤ _D, and set t:=δD/γt:= _D/γ. Let θ:=∠(,)θ:= (a,b). The component of e perpendicular to a satisfies ‖⟂‖≤‖≤δD\|P_ e\|≤\|e\|≤ _D, and the reverse triangle inequality gives ‖≥‖−‖≥γ−δD\|b\|≥\|a\|-\|e\|≥γ- _D. Therefore sinθ=‖⟂‖≤δDγ−δD=t1−t. θ= \|P_ e\|\|b\|≤ _Dγ- _D= t1-t. For all D large enough that t≤1/2t≤ 1/2 (equivalently δD≤γ/2 _D≤γ/2, which holds for D≥D1:=16η2J2K/γ2D≥ D_1:=16η^2J_2K/γ^2), we have θ≤π/2θ≤π/2. For θ∈[0,π/2]θ∈[0,π/2], 1−cosθ=sin2θ1+cosθ≤sin2θ,1- θ= ^2θ1+ θ≤ ^2θ, since 1+cosθ≥11+ θ≥ 1. Combining the two bounds and using (1−t)2≥1/4(1-t)^2≥ 1/4 for t≤1/2t≤ 1/2, 1−cos∠(drift(),ST())≤sin2θ≤t2(1−t)2≤4t2=4δD2γ2=16η2J2Kγ2D.1- (g_drift( θ),g_ST( θ) )≤ ^2θ≤ t^2(1-t)^2≤ 4t^2= 4 _D^2γ^2= 16\,η^2\,J_2\,Kγ^2\,D. ∎