← Back to papers

Paper deep dive

Robust Physics-Guided Diffusion for Full-Waveform Inversion

Jishen Peng, Enze Jiang, Zheng Ma, Xiongbin Yan

Year: 2026Venue: arXiv preprintArea: math.NAType: PreprintEmbeddings: 109

Abstract

Abstract:We develop a robust physics-guided diffusion framework for full-waveform inversion that combines a score-based generative prior with likelihood guidance computed through wave-equation simulations. We adopt a transport-based data-consistency potential (Wasserstein-2), incorporating wavefield enhancement via bounded weighting and observation-dependent normalization, thereby improving robustness to amplitude imbalance and time/phase misalignment. On the inference side, we introduce a preconditioned guided reverse-diffusion scheme that adapts the guidance strength and spatial scaling throughout the reverse-time dynamics, yielding a more stable and effective data-consistency guidance step than standard diffusion posterior sampling (DPS). Numerical experiments on OpenFWI datasets demonstrate improved reconstruction quality over deterministic optimization baselines and standard DPS under comparable computational budgets.

Tags

ai-safety (imported, 100%)mathna (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

109,051 characters extracted from source content.

Expand or collapse full text

Robust Physics-Guided Diffusion for Full-Waveform Inversion Jishen Peng111Jishen Peng and Enze Jiang contributed equally to this work. Enze Jiang222Jishen Peng and Enze Jiang contributed equally to this work. Zheng Ma Xiong-Bin Yan yanxb2015@163.com School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai, China Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China CMA-Shanghai, Shanghai Jiao Tong University, Shanghai, China School of Mathematics and Statistics, Lanzhou University, Lanzhou, China. Abstract We develop a robust physics-guided diffusion framework for full-waveform inversion that combines a score-based generative prior with likelihood guidance computed through wave-equation simulations. We adopt a transport-based data-consistency potential (Wasserstein-2), incorporating wavefield enhancement via bounded weighting and observation-dependent normalization, thereby improving robustness to amplitude imbalance and time/phase misalignment. On the inference side, we introduce a preconditioned guided reverse-diffusion scheme that adapts the guidance strength and spatial scaling throughout the reverse-time dynamics, yielding a more stable and effective data-consistency guidance step than standard diffusion posterior sampling (DPS). Numerical experiments on OpenFWI datasets demonstrate improved reconstruction quality over deterministic optimization baselines and standard DPS under comparable computational budgets. Keywords. Full-waveform inversion; Optimal transport; Physics-guided diffusion; Preconditioned guidance 1 Introduction In geophysics, seismic inversion is widely used to infer quantitative subsurface models that match observed seismic recordings. From a mathematical viewpoint, recovering medium parameters from time-dependent wavefield measurements constitutes a nonlinear and ill-posed inverse problem, primarily due to limited acquisition geometry and band-limited observations. Several classes of methods have been developed, including velocity analysis based on stacked traces [1], migration and traveltime-based methods [2, 3], Born-approximation-based linearized inversion [4, 5], and full-waveform inversion (FWI) [6, 7, 8]. Among these, FWI is distinguished by its potential to recover high-resolution subsurface structure by exploiting both phase and amplitude information through the solution of a PDE-constrained optimization problem that minimizes a chosen data-misfit functional between observed and synthetic seismograms. Full-waveform inversion (FWI) is a PDE-constrained inverse problem that aims to reconstruct subsurface parameters (e.g., wavespeed or velocity) from time-dependent seismic recordings [9, 10, 6]. Given an acquisition geometry and a forward wave-propagation operator ℱF, the classical formulation seeks a velocity model v such that ℱ​(v)F(v) matches the observed data dobsd_obs, where v∈ℝmv ^m denotes the discretized velocity field and ℱ:ℝm→ℝnF:R^m ^n maps model parameters to recorded traces. Despite its ability to recover high-resolution structures by exploiting the full wavefield content, FWI is widely recognized as a challenging nonconvex optimization problem: the forward map is nonlinear, the inverse problem is ill-conditioned, and the misfit landscape typically contains many local minima due to phase ambiguity and limited illumination [9, 11, 12, 13]. These difficulties are especially pronounced when the initial model is inaccurate and when the observations are contaminated by noise (and other data errors). In addition, strong multi-scale components with markedly different amplitudes can further bias residual-based objectives and degrade stability [14, 9]. Since pointwise ℓ2 _2 objectives penalize the pointwise residual in time, they are highly sensitive to small time/phase shifts between observed and synthetic traces. This sensitivity leads to a strongly nonconvex objective landscape and the well-known cycle-skipping phenomenon [15, 16], in which iterative methods can become trapped in local minima associated with incorrect phase (kinematic) alignment. Moreover, seismic traces often exhibit pronounced dynamic-range imbalance: early-arriving phases can have substantially larger amplitudes than later reflections, causing both the misfit and its gradient to be dominated by a small portion of the data [17]. As a consequence, the resulting gradients can be dominated by high-amplitude components, leading updates to underfit weaker later phases. A large body of work aims to mitigate these issues through multi-scale continuation [14], alternative misfit functions based on phase/envelope attributes [18, 19], adaptive waveform inversion [20], and geometry-aware metrics such as optimal-transport/Wasserstein distances [12, 21, 22]; see also [23] for a recent review of misfit functions and adjoint sources. In parallel with advances in data misfit design, recent years have witnessed rapid progress in learning expressive priors for inverse problems using generative modeling. In seismic imaging, a range of generative models—including GAN-based geological priors, latent-variable models such as VAEs, and normalizing flows—have been explored to mitigate ill-posedness and to facilitate uncertainty quantification [24, 25, 26]. Score-based diffusion models (also known as score-based generative models) provide a particularly attractive mechanism: they learn the score field ∇xlog⁡pt​(x) _x p_t(x) of progressively noised marginals and enable conditional generation by reversing the associated diffusion dynamics [27, 28]. From the perspective of inverse problems, diffusion models can be interpreted as providing a learned prior over plausible velocity fields, trained solely from samples of the velocity model (i.e., parameter-field samples), without requiring paired data of the form (wavefield, velocity) and without embedding a forward solver in the prior-training loop; conditional inference is then performed at test time by incorporating data-consistency (likelihood) information during reverse-time sampling or related posterior-sampling schemes [29, 30]. This decoupling is especially appealing in seismic imaging, where repeated wavefield simulation is expensive and acquisition geometries may vary across applications; accordingly, diffusion-based Bayesian FWI and diffusion-driven velocity modeling have started to be investigated in recent work [31, 32]. To solve the inverse problem given observations, guided reverse diffusion methods combine the learned prior score with a data-consistency term induced by a potential Φ (i.e., a negative log-likelihood, or more generally a data-misfit functional that measures agreement between simulated data and the observed data). Diffusion posterior sampling (DPS) is a representative method: at reverse step i it forms a clean-state estimate v^0(i) v_0^(i), which is a denoised estimate of the underlying clean variable associated with the current noisy state. This estimate is computed using the learned prior score sθ∗​(⋅,i)s_θ (·,i). A likelihood-guidance step is then applied using the gradient ∇Φ​(v^0(i))∇ ( v_0^(i)). However, a direct use of the baseline DPS update with a fixed global guidance step size is poorly matched to the characteristics of full-waveform inversion (FWI) in two fundamental respects. First, the effectiveness of DPS critically depends on the choice of the potential Φ . In FWI, conventional pointwise ℓ2 _2 (residual-based) potentials inherit two structural deficiencies. On the one hand, the quadratic residual makes the induced descent direction disproportionately influenced by large-amplitude arrivals, resulting in markedly unbalanced updates across traces and time windows. On the other hand, the pointwise ℓ2 _2 geometry is highly sensitive to small time/phase shifts, which typically produces a highly nonconvex misfit landscape with spurious local minima (cycle-skipping) associated with incorrect phase alignment. When such a potential is used for likelihood guidance in DPS, these effects can yield correction terms whose scale and direction are poorly matched to the current reverse-time state, thereby degrading the stability and effectiveness of the reverse-time dynamics. Second, even with a robust potential, a single scalar guidance strength in baseline DPS cannot capture the spatially varying sensitivity of the misfit to the velocity field v. Early reverse-time steps often yield a rough v^0(i) v_0^(i), so overly aggressive guidance can be unreliable, whereas later steps typically benefit from stronger data-consistency enforcement. Moreover, illumination and acquisition geometry induce strong spatial variability in sensitivity, so a single global step size may overshoot in well-illuminated regions while remaining ineffective in poorly illuminated ones. These mismatches can reduce stability and efficiency along the reverse-time iterations and ultimately degrade reconstruction quality. Contributions. This work develops a robust physics-guided diffusion framework for full-waveform inversion (FWI), combining a score-based generative prior with an OT-based data-consistency potential and a stabilized, variable-metric guidance mechanism in guided reverse diffusion. Our main contributions are: • An OT-based data-consistency potential for physics-guided diffusion. We design a robust likelihood potential for guided reverse diffusion by combining bounded amplitude-adaptive weighting of seismic traces with a one-dimensional Wasserstein discrepancy computed via quantile functions. When used as the data-consistency term in DPS-type guidance, the resulting guidance signal is less dominated by high-amplitude arrivals and less sensitive to small time/phase misalignment, thereby alleviating cycle-skipping and yielding a more balanced and stable data-consistency mechanism for FWI. • Adaptive, variable-metric guidance for guided reverse diffusion. We introduce a variable-metric (diagonal) preconditioner for the guidance step, replacing the scalar DPS step size by a matrix Pi=ρi​DiP_i= _iD_i. Here ρi _i adapts the guidance strength along the reverse trajectory to suppress unreliable early updates, while the diagonal scaling DiD_i balances the spatially heterogeneous sensitivity of the misfit with respect to the velocity field in FWI. This yields a more stable and effective guidance mechanism than standard DPS in seismic settings. • Numerical validation. We evaluate the proposed method on OpenFWI benchmarks and compare against deterministic optimization-based inversion baselines and the original DPS under matched computational budgets, demonstrating improved reconstruction quality and stability. We further assess generalization on other standard benchmarks beyond OpenFWI, where the method maintains strong performance. The remainder of the paper is organized as follows. In Section 2 we state the FWI problem and cast it in Bayesian form. Section 3 recalls score-based diffusion models and the learned prior score needed for reverse-time dynamics. Section 4 derives a conditional reverse-diffusion formulation for Bayesian inversion and presents a baseline discretization in the spirit of diffusion posterior sampling (DPS). Section 5 introduces our robust physics-guided diffusion scheme, including an OT-based data-consistency potential and an adaptive, variable-metric guidance mechanism. Section 6 provides numerical results, comparative studies on OpenFWI benchmarks, and generalization tests on additional standard benchmark datasets. Section 7 concludes with a summary of the main results. 2 Problem formulation 2.1 Full-waveform inversion (FWI) Full-waveform inversion (FWI) aims to recover a spatially varying wave speed v​()v(x) from seismic observations recorded at receivers rr=1Nr\x_r\_r=1^N_r. Given observed seismograms dr​(t)=d​(r,t)d_r(t)=d(x_r,t), the estimate of v is typically obtained by minimizing a waveform-misfit functional between observed and simulated data, subject to the governing wave equation. By exploiting the full waveform information—including phase, amplitude, and multiple arrivals—FWI can in principle achieve higher resolution than kinematic approaches such as traveltime tomography [33, 34, 35]. At the same time, the resulting PDE-constrained optimization problem is strongly nonlinear and generally nonconvex; robust performance therefore requires careful treatment of noise and modeling errors, together with appropriate regularization or prior information. In what follows, the simulated data are generated by a forward operator ℱ​(v)F(v) defined through the solution of an acoustic wave equation. Specifically, we consider seismic wave propagation under the constant-density acoustic approximation in an isotropic medium. The wavefield u​(,t)u(x,t) denotes the acoustic pressure, and the unknown parameter is the spatially varying wave speed v​()>0v(x)>0. Neglecting variable-density and anisotropic effects leads to the standard Laplacian as the spatial operator. Let Ω⊂ℝd ^d (d=2,3)(d=2,3) be the domain and T>0T>0 the final time. For a given source term s​(,t)s(x,t), the forward problem reads 1v2​()​∂t​tu​(,t)−Δ​u​(,t)=s​(,t),(,t)∈Ω×(0,T],u​(,0)=0,∂tu​(,0)=0,∈Ω,∂nu​(,t)=0,(,t)∈Γref×(0,T], cases 1v^2(x) _tu(x,t)- u(x,t)=s(x,t),&(x,t)∈ ×(0,T],\\[4.0pt] u(x,0)=0, _tu(x,0)=0,&x∈ ,\\[4.0pt] _nu(x,t)=0,&(x,t)∈ _ref×(0,T], cases (1) where Γref⊂∂Ω _ref⊂∂ denotes the (top) reflecting boundary and ∂n _n is the outward normal derivative. On the remaining boundary Γabs:=∂Ω∖Γref _abs:=∂ _ref, we impose absorbing boundary conditions, implemented in practice via perfectly matched layers (PML), in order to minimize artificial reflections from the truncation of the computational domain. We take the domain to be the rectangle Ω=[0,Lx]×[0,Lz]⊂ℝ2 =[0,L_x]×[0,L_z] ^2 with =(x,z)x=(x,z), where the reflecting boundary corresponds to the top side Γref=(x,z)∈∂Ω:z=Lz _ref=\(x,z)∈∂ :\ z=L_z\. We discretize Ω by a tensor-product grid with nodes (xi,zj)i=1,…,mx;j=1,…,mz\(x_i,z_j)\_i=1,…,m_x;\,j=1,…,m_z. The wave-speed field is represented by its nodal values vi​j:=v​(xi,zj)v_ij:=v(x_i,z_j), collected in a matrix V=vi​j∈ℝmx×mzV=\v_ij\ ^m_x× m_z, or equivalently by the vectorization v=vec​(V)∈ℝmv=vec(V) ^m, where m=mx​mzm=m_xm_z. Receivers are deployed exclusively on the top boundary Γref⊂∂Ω _ref⊂∂ . We denote the discrete receiver set by Γr:=rr=1Nr⊂Γref _r:=\x_r\_r=1^N_r⊂ _ref. For the j-th experiment with source term s​(,t;j)s(x,t; ξ_j), we solve (1) (with this source) to obtain the wavefield uj​(,t;v,j)u_j(x,t;v, ξ_j). Let R denote the receiver sampling (restriction) operator, defined by (R​uj)r​(t):=uj​(r,t;v,j),r=1,…,Nr,(Ru_j)_r(t):=u_j(x_r,t;v, ξ_j), r=1,…,N_r, so that the corresponding noise-free synthetic data for experiment j are dr(j)​(t)=(R​uj)r​(t)d^(j)_r(t)=(Ru_j)_r(t). Stacking the traces across all experiments and receivers defines the parameter-to-observable operator ℱ:v⟼(R​uj)r​(t)j=1,…,Ns;r=1,…,Nr,F:\ v \(Ru_j)_r(t)\_j=1,…,N_s;\,r=1,…,N_r, which maps the velocity model to the predicted receiver recordings. After discretizing time at tkk=1NT\t_k\_k=1^N_T, the data can be viewed as an array d=dj,r​(tk)∈ℝNs×Nr×NT,(ℱ​(v))j,r,k=(R​uj)r​(tk).d=\d_j,r(t_k)\ ^N_s× N_r× N_T, (F(v))_j,r,k=(Ru_j)_r(t_k). The observation model is then dobs=ℱ​(v)+η,d_obs=F(v)+η, (2) where η denotes the measurement noise. We adopt an additive Gaussian noise η∼​(0,Σ)η (0, ), where Σ is the noise covariance in data space. In the simplest case of independent and identically distributed noise, Σ=σ2​I =σ^2I, so that η∼​(0,σ2​I)η (0,σ^2I) with σ>0σ>0 controlling the noise level. Classically, FWI is formulated as the minimization of a data-misfit functional (possibly augmented by regularization). In this work, we instead adopt a Bayesian formulation and infer v through the posterior, where data consistency is encoded by the likelihood potential Φ (i.e., the negative log-likelihood), defined below. 2.2 Bayesian formulation of FWI Given the observation model (2), we cast FWI as a Bayesian inverse problem and seek the subsurface velocity v through the posterior distribution p​(v|dobs)p(v|d_obs). In this work, we use the posterior formulation primarily as a computational target for a diffusion-based inversion algorithm, rather than pursuing full posterior uncertainty quantification. Likelihood as a data-consistency potential. Rather than fixing the likelihood to a pointwise ℓ2 _2 discrepancy a priori, we express it through a general data-consistency potential Φ​(⋅;dobs) (\,·\,;d_obs)(i.e., a negative log-likelihood up to an additive constant), following the standard Bayesian inverse problem formulation; see, e.g., [36, 37]: p​(dobs|v)∝exp⁡(−Φ​(v;dobs)).p(d_obs|v)\ \ \! (- (v;d_obs) ). (3) This formulation decouples the data-error model (2) from the choice of data-misfit functional used to quantify agreement between simulated and observed data. As a classical special case, the additive Gaussian noise model in (2) yields p​(dobs|v)∝exp⁡(−12​‖ℱ​(v)−dobs‖Σ−12),‖r‖Σ−12:=r⊤​Σ−1​r.p(d_obs|v)\ \ (- 12 \|F(v)-d_obs \|_ ^-1^2 ), \|r\|_ ^-1^2:=r ^-1r. (4) corresponding to Φ​(v;dobs)=12​‖ℱ​(v)−dobs‖Σ−12 (v;d_obs)= 12\|F(v)-d_obs\|_ ^-1^2. Throughout the paper, the observation dobsd_obs is fixed. We often suppress this dependence and write Φ​(v)≡Φ​(v;dobs) (v)≡ (v;d_obs). In what follows, we keep Φ general until Section 5, where we instantiate it by a robust OT-based waveform misfit tailored to amplitude dominance, cycle-skipping, and objective scaling. Prior. To regularize the inverse problem and encode admissible structural assumptions on the subsurface, we endow the wave speed v with a prior distribution π0​(d​v) _0(dv) (with density p​(v)p(v) in the finite-dimensional discretization). Such prior information is crucial in FWI, which is typically ill-posed due to limited acquisition geometry and band-limited data, and hence may admit non-unique and unstable reconstructions under purely data-driven fitting. A common choice is a Gibbs-type prior of the form p​(v)∝exp⁡(−μ​ℛ​(v)),p(v) (- (v) ), where ℛ​(v)R(v) is a regularization functional and μ>0μ>0 controls its strength. In this case, the negative log-prior contributes the term μ​ℛ​(v)μ\,R(v) to the posterior objective, so that classical deterministic regularization is recovered at the level of maximum a posteriori (MAP) estimation. Posterior. Combining the likelihood (3) with the prior p​(v)p(v) via Bayes’ rule gives the posterior distribution p​(v|dobs)∝p​(dobs|v)​p​(v)∝exp⁡(−Φ​(v;dobs))​p​(v).p(v|d_obs) p(d_obs|v)p(v) (- (v;d_obs) )p(v). (5) Once Φ is specified, the posterior combines the data-fit term encoded by the likelihood with the prior information encoded by p​(v)p(v). Connection to deterministic FWI (MAP). Classical deterministic formulations of FWI are often posed as minimizing a data-misfit functional, possibly augmented by regularization. Within the Bayesian formulation, such approaches can be interpreted as computing a MAP estimate. Taking the negative logarithm of (5) yields, up to an additive constant, −log⁡p​(v|dobs)=Φ​(v;dobs)−log⁡p​(v)+C,- p(v|d_obs)= (v;d_obs)- p(v)+C, where C is independent of v. Consequently, vMAP=arg⁡minv∈⁡Φ​(v;dobs)−log⁡p​(v),v_MAP= _v (v;d_obs)- p(v), (6) where V denotes the admissible set (e.g., enforcing v>0v>0). If Φ is chosen as a quadratic (Gaussian) data-misfit and p​(v)∝exp⁡(−μ​ℛ​(v))p(v) (- (v)), then (6) reduces to a classical regularized PDE-constrained optimization problem for FWI. In the remainder of the paper, we compute posterior-driven reconstructions using a diffusion-based solver with a learned prior score and a data-consistency potential Φ . 3 Score-Based Generative Model Diffusion models, also known as score-based generative models [28, 27, 38], define a generative process by gradually perturbing samples with Gaussian noise and then learning to reverse this corruption process. From the score-based viewpoint, these models are grounded in the estimation of the score function, i.e., the gradient of the log-density with respect to the variable of interest. 3.1 Setup and notation We identify the discretized velocity field v∈ℝmv ^m (m=mx​mzm=m_xm_z) with the data vector 0x_0 and write 0:=vx_0:=v. This is purely a notational convention to align with diffusion-model formulations, where 0x_0 denotes an uncorrupted sample at diffusion time t=0t=0. We regard 0x_0 as a random vector following an unknown distribution pdatap_data on ℝmR^m, representing the population of plausible subsurface models. In our Bayesian setting, pdatap_data plays the role of the prior p​(v)p(v), from which we assume access to samples. 3.2 Forward diffusion process A diffusion model begins by defining a continuum of progressively noised variables ​(t)t∈[0,Tdiff]\x(t)\_t∈[0,T_ diff] through an Itô stochastic differential equation (SDE) [39] d​(t)=f​(​(t),t)​d​t+g​(t)​d​t,​(0)=0∼pdata,dx(t)=f(x(t),t)\,dt+g(t)\,dw_t, (0)=x_0 p_data, where tw_t is a standard m-dimensional Wiener process, f is the drift, and g is the diffusion coefficient. Under mild regularity assumptions [40], the law of ​(t)x(t) admits a density ptp_t, and ptt∈[0,Tdiff]\p_t\_t∈[0,T_ diff] evolves according to the Kolmogorov forward (Fokker–Planck) equation [39] ∂tpt​()=−∇⋅(f​(,t)​pt​())+12​g2​(t)​Δ​pt​(),pt=0=pdata. _tp_t(x)=-∇· (f(x,t)p_t(x) )+ 12g^2(t) p_t(x), p_t=0=p_data. In particular, the forward SDE induces a family of distributions pt\p_t\ obtained by Gaussian perturbations of pdatap_data. 3.3 Variance-preserving diffusion and transition kernel We adopt the variance-preserving (VP) SDE [28], defined by the linear Itô SDE d​(t)=−12​β​(t)​(t)​d​t+β​(t)​d​t,t∈[0,Tdiff],dx(t)=- 12β(t)\,x(t)\,dt+ β(t)\,dw_t, t∈[0,T_ diff], (7) where β​(t)>0β(t)>0 is a prescribed noise schedule. In our setting, ​(0)=0x(0)=x_0 represents a clean (vectorized) velocity model, while ​(t)x(t) denotes its progressively perturbed version. Since (7) is linear with time-dependent coefficients, it admits an explicit mild solution. Define the integrating factor α​(t):=exp⁡(−12​∫0tβ​(τ)​dτ).α(t):= \! (- 12 _0^tβ(τ)\,dτ ). Applying Itô’s formula to α​(t)−1​(t)α(t)^-1x(t) yields ​(t) (t) =α​(t)​0+α​(t)​∫0tα​(s)−1​β​(s)​ds =α(t)x_0+α(t) _0^tα(s)^-1 β(s)\,dw_s (8) =α​(t)​0+∫0texp⁡(−12​∫stβ​(τ)​dτ)​β​(s)​ds. =α(t)x_0+ _0^t \! (- 12 _s^tβ(τ)\,dτ ) β(s)\,dw_s. The stochastic integral in (8) is Gaussian with mean zero. Setting σ^2​(t):=1−exp⁡(−∫0tβ​(τ)​dτ)=1−α2​(t), σ^2(t):=1- \! (- _0^tβ(τ)\,dτ )=1-α^2(t), we obtain the Gaussian transition kernel ​(t)|0∼​(α​(t)​0,σ^2​(t)​I),x(t)|x_0 \! (α(t)\,x_0,\ σ^2(t)\,I ), (9) or, equivalently, the reparameterization ​(t)=α​(t)​0+σ^​(t)​,∼​(0,I).x(t)=α(t)\,x_0+ σ(t)\, z, z (0,I). If the schedule is chosen so that ∫0Tdiffβ​(τ)​dτ _0^T_ diffβ(τ)dτ is sufficiently large, then α​(Tdiff)≈0α(T_ diff)≈ 0 and σ^2​(Tdiff)≈1 σ^2(T_ diff)≈ 1, implying that ​(Tdiff)x(T_ diff) is approximately standard Gaussian and essentially independent of 0x_0. In practice we sample ​(Tdiff)∼​(0,I)x(T_ diff) (0,I) to initialize the reverse-time procedure. 3.4 Reverse-time dynamics and score learning Let qt(⋅|0)q_t(·|x_0) denote the transition kernel of the Markov diffusion (7), and define, for each t∈[0,Tdiff]t∈[0,T_ diff], the time-t density pt​():=∫ℝmqt​(|0)​pdata​(0)​d0,p_t(x):= _R^mq_t(x|x_0)p_data(x_0)dx_0, whenever the density exists. The main object of interest is the score field s⋆​(,t):=∇log⁡pt​(),s (x,t):= _x p_t(x), i.e., the logarithmic derivative of the marginal density induced by the forward diffusion at time t (equivalently, at the corresponding noise level). Under mild regularity assumptions, a time-reversal formula for diffusion processes [41] implies that the reverse-time dynamics associated with (7) can be expressed in terms of the score. In the VP case, where f​(,t)=−12​β​(t)​f(x,t)=- 12β(t)x and g2​(t)=β​(t)g^2(t)=β(t), the reverse-time SDE can be written as d​(t)=(−12​β​(t)​(t)−β​(t)​s⋆​(​(t),t))​d​t+β​(t)​d​¯t,t∈[0,Tdiff],dx(t)= (- 12β(t)x(t)-β(t)s (x(t),t) )dt+ β(t)\,d w_t, t∈[0,T_ diff], (10) interpreted in reverse time, i.e., the process is run backward from t=Tdifft=T_ diff to t=0t=0, where ¯t w_t is a Wiener process in the reverse-time parameterization. We approximate the true score s⋆​(,t)=∇log⁡pt​()s (x,t)= _x p_t(x) by a neural network sθ​(,t)s_θ(x,t) (the score network), parameterized by θ. A natural training objective is the mean-squared error regression: ℒ1​(θ):=t∼​(0,Tdiff)​∼pt​[κ​(t)​‖sθ​(,t)−∇log⁡pt​()‖22],L_1(θ):=E_t (0,T_ diff)E_x p_t [κ(t)\|s_θ(x,t)- _x p_t(x)\|_2^2 ], (11) where κ​(t)>0κ(t)>0 reweights the contributions of different noise levels. However, (11) is not directly computable in practice, since the marginal density ptp_t is induced by the unknown data distribution pdatap_data. Since the marginal score ∇log⁡pt​() _x p_t(x) is intractable, we adopt denoising score matching (DSM) [42]. DSM exploits the fact that the forward transition kernel qt(⋅∣0)q_t(· _0) is known, and trains the score network to match the corresponding conditional score. Concretely, it minimizes ℒ2(θ):=t∼​(0,Tdiff)0∼pdata∼qt(⋅∣0)[κ(t)∥sθ(,t)−∇logqt(|0)∥22].L_2(θ):=E_t (0,T_ diff)E_x_0 p_dataE_x q_t(· _0) [κ(t)\,\|s_θ(x,t)- _x q_t(x|x_0)\|_2^2 ]. (12) The mathematical rationale is the identity ∇log⁡pt​()=0∼p(⋅∣(t)=)​[∇log⁡qt​(|0)], _x p_t(x)=E_x_0 p(·\, (t)=x) [ _x q_t(x|x_0) ], so the marginal score is the conditional expectation of the conditional score. By the L2L^2 projection property of conditional expectation, ℒ2​(θ)=ℒ1​(θ)+CL_2(θ)=L_1(θ)+C with a constant C≥0C≥ 0 independent of θ. Hence ℒ1L_1 and ℒ2L_2 have the same minimizers and ℒ2​(θ)≥ℒ1​(θ)L_2(θ) _1(θ). For the VP diffusion, qt(⋅|0)q_t(·|x_0) is Gaussian ((9)), hence ∇log⁡qt​(|0)=−α​(t)​0σ^2​(t). _x q_t(x|x_0)=- x-α(t)x_0 σ^2(t). After training, let θ∗θ denote the learned parameters (e.g., an empirical minimizer of the training loss (12)). Replacing the unknown score in the reverse-time dynamics by sθ∗s_θ yields the approximate reverse SDE d​(t)=(f​(​(t),t)−g2​(t)​sθ∗​(​(t),t))​d​t+g​(t)​d​¯t,t∈[0,Tdiff],dx(t)= (f(x(t),t)-g^2(t)\,s_θ (x(t),t) )dt+g(t)\,d w_t, t∈[0,T_ diff], interpreted in reverse time (run backward from t=Tdifft=T_ diff to t=0t=0), with f​(,t)=−12​β​(t)​f(x,t)=- 12β(t)x and g​(t)=β​(t)g(t)= β(t) for the VP diffusion. 4 Conditional Diffusion for Bayesian Inversion 4.1 Diffusion-based posterior solver We now describe a diffusion-based solver for computing posterior-driven reconstructions of v from the Bayesian model (5). Following the diffusion posterior sampling (DPS) framework, we incorporate data consistency through a likelihood potential and use a learned score model to represent prior information. For notational consistency with the DPS literature, we denote the observation by yδ:=dobsy^δ:=d_obs and recall the reverse-time SDE (10) for the VP diffusion. Throughout this section, the observation yδy^δ is fixed; hence we suppress this dependence in the notation and write Φ​(⋅)≡Φ​(⋅;yδ) (·)≡ (·;y^δ). To draw samples from the Bayesian posterior distribution p​(v|yδ)p(v|y^δ) in (5), we replace the marginal score s⋆​(,t)=∇log⁡pt​()s (x,t)= _x p_t(x) by the conditional score ∇log⁡pt​(|yδ) _x p_t(x|y^δ), leading to d​(t)=(−12​β​(t)​(t)−β​(t)​∇log⁡pt​(​(t)|yδ))​d​t+β​(t)​d​¯t,t∈[0,Tdiff],dx(t)= (- 12β(t)x(t)-β(t) _x p_t(x(t)|y^δ) )dt+ β(t)\,d w_t, t∈[0,T_ diff], interpreted in reverse time (run backward from t=Tdifft=T_ diff to t=0t=0). Using Bayes’ rule, the conditional score decomposes as ∇​(t)log⁡pt​(​(t)|yδ)=∇​(t)log⁡pt​(​(t))+∇​(t)log⁡pt​(yδ|​(t)), _x(t) p_t(x(t)|y^δ)= _x(t) p_t(x(t))+ _x(t) p_t(y^δ|x(t)), (13) so that, given an approximation of the marginal score ∇log⁡pt​(​(t))≈sθ∗​(​(t),t)∇ p_t(x(t))≈ s_θ^*(x(t),t), it remains to approximate the term ∇​(t)log⁡pt​(yδ|​(t)) _x(t) p_t(y^δ|x(t)). From likelihood to a general potential. Under a Gaussian likelihood (4), a common approximation is ∇​(t)log⁡pt​(yδ|​(t))≃−12​∇​(t)‖yδ−ℱ​(^0​(t))‖Σ−12, _x(t) p_t(y^δ|x(t)) - 12 _x(t)\|y^δ-F( x_0(t))\|_ ^-1^2, where ^0​(t) x_0(t) denotes the denoised (posterior-mean) estimate of the clean variable associated with ​(t)x(t). Following Eq. (10) in [30], we set ^0​(t):=1α¯​(t)​(​(t)+(1−α¯​(t))​sθ∗​(​(t),t)), x_0(t):= 1 α(t) (x(t)+ (1- α(t) )s_θ^*(x(t),t) ), where sθ∗​(,t)s_θ^*(x,t) is the trained score network and, for the VP diffusion, α¯​(t):=exp⁡(−∫0tβ​(τ)​τ) α(t):= (- _0^tβ(τ)dτ ). In our formulation (3), data consistency is encoded through a potential Φ . Accordingly, we approximate the conditioning term by ∇​(t)log⁡pt​(yδ|​(t))≃−∇​(t)Φ​(^0​(t)), _x(t) p_t(y^δ|x(t)) - _x(t) ( x_0(t)), (14) so that the observational constraint enters the reverse-time dynamics via the chosen potential Φ . Combining (13) with (14) and replacing the prior score by the trained score network yields ∇​(t)log⁡pt​(​(t)|yδ)≃sθ∗​(​(t),t)−ζ​∇​(t)Φ​(^0​(t)), _x(t) p_t(x(t)|y^δ) s_θ^*(x(t),t)-ζ\, _x(t) ( x_0(t)), (15) where ζ>0ζ>0 is a user-chosen guidance-strength parameter that rescales the data-consistency term to account for objective scaling. 4.2 Baseline DPS discretization for FWI Let βii=1N\ _i\_i=1^N be the discrete noise schedule, αi:=1−βi _i:=1- _i, and α¯i:=∏j=1iαj α_i:= _j=1^i _j. Denote by viv_i the reverse-time diffusion state at step i (DDPM indexing [27]), so that vi≈​(ti)v_i (t_i). Here i=Ni=N corresponds to the highest noise level (an approximately Gaussian state), and i=1i=1 corresponds to the lowest noise level; thus vii=1N\v_i\_i=1^N traces a discrete reverse-time trajectory that progressively denoises the state from vNv_N toward a clean reconstruction. Denote by sθ∗​(⋅,i)s_θ (·,i) the trained score network. The standard clean estimator is v^0(i)=v^0​(vi,i):=1α¯i​(vi+(1−α¯i)​sθ∗​(vi,i)). v_0^(i)= v_0(v_i,i):= 1 α_i (v_i+(1- α_i)s_θ (v_i,i) ). (16) Baseline likelihood potential (least squares). In the original DPS formulation [30], data consistency is enforced through a Gaussian likelihood, leading to the least-squares potential Φℓ2​(v):=12​‖ℱ​(v)−yδ‖Σ−12. _ _2(v):= 12\|F(v)-y^δ\|_ ^-1^2. Given the prior-driven proposal vi−1′v_i-1 , baseline DPS applies a scalar guidance correction evaluated at the current state viv_i: vi−1 v_i-1 =vi−1′−ζ​∇viΦℓ2​(v^0(i)), =v_i-1 -ζ _v_i _ _2( v_0^(i)), (17) =vi−1′−ζ​(ℱ′​(v^0(i)))∗​Σ−1​(ℱ​(v^0(i))−yδ),ζ>0. =v_i-1 -ζ (F^ ( v_0^(i)) ) ^-1 (F( v_0^(i))-y^δ ), ζ>0. Based on the above formulation, the DPS-based procedure for solving the full waveform inversion problem is summarized in Algorithm 1. Require: N, schedule βii=1N\ _i\_i=1^N, score network sθ∗s_θ , variance σ^i\ σ_i\ fixed observation dobsd_obs, hyperparameters ζi\ _i\ (and optional clipping bounds), objective function Φl2 _l_2. Result: obtain the inversion velocity model v=v0v=v_0. Initialize vN∼​(0,I)v_N (0,I); set αi:=1−βi _i:=1- _i and α¯i:=∏j=1iαj α_i:= _j=1^i _j. for i=Ni=N to 11 do v^0(i)←1α¯i​(vi+(1−α¯i)​sθ∗​(vi,i)) v_0^(i)← 1 α_i (v_i+(1- α_i)s_θ (v_i,i) ); ∼​(0,)z (0,I); vi−1′←αi​(1−α¯i−1)1−α¯i​vi+α¯i−1​βi1−α¯i​v^0(i)+σ^i​v _i-1← _i(1- α_i-1)1- α_iv_i+ α_i-1 _i1- α_i v_0^(i)+ σ_iz; vi−1←vi−1′−ζ​∇viΦl2​(v^0)v_i-1← v_i-1 -ζ\, _v_i _l_2( v_0); Algorithm 1 Baseline Diffusion Posterior Sampling (DPS) 4.3 Limitations of ℓ2 _2-guided DPS in FWI Although the Algorithm 1 is consistent with a Gaussian noise model, its direct application to FWI presents several well-known and practically important difficulties. Specifically, the pointwise ℓ2 _2 potential may induce cycle-skipping and objective imbalance, while the use of a single global scalar guidance parameter is inadequate for the strongly space-dependent sensitivities characteristic of FWI. (i) Cycle-skipping and nonconvexity. The ℓ2 _2 misfit compares waveforms pointwise in time; small phase shifts can yield large residuals and highly oscillatory gradients. As a consequence, Φℓ2 _ _2 typically exhibits a strongly nonconvex landscape with many spurious local minima, which can lead to cycle-skipping in solving the present inverse problem. Within DPS, the guidance term is applied to the denoised estimate v^0(i) v_0^(i), and early reverse-time iterates often produce rough v^0(i) v_0^(i); in this regime, the ℓ2 _2 gradient can be particularly unreliable, causing unstable or ineffective guidance. (i) Amplitude dominance and objective scaling. Seismic data often exhibit a pronounced dynamic-range imbalance across arrivals. Since Φℓ2 _ _2 is quadratic in the residual, the gradient ∇Φℓ2∇ _ _2 is dominated by the largest residual components, which are typically associated with high-amplitude early arrivals. As a result, the baseline correction step tends to prioritize fitting the strongest events, while weaker phases that are informative for deeper or poorly illuminated regions are comparatively down-weighted. Moreover, Φℓ2​(v)=12​‖ℱ​(v)−yδ‖Σ−12 _ _2(v)= 12\|F(v)-y^δ\|_ ^-1^2 is used without normalization by an observation-dependent scale; when the data (or residuals after preprocessing) have small magnitude, both Φℓ2 _ _2 and ∇Φℓ2∇ _ _2 can become numerically small, making the guidance strength ζ in (17) delicate to tune and potentially leading to overly weak data-consistency enforcement. (i) Heterogeneous sensitivity and scalar step size. FWI sensitivities are strongly space-dependent (e.g., due to limited illumination), so a single global scalar guidance parameter ζ cannot balance updates across well-illuminated and poorly-illuminated regions. This mismatch can slow down progress in weak-sensitivity areas and may yield unstable behavior in sensitive regions. These observations motivate two modifications: we (a) replace the pointwise ℓ2 _2 potential by a robust OT-based data-consistency potential to mitigate cycle-skipping and amplitude imbalance, and (b) introduce a variable-metric preconditioned guidance scheme to stabilize and balance spatial updates. We present these components next. 5 Our method: OT-guided preconditioned DPS for FWI This section presents two modifications motivated by the limitations of ℓ2 _2-guided DPS in FWI discussed in the above section: (a) an OT-based data-consistency potential Φ= =J that mitigates amplitude dominance, reduces phase-sensitivity (cycle-skipping), and stabilizes objective scaling; and (b) a variable-metric preconditioned guidance scheme that adapts the guidance strength across noise levels and balances spatial updates under heterogeneous sensitivities. 5.1 OT-based data-consistency potential As discussed in Section 4.3, pointwise least-squares waveform fitting can lead to amplitude-dominated updates, strong phase sensitivity (cycle-skipping), and ill-conditioned objective scaling. We therefore construct a robust OT-based misfit J in three steps, each targeting one of these difficulties. (I) Amplitude-adaptive data weighting. Let dobs​(s,r,t)d_obs(s,r,t) denote the observed trace associated with the source–receiver pair (s,r)(s,r), and let dsyn​(s,r,t;v)d_syn(s,r,t;v) be its synthetic counterpart generated by the velocity model v. We introduce a globally normalized instantaneous amplitude a​(s,r,t)=|dobs​(s,r,t)|maxi,j,t⁡|dobs​(si,rj,t)|∈[0,1],a(s,r,t)= |d_obs(s,r,t)| _i,j,t|d_obs(s_i,r_j,t)|∈[0,1], and define the bounded weight ω​(s,r,t)=11+k​a​(s,r,t),k≥0.ω(s,r,t)= 11+k\,a(s,r,t), k≥ 0. (18) By construction, ω​(s,r,t)∈[1/(1+k), 1]ω(s,r,t)∈ [1/(1+k),\,1 ] and is monotone decreasing in a​(s,r,t)a(s,r,t), so large-amplitude arrivals are down-weighted while low-amplitude portions are largely preserved. We apply the same pointwise weighting to both observed and synthetic signals, d~obs​(s,r,t)=ω​(s,r,t)​dobs​(s,r,t),d~syn​(s,r,t;v)=ω​(s,r,t)​dsyn​(s,r,t;v), d_obs(s,r,t)=ω(s,r,t)\,d_obs(s,r,t), d_syn(s,r,t;v)=ω(s,r,t)\,d_syn(s,r,t;v), (19) so that the comparison is performed under a fixed, observation-dependent weighting. Equivalently, d~syn−d~obs=ω​(dsyn−dobs) d_syn- d_obs=ω\,(d_syn-d_obs), and thus the contribution of high-amplitude events to the misfit and its gradient is reduced. Importantly, ω is computed from the observations and kept fixed during inversion, so the weighting does not introduce additional model-dependent nonlinearity. This amplitude-adaptive weighting alleviates scale disparities induced by raw amplitudes and prevents the descent direction from being dominated by the high-amplitude portions of the data. As an empirical illustration, we take CurveFault-A dataset [43] as an example and compare the original traces and their weighted counterparts at Receiver 1 for the first three source–receiver pairs in Figure 1. In the original recordings (Figure 1(a)), the early-arriving portions exhibit substantially larger amplitudes than later-arriving events, thereby overwhelming the contribution of weaker phases in the least-squares misfit. After applying the bounded weighting (18) with k=100k=100 (Figure 1(b)), the dynamic range is effectively compressed: high-amplitude arrivals are suppressed while low-amplitude events are largely preserved. This amplitude balancing reduces the tendency of large-amplitude portions of the data to dominate the gradient and yields a more balanced weighted comparison for subsequent inversion. Nevertheless, even with amplitude balancing, pointwise ℓ2 _2 misfits can remain highly sensitive to small time shifts and phase mismatch. To alleviate the resulting nonlinearity and reduce cycle-skipping, we therefore adopt the Wasserstein-p distance as the data-misfit measure (we use p=2p=2 in all computations). (a) The original observed wavefield. (b) The weighted wavefield with k=100k=100. Figure 1: Comparison of the original and weighted wavefield traces at Receiver 1 for the first three source–receiver pairs. The original recordings are dominated by early large-amplitude arrivals, whereas the proposed bounded weighting compresses the dynamic range and renders later weak-amplitude events more comparable in scale. (I) Transport-based misfit (Wasserstein-p). The least-squares misfit induces a pointwise ℓ2 _2 geometry that is overly sensitive to temporal misalignment, which can lead to a highly nonconvex objective landscape and promote cycle-skipping. We therefore adopt the Wasserstein-p distance, which equips the data space with an optimal-transport geometry and measures discrepancy via mass displacement (after suitable normalization), rather than by pointwise differences. Since the Wasserstein distance is formally defined on nonnegative measures with equal total mass, we follow the standard OT-FWI construction [12] and transform each enhanced 1D trace into a probability density function (PDF) via a fixed shift–rescale normalization. Specifically, for a trace d~​(t) d(t) (either d~obs d_obs or d~syn d_syn), we introduce a constant c′>0c >0 to ensure nonnegativity. To guarantee that d~​(t)+c′≥0 d(t)+c ≥ 0 on [0,T][0,T] for all traces, we fix (s,r)(s,r) and take c′=1.1​|mint⁡d~obs​(s,r,t)|,c =1.1 | _t d_obs(s,r,t) |, and keep c′c constant throughout the inversion, so that the mapping does not introduce additional model-dependent nonlinearity in the gradient. In practice, c′c is chosen sufficiently large so that d~syn+c′≥0 d_syn+c ≥ 0 throughout the inversion. We then define the density mapping P​(d~)​(t)=d~​(t)+c′∫0T(d~​(τ)+c′)​τ,P( d)(t)= d(t)+c _0^T ( d(τ)+c )dτ, (20) which ensures P​(d~)​(t)≥0P( d)(t)≥ 0 and ∫0TP​(d~)​(t)​t=1 _0^TP( d)(t)dt=1. Accordingly, for each source–receiver pair (s,r)(s,r), the associated probability densities are ρ~obs​(⋅;s,r)=P​(d~obs​(⋅;s,r)),ρ~syn​(⋅;s,r,v)=P​(d~syn​(⋅;s,r,v)). ρ_obs(·;s,r)=P\! ( d_obs(·;s,r) ), ρ_syn(·;s,r,v)=P\! ( d_syn(·;s,r,v) ). With the signals transformed into densities, the Wasserstein-p distance is defined by Wp​(ρ~syn,ρ~obs)=(infγ∈Γ~​(ρ~syn,ρ~obs)∫[0,T]×[0,T]|t−τ|p​γ​(t,τ))1/p,p≥1,W_p ( ρ_syn, ρ_obs )= ( _γ∈ ( ρ_syn, ρ_obs) _[0,T]×[0,T]|t-τ|^p\,dγ(t,τ) )^1/p, p≥ 1, where Γ~​(ρ~syn,ρ~obs) ( ρ_syn, ρ_obs) denotes the set of admissible transport plans with marginals ρ~syn ρ_syn and ρ~obs ρ_obs. In the one-dimensional setting, WpW_p admits an explicit representation in terms of quantile functions. Let F~syn F_syn and F~obs F_obs be the cumulative distribution functions (CDFs) of ρ~syn ρ_syn and ρ~obs ρ_obs, respectively; then Wp​(ρ~syn,ρ~obs)=(∫01|F~syn−1​(ξ)−F~obs−1​(ξ)|p​ξ)1/p,W_p ( ρ_syn, ρ_obs )= ( _0^1 | F_syn^-1(ξ)- F_obs^-1(ξ) |^p\,dξ )^1/p, (21) where F~−1 F^-1 denotes the quantile function, defined as the generalized inverse F~−1​(ξ):=inft∈[0,T]:F~​(t)≥ξ F^-1(ξ):= \t∈[0,T]: F(t)≥ξ\. While we present the definition for general p≥1p≥ 1, all subsequent results and computations in this paper use p=2p=2. Numerically, for each trace we compute the PDF using (20), construct the CDF via cumulative summation, and evaluate the quantile difference in (21) by interpolation. (I) Misfit scale normalization. To stabilize the numerical scale of the OT guidance term, we rescale the aggregated discrepancy by an observation-dependent characteristic magnitude. Specifically, let F~obs,s,r−1 F_obs,s,r^-1 denote the quantile function associated with the observed PDF for the pair (s,r)(s,r), and define the scale factor Sobs=∑s∑r(∫01|F~obs,s,r−1​(ξ)|p​ξ)1/p,S_obs= _s _r ( _0^1 F_obs,s,r^-1(ξ) ^p\,dξ )^1/p, which depends only on observations and is independent of the velocity model v. Therefore, this standardization does not alter the minimizer of the objective; it merely rescales the functional and its gradient by a constant factor, improving numerical conditioning, and reducing sensitivity to step-size tuning in the guidance updates. Weighted and normalized Wasserstein misfit. Combining the above ingredients, we define the normalized Wasserstein-based objective functional as ​(v)=∑s∑rWp​(ρ~syn​(⋅;s,r;v),ρ~obs​(⋅;s,r))∑s∑r(∫01|F~obs,s,r−1​(ξ)|p​ξ)1/p,J(v)= _s _rW_p ( ρ_syn(·;s,r;v), ρ_obs(·;s,r) ) _s _r ( _0^1 F_obs,s,r^-1(ξ) ^p\,dξ )^1/p, (22) where ρ~syn ρ_syn and ρ~obs ρ_obs are the PDFs obtained from the enhanced wavefields d~syn d_syn and d~obs d_obs, respectively. This construction of J is designed to (i) reduce the dominance of high-amplitude arrivals through bounded weighting, (i) alleviate phase sensitivity via the OT geometry, and (i) stabilize the numerical scale through observation-dependent normalization. Choice of likelihood potential. With the above construction, we instantiate the data-consistency potential in the Bayesian likelihood (3) as Φ​(v;dobs):=​(v), (v;d_obs):=J(v), and we often suppress the dependence on dobsd_obs by writing ​(v)≡​(v;dobs)J(v) (v;d_obs) and Φ​(v)≡Φ​(v;dobs) (v)≡ (v;d_obs). 5.2 Preconditioned Guidance in DPS Building on the conditional-score approximation (15), we now introduce a variable-metric (diagonal) preconditioned guidance scheme tailored to FWI. Our modification: variable-metric preconditioned guidance. We replace the uniform scalar step size by a positive diagonal preconditioner, vi−1=vi−1′−Pi​∇viΦ​(v^0(i)),Pi=ρi​Di,v_i-1=v_i-1 -P_i _v_i ( v_0^(i)), P_i= _iD_i, (23) where ρi>0 _i>0 is a scalar schedule and DiD_i is a diagonal positive matrix. When Pi=ζ​IP_i=ζ I, (23) reduces to the baseline DPS step. The guidance gradient in (23) is computed with respect to viv_i through the clean estimator (16): ∇viΦ​(v^0(i))=(∇viv^0​(vi,i))⊤​∇v^0Φ​(v^0(i)). _v_i ( v_0^(i))= ( _v_i v_0(v_i,i) ) _ v_0 ( v_0^(i)). (24) We evaluate (24) in practice by automatic differentiation. Stabilized guidance scheduling. Early reverse-time iterates typically produce v^0(i) v_0^(i) with spurious high-frequency artifacts. In this regime, the guidance direction can be unreliable, and overly aggressive updates may destabilize the reverse trajectory. We therefore adopt a conservative guidance strength when v^0(i) v_0^(i) is still rough, and gradually increase it as the estimate becomes smoother and more structured. To quantify the roughness of the current denoised estimate, we introduce a TV-based indicator [44]. Specifically, let TV​(⋅)TV(·) denote the discrete total-variation semi-norm on the 2D grid: TV​(v)=1Ng​∑ℓ(|(∇xv)ℓ|+|(∇zv)ℓ|),TV(v)= 1N_g _ (|( _xv)_ |+|( _zv)_ | ), (25) where ℓ indexes grid points, NgN_g is the number of grid points, and ∇x,∇z _x, _z are forward finite differences. We use TV​(v^0(i))TV( v_0^(i)) as a roughness proxy because spurious oscillatory artifacts in early iterates typically manifest themselves through large local finite differences and hence elevated TV values. At the same time, TV is less sensitive than quadratic smoothness indicators to genuine sharp interfaces, which are common in subsurface models. Therefore, it provides a simple and computationally efficient criterion for distinguishing unstable, artifact-dominated iterates from structurally meaningful ones. Based on this proxy, we define the step-dependent scalar factor ρi=ρ0​exp⁡(−(TV​(v^0(i))−τ)+c),(⋅)+:=max⁡⋅,0, _i= _0 (- (TV( v_0^(i))-τ)_+c ), (·)_+:= \·,0\, (26) with hyperparameters ρ0>0 _0>0, c>0c>0, and an optional threshold τ≥0τ≥ 0. By construction, ρi∈(0,ρ0] _i∈(0, _0]. Moreover, ρi=ρ0 _i= _0 whenever TV​(v^0(i))≤τTV( v_0^(i))≤τ, and it decreases monotonically with TV​(v^0(i))TV( v_0^(i)) once TV​(v^0(i))>τTV( v_0^(i))>τ. The parameter c sets the decay scale: larger c leads to slower attenuation, that is, less sensitivity to roughness. The threshold τ defines a tolerance level for total variation and prevents excessive suppression of guidance in the presence of genuine interfaces, such as sharp geological contrasts. Spatial diagonal preconditioning. FWI gradients typically decay with depth due to limited illumination and attenuation, leading to highly nonuniform update magnitudes. To mitigate this imbalance, we introduce a diagonal rescaling based on the current misfit gradient with respect to the clean estimate. Let g(i):=∇v^0Φ​(v^0(i))∈ℝn.g^(i):= _ v_0 ( v_0^(i)) ^n. We define the diagonal weights κj(i)=(‖g(i)‖∞+ε|gj(i)|+ε)γ,Di:=diag​(κ1(i),…,κn(i)),κ^(i)_j= ( \|g^(i)\|_∞+ |g^(i)_j|+ )^γ, D_i:=diag(κ^(i)_1,…,κ^(i)_n), (27) where γ≥0γ≥ 0 and ε>0 >0. We build DiD_i from g(i)=∇v^0Φg^(i)= _ v_0 because it directly reflects the physical FWI sensitivity in the model domain; the chain rule in (24) then maps this sensitivity back to the current diffusion state. Interpretation via a descent inequality. For each reverse-time index i, consider the composite functional Φ~i​(v):=Φ​(v^0(i)​(v)) _i(v):= ( v_0^(i)(v)). Assume that Φ~i _i is differentiable and has a locally LiL_i-Lipschitz gradient on a neighborhood containing the relevant iterates. Then the descent lemma [45] yields that, for any symmetric positive semidefinite matrix P satisfying ‖P‖2≤1/Li\|P\|_2≤ 1/L_i, Φ~i​(v−P​∇Φ~i​(v))≤Φ~i​(v)−12​‖∇Φ~i​(v)‖P2,‖a‖P2:=a⊤​P​a. _i(v-P∇ _i(v))≤ _i(v)- 12\|∇ _i(v)\|_P^2, \|a\|_P^2:=a Pa. (28) Applying (28) pointwise with P at the current iterate, this estimate provides a heuristic justification for choosing Pi=ρi​DiP_i= _iD_i. Indeed, by tuning ρi _i and clipping the diagonal entries of DiD_i, we control the effective spectral norm of PiP_i and obtain a stabilized variable-metric guidance step, analogous to diagonal preconditioning in ill-conditioned inverse problems. Combining the above ingredients, we arrive at the following final Algorithm 2. Require: N, schedule βii=1N\ _i\_i=1^N, score network sθ∗s_θ , variance σ^i\ σ_i\, fixed observation dobsd_obs, hyperparameters ρ0,c,τ,p,ε _0,c,τ,p, (and optional clipping bounds), objective function J as (22). Result: approximate posterior sample v0v_0. Initialize vN∼​(0,I)v_N (0,I); set αi:=1−βi _i:=1- _i and α¯i:=∏j=1iαj α_i:= _j=1^i _j. for i=Ni=N to 11 do v^0(i)←1α¯i​(vi+(1−α¯i)​sθ∗​(vi,i)) v_0^(i)← 1 α_i (v_i+(1- α_i)s_θ (v_i,i) ); ∼​(0,)z (0,I); vi−1′←αi​(1−α¯i−1)1−α¯i​vi+α¯i−1​βi1−α¯i​v^0(i)+σ^i​v _i-1← _i(1- α_i-1)1- α_iv_i+ α_i-1 _i1- α_i v_0^(i)+ σ_iz; g(i)←∇v^0(i)​(v^0(i))g^(i)← _ v_0^(i)J( v_0^(i)); κj(i)←((‖g(i)‖∞+ε)/(|gj(i)|+ε))pκ^(i)_j←((\|g^(i)\|_∞+ )/(|g^(i)_j|+ ))^p; form Di=diag​(κ1(i),…,κd(i))D_i=diag(κ^(i)_1,…,κ^(i)_d); ρi←ρ0​exp⁡(−(TV​(v^0(i))−τ)+/c) _i← _0 \! (-(TV( v_0^(i))-τ)_+/c ); Set Pi←ρi​DiP_i← _iD_i (optionally with clipping and/or spectral-norm control); vi−1←vi−1′−Pi​∇vi​(v^0(i))v_i-1← v_i-1 -P_i\, _v_iJ( v_0^(i)); Algorithm 2 OT-guided Wavefield-Enhanced Preconditioned DPS (OT-WE-PDPS) 6 Experiments We evaluate our method on OpenFWI [43]. We train the score-based prior using velocity models from the Vel Family and Fault Family datasets. All inversion results are reported on held-out test samples that are disjoint from the training set; representative test models are shown in the first row of Figure 2. The score neural network is a U-Net [46] trained via denoising score matching under the VP diffusion described in Section 3. Importantly, prior training uses only samples of the model parameter field (velocity) and does not require any wavefield simulations. We compute posterior-driven reconstructions by running a discretized reverse-time diffusion whose drift combines the learned prior score and the data-consistency term induced by ℱF, implemented via our OT-guided Wavefield-Enhanced Preconditioned DPS (OT-WE-PDPS) (Algorithm 2). The likelihood potential is chosen as Φ= =J, the normalized OT misfit of Section 5.1. The reported reconstructions correspond to unseen models that were not used in training the prior score network. Experimental performance is evaluated using both qualitative inspection and quantitative metrics. As a primary accuracy measure, we report the relative ℓ2 ^2 reconstruction error eℓ2:=‖vtrue−vrec‖2‖vtrue‖2,e_ ^2:= \|v_true-v_rec\|_2\|v_true\|_2, where vtrue∈ℝmv_true ^m denotes the ground-truth velocity model (vectorized on the computational grid) and vrec∈ℝmv_rec ^m is the reconstruction, and ∥⋅∥2\|·\|_2 is the Euclidean norm. In addition, to quantify perceptual fidelity and structural consistency, we report the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) computed between vrecv_rec and vtruev_true. Baselines and implementation details. We compare our method against three baselines: (i) Baseline 1 (W2+TVW_2+TV) [47].We consider a classical baseline that minimizes a normalized W2W_2 waveform misfit with total-variation (TV) regularization. Define W2raw​(v):=∑s∑rW22​(ρsyn​(⋅;s,r;v),ρobs​(⋅;s,r))∑s∑r∫01|Fobs,s,r−1​(ξ)|2​ξ,J_W_2^raw(v):= _s _rW_2^2 (ρ_syn(·;s,r;v),ρ_obs(·;s,r) ) _s _r _0^1 F_obs,s,r^-1(ξ) ^2dξ, where ρsyn​(⋅;s,r;v) _syn(·;s,r;v) and ρobs​(⋅;s,r) _obs(·;s,r) are probability densities obtained from the raw synthetic and observed traces dsyn​(⋅;s,r;v)d_syn(·;s,r;v) and dobs​(⋅;s,r)d_obs(·;s,r), respectively, via the fixed shift–rescale density map (cf. (20)); in particular, no wavefield enhancement is applied here. Moreover, Fobs,s,rF_obs,s,r denotes the CDF of ρobs​(⋅;s,r) _obs(·;s,r), and Fobs,s,r−1F_obs,s,r^-1 is the associated quantile function. We then minimize the TV-regularized objective minv⁡W2raw​(v)+α​TV​(v), _v\ J_W_2^raw(v)+α\ TV(v), where α>0α>0 is the regularization parameter and TV​(⋅)TV(·) is defined in (25). Using a first-order method, a (sub)gradient descent iteration reads v←v−ρ0​(∇vW2raw​(v)+α​gTV​(v)),gTV​(v)∈∂TV​(v),v← v- _0 ( _vJ_W_2^raw(v)+α\,g_TV(v) ), g_TV(v)∈∂\,TV(v), with a fixed step size ρ0>0 _0>0. (i) Baseline 2 (OT-WE+TV+TV). We consider an improved version of the above method that minimizes the wavefield-enhanced normalized OT misfit ​(v)J(v) in (22) (Section 5.1), augmented with a TV penalty: minv⁡​(v)+α​TV​(v), _v\ J(v)+α\,TV(v), where α>0α>0 is the regularization parameter and TV​(⋅)TV(·) is defined in (25). We solve this problem using a first-order method. In particular, we use a diagonal preconditioned (sub)gradient step v←v−P​(v)​(∇v​(v)+α​gTV​(v)),gTV​(v)∈∂TV​(v),v← v-P(v) ( _vJ(v)+α\,g_TV(v) ), g_TV(v)∈∂\,TV(v), where P​(v)=ρ​(v)​D​(v)P(v)=ρ(v)\,D(v) is a positive diagonal preconditioner updated from the current iterate. Here ρ​(v)ρ(v) and D​(v)D(v) are defined by (26) and (27), respectively, after replacing v^0(i) v_0^(i) by v. (i) Baseline 3 (DPS) [30]. We use the original diffusion posterior sampling baseline with an ℓ2 _2 (MSE) data-consistency potential, Φℓ2​(v):=12​‖ℱ​(v)−yδ‖Σ−12, _ _2(v):= 12\|F(v)-y^δ\|_ ^-1^2, and apply the standard scalar guidance update (17) at each reverse-time step. No OT-based misfit, wavefield enhancement, or preconditioned guidance is used in this baseline. Across all methods, we use the same forward solver, acquisition geometry, and observation data. For a fair comparison, we tune the hyperparameters of each competing method as carefully as possible (within the same experimental protocol) to achieve its best attainable performance. 6.1 Forward Solver and Observation Data The forward operator ℱF (defined in Section 2) is evaluated by numerically solving problem (1) for the wavefield u given the velocity field v. For a source located at j ξ_j, we use the source term s​(,t;)=f​(t)​δ​(−j)s(x,t; _j)=f(t)δ(x- ξ_j), where f is a Ricker wavelet f​(t)=(1−2​π2​fp2​(t−t0)2)​exp⁡(−π2​fp2​(t−t0)2),f(t)= (1-2π^2f_p^2(t-t_0)^2 ) \! (-π^2f_p^2(t-t_0)^2 ), with fp=15f_p=15 and t0=1.1/fpt_0=1.1/f_p. We discretize (1) on a uniform Cartesian grid of size 71×7171× 71 with spacing d​x=d​z=10dx=dz=10 and advance in time with step d​t=10−3dt=10^-3 for nt=1000n_t=1000 steps. We use an 8th-order accurate finite-difference discretization in space and a second-order accurate explicit time-stepping scheme, as implemented in Deepwave [48]. To approximate an unbounded domain, we truncate the computational domain using absorbing perfectly matched layers (PMLs) of thickness 66 grid points on the left, right, and bottom boundaries. On the top boundary, we impose a reflecting (free-surface) boundary condition. The observed data are generated according to (2), with additive Gaussian noise η∼​(0,σ2​I)η (0,σ^2I) and σ=0.05σ=0.05. 6.2 Numerical calculation of the Wasserstein–2 distance For each source–receiver trace, we compute a one-dimensional Wasserstein–2 distance between the observed and synthetic signals after enforcing nonnegativity and unit mass. In our method, the Wasserstein distance is computed using the wavefield-enhanced traces d~obs​(t) d_obs(t) and d~syn​(t) d_syn(t) (19). We first shift each trace by a constant c′>0c >0 to ensure positivity and then normalize to obtain probability densities: ρ~obs​(t)=d~obs​(t)+c′∫0T(d~obs​(τ)+c′)​τ+ε′,ρ~syn​(t)=d~syn​(t)+c′∫0T(d~syn​(τ)+c′)​τ+ε′, ρ_obs(t)= d_obs(t)+c _0^T ( d_obs(τ)+c )\,dτ+ , ρ_syn(t)= d_syn(t)+c _0^T ( d_syn(τ)+c )\,dτ+ , where the integrals are taken over the recording window [0,T][0,T] and evaluated on the discrete sampling grid, ε′=10−9 =10^-9 is a small constant used to avoid division by a vanishing discrete normalization factor, and we fix c′=1.1​|mint∈[0,T]⁡d~obs​(t)|c =1.1 | _t∈[0,T] d_obs(t) | in all experiments. Let F~obs F_obs and F~syn F_syn denote the corresponding CDFs, and let Q~obs=F~obs−1 Q_obs= F_obs^-1 and Q~syn=F~syn−1 Q_syn= F_syn^-1 be the associated quantile functions. Using the one-dimensional representation W2​(ρobs,ρsyn)=(∫01|Q~obs​(ξ)−Q~syn​(ξ)|2​ξ)12,W_2( _obs, _syn)= ( _0^1 | Q_obs(ξ)- Q_syn(ξ) |^2\,dξ ) 12, we discretize F~obs F_obs and F~syn F_syn on the time grid tkk=0nt−1\t_k\_k=0^n_t-1 by trapezoidal quadrature, evaluate Q~obs​(ξ) Q_obs(ξ) and Q~syn​(ξ) Q_syn(ξ) via linear interpolation, and approximate the ξ-integral on a uniform grid ξj=j/(Nξ−1) _j=j/(N_ξ-1) by the trapezoidal rule (Nξ=1000N_ξ=1000). The overall OT misfit is obtained by averaging the resulting W2W_2 values over all source–receiver pairs. 6.3 Quantitative and qualitative comparisons Table 1: Hyperparameters used in each dataset. Here γ is the exponent in the diagonal preconditioner DiD_i for OT-WE-PDPS (27). The symbol ρ0 _0 denotes the base guidance strength in the noise-aware schedule ρi _i for OT-WE-PDPS (26), and it also denotes the (constant) step size used in the first-order solver for the W2+TVW_2+TV baseline. The parameter α is the TV regularization weight used in the TV-regularized baselines (Baseline 1 and Baseline 2), with TVTV defined in (25). CurveVel-A CurveVel-B Param Ours DPS OT-WE+TV+TV W2+TVW_2+TV Ours DPS OT-WE+TV+TV W2+TVW_2+TV γ 0.55 - 0.65 - 0.35 - 0.65 - ρ0 _0 1.15 8 1.3 13 12.4 4 4 14 α - - 0.05 0.2 - - 1.0 0.2 FlatFault-A FlatFault-B Param Ours DPS OT-WE+TV+TV W2+TVW_2+TV Ours DPS OT-WE+TV+TV W2+TVW_2+TV γ 0.35 - 0.65 - 0.45 - 0.65 - ρ0 _0 1.2 0.3 1.1 5 1.45 9 0.6 2 α - - 0.3 1.0 - - 0.2 0.5 CurveFault-A CurveFault-B Param Ours DPS OT-WE+TV+TV W2+TVW_2+TV Ours DPS OT-WE+TV+TV W2+TVW_2+TV γ 0.45 - 0.65 - 0.55 - 0.65 - ρ0 _0 1.8 4.15 1.3 12 1.75 5 0.6 14 α - - 0.1 0.2 - - 0.1 0.5 6.3.1 Acquisition geometry and discretization. We consider a surface acquisition setting in which both sources and receivers are located on the free surface z=Lzz=L_z, where Lx=Lz=700L_x=L_z=700. The computational domain is discretized on a 71×7171× 71 Cartesian grid with spacing d​x=d​z=10dx=dz=10 m, where x denotes the horizontal axis and z the depth axis. Grid nodes are given by xi=i​d​x_i=i\,dx and zj=j​d​z_j=j\,dz for i,j=0,…,70i,j=0,…,70. Ten point sources are placed at surface nodes (xsk,zsk)=(xik,Lz)=(ik​d​x,Lz),ik∈0,7,14,21,28,35,42,49,56,63,(x_s_k,z_s_k)=(x_i_k,L_z)=(i_k\,dx,L_z), i_k∈\0,7,14,21,28,35,42,49,56,63\, and 7070 receivers are deployed at all surface nodes, (xr,zr)=(xr,Lz)=(r​d​x,Lz),r=0,1,…,69.(x_r,z_r)=(x_r,L_z)=(r\,dx,L_z), r=0,1,…,69. 6.3.2 Test datasets. We evaluate six velocity models shown in Figure 2: CurveVel-A/B, FlatFault-A/B, and CurveFault-A/B. These models are designed to probe increasing structural complexity, from smooth layered interfaces to sharp discontinuities and their combinations. CurveVel-A/B are layered media with clear interfaces (including curved boundaries), intended to test interface recovery. In CurveVel-A, velocities within layers increase gradually with depth, whereas in CurveVel-B, the layer velocities are randomly distributed. FlatFault-A/B introduces faults that shift layers and generate sharp discontinuities; compared with CurveVel-A/B exhibits more discontinuities and more severe velocity contrasts. CurveFault-A/B combines curved stratified interfaces with fault-induced discontinuities, thereby coupling smooth geometric variation with sharp jumps and posing the most challenging scenarios among the six datasets. 6.3.3 Settings. We set k=100k=100 in (18) and ε=10−4 =10^-4 in the construction of DiD_i (27). When computing ρi _i (26), we set c=0.1c=0.1 and τ=0τ=0 in our method. These settings are kept fixed across all datasets. Other hyperparameters vary across datasets and methods and are listed in Table 1. For Baseline 1 (W2+TVW_2+TV) and Baseline 2 (OT-WE+TV+TV), we run gradient descent for 10,00010,000 iterations and report the final iterate; step sizes and regularization weights are listed in Table 1. Across all methods, we use the same forward solver, acquisition geometry, and observation data. Hyperparameters are tuned under a unified protocol to obtain the best attainable performance for each competing method. 6.3.4 Results. The following observations are drawn from Table 2 and Figure 2: Table 2: Quantitative comparison of inversion methods on OpenFWI velocity models. Metrics are the relative ℓ2 _2 error eℓ2e_ _2, PSNR, and SSIM for the reconstructed velocity field (arrows indicate the preferred direction). Methods include W2+TVW_2+TV (Baseline 1), OT-WE+TV+TV (Baseline 2), DPS (Baseline 3), and the proposed method. CurveVel-A CurveVel-B FlatFault-A Method el2e_l_2↓ PSNR↑ SSIM↑ el2e_l_2↓ PSNR↑ SSIM↑ el2e_l_2↓ PSNR↑ SSIM↑ W2+TVW_2+TV 12.10% 20.65 0.4971 10.72% 21.05 0.5425 11.84% 20.56 0.7198 OT-WE+TV+TV 7.96% 24.28 0.6987 7.40% 24.27 0.6434 7.28% 24.79 0.9217 DPS 23.15% 15.01 0.5747 15.70% 17.73 0.4329 16.68% 17.58 0.6965 Ours 3.50% 31.42 0.9039 2.04% 35.45 0.9517 2.91% 32.74 0.9646 FlatFault-B CurveFault-A CurveFault-B Method el2e_l_2↓ PSNR↑ SSIM↑ el2e_l_2↓ PSNR↑ SSIM↑ el2e_l_2↓ PSNR↑ SSIM↑ W2+TVW_2+TV 22.82% 17.05 0.4584 17.63% 17.74 0.4875 16.34% 18.20 0.4343 OT-WE+TV+TV 6.31% 28.21 0.8729 4.79% 29.06 0.7897 9.39% 23.01 0.6997 DPS 13.54% 21.58 0.5662 10.72% 22.06 0.6149 21.82% 15.69 0.2210 Ours 6.13% 28.48 0.9272 2.41% 35.03 0.9540 5.32% 27.95 0.7956 velocity (m/s) velocity (m/s) Reconstruction Difference CurveVel-ACurveVel-BFlatFault-AFlatFault-BCurveFault-ACurveFault-B Truth Ours Ours DPS DPS OT-WE+TV+TV OT-WE+TV+TV W2+TVW_2+TV W2+TVW_2+TV Figure 2: Inversion results across different datasets. The first row displays the true velocity field vtruev_true. Rows 2 to 5 show the inverted velocity fields vrecv_rec obtained using different algorithms. Rows 6 to 9 present the difference between the true and inverted velocity fields, vrec−vtruev_rec-v_true. The observed wavefield has been perturbed with additive noise of intensity σ=0.05σ=0.05. (a) Relative l2l_2-error of the reconstructed velocity field v for the W2+TVW_2+TV method under different values of the TV regularization parameter α. (b) Relative l2l_2-error of the reconstructed velocity field v for the OT-WE+TV+TV method under different values of the TV regularization parameter α. Figure 3: Relative l2l_2-error of the reconstructed velocity field v versus the TV regularization parameter α for the two deterministic methods: (a) W2+TVW_2+TV and (b) OT-WE+TV+TV. (a) Relative l2l_2-error of the reconstructed velocity field v versus the iteration number for the W2+TVW_2+TV method with the optimal TV regularization parameter α=0.2α=0.2. (b) Relative l2l_2-error of the reconstructed velocity field v versus the iteration number for the OT-WE+TV+TV method with the optimal TV regularization parameter α=0.08α=0.08. Figure 4: Relative l2l_2-error of the reconstructed velocity field v versus the iteration number for the two deterministic methods with their optimal TV regularization parameters α: (a) W2+TVW_2+TV and (b) OT-WE+TV+TV. 1. Our method attains the lowest reconstruction errors and the highest image-quality metrics across the datasets, while preserving interfaces and fault geometries in Figure 2. In contrast, the DPS baseline often yields reconstructions with a larger mismatch to the ground truth, suggesting that the standard MSE-based wavefield guidance may be too weak to enforce data consistency under surface-only acquisition. These results indicate that our proposed method achieves higher reconstruction quality than the DPS baseline under surface-only acquisition. 2. W2+TVW_2+TV baseline often produces over-smoothed reconstructions, with blurred interfaces and loss of structural details, which is reflected by lower SSIM and visually smeared layer boundaries. Our method mitigates this over-smoothing and recovers the main structures, although errors remain in challenging regions. 3. As the model complexity increases (e.g., more layers, stronger discontinuities, and curved/faulted interfaces), the inversion problem becomes more ill-posed and the reconstruction accuracy degrades. For our method, residual errors concentrate near sharp velocity transitions and at larger depths, consistent with reduced illumination and parameter sensitivity in deeper regions for surface-only acquisition. 4. The OT-WE+TV+TV baseline frequently outperforms the W2+TVW_2+TV baseline in terms of both quantitative metrics and visual quality. This observation suggests that wavefield enhancement and adaptive step sizing are effective not only within the DPS framework, but may also improve the performance of other inversion methods. To verify that the selected regularization parameter α is appropriate for the deterministic methods, we take Dataset CurveVel-A as an example and plot, in Figure 3, the relationship between the regularization parameter and the final relative error for (i) the W2+TVW_2+TV method and (i) the OT-WE+TV+TV method. In addition, Figure 4 shows the evolution of the relative error with respect to the iteration number for these two methods. Since Figure 4(a) indicates that the relative error of the W2+TVW_2+TV method is still decreasing, we further continue this method to 80,00080,000 iterations under the same parameter setting. Even then, the relative error is only reduced to 10.5%10.5\%, which is still higher than the best relative error achieved by the OT-WE+TV+TV method, namely, 8%8\%. The results indicate that, for the deterministic methods, the final inversion error is sensitive to the choice of the TV regularization parameter α, and a best-performing value can be identified within the tested parameter range. For the remaining datasets, the parameter α was selected by the same tuning procedure in order to achieve the best performance attainable within the tested parameter range for each deterministic method. In addition, measured in terms of iteration count, the deterministic methods require substantially more iterations to converge than our method. velocity (m/s) velocity (m/s) Reconstruction Difference TruthOnlyStepEnhAll W2W_2 MSE W2W_2 MSE Figure 5: Ablation study on CurveVel-B. The top panel shows the ground-truth velocity model. The second and third panels report reconstructions obtained with W2W_2-based and MSE-based guidance, respectively. From left to right, the columns correspond to Only (baseline DPS), Step (preconditioned guidance update Pi=ρi​DiP_i= _iD_i), Enh (wavefield enhancement: amplitude-adaptive weighting and misfit-scale normalization), and All (Step+Enh). The bottom two panels show the corresponding error maps (reconstruction minus ground truth). Table 3: Ablation results on CurveVel-B: relative ℓ2 _2 error (eℓ2e_ _2; lower is better), PSNR and SSIM (higher is better). Here Only denotes the baseline DPS update with the specified misfit, Enh enables wavefield enhancement (amplitude-adaptive data weighting and misfit-scale normalization), Step enables the preconditioned guidance update Pi=ρi​DiP_i= _iD_i, and All enables both Enh and Step. Method eℓ2↓e_ _2 PSNR↑ SSIM↑ MSE guidance Only 18.21% 16.45 0.3827 Step 12.52% 19.70 0.5082 Enh 9.20% 22.37 0.6578 All 5.10% 27.50 0.8226 W2W_2 guidance Only 10.81% 20.98 0.6307 Step 8.65% 22.91 0.6726 Enh 5.33% 27.13 0.8244 All (ours) 2.04% 35.45 0.9517 6.4 Ablation Study In the preceding experiments, we evaluated the reconstruction performance of our method for FWI. We next report an ablation study to assess the contribution of its principal components. Taking the original DPS framework as the reference, we investigate: (i) the choice of data-misfit metric, comparing MSE with the Wasserstein-22 distance (W2W_2); (i) wavefield enhancement, consisting of amplitude-adaptive data weighting and misfit-scale normalization (Section 5.1); and (i) the preconditioned guidance update Pi=ρi​DiP_i= _iD_i (Section 5.2), which combines stabilized guidance scheduling with spatially adaptive diagonal scaling. Unless otherwise noted, all ablation experiments are performed on the CurveVel-B model in Figure 2. For each setting, hyperparameters are selected according to the same tuning protocol so as to achieve the best performance attainable within that setting. The results are summarized in Figure 5 and Table 3. Figure 6 further reports the evolution of the relative ℓ2 _2 error (with respect to vtruev_true) for the intermediate reconstructions v^0(i) v_0^(i) and viv_i over the reverse-diffusion steps t. Notation. Throughout this ablation study, each label has the form (misfit)-(component). The prefix W2W_2- or MSE- specifies the guidance misfit used in the data-consistency term, namely the Wasserstein-22 distance (W2W_2) or the mean-squared error (MSE), respectively. The suffix indicates which additional components are enabled: • Only: baseline DPS with the chosen misfit, without any of our additional modifications. • Enh: baseline DPS plus wavefield enhancement, i.e., amplitude-adaptive data weighting together with misfit-scale normalization (Sec. 5.1). • Step: baseline DPS plus the preconditioned guidance update Pi=ρi​DiP_i= _iD_i (Sec. 5.2), which combines the noise-aware scalar schedule ρi _i and the spatial diagonal scaling DiD_i. • All: baseline DPS with both Enh and Step enabled. For example, W2W_2-Enh uses W2W_2 guidance together with wavefield enhancement, while MSE-All enables both enhancements under MSE guidance. (a) Relative ℓ2 _2-error of viv_i using W2W_2 (b) PSNR of viv_i using W2W_2 (c) SSIM of viv_i using W2W_2 (d) Relative ℓ2 _2-error of v^0(i) v_0^(i) using W2W_2 (e) PSNR of v^0(i) v_0^(i) using W2W_2 (f) SSIM of v^0(i) v_0^(i) using W2W_2 (g) Relative ℓ2 _2-error of viv_i using MSE (h) PSNR of viv_i using MSE (i) SSIM of viv_i using MSE (j) Relative ℓ2 _2-error of v^0(i) v_0^(i) using MSE (k) PSNR of v^0(i) v_0^(i) using MSE (l) SSIM of v^0(i) v_0^(i) using MSE Figure 6: Evolution of the statistical metrics over reverse-time steps for the ablation settings Enh, Step, Only and All. Panels (a)–(c) show the relative ℓ2 _2-error, PSNR and SSIM of the intermediate iterate viv_i under W2W_2-based guidance; panels (d)–(f) show the corresponding metrics for the denoised estimate v^0(i) v_0^(i) under W2W_2-based guidance. Panels (g)–(i) and (j)–(l) report the analogous quantities under MSE-based guidance for viv_i and v^0(i) v_0^(i), respectively. Observations. Figures 5– 6 and Table 3 reveal the following trends: (i) Combining wavefield enhancement with the preconditioned update yields the best overall performance. In particular, W2W_2-All achieves eℓ2=2.04%e_ _2=2.04\%, PSNR =35.45=35.45, and SSIM =0.9517=0.9517, substantially improving over W2W_2-Only (eℓ2=10.81%e_ _2=10.81\%, PSNR =20.98=20.98, SSIM =0.6307=0.6307). (i) Wavefield enhancement provides the larger gain when used alone, while the preconditioned update offers additional improvement when combined. For example, under W2W_2 guidance, adding wavefield enhancement reduces eℓ2e_ _2 from 10.81%10.81\% (W2W_2-Only) to 5.33%5.33\% (W2W_2-Enh), whereas adding the preconditioned update alone reduces it to 8.65%8.65\% (W2W_2-Step).Enabling both components yields the best performance, achieving eℓ2=2.04%e_ _2=2.04\% (W2W_2-All). (i) W2W_2-based guidance is consistently more robust than MSE-based guidance across all ablation settings, both in the baseline configuration (Only) and in the combined configuration (All). This is also reflected in the visual reconstructions and error maps in Figure 5. (iv) The metric trajectories suggest improved stability with preconditioned guidance. As shown in Figure 6, enabling Pi=ρi​DiP_i= _iD_i reduces transient oscillations during intermediate reverse steps and leads to more stable convergence, with the combined setting (All) achieving the best final metrics. 6.5 Robustness Study velocity (m/s) CurveVel-ACurveVel-BFlatFault-AFlatFault-BCurveFault-ACurveFault-B Truth σ=0σ=0 σ=1​e−4σ=1e-4 σ=1​e−3σ=1e-3 σ=1​e−2σ=1e-2 σ=5​e−2σ=5e-2 Figure 7: Qualitative robustness to observation noise across the six datasets. Each column corresponds to one dataset. The first row shows the true velocity field, and the remaining rows show the reconstructed velocity fields under progressively increasing noise levels σ∈0,10−4,10−3,10−2,5×10−2σ∈\0,10^-4,10^-3,10^-2,5× 10^-2\. The reconstructions remain visually stable as the noise level increases, with the main interfaces and fault structures largely preserved. 6.5.1 Robustness to Observation Noise We evaluate the robustness of our method to observation noise by corrupting the recorded data with additive noise at increasing levels σ∈0,10−4,10−3,10−2,5×10−2σ∈\0,10^-4,10^-3,10^-2,5× 10^-2\. All other hyperparameters are kept identical to those used in the previous experiments. Qualitative reconstructions are shown in Figure 7, and quantitative results, including the relative ℓ2 _2-error, PSNR, and SSIM, are reported in Table 4. Overall, the reconstructions remain visually stable across all noise levels for all six datasets: the major interfaces and fault geometries are preserved, and the predicted velocity ranges remain consistent (Figure 7). Quantitatively, the performance degrades only mildly as σ increases, with nonmonotone but limited variations in eℓ2e_ _2, PSNR, and SSIM (Table 4). These results indicate that the proposed guidance remains effective under moderate levels of observation noise. Table 4: Quantitative robustness to observation noise across the six datasets. Relative ℓ2 _2-error (eℓ2e_ _2; lower is better), PSNR, and SSIM (higher is better) of the reconstructed velocity fields under different noise levels σ. CurveVel-A CurveVel-B FlatFault-A σ eℓ2↓e_ _2 PSNR↑ SSIM↑ eℓ2↓e_ _2 PSNR↑ SSIM↑ eℓ2↓e_ _2 PSNR↑ SSIM↑ 0 3.29% 31.96 0.9111 1.95% 35.83 0.9584 1.60% 37.94 0.9820 10−410^-4 3.29% 31.96 0.9113 1.94% 35.90 0.9594 1.84% 36.74 0.9691 10−310^-3 3.39% 31.70 0.9075 2.23% 34.69 0.9451 1.95% 36.20 0.9739 10−210^-2 3.62% 31.12 0.9121 2.07% 35.34 0.9557 1.97% 36.15 0.9747 5×10−25\!×\!10^-2 3.50% 31.42 0.9039 2.04% 35.45 0.9517 2.91% 32.74 0.9646 FlatFault-B CurveFault-A CurveFault-B σ eℓ2↓e_ _2 PSNR↑ SSIM↑ eℓ2↓e_ _2 PSNR↑ SSIM↑ eℓ2↓e_ _2 PSNR↑ SSIM↑ 0 6.07% 28.56 0.9242 2.41% 35.01 0.9602 5.67% 27.39 0.7920 10−410^-4 6.03% 28.61 0.9240 2.22% 35.75 0.9606 6.03% 26.86 0.7779 10−310^-3 6.10% 28.51 0.9225 2.45% 34.90 0.9552 6.34% 26.43 0.7407 10−210^-2 6.15% 28.45 0.9214 2.26% 35.56 0.9609 6.65% 26.01 0.7304 5×10−25\!×\!10^-2 6.13% 28.48 0.9272 2.41% 35.03 0.9540 5.32% 27.95 0.7956 6.6 Generalization across forward operators An important feature of the proposed framework is that the learned score-based prior is trained only on prior velocity models and does not depend on a specific forward operator. Once this prior has been trained, it can be coupled with different forward models ℱF for inversion without retraining the prior network. In this subsection, we examine the robustness of the proposed method with respect to changes in the forward operator by perturbing the acquisition and physical settings and then re-running the inversion. We consider the FlatFault-B dataset and define five forward-operator configurations. The baseline configuration, denoted by Init, coincides with that used in the preceding experiments. We then modify the acquisition and physical settings in several ways, thereby obtaining perturbed forward operators as follows: 1. Freq10: we replace the temporal source function by a Ricker wavelet of peak frequency fp=10​Hzf_p=10\,Hz, while keeping the spatial point-source locations, all other parameters, and the acquisition geometry unchanged. Consequently, the observed and synthetic wavefields are generated with a different source peak frequency. 2. Depth2: we shift the acquisition array downward by two grid intervals by setting the source and receiver depths to z=Lz−2​d​xz=L_z-2\,dx instead of z=Lzz=L_z, while keeping their horizontal coordinates fixed. This modification changes how waves propagate from the sources to the receivers and how the subsurface structures are illuminated. 3. Shot17: relative to the baseline configuration with 10 sources, we increase the number of sources to 17 and distribute them uniformly along the surface, while keeping the receiver layout unchanged. This perturbation yields denser illumination and increased data redundancy for the inversion. 4. AllChange: we apply Freq10, Depth2, and Shot17 simultaneously, thereby introducing a compound perturbation of the forward operator that modifies the source peak frequency, the source and receiver depths, and the number of sources. For a fair comparison across the different operator settings, the network architecture, diffusion schedule, and all remaining hyperparameters are kept identical to those used in the preceding experiments. For each operator configuration, only the guidance-scale parameter ρ0 _0 is tuned, following the same protocol, so as to achieve the best performance attainable for our method; the selected values are listed in Table 5. The corresponding quantitative metrics and qualitative reconstructions are reported in Table 5 and Figure 8, respectively.› velocity (m/s) velocity (m/s) Reconstruction Difference TruthInitShot17Depth2Freq10AllChange Figure 8: Inversion results under different forward-operator configurations (FlatFault-B). From left to right: Init (baseline setup), Shot17 (17 uniformly distributed sources), Depth2 (sources and receivers shifted downward by 2​d​x2\,dx), Freq10 (Ricker wavelet peak frequency fp=10,Hzf_p=10,Hz), and AllChange (all modifications applied). The reconstructions remain visually consistent across operator changes, indicating that the proposed method generalizes well to perturbed forward operators without retraining the prior. Table 5: Reconstruction performance on FlatFault-B under different forward-operator configurations. Reported are the tuned values of ρ0 _0, together with the relative ℓ2 _2-error (eℓ2e_ _2; lower is better), PSNR, and SSIM (higher is better). Operator ρ0 _0 eℓ2↓e_ _2 PSNR↑ SSIM↑ Init 1.45 6.13% 28.48 0.9272 Shot17 2.25 5.24% 29.84 0.9399 Depth2 1.60 5.37% 29.62 0.9343 Freq10 1.00 5.85% 28.87 0.9160 AllChange 1.30 5.82% 28.92 0.9168 It can be observed that, across all forward-operator configurations, our method consistently yields high-quality reconstructions, with SSIM exceeding 0.90.9 and relative errors below 6.2%6.2\% (Table 5). These results indicate that, once the score-based prior is trained on velocity models, it can be coupled with modified acquisition/physics settings (e.g., changes in source frequency, source/receiver depth, or the number of shots) without retraining the prior. This operator-agnostic property substantially broadens the applicability of the proposed inversion framework. 6.7 A unified diffusion model for heterogeneous velocity fields 6.7.1 Performance of the mixed model on OpenFWI In the preceding experiments, a separate diffusion model was trained for each family of velocity fields. Although this dataset-specific strategy is effective when the structural class is known a priori, it is of limited practical value for inverse problems, where the subsurface structure is typically unknown. To address this issue, we next consider a mixed model trained jointly on a heterogeneous dataset assembled from multiple structural classes. The resulting unified model is designed to learn a broader prior and thus avoids the need to select a class-specific model at the inference stage. velocity (m/s) velocity (m/s) Reconstruction Difference CurveVel-AFlatFault-ACurveFault-B Truth mixed model separate model mixed model separate model Figure 9: Comparison of inversion results produced by the mixed and separate models for CurveVel-A, FlatFault-A, and CurveFault-B. The first row shows the true velocity models vtruev_true. The second and third rows display the reconstructed velocity models vrecv_rec obtained by the mixed model and the separate model, respectively. The fourth and fifth rows show the reconstruction errors, vrec−vtruev_rec-v_true, corresponding to the mixed and separate models, respectively. Specifically, we randomly selected 7,500 samples from each of CurveVel-A, CurveVel-B, FlatFault-A, FlatFault-B, CurveFault-A, and CurveFault-B, and merged them into a single training set. The mixed model was then evaluated on several test datasets, among which CurveVel-A, FlatFault-A, and CurveFault-B are reported here as representative examples. We adopted the same operator setting as in Section 6.3.1 and carefully tuned the remaining hyperparameters. More precisely, we set k=100k=100, c=0.1c=0.1, τ=0τ=0, ε=10−4 =10^-4, and γ=0.55γ=0.55, and chose ρ0=0.95 _0=0.95, 0.850.85, and 1.41.4 for CurveVel-A, FlatFault-A, and CurveFault-B, respectively. For each dataset, the performance of the mixed model is compared with that of the corresponding separate model trained exclusively on the same dataset. The quantitative and qualitative results are reported in Table 6 and Figure 9, respectively. It can be observed that, when the score-based neural network is trained on the mixed dataset rather than on a single dataset, the inversion performance degrades slightly across all three representative examples. Nevertheless, the mixed model still yields satisfactory reconstructions and remains clearly competitive in practice. These results indicate that, when strong prior knowledge of the target velocity field is available, a dataset-specific model trained on structurally homogeneous samples is generally preferable. By contrast, when prior information is limited or the structural type is unknown, the mixed model provides a practical and flexible alternative. A plausible reason for the performance gap is that the heterogeneous training distribution makes it more challenging for a single score network to capture the full range of structural patterns with equal fidelity. As a consequence, the learned prior may become less accurate for individual classes than that obtained from dataset-specific training. Moreover, owing to computational constraints, the mixed model was not trained on the full collection of available samples, which may also contribute to the observed performance gap. Table 6: Quantitative comparison of inversion results on OpenFWI velocity models using the separate and mixed models. Metrics include the relative ℓ2 _2 error eℓ2e_ _2, PSNR, and SSIM for the reconstructed velocity model (arrows indicate the preferred direction). Dataset eℓ2↓e_ _2 PSNR↑ SSIM↑ CurveVel-A (separate model) 3.50% 31.42 0.9039 CurveVel-A (mixed model) 3.87% 30.54 0.8906 FlatFault-A (separate model) 2.91% 32.74 0.9646 FlatFault-A (mixed model) 3.28% 31.72 0.9179 CurveFault-B (separate model) 5.32% 27.95 0.7956 CurveFault-B (mixed model) 5.91% 27.04 0.7079 6.7.2 Performance of the mixed model on the Marmousi2 model An important practical advantage of the mixed model is its ability to provide a unified prior for structures not explicitly represented by any single training class. To investigate this generalization capability, we applied the mixed model to the inversion of the Marmousi2 dataset [49]. We emphasize that the training data contained no samples from Marmousi2. In the numerical experiments, local velocity models of different sizes were cropped from Marmousi2 and normalized to form dimensionless 71×7171× 71 data matrices. For all local velocity models, we fixed k=100k=100, c=0.1c=0.1, τ=0τ=0, ε=10−4 =10^-4, and γ=0.55γ=0.55, while ρ0 _0 was chosen separately for each example. We considered four local velocity models cropped from Marmousi2 at different physical scales. Their physical sizes, grid spacings, and the corresponding values of ρ0 _0 are summarized in Table 7. For all cases, we used exactly the same trained mixed model without any fine-tuning or distillation. We followed the same experimental setup as in the depth2 experiment in the multi-operation generalization section. The quantitative metrics and qualitative reconstructions are reported in Table 8 and Figure 10, respectively. velocity (m/s) velocity (m/s) Truth Reconstruction Difference Marmousi2-P1Marmousi2-P2 Marmousi2-P3Marmousi2-P4 Figure 10: Inversion results for the four local velocity models extracted from Marmousi2, denoted by Marmousi2-P1–P4. The first row shows the true velocity models vtruev_true. The second row presents the reconstructed velocity models vrecv_rec obtained by the proposed method with the mixed model. The third row shows the corresponding reconstruction errors vrec−vtruev_rec-v_true. Table 7: Experimental settings for the four local velocity models extracted from Marmousi2. Local model Physical size d​xdx (m) d​zdz (m) ρ0 _0 Marmousi2-P1 700​m×175​m700\,m× 175\,m 10 2.5 3.9 Marmousi2-P2 700​m×263​m700\,m× 263\,m 10 3.75 17.9 Marmousi2-P3 1000​m×175​m1000\,m× 175\,m 1007 1007 2.5 6.1 Marmousi2-P4 1500​m×175​m1500\,m× 175\,m 1507 1507 2.5 5.0 Table 8: Quantitative comparison of inversion results obtained by the mixed model for the four local velocity models extracted from Marmousi2. Metrics include the relative ℓ2 _2 error eℓ2e_ _2, PSNR, and SSIM for the reconstructed velocity model (arrows indicate the preferred direction). Local model eℓ2↓e_ _2 PSNR↑ SSIM↑ Marmousi2-P1 (700​m×175​m700\,m× 175\,m) 1.80% 35.95 0.9206 Marmousi2-P2 (700​m×263​m700\,m× 263\,m) 3.57% 30.49 0.7553 Marmousi2-P3 (1000​m×175​m1000\,m× 175\,m) 2.81% 33.21 0.8975 Marmousi2-P4 (1500​m×175​m1500\,m× 175\,m) 5.31% 26.96 0.8536 The local velocity models extracted from Marmousi2 differ substantially from the OpenFWI datasets in structural features. Despite this mismatch, the proposed method using the mixed model still produces satisfactory reconstructions for local velocity models of different sizes. This is supported both qualitatively by Figure 10, where the reconstructed velocity models preserve the main subsurface structures, and quantitatively by the relative ℓ2 _2 error, PSNR, and SSIM reported in Table 8, which indicate good agreement between the reconstructed and true velocity models. Since no Marmousi2-specific samples were included in the training data, these results provide evidence of the cross-dataset generalization capability of the proposed approach. In this sense, the method offers a promising data-driven framework for mitigating the difficulty arising from limited training data in full waveform inversion. 7 Conclusion In this paper, we proposed a physics-guided diffusion framework for seismic velocity reconstruction in full waveform inversion. The method combines an optimal-transport (OT) data-consistency potential with a stabilized variable-metric guidance rule in the reverse diffusion process. On the data-consistency side, the OT potential incorporates bounded amplitude-adaptive reweighting together with a one-dimensional Wasserstein discrepancy evaluated through quantile functions. This construction reduces the dominance of large-amplitude early arrivals and alleviates the sensitivity of pointwise ℓ2 _2 objectives to time and phase misalignment. On the guidance side, the scalar DPS correction is replaced by a diagonal preconditioner Pi=ρi​DiP_i= _iD_i, where ρi _i provides a step-dependent scalar schedule and DiD_i supplies spatially adaptive scaling. This yields a more balanced and stable guidance mechanism for the heterogeneous sensitivities arising in FWI. The numerical results on the OpenFWI benchmarks show that, under matched computational budgets, the proposed method improves reconstruction quality relative to both the standard DPS approach and the deterministic baselines. In particular, the reconstructions produced by the proposed method exhibit lower relative ℓ2 _2 errors and fewer visible artifacts across the tested datasets. Several directions remain for future work. First, the computational cost of physics-based guidance is still substantial, and it would be valuable to develop more efficient implementations and acceleration strategies, for example through multi-fidelity surrogate strategies. Second, an important extension is to consider more realistic forward models, including elastic, anisotropic, and attenuative media, as well as three-dimensional acquisition geometries. Acknowledgments Zheng Ma is supported by NSFC Grant No. 12531016 and Beijing Institute of Applied Physics and Computational Mathematics funding HX02023-6. Xiong-Bin Yan is supported by NSFC Grant No. 12401563. Additionally, we also thank Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS) for their financial support. This research was funded by SIMIS under grant number SIMIS-ID-2025-ST. The authors are grateful for the resources and facilities provided by SIMIS, which were essential for the completion of this work. References [1] A. J. Berkhout, Pushing the limits of seismic imaging, Part I: Integration of prestack migration, velocity estimation, and AVO analysis, Geophysics 62 (3) (1997) 954–969. [2] C. Zelt, R. Smith, Seismic traveltime inversion for 2-D crustal velocity structure, Geophysical Journal International 108 (1) (1992) 16–34. [3] F. Clement, G. Chavent, S. Gómez, Migration-based traveltime waveform inversion of 2-D simple structures: A synthetic example, Geophysics 66 (3) (2001) 845–860. [4] J. Hudson, J. Heritage, The use of the Born approximation in seismic scattering problems, Geophysical Journal International 66 (1) (1981) 221–240. [5] K. Muhumuza, M. Jakobsen, T. Luostari, T. Lähivaara, Seismic monitoring of CO2 injection using a distorted Born T-matrix approach in acoustic approximation, J. Seism. Explor 27 (2018) 403–431. [6] A. Tarantola, Inversion of seismic reflection data in the acoustic approximation, Geophysics 49 (8) (1984) 1259–1266. [7] M. Warner, A. Ratcliffe, T. Nangoo, J. Morgan, A. Umpleby, N. Shah, V. Vinje, I. Štekl, L. Guasch, C. Win, Anisotropic 3D full-waveform inversion, Geophysics 78 (2) (2013) R59–R80. [8] M. Jakobsen, B. Ursin, Full waveform inversion in the frequency domain using direct iterative T-matrix methods, Journal of Geophysics and Engineering 12 (3) (2015) 400–418. [9] J. Virieux, S. Operto, An overview of full-waveform inversion in exploration geophysics, Geophysics 74 (6) (2009) WCC1–WCC26. [10] A. Fichtner, Full seismic waveform modelling and inversion, Springer Science & Business Media, 2010. [11] G. Yao, N. V. da Silva, M. Warner, D. Wu, C. Yang, Tackling cycle skipping in full-waveform inversion with intermediate data, Geophysics 84 (3) (2019) R411–R427. [12] Y. Yang, B. Engquist, J. Sun, B. F. Hamfeldt, Application of optimal transport and the quadratic Wasserstein metric to full-waveform inversion, Geophysics 83 (1) (2018) R43–R62. [13] R. G. Pratt, Seismic waveform inversion in the frequency domain, part 1: Theory and verification in a physical scale model, Geophysics 64 (3) (1999) 888–901. [14] C. Bunks, F. M. Saleck, S. Zaleski, G. Chavent, Multiscale seismic waveform inversion, Geophysics 60 (5) (1995) 1457–1473. [15] O. Gauthier, J. Virieux, A. Tarantola, Two-dimensional nonlinear inversion of seismic waveforms: Numerical results, geophysics 51 (7) (1986) 1387–1403. [16] L. Métivier, R. Brossier, Q. Mérigot, E. Oudet, J. Virieux, Measuring the misfit between seismograms using an optimal transport distance: Application to full waveform inversion, Geophysical Supplements to the Monthly Notices of the Royal Astronomical Society 205 (1) (2016) 345–377. [17] Y. Liu, J. Teng, T. Xu, Y. Wang, Q. Liu, J. Badal, Robust time-domain full waveform inversion with normalized zero-lag cross-correlation objective function, Geophysical Journal International 209 (1) (2017) 106–122. [18] E. Bozdağ, J. Trampert, J. Tromp, Misfit functions for full waveform inversion based on instantaneous phase and envelope measurements, Geophysical Journal International 185 (2) (2011) 845–870. [19] B. Chi, L. Dong, Y. Liu, Full waveform inversion method using envelope objective function without low frequency data, Journal of Applied Geophysics 109 (2014) 36–46. [20] M. Warner, L. Guasch, Adaptive waveform inversion: Theory, Geophysics 81 (6) (2016) R429–R445. [21] L. Qiu, J. Ramos-Martínez, A. Valenciano, Y. Yang, B. Engquist, Full-waveform inversion with an exponentially encoded optimal-transport norm, in: SEG technical program expanded abstracts 2017, 2017, p. 1286–1290. [22] Z. Li, Y. Tang, J. Chen, H. Wu, The quadratic Wasserstein metric with squaring scaling for seismic velocity inversion, arXiv preprint arXiv:2201.11305 (2022). [23] Y. Gao, F. Tilmann, A. Rietbrock, A review of misfit functions for adjoint full waveform inversion in seismology, Geophysical Journal International 235 (3) (2023) 2794–2827. [24] L. Mosser, O. Dubrule, M. J. Blunt, Stochastic seismic waveform inversion using generative adversarial networks as a geological prior, Mathematical Geosciences 52 (1) (2020) 53–79. [25] J. Lopez-Alvis, F. Nguyen, M. Looms, T. Hermans, Geophysical inversion using a variational autoencoder to model an assembled spatial prior uncertainty, Journal of Geophysical Research: Solid Earth 127 (3) (2022) e2021JB022581. [26] C. Sun, A. Malcolm, R. Kumar, W. Mao, Enabling uncertainty quantification in a standard full-waveform inversion method using normalizing flows, Geophysics 89 (5) (2024) R493–R507. [27] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural information processing systems 33 (2020) 6840–6851. [28] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole, Score-based generative modeling through stochastic differential equations, arXiv preprint arXiv:2011.13456 (2020). [29] Y. Song, L. Shen, L. Xing, S. Ermon, Solving inverse problems in medical imaging with score-based generative models, arXiv preprint arXiv:2111.08005 (2021). [30] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, J. C. Ye, Diffusion posterior sampling for general noisy inverse problems, arXiv preprint arXiv:2209.14687 (2022). [31] Y. Li, H. Zhang, Z. Yan, T. Alkhalifah, Diffusioninv: Prior-enhanced bayesian full waveform inversion using diffusion models, arXiv preprint arXiv:2505.03138 (2025). [32] F. Wang, X. Huang, T. Alkhalifah, Controllable seismic velocity synthesis using generative diffusion models, Journal of Geophysical Research: Machine Learning and Computation 1 (3) (2024) e2024JH000153. [33] W. S. Phillips, M. C. Fehler, Traveltime tomography; a comparison of popular methods, Geophysics 56 (10) (1991) 1639–1649. [34] J. Chen, G. Chen, M. Nagaso, P. Tong, Adjoint-state traveltime tomography for azimuthally anisotropic media in spherical coordinates, Geophysical Journal International 234 (1) (2023) 712–736. [35] S. Hao, J. Chen, M. Xu, P. Tong, Topography-incorporated adjoint-state surface wave traveltime tomography: Method and a case study in hawaii, Journal of Geophysical Research: Solid Earth 129 (1) (2024) e2023JB027454. [36] A. M. Stuart, Inverse problems: a bayesian perspective, Acta numerica 19 (2010) 451–559. [37] M. Dashti, A. M. Stuart, The bayesian approach to inverse problems, arXiv preprint arXiv:1302.6989 (2013). [38] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, M.-H. Yang, Diffusion models: A comprehensive survey of methods and applications, ACM computing surveys 56 (4) (2023) 1–39. [39] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole, Score-based generative modeling through stochastic differential equations, in: International Conference on Learning Representations, 2021. [40] B. Øksendal, Stochastic differential equations, in: Stochastic differential equations: an introduction with applications, Springer, 2003, p. 38–50. [41] B. D. Anderson, Reverse-time diffusion equation models, Stochastic Processes and their Applications 12 (3) (1982) 313–326. [42] P. Vincent, A connection between score matching and denoising autoencoders, Neural computation 23 (7) (2011) 1661–1674. [43] C. Deng, S. Feng, H. Wang, X. Zhang, P. Jin, Y. Feng, Q. Zeng, Y. Chen, Y. Lin, Openfwi: Large-scale multi-structural benchmark datasets for full waveform inversion, Advances in Neural Information Processing Systems 35 (2022) 6007–6020. [44] L. I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D: nonlinear phenomena 60 (1-4) (1992) 259–268. [45] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Vol. 87, Springer Science & Business Media, 2013. [46] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, p. 234–241. [47] B. Engquist, Y. Yang, Optimal transport based seismic inversion: Beyond cycle skipping, Communications on Pure and Applied Mathematics 75 (10) (2022) 2201–2244. [48] A. Richardson, Deepwave (Oct. 2025). doi:10.5281/zenodo.3829886. URL https://doi.org/10.5281/zenodo.3829886 [49] G. S. Martin, R. Wiley, K. J. Marfurt, Marmousi2: An elastic upgrade for marmousi, The leading edge 25 (2) (2006) 156–166.