Paper deep dive
A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks
Leonardo Del Grande, Christoph Brune, Marcello Carioni
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/22/2026, 5:57:47 AM
Summary
The paper investigates the sparsity of solutions for total variation (TV)-regularized training of infinite-width shallow ReLU neural networks. By formulating the training as a convex optimization problem over measures on the unit sphere, the authors leverage duality theory to show that the dual certificate is piecewise linear. This structure allows them to prove that minimizers are sparse (composed of Dirac deltas) and to establish stability and convergence rates for these sparse solutions under low noise and small regularization parameters.
Entities (5)
Relation Signals (3)
Dual certificate → determines → Support of minimizer
confidence 96% · Since, by optimality conditions, the extreme values of the dual certificates determine the support of the minimizers
Total Variation (TV) regularization → enforces → Sparsity
confidence 95% · In particular, it has the goal of enforcing sparsity in the solution.
Dual regions → determines → Dual certificate
confidence 94% · Its linearity regions, which we name dual regions, are determined by the activation patterns of the data
Cypher Suggestions (2)
Map the relationship between dual certificates and the support of minimizers. · confidence 95% · unvalidated
MATCH (d:Tool {name: 'Dual certificate'})-[:DETERMINES]->(s:Property {name: 'Support of minimizer'}) RETURN d, sFind all mathematical concepts related to the sparsity analysis of neural networks. · confidence 90% · unvalidated
MATCH (e:Concept)-[:RELATED_TO]->(s:Property {name: 'Sparsity'}) RETURN e, sAbstract
Abstract:In this paper, we study total variation (TV)-regularized training of infinite-width shallow ReLU neural networks, formulated as a convex optimization problem over measures on the unit sphere. Our approach leverages the duality theory of TV-regularized optimization problems to establish rigorous guarantees on the sparsity of the solutions to the training problem. Our analysis further characterizes how and when this sparsity persists in a low noise regime and for small regularization parameter. The key observation that motivates our analysis is that, for ReLU activations, the associated dual certificate is piecewise linear in the weight space. Its linearity regions, which we name dual regions, are determined by the activation patterns of the data via the induced hyperplane arrangement. Taking advantage of this structure, we prove that, on each dual region, the dual certificate admits at most one extreme value. As a consequence, the support of any minimizer is finite, and its cardinality can be bounded from above by a constant depending only on the geometry of the data-induced hyperplane arrangement. Then, we further investigate sufficient conditions ensuring uniqueness of such sparse solution. Finally, under a suitable non-degeneracy condition on the dual certificate along the boundaries of the dual regions, we prove that in the presence of low label noise and for small regularization parameter, solutions to the training problem remain sparse with the same number of Dirac deltas. Additionally, their location and the amplitudes converge, and, in case the locations lie in the interior of a dual region, the convergence happens with a rate that depends linearly on the noise and the regularization parameter.
Tags
Links
- Source: https://arxiv.org/abs/2603.17785v1
- Canonical: https://arxiv.org/abs/2603.17785v1
Full Text
113,464 characters extracted from source content.
Expand or collapse full text
MnLargeSymbols’164 MnLargeSymbols’171 A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks 00footnotetext: 2020 Mathematics Subject Classification: 46A55, 49K27, 49N15, 49Q22, 52A40, 54E35 00footnotetext: Keywords: convex optimization, Dirac deltas, duality, neural networks, regions, sparsity, stability Leonardo Del Grande , Christoph Brune11footnotemark: 1 , Marcello Carioni11footnotemark: 1 Department of Applied Mathematics, University of Twente, 7500AE Enschede, The Netherlands (l.delgrande@utwente.nl, c.brune@utwente.nl, m.c.carioni@utwente.nl) Abstract In this paper, we study total variation (TV)-regularized training of infinite-width shallow ReLU neural networks, formulated as a convex optimization problem over measures on the unit sphere. Our approach leverages the duality theory of TV-regularized optimization problems to establish rigorous guarantees on the sparsity of the solutions to the training problem. Our analysis further characterizes how and when this sparsity persists in a low noise regime and for small regularization parameter. The key observation that motivates our analysis is that, for ReLU activations, the associated dual certificate is piecewise linear in the weight space. Its linearity regions, which we name dual regions, are determined by the activation patterns of the data via the induced hyperplane arrangement. Taking advantage of this structure, we prove that, on each dual region, the dual certificate admits at most one extreme value. As a consequence, the support of any minimizer is finite, and its cardinality can be bounded from above by a constant depending only on the geometry of the data-induced hyperplane arrangement. Then, we further investigate sufficient conditions ensuring uniqueness of such sparse solution. Finally, under a suitable non-degeneracy condition on the dual certificate along the boundaries of the dual regions, we prove that in the presence of low label noise and for small regularization parameter, solutions to the training problem remain sparse with the same number of Dirac deltas. Additionally, their location and the amplitudes converge, and, in case the locations lie in the interior of a dual region, the convergence happens with a rate that depends linearly on the noise and the regularization parameter. 1 Introduction Two-layer neural networks with K neurons are parametrized functions Fc,w,b:ℝd→ℝF_c,w,b:R^d defined as Fc,w,b(x)=∑i=1Kciσ(⟨wi,x⟩+bi),F_c,w,b(x)= _i=1^Kc^iσ( w^i,x +b^i), (1.1) where σ:ℝ→ℝσ:R is the so-called activation function, while ci∈ℝc^i and wi∈ℝdw^i ^d are the weights for the output and hidden layers of the network, and bi∈ℝb^i are the biases. Following the approach in [2, 22] one can consider the formal limit for K→+∞K→+∞, i.e. when the number of neurons tends to infinity and replace the sum in (1.1) by the integration over a measure taking values on the space of weights and biases. In particular, one defines Fμ(x)=∫Θσ(⟨w,x⟩+b)μ(w,b) F_μ(x)= _ σ( w,x +b)\,dμ(w,b) (1.2) where (w,b)∈Θ(w,b)∈ and μ∈M(Θ)μ∈ M( ). Problem (1.2) extends (1.1) since by choosing measures μ of the form μ=∑i=1Nciδ(wi,bi)μ= _i=1^Nc^i _(w^i,b^i) one recovers instances of (1.1). Moreover, despite being infinite-dimensional it is convex in the space of measures (1.2), allowing for insights facilitated by the tools of convex analysis (see for instance [9]). As often considered in the available literature, the bias b is reabsorbed into the weight w by identifying x=(x,1)x=(x,1) and w=(w,b)w=(w,b), allowing for a more compact representation Fμ(x)=∫Θσ(⟨w,x⟩)μ(w). F_μ(x)= _ σ( w,x )\,dμ(w). (1.3) In this work, we consider positively 11-homogeneous activation functions σ:ℝ→ℝσ:R and, in particular, we often restrict our attention to the ReLU activation function defined as σ(z)=max0,zσ(z)= \0,z\ for z∈ℝz . Due to such 11-homogeneity, since σ(⟨w,x⟩)=|w|σ(⟨w/|w|,x⟩)σ( w,x )=|w|σ( w/|w|,x ) and rescaling the measure μ one can, without loss of expressivity for FμF_μ, consider measures taking values on the sphere, that is Θ=d−1 =S^d-1. Such a choice is also convenient, since the compactness of the parameter space allows for well-posed variational problems in the space of measures. We will make the choice Θ=d−1 =S^d-1 throughout this paper. According to the previous considerations, given a collection of input/output pairs (xj,yj)j=1n(x^j,y^j)_j=1^n for n∈ℕn , a typical training problem taking values in the space of measures M(d−1)M(S^d-1) can be defined as minμ∈M(d−1)12∑j=1n|Fμ(xj)−yj|2+λ‖μ‖TV, _μ∈ M(S^d-1) 12 _j=1^n|F_μ(x^j)-y^j|^2+λ\|μ\|_TV, (1.4) where ‖μ‖TV\|μ\|_TV is the total variation norm, defined as ‖μ‖TV=sup∫φ(x)μ(x):φ∈C(d−1) such that ‖φ‖∞⩽1. \|μ\|_TV= \ (x)\,dμ(x): ∈ C(S^d-1) such that \| \|_∞ 1 \. (1.5) In this case, the total variation norm acts as a regularizer (whose intensity is regulated by λ). In particular, it has the goal of enforcing sparsity in the solution. This means selecting solutions that are made of a linear combination of Dirac deltas, thus recovering a classical neural network (with a finite number of neurons). In the regime λ≪1λ 1, problem (1.4) approaches an interpolation problem, in which the data fidelity term effectively acts as a hard constraint. Formally, one recovers the constrained formulation minμ∈M(d−1)‖μ‖TVsubjected to Fμ(xj)=yj∀j. _μ∈ M(S^d-1)\|μ\|_TV to F_μ(x^j)=y^j\ \ \ ∀ j. (1.6) We also remark that total variation norm regularization allows to define a natural representation costs for infinite-width shallow neural networks opening the way to the analysis of approximation capabilities of shallow neural networks. Indeed, for f∈L1(d−1)f∈ L^1(S^d-1) one can define ∥f∥ℬ(d−1)=inf∥μ∥TV:μ∈M(d−1),Fμ(x)=f(x) for a.e.x∈d−1, \|f\|_B(S^d-1)= \\|μ\|_TV:μ∈ M(S^d-1),F_μ(x)=f(x) for a.e.\ x ^d-1 \, (1.7) that can be seen as a representation cost in infinite-width regimes. In analogy with foundational works by Barron [3, 4], the quantity ‖f‖ℬ(d−1)\|f\|_B(S^d-1) is also named Barron norm and Barron functions are those functions f∈L1(d−1)f∈ L^1(S^d-1) such that ‖f‖ℬ(d−1)<∞\|f\|_B(S^d-1)<∞ (see for instance [21]). Many important works have carried out different aspects of such analysis related either to the study of the representation cost or the approximation power of Barron functions [18, 21, 22, 23, 26, 34, 38, 40]. 1.1 Sparse guarantees in infinite-width limit of neural networks: state of the art and super-resolution theory It is a natural question to ask under which assumptions solutions of problem (1.4) and (1.6) are sparse, namely their minimizers are made of linear combinations of Dirac deltas of the form μ=∑i=1Nciδwi. μ= _i=1^Nc^i _w^i. (1.8) As highlighted in [6] in the context of Reproducing Kernel Banach Spaces (RKBS), a first, partial answer is provided by so-called representer theorems [7, 8]. In particular, when choosing the regularizer R(μ)=‖μ‖TVR(μ)=\|μ\|_TV and assuming that the weight space Θ is compact, general representer theorems for infinite-dimensional convex optimization problems guarantee the existence of at least one solution of the form (1.8), with the number of atoms bounded by N⩽nN n. Such sparsity results have been established in various training frameworks [5, 6, 35, 36]. However, empirical evidence suggests that sparsity is not merely an existence property: in practice, all minimizers often appear to be sparse even when multiple solutions exist. This naturally raises the question of whether all solutions are sparse. While this question remains largely unexplored in the machine learning literature with few exceptions [15, 32], it has been studied in depth in the context of super-resolution, where similar problems are formulated as convex optimization problems over spaces of measures. In this context, the property of having unique and sparse solutions is typically referred to as exact reconstruction. Foundational results [11, 14, 39] identify precise conditions under which exact reconstruction holds for TV-regularized interpolation problems with Fourier constraints. These results typically rely on duality arguments, where the key tool is the construction of dual certificates η, i.e. dual variables that certify optimality and sparsity of the primal solution by satisfying the associated first order optimality conditions [25]. Beyond exact recovery, a substantial body of work has focused on sparse stability, namely the robustness of sparsity under perturbations of the problem. In this setting, one seeks conditions ensuring that the sparsity of the solution persists when the regularization parameter λ varies, and the data and labels are contaminated by noise. A series of influential works [16, 19, 20, 37] has shown that sparse stability is governed by non-degeneracy properties of the dual certificate associated with the optimization problem. More precisely, stability holds when the dual certificate saturates its extreme values on the support of the underlying sparse measure and exhibits a strict second order behavior there, namely local strong concavity (or convexity, depending on the sign convention). These conditions prevent the creation or annihilation of atoms under small perturbations. 1.2 Contributions of the paper In this paper, we study sparsity properties of both problems (1.4) and (1.6) in the case of the ReLU activation function. Our analysis follows a duality-based approach, drawing on classical tools from optimization in the space of measures and super-resolution techniques discussed in the previous part of the introduction. Duality methods play a central role in infinite-dimensional optimization and in the analysis of sparse representations. Nevertheless, their systematic use to establish sparsity guarantees for infinite-width neural networks remains largely unexplored. We note, however, that duality techniques have appeared in the context of Barron spaces [41] and in certain finite-dimensional settings [29]. We begin our analysis by proving that every solution of both problems is sparse, and that the number of Dirac deltas in any minimizer is uniformly bounded from above by a quantity depending only on the ambient dimension d and on the number of data points n. The key observation underlying our results is that the dual certificate associated with the problems can be written in the explicit form η(w)=∑i=1npiσ(⟨w,xi⟩),pi∈ℝ,w∈d−1. η(w)= _i=1^np^iσ( w,x^i ), p^i ,w ^d-1. (1.9) When σ is the ReLU activation function, η coincides with the restriction to the sphere d−1S^d-1 of a piecewise linear function. More precisely, its linearity regions are determined by the intersection of the half-spaces w:⟨w,xi⟩>0\w: w,x^i >0\ and w:⟨w,xi⟩<0\w: w,x^i <0\, referred to as dual regions, and η changes slope precisely across their common boundaries w:⟨w,xj⟩=0 \w: w,x^j =0 \. Due to the geometry of the sphere, this observation allows us to conclude that in the interior of each dual region, η achieves maximum and minimum in at most one point. Moreover, at the common boundary of the regions (determined by the vanishing of the scalar products ⟨w,xi⟩ w,x^i ), a similar analysis can also be performed. Since, by optimality conditions, the extreme values of the dual certificates determine the support of the minimizers, one can then infer the sparsity of the minimizers. Additionally, by combinatorial arguments involving the counting of non-empty dual regions, one can also obtain an upper bound on the number of Dirac deltas constituting the minimizers. It is important to emphasize that the sparsity of minimizers alone does not imply uniqueness. In Section 3.3, we introduce sufficient conditions guaranteeing the uniqueness of the minimizer. These conditions rely on two key properties: first, given a sparse measure of the form (1.8), the associated dual certificate for problems (1.4) and (1.6) attains the extreme values |η(w)|=1|η(w)|=1 exactly at the support points wiw^i; second, the corresponding neural network evaluations at the data points, namely ∫d−1σ(⟨w,xj⟩)dμ(w) _S^d-1σ( w,x^j )dμ(w), are linearly independent. In this section we provide sufficient conditions that guarantee such linear independence. Finally, Section 4 is devoted to the analysis of sparse stability for problem (1.6). More precisely, we show that, under suitable assumptions on the dual certificate and for sufficiently small regularization parameters and noise levels in the labels, the sparsity of the solution is preserved, with the same number of Dirac deltas. Moreover, the weights cic^i and wiw^i of the minimizer of the perturbed problem converge to the weights of the unperturbed one. As discussed in Section 1.1, a key ingredient for sparse stability is the non-degeneracy of the dual certificate associated with problem (1.6). While this remains true in our setting, the piecewise linear structure of the dual certificate leads to non-degeneracy conditions with an explicit characterization. In particular, when a Dirac delta lies in the interior of a dual region, no additional non-degeneracy condition is required. In contrast, when a Dirac delta is located at the boundary of a region where the dual certificate is non-differentiable, the non-degeneracy is characterized by sign conditions on the (distributional) derivative of the dual certificate. Finally, in the case where the support points wiw^i lie in the interior of a dual region, we establish quantitative convergence rates for the weights of the minimizer of the perturbed problem. More precisely, under suitable linear independence assumptions on both the neural network evaluations at the data points and their derivatives, we prove that |cλ,ζi−c0i| |c_λ,ζ^i-c_0^i | =O(λ), =O(λ), (1.10) ‖wλ,ζi−w0i‖ \|w_λ,ζ^i-w_0^i \| =O(λ), =O(λ), as λ→0λ→ 0 and the noise ζ→0ζ→ 0 with ‖ζ‖⩽αλ\|ζ\| αλ. 2 Notations and preliminaries 2.1 TV-regularized optimization in the space of measures In this section, we review tools of optimization in the space of measures we are going to use in the rest of the paper. We denote by M(X)M(X) the Banach space of finite signed Radon measures on a compact set X⊂ℝdX ^d, endowed with the total variation norm. For any Radon measure μ defined on X, we also denote its support by Supp(μ)Supp(μ). We will consider TV-regularized optimization problems in the space of measures (commonly known as BLASSO [1]) that will be written as infμ∈M(X)12‖Kμ−yζ‖Y2+λ‖μ‖TV, _μ∈ M(X) 12\|Kμ-y_ζ\|_Y^2+λ\|μ\|_ TV, (λ(yζ)P_λ(y_ζ)) where Y is a separable Hilbert space, yζ∈Yy_ζ∈ Y, K:M(X)→YK:M(X)→ Y is a weak*-to-strong continuous linear operator and λ is a positive parameter. Such problems are popular as inverse problems in the space of measures with applications to signal processing [11], super-resolution in microscopy [17], sensor placement [33], and many others. Often, the measurement yζy_ζ is a noisy perturbation of a ground truth signal y0y_0, that is yζ=y0+ζy_ζ=y_0+ζ. The goal is to reconstruct the optimal μ determining its sparsity properties. In the formal limit (‖ζ‖Y,λ)→(0,0) (\|ζ\|_Y,λ )→(0,0), solutions to λ(yζ)P_λ(y_ζ) converge to solutions to the hard-constrained interpolation problem: infμ∈M(X):Kμ=y0‖μ‖TV. _μ∈ M(X):Kμ=y_0\|μ\|_TV. (0(y0)P_0(y_0)) Existence of minimizers is guaranteed for all the problems (see for instance [12, Section 33]). Moreover, this convergence can be made rigorous as in the following classical result [27, Theorem 3.5]. Proposition 2.1. Suppose that there exists μ∈M(X)μ∈ M(X) such that Kμ=y0Kμ=y_0. Consider sequences ζn∈Y _n∈ Y and λn>0 _n>0 such that ‖ζn‖Y→0,λn→0,‖ζn‖Y2λn→0as n→∞. \| _n\|_Y→ 0, _n→ 0, \| _n\|_Y^2 _n→ 0 n→∞. (2.1) Then, any sequence of minimizers of infμ∈M(X)12‖Kμ−yζn‖Y2+λn‖μ‖TV _μ∈ M(X) 12\|Kμ-y_ _n\|_Y^2+ _n\|μ\|_ TV (2.2) has a weak* converging subsequence and each limit is a solution to 0(y0)P_0(y_0). In particular, if 0(y0)P_0(y_0) has a unique minimizer, then the whole sequence converges weak* to it. 2.1.1 Duality theory and optimality conditions A crucial source of information for minimizers of λ(yζ)P_λ(y_ζ) and 0(y0)P_0(y_0) is given by respective dual problems and optimality conditions. The dual problem associated with λ(yζ)P_λ(y_ζ) (cf. [20, 24]) is maxp∈Y:‖K∗p‖∞⩽1⟨yζ,p⟩−λ2‖p‖Y2, _p∈ Y: \|K_*p \|_∞ 1 y_ζ,p - λ2\|p\|_Y^2, (λ(yζ)D_λ(y_ζ)) while the dual problem associated with 0(y0)P_0(y_0) is supp∈Y:‖K∗p‖∞⩽1⟨y0,p⟩. _p∈ Y: \|K_*p \|_∞ 1 y_0,p . (0(y0)D_0(y_0)) Here K∗:Y→C(X)K_*:Y→ C(X) denotes the adjoint operator defined by ⟨Kμ,p⟩Y=∫X(K∗p)(x)μ(x) Kμ,p _Y= _X(K_*p)(x)\,dμ(x) for all μ∈M(X)μ∈ M(X) and p∈Yp∈ Y, and ∥⋅∥∞\|·\|_∞ is the supremum norm on C(X)C(X). Strong duality holds between λ(yζ)P_λ(y_ζ) and λ(yζ)D_λ(y_ζ) and the existence of μλ,ζ∈M(X) _λ,ζ∈ M(X) solution to λ(yζ)P_λ(y_ζ) and pλ,ζ∈Yp_λ,ζ∈ Y solution to λ(yζ)D_λ(y_ζ), is equivalent to the following optimality conditions: K∗pλ,ζ∈∂‖μλ,ζ‖TV,−pλ,ζ=1λ(Kμλ,ζ−yζ). \ aligned K_*p_λ,ζ&∈∂\| _λ,ζ\|_TV,\\ -p_λ,ζ&= 1λ (K _λ,ζ-y_ζ ). aligned . (2.3) Similarly, strong duality holds between 0(y0)P_0(y_0) and 0(y0)D_0(y_0) and the existence of μ0∈M(X) _0∈ M(X) solution to 0(y0)P_0(y_0) and p0∈Yp_0∈ Y solution to 0(y0)D_0(y_0), is equivalent to the following optimality conditions: K∗p0∈∂‖μ0‖TV,Kμ0=y0. \ aligned K_*p_0&∈∂\| _0\|_TV,\\ K _0&=y_0. aligned . (2.4) If ηλ,ζ=K∗pλ,ζ _λ,ζ=K_*p_λ,ζ and ηλ,ζ∈∂‖μλ,ζ‖TV _λ,ζ∈∂\| _λ,ζ\|_TV, we call ηλ,ζ _λ,ζ a dual certificate for μλ,ζ _λ,ζ with respect to λ(yζ)P_λ(y_ζ), since the optimality conditions (2.3) guarantee that μλ,ζ _λ,ζ is a solution to λ(yζ)P_λ(y_ζ). Similarly, if η0=K∗p0 _0=K_*p_0 and η0∈∂‖μ0‖TV _0∈∂\| _0\|_TV, we call η0 _0 a dual certificate for μ0 _0 with respect to 0(y0)P_0(y_0), since the optimality conditions (2.4) guarantee that μ0 _0 is a solution to 0(y0)P_0(y_0). The solution pλ,ζp_λ,ζ to λ(yζ)D_λ(y_ζ) is unique, since this problem can be reformulated as the Hilbert projection of yζ/λy_ζ/λ onto the closed convex set p∈Y:‖K∗p‖∞⩽1 \p∈ Y:\|K_*p\|_∞ 1 \, c.f. [20, Section 2.3]. On the other hand, the solution to 0(y0)D_0(y_0) is not unique. Among the dual certificates, an important role is played by the minimal-norm dual certificate defined as follows. Definition 2.2 (Minimal-norm dual certificate). Suppose that there exists a solution to 0(y0)D_0(y_0). Then the minimal-norm dual certificate is defined as η0=K∗p0 _0=K_*p_0, where p0=argmin‖p‖Y:p∈Y is a solution to 0(y0).p_0=argmin \\|p\|_Y:p∈ Y is a solution to D_0(y_0) \. (2.5) Since the subdifferential of the total variation norm can be characterized as ∂‖μ‖TV=φ∈C(X):‖φ‖∞⩽1 and ∫Xφ(x)μ=‖μ‖TV, ∂\|μ\|_TV= \ ∈ C(X):\| \|_∞ 1 and _X (x)\,dμ=\|μ\|_TV \, (2.6) one can deduce the following well-known result that relates the extreme values of the dual certificate with the support of the optimal measure. Lemma 2.3. Let η0∈C(X) _0∈ C(X) be a dual certificate for μ0 _0. Then, μ0 _0 is a solution to 0(y0)P_0(y_0) and Supp(μ0)⊂x∈X:|η0(x)|=1. Supp( _0)⊂\x∈ X:| _0(x)|=1\. (2.7) Similarly, if ηλ,ζ∈C(X) _λ,ζ∈ C(X) is the dual certificate for μλ,ζ _λ,ζ. Then, μλ,ζ _λ,ζ is a solution to λ(yζ)P_λ(y_ζ) and Supp(μλ,ζ)⊂x∈X:|ηλ,ζ(x)|=1. Supp( _λ,ζ)⊂\x∈ X:| _λ,ζ(x)|=1\. (2.8) 3 Dual regions and sparsity properties for infinite-width neural networks The goal of this section is to construct and analyze dual certificates for infinite-width neural networks with ReLU activation, and to infer sparsity properties of the minimizers. From now on, we will identify Y=ℝnY=R^n and consider data points xj∈ℝd+1x^j ^d+1, j=1,…,nj=1,…,n. We set y0:=(y01,…,y0n)y_0:=(y_0^1,…,y_0^n) and yζ:=(yζ1,…,yζn)y_ζ:=(y_ζ^1,…,y_ζ^n). We will consider both the interpolation problem without noise infμ∈M(d−1)‖μ‖TVsubjected toFμ(xj)=y0j∀j _μ∈ M(S^d-1)\|μ\|_TV to F_μ(x^j)=y_0^j ∀ j (0(y0) P_0(y_0)) and the TV-regularized empirical loss where the labels are perturbed as yζj=y0j+ζjy_ζ^j=y_0^j+ζ^j, with ζ=(ζ1,…,ζn)∈ℝnζ=(ζ^1,…,ζ^n) ^n, infμ∈M(d−1)12∑j=1n|Fμ(xj)−yζj|2+λ‖μ‖TV. _μ∈ M(S^d-1) 12 _j=1^n|F_μ(x^j)-y_ζ^j|^2+λ\|μ\|_TV. (λ(yζ) P_λ(y_ζ)) We recall that ∥⋅∥TV\|·\|_TV denotes the total variation norm, defined in (1.5). Note that by defining the linear operator K:M(d−1)→ℝnK:M(S^d-1) ^n as (Kμ)j:=Fμ(xj)=∫d−1σ(⟨w,xj⟩)dμ(w)j=1,…,n, (Kμ)_j:=F_μ(x^j)= _S^d-1σ( w,x^j )dμ(w) j=1,…,n, (3.1) where σ(z)=ReLU(z)σ(z)= ReLU(z), we obtain that 0(y0) P_0(y_0) has the form of a general hard-constrained TV-regularized problem 0(y0)P_0(y_0) infμ∈M(d−1):Kμ=y0‖μ‖TV _μ∈ M(S^d-1):Kμ=y_0\|μ\|_TV (3.2) and λ(yζ) P_λ(y_ζ) has the form of a general Tikhonov-type regularized problem λ(yζ)P_λ(y_ζ) infμ∈M(d−1)12‖Kμ−yζ‖ℝn2+λ‖μ‖TV, _μ∈ M(S^d-1) 12\|Kμ-y_ζ\|_R^n^2+λ\|μ\|_TV, (3.3) allowing us to use the dual certificate theory developed in Section 2.1, provided K is weak*-to-strong continuous. This is verified in the next easy lemma. Lemma 3.1. The operator K:M(d−1)→ℝnK:M(S^d-1) ^n is weak*-to-strong continuous. Moreover, its adjoint K∗:ℝn→C(d−1)K_*:R^n→ C(S^d-1) is given by K∗p(w)=∑j=1npjσ(⟨w,xj⟩)∀p∈ℝn. K_*p(w)= _j=1^np^jσ( w,x^j ) ∀ p ^n. (3.4) Proof. The weak* continuity of K is immediate. In particular, since the codomain is finite-dimensional, K is weak*-to-strong continuous. To prove the characterization of K∗K_* it is enough to note that, for all μ∈M(d−1)μ∈ M(S^d-1), it holds that ∫d−1K∗pdμ(w) _S^d-1K_*pdμ(w) =⟨K∗p,μ⟩=(p,Kμ)ℝn=∑j=1npj(Kμ)j = K_*p,μ =(p,Kμ)_R^n= _j=1^np^j(Kμ)_j (3.5) =∑j=1npj∫d−1σ(⟨w,xj⟩)dμ(w)=∫d−1∑j=1npjσ(⟨w,xj⟩)dμ(w) = _j=1^np^j _S^d-1σ( w,x^j )dμ(w)= _S^d-1 _j=1^np^jσ( w,x^j )dμ(w) =⟨∑j=1npjσ(⟨w,xj⟩),μ⟩. = _j=1^np^jσ( w,x^j ),μ . ∎ We now specialize, for the sake of clarity in the rest of the paper, the abstract dual problems from the previous section to the present ReLU setting. In particular, using (3.4), the dual of the interpolation problem 0(y0) P_0(y_0) becomes supp∈ℝn⟨p,y0⟩subject tosupw∈d−1|∑j=1npjσ(⟨w,xj⟩)|⩽1. _p ^n p,y_0 to _w ^d-1 | _j=1^np^j\,σ( w,x^j ) | 1. (0(y0) D_0(y_0)) Any maximizer p0p_0 of 0(y0) D_0(y_0) yields a dual certificate η0(w):=K∗p0(w)=∑j=1np0jσ(⟨w,xj⟩). _0(w):=K_*p_0(w)= _j=1^np_0^j\,σ( w,x^j ). Instead, the dual of λ(yζ) P_λ(y_ζ) becomes supp∈ℝn⟨yζ,p⟩−λ2‖p‖2subject tosupw∈d−1|∑j=1npjσ(⟨w,xj⟩)|⩽1. _p ^n y_ζ,p - λ2\|p\|^2 to _w ^d-1 | _j=1^np^j\,σ( w,x^j ) | 1. (λ(yζ) D_λ(y_ζ)) Given the maximizer pλ,ζp_λ,ζ of λ(yζ) D_λ(y_ζ), we define the dual certificate ηλ,ζ(w):=K∗pλ,ζ(w)=∑j=1npλ,ζjσ(⟨w,xj⟩) _λ,ζ(w):=K_*p_λ,ζ(w)= _j=1^np_λ,ζ^j\,σ( w,x^j ). Since Y=ℝnY=R^n, the dual problem λ(yζ) D_λ(y_ζ) admits a (unique) maximizer for every λ>0λ>0. For 0(y0) D_0(y_0), maximizers may be not unique; we will consider the minimal-norm one when needed. Moreover, by the following standard result (see for example [20, Proposition 1]), the dual certificates converge uniformly in the noiseless regime as the regularization parameter vanishes. Proposition 3.2 (Convergence of the dual certificates). Let pλ,0p_λ,0 be the unique solution to λ(y0) D_λ(y_0). Let p0p_0 be the solution with minimal-norm of 0(y0) D_0(y_0). Then, limλ→0+‖ηλ,0−η0‖∞=0. _λ→ 0^+\| _λ,0- _0\|_∞=0. (3.6) Proof. By definition and (3.4), for every w∈d−1w ^d-1 we have ηλ,0(w)−η0(w)=∑j=1n(pλ,0j−p0j)σ(⟨w,xj⟩). _λ,0(w)- _0(w)= _j=1^n (p_λ,0^j-p_0^j )\,σ( w,x^j ). (3.7) Hence, ‖ηλ,0−η0‖∞ \| _λ,0- _0\|_∞ =supw∈d−1|∑j=1n(pλ,0j−p0j)σ(⟨w,xj⟩)| = _w ^d-1 | _j=1^n (p_λ,0^j-p_0^j )\,σ( w,x^j ) | ⩽∑j=1n|pλ,0j−p0j|supw∈d−1σ(⟨w,xj⟩). _j=1^n|p_λ,0^j-p_0^j|\, _w ^d-1σ( w,x^j ). Since σ(z)=max0,zσ(z)= \0,z\ and sup‖w‖=1⟨w,xj⟩=‖xj‖ _\|w\|=1 w,x^j =\|x^j\|, we obtain supw∈d−1σ(⟨w,xj⟩)=‖xj‖. _w ^d-1σ( w,x^j )=\|x^j\|. If we let C:=max1⩽j⩽n‖xj‖<∞C:= _1 j n\|x^j\|<∞, then ‖ηλ,0−η0‖∞⩽C∑j=1n|pλ,0j−p0j|=C‖pλ,0−p0‖1.\| _λ,0- _0\|_∞ C _j=1^n|p_λ,0^j-p_0^j|=C\,\|p_λ,0-p_0\|_1. (3.8) By [20, Proposition 1], pλ,0→p0p_λ,0→ p_0 in ℝnR^n, implying ‖pλ,0−p0‖1→0\|p_λ,0-p_0\|_1→ 0 as λ→0λ→ 0. Therefore, it holds limλ→0+‖ηλ,0−η0‖∞=0 _λ→ 0^+\| _λ,0- _0\|_∞=0 (3.9) as desired. ∎ 3.1 Dual regions Given vectors x1,…,xn⊂ℝd\x^1,…,x^n\ ^d (assume xi≠0x^i≠ 0), we consider the hyperplane arrangement Hi:=w∈ℝd:⟨w,xi⟩=0,i=1,…,n,H_i:=\w ^d: w,x^i =0\, i=1,…,n, together with the associated open half-spaces Ai1:=w∈ℝd:⟨w,xi⟩>0,Ai0:=w∈ℝd:⟨w,xi⟩<0.A_i^1:=\w ^d: w,x^i >0\, A_i^0:=\w ^d: w,x^i <0\. A binary vector π=(π1,…,πn)∈0,1nπ=( _1,…, _n)∈\0,1\^n prescribes, for each i, a sign pattern for the scalars ⟨w,xi⟩ w,x^i and thus determines the intersection Rπ:=⋂i=1nAiπi. R_π:= _i=1^nA_i _i. (3.10) We call the (possibly empty) open subset Rπ⊂ℝdR_π ^d a dual region. The term “dual” emphasizes that these regions live in the weight space (w-variable), in contrast with the more common “linear region” decompositions in input space (x-variable). Intersections of this type are well known in the literature of neural networks analysis. We refer, for example, to the “sectors” of [30, 15], the linear regions studied in [31], and the activation regions introduced in [28]. For instance, see Figure 1, which depicts the spherical dual regions Rπ∩1R_π ^1 for d=n=2d=n=2. Definition 3.3 (Dual region). Given π=(π1,…,πn)∈0,1nπ=( _1,…, _n)∈\0,1\^n, we call a dual region associated with π Rπ:=⋂i=1nAiπi. R_π:= _i=1^nA_i _i. (3.11) H1H_1H2H_2x1x^1x2x^2R(1,1)R_(1,1)R(1,0)R_(1,0)R(0,1)R_(0,1)R(0,0)R_(0,0) Figure 1: Dual regions on 1S^1 induced by two inputs (d=n=2d=n=2). It is well known from classical results on linearly separable dichotomies/central hyperplane arrangements (Vapnik–Chervonenkis dimension theory) that the number of distinct binary vectors π such that the dual region RπR_π is non-empty is at most 2∑k=0d−1(n−1k)2 _k=0^d-1 n-1k; equality holds when the hyperplanes Hii=1n\H_i\_i=1^n are in general position (see, e.g., [13, Theorem 1]). In particular, thanks to the binomial theorem, we obtain that there are at most 2n2^n possible combinations (independent of d) if and only if n⩽dn d. Instead, if n>dn>d, then, one can bound the number of these combinations by (nd)O(n^d) (refer to [28, Proposition 6], [30, Chapter 4]). In what follows, we will denote by Θ the set of binary vectors π such that the region RπR_π is non-empty. Our analysis will necessarily deal with the boundary of dual regions. To capture optimality on these sets, we will introduce the concept of higher codimension dual regions, obtained by allowing some of the defining inequalities to become equalities. This yields a natural stratification of the weight space into open cells of varying codimension, which will be crucial for our study. We also remark that the need for such structures has already been noted in [28, 30]. Consider the index set I:=1,…,nI:=\1,…,n\ and, for k⩽nk n, let Jk=j1,…,jk⊂IJ^k=\j_1,…,j_k\⊂ I. We aim to define regions of codimension n−kn-k that belong to the topological boundary of a given region RπR_π. Definition 3.4 (Higher–codimension dual regions). Let π∈0,1nπ∈\0,1\^n and RπR_π be as in Definition 3.3. For Jk⊂IJ^k⊂ I with |Jk|=k|J^k|=k, we define the JkJ^k-stratum of RπR_π as follows: RπJk:=(⋂j∈JkAjπj)∩(⋂z∈I∖JkHz),R^J^k_π:= ( _j∈ J^kA_j _j )\;∩\; ( _z∈ I J^kH_z ), (3.12) where Hz:=w∈ℝd:⟨w,xz⟩=0H_z:=\w ^d: w,x^z =0\. By convention, RπI=RπR_π^I=R_π. The set RπJkR^J^k_π lies in LJk:=⋂z∈I∖JkHzL^J^k:= _z∈ I J^kH_z, which has codimension at most |I∖Jk|=n−k|I J^k|=n-k in ℝdR^d (and equals n−kn-k if the normals xzz∈I∖Jk\x^z\_z∈ I J^k are linearly independent). Similarly to the codimension zero case, we denote by ΘJk:=π∈0,1n:RπJk≠∅ ^J^k:=\π∈\0,1\^n:\;R^J^k_π≠ \ (3.13) the set of sign patterns for which the stratum RπJkR^J^k_π is non-empty. For simplicity, we are going to call RπJkR^J^k_π again a dual region. Remark 3.5 (Boundary stratification). Let R¯πJk R^J^k_π denote the closure of RπJkR^J^k_π in the relative topology of LJkL^J^k. Here and below, ∂ denotes the boundary taken in the same relative topology. Note that the topological boundary of a region RπR_π can be simply rewritten as ∂Rπ=⋃Jn−1⊂IR¯πJn−1,∂ R_π= _J^n-1⊂ I R^J^n-1_π, (3.14) that is, the union of the closures of its codimension 11 faces. Lower dimensional strata are contained in these closures. More generally, for k=1,…,n−2k=1,…,n-2, the topological boundary of RπJkR_π^J^k becomes: ∂RπJk=⋃Jk−1⊂JkR¯πJk−1, ∂ R^J^k_π= _J^k-1⊂ J^k R^J^k-1_π, (3.15) where |Jk−1|=k−1|J^k-1|=k-1. 3.2 Sparse characterization and upper bounds on the number of Dirac deltas The derivation of an exact reconstruction result for the interpolation hard-constrained problem 0(y0) P_0(y_0) is based on the behaviour of the dual certificate in the dual regions, determined by the input data x1,…,xn\x^1,…,x^n\. For this, it is crucial to choose the ReLU as activation function σ. First, it is easy to see that since σ(z)=ReLU(z)σ(z)= ReLU(z), for every i=1,…,ni=1,…,n, we can write σ(⟨w,xi⟩)=πi⟨w,xi⟩for every w∈Rπ.σ( w,x^i )= _i w,x^i every w∈ R_π. (3.16) As a consequence the minimal-norm dual certificate can be written as a linear function in each dual region RπR_π as stated in the next lemma. When ⟨w,xi⟩=0 w,x^i =0 we have σ(⟨w,xi⟩)=0σ( w,x^i )=0; boundary cases will be treated via the strata RπJkR_π^J^k. Lemma 3.6. Given a binary vector π, for every w∈Rπw∈ R_π, every dual certificate η0 _0 can be rewritten as η0(w)=∑i=1np0iπi⟨w,xi⟩. _0(w)= _i=1^np_0^i _i w,x^i . (3.17) Proof. The lemma follows directly from Lemma 3.1 and (3.16). ∎ In particular, every dual certificate η0∈C(d−1) _0∈ C(S^d-1) is the restriction to d−1S^d-1 of a function that is linear on each cone RπR_π and thus globally piecewise linear. This enables us to use the dual certificate theory summarized in Section 2.1 to obtain information about the support of the minimizers of 0(y0) P_0(y_0) and λ(yζ) P_λ(y_ζ). Indeed, in the next lemma, we use the piecewise linear structure of the dual certificate to show that the number of points where its absolute value achieves one is finite. Thanks to the optimality conditions in (2.4), this will imply that the support of the minimizers of 0(y0) P_0(y_0) is a discrete set (see Corollary 3.10). Lemma 3.7. Let p0p_0 be a solution to 0(y0) D_0(y_0) and let η0=K∗p0 _0=K_*p_0 be a dual certificate. Fix π∈0,1nπ∈\0,1\^n, Jk⊂IJ^k⊂ I with |Jk|=k|J^k|=k, and set z0:=∑i∈Jkp0iπixi,z~0:=PLJkz0,z_0:= _i∈ J^kp_0^i _ix^i, z_0:=P_L^J^kz_0, where PLJkP_L^J^k denotes the orthogonal projection onto LJkL^J^k. If there exists w0∈RπJk∩d−1w_0∈ R_π^J^k ^d-1 such that |η0(w0)|=1| _0(w_0)|=1, then z~0≠0 z_0≠ 0 and w0=sign(η0(w0))z~0‖z~0‖.w_0=sign( _0(w_0))\, z_0\| z_0\|. Moreover, |η0(w)|<1| _0(w)|<1 for every w∈(R¯πJk∩d−1)∖w0w∈ ( R_π^J^k ^d-1 ) \w_0\. Proof. First, note that on RπJk∩d−1R_π^J^k ^d-1 the dual certificate η0 _0 can be written as η0(w)=∑i∈Jkp0iπi⟨w,xi⟩=⟨z0,w⟩,z0:=∑i∈Jkp0iπixi. _0(w)= _i∈ J^kp_0^i _i w,x^i = z_0,w , z_0:= _i∈ J^kp_0^i _ix^i. Since w∈LJkw∈ L^J^k, we also have ⟨z0,w⟩=⟨PLJkz0,w⟩=⟨z~0,w⟩ z_0,w = P_L^J^kz_0,w = z_0,w , where z~0:=PLJkz0 z_0:=P_L^J^kz_0. Assume there exists w0∈RπJk∩d−1w_0∈ R_π^J^k ^d-1 such that |η0(w0)|=1| _0(w_0)|=1. Then z~0≠0 z_0≠ 0 (otherwise η0≡0 _0≡ 0 on LJk∩d−1L^J^k ^d-1). By Cauchy–Schwarz, for every w∈LJk∩d−1w∈ L^J^k ^d-1, |η0(w)|=|⟨z~0,w⟩|⩽‖z~0‖,| _0(w)|=| z_0,w | \| z_0\|, with equality if and only if w=±z~0/‖z~0‖w=± z_0/\| z_0\|. Since |η0(w0)|=1| _0(w_0)|=1, necessarily ‖z~0‖⩾1\| z_0\| 1, and equality in Cauchy–Schwarz implies w0=sign(η0(w0))z~0‖z~0‖.w_0=sign( _0(w_0))\, z_0\| z_0\|. To prove uniqueness inside R¯πJk∩d−1 R_π^J^k ^d-1, suppose that |η0(w¯)|=1| _0( w)|=1 for some w¯∈R¯πJk∩d−1 w∈ R_π^J^k ^d-1. Then by the same argument w¯=±z~0/‖z~0‖ w=± z_0/\| z_0\|. The choice w¯=−w0 w=-w_0 is incompatible with belonging to R¯πJk R_π^J^k. Indeed, for any index i∈Jki∈ J^k with πi=1 _i=1 we have ⟨w0,xi⟩⩾0 w_0,x^i 0 on the closure, hence ⟨−w0,xi⟩⩽0 -w_0,x^i 0, which would force −w0-w_0 to lie in the opposite half-space (or on the hyperplane), contradicting w¯∈R¯πJk w∈ R_π^J^k unless all such scalar products vanish (which cannot occur if |η0(w0)|=1| _0(w_0)|=1). Therefore w¯=w0 w=w_0, and the statement follows. ∎ Remark 3.8. Note that the previous lemma put constraints on the points where the absolute value of the dual certificate reaches one. In particular, under the assumption of the previous proposition it also holds that |η0(w)|<1| _0(w)|<1 for every w∈⋃Jk−1⊂JkR¯πJk−1∩d−1w∈ _J^k-1⊂ J^k R_π^J^k-1 ^d-1. Moreover, if w0w_0 belongs to a region of 0-codimension, that is if k=nk=n, then z~0=z0 z_0=z_0. A similar statement holds for the dual certificate associated with λ(yζ) P_λ(y_ζ). Lemma 3.9. Let pλ,ζ∈ℝnp_λ,ζ ^n be the solution to λ(yζ) D_λ(y_ζ) and let ηλ,ζ=K∗pλ,ζ _λ,ζ=K_*p_λ,ζ be the dual certificate for λ(yζ) P_λ(y_ζ). Fix π∈0,1nπ∈\0,1\^n, Jk⊂IJ^k⊂ I with |Jk|=k|J^k|=k, and set zλ,ζ=∑i∈Jkpλ,ζiπixi,z~λ,ζ=PLJkzλ,ζ. z_λ,ζ= _i∈ J^kp_λ,ζ^i _ix^i, z_λ,ζ=P_L^J^kz_λ,ζ. (3.18) If there exists wλ,ζ∈RπJk∩d−1w_λ,ζ∈ R_π^J^k ^d-1 such that |ηλ,ζ(wλ,ζ)|=1| _λ,ζ(w_λ,ζ)|=1, then z~λ,ζ≠0 z_λ,ζ≠ 0 and wλ,ζ=sign(ηλ,ζ(wλ,ζ))z~λ,ζ‖z~λ,ζ‖. w_λ,ζ=sign( _λ,ζ(w_λ,ζ)) z_λ,ζ\| z_λ,ζ\|. (3.19) Moreover, |ηλ,ζ(w)|<1| _λ,ζ(w)|<1 for every w∈(R¯πJk∩d−1)∖wλ,ζw∈( R^J^k_π ^d-1) \w_λ,ζ\. From the previous lemmas it follows directly the mentioned sparse characterization of minimizers of 0(y0) P_0(y_0) and λ(yζ) P_λ(y_ζ). Theorem 3.10 (Sparsity of minimizers). The following two statements hold: • There exists N∈ℕN such that all the solutions to 0(y0) P_0(y_0) are of the form μ0=∑i=1Nc0iδw0i, _0= _i=1^Nc^i_0 _w^i_0, (3.20) where c0i∈ℝc^i_0 and w0i=sign(η0(w0))z0|z0|w^i_0=sign( _0(w_0)) z_0|z_0| with z0=∑i∈Jkp0iπixiz_0= _i∈ J^kp_0^i _ix^i for some binary vector π and Jk⊂IJ^k⊂ I. • There exists M∈ℕM such that all the solutions to λ(yζ) P_λ(y_ζ) are of the form μλ,ζ=∑i=1Mcλ,ζiδwλ,ζi, _λ,ζ= _i=1^Mc_λ,ζ^i _w_λ,ζ^i, (3.21) where cλ,ζi∈ℝc_λ,ζ^i and wλ,ζi=sign(ηλ,ζ(wλ,ζi))zλ,ζ‖zλ,ζ‖w_λ,ζ^i=sign( _λ,ζ(w^i_λ,ζ)) z_λ,ζ\|z_λ,ζ\| with zλ,ζ=∑i∈Jkpλ,ζiπixiz_λ,ζ= _i∈ J^kp_λ,ζ^i _ix^i for some binary vector π and Jk⊂IJ^k⊂ I. Proof. The proof follows from Lemma 3.7 and Lemma 3.9 together with the optimality conditions for 0(y0) P_0(y_0) and λ(yζ) P_λ(y_ζ) summarized in Lemma 2.3. ∎ From a rough estimate one can upper bound both N and M (we denote by NmaxN_max a common upper bound depending only on n and d) in the previous theorem by Nmax⩽3n. N_max 3^n. (3.22) This is obtained as all possible ways to choose the sign of each scalar product ⟨w,xi⟩ w,x^i for i=1,…,ni=1,…,n (⟨w,xi⟩>0 w,x^i >0, ⟨w,xi⟩<0 w,x^i <0 or ⟨w,xi⟩=0 w,x^i =0 in correspondence with regions of higher codimension). We are counting the ternary activation patterns for each i∈1,…,ni∈\1,…,n\. However, finer estimates are possible, as described in the remaining part of this subsection. Lemma 3.11. The absolute value of the dual certificate η0 _0 (resp. ηλ,ζ _λ,ζ) for 0(y0) P_0(y_0) (resp. λ(yζ) P_λ(y_ζ)) achieves its maximum in at most max0⩽k⩽n(nk)Ξ(n−k,d−k) _0 k n nk (n-k,d-k) (3.23) points, where Ξ(n−k,d−k) (n-k,d-k) denotes the number of non-empty regions induced in ℝd−kR^d-k by an arrangement of n−kn-k hyperplanes and we use the convention that Ξ(n−k,d−k)=0 (n-k,d-k)=0 when d−k<0d-k<0. Proof. The proof of the statement is the same for the interpolation problem 0(y0) P_0(y_0) and for the training problem λ(yζ) P_λ(y_ζ). The proof is based on the observation that if |η||η| is maximized at some point in a given stratum RπJkR^J^k_π for Jk⊂IJ^k⊂ I, then it cannot be maximized at any other point in lower dimensional strata R¯πJk−1 R_π^J^k-1 for Jk−1⊂JkJ^k-1⊂ J^k. Moreover, distinct strata correspond to disjoint ternary patterns, so maximizers are counted by enumerating all the strata. In particular, we have (nk) nk ways of choosing k’s xix^i such that ⟨xi,w⟩=0 x^i,w =0. This intersection of planes determines a set of dimension ℝd−kR^d-k. In such a set, we are allowed to choose the intersection of n−kn-k hyperplanes that leads to the factor Ξ(n−k,d−k) (n-k,d-k). ∎ Remark 3.12. The previous lemma gives a sharper estimate with respect to the rough bound in (3.22). Indeed, since Ξ(n−k,d−k)⩽2n−k (n-k,d-k) 2^n-k, then by the binomial theorem we obtain that max0⩽k⩽n(nk)Ξ(n−k,d−k)⩽max0⩽k⩽n(nk)2n−k⩽∑k=0n(nk)2n−k=3n _0 k n nk (n-k,d-k) _0 k n nk2^n-k _k=0^n nk2^n-k=3^n (3.24) Therefore, if n⩽dn d the best bound we can get is max0⩽k⩽n(nk)2n−k _0 k n nk2^n-k (3.25) that can be estimated grossly by 3n3^n. Even if slightly more precise estimates can be obtained, we will not enter into these details here. Instead, if n>dn>d it holds that Ξ(n−k,d−k)⩽O((n−k)d−k). (n-k,d-k) O((n-k)^d-k). (3.26) Hence, taking into account that Ξ(n−k,d−k)=0 (n-k,d-k)=0 when d−k<0d-k<0, it holds that max0⩽k⩽n _0 k n (nk)Ξ(n−k,d−k)⩽max0⩽k⩽n(nk)O((n−k)d−k)=max0⩽k⩽n(n−k)O(kd−(n−k)) nk (n-k,d-k) _0 k n nkO((n-k)^d-k)= _0 k n nn-kO(k^d-(n-k)) (3.27) ⩽max0⩽k⩽n(n−k)O(nd−(n−k))⩽n−k(n−k)!O(nd−(n−k))⩽O(nd). _0 k n nn-kO(n^d-(n-k)) n^n-k(n-k)!O(n^d-(n-k)) O(n^d). (3.28) Note that this bound, in the case n>dn>d, is actually equal to O(Ξ(n,d))O( (n,d)), that is, proportional to the number of dual regions. This yields a sharper estimate than the one provided in [15]. 3.3 Linear independence and uniqueness of the minimizers In the previous subsection we proved that both problems 0(y0) P_0(y_0) and λ(yζ) P_λ(y_ζ) admit sparse minimizers and we provided an upper bound on the number of Dirac deltas in any such representation. We now address uniqueness. As common in the available literature on TV-regularized optimization problems in the space of measures, uniqueness follows from a combination of (i) a localization property of the dual certificate and (i) a linear independence condition on the measurement vectors Kδwi\K _w^i\. We provide sufficient conditions ensuring (i) in our ReLU setting. We caution that our conditions are likely not sharp and that a complete characterization lies beyond the scope of this work. We begin by recalling several definitions from linear algebra. Definition 3.13 (Hadamard product). The Hadamard or entry-wise product of two matrices M and N of the same dimension is defined as M⊙N:=(Mij⋅Nij)M N:= (M_ij· N_ij ). We will also need the definition of transversal given for example in [10]. Definition 3.14 (Transversal). A transversal of an m×nm× n matrix is a set of m entries, one from each row and no two from the same column. We also recall that the determinant of a square matrix M=(mij)∈ℝn×nM= (m_ij ) ^n× n can be computed as det(M):=∑p∈n(sign(p)∏i=1nmi,p(i)), (M):= _p _n (sign(p) _i=1^nm_i,p(i) ), (3.29) where the sum is over all permutations p∈np _n of n elements. Definition 3.15 (Permanent). Given a square matrix M=(mij)∈ℝn×nM= (m_ij ) ^n× n, the permanent of M is defined as perm(M):=∑p∈n∏i=1nmi,p(i).perm(M):= _p _n _i=1^nm_i,p(i). Let μ0=∑i=1Nc0iδw0i _0= _i=1^Nc_0^i _w_0^i with c0i∈ℝc_0^i and w0i∈Rπiw_0^i∈ R_π^i for i=1,…,Ni=1,…,N. Let w0:=(w01,…,w0N)w_0:=(w_0^1,…,w_0^N). We define the N×nN× n evaluation matrix (w0):=[σ(⟨w01,x1⟩)⋯σ(⟨w01,xn⟩)⋮⋯⋮⋱⋮σ(⟨w0N,x1⟩)⋯σ(⟨w0N,xn⟩)]=[π11⟨w01,x1⟩⋯πn1⟨w01,xn⟩⋮⋯⋮⋱⋮π1N⟨w0N,x1⟩⋯πnN⟨w0N,xn⟩],K(w_0):= [ array[]cσ( w_0^1,x^1 )&·s&σ( w_0^1,x^n )\\ &·s& \\ & & \\ σ( w_0^N,x^1 )&·s&σ( w_0^N,x^n ) array ]= [ array[]c _1^1 w_0^1,x^1 &·s& _n^1 w_0^1,x^n \\ &·s& \\ & & \\ π^N_1 w_0^N,x^1 &·s&π^N_n w_0^N,x^n array ], (3.30) where πji:=⟨w0i,xj⟩>0 _j^i:=1_ w_0^i,x^j >0 so that σ(⟨w0i,xj⟩)=πji⟨w0i,xj⟩σ( w_0^i,x^j )= _j^i w_0^i,x^j . Its i-th row is exactly the measurement vector (Kδw0i)⊤∈ℝ1×n(K _w_0^i) ^1× n. Moreover, if we denote Π:=[π11⋯πn1⋮⋯⋮⋱⋮π1N⋯πnN],X:=[x11⋯x1n⋮⋯⋮⋱⋮xd1⋯xdn],W:=[w0,11⋯w0,1N⋮⋯⋮⋱⋮w0,d1⋯w0,dN], := [ array[]c _1^1&·s& _n^1\\ &·s& \\ & & \\ π^N_1&·s&π^N_n array ],X:= [ array[]cx^1_1&·s&x^n_1\\ &·s& \\ & & \\ x^1_d&·s&x^n_d array ],W:= [ array[]cw^1_0,1&·s&w^N_0,1\\ &·s& \\ & & \\ w^1_0,d&·s&w^N_0,d array ], (3.31) where w0i=(w0,1i,…,w0,di)w_0^i=(w_0,1^i,…,w_0,d^i) and xi=(x1i,…,xdi)x^i=(x_1^i,…,x_d^i) then we can rewrite (w0)=Π⊙W⊤X.K(w_0)= W X. In the following lemma we will prove that Kδw0ii=1N\K _w_0^i\_i=1^N are linearly independent under certain assumptions on Π and supposing that N⩽nN n. Note that such condition is necessary, since otherwise the vectors Kδw0ii=1N\K _w_0^i\_i=1^N cannot be linearly independent. Lemma 3.16. Let μ0=∑i=1Nc0iδw0i _0= _i=1^Nc_0^i _w_0^i with c0i∈ℝc_0^i and w0i∈Rπiw_0^i∈ R_π^i, N⩽nN n. Suppose that the following conditions are satisfied: i) Some N×N× N minor of Π , called ΠN _N, has full rank; i) perm(ΠN)=|det(ΠN)|perm( _N)=| ( _N)|. Then, the vectors Kδw0ii=1N\K _w_0^i\_i=1^N are linearly independent. Proof. By the definition of K we have to prove that σ(⟨w0i,x1⟩),…,σ(⟨w0i,xn⟩)i=1N\σ( w_0^i,x^1 ),…,σ( w_0^i,x^n )\_i=1^N are linearly independent. We will equivalently prove that (w0)K(w_0) has rank N, that is it has an N×N× N minor with nonzero determinant. Let ΠN _N be an N×N× N minor of Π which is nonsingular by assumption i), and let (W⊤X)N(W X)^N be the corresponding N×N× N minor of W⊤XW X (same rows and columns). Set B:=(W⊤X)N,A:=ΠN∈0,1N×N. B:=(W X)^N, A:= _N∈\0,1\^N× N. (3.32) Then the corresponding minor of (w0)=Π⊙(W⊤X)K(w_0)= (W X) is A⊙BA B and det(A⊙B)=∑p∈Nsign(p)∏i=1NAi,p(i)Bi,p(i). (A B)= _p _Nsign(p) _i=1^NA_i,p(i)\,B_i,p(i). (3.33) By assumption i), there exists at least one permutation p¯ p such that ∏iAi,p¯(i)=1 _iA_i, p(i)=1 (a transversal of ones). Moreover, by assumption i), or equivalently perm(A)=|det(A)|perm(A)=| (A)|, all permutations p with ∏iAi,p(i)=1 _iA_i,p(i)=1 have the same parity, hence the same sign sign(p)=s∈±1sign(p)=s∈\± 1\. For those p, we also have Bi,p(i)=⟨w0i,xp(i)⟩>0B_i,p(i)= w_0^i,x^p(i) >0, since Ai,p(i)=1A_i,p(i)=1 means ⟨w0i,xp(i)⟩>0 w_0^i,x^p(i) >0 in the associated dual region. Therefore, every nonzero term in the expansion of det(A⊙B) (A B) has the same sign s and is strictly positive in magnitude, and in particular the term corresponding to p¯ p is nonzero due to the fact that ∏iAi,p¯(i)=1 _iA_i, p(i)=1. Hence, det(A⊙B)≠0 (A B)≠ 0, implying that (w0)K(w_0) has rank N. ∎ Remark 3.17. Note that in the proof we are using that the second assumption, that is perm(ΠN)=|det(ΠN)|perm( _N)=| ( _N)| for 0-11 matrices, is equivalent to all transversals of ΠN _N, i.e. all permutations p with ∏i=1N(ΠN)i,p(i)=1 _i=1^N( _N)_i,p(i)=1, having the same parity. The previous lemma provides easy scenarios where the linear independence of Kδw0ii=1N\K _w_0^i\_i=1^N is ensured. For example if ΠN _N is a permutation matrix, then i)i) and i)i) are automatically verified. To the best of our knowledge, these conditions have not been explored in the literature. Extending this line of results is left for future research. We are now ready to examine uniqueness for the sparse solutions to 0(y0) P_0(y_0) and λ(yζ) P_λ(y_ζ). We first treat the interpolation problem 0(y0) P_0(y_0), since the analysis for the training problem λ(yζ) P_λ(y_ζ) follows the same strategy. Let μ0=∑i=1Nσic0iδw0i _0= _i=1^Nσ^ic_0^i\, _w_0^i with c0i>0c_0^i>0, σi∈−1,+1σ^i∈\-1,+1\ and w0i∈d−1w_0^i ^d-1 satisfy the hard constraint Kμ0=y0K _0=y_0 of 0(y0) P_0(y_0). If μ0 _0 is a minimizer, the dual certificate η0 _0 satisfies η0(w0i)=σi _0(w_0^i)=σ^i by Lemma 2.3. To conclude uniqueness, we require that |η0|| _0| attains its maximum 11 only at the atoms w0ii=1N\w_0^i\_i=1^N and that the measurements Kδw0ii=1N\K _w_0^i\_i=1^N are linearly independent (ensured by (i),(ii)(i),(i) in Lemma 3.16). Theorem 3.18. Let μ0=∑i=1Nc0iσiδw0i _0= _i=1^Nc_0^iσ^i _w_0^i be such that y0=Kμ0y_0=K _0, with c0i>0c_0^i>0, σi∈−1,1σ^i∈\-1,1\ and w0i∈R¯πiw_0^i∈ R_π^i for all i=1,…,Ni=1,…,N. Let η0 _0 be the dual certificate for μ0 _0, and suppose that Kδw0ii=1N\K _w_0^i\_i=1^N are linearly independent and (LC)(LC) |η0(w)|<1| _0(w)|<1 for all w∈d−1∖w01,…w0Nw ^d-1 \w_0^1,… w_0^N\. Then, μ0 _0 is the unique solution to 0(y0) P_0(y_0). Proof. The proof follows directly by applying [12, Lemma 4.5]. ∎ A verbatim statement holds for the training problem when the dual certificate for λ(yζ) P_λ(y_ζ) plays the role of η0 _0. Theorem 3.19. Let μλ,ζ=∑i=1Ncλ,ζiσiδwλ,ζi _λ,ζ= _i=1^Nc_λ,ζ^iσ^i _w^i_λ,ζ, with cλ,ζi>0c_λ,ζ^i>0, σi∈−1,1σ^i∈\-1,1\ and wλ,ζi∈R¯πiw^i_λ,ζ∈ R_π^i for all i=1,…,Ni=1,…,N. Let ηλ,ζ _λ,ζ be the dual certificate for μλ,ζ _λ,ζ, and suppose that Kδwλ,ζii=1N\K _w^i_λ,ζ\_i=1^N are linearly independent and (LC~)( LC) |ηλ,ζ(w)|<1| _λ,ζ(w)|<1 for all w∈d−1∖wλ,ζ1,…wλ,ζNw ^d-1 \w^1_λ,ζ,… w^N_λ,ζ\. Then, μλ,ζ _λ,ζ is the unique solution to λ(yζ) P_λ(y_ζ). Remark 3.20. The localization condition (LC), together with the optimality conditions ensures that the dual certificate is maximized only at the Dirac deltas locations w0ii=1N\w_0^i\_i=1^N (and similarly for ηλ,ζ _λ,ζ). By Lemmas 3.7–3.9, for each dual region RπJk∩d−1R_π^J^k ^d-1 the function |η||η| has at most one point where it can possibly attain its maximum (when the region is nonempty). In particular, in the interior of a region there cannot be two distinct maximizers. Consequently, to check the localization condition it is not necessary to control |η||η| everywhere on d−1S^d-1 but it suffices to inspect the finite collection of candidate maximizers and verify that |η|=1|η|=1 only on the support of the sparse measure, while all the remaining candidates satisfy |η|<1|η|<1. The size of this candidate set is finite and is bounded as in Lemma 3.11 and Remark 3.12. 4 Sparse stability for noisy labels under regularization In this section, we aim to show that for sufficiently small regularization parameter λ and in a low noise regime, every solution to λ(yζ) P_λ(y_ζ) has the same number of Dirac deltas as the (unique) solution to 0(y0) P_0(y_0). Moreover, the locations and weights are continuous perturbations of those of μ0 _0 as (λ,ζ)→(0,0)(λ,ζ)→(0,0) within the admissible set Nα,λ0N_α, _0 defined below. Such results are typically referred to as exact support recovery [20]. We consider the following set of admissible parameters/noise levels for λ0>0 _0>0 and α>0α>0: Nα,λ0=(λ,ζ)∈ℝ+×ℝn:0<λ⩽λ0 and ‖ζ‖⩽αλ.N_α, _0= \(λ,ζ) _+×R^n:0<λ _0\ and \ \|ζ\| αλ \. (4.1) This set was introduced in [20] and describes the regime in which one expects exact support recovery. Let us also define the extended support as introduced in [20]. Definition 4.1 (Extended support [20]). Let μ0∈M(d−1) _0∈ M(S^d-1) be such that y0=Kμ0y_0=K _0 and let η0∈C(d−1) _0∈ C(S^d-1) denote the minimal-norm dual certificate associated with 0(y0) P_0(y_0). The extended support of the measure μ0 _0 is defined as Ext(μ0):=w∈d−1:|η0(w)|=1.Ext( _0):= \w ^d-1:| _0(w)|=1 \. (4.2) Note that Supp(μ0)⊂Ext(μ0)Supp( _0) ( _0) by Lemma 2.3. Analogously, for μλ,ζ _λ,ζ with dual certificate ηλ,ζ _λ,ζ, set Ext(μλ,ζ):=w∈d−1:|ηλ,ζ(w)|=1Ext( _λ,ζ):=\w ^d-1:| _λ,ζ(w)|=1\. We now recall a fundamental result, [20, Lemma 11], regarding the support of solutions to λ(yζ) P_λ(y_ζ) in the low noise regime and for small regularization parameters. Lemma 4.2. Let p0p_0 be the minimal norm solution to 0(y0) D_0(y_0), pλ,ζp_λ,ζ the solution to λ(yζ) D_λ(y_ζ) and μλ,ζ∈M(d−1) _λ,ζ∈ M(S^d-1) any solution to λ(yζ) P_λ(y_ζ). Given ε,δ>0 ,δ>0, there exist α>0,λ0>0α>0, _0>0 such that, for all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0, we have ‖pλ,ζ−p0‖⩽δ and suppμλ,ζ⊂Extε(μ0),\|p_λ,ζ-p_0\| δ and _λ,ζ ( _0 ), (4.3) where Extε(μ0):=⋃w∈Ext(μ0)Bε(w)Ext ( _0 ):= _w ( _0 )B_ (w). Proof. The proof can be adapted straightforwardly from [20, Lemma 11] by taking into account Proposition 3.2. ∎ We now prove that the solutions to λ(yζ) P_λ(y_ζ) are sparse, for small enough parameters/noise levels in the set Nα,λ0N_α, _0. In particular, the number of atoms equals the cardinality of the support of the measure μ0 _0 and their locations are uniquely determined. Note that this statement does not imply uniqueness of the corresponding coefficients and consequently of the representation. This problem will be addressed in the next section. Before stating the next proposition, we introduce an auxiliary set that describes the common boundary between two different regions. Definition 4.3. Let π∈0,1nπ∈\0,1\^n and let Jk⊂IJ^k⊂ I. Define the compatible family of strata as ℱπJk:=Rπ¯J:Rπ¯J≠∅,Jk⊂J,and π¯j=πj∀j∈Jk. _π^J^k:= \R^J_ π:R^J_ π≠ ,\ J^k⊂ J,\ and π_j= _j\ ∀ j∈ J^k \. (4.4) As typical in super-resolution problems, we will need to impose a further non-degeneracy condition for the dual certificate that is related with its strong convexity (resp. concavity) in the minimum (resp. maximum) points. Assumption 4.4 (Boundary non-degeneracy condition). We define the following non-degeneracy condition on points that belongs to the boundary of the dual regions: (ND) Let μ0=∑i=1Nc0iσiδw0i _0= _i=1^Nc_0^iσ^i _w_0^i, with c0i>0c_0^i>0, σi∈−1,+1σ^i∈\-1,+1\, and assume w0i∈RπJk∩d−1w_0^i∈ R^J^k_π ^d-1 for some k<nk<n. Given p0p_0 the minimal norm solution of 0(y0) D_0(y_0), for any Rπ¯J∈ℱπJkR^J_ π ^J^k_π, define z0(J,π¯):=∑j∈Jp0jπ¯jxj,z~0(J,π¯):=PJz0(J,π¯).z_0(J, π):= _j∈ Jp_0^j\, π_j\,x^j, z_0(J, π):=P_J\,z_0(J, π). (4.5) We say that μ0 _0 satisfies (ND) if for any two distinct strata Rπ1J1,Rπ2J2∈ℱπJkR^J_1_π^1,R^J_2_π^2 ^J^k_π such that z~0(Jℓ,πℓ)≠0 z_0(J_ ,π )≠ 0, it holds that z~0(J1,π1)‖z~0(J1,π1)‖≠z~0(J2,π2)‖z~0(J2,π2)‖. z_0(J_1,π^1)\| z_0(J_1,π^1)\|≠ z_0(J_2,π^2)\| z_0(J_2,π^2)\|. (4.6) We are now ready to state and prove the main result of this section. Proposition 4.5. Let N⩽nN n. Assume that μ0=∑i=1Nc0iσiδw0i _0= _i=1^Nc_0^iσ^i _w_0^i, where c0i>0c_0^i>0, σi∈−1,+1σ^i∈\-1,+1\ and w0i∈R¯πiJiw_0^i∈ R^J_i_π^i for all i=1,…,Ni=1,…,N, satisfies (LC) and (ND). Let Kδw0ii=1N\K _w_0^i\_i=1^N be linearly independent and μλ,ζ∈M(d−1) _λ,ζ∈ M(S^d-1) be any solution to λ(yζ) P_λ(y_ζ). Then, there exist α>0,λ0>0α>0, _0>0 such that, for all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0, there exists a unique collection of wλ,ζi∈d−1w_λ,ζ^i ^d-1 such that μλ,ζ=∑i=1Ncλ,ζiσiδwλ,ζi, _λ,ζ= _i=1^Nc_λ,ζ^iσ^i _w_λ,ζ^i, (4.7) where cλ,ζi>0c_λ,ζ^i>0 and ηλ,ζ(wλ,ζi)=σi _λ,ζ(w_λ,ζ^i)=σ^i for every i=1,…,Ni=1,…,N. Moreover, as λ→0λ→ 0 with ‖ζ‖⩽αλ\|ζ\| αλ, it holds wλ,ζi→w0i,cλ,ζi→c0i.w^i_λ,ζ→ w_0^i, c^i_λ,ζ→ c_0^i. (4.8) Proof. Without loss of generality, assume that σi=1σ^i=1 for all i. The argument is local around each w0iw_0^i, and the case σi=−1σ^i=-1 is handled in the same way by replacing η with −η-η on the corresponding subset of indices. By (LC), |η0(w)|<1| _0(w)|<1 for every w≠w0iw≠ w_0^i, hence for sufficiently small ε0>0 _0>0, Ext(μ0)∩Bε0(w0i)=w0ifor every i=1,…,N. ( _0)∩ B_ _0(w_0^i)=\w_0^i\ every i=1,…,N. (4.9) Moreover, by Lemma 3.7, if w0i∈RπiJiw_0^i∈ R^J_i_π^i then w0i=z~0i‖z~0i‖,z~0i:=PJi(∑j∈Jip0jπjixj),w_0^i= z_0^i\| z_0^i\|, z_0^i:=P_J_i ( _j∈ J_ip_0^jπ^i_jx^j ), (4.10) where p0p_0 is the dual maximizer in 0(y0) D_0(y_0). Fix i and let RπJk=RπiJikR^J^k_π=R^J_i^k_π^i be the JkJ^k-stratum such that w0i∈RπJkw_0^i∈ R^J^k_π with k⩽nk n. We want to prove that for (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0 with α,λ0α, _0 sufficiently small, the set Ext(μλ,ζ)∩Bε0(w0i)Ext( _λ,ζ)∩ B_ _0(w_0^i) contains at most one point. Indeed, by Lemma 4.2 the support of μλ,ζ _λ,ζ lies in ⋃m=1NBε0(w0m) _m=1^NB_ _0(w_0^m) for (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0 with α,λ0α, _0 sufficiently small. Thus, showing that Ext(μλ,ζ)∩Bε0(w0i)Ext( _λ,ζ)∩ B_ _0(w_0^i) has at most one point implies that μλ,ζ _λ,ζ has at most one atom in Bε0(w0i)B_ _0(w_0^i). Note that with this choice of JkJ^k and π, shrinking ε0 _0 if necessary, the ball Bε0(w0i)B_ _0(w_0^i) can intersect only strata in the compatible family ℱπJkF^J^k_π as defined in (4.4). Arguing as in Lemma 3.6, we can write the dual certificate ηλ,ζ _λ,ζ on the dual region Rπ¯J∈ℱπJkR^J_ π ^J^k_π as ηλ,ζ(w)=∑j∈Jpλ,ζjπ¯j⟨w,xj⟩,w∈Rπ¯J,η_λ,ζ(w)= _j∈ Jp_λ,ζ^j π_j w,x^j , w∈ R^J_ π, (4.11) where pλ,ζ∈ℝnp_λ,ζ ^n is the solution to the dual problem λ(yζ) D_λ(y_ζ). Since Rπ¯J⊂LJR^J_ π⊂ L^J, for w∈Rπ¯J∩d−1w∈ R^J_ π ^d-1 we have ηλ,ζ(w)=⟨PJ(∑j∈Jpλ,ζjπ¯jxj),w⟩=⟨z~λ,ζ(J,π¯),w⟩. _λ,ζ(w)= P_J ( _j∈ Jp_λ,ζ^j π_jx^j ),\,w = z_λ,ζ(J, π),w . (4.12) Hence, |ηλ,ζ(w)|⩽‖z~λ,ζ(J,π¯)‖| _λ,ζ(w)| \| z_λ,ζ(J, π)\| and equality can hold only if w=±z~λ,ζ(J,π¯)/‖z~λ,ζ(J,π¯)‖w=± z_λ,ζ(J, π)/\| z_λ,ζ(J, π)\|. At most one of these two points belongs to the fixed sign region Rπ¯JR^J_ π. Therefore, following the computation made in the proof of Lemma 3.7 for η0 _0, in each region Rπ¯J∩d−1R^J_ π ^d-1 with Rπ¯J∈ℱπJkR^J_ π ^J^k_π, the set Ext(μλ,ζ)Ext( _λ,ζ) contains at most one point, and whenever it is nonempty it must be of the form wλ,ζ(J,π¯)=z~λ,ζ(J,π¯)‖z~λ,ζ(J,π¯)‖.w_λ,ζ(J, π)= z_λ,ζ(J, π)\| z_λ,ζ(J, π)\|. (4.13) Now we want to prove that there exists α,λ0α, _0 sufficiently small such that for all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0 there exists at most one point w∈Bε0(w0i)w∈ B_ _0(w_0^i) with |ηλ,ζ(w)|=1| _λ,ζ(w)|=1. Assume by contradiction that this does not hold. Then, also recalling the first step of the proof, there exist a vanishing sequence (αn,λ0,n)( _n, _0,n), (λ,ζ)∈Nαn,λ0,n(λ,ζ)∈ N_ _n, _0,n and two distinct regions Rπ1J1,Rπ2J2∈ℱπJkR^J_1_ _1,R^J_2_ _2 ^J^k_π with points w1∈Rπ1J1w_1∈ R^J_1_ _1, w2∈Rπ2J2w_2∈ R^J_2_ _2 such that w1,w2∈Ext(μλ,ζ)∩Bε0(w0i)w_1,w_2 ( _λ,ζ)∩ B_ _0(w_0^i), w1≠w2w_1≠ w_2 (note that for easing the notation we are omitting the dependence of n in (λ,ζ)(λ,ζ) and in the selected dual regions). Since Rπ1J1R^J_1_ _1 and Rπ2J2R^J_2_ _2 are assumed distinct, we have that (J1,π1)≠(J2,π2)(J_1,π^1)≠(J_2,π^2). For ℓ∈1,2 ∈\1,2\ set zλ,ζℓ:=∑j∈Jℓpλ,ζjπjℓxj,z~λ,ζℓ:=PJℓzλ,ζℓ.z_λ,ζ := _j∈ J_ p_λ,ζ^j\,π _j\,x^j, z_λ,ζ :=P_J_ \,z_λ,ζ . (4.14) Applying the previous step of the proof again and taking αn _n, λ0,n _0,n small enough, z~λ,ζℓ≠0 z_λ,ζ ≠ 0 and wℓ=z~λ,ζℓ‖z~λ,ζℓ‖,ℓ=1,2.w_ = z_λ,ζ \| z_λ,ζ \|, =1,2. (4.15) For ℓ∈1,2 ∈\1,2\ define the linear map z~ℓ(p):=PJℓ(∑j∈Jℓpjπjℓxj) z (p):=P_J_ ( _j∈ J_ p^j\, _j \,x^j ) and fix an arbitrary 0<ε<ε00< < _0. Since z~ℓ(p0)≠0 z (p_0)≠ 0, the map p↦z~ℓ(p)/‖z~ℓ(p)‖p z (p)/\| z (p)\| is continuous at p0p_0. Hence, there exists δ>0δ>0 such that ‖p−p0‖⩽δ⟹‖z~ℓ(p)‖z~ℓ(p)‖−z~ℓ(p0)‖z~ℓ(p0)‖⩽ε,ℓ=1,2.\|p-p_0\| δ \| z (p)\| z (p)\|- z (p_0)\| z (p_0)\| \| , =1,2. (4.16) Applying Lemma 4.2 with this δ, there exist n sufficiently big such that ‖pλ,ζ−p0‖⩽δ\|p_λ,ζ-p_0\| δ. Therefore, ‖z~λ,ζℓ‖z~λ,ζℓ‖−z~0ℓ‖z~0ℓ‖⩽ε,ℓ=1,2, \| z_λ,ζ \| z_λ,ζ \|- z_0 \| z_0 \| \| , =1,2, (4.17) where z~0ℓ:=PJℓ(∑j∈Jℓp0jπjℓxj) z_0 :=P_J_ ( _j∈ J_ p_0^j\, _j \,x^j ). Moreover, by again applying Lemma 4.2 with n sufficiently big, it holds that w1,w2∈Bε(w0i)w_1,w_2∈ B_ (w_0^i) and thus ‖w2−w1‖⩽2ε\|w_2-w_1\| 2 . Therefore, we obtain ‖z~01‖z~01‖−z~02‖z~02‖ \| z_0^1\| z_0^1\|- z_0^2\| z_0^2\| \| ⩽‖z~01‖z~01‖−z~λ,ζ1‖z~λ,ζ1‖+‖z~λ,ζ1‖z~λ,ζ1‖−z~λ,ζ2‖z~λ,ζ2‖+‖z~λ,ζ2‖z~λ,ζ2‖−z~02‖z~02‖ \| z_0^1\| z_0^1\|- z_λ,ζ^1\| z_λ,ζ^1\| \|+ \| z_λ,ζ^1\| z_λ,ζ^1\|- z_λ,ζ^2\| z_λ,ζ^2\| \|+ \| z_λ,ζ^2\| z_λ,ζ^2\|- z_0^2\| z_0^2\| \| ⩽ε+‖w1−w2‖+ε⩽ε+2ε+ε=4ε. +\|w_1-w_2\|+ +2 + =4 . Since (ND) states that the normalized vectors z~0(J,π¯)/‖z~0(J,π¯)‖ z_0(J, π)/\| z_0(J, π)\| are pairwise distinct for distinct strata in ℱπJkF^J^k_π, there exists c:=minRπ1J1,Rπ2J2∈ℱπJk(J1,π1)≠(J2,π2)‖z~0(J1,π1)‖z~0(J1,π1)‖−z~0(J2,π2)‖z~0(J2,π2)‖>0.c:= _ subarraycR^J_1_π^1,R^J_2_π^2 ^J^k_π\\ (J_1,π^1)≠(J_2,π^2) subarray \| z_0(J_1,π^1)\| z_0(J_1,π^1)\|- z_0(J_2,π^2)\| z_0(J_2,π^2)\| \|>0. Choosing ε such that ε<c/4 <c/4 yields a contradiction. Hence, for each i the following holds: for (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0 with λ0,α _0,α small enough there exists at most one point w∈Bε0(w0i)w∈ B_ _0(w_0^i) with |ηλ,ζ(w)|=1| _λ,ζ(w)|=1. This also shows that μλ,ζ _λ,ζ admits the representation (4.7) with cλ,ζi⩾0c_λ,ζ^i 0 and ηλ,ζ(wλ,ζi)=1 _λ,ζ(w_λ,ζ^i)=1 for all i. Indeed, By Lemma 4.2, the support of the measure μλ,ζ _λ,ζ is contained in Extε0(μ0)Ext_ _0 ( _0 ). Moreover, optimality implies suppμλ,ζ⊂Ext(μλ,ζ)=w:|ηλ,ζ(w)|=1supp _λ,ζ ( _λ,ζ)=\w:\ | _λ,ζ(w)|=1\. Hence, μλ,ζ _λ,ζ is a finite sum of Dirac masses, with at most one atom in each ball, that is μλ,ζ=∑i=1Ncλ,ζiδwλ,ζi. _λ,ζ= _i=1^Nc_λ,ζ^i _w_λ,ζ^i. (4.18) For each i either suppμλ,ζ∩Bε0(w0i)=∅supp _λ,ζ∩ B_ _0(w_0^i)= , or there exists a unique point wλ,ζi∈Bε(w0i)w_λ,ζ^i∈ B_ (w_0^i) such that suppμλ,ζ∩Bε0(w0i)=Ext(μλ,ζ)∩Bε0(w0i)=wλ,ζi.supp _λ,ζ∩ B_ _0(w_0^i)=Ext( _λ,ζ)∩ B_ _0(w_0^i)=\w_λ,ζ^i\. (4.19) In the latter case, wλ,ζi∈Ext(μλ,ζ)w_λ,ζ^i ( _λ,ζ) and ηλ,ζ(wλ,ζi)=1 _λ,ζ(w_λ,ζ^i)=1. In particular, we may write (4.18), where cλ,ζi⩾0c_λ,ζ^i 0 and cλ,ζi>0c_λ,ζ^i>0 if and only if suppμλ,ζ∩Bε(w0i)≠∅supp _λ,ζ∩ B_ (w_0^i)≠ . Thus, it remains to prove that cλ,ζi>0c_λ,ζ^i>0 for all i when (λ0,α)( _0,α) is sufficiently small. By the minimality of μλ,ζ _λ,ζ for λ(yζ) P_λ(y_ζ), λ‖μλ,ζ‖TV⩽12‖Kμλ,ζ−y0−ζ‖2+λ‖μλ,ζ‖TV⩽‖ζ‖22+λ‖μ0‖TV,λ \| _λ,ζ \|_ TV 12 \|K _λ,ζ-y_0-ζ \|^2+λ \| _λ,ζ \|_ TV \|ζ\|^22+λ \|μ_0 \|_ TV, (4.20) where in the second inequality, we used that Kμ0−y0=0K _0-y_0=0. Dividing by λ>0λ>0 and using ‖ζ‖⩽αλ\|ζ\| αλ for (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0, we obtain a uniform bound ‖μλ,ζ‖TV=∑i=1Ncλ,ζi⩽‖μ0‖TV+α22λ0,\| _λ,ζ\|_ TV= _i=1^Nc_λ,ζ^i \| _0\|_ TV+ α^22 _0, (4.21) and since ‖μλ,ζ‖TV=∑i=1Ncλ,ζi\| _λ,ζ\|_ TV= _i=1^Nc_λ,ζ^i is uniformly bounded, each coefficient cλ,ζic_λ,ζ^i is uniformly bounded on Nα,λ0N_α, _0. Assume by contradiction that there exist some j and a sequence (λk,ζk)∈Nαk,λ0,k( _k, _k)∈ N_ _k, _0,k with λ0,k→0 _0,k→ 0 and αk→0 _k→ 0, such that cλk,ζkj→0as k→∞.c_ _k, _k^j→ 0 k→∞. (4.22) By the compactness of the sphere and the boundedness of the coefficients, we can extract a subsequence (not relabelled) such that, for each i, δwλk,ζki⇀∗δw^0i,cλk,ζki→c^0i, _w_ _k, _k^i * _ w_0^i, c_ _k, _k^i→ c_0^i, (4.23) with c^0j=0 c_0^j=0. In particular, it holds that μλk,ζk=∑i=1Ncλk,ζkiδwλk,ζki⇀∗∑i=1Nc^0iδw^0i=:μ^0. _ _k, _k= _i=1^Nc_ _k, _k^i _w_ _k, _k^i * _i=1^N c_0^i _ w_0^i=: μ_0. (4.24) On the other hand, by the stability result [27, Theorem 3.5] we know that μλk,ζk⇀∗μ0 _ _k, _k * _0 as k→∞k→∞. By Theorem 3.18, the minimizer μ0 _0 of λ(yζ) P_λ(y_ζ) is unique, hence necessarily μ^0=μ0 μ_0= _0. Since the balls Bε0(w0i)i=1N\B_ _0(w_0^i)\_i=1^N are pairwise disjoint and suppμλk,ζk⊂⋃iBε0(w0i)supp _ _k, _k⊂ _iB_ _0(w_0^i) for k large, the labeling of atoms is fixed by the balls. Since μ0 _0 is uniquely representable (i.e., both the coefficients c0ic_0^i and the points w0iw_0^i are unique), we deduce, thanks to the complementarity assumption, that c^0i=c0i>0,w^0i=w0i,i=1,…,N, c_0^i=c_0^i>0, w_0^i=w_0^i, i=1,…,N, (4.25) which contradicts c^0j=0 c_0^j=0. Therefore, for (λ0,α)( _0,α) sufficiently small, we have that for all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0 cλ,ζi>0for all i=1,…,Nc_λ,ζ^i>0 all i=1,…,N (4.26) obtaining the representation (4.7). Finally, let (λk,ζk)∈Nα,λ0( _k, _k)∈ N_α, _0 with (λk,ζk)→(0,0)( _k, _k)→(0,0). By [27, Theorem 3.5] we have μλk,ζk⇀∗μ0 _ _k, _k * _0. Since for every k the ball Bε0(w0i)B_ _0(w_0^i) contains exactly one atom wλk,ζkiw_ _k, _k^i of μλk,ζk _ _k, _k with mass cλk,ζki>0c_ _k, _k^i>0, and the balls are pairwise disjoint we obtain that wλk,ζki→w0i,cλk,ζki→c0i,i=1,…,N.w_ _k, _k^i→ w_0^i, c_ _k, _k^i→ c_0^i, i=1,…,N. (4.27) ∎ Remark 4.6. Note that due to Theorem 3.18 that ensures uniqueness of solutions, the empirical measure μ0 _0 is precisely the solution obtained by the application of the representer theorem [6, 7, 8]. This justifies once more why the assumption N⩽nN n is unavoidable. A similar observation can also be made regarding Theorem 4.7. 4.1 Rates of convergence and uniqueness of the coefficients In this section we derive quantitative rates for the convergences wλ,ζi→w0i,cλ,ζi→c0i,i=1,…,N, w_λ,ζ^i→ w_0^i, c_λ,ζ^i→ c_0^i, i=1,…,N, (4.28) obtained in the previous section. We will show that, locally, |cλ,ζi−c0i|+‖wλ,ζi−w0i‖=O(|λ|+‖ζ‖),|c_λ,ζ^i-c_0^i|+\|w_λ,ζ^i-w_0^i\|\;=\;O (|λ|+\|ζ\| ), (4.29) and in particular, along Nα,λ0N_α, _0, we get O(λ)O(λ). Note that (w0)=[K∗e1(w01)⋯K∗en(w01)⋮⋯⋮⋱⋮K∗e1(w0N)⋯K∗en(w0N)]∈ℝN×n,K(w_0)= [ array[]cK_*e^1(w_0^1)&·s&K_*e^n(w_0^1)\\ &·s& \\ & & \\ K_*e^1(w_0^N)&·s&K_*e^n(w_0^N) array ] ^N× n, (4.30) where ej∈ℝne^j ^n is the j-th vector of the canonical base of ℝnR^n. Given the operator K∗′:ℝn→C(d−1;ℝd)K_* :R^n→ C(S^d-1;R^d) defined as K∗′(p1,…,pn)(w)=∑i=1npi∇d−1σ(⟨w,xi⟩), K _*(p^1,…,p^n)(w)= _i=1^np^i _S^d-1σ( w,x^i ), (4.31) where ∇d−1 _S^d-1 is the Riemannian gradient on the manifold d−1S^d-1, we can also define the N(d−1)×nN(d-1)× n matrix of first derivatives as ′(w0):=[∇d−1σ(⟨w01,x1⟩)⋯∇d−1σ(⟨w01,xn⟩)⋮⋯⋮⋱⋮∇d−1σ(⟨w0N,x1⟩)⋯∇d−1σ(⟨w0N,xn⟩)]=[K∗′e1(w01)⋯K∗′en(w01)⋮⋯⋮⋱⋮K∗′e1(w0N)⋯K∗′en(w0N)],K (w_0):= [ array[]c _S^d-1σ( w_0^1,x^1 )&·s& _S^d-1σ( w_0^1,x^n )\\ &·s& \\ & & \\ _S^d-1σ( w_0^N,x^1 )&·s& _S^d-1σ( w_0^N,x^n ) array ]= [ array[]cK _*e^1(w_0^1)&·s&K _*e^n(w_0^1)\\ &·s& \\ & & \\ K _*e^1(w_0^N)&·s&K _*e^n(w_0^N) array ], (4.32) whose entries are tangent vectors in ℝdR^d. In particular, we note that K∗′p(w)∈Twd−1⊂ℝdK_* p(w)∈ T_wS^d-1 ^d for every p∈ℝnp ^n and w∈d−1w ^d-1. Therefore, the natural way to interpret (4.32) is as the linear map ′(w0):ℝn→×i=1NTw0id−1.K (w_0):R^n→ _i=1^NT_w_0^iS^d-1. (4.33) Theorem 4.7 (Interior Recovery). Let N⩽nN n. Assume that μ0=∑i=1Nc0iσiδw0i _0= _i=1^Nc_0^iσ^i _w_0^i, where c0i>0c_0^i>0, σi∈−1,+1σ^i∈\-1,+1\, and w0i∈Rπi∩d−1w_0^i∈ R_π^i ^d-1 for every i=1,…,Ni=1,…,N, satisfies the (LC)(LC). Moreover, suppose that Kδw0i\K _w_0^i\ are linearly independent and [(w0)′(w0)]∈ℝNd×n has full rank Nd. [ array[]cK(w_0)\\ K (w_0) array ] ^Nd× n has full rank Nd. (4.36) Then, there exists α>0,λ0>0α>0, _0>0, such that, for all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0, the solution μλ,ζ _λ,ζ to λ(yζ) P_λ(y_ζ) is unique and admits a unique representation of the form: μλ,ζ=∑i=1Ncλ,ζiσiδwλ,ζi, _λ,ζ= _i=1^Nc_λ,ζ^iσ^i _w_λ,ζ^i, (4.37) where wλ,ζi∈Rπi∩d−1w_λ,ζ^i∈ R_π^i ^d-1 such that ηλ,ζ(wλ,ζi)=σi,cλ,ζi>0 _λ,ζ(w_λ,ζ^i)=σ^i,c_λ,ζ^i>0. Moreover, for every i=1,…,Ni=1,…,N and all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0, it holds: |cλ,ζi−c0i| |c_λ,ζ^i-c_0^i | =O(λ), =O(λ), (4.38) ‖wλ,ζi−w0i‖ \|w_λ,ζ^i-w_0^i \| =O(λ). =O(λ). Proof. Since the (LC) holds for μ0 _0 and the measurements are linearly independent, we can apply Proposition 4.5. Once again we will carry out the proof considering σi=1σ^i=1 for simplicity. Therefore, there exist α>0,λ0>0α>0, _0>0 such that, for all (λ,ζ)∈Nα,λ0(λ,ζ)∈ N_α, _0, any solution μλ,ζ _λ,ζ is composed of exactly N Dirac deltas, i.e. μλ,ζ=∑i=1Ncλ,ζiδwλ,ζi, _λ,ζ= _i=1^Nc_λ,ζ^i _w_λ,ζ^i, where cλ,ζi>0c_λ,ζ^i>0 and wλ,ζi∈Rπi∩d−1w_λ,ζ^i∈ R_π^i ^d-1 such that ηλ,ζ(wλ,ζi)=1 _λ,ζ(w_λ,ζ^i)=1 for every i=1,…,Ni=1,…,N. In Proposition 4.5 we also obtained the uniqueness of the locations wλ,ζiw_λ,ζ^i inside each region RπiR_π^i. To conclude uniqueness of μλ,ζ _λ,ζ it remains to show the uniqueness of the coefficients cλ,ζic_λ,ζ^i, and to obtain the quantitative estimate (4.38). We work in a neighborhood where each w0iw_0^i is interior to its activation region, so πiπ^i is fixed. In particular, we choose ε>0 >0 such that Bε(w0i)∩d−1⊂Rπi∩d−1B_ (w_0^i) ^d-1⊂ R_π^i ^d-1 for all i. Now, we define R(c,w,ζ):=K∑i=1Nciδwi−y0−ζ∈ℝn,R(c0,w0,0)=0,R(c,w,ζ):=K _i=1^Nc^i _w^i-y_0-ζ ^n, R(c_0,w_0,0)=0, (4.39) for c∈ℝNc ^N and w=(w1,…,wN)∈∏i=1N(Bε(w0i)∩d−1)w=(w^1,…,w^N)∈ _i=1^N (B_ (w_0^i) ^d-1 ). Since μλ,ζ _λ,ζ satisfies the optimality conditions, ηλ,ζ _λ,ζ reaches its maximum at wλ,ζjw_λ,ζ^j, i.e. K∗pλ,ζ(wλ,ζj)=ηλ,ζ(wλ,ζj)=1K_*p_λ,ζ(w_λ,ζ^j)= _λ,ζ(w_λ,ζ^j)=1. Moreover, since wλ,ζjw_λ,ζ^j lies in the interior of RπjR_π^j, the map ηλ,ζ _λ,ζ is C1C^1 near wλ,ζjw_λ,ζ^j, hence ∇d−1ηλ,ζ(wλ,ζj)=0 _S^d-1 _λ,ζ(w_λ,ζ^j)=0. Recalling −λpλ,ζ=Kμλ,ζ−y0−ζ-λ p_λ,ζ=K _λ,ζ-y_0-ζ and ηλ,ζ=K∗pλ,ζ _λ,ζ=K_*p_λ,ζ, we obtain K∗(R(cλ,ζ,wλ,ζ,ζ))(wλ,ζj)+λ=0,j=1,…,N,K_* (R(c_λ,ζ,w_λ,ζ,ζ) )(w_λ,ζ^j)+λ=0, j=1,…,N, (4.40) and K∗′(R(cλ,ζ,wλ,ζ,ζ))(wλ,ζj)=0in Twλ,ζjd−1,j=1,…,N.K_* (R(c_λ,ζ,w_λ,ζ,ζ) )(w_λ,ζ^j)=0 T_w_λ,ζ^jS^d-1, j=1,…,N. (4.41) Moreover, for (λ,ζ)=(0,0)(λ,ζ)=(0,0) we have Kμ0=y0K _0=y_0, hence the same equations hold at (c0,w0)(c_0,w_0). To apply the implicit function theorem, we introduce local charts around w0jw_0^j and express the second equation (4.41) in tangent coordinates. For each j, we choose a smooth chart φj:Uj⊂ℝd−1→d−1 _j:U_j ^d-1 ^d-1 such that φj(0)=w0j _j(0)=w_0^j and φj(Uj)⊂Bε(w0j)∩d−1 _j(U_j)⊂ B_ (w_0^j) ^d-1, and write u=(u1,…,uN)∈U:=U1×⋯×UN⊂ℝN(d−1)u=(u^1,…,u^N)∈ U:=U_1×·s× U_N ^N(d-1) such that w(u):=(φ1(u1),…,φN(uN))w(u):= ( _1(u^1),…, _N(u^N) ). We can then choose, for each j, a C1C^1 family of linear isometries Πj(uj):Tφj(uj)d−1→ℝd−1,Πj(uj)is an isometry for every uj∈Uj. _j(u^j):T_ _j(u^j)S^d-1 ^d-1, _j(u^j)\ is an isometry for every u^j∈ U_j. (4.42) Then, we can define F=(F1,…,FN):(ℝN×U)×Nα,λ0→ℝNdF=(F^1,…,F^N):(R^N× U)× N_α, _0 ^Nd by Fj(c,u;λ,ζ):=(K∗(R(c,w(u),ζ))(φj(uj))+λΠj(uj)K∗′(R(c,w(u),ζ))(φj(uj)))∈ℝ1+(d−1)=ℝd.F^j(c,u;λ,ζ):= ( aligned K_* (R(c,w(u),ζ) ) ( _j(u^j) )+λ\\[2.0pt] _j(u^j)\,K_* (R(c,w(u),ζ) ) ( _j(u^j) ) aligned ) ^1+(d-1)=R^d. (4.43) Since the activation pattern is fixed on φj(Uj)⊂Rπj _j(U_j)⊂ R_π^j and the maps φj,Πj _j, _j are C1C^1, the map (c,u,λ,ζ)↦F(c,u;λ,ζ)(c,u,λ,ζ) F(c,u;λ,ζ) is C1C^1 near (c0,0;0,0)(c_0,0;0,0). Moreover, (4.40)–(4.41) are equivalent to F(cλ,ζ,uλ,ζ;λ,ζ)=0,F(c_λ,ζ,u_λ,ζ;λ,ζ)=0, (4.44) where uλ,ζu_λ,ζ is defined by wλ,ζj=φj(uλ,ζj)w_λ,ζ^j= _j(u_λ,ζ^j) (for (λ,ζ)(λ,ζ) small). It also holds that F(c0,0;0,0)=0F(c_0,0;0,0)=0. Now, we differentiate F with respect to (c,u)(c,u) and we check the invertibility of D(c,u)F(c0,0;0,0)D_(c,u)F(c_0,0;0,0). First, note that for fixed w=(w1,…,wN)w=(w^1,…,w^N), we have p↦(w)p=((K∗p)(w1),…,(K∗p)(wN))∈ℝN,p (w)\,p= ((K_*p)(w^1),…,(K_*p)(w^N) ) ^N, (4.45) and, in tangent coordinates defined by Πj(uj) _j(u^j), p↦T′(w(u))p:=(Π1(u1)K∗′p(φ1(u1)),…,ΠN(uN)K∗′p(φN(uN)))∈ℝN(d−1).p _T(w(u))\,p:= ( _1(u^1)K_* p( _1(u^1)),…, _N(u^N)K_* p( _N(u^N)) ) ^N(d-1). (4.46) At u=0u=0 this gives the coordinate representation of ′(w0)K (w_0), which we denote by T′(w0)K _T(w_0). With this notation, we can define T(w(u)):=[(w(u))T′(w(u))]∈ℝNd×n.A_T(w(u)):= [ array[]cK(w(u))\\ K _T(w(u)) array ] ^Nd× n. (4.47) If we differentiate in (c,u)(c,u), we get an Nd×NdNd× Nd matrix: D(c,u)F=DuT(w(u))R(c,w(u),ζ)+T(w(u))D(c,u)R(c,w(u),ζ),D_(c,u)F=D_uA_T(w(u))R(c,w(u),ζ)+A_T(w(u))D_(c,u)R(c,w(u),ζ), (4.48) where the first term vanishes when evaluated at (c0,0;0,0)(c_0,0;0,0), that is D(c,u)F(c0,0;0,0)=T(w0)D(c,u)R(c0,w0,0).D_(c,u)F (c_0,0;0,0 )=A_T(w_0)D_(c,u)R (c_0,w_0,0 ). (4.49) Therefore, the Jacobian depends only on the first order variation of R. Next, differentiating R(c,w(u),ζ)R(c,w(u),ζ) at (c0,0,0)(c_0,0,0) yields the block form D(c,u)R(c0,w0,0)=[(w0)⊤T′(w0)⊤diagd−1(c0)M],D_(c,u)R(c_0,w_0,0)= [K(w_0) \ \ K _T(w_0) diag^\,d-1(c_0)\,M ], (4.50) where diagd−1(c0)∈ℝN(d−1)×N(d−1)diag^\,d-1(c_0) ^N(d-1)× N(d-1) is the block-diagonal matrix with (d−1)(d-1) copies of each c0ic_0^i, and M∈ℝN(d−1)×N(d−1)M ^N(d-1)× N(d-1) is the block-diagonal matrix M:=diag(M1,…,MN),Mj:=Πj(0)∘Dφj(0)∈ℝ(d−1)×(d−1).M:=diag(M_1,…,M_N), M_j:= _j(0) D _j(0) ^(d-1)×(d-1). (4.51) Since each φj _j is a local chart and each Πj(0) _j(0) is an isometry, every MjM_j (hence M) is invertible. Substituting (4.50) into (4.49) we obtain the following factorization D(c,u)F(c0,0;0,0)=T(w0)T(w0)⊤[IN00diagd−1(c0)M].D_(c,u)F(c_0,0;0,0)=A_T(w_0)\,A_T(w_0) [ array[]cI_N&0\\ 0&diag^\,d-1(c_0)\,M array ]. (4.52) Fix the linear isometries Πj(0):Tw0jd−1→ℝd−1 _j(0):T_w_0^jS^d-1 ^d-1 for j=1,…,Nj=1,…,N, and consider the following linear map J:ℝN×j=1NTw0jd−1→ℝN×ℝN(d−1)J(a;v1,…,vN)=(a;Π1(0)v1,…,ΠN(0)vN).J:R^N _j=1^NT_w_0^jS^d-1 ^N×R^N(d-1) J(a;v_1,…,v_N)=(a; _1(0)v_1,…, _N(0)v_N). (4.53) Then, J is an isomorphism and the matrix T(w0)A_T(w_0) is precisely the coordinate representation of the linear map J∘[(w0)′(w0)]J [ smallmatrixK(w_0)\\ K (w_0) smallmatrix ]. Since after the composition with an isomorphism the rank does not change, rank(T(w0))=rank([(w0)′(w0)]).rank\! (A_T(w_0) )=rank\! ( [ array[]cK(w_0)\\ K (w_0) array ] ). (4.54) In particular, by (4.36) we have rank(T(w0))=Ndrank(A_T(w_0))=Nd (full row rank), hence T(w0)T(w0)⊤A_T(w_0)A_T(w_0) is symmetric positive definite. Moreover, since c0i>0c_0^i>0 and M is invertible, the block diagonal matrix in (4.52) is invertible. Therefore, D(c,u)F(c0,0;0,0)D_(c,u)F(c_0,0;0,0) is invertible. The implicit function theorem applies. Hence, there exist neighborhoods Br(c0,0)⊂ℝN×UB_r(c_0,0) ^N× U and Bs((0,0))⊂ℝ×ℝnB_s((0,0)) ×R^n, and a unique C1C^1 map g:Bs((0,0))→Br(c0,0),(λ,ζ)↦(c(λ,ζ),u(λ,ζ)),g:B_s((0,0))→ B_r(c_0,0), (λ,ζ) (c(λ,ζ),u(λ,ζ)), (4.55) such that g(0,0)=(c0,0)g(0,0)=(c_0,0) and F(g(λ,ζ);λ,ζ)=0F(g(λ,ζ);λ,ζ)=0 for all (λ,ζ)∈Bs((0,0))(λ,ζ)∈ B_s((0,0)). In particular, since g is C1C^1 at (0,0)(0,0), there exists C>0C>0 such that, for (λ,ζ)(λ,ζ) sufficiently small, ‖(c(λ,ζ),u(λ,ζ))−(c0,0)‖⩽C(|λ|+‖ζ‖).\|(c(λ,ζ),u(λ,ζ))-(c_0,0)\| C (|λ|+\|ζ\| ). (4.56) Mapping back to the sphere by wi(λ,ζ):=φi(ui(λ,ζ))w^i(λ,ζ):= _i(u^i(λ,ζ)) and using that φi _i is C1C^1, (4.56) implies |cλ,ζi−c0i|+‖wλ,ζi−w0i‖⩽C(|λ|+‖ζ‖). |c_λ,ζ^i-c_0^i |+ \|w_λ,ζ^i-w_0^i \|\; \;C (|λ|+\|ζ\| ). (4.57) Finally, by Proposition 4.5 the locations wλ,ζiw_λ,ζ^i are uniquely determined inside each region RπiR_π^i and satisfy wλ,ζi→w0iw_λ,ζ^i→ w_0^i, together with cλ,ζi→c0ic_λ,ζ^i→ c_0^i. Therefore, for (λ,ζ)(λ,ζ) small enough, we have (cλ,ζ,uλ,ζ)∈Br(c0,0)(c_λ,ζ,u_λ,ζ)∈ B_r(c_0,0). Since F(cλ,ζ,uλ,ζ;λ,ζ)=0F(c_λ,ζ,u_λ,ζ;λ,ζ)=0 by (4.44), uniqueness in the IFT implies (cλ,ζ,uλ,ζ)=g(λ,ζ),(c_λ,ζ,u_λ,ζ)=g(λ,ζ), (4.58) which proves that cλ,ζc_λ,ζ is unique and therefore μλ,ζ _λ,ζ is unique. The rates (4.38) follow from (4.57), since on Nα,λ0N_α, _0 it holds ‖ζ‖⩽αλ\|ζ\| αλ. ∎ Remark 4.8. Note that the condition (4.36) forces n⩾Ndn Nd. Moreover, the linear independence of Kδw0ii=1N\K _w_0^i\_i=1^N is not implied by (4.36) in general; it is assumed here in order to invoke Proposition 4.5 and obtain the localization of the reconstructed Dirac deltas. We note that (4.36) can be seen as a finite-dimensional analogue of the condition used in the infinite-dimensional measurement setting of [20], where one works with an observation operator having L2L^2 measurements (heuristically n=+∞n=+∞). Regarding the non-degeneracy condition we remark that we do not impose (ND), since the locations lie in the interior of the dual regions; in this case, (LC) is sufficient to apply Proposition 4.5. Finally, we remark that a similar argument would provide a linear decay of both the location and the coefficient errors for fixed λ0 _0 and a vanishing sequence of noise levels ζ. In this case, the assumptions concerning the linear independence of the measurements and (4.36) should clearly be understood with respect to the zero-noise reconstruction with λ=λ0λ= _0. 5 Numerical experiments This section shows simple numerical simulations confirming the theoretical findings of this paper. For simplicity, we focus on the two-dimensional setting (d=2d=2), where the weight domain is the unit circle 1S^1. The experiments are designed to display the dual-region partition, the behaviour of the dual certificate ηλ,ζ _λ,ζ on 1S^1, and of the optimal solution μλ,ζ _λ,ζ depending on noise level and regularization parameter. In particular, we show the sparsity of the recovered measure (Theorem 3.10), the relationship between the number of atoms and the regularization parameter (Lemma 3.11), and the linear convergence rates for the weights and locations under label noise (Theorem 4.7). These examples are representative, since the aim here is not to optimize predictive performance but to isolate and visualize the results discussed in the previous sections. 5.1 Model and experimental setup Given inputs x1,…,xn∈ℝ2x_1,…,x_n ^2 and targets y∈ℝny ^n, we consider the operator (Kμ)j=∫1σ(⟨w,xj⟩)μ(w),σ(t)=ReLU(t),(Kμ)_j= _S^1σ( w,x^j )\,dμ(w), σ(t)= ReLU(t), (5.1) and the TV-regularized empirical risk problem: minμ∈M(1)12∑j=1n((Kμ)j−yj)2+λ‖μ‖TV,λ>0. _μ∈ M(S^1)\; 12 _j=1^n((Kμ)_j-y_j)^2+λ\|μ\|_TV, λ>0. (5.2) Numerically, we approximate μ by an empirical measure with m Dirac deltas: μ=∑i=1mciδwi,wi∈1,ci∈ℝ,μ= _i=1^mc_i\, _w_i, w_i ^1, c_i , (5.3) which yields the finite-dimensional objective min(ci,wi)i=1m12∑j=1n(∑i=1mciσ(⟨wi,xj⟩)−yj)2+λ∑i=1m|ci|. _\(c_i,w_i)\_i=1^m 12 _j=1^n ( _i=1^mc_i\,σ( w_i,x_j )-y_j )^2+λ _i=1^m|c_i|. (5.4) We enforce the constraint wi∈1w_i ^1 by parametrizing wi=ui/‖ui‖w_i=u_i/\|u_i\| with unconstrained ui∈ℝ2u_i ^2 and optimizing over (ci,ui)(c_i,u_i). 5.2 Simulations We train the model by optimizing (5.4) with ADAM using m=105m=10^5 and random initialization for c and u. In all the experiments reported in Figures 2–6, we use a fixed set of n=5n=5 data points in ℝ2R^2: x1=(0.2,−0.1),x2=(1.0,0.3),x3=(1.0,0.0),x4=(−0.4,0.9),x5=(0.5,0.5)x_1=(0.2,-0.1), x_2=(1.0,0.3), x_3=(1.0,0.0), x_4=(-0.4,0.9), x_5=(0.5,0.5) (5.5) and targets y=(0.8,−0.1, 0.3,−1.2, 1.0).y=(0.8,\,-0.1,\,0.3,\,-1.2,\,1.0). (5.6) We recall that in the first three experiments the labels are not corrupted by noise (ζ=0ζ=0), while in the final experiment we vary the noise level to verify the convergence rates of Theorem 4.7. The data points xix_i are non-collinear and span multiple directions, so the hyperplane constraints ⟨w,xj⟩=0 w,x_j =0 generate a nontrivial arrangement on 1S^1 with multiple dual regions (see Figure 2). As a regularization parameter, we choose λ=0.03λ=0.03. Every 200200 iterations we prune the sequence by removing the atoms corresponding to coefficients with magnitude under the threshold τprune=10−2 _prune=10^-2. The result is represented in Figure 2 and it corresponds to the solution μλ,ζ=∑i=13c∗iδw∗i, _λ,ζ= _i=1^3c_*^i _w_*^i, (5.7) where w∗1=(−0.163,0.987),w∗2=(0.287,−0.958),w∗3=(−0.708,0.706), w_*^1=(-0.163,0.987),\,w_*^2=(0.287,-0.958),w_*^3=(-0.708,0.706), c∗1=1.312,c∗2=1.256,c∗3=−2.577. c_*^1=1.312,c_*^2=1.256,c_*^3=-2.577. Figure 2: Dual region decomposition of 1S^1 and optimal solution μλ,ζ _λ,ζ. The dashed diameters correspond to the boundaries ⟨w,xj⟩=0 w,x_j =0 induced by the five inputs (5.5), partitioning 1S^1 into sectors on which (⟨w,xj⟩>0)j=15(1\ w,x_j >0\)_j=1^5 is constant. We display the recovered atoms wii=13\w_i\_i=1^3 of μλ,ζ _λ,ζ. The arrow corresponds to the vector z~λ,ζ‖z~λ,ζ‖ z_λ,ζ\| z_λ,ζ\| as in (3.19) with z~λ,ζ z_λ,ζ defined in (3.18). As predicted by Lemma 3.9, this vector aligns with the location of the Dirac delta of the solution μλ,ζ _λ,ζ belonging to the interior of a dual region. We also study the behaviour of the dual certificate ηλ,ζ∈C(d−1) _λ,ζ∈ C(S^d-1) corresponding to the computed solution μλ,ζ _λ,ζ. In particular, we show that it reaches the maximum ηλ,ζ=1 _λ,ζ=1 at the location of the positive Dirac deltas of the optimal solution; similarly, it reaches the minimum ηλ,ζ=−1 _λ,ζ=-1 at the location of the negative Dirac deltas. Such behaviour is a clear indication that the optimization procedure has reached a minimizer of the problem. Moreover, we note that the dual certificate aligns with the unique (positive) Dirac delta of μλ,ζ _λ,ζ lying in the interior of a dual region, which we denote by w∗∈Rπw_*∈ R_π. As described in Lemma 3.9, it holds that, for pλ,ζ=−1λ(Kμλ,ζ−yζ)p_λ,ζ=- 1λ(K _λ,ζ-y_ζ) (see (2.3)) and z~λ=∑ipλiπixi z_λ= _ip_λ^i _ix^i (see (3.18)), ηλ,ζ=sign(w∗)z~λ,ζ‖z~λ,ζ‖. _λ,ζ= sign(w_*) z_λ,ζ\| z_λ,ζ\|. (5.8) Figure 3: Dual certificate ηλ,ζ _λ,ζ on 1S^1. The dashed circles (orange and green) show the locations where ηλ,ζ=±1 _λ,ζ=± 1. As expected by the optimality conditions, the dual certificate is always between −1-1 and +1+1. The points where ηλ=1 _λ=1 (resp. ηλ=−1 _λ=-1) correspond exactly to the location of positive (resp. negative) Dirac deltas in the TV-regularized solution (5.7). In Figure 4, while keeping fixed the data and labels, we modify the value of the regularization parameter to observe how the number of reconstructed Dirac deltas varies in the solution μλ,ζ _λ,ζ. We note that when the regularization parameter decreases, the number of Dirac deltas increases, while staying below the geometric upper bound given by Lemma 3.11, which in this case corresponds to 1010. Figure 4: Size of the support depending on the regularization parameter. For λ in the interval (λmax=1 _ =1, λmin=5⋅10−7 _ =5· 10^-7, 1111 values), we compute the cardinality of the support of the solution. As expected, the number of Dirac deltas increases with the decrease of the regularization parameter λ. The maximum number of Dirac deltas is, however, still below the theoretical bound determined by the number of regions: R(X)=10R(X)=10. Finally, in the last experiment, we verify the decay of the coefficients and locations of the Dirac deltas in the optimal solution μλ,ζ _λ,ζ predicted by Theorem 4.7 (see also Remark 4.8). We again work in dimension d=2d=2 and we consider n=5n=5 data points in ℝ2R^2: x1=(0.2,−0.1),x2=(1.0,−2),x3=(1.0,0.2),x4=(−0.4,−0.2),x5=(0.5,0.5)x_1=(0.2,-0.1), x_2=(1.0,-2), x_3=(1.0,0.2), x_4=(-0.4,-0.2), x_5=(0.5,0.5) (5.9) and targets y=(0.8,−0.1, 0.3,−1.2, 1.0).y=(0.8,\,-0.1,\,0.3,\,-1.2,\,1.0). (5.10) We fix λ=0.2λ=0.2 and we compute the solution μλ,ζ=∑i=12c∗iδw∗i _λ,ζ= _i=1^2c_*^i _w_*^i with no noise. The result is depicted in Figure 5, where we observe that both w∗iw_*^i lie in the interior of a dual region. Figure 5: Unperturbed solution and dual certificate for λ=0.2λ=0.2. Left: Dual region decomposition of 1S^1 and recovered atoms w∗ii=12\w_*^i\_i=1^2 of the solution μλ,ζ=∑i=12c∗iδw∗i _λ,ζ= _i=1^2c_*^i\, _w_*^i with ζ=0ζ=0. Both atoms lie in the interior of a dual region. Right: Corresponding dual certificate ηλ,ζ _λ,ζ on 1S^1. As in Figure 3, it satisfies |ηλ,ζ|⩽1| _λ,ζ| 1 everywhere and attains ±1± 1 exactly at the support of μλ,ζ _λ,ζ. While keeping λ fixed, we perturb y by increasingly large noise ζ, ranging from 5⋅10−45· 10^-4 to 2.4⋅10−32.4· 10^-3. We compute the optimal solution μλ,ζ=∑i=12cζiδwζi _λ,ζ= _i=1^2c_ζ^i _w_ζ^i for all values of ζ and in Figure 6 we plot the quantities Lci(ζ)=|cλ,ζi−c∗i|andLwi(ζ)=‖wλ,ζi−w∗i‖fori=1,2. L^i_c(ζ)=|c_λ,ζ^i-c_*^i| L^i_w(ζ)=\|w_λ,ζ^i-w_*^i\| i=1,2. As predicted by Theorem 4.7, we observe that these quantities decay linearly in ‖ζ‖\|ζ\|, consistently with the bound O(λ)O(λ) along Nα,λ0N_α, _0 where ‖ζ‖⩽αλ\|ζ\| αλ. In particular, we have Lci(ζ)⩽Cci‖ζ‖L^i_c(ζ) C_c^i\|ζ\| and Lwi(ζ)⩽Cwi‖ζ‖L^i_w(ζ) C_w^i\|ζ\| for constants CciC_c^i and CwiC_w^i with i=1,2i=1,2. One can also verify that the assumptions of Theorem 4.7 are indeed satisfied. Since both atoms w∗1,w∗2w_*^1,w_*^2 lie in the interior of their respective dual regions Rπ1,Rπ2R_ _1,R_ _2, the non-degeneracy condition (ND) is not required (cf. Remark 4.8). The localization condition (LC) is verified by inspecting the dual certificate ηλ,ζ _λ,ζ in Figure 5 (right), where we observe that |ηλ,ζ(w)|<1| _λ,ζ(w)|<1 for all w∈1∖w∗1,w∗2w ^1 \w_*^1,w_*^2\. To verify the linear independence of Kδw∗1,Kδw∗2\K _w_*^1,K _w_*^2\ we apply Lemma 3.16. The activation patterns of the two atoms are π1=(1,0,1,0,1) _1=(1,0,1,0,1) and π2=(1,1,0,1,0) _2=(1,1,0,1,0). The 2×22× 2 minor of Π corresponding to columns 11 and 22 equals (1011) pmatrix1&0\\ 1&1 pmatrix, which satisfies perm(ΠN)=|det(ΠN)|=1perm( _N)=| ( _N)|=1, so conditions (i) and (i) of Lemma 3.16 are fulfilled. Finally, we verify condition (4.36). Since d=2d=2, the tangent space Tw∗i1T_w_*^iS^1 is one-dimensional, spanned by (−w∗,2i,w∗,1i)(-w_*,2^i,\,w_*,1^i). We numerically evaluate the matrix [(w∗)′(w∗)]∈ℝ4×5 bmatrixK(w_*)\\ K (w_*) bmatrix ^4× 5 and obtain singular values approximately 2.2432.243, 1.2041.204, 0.4720.472, 0.3650.365, confirming that its rank is 4=Nd4=Nd. Hence, all hypotheses of Theorem 4.7 are fulfilled. Figure 6: Linear decay of coefficients and locations under label noise. Left: Coefficient mismatch Lci(ζ)=|cλ,ζi−c∗i|L_c^i(ζ)=|c_λ,ζ^i-c_*^i| as a function of the noise level ‖ζ‖\|ζ\| for i=1,2i=1,2, together with linear fits. Right: Location mismatch Lwi(ζ)=‖wλ,ζi−w∗i‖L_w^i(ζ)=\|w_λ,ζ^i-w_*^i\| for i=1,2i=1,2. Both quantities exhibit the linear dependence on ‖ζ‖\|ζ\| predicted by Theorem 4.7. References [1] J. M. Azais, Y. De Castro, and F. Gamboa. Spike detection from inaccurate samplings. Applied and Computational Harmonic Analysis, 38(2):177–195, 2015. [2] F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017. [3] A. R. Barron. Approximation and estimation bounds for artificial neural networks. Machine learning, 14(1):115–133, 1994. [4] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 2002. [5] F. Bartolucci, M. Carioni, J. A. Iglesias, Y. Korolev, E. Naldi, and S. Vigogna. A Lipschitz spaces view of infinitely wide shallow neural networks. arXiv preprint arXiv:2410.14591, 2024. [6] F. Bartolucci, E. De Vito, L. Rosasco, and S. Vigogna. Understanding neural networks with reproducing kernel banach spaces. Applied and Computational Harmonic Analysis, 62:194–236, 2023. [7] C. Boyer, A. Chambolle, Y. De Castro, V. Duval, F. de Gournay, and P. Weiss. On representer theorems and convex regularization. SIAM Journal on Optimization, 29(2):1260–1281, 2019. [8] K. Bredies and M. Carioni. Sparsity of solutions for variational inverse problems with finite-dimensional data. Calculus of Variations and Partial Differential Equations, 59(1):14, 2020. [9] K. Bredies and H.K. Pikkarainen. Inverse problems in spaces of measures. ESAIM: Control, Optimisation and Calculus of Variations, 19(1):190–218, 2013. [10] R. A. Brualdi and H. J. Ryser. Combinatorial matrix theory, volume 39. Springer, 1991. [11] E. J. Candès and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on pure and applied Mathematics, 67(6):906–956, 2014. [12] M. Carioni and L. Del Grande. A general theory for exact sparse representation recovery in convex optimization. arXiv preprint arXiv:2311.08072, 2023. [13] T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC-14(3):326–334, 1965. [14] Y. De Castro and F. Gamboa. Exact reconstruction using beurling minimal extrapolation. Journal of Mathematical Analysis and applications, 395(1):336–354, 2012. [15] J. de Dios and J. Bruna. On sparsity in overparametrised shallow relu networks. arXiv preprint arXiv:2006.10225, 2020. [16] Q. Denoyelle, V. Duval, and G. Peyré. Support recovery for sparse super-resolution of positive measures. Journal of Fourier Analysis and Applications, 23:1153–1194, 2017. [17] Q. Denoyelle, V. Duval, G. Peyré, and E. Soubies. The sliding Frank–Wolfe algorithm and its application to super-resolution microscopy. Inverse Problems, 36(1):014001, 2019. [18] S. Dummer, T. J. Heeringa, and J. A. Iglesias. Vector-valued reproducing kernel banach spaces for neural networks and operators. arXiv preprint arXiv:2509.26371, 2025. [19] V. Duval. A characterization of the non-degenerate source condition in super-resolution. Information and Inference: A Journal of the IMA, 9(1):235–269, 2020. [20] V. Duval and G. Peyré. Exact support recovery for sparse spikes deconvolution. Foundations of Computational Mathematics, 15(5):1315–1355, 2014. [21] W. E, C. Ma, and L. Wu. The barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022. [22] W. E and S. Wojtowytsch. Representation formulas and pointwise properties for barron functions. Calculus of Variations and Partial Differential Equations, 61(2):46, 2022. [23] W. E and S. Wojtowytsch. Some observations on high-dimensional partial differential equations with barron data. In Mathematical and Scientific Machine Learning, pages 253–269. PMLR, 2022. [24] I. Ekeland and R. Temam. Convex analysis and variational problems. SIAM, 1999. [25] J.-J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE transactions on Information theory, 50(6):1341–1344, 2004. [26] T. J. Heeringa, L. Spek, F. L. Schwenninger, and C. Brune. Embeddings between barron spaces with higher-order activation functions. Applied and Computational Harmonic Analysis, 73:101691, 2024. [27] B. Hofmann, B. Kaltenbacher, C. Poschl, and O. Scherzer. A convergence rates result for Tikhonov regularization in Banach spaces with non-smooth operators. Inverse Problems, 23(3):987, 2007. [28] K. Karhadkar, M. Murray, H. Tseran, and G. F. Montufar. Mildly overparameterized relu networks have a favorable loss landscape. Transactions of Machine Learning Research, 2024. [29] S. Kim, A. Mishkin, and M. Pilanci. Exploring the loss landscape of regularized neural networks via convex duality. arXiv preprint arXiv:2411.07729, 2024. [30] H. Maennel, O. Bousquet, and S. Gelly. Gradient descent quantizes relu network features. arXiv:1803.08367, 2018. [31] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014. [32] M. Najaf and G. Ongie. Sampling theory for super-resolution with implicit neural representations. arXiv preprint arXiv:2506.09949, 2025. [33] I. Neitzel, K. Pieper, B. Vexler, and D. Walter. A sparse control approach to optimal sensor placement in PDE-constrained parameter estimation problems. Numerische Mathematik, 143(4):943–984, December 2019. [34] G. Ongie, R. Willett, D. Soudry, and N. Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019. [35] R. Parhi and R. D. Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(43):1–40, 2021. [36] R. Parhi and R. D. Nowak. What kinds of functions do deep neural networks learn? insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2):464–489, 2022. [37] C. Poon, N. Keriven, and G. Peyré. The geometry of off-the-grid compressed sensing. Foundations of Computational Mathematics, 23(1):241–327, 2023. [38] P. Savarese, I. Evron, D. Soudry, and N. Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690. PMLR, 2019. [39] G. Schiebinger, E. Robeva, and B. Recht. Superresolution without separation. Information and Inference: A Journal of the IMA, 7(1):1–30, 2018. [40] A. Shevchenko, V. Kungurtsev, and M. Mondelli. Mean-field analysis of piecewise linear solutions for wide relu networks. Journal of Machine Learning Research, 23(130):1–55, 2022. [41] L. Spek, T. J. Heeringa, F. L. Schwenninger, and C. Brune. Duality for neural networks through reproducing kernel banach spaces. Applied and Computational Harmonic Analysis, 78:101765, 2025.