Paper deep dive

Provably Extracting the Features from a General Superposition

Allen Liu

Year: 2024Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 113

Abstract

Abstract:It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function \[ f(x)=\sum_{i=1}^n a_i\,\sigma_i(v_i^\top x), \] where each unit vector $v_i$ encodes a feature direction and $\sigma_i:\mathbb{R} \rightarrow \mathbb{R}$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results -- we allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and general response functions $\sigma_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 12:53:50 AM

Summary

The paper presents an efficient query-based algorithm to recover feature directions and response functions in a general superposition model, where a function is represented as a sum of ridge functions. By searching in Fourier space and iteratively refining the search space, the algorithm overcomes the overcomplete regime (n > d) and works for arbitrary response functions, provided the feature directions are not nearly identical.

Entities (4)

Allen Liu · researcher · 100%Ridge Function · mathematical-model · 99%Superposition · concept · 98%Fourier Space Search · algorithm · 95%

Relation Signals (2)

Ridge Function → iscomponentof → Superposition

confidence 98% · f(x) = sum of ridge functions... superposition refers to the overcomplete regime

Superposition → isaddressedby → Fourier Space Search

confidence 95% · Our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions

Cypher Suggestions (2)

Map the relationship between mathematical models and concepts · confidence 95% · unvalidated

MATCH (m:Model)-[r:IS_COMPONENT_OF]->(c:Concept) RETURN m.name, r.relation, c.name

Find all algorithms related to feature extraction in superposition · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:ADDRESSES]->(c:Concept {name: 'Superposition'}) RETURN a.name

Full Text

112,500 characters extracted from source content.

Expand or collapse full text

Provably Extracting the Features from a General Superposition Allen Liu 111This work was supported by a Miller Research Fellowship UC Berkeley aliu42@berkeley.edu Abstract It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function f(x)=∑i=1naiσi(vi⊤x),f(x)= _i=1^na_i\, _i(v_i x), where each unit vector viv_i encodes a feature direction and σi:ℝ→ℝ _i:R→R is an arbitrary response function and our goal is to recover the viv_i and the function f. In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. n>dn>d), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to f, identifies all feature directions whose responses are non-degenerate and reconstructs the function f. Crucially, our algorithm works in a significantly more general setting than all related prior results — we allow for essentially arbitrary superpositions, only requiring that vi,vjv_i,v_j are not nearly identical for i≠ji≠ j, and general response functions σi _i. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions viv_i. 1 Introduction While modern machine learning models are incredibly complex, a foundational viewpoint in our quest to understand them is the notion of linear representations [alain2016understanding, elhage2022toy, park2023linear, olah_linear_representation_2024, golowich2025sequenceslogitsreveallow]. The hypothesis is that salient features correspond to directions v∈ℝdv∈R^d in some representation space, and the response to a feature depends only on the one–dimensional projection v⊤xv x via some ridge function say f(x)=σ(v⊤x)f(x)=σ(v x). Even the study of simple, single-feature functions, that depend only on the projection of the input onto a single direction, has led to a rich body of work in computational learning theory through generalized linear models [mccullagh2019generalized, kakade2011efficient, chen2020classification], single-index models [ichimura1993semiparametric, dudeja2018learning, bietti2022learning, gollakota2023agnostically, damian2023smoothing, damian2024computational, zarifis2024robustly], and also related problems such as non-Gaussian component analysis [blanchard2006search, diakonikolas2022non]. However, further adding to the challenge is the fact that for most models or functions that we want to study, the output depends on many features of the input. We formalize this in the most basic setting, when the output is a sum of the responses to individual features i.e. a sum of ridge functions f(x)=∑i=1naiσi(vi⊤x),f(x)\;=\; _i=1^na_i\, _i(v_i x), where each vi∈ℝdv_i ^d is a unit “feature direction” and σi:ℝ→ℝ _i:R is an arbitrary univariate response. A fundamental concept in feature learning and interpretability is the notion ofsuperposition [elhage2022toy], meaning that the number of features is generally much larger than the dimension of the representation. In learning-theoretic terms, this corresponds to the overcomplete regime where the number of features n exceeds the ambient dimension d. Yet, this overcomplete regime has proven particularly challenging from an algorithmic perspective, posing a barrier for common approaches such as moment and tensor methods [hillar2013most, anandkumar2013overcomplete, anandkumar2014tensor]. To complete the problem setup, we need to specify how we can access the function f. Typical learning setups assume x is drawn from some distribution, such as a Gaussian, and then ask to recover f given polynomially many samples (x,f(x))(x,f(x)). However, it is difficult to posit tractable but realisic distributional assumptions — common assumptions like x being Gaussian are unrealistic — and furthermore, even then there are still computational barriers in simple settings with just a single feature [goel2020statistical, song2021cryptographic, damian2024computational]. Instead, we study a query learning model where we assume black-box query access to some function f∼f_ with ‖f∼−f‖∞≤ε\|f_ -f\|_∞≤ . While this is a stronger access model, it is natural from the lens of extracting features from a trained model. This query model also captures settings for model distillation and stealing, where a learner tries to recover something about a trained machine learning model through black-box API access [tramer2016stealing, finlayson2024logits, carlini2024stealing]. We now ask: Question. Given queries to a function that is a sum of features, can we learn the underlying features? Are there barriers to learning if the features exhibit superposition? This question is important both for understanding how to extract interpretable features (see e.g. [agarwal] which uses a special case of our model for interpretable deep learning) and from a fundamental computational learning theory perspective, where it builds on a vast existing body of work on learning single hidden layer neural networks. Within this literature there are two main points of comparison. First, there are numerous results in the standard “passive” setting, but the bottom line is that here, there is substantial evidence of computational hardness. There are statistical query lower bounds even for settings that are much simpler, special cases of our setting such as learning a single ridge function with general activation [goel2020statistical] and learning a linear combination of many ridge functions with fixed e.g. ReLU activation [diakonikolas2020algorithms, goel2020superpolynomial], suggesting that a general learning guarantee is out of reach for efficient algorithms. In query learning models, there has been work specifically in the case of ReLU neural networks i.e. σi(z)=ReLU(z) _i(z)=ReLU(z) for all i∈[n]i∈[n], but these algorithms are extremely tailored to specifically the ReLU activation function, relying on trying to find the kink in the ReLU [rolnick2020reverse, chen2021efficientlylearninghiddenlayer, daniely2021exact]. Main Result Our main result is an algorithm for learning a general sum of features whose query complexity and runtime is polynomial in all relevant parameters that, under mild and information-theoretically necessary assumptions — nontrivial separation between the feature directions viv_i and Lipschitzness of the functions σi(⋅) _i(·) (see 1) — achieves the following guarantees: 1. Identifies all nonlinear feature directions vi\v_i\ up to sign and ε -error 2. Reconstructs the associated univariate responses on [−R,R][-R,R], yielding a uniform ε -approximation to f on the domain ‖x‖≤R\|x\|≤ R. In other words, if some trained model is close to a linear combination of features in some known representation space, then we can efficiently recover the model from queries. The formal statements are in Theorem 8.1 and Corollary 8.2. Note that if some of the σi(⋅) _i(·) were linear, then these directions would not be identifiable for trivial reasons. We emphasize that our result holds for overcomplete viv_i, even when some of the viv_i are highly correlated and for general response functions σi _i. This circumvents the aforementioned computational hardness in the passive setting and significantly generalizes previous learning results which either require restrictive assumptions on the σi _i (e.g. ReLU) [chen2021efficientlylearninghiddenlayer, daniely2021exact] or the viv_i (e.g. linear independence or near-orthogonality) [sedghi2016provable, oko2024learning]. Note on Identifiability We also highlight the importance of the identifiability aspect of our result from the lens of feature learning and interpretability. As formulated in [elhage2022toy], one of the fundamental challenges in extracting features from modern models, and interpretability more broadly, is superposition — when the number of features is larger than the ambient dimension, it makes the features nonidentifiable. Indeed, this would be the case if the response functions were linear. However, from this viewpoint, our results actually give a new reason for optimism. The features are identifiable in a superposition as long as the response functions are nonlinear. High Level Approach Key to our result is a different type of algorithmic approach that departs from the typical method-of-moments recipe that is ubiquitous for learning latent variable models, especially over Euclidean spaces. The starting point is the basic observation that the Fourier transform f f of f is “sparse”. Ignoring integrability issues for now, for a single ridge function f(x)=σ(v⊤x)f(x)=σ(v x), the Fourier transform is nonzero only on the line tvt∈ℝ\tv\_t∈R. Thus when f is a sum of ridge functions, its Fourier transform is supported on only the lines through the origin in the directions viv_i. Then, to recover the directions, we design an algorithm that can “locate the mass” in Fourier space. This paradigm has already been successful in boolean [goldreich1989hard, kushilevitz1991learning] and discrete [hassanieh2012nearly] settings. Our algorithm draws inspiration from this approach — the main subroutine (Algorithm 2) involves an iterative algorithm for searching in Fourier space to locate the hidden directions. However, compared to discrete spaces, the continuity and unboundedness of Euclidean space pose technical challenges — we emphasize the conceptual novelty here, that through carefully chosen geometric constructions and filter functions, we develop an algorithm that efficiently searches over a high-dimensional Euclidean Fourier domain. We provide a more detailed overview of our techniques in Section 3. 1.1 Related Work GLMs and single and multi‑index models. There is a substantial body of work on GLM/SIM estimators, which recover a function that depends only on the projection of the input onto one direction, under various distributional and noise assumptions [mccullagh2019generalized, ichimura1993semiparametric, kakade2011efficient, dudeja2018learning, bietti2022learning, damian2023smoothing, gollakota2023agnostically, zarifis2024robustly, damian2024computational]. Multi-index models (MIMs) generalize SIMs to functions that depend only the projection of the input onto a constant dimensional subspace. Recently there has also been a growing body of work on learning MIMs [bietti2023learning, diakonikolas2024agnostically, mousavi2024learning]. We refer the reader to [bruna2025survey] for a more detailed overview of this literature. Note that our setting, involving a linear combination of ridge functions, is very different because the function actually depends on the full d-dimensional space. Shallow Neural Networks and Generalizations. In our setting, when all of the σi _i are some fixed activation function such as sigmoid or ReLU, then f can be viewed as a two-layer neural network. There is a large body of work on learning two-layer neural networks from samples, which aims to characterize the boundaries of efficient algorithms and computational hardness e.g. [awasthi2021efficient, diakonikolas2020algorithms, ge2018learning, goel2020superpolynomial, chen24faster]. Beyond understanding the computational landscape, there is also a line of work towards understanding specifically the dynamics of gradient descent for training shallow networks e.g. [janzamin2016beatingperilsnonconvexityguaranteed, ge2017learningonehiddenlayerneuralnetworks, zhong2017recoveryguaranteesonehiddenlayerneural]. However in the passive setting, there are exponential in min⁡(d,n) (d,n) SQ lower bounds, suggesting that a general, efficient (i.e. poly(d,n)poly(d,n)) time algorithm is not possible. There has also been work on learning neural networks with query access [rolnick2020reverse, chen2021efficientlylearninghiddenlayer, daniely2021exact]. However, as mentioned earlier, these results are extremely tailored to ReLU activations, whereas our result allows for arbitrary activations σi _i. Linear combinations of more general activation functions are considered in [sedghi2016provable, oko2024learning]. These works also study a passive learning setting, and thus require structural assumptions on the viv_i, namely that they are linearly independent in the former, and nearly orthogonal in the latter, whereas we do not need any such conditions. Fourier search in discrete spaces. Our algorithm of searching in a high-dimensional Fourier space draws inspiration from the Goldreich–Levin heavy‑Fourier‑coefficient search algorithm [goldreich1989hard, kushilevitz1991learning] and also bears resemblance to algorithms used for computing sparse Fourier transforms [hassanieh2012nearly]. However, a key difference is that we give an efficient algorithm for searching over a high-dimensional Euclidean space. We remark that the connection between Fourier sparsity and expressing functions as a sum of ridge functions dates back to the classical works [barron, candes, donoho] but our focus here is on making this connection algorithmic. Query learning and model stealing. With the rise of APIs that provide users with black-box access to trained machine learning models, there has been increased interest in understanding what we can extract from such an API, both from a theoretical [mahajan2023learning, liu2025model, golowich2025sequenceslogitsreveallow, golowich2025provablylearningmodernlanguage] and practical perspective [tramer2016stealing, finlayson2024logits, carlini2024stealing]. Our result says that if a model is close to a sum of features in some representation space that we know, then we can recover the model from queries. 2 Preliminaries We begin with some basic notation and facts that will be used throughout the paper. Throughout, we use the convention i=−1i= -1. 2.1 Ridge Functions (Features) A ridge function is a function that depends only on the projection of the input onto a single direction. Definition 2.1 (Ridge Function (Feature)). A function f:ℝd→ℝf:R^d→R is a ridge function if it can be written as f(x)=σ(v⊤x)f(x)=σ(v x) for some unit vector v∈ℝdv∈R^d and univariate function σ. Remark 2.2. Whenever we write a ridge function in the above form, we will assume that ‖v‖=1 v =1 (we can always ensure this by just rescaling σ). Given a ridge function, we can think of v as encoding a feature direction, and the function σ as an activation that controls how strongly the output responds to the strength of the feature. More generally, we can consider a sum or linear combination of features, each involving a different direction viv_i and possibly different activation σi _i. Definition 2.3 (Sum of Features). A function f:ℝd→ℝf:R^d→R is a sum of n features if it can be written as a linear combination of n ridge functions i.e. f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x).f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x)\,. 2.2 Learning Setup The main goal in this paper will be to recover an unknown sum of features from queries. We now describe the details of the problem setup. There is an unknown function f:ℝd→ℝf:R^d→R with f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x).f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x)\,. We receive query access to a function f∼f_ that satisfies ‖f−f∼‖∞≤ε f-f_ _∞≤ and our goal will be to recover a description of a function that is close to f. We will be mostly interested in the overcomplete regime where n≥dn≥ d — this regime in particular has proven challenging for designing efficient algorithms 222When n<dn<d, we can first learn the subspace spanned by the viv_i via a standard tensor method and then we can project onto this subspace to reduce to the case where n≥dn≥ d.. We will make a few assumptions on the vi,σiv_i, _i to rule out degenerate or pathological examples. Assumption 1. We make the following assumptions on f: • The coefficients aia_i all satisfy |ai|≤1|a_i|≤ 1 • The functions σi _i all satisfy σi(0)=0 _i(0)=0 • The functions σi _i are all L-Lipschitz • The sine of the angle between viv_i and vjv_j is at least γ whenever i≠ji≠ j Remark 2.4. The first assumption is just for the sake of normalization, and the second one is without loss of generality since we can always just query f∼(0)f_ (0) and subtract it off. The Lipschitzness of the activations σi _i is necessary since otherwise, even for a single function in one dimension, it would be impossible to approximate it from queries. While the final assumption may not be strictly necessary for just recovering the function, it is necessary for being able to recover the individual directions viv_i as it rules out degenerate examples where features are (almost) identical and cancel each other out. We will also mention the following assumption, that the σi _i are bounded. While this assumption is not necessary, it simplifies the exposition. We will first give a learning algorithm under this assumption, and then show in Section 8.2 (see Theorem 8.1) how to remove this secondary assumption. Assumption 2. Assume that the functions σi _i satisfy |σi(x)|≤1| _i(x)|≤ 1 for all x∈ℝ,i∈[n]x∈R,i∈[n]. The majority of the paper will be devoted to proving the following theorem: Theorem 2.5. Under 1 and 2, for any target accuracy ε′ , target domain R, and failure probability δ, there is some N=poly(d,L,R,n,1/γ,1/ε′,log⁡1/δ)N=poly(d,L,R,n,1/γ,1/ , 1/δ) such that if ε<1/N <1/N, there is an algorithm that makes poly(N)poly(N) queries and runs in poly(N)poly(N) time and outputs a sum of features f~(x)=a1′σ1′(v1′⊤x)+⋯+an′σn′(vn′⊤x) f(x)=a_1 _1 (v_1 x)+…+a_n _n (v_n x) such that with probability 1−δ1-δ, for all x with ‖x‖≤R x ≤ R, |f(x)−f~(x)|≤ε′|f(x)- f(x)|≤ . Remark 2.6. Note that with finitely many queries, it is only possible to guarantee closeness over a bounded domain, as opposed to everywhere, even for a single function in one dimension, and thus the restriction to ‖x‖≤R x ≤ R in the above theorem is necessary. Theorem 2.5 in fact guarantees not just recovering the function, but also recovering all of the hidden directions viv_i for which the function σi _i is nontrivial (see Lemma 6.9 for a formal statement). If some σi _i were constant, then of course, recovering the associated direction is impossible. 3 Technical Overview In this section, we give an overview of the techniques and key ingredients that go into the proof of Theorem 2.5. Recall that the starting point is the intuition that if f is a sum of n features in directions v1,…,vnv_1,…,v_n, then the Fourier transform f f is nonzero only on the lines tvit∈ℝ\tv_i\_t∈R. First, to make this formal, we need to resolve the integrability issues. We define the Gaussian-reweighted function f(ℓ)(x)=f(x)exp⁡(−‖x‖2/(2ℓ2)),f^( )(x)\;=\;f(x)\, \! (-\|x\|^2/(2 ^2) ), and we will choose ℓ sufficiently large. This ensures integrability and also that the Fourier transform remains concentrated around the lines tvit∈ℝ\tv_i\_t∈R. For a single ridge function ρ(x)=σ(v⊤x)ρ(x)=σ(v x), the Fourier transform can be written as (see 4.11) ρ(ℓ)^(y)=σ(ℓ)^(v⊤y)⋅ℓd−1exp⁡(−ℓ22‖y−(v⊤y)v‖2), ρ^( )(y)\;=\; σ^( )(v y)· ^\,d-1 \! (- ^22\,\|y-(v y)v\|^2 )\,, i.e., it concentrates on a tube of width ≈1/ℓ≈ 1/ around the line tv:t∈ℝ\tv:t∈R\. A sum of ridge functions thus yields a sum of such tubes in Fourier space. 3.1 High Level Idea To recover the directions viv_i, we want an algorithm that can “locate the mass” in Fourier space. The famous result of Goldreich and Levin gives an efficient algorithm that searches for heavy Fourier coefficients of a boolean function — even though there are exponentially many possible coefficients, the algorithm gets around this by querying the total weight on various slices of the hypercube, and only zooming in further on the slices that have nontrivial weight. Our algorithm draws inspiration from this approach but we must overcome additional challenges in Euclidean space. To iteratively refine the search space, we use hyperplane-like slices. Given an orthonormal basis, say b1,…,bdb_1,…,b_d, roughly we first search over hyperplanes orthogonal to b1b_1 (say discretized to some grid). Then among the hyperplanes that have nontrivial Fourier weight, we search within those along hyperplanes orthogonal to both b1b_1 and b2b_2. Iterating this procedure, at each level k, we maintain a collection of “candidate” k-tuples, say (α1(i),…,αk(i))i∈[m]\( _1^(i),…, _k^(i))\_i∈[m], for the first k coordinates. Then within each of these, we enumerate the possibilities for the next coordinate αk+1 _k+1 and recurse on all candidates with nontrivial Fourier mass. This high-level approach is the foundation of our algorithm. To actually implement this approach, we need to address two key aspects, which we discuss in the sections below: 1. How can we estimate the total Fourier mass on a hyperplane? 2. How can we guarantee that the algorithm has bounded recursion while ensuring that we recover all relevant directions? 3.2 Estimating the Fourier Mass Again to address integrability issues from restricting strictly to a hyperplane, we will add some small thickness by convolving with a Gaussian. Formally, for any function g, vector v∈ℝdv∈R^d and A⪰0A 0, we define Ig∗(v,A):=∫ℝd|g^(y)|2e−(y−v)⊤A(y−v)dy.I^*_g(v,A) := _R^d | g(y) |^2e^-(y-v) A(y-v)\,dy\,. (1) To simulate a hyperplane orthogonal to say b1,…,bkb_1,…,b_k, we set A=C(b1b1⊤+⋯+bkbk⊤)A=C(b_1b_1 +…+b_kb_k ) for some sufficiently large C (for technical reasons later on, we will need different values of C for different coordinates to bound the branching in the search algorithm). Now our goal will be to estimate If(ℓ)∗(v,A)I^*_f^( )(v,A) for various v,Av,A. A simple, but critical observation, is that by substituting the definition g^(y)=1(2π)d/2∫ℝde−iy⊤xg(x)x g(y)= 1(2π)^d/2 _R^de^-iy xg(x)dx into the expression for Ig⋆(v,A)I _g(v,A) and switching the order of integration and explicitly computing the Gaussian integrals, we can rewrite it as an expectation over two-point correlations of g (see 5.2): Ig⋆(v,A)=Δ∼(0,2A)⁡[e−iv⊤Δ⋅∫ℝdg(x)g(x+Δ)x].I _g(v,A)\;=\; E_ (0,2A)\! [e^-iv · _R^dg(x)\,g(x+ )\,dx ]\,. Now setting g(x)=f(ℓ)(x)=f(x)e−‖x‖2/(2ℓ2)g(x)=f^( )(x)=f(x)e^- x ^2/(2 ^2), we can write the above as an expectation over both x and Δ involving f(x)f(x+Δ)f(x)f(x+ ) — and we can estimate this by sampling and querying the values f(x),f(x+Δ)f(x),f(x+ ) (see Algorithm 1). Remark 3.1. While the Gaussian thickening in (1) is natural, we remark that the choice of the weight function is actually critical. For instance, if we had used uniform over a box or ball instead of a Gaussian, then the analogous formula could not be easily written as an expectation over a probability distribution (due to lack of positivity). This would then make it much harder to estimate via sampling due to difficulties in controlling the variance. 3.3 Bounding the Search Algorithm First we discuss bounding the total number of candidates that our algorithm recurses on. By choosing the orthonormal basis b1,…,bdb_1,…,b_d randomly, we can ensure with high probability, once we fix the first two coordinates, any hyperplane orthogonal to both b1,b2b_1,b_2 that is not too close to the origin intersects at most one of the lines tvit∈ℝ\tv_i\_t∈R. We will simply brute-force over a sufficiently fine grid for the first two coordinates. Now consider a candidate for the first k coordinates α1,…,αk _1,…, _k for k≥2k≥ 2. We then probe candidates for αk+1 _k+1 by evaluating If(ℓ)⋆(v,A)I _f^( )(v,A) at v=∑j≤k+1αjbj,A=∑j≤k+1Kjbjbj⊤,v= _j≤ k+1 _jb_j, A= _j≤ k+1K_j\,b_jb_j , with K1=K2=Kk+1=C2,K3=⋯=Kk=C1K_1=K_2=K_k+1=C_2,K_3=·s=K_k=C_1 with C2≫C1C_2 C_1. To see why we need different scales for different coordinates, ignore the first two coordinates (since we are brute forcing over these). For the remaining coordinates, we search over a finer grid, of width ≈1C2≈ 1 C_2. If we included all candidates for which the mass If(ℓ)⋆(v,A)I _f^( )(v,A) is nontrivial, then this could blow up the number of candidates by a constant factor (since e.g. both of the grid points on either side of the hidden Fourier spike would be valid candidates). Instead we choose only a sufficiently separated set of candidates. Then once we fix a coordinate, we need to relax the width of the Gaussian in that direction to ≈1C1≈ 1 C_1 to ensure that we lose only a small fraction of the Fourier mass (see Algorithm 2 for details). Finally, we discuss how we guarantee that our algorithm recovers all relevant directions. Indeed, to recover a direction viv_i, we need f^(tvi) f(tv_i) to be nontrivial for some t bounded away from 0. In fact, we need t≫1/ℓt 1/ since the “tubes” on which f f is concentrated now have width ≈1/ℓ≈ 1/ . Ensuring this is not totally trivial. As an example, if σi(⋅) _i(·) were a polynomial function like x3x^3, then actually this need not be true. For large ℓ , we can easily compute that the Fourier transform of x3e−‖x‖2/(2ℓ2)x^3e^- x ^2/(2 ^2) decays exponentially with width scale 1/ℓ1/ . How can we rule out such examples? Because the functions σi(⋅) _i(·) are Lipschitz, we can choose ℓ sufficiently large so that the only way σi(⋅) _i(·) can be close to a polynomial is if it is linear. For σi(⋅) _i(·) that are close to linear, there is an obvious lack of identifiability since we could easily have two different sets of vectors v1,…,vs\v_1,…,v_s\ and v1′,…,vs′\v_1 ,…,v_s \ such that a1v1+⋯+asvs=a1v1′+⋯+asvs′a_1v_1+…+a_sv_s=a_1v_1 +…+a_sv_s which of course implies that for all x, a1v1⊤x+⋯+asvs⊤x=a1v1′⊤x+⋯+asvs′⊤x.a_1v_1 x+…+a_sv_s x=a_1v_1 x+…+a_sv_s x\,. Fortunately, this does not affect our ability to recover the overall function since we can learn the “linear part” separately. For σi(⋅) _i(·) that are not too close to linear, we actually show that there must be some t≳1/ℓ≫1/ℓt 1/ 1/ such that σi^(t) _i(t) is non-negligible (see Lemma 4.9). This can then be used to argue that our algorithm must recover the corresponding direction viv_i. Once we recover the directions viv_i, we can recover the functions by interpolating in Fourier space. The main bound for this part is 4.7, which we use to bound the error from interpolating the Fourier transform only on a discretization of a bounded interval instead of over all of ℝR. 3.4 Organization In Section 4, we present some general bounds on functions and their Fourier transforms that will be used in the analysis. Then in Section 5, we present our machinery for estimating the Fourier mass If(ℓ)∗(v,A)I^*_f^( )(v,A). In Section 6, we then present our algorithm that makes use of this machinery to locate the hidden directions viv_i. In Section 7, we then show how to recover the function f given estimates for the directions viv_i. Finally in Section 8, we put everything together to prove our main theorems. 4 Properties of Functions Fourier analysis is a ubiquitous tool and it will play an important role in our learning algorithm. We begin with some basic definitions and properties. Definition 4.1 (Fourier Transform). For a function f:ℝ→ℂf:R→C, we define its Fourier transform f^:ℝ→C f:R→ C as f^(y)=12π∫−∞e−iyxf(x)x. f(y)= 1 2π _-∞^∞e^-iyxf(x)dx\,. For a multivariable function f:ℝd→ℂf:R^d→C, its Fourier transform f^:ℝd→C f:R^d→ C is defined as f^(y)=1(2π)d/2∫ℝde−iy⊤xf(x)x. f(y)= 1(2π)^d/2 _R^de^-iy xf(x)dx\,. Fact 4.2 (Parseval’s Identity). For a function f:ℝd→ℂf:R^d→C such that |f(x)|2|f(x)|^2 is integrable, we have ∫ℝd|f^(y)|2y=∫ℝd|f(x)|2x. _R^d| f(y)|^2dy= _R^d|f(x)|^2dx\,. The idea of rewighting by a Gaussian distribution and the interplay between the Fourier transform and Gaussian reweighting will also be important in our analysis. We use the following notation. Definition 4.3. For μ∈ℝd,Σ∈ℝd×dμ∈R^d, ∈R^d× d, we let μ,ΣN_μ, denote the Gaussian distribution with mean μ and covariance Σ . We use μ,Σ(x)N_μ, (x) to denote the density function of the Gaussian at a point x. We will often consider the Gaussian weighting of a function, which we denote as follows. Definition 4.4. For a function f:ℝd→ℂf:R^d→C and ℓ>0 >0, we define f(ℓ)(x)=f(x)e−‖x‖2/(2ℓ2)f^( )(x)=f(x)e^- x ^2/(2 ^2). 4.1 Basic Fourier Transform Bounds We begin by proving a few basic bounds on the Fourier transform of a function after reweighting. First, we have the following L∞L^∞ bound. Fact 4.5. Let σ:ℝ→ℝσ:R→R be a function with |σ(x)|≤1|σ(x)|≤ 1 for all x. For any ℓ>0 >0, let σ(ℓ)(x)σ^( )(x) be as defined in Definition 4.4. Then for all y∈ℝy∈R, |σ(ℓ)^(y)|≤ℓ| σ^( )(y)|≤ . Proof. Using |σ|≤1|σ|≤ 1 and triangle inequality, |σ(ℓ)^(y)|=|12π∫−∞e−iyxσ(x)e−x2/(2ℓ2)x|≤12π∫−∞e−x2/(2ℓ2)x=ℓ.| σ^( )(y)|\;=\; | 1 2π _-∞^∞e^-iyxσ(x)e^-x^2/(2 ^2)dx |\;≤\; 1 2π _-∞^∞e^-x^2/(2 ^2)dx\;=\; . ∎ Next, we bound the Lipschitz constant of the Fourier transform. Claim 4.6. If σ:ℝ→ℝσ:R→R satisfies |σ(x)|≤1|σ(x)|≤ 1 for all x, then the Fourier transform σ(ℓ) σ^( ) is ℓ2 ^2-Lipschitz. Proof. Write dyσ(ℓ)^(y)=12π∫−∞(−ix)e−iyxσ(x)e−x2/(2ℓ2)x. ddy σ^( )(y)= 1 2π _-∞^∞(-ix)e^-iyxσ(x)e^-x^2/(2 ^2)dx. Now triangle inequality immediately implies |dyσ(ℓ)^(y)|≤12π∫−∞|x|e−x2/(2ℓ2)x≤ℓ2. ddy σ^( )(y) \;≤\; 1 2π _-∞^∞|x|e^-x^2/(2 ^2)dx\;≤\; ^2. ∎ Finally, we prove a quantitative bound showing that on a bounded interval we can reconstruct σ using only the portion of σ(ℓ) σ^( ) on a finite frequency window. Claim 4.7. Assume σ:ℝ→ℝσ:R→R has |σ(x)|≤1|σ(x)|≤ 1 for all x and that σ is L-Lipschitz. For any ℓ≥R≥1 ≥ R≥ 1, and cutoff B>0B>0, define σ~B(x):=ex22ℓ212π∫−BBeiyxσ(ℓ)^(y)dy. σ_B(x)\; :=\;e x^22 ^2\; 1 2π _-B^Be^iyx\, σ^( )(y)\,dy. Then for all |x|≤R|x|≤ R, |σ(x)−σ~B(x)|≤2L2ℓB. |\,σ(x)- σ_B(x)\, |\;≤\; 2L 2 B\,. (2) In particular, given any ε>0 >0, choosing B≥8L2ℓε2B\;≥\; 8L^2 ^2 (3) ensures max|x|≤R⁡|σ(x)−σ~B(x)|≤ε _|x|≤ R\,|σ(x)- σ_B(x)|≤ . Proof. By Fourier inversion, for every x∈ℝx∈R, σ(x)=ex22ℓ212π∫−∞eiyxσ(ℓ)^(y)y.σ(x)\;=\;e x^22 ^2\; 1 2π _-∞^∞e^iyx\, σ^( )(y)\,dy. Therefore, for any B>0B>0 and any x∈ℝx∈R, σ(x)−σ~B(x)=ex22ℓ212π∫|y|>Beiyxσ(ℓ)^(y)y,σ(x)- σ_B(x)\;=\;e x^22 ^2\; 1 2π _|y|>Be^iyx\, σ^( )(y)\,dy, and thus, for |x|≤R|x|≤ R, |σ(x)−σ~B(x)|≤∫|y|>B|σ(ℓ)^(y)|y. |\,σ(x)- σ_B(x)\, |\;≤\; _|y|>B | σ^( )(y) |\,dy. (4) To bound the tail L1L^1 norm, apply Cauchy–Schwarz with the weight |y|−1|y|^-1: ∫|y|>B|σ(ℓ)^(y)|y≤(∫|y|>Bdy2)1/2(∫ℝ|yσ(ℓ)^(y)|2y)1/2=2B‖yσ(ℓ)^‖2. _|y|>B | σ^( )(y) |\,dy\;≤\; ( _|y|>B dyy^2 )^\!1/2 ( _R |\,y\, σ^( )(y)\, |^2dy )^\!1/2\;=\; 2B\; \|\,y\, σ^( )\, \|_2. By Parseval, ‖yσ(ℓ)^‖2=‖(σ(ℓ))′‖2\|\,y\, σ^( )\,\|_2=\|\,(σ^( )) \,\|_2. Using |σ′|≤L|σ |≤ L almost everywhere and |σ|≤1|σ|≤ 1, ‖(σ(ℓ))′‖2=‖σ′e−x22ℓ2−xℓ2σe−x22ℓ2‖2≤(∫L2e−x2ℓ2x)1/2+1ℓ2(∫x2e−x2ℓ2x)1/2.\|(σ^( )) \|_2\;=\; \|\,σ \,e^- x^22 ^2- x ^2\,σ\,e^- x^22 ^2\, \|_2\;≤\; ( L^2e^- x^2 ^2dx )^\!1/2\!+\; 1 ^2 ( x^2e^- x^2 ^2dx )^\!1/2. Computing the Gaussian integrals gives ‖(σ(ℓ))′‖2≤ 2Lℓ.\|(σ^( )) \|_2\;≤\;2L \,. Combining with (4) yields (2). The stated choice (3) of B then guarantees the error is at most ε on [−R,R][-R,R]. ∎ 4.2 Non-degeneracy for Univariate Functions In our learning algorithm, our goal will be to identify all of the directions viv_i. However, if the corresponding σi _i is a constant function, then this is impossible. To specify the set of directions that we will be able to uniquely identify, we introduce the following quantitative notion of non-degeneracy for univariate functions. Definition 4.8. We say a function σ:ℝ→ℝσ:R→R is (R,ε)(R, )-nondegenerate if there are R,ε>0R, >0 such that there are x1,x2∈[−R,R]x_1,x_2∈[-R,R] with σ(x1)−σ(x2)≥εσ(x_1)-σ(x_2)≥ . We prove the following lemma which lower bounds the Fourier weight of a non-degenerate function. The point of this lemma is that it implies (see Corollary 4.10) that if a function is non-degenerate, then a non-trivial portion of the Fourier weight of its Gaussian reweighting must lie in a certain frequency band that is both bounded away from zero and from infinity. The bounds on this region will be important in our learning algorithm later on. Lemma 4.9. Let σ:ℝ→ℝσ:R→R be a function with |σ(x)|≤1|σ(x)|≤ 1 for all x. For any ℓ>0 >0, let σ(ℓ)(x)σ^( )(x) be as defined in Definition 4.4. Assume additionally that σ is L-Lipschitz and (R,ε)(R, )-nondegenerate where R≥1,ε<1R≥ 1, <1, and that ℓ≥20(R+L)/ε ≥ 20(R+L)/ . Then ∫−∞|σ(ℓ)^(y)|2|y|2e−y2/ℓ2y≥ε28R. _-∞^∞| σ^( )(y)|^2\,|y|^2\,e^-y^2/ ^2\,dy\;≥\; ^28R\,. Proof. Define h(x,ℓ):=∫−∞ℓ2πσ(ℓ)(x−z)e−z2ℓ2/2dz,h(x, )\; :=\; _-∞^∞ 2π\;σ^( )(x-z)\;e^-z^2 ^2/2\,dz\,, where view h as a function of x. A direct calculation shows h^(y,ℓ)=σ(ℓ)^(y)e−y2/(2ℓ2). h(y, )= σ^( )(y)\,e^-y^2/(2 ^2). By the assumed choice of ℓ and the assumption that σ is L-Lipschitz, standard Gaussian estimates give supx∈[−R,R]|h(x,ℓ)−σ(x)|≤0.1ε _x∈[-R,R]|h(x, )-σ(x)|≤ 0.1\, . Hence, there must be x1,x2∈[−R,R]x_1,x_2∈[-R,R] such that h(x1,ℓ)−h(x2,ℓ)≥ε−2⋅0.1ε= 0.8ε≥ε/2.h(x_1, )-h(x_2, )\;≥\; -2· 0.1 \;=\;0.8 \;≥\; /2. Therefore ∫−R|h′(x,ℓ)|x≥ε/2 _-R^R|h (x, )|\,dx≥ /2, and by Cauchy–Schwarz, ∫−R|h′(x,ℓ)|2x≥(ε/2)22R=ε28R. _-R^R|h (x, )|^2\,dx\;≥\; ( /2)^22R\;=\; ^28R. By Parseval, ∫−∞|h′^(y,ℓ)|2y=∫−∞|h′(x,ℓ)|2x≥ε28R. _-∞^∞| h (y, )|^2\,dy\;=\; _-∞^∞|h (x, )|^2\,dx\;≥\; ^28R. Using h′^(y,ℓ)=iyh^(y,ℓ)=iye−y2/(2ℓ2)σ(ℓ)^(y) h (y, )=iy\, h(y, )=iy\,e^-y^2/(2 ^2) σ^( )(y), we obtain ∫−∞|σ(ℓ)^(y)|2|y|2e−y2/ℓ2y=∫−∞|h′^(y,ℓ)|2y≥ε28R, _-∞^∞| σ^( )(y)|^2\,|y|^2\,e^-y^2/ ^2\,dy\;=\; _-∞^∞| h (y, )|^2\,dy\;≥\; ^28R, as claimed. ∎ Using Lemma 4.9, we can now prove the following corollary which lower bounds the Fourier weight of a non-degenerate function in a bounded frequency band that is also bounded away from zero. Corollary 4.10. Assume σ:ℝ→ℝσ:R→R satisfies |σ(x)|≤1|σ(x)|≤ 1 and σ is L-Lipschitz and (R,ε)(R, )-nondegenerate where R≥1,ε<1R≥ 1, <1, and let ℓ be some parameter with ℓ≥20(R+L)/ε ≥ 20(R+L)/ . Define a:=ε81RℓandB:= 4ℓlog(ℓRε).a\; :=\; 8 1R\, B\; :=\;4 \! ( R ). Set (ℓ):=[−B,−a]∪[a,B]S^( ) :=[-B,-a]∪[a,B]. Then ∫(ℓ)|σ(ℓ)^(y)|2y≥ε28Rℓ2. _S^( ) | σ^( )(y) |^2\,dy\;≥\; ^28R\, ^2. Proof. Throughout, write W(y):=y2e−y2/ℓ2W(y) :=y^2e^-y^2/ ^2. From Lemma 4.9, we have ∫ℝ|σ(ℓ)^(y)|2W(y)y≥ε28R. _R| σ^( )(y)|^2\,W(y)\,dy\;≥\; ^28R. (5) Decompose ℝR into the “good” region (ℓ)S^( ) and the “bad” region ℬ(ℓ):=(−a,a)∪|y|>BB^( ) :=(-a,a)∪\\,|y|>B\,\. Then ∫ℝ|σ(ℓ)^|2W=∫(ℓ)|σ(ℓ)^|2W+∫ℬ(ℓ)|σ(ℓ)^|2W. _R| σ^( )|^2W= _S^( )| σ^( )|^2W\;+\; _B^( )| σ^( )|^2W. Let α:=supy∈(ℓ)W(y),β:=supy∈ℬ(ℓ)W(y),M:=∫ℝ|σ(ℓ)^(y)|2dy.α\; :=\; _y ^( )W(y), β\; :=\; _y ^( )W(y), M\; :=\; _R| σ^( )(y)|^2\,dy. By the definition of σ(ℓ)σ^( ), Parseval, and |σ(x)|≤1|σ(x)|≤ 1, M=∫ℝ|σ(ℓ)(x)|2x≤∫ℝe−x2/ℓ2x=ℓπ.M\;=\; _R|σ^( )(x)|^2dx\;≤\; _Re^-x^2/ ^2\,dx\;=\; π. (6) Bounding α. Since W is even, increases on [0,ℓ][0, ], and decreases on [ℓ,∞)[ ,∞), and ℓ∈[a,B] ∈[a,B] (because a≪ℓa and B≥4ℓB≥ 4 ), we have α=W(ℓ)=ℓ2e.α\;=\;W( )\;=\; ^2e. (7) Bounding β. On (−a,a)(-a,a), W(y)≤a2W(y)≤ a^2. On the tails |y|>B|y|>B we use monotonicity of W on [ℓ,∞)[ ,∞): sup|y|>BW(y)=W(B)=B2e−B2/ℓ2. _|y|>BW(y)\;=\;W(B)\;=\;B^2e^-B^2/ ^2. By the choice of B, we have W(B)= 16ℓ2log2⁡(ℓRε)e−16log2⁡(ℓRε)≤ε264Rℓ.W(B)\;=\;16 ^2 ^2\! ( R )e^-16 ^2\! ( R )≤ ^264R \,. Therefore β≤max⁡a2,W(B)≤ε264Rℓ.β\;≤\; \\,a^2,\ W(B)\,\\;≤\; ^264R \,. (8) Assembling the bounds. From (5) and the definitions, ε28R≤∫(ℓ)|σ(ℓ)^|2W+∫ℬ(ℓ)|σ(ℓ)^|2W≤α∫(ℓ)|σ(ℓ)^|2+β∫ℬ(ℓ)|σ(ℓ)^|2≤(α−β)∫(ℓ)|σ(ℓ)^|2+βM. ^28R\;≤\; _S^( )| σ^( )|^2W\;+\; _B^( )| σ^( )|^2W\;≤\;α _S^( )| σ^( )|^2\;+\;β _B^( )| σ^( )|^2\;≤\;(α-β)\! _S^( )\!| σ^( )|^2\;+\;β M. Hence ∫(ℓ)|σ(ℓ)^(y)|2y≥ε28R−βMα−β. _S^( )| σ^( )(y)|^2\,dy\;≥\; ^28R\;-\;β Mα-β. (9) Using (6) and (8), βM≤(ε264Rℓ)ℓπ=π64ε2R.β M\;≤\; ( ^264R ) π\;=\; π64\, ^2R\,. Next, the denominator clearly satisfies α−β≤α=ℓ2/eα-β≤α= ^2/e. Substituting everything back into (9): ∫(ℓ)|σ(ℓ)^(y)|2y≥ε216R⋅ℓ2e≥18ε2Rℓ2, _S^( )| σ^( )(y)|^2\,dy\;≥\; ^216R · ^2e\;≥\; 18\, ^2R\, ^2, which is the claimed bound. ∎ 4.3 Fourier Transform of Ridge Functions So far in this section, we have focused on bounds for univariate functions. The following formula for the Fourier transform of a Gaussian reweighted ridge function in high dimensions will be important for helping us reduce to a setting where we can apply our univariate estimates. Claim 4.11. Let σ:ℝ→ℝσ:R→R be a function. Define f:ℝd→ℝf:R^d→R as f(x)=σ(v⊤x)f(x)=σ(v x) for some unit vector v∈ℝdv∈R^d. Then the Fourier transform of f(ℓ)(x)f^( )(x) is given by f(ℓ)^(y)=σ(ℓ)^(v⊤y)⋅ℓd−1e−ℓ2‖y−(v⊤y)v‖2/2. f^( )(y)= σ^( )(v y)· ^d-1e^- ^2 y-(v y)v ^2/2. Proof. Extend the unit vector v to an orthonormal basis of ℝdR^d. Write any x∈ℝdx∈R^d as x=tv+zx=t\,v+z with t∈ℝt∈R and z∈v⟂z∈ v , and decompose y as y=αv+wy=α v+w where α:=v⊤yα :=v y and w:=y−(v⊤y)v∈v⟂w :=y-(v y)v∈ v . Then y⊤x=αt+w⊤zy x=α t+w z and ‖x‖2=t2+‖z‖2\|x\|^2=t^2+\|z\|^2. By Definition 4.1, f(ℓ)^(y)=1(2π)d/2∫ℝ∫v⟂e−i(αt+w⊤z)σ(t)e−t22ℓ2e−‖z‖22ℓ2zt. f^( )(y)= 1(2π)^d/2 _R\! _v e^-i(α t+w z)\,σ(t)\,e^- t^22 ^2\,e^- \|z\|^22 ^2\,dz\,dt. The integrals factor: f(ℓ)^(y)=1(2π)d/2(∫ℝe−iαtσ(t)e−t22ℓ2t)(∫v⟂e−iw⊤ze−‖z‖22ℓ2z). f^( )(y)= 1(2π)^d/2 ( _Re^-iα t\,σ(t)\,e^- t^22 ^2dt ) ( _v e^-iw z\,e^- \|z\|^22 ^2dz ). The first bracket equals 2πσ(ℓ)^(α) 2π\, σ^( )(α). The second bracket is the standard Gaussian Fourier transform on v⟂≅ℝd−1v R^d-1, meaning ∫z∈v⟂e−iw⊤ze−‖z‖22ℓ2z=(2π)d−12ℓd−1e−ℓ2‖w‖22. _z∈ v e^-iw z\,e^- \|z\|^22 ^2dz=(2π) d-12\, ^\,d-1\,e^- ^2\|w\|^22. Combining and noting that the (2π)(2π) factors cancel gives f(ℓ)^(y)=σ(ℓ)^(v⊤y)ℓd−1e−ℓ2‖y−(v⊤y)v‖22 f^( )(y)= σ^( )(v y)\; ^\,d-1\;e^- ^2\|\,y-(v y)\,v\,\|^22 as desired. ∎ We also have the following basic bound on the Lipschitzness of the Fourier transform in higher dimensions. Claim 4.12. Let f:ℝd→ℂf:R^d→C be a function that has |f(x)|≤1|f(x)|≤ 1 for all x. Then f(ℓ) f^( ) is ℓd+1 ^d+1-Lipschitz. Proof. We can write ∇f(ℓ)^(y)=1(2π)d/2∫ℝd(−ix)e−iy⊤xf(x)e−‖x2‖/(2ℓ2).∇ f^( )(y)= 1(2π)^d/2 _R^d(-ix)e^-iy xf(x)e^- x^2 /(2 ^2)\,. For any unit vector v, we have by triangle inequality, |⟨v,‖∇f(ℓ)^(y)‖⟩|≤1(2π)d/2∫ℝde−‖x2‖/(2ℓ2)|⟨v,x⟩|≤ℓd+1 v, ∇ f^( )(y) ≤ 1(2π)^d/2 _R^de^- x^2 /(2 ^2)| v,x |≤ ^d+1 where the last step follows from 4.6 and the fact that the integral factorizes over v and v⟂v . This implies ‖∇f(ℓ)^(y)‖≤ℓd+1 ∇ f^( )(y) ≤ ^d+1 as desired. ∎ 5 Estimating Fourier Weight An important subroutine in our algorithm will be using queries to estimate the Fourier mass of the rescaled function f(ℓ)f^( ) (recall Definition 4.4) around a point v∈ℝdv∈R^d, reweighted by a Gaussian with covariance A−1A^-1 for some positive definite matrix A. The first part of the analysis in this section will hold for a generic function g:ℝd→ℝg:R^d→R; later we will apply it to the specific case g=f(ℓ)g=f^( ). We begin with the following definition of the weighted Fourier mass. Definition 5.1. For g:ℝd→ℝg:R^d→R, we define the quantity Ig∗(v,A):=∫ℝd|g^(y)|2e−(y−v)⊤A(y−v)dy,I^*_g(v,A) := _R^d | g(y) |^2e^-(y-v) A(y-v)\,dy, Crucial to estimating this quantity is the following formula that rewrites Ig∗(v,A)I^*_g(v,A) as a quadratic form in g itself, rather than its Fourier transform. This will allow us to estimate Ig∗(v,A)I^*_g(v,A) via sampling. The crucial point about this formula is that it can be interpreted as an integral over x, of an expectation over a distribution of Δ , of g(x)g(x+Δ)g(x)g(x+ ) times a unit complex rotation, and the distribution is independent of g. Claim 5.2. Let g:ℝd→ℝg:R^d→R be such that |g|,|g|2|g|,|g|^2 are integrable. Then for any v∈ℝdv∈R^d and A∈ℝd×dA∈R^d× d with A≻0A 0, Ig∗(v,A)=∫ℝd(e−iv⊤Δ∫ℝdg(x+Δ)g(x)x)N0,2A(Δ)Δ.I^*_g(v,A)= _R^d (e^-iv _R^dg(x+ )g(x)\,dx )\,N_0,2A( )\,d \,. Proof. Recall that since g^(y)=(2π)−d/2∫g(x)e−iy⊤xx g(y)=(2π)^-d/2 g(x)e^-iy xdx, we have |g^(y)|2=(2π)−d∫g(x1)g(x2)e−iy⊤(x1−x2)x1x2.| g(y)|^2=(2π)^-d\! \!\! g(x_1)g(x_2)e^-iy (x_1-x_2)dx_1dx_2. Multiply by w(y):=e−(y−v)⊤A(y−v)w(y) :=e^-(y-v) A(y-v) and integrate in y: ∫|g^(y)|2w(y)y=(2π)−d∫g(x1)g(x2)ℐ(x1−x2)x1x2, | g(y)|^2w(y)dy=(2π)^-d\! \!\! g(x_1)g(x_2)\,I(x_1-x_2)\,dx_1dx_2, where ℐ(Δ)=∫ℝde−iy⊤Δw(y)y=(2π)d/2w^(Δ).I( )= _R^de^-iy w(y)\,dy=(2π)^d/2\, w( ). Shifting y=z+vy=z+v and using the standard Gaussian FT, w^(Δ)=(2π)−d/2e−iv⊤Δ∫e−z⊤Az−iz⊤Δz=(2π)−d/2e−iv⊤Δπd/2(detA)−1/2e−14Δ⊤A−1Δ. w( )=(2π)^-d/2e^-iv \! e^-z Az-iz dz=(2π)^-d/2e^-iv \,π^d/2( A)^-1/2e^- 14 A^-1 . Hence ℐ(Δ)=e−iv⊤Δπd/2(detA)−1/2e−14Δ⊤A−1Δ.I( )=e^-iv \,π^d/2( A)^-1/2e^- 14 A^-1 . Changing variables (x1,x2)=(x+Δ,x)(x_1,x_2)=(x+ ,x) and recognizing N0,2A(Δ)=(4π)−d/2(detA)−1/2e−14Δ⊤A−1Δ,N_0,2A( )=(4π)^-d/2( A)^-1/2e^- 14 A^-1 , we note the constants cancel: (2π)−d⋅πd/2(detA)−1/2=(4π)−d/2(detA)−1/2.(2π)^-d·π^d/2( A)^-1/2=(4π)^-d/2( A)^-1/2. Thus ∫ℝd|g^(y)|2e−(y−v)⊤A(y−v)y=∫ℝd(e−iv⊤Δ∫ℝdg(x+Δ)g(x)x)N0,2A(Δ)Δ. _R^d| g(y)|^2e^-(y-v) A(y-v)dy= _R^d (e^-iv _R^dg(x+ )g(x)\,dx )\,N_0,2A( )\,d . ∎ Recall that in our original learning setup, we have query access to a function f∼:ℝd→ℝf_ :R^d→R that is close to f in L∞L_∞ norm. We now present our algorithm that makes use of 5.2 to estimate If(ℓ)∗(v,A)I^*_f^( )(v,A) using queries to f∼f_ . In order to implement this, there is one additional wrinkle that we now explain. Since 5.2 requires integrating g(x)g(x+Δ)g(x)g(x+ ) over all x, we will set g to be f(ℓ)f^( ) to ensure integrability. Because we have query access to f∼f_ , which is close to f, we can then rewrite the integral over all x of f(ℓ)(x)f(ℓ)(x+Δ)f^( )(x)f^( )(x+ ) as an expectation over x drawn from an appropriate Gaussian distribution of f(x)f(x+Δ)f(x)f(x+ ), which we can then estimate by sampling. Input: Query access to f∼:ℝd→ℝf_ :R^d→R Input: ℓ>0 >0, vector v∈ℝdv∈R^d, matrix A∈ℝd×dA∈R^d× d with A⪰0A 0, sample budget m 1 2Function EstWeightf∼,ℓ,m(v,A) EstWeight_f_ , ,m(v,A) 3 Set C←(πℓ2)d2C←(π ^2) d2. 4 5 for j=1j=1 to m do 6 Draw Δ←N0,2A ← N_0,2A 7 Draw Z∼N0,ℓ22IdZ N_0, ^22I_d 8 x−←Z−12Δx_-← Z- 12 ; x+←Z+12Δx_+← Z+ 12 9 u←f∼(x−)u← f_ (x_-); w←f∼(x+)w← f_ (x_+) 10 ϕ←e−‖Δ‖24ℓ2−iv⊤Δφ← e^- \| \|^24 ^2-iv 11 Set cj←ϕ⋅u⋅wc_j←φ· u· w 12 13 end for 14 return I←C⋅c1+⋯+cmmI← C· c_1+…+c_mm 15 16 Algorithm 1 Estimating Reweighted Fourier Mass The following corollary shows that the output of Algorithm 1 is indeed a good estimate of the desired weighted Fourier mass, provided that f∼f_ is sufficiently close to f and that we use enough samples. Corollary 5.3. Let f:ℝd→ℝf:R^d→R be L-Lipschitz with |f(x)|≤1|f(x)|≤ 1, and fix ℓ>0 >0. Let f(ℓ)f^( ) be as defined in Definition 4.4. Given query access to f∼f_ with ‖f−f∼‖∞≤ε\|f-f_ \|_∞≤ and also parameters v∈ℝd,A∈ℝd×dv∈R^d,A∈R^d× d and sample budget m, compute I=EstWeightf∼,ℓ,m(v,A)I= EstWeight_f_ , ,m(v,A) as defined in Algorithm 1. For any δ∈(0,1)δ∈(0,1), with probability at least 1−δ1-δ, |I−If(ℓ)∗(v,A)|≤ 8(πℓ2)d2(ε+2log⁡(8/δ)m). |\,I-I^*_f^( )(v,A) |\;≤\;8\,(π ^2) d2\! ( + 2 (8/δ)m ). Proof. Let C:=(πℓ2)d/2C :=(π ^2)^d/2. For each j∈[m]j∈[m], let cjc_j be the value calculated by Algorithm 1 using f∼f_ and let cj∗c_j^* be what the value would have been if calculated using f instead. Since |f|≤1|f|≤ 1 and ‖f−f∼‖∞≤ε\|f-f_ \|_∞≤ , we have |cj|≤(1+ε)2,|cj−cj∗|≤ε(2+ε).|c_j|≤(1+ )^2, |c_j-c_j^*|≤ (2+ ). Taking expectations and using |ϕ|≤1|φ|≤ 1 yields |⁡[cj]−⁡[cj∗]|≤ε(2+ε). | E[c_j]- E[c_j^*] |≤ (2+ ). By 5.2, C⁡[cj∗]=∫f(Z−12Δ)f(Z+12Δ)e−‖Δ‖24ℓ2−iv⊤Δ(πℓ2)d2N0,ℓ22Id(Z)⋅N0,2A(Δ)ZΔ=∫f(ℓ)(Z−12Δ)f(ℓ)(Z+12Δ)e−iv⊤ΔN0,2A(Δ)ZΔ=If(ℓ)∗(v,A). splitC E[c_j^*]&= f (Z- 12 )f (Z+ 12 )e^- \| \|^24 ^2-iv (π ^2) d2N_0, ^22I_d(Z)· N_0,2A( )dZd \\ &= f^( ) (Z- 12 )f^( ) (Z+ 12 )e^-iv N_0,2A( )dZd \\ &=I^*_f^( )(v,A). split Hence if I is the output of Algorithm 1, |⁡[I]−If(ℓ)∗(v,A)|≤Cε(2+ε). | E[I]-I^*_f^( )(v,A) |≤ C\, (2+ ). For sampling error, we can simply apply Hoeffding’s inequality to the real and imaginary parts (each bounded in magnitude by (1+ε)2(1+ )^2) and union bound: Pr⁡[|1m∑j=1m(cj−⁡[cj])|≥2(1+ε)22log⁡(8/δ)m]≤δ. \! [ | 1m _j=1^m (c_j- E[c_j] ) |≥ 2(1+ )^2 2 (8/δ)m ]≤δ. Multiplying by C and using ε<1 <1 and simplifying gives the stated concentration bound. ∎ 6 Frequency Finding Algorithm In this section, our goal is to present an algorithm that makes queries to f∼f_ that is ε -close in L∞L_∞ to a function f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) and finds the directions v1,…,vnv_1,…,v_n (recall that throughout this paper we will maintain the convention that the viv_i are unit vectors). However, this exact goal is not possible since if there is some σi _i that is constant, then we don’t be able to recover the direction viv_i. However, we will show how to recover all of the directions where σi _i is non-degenerate, and this will suffice downstream for reconstructing the function. For a precise statement, see Lemma 6.9, which is the main result that we will prove in this section. Before presenting the algorithm, we define some notation and prove a few facts that will be useful in the analysis. First, we define the following oracle, based on Corollary 5.3, that will simplify the exposition. Definition 6.1 (Fourier Mass Oracle). A Fourier Mass Oracle is parameterized by an underlying function g and accuracy τ. The oracle ℐτ,gI_τ,g takes as input a vector v∈ℝdv∈R^d and a positive semidefinite matrix A⪰0A 0 and outputs an estimate of ℐτ,g(v,A)I_τ,g(v,A) such that |Ig∗(v,A)−ℐτ,g(v,A)|≤τ.|I^*_g(v,A)-I_τ,g(v,A)|≤τ\,. We call such an oracle a τ-accurate oracle for the function g. In light of Corollary 5.3, we can implement a Fourier Mass Oracle for the function f(ℓ)f^( ) with accuracy τ=10ε(πℓ2)d/2τ=10 (π ^2)^d/2 and exponentially small failure probability using polynomially many queries to f∼f_ . For the rest of this section, we will only interact with f∼f_ via such an oracle, and thus we will measure query complexity in terms of the number of calls to such an oracle rather than direct queries to f∼f_ . We will bound the actual query complexity in terms of the number of oracle calls when we put everything together in Section 8. 6.1 Location of Nonzero Frequencies Next, we prove two statements, Lemma 6.2 and Lemma 6.3, which characterize the v,Av,A for which the Fourier mass If(ℓ)∗(v,A)I^*_f^( )(v,A) can be non-negligible. First, we show that when v is far (in the norm induced by A) from every hidden direction line tvi:t∈ℝ\tv_i:t∈R\, then the Fourier mass is small. To interpret the bound in Lemma 6.2, we will set A so that ‖A‖≤ℓ2 A ≤ ^2, so then the Fourier mass If(ℓ)∗(v,A)I^*_f^( )(v,A) decays exponentially in D2D^2 where D is the A-distance from v to the union of lines tvi:t∈ℝi∈[n]\tv_i:t∈R\_i∈[n]. Lemma 6.2 (Where weight is negligible). Let f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) be a sum of features satisfying |ai|≤1,‖σi‖∞≤1∀i|a_i|≤ 1, _i _∞≤ 1\;∀ i. Let A⪰0A 0 be any positive semidefinite matrix. Define the A-distance from v to the union of lines by D2:=mini∈[n]mint∈ℝ(v−tvi)⊤A(v−tvi).D^2\; :=\; _i∈[n]\; _t∈R\;(v-tv_i) A\,(v-tv_i). Then If(ℓ)∗(v,A)≤n2πd2ℓdexp⁡(−ℓ2ℓ2+‖A‖D2).I^*_f^( )(v,A)\;≤\;n^2\,π d2\, ^d\; \! (-\, ^2 ^2+\|A\|\,D^2 )\,. (10) Proof. We use the shorthand I∗:=If(ℓ)∗(v,A)I^* :=I^*_f^( )(v,A). Using 4.11 and Cauchy–Schwarz, |f(ℓ)^(y)|2≤n∑i=1n|σi(ℓ)^(vi⊤y)|2ℓ2d−2e−ℓ2‖y−(vi⊤y)vi‖2.| f^( )(y)|^2\;≤\;n _i=1^n | _i^( )(v_i y) |^2\, ^2d-2\,e^- ^2\|y-(v_i y)v_i\|^2. First consider a fixed i and decompose y=tvi+zy=tv_i+z with z∈vi⟂z∈ v_i . For x:=tvi−vx :=tv_i-v, write P for the orthogonal projector onto vi⟂v_i , set A⟂:=PAPA_ :=PAP, and define M:=ℓ2I⟂+A⟂M\; :=\; ^2I_ +A_ . Note that when we compute I∗=∫|f(ℓ)^(y)|2e−(y−v)⊤A(y−v)y,I^*= | f^( )(y)|^2e^-(y-v) A(y-v)dy\,, and substitute in the above bound on |f(ℓ)^(y)|2| f^( )(y)|^2, we will obtain an expression with the following quadratic in the exponent ℓ2‖y−(vi⊤y)vi‖2+(y−v)⊤A(y−v)=ℓ2‖z‖2+(z+x)⊤A(z+x). ^2\|y-(v_i y)v_i\|^2+(y-v) A(y-v)= ^2\|z\|^2+(z+x) A(z+x)\,. We will then first integrate over z∈vi⟂z∈ v_i and then over t∈ℝt∈R. Since A⟂⪰0A_ 0 and ℓ>0 >0, we have M≻0M 0 as an operator on the d−1d-1 dimensional space vi⟂v_i . Observe that ∫vi⟂e−ℓ2‖z‖2e−(z+x)⊤A(z+x)z=πd−12detMexp⁡(−ϕ⟂(x)), _v_i e^- ^2\|z\|^2\,e^-(z+x) A(z+x)dz= π d-12 M\; \! (-\, _ (x) ), where ϕ⟂(x):=minz∈vi⟂ℓ2∥z∥2+(z+x)⊤A(z+x) _ (x)\; :=\; _z∈ v_i \\; ^2\|z\|^2+(z+x) A(z+x)\; \ and again M is viewed as an operator on vi⟂v_i so its determinant is positive. To see why the above characterization as a minimum holds, note that the integral is over a rescaled Gaussian with covariance matrix (2M)−1(2M)^-1 and thus evaluating the “scaling factor” in the integral is the same as computing the maximum value of the quadratic form in the exponent. To upper bound the integral, we can relax z∈vi⟂z∈ v_i to z∈ℝdz∈R^d; then ϕ⟂(x)≥minz∈ℝd⁡ℓ2‖z‖2+(z+x)⊤A(z+x)=ℓ2x⊤A(ℓ2I+A)−1x, _ (x)\;≥\; _z∈R^d \\; ^2\|z\|^2+(z+x) A(z+x)\; \\;=\; ^2\,x A\,( ^2I+A)^-1x, where the minimizer is z∗=−(ℓ2I+A)−1Axz_*=-( ^2I+A)^-1Ax; note that ℓ2I+A≻0 ^2I+A 0, so the inverse is well-defined even if A is singular. Also note that A and (ℓ2I+A)−1( ^2I+A)^-1 commute which allows us to write the expression in the above form. Consequently, ∫vi⟂e−ℓ2‖z‖2e−(z+x)⊤A(z+x)z≤πd−12det(ℓ2I⟂+A⟂)exp⁡(−ℓ2x⊤A(ℓ2I+A)−1x). _v_i e^- ^2\|z\|^2\,e^-(z+x) A(z+x)dz\;≤\; π d-12 ( ^2I_ +A_ )\; \! (-\, ^2\,x A\,( ^2I+A)^-1x ). Since det(ℓ2I⟂+A⟂)≥ℓ2(d−1) ( ^2I_ +A_ )≥ ^2(d-1) and (ℓ2I+A)−1⪰1ℓ2+‖A‖I( ^2I+A)^-1 1 ^2+\|A\|I, we obtain ∫vi⟂e−ℓ2‖z‖2e−(z+x)⊤A(z+x)z≤πd−12ℓ−(d−1)exp⁡(−ℓ2ℓ2+‖A‖x⊤Ax). _v_i e^- ^2\|z\|^2\,e^-(z+x) A(z+x)dz\;≤\;π d-12\, ^-(d-1)\, \! (-\, ^2 ^2+\|A\|\;x Ax ). Now we can put everything together and integrate over t as well. Letting Di2:=mint∈ℝ(v−tvi)⊤A(v−tvi)D_i^2 := _t∈R(v-tv_i) A(v-tv_i) and D2=mini⁡Di2D^2= _iD_i^2, I∗≤n∑i=1nπd−12ℓd−1∫ℝ|σi(ℓ)^(t)|2exp⁡(−ℓ2ℓ2+‖A‖(v−tvi)⊤A(v−tvi))t≤n∑i=1nπd−12ℓd−1e−ℓ2ℓ2+‖A‖Di2∫ℝ|σi(ℓ)^(t)|2t. splitI^*\;≤\;n _i=1^nπ d-12\, ^d-1\! _R | _i^( )(t) |^2 \! (-\, ^2 ^2+\|A\|\;(v-tv_i) A(v-tv_i) )dt\\ \;≤\;n _i=1^nπ d-12\, ^d-1\,e^- ^2 ^2+\|A\|D_i^2\! _R | _i^( )(t) |^2dt. split By Parseval in one dimension and |σi|≤1| _i|≤ 1, we get ∫ℝ|σi(ℓ)^(t)|2t≤ℓπ _R| _i^( )(t)|^2dt≤ \, π. It follows that I∗≤n2πd2ℓdexp⁡(−ℓ2ℓ2+‖A‖D2)I^*\;≤\;n^2\,π d2\, ^d\; \! (-\, ^2 ^2+\|A\|\,D^2 ) and this completes the proof. ∎ Next, we prove a counterpart to Lemma 6.2, showing that for each direction viv_i with a non-degenerate activation σi _i, there is a bounded scale β with β also bounded away from 0 such that if v is close to βviβ v_i and A is not too large, then the Fourier mass If(ℓ)∗(v,A)I^*_f^( )(v,A) is bounded away from zero. To interpret the bound in Lemma 6.3, will ensure α∼ℓ2/dα ^2/d. This means that the factor on the outside reduces to (πℓ2)d−12(π ^2) d-12 (up to a constant). As long as ℓ is sufficiently large, then the exponential on the inside becomes negligible and the inside is lower bounded by some inverse polynomial. Thus, overall the expression will be lower bounded by some inverse polynomial times (πℓ2)d2(π ^2) d2 (which is the scaling factor that shows up naturally from the Fourier mass oracle). Lemma 6.3 (Where weight is non-negligible). Let f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) be a sum of features satisfying |ai|≤1,‖σi‖∞≤1∀i|a_i|≤ 1, _i _∞≤ 1\;∀ i. Assume the directions are γ-separated i.e. the sines of all angles between them are at least γ. Fix i∈[n]i∈[n] and assume σi _i is L-Lipschitz and (R,Δ)(R, )-nondegenerate where R≥1,Δ<1R≥ 1, <1. Let ℓ≥20(R+L)/Δ ≥ 20(R+L)/ , and define a:=Δ81Rℓ,B:= 4ℓlog(ℓRΔ),E:=[a,B]∪[−B,−a].a\; :=\; 8 1R\, , B\; :=\;4 \, \! ( R ), E\; :=\;[a,B]∪[-B,-a]. There exists a β∈Eβ∈ E such that the following holds. Set A=αIdA=α I_d and v=βviv=β v_i. Then If(ℓ)∗(v,A)≥πd−12ℓ2d−2(ℓ2+α)d−12[Δ264n(1+Bα)Rℓ2− 4nℓexp⁡(−αℓ2γ2a2ℓ2+α)]. splitI^*_f^( )(v,A)\;≥\; π d-12\, ^2d-2( ^2+α) d-12\; [\; ^264n(1+B α)R\, ^2\;-\;4n \; \! (-\, α ^2γ^2a^2 ^2+α )\; ]\,. split (11) Proof. By 4.11 we can decompose f(ℓ)^(y)=∑j=1nTj(y) f^( )(y)= _j=1^nT_j(y), where Tj(y):=σj(ℓ)^(vj⊤y)ℓd−1e−ℓ2‖y−(vj⊤y)vj‖2/2.T_j(y)\; :=\; _j^( ) (v_j y )\, ^d-1\,e^- ^2\|y-(v_j y)v_j\|^2/2. For any complex numbers z1,z2,…,znz_1,z_2,…,z_n, |z1+z2+…zn|2≥1n|zi|2−∑j≠i|zj|2|z_1+z_2+… z_n|^2≥ 1n|z_i|^2- _j≠ i|z_j|^2. Applying this and integrating against the nonnegative weight e−(y−v)⊤A(y−v)e^-(y-v) A(y-v) gives If(ℓ)∗(v,A)≥1nIi∗−∑j≠iIj∗,I^*_f^( )(v,A)\;≥\; 1n\,I^*_i\;-\; _j≠ iI^*_j, (12) where we write the per-component masses Ij∗:=∫ℝd|σj(ℓ)^(vj⊤y)|2ℓ2d−2e−ℓ2‖y−(vj⊤y)vj‖2e−(y−v)⊤A(y−v)dy.I^*_j\; :=\; _R^d | _j^( )(v_j y) |^2\, ^2d-2\,e^- ^2\|y-(v_j y)v_j\|^2\,e^-(y-v) A(y-v)dy. We first lower bound the single-component contribution Ii∗I^*_i. Set v=βviv=β v_i and decompose y=tvi+zy=tv_i+z with z∈vi⟂z∈ v_i ; then ⟨v,vi⟩=β v,v_i =β. Recall A=αIdA=α I_d. Now we follow a similar calculation to Lemma 6.2 where we first integrate over z∈v⟂z∈ v and then over t to get Ii∗=∫ℝ|σi(ℓ)^(t)|2ℓ2d−2∫vi⟂e−(ℓ2+α)‖z‖2e−α(t−β)2zt=πd−12ℓ2d−2(ℓ2+α)d−12∫ℝ|σi(ℓ)^(t)|2e−α(t−β)2t.I^*_i= _R| _i^( )(t)|^2\; ^2d-2 _v_i e^-( ^2+α) z ^2e^-α(t-β)^2dzdt= π d-12\, ^2d-2( ^2+α) d-12\; _R | _i^( )(t) |^2e^-α(t-β)^2dt. By Corollary 4.10, since σi _i is L-Lipschitz and (R,Δ)(R, )-nondegenerate and ℓ≥20(R+L)/Δ ≥ 20(R+L)/ , we have ∫E|σi(ℓ)^(t)|2t≥Δ28Rℓ2. _E | _i^( )(t) |^2dt\;≥\; ^28R\, ^2. Let E+=[a,B]E_+=[a,B] and E−=[−B,−a]E_-=[-B,-a], and pick the sign s∈±1s∈\± 1\ maximizing ∫Es|σi(ℓ)^(t)|2t _E_s| _i^( )(t)|^2dt. This guarantees we keep at least half of the total integral. WLOG Es=E+E_s=E_+ For any fixed t∈E+t∈ E_+, the integral over β∈E+β∈ E_+ of e−α(t−β)2e^-α(t-β)^2 is ∫β∈E+e−α(t−β)2β=∫t−Bt−ae−αu2u≥14min⁡(1α,B). _β∈ E_+e^-α(t-β)^2dβ\;=\; _t-B^t-ae^-α u^2\,du\;≥\; 14 ( 1 α,B ). Averaging over β∈E+β∈ E_+ therefore shows that there exists a choice of β∈E+β∈ E_+ with ∫ℝ|σi(ℓ)^(t)|2e−α(t−β)2t≥14(1+Bα)∫E+|σi(ℓ)^(t)|2t≥Δ264(1+Bα)Rℓ2. _R | _i^( )(t) |^2e^-α(t-β)^2dt\;≥\; 14(1+B α) _E_+ | _i^( )(t) |^2dt\;≥\; ^264(1+B α)R\, ^2. Combining with the factor obtained from integrating over z, we get that there exists some choice of β such that Ii∗≥πd−12ℓ2d−2(ℓ2+α)d−12Δ264(1+Bα)Rℓ2.I^*_i\;≥\; π d-12\, ^2d-2( ^2+α) d-12\; ^264(1+B α)R\, ^2\,. It remains to control the remainder ∑j≠iIj∗ _j≠ iI^*_j in (12). Consider a fixed index j. We apply the same approach of integrating over vj⟂v_j and then over t. We set A=αIdA=α I_d and v=βviv=β v_i and also let wj=Pvj⟂vw_j=P_v_j v and get Ij∗=∫ℝ|σj(ℓ)^(t)|2ℓ2d−2∫vi⟂e−ℓ2‖z‖2−α‖tvj+z−v‖2zt=∫ℝ|σj(ℓ)^(t)|2∫vj⟂e−ℓ2‖z‖2−α‖z−wj‖2e−α‖tvj−v+wj‖2zt=(∫vj⟂e−ℓ2‖z‖2−α‖z−wj‖2z)(∫ℝ|σj(ℓ)^(t)|2e−α‖tvj−v+wj‖2t)=πd−12ℓ2d−2(ℓ2+α)d−12minz∈vj⟂⁡(e−ℓ2‖z‖2−α‖z−wj‖2)∫ℝ|σj(ℓ)^(t)|2e−α‖tvj−v+wj‖2t≤πd−12ℓ2d−2(ℓ2+α)d−12exp⁡(−αℓ2ℓ2+α‖wj‖2)∫ℝ|σj(ℓ)^(t)|2t≤πd2ℓ2d−1(ℓ2+α)d−12exp⁡(−αℓ2ℓ2+α‖wj‖2) splitI^*_j&= _R| _j^( )(t)|^2\; ^2d-2 _v_i e^- ^2 z ^2-α tv_j+z-v ^2dzdt\\ &= _R| _j^( )(t)|^2 _v_j e^- ^2 z ^2-α z-w_j ^2e^-α tv_j-v+w_j ^2dzdt\\ &= ( _v_j e^- ^2 z ^2-α z-w_j ^2dz ) ( _R| _j^( )(t)|^2e^-α tv_j-v+w_j ^2dt )\\ &= π d-12\, ^2d-2( ^2+α) d-12 _z∈ v_j (e^- ^2 z ^2-α z-w_j ^2 ) _R| _j^( )(t)|^2e^-α tv_j-v+w_j ^2dt\\ &≤ π d-12\, ^2d-2( ^2+α) d-12 \! (-\, α\, ^2 ^2+α\,\|w_j\|^2 ) _R| _j^( )(t)|^2dt\\ &≤ π d2\, ^2d-1( ^2+α) d-12 \! (-\, α\, ^2 ^2+α\,\|w_j\|^2 ) split where the last step follows from Parseval and the assumption on σj _j. By γ-separation, for unit vectors vi,vjv_i,v_j, sin2⁡∠(vi,vj)≥γ2 ^2\! (v_i,v_j)≥γ^2. Since v=βviv=β v_i, ‖wj‖2=β2sin2⁡∠(vi,vj)≥β2γ2≥a2γ2\|w_j\|^2=β^2 ^2\! (v_i,v_j)≥β^2\,γ^2≥ a^2\,γ^2 for β∈Eβ∈ E. Hence ∑j≠iIj∗≤(n−1)πd2ℓ2d−1(ℓ2+α)d−12exp⁡(−αℓ2γ2a2ℓ2+α). _j≠ iI^*_j\;≤\;(n-1)\, π d2\, ^2d-1( ^2+α) d-12\; \! (-\, α ^2γ^2a^2 ^2+α ). Plugging back into (12) and substituting the bounds we obtained into the expression If(ℓ)∗(v,A)≥1nIi∗−∑j≠iIj∗I^*_f^( )(v,A)\;≥\; 1nI^*_i\;-\; _j≠ iI^*_j gives the desired inequality ∎ 6.2 Direction Recovery Algorithm and Analysis Now we present our algorithm for recovering the hidden directions. Recall, the only way the algorithm interacts with the unknown function f is through a Fourier mass oracle (Definition 6.1). We begin with a high-level description of the algorithm. The algorithm takes as input some orthonormal basis b1,…,bdb_1,…,b_d as well as information about some of the coordinates, say α1,…,αk _1,…, _k for k≤dk≤ d. The idea is to then search over possibilities for the next coordinate αk+1 _k+1 such that there is nontrivial total Fourier mass on the set of all points close to α1,…,αk+1 _1,…, _k+1 on their first k+1k+1 coordinates. We can then repeat and recurse to search for the coordinate αk+2 _k+2 and so on. Once we have fixed all d coordinates, we simply return the unit vector in the direction (α1,…,αd)( _1,…, _d). For the actual implementation, we have width parameters K1,…,Kk+1K_1,…,K_k+1 which are sufficiently large. To localize around points that are close to α1,…,αk+1 _1,…, _k+1 in their first k+1k+1 coordinates, we query ℐτ,f(ℓ)(v,A)I_τ,f^( )(v,A) for v=α1b1+⋯+αkbk+αk+1bk+1A=K1b1b1⊤+⋯+Kk+1bk+1bk+1⊤.v= _1b_1+…+ _kb_k+ _k+1b_k+1 A=K_1b_1b_1 +…+K_k+1b_k+1b_k+1 \,. The reason we require different width parameters for the different coordinates is for technical details later on for bounding the branching factor in this algorithm. The details of the algorithm are described below in Algorithm 2. Note that the only parameters that change between levels of recursion are the current coordinates α1,…,αk _1,…, _k (since we fix an additional coordinate in reach iteration). All other parameters are global, i.e. shared between all levels of recursion. Input: Width ℓ , accuracy ε (global) Input: Access to Fourier mass oracle ℐτ,f(ℓ)I_τ,f^( ) with τ=10ε(πℓ2)d/2τ=10 (π ^2)^d/2 Input: Width parameters C1,C2C_1,C_2 (global) Input: Orthonormal basis b1,…,bd∈ℝdb_1,…,b_d∈R^d (global) Input: Current coordinates α1,…,αk∈ℝ _1,…, _k∈R (where k≤dk≤ d) 1 2if k=dk=d then 3 return α1b1+⋯+αdbdα12+⋯+αd2 _1b_1+…+ _db_d _1^2+…+ _d^2 4 5 end if 6Let T be the set of all integer multiples of 1/10C21/ 10C_2 between −ℓ2- ^2 and ℓ2 ^2 7 for c∈c do 8 Set v=α1b1+⋯+αkbk+cbk+1v= _1b_1+…+ _kb_k+cb_k+1 9 Set (K1,…,Kk+1)=(C2,C2,C1,…,C1,C2)(K_1,…,K_k+1)=(C_2,C_2,C_1,…,C_1,C_2) 10 Set A=K1b1b1⊤+⋯+Kk+1bk+1bk+1⊤A=K_1b_1b_1 +…+K_k+1b_k+1b_k+1 11 Query Wc=ℐτ,f(ℓ)(v,A)W_c=I_τ,f^( )(v,A) 12 13 end for 14Let ′=c|c∈T,|Wc|≥5τT =\c|c∈ T,|W_c|≥ 5τ\ 15 Let S be any maximal subset of ′T whose elements are 1/10dC11/ 10dC_1-separated 16 for αk+1∈ _k+1 do 17 Recurse on α1,…,αk+1 _1,…, _k+1 18 19 end for Algorithm 2 Find Directions Now we are ready to analyze Algorithm 2. First, we set a couple parameters. We assume we are given parameters d,n,R,L,γ,εd,n,R,L,γ, which govern the properties of the unknown function f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) as in Section 2.2. We also assume we are given a target accuracy parameter Δ and that ε<1poly(d,n,R,L,1γ,1Δ) < 1poly (d,n,R,L, 1γ, 1 ) for some sufficiently large polynomial. We will set the global parameters in Algorithm 2 as follows: ℓ=poly(d,n,R,L,1γ,1Δ),C2=ℓ2d,C1=C20.9. =poly (d,n,R,L, 1γ, 1 )\;,\;C_2= ^2d\;,\;C_1=C_2^0.9\,. (13) but we ensure ε≪1/poly(ℓ) 1/poly( ), which is possible as long as ε is a sufficiently small inverse polynomial. Now we begin by giving a geometric characterization that will be useful for the analysis. Definition 6.4. We say the vectors b1,b2b_1,b_2 of the orthonormal basis are θ-separating if • For every i∈[n]i∈[n], |vi⋅b1|,|vi⋅b2|≥θ/d|v_i· b_1|,|v_i· b_2|≥θ/ d • For every i,j∈[n]i,j∈[n] with i≠ji≠ j, (vi⋅b1)(vj⋅b2)−(vj⋅b1)(vi⋅b2)≥θ2d.(v_i· b_1)(v_j· b_2)-(v_j· b_1)(v_i· b_2)≥ θ^2d\,. Definition 6.4 is useful because it implies that if we fix any projection (α1,α2)( _1, _2) in the plane formed by b1,b2b_1,b_2 that is not too close to the origin, then there is at most one viv_i such that the projection of tvitv_i is very close to α1b1+α2b2 _1b_1+ _2b_2 for some t∈ℝt∈R. This will be crucial for arguing that Algorithm 2 doesn’t branch too much in the recursive step. We now show that with high probability, a random orthonormal basis will be θ-separating for θ not too small. Claim 6.5. Assume that v1,…,vnv_1,…,v_n are unit vectors such that the sines of the pairwise angles between them are all at least γ. Then with 1−110n1- 110n probability over the choice of a random orthonormal basis b1,b2,…,bdb_1,b_2,…,b_d, we have that b1,b2b_1,b_2 are γ(10n)3 γ(10n)^3-separating. Proof. For the first condition, since b1,b2b_1,b_2 are each individually uniform over the sphere, anti-concentration implies that for each i, Pr⁡[|vi⋅b1|≤θ/d]≤10θ [|v_i· b_1|≤θ/ d]≤ 10θ and similar for b2b_2. Thus, we get with probability at least 1−20nθ1-20nθ that simultaneously |vi⋅b1|,|vi⋅b2|≥θ/d|v_i· b_1|,|v_i· b_2|≥θ/ d for all i∈[n]i∈[n]. For the second condition, fix i≠ji≠ j. Condition on b1b_1. Write (vi⋅b1)(vj⋅b2)−(vj⋅b1)(vi⋅b2)=((vi⋅b1)vj−(vj⋅b1)vi)⋅b2(v_i· b_1)(v_j· b_2)-(v_j· b_1)(v_i· b_2)=((v_i· b_1)v_j-(v_j· b_1)v_i)· b_2 Note (vi⋅b1)vj−(vj⋅b1)vi(v_i· b_1)v_j-(v_j· b_1)v_i is orthogonal to b1b_1 and ‖(vi⋅b1)vj−(vj⋅b1)vi‖≥max⁡(|vi⋅b1|,|vj⋅b1|)⋅sin⁡∠(vi,vj)≥max⁡(|vi⋅b1|,|vj⋅b1|)⋅γ. (v_i· b_1)v_j-(v_j· b_1)v_i ≥ (|v_i· b_1|,|v_j· b_1|)· (v_i,v_j)≥ (|v_i· b_1|,|v_j· b_1|)·γ\,. Thus, with probability at least 1−10θ1-10θ over the randomness of b2b_2, (vi⋅b1)(vj⋅b2)−(vj⋅b1)(vi⋅b2)≥γθdmax⁡(|vi⋅b1|,|vj⋅b1|).(v_i· b_1)(v_j· b_2)-(v_j· b_1)(v_i· b_2)≥ γθ d (|v_i· b_1|,|v_j· b_1|)\,. Thus, setting θ=1(10n)3θ= 1(10n)^3 and taking a union bound over all i,ji,j completes the proof. ∎ Now we analyze Algorithm 2 assuming that we initialize with α1,α2 _1, _2 that satisfy certain properties. First, we show that if there is some nontrivial Fourier mass on the hyperplane through α1b1+α2b2 _1b_1+ _2b_2 orthogonal to b1,b2b_1,b_2, then the algorithm will succeed and return some direction that contains nontrivial Fourier mass. Definition 6.6. We say a point v∈ℝdv∈R^d is heavy if If(ℓ)∗(v,4C2Id)≥103ε(πℓ2)d/2I^*_f^( )(v,4C_2I_d)≥ 10^3 (π ^2)^d/2. Claim 6.7. Assume that we run Algorithm 2 starting with α1,α2 _1, _2 such that • α12+α22≥1C10.6 _1^2+ _2^2≥ 1C_1^0.6 • There is some point v with |v⋅b1−α1|,|v⋅b2−α2|≤110C2|v· b_1- _1|,|v· b_2- _2|≤ 1 10C_2 and ‖v‖≤ℓ2/2 v ≤ ^2/2 that is heavy Then the algorithm will return some point u with ‖u−v‖v‖≤1C10.2. u- v v ≤ 1C_1^0.2\,. Proof. We prove by induction that for each k≥3k≥ 3, the algorithm recurses on some coordinates α1,…,αk _1,…, _k such that ∑i=3k(αi−v⋅bi)2≤k−25dC1. _i=3^k( _i-v· b_i)^2≤ k-25dC_1\,. Assume that this is true at level k — the base case for k=2k=2 is trivial. There must be some choice of c∈c such that |c−v⋅bk+1|≤1/10C2|c-v· b_k+1|≤ 1/ 10C_2. First, we claim that this choice of c will be in the set of points ′T (as constructed in the execution of Algorithm 2). To see this, set v0=α1b1+⋯+αkbk+cbk+1,A0=C2b1b1⊤+C2b2b2⊤+C1b3b3⊤+⋯+C1bkbk⊤+C2bk+1bk+1⊤v_0= _1b_1+…+ _kb_k+cb_k+1\;,\;A_0=C_2b_1b_1 +C_2b_2b_2 +C_1b_3b_3 +…+C_1b_kb_k +C_2b_k+1b_k+1 and then since we assumed v is heavy, we have If(ℓ)∗(v0,A0)≥0.3If(ℓ)∗(v,4C2Id)=300ε(πℓ2)d/2I^*_f^( )(v_0,A_0)≥ 0.3I^*_f^( )(v,4C_2I_d)=300 (π ^2)^d/2 where the first inequality holds because if we set v¯ v to be the projection of v onto the span of b1,…bk+1b_1,… b_k+1, then for all y∈ℝdy∈R^d, e−(v0−y)⊤A0(v0−y)≥e−2(v0−v¯)⊤A0(v0−v¯)−2(v¯−y)⊤A0(v¯−y)≥0.3e−4(v−y)⊤C2Id(v−y).e^-(v_0-y) A_0(v_0-y)≥ e^-2(v_0- v) A_0(v_0- v)-2( v-y) A_0( v-y)≥ 0.3e^-4(v-y) C_2I_d(v-y)\,. Thus, by the guarantees of the Fourier mass oracle ℐτ,f(ℓ)I_τ,f^( ) (and since we set τ=10ε(πℓ2)d/2τ=10 (π ^2)^d/2), this value of c must be included in ′T . Now, when we filter down the set ′T , we must include some αk+1 _k+1 in the set S such that |αk+1−v⋅bk+1|≤110dC1+1C2≤15dC1.| _k+1-v· b_k+1|≤ 1 10dC_1+ 1 C_2≤ 1 5dC_1\,. Now this completes the inductive step since the above now implies ∑i=3k+1(αi−v⋅bi)2≤k−15dC1. _i=3^k+1( _i-v· b_i)^2≤ k-15dC_1\,. When k=dk=d, the inductive hypothesis combined with the assumption that α12+α22≥1C10.6 _1^2+ _2^2≥ 1C_1^0.6 now implies that we return some u with ‖u−v‖v‖≤1C10.2 u- v v ≤ 1C_1^0.2 as desired. ∎ Next, we show that if b1,b2b_1,b_2 are separating (as in Definition 6.4) then we can bound the number of recursive calls in Algorithm 2. Claim 6.8. If b1,b2b_1,b_2 are γ/(10n)3γ/(10n)^3-separating, then if we start Algorithm 2 with any α1,α2 _1, _2 with α12+α22≥1C10.6 _1^2+ _2^2≥ 1C_1^0.6, the algorithm will only recurse on at most one possibility for αk _k for each k≥3k≥ 3. Proof. By Lemma 6.2 and the definition of the threshold τ=10ε(πℓ2)d/2τ=10 (π ^2)^d/2, if we fix α1,…,αk _1,…, _k, then in the execution of Algorithm 2, a value c gets added to the set ′T only if for some t∈ℝ,i∈[n]t∈R,i∈[n], C2(t(vi⋅bk+1)−c)2+C2(t(vi⋅b1)−α1)2+C2(t(vi⋅b2)−α2)2≤4dlog⁡(ℓn/ε).C_2(t(v_i· b_k+1)-c)^2+C_2(t(v_i· b_1)- _1)^2+C_2(t(v_i· b_2)- _2)^2≤ 4d ( n/ )\,. Recall that we set C1,C2≤ℓ2dC_1,C_2≤ ^2d and therefore when the above doesn’t hold, the upper bound obtained in Lemma 6.2 for the choice of v,Av,A in Algorithm 2 is much smaller than τ. By the setting of C1,C2C_1,C_2 sufficiently large and the assumption that b1,b2b_1,b_2 are θ-separating for θ=γ/(10n)3θ=γ/(10n)^3, all of the c that can satisfy the above must actually correspond to the same i∈[n]i∈[n]. This also implies that the range of possible t is at most |tmax−tmin|≤4dlog⁡(ℓn/ε)C2⋅dθ=4d(10n)3log⁡(ℓn/ε)γC2.|t_ -t_ |≤ 4 d ( n/ ) C_2· dθ= 4d(10n)^3 ( n/ )γ C_2\,. Since C1=C20.9C_1=C_2^0.9 and C2C_2 is a sufficiently large polynomial, this implies that all elements c that get added to ′T must be contained in an interval of width at most 1/10dC11/ 10dC_1. Thus, actually the set S that gets constructed has size at most 11. Thus, the algorithm will recurse on at most one value of αk+1 _k+1 at each step, as desired. ∎ Finally, we can analyze the full algorithm for recovering the directions viv_i. We show that we can recover all of the directions for which σi _i is nondegenerate and that also there are no extraneous directions recovered. The procedure works by first randomly choosing an orthnormal basis b1,…,bdb_1,…,b_d (note b1,b2b_1,b_2 are separating with high probability by 6.5). We then grid search over the first two coordinates α1,α2 _1, _2 and run Algorithm 2 initialized with each possibility. The analysis uses 6.8 to bound the runtime and query complexity and combines Lemma 6.3 with 6.7 to argue that all of the desired directions are found. Lemma 6.9 (Finding Directions). Assume that f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) is a sum of n features satisfying 1 and 2. Then with parameters as in (13) and assuming d≥3,R>1d≥ 3,R>1 and ε<1poly(d,n,R,L,1γ,1Δ), < 1poly(d,n,R,L, 1γ, 1 )\,, there is an algorithm that uses poly(d,n,R,L,1γ,1Δ)poly(d,n,R,L, 1γ, 1 ) runtime and queries to a Fourier mass oracle ℐτ,f(ℓ)I_τ,f^( ) with τ=10ε(πℓ2)d/2τ=10 (π ^2)^d/2, and with probability at least 0.90.9 returns a set of unit vectors u1,…,us\u_1,…,u_s\ with the following properties: • For each j∈[s]j∈[s], there is some i∈[n]i∈[n] such that min⁡(‖uj−vi‖,‖uj+vi‖)≤Δ ( u_j-v_i , u_j+v_i )≤ . • For each function σi(⋅) _i(·) that is (R,Δ)−nondegenerate(R, )-nondegenerate, there is some j∈[s]j∈[s] with min⁡(‖uj−vi‖,‖uj+vi‖)≤Δ ( u_j-v_i , u_j+v_i )≤ . • For each j≠j′j≠ j the sine of the angle between uju_j and uj′u_j is at least γ/2γ/2. Proof. We first randomly choose an orthonormal basis b1,…,bdb_1,…,b_d. By 6.5, with probability at least 1−1/(10n)1-1/(10n) over this choice, b1,b2b_1,b_2 are γ/(10n)3γ/(10n)^3-separating. For the remainder of this proof, we condition on this event. Now, we set parameters ℓ,C1,C2 ,C_1,C_2 as in (13) and grid over all α1,α2 _1, _2 with 1C10.6≤α12+α22≤ℓ4 1C_1^0.6≤ _1^2+ _2^2≤ ^4 with grid size 110C2 110 C_2. Note that the number of such grid points is at most poly(d,n,R,L,1γ,1Δ)poly(d,n,R,L, 1γ, 1 ). For each such grid point, we run Algorithm 2. By 6.8, if b1,b2b_1,b_2 are γ/(10n)3γ/(10n)^3-separating, then for each such grid point, Algorithm 2 recurses on at most one value of αk _k for each k≥3k≥ 3. Also it is clear that each iteration of Algorithm 2 can be implemented in poly(d,n,R,L,1γ,1Δ)poly(d,n,R,L, 1γ, 1 ) time and oracle queries. This gives the desired time and query complexity bounds. Next, we argue about the set of points that we actually find. By Lemma 6.3, for each i∈[n]i∈[n] such that σi _i is (R,Δ)(R, )-nondegenerate, there is some β∈[a,B]∪[−B,−a]β∈[a,B]∪[-B,-a] (where a,Ba,B are as defined in Lemma 6.3) such that the point v=βviv=β v_i is heavy. This is because the estimate in Lemma 6.3 (for α=4C2=4ℓ2/dα=4C_2=4 ^2/d) can be lower bounded as (πℓ2)d/2(Δ2103ℓ10−4nℓe−ℓ0.2)>103ε(πℓ2)d/2(π ^2)^d/2 ( ^210^3 ^10-4n e^- ^0.2 )>10^3 (π ^2)^d/2 where we used the assumption on ℓ being sufficiently large and ε being sufficiently small compared to ℓ . We also immediately have ‖v‖≤ℓ2/2 v ≤ ^2/2. Now, since b1,b2b_1,b_2 are γ/(10n)3γ/(10n)^3-separating and using the lower bound |β|≥a|β|≥ a where a=Δ81Rℓa= 8 1R (as defined in Lemma 6.3), there must be some choice of α1,α2 _1, _2 in our grid with |v⋅b1−α1|,|v⋅b2−α2|≤110C2|v· b_1- _1|,|v· b_2- _2|≤ 1 10C_2 because (v⋅b1)2+(v⋅b2)2=β2(vi⋅b1)2+β2(vi⋅b2)2≥γ2Δ2(20n)6dRℓ>2C10.6.(v· b_1)^2+(v· b_2)^2=β^2(v_i· b_1)^2+β^2(v_i· b_2)^2≥ γ^2 ^2(20n)^6dR > 2C_1^0.6\,. Thus, by 6.7, when we run Algorithm 2 starting from this α1,α2 _1, _2, we will find some point u with min⁡(‖u−vi‖,‖u+vi‖)≤1C10.2 ( u-v_i , u+v_i )≤ 1C_1^0.2 since by definition v/‖v‖=±viv/ v =± v_i. Note that the above argument holds for any i∈[n]i∈[n] such that σi _i is (R,Δ)(R, )-nondegenerate and thus for each such i, we find some point u with the above property. Next, we also argue that we do not find any extraneous points that don’t correspond to some direction viv_i. Since d≥3d≥ 3, Algorithm 2 can recurse on α1,…,αk+1 _1,…, _k+1 for k≥2k≥ 2 only if If(ℓ)∗(v0,A0)≥4τI^*_f^( )(v_0,A_0)≥ 4τ where v0=α1b1+⋯+αk+1bk+1,A0=C2b1b1⊤+C2b2b2⊤+C1b3b3⊤+⋯+C1bkbk⊤+C2bk+1bk+1⊤.v_0= _1b_1+…+ _k+1b_k+1\;,\;A_0=C_2b_1b_1 +C_2b_2b_2 +C_1b_3b_3 +…+C_1b_kb_k +C_2b_k+1b_k+1 \,. By Lemma 6.2 and the way we set parameters ℓ,C1,C2 ,C_1,C_2 in (13), this can only happen if for some i∈[n]i∈[n] and t∈ℝt∈R ∑j=1k+1(t(vi⋅bj)−αj)2≤1C10.8. _j=1^k+1(t(v_i· b_j)- _j)^2≤ 1C_1^0.8\,. Aso recall that α12+α22≥1C10.6 _1^2+ _2^2≥ 1C_1^0.6. Thus, any point that Algorithm 2 actually returns must satisfy the above for k+1=dk+1=d and this therefore implies that any returned point u satisfies min⁡(‖u−vi‖,‖u+vi‖)≤1C10.1 ( u-v_i , u+v_i )≤ 1C_1^0.1 for some i∈[n]i∈[n]. Thus we’ve shown so far that we can guarantee the first two of the desired conditions. Finally, we argue that we can post-process to ensure separation. Among all of the returned points, we greedily construct a maximal set of points such that all pairs have the sine of the angle between them being at least γ/2γ/2. Since the viv_i are γ-separated in angle, this still ensures that if some returned point u is 1C10.1 1C_1^0.1-close to ±vi± v_i for some i, then after post-processing, there must still be some point remaining that is 1C10.1 1C_1^0.1-close to ±vi± v_i. Since 1C10.1<Δ 1C_1^0.1< by the way we set C1C_1, we have now verified all three conditions and this completes the proof. ∎ 7 Function Recovery Once we have recovered the directions that are close to the viv_i, we now show how to recover the actual functions σi _i and the corresponding coefficients aia_i. An important subroutine for this step is using queries to estimate the value of f(ℓ)^(y) f^( )(y) at any specified point y. 7.1 Estimating Fourier Value The subroutine for estimating f(ℓ)^(y) f^( )(y) is similar to Algorithm 1, but much simpler in terms of the sampling procedure, and is described below. Input: Query access to f∼:ℝd→ℝf_ :R^d→R Input: ℓ>0 >0, point y∈ℝdy∈R^d, sample budget m 1 2Function EstValf∼,ℓ,m(y) EstVal_f_ , ,m(y) 3 for j=1j=1 to m do 4 Draw x∼N0,ℓ2Idx N_0, ^2I_d 5 Set cj←f∼(x)e−iy⊤xc_j← f_ (x)e^-iy x 6 7 end for 8 return V←ℓd⋅c1+⋯+cmmV← ^d· c_1+…+c_mm 9 10 Algorithm 3 Estimating Fourier Value It is straight-forward to verify that Algorithm 3 gives an unbiased estimate for f(ℓ)^(y) f^( )(y) and that the estimator concentrates for m sufficiently large. Claim 7.1. Let f:ℝd→ℝf:R^d→R be a function with |f(x)|≤1|f(x)|≤ 1 and fix ℓ>0 >0. Let f(ℓ)f^( ) be as defined in Definition 4.4. Given query access to f∼f_ with ‖f−f∼‖∞≤ε\|f-f_ \|_∞≤ and any y∈ℝdy∈R^d and sample budget m, compute V=EstValf∼,ℓ,m(y)V= EstVal_f_ , ,m(y) as defined in Algorithm 3. Then for any δ∈(0,1)δ∈(0,1), with probability at least 1−δ1-δ, |V−f(ℓ)^(y)|≤ 4ℓd(ε+2log⁡(8/δ)m). |\,V- f^( )(y) |\;≤\;4 ^d\! ( + 2 (8/δ)m ). Proof. By definition, f(ℓ)^(y)=1(2π)d/2∫ℝdf(x)e−‖x‖2/(2ℓ2)e−iy⊤xx. f^( )(y)= 1(2π)^d/2 _R^df(x)e^- x ^2/(2 ^2)e^-iy xdx\,. From this, it is immediate that if cjc_j in Algorithm 3 were computed using query access to f, then its expectation would be f(ℓ)^(y)/ℓd f^( )(y)/ ^d. Thus, by the assumption about f∼f_ , we have |ℓd⁡[cj]−f(ℓ)^(y)|≤εℓd.| ^d E[c_j]- f^( )(y)|≤ ^d\,. Now we apply Hoeffding’s inequality and since each cjc_j has |cj|≤1+ε|c_j|≤ 1+ this gives the desired concentration. ∎ In light of Claim 7.1, we define the following Fourier value oracle, and the rest of the analysis in this section will be in terms of the number of calls to this Fourier value oracle for the function f(ℓ)f^( ). We will put everything together to bound the number of actual queries to f∼f_ in Section 8. Definition 7.2 (Fourier Value Oracle). A Fourier Value Oracle for an underlying function g and accuracy τ on a query y∈ℝdy∈R^d returns a value Vτ,g(y)V_τ,g(y) such that |g^(y)−Vτ,g(y)|≤τ.| g(y)-V_τ,g(y)|≤τ\,. Now our next goal will be to show that for an unknown sum of features f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) satisfying 1 and 2, if we are given a direction u that is sufficiently close to viv_i for some i∈[n]i∈[n], then we can recover the corresponding function σi _i. First, we begin by setting parameters We assume we are given parameters d,n,R,L,γ,εd,n,R,L,γ, and ε≤1poly(d,n,R,L,1γ) ≤ 1poly(d,n,R,L, 1γ) for some sufficiently large polynomial. We then set: ℓ=poly(d,n,R,L,1γ),Δ=1poly(ℓ),τ=10εℓd =poly (d,n,R,L, 1γ )\;,\; = 1poly( )\;,\;τ=10 ^d (14) and we assume that ε is sufficiently small that ε≪1/poly(1/Δ) 1/poly(1/ ). Note that this differs from the setting in (13) because now ℓ≪1/Δ 1/ . In our full algorithm, we will set two smoothing scales ℓ1,ℓ2 _1, _2 with ℓ2≪1/Δ≪ℓ1 _2 1/ _1 and we will run the first part, described in Section 6 with ℓ=ℓ1 = _1 and the second part, described here with ℓ=ℓ2 = _2. We first prove the following claim showing that querying f(ℓ) f^( ) allows us to get good point estimates for σi(ℓ) _i^( ) if we are given a direction u that is close to viv_i. We will then use 4.7 to reconstruct the function σi _i by querying at a discrete grid of points to approximate the integral. Claim 7.3. Let f(x)=a1σ1(v1⊤x)+⋯+anσn(vn⊤x)f(x)=a_1 _1(v_1 x)+…+a_n _n(v_n x) be a sum of features satisfying 1 and 2. Assume that we are given a unit vector u such that ‖u−vi‖≤Δ u-v_i ≤ (with parameters set in (14)). Then for any t with 1ℓ0.9≤|t|≤1Δ0.1 1 ^0.9≤|t|≤ 1 ^0.1, |f(ℓ)^(tu)−aiℓd−1σi(ℓ)^(t)|≤2Δ0.8ℓd.| f^( )(tu)-a_i ^d-1 _i^( )(t)|≤ 2 ^0.8 ^d\,. Proof. First, by 4.12 and the setting of parameters in (14), |f(ℓ)^(tu)−f(ℓ)^(tvi)|≤nℓd+1|t|‖u−vi‖≤Δ0.8ℓd.| f^( )(tu)- f^( )(tv_i)|≤ n ^d+1|t| u-v_i ≤ ^0.8 ^d\,. Next, by 4.11, we have the formula f(ℓ)^(tvi)=∑j=1najσj(ℓ)^(tvj⊤vi)⋅ℓd−1e−ℓ2‖tvi−(tvj⊤vi)vj‖2/2 f^( )(tv_i)= _j=1^na_j _j^( )(tv_j v_i)· ^d-1e^- ^2 tv_i-(tv_j v_i)v_j ^2/2 but for all j≠ij≠ i, ℓ2‖tvi−(tvj⊤vi)vj‖22≥ℓ2t2γ22≥ℓ0.2γ22≥ℓ0.1. ^2 tv_i-(tv_j v_i)v_j ^22≥ ^2t^2γ^22≥ ^0.2γ^22≥ ^0.1\,. Thus, by the way we set ℓ , we can ensure the sum of all of the terms for j≠ij≠ i is at most Δ0.8ℓd ^0.8 ^d. The term for j=ij=i is exactly aiℓd−1σi(ℓ)^(t)a_i ^d-1 _i^( )(t) so putting these together, we get |f(ℓ)^(tu)−aiℓd−1σi(ℓ)^(t)|≤2Δ0.8ℓd| f^( )(tu)-a_i ^d-1 _i^( )(t)|≤ 2 ^0.8 ^d as desired. ∎ We can now prove the main lemma for this section. Lemma 7.4 (Given a Direction, Recover the Function). Let f:ℝd→ℝf:R^d→R be a sum of features f=a1σ1(v1⊤x1)+⋯+anσn(vn⊤x)f=a_1 _1(v_1 x_1)+…+a_n _n(v_n x) satisfying 1 and 2. Suppose we are given a unit vector u such that there exists some i∈[n]i∈[n] with ‖u−vi‖≤Δ u-v_i ≤ . Then with parameters set as in (14), there is an algorithm that takes poly(1/Δ)poly(1/ ) queries to a Fourier value oracle Vτ,f(ℓ)V_τ,f^( ) and runtime and outputs a function σ~:ℝ→ℝ σ:R→R with maxx∈ℝd,‖x‖≤R⁡|σ~(u⊤x)−aiσi(vi⊤x)|≤5ℓ0.7. _x∈R^d, x ≤ R| σ(u x)-a_i _i(v_i x)|≤ 5 ^0.7\,. Proof. Let S be the set of all t that are integer multiples of Δ and satisfy 1ℓ0.9≤|t|≤1Δ0.1 1 ^0.9≤|t|≤ 1 ^0.1. For each t∈t , we query Vτ,f(ℓ)(tu)V_τ,f^( )(tu). We then define the function σ~(z)=C+ez22ℓ2Δ2π∑t∈eitzVτ,f(ℓ)(tu)ℓd−1 σ(z)=C+e z^22 ^2 2π _t e^itz V_τ,f^( )(tu) ^d-1 where C is a constant chosen so that σ(0)~=0 σ(0)=0. Define the set T to include all t where t is an integer multiple of Δ with |t|≤1ℓ0.9|t|≤ 1 ^0.9. Define ϕ(z):=ez22ℓ2Δ2π∑t∈∪eitzVτ,f(ℓ)(tu)ℓd−1ϕ0(z):=ez22ℓ2Δ2π∑t∈∪eitzaiσ(ℓ)^(t) splitφ(z) :=e z^22 ^2 2π _t e^itz V_τ,f^( )(tu) ^d-1\\ _0(z) :=e z^22 ^2 2π _t e^itza_i σ^( )(t) split By the guarantees of the value oracle and 7.3, we have for all z with |z|≤ℓ|z|≤ , |ϕ(z)−ϕ0(z)|≤Δ⋅2Δ1.1⋅2(Δ0.8+10ε)ℓ≤Δ0.6.|φ(z)- _0(z)|≤ · 2 ^1.1· 2( ^0.8+10 ) ≤ ^0.6\,. Also by 4.7 and 4.6 (which bounds the error from discretizing the integral), we have for all z with |z|≤R|z|≤ R |aiσi(z)−ϕ0(z)|≤Δ0.04+2Δ0.1⋅2Δℓ2≤Δ0.03.|a_i _i(z)- _0(z)|≤ ^0.04+ 2 ^0.1· 2 ^2≤ ^0.03\,. Finally, note that we must have ‖f(ℓ)^‖∞≤nℓd f^( ) _∞≤ n ^d and thus the oracle values Vτ,f(ℓ)V_τ,f^( ) must be bounded by 2nℓd2n ^d. Define ρ(z):=ez22ℓ2Δ2π∑t∈eitzVτ,f(ℓ)(tu)ℓd−1.ρ(z) :=e z^22 ^2 2π _t e^itz V_τ,f^( )(tu) ^d-1\,. Then for any z with |z|≤R|z|≤ R, because all of the t in the sum above have |t|≤1ℓ0.9|t|≤ 1 ^0.9, we have |ρ(z)−ρ(0)|≤2ℓ0.9⋅(2ℓn)⋅5Rℓ0.9≤1ℓ0.7.|ρ(z)-ρ(0)|≤ 2 ^0.9·(2 n)· 5R ^0.9≤ 1 ^0.7\,. Thus, since by assumption σi(0)=0 _i(0)=0 and we chose the shift C so that σ~(0)=0 σ(0)=0, combining everything we’ve shown so far implies that actually for all z with |z|≤R|z|≤ R, |σ~(z)−aiσi(z)|≤Δ0.6+Δ0.03+3ℓ0.7≤4ℓ0.7.| σ(z)-a_i _i(z)|≤ ^0.6+ ^0.03+ 3 ^0.7≤ 4 ^0.7\,. Finally, to prove the high-dimensional statement in the lemma, note that for any x∈ℝdx∈R^d with ‖x‖≤R x ≤ R, by 4.6, |aiσi(vi⊤x)−aiσi(u⊤x)|≤RΔℓ2≤1ℓ0.7.|a_i _i(v_i x)-a_i _i(u x)|≤ R ^2≤ 1 ^0.7\,. Thus, for all x with ‖x‖≤R x ≤ R, |σ~(u⊤x)−aiσi(vi⊤x)|≤|σ~(u⊤x)−aiσi(u⊤x)|+|aiσi(vi⊤x)−aiσi(u⊤x)|≤5ℓ0.7| σ(u x)-a_i _i(v_i x)|≤| σ(u x)-a_i _i(u x)|+|a_i _i(v_i x)-a_i _i(u x)|≤ 5 ^0.7 as desired. ∎ 8 Putting Everything Together We are now ready to put everything together and prove our main learning results. We first prove Theorem 2.5 and then show a simple reduction to remove the boundedness assumption (2) to get a more general result in Theorem 8.1. 8.1 Proof of Theorem 2.5 With all of the components that we have so far, the remainder of the proof of Theorem 2.5 proceeds by directly combining Lemma 6.9 and Lemma 7.4 and our concentration bounds for implementing the Fourier mass and Fourier value oracles in Corollary 5.3 and 7.1. Proof of Theorem 2.5. Given the parameters d,n,L,R,γ,ε′,δd,n,L,R,γ, ,δ, we can set ℓ2=poly(d,n,L,R,1γ,1ε′),Δ=1poly(ℓ2),ℓ1=poly(1Δ) _2=poly (d,n,L,R, 1γ, 1 )\;,\; = 1poly( _2)\;,\; _1=poly ( 1 ) and assume ε<1/poly(ℓ1) <1/poly( _1) all for some sufficiently large polynomials. We apply Lemma 6.9 with ℓ←ℓ1 ← _1. By Corollary 5.3, we can simulate all of the Fourier mass oracle queries with probability 1−0.1δ21-0.1δ^2 using poly(log⁡(1/δ),1/ε)poly( (1/δ),1/ ) actual queries to the function f∼f_ . Also, we can run the algorithm O(log⁡(1/δ))O( (1/δ)) many times independently and post-process the points by majority voting to reduce the failure probability to 0.1δ0.1δ. We now have a set of unit vectors u1,…,us\u_1,…,u_s\ for some s≤ns≤ n. The conditions in Lemma 6.9 imply that each uiu_i must be close to exactly one hidden direction vjv_j. By permuting the indices and possibly negating some of the directions viv_i, WLOG we may assume that • For each i∈[s]i∈[s], ‖ui−vi‖≤Δ u_i-v_i ≤ • For i≥s+1i≥ s+1, the function σi(⋅) _i(·) is not (R,Δ)(R, )-nondegenerate We now apply Lemma 7.4 on each of u1,…,usu_1,…,u_s with ℓ←ℓ2 ← _2 to obtain functions σ1~,…,σs~ _1,…, _s and then we output our estimate f~(x):=σ1~(u1⊤x)+⋯+σs~(us⊤x). f(x) := _1(u_1 x)+…+ _s(u_s x)\,. 7.1 implies that we can simulate all of the Fourier value oracle queries with probability 1−0.1δ21-0.1δ^2 using poly(log⁡(1/δ),1/ε)poly( (1/δ),1/ ) actual queries to the function f∼f_ . Now we bound the error of our estimate. Since σi(⋅) _i(·) is not (R,Δ)(R, )-nondegenerate for i≥s+1i≥ s+1 we get that for all x with ‖x‖≤R x ≤ R, |as+1σs+1(vs+1⊤x)+⋯+anσn(vn⊤x)|≤nΔ|a_s+1 _s+1(v_s+1 x)+…+a_n _n(v_n x)|≤ n since we assumed that σi(0)=0 _i(0)=0 for all i. Thus, the guarantees of Lemma 7.4 and the way we set the parameter ℓ2 _2 give that for all x with ‖x‖≤R x ≤ R, |f~(x)−(a1σ1(x)+⋯+anσn(vn⊤x))|≤ε| f(x)-(a_1 _1(x)+…+a_n _n(v_n x))|≤ as desired. The overall failure probability is at most δ and the total number of queries is polynomial in all relevant parameters so this completes the proof. ∎ 8.2 Removing Boundedness We now show how to remove the assumption that the σi _i are bounded (2). We will do this by reducing to the bounded case. The idea is to first convolve f with a small Gaussian to ensure smoothness. We then apply Theorem 2.5 on the derivatives of f — we show that we can simulate query access to the derivatives. Then since the reconstructed functions are explicitly integrable, we simply integrate to reconstruct f. The formal statement is as follows: Theorem 8.1. For any target accuracy ε′ , target domain R, and failure probability δ, there is some N=poly(d,L,R,n,1/γ,1/ε′,log⁡1/δ)N=poly(d,L,R,n,1/γ,1/ , 1/δ) such that if ε<1/N <1/N, then under 1, there is an algorithm that makes poly(N)poly(N) queries and poly(N)poly(N) runtime and outputs a sum of features f~(x)=a1′σ1′(v1′⊤x)+⋯+an′σn′(vn′⊤x) f(x)=a_1 _1 (v_1 x)+…+a_n _n (v_n x) such that with probability 1−δ1-δ, for all x with ‖x‖≤R x ≤ R, |f(x)−f~(x)|≤ε′|f(x)- f(x)|≤ . The following corollary, about identifying all of the directions viv_i for which σi(⋅) _i(·) is nonlinear, will also be immediate from the reduction in the proof of Theorem 8.1. Corollary 8.2. In the same setting as Theorem 8.1, the algorithm can also guarantee to return directions v1′,…,vs′v_1 ,…,v_s for some s≤ns≤ n such that • Each returned direction vj′v_j satisfies min⁡(‖vj′−vi‖,‖vj′+vi‖)≤ε′ ( v_j -v_i , v_j +v_i )≤ for some i∈[n]i∈[n] • For each i∈[n]i∈[n] such that the function σi(⋅) _i(·) has maxz∈[−R,R]⁡|σi(z)−(az+b)|≥ε′ _z∈[-R,R]| _i(z)-(az+b)|≥ for any linear function az+baz+b, there must be some returned direction vj′v_j with min⁡(‖vj′−vi‖,‖vj′+vi‖)≤ε′ ( v_j -v_i , v_j +v_i )≤ We now formalize the reduction and prove Theorem 8.1. Proof of Theorem 8.1. Let η=1poly(d,L,R,n,1γ,1ε′)η= 1poly(d,L,R,n, 1γ, 1 ) and define g(x)=∫ℝdf(x+z)N0,η2Id(z)z.g(x)= _R^df(x+z)N_0,η^2I_d(z)dz\,. Now we have ∇g(x)=∫ℝd∇f(x+z)N0,η2Id(z)z=∑i=1nai∫ℝd∇fi(x+z)N0,η2Id(z)∇ g(x)= _R^d∇ f(x+z)N_0,η^2I_d(z)dz= _i=1^na_i _R^d∇ f_i(x+z)N_0,η^2I_d(z) where fi(x)=σi(vi⊤x)f_i(x)= _i(v_i x). Now for any unit vector u, we can write ⟨u,∇g(x)⟩=∑i=1nai∫ℝd⟨u,vi⟩σi′(vi⊤(x+z))N0,η2Id(z). u,∇ g(x) = _i=1^na_i _R^d u,v_i _i (v_i (x+z))N_0,η^2I_d(z)\,. First consider fixing u and defining ρi(t)=∫−∞⟨u,vi⟩σi′(t+z)N0,η2(z)z. _i(t)= _-∞^∞ u,v_i _i (t+z)N_0,η^2(z)dz\,. Then we have ⟨u,∇g(x)⟩=a1ρ1(v1⊤x)+⋯+anρn(vn⊤x). u,∇ g(x) =a_1 _1(v_1 x)+…+a_n _n(v_n x)\,. The definition of ρi _i immediately gives that |ρi(t)|≤L| _i(t)|≤ L for all t∈ℝt∈R. Next, we bound the derivative of ρi _i. We can write ρi′(t)=⟨u,vi⟩dt∫−∞σi′(z)N0,η2(t−z)z=⟨u,vi⟩∫−∞σi′(z)⋅t−zη2⋅N0,η2(t−z)z. _i (t)= u,v_i ddt _-∞^∞ _i (z)N_0,η^2(t-z)dz= u,v_i _-∞^∞ _i (z)· t-zη^2· N_0,η^2(t-z)dz\,. Thus we get that for all t, |ρi′(t)|≤L/η| _i (t)|≤ L/η. Now define h(x):=⟨u,∇g(x)−∇g(0)⟩h(x) := u,∇ g(x)-∇ g(0) . We have shown that, after rescaling by η/Lη/L, the function h(x)h(x) satisfies both 1 and 2. Now we need show how we can simulate query access to the function ⟨u,∇g(x)⟩ u,∇ g(x) . Note that from the bounds above, h(x)h(x) is nL/ηnL/η-Lipschitz. For any α, we can write g(x+αu)−g(x)α=∫01⟨∇g(x+αtu),u⟩t g(x+α u)-g(x)α= _0^1 ∇ g(x+α tu),u dt and thus |g(x+αu)−g(x)α−⟨u,∇g(x)⟩|≤nLαη. g(x+α u)-g(x)α- u,∇ g(x) ≤ nLαη\,. Now with query access to f∼f_ with ‖f−f∼‖∞≤ε f-f_ _∞≤ , we can estimate g to 2ε2 accuracy using poly(1/ε)poly(1/ ) queries simply by sampling. By the above inequality, this lets us estimate ⟨u,∇g(x)⟩ u,∇ g(x) to accuracy 4εα+nLαη 4 α+ nLαη. Thus, we can set α=ε0.5α= ^0.5 and get an ε0.4 ^0.4-accurate oracle for ⟨u,∇g(x)⟩ u,∇ g(x) as long as ε is sufficiently small in terms of η. Thus, we can now apply Theorem 2.5 (redefining parameters so that ε′←η ←η) to recover h(x)h(x) to accuracy η on the domain ‖x‖≤R x ≤ R. Since we can also just estimate the constant ⟨u,∇g(0)⟩ u,∇ g(0) and add it back in, we now have an η-accurate approximation to the function ⟨u,∇g(x)⟩ u,∇ g(x) as a sum of at most n ridge functions. Recall that originally we fixed a choice of u, but we can actually apply the above for a collection of different u, say the set of standard basis vectors e1,…,ede_1,…,e_d. This gives us an estimate that is dη dη accurate for ∇g(x)∇ g(x) on the domain ‖x‖≤R x ≤ R. Note that the functions returned by Theorem 2.5 are in an explicit form as a sum of ridge functions and thus we can integrate these estimates to get a sum of ridge functions in the same directions. Thus, we now have some g~ g of the desired form such that |g~(x)−g(x)|≤Rdη| g(x)-g(x)|≤ R dη on the domain ‖x‖≤R x ≤ R. Finally, since f is nLnL-Lipschitz, ‖g−f‖∞≤2dnLη g-f _∞≤ 2 dnLη and thus, since we chose η sufficiently small, |g~(x)−f(x)|≤3RdnLη≤ε′| g(x)-f(x)|≤ 3RdnLη≤ on the domain ‖x‖≤R x ≤ R and we are done. ∎ Proof of Corollary 8.2. The proof follows from the same reduction as in Theorem 8.1 but just applying Lemma 6.9 instead of Theorem 2.5. ∎