Paper deep dive
Mathematical Models of Computation in Superposition
Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/12/2026, 7:33:04 PM
Summary
The paper introduces mathematical models of computation in superposition, where neural networks efficiently emulate sparse boolean circuits by representing more features than they have dimensions. The authors construct a 1-layer MLP to emulate Universal AND (U-AND) circuits using O~(m^(2/3)) neurons and generalize this to deep networks with error-correction layers, providing a framework for understanding how neural networks perform compressed computation.
Entities (5)
Relation Signals (3)
1-layer MLP → emulates → Universal AND (U-AND) circuit
confidence 95% · We construct a 1-layer MLP that uses superposition to perform this task up to ε-error
Neural Networks → utilize → Superposition
confidence 95% · Superposition – when a neural network represents more “features” than it has dimensions
Superposition → enables → Efficient Computation
confidence 90% · superposition is actively helpful for efficiently accomplishing the task
Cypher Suggestions (2)
Map the relationship between models and the tasks they perform · confidence 90% · unvalidated
MATCH (m:Model)-[:EMULATES]->(t:Task) RETURN m.name, t.name
Find all tasks related to superposition in the graph · confidence 85% · unvalidated
MATCH (e:Entity)-[:RELATED_TO]->(t:Task {name: 'Superposition'}) RETURN e, tAbstract
Abstract:Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emph{representational} superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the $\binom{m}{2}$ pairs of each of $m$ features. We construct a 1-layer MLP that uses superposition to perform this task up to $\varepsilon$-error, where the network only requires $\tilde{O}(m^{\frac{2}{3}})$ neurons, even when the input features are \emph{themselves in superposition}. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width $d$ to emulate circuits of width $\tilde{O}(d^{1.5})$ and \emph{any} polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.
Tags
Links
- Source: https://arxiv.org/abs/2408.05451
- Canonical: https://arxiv.org/abs/2408.05451
Full Text
274,888 characters extracted from source content.
Expand or collapse full text
Mathematical Models of Computation in Superposition Kaarel Hänni Jake Mendel Dmitry Vaintrob Lawrence Chan Abstract Superposition – when a neural network represents more “features” than it has dimensions – seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies representational superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of computation in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the (m2)binomial2 m2( FRACOP start_ARG m end_ARG start_ARG 2 end_ARG ) pairs of each of m features. We construct a 1-layer MLP that uses superposition to perform this task up to ε ε-error, where the network only requires O~(m23)~superscript23 O(m 23)over~ start_ARG O end_ARG ( mdivide start_ARG 2 end_ARG start_ARG 3 end_ARG ) neurons, even when the input features are themselves in superposition. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct “error correction” layers that allow deep fully-connected networks of width d to emulate circuits of width O~(d1.5)~superscript1.5 O(d^1.5)over~ start_ARG O end_ARG ( d1.5 ) and any polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition. Machine Learning, ICML, superposition, random projection, sparse boolean circuits 1 Introduction Mechanistic interpretability seeks to decipher the algorithms utilized by neural networks (Olah et al., 2017; Elhage et al., 2021; Räuker et al., 2023; Olah et al., 2020; Meng et al., 2023; Geiger et al., 2021; Wang et al., 2022; Conmy et al., 2024). A significant obstacle is that neurons are polysemantic – activating in response to various unrelated inputs (Fusi et al., 2016; Nguyen et al., 2016; Olah et al., 2017; Geva et al., 2021; Goh et al., 2021). As a proposed explanation for polysemanticity, Olah et al. (2020) introduce the ‘superposition hypothesis’ (see also Arora et al. (2018); Elhage et al. (2022)): the idea that networks represent many more concepts in their activation spaces than they have neurons by sparsely encoding features as nearly orthogonal directions. Figure 1: The naive way to linearly represent the pairwise ANDs of m boolean variables using an MLP is to use one neuron to compute the AND of each pair of variables (left). This requires (m2)=O(m2)binomial2superscript2 m2=O(m^2)( FRACOP start_ARG m end_ARG start_ARG 2 end_ARG ) = O ( m2 ) neurons. However, when inputs are sparse, there is a much more efficient implementation using superposition (right). Here, each neuron checks for whether or not at least two variables are active in a subset of random variables. Then, for any pair of variables, we can read off the AND of that pair by averaging together the activations of all neurons corresponding to the subsets containing both variables. With appropriately chosen subsets, we can ε ε-linearly represent all pairwise ANDs using only O~(m23)~superscript23 O(m 23)over~ start_ARG O end_ARG ( mdivide start_ARG 2 end_ARG start_ARG 3 end_ARG ) neurons, even when the inputs are themselves represented in superposition (Section 3). Previous work has studied how networks can store more features than they have neurons in a range of toy models (Elhage et al., 2022; Scherlis et al., 2022). However, previous models of superposition either involve almost no computation (Elhage et al., 2022) or rely on some part of the computation not happening in superposition (Scherlis et al., 2022). Insofar as neural networks are incentivized to learn as many circuits as possible (Olah et al., 2020), they are likely to compute circuits in the most compressed way possible. Therefore, understanding how networks can undergo more general computation in a fully superpositional way is valuable for understanding the algorithms they learn. In this paper, we lay the groundwork for understanding computation in superposition in general, by studying how neural networks can emulate sparse boolean circuits. • In Section 2, we clarify existing definitions of linearly represented features, and propose our own definition which is more suited for reasoning about computation. • In Section 3, we focus our study on the task of emulating the particular boolean circuit we call the Universal AND (U-AND) circuit. In this task, a neural network must take in a set of boolean features in superposition, and compute the pairwise logical ANDs of these features in a single layer with as few hidden neurons as possible. We present a construction which allows for many more new features to be computed than the number of hidden neurons, with outputs represented natively in superposition. We argue that real neural networks may well implement our construction in the wild by proving that randomly initialised networks are very likely to emulate U-AND. • In Section 4 we demonstrate a second reason why this task is worth studying: it is possible to modify our construction to allow a wide range of large boolean circuits to be emulated entirely in superposition, provided that they satisfy a certain sparsity property. We conclude with a discussion of the limitations of our formal models, including the fact that our results are asymptotic and deal with only boolean features, and provide directions of future work that could address them. 2 Background and setup 2.1 Notation and conventions Asymptotic complexity and O~~ Oover~ start_ARG O end_ARG notation We make extensive use of standard Bachmann–Landau (“big O”) asymptotic notation. We use O~~ Oover~ start_ARG O end_ARG to indicate that we are ignoring polylogarithmic factors: O~(g(n)):=O(g(n)logkn) for some k∈ℤ.assign~superscript for some k∈ℤ. O(g(n)):=O(g(n) ^kn) for some $k∈% Z$.over~ start_ARG O end_ARG ( g ( n ) ) := O ( g ( n ) logitalic_k n ) for some k ∈ blackboard_Z . (And so forth for Θ~,Ω~~Θ~Ω , over~ start_ARG Θ end_ARG , over~ start_ARG Ω end_ARG, etc.) Fully connected neural networks We use ℳw:X→Y:subscriptℳ→M_w:X→ YMitalic_w : X → Y to denote a neural network model parameterized by w that takes input x∈Xx∈ Xx ∈ X and outputs ℳw(x)∈YsubscriptℳM_w(x)∈ YMitalic_w ( x ) ∈ Y. In this work, we study fully-connected networks consisting of L MLP layers with ReLU activations: a→(0)(x)superscript→0 a^(0)(x)over→ start_ARG a end_ARG( 0 ) ( x ) =xabsent =x= x a→(l)(x)superscript→ a^(l)(x)over→ start_ARG a end_ARG( l ) ( x ) =MLP(l)(a→(l−1)(x))absentsuperscriptMLPsuperscript→1 =MLP^(l)( a^(l-1)(x))= MLP( l ) ( over→ start_ARG a end_ARG( l - 1 ) ( x ) ) =ReLU(Win(l)a→(l−1)(x)+wbias(l))absentReLUsuperscriptsubscriptinsuperscript→1superscriptsubscriptbias =ReLU(W_ in^(l) a^(l-1)(x)+w_% bias^(l))= ReLU ( Win( l ) over→ start_ARG a end_ARG( l - 1 ) ( x ) + wbias( l ) ) ℳw(x)subscriptℳ M_w(x)Mitalic_w ( x ) =Wouta→(L),absentsubscriptoutsuperscript→ =W_ out a^(L),= Wout over→ start_ARG a end_ARG( L ) , where ReLU(x)=max(0,x)ReLU0ReLU(x)= (0,x)ReLU ( x ) = max ( 0 , x ) with max taken elementwise. We assume that our MLPs have width d for all hidden layers, that is, a→(l)∈ℝdsuperscript→superscriptℝ a^(l) ^dover→ start_ARG a end_ARG( l ) ∈ blackboard_Rd for all l∈1,…,L1…l∈\1,...,L\l ∈ 1 , … , L . For simplicity’s sake we will be dropping l whenever we only talk about a single layer at a time. Figure 2: In Section 2.2, we distinguish between boolean features that are ε ε-linearly represented (left), ReLUReLUReLUReLU-linearly represented (center left), and those that are only linearly separable (i.e. weakly linearly represented) (center right). Red/blue indicates the presence or absence of the feature. In addition to being linearly separable, ε ε-linearly represented features must satisfy the further condition that the variance in the readoff direction r→ksubscript→ r_kover→ start_ARG r end_ARGk within the positive and negative clusters is small compared to the margin between the two. Features and feature vectors Following previous work in mechanistic interpretability (e.g. Tamkin et al. (2023); Rajamanoharan et al. (2024)), we suppose that the activations of a model can be thought of as representing m>dm>dm > d boolean features fk:X→0,1:subscript→01f_k X→\0,1\fitalic_k : X → 0 , 1 of the input in superposition. That is, a→(x)=∑i=1mϕ→kfk(x)→superscriptsubscript1subscript→italic-ϕsubscript a(x)= _i=1^m φ_kf_k(x)over→ start_ARG a end_ARG ( x ) = ∑i = 1m over→ start_ARG ϕ end_ARGk fitalic_k ( x ) for some set of feature vectors ϕ→1,…,ϕ→m∈ℝdsubscript→italic-ϕ1…subscript→italic-ϕsuperscriptℝ φ_1,..., φ_m ^dover→ start_ARG ϕ end_ARG1 , … , over→ start_ARG ϕ end_ARGm ∈ blackboard_Rd and features f1,…,fm:X→0,1:subscript1…subscript→01f_1,...,f_m X→\0,1\f1 , … , fitalic_m : X → 0 , 1 . Equivalently, a→(l)(x)=Φsuperscript→Φ a^(l)(x)= bover→ start_ARG a end_ARG( l ) ( x ) = Φ italic_b where Φ=(ϕ→1,…,ϕ→m)Φsubscript→italic-ϕ1…subscript→italic-ϕ =( φ_1,..., φ_m)Φ = ( over→ start_ARG ϕ end_ARG1 , … , over→ start_ARG ϕ end_ARGm ) is the d×md× md × m feature encoding matrix with columns equal to the feature vectors and ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m is the boolean vector with entries ksubscript b_kitalic_bitalic_k = fk(x)subscriptf_k(x)fitalic_k ( x ). In addition, as in previous work, we assume that these features are s-sparse, in that only at most s≪d,mmuch-less-thans d,ms ≪ d , m features fisubscriptf_ifitalic_i can be nonzero for any input x (equivalently, ‖1≤ssubscriptnorm1|| b||_1≤ s| | italic_b | |1 ≤ s.) For clarity, we preferentially use k,ℓ∈1,…,mℓ1…k, ∈\1,...,m\k , ℓ ∈ 1 , … , m to index features (in 0,1m)\0,1\^m) 0 , 1 m ) and i,j∈1,…,d1…i,j∈\1,...,d\i , j ∈ 1 , … , d to index the standard neuron basis of activations (in ℝdsuperscriptℝR^dblackboard_Rd). Sparse boolean circuits We construct tasks where a neural network needs to emulate a boolean circuit :0,1m→0,1m′:→superscript01superscript01superscript′C \0,1\^m→\0,1\^m C : 0 , 1 m → 0 , 1 m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We assume that this circuit can be written as =L∘⋯∘1subscript⋯subscript1C=C_L ·s _1C = Citalic_L ∘ ⋯ ∘ C1, where each intermediate “layer” l:0,1m→0,1m:subscript→superscript01superscript01C_l:\0,1\^m→\0,1\^mCitalic_l : 0 , 1 m → 0 , 1 m is a collection of m parallel boolean gates (of fan-in up to 2), for l<Ll<Ll < L. We say that a circuit CC is s-sparse on boolean input ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m if the input (0)=superscript0 b^(0)= bitalic_b( 0 ) = italic_b and all intermediate activations (l)=i((l−1))superscriptsubscriptsuperscript1 b^(l)=C_i( b^(l-1))italic_b( l ) = Citalic_i ( italic_b( l - 1 ) ) are s-sparse, i.e. they satisfy ‖(i)‖1≤ssubscriptnormsuperscript1|| b^(i)||_1≤ s| | italic_b( i ) | |1 ≤ s. 2.2 Strong and weak linear representations Given the activations of a neural network at a particular layer a(l):X→ℝd:superscript→superscriptℝa^(l) X ^da( l ) : X → blackboard_Rd, we can also ask what features are linearly represented by a(l)superscripta^(l)a( l ). In this section, we present three definitions for a feature being linearly represented by a(l)superscripta^(l)a( l ), which we illustrate in Figure 2. The standard definition of linear representation is based on whether or not the representations of positive and negative examples can be separated by a hyperplane: Definition 1 (Weak linear representations). We say that a binary feature fksubscriptf_kfitalic_k is weakly linearly represented by a:X→ℝd:→superscriptℝa X ^da : X → blackboard_Rd (or linearly separable in a) if there exists some r→k∈ℝdsubscript→superscriptℝ r_k ^dover→ start_ARG r end_ARGk ∈ blackboard_Rd such that for all x1,x2∈Xsubscript1subscript2x_1,x_2∈ Xx1 , x2 ∈ X where fk(x1)=0subscriptsubscript10f_k(x_1)=0fitalic_k ( x1 ) = 0 and fk(x2)=1subscriptsubscript21f_k(x_2)=1fitalic_k ( x2 ) = 1, we have: r→k⋅a(x1)<r→k⋅a(x2).⋅subscript→subscript1⋅subscript→subscript2 r_k· a(x_1)< r_k· a(x_2).over→ start_ARG r end_ARGk ⋅ a ( x1 ) < over→ start_ARG r end_ARGk ⋅ a ( x2 ) . Or, equivalently, the sets x|fk(x)=0conditional-setsubscript0\x|f_k(x)=0\ x | fitalic_k ( x ) = 0 and x|fk(x)=1conditional-setsubscript1\x|f_k(x)=1\ x | fitalic_k ( x ) = 1 are separated by a hyperplane normal to r→ksubscript→ r_kover→ start_ARG r end_ARGk. That being said, features being linearly separable does not mean a neural network can easily “make use” of the features. For some weakly linearly represented features f1subscript1f_1f1 and f2subscript2f_2f2, neither f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 nor f2∨f2subscript2subscript2f_2 f_2f2 ∨ f2 need to be linearly represented, even if their read-off vectors r→1,r→2subscript→1subscript→2 r_1, r_2over→ start_ARG r end_ARG1 , over→ start_ARG r end_ARG2 are orthogonal (Figure 3). In fact, a stronger statement is true: it might not even be possible to linearly separate f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 or f2∨f2subscript2subscript2f_2 f_2f2 ∨ f2 in MLP∘aMLPMLP aMLP ∘ a, that is, even after applying an MLP to the activations (see Theorem 9 in Appendix C). As a result, in this paper we make use of a more restrictive notion of a feature being linearly represented: Definition 2 (ε ε-linear representations). Let X be a set of inputs and a:X→ℝd:→superscriptℝa X ^da : X → blackboard_Rd be the activations of a neural network (in a particular position/layer in a given model). We say that f1,…,fmsubscript1…subscriptf_1,…,f_mf1 , … , fitalic_m are linearly represented with interference ε ε (or ε ε-linearly represented from these activation vectors) if there exists a read-off matrix ∈Matm×dsubscriptMatR _m× dR ∈ Matitalic_m × d with rows r→1,…,r→m∈ℝdsubscript→1…subscript→superscriptℝ r_1,…, r_m ^dover→ start_ARG r end_ARG1 , … , over→ start_ARG r end_ARGm ∈ blackboard_Rd such that for all k∈1,…,m1…k∈\1,…,m\k ∈ 1 , … , m and all x∈Xx∈ Xx ∈ X, we have |r→k⋅a→(x)−fk(x)|<ε.⋅subscript→subscript | r_k· a(x)-f_k(x)|< .| over→ start_ARG r end_ARGk ⋅ over→ start_ARG a end_ARG ( x ) - fitalic_k ( x ) | < ε . We refer to r→ksubscript→ r_kover→ start_ARG r end_ARGk as a read-off vector for the feature fksubscriptf_kfitalic_k. It follows that if a→(x)=∑i=1mϕ→kfk(x)→superscriptsubscript1subscript→italic-ϕsubscript a(x)= _i=1^m φ_kf_k(x)over→ start_ARG a end_ARG ( x ) = ∑i = 1m over→ start_ARG ϕ end_ARGk fitalic_k ( x ), then we have: ‖Φ−m‖∞<εsubscriptnormΦsubscript ||R -Id_m||_∞< | | R Φ - Iditalic_m | |∞ < ε where msubscriptId_mIditalic_m is the m×m× m × m identity matrix111 In some cases if the feature vectors satisfy |ΦTΦ−m|≤μsuperscriptΦsubscript| ^T -Id_m|≤μ| Φitalic_T Φ - Iditalic_m | ≤ μ — that is, if the feature vectors are almost orthogonal with interference μ, then the features vectors can function as their own readoffs. . For brevity’s sake, we very slightly abuse notation here to include the bias term in r→ksubscript→ r_kover→ start_ARG r end_ARGk. This is equivalent to assuming that one of a→ aover→ start_ARG a end_ARG’s outputs is a constant, that is, ai(x)=csubscripta_i(x)=caitalic_i ( x ) = c for all x for some i∈1,…,d1…i∈\1,...,d\i ∈ 1 , … , d and some c∈ℝc ∈ blackboard_R. In contrast to features that are merely linearly separable, features that are ε ε-linearly represented are easy to linearly separate, as we show in Figure 3. We formalize and prove this in Theorem 10 in Appendix C. Figure 3: When two features f1,f2subscript1subscript2f_1,f_2f1 , f2 are ε ε-linearly represented in activations a(x)a(x)a ( x ), we can use two MLP neurons with input weights r→1,r→2subscript→1subscript→2 r_1, r_2over→ start_ARG r end_ARG1 , over→ start_ARG r end_ARG2 to read-off the two features, after which f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 and f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2 are ε ε-linearly represented in the MLP activations MLP(a(x))MLPMLP(a(x))MLP ( a ( x ) ). However, because linearly-separable features can have arbitrarily small margin, there might exist no MLP such that f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 and f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2 are linearly separable in MLP(a(x))MLPMLP(a(x))MLP ( a ( x ) ). Comparison with Anthropic’s Toy Model of Superposition Finally, Elhage et al. (2022) and Bricken et al. (2023) consider a definition of linearly represented feature that involves using a ReLU to remove negative interference: Definition 3 (ReLU-linear representations). A set of m binary features F→=(f1,…,fm)→subscript1…subscript F=(f_1,...,f_m)over→ start_ARG F end_ARG = ( f1 , … , fitalic_m ) is ReLU-linearly represented in a:X→ℝd:→superscriptℝa X ^da : X → blackboard_Rd with error ε ε if there exists a read-off matrix ∈Matm×dsubscriptMatR _m× dR ∈ Matitalic_m × d such that x∈X‖F→(x)−ReLU(a(x))‖2subscriptsubscriptnorm→ReLU2 _x∈ X|| F(x)-ReLU (Ra(x)% )||_2blackboard_Ex ∈ X | | over→ start_ARG F end_ARG ( x ) - ReLU ( R a ( x ) ) | |2 <ε.absent < .< ε . Note that in contrast to ε ε-linearly represented features, where each individual feature must be able to be read off using an affine function with small error on every datapoint, ReLU-linear representated features are read off using a MLP layer with m neurons (one per feature), such that the expected ℓ2subscriptℓ2 _2ℓ2 loss (summed across all m features) is small. 3 Universal ANDs: a model of single-layer MLP superposition We start by presenting one of the simplest non-trivial boolean circuits: namely, the one-layer circuit that computes the pairwise AND of the input features. Note that due to space limitations, we include only proof sketches in the main body and may ignore some regularity conditions in the theorem statement. See Appendix D for more rigorous theorem statements and proofs. Definition 4 (The universal AND boolean circuit). Let ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m be a boolean vector. The universal AND (or U-AND) circuit has m inputs and (m2)binomial2 m2( FRACOP start_ARG m end_ARG start_ARG 2 end_ARG ) outputs indexed by unordered pairs k,ℓk, , ℓ of locations and is defined by UAND()k,ℓ:=k∧ℓ.assignsubscriptUANDsubscriptℓsubscriptsubscriptℓC_UAND( b)_k, := b_k % b_ .Croman_UAND ( italic_b )k , ℓ := italic_bitalic_k ∧ italic_broman_ℓ . In other words, we apply the AND gate to all possible pairs of distinct inputs to produce (m2)binomial2 m2( FRACOP start_ARG m end_ARG start_ARG 2 end_ARG ) outputs. We will build our theory of computation starting from a single-layer neural net that emulates the universal AND when the input bitalic_b is s-sparse for some s∈ℕs ∈ blackboard_N (this implies that the output has sparsity O(s2)superscript2O(s^2)O ( s2 )). 3.1 Superposition in MLP activations enables more efficient U-AND First, consider the naive implementation, where we use one ReLU to implement each AND using the fact that for boolean x1,x2subscript1subscript2x_1,x_2x1 , x2: ReLU(x1+x2−1)=x1∧x2.ReLUsubscript1subscript21subscript1subscript2 (x_1+x_2-1)=x_1 x_2.ReLU ( x1 + x2 - 1 ) = x1 ∧ x2 . This requires (n2)=O(n2)binomial2superscript2 n2=O(n^2)( FRACOP start_ARG n end_ARG start_ARG 2 end_ARG ) = O ( n2 ) neurons, each of which is monosemantic in that it represents a single natural feature. In contrast, by using sparsity, we can construct using exponentially fewer neurons (Figure 1): Theorem 1 (U-AND with basis-aligned inputs). Fix a sparsity parameter s∈ℕ.ℕs .s ∈ blackboard_N . Then for large input length m, there exists a single-layer neural network ℳw(x)=MLP(x)=ReLU(Winx+wbias)subscriptℳMLPReLUsubscriptinsubscriptbiasM_w(x)=MLP(x)=ReLU(W_ inx+w_% bias)Mitalic_w ( x ) = MLP ( x ) = ReLU ( Win x + wbias ) that ε ε-linearly represents the universal AND circuit UANDsubscriptUANDC_UANDCroman_UAND on s-sparse inputs, with width d=O~m(1/ε2)subscript~1superscript2d= O_m(1/ ^2)d = over~ start_ARG O end_ARGm ( 1 / ε2 ) (i.e. polylogarithmic in m). Proof. (sketch) To show this, we construct an MLP such that each neuron checks whether or not at least two inputs in a small random subset of the boolean input bitalic_b are active (see also Figure 1). Intuitively, since the inputs are sparse, each neuron can be thought of as checking the ANDs of any pair of input variables k1,k2subscriptsubscript1subscriptsubscript2 b_k_1, b_k_2italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the subset, with interference terms corresponding to all the other variables. That is, we can write the preactivation of each neuron as the sum of the AND of k1,k2subscriptsubscript1subscriptsubscript2 b_k_1, b_k_2italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and some interference terms: −1+k1+k2⏟k1∧k2+∑k′≠k1,k2k′⏟interference termssubscript⏟1subscriptsubscript1subscriptsubscript2subscriptsubscript1subscriptsubscript2subscript⏟subscriptsuperscript′subscript1subscript2subscriptsuperscript′interference terms -1+ b_k_1+ b_k_2% width=0.0pt,height=0.0pt,depth=12.52913pt_ b_k_1% b_k_2+ _k =k_1,k_2% b_k _ interference termsunder⏟ start_ARG - 1 + italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARGitalic_b start_POSTSUBSCRIPT k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ∑k′ ≠ k start_POSTSUBSCRIPT 1 , k2 end_POSTSUBSCRIPT italic_bitalic_k′ end_ARGinterference terms We then use the sparsity of inputs to bound the size of the interference terms, and show that we can “read-off” the AND of k1,k2subscriptsubscript1subscriptsubscript2 b_k_1, b_k_2italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by averaging together the value of post-ReLU activations of the neurons connected to k1,k2subscriptsubscript1subscriptsubscript2 b_k_1, b_k_2italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We then argue that this averaging reduces the size of the interference terms to below ε ε. Specifically, we construct input weights Win∈Matd×msubscriptinsubscriptMatW_ in _d× mWin ∈ Matitalic_d × m such that the input to each neuron is connected to the kkkth entry of the input ksubscript b_kitalic_bitalic_k with weight 1 with probability p=log2m/dsuperscript2p= ^2m/ dp = log2 m / square-root start_ARG d end_ARG, and weight 0 otherwise. We set the bias of each neuron to −11-1- 1. Let Γ(k)Γ (k)Γ ( k ) be indices of neurons that have input weight 1111 for ksubscript b_kitalic_bitalic_k, and Γ(k1,k2)Γsubscript1subscript2 (k_1,k_2)Γ ( k1 , k2 ) be the indices of neurons that have input weight 1111 for k1,k2subscriptsubscript1subscriptsubscript2 b_k_1, b_k_2italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, Γ(k1,k2,k3)Γsubscript1subscript2subscript3 (k_1,k_2,k_3)Γ ( k1 , k2 , k3 ) be the indices of neurons reading from all of k1,k2,k3subscriptsubscript1subscriptsubscript2subscriptsubscript3 b_k_1, b_k_2, b_k_3italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_bitalic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and so forth. By construction, Γ(k1)Γsubscript1 (k_1)Γ ( k1 ) has expected size Θ(log2m⋅d)Θsuperscript2⋅ ( ^2m· d)Θ ( log2 m ⋅ square-root start_ARG d end_ARG ), Γ(k1,k2)Γsubscript1subscript2 (k_1,k_2)Γ ( k1 , k2 ) has expected size Θ(log4m)Θsuperscript4 ( ^4m)Θ ( log4 m ), and Γ(k1,k2,k3)Γsubscript1subscript2subscript3 (k_1,k_2,k_3)Γ ( k1 , k2 , k3 ) has expected size Θ(log6m/d)Θsuperscript6 ( ^6m/ d)Θ ( log6 m / square-root start_ARG d end_ARG ). In general, the set of indices for n such inputs has expected size Θ(log2n/d(n/2−1))Θsuperscript2superscript21 ( ^2n/d^(n/2-1))Θ ( log2 n / d( n / 2 - 1 ) ) Our read-off vector r→ rover→ start_ARG r end_ARG for the AND k1∧k2subscriptsubscript1subscriptsubscript2 b_k_1 b_k_2italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will have entries: r→(i)=1|Γ(k1,k2)|i∈|Γ(k1,k2)|0otherwisesubscript→cases1Γsubscript1subscript2Γsubscript1subscript20otherwise r_(i)= cases 1| (k_1,k_2)|&i∈|% (k_1,k_2)|\\ 0& otherwise casesover→ start_ARG r end_ARG( i ) = start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG | Γ ( k1 , k2 ) | end_ARG end_CELL start_CELL i ∈ | Γ ( k1 , k2 ) | end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW We then check that r→⋅MLP()⋅→MLP r·MLP( b)over→ start_ARG r end_ARG ⋅ MLP ( italic_b ) gives the correct output in each of three cases. Note that for any input, r→⋅MLP()≥k1∧k2⋅→MLPsubscriptsubscript1subscriptsubscript2 r·MLP( b)≥ b_k_1 % b_k_2over→ start_ARG r end_ARG ⋅ MLP ( italic_b ) ≥ italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, so it suffices to upper bound the average number of non-k1,k2subscript1subscript2k_1,k_2k1 , k2 inputs that are non-zero, divided by the total number of neurons in Γ(k1,k2)Γsubscript1subscript2 (k_1,k_2)Γ ( k1 , k2 ). • When k1=k2=0subscriptsubscript1subscriptsubscript20 b_k_1= b_k_2=0italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, the interference terms in each read-off neuron have value at most s, and there are at most ∑k′=k′=1|Γ(k1,k2,k′,k′)|subscriptsubscriptsuperscript′subscriptsuperscript′1Γsubscript1subscript2superscript′ _ b_k = b_k % =1| (k_1,k_2,k ,k )|∑italic_b start_POSTSUBSCRIPT k′ = italic_bitalic_k′ ′ = 1 end_POSTSUBSCRIPT | Γ ( k1 , k2 , k′ , k′ ′ ) | =Θ(s2⋅log8m/d)absentΘ⋅superscript2superscript8 = (s^2· ^8m/d)= Θ ( s2 ⋅ log8 m / d ) such neurons outputting non-zero values. So the error is bounded above by s⋅∑k′≠k1,k2|Γ(k1,k2,k′,k′)||Γ(k1,k2)|⋅subscriptsuperscript′subscript1subscript2Γsubscript1subscript2superscript′Γsubscript1subscript2 s· _k =k_1,k_2| (k_1,k_2,% k ,k )|| (k_1,k_2)|divide start_ARG s ⋅ ∑k′ ≠ k start_POSTSUBSCRIPT 1 , k2 end_POSTSUBSCRIPT | Γ ( k1 , k2 , k′ , k′ ′ ) | end_ARG start_ARG | Γ ( k1 , k2 ) | end_ARG =Θ(s3⋅log4m/d).absentΘ⋅superscript3superscript4 = (s^3· ^4m/d).= Θ ( s3 ⋅ log4 m / d ) . • When k1=1subscriptsubscript11 b_k_1=1italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 or k2=1subscriptsubscript21 b_k_2=1italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, the interference terms in each read-off neuron have value at most s−22s-2s - 2, and there are at most ∑k′=1|Γ(k1,k2,k′)|subscriptsubscriptsuperscript′1Γsubscript1subscript2superscript′ _ b_k =1| (k_1,k_2,k % )|∑italic_b start_POSTSUBSCRIPT k′ = 1 end_POSTSUBSCRIPT | Γ ( k1 , k2 , k′ ) | =Θ(s⋅log6m/d)absentΘ⋅superscript6 = (s· ^6m/ d)= Θ ( s ⋅ log6 m / square-root start_ARG d end_ARG ) neurons that have such interference terms. So the error is bounded above by s−2|Γ(k1,k2)|∑k′≠k1,k2|Γ(k1,k2,k′)|2Γsubscript1subscript2subscriptsuperscript′subscript1subscript2Γsubscript1subscript2superscript′ s-2| (k_1,k_2)| _k =k_1,k_2% | (k_1,k_2,k )|divide start_ARG s - 2 end_ARG start_ARG | Γ ( k1 , k2 ) | end_ARG ∑k′ ≠ k start_POSTSUBSCRIPT 1 , k2 end_POSTSUBSCRIPT | Γ ( k1 , k2 , k′ ) | =Θ(s2⋅ = (s^2·= Θ ( s2 ⋅ log2m/d) ^2m/ d)log2 m / square-root start_ARG d end_ARG ) Combining the above, we get that the read-off error is O(log4m/d)superscript4O( ^4m/ d)O ( log4 m / square-root start_ARG d end_ARG ), and so setting d=Θ(log8m/ε2)=O~m(1/ε2)Θsuperscript8superscript2subscript~1superscript2d= ( ^8m/ ^2)= O_m(1/ ^2)d = Θ ( log8 m / ε2 ) = over~ start_ARG O end_ARGm ( 1 / ε2 ) gives us an error that is <εabsent< < ε outside negligible probability. ∎ 3.2 Neural networks can implement efficient U-AND even with inputs in superposition Note that in Theorem 1, we assume that the network gets m basis-aligned inputs (that is, not in superposition). However, it turns out that we can extend the result in Theorem 1 to inputs in superposition. Theorem 2 (U-AND with inputs in superposition). Let s∈ℕs ∈ blackboard_N be a fixed sparsity limit and ε<11 <1ε < 1 a fixed interference parameter. There exists a feature encoding Φ Φ and single-layer neural net ℳw(x)=MLP(x)=ReLU(Winx+wbias)subscriptℳMLPReLUsubscriptinsubscriptbiasM_w(x)=MLP(x)=ReLU(W_ inx+w_% bias)Mitalic_w ( x ) = MLP ( x ) = ReLU ( Win x + wbias ) with input size and width d=O~(m/ε2)~superscript2d= O( m/ ^2)d = over~ start_ARG O end_ARG ( square-root start_ARG m end_ARG / ε2 ), where ℳw∘ΦsubscriptℳΦM_w _w ∘ Φ ε ε-linearly represents UANDsubscriptUANDC_UANDCroman_UAND on all s-sparse inputs bitalic_b. Proof. (sketch) By picking almost orthogonal unit-norm vectors Φ=(ϕ→1,…,ϕ→m)Φsubscript→italic-ϕ1…subscript→italic-ϕ =( φ_1,…, φ_m)Φ = ( over→ start_ARG ϕ end_ARG1 , … , over→ start_ARG ϕ end_ARGm ), we can recover each feature up to error ε ε using readoffs =ΦTsuperscriptΦR= ^TR = Φitalic_T. Take the input weight Win∈Matd×msubscriptinsubscriptMatW_ in _d× mWin ∈ Matitalic_d × m for the MLP constructed in the proof of Theorem 1. Using Win′=Winsuperscriptsubscriptin′subscriptinW_ in =W_ inRWin′ = Win R and wbias′=wbiassuperscriptsubscriptbias′subscriptbiasw_ bias =w_ biaswbias′ = wbias suffices, as this gives us ℳw∘Φ()subscriptℳΦ M_w ( b)Mitalic_w ∘ Φ ( italic_b ) =ReLU(WinΦ+wbias)absentReLUsubscriptinΦsubscriptbias =ReLU(W_ inR b+w_% bias)= ReLU ( Win R Φ italic_b + wbias ) ≈ReLU(Win+wbias),absentReLUsubscriptinsubscriptbias (W_ in b+w_ % bias),≈ ReLU ( Win italic_b + wbias ) , which is just the model from Theorem 1, which ε ε-linearly represents UANDsubscriptUANDC_UANDCroman_UAND as desired. Carefully tracking error terms shows that we need d=Θ~(m)~Θd= ( m)d = over~ start_ARG Θ end_ARG ( square-root start_ARG m end_ARG ) neurons. ∎ 3.3 Randomly initialized neural networks linearly represent U-AND While the results in previous section show that there exist some network weights that ε ε-linearly represents the U-AND circuit UANDsubscriptUANDC_UANDCroman_UAND, there still is a question of whether neural networks can learn to represent many ANDs starting from the standard initialization. In this section, we provide some theoretical evidence – namely, that sufficiently wide randomly initialized one-layer MLPs ε ε-linearly represent UANDsubscriptUANDC_UANDCroman_UAND. Theorem 3 (Randomly initialized MLPs linearly represent U-AND). Let MLP:ℝm→ℝd:MLP→superscriptℝsuperscriptℝMLP:R^m ^dMLP : blackboard_Rm → blackboard_Rd be a one-layer MLP with d=Ω~(1/ε2)~Ω1superscript2d= (1/ ^2)d = over~ start_ARG Ω end_ARG ( 1 / ε2 ) neurons that takes input bitalic_b, and where WinsubscriptinW_ inWin is drawn i.i.d from a normal distribution (0,δ2)0superscript2N(0,δ^2)N ( 0 , δ2 ) and wbias=0→subscriptbias→0w_ bias= 0wbias = over→ start_ARG 0 end_ARG. Then this MLP ε ε-linearly represents UANDsubscriptUANDC_UANDCroman_UAND on s-sparse inputs outside of negligible probability. Proof. (Sketch) We prove this by constructing a read-off vector r→ rover→ start_ARG r end_ARG for each pair of features k1,k2subscript1subscript2k_1,k_2k1 , k2. Let σ be the sign function σ(x)=+1x>00x=0−1x<0cases100010 σ(x)= cases+1&x>0\\ ~~0&x=0\\ -1&x<0 casesσ ( x ) = start_ROW start_CELL + 1 end_CELL start_CELL x > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL x = 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL x < 0 end_CELL end_ROW and let wi,ksubscriptw_i,kwitalic_i , k be the contribution to the preactivation of neuron i from ksubscript b_kitalic_bitalic_k. We construct r→ rover→ start_ARG r end_ARG coordinatewise (that is, neuron-by-neuron). In particular, we set the iiith coordinate of r→ rover→ start_ARG r end_ARG to be r→i=ηi(σ(wi,k1)=σ(wi,k2)−σ(wi,k1)≠σ(wi,k2)). r_i= _i (1_σ(w_i,k_1)=σ(% w_i,k_2)-1_σ(w_i,k_1) =σ(w_i,k_2) ).over→ start_ARG r end_ARGi = ηitalic_i ( 1italic_σ ( w start_POSTSUBSCRIPT i , k start_POSTSUBSCRIPT 1 ) end_POSTSUBSCRIPT = σ ( witalic_i , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - 1italic_σ ( w start_POSTSUBSCRIPT i , k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ σ ( witalic_i , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) . That is, if k1subscript1k_1k1 and k2subscript2k_2k2 contribute to the neuron preactivations with the same sign, then r→i=ηisubscript→subscript r_i= _iover→ start_ARG r end_ARGi = ηitalic_i, else, r→i=−ηisubscript→subscript r_i=- _iover→ start_ARG r end_ARGi = - ηitalic_i. Here, ηisubscript _iηitalic_i is a scaling parameter of size Θ(s/d)Θ ( s/d)Θ ( square-root start_ARG s end_ARG / d ) used to scale the read-off to be 1111 when k1=k2=1subscriptsubscript1subscriptsubscript21 b_k_1= b_k_2=1italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 When k1=0subscriptsubscript10 b_k_1=0italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 or k2=0subscriptsubscript20 b_k_2=0italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, the expected value of r→⋅ℳw()⋅→subscriptℳ r·M_w( b)over→ start_ARG r end_ARG ⋅ Mitalic_w ( italic_b ) is zero, while the error terms have size O~m(1/d)subscript~1 O_m(1/ d)over~ start_ARG O end_ARGm ( 1 / square-root start_ARG d end_ARG ). So setting d=Ω~(1/ε2)~Ω1superscript2d= (1/ ^2)d = over~ start_ARG Ω end_ARG ( 1 / ε2 ) suffices to get error below ε ε with high probability. When k1=k2=1subscriptsubscript1subscriptsubscript21 b_k_1= b_k_2=1italic_bitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_bitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, the contribution from each neuron i to r→⋅ℳw()⋅→subscriptℳ r·M_w( b)over→ start_ARG r end_ARG ⋅ Mitalic_w ( italic_b ) with σ(wi,k1)=σ(wi,k2)σ(w_i,k_1)=σ(w_i,k_2)σ ( witalic_i , k start_POSTSUBSCRIPT 1 ) end_POSTSUBSCRIPT = σ ( witalic_i , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) will be in expectation larger than those with σ(wi,k1)≠σ(wi,k2)σ(w_i,k_1) =σ(w_i,k_2)σ ( witalic_i , k start_POSTSUBSCRIPT 1 ) end_POSTSUBSCRIPT ≠ σ ( witalic_i , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (as the standard deviation of the sum of two weights with equal signs is larger than the sum of two weights with different signs, and we apply a ReLU). By setting η to be the reciprocal of the difference in expected contributions, we have that this value has expectation 1. Again, as the error terms have size O~m(1/d)subscript~1 O_m(1/ d)over~ start_ARG O end_ARGm ( 1 / square-root start_ARG d end_ARG ), it follows that setting d=Ω~(1/ε2)~Ω1superscript2d= (1/ ^2)d = over~ start_ARG Ω end_ARG ( 1 / ε2 ) suffices to get error below ε ε with high probability, as desired. ∎ Before proceeding, we record a corollary, which underscores the surprisingly strong asymptotic representability of the universal AND circuit. Corollary 4. For any fixed input size s,s,s , dimension d and m=dO(1)superscript1m=d^O(1)m = ditalic_O ( 1 ) polynomial in d, there exists a “universal AND” model with hidden dimension d,d,d , ℳw:x↦ReLU(Win(x)):subscriptℳmaps-toReLUsubscriptinM_w:x (W_ in(x))Mitalic_w : x ↦ ReLU ( Win ( x ) ) from ℝdsuperscriptℝR^dblackboard_Rd to ℝdsuperscriptℝR^dblackboard_Rd and a feature matrix Φ∈Matm×dΦsubscriptMat _m× dΦ ∈ Matitalic_m × d such that for any input bitalic_b with sparsity ‖1=s,subscriptnorm1|| b||_1=s,| | italic_b | |1 = s , we have that ℳw(Φ())∈ℝdsubscriptℳΦsuperscriptℝM_w( ( b)) ^dMitalic_w ( Φ ( italic_b ) ) ∈ blackboard_Rd strongly linearly represents uAND()∈0,1(m2)uANDsuperscript01binomial2uAND( b)∈\0,1\ m2uAND ( italic_b ) ∈ 0 , 1 ( FRACOP start_ARG m end_ARG start_ARG 2 end_ARG ) (with error at worst ε=O~(1d)~1 = O ( 1 d )ε = over~ start_ARG O end_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG d end_ARG end_ARG )). 4 MLPs as representing sparse boolean circuits In the previous section we showed variants of computation in superposition at a single layer, for one of the simplest non-trivial boolean circuits. In this section, we extend these results to show that neural networks can efficiently represent arbitrary sparse boolean circuits. As in Section 3, we include only proof sketches in the main body due to space limitations, and may also ignore some regularity conditions in our theorem statements. See Appendix D for more rigorous theorem statements and proofs. 4.1 Boolean circuits in single layer MLPs We start by extending these results from Section 3 to ANDs of more than two variables. Let UAND(n)superscriptsubscriptUANDC_UAND^(n)Croman_UAND( n ) be the boolean circuit of depth L=log(n)L= (n)L = log ( n ) that computes the ANDs of each n-tuple of elements in bitalic_b.222Note that by our definition, boolean circuits are made of gates of fan-in at most 2. So computing the ANDs of n variables requires a boolean circuit of depth log(n) (n)log ( n ). Lemma 5 (“High fan-in” U-AND). For each n∈ℕn ∈ blackboard_N, there exists a one-layer neural network ℳw=MLP:ℝm→ℝd:subscriptℳMLP→superscriptℝsuperscriptℝM_w=MLP:R^m ^dMitalic_w = MLP : blackboard_Rm → blackboard_Rd with width d=O~(n/ε2)~superscript2d= O(n/ ^2)d = over~ start_ARG O end_ARG ( n / ε2 ) such that ℳw()subscriptℳM_w( b)Mitalic_w ( italic_b ) ε ε-linearly represents UAND(n)superscriptsubscriptUANDC_UAND^(n)Croman_UAND( n ) on s-sparse inputs. Proof. (sketch) We can extend the construction in the proof of Theorem 1 to allow for ANDs of exactly n variables, by considering index sets (k1,k2,…,kn)subscript1subscript2…subscriptI(k_1,k_2,...,k_n)I ( k1 , k2 , … , kitalic_n ) of n variables, and changing the bias of each neuron from −11-1- 1 to −n+11-n+1- n + 1. The expected size of an index set of n variables is [|(k1,k2,…,kn)|]=pnddelimited-[]subscript1subscript2…subscriptsuperscriptE[|I(k_1,k_2,...,k_n)|]=p^ndblackboard_E [ | I ( k1 , k2 , … , kitalic_n ) | ] = pitalic_n d, and we require this expected value to be Ω(log4m)Ωsuperscript4 ( ^4m)Ω ( log4 m ) to ensure that the index set is non-empty outside negligible probability (using the normal Chernoff and Union bounds). Therefore, we have to scale up the probability that any given value in WinsubscriptinW_ inWin is 1111: p=log2md1/nsuperscript2superscript1p= ^2md^1/np = divide start_ARG log2 m end_ARG start_ARG d1 / n end_ARG suffices. A similar argument to the one found in the proof of Theorem 1 shows that all the interference terms are o(1)1o(1)o ( 1 ). ∎ As illustrated in Figure 4, Lemma 5 allows us to construct MLPs that ε ε-linearly represents arbitrary small circuits: Figure 4: As discussed in Section 4.1, our U-AND construction can be extended to allow for arbitrarily high fan-in ANDs, which in turn allows for single-layer MLPs that linearly represent all small boolean circuits. Theorem 6. For any s-sparse circuit CC of width m and depth L, there exists a feature encoding Φ∈Matd×mΦsubscriptMat _d× mΦ ∈ Matitalic_d × m and a single-layer neural network ℳw(x)=ReLU(Winx+wbias)subscriptℳReLUsubscriptinsubscriptbiasM_w(x)=ReLU(W_ inx+w_ bias)Mitalic_w ( x ) = ReLU ( Win x + wbias ) of width d=O~(m)~d= O( m)d = over~ start_ARG O end_ARG ( square-root start_ARG m end_ARG ) such that ℳw(Φ)subscriptℳΦM_w( b)Mitalic_w ( Φ italic_b ) ε ε-linearly represents ()ksubscriptC( b)_kC ( italic_b )k for all k∈1,…,m1…k∈\1,...,m\k ∈ 1 , … , m for some ε=O~(m−1/3)~superscript13 = O(m^-1/3)ε = over~ start_ARG O end_ARG ( m- 1 / 3 ). Proof. (sketch) First, apply the construction in Theorem 2 to show that there exists one-layer MLPs of width d=O~(m)~d= O( m)d = over~ start_ARG O end_ARG ( square-root start_ARG m end_ARG ) that compute UAND(n)superscriptsubscriptUANDC_UAND^(n)Croman_UAND( n ) when the inputs are in superposition, where n∈2,3,…,2L23…superscript2n∈\2,3,...,2^L\n ∈ 2 , 3 , … , 2L . Next, concatenate together the 2L−1superscript212^L-12L - 1 networks of width d=O~(m)~d= O( m)d = over~ start_ARG O end_ARG ( square-root start_ARG m end_ARG ) that ε ε-linearly represent each UAND(n)superscriptsubscriptUANDC_UAND^(n)Croman_UAND( n ) for n∈2,3,…,2L23…superscript2n∈\2,3,...,2^L\n ∈ 2 , 3 , … , 2L , when the inputs are in superposition. Since the output of any boolean circuits of depth L can be written as a linear combinations of ANDs of maximum fan-in 2Lsuperscript22^L2L, it follows that the concatenated network ε′-linearly represents any boolean circuit of depth L, for some ε′ dependent on how many ANDs need to be added together to compute the circuit, as desired. ∎ 4.2 Efficient boolean circuits via deep MLPs The one-layer MLP in Theorem 6 has width that is exponential in the depth of the circuit. However, by combining pairwise U-AND layers (which linearly represent any one-layer boolean circuit) with “error correction” layers, we can construct deeper neural networks with sublinear width and depth linear in the depth of the circuit. Lemma 7. Assume that m=O~(d1.5),~superscript1.5m= O(d^1.5),m = over~ start_ARG O end_ARG ( d1.5 ) , and c is some large polylog constant. Then for sufficiently small input interference ε=O~(1/d)~1 = O(1/ d)ε = over~ start_ARG O end_ARG ( 1 / square-root start_ARG d end_ARG ) there exists a 1-layer MLP ℳw:ℝd→ℝd:subscriptℳ→superscriptℝsuperscriptℝM_w:R^d ^dMitalic_w : blackboard_Rd → blackboard_Rd that takes as input a boolean vector of length m encoded in d-dimensions using superposition and returns (outside negligible probability) an encoding of the same boolean vector with interference ε/c /cε / c. Proof. See Theorem 21 in Appendix D.4. ∎ By alternating between such “error correction” layers and U-AND layers, we can construct more efficient circuits: Theorem 8. Let :0,1m→0,1m:→superscript01superscript01C:\0,1\^m→\0,1\^mC : 0 , 1 m → 0 , 1 m be a circuit of width m and of depth L=O(mc)superscriptL=O(m^c)L = O ( mitalic_c ) polynomial in m. There exists a neural network of width d=O~(m23s2)~superscript23superscript2d= O(m 23s^2)d = over~ start_ARG O end_ARG ( mdivide start_ARG 2 end_ARG start_ARG 3 end_ARG s2 ) and with depth 2L22L2 L such that ℳw(Φ)subscriptℳΦM_w( b)Mitalic_w ( Φ italic_b ) ε ε-linearly ()C( b)C ( italic_b ) for all but a negligible fraction of inputs bitalic_b on which CC is s-sparse. Proof. (sketch) As a single MLP layer can ε ε-linearly represent the ANDs of all input features (by Theorem 2), we can use one MLP layer to approximate each layer of the circuit. However, the naive construction suffers from (potentially) exponentially growing error. To fix this, we insert an error correction layer from Lemma 7 between every such layer. ∎ 5 Related Work The idea that neural networks could or should make use of distributed or compositional representations has been a mainstay of early neural network research (Rosenblatt, 1961; Holyoak, 1987; Fodor & Pylyshyn, 1988). Arora et al. (2018) were the first in the modern deep learning context to discuss that neural networks could store many features in superposition. Olah et al. (2020) developed this idea into the ‘superposition hypothesis’: the conjecture that networks use the same neurons for multiple circuits to maximise the number of circuits they can learn. Many of our results are similar in flavor to those from the fields of sparse dictionary (Tillmann, 2014) and hyperdimensional computing (Zou et al., 2021), as all rely on useful properties of high-dimensional spaces. In addition, many of our boolean circuit results on randomly-initialized MLP layers are similar in flavor to universality results on randomly initialized neural networks with different non-linearities (Rahimi & Recht, 2008a, b). However, these results consider cases where there are fewer “true features” than there are dimensions, while the superposition hypothesis requires that the number of “true features” exceeds the dimensionality of the space. Randomized numerical linear algebra (Murray et al., 2023) studies the use of random projections to perform efficient computation, but in the context of reducing the cost of linear algebra operations such as linear regression or SVD with inputs and outputs represented in an axis-aligned fashion. Superposition has been studied in a range of idealised settings: Elhage et al. (2022) provided the first examples of toy models which employed superposition to achieve low loss and Henighan et al. (2023) further explored superposition in a toy memorisation task. Notably, they study features that are ReLU-linear represented. (See Section 2.2 for more discussion.) Scherlis et al. (2022) study a model of using a small number of neurons with quadratic activations to approximately compute degree two polynomials. The models studied in all of these papers require sparse features of declining importance. In contrast, our model allows for sparse features that are equally important. More importantly, none of these listed works study performing computation with inputs in superposition. Several papers have also explored the prevalence of superposition in language models. Gurnee et al. (2023) found that some bigrams are represented on sparse sets of neurons but not on any individual neurons. There is also a growing literature on using sparse dictionary learning to identify features in language models inspired by the superposition hypothesis (Cunningham et al., 2023; Bricken et al., 2023; Tamkin et al., 2023; Bloom, 2024; Braun et al., 2024; Templeton et al., 2024) although it is unclear how much evidence the success of sparse dictionary learning in finding human-interpretable features provides for the superposition hypothesis. 6 Discussion 6.1 Summary In this work, we have presented a mathematical framework for understanding how neural networks can perform computation in superposition, where the number of features computed can greatly exceed the number of neurons. We have demonstrated this capability through the construction of a neural network that efficiently emulates the Universal AND circuit, computing all pairwise logical ANDs of input features using far fewer neurons than the number of output features. Furthermore, we have shown how this construction can be generalized to emulate a wide range of sparse, low-depth boolean circuits entirely in superposition. This work lays the foundation for a deeper understanding of how neural networks can efficiently represent and manipulate information, and highlights the importance of considering computation in superposition when interpreting the algorithms learned by these systems. 6.2 Practical Takeaways for Mechanistic Interpretability Our primary motivation for undertaking this work was to glean insights about the computation implemented by neural networks. While we provide more potential takeaways in Appendix B, here we discuss what we think are two salient takeaways for interpretability: Unused features The implementation of U-AND by random matrices (Theorem 3) suggests that certain concepts may be detectable through linear probes in a network’s activation space without being actively utilized in subsequent computations. This phenomenon could explain the findings of Marks (2024), who observed that arbitrary XORs of concepts can be successfully probed in language models. Furthermore, it implies that successfully probing for a concept and identifying a direction that explains a high percentage of variance (e.g., 80%) may not constitute strong evidence of the model’s actual use of that concept. Consequently, there is reason to be cautious about how many of the features identified by Sparse Autoencoders (Cunningham et al., 2023; Bricken et al., 2023; Bloom, 2024; Templeton et al., 2024) are actively employed by the model in its computation. Robustness to noise This research underscores the critical role of error correction in networks performing computations in superposition. Effective error correction mechanisms should enable networks to rectify minor perturbations in their activation states, resulting in a nonlinear response in output when activation vectors are slightly altered along specific directions. Expanding on this concept, Heimersheim & Mendel (2023) conducted follow-up investigations, revealing the presence of plateaus surrounding activation vectors in GPT2-small (Radford et al., 2019). Within these plateaus, model outputs exhibit minimal variation despite small changes in activation values, providing weak evidence for an error correcting mechanism in the model’s computation. 6.3 Limitations and future work That being said, there are a number of ways in which the computational framework presented in this work is very likely to miss the full richness of computation happening in any given real neural network. Firstly, this work studies computation on binary features. It is plausible that other kinds of features – in particular, discrete features which take on more than 2222 distinct values, or continuous-valued features – occur commonly in real neural networks. It would be valuable to extend the understanding developed in this work to such non-binary features. Secondly, though we do not require features to have declining importance, we do require features to be sparse, with each data point only having a small number of active features. It is plausible that not all features are sparse in practice (given the present state of empirical evidence, it even appears open to us whether a significant fraction of features are sparse in practice) – for instance, perhaps real neural networks partly use more compositional representations with dense features. Thirdly, in this work, we have made a particular choice regarding what it takes for a feature to be provided in the input and to have been computed in the output: ε ε-linear representation (Definition 2). Future empirical results or theoretical arguments could call for revising this choice — for instance, perhaps an eventual full reverse-engineering picture would permit certain kinds of non-linear features. Finally and least specifically, the way of looking at neural net computation suggested in this work could turn out to be thoroughly confused. We consider there to be a lot of room for the development of a more principled and empirically grounded picture. Impact Statement The primary impact of our work is to advance the field of mechanistic interpretability. While advancing this field may have many potential societal impacts, we feel that there are no direct, non-standard impacts of our work that are worth highlighting. References Arora et al. (2018) Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. Bellare (2002) Bellare. A note on negligible functions. Journal of Cryptology, 15:271–284, 2002. Bernstein (1924) Bernstein, S. On a modification of chebyshev’s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38–49, 1924. Bloom (2024) Bloom, J. Open source sparse autoencoders for all residual stream layers of GPT2 small. https://w.alignmentforum.org/posts/f9EgfLSurAiqRJySD/, 2024. Braun et al. (2024) Braun, D., Taylor, J., Goldowsky-Dill, N., and Sharkey, L. Identifying functionally important features with end-to-end sparse dictionary learning. arXiv preprint arXiv:2405.12241, 2024. Bricken et al. (2023) Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Conmy et al. (2024) Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024. Cunningham et al. (2023) Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023. Elhage et al. (2021) Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Elhage et al. (2022) Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. Fodor & Pylyshyn (1988) Fodor, J. A. and Pylyshyn, Z. W. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988. Fusi et al. (2016) Fusi, S., Miller, E. K., and Rigotti, M. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37:66–74, 2016. ISSN 0959-4388. doi: https://doi.org/10.1016/j.conb.2016.01.010. URL https://w.sciencedirect.com/science/article/pii/S0959438816000118. Neurobiology of cognitive behavior. Geiger et al. (2021) Geiger, A., Lu, H., Icard, T., and Potts, C. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Geva et al. (2021) Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories, September 2021. URL http://arxiv.org/abs/2012.14913. arXiv:2012.14913 [cs]. Goh et al. (2021) Goh, G., †, N. C., †, C. V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah, C. Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons. Gurnee et al. (2023) Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023. Heimersheim & Mendel (2023) Heimersheim, S. and Mendel, J. Interim research report: Activation plateaus and sensitive periods in transformer training, 2023. URL https://w.alignmentforum.org/posts/LajDyGyiyX8DNNsuF/interim-research-report-activation-plateaus-and-sensitive-1. Accessed: 2024-07-27. Henighan et al. (2023) Henighan, T., Carter, S., Hume, T., Elhage, N., Lasenby, R., Fort, S., Schiefer, N., and Olah, C. Superposition, memorization, and double descent. Transformer Circuits Thread, 2023. Holyoak (1987) Holyoak, K. J. Parallel distributed processing: explorations in the microstructure of cognition. Science, 236:992–997, 1987. Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), p. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann. Marks (2024) Marks, S. What’s up with llms representing xors of arbitrary features?, 2024. URL https://w.alignmentforum.org/posts/hjJXCn9GsskysDceS/what-s-up-with-llms-representing-xors-of-arbitrary-features. Accessed: 2024-07-27. Meng et al. (2023) Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt, 2023. Murray et al. (2023) Murray, R., Demmel, J., Mahoney, M. W., Erichson, N. B., Melnichenko, M., Malik, O. A., Grigori, L., Luszczek, P., Dereziński, M., Lopes, M. E., et al. Randomized numerical linear algebra: A perspective on the field with an eye to software. arXiv preprint arXiv:2302.11474, 2023. Nguyen et al. (2016) Nguyen, A., Yosinski, J., and Clune, J. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, 2016. Olah et al. (2017) Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization. Olah et al. (2020) Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020. Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Rahimi & Recht (2008a) Rahimi, A. and Recht, B. Uniform approximation of functions with random bases. In 2008 46th annual allerton conference on communication, control, and computing, p. 555–561. IEEE, 2008a. Rahimi & Recht (2008b) Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in neural information processing systems, 21, 2008b. Rajamanoharan et al. (2024) Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., Shah, R., and Nanda, N. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024. Rosenblatt (1961) Rosenblatt, F. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical report, Cornell Aeronautical Lab Inc Buffalo NY, 1961. Räuker et al. (2023) Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. Toward transparent AI: A survey on interpreting the inner structures of deep neural networks, 2023. Scherlis et al. (2022) Scherlis, A., Sachan, K., Jermyn, A. S., Benton, J., and Shlegeris, B. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892, 2022. Taggart (2024) Taggart, G. M. ProLU: A nonlinearity for sparse autoencoders. https://w.alignmentforum.org/posts/HEpufTdakGTTKgoYF/prolu-a-nonlinearity-for-sparse-autoencoders, 2024. Tamkin et al. (2023) Tamkin, A., Taufeeque, M., and Goodman, N. D. Codebook features: Sparse and discrete interpretability for neural networks. arXiv preprint arXiv:2310.17230, 2023. Templeton et al. (2024) Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. Tillmann (2014) Tillmann, A. M. On the computational intractability of exact and approximate dictionary learning. IEEE Signal Processing Letters, 22(1):45–49, 2014. Wang et al. (2022) Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593, 2022. Zou et al. (2021) Zou, Z., Alimohamadi, H., Imani, F., Kim, Y., and Imani, M. Spiking hyperdimensional network: Neuromorphic models integrated with memory-inspired framework. arXiv preprint arXiv:2110.00214, 2021. Appendix A Mathematical definitions Here, we list and define the mathematical terms that we use throughout this work. X set of inputs Y set of outputs ℳw:X→Y:subscriptℳ→M_w:X→ YMitalic_w : X → Y neural network with ReLUReLUReLUReLU activations, parameterized by w a→(l)(x)∈ℝdsuperscript→superscriptℝ a^(l)(x) ^dover→ start_ARG a end_ARG( l ) ( x ) ∈ blackboard_Rd the activations of a neural network at layer l, l∈0,…,L0…l∈\0,...,L\l ∈ 0 , … , L MLP(l):ℝd→ℝd:superscriptMLP→superscriptℝsuperscriptℝMLP^(l):R^d ^dMLP( l ) : blackboard_Rd → blackboard_Rd the lllth MLP layer, MLP(l)(x)=ReLU(Win(l)x+wbias(l))superscriptMLPReLUsuperscriptsubscriptinsuperscriptsubscriptbiasMLP^(l)(x)=ReLU(W_ in^(l)x+w_ bias% ^(l))MLP( l ) ( x ) = ReLU ( Win( l ) x + wbias( l ) ) fk:X→0,1:subscript→01f_k:X→\0,1\fitalic_k : X → 0 , 1 boolean feature of the input, k=1,…,m1…k=1,…,mk = 1 , … , m F:X→0,1m:→superscript01F:X→\0,1\^mF : X → 0 , 1 m the concatenation of m boolean features ϕ→k∈ℝdsubscript→italic-ϕsuperscriptℝ φ_k ^dover→ start_ARG ϕ end_ARGk ∈ blackboard_Rd vector linearly representing the kkkth boolean feature Φ∈ℝd×mΦsuperscriptℝ ^d× mΦ ∈ blackboard_Rd × m the feature embedding matrix, Φ=(ϕ→1,…,ϕ→m)Φsubscript→italic-ϕ1…subscript→italic-ϕ =( φ_1,..., φ_m)Φ = ( over→ start_ARG ϕ end_ARG1 , … , over→ start_ARG ϕ end_ARGm ) =(x)∈0,1msuperscript01 b= b(x)∈\0,1\^mitalic_b = italic_b ( x ) ∈ 0 , 1 m a boolean vector of length m associated to an input/activation k=k(x)∈0,1subscriptsubscript01 b_k= b_k(x)∈\0,1\italic_bitalic_k = italic_bitalic_k ( x ) ∈ 0 , 1 the kkkth entry in the boolean vector, equal to fk(x)subscriptf_k(x)fitalic_k ( x ) ‖(x)‖1subscriptnorm1|| b(x)||_1| | italic_b ( x ) | |1 “sparsity”, a.k.a. number of bits that are “on” for the boolean vector , b,italic_b , equal to ∑k=1mfk(x).superscriptsubscript1subscript _k=1^mf_k(x).∑k = 1m fitalic_k ( x ) . :0,1m→0,1m′:→superscript01superscript01superscript′C:\0,1\^m→\0,1\^m C : 0 , 1 m → 0 , 1 m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT a boolean circuit l:0,1m→0,1m′:subscript→superscript01superscript01superscript′C_l:\0,1\^m→\0,1\^m Citalic_l : 0 , 1 m → 0 , 1 m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT layer l of the boolean circuit CC, consisting of m′ boolean gates of fan-in at most two. Table 1: Definitions of terms used in this work. We also use the following conventions for clarity: i,j∈1,…,d1…i,j∈\1,...,d\i , j ∈ 1 , … , d indices for neurons k, ℓ ℓ, p∈1,…,m1…p∈\1,...,m\p ∈ 1 , … , m indices for features μ amount of interference between near-orthogonal vectors ε ε error in the read-off of a boolean feature s A bound on the “sparsity”; we require ‖(x)‖1≤s∀x∈X.subscriptnorm1for-all|| b(x)||_1≤ s\,\,∀\,\,x∈ X.| | italic_b ( x ) | |1 ≤ s ∀ x ∈ X . Table 2: Conventions used in this work. We assume our terms satisfy the following asymptotic relationships in terms of the principal complexity parameter m (the number of features): d is polynomial in m so d=Ω~(mα+),d=O~(mα−)formulae-sequence~Ωsuperscriptsubscript~superscriptsubscriptd= (m _+),d= O(m _-)d = over~ start_ARG Ω end_ARG ( mitalic_α+ ) , d = over~ start_ARG O end_ARG ( mitalic_α- ) for some finite exponents 0<α≤α−<∞.0subscript0<α≤ _-<∞.0 < α ≤ α- < ∞ . s is at worst polynomial in m,m,m , so s=O(mβ).superscripts=O(m^β).s = O ( mitalic_β ) . Note that this is different from the body, where we assumed s is a constant (so β=00β=0β = 0). s=O~(d1/3).~superscript13s= O(d^1/3).s = over~ start_ARG O end_ARG ( d1 / 3 ) . This is a technical “sparsity” condition that will be useful for us. Table 3: Asymptotic relationships between variables in this work. Appendix B Potential takeaways for practical mechanistic interpretability Our motivation for studying these mathematical models is to glean insights about the computation implemented by real networks, that could have ramifications for the field of mechanistic interpretability, particularly the subfield focussed on taking features out of superposition in language models using sparse dictionary learning (Cunningham et al., 2023; Bricken et al., 2023; Tamkin et al., 2023; Bloom, 2024; Braun et al., 2024; Templeton et al., 2024). In order to render the models mathematically tractable, we have had to make idealising assumptions about the computation implemented by the networks. 1. Early work on superposition (Elhage et al., 2022) suggested that it may be possible to store exponentially many features in superposition in an activation space. On the other hand, early sparse dictionary learning efforts (Cunningham et al., 2023; Bricken et al., 2023; Bloom, 2024) learn dictionaries which are smaller than even the square of the dimension of the activation space. Our work suggests that the number of features that can be stored in superposition and computed with is likely to be around O~(d2)~superscript2 O(d^2)over~ start_ARG O end_ARG ( d2 ) (this is also the information-theoretic limit). We think that by using a dictionary size that scales quadratically in the size of the activations, while computationally challenging, this will likely lead to better performance on downstream tasks. We are heartened by more recent work by Templeton et al. (2024) which works with dictionaries that are closer to this size, and would encourage more systems-oriented work to scale to ever larger dictionaries. 2. The current mainstream sparse autoencoder (SAE) architecture used by Cunningham et al. (2023); Bricken et al. (2023); Bloom (2024); Templeton et al. (2024) and others uses ReLUs to read off feature values, in accordance with the toy model of superposition of Elhage et al. (2022) and features being ReLU-linearly represented. Our work suggests that networks may be more expressive when storing features ε ε-linearly. If so, this suggests that future work should consider sparse dictionary learning with alternative activation functions that only allow for removing errors of size ε ε, such as a noise-filtering nonlinearity NFε(x)=x|x|>ε0|x|≤ε.subscriptNFcases0NF_ (x)= casesx&|x|> \\ 0&|x|≤ cases.NFitalic_ε ( x ) = start_ROW start_CELL x end_CELL start_CELL | x | > ε end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL | x | ≤ ε end_CELL end_ROW . or nonlinearities that filter all but the k largest positive and largest negative preactivations. Notably, recent work by Rajamanoharan et al. (2024); Taggart (2024) finds suggestive evidence that the ProLU activation: ProLUε(x)=x>ε0x≤εsubscriptProLUcases0 _ (x)= casesx&x> \\ 0&x≤ casesProLUitalic_ε ( x ) = start_ROW start_CELL x end_CELL start_CELL x > ε end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL x ≤ ε end_CELL end_ROW outperforms the standard ReLU activation SAEs, which accords with the predictions in this work. 3. Previous work by Gurnee et al. (2023) found some features that were represented on a small set of neurons, even when they weren’t represented on any singular particular neuron. In our constructions, feature representations end up distributed over a larger range of neurons. We expect that networks which employ superposition heavily to maximise their expressiveness are unlikely to have many sparse features that are localised to one or even a few neurons. Appendix C Additional discussion of various feature definitions C.1 Formal statements and proofs for facts referenced in main body We present formal statements and proofs that we referred to in Section 2.1. Note that without loss of generality, we can include the activation function a into our input set X, so we omit the use of a in this section. Theorem 9 (Composition of linearly separable features). There exist a set of inputs X and two features f1,f2subscript1subscript2f_1,f_2f1 , f2 weakly linearly represented in X such that there exists no MLP layer MLPMLPMLPMLP such that either f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 or f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2 are linearly separable in MLP(x)MLPMLP(x)MLP ( x ). Proof. (sketch) Let X=[−1,1]2superscript112X=[-1,1]^2X = [ - 1 , 1 ]2 be the unit square in ℝ2superscriptℝ2R^2blackboard_R2, and let f1(x)=(x1>0)subscript11subscript10f_1(x)=1(x_1>0)f1 ( x ) = 1 ( x1 > 0 ) and f2(x)=(x2>0)subscript21subscript20f_2(x)=1(x_2>0)f2 ( x ) = 1 ( x2 > 0 ) be the indicator functions of whether the first and second coordinates are greater than zero. There exists no MLP layer MLP:X→ℝd:MLP→superscriptℝMLP:X ^dMLP : X → blackboard_Rd of any width d such that f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 is linearly separable in MLP(X)MLPMLP(X)MLP ( X ). To show this, it suffices to notice that any MLP layer has finite Lipschitz coefficient, and that any function weakly linearly representing f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 or f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2 will need to have arbitrarily high Lipschitz coefficient (since there exist points that are arbitrarily close to the separating hyperplanes of f1subscript1f_1f1 and f2subscript2f_2f2. ∎ Theorem 10 (Composition of ε ε-linearly represented features). For any set X and features f1,f2subscript1subscript2f_1,f_2f1 , f2 that are ε ε-linearly represented in X, there exists a two neuron MLP MLP:X→ℝ2:MLP→superscriptℝ2MLP:X ^2MLP : X → blackboard_R2 such that f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 and f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2 are ε′-linearly represented in MLP(X)MLPMLP(X)MLP ( X ) for some ε′. Proof. (sketch) We use an MLP with two neurons MLP1subscriptMLP1MLP_1MLP1, MLP2subscriptMLP2MLP_2MLP2 with input weights equal to the read-off vectors of r→1,r→2subscript→1subscript→2 r_1, r_2over→ start_ARG r end_ARG1 , over→ start_ARG r end_ARG2. To read off f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2, we use the read-off vector r→1∧2subscript→12 r_1 2over→ start_ARG r end_ARG1 ∧ 2 defined by r→1∧2(x)=MLP1(x)+MLP1(x)−3/4subscript→12subscriptMLP1subscriptMLP134 r_1 2(x)=MLP_1(x)+MLP_1(x)-3/4over→ start_ARG r end_ARG1 ∧ 2 ( x ) = MLP1 ( x ) + MLP1 ( x ) - 3 / 4. Similarly, to read off f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2, we use the read-off vector r→1∨2(x)=MLP1(x)+MLP1(x)−1/4subscript→12subscriptMLP1subscriptMLP114 r_1 2(x)=MLP_1(x)+MLP_1(x)-1/4over→ start_ARG r end_ARG1 ∨ 2 ( x ) = MLP1 ( x ) + MLP1 ( x ) - 1 / 4. ∎ In fact, by allowing for wider MLPs, it is fairly easy to construct an MLP MLP:X→ℝd:MLP→superscriptℝMLP:X ^dMLP : X → blackboard_Rd such that f1∧f2subscript1subscript2f_1 f_2f1 ∧ f2 and f1∨f2subscript1subscript2f_1 f_2f1 ∨ f2 are also ε ε-linearly represented in MLP(X)MLPMLP(X)MLP ( X ) (that is, with equal error). We leave the construction of this MLP as an exercise for the reader. Appendix D Precise statements and proofs of theorems Let m be a parameter associated to the length of a boolean input. For the remainder of this section, we will work with real parameters α,βin,βout,γsubscriptinsubscriptoutα, _in, _out,γα , βroman_in , βroman_out , γ which do not scale with m and corresponding to scaling exponents. We impose the following asymptotic relationships on parameters m (length of boolean input), d=dinsubscriptind=d_ind = droman_in (width of emulating neural net), s (sparsity, i.e., number of 1111 values, of suitable boolean variables), εinsubscriptin _inεroman_in (incoming interference, if applicable) and εoutsubscriptout _outεroman_out (outgoing interference): m=Ω~(rα)~Ωsuperscript m= (r^α)m = over~ start_ARG Ω end_ARG ( ritalic_α ) (1) εin=Ω~(r−βin)subscriptin~Ωsuperscriptin _in= (r^-βin)εroman_in = over~ start_ARG Ω end_ARG ( r- β in ) (2) εout=O~(r−βout)subscriptout~superscriptout _out= O(r^-βout)εroman_out = over~ start_ARG O end_ARG ( r- β out ) (3) s=O~(rγ).~superscript s= O(r^γ).s = over~ start_ARG O end_ARG ( ritalic_γ ) . (4) More precisely, we assume that a large parameter m is given and the O(polylog(m))polylogO(polylog(m))O ( polylog ( m ) ) scaling factors implicit in the O~,Ω~~~Ω O, over~ start_ARG O end_ARG , over~ start_ARG Ω end_ARG asymptotics can be chosen in a suitable way to make the results hold. D.1 Emulation of AND layer In this section we prove a generalization of Theorem 3.2. Let Γ⊂1,…,m[2]Γsuperscript1…delimited-[]2 ⊂\1,…,m\^[2]Γ ⊂ 1 , … , m [ 2 ] be the edges of a graph (here the superscript [2]delimited-[]2[2][ 2 ] denotes the “exterior power” of a set, i.e., the set of (m2)binomial2 m2( FRACOP start_ARG m end_ARG start_ARG 2 end_ARG ) unordered pairs). Assume that the number of edges |EΓ|=O~(m).subscriptΓ~|E_ |= O(m).| Eroman_Γ | = over~ start_ARG O end_ARG ( m ) . Let Γ:0,1m→0,1EΓ:subscriptΓ→superscript01superscript01subscriptΓC_ :\0,1\^m→\0,1\^E_ Croman_Γ : 0 , 1 m → 0 , 1 Eroman_Γ be the circuit with value Γ()(k,ℓ)=k∧subscriptΓsubscriptℓsubscriptC_ ( b)_(k, )= b_k % bCroman_Γ ( italic_b )( k , ℓ ) = italic_bitalic_k ∧ italic_b at the unordered pair (k,ℓ)∈EΓℓsubscriptΓ(k, )∈ E_ ( k , ℓ ) ∈ Eroman_Γ corresponding to an edge of Γ.Γ .Γ . We think of ΓsubscriptΓC_ Croman_Γ as the (not quite universal) circuit that takes AND’s of pairs of features in Γ Γ and returns a boolean vector of roughly the same size. We will show that this circuit can be emulated with suitably small interference on the output. The proof is very similar to the proof of the error correction theorem above (Theorem 21), in particular with the main argument controlled by a subset Σ⊂1,…,m×1,…,d,Σ1…1… ⊂\1,…,m\×\1,…,d\,Σ ⊂ 1 , … , m × 1 , … , d , with m the number of edges of Γ Γ (i.e., outputs of the circuit). There are however two main differences. 1. What we read from each subset Σk,ℓsubscriptΣℓ _k, Σitalic_k , ℓ associated to an edge (k,ℓ)∈ΓℓΓ(k, )∈ ( k , ℓ ) ∈ Γ is a the result of a nonlinearity applied to a sum of two random ±1plus-or-minus1± 1± 1 vectors ϕk,ϕℓsubscriptitalic-ϕsubscriptitalic-ϕℓ _k, _ ϕitalic_k , ϕroman_ℓ (associated to the two inputs k,ℓk, , ℓ), that returns (up to small error) the sum of neurons in of ΣijsubscriptΣ _ijΣitalic_i j where the signs of ϕksubscriptitalic-ϕ _kϕitalic_k and ϕℓsubscriptitalic-ϕℓ _ ϕroman_ℓ are both 1111. 2. To control interference issues, we need to carefully partition the graph Γ Γ into pieces with a certain asymptotic “balanced” property (see Theorem 13). 3. The output interference is O~(s2d O( s^2dover~ start_ARG O end_ARG ( square-root start_ARG divide start_ARG s2 end_ARG start_ARG d end_ARG end_ARG instead of O~(sd O( sdover~ start_ARG O end_ARG ( square-root start_ARG divide start_ARG s end_ARG start_ARG d end_ARG end_ARG since there are O(s2)superscript2O(s^2)O ( s2 ) active output features (corresponding to pairs of features that are on). Theorem 11 (Targeted superpositional AND). Let m be an integer and Γ⊂1,…,m×1,…,mΓ1…1… ⊂\1,…,m\×\1,…,m\Γ ⊂ 1 , … , m × 1 , … , m a graph. Assume we have a readoff matrix in∈Matm×dsubscriptinsubscriptMatR_in _m× dRroman_in ∈ Matitalic_m × d that maps a d-dimensional space to an m-dimensional space, and let s=o(m)s=o( m)s = o ( square-root start_ARG m end_ARG ) be a sparsity parameter (either polynomial or polylogarithmic in m). Let εinsubscriptin _inεroman_in be an interference parameter. Assume that we have εin2mdd/s=O~(1)superscriptsubscriptin2~1 _in^2md d/s= O(1)εroman_in2 m d square-root start_ARG d / s end_ARG = over~ start_ARG O end_ARG ( 1 ) is bounded by some sufficiently small inverse polylogarithmic expression in m.m.m . Then there exists a single-layer mixed emulation ℳw(x)=ReLU(Winx+wbias)subscriptℳReLUsubscriptinsubscriptbiasM_w(x)=ReLU(W_ inx+w_ bias)Mitalic_w ( x ) = ReLU ( Win x + wbias ) of the universal AND circuit uandsubscriptuandC_uandCroman_uand (together with an “output readoff” matrix outsubscriptoutR_outRroman_out) such that ℳwsubscriptℳM_wMitalic_w is an emulation of ΓsubscriptΓC_ Croman_Γ on the input class ℬ=ℬsℬsubscriptℬB=B_sB = Bitalic_s of boolean vectors of sparsity ≤s,absent≤ s,≤ s , with precision εin→εout,→subscriptinsubscriptout _in→ _out,εroman_in → εroman_out , for εout=O~(s2d).subscriptout~superscript2 _out= O ( s^2d ).εroman_out = over~ start_ARG O end_ARG ( square-root start_ARG divide start_ARG s2 end_ARG start_ARG d end_ARG end_ARG ) . Before proving the theorem, we note that our UAND statements are corollaries: Corollary 12 (U-AND with basis-aligned inputs). Fix a sparsity parameter s∈ℕ.ℕs .s ∈ blackboard_N . Then for large input length m, there exists a single-layer neural network ℳw(x)=MLP(x)=ReLU(Winx+wbias)subscriptℳMLPReLUsubscriptinsubscriptbiasM_w(x)=MLP(x)=ReLU(W_ inx+w_% bias)Mitalic_w ( x ) = MLP ( x ) = ReLU ( Win x + wbias ) that ε ε-linearly represents the universal AND circuit UANDsubscriptUANDC_UANDCroman_UAND on s-sparse inputs, with width d=O~m(1/ε2)subscript~1superscript2d= O_m(1/ ^2)d = over~ start_ARG O end_ARGm ( 1 / ε2 ) (i.e. polylogarithmic in m). This follows from the fact that the incoming interference εin=0subscriptin0 _in=0εroman_in = 0 since the incoming feature basis is basis-aligned. Corollary 13 (U-AND with inputs in superposition). Let s∈ℕs ∈ blackboard_N be a fixed sparsity limit and ε<11 <1ε < 1 a fixed interference parameter. There exists a feature encoding Φ Φ and single-layer neural net ℳw(x)=MLP(x)=ReLU(Winx+wbias)subscriptℳMLPReLUsubscriptinsubscriptbiasM_w(x)=MLP(x)=ReLU(W_ inx+w_% bias)Mitalic_w ( x ) = MLP ( x ) = ReLU ( Win x + wbias ) with input size minsubscriptinm_inmroman_in and width d=O~(min/ε2)~subscriptinsuperscript2d= O( m_in/ ^2)d = over~ start_ARG O end_ARG ( square-root start_ARG mroman_in end_ARG / ε2 ), such that ℳw∘ΦsubscriptℳΦM_w _w ∘ Φ ε ε-linearly represents UANDsubscriptUANDC_UANDCroman_UAND on all s-sparse inputs bitalic_b. This follows by restricting all but min=msubscriptinm_in= mmroman_in = square-root start_ARG m end_ARG input features to 00 and taking Γ Γ to be the complete graph on vertices 0,…,min.0…subscriptin\0,…,m_in\. 0 , … , mroman_in . Now we prove the theorem. Proof. We begin by considering a simpler case. We say that a graph Γ Γ with m edges is self-balanced if each vertex has degree at most O~(1)~1 O(1)over~ start_ARG O end_ARG ( 1 ) (some fixed polylogarithmic-in-m bound). Suppose Γ Γ is self-balanced. Define A:=d/s.assignA:= d/s.A := square-root start_ARG d / s end_ARG . For each edge (k,ℓ)∈Γ,ℓΓ(k, )∈ ,( k , ℓ ) ∈ Γ , choose at random a subset Σkℓ⊂1,…,dsubscriptΣℓ1… _k ⊂\1,…,d\Σitalic_k ℓ ⊂ 1 , … , d of size within a polylog error of A.A.A . Write also Σk=⋃ℓ∣(k,ℓ)∈ΓΣk,ℓ.subscriptΣsubscriptconditionalℓΓsubscriptΣℓ _k= _ (k, )∈ _k, .Σitalic_k = ⋃ℓ ∣ ( k , ℓ ) ∈ Γ Σitalic_k , ℓ . Write down feature vectors ϕ→kℓ=∑i∈Σk±e→i,subscript→italic-ϕℓplus-or-minussubscriptsubscriptΣsubscript→ φ_k = _i∈ _k± e_i,over→ start_ARG ϕ end_ARGk ℓ = ∑i ∈ Σ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ± over→ start_ARG e end_ARGi , with signs σk,isubscript _k,iσitalic_k , i chosen independently and randomly for each k,i.k,i.k , i . For a pair k,ℓ∈Γ,ℓΓk, ∈ ,k , ℓ ∈ Γ , define the vector r→k,ℓsubscript→ℓ r_k, over→ start_ARG r end_ARGk , ℓ to be the indicator of the set of neurons Σk,ℓout:=i∈1,…,d∣σk,i=σℓ,i=1,assignsuperscriptsubscriptΣℓoutconditional-set1…subscriptsubscriptℓ1 _k, ^out:=\i∈\1,…,d\ _k,i= _% ,i=1\,Σitalic_k , ℓroman_out := i ∈ 1 , … , d ∣ σitalic_k , i = σroman_ℓ , i = 1 , Note that |Σk,ℓout|superscriptsubscriptΣℓout| _k, ^out|| Σitalic_k , ℓroman_out | has, o. n. p., within a polylog difference from 14|Σk,ℓ|=A414subscriptΣℓ4 14| _k, |= A4divide start_ARG 1 end_ARG start_ARG 4 end_ARG | Σitalic_k , ℓ | = divide start_ARG A end_ARG start_ARG 4 end_ARG elements. Write ϕ→kin:=∑ℓ∣(k,ℓ)∈Γϕ→k,ℓ,assignsuperscriptsubscript→italic-ϕinsubscriptconditionalℓΓsubscript→italic-ϕℓ φ_k^in:= _ (k, )∈ φ_k,% ,over→ start_ARG ϕ end_ARGkroman_in := ∑ℓ ∣ ( k , ℓ ) ∈ Γ over→ start_ARG ϕ end_ARGk , ℓ , Note that this is a indicator function of a union polylog-many independently chosen sets of size A.A.A . Write ΦinsuperscriptΦin ^inΦroman_in for the m×dm× dm × d matrix with columns ϕ→kin.superscriptsubscript→italic-ϕin φ_k^in.over→ start_ARG ϕ end_ARGkroman_in . Now we define the emulation net to be ℳwΓ(x)=4AReLU(Φin(x)−1).subscriptsubscriptℳΓ4ReLUsuperscriptΦin1M_w_ (x)= 4AReLU( ^in(x)-1).Mitalic_wroman_Γ ( x ) = divide start_ARG 4 end_ARG start_ARG A end_ARG ReLU ( Φroman_in ( x ) - 1 ) . We note that (outside interference and collision errors of frequency bounded o. n. p. by O~(εout)~subscriptout O( _out)over~ start_ARG O end_ARG ( εroman_out ),) we have ReLU(ΦT())−1)i=1,∃k,ℓ∈S with i∈Σk,ℓ and σk,i=σℓ,i=10, otherwise,.ReLU( ^T( b))-1)_i= cases1,&∃ k, % ∈ S with i∈ _k, and _k,i= _ ,i=% 1\\ 0,& otherwise, cases.ReLU ( Φitalic_T ( italic_b ) ) - 1 )i = start_ROW start_CELL 1 , end_CELL start_CELL ∃ k , ℓ ∈ S with i ∈ Σitalic_k , ℓ and σitalic_k , i = σroman_ℓ , i = 1 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise, end_CELL end_ROW . Here as before we take S⊂1,…,m1…S⊂\1,…,m\S ⊂ 1 , … , m for the set of features that are on. Analogously to our proof of Lemma 23’s part 1 we see that the difference ΦT()−ΦT(in(x))superscriptΦsuperscriptΦsubscriptin ^T( b)- ^T(R_in(x))Φitalic_T ( italic_b ) - Φitalic_T ( Rroman_in ( x ) ) is (o. n. p.) bounded by o(1),1o(1),o ( 1 ) , and thus we are done just as in the previous lemma. For general graphs Γ,Γ ,Γ , we might have an issue if some vertices have very high degree; if one were to try to run the same proof, their corresponding features would then admit unmanageably high interference. To fix this, we note that in order to emulate ΓsubscriptΓC_ Croman_Γ it is sufficient (up to polylogarithmically increasing the number of neurons) to emulate Γ1,…,ΓTsubscriptsubscriptΓ1…subscriptsubscriptΓC_ _1,…,C_ _TCroman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , Croman_Γ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT for some polylogarithmic collection of graphs ΓtsubscriptΓ _tΓitalic_t with ∪tΓt=Γ.subscriptsubscriptΓ _t _t= .∪t Γitalic_t = Γ . We now split an arbitrary graph Γ Γ into subgraphs with a nice “balanced” property. Let a,b∈ℝa,b , b ∈ blackboard_R be parameters. We say that a graph is a,ba,ba , b-balanced if it is bipartite on a pair of disjoint subsets of vertices V0,V1⊂0,…,m,subscript0subscript10…V_0,V_1⊂\0,…,m\,V0 , V1 ⊂ 0 , … , m , such that |V0|=a,|V1|=bformulae-sequencesubscript0subscript1|V_0|=a,|V_1|=b| V0 | = a , | V1 | = b and each vertex in V0subscript0V_0V0 has degree at most m/am/am / a and each vertex in V1subscript1V_1V1 has degree at most m/b.m/b.m / b . We say a graph Γ⊂0,…,m[2]Γsuperscript0…delimited-[]2 ⊂\0,…,m\^[2]Γ ⊂ 0 , … , m [ 2 ] is balanced if it is a,ba,ba , b-balanced for some a,b.a,b.a , b . It can be shown using an inductive argument that any graph Γ Γ with m edges can be written as a union of polylog(m.)polylog(m.)polylog ( m . ) Now it remains to show that the theorem holds for a balanced graph. Indeed, suppose that Γ Γ has vertices supported on V0⊔V1⊂1,…,msquare-unionsubscript0subscript11…V_0 V_1⊂\1,…,m\V0 ⊔ V1 ⊂ 1 , … , m and is a,ba,ba , b-balanced. Suppose (WLOG) that a≤b.a≤ b.a ≤ b . Then we randomly partition the neurons 1,…,d1…\1,…,d\ 1 , … , d into a roughly equal sets ΣksubscriptΣ _kΣitalic_k for k∈V0subscript0k∈ V_0k ∈ V0 (equivalently, we choose a random map 1,…,d→V0→1…subscript0\1,…,d\→ V_0 1 , … , d → V0 and define ΣksubscriptΣ _kΣitalic_k to be the preimage of k). We then choose for ℓ∈V1ℓsubscript1 ∈ V_1ℓ ∈ V1 the set Σk,ℓsubscriptΣℓ _k, Σitalic_k , ℓ to be a random subset of size about d/s2superscript2 d/s^2square-root start_ARG d / s2 end_ARG inside Σk,subscriptΣ _k,Σitalic_k , and define Σℓ=∪k∣(k,ℓ)∈Γ.subscriptΣℓsubscriptconditionalℓΓ _ = _k (k, )∈ .Σroman_ℓ = ∪k ∣ ( k , ℓ ) ∈ Γ . We finish the argument by bounding the errors in the same way as in the self-balanced case, concluding the proof. ∎ D.2 Universal AND with inputs in superposition We use the conventions from Section A. We make an additional assumption, that our inputs a→(0)(x)superscript→0 a^(0)(x)over→ start_ARG a end_ARG( 0 ) ( x ) for x∈Xx∈ Xx ∈ X approximately lie on a sphere of suitable radius. Note that if m=dm=dm = d and the feature basis ϕ→isubscript→italic-ϕ φ_iover→ start_ARG ϕ end_ARGi is an orthonormal basis, then |Φ()|=‖1,Φsubscriptnorm1| ( b)|= || b||_1,| Φ ( italic_b ) | = square-root start_ARG | | italic_b | |1 end_ARG , so the ℓ2subscriptℓ2 _2ℓ2 norm of the embedding is the square root of the sparsity. If the sparsity ‖1subscriptnorm1|| b||_1| | italic_b | |1 is exactly s and the feature interference parameter μ is sufficiently small compared to the sparsity bound s,s,s , we still have |Φ()|≈sΦ| ( b)|≈ s| Φ ( italic_b ) | ≈ square-root start_ARG s end_ARG (with some suitable bound — in general, it will be O~(μs1.5)~superscript1.5 O(μ s^1.5)over~ start_ARG O end_ARG ( μ s1.5 )). If instead, we assume only that the boolean features fi()subscriptf_i( b)fitalic_i ( italic_b ) are ε ε-linearly represented for suitable ε>1d,1 > 1 d,ε > divide start_ARG 1 end_ARG start_ARG square-root start_ARG d end_ARG end_ARG , in general we cannot guarantee that |a→(0)(x)|≈s;superscript→0| a^(0)(x)|≈ s;| over→ start_ARG a end_ARG( 0 ) ( x ) | ≈ square-root start_ARG s end_ARG ; rather, we will have |a→(0)(x)|=Ω~(s)superscript→0~Ω| a^(0)(x)|= ( s)| over→ start_ARG a end_ARG( 0 ) ( x ) | = over~ start_ARG Ω end_ARG ( square-root start_ARG s end_ARG ) since especially for small s,s,s , the norm might be significantly increased by adding a large vector that is almost-orthogonal to all features (and thus doesn’t affect the linear representability of the fisubscriptf_ifitalic_i). This observation allows us, in principle, to write down a vector with some suitable norm in θ~(s)~ θ( s)over~ start_ARG θ end_ARG ( square-root start_ARG s end_ARG ) which ε ε-linearly represents a very sparse boolean vector bitalic_b with ‖1<<s.much-less-thansubscriptnorm1|| b||_1<<s.| | italic_b | |1 < < s . We show how to modify inputs with unknown bounded sparsity ‖(x)‖1<ssubscriptnorm1|| b(x)||_1<s| | italic_b ( x ) | |1 < s to have an (approximately) constant norm in the following section. For now, we assume in addition to ‖(x)‖1<ssubscriptnorm1|| b(x)||_1<s| | italic_b ( x ) | |1 < s that all our inputs have norm equal to some s0=O~(s)subscript0~s_0= O( s)s0 = over~ start_ARG O end_ARG ( square-root start_ARG s end_ARG ) up to a small error. Theorem 14. Let m,d,X,Φ,ε=ε0,μ,sformulae-sequenceΦsubscript0m,d,X, , = _0,μ,sm , d , X , Φ , ε = ε0 , μ , s be as in Appendix A. Let r be a parameter so that r2=O~(s)superscript2~r^2= O(s)r2 = over~ start_ARG O end_ARG ( s ). Assume in addition to the conditions on X,ΦX, , Φ in Appendix A that for any input x∈X,x∈ X,x ∈ X , we have |a→(0)(x)|=r+O~(sd),superscript→0~| a^(0)(x)|=r+ O( s d),| over→ start_ARG a end_ARG( 0 ) ( x ) | = r + over~ start_ARG O end_ARG ( divide start_ARG square-root start_ARG s end_ARG end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) , i.e., the inputs lie approximately on a sphere of radius r. Let W∈Matd×dsubscriptMatW _d× dW ∈ Matitalic_d × d be a random weight matrix with i.i.d. Gaussian-distributed entries, and let a→(1)(x)=ℳw(a→(0)(x)):=ReLU(Wx)superscript→1subscriptℳsuperscript→0assignReLU a^(1)(x)=M_w( a^(0)(x)):=ReLU(Wx)over→ start_ARG a end_ARG( 1 ) ( x ) = Mitalic_w ( over→ start_ARG a end_ARG( 0 ) ( x ) ) := ReLU ( W x ) be the associated neural net. Then there exist some ε(1)=O~(max(sμ,sε,s/d))superscript1~ ^(1)= O ( (sμ, s , s/d)% )ε( 1 ) = over~ start_ARG O end_ARG ( max ( s μ , square-root start_ARG s end_ARG ε , square-root start_ARG s / d end_ARG ) ) and μ(1)=O~(max(1/d,μ)),superscript1~1μ^(1)= O ( ( 1/d,μ)),μ( 1 ) = over~ start_ARG O end_ARG ( max ( square-root start_ARG 1 / d end_ARG , μ ) ) , such that the boolean function fk∧ℓ(x):=fk(x)∧fℓ(x)assignsubscriptℓsubscriptsubscriptℓf_k (x):=f_k(x) f_ (x)fitalic_k ∧ ℓ ( x ) := fitalic_k ( x ) ∧ froman_ℓ ( x ) is ε(1)superscript1 ^(1)ε( 1 )-linearly represented by a feature vector ϕ→k∧ℓ(1)∈ℝd,superscriptsubscript→italic-ϕℓ1superscriptℝ φ_k ^(1) ^d,over→ start_ARG ϕ end_ARGk ∧ ℓ( 1 ) ∈ blackboard_Rd , outside negligible probability (in the entries of W). Moreover, up to rescaling by a fixed scalar, the feature vectors ϕ→k∧ℓsubscript→italic-ϕℓ φ_k over→ start_ARG ϕ end_ARGk ∧ ℓ form an almost-orthogonal collection with feature interference parameter μ(1).superscript1μ^(1).μ( 1 ) . Corollary 15. The result of Theorem 14 is true with the assumption |a→(0)(x)|2=r2+O~(εs)superscriptsuperscript→02superscript2~| a^(0)(x)|^2=r^2+ O( s)| over→ start_ARG a end_ARG( 0 ) ( x ) |2 = r2 + over~ start_ARG O end_ARG ( ε s ) (that inputs are close to a sphere) replaced by |a→(0)(x)|2=O~(s),superscriptsuperscript→02~| a^(0)(x)|^2= O(s),| over→ start_ARG a end_ARG( 0 ) ( x ) |2 = over~ start_ARG O end_ARG ( s ) , at the cost of increasing the depth of the neural network ℳwsubscriptℳM_wMitalic_w from 1111 to 3333. Proof. (Of corollary.) This follows by chaining the neural network constructed in this theorem with the “norm-balancer network” constructed in Appendix D.3 (independent from this one). ∎ The idea of the proof of Theorem 14 is derived from the quadratic activations case, ℳw(x→)=Q(Wx→),subscriptℳ→M_w( x)=Q(W x),Mitalic_w ( over→ start_ARG x end_ARG ) = Q ( W over→ start_ARG x end_ARG ) , where Q is the function that squares entries of a vector coordinatewise. Let aki=W(ϕ→k)isuperscriptsubscriptsuperscriptsubscript→italic-ϕa_k^i=W( φ_k)^iaitalic_kitalic_i = W ( over→ start_ARG ϕ end_ARGk )i (for i∈0,…,d−10…1i∈\0,…,d-1\i ∈ 0 , … , d - 1 ) be the coordinates of the preactivation vector W(ϕ→k)subscript→italic-ϕW( φ_k)W ( over→ start_ARG ϕ end_ARGk ) associated to the kkkth boolean bit. One can show using the theory of quadratic forms that the readoff vector Rk,ℓi=akiaℓisuperscriptsubscriptℓsuperscriptsubscriptsuperscriptsubscriptℓR_k, ^i=a_k^ia_ ^iRitalic_k , ℓitalic_i = aitalic_kitalic_i aroman_ℓitalic_i gives a valid readoff direction to show ε ε-strong linear separation of the boolean expression k∧ℓsubscriptsubscriptℓ b_k b_ italic_bitalic_k ∧ italic_broman_ℓ (o. n. p.). We will show that a similar strategy works for an arbitrary (reasonable, and in particular nonlinear) activation function, including ReLU. Write down the unnormalized model ℳwu(x→):=ReLU(W(x→)).assignsuperscriptsubscriptℳ→ReLU→M_w^u( x):=ReLU(W( x)).Mitalic_witalic_u ( over→ start_ARG x end_ARG ) := ReLU ( W ( over→ start_ARG x end_ARG ) ) . Define ϕ→k′=Wϕ→ksuperscriptsubscript→italic-ϕ′subscript→italic-ϕ φ_k =W φ_kover→ start_ARG ϕ end_ARGk′ = W over→ start_ARG ϕ end_ARGk to be the preactivation under this model of ϕ→k.subscript→italic-ϕ φ_k.over→ start_ARG ϕ end_ARGk . Define the unnormalized readoff matrix for the UAND coordinate associated to the pair of features k,ℓk, , ℓ as follows: r→k,ℓi=sign((ϕ→k′)i⋅(ϕ→ℓ′)i),superscriptsubscript→ℓsign⋅subscriptsuperscriptsubscript→italic-ϕ′subscriptsuperscriptsubscript→italic-ϕℓ′ r_k, ^i=sign(( φ_k )_i·( φ% _ )_i),over→ start_ARG r end_ARGk , ℓitalic_i = sign ( ( over→ start_ARG ϕ end_ARGk′ )i ⋅ ( over→ start_ARG ϕ end_ARGℓ′ )i ) , where sign(x)signsign(x)sign ( x ) is the sign function that returns −1,0,1101-1,0,1- 1 , 0 , 1 depending on whether x is negative, 00 or positive, respectively. Remark 16. Note that as we care about the existence of a linear representation rather than a learnable formula for it, the readoff doesn’t have to depend continuously on the parameters. However having continuous dependence is also possible; in particular, it would also be reasonable to make the dependence continuous; indeed, the readoff vector with coordinates aki⋅aℓi⋅superscriptsubscriptsuperscriptsubscriptℓa_k^i· a_ ^iaitalic_kitalic_i ⋅ aroman_ℓitalic_i (same as for quadratic activations) would also work, with an alternative normalization; the important property of the readoff function is that it is odd in each of the x and y coordinates independently, and that it does not have wild asymptotic behavior. We use the discrete “sign” function for the readoff for convenience. The crucial observation is the following simple lemma. For a given input x,x,x , let a→(x)→ a(x)over→ start_ARG a end_ARG ( x ) be the corresponding embedding. Let a→(x)Λ:=a→(x)−fk(x)ϕ→k−fℓ(x)]ϕ→ℓ a(x) := a(x)-f_k(x) φ_k-f_ (x)] φ% _ over→ start_ARG a end_ARG ( x )Λ := over→ start_ARG a end_ARG ( x ) - fitalic_k ( x ) over→ start_ARG ϕ end_ARGk - froman_ℓ ( x ) ] over→ start_ARG ϕ end_ARGℓ (the “hat” notation denotes that we are “skipping” information about features k and ℓ ℓ in the embedded input a→(x);→ a(x);over→ start_ARG a end_ARG ( x ) ; it linearly represents the modification of the boolean vector (x) b(x)italic_b ( x ) that zeroes out the kkkth and ℓ ℓth coordinates). Lemma 17. Suppose Φ,k,ℓ,Φℓ ,k, ,Φ , k , ℓ , and bitalic_b are fixed. Then in the context of the theorem above, the unnormalized readoff k,ℓu(ℳw(Φ()))subscriptsuperscriptℓsubscriptℳΦR^u_k, (M_w( ( b)))Ritalic_uitalic_k , ℓ ( Mitalic_w ( Φ ( italic_b ) ) ) is a sum of d i.i.d. variables of the form F(xi,yi,zi),subscriptsubscriptsubscriptF(x_i,y_i,z_i),F ( xitalic_i , yitalic_i , zitalic_i ) , where F(x,y,z)=sign(x)sign(y)ReLU(k(x)x+ℓ(x)y+z)signsignReLUsubscriptsubscriptℓF(x,y,z)=sign(x)sign(y)ReLU( b_k(x)x+% b_ (x)y+z)F ( x , y , z ) = sign ( x ) sign ( y ) ReLU ( italic_bitalic_k ( x ) x + italic_broman_ℓ ( x ) y + z ) and the triple (xi,yi,zi)subscriptsubscriptsubscript(x_i,y_i,z_i)( xitalic_i , yitalic_i , zitalic_i ) is drawn from the distribution (0,Σ)0ΣN(0, )N ( 0 , Σ ) where Σ=(‖ϕ→k‖22ϕ→k⋅ϕ→ℓϕ→k⋅x→Λϕ→k⋅ϕ→ℓ‖ϕ→ℓ‖22ϕ→ℓ⋅x→Λϕ→k⋅x→Λϕ→ℓ⋅x→Λ‖x→Λ‖2).Σmatrixsuperscriptsubscriptnormsubscript→italic-ϕ22⋅subscript→italic-ϕsubscript→italic-ϕℓ⋅subscript→italic-ϕsuperscript→Λ⋅subscript→italic-ϕsubscript→italic-ϕℓsuperscriptsubscriptnormsubscript→italic-ϕℓ22⋅subscript→italic-ϕℓsuperscript→Λ⋅subscript→italic-ϕsuperscript→Λ⋅subscript→italic-ϕℓsuperscript→Λsuperscriptnormsuperscript→Λ2 = pmatrix|| φ_k||_2^2& φ_k· φ% _ & φ_k· x \\ φ_k· φ_ &|| φ_ ||_2^2& φ_% · x \\ φ_k· x & φ_ · x &|% | x ||^2 pmatrix.Σ = ( start_ARG start_ROW start_CELL | | over→ start_ARG ϕ end_ARGk | |22 end_CELL start_CELL over→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG ϕ end_ARGℓ end_CELL start_CELL over→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG x end_ARGΛ end_CELL end_ROW start_ROW start_CELL over→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG ϕ end_ARGℓ end_CELL start_CELL | | over→ start_ARG ϕ end_ARGℓ | |22 end_CELL start_CELL over→ start_ARG ϕ end_ARGℓ ⋅ over→ start_ARG x end_ARGΛ end_CELL end_ROW start_ROW start_CELL over→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG x end_ARGΛ end_CELL start_CELL over→ start_ARG ϕ end_ARGℓ ⋅ over→ start_ARG x end_ARGΛ end_CELL start_CELL | | over→ start_ARG x end_ARGΛ | |2 end_CELL end_ROW end_ARG ) . Proof. Write xi=(ϕ→k′)i,yi=(ϕ→ℓ′)i,zi=Wa→(x)Λformulae-sequencesubscriptsubscriptsuperscriptsubscript→italic-ϕ′formulae-sequencesubscriptsubscriptsuperscriptsubscript→italic-ϕℓ′subscript→superscriptΛx_i=( φ_k )_i,y_i=( φ_ )_i,z_% i=W a(x) xitalic_i = ( over→ start_ARG ϕ end_ARGk′ )i , yitalic_i = ( over→ start_ARG ϕ end_ARGℓ′ )i , zitalic_i = W over→ start_ARG a end_ARG ( x )Λ be the neuronal coordinates of the corresponding activations. Then (Rk,ℓu)i=sign(xi)sign(yi)subscriptsuperscriptsubscriptℓsignsubscriptsignsubscript(R_k, ^u)_i=sign(x_i)sign(y_i)( Ritalic_k , ℓitalic_u )i = sign ( xitalic_i ) sign ( yitalic_i ) and ℳwu(a→(x))i=ReLU(W(a→(x))i)=ReLU((x)kxi+(y)kyi+zi).superscriptsubscriptℳsubscript→ReLUsubscript→ReLUsubscriptsubscriptsubscriptsubscriptsubscriptM_w^u( a(x))_i=ReLU (W( a(x))_i% )=ReLU( b(x)_kx_i+ b(y)_ky_i+z_% i).Mitalic_witalic_u ( over→ start_ARG a end_ARG ( x ) )i = ReLU ( W ( over→ start_ARG a end_ARG ( x ) )i ) = ReLU ( italic_b ( x )k xitalic_i + italic_b ( y )k yitalic_i + zitalic_i ) . It remains to show that (xi,yi,zi)subscriptsubscriptsubscript(x_i,y_i,z_i)( xitalic_i , yitalic_i , zitalic_i ) are drawn according to the Gaussian distribution (0,Σ).0ΣN(0, ).N ( 0 , Σ ) . This follows from the standard result that applying a Gaussian-distributed matrix with entries in (0,1/d)01N(0,1/d)N ( 0 , 1 / d ) to a collection of vectors v→1,…,v→nsubscript→1…subscript→ v_1,…, v_nover→ start_ARG v end_ARG1 , … , over→ start_ARG v end_ARGn is distributed as a (possibly singular) Gaussian with PSD covariance matrix Σkℓ=v→k⋅v→ℓ.subscriptΣℓ⋅subscript→subscript→ℓ _k = v_k· v_ .Σitalic_k ℓ = over→ start_ARG v end_ARGk ⋅ over→ start_ARG v end_ARGℓ . ∎ Now our interference bounds imply that the triple (xi,yi,zi+kxi+ℓyi)subscriptsubscriptsubscriptsubscriptsubscriptsubscriptℓsubscript(x_i,y_i,z_i+ b_kx_i+ b_ y_i)( xitalic_i , yitalic_i , zitalic_i + italic_bitalic_k xitalic_i + italic_broman_ℓ yitalic_i ) are distributed according to a matrix of the form (1+O(μ)O(μ)k+O(ε)O(μ)1+O(μ)ℓ+O(ε)k+O(ε)ℓ+O(ε)r2+O~(s/d).)matrix1subscript1subscriptℓsubscriptsubscriptℓsuperscript2~ pmatrix1+O(μ)&O(μ)& b_k+O( )\\ O(μ)&1+O(μ)& b_ +O( )\\ b_k+O( )& b_ +O( )&r^2+% O(s/ d). pmatrix( start_ARG start_ROW start_CELL 1 + O ( μ ) end_CELL start_CELL O ( μ ) end_CELL start_CELL italic_bitalic_k + O ( ε ) end_CELL end_ROW start_ROW start_CELL O ( μ ) end_CELL start_CELL 1 + O ( μ ) end_CELL start_CELL italic_broman_ℓ + O ( ε ) end_CELL end_ROW start_ROW start_CELL italic_bitalic_k + O ( ε ) end_CELL start_CELL italic_broman_ℓ + O ( ε ) end_CELL start_CELL r2 + over~ start_ARG O end_ARG ( s / square-root start_ARG d end_ARG ) . end_CELL end_ROW end_ARG ) Let s′:=r2−k−ℓassignsuperscript′2subscriptsubscriptℓs :=r^2- b_k- b_ s′ := r2 - italic_bitalic_k - italic_broman_ℓ and r′:=s′.assignsuperscript′r := s .r′ := square-root start_ARG s′ end_ARG . Now o.n.p., we can assume that xi,yi∈O~(1)subscriptsubscript~1x_i,y_i∈ O(1)xitalic_i , yitalic_i ∈ over~ start_ARG O end_ARG ( 1 ) and zi∈O~(r).subscript~z_i∈ O(r).zitalic_i ∈ over~ start_ARG O end_ARG ( r ) . Since F grows linearly, we see that F(xi,yi,zi)∈O~(r)subscriptsubscriptsubscript~F(x_i,y_i,z_i)∈ O(r)F ( xitalic_i , yitalic_i , zitalic_i ) ∈ over~ start_ARG O end_ARG ( r ) o.n.p. We can now apply Bernstein’s inequality 29 to get that, o.n.p., ∑i=1dF(xi,yi,zi)=d[(x,y,z)∼(0,Σ)f(x,y,z)+O~(r/d)].superscriptsubscript1subscriptsubscriptsubscriptdelimited-[]subscriptsimilar-to0Σ~ _i=1^dF(x_i,y_i,z_i)=d[E_(x,y,z) (0,% )f(x,y,z)+ O(r/ d)].∑i = 1d F ( xitalic_i , yitalic_i , zitalic_i ) = d [ blackboard_E( x , y , z ) ∼ N ( 0 , Σ ) f ( x , y , z ) + over~ start_ARG O end_ARG ( r / square-root start_ARG d end_ARG ) ] . Now since r=O~(s)~r= O( s)r = over~ start_ARG O end_ARG ( square-root start_ARG s end_ARG ) and |(r′)2−r2|superscriptsuperscript′2superscript2|(r )^2-r^2|| ( r′ )2 - r2 | is an integer equal to at most 2222 (the sum of two feature readoffs of a→ aover→ start_ARG a end_ARG), the error term in the Bernstein inequality is bounded by O~(r′/d).~superscript′ O(r / d).over~ start_ARG O end_ARG ( r′ / square-root start_ARG d end_ARG ) . It remains to estimate the expectation E:=(x,y,z)∼(0,Σ)F(x,y,z¯).assignsubscriptsimilar-to0Σ¯E:=E_(x,y,z) (0, )F(x,y, z).E := blackboard_E( x , y , z ) ∼ N ( 0 , Σ ) F ( x , y , over¯ start_ARG z end_ARG ) . Assume that (x) b(x)italic_b ( x ) has nonzero coordinates other than at k,ℓ,ℓk, ,k , ℓ , so that r′=Ω(1)superscript′Ω1r = (1)r′ = Ω ( 1 ) (the case where (x) b(x)italic_b ( x ) only has nonzero coordinates on a subset of k,ℓ\k, \ k , ℓ can be handled similarly and more easily). In this case, we add a new notation F′(x,y,z′):=F(x,y,s′z′)=sign(x)sign(y)ReLU(r′z¯+kx+ℓy),assignsuperscript′superscript′signsignReLUsuperscript′¯subscriptsubscriptℓF (x,y,z ):=F(x,y,s z )=sign(x)% sign(y)ReLU(r z+ b_kx+ b_% y),F′ ( x , y , z′ ) := F ( x , y , s′ z′ ) = sign ( x ) sign ( y ) ReLU ( r′ over¯ start_ARG z end_ARG + italic_bitalic_k x + italic_broman_ℓ y ) , where the third input of F is rescaled to make the distribution on (x,y,z′)superscript′(x,y,z )( x , y , z′ ) closer to the identity Gaussian. Let Σ′ be the distribution on (x,y,z′),superscript′(x,y,z ),( x , y , z′ ) , given by Σ′=diag(1,1,(r′)−1)Σdiag(1,1,(r′)−1).superscriptΣ′diag11superscriptsuperscript′1Σdiag11superscriptsuperscript′1 =diag(1,1,(r )^-1) (1,1,(r^% )^-1).Σ′ = diag ( 1 , 1 , ( r′ )- 1 ) Σ diag ( 1 , 1 , ( r′ )- 1 ) . Since the two differ by a reparametrization, the expectation of F′ on (0,Σ′)0superscriptΣ′N(0, )N ( 0 , Σ′ ) is equal to the expectation of F on (0,Σ).0ΣN(0, ).N ( 0 , Σ ) . Let X′=(0,Σ′)superscript′0superscriptΣ′X =N(0, )X′ = N ( 0 , Σ′ ) and X0′=(0,Γ),superscriptsubscript0′0ΓX_0 =N(0, ),X0′ = N ( 0 , Γ ) , both on ℝ3.superscriptℝ3R^3.blackboard_R3 . Our various interference bounds imply that the difference Σ−ΓΣΓ - Σ - Γ is bounded by δ:=O~(max(sd,εs,μ)).assign~δ:= O ( ( s d, % s,μ)).δ := over~ start_ARG O end_ARG ( max ( divide start_ARG square-root start_ARG s end_ARG end_ARG start_ARG square-root start_ARG d end_ARG end_ARG , divide start_ARG ε end_ARG start_ARG square-root start_ARG s end_ARG end_ARG , μ ) ) . This means that the total variational difference between X and X′ is bounded by O(δ).O(δ).O ( δ ) . Now the expectation F′ on X,X0subscript0X,X_0X , X0 are not affected, up to negligible terms, by (x,y,z)(x,y,z)( x , y , z ) outside some constant O~(1),~1 O(1),over~ start_ARG O end_ARG ( 1 ) , and here F′ is bounded by O~(r).~ O(r).over~ start_ARG O end_ARG ( r ) . Thus we have |(x,y,z′)∼XF′(x,y,z′)−(x,y,z′)∼X0F′(x,y,z′)|=O~(rδ).subscriptsimilar-tosuperscript′superscript′subscriptsimilar-tosuperscript′subscript0superscript′~|E_(x,y,z ) XF (x,y,z )-E_(x,% y,z ) X_0F (x,y,z )|= O(rδ).| blackboard_E( x , y , z′ ) ∼ X F′ ( x , y , z′ ) - blackboard_E( x , y , z′ ) ∼ X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT F′ ( x , y , z′ ) | = over~ start_ARG O end_ARG ( r δ ) . It remains to estimate the mean E0:=(x,y,z′)∼X0′F′(x,y,z′)=(x,y,z)∼X0F(x,y,z),assignsubscript0subscriptsimilar-tosuperscript′subscriptsuperscript′0superscript′subscriptsimilar-tosubscript0E_0:=E_(x,y,z ) X _0F (x,y,z^% )=E_(x,y,z) X_0F(x,y,z),E0 := blackboard_E( x , y , z′ ) ∼ X′ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT F′ ( x , y , z′ ) = blackboard_E( x , y , z ) ∼ X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT F ( x , y , z ) , where X0=(0,diag(1,1,(d′)2)).subscript00diag11superscriptsuperscript′2X_0=N(0,diag(1,1,(d )^2)).X0 = N ( 0 , diag ( 1 , 1 , ( d′ )2 ) ) . Up to symmetry, we have three cases depending on the k and ℓ ℓ coordinates of =(x) b= b(x)italic_b = italic_b ( x ) associated to our input: • k=ℓ=0,subscriptsubscriptℓ0 b_k= b_ =0,italic_bitalic_k = italic_broman_ℓ = 0 , • k=0,ℓ=1,formulae-sequencesubscript0subscriptℓ1 b_k=0, b_ =1,italic_bitalic_k = 0 , italic_broman_ℓ = 1 , • k=ℓ=1.subscriptsubscriptℓ1 b_k= b_ =1.italic_bitalic_k = italic_broman_ℓ = 1 . The expectation calculation in the first two cases are trivial: if ksubscript b_kitalic_bitalic_k, is zero, then each F is odd in the x, resp., y coordinate, so since the distribution X0subscript0X_0X0 is independent Gaussian, the mean is E0=0.subscript00E_0=0.E0 = 0 . It remains to consider the case k=ℓ=1,subscriptsubscriptℓ1 b_k= b_ =1,italic_bitalic_k = italic_broman_ℓ = 1 , i.e., the “interesting” case where ∧(k,ℓ)=1.subscriptsubscriptℓ1 ( b_k, b_ )=1.∧ ( italic_bitalic_k , italic_broman_ℓ ) = 1 . We write down the integral expression E0:=(x,y,z)∼X0Qi(x,y,z)=∫sign(x)sign(y)ReLU(x+y+z)p0(x,y,z)xyz,assignsubscript0subscriptsimilar-tosubscript0subscriptsignsignReLUsubscript0differential-ddifferential-ddifferential-d E_0:=E_(x,y,z) X_0Q_i(x,y,z)= % (x)sign(y)ReLU(x+y+z)p_0(x,y,z)dxdydz,E0 := blackboard_E( x , y , z ) ∼ X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Qitalic_i ( x , y , z ) = ∫ sign ( x ) sign ( y ) ReLU ( x + y + z ) p0 ( x , y , z ) d x d y d z , (5) for p0(x,y,z)subscript0p_0(x,y,z)p0 ( x , y , z ) the pdf of X0=(0,diag(1,1,s′)).subscript00diag11superscript′X_0=N(0,diag(1,1,s )).X0 = N ( 0 , diag ( 1 , 1 , s′ ) ) . We would like to show this value is positive and bound it from below (to show eventually that the mean in the CLT dominates the errors). We use x,yx,yx , y-symmetry to rewrite the integral as A=2∫x≤ysign(x)sign(y)ReLU(x+y+z)p0(x,y,z).2subscriptsignsignReLUsubscript0A=2 _x≤ ysign(x)sign(y)ReLU(x+y+z)p_0(x,y,z).A = 2 ∫x ≤ y sign ( x ) sign ( y ) ReLU ( x + y + z ) p0 ( x , y , z ) . Since the independent Gaussian p0(x,y,z)subscript0p_0(x,y,z)p0 ( x , y , z ) is symmetric in the x and y coordinates, we can collect ±x,±yplus-or-minusplus-or-minus± x,± y± x , ± y terms together to write E0=2∫0≤x≤yp(x,y,z)(ReLU(x+y+z)−ReLU(x−y+z)−ReLU(−x+y+z)+ReLU(x+y+z)).subscript02subscript0ReLUReLUReLUReLUE_0=2 _0≤ x≤ yp(x,y,z) (ReLU(x+y+z)-ReLU(x% -y+z)-ReLU(-x+y+z)+ReLU(x+y+z) ).E0 = 2 ∫0 ≤ x ≤ y p ( x , y , z ) ( ReLU ( x + y + z ) - ReLU ( x - y + z ) - ReLU ( - x + y + z ) + ReLU ( x + y + z ) ) . We split the domain up further into five terms, E0=A−+A−+A0+A+A++,subscript0superscriptabsentsuperscriptsuperscript0superscriptsuperscriptabsentE_0=A^--+A^-+A^0+A^+A^++,E0 = A- - + A- + A0 + A+ A+ + , into regions on which the relus are constantly 00 or nonnegative linear functions: A−superscriptabsent A^--A- - = == 2∫0≤x≤y,z≤−x−yp0(x,y,z)xyz⋅02subscriptformulae-sequence0⋅subscript0differential-ddifferential-ddifferential-d0 2 _0≤ x≤ y,z≤-x-yp_0(x,y,z)dxdydz· 02 ∫0 ≤ x ≤ y , z ≤ - x - y p0 ( x , y , z ) d x d y d z ⋅ 0 A−superscript A^-A- = == 2∫0≤x≤y,−x−y≤z≤x−yp0(x,y,z)(x+y+z)2subscriptformulae-sequence0subscript0 2 _0≤ x≤ y,-x-y≤ z≤ x-yp_0(x,y,z)(x+y+z)2 ∫0 ≤ x ≤ y , - x - y ≤ z ≤ x - y p0 ( x , y , z ) ( x + y + z ) A0superscript0 A^0A0 = == 2∫0≤x≤y,x−y≤z≤y−xp0(x,y,z)xyz((x+y+z)−(−x+y+z))2subscriptformulae-sequence0subscript0differential-ddifferential-ddifferential-d 2 _0≤ x≤ y,x-y≤ z≤ y-xp_0(x,y,z)dxdydz\, % ((x+y+z)-(-x+y+z) )2 ∫0 ≤ x ≤ y , x - y ≤ z ≤ y - x p0 ( x , y , z ) d x d y d z ( ( x + y + z ) - ( - x + y + z ) ) = == 2∫…p0(x,y,z)xyz(2x)2subscript…subscript0differential-ddifferential-ddifferential-d2 2 _…p_0(x,y,z)dxdydz\,(2x)2 ∫… p0 ( x , y , z ) d x d y d z ( 2 x ) A+superscript A^+A+ = == 2∫0≤x≤y,y−x≤z≤x+yp0(x,y,z)xyz(x+y+z)−(−x+y+z)−(x−y+z)2subscriptformulae-sequence0subscript0differential-ddifferential-ddifferential-d 2 _0≤ x≤ y,y-x≤ z≤ x+yp_0(x,y,z)dxdydz\,(x+y+% z)-(-x+y+z)-(x-y+z)2 ∫0 ≤ x ≤ y , y - x ≤ z ≤ x + y p0 ( x , y , z ) d x d y d z ( x + y + z ) - ( - x + y + z ) - ( x - y + z ) = == 2∫…p0(x,y,z)xyz(x+y−z)2subscript…subscript0differential-ddifferential-ddifferential-d 2 _…p_0(x,y,z)dxdydz\, (x+y-z )2 ∫… p0 ( x , y , z ) d x d y d z ( x + y - z ) A++superscriptabsent A^++A+ + = == 2∫0≤x≤y,z≥x+yp0(x,y,z)xyz((x+y+z)−(−x+y+z)−(x−y+z)+(−x−y+z))2subscriptformulae-sequence0subscript0differential-ddifferential-ddifferential-d 2 _0≤ x≤ y,z≥ x+yp_0(x,y,z)dxdydz\, ((x+y+z% )-(-x+y+z)-(x-y+z)+(-x-y+z) )2 ∫0 ≤ x ≤ y , z ≥ x + y p0 ( x , y , z ) d x d y d z ( ( x + y + z ) - ( - x + y + z ) - ( x - y + z ) + ( - x - y + z ) ) = == 0.0 0.0 . Note in particular that each term above is nonnegative on its domain (for A+,superscriptA^+,A+ , this is because the domain includes the inequality z≤x+yz≤ x+yz ≤ x + y). Thus in particular, E≥A0.superscript0E≥ A^0.E ≥ A0 . Since the integrand is positive, we can get a lower bound by restricting the domain: A0≥2∫x≤1,y≥2,−1≤z≤12p0(x,y,z)xyz,superscript02subscriptformulae-sequence1formulae-sequence2112subscript0differential-ddifferential-ddifferential-dA^0≥ 2 _x≤ 1,y≥ 2,-1≤ z≤ 12p_0(x,y,z)dxdydz,A0 ≥ 2 ∫x ≤ 1 , y ≥ 2 , - 1 ≤ z ≤ 1 2 p0 ( x , y , z ) d x d y d z , using that the integrand is 2x≥2.222x≥ 2.2 x ≥ 2 . This is, equivalently, twice the probability that |x|≥1,|y|≥2,|z|≤1,formulae-sequence1formulae-sequence21|x|≥ 1,|y|≥ 2,|z|≤ 1,| x | ≥ 1 , | y | ≥ 2 , | z | ≤ 1 , for (x,y,z)(x,y,z)( x , y , z ) drawn from p0(x,y,z)=σ0,1(x)σ0,1(y)σ0,r2−2(z).subscript0subscript01subscript01subscript0superscript22p_0(x,y,z)= _0,1(x) _0,1(y) _0,r^2-2(z).p0 ( x , y , z ) = σ0 , 1 ( x ) σ0 , 1 ( y ) σ0 , r2 - 2 ( z ) . By independence of p0,subscript0p_0,p0 , this is a product of 3333 terms. The probability distributions on x,yx,yx , y are fixed unit Gaussians, so the corresponding terms are O(1),1O(1),O ( 1 ) , and so the mean has (up to an O(1)1O(1)O ( 1 ) constant) the same asymptotic as the third term, which is Pz∼σ0,r2−2(|z|<1)=O(1/r)=Θ~(1/s).subscriptsimilar-tosubscript0superscript2211~Θ1P_z _0,r^2-2(|z|<1)=O(1/r)= (1/ s).Pitalic_z ∼ σ start_POSTSUBSCRIPT 0 , r2 - 2 end_POSTSUBSCRIPT ( | z | < 1 ) = O ( 1 / r ) = over~ start_ARG Θ end_ARG ( 1 / square-root start_ARG s end_ARG ) . The Bernstein bound applied to d i.i.d. such variables now gives us o.n.p. ∑i=1dF(xi,yi,zi)(xi,yi,zi)∼X0=d⋅E0+dO~(r).superscriptsubscript1subscriptsubscriptsubscriptsubscriptsimilar-tosubscriptsubscriptsubscriptsubscript0⋅subscript0~ _i=1^dF(x_i,y_i,z_i)_(x_i,y_i,z_i) X_0=d· E_0% + d O(r).∑i = 1d F ( xitalic_i , yitalic_i , zitalic_i )( x start_POSTSUBSCRIPT i , yitalic_i , zitalic_i ) ∼ X0 end_POSTSUBSCRIPT = d ⋅ E0 + square-root start_ARG d end_ARG over~ start_ARG O end_ARG ( r ) . Incorporating error terms, we get r→k,ℓu(ℳwu(a→(x)))=d⋅E0+dO~(r)+dO~(rδ).superscriptsubscript→ℓsuperscriptsubscriptℳ→⋅subscript0~~ r_k, ^u(M_w^u( a(x)))=d· E_0+ d% O(r)+d O(rδ).over→ start_ARG r end_ARGk , ℓitalic_u ( Mitalic_witalic_u ( over→ start_ARG a end_ARG ( x ) ) ) = d ⋅ E0 + square-root start_ARG d end_ARG over~ start_ARG O end_ARG ( r ) + d over~ start_ARG O end_ARG ( r δ ) . We now normalize: ℳw(a→):=ℳwu(a→)dE0assignsubscriptℳ→superscriptsubscriptℳ→subscript0 M_w( a):= M_w^u( a)% dE_0Mitalic_w ( over→ start_ARG a end_ARG ) := divide start_ARG Mitalic_witalic_u ( over→ start_ARG a end_ARG ) end_ARG start_ARG square-root start_ARG d end_ARG E0 end_ARG (6) r→k,ℓ:=r→k,ℓud.assignsubscript→ℓsuperscriptsubscript→ℓ r_k, := r_k, ^u d.over→ start_ARG r end_ARGk , ℓ := divide start_ARG over→ start_ARG r end_ARGk , ℓitalic_u end_ARG start_ARG square-root start_ARG d end_ARG end_ARG . (7) Then if fk(x)∧fℓ(x)=1,subscriptsubscriptℓ1f_k(x) f_ (x)=1,fitalic_k ( x ) ∧ froman_ℓ ( x ) = 1 , then (o.n.p.) r→k,ℓ(x)=1+O~(r)d+O~rδ.subscript→ℓ1~~ r_k, (x)=1+ O(r) d+ Orδ.over→ start_ARG r end_ARGk , ℓ ( x ) = 1 + divide start_ARG over~ start_ARG O end_ARG ( r ) end_ARG start_ARG square-root start_ARG d end_ARG end_ARG + over~ start_ARG O end_ARG r δ . Alternatively if fk(x)∧fℓ(x)=0,subscriptsubscriptℓ0f_k(x) f_ (x)=0,fitalic_k ( x ) ∧ froman_ℓ ( x ) = 0 , the expectation is zero and we are left with the error term, r→k,ℓ(x)=O~(r)d+O~rδsubscript→ℓ~~ r_k, (x)= O(r) d+ Or → start_ARG r end_ARGk , ℓ ( x ) = divide start_ARG over~ start_ARG O end_ARG ( r ) end_ARG start_ARG square-root start_ARG d end_ARG end_ARG + over~ start_ARG O end_ARG r δ The theorem follows. ∎ D.3 Norm-balancer network In this section, we prove a technical result that was needed in the previous section. Namely, at one point we assumed that the norm of our inputs a→0(x)subscript→0 a_0(x)over→ start_ARG a end_ARG0 ( x ) are (o.n.p., and up to a multiplicative error of 1+O~(1d)1~11+ O( 1 d)1 + over~ start_ARG O end_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG d end_ARG end_ARG )) equal to a specific value λ,λ,λ , which is related to the sparsity by a bound of the form λ=O~(s).~λ= O( s).λ = over~ start_ARG O end_ARG ( square-root start_ARG s end_ARG ) . It is not difficult to guarantee this if we know the exact sparsity of the sparse boolean vector sexact=‖0‖1.subscriptsubscriptnormsubscript01s_exact=|| b_0||_1.sitalic_e x a c t = | | italic_b0 | |1 . However, in the process of chaining together multiple boolean circuits, we would like to allow the exact sparsity of intermediate layers to vary (so long as it is bounded by s), even if the exact sparsity of the input layer is fixed. In this section we give a two-layer neural network mechanism that allows us to circumvent this issue by modifying all inputs a→0(x)subscript→0 a_0(x)over→ start_ARG a end_ARG0 ( x ) to have roughly the same norm, equal to some specific value s0=O~s.subscript0~ s_0= O s.square-root start_ARG s0 end_ARG = over~ start_ARG O end_ARG square-root start_ARG s end_ARG . We note that while it seems plausible that real neural networks share properties in common with the past two artificial neural nets we constructed (error correction and universal AND), the neural net constructed here Theorem 18. Let s0=O~(d)subscript0~s_0= O( d)s0 = over~ start_ARG O end_ARG ( square-root start_ARG d end_ARG ) be a sparsity parameter. There exists a 2-layer neural net balances0:ℝd→ℝd:subscriptbalancesubscript0→superscriptℝsuperscriptℝbalance_s_0:R^d ^dbalanceitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : blackboard_Rd → blackboard_Rd depending on random parameters, with hidden layers of width O(d),O(d),O ( d ) , with the following property. Suppose that ϕ→1,…,ϕ→dsubscript→italic-ϕ1…subscript→italic-ϕ φ_1,…, φ_dover→ start_ARG ϕ end_ARG1 , … , over→ start_ARG ϕ end_ARGd is a collection of features of length <2,absent2<2,< 2 , and a→xsubscript→ a_xover→ start_ARG a end_ARGx is an input satisfying |a→x|<s0.subscript→subscript0| a_x|< s_0.| over→ start_ARG a end_ARGx | < square-root start_ARG s0 end_ARG . Then 1. |balance(a→x)|=s0⋅(1+O~(1/d))balancesubscript→⋅subscript01~1|balance( a_x)|= s_0·(1+ O(1/ d))| balance ( over→ start_ARG a end_ARGx ) | = square-root start_ARG s0 end_ARG ⋅ ( 1 + over~ start_ARG O end_ARG ( 1 / square-root start_ARG d end_ARG ) ) 2. a→x⋅ϕ→k−balance(a→x)⋅ϕ→k=O~(s0d).⋅subscript→subscript→italic-ϕ⋅balancesubscript→subscript→italic-ϕ~subscript0 a_x· φ_k-balance( a_x)· φ_k% = O( s_0 d).over→ start_ARG a end_ARGx ⋅ over→ start_ARG ϕ end_ARGk - balance ( over→ start_ARG a end_ARGx ) ⋅ over→ start_ARG ϕ end_ARGk = over~ start_ARG O end_ARG ( divide start_ARG square-root start_ARG s0 end_ARG end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) . Proof. Let W∈Matd×dsubscriptMatW _d× dW ∈ Matitalic_d × d be a random square matrix, with entries drawn independently from σ(0,1/d2).01superscript2σ(0,1/d^2).σ ( 0 , 1 / d2 ) . Define the function N(a→)=∑i=1dReLU(Wx)i.→superscriptsubscript1ReLUsubscriptN( a)= _i=1^dReLU(Wx)_i.N ( over→ start_ARG a end_ARG ) = ∑i = 1d ReLU ( W x )i . Then N(a→)→N( a)N ( over→ start_ARG a end_ARG ) is a sum of d i.i.d. random variables of the form Ni=ReLU(x)∣x∼σ(0,|a→|/d).subscriptconditionalReLUsimilar-to0→N_i=ReLU(x) x σ(0,| a|/d).Nitalic_i = ReLU ( x ) ∣ x ∼ σ ( 0 , | over→ start_ARG a end_ARG | / d ) . Applying arguments similar to those used in the proof of the previous theorem, we see that NisubscriptN_iNitalic_i has norm c⋅|a→|/d,⋅→c·| a|/d,c ⋅ | over→ start_ARG a end_ARG | / d , for c>00c>0c > 0 the absolute constant c=x∼σ(0,1)ReLU(x)=12π.subscriptsimilar-to01ReLU12c=E_x σ(0,1)ReLU(x)= 12 π.c = blackboard_Ex ∼ σ ( 0 , 1 ) ReLU ( x ) = divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG π end_ARG end_ARG . The variance of NisubscriptN_iNitalic_i is O(|a→|2)/d,superscript→2O(| a|^2)/d,O ( | over→ start_ARG a end_ARG |2 ) / d , and NisubscriptN_iNitalic_i is bounded o.n.p. by O~(|a→|).~→ O(| a|).over~ start_ARG O end_ARG ( | over→ start_ARG a end_ARG | ) . Thus Bernstein’s inequality implies that, o.n.p., N(a→)=∑i=1dNi=c⋅|a→|+O~(|a→|/d).→superscriptsubscript1subscript⋅→~→N( a)= _i=1^dN_i=c·| a|+ O(| a|/ d).N ( over→ start_ARG a end_ARG ) = ∑i = 1d Nitalic_i = c ⋅ | over→ start_ARG a end_ARG | + over~ start_ARG O end_ARG ( | over→ start_ARG a end_ARG | / square-root start_ARG d end_ARG ) . Now |a→|<s0=O~(s),→subscript0~| a|<s_0= O( s),| over→ start_ARG a end_ARG | < s0 = over~ start_ARG O end_ARG ( square-root start_ARG s end_ARG ) , so N(a→)=|a→|+O~(ε).→~N( a)=| a|+ O( ).N ( over→ start_ARG a end_ARG ) = | over→ start_ARG a end_ARG | + over~ start_ARG O end_ARG ( ε ) . Let f(y)=s0−y2subscript0superscript2f(y)= s_0-y^2f ( y ) = square-root start_ARG s0 - y2 end_ARG (for |y|≤s0subscript0|y|≤ s_0| y | ≤ square-root start_ARG s0 end_ARG), a semicircle of radius s0subscript0 s_0square-root start_ARG s0 end_ARG viewed as a function of a real variable. Define the piecewise-linear function fPLsubscriptf_PLfitalic_P L given by splitting the semicircle into d equal arcs, and connecting the endpoints of the arcs (extending the first and last arc linearly outside the domain of definition). The difference between the values of f on the endpoints of each arc is bounded by its arclength, which is O(s0/d).subscript0O( s_0/d).O ( square-root start_ARG s0 end_ARG / d ) . Thus |f(x)−fPL(x)|=O(s0/d)subscriptsubscript0|f(x)-f_PL(x)|=O( s_0/d)| f ( x ) - fitalic_P L ( x ) | = O ( square-root start_ARG s0 end_ARG / d ) (in fact, much better asymptotic bounds are possible.) Now fPLsubscriptf_PLfitalic_P L is a sum of d ReLUs, thus it is a scalar-valued function which can be expressed by a width-d neural net. Now choose a random “approximately unit” vector v∈ℝdsuperscriptℝv ^dv ∈ blackboard_Rd according to the Gaussian v∼σ(0,1/d).similar-to01v σ(0,1/ d).v ∼ σ ( 0 , 1 / square-root start_ARG d end_ARG ) . Now we define the neural net balance(a→):=a→+fPL(N(a→))v.assignbalance→subscript→balance( a):= a+f_PL(N( a))v.balance ( over→ start_ARG a end_ARG ) := over→ start_ARG a end_ARG + fitalic_P L ( N ( over→ start_ARG a end_ARG ) ) v . Since both N(a→)→N( a)N ( over→ start_ARG a end_ARG ) and fPLsubscriptf_PLfitalic_P L can be expressed as width-d neural nets, balancebalancebalancebalance can be expressed as a width-O(d)O(d)O ( d ) neural net. Now since v is a random vector, we have, o.n.p., v⋅a→=O~(|a→|/d)⋅→~→v· a= O(| a|/ d)v ⋅ over→ start_ARG a end_ARG = over~ start_ARG O end_ARG ( | over→ start_ARG a end_ARG | / square-root start_ARG d end_ARG ) and v⋅ϕ→k=O~(1d).⋅subscript→italic-ϕ~1v· φ_k= O( 1 d).v ⋅ over→ start_ARG ϕ end_ARGk = over~ start_ARG O end_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) . Since there are at most polynomially-many (in r) features, the “negligible probablity” exceptions remain negligible when combined over all features. The bound N(a→)=O~(s)→~N( a)= O( s)N ( over→ start_ARG a end_ARG ) = over~ start_ARG O end_ARG ( square-root start_ARG s end_ARG ) thus implies both bounds in the theorem. ∎ Let α(x),β(x,y)α(x),β(x,y)α ( x ) , β ( x , y ) be functions. Let W be random and Φ Φ be a matrix of features. Fix k,ℓ∈0,…,m−1.ℓ0…1k, ∈\0,…,m-1\.k , ℓ ∈ 0 , … , m - 1 . Let ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m be a boolean vector. Let kl=kϕ→k+ℓϕ→ℓ,subscriptsubscriptsubscript→italic-ϕsubscriptℓsubscript→italic-ϕℓ b_kl= b_k φ_k+ b_ % φ_ ,italic_bitalic_k l = italic_bitalic_k over→ start_ARG ϕ end_ARGk + italic_broman_ℓ over→ start_ARG ϕ end_ARGℓ , and ′=−kℓ.superscript′subscriptℓ b = b- b_k .italic_b′ = italic_b - italic_bitalic_k ℓ . Outside negligible probability, we know that Φ(kℓ)⋅Φ(′)=O~(ε).⋅ΦsubscriptℓΦsuperscript′~ ( b_k )· ( b )= O(% ).Φ ( italic_bitalic_k ℓ ) ⋅ Φ ( italic_b′ ) = over~ start_ARG O end_ARG ( ε ) . This means that if ε=Θ~(1/d)~Θ1 = (1/ d)ε = over~ start_ARG Θ end_ARG ( 1 / square-root start_ARG d end_ARG ) and we apply a random matrix W then we still have WΦ(′)⋅WΦ(kℓ)=O~(ε)⋅Φsuperscript′Φsubscriptℓ~W ( b )· W ( b_k )= O( )W Φ ( italic_b′ ) ⋅ W Φ ( italic_bitalic_k ℓ ) = over~ start_ARG O end_ARG ( ε ) (outside negligible probability). Define x→kl=WΦ(kl)subscript→Φsubscript x_kl=W ( b_kl)over→ start_ARG x end_ARGk l = W Φ ( italic_bitalic_k l ) and x→′=WΦ(′).superscript→′Φsuperscript′ x =W ( b ).over→ start_ARG x end_ARG′ = W Φ ( italic_b′ ) . Since random matrices are O(d)O(d)O ( d )-invariant, we can assume WLOG that these are drawn independently and randomly from appropriate Gaussian distributions EXPAND. Specifically, x→klsubscript→ x_klover→ start_ARG x end_ARGk l is drawn from a distribution with variance 2222 and x→′ x over→ start_ARG x end_ARG′ is drawn from a distribution with variance O(s).O(s).O ( s ) . Define ℳw(x→)=α(W(x→)),subscriptℳ→M_w( x)=α(W( x)),Mitalic_w ( over→ start_ARG x end_ARG ) = α ( W ( over→ start_ARG x end_ARG ) ) , and define Rkℓi:=β(ϕ→ki,ϕ→ℓi).assignsubscriptsuperscriptℓsubscriptsuperscript→italic-ϕsubscriptsuperscript→italic-ϕℓR^i_k :=β( φ^i_k, φ^i_ ).Ritalic_iitalic_k ℓ := β ( over→ start_ARG ϕ end_ARGiitalic_k , over→ start_ARG ϕ end_ARGiroman_ℓ ) . Lemma 19. For suitable choices of a piecewise-linear function α and some function β (both depending on s) we can guarantee that Rk,ℓ⋅ℳw(Φ())=k∧ℓ+O~(εout).⋅subscriptℓsubscriptℳΦsubscriptsubscriptℓ~subscriptoutR_k, ·M_w( ( b))= b_k % b_ + O( _out).Ritalic_k , ℓ ⋅ Mitalic_w ( Φ ( italic_b ) ) = italic_bitalic_k ∧ italic_broman_ℓ + over~ start_ARG O end_ARG ( εroman_out ) . Proof. As explained above, we can assume that xkℓi,(xi)′subscriptsuperscriptℓsuperscriptsuperscript′x^i_k ,(x^i) xitalic_iitalic_k ℓ , ( xitalic_i )′ are drawn from independent boolean distributions with variance respectively 2d,sd.2 2d, sd.divide start_ARG 2 end_ARG start_ARG d end_ARG , divide start_ARG s end_ARG start_ARG d end_ARG . Define X=σ(0,sdI)0X=σ(0, sdI)X = σ ( 0 , divide start_ARG s end_ARG start_ARG d end_ARG I ) to be the Gaussian variable with variance sd. sd.divide start_ARG s end_ARG start_ARG d end_ARG . Define Δi(x):=α(x+x→kℓi)−α(x)). _i(x):=α (x+ x^i_k )-α(x) ).Δitalic_i ( x ) := α ( x + over→ start_ARG x end_ARGiitalic_k ℓ ) - α ( x ) ) . Write ℳwΔ(y→)i:=Δi(y→).assignsubscriptsubscriptℳΔsuperscript→subscriptΔ→M_w_ ( y)^i:= _i( y).Mitalic_wroman_Δ ( over→ start_ARG y end_ARG )i := Δitalic_i ( over→ start_ARG y end_ARG ) . Then ℳw(x→)=ℳw(x→′)+Δi(x→′).subscriptℳ→subscriptℳsuperscript→′subscriptΔsuperscript→′M_w( x)=M_w( x )+ _i( % x ).Mitalic_w ( over→ start_ARG x end_ARG ) = Mitalic_w ( over→ start_ARG x end_ARG′ ) + Δitalic_i ( over→ start_ARG x end_ARG′ ) . It remains to prove the following sublemma: Lemma 20. (Outside negligible probability:) Rkℓ⋅ℳw(x→′)=O~(εout)⋅subscriptℓsubscriptℳsuperscript→′~subscriptout R_k ·M_w( x )= O(% _out)Ritalic_k ℓ ⋅ Mitalic_w ( over→ start_ARG x end_ARG′ ) = over~ start_ARG O end_ARG ( εroman_out ) (8) Rkℓ⋅Δi(x→′)=k∧ℓ+O~(εout)⋅subscriptℓsubscriptΔsuperscript→′subscriptsubscriptℓ~subscriptout R_k · _i( x )= b_k% b_ + O( _out)Ritalic_k ℓ ⋅ Δitalic_i ( over→ start_ARG x end_ARG′ ) = italic_bitalic_k ∧ italic_broman_ℓ + over~ start_ARG O end_ARG ( εroman_out ) (9) We start with the first expression. We have • x→i′subscript→′ x_i over→ start_ARG x end_ARGi′ random from Gaussian X, variance s/d.s/d.s / d . • α(x→i′)superscriptsubscript→′α( x_i )α ( over→ start_ARG x end_ARGi′ ) random, bounded by B (o.n.p. bound for α on X). • From POV of x′::superscript′absentx :x′ : we know (x,y)(x,y)( x , y ) random Gaussian, variance 1/d.11/d.1 / d . • So Rkl⋅ℳw(x→′)⋅subscriptsubscriptℳsuperscript→′R_kl·M_w( x )Ritalic_k l ⋅ Mitalic_w ( over→ start_ARG x end_ARG′ ) is the sum of d samples of β(x,y)α(z)β(x,y)α(z)β ( x , y ) α ( z ) for x,y,zx,y,zx , y , z from appropriate Gaussians. • WTS: ±plus-or-minus± symmetric in independent way, variance O~(εout)/d,~subscriptout O( _out)/d,over~ start_ARG O end_ARG ( εroman_out ) / d , bounded (onp) by O~~ Oover~ start_ARG O end_ARG of stdev (check if this bound correct for Azuma inequality). For this (modelling on quadratic case): choose β to be ±plus-or-minus± symmetric in either coordinate independently, and appropriately bounded. For the second expression, we treat two cases, namely (k,ℓ)∈(1,1),(0,1).subscriptsubscriptℓ1101( b_k, b_ )∈\(1,1),(0,1)\.( italic_bitalic_k , italic_broman_ℓ ) ∈ ( 1 , 1 ) , ( 0 , 1 ) . We do not need to treat other cases as (1,0)10(1,0)( 1 , 0 ) follows by symmetry and (0,0)00(0,0)( 0 , 0 ) is trivial. Start with (1,1)11(1,1)( 1 , 1 ) case, so k∧ℓ=1.subscriptsubscriptℓ1 b_k b_ =1.italic_bitalic_k ∧ italic_broman_ℓ = 1 . We then have • Want E(β(x,y)Δx,y(z)))E (β(x,y) _x,y(z)) )E ( β ( x , y ) Δitalic_x , y ( z ) ) ) to be 1111. • Above bounded to make Azuma ok (prob enough to check Δ=O(1)Δ1 =O(1)Δ = O ( 1 ) and use Azuma bounds from previous). Final case, (0,1).01(0,1).( 0 , 1 ) . • Want E((β(x,y)Δx(z)))subscriptΔE( (β(x,y) _x(z)) )E ( ( β ( x , y ) Δitalic_x ( z ) ) ) to be 1111. • This follows from ±plus-or-minus± symmetry of β (and bounds as above). ∎ D.4 Error correction layers Theorem 21. Suppose we are in the context of Appendix A. Then there exists a polylog constant K=K(d)K=K(d)K = K ( d ) and a single-layer neural net ℳw(x)=v1+W1ReLU(v0+W0(x))subscriptℳsubscript1subscript1ReLUsubscript0subscript0M_w(x)=v_1+W_1ReLU(v_0+W_0(x))Mitalic_w ( x ) = v1 + W1 ReLU ( v0 + W0 ( x ) ) and a feature matrix Φ(1)∈Matd×msuperscriptΦ1subscriptMat ^(1) _d× mΦ( 1 ) ∈ Matitalic_d × m such that if ε(=ε(0))<Kd1/4m1/2s1/4,annotatedabsentsuperscript0superscript14superscript12superscript14 (= ^(0))<K d^1/4m^1/2s^1/4,ε ( = ε( 0 ) ) < K divide start_ARG d1 / 4 end_ARG start_ARG m1 / 2 s1 / 4 end_ARG , then for each input x,x,x , o.n.p., the feature ϕ→k(1)superscriptsubscript→italic-ϕ1 φ_k^(1)over→ start_ARG ϕ end_ARGk( 1 ) linearly separates the boolean function fksubscriptf_kfitalic_k on the activation a→(1)(x)=ℳw(x),superscript→1subscriptℳ a^(1)(x)=M_w(x),over→ start_ARG a end_ARG( 1 ) ( x ) = Mitalic_w ( x ) , with error ε(1)=O(log(d)⋅sd.) ^(1)=O ( (d)· s d. )ε( 1 ) = O ( log ( d ) ⋅ divide start_ARG square-root start_ARG s end_ARG end_ARG start_ARG square-root start_ARG d end_ARG . end_ARG ) Moreover, we can choose the new feature vectors ϕk(1)superscriptsubscriptitalic-ϕ1 _k^(1)ϕitalic_k( 1 ) such that they have feature interference bounded by μ(1)=O~(sd).superscript1~μ^(1)= O ( s d ).μ( 1 ) = over~ start_ARG O end_ARG ( divide start_ARG square-root start_ARG s end_ARG end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) . Proof. We begin by defining an unnormalized version of the output feature matrix. Define p=1ds,1p= 1 ds,p = divide start_ARG 1 end_ARG start_ARG square-root start_ARG d s end_ARG end_ARG , a probability parameter. Let Φ(1),u∈Matm×dsuperscriptΦ1subscriptMat ^(1),u _m× dΦ( 1 ) , u ∈ Matitalic_m × d be a matrix of entries MkisubscriptsuperscriptM^i_kMitalic_iitalic_k drawn uniformly from the ternary random variable p(Mki=1)=p/2p(Mki=−1)=p/2p(Mki=0)=1−p.casessubscriptsuperscript1absent2subscriptsuperscript1absent2subscriptsuperscript0absent1 casesp(M^i_k=1)&=p/2\\ p(M^i_k=-1)&=p/2\\ p(M^i_k=0)&=1-p cases. start_ROW start_CELL p ( Mitalic_iitalic_k = 1 ) end_CELL start_CELL = p / 2 end_CELL end_ROW start_ROW start_CELL p ( Mitalic_iitalic_k = - 1 ) end_CELL start_CELL = p / 2 end_CELL end_ROW start_ROW start_CELL p ( Mitalic_iitalic_k = 0 ) end_CELL start_CELL = 1 - p end_CELL end_ROW . Let Γ⊂0,…,m×0,…,dΓ0…0… ⊂\0,…,m\×\0,…,d\Γ ⊂ 0 , … , m × 0 , … , d be the set of nonzero values of Φ(1),u.superscriptΦ1 ^(1),u.Φ( 1 ) , u . Note that (o.n.p.), it has size |Γ|=mds+O~(1).Γ~1| |=m ds+ O(1).| Γ | = m square-root start_ARG divide start_ARG d end_ARG start_ARG s end_ARG end_ARG + over~ start_ARG O end_ARG ( 1 ) . We think of this as a graph, connecting each feature k to a set of (approximately ds dssquare-root start_ARG divide start_ARG d end_ARG start_ARG s end_ARG end_ARG) neurons it “activates”, Γk⊂1,…,d.subscriptΓ1… _k⊂\1,…,d\.Γitalic_k ⊂ 1 , … , d . We also write Γi⊂1,…,msubscriptΓ1… _i⊂\1,…,m\Γitalic_i ⊂ 1 , … , m for the set of features connected to the iiith neuron. Let round[0,1](x):=3(ReLU(x−1/3)−ReLU(x−2/3)),assignsubscriptround013ReLU13ReLU23round_[0,1](x):=3 (ReLU(x-1/3)-ReLU(x-2/3)% ),round[ 0 , 1 ] ( x ) := 3 ( ReLU ( x - 1 / 3 ) - ReLU ( x - 2 / 3 ) ) , the piecewise-linear function that maps ℝRblackboard_R to the interval [0,1]01[0,1][ 0 , 1 ] and is non-constant only on the interval (1/3,2/3).1323(1/3,2/3).( 1 / 3 , 2 / 3 ) . Now for any integer, define round[0,a](x):=round[0,1](x)+round[0,1](x−1)+⋯+round[0,1](x−a+1),assignsubscriptround0subscriptround01subscriptround011⋯subscriptround011round_[0,a](x):=round_[0,1](x)+round_[0,1](x-% 1)+…+round_[0,1](x-a+1),round[ 0 , a ] ( x ) := round[ 0 , 1 ] ( x ) + round[ 0 , 1 ] ( x - 1 ) + ⋯ + round[ 0 , 1 ] ( x - a + 1 ) , and similarly, round[−a,a](x):=round[0,a](x)−round[0,1](−x).assignsubscriptroundsubscriptround0subscriptround01round_[-a,a](x):=round_[0,a](x)-round_[0,1](-% x).round[ - a , a ] ( x ) := round[ 0 , a ] ( x ) - round[ 0 , 1 ] ( - x ) . This is a piecewise-linear “staircase” function with the following properties: • round[−a,a](x)∈[−a,a]subscriptroundround_[-a,a](x)∈[-a,a]round[ - a , a ] ( x ) ∈ [ - a , a ] for all x∈ℝx ∈ blackboard_R and • round[−a,a](n+ε)=n,subscriptroundround_[-a,a](n+ )=n,round[ - a , a ] ( n + ε ) = n , whenever n∈[−a,a]n∈[-a,a]n ∈ [ - a , a ] is an integer and ε<1/3.13 <1/3.ε < 1 / 3 . Thus for all sufficiently small values x,x,x , the function roundroundroundround will “round” x to the nearest integer, so long as the nearest integer is less than 1/3131/31 / 3 away; hence its name. By construction, the function round[−a,a](x)subscriptroundround_[-a,a](x)round[ - a , a ] ( x ) is a sum of a 4a44a4 a ReLUs. We will use for our nonlinearity the function round(x)=round[−2,2](x)::roundsubscriptround22absentround(x)=round_[-2,2](x):round ( x ) = round[ - 2 , 2 ] ( x ) : −22-2- 2−11-1- 111112222−22-2- 2−11-1- 111112222xxxround(x)roundround(x)round ( x ) (Using larger intervals [−a,a][-a,a][ - a , a ] in our nonlinearity round[−a.a]subscriptrounddelimited-[]formulae-sequenceround_[-a.a]round[ - a . a ] would give slightly stronger results, but won’t be needed.) Now we define the unnormalized neural net model as follows: ℳwu(x):=round(Φ(1),u(Φ(0))T(x)).assignsuperscriptsubscriptℳroundsuperscriptΦ1superscriptsuperscriptΦ0 M_w^u(x):=round( ^(1),u ( ^% (0) )^T(x)).Mitalic_witalic_u ( x ) := round ( Φ( 1 ) , u ( Φ( 0 ) )T ( x ) ) . (10) Finally, we normalize: ℳw(x):=ℳwu(x)d/sassignsubscriptℳsuperscriptsubscriptℳ M_w(x):= M_w^u(x) d/sMitalic_w ( x ) := divide start_ARG Mitalic_witalic_u ( x ) end_ARG start_ARG square-root start_ARG d / s end_ARG end_ARG (11) Φ(1):=Φ(1),ud/s.assignsuperscriptΦ1superscriptΦ1 ^(1):= ^(1),u d/s.Φ( 1 ) := divide start_ARG Φ( 1 ) , u end_ARG start_ARG square-root start_ARG d / s end_ARG end_ARG . (12) For each feature k∈1,…,m1…k∈\1,…,m\k ∈ 1 , … , m in an input x, the unnormalized neural net ℳw(1),usuperscriptsubscriptℳ1M_w^(1),uMitalic_w( 1 ) , u roughly does the following. 1. “Reads” the feature ϕksubscriptitalic-ϕ _kϕitalic_k 2. “Writes” 1111s in all neurons i∈ΓksubscriptΓi∈ _ki ∈ Γitalic_k connected to k assuming ϕksubscriptitalic-ϕ _kϕitalic_k is present 3. “Rounds” each neuron which is close to −2,−1,0,12101-2,-1,0,1- 2 , - 1 , 0 , 1 or 2222 to the closest integer. At the end, we hope to obtain a vector with exactly the entry Mki∈±1superscriptsubscriptplus-or-minus1M_k^i∈± 1Mitalic_kitalic_i ∈ ± 1 for each k with fk(x)=1subscript1f_k(x)=1fitalic_k ( x ) = 1 and zero elsewhere. If we’re lucky and there are no issues with excess interference and no pairs of active features k,ℓk, , ℓ that share a neuron i∈Γk∩Γℓ,subscriptΓsubscriptΓℓi∈ _k∩ _ ,i ∈ Γitalic_k ∩ Γroman_ℓ , the result of this computation will be Φ(1),u((x)),superscriptΦ1 ^(1),u( b(x)),Φ( 1 ) , u ( italic_b ( x ) ) , and its error can then be controlled by understanding the interference of the new normalized feature matrix Φ(1)superscriptΦ1 ^(1)Φ( 1 ). In order to make this work, we need to control two types of issues: • Collision: it’s possible that two simultaneously active features k,ℓk, , ℓ with fk(x)=fℓ(x)=1subscriptsubscriptℓ1f_k(x)=f_ (x)=1fitalic_k ( x ) = froman_ℓ ( x ) = 1 share some neurons, so some of the entries of ϕ→k(1),u+ϕ→ℓ(1),usubscriptsuperscript→italic-ϕ1subscriptsuperscript→italic-ϕ1ℓ φ^(1),u_k+ φ^(1),u_ over→ start_ARG ϕ end_ARG( 1 ) , uitalic_k + over→ start_ARG ϕ end_ARG( 1 ) , uroman_ℓ have “colliding” information from the kkkth and ℓ ℓth neurons that gives the wrong answer after getting rounded to one of −2,−1,0,1,2.21012\-2,-1,0,1,2\. - 2 , - 1 , 0 , 1 , 2 . • Interference: it’s possible that, even if ΓksuperscriptΓ ^kΓitalic_k are disjoint for all features k appearing in (x), b(x),italic_b ( x ) , the various interference terms shift the value far enough from the “correct” value in −1,0,1101\-1,0,1\ - 1 , 0 , 1 that the “round” function does not successfully return it to its original position. These are controlled by the two parts of the following lemma. Lemma 22. 1. For any x∈Xx∈ Xx ∈ X, we have o.n.p.: ‖(Φ(1,u)(Φ(0))T(x)−Φ(1),ux)‖∞=o(1).subscriptnormsuperscriptΦ1superscriptsuperscriptΦ0superscriptΦ1subscript1|| ( ^(1,u) ( ^(0) )^T(x)- ^(1),u b_% x )||_∞=o(1).| | ( Φ( 1 , u ) ( Φ( 0 ) )T ( x ) - Φ( 1 ) , u italic_bitalic_x ) | |∞ = o ( 1 ) . 2. For any boolean bitalic_b with sparsity ‖1<ssubscriptnorm1|| b||_1<s| | italic_b | |1 < s, we have (o.n.p.) the difference err→collision:=round(Φ(1,u)())−Φ(1,u)()∈ℝdassignsubscript→errcollisionroundsuperscriptΦ1superscriptΦ1superscriptℝ err_collision:=round( ^(1,u)(% b))- ^(1,u)( b) ^dover→ start_ARG err end_ARGcollision := round ( Φ( 1 , u ) ( italic_b ) ) - Φ( 1 , u ) ( italic_b ) ∈ blackboard_Rd has all unnormalized feature readoffs ϕ→ku⋅err→collision=O~max(1,s3/d).⋅superscriptsubscript→italic-ϕsubscript→errcollision~1superscript3 φ_k^u· err_collision= % O (1, s^3/d).over→ start_ARG ϕ end_ARGkitalic_u ⋅ over→ start_ARG err end_ARGcollision = over~ start_ARG O end_ARG max ( 1 , square-root start_ARG s3 / d end_ARG ) . Proof. Note that the two results are both about ℓ∞subscriptℓ _∞ℓ∞ errors, but in two different spaces, namely in the space ℝdsuperscriptℝR^dblackboard_Rd with the neuron basis for part (1) and in the space ℝmsuperscriptℝR^mblackboard_Rm with the feature basis for part (2). We start with part (1). Since there is a polynomial number of neurons, bounding the ℓ∞subscriptℓ _∞ℓ∞ error o.n.p. is equivalent to bounding the difference for each coordinate: Ei(x):=(Φ(1,u)(Φ(0))T(x)−Φ(1),u(x))⋅→i.assignsubscript⋅superscriptΦ1superscriptsuperscriptΦ0superscriptΦ1subscript→E_i(x):= ( ^(1,u) ( ^(0) )^T(x)- ^(1),u% b(x) )· e_i.Eitalic_i ( x ) := ( Φ( 1 , u ) ( Φ( 0 ) )T ( x ) - Φ( 1 ) , u italic_b ( x ) ) ⋅ over→ start_ARG italic_e end_ARGi . This difference is a linear combination of the errors ϕ→k(0)⋅x,⋅superscriptsubscript→italic-ϕ0 φ_k^(0)· x,over→ start_ARG ϕ end_ARGk( 0 ) ⋅ x , with coefficients given by the matrix coefficients (Φ(1,u))ik,superscriptsubscriptsuperscriptΦ1 ( ^(1,u) )_i^k,( Φ( 1 , u ) )iitalic_k , with i fixed and k varying. For a pair (i,k)∈Γ,Γ(i,k)∈ ,( i , k ) ∈ Γ , let σ(i,k)∈±1plus-or-minus1σ(i,k)∈± 1σ ( i , k ) ∈ ± 1 be the sign of the corresponding matrix coefficient (which is chosen independently at random in the random variable-valued definition of our neural net). We then have Ei(x)=∑k∈Γiσ(i,k)x⋅ϕ→k(0).subscriptsubscriptsuperscriptΓ⋅superscriptsubscript→italic-ϕ0E_i(x)= _k∈ ^iσ(i,k)x· φ_k^(0).Eitalic_i ( x ) = ∑k ∈ Γitalic_i σ ( i , k ) x ⋅ over→ start_ARG ϕ end_ARGk( 0 ) . By assumption, x⋅ϕk≤ε(0).⋅subscriptitalic-ϕsuperscript0x· _k≤ ^(0).x ⋅ ϕitalic_k ≤ ε( 0 ) . Since the signs are chosen independently at random, we can bound this value o.n.p. by the Bernstein inequality, Theorem 29, with discrete variables Xk=σi,kx⋅ϕ→k(0).subscript⋅subscriptsuperscriptsubscript→italic-ϕ0X_k= _i,kx· φ_k^(0).Xitalic_k = σitalic_i , k x ⋅ over→ start_ARG ϕ end_ARGk( 0 ) . Here k is indexed by a |Γi|superscriptΓ| ^i|| Γitalic_i |-element set. By definition of Φ(1),superscriptΦ1 ^(1),Φ( 1 ) , each element 1,…,m1…\1,…,m\ 1 , … , m has probability p=1sd1p= 1 sdp = divide start_ARG 1 end_ARG start_ARG square-root start_ARG s d end_ARG end_ARG of being in Γi,superscriptΓ ^i,Γitalic_i , so |Γi|=msd+O~(msd)=O~(msd).superscriptΓ~~| ^i|= m sd+ O ( m sd% )= O ( m sd ).| Γitalic_i | = divide start_ARG m end_ARG start_ARG square-root start_ARG s d end_ARG end_ARG + over~ start_ARG O end_ARG ( divide start_ARG square-root start_ARG m end_ARG end_ARG start_ARG square-root start_ARG s d end_ARG end_ARG ) = over~ start_ARG O end_ARG ( divide start_ARG m end_ARG start_ARG square-root start_ARG s d end_ARG end_ARG ) . Since all these random variables are bounded by ε(0)superscript0 ^(0)ε( 0 ) in absolute value, Bernstein’s inequality implies that o.n.p., Ei=O(ε(0)⋅(msd)1/2),subscript⋅superscript0superscript12E_i=O ( ^(0)· ( m sd )^1/2% ),Eitalic_i = O ( ε( 0 ) ⋅ ( divide start_ARG m end_ARG start_ARG square-root start_ARG s d end_ARG end_ARG )1 / 2 ) , giving part (1) of the lemma. To prove the second part, note that the “ground truth” activation a→ground:=Φ(1,u)assignsubscript→groundsuperscriptΦ1 a_ground:= ^(1,u) bover→ start_ARG a end_ARGground := Φ( 1 , u ) italic_b is an integer-valued vector with coefficients (a→ground)i=∑k∈Γi∩σk.subscriptsubscript→groundsubscriptsuperscriptΓsubscript( a_ground)_i= _k∈ ^i∩ b _% k.( over→ start_ARG a end_ARGground )i = ∑k ∈ Γitalic_i ∩ italic_b σitalic_k . It is changed by applying the roundroundroundround function if and only if this sum is >2absent2>2> 2 in absolute value, i.e., if it is a “collision” (i.e., contained in the intersection) of at least 3333 subset of the form Σk.subscriptΣ _k.Σitalic_k . The expectation of the number of such overlaps a given neuron i∈1,…,d1…i∈\1,…,d\i ∈ 1 , … , d can be can be bounded by O(s3(sd)3)=O(s3/2d3/2).superscript3superscript3superscript32superscript32O( s^3( sd)^3)=O ( s^3/2d^3/2 ).O ( divide start_ARG s3 end_ARG start_ARG ( square-root start_ARG s d end_ARG )3 end_ARG ) = O ( divide start_ARG s3 / 2 end_ARG start_ARG d3 / 2 end_ARG ) . Thus the coefficients of the error vector (err→collision)i=(a→ground)i−round(a→ground)isubscriptsubscript→errcollisionsubscriptsubscript→groundroundsubscriptsubscript→ground( err_collision)_i=( a_% ground)_i-round( a_ground)_i( over→ start_ARG err end_ARGcollision )i = ( over→ start_ARG a end_ARGground )i - round ( over→ start_ARG a end_ARGground )i are drawn i.i.d. from a distribution with mean 00 (as it is symmetric) and variance bounded by O~(s3/2d3/2),~superscript32superscript32 O( s^3/2d^3/2),over~ start_ARG O end_ARG ( divide start_ARG s3 / 2 end_ARG start_ARG d3 / 2 end_ARG ) , which is absolutely bounded by O~(1).~1 O(1).over~ start_ARG O end_ARG ( 1 ) . In other words, we have o.n.p. that this vector has at most O~max(1,(s3/2d))~1superscript32 O (1, (s^3/2 d ) )over~ start_ARG O end_ARG max ( 1 , ( s3 / 2 square-root start_ARG d end_ARG ) ) entries all bounded by O~(1),~1 O(1),over~ start_ARG O end_ARG ( 1 ) , and with independently random signs. When we take the dot product with another unnormalized feature vector we are left with an error bounded by εcollision⋅ϕ→k(1),u=O~max(1,(s3/2d1/2)),⋅subscriptcollisionsuperscriptsubscript→italic-ϕ1~1superscript32superscript12 _collision· φ_k^(1),u= O (1% , ( s^3/2d^1/2 ) ),εcollision ⋅ over→ start_ARG ϕ end_ARGk( 1 ) , u = over~ start_ARG O end_ARG max ( 1 , ( divide start_ARG s3 / 2 end_ARG start_ARG d1 / 2 end_ARG ) ) , completing the proof. ∎ Now we can finish the proof. The interference bound in the lemma implies that o.n.p., the d-dimensional vector Φ(1),u()−Φ(1),u(Φ(0,T)(x))superscriptΦ1superscriptΦ1superscriptΦ0 ^(1),u( b)- ^(1),u( ^(0,T)(x))Φ( 1 ) , u ( italic_b ) - Φ( 1 ) , u ( Φ( 0 , T ) ( x ) ) has all coefficients bounded by o(1),1o(1),o ( 1 ) , an in particular, bounded by 1/3.131/3.1 / 3 . Since the LHS has all integer entries, this means that round(Φ(1),u())=round(Φ(1),u(Φ(0,T)(x)))roundsuperscriptΦ1roundsuperscriptΦ1superscriptΦ0round( ^(1),u( b))=round( ^(1),u( ^% (0,T)(x)))round ( Φ( 1 ) , u ( italic_b ) ) = round ( Φ( 1 ) , u ( Φ( 0 , T ) ( x ) ) ) (As the “round” function is constant on [n−1/3,n+1/3]1313[n-1/3,n+1/3][ n - 1 / 3 , n + 1 / 3 ] for any integer n). Since we have assumed that s<d1/3superscript13s<d^1/3s < d1 / 3 (in A), the asymptotic term s3/2/d1/2superscript32superscript12s^3/2/d^1/2s3 / 2 / d1 / 2 in the collision error bound is bounded by 1111, so o.n.p., εcollision⋅ϕ→k(1),u=O~(1).⋅subscriptcollisionsuperscriptsubscript→italic-ϕ1~1 _collision· φ_k^(1),u= O(1).εcollision ⋅ over→ start_ARG ϕ end_ARGk( 1 ) , u = over~ start_ARG O end_ARG ( 1 ) . Finally, when we normalize, both sides of the dot product get multiplied by A=s1/4/d1/4,superscript14superscript14A=s^1/4/d^1/4,A = s1 / 4 / d1 / 4 , and so after normalizing the coresponding bound gets multiplied by s/d, s/ d,square-root start_ARG s end_ARG / square-root start_ARG d end_ARG , and we get the expression (o.n.p.): ϕ→k(1)⋅(ℳw(x)−Φ(1)((x)))=O~(s/d).⋅superscriptsubscript→italic-ϕ1subscriptℳsuperscriptΦ1~ φ_k^(1)· (M_w(x)- ^(1)( b(% x)) )= O( s/ d).over→ start_ARG ϕ end_ARGk( 1 ) ⋅ ( Mitalic_w ( x ) - Φ( 1 ) ( italic_b ( x ) ) ) = over~ start_ARG O end_ARG ( square-root start_ARG s end_ARG / square-root start_ARG d end_ARG ) . Finally, by a similar argument to the collision proof, we see that the unnormalized dot product Φ(1),u((x))⋅ϕ→k⋅superscriptΦ1subscript→italic-ϕ ^(1),u( b(x))· φ_kΦ( 1 ) , u ( italic_b ( x ) ) ⋅ over→ start_ARG ϕ end_ARGk is d/s d/ ssquare-root start_ARG d end_ARG / square-root start_ARG s end_ARG up to an error of O~(1),~1 O(1),over~ start_ARG O end_ARG ( 1 ) , so the error m We claim that the pair (ℳw,Φ(1))subscriptℳsuperscriptΦ1(M_w, ^(1))( Mitalic_w , Φ( 1 ) ) satisfies (o.n.p.) the conditions for the error-correction circuit above, for some appropriate relationships between the values d,ε(0),ε(1)superscript0superscript1d, ^(0), ^(1)d , ε( 0 ) , ε( 1 ) depending on m,m,m , satisfying asymptotic inequalities of the form ε(0)=O~(d1/4m1/2s1/4),superscript0~superscript14superscript12superscript14 ^(0)= O ( d^1/4m^1/2s^1/4 ),ε( 0 ) = over~ start_ARG O end_ARG ( divide start_ARG d1 / 4 end_ARG start_ARG m1 / 2 s1 / 4 end_ARG ) , ε(1)=O~(sd),superscript1~ ^(1)= O ( s d ),ε( 1 ) = over~ start_ARG O end_ARG ( divide start_ARG square-root start_ARG s end_ARG end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) , μ(1)=O~(sd).superscript1~μ^(1)= O ( s d ).μ( 1 ) = over~ start_ARG O end_ARG ( divide start_ARG square-root start_ARG s end_ARG end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) . Lemma 23. For a suitable choice of εinsubscriptin _inεroman_in as above we can guarantee that: 1. If err→∈ℝm→errsuperscriptℝ err ^mover→ start_ARG err end_ARG ∈ blackboard_Rm has ‖err→‖∞<εin,subscriptnorm→errsubscriptin|| err||_∞< _in,| | over→ start_ARG err end_ARG | |∞ < εroman_in , then ‖Φ(err→)‖∞=o(1),subscriptnormΦ→err1|| ( err)||_∞=o(1),| | Φ ( over→ start_ARG err end_ARG ) | |∞ = o ( 1 ) , o. n. p. (Note that the latter value is an ℓ∞superscriptℓ ^∞ℓ∞ norm in the neuron basis.) 2. If bitalic_b is boolean and s-sparse, then ΦA(round(Φ())≈εout, A(round( ( b)) _ % out b,divide start_ARG Φ end_ARG start_ARG A end_ARG ( round ( Φ ( italic_b ) ) ≈ε out italic_b , o. n. p. To get part (1) above, observe that for any neuron index i,i,i , we have Φ(err→)i=∑k∣k∈Γiσk,ierr→k,Φsubscript→errsubscriptconditionalsuperscriptΓsubscriptsubscript→err ( err)_i= _k k∈ ^i _k,i% err_k,Φ ( over→ start_ARG err end_ARG )i = ∑k ∣ k ∈ Γitalic_i σitalic_k , i over→ start_ARG err end_ARGk , where we define Γi:=k∣(k,i)∈Γ.assignsuperscriptΓconditional-setΓ ^i:=\k (k,i)∈ \.Γitalic_i := k ∣ ( k , i ) ∈ Γ . Since the signs σk,isubscript _k,iσitalic_k , i are random and independent, this is a sum with random signs of numbers of absolute value <εin.absentsubscriptin< _in.< εroman_in . From the Azuma inequality, we see that (o. n. p.) Φ(err→)i=O~(err→⋅|Γi|).Φsubscript→err~⋅→errsubscriptΓ ( err)_i= O( err% · | _i|).Φ ( over→ start_ARG err end_ARG )i = over~ start_ARG O end_ARG ( over→ start_ARG err end_ARG ⋅ square-root start_ARG | Γitalic_i | end_ARG ) . Since the Γ Γ was chosen randomly, o. n. p. |Γi|=Θ~(|Γ|/d)=Θ~(d1−γ2)=o(εin−2).subscriptΓ~ΘΓ~Θsuperscript12superscriptsubscriptin2| _i|= (| |/d)= (d 1-γ2)% =o( _in^-2).| Γitalic_i | = over~ start_ARG Θ end_ARG ( | Γ | / d ) = over~ start_ARG Θ end_ARG ( ddivide start_ARG 1 - γ end_ARG start_ARG 2 end_ARG ) = o ( εroman_in- 2 ) . The last statement follows from comparing exponents in the two sides, and the freedom of choice of polylog term in εin.subscriptin _in.εroman_in . For part (2) above, observe that (out(round(Φ())))ksubscriptsubscriptoutroundΦ (R_out(round( ( b))) )_k( Rroman_out ( round ( Φ ( italic_b ) ) ) )k is the average over the set Γk=i∣(k,i)∈ΓsubscriptΓconditional-setΓ _k=\i (k,i)∈ \Γitalic_k = i ∣ ( k , i ) ∈ Γ of ai:=round(∑ℓ∈SΦℓ,i)assignsubscriptroundsubscriptℓsubscriptΦℓa_i:=round ( _ ∈ S _ ,i )aitalic_i := round ( ∑ℓ ∈ S Φroman_ℓ , i ) where S is the set of features that are on in , b,italic_b , of size |S|≤s.|S|≤ s.| S | ≤ s . We want to compare this to ksubscript b_kitalic_bitalic_k, which is 1111 if k∈Sk∈ Sk ∈ S and 00 otherwise. We expect (for i∈ΓksubscriptΓi∈ _ki ∈ Γitalic_k) that ai=0subscript0a_i=0aitalic_i = 0 if k=0subscript0 b_k=0italic_bitalic_k = 0 and ai=1subscript1a_i=1aitalic_i = 1 if k=1.subscript1 b_k=1.italic_bitalic_k = 1 . Since round()roundround()round ( ) always returns a value of absolute value ≤1,absent1≤ 1,≤ 1 , we can bound the error by twice the number of incorrect values. We get errors of two types. 1. Interference error, from neurons that are on when they should be off. I.e., when ai≠0subscript0a_i≠ 0aitalic_i ≠ 0 despite k=0.subscript0 b_k=0.italic_bitalic_k = 0 . 2. Collision error, from neurons which should be on but are 00 (or have wrong sign) due to contributions from both SksubscriptS_kSitalic_k and another feature. Either of these errors happens when ΓksubscriptΓ _kΓitalic_k and ⋃ℓ∈S′Γℓsubscriptℓsuperscript′subscriptΓℓ _ ∈ S _ ⋃ℓ ∈ S′ Γroman_ℓ intersect for S′=S∖k,superscript′S =S \k\,S′ = S ∖ k , the set of nonzero values of bitalic_b not equal to k. Now ΓksubscriptΓ _kΓitalic_k has O~(d1−γ2)~superscript12 O(d 1-γ2)over~ start_ARG O end_ARG ( ddivide start_ARG 1 - γ end_ARG start_ARG 2 end_ARG ) nonzero entries and ⋃ℓ∈S′Γℓsubscriptℓsuperscript′subscriptΓℓ _ ∈ S _ ⋃ℓ ∈ S′ Γroman_ℓ has at most O~(dγ+1−γ2)~superscript12 O(d^γ+ 1-γ2)over~ start_ARG O end_ARG ( ditalic_γ + divide start_ARG 1 - γ end_ARG start_ARG 2 end_ARG ) entries; since each subset ΓksubscriptΓ _kΓitalic_k is independently random, we see (o. n. p.) that the intersection has at most O~(d1+γ2dγ+1+γ2d)=O~(1)~superscript12superscript12~1 O( d 1+γ2d^γ+ 1+γ2d)=% O(1)over~ start_ARG O end_ARG ( divide start_ARG ddivide start_ARG 1 + γ end_ARG start_ARG 2 end_ARG ditalic_γ + divide start_ARG 1 + γ end_ARG start_ARG 2 end_ARG end_ARG start_ARG d end_ARG ) = over~ start_ARG O end_ARG ( 1 ) entries, and the average is indeed O~(εout).~subscriptout O( _out).over~ start_ARG O end_ARG ( εroman_out ) . This completes the proof of the lemma. The theorem follows. Indeed, suppose that x∈ℝdinsuperscriptℝinx ^dinx ∈ blackboard_Rd in is a vector with in(x)≈εinsubscriptinsubscriptinR_in(x) _ in bRroman_in ( x ) ≈ε in italic_b for ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m an s-sparse boolean vector. Setting err→=in(x)−,→errsubscriptin err=R_in(x)- b,over→ start_ARG err end_ARG = Rroman_in ( x ) - italic_b , part (1) implies that Φ∘in(x)−Φ()ΦsubscriptinΦ _in(x)- ( b)Φ ∘ Rroman_in ( x ) - Φ ( italic_b ) has coefficients at most o(1);1o(1);o ( 1 ) ; since Φ()Φ ( b)Φ ( italic_b ) has integer entries, this means that applying roundroundroundround to both sides produces the same results. Part (1) then implies that the RHS Φ()Φ ( b)Φ ( italic_b ) has sufficiently small interference. ∎ Corollary 24 (Lemma 7). For sufficiently small input interfefrence there exists a 1-layer MLP that returns (outside negligible probability) an encoding of the same boolean vector with low interference (1/d11/ d1 / square-root start_ARG d end_ARG assuming low sparsity parameter). Proof. This follows from the theorem in the case γ=0,0γ=0,γ = 0 , i.e., when the sparsity parameter s is polylog in m.m.m . ∎ Appendix E Theoretical Framework and Statistical Tools Here we provide statistical definitions and lemmas required for our proofs in Appendix D. E.1 Negligible probabilities Most results in this paper are proven outside negligible probability. This is a standard notion in complexity theory and cryptography (Bellare, 2002), with the following formal definition: Definition 5. Let Enn=1∞superscriptsubscriptsubscript1\E_n\_n=1^∞ Eitalic_n n = 1∞ be a sequence of events parameterized by n. We say that EnsubscriptE_nEitalic_n is true with negligible probability (w. n. p.) if for any polynomial exponent c∈ℕc ∈ blackboard_N, there exists some constant Nc∈ℕsubscriptℕN_c _c ∈ blackboard_N such that P(En)<O(n−c)subscriptsuperscriptP(E_n)<O(n^-c)P ( Eitalic_n ) < O ( n- c ) for all n>Ncsubscriptn>N_cn > Nitalic_c. Similarly, we say that EnsubscriptE_nEitalic_n is true outside negligible probability (o. n. p.) if its complement En¯subscript E_nover¯ start_ARG Eitalic_n end_ARG is true with negligible probability. If En=En(x)subscriptsubscriptE_n=E_n(x)Eitalic_n = Eitalic_n ( x ) depends on an input in some set X, when we say En()subscriptE_n( b)Eitalic_n ( italic_b ) is true with negligible probability for all fixed inputs x∈Xx∈ Xx ∈ X we implicitly assume that there is an explicit constant Cn<O(n−c)subscriptsuperscriptC_n<O(n^-c)Citalic_n < O ( n- c ) as above that bounds the probability of En(x)subscriptE_n(x)Eitalic_n ( x ) for each valid input x∈Xx∈ Xx ∈ X. Intuitively, the reason why this probability is “negligible” is that the union of polynomially many events of negligible probability also has negligible probability. As we never consider more networks requiring more than polynomially many operations, we can ignore events of negligible probability at each step when performing asymptotic analysis, which greatly simplifies our proofs. Example 1. Let bitalic_b be a random boolean vectors of length n. Then outside negligible probability, bitalic_b has between n/2+log(n)n2n/2+ (n) nn / 2 + log ( n ) square-root start_ARG n end_ARG and n/2−log(n)n2n/2- (n) nn / 2 - log ( n ) square-root start_ARG n end_ARG zeroes. This follows from the central limit theorem. (Note that if we used log(n)n, (n) n,square-root start_ARG log ( n ) end_ARG square-root start_ARG n end_ARG , the result would be false!) For cases where the event is a bound on a random function (as above), we can combine “negligible probability” notation and big-O, as well as big-O~~ Oover~ start_ARG O end_ARG notation, as follows. Definition 6. Suppose a function f(x)=fn(x)subscriptf(x)=f_n(x)f ( x ) = fitalic_n ( x ) depends on the complexity parameter n and a fixed input x∈Xx∈ Xx ∈ X and is valued in random variables333The input can be an “empty input”, i.e., f is itself a random variable depending only on n. Let g(x)≥00g(x)≥ 0g ( x ) ≥ 0 be a deterministic function444or a constant depending on n if x is an empty input. Then we say that f(x)=O~(g(x))~f(x)= O(g(x))f ( x ) = over~ start_ARG O end_ARG ( g ( x ) ) if there exists a polylog constant Kn=O(polylog(n))subscriptpolylogK_n=O(polylog(n))Kitalic_n = O ( polylog ( n ) ) such that, for any input x,x,x , the event |f(x)|<g(x)K(x)|f(x)|<g(x)K(x)| f ( x ) | < g ( x ) K ( x ) is true outside negligible probability. This lets us rephrase the previous example as “for bitalic_b a random boolean vector of length m,m,m , we have ∑k(x)=m/2+O~(m).subscript2~Σ b_k(x)=m/2+ O( m).∑ italic_bitalic_k ( x ) = m / 2 + over~ start_ARG O end_ARG ( square-root start_ARG m end_ARG ) .” We also list the following result, which will be important for us. Lemma 25. Let d∈ℕd ∈ blackboard_N be a complexity parameter. Let v∈(0,Γ/d)0Γv (0, /d)v ∈ N ( 0 , Γ / d ) be a Gaussian-distributed random vector in ℝd,superscriptℝR^d,blackboard_Rd , and let x∈ℝdsuperscriptℝx ^dx ∈ blackboard_Rd be a fixed input vector. Then, outside negligible probability, we have 1. |v|=1+O~(1d)1~1|v|=1+ O ( 1 d )| v | = 1 + over~ start_ARG O end_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) 2. v⋅x=O~(|x|d).⋅~v· x= O ( |x| d ).v ⋅ x = over~ start_ARG O end_ARG ( divide start_ARG | x | end_ARG start_ARG square-root start_ARG d end_ARG end_ARG ) . Proof. The first statement is standard (and follows from the central limit theorem applied to the real variable (0,1)2superscript012N(0,1)^2N ( 0 , 1 )2). The second statement follows from the fact that sums of Gaussian random variables are Gaussian (and variance adds). ∎ Note that this in particular implies a version of the Johnson-Lindenstrauss lemma: Corollary 26. Suppose m is a polynomial function of d (which we take to be the complexity parameter), and suppose ϕ→1,…,ϕ→m∈ℝmsubscript→italic-ϕ1…subscript→italic-ϕsuperscriptℝ φ_1,…, φ_m ^mover→ start_ARG ϕ end_ARG1 , … , over→ start_ARG ϕ end_ARGm ∈ blackboard_Rm are random vectors drawn from (0,Γ/d).0ΓN(0, /d).N ( 0 , Γ / d ) . Then outside negligible probability, ϕ→k⋅ϕ→ℓ=1+O~(1/d),k=ℓO~(1/d),k≠ℓ.⋅subscript→italic-ϕsubscript→italic-ϕℓcases1~1ℓ~1ℓ φ_k· φ_ = cases1+ O(1/ d),&k=% \\ O(1/ d),&k≠ . casesover→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG ϕ end_ARGℓ = start_ROW start_CELL 1 + over~ start_ARG O end_ARG ( 1 / square-root start_ARG d end_ARG ) , end_CELL start_CELL k = ℓ end_CELL end_ROW start_ROW start_CELL over~ start_ARG O end_ARG ( 1 / square-root start_ARG d end_ARG ) , end_CELL start_CELL k ≠ ℓ . end_CELL end_ROW Proof. We are checking polynomially many (namely, O(m2)superscript2O(m^2)O ( m2 ) with m polynomial in d) statements, thus by the union bound, it suffices to show that each is true outside negligible probability. The corollary now follows by inductively on k applying 25 to ϕ→k⋅ϕ→k=|ϕ→k|2⋅subscript→italic-ϕsubscript→italic-ϕsuperscriptsubscript→italic-ϕ2 φ_k· φ_k=| φ_k|^2over→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG ϕ end_ARGk = | over→ start_ARG ϕ end_ARGk |2 and ϕ→k⋅ϕ→ℓ∣ℓ<k,conditional⋅subscript→italic-ϕsubscript→italic-ϕℓ φ_k· φ_ <k,over→ start_ARG ϕ end_ARGk ⋅ over→ start_ARG ϕ end_ARGℓ ∣ ℓ < k , in the latter case taking the vector ϕ→ℓsubscript→italic-ϕℓ φ_ over→ start_ARG ϕ end_ARGℓ as fixed. ∎ Before continuing, we record the following simple result, which will allow us to convert “negligible probability” results to our existence results in the body. Theorem 27. Suppose that s=O(1)1s=O(1)s = O ( 1 ) is a constant sparsity parameter, ℳwsubscriptℳM_wMitalic_w is a model in a fixed class that depends on some random parameters, and a property P(ℳw,x)subscriptℳP(M_w,x)P ( Mitalic_w , x ) holds outside negligible probability for all inputs x=Φ()Φx= ( b)x = Φ ( italic_b ) corresponding to boolean inputs ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m of sparsity s.s.s . Then there exists a model ℳwsubscriptℳM_wMitalic_w such that the property P(ℳw,)subscriptℳP(M_w, b)P ( Mitalic_w , italic_b ) holds for all boolean inputs . b.italic_b . Proof. This follows from the union bound, since the number of possibly inputs bitalic_b with sparsity s is (ms)<msbinomialsuperscript ms<m^s( FRACOP start_ARG m end_ARG start_ARG s end_ARG ) < mitalic_s (and negligible probability goes to zero faster than any inverse polynomial). ∎ Remark 28. For every “negligible probability” statement we encounter, it is straightforward to check that, up to decreasing the asymptotic parameters in appropriate O~~ Oover~ start_ARG O end_ARG-asymptotic assumptions in the variables involved, we can guarantee for a stronger statement to hold: namely, for any fixed c,c,c , we can guarantee that the negligible probability p asymptotically satisfies p=O(exp(−log(m)c)).p=O( (- (m)^c)).p = O ( exp ( - log ( m )c ) ) . Thus (by another union bound), statements that are true with negligible probability for any boolean input bitalic_b of size ‖1=O~(1)subscriptnorm1~1|| b||_1= O(1)| | italic_b | |1 = over~ start_ARG O end_ARG ( 1 ) (at most polylogarithmic in m) can be made to hold for all such parameters , b,italic_b , for an appropriate choice of parameters. E.2 Concentration inequalities Concentration inequalities (in the sense we use here) bound tail probabilities of sums of random variables which are either i.i.d. or “close to” i.i.d. in some sense. As we only care about O~~ Oover~ start_ARG O end_ARG-type precision in our error bounds (i.e., up to polylog factors) and we need statements to be true only outside negligible probability, we are able to get away with very weak versions of bounds which exist in general with much more precision; both of the results we need follow from the Bernstein inequality for martingales (which subsumes the Azuma inequality). Theorem 29 (Coarse Bernstein bound). Suppose that X1,…,Xnsubscript1…subscriptX_1,…,X_nX1 , … , Xitalic_n are a real random variable bounded by a constant M, which are either i.i.d. or form the difference sequence of a Martingale, i.e., (Xi∣X1,…,Xi−1)=0.conditionalsubscriptsubscript1…subscript10E(X_i X_1,…,X_i-1)=0.blackboard_E ( Xitalic_i ∣ X1 , … , Xitalic_i - 1 ) = 0 . Then ∑xi=nμ+O~(Mn)subscript~Σ x_i=nμ+ O(M n)∑ xitalic_i = n μ + over~ start_ARG O end_ARG ( M square-root start_ARG n end_ARG ) outside negligible probability, uniformly in the XisubscriptX_iXitalic_i. In other words, there exists a polylogarithmic sequence of constants Kn=O(polylog(n))subscriptpolylogK_n=O(polylog(n))Kitalic_n = O ( polylog ( n ) ) such that the probability P(|∑i=1n(xi−[Xi])|<Kn⋅Mn)≤Pnsuperscriptsubscript1subscriptdelimited-[]subscript⋅subscriptsubscriptP (| _i=1^n(x_i-[X_i])|<K_n· M n )≤ P_nP ( | ∑i = 1n ( xitalic_i - [ Xitalic_i ] ) | < Kitalic_n ⋅ M square-root start_ARG n end_ARG ) ≤ Pitalic_n for some sequence PnsubscriptP_nPitalic_n that goes to zero faster than any polynomial function in n. Proof. This follows from Bernstein’s theorem, (Bernstein, 1924). In fact, both statements also follow from the simpler Azuma-Hoeffding inequality. ∎ Corollary 30. Let V=ℝasuperscriptℝV=R^aV = blackboard_Ra be a vector space, with a=O(1)1a=O(1)a = O ( 1 ) a constant (we will use a=11a=1a = 1 and a=33a=3a = 3). Let Σ∈Mata×aΣsubscriptMat _a× aΣ ∈ Matitalic_a × a be a fixed symmetric positive-definite matrix, with X=(0,Σ)0ΣX=N(0, )X = N ( 0 , Σ ) the corresponding distribution. Let f:V→ℝ:→ℝf:V : V → blackboard_R be a fixed function with subpolynomial growth in x, and let μ=[f(x),x∼X]delimited-[]similar-toμ=[f(x),x X]μ = [ f ( x ) , x ∼ X ] be the mean of f on x drawn from this distribution. Let x1,…,xmsubscript1…subscriptx_1,…,x_mx1 , … , xitalic_m be a collection of variables drawn from i.i.d. copies of (0,Σ).0ΣN(0, ).N ( 0 , Σ ) . Then o.n.p., ∑f(xi)=mμ+O~(m),subscript~Σ f(x_i)=mμ+ O( m),∑ f ( xitalic_i ) = m μ + over~ start_ARG O end_ARG ( square-root start_ARG m end_ARG ) , where the polylogarithmic constant in O~~ Oover~ start_ARG O end_ARG depends on f.f.f . Proof. Since f has polynomial growth, f(x)<K(1+|x|c)1superscriptf(x)<K(1+|x|^c)f ( x ) < K ( 1 + | x |c ) for some constants C,d.C,d.C , d . Thus o.n.p. in m, f(x)≤(clog(Km))f(x)≤(c (Km))f ( x ) ≤ ( c log ( K m ) ) (note that f(x)f(x)f ( x ) doesn’t depend on m;m;m ; we’re just saying that P(f(x)≤dlog(m))P (f(x)≤ d (m) )P ( f ( x ) ≤ d log ( m ) ) goes to 00 faster than any polynomial function in m;m;m ; in fact this probability is O(m−log(m))superscriptO(m^- (m))O ( m- log ( m ) )). Let M=clog(Km).M=c (Km).M = c log ( K m ) . Then the concentration theorem above implies that ∑i=1m(f(xi)−[f(xi)])=O~(M)=O~(1),superscriptsubscript1subscriptdelimited-[]subscript~~1 _i=1^m(f(x_i)-[f(x_i)])= O(M)= O(1),∑i = 1m ( f ( xitalic_i ) - [ f ( xitalic_i ) ] ) = over~ start_ARG O end_ARG ( M ) = over~ start_ARG O end_ARG ( 1 ) , since M=O~(1).~1M= O(1).M = over~ start_ARG O end_ARG ( 1 ) . ∎ E.3 Precise and mixed emulations The parameters in the models ℳwsubscriptℳM_wMitalic_w in the proofs of our emulation results depend on random matrices of ±1plus-or-minus1± 1± 1’s and 00’s, hence can be understood as suitable random variables. In terms of this point of view, we make the following definition. Suppose that :0,1m→0,1m′:→superscript01superscript01superscript′C:\0,1\^m→\0,1\^m C : 0 , 1 m → 0 , 1 m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a boolean circuit with input size m.m.m . We always assume that the output size m′ and the depth are at most polynomial in m.m.m . Let ℬ⊂0,1mℬsuperscript01B⊂\0,1\^mB ⊂ 0 , 1 m be a class of inputs (usually characterized by a suitable sparsity property). Let ε<11 <1ε < 1 be an interference parameter. Definition 7. An ε ε-precise emulation of CC (on input class ℬBB) is a triple of data (Φ,ℳw,)Φsubscriptℳ( ,M_w,R)( Φ , Mitalic_w , R ) all possibly depending on random parameters where Φ∈Matdin×mΦsubscriptMatsubscriptin _d_in× mΦ ∈ Matitalic_d start_POSTSUBSCRIPT in × m end_POSTSUBSCRIPT is a feature matrix, ∈Matm′×doutsubscriptMatsuperscript′subscriptoutR _m × d_outR ∈ Matitalic_m′ × d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is a readoff matrix and ℳw:ℝdin→ℝdout:subscriptℳ→superscriptℝinsuperscriptℝoutM_w:R^din ^doutMitalic_w : blackboard_Rd in → blackboard_Rd out is a (not necessarily linear) function given by a neural net, with the following property: For any ∈ℬ,ℬ b ,italic_b ∈ B , we have, outside negligible probability, ‖∘ℳw∘Φ()−()‖∞<ε.subscriptnormsubscriptℳΦ||R M_w ( b)-C(% b)||_∞< .| | R ∘ Mitalic_w ∘ Φ ( italic_b ) - C ( italic_b ) | |∞ < ε . Importantly, we do not consider the boolean circuit CC or the input ∈ℬ b _b ∈ B to be random variables, and the randomness involved in the negligible probability statement is purely in terms of the parameters that go into the emulation scheme (Φ,ℳw,).Φsubscriptℳ( ,M_w,R).( Φ , Mitalic_w , R ) . In particular, this guarantees that if the boolean input bitalic_b is generated in a non-random way (e.g., adversarially), an emulation nevertheless guarantees (in the “negligible probability sense”) safe performance on b so long as the parameters of the emulation were chosen randomly. It will be useful to extend the notion of emulation to one which correctly approximates CC on inputs x∈ℝdinsuperscriptℝinx ^dinx ∈ blackboard_Rd in which represent a boolean input ∈0,1msuperscript01 b∈\0,1\^mitalic_b ∈ 0 , 1 m not in the sense of “pure superposition” x=Φ()Φx= ( b)x = Φ ( italic_b ) but in the sense of “read-off”, ‖in(x)−‖∞<εin.subscriptnormsubscriptinsubscriptin||R_in(x)- b||_∞< _ % in.| | Rroman_in ( x ) - italic_b | |∞ < εroman_in . Here in∈Matm×dinsubscriptinsubscriptMatsubscriptsubscriptinR_in __×md_inRroman_in ∈ Matstart_FLOATSUBSCRIPT × end_FLOATSUBSCRIPT m droman_in is a readoff matrix that should be thought of as a noisy inverse to the feature matrix on sparse inputs. Formally, we make the following definition. Here we will assume that the matrix insubscriptinR_inRroman_in was generated at an earlier stage of the computation, and does not depend on random variables. Fix a circuit :0,1m→0,1m′,:→superscript01superscript01superscript′C:\0,1\^m→\0,1\^m ,C : 0 , 1 m → 0 , 1 m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , a class of inputs ℬ⊂0,1m,ℬsuperscript01B⊂\0,1\^m,B ⊂ 0 , 1 m , and an “input readoff” matrix in∈Matm×din.subscriptinsubscriptMatsubscriptinR_in _m× d_in.Rroman_in ∈ Matitalic_m × d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT . Let εin,εoutsubscriptinsubscriptout _in, _outεroman_in , εroman_out be two interference parameters. Definition 8. A mixed emulation of CC with precision εin→εout→subscriptinsubscriptout _in→ _outεroman_in → εroman_out (on input class ℬBB and relative to a fixed input readoff matrix insubscriptinR_inRroman_in) is a pair of data (ℳw,out)subscriptℳsubscriptout(M_w,R_out)( Mitalic_w , Rroman_out ) both possibly depending on random parameters where ∈Matm′×doutsubscriptMatsuperscript′subscriptoutR _m × d_outR ∈ Matitalic_m′ × d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is a readoff matrix and ℳw:ℝdin→ℝdout:subscriptℳ→superscriptℝinsuperscriptℝoutM_w:R^din ^doutMitalic_w : blackboard_Rd in → blackboard_Rd out is a (not necessarily linear) function given by a neural net, with the following property: For any boolean input ∈ℬ b _b ∈ B and x∈ℝdinsuperscriptℝinx ^dinx ∈ blackboard_Rd in satisfying ‖in(x)−‖∞<εin,subscriptnormsubscriptinsubscriptin||R_in(x)- b||_∞< _ % in,| | Rroman_in ( x ) - italic_b | |∞ < εroman_in , we have, outside negligible probability, ‖∘ℳw(x)−()‖∞<εout.subscriptnormsubscriptℳsubscriptout||R M_w(x)-C( b)||_∞% < _out.| | R ∘ Mitalic_w ( x ) - C ( italic_b ) | |∞ < εroman_out . Remark 31. Note that if it is impossible to accurately represent bitalic_b via the matrix ,R,R , i.e., to satisfy ‖(x)−‖∞<εin,subscriptnormsubscriptin||R(x)- b||_∞< _in,| | R ( x ) - italic_b | |∞ < εroman_in , then the notion of mixed emulation is vacuous (any neural net would satisfy it for tautological reasons). We will generally apply this notion in contexts where such representations are possible (for example, with via a suitable feature matrix x=Φ()Φx= ( b)x = Φ ( italic_b )). Here as before we do not consider the boolean circuit CC or the input ∈ℬ b _b ∈ B to be random variables, and in addition the representation x and the input readoff matrix insubscriptinR_inRroman_in are assumed fixed. So the randomness involved in the negligible probability statement is purely in terms of the parameters that go into the pair (ℳw,).subscriptℳ(M_w,R).( Mitalic_w , R ) .