Paper deep dive

The Persian Rug: solving toy models of superposition using large-scale symmetries

Aditya Cowsik, Kfir Dolev, Alex Infanger

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: TheoreticalEmbeddings: 105

Models: minimal nonlinear sparse autoencoder

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 12:37:09 AM

Summary

The paper provides a mechanistic interpretability analysis of sparse autoencoders (SAEs) used for superposition in neural networks. By analyzing the model in the thermodynamic limit (large input dimension) and assuming permutation symmetry, the authors derive an analytically tractable loss function. They demonstrate that trained models converge to a state where individual weights are less important than their large-scale statistics, and they introduce a 'Persian rug' model—an artificial weight construction that matches the performance of trained models while revealing that the algorithm is oblivious to microscopic weight structures.

Entities (5)

ReLU · activation-function · 100%Sparse Autoencoder · model-architecture · 100%Superposition · phenomenon · 100%Permutation Symmetry · mathematical-property · 95%Persian Rug Model · model-construction · 95%

Relation Signals (3)

Persian Rug Model → approximates → Sparse Autoencoder

confidence 95% · we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models

Sparse Autoencoder → exhibits → Superposition

confidence 90% · The model consisted of a compressing linear layer modeling the superposition

Permutation Symmetry → simplifies → Loss Function

confidence 90% · For these models, the loss function becomes analytically tractable.

Cypher Suggestions (2)

Find all models that utilize a specific activation function · confidence 90% · unvalidated

MATCH (m:Model)-[:USES_ACTIVATION]->(a:ActivationFunction {name: 'ReLU'}) RETURN m

Map the relationship between phenomena and the models that study them · confidence 85% · unvalidated

MATCH (p:Phenomenon {name: 'Superposition'})<-[:STUDIES]-(m:Model) RETURN m.name, p.name

Abstract

Abstract:We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in arXiv:2209.10652, compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model's performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at this https URL .

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

104,446 characters extracted from source content.

Expand or collapse full text

The Persian Rug: solving toy models of superposition using large-scale symmetries Aditya Cowsik Physics Department Stanford University acowsik@stanford.edu &Kfir Dolev∗ Physics Department Stanford University dolev@stanford.edu &Alex Infanger∗ Independent alexdinfanger@gmail.com equal contribution Abstract We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in Elhage et al. (2022), compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model’s performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at https://github.com/KfirD/PersianRug. Figure 1: The Persian rug, an artificial set of weights matching trained model performance. 1 Introduction Large language model capabilities and applications have recently proliferated. As these systems advance and are given more control over basic societal functions, it becomes imperative to ensure their reliability with absolute certainty. Mechanistic interpretability aims to achieve this by obtaining a concrete weight-level understanding of the algorithms learned and employed by these models. A major impediment to this program has been the difficulty of interpreting intermediate activations. This is due to the phenomena of superposition, in which a model takes advantage of sparsity in the input data to reuse the same neurons for multiple distinct features, obscuring their function. Finding a systematic method to undo superposition and extract the fundamental features encoded by the network is a large and ongoing area of research Bricken et al. (2023); Cunningham et al. (2023); Gao et al. (2024); Engels et al. (2024). Currently, the most popular method of dealing with superposition is dictionary learning with sparse autoencoders. In this method, the smaller space of neuron activations at a layer of interest is mapped to a larger feature space. The map is trained to encourage sparsity and often consists of an affine + ReLU network. This method has been applied to large language models revealing many strikingly interpretable features (e.g. corresponding to specific bugs in code, the golden gate bridge, and sycophancy), even allowing for a causal understanding of the model’s reasoning in certain scenarios Marks et al. (2024). The sparse decoding ability of the affine + ReLU map was recently studied in the foundational work Elhage et al. (2022), which introduced and studied a toy model of superposition. The model consisted of a compressing linear layer modeling the superposition111This is because, if good enough recovery is possible for most features, the pigeonhole principle tells us that at least some of the smaller space activations must encode information about multiple input features. followed by a decompressing affine + ReLU layer, trained together to auto-encode sparse data. They showed that the network performs superposition by encoding individual feature vectors into nearly orthogonal vectors in the smaller space. The affine layer alone is unable to decode sparse input vectors sufficiently well to make use of superposition, but the addition of the ReLU makes it possible by screening out negative interference. While Elhage et al. (2022) provides valuable empirical and theoretical insights into superposition, it does not obtain a strong enough description of the model algorithm to quantitatively characterize the algorithm’s performance. Given the extensive use of the affine + ReLU map for decoding sparse data in practice, it is important to obtain a complete analytic understanding of the model behavior over a large parameter regime. As we will see, this will inform the design of better sparse autoencoder architectures. In this work we obtain such an understanding by considering a particularly tractable regime of the Elhage et al. (2022) model: permutation symmetric data (no input feature is privileged in any way), and the thermodynamic limit (a large number of input features), while maintaining the full range of sparsity and compression ratio values. In this regime, the learned model weights are permutation symmetric on large scales, which sufficiently simplifies the form of the loss function to the point where it is analytically tractable, leaving only a small number of free parameters. We then forwards-engineer an artificial set of weights satisfying these symmetries and optimizing the remaining parameters, which achieves the same loss as a corresponding trained model, implying that trained models also implement the optimal permutation symmetric algorithm. The artificial set of weights resembles a Persian rug fig. 1, whose structure is a relic of the minimal randomness used in the construction, illustrating that the algorithm relies entirely on large-scale statistics that are insensitive to this structure. Finally, we derive the exact power-law scaling of the loss in the high-sparsity regime. We expect our work to impact the field of neural network interpretability in multiple ways. First, our work provides a basic theoretical framework that we believe can be extended to other regimes of interest, such as structured correlations in input features, which may help predict scaling laws in the loss based on the data’s correlations. Second, our work rules out a large class of performance improvement proposals for sparse autoencoders. Finally, our work provides an explicit example of a learned algorithm that is insensitive to microscopic structure in weights, which may be useful for knowing when not to analyze individual weights. The paper is structured as follows. In section 2 we review the model and explain our training procedure. In section 3 we show empirically that large models display a “statistical” permutation symmetry. In section 4 we extract the algorithm by plugging the symmetry back into the loss, introduce the Persian rug model which optimizes the remaining parameters, show that large trained models achieve the same loss, and derive the loss behavior in the high sparsity limit. In section 5.2 we discuss related works, and conclude with section 5. 2 The Model We consider the following model for our non-linear autoencoder with matrix parameters Win∈ℝnd×ns,Wout∈ℝns×nd,∈ℝnsformulae-sequencesubscriptinsuperscriptℝsubscriptsubscriptformulae-sequencesubscriptoutsuperscriptℝsubscriptsubscriptsuperscriptℝsubscriptW_in ^n_d× n_s,W_out ^n_% s× n_d,b ^n_sWin ∈ blackboard_Rnitalic_d × nitalic_s , Wout ∈ blackboard_Rnitalic_s × nitalic_d , b ∈ blackboard_Rnitalic_s, fnonlinear⁢()subscriptnonlinear f_nonlinear(x)fnonlinear ( x ) =ReLU⁡(Wout⁢Win⁢+).absentReLUsubscriptoutsubscriptin =ReLU(W_outW_inx+% b).= ReLU ( Wout Win x + b ) . (1) WinsubscriptinW_inWin is an encoding matrix which converts a sparse activation vector xx to a dense vector, while WoutsubscriptoutW_outWout perform the linear step of decoding. We also consider a simple model for the sparse data on which this autoencoder operates. Each vector xx is drawn i.i.d. during training, and each component is drawn i.i.d. with xi=ci⁢uisubscriptsubscriptsubscriptx_i=c_iu_ixitalic_i = citalic_i uitalic_i, where ci∼Bernoulli⁢(p)similar-tosubscriptBernoullic_i (p)citalic_i ∼ Bernoulli ( p ) and ui∼Uniform⁢[0,1]similar-tosubscriptUniform01u_i [0,1]uitalic_i ∼ Uniform [ 0 , 1 ] are independent variables. This ensures that xx is sparse with typically only p⁢nssubscriptpn_sp nitalic_s features turned on. We train our toy models to minimize the expected L2subscript2L_2L2 reconstruction loss, L⁢(;Wout,Win,)=E⁢‖−fnonlinear⁢()‖22.subscriptoutsubscriptinsuperscriptsubscriptnormsubscriptnonlinear22 L(x;W_out,W_in,b)=E||% x-f_nonlinear(x)||_2^2.L ( x ; Wout , Win , b ) = E | | x - fnonlinear ( x ) | |22 . (2) It is known that for the linear model (eq. 1 without the ReLU), the optimal solution is closely related to principle component analysis (see, for example, Plaut (2018) and p. 563 or Bishop & Nasrabadi (2006)). In particular, the reconstruction loss decreases linearly in the hidden dimension ndsubscriptn_dnitalic_d when all features are i.i.d. On the other hand, the model eq. 1 will have a much quicker reduction in loss, as will be described in section 3.1. For all models presented we train them with a batch size of 1024102410241024 and the Adam optimizer to completion. That is training continues as long as the average loss over the past 100 batches is lower than the average loss over the 100 batches prior to that one. Our goal with training is to ensure that we have found an optimal model in the large-data limit to analyze the structure of the model itself. 3 Empirical observations In this section, we present empirical observations of the trained models. We start by presenting a remarkable phenomenon this model exhibits in the high-sparsity regime: a dramatic decrease in loss as a function of the compression ratio. We then turn to a mechanistic interpretation of the weights which gives empirical evidence for the phenomena needed to understand the algorithm the model learns. These are manifestations of a partially preserved permutation symmetry of the sparse degrees of freedom. 3.1 Fast loss drop To gauge the performance of the model, we plot the loss (eq. 2) as a function of the compression ratio nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s. In fig. 2 we plot the performance of the linear versus non-linear models for representative parameters. It is clear that the non-linear model outperforms the linear model up until near the nd/ns≈1subscriptsubscript1 n_dn_sn_d/n_sn_d/n_sn_d/n_s≈ 1nitalic_d / nitalic_s ≈ 1 regime due to an immediate drop in the loss.The slope and duration of this initial fall is controlled by p. In particular, in the high-sparsity regime (p close to zero), the loss drops to zero entirely near the nd/ns≈0subscriptsubscript0 n_dn_sn_d/n_sn_d/n_sn_d/n_s≈ 0nitalic_d / nitalic_s ≈ 0 regime. What is going on here? To explain this behavior, we analyze the algorithm the model encodes. Figure 2: Loss curves of trained models, Persian rug models, and optimal linear models as a function of the compression ratio. 3.2 Statistical permutation symmetry Rather than looking individually at the weights, it is helpful to look at the matrix W=Wout⁢WinsubscriptoutsubscriptinW=W_outW_inW = Wout Win, shown in fig. 3. As will be made precise in section 4, we are interested in the structure of each row since it is responsible for the corresponding feature output. Though the matrix is not permutation symmetric in detail, its statistics are sufficiently so as far as the algorithm is concerned. In particular, in the nd/ns→∞→subscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_s→∞nitalic_d / nitalic_s → ∞ limit, the matrix is statistically permutation invariant in the following sense: 1. the diagonal elements become the same (fig. 4), 2. the bias elements become uniformly negative, which can be seen in the uniformity and slight blue shade in fig. 3 and is quantified in fig. 5, 3. the statistics of the off-diagonal terms in each row are sufficiently uniform for interference to become Gaussian (fig. 7), and finally 4. the mean and variance of each of these becomes the same across rows (fig. 6). Figure 3: Plot of the first 30×30303030× 3030 × 30 W elements and the corresponding bias (bb) components, at p=4.5%percent4.5p=4.5\%p = 4.5 % and ratio ns=512subscript512n_s=512nitalic_s = 512. The diagonal components are all at similar values of 1.29±.01plus-or-minus1.29.011.29±.011.29 ± .01 (one standard deviation) while the off-diagonal components are approximately mean-zero, appearing like noise. The bias elements are all negative around −.18±.01plus-or-minus.18.01-.18±.01- .18 ± .01. This statistical uniformity is a permutation symmetry across the sparse features. Immediately on looking at the exemplar model from fig. 3 we see that the diagonals are large and uniform while the off-diagonal components are fluctuating and much smaller. This serves the function of routing each input to the corresponding output, except with some off-diagonal contributions which should be as small as possible. Figure 4: Permutation symmetry of diagonal values. We plot the mean-square fluctuation of the diagonal values corresponding to each model. Models are trained as a function of p and nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s. The emergence of symmetry as nssubscriptn_snitalic_s grows (at all locations in the diagram) is a crucial element of the algorithm implemented by the autoencoders. Because the diagonal components appear uniform, we quantify this by looking at the fluctuation on a per-model basis. For each model, we compute the empirical mean and variance of the diagonal components and plot the latter in fig. 4. Initially, when ns=128subscript128n_s=128nitalic_s = 128 is small, we see a region with large fluctuations among the diagonal components, particularly when the p and nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s are both small. We hypothesize this occurs because for small models (and small feature probabilities) the contribution to any output from off-diagonal terms may fluctuate because of the small size of the matrix. Figure 5: Permutation symmetry of bias values. We plot the mean-square fluctuation of values in the bias vectors corresponding to each model, which are trained as a function of p and nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s. As nssubscriptn_snitalic_s increases the fluctuation over bias elements generally decreases in all trained models. We see similar features among the entries of the bias in fig. 5. The relatively large initial fluctuations generally decay as nssubscriptn_snitalic_s grows larger. Similar features are present in the two plots, such as the relatively low fluctuation regions near (p,nd/ns)≈(.1,.5)subscriptsubscript.1.5(p, n_dn_sn_d/n_sn_d/n_sn_d/n_s)% ≈(.1,.5)( p , nitalic_d / nitalic_s ) ≈ ( .1 , .5 ) because the bias and the diagonal entries are coupled when a diagonal element in a row is large the corresponding bias must be scaled up as well and vice versa. For each of these algorithm-revealing features, we give three plots, each with increasing nssubscriptn_snitalic_s, to show that the feature becomes apparent in the thermodynamic limit. Each plot is a heat map with the variable representing the feature plotted in color and spatial dimensions covering the entire parameter space of nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s and p. Figure 6: Permutation symmetry of off-diagonal statistics. The symmetry breaking parameter Δ⁢var⁡(ν)Δvar (ν)Δ var ( ν ) is given by the variance across all rows of the squared sum of the off diagonal elements in each row, up to a constant. Once nssubscriptn_snitalic_s reaches 8192819281928192 all noise variables have nearly identical variances. We similarly see in fig. 6 that the off-diagonal terms concentrate as well. To be more precise, we look at the fluctuation of the total norm of the off-diagonal elements in each row in W across rows. The norm is a measure of the amount of signal arriving at a particular output from other inputs. A smaller contribution from other inputs means that each output has a higher signal-to-noise ratio as long as the diagonal term remains constant. Figure 7: Interference Gaussianity in the large nssubscriptn_snitalic_s limit. We show that the Lyapunov condition is better satisfied as nssubscriptn_snitalic_s grows larger, by plotting the worst-case Lyapunov condition (see eq. 3) over all rows of W. Models are trained at different p and nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s values. We see that for small nssubscriptn_snitalic_s only models trained at large p have nearly-uniform off-diagonal entries whereas all models approach uniformity at large nssubscriptn_snitalic_s. Given that there are many off-diagonal terms, and those terms are all added together in the process of multiplying W⁢WxW x, it is natural to ask if they satisfy the Lyapunov condition, which would imply that the contribution to output i from off-diagonal terms j≠ij≠ ij ≠ i is Gaussian. In fig. 7 we plot the statistic Λ≡maxi⁡∑j≠i|Wi⁢j|3(∑j≠iWi⁢j2)3/2Λsubscriptsubscriptsuperscriptsubscript3superscriptsubscriptsuperscriptsubscript232 ≡ _i _j≠ i|W_ij|^3 ( _j≠ iW_% ij^2 )^3/2Λ ≡ maxitalic_i divide start_ARG ∑j ≠ i | Witalic_i j |3 end_ARG start_ARG ( ∑j ≠ i Witalic_i j2 )3 / 2 end_ARG (3) where the sum is over all values of j≠ij≠ ij ≠ i for each i. The statistic tends towards 0 as nssubscriptn_snitalic_s grows larger which means that the contribution from the off-diagonal elements tends towards a Gaussian distribution. This is strong numerical evidence that no small set of terms in the off-diagonals dominate. 3.3 Optimization of residual parameters The statistical permutation symmetry places constraints on the possible values of W and bb. The constraint on bb is straightforward: it is proportional to the all ones vector, i.e. there is a number b such that i=bsubscriptb_i=bbitalic_i = b for all i. As will be explained in section 4.1, the relevant degrees of freedom remaining in W are one number a equal to the diagonals, and σ characterizing the root mean square of the off diagonal rows. The precise values of the off diagonals can be thought of as irrelevant “microscopic information”. Thus there are three relevant degrees of freedom remaining: b,ab,ab , a, and σ. In section 4.2.2 we give a specific set of values for WinsubscriptinW_inWin and WoutsubscriptoutW_outWout via the “Persian rug” matrix, which have the statistical permutation symmetry in the ns→∞→subscriptn_s→∞nitalic_s → ∞ limit while also optimizing σ. The remaining parameters, a and b can be optimized numerically. In fig. 2 we compare the loss curve of this artificial model with that of a trained model, and see that they are essentially the same. 4 Extracting the Algorithm In this section we give a precise explanation of the algorithm the model performs. We start with a qualitative description of why the statistical permutation symmetry gives a good auto-encoding algorithm when the remaining macroscopic degrees of freedom are optimized. We then find an artificial set of symmetric weights with optimized macroscopic parameters. We show that the trained models achieve the same performance as the artificial model, thus showing they are optimal even restricting to statistically symmetric solutions. Finally, we derive an explicit form of the loss scaling and argue that ReLU performs near optimally among element-wise or “selection” decoders. 4.1 Qualitative Description A key simplification is to consider strategies as collections of low-rank affine maps rather than as the collection of weights directly. In other words consider the tuple (W,)(W,b)( W , b ) where W=Wout⁢WinsubscriptoutsubscriptinW=W_outW_inW = Wout Win to define the strategy. We must restrict to W with rank no more than ndsubscriptn_dnitalic_d because it is the product of two low-rank matrices. Given any such W we may also find WinsubscriptinW_inWin and WoutsubscriptoutW_outWout of the appropriate shape (e.g. by finding the SVD), so the two representations are equivalent. Let us now take a close look at the reconstruction error. Let us consider output i, which is given by (fnonlinear⁢())i=ReLU⁡(Wi⁢i⁢xi+∑j=1,j≠insWi⁢j⁢xj+i)=ReLU⁡(a⁢(xi+νi)+i)subscriptsubscriptnonlinearReLUsubscriptsubscriptsuperscriptsubscriptformulae-sequence1subscriptsubscriptsubscriptsubscriptReLUsubscriptsubscriptsubscript(f_nonlinear(x))_i=ReLU(W_ix_i+ _% j=1,j≠ i^n_sW_ijx_j+b_i)=ReLU (a(% x_i+ _i)+b_i )( fnonlinear ( x ) )i = ReLU ( Witalic_i i xitalic_i + ∑j = 1 , j ≠ iitalic_nitalic_s Witalic_i j xitalic_j + bitalic_i ) = ReLU ( a ( xitalic_i + νitalic_i ) + bitalic_i ) (4) where we used that all diagonal elements of W are equal to a and defining νi=a−1⁢∑j=1,j≠insWi⁢j⁢xjsubscriptsuperscript1superscriptsubscriptformulae-sequence1subscriptsubscriptsubscript _i=a^-1 _j=1,j≠ i^n_sW_ijx_jνitalic_i = a- 1 ∑j = 1 , j ≠ iitalic_nitalic_s Witalic_i j xitalic_j. At this stage we can already see that the individual off-diagonal elements are only important insofar as they characterize the Gaussian random variable νisubscript _iνitalic_i. More specifically, because νisubscript _iνitalic_i is completely characterized by its mean and variance, only the sum and sum of squares of the off-diagonal components matter. Because we have a bias term the mean of νisubscript _iνitalic_i may be absorbed into it, so from now we assume that νisubscript _iνitalic_i are all zero-mean. This simplification is why permutation symmetry emerges when nssubscriptn_snitalic_s is large, and allows us to consider only the macroscopic variables introduced when we discuss the loss. The reconstruction error is (dropping the i subscript, and substituting i=bsubscriptb_i=bbitalic_i = b) L=x,ν⁢[(x−ReLU⁡(a⁢(x+ν)+b))2].subscriptdelimited-[]superscriptReLU2 L=E_x,ν[(x-ReLU (a(x+ν)+b% ))^2].L = blackboard_Ex , ν [ ( x - ReLU ( a ( x + ν ) + b ) )2 ] . (5) We can further decompose this by taking the expectation value over whether x is on or off, so L=(1−p)⁢Loff+p⁢Lon1subscriptoffsubscripton L=(1-p)L_off+pL_onL = ( 1 - p ) Loff + p Lon where Loffsubscriptoff L_offLoff =ν⁢[(ReLU⁡(a⁢ν+b))2], andabsentsubscriptdelimited-[]superscriptReLU2 and =E_ν[ (ReLU (aν+b )% )^2], and= blackboard_Eν [ ( ReLU ( a ν + b ) )2 ] , and Lonsubscripton L_onLon =u,ν⁢[(u−ReLU⁡(a⁢(u+ν)+b))2]absentsubscriptdelimited-[]superscriptReLU2 =E_u,ν[(u-ReLU (a(u+ν)+b )% )^2]= blackboard_Eu , ν [ ( u - ReLU ( a ( u + ν ) + b ) )2 ] with u∼Uniform⁢[0,1]similar-toUniform01u [0,1]u ∼ Uniform [ 0 , 1 ]. Let us explore the regimes of macroscopic parameters a,b,σa,b, , b , σ when either or both of these loss terms are low. The impact of all the non-diagonal terms has been summed up in the “noise” ν. Though it is a deterministic function of xx, output i has no way to remove ν from u+νu+ + ν. The best it can do is to estimate u from u+νu+ + ν because it doesn’t have any other information. This exemplifies a key principle – by restricting the computational capacity of our decoder, deterministic, but complicated correlations act like noise. The main advantage of the nonlinear autoencoder is that the dominant contribution to the loss, LoffsubscriptoffL_offLoff, can be immediately screened away by making a⁢ν+baν+ba ν + b either small or negative, allowing the network to focus on encoding active signals. This immediate screening is always possible either by choosing b+μb+ + μ large and negative or a and μ+bμ+bμ + b small. However, these strategies come at a cost because the output value is distorted from u to a⁢u+bau+ba u + b. It is thus preferable instead for ν and b to be as small as possible, which occurs when σ is as small as possible. As we will see, σ is the only parameter we are not free to choose in the large nssubscriptn_snitalic_s limit, and whose value will be bounded as a function of nd/nssubscriptsubscript n_dn_sn_d/n_sn_d/n_sn_d/n_snitalic_d / nitalic_s and p. Since LoffsubscriptoffL_offLoff is the dominant contribution to the loss, therefore, it will thus be necessary to damp the signal by setting a small and/or b large and negative in regimes where σ is uncontrollably large. Given that we see a statistical permutation symmetry in trained models let’s consider symmetric strategies so that Wi⁢i=asubscriptW_i=aWitalic_i i = a and i=bsubscriptb_i=bbitalic_i = b for all i=1,…,ns1…subscripti=1,…,n_si = 1 , … , nitalic_s. We will show that optimizing the remaining macroscopic parameters makes fnonlinearsubscriptnonlinearf_nonlinearfnonlinear act close to the identity on sparse inputs. 4.2 Optimizing the macroscopic parameters We have seen qualitatively that a statistically symmetric strategy exists in certain regimes of the macroscopic parameters. Two of these parameters, a and b are unconstrained. Furthermore the loss should be monotonically increasing with σ because a larger σ implies more noise which hinders reconstruction. Thus we now prove lower bounds on σ and construct an artificial set of statistically permutation symmetric weights which achieve this bound. Finally we will compare the reconstruction loss of this strategy with the learned one to justify that those ignored microscopic degrees of freedom were indeed irrelevant. 4.2.1 Optimal σ Assuming the permutation symmetry we discovered earlier in our empirical investigations we will derive a bound on the variance of the output. Additionally for an optimal choice of bb the average loss is increasing in the variance, because a larger variance corresponds to a smaller signal-to-noise ratio. Taken together these two facts will give a lower bound on the loss. We will then provide an explicit construction which achieves this lower bound and illustrates how the algorithm works. The lower bound on the variance comes from the fact that W is low-rank with constant diagonals. For now let us ignore the overall scale of W, and just rescale so that the diagonals are exactly 1. The bound we are about to prove is very similar to the Welch bound (Welch, 1974) who phrased it instead as a bound on the correlations between sets of vectors. We produce an argument for our context, which deals with potentially non-symmetric matrices W, the details of which are located in appendix C. We show that σ2≥4⁢p−3⁢p212⁢(nsnd−1)superscript243superscript212subscriptsubscript1σ^2≥ 4p-3p^212 ( n_sn_d-1 )σ2 ≥ divide start_ARG 4 p - 3 p2 end_ARG start_ARG 12 end_ARG ( divide start_ARG nitalic_s end_ARG start_ARG nitalic_d end_ARG - 1 ) (6) with equality only when W is symmetric, maximum rank, with all non-zero eigenvalues equal. This naturally leads to a candidate for the optimal choice of W, namely matrices of the form W∝O⁢P⁢OT⁢ and ⁢Wi⁢i=1proportional-tosuperscript and subscript1W OPO^T and W_i=1W ∝ O P Oitalic_T and Witalic_i i = 1 (7) where O is an orthogonal matrix and P is any rank-ndsubscriptn_dnitalic_d projection matrix. This kind of matrix saturates the bound because it is symmetric and has all nonzero eigenvalues equal to 1. A note on the connections between these matrices and tight frames – if we take the “square root” of W as a ns×ndsubscriptsubscriptn_s× n_dnitalic_s × nitalic_d matrix W Wsquare-root start_ARG W end_ARG such that W⁢W†=Wsuperscript† W W =Wsquare-root start_ARG W end_ARG square-root start_ARG W end_ARG† = W then the ndsubscriptn_dnitalic_d dimensional rows of W Wsquare-root start_ARG W end_ARG are a tight-frame. This is because Tr(W)2=ns2=ndTrWW†Tr(W)^2=n_s^2=n_dTrW Tr ( W )2 = nitalic_s2 = nitalic_d Tr W W† which is the variational characterization of tight-frames as in Theorem 6.1 from Waldron (2018). 4.2.2 Persian rug model We now give an explicit construction of an optimal choice for W. The construction is based on the Hadamard matrix of size ns=2msubscriptsuperscript2n_s=2^mnitalic_s = 2m for some integer m, defined as Hi⁢j=parity⁡(bin⁢(i)⊕bin⁢(j))subscriptparitydirect-sumbinbin H_ij=parity( bin(i) bin(j))Hitalic_i j = parity ( bin ( i ) ⊕ bin ( j ) ) where bin⁢(i)bin bin(i)bin ( i ) is the m bit binary representation of i and ⊕direct-sum ⊕ is the bit-wise xor operation. We then define a Persian rug matrix as R=nd−1⁢∑k∈SHi⁢k⁢Hk⁢jsuperscriptsubscript1subscriptsubscriptsubscript R=n_d^-1 _k∈ SH_ikH_kjR = nitalic_d- 1 ∑k ∈ S Hitalic_i k Hitalic_k j where S⊂1,…,ns1…subscriptS⊂\1,...,n_s\S ⊂ 1 , … , nitalic_s with |S|=ndsubscript|S|=n_d| S | = nitalic_d. In fig. 1 we plot such a matrix for ns=256subscript256n_s=256nitalic_s = 256, nd=40subscript40n_d=40nitalic_d = 40, and S chosen randomly. The matrix R has diagonals equal to 1 because each diagonal is the average of ndsubscriptn_dnitalic_d terms equal to (±1)2superscriptplus-or-minus12(± 1)^2( ± 1 )2, and a projector because the rows of H are orthogonal, and thus saturates eq. 6. Furthermore, one can readily check numerically that it satisfies the statistical symmetries. There remain two variables to optimize, a and b (recall μ can be absorbed into b). We do this numerically and compare to a trained model in fig. 2. 4.3 Loss scaling at high sparsity Having obtained a simple expression for the loss in terms of constants a,ba,ba , b and two simple random variables x∼Uniform⁢[0,1]similar-toUniform01x [0,1]x ∼ Uniform [ 0 , 1 ] and ν∼⁢(μ,σ)similar-toν (μ,σ)ν ∼ N ( μ , σ ), as well has having deduced an achievable lower bound for σ, we are now able to explain why the simple ReLU model performs so well at high sparsity. For ease of notation let us use r=nd/nssubscriptsubscriptr= n_dn_sn_d/n_sn_d/n_sn_d/n_sr = nitalic_d / nitalic_s. 4.3.1 Initial loss (ratio=0) Let us first consider the r→0→0r→ 0r → 0 limit with all other parameters fixed. Then σ→∞→σ→∞σ → ∞ because of the bound in eq. 6 so the fluctuations in ν overwhelms the signal term. This means that the optimal a is a=p⁢u,ν⁢[u⁢ReLU⁡(ν+b)]ν[ReLU(ν+b)2]+O⁢(σ−1).a=p E_u,ν [uReLU(ν+b) ]E% _ν [ReLU(ν+b)^2 ]+O(σ^-1).a = p divide start_ARG blackboard_Eu , ν [ u ReLU ( ν + b ) ] end_ARG start_ARG blackboard_Eν [ ReLU ( ν + b )2 ] end_ARG + O ( σ- 1 ) . (8) The loss then becomes L L =(1−p)a2ν[ReLU(ν+b)2]+pu,ν[(u−aReLU(ν+b))2]+O(σ−1) =(1-p)a^2E_ν[ReLU (ν+b )% ^2]+pE_u,ν[(u-aReLU (ν+b) )^2]+O(% σ^-1)= ( 1 - p ) a2 blackboard_Eν [ ReLU ( ν + b )2 ] + p blackboard_Eu , ν [ ( u - a ReLU ( ν + b ) )2 ] + O ( σ- 1 ) =a2ν[ReLU(ν+b)2]−2apu,ν[uReLU(ν+b)]+pu[u2]+O(σ−1) =a^2E_ν[ReLU (ν+b )^2]% -2apE_u,ν[uReLU(ν+b)]+pE_u[u^2]+O(% σ^-1)= a2 blackboard_Eν [ ReLU ( ν + b )2 ] - 2 a p blackboard_Eu , ν [ u ReLU ( ν + b ) ] + p blackboard_Eu [ u2 ] + O ( σ- 1 ) plugging in a explicitly gives L=p⁢u⁢[u2]−p2⁢(u⁢[u])2⁢(ν⁢[ReLU⁡(ν+b)])2ν[ReLU(ν+b)2]+O⁢(σ−1). L=pE_u[u^2]-p^2 (E_u [u% ] )^2 (E_ν [ReLU(ν+b) % ] )^2E_ν [ReLU(ν+b)^2 ]+O(% σ^-1).L = p blackboard_Eu [ u2 ] - p2 divide start_ARG ( blackboard_Eu [ u ] )2 ( blackboard_Eν [ ReLU ( ν + b ) ] )2 end_ARG start_ARG blackboard_Eν [ ReLU ( ν + b )2 ] end_ARG + O ( σ- 1 ) . Thus we can conclude that limr→0L=p⁢u⁢[u2]+O⁢(p2)=p3+O⁢(p2)subscript→0subscriptdelimited-[]superscript2superscript23superscript2 _r→ 0L=pE_u[u^2]+O(p^2)= p3% +O(p^2)limitalic_r → 0 L = p blackboard_Eu [ u2 ] + O ( p2 ) = divide start_ARG p end_ARG start_ARG 3 end_ARG + O ( p2 ) Thus we see that in the p≪1much-less-than1p 1p ≪ 1 regime we have L→L0⁢(p)∼O⁢(p)→subscript0similar-toL→ L_0(p) O(p)L → L0 ( p ) ∼ O ( p ) independent of the other parameters. We will now see that increasing r will quickly cause the loss to drop to O⁢(p2)superscript2O(p^2)O ( p2 ). 4.3.2 Loss upper bound We now show that the loss drops off quickly in the sense that for rp≫1much-greater-than1 rp 1divide start_ARG r end_ARG start_ARG p end_ARG ≫ 1 we get that L⁢(p)/p→0→0L(p)/p→ 0L ( p ) / p → 0, i.e. L⁢(p)L(p)L ( p ) scales super-linearly with p. We will consider the regime where r≪1much-less-than1r 1r ≪ 1 holds222For example r=p1−ϵsuperscript1italic-ϵr=p^1-εr = p1 - ϵ for any ϵ∈(0,1)italic-ϵ01ε∈(0,1)ϵ ∈ ( 0 , 1 ). so that we may take σ2∼pr≪1similar-tosuperscript2much-less-than1σ^2 pr 1σ2 ∼ divide start_ARG p end_ARG start_ARG r end_ARG ≪ 1. To obtain the upper bound we will make educated estimates for values of a and μ that are near optimal. In particular, in appendix A we show that the optimal value of a is (after absorbing b into the mean of ν, μ) : aopt=x,ν⁢[x⁢ReLU⁡(x+ν)]x,ν[ReLU(x+ν)2].a_opt= E_x,ν [xReLU(x+ν) % ]E_x,ν [ReLU(x+ν)^2 ].aopt = divide start_ARG blackboard_Ex , ν [ x ReLU ( x + ν ) ] end_ARG start_ARG blackboard_Ex , ν [ ReLU ( x + ν )2 ] end_ARG . (9) From the form of the loss, we know that μ must decrease as p decreases for the loss to go down faster than O⁢(p)O(p)O ( p ). Thus ν has both a mean and variance approaching 0, and aopt→1→subscriptopt1a_opt→ 1aopt → 1. Thus we plug in a=11a=1a = 1 before taking these limits in the expectation of getting a good upper bound. The loss then takes the form L=(1−p)⁢Loff+p⁢Lon1subscriptoffsubscripton L=(1-p)L_off+pL_onL = ( 1 - p ) Loff + p Lon with Loffsubscriptoff L_offLoff =[ReLU(ν)2], and =E [ReLU (ν )^2 ],% and= blackboard_E [ ReLU ( ν )2 ] , and Lonsubscripton L_onLon =⁢[(u−ReLU⁡(u+ν))2]absentdelimited-[]superscriptReLU2 =E [(u-ReLU (u+ν) )^2 ]= blackboard_E [ ( u - ReLU ( u + ν ) )2 ] Off term: The off term can be upper bounded via [ReLU(ν)2] [ReLU(ν)^2 ]blackboard_E [ ReLU ( ν )2 ] =∫0∞ν⁢e−(ν+|μ|)22⁢σ22⁢π⁢σ2⁢ν2=σ2⁢∫0∞ν⁢e−(ν+|μ|σ)222⁢π⁢ν2absentsuperscriptsubscript0differential-dsuperscriptsuperscript22superscript22superscript2superscript2superscript2superscriptsubscript0differential-dsuperscriptsuperscript222superscript2 = _0^∞dν e^- (ν+|μ|)^22σ^2% 2πσ^2ν^2=σ^2 _0^∞dν e^-% (ν+ |μ|σ)^22 2πν^2= ∫0∞ d ν divide start_ARG e- divide start_ARG ( ν + | μ | ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π σ2 end_ARG end_ARG ν2 = σ2 ∫0∞ d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG ν2 (10) ≤σ22⁢e−μ22⁢σ2absentsuperscript22superscriptsuperscript22superscript2 ≤ σ^22e^- μ^22σ^2≤ divide start_ARG σ2 end_ARG start_ARG 2 end_ARG e- divide start_ARG μ start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT (11) and thus we see we need to set μσ≫1much-greater-than1 μσ 1divide start_ARG μ end_ARG start_ARG σ end_ARG ≫ 1 to get a good bound. In particular, we know empirically that the loss drop happens at increasingly smaller r. To ensure this we let σ2∼prsimilar-tosuperscript2σ^2 prσ2 ∼ divide start_ARG p end_ARG start_ARG r end_ARG scale at some rate slower than p. Thus to ensure that the total loss decreases faster than O⁢(p)O(p)O ( p ), we need e−μ22⁢σ2∼O⁢(p)similar-tosuperscriptsuperscript22superscript2e^- μ^22σ^2 O(p)e- divide start_ARG μ start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT ∼ O ( p ) or in other words μ∼σ⁢log⁡1p.similar-to1 μ σ 1p.μ ∼ σ square-root start_ARG log divide start_ARG 1 end_ARG start_ARG p end_ARG end_ARG . (12) On term: We perform a similar, but slightly more involved computation in appendix B and combine with the off term to obtain L<(1−p)⁢O⁢(σ2⁢e−μ22⁢σ2)+p⁢O⁢(μ3+e−|μ|22⁢σ2+μ⁢e−|μ|22⁢σ2+σ2+σ⁢μ+μ2).1superscript2superscriptsuperscript22superscript2superscript3superscriptsuperscript22superscript2superscriptsuperscript22superscript2superscript2superscript2 L<(1-p)O(σ^2e^- μ^22σ^2)+pO(μ^3+% e^- |μ|^22σ^2+μ e^- |μ|^22σ^2+% σ^2+σμ+μ^2).L < ( 1 - p ) O ( σ2 e- divide start_ARG μ start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT ) + p O ( μ3 + e- divide start_ARG | μ | start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT + μ e- divide start_ARG | μ | start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT + σ2 + σ μ + μ2 ) . Plugging in the μ scaling from eq. 12 and keeping only the lower order terms gives L<O⁢(σ2⁢p⁢log⁡1p)∼O⁢(p2r⁢log⁡1p).superscript21similar-tosuperscript21 L<O (σ^2p 1p ) O ( p^% 2r 1p ).L < O ( σ2 p log divide start_ARG 1 end_ARG start_ARG p end_ARG ) ∼ O ( divide start_ARG p2 end_ARG start_ARG r end_ARG log divide start_ARG 1 end_ARG start_ARG p end_ARG ) . (13) 4.3.3 Loss lower bound In appendix D we also derive a lower bound in the high-sparsity limit L>O⁢(p2r)superscript2L>O( p^2r)L > O ( divide start_ARG p2 end_ARG start_ARG r end_ARG ) in the high sparsity limit up to logarithmic corrections. We show this in fact holds for a more general class of activation functions. In particular, any function which acts element-wise or filters out elements will give an on-loss contribution of the form u,ν⁢[(u−f⁢(u+ν))2]subscriptdelimited-[]superscript2 _u,ν[(u-f (u+ν) )^2]blackboard_Eu , ν [ ( u - f ( u + ν ) )2 ] which has a lower bound due to ν destroying information about u. Thus we can conclude that L∼O⁢(p2r)similar-tosuperscript2 L O ( p^2r )L ∼ O ( divide start_ARG p2 end_ARG start_ARG r end_ARG ) up to logarithmic factors whenever pr≪1much-less-than1 pr 1divide start_ARG p end_ARG start_ARG r end_ARG ≪ 1. 5 Discussion 5.1 Summary We have performed an exhaustive analysis of the toy model of superposition for permutation symmetric data in the limit of large number of sparse signals. We empirically showed that the network reliably learns statistically permutation symmetric algorithms, in the sense that the information relevant to the algorithm is encoded in the summary statistics of the elements in each row of the W matrix. Plugging these symmetries back into the reconstruction-loss allowed us to reduce its optimization problem over all model parameters to an optimization problem over three scalar variables. Of these parameters, only the noise parameter σ was constrained by the hyper-parameters p,ns,subscriptp,n_s,p , nitalic_s , and ndsubscriptn_dnitalic_d. We showed that σ is minimized for a statistically symmetric W that is proportional to a projector. We then forward-engineered weights optimizing these parameters, giving the Persian rug model, and showed empirically that its loss matched the model loss closely. This shows that the trained models learn the optimal symmetric algorithm. The additional presence of small scale structure in the Persian rug model highlights the algorithm’s independence on such smaller scale details. Finally, we considered the analytic form of the loss in the high-sparsity limit, and derived its scalings as a function of sparsity and compression ratio to be O⁢(p)O(p)O ( p ) when rp≪1much-less-than1 rp 1divide start_ARG r end_ARG start_ARG p end_ARG ≪ 1 and ∼O⁢(p2/r)similar-toabsentsuperscript2 O(p^2/r)∼ O ( p2 / r ) when pr≪1much-less-than1 pr 1divide start_ARG p end_ARG start_ARG r end_ARG ≪ 1, or in other words the loss drops of O⁢(p)O(p)O ( p ) in a ratio window scaling faster than O⁢(p)O(p)O ( p ). 5.2 Relationship to other works 5.2.1 Mechanistic Interpretability and Sparse Autoencoders Mechanistic interpretability is a research agenda which aims to understand learned model algorithms through studying their weights, (see Olah et al. (2020) for an introduction). Recent results relating to language models include Meng et al. (2023), which finds a correspondence between specific facts and feature weights, along with Olsson et al. (2022), which shows that transformers learn in context learning through the mechanism of “induction heads”. Elhage et al. (2022) introduced the model we study in this paper. While that work focused on mapping empirically behaviors of the model in multiple regimes of interest such as correlated inputs, we focused on a regime with enough symmetry to solve the model analytically given observed symmetries in trained models. Chen et al. (2023) study this model in the context of singular learning theory. As part of their work, they characterize the loss using a different high sparsity approximation than the one we present in this paper (they assume exactly one input feature is on). Then they derive a subset of the critical points and their corresponding local learning coefficients under the assumption nd=2subscript2n_d=2nitalic_d = 2. While the concept of dictionary learning was introduced by (Mallat & Zhang, 1993), the practical use of sparse autoencoders to understand large language models has accelerated recently due to mezzo-scale open weight models (Gao et al., 2024; Lieberum et al., 2024) and large-scale open-output models Bricken et al. (2023). These features are highly interpretable (Cunningham et al., 2023) and scale predictably. Interestingly, the scaling is quite similar for the various different architectures they consider, differing primarily by a constant, which fits with the predictions in this work. We have seen that the dominant source of error is not from determining which features are present, but rather the actual values of those features. Small modifications to the activation functions, such as gating Rajamanoharan et al. (2024), k-sparse Makhzani & Frey (2013), or TRec non-linearity Taggart (2024); Konda et al. (2014), are insufficient to fix this problem as they do not solve the basic issue of noisy outputs. In this context our work implies that innovative architectures, that are suitable both for gradient-based training and also for decoding sparse features, must be developed. 5.2.2 Compressed Sensing and Statistical Physics It is known that compressed sparse data can be exactly reconstructed by solving a convex problem (Candes & Tao, 2005; Candes et al., 2006; Donoho & Elad, 2003; Donoho, 2006) given knowledge of the compression matrix. Furthermore, using tools from statistical physics it is possible to show that this holds for typical compressed sparse data (Ganguli & Sompolinsky, 2010). Learning the compression matrix is also easy in certain circumstances(Sakata & Kabashima, 2013). For a more general review on compressed sensing and it’s history consider the introduction by Davenport et al. (2012). The reconstruction procedure typically used in compressed sensing is optimizing a (convex) relaxation of finding the sparsest set of features which reproduces your data vector. This is significantly different to the setting of sparse autoencoders which try to obtain the sparse features using only one linear + activation layer. The discrepancy between the ability of convex optimization techniques to achieve zero loss while a linear + ReLU model necessarily incurs an error suggests that a more complex model architecture is needed for sparse autoencoders when it is desirable to calculate the feature magnitude to high precision. This may occur, for example, if one wishes to insert a sparse autoencoder into a model without corrupting its downstream outputs. 5.2.3 Linear Models The linear version of the toy model we studied has garnered significant attention. Plaut (2018) demonstrates that the minimizers of eq. 2 can be directly related to Principal Component Analysis (PCA), a fundamental technique in dimensionality reduction and data analysis (see also Bishop & Nasrabadi (2006) for a comprehensive treatment). Kunin et al. (2019) studies the learning dynamics of linear autoencoders when regularization is added. 5.3 Remaining questions and future directions Correlated Data We expect real world data to display significant correlations between activated features. Nevertheless, we expect there are more realistic correlation structures that remain simple enough to analyze the loss analytically. One such possibility is to use data whose correlations are described by a sparse graph. Compressed Computation We have shown that features can only be recovered using linear + element-wise or selection type activations together with a small error. How then are neural networks able to compute on information stored in superposition using just linear + ReLU layers? Noise injected into data early on in a computation tends to be compounded as the computation grows in circuit complexity. Thus we believe the neural networks must be at least one of the following: 1. performing computation on superposed information without first localizing it, 2. performing computation on superposed information by first localizing it using multiple layers, 3. performing only low complexity computation on superposed information, or 4. implementing a fault tolerance scheme (computation which actively corrects errors). It would be interesting to experimentally look for these phenomena and their structure. Acknowledgements We are grateful for the valuable feedback we received from Daniel Kunin, David Barmherzig, Henrik Marklund, Surya Ganguli, and Zach Furman. We acknowledge Adam Brown in particular for insightful remarks that helped us construct the Persian rug matrix. Aditya Cowsik acknowledges funding from the Simons Foundation (Award ID: 560571). References Bishop & Nasrabadi (2006) Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006. Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Candes & Tao (2005) E.J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005. doi: 10.1109/TIT.2005.858979. Candes et al. (2006) E.J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006. doi: 10.1109/TIT.2005.862083. Chen et al. (2023) Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy model of superposition, 2023. URL https://arxiv.org/abs/2310.06301. Cunningham et al. (2023) Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023. Davenport et al. (2012) Mark A Davenport, Marco F Duarte, Yonina C Eldar, and Gitta Kutyniok. Introduction to compressed sensing., 2012. Donoho (2006) David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006. Donoho & Elad (2003) David L. Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ¡sup¿1¡/sup¿ minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003. doi: 10.1073/pnas.0437847100. URL https://w.pnas.org/doi/abs/10.1073/pnas.0437847100. Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/toy_model/index.html. Engels et al. (2024) Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, and Max Tegmark. Not all language model features are linear, 2024. URL https://arxiv.org/abs/2405.14860. Ganguli & Sompolinsky (2010) Surya Ganguli and Haim Sompolinsky. Statistical mechanics of compressed sensing. Physical review letters, 104(18):188701, 2010. Gao et al. (2024) Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. Konda et al. (2014) Kishore Konda, Roland Memisevic, and David Krueger. Zero-bias autoencoders and the benefits of co-adapting features. arXiv preprint arXiv:1402.3337, 2014. Kunin et al. (2019) Daniel Kunin, Jonathan Bloom, Aleksandrina Goeva, and Cotton Seed. Loss landscapes of regularized linear autoencoders. In International conference on machine learning, p. 3560–3569. PMLR, 2019. Lieberum et al. (2024) Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URL https://arxiv.org/abs/2408.05147. Makhzani & Frey (2013) Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. Mallat & Zhang (1993) Stéphane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397–3415, 1993. Marks et al. (2024) Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. URL https://arxiv.org/abs/2403.19647. Meng et al. (2023) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262. Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads, 2022. URL https://arxiv.org/abs/2209.11895. Plaut (2018) Elad Plaut. From principal subspaces to principal components with linear autoencoders, 2018. URL https://arxiv.org/abs/1804.10253. Rajamanoharan et al. (2024) Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024. Sakata & Kabashima (2013) Ayaka Sakata and Yoshiyuki Kabashima. Statistical mechanics of dictionary learning. Europhysics Letters, 103(2):28008, 2013. Taggart (2024) Glen M. Taggart. Prolu: A nonlinearity for sparse autoencoders. https://w.alignmentforum.org/posts/HEpufTdakGTTKgoYF/prolu-a-nonlinearity-for-sparse-autoencoders, 2024. Waldron (2018) Shayne FD Waldron. An introduction to finite tight frames. Springer, 2018. Welch (1974) Lloyd Welch. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on Information theory, 20(3):397–399, 1974. Appendix A Optimizing over a We may optimize over a analytically because it appears almost quadratically in the loss. Consider the expression for the average loss from eq. 5 with b replaced with a⋅b⋅a· ba ⋅ b. As long as a≠00a≠ 0a ≠ 0 this redefinition doesn’t change the set of accessible models. Furtheremore let us restrict to positive a which allows us to rewrite the loss as L=x,ν⁢[(x−ReLU⁡(a⁢(x+ν+b)))2]=x,ν⁢[(x−a⁢ReLU⁡(x+ν+b))2].subscriptdelimited-[]superscriptReLU2subscriptdelimited-[]superscriptReLU2L=E_x,ν[ (x-ReLU (a(x+ν+b) ))^2]=% E_x,ν[(x-aReLU (x+ν+b ) )^2].L = blackboard_Ex , ν [ ( x - ReLU ( a ( x + ν + b ) ) )2 ] = blackboard_Ex , ν [ ( x - a ReLU ( x + ν + b ) )2 ] . (14) The restriction to positive a is acceptable because we never see negative off-diagonal elements in our trained models. Now, optimizing over a is exactly linear regression; we can obtain the optimal value of a with the standard method 0=d⁢a⁢L=−2⁢[(x−a⁢ReLU⁡(x+ν+b))⁢ReLU⁡(x+ν+b)]02delimited-[]ReLUReLU 0= ddaL=-2E [(x-aReLU(x+ν+% b))ReLU(x+ν+b) ]0 = divide start_ARG d end_ARG start_ARG d a end_ARG L = - 2 blackboard_E [ ( x - a ReLU ( x + ν + b ) ) ReLU ( x + ν + b ) ] (15) which implies that aopt=x,ν⁢[x⁢ReLU⁡(x+ν+b)]x,ν[ReLU(x+ν+b)2].a_opt= E_x,ν [xReLU(x+ν+b)% ]E_x,ν [ReLU(x+ν+b)^2 ].aopt = divide start_ARG blackboard_Ex , ν [ x ReLU ( x + ν + b ) ] end_ARG start_ARG blackboard_Ex , ν [ ReLU ( x + ν + b )2 ] end_ARG . (16) Notice that the optimal a is always positive, which is consistent with the assumption we made earlier. Appendix B Bounding the Reconstruction Error On term: We can start to write the on term similarly as ⟨(u−ReLU⁡(u+ν))2⟩=⟨σ2⁢∫−∞ν⁢e−(ν+|μ|σ)222⁢π⁢(u−ReLU⁡(u+σ⁢ν))2⟩u.delimited-⟨⟩superscriptReLU2subscriptdelimited-⟨⟩superscript2superscriptsubscriptdifferential-dsuperscriptsuperscript222superscriptReLU2 (u-ReLU (u+ν) )^2 = % σ^2 _-∞^∞dν e^- (ν+ |μ|% σ)^22 2π(u-ReLU(u+σν))^2 % _u.⟨ ( u - ReLU ( u + ν ) )2 ⟩ = ⟨ σ2 ∫- ∞ d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG ( u - ReLU ( u + σ ν ) )2 ⟩u . Now we write the integral in two parts to get rid of the ReLU: one when u+σ⁢ν<00u+σν<0u + σ ν < 0 and one when u+σ⁢ν>0.0u+σν>0.u + σ ν > 0 . This gives ⟨∫−∞−uσν⁢e−(ν+|μ|σ)222⁢π⁢u2⟩u⏟Er+⟨σ2⁢∫−uσ∞ν⁢e−(ν+|μ|σ)222⁢π⁢ν2⟩u⏟Eν.subscript⏟subscriptdelimited-⟨⟩superscriptsubscriptdifferential-dsuperscriptsuperscript222superscript2subscriptsubscript⏟subscriptdelimited-⟨⟩superscript2superscriptsubscriptdifferential-dsuperscriptsuperscript222superscript2subscript _-∞^- uσdν% e^- (ν+ |μ|σ)^22 2πu^2 % _u_E_r+ σ^2 _- uσ% ^∞dν e^- (ν+ |μ|σ)^22 2π% ν^2 _u_E_ν.under⏟ start_ARG ⟨ ∫- ∞- divide start_ARG u end_ARG start_ARG σ end_ARG d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG u2 ⟩u end_ARGE start_POSTSUBSCRIPT r end_POSTSUBSCRIPT + under⏟ start_ARG ⟨ σ2 ∫- divide start_ARG u end_ARG start_ARG σ end_ARG∞ d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG ν2 ⟩u end_ARGE start_POSTSUBSCRIPT ν end_POSTSUBSCRIPT . Where the first term ErsubscriptE_rEitalic_r represents error coming from the ReLUReLUReLUReLU and the second term EνsubscriptE_νEitalic_ν represents error coming from the noise. The scaling of EνsubscriptE_νEitalic_ν can be easily bounded: Eν<σ2⁢∫−∞ν⁢e−(ν+|μ|σ)222⁢π⁢ν2∼O⁢(σ2+σ⁢μ+μ2)subscriptsuperscript2superscriptsubscriptdifferential-dsuperscriptsuperscript222superscript2similar-tosuperscript2superscript2 E_ν<σ^2 _-∞^∞dν e^- (% ν+ |μ|σ)^22 2πν^2 O(σ^2+% σμ+μ^2)Eitalic_ν < σ2 ∫- ∞ d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG ν2 ∼ O ( σ2 + σ μ + μ2 ) And thus we see, unsurprisingly, that we need to set μ≪1much-less-than1μ 1μ ≪ 1 to get a good bound. To upper bound ErsubscriptE_rEitalic_r write the u integral in two intervals: [0,2⁢|μ|]02[0,2|μ|][ 0 , 2 | μ | ] and [2⁢|μ|,1]21[2|μ|,1][ 2 | μ | , 1 ], corresponding to regions in which the interval of the ν integral does and ”decisively” does not include the the mean respectively. In particular, we have Er<∫02⁢|μ|u2⁢u⁢∫−∞−uσν⁢e−(ν+|μ|σ)222⁢π⏟Ermean+∫2⁢|μ|1u2⁢u⁢∫−∞−uσν⁢e−(ν+|μ|σ)222⁢π⏟Ertail.subscriptsubscript⏟superscriptsubscript02superscript2differential-dsuperscriptsubscriptdifferential-dsuperscriptsuperscript222superscriptsubscriptmeansubscript⏟superscriptsubscript21superscript2differential-dsuperscriptsubscriptdifferential-dsuperscriptsuperscript222superscriptsubscripttail E_r< _0^2|μ|u^2du _-∞^- % uσdν e^- (ν+ |μ|σ)^22 2% π_E_r^mean+ _2|μ|^1u^2du _-∞% ^- uσdν e^- (ν+ |μ|σ)^22% 2π_E_r^tail.Eitalic_r < under⏟ start_ARG ∫02 | μ | u2 d u ∫- ∞- divide start_ARG u end_ARG start_ARG σ end_ARG d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG end_ARGE start_POSTSUBSCRIPT rmean end_POSTSUBSCRIPT + under⏟ start_ARG ∫2 | μ |1 u2 d u ∫- ∞- divide start_ARG u end_ARG start_ARG σ end_ARG d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG end_ARGE start_POSTSUBSCRIPT rtail end_POSTSUBSCRIPT . Since in ErmeansuperscriptsubscriptmeanE_r^meanEitalic_rmean the ν integrals’ interval includes the mean we may as well extend the interval to the full real line to get the bound, giving Ermean<∫02⁢|μ|u2⁢u=O⁢(μ3).superscriptsubscriptmeansuperscriptsubscript02superscript2differential-dsuperscript3 E_r^mean< _0^2|μ|u^2du=O(μ^3).Eitalic_rmean < ∫02 | μ | u2 d u = O ( μ3 ) . ErtailsuperscriptsubscripttailE_r^tailEitalic_rtail can be upper bounded by setting the ν range to the maximum value of 2⁢|μ|22|μ|2 | μ |, so we have Ertail<(1−2⁢|μ|)⁢∫−∞−2⁢|μ|σν⁢e−(ν+|μ|σ)222⁢π=(1−2⁢|μ|)⁢∫−∞0ν⁢e−(ν−|μ|σ)222⁢π<O⁢(1)⁢(1−2⁢|μ|)⁢e−|μ|22⁢σ2.superscriptsubscripttail12superscriptsubscript2differential-dsuperscriptsuperscript22212superscriptsubscript0differential-dsuperscriptsuperscript222112superscriptsuperscript22superscript2 E_r^tail<(1-2|μ|) _-∞^- 2|μ|% σdν e^- (ν+ |μ|σ)^22 2π=% (1-2|μ|) _-∞^0dν e^- (ν- |μ|σ)^2% 2 2π<O(1)(1-2|μ|)e^- |μ|^22σ^2.Eitalic_rtail < ( 1 - 2 | μ | ) ∫- ∞- divide start_ARG 2 | μ | end_ARG start_ARG σ end_ARG d ν divide start_ARG e- divide start_ARG ( ν + divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG = ( 1 - 2 | μ | ) ∫- ∞0 d ν divide start_ARG e- divide start_ARG ( ν - divide start_ARG | μ | end_ARG start_ARG σ end_ARG ) start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 π end_ARG end_ARG < O ( 1 ) ( 1 - 2 | μ | ) e- divide start_ARG | μ | start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT . Putting it all together gives L<(1−p)⁢O⁢(σ2⁢e−μ22⁢σ2)+p⁢O⁢(μ3+e−|μ|22⁢σ2+μ⁢e−|μ|22⁢σ2+σ2+σ⁢μ+μ2).1superscript2superscriptsuperscript22superscript2superscript3superscriptsuperscript22superscript2superscriptsuperscript22superscript2superscript2superscript2 L<(1-p)O(σ^2e^- μ^22σ^2)+pO(μ^3+% e^- |μ|^22σ^2+μ e^- |μ|^22σ^2+% σ^2+σμ+μ^2).L < ( 1 - p ) O ( σ2 e- divide start_ARG μ start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT ) + p O ( μ3 + e- divide start_ARG | μ | start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT + μ e- divide start_ARG | μ | start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 σ2 end_ARG end_POSTSUPERSCRIPT + σ2 + σ μ + μ2 ) . Plugging in the μ scaling from eq. 12 and keeping only the lowest order terms gives L<O⁢(σ2⁢p⁢log⁡1p)∼O⁢(p2r⁢log⁡1p).superscript21similar-tosuperscript21 L<O(σ^2p 1p) O( p^2r % 1p).L < O ( σ2 p log divide start_ARG 1 end_ARG start_ARG p end_ARG ) ∼ O ( divide start_ARG p2 end_ARG start_ARG r end_ARG log divide start_ARG 1 end_ARG start_ARG p end_ARG ) . (17) Appendix C Minimal Variance Bound We will show a minimum variance bound for matrices W which have all diagonals equal to 1 and also have maximum rank ndsubscriptn_dnitalic_d. In this case we know that Tr⁡W=nsTrsubscriptTrW=n_sTr W = nitalic_s. On the other hand we also know that the trace is the sum of the eigenvalues, and because W has rank at most ndsubscriptn_dnitalic_d that ns=∑i=1ndλisubscriptsuperscriptsubscript1subscriptsubscriptn_s= _i=1^n_d _initalic_s = ∑i = 1nitalic_d λitalic_i (18) for the eigenvalues λisubscript _iλitalic_i of W. Now we solve for the mean of the variance across rows, 1ns⁢∑i=1nsVar⁡(νi)=4⁢p−3⁢p212⁢ns⁢∑i=1ns∑j=1,j≠insWi⁢j2=4⁢p−3⁢p212⁢ns⁢(Tr⁡(W⁢W†)−ns).1subscriptsuperscriptsubscript1subscriptVarsubscript43superscript212subscriptsuperscriptsubscript1subscriptsuperscriptsubscriptformulae-sequence1subscriptsuperscriptsubscript243superscript212subscriptTrsuperscript†subscript 1n_s _i=1^n_sVar( _i)= 4p-3p^2% 12n_s _i=1^n_s _j=1,j≠ i^n_sW_ij^2= 4p-3p^2% 12n_s (Tr(W )-n_s ).divide start_ARG 1 end_ARG start_ARG nitalic_s end_ARG ∑i = 1nitalic_s Var ( νitalic_i ) = divide start_ARG 4 p - 3 p2 end_ARG start_ARG 12 nitalic_s end_ARG ∑i = 1nitalic_s ∑j = 1 , j ≠ iitalic_nitalic_s Witalic_i j2 = divide start_ARG 4 p - 3 p2 end_ARG start_ARG 12 nitalic_s end_ARG ( Tr ( W W† ) - nitalic_s ) . (19) Here the first equality arises from the definition of νisubscript _iνitalic_i (remembering that we have set the diagonals to 1 exactly) and substituting the variance of xjsubscriptx_jxitalic_j, while the second equality follows because Tr⁡(W⁢W†)Trsuperscript†Tr(W )Tr ( W W† ) is the sum of the square of all entries of W, and we subtract off the diagonal entries. Because we want a bound on this quantity related to the eigenvalues of W, it is convenient to use the Schur decomposition of W=Q⁢U⁢Q†superscript†W=QUQ W = Q U Q†. Here Q is a unitary matrix and U is upper-triangular with the eigenvalues of W on the diagonal. This allows us to lower bound the trace Tr⁡(W⁢W†)=Tr⁡(Q⁢U⁢Q†⁢Q⁢U†⁢Q†)=Tr⁡(U⁢U†)=∑i,j=1ns|Ui⁢j|2≥∑i=1nd|λi|2≥ns2ndTrsuperscript†Trsuperscript†superscript†superscript†Trsuperscript†superscriptsubscript1subscriptsuperscriptsubscript2superscriptsubscript1subscriptsuperscriptsubscript2superscriptsubscript2subscriptTr(W )=Tr(QUQ QU Q^% )=Tr(U )= _i,j=1^n_s|U_ij|^2≥% _i=1^n_d| _i|^2≥ n_s^2n_dTr ( W W† ) = Tr ( Q U Q† Q U† Q† ) = Tr ( U U† ) = ∑i , j = 1nitalic_s | Uitalic_i j |2 ≥ ∑i = 1nitalic_d | λitalic_i |2 ≥ divide start_ARG nitalic_s2 end_ARG start_ARG nitalic_d end_ARG (20) where the last inequality follows from Cauchy-Schwarz and eq. 18. With this we find a bound on the variance 1ns⁢∑i=1nsVar⁡(νi)≥4⁢p−3⁢p212⁢(nsnd−1),1subscriptsuperscriptsubscript1subscriptVarsubscript43superscript212subscriptsubscript1 1n_s _i=1^n_sVar( _i)≥ 4p-3p^2% 12 ( n_sn_d-1 ),divide start_ARG 1 end_ARG start_ARG nitalic_s end_ARG ∑i = 1nitalic_s Var ( νitalic_i ) ≥ divide start_ARG 4 p - 3 p2 end_ARG start_ARG 12 end_ARG ( divide start_ARG nitalic_s end_ARG start_ARG nitalic_d end_ARG - 1 ) , (21) with equality if W is symmetric with all non-zero eigenvalues equal. These two conditions follow because the two inequalities in the proof become equalities when these conditions are met. This naturally leads to a candidate for the optimal choice of W, namely matrices of the form W∝O⁢P⁢OT⁢ and ⁢Wi⁢i=1proportional-tosuperscript and subscript1W OPO^T and W_i=1W ∝ O P Oitalic_T and Witalic_i i = 1 (22) where O is an orthogonal matrix and P is any rank-ndsubscriptn_dnitalic_d projection matrix. This kind of matrix saturates both bounds because it is symmetric and has all nonzero eigenvalues equal to 1. Appendix D Lower bound on loss scaling We now show a lower bound on the loss in the p→0→0p→ 0p → 0 limit. To do this, we will show a more general lower bound on the on loss for any deterministic function of the pre-activation. Specifically, we would like to lower bound L≤p⁢Lonsubscripton L≤ pL_onL ≤ p Lon =⁢[(u−f⁢(u+ν))2]absentdelimited-[]superscript2 =E[(u-f (u+ν) )^2]= blackboard_E [ ( u - f ( u + ν ) )2 ] for any function f with u∼Uniform⁢[0,1]similar-toUniform01u [0,1]u ∼ Uniform [ 0 , 1 ] and ν∼⁢(0,σ)similar-to0ν (0,σ)ν ∼ N ( 0 , σ ). Recall that there is no need to consider the bias b as it can be absorbed into ν. Recall that the optimal function f is given by f∗⁢(u+ν)=⁢[u|u+ν].superscriptdelimited-[]conditional f^*(u+ν)=E[u|u+ν].f∗ ( u + ν ) = blackboard_E [ u | u + ν ] . Let’s make a change of variable from u,νu, , ν to y≡u+ν,uy≡ u+ν,uy ≡ u + ν , u, and then use the tower property to rewrite LonsubscriptonL_onLon as Lon=y∼Pu+ν⁢[u|y⁢[(u−f∗⁢(u+ν))2]].subscriptonsubscriptsimilar-tosubscriptdelimited-[]subscriptconditionaldelimited-[]superscriptsuperscript2L_on=E_y P_u+ν [E_u|y [ (u-% f^*(u+ν) )^2 ] ].Lon = blackboard_Ey ∼ P start_POSTSUBSCRIPT u + ν end_POSTSUBSCRIPT [ blackboard_Eu | y [ ( u - f∗ ( u + ν ) )2 ] ] . (23) We first draw y from the marginal distribution of u+νu+ + ν and then draw u from the conditional distribution given y. Because f∗superscriptf^*f∗ is exactly the conditional expectation the interior expectation becomes the conditional variance Lon=y⁢[Var⁡[u|y]].subscriptonsubscriptdelimited-[]VarconditionalL_on=E_y [Var [u|y ] ].Lon = blackboard_Ey [ Var [ u | y ] ] . (24) Because we want to lower bound LonsubscriptonL_onLon it will be convenient to start with a lower bound for the conditional variance. We will lower bound the conditional variance for y∈[σ,1−σ]1y∈[σ,1-σ]y ∈ [ σ , 1 - σ ], and then use that lower bound to find a lower bound for the loss, with a goal of showing that the loss is lower bounded by a constant multiple of σ2superscript2σ^2σ2, for σ<1414σ< 14σ < divide start_ARG 1 end_ARG start_ARG 4 end_ARG. This will show that the overall loss of any strategy, even one which can perfectly estimate which features are on or off, is incapable of achieving a reconstruction error better than O⁢(p2/r)superscript2O(p^2/r)O ( p2 / r ). The conditional distribution for u is a truncated Gaussian distribution. By Bayes’ theorem P⁢[u|u+ν=y]delimited-[]conditional P[u|u+ν=y]P [ u | u + ν = y ] =P⁢[u+ν=y]⁢P⁢[u]P⁢[y]absentdelimited-[]delimited-[]delimited-[] = P[u+ν=y]P[u]P[y]= divide start_ARG P [ u + ν = y ] P [ u ] end_ARG start_ARG P [ y ] end_ARG (25) =e−(u−y)2/2⁢σ2∫01x⁢e−(x−y)2/2⁢σ2if ⁢u∈[0,1]0otherwise,absentcasessuperscriptsuperscript22superscript2superscriptsubscript01differential-dsuperscriptsuperscript22superscript2if 010otherwise = cases e^-(u-y)^2/2σ^2 _0^1dxe^% -(x-y)^2/2σ^2&if u∈[0,1]\\ 0&otherwise, cases= start_ROW start_CELL divide start_ARG e- ( u - y ) start_POSTSUPERSCRIPT 2 / 2 σ2 end_POSTSUPERSCRIPT end_ARG start_ARG ∫01 d x e- ( x - y ) start_POSTSUPERSCRIPT 2 / 2 σ2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL if u ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise , end_CELL end_ROW (26) with normalizing constant Z⁢(y)=∫01x⁢e−(x−y)2/2⁢σ2<2⁢π⁢σ2superscriptsubscript01differential-dsuperscriptsuperscript22superscript22superscript2Z(y)= _0^1dxe^-(x-y)^2/2σ^2< 2πσ^2Z ( y ) = ∫01 d x e- ( x - y ) start_POSTSUPERSCRIPT 2 / 2 σ2 end_POSTSUPERSCRIPT < square-root start_ARG 2 π σ2 end_ARG. This is a truncated Gaussian distribution. Fix y∈[σ,1−σ]1y∈[σ,1-σ]y ∈ [ σ , 1 - σ ] so that all distributions are implicitly conditioned on y for now. Sample u via the following procedure. First we decide if |u−y|≤σ|u-y|≤σ| u - y | ≤ σ and then we either sample from the conditional distribution P[u|y and |u−y|≤σ]P[u|y and |u-y|≤σ]P [ u | y and | u - y | ≤ σ ] or P[u|y and |u−y|≥σ]P[u|y and |u-y|≥σ]P [ u | y and | u - y | ≥ σ ] with their corresponding probabilities. Let R be the indicator random variable denoting |u−y|≤σ|u-y|≤σ| u - y | ≤ σ. Then by the law of total variance Var⁡[u|y]Varconditional [u~|~y ]Var [ u | y ] =P⁢[R=1]⁢Var⁡[u|R=1]+P⁢[R=0]⁢Var⁡[u|R=0]+VarR⁡[⁢[u|R]]absentdelimited-[]1Varconditional1delimited-[]0Varconditional0subscriptVardelimited-[]conditional =P[R=1]Var [u~|~R=1 ]+P[R=0]% Var [u~|~R=0 ]+Var_R [ % E[u~|~R] ]= P [ R = 1 ] Var [ u | R = 1 ] + P [ R = 0 ] Var [ u | R = 0 ] + Varitalic_R [ blackboard_E [ u | R ] ] (27) ≥P⁢[R=1]⁢Var⁡[u|R=1]absentdelimited-[]1Varconditional1 ≥ P[R=1]Var [u~|~R=1 ]≥ P [ R = 1 ] Var [ u | R = 1 ] (28) where we have dropped the latter two positive terms to derive the lower bound. P⁢[R=1]≥erf⁡(2−1/2)delimited-[]1erfsuperscript212P[R=1] (2^-1/2)P [ R = 1 ] ≥ erf ( 2- 1 / 2 ) because the chance a truncated Gaussian is within one σ of its mode is larger than that for an untruncated Gaussian, given that the truncation is more than σ away from the mode. This condition is satisfied by construction because we have chosen y to be more than σ from the boundary. Additionally a trivial scaling argument shows that the variance is proportional to σ2superscript2σ^2σ2 which means that there is some constant, C>00C>0C > 0 such that Var⁡[u|y]≥C⁢σ2Varconditionalsuperscript2Var [u~|~y ]≥ Cσ^2Var [ u | y ] ≥ C σ2 (29) when y∈[σ,1−σ]1y∈[σ,1-σ]y ∈ [ σ , 1 - σ ]. To complete the argument we now return to Lon=y⁢[Var⁡[u|y]]≥y⁢[Var⁡[u|y]⁢1y∈[σ,1−σ]]≥C⁢σ2⁢P⁢[y∈[σ,1−σ]].subscriptonsubscriptdelimited-[]Varconditionalsubscriptdelimited-[]Varconditionalsubscript11superscript2delimited-[]1L_on=E_y [Var [u|y ] ]% _y [Var [u|y ]1_y∈[σ,1-% σ] ]≥ Cσ^2P[y∈[σ,1-σ]].Lon = blackboard_Ey [ Var [ u | y ] ] ≥ blackboard_Ey [ Var [ u | y ] 1y ∈ [ σ , 1 - σ ] ] ≥ C σ2 P [ y ∈ [ σ , 1 - σ ] ] . (30) For σ=1/414σ=1/4σ = 1 / 4 this probability is clearly finite and for σ<1/414σ<1/4σ < 1 / 4 it is increasing as σ decreases so it is uniformly bounded below by a constant C1subscript1C_1C1. So finally Lon≥C′⁢σ2⟹L≥C′⁢p⁢σ2≈C′⁢p2r.subscriptonsuperscript′2superscript′2superscript′2L_on≥ C σ^2 L≥ C pσ^2% ≈ C p^2r.Lon ≥ C′ σ2 ⟹ L ≥ C′ p σ2 ≈ divide start_ARG C′ p2 end_ARG start_ARG r end_ARG . (31)