← Back to papers

Paper deep dive

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

Year: 2024Venue: arXiv preprint (submitted to ICLR 2025, rejected)Area: Mechanistic Interp.Type: TheoreticalEmbeddings: 64

Models: modular addition model, one-layer transformer (ReLU MLP)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:25:56 PM

Summary

This paper presents a mechanistic interpretability study of ReLU MLP layers in one-layer transformer models trained on modular addition. The authors demonstrate that these MLPs can be interpreted as performing numerical integration (a quadrature scheme) to compute trigonometric integral identities, effectively 'doubling' input frequencies. This interpretation allows for the compression of nonlinear feature-maps into analytic expressions, providing non-vacuous bounds on model behavior in time linear to the parameter count.

Entities (5)

Modular Addition · task · 100%ReLU MLP · model-component · 100%Pizza Algorithm · algorithm · 98%Infinite-width lens · methodology · 95%Quadrature scheme · mathematical-concept · 95%

Relation Signals (3)

Pizza Algorithm solves Modular Addition

confidence 100% · Models trained on the modular addition task... implement modular arithmetic using the following algorithm, which they call the 'pizza' algorithm

Infinite-width lens enables Numerical Integration Interpretation

confidence 95% · the infinite-width lens... turns post-activation matrix multiplications into approximate integrals

ReLU MLP implements Pizza Algorithm

confidence 90% · the MLP can be understood as evaluating a quadrature scheme... in one-layer transformers implementing the 'pizza' algorithm

Cypher Suggestions (2)

Find all algorithms associated with specific model components · confidence 90% · unvalidated

MATCH (c:Component)-[:IMPLEMENTS]->(a:Algorithm) RETURN c.name, a.name

Identify methodologies used to analyze model components · confidence 85% · unvalidated

MATCH (m:Methodology)-[:ENABLES]->(a:Analysis) RETURN m.name, a.name

Abstract

Abstract:The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at this https URL.

Tags

ai-safety (imported, 100%)mechanistic-interp (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

64,162 characters extracted from source content.

Expand or collapse full text

Under review MODULAR ADDITION WITHOUT BLACK-BOXES: COMPRESSINGEXPLANATIONSOFMLPSTHAT COMPUTE NUMERICAL INTEGRATION Chun Hei YipRajashree AgrawalLawrence ChanJason Gross ABSTRACT The goal of mechanistic interpretability is discovering simpler, low-rank algo- rithms implemented by models. While we can compress activations into fea- tures, compressing nonlinear feature-maps—like MLP layers—is an open problem. In this work, we present the first case study in rigorously compressing nonlin- ear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular ad- dition models (Nanda et al., 2023), and target a non-vacuous bound on the be- haviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of the MLP layer in one-layer transformers implementing the “pizza” algorithm (Zhong et al., 2023): the MLP can be understood as evalu- ating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration. 1INTRODUCTION Neural networks’ ability to generalize suggests they implement simpler, low-rank algorithms despite performing high-dimensional computations (Olah et al., 2020). The field of mechanistic interpretabil- ity has made tremendous progress in finding low-rank approximations of activations—for example, discovering interpretable features (Bricken et al., 2023; Cunningham et al., 2023). However, finding low-rank approximations of feature-maps—particularly MLP layers—is still an open problem (Elhage et al., 2022). Finding low-rank approximations of nonlinear feature-maps becomes particularly important in light of recent work that found evidence that sparse linear features do not fully capture the structure of frontier models (Marks et al., 2024; Engels et al., 2024). Interpretations that lack analyses of feature-maps may not be usable for ambitious applications of mechanistic interpretability, like anomaly detection and worst-case guarantees. The expressivity of MLPs constitutes a large attack surface for perturbing the model, so not compressing feature-maps leaves a lot of free parameters in the interpretation. These free parameters diminish our ability to detect anomalous behaviour, or make strong guarantees. To illustrate the difficulty of compressing MLPs, consider the following toy model comparing the effective parameter count of deep nonlinear networks and deep linear networks: adding or multiplying kmatrices of shapem×m. For linear operations, we need onlym 2 parameters to completely describe the input-output behaviour, regardless of the depth of network. However, introducing nonlinearities like ReLU between these operations increases the effective parameter count tokm 2 , with the complexity growing exponentially with depth. While deep linear networks can be compressed to shallow networks of equivalent width without loss of expressivity, nonlinear networks resist such compression. Even in the classic toy setting of modular addition models, the MLP layer is treated as a black box. Nanda et al. (2023) find sparse features to describe the input and output to the MLP layer, however, they do not tell us how the MLP layer processes the input features to generate the output features. 1 arXiv:2412.03773v1 [cs.LG] 4 Dec 2024 Under review X i f x (ξ i )g c (ξ i )w i Z f x (ξ)g c (ξ) dξ i f x (ξ i ) g c (ξ i )w i x c Figure 1: (Left) There are finitely many neurons in the model (indexed byi). The functionf x (ξ i ) is ReLU applied to the inputsx. The weight of the connection to each outputcisg c (ξ i )times a neuron-specific output-independent normalization factorw i . (Right) Taking the limit as the number of neurons goes to infinity turns the sum over neurons into an integral. Compressing the resulting analytic expression allows us to compress the MLP. In this work, we present a case study in compressing nonlinear feature-maps. We extend the analysis in Nanda et al. (2023); Zhong et al. (2023); Gromov (2023) to the ReLU MLP layer, opening up the final black box remaining in the modular addition models (§2). To demonstrate that compressing the feature-map is essential for reducing free parameters, we use the formal compression metric from Gross et al. (2024). We measure compression by the computational complexity of verifying the interpretation, where proving the same bound with a lower complexity budget would correspond to finding an interpretation with fewer free parameters (§3). We apply aninfinite-width lens, wherein we treat densely connected MLPs as finite approximations to an infinite-MLP-width limit (Appendix J). The infinite-width lens permits us to turn post-activation matrix multiplications into approximate integrals and study the remaining operations of the network – including nonlinear operations – analytically. We find a low-rank approximation of the nonlinear function implemented by the MLPs of the “pizza” transformers (Zhong et al., 2023): doubling the frequencies of the input representations, that is, mapping from representations of the form cos( k 2 (a+b)),sin( k 2 (a+b))tocos(k(a+b)),sin(k(a+b)). Building on surprising patterns in the phase-amplitude representation of the MLP pre-activations and neuron-logit map, we find that the MLP can be understood as implementing a quadrature scheme for integrals of the form Z π −π ReLU[cos( k 2 (a+b) +φ)] cos(kc+ 2φ) dφ= 2 3 cos(k(a+b−c)), for a handful of key frequencieskwhere the integral can be seen as the limiting case of the MLP’s summation as the number of neurons increases (§4). We confirm this interpretation by creating non-vacuous bounds for the outputs of these MLPs in time linearin the number of parameters, i.e. without evaluating it on allP 2 possible inputs (§5). Finally, we resolve the puzzling observation that the logits of supposedly “pizza” models are closer of those of Nanda et al. (2023)’s “clock” algorithm, by explaining how the model uses secondary frequencies equal to twice of each key frequency in order to compensate for how the pizza algorithm fails in cases whenk(a−b)≈π(§6). 2BACKGROUND 2.1MECHANISTICINTERPRETABILITY OFMODULARADDITIONMODELS Models trained on the modular addition task have become a classic testbed in the mechanistic interpretability literature. Originally, Nanda et al. (2023) studied a one-layer transformer model. They found low-rank features to describe all components of the model, and analysed the feature-map of the final linear layer. While they generated a human-intuitive algorithm of how the model works, they did not explainhowthe MLP layers compute logits that fit with the form required by this algorithm. Despite the tremendous progress in analysing the non-MLP layers of the model, the ReLU MLP is still treated as a black-box. Zhong et al. (2023) extended the analysis to a family of architectures parameterized byattention rate. The architecture from Nanda et al. (2023) corresponds to attention rate 0, while attention rate 2 Under review 1 corresponds to a ReLU MLP-only model, i.e. a transformer with constant attention. Depending on attention rate, they showed that models may learn the “clock” or “pizza” algorithm. They exhaustively enumerated inputs to MLP-only model and found a description of the feature-map. However, exhaustive enumeration is not feasible for larger input sizes, and does not constitute an insight-rich explanation. Thus, their approximation of the feature-map is equal to the feature-map itself, failing to provide any compression. Gromov (2023) considered a cleaner version of the MLP-only architecture considered by Zhong et al. (2023), and used quadratic activations instead of ReLU activations. They presented a formula corresponding to a compressed feature-map. However, they did not present sufficient evidence to show that the trained models follow the formula as suggested. They only showed that the weights are roughly “single-frequency”, i.e. they are well approximated by the largest Fourier component. Establishing that the model in fact uses the stated algorithm as opposed to a different algorithm would require significantly more validation. In this work, we analyze the ReLU MLP with the goal of demonstratinghowit computes the functions described in prior work, and establish this with rigorous evidence and a formal proof. 2.2EXPERIMENTAL SETUP We study models which implement the ‘pizza’ algorithm from Zhong et al. (2023): a one-layer ReLU transformer with four heads and constant attention= 1 2 for all tokens, trained to compute M: (a,b,‘=’)7→(a+b) modp. As in Zhong et al. (2023), we takep= 59. The model takes input(a,b,‘=’)encoded as one-hot vectors, and we read off the logitslogit(a,b,c)for all possible values(modp)above the final sequence position ‘=’, with the largest logit representing its predicted answer. Since the attention is constant, the logits of the model given inputa,bare calculated as: x (0) i =W E t i +p i Embedding x (1) =x (0) 2 + 1 2 4 X j=1 W j O W j V x (0) a +x (0) b Post attention residual stream N= ReLU(W in x (1) )Post ReLU neuron activations x (2) =x (1) +W out N logits=W U x (2) =W U x (1) +W out ReLU(W in x (1) ) As such, the model architecture considered has constant attention= 1 2 throughout, so the model consists of linear embeddings, followed by ReLU, followed by a further linear layer. As noted in both Nanda et al. (2023) and Zhong et al. (2023), the contribution from the skip connection around the MLP to the logits is small. Combining this with the fact that the attention is uniform acrossa,b, and usingW L =W U W out for the neuron-logit map, the logits can approximately be written as the sum of the contributions of each of thed mlp neurons: logit(a,b,c)≈ P d mlp i=1 (W L ) ci ·ReLU( 1 2 OV(a) i + 1 2 OV(b) i )(1) 2.3THE“PIZZA”ALGORITHM Zhong et al. (2023) demonstrated that small transformers using constant attention implement modular arithmetic using the following algorithm, which they call the “pizza” algorithm: 1.OV(a),OV(b)embed the one-hot encoded tokensa,bas(cos(ka),sin(ka))and (cos(kb),sin(kb))for a small handful of key frequenciesk 2. Before the MLP layer, the representation of the two tokens are averaged: (cos(ka) + cos(ka),sin(ka) + sin(ka))/2 = cos(k(a−b)/2)(cos(k(a+b)/2),sin(k(a+b)/2)) 3. The network then uses the MLP layer to “double the frequencies”: N=|cos(k(a−b)/2)|(cos(k(a+b)),sin(k(a+b))) 3 Under review 4. Finally,W L scores possible outputs by taking the dot product with(cos(kc),sin(kc)): logit(a,b,c) =|cos(k(a−b)/2)|(cos(k(a+b)) cos(kc) + sin(k(a+b)) sin(kc)) =|cos(k(a−b)/2)|cos(k(a+b−c)) They distinguish this from the “clock” algorithm of Nanda et al. (2023), whose logits have no dependence on|cos(k(a−b)/2)|, and where the MLP layer instead multiplies together its inputs. Note that Zhong et al. (2023) check that the MLP layer doubles frequenciesempirically(i.e. by brute force), by treating the MLP as a black box and enumerating the MLP’s outputs for all possible inputs. In this work, we seek to understandhowthe MLP performs its role. 3COMPRESSINGMLPS Post-hoc mechanistic interpretability (Olah et al., 2020; Elhage et al., 2021; Black et al., 2022) can be formalized as finding a compact explanation (Gross et al., 2024) of how the model computes its outputs on the entire input distribution. Gross et al. (2024) demonstrated that finding non-trivial compact explanations, i.e. proving a meaningful bound instead of a null bound, necessarily requires more mechanistic information about the model behaviour. Moreover, if some marginal mechanistic analysis does not result in compression, then it does not reduce the free parameters of that component. −π − 2π 3 − π 3 0 π 3 2π 3 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 2: The MLP approximately computes the integral R π −π h(φ) dφ . The computed integral is for frequencyk= 12whena+b=c= 0. The widths and heights of rectangles are generated by the actual weights in a trained model. Gross et al. (2024) conduct a limited empirical study of models trained to compute the max- imum ofkintegers, which are attention-only transformers with no MLP layers. Our work can be seen as applying this rigorous formalization to a more challenging case study. Formally, we consider the computational com- plexity needed to check the behaviour of the MLP on all inputs. The lower the complexity is, the better we have understood the MLP and the better we can compress our description of how the MLP computes the function that it does. The naive baseline (as provided by both Nanda et al. (2023) and Zhong et al. (2023)) is to de- scribe the MLP’s behaviour by evaluating it on every possible input. Fornsuch inputs and an MLP with widthd mlp and input/output dimen- siond model , this requires checking the result of O(n d mlp d model )operations. This corresponds to a null interpretation providing zero under- standing of how the MLP internally functions. Ideally, we hope that we can bound the MLP’s behaviour by referencing only the model weights, without evaluating it on all inputs – in timeO(d mlp d model +n), linear in the parameter count of the MLP. As this is the information-theoretic limit, we target interpretations that formally bound MLP behaviour in time linear in parameter count. 4 INTERPRETING“PIZZA” MLPS AS PERFORMING NUMERICAL INTEGRATION Following Nanda et al. (2023), we focus on a particular (“mainline”) model in the main body of the work Appendix A. We confirm that our results generalize to another 150 transformers in Appendix C. We identify new structure in the model by using the amplitude-phase form of the Fourier series, and find an integral approximation of the MLP showing doubling of the frequency of the preactivations. 4 Under review 4.1TRANSLATING INTO THEAMPLITUDE-PHASEFOURIER FORM We first rewrite the algorithm in subsection 2.3 as an equation of the logits. LetFbe thep×p discrete Fourier transform matrix, thenW L =F f W L ,OV(a) = g OV F a , andOV(b) = g OV F b for some matrices of Fourier coefficients f W L , g OV. The logits can thus be written: logit(a,b,c)≈F |z (4) f W L ReLU( g OV | z (3) 1 2 (F a |z (1) +F b |z (1) ) |z (2) )(2) where numerical underbrace labels correspond to the relevant step in the algorithm of subsection 2.3. We translate the algorithm into amplitude-phase Fourier form, because phase terms naturally capture cyclic relationships. We rewrite the matrices of Fourier coefficients f W L and g OVin polar coordinates. Takings=k(a+b)/2ands ′ =k(a−b)/2, the algorithm becomes 1.F a ,F b embed the one-hot encoded tokensa,bas(cos(ka),sin(ka))and(cos(kb),sin(kb)) for a small handful of key frequenciesk 2. Before the MLP layer, the representation of the two tokens are averaged: (cos(ka) + cos(ka),sin(ka) + sin(ka))/2 = coss ′ (coss,sins) 3. The network then uses the MLP layer to “double the frequencies”. For each output frequency k ′ , we write P i ( f W L ) k ′ i ReLU( g OV i 1 2 (F a +F b )) = P i r ′ i (cosψ k ′ i ,sinψ k ′ i ) ReLU(r i (cosφ i ,sinφ i )·coss ′ (coss,sins)) = P i r ′ i |r i ||coss ′ |(cosψ k ′ i ,sinψ k ′ i ) ReLU(cosφ i coss+ sinφ i sins) = P i r ′ i |r i ||coss ′ |(cosψ k ′ i ,sinψ k ′ i ) ReLU(cos(s−φ i )) ≈Z k ′ |coss ′ |(cos(2s),sin(2s))(empirically)(3) 4. Finally,Fscores possible outputs by taking the dot product with(cos(kc),sin(kc)): logit(a,b,c)≈Z k ′ |coss ′ |(cosscos(kc) + sinssin(kc)) =Z k ′ |cos(k(a−b)/2)|cos(k(a+b−c)) 4.2OBSERVATIONS −π − π 2 0 π 2 π −π − π 2 0 π 2 π φ i ψ i Primary frequency Non-primary frequency ψ≡2φ(mod 2π) Figure 3: The input (φ i ) and output (ψ i ) phase shift angles for frequencyk= 12, whereψ i ≈2φ i (mod 2π)for the primary frequency of each neu- ron. The line hasR 2 >0.99and the intervals between angles have mean width0.054, and stan- dard deviation0.049. This shows that the angles are roughly uniform. We find thateach neuron has a primary fre- quency for both input and output. As seen in Figure 4, the majority of pre-activations for each neuron consist of terms from a single key fre- quency, as does the neuron-logit mapW L . The largest frequency component for theReLUpre- activation is almost alwaysequalto the largest frequency component for the corresponding row ofW L . This allows us to divide the neurons into clustersI k , each of which correspond to a single frequencyk. Furthermore, theoutput phase is double the input phase. In Figure 3, we plot the input and output phase shift anglesφ i andψ i for the largest two frequencies. For each neuron’s pri- mary frequency, the neuron’s output phase shift ψ i is almost exactly twice its input phase shift φ i . 5 Under review 0.940.960.970.981.00 0 10 20 30 40 50 60 70 |Key frequency component| 2 /|ReLU input| 2 Count (number of neurons) %variance explained by key frequency for ReLU input (a) For most neurons,>96%of variance in pre- activations is explained by the largest frequency. 0.9800.9830.9860.9890.9910.9940.9971.000 0 25 50 75 100 125 150 175 200 |Key frequency component| 2 /|W U W out | 2 Count (number of neurons) %variance explained by key frequency forW U W out (b) For most neurons,>98%of variance inW out is explained by the largest frequency. Figure 4: Histograms of the variance explained by the largest Fourier frequency component for the pre-activations and neuron-logit mapW L for each of the 512 neurons in the mainline model. 4.3INTEGRAL APPROXIMATION OF THEMLP Treating the MLP layer as approximately performing numerical integration allows us to make sense of the previous observations. Recall that we can write the logits of the model as logit(a,b,c)≈ P d mlp i=1 (W L ) ci ·ReLU 1 2 OV(a) i + 1 2 OV(b) i ≈ P k∈K P i∈I k cos (k(c+ 2φ i )) ReLU cos k 2 (a−b) cos k 2 (a+b+φ i ) For each frequencyk, we can write the contributions to the logits from neurons of frequencykas : logit (k) (a,b,c) =|cos k 2 (a−b) | P i∈I k cos (k(c+ 2φ i )) ReLU σ k cos k 2 (a+b+φ i ) whereσ k = 1if|cos k 2 (a−b) |≥0and−1otherwise. Ignoring the|cos(k(a−b)/2)|scaling factor (which does not vary per neuron), we claim that the normalized MLP outputs X i∈I k w i ReLU[σ k cos( k 2 (a+b) +φ i )] cos(kc+ 2φ i )(4) can we well-thought of as approximating the integral (see Appendix E): Z π −π ReLU[σ k cos( k 2 (a+b) +φ)] cos(kc+ 2φ) dφ= 2 3 cos(k(a+b−c)).(5) Note that the above integral is valid for bothσ k =−1,1. This gives us the desired form for the output logits: logit(a,b,c) = X k |cos(k(a−b)/2)|cos(k(a+b−c))(6) 5VALIDATION VIA COMPACT GUARANTEES ONMLPPERFORMANCE As evidence for the usefulness of the numerical integration interpretation, we use it to derive non- vacuous bounds on the output of the MLPon all inputsin time linear in the parameters. 6 Under review 5.1COMPUTING NUMERICAL INTEGRATION ERROR The validation of our interpretation as an integral then becomes a mathematical question of evaluating the efficiency of the quadrature scheme Z π −π h a,b,c (φ) dφ≈ X i w ′ i h a,b,c (φ i ) whereh a,b,c (φ i ) = ReLU[σ k cos( k 2 (a+b) +φ i )] cos(kc+ 2φ i ). The absolute error is given by ε R := Z π −π h a,b,c (φ) dφ− X i w ′ i h a,b,c (φ i ) (7) and we can compute relative error by computing (the average value overa,b,cof) ε 0 := Z π −π h a,b,c (φ) dφ (8) and then dividing Equation 7 by Equation 8 to give the relative errorε r :=ε R /ε 0 . Following Gross et al. (2024), if we can compute a non-vacuous error bound (i.e. a relative error that is strictly less than 1) in time linear in the number of parameters of the computation, we can be assured that our interpretation has some validity. Since there ared mlp neurons andppoints to evaluate our target is to compute an error bound in timeO(d mlp +p). Thus, this part of the computation is sublinear in the number of parameters as desired. Recall the neuron contributions to the logits from Equation 5: ReLU[σ k cos( k 2 (a+b) +φ)] cos(kc+ 2φ) We can split ReLU asReLU(x) = (x+|x|)/2. We shall see below that thex/2part integrates to0. As a result, a relative error bound would not give a meaningful result for this part. However, this part of the network is linear, thus, we can still effectively compress the network behaviour here. For example, we can compute a matrixA∈R p×p such that the logit contribution of this part is A[:,a] +A[:,b]for inputsaandb. Thus, in the below section, we can restrict our attention to the absolute value part,|x|/2. This turnsReLU[σ k cos( k 2 (a+b) +φ)] cos(kc+ 2φ)into 1 2 |σ k cos( k 2 (a+b) +φ)|cos(kc+ 2φ) | z h a+b,c,σ k + 1 2 σ k cos( k 2 (a+b) +φ) cos(kc+ 2φ) We sort the phasesφ i for each neuron, and turn the weightsw i into widths of the rectangles, and the function callsh(φ i )into the heights of the rectangles corresponding to the function evaluated atφ i . This gives a picture similar to Figure 2. The absolute error is ε h := R π −π h(x)−h(φ i ) dx .(9) A crude bound is: ≤ R π −π |h(x)−h(φ i )|dx(10) ≤ R π −π |x−φ i |·sup x |h ′ (x)|dx(11) ≤sup x |h ′ (x)| P i R v i −φ i v i−1 −φ i |x|dx (12) = sup x |h ′ (x)| X i 1 2 (v i −φ i ) 2 −(v i−1 −φ i ) 2 ifφ i ∈[v i−1 ,v i ] (v i −φ i ) 2 + (v i−1 −φ i ) 2 otherwise (13) where the rectangle width goes fromv i−1 tov i and the function is evaluated atφ i . Forh=h a+b,c,σ k , we can bound sup x |h ′ (x)|≤2 7 Under review 0 π 4 π 2 3π 4 π −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 φ h(φ) Figure 5: We plot the error bound±2(φ−φ i ), depicted in red, for frequencyk= 12. We observe that the red area includes both the actual curve and the numerical integration approximation. We use ±2as a bound onh ′ (φ)because the Lipschitz constant ofhis 2. analytically, and the remaining sum can be evaluated easily inO(d mlp )time; call thisε ≈ R . Notice that this upper bound has no dependency onh, and therefore is a valid upper bound, for all possiblea,b,ctriplets. In other words, we can bound the error of the quadrature approximation for anya,b,cby evaluating an expression that takes time linear in the number of neurons. This bound is represented visually in Figure 5. Note that the figure is produced by analysing the trained model weights. We must also include approximation error fromψ i ≈2φ i , which we call the “angle approximation error,”ε φ : we have P i w ′ j |cos(k i (a+b)/2−φ i )|·(cos(kc+ψ i )−cos(kc+2φ i ))≤ P i w ′ j |cos(k i (a+ b)/2−φ i )|·|ψ i −2φ i |≤ P i w ′ j |ψ i −2φ i |. 5.2EMPIRICAL VALIDATION −π − 2π 3 − π 3 0 π 3 2π 3 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 6: Numerical integration R π −π h(φ) dφ with boxes centered atφ i for frequencyk= 12. Shad- ing is used to represent overlap of boxes. We now provide empirical validation of our in- terpretation. The network is well-approximated as doing numerical integration.We can compute the actual errorε R /ε 0 empirically, by evaluating the expression over all possible inputs, giving numbers between 0.03 and 0.05, see Table 1. The interpretation gives useful compres- sion because integral explanation gives com- pact non-vacuous bounds.Our relative error bounds range from 0.48 to 0.7 (i.e. less than 1), see Table 1. The bounds are far away from actual error because we don’t completely understand nu- merical integration.Intuitively, we’d like to Error Bound Equation .12182122 ε R /ε 0 (actual error computed by brute force)0.050.030.050.03 (ε ≈ R +ε φ )/ε 0 (error bound computed in linear time)0.700.490.540.48 Table 1: Relative error bounds by splitting ReLU into absolute value and identity components 8 Under review −π − 2π 3 − π 3 0 π 3 2π 3 π −π − 2π 3 − π 3 0 π 3 2π 3 π 1 st frequency phase shift 2 nd frequency phase shift 1 st : 12, 2 nd : 24 (116 neurons)∆φ 1 ≈0.05±0.05 1 st : 18, 2 nd : 36 (110 neurons)∆φ 1 ≈0.06±0.05 1 st : 21, 2 nd : 42 (138 neurons)∆φ 1 ≈0.04±0.04 1 st : 22, 2 nd : 44 (148 neurons)∆φ 1 ≈0.04±0.04 φ 2 ∼ = 2φ 1 +π(mod 2π),R 2 = 0.9932 φ 2 ∼ = 2φ 1 +π(mod 2π),R 2 = 0.9993 φ 2 ∼ = 2φ 1 +π(mod 2π),R 2 = 0.9997 φ 2 ∼ = 2φ 1 +π(mod 2π),R 2 = 0.9995 Figure 7: We plot the input phase shifts of the two frequencies. We observe that not only are the secondary frequencies of neurons approximately double the primary frequencies, the input phase shift of the secondary frequencies are approximately twice the primary frequency phase shift plusπ. center each box at its correspondingφ i and compute the error that results from having box density above or below 1 at various points of the curve (Figure 6). We’d also like to take into account the fact that if one region has a box density above 1 and is adjacent to a region with a box density below one, these density errorspartiallycancel out. However, we don’t know how to efficiently compute these effects, and the bounds we’re able to give using non-uniform box densities instead of non-uniform angle locations is too crude. 6THE ROLE OF SECONDARY FREQUENCIES 6.1REGRESSING MODEL LOGITS VERSUS“CLOCK”AND“PIZZA”LOGITS As noted above, Nanda et al. (2023) claim that logits are of the form (“clock logits”) logit(a,b,c)∝cos(k(a+b−c)) while Zhong et al. (2023) suggests that logits can also be of the form (“pizza logits”) logit(a,b,c)∝|cos(k(a−b)/2)|cos(k(a+b−c)) Interestingly, if we regress the logits against the factors|cos(k(a−b)/2)|cos(k(a+b−c)), which gives anR 2 of0.86, while if we regress them against justcos(k(a+b−c)), we obtain anR 2 of 0.98– substantially higher. So overall, the “clock logits” give a more accurate expression, suggesting that the analysis above is importantly incomplete. Interestingly, if we only consider the contribution to the logits from only the absolute value component of ReLU (Appendix I), theR 2 values become 0.99and0.85respectively. What this suggests is that the absolute value component of ReLU indeed carries out the “pizza” algorithm and produces those logits. As we will demonstrate below, the discrepancy for the overall logits is due to the effects of the substantially smaller non-primary frequencies. In particular, it is explained by the action of the “secondary frequency” – the second largest Fourier component. 6.2USING SECONDARY FREQUENCIES TO BETTER APPROXIMATE CLOCK LOGITS For each of the neurons, the largest secondary frequency is almost always twice the primary frequency. For example, for neurons of frequency 12, the largest secondary frequency is 24, while for neurons of frequency 22, the largest secondary frequency is 15 (= 59−22·2, note that cosine is symmetric about 0). Note that the input phase shift of the secondary frequency is approximately twice the input phase shift of the primary frequency plusπ(Figure 7). The contribution of the doubled secondary frequency to the logits can thus be written as (compare to Equation 5, note we lose the 1 2 factor in the pre-ReLU expression because the secondary frequency is 9 Under review double the primary frequency) logit (2k) (a,b,c) = X i∈I k cos (kc+ 2φ i ) ReLU [cos (k(a−b)) cos (k(a+b) + 2φ i +π)] ≈ Z π −π cos(kc+ 2φ) ReLU[−cos (k(a−b)) cos(k(a+b) + 2φ)] dφ Lettingθ=φ+k(a+b)/2, we get = Z k(a+b)/2+π k(a+b)/2−π cos(k(c−(a+b)) + 2θ) ReLU[−cos (k(a−b)) cos(2θ)] dθ Because the integrand is2π-periodic, the shift on the limits of integration is irrelevant: = Z π −π cos(k(c−(a+b)) + 2θ) ReLU[−cos (k(a−b)) cos(2θ)] dθ By the law of cosine addition: = Z π −π cos(k(c−(a+b))) cos(2θ)−sin(k(c−(a+b))) sin(2θ) ·ReLU[−cos (k(a−b)) cos(2θ)] dθ Becausesinis odd andReLU[cos]is even, thesinterm integrates to 0. = cos(k(c−(a+b))) Z π −π cos(2θ) ReLU[−cos (k(a−b)) cos(2θ)] dθ =− π 2 cos (k(a−b)) cos(k(a+b−c)) This logit contribution helps make the model more robust. Note that in the expression for “pizza logits” (Equation 6), the model works by having thatcos(k(a+b−c))is largest whenc=a+b(modp), and choosing the largest logit as output. However, this logit difference has a factor of|cos k 2 (a−b) |. As a result, when|cos k 2 (a−b) | ≈0, the logit difference is small and may not distinguish the correct output. However, from the above expression, we have that (bycos(2x) = 2 cos 2 (x)−1) cos (k(a−b))≈−1so this term contributes positively to the correct logitcos(k(a+b−c)). This compensates for the weakness of the pizza logits in cases where|cos k 2 (a−b) |≈0. 7DISCUSSION We provide a first case study in rigorously compressing nonlinear feature-maps. We demonstrate that interpreting feature-maps reveals additional insight about the model mechanism, even in models that the research community assumes that we understand quite well. Our hope is that this work will inspire additional study of feature-maps in mechanistic interpretability. The key steps in our derivation of efficient bounds were: splitting the input into orthogonal output- relevant and output-irrelevant directions; decomposing the pre-activations as a product of functions of these axes; reindexing the neurons by the output-relevant direction so that the reindexed post- activations dependonlyon the output-irrelevant directions; and compressing independently over the output-relevant direction term and the output-irrelevant direction term. We believe that some of these steps could be adapted to interpret other MLPs, even where we cannot derive closed-form analytic representations. The bounds we derive in Appendix H could also be improved to understand how the network is allocating boxes under the curve, which we hope to address with future work. ACKNOWLEDGMENTS We are thankful to Euan Ong, Alex Gibson, Soufiane Noubir for helpful discussions throughout. Feedback from Wilson Wu, Louis Jaburi, and Jacob Drori’s work inspired more clarity in our change-of-variables methodology for computing the integral in Appendix F. We thank the following organizations for their support:Mentorship for Alignment Research Students(MARS) program for setting up the collaboration between the authors, and providing funding for compute and in-person research sprints, andConstellationandFAR Labsfor providing an excellent research environment. 10 Under review REFERENCES Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram ́ on Guevara, Beren Millidge, Gabriel Alfour, and Connor Leahy. Interpreting neural networks through the polytope lens, 2022. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023. URLhttps://transformer-circuits. pub/2023/monosemantic-features/index.html. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models, 2023. URLhttps://arxiv. org/abs/2309.08600. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposi- tion.Transformer Circuits Thread, 2022. URLhttps://transformer-circuits.pub/ 2022/toy_model/index.html. Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, and Max Tegmark. Not all language model features are linear, 2024. URLhttps://arxiv.org/abs/2405.14860. Andrey Gromov. Grokking modular arithmetic, 2023. URLhttps://arxiv.org/abs/2301. 02679. Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, and Lawrence Chan. Compact proofs of model performance via mechanistic intepretability, June 2024. URLhttps://arxiv.org/abs/2406.11779. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. URLhttps://arxiv.org/abs/2403.19647. Neel Nanda and Joseph Bloom.Transformerlens.https://github.com/ TransformerLensOrg/TransformerLens, 2022. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint, 2023. doi: 10.48550/arXiv.2301.05217. URLhttps://arxiv.org/abs/2301.05217. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. URL https://distill.pub/2020/circuits/zoom-in. Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks, 2023. URLhttps://arxiv.org/abs/2306. 17844. 11 Under review AMODEL TRAINING DETAILS We train 1-layer transformers with constant attention= 1 2 (equivalently, withQKclamped to0), as implemented by TransformerLens (Nanda & Bloom, 2022) with the following parameters d vocab=p+ 1 = 59 + 1 n ctx= 2 + 1 dmodel= 128 dmlp= 512 dhead= 32 nheads= 4 nlayers= 1 actfn=’relu’ We take80%of all(a,b)pairsmodpas the training set, and the rest as the validation set. We set the loss function to be the mean log probabilities of the correct logit positions, and train for10000 epochs using the AdamW optimizer with the default weightdecay= 0.01. The large number of epochs is such that the resulting model achieves100%accuracy on all possible input pairs. We chose 151random seeds which are pseudorandomly deterministically derived 0, along with 150 values read from/dev/urandom. Each training run takes approximately 11–12 GPU-minutes to complete on an NVIDIA GeForce RTX 3090. In total, the experiments in this paper took less than 1000 GPU-hours. BMORE FIGURES AND RESULTS FOR MAINLINE MODEL In this section we display variants of Figure 2, Figure 3, Figure 5, and Figure 7 for the other frequencies, as well as providing additional visualizations and supporting evidence in the form of Figure 8 and Figure 11. −π − 2π 3 − π 3 0 π 3 2π 3 π φ 0 10 20 30 40 50 60 a + b cos(k(a+b)/2 +φ) ReLU ====⇒ ×cos(2φ) ======⇒ −π − 2π 3 − π 3 0 π 3 2π 3 π φ 0 10 20 30 40 50 60 a + b ReLU[cos(k(a+b)/2 +φ)] cos(2φ) ×sin(2φ) ======⇒ −π − 2π 3 − π 3 0 π 3 2π 3 π φ 0 10 20 30 40 50 60 a + b ReLU[cos(k(a+b)/2 +φ)] sin(2φ) R dφ ===⇒ 0102030405060 −0.50 −0.25 0.00 0.25 0.50 2 3 cos(a+b) 0102030405060 −0.50 −0.25 0.00 0.25 0.50 2 3 sin(a+b) Figure 8: The MLP computes logits by approximate numerical integration. 12 Under review −π − 2π 3 − π 3 0 π 3 2π 3 π −π − 2π 3 − π 3 0 π 3 2π 3 π 1 st frequency phase shift 3 rd frequency phase shift 1 st : 12, 3 rd : 18 (3 neurons) 1 st : 12, 3 rd : 36 (113 neurons) 1 st : 18, 3 rd : 47 (4 neurons) 1 st : 18, 3 rd : 38 (91 neurons) (no fit,R 2 = 0.36) 1 st : 18, 3 rd : 37 (14 neurons) (no fit,R 2 = 0.67) 1 st : 18, 3 rd : 24 (1 neuron) 1 st : 21, 3 rd : 47 (22 neurons) (no fit,R 2 = 0.63) 1 st : 21, 3 rd : 18 (106 neurons) (no fit,R 2 = 0.23) 1 st : 21, 3 rd : 22 (9 neurons) 1 st : 21, 3 rd : 23 (1 neuron) 1 st : 22, 3 rd : 12 (8 neurons) 1 st : 22, 3 rd : 18 (9 neurons) 1 st : 22, 3 rd : 21 (131 neurons) (no fit,R 2 = 0.25) φ 3 ∼ = 3φ 1 +π(mod 2π),R 2 = 0.9936 Figure 9: We plot the input phase shifts of primary and tertiary frequencies. We observe that for the one secondary frequency (36 = 2·18) which is also a multiple of another primary frequency (3·12 = 36), the input phase shift of the tertiary frequency is approximately thrice the primary frequency phase shift plusπ. Linear regression is computed for cases where there are at least 10 neurons with the same primary and tertiary frequencies and the best-fit line is displayed in cases whereR 2 > .9. −π − 2π 3 − π 3 0 π 3 2π 3 π −π − 2π 3 − π 3 0 π 3 2π 3 π 1 st frequency phase shift 4 th frequency phase shift 1 st : 12, 4 th : 46 (2 neurons) 1 st : 12, 4 th : 15 (2 neurons) 1 st : 12, 4 th : 18 (49 neurons) (no fit,R 2 = 0.60) 1 st : 12, 4 th : 21 (45 neurons) (no fit,R 2 = 0.29) 1 st : 12, 4 th : 37 (16 neurons) (no fit,R 2 = 0.42) 1 st : 12, 4 th : 23 (2 neurons) 1 st : 18, 4 th : 12 (35 neurons) (no fit,R 2 = 0.57) 1 st : 18, 4 th : 44 (13 neurons) (no fit,R 2 = 0.87) 1 st : 18, 4 th : 21 (10 neurons) 1 st : 18, 4 th : 22 (48 neurons) (no fit,R 2 = 0.57) 1 st : 18, 4 th : 35 (4 neurons) 1 st : 21, 4 th : 12 (52 neurons) (no fit,R 2 = 0.59) 1 st : 21, 4 th : 41 (20 neurons) (no fit,R 2 = 0.54) 1 st : 21, 4 th : 37 (57 neurons) (no fit,R 2 = 0.32) 1 st : 21, 4 th : 23 (4 neurons) 1 st : 21, 4 th : 35 (5 neurons) 1 st : 22, 4 th : 10 (3 neurons) 1 st : 22, 4 th : 12 (37 neurons) (no fit,R 2 = 0.69) 1 st : 22, 4 th : 18 (16 neurons) (no fit,R 2 = 0.50) 1 st : 22, 4 th : 38 (9 neurons) 1 st : 22, 4 th : 23 (83 neurons) (no fit,R 2 = 0.43) Figure 10: We plot the input phase shifts of primary and quaternary frequencies. We observe that there is no pattern. Linear regression is computed for cases where there are at least 10 neurons and theR 2 is included in the legend. 13 Under review 123456789 10111213141516171819202122 0 20 40 60 80 100 120 140 Key frequencies Count (number of neurons) Key frequency counts count Figure 11: The key frequencies for this model are12,18,21, and22(multiplied by2π/p) −π − π 2 0 π 2 π −π − π 2 0 π 2 π φ i ψ i Primary frequency Non-primary frequency ψ≡2φ(mod 2π) Figure 12: Angles for frequencyk= 12.ψ i ≈2φ i (mod 2π)for the primary frequency of each neuron but not in general. 14 Under review −π − π 2 0 π 2 π −π − π 2 0 π 2 π φ i ψ i Primary frequency Non-primary frequency ψ≡2φ(mod 2π) Figure 13: Angles for frequencyk= 18.ψ i ≈2φ i (mod 2π)for the primary frequency of each neuron but not in general. −π − π 2 0 π 2 π −π − π 2 0 π 2 π φ i ψ i Primary frequency Non-primary frequency ψ≡2φ(mod 2π) Figure 14: Angles for frequencyk= 21.ψ i ≈2φ i (mod 2π)for the primary frequency of each neuron but not in general. −π − π 2 0 π 2 π −π − π 2 0 π 2 π φ i ψ i Primary frequency Non-primary frequency ψ≡2φ(mod 2π) Figure 15: Angles for frequencyk= 22.ψ i ≈2φ i (mod 2π)for the primary frequency of each neuron but not in general. 15 Under review −π − 2π 3 − π 3 0 π 3 2π 3 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 16: Converting the weighted sum into rectangles to estimate the integral R π −π h(φ) dφ (for frequencyk= 12).h(φ) =|cos(φ)|cos(2φ). −π − 2π 3 − π 3 0 π 3 2π 3 π −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 φ h(φ) Figure 17: Error bound is the red area (for frequencyk= 12),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. 0 π 4 π 2 3π 4 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 18: Converting the weighted sum into rectangles to estimate the integral R π 0 h(φ) dφ(for frequencyk= 12).h(φ) =|cos(φ)|cos(2φ). 16 Under review 0 π 4 π 2 3π 4 π −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 φ h(φ) Figure 19: Error bound is the red area (for frequencyk= 12),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. −π − 2π 3 − π 3 0 π 3 2π 3 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 20: Converting the weighted sum into rectangles to estimate the integral R π −π h(φ) dφ (for frequencyk= 18).h(φ) =|cos(φ)|cos(2φ). −π − 2π 3 − π 3 0 π 3 2π 3 π −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 φ h(φ) Figure 21: Error bound is the red area (for frequencyk= 18),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. 17 Under review 0 π 4 π 2 3π 4 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 22: Converting the weighted sum into rectangles to estimate the integral R π 0 h(φ) dφ (for frequencyk= 18).h(φ) =|cos(φ)|cos(2φ). 0 π 4 π 2 3π 4 π −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 23: Error bound is the red area (for frequencyk= 18),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. −π − 2π 3 − π 3 0 π 3 2π 3 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 24: Converting the weighted sum into rectangles to estimate the integral R π −π h(φ) dφ(for frequencyk= 21).h(φ) =|cos(φ)|cos(2φ). 18 Under review −π − 2π 3 − π 3 0 π 3 2π 3 π −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 φ h(φ) Figure 25: Error bound is the red area (for frequencyk= 21),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. 0 π 4 π 2 3π 4 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 26: Converting the weighted sum into rectangles to estimate the integral R π 0 h(φ) dφ (for frequencyk= 21).h(φ) =|cos(φ)|cos(2φ). 0 π 4 π 2 3π 4 π −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 φ h(φ) Figure 27: Error bound is the red area (for frequencyk= 21),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. 19 Under review −π − 2π 3 − π 3 0 π 3 2π 3 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 28: Converting the weighted sum into rectangles to estimate the integral R π −π h(φ) dφ(for frequencyk= 22).h(φ) =|cos(φ)|cos(2φ). −π − 2π 3 − π 3 0 π 3 2π 3 π −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 φ h(φ) Figure 29: Error bound is the red area (for frequencyk= 22),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. 0 π 4 π 2 3π 4 π −0.2 0.0 0.2 0.4 0.6 0.8 1.0 φ h(φ) Figure 30: Converting the weighted sum into rectangles to estimate the integral R π 0 h(φ) dφ(for frequencyk= 22).h(φ) =|cos(φ)|cos(2φ). 20 Under review 0 π 4 π 2 3π 4 π −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 φ h(φ) Figure 31: Error bound is the red area (for frequencyk= 22),h(φ) =|cos(φ)|cos(2φ). Note how the red area includes both the actual curve and the numerical integration approximation. CRESULTS FOR OTHER RANDOM SEEDS To ensure that this phenomenon is not specific to the model we looked at, we trained151models with the same setup and different random seeds (leading to different weight initialization and train-test split). We replicate some of the analysis above to show that a significant proportion of these models follow the above explanation. For oursecond observation, we notice that most neurons are single frequency, with the largest frequency explaining more than 90% of the variance of the ReLU input, see Figure 35. The same is true for theW out matrix, see Figure 36. For ourthird observation, we check that (disregarding neurons where the largest frequency is the constant term)100out of151models (‘good models’) have the frequencies matching for all neurons, with a further27models where the frequencies don’t match for less than10(out of512neurons). For ourfourth observation, we look at the100good models and see that for78of them, we haveR 2 >0.9between the angles for all key frequencies. For ourfifth observation, we look at the100good models and see that for54of them, we have mean width of intervals>standard deviation of width, showing that angles are roughly uniformly distributed. Finally, we carry out the error bound calculations for the100good models. For each (model, freq) pair, we can calculate the error bounds as in Table 1. We see that for295out of358such pairs (82%), the empirical errors (normalised abs cos error, normalised abs sin error, normalised id cos error, normalised id sin error) are all less than0.1. For these pairs, the normalised abs integral bound has a median of0.40, which is47%of the naive abs bound of0.85. Also,99%of the normalised abs integral bounds are less than the naive abs bound (see Figure 37). Hence, the numerical integration phenomenon explained above indeed appears in most trained models, and our method of proof is able to produce a non-vacuous error bound whenever this phenomenon occurs. For the relationship observed between primary and secondary frequencies, we see that it still holds when we use different random seeds. In particular, around 84% of the neurons have the second largest non-zero Fourier component being 2 times the primary frequency (see Figure 38). Also, amongst these neurons, the angle for the secondary frequency is approximate 2 times the angle for the primary frequency+π, withR 2 >0.9for most (model, freq) pairs. 21 Under review 43526 0 10 20 30 40 50 60 70 Number of key frequencies Count (number of models) count Figure 32: Most models have 3 to 5 key frequencies, which is in line with what we would expect to make the above argument work. 0 5 10152025 0 500 1000 1500 2000 2500 3000 3500 4000 Key frequencies Count (number of neurons) count Figure 33: Key frequencies for the models are roughly uniformly distributed across all possible values. 0.0000.0050.0100.0150.0200.0250.030 0 200 400 600 800 1000 |residattn| 2 /|logits| 2 Count (number of input pairs) Variance of residual attention stream compared to logits Figure 34: Residual connection from the attention stream causes mostly less than 3% of the variance of the logits. 22 Under review 0.20.40.60.81.0 0 5000 10000 15000 20000 25000 |Key frequency component| 2 /|ReLU input| 2 Count (number of neurons) Figure 35: For most neurons, above 90% of the variance of the ReLU input is explained by the largest Fourier frequency component. 0.40.50.60.70.80.91 0 10,000 20,000 30,000 40,000 |Key frequency component| 2 /|W U W out | 2 Count (number of neurons) %variance explained by key frequency forW U W out Figure 36: For most neurons in ‘good’ models, above 98% of the variance of theW out matrix is explained by the largest Fourier frequency component. 0.20.40.60.81.01.2 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Normalised abs integral bound Count (number of (model, freq) pairs) Figure 37: Most of the pairs have normalised abs integral bounds less than0.6, compared to the naive abs error bound of0.85. 23 Under review 01020 0 10 20 30 40 50 Primary frequency Secondary frequency 0 500 1000 1500 2000 2500 3000 3500 Figure 38: The second largest non-zero Fourier component is 2 times the primary frequency for 83% of the neurons. 00.20.40.60.81 0 100 200 300 R 2 value Count (number of (model, freq) pairs) Figure 39: Most of the (model, freq) pairs have the angle for the secondary frequency closely resembling 2 times the primary frequency plusπ. 24 Under review DTRIGONOMETRIC IDENTITIES We use the following trigonometric identities throughout the paper: sinα+ sinβ= 2 sin α+β 2 cos α−β 2 (14) cosα+ cosβ= 2 cos α+β 2 cos α−β 2 (15) cosαcosβ= cos(α+β) + cos(α−β) 2 (16) cos(α+β) = cosαcosβ−sinαsinβ(17) cos(α−β) = cosαcosβ+ sinαsinβ(18) cos(α+π) =−cosα(19) EDERIVATION OF TRIG INTEGRAL Instead of splitting the summation into two chunks, converting each into an integral, and evaluating each integral, we can replace the summand with an integral directly and evaluate that. Also, we use the corrected phase shiftψ i = 2φ i : logit(a,b,c) ≈ X k∈K X i∈I k w i ReLU σ k cos(k a+b 2 +φ i ) cos(kc+ 2φ i ) ≈ X k∈K Z k Z π −π ReLU σ k cos(k a+b 2 +φ) cos(kc+ 2φ) dφ UsingF ± to denote the integral F ± = Z π −π ReLU ±cos(k a+b 2 +φ) cos(kc+ 2φ) dφ , lets=k a+b 2 andt=kc. Using the periodicity of cosine (Equation 19), we have F − = Z π −π ReLU −cos(k a+b 2 +φ) cos(kc+ 2φ) dφ = Z π −π ReLU cos(k a+b 2 +φ+π) cos(kc+ 2φ+ 2π) dφ = Z 2π 0 ReLU cos(k a+b 2 +φ ′ ) cos(kc+ 2φ ′ ) dφ ′ = Z π −π ReLU cos(k a+b 2 +φ) cos(kc+ 2φ) dφ =F + . 25 Under review So we may writeF:=F + =F − . Note that the integrand is non-zero only whenφ∈[−π/2− s,π/2−s]. Applying the cosine product-sum identity (Equation 16) and doing some algebra: F= Z π/2−s −π/2−s cos(s+φ) cos(t+ 2φ) dφ = 1 2 Z π/2−s −π/2−s cos(s−t−φ) + cos(s+t+ 3φ) dφ = 1 2 sin(φ−s+t) + 1 3 sin(s+t+ 3φ) π/2−s −π/2−s = 1 2 sin(π/2−2s+t) + 1 3 sin(3π/2−2s+t) − 1 2 sin(−π/2−2s+t) + 1 3 sin(−3π/2−2s+t) Using the periodicity of sine and cosine, we have F= 1 2 sin(π/2−2s+t) + 1 3 sin(3π/2−2s+t) − 1 2 sin(−π/2−2s+t) + 1 3 sin(−3π/2−2s+t) = 1 3 sin(π/2−2s+t) + 1 3 sin(π/2 + 2s−t) = 1 3 cos(2s−t) + 1 3 cos(−2s+t) = 2 3 cos(2s−t) = 2 3 cos(k(a+b−c)), as desired. That is, the pizza model computes its logits using the trigonometric integral identity: Z π −π ReLU cos(k(a+b)/2 +φ) cos(kc+ 2φ) dφ = 2 3 cos(k(a+b−c)) Or equivalently: cos(k(a+b−c)) = 3 2 Z π −π ReLU cos(k(a+b)/2 +φ) cos(kc+ 2φ) dφ FDERIVATION OF TRIG INTEGRAL INCLUDING SECONDARY FREQUENCIES Let s k =k a+b 2 u k =k a−b 2 v k = cos(u k )t k =kc and letβ i be the coefficient of the secondary frequency term andγ i be the constant term. Consider a network of the form logit(a,b,c)≈ X k∈K X i∈I k w i f(ka+φ i ,kb+φ i ) cos(kc+ 2φ i )(20) 26 Under review for some symmetric even functionfwhich is2π-periodic – that is,f(x,y) =f(y,x) =f(−x,y) = f(x,−y) =f(x+ 2π,y) =f(x,y+ 2π). Then we have logit(a,b,c) ≈ X k∈K X i∈I k w i f(ka+φ i ,kb+φ i ) cos(kc+ 2φ i ) ≈ X k∈K Z k Z π −π f(ka+φ,kb+φ) cos(kc+ 2φ) dφ Reindexing withθ=φ+k a+b 2 =φ+s k : ≈ X k∈K Z k Z k a+b 2 +π k a+b 2 −π f(k a−b 2 +θ,−k a−b 2 +θ) cos(k(c−(a+b)) + 2θ) dθ = X k∈K Z k Z s k +π s k −π f(u k +θ,−u k +θ) cos(t k −2s k + 2θ) dθ Using the fact that the integrand is2π-periodic and hence we can arbitrarily shift the limits of integration: = X k∈K Z k Z π −π f(u k +θ,−u k +θ) cos(t k −2s k + 2θ) dθ Define g k (θ) =f(u k +θ,−u k +θ) and note thatg k is even 1 and use the cosine addition formula (Equation 17) to get: = X k∈K Z k Z π −π g k (θ)(cos(t k −2s k ) cos(2θ)−sin(t k −2s k ) sin(2θ)) dθ = X k∈K Z k cos(t k −2s k ) Z π −π g k (θ) cos(2θ) dθ−Z k sin(t k −2s k ) Z π −π g k (θ) sin(2θ) dθ Sinceg k is even andsinis odd, the second integral evaluates to zero, giving = X k∈K Z k cos(t k −2s k ) Z π −π g k (θ) cos(2θ) dθ This equation is maximized att k = 2s k wheneverg k is non-negative. 1 Becauseg k (−θ) =f(u k −θ,−u k −θ) =f(−(−u k +θ),−(u k +θ)) =f(−u k +θ,u k +θ) = f(u k +θ,−u k +θ) =g k (θ) 27 Under review In the particular case of secondary frequencies which are double the primary frequencies, with phases also double the primary phases, we have: logit(a,b,c) ≈ X k∈K X i∈I k w i ReLU cos(k a−b 2 ) cos(k a+b 2 +φ i ) +β i cos(k(a−b)) cos(k(a+b) + 2φ i ) +γ i cos(kc+ 2φ i ) ≈ X k∈K Z k Z π −π ReLU cos(k a−b 2 ) cos(k a+b 2 +φ) + ̄ β k cos(k(a−b)) cos(k(a+b) + 2φ) + ̄γ k cos(kc+ 2φ) dφ = X k∈K Z k Z π −π ReLU cos(u k ) cos(s k +φ) + ̄ β k cos(2u k ) cos(2s k + 2φ) + ̄γ k cos(t k + 2φ) dφ = X k∈K Z k Z π −π ReLU v k cos(s k +φ) + ̄ β k (2v 2 k −1) cos(2s k + 2φ) + ̄γ k cos(t k + 2φ) dφ Reindexing withθ=φ+s k : = X k∈K Z k Z s k +π s k −π ReLU v k cos(θ) + ̄ β k (2v 2 k −1) cos(2θ) + ̄γ k cos(t k −2s k + 2θ) dθ Using the fact that the integrand is2π-periodic: = X k∈K Z k Z π −π ReLU v k cos(θ) + ̄ β k (2v 2 k −1) cos(2θ) + ̄γ k cos(t k −2s k + 2θ) dθ Define f k (θ) = ReLU v k cos(θ) + ̄ β k (2v 2 k −1) cos(2θ) + ̄γ k and use the cosine addition formula to get: = X k∈K Z k Z π −π f k (θ)(cos(t k −2s k ) cos(2θ)−sin(t k −2s k ) sin(2θ)) dθ = X k∈K Z k cos(t k −2s k ) Z π −π f k (θ) cos(2θ) dθ−Z k sin(t k −2s k ) Z π −π f k (θ) sin(2θ) dθ Sincef k is even andsinis odd, the second integral evaluates to zero, giving = X k∈K Z k cos(t k −2s k ) Z π −π f k (θ) cos(2θ) dθ GNUMERICAL INTEGRATION ERROR BOUND FORRELUFUNCTION Recall that in subsection 5.2, we computed the error bound for integrating the absolute value function rather than the ReLU function in the network. This is because we can break down ReLU(x) = x 2 + |x| 2 and the identity part of ReLU integrates to zero. Hence, the baseline result (of approximating the integral to zero) makes more sense when we only consider the absolute value part of ReLU. However, using the general form of our error bound (Equation 9), it is simple to replicate the analysis for the whole ReLU function. We have the same bound sup x |h ′ (x)|≤2 28 Under review Error Bound Type .12182122Equation Normalised ReLUcoserror0.050.040.040.03(7)/(9) Normalised ReLUsinerror0.030.050.040.03(7)/(9) Angle approximation error0.130.070.060.05(22) Numerical ReLU R π −π bound0.550.440.420.37(23) Numerical ReLU R π 0 bound0.210.150.170.15(24) Total numerical ReLU R π −π bound0.680.500.480.42(23) + (22) Total numerical ReLU R π 0 bound0.550.370.400.342·(24) + (22) Naive ReLUcosbaseline0.420.420.420.42(8), (21) Naive ReLUsinbaseline0.420.420.420.42(8), (21) Table 2: Error bounds for the ReLU function Moreover, utilising the fact that ReLU evaluates to 0 on half the range of integration, we can reduce our error bound by only considering the largest half of the boxes (with some boundary effects). This gives us the following error bounds: Note the baseline is halved because the identity component of ReLU yields a zero integral. HDETAILED COMPUTATION OF NUMERICAL INTEGRATION ERROR The maximum empirical error (obtained by evaluating the expression for each value ofa+bmodp) is shown below: Error Bound Type .12182122Equation Normalised abscoserror0.040.030.040.03(7)/(9) Normalised abssinerror0.050.050.030.02(7)/(9) Normalised idcoserror0.060.050.040.04(7)/(9) Normalised idsinerror0.020.050.040.04(7)/(9) Angle approximation error0.140.070.060.06(22) Numerical abs R π −π bound0.590.520.500.44(23) Numerical abs R π 0 bound0.230.170.200.17(24) Total numerical abs R π −π bound0.730.590.560.50(23) + (22) Total numerical abs R π 0 bound0.590.410.460.402·(24) + (22) Naive abscosbaseline0.850.850.850.85(8), (21) Naive abssinbaseline0.850.850.850.85(8), (21) Table 3: Error bounds by splitting ReLU into absolute value and identity components The first four rows (normalised abs & idcos&sinerror) compute the error by brute force exactly: R π −π h(x)−h(φ i ) dx forh∈h a+b,|C| ,h a+b,|S| ,h a+b,C ,h a+b,S . Rows ten and eleven compute the baseline for the error R π −π h(x) dx forh∈h a+b,|C| ,h a+b,|S| , which is given by E 0<a+b≤p 4 3 cos(2πk(a+b)/p) ≈E 0<a+b≤p 4 3 sin(2πk(a+b)/p) (21) Note that the baseline forh∈ h a+b,C ,h a+b,S is 0. Line five (angle discrepancy) computes the error fromψ≈2φ: P i w ′ j |ψ i −2φ i |.(22) Line six (numerical abs R π −π bound) computes the error bound from the integral: min θ P i (v i −(φ i −θ)) 2 −(v i−1 −(φ i −θ)) 2 ifφ i ∈[v i−1 ,v i ] (v i −(φ i −θ)) 2 + (v i−1 −(φ i −θ)) 2 otherwise (23) 29 Under review Shifting byθis permitted because our functions are periodic with periodπ; it does not matter where we start the integral. Line seven (numerical abs R π 0 bound) takes advantage of the fact thathis π-periodic to integrate from0toπ: takingˆw i :=w ′ i /2 ,ˆv i :=v i /2, and ˆ φ i :=φ i forφ i ≥0and ˆ φ i :=φ i +πforφ i <0, and we compute min θ P i ( (ˆv i −( ˆ φ i −θ)) 2 −(ˆv i−1 −( ˆ φ i −θ)) 2 if ˆ φ i ∈[ˆv i−1 ,ˆv i ] (ˆv i −( ˆ φ i −θ)) 2 + (ˆv i−1 −( ˆ φ i −θ)) 2 otherwise (24) Line eight (total numerical abs R π −π bound) computes the combined error bound from the integral and angle discrepancy. Line nine (total numerical abs R π 0 bound) takes advantage of the fact thathis π-periodic to integrate from0toπ. This allows us to overlap the two halves of sampled points to try and reduce the error of integration. (In this way, the rectangles in the approximation are narrower and so the error would be smaller.) Lines six and seven (numerical abs R bound) also both take advantage of the fact that the function is2π-periodic, allowing us to shift the intervals formed above by any constant. When bounding approximation error, we use the shift that gives the lowest bound. The exact error is much smaller than the size of the integral, and the mathematical error bound is also smaller than the size of the integral. This gives convincing evidence that the model is indeed performing numerical integration. IANALYSIS OF THE‘IDENTITY’COMPONENT OFRELU We can break down ReLU into two parts, ReLU(x) = x 2 + |x| 2 The integrals then split into R π −π cos(−2φ) 1 2 cos( k 2 +φ) dφ= 2 3 cos(k) R π −π sin(−2φ) 1 2 cos( k 2 +φ) dφ= 2 3 sin(k) R π −π cos(−2φ) 1 2 |cos( k 2 +φ)|dφ= 0 R π −π sin(−2φ) 1 2 |cos( k 2 +φ)|dφ= 0 We see that the ‘identity’ part of the ReLU yields a zero integral. So does this part of the model contribute to the logits? It turns out that the answer is yes. To resolve this issue, we look at the discrepancy between the results suggested by previous work: Zhong et al. (2023) claim that logits are of the form logit(a,b,c)∝|cos(k(a−b)/2)|cos(k(a+b−c)) while Nanda et al. (2023) claim that logits are of the form logit(a,b,c)∝cos(k(a+b−c)) To check which is correct, we regress the logits against the factors|cos(k(a−b)/2)|cos(k(a+b−c)), which gives anR 2 of0.86, while if we regress them against justcos(k(a+b−c)), we obtain anR 2 of0.98. So overall, Nanda et al. (2023) give a more accurate expression, but this seems to go against the analysis we did above, which led to the expression in Zhong et al. (2023). (A similar value is obtained if we just use the MLP output and drop the residual streams.) However, if we only consider the contribution to the logits from the absolute value component of ReLU, theR 2 values become0.99 and0.85respectively. Therefore, although the contribution from the identity component of ReLU is small, it does make a difference towards reducing the logit dependence ona−b, in particular |cos(k(a−b)/2)|. This is a good thing because whencos(k(a−b)/2)is small, the logit difference between the correct logit (a+b) and other logits will also be small, which will lead to a higher loss. The identity component slightly counters this effect. We can rewrite the identity component as: W out OV(a)/2 +W out OV(b)/2 +W out embed(b)/2 30 Under review Thus, we can store the matriceslogitid1[:,a] =W out OV(a)/2andlogitid2[:,a] = W out embed(a)/2, then we have F 2 (a,b) c =residual stream+absolute value terms + logit id1[c,a] + logitid1[c,b] + logitid2[c,b] We carry out a 2D Fourier transform to find out the decomposition of thelogitid1andlogitid2 matrices (becauseW out andOV(a)are sparse in the (1D) Fourier basis, so their product will naturally be sparse in the 2D Fourier basis). We getlogitid1[c,a]≈2ℜ( P k a k e i(kc−2ka) ), where the frequencieskhere are the same as subsection 2.3. Hence, the output from the identity component of ReLU is (ignoringlogitid2for now, which comes from the residual stream and is smaller): P k D k (cos(kc−2ka) + cos(kc−2kb)) +E k (sin(kc−2ka) + sin(kc−2kb)) = P k cos(k(b− a))(D k cos(k(c−a−b)) +E k sin(k(c−a−b))). The imaginary component of the FT is very small,c k ≈0; so the contribution is P k b k cos(k(b− a)) cos(k(a+b−c)). Why does this happen, and why does it help explain theR 2 values we got above? We first list the approximate coefficientsa k : Frequency12182122 abs coefs (C k )13.915.112.111.2 id coefs (D k )-3.7-3.9-3.2-3.3 Thus, the overall expression for the logits is F 2 (a,b) c ≈ P k (C k |cos(k(b−a)/2)| +D k cos(k(b−a))) cos(k(a+b−c)) = P k (2D 2 k |cos(k(b−a)/2)| 2 +C k |cos(k(b−a)/2)|) cos(k(a+b−c)) −D k cos(k(a+b−c)) using double angle formula. SinceD k <0, the P k −D k cos(k(a+b−c)) term gives some cushion for the base performance of the model (since as we discussed, thecos(k(a+b−c))term is why the model gives the highest logit whenc=a+b). Moreover, the2D 2 k |cos(k(b−a)/2)| 2 term also further improves the model since it is always non-negative. Hence, the contribution of the identity term evens out parts of the model and improves the logit difference when|cos(k(b−a)/2)|is small (where the absolute value part doesn’t do well). Note that the model would work on its own if we only use the absolute value part, but since ReLU is composed of both the absolute value and identity part and the coefficients combine both parts in a way that improve model performance. JTHEINFINITE-WIDTHLENS Consider a nonlinear layerMLP (⃗a) =W out ReLU W in ⃗a , where the activationn i of neuroniis given byReLU (W in ) i ⃗a . Our goal is to find an approximationF(⃗a)of each outputMLP j (⃗a) such that we can bound|MLP j (⃗a)−F(⃗a)|< εwithout computingMLP j (⃗a)for each⃗a. The infinite-width lens argument is given by MLP j (⃗a) = X i (W out ) ji ReLU (W in ) i ⃗a = X i w i f(⃗a;ξ i ) ≈Z Z ξ n ξ 0 f(⃗a;ξ) dξ =F(⃗a) where 31 Under review 1.ξ i is a “locally one-dimensional” neuron-indexed variable of integration 2.w i is approximately linear in(ξ i+1 −ξ i−1 )/2 3.f(⃗a;ξ i )−f(⃗a;ξ i−1 )is uniformly small over all values of⃗a, i.e.f“continuous” inξ If we are able to find values satisfying the constraints, then we can analytically evaluate the integral Z ξ n ξ 0 f(⃗a;ξ) dξ=F(⃗a) independent of⃗a, allowing us to bound the error of the numerical approximation offat eachξ i , independent of⃗a, for example via Lipschitz constant×size of box. KFUTURE WORK There are several hurdles to replicating this approach to interpreting other neural networks. It is highly labour intensive, and requires extensive mathematical exploration. Thus, in order for the approach to have any practical use, we need to develop automated tools to make these approximations and interpretations. For example, we may want to use the first few terms of the Fourier expansion (or other low-rank approximations) to approximate the action of various layers in a neural network, and then combine those to get algebraic expressions for certain neuron outputs of interest. Such algebraic expressions will natural admit phenomena like the numerical integration we described above. This sort of method may be particularly fruitful on problems which Fourier transforms play a large role, such as signal processing and solutions to partial differential equations. LFUTURE TECHNICAL WORK To complete the technical work laid out in section 7, we must accomplish two tasks which we discuss in this appendix section: constructing a parameterisation of the MLP which is checkable in less than O(p·d mlp )time, and more generally constructing a parameterisation of the entire ‘pizza’ model that is checkable in time that is linear in the number of parameters; and establishing a bound on the error in the model’s logits that does not neglect any terms. L.1LINEAR PARAMETERISATION Constructing a parameterisation of the model which is checkable in less thanO(p·d mlp )time is a relatively straightforward task, given the interpretation in the body of the paper. We expect that the parameters are: • A choice ofn freq frequenciesk i . •A splitting of the neurons into groups by frequency, and an ordering of the neurons within each group. • An assignment of widthsw i to each neuron, and an assignment of anglesφ i to each neuron. •An assignment of orthogonal planes into which each frequency is embedded by the embed- ding matrix, and by the unembedding matrix. •Rotations and scaling of the low-rank subset of the hidden model dimension for each of the O and V matrices. L.2BOUNDING THE ERROR OF THEMLP To bound the error of our interpretation of the MLP precisely, we’d need to include a bound on the primary frequency contribution of the identity component (which integrates to 0 symbolically), and include bounds on the residual components – OVE onxandy, the MLP bias, and the embed ofy, as inputs to ReLU; and UOVE onxandyand UE onyas output logits. 32 Under review We could decompose every matrix in our model as a sum of the corresponding matrix from our parameterized model and a noise term. Expanding out the resulting expression for the logits (and expanding|x+ε|as|x|+ (|x+ε|−|x|)), we will have an expression which is at top-level a sum of our parameterized model result and a noise term which is expressed recursively as the difference between the actual model and the parameterized model. We can then ask two questions: 1. What worst-case bounds can we prove on the error terms at various complexities? 2. What are the empirical worst-case bounds on the relevant error terms? 33