Paper deep dive
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
Ziqian Zhong, Ziming Liu, Max Tegmark, Jacob Andreas
Models: One-layer ReLU transformer (constant attention), One-layer ReLU transformer (full attention)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 8:08:19 PM
Summary
The paper investigates how neural networks learn modular addition, identifying two distinct algorithmic strategies: the 'Clock' algorithm (previously known) and the 'Pizza' algorithm (a newly described, less intuitive procedure). The authors demonstrate that model hyperparameters and architecture choices induce sharp algorithmic phase transitions between these solutions, and that networks can even implement multiple, parallel, imperfect versions of these algorithms.
Entities (5)
Relation Signals (3)
Neural Networks → implements → Clock algorithm
confidence 95% · Some networks trained to perform modular addition implement a familiar Clock algorithm
Neural Networks → implements → Pizza Algorithm
confidence 95% · others implement a previously undescribed, less intuitive, but comprehensible procedure we term the Pizza algorithm
Hyperparameters → influences → Algorithmic phase space
confidence 90% · Small changes to model hyperparameters and initializations can induce the discovery of qualitatively different algorithms
Cypher Suggestions (2)
Find all algorithms implemented by neural networks for a specific task. · confidence 90% · unvalidated
MATCH (n:Model)-[:IMPLEMENTS]->(a:Algorithm)-[:SOLVES]->(t:Task {name: 'modular addition'}) RETURN n, aIdentify relationships between hyperparameters and algorithmic outcomes. · confidence 85% · unvalidated
MATCH (h:Hyperparameter)-[:INFLUENCES]->(p:PhaseTransition)-[:AFFECTS]->(a:Algorithm) RETURN h, p, a
Abstract
Abstract:Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known algorithms for solving those tasks? Several recent studies, on tasks ranging from group arithmetic to in-context linear regression, have suggested that the answer is yes. Using modular addition as a prototypical problem, we show that algorithm discovery in neural networks is sometimes more complex. Small changes to model hyperparameters and initializations can induce the discovery of qualitatively different algorithms from a fixed training set, and even parallel implementations of multiple such algorithms. Some networks trained to perform modular addition implement a familiar Clock algorithm; others implement a previously undescribed, less intuitive, but comprehensible procedure which we term the Pizza algorithm, or a variety of even more complex procedures. Our results show that even simple learning problems can admit a surprising diversity of solutions, motivating the development of new tools for characterizing the behavior of neural networks across their algorithmic phase space.
Tags
Links
- Source: https://arxiv.org/abs/2306.17844
- Canonical: https://arxiv.org/abs/2306.17844
- Code: https://github.com/fjzzq2002/pizza
Full Text
62,361 characters extracted from source content.
Expand or collapse full text
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks Ziqian Zhong*, Ziming Liu*, Max Tegmark, Jacob Andreas Massachusetts Institute of Technology ziqianz, zmliu, tegmark, jda@mit.edu Abstract Do neural networks, trained on well-understood algorithmic tasks, reliably re- discover known algorithms for solving those tasks? Several recent studies, on tasks ranging from group arithmetic to in-context linear regression, have suggested that the answer is yes. Using modular addition as a prototypical problem, we show that algorithm discovery in neural networks is sometimes more complex. Small changes to model hyperparameters and initializations can induce discovery of qualitatively different algorithms from a fixed training set, and even parallel implementations of multiple such algorithms. Some networks trained to perform modular addition implement a familiarClockalgorithm (previously described by Nanda et al. [1]); others implement a previously undescribed, less intuitive, but comprehensible procedure we term thePizzaalgorithm, or a variety of even more complex procedures. Our results show that even simple learning problems can admit a surprising diversity of solutions, motivating the development of new tools for characterizing the behavior of neural networks across their algorithmic phase space. 1 1 Introduction Mechanistically understanding deep network models—reverse-engineering their learned algorithms and representation schemes—remains a major challenge across problem domains. Several recent studies [2,3,4,5,1] have exhibited specific examples of models apparently re-discovering inter- pretable (and in some cases familiar) solutions to tasks like curve detection, sequence copying and modular arithmetic. Are these models the exception or the rule? Under what conditions do neural network models discover familiar algorithmic solutions to algorithmic tasks? In this paper, we focus specifically on the problem of learning modular addition, training networks to compute sums like8 + 6 = 2 (mod 12). Modular arithmetic can be implemented with a simple geometric solution, familiar to anyone who has learned to read a clock: every integer is represented as an angle, input angles are added together, and the resulting angle evaluated to obtain a modular sum (Figure 1, left). Nanda et al. [1] show that specific neural network architectures, when trained to perform modular addition, implement thisClockalgorithm. In this work, we show that theClock algorithm is only one part of a more complicated picture of algorithm learning in deep networks. In particular, networks structurally similar to the ones trained by Nanda et al. preferentially implement a qualitatively different approach to modular arithmetic, which we term thePizzaalgorithm (Figure 1, right), and sometimes even more complex solutions. Models exhibit sharpalgorithmic phase transitions[6] between theClockandPizzaalgorithms as their width and attention strength very, and often implement multiple, imperfect copies of thePizzaalgorithm in parallel. * Equal contribution. 1 Code is available at https://github.com/fjzzq2002/pizza. 37th Conference on Neural Information Processing Systems (NeurIPS 2023). arXiv:2306.17844v2 [cs.LG] 21 Nov 2023 Clock Algorithm Pizza Algorithm Same-label predictions are spread over two slices of pizza =5 (mod 12) 2+3 1+4 12+5 11+6 10+7 9+8 1 2 3 4 5 6 7 8 9 10 11 12 E a E b E ab = E a +E b 2 Step 1: Embed token and to a circle where for some abw k =2πk/pk∈[1,2⋯,p−1] Step 2: compute the angle sum using multiplication. Step 3: score possible outputs using a dot product. c Step 2.1: compute the vector mean. E ab =(E a +E b )/2=(cos(w k a)+cos(w k b),sin(w k a)+sin(w k b))/2 Same-label predictions collapse to a number on a clock 5 2+3=1+4=12+5=11+6 =10+7=9+8=5 (mod 12) Input Embed MLP or Transformer Unembed Logit a→E a ≡(E a,x ,E a,y )=(cos(w k a),sin(w k a)),b→E b ≡(E b,x ,E b,y )=(cos(w k b),sin(w k b)) 1 2 3 4 5 6 7 8 9 10 11 12 θ a θ b θ ab =θ a +θ b E a E b E ab E ab ≡ ( E ab,x E ab,y ) = ( E a,x E b,x −E a,y E b,y E a,x E b,y +E a,y E b,x ) = ( cos(w k (a+b)) sin(w k (a+b)) ) Q abc =U c ⋅H ab , a,b E a ,E b E ab Q abc U c ≡(E c,x ,E c,y )=(cos(w k c),sin(w k c)) ⋅U c Q abc (Clock)=cos(w k (a+b−c)) Q abc (Pizza)=cos(w k (a+b−c)) Step 2.2: using and nonlinearities to compute E ab H ab = | cos(w k (a−b)/2) | (cos(w k (a+b)),sin(w k (a+b))) H ab H ab =E ab H ab cos(w k (a−b)/2) Figure 1: Illustration of theClockand thePizzaAlgorithm. Our results highlight the complexity of mechanistic description in even models trained to perform simple tasks. They point to characterization of algorithmic phase spaces, not just single algorithmic solutions, as an important goal in algorithm-level interpretability. OrganizationIn Section 2, we review theClockalgorithm [1] and show empirical evidence of deviation from it in models trained to perform modular addition. In Section 3, we show that these deviations can be explained by an alternativePizzaalgorithm. In Section 4, we define additional metrics to distinguish between these algorithms, and detect phase transitions between these algorithms (and othersNon-circularalgorithms) when architectures and hyperparameters are varied. We discuss the relationship between these findings and other work on model interpretation in Section 5, and conclude in Section 6. 2 Modular Arithmetic and theClockAlgorithm SetupWe train neural networks to perform modular additiona+b=c(modp), wherea,b,c= 0,1,·,p−1. We usep= 59throughout the paper. In these networks, every integerthas an associated embedding vectorE t ∈R d . Networks take as input embeddings[E a ,E b ]∈R 2d and predict a categorical outputc. Both embeddings and network parameters are learned. In preliminary experiments, we train two different network architectures on the modular arithmetic task, which we refer to as: Model A and Model B.Model Ais a one-layer ReLU transformer [7] with constant attention, whileModel Bis a standard one-layer ReLU transformer (see Appendix F.1 for details). As attention is not involved in Model A, it can also be understood as a ReLU MLP (Appendix G). 2.1 Review of theClockAlgorithm As in past work, we find that after training both Model A and Model B, embeddings (E a ,E b in Figure 1) usually describe a circle [8] in the plane spanned by the first two principal components of the embedding matrix. Formally,E a ≈[cos(w k a),sin(w k a)]wherew k = 2πk/p,kis an integer in [1,p−1]. Nanda et al. [1] discovered a circuit that uses these circular embeddings to implement an interpretable algorithm for modular arithmetic, which we call theClockalgorithm. AlgorithmLearned Embeddings Gradient Symmetry Required Non-linearity ClockCircleNoMultiplication PizzaCircleYesAbsolute value Non-circularLine, Lissajous-like curves, etc.N/AN/A Table 1: Different neural algorithms for modular addition 2 "If a meeting starts at 10, and lasts for 3 hours, then it will end at 1." This familiar fact is a description of a modular sum,10+3 = 1 (mod 12), and the movement of a clock describes a simple algorithm for modular arithmetic: the numbers 1 through 12 are arranged on a circle in360 ◦ /12 = 30 ◦ increments, angles of10×30 ◦ and3×30 ◦ are added together, then this angle is evaluated to determine that it corresponds to1×30 ◦ . Remarkably, Nanda et al. [1] find that neural networks like our Model B implement thisClock algorithm, visualized in Figure 1 (left): they represent tokensaandbas 2D vectors, and adding their polar angles using trigonometric identities. Concretely, theClockalgorithm consists of three steps: In step 1, tokensaandbare embedded asE a = [cos(w k a),sin(w k a)]andE b = [cos(w k b),sin(w k b)], respectively, wherew k = 2πk/p(an everyday clock hasp= 12andk= 1). Then the polar angles ofE a andE b are added (in step 2) and extracted (in step 3) via trigonometric identities. For each candidate outputc, we denote the logitQ abc ; the predicted output isc ∗ = argmax c Q abc . Crucial to this algorithm is the fact that the attention mechanism can be leveraged to perform multiplication. What happens in model variants when the attention mechanism is absent, as in Model A? We find two pieces of evidence of deviation from theClockalgorithm in Model A. 2.2 First Evidence forClockViolation: Gradient Symmetricity Since theClockalgorithm has logits: Q Clock abc = (E a,x E b,x −E a,y E b,y )E c,x + (E a,x E b,y +E a,y E b,x )E c,y ,(1) (see Figure 1) the gradients ofQ abc generically lack permutation symmetry in argument order: ∇ E a Q abc ̸=∇ E b Q abc . Thus, if learned models exhibit permutation symmetry (∇ E a Q abc = ∇ E b Q abc ), they must be implementing some other algorithm. We compute the 6 largest principal components of the input embedding vectors. We then compute the gradients of output logits (unnormalized log-probabilities from the model) with respect to the input embeddings. We then project them onto these 6 principal components (since the angles relevant to the ClockandPizzaalgorithms are encoded in the first few principal components). These projections are shown in Figure 2. While Model B demonstrates asymmetry in general, Model A exhibits gradient symmetry. Figure 2: Gradients on first six principal components of input embeddings.(a,b,c)in the title stands for taking gradients on the output logitcfor input(a,b). x and y axes represent the gradients for embeddings of the first and the second token. The dashed liney=xsignals a symmetric gradient. 2.3 Second Evidence forClockViolation: Logit Patterns Inspecting models’ outputs, in addition to inputs, reveals further differences. For each input pair (a,b), we compute the output logit assigned to the correct labela+b. We visualize thesecorrect logitsfrom Models A and B in Figure 3. Notice that the rows are indexed bya−band the columns bya+b. From Figure 3, we can see that the correct logits of Model A have a clear dependency ona−bin that within each row, the correct logits are roughly the same, while this pattern is not observed in Model B. This suggests that Models A and B are implementing different algorithms. 3 Figure 3: Correct Logits of Model A & Model B. The correct logits of Model A (left) have a clear dependence ona−b, while those of Model B (right) do not. 3 An Alternative Solution: thePizzaAlgorithm How does Model A perform modular arithmetic? Whatever solution it implements must exhibit gradient symmetricity in Figure 2 and the output patterns in Figure 3. In this section, we describe a new algorithm for modular arithmetic, which we call thePizzaalgorithm, and then provide evidence that this is the procedure implemented by Model A. 3.1 ThePizzaAlgorithm Unlike theClockalgorithm, thePizzaalgorithm operatesinsidethe circle formed by embeddings (just as pepperoni are spread all over a pizza), instead of operating on the circumference of the circle. The basic idea is illustrated in Figure 1: given a fixed labelc, forall(a,b)witha+b=c(modp), the pointsE ab = (E a +E b )/2lie on a line though the origin of a 2D plane, and the points closer to this line than to the lines corresponding to any othercform two out of2pmirrored “pizza slices”, as shown at the right of the figure. Thus, to perform modular arithmetic, a network can determine which slice pair the average of the two embedding vectors lies in. Concretely, thePizzaalgorithm also consists of three steps. Step 1 is the same as in theClockalgorithm: the tokensaandbare embedded atE a = (cos(w k a),sin(w k a))andE b = (cos(w k b),sin(w k b)), respectively. Step 2 and Step 3 are different from theClockalgorithm. In Step 2.1,E a andE b are averaged to produce an embeddingE ab . In Step 2.2 and Step 3, the polar angle ofE ab is (implicitly) computed by computing the logitQ abc for any possible outputsc. While one possibility of doing so is to take the absolute value of the dot product ofE ab with(cos(w k c/2),sin(w k c/2)), it is not commonly observed in neural networks (and will result in a different logit pattern). Instead, Step 2.2 transformsE ab into a vector encoding|cos(w k (a−b)/2)|(cos(w k (a+b)),sin(w k (a+b))), which is then dotted with the output embeddingU c = (cos(w k c),sin(w k c)). Finally, the prediction isc ∗ = argmax c Q abc . See Appendix A and Appendix L for a more detailed analysis of a neural circuit that computesH ab in a real network. The key difference between the two algorithms lies in what non-linear operations are required:Clock requires multiplication of inputs in Step 2, whilePizzarequires only absolute value computation, which is easily implemented by the ReLU layers. If neural networks lack inductive biases toward implementing multiplication, they may be more likely to implementPizzarather thanClock, as we will verify in Section 4. 3.2 First Evidence forPizza: Logit Patterns Both theClockandPizzaalgorithms compute logitsQ abc in Step 3, but they have different forms, shown in Figure 1. Specifically,Q abc (Pizza)has an extra multiplicative factor|cos(w k (a−b)/2)| compared toQ abc (Clock). As a result, givenc=a+b,Q abc (Pizza)is dependent ona−b, but Q abc (Clock)is not. The intuition for the dependence is that a sample is more likely to be classified correctly ifE ab is longer. The norm of this vector depends ona−b. As we observe in Figure 3, the logits in Model A indeed exhibit a strong dependence ona−b. 4 3.3 Second Evidence forPizza: Clearer Logit Patterns via Circle Isolation To better understand the behavior of this algorithm, we replace the embedding matrixEwith a series of rank-2 approximations: using only the first and second principal components, or only the third and fourth, etc. For each such matrix, embeddings lie in a a two-dimensional subspace. For both Model A and Model B, we find that embeddings form a circle in this subspace (Figure 4 and Figure 5, bottom). We call this procedurecircle isolation. Even after this drastic modification to the trained models’ parameters, both Model A and Model B continue to behave in interpretable ways: a subset of predictions remain highly accurate, with this subset determined by the periodicity of thekof the isolated circle. As predicted by thePizzaandClockalgorithms described in Figure 1, Model A’s accuracy drops to zero at specific values ofa−b, while Model B’s accuracy is invariant ina−b. Applying circle isolation to Model A on the two principal components (one circle) yields a model with32.8%overall accuracy, while retaining the first six principal components (three circles) yields an overall accuracy of91.4%. See Appendix D for more discussion. By contrast, Model B achieves 100%when embeddings are truncated to the first six principal components. Circle isolation thus reveals anerror correctionmechanism achieved via ensembling: when an algorithm (clock or pizza) exhibits systematic errors on subset of inputs, models can implement multiple algorithm variants in parallel to obtain more robust predictions. Figure 4: Correct logits of Model A (Pizza) after circle isolation. The rightmost pizza is accompanying the third pizza (discussed in Section 3.4 and Appendix D).Top:The logit pattern depends ona−b. Bottom:Embeddings for each circle. Using these isolated embeddings, we may additionally calculate the isolated logits directly with formulas in Figure 1 and compare with the actual logits from Model A. Results are displayed in Table 2. We find thatQ abc (Pizza)explains substantially more variance thanQ abc (Clock). Why do we only analyze correct logits?The logits from thePizzaalgorithm are given by Q abc (Pizza) =|cos(w k (a−b)/2)|cos(w k (a+b−c)). By contrast, theClockalgorithm has logitsQ abc (Clock) = cos(w k (a+b−c)). In a word,Q abc (Pizza)has an extra multiplica- tive factor|cos(w k (a−b)/2)|compared toQ abc (Clock). By constrainingc=a+b(thus cos(w k (a+b−c)) = 1), the factor|cos(w k (a−b)/2)|can be identified. (Unexpected) dependence of logitsQ abc (Clock)ona+b: Although our analysis above expects logitsQ abc (Clock)not to depend ona−b, they do not predict its dependence ona+b. In Figure 5, we surprisingly find thatQ abc (Clock)is sensitive to this sum. Our conjecture is that Step 1 and Step 2 of theClockare implemented (almost) noiselessly, such that same-label samples collapse to the same point after Step 2. However, Step 3 (classification) is imperfect after circle isolation, resulting in fluctuations of logits. Inputs with common sumsa+bproduce the same logits. 5 Figure 5: Correct logits of Model B (Clock) after circle isolation.Top:The logit pattern depends on a+b.Bottom:Embeddings for each circle. Circle w k Q abc (clock)FVEQ abc (pizza)FVE #12π/59·1775.41%99.18% #22π/59·375.62%99.18% #32π/59·4475.38%99.28% Table 2: After isolating circles in the input embedding, fraction of variance explained (FVE) ofall Model A’s output logits (59×59×59of them) by various formulas. Both model output logits and formula results’ are normalized to mean0variance1before taking FVE.w k ’s are calculated according to the visualization. For example, distance between0and1in Circle #1 is17, sow k = 2π/59·17. 3.4 Third Evidence forPizza: Accompanied & Accompanying Pizza The Achilles’ heel of thePizzaalgorithm is antipodal pairs. If two inputs(a,b)happen to lie antipodally, then their middle point will lie at the origin, where the correct “pizza slice” is difficult to identify. For example in Figure 1 right, antipodal pairs are (1,7), (2,8), (3,9) etc., whose middle points all collapse to the origin, but their class labels are different. Models cannot distinguish between, and thus correctly classify, these pairs. Even for oddp’s where there are no strict antipodal pairs, approximately antipodal pairs are also more likely to be classified incorrectly than non-antipodal pairs. Intriguingly, neural networks find a clever way to compensate for this failure mode. we find that pizzas usually come with “accompanying pizzas”. An accompanied pizza and its accompanying pizza complement each other in the sense that near-antipodal pairs in the accompanied pizza become adjacent or close (i.e, very non-antipodal) in the accompanying pizza. If we denote the difference between adjacent numbers on the circle asδandδ 1 ,δ 2 for accompanied and accompanying pizzas, respectively, thenδ 1 = 2δ 2 (modp). In the experiment, we found that pizzas #1/#2/#3 in Figure 4 all have accompanying pizzas, which we call pizzas #4/#5/#6 (see Appendix D for details). However, these accompanying pizzas do not play a significant role in final model predictions 2 . We conjecture that training dynamics are as follows: (1) At initialization, pizzas #1/#2/#3 correspond to three different “lottery tickets” [9]. (2) In early stages of training, to compensate the weaknesses (antipodal pairs) of pizzas #1/#2/#3, pizzas #4/#5/#6 are formed. (3) As training goes on (in the presence of weight decay), the neural network gets pruned. As a result, pizzas #4/#5/#6 are not significantly involved in prediction, although they continue to be visible in the embedding space. 2 Accompanied pizzas #1/#2/#3 can achieve 99.7% accuracy, but accompanying pizzas #4/#5/#6 can only achieve 16.7% accuracy. 6 4 The Algorithmic Phase Space In Section 3, we have demonstrated a typicalClock(Model A) and a typicalPizza(Model B). In this section, we study how architectures and hyperparametes govern the selection of these two algorithmic “phases”. In Section 4.1, we propose quantitative metrics that can distinguish betweenPizzaandClock. In Section 4.2, we observe how these metrics behave with different architectures and hyperparameters, demonstrating sharp phase transitions. The results in this section focusClockandPizzamodels, but other algorithmic solutions to modular addition are also discovered, and explored in more detail in Appendix B. 4.1 Metrics We wish to study the distribution ofPizzaandClockalgorithms statistically, which will require us to distinguish between two algorithms automatically. In order to do so, we formalize our observations in Section 2.2 and 2.3, arriving at two metrics:gradient symmetricityanddistance irrelevance. 4.1.1 Gradient Symmetricity To measure the symmetricity of the gradients, we select some input-output group(a,b,c), compute the gradient vectors for the output logit at positioncwith respect to the input embeddings, and then compute the cosine similarity. Taking the average over many pairs yields the gradient symmetricity. Definition 4.1(Gradient Symmetricity).For a fixed setS⊆Z 3 p of input-output pairs 3 , define gradient-symmetricityof a networkMwith embedding layerEas s g ≡ 1 |S| X (a,b,c)∈S sim ∂Q abc ∂E a , ∂Q abc ∂E b , wheresim(a,b) = a·b |a||b| is the cosine-similarity,Q abc is the logit for classcgiven inputaandb. It is clear thats g ∈[−1,1]. As we discussed in Section 2.2, thePizzaalgorithm has symmetric gradients while theClockalgorithm has asymmetric ones. Model A and Model B in Section 3 have gradient symmetricity99.37%and 33.36%, respectively (Figure 2). 4.1.2 Distance Irrelevance To measure the dependence of correct logits on differences between two inputs, which reflect the distances of the inputs on the circles, we measure how much of the variance in the correct logit matrix depends on it. We do so by comparing the average standard deviation of correct logits from inputs with the same differences and the standard deviation from all inputs. Definition 4.2(Distance Irrelevance).For some networkMwith correct logit matrixL(L i,j = Q ij,i+j ), define itsdistance irrelevanceas q≡ 1 p P d∈Z p std (L i,i+d |i∈Z p ) std L i,j |i,j∈Z 2 p , wherestdcomputes the standard deviation of a set. It is clear thatq∈[0,1]. Model A and Model B in Section 3 give distance irrelevance 0.17 and 0.85, respectively (Figure 3). A typical distance irrelevance from thePizzaalgorithm ranges from 0 to 0.4 while a typical distance irrelevance fromClockalgorithm ranges from 0.4 to 1. 4.1.3 Which Metric is More Decisive? When the two metrics have conflicting results, which one is more decisive? We consider distance irrelevance as the decisive factor of thePizzaalgorithm, as the output logits being dependent on the distance is highly suggestive ofPizza. On the other hand, gradient symmetricity can be used to rule out theClockalgorithm, as it requires multiplying (transformed) inputs which will result in asymmetric gradients. Figure 6 confirmed that at low distance irrelevance (suggesting pizza) the gradient symmetricity is almost always close to 1 (suggesting non-clock). 3 To speed-up the calculations, in our experimentsSis taken as a random subset ofZ 3 p of size100. 7 Figure 6: Distance irrelevance vs gradient symmetricity over all the main experiments. 4.2 Identifying algorithmic phase transitions How do models “choose” whether to implement theClockorPizzaalgorithm? We investigate this question by interpolating between Model A (transformer without attention) and Model B (transformer with attention). To do so, we introduce a new hyperparameterαwe call theattention rate. For a model with attention rateα, we modify the attention matrixMfor each attention head to be M ′ =Mα+J(1−α). In other words, we modify this matrix to consist of a linear interpolation between the all-one matrix and the original attention (post-softmax), with the rateαcontrolling how much of the attention is kept. The transformer with and without attention corresponds to the case whereα= 1(attention kept) andα= 0(constant attention matrix). With this parameter, we can control the balance of attention versus linear layers in transformers. We performed the following set of experiments on transformers (see Appendix F.1 for architecture and training details). (1) One-layer transformers with width128and attention rate uniformly sampled in[0,1](Figure 7). (2) One-layer transformers with width log-uniformly sampled in[32,512]and attention rate uniformly sampled in[0,1](Figure 7). (3) Transformers with2to4layers, width128 and attention rate uniformly sampled in[0,1](Figure 11). ThePizzaand theClockalgorithms are the dominating algorithms with circular embeddings. For circular models, most observed models either have low gradient symmetricity (corresponding to theClockalgorithm) or low distance irrelevance (corresponding to thePizzaalgorithm). Two-dimensional phase change observed for attention rate and layer width.For the fixed-width experiment, we observed a clear phase transition from thePizzaalgorithm to theClockalgorithm (characterized by gradient symmetricity and distance irrelevance). We also observe an almost linear phase boundary with regards to both attention rate and layer width. In other words, the attention rate transition point increases as the model gets wider. Dominance of linear layers determines whether thePizzaor theClockalgorithm is preferred. For one-layer transformers, we study the transition point against the attention rate and the width: •TheClockalgorithm dominates when the attention rate is higher than the phase change point, and thePizzaalgorithm dominates when the attention rate is lower than the point. Our explanation is: At a high attention rate, the attention mechanism is more prominent in the network, giving rise to the clock algorithm. At a low attention rate, the linear layers are more prominent, giving rise to the pizza algorithm. •The phase change point gets higher when the model width increases. Our explanation is: When the model gets wider, the linear layers become more capable while the attention mechanism receive less benefit (attentions remain scalars while outputs from linear layers become wider vectors). The linear layer therefore gets more prominence with a wider model. Possibly hybrid algorithms between theClockand thePizzaalgorithms.The continuous phase change suggests the existence of networks that lie between theClockand thePizzaalgorithms. This is achievable by having some principal components acting as theClockand some principal components acting as thePizza. 8 Figure 7: Training results from 1-layer transformers. Each point in the plots represents a training run reaching circular embeddings and 100% validation accuracy. See Appendix C for additional plots.Top:Model width fixed to be 128.Bottom:Model width varies. The phase transition lines are calculated by logistic regression (classify the runs by whether gradient symmetricity>98%and whether distance irrelevance<0.6). Existence of non-circular algorithms.Although our presentation focuses on circular algorithms (i.e., whose embeddings are circular), we find non-circular algorithms (i.e., whose embeddings do not form a circle when projected onto any plane) to be present in neural networks. See Appendix B for preliminary findings. We find that deeper networks are more likely to form non-circular algorithms. We also observe the appearance of non-circular networks at low attention rates. Nevertheless, the Pizzaalgorithm continues to be observed (low distance irrelevance, high gradient symmetricity). 5 Related Work Mechanistic interpretabilityaims to mechanically understand neural networks by reverse engi- neering them [2,3,5,4,10,11,1,12,13,14]. One can either look for patterns in weights and activations by studying single-neuron behavior (superposition [11], monosemantic neurons [15]), or study meaningful modules or circuits grouped by neurons [4,14]. Mechanistic interpretability is closely related to training dynamics [8, 13, 1]. Learning mathematical tasks: Mathematical tasks provide useful benchmarks for neural network interpretability, since the tasks themselves are well understood. The setup could be learning from images [16,17], with trainable embeddings [18], or with number as inputs [19,5]. Beyond arithmetic relations, machine learning has been applied to learn other mathematical structures, including geometry [20], knot theory [21] and group theory [22]. Algorithmic phase transitions: Phase transitions are present in classical algorithms [23] and in deep learning [6,24,25]. Usually the phase transition means that the algorithmic performance sharply 9 changes when a parameter is varied (e.g., amount of data, network capacity etc). However, the phase transition studied in this paper isrepresentational: both clock and pizza give perfect accuracy, but arrive at answers via different interal computations. These model-internal phase transitions are harder to study, but closer to corresponding phenomena in physical systems [24]. Algorithm learning in neural networks: Emergent abilities in deep neural networks, especially large language models, have recently attracted significant attention [26]. An ability is “emergent” if the performance on a subtask suddenly increases with growing model sizes, though such claims depend on the choice of metric [27]. It has been hypothesized that the emergence of specific capability in a model corresponds to the emergence of a modular circuit responsible for that capability, and that emergence of some model behaviors thus results from a sequence of quantized circuit discovery steps [5]. 6 Conclusions We have offered a closer look at recent findings that familiar algorithms arise in neural networks trained on specific algorithmic tasks. In modular arithmetic, we have shown that such algorithmic discoveries are not inevitable: in addition to theClockalgorithm reverse-engineered by [1], we find other algorithms (including aPizzaalgorithm, and more complicated procedures) to be prevalent in trained models. These different algorithmic phases can be distinguished using a variety of new and existing interpretability techniques, including logit visualization, isolation of principle components in embedding space, and gradient-based measures of model symmetry. These techniques make it possible toautomaticallyclassify trained networks according to the algorithms they implement, and reveal algorithmic phase transitions in the space of model hyperparameters. Here we found specifically that the emergence of aPizzaorClockalgorithm depends on the relative strength of linear layers and attention outputs. We additionally showed that these algorithms are not implemented in isolation; instead, networks sometimes ensemble multiple copies of an algorithm in parallel. These results offer exciting new challenges for mechanistic interpretability: (1) How to find, classify, and interpret unfamiliar algorithms in a systematic way? (2) How to disentangle multiple, parallel algorithm implementations in the presence of ensembling? LimitationsWe have focused on a single learning problem: modular addition. Even in this restricted domain, qualitatively different model behaviors emerge across architectures and seeds. Significant additional work is needed to scale these techniques to the even more complex models used in real-world tasks. Broader ImpactWe believe interpretability techniques can play a crucial role in creating and improving safe AI systems. However, they may also be used to build more accurate systems, with the attendant risks inherent in all dual-use technologies. It is therefore necessary to exercise caution and responsible decision-making when deploying such techniques. Acknowledgement We would like to thank Mingyang Deng and anonymous reviewers for valuable and fruitful discussions and MIT SuperCloud for providing computation resources. ZL and MT are supported by the Foundational Questions Institute, the Rothberg Family Fund for Cognitive Science and IAIFI through NSF grant PHY-2019786. JA is supported by a gift from the OpenPhilanthropy Foundation. References [1]Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. [2] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. https://distill.pub/2020/circuits/zoom-in. [3] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, 10 Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads.Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction- heads/index.html. [4]Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations, 2023. [5]Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling.arXiv preprint arXiv:2303.13506, 2023. [6] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learn- ing algorithm is in-context learning? investigations with linear models. InThe Eleventh International Conference on Learning Representations, 2023. [7]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017. [8]Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022. [9]Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019. [10] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. [11]Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. [12]Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. InThe Fortieth International Conference on Machine Learning, 2023. [13] Ziming Liu, Eric J Michaud, and Max Tegmark. Omnigrok: Grokking beyond algorithmic data. InThe Eleventh International Conference on Learning Representations, 2023. [14]Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.arXiv preprint arXiv:2304.14997, 2023. [15]Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023. [16]Yedid Hoshen and Shmuel Peleg. Visual learning of arithmetic operation. InProceedings of the AAAI Conference on Artificial Intelligence, 2016. [17]Samuel Kim, Peter Y Lu, Srijon Mukherjee, Michael Gilbert, Li Jing, Vladimir ˇ Ceperi ́ c, and Marin Solja ˇ ci ́ c. Integration of neural network-based symbolic regression in deep learning for scientific discovery.IEEE Transactions on Neural Networks and Learning Systems, 32(9):4166– 4177, 2020. [18]Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. [19]Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit.Advances in Neural Information Processing Systems, 35:21750–21764, 2022. 11 [20]Yang-Hui He. Machine-learning mathematical structures.International Journal of Data Science in the Mathematical Sciences, 1(01):23–47, 2023. [21]Sergei Gukov, James Halverson, Fabian Ruehle, and Piotr Sułkowski. Learning to unknot. Machine Learning: Science and Technology, 2(2):025035, 2021. [22]Alex Davies, Petar Veli ˇ ckovi ́ c, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev, Richard Tanburn, Peter Battaglia, Charles Blundell, András Juhász, et al. Advancing mathemat- ics by guiding human intuition with ai.Nature, 600(7887):70–74, 2021. [23]Alexander K Hartmann and Martin Weigt.Phase transitions in combinatorial optimization problems: basics, algorithms and statistical mechanics. John Wiley & Sons, 2006. [24]Lorenza Saitta, Attilio Giordana, and Antoine Cornuejols.Phase transitions in machine learning. Cambridge University Press, 2011. [25]Ekdeep Singh Lubana, Eric J Bigelow, Robert P Dick, David Krueger, and Hidenori Tanaka. Mechanistic mode connectivity. InInternational Conference on Machine Learning, pages 22965–23004. PMLR, 2023. [26]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022. Survey Certification. [27]Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?arXiv preprint arXiv:2304.15004, 2023. [28]Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, and Roy Schwartz. How much does attention actually attend? questioning the importance of attention in pretrained transformers. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1403–1416, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. [29]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. 12 Supplementary material A Mathematical Analysis and An Example of Pizza Algorithm In the pizza algorithm, we haveE ab = cos(w k (a−b)/2)·(cos(w k (a+b)/2),sin(w k (a+b)/2)), ascosx+cosy= cos((x−y)/2)(2 cos((x+y)/2))andsinx+siny= cos((x−y)/2)(2 sin((x+ y)/2)). To get|cos(w k (a−b)/2)|(cos(w k (a+b)),sin(w k (a+b))), we generalize this to|cos(w k (a− b)/2)|cos(w k (a+b−u))(the two given cases correspond tou= 0andu=π/2/w k ). |(cos(w k u/2),sin(w k u/2))·E ab |=|cos(w k (a−b)/2) cos(w k (a+b−u)/2)| |(−sin(w k u/2),cos(w k u/2))·E ab |=|cos(w k (a−b)/2) sin(w k (a+b−u)/2)| thus their difference will be equal to |cos(w k (a−b)/2)|(|cos(w k (a+b−u)/2)|−|sin(w k (a+b−u)/2)|). Now notice|cos(t)|−|sin(t)|≈cos(2t)for anyt∈R(Figure 8), so the difference is approximately |cos(w k (a−b)/2)|cos(w k (a+b−u)). Figure 8:|cos(t)|−|sin(t)|is approximatelycos(2t)for anyt∈R Plugging inu= 0andu=π/2/w k as mentioned, we get the following particular implementation of the pizza algorithm. Algorithm: Pizza, Example Step 1 On given inputaandb, circularly embed them to two vectors on the circumference (cos(w k a),sin(w k a))and(cos(w k b),sin(w k b)). Step 2 Compute: α=|cos(w k a) + cos(w k b)|/2−|sin(w k a) + sin(w k b)|/2 ≈|cos(w k (a−b)/2)|cos(w k (a+b)) β=|cos(w k a) + cos(w k b) + sin(w k a) + sin(w k b)|/(2 √ 2) −|cos(w k a) + cos(w k b)−sin(w k a)−sin(w k b)|/(2 √ 2) =|cos(w k a−π/4) + cos(w k b−π/4)|/2−|sin(w k a−π/4) + sin(w k b−π/4)|/2 ≈|cos(w k (a−b)/2)|cos(w k (a+b)−π/2) =|cos(w k (a−b)/2)|sin(w k (a+b)) Step 3 Output of this pizza is computed as a dot product. Q ′ abc =αcos(w k c) +βsin(w k c)≈|cos(w k (a−b)/2)|cos(w k (a+b−c)) Similar circuits are observed in the wild, but instead of the above two-term approximation, a more complicated one is observed. See Appendix L for details. The extra|cos(w k (a−b)/2)|term is not a coincidence. We can generalize our derivation as the following. 13 Lemma A.1.A symmetric functionf(x,y)that is a linear combination ofcosx,sinx,cosy,siny 4 can always be written ascos((x−y)/2)g(x+y)for some functiong. Proof.Noticecosx+ cosy= cos((x−y)/2)(2 cos((x+y)/2))andsinx+ siny= cos((x− y)/2)(2 sin((x+y)/2)), soα(cosx+ cosy) +β(sinx+ siny) = cos((x−y)/2)(2αcos((x+ y)/2) + 2βsin((x+y)/2)). This is why we consider the output pattern with the|cos(w k (a−b)/2)|terms rather than the actual computation circuits as the determinant feature of the pizza algorithm. B Non-Circular algorithms One thing that further complicates our experiment is the existence of non-circular embeddings. While only circular algorithms are reported in the previous works [8,1], many non-circular embeddings are found in our experiments, e.g., 1D lines or 3D Lissajous-like curves, as shown in Figure 9. We leave the detailed analysis of these non-circular algorithms for future study. Since circular algorithms are our primary focus of study, we propose the following metriccircularityto filter out non-circular algorithms. The metric reaches maximum1when the principal components aligns with cosine waves. Figure 9: Visualization of the principal components of input embeddings for two trained non-circular models.Top:A line-like first principal component. Notice the re-arranged x axis (token id).Bottom: First three principal components forming a three-dimensional non-circular pattern. Each point represents the embedding of a token. Definition B.1(Circularity).For some network, suppose thel-th principal component of its input embeddings isv l,0 ,v l,1 ,·,v l,p−1 , define itscircularitybased on first four components as c= 1 4 4 X l=1 max k∈[1,2,·,p−1] 2 p P p−1 j=0 v 2 l,j p−1 X j=0 v l,j e 2πi·jk/p 2 whereiis the imaginary unit.c∈[0,1]by Fourier analysis.c= 1means first four components are Fourier waves. Both Model A and Model B in Section 3 have a circularity around99.8%and we consider models with circularity≥99.5%circular. C More Results from the Main Experiments Here we provide Figure 7 with non-circular networks unfiltered (Figure 10). We can see more noise emerging in the plot. We also provide the training results from multi-layer transformers (Figure 11). 4 The actual neural networks could be more complicated - even if our neural network is locally linear and symmetric, locally they could be asymmetric (e.g.|x|+|y|could locally bex−y). Nevertheless, the pattern is observed in our trained networks. 14 Figure 10: Training results from 1-layer transformers. Each point in the plots represents a training run reaching 100% validation accuracy. Among all the trained 1-layer transformers, 34.31% are circular.Top:Model width fixed to be 128.Bottom:Model width varies. D Pizzas Come in Pairs Cautious readers might notice that the pizza algorithm is imperfect - for near antipodal points, the sum vector will have a very small norm and the result will be noise-sensitive. While the problem is partially elevated by the use of multiple circles instead of one, we also noticed another pattern emerged:accompanying pizzas. The idea is the following: suppose the difference between adjacent points is2kmodp, then the antipodal points have difference±k. Therefore, if we arrange a new circle with a differencekfor adjacent points, we will get a pizza that worksbestfor formerly antipodal points. Algorithm: Accompanying Pizza Step 1Takew k as of the accompanied pizza. On given inputaandb, circularly embed them to two vectors on the circumference(cos(2w k a),sin(2w k a))and(cos(2w k b),sin(2w k b)). Step 2 Compute the midpoint: s= 1 2 (cos(2w k a) + cos(2w k b),sin(2w k a) + sin(2w k b)) Step 3 Output of this pizza is computed as a dot product. A c =−(cos(w k c),sin(w k c))·s This is exactly what we observed in Model A (Table 3, Figure 13). With the six circles (pizzas and accompanying pizzas) included in the embedding, Model A also gets 100% accuracy. 15 Figure 11: Training results from transformers with 2, 3 and 4 layers. Among all the trained transformers with 2, 3 and 4 layers, 9.95%, 11.55% and 6.08% are circular, respectively. Figure 12: An Illustration on the Accompanying Pizza Algorithm 16 Circlew k A c FVE #4 (accompanying #1)2π/59·1797.56% #5 (accompanying #2)2π/59·397.23% #6 (accompanying #3)2π/59·4497.69% Table 3: After isolating accompanying circles in the input embedding, fraction of variance explained (FVE) ofallModel A’s output logits by various formulas. Both model output logits and formula results’ are normalized to mean0variance1before taking FVE. Accompanying and accompanied pizza have the samew k . Figure 13: Correct logits of Model A (Pizza) after circle isolation. Only accompanying pizzas are displayed. Notice the complementing logit pattern (Figure 4). E Results in Other Linear Architectures While this is not the primary focus of our paper, we also ran experiments on the following four different linear model setups (see Section F.2 for setup details). • For all the models, we first encode input tokens(a,b)with a trainable embedding layerW E : x 1 =W E,a , x 2 =W E,b (positional embedding removed for simplicity).L 1 ,L 2 ,L 3 are trainable linear layers. The outmost layers (commonly referred as unembed layers) have no biases and the other layers have biases included for generality. • Modelα: calculate output logits asL 2 (ReLU(L 1 (x 1 +x 2 ))). • Modelβ: calculate output logits asL 3 (ReLU(L 2 (ReLU(L 1 (x 1 +x 2 ))))). • Modelγ: calculate output logits asL 3 (ReLU(L 2 (ReLU(L 1 (x 1 ) +L 1 (x 2 ))))). • Modelδ: calculate output logits asL 2 (ReLU(L 1 ([x 1 ;x 2 ]))) ([x 1 ;x 2 ]stands for the concatenation ofx 1 andx 2 ) The results are shown in Figure 14. Rather surprisingly, Modelα, Modelβand Modelδgave radically different results. Modelβand Modelγare very similar, and in general they are more pizza-like 17 Figure 14: Training results from linear models. Each point in the first-row plots represents a training run. The second row are histograms for distance irrelevancy of each model type. than Modelα, with lower distance irrelevancy and higher circularity. This could be explained by the addition of an extra linear layer. However, Modelδgave very different results from Modelαalthough they are both one-layer linear models. It is more likely to be non-circular and have very high distance irrelevancy in general. In other words, concatenating instead of adding embeddings yields radically different behaviors in one-layer linear model. This result, again, alarmed us the significance of induction biases in neural networks. We also want to note that using different embeddings on two tokens of Modelαdoesn’t resolve the discrepancy. The following model •Modelα ′ : calculate output logits asL 2 (ReLU(L 1 (x 1 +x 2 )))wherex 1 =W A E,a , x 2 = W B E,b on input(a,b)andW A E ,W B E are different embedding layers. gives roughly the same result as of Modelα(Figure 14, lower right corner). Figure 15 shows the correct logits after circle isolation (Section 3.3) of a circular model from Model βimplementing the pizza algorithm. Figure 16 shows the correct logits after circle isolation (Section 3.3) of a circular model from Modelδ. We can see the pattern is similar but different from the one of clock algorithm (Figure 5). We leave the study of such models to future work. F Architecture and Training Details F.1 Transformers Here we describe our setup for the main experiments. See Appendix E and Appendix I for experiments on different setups. ArchitectureWe train bidirectional transformers (attention unmasked) to perform modular addition modpwherep= 59. To calculate(a+b) modp, the input is provided to the model as a sequence 18 Figure 15: Correct logits from Modelβafter circle isolation. Figure 16: Correct logits from Modelδafter circle isolation. of two tokens[a,b]. The output logit at the last token is considered as the output of the model. For a transformer with “width”d, the input embedding and the residue stream will bed-dimensional, 4attention heads of⌊d/4⌋dimensions will be employed, and the MLP will be of4dhidden units. By defaultd= 128is chosen. ReLU is used as the activation function and layer normalization isn’t applied. The post-softmax attention matrix is interpolated between an all-one matrix and original as 19 specified by the attention rate (Section 4.2). We want to point out that the setup of constant-attention transformers is also considered in the previous work [28]. DataAmong all possible data points (p 2 = 3481of them), we randomly select80%as training samples and20%as validation samples. This choice (smallpand high training data fraction) helps accelerating the training. TrainingWe used AdamW optimizer [29] with learning rateγ= 0.001and weight decay factor β= 2. We do not use minibatches and the shuffled training data is provided as a whole batch in every epoch. For each run, we start the training from scratch and train for20,000epoches. We removed the runs that did not reach100%validation accuracy at the end of the training (majority of the runs reached100%). F.2 Linear Models Here we describe our setup for the linear model experiments (Appendix E). ArchitectureWe train several types of linear models to perform modular additionmodpwhere p= 59. The input embedding, residue stream and hidden layer are alld= 256dimensional. ReLU is used as the activation function. The actual structures of network types are specified in Appendix E. Data & TrainingSame as in the previous section (Section F.1). F.3 Computing Resources A total of 226 GPU days of NVidia V100 is spent on this project, although we expect a replication would take significantly fewer resources. G Mathematical Description of Constant-Attention transformer In this section, we examine the structure of constant-attention transformers loosely following the notation of [10]. Denote the weight of embedding layer asW E , the weight of positional embedding asW pos , the weight of the value and output matrix of thej-th head of thet-th layer asW t,j V andW t,j O , the weights and biases of the input linear map of MLP in thet-th layer asW t in andb t in , the corresponding weights and biases of the output linear map asW t out andb t out , and the weight of the unembedding layer asW U . Notice that the query and the key matrices are irrelevant as the attention matrix is replaced with an all-one matrix. Denotex j as the value of residue stream vector after the firstjlayers and denotec i as the character in thei-th position. We use subscripts likex t to denote taking a specific element of vector. We can formalize the logit calculation as the following: • Embedding:x 0 i =W E,c i +W pos,i . • For each layertfrom1ton layer : – ConstantAttention:w t i =x t−1 i + P j W t,j O W t,j V P k x t−1 k . –MLP:x t =w t +b t out +W t out ReLU(b t in +W t in w t ). • Output:O=W U x n layer . In the particular case where the input length is2, the number of layer is1, and we focus on the logit of the last position, we may restate as the following (denotezasx 1 andyasw 1 ): • Embedding:x 1 =W E,c 1 +W pos,1 , x 2 =W E,c 2 +W pos,2 . • Constant Attention:y=x 2 + P j W j O W j V (x 1 +x 2 ). • MLP:z=y+b t out +W t out ReLU(b t in +W t in y). 20 • Output:o=W U z. If we remove the skip connections, the network after embedding could be seen as o=L U L out ReLU L in X j L j O L j V (x 1 +x 2 ) whereL j V ,L j O ,L in ,L out ,L U are a series of linear layers corresponding to the matrices. H Pizza with Attention Extrapolating from Figure 7, we trained transformers with width1024and attention rate1(normal attention). After several tries, we are able to observe a trained circular model with distance irrelevance 0.156and gradient symmetricity0.995, which fits our definition ofPizza(Figure 17). Figure 17: Correct logits of the trained model in Section H after circle isolation (Section 3.3). I Results on Slightly Different Setups We considered the following variations of our setups (Appendix F.1, Section 4), for which the existence of pizzas and clocks as well as the phase changes are still observed. GeLU instead of ReLUWe conducted the same 1-layer transformer experiment with activation function GeLU instead of ReLU. Very similar results are observed (Figure 18). Encode Two Tokens DifferentlyWe conducted the 1-layer transformer experiments but with different embedding for the two tokens. Again very similar results are observed (Figure 19). We also discovered that the two tokens’ embeddings are often aligned to implement thePizzaandClock algorithm (Figure 20). Adding Equal SignWe conducted the 1-layer transformer experiment with an equal sign added. Very similar results are observed (Figure 21). J Pizza Occurs Early in the Clock Training We plotted intermediate states during the training of a model with attention (attention rate 1). Pizza- like pattern was observed early in the training, but the pattern gradually disappeared during the run (Figure 22). 21 Figure 18: Training results from 1-layer transformers with GeLU instead of ReLU as the activation function. Each point in the plots represents a training run that reached 100% validation accuracy. Figure 19: Training results from 1-layer transformers where the two tokens use different embeddings (feed[a,b+p]to the model on input(a,b);2ptokens are handled in the embedding layer). Each point in the plots represents a training run that reached 100% validation accuracy. We did not use circularity to filter the result because it is no longer well-defined. 22 Figure 20: Correct logits after circle isolation from a trained model where two tokens use different embeddings. The blue points represent the embeddings for the first token and the green points represent the embeddings for the second token. The model is implementing thePizzaalgorithm. The correct logit pattern is shifted comparing to the previous patterns because the embeddings of two tokens do not line up exactly. For example, the third circle has near-maximum correct logit for a= 6,b= 3(the two points lining up on the top) and(a−b)/18≡10 (mod 59). This is the reason that the correct logit pattern appears to be shifted 10 units down. Figure 21: Training results from 1-layer transformers where an equal sign is added (feed[a,b,=] to the model on input(a,b)where=is a special token;p+ 1tokens are handled in the embedding layer; context length of the model becomes3). Each point in the plots represents a training run that reached 100% validation accuracy. We did not use circularity to filter the result because it is no longer well-defined. K Accompanying Pizza Occurs Early in the Pizza Training We plotted intermediate states during the training of a model without attention (attention rate 0). We observed the early emergence of a pattern similar to accompanying pizza in training runs (Figure 23 Figure 22: For a 1-layer transformer with attention, correct logits after principal component (possibly non-circle) isolations at various states during the training. The pizza-like pattern graduallydesolved. 23) and removing that circle brings accuracy down from 99.7% to 97.9%. They are less helpful later in the network (removing accompanying pizzas in trained Model A only brings accuracy down to 99.7%). 24 Figure 23: Immediate state after 600 epochs of training for a 1-layer transformer with constant attention. L A Closer Look at a Linear Pizza Model In this section, we provide a full picture of the linear model shown in Figure 15 by investigating the actual weights in the model. L.1 Model Structure As described in Appendix E, on input(a,b), the output logits of the model is computed as L 3 (ReLU(L 2 (ReLU(L 1 (Embed[a] +Embed[b]))))). Denote the weight of embedding layer asW E , the weight of the unembedding layer (L 3 ) asW U , and the weights and biases ofL 1 andL 2 asW 1 ,b 1 andW 2 ,b 2 , respectively, then the output logits on input(a,b)can be written as W U ReLU(b 2 +W 2 ReLU(b 1 +W 1 (W E [a] +W E [b]))). L.2 General Picture We first perform principal component visualizations on the embedding and unembedding matrices. From Figure 24, we can see that the embedding and unembedding matrices formedmatching circles (circles with the same gapδbetween adjacent entries). We now give the general overview of the circuit. Each pair of matching circles forms an instance of Pizzaand they operate independently (with rather limited interference). Specifically for each pair, • The embedding matrix first places the inputsa,bon the circumference:W ′ E [a]≈ (cos(w k a),sin(w k a))andW ′ E [b]≈(cos(w k b),sin(w k b))(w k = 2πk/pfor some inte- gerk∈[1,p−1]as in Section 2.1;W ′ E stands for the two currently considered principal components ofW E ; rotation and scaling omitted for brevity). 25 • The embeddings are added to get (cos(w k a) + cos(w k b),sin(w k a) + sin(w k b)) = cos(w k (a−b)/2)·(cos(w k (a+b)/2),sin(w k (a+b)/2)) •It is then passed through the first linear layerL 1 . Each result entry pre-ReLU will thus be a linear combination of the two dimensions of the aforementioned vectors, i.e.cos(w k (a− b)/2)·(αcos(w k (a+b)/2) +βsin(w k (a+b)/2))for someα,β, which will become |cos(w k (a−b)/2)||αcos(w k (a+b)/2) +βsin(w k (a+b)/2))|after ReLU. • These values are then passed through the second linear layerL 2 . Empirically the ReLU is not observed to be effective as the majority of values is positive. The output entries are then simply linear combinations of aforementioned outputs ofL 1 . •The unembedding matrix is finally applied. In the principal components we are considering, W ′ U [c]≈(cos(w k c),sin(w k c)). (W ′ U stands for the two currently considered principal components ofW U ; rotation and scaling omitted for brevity) and these two principal components correspond to a linear combination of the output entries ofL 2 , which then correspond to a linear combination of the outputs ofL 1 (thanks to the non-functional ReLU). •Similar to the formula|sin(t)|−|cos(t)|≈cos(2t)discussed in Appendix A, these linear combinations provide good approximations for|cos(w k (a−b)/2)|cos(w k (a+b))and |cos(w k (a−b)/2)|sin(w k (a+b)). Finally we arrive at |cos(w k (a−b)/2)|(cos(w k c) cos(w k (a+b)) + sin(w k c) sin(w k (a+b))) =|cos(w k (a−b)/2)|cos(w k (a+b−c)) . Figure 24: Visualization of the principal components of the embeddings and unembedding matrices. L.3 Aligning Weight Matrices We first verify that the ReLU from the second layer is not functional. After removing it, the accuracy of the model remains100%and the cross-entropy loss actuallydecreasedfrom6.20×10 −7 to 5.89×10 −7 . Therefore, the model output can be approximately written as W U (b 2 +W 2 ReLU(b 1 +W 1 (W E [a]+W E [b]))) =W U b 2 +W U W 2 ReLU(b 1 +W 1 (W E [a]+W E [b])). We now “align” the weight matricesW 1 andW 2 by mapping through the directions of the principal components of the embeddings and unembeddings. That is, we calculate how these matrices act on and onto the principal directions (considerW 1 vfor every principal directionvinW E andv T W 2 for every principal directionvinW U ). We call the other dimension of alignedW 1 andW 2 output and source dimensions, respectively (Figure 25). In the aligned weight matrices, we can see a clear domino-like pattern: in most output or source dimensions, only two principal components have significant non-zero values, and they correspond to a pair of matching circle, or a pizza. In this way, every immediate dimension serves for exactly one pizza, so the pizzas do not interfere with each other. 26 Figure 25: Visualization of the alignedW 1 andW 2 . L.4 Approximation Everything becomes much clearer after realigning the matrices. For a pizza and its two corresponding principal embedding / unembedding dimensions,W ′ E [a] +W ′ E [b]≈cos(w k (a−b)/2)·(cos(w k (a+ b)/2),sin(w k (a+b)/2)) will be mapped by realignedW 1 into its corresponding columns (which are different for every pizza), added withb 1 and applyReLU. The result will then be mapped by the realignedW 2 , added withrealignedb 2 , and finally multipled by(cos(w k c),sin(w k c)). For the first two principal dimensions, realignedW 1 has44corresponding columns (with coefficients of absolute value>0.1). Let the embedded input be(x,y) =W ′ E [a] +W ′ E [b]≈cos(w k (a−b)/2)· (cos(w k (a+b)/2),sin(w k (a+b)/2)), the intermediate columns are ReLU([0.530x−1.135y+0.253,−0.164x−1.100y+0.205,1.210x−0.370y+0.198,−0.478x− 1.072y+ 0.215,−1.017x+ 0.799y+ 0.249,0.342x−0.048y+ 0.085,1.149x−0.598y+ 0.212,−0.443x+1.336y+0.159,−1.580x−0.000y+0.131,−1.463x+0.410y+0.178,1.038x+ 0.905y+ 0.190,0.568x+ 1.188y+ 0.128,0.235x−1.337y+ 0.164,−1.180x+ 1.052y+ 0.139,−0.173x+0.918y+0.148,−0.200x+1.060y+0.173,−1.342x+0.390y+0.256,0.105x− 1.246y+ 0.209,0.115x+ 1.293y+ 0.197,0.252x+ 1.247y+ 0.140,−0.493x+ 1.252y+ 0.213,1.120x+ 0.262y+ 0.239,0.668x+ 1.096y+ 0.205,−0.487x−1.302y+ 0.145,1.134x− 0.862y+ 0.273,1.143x+ 0.435y+ 0.171,−1.285x−0.644y+ 0.142,−1.454x−0.285y+ 0.218,−0.924x+1.068y+0.145,−0.401x+0.167y+0.106,−0.411x−1.389y+0.249,1.422x− 0.117y+ 0.227,−0.859x−0.778y+ 0.121,−0.528x−0.216y+ 0.097,−0.884x−0.724y+ 0.171,1.193x+0.724y+0.131,1.086x+0.667y+0.218,0.402x+1.240y+0.213,1.069x−0.903y+ 0.120,0.506x−1.042y+ 0.153,1.404x−0.064y+ 0.152,0.696x−1.249y+ 0.199,−0.752x− 0.880y+ 0.106,−0.956x−0.581y+ 0.223]). For the first principal unembedding dimension, it will be taken dot product with [1.326,0.179,0.142,−0.458,1.101,−0.083,0.621,1.255,−0.709,0.123,−1.346,−0.571,1.016, 1.337,0.732,0.839,0.129,0.804,0.377,0.078,1.322,−1.021,−0.799,−0.339,1.117,−1.162, −1.423,−1.157,1.363,0.156,−0.165,−0.451,−1.101,−0.572,−1.180,−1.386,−1.346,−0.226, 1.091,1.159,−0.524,1.441,−0.949,−1.248]. Call this functionf(x,y). When we plug inx= cos(t),y= sin(t), we get a function that well- approximated8 cos(2t+ 2)(Figure 26). Therefore, lett=w k (a+b)/2, the dot product will be approximately8|cos(w k (a−b)/2)|cos(w k (a+b) + 2), or|cos(w k (a−b)/2)|cos(w k (a+b))if we ignore the phase and scaling. This completes the picture we described above. 27 Figure 26:f(cos(t),sin(t))well-approximates8 cos(2t+ 2). 28