Paper deep dive

Grokking Group Multiplication with Cosets

Dashiell Stander, Qinan Yu, Honglu Fan, Stella Biderman

Year: 2023Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 172

Models: one-layer feedforward network (custom, trained on S5/S6 multiplication)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:55:38 PM

Summary

This paper investigates the mechanistic interpretability of deep neural networks that have 'grokked' the arithmetic of symmetric groups S5 and S6. By reverse engineering fully connected one-hidden layer networks, the authors demonstrate that these models discover the subgroup structure of the symmetric groups and converge on neural circuits that decompose group arithmetic using cosets. The study highlights the challenges in interpretability research by contrasting their findings with previous work by Chughtai et al.

Entities (5)

Symmetric Group S5 · mathematical-structure · 100%Symmetric Group S6 · mathematical-structure · 100%Grokking · machine-learning-phenomenon · 98%Chughtai et al. · research-group · 95%Mechanistic Interpretability · research-field · 95%

Relation Signals (3)

Dashiell Stander → authored → Grokking Group Multiplication with Cosets

confidence 100% · Paper title and author list

Neural Network → grokked → Symmetric Group S5

confidence 95% · fully connected one-hidden layer networks that have “grokked” the arithmetic of the permutation groups S5

Neural Network → discovered → Subgroup Structure

confidence 90% · The models discover the true subgroup structure of the full group

Cypher Suggestions (2)

Find all mathematical structures studied in the paper · confidence 90% · unvalidated

MATCH (n:Entity {entity_type: 'Mathematical Structure'}) RETURN n.name

Identify relations between models and mathematical tasks · confidence 85% · unvalidated

MATCH (m:Model)-[r]->(t:Task) RETURN m.name, type(r), t.name

Abstract

Abstract:The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have ``grokked'' the arithmetic of the permutation groups $S_5$ and $S_6$. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group's subgroups. We relate how we reverse engineered the model's mechanisms and confirmed our theory was a faithful description of the circuit's functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. [4] which alleges to find a different algorithm for this same problem.

PDF

Open source PDF →Open local PDF →

Full Text

171,980 characters extracted from source content.

Expand or collapse full text

Grokking Group Multiplication with Cosets Dashiell Stander Qinan Yu Honglu Fan Stella Biderman Abstract The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have “grokked” the arithmetic of the permutation groups S5subscript5S_5S5 and S6subscript6S_6S6. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group’s subgroups. We relate how we reverse engineered the model’s mechanisms and confirmed our theory was a faithful description of the circuit’s functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. [4] which alleges to find a different algorithm for this same problem. Machine Learning, Grokking, Interpretability, Group Theory, Harmonic Analysis 1 Introduction Many methods have been proposed to render deep neural networks interpretable. There is both an academic interest in understanding how neural networks do what they do and a societal interest in ensuring that decisions made by such models are sound, unbiased, and subject to human review. These concerns are not new, nor are they unique to deep neural networks. Many of the techniques developed (such as SHAP values [33], saliency maps [52], gradient attribution [51], dimension reduction [61], etc…) are still widely used today, but there is an understanding that such methods must be used as just one part of a careful analysis. Naive applications of even the most sophisticated algorithms will give misleading results [1, 2, 9, 25]. Mechanistic interpretability seeks to find “neural circuits” within deep neural networks, small sub-networks that act as connected computation graphs and accomplish a task. In “toy” (highly constrained) settings mechanistic interpretability has been successful, with multiple examples where the inner workings of neural networks have been successfully reverse engineered end-to-end [20, 40, 41, 49, 63]. There have also been encouraging early successes in finding interpretable circuits within real-world models [18, 32, 35, 44, 57], but there is already work emerging that illustrates how neural networks can resist common “mechanistic interpretability” methods [14, 34, 60]. The toy interpretability projects that have succeeded have done so in large part because a distinct ground truth circuit that encodes the true nature of the task or environment emerged in the model. We build on this tradition and study a model that has perfectly learned to multiply permutations of five and six elements, which in mathematics is known as the symmetric groups S5subscript5S_5S5 and S6subscript6S_6S6, which are deeply studied and well-understood objects [8, 10, 15]. We succeed in completely reverse engineering the model and enumerating the diverse circuits that it converges on to implement the multiplication of the symmetric group. Our work does not, however, represent an unmitigated success for the project of mechanistic interpretability. The prior work of Chughtai et al. [4] studied the exact same model and setting, but came to completely different conclusions. Understanding why our and Chughtai et al. [4]’s interpretations of the same data diverged required extensive effort (see Appendix 7 for a thorough comparison). We find that even in a setting as simple and well understood as group arithmetic, it is incredibly difficult to do interpretability research and be confident about one’s conclusions. Our main contributions are as follows: • We completely reverse engineer a one-hidden layer fully-connected network trained on the permutation groups S5subscript5S_5S5 and S6subscript6S_6S6. • We apply a methodology inspired by Geiger et al. [17] to use causal experiments to thoroughly test all of the properties of our proposed circuit. • We survey current research in mechanistic interpretability and draw connections between the difficulty of our work and broader challenges in the field. 2 Related Work Mechanistic Interpretability Interpreting and reverse engineering the mechanism used to complete a given task is an active field in interpretability. Analysis of such mechanisms and circuits are discovered mainly through a top-down approach of causal mediation analysis. In the previous work Hanna et al. [21], Meng et al. [36], Tigges et al. [54], Wang et al. [58], the circuits are composed at the “component level” using the feed-forward layer and attention heads. We analyze the mechanisms of neural networks at the circuit level of individual and small groups of neurons, drawing directly on the work of Nanda et al. [40, 41], Olah et al. [43], Quirke & Barez [49], Zhong et al. [64], Zhang et al. [63]. Our work builds directly on “A Toy Model of Universality” by Chughtai et al. [4]. We recreated precisely their experimental setup for the groups S5subscript5S_5S5 and S6subscript6S_6S6, though we came to different conclusions. Grokking The models we study exhibit “grokking”, wherein the model first memorizes the training set and then much later generalizes to the held out data perfectly. Grokking was first identified by Power et al. [47] and has been well studied for its counter-intuitive training dynamics [31, 37, 55, 50, 62, 59]. We conducted all the analysis on fully grokked models with perfect test accuracy, as models that show this behavior have often formed clean generalizing circuits that are more easily interpreted [20, 40]. Group Theory We used many of the tools of group theory for our analysis, in particular the well-developed representation theory of the symmetric group. Tools for analyzing data on groups are well-laid out in Clausen & Baum [5], Cohen & Welling [6], Diaconis [8], Kondor [29], Kondor & Trivedi [30], Huang et al. [24], Karjol et al. [27], Plumb et al. [46]. 3 Mathematical Preliminaries Figure 1: Model Architecture: we follow the model architecture used by Chughtai et al. [4]. The one-hot vectors of left and right permutations pass through separate embeddings. We concatenate the embeddings and pass them through a single fully-connected hidden layer with ReLU activations. An unembedding matrix transforms the activations into logits. This paper requires a familiarity with functions on groups, a topic that is uncommon in machine learning research. In this section we give an overview of the major concepts as they are realized in the permutation groups that we study. For a more formal introduction to group theory, please refer to Appendix D. 3.1 Permutations and the Symmetric Group A permutation of n elements is a map σ that sends one ordering of n elements to a different ordering. For example the order-reversing permutation on four elements would be: (1 2 3 4)↦σ(4 3 2 1)superscriptmaps-to12344321(1\;2\;3\;4) σ (4\;3\;2\;1)( 1 2 3 4 ) start_RELOP SUPERSCRIPTOP start_ARG ↦ end_ARG start_ARG σ end_ARG end_RELOP ( 4 3 2 1 ) The identity permutation, denoted e, leaves the ordering unchanged: (1 2 3 4)↦e(1 2 3 4)superscriptmaps-to12341234(1\;2\;3\;4) e (1\;2\;3\;4)( 1 2 3 4 ) start_RELOP SUPERSCRIPTOP start_ARG ↦ end_ARG start_ARG e end_ARG end_RELOP ( 1 2 3 4 ) We refer to specific permutations by identifying them with the image of their action on the elements [n]≔1,2,…,n≔delimited-[]12…[n] \1,2,…,n\[ n ] ≔ 1 , 2 , … , n in increasing order. For the above example we would simply denote the order reversing permutation on four elements as (4 3 2 1)4321(4\;3\;2\;1)( 4 3 2 1 ). We multiply two permutations on n elements σ,τσ,τσ , τ by composition, read from right to left. If σ=(4 3 2 1)4321σ=(4\;3\;2\;1)σ = ( 4 3 2 1 ) and τ=(3 2 1 4)3214τ=(3\;2\;1\;4)τ = ( 3 2 1 4 ), then σ⁢τστσ τ is the permutation we obtain by first applying τ and then applying σ to the output of τ: (1 2 3 4)↦τ(3 2 1 4)↦σ(4 1 2 3)superscriptmaps-to12343214superscriptmaps-to4123(1\;2\;3\;4) τ (3\;2\;1\;4) % σ (4\;1\;2\;3)( 1 2 3 4 ) start_RELOP SUPERSCRIPTOP start_ARG ↦ end_ARG start_ARG τ end_ARG end_RELOP ( 3 2 1 4 ) start_RELOP SUPERSCRIPTOP start_ARG ↦ end_ARG start_ARG σ end_ARG end_RELOP ( 4 1 2 3 ) First applying τ and then σ has the same effect as just applying the permutation (4 1 2 3)4123(4\;1\;2\;3)( 4 1 2 3 ). Additionally every permutation σ has an inverse σ−1superscript1σ^-1σ- 1 such that σ⁢σ−1=esuperscript1σ^-1=eσ σ- 1 = e. These properties makes all of the permutations on n elements a group called the symmetric group, which we write SnsubscriptS_nSitalic_n. There are six permutations in S4subscript4S_4S4 that do not change the position of 4444: (1 2 3 4)(2 1 3 4)(3 2 1 4)(1 3 2 4)(3 1 2 4)(2 3 1 4)matrix123421343214132431242314 matrix(1\;2\;3\;4)&(2\;1\;3\;4)&(3\;2\;1\;4)\\ (1\;3\;2\;4)&(3\;1\;2\;4)&(2\;3\;1\;4) matrixstart_ARG start_ROW start_CELL ( 1 2 3 4 ) end_CELL start_CELL ( 2 1 3 4 ) end_CELL start_CELL ( 3 2 1 4 ) end_CELL end_ROW start_ROW start_CELL ( 1 3 2 4 ) end_CELL start_CELL ( 3 1 2 4 ) end_CELL start_CELL ( 2 3 1 4 ) end_CELL end_ROW end_ARG These six permutations form a subgroup of S4subscript4S_4S4 because multiplication is closed within that subset, multiplying any two permutations that leave 4444 unchanged results in another permutation that leaves 4444 unchanged. You can see that these six permutations are isomorphic to S3subscript3S_3S3 by simply “forgetting” about the 4444 that is fixed in the fourth position. In the paper, we will refer to the subgroup of SnsubscriptS_nSitalic_n isomorphic to Sn−1subscript1S_n-1Sitalic_n - 1 that leaves element i fixed as HisubscriptH_iHitalic_i. One of the simplest types of permutations is a “transposition,” a permutation τ∈Snsubscriptτ∈ S_nτ ∈ Sitalic_n that switches (“transposes”) two elements i,j∈[n]delimited-[]i,j∈[n]i , j ∈ [ n ] and leaves the remaining elements fixed. Every element of SnsubscriptS_nSitalic_n can be decomposed into a product of transpositions. A given decomposition of a permutation is not unique, but the number of transpositions in the decomposition is an invariant of the permutation. For a permutation g∈Snsubscriptg∈ S_ng ∈ Sitalic_n if a set of transpositions τ1⁢τ2⁢…⁢τk=gsubscript1subscript2…subscript _1 _2… _k=gτ1 τ2 … τitalic_k = g, then every possible such set of transpositions will also have k elements. The permutations that have an even number of transpositions are referred to as “even” permutations and those with an odd number are “odd.” The set of all even permutations in SnsubscriptS_nSitalic_n is a subgroup referred to as the “alternating group” AnsubscriptA_nAitalic_n. If we take H4<S4subscript4subscript4H_4<S_4H4 < S4 and multiply every element of on the left by some element σ∈S4subscript4σ∈ S_4σ ∈ S4 then we get a left coset of H4subscript4H_4H4 denoted σ⁢H4subscript4σ H_4σ H4. The transposition τ=(4 2 3 1)4231τ=(4\;2\;3\;1)τ = ( 4 2 3 1 ) switches the elements in the first and fourth positions. The elements of τ⁢H4subscript4τ H_4τ H4 are: (4 2 3 1)(4 1 3 2)(4 2 1 3)(4 3 2 1)(4 1 2 3)(4 3 1 2)matrix423141324213432141234312 matrix(4\;2\;3\;1)&(4\;1\;3\;2)&(4\;2\;1\;3)\\ (4\;3\;2\;1)&(4\;1\;2\;3)&(4\;3\;1\;2) matrixstart_ARG start_ROW start_CELL ( 4 2 3 1 ) end_CELL start_CELL ( 4 1 3 2 ) end_CELL start_CELL ( 4 2 1 3 ) end_CELL end_ROW start_ROW start_CELL ( 4 3 2 1 ) end_CELL start_CELL ( 4 1 2 3 ) end_CELL start_CELL ( 4 3 1 2 ) end_CELL end_ROW end_ARG This coset is characterized by every element having 4444 in the first position. Every element of H4subscript4H_4H4 has 4444 in the fourth position and τ switches the first and fourth positions. For any h∈H4ℎsubscript4h∈ H_4h ∈ H4, h⁢τℎh τ has 4444 in the first position because τ moves it from the fourth. We would get a coset with all of the elements of S4subscript4S_4S4 with 4444 in the third position if we multiplied H4subscript4H_4H4 on the left by the any permutation that switches three and four. There are also right cosets where every element in a subgroup is multiplied from the right. The elements of H4⁢τsubscript4H_4 4 τ are: (4 2 3 1)(1 4 3 1)(3 2 4 1)(4 3 2 1)(3 4 2 1)(2 3 4 1)matrix423114313241432134212341 matrix(4\;2\;3\;1)&(1\;4\;3\;1)&(3\;2\;4\;1)\\ (4\;3\;2\;1)&(3\;4\;2\;1)&(2\;3\;4\;1) matrixstart_ARG start_ROW start_CELL ( 4 2 3 1 ) end_CELL start_CELL ( 1 4 3 1 ) end_CELL start_CELL ( 3 2 4 1 ) end_CELL end_ROW start_ROW start_CELL ( 4 3 2 1 ) end_CELL start_CELL ( 3 4 2 1 ) end_CELL start_CELL ( 2 3 4 1 ) end_CELL end_ROW end_ARG This right coset is characterized by every element having 1111 in the fourth position. There are in fact four subgroups Hi<S4subscriptsubscript4H_i<S_4Hitalic_i < S4 that are isomorphic to S3subscript3S_3S3, one where each element 1,…,41…4\1,…,4\ 1 , … , 4 is fixed. In general there are at least n subgroups of SnsubscriptS_nSitalic_n that are isomorphic to Sn−1subscript1S_n-1Sitalic_n - 1. Any two Hi,HjsubscriptsubscriptH_i,\;H_jHitalic_i , Hitalic_j are conjugate to each other. Conjugation by an element σ maps x↦σ⁢x⁢σ−1maps-tosuperscript1x σ xσ^-1x ↦ σ x σ- 1. So if we have H4subscript4H_4H4 and conjugate it by σ=(1 4 3 2)1432σ=(1\;4\;3\;2)σ = ( 1 4 3 2 ), then σ⁢H4⁢σ−1subscript4superscript1σ H_4σ^-1σ H4 σ- 1 is H2subscript2H_2H2: (1 2 3 4)(3 2 1 4)(4 2 3 1)(1 2 4 3)(4 2 1 3)(3 2 4 1)matrix123432144231124342133241 matrix(1\;2\;3\;4)&(3\;2\;1\;4)&(4\;2\;3\;1)\\ (1\;2\;4\;3)&(4\;2\;1\;3)&(3\;2\;4\;1) matrixstart_ARG start_ROW start_CELL ( 1 2 3 4 ) end_CELL start_CELL ( 3 2 1 4 ) end_CELL start_CELL ( 4 2 3 1 ) end_CELL end_ROW start_ROW start_CELL ( 1 2 4 3 ) end_CELL start_CELL ( 4 2 1 3 ) end_CELL start_CELL ( 3 2 4 1 ) end_CELL end_ROW end_ARG If a subgroup is invariant to conjugation it is a normal subgroup. The only normal subgroup of SnsubscriptS_nSitalic_n for n>44n>4n > 4 is the alternating group AnsubscriptA_nAitalic_n of even permutations. We will mostly refer to groups by name, but we will denote a general group as capital G and a general subgroup as H≤GH≤ GH ≤ G. For a proper subgroup (H≠GH≠ GH ≠ G), we will write H<GH<GH < G. For a normal subgroup, we will use N⁢⊴⁢G⊴N GN ⊴ G. 3.2 Fourier Transform over Groups Though Group Fourier Transform is not central to our presentation of the coset circuit, it was an important tool that we used to analyze the the activations of the trained models. It is also a critical part of [4]. We introduce the concepts here and go over the the similarities and differences between our work and [4] in Section 7. We begin with a presentation of the Discrete Fourier Transform (DFT), and then present the Group Fourier Transform by analogy. The DFT converts a function f defined on 0,1,…,n−101…1\0,1,…,n-1\ 0 , 1 , … , n - 1 to a complex-valued function via the formula: f^⁢(k)=∑t=0n−1f⁢(t)⁢e−2⁢i⁢π⁢k⁢t/n,k∈0,…,n−1formulae-sequence^superscriptsubscript01superscript20…1 f(k)= _t=0^n-1f(t)e^-2iπ kt/n, k∈\0,…,n-1\over start_ARG f end_ARG ( k ) = ∑t = 0n - 1 f ( t ) e- 2 i π k t / n , k ∈ 0 , … , n - 1 The DFT is commonly interpreted as a conversion from the time domain to the frequency domain because the e−2⁢i⁢π⁢k⁢t/nsuperscript2e^-2iπ kt/ne- 2 i π k t / n terms define a complex sinusoid with frequency 2⁢π⁢k⁢t/n22π kt/n2 π k t / n. The frequency domain in this case means that these frequencies provide an alternative orthonormal basis from which we can work with functions. A function on 0,1,…,n−101…1\0,1,…,n-1\ 0 , 1 , … , n - 1 can be represented as a vector f=(x0x1…xn−1)⊤superscriptmatrixsubscript0subscript1…subscript1topf= pmatrixx_0&x_1&…&x_n-1 pmatrix f = ( start_ARG start_ROW start_CELL x0 end_CELL start_CELL x1 end_CELL start_CELL … end_CELL start_CELL xitalic_n - 1 end_CELL end_ROW end_ARG )⊤ and its basis is given by the identity matrix InsubscriptI_nIitalic_n. The DFT defines a basis transformation, much like any other. The Fourier basis is given n vectors. The first basis vector, corresponding to k=00k=0k = 0, is all ones. The k=11k=1k = 1 basis vector is (1e−2⁢i⁢π/n…e−2⁢i⁢π⁢(n−1)/n)matrix1superscript2…superscript21 pmatrix1&e^-2iπ/n&…&e^-2iπ(n-1)/n pmatrix( start_ARG start_ROW start_CELL 1 end_CELL start_CELL e- 2 i π / n end_CELL start_CELL … end_CELL start_CELL e- 2 i π ( n - 1 ) / n end_CELL end_ROW end_ARG ), and all of the rest for up to n−11n\!-\!1n - 1 are given by (1e−2⁢i⁢π⁢k/n…e−2⁢i⁢π⁢(n−1)⁢k/n)matrix1superscript2…superscript21 pmatrix1&e^-2iπ k/n&…&e^-2iπ(n-1)k/n pmatrix( start_ARG start_ROW start_CELL 1 end_CELL start_CELL e- 2 i π k / n end_CELL start_CELL … end_CELL start_CELL e- 2 i π ( n - 1 ) k / n end_CELL end_ROW end_ARG ). The DFT has a particularly nice interpretation as a function on the cyclic group CnsubscriptC_nCitalic_n, which is isomorphic to addition modulo n. Please refer to Appendix D or to references such as [8, 15, 29] for a more detailed discussion. The interpretation of the DFT as being over the cyclic groups can be generalized to non-commutative groups. We go over the construction in Appendix E and F. The high level interpretation, however, is the same. For functions from Sn→ℂ→subscriptℂS_n _n → blackboard_C there is an orthonormal basis that is equivariant to translations and convolutions. The frequencies for the Fourier transform over SnsubscriptS_nSitalic_n are given by the partitions of n. The “highest” frequencies can be interpreted as representing functions that are constant on permutations that all agree on a small number of elements of [n]delimited-[][n][ n ] [12]. 4 Model Architecture As shown in Figure 1, the model we study contains separate left and right embeddings, followed by a fully connected linear layer with ReLU activations, and an unembedding layer. We use the same architecture as in [4] to enable consistent comparisons. 111All code necessary for reproducing results and analysis is available at https://w.github.com/dashstander/sn-grok • One hot vectors gsubscriptx_gxitalic_g with length |G||G|| G |. • Two embedding matrices, l,rsubscriptsubscriptE_l,\;E_rEitalic_l , Eitalic_r with dimensions (d,|G|)(d,\;|G|)( d , | G | ), where d is embedding dimension. SnsubscriptS_nSitalic_n is non-abelian, i.e. not commutative, and the separate embeddings are to give the model extra capacity. • A linear layer WW with dimension (w, 2⁢d)2(w,\;2d)( w , 2 d ), w denoting the width of the linear layer. After the linear layer we apply the ReLU pointwise nonlinearity. • An unembedding layer UU with dimension (|G|,w)(|G|,\;w)( | G | , w ), which transforms the outputs of the ReLU and linear layer to into logit space for the group. We also note that the first d columns of the linear layer will only act on the left embeddings and the second d columns will only act on the right embeddings, so we can analyze WW as the concatenation of two (w,d)(w,\;d)( w , d ) matrices: =[⁢]delimited-[]W=[L\;R]W = [ L R ]. ⁢[l⁢gr⁢h]=l⁢g+r⁢hmatrixsubscriptsubscriptsubscriptsubscriptℎsubscriptsubscriptsubscriptsubscriptℎW bmatrixE_lx_g\\ E_rx_h bmatrix=LE_lx_% g+RE_rx_hW [ start_ARG start_ROW start_CELL Eitalic_l xitalic_g end_CELL end_ROW start_ROW start_CELL Eitalic_r xitalic_h end_CELL end_ROW end_ARG ] = LEitalic_l xitalic_g + REitalic_r xitalic_h Throughout the paper will refer to the values l⁢gsubscriptsubscriptLE_lx_gLEitalic_l xitalic_g, r⁢hsubscriptsubscriptℎRE_rx_hREitalic_r xitalic_h, and their sum as “pre-activations” to denote that the ReLU activation function has not been applied. Post-ReLU values we refer to as “activations.” 5 Coset Circuits 5.1 Sign Neurons Implement the Sign Circuit Figure 2: A diagram showing the four possible paths through a single neuron (i.e. one row of rsubscriptRE_rREitalic_r) that implements part of a “sign circuit.” The model stores whether a permutation is “even” or “odd” in the embeddings, represented in the left or right pre-activation values. The pre-activations are added together and then the ReLU activation is applied. The neuron only fires when the left permutation is even and the right is odd. If the neuron does not fire, then in 1/3131/31 / 3 cases the product is odd and 2/3232/32 / 3 it is even. The even permutations form a subgroup called the alternating group AnsubscriptA_nAitalic_n. The two cosets of AnsubscriptA_nAitalic_n are the group itself and all of the odd permutations, τ⁢Ansubscriptτ A_nτ Aitalic_n. The multiplication of even and odd permutations has similar features to the addition of even and odd integers (hence the name). The sign map on a permutation in SnsubscriptS_nSitalic_n, sgnsgnsgnsgn, is given by: sgn⁡(σ)=1σ∈An−1σ∈τ⁢Ansgncases1subscriptotherwise1subscriptotherwisesgn(σ)= cases1 σ∈ A_n\\ -1 σ∈τ A_n casessgn ( σ ) = start_ROW start_CELL 1 σ ∈ Aitalic_n end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 σ ∈ τ Aitalic_n end_CELL start_CELL end_CELL end_ROW An “even” permutation that is in AnsubscriptA_nAitalic_n is mapped to 1111 and an “odd” permutation not in AnsubscriptA_nAitalic_n is mapped to −11-1- 1. For any σ,ρ∈Snsubscriptσ,ρ∈ S_nσ , ρ ∈ Sitalic_n, the sign of their product is the product of their signs: sgn⁡(σ⁢ρ)=sgn⁡(σ)⁢sgn⁡(ρ)sgnsgnsgnsgn(σρ)=sgn(σ)sgn(ρ)sgn ( σ ρ ) = sgn ( σ ) sgn ( ρ ). The one-layer model that we train uses this relationship to help solve the general group multiplication. Every single model we trained had at least two neurons dedicated to encoding the sign of the permutation product. Though the model cannot use the alternating group to completely solve multiplication in SnsubscriptS_nSitalic_n, this sign circuit is emblematic of the general coset circuits the model forms. Consider the neuron shown in Fig. 2. The left pre-activations are given by L⁢(σ)=sgn⁡(σ)sgnL(σ)=sgn(σ)L ( σ ) = sgn ( σ ) and the right pre-activations are R⁢(σ)=−sgn⁡(σ)sgnR(σ)=-sgn(σ)R ( σ ) = - sgn ( σ ). The full action of the neuron is given by ReLU⁢(L⁢(σl)+R⁢(σr))ReLUsubscriptsubscript ReLU(L( _l)+R( _r))ReLU ( L ( σitalic_l ) + R ( σitalic_r ) ) and there are three cases: 1. sgn⁡(σl)=sgn⁡(σr)⇒sgn⁡(σl⁢σr)=1sgnsubscriptsgnsubscript⇒sgnsubscriptsubscript1sgn( _l)=sgn( _r) % sgn( _l _r)=1sgn ( σitalic_l ) = sgn ( σitalic_r ) ⇒ sgn ( σitalic_l σitalic_r ) = 1. In this case L⁢(σl)subscriptL( _l)L ( σitalic_l ) and R⁢(σr)subscriptR( _r)R ( σitalic_r ) destructively interfere, cancelling out to 00. Both the pre-activation and activation are 00. 2. sgn⁡(σl)=−1,sgn⁡(σr)=1⇒sgn⁡(σl⁢σr)=−1formulae-sequencesgnsubscript1sgnsubscript1⇒sgnsubscriptsubscript1sgn( _l)=-1,\;sgn( _r)=1% ( _l _r)=-1sgn ( σitalic_l ) = - 1 , sgn ( σitalic_r ) = 1 ⇒ sgn ( σitalic_l σitalic_r ) = - 1. In this case L⁢(σl)subscriptL( _l)L ( σitalic_l ) and R⁢(σr)subscriptR( _r)R ( σitalic_r ) reinforce each other and sum to a positive value. Since 2>0202>02 > 0, the activation value is 2222. 3. sgn⁡(σl)=1,sgn⁡(σr)=−1⇒sgn⁡(σl⁢σr)=−1formulae-sequencesgnsubscript1sgnsubscript1⇒sgnsubscriptsubscript1sgn( _l)=1,\;sgn( _r)=-1% ( _l _r)=-1sgn ( σitalic_l ) = 1 , sgn ( σitalic_r ) = - 1 ⇒ sgn ( σitalic_l σitalic_r ) = - 1. Like in (2) the product σl⁢σrsubscriptsubscript _l _rσitalic_l σitalic_r is an odd permutation and L⁢(σl)subscriptL( _l)L ( σitalic_l ) and R⁢(σr)subscriptR( _r)R ( σitalic_r ) constructively interfere, though this time L⁢(σl)+R⁢(σr)=−2subscriptsubscript2L( _l)+R( _r)=-2L ( σitalic_l ) + R ( σitalic_r ) = - 2, which is less than 00. Thus ReLU clips the pre-activation and sends it to 00. 5.2 Conjugate Subgroup Circuit All four ways to multiply two cosets of AnsubscriptA_nAitalic_n are well-defined. For each of the four options (even-even, odd-even, etc…) we know which coset of AnsubscriptA_nAitalic_n the product will be in, but no other subgroup of SnsubscriptS_nSitalic_n has this property. The model instead learns to use sets of conjugate subgroups. Recall that Hi<SnsubscriptsubscriptH_i<S_nHitalic_i < Sitalic_n is the subgroup isomorphic to Sn−1subscript1S_n-1Sitalic_n - 1 that fixes the element i∈[n]delimited-[]i∈[n]i ∈ [ n ] in the ithsuperscriptthi^thith place and τi⁢jsubscript _ijτitalic_i j is the permutation that swaps i and j. Any two HisubscriptH_iHitalic_i and HjsubscriptH_jHitalic_j are conjugate to each other, τi⁢j⁢Hi⁢τi⁢j=Hjsubscriptsubscriptsubscriptsubscript _ijH_i _ij=H_jτitalic_i j Hitalic_i τitalic_i j = Hitalic_j and τi⁢j⁢Hj⁢τi⁢j=Hisubscriptsubscriptsubscriptsubscript _ijH_j _ij=H_iτitalic_i j Hitalic_j τitalic_i j = Hitalic_i. This means that there are two shared cosets between HisubscriptH_iHitalic_i and HjsubscriptH_jHitalic_j, because Hi⁢τi⁢j=τi⁢j⁢HjsubscriptsubscriptsubscriptsubscriptH_i _ij= _ijH_jHitalic_i τitalic_i j = τitalic_i j Hitalic_j and Hj⁢τi⁢j=τi⁢j⁢HisubscriptsubscriptsubscriptsubscriptH_j _ij= _ijH_iHitalic_j τitalic_i j = τitalic_i j Hitalic_i. The model implements the full group multiplication by picking out the shared cosets of conjugate subgroups. As an example, consider a neuron that corresponds to H1subscript1H_1H1 for the left permutation and H5subscript5H_5H5 for the right permutation. The shared coset is H1⁢τ15=τ15⁢H5subscript1subscript15subscript15subscript5H_1 _15= _15H_5H1 τ15 = τ15 H5, the set of all σ∈S5subscript5σ∈ S_5σ ∈ S5 with σ⁢(1)=515σ(1)=5σ ( 1 ) = 5. The pre-activations for the left and right permutations will be: L⁢(σ)=4σ∈H12σ∈H1⁢τ120σ∈H1⁢τ13−2σ∈H1⁢τ14−4σ∈H1⁢τ15cases4subscript1otherwise2subscript1subscript12otherwise0subscript1subscript13otherwise2subscript1subscript14otherwise4subscript1subscript15otherwise L(σ)= cases4 σ∈ H_1\\ 2 σ∈ H_1 _12\\ 0 σ∈ H_1 _13\\ -2 σ∈ H_1 _14\\ -4 σ∈ H_1 _15\\ casesL ( σ ) = start_ROW start_CELL 4 σ ∈ H1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 2 σ ∈ H1 τ12 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 σ ∈ H1 τ13 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 2 σ ∈ H1 τ14 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 4 σ ∈ H1 τ15 end_CELL start_CELL end_CELL end_ROW R⁢(σ)=−4σ∈τ15⁢H5−2σ∈τ25⁢H50σ∈τ35⁢H52σ∈τ45⁢H54σ∈H5cases4subscript15subscript5otherwise2subscript25subscript5otherwise0subscript35subscript5otherwise2subscript45subscript5otherwise4subscript5otherwise R(σ)= cases-4 σ∈ _15H_5\\ -2 σ∈ _25H_5\\ 0 σ∈ _35H_5\\ 2 σ∈ _45H_5\\ 4 σ∈ H_5\\ casesR ( σ ) = start_ROW start_CELL - 4 σ ∈ τ15 H5 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 2 σ ∈ τ25 H5 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 σ ∈ τ35 H5 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 2 σ ∈ τ45 H5 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 4 σ ∈ H5 end_CELL start_CELL end_CELL end_ROW (1) The final activation is still ReLU⁢(L⁢(σl)+R⁢(σr))ReLUsubscriptsubscript ReLU(L( _l)+R( _r))ReLU ( L ( σitalic_l ) + R ( σitalic_r ) ), but now there are twenty-five possible pairs of cosets. All twenty-five combinations can be boiled down to two meaningful cases: 1. If L⁢(σl)+R⁢(σr)=0subscriptsubscript0L( _l)+R( _r)=0L ( σitalic_l ) + R ( σitalic_r ) = 0, then σl⁢σrsubscriptsubscript _l _rσitalic_l σitalic_r is in the shared coset H1⁢τ15subscript1subscript15H_1 _15H1 τ15. 2. If L⁢(σl)+R⁢(σr)≠0subscriptsubscript0L( _l)+R( _r)≠ 0L ( σitalic_l ) + R ( σitalic_r ) ≠ 0, then σl⁢σrsubscriptsubscript _l _rσitalic_l σitalic_r is not in the shared coset H1⁢τ15subscript1subscript15H_1 _15H1 τ15. Each left coset y⁢H5subscript5yH_5y H5 has a paired right coset H1⁢xsubscript1H_1xH1 x such that H1⁢x⁢y⁢H5=H1⁢τ15=τ15⁢H5subscript1subscript5subscript1subscript15subscript15subscript5H_1xyH_5=H_1 _15= _15H_5H1 x y H5 = H1 τ15 = τ15 H5. The discrete values that L and R can take are precisely tuned so that those pairs of left and right cosets cancel out. Just like with the sign neuron, information about the pre-activation being negative is lost with the ReLU. This lost information has to be made up with extra neurons that correspond to (H1,H5)subscript1subscript5(H_1,H_5)( H1 , H5 ) and assign different values to the cosets. For example, a different neuron that uses −L⁢(σl)−R⁢(σr)subscriptsubscript-L( _l)-R( _r)- L ( σitalic_l ) - R ( σitalic_r ) will fail to fire for a different set of permutations. The combination ReLU⁢(L⁢(σl)+R⁢(σr))+ReLU⁢(−L⁢(σl)−R⁢(σr))ReLUsubscriptsubscriptReLUsubscriptsubscript ReLU(L( _l)+R( _r))+ ReLU(-L( _l)-R(% _r))ReLU ( L ( σitalic_l ) + R ( σitalic_r ) ) + ReLU ( - L ( σitalic_l ) - R ( σitalic_r ) ) will be much closer to a perfect on/off switch for coset membership. 5.3 Decoding Permutations with Coset Membership There are n2superscript2n^2n2 combinations of (Hi,Hj)subscriptsubscript(H_i,H_j)( Hitalic_i , Hitalic_j ) subgroups. Each pair can be interpreted directly as encoding the set of permutations with i in the j⁢ththj thj position. Because of the way the coset neurons function, each neuron is better understood as firing when the value in the j⁢ththj thj position is certainly not i. The n2superscript2n^2n2 combinations of (Hi,Hj)subscriptsubscript(H_i,H_j)( Hitalic_i , Hitalic_j ) uniquely identify each element of SnsubscriptS_nSitalic_n. We can use the outputs of twenty-five (Hi,Hj)subscriptsubscript(H_i,H_j)( Hitalic_i , Hitalic_j ) neurons as a code that uniquely encodes each element of S5subscript5S_5S5. By analyzing the unembedding layer to see how the model makes use of (Hi,Hj)subscriptsubscript(H_i,H_j)( Hitalic_i , Hitalic_j ) neurons, we see that this is almost exactly what the model does. This same construction works for every subgroup of SnsubscriptS_nSitalic_n except for AnsubscriptA_nAitalic_n. Figure 3: An illustration of the phenomenon of “concentration on cosets,” depicting the 115th neuron from seed 11. We show the evolution of the left pre-activations (the pre-ReLU outputs of a layer) of training on an F20subscript20F_20F20 neuron from 100k to 130k steps. The seed of the neuron’s functionality is already present at 100k steps, where it fires very strongly and negatively for permutations in the coset F20⁢(1 2 3 5 4)subscript2012354F_20(1\;2\;3\;5\;4)F20 ( 1 2 3 5 4 ), but it takes time for the action of the neuron to “clean up” on the other cosets of F20subscript20F_20F20. The distribution found at 130k steps does not change very much afterwards. Noticing this common pattern of neurons taking on these discrete values was a striking piece of evidence that required further investigation. 6 The Process of Reverse Engineering 6.1 Identifying Coset Circuits The first step in attempting to reverse engineer the mechanisms of a neural network is to spend some time staring at the weights and activations. Even a small one-layer model such as ours is too large to visualize all at once. It was not until we looked closely at the pre-ReLU activations that we produced a histogram similar to Figure 3. The left and right pre-activations of one neuron were nearly constant on the distinct cosets of the Frobenius group of order 20 (F20subscript20F_20F20), one of the subgroups of S5subscript5S_5S5. 222F20subscript20F_20F20 is equivalent to the group of affine transformations x↦a⁢x+bmaps-tox ax+bx ↦ a x + b, where a,b,xa,b,xa , b , x are in the field with five elements and a≠00a≠ 0a ≠ 0. Further investigation revealed that almost every neuron had this property of only producing a discrete number of values that corresponded directly to the cosets of one of the subgroups of S5subscript5S_5S5 or S6subscript6S_6S6. For a function f:G→ℝ:→ℝf:G : G → blackboard_R, we define CH⁢(f)subscriptC_H(f)Citalic_H ( f ) to be the degree to which f concentrates on the cosets H≤GH≤ GH ≤ G: CH⁢(f)≔∑g⁢HVar⁡[f|g⁢H]Var⁡[f]≔subscriptsubscriptVarevaluated-atVarC_H(f) _gHVar[f|_gH]Var% [f]Citalic_H ( f ) ≔ divide start_ARG ∑g H Var [ f |g H ] end_ARG start_ARG Var [ f ] end_ARG Where Var⁡[f|g⁢H]Varevaluated-atVar[f|_gH]Var [ f |g H ] is the variance of f when the domain is restricted to the coset g⁢HgHg H. Intuitively CH⁢(f)subscriptC_H(f)Citalic_H ( f ) calculates the degree to which restricting to the cosets of H reduces the variance of f. If CH⁢(f)<1subscript1C_H(f)<1Citalic_H ( f ) < 1 it implies that the activations f can meaningfully be understood better by looking at the values that it takes on the cosets of some subgroup. Recall that a single neuron is a function Ni:Sn×Sn→ℝ:subscript→subscriptsubscriptℝN_i:S_n× S_n _i : Sitalic_n × Sitalic_n → blackboard_R is the sum of two functions G→ℝ→ℝG → blackboard_R, one for the left and right permutations, respectively. We can calculate min⁡CHsubscript C_Hmin Citalic_H for each. Take as an example N115lsubscriptsuperscript115N^l_115Nitalic_l115, the neuron shown in Figure 3. At 100,000 steps (on the far left) Var⁡[N115l]=5.23Varsubscriptsuperscript1155.23Var[N^l_115]=5.23Var [ Nitalic_l115 ] = 5.23. Its activations are not concentrated on the specific cosets of F20subscript20F_20F20, however, and CF20⁢(N115l)=2.96subscriptsubscript20subscriptsuperscript1152.96C_F_20(N^l_115)=2.96Citalic_F start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT ( Nitalic_l115 ) = 2.96. At 130,000 steps (on the far right) Var⁡[N115l]Varsubscriptsuperscript115Var[N^l_115]Var [ Nitalic_l115 ] has increased to 9.069.069.069.06, but CF20⁢(N115l)<10−5subscriptsubscript20subscriptsuperscript115superscript105C_F_20(N^l_115)<10^-5Citalic_F start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT ( Nitalic_l115 ) < 10- 5. The distribution within each coset of F20subscript20F_20F20 has close to zero variance. We see a typical example of what this looks like for the entire model in Fig. 4. As the validation loss approaches a small value, there is a rapid transition from the median coset concentration being approximately 1111, to a minuscule value. Even if it is apparent that a neuron is taking on discrete values and is a good candidate for being a coset neuron, it is difficult to tell by sight which subgroup the neuron is activating for. S5subscript5S_5S5 and S6subscript6S_6S6 only have 156 and 1,455 subgroups, respectively, 333Sequence A005432 OEIS [42] so it is tractable to do an exhaustive search and calculate argminH∈Sub⁡(G)⁡CH⁢(f)subscriptargminSubsubscriptargmin_H (G)C_H(f)argminitalic_H ∈ Sub ( G ) Citalic_H ( f ) the subgroup that minimizes the variance of f for every neuron in the model. Running these calculations shows that for the 128 S5subscript5S_5S5 models and 100 S6subscript6S_6S6 models we trained over 99.2% of the neurons in the linear layer had minH∈Sub⁡(G)⁡CH⁢(f)<1.0subscriptSubsubscript1.0 _H (G)C_H(f)<1.0minitalic_H ∈ Sub ( G ) Citalic_H ( f ) < 1.0, and the vast majority of those were less than 10−6superscript10610^-610- 6. With the ability to calculate directly which neurons corresponded to which subgroup, our theories for exactly what the neurons were representing fell into place. The next step was to confirm that these neurons were actually responsible for the models’ performance. Figure 4: The paired evolution of the the validation loss and minH∈S⁢u⁢b⁢(H)⁡CHsubscriptsubscript _H∈ Sub(H)C_Hminitalic_H ∈ S u b ( H ) Citalic_H, which encodes the formation of coset circuits. Displayed is the S5subscript5S_5S5 model with random seed 1. Different runs will form coset circuits at different times in training, but the effect is representative. Figure 5: We perform ablations by re-calculating the accuracy after removing any neurons NisubscriptN_iNitalic_i that have minH∈Sub⁡(G)⁡CH⁢(Ni)subscriptSubsubscriptsubscript _H (G)C_H(N_i)minitalic_H ∈ Sub ( G ) Citalic_H ( Nitalic_i ) greater than (top figure) or less than (bottom figure) the thresholds on the x-axis. 6.2 Ablations We have described how coset neurons function and how they can be identified. We will now show via ablations that coset neurons are not solely sufficient but also necessary to implement multiplication in SnsubscriptS_nSitalic_n. We conduct ablations by removing neurons which have a coset concentration minH∈Sub⁡(G)⁡CH⁢(Ni)subscriptSubsubscriptsubscript _H (G)C_H(N_i)minitalic_H ∈ Sub ( G ) Citalic_H ( Nitalic_i ) above a threshold. If coset circuits are in fact responsible for the performance of our models, then we expect to see no change in the accuracy when the neurons that have not converged onto the cosets of a subgroup are removed from the model. This is precisely what we see on the far right of Figure 5. Of the 128 S5subscript5S_5S5 models that we trained, 126 models saw no change in the accuracy when we removed the neurons with minH∈Sub⁡(G)⁡CH⁢(Ni)≥1subscriptSubsubscriptsubscript1 _H (G)C_H(N_i)≥ 1minitalic_H ∈ Sub ( G ) Citalic_H ( Nitalic_i ) ≥ 1 (the far right of Figure 5). Recall that if CH⁢(Ni)≥1subscriptsubscript1C_H(N_i)≥ 1Citalic_H ( Nitalic_i ) ≥ 1, restricting to the cosets of H at best does not change the variance of f. Of the two models that did show a decrease in accuracy, they decreased to 99% and 98%. We see more between-run variation when we remove more neurons. The median model has 24 out of 128 neurons with min⁡CH⁢(Ni)≥10−5subscriptsubscriptsuperscript105 C_H(N_i)≥ 10^-5min Citalic_H ( Nitalic_i ) ≥ 10- 5, but the 50thsuperscript50th50^th50th and 25thsuperscript25th25^th25th percentile accuracy is still 100%. It is not until we set the threshold to 10−6superscript10610^-610- 6 that the 25thsuperscript25th25^th25th percentile moves at all. When we set the threshold at 10−7superscript10710^-710- 7 the performance for many models has collapsed, but the median model has had 42 neurons removed and the median accuracy is still 100%. Recall also that the neuron shown in the far right of Figure 3 has a coset concentration of 10−5superscript10510^-510- 5. The overwhelming majority of neurons are identifiable as coset neurons. Of those neurons, those with the very highest concentration on cosets account for the largest portion of each model’s performance. 6.3 Causal Interventions Table 1: Causal interventions aggregated over 128 runs on S5subscript5S_5S5 with different sizes Intervention Mean Accuracy Mean Loss Base Model 99.99% 1.97e-6 Embedding Swap 1% 4.76 Switch Left and Right Sign 100% 1.97e-6 Switch Left Permutation Sign 0% 22.39 Switch Right Permutation Sign 0% 22.36 Perturb ⁢(0,0.1)00.1N(0,0.1)N ( 0 , 0.1 ) 99.99% 2.96e-6 Perturb ⁢(0,1)01N(0,1)N ( 0 , 1 ) 97.8% 0.0017 Absolute Value Non-Linearity 100% 3.69e-13 Perturb ⁢(1,1)11N(1,1)N ( 1 , 1 ) 88% 0.029 Perturb ⁢(−1,1)11N(-1,1)N ( - 1 , 1 ) 98% 0.0021 To rigorously test the properties of the coset circuit, we carefully designed causal experiments to test specific properties of in the circuits. We observe a circuit’s behavior over the entire data distribution (the full group SnsubscriptS_nSitalic_n) and we see that our model of the circuit is consistent with the behavior of the true circuit. To confirm that our model of the circuit is correct, however, we need to “break” the circuit in targeted ways and test that it behaves in the way we predict. Neural circuits are complex enough that observational evidence is not enough. We aggregated runs over 128 S5subscript5S_5S5 models of different and recorded their average loss and accuracy. Initially, over the initial models without intervention, we have accuracy extremely close to 1. Embedding Exchange The left and right embeddings encode different information—membership in right and left cosets, respectively—and cannot be interchanged. To test this we intervene to switch the left and right embeddings. After the intervention, we observed a significant drop in accuracy to 0 and a rise in loss. This aligns with our expectation that the membership is an important property that can’t be switched. Switch Permutation Sign The pre-activations are symmetric about the origin and the sign of the pre-activations does not matter, only whether or not the pre-activations is equal to zero. The relative sign of the left and right pre-activations should matter a lot. To test this, we have three tests: changing the sign of just the left embeddings, just the right embeddings, and both embeddings. In the case where we change both the sign and with commutative property, we can still expect the left and right activation to cancel out. Therefore, we should see a near-perfect accuracy and near-0 loss. The result is as expected. When we change the sign of only the left or right embedding, such cancellation law doesn’t hold anymore. Therefore, we observe a 0 accuracy in both cases. Absolute Value Non-linearity The circuit can create a perfect 0-1 coset membership switch with multiple neurons on constructive interference, but every single neuron is noisy and fundamentally limited by the ReLu non-linearity. To test this, we replace the ReLU activation function with the absolute value function x↦|x|maps-tox |x|x ↦ | x |. We observe perfect accuracy and an even lower loss that a half of the original loss. Distribution Change It is essential to the functioning of each neuron that a large proportion of the pre-activations are close to zero. To test this we compare how adding noise from a ⁢(−1,1)11N(-1,1)N ( - 1 , 1 ) and ⁢(1,1)11N(1,1)N ( 1 , 1 ) affect the performance of the model. We can see that changing the distribution of the activation in Perturb ⁢(−1,1)11N(-1,1)N ( - 1 , 1 ) changes the performance less significantly than ⁢(1,1)11N(1,1)N ( 1 , 1 ). This indicates that the coset requires 0 as a threshold value to decide the membership. The results of these interventions can be viewed in Table 1 7 The Group Composition via Representations Algorithm Our experimental setup is identical to that of Chughtai et al. [4], but our analysis led us to a different conclusion.Chughtai et al. [4] proposed the “Group Composition via Representations” (GCR) algorithm. They show that, given an irrep ρ of SnsubscriptS_nSitalic_n, argmaxc∈Sntr[ρ(a)ρ(b)ρ(−1c)]=abargmax_c∈ S_ntr[ρ(a)ρ(b)ρ(^-1c)% ]=abargmaxitalic_c ∈ S start_POSTSUBSCRIPT n end_POSTSUBSCRIPT tr [ ρ ( a ) ρ ( b ) ρ (- 1 c ) ] = a b and propose that this is the algorithm the model is implementing. This requires that not only store the matrix irreps, but that the model perform the matrix multiplication within its mechanism. We find that most of the evidence [4] put forward is also consistent with coset circuits. The other evidence we were not able to independently replicate. We also find evidence that, to our understanding, is not consistent with the GCR algorithm but is explained by coset circuits. 7.1 Our Interpretation of the Evidence for GCR Chughtai et al. [4] put forward four main pieces of evidence, which we restate here for clarity: (1) Correlation between the model’s logits and characters of a learned representation ρ. (2) The embedding and unembedding layers function as a “lookup table” for the representations of the input elements ρ⁢(a),ρ⁢(b)ρ(a),\;ρ(b)ρ ( a ) , ρ ( b ) and the inverse of the target ρ⁢(c−1)superscript1ρ(c^-1)ρ ( c- 1 ). (3) The neurons in the linear layer calculate the matrix product ρ⁢(a)⁢ρ⁢(b)=ρ⁢(a⁢b)ρ(a)ρ(b)=ρ(ab)ρ ( a ) ρ ( b ) = ρ ( a b ). (4) Ablations showing that the circuit they identify is responsible for the majority of the model’s performance. Many of these points are equally consistent with the coset circuit and the other we could not find evidence for. Ablations Though we do not perform all of the exact ablations that Chughtai et al. [4] perform, we also find that the weights that show high Fourier concentration and perform the coset multiplication are integral to the model’s performance, see Section 6.2. Irrep Look Up Table We were not able to find any evidence that the embedding or unembedding layers function as a look-up table for any representation except for the one-dimensional sign representation. We did find that the model’s weights and activations concentrate on specific irreps in the group Fourier basis. This is due, however, to concentration on cosets of specific subgroups, not because the matrix representations are realized anywhere in the weights. The relationship between functions that are constant on cosets and specific irreps is shown in Appendix G.2. Logit Attribution The trace of a group representation is referred to as the “character” and often denoted χ. We find that the model’s logits correlate with the character χρ⁢(a⁢b⁢c−1)subscriptsuperscript1 _ρ(abc^-1)χitalic_ρ ( a b c- 1 ) when the irrep ρ appears in the Fourier transform of the model’s weights. This is not, however, because the model has implemented the matrix product ρ⁢(a⁢b)⁢ρ⁢(c−1)superscript1ρ(ab)ρ(c^-1)ρ ( a b ) ρ ( c- 1 ), but because the model is “counting” the number of cosets that a⁢baba b and c are both in. We prove in G.2, if the cosets are of conjugate subgroups that have their Fourier transform concentrated on the irrep ρ (as we observe for the models in question), then the number of shared cosets will also correlate with the characters of ρ. Matrix Multiplication of Irreps We were not able to find any evidence that the linear layer implements matrix multiplication, again excluding scalar multiplication of the sign irrep. 7.2 Evidence GCR Does Not Explain Concentration on Cosets In the standard basis the pre-activations of the overwhelming majority of neurons concentrate heavily on the cosets of subgroups. This is behavior is not predicted by the GCR algorithm. The Difference Between Subgroups and Irreps The GCR algorithm and coset circuit cannot be equivalent because there is not, in fact, a one-to-one relationship between cosets and irreps. Most subgroups of SnsubscriptS_nSitalic_n have their Fourier transforms concentrate on more than a single group (see Table 5 for the spectral properties of all of the subgroups of S5subscript5S_5S5), indeed this needs to be the case as there are many more subgroups than irreps. Please refer to Table 4 for a concrete comparison and Appendix 7.2 for an asymptotic analysis. We also observe coset circuits for some subgroups such as D10subscript10D_10D10444The dihedral group of order 10, the symmetry group of a pentagon. will have coset circuits concentrated on both (3, 2)32(3,\;2)( 3 , 2 ) or (2, 2, 1)221(2,\;2,\;1)( 2 , 2 , 1 ), depending on the run. The GCR algorithm would treat these as different circuits, though their behavior is in fact identical. Unembedding Correlations of Neurons We observe that the correlation between in the unembedding of neurons that concentrate on the same coset is on average 81.4%percent81.481.4\%81.4 % (see Table 3). The correlation between neurons concentrated only on the same conjugacy class of subgroup (e.g. H1subscript1H_1H1 and H2subscript2H_2H2) is on average −0.2%percent0.2-0.2\%- 0.2 %. The neurons that represent subgroups in the same conjugacy class will oftentimes, though not always, be concentrated on the same irrep. The model is treating cosets together but the irreps and conjugacy classes separately. Coset Circuit Specific Causal Interventions The property that the loss goes down when we replace the ReLU activation function with absolute value is a very strange property that GCR does not predict. The concentration of the model’s activations on irreps of SnsubscriptS_nSitalic_n is striking evidence and the GCR algorithm that [4] detail could indeed solve the problem of group multiplication. The coset circuit is also consistent with all of the evidence that [4] provide and is additionally consistent with evidence that the GCR algorithm does not explain. 8 Discussion and Conclusion We performed a circuit level analysis to discover the concrete mechanism a one layer fully connected network uses to solve group multiplication in S5subscript5S_5S5 and S6subscript6S_6S6. We showed that the model decomposes S5subscript5S_5S5 and S6subscript6S_6S6 into its cosets and uses this structural information to perfectly implement the task. Though our work concerns a toy problem, we highlight a core takeaway that applies broadly to the field of interpretability: we must treat proposed neural mechanisms as theories until they have been thoroughly tested. When we identify what we believe to be a circuit within a larger network found via techniques such as [7, 19], we have taken the first step towards mechanistically understanding how a model performs a task. The evidence we have for the circuit’s role in that task is, however, fundamentally observational and correlational. The nodes in the circuit’s computation graph are causally connected, but the relationship between the action of those nodes is only observed to be correlated to a certain task with respect to a distribution. This is valuable information to have, but the understanding that it imparts is limited and must be recognized as such. When beginning this project we quickly noticed that the activations of sub-circuits of our model were concentrated on specific irreps of SnsubscriptS_nSitalic_n. It was only with additional investigation that we were able to attach semantic meaning to this phenomenon. We observed that the neurons concentrated on a single irrep were activating for specific subgroups. The hypothesis of the coset circuits had formed, but it was still only a theory. The facts we had observed were incontrovertible, but their reason was unclear. It was only after performing the causal experiments detailed in Section 6.3 that we became confident we understood the mechanism. The simple reality is that more than one theory can be consistent with observational data, especially when that data only comes from a small subset of the full distribution. There is a long history of scholarship showing that interpretability techniques, including state-of-the-art, can give be misleading and contradictory results [1, 2, 3, 9, 14, 23, 25, 34, 35]. In doing this work we had many advantages not available when interpreting real-world models: access to the entire distribution, an orthonormal basis for the function space of the network, and a relatively small model. The task of multiplication in SnsubscriptS_nSitalic_n is deterministic and very well studied, we had many mathematical tools to bring to bear in analyzing the model. Even still, this project was quite challenging and the circuits we found surprised us. Interpreting real models will be even difficult. We encourage future work to apply interpretability tools cautiously and validate observational results with rigorous experimental tests. Impact Statement This paper presents work whose goal is to make the function and mechanisms of deep neural networks interpretable to humans. We present methods for reasoning about counterfactual and out of distribution behavior in the models that we train. Though our setting is too small to be directly relevant to real-world use cases, we hope that similar techniques will be able to test, audit, and monitor deep neural networks that have been deployed in the real world. We also present results that urge caution and humility when attempting to interpret neural networks. We believe that robust and effective interpretability techniques may mitigate some societal harms that could arise from the use of deep neural networks, but that mistakenly trusting illusory interpretability techniques could be disastrous. Acknowledgements We would like to thank Coreweave for donating the computing resources that we used to run all of our experiments, to Bilal Chughtai for helpful discussions we had throughout the project, and to Neel Nanda for telling us to “not hold back for fear of offending [him].” We would also like to thank Nora Belrose, Neils uit de Bos, Aidan Ewart, Sara Price, Hailey Schoelkopf, Cédric Simal, and Benjamin Wright for their helpful feedback on earlier drafts of this paper. References Adebayo et al. [2020] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps, 2020. Bolukbasi et al. [2021] Bolukbasi, T., Pearce, A., Yuan, A., Coenen, A., Reif, E., Viégas, F., and Wattenberg, M. An interpretability illusion for bert, 2021. Casper et al. [2023] Casper, S., Li, Y., Li, J., Bu, T., Zhang, K., Hariharan, K., and Hadfield-Menell, D. Red teaming deep neural networks with feature synthesis tools, 2023. Chughtai et al. [2023] Chughtai, B., Chan, L., and Nanda, N. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations. Technical Report arXiv:2302.03025, arXiv, May 2023. URL http://arxiv.org/abs/2302.03025. arXiv:2302.03025 [cs, math] type: article. Clausen & Baum [1993] Clausen, M. and Baum, U. Fast Fourier Transforms for Symmetric Groups: Theory and Implementation. Mathematics of Computation, 61(204):833–847, 1993. ISSN 0025-5718. doi: 10.2307/2153256. URL https://w.jstor.org/stable/2153256. Publisher: American Mathematical Society. Cohen & Welling [2016] Cohen, T. and Welling, M. Group Equivariant Convolutional Networks. In Proceedings of The 33rd International Conference on Machine Learning, p. 2990–2999. PMLR, June 2016. URL https://proceedings.mlr.press/v48/cohenc16.html. ISSN: 1938-7228. Conmy et al. [2023] Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards Automated Circuit Discovery for Mechanistic Interpretability, October 2023. URL http://arxiv.org/abs/2304.14997. arXiv:2304.14997 [cs]. Diaconis [1988] Diaconis, P. Group Representations in Probability and Statistics, volume 11 of Institute of Mathematical Statistics Lecture Notes. Insitute of Mathematical Statistics, Hayward, CA, 1988. ISBN 0-940600-14-5. Doshi-Velez & Kim [2017] Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning, 2017. Dummit & Foote [2003] Dummit, D. S. and Foote, R. M. Abstract Algebra. Wiley, 3rd edition, July 2003. ISBN 978-0-471-43334-7. Elias M. Stein [2003] Elias M. Stein, R. S. Fourier analysis: an introduction. Princeton lectures in analysis 1. Princeton University Press, 2003. ISBN 069111384X,9780691113845. Ellis et al. [2017] Ellis, D., Friedgut, E., and Pilpel, H. Intersecting Families of Permutations, July 2017. URL http://arxiv.org/abs/1011.3342. arXiv:1011.3342 [math]. Erdos [1942] Erdos, P. On an elementary proof of some asymptotic formulas in the theory of partitions. Annals of Mathematics, 43(3):437–450, 1942. ISSN 0003486X. URL http://w.jstor.org/stable/1968802. Friedman et al. [2023] Friedman, D., Lampinen, A., Dixon, L., Chen, D., and Ghandeharioun, A. Interpretability illusions in the generalization of simplified models, 2023. Fulton & Harris, Joe [1991] Fulton, W. and Harris, Joe. Representation Theory. Graduate Texts in Mathematics. Springer, New York, NY, October 1991. ISBN 978-0-387-97495-8. [16] GAP. Gap – groups, algorithms, and programming, version 4.12.2, 2023. URL https://w.gap-system.org. Geiger et al. [2023] Geiger, A., Potts, C., and Icard, T. Causal abstraction for faithful model interpretation, 2023. Geva et al. [2021] Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories, 2021. Goldowsky-Dill et al. [2023] Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. Localizing model behavior with path patching, 2023. Gromov [2023] Gromov, A. Grokking modular arithmetic, January 2023. URL http://arxiv.org/abs/2301.02679. arXiv:2301.02679 [cond-mat]. Hanna et al. [2023] Hanna, M., Liu, O., and Variengien, A. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, November 2023. URL http://arxiv.org/abs/2305.00586. arXiv:2305.00586 [cs]. Harris et al. [2020] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2. Hase et al. [2023] Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023. Huang et al. [2009] Huang, J., Guestrin, C., and Guibas, L. Fourier Theoretic Probabilistic Inference over Permutations. Journal of Machine Learning Research, 10(37):997–1070, 2009. ISSN 1533-7928. URL http://jmlr.org/papers/v10/huang09a.html. Jain & Wallace [2019] Jain, S. and Wallace, B. C. Attention is not explanation. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:67855860. Janusz & Rotman [1982] Janusz, G. and Rotman, J. Outer automorphisms of S6subscript6S_6S6. The American Mathematical Monthly, 89(6):407–410, 1982. ISSN 00029890, 19300972. URL http://w.jstor.org/stable/2321657. Karjol et al. [2023] Karjol, P., Kashyap, R., and Ap, P. Neural Discovery of Permutation Subgroups. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, p. 4668–4678. PMLR, April 2023. URL https://proceedings.mlr.press/v206/karjol23a.html. ISSN: 2640-3498. Kingma & Ba [2014] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106. Kondor [2008] Kondor, R. Group theoretical methods in machine learning. PhD thesis, Columbia University, New York, NY, 2008. URL https://dl.acm.org/doi/abs/10.5555/1570977. Archive Location: world. Kondor & Trivedi [2018] Kondor, R. and Trivedi, S. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. In Proceedings of the 35th International Conference on Machine Learning, p. 2747–2755. PMLR, July 2018. URL https://proceedings.mlr.press/v80/kondor18a.html. ISSN: 2640-3498. Kumar et al. [2023] Kumar, T., Bordelon, B., Gershman, S. J., and Pehlevan, C. Grokking as the Transition from Lazy to Rich Training Dynamics, October 2023. URL http://arxiv.org/abs/2310.06110. arXiv:2310.06110 [cond-mat, stat]. Lieberum et al. [2023] Lieberum, T., Rahtz, M., Kramár, J., Nanda, N., Irving, G., Shah, R., and Mikulik, V. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023. Lundberg & Lee [2017] Lundberg, S. and Lee, S.-I. A unified approach to interpreting model predictions, 2017. Makelov et al. [2023] Makelov, A., Lange, G., and Nanda, N. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching, 2023. McGrath et al. [2023] McGrath, T., Rahtz, M., Kramar, J., Mikulik, V., and Legg, S. The hydra effect: Emergent self-repair in language model computations, 2023. Meng et al. [2023] Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt, 2023. Merrill et al. [2023] Merrill, W., Tsilivis, N., and Shukla, A. A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks, March 2023. URL http://arxiv.org/abs/2303.11873. arXiv:2303.11873 [cs]. Morwani et al. [2024] Morwani, D., Edelman, B. L., Oncescu, C.-A., Zhao, R., and Kakade, S. Feature emergence via margin maximization: case studies in algebraic tasks, 2024. Nanda & Bloom [2022] Nanda, N. and Bloom, J. Transformerlens. https://github.com/neelnanda-io/TransformerLens, 2022. Nanda et al. [2023a] Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. Technical Report arXiv:2301.05217, arXiv, January 2023a. URL http://arxiv.org/abs/2301.05217. arXiv:2301.05217 [cs] type: article. Nanda et al. [2023b] Nanda, N., Lee, A., and Wattenberg, M. Emergent Linear Representations in World Models of Self-Supervised Sequence Models, September 2023b. URL http://arxiv.org/abs/2309.00941. arXiv:2309.00941 [cs]. OEIS Foundation Inc. [2023] OEIS Foundation Inc. The On-Line Encyclopedia of Integer Sequences, 2023. Published electronically at http://oeis.org. Olah et al. [2020] Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom In: An Introduction to Circuits. Distill, 5(3):e00024.001, March 2020. ISSN 2476-0757. doi: 10.23915/distill.00024.001. URL https://distill.pub/2020/circuits/zoom-in. Olsson et al. [2022] Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context Learning and Induction Heads, September 2022. URL https://arxiv.org/abs/2209.11895v1. Paszke et al. [2019] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat]. Plumb et al. [2015] Plumb, G., Pachauri, D., Kondor, R., and Singh, V. SnFFT: A Julia Toolkit for Fourier Analysis of Functions over Permutations. Journal of Machine Learning Research, 16(107):3469–3473, 2015. URL http://jmlr.org/papers/v16/plumb15a.html. Power et al. [2022] Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, January 2022. URL http://arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]. Pyber [1993] Pyber, L. Enumerating finite groups of given order. Annals of Mathematics, 137(1):203–220, 1993. ISSN 0003486X. URL http://w.jstor.org/stable/2946623. Quirke & Barez [2024] Quirke, P. and Barez, F. Understanding addition in transformers, 2024. Rubin et al. [2023] Rubin, N., Seroussi, I., and Ringel, Z. Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks, October 2023. URL http://arxiv.org/abs/2310.03789. arXiv:2310.03789 [cond-mat, stat]. Shrikumar et al. [2017] Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A. Not just a black box: Learning important features through propagating activation differences, 2017. Simonyan et al. [2014] Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. Stein et al. [2023] Stein, W. et al. Sage Mathematics Software (Version 10.0.0). The Sage Development Team, 2023. URL http://w.sagemath.org. Tigges et al. [2023] Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. Linear representations of sentiment in large language models, 2023. Varma et al. [2023] Varma, V., Shah, R., Kenton, Z., Kramár, J., and Kumar, R. Explaining grokking through circuit efficiency, September 2023. URL http://arxiv.org/abs/2309.02390. arXiv:2309.02390 [cs]. Vink et al. [2023] Vink, R., Gooijer, S. d., Beedie, A., Gorelli, M. E., Zundert, J. v., Hulselmans, G., Grinstead, C., Santamaria, M., Guo, W., Heres, D., Magarick, J., Marshall, ibENPC, Peters, O., Leitao, J., Wilksch, M., Heerden, M. v., Borchert, O., Jermain, C., Haag, J., Peek, J., Russell, R., Pryer, C., Castellanos, A. G., Goh, J., illumination-k, Brannigan, L., Conradt, M., and Robert. pola-rs/polars: Python Polars 0.19.0, August 2023. URL https://doi.org/10.5281/zenodo.8301818. Wang et al. [2022a] Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, November 2022a. URL http://arxiv.org/abs/2211.00593. arXiv:2211.00593 [cs]. Wang et al. [2022b] Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022b. Wei et al. [2022] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models, 2022. Wen et al. [2023] Wen, K., Li, Y., Liu, B., and Risteski, A. Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=OitmaxSAUu. Wold et al. [1987] Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3):37–52, 1987. Xu et al. [2023] Xu, Z., Wang, Y., Frei, S., Vardi, G., and Hu, W. Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data, October 2023. URL http://arxiv.org/abs/2310.02541. arXiv:2310.02541 [cs, stat]. Zhang et al. [2023] Zhang, S. D., Tigges, C., Biderman, S., Raginsky, M., and Ringer, T. Can Transformers Learn to Solve Problems Recursively?, June 2023. URL http://arxiv.org/abs/2305.14699. arXiv:2305.14699 [cs]. Zhong et al. [2023] Zhong, Z., Liu, Z., Tegmark, M., and Andreas, J. The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks, June 2023. URL http://arxiv.org/abs/2306.17844. arXiv:2306.17844 [cs]. Appendix A Author Contributions Dashiell Wrote the code for training and for calculating the Group Fourier Transform over SnsubscriptS_nSitalic_n. Performed the initial analyses of models trained on S5subscript5S_5S5 and initially found what we came to call the coset circuit. Designed and ran causal experiments to confirm our understanding of the coset circuit. Derived formal properties of the coset circuit. Participated in discussions throughout the project and in writing the paper. Qinan Ran training jobs and performed the bulk of circuit analysis on S6subscript6S_6S6, designed and ran ablation experiments and causal interchange interventions, participated in discussions throughout the project and the writing of the paper. Honglu Derived formal properties of the coset circuit, participated in the discussions throughout the project, and the writing of the paper. Stella Helped scope the problem and identify and plan the core experiments. Advised on the interpretation of the analysis and the writing of the paper. Appendix B Structure of the Appendix In the Appendix we provide more of the mathematical background needed to fully describe some of our results and techniques. In particular, we explain the Group Fourier Transform and how we used to to analyze our models. We do this because we believe it is of independent interest and also because it is necessary to fully explain where our results and those of Chughtai et al. [4] diverge. In Appendix C we go over the precise experimental set up of the models that we trained. In Appendix D we introduce the necessary concepts from group theory needed to rigorously talk about the more mathematical aspects of our results. In Appendix E we introduce representation theory, representations of the symmetric group, and the group Fourier transform. In Appendix G we return to the coset circuit and coset neurons, with the presentation grounded in the mathematical concepts introduced in Appendices D and E. Finally, in Appendix H we present extra graphs that did not fit in the main paper and in Appendix I we present a table of all of the conjugacy classes of subgroups of S5subscript5S_5S5. Appendix C Experiment Details We conducted experiments focusing on the permutation group of S5subscript5S_5S5 and S6subscript6S_6S6. All models were trained on NVIDIA GeForce RTX 2080 GPUs. All models were implemented in PyTorch Paszke et al. [45] and trained with the Adam optimizer [28] with a fixed learning rate of 0.0010.0010.0010.001, weight decay set to 1.01.01.01.0, β1=0.9subscript10.9 _1=0.9β1 = 0.9 and β2=0.98subscript20.98 _2=0.98β2 = 0.98. At the beginning of each training run, the training set is sampled uniformly from all |Sn|2superscriptsubscript2|S_n|^2| Sitalic_n |2 combinations of permutations. Each optimization step was made on the entire training set. Using our setup a single S5subscript5S_5S5 model trained in approximately 8 hours and a single S6subscript6S_6S6 model trained in approximately 100 hours, though multiple training jobs could be scheduled on a single GPU. Analysis and reverse engineering was performed with Vink et al. [56], Nanda & Bloom [39], Harris et al. [22], GAP [16], Stein et al. [53]. Table 2: Experiment hyperparameters. Group % Train Set Num. Runs Num. Epochs Linear Layer Size Embedding Size S5subscript5S_5S5 40% 128 250,000 128 256 S6subscript6S_6S6 40% 100 50,000 256 512 Appendix D Group Theory In this section, let us recall some basic definitions and propositions in group theory that are relevant to this paper. D.1 Groups A group G is a nonempty set equipped with a special element e∈Ge∈ Ge ∈ G called the identity and a multiplication operator ⋅·⋅ satisfying the following: • (inverse) For each element a∈Ga∈ Ga ∈ G, there exists an element b∈Gb∈ Gb ∈ G such that a⋅b=b⋅a=e⋅a· b=b· a=ea ⋅ b = b ⋅ a = e. • (identity) For each element a∈Ga∈ Ga ∈ G, a⋅e=e⋅a=a⋅a· e=e· a=a ⋅ e = e ⋅ a = a. • (associativity) For elements a,b,c∈Ga,b,c∈ Ga , b , c ∈ G, we have (a⋅b)⋅c=a⋅(b⋅c)⋅(a· b)· c=a·(b· c)( a ⋅ b ) ⋅ c = a ⋅ ( b ⋅ c ). The inverse of a∈Ga∈ Ga ∈ G is denoted by a−1superscript1a^-1a- 1. Example D.1. The set of integers ℤZblackboard_Z along with the addition +++ form a group. The identity element is 00. Also with the addition, the same is true for the set of rational numbers ℚQblackboard_Q, the set of real numbers ℝRblackboard_R and the set of complex numbers ℂCblackboard_C. Example D.2. The symmetric group introduced in Section 3.1 along with the composition of permutations satisfies the group axioms. The identity element is the identity permutation leaving each element unchanged. Example D.3. The set of natural numbers ℕNblackboard_N and addition do not form a group. The reason being that the inverse elements do not exist except for 00. Definition D.4. Given a group G, a subgroup H is a subset of G such that • a⋅b∈H⋅a· b∈ Ha ⋅ b ∈ H for any a,b∈Ha,b∈ Ha , b ∈ H. • e∈He∈ He ∈ H. • a−1∈Hsuperscript1a^-1∈ Ha- 1 ∈ H. One can check that H along with the multiplication satisfies the group axiom as well. H being a subgroup of G is denoted by H≤GH≤ GH ≤ G. D.2 Cosets and double cosets Definition D.5. Given a proper subgroup H<GH<GH < G and an element g∈Gg∈ Gg ∈ G, the set g⁢H:=g⁢h|h∈Hassignconditional-setℎgH:=\gh~|~h∈ H\g H := g h | h ∈ H is called a left H-coset. Similarly, H⁢g:=h⁢g|h∈Hassignconditional-setℎHg:=\hg~|~h∈ H\H g := h g | h ∈ H is called a right H-coset. g⁢HgHg H is sometimes called a coset if the subgroup H is clear from the context. When we do not mention whether it is a left coset or a right coset, left coset is the default. Lemma D.6. Two cosets g1⁢Hsubscript1g_1Hg1 H and g2⁢Hsubscript2g_2Hg2 H are either the same subset of G or disjoint (i.e., g1⁢H⁢⋂g2⁢H=∅subscript1subscript2g_1H g_2H= 1 H ⋂ g2 H = ∅). Lemma D.7. If G is a finite group, any two H-cosets have the same number of elements. As a result, one can pick suitable representative elements (but not unique) g1,⋯,gn∈Gsubscript1⋯subscriptg_1,·s,g_n∈ Gg1 , ⋯ , gitalic_n ∈ G, so that g1⁢H,⋯,gn⁢Hsubscript1⋯subscriptg_1H,·s,g_nHg1 H , ⋯ , gitalic_n H form a partition of G. Because the cosets have equal sizes, we can also conclude that |G||G|| G | is always divisible by |H||H|| H |. Definition D.8. Given two subgroups H,L<GH,L<GH , L < G and an element g∈Gg∈ Gg ∈ G, the set H⁢g⁢L:=h⁢g⁢l|h∈H,l∈Lassignconditional-setℎformulae-sequenceℎHgL:=\hgl~|~h∈ H,l∈ L\H g L := h g l | h ∈ H , l ∈ L is called the (H,L)(H,L)( H , L )-double coset, or the double coset if the pair (H,L)(H,L)( H , L ) is clear from the context. Double cosets enjoy the similar property as cosets: Lemma D.9. Two double cosets H⁢g1⁢Lsubscript1Hg_1LH g1 L and H⁢g2⁢Lsubscript2Hg_2LH g2 L are either the same or disjoint. As a result, G can be similarly decomposed as a disjoint union of (H,L)(H,L)( H , L )-double cosets. However, when G is finite, (H,L)(H,L)( H , L )-double cosets do not always come with equal sizes. So the decomposition is not equal-sized. For simplicity, we call the (H,H)(H,H)( H , H )-double coset the H-double coset. D.3 Normal Subgroups Definition D.10. A subgroup N is normal in G, denoted N⁢⊴⁢G⊴N GN ⊴ G, if for any g∈Gg∈ Gg ∈ G and any n∈Nn∈ Nn ∈ N, we have g⁢n⁢g−1∈Nsuperscript1gng^-1∈ Ng n g- 1 ∈ N. A subgroup is normal if and only if the left and right cosets are the same, i.e., for any g∈G,g⁢N=N⁢gformulae-sequenceg∈ G,\ gN=Ngg ∈ G , g N = N g. Normal subgroups are important because they are precisely the groups for which the set of N-cosets G/NG/NG / N has a natural group structure. Definition D.11. Given a group G and a normal subgroup N⁢⊴⁢G⊴N GN ⊴ G, the quotient group G/NG/NG / N is defined to be the set of N-cosets endowed with the multiplication given by g⁢N⋅h⁢N=g⁢h⁢N⋅ℎgN· hN=ghNg N ⋅ h N = g h N for any g,h∈Gℎg,h∈ Gg , h ∈ G. The well-definedness of the multiplication is a consequence of N being normal and its group axioms are straightforward to check. Example D.12. If G is commutative (for every g,h∈Gℎg,h∈ Gg , h ∈ G, we have g⁢h=h⁢gℎgh=hgg h = h g), every subgroup H≤GH≤ GH ≤ G is normal. Example D.13. If G=SnsubscriptG=S_nG = Sitalic_n, the subgroup Sn−1subscript1S_n-1Sitalic_n - 1 fixing the first element is not a normal subgroup. On the other hand, the alternating subgroup AnsubscriptA_nAitalic_n (consisting of even permutations) is a normal subgroup of SnsubscriptS_nSitalic_n. The double cosets of a normal subgroup are simply the usual cosets. Lemma D.14. Given a normal subgroup H⁢⊴⁢G⊴H GH ⊴ G, the left H-coset and the right H-coset are in one-to-one correspondence. Furthermore, the set of H-double cosets is also in one-to-one correspondence to H-cosets. Proof. By definition, g⁢H⁢g−1=Hsuperscript1gHg^-1=Hg H g- 1 = H. Therefore, g⁢H=H⁢ggH=Hgg H = H g. H⁢g⁢H=g⁢H⁢H=g⁢HHgH=gHH=gHH g H = g H H = g H. ∎ D.4 Conjugate Subgroups The cosets of a normal subgroup N⁢⊴⁢G⊴N GN ⊴ G themselves form a group. If x,y∈Gx,\;y∈ Gx , y ∈ G and x∈g⁢Nx∈ gNx ∈ g N but y∈h⁢Nℎ\;y∈ hNy ∈ h N, then x⁢y∈g⁢h⁢Nℎxy∈ ghNx y ∈ g h N. If G is not abelian, however, many or even all subgroups are not normal and do not have this property. For a non-normal subgroup H, a g∉Hg∉ Hg ∉ H gives rise to a different conjugate subgroup g⁢H⁢g−1superscript1gHg^-1g H g- 1. In general, the relationship between the cosets of H and g⁢H⁢g−1superscript1gHg^-1g H g- 1 is complex, but they will have at least one left and one right coset in common: H⁢g−1=g−1⁢(g⁢H⁢g−1)superscript1superscript1superscript1Hg^-1=g^-1(gHg^-1)H g- 1 = g- 1 ( g H g- 1 ). Every right coset H⁢xHxH x will have a left coset pair y⁢(g⁢H⁢g−1)superscript1y(gHg^-1)y ( g H g- 1 ) such that when multiplied, right coset on the left and left coset on the right, H⁢x⁢y⁢(g⁢H⁢g−1)=H⁢g−1superscript1superscript1Hxy(gHg^-1)=Hg^-1H x y ( g H g- 1 ) = H g- 1, specifically when x⁢y=g−1superscript1xy=g^-1x y = g- 1. This relationship between the cosets of pairs of conjugate subgroups is not as powerful as that of the cosets of normal subgroups, but conjugate subgroups are guaranteed to exist in non-abelian groups, whereas there are many simple groups without normal subgroups at all. This relationship between pairs of conjugate subgroups is also useful enough that it is used by every model we trained. In general, we have the following: Lemma D.15. For any H≤GH≤ GH ≤ G and an element g∈Gg∈ Gg ∈ G, the set of conjugate elements g⁢H⁢g−1superscript1gHg^-1g H g- 1 forms a subgroup of G. If the conjugate subgroup g⁢H⁢g−1superscript1gHg^-1g H g- 1 is different than H, the left and right cosets g⁢H,H⁢ggH,Hgg H , H g are different. The double coset circuits operate by first identifying a pair of different conjugate subgroups H and g⁢H⁢g−1superscript1gHg^-1g H g- 1. It exploits the fact that the left coset g⁢HgHg H and the right coset (g⁢H⁢g−1)⁢gsuperscript1(gHg^-1)g( g H g- 1 ) g are the same subset of G, which will be fully generalized and elaborated in the later sections. D.5 An important case When a group G decomposes as only two disjoint H-double cosets, any pair of subgroups conjugate to H shares a left coset with another’s right coset. Lemma D.16. Let H1,…,Hnsubscript1…subscriptH_1,...,H_nH1 , … , Hitalic_n be conjugate subgroups of G, such that for each HisubscriptH_iHitalic_i the double coset Hi⁢g⁢HisubscriptsubscriptH_igH_iHitalic_i g Hitalic_i is equal to either HisubscriptH_iHitalic_i or G∖HisubscriptG H_iG ∖ Hitalic_i. Then for each pair of subgroups HisubscriptH_iHitalic_i and HjsubscriptH_jHitalic_j there exists a g∈Gg∈ Gg ∈ G such that Hi⁢g=g⁢HjsubscriptsubscriptH_ig=gH_jHitalic_i g = g Hitalic_j. Moreover, the only double cosets of HisubscriptH_iHitalic_i and HjsubscriptH_jHitalic_j are Hi⁢g⁢Hj=g⁢HjsubscriptsubscriptsubscriptH_igH_j=gH_jHitalic_i g Hitalic_j = g Hitalic_j and Hi⁢x⁢Hj=G∖g⁢HjsubscriptsubscriptsubscriptH_ixH_j=G gH_jHitalic_i x Hitalic_j = G ∖ g Hitalic_j. Proof. If i=ji=ji = j, for any h∈Hiℎsubscripth∈ H_ih ∈ Hitalic_i the shared coset is the subgroup itself. If i≠ji≠ ji ≠ j, because HisubscriptH_iHitalic_i and HjsubscriptH_jHitalic_j are conjugate, there exists a g∈Gg∈ Gg ∈ G such that Hj=g−1⁢Hi⁢gsubscriptsuperscript1subscriptH_j=g^-1H_igHitalic_j = g- 1 Hitalic_i g. The left coset is equal to the right coset: g⁢Hj=g⁢(g−1⁢Hi⁢g)=Hi⁢gsubscriptsuperscript1subscriptsubscriptgH_j=g(g^-1H_ig)=H_ig Hitalic_j = g ( g- 1 Hitalic_i g ) = Hitalic_i g Notice that the double coset Hi⁢g⁢Hj=Hi⁢(Hi⁢g)=Hi⁢gsubscriptsubscriptsubscriptsubscriptsubscriptH_igH_j=H_i(H_ig)=H_igHitalic_i g Hitalic_j = Hitalic_i ( Hitalic_i g ) = Hitalic_i g. But for x≠gx≠ gx ≠ g: Hi⁢x⁢Hjsubscriptsubscript H_ixH_jHitalic_i x Hitalic_j =Hi⁢x⁢g−1⁢Hi⁢gabsentsubscriptsuperscript1subscript =H_ixg^-1H_ig= Hitalic_i x g- 1 Hitalic_i g (2) =(G∖Hi)⁢gabsentsubscript =(G H_i)g= ( G ∖ Hitalic_i ) g (3) =G∖Hi⁢gabsentsubscript =G H_ig= G ∖ Hitalic_i g (4) ∎ Appendix E Representation Theory E.1 Preliminaries Definition E.1. Given a group G, a representation of G is a group homomorphism ρV:G→G⁢L⁢(V):subscript→ _V:G→ GL(V)ρitalic_V : G → G L ( V ) for some finite (but nonzero) dimensional vector space V over a field k. When we do not specifically mention k, we use ℂCblackboard_C as the default. In other words, a representation maps a group element g to a linear operator f⁢(g):V→V:→f(g):V→ Vf ( g ) : V → V where V is a vector space of dimension d, so that the group multiplication becomes compositions of linear operators (f⁢(g⋅h)=f⁢(g)∘f⁢(h)⋅ℎf(g· h)=f(g) f(h)f ( g ⋅ h ) = f ( g ) ∘ f ( h )). Without explicit specifications, all representations in this paper are assumed to be over complex numbers. Recall also that finite dimensional linear operators can be represented as matrices, and composition of linear operators is then given as matrix multiplication. When the context is clear, sometimes we omit the subscript V in the notation ρVsubscript _Vρitalic_V. The representations of finite groups have a rich and beautiful theory (see Diaconis [8], Fulton & Harris, Joe [15]). Here, we recall a few basic definitions and facts without going into details. Definition E.2. A representation ρV:G→G⁢L⁢(V):subscript→ _V:G→ GL(V)ρitalic_V : G → G L ( V ) is a sub-representation of ρW:G→G⁢L⁢(W):subscript→ _W:G→ GL(W)ρitalic_W : G → G L ( W ) if V can be identified as a linear subspace of W so that ρW⁢(g)subscript _W(g)ρitalic_W ( g ) restricts to ρV⁢(g)subscript _V(g)ρitalic_V ( g ) for all g∈Gg∈ Gg ∈ G. Example E.3. For any group G, the map G→G⁢L⁢(V)→G→ GL(V)G → G L ( V ) sending all elements to the identity matrix is a representation. When dim⁢(V)=1dim1dim(V)=1dim ( V ) = 1, we call it the trivial representation of G. Definition E.4. Given two representations ρV,ρWsubscriptsubscript _V, _Wρitalic_V , ρitalic_W of G, the direct sum of vector spaces V⊕Wdirect-sumV WV ⊕ W admits a natural representation of G by letting ρV,ρWsubscriptsubscript _V, _Wρitalic_V , ρitalic_W act on each component separately. We call this the direct sum of representations ρV,ρWsubscriptsubscript _V, _Wρitalic_V , ρitalic_W, and denote it by ρV⊕ρWdirect-sumsubscriptsubscript _V _Wρitalic_V ⊕ ρitalic_W. Definition E.5. Similarly, given two representations ρV,ρWsubscriptsubscript _V, _Wρitalic_V , ρitalic_W, the tensor product V⊗Wtensor-productV WV ⊗ W admits a natural representation of G by acting on V,WV,WV , W separately and extend by linearity. We call this the tensor product of representations ρV,ρWsubscriptsubscript _V, _Wρitalic_V , ρitalic_W, and denote it by ρV⊗ρWtensor-productsubscriptsubscript _V _Wρitalic_V ⊗ ρitalic_W. Definition E.6. A representation ρ of a group G is irreducible, if it does not have sub-representations other than ρ. We denote the set of all irreducible representations of G by Irr⁡(G)IrrIrr(G)Irr ( G ) Lemma E.7. A representation ρ of a finite group G is a direct sum of irreducible representations. Example E.8. The trivial representation of G is irreducible. Example E.9. The permutation representation maps Sn→G⁢L⁢(ℂ3)→subscriptsuperscriptℂ3S_n→ GL(C^3)Sitalic_n → G L ( blackboard_C3 ), i.e. 3×3333\!×\!33 × 3 matrices with a single 1111 in each row and column and zeros everywhere else. (2 1 3)↦(010100001)(3 2 1)↦(001010100)formulae-sequencemaps-to213matrix010100001maps-to321matrix001010100(2\;1\;3) pmatrix0&1&0\\ 1&0&0\\ 0&0&1\\ pmatrix (3\;2\;1) pmatrix0&0&1\\ 0&1&0\\ 1&0&0\\ pmatrix( 2 1 3 ) ↦ ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ( 3 2 1 ) ↦ ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) You can see that the matrices of the permutation representation act on the basis vectors of ℂ3superscriptℂ3C^3blackboard_C3: (001010100)⁢(xyz)=(zyx)matrix001010100matrixmatrix pmatrix0&0&1\\ 0&1&0\\ 1&0&0\\ pmatrix pmatrixx\\ y\\ z pmatrix= pmatrixz\\ y\\ x pmatrix( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL x end_CELL end_ROW start_ROW start_CELL y end_CELL end_ROW start_ROW start_CELL z end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL z end_CELL end_ROW start_ROW start_CELL y end_CELL end_ROW start_ROW start_CELL x end_CELL end_ROW end_ARG ) What it means to be a representation is that the group multiplication becomes matrix multiplication, so just as (2 1 3)⁢(3 2 1)=(2 3 1)213321231(2\;1\;3)(3\;2\;1)=(2\;3\;1)( 2 1 3 ) ( 3 2 1 ) = ( 2 3 1 ), (010100001)⁢(001010100)=(001100010)matrix010100001matrix001010100matrix001100010 pmatrix0&1&0\\ 1&0&0\\ 0&0&1\\ pmatrix pmatrix0&0&1\\ 0&1&0\\ 1&0&0\\ pmatrix= pmatrix0&0&1\\ 1&0&0\\ 0&1&0 pmatrix( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) Example E.10. The permutation representation is reducible, because there is a subspace of ℂ3superscriptℂ3C^3blackboard_C3 that is invariant to it’s action. (001010100)⁢(x)=(x)matrix001010100matrixmatrix pmatrix0&0&1\\ 0&1&0\\ 1&0&0\\ pmatrix pmatrixx\\ x\\ x pmatrix= pmatrixx\\ x\\ x pmatrix( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL x end_CELL end_ROW start_ROW start_CELL x end_CELL end_ROW start_ROW start_CELL x end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL x end_CELL end_ROW start_ROW start_CELL x end_CELL end_ROW start_ROW start_CELL x end_CELL end_ROW end_ARG ) Note that there is no permutation matrix acting on the vector (x)Tsuperscriptmatrix pmatrixx&x&x pmatrix^T( start_ARG start_ROW start_CELL x end_CELL start_CELL x end_CELL start_CELL x end_CELL end_ROW end_ARG )T that will change it, because all of the components are equal. As it turns out, there are no irreducible representations of S3subscript3S_3S3 that are three-dimensional. The largest irrep of S3subscript3S_3S3 is ρ(2,1)subscript21 _(2,1)ρ( 2 , 1 ), which is made of 2×2222\!×\!22 × 2 matrices. The matrices of the (2,1)21(2,1)( 2 , 1 ) irrep of S3subscript3S_3S3 are as follows: (1 2 3)123 (1\;2\;3)( 1 2 3 ) ↦(1001)maps-toabsentmatrix1001 pmatrix1&0\\ 0&1 pmatrix↦ ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) (2 1 3)213 (2\;1\;3)( 2 1 3 ) ↦(−1001)maps-toabsentmatrix1001 pmatrix-1&0\\ 0&1 pmatrix↦ ( start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) (3 2 1)321 (3\;2\;1)( 3 2 1 ) ↦(1/2−3/23/2−1/2)maps-toabsentmatrix12323212 pmatrix1/2&- 3/2\\ 3/2&-1/2 pmatrix↦ ( start_ARG start_ROW start_CELL 1 / 2 end_CELL start_CELL - square-root start_ARG 3 end_ARG / 2 end_CELL end_ROW start_ROW start_CELL square-root start_ARG 3 end_ARG / 2 end_CELL start_CELL - 1 / 2 end_CELL end_ROW end_ARG ) (1 3 2)132 (1\;3\;2)( 1 3 2 ) ↦(−1/23/23/21/2)maps-toabsentmatrix12323212 pmatrix-1/2& 3/2\\ 3/2&1/2 pmatrix↦ ( start_ARG start_ROW start_CELL - 1 / 2 end_CELL start_CELL square-root start_ARG 3 end_ARG / 2 end_CELL end_ROW start_ROW start_CELL square-root start_ARG 3 end_ARG / 2 end_CELL start_CELL 1 / 2 end_CELL end_ROW end_ARG ) (3 1 2)312 (3\;1\;2)( 3 1 2 ) ↦(−1/23/2−3/2−1/2)maps-toabsentmatrix12323212 pmatrix-1/2& 3/2\\ - 3/2&-1/2 pmatrix↦ ( start_ARG start_ROW start_CELL - 1 / 2 end_CELL start_CELL square-root start_ARG 3 end_ARG / 2 end_CELL end_ROW start_ROW start_CELL - square-root start_ARG 3 end_ARG / 2 end_CELL start_CELL - 1 / 2 end_CELL end_ROW end_ARG ) (2 3 1)231 (2\;3\;1)( 2 3 1 ) ↦(−1/23/2−3/2−1/2)maps-toabsentmatrix12323212 pmatrix-1/2& 3/2\\ - 3/2&-1/2 pmatrix↦ ( start_ARG start_ROW start_CELL - 1 / 2 end_CELL start_CELL square-root start_ARG 3 end_ARG / 2 end_CELL end_ROW start_ROW start_CELL - square-root start_ARG 3 end_ARG / 2 end_CELL start_CELL - 1 / 2 end_CELL end_ROW end_ARG ) We leave it as an exercise to the reader to verify that ρ(2,1)⁢(2 1 3)⁢ρ(2,1)⁢(3 2 1)=ρ(2,1)⁢(2 3 1)subscript21213subscript21321subscript21231 _(2,1)(2\;1\;3) _(2,1)(3\;2\;1)= _(2,1)(2\;3\;1)ρ( 2 , 1 ) ( 2 1 3 ) ρ( 2 , 1 ) ( 3 2 1 ) = ρ( 2 , 1 ) ( 2 3 1 ). Trace is an important notion in linear algebra. Taking trace of a representation induces an important map from G to ℂCblackboard_C. Definition E.11. Let ρVsubscript _Vρitalic_V be a representation of G. The character of ρVsubscript _Vρitalic_V is a map χ⁢(ρV):G→ℂ:subscript→ℂχ( _V):G χ ( ρitalic_V ) : G → blackboard_C given by χ⁢(ρV)⁢(g)=tr⁡(ρV⁢(g))subscripttrsubscriptχ( _V)(g)=tr( _V(g))χ ( ρitalic_V ) ( g ) = tr ( ρitalic_V ( g ) ). Lemma E.12. The character χ⁢(ρV)subscriptχ( _V)χ ( ρitalic_V ) takes the same value on a conjugacy class of G. In other words, χ⁢(ρV)⁢(h)=χ⁢(ρV)⁢(g⁢h⁢g−1)subscriptℎsubscriptℎsuperscript1χ( _V)(h)=χ( _V)(ghg^-1)χ ( ρitalic_V ) ( h ) = χ ( ρitalic_V ) ( g h g- 1 ). To distill this property for a wider range of functions, we have the following definition: Definition E.13. Let f:G→ℂ:→ℂf:G : G → blackboard_C be a map. If f⁢(h)=f⁢(g⁢h⁢g−1)ℎsuperscript1f(h)=f(ghg^-1)f ( h ) = f ( g h g- 1 ) for any g,h∈Gℎg,h∈ Gg , h ∈ G, f is called a class function. For a finite group G, the set of class functions form a finite-dimensional vector space. There is an important inner product between class functions. Definition E.14. The inner product of two class functions ϕ,ψitalic-ϕφ,ψϕ , ψ are defined as: ⟨ϕ,ψ⟩=1|G|⁢∑g∈Gϕ⁢(g)⁢ψ⁢(g)¯.italic-ϕ1subscriptitalic-ϕ¯ φ,ψ = 1|G|Σ _g∈ Gφ(g) % ψ(g).⟨ ϕ , ψ ⟩ = divide start_ARG 1 end_ARG start_ARG | G | end_ARG ∑g ∈ G ϕ ( g ) over¯ start_ARG ψ ( g ) end_ARG . As we require the class functions to take the same values on conjugacy classes, the dimension of the vector space of class functions is equal to the number of conjugacy classes in G. On the other hand, we have the following important theorem: Theorem E.15. The characters of Irr⁡(G)IrrIrr(G)Irr ( G ) forms an orthonormal basis in the vector space of class functions. Lemma E.16. For a finite group G, Irr⁡(G)IrrIrr(G)Irr ( G ) is a finite set. Furthermore, the order of Irr⁡(G)IrrIrr(G)Irr ( G ) is equal to the number of conjugacy classes in G. Appendix F Fourier transform over finite groups Despite being mostly perceived as a powerful tool in physics and engineering, the Fourier transform has also been successfully applied in group theory thanks to its generalization to locally compact abelian groups as well as an analog over finite groups. The purpose the group Fourier transform serves is largely analogous to the one served by the classical Fourier transform: it provides an alternate orthogonal basis with which to analyze functions from a group G to either ℝRblackboard_R or ℂCblackboard_C. To motivate the transition from the classical Fourier theory to the Fourier theory over groups, we start with a brief recall of the definitions. The classical Fourier transform over real numbers converts a complex-valued Lebesgue-integrable function f:ℝ→ℂ:→ℝℂf:R : blackboard_R → blackboard_C into a function from the complex unit circle S1superscript1S^1S1 to ℂCblackboard_C with following formula: f^⁢(ξ)=∫−∞f⁢(x)⁢e−2⁢π⁢i⁢ξ⁢x⁢x.^superscriptsubscriptsuperscript2differential-d f(ξ)= _-∞^∞f(x)e^-2π iξ xdx.over start_ARG f end_ARG ( ξ ) = ∫- ∞ f ( x ) e- 2 π i ξ x d x . (5) Taking one step further in abstraction, we note that e−2⁢π⁢i⁢ξ⁢xsuperscript2e^-2π iξ xe- 2 π i ξ x as a function of x has the defining properties of turning additions into multiplications (being a group homomorphism) and always having complex norm 1111: e−2⁢π⁢i⁢ξ⁢(x1+x2)superscript2subscript1subscript2 e^-2π iξ(x_1+x_2)e- 2 π i ξ ( x1 + x2 ) =e−2⁢π⁢i⁢ξ⁢x1⋅e−2⁢π⁢i⁢ξ⁢x2,absent⋅superscript2subscript1superscript2subscript2 =e^-2π iξ x_1· e^-2π iξ x_2,= e- 2 π i ξ x1 ⋅ e- 2 π i ξ x2 , |e−2⁢π⁢i⁢ξ⁢x|superscript2 |e^-2π iξ x|| e- 2 π i ξ x | =1.absent1 =1.= 1 . We call such functions the characters of ℝRblackboard_R, though they are often thought of as frequencies. One can prove that all characters of ℝRblackboard_R can be written as e−2⁢π⁢i⁢ξ⁢xsuperscript2e^-2π iξ xe- 2 π i ξ x for a suitable ξ∈ℝξ ξ ∈ blackboard_R. Looking back at (5), the properties we need in order to define the Fourier transform over ℝRblackboard_R are: • ℝRblackboard_R has the Lebesgue measure (allowing for integration to happen). • ℝRblackboard_R is a group (so that the characters make sense as group homomorphisms from ℝRblackboard_R to the unit circle group S1⊂ℂsuperscript1ℂS^1 1 ⊂ blackboard_C). Now, if we are given a finite group G, the Fourier transform of a finite group is an operator converting a map f:G→ℂ:→ℂf:G : G → blackboard_C into a function between Irr⁡(G)IrrIrr(G)Irr ( G ) and the set of linear operators M⁢(V)M(V)M ( V ). Definition F.1. Given a group G, the Fourier transform of a map f:G→ℂ:→ℂf:G : G → blackboard_C is a function f^ fover start_ARG f end_ARG from Irr⁡(G)IrrIrr(G)Irr ( G ) to the union of M⁢(ℂn)superscriptℂM(C^n)M ( blackboard_Cn ) for all n such that f^⁢(ρ)=∑a∈Gf⁢(a)⁢ρ⁢(a)^subscript f(ρ)=Σ _a∈ Gf(a)ρ(a)over start_ARG f end_ARG ( ρ ) = ∑a ∈ G f ( a ) ρ ( a ) for an irreducible representation ρ. The analogy comes from the following similar facts: • G, as a finite set, has the invariant discrete measure (where the “integration” becomes the sum). • G is a group, and the irreps ρ are in a sense the ”smallest” group homomorphisms from G to G⁢L⁢(n,ℂ)ℂGL(n,C)G L ( n , blackboard_C ) (note that the images of ρ similarly have complex-norm-1111 determinants due to G being a finite group). For more details and applications, one can refer to, for example, Elias M. Stein [11]. We would like to note that there is also an inverse transform that restores the original function f from f^ fover start_ARG f end_ARG: f⁢(g)=1|G|⁢∑ρ∈Irr⁡(G)dρ⁢tr⁡[f^⁢(ρ)⁢ρ⁢(g−1)]1subscriptIrrsubscripttr^superscript1f(g)= 1|G| _ρ (G)d_ρtr% [ f(ρ)ρ(g^-1)]f ( g ) = divide start_ARG 1 end_ARG start_ARG | G | end_ARG ∑ρ ∈ Irr ( G ) ditalic_ρ tr [ over start_ARG f end_ARG ( ρ ) ρ ( g- 1 ) ] (6) Appendix G The Coset Circuit (with more math) We did not introduce it in the main body of our paper because it would distract from the core of our results, but for the first half of our investigation the Fourier transform over the symmetric group was integral to our investigation. We were building directly on [4] who had shown striking results around the weights of single-layer models showing high degrees of correlation with the irreps of the symmetric group. We wished to cast those results in the language of the group Fourier transform. Even when we realized that the mechanism of the model was based around cosets it became extremely important to understand why our coset circuit was so concentrated in Fourier space. G.1 Harmonic Analysis on the Symmetric Group The presentation in the Appendix E was given in terms of functions on ℂCblackboard_C because it is required for arbitrary groups. For SnsubscriptS_nSitalic_n all of the irreps are rational [15] and the Fourier transform of functions on SnsubscriptS_nSitalic_n can safely be defined over ℝRblackboard_R. In this section we describe how we use the Fourier transform to analyze the weights and activations of an MLP. The inputs to the model are two one-hot vectors, l,rsubscriptsubscriptx_l,\;x_rxitalic_l , xitalic_r, which multiply the embedding matrices l⁢lsubscriptsubscriptE_lx_lEitalic_l xitalic_l and r⁢rsubscriptsubscriptE_rx_rEitalic_r xitalic_r. lsubscriptE_lEitalic_l and rsubscriptE_rEitalic_r are d×|G|d×|G|d × | G | matrices, where d is the embedding dimension and |G||G|| G | is the size of the group. The columns are the embedding vectors for a single element g∈Gg∈ Gg ∈ G. The normal approach would be to try and look at the column spaces of lsubscriptE_lEitalic_l and rsubscriptE_rEitalic_r, as these columns are the inputs to the model. However, since each row of lsubscriptE_lEitalic_l and rsubscriptE_rEitalic_r and each value of that row is associated with a single element of G, we instead treat each row of the embedding as a function f:G→ℝ:→ℝf:G : G → blackboard_R. In fact, anywhere in the model where a matrix or set of activations has |G||G|| G | in the shape we can expand into the Fourier basis. For non-abelian groups, each Fourier frequency is an irrep, and the Fourier transform for each irrep is matrix-valued. This is, on its face, less interpretable than what we started with. Following the techniques outlined in Diaconis [8], however, we can expand the function at each element g∈Gg∈ Gg ∈ G into a new Fourier basis. Concretely, if our function f:G→ℝ:→ℝf:G : G → blackboard_R is represented as a vector, we know from 6 that each element of the vector is a sum of the Fourier components: [f⁢(g1)f⁢(g2)⋮f⁢(g|G|)]=1|G|⁢[∑ρdρ⁢tr⁡[f^⁢(ρ)⁢ρ⁢(g1−1)]∑ρdρ⁢tr⁡[f^⁢(ρ)⁢ρ⁢(g2−1)]⋮∑ρdρ⁢tr⁡[f^⁢(ρ)⁢ρ⁢(g|G|−1)]]matrixsubscript1subscript2⋮subscript1matrixsubscriptsubscripttr^subscriptsuperscript11subscriptsubscripttr^subscriptsuperscript12⋮subscriptsubscripttr^subscriptsuperscript1 bmatrixf(g_1)\\ f(g_2)\\ \\ f(g_|G|) bmatrix= 1|G| bmatrix _ρd_ρ% tr[ f(ρ)ρ(g^-1_1)]\\ _ρd_ρtr[ f(ρ)ρ(g^-1_2)]\\ \\ _ρd_ρtr[ f(ρ)ρ(g^-1_|G|)] bmatrix[ start_ARG start_ROW start_CELL f ( g1 ) end_CELL end_ROW start_ROW start_CELL f ( g2 ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL f ( g| G | ) end_CELL end_ROW end_ARG ] = divide start_ARG 1 end_ARG start_ARG | G | end_ARG [ start_ARG start_ROW start_CELL ∑ρ ditalic_ρ tr [ over start_ARG f end_ARG ( ρ ) ρ ( g- 11 ) ] end_CELL end_ROW start_ROW start_CELL ∑ρ ditalic_ρ tr [ over start_ARG f end_ARG ( ρ ) ρ ( g- 12 ) ] end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ∑ρ ditalic_ρ tr [ over start_ARG f end_ARG ( ρ ) ρ ( g- 1| G | ) ] end_CELL end_ROW end_ARG ] We can keep track of all of the Fourier components at once by purposefully not completing the sum from 6), but instead keep each term into a new dimension: 1|G|⁢[dρ1⁢tr⁡[f^⁢(ρ1)⁢ρ1⁢(g1−1)]…dρk⁢tr⁡[f^⁢(ρk)⁢ρk⁢(g1−1)]dρ1⁢tr⁡[f^⁢(ρ1)⁢ρ1⁢(g2−1)]…dρk⁢tr⁡[f^⁢(ρk)⁢ρk⁢(g2−1)]⋮dρ1⁢tr⁡[f^⁢(ρ1)⁢ρ1⁢(g|G|−1)]…dρk⁢tr⁡[f^⁢(ρk)⁢ρk⁢(g|G|−1)]]1matrixsubscriptsubscript1tr^subscript1subscript1subscriptsuperscript11…subscriptsubscripttr^subscriptsubscriptsubscriptsuperscript11subscriptsubscript1tr^subscript1subscript1subscriptsuperscript12…subscriptsubscripttr^subscriptsubscriptsubscriptsuperscript12⋮missing-subexpression⋮subscriptsubscript1tr^subscript1subscript1subscriptsuperscript1…subscriptsubscripttr^subscriptsubscriptsubscriptsuperscript1 1|G| bmatrixd_ _1tr[ f( _1)% _1(g^-1_1)]&…&d_ _ktr[ f( _k)% _k(g^-1_1)]\\ d_ _1tr[ f( _1) _1(g^-1_2)]&…&d_% _ktr[ f( _k) _k(g^-1_2)]\\ && \\ d_ _1tr[ f( _1) _1(g^-1_|G|)]&…&d% _ _ktr[ f( _k) _k(g^-1_|G|)]\\ bmatrixdivide start_ARG 1 end_ARG start_ARG | G | end_ARG [ start_ARG start_ROW start_CELL ditalic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tr [ over start_ARG f end_ARG ( ρ1 ) ρ1 ( g- 11 ) ] end_CELL start_CELL … end_CELL start_CELL ditalic_ρ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT tr [ over start_ARG f end_ARG ( ρitalic_k ) ρitalic_k ( g- 11 ) ] end_CELL end_ROW start_ROW start_CELL ditalic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tr [ over start_ARG f end_ARG ( ρ1 ) ρ1 ( g- 12 ) ] end_CELL start_CELL … end_CELL start_CELL ditalic_ρ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT tr [ over start_ARG f end_ARG ( ρitalic_k ) ρitalic_k ( g- 12 ) ] end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ditalic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tr [ over start_ARG f end_ARG ( ρ1 ) ρ1 ( g- 1| G | ) ] end_CELL start_CELL … end_CELL start_CELL ditalic_ρ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT tr [ over start_ARG f end_ARG ( ρitalic_k ) ρitalic_k ( g- 1| G | ) ] end_CELL end_ROW end_ARG ] Though this may seem like it is only making the data more complicated, it gives us many tools for analyzing the data. In particular, it turns out that the weights and activations are sparse in this new basis, which gives us a small path forward in analyzing the mechanisms. Corollary G.1. If HisubscriptH_iHitalic_i and HjsubscriptH_jHitalic_j are conjugate subgroups of G such that the only two double cosets are Hi⁢g⁢HjsubscriptsubscriptH_igH_jHitalic_i g Hitalic_j and Hi⁢HjsubscriptsubscriptH_iH_jHitalic_i Hitalic_j, then each right coset Hi⁢xsubscriptH_ixHitalic_i x has a paired left coset y⁢HjsubscriptyH_jy Hitalic_j where y=x−1⁢gsuperscript1y=x^-1gy = x- 1 g such that for all hx∈Hi⁢xsubscriptℎsubscripth_x∈ H_ixhitalic_x ∈ Hitalic_i x and hy∈y⁢Hjsubscriptℎsubscripth_y∈ yH_jhitalic_y ∈ y Hitalic_j, hx⁢hy∈Hi⁢g⁢Hjsubscriptℎsubscriptℎsubscriptsubscripth_xh_y∈ H_igH_jhitalic_x hitalic_y ∈ Hitalic_i g Hitalic_j Lemma G.2. Let f:G→ℂ:→ℂf:G : G → blackboard_C be constant on the cosets of H≤GH≤ GH ≤ G and non-zero on at least one coset. Then f^⁢(ρ)=0^0 f(ρ)=0over start_ARG f end_ARG ( ρ ) = 0 if the restriction of ρ to H, ρ|Hevaluated-atρ|_Hρ |H does not contain the trivial representation as a subrepresentation. Proof. The function f can be decomposed as the sum of functions fx⁢H⁢(σ)=αxσ∈x⁢H0otherwisesubscriptcasessubscriptotherwise0otherwiseotherwisef_xH(σ)= cases _x σ∈ xH\\ 0 casesfitalic_x H ( σ ) = start_ROW start_CELL αitalic_x σ ∈ x H end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 otherwise end_CELL start_CELL end_CELL end_ROW for each coset x⁢HxHx H. Because Fourier transform f^ fover start_ARG f end_ARG is invariant under translation we may, without loss of generality, analyze only the function fHsubscriptf_Hfitalic_H. For a given αxsubscript _xαitalic_x, f^x⁢H⁢(ρ)=f^Hx⁢(ρ)=ρ⁢(x)⁢f^H⁢(ρ)subscript^subscriptsuperscript^subscript f_xH(ρ)= f^x_H(ρ)=ρ(x) f_H(ρ)over start_ARG f end_ARGx H ( ρ ) = over start_ARG f end_ARGxitalic_H ( ρ ) = ρ ( x ) over start_ARG f end_ARGH ( ρ ) for all x∈Gx∈ Gx ∈ G. Recall the definition of f^H⁢(ρ)subscript f_H(ρ)over start_ARG f end_ARGH ( ρ ) from F.1: f^H⁢(ρ)subscript f_H(ρ)over start_ARG f end_ARGH ( ρ ) =∑g∈GfH⁢(g)⁢ρ⁢(g)absentsubscriptsubscript = _g∈ Gf_H(g)ρ(g)= ∑g ∈ G fitalic_H ( g ) ρ ( g ) (7) =αH⁢∑h∈Hρ|H⁢(h)absentevaluated-atsubscriptsubscriptℎ = _H _h∈ Hρ|_H(h)= αitalic_H ∑h ∈ H ρ |H ( h ) (8) =αH⁢∑h∈HT−1⁢[⨁τi∈τi⁢(h)]⁢Tabsentsubscriptsubscriptℎsuperscript1delimited-[]subscriptdirect-sumsubscriptsubscriptℎ = _H _h∈ HT^-1[ _ _i % _i(h)]T= αitalic_H ∑h ∈ H T- 1 [ ⨁τ start_POSTSUBSCRIPT i ∈ T end_POSTSUBSCRIPT τitalic_i ( h ) ] T (9) =αH⁢T−1⁢[⨁τi∈∑h∈Hτi⁢(h)]⁢Tabsentsubscriptsuperscript1delimited-[]subscriptdirect-sumsubscriptsubscriptℎsubscriptℎ = _HT^-1[ _ _i _h∈ H% _i(h)]T= αitalic_H T- 1 [ ⨁τ start_POSTSUBSCRIPT i ∈ T end_POSTSUBSCRIPT ∑h ∈ H τitalic_i ( h ) ] T (10) where in 9 we decompose ρ|Hevaluated-atρ|_Hρ |H into a direct sum of irreps of H. But because each τisubscript _iτitalic_i is irreducible, ∑h∈Hτi⁢(h)=subscriptℎsubscriptℎ0 _h∈ H _i(h)=0∑h ∈ H τitalic_i ( h ) = 0 unless τisubscript _iτitalic_i is the trivial irrep. Thus, unless the decomposition of ρHsubscript _Hρitalic_H into irreps of H includes the trivial representation, f^|H=0evaluated-at^0 f|_H=0over start_ARG f end_ARG |H = 0 ∎ G.2 Logits and Counting Cosets In Chughtai et al. [4], one way of justifying the GCR algorithm is to study the correlation between the character functions and the neuron activations. We would like to argue that the correlation between the GCR and the coset membership counting function may already exist, and in some simple cases it can be made explicit. More precisely, we are measuring the correlation between the character function χ⁢(ρ)χ(ρ)χ ( ρ ) of an irrep ρ with a set function f:G→ℂ:→ℂf:G : G → blackboard_C. In this section, we provide an explicit characterization of f in terms of trace and irreps, when f counts the membership of cosets. We are specifically interested in the following situation: Lemma G.3. Suppose f:G→ℂ:→ℂf:G : G → blackboard_C is a function such that its Fourier transform f^ fover start_ARG f end_ARG is nonzero only on an irreducible representation ρ and the trivial representation. Let f^⁢(ρ)=A∈M⁢(ℂn)^superscriptℂ f(ρ)=A∈ M(C^n)over start_ARG f end_ARG ( ρ ) = A ∈ M ( blackboard_Cn ). We have the following explicit formula: f⁢(σ)=dρ|G|⁢tr⁡(A⋅ρ⁢(σ−1))+|H||G|.subscripttr⋅superscript1f(σ)= d_ρ|G|tr(A·ρ(σ^-1))+% |H||G|.f ( σ ) = divide start_ARG ditalic_ρ end_ARG start_ARG | G | end_ARG tr ( A ⋅ ρ ( σ- 1 ) ) + divide start_ARG | H | end_ARG start_ARG | G | end_ARG . (11) Proof. This is immediate by the Fourier inversion formula. ∎ In this case, although f is not directly written in terms of tr⁡(ρ⁢(σ−1))trsuperscript1tr(ρ(σ^-1))tr ( ρ ( σ- 1 ) ), f is correlated with tr⁡(ρ⁢(σ−1))trsuperscript1tr(ρ(σ^-1))tr ( ρ ( σ- 1 ) ) depending on how much A is concentrated to the diagonal and how even are the diagonal entries. For the rest of the section, we show that under certain conditions, Equation (11) applies verbatim to the functions that count membership of cosets for a collection of conjugate subgroups. Given a subgroup H≤GH≤ GH ≤ G, let 1Hsubscript11_H1H be the function that takes value 1111 on the subgroup H, and takes 00 otherwise. The action of G on cosets G/HG/HG / H induces a representation of G on ℂ|G/H|superscriptℂC^|G/H|blackboard_C| G / H | by permuting the basis accordingly. We call it the permutation representation of G on G/HG/HG / H. Lemma G.4. The Fourier transform of 1Hsubscript11_H1H is nonzero only at the irreducible components of the permutation representation of G on G/HG/HG / H. Proof. By definition, the Fourier transform of 1Hsubscript11_H1H on an irrep ρ is 1H^⁢(ρ)=∑a∈Hρ⁢(a).^subscript1subscript 1_H(ρ)=Σ _a∈ Hρ(a).over start_ARG 1H end_ARG ( ρ ) = ∑a ∈ H ρ ( a ) . Notice that the image of ∑a∈Hρ⁢(a)subscriptΣ _a∈ Hρ(a)∑a ∈ H ρ ( a ) are invariant under H due to the symmetry of this expression. Let V be the vector space where ρ acts on. Under the action of the subgroup H through ρ, one can decompose V as irreps of H. We group them into two parts: V=VH⊕V′,direct-sumsuperscriptsuperscript′V=V^H V ,V = Vitalic_H ⊕ V′ , where VHsuperscriptV^HVitalic_H is a direct sum of copies of trivial representation of H (or in other words, the invariant subspace of V under H), and V′ is the direct sum of nontrivial irreducible components of V. We immediately see the following by definition: ∑a∈Hρ⁢(a)|VH=|H|⋅IdVH.evaluated-atsubscriptsuperscript⋅subscriptIdsuperscriptΣ _a∈ Hρ(a)|_V^H=|H|·Id_V^H.∑a ∈ H ρ ( a ) |Vitalic_H = | H | ⋅ IdVitalic_H . Also by definition, nontrivial irreps of H do not have invariant subspaces since they do not admit proper sub-representations. Therefore, ∑a∈Hρ⁢(a)|V′=0.evaluated-atsubscriptsuperscript′0Σ _a∈ Hρ(a)|_V =0.∑a ∈ H ρ ( a ) |V′ = 0 . As a result, 1H^⁢(ρ)^subscript1 1_H(ρ)over start_ARG 1H end_ARG ( ρ ) is simply a scaled projection to the invariant subspace of V. Whether it is zero depends on whether ResH⁢ρsubscriptResRes_H ρ has any trivial components. By Frobenius reciprocity, ⟨IndHG⁢(1H),χ⁢(ρ)⟩=⟨1H,χ⁢(ResH⁢(ρ))⟩H,subscriptsuperscriptIndsubscript1subscriptsubscript1subscriptRes ^G_H(1_H),χ(ρ) = 1_H,χ(% Res_H(ρ)) _H,⟨ IndGitalic_H ( 1H ) , χ ( ρ ) ⟩ = ⟨ 1H , χ ( ResH ( ρ ) ) ⟩H , where χ⁢(ρ)χ(ρ)χ ( ρ ) is the character of the irrep χ∈Irr⁡(G)Irrχ (G)χ ∈ Irr ( G ) given by its traces, and ⟨⋅⟩delimited-⟨⟩⋅ · ⟨ ⋅ ⟩ is the inner product between class functions. The left-hand side ⟨IndHG⁢(1H),χ⁢(ρ)⟩subscriptsuperscriptIndsubscript1 ^G_H(1_H),χ(ρ) ⟨ IndGitalic_H ( 1H ) , χ ( ρ ) ⟩ is nonzero if and only if ρ is an irreducible component of the permutation representation of G on G/HG/HG / H. The right-hand side ⟨1H,χ⁢(ResH⁢(ρ))⟩Hsubscriptsubscript1subscriptRes 1_H,χ(Res_H(ρ)) _H⟨ 1H , χ ( ResH ( ρ ) ) ⟩H is nonzero if and only if dim⁢(VH)≠0dimsuperscript0dim(V^H)≠ 0dim ( Vitalic_H ) ≠ 0 ∎ Note that this lemma also works for 1g⁢Hsubscript11_gH1g H for a coset g⁢HgHg H, since Fourier transforms turns the translation action by g into group multiplication by ρ⁢(g)ρ(g)ρ ( g ). In the double coset circuit, we are specifically interested in the membership counting functions. More specifically, let H1,⋯,Hnsubscript1⋯subscriptH_1,·s,H_nH1 , ⋯ , Hitalic_n be a collection of conjugate subgroups of G. Given an element σ∈Gσ∈ Gσ ∈ G, define the membership counting function as F⁢(σ)=∑i=1n1σ⁢Hi.superscriptsubscript1subscript1subscriptF(σ)=Σ _i=1^n1_σ H_i.F ( σ ) = ∑i = 1n 1σ H start_POSTSUBSCRIPT i end_POSTSUBSCRIPT . Combining all previous results, we have the following corollary describing the membership counting function F. Corollary G.5. If the permutation representation of G on G/H1subscript1G/H_1G / H1 has only 2222 irreducible components, the Fourier transform F^ Fover start_ARG F end_ARG of the membership counting function F is nonzero only at these 2222 irreducible components. In particular, the equation (11) applies to F. One may wonder how restrictive it is for the permutation representation on G/HG/HG / H to only have 2222 irreducible components. The follow lemma shows that it applies to our case when G=SnsubscriptG=S_nG = Sitalic_n and H=Sn−1subscript1H=S_n-1H = Sitalic_n - 1. Lemma G.6. For SnsubscriptS_nSitalic_n and the subgroup Sn−1subscript1S_n-1Sitalic_n - 1 fixing one element, the permutation representation has only two irreducible components. Proof. The natural representation of SnsubscriptS_nSitalic_n on ℂnsuperscriptℂC^nblackboard_Cn (by permuting the basis) decomposes as a direct sum of trivial representation and the standard representation of dimension n−11n-1n - 1. ∎ Indeed, we see that when looking at the action of an individual neuron on the prediction space (i.e. “if this neuron fires, which predictions become more likely and which less?”), we see that it is only neurons that are predicting the same coset that are correlated. The average pairwise correlation of neuron actions is uncorrelated, as is the correlation of neurons associated with the same irrep. Refer to Table 3 for the full results. Table 3: The correlation of unembedding neurons. Neurons that correspond to the same coset are averaged together in the unembedding, leading to the unembedding vectors being highly correlated. Mean Correlation Std Dev Correlation Within Coset 0.814 0.445 Within Subgroup Conjugacy Class -0.002 0.222 Baseline -0.003 0.163 . G.3 An Asymptotic Analysis Our theory of coset circuits and the GCR algorithm of [4] cannot be equivalent because there is no one-to-one relationship between irreps and subgroups. Even for S5subscript5S_5S5, there are more subgroups than irreps. Quantitatively speaking, the irreps already fail to catch up with the number of subgroups. For the direct comparison of SnsubscriptS_nSitalic_n refer to Asymptotically, the number of subgroups of SnsubscriptS_nSitalic_n is bounded below as follows (see Pyber [48, Corollary 3.3]): 2(116+o⁢(1))⁢n2≤|Sub⁢(Sn)|,superscript21161superscript2Subsubscript2^( 116+o(1))n^2≤|Sub(S_n)|,2( divide start_ARG 1 end_ARG start_ARG 16 end_ARG + o ( 1 ) ) n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ | Sub ( Sitalic_n ) | , whereas the number of irreps of SnsubscriptS_nSitalic_n is asymptotically the following (see Erdos [13]): |Irr⁡(Sn)|∼14⁢n⋅312⁢eπ⁢(23)12⁢n12.similar-toIrrsubscript1⋅4superscript312superscriptsuperscript2312superscript12|Irr(S_n)| 14n· 3 12e^π( % 23) 12n 12.| Irr ( Sitalic_n ) | ∼ divide start_ARG 1 end_ARG start_ARG 4 n ⋅ 3divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG eitalic_π ( divide start_ARG 2 end_ARG start_ARG 3 end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ndivide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . We see that the former has a much higher asymptotic growth than the latter. In practice, as can be seen in Table 5, many subgroups concentrate on more than one irrep. We do not have an explanation for why the coset circuits always do concentrate one irrep. In practice, the different values for the cosets are arranged so that the contributions of all but one irrep cancel out. We hypothesize that it may have something to do with the margin maximization effect discussed in [38]. As we mention in the main body, we observe that subgroups which concentrate on more than one irrep will form coset circuits that concentrate entirely on any of the irreps, while still behaving equivalently. We do not think that there is in fact a connection between what the circuit is doing the irrep. Table 4: The number of subgroups and the number of irreps from S5subscript5S_5S5 to S12subscript12S_12S12. The numbers of subgroups use the A005432 sequence of the OEIS [42]. The numbers of irreps corresponds to the number of integer partitions of n and use the A000041 sequence of the OEIS [42]. S5subscript5S_5S5 S6subscript6S_6S6 S7subscript7S_7S7 S8subscript8S_8S8 S9subscript9S_9S9 S10subscript10S_10S10 S11subscript11S_11S11 S12subscript12S_12S12 Number of subgroups 156156156156 1455145514551455 11300113001130011300 151221151221151221151221 1694723169472316947231694723 29594446295944462959444629594446 404126228404126228404126228404126228 10594925360105949253601059492536010594925360 Number of irreps 7777 11111111 15151515 22222222 30303030 42424242 56565656 77777777 Appendix H Extra Graphs H.1 Distribution over Subgroups and Cosets (a) 128 Models trained on S5subscript5S_5S5 (b) 100 Models trained on S6subscript6S_6S6 Figure 6: Distribution of coset circuits for models trained on S5subscript5S_5S5 and S6subscript6S_6S6 with different initial seeds. Every model has a few sign circuit neurons that correspond to An<SnsubscriptsubscriptA_n<S_nAitalic_n < Sitalic_n, but the model cannot completely solve the task with only the sign circuit, so there are never more than a few. Every other subgroup could, with enough neurons, be used to completely solve the the multiplication, but in general if a model primarily uses a single subgroup it is Sn−1subscript1S_n-1Sitalic_n - 1 (in the main body of the paper we refer to these subgroups as HisubscriptH_iHitalic_i, for the element i∈[n]delimited-[]i∈[n]i ∈ [ n ] that is fixed). Every model has at least a few Sn−1subscript1S_n-1Sitalic_n - 1 neurons. Many models use a mix of subgroups and there is often a “long tail” of a subgroup being represented by only one or two neurons. The subgroups marked with asterisks, A5∗superscriptsubscript5A_5^*A5∗ and S5∗superscriptsubscript5S_5^*S5∗, correspond to the “exceptional” subgroups of S6subscript6S_6S6, which come from an outer automorphism that only S6subscript6S_6S6 has [26]. These subgroups are isomorphic to S5subscript5S_5S5 and A5subscript5A_5A5, but not conjugate to the subgroups that come from fixing an element in 1,…,6.1…6\1,…,6\. 1 , … , 6 . H.2 Other Examples of Coset Circuits Forming (a) S3×S2subscript3subscript2S_3\!×\!S_2S3 × S2 Left Permutations (b) S3×S2subscript3subscript2S_3\!×\!S_2S3 × S2 Right Permutations Figure 7: The formation of an S3×S2subscript3subscript2S_3\!×\!S_2S3 × S2 neuron. (a) A4subscript4A_4A4 Left Permutations (b) A4subscript4A_4A4 Right Permutations Figure 8: The formation of an A4subscript4A_4A4 neuron. Appendix I Irreducible Representations I.1 Symmetric Group S5subscript5S_5S5 For a subgroup H≤GH≤ GH ≤ G, we can investigate the Fourier transform of the indicator function 1Hsubscript11_H1H by looking at its evaluation at each irrep. Concretely, we first center the indicator function by defining f⁢(g)=−|H||G|,g∉H1−|H||G|,g∈H.casesotherwise1otherwisef(g)= cases- |H||G|,~g ∈ H\\ 1- |H||G|,~g∈ H. casesf ( g ) = start_ROW start_CELL - divide start_ARG | H | end_ARG start_ARG | G | end_ARG , g ∉ H end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 - divide start_ARG | H | end_ARG start_ARG | G | end_ARG , g ∈ H . end_CELL start_CELL end_CELL end_ROW By doing so, f^ fover start_ARG f end_ARG evaluates to 00 on the trivial representation of G. Given an irrep ρ∈Irr⁡(G)Irrρ (G)ρ ∈ Irr ( G ), we first denote the value of the Fourier transform of f at ρ by f^|ρevaluated-at f|_ρover start_ARG f end_ARG |ρ. The contribution of ρ to f^ fover start_ARG f end_ARG is defined by the following: ∥f^|ρ∥2∑δ∈Irr⁡(G)∥f^|δ∥2 \| f|_ρ\|^2Σ _δ (G)\|% f|_δ\|^2divide start_ARG ∥ over start_ARG f end_ARG |ρ ∥2 end_ARG start_ARG ∑δ ∈ Irr ( G ) ∥ over start_ARG f end_ARG |δ ∥2 end_ARG Here, we list all the conjugacy classes of subgroups of S5subscript5S_5S5 and how each irrep of S5subscript5S_5S5 contributes to their centered indicator function. We center the indicator function to remove the contribution of the trivial irrep, which is only based on the index of the subgroup. This step makes the contributions comparable. In the first column, we show the homomorphism type of each subgroup. Recall that two groups G,G′,G G , G′ are homomorphic if there exists a function f:G→G′:→superscript′f:G→ G f : G → G′ such that for all g,h∈Gℎg,\;h∈ Gg , h ∈ G, f⁢(g⁢h)=f⁢(g)⁢f⁢(h)ℎf(gh)=f(g)f(h)f ( g h ) = f ( g ) f ( h ). Every group within a conjugacy class is a homomorphic, with the homomorphism of two subgroups H,H′,\;H H , H′ of G given by conjugation by an element of g∈Gg∈ Gg ∈ G, h↦g⁢h⁢g−1maps-toℎsuperscript1h ghg^-1h ↦ g h g- 1. Two conjugacy classes of subgroups, however, may be homomorphic as groups, but no homomorphism can be given as conjugation by an element of G. Different conjugacy classes of subgroups that are homomorphic are distinguished in the second column by an example set of generators. In the list: • CnsubscriptC_nCitalic_n means cyclic groups of order n. • SnsubscriptS_nSitalic_n means the symmetric group of n elements. • AnsubscriptA_nAitalic_n means the alternating group of n elements,the subgroup of SnsubscriptS_nSitalic_n consisting of even permutations. Recall than an “even” permutation is one that consists of an even number of transpositions. • D2⁢nsubscript2D_2nD2 n means the n-gon dihedral group of order 2⁢n22n2 n (the symmetric group of regular polyhedron with n edges). • F20subscript20F_20F20 means the Frobenius group of order 20202020, isomorphic to C4⋉C5left-normal-factor-semidirect-productsubscript4subscript5C_4 C_5C4 ⋉ C5 [10]. Isomorphism type Generators Size (4,1)41(4,1)( 4 , 1 ) (3,2)32(3,2)( 3 , 2 ) (3,12)3superscript12(3,1^2)( 3 , 12 ) (22,1)superscript221(2^2,1)( 22 , 1 ) (2,13)2superscript13(2,1^3)( 2 , 13 ) (15)superscript15(1^5)( 15 ) C2subscript2C_2C2 ⟨(12)⟩delimited-⟨⟩12 (12) ⟨ ( 12 ) ⟩ 2 20.3% 25.4% 30.5% 17% 6.8% - C2subscript2C_2C2 ⟨(12)⁢(34)⟩delimited-⟨⟩1234 (12)(34) ⟨ ( 12 ) ( 34 ) ⟩ 2 13.6% 25.4% 20.3% 25.4% 13.6 1.7% C3subscript3C_3C3 ⟨(123)⟩delimited-⟨⟩123 (123) ⟨ ( 123 ) ⟩ 3 20.1% 12.8% 30.8% 12.8% 20.5% 2.6% C4subscript4C_4C4 ⟨(1234)⟩delimited-⟨⟩1234 (1234) ⟨ ( 1234 ) ⟩ 4 13.6% 25.4% 20.3% 25.4% 13.6% 1.7% C2×C2subscript2subscript2C_2× C_2C2 × C2 ⟨(12),(34)⟩1234 (12),(34) ⟨ ( 12 ) , ( 34 ) ⟩ 4 27.6% 34.5% 20.7% 17.2% - - C2×C2subscript2subscript2C_2× C_2C2 × C2 ⟨(12)⁢(34),(13)⁢(24)⟩12341324 (12)(34),(13)(24) ⟨ ( 12 ) ( 34 ) , ( 13 ) ( 24 ) ⟩ 4 13.8% 34.5% - 34.5% 13.8% 3.5% C5subscript5C_5C5 ⟨(12345)⟩delimited-⟨⟩12345 (12345) ⟨ ( 12345 ) ⟩ 5 - 21.7% 52.2% 21.7% - 4.4% C6subscript6C_6C6 ⟨(123),(45)⟩12345 (123),(45) ⟨ ( 123 ) , ( 45 ) ⟩ 6 21.1% 26.3% 31.6% - 21.1% - S3subscript3S_3S3 ⟨(123),(12)⟩12312 (123),(12) ⟨ ( 123 ) , ( 12 ) ⟩ 6 42.1% 26.3% 31.6% - - - S3subscript3S_3S3 555Referred to as ”twisted” S3subscript3S_3S3 in plots. ⟨(123),(12)⁢(45)⟩1231245 (123),(12)(45) ⟨ ( 123 ) , ( 12 ) ( 45 ) ⟩ 6 21.1% 26.3% - 26.3% 21.1% 5.3% D8subscript8D_8D8 ⟨(1234),(13)⟩123413 (1234),(13) ⟨ ( 1234 ) , ( 13 ) ⟩ 8 28.6% 35.7% - 35.7% - - D10subscript10D_10D10 ⟨(12345),(25)⁢(34)⟩123452534 (12345),(25)(34) ⟨ ( 12345 ) , ( 25 ) ( 34 ) ⟩ 10 - 45.5% - 45.5% - 1% S3×S2subscript3subscript2S_3\!×\!S_2S3 × S2 ⟨(123),(12),(45)⟩1231245 (123),(12),(45) ⟨ ( 123 ) , ( 12 ) , ( 45 ) ⟩ 12 55.6% 44.4% - - - - A4subscript4A_4A4 ⟨(12)⁢(34),(123)⟩1234123 (12)(34),(123) ⟨ ( 12 ) ( 34 ) , ( 123 ) ⟩ 12 44.4% - - - 44.4% 11.2% F20subscript20F_20F20 ⟨(12345),(2354)⟩123452354 (12345),(2354) ⟨ ( 12345 ) , ( 2354 ) ⟩ 20 - - - 100% - - S4subscript4S_4S4 ⟨(12345),(12)⟩1234512 (12345),(12) ⟨ ( 12345 ) , ( 12 ) ⟩ 24 100% - - - - - A5subscript5A_5A5 ⟨(12345),(123)⟩12345123 (12345),(123) ⟨ ( 12345 ) , ( 123 ) ⟩ 60 - - - - - 100% Table 5: Subgroups of S5subscript5S_5S5 and the contribution of each irrep to their centered indicator function.