← Back to papers

Paper deep dive

Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition

Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, Daniel Murfet

Year: 2023Venue: arXiv preprintArea: Training DynamicsType: TheoreticalEmbeddings: 129

Models: Toy Model of Superposition (2 hidden dims, 6 features)

Abstract

Abstract:We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT). We derive a closed formula for the theoretical loss and, in the case of two hidden dimensions, discover that regular $k$-gons are critical points. We present supporting theory indicating that the local learning coefficient (a geometric invariant) of these $k$-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then show empirically that the same $k$-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we find that the learning process in TMS, be it through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.

Tags

ai-safety (imported, 100%)theoretical (suggested, 88%)training-dynamics (suggested, 92%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 7:36:36 PM

Summary

The paper investigates phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT). It establishes a connection between dynamical transitions observed during SGD training and Bayesian phase transitions, showing that both processes involve a sequential progression from high-loss/low-complexity solutions to low-loss/high-complexity solutions, characterized by the local learning coefficient of k-gon critical points.

Entities (5)

SGD · algorithm · 100%Singular Learning Theory · theory · 100%Toy Model of Superposition · model · 100%Local Learning Coefficient · metric · 95%k-gon · critical-point · 95%

Relation Signals (3)

k-gon determines phase transitions

confidence 95% · the local learning coefficient (a geometric invariant) of these k-gons determines phase transitions in the Bayesian posterior

Bayesian posterior undergoes phase transitions

confidence 95% · the posterior prefers, for small training sample size n, critical points with low complexity but potentially high loss.

SGD exhibits dynamical transitions

confidence 90% · we often see steady plateaus in the training (and test) loss separated by sudden transitions... We refer to these as dynamical transitions.

Cypher Suggestions (2)

Find all critical points associated with the TMS model · confidence 90% · unvalidated

MATCH (m:Model {name: 'Toy Model of Superposition'})-[:HAS_CRITICAL_POINT]->(c:CriticalPoint) RETURN c.name, c.type

Map the relationship between algorithms and the phenomena they exhibit · confidence 85% · unvalidated

MATCH (a:Algorithm)-[:EXHIBITS]->(p:Phenomenon) RETURN a.name, p.name

Full Text

128,768 characters extracted from source content.

Expand or collapse full text

DYNAMICAL VERSUSBAYESIANPHASETRANSITIONS IN ATOY MODEL OFSUPERPOSITION Zhongtian Chen ∗† zhongtianc@student.unimelb.edu.au Edmund Lau ∗† elau1@student.unimelb.edu.au Jake Mendel jakeamendel@gmail.com Susan Wei † susan.wei@unimelb.edu.au Daniel Murfet † d.murfet@unimelb.edu.au ABSTRACT We investigate phase transitions in a Toy Model of Superposition (TMS) (Elhage et al., 2022) using Singular Learning Theory (SLT). We derive a closed formula for the theoretical loss and, in the case of two hidden dimensions, discover that regulark-gons are critical points. We present supporting theory indicating that the local learning coefficient (a geometric invariant) of thesek-gons deter- mines phase transitions in the Bayesian posterior as a function of training sample size. We then show empirically that the samek-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we find that the learning process in TMS, be it through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity. 1 Introduction The apparent simplicity of the Toy Model of Superposition (TMS) proposed in Elhage et al. (2022) conceals a re- markably intricatephase structure. During training, a plateau in the loss is often followed by a sudden discrete drop, suggesting some development in the network’s internal structure. To shed light on these transitions and their significance, this paper examines the dynamical transitions in TMS during SGD training, connecting them to phase transitions of the Bayesian posterior with respect to sample sizen. While the former transitions have been observed in several recent works in deep learning (Olsson et al., 2022; McGrath et al., 2022; Wei et al., 2022a), their formal status has remained elusive. In contrast, phase transitions of the Bayesian posterior are mathematically well-defined in Singular Learning Theory (SLT) (Watanabe, 2009). Using SLT, we can show formally that the Bayesian posterior is subject to aninternal model selectionmechanism in the following sense: the posterior prefers, for small training sample sizen, critical points with low complexity but potentially high loss. The opposite is true for highnwhere the posterior prefers low loss critical points at the cost of higher complexity. The measure of complexity here is very specific: it is thelocal learning coefficient,λ, of the critical points, first alluded to by Watanabe (2009, §7.6) and clarified recently in Lau et al. (2023). We can think of this internal model selection as a discrete dynamical process: at various critical sample sizes the posterior concentration “jumps” from one regionW α of parameter space to another regionW β . We refer to an event of this kind as aBayesian phase transitionα→β. For the TMS model with two hidden dimensions we show that these Bayesian phase transitions actually occur and do so between phases dominated by weight configurations representing regular polygons (termed herek-gons). The main result of SLT, the asymptotic expansion of the free energy (Watanabe, 2018), predicts phase transitions as a function of the loss and local learning coefficient of each phase. For TMS, we are in the fortunate position of being able to derive theoretically the exact local learning coefficient of thek-gons which are most commonly encountered during MCMC ∗ These authors contributed equally to this work. † School of Mathematics and Statistics, University of Melbourne. arXiv:2310.06301v1 [cs.LG] 10 Oct 2023 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition sampling of the posterior, and thereby verify that the mathematical theory correctly predicts the empirically observed phases and phase transitions. Altogether, this forms a mathematically well-founded toolkit for reasoning about phase transitions in the Bayesian posterior of TMS. Figure 1: In TMS forr= 2hidden dimensions andc= 6feature dimensions, SGD seems to perform an internal form of Occam’s Razor: at the beginning of training, high loss solutions are tolerated because they have low complexity (low local learning coefficient ˆ λ) but at the end of training low loss solutions are attractive despite their high complexity (high local learning coefficient ˆ λ). The top row shows a visualization of the columnsW i of three snapshots (timestamps shown as red dots in the loss plot). For more examples and a guide to reading these plots, see Appendix B. It has been observed empirically in TMS that SGD training also undergoes “phase transitions” (Elhage et al., 2022) in the sense that we often see steady plateaus in the training (and test) loss separated by sudden transitions, associated with geometric transformations in the configuration of the columns of the weight matrix. Figure 1 shows a typical example. We refer to these asdynamical transitions. A striking pattern emerges when we observe the evolution of the loss and the estimated local learning coefficient, ˆ λ, over the course of training: we see “opposing staircases” where each drop in the training and test loss is accompanied by a jump in the (estimated) local complexity measure. In essence, during the training process, as SGD reduces the loss, it exhibits an increasing tolerance for complex solutions. On these grounds we propose theBayesian antecedent hypothesis, which says that these dynamical transtions have “standing behind them” a Bayesian phase transition. We begin in Section 3.1 by recalling the TMS, and present a closed form for the population loss in the high sparsity limit. In ourfirst contribution, we provide a partial classification of critical points of the population loss (Section 3.2) and document the local learning coefficients of several of these critical points (Section 3.3). In oursecond contribu- tion, we experimentally verify that the main phase transition predicted by the internal model selection theory, using 2 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition the theoretically derived local learning coefficients, actually takes place (Section 4.2). In Section 5 we present exper- imental results on dynamical transitions in TMS. Ourthird contributionis to show empirically that SGD training in TMS transitions from high-loss-low-complexity solutions to low-loss-high-complexity solutions, where complex- ity is measured by the estimated local learning coefficient. This provides support for our proposed relation between Bayesian and dynamical transitions (Section 5.1). 2 Related work The TMS problem is, with the nonlinearity removed and varying importance factors, solved by computing principal components; it has long been understood that the learning dynamics of computing principal components is determined by a unique global minimum and a hierarchy of saddle points of decreasing loss (Baldi & Hornik, 1989), (Amari, 2016, §13.1.3). In recent decades an extensive literature has emerged onDeep Linear Networks(DLNs) building on these results, and applying them to explain phenomena in the development of both natural and artificial neural networks (Saxe et al., 2019). Under some hypotheses the saddles of a DLN are strict (Kawaguchi, 2016) and all local minima are global; this suggests a picture of gradient descent dynamics moving through neighbourhoods of saddles of ever- decreasing index until reaching a global minima. This has been termed “saddle-to-saddle” dynamics by Jacot et al. (2021). Through careful analysis of training dynamics it has been shown for DLNs that there is a general tendency of optimization trajectories towards solutions of lower loss and higher “complexity”, which is generally defined in an ad-hoc way depending on the data distribution (Arora et al., 2018; Li et al., 2020; Eftekhari, 2020; Advani et al., 2020). For example, it has been shown that gradient-based optimization introduces a form of implicit regularization towards low-rank solutions in deep matrix factorization (Arora et al., 2019). Viewing the optimization process as a search for solutions which begins at candidates of low complexity, the tendency to gradually increase complexity “only when necessary” has been put forward as a potential explanation for the gener- alization performance of neural networks (Gissin et al., 2019). This intuition is backed by results such as (Gidel et al., 2019; Saxe et al., 2013), which show that for DLNs the singular values of the model are learned separately at different rates, with features corresponding to larger singular values learned first. Outside of the DLN models, saddle-to-saddle dynamics of SGD training have been studied in toy non-linear models often referred to as single-index or multi-index models. In these models, the target function for inputx∈R d is generated by a non-linear, low dimensional functionφ:R k →Rviaf(x) =φ(θ T x)withθ∈R d×k wherek≪d. Single-index refers tok= 1. In a very recent work, Abbe et al. (2023) showed that for a particular multi-index model with certain restrictions on the input data distribution, SGD follows a saddle-to-saddle dynamic where the learning process adaptively selects target functions of increasing complexity. Their Figure 1 tells the same story as our Figure 1: at the beginning of training, low complexity solutions are preferred, and the opposite preference develops as training progresses. One attempt to put these intuitions in a broader context is (Zhang et al., 2018) which relates the above phenomena to entropy-energy competition in statistical physics. However this approach suffers from a lack of theoretical justification due to an incorrect application of the Laplace approximation (Wei et al., 2022b; Lau et al., 2023). The internal model selection principle (Section 4.1) in singular learning theory provides the correct form of entropy-energy competition for neural networks and potentially gives a theoretical backing for the intuitions developed in the DLN literature. 3 Toy Model of Superposition 3.1 The TMS Potential We recall the Toy Model of Superposition (TMS) setup from (Elhage et al., 2022) and derive a closed-form expression for the population loss in the high sparsity limit. The TMS is an autoencoder with input and output dimensioncand hidden dimensionr < c: f:X×W −→R c , f(x,w) = ReLU(W T Wx+b),(1) wherew= (W,b)∈W ⊆M r,c (R)×R c and inputs are taken fromx∈X= [0,1] c . We suppose that the (unknown) true generating mechanism ofxis given by the distribution q(x) = c X i=1 1 c δ x∈C i (2) 3 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition whereC i denotes theith coordinate axis intersected withX. Sampling fromq(x)can be described as follows: uniformly sample a coordinate1≤i≤cand then uniformly sample a length0≤μ≤1, returnμe i wheree i is the ith unit vector. This is the high sparsity limit of the TMS input distribution of Elhage et al. (2022), see also Henighan et al. (2023). We posit the probability model p(x|w)∝exp − 1 2 ∥x−f(x,w)∥ 2 ,(3) which leads to the expected negative log likelihood− R q(x) logp(x|w)dx. Dropping terms constant with respect to wwe arrive at the population loss function L(w) = Z q(x)∥x−f(x,w)∥ 2 dx. GivenW∈M r,c (R)we denote byW 1 ,...,W c the columns ofW. We set 1.P i,j =(W,b)∈M r,c (R)×R c |W i ·W j >0and−W i ·W j ≤b i ≤0; 2.P i =(W,b)∈M r,c (R)×R c |∥W i ∥ 2 >0and−∥W i ∥ 2 ≤b i ≤0; 3.Q i,j =(W,b)∈M r,c (R)×R c | −W i ·W j > b i >0 Forw= (W,b)we setδ(P i,j )to be1ifw∈P i,j and0otherwise, similarly forδ(P i ),δ(Q i,j ). Lemma 3.1.Forw= (W,b)∈M r,c (R)×R c we haveL(w) = 1 3c H(w)where H(W,b) = c X i=1 δ(b i ≤0)H − i (W,b) +δ(b i >0)H + i (W,b)(4) and H − i (W,b) = X j̸=i δ(P i,j ) 1 W i ·W j (W i ·W j +b i ) 3 +δ(P i ) b 3 i ∥W i ∥ 4 + b 3 i ∥W i ∥ 2 + (1−δ(P i )) +δ(P i )N i H + i (W,b) = X j̸=i δ(Q i,j ) − 1 W i ·W j b 3 i + X j̸=i (1−δ(Q i,j )) (W i ·W j ) 2 + 3(W i ·W j )b i + 3b 2 i +N i whereN i = (1−∥W i ∥ 2 ) 2 −3(1−∥W i ∥ 2 )b i + 3b 2 i Proof.See Appendix G. We refer toH(w)as theTMS potential. While this function is analytic at many of the critical points of relevance whenr= 2, it is not analytic at the4-gons (see Appendix J). 3.2k-gon critical points We prove that variousk-gons are critical points forHwhenr= 2. Recall thatw ∗ ∈ Wis acritical pointofHif ∇H| w=w ∗ = 0. The functionHis clearlyO(r)-invariant: ifOis an orthogonal matrix thenH(OW,b) =H(W,b). The potential is also invariant to jointly permuting the columns and biases. Due to thesegeneric symmetrieswe may without loss of generality assume that the columnsW i ofWare ordered anti-clockwise inR 2 with zero columns coming last. Fori= 1,...,c, letθ i ∈[0,2π)denote the angle between nonzero columnsW i andW i+1 , wherec+ 1is defined to be1. Letl i ∈R ≥0 denote∥W i ∥. In this parametrizationWhas coordinate(l 1 ,...,l c ,θ 1 ,...,θ c ,b 1 ,...,b c ) with constraintθ 1 +·+θ c = 2π. SinceO(2)has dimension1any critical point ofHis automatically part of a1-parameter family. For convenience we refer to a critical point as non-degenerate (resp. minimally singular) if it has these propertiesmodulo the generic symmetries, that is, in theθ,l,bparametrization. Thus, a critical point is non-degenerate (resp. minimally singular) if in a local neighbourhood in theθ,l,bparametrizationHcan be written 4 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition as a full sum of squares with nonzero coefficients (resp. a non-full sum of squares). For background on the minimal singularity condition see (Wei et al., 2022b, §4) and (Lau et al., 2023, Appendix A). We callw∈M 2,c (R)×R c astandardk-gonfork∈4,5,6,7,8andk≤cif it has coordinate l 1 =·=l k =l ∗ ,l k+1 =·=l c = 0, θ 1 =·=θ k−1 = 2π k ,θ k +·+θ c = 2π k , b 1 =·=b k =b ∗ ,b k+1 <0,...,b c <0 wherel ∗ ∈R >0 ,b ∗ ∈R ≤0 are the unique joint solution to−(l ∗ ) 2 cos(s 2π k )≤b ∗ wheresis the unique integer in [ k 4 −1, k 4 )(see Theorem H.1). For values ofl ∗ ,b ∗ see Table A.1. Any parameter of this form is proven to be a critical point ofHin Appendix H. Forkas above and0≤σ≤c−kak σ+ -gonis a parameter with the same specification as the standardk-gon except thatσof the biasesb k+1 ,...,b c are equal to1/(2c)and the rest have arbitrary negative values. We usually write for examplek ++ whenσ= 2, noting that thek 0+ -gon is the standardk-gon. These parameters are proven to be critical points ofHwhenk≥5in Appendix I and fork= 4in Appendix J.2. Fork= 4there are a number of additional “exotic”4-gons. They are parametrized by0≤σ≤c−kand0≤φ≤4. A4 σ+,φ− -gonhas the same specification as the4 σ+ -gon, except that a subset of the biasesI⊆ 1,2,3,4of size |I|=φare special in the following sense: for anyi /∈Ithe biasb i has the optimal valueb i =b ∗ = 0and the corresponding length is standardl i =l ∗ = 1, but ifi∈Ithenb i <0andl i is subject only to the constraintl 2 i <−b i . We write for example4 ++− for the4 2+,1− -gon. These are proven to be critical points ofHin Appendix J.2. In Appendix A, we provide visualizations and a quick guide for recognizing these critical points and their variants. What we know is the following: the standardk-gon fork=cis a non-degenerate critical point (modulo the generic symmetries) forc∈ 5,6,7,8in the sense that in local coordianates in thel,θ,b-parametrization near the critical pointHcan be written as a full sum of squares (Section H.1). Forc >8andcbeing a multiple of4, we conjecture that thec-gon is a critical point (Section H.2), and we also conjecture that forc >8andcnot a multiple of4there is no c-gon which is a critical point. Whenk∈ 5,6,7,8andk < cthe standardk-gon is minimally singular (Appendix H.3). 3.3 Local learning coefficients The local learning coefficient was proposed in Lau et al. (2023) as a general measure to quantify the degeneracy of a critical point in singular models. Table 1 summarises theoretical local learning coefficientsλand lossesLfor some critical points 3 . For more theoretical values see Table H.2 and Appendix H, and for empirical estimates Appendix K. In minimally singular cases (including5,5 + ,6) the local learning coefficient agrees with a simple dimension count (half the number of normal directions to the level set, which is locally a manifold). This explains why the coefficient increases by 3 2 as we move from the5-gon to the6-gon: this transition fixes one column ofW(2parameters) and the corresponding entry in the biasb, and so reduces by3the number of free parameters, increasing the learning coefficient by 3 2 (for further discussion see Appendix E). Critical pointsLocallearning coefficientλLossL 570.06874 5 + 8.50.06180 68.50.04819 Table 1: Critical points and their theoreticalλandLvalues for ther= 2,c= 6TMS potential. 4 Bayesian Phase Transitions In Bayesian statistics there is a fundamental distinction between the learning process for regular models and singular models. In regular models, as the number of samplesnincreases, the posterior concentrates at the MAP estimator and looks increasingly Gaussian. In singular models, which include neural networks, we expect rather that the learning process is dominated byphase transitions, where at some critical valuesn≈n cr the posterior “jumps” from one region of parameter space to another. 4 This is a universal phenomena in singular learning theory (Watanabe, 2009, 2020). 3 The4-gons are on the boundary of multiple chambers (see Appendix J). 4 Another important class of phase transitions, where the posterior jumps as a hyperparameter in the prior or true distribution is varied, will not be discussed here; see (Watanabe, 2018, §9.4), (Carroll, 2021). 5 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition 4.1 Internal Model Selection We present phase transitions of the Bayesian posterior in SLT building on (Watanabe, 2009, §7.6), (Watanabe, 2018, §9.4), Watanabe (2020). We assume(p(x|w),q(x),φ(w))is a model-truth-prior triplet with parameter space W ⊆R d satisfying the fundamental conditions of (Watanabe, 2009) and the relative finite variance condition (Watanabe, 2018). Given a datasetD n =x 1 ,...,x n , we define the empirical negative log likelihood function L n (w) =− 1 n P n i=1 logp(x i |w). The posterior distributionp(w|D n )is, up to a normalizing constant, given by exp(−nL n (w))φ(w).The marginal likelihood is the intractable normalizing constant of the posterior distribution. The free energyF n is defined to be the negative log of the marginal likelihood: F n =−log Z W exp(−nL n (w))φ(w)dw.(5) The asymptotic expansion innis (Watanabe, 2018, §6.3) given by F n =nL n (w 0 ) +λlogn−(m−1) log logn+O p (1)(6) wherew 0 is an optimal parameter,λis the learning coefficient andmis the multiplicity. We refer to this as thefree energy formula. The philosophy behind using the marginal likelihood (or equivalently, the free energy) to perform model selection is well established. Thus we could use the first two terms in (6) to choose between two competing models on the basis of their fit (as measured bynL n ) and their complexity (as measured byλ). We can also apply the same principle to different regions of the parameter space in the same model. LetW α α be a finite collection of compact semi-analytic subsets ofWwith nonempty interior, whose interiors coverW. We assume eachW α contains in its interior a point w ∗ α minimisingLonW α and that the triple(p,q,φ)restricted toW α in the obvious sense has relative finite variance. We refer to theαrather loosely asphases. We can choose a partition of unityρ α subordinate to a suitably chosen cover, so as to defineφ α (w) =ρ α (w)φ(w)with F n =−log Z W e −nL n (w) φ(w)dw=−log X α Z W α e −nL n (w) φ α (w)dw =−log X α V α Z W α e −nL n (w) φ α (w)dw=−log X α e −F n (W α )−v α where φ α = 1 V α φ α forV α = R W α φ α dw,v α =−log(V α )and F n (W α ) =−log Z W α e −nL n (w) φ α (w)dw(7) denotes the free energy of the restricted tuple(p,q, φ α ,W α ). We will refer toF n (W α )as thelocal free energy. Using the log-sum-exp approximation, we can writeF n =−log P α e −F n (W α )−v α ≈min α F n (W α ) +v α . Since (6) applies to the restricted tuple(p,q, φ α ,W α )we have F n (W α ) =nL n (w ∗ α ) +λ α logn−(m α −1) log logn+O p (1)(8) which we refer to as thelocal free energy formula. 5 In this paper we absorb the volume constantv α and terms of orderlog lognor lower in (8) into a termc α that we treat as effectively constant, giving F n ≈min α nL n (w ∗ α ) +λ α logn+c α .(9) A principle ofinternal model selectionis suggested by (9) whereby the Bayesian posterior “selects” a phaseαbased on the local free energy of the phase, in the sense that this phase contains most of the probability mass of the posterior for this value ofn(Watanabe, 2009, §7.6). 6 At a given value ofnwe can order the phases by their posterior concentration, or what is the same, their free energiesF n (W α ). We say there is alocal phase transitionbetween phasesα,βatcritical sample sizen cr , writtenα→β, if the position ofα,βin this ordered list of phases swaps. That is, forn≈n cr and n < n cr the Bayesian posterior prefersαtoβ, and the reverse is true forn > n cr . We say that a phaseαdominates the posterioratnif it has the highest posterior mass, that is,F n (W α )< F n (W β )for allβ̸=α. Aglobalphase 5 In general deriving the free energy formula requires some sophisticated mathematics (Watanabe, 2009, 2018) but when the critical pointw ∗ α dominating the phaseW α is minimally singular, simpler techniques similar to (Balasubramanian, 1997) suffice; see (Lau et al., 2023, Appendix A). Many, but not all, of the singularities appearing in this paper are minimally singular. 6 We often replaceL n (w ∗ α )byL(w ∗ α )in comparing phases; see (Watanabe, 2018, §9.4). 6 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition transition is a local phase transition whereαdominates the posterior forn < n cr andβdominates forn > n cr withn nearn cr . Generally when we speak of a phase transition in this paper we mean alocaltransition. Generically, phase transitions occur when, asnincreases, phases with lower loss and higher complexity are preferred; this expectation is verified in TMS in the next section. For more on the theory of Bayesian phase transitions see Appendix C. Figure 2: Proportion of Bayesian posterior concentrated in regionsW k,σ forr= 2,c= 6according to the free energy formula (theory, left) and MCMC sampling of the posterior (experimental, right). Theory predicts, and experiments show, a phase transition5→6in the range600≤n≤700. 4.2 Experiments There is a fundamental tension in the internal model selection story elaborated above: the free energy formula is asymptotic inn, but a theoretical discussion of phase transitions involves comparing local free energiesF n (W α )at finiten. Whether or not this is valid, in a given range ofnand for a given system, is a question that may be difficult to resolve purely theoretically. We show experimentally forr= 2,c= 6that a Bayesian phase transition actually takes place between the5-gon and the6-gon, within a range ofnvalues consistent with the free energy formula. In this section we focus on the caser= 2,c= 6. Forc∈ 4,5see Appendix F.4. We first define regions of parameter spaceW α α . Given a matrixWwe writeConvHull(W)for the number of points in the convex hull of the set of columns. For3≤k≤c,0≤σ≤c−kwe define W k,σ = w= (W,b)∈W|ConvHull(W) =kandbhasσpositive entries . The setW k,σ ⊆R d is semi-analytic and contains thek σ+ -gon in its interior. Forα= (k,σ)we letw ∗ α denote the parameter of thek σ+ -gon. We verify experimentally the hypothesis that this parameter dominates the Bayesian posterior ofW α (see Appendix F) by which we mean that most samples from the posterior for a relevant range of nvalues are “close” tow ∗ α . 7 In this sense the choice of phasesW α is appropriate for the range of sample sizes we consider. We draw posterior samples using MCMC-NUTS (Homan & Gelman, 2014) with prior distributionN(0,1)and sample sizesn. Each posterior sample is then classified into someW k,σ (our classification algorithm for the deciding the appropriate value ofkis not error-free.) For eachn,10datasets are generated and the average proportion of the k-gons, and standard error, is reported in Figure 2. Details of the theoretical proportion plot are given in Appendix F.1. Letw ∗ α ,w ∗ β bek-gons dominating phasesW α ,W β . ABayesian phase transitionα→βoccurs when the difference between the free energiesF n (W β )−F n (W α )swaps sign, from positive to negative, as explained in the previous section. The most distinctive feature in the experimental plot is the5→6transition in the range600≤n≤700. The free energy formula predicts this transition atn cr ≈600(Appendix C.2). An alternative visualization of the5→6 transition using t-SNE is given in Appendix F.2. Asndecreases past400the MCMC classification becomes increas- ingly uncertain, and it is less clear that we should expect the free energy formula to be a good model of the Bayesian posterior, so we should not read too much into any correspondence between the plots forn≤400(see Appendix F.2). 7 Thek σ+,φ− -gons forφ >0have high loss but may nonetheless dominate the posterior for very lown, however this is outside the scope of our experiments, which ultimately dictates the choice of the setW k,σ . 7 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure 3: Visualization of400SGD trajectories initialized at MCMC samples from the Bayesian posterior forr= 2,c= 6at sample sizen= 100. We see that SGD trajectories are dominated by plateaus at loss values corresponding to our classification of critical points (Appendix A) and that lower loss critical points have higher estimated local learning coefficients. Note that for highly singular critical points we see that ˆ λis unable to provide non-negative values without additional hyperparameter tuning, but the ordinality (more positive is less degenerate) is nonetheless correct. See Appendix K for details and caveats for the ˆ λestimation. 5 Dynamical Phase Transitions Adynamicaltransitionα→βoccurs in a trajectory if it is near a critical pointw ∗ α of the loss at some timeτ 1 (e.g. there is a visible plateau in the loss) and at some later timeτ 2 > τ 1 it is nearw ∗ β without encountering an intermediate critical point. We conduct an empirical investigation into whether thek-gon critical points of the TMS potential dominate the behaviour of SGD trajectories forr= 2,c= 6, and the existence of dynamical transitions. There are two sets of experiments. In the first we draw a training datasetD n =x 1 ,...,x n wheren= 1000from the true distributionq(x). We also draw a test set of size5000. We use minibatch-SGD initialized at a4-gon plus Gaussian noise of standard deviation0.01, and run for4500epochs with batch size20and learning rate0.005. This initialisation is chosen to encourage the SGD trajectory to pass through critical points with high loss after a small number of SGD steps, allowing us to observe phase transitions more easily. Along the trajectory, we keep track of each iterate’s training loss, test set loss and theoretical test loss. Figure 1 is a typical example, additional plots are collected in Figures B.1-B.7. In the second set of experiments, summarized in Figure 3, we take the same size training dataset but initialize SGD trajectories differently, at random MCMC samples from the Bayesian posterior atn= 100 (a small value ofn). The number of epochs is5000. In both cases we estimate the local learning coefficient of the training iteratesw t . This is a newly-developed estimator Lau et al. (2023) that uses SGLD (Welling & Teh, 2011) to estimate a version of the WBIC (Watanabe, 2013) localized tow t and then forms an estimate of the local learning coefficient ˆ λ(w t )based on the approximation thatWBIC(w t )≈ nL n (w t )+λ(w t ) lognwhereλ(w t )is the local RLCT. In the language of Lau et al. (2023), we use a full-batch version of SGLD with hyperparametersε= 0.001,γ= 1and500steps to estimate the local WBIC. These experiments support the following description of SGD training for the TMS potential whenr= 2,c= 6: trajectories are characterised by plateaus associated to the critical points described in Section 3.2 and further discussed in Appendix A. The dynamical transitions encountered are 4 ++− −→4 +− −→4 − ,4 +− −→4 − , 4 ++− −→4 +− −→4,4 ++− −→4 + −→5−→5 + .(10) The dominance of the classified critical points, and the general relationship of decreasing loss and increasing com- plexity, can be seen in Figure 3. 8 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition 5.1 Relation Between Bayesian and Dynamical Transitions Phases of the Bayesian posterior for TMS withr= 2,c= 6are dominated byk-gons which are critical points of the TMS potential (Section 4). The same critical points explain plateaus of the SGD training curves (Section 5). This is not a coincidence: on the one hand SLT predicts that phases of the Bayesian posterior will be associated to singularities of the KL divergence, and on the other hand it is a general principle of nonlinear dynamics that singularities of a potential dictate the global behaviour of solution trajectories (Strogatz, 2018; Gilmore, 1981). However, the relation betweentransitionsof the Bayesian posterior andtransitionsover SGD training is more subtle. There is nonecessaryrelation between these two kinds of transitions. A Bayesian transitionα→βmight not have an associated dynamical transition if, for example, the regionsW α ,W β are distant or separated by high energy barriers. For example, the Bayesian phase transition5→6has not been observed as a dynamical transition (it may occur, just with low probability per SGD step). However, it seems reasonable to expect that for many dynamical transitions there exists a Bayesian transition between the same phases. We call this theBayesian antecedentof the dynamical transition if it exists. This leads us to: Bayesian Antecedent Hypothesis (BAH).The dynamical transitionsα→βencountered in neural network training have Bayesian antecedents. Since a dynamical transition decreases the loss, the main obstruction to having a Bayesian antecedent is that in a Bayesian phase transitionα→βthe local learning coefficient should increase (Appendix D). Thus the BAH is in a similar conceptual vein to the expectation, discussed in Section 2, that SGD prefers higher complexity critical points as training progresses. While the dynamical transitions in (10) are all associated with increases in our estimate of the local learning coefficient, we also know that at lown, the constant terms can play a nontrivial role in the free energy formula. Our analysis (Appendix D.1) suggests that all dynamical transitions in (10) have Bayesian antecedents, with the possible exception of4 ++− →4 +− and4 +− →4 − where the analysis is inconclusive. 6 Conclusion Phase transitions and emergent structure are among the most interesting phenomena in modern deep learning (Wei et al., 2022a; Barak et al., 2022; Liu et al., 2022) and provide an interesting avenue for fundamental progress in neural network interpretability (Olsson et al., 2022; Nanda et al., 2023) and AI safety (Hoogland et al., 2023). Building on Elhage et al. (2022) we have shown that the Toy Model of Superposition with two hidden dimensions has, in the high sparsity limit, phase transitions in both stochastic gradient-based and Bayesian learning. We have shown that phases are in both cases dominated byk-gon critical points which we have classified, and we have proposed with the BAH a relation between transitions in SGD training and phase transitions in the Bayesian posterior. Our analysis of TMS also demonstrates the practical utility of the local complexity measure ˆ λintroduced in (Lau et al., 2023), which is an all-purpose tool for measuring model complexity. In this paper we have shown that this tool reveals in TMS an interesting sequential learning mechanism underlying SGD training, consistent with observations derived in other settings including DLNs (Arora et al., 2019; Gissin et al., 2019) and multi-index models (Abbe et al., 2023). However we emphasise that ˆ λhas not been specifically engineered for studying complexity in TMS. In principle it can be used to study the development of internal structure over training inanyneural network, from toy models like the ones considered here, through to large language models. Acknowledgement SW is the recipient of an Australian Research Council Discovery Early Career Researcher Award (project number DE200101253) funded by the Australian Government. SW is also partially funded by an unrestricted gift from Google. We would like to thank Matthew Farrugia-Roberts, Jesse Hoogland and Liam Carroll for discussions and valuable feedback on the manuscript. References Emmanuel Abbe, Enric Boix Adser ` a, and Theodor Misiakiewicz. SGD Learning on Neural Networks: Leap Complex- ity and Saddle-to-Saddle Dynamics. InProceedings of Thirty Sixth Conference on Learning Theory, p. 2552–2623. PMLR, 2023. Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020. 9 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Shun-ichi Amari.Information Geometry and its Applications, volume 194. Springer, 2016. Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks.arXiv preprint arXiv:1810.02281, 2018. Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization.Advances in Neural Information Processing Systems, 32, 2019. Vijay Balasubramanian. Statistical inference, Occam’s razor, and Statistical Mechanics on the Space of Probability Distributions.Neural Computation, 9(2):349–368, 1997. Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58, 1989. Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit.Advances in Neural Information Processing Systems, 35: 21750–21764, 2022. Liam Carroll. Phase Transitions in Neural Networks.MSc Thesis at the University of Melbourne, 2021. Armin Eftekhari. Training linear neural networks: Non-local convergence and complexity results. InInternational Conference on Machine Learning, p. 2836–2847. PMLR, 2020. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy Models of Superposition.Transformer Circuits Thread, 2022. Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks.Advances in Neural Information Processing Systems, 32, 2019. R. Gilmore.Catastrophe Theory for Scientists and Engineers. Wiley, 1981. Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The implicit bias of depth: How incremental learning drives generalization.arXiv preprint arXiv:1909.12051, 2019. Tom Henighan, Shan Carter, Tristan Hume, Nelson Elhage, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, and Christopher Olah. Superposition, Memorization, and Double Descent.Transformer Circuits Thread, 2023. Matthew D Homan and Andrew Gelman. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.J. Mach. Learn. Res., 15(1):1593–1623, 2014. Jesse Hoogland,Alexander Gietelink Oldenziel,Daniel Murfet,and Stan van Wingerden.To- wardsDevelopmentalInterpretability.ttps://w.lesswrong.com/posts/TjaeCWvLZtEDAS5Ex/ towards-developmental-interpretability, 2023. [Online; accessed 28-September-2023]. Arthur Jacot, Franc ̧ois Ged, Berfin S ̧ ims ̧ek, Cl ́ ement Hongler, and Franck Gabriel. Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity.arXiv preprint arXiv:2106.15933, 2021. Kenji Kawaguchi. Deep learning without poor local minima.Advances in neural information processing systems, 29, 2016. E. Lau, S. Wei, and D. Murfet. Quantifying degeneracy in singular models via the learning coefficient.arXiv preprint arXiv:2308.12108, 2023. Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factor- ization: Greedy low-rank learning.arXiv preprint arXiv:2012.09839, 2020. Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35: 34651–34663, 2022. Thomas McGrath, Andrei Kapishnikov, Nenad Toma ˇ sev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of Chess Knowledge in AlphaZero.Proceedings of the National Academy of Sciences, 119(47), 2022. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context Learning and Induction Heads.Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. 10 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013. Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019. Steven H Strogatz.Nonlinear dynamics and chaos with student solutions manual: With applications to physics, biology, chemistry, and engineering. CRC press, 2018. Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE.Journal of Machine Learning Research, 9(86):2579–2605, 2008. Sumio Watanabe.Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, USA, 2009. Sumio Watanabe. A Widely Applicable Bayesian Information Criterion.Journal of Machine Learning Research, 14: 867–897, 2013. Sumio Watanabe.Mathematical Theory of Bayesian Statistics. CRC Press, Taylor and Francis group, USA, 2018. Sumio Watanabe. Cross Validation, Information Criterion and Phase Transition.Talk, 2020. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022a. ISSN 2835-8856. URLhttps://openreview.net/forum?id=yzkSU5zdwD. Survey Certification. Susan Wei, Daniel Murfet, Mingming Gong, Hui Li, Jesse Gell-Redman, and Thomas Quella. Deep Learning is Singular, and That’s Good.IEEE Transactions on Neural Networks and Learning Systems, 2022b. M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. InProceedings of the 28th International Conference on Machine Learning, 2011. Yao Zhang, Andrew M Saxe, Madhu S Advani, and Alpha A Lee. Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning.Molecular Physics, 116(21-22):3214–3223, 2018. A Fantastic Critical Points and Where to Find Them In this section we provide a brief guide, using loose intuitive language, to recognise known critical parameters and their variations. Rigorous derivation and details are given in Appendix H, I, and J. Broadly speaking, known critical parameters,(W ∗ ,b ∗ ), are classified by three discrete numbers: •k: the number of vertices in theregularpolygon formed by the convex hull of the columns of theW ∗ matrix, interpreted as a vector inR 2 . The length of these vectors (for anyk≤c) have to be at the optimal values derived in Appendix H and listed in Table A.1. •σ: the number of positive values in the bias vector. These positive biases are required to take on the optimal value atb ∗ = 1/(2c)and have to occur at indices thatdo notcorrespond to thek-gon vertices. •φ: the number of large negative values in the bias vectors. So far, we’ve only observedφ >0whenk= 4, i.e. this discrete subcategory only applies to 4-gons. These biases have to occur at indices thatdocorrespond to the4-gon vertices. Forr= 2,c= 6, the above description and constraints result in the18families of critical points whose representative members are shown in Figure A.1 and their loss or potential energy levels are shown in Figure A.2. Next, we discuss possible variations within these families of critical points. Aside from the ever-present rotational and permutation symmetries discussed elsewhere, there are variations of these standard descriptions that allow the parameter to stay on the same critical submanifold. Figure A.3 shows some examples of irregular versions of known critical points. One can cross-check that their potential valuesLare the same as their regular counterpart. Most of these variation is the result of having negative values in the bias vectors allowing for extra variability without changing the loss value. To explain the examples in Figure A.3, •5-gon (top left). In the standard5-gon family, the vestigial biasb ′ can have arbitrary negative value and corresponding weight column can be any vector so long as its lengthl ′ is smaller than p min|b ′ |,|b ∗ | whereb ∗ is the optimal negative bias for the main columns (see Table A.1). •4-gon (top right). The two vestigial biases can take arbitrary negative values. 11 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition •4 +− -gon (bottom left). Having two negative biases with large magnitude afford a few other degrees of freedom. The weight columnsW 3 ,W 4 with those large negative biases can be any vector as long as (1) they lengths is smaller than p |b i |for their respective biases and (2) they form obtuse angle relative other columns W 1 andW 2 , i.e the other two vertices of the 4-gon. •4 − -gon (bottom right). Other than the variation in the main columnsW 3 ,W 6 with large negative biases, the two vestigial columns can also be any vectors as long as they stay within the sector betweenW 1 andW 2 and their lengths are bounded bymin p |b i ||i= 3,4,5,6 . Critical pointl ∗ b ∗ 4-gon10 5-gon1.17046−0.28230 6-gon1.32053−0.61814 Table A.1: Parameters of certaink-gons. 12 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure A.1: Representative of each known class of critical parameters inr= 2,c= 6. 13 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure A.2: Potential energy levelsLfor known critical points inr= 2,c= 6. Figure A.3: Irregular versions of known critical points inr= 2,c= 6. 14 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition B Additional examples of SGD trajectories for ther=2, c=6TMS potential In this Appendix we collect additional individual SGD trajectories for ther= 2,c= 6TMS potential, with the same hyperparameters as discussed in Section 5. These are all of the runs from30random seeds that had a dynamical transition. We note that each of the critical points encountered in a plateau fall into the classification discussed in Appendix A and the estimator ˆ λfor the local learning coefficient jumps in each transition. Note that the transitions in Figure 1 are4 ++− →4 + →5. First a brief guide to reading these figures, for example Figure B.1. The top row contains a visualization of the weight vectorW, one black arrow per column. Adjacent columns are connected by a blue line, the red dashed line shows the convex hull. The middle row shows the parameter more quantitatively: columnsW i for1≤i≤6ordered from the negativex-axis in a counter-clockwise direction and∥W i ∥(black) and|b i |(red, green) are shown. A white plus sign indicates a bias that exceeds1.25∗max i∈I ∥W i ∥ 2 whereIis the set of all columnsiwhereb i ≥∥W i ∥. All columns share the same axes. Each column of the top and middle rows jointly display the same parameterw= (W,b)which corresponds to (in order) the points marked during training by red dots in the bottom row. The bottom row shows losses and local learning coefficient, with the latter smoothed over a window of size6, where ˆ λis measured every30 epochs. We note that in some runs containing particularly degeneratek-gons, such as the4 ++− in Figure B.1, the estimator ˆ λproduces negative values for the standard hyperparameterε= 0.001. By adapting this hyperparameter to the level of degeneracy we can correct for this and avoid invalid estimates (see Appendix K). But since we cannot predict the trajectory of SGD iterates, we choose to use a fixed hyperparameterγ= 1.0,ε= 0.001, number of SGLD steps= 500 in all Figures of this form. Figure B.1: Trajectory with dynamical transitions4 ++− →4 +− →4 − . 15 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure B.2: Trajectory with dynamical transition4 +− →4. 16 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure B.3: Trajectory with dynamical transition4 + →5. 17 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure B.4: Trajectory with dynamical transition4 + →5. 18 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure B.5: Trajectory with dynamical transition4 +− →4 − . 19 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure B.6: Trajectory with dynamical transition4 + →5. 20 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure B.7: Trajectory with dynamical transition4 ++− →4 +− →4. 21 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition C Using the Free Energy Formula In Section 4.1 we defined a (local) phase transitionα→βat a critical sample sizen cr to take place when the local free energies swap order, as in the following table, withntaken close ton cr : n < n cr n=n cr n > n cr F n (W α )< F n (W β )F n (W α )≈F n (W β )F n (W α )> F n (W β ) We assumenis in a range where the local free energies forγ∈ α,βare well-approximated by the right hand-side of the following F n (W γ )≈nL(w ∗ γ ) +λ γ logn+c γ (11) for some constantc γ . Whilenis of course an integer, to find where the free energy curves cross we may treatnas a real variable. To an ordered pairα,βwe may associate ∆L=L(w ∗ β )−L(w ∗ α ) ∆λ=λ β −λ α ∆c=c β −c α Then to solveF n (W α ) =F n (W β )fornwe may instead solve n∆L+ ∆λlogn+ ∆c= 0.(12) Theoretically, a phase transitionα→βexists if and only if this equation has a positive solution. However, in practice the free energy formula on which this equation is based will only well describe the Bayesian posterior for sufficiently largen, and it is an empirical question what thisnmay be. In the following when we say that a phase transition is predicted to exist (or not), the reader should keep this caveat in mind. When we refer to theoretically derived values for phase transitions, we mean that we solve (12) with the given values of∆L,∆λ,∆c. Note that if the phaseβhas lower loss, learning coefficient and constant term (so that∆L,∆λand ∆care all negative) then there can be no phase transitionα→βasF n (W α )is never lower thanF n (W β ). Although the constant (and lower order) terms in the free energy expansion are not well-understood, in this paper we proceed assuming that the leading contribution comes from the prior in the manner described in Section C.1 below. C.1 Constant terms in the Free Energy Formula Recall from Section 4.1 that given a collection of phasesW α the free energy is F n =−log X α V α Z W α e −nL n (w) φ α (w)dw whereφ α = 1 V α φ α forV α = R W α φ α dw. Suppose that the phaseW α is dominated by a critical pointw ∗ α and that the partition of unity is chosen so thatφ α (w ∗ α )≈φ(w α )(this is reasonable since the critical point is in the interior). We explore the following approximation to the contribution ofαto the above integral Z W α e −nL n (w) φ α (w)dw≈φ(w ∗ α ) Z W α e −nL n (w) dw. This means that the prior contributes toc α of (8) through−log(V α φ(w ∗ α ))as well as through theO P (1)term of the asymptotic expansion. With a normal priorφ= 1 σ √ (2π) d exp(− 1 2σ 2 ∥w∥ 2 ) V α φ(w ∗ α ) =φ α (w ∗ α )≈ 1 σ p (2π) d exp − 1 2σ 2 ∥w ∗ α ∥ 2 . Hence−log(V α φ(w ∗ α ))depends onσthrough the sumlogσ+ 1 2σ 2 ∥w α ∥ 2 . Here ifw ∗ α = (W ∗ α ,b ∗ α )we have∥w ∗ α ∥ 2 = ∥W ∗ α ∥ 2 +∥b ∗ α ∥ 2 . In Table C.1, Table C.2 we show the value of this contribution whenσ= 1forc∈5,6. Note that for somek-gons there are negative biases that can take arbitrarily large values, so the shown values are lower bounds. C.2 Theoretical predictions for the5-gon to6-gon transition forr= 2,c= 6 Withα= 5andβ= 6we have from Table 1 and Table C.2 that ∆L= 0.04819−0.06874 =−0.02055 ∆λ= 8.5−7 = 1.5 ∆c= 6.37767−3.62417 = 2.7535 Solving (12) numerically givesn cr = 601as the closest integer. 22 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Critical point 1 2 ∥w ∗ α ∥ 2 4≥2 4 + ≥2.05 5≥3.62417 Table C.1: Prior factors fork-gons whenc= 5. Critical point 1 2 ∥w ∗ α ∥ 2 66.37767 5>3.62417 5 + 3.62764 42 4 − >1.5 4 − >1 4 − >0.5 4 − >0 4 + 2.00347 4 +− >1.50347 4 +− >1.00347 4 +− >0.50347 4 +− >0.00347 4 ++ 2.00694 4 ++− >1.50694 4 ++− >1.00694 4 ++− >0.50694 4 ++− >0.00694 Table C.2: Prior factors fork-gons whenc= 6. C.3 Influence of constant terms Dividing (12) through bylognwe have n logn =− 1 logn ∆c ∆L − ∆λ ∆L =− 1 ∆L h ∆c logn + ∆λ i .(13) In the phase transitions we analyse in this paper∆Lis on the order of0.01,∆λis on the order of1, and∆cis on the order of1, so∆λ/∆L,∆c/∆Lare on the order of10. In Figure 2 we care about roughly200≤n≤1000so 5≤logn≤7. Hence in practice the first term in (13) is roughly one order of magnitude lower than the second; the upshot being that the primary determinant ofn cr is|∆λ/∆L|but the influence of the constant terms can be significant. In the second transition of Figure 1 from4 + →5we have∆λ= 2,∆L= 0.06874−0.10417 =−0.03543(based on Table 1) and∆c= 3.62417−2.00347 = 1.6207(based on Table C.2) so−∆c/∆L≈45and−∆λ/∆L≈56. Solving (12) numerically yieldsn cr ≈380. Solving the equation with∆c= 0givesn cr ≈327, so as suggested above including the constant term shifts the critical sample size by a lower order term. C.4 Double Transitions Assume that there are transitionsα→βat critical sample sizen 1 andβ→γat critical sample sizen 2 , both involving no change in constant terms so that (16) applies. Sincen/lognis increasing, ifn 1 < n 2 we deduce − ∆λ 1 ∆L 1 <− ∆λ 2 ∆L 2 (14) where ∆λ 1 =λ β −λ α , ∆λ 2 =λ γ −λ β , ∆L 1 =L(w ∗ β )−L(w ∗ α ), ∆L 2 =L(w ∗ γ )−L(w ∗ β ). 23 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition From (14) we obtain the inequality ∆L 2 ∆λ 1 >∆λ 2 ∆L 1 =⇒ ∆L 2 ∆λ 2 > ∆L 1 ∆λ 1 (15) which says that:along any curve in the(λ,L)plane following a sequence of Bayesian phase transitions, the slope must increase. For example, we observe in Figure D.1 that the negative slopes become successively less negative as we move along a sequence of transitions. This is the least obvious for the pair of transitions4 +− →4 + and4 + →5 which corresponds to the fact that the gapn 2 −n 1 is small in Figure D.5. D Bayesian Antecedents In this section we review whether the phase transitions we find empirically have Bayesian antecedents. To begin we consider the case where∆c= 0. Then from (13) we deduce n logn =− ∆λ ∆L .(16) Forn >3,n/lognis positive and an increasing function ofn, and we denote the inverse function byN. Since the critical sample size for a transitionα→βmust be positive, if∆L <0(the loss decreases) then (16) has a (unique) solution if and only if∆λ >0(the complexity increases). The unique solution is the critical sample size n cr =N − ∆λ ∆L . If∆L <0and∆c̸= 0we simply plot the free energy curves and see if they intersect. Given the orders of magnitude discussed in Section C.3 we expect if∆c >0,∆λ >0then there is likely to be a solution, whereas if∆c <0, ∆λ <0then the right hand side of (13) is negative and no transition can exist. The mixed cases are harder to argue about in general terms. D.1 The BAH forr= 2,c= 6 We examine the evidence for the existence of Bayesian antecedents of the dynamical transitions in TMS forr= 2,c= 6exhibited in Section 5. The known dynamical transitions are summarised in Figure D.1. The slope of the lines is, in the notation of (16), equal to ∆L ∆λ and so the fact that all observed phase transitions go down and to the right would indicate, if the constant terms were ignored, that the critical sample size is positive and a Bayesian phase transition exists. Here theLvalues are from Section A and the ˆ λvalues from Table K.1 (note the caveats there) for thosek-gons where we do not have theoretically derived values (forα∈5,5 + ,6see Table 1). To perform a more refined analysis which includes the constant terms we use Table C.2 and compare plots of free energy curves. In the cases where we use an empirical estimate of the learning coefficient, we display the curve as part of a shaded region made up of curves with coefficients oflognwithin one standard deviation of the estimate. The results are shown in Figures D.2-D.5. For phase transitions occurring at large values ofn, the existence of a transition is relatively insensitive to small changes in the learning coefficient or constant terms, and we can also be more confident that the predicted transition translates (via the correspondence between the free energy formula and the posterior, which is only valid for suffi- ciently largen) to an actual phase transition in the posterior. For transitions occurring at lown, such as those in Figure D.2 and Figure D.3, the analysis is strongly affected by small changes in learning coefficient or constant terms, and so we cannot be sure that a Bayesian transition exists. 24 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure D.1: Summary of known dynamical transitions and the phases involved. Each scatter point( ˆ λ α ,L α )corre- sponds to one of our classified critical pointsw ∗ α and a red line is drawn between phases with dynamical transitions connecting them (in the direction that goes right and down) as listed in (10). These “curves” are necessarily concave up if the time order of dynamical transitions matches the sample size order of Bayesian transitions, see C.4. Figure D.2: Free energy plot providing evidence of Bayesian transitions4 ++− →4 +− and4 +− →4 − . In the former case the plot is merely suggestive, since the transition takes place at lownand is very sensitive to the learning coefficients and constant terms. 25 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure D.3: Free energy plot providing weak evidence of a Bayesian transition4 +− →4 − . The transition takes place at lownand is very sensitive to the learning coefficients and constant terms. Figure D.4: Free energy plots suggesting Bayesian transitions4 ++− →4 +− and4 +− →4. 26 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure D.5: Free energy plots suggesting Bayesian transitions4 ++− →4 + ,4 + →5and5→5 + . 27 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition E Intuition for the local learning coefficient as complexity measure forc=6 Here we provide some intuition for why critical points in TMS with higher locallearning coefficientshould be thought of as more complex. It seems uncontroversial that the standard(k+ 1)-gon is more complex than the standardk-gon. We focus on explaining why increasing the number of positive biases on ak-gon causes a slightincreasein the model complexity and increasing the number of large negative biases on the4-gon causes a largedecreasein the complexity. This pattern can be seen empirically in Table K.1. The basic fact that informs this discussion is that the local learning coefficient is half the number of normal directions to the level setL(w) =L(w ∗ α )atw ∗ α whenLis Morse-Bott atw ∗ α so that a naive count of normal directions captures the degeneracy. See (Watanabe, 2009, §7.1), (Wei et al., 2022b, §4) and (Lau et al., 2023, Appendix A) for relevant mathematical discussion. That means that if weincreasethe number of directions we can travel in the level set by1 when we move fromw ∗ α tow ∗ β then we expect todecreasethe learning coefficient by 1 2 . When the level set is more degenerate atw ∗ α such naive dimension counts fail to be the correct measure and it is more difficult to provide simple intuitions. However in TMS we are fortunate that some of the critical points (e.g.5,5 + ) are minimally singular so naive counts actually do capture what is going on. So let us do some naive counting. Recall that any positive biasb i associated with a columnW i with zero norm must have the exact value 1 2c , whereas negative biases associated with such columns can take on any value. Fixing the value of the bias reduces the number of free parameters by3, since if we have a positive bias atb 6 thenl 6 =∥W 6 ∥must be zero and the parameterθ 5 does not exist in the parametrisation. This explains why the learning coefficient of the 5 + -gon is larger than that of the5-gon by 3 2 , since both are minimally singular and we have decreased the number of free parameters (dimension of the level set) by3. Next we consider large negative biases. For the4 − -gon, note that the neuron with the large negative bias never fires (it is a “dead” neuron), so this critical point only really has representations for three inputs. In fact, when there are no positive biases, the family of parameters that we call a4 − -gon includesw∈Wwith (using thel,θ,bparametrization) anyb 4 <0and anyl 2 4 <−b 4 includingl 4 arbitrarily close to zero, with a convex hull containing only three vertices. Further, in the case of the4 − -gon, this configuration only has representations for two inputs. In this case, if the two weights with negative biases are adjacent then there is an entire “dead” quarter-plane of the activation space, and the5th and6th columns ofWcan take on nonzero values in that quarter plane (provided they satisfyl 2 i <−b i for i∈5,6). This extra freedom means that the number of bits required to “pin down” a4 − -gon is less than a4 − gon, which is less than a4-gon. Similarly, specifying the4 − -gon and4 − -gon requires even less information, so it is appropriate that the locallearning coefficientclassifies them as less complex. 28 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition F MCMC Experiments We give further details about the experiments where we use MCMC to sample the Bayesian posterior to establish the phase occupancy plot in 2. For a given number of columnscand sample sizen, we generatensamplesX i from the true distributionq(x)and obtain a likelihood function Q n i=1 p(X i |w)withq(x)andp(x|w)given by (2) and (3) respectively. We choose a prior on the parameter spacew= (W,b)to be the standard multivariate Gaussian prior (i.e. with zero mean and identity covariance matrix). To sample from the corresponding posterior distribution, we run Markov Chain Monte Carlo (MCMC), specifically with the No U-Turn sampler (NUTS) (Homan & Gelman, 2014), with6randomly initialised MCMC chains, each with 5000iterations after500iterations of burn-in. We thin the resulting chain by a factor of10resulting in a total posterior sample size of3000. For each combination ofnandc, we run the above posterior sampling procedure for10different PRNG seeds, which produces different input samplesX i as well as changing MCMC trajectories. F.1 Details of Theoretical Proportion Curves Figure F.1: Extended version of theoretical occupancy plot shown previously in Figure 2 where the effect of sub- dominant phases is now included. Note that exact theoretical values of the loss and prior contributions were used for all critical points shown, and exact values of the local learning coefficient were used for the6-gon,5 + -gon and5-gon, but estimates of the local learning coefficients were used for other critical points (Table K.1). This section contains details of the theoretical component of Figure 2. For eachα∈4,4 + ,5,5 + ,6we consider the free energy approximation f α (n) =nL α +λ α logn+c α whereL α is the theoretical value taken from Section A andc α are the constant terms from Table C.2. We use the theoretically derived value ofλ α in Table 1 forα∈5,5 + ,6and the empirically estimated ˆ λ α forα∈4,4 + from Table K.1. We then define p α (n) = exp(−f α (n)), Z(n) = X α p α (n) and the theory plot in Figure 2 shows the curves 1 Z p α (n) α . Figure F.1 is produced in the same way, with a larger range of phasesα. F.2 Verifying dominant phases forr= 2,c= 6 To quantify the relative frequency of each phase at a given sample sizen, we classify all posterior samples into various phasesW k,σ by counting the vertices in their convex hull (k) and the number of positive biases (σ), and compute 29 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition the proportion of samples that falls into each phase. We then plot the frequencies of each phase as a function of sample sizento visualize how preferred phases changed withn. Figure 2 shows the corresponding plot in the case for r= 2,c= 6. The lines show the frequencies of the phases6,5,5 + ,4and4 + , while unclassified posterior samples are labelled as “other”. While this convex-hull and positive bias counting classification scheme is based on the characteristics of known critical points, it is only an imperfect reflection. There is the risk of mistaking posterior samples inW k,σ as evidence of occu- pation in thek σ+ -gon phase when it is not. Since we do not claim to have found all possible critical points of the TMS potential and have neglected the higher loss variants of4-gon (explained below), it is possible that MCMC samples do not reflect known phases. If this misclassification happens sufficiently often, it could invalidate the comparison of the occupancy plots with theoretical predictions. To guard against this, we should check that every MCMC sample inW k,σ is close to a known critical point inW k,σ , or is classified as “other”. To reduce the amount of labour for this task, we run t-SNE projection (van der Maaten & Hinton, 2008) of the parameters into a 2D space with a custom metric design to remove known symmetries allowing samples that are similar to each other to show up in t-SNE projections as clusters regardless of the irrelevant differences between their angular displacement and column permutation. The custom t-SNE metric is such that distance between a pair of parameters,(W,b),(W ′ ,b ′ ), is given by the sum HammingDistance(b >0,b ′ >0) +min i,j∈1,...c ∥Normalize(W,i)−Normalize(W,j)∥ Frobenius whereb >0denotes the binary array(1 b 1 >0 ,...,1 b c >0 )andNormalise(W,i)denotes a normalised weight matrix where all column vectors are rotated by a fixed angle so thati th column vector is aligned with the positive x-axis and the columns are reordered so that the column indices reflects the order of the vectors when read counter-clockwise starting from the positive x-axis. With this, we can verify the occupancy of dominant phases by checking several samples in each cluster to verify the phase classification of the entire cluster. To illustrate, let us verify the phase occupancy forc= 6atn= 1000as shown in Figure 2. Figure F.2 shows the t-SNE projection of the samples for a particular MCMC run. Looking at both the theoretical and empirical occupancy curves atn= 1000, the posterior is dominated by the6-gon, followed by the 5-gon and then the5 + -gon. Looking at various samples in the largest (green) t-SNE cluster, they do correspond to the 6-gon all with biases near the optimal negative value. The minor cluster (in dark purple) corresponds to the5-gon. This cluster of5-gons includes samples with a sixth “vestigial leg” with non-zero length. However, these belong to the same phase (same critical submanifold as the5-gon) since the corresponding bias has large negative value. The t-SNE projection also reveals a small number of5 + -gon samples. Performing similar inspections for MCMC chains atn= 500,700allows us to confirm that the dominant phase switches from the5to6-gon in the interval600≤n≤700. This inspection also confirms that clusters of5 + -gons coexist with the two dominant phases albeit at a much lower probability. For sufficiently low values ofn, we encounter two issues in establishing phase occupancy. 1. Other higher loss phases such as variants of the 4-gon with large negative biases, and potentially other higher energy phases that we have not characterised start to have non-negligible occupancy. 2. Asnbecomes lower, the posterior distribution becomes less concentrated. This means that significantly more posterior mass, and hence a higher fraction of MCMC samples, is accounted for by regions of parameter space that are further away from critical points. These points may be close to the boundaries between different regions, increasing the chance of misclassification, or they may bear little resemblance to the critical point associated with the region they are classified into. Forn >400, from inspecting t-SNE clusters, the above issues do not arise: the samples are close to known critical points, and the frequency of unclassified “other” samples is low enough that it won’t significantly affect the relative frequency of the dominant phases. Furthermore, we also do not observe many samples that are close to high loss 4-gon variants. This supports the prediction depicted in the extended theoretical occupancy curve shown in Figure F.1 which suggests that these 4-gon variants only show up in then <400regime. We caution the reader in regards to interpreting the phase occupancy diagrams for the range100< n <400where one or more of the issues above could affect the empirical frequency. F.3 MCMC Health MCMC sampling for high dimensional posterior distributions is challenging. In our case there is the added challenge of the posterior being multi-modal (the posterior density has local maxima at the dominant phases) where the modes 30 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure F.2: t-SNE plots of MCMC samples from the posterior at a range of sample sizesnencompassing the phase transition from the5-gon to the6-gon. are not points, but submanifolds of varying codimension. To ensure that the proportion of MCMC samples that falls intoW k,σ is a good reflection of the probability ofW k,σ , we need to make sure that our Markov chains are well mixed. For this purpose, we produce and check two different types of diagnostic plots for each MCMC run: •Theoretical loss trace plots.We plot the theoretical loss of each MCMC sample against its sample index which orders the samples in each MCMC chain in increasing order of MCMC iterations required to generate the sample. An unhealthy MCMC chain will show up on such a plot as points occupying a very narrow band of theoretical loss values. •Phase type trace plots.On the same trace plots, we color each sample by their phase classification. Successful posterior sampling should produce samples in each phase with a frequency that is roughly the same as the posterior probability of that phase. While we do not know the true probability of a given phase, we can cross reference each MCMC chain with other chains performing sampling on the same posterior to see that every MCMC chain visits phases discovered by any other MCMC chain. An unhealthy MCMC chain will show up on such a plot as a chain that only contains samples of one phase type when there is more than one phase type observed across all chains. 31 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure F.3 shows a examples of such diagnostic trace plots for a few experiments (withc= 6and matching those in Figure F.2) atn= 300,1000. All six chains run in these experiments are plotted on the same plot and distinguished by color. At the higher sample sizen= 1000, we expect and do indeed observe that a particular phase, the6-gon, dominates the posterior but every chain visits sub-dominant phases as well. The diagnostics detect no sign of problems for the experiments used to establish the phase occupancy curves in Figures 2, F.5 and F.6. However, we do observe that MCMC fails for sample sizesnsignificantly greater than those we report in this paper. With large sample sizes, the posterior distribution becomes highly concentrated at each phase, posing a significant challenge for an MCMC chain to escape its starting point (controlled by random initialization and the trajectory of the burn-in phase). Figure F.4 shows an example atn= 4000, where we see • A chain (colored pink) which, for many iterations, produces samples in a very narrow band of loss. • Most chains have a starting point falling into the5 + -gon phase and rarely escape (only the red chain found the lower loss6-gon region). • The proportion of6-gons is mostly determined by how many chains have their starting point already in the 6-gon phase. In this run, this proportion is dominated by the last orange chain. Figure F.3: Trace plots displaying theoretical loss of the MCMC samples ordered by their MCMC iteration number and colored by MCMC chain index (top) and the same scatter plot but colored by phase classification (bottom). 32 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure F.4: Unhealthy trace plots. 33 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition F.4 Theory and experiments forc= 4andc= 5 In this section we repeat the analysis of Section 4.2 forc= 4andc= 5with the same experimental setup as explained at the beginning of that section. Whenc= 4the4-gon is a true parameter (it has zero loss) and any3-gon is not, so the theory predicts that the4-gon must dominate the posterior for alln, as seen in Figure F.5. Critical pointLocallearning coefficientλLossL 4-gon4,4.5,5,5.50 Table F.1:r= 2,c= 4 Figure F.5:r= 2,c= 4. The standard4-gon dominates for alln. Whenc= 5the theory and experimental curves in Figure F.6 show the4→5transition. Note that4 + is correctly predicted to never dominate the posterior despite having lower energy than the standard4-gon. Critical pointLocallearning coefficientλLossL 4-gon4,4.5,5,5.50.06667 4 + -gon5,5.5,6,6.50.05667 5 − -gon70.01583 . Table F.2:r= 2,c= 5 Figure F.6: Proportion of Bayesian posterior density concentrated in regionsW k,σ associated tok-gons, as a function of the numbernof samples forr= 2,c= 5. We note that the classification of MCMC samples inc= 4,5described in this section is slightly different from what was described forc= 6in Section 4.2. The main reason being that we need to handle variants of the4-gon more carefully inc= 4,5. 34 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition • Forc= 4, we classify a sample(W,b)only by the number of vertices on the convex hull formed by the column vectors with no additional subcategories defined by the number of positive biases. Theoretically we know that there are no critical4-gons with positive bias and the standard4-gon is a critical point with all zero bias and is thus susceptible to misclassification even with slight perturbation when the the number of positive biases is counted. • Forc= 5, the situation is similar except for one extra case where need to allow for the possibility of a4 + - gon. A sample(W,b)is classified as a4 + -gon when it has4vertices in its convex hull, and ifb i >0then l i =∥W i ∥<0.5. In the casesc= 4,5we also manually verify the dominant phases by visually inspecting t-SNE clusters of MCMC samples at multiple sample sizes. 35 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition G Potential in local coordinates Proof of Lemma 3.1.By definition L(W,b) = 1 c   X i̸=j Z 1 0 ReLU(W j ·W i x i +b j ) 2 dx i + c X i=1 Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i ! . Leti,j∈1,...,cbe such thati̸=j. To compute the integral Z 1 0 ReLU(W j ·W i x i +b j ) 2 dx i , we first find the region in[0,1]on whichW j ·W i x i +b j ≥0. Let D j,i =x i ∈[0,1]|W j ·W i x i +b j ≥0. 1. IfW j ·W i >0and−W j ·W i ≤b j ≤0, then D j,i = −b j W j ·W i ,1 . 2. IfW j ·W i >0andb j ≤−W j ·W i , then D j,i =∅. 3. IfW j ·W i = 0andb j = 0, then D j,i = [0,1]. Note that in this case,W j ·W i x i +b j = 0for allx i ∈[0,1]. 4. IfW j ·W i = 0andb j <0, then D j,i =∅. 5. IfW j ·W i <0andb j ≤0, then D j,i =∅. 6. IfW j ·W i >0andb j >0, then D j,i = [0,1]. 7. IfW j ·W i = 0andb j >0, then D j,i = [0,1]. 8. IfW j ·W i <0andb j ≥−W j ·W i >0, then D j,i = [0,1]. 9. IfW j ·W i <0and−W j ·W i > b j >0, then D j,i = 0, −b j W j ·W i Recall the definition ofP j,i ,P i , andQ j,i from (Lemma 3.1). Then forb j ≤0, Z 1 0 ReLU(W j ·W i x i +b j ) 2 dx i =δ(P j,i ) Z 1 −b j /(W j ·W i ) (W j ·W i x i +b j ) 2 dx i =δ(P j,i ) 1 3W j ·W i (W j ·W i x i +b j ) 3 1 −b j /(W j ·W i ) =δ(P j,i ) 1 3W j ·W i (W j ·W i +b j ) 3 . 36 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition and forb j >0, Z 1 0 ReLU(W j ·W i x i +b j ) 2 dx i =δ(Q j,i ) Z −b j /(W j ·W i ) 0 (W j ·W i x i +b j ) 2 dx i + 1−δ(Q j,i ) Z 1 0 (W j ·W i x i +b j ) 2 dx i =δ(Q j,i ) 1 3W j ·W i (W j ·W i x i +b j ) 3 −b j /(W j ·W i ) 0 + 1−δ(Q j,i ) 1 3W j ·W i (W j ·W i x i +b j ) 3 1 0 =δ(Q j,i ) 1 3 −b 3 j W j ·W i ! + 1−δ(Q j,i ) 1 3 [(W j ·W i ) 2 + 3(W j ·W i )b j + 3b 2 j ] It remains to compute Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i for eachi∈1,...,c. We first find the region in[0,1]on which∥W i ∥ 2 x i +b i ≥0. Let D i =x i ∈[0,1]|∥W i ∥ 2 x i +b i ≥0. 1. If∥W i ∥ 2 >0and−∥W i ∥ 2 ≤b i ≤0, then D i = −b i ∥W i ∥ 2 ,1 . 2. If∥W i ∥ 2 >0andb i ≤−∥W i ∥ 2 , then D i =∅. In this case Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i = Z 1 0 x 2 i dx i = 1 3 . 3. If∥W i ∥ 2 = 0andb i = 0, then D i = [0,1]. Note that in this case,∥W i ∥ 2 x i +b i = 0for allx i ∈[0,1]. So Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i = Z 1 0 x 2 i dx i = 1 3 . 4. If∥W i ∥ 2 = 0andb i <0, then D i =∅. In this case Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i = Z 1 0 x 2 i dx i = 1 3 . 5. If∥W i ∥ 2 >0andb i >0, then D i = [0,1]. 6. If∥W i ∥ 2 = 0andb i >0, then D i = [0,1]. 37 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Foriwithb i ≤0, onP i , Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i = Z 1 −b i /∥W i ∥ 2 (x i −∥W i ∥ 2 x i −b i ) 2 dx i + Z −b i /∥W i ∥ 2 0 x 2 i dx i = Z 1 −b i /∥W i ∥ 2 (1−∥W i ∥ 2 )x i −b i 2 dx i + 1 3 x 3 i −b i /∥W i ∥ 2 0 . If∥W i ∦= 1, then the integral is equal to 1 3 (1−∥W i ∥ 2 ) 2 −3(1−∥W i ∥ 2 )b i + 3b 2 i + b 3 i ∥W i ∥ 4 + b 3 i ∥W i ∥ 2 . If∥W i ∥= 1, then the integral is equal to 1 3 (3b 2 i + 2b 3 i ). Since lim ∥W i ∥→1 1 3 (1−∥W i ∥ 2 ) 2 −3(1−∥W i ∥ 2 )b i + 3b 2 i + b 3 i ∥W i ∥ 4 + b 3 i ∥W i ∥ 2 = 1 3 (3b 2 i + 2b 3 i ), We know that inP i , Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i = 1 3 (1−∥W i ∥ 2 ) 2 −3(1−∥W i ∥ 2 )b i + 3b 2 i + b 3 i ∥W i ∥ 4 + b 3 i ∥W i ∥ 2 . Foriwithb i >0, Z 1 0 x i −ReLU(∥W i ∥ 2 x i +b i ) 2 dx i = Z 1 0 x i −(∥W i ∥ 2 x i +b i ) 2 dx i = 1 3(1−∥W i ∥ 2 ) [(1−∥W i ∥ 2 )−b i ] 3 +b 3 i = 1 3 (1−∥W i ∥ 2 ) 2 −3(1−∥W i ∥ 2 )b i + 3b 2 i Thus,L(W,b) = 1 3c H(W,b)as claimed. Now we introduce a new coordinate system of the parameter space which is used to analyse the local geometry around a critical point. LetC=i,j|i̸=j∈ 1,2,...,c. For a subsetC⊂ C, define a subsetW C ofM r,c (R), called achamber, by W C =W∈M r,c (R)|W i ·W j >0if and only ifi,j∈C. Note that 1. Some subsetsCofCdefine an empty chamberW C =∅. For example, whenr= 2andc= 4, the set C=1,2,2,3,3,4,4,1⊂C defines an empty chamber because within this set, to satisfyW i ·W i+1 >0,W i+1 ·W i+2 >0,W i ·W i+2 ≤0 we must have thatW i+1 is betweenW i andW i+1 on the circle. Therefore we require that the sum of angles betweenW i andW i+1 = 2π, but each of these 4 angles must be more thanπ/2so the configuration isn’t possible. 38 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition 2. ifC⊂Cdefines a nonempty chamberW C containing aWwith W i ·W j >0∀i,j∈CandW u ·W v <0∀u,v/∈C, thenW C contains the open subset W∈M r,c (R)|W i ·W j >0∀i,j∈CandW u ·W v <0∀u,v/∈C ofM r,c (R); 3. ifC⊂Cdefines a nonempty chamberW C , then for anyQ⊂C, W∈M r,c (R)|W i ·W j = 0∀i,j∈QandW u ·W v >0⇔u,v∈C defines a boundary ofW C ; 4. for any distinct subsetsC,QofC,W C ∩W Q =∅; 5. the union of all chambers coverM r,c (R). Consider a nonempty chamberW C for someC⊂ C. Suppose thatW C contains an open subset ofM r,c (R). Let i̸=j∈1,2,...,c. Ifj,i/∈C, then for allW∈W C , δ(P j,i ) = 0. Thus, inW C ×R c , for eachi∈1,2,...,c,H − i (W,b)(see Lemma 3.1) is given by H − i (W,b) = X j̸=i:j,i∈C δ(P i,j ) 1 W i ·W j (W i ·W j +b i ) 3 +δ(P i ) b 3 i ∥W i ∥ 4 + b 3 i ∥W i ∥ 2 + (1−δ(P i )) +δ(P i )N i . Now we focus on the caser= 2. LetW∈M 2,c (R). ThenWis contained in some chamberW C . In the new parametrization(l,θ), we can describe the chamberW C in a different way. Let(l 1 ,...,l c ,θ 1 ,...,θ c )be the coordi- nate ofW. For eachi= 1,...,c, awedgeM ij is defined by M ij = (i,i+ 1,...,i+j−1),ifj >0andθ i +·+θ i+j−1 < π 2 ; ∅,ifj= 0orθ i +·+θ i+j−1 ≥ π 2 , where addtions are computed cyclically. For eachi= 1,...,c, lett(i)denote the integer such that θ i +·+θ i+t(i)−1 < π 2 andθ i +·+θ i+t(i) ≥ π 2 . We use the convention wheret(i) = 0ifθ i ≥π/2. ThenM ij =∅for allj≥t(i). So we can list the set of all wedges M=M ij into a table: M 11 M 12 · ·M 1t(1) M 21 M 21 · ·M 2t(2) . . . . . . M (c−1)1 M (c−1)2 · ·M (c−1)t(c−1) M c1 M c2 · ·M ct(c) This set of wedgesMdescribes a chamberW C containingW, where C= 1,2,1,3,...,1,t(1) + 1,2,3,2,4,...,2,t(2) + 2,..., ...,c−1,c,...,c−1,t(c−1) +c−1,c,1,...,c,t(c) +c , and the addtions are computed cyclically. For example, ifc= 5, then a5-gon has coordinate l 1 =l 2 =l 3 =l 4 =l 5 ; θ 1 =θ 2 =θ 3 =θ 4 = 2π 5 . It is contained in the interior of the chamberW C , where C=1,2,2,3,3,4,4,5,5,1. This chamber is described by the set of wedges: 39 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition (1) (2) (3) (4) (5) . Now letW C ⊂M 2,c (R)be a nonempty chamber. Suppose thatW C contains an open subset ofM 2,c (R). LetMbe the set of wedges describingW C . Set 1.T (1) M ij = n (l,θ,b)| −l i l i+j cos P k∈M ij θ k ≤b i+j o ; 2.T (2) M ij = n (l,θ,b)| −l i l i+j cos P k∈M ij θ k ≤b i o ; 3.T i =(l,θ,b)| −l 2 i ≤b i ; 4.S ij =(l,θ,b)| −l i l j cos(θ i +·θ j−1 )> b i >0. Then inW C ×R c , the TMS potentialH(l,θ,b)in the new parametrization is H(l,θ,b) = c X i=1 δ(b i ≤0)H − i (l,θ,b) +δ(b i >0)H + i (l,θ,b),(17) where H − i (l,θ,b) = X j:M (i−j)j ∈M δ(T (1) M (i−j)j ) l i−j l i cos P k∈M (i−j)j θ k +b i 3 l i−j l i cos P k∈M (i−j)j θ k + X j:M ij ∈M δ(T (2) M ij ) l i l i+j cos P k∈M ij θ k +b i 3 l i l i+j cos P k∈M ij θ k +δ(T i ) (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i + 1−δ(T i ) and H + i (l,θ,b) = X j̸=i δ(S ij ) −b 3 i l i l j cos(θ i +θ i+1 +·θ j−1 ) + 1−δ(S ij ) l i l j cos(θ ij ) 2 + 3 l i l j cos(θ ij ) b i + 3b 2 i + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i , whereθ ij =θ i +θ i+1 +·+θ j−1 is the angle betweenW i andW j . Remark G.1.If a critical point ofHis contained in the interior of a chamber (see Appendix H.1), then there is an open neighbourhood of the critical point in whichHis of the above form. However, critical points are not always contained in the interior of some chamber. If a critical point is contained in the boundary of different chambers (see Appendix J and Appendix H.3), then extra efforts are required for analysing the local geometry around the critical point. 40 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition H Derivation of local learning coefficients H.1k-gons with negtive bias fork=c /∈4Z In this section, we show that forc∈5,6,7, thec-gon with coordinate l 1 =·=l c =x ∗ , θ 1 =·=θ c =α= 2π c , b 1 =·=b c =y ∗ , where the values ofx ∗ andy ∗ is given in Table H.1, is a non-degenerate critical point of the TMS potential. Therefore, the locallearning coefficientofc-gon is(3c−1)/2. cx ∗ y ∗ 51.17046−0.28230 61.32053−0.61814 71.44839−0.96691 Table H.1: Parameters ofc-gons. Letcbe an integer greater than or equal to4. We consider the case wherecis not a multiple of4. Consider thec-gon with coordinate(l ∗ ,θ ∗ ,b ∗ ) l ∗ :l 1 =·=l c =x, θ ∗ :θ 1 =·=θ c =α= 2π c , b ∗ :b 1 =·=b c =y, for somex >0andy <0. The chamber containingc-gons is described by the following wedges (see Appendix G): (1)(1,2)· ·(1,2,...,s) (2)(2,3)· ·(2,3,...,s+ 1) . . . . . .· · . . . (c+ 1−s)(c+ 1−s,c+ 2−s)· ·(c+ 1−s,c+ 2−s,...,c) (c+ 2−s)(c+ 2−s,c+ 3−s)· ·(c+ 2−s,c+ 3−s,...,c,1) . . . . . .· · . . . (c−1)(c−1,c)· ·(c−1,c,1,...,s−2) (c)(c,1)· ·(c,1,2,...,s−1) wheresis the unique integer in the interval c 4 −1, c 4 . LetM ij be the wedge in the(i,j)-position in the above table. ThenM ij = (i,i+ 1,...,i+j−1), where additions are computed cyclically. Then the local TMS potential (see Appendix G) is H(l,θ,b) = X M ij ∈M δ(T (1) M ij ) 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + X M ij ∈M δ(T (2) M ij ) 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 + c X i=1 δ(T i ) (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i + (1−δ(T i )), where 1.T (1) M ij = n (l,θ,b)| −l i l i+j cos P k∈M ij θ k ≤b i+j o ; 2.T (2) M ij = n (l,θ,b)| −l i l i+j cos P k∈M ij θ k ≤b i o ; 3.T i =(l,θ,b)| −l 2 i ≤b i ; 4.θ 1 +θ 2 +·+θ c = 2π. 41 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Consider the open subset of the parameter space defined by −l i l i+j cos   X k∈M ij θ k   < b i+j ,−l i l i+j cos   X k∈M ij θ k   < b i for allM ij and −l 2 i < b i for alli= 1,...,c. In this open subset, we have H(l,θ,b) = X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 + c X i=1 (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i , with constraintθ 1 +θ 2 +·+θ c = 2π. It follows from the Lagrangian multiplier method that a point(l ∗ ,θ ∗ ,b ∗ ) is a critical point ofH(l,θ,b)with constraintθ 1 +·+θ c = 2πif and only if there existsλ∈Rsuch that for all a= 1,2,...,c, 1. ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) =λ; 2. ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0; 3. ∂ ∂b a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0. We compute the partial derivative ofH(l,θ,b)with respect tob a : ∂ ∂b a H(l,θ,b) = X M ij :i+j=a 3 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   2 + X M ij :i=a 3 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   2 −3(1−l 2 a ) + 6b a + 3 l 4 a b 2 a + 3 l 2 a b 2 a . We list allM ij withi=aand allM ij withi+j=a: i=a:(a)(a,a+ 1)· ·(a,a+ 1,...,a+s−1) i+j=a:(a−1)(a−2,a−1)· ·(a−s,...,a−2,a−1) Then ∂ ∂b a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) =x 2   3 + 6 s X j=1 cos(jα)   + y 2 x 2   3 + 6 s X j=1 1 cos(jα)   +y(12s+ 6) + 3 y 2 x 4 −3. 42 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Multiplying both sides of the equation x 2   3 + 6 s X j=1 cos(jα)   + y 2 x 2   3 + 6 s X j=1 1 cos(jα)   +y(12s+ 6) + 3 y 2 x 4 −3 = 0 by 1 3 x 4 , we have x 6   1 + 2 s X j=1 cos(jα)   +x 2 y 2   1 + 2 s X j=1 1 cos(jα)   + 2x 4 y(2s+ 1) +y 2 −x 4 = 0. LetG(s) = 1 + 2 P s j=1 cos(jα),H(s) = 1 + 2 P s j=1 1 cos(jα) ,M(s) = 1 + 2s. Then we obtain a parametrized polynomial equation in two variables: x 6 G(s) +x 2 y 2 H(s) + 2x 4 yM(s) +y 2 −x 4 = 0. Now we compute the partial derivative ofH(l,θ,b)with respect tol a : ∂ ∂l a H(l,θ,b) = ∂ ∂l a X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + ∂ ∂l a X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 −4(1−l 2 a )l a + 6l a b a −4 b 3 a l 5 a −2 b 3 a l 3 a . From the list of allM ij withi=aand allM ij withi+j=a, we have ∂ ∂l a X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 = s X j=1 3 l a l a+j cos P k∈M aj θ k +b a+j 2 l a − l a l a+j cos P k∈M aj θ k +b a+j 3 l 2 a l a+j cos P k∈M aj θ k + s X j=1 3 l a−j l a cos P k∈M (a−j)j θ k +b a 2 l a − l a−j l a cos P k∈M (a−j)j θ k +b a 3 l a−j l 2 a cos P k∈M (a−j)j θ k . So ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 = 2 s X j=1 3 x 2 cos(jα) +y 2 x − x 2 cos(jα) +y 3 x 3 cos(jα) ! Similarly, ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 = 2 s X j=1 3 x 2 cos(jα) +y 2 x − x 2 cos(jα) +y 3 x 3 cos(jα) ! 43 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Thus, ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) =x 3   4 + 8 s X j=1 cos 2 (jα)   +xy   6 + 12 s X j=1 cos(jα)   − y 3 x 3   2 + 4 s X j=1 1 cos(jα)   −4x−4 y 3 x 5 . Multiplying both sides of the equation x 3   4 + 8 s X j=1 cos 2 (jα)   +xy   6 + 12 s X j=1 cos(jα)   − y 3 x 3   2 + 4 s X j=1 1 cos(jα)   −4x−4 y 3 x 5 = 0 by 1 2 x 5 , we have 2x 8   1 + 2 s X j=1 cos 2 (jα)   + 3x 6 y   1 + 2 s X j=1 cos(jα)   −x 2 y 3   1 + 2 s X j=1 1 cos(jα)   −2x 6 −2y 3 = 0. LetF(s) = 1 + 2 P s j=1 cos 2 (jα). Then we obtain a parametrized polynomial equation in two variables: 2x 8 F(s) + 3x 6 yG(s)−x 2 y 3 H(s)−2x 6 −2y 3 = 0. Therefore, we need to determine whether the system of two parametrized polynomial equations in two variables x 6 G(s) +x 2 y 2 H(s) + 2x 4 yM(s) +y 2 −x 4 = 0 2x 8 F(s) + 3x 6 yG(s)−zy 3 H(s)−2x 6 −2y 3 = 0 where 1.F(s) = 1 + 2 P s j=1 cos 2 (jα); 2.G(s) = 1 + 2 P s j=1 cos(jα); 3.H(s) = 1 + 2 P s j=1 1 cos(jα) ; 4.M(s) = 1 + 2s. have common solutions inR >0 ×R <0 with−x 2 cos(sα)< yor not . Now we compute the partial derivative of H(l,θ,b)with respect toθ a . ∂ ∂θ a H(l,θ,b) = ∂ ∂θ a X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + ∂ ∂θ a X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 . We list allM ij containinga: M a1 M a2 · · ·M as M (a−s+1)s M (a−s+2)(s−1) M (a−s+2)s . . . M (a−2)3 · ·M (a−2)s M (a−1)2 M (a−1)3 · ·M (a−1)s 44 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Rearrange these wedges in terms of the length: length1:M a1 length2:M a2 M (a−1)2 length3M a3 M (a−1)3 M (a−2)3 . . . lengths−1:M a(s−1) M (a−1)(s−1 ·M (a−s+2)(s−1) lengths:M as M (a−1)s ·M (a−s+2)s M (a−s+1)s Note that forj= 1,...,s, there exactlyjwedges of lengthjcontaininga. Then ∂ ∂θ a X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 = X M ij :a∈M ij −3 h l i l i+j cos P k∈M ij θ k +b i+j i 2 sin P k∈M ij θ k cos P k∈M ij θ k + X M ij :a∈M ij sin P k∈M ij θ k h l i l i+j cos P k∈M ij θ k +b i+j i 3 l i l i+j cos 2 P k∈M ij θ k . Then ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 = s X j=1 j· −3 x 2 cos (jα) +y 2 sin (jα) cos (jα) + sin (jα) x 2 cos (jα) +y 3 x 2 cos 2 (jα) ! . Similarly, ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 = s X j=1 j· −3 x 2 cos (jα) +y 2 sin (jα) cos (jα) + sin (jα) x 2 cos (jα) +y 3 x 2 cos 2 (jα) ! . Thus, ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 2 s X j=1 j· −3 x 2 cos (jα) +y 2 sin (jα) cos (jα) + sin (jα) x 2 cos (jα) +y 3 x 2 cos 2 (jα) , which is independent ofa. So if the system of polynomial equations x 6 G(s) +x 2 y 2 H(s) + 2x 4 yM(s) +y 2 −x 4 = 0 2x 8 F(s) + 3x 6 yG(s)−x 2 y 3 H(s)−2x 6 −2y 3 = 0 has a common solution(x ∗ ,y ∗ )∈R >0 ×R <0 with−x 2 cos(sα)< y, then the Lagrangian multiplier is 2 s X j=1 j· −3 (x ∗ ) 2 cos (jα) +y ∗ 2 sin (jα) cos (jα) + sin (jα) (x ∗ ) 2 cos (jα) +y ∗ 3 (x ∗ ) 2 cos 2 (jα) ! 45 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition and the Lagrangian multipler method implies that thec-gon with coordinate l 1 =·=l c =x ∗ , b 1 =·=b c =y ∗ , θ 1 =·=θ c =α= 2π c , is a critical point ofH(l,θ,b)with constraintθ 1 +·+θ c = 2π. So it remains to determine whether the system of two parametrized polynomial equations in two variables x 6 G(s) +x 2 y 2 H(s) + 2x 4 yM(s) +y 2 −x 4 = 0 2x 8 F(s) + 3x 6 yG(s)−x 2 y 3 H(s)−2x 6 −2y 3 = 0 have common solutions(x ∗ ,y ∗ )∈R >0 ×R ≤0 with−x 2 cos(sα)< yor not. We compute the common solution using Mathematica. 1.c= 5: in this case,s= 1. Then(x ∗ ,y ∗ )≈(1.17046,−0.28230)is the unique common solution such that−x 2 cos(2π/5)< y. Moreover, the Hessian ofHat this5-gon is non-degenerate. So the5-gon with coordinate: l 1 =l 2 =·=l 5 =x≈1.17046; θ 1 =θ 2 =·=θ 4 = 2π 5 ; b 1 =b 2 =·=b 5 =y≈−0.28230, is a non-degenerate critical point. So its local learning coefficient is(3c−1)/2 = 7. 2.c= 6: in this cases= 1. Then(x ∗ ,y ∗ )≈(1.32053,−0.61814)is the unique common solution such that−x 2 cos(π/3)< y. Moreover, the Hessian ofHat this6-gon is non-degenerate. So the6-gon with coordinate: l 1 =l 2 =·=l 6 ≈1.32053; θ 1 =θ 2 =·=θ 5 = π 3 ; b 1 =b 2 =·=b 6 ≈−0.61814, is a non-degenerate critical point. So its learning coefficient is(3c−1)/2 = 8.5. 3.c= 7: in this cases= 1. Then(x ∗ ,y ∗ )≈(1.44839,−0.96691)is the unique common solution such that−x 2 cos(2π/7)< y. Moreover, the Hessian ofHat this7-gon is non-degenerate. So the5-gon with coordinate: l 1 =l 2 =·=l 7 ≈1.44839; θ 1 =θ 2 =·=θ 6 = 2π 7 ; b 1 =b 2 =·=b 7 ≈−0.96691, is a non-degenerate critical point. So itslocal learning coefficient is(3c−1)/2 = 10. 4.c= 9: in this cases= 2. There is no common solution. 5.c= 10: in this cases= 2. There is no common solution. 6.c= 11: in this cases= 2. There is no common solution. 7.c= 13: in this cases= 3. There is no common solution. 8. We checked that for any9≤c≤203which is not a multiple of4, there is no common solution. H.2k-gons with negative bias fork=c∈4Z In this section, we show that forc= 8, the8-gon with coordinate l 1 =l 2 =·=l 8 ≈1.55041, θ 1 =θ 2 =·=θ 8 = π 4 , b 1 =b 2 =·=b 8 ≈−1.29122, is a non-degenerate critical point of the TMS potential. So the local learning coefficient of8-gon is11.5. Letc >4 being a multiple of4. Ac-gon hasθ-coordinate: θ 1 =θ 2 =·=θ c = 2π c . 46 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Lets=c/4−1∈Z. Then(s+ 1)·(2π/c) =π/2. So W i ·W i+s =l i l i+s cos (s+ 1)· 2π c =l i l i+s cos (s+ 1)· π 2 = 0 for alli= 1,...,c. Soc-gons are on the boundaries of some chambers. Sincec >4,s≥1. LetM=M ij be wedges describing a chamber whose boundary containsc-gons. Ifb i <0for alli= 1,...,c, then the TMS potential (see Appendix G) is H(l,θ,b) = X M ij ∈M δ(T (1) M ij ) 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + X M ij ∈M δ(T (2) M ij ) 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 + c X i=1 δ(T i ) (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i + (1−δ(T i )), where 1.T (1) M ij = n (l,θ,b)| −l i l i+j cos P k∈M ij θ k ≤b i+j o ; 2.T (2) M ij = n (l,θ,b)| −l i l i+j cos P k∈M ij θ k ≤b i o ; 3.T i =(l,θ,b)| −l 2 i ≤b i ; 4.θ 1 +θ 2 +·+θ c = 2π. Consider thec-gon with coordinate l ∗ :l 1 =·=l c =x, θ ∗ :θ 1 =·=θ c =α= 2π c , b ∗ :b 1 =·=b c =y for somex >0andy <0. Suppose that−x 2 <−x 2 cos(α)<·<−x 2 cos(s·α)< y. Lemma H.1.Thec-gon(l ∗ ,θ ∗ ,b ∗ )has an open neighbourhood in which the TMS potential is H(l,θ,b) = X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + X M ij 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 + c X i=1 (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i , whereθ 1 +θ 2 +·+θ c = 2π, andM=M ij is (1)(1,2)· ·(1,2,...,s) (2)(2,3)· ·(2,3,...,s+ 1) . . . . . .· · . . . (c+ 1−s)(c+ 1−s,c+ 2−s)· ·(c+ 1−s,c+ 2−s,...,c) (c+ 2−s)(c+ 2−s,c+ 3−s)· ·(c+ 2−s,c+ 3−s,...,c,1) . . . . . .· · . . . (c−1)(c−1,c)· ·(c−1,c,1,...,s−2) (c)(c,1)· ·(c,1,2,...,s−1) . 47 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Proof.Sincex >0andy <0, there are positive small real numbersε x ,ε y such thatx−ε x >0andy+ε b <0. Then y+ε b −(x+ε x ) 2 is a positive number. Then there existsγ >0such that cos π 2 −γ < y+ε b −(x+ε x ) 2 . Letε θ =γ/(s+ 1). Consider the open neighbourhood of(l ∗ ,θ ∗ ,b ∗ )given by (x−ε x ,x+ε x )×(α−ε θ ,α+ε θ )×(y−ε b ,y+ε b ). Let(l,θ,b)be a point in this open neighbourhood. Consider for any wedgeM ij containing more thansnumbers. Without loss of generality, assume thatM ij contains+ 1elements. If P k∈M ij θ k > π/2, then −l i l i+j cos   X k∈M ij θ k   >0> y+ε b . Otherwise, suppose P k∈M ij θ k ≤π/2. Then −l i l i+j cos   X k∈M ij θ k   >−(x+ε x ) 2 cos   X k∈M ij θ k   >−(x+ε x ) 2 cos (s+ 1)× 2π n −ε θ =−(x+ε x ) 2 cos π 2 −γ > y+ε b > b i orb i+j . Thus(l,θ,b)/∈T (1) M ij and(l,θ,b)/∈T (2) M ij . Thus, the term inH(l,θ,b)indexed by thisM ij disappears. So onlyM ij containing less than(s+ 1)numbers remain in the sum. It follows from the calculations in (Appendix H.1) that ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) is independent ofa. Moreover, the same calculations show that ∂ ∂b a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0 and ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0 give rise to a system of parametrized polynomial equations inxandy: x 6 G(s) +x 2 y 2 H(s) + 2x 4 yM(s) +y 2 −x 4 = 0 2x 8 F(s) + 3x 6 yG(s)−x 2 y 3 H(s)−2x 6 −2y 3 = 0 where 1.F(s) = 1 + 2 P s j=1 cos 2 (jα); 2.G(s) = 1 + 2 P s j=1 cos(jα); 3.H(s) = 1 + 2 P s j=1 1 cos(jα) ; 48 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition 4.M(s) = 1 + 2s. We compute the common solution using Mathematica. 1.c= 8: in this case,s= 1. Then(x ∗ ,y ∗ )≈(1.55045,−1.29119)is a common solution such that −x 2 cos(π/4)< y. Moreover, the Hessian ofH(l,θ,b)at this8-gon is non-degenerate. So the8-gon with coordinate: l 1 =l 2 =·=l 8 ≈1.55045; θ 1 =θ 2 =·=θ 8 = π 4 ; b 1 =b 2 =·=b 8 ≈−1.29119, is a non-degenerate critical point. So its local learning coefficient is(3c−1)/2 = 11.5 2.n= 12: in this case,s= 2. Then(x ∗ ,y ∗ )≈(1.03322,−0.46654)and(x ∗ ,y ∗ )≈(1.24975,−0.85483)are common solutions such that−x 2 cos(π/6)< y. 3. We plot the level sets of two polynomial equations for1≤s≤50. There is always a common solution. H.3k-gons with negative bias fork < c Now for a fixedc, we discuss arbitraryk-gons wherek≤c. Ak-gon is obtained by shrinkingc−k W i ’s to zero. Without loss of generality, we assume thatW c ,W c−1 ,...,W k+1 are shrinking to zero. So ak-gon is on the boundary of some chamber. Note that different angles betweenW i andW j fori,j∈ c−k+ 1,...,cmight give different chambers whose boundary contains thek-gon. The following example illustrates the idea. Letc= 6andk= 5. Then there are three different chambers whose boundary contains the5-gon. 1. Consider a family of6-gons with coordinatel 1 =l 2 =l 3 =l 4 =l 5 =l,l 6 =u,θ 1 =θ 2 =θ 3 =θ 4 = 2π 5 , θ 5 =αwherel,u >0andα∈ 0, π 10 . This family of6-gons is contained in the chamber described by the following wedges: (1) (2) (3) (4)(4,5) (5)(5,6) (6) The5-gon is obtained from this family of6-gons by taking the limit asu→0. So the5-gon is on the boundary of this chamber. 2. Consider another family of6-gons with coordinatel 1 =l 2 =l 3 =l 4 =l 5 =l,l 6 =u,θ 1 =θ 2 =θ 3 =θ 4 = 2π 5 ,θ 5 =αwherel,u >0andα∈ π 10 , 3π 10 . This family of6-gons is contained in the chamber described by the following wedges: (1) (2) (3) (4) (5)(5,6) (6) The5-gon is obtained from this family of6-gons by taking the limit asu→0. So the5-gon is on the boundary of this chamber. 3. Finally, consider the family of6-gons with coordinatel 1 =l 2 =l 3 =l 4 =l 5 =l,l 6 =u,θ 1 =θ 2 = θ 3 =θ 4 = 2π 5 ,θ 5 =αwherel,u >0andα∈ 3π 10 , 2π 5 . This family of6-gons is contained in the chamber described by the following wedges: (1) (2) (3) (4) (5)(5,6) (6)(6,1) 49 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition The5-gon is obtained from this family of6-gons by taking the limit asu→0. So the5-gon is on the boundary of this chamber. Theorem H.1.Letk∈Z >4 andsbe the unique integer in the interval[ k 4 −1, k 4 ). If ak-gon with coordinate l 1 =·=l k =x;θ 1 =·=θ k = 2π k ;b 1 =·=b k =y for somex >0andy <0satisfying −x 2 cos s· 2π k ≤y is a critical point ofH(l,θ,b)with constraintθ 1 +·+θ k = 2πforc=k, then thek-gons with coordinate l 1 =·=l k =x, l k+1 =·=l c = 0; θ 1 =·=θ k−1 = 2π k , θ k +·+θ c = 2π k ; b 1 =·=b k =y, b k+1 ,...,b c <0, are critical points ofH(l,θ,b)with constraintθ 1 +·+θ c = 2πfor anyc > k. Proof.We first show that for anyb k+1 ,...,b c <0andθ k ,...,θ c ∈[0,2π)withθ k +·+θ c = 2π/k, thek-gon (l ∗ ,θ ∗ ,b ∗ )with coordinate l 1 =·=l k =x, l k+1 =·=l c = 0; θ 1 =·=θ k−1 =α= 2π k , θ k ,...,θ c ∈[0,2π); b 1 =·=b k =y, b k+1 ,...,b c <0 has an open neighbourhood in which only the following types of wedges showing up inH(l,θ,b): 1.M ij does not contain any ofk,k+ 1,...,cand has length at mosts; 2.M ij contains(k,k+ 1,...,c)and has length at mosts+c−k. Since each coordinate inb ∗ is less than zero, there is an open neighbourhoodBofb ∗ contained inR c <0 . Since l 1 =·=l k =x >0andl k+1 =·=l c = 0, there exists an open neighbourhoodLofl ∗ contained in R k >0 ×R c−k ≥0 . Ifkis not a multiple of4, thens·α < π/2and(s+ 1)·α > π/2. So we can perturb eachθ i to obtain an open neighbourhoodΘofθ ∗ in which •θ i +·+θ i+s−1 < π/2andθ i +·θ i+s > π/2fori,i+ 1,...,i+s⊂1,...,k−1; •θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2,θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−k+i > π/2, andθ i−1 +θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2. Ifkis a multiple of4, thens·α < π/2and(s+ 1)·α=π/2. So we can perturb eachθ i to obtain an open neighbourhoodΘofθ ∗ in which •θ i +·+θ i+s−1 < π/2andθ i +·θ i+s is closed toπ/2fori,i+ 1,...,i+s⊂1,...,k−1; •θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2,θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−k+i is closed toπ/2, andθ i−1 +θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2. ThenL×Θ×Bis an open neighbourhood of(l ∗ ,θ ∗ ,b ∗ ). Sincel k+1 =·=l c = 0, we can shrinkLso that for every(l,θ,b)∈L×Θ×B, −l i l i+j cos   X k∈M ij θ k   > b i ,−l i l i+j cos   X k∈M ij θ k   > b i+j for allM ij with eitheri∈k+ 1,...,cori+j−1∈k,...,c−1, and −l 2 i > b i 50 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition for alli=k+ 1,...,c. Since −x 2 <·<−x 2 cos (s−1)· 2π k <−x 2 cos s· 2π k < y, we can shrinkL×Θ×Bso that for every(l,θ,b)∈L×Θ×B, −l i l i+j cos   X k∈M ij θ k   < b i ,−l i l i+j cos   X k∈M ij θ k   < b i+j for alli,jwithi,i+j∈1,...,kand −l 2 i < b i for alli= 1,...,k. Since anyM ij containing only some ofk,k+ 1,...,chas eitheri∈ k+ 1,...,cor i+j−1∈k,...,c−1, we know that forM ij containing only some ofk,k+ 1,...,c, we have (l,θ,b)/∈T (1) M ij and(l,θ,b)/∈T (2) M ij for all(l,θ,b)∈L×Θ×B. We have shown that ifM ij contains some ofk,k+ 1,...,c, then it must contain (k,k+ 1,...,c). Moreover, forknot being a multiple of4, since •θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2,θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−k+i > π/2, andθ i−1 +θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2, we know that ifM ij contains(k,k+ 1,...,c), then it has length at mosts+c−k. Forkbeing a multiple of4, since •θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2,θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−k+i is closed toπ/2, andθ i−1 +θ i +·+θ k−1 +θ k +·+θ c +θ 1 +·+θ s−1−k+i < π/2, by the same argument we use in the proof of Lemma H.1, we know that ifM ij contains(k,k+ 1,...,c), then it has length at mosts+c−k. Now forM ij not containing any ofk,k+ 1,...,c, ifkis not a multiple of4, it follows from •θ i +·+θ i+s−1 < π/2andθ i +·θ i+s > π/2fori,i+ 1,...,i+s⊂1,...,k−1 thatM ij has length at mosts. Ifkis a multiple of4, it follows from •θ i +·+θ i+s−1 < π/2andθ i +·θ i+s is closed toπ/2fori,i+ 1,...,i+s⊂1,...,k−1 and the same argument in Lemma H.1 thatM ij has length at mosts. Therefore in the open neighbourhoodL×Θ×B of(l ∗ ,θ ∗ ,b ∗ ), only the following types of wedges showing up inH: 1.M ij does not contain any ofk,k+ 1,...,cand has length at mosts; 2.M ij contains(k,k+ 1,...,c)and has length at mosts+c−k. Fora=k+ 1,...,c, since(L×Θ×B)∩T a =∅, and(L×Θ×B)∩T (1) M ij = (L×Θ×B)∩T (2) M ij =∅forM ij with eitheri∈k+ 1,...,cori+j−1∈k,...,c−1, we know that ∂ ∂b a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0 and ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0. Fora= 1,...,k, since l 1 =·=l k =x;θ 1 =·=θ k = 2π k ;b 1 =·=b k =y 51 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition is a critical point of the TMS potential forc=k, we know that ∂ ∂b a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0 and ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 0. Fora= 1,...,c, we have ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ H(l,θ,b) = 2 s X j=1 j· −3 x 2 cos (jα) +y 2 sin (jα) cos (jα) + sin (jα) x 2 cos (jα) +y 3 x 2 cos 2 (jα) ! which is independent ofa. Therefore, by Lagrangian multiplier method, thek-gon(l ∗ ,θ ∗ ,b ∗ )is a critical point of H(l,θ,b)with constraintθ 1 +·+θ c = 2π. In the previous example, we see that forn= 6, there are three different chambers whose boundary contains the5-gon. As different chambers give different explicit form ofH, one might thinkHis not differentiable at the5-gon. However, in our proof, we show that the5-gon has an open neighbourhood in whichHhas the explicit form X M ij ∈M 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + X M ij ∈M 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 + 5 X i=1 (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i + 1, whereθ 1 +·+θ 6 = 2πandMconsists of the following wedges: (1) (2) (3) (4) (5,6) . SoHis actually analytic at the5-gon. Corollary H.1.Letk∈Z >4 andsbe the unique integer in the interval[ k 4 −1, k 4 ). LetH (k) (l,θ,b)denote the TMS potential forc=k. Suppose that ak-gon with coordinate l 1 =·=l k =x;θ 1 =·=θ k = 2π k ;b 1 =·=b k =y for somex >0andy <0satisfying −x 2 cos s· 2π k ≤y is a critical point ofH (k) (l,θ,b). LetΠ c,k be the projection defined by Π c,k : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l k ,θ 1 ,...,θ k−1 ,b 1 ,...,b k ). Then forc > k, thek-gon with coordinate l 1 =·=l k =x, l k+1 =·=l c = 0; θ 1 =·=θ k−1 = 2π k , θ k +·+θ c = 2π k ; b 1 =·=b k =y, b k+1 ,...,b c <0, 52 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition has an open neighbourhood in which H(l,θ,b) =H (k) ◦Π c,k (l,θ,b) + (c−k). Proof.Letτ= (k,k+ 1,...,c). From the proof of the theorem, we know that forc > k, thek-gon l 1 =·=l k =x, l k+1 =·=l c = 0; θ 1 =·=θ k−1 = 2π k , θ k +·+θ c = 2π k ; b 1 =·=b k =y, b k+1 ,...,b c <0, has an open neighbourhood in which H(l,θ,b) = X M ij ∈M 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i+j   3 + X M ij ∈M 1 l i l i+j cos P k∈M ij θ k   l i l i+j cos   X k∈M ij θ k   +b i   3 + k X i=1 (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i + (c−k), whereθ 1 +·+θ c = 2πandMis (1)(1,2)· ·(1,2,...,s) (2)(2,3)· ·(2,3,...,s+ 1) . . . . . .· · . . . (k−s)(k−s,k−s+ 1)· ·(k−s,...,k−1) (k−s+ 1)(k−s+ 1,k−s+ 2)· ·(k−s+ 1,...,k−1,τ) (k−s+ 2)(k−s+ 1,k−s+ 2)· ·(k−s+ 1,...,k−1,τ,1) . . . . . .· · . . . (τ)(τ,1)· ·(τ,1,...,s−1) . LetH (k) (l 1 ,...,l k ,θ 1 ,...,θ k−1 ,b 1 ,...,b k )denote the TMS potential forc=k. Consider the projection Π c,k : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l k ,θ 1 ,...,θ k−1 ,b 1 ,...,b k ). Since X i∈τ θ i = 2π−(θ 1 +·+θ k−1 ), we have H(l,θ,b) =H (k) ◦Π c,k + (c−k). Since thec-gons are non-degenerate (moduloO(c)-action) critical points forc= 5,6,7,8, the corollary implies that forc > kandk∈ 5,6,7,8, thek-gon is a degenerate critical point but it is minimally singular in the sense of the potentialHbeing locally a sum of squares with all squares having positive coefficients around thek-gon. We can compute the locallearning coefficientof eachk-gon fork= 5,6,7,8: 53 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Critical pointLocallearning coefficientL 57(0.23738 +c−5)/3c 68.5(0.86746 +c−6)/3c 710(1.74870 +c−7)/3c 811.5(2.77311 +c−8)/3c Table H.2: Local learning coefficients and losses fork-gons 54 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition I Including positive biases In this section, we discuss thek σ+ -gon defined in Section 3.2. We show that forc= 6,5 + -gon is a critical point and its local learning coefficient is8.5. We also show that in generalk σ+ -gons are critical points. LetW C be a nonempty chamber. Suppose thatW C contains an open subset ofM 2,c (R). LetMbe the set of wedges describingW C . Recall that the local TMS potential is given by (17) in Appendix G. Let’s first discussc= 6. Consider the5-gon(l ∗ ,θ ∗ ,b ∗ ) with coordinate l ∗ :l 1 =·=l 5 =x≈1.17046, l 6 = 0; θ ∗ :θ 1 =·=θ 4 = 2π 5 , θ 5 +θ 6 = 2π 5 ; b ∗ :b 1 =·=b 5 =y≈−0.28230, b 6 =z∈R >0 . In Appendix H.3, we see that depending on the value ofθ 5 , there are three different chambers whose boundary containing the5-gon, but for the negative bias case (b i <0for alli= 1,...,c), there is an open neighbourhood in which the local TMS potential is smooth (Corollary H.1). Now we claim that this holds for the general case, i.e. there is an open neighbourhood of the5-gon in which the local TMS potential is smooth. Sincel 6 = 0andb i <0for all i= 1,...,5, we know that −l 6 l 1 cos(θ 6 ) = 0> b 1 , l 6 l 2 cos(θ 6 +θ 1 ) = 0> b 2 , −l 5 l 6 cos(θ 5 ) = 0> b 5 ,−l 4 l 6 cos(θ 4 +θ 5 ) = 0> b 4 So there is an open neighbourhoodUof the5-gon in which these inequalities hold. Thus, inU, the wedges appearing in the local TMS potential are (1) (2) (3) (4) (5,6) . Moreover, sinceb 6 >0, we have −l 6 l i cos(θ 6 +·+θ i−1 ) = 0< b i , for alli= 1,...,5. We can shrinkUso that these inequalities hold inU. Thus, inU,δ(S 6j ) = 0for allj= 1,...,5. Therefore, the local TMS potential is 4 X i=1 1 l i l i+1 cos(θ i ) l i l i+1 cos(θ i ) +b i 3 + 1 l 5 l 1 cos(θ 5 +θ 6 ) l 5 l 1 cos(θ 5 +θ 6 ) +b 5 3 + 4 X i=1 1 l i l i+1 cos(θ i ) l i l i+1 cos(θ i ) +b i+1 3 + 1 l 5 l 1 cos(θ 5 +θ 6 ) l 5 l 1 cos(θ 5 +θ 6 ) +b 1 3 + 5 X i=1 (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i + 5 X i=1 l 6 l i cos(θ 6 +·+θ i−1 ) 2 + 3 l 6 l i cos(θ 6 +·+θ i−1 ) b 6 + 3b 2 6 + (1−l 2 6 ) 2 −3(1−l 2 6 )b 6 + 3b 2 6 . Let Φ(l,θ,b) = 4 X i=1 l i l i+1 cos(θ i ) +b i 3 l i l i+1 cos(θ i ) + l 5 l 1 cos(θ 5 +θ 6 ) +b 5 3 l 5 l 1 cos(θ 5 +θ 6 ) + 4 X i=1 l i l i+1 cos(θ i ) +b i+1 3 l i l i+1 cos(θ i ) + l 5 l 1 cos(θ 5 +θ 6 ) +b 1 3 l 5 l 1 cos(θ 5 +θ 6 ) + 5 X i=1 (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i , 55 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition and Ψ(l,θ,b) = 5 X i=1 l 6 l i cos(θ 6 +·+θ i−1 ) 2 + 3 l 6 l i cos(θ 6 +·+θ i−1 ) b 6 + 3b 2 6 + (1−l 2 6 ) 2 −3(1−l 2 6 )b 6 + 3b 2 6 , Then the local TMS potential is H(l,θ,b) = Φ(l,θ,b) + Ψ(l,θ,b). Note that Corollary (H.1) implies thatΦ(l,θ,b) =H (5) ◦Π 6,5 (l,θ,b), whereH (5) is the TMS potential forn= 5and Π 6,5 is the projection Π 6,5 : (l 1 ,...,l 6 ,θ 1 ,...,θ 5 ,b 1 ,...,b 6 )7→(l 1 ,...,l 5 ,θ 1 ,...,θ 4 ,b 1 ,...,b 5 ). Thus, for alla= 1,...,6, ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ Φ(l,θ,b) = 0, ∂ ∂b a l=l ∗ ,θ=θ ∗ ,b=b ∗ Φ(l,θ,b) = 0, and ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ Φ(l,θ,b) is independent ofi. Fora= 1,...,5, ∂ ∂l a Ψ(l,θ,b) = 2 l 6 l a cos(θ 6 +·+θ a−1 ) l 6 cos(θ 6 +·+θ a−1 ) + 3l 6 cos(θ 6 +·+θ a−1 )b 6 . Sincel 6 = 0for the5-gon, for alla= 1,...,5, ∂ ∂l a l=l ∗ ,θ=θ ∗ ,b=b ∗ Ψ(l,θ,b) = 0. Compute ∂ ∂l 6 Ψ(l,θ,b) = ( 5 X i=1 2 l 6 l i cos(θ 6 +·+θ i−1 ) l i cos(θ 6 +·+θ i−1 ) + 3l i cos(θ 6 +·+θ i−1 )b 6 ) −4(1−l 2 6 )l 6 + 6l 6 b 6 . Then ∂ ∂l 6 l=l ∗ ,θ=θ ∗ ,b=b ∗ Ψ(l,θ,b) = 3xz 5 X i=1 cos θ 6 + (i−1)· 2π 5 = 3xz 5 X i=1 cos(θ 6 ) cos (i−1)· 2π 5 −sin(θ 6 ) sin (i−1)· 2π 5 = 3xz cos(θ 6 ) 5 X i=1 cos (i−1)· 2π 5 −sin(θ 6 ) 5 X i=1 sin (i−1)· 2π 5 = 3xz cos(θ 6 )·0−sin(θ 6 )·0 = 0. Fora= 1,...,5, ∂ ∂b a Ψ(l,θ,b) = 0. 56 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Compute ∂ ∂b 6 Ψ(l,θ,b) = 5 X i=1 3 l 6 l i cos(θ 6 +·+θ i−1 ) + 6b 6 −3(1−l 2 6 ) + 6b 6 . Then ∂ ∂b 6 l=l ∗ ,θ=θ ∗ ,b=b ∗ Ψ(l,θ,b) = 5 X i=1 6z −3 + 6z = 36z−3. So ∂ ∂b 6 l=l ∗ ,θ=θ ∗ ,b=b ∗ Ψ(l,θ,b) = 0 if and only ifz= 1 12 . Note that ∂ ∂θ 5 Ψ(l,θ,b) = 0 asθ 5 is not inΨ l,θ,b . Fora= 1,...,4, ∂ ∂θ a Ψ(l,θ,b) = 5 X i=a+1 −2 l 6 l i cos(θ 6 +·+θ i−1 ) l 6 l i sin(θ 6 +·+θ i−1 ) −3l 6 l i sin(θ 6 +·+θ i−1 )b 6 . Since for the5-gon,l 6 = 0, then fora= 1,...,4, ∂ ∂θ a l=l ∗ ,θ=θ ∗ ,b=b ∗ Ψ(l,θ,b) = 0. Compute ∂ ∂θ 6 Ψ(l,θ,b) = 5 X i=1 −2 l 6 l i cos(θ 6 +·+θ i−1 ) l 6 l i sin(θ 6 +·+θ i−1 ) −3l 6 l i sin(θ 6 +·+θ i−1 )b 6 . Since for the5-gon,l 6 = 0, then ∂ ∂θ 6 l=l ∗ ,θ=θ ∗ ,b=b ∗ Ψ(l,θ,b) = 0. Therefore, the5-gon with coordinate l ∗ :l 1 =·=l 5 =x≈1.17046, l 6 = 0; θ ∗ :θ 1 =·=θ 4 = 2π 5 , θ 5 +θ 6 = 2π 5 ; b ∗ :b 1 =·=b 5 =y≈−0.28230, b 6 = 1 12 ; is a critical point ofH(l,θ,b). Note that this is the5 + -gon defined in Section 3.2. We claim that the TMS potential is minimally singular in some neighbourhood of5 + -gon in the original parameter spaceM r,c (R)×R c and compute its local learning coefficient. Lemma I.1.Forc= 6, the5 + -gon with coordinate(l ∗ ,θ ∗ ,b ∗ )has an open neighbourhood in which the TMS potential is minimally singular. Moreover, its local learning coefficient is8.5. Proof.We compute the Hessian of the TMS potential at5 + -gon in the original parameter spaceM r,c (R)×R c . All eigenvalues of the Hessian are positive except one zero eigenvalue caused by theO(2)symmetry. Thus,5 + -gon is minimally singular and has local learning coefficient17/2 = 8.5. 57 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Now we discuss thek σ+ -gons (in Section 3.2) fork < c. LetB + ⊂ k+ 1,...,c. Letσ=|B + |. Consider the k-gon(l ∗ ,θ ∗ ,b ∗ )with coordinate l ∗ :l 1 =·=l k =x >0, l k+1 =·=l c = 0; θ ∗ :θ 1 =·=θ k−1 = 2π c , θ k +·+θ c = 2π c ; b ∗ :b 1 =·=b k =y <0, fori=k+ 1,...,c,b i <0ifi /∈B + andb i =z >0ifi∈B + . Theorem I.1.There is an open neighbourhood of(l ∗ ,θ ∗ ,b ∗ )in which the TMS potential is H(l,θ,b) =H (k) ◦Π c,k (l,θ,b) + c−(k+σ) + X i∈B + H + i (l,θ,b), where 1.H (k) is the TMS potential forc=k; 2.Π c,k is the projection: Π c,k : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l k ,θ 1 ,...,θ k−1 ,b 1 ,...,b k ); 3. for eachi∈B + , H + i (l,θ,b) = X j̸=i l i l j cos(θ ij ) 2 + 3 l i l j cos(θ ij ) b i + 3b 2 i + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i , whereθ ij denote the angle betweenW i andW j . Proof.This follows from applying the same argument in thec= 6case. Theorem I.2.Letk∈Z >4 andsbe the unique integer in the interval[ k 4 −1, k 4 ). If ak-gon with coordinate l 1 =·=l k =x;θ 1 =·=θ k = 2π k ;b 1 =·=b k =y for somex >0andy <0satisfying −x 2 cos s· 2π k ≤y is a critical point ofH(l,θ,b)with constraintθ 1 +·+θ k = 2πforc=k, then for any integer0≤σ≤c−k, the k σ+ -gons defined in Section 3.2 are critical points ofH(l,θ,b)with constraintθ 1 +·+θ c = 2πfor anyc > k. Proof.This follows from applying the same argument in thec= 6case. Remark I.1.In Lemma I.1, we show that the5 + -gon forc= 6is minimally singular and compute the local learning coefficient. In general, we do not know whether thek σ+ -gons are minimally singular or not forc > k. However, given ak σ+ -gon, the method to check whether it is minimally singular or not is the same as the method used in Lemma I.1. We compute the Hessian of the TMS potential at eachk σ+ -gon in the original parameter spaceM r,c (R)×R c . Then check that the Hessian of the TMS potential is non-degenerate in the direction normal to the tangent space of k σ+ -gons. If this is the case, then we conclude that thek σ+ -gon is minimally singular by Morse-Bott lemma. 58 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition J4-gons In this section we discuss4-gons. The subtlety here is that the TMS potential is not analytic at4-gons. In particular, the directional Hessian of the TMS potential at4-gons depends on the direction. J.1c= 4 J.1.1 Standard4-gon Forc= 4, consider the (standard)4-gon(l ∗ ,θ ∗ ,b ∗ ), where l ∗ = (1,1,1,1), θ ∗ = π 2 , π 2 , π 2 , π 2 , b ∗ = (0,0,0,0). SinceH(l,θ,b)≥0andH(l ∗ ,θ ∗ ,b ∗ ) = 0, we know the4-gon is a global minimum. Let’s work out the explicit form ofHaround the4-gon(l ∗ ,θ ∗ ,b ∗ ). Sincecos(π/2) = 0, the4-gon is at boundary of some chambers. LetM=M ij be a chamber whose boundary contains the4-gon(l ∗ ,θ ∗ ,b ∗ ). We claim that each wedgeM ij is either empty or contains one number. Assume, by contradiction, there is aM ij inMcontains more than one number. Then we show that the4-gon(l ∗ ,θ ∗ ,b ∗ )is not in the boundary of the chamber described byM. Letε θ ∈R >0 be such that π−2ε θ > π 2 . Then for anyθ,θ ′ ∈ π 2 −ε, π 2 +ε , θ+θ ′ > π 2 −ε θ + π 2 −ε θ =π−2ε θ > π 2 . Letε l ∈R >0 be such that1−ε l >0. So(l ∗ ,θ ∗ ,b ∗ )has an open neighbourhood given by (1−ε l ,1 +ε l ) 4 × π 2 −ε θ , π 2 +ε θ 3 ×R 4 which does not intersect the interior of the chamber described byM. Thus,(l ∗ ,θ ∗ ,b ∗ )is not in the boundary of the chamber described byMwhen some ofM ij inMcontains more than one element. So each wedgeM ij inM contains at most one element. Because of the permutation symmetry, there are four possibleM: M 1 = (1) ,M 2 = (1) (2) ,M 3 = (1) (3) ,M 4 = (1) (2) (3) . Since−1 2 <0,(l ∗ ,θ ∗ ,b ∗ )has an open neighbourhood in which −l 2 i < b i fori= 1,2,3,4. Sincecos(π) =−1<0, the4-gon(l ∗ ,θ ∗ ,b ∗ )has an open neighbourhood in which for all i= 1,2,3,4, ifb i >0, then l i l i+2 cos(θ i +θ i+1 )<−b i . Recall the formula 17 for the local TMS potential in Section G. The4-gon(l ∗ ,θ ∗ ,b ∗ )has an open neighbourhood in which the local TMS potential is H(l,θ,b) = 4 X i=1 δ(b i ≤0)H − i (l,θ,b) +δ(b i >0)H + i (l,θ,b),(18) where H − i (l,θ,b) =δ(T (1) M (i−1)1 ) l i−1 l i cos(θ i−1 ) +b i 3 l i−1 l i cos(θ i−1 ) +δ(T (2) M i1 ) l i l i+1 cos(θ i ) +b i 3 l i l i+1 cos(θ i ) + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i , 59 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition and H + i (l,θ,b) =δ(S i(i+1) ) −b 3 i l i l i+1 cos(θ i ) + 1−δ(S i(i+1) ) l i l i+1 cos(θ i ) 2 + 3 l i l i+1 cos(θ i ) b i + 3b 2 i + −b 3 i l i l i+2 cos(θ i +θ i+1 ) +δ(S i(i+3) ) −b 3 i l i l i+3 cos(θ i+3 ) + 1−δ(S i(i+3) ) l i l i+3 cos(θ i+3 ) 2 + 3 l i l i+3 cos(θ i+3 ) b i + 3b 2 i + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i We checked that each term inH − i (l,θ,b)andH + i (l,θ,b)has gradient zero when approaching(l ∗ ,θ ∗ ,b ∗ )in the region specified by the indicator function associated with it. Thus, the TMS potential is differentiable at the(l ∗ ,θ ∗ ,b ∗ ), and(l ∗ ,θ ∗ ,b ∗ )is a critical point. However, the TMS potential is not continuously differentiable twice, i.e. there are different directions in which directional Hessians are different. We checked that(l ∗ ,θ ∗ ,b ∗ )is minimally singular in each subspace with nonempty interior containing(l ∗ ,θ ∗ ,b ∗ )in the boundary. So we obtain a list4,4.5,5,5.5of local learning coefficient when approached from these different subspaces. J.1.24 φ− -gon LetB − ⊂1,2,3,4andφ=|B − |. Consider the4 φ− -gon (Section 3.2) with coordinate θ ∗ :θ 1 =θ 2 =θ 3 =θ 4 = π 2 ; b ∗ :b i <0ifi∈B − andb i = 0ifi /∈B − ; l ∗ :0< l 2 i <−b i ifi∈B − andl i = 1ifi /∈B − . Sincecos(π/2) = 0, the4 φ− -gon is on the boundary of some chambers. LetM=M ij be a chamber whose boundary contains4 φ− -gon. Using the same arguments in Section J.1.1, we know that eachM ij is either empty or contains one number. Because of theO(2)-symmetry, there are four possibleM: M 1 = (1) ,M 2 = (1) (2) ,M 3 = (1) (3) ,M 4 = (1) (2) (3) . Fori /∈B − , we havel 2 i = 1>0 =b i . Fori∈B − , we have−l 2 i > b i ,−l i l i+1 cos(θ i ) = 0> b i , and −l i−1 l i cos(θ i−1 ) = 0> b i . Sincecos(π) =−1<0, the4 φ− -gon has an open neighbourhood in which for all i= 1,2,3,4, ifb i >0, then l i l i 1 cos(θ i +θ i+1 )< b i . So the4 φ− -gon has an open neighbourhood in which the local TMS potential is H(l,θ,b) = 4 X i=1 δ(b i ≤0)H − i (l,θ,b) +δ(b i >0)H + i (l,θ,b),(19) where 1. ifi /∈B − , then H − i (l,θ,b) =δ(T (1) M (i−1)1 ) l i−1 l i cos(θ i−1 ) +b i 3 l i−1 l i cos(θ i−1 ) +δ(T (2) M i1 ) l i l i+1 cos(θ i ) +b i 3 l i l i+1 cos(θ i ) + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i + b 3 i l 4 i + b 3 i l 2 i , 60 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition and H + i (l,θ,b) =δ(S i(i+1) ) −b 3 i l i l i+1 cos(θ i ) + 1−δ(S i(i+1) ) l i l i+1 cos(θ i ) 2 + 3 l i l i+1 cos(θ i ) b i + 3b 2 i + −b 3 i l i l i+2 cos(θ i +θ i+1 ) +δ(S i(i+3) ) −b 3 i l i l i+3 cos(θ i+3 ) + 1−δ(S i(i+3) ) l i l i+3 cos(θ i+3 ) 2 + 3 l i l i+3 cos(θ i+3 ) b i + 3b 2 i + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i ; 2. ifi∈B − , then H − i (l,θ,b) = 1, and H + i (l,θ,b) = 0. Ifφ= 4, then the4 φ− -gon has an open neighbourhood in which the TMS potential is the zero function, hence it is a critical point with local learning coefficient0. For0≤φ≤3, we checked that each term inH − i (l,θ,b)andH + i (l,θ,b) has gradient zero when approaching the4 φ− -gon in the region specified by the indicator function associated with it. Thus, the TMS potential is differentiable at the4 φ− -gon, and the4 φ− -gon is a critical point. However, the TMS potential is not continuously differentiable twice, i.e. there are different directions in which directional Hessians are different. We checked that the4 φ− -gon is minimally singular in each subspace with nonempty interior containing the 4 φ− -gon in the boundary. So we obtain lists φ= 1 :3,3.5,4,4.5; φ= 2 :2,2.5,3,3.5forB − =i,i+ 1, wherei= 1,2,3,4, 2.5,3,3.5,4forB − =i,i+ 2, wherei= 1,2,3,4; φ= 3 :1,1.5,2,2.5; φ= 4 :0. of local learning coefficient for each when approached from these different subspaces. J.2c >4 We analyse four typical4-gons appearing as critical points of TMS potential whenc >4in this section. In particular, we state their coordinates (hence computing their loss), and show they are actually critical points of the TMS potential. Because of the permutation symmetry, we may assume that4-gons haveθ-coordinate θ 1 =θ 2 =θ 3 = π 2 , θ 4 +·+θ c = π 2 . So for givenl 1 ,...,l c and biasesb 1 ,...,b c , there is a set of4-gons given by(θ 4 ,...,θ c )withθ 4 +·+θ c =π/2. As discussing in Appendix J.1,4-gons are in the boundary of some chambers. LetM=M ij be wedges describing a chamber containing4-gons in its boundary. Consider the standard4-gons with coordinate l 1 =l 2 =l 3 =l 4 = 1, l 5 =·=l c = 0; θ 1 =θ 2 =θ 3 = π 2 , θ 4 +·+θ c = π 2 ; b 1 =b 2 =b 3 =b 4 = 0, b k+1 ,...,b c <0. 61 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Theorem J.1.For a fixedM, letH (4) denote the TMS potential in some neighbourhood of the standard4-gon for c= 4(Appendix J.1.1). Consider the projection Π c,4 : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l 4 ,θ 1 ,...,θ 3 ,b 1 ,...,b 4 ). The TMS potential in the chamber described byMis H(l,θ,b) =H (4) ◦Π c,4 (l,θ,b) + c X i=5 G i (l,θ,b) + (c−4), where G i (l,θ,b) =δ(T (1) M i(c−i+1) ) l i l 1 cos P k∈M i(c−i+1) θ k +b 1 3 l i l 1 cos P k∈M i(c−i+1) θ k +δ(T (1) M i(c−i+2) ) l i l 2 cos P k∈M i(c−i+2) θ k +b 2 3 l i l 2 cos P k∈M i(c−i+2) θ k +δ(T (2) M 3(i−3) ) l 3 l i cos P k∈M 3(i−3) θ k +b 3 3 l 3 l i cos P k∈M 3(i−3) θ k +δ(T (2) M 4(i−4) ) l 4 l i cos P k∈M 4(i−4) θ k +b 4 3 l 4 l i cos P k∈M 4(i−4) θ k . Proof.This theorem follows from the same arguments used in Corollary H.1. Thus, we conclude that the standard4-gons are critical points of the TMS potential by checking that each term in the TMS potential has gradient zero when approaching the standard4-gons in the region specified by the indicator function associated to it. LetB − ⊂1,2,3,4. Letφ=|B − |. Consider the4 φ− -gons (Section 3.2) with coordinate θ ∗ :θ 1 =θ 2 =θ 3 = π 2 , θ 4 +·+θ c = π 2 b ∗ :fori= 1,2,3,4,b i <0ifi∈B − andb i = 0ifi /∈B − , forj= 5,6,·,c,b j <0; l ∗ :fori= 1,2,3,4,0< l 2 i <−b i ifi∈B − andl i = 1ifi /∈B − , forj= 5,6,...,c,l j = 0. Theorem J.2.For a fixedM, letH (4,−) denote the TMS potential in some neighbourhood of4 φ− -gon forc= 4 (Appendix J.1.2). Consider the projection Π c,4 : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l 4 ,θ 1 ,...,θ 3 ,b 1 ,...,b 4 ). The TMS potential in the chamber described byMis H(l,θ,b) =H (4,−) ◦Π c,4 (l,θ,b) + c X i=5 G i (l,θ,b) + (c−4), whereG i (l,θ,b)is defined in Theorem J.1. Proof.This theorem follows from the same arguments used in Corollary H.1. Thus, we conclude that the4 φ− -gons are critical points of the TMS potential by checking that each term in the TMS potential has gradient zero when approaching the4 φ− -gons in the region specified by the indicator function associated to it. 62 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition LetB + ⊂5,6,...,c. Letσ=|B + |. Consider the4 σ+ -gons (Section 3.2) with coordinate l ∗ :l 1 =l 2 =l 3 =l 4 = 1>0, l k+1 =·=l c = 0; θ ∗ :θ 1 =θ 2 =θ 3 = π 2 , θ 4 +·+θ c = π 2 ; b ∗ :b 1 =b 2 =b 3 =b 4 = 0, fori= 5,6,...,c,b i <0ifi /∈B + andb i = 1 2c ifi∈B + . Theorem J.3.For a fixedM, letH (4) denote the TMS potential in some neighbourhood of the standard4-gon for c= 4(Appendix J.1.1). Consider the projection Π c,4 : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l 4 ,θ 1 ,...,θ 3 ,b 1 ,...,b 4 ). The TMS potential in the chamber described byMis H(l,θ,b) =H (4) ◦Π c,4 (l,θ,b) + c X i=5 G i (l,θ,b) + c−(4 +σ) + X i∈B + H + i (l,θ,b), whereG i (l,θ,b)is defined in Theorem J.1, and for eachi∈B + , H + i (l,θ,b) = X j̸=i l i l j cos(θ ij ) 2 + 3 l i l j cos(θ ij ) b i + 3b 2 i + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i , whereθ ij denote the angle betweenW i andW j . Proof.This follows from the same arguments in Theorem I.1. Thus, we conclude that the4 σ+ -gons are critical points of the TMS potential by checking that each term in the TMS potential has gradient zero when approaching the4 σ+ -gons in the region specified by the indicator function associated to it. LetB − ⊂1,2,3,4andB + ⊂5,6,·,c. Consider the4 σ+,φ− -gon (Section 3.2) with coordinate θ ∗ :θ 1 =θ 2 =θ 3 = π 2 , θ 4 +·+θ c = π 2 b ∗ :fori= 1,2,3,4,b i <0ifi∈B − andb i = 0ifi /∈B − , forj= 5,6,·,c,b j <0ifj /∈B + andb j = 1 2c ifj∈B + ; l ∗ :fori= 1,2,3,4,0< l 2 i <−b i ifi∈B − andl i = 1ifi /∈B − , forj= 5,6,...,c,l j = 0. Theorem J.4.For a fixedM, letH (4,−) denote the TMS potential in some neighbourhood of the4 φ− -gon forc= 4 (Appendix J.1.2). Consider the projection Π c,4 : (l 1 ,...,l c ,θ 1 ,...,θ c−1 ,b 1 ,...,b c )7→(l 1 ,...,l 4 ,θ 1 ,...,θ 3 ,b 1 ,...,b 4 ). The TMS potential in the chamber described byMis H(l,θ,b) =H (4,−) ◦Π c,4 (l,θ,b) + c X i=5 G i (l,θ,b) + c−(4 +σ) + X i∈B + H + i (l,θ,b), whereG i (l,θ,b)is defined in Theorem J.1, and for eachi∈B + , H + i (l,θ,b) = X j̸=i l i l j cos(θ ij ) 2 + 3 l i l j cos(θ ij ) b i + 3b 2 i + (1−l 2 i ) 2 −3(1−l 2 i )b i + 3b 2 i , whereθ ij denote the angle betweenW i andW j . 63 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Proof.This follows from the same arguments in Theorem I.1. Thus, we conclude that the4 σ+,φ− -gons are critical points of the TMS potential by checking that each term in the TMS potential has gradient zero when approaching the4 σ+,φ− -gons in the region specified by the indicator function associated to it. Remark J.1.Forc >4, we do not know the theoretical local learning coefficients of these4-gons. In Appendix K, we provide an estimation of local learning coefficients for various4-gons whenc= 6. 64 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Figure K.1: Loss trace plots for samples used for ˆ λproduced via SGLD sampling. The left plot show SGLD chains that are all “healthy” and the right plot shows a trajectory that escapes the phase of the initialising critical point (loss indicated by red horizontal line) to another phase with lower loss. To obtain the estimates ˆ λlisted in Table K.1, such unhealthy chains are removed from consideration. K Details of local learning coefficient estimation In this section, we discuss technical details and caveats about the values of the local learning coefficient estimates ˆ λ given throughout the paper. It was claimed in Lau et al. (2023) that the ˆ λalgorithm is valid for comparing or ordering critical points by their level of degeneracy. Obtaining the correct local learning coefficient can prove challenging. For TMS, we find that • The ordering of ˆ λfor different critical points lines up with the theoretical prediction. • For critical points with low loss such as6,5and5 + -gon depicted in Figure A.2, the ˆ λvalues are close to theoretically derived values. • However, for critical points with higher loss, mis-configured SGLD step size used in the algorithm can caus- ing the sample path itself to undergo a phase transition to a lower loss state. See the diagnostic trace plot on the right of Figure K.1 for an example where SGLD trajectory drop to a different phase. This is the reason fornegative ˆ λvalues shown in Figure 3, in which we opted to use a uniform set of SGLD hyperparameters since we cannot a priori predict which critical point an SGD trajectory will visit. • Lowering SGLD step size can ameliorate this issue, at the cost of increasing the required number of sampling steps needed. Table K.1 shows a set of ˆ λvalues computed using bespoke SGLD step size (explained below) for each group of 3 critical points with similar loss (again c.f. Figure A.2. Specifically, we take the dataset sizen= 5000, and SGLD hyperparametersγ= 0.1, number of steps= 10000. Furthermore, for each critical point, we run10independent SGLD chains and discard any chain where more than5% of the samples have loss values that are lower than the critical point itself. The ˆ λestimate and the standard deviation are then calculated from the remaining chains. The SGLD step size is manually chosen so that the majority of chains in each group passes the test above. 65 Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition Critical pointEstimated ˆ λ(std)SGLD step size 4 − 0.000 (0.00)0.0000005 4 +− 0.540 (0.16)0.0000005 4 ++− 0.998 (0.54)0.0000005 4 − 1.024 (0.74)0.000001 4 +− 1.619 (0.89)0.000001 4 ++− 1.899 (0.76)0.000001 4 − 1.689 (0.88)0.000001 4 +− 2.096 (1.00)0.000001 4 ++− 2.597 (0.88)0.000001 4 − 2.991 (0.35)0.000005 4 +− 3.393 (0.65)0.000005 4 ++− 4.097 (0.65)0.000005 45.297 (0.04)0.00001 4 + 5.761 (1.53)0.00001 4 ++ 6.203 (0.99)0.00001 57.705 (0.85)0.00005 5 + 9.906 (1.27)0.00005 69.027 (0.59)0.00005 Table K.1: ˆ λfor known critical points inr= 2,c= 6, their standard deviation across viable SGLD chains and the SGLD step size used. 66