Paper deep dive

Using physics-inspired Singular Learning Theory to understand grokking & other phase transitions in modern neural networks

Anish Lakkapragada

Year: 2025Venue: arXiv preprintArea: Training DynamicsType: EmpiricalEmbeddings: 23

Models: Anthropic Toy Models of Superposition, Autoencoders, Low-rank matrix networks, Toy polynomial regressors

Abstract

Abstract:Classical statistical inference and learning theory often fail to explain the success of modern neural networks. A key reason is that these models are non-identifiable (singular), violating core assumptions behind PAC bounds and asymptotic normality. Singular learning theory (SLT), a physics-inspired framework grounded in algebraic geometry, has gained popularity for its ability to close this theory-practice gap. In this paper, we empirically study SLT in toy settings relevant to interpretability and phase transitions. First, we understand the SLT free energy $\mathcal{F}_n$ by testing an Arrhenius-style rate hypothesis using both a grokking modulo-arithmetic model and Anthropic's Toy Models of Superposition. Second, we understand the local learning coefficient $\lambda_{\alpha}$ by measuring how it scales with problem difficulty across several controlled network families (polynomial regressors, low-rank linear networks, and low-rank autoencoders). Our experiments recover known scaling laws while others yield meaningful deviations from theoretical expectations. Overall, our paper illustrates the many merits of SLT for understanding neural network phase transitions, and poses open research questions for the field.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/11/2026, 12:33:01 AM

Summary

This paper investigates Singular Learning Theory (SLT) as a framework to understand phase transitions, grokking, and model complexity in neural networks. The author empirically tests an Arrhenius-style reaction rate hypothesis for grokking and phase transitions, and analyzes the scaling of the Local Learning Coefficient (LLC) across polynomial regressors, low-rank networks, and autoencoders, finding that domain constraints and model architecture significantly influence singularity and complexity.

Entities (5)

Local Learning Coefficient · metric · 100%Singular Learning Theory · framework · 100%Sumio Watanabe · researcher · 100%Grokking · phenomenon · 95%Toy Models of Superposition · model · 95%

Relation Signals (3)

Sumio Watanabe → developed → Singular Learning Theory

confidence 100% · SLT is an extremely mathematically rigorous framework developed from algebraic geometry by Dr. Sumio Watanabe.

Local Learning Coefficient → measures → Model Complexity

confidence 95% · the LLC is a complexity measure of a model (↑ LLC =⇒ ↑ complexity)

Singular Learning Theory → explains → Grokking

confidence 90% · How can we understand grokking and other phase transitions in neural networks through singular learning theory?

Cypher Suggestions (2)

Find all models studied in the context of SLT · confidence 90% · unvalidated

MATCH (m:Model)-[:STUDIED_IN]->(f:Framework {name: 'Singular Learning Theory'}) RETURN m.name

Map the relationship between metrics and phenomena · confidence 85% · unvalidated

MATCH (m:Metric)-[r:MEASURES]->(c:Concept) MATCH (p:Phenomenon)-[:UNDERSTOOD_VIA]->(f:Framework) RETURN m, r, c, p, f

Full Text

22,567 characters extracted from source content.

Expand or collapse full text

Using physics-inspired Singular Learning Theory to understand grokking & other phase transitions in modern neural networks Anish Lakkapragada Department of Statistics Yale University anish.lakkapragada@yale.edu Abstract Classical statistical inference and learning theory often fail to explain the success of modern neural networks. A key reason is that these models are non-identifiable (singular), violating core assumptions behind PAC bounds and asymptotic normality. Singular learning theory (SLT), a physics-inspired framework grounded in algebraic geometry, has gained popularity for its ability to close this theory–practice gap. In this paper, we empirically study SLT in toy settings rel- evant to interpretability and phase transitions. First, we understand the SLT free energy F n by testing an Arrhenius-style rate hypothesis using both a grokking modulo-arithmetic model and Anthropic’s Toy Models of Superposition. Second, we understand the local learning coef- ficient λ α by measuring how it scales with problem difficulty across several controlled network families (polynomial regressors, low-rank linear networks, and low-rank autoencoders). Our ex- periments recover known scaling laws while others yield meaningful deviations from theoretical expectations. Overall, our paper illustrates the many merits of SLT for understanding neural network phase transitions, and poses open research questions for the field. 1Introduction Classical statistics are insufficient to understand modern machine-learning. For example, well- studied PAC-bounds from statistical learning theory are vacuous and fail to explain the remarkable generalization power of neural networks. The reason for this comes from a well-kept secret for the last 15 years: neural networks are singular statistical models (Wei et al., 2022). Namely, unlike regular models, singular models can implement the same function through multiple dis- tinct parameter values. As an example of how this could occur in a singular model, consider that ∀α > 0, ReLU(x ) = 1 α ReLU(αx ) (Hoogland, 2023). Moreover, singular models do not exhibit expected statistical behavior. As an example, the Fisher Information matrix, the basis of asymp- totic normality theory for MLEs, is often non-invertible at the true parameters in singular models (Watanabe, 2010). Most pressingly, the majority of AI models today (neural networks, LLMs) are singular. Thus, out of a dire need to study them for AI safety (Lehalleur et al., 2025), the extremely mathematically rigorous field of singular learning theory (SLT) developed in 2009 has gained renewed attention. Introduction to Singular Learning Theory (SLT) SLT is an extremely mathematically rigorous framework developed from algebraic geometry by Dr. Sumio Watanabe in his two books (Watanabe, 2009) and (Watanabe, 2018). At its core, SLT posits that the singularities of the model – the weights at which the model is non-identifiable 1 — determine the weight spaces that the model will occupy as n →∞. More concretely, if we assume our model has parameter space W ⊆R d and we take W α α to be a sequence of subsets of W satisfying mild analytical conditions 2 , then through application of (Watanabe, 2018,§ 6.3) we get the following approximation: 1 The weights w ∈ W are non-identifiable if ∃ w ′ ∈ W s.t. the model functions f (x,w ) = f (x,w ′ ). 2 For a full discussion of these properties, please see Section 4.1 in (Chen et al., 2023). arXiv:2512.00686v3 [cs.LG] 3 Dec 2025 F n :=− logp(D n )≈ min α [nL n (w ∗ α ) + λ α logn] where F n is the “free energy” of the model, w ∗ α is a W α -global minima of the population negative-log-likelihood, L n is the empirical negative-log-likelihood, and λ α is the local learning coefficient (LLC 3 ) (Lau et al., 2023). From this equation, we gain a principled understanding of the internal model selection a neural network is performing during training (Chen et al., 2023). Specifically, we see that a neural network during training is consistently trying to minimize the free-energy through a tradeoff of the loss L n (w ∗ α ) and model complexity λ α . Research Questions We are interested in the following broad question: how can we understand 4 grokking and other phase transitions in neural networks through singular learning theory? Two main research questions we tackle in this vein are: • How far can we extend SLT’s physics-based “free-energy” quantity F n to understand when a model will undergo a phase transition or grok? • Can we better understand SLT’s local learning coefficient λ α (Lau et al., 2023) and its sensitivity to various problems and their difficulty? 2Methods We aim to use lightweight models known to grok in our study of SLT, as we can train them many times to gain good statistics. LLC Estimation.We are interested in the free energy of these models over time, which is fun- damentally based on SLT’s local learning coefficient λ α . Measuring the λ α of a model is extremely non-trivial, and so we will defer to using (Lau et al., 2023)’s Stochastic Gradient Langevin Dy- namics (SGLD) MCMC estimation procedure. All SGLD hyperparameters were tuned for stability and are summarized in the Appendix. 3Experiments Question 1: Understanding the temporality of grokking and phase transitions Our Arrhenius Reaction Rate Hypothesis.Suppose we are training a model, and let us define W t to be the weights of the model at iteration t . Now suppose at iteration i the model has only memorized (i.e. 100% training accuracy but poor testing accuracy) whereas at iteration j the model has grokked (i.e. 100% testing & training accuracy). Let us define F i and F j to be the model’s free energy, as per the SLT definition, at each of these respective iterations. Borrowing from the Arrhenius Reaction Rate Theory in chemical kinetics (Arrhenius, 1889), we present the following hypothesis to explain the time it takes to grok: r i→j ∝ exp β eff ∆F i→j , r i→j := j − i,∆F i→j :=F i −F j where ∆F i→j < 0 and β eff is an effective inverse temperature dependent on global hyperpa- rameters (learning rate, batch size, etc.). The main intuition here is that the amount of time it 3 Note that higher values of λ α indicate greater model complexity. 4 Understand in the when/what/where/why setting. 2 Figure 1: Diagram of the modulo arithmetic neural network we use for our experiments, taken from (Panickssery and Vaintrob, 2023). Note that this task can be thought of as taking two numbers inZ p and outputting the modulo inZ p . takes for a phase transition to occur is exponential in how much it decreases the (free) energy of the model. Experiment One: Grokking on Modulo Arithmetic.We first test our hypothesis on the setting of modulo arithmetic, which is a common setup to study grokking (Power et al., 2022; Panickssery and Vaintrob, 2023; Pearce et al., 2023). In more detail, we are training a neural network to (a,b)7→ a + b mod p for some prime p, using the architecture shown in Figure 1. For our specific experiments, we take p = 53 and train all our models using a randomly chosen 40% subset ofZ 53 ×Z 53 (remainder 60% was used for validation.) We trained 500 different models and for each of them stored 5 their training loss & accuracy, validation loss & accuracy, and if grokking occurred – computed pre-grok and post-grok LLCs. We stored all of this information through use of a Weights & Biases project; in total this process took over 7 days of continuous CPU compute to run. For our analyses, which we present momentarily, we only considered the 168 (33.6%) runs during which grokking occurred. We report a histogram of the distribution of observed LLC jumps ∆λ and log times for grokking log(r i→j ) in Figure 2. We report a plot of log(r i→j ) versus ∆F i→j across our runs in Figure 3. Figure 2: Distribution of ∆λ and logr i→j . This observed relationship is consistent with our hypothesis as it is demonstrates a consistent negative slope, albeit with a low R 2 score. 5 Time-based metrics were stored across 100 linearly-spaced checkpoints during training. 3 Figure 3: ∆F i→j vs. logr i→j with linear fit on modulo arithmetic toy networks with p = 53. Experiment Two: Phase Transitions on Anthropic’s Toy Models of Superposition.To re- test our hypothesis on another problem, we perform the same experiment on Anthropic’s Toy Models of Superposition (TMS) (Elhage et al., 2022). TMS are extremely small networks that reliably undergo aesthetic phase transitions, where the weight columns (when visualized) create increasingly complex shapes (e.g. 2-gon → 3-gon → 4−gon). Note that this is in contrast to grokking, as now multiple phase transitions can occur in a single run. This experiment is largely similar to the first, except for a few changes. We trained 60 models (with Weights & Biases again) for 4500 iterations. For each run, we tracked the LLC across 100 linearly & logarithmically-spaced checkpoints. Moreover, for a given run we detected transitions by (1) smoothing out the training loss curve across every ∼ 10 steps and (2) finding segments in the training loss curve where the loss decreased by ≥ 10% of the total loss decrease throughout training. We observed our end results to vary greatly based on the method we used to detect these transitions. We then computed ∆F i→j to be the change in free-energy across any two consecutive transitions. We provide a plot of ∆F i→j versus logr i→j with linear fit in Figure 4. Figure 4: ∆F i→j vs. logr i→j with linear fit on Anthropic’s TMS models. While these results vary drastically from those in our prior experiment, we contend that we can still glean valuable information from them. We first address the three vertical clusters of transition times: we conjecture these clusters occurred as a result of (1) our method for detecting transitions 4 and (2) uniqueness of the TMS experiment setup (i.e. an extremely low parameter model may have more fixed transition timesteps). Moreover, the linear fit demonstrated has an upward slope, which we note was highly sensitive to (1): we actually recovered a downward-sloping linear fit using a more rudimentary, non-smoothing transition detection approach. Given the variability in our results, we deemed this experiment as inconclusive to our hypothesis and proceeded to Question 2. Question 2: Understanding how the LLC λ α Scales with Problem Difficulty Motivation.The second broad question we tackle in this project is understanding the behavior of the local learning coefficient (LLC) λ α . For context, the LLC is a complexity measure of a model (↑ LLC =⇒ ↑ complexity) designed to be robust in measuring the complexity of singular models. However, the LLC is extremely unintuitive as seen by its definition (reproduced from (Lau et al., 2023)) below: Definition 3.1 (Local Learning Coefficient (LLC)). There exists a unique rational number λ(w ∗ ), a positive integer m(w ∗ ) and some constant c > 0 such that, asymptotically as ε→ 0, V (ε) = c ε λ(w ∗ ) (− logε) m(w ∗ )−1 + o ε λ(w ∗ ) (− logε) m(w ∗ )−1 . We call λ(w ∗ ) the Local Learning Coefficient (LLC), and m(w ∗ ) the local multiplicity. This is in stark contrast to statistical learning theory’s famous complexity measures (e.g. Vapnik–Chervonenkis Dimension), which are not only perhaps more intuitive but very well-studied. As such, we aim to take a stab at understanding the LLC by testing how well it scales with the problem difficulty. These experiments were inspired by (Panickssery and Vaintrob, 2023), who demonstrated that the LLC linearly scaled in (generalizing) modulo arithmetic networks with p. Experiment One: Polynomial Regressors of Increasing Degree.The first problem setup we study are univariate polynomial regressors of increasing degree. We say a function f :R→R is a polynomial regressor of degree d if f (x ) = P d i =0 a i x i for a i d i =0 ⊂R and all x i ∈ X ⊆R. Note that the use of this constrained instance space X is required, as it is numerically unstable to use x i with absolute magnitude greater than one as we scale the degree d ∼ 10 3 . For a given degree d and instance space X , we run the following procedure ten times to get an accurate estimate of the mean LLC λ d and its corresponding standard deviation: 1. Generate a d -degree dataset D N = (x i ,y i ) N i =1 of N = 500 samples realizable by some polynomial regressor with all coefficients in [−1, 1] and all samples x i ∈X . 2. Initialize a polynomial regressor of degree d from scratch and train it onD N until convergence. 3. Measure the LLC of this converged model. We useX = [−1, 1], [−0.75, 0.75], and [−0.5, 0.5] for our experiments. For each choice ofX , we present λ d for twenty choices of d from 10 0 to 10 3 in Figure 5. While deceptively simple, this result is actually quite interesting. Observe that polynomial regressors are regular models, meaning that there is a bijection from a i d i =0 to some function fromR →R. As such, following well-established results from SLT (Wei et al., 2022) we would expect λ d = d 2 . This means our empirical LLC estimates, across all X , are lower than the theoretical expectations. We reason this to be the case because the input of our polynomial regressors is notR but instead some choice of X ⊂R, which induces parameter singularities. Restating this explicitly, it is likely that for some polynomial regression function f A induced by A := a i d i =0 ⊂R and some polynomial regression function f B induced by B := b i d i =0 ⊂R, the 5 Figure 5: Estimated LLC versus degree of polynomial regressor for all choices of instance space X . functions f A | X = f B | X despite A ̸= B. Such singularities decrease the LLC (Hoogland, 2023), which is consistent with our results. Furthermore, tighter interval choices for X would lead to more singularities and lower LLCs – this is again consistent with our results in Figure 5. This experiment establishes that models on practically constrained domains can often be more singular (and thus have higher generalization capability) than expected. Experiment Two: Low-Rank Neural Network with Matrix Factorization.For our second experiment, we choose a problem that a priori has explicit singularities. Consider a 1-layer neural network f :R d →R d given by f (x ) = W 2 W 1 x where W 2 ∈R d×r ,W 1 ∈R r×d for constants r ≤ d . Then rank(W 2 W 1 ) ≤ r and moreover this model is singularity-full as for any invertible G ∈R r×r , W 2 W 1 = (W 2 G)(G −1 W 1 ). We fix d := 100 and use the same methodology as in the previous experiment, where we are now testing LLC λ r across 20 choices of rank r linearly spaced from 1 to d . We report our results in Figure 6. Figure 6: Estimated LLC versus rank of neural network, with quadratic fit. 6 We can see an extremely strong quadratic fit, peaking when the rank r = d . This fit matches the expected results quite nicely. Specifically, it is a standard fact from algebraic geometry that the set M r of matrices with rank r is a smooth manifold of dimension r (2d−r ). Hence, on that manifold, we can essentially treat our low-rank neural network f as a regular model with r (2d−r ) parameters =⇒ λ r = 1 2 r (2d − r )≈− 1 2 r 2 + 100r . Note this quadratic matches cleanly with the −0.429r 2 + 87.342r terms in our fit. Experiment Three: Autoencoders of Increasing Bottleneck Dimension.For our final exper- iment, we consider an autoencoder f :R d →R d . We again fix d := 100 and for a given rank r ≤ d , generate a sample x ∈R d as x = Az where z ∼N (0,I r ) and A∈R d×r . Thus, increasing r increases the dimension of the subspace that our sample x lives in. We design our autoencoder as a sequential concatenation of a ReLU MLP encoder (d → 128 → r units) and a ReLU MLP Decoder (r → 128 → d ) units. We train our autoencoders with MSE and present our results in Figure 7: Figure 7: Estimated LLC versus rank of input data, with a linear fit. This result is unique because it gives a clean linear fit (R 2 = 0.998) despite most MLPs certainly not being regular models (i.e. symmetry-full & singularity-full.) To the best of our knowledge, we have not seen other scaling law results like this in the SLT literature. Additionally, this result is also meaningful due to the autoencoder’s bottleneck layer effectively performing (non-linear) PCA (Kramer, 1991). 4Conclusion and Future Directions We present a collection of empirical results on using singular learning theory (SLT) as a tool to study both phase transitions and problem complexity in modern neural networks. In Question 1, we obtain mixed evidence for an Arrhenius-style reaction-rate relationship between the change in free energy ∆F i→j and the time-to-transition r i→j on two popular phase-transition setups (modular addition and Anthropic’s Toy Models of Superposition). In Question 2, we show that practically constrained domains can induce singularities in regular models (Experiment 1), and we recover the expected algebraic-geometry prediction λ r ≈ 1 2 r (2d − r ) for low-rank matrix factorization when f (x ) = W 2 W 1 x (Experiment 2). Finally, we demonstrate a peculiar linear fit for the LLC on singular autoencoders of growing bottleneck dimension (Experiment 3). These results provide 7 concrete examples where SLT reproduces known scaling laws but also sometimes does not align with real-world measurements. Future Directions.We leave the following open questions and directions for future work: • Free-energy barriers and reaction rates. Develop a principled method to estimate the free-energy barrier(s) between two subsets of weight space W i and W j . This would enable a more direct test of the Arrhenius-style reaction-rate hypothesis relating ∆F i→j to r i→j . • Constrained domains induce singularities. Identify other standard deep-learning settings where natural domain constraints (e.g. pixel intensities in [0, 255]) cause distinct parameter- izations to yield the same function behavior. Then replicate the polynomial-regressor study of Question 2, Experiment 1 in these settings. This could clarify when such constraints meaningfully lower the LLC, which oftentimes enables generalization. • Comparing LLC across memorization-vs-generalization architectures. For a fixed task family (e.g. image classification), compare the LLC of a heavily over-parameterized “mem- orization” network and a more compact “generalization” network. See how this difference scales with problem difficulty (e.g. adding more classification classes) 6 . 5Appendix Q1 & Q2: SGLD Parameters used for LLC Estimation Across All Experiments As stated, estimating the LLC relies on Stochastic Gradient Langevin Dynamics (SGLD), which required hyperparameter tuning in order to be numerically stable. For each of our experiments, we provide the learning rate ε, choice of γ, and number of steps n that we used in the below table. Question & Experiment #ε γ n Question One, Experiment One3e-35.0500 Question One, Experiment Two5e-41.0400 Question Two, Experiment One1e-31.02000 Question Two, Experiment Two1e-31.02000 Question Two, Experiment Three1e-51.02000 Table 1: SGLD hyperparameters used for LLC estimation (Lau et al., 2023) across all experiments. References Arrhenius, Svante (1889). “ ̈ Uber die Reaktionsgeschwindigkeit bei der Inversion von Rohrzucker durch S ̈auren”. In: Zeitschrift f ̈ur physikalische Chemie 4.1, p. 226–248. Chen, Zhongtian, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet (2023). “Dynam- ical versus bayesian phase transitions in a toy model of superposition”. In: arXiv preprint arXiv:2310.06301. 6 We did try this on a sinusoidal regression problem where the memorization and generalization networks were a Wide ReLU MLP and a Linear Regressor in the Fourier Basis respectively. Our results indicated that the MLP was so singularity-full that its LLC was actually smaller than that of the generalization network, as would be expected by SLT. While we did not present this result, we posit that future work in this direction could be meaningful. 8 Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCan- dlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah (2022). “Toy Models of Superposition”. In: arXiv preprint arXiv:2209.10652. URL: https://arxiv.org/ abs/2209.10652. Hoogland, Jesse (Jan. 2023). Neural Networks Generalize Because of This One Weird Trick. LessWrong. URL: https://w.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural- networks-generalize-because-of-this-one-weird-trick. Kramer, Mark A. (1991). “Nonlinear Principal Component Analysis Using Autoassociative Neural Networks”. In: AIChE Journal 37.2, p. 233–243. DOI: 10.1002/aic.690370209. Lau, Edmund, Zach Furman, George Wang, Daniel Murfet, and Susan Wei (2023). “The local learning coefficient: A singularity-aware complexity measure”. In: arXiv preprint arXiv:2308.12108. Lehalleur, Simon Pepin, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gi- etelink Oldenziel, George Wang, Liam Carroll, and Daniel Murfet (2025). “You Are What You Eat–AI Alignment Requires Understanding How Data Shapes Structure and Generalisation”. In: arXiv preprint arXiv:2502.05475. Panickssery, Nina and Dmitry Vaintrob (Oct. 17, 2023). Investigating the learning coefficient of modular addition: hackathon project. LessWrong. URL: https://w.lesswrong.com/ posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular- addition (visited on 09/22/2025). Pearce, Adam, Asma Ghandeharioun, Nada Hussein, Nithum Thain, Martin Wattenberg, and Lucas Dixon (Aug. 2023). Do Machine Learning Models Memorize or Generalize? https://pair. withgoogle.com/explorables/grokking/. Explorables, PAIR (People + AI Research) at Google. (Visited on 10/27/2025). Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra (2022). “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets”. In: arXiv preprint arXiv:2201.02177. Watanabe, Sumio (2009). Algebraic Geometry and Statistical Learning Theory. Cambridge Uni- versity Press. Watanabe, Sumio (2010). “Equations of states in singular statistical estimation”. In: Neural Net- works 23.1, p. 20–34. Watanabe, Sumio (2018). Mathematical Theory of Bayesian Statistics. Cambridge University Press. Wei, Susan, Daniel Murfet, Mingming Gong, Hui Li, Jesse Gell-Redman, and Thomas Quella (2022). “Deep learning is singular, and that’s good”. In: IEEE Transactions on Neural Networks and Learning Systems 34.12, p. 10473–10486. 9