Paper deep dive
The Local Learning Coefficient: A Singularity-Aware Complexity Measure
Edmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei
Models: Deep linear networks (up to 100M params), Feedforward ReLU network (1.9M params), ResNet, Transformer language models
Abstract
Abstract:The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.
Tags
Links
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/12/2026, 7:39:51 PM
Summary
The paper introduces the Local Learning Coefficient (LLC) as a novel, singularity-aware complexity measure for deep neural networks. By leveraging Singular Learning Theory (SLT), the authors define the LLC as an asymptotic volume scaling exponent of the loss landscape near local minima. They propose a scalable estimator for the LLC and demonstrate its utility in analyzing modern architectures like ResNet and transformers, showing that it effectively captures the implicit regularization effects of training heuristics.
Entities (5)
Relation Signals (3)
Local Learning Coefficient → leverages → Singular Learning Theory
confidence 100% · the LLC leverages Singular Learning Theory (SLT)
Local Learning Coefficient → measures → Deep Neural Networks
confidence 95% · The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs).
Local Learning Coefficient → appliedto → ResNet
confidence 90% · applied across diverse architectures including... ResNet image models
Cypher Suggestions (2)
Identify the theoretical framework supporting the LLC metric · confidence 95% · unvalidated
MATCH (m:Metric {name: 'Local Learning Coefficient'})-[:LEVERAGES]->(f:Framework) RETURN f.nameFind all architectures analyzed using the Local Learning Coefficient · confidence 90% · unvalidated
MATCH (m:Metric {name: 'Local Learning Coefficient'})-[:APPLIED_TO]->(a:Architecture) RETURN a.nameFull Text
179,615 characters extracted from source content.
Expand or collapse full text
The Local Learning Coefficient: A Singularity-Aware Complexity Measure Edmund Lau* School of Mathematics and Statistics University of Melbourne elau1@student.unimelb.edu.au Zach Furman* Timaeus zach.furman1@gmail.com George Wang Timaeus george@timaeus.co Daniel Murfet School of Mathematics and Statistics University of Melbourne d.murfet@unimelb.edu.au Susan Wei Department of Econometrics and Business Statistics Monash University susan.wei@monash.edu Abstract The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC’s theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning’s complexity and the principle of parsimony. 1 Introduction Occam’s razor, a foundational principle in scientific inquiry, suggests that the simplest among competing hypotheses should be selected. This principle has emphasized simplicity and parsimony in scientific thinking for centuries. Yet, the advent of deep neural networks (DNNs), with their multi-million parameter configurations and complex architectures, poses a stark challenge to this time-honored principle. The question arises: how do we reconcile the effectiveness of these intricate models with the pursuit of simplicity? It is natural to pose this question in terms of model complexity. Unfortunately, many existing definitions of model complexity are problematic for DNNs. For instance, the parameter count, which in classical statistical learning theory is a commonly-used measure of the amount of information captured in a fitted model, is well-known to be inappropriate in deep learning. This is clear from technical results on generalization (Zhang et al.,, 2017), pruning (Blalock et al.,, 2020) and distillation (Hinton et al.,, 2015): the amount of information captured in a trained network is in some sense decoupled from the number of parameters. As forcefully argued in (Wei et al.,, 2022), the gap in our understanding of DNN model complexity is induced by singularities, hence the need for a singularity-aware complexity measure. This motivates our appeal to Singular Learning Theory (SLT), which recognizes the necessity for sophisticated tools in studying statistical models exhibiting singularities. Roughly speaking, a model is singular if there are many ways to vary the parameters without changing the function; the more ways to vary without a change, the more singular (and thus more degenerate). DNNs are prime examples of such singular statistical models, characterized by complex degeneracies that make them highly singular. SLT continues a longstanding tradition of using free energy as the starting point for measuring model complexity (Bialek et al.,, 2001), and here we see the implications of singularities. Specifically, the free energy, also known as the negative log marginal likelihood, can be shown to asymptotically diverge with sample size n according to the law an+blogn+o(loglogn)an+b n+o( n)a n + b log n + o ( log log n ); the coefficient a of the linear term is the minimal loss achievable and the coefficient b of the remaining logarithmic divergence is then taken as the model complexity of the model class. In regular statistical models, the coefficient b is the number of parameters (divided by 2). In singular statistical models, b is not tied to the number of parameters, indicating a different kind of complexity at play. The LLC arises out of a consideration of the local free energy. We explore the mathematical underpinnings of the LLC in Section 3, utilizing intuitive concepts like volume scaling to aid understanding. Our contributions encompass 1) the definition of the new LLC complexity measure, 2) the development of a scalable estimator for the LLC, and 3) empirical validations that underscore the accuracy and practicality of the LLC estimator. In particular, we demonstrate that the estimator is accurate and scalable to modern network size in a setting where theoretical learning coefficients are available for comparison. Furthermore, we show empirically that some common training heuristics effectively control the LLC. Figure 1: Impact of SGD learning rate (top), batch size (middle) and momentum (bottom) when training ResNet18 on CIFAR10. We plot the LLC estimate (left), test accuracy (middle) and train loss (right) across training time. As the strength of the implicit regularization increases — through higher learning rate, lower batch size and higher momentum — LLC decreases (the network gets “simpler”) and test accuracy increases. Even though most training losses collapse to zero, the LLC can discern the implicit regularization pressure applied by various training heuristics. On this last point, we preview some results; Figure 1 displays the LLC estimate, test accuracy, and and training loss over the course of training ResNet18 on CIFAR10. Loosely speaking, lower LLC means a less complex, and thus more degenerate, neural network. In each of the rows, lighter colors represent stronger implicit regularization, e.g., higher learning rate, lower batch size, higher SGD momentum, which we see corresponds to a preference for lower LLC, i.e., simpler neural networks. 2 Setup Let W⊂ℝdsuperscriptℝW ^dW ⊂ blackboard_Rd be a compact space of parameters w∈Ww∈ Ww ∈ W. Consider the model-truth-prior triplet (p(x,y|w),q(x,y),φ(w)),conditional(p(x,y|w),q(x,y), (w)),( p ( x , y | w ) , q ( x , y ) , φ ( w ) ) , (1) where q(x,y)=q(y|x)q(x)conditionalq(x,y)=q(y|x)q(x)q ( x , y ) = q ( y | x ) q ( x ) is the true data-generating mechanism, p(x,y|w)=p(y|x,w)q(x)conditionalconditionalp(x,y|w)=p(y|x,w)q(x)p ( x , y | w ) = p ( y | x , w ) q ( x ) is the posited model with parameter w representing the neural network weights, and φ φ is a prior on w. Suppose we are given a training dataset of n input-output pairs, n=(xi,yi)i=1nsubscriptsuperscriptsubscriptsubscriptsubscript1D_n=\(x_i,y_i)\_i=1^nDitalic_n = ( xitalic_i , yitalic_i ) i = 1n, drawn i.i.d. from q(x,y)q(x,y)q ( x , y ). To these objects, we can associate the sample negative log likelihood function defined as Ln(w)=−1n∑i=1nlogp(yi|xi,w),subscript1superscriptsubscript1conditionalsubscriptsubscriptL_n(w)=- 1n _i=1^n p(y_i|x_i,w),Litalic_n ( w ) = - divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n log p ( yitalic_i | xitalic_i , w ) , and its theoretical counterpart defined as L(w)=−Eq(x,y)logp(y|x,w).subscriptEconditionalL(w)=-E_q(x,y) p(y|x,w).L ( w ) = - Eitalic_q ( x , y ) log p ( y | x , w ) . It is appropriate to also call LnsubscriptL_nLitalic_n and L the training and population loss, respectively, since using the negative log likelihood encompasses many classic loss functions used in machine learning and deep learning, such as mean squared error (MSE) and cross-entropy. The behavior of the training and population losses is highly nontrivial for neural networks. To properly account for model complexity of neural network models, it is critical to engage with the challenges posed by singularities. To appreciate this, we follow (Watanabe,, 2009) and make the following distinction between regular and singular statistical models. A statistical model p(x,y|w)conditionalp(x,y|w)p ( x , y | w ) is called regular if it is 1) identifiable, i.e. the parameter to distribution map w↦p(x,y|w)maps-toconditionalw p(x,y|w)w ↦ p ( x , y | w ) is one-to-one, and 2) its Fisher information matrix I(w)I(w)I ( w ) is everywhere positive definite. We call a model singular if it is not regular. For an introduction to the implications of singular learning theory for deep learning, we refer the readers to Appendix A and further reading in (Wei et al.,, 2022). We shall assume throughout the triplet (1) satisfies a few fundamental conditions in SLT (Watanabe,, 2009). These conditions are stated and discussed in an accessible manner in A.1. 3 The local learning coefficient In this paper, we introduce the Local Learning Coefficient (LLC), an extension of Watanabe, (2009)’s global learning coefficient, referred to as simply learning coefficient there. Below we focus on explaining the LLC through its geometric intuition, specifically as an invariant based on the volume of the loss landscape basin. For readers interested in the detailed theoretical foundations, we have included comprehensive explanations in the appendices. Appendix A offers a short introduction to the basics of Singular Learning Theory (SLT), and Appendix B sets out formal conditions of the well-definedness of the LLC. This structure ensures that readers with varying levels of familiarity with SLT can engage with the content at their own pace. 3.1 Complexity via counting low loss parameters At a local minimum of the population loss landscape there is a natural notion of complexity, given by the number of bits required to specify the minimum to within a tolerance ϵitalic-ϵεϵ. This idea is well-known in the literature on minimum description length (Grünwald and Roos,, 2019) and was used by (Hochreiter and Schmidhuber,, 1997) in an attempt to quantify the complexity of a trained neural network. However, a correct treatment has to take into consideration the degeneracy of the geometry of the population loss L, as we now explain. Consider a local minimum w∗superscriptw^*w∗ of the population loss L and a closed ball B(w∗)superscriptB(w^*)B ( w∗ ) centered on w∗superscriptw^*w∗ such that for all w∈B(w∗)superscriptw∈ B(w^*)w ∈ B ( w∗ ) we have L(w)≥L(w∗)superscriptL(w)≥ L(w^*)L ( w ) ≥ L ( w∗ ). Given a tolerance ϵ>0italic-ϵ0ε>0ϵ > 0 we can consider the set of parameters B(w∗,ϵ)=w∈B(w∗)∣L(w)−L(w∗)<ϵsuperscriptitalic-ϵconditional-setsuperscriptsuperscriptitalic-ϵB(w^*,ε)=\w∈ B(w^*) L(w)-L(w^*)<ε\B ( w∗ , ϵ ) = w ∈ B ( w∗ ) ∣ L ( w ) - L ( w∗ ) < ϵ whose loss is within the tolerance, the volume of which we define to be V(ϵ):=Vol(B(w∗,ϵ))=∫B(w∗,ϵ)w.assignitalic-ϵVolsuperscriptitalic-ϵsubscriptsuperscriptitalic-ϵdifferential-dV(ε):=Vol(B(w^*,ε))= _B(w^*,ε)dw\,.V ( ϵ ) := Vol ( B ( w∗ , ϵ ) ) = ∫B ( w∗ , ϵ ) d w . (2) The minimal number of bits to specify this set within the ball is −log2(V(ϵ)/Vol(B(w∗))).subscript2italic-ϵVolsuperscript- _2(V(ε)/Vol(B(w^*)))\,.- log2 ( V ( ϵ ) / Vol ( B ( w∗ ) ) ) . (3) This can be taken as a measure of the complexity of the set of low loss parameters near w∗superscriptw^*w∗. However, as it stands this notion depends on ϵitalic-ϵεϵ and has no intrinsic meaning. Classically, this is addressed as follows: if a model is regular and thus L(w)L(w)L ( w ) is locally quadratic around w∗superscriptw^*w∗, the volume satisfies a law of the form V(ϵ)≈cϵd/2,italic-ϵsuperscriptitalic-ϵ2V(ε)≈ cε^d/2,V ( ϵ ) ≈ c ϵitalic_d / 2 , where c is a constant that depends on the curvature of the basin around the local minimum w∗superscriptw^*w∗, and d is the dimension of W. This explains why d22 d2divide start_ARG d end_ARG start_ARG 2 end_ARG is a valid measure of complexity in the regular case, albeit one that cannot distinguish w∗superscriptw^*w∗ from any other local minima. The curvature c is less significant than the scaling exponent, but can distinguish local minima and logc clog c is sometimes used as a complexity measure. The population loss of a neural network is not locally quadratic near its local minima, from which it follows that the functional form of the volume V(ϵ)italic-ϵV(ε)V ( ϵ ) is more complicated than in the regular case. The correct functional form was discovered by Watanabe, (2009) and we adapt it here to a local neighborhood of a parameter (see Appendix A for details): Definition 1 (The Local Learning Coefficient (LLC), λ(w∗)superscriptλ(w^*)λ ( w∗ )). There exists a unique rational111The fact that λ(w∗)superscriptλ(w^*)λ ( w∗ ) is rational-valued, and not real-valued, as one would naively assume, is a deep fact of algebraic geometry derived from the celebrated Hironaka’s resolution of singularities. number λ(w∗)superscriptλ(w^*)λ ( w∗ ), a positive integer m(w∗)superscriptm(w^*)m ( w∗ ) and some constant c>00c>0c > 0 such that asymptotically as ϵ→0→italic-ϵ0ε→ 0ϵ → 0, V(ϵ)=cϵλ(w∗)(−logϵ)m(w∗)−1+o(ϵλ(w∗)(−logϵ)m(w∗)−1).italic-ϵsuperscriptitalic-ϵsuperscriptsuperscriptitalic-ϵsuperscript1superscriptitalic-ϵsuperscriptsuperscriptitalic-ϵsuperscript1V(ε)=cε^λ(w^*)(- ε)^m(w^*)-1+o(% ε^λ(w^*)(- ε)^m(w^*)-1).V ( ϵ ) = c ϵitalic_λ ( w start_POSTSUPERSCRIPT ∗ ) end_POSTSUPERSCRIPT ( - log ϵ )m ( w start_POSTSUPERSCRIPT ∗ ) - 1 end_POSTSUPERSCRIPT + o ( ϵitalic_λ ( w start_POSTSUPERSCRIPT ∗ ) end_POSTSUPERSCRIPT ( - log ϵ )m ( w start_POSTSUPERSCRIPT ∗ ) - 1 end_POSTSUPERSCRIPT ) . (4) We call λ(w∗)superscriptλ(w^*)λ ( w∗ ) the Local Learning Coefficient (LLC), and m(w∗)superscriptm(w^*)m ( w∗ ) the local multiplicity. In the case where m(w∗)=1superscript1m(w^*)=1m ( w∗ ) = 1, the formula simplifies, and V(ϵ)∝ϵλ(w∗)proportional-toitalic-ϵsuperscriptitalic-ϵsuperscriptV(ε) ε^λ(w^*)V ( ϵ ) ∝ ϵitalic_λ ( w start_POSTSUPERSCRIPT ∗ ) end_POSTSUPERSCRIPT (5) Thus, the LLC λ(w∗)superscriptλ(w^*)λ ( w∗ ) is the (asymptotic) volume scaling exponent near a minimum w∗superscriptw^*w∗ in the loss landscape: increasing the error tolerance by a factor of a increases the volume by a factor of aλ(w∗)superscriptsuperscripta^λ(w^*)aitalic_λ ( w start_POSTSUPERSCRIPT ∗ ) end_POSTSUPERSCRIPT. Applying Equation (3), the number of bits needed to specify V(ϵ)italic-ϵV(ε)V ( ϵ ) within B(w∗)superscriptB(w^*)B ( w∗ ), for sufficiently small ϵitalic-ϵεϵ and in the case m(w∗)=1superscript1m(w^*)=1m ( w∗ ) = 1, is approximated by −λ(w∗)log2ϵ+O(log2log2ϵ).superscriptsubscript2italic-ϵsubscript2subscript2italic-ϵ-λ(w^*) _2ε+O( _2 _2ε)\,.- λ ( w∗ ) log2 ϵ + O ( log2 log2 ϵ ) . Informally, the LLC tells us the number of additional bits needed to halve an already small error of ϵitalic-ϵεϵ: −log2[V(ϵ2)/V(ϵ)]≈−λ(w∗)log2ϵ2+λ(w∗)log2ϵ=λ(w∗)subscript2italic-ϵ2italic-ϵsuperscriptsubscript2italic-ϵ2superscriptsubscript2italic-ϵsuperscript- _2 [V( ε2)/V(ε) ]≈-λ(w^% *) _2 ε2+λ(w^*) _2ε=λ(w% ^*)- log2 [ V ( divide start_ARG ϵ end_ARG start_ARG 2 end_ARG ) / V ( ϵ ) ] ≈ - λ ( w∗ ) log2 divide start_ARG ϵ end_ARG start_ARG 2 end_ARG + λ ( w∗ ) log2 ϵ = λ ( w∗ ) See Figure LABEL:fig:volume_scaling for simple examples of the LLC as a scaling exponent. We note that, in contrast to the regular case, the scaling exponent λ(w∗)superscriptλ(w^*)λ ( w∗ ) depends on the underlying data distribution q(x,y)q(x,y)q ( x , y ). The (global) learning coefficient λ was defined by Watanabe, (2001). If w∗superscriptw^*w∗ is taken to be a global minimum of the population loss L(w)L(w)L ( w ), and the ball B(w∗)superscriptB(w^*)B ( w∗ ) is taken to be the entire parameter space W, then we obtain the global learning coefficient λ as the scaling exponent of the volume V(ϵ)italic-ϵV(ε)V ( ϵ ). The learning coefficient and related quantities like the WBIC (Watanabe,, 2013) have historically seen significant application in Bayesian model selection (e.g. Endo et al.,, 2020; Fontanesi et al.,, 2019; Hooten and Hobbs,, 2015; Sharma,, 2017; Kafashan et al.,, 2021; Semenova et al.,, 2020). In Appendix C, we prove that the LLC is invariant to local diffeomorphism: that is, roughly, a locally smooth and invertible change of variables. This property is motivated by the desire that a good complexity measure should not be confounded by superficial differences in how a model is represented: two models which are essentially the same should have the same complexity. 4 LLC estimation Having established the intuition behind the theoretical LLC in terms of volume scaling, we now turn to the task of estimating it. As described in the introduction, the LLC is a coefficient in the asymptotic expansion of the local free energy. It is this fact that we leverage for estimation. There is a mathematically rigorous link between volume scaling and the appearance of the LLC in the local free energy which we will not discuss in detail; see (Watanabe,, 2009, Theorem 7.1). We first introduce what we call the idealized LLC estimator which is theoretically sound, albeit intractable from a computational point of view. In the sections that follow the introduction of the idealized LLC estimator, we walk through the steps taken to engineer a practically implementable version of the idealized LLC estimator. 4.1 Idealized LLC estimator Consider the following integral Zn(Bγ(w∗))=∫Bγ(w∗)exp−nLn(w)φ(w)w,subscriptsubscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptdifferential-dZ_n(B_γ(w^*))= _B_γ(w^*) \-nL_n(w)\ (w)% \,dw,Zitalic_n ( Bitalic_γ ( w∗ ) ) = ∫B start_POSTSUBSCRIPT γ ( w∗ ) end_POSTSUBSCRIPT exp - n Litalic_n ( w ) φ ( w ) d w , (6) where Ln(w)subscriptL_n(w)Litalic_n ( w ) is again the sample negative log likelihood, Bγ(w∗)subscriptsuperscriptB_γ(w^*)Bitalic_γ ( w∗ ) denotes a small ball of radius γ around w∗superscriptw^*w∗ and φ φ is a prior over model parameters w. If (6) is high, there is high posterior concentration around w∗superscriptw^*w∗. In this sense, (6) is a measure of the concentration of low-loss solutions near w∗superscriptw^*w∗. Next consider a log transformation of Zn(Bγ(w∗))subscriptsubscriptsuperscriptZ_n(B_γ(w^*))Zitalic_n ( Bitalic_γ ( w∗ ) ) with a negative sign, i.e., Fn(Bγ(w∗))=−logZn(Bγ(w∗)).subscriptsubscriptsuperscriptsubscriptsubscriptsuperscriptF_n(B_γ(w^*))=- Z_n(B_γ(w^*)).Fitalic_n ( Bitalic_γ ( w∗ ) ) = - log Zitalic_n ( Bitalic_γ ( w∗ ) ) . (7) This quantity is sometimes called (negative) log marginal likelihood or free energy, depending on the discipline. Given a local minimum w∗superscriptw^*w∗ of the population negative log likelihood L(w)L(w)L ( w ), it can be shown using the machinery of SLT that, asymptotically in n, we have Fn(Bγ(w∗))=nLn(w∗)⏟energy+λ(w∗)⏟entropylogn+op(loglogn),subscriptsubscriptsuperscriptsubscript⏟subscriptsuperscriptenergysubscript⏟superscriptentropysubscriptF_n(B_γ(w^*))=n L_n(w^*)_energy+% λ(w^*)_entropy n+o_p( n),Fitalic_n ( Bitalic_γ ( w∗ ) ) = n under⏟ start_ARG Litalic_n ( w∗ ) end_ARGenergy + under⏟ start_ARG λ ( w∗ ) end_ARGentropy log n + oitalic_p ( log log n ) , (8) where λ(w∗)superscriptλ(w^*)λ ( w∗ ) is the theoretical LLC expounded in Section 3. Remarkably, the asymptotic approximation in (8) holds even for singular models such as neural networks; further discussion on this can be found in Appendix B. The asymptotic approximation in (8) suggests that a reasonable estimator of λ(w∗)superscriptλ(w^*)λ ( w∗ ) might come from re-arranging (8) to give what we call the idealized LLC estimator, λ^idealized(w∗)=Fn(Bγ(w∗))−nLn(w∗)logn.superscript^idealizedsuperscriptsubscriptsubscriptsuperscriptsubscriptsuperscript λ^idealized(w^*)= F_n(B_γ(w^*))-nL_% n(w^*) n.over start_ARG λ end_ARGidealized ( w∗ ) = divide start_ARG Fitalic_n ( Bitalic_γ ( w∗ ) ) - n Litalic_n ( w∗ ) end_ARG start_ARG log n end_ARG . (9) But as indicated by the name, the idealized LLC estimator cannot be easily implemented; computing or even MCMC sampling from the posterior to estimate Fn(Bγ(w∗))subscriptsubscriptsuperscriptF_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ) is made no less challenging by the need to confine sampling to the neighborhood Bγ(w∗)subscriptsuperscriptB_γ(w^*)Bitalic_γ ( w∗ ). In what follows, we use the idealized LLC estimator as inspiration for a practically-minded LLC estimator. 4.2 Surrogate for enforcing Bγ(w∗)subscriptsuperscriptB_γ(w^*)Bitalic_γ ( w∗ ) The first step towards a practically-minded LLC estimator is to circumvent the constraint posed by the neighborhood Bγ(w∗)subscriptsuperscriptB_γ(w^*)Bitalic_γ ( w∗ ). To this end, we introduce a localizing Gaussian prior that acts as a surrogate for enforcing the domain of integration given by Bγ(w∗)subscriptsuperscriptB_γ(w^*)Bitalic_γ ( w∗ ). Specifically, let φγ(w)∝exp−γ2‖w‖22proportional-tosubscript2subscriptsuperscriptnorm22 _γ(w) \- γ2||w||^2_2 \φitalic_γ ( w ) ∝ exp - divide start_ARG γ end_ARG start_ARG 2 end_ARG | | w | |22 be a Gaussian prior centered at the origin with scale parameter γ>00γ>0γ > 0. We replace (6) with Zn(w∗,γ)=∫exp−nLn(w)φγ(w−w∗)w,subscriptsuperscriptsubscriptsubscriptsuperscriptdifferential-dZ_n(w^*,γ)= \-nL_n(w)\ _γ(w-w^*)\,dw,Zitalic_n ( w∗ , γ ) = ∫ exp - n Litalic_n ( w ) φitalic_γ ( w - w∗ ) d w , which, for β=11β=1β = 1, can also be recognized as the normalizing constant to the posterior distribution given by p(w|w∗,β,γ)∝exp−nβLn(w)−γ2‖w−w∗‖22,proportional-toconditionalsuperscriptsubscript2superscriptsubscriptnormsuperscript22p(w|w^*,β,γ) \-nβ L_n(w)- γ2||w-% w^*||_2^2 \,p ( w | w∗ , β , γ ) ∝ exp - n β Litalic_n ( w ) - divide start_ARG γ end_ARG start_ARG 2 end_ARG | | w - w∗ | |22 , (10) where β>00β>0β > 0 plays the role of an inverse temperature. Large values of γ force the posterior distribution in (10) to stay close to w∗superscriptw^*w∗. A word on the notation p(w|w∗,β,γ)conditionalsuperscriptp(w|w^*,β,γ)p ( w | w∗ , β , γ ): this is a distribution in w solely, the parameters w∗,β,γsuperscriptw^*,β, ∗ , β , γ are fixed, hence the normalizing constant to (10) is an integral over w only. As Zn(w∗,γ)subscriptsuperscriptZ_n(w^*,γ)Zitalic_n ( w∗ , γ ) is to be viewed as a proxy to (6), we shall accordingly treat Fn(w∗,γ)≔−logZn(w∗,γ)≔subscriptsuperscriptsubscriptsuperscriptF_n(w^*,γ) - Z_n(w^*,γ)Fitalic_n ( w∗ , γ ) ≔ - log Zitalic_n ( w∗ , γ ) as a proxy for Fn(Bγ(w∗))subscriptsubscriptsuperscriptF_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ) in (7). Although it is tempting at this stage to simply drop Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ) into the idealized LLC estimator in place of Fn(Bγ(w∗))subscriptsubscriptsuperscriptF_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ), we have to address estimation of Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ), the subject of the next section. 4.3 The LLC estimator Let us denote the expectation of a function f(w)f(w)f ( w ) with respect to the posterior distribution in (10) as Ew|w∗,β,γf(w)≔∫f(w)p(w|w∗,β,γ)w.≔subscriptEconditionalsuperscriptconditionalsuperscriptdifferential-dE_w|w^*,β,γf(w) f(w)p(w|w^*,β,% γ)\,dw.Eitalic_w | w∗ , β , γ f ( w ) ≔ ∫ f ( w ) p ( w | w∗ , β , γ ) d w . Consider the quantity Ew|w∗,β∗,γ[nLn(w)]subscriptEconditionalsuperscriptsuperscriptdelimited-[]subscriptE_w|w^*,β^*,γ[nL_n(w)]Eitalic_w | w∗ , β∗ , γ [ n Litalic_n ( w ) ] (11) where the inverse temperature is deliberately set to β∗=1/lognsuperscript1β^*=1/ nβ∗ = 1 / log n. The quantity in (11) may be regarded as a localized version of the widely applicable Bayesian information criterion (WBIC) first introduced in (Watanabe,, 2013). It can be shown that (11) is a good estimator of Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ) in the following sense: the leading order terms of (11) match those of Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ) when we perform an asymptotic expansion in sample size n. This justifies using (11) to estimate Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ). Further discussion on this can be found in in Appendix D Going back to (9), we approximate Fn(Bγ(w∗))subscriptsubscriptsuperscriptF_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ) first with Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ), which is further estimated by (11). We are finally ready to define the LLC estimator. Definition 2 (Local Learning Coefficient (LLC) estimator). Let w∗superscriptw^*w∗ be a local minimum of L(w)L(w)L ( w ). Let β∗=1/lognsuperscript1β^*=1/ nβ∗ = 1 / log n. The associated local learning coefficient estimator is given by λ^(w∗)≔nβ∗[Ew|w∗,β∗,γLn(w)−Ln(w∗)].≔^superscriptsuperscriptdelimited-[]subscriptEconditionalsuperscriptsuperscriptsubscriptsubscriptsuperscript λ(w^*) nβ^* [E_w|w^*,β^% *,γL_n(w)-L_n(w^*) ].over start_ARG λ end_ARG ( w∗ ) ≔ n β∗ [ Eitalic_w | w∗ , β∗ , γ Litalic_n ( w ) - Litalic_n ( w∗ ) ] . (12) Note that λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) depends on γ but we have suppressed this in the notation. Let us ponder the pleasingly simple form that is the LLC estimator. The expectation term in (12) is a measure of the loss LnsubscriptL_nLitalic_n under perturbation near w∗superscriptw^*w∗. If the perturbed loss, under this expectation, is very close to Ln(w∗)subscriptsuperscriptL_n(w^*)Litalic_n ( w∗ ), then λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) is small. This accords with our intuition that if w∗superscriptw^*w∗ is simple, its loss should not change too much under reasonable perturbations. Finally we note that in applications, we use the empirical loss Ln(w)subscriptL_n(w)Litalic_n ( w ) to determine a critical point of interest, i.e., w^n∗≔argminwLn(w).≔superscriptsubscript^subscriptargminsubscript w_n^* *arg\,min_wL_n(w)\,.over start_ARG w end_ARGn∗ ≔ start_OPERATOR arg min end_OPERATORw Litalic_n ( w ) . We lose something by plugging in w^n∗superscriptsubscript w_n^*over start_ARG w end_ARGn∗ to (12) directly since we end up using the dataset nsubscriptD_nDitalic_n twice. However we do not observe adverse effects in our experiments, see Figure 4 for an example. 4.4 The SGLD-based LLC estimator The LLC estimator defined in (12) is not prescriptive as to how the expectation with respect to the posterior distribution should actually be approximated. A wide array of MCMC techniques are possible. However, to be able to estimate the LLC at the scale of modern deep learning, we must look at efficiency. In practice, the computational bottleneck to implementing (12) is the MCMC sampler. In particular, traditional MCMC samplers must compute log-likelihood gradients across the entire training dataset, which is prohibitively expensive at modern dataset sizes. If one modifies these samplers to take minibatch gradients instead of full-batch gradients, this results in stochastic-gradient MCMC, the prototypical example of which is (Welling and Teh,, 2011)’s Stochastic Gradient Langevin Dynamics (SGLD). The computational cost of this sampler is much lower: roughly the cost of a single SGD step times the number of samples required. The standard SGLD update applied to sampling (10) at the optimal temperature β∗superscriptβ^*β∗ required for the LLC estimator is given by ΔwtΔsubscript w_tΔ witalic_t =ϵ2(β∗nm∑(x,y)∈Bt∇logp(y|x,wt)+γ(w∗−wt))+N(0,ϵ)absentitalic-ϵ2superscriptsubscriptsubscript∇conditionalsubscriptsuperscriptsubscript0italic-ϵ = ε2 ( β^*nm _(x,y)∈ B_t% ∇ p(y|x,w_t)+γ(w^*-w_t) )+N(0,ε)= divide start_ARG ϵ end_ARG start_ARG 2 end_ARG ( divide start_ARG β∗ n end_ARG start_ARG m end_ARG ∑( x , y ) ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ∇ log p ( y | x , witalic_t ) + γ ( w∗ - witalic_t ) ) + N ( 0 , ϵ ) where Bt=(xi,yi)i=1msubscriptsuperscriptsubscriptsubscriptsubscript1B_t=\(x_i,y_i)\_i=1^mBitalic_t = ( xitalic_i , yitalic_i ) i = 1m is a randomly sampled minibatch of samples of size m for step t and ϵitalic-ϵεϵ controls both step size and variance of injected Gaussian noise. Crucially, the log-likelihood gradient is evaluated using mini-batches. In practice, we choose to shuffle the dataset once and partition it into a sequence of size m segments as minibatches instead of drawing fresh random samples of size m. Let us now suppose we have obtained T approximate samples w1,w2,…,wTsubscript1subscript2…subscript\w_1,w_2,…,w_T\ w1 , w2 , … , witalic_T of the tempered posterior distribution at inverse temperature β∗superscriptβ^*β∗ via SGLD. This is usually taken from the SGLD trajectory after burn-in. We can then form what we call the SGLD-based LLC estimator, λ^SGLD(w∗):=nβ∗[1T∑t=1TLn(wt)−Ln(w∗)].assignsuperscript^SGLDsuperscriptsuperscriptdelimited-[]1superscriptsubscript1subscriptsubscriptsubscriptsuperscript λ^SGLD(w^*):=nβ^* [ 1T _t=1% ^TL_n(w_t)-L_n(w^*) ].over start_ARG λ end_ARGSGLD ( w∗ ) := n β∗ [ divide start_ARG 1 end_ARG start_ARG T end_ARG ∑t = 1T Litalic_n ( witalic_t ) - Litalic_n ( w∗ ) ] . (13) For further computation saving, we also recycle the forward passes that compute Lm(wt)subscriptsubscriptL_m(w_t)Litalic_m ( witalic_t ), which is required for computing ∇wLm(wt)subscript∇subscriptsubscript _wL_m(w_t)∇w Litalic_m ( witalic_t ) via back-propagation, as unbiased estimate of Ln(wt)subscriptsubscriptL_n(w_t)Litalic_n ( witalic_t ). Here by Lm(wt)subscriptsubscriptL_m(w_t)Litalic_m ( witalic_t ), we mean −1m∑(x,y)∈Btlogp(y|x,wt)1subscriptsubscriptconditionalsubscript- 1m _(x,y)∈ B_t p(y|x,w_t)- divide start_ARG 1 end_ARG start_ARG m end_ARG ∑( x , y ) ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT log p ( y | x , witalic_t ), though the notation suppresses the dependence on BtsubscriptB_tBitalic_t for brevity. Pseudocode for this minibatch version of the SGLD-based LLC estimator is provided in Appendix G. Henceforth, when we say the SGLD-based LLC estimator, we are referring to the minibatch version. In Appendix H, we give a comprehensive guide on best practices for implementing the SGLD-based LLC estimator including choices for γ,ϵitalic-ϵγ,εγ , ϵ, number of SGLD iterations, and required burn-in. 5 Experiments The goal of our experiments is to give evidence that the LLC estimator is accurate, scalable and can reveal insights on deep learning practice. Throughout the experiments, we implement the minibatch version of the SGLD-based LLC estimator presented in the pseudo-algorithm in Appendix G. There are also a number of experiments we performed that are relegated to the appendices due to space constraints. They include • deploying LLC estimation on transformer language models (Section L) • an experiment on a small ReLU network verifying that SGLD sampler is just as accurate as one that uses full-batch gradients (Appendix M) • an experiment comparing the optimizers SGD and entropy-SGD with the latter having intimate connection to our notion of complexity (Appendix N) • an experiment verifying the scaling invariance property (Appendix O) set out in Appendix C Every experiment described in the main text below has an accompanying section in the Appendix that offers full experimental details, further discussion, and possible additional figures/tables. 5.1 LLC for Deep Linear Networks (DLNs) In this section, we verify the accuracy and scalability of our LLC estimator against theoretical LLC values in deep linear networks (DLNs) up to 100M parameters. Recall DLNs are fully-connected feedforward neural networks with identity activation function. The input-output behavior of a DLN is obviously trivial; it is equivalent to a single-layer linear network obtained by multiplying together the weight matrices. However, the geometry of such a model is highly non-trivial — in particular, the optimization dynamics and inductive biases of such networks have seen significant research interest (Saxe et al.,, 2013; Ji and Telgarsky,, 2018; Arora et al.,, 2018). Thus one reason we chose to study DLN model complexity is because they have long served as an important sandbox for deep learning theory. Another key factor is the recent clarification of theoretical LLCs in DLNs by Aoyagi, (2024), making DLNs the most realistic setting where theoretical learning coefficients are available. The significance of Aoyagi, (2024) lies in the substantial technical difficulty of deriving theoretical global learning coefficients (and, by extension, theoretical LLCs) which means that these coefficients are generally unavailable except in a few exceptional cases, with most of the research conducted decades ago (Yamazaki and Watanabe, 2005b, ; Aoyagi et al.,, 2005; Yamazaki and Watanabe,, 2003; Yamazaki and Watanabe, 2005a, ). Further details and discussion of the theoretical result in Aoyagi, (2024) can be found in Appendix I. Figure 4: Estimated LLC against true learning coefficient; model dimension shown in color. On the left, we evaluate the LLC estimator at a global minimum, w∗superscriptw^*w∗, of the population loss. On the right, we evaluate the LLC estimator at a minimum, w^n∗superscriptsubscript w_n^*over start_ARG w end_ARGn∗, found by SGD. Fortunately, we do not see an adverse effect of using the training data twice, a minor concern we had raised at the end of Section 4.3. The estimated LLCs accurately measures the learning coefficient λ up to 100 million parameters in deep linear networks, as compared to known theoretical values (dashed line). See Figure J.1 for linear-scale plots. The results are summarized in Figure 4. Our LLC estimator is able to accurately estimate the learning coefficient in DLNs up to 100M parameters. We further show that accuracy is maintained even if (as is typical in practice) one does not evaluate the LLC at a local minimum of the population loss, but is instead forced to use SGD to first find a local minimum w^n∗superscriptsubscript w_n^*over start_ARG w end_ARGn∗. Further experimental details and results for this section are detailed in Appendix J. Overall it is quite remarkable that our LLC estimator, after the series of engineering steps described in Section 4, maintains such high levels of accuracy. 5.2 LLC for ResNet Here, we empirically test whether implicit regularization effectively induces a preference for simplicity as manifested by a lower LLC. We examine the effects of SGD learning rate, batch size, and momentum on the LLC. Note that we do so in isolation, i.e., we do not look at interactions between these factors. For instance, when we vary the learning rate, we hold everything else constant including batch size and momentum values. We perform experiments on ResNet18 trained on CIFAR10 and show the results in Figure 1. We see that the training loss reaches zero in most instances and therefore on its own cannot distinguish between the effect of the implicit regularization. In contrast, we see that there is a consistent pattern of “stronger implicit regularization = higher test accuracy = lower LLC". Specifically, higher learning rate, lower batch size, higher momentum all apply stronger implicit regularization which is reflected in lower LLC. Full experimental details can be found in Appendix K. Note that in Figures 1 we employed SGD without momentum in the top two rows. We repeat the experiments in these top two rows for SGD with momentum; the associated results in Figure K.1 support very similar conclusions. We also conducted some explicit regularization experiments involving L2 regularization (Figure K.2 in Appendix K) and again conclude that stronger regularization is accompanied by lower LLC. In contrast to the LLC estimates for DLN, the LLC estimates for ResNet18 cannot be calibrated as we do not know the true LLC values. With many models in practice, we find value in the LLC estimates by comparing their relative values between models with shared context. For instance, when comparing LLC values we might hold everything constant while varying one factor such as SGD batch size, learning rate, or momentum as we have done in this section. 6 Related work We briefly cover related work here; more detail may be found in Appendix E. The primary reference for the theoretical foundation of this work, known as SLT, is Watanabe, (2009). The global learning coefficient, first introduced by (Watanabe,, 2001), provides the asymptotic expansion of the free energy, which is equivalent to the negative log Bayes marginal likelihood, an all-important important quantity in Bayesian analysis. Subsequent research has utilized algebraic-geometric tools to calculate the learning coefficient for various machine learning models (Yamazaki and Watanabe, 2005b, ; Aoyagi et al.,, 2005; Aoyagi,, 2024; Yamazaki and Watanabe,, 2003; Yamazaki and Watanabe, 2005a, ). SLT has also enhanced the understanding of model selection criteria in Bayesian statistics. Of particular relevance to this work is Watanabe, (2013), which introduced the WBIC estimator of free energy. This estimator has been applied in various practical settings (Endo et al.,, 2020; Fontanesi et al.,, 2019; Hooten and Hobbs,, 2015; Sharma,, 2017; Kafashan et al.,, 2021; Semenova et al.,, 2020). The LLC can be seen as a singularity-aware version of basin broadness measures, which attempt to connect geometric "broadness" or "flatness" with model complexity (Hochreiter and Schmidhuber,, 1997; Jiang et al.,, 2019). In particular, the LLC estimator bears resemblance to PAC-Bayes inspired flatness/sharpness measures (Neyshabur et al.,, 2017), but takes into account the non-Gaussian nature of the posterior distribution in singular models. The LLC is set apart from classic model complexity measures such as Rademacher complexity (Koltchinskii and Panchenko,, 2000) and the VC dimension (Vapnik and Chervonenkis,, 1971) because it measures the complexity of a specific model p(y|x,w)conditionalp(y|x,w)p ( y | x , w ) rather than the complexity over the function class f(x|w):w:conditional\f(x|w):w\ f ( x | w ) : w where f is the neural network function. This makes the LLC an appealing tool for understanding the interplay between function class, data properties, and training heuristics. Concurrent work by (Chen et al., 2023a, ) proposes a measure called the learning capacity, which can be viewed as a finite-n version of the learning coefficient, and investigates its behavior as a function of training size n. 7 Outlook An exciting direction of future research is to study the role of the LLC in detecting phase transitions and emergent abilities in deep learning models. A first step in this direction was undertaken in (Chen et al., 2023b, ) where the energy (training loss) and entropy (both estimated and theoretical LLC) were tracked as training progresses; it was observed that energy and entropy proceed along staircases in opposing directions. Further, Hoogland et al., (2024) showed how the estimated LLC can be used to detect phase transitions in the formation of in-context learning in transformer language models. It is natural to wonder if the LLC could shed light on the hypothesis that SGD-trained neural networks sequentially learn the target function with a “saddle-to-saddle" dynamic. Previous theoretical works had to devise complexity measures on a case-by-case basis (Abbe et al.,, 2023; Berthier,, 2023). We posit that the free energy perspective could offer a more unified and general approach to understanding the intricate dynamics of learning in deep neural networks by accounting for competition between model fit and model complexity during training. References Abbe et al., (2023) Abbe, E., Adserà, E. B., and Misiakiewicz, T. (2023). Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. In Neu, G. and Rosasco, L., editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 2552–2623. PMLR. Aoyagi, (2024) Aoyagi, M. (2024). Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, page 106132. Aoyagi et al., (2005) Aoyagi, M., Watanabe, S., et al. (2005). Resolution of singularities and the generalization error with Bayesian estimation for layered neural network. IEICE Trans, 88(10):2112–2124. Arora et al., (2018) Arora, S., Cohen, N., Golowich, N., and Hu, W. (2018). A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281. Balasubramanian, (1997) Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Computation, 9(2):349–368. Berner et al., (2019) Berner, J., Elbrächter, D., and Grohs, P. (2019). How degenerate is the parametrization of neural networks with the ReLU activation function? In Neural Information Processing Systems. Berthier, (2023) Berthier, R. (2023). Incremental learning in diagonal linear networks. Journal of Machine Learning Research, 24(171):1–26. Bialek et al., (2001) Bialek, W., Nemenman, I., and Tishby, N. (2001). Predictability, complexity, and learning. Neural computation, 13(11):2409–2463. Blalock et al., (2020) Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., and Guttag, J. (2020). What is the state of neural network pruning? Proceedings of Machine Learning and Systems, 2:129–146. Chaudhari et al., (2019) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2019). Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018. (11) Chen, D., Chang, W., and Chaudhari, P. (2023a). Learning Capacity: A Measure of the Effective Dimensionality of a Model. (12) Chen, Z., Lau, E., Mendel, J., Wei, S., and Murfet, D. (2023b). Dynamical versus Bayesian phase transitions in a toy model of superposition. Deng, (2012) Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142. Dherin et al., (2022) Dherin, B., Munn, M., Rosca, M., and Barrett, D. G. T. (2022). Why neural networks find simple solutions: the many regularizers of geometric complexity. Dinh et al., (2017) Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR. Endo et al., (2020) Endo, A., Abbott, S., Kucharski, A. J., Funk, S., et al. (2020). Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China. Wellcome Open Research, 5. Farrugia-Roberts, (2023) Farrugia-Roberts, M. (2023). Functional equivalence and path connectivity of reducible hyperbolic tangent networks. In Thirty-seventh Conference on Neural Information Processing Systems. Fefferman, (1994) Fefferman, C. (1994). Reconstructing a neural net from its output. Revista Matemática Iberoamericana, 10(3):507–555. Fontanesi et al., (2019) Fontanesi, L., Gluth, S., Spektor, M. S., and Rieskamp, J. (2019). A reinforcement learning diffusion decision model for value-based decisions. Psychonomic Bulletin & Review, 26(4):1099–1121. Fukumizu, (1996) Fukumizu, K. (1996). A regularity condition of the information matrix of a multilayer perceptron network. Neural Networks, 9(5):871–879. Gao et al., (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. (2020). The Pile: an 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027. Grünwald and Roos, (2019) Grünwald, P. and Roos, T. (2019). Minimum description length revisited. International journal of mathematics for industry, 11(01):1930001. Haario et al., (1999) Haario, H., Saksman, E., and Tamminen, J. (1999). Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14:375–395. Haario et al., (2001) Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive metropolis algorithm. Bernoulli, pages 223–242. He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE. Hinton et al., (2015) Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Hironaka, (1964) Hironaka, H. (1964). Resolution of Singularities of an Algebraic Variety Over a Field of Characteristic Zero: I. Annals of Mathematics, 79(1):109–203. Hochreiter and Schmidhuber, (1997) Hochreiter, S. and Schmidhuber, J. (1997). Flat Minima. Neural Computation, 9(1):1–42. Hoogland and van Wingerden, (2023) Hoogland, J. and van Wingerden, S. (2023). You’re measuring model complexity wrong. Hoogland et al., (2024) Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. (2024). The developmental landscape of in-context learning. Hooten and Hobbs, (2015) Hooten, M. B. and Hobbs, N. T. (2015). A guide to Bayesian model selection for ecologists. Ecological Monographs, 85(1):3–28. (32) Imai, T. (2019a). Estimating real log canonical thresholds. arXiv preprint arXiv:1906.01341. (33) Imai, T. (2019b). On the overestimation of widely applicable Bayesian information criterion. arXiv preprint arXiv:1908.10572. Iriguchi and Watanabe, (2007) Iriguchi, R. and Watanabe, S. (2007). Estimation of poles of zeta function in learning theory using Padé approximation. In Artificial Neural Networks–ICANN 2007: 17th International Conference, Porto, Portugal, September 9-13, 2007, Proceedings, Part I 17, pages 88–97. Springer. Ji and Telgarsky, (2018) Ji, Z. and Telgarsky, M. (2018). Gradient descent aligns the layers of deep linear networks. arXiv preprint arXiv:1810.02032. Jiang et al., (2019) Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2019). Fantastic Generalization Measures and Where to Find Them. In International Conference on Learning Representations. Jules et al., (2023) Jules, T., Brener, G., Kachman, T., Levi, N., and Bar-Sinai, Y. (2023). Charting the topography of the neural network landscape with thermal-like noise. Kafashan et al., (2021) Kafashan, M., Jaffe, A. W., Chettih, S. N., Nogueira, R., Arandia-Romero, I., Harvey, C. D., Moreno-Bote, R., and Drugowitsch, J. (2021). Scaling of sensory information in large neural populations shows signatures of information-limiting correlations. Nature Communications, 12(1):473. Koltchinskii and Panchenko, (2000) Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In Giné, E., Mason, D. M., and Wellner, J. A., editors, High Dimensional Probability I, pages 443–457, Boston, MA. Birkhäuser Boston. Krizhevsky, (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Kůrková and Kainen, (1994) Kůrková, V. and Kainen, P. C. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6:543–558. LaMont and Wiggins, (2019) LaMont, C. H. and Wiggins, P. A. (2019). Correspondence between thermodynamics and inference. Physical Review E, 99(5):052140. Le, (2018) Le, S. L. S. a. Q. V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent. In International Conference on Learning Representations. Nanda and Bloom, (2022) Nanda, N. and Bloom, J. (2022). TransformerLens. Neyshabur et al., (2017) Neyshabur, B., Bhojanapalli, S., Mcallester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Petersen et al., (2020) Petersen, P. C., Raslan, M., and Voigtlaender, F. (2020). Topological properties of the set of functions generated by neural networks of fixed size. Foundations of Computational Mathematics, 21:375 – 444. Phuong and Lampert, (2020) Phuong, M. and Lampert, C. H. (2020). Functional vs. parametric equivalence of ReLU networks. In International Conference on Learning Representations. Press et al., (2021) Press, O., Smith, N. A., and Lewis, M. (2021). Shortformer: Better language modeling using shorter inputs. Roberts and Rosenthal, (1998) Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268. Saxe et al., (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Semenova et al., (2020) Semenova, E., Williams, D. P., Afzal, A. M., and Lazic, S. E. (2020). A Bayesian neural network for toxicity prediction. Computational Toxicology, 16:100133. Sharma, (2017) Sharma, S. (2017). Markov chain Monte Carlo methods for Bayesian data analysis in astronomy. Annual Review of Astronomy and Astrophysics, 55:213–259. Skalse, (2023) Skalse, J. (2023). My criticism of singular learning theory. Sussmann, (1992) Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, 5(4):589–593. Valle-Perez et al., (2018) Valle-Perez, G., Camargo, C. Q., and Louis, A. A. (2018). Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522. Vapnik and Chervonenkis, (1971) Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280. Waskom, (2021) Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021. Watanabe, (2001) Watanabe, S. (2001). Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13(4):899–933. Watanabe, (2009) Watanabe, S. (2009). Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, USA. Watanabe, (2010) Watanabe, S. (2010). Asymptotic learning curve and renormalizable condition in statistical learning theory. Journal of Physics: Conference Series, 233:012014. Watanabe, (2013) Watanabe, S. (2013). A Widely Applicable Bayesian Information Criterion. Journal of Machine Learning Research, 14(Mar):867–897. Watanabe, (2018) Watanabe, S. (2018). Mathematical theory of Bayesian statistics. CRC Press. Wei et al., (2022) Wei, S., Murfet, D., Gong, M., Li, H., Gell-Redman, J., and Quella, T. (2022). Deep Learning Is Singular, and That’s Good. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14. Welling and Teh, (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning. Xie et al., (2023) Xie, S. M., Santurkar, S., Ma, T., and Liang, P. (2023). Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169. Yamazaki and Watanabe, (2003) Yamazaki, K. and Watanabe, S. (2003). Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, 16(7):1029–1038. (67) Yamazaki, K. and Watanabe, S. (2005a). Algebraic geometry and stochastic complexity of hidden Markov models. Neurocomputing, 69(1-3):62–84. (68) Yamazaki, K. and Watanabe, S. (2005b). Singularities in complete bipartite graph-type Boltzmann machines and upper bounds of stochastic complexities. IEEE transactions on neural networks, 16(2):312–324. Yang et al., (2022) Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. (2022). Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466. Zhang et al., (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. Zhang et al., (2018) Zhang, Y., Saxe, A. M., Advani, M. S., and Lee, A. A. (2018). Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning. Molecular Physics, 116(21-22):3214–3223. Appendix A Background on Singular Learning Theory Most models in machine learning are singular: they contain parameters where the Fisher information matrix is singular. While these parameters where the Fisher information is degenerate form a measure zero subset, their effect is far from negligible. Singular Learning Theory (SLT) shows that the geometry in the neighbourhood of these degenerate points determines the asymptotics of learning (Watanabe,, 2009). The theory explains observable effects of degeneracy in common machine learning models under practical settings (Watanabe,, 2018) and has the potential to account for important phenomena in deep learning (Wei et al.,, 2022). The central quantity of SLT is the learning coefficient λ. Many notable SLT results are conveyed through the learning coefficient which can be thought of as the complexity of the model class relative to the true distribution. In this section we carefully define the (global) learning coefficient λ, which we then contrast with the local learning coefficient λ(w∗)superscriptλ(w^*)λ ( w∗ ). For notational simplicity, we consider here the unsupervised setting where we have the model p(x|w)conditionalp(x|w)p ( x | w ) parameterised by a compact parameter space W⊂ℝdsuperscriptℝW ^dW ⊂ blackboard_Rd. We assume fundamental conditions I and I of (Watanabe,, 2009, §6.1, §6.2). In particular W is defined by a finite set of real analytic inequalities, at every parameter w∈Ww∈ Ww ∈ W the distribution p(x|w)conditionalp(x|w)p ( x | w ) should have the same support as the true density q(x)q(x)q ( x ), and the prior density φ(w)=φ1(w)φ2(w)subscript1subscript2 (w)= _1(w) _2(w)φ ( w ) = φ1 ( w ) φ2 ( w ) is a product of a positive smooth function φ1(w)subscript1 _1(w)φ1 ( w ) and non-negative real analytic φ2(w)subscript2 _2(w)φ2 ( w ). We refer to (p(x|w),q(x),φ(w))conditional(p(x|w),q(x), (w))( p ( x | w ) , q ( x ) , φ ( w ) ) (14) as the model-truth-prior triplet. Let K(w)K(w)K ( w ) be the Kullback-Leibler divergence between the truth and the model K(w)≔KL(q(x)∥p(x|w))=∫q(x)log[q(x)/p(x|w)]dxK(w) (q(x)\,\|\,p(x|w) )= q(x) % [q(x)/p(x|w) ]dxK ( w ) ≔ KL ( q ( x ) ∥ p ( x | w ) ) = ∫ q ( x ) log [ q ( x ) / p ( x | w ) ] d x and define the (average) negative log likelihood to be L(w)≔−∫q(x)logp(x|w)x=K(w)−S≔conditionaldifferential-dL(w) - q(x) p(x|w)dx=K(w)-SL ( w ) ≔ - ∫ q ( x ) log p ( x | w ) d x = K ( w ) - S where S is the entropy of the true distribution. Set K0=infw∈WK(w)subscript0subscriptinfimumK_0= _w∈ WK(w)K0 = infitalic_w ∈ W K ( w ). Let W0≔w∈W:K(w)=K0≔subscript0conditional-setsubscript0W_0 \w∈ W:K(w)=K_0\W0 ≔ w ∈ W : K ( w ) = K0 be the set of optimal parameters. We say the truth is realizable by the model if K0=0subscript00K_0=0K0 = 0. We do not assume that our models are regular or that the truth is realizable, but we do assume the model-truth-prior triplet satisfies the more general condition of relative finite variance (Watanabe,, 2013). We also assume that there exists w0∗superscriptsubscript0w_0^*w0∗ in the interior of W satisfying K(w0∗)=K0superscriptsubscript0subscript0K(w_0^*)=K_0K ( w0∗ ) = K0. Following Watanabe, (2009) we define: Definition 3. The zeta function of (14) is defined for Re(z)>0Re0Re(z)>0Re ( z ) > 0 by ζ(z)=∫W(K(w)−K0)zφ(w)wsubscriptsuperscriptsubscript0differential-dζ(z)= _W (K(w)-K_0 )^z (w)dwζ ( z ) = ∫W ( K ( w ) - K0 )z φ ( w ) d w and can be analytically continued to a meromorphic function on the complex plane with poles that are all real, negative and rational (Watanabe,, 2009, Theorem 6.6). Let −λ∈ℝ-λ - λ ∈ blackboard_R be the largest pole of ζ and m its multiplicity. Then, the learning coefficient and its multiplicity of the triple (14) are defined to be λ and m respectively. When p(x|w)conditionalp(x|w)p ( x | w ) is a singular model, W0subscript0W_0W0 is an analytic variety which is in general positive dimensional (not a collection of isolated points). As long as φ>00 >0φ > 0 on W0subscript0W_0W0 the learning coefficient λ is equal to a birational invariant of W0subscript0W_0W0 known in algebraic geometry as the Real Log Canonical Threshold (RLCT). We will always assume this is the case, and now recall how λ is described geometrically. With Wϵ=w∈W:K(w)−K0≤ϵsubscriptitalic-ϵconditional-setsubscript0italic-ϵW_ε=\w∈ W:K(w)-K_0≤ε\Witalic_ϵ = w ∈ W : K ( w ) - K0 ≤ ϵ for sufficiently small ϵitalic-ϵεϵ, resolution of singularities (Hironaka,, 1964) gives us the existence of a birational proper map g:M→Wϵ:→subscriptitalic-ϵg:M→ W_εg : M → Witalic_ϵ from an analytic manifold M which monomializes K(w)−K0subscript0K(w)-K_0K ( w ) - K0 in the following sense, described precisely in (Watanabe,, 2009, Theorem 6.5): there are local coordinate charts MαsubscriptM_αMitalic_α covering g−1(W0)superscript1subscript0g^-1(W_0)g- 1 ( W0 ) with coordinates u such that the reparameterisation w=g(u)w=g(u)w = g ( u ) puts K(w)−K0subscript0K(w)-K_0K ( w ) - K0 and φ(w)dw (w)dwφ ( w ) d w into normal crossing form K(g(u))−K0=u12k1…ud2kdsubscript0superscriptsubscript12subscript1…superscriptsubscript2subscript K(g(u))-K_0=u_1^2k_1… u_d^2k_dK ( g ( u ) ) - K0 = u12 k1 … uitalic_d2 kitalic_d (15) |g′(u)|=b(u)u1h1…udhdsuperscript′subscript1subscriptℎ1…superscriptsubscriptsubscriptℎ |g (u)|=b(u)u_1^h_1… u_d^h_d| g′ ( u ) | = b ( u ) u1italic_h1 … uitalic_ditalic_hitalic_d (16) φ(w)dw=φ(g(u))|g′(u)|dusuperscript′ (w)dw= (g(u))|g (u)|duφ ( w ) d w = φ ( g ( u ) ) | g′ ( u ) | d u (17) for some positive smooth function b(u)b(u)b ( u ). The RLCT of K−K0subscript0K-K_0K - K0 is independent of the (non-unique) resolution map g, and may be computed as (Watanabe,, 2009, Definition 6.4) λ=minαminj=1…dhj+12kj.subscriptsubscript1…subscriptℎ12subscriptλ= _α _j=1… d h_j+12k_j.λ = minitalic_α minitalic_j = 1 … d divide start_ARG hitalic_j + 1 end_ARG start_ARG 2 kitalic_j end_ARG . (18) The multiplicity is defined as m=maxα#j:λj=λ.subscript#conditional-setsubscriptm= _α\#\j: _j=λ\.m = maxitalic_α # j : λitalic_j = λ . For P∈W0subscript0P∈ W_0P ∈ W0 there exist coordinate charts Mα∗subscriptsuperscriptM_α^*Mitalic_α∗ such that g(0)=P0g(0)=Pg ( 0 ) = P and (15), (16) hold. The RLCT of K−K0subscript0K-K_0K - K0 at P is then (Watanabe,, 2009, Definition 2.7) λ(P)=minα∗minj=1…dhj+12kj.subscriptsuperscriptsubscript1…subscriptℎ12subscriptλ(P)= _α^* _j=1… d h_j+12k_j\,.λ ( P ) = minitalic_α∗ minitalic_j = 1 … d divide start_ARG hitalic_j + 1 end_ARG start_ARG 2 kitalic_j end_ARG . (19) We then have the RLCT λ=infP∈W0λ(P)subscriptinfimumsubscript0λ= _P∈ W_0λ(P)λ = infitalic_P ∈ W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT λ ( P ). In regular models, all λ(P)λ(P)λ ( P ) are d/22d/2d / 2 and hence the RLCT λ=d/22λ=d/2λ = d / 2 and m=11m=1m = 1, see (Watanabe,, 2009, Remark 1.15). A.1 Background assumptions in SLT There are a few technical assumptions throughout SLT that we shall collect in this section. We should note that these are only sufficient but not necessary conditions for many of the results in SLT. Most of the assumptions can be relaxed on a case by case basis without invalidating the main conclusions of SLT. For in depth discussion see Watanabe, (2009, 2010, 2018). With the same hypotheses as above, the log likelihood ratio r(x,w):=logp(x|w0)p(x|w)assignconditionalsubscript0conditionalr(x,w):= p(x|w_0)p(x|w)r ( x , w ) := log divide start_ARG p ( x | w0 ) end_ARG start_ARG p ( x | w ) end_ARG is assumed to be an Ls(q(x))superscriptL^s(q(x))Litalic_s ( q ( x ) )-valued analytic function of w with s≥22s≥ 2s ≥ 2 that can be extended to a complex analytic function on Wℂ⊂ℂdsubscriptℂsuperscriptℂW_C ^dWblackboard_C ⊂ blackboard_Cd. A more conceptually significant assumption is the condition relatively finite variance, which consists of the following two requirements: 1. For any optimal parameters w1,w2∈W0subscript1subscript2subscript0w_1,w_2∈ W_0w1 , w2 ∈ W0, we have p(x|w1)=p(x|w2)conditionalsubscript1conditionalsubscript2p(x|w_1)=p(x|w_2)p ( x | w1 ) = p ( x | w2 ) almost everywhere. This is also known as essential uniqueness. 2. There exists c>00c>0c > 0 such that for all w∈Ww∈ Ww ∈ W, we have q(x)[r(x,w)]≥cq(x)[r(x,w)2].subscriptdelimited-[]subscriptdelimited-[]superscript2 _q(x) [r(x,w) ]≥ cE_q(x) [% r(x,w)^2 ].blackboard_Eq ( x ) [ r ( x , w ) ] ≥ c blackboard_Eq ( x ) [ r ( x , w )2 ] . Note that if the true density q(x)q(x)q ( x ) is realizable by the model, i.e. there exist w∗∈Wsuperscriptw^*∈ Ww∗ ∈ W such that q(x)=p(x|w∗)conditionalsuperscriptq(x)=p(x|w^*)q ( x ) = p ( x | w∗ ) almost everywhere, then these conditions are automatically satisfied. This includes settings for which the training labels are synthetically generated by passing inputs through a target model (in which case realizability is satisfied by construction), such as our experiments involving DLNs and some of the experiments involving MLPs. For experiments involving real data, it is unclear how reasonable it is to assume that relative finite variance holds. Appendix B Well-definedness of the theoretical LLC Here we remark on how we adapt the (global) learning coefficient of Definition 3 to define the local learning coefficient in terms of the poles of a zeta function, and how this relates to the asymptotic volume formula in Definition 1 of the main text. In addition to the setup described above, now suppose we have a local minimum w∗superscriptw^*w∗ of the negative log likelihood L(w)L(w)L ( w ) and as in Section 3.1 we assume B(w∗)superscriptB(w^*)B ( w∗ ) to be a closed ball centered on w∗superscriptw^*w∗ such that L(w∗)superscriptL(w^*)L ( w∗ ) is the minimum value of L on the ball. We define V:=Vol(B(w∗))=∫B(w∗)φ(w)w,assignVolsuperscriptsubscriptsuperscriptdifferential-d V:=Vol(B(w^*))= _B(w^*) (w)dw\,,V := Vol ( B ( w∗ ) ) = ∫B ( w∗ ) φ ( w ) d w , φ¯(w)=1Vφ(w)¯1 (w)= 1V (w)over¯ start_ARG φ end_ARG ( w ) = divide start_ARG 1 end_ARG start_ARG V end_ARG φ ( w ) then we can form the local triplet (p,q,φ¯)¯(p,q, )( p , q , over¯ start_ARG φ end_ARG ) with parameter space B(w∗)superscriptB(w^*)B ( w∗ ). Note that W is cut out by a finite number of inequalities between analytic functions, and hence so is B(w∗)superscriptB(w^*)B ( w∗ ). Provided φ(w∗)>0superscript0 (w^*)>0φ ( w∗ ) > 0 the prior does not contribute to the leading terms of the asymptotic expansions considered, and in particular does not appear in the asymptotic formula for the volume in Definition 1, and so we disregard it in the main text for simplicity. Assuming relative finite variance ofr the local triplet, we can apply the discussion of Section A. In particular, we can define the LLC λ(w∗)superscriptλ(w^*)λ ( w∗ ) and the local multiplicity m(w∗)superscriptm(w^*)m ( w∗ ) in terms of the poles of the “local” zeta function ζ(z,w∗)=∫B(w∗)(K(w)−K(w∗))zφ¯(w)wsuperscriptsubscriptsuperscriptsuperscriptsuperscript¯differential-dζ(z,w^*)= _B(w^*) (K(w)-K(w^*) )^z (w)dwζ ( z , w∗ ) = ∫B ( w∗ ) ( K ( w ) - K ( w∗ ) )z over¯ start_ARG φ end_ARG ( w ) d w Note that L(w)−L(w∗)=K(w)−K(w∗)superscriptsuperscriptL(w)-L(w^*)=K(w)-K(w^*)L ( w ) - L ( w∗ ) = K ( w ) - K ( w∗ ) since K,LK,LK , L differ by S which does not depend on w. In order for the LLC to be a learning coefficient in its own right, it must be related, asymptotically, to the local free energy in the manner stipulated in (8) in the main text. We verify this next. To derive the asymptotic expansion of the local free energy Fn(B(w∗))subscriptsuperscriptF_n(B(w^*))Fitalic_n ( B ( w∗ ) ), we assume in addition that λ(w∗)≤λ(P)superscriptλ(w^*)≤λ(P)λ ( w∗ ) ≤ λ ( P ) if P∈B(w∗)superscriptP∈ B(w^*)P ∈ B ( w∗ ) and L(P)=L(w∗)superscriptL(P)=L(w^*)L ( P ) = L ( w∗ ). That is, w∗superscriptw^*w∗ is at least as degenerate as any nearby minimiser. Note that the KL divergence for this triplet is just the restriction of K:W→ℝ:→ℝK:W : W → blackboard_R to B(w∗)superscriptB(w^*)B ( w∗ ), but the local triplet has its own set of optimal parameters W0(w∗)=w∈B(w∗):L(w)=L(w∗).subscript0superscriptconditional-setsuperscriptsuperscriptW_0(w^*)=\w∈ B(w^*):L(w)=L(w^*)\\,.W0 ( w∗ ) = w ∈ B ( w∗ ) : L ( w ) = L ( w∗ ) . (20) Borrowing the proof in Watanabe, (2009, §3), we can show that Fn(B(w∗))=nLn(w∗)+λ(w∗)logn−(m(w∗)−1)loglogn+OP(1)subscriptsuperscriptsubscriptsuperscriptsuperscriptsuperscript1subscript1F_n(B(w^*))=nL_n(w^*)+λ(w^*) n-(m(w^*)-1) n+O_P% (1)Fitalic_n ( B ( w∗ ) ) = n Litalic_n ( w∗ ) + λ ( w∗ ) log n - ( m ( w∗ ) - 1 ) log log n + Oitalic_P ( 1 ) (21) where the difference between φ φ and φ¯ over¯ start_ARG φ end_ARG contributes a summand logV Vlog V to the constant term. Note that the condition of relative finite variance is used to establish (21). This explains why we can consider λ(w∗)superscriptλ(w^*)λ ( w∗ ) as a local learning coefficient, following the ideas sketched in (Watanabe,, 2009, Section 7.6). The presentation of the LLC in terms of volume scaling given in the main text now follows from (Watanabe,, 2009, Theorem 7.1). Appendix C Reparameterization invariance of the LLC The LLC is invariant to local diffeomorphism of the parameter space - roughly a locally smooth and invertible change of variables. Note that this automatically implies several weaker notions of invariance, such as rescaling invariance in feedforward neural networks; other e.g. Hessian-based complexity measures have been undermined by their failure to stay invariant to this symmetry (Dinh et al.,, 2017). We now show how the LLC is invariant to local diffeomorphism. The fact that the LLC is an asymptotic quantity is crucial; this property would not hold for the volume V(ϵ)italic-ϵV(ε)V ( ϵ ) itself, for any value of ϵitalic-ϵεϵ. Let U⊂WU⊂ WU ⊂ W and U~⊂W~~~ U⊂ Wover~ start_ARG U end_ARG ⊂ over~ start_ARG W end_ARG be open subsets of parameter spaces W and W~~ Wover~ start_ARG W end_ARG. A local diffeomorphism is an invertible map ϕ:U→U~:italic-ϕ→~φ:U→ Uϕ : U → over~ start_ARG U end_ARG, such that both ϕitalic-ϕφϕ and ϕ−1superscriptitalic-ϕ1φ^-1ϕ- 1 are infinitely differentiable. We further require that ϕitalic-ϕφϕ respect the loss function: that is, if L:U→ℝ:→ℝL:U : U → blackboard_R and L~:U~→ℝ:~→~ℝ L: U ~ start_ARG L end_ARG : over~ start_ARG U end_ARG → blackboard_R are loss functions on each space, we insist that L(u)=L~(ϕ(u))~italic-ϕL(u)= L(φ(u))L ( u ) = over~ start_ARG L end_ARG ( ϕ ( u ) ) for all u∈Uu∈ Uu ∈ U. Supposing such a ϕitalic-ϕφϕ exists, the statement to be proved is that the LLC at u∗∈Usuperscriptu^*∈ Uu∗ ∈ U under L(u)L(u)L ( u ) is equal to the LLC at u~∗=ϕ(u∗)∈U~superscript~italic-ϕsuperscript~ u^*=φ(u^*)∈ Uover~ start_ARG u end_ARG∗ = ϕ ( u∗ ) ∈ over~ start_ARG U end_ARG under L~(u~)~~ L( u)over~ start_ARG L end_ARG ( over~ start_ARG u end_ARG ). Define V(ϵ)=∫L(u)−L(u∗)<ϵu,V~(ϵ)=∫L~(u~)−L~(u~∗)<ϵu~formulae-sequenceitalic-ϵsubscriptsuperscriptitalic-ϵdifferential-d~italic-ϵsubscript~~~superscript~italic-ϵdifferential-d~ V(ε)= _L(u)-L(u^*)<εdu, V(% ε)= _ L( u)- L( u^*)<% εd uV ( ϵ ) = ∫L ( u ) - L ( u∗ ) < ϵ d u , over~ start_ARG V end_ARG ( ϵ ) = ∫over~ start_ARG L end_ARG ( over~ start_ARG u end_ARG ) - over~ start_ARG L end_ARG ( over~ start_ARG u end_ARG∗ ) < ϵ d over~ start_ARG u end_ARG Now note that by the change of variables formula for diffeomorphisms, we have V~(ϵ)=∫L(u)−L(u∗)<ϵ|detDϕ(u)|u~italic-ϵsubscriptsuperscriptitalic-ϵitalic-ϕdifferential-d V(ε)= _L(u)-L(u^*)<ε| Dφ(u)|duover~ start_ARG V end_ARG ( ϵ ) = ∫L ( u ) - L ( u∗ ) < ϵ | det D ϕ ( u ) | d u where detDϕ(u)italic-ϕ Dφ(u)det D ϕ ( u ) is the Jacobian determinant of ϕitalic-ϕφϕ at u. The fact that ϕitalic-ϕφϕ is a local diffeomorphism implies that there exists constants c1,c2subscript1subscript2c_1,c_2c1 , c2 such that c1≤|detDϕ(u)|≤c2subscript1italic-ϕsubscript2c_1≤| Dφ(u)|≤ c_2c1 ≤ | det D ϕ ( u ) | ≤ c2 for all u∈Uu∈ Uu ∈ U. This means that c1V(ϵ)≤V~(ϵ)≤c2V(ϵ)subscript1italic-ϵ~italic-ϵsubscript2italic-ϵc_1V(ε)≤ V(ε)≤ c_2V(ε)c1 V ( ϵ ) ≤ over~ start_ARG V end_ARG ( ϵ ) ≤ c2 V ( ϵ ) Finally, applying the definition of the LLC λ and its multiplicity m, and leveraging the fact that this definition is asymptotic as ϵ→0→italic-ϵ0ε→ 0ϵ → 0, we can conclude that V(ϵ)∝V~(ϵ)∝ϵλ(−log(ϵ))m−1proportional-toitalic-ϵ~italic-ϵproportional-tosuperscriptitalic-ϵsuperscriptitalic-ϵ1V(ε) V(ε) ε^λ(- (% ε))^m-1V ( ϵ ) ∝ over~ start_ARG V end_ARG ( ϵ ) ∝ ϵitalic_λ ( - log ( ϵ ) )m - 1 which demonstrates that the LLC is preserved by the local diffeomorphism ϕitalic-ϕφϕ. Appendix D Consistency of local WBIC In the main text, we introduced (11) as an estimator of Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ). Our motivation for this estimator comes directly from the well-known widely applicable Bayesian information criterion (WBIC) (Watanabe,, 2013). In this section, we refer to (11) as the local WBIC and denote it WBIC(w∗)WBICsuperscriptWBIC(w^*)WBIC ( w∗ ) It is a direct extension of the proofs in Watanabe, (2013) to show that the first two terms in the asymptotic expansion of the local WBIC match those of Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ). By this we mean that it can be shown that Fn(w∗,γ)=nLn(w∗)+λ~(w∗)logn−(m−1)loglogn+Rnsubscriptsuperscriptsubscriptsuperscript~superscript1subscriptF_n(w^*,γ)=nL_n(w^*)+ λ(w^*) n-(m-1) % n+R_nFitalic_n ( w∗ , γ ) = n Litalic_n ( w∗ ) + over~ start_ARG λ end_ARG ( w∗ ) log n - ( m - 1 ) log log n + Ritalic_n and WBIC(w∗)=nLn(w∗)+λ~(w∗)logn+Unλ~(w∗)logn/2+OP(1).WBICsuperscriptsubscriptsuperscript~superscriptsubscript~superscript2subscript1WBIC(w^*)=nL_n(w^*)+ λ(w^*) n+U_% n λ(w^*) n/2+O_P(1).WBIC ( w∗ ) = n Litalic_n ( w∗ ) + over~ start_ARG λ end_ARG ( w∗ ) log n + Uitalic_n square-root start_ARG over~ start_ARG λ end_ARG ( w∗ ) log n / 2 end_ARG + Oitalic_P ( 1 ) . (22) This firmly establishes that (11) is a good estimator of Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ). However, it is important to understand that we cannot immediately conclude that it is a good estimator of Fn(Bγ(w∗))subscriptsubscriptsuperscriptF_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ). We conjecture that λ~(w∗)~superscript λ(w^*)over~ start_ARG λ end_ARG ( w∗ ) is equal to the LLC λ(w∗)superscriptλ(w^*)λ ( w∗ ) given certain conditions on γ. So far, all our empirical findings suggest this. Detailed proof of this conjecture, in particular ascertaining the exact conditions on γ, will be left as future theoretical work. Appendix E Related work We review both the singular learning theory literature directly involving the learning coefficient, as well as research from other areas of machine learning that may be relevant. This is an expanded version of the discussion in Section 6. Singular learning theory. Our work builds upon the singular learning theory (SLT) of Bayesian statistics: good references are Watanabe, (2009, 2018). The global learning coefficient, first introduced by (Watanabe,, 2001), provides the asymptotic expansion of the free energy, which is equivalent to the negative log Bayes marginal likelihood, an all-important important quantity in Bayesian analysis. Later work used algebro-geometric tools to bound or exactly calculate the learning coefficient for a wide range of machine learning models, including Boltzmann machines Yamazaki and Watanabe, 2005b , single-hidden-layer neural networks Aoyagi et al., (2005), DLNs Aoyagi, (2024), Gaussian mixture models Yamazaki and Watanabe, (2003), and hidden Markov models Yamazaki and Watanabe, 2005a . SLT has also enhanced the understanding of model selection criteria in Bayesian statistics. Of particular relevance to this work is Watanabe, (2013), which introduced the WBIC estimator of free energy. This estimator has been applied in various practical settings (e.g. Endo et al.,, 2020; Fontanesi et al.,, 2019; Hooten and Hobbs,, 2015; Sharma,, 2017; Kafashan et al.,, 2021; Semenova et al.,, 2020). Some of the estimation methodology in this paper can be seen as a localized extension of the WBIC. Several other papers have explored improvements or alternatives to this estimator Iriguchi and Watanabe, (2007); Imai, 2019a ; Imai, 2019b . Basin broadness. The learning coefficient can be seen as a Bayesian version of basin broadness measures, which typically attempt to empirically connect notions of geometric “broadness" or “flatness" with model complexity Hochreiter and Schmidhuber, (1997); Jiang et al., (2019). However, the evidence supporting the (global) learning coefficient (in the Bayesian setting) is significantly stronger: the learning coefficient provably determines the Bayesian free energy to leading order Watanabe, (2009). We expect the utility of the learning coefficient as a geometric measure to apply beyond the Bayesian setting, but whether the connection with generalization will continue to hold is unknown. Neural network identifiability. A core observation leading to singular learning theory is that the map w↦p(x|w)maps-toconditionalw p(x|w)w ↦ p ( x | w ) from parameters w to statistical models p(x|w)conditionalp(x|w)p ( x | w ) may not be one-to-one (in which case, the model is singular). This observation has been made in parallel by researchers studying neural network identifiability Sussmann, (1992); Fefferman, (1994); Kůrková and Kainen, (1994); Phuong and Lampert, (2020). Recent work222Within the context of singular learning theory, this fact was known at least as early as Fukumizu, (1996). has shown that the degree to which a network is identifiable (or inverse stable) is not uniform across parameter space Berner et al., (2019); Petersen et al., (2020); Farrugia-Roberts, (2023). From this perspective, the LLC can be viewed as a quantitative measure of “how identifiable" the network is near a particular parameter. Statistical mechanics of the loss landscape. A handful of papers have explored related ideas from a statistical mechanics perspective. Jules et al., (2023) use Langevin dynamics to probe the geometry of the loss landscape. Zhang et al., (2018) show how a bias towards “wide minima" may be explained by free energy minimization. These observations may be formalized using singular learning theory LaMont and Wiggins, (2019). In particular, the learning coefficient may be viewed as a heat capacity LaMont and Wiggins, (2019), and learning coefficient estimation corresponds to measuring the heat capacity by molecular dynamics sampling. Other model complexity measures. The LLC is set apart from a number of classic model complexity measures such as Rademacher complexity (Koltchinskii and Panchenko,, 2000), the VC dimension (Vapnik and Chervonenkis,, 1971) because the latter measures act on an entire class of functions while the LLC measures the complexity of a specific individual function within the context of the function class carved out by the model (e.g. via DNN architecture). This affords the LLC a better position for unraveling the theoretical mysteries of deep learning, which cannot be disentangled from the way in which DNNs are trained or the data that they are trained on. In the context studied here, our proposed LLC measures the complexity of a trained neural network rather than complexity over the entire function class of neural networks. It is also sensitive to the data distribution, making it ideal for understanding the intricate dance between function class, data properties, and implicit biases baked into different training heuristics. Like earlier investigations by Le, (2018); Zhang et al., (2018); LaMont and Wiggins, (2019), our notion of model complexity appeals to the correspondence between parameter inference and free energy minimization. This mostly refers to the fact that the posterior distribution over w has high concentration around w∗superscriptw^*w∗ if the associated local free energy, Fn(Bγ(w∗))subscriptsubscriptsuperscriptF_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ), is low. Viewed from the free energy lens, it is thus not surprising that “flatter” minima (low λ(w∗)superscriptλ(w^*)λ ( w∗ )) might be preferred over “sharper” minima (high λ(w∗)superscriptλ(w^*)λ ( w∗ )) even if the former has high training loss (higher Ln(w∗)subscriptsuperscriptL_n(w^*)Litalic_n ( w∗ )). Put another way, (8) reveals that parameter inference is not about seeking the solution with the lowest loss. In the terminology of Zhang et al., (2018), parameter inference plays out as a competition between energy (loss) and entropy (Occam’s factor). Despite spiritual similarities, our work starts to depart from that of Zhang et al., (2018) and Le, (2018) in our technical treatment of the local free energy. These prior works rely on the Laplace approximation to arrive at the result Fn(Bγ(w∗))subscriptsubscriptsuperscript F_n(B_γ(w^*))Fitalic_n ( Bitalic_γ ( w∗ ) ) =nLn(w∗)+d2logn+12logdetH(w∗)+OP(1),absentsubscriptsuperscript212detsuperscriptsubscript1 =nL_n(w^*)+ d2 n+ 12 (w% ^*)+O_P(1),= n Litalic_n ( w∗ ) + divide start_ARG d end_ARG start_ARG 2 end_ARG log n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG log det H ( w∗ ) + Oitalic_P ( 1 ) , (23) where H is the Hessian of the loss. The flatness of a local minimum is calculated as logdetH(w∗)detsuperscript (w^*)log det H ( w∗ ), which is, notably, ill-defined for neural networks333Balasubramanian, (1997) includes all O(1)1O(1)O ( 1 ) terms in the Laplace expansion as a type of complexity measure.. Indeed a concluding remark in Zhang et al., (2018) points out that “a more nuanced metric is needed to characterise flat minima with singular Hessian matrices." Le, (2018), likewise, state in their introduction that “to compute the (model) evidence, we must carefully account for this degeneracy", but then argues that degeneracy is not a major limitation to applying (23). This is only partially true. For a very benign type of degeneracy, (23) is indeed valid. However, under general conditions, the correct asymptotic expansion of the local free energy is provided in (8). It might be said that while Zhang et al., (2018) and Le, (2018) make an effort to account for degeneracies in the DNN loss landscape, they only take a small step up the degeneracy ladder while we take a full leap. Similarity to PAC-Bayes. We have just described how the theoretical LLC is the sought-after notion of model complexity coming from earlier works who adopt the energy-entropy competition perspective. Interestingly, the actual LLC estimator also has connections to another familiar notion of model complexity. Among the diverse cast of complexity measures, see e.g., Jiang et al., (2019) for a comprehensive overview of over forty complexity measures in modern deep learning, the LLC estimator bears the most resemblance to PAC-Bayes inspired flatness/sharpness measures (Neyshabur et al.,, 2017). Indeed, it may be immediately obvious that, other than the scaling of nβ∗superscriptnβ^*n β∗, λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) can be viewed as a PAC-Bayes flatness measure which utilises a very specific posterior distribution localised to w∗superscriptw^*w∗. Recall the canonical PAC-Bayes flatness measure is based on λPAC−bayes(w∗)=Eq(w|w∗)ℓn(w)−ℓn(w∗),subscriptPACbayessuperscriptsubscriptEconditionalsuperscriptsubscriptℓsubscriptℓsuperscript _PAC-bayes(w^*)=E_q(w|w^*) _n(w)-% _n(w^*),λroman_PAC - bayes ( w∗ ) = Eitalic_q ( w | w∗ ) ℓitalic_n ( w ) - ℓitalic_n ( w∗ ) , (24) where ℓnsubscriptℓ _nℓitalic_n is a general empirical loss function (which in our case is the sample negative log likelihood) and the “posterior" distribution q is often taken to be Gaussian, i.e., q(w|w∗)=(w∗,σ2I)conditionalsuperscriptsuperscriptsuperscript2q(w|w^*)=N(w^*,σ^2I)q ( w | w∗ ) = N ( w∗ , σ2 I ). A simple derivation shows us that the quantity in (24), if we use a Gaussian q around w∗superscriptw^*w∗, reduces approximately to 12σ2Tr(H(w∗)),12superscript2superscript 12σ^2Tr(H(w^*)),divide start_ARG 1 end_ARG start_ARG 2 end_ARG σ2 T r ( H ( w∗ ) ) , where H is the Hessian of the loss, i.e., H(w∗)=∇w2ℓn(w)|w∗superscriptevaluated-atsubscriptsuperscript∇2subscriptℓsuperscriptH(w^*)=∇^2_w _n(w)|_w^*H ( w∗ ) = ∇2w ℓitalic_n ( w ) |w∗. However, for singular models, the posterior distribution around w∗superscriptw^*w∗, e.g., (10), is decidedly not Gaussian. This calls into question the standard choice of the Gaussian posterior in (24). Learning capacity. Finally, we briefly discuss concurrent work that measures a quantity related to the learning coefficient. In Chen et al., 2023a , a measure called the learning capacity is proposed to estimate the complexity of a hypothesis class. The learning capacity can be viewed as a finite-n version of the learning coefficient; the latter only appears in the n→∞→n→∞n → ∞ limit. Chen et al., 2023a is largely interested in the learning capacity as a function of training size n. They discover the learning capacity saturates at very small and large n with a sharp transition in between. Applications. Recently, the LLC estimation method we introduce here has been used to empirically detect “phase transitions" in toy ReLU networks Chen et al., 2023a , and the development of in-context learning in transformers Hoogland et al., (2024). Appendix F Model complexity vs model-independent complexity In this paper, we have described the LLC as a measure of “model complexity." It is worth clarifying what we mean here — or rather, what we do not mean. This clarification is in part a response to Skalse, (2023). We distinguish measures of “model complexity," such as those traditionally found in statistical learning theory, from measures of “model-independent complexity," such as those found in algorithmic information theory. Measures of model complexity, like the parameter count, describe the expressivity or degrees of freedom available to a particular model. Measures of model-independent complexity, like Kolmogorov complexity, describe the complexity inherent to the task itself. In particular, we emphasize that — a priori — the LLC is a measure of model complexity, not model-independent complexity. It can be seen as the amount of information required to nudge a model towards w∗superscriptw^*w∗ and away from other parameters. Parameters with higher LLC are more complex for that particular model to implement. Alternatively, the model is inductively biased444The role of the LLC in inductive biases is only rigorously established for Bayesian learning, but we suspect it also applies for learning with SGD. towards parameters with lower LLC — but a different model could have different inductive biases, and thus different LLC for the same task. This is why it is not sensible to conclude that a bias towards low LLC, would, on its own, explain observed “simplicity bias" in neural networks Valle-Perez et al., (2018) — this is tautological, as Skalse, (2023) noted. To highlight this distinction, we construct a statistical model where these two notions of complexity diverge. Let f1(x)subscript1f_1(x)f1 ( x ) be a Kolmogorov-simple function, like the identity function. Let f2(x)subscript2f_2(x)f2 ( x ) be a Kolmogorov-complex function, like a random lookup table. Then consider the following regression model with a single parameter w∈[0,1]01w∈[0,1]w ∈ [ 0 , 1 ]: f(x,w)=w8f1(x)+(1−w8)f2(x)superscript8subscript11superscript8subscript2f(x,w)=w^8f_1(x)+(1-w^8)f_2(x)f ( x , w ) = w8 f1 ( x ) + ( 1 - w8 ) f2 ( x ) For this model, f1(x)subscript1f_1(x)f1 ( x ) has a learning coefficient of λ=1212λ= 12λ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG, whereas f2(x)subscript2f_2(x)f2 ( x ) has a learning coefficient of λ=116116λ= 116λ = divide start_ARG 1 end_ARG start_ARG 16 end_ARG. Therefore, despite f1(x)subscript1f_1(x)f1 ( x ) being more Kolmogorov-simple, it is more complex for f(x,w)f(x,w)f ( x , w ) to implement — the model is biased towards f2(x)subscript2f_2(x)f2 ( x ) instead of f1(x)subscript1f_1(x)f1 ( x ), and so f1(x)subscript1f_1(x)f1 ( x ) requires relatively more information to learn. Yet, this example feels contrived: in realistic deep learning settings, the parameters w do not merely interpolate between handpicked possible algorithms, but themselves define an internal algorithm based on their values. That is, it seems intuitively like the parameters play a role closer to “source code" than “tuning constants." Thus, while in general LLC is not a model-independent complexity measure, it seems distinctly possible that for neural networks (perhaps even models in some broader “universality class"), the LLC could be model-independent in some way. This would theoretically establish the inductive biases of neural networks. We believe this to be an intriguing direction for future work. Appendix G SGLD-based LLC estimator: minibatch version pseudocode Algorithm 1 computing λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) Input: • initialization point: w∗superscriptw^*w∗ • scale: γ • step size: ϵitalic-ϵεϵ • number of iterations: SGLD_iters • batch size: m • dataset of size n: n=(xi,yi)i=1,…,nsubscriptsubscriptsubscriptsubscript1…D_n=\(x_i,y_i)\_i=1,…,nDitalic_n = ( xitalic_i , yitalic_i ) i = 1 , … , n • averaged log-likelihood function for w∈ℝdsuperscriptℝw ^dw ∈ blackboard_Rd and arbitrary subset D of data: logL(D,w)=1|D|∑(xi,yi)∈Dlogp(yi|xi,w)logL1subscriptsubscriptsubscriptconditionalsubscriptsubscriptlogL(D,w)= 1 D _(x_i,y_i)∈ D p(y_% i|x_i,w)logL ( D , w ) = divide start_ARG 1 end_ARG start_ARG | D | end_ARG ∑( x start_POSTSUBSCRIPT i , yitalic_i ) ∈ D end_POSTSUBSCRIPT log p ( yitalic_i | xitalic_i , w ) Output: λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) 1: β∗←1logn←superscript1β^*← 1 nβ∗ ← divide start_ARG 1 end_ARG start_ARG log n end_ARG Optimal sampling temperature. 2: w←w∗←superscriptw← w^*w ← w∗ Initialize at the given parameter 3: arrayLogL←[]←arrayLogLarrayLogL←[\,\,]arrayLogL ← [ ] 4: for t=1…SGLD_iters1…SGLD_iterst=1…SGLD\_iterst = 1 … SGLD_iters do 5: B←absentB ← random minibatch of size m 6: append logL(B,w)logLlogL(B,w)logL ( B , w ) to arrayLogLarrayLogLarrayLogLarrayLogL 7: η∼N(0,ϵ)similar-to0italic-ϵη N(0,ε)η ∼ N ( 0 , ϵ ) d-dimensional Gaussian, variance ϵitalic-ϵεϵ 8: Δw←ϵ2[γ(w∗−w)+nβ∗∇wlogL(B,w)]+η←Δitalic-ϵ2delimited-[]superscriptsuperscriptsubscript∇logL w← ε2 [γ(w^*-w)+nβ^* _% wlogL(B,w) ]+ηΔ w ← divide start_ARG ϵ end_ARG start_ARG 2 end_ARG [ γ ( w∗ - w ) + n β∗ ∇w logL ( B , w ) ] + η 9: w←w+Δw←Δw← w+ w ← w + Δ w 10: end for 11: WBIC^←−n⋅Mean(arrayLogL)←^WBIC⋅MeanarrayLogL WBIC←-n·Mean(arrayLogL)over start_ARG WBIC end_ARG ← - n ⋅ Mean ( arrayLogL ) 12: nLn(w∗)←−n⋅logL(n,w∗)←subscriptsuperscript⋅logLsubscriptsuperscriptnL_n(w^*)←-n·logL(D_n,w^*)n Litalic_n ( w∗ ) ← - n ⋅ logL ( Ditalic_n , w∗ ) 13: λ^(w∗)←WBIC^−nLn(w∗)logn←^superscript^WBICsubscriptsuperscript λ(w^*)← WBIC-nL_n(w% ^*) nover start_ARG λ end_ARG ( w∗ ) ← divide start_ARG over start_ARG WBIC end_ARG - n Litalic_n ( w∗ ) end_ARG start_ARG log n end_ARG 14: return λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) Appendix H Recommendations for accurate LLC estimation and troubleshooting In this section, we collect several recommendations for estimating the LLC accurately in practice. Note that these recommendations are largely based on our experience with LLC estimation for DLN (see Section 5.1) as it is the only realistic model (it being wide and deep) where the theoretical learning coefficient is available. H.1 Step size From experience, the most important hyperparameter to the performance and accuracy of the method is the step size ϵitalic-ϵεϵ. If the step size is too low, the sampler may not equilibriate, leading to underestimation. If the step size is too high, the sampler can become numerically unstable, causing overestimation or even “blowing up" to NaN values. Manual tuning of the step size is possible. However we strongly recommend a particular diagnostic based on the acceptance criterion for Metropolis-adjusted Langevin dynamics (MALA). This is used to correct numerical errors in traditional MCMC, but here we use it only to detect them. In traditional (full-gradient) MCMC, numerical errors caused by the step size are completely corrected by a secondary step in the algorithm, the acceptance check or Metropolis correction, which accepts or rejects steps with some probability roughly555Technically, the acceptance probability is based on maintaining detailed balance, not necessarily numerical error, as can be seen in the case of e.g. Metropolis-Hastings. But this is a fine intuition for gradient-based algorithms like MALA or HMC. based on the likelihood of numerical error. The proportion of steps accepted additionally becomes an important diagnostic as to the health of the algorithm: a low acceptance ratio indicates that the acceptance check is having to compensate for high levels of numerical error. The acceptance probability between step XksubscriptX_kXitalic_k and proposed step Xk+1subscript1X_k+1Xitalic_k + 1 is calculated as: min(1,π(Xk)q(Xk|Xk+1)π(Xk+1)q(Xk+1|Xk))1subscriptconditionalsubscriptsubscript1subscript1conditionalsubscript1subscript (1, π(X_k)\,q(X_k|X_k+1)π(X_k+1)\,q(X_k+1|X_k)% )min ( 1 , divide start_ARG π ( Xitalic_k ) q ( Xitalic_k | Xitalic_k + 1 ) end_ARG start_ARG π ( Xitalic_k + 1 ) q ( Xitalic_k + 1 | Xitalic_k ) end_ARG ) where π(x)π(x)π ( x ) is the probability density at x (in our case, logπ(x)=βnLn(x)subscript π(x)=β nL_n(x)log π ( x ) = β n Litalic_n ( x )), and q(x′|x)conditionalsuperscript′q(x |x)q ( x′ | x ) is the probability of our sampler transitioning from x to x′. In the case of MALA, q(x′|x)≠q(x|x′)conditionalsuperscript′conditionalsuperscript′q(x |x) =q(x|x )q ( x′ | x ) ≠ q ( x | x′ ) and so we must explicitly calculate this term. For MALA, it is: q(x′|x)∝exp(−14ϵ‖x′−x−ϵ∇logπ(x)‖2)proportional-toconditionalsuperscript′14italic-ϵsuperscriptnormsuperscript′italic-ϵ∇2q(x |x) (- 14ε||x -x-ε∇% π(x)||^2)q ( x′ | x ) ∝ exp ( - divide start_ARG 1 end_ARG start_ARG 4 ϵ end_ARG | | x′ - x - ϵ ∇ log π ( x ) | |2 ) We choose to use MALA’s formula because we are using SGLD, and both MALA and SGLD propose steps using Langevin dynamics. MALA’s formula is the correct one to use when attempting to apply Metropolis correction to Langevin dynamics. For various reasons, directly implementing such an acceptance check for stochastic-gradient MCMC (while possible) is typically either ineffective or inefficient. Instead we use the acceptance probability merely as a diagnostic. We recommend tuning the step size such that the average acceptance probability is in the range of 0.9-0.95. Below this range, increase step size to avoid numerical error. Above this range, consider decreasing step size for computational efficiency (to save on the number of steps required). For efficiency, we recommend calculating the acceptance probability for only a fraction of steps — say, one out of every twenty. Note that since we are not actually using an acceptance check, these acceptance “probabilities" are not really probabilities, but merely diagnostic values. H.2 Step count and burn-in The step count for sampling should be chosen such that the sampler has time to equilibriate or “burn in”. Insufficient step count may lead to underestimating the LLC. Excessive step count will not degrade accuracy, but is unnecessarily time-consuming. We recommend increasing the number of steps until the loss, Lm(wt)subscriptsubscriptL_m(w_t)Litalic_m ( witalic_t ), stops increasing after some period of time. This can be done with manual inspection of the loss trace. See Figure H.1 for some examples of loss trace and MALA acceptance probability over SGLD trajectories for DLN model at different scale. It is worth noting that the loss trace should truly be flat — a slow upwards slope can still be indicative of significant underestimation. We also recommend that samples during this burn-in period are discarded. That is, loss values should only be tallied once they have flattened out. This avoids underestimation. H.3 Other issues and troubleshooting We note some miscellaneous other issues and some troubleshooting recommendations: • Negative LLC estimates: This can happen when w∗superscriptw^*w∗ fails to be near a local minimum of L(w)L(w)L ( w ). However even when w∗superscriptw^*w∗ is a local minimum, we still might get negative LLC estimates if we are not careful. This can happen when the SGLD trajectory wanders to an area with lower loss than the initialization, causing the numerator in (12) to be negative. This could be alleviated by smaller step size or shorter chain length. This, however, risks under-exploration. This can also be alleviated by having larger restoring force γ. This risks the issue discussed below. • Large γ: An overly concentrated localizing prior (γ too large) can overwhelm the gradient signal coming from the log-likelihood. This can result in samples that are different from the posterior, destroying SGLD’s sensitivity to the local geometry. To sum up, in pathological cases like having SGLD trajectory falling to lower loss region or blowing up beyond machine floating number limits, we recommend keeping γ small (1.0 to 10.0), gradually lowering the step-size ϵitalic-ϵεϵ while lengthening the sampling chain so that the loss trace still equilibrates. Figure H.1: Sample loss trace (blue, left axis) and MALA acceptance probability (red, right axis) over DLN training trajectories at different model sizes. Appendix I Learning coefficient of DLN (Aoyagi,, 2024) A DLN is a feedforward neural network without nonlinear activation. Specifically, a biasless DLN with M hidden layers, layer sizes H1,H2,…,HMsubscript1subscript2…subscriptH_1,H_2,…,H_MH1 , H2 , … , Hitalic_M and input dimension H0subscript0H_0H0 is given by: y=f(x,w)=WM…W2W1xsubscript…subscript2subscript1y=f(x,w)=W_M… W_2W_1xy = f ( x , w ) = Witalic_M … W2 W1 x (25) where x∈ℝH0superscriptℝsubscript0x ^H_0x ∈ blackboard_RH0 is the input vector and the model parameter w consist of the weight matrices WjsubscriptW_jWitalic_j of shape Hj×Hj−1subscriptsubscript1H_j× H_j-1Hitalic_j × Hitalic_j - 1 for j=1,…,M1…j=1,…,Mj = 1 , … , M. Given a DLN, f(x,w)f(x,w)f ( x , w ), (c.f. Equation 25) with M hidden layers, layer sizes H1,H2,…,HMsubscript1subscript2…subscriptH_1,H_2,…,H_MH1 , H2 , … , Hitalic_M and input dimension H0subscript0H_0H0, the associated regression model with additive Gaussian noise is given by p(x,y|w)=q(x)2πσ2HMe−12σ2‖y−WM…W2W1x‖2conditionalsuperscript2superscript2subscriptsuperscript12superscript2superscriptnormsubscript…subscript2subscript12 p(x,y|w)= q(x) 2πσ^2^H_Me^- 1% 2σ^2\|y-W_M… W_2W_1x\|^2p ( x , y | w ) = divide start_ARG q ( x ) end_ARG start_ARG square-root start_ARG 2 π σ2 end_ARGHitalic_M end_ARG e- divide start_ARG 1 end_ARG start_ARG 2 σ start_POSTSUPERSCRIPT 2 end_ARG ∥ y - Witalic_M … W2 W1 x ∥2 end_POSTSUPERSCRIPT (26) where q(x)q(x)q ( x ) is some distribution on the input x∈ℝH0superscriptℝsubscript0x ^H_0x ∈ blackboard_RH0, w=(W1,…,WM)subscript1…subscriptw=(W_1,…,W_M)w = ( W1 , … , Witalic_M ) is the parameter consisting of the weight matrices WjsubscriptW_jWitalic_j of shape Hj×Hj−1subscriptsubscript1H_j× H_j-1Hitalic_j × Hitalic_j - 1 for j=1,…,M1…j=1,…,Mj = 1 , … , M and σ2superscript2σ^2σ2 is the variance of the additive Gaussian noise. Let q(x,y)q(x,y)q ( x , y ) be the density of the true data generating process and w∗=(W1∗,…,WM∗)superscriptsuperscriptsubscript1…superscriptsubscriptw^*=(W_1^*,…,W_M^*)w∗ = ( W1∗ , … , Witalic_M∗ ) be an optimal parameter that minimizes the KL-divergence between q(x,y)q(x,y)q ( x , y ) and p(x,y|w)conditionalp(x,y|w)p ( x , y | w ). Here we shall pause and emphasize that this result gives us the (global) learning coefficient, which is conceptually distinct from the LLC. They are related: the learning coefficient is the minimum of the LLCs of the global minima of the population loss. In our experiments, we will be measuring the LLC at a randomly chosen global minimum of the population loss. While we don’t expect LLCs to differ much among global minima for DLN, we do not know that for certain and it is of independent interest that the estimated LLC can tell us about the learning coefficient. Theorem 1 (DLN learning coefficient, Aoyagi,, 2024). Let r:=rank(WM∗…W2∗W1∗)assignranksuperscriptsubscript…superscriptsubscript2superscriptsubscript1r:=rank (W_M^*… W_2^*W_1^* )r := rank ( Witalic_M∗ … W2∗ W1∗ ) be the rank of the linear transformation implemented by the true DLN, f(x,w)f(x,w)f ( x , w ) and set Δj:=Hj−rassignsubscriptΔsubscript _j:=H_j-rΔitalic_j := Hitalic_j - r, for j=0,…,M0…j=0,…,Mj = 0 , … , M. There exist a subset Σ⊂0,1,…,MΣ01… ⊂\0,1,…,M\Σ ⊂ 0 , 1 , … , M of indices, Σ=σ1,…,σℓ+1Σsubscript1…subscriptℓ1 =\ _1,…, _ +1\Σ = σ1 , … , σroman_ℓ + 1 with cardinality ℓ+1ℓ1 +1ℓ + 1 that satisfy the following conditions: maxΔσ∣σ∈ΣconditionalsubscriptΔΣ \ _σ σ∈ \max Δitalic_σ ∣ σ ∈ Σ <minΔk∣k∉ΣabsentconditionalsubscriptΔΣ < \ _k k ∈ \< min Δitalic_k ∣ k ∉ Σ ∑σ∈ΣΔσsubscriptΣsubscriptΔ _σ∈ _σ∑σ ∈ Σ Δitalic_σ ≥ℓ⋅maxΔσ∣σ∈Σabsent⋅ℓconditionalsubscriptΔΣ ≥ · \ _σ σ∈ \≥ ℓ ⋅ max Δitalic_σ ∣ σ ∈ Σ ∑σ∈ΣΔσsubscriptΣsubscriptΔ _σ∈ _σ∑σ ∈ Σ Δitalic_σ <ℓ⋅minΔσ∣σ∉Σ.absent⋅ℓconditionalsubscriptΔΣ < · \ _σ σ ∈ \.< ℓ ⋅ min Δitalic_σ ∣ σ ∉ Σ . Assuming that the DLN truth-model pair (q(x,y),p(x,y|w))conditional (q(x,y),p(x,y|w) )( q ( x , y ) , p ( x , y | w ) ) satisfies the relatively finite variance condition (Appendix A.1), their learning coefficient is then given by λ=−r2+r(H0+HL)2+a(ℓ−a)4ℓ−ℓ(ℓ−1)4(1ℓ∑j=1ℓ+1Δσj)2+12∑1≤i<j≤ℓ+1ΔσiΔσj.superscript2subscript0subscript2ℓ4ℓ14superscript1ℓsuperscriptsubscript1ℓ1subscriptΔsubscript212subscript1ℓ1subscriptΔsubscriptsubscriptΔsubscript λ= -r^2+r(H_0+H_L)2+ a( -a)4 -% ( -1)4 ( 1 _j=1 +1 _ _j% )^2+ 12 _1≤ i<j≤ +1 _ _i _% _j.λ = divide start_ARG - r2 + r ( H0 + Hitalic_L ) end_ARG start_ARG 2 end_ARG + divide start_ARG a ( ℓ - a ) end_ARG start_ARG 4 ℓ end_ARG - divide start_ARG ℓ ( ℓ - 1 ) end_ARG start_ARG 4 end_ARG ( divide start_ARG 1 end_ARG start_ARG ℓ end_ARG ∑j = 1ℓ + 1 Δitalic_σ start_POSTSUBSCRIPT j end_POSTSUBSCRIPT )2 + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑1 ≤ i < j ≤ ℓ + 1 Δitalic_σ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT Δitalic_σ start_POSTSUBSCRIPT j end_POSTSUBSCRIPT . As we mention in the introduction, trained neural networks are less complex than they seem. It is natural to expect that this, if true, is reflected in deep networks having more degenerate (good parameters has higher volume) loss landscape. With the theorem above, we are given a window into the volume scaling behaviour in the case of DLNs, allowing us to investigate an aspect of this hypothesis. Figure I.1 shows the the true learning coefficient, λ, and the multiplicity, m, of many randomly drawn DLNs with different numbers of hidden layers. Observe that λ decreases with network depth. This plot is generated by creating networks with 2-800 hidden layers, with width randomly drawn from 100-2000 including the input dimension. The overall rank of the DLN is randomly drawn from the range of zero to the maximum allowed rank, which is the minimum of the layer widths. See also Figure 1 in Aoyagi, (2024) for more theoretical examples of this phenomenon. Figure I.1: The top graph shows λ decreasing as the DLN becomes deeper, even though model parameter count increases with number of layers. The bottom graph shows the true multiplicities, m. Since regular models can only have m=11m=1m = 1, the graph shows that most of these randomly generated DLNs are singular. Appendix J Experiment: DLN We compare the estimated λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) against theoretical λ (with w∗superscriptw^*w∗ being a randomly generated true parameter), by randomly generating many DLNs with different architectures and model sizes that span several orders of magnitude (OOM). Each DLN is constructed randomly, as follows. Draw an integer M∼U(Mlow,…,Mhigh)similar-tosubscript…subscriptℎM U(M_low,…,M_high)M ∼ U ( Mitalic_l o w , … , Mitalic_h i g h ) as the number of hidden layers, where U(a,…,b)…U(a,…,b)U ( a , … , b ) denotes the discrete uniform distribution on the finite set a,a+1,…,b1…\a,a+1,…,b\ a , a + 1 , … , b . Then, draw layer size Hj∼U(Hlow,…,Hhigh)similar-tosubscriptsubscript…subscriptℎH_j U(H_low,…,H_high)Hitalic_j ∼ U ( Hitalic_l o w , … , Hitalic_h i g h ) for each j=0,…,M0…j=0,…,Mj = 0 , … , M where H0subscript0H_0H0 denotes the input dimension. The weight matrix WjsubscriptW_jWitalic_j for layer j is then a Hj×Hj−1subscriptsubscript1H_j× H_j-1Hitalic_j × Hitalic_j - 1 matrix with each matrix element independently sampled from N(0,1)01N(0,1)N ( 0 , 1 ) (random initialization). To obtain a more realistic true parameter, with probability 0.50.50.50.5, each matrix Wj∗superscriptsubscriptW_j^*Witalic_j∗ is modified to have a random rank of r∼U(0,…,min(Hj−1,Hj))similar-to0…subscript1subscriptr U(0,…, (H_j-1,H_j))r ∼ U ( 0 , … , min ( Hitalic_j - 1 , Hitalic_j ) ). For each DLN generated, a corresponding synthetic training dataset of size n is generated to be used in SGLD sampling. The configuration values Mlow,Mhigh,Hlow,HhighsubscriptsubscriptℎsubscriptsubscriptℎM_low,M_high,H_low,H_highMitalic_l o w , Mitalic_h i g h , Hitalic_l o w , Hitalic_h i g h are chosen differently for separate set of experiments with model size targeting DLN size of different order of magnitude. See Table 1 for the values used in the experiments. SGLD hyperparameters ϵitalic-ϵεϵ, γ, and number of steps are chosen to suit each set of experiments according to our recommendations outlined in Appendix H. The exact hyperparameter values are given in the following section. We emphasize that hyperparameters must be tuned independently for different model sizes. In particular, the required step size for numerical stability tends to decrease for larger models, forcing a compensatory increase in the step count. See Table 1 for an example of this tuning with scale. Future work using e.g. μ-parameterization Yang et al., (2022) may be able to alleviate this issue. J.1 Hyperparameters and further details As described in the prior section, the experiments shown in Figure 4 consist of randomly constructed DLNs. For each target order of magnitude of DLN parameter count, we randomly sample the number of layers and their widths from a different range. We also use a different set of SGLD hyperparameter chosen according to the recommendation made in Section H. Configuration values that varies across OOM are shown in Table 1 and other configurations are as follow • Batch size used in SGLD is 500. • The amount of burn-in steps used for SGLD samples is set to 90% of total SGLD chain length, i.e. only the last 10% of SGLD samples are used in estimating the LLC. • The parameter γ is set to 1.01.01.01.0. • For each DLN, f(x,w∗)superscriptf(x,w^*)f ( x , w∗ ) with a chosen true parameter w∗superscriptw^*w∗, a synthetic dataset, (xi,yi)i=1,…,nsubscriptsubscriptsubscript1…\(x_i,y_i)\_i=1,…,n ( xitalic_i , yitalic_i ) i = 1 , … , n is generated by randomly sampling each element of the input vector x uniformly from the interval [−10,10]1010[-10,10][ - 10 , 10 ] and set the output as y=f(x,w∗)superscripty=f(x,w^*)y = f ( x , w∗ ), which effectively means we are setting a very small noise variance σ2superscript2σ^2σ2. • For LLC estimation done at a trained parameter instead of the true parameter (shown in right plot in Figure 4), the network is first trained using SGD with learning rate 0.01 and momentum 0.9 for 50000 steps. For each target OOM, a number of different experiments are run with different random seeds. The number of such experiment is determined by our compute resources and is reported in Table 1 with some experiment failing due to SGLD chains “blowing up” (See discussion in Appendix H) for the SGLD hyperparameters used. Figure J.3 shows that there is a left tail to the mean MALA acceptance rate distribution that hint at instability in SGLD chains encountered in some λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) estimation run. OOM Num layers, MlowsubscriptM_lowMitalic_l o w-MhighsubscriptℎM_highMitalic_h i g h Widths, HlowsubscriptH_lowHitalic_l o w-HhighsubscriptℎH_highHitalic_h i g h ϵitalic-ϵεϵ Num SGLD steps n Num experiments 1k 2-5 5-50 5×10−75superscript1075× 10^-75 × 10- 7 10k 105superscript10510^5105 99 10k 2-10 5-100 5×10−75superscript1075× 10^-75 × 10- 7 10k 105superscript10510^5105 100 100k 2-10 50-500 1×10−71superscript1071× 10^-71 × 10- 7 50k 106superscript10610^6106 100 1M 5-20 100-1000 5×10−85superscript1085× 10^-85 × 10- 8 50k 106superscript10610^6106 99 10M 2-20 500-2000 2×10−82superscript1082× 10^-82 × 10- 8 50k 106superscript10610^6106 93 100M 2-40 500-3000 2×10−82superscript1082× 10^-82 × 10- 8 50k 106superscript10610^6106 54 Table 1: Table of experimental configuration for each batch of experiment at different order of magnitudes (OOM) in DLN model size. n denotes the training dataset size and ϵitalic-ϵεϵ denotes SGLD step size. J.2 Additional plots for DLN experiments • Figure J.1 is a linear scale version of Figure 4 in the main text. This shows the estimated LLC against the true learning coefficients for experiments at different model size range without log-scale distortion. • Figure J.2 shows the relative error (λ−λ^(w∗))/λ^superscript(λ- λ(w^*))/λ( λ - over start_ARG λ end_ARG ( w∗ ) ) / λ across multiple orders of magnitude of DLN model size. Figure J.1: Supplementary plot to Figure 4. Each plot shows a single batch of DLN experiment with model size at different order of magnitude. The SGLD hyperparameter is tuned once for each batch. Their values are listed in Table 1 In contrast to Figure 4 which is in log scale, all plots here are in linear scale. Figure J.2: Relative error of estimated LLC compared to the theoretical learning coefficient, for DLNs across different orders of magnitude of model size. Figure J.3: Mean MALA acceptance probability over the entire SGLD trajectory for every DLN experiment. Model size is not the only factor affecting the correct scale for SGLD step size. Local geometry varies significantly among different models, and among different neighbourhoods in the parameter space. Without tuning SGLD hyperparameters individually for each experiment, we get a spread of (mean) MALA acceptance probability over all experiments. Those with low acceptance probability may indicate poor λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ) estimation quality. Appendix K Experiment: LLC for ResNet This section provides details for the experiments where we aimed to investigate how LLC vary over neural network training with different training configuration. These experiments parallel those performed in Dherin et al., (2022) who propose thegeometric complexity measure. We train ResNet18 (He et al.,, 2016) on the CIFAR10 dataset (Krizhevsky,, 2009) using SGD with cross-entropy loss. We vary SGD hyperparameters such as learning rate, batch size, momentum and L2superscript2L^2L2-regularization rate and track the resulting LLC estimates over evenly spaced checkpoints of SGD iterations. For the LLC estimates to be comparable, we need to ensure that the SGLD hyperparameters used in the LLC algorithm is the same. To this end, for every set of experiments where we vary a single optimizer hyperparameter, we first perform a calibration run to select SGLD hyperparameters according to the recommendation out lined in Appendix H. Once selected, this set of SGLD hyperparameter is then used for all LLC estimation within the set of experiments. That include the LLC estimation for every checkpoint of every ResNet18 training run for different optimizer configuration. Also following the same recommendation, we also burn away 90% of the sample trajectory and only use last 10% of samples. We also note that, since the SGLD hyperparameters are not tuned for all experiments within a single set, there are possibilities of negative LLC estimates or divergent SGLD chains. See Appendix H.3 for discussion on such cases and how to troubleshoot them. We manually remove these cases. They are rare enough that they do not change the LLC curves, just widen the error bars (confidence intervals) due to having less repeated experiments. Each experiment is repeated with 5 different random seeds. While the model architecture and dataset (including the train-test split) is fixed, other factors like the network initialization, training trajectories and SGLD samples are randomized. In each plot, the error bars show the 95% confidence intervals over 5 repeated experiments of the statistics plotted. The error bars were calculated using the inbuilt error bar function to Python Seaborn library (Waskom,, 2021) plotting function, seaborn.lineplot(..., errorbar=("ci", 95), ...). K.1 Details for main text Figures 1 For experiments that vary the learning rate in Figure 1 (top), for each learning rate value in [0.005,0.05,0.01,0.1,0.2]0.0050.050.010.10.2[0.005,0.05,0.01,0.1,0.2][ 0.005 , 0.05 , 0.01 , 0.1 , 0.2 ] we run SGD without momentum with a fixed batch size of 512 for 30000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=2×10−7italic-ϵ2superscript107ε=2× 10^-7ϵ = 2 × 10- 7, chain length of 3000 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. For experiments that vary the batch size in Figure 1 (middle), for each batch size value in [16,32,64,128,256,512,1024]1632641282565121024[16,32,64,128,256,512,1024][ 16 , 32 , 64 , 128 , 256 , 512 , 1024 ] we run SGD without momentum with a fixed learning rate of 0.01 for 100000 iterations. LLC estimations were performed every 2500 iterations with SGLD hyperparameters as follow: step size ϵ=2×10−7italic-ϵ2superscript107ε=2× 10^-7ϵ = 2 × 10- 7, chain length of 2500 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. For experiments that vary the SGD momentum in Figure 1 (bottom), for each momentum value in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]0.00.10.20.30.40.50.60.70.80.9[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9][ 0.0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 ] we run SGD with momentum with a fixed learning rate of 0.05 and a fixed batch size of 512 for 20000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=2×10−7italic-ϵ2superscript107ε=2× 10^-7ϵ = 2 × 10- 7, chain length of 3000 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. K.2 Additional ResNet18 + CIFAR10 LLC experiments Experiments using SGD with momentum We repeat the experiments varying learning rate and batch size shown in Figure 1 (top and middle) in the main text, but this time we train ResNet18 on CIFAR10 using SGD with momentum instead. The results are shown in Figure K.1 and the experimental details are as follow: • For experiments that varies the learning rate (top), for each learning rate value in [0.005,0.05,0.01,0.1,0.2]0.0050.050.010.10.2[0.005,0.05,0.01,0.1,0.2][ 0.005 , 0.05 , 0.01 , 0.1 , 0.2 ] we run SGD with momentum of 0.9 with a fixed batch size of 512 for 30000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=2×10−7italic-ϵ2superscript107ε=2× 10^-7ϵ = 2 × 10- 7, chain length of 3000 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. • For experiments that varies the batch size (bottom), for each batch size value in [16,32,64,128,256,512,1024]1632641282565121024[16,32,64,128,256,512,1024][ 16 , 32 , 64 , 128 , 256 , 512 , 1024 ] we run SGD with momentum of 0.9 with a fixed learning rate of 0.01 for 100000 iterations. LLC estimations were performed every 2500 iterations with SGLD hyperparameters as follow: step size ϵ=2×10−7italic-ϵ2superscript107ε=2× 10^-7ϵ = 2 × 10- 7, chain length of 2500 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. Explicit L2superscript2L^2L2-regularization We also run analogous experiments using explicit L2superscript2L^2L2-regularization. We trained ResNet18 on CIFAR10 dataset using SGD both with and without momentum using the usual cross-entropy loss but with an added L2superscript2L^2L2-regularization term, α‖w‖22subscriptsuperscriptnorm22α\|w\|^2_2α ∥ w ∥22 where w denote the network weight vector and α is the regularization rate hyperparameter that we vary in this set of experiments. Similar to other experiments in this section, we track LLC estimates over evenly spaced training checkpoints. However, it is worth noting that the loss function used for SGLD sampling as part of the LLC estimation is the original cross-entropy loss function without the added L2superscript2L^2L2-regularization term. The results are shown in Figure K.2 and the details are as follow: • For the experiment with momentum (Figure K.2 top), for each regularization rate α∈[0.0,0.01,0.025,0.05,0.075,0.1]0.00.010.0250.050.0750.1α∈[0.0,0.01,0.025,0.05,0.075,0.1]α ∈ [ 0.0 , 0.01 , 0.025 , 0.05 , 0.075 , 0.1 ], we run SGD with momentum of 0.9 with a fixed learning rate of 0.0005 and batch size of 512 for 15000 iterations. LLC estimations were performed every 500 iterations with SGLD hyperparameters as follow: step size ϵ=5×10−8italic-ϵ5superscript108ε=5× 10^-8ϵ = 5 × 10- 8, chain length of 2000 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. • For the experiment without momentum (Figure K.2 bottom), for each regularization rate α∈[0.0,0.01,0.025,0.05,0.075,0.1]0.00.010.0250.050.0750.1α∈[0.0,0.01,0.025,0.05,0.075,0.1]α ∈ [ 0.0 , 0.01 , 0.025 , 0.05 , 0.075 , 0.1 ], we run SGD without momentum with a fixed learning rate of 0.001 and batch size of 512 for 50000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=5×10−8italic-ϵ5superscript108ε=5× 10^-8ϵ = 5 × 10- 8, chain length of 2000 iterations, batch size of 2048 and γ=1.01.0γ=1.0γ = 1.0. Figure K.1: Impact of varying different training configuration believed to exert implicit regularization pressure when training ResNet18 on CIFAR10 data using SGD with momentum (contrast with those without momentum reported in Figure 1 in the main text). Top: varying learning rate. Bottom: varying batch size. Figure K.2: Impact of varying explicit L2superscript2L^2L2-regularization rate when training ResNet18 on CIFAR10 data using SGD with (top) and without (bottom) momentum. Appendix L Additional Experiment: LLC for language model The purpose of this experiment is simply to verify that the LLC can be consistently estimated for a transformer trained on language data. In Figure L.1, we show the LLC estimates for w^n∗subscriptsuperscript w^*_nover start_ARG w end_ARG∗n at the end of training over a few training runs. We see the LLC estimates are a small fraction of the 3.3m total parameters. We also notice that the value of the LLC estimates are remarkably stable over multiple training runs. See Appendix L for experimental details. Figure L.1: SGLD-based LLC estimates for w^n∗subscriptsuperscript w^*_nover start_ARG w end_ARG∗n at the end of training an attention-only transformer on subset of the Pile dataset. The distribution reports LLC estimates over 10 training repetitions. Again the LLC is a tiny fraction of the 3.3m parameters in the transformer. We trained a two-layer attention-only (no MLP layers) transformer architecture on a resampled subset of the Pile dataset Gao et al., (2020); Xie et al., (2023) with a context length of 1024, a residual stream dimension of dmodel=256subscriptmodel256d_ model=256dmodel = 256, and 8 attention heads per layer. The architecture also uses a learnable Shortformer embedding Press et al., (2021) and includes layer norm layers. Additionally, we truncated the full GPT-2 tokenizer, which has a vocabulary of around 50,000 tokens, down to the first 5,000 tokens in the vocabulary to reduce the size of the model. The resulting model has a total parameter count of d=3,355,0163355016d=3,355,016d = 3 , 355 , 016. We instantiated these models using an implementation provided by TransformerLens Nanda and Bloom, (2022). We trained 10 different seeds over a single epoch of 50,000 steps with a minibatch size of 100, resulting in about 5 billion tokens used during training for each model. We used the AdamW optimizer with a weight decay value of 0.050.050.050.05 and a learning rate of 0.0010.0010.0010.001 with no scheduler. We run SGLD-based LLC estimation once at the end of training for each seed at a temperature of β=1/log(100)1100β=1/ (100)β = 1 / log ( 100 ). We set γ=100100γ=100γ = 100 and ϵ=0.001italic-ϵ0.001ε=0.001ϵ = 0.001. We take samples over 20 SGLD chains with 200 draws per chain using a validation set. Appendix M Additional Experiment: MALA versus SGLD Here we verify empirically that our SGLD-based LLC estimator (Algorithm 1) does not suffer from using minibatch loss for both SGLD sampling and LLC calculation. Specifically, we compare to LLC estimation via Equation 12 and the Metropolis-adjusted Langevin Algorithm (MALA), a standard gradient-based MCMC algorithm Roberts and Rosenthal, (1998). Notice that this comparison rather stacks the odds against the minibatch-version SGLD-based LLC estimator in Algorithm 1 so that it is all the more surprising we see such good results below. We test with a two-hidden-layer ReLU network with ten inputs, ten outputs, and twenty neurons per hidden layer. Denote the inputs by x, the parameters by w, and the output of this network f(x,w)f(x,w)f ( x , w ). The data are generated to create a “realizable" data generating process, with “true parameter" w∗superscriptw^*w∗: inputs X are generated from a uniform distribution, and labels Y are (noiselessly) generated based on the true network, so that Yi=f(Xi,w∗)subscriptsubscriptsuperscriptY_i=f(X_i,w^*)Yitalic_i = f ( Xitalic_i , w∗ ). Hyperparameters and experimental details are as follows. We sweep dataset size from 100 to 100000, and compare our LLC estimator and the MALA-based one. For all dataset sizes, SGLD batch size was set to 32 and γ=1.01.0γ=1.0γ = 1.0, and MALA and SGLD shared the same true parameter w∗superscriptw^*w∗ (set at random according to a normal distribution). Both MALA and SGLD used a step size of 1e-5 and the asymptotically optimal inverse temperature β∗=1/lognsuperscript1β^*=1/ nβ∗ = 1 / log n. Experiments were run on CPU. Figure M.1: We compare LLC estimation using SGLD with LLC estimation using MALA. They make similar estimates (top) but the SGLD-based method is significantly faster (bottom), especially for large dataset sizes. The results are summarized in Figure M.1. We find that across all dataset sizes, the SGLD and MALA estimates of the LLC agree (Figure M.1, top), but the SGLD-based estimate has far lower computational cost, especially as the dataset size grows (Figure M.1, bottom). Appendix N Additional Experiment: SGD versus eSGD We fit a two hidden-layer feedforward ReLU network with 1.9m parameters to MNIST using two stochastic optimizers: SGD and entropy-SGD (Chaudhari et al.,, 2019). We choose entropy-SGD because its objective is to minimize Fn(w∗,γ)subscriptsuperscriptF_n(w^*,γ)Fitalic_n ( w∗ , γ ) over w∗superscriptw^*w∗, so we expect that the local minima found by entropy-SGD will have lower λ^(w∗)^superscript λ(w^*)over start_ARG λ end_ARG ( w∗ ). Figure N.1 shows the LLC estimates λ^(w^n∗)^superscriptsubscript λ( w_n^*)over start_ARG λ end_ARG ( over start_ARG w end_ARGn∗ ) for w^n∗superscriptsubscript w_n^*over start_ARG w end_ARGn∗ at the end of training, optimized by either entropy-SGD or standard SGD. Notice the LLC estimates, for both stochastic optimizers, are on the order of 1000, much lower than the 1.9m number of parameters in the ReLU network. Figure N.1: LLC estimates for w^n∗superscriptsubscript w_n^*over start_ARG w end_ARGn∗ at the end of training a feedforward ReLU network on MNIST. The distribution reports λ^(w^n∗)^superscriptsubscript λ( w_n^*)over start_ARG λ end_ARG ( over start_ARG w end_ARGn∗ ) over 80 training repetitions where the training data remains fixed in this repetition, only the randomness in the stochastic optimizer is being modded out. We compare two stochastic optimizers – SGD and entropy-SGD. Note all λ^(w^n∗)^superscriptsubscript λ( w_n^*)over start_ARG λ end_ARG ( over start_ARG w end_ARGn∗ ) are on the order of 1000100010001000 while the parameter count in the ReLU network is 1.9m. Figure N.1 confirms our expectation that entropy-SGD finds local minima with lower LLC, i.e., entropy-SGD is attracted to more degenerate (simpler) critical points than SGD. Interestingly Figure N.1 also reveals that the LLC estimate has remarkably low variance over the randomness of the stochastic optimizer. Finally it is noteworthy that the LLC estimate of a learned N model for both stochastic optimizers is, on average, a tiny percentage of the total number of weights in the N model: λ^(w∗)≈1000^superscript1000 λ(w^*)≈ 1000over start_ARG λ end_ARG ( w∗ ) ≈ 1000. In this experiment, we trained a feedforward ReLU network on the MNIST dataset Deng, (2012). The dataset consists of 60000600006000060000 training samples and 10000100001000010000 testing samples. The network is designed with 2 hidden layers having sizes [1024,1024]10241024[1024,1024][ 1024 , 1024 ], and it contains a total of 1863690186369018636901863690 parameters. For training, we employed two different optimizers, SGD and entropy-SGD, minimizing cross-entropy loss. Both optimizers are set to have learning rate of 0.010.010.010.01, momentum parameter at 0.90.90.90.9 and batch size of 512512512512. SGD is trained with Nesterov look-ahead gradient estimator. The number of samples L used by entropy-SGD for local free energy estimation is set to 5555. The network is trained for 200200200200 epochs. The number of epochs is chosen so that the classification error rate on the training set falls below 10−4superscript10410^-410- 4. The hyperparameters used for SGLD are as follows: ϵitalic-ϵεϵ is set to 10−5superscript10510^-510- 5, chain length to 400400400400 and the minibatch size 512, and γ=100100γ=100γ = 100. We repeat each SGLD chain 4444 times to compute the variance of estimated quantities and also as a diagnostic tool, a proxy for estimation stability. Appendix O Additional Experiment: scaling invariance In Appendix C, we show theoretically that the theoretical LLC is invariant to local diffeomorphism. Here we empirically verify that our LLC estimator (with preconditioning) is capable of satisfying this property in a specific easy-to-test case (though we do not test it in general). The function implemented by a feedforward ReLU networks is invariant to a type of reparameterization known as rescaling symmetry. Invariance of other measures to this symmetry is not trivial or automatic, and other geometric measures like Hessian-based basin broadness have been undermined by their failure to stay invariant to these symmetries Dinh et al., (2017). For simplicity, suppose we have a two-layer ReLU network, with weights W1,W2subscript1subscript2W_1,W_2W1 , W2 and biases b1,b2subscript1subscript2b_1,b_2b1 , b2. Then rescaling symmetry is captured by the following fact, for some arbitrary scalar α: W2ReLU(W1x+b1)+b2=αW2ReLU(1αW1x+1αb1)+b2subscript2ReLUsubscript1subscript1subscript2subscript2ReLU1subscript11subscript1subscript2W_2ReLU(W_1x+b_1)+b_2=α W_2ReLU ( 1% αW_1x+ 1αb_1 )+b_2W2 ReLU ( W1 x + b1 ) + b2 = α W2 ReLU ( divide start_ARG 1 end_ARG start_ARG α end_ARG W1 x + divide start_ARG 1 end_ARG start_ARG α end_ARG b1 ) + b2 That is, we may choose new parameters W1′=1αW1,b1′=1αb1,W2′=αW2,b2′=b2formulae-sequencesuperscriptsubscript1′1subscript1formulae-sequencesuperscriptsubscript1′1subscript1formulae-sequencesuperscriptsubscript2′subscript2superscriptsubscript2′subscript2W_1 = 1αW_1,b_1 = 1αb_1,W_2% =α W_2,b_2 =b_2W1′ = divide start_ARG 1 end_ARG start_ARG α end_ARG W1 , b1′ = divide start_ARG 1 end_ARG start_ARG α end_ARG b1 , W2′ = α W2 , b2′ = b2 without affecting the input-output behavior of the network in any way. This symmetry generalizes to any two adjacent layers in ReLU networks of arbitrary depth. Given that these symmetries do not affect the function implemented by the network, and are present globally throughout all of parameter space, it seems like these degrees of freedom are “superfluous", and should ideally not affect our tools. Importantly, this is the case for the LLC. We verify empirically that this property also appears to hold for our LLC estimator, when proper preconditioning is used. O.1 Preconditioned SGLD We must slightly modify our SGLD sampler to perform this experiment tractably. This is because applying rescaling symmetry makes the loss landscape significantly anisotropic, forcing prohibitively small step sizes with the original algorithm. Thus we add preconditioning to the SGLD sampler from Section 4.4, with the only modification being a fixed preconditioning matrix A: ΔwtΔsubscript w_tΔ witalic_t =Aϵ2(β∗nm∑(x,y)∈Bt∇logp(y|x,wt)+γ(w∗−wt))+N(0,ϵ)absentitalic-ϵ2superscriptsubscriptsubscript∇conditionalsubscriptsuperscriptsubscript0italic-ϵ =A ε2 ( β^*nm _(x,y)∈ B_t% ∇ p(y|x,w_t)+γ(w^*-w_t) )+N(0,ε)= A divide start_ARG ϵ end_ARG start_ARG 2 end_ARG ( divide start_ARG β∗ n end_ARG start_ARG m end_ARG ∑( x , y ) ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ∇ log p ( y | x , witalic_t ) + γ ( w∗ - witalic_t ) ) + N ( 0 , ϵ ) (27) In the experiment to follow, this preconditioning matrix is hardcoded for convenience, but if this algorithm is to be used in practice, the preconditioning matrix must be learned adaptively using standard methods for adaptive preconditioning (Haario et al.,, 1999). O.2 Experiment We take a small feedforward ReLU network, and rescale two adjacent layers in the network by α in the fashion described above. We vary the value of α across eight orders of magnitude and measure the estimated LLC using Eq 27 for SGLD sampling and Eq 12 for calculating LLC from samples.666Note that this means that we are not using Algorithm 1 here, both because of preconditioned SGLD and because we are calculating LLC using full-batch loss instead of mini-batch loss. Crucially, we must use preconditioning here so as to avoid prohibitively small step size requirements. In this case, the preconditioning matrix A is set manually, to be a diagonal matrix with entries α2superscript2α^2α2 for parameters corresponding to W1subscript1W_1W1 and b1subscript1b_1b1, entries 1α21superscript2 1α^2divide start_ARG 1 end_ARG start_ARG α2 end_ARG for W2subscript2W_2W2, and entries 1111 otherwise.777In practical situations, the preconditioning matrix cannot be set manually, and must be learned adaptively. Standard methods for adaptive preconditioning exist in the MCMC literature Haario et al., (2001, 1999). Figure O.1: LLC estimation is invariant to rescaling symmetries in ReLU networks. As the rescaling parameter α is varied over eight orders of magnitude, the estimated value of the LLC remains invariant (up to statistical error). The small error bar across multiple SGLD runs illustrates the stability of the estimation method. Model layer sizes, including input dimension is shown in the legend. The results can be found in Figure O.1. We conclude that LLC estimation appears invariant to ReLU network rescaling symmetries. Appendix P Compute resources disclosure Specific details about each type of experiment we carried out are listed below. There addition compute resources required for finding suitable SGLD hyperparameters for LLC estimation, but they did not constitute significant difference to overall resource requirement. ResNet18 + CIFAR10 experiments. Each experiment, i.e. training a ResNet18 for the reported number of iteration and performing LLC estimation for the stated number of checkpoints, is run on a node with either a single V100 or A100 NVIDIA GPU depending on availability hosted on internal HPC cluster with 2 CPU cores and 16GB memory allocated. No significant storage required. Each experiment took between a few minutes to 3 hours depending on the configuration (mean: 87.6 minutes, median: 76.7 minutes). Estimated total compute is 287 GPU hours spread across 208 experiments. DLN experiments. Each experiment, either estimating LLC at the true parameter or at a trained SGD parameter (thus require training), is run on a single A100 NVIDIA GPU node hosted on internal HPC cluster with 2 CPU cores and no more than 8GB memory allocated. No significant storage required. Each experiment took less than 15 minutes. Estimated total compute is 150 GPU hours: 600 experiments (including failed ones) each around 15 GPU minutes. MNIST experiments. Each repetition of training a feedforward ReLU network using SGD or eSGD optimizer on MNIST data and estimating the LLC at the end of training is run on a single A100 NVIDIA GPU node hosted on internal HPC cluster with 8 CPU cores and no more than 16GB memory allocated. No significant storage required. Each experiment took less than 10 minutes. Estimated total compute is 27 GPU hours: 2 sets of 80 repetitions, 10 GPU minutes each. Language model experiments. Training the language model and estimating its LLC at the last checkpoint took around 30 minutes on Google Colab on a single A100 NVIDIA GPU with 84GB memory and 1 CPU allocated. The storage space used for the training data is around 27GB. Estimated total compute is 5 GPU hours: 10 repetitions of 30 GPU minutes each.