Paper deep dive
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory
Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet
Models: Pythia-1.4B, Pythia-160M, Pythia-1B, Pythia-2.8B, Pythia-410M, Pythia-6.9B, Pythia-70M
Abstract
Abstract:We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression.
Tags
Links
- Source: https://arxiv.org/abs/2510.12077
- Canonical: https://arxiv.org/abs/2510.12077
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 6:09:03 PM
Summary
The paper extends the Minimum Description Length (MDL) principle to singular models like neural networks using Singular Learning Theory (SLT). It demonstrates that the local learning coefficient (LLC) is a robust measure of model complexity, showing a strong correlation with empirical compressibility in large-scale models like the Pythia suite.
Entities (5)
Relation Signals (3)
Local Learning Coefficient → correlateswith → Compressibility
confidence 95% · complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility
Singular Learning Theory → extends → Minimum Description Length
confidence 90% · using singular learning theory to extend the minimum description length (MDL) principle to singular models
Pythia → usedinexperiment → Compressibility
confidence 90% · Through extensive experiments on the Pythia suite... we find that complexity estimates... are closely... correlated with compressibility
Cypher Suggestions (2)
Find all metrics related to model complexity · confidence 90% · unvalidated
MATCH (m:Metric)-[:CORRELATES_WITH]->(c:Metric) RETURN m, c
Identify theories applied to specific technologies · confidence 85% · unvalidated
MATCH (t:Theory)-[:EXTENDS|APPLIED_TO]->(tech:Technology) RETURN t, tech
Full Text
98,681 characters extracted from source content.
Expand or collapse full text
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory Einar Urdshals= Timaeus einar@timaeus.co &Edmund Lau= UK AI Security Institute edmund.lau@dsit.gov.uk &Jesse Hoogland Timaeus jesse@timaeus.co &Stan van Wingerden Timaeus stan@timaeus.co &Daniel Murfet Timaeus daniel@timaeus.co Abstract We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression. †footnotetext: = Equal contribution. 1 Introduction A central challenge in deep learning is to measure a model’s complexity, that is, the amount of information about the dataset that is encoded in its parameters. This cannot be trivially derived from the loss because there are ways to achieve a given level of loss that involve different quantities of information: for example, the network can memorize the training data (encoded using a relatively large fraction of its weights) or discover a general solution (encoded using a small number of weights). A measurement that could distinguish these two kinds of solutions would be useful, for example, in predicting how a network will behave out of distribution. How then are we to measure this quantity? One simple practical answer involves compression: given a loss tolerance ϵ>0ε>0 and some compression scheme with parameter P (such that larger P means more compression) let PmaxP_max be the amount of compression that increases the loss from its original value L up to the threshold L+ϵL+ε. Intuitively, if the network encoded its solution to the constraints in the data using a small fraction of its weights, then it could “withstand” a large amount of compression and PmaxP_max will be large. If the network has used all of its capacity to encode the solution then we expect PmaxP_max to be small. Given the practical importance of compression techniques like quantization, this seems like a useful measure of model complexity. However, the theoretical status of this notion of “compressibility” is a priori unclear. The informal relationship between compressibility and complexity goes back to LeCun et al. (1989); Hochreiter and Schmidhuber (1997) and has been the basis for theoretical bounds on generalization error (Arora et al., 2018). It is clear that compressibility in the above sense must be related to ideas like minimum description length (MDL; Grünwald and Roos 2019). In this paper we investigate the relation between various practical compression schemes and MDL via singular learning theory (SLT; Watanabe 2009) and the estimator for a measure of model complexity known as the local learning coefficient (Lau et al., 2024) and in this way provide some theoretical basis for the intuitive connection between compressibility and complexity in the setting of deep learning. Contributions. We make the following contributions: • We derive a singular MDL principle (Section 3): Using ideas from singular learning theory (SLT; Watanabe 2009), we extend the minimum description length (MDL; Grünwald and Roos 2019) principle to neural networks and prove that there is a two-part code for which the asymptotic redundancy involves the local learning coefficient (LLC; Lau et al. 2024), a measure of model complexity from SLT. In contrast to the classical treatment of MDL, where geometric invariants like the curvature determined by the Hessian appear in the description length, the important geometric feature in the singular case is degeneracy (Figure 1). • We compare the LLC to compressibility: in the setting of compression via quantization and factorization we study empirically the relation between compressibility and the LLC, by plotting them against each other for a range of models form the Pythia family up to 6.9B parameters, across training checkpoints. As expected we find that models with larger LLCs tend to be less compressible. For quantization we observe a particularly close relationship: over a large fraction of training steps there is a linear relationship between the estimated LLC and the compressibility measured in bits. From these results we draw two main conclusions. Firstly, the informal notion of compressibility as a measure of model complexity is consistent with the LLC estimate, which has a sound theoretical foundation. Secondly, compressibility in Pythia models serves as an independent check on the practice of using LLC estimates for models at these scales; this is valuable since we lack theoretical knowledge of the true LLC for large transformer models (see Section D.2). Figure 1: Loss landscape volume determines model compressibility. We generalize the minimum description length (MDL) principle to singular models like neural networks, which exhibit redundancy in their loss landscape (right two panels). This redundancy, or “degeneracy”, is the leading order contribution to model compressibility — not the curvature as determined by the Hessian (left two panels). From left to right: (1) regular model with symmetric quadratic loss; (2) regular model with elliptical quadratic loss showing different curvatures along principal axes; (3) minimally singular model with a redundant parameter, creating a valley of degenerate optima; (4) “normal-crossing” singularity showing higher order degeneracy. 2 Related Work Network compression in deep learning. There is a large literature on model compression, which is evolving rapidly. A standard reference is Han et al. (2016), and newer surveys include Hoefler et al. (2021); Wang et al. (2024b). It has long been recognized that the “effective dimension” of deep neural networks is typically much smaller than the number of parameters (Maddox et al., 2020). This is widely understood as one reason why model compression is possible (LeCun et al., 1989; Hassibi et al., 1993; Denil et al., 2013). Pruning models by discarding small magnitude weights, or using the spectrum of the Hessian to determine low saliency weights, coupled with the empirical success of such pruning methods, has led to an informal working understanding of effective dimension in terms of “how much compression can be done without sacrificing too much performance.” Nonetheless, the theoretical basis for using, e.g., the Hessian spectrum to determine effective dimension remains weak. The existence of “lottery tickets” (that is, sparse and trainable subnetworks at initialization) also suggests a large degree of redundancy in the final trained parameter (Frankle and Carbin, 2019). Intrinsic dimension of fine-tuning. Related to, but distinct from, the low effective dimension of trained neural networks is the low observed “intrinsic dimension” of fine-tuning pretrained LLMs (Li et al., 2018). Here the intrinsic dimension refers to the minimum dimension of a hyperplane in the full parameter space in which the fine-tuning optimization problem can be solved to some precision level. This can be many orders of magnitude smaller than the full dimension; for example, Aghajanyan et al. (2021) note that 200200 parameters are enough to solve a fine-tuning problem for a RoBERTa model (with 335M parameters) to within 90% performance of the full model. This observation that the update matrices in LLM fine-tuning have a low “intrinsic rank” led to the introduction and widespread usage of low-rank adaptation for fine-tuning (Hu et al., 2022). The relation of this intrinsic dimension to the effective dimension of the full pretrained model is unclear. For additional related work see Appendix A. 3 Theory: Singular MDL MDL is the canonical theoretical framework that relates compressibility and model complexity. The idea of measuring the complexity of a trained model or at a given local minimum of the population loss landscape is well-known in the literature on MDL (Grünwald and Roos, 2019) and was used by Hochreiter and Schmidhuber (1997) in an attempt to quantify the complexity of a trained neural network. However, these classical MDL treatments make the assumption that models are “regular” meaning that the parameter-to-distribution map, w↦pww p_w, is one-to-one (which implies that there is a unique global minimum) and that the Fisher information matrix, I(w)I(w) is everywhere non-singular (middle-left panel of Figure 2). Since this assumption is invalid for neural networks, the resulting theory does not apply. In this section, we start from the MDL principles and use insights from SLT to extend its applicability to singular models like ANNs. 3.1 Setup Let X denote a sample space and let q(n)∈Δ(n)q^(n)∈ (X^n) be an unknown data-generating distribution on the space of X-sequences (n)=(x1,…,xn)∈nx (n )=(x_1,…,x_n) ^n of length n∈ℕn∈N. We assume that X is finite (e.g., the token vocabulary for modern transformer language models). Any distribution p(n)p^(n) on nX^n gives rise to a prefix-free (thus uniquely decodable) encoding, ⟦(n)⟧p(n) (n ) _p^(n), for any (n)∈nx (n ) ^n with code length given by (⟦(n)⟧p(n))=−logp(n)((n)) len ( (n ) _p^(n) )=- p^(n) (x (n ) ). Conversely, every prefix-free encoding can be used to define such distributions (Kraft, 1949; McMillan, 1956), which we shall call an encoding distribution. A central observation of the MDL principle is that any statistical pattern or regularity in q(n)q^(n) can be exploited to compress samples (n)x (n ) of q(n)q^(n). If a learning algorithm can extract these regularities through only samples (n)x (n ), then it has implicitly learned a good compression of (n)x (n ). This is the oft-invoked principle of “learning as compression”. Throughout, we will consider learning machines with a finite-dimensional parameterized statistical models, denoted as ℳ=pw(n)∈Δ(n)∣w∈WM= \p^(n)_w∈ (X^n) w∈ W \ where W⊂ℝdW⊂R^d is a compact d-dimensional parameter space. An important example for this work is the case of modern auto-regressive language models where (n)x (n ) are the token sequences and the model pw(n)p^(n)_w takes the form pw(n)((n))=pw(x1)pw(x2|x1)pw(x3|x1,x2)…pw(xn|x1,…,xn−1)p^(n)_w(x (n ))=p_w(x_1)p_w(x_2|x_1)p_w(x_3|x_1,x_2)… p_w(x_n|x_1,…,x_n-1) for some learned sequence-to-next-token model pwp_w, such as a transformer (Vaswani et al., 2017).111This is an example of what is known as prequential code in MDL literature. For exposition, we will focus on the case where both the data and model are independent and identically distributed (i.i.d., see discussion of assumptions in Appendix F). This means that, for every n∈ℕn∈N, the data distribution and model distribution on nX^n are respectively given by q(n)((n))=∏i=1nq(xi)andpw(n)((n))=∏i=1npw(xi) q^(n) (x (n ) )= _i=1^nq(x_i) p^(n)_w (x (n ) )= _i=1^np_w(x_i) for some unknown q and model pw \p_w \. Under such assumptions, the unique minimum average code length in the long-run (large n) is achieved by setting the data generating distribution itself as the encoding distribution, i.e., setting p(n)=q(n)p^(n)=q^(n). The expected per-symbol excess length compared to this minimum is measured by the Kullback-Leibler (KL) divergence, DKL(q∥pw):=x∼q[log1pw(x)]−x∼q[log1q(x)]=ℋ(q,pw)−ℋ(q),D_KL (q \|p_w ):=E_x q [ 1p_w(x) ]-E_x q [ 1q(x) ]=H (q,p_w )-H(q), where ℋH denotes the (cross-)entropy. We will call the first parameter-dependent term above the population loss and denote it as ℒ(w)L(w). Note that the empirical estimate of ℒL given by Ln(w)=−1n∑i=1nlogpw(xi)L_n(w)=- 1n _i=1^n p_w(x_i) is the usual per-token cross-entropy criterion used for training modern transformer network, also known as the average negative log-likelihood at w. 3.2 Two-Part Codes We will focus on the case of two-part codes to clarify the underlying geometrical phenomenon and explain the direct relevance to neural network compression. To communicate with two-part codes, the sender and receiver agree to first communicate an encoding distribution p by sending some encoded representation ⟦p⟧ p , before sending the data encoded with p, ⟦(n)⟧p (n ) _p. Once p is received, the receiver can reconstruct the encoding distribution and decode any message encoded with p. The result is a message of length (⟦p⟧)+(⟦(n)⟧p)=(⟦p⟧)+∑i=1nlog1p(xi). len( p )+ len( (n ) _p)= len( p )+ _i=1^n 1p(x_i). One measure of code performance, known as redundancy, is defined to be the excess code length as compared to the encoding generated using the data distribution itself as encoding distribution. So, if the data is drawn i.i.d. from q, then the redundancy of the two-part code is given by Rn:=(⟦p⟧)+(⟦(n)⟧p)−∑ilog1q(xi)=(⟦p⟧)+∑ilogq(xi)p(xi).R_n:= len( p )+ len( (n ) _p)- _i 1q(x_i)= len( p )+ _i q(x_i)p(x_i). (1) Notice that if the chosen encoding distribution p is sufficiently good in the sense of having small KL-divergence, q[logq(x)p(x)]=DKL(q∥p)E_q [ q(x)p(x) ]=D_KL (q \|p ), from q, then it is worth paying (⟦p⟧) len ( p ) bits to obtain a cheap encoding for samples drawn from q. Suppose the sender and receiver have a shared knowledge of a finite dimensional statistical model (e.g., a neural network architecture). This allow them to communicate using codes specific to the biases implicit in the model architecture. In other words, the model provides an implicit prior on the set of distributions ℳM, allocating codes of varying length depending on which hypotheses are considered simple or complex. To the state the main theorem let ℳ=pw∈Δ̊m():w∈W⊂ℝdM= \p_w∈ _m(X):w∈ W⊂R^d \ be a statistical model of finite dimension d∈ℕd∈N. We will assume some technical conditions on the model laid out in Appendix (E.1). In particular, we require that all distributions in our models are in the restricted simplex Δ̊m() _m(X) of uniformly lower bounded distributions where given m>0m>0 and a distribution p over X we say p∈Δ̊m()p∈ _m(X) if minx∈p(x)≥m _x p(x)≥ m. Theorem 1. There exists a two-part code such that, for any realizable data generating distribution q∈ℳq∈M and dataset (n)x (n ) drawn i.i.d. from q, the asymptotic redundancy is Rn=λlogn−(m−1)loglogn+Op(1)R_n=λ n-(m-1) n+O_p(1) where λ is the learning coefficient of q for the model and m is the multiplicity. We refer to Watanabe (2009) for the definition of the learning coefficient λ and multiplicity m. To establish this result, we need to specify a way for the sender to communicate a specific hypothesis or distribution p in the model. We note that a model generally contains uncountably many distinct distributions, yet any parameter encoding can specify at most countably many. Thus, discretization is needed. We assume the sender and receiver have a way to construct, for any ϵ>0ε>0, shared finite sets Qϵ=p1,p2,…,pNϵ⊂ℳQ_ε= \p_1,p_2,…,p_N_ε \⊂M such that any p∈ℳp∈M belongs to some set of the form Pϵ(p∗):=p∈ℳ:DKL(p∥p∗)≤ϵP_ε(p^*):= \p∈M:D_KL (p \|p^* )≤ε \ where p∗∈Qϵp^*∈ Q_ε222This is a finite ϵε-net of the model in distribution space.. Let us define (R for Reversed) VpR(ϵ):=Vol(w∈W:DKL(pw∥p)≤ϵ).V^R_p(ε):=Vol ( \w∈ W:D_KL (p_w \|p )≤ε \ )\,. Given p∈ℳp∈M, we assume a consistent and shared algorithm for choosing333breaking ties consistently when needed p∗∈Qϵp^*∈ Q_ε such that p∈Pϵ(p∗)p∈ P_ε(p^*) and Vp∗R(ϵ)=maxVp′R(ϵ)∣p′∈Qϵ and p∈Pϵ(p′)V^R_p^*(ε)= \V^R_p (ε) p ∈ Q_ε and p∈ P_ε(p ) \. Observe that this produces a partition of the model, ℳM, with each set in the partition represented by a grid point p∗p^* in QϵQ_ε444possibly a smaller set, in which case we take QϵQ_ε to be this non-redundant set. Ideally, we would like a tight ϵε-KL-sphere packing. If ℳM is a subset of the interior of Δ() (X) simplex, itself, with non-empty interior, we can use construction similar to Balasubramanian (1996) to obtain such an ϵε-net.. We will then assign probability555this is only an approximate equality because the true volume should be the volume of the partitions instead of those of the ϵε-nets. However, their difference can be made small via careful selection of centers p∗p^* and sphere packing arguments. ≈Vp∗R(ϵ)Vol(W)≈ V^R_p^*(ε)Vol(W) to p∗p^* and therefore a code of length (⟦p∗⟧):=logVol(W)Vpn∗R(ϵ). len( p^* ):= Vol(W)V^R_p^*_n(ε)\,. Notice that this is very different from putting the uniform distribution on ℳM (e.g., by using the Jeffrey’s prior on W if ℳM is regular). We are deliberately assigning shorter codes to hypotheses p∗∈Qϵp^*∈ Q_ε that are simpler according to the model’s own implicit bias: a hypothesis is simpler to state relative to a given model if it takes up more parameter volume (requires lower parameter-precision to specify its distribution) up to ϵε error tolerance. With such a construction, we can now calculate the two-part code length with respect to some model ℳM for i.i.d. data drawn from data distribution q that is realizable (q∈ℳq∈M) and satisfies assumptions in Appendix (E.1). Let p^=argminp∈ℳ∑i=1nlog1p(xi) p= _p∈M _i=1^n 1p(x_i) be the maximum likelihood distribution and define pn∗p^*_n be the grid point in QϵQ_ε closest to p p, pn∗:=argminp∈QϵDKL(p^∥p)p^*_n:= _p∈ Q_εD_KL ( p \|p ). To send the data (n)x (n ) we send the encoding of pn∗p^*_n and the data encoded with this distribution. Writing Kn(p)=1n∑i=1nlogq(xi)p(xi)K_n(p)= 1n _i=1^n q(x_i)p(x_i) the redundancy of the code at tolerance ϵε is given by Rn R_n =logVol(W)Vpn∗R(ϵ)+nKn(pn∗) = Vol(W)V^R_p^*_n(ε)+nK_n(p^*_n) (2) =logVol(W)Vpn∗R(ϵ)+nDKL(q∥pn∗)⏟(⋆)+n(Kn(pn∗)−DKL(q∥pn∗)). = Vol(W)V^R_p^*_n(ε)+n ( ) D_KL (q \|p^*_n )+n(K_n(p^*_n)-D_KL (q \|p^*_n ))\,. (3) Now, we introduce a dependency of the tolerance on n by ϵn=an _n= an for some a>0a>0. With this assumption, both nDKL(q∥pn∗)nD_KL (q \|p^*_n ) and n(Kn−DKL(q∥pn∗))n(K_n-D_KL (q \|p^*_n )) are Op(1)O_p(1) by Theorem 5 and Theorem 3 respectively. Therefore, the redundancy is given asymptotically by Rn=−logVpn∗R(ϵn)+Op(1).R_n=- V^R_p^*_n( _n)+O_p(1). In (3) we see a fundamental tradeoff: decreasing the error tolerance ϵε (a finer grid) decreases the excess code length (⋆)( ) because we can find a grid point pn∗p^*_n closer to p p and thus q, but decreasing the tolerance will also decrease the volume Vpn∗R(ϵ)V^R_p^*_n(ε) and thus increase the cost for communicating pn∗p^*_n. Similar to the case for regular models (see for example Balasubramanian (1996)), the optimal grid size for data set of size n scale as ϵn=O(1n) _n=O ( 1n ): any higher order rate of decay for ϵn _n implies a finer distinguishability of grid points than the number of data points n can justify (the MLE itself has DKL(q∥p^)=Op(1/n)D_KL (q \| p )=O_p(1/n), see further discussion in Appendix E.4). It remains to determine the behavior of the Vpn∗R(ϵn)V^R_p^*_n( _n). One difficulty is that the center pn∗p^*_n is a random variable depending on data and changes with ϵn _n. However, it is also clear that, as ϵn→0 _n→ 0, pn∗p^*_n approaches in KL-divergence to the data generating distribution q. Furthermore, the relevant volume will also be similar to that of the set of parameter where pwp_w is close to q defined by Vq(ϵ):=Vol(w∈W:DKL(q∥pw)≤ϵ)V_q(ε):=Vol ( \w∈ W:D_KL (q \|p_w )≤ε \ ). Theorem 4 shows that there exist C>0C>0 such that for any ϵ>0ε>0 and p∗∈ℳp^*∈M such that DKL(q∥p∗)≤ϵD_KL (q \|p^* )≤ε we get Vq(ϵ)≤Vp∗R(Cϵ)≤Vq(C2(C+1)ϵ).V_q (ε )≤ V^R_p^*(Cε)≤ V_q ( C2(C+1)ε ). (4) This allows us to make conclusions about Vp∗R(ϵ)V^R_p^*(ε), for p∗p^* such that DKL(q∥p∗)≤ϵD_KL (q \|p^* )≤ε, by investigating Vq(ϵ)V_q(ε). To do that, we invoke a central result of SLT: Theorem 2 ((Watanabe, 2009; Arnold et al., 2012)). Let f:W→ℝ≥0f:W→R_≥ 0 be a non-negative analytic function. Then there exist λ∈ℚλ∈Q and m∈ℕm∈N, such that the volume of the ϵε-sublevel sets is given by Vol(w∈W:f(w)≤ϵ)∼cϵλ(−logϵ)m−1Vol ( \w∈ W:f(w)≤ε \ ) c\,ε^λ (- ε )^m-1 as ϵ→0ε→ 0 for some constant c>0c>0. Applying this theorem to the map w↦Vq(ϵ)w V_q(ε) and using Equation (4), we get cϵλ(−logϵ)m−1≤Vp∗R(Cϵ)≤c(12C(C+1))λϵλ(−logϵ−log12C(C+1))m−1.cε^λ (- ε )^m-1≤ V^R_p^*(Cε)≤ c ( 12C(C+1) )^λε^λ (- ε- 12C(C+1) )^m-1. Using the fact that (−logϵ+a)m−1/(−logϵ)m−1→1(- ε+a)^m-1/(- ε)^m-1→ 1 as ϵ→0ε→ 0 for any a∈ℝa∈R, we conclude there exist c′,c′>0c ,c >0 such that for sufficiently small ϵε, c′ϵλ(−logϵ)m−1≤Vp∗R(Cϵ)≤c′ϵλ(−logϵ)m−1. c ε^λ (- ε )^m-1≤ V^R_p^*(Cε)≤ c ε^λ(- ε)^m-1. (5) This in turn implies666note that Equation (5) shows that Vp∗R(Cϵ)Vp∗R(ϵ)=O(1) V^R_p^*(Cε)V^R_p^*(ε)=O(1) for ϵ→0ε→ 0. −logVp∗R(ϵ)=λlog1ϵ−(m−1)loglog1ϵ+Op(1).- V^R_p^*(ε)=λ 1ε-(m-1) 1ε+O_p(1). (6) Finally, recalling that we took grid scale to be ϵn=a/n _n=a/n. For sufficiently large a>0a>0, this implies that DKL(q∥pn∗)≤ϵnD_KL (q \|p^*_n )≤ _n with high probability by Theorem 5. Therefore, the result about in Equation (6) applies and plugging in expression for ϵn _n, we get Rn=λlogn−(m−1)loglogm+Op(1)R_n=λ n-(m-1) m+O_p(1) which concludes the proof for Theorem 1. Notice that the leading order terms above can be interpreted as model complexity: it is the code length required to communicate a sufficiently good encoding distribution pn∗p^*_n in the model while maintaining an O(1)O(1) excess length for the encoded message even when the number of sample n→∞n→∞. We remark that Theorem 2 is a consequence of the celebrated theorem on the resolution of singularities by Hironaka (1964). The scaling exponent λ is known as the real log canonical threshold (RLCT) of the analytic function w↦DKL(q∥pw)w D_KL (q \|p_w ) and m is its multiplicity. Watanabe (2009) was the first to make use of the resolution of singularities and thereby connect these geometrical invariants to statistical learning, showing that λlognλ n gives the leading-order term for model complexity and λn λn gives the leading-order term for generalization error in Bayesian learning. In the context of machine learning, λ is referred to as the learning coefficient. For regular models, the sublevel sets look ellipsoidal (Figure 2 top-left), with volume ∼ϵd/2 ε^d/2 and thus the learning coefficient is λ=d2λ= d2 where d is the parameter count. Its multiplicity is m=1m=1. Indeed, there are simpler two-part code construction for regular model that achieves Rn=d2logn+Op(1)R_n= d2 n+O_p(1) by just having regular rectangular grid in W of scale O(1n)O ( 1 n ) (corresponding to KL-divergence scale of O(1/n)O(1/n) in the space of distributions). Observe that this leading order behavior of RnR_n for regular model is independent of data distribution q. For singular model, λ<d2λ< d2, which means models can potentially be much more compressible than their explicit parameter count suggests. Figure 2 (top-middle and top-right) illustrates how the sublevel sets can have complex geometry with degeneracies, resulting in larger sublevel set volume that allow for higher level of compressibility. 3.3 Relation to compressibility Figure 2: Degeneracy determines volume and compressibility Left: The relationship between error tolerance (ϵε) and the number of bits required to encode parameters for three different model geometries shows that the volume-scaling exponent λ (“local learning coefficient,” LLC) determines compressibility. Right three panels show level set contours of the respective loss landscapes at ϵ=2,…,2−8ε=2,…,2^-8 (same as in Figure 1). 1. Regular: A model with elliptical level sets requiring approximately d/2log(1/ϵ)d/2 (1/ε) bits (λ=1λ=1 for d=2d=2) to specify a point within tolerance ϵε. 2. Minimally Singular: A model with free parameter, requiring only (d−1)/2log(1/ϵ)(d-1)/2 (1/ε) bits (λ=1/2λ=1/2 for d=2d=2) due to degeneracy along the w2w_2 direction. 3. Singular: A model exhibiting a more complex geometric structure requiring approximately λlog(1/ϵ)λ (1/ε) bits, with λ=1/4λ=1/4. In the previous section we established the existence of a two-part code in which the leading term of asymptotic redundancy (excess code length compared to the encoding which we would use if we knew the true data distribution) is λlognλ n where λ is the learning coefficient. This is directly related to compression (Grünwald and Roos, 2019) as it tells us the number of bits needed to communicate a set of samples (n)x (n ) between a sender and receiver who share a statistical model. This MDL perspective captures the idea that a model class which allows for simpler representations of a given data distribution (smaller λ) offers better compression of its samples. However, it remains to explain how any given practical compression scheme (e.g., quantization) fits into this story. In this section we provide a less formal argument based on the concepts introduced in the previous section which aims to explain this connection in a straightforward way. From a mathematical perspective parameters live in a continuous space W⊆ℝdW R^d, but any realization in a computer uses some kind of grid with spacing h>0h>0. Fix a local minimum w∗w^* of the population loss ℒL and define the local excess loss K(w)=ℒ(w)−ℒ(w∗)K(w)=L(w)-L(w^*). We consider only parameters in a neighborhood near w∗w^* that is small enough that K(w)K(w) is non-negative. Invoking the sublevel-set volume law (Theorem 2), there exist numbers λ(w∗)λ(w^*) and m(w∗)m(w^*) such that V(ϵ):=Vol(w∈W:K(w)≤ϵ)∼cϵλ(w∗)(−logϵ)m(w∗)−1.V(ε):=Vol ( \w∈ W:K(w)≤ε \ ) cε^λ(w^*) (- ε )^m(w^*)-1\,. (7) Here λ(w∗)λ(w^*) and m(w∗)m(w^*) are known as the local learning coefficient (LLC) and multiplicity, introduced in Lau et al. (2024). Our in the remainder of this section is to connect the resolution h to the loss tolerance ϵε through the LLC λ(w∗)λ(w^*). Consider the quantization cell Ch(w)=u:‖u−w‖2≤h/2C_h(w)=\u:\|u-w\|_2≤ h/2\ around a parameter w with a volume proportional to hdh^d. To guarantee that quantization does not increase excess loss beyond ϵε, it is sufficient that the cell containing w∗w^* be contained in the ϵε-sublevel set around w∗w^*. A surrogate for this containment is the volume condition Vol(Ch)≤V(ϵ)Vol(C_h)≤ V(ε) or hd≤ϵλ(w∗)(log1ϵ)m(w∗)−1.h^d≤ε^λ(w^*)\, ( 1ε )^m(w^*)-1\,. (8) If we write nqn_q for the number of intervals for each coordinate in our grid, then this behaves like 1/h1/h. We denote by h∗h^* and nq∗n_q^* the level of quantization that reaches the loss tolerance ϵε and therefore makes (8) an equality. Hence, writing the per-coordinate bit budget as b∗(ϵ)=log2nq∗b^*(ε)= _2n_q^* we have b∗(ϵ)=λ(w∗)dlog21ϵ+O(loglog(1/ϵ)d).b^*(ε)= λ(w^*)d\, _2 1ε+O\! ( (1/ε)d ). (9) Thus, for a fixed loss tolerance ϵε the critical bits per coordinate grows linearly with the LLC. Intuitively with larger λ (less degeneracy), the admissible basin is smaller, so smaller cells (finer grids, more bits) are needed to keep the entire cell inside the basin. This is illustrated in Figure 2. 4 Methodology In order to complement the theory on the singular MDL principle, we study how compressibility relates to local learning coefficient (LLC) estimates in practice. In the main text we focus on quantization (Section 4.1). In the appendices, we also treat tensor factorization (Section C.2), pruning (Section C.5) and adding Gaussian noise to the model parameters (Section C.4). For estimating the LLC, in Section 4.2, we describe a preconditioned variant of the estimator in Lau et al. (2024). 4.1 Quantization We quantize models using a symmetric quantization scheme that includes 0. Given nq∈2ℤ>0n_q∈ 2Z_>0 and m>0m>0 we divide the intervals [0,m][0,m] and [−m,0][-m,0] into 12nq 12n_q intervals of length Δ=m/(12nq−1) =m/( 12n_q-1) so that in each interval there are 12nq 12n_q possible values lying at the endpoints of subintervals (including 0 and ±m± m). Combining these to form [−m,m][-m,m] and accounting for double counting of 0 there are nqn_q intervals and nq−1n_q-1 possible quantized values. To quantize a parameter w∈W⊆ℝdw∈ W ^d with w=(w1,…,wd)w=(w_1,…,w_d) means firstly to “clamp” each wiw_i to the interval [−m,m][-m,m] and then round these values to the nearest quantized value in this interval according to the above subdivision. More precisely we define wiquant:=round[wiΔ]Δw^quant_i:=round [ w_i ] and wquant=(w1quant,…,wdquant)w^quant=(w_1^quant,…,w_d^quant). Note that specifying each wiquantw_i^quant requires log2(nq) _2(n_q) bits. In Section 5 we treat m as a free parameter and search for a value that minimizes the loss of the quantized model. This is our baseline method, inspired by a more sophisticated approach in Cheong and Daniel (2019) where they allow for non-evenly spaced quantization intervals. The increase in loss caused by quantization is a function ΔLoss=Ln(wquant)−Ln(w) =L_n(w^quant)-L_n(w) of nqn_q and w. This is typically larger when nqn_q is smaller. We measure the compressibility of a language model with parameter w by finding the smallest nqn_q with ΔLoss(w)≤ϵ (w)≤ε and call this value the critical nqn_q and denote it nq∗n_q^*. When nq∗n_q^* is large the model is less compressible (we hit the threshold with a smaller amount of compression), and conversely when nq∗n_q^* is small the model is more compressible. In Section C.3 we show results of the cruder quantization method of setting m to the largest parameter absolute value, which is equivalent to the scheme used by Kumar et al. (2025). 4.2 LLC estimation We consider a transformer neural network that models the conditional distribution p(y|x;w)p(y|x;w) of outputs y (next tokens) given inputs x (contexts), where w∈Ww∈ W represents the network parameters in a compact parameter space W. Given samples DnD_n from a true distribution with associated empirical loss LnL_n, we define the estimated local learning coefficient at a parameter point w∗w^* to be: λ^(w∗)=nβ[w|w∗,γβ[Ln(w)]−Ln(w∗)], λ(w^*)=nβ [E_w|w^*,γ^β[L_n(w)]-L_n(w^*) ], (10) where w|w∗,γβE_w|w^*,γ^β is the expectation with respect to the Gibbs posterior (Bissiri et al., 2016), p(w|w∗,β,γ)∝exp−nβLn(w)−γ2‖w−w∗‖22.p(w|w^*,β,γ) \-n _n(w)- γ2\|w-w^*\|^2_2 \\,. (11) The hyperparameters are the sample size n, the inverse temperature β, which controls the contribution of the loss, and the localization strength γ, which controls proximity to w∗w^*. For a full account of these hyperparameters, we refer the reader to Watanabe (2013); Lau et al. (2024); Hoogland et al. (2025). Our LLC estimation procedure uses the preconditioned stochastic gradient Langevin dynamics (pSGLD) algorithm (Li et al., 2015). This combines RMSNorm-style adaptive step sizes with SGLD (Welling and Teh, 2011). For more details on LLC estimation and its uncertainties, see Appendix D. 5 Results Figure 3: Sensitivity of Pythia models of different sizes to quantization. On the left, we show the loss increase as a function of nqn_q, with the black dashed line indicating our choice of loss tolerance ϵ=0.5ε=0.5. On the right, we show the critical nqn_q, i.e., the nqn_q for which ΔLoss=ϵ =ε as a function of the LLC for different training checkpoints of Pythia models. The transparent points are not included in the linear fits. The checkpoints increase with the LLC, that is, training time moves from left to right along all curves. In this section we give experimental results relating compressibility under quantization with LLC estimates. For results on tensor factorization see Section C.2. As explained in Section 4.1, given a loss threshold ϵε we measure compressibility by the critical number of quantization intervals nq∗n_q^* at which the increase in loss ΔLoss hits the threshold. In Figure 3, left side, we show the increase in loss due to compression as a function of the number of quantized values, nqn_q. We observe the loss curves featuring a knee at a loss increase of around ϵ=0.5ε=0.5, and we therefore choose this as our loss tolerance. In the appendix, we show a selection of other ϵε values, showing that they lead to similar results. In the panel to the right, we observe the critical nqn_q increasing linearly with the LLC for a large range of training steps, as expected from (9). We find a linear fit with R2=0.98R^2=0.98 for all the shown models. In Section C.1 we show results for additional Pythia models, showing that these also feature ranges of training checkpoints with a linear relation between critical nqn_q and LLC. 6 Conclusion We have established a theoretical foundation for understanding neural network compression through the lens of singular learning theory, extending the minimum description length principle to account for the degenerate geometry of neural network loss landscape. Our experiments demonstrate that the local learning coefficient (LLC) provides a principled measure of compressibility, with model checkpoints featuring larger estimated LLC proving to be less resistant to compression across multiple compression techniques including quantization and factorization. The strong linear relationships observed between LLC estimates and critical compression thresholds for quantization (R2≥0.98R^2≥ 0.98) is an independent check that our current SGLD-based estimates are capturing meaningful information about model complexity for transformer models up to 6.9B parameters. This is an encouraging signal for applications of SLT to large neural networks, but significant methodological challenges remain for LLC estimation and similar techniques. The sensitivity of LLC estimates to hyperparameters and the likely gap between estimated and true values represent the primary limitations of our current framework. Looking forward, the field is advancing along two complementary paths that will eventually converge. From one direction, practical compression techniques continue to improve, pushing closer to theoretical limits. From the other direction, the developing science of LLC estimation offers a path toward more accurate estimation of these limits. As these approaches converge, we will gain precise understanding of both the fundamental limits of compression and how closely practical techniques are approaching them. Acknowledgments We would like to thank George Wang for his feedback on earlier drafts and his advice on hyperparameter search. We would also like to thank Jacob Pfau, Geoffrey Irving, and Tomek Korbak for their valuable feedback and discussions. This project was funded by the UK AI Security Institute (AISI). Author Contributions Einar Urdshals led the experimental component, including implementing, running, and analyzing all compression experiments. Edmund Lau developed the theoretical framework and the singular MDL principle. Jesse Hoogland contributed to implementation, analysis, writing, and figure design. Stan van Wingerden ran initial hyperparameter sweeps, contributed to implementation, and managed the experimental infrastructure. Daniel Murfet led the project and contributed to analysis, theory, and writing. References Aghajanyan et al. [2021] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.568. URL https://aclanthology.org/2021.acl-long.568/. AI [2024] Meta AI. Introducing quantized llama models with increased speed and a reduced memory footprint. https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/, 2024. Anthropic [2024] Anthropic. Claude 3.5 Haiku on AWS Trainium2 and model distillation in Amazon Bedrock. https://w.anthropic.com/news/trainium2-and-distillation, 2024. Aoyagi [2024] Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, page 106132, 2024. Arnold et al. [2012] V I Arnold, S M Gusein-Zade, and A N Varchenko. Singularities of differentiable maps, volume 2: Monodromy and asymptotics of integrals, volume 2 of Modern Birkhäuser Classics. Birkhauser Boston, Secaucus, NJ, 2012 edition, May 2012. Arora et al. [2018] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International conference on machine learning, pages 254–263. PMLR, 2018. Balasubramanian [1996] Vijay Balasubramanian. A geometric formulation of occam’s razor for inference of parametric distributions. arXiv [adap-org], January 1996. Biderman et al. [2023] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023. Bissiri et al. [2016] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016. Busbridge et al. [2025] Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. arXiv preprint arXiv:2502.08606, 2025. Carlini et al. [2023] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TatRHT_1cK. Carroll et al. [2025] Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, and Daniel Murfet. Dynamics of transient structure in in-context linear regression transformers, 2025. URL https://arxiv.org/abs/2501.17745. Chen et al. [2015] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. Advances in neural information processing systems, 28, 2015. Chen et al. [2023] Z Chen, E Lau, J Mendel, S Wei, and D Murfet. Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition. arXiv preprint arXiv, 2023. Chen and Murfet [2025] Zhongtian Chen and Daniel Murfet. Modes of sequence models and learning coefficients. arXiv preprint arXiv:2504.18048, 2025. Cheng et al. [2017] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017. Cheong and Daniel [2019] Robin Cheong and Robel Daniel. Compressing transformers with pruning and quantization. Department of Computer Science, Stanford University, 2019. Denil et al. [2013] Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando De Freitas. Predicting parameters in deep learning. Advances in neural information processing systems, 26, 2013. Dettmers and Zettlemoyer [2023] Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pages 7750–7774. PMLR, 2023. Elhage et al. [2022] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. arXiv [cs.LG], September 2022. Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7. Frantar et al. [2025] Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, and Dan Alistarh. Compression scaling laws: Unifying sparsity and quantization. arXiv preprint arXiv:2502.16440, 2025. Gao et al. [2020] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Grünwald and Roos [2019] Peter Grünwald and Teemu Roos. Minimum description length revisited. International journal of mathematics for industry, 11(01):1930001, 2019. Han et al. [2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a. Han et al. [2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015b. URL https://proceedings.neurips.c/paper_files/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf. Han et al. [2016] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016. Hassibi et al. [1993] Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993. Hironaka [1964] Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero: I. Ann. Math., 79(2):205–326, 1964. Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Flat Minima. Neural Computation, 9(1):1–42, 1997. Hoefler et al. [2021] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124, 2021. Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc., 2022. URL https://proceedings.neurips.c/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf. Hoogland et al. [2025] Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. Loss landscape degeneracy drives stagewise development in transformers, 2025. URL https://arxiv.org/abs/2402.02364. Hu et al. [2022] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. Kraft [1949] Leon Gordon Kraft. A device for quantizing, grouping, and coding amplitude-modulated pulses. PhD thesis, Massachusetts Institute of Technology, 1949. Kumar et al. [2025] Tanishq Kumar, Zachary Ankner, Benjamin Frederick Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, and Aditi Raghunathan. Scaling laws for precision. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wg1PCg3CUP. Lau et al. [2024] Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The Local Learning Coefficient: A Singularity-Aware Complexity Measure, September 2024. LeCun et al. [1989] Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989. Li et al. [2015] Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned stochastic gradient langevin dynamics for deep neural networks, 2015. URL https://arxiv.org/abs/1512.07666. Li et al. [2018] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryup8-WCW. Maddox et al. [2020] Wesley J Maddox, Gregory Benton, and Andrew Gordon Wilson. Rethinking parameter counting in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020. McMillan [1956] B McMillan. Two inequalities implied by unique decipherability. IEEE Trans. Inf. Theory, 2(4):115–116, December 1956. Mishra et al. [2021] Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks, 2021. URL https://arxiv.org/abs/2104.08378. Moar et al. [2024] Chakshu Moar, Faraz Tahmasebi, Michael Pellauer, and Hyoukjun Kwon. Characterizing the accuracy – efficiency trade-off of low-rank decomposition in language models, 2024. URL https://arxiv.org/abs/2405.06626. Nanda and Bloom [2022] Neel Nanda and Joseph Bloom. TransformerLens, 2022. URL https://github.com/neelnanda-io/TransformerLens. Ouyang et al. [2024] Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained llms: Scaling laws for quantized llms with 100t training tokens. arXiv preprint arXiv:2411.17691, 2024. Teh et al. [2016] Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1):193–225, 2016. Urdshals and Urdshals [2025] Einar Urdshals and Jasmina Urdshals. Structure development in list-sorting transformers. arXiv preprint arXiv:2501.18666, 2025. Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv [cs.CL], June 2017. Wang et al. [2024a] George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024a. URL https://arxiv.org/abs/2410.02984. Wang et al. [2024b] Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, and Xiaofei He. Model compression and efficient inference for large language models: A survey, 2024b. URL https://arxiv.org/abs/2402.09748. Watanabe [2009] Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, USA, 2009. Watanabe [2013] Sumio Watanabe. A Widely Applicable Bayesian Information Criterion. Journal of Machine Learning Research, 14:867–897, 2013. Watanabe [2018] Sumio Watanabe. Mathematical theory of Bayesian statistics. CRC Press, 2018. Welling and Teh [2011] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011. Xu et al. [2024] Zifei Xu, Alexander Lan, Tristan Webb, Sayeh Sharify, Xin Wang, et al. Scaling laws for post-training quantized large language models. arXiv preprint arXiv:2410.12119, 2024. Zhu et al. [2024] Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models, 2024. URL https://arxiv.org/abs/2308.07633. Appendix This appendix provides detailed information about our methodology, experimental setup, and additional results that supplement the main paper. We organize the appendix into the following sections: • Appendix A: Additional related work and discussion. • Appendix B: Descriptions of the Pythia model architectures used in our experiments. • Appendix C: Additional experimental results: – Additional quantization results with loss minimization, including more models, more ϵε values and critical nqn_q vs. steps (Section C.1) – Results on tensor factorization (Section C.2) – Quantization results without loss minimization, including loss increase plotted against nqn_q, three different ϵε values and critical nqn_q vs. LLC and training step (Section C.3) – Addition of Gaussian noise, loss increase plotted against strength of Gaussian noise, three different ϵε values and critical amount of gaussian noise vs. LLCs and steps (Section C.4) – Structured pruning, our retraining protocol and loss increase as a function of pruning amount (Section C.5) • Appendix D: Details on our LLC estimation procedure, including hyperparameter settings and computational resources required. • Appendix E: Supplementary mathematical derivations and proofs that extend the theoretical framework presented in Section 3. • Appendix F: Further details on the derivation of the singular MDL principle and its implications. • Appendix G: Details on LLM usage. Appendix A Additional related work and discussion Model compression in industry. Model compression techniques are widely employed by industry leaders to scale inference of large language models (LLMs), as they significantly reduce model size, memory footprint, and inference latency. The connection to compression is due to the fact that memory and latency are primarily determined by the total number of bits in the parameters [Dettmers and Zettlemoyer, 2023]. Quantization is used by Meta to compress their LLaMA models, approximately halving their memory footprint and doubling inference throughput [AI, 2024]. Knowledge distillation has similarly been utilized by Anthropic to create smaller models like Claude 3 Haiku, which achieves near-identical performance to its larger predecessor, Claude 3.5 Sonnet, while substantially lowering deployment costs [Anthropic, 2024]. Pruning, particularly structured sparsity supported by NVIDIA GPUs, also shows empirical evidence of approximately doubling inference throughput by eliminating around half of the model’s weights [Mishra et al., 2021]. Scaling laws and compression. The training of large-scale neural networks obeys empirical scaling laws [Kaplan et al., 2020, Hoffmann et al., 2022], which relate test loss to parameter count and tokens seen during training. Since model compression techniques work by reducing the effective parameter count, at the cost of an increase in loss, it is natural to wonder how to incorporate compression into the neural scaling laws. Most of the work to date has been on empirical scaling laws for quantization [Dettmers and Zettlemoyer, 2023, Ouyang et al., 2024, Xu et al., 2024, Frantar et al., 2025, Kumar et al., 2025], although there is some work on distillation [Busbridge et al., 2025]. Data-dependent compression bounds. Lossy compression is always defined relative to a specific loss function on a particular dataset, which implicitly chooses which capabilities to prioritize and preserve. A corollary is that any attempts to derive compression bounds based on the pretraining objective may be unnecessarily conservative: a large fraction of a model’s capacity goes to memorization [Carlini et al., 2023], much of which may be irrelevant to particular capabilities. Understanding compression requires data-dependent bounds such as those considered here. Security implications. Our work establishes a theoretical connection between SLT and neural network compressibility, providing a principled framework that could inform future security research. By demonstrating that the local learning coefficient (LLC) correlates with practical compression limits, we lay groundwork for developing rigorous bounds on how much specific capabilities can be compressed. Future work building on these theoretical foundations could provide robust bounds on the information required to transmit specific capabilities, helping calibrate security measures and inform discussions about model weight protection. Economic drivers & theoretical limits of compression Halving the memory cost of a model can potentially double its operational value: under fixed GPU budgets, compressing parameters (e.g., pruning or quantization) directly raises the volume of token processing and thus revenue [Wang et al., 2024b, Zhu et al., 2024]. This incentive is driving substantial private research and development so that the state of the art in model compression likely surpasses known public benchmarks [Cheng et al., 2017, Han et al., 2015a]. In this situation, it is particularly valuable to understand the theoretical limit to model compression since this limit is a key factor in the economic feedback loops driving investment. This is particularly true as AI systems start to do autonomous research. Appendix B Model details We conduct experiments on models from the Pythia suite [Biderman et al., 2023], ranging from 14M to 6.9B parameters for most experiments. For Pythia, we include model checkpoints ranging from 2k to 90k, excluding later checkpoints because of apparent instability in the original training runs. Note that all Pythia models are trained on the same data in the same order from the Pile [Gao et al., 2020]. We begin with an already (losslessly) compressed version of the Pythia models, in which layer norm weights are folded into subsequent linear layers, following the default settings in our TransformerLens-based implementation [Nanda and Bloom, 2022]. Appendix C Additional Results In this appendix, we provide quantization and factorization results for additional model sizes, as well as quantization results for quantization without loss minimization, addition of Gaussian noise to the parameters, and pruning. C.1 Quantization with Loss Minimization Figure 4: Sensitivity of Pythia models to Quantization. The panels show loss increase as a function of nqn_q, with the lower right panel showing the critical nqn_q as a function of LLC for ϵ=0.5ε=0.5. Note that for Pythia-14M and Pythia-70M, checkpoint 90k is not included since our quantization algorithm fails for these checkpoints. We do a linear fit excluding the transparent points, finding R2R^2 values of 0.98 and above. Figure 5: Critical n_q vs. LLC for different ϵ ε values. We see that the curves are qualitatively ϵε-insensitive. Note that for Pythia-14M and Pythia-70M, checkpoint 90k is not included since our quantization algorithm fails for these checkpoints. Figure 6: Critical n_q vs. training step for different ϵ ε values. Note that for Pythia-14M and Pythia-70M, checkpoint 90k is not included since our quantization algorithm fails for these checkpoints. We show results on additional models for the quantization method used in Section 5 in Figure 4, showing linear fits with R2≥0.98R^2≥ 0.98 for checkpoint ranges across a wide range of Pythia models. The comparison of LLC vs. critical nqn_q for 3 different choices of ϵε is shown in Figure 5. As stated in the main body, these curves are qualitatively ϵε-insensitive. For comparison, we show the critical nqn_q as a function of training step in Figure 6, and observe that the curves are qualitatively similar to the ones with the LLC on the x-axis. This is expected, as the LLC is an increasing function of training steps, as shown in Figure 21. C.2 Tensor Factorization Figure 7: Sensitivity of Pythia models of different sizes to factorization. On the left, we show the loss as a function of the compression fraction, with the black dashed line indicating our choice of loss tolerance ϵ=0.5ε=0.5. The data-points at (1,0) are the models before compression. On the right, we show the critical compression fraction, i.e., the compression fraction for which ΔLoss=ϵ =ε as a function of the LLC. Tensor factorization techniques decompose weight matrices in neural networks into products of smaller matrices, reducing the total number of parameters. We perform the factorization by performing Singlar Value Decomposition on weight matrices W, and truncating a fixed fraction of the singular values, leaving the weight matrix approximated as W←U×S×VW← U× S× V (12) where S is a diagonal matrix with n diagonal entries. We do this by following the heuristics outlined in Moar et al. [2024]: We target a selection of layers and factorize all matrices in those layers. We avoid the very last and very first layers, and also avoid factorizing consecutive layers. In the experiments shown in Section 5, we avoid factorizing the embedding and unembedding matrices. If W has dimensions d1d_1 by d2d_2, then before factorization the matrix has d1×d2d_1× d_2 parameters, whereas after factorization it has d1×n+n+n×d2d_1× n+n+n× d_2 parameters. The reported compression fraction is the ratio between the total number of parameters in the model after and before factorization, i.e. Compression Fraction=# parameters after factorization# parameters before factorizationCompression Fraction= \# parameters after factorization\# parameters before factorization (13) For the smaller Pythia models where the embedding and unembedding matrix dominate the parameter count of the model, the compression fraction is always close to 1. To measure the compressibility of the models under factorization, we find the critical compression fraction, i.e., the value of the compression fraction which causes a loss increase of ϵε. In Figure 7, left side, we show the compression-induced loss increase as a function of compression fraction. To a lesser extent than with quantization, we observe the loss curves featuring a knee at a loss increase of around ϵ=0.5ε=0.5. For consistency, we stick to the same value of ϵε for factorization as for quantization. In the Section C.2, we show a selection of other ϵε values, showing that they lead to qualitatively similar results. In the panel to the right, we observe the critical compression fraction largely increasing with increasing LLC, with the exception of Pythia-6.9B where it seems to flat-line at later steps. This might be related to Pythia-6.9B late in training featuring a knee at considerably higher ϵε values, between 1 and 1.5. Figure 8: Sensitivity of Pythia models to Factorization. The panels show loss increase as a function of compression fraction. We exclude checkpoint 90000 of Pythia-14M due to believed training instability causing a very large estimated LLC. Figure 9: Critical compression fraction vs. LLC for different choices of ϵ ε. Checkpoint 90000 of Pythia-14M is excluded due to suspected training instability. Figure 10: Critical compression fraction vs. training step for different choices of ϵ ε. Checkpoint 90000 of Pythia-14M is excluded due to suspected training instability. In Figure 8 we show the loss increase as a function of compression fraction for all Pythia models up to and including 6.9B. We compare different choices of ϵε in Figure 9 and Figure 10, and observe the curves being largely ϵε-insensitive. We observe that the critical compression fraction is mostly an increasing function of both LLC and training step. C.3 Quantization without Loss Minimization Figure 11: Sensitivity of Pythia models to Quantization without loss optimization. The panels show loss increase as a function of nqn_q, with the lower right panel showing the critical nqn_q as a function of LLC for ϵ=0.5ε=0.5. Note that for Pythia-14M, checkpoint 90k is not included due to suspected training instability. Transparent data-points are not included in the linear fits. Figure 12: Critical n_q vs. LLC for different ϵ ε values for quantization without loss optimization. We see that the curves are qualitatively ϵε-insensitive. Note that for Pythia-14M, checkpoint 90k is not included due to suspected training instability. Transparent data-points are not included in the linear fits. Figure 13: Critical n_q vs. training step for different ϵ ε values for quantization without loss optimization. We see that the curves are qualitatively ϵε-insensitive. Note that for Pythia-14M, checkpoint 90k is not included due to suspected training instability. Here we quantize by setting m to the largest parameter value, rather than selecting it by minimizing the post-quantization loss. We show the loss increase for this form of quantization in Figure 11. A comparison of 3 critical nqn_q for different choices of ϵε is shown in Figure 12 and Figure 13, and we observe that the curves are largely ϵε-insensitive. We observe that this form of quantization also features critical nqn_q increasing as a function of LLC, but find worse linear fits for critical nqn_q vs. LLC than we find for quantization with loss minimization in Section C.1. This might be because quantization with loss minimization better probes the loss landscape near the w∗w^*. C.4 Adding Gaussian Noise Figure 14: Sensitivity of Pythia models to absolute Gaussian noise. As in the other settings, we exclude checkpoint 90000 of Pythia-14M. Figure 15: Sensitivity of Pythia models to relative Gaussian noise. As in the other settings, we exclude checkpoint 90000 of Pythia-14M. Figure 16: Critical σ σ as a function of the LLC for absolute Gaussian noise. Figure 17: Critical σ σ as a function of the LLC for relative Gaussian noise. Figure 18: Critical σ σ as a function of the training step for absolute Gausian noise. Figure 19: Critical σ σ as a function of the training step for relative Gaussian noise. We have two ways of adding Gaussian noise. The first we call absolute Gaussian noise, and involves updating the parameters of the model according to w←w∗+σN(0,1).w← w^*+σ N(0,1)\,. (14) We use relative Gaussian noise to refer to adding noise proportional to the parameter, w←w∗+w∗σN(0,1).w← w^*+w^*σ N(0,1)\,. (15) In Figure 14 and Figure 15, we show the loss increase as a function of σ for absolute and relative noise, respectively, with the lower right corner showing the critical σ for ϵ=0.5ε=0.5 as a function of LLC. We observe that for addition of relative noise, critical σ largely decreases with increasing LLC, as expected. For absolute Gaussian noise, the picture is more complicated, and is probably impacted by the change in magnitude of the model parameters over the course of training. In Figure 16 and Figure 17, we show critical σ as a function of LLC for different values of loss tolerance ϵε. We observe that the qualitative shape of the curves are ϵε-insensitive. In Figure 18 and Figure 19 we plot the critical σ as a function of training step. We observe a similar relation between the training steps and critical σ as between the LLC and critical σ. Again, we observe that the curves are qualitatively ϵε-insensitive. C.5 Pruning Figure 20: Sensitivity of Pythia models to pruning. The panels shows loss increase as a function of fraction of heads remaining. We here include checkpoint 90000 of Pythia-14M, as we here don’t consider critical fractions. Pruning techniques can be broadly categorized into structured and unstructured approaches. Unstructured pruning involves removing individual weights throughout the network without any regular pattern, potentially achieving higher compression rates but requiring specialized hardware or software to realize computational speedups. Structured pruning, on the other hand, removes entire structured components (e.g., neurons, filters, or attention heads), resulting in models that are inherently smaller and faster on standard hardware. For our experiments, we focus on structured pruning of attention heads in transformer models. When pruning a model, we first specify a desired fraction of heads to keep p. From this, we compute the number of heads to prune n as: n=⌊(1−p)Nh⌋,n= (1-p)N_h \,, (16) where NhN_h is the total number of attention heads in the model. We then select NhN_h heads at random and set their weight matrices to zero (excluding biases). Following pruning, we implement a retraining phase with the following specifications [Han et al., 2015b]: • The gradients of the weight matrices in pruned heads are fixed to zero to maintain the pruning structure. • We use a learning rate 1/10th of the one used during initial training. • We retrain for 1000 steps, taking the post-retraining loss to be the minimal training loss during retraining. In Figure 20, we show how the loss changes during pruning of a selection of Pythia models. Since several of these curves are very rugged, we refrain from plotting LLC vs critical values of p. Appendix D LLC Estimation Details Sanity checks for LLCs. It was shown in Lau et al. [2024] that variations in training hyperparameters (learning rate, batch size and momentum) affect LLC estimates in the way one would expect for a measure of model complexity. Outside of the limited cases where theoretical values of the LLC for large neural networks are available (principally deep linear networks, Aoyagi 2024), such experiments serve as a crucial “sanity check” on LLC estimates. The experimental results on the effect of compression on the LLC in this paper serve as a complementary set of sanity checks for LLC estimation in models up to 6.9B parameters. D.1 Implementation of the LLC Estimator Computational Resources. LLC estimation for our largest models required substantial computational resources. For reference, a single LLC estimation for the Pythia-6.9B model required approximately 3.5 hours on an H200 GPU with 141GB memory. Hyperparameters. We estimate the LLC of the Pythia models on the Pile [Gao et al., 2020], using the full context of 2048 tokens, with localization γ=300γ=300, inverse temperature nβ=30nβ=30 and 4 SGLD chains with 200 steps for models smaller than 1B parameters and 100 steps for models equal to or larger than 1B parameters. We use a batch size of 32, and use 8 batches to calculate Ln(w∗)L_n(w^*). The SGLD learning rate varies with model size, and we use 10−310^-3 for Pythia-14M, 3×10−43× 10^-4 for Pythia-31M and Pythia-70M, 10−410^-4 for Pythia-160M and Pythia-410M, 3×10−53× 10^-5 for Pythia-1B and Pythia-1.4B, 10−510^-5 for Pythia-2.8B and 3×10−63× 10^-6 for Pythia-6.9B. Figure 21: LLC vs. training step for the Pythia models. Estimated LLCs for Pythia models. In Figure 21 we show the LLC as function of training step for the Pythia models. We see that with the exception of Pythia-14M through 70M, the LLC rises smoothly as function training step. D.2 (Challenges in) Estimating the LLC The main obstacle to using the LLC in practice as a tool for evaluating compression techniques is that we usually do not have direct access to the true LLC, λ, but must instead estimate its value, λ λ, and these estimates may be systematically biased. Currently, the only scalable approach to estimating LLCs for large neural networks is via gradient-based approximate posterior sampling methods like SGLD [Lau et al., 2024]. The resulting estimates have been found in recent years to be useful in practice for understanding the development of neural networks [Hoogland et al., 2025, Wang et al., 2024a, Carroll et al., 2025, Urdshals and Urdshals, 2025]. However, while there is a deep mathematical theory behind the definition of the LLC, there are several serious problems with the current state of empirical practice: 1. There are gaps in the theory of SGLD. Although there is a theoretical literature [Welling and Teh, 2011, Chen et al., 2015, Teh et al., 2016], which provides conditions (for example, decaying step size and long chains) under which gradient-based posterior sampling methods converge weakly to the true posterior for some classes of statistical models, some of the technical conditions in these theorems do not hold for all neural networks. Thus, the theoretical status of SGLD-based estimation is unclear. 2. We do not fully understand the role of hyperparameters like inverse temperature. In practice, we know that varying the inverse temperature β used for estimation does affect the estimates. In principle, any inverse temperature is valid (since the effect due to the tempering of the posterior should be canceled by the nβnβ occurring as a prefactor), but in practice, SGLD-based estimation appears sensitive to this factor. Since the only principled setting is 1/logn1/ n [Watanabe, 2013], which is too small for stable estimation in our settings, we know that the LLC estimates can, at best, be meaningful up to whatever effect this variation has on estimates. Chen and Murfet [2025] prove that the temperature acts as a resolution dial on the loss landscape, so that we sample from an effectively truncated posterior, but this effect is not yet fully understood; this explains why we have focused on applications of LLC estimation to a single model with the same hyperparameters across training, under the hypothesis that this effect does not confound comparisons of LLC values at different training timesteps. 3. Unrealistic values for large networks. SGLD-based LLC estimation can produce accurate estimates for deep linear networks [Lau et al., 2024]. Keeping in mind the previous point, the hyperparameters we select for the Pythia suite lead to LLC estimates that are on the order of hundreds, for models with parameter counts ranging from millions to billions. Appendix E Theoretical results for Singular MDL E.1 Assumptions In this section, we list the sufficient conditions for the results discussed in this work to hold. Recall that we have an outcome space X, data distribution q∈Δ()q∈ (X) and model ℳ=pw∈Δ():w∈W⊂ℝdM= \p_w∈ (X):w∈ W⊂R^d \. Finite outcome space. We assume that the outcome space X is finite so that the data distribution, q and distributions pwp_w in a model are members of the finite dimensional simplex Δ() (X). This is a severe restriction stated for expository ease. There isn’t any fundamental limitation from relaxing this criterion to the continuous case. Conditions for SLT. As we rely heavily on the core result of SLT, we require similar sufficient conditions as stated by Watanabe [2009, Definition 6.1 and 6.3] and the relaxation of the realisability assumptions stated in Watanabe [2018, Chapter 3.1]. Importantly, we require that • The parameter space is a compact subset of ℝdR^d with non-empty interior. • The data distribution q satisfies relatively finite variance condition set out in Watanabe [2018, Definition 7]. • The loss function w↦(x↦log1pw(x))w (x 1p_w(x)) can be extended to a L2(q)L^2(q)-valued complex analytic function. Uniformly bounded away from boundary. This is a technical condition that allow us to treat KL-divergence as almost a metric on ℳM. We require that the model we considered to be a subset of the restricted simplex Δ̊m() _m(X) for some m>0m>0 defined as follow. Definition 1. Let X be a finite set and m be a number 0<m<1||0<m< 1 |X |. We define Δ̊m() _m(X) as the set of distributions in the interior of the simplex Δ() (X) that is uniformly bounded away from the simplex boundary by m. That is Δ̊m():=p∈Δ():minx∈p(x)≥m. _m(X):= \p∈ (X): _x p(x)≥ m \. E.2 Lemmas Lemma 1. Let 0<m≤1||0<m≤ 1 |X | be fixed. There exist constants c,c′>0c,c >0 such that for any q,p∈Δ̊m()q,p∈ _m(X), c∥p−q∥22≤DKL(q∥p)≤c′∥p−q∥22c \|p-q \|_2^2≤ D_KL (q \|p )≤ c \|p-q \|_2^2 where ∥⋅∥2 \|· \|_2 denotes the ℓ2 ^2-norm. Proof. Let r(x):=q(x)p(x)r(x):= q(x)p(x). Note that r(x)∈[m,1/m]r(x)∈[m,1/m]. Now, DKL(q∥p)=∑x∈p(x)r(x)logr(x)=∑x∈p(x)(r(x)logr(x)−r(x)+1)D_KL (q \|p )= _x p(x)r(x) r(x)= _x p(x) (r(x) r(x)-r(x)+1 ) since ∑xp(x)r(x)=1 _xp(x)r(x)=1. Let f(z):=zlogz−z+1f(z):=z z-z+1. Taylor expanding f at z=1z=1 up to order 2 with Lagrange remainder give us f(z)=12t(z−1)2f(z)= 12t(z-1)^2 for some t∈(1,z)t∈(1,z). Therefore, 12max(1,z)(z−1)2≤f(z)≤12min(1,z)(z−1)2 12 (1,z)(z-1)^2≤ f(z)≤ 12 (1,z)(z-1)^2 Now, from the calculation above, we have DKL(q∥p)=∑x∈p(x)f(r(x))D_KL (q \|p )= _x p(x)f(r(x)) and therefore 12∑x∈p(x)(q(x)p(x)−1)2max(1,q(x)p(x))≤DKL(q∥p)≤12∑x∈p(x)(q(x)p(x)−1)2min(1,q(x)p(x)) 12 _x p(x) ( q(x)p(x)-1 )^2 (1, q(x)p(x) )≤ D_KL (q \|p )≤ 12 _x p(x) ( q(x)p(x)-1 )^2 (1, q(x)p(x) ) ⟹ 12∑x∈(q(x)−p(x))2max(q(x),p(x))≤DKL(q∥p)≤12∑x∈(q(x)−p(x))2min(q(x),p(x)) 12 _x (q(x)-p(x) )^2 (q(x),p(x) )≤ D_KL (q \|p )≤ 12 _x (q(x)-p(x) )^2 (q(x),p(x) ) where we have used the fact that p(x)min(1,r(x))=min(q(x),p(x))p(x) (1,r(x))= (q(x),p(x)) and p(x)max(1,r(x))=max(q(x),p(x))p(x) (1,r(x))= (q(x),p(x)). Finally, maxxmax(p(x),q(x))≤1 _x (p(x),q(x))≤ 1 and minxmin(p(x),q(x))≥m _x (p(x),q(x))≥ m, we get 12∥p−q∥22≤DKL(q∥p)≤12m∥p−q∥22. 12 \|p-q \|_2^2≤ D_KL (q \|p )≤ 12m \|p-q \|_2^2. ∎ The above result allows us to show that the KL-divergence on this restricted space of distribution satisfies a form of triangle inequality. Lemma 2. With the same assumption as Lemma (1), there exist C>0C>0 such that for any q,p,p′∈Δ̊m()q,p,p ∈ _m(X) DKL(p∥p′)≤C(DKL(q∥p)+DKL(q∥p′)).D_KL (p \|p )≤ C (D_KL (q \|p )+D_KL (q \|p ) ). Since this holds over all q,p,p′q,p,p , the ordering of the arguments for each KL-divergence above does not matter. Proof. Applying the Lemma (1), once in each direction of inequality, together with the fact that (a+b)2≤2a2+2b2(a+b)^2≤ 2a^2+2b^2 give DKL(p∥p′) D_KL (p \|p ) ≤12m‖p−p′‖22 ≤ 12m \|p-p \|^2_2 ≤12m(‖p−q‖2+‖q−p′‖2)2 ≤ 12m ( \|p-q \|_2+ \|q-p \|_2 )^2 ≤1m(‖p−q‖22+‖p′−q‖22) ≤ 1m ( \|p-q \|_2^2+ \|p -q \|^2_2 ) ≤1m(12DKL(q∥p)+12DKL(q∥p′)) ≤ 1m ( 12D_KL (q \|p )+ 12D_KL (q \|p ) ) =12m(DKL(q∥p)+DKL(q∥p′)) = 12m (D_KL (q \|p )+D_KL (q \|p ) ) which is the desired result with C=12mC= 12m. We note that C≥1C≥ 1 whenever X has more than 1 element. ∎ Lemma 3. Let q,p∈Δ()q,p∈ (X) and M:=supxlogq(x)p(x)M:= _x q(x)p(x) and m:=infxlogq(x)p(x)m:= _x q(x)p(x). Suppose |m| |m | and M are both finite, then there exist constants c,c′>0c,c >0, independent of q,pq,p, such that (c−DKL(q∥p))DKL(q∥p)≤q(logq(x)p(x))≤(c′−DKL(q∥p))DKL(q∥p).(c-D_KL (q \|p ))D_KL (q \|p ) _q ( q(x)p(x) )≤(c -D_KL (q \|p ))D_KL (q \|p ). Proof. Let ℓ(x)=logq(x)p(x) (x)= q(x)p(x). Using Taylor’s theorem with Lagrange remainder, there exist α∈(0,1)α∈(0,1) such that e−ℓ(x)+ℓ(x)−1=e−αℓ(x)2ℓ(x)2.e^- (x)+ (x)-1= e^-α (x)2 (x)^2. Furthermore, observe that q[e−ℓ(x)+ℓ(x)−1] E_q [e^- (x)+ (x)-1 ] =∑xq(x)p(x)q(x)+q(x)logq(x)p(x)−q(x) = _xq(x) p(x)q(x)+q(x) q(x)p(x)-q(x) =∑xq(x)logq(x)p(x)+∑xp(x)−q(x) = _xq(x) q(x)p(x)+ _xp(x)-q(x) =DKL(q∥p). =D_KL (q \|p ). Combining the above, we have that for some α∈(0,1)α∈(0,1) DKL(q∥p)=q[e−αℓ(x)2ℓ(x)2].D_KL (q \|p )=E_q [ e^-α (x)2 (x)^2 ]. Given the condition on ℓ(x) (x), we have 12min(1,e−M)≤12e−αℓ(x)≤12max(1,e−m) 12 (1,e^-M)≤ 12e^-α (x)≤ 12 (1,e^-m) ⟹ 12min(1,e−M)q[ℓ(x)2]≤DKL(q∥p)≤12max(1,e−m)q[ℓ(x)2] 12 (1,e^-M)E_q [ (x)^2 ]≤ D_KL (q \|p )≤ 12 (1,e^-m)E_q [ (x)^2 ] ⟹ 2max(1,e−m)DKL(q∥p)≤q[ℓ(x)2]≤2min(1,e−M)DKL(q∥p). 2 (1,e^-m)D_KL (q \|p )≤E_q [ (x)^2 ]≤ 2 (1,e^-M)D_KL (q \|p ). Taking c=2/max(1,e−m)c=2/ (1,e^-m) and c′=2/min(1,e−M)c =2/ (1,e^-M), we get cDKL(q∥p)≤q[ℓ(x)2]≤c′DKL(q∥p) cD_KL (q \|p )≤E_q [ (x)^2 ]≤ c D_KL (q \|p ) ⟹ cDKL(q∥p)−DKL(q∥p)2≤q[ℓ(x)2]−DKL(q∥p)2≤c′DKL(q∥p)−DKL(q∥p)2 cD_KL (q \|p )-D_KL (q \|p )^2≤E_q [ (x)^2 ]-D_KL (q \|p )^2≤ c D_KL (q \|p )-D_KL (q \|p )^2 ⟹ (c−DKL(q∥p))DKL(q∥p)≤q(logq(x)p(x))≤(c′−DKL(q∥p))DKL(q∥p) (c-D_KL (q \|p ))D_KL (q \|p ) _q ( q(x)p(x) )≤(c -D_KL (q \|p ))D_KL (q \|p ) Note that this implies q(logq(x)p(x))=O(DKL(q∥p))V_q ( q(x)p(x) )=O(D_KL (q \|p )) as DKL(q∥p)→0D_KL (q \|p )→ 0. ∎ E.3 Theorems A rather straight forward application of Bernstein inequality together with the variance bound above give the following result. Theorem 3. Let q,p∈Δ()q,p∈ (X) and (n)x (n ) be a data sequence of size n drawn i.i.d. from q. Define the random variable Kn:=1n∑i=1nlogq(xi)p(xi)K_n:= 1n _i=1^n q(x_i)p(x_i). Suppose maxx|logq(x)p(x)−DKL(q∥p)|≤M<∞ _x | q(x)p(x)-D_KL (q \|p ) |≤ M<∞ and DKL(q∥p)≤cn+o(1n)D_KL (q \|p )≤ cn+o( 1n) for some constant c>0c>0, then for sufficiently large n, ℙ(n⋅|Kn−DKL(q∥p)|≥t)≤exp(−t2C+13Mt)P (n· |K_n-D_KL (q \|p ) |≥ t )≤ (- t^2C+ 13Mt ) for some constant C>0C>0 independent of p,qp,q. In other words, n(Kn−DKL(q∥p))=Op(1)n (K_n-D_KL (q \|p ) )=O_p(1). Proof. We apply Bernstein inequality on the centered random variable Xi=logq(xi)p(xi)−DKL(q∥p)X_i= q(x_i)p(x_i)-D_KL (q \|p ) with norm bounded by M to get ℙ(∑i=1nXi≥t)≤exp(−t2∑iq[Xi2]+13Mt).P ( _i=1^nX_i≥ t )≤ (- t^2 _iE_q[X_i^2]+ 13Mt ). Unpacking definition, we get ℙ(n(Kn−DKL(q∥p))≥t)≤exp(−t2nq(logq(x)p(x))+13Mt).P (n (K_n-D_KL (q \|p ) )≥ t )≤ (- t^2nV_q ( q(x)p(x) )+ 13Mt ). Using the result from Lemma (3), we get know that for sufficiently large n there exist c′>0c >0 such that q(logq(x)p(x))≤c′2DKL(q∥p)≤cc′2n.V_q ( q(x)p(x) )≤ c 2D_KL (q \|p )≤ c 2n. Choose C=cc′/2C=c /2 and we get the required result. We can apply the same argument to Xi′:=−XiX _i:=-X_i to get the lower tail bound. ∎ Theorem 4. Let ℳ=pw∈Δ̊m():w∈W⊂ℝdM= \p_w∈ _m(X):w∈ W⊂R^d \ be a model consisting of distributions with uniform lower bound m>0m>0. There exist constant C>0C>0 such that for any ϵ>0ε>0 and any q,p∗∈ℳq,p^*∈M satisfying DKL(q∥p∗)≤ϵD_KL (q \|p^* )≤ε the following holds w∈W:DKL(q∥pw)≤ϵ⊆w∈W:DKL(pw∥p∗)≤Cϵ⊆w∈W:DKL(q∥pw)≤C2(C+1)ϵ. \w∈ W:D_KL (q \|p_w )≤ε \ \w∈ W:D_KL (p_w \|p^* )≤ Cε \ \w∈ W:D_KL (q \|p_w )≤ C2(C+1)ε \. Proof. Applying Lemma (2) gives us a constant C′>0C >0 such that DKL(p∥p′)≤C′(DKL(p′∥p)+DKL(p′∥p))D_KL (p \|p )≤ C (D_KL (p \|p )+D_KL (p \|p ) ) for any p,p′,p′∈ℳp,p ,p ∈M. Now, set C=2C′C=2C . Let ϵ>0ε>0, q,p∗∈ℳq,p^*∈M be given such that DKL(q∥p∗)≤ϵD_KL (q \|p^* )≤ε. For any given w∈Ww∈ W such that DKL(q∥pw)≤ϵD_KL (q \|p_w )≤ε, we get DKL(pw∥p∗)≤C′(DKL(q∥pw)+DKL(q∥p∗))≤C′(ϵ+ϵ)≤2C′ϵ=Cϵ.D_KL (p_w \|p^* )≤ C (D_KL (q \|p_w )+D_KL (q \|p^* ) )≤ C (ε+ε )≤ 2C ε=Cε. This proves the first inclusion. Similarly, whenever DKL(pw∥p∗)≤CϵD_KL (p_w \|p^* )≤ Cε DKL(q∥pw)≤C′(DKL(pw∥p∗)+DKL(q∥p∗))≤C′(Cϵ+ϵ)=C2(C+1)ϵ.D_KL (q \|p_w )≤ C (D_KL (p_w \|p^* )+D_KL (q \|p^* ) )≤ C (Cε+ε)= C2(C+1)ε. This proves the second inclusion. ∎ Theorem 5. Let ℳ=pw∈Δ̊m():w∈W⊂ℝdM= \p_w∈ _m(X):w∈ W⊂R^d \ be a model consisting of distributions with uniform lower bound m>0m>0 and q be a data distribution in ℳM (realizable). Given any c>0c>0 and n∈ℕn∈N, we suppose there exist sets Qn⊂ℳQ_n⊂M such that for every p∈ℳp∈M there exist p∗∈Qnp^*∈ Q_n with DKL(p∥p∗)≤ϵnD_KL (p \|p^* )≤ _n where ϵn=O(1n) _n=O( 1n). Given any i.i.d. samples (n)∼qx (n ) q of size n, let p^=argminp∈ℳlog1p(xi) p= _p∈M 1p(x_i) be the maximum likelihood hypothesis in ℳM and define pn∗∈Qnp^*_n∈ Q_n be the closest grid point to p p, i.e. DKL(p^∥pn∗)=minp∈QnDKL(p^∥p)≤cnD_KL ( p \|p^*_n )= _p∈ Q_nD_KL ( p \|p )≤ cn. Then the random variable DKL(q∥pn∗)D_KL (q \|p^*_n ) is satisfies rn≤DKL(q∥pn∗)≤Rnr_n≤ D_KL (q \|p^*_n )≤ R_n for some sequences of random variables rnr_n and RnR_n that are both of order Op(1n)O_p ( 1n ). Furthermore, nDKL(q∥pn∗)nD_KL (q \|p^*_n ) is bounded with high probability. Proof. Using the inequality from Lemma twice (2), we get DKL(q∥pn∗)≤C(DKL(q∥p^)+DKL(p^∥pn∗))≤CDKL(q∥p^)+Cϵn D_KL (q \|p^*_n )≤ C (D_KL (q \| p )+D_KL ( p \|p^*_n ) )≤ CD_KL (q \| p )+C _n DKL(q∥p^)≤C(DKL(q∥pn∗)+DKL(p^∥pn∗))≤CDKL(q∥pn∗)+Cϵn D_KL (q \| p )≤ C (D_KL (q \|p^*_n )+D_KL ( p \|p^*_n ) )≤ CD_KL (q \|p^*_n )+C _n for some constant C>0C>0. This jointly implies aDKL(q∥p^)−bϵn≤DKL(q∥pn∗)≤CDKL(q∥p^)+CϵnaD_KL (q \| p )-b _n≤ D_KL (q \|p^*_n )≤ CD_KL (q \| p )+C _n (17) for some constants a,b>0a,b>0. Now, Watanabe [2009, Main theorem 6.4] shows that nDKL(q∥p^)→RnD_KL (q \| p )→ R where R is a non-zero random variable with non-zero expectation. So, DKL(q∥p^)=Op(1/n)D_KL (q \| p )=O_p(1/n). Combined with the fact that ϵn=O(1/n) _n=O(1/n), we get the desired result. Finally, to show that nDKL(q∥pn∗)nD_KL (q \|p^*_n ) is bounded with high probability, we observe that Watanabe [2009, Main theorem 6.4] also shows that the random variable R, being a maximum of a Gaussian process with continuous sample paths on a compact parameter space W, is almost surely bounded. Hence, for sufficiently large n, we have nDKL(q∥pn∗)≤CnDKL(q∥p^)+CnϵnnD_KL (q \|p^*_n )≤ CnD_KL (q \| p )+Cn _n which is bounded with high probability. ∎ E.4 Justification for ϵn=O(1/n) _n=O(1/n) Equation (17) shows that even if ϵn≪1n _n 1n, then nDKL(q∥pn∗)nD_KL (q \|p^*_n ) will still converge to a non-zero random variable (with non-zero expectation) since DKL(q∥p^)D_KL (q \| p ) will then be the sole dominant term instead, which is still Op(1/n)O_p(1/n). For the purpose of minimizing the redundancy in Eq (1), we would not want ϵn _n that decays faster than O(1/n)O(1/n) since that would increase the cost of the model description term −logVpn∗R(ϵ)- V^R_p^*_n(ε) without saving message length. On the other hand, if ϵn≫1/n _n 1/n, i.e. ϵn=g(n)/n _n=g(n)/n for some g(n)g(n) that diverges to infinity, then DKL(q∥pn∗)D_KL (q \|p^*_n ) can be dominated by the discretisation cost, leaving nDKL(q∥pn∗)=Op(g(n))nD_KL (q \|p^*_n )=O_p(g(n)). Yet, assuming −logVpn∗R(ϵ)=−Clogϵ+o(logϵ)- V^R_p^*_n(ε)=-C ε+o( ε) for some C>0C>0, then, relative to ϵn=O(1/n) _n=O(1/n), this only provide saving for the model description term of order O(logg(n))O( g(n)) and is thus not optimal. Appendix F Further theoretical discussion on Singular MDL F.1 Phases and phase transition in code lengths In regular models, regardless of the underlying data distribution being modeled and regardless of the minimum of the population loss under consideration, the complexity, as measured by LLC, is always d/2d/2, where d is the model parameter count. The difference in complexity only shows up in lower-order terms in the form of local curvature. In contrast, the geometry of the loss landscape can change drastically for singular models with even small changes in the data distribution, and each minimum of the population loss can have a different LLC value. Importantly for compression, there can be sudden reversals in the balance of loss-complexity trade-off when the data size n increases. This is a consequence of the fact that for different minima w∗w^* of ℒ(w)L(w), the associated total code length has leading-order terms fj(n)=nℒ(wj∗)+λj(w∗)lognf_j(n)=nL(w_j^*)+ _j(w^*) n. For low n, minima with lower complexity but higher loss can be preferred since that can give rise to a lower code length. But as n increases, the O(n)O(n) term will dominate, and it is increasingly favored to pay a high λ(w∗)lognλ(w^*) n upfront cost for specifying a high complexity set of weights, which is then amortized by having a lower marginal cost for using those weights to send more symbols. This is the usual model selection procedure statisticians perform to balance model complexity and fit, but for singular models, this process can happen implicitly and internally to the model. These phenomena are collectively known as phase transitions in statistical learning. Watanabe [2009] first described these phase transitions, which have since been observed and measured in various settings. For example, Chen et al. [2023] track phase transitions in a toy model of neural network superposition [Elhage et al., 2022]. They find that loss decreases as the LLC increases in a Bayesian-learning setting (performing Bayesian updates on increasing numbers of data points) and also in an SGD-training setting (taking an increasing number of gradient steps for a fixed dataset size). While there are mature theoretical explanations for the Bayesian setting, the observations in the SGD setting remain empirical results [Urdshals and Urdshals, 2025, Hoogland et al., 2025, Wang et al., 2024a]. Nonetheless, those results provide an important context for the present work where the trade-off between loss and model complexity – thus compressibility – is also a primary concern. F.2 I.I.D. Assumption Needing to assume i.i.d. is a severe theoretical weakness for applications of the theory in linguistic domains. Neither the data-generating distribution q(n)q^(n) nor the auto-regressive training objective treats sequences of tokens as i.i.d. sequences. Results in MDL and SLT can be generalized to non-i.i.d. settings like Markovian processes, but usually some form of ergodicity assumptions are needed. Those are likely violated by natural language-generating processes. This is less of a theoretical issue when we are mainly discussing pretraining loss as we can treat each chunk of text the size of the maximum context window, M, as a single data point and treat them as i.i.d. data. This means our outcome space is =MX=Vocab^M. While the underlying data-generating process for internet text certainly does not have this structure, it is reasonable to use this model for some pretraining data-loading process where the chunks are fed in as independent data points. The critical caveats are thus: • Our framework has yet to explain any capabilities gains via post-training methods like various forms of fine-tuning and reinforcement learning. • This framework is also not strong enough to discuss the base model capability (as opposed to their ability for compressing internet text) as most capabilities measures require modeling a joint distribution of the form p(long token sequence|prompt)p(long token sequence|prompt), which are likely non-stationary distributions. It is plausible that leading order indicators like loss and LLC are unaffected by these considerations, but that is an open theoretical and empirical question at this stage. There is evidence that the emergence of certain algorithmic capabilities correlates sharply with changes in the LLC Wang et al. [2024a], Urdshals and Urdshals [2025] Appendix G LLM usage In the process of writing this paper, LLMs were used for literature search and copy-editing.