Paper deep dive

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: TheoreticalEmbeddings: 219

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:47:53 PM

Summary

The paper introduces a framework for mechanistic interpretability by leveraging degeneracy in the loss landscape of neural networks. It proposes using Singular Learning Theory (SLT) to quantify parameter degeneracy and introduces the 'Interaction Basis' as a technique to obtain representations invariant to these degeneracies, thereby improving interpretability.

Entities (5)

Singular Learning Theory · theoretical-framework · 99%Mechanistic Interpretability · field-of-study · 98%Interaction Basis · technique · 97%Local Learning Coefficient · metric · 96%Behavioral Loss · metric · 95%

Relation Signals (3)

Interaction Basis → provides → Invariant Representation

confidence 96% · We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies

Singular Learning Theory → quantifies → Degeneracy

confidence 95% · Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate

Local Learning Coefficient → measures → Basin Broadness

confidence 94% · The learning coefficient quantifies the way the volume of a region of low loss changes... It is a measure of basin broadness

Cypher Suggestions (2)

Map the relationship between metrics and the concepts they quantify · confidence 92% · unvalidated

MATCH (m:Metric)-[r:MEASURES]->(c:Concept) RETURN m.name, r.relation, c.name

Find all techniques related to improving interpretability · confidence 90% · unvalidated

MATCH (t:Technique)-[:IMPROVES]->(i:Field {name: 'Mechanistic Interpretability'}) RETURN t.name

Abstract

Abstract:Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

PDF

Open source PDF →Open local PDF →

Full Text

218,803 characters extracted from source content.

Expand or collapse full text

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Lucius Bushnaq Jake Mendel Stefan Heimersheim Dan Braun Nicholas Goldowsky-Dill Kaarel Hänni Cindy Wu Marius Hobbhahn Apollo Research Correspondence to Lucius Bushnaq <lucius@apolloresearch.ai>Cadenza LabsIndependent Abstract Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians. 1 Introduction Mechanistic Interpretability aims to understand the algorithms implemented by neural networks (Olah et al., 2017; Elhage et al., 2021; Räuker et al., 2023; Olah et al., 2020; Meng et al., 2023; Geiger et al., 2021; Wang et al., 2022; Conmy et al., 2024). A key challenge in mechanistic interpretability is that neurons tend to fire on many unrelated inputs (Fusi et al., 2016; Nguyen et al., 2016; Olah et al., 2017; Geva et al., 2021; Goh et al., 2021) and any apparent circuits in the model often do not show a single clear functionality and do not have clear boundaries separating them from the rest of the network (Conmy et al., 2023; Chan et al., 2022). We suggest that a central problem for current methods of reverse engineering networks is that neural networks are degenerate: there are many different choices of parameters that implement the same function (Wei et al., 2022; Watanabe, 2009). For example, in a transformer attention head, only the product WO⁢V=WO⁢WVsubscriptsubscriptsubscriptW_OV=W_OW_VWitalic_O V = Witalic_O Witalic_V of the WVsubscriptW_VWitalic_V and WOsubscriptW_OWitalic_O matrices influences the network’s output, thus, many different choices of WOsubscriptW_OWitalic_O and WVsubscriptW_VWitalic_V are parameterizations of the same network (Elhage et al., 2021). This degeneracy makes parameters and activations an obfuscated view of a network’s computational features, hindering interpretability. While we have workarounds for known architecture-dependent degeneracies such as the WO⁢VsubscriptW_OVWitalic_O V case, Singular Learning Theory (SLT, Watanabe, 2009, 2013) suggests that we should expect additional degeneracy in trained networks that generalize well. SLT quantifies the degeneracy of the loss landscape around a solution using the local learning coefficient (LLC) (Lau et al., 2023; Watanabe, 2009, 2013). More degenerate solutions lie in broader ‘basins’ of the loss landscape, where many alternative parameterizations implement a similar function. Networks with lower LLCs are more degenerate, implement more general algorithms, and generalize better to new data (Watanabe, 2009, 2013). These predictions of SLT are only straightforwardly applicable to the global minimum in the loss landscape; a generalization is required to apply these insights to real networks. In this paper we make the following contributions. First, in Section 2 we propose changes to SLT to make it useful for interpretability on real networks. Then, in Section 3 we characterize three ways in which neural networks can be degenerate. In Section 4, we prove a link between some of these degeneracies and sparsity in the interactions between features. In Section 5, we develop a technique for searching for modularity based on its relation to degeneracy in the loss landscape. Finally in Section 6, we propose a practical technique for removing some of these degeneracies in the form of the interaction basis. 2 Singular learning theory and the effective parameter count If a neural network’s parameterisation is degenerate, this means there are many choices of parameters that achieve the same loss. At a global minimum in the loss landscape, more degeneracy in the parametrisation implies that the network lies in a broader basin of the loss. We can quantify how broad the basin is using Singular Learning Theory [SLT, Watanabe 2009, 2013; Wei et al. 2022]. In Section 2.1, we provide an overview of the key concepts from SLT that we will make use of. In Section 2.2 we explain why the tools of SLT are not completely suitable for identifying degeneracy in model internals. As a proposal to resolve some of these limitations, we introduce the behavioral loss in Section 2.2.1, and finite data singular learning theory in Section 2.2.2. Together, these concepts will allow us to define the effective parameter count, a measure of the number of computationally-relevant parameters in the network. If we achieved our goal of a fully parameterisation-invariant representation of a neural network, its explicit parameter count would equal its effective parameter count. 2.1 Background: the local learning coefficient The most important quantity in SLT is the learning coefficient λ. We define a data distribution x∼Xsimilar-tox Xx ∼ X and a family of models with N parameters, parameterised by a vector θ in a parameter space Θ⊆ℝNΘsuperscriptℝ ^NΘ ⊆ blackboard_RN. We also define a population loss function L⁢(θ|X)conditionalL(θ|X)L ( θ | X ) which is normalised so that L⁢(θ0|X)=0conditionalsubscript00L( _0|X)=0L ( θ0 | X ) = 0 at the global minimum θ0=arg⁢minθ⁡L⁢(θ|X)subscript0subscriptargminconditional _0= *arg\,min_θL(θ|X)θ0 = start_OPERATOR arg min end_OPERATORθ L ( θ | X ). Then λ is defined as (Watanabe, 2009):111See Watanabe (2009) for a more rigorous definition of the learning coefficient. λ:=limϵ→0[ϵ⁢d⁢ϵ⁢log⁡V⁡(ϵ)],assignsubscript→italic-ϵ0delimited-[]italic-ϵdditalic-ϵVitalic-ϵ λ:= _ε→ 0 [ε d% dε (ε) ]\,,λ := limitalic_ϵ → 0 [ ϵ divide start_ARG d end_ARG start_ARG d ϵ end_ARG log V ( ϵ ) ] , (1) where V⁡(ϵ)Vitalic-ϵV(ε)V ( ϵ ) is the volume of the region of parameter space Θ Θ with loss less than ϵitalic-ϵεϵ: V⁡(ϵ):=∫θ∈Θ:L⁢(θ)<ϵd⁢θassignVitalic-ϵsubscriptconditional-setΘitalic-ϵd (ε):= _\θ∈ :\,L(θ)<% ε\d ( ϵ ) := ∫ θ ∈ Θ : L ( θ ) < ϵ d θ (2) The learning coefficient quantifies the way the volume of a region of low loss changes as we ‘zoom in’ to lower and lower loss. It is a measure of basin broadness, and SLT predicts that networks are biased towards points in the loss landscape with lower learning coefficient. Since the loss landscape can have many different solutions with minimum loss, this definition does not necessarily single out a region corresponding to a single solution. Therefore Lau et al. (2023) introduce the local learning coefficient (LLC, denoted by λ^ λover start_ARG λ end_ARG) as a way to use the machinery of SLT to study the loss landscape geometry in the neighbourhood of a particular local minimum at θ∗superscriptθ^*θ∗ by restricting the volume in the definition of the learning coefficient to a neighbourhood of that minimum Θθ∗⊂ΘsubscriptΘsuperscriptΘ _θ^*⊂ Θitalic_θ∗ ⊂ Θ satisfying θ∗=arg⁢minθ∈Θθ∗⁡L⁢(θ|X)superscriptsubscriptargminsubscriptΘsuperscriptconditionalθ^*= *arg\,min_θ∈ _θ^*L(θ|X)θ∗ = start_OPERATOR arg min end_OPERATORθ ∈ Θ start_POSTSUBSCRIPT θ∗ end_POSTSUBSCRIPT L ( θ | X ). Then we define the local volume: Vθ∗⁡(ϵ)=∫θ∈Θθ∗:L⁢(θ)<L⁢(θ∗)+ϵd⁢θsubscriptVsuperscriptitalic-ϵsubscriptconditional-setsubscriptΘsuperscriptsuperscriptitalic-ϵdV_θ^*(ε)= _\θ∈ _θ^*:% \,L(θ)<L(θ^*)+ε\d _θ∗ ( ϵ ) = ∫ θ ∈ Θ start_POSTSUBSCRIPT θ∗ : L ( θ ) < L ( θ∗ ) + ϵ end_POSTSUBSCRIPT d θ (3) and the local learning coefficient: λ^⁢(θ∗)=limϵ→0[ϵ⁢d⁢ϵ⁢log⁡Vθ∗⁡(ϵ)].^superscriptsubscript→italic-ϵ0delimited-[]italic-ϵdditalic-ϵsubscriptVsuperscriptitalic-ϵ λ(θ^*)= _ε→ 0 [ε d% dε _θ^*(ε) ]\,.over start_ARG λ end_ARG ( θ∗ ) = limitalic_ϵ → 0 [ ϵ divide start_ARG d end_ARG start_ARG d ϵ end_ARG log Vitalic_θ∗ ( ϵ ) ] . (4) To see why the LLC can be thought of as counting the degeneracy in the network, consider a network with N parameters, with NfreesubscriptfreeN_freeNfree degrees of freedom in the parameterisation (such that NfreesubscriptfreeN_freeNfree of the parameters can be freely varied together or independently, without affecting the loss). Then, we can approximate the loss by a Taylor series around the minimum: L⁢(θ|X)=L⁢(θ∗)+12⁢(θ−θ∗)T⁢H⁢(θ∗)⁢(θ−θ∗)+O⁢(‖θ−θ∗‖3)conditionalsuperscript12superscriptsuperscriptsuperscriptsuperscriptsuperscriptnormsuperscript3L(θ|X)=L(θ^*)+ 12(θ-θ^*)^TH(θ^*)(% θ-θ^*)+O(||θ-θ^*||^3)L ( θ | X ) = L ( θ∗ ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( θ - θ∗ )T H ( θ∗ ) ( θ - θ∗ ) + O ( | | θ - θ∗ | |3 ) (5) where H⁢(θ∗)superscriptH(θ^*)H ( θ∗ ) is the Hessian at the mininum. Consider the case that all functionally relevant parameters all contribute a quadratic term to the loss to leading order, and degrees of freedom correspond to parameters which the loss does not depend on at all. In this case, Murfet (2020) explicitly calculate the LLC, showing that it equals 12⁢(N−Nfree)12subscriptfree 12(N-N_free)divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( N - Nfree ) — i.e. the LLC counts the number of functionally relevant parameters in the model. There is a sense that in such a model, the nominal parameter count is misleading, and if there are NfreesubscriptfreeN_freeNfree degrees of freedom then there are effectively only N−NfreesubscriptfreeN-N_freeN - Nfree actual parameters in the model. Indeed, this is the right perspective to take for selecting a model class to fit data with. Watanabe (2013) demonstrates that for models with parameter-function maps that are not one-to-one, the Bayesian Information Criterion (Schwarz, 1978), which predicts which model fit to given data generalizes best (Hoogland, 2023), should be modified: the parameter count of the model N should be replaced with 2⁢λ22 2 λ. In this simple example, the LLC is equal to half the rank of the Hessian at the minimum, and one might wonder if these two quantities are always related in this way. It turns out that they are only the same when the loss landscape can be written locally as a sum of quadratic terms, but this isn’t always true. For example, the loss landscape could be locally quartic in some directions, or the set of points with loss equal to 0 may form complicated self intersecting shapes like a cross. In these cases, it is the LLC, not the rank of the Hessian, that measures how much freedom there is to change parameters and how much we expect a particular model to generalise. 2.2 Modifying SLT for interpretability We would like to use the local learning coefficient to quantify the number of degrees of freedom in the parameterisation of a neural network — the number of ways the parameters in a neural network can be changed while still implementing the same function, or at least a highly similar function. However, there are some obstacles to using the LLC for this purpose: 1. The LLC λ^⁢(θ∗)^superscript λ(θ^*)over start_ARG λ end_ARG ( θ∗ ) measures the size of the region of equal loss around a particular local minimum θ∗∈ΘsuperscriptΘθ^*∈ θ∗ ∈ Θ in the loss landscape. This loss landscape is defined by a loss function and a dataset of inputs and labels. Unless the network achieves optimal loss on this dataset, points in the region could have equal loss even though they correspond to different functions, if these functions achieve the same average performance over the dataset. We do not want our measure of the number of degrees of freedom to include different functions which achieve the same overall loss. 2. The local learning coefficient is only well defined at a local minimum of the loss, but we frequently want to interpret neural networks that have not been trained to convergence and are not at a minimum of the loss on their training distribution. 3. We would like to be able to consider two very similar but not identical functions to be the same function, if they only differ in ways that can be considered noise. This is partially because, after finite training time, a network will not have fully converged on the cleanest version of an algorithm without any noise222Indeed, sometimes it is possible to remove this noise and improve performance (Nanda et al., 2023). However, the formal approach of SLT studies models in the limit of infinite data. This turns out to correspond to taking the limit ϵ→0→italic-ϵ0ε→ 0ϵ → 0 in the definition of the LLC (equation 4) — after infinite data, the LLC is determined by the scaling of the volume function at loss equal to L⁢(θ∗)superscriptL(θ^*)L ( θ∗ ). This means that the LLC contains information only about exact degeneracies in the parameterization — only about different parameterisations that are at the local minimum. Instead, we would prefer to work with a modified LLC which quantifies the number of parameterization choices which correspond to approximately identical functions. We introduce the behavioral loss as a resolution to problems (1) and (2), and finite data SLT as a resolution to problem (3). 2.2.1 Behavioral loss In this section, we describe how we can define the local learning coefficient of a network to avoid problems 1 and 2 listed above. We want to define a new loss function and corresponding loss landscape for the sake of the SLT formalism (we do not train with this loss) such that all the parameter choices in a region with zero loss correspond to the same function on the training dataset: the same map of inputs to outputs. This loss function, which we call the Behavioral Loss, LBsubscriptL_BLitalic_B, is defined with respect to an original neural network with an original set of parameters θ∗superscriptθ^*θ∗, and defines how similar the function θsubscriptf_θfitalic_θ implemented by a different set of parameters θ is to the original function θ∗subscriptsuperscriptf_θ^*fitalic_θ∗: LB⁢(θ|θ∗,)=1n⁢∑x∈‖θ⁢(x)−θ∗⁢(x)‖2subscriptconditionalsuperscript1subscriptsuperscriptnormsubscriptsubscriptsuperscript2L_B(θ|θ^*, D)= 1n _x∈% D | |f_θ(x)-f_% θ^*(x) | |^2Litalic_B ( θ | θ∗ , D ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D | | fitalic_θ ( x ) - fitalic_θ∗ ( x ) | |2 (6) where DD is the training dataset and ‖norm||v||| | v | | denotes the ℓ2superscriptℓ2 ^2ℓ2-norm of vv333We arbitrarily chose an MSE loss here, but conceptually we require a loss which is non-negative and satisfies identity of indiscernibles: L=0⇔∀x:θ⁢(x)=θ∗⁢(x)iff0for-all:subscriptsubscriptsuperscriptL=0 ∀ x:f_θ(x)=f_θ^*(x)L = 0 ⇔ ∀ x : fitalic_θ ( x ) = fitalic_θ∗ ( x ). For example, when studying an LLM, it may be more suitable to use KL-divergence.. By definition, this loss landscape always has a global minimum at the parameters the model actually uses θ=θ∗superscriptθ=θ^*θ = θ∗, solving problem 2 above. Additionally, parameter choices which achieve 0 behavioral loss must have the same input-output behaviour as θ∗superscriptsubscriptf_θ^*fitalic_θ∗ on the entire training dataset, solving problem 1. Note that achieving zero behavioral loss relative to a model with parameters θ∗superscriptθ^*θ∗ is a stricter requirement than achieving the same loss as the model with parameters θ∗superscriptθ^*θ∗ on the training data. Therefore, the behavioral loss LLC λ^Bsubscript λ_Bover start_ARG λ end_ARGB will be equal to, or higher than the training loss LLC λ^ λover start_ARG λ end_ARG. 2.2.2 Singular learning theory at finite data Next we want to resolve the problem that standard SLT formulae concern only the limit of infinite data when the model is certainly at a local minimum of the loss landscape. We would like to think of a neural network trained on a finite amount of data as implementing a core algorithm we are interested in reverse engineering, plus some amount of ‘noise’ which may vary with the parameterisation and which is not important to interpret. For example, in a modular addition transformer (Nanda et al., 2023), there are parts of the network which can be ablated to improve loss: these parts of the network may be present because the model has not fully converged to a minimum yet. In this case, if we have two transformers trained on modular addition which have the same input-output behaviour after we have ablated parts to improve performance, then we would like to consider these models as implemtenting the same function ‘up to’ noise before we ablate those parts. In this section, we sketch how to modify SLT so that the LLC becomes a measure of how many different parameterisations implement nearly the same function, rather than exactly the same function. In this way, we can numerically vary how much the functions two different parameterisations implement are allowed to differ from each other on the training data. We start by explaining why SLT takes the limit ϵ→0→italic-ϵ0ε→ 0ϵ → 0 in the definition of the learning coefficient (equation 1). SLT is a theory of Bayesian learning machines: learning machines which start with some prior over parameters which is nonzero everywhere φ:Θ↦(0,1):maps-toΘ01 : (0,1)φ : Θ ↦ ( 0 , 1 ), and which learn by performing a Bayesian update on each datapoint they observe. After a dataset nsubscript D_nDitalic_n of n datapoints, the posterior distribution over parameters is: p⁢(θ|n)=e−n⁢L⁢(θ|n)⁢φ⁢(θ)p⁢(n).conditionalsubscriptsuperscriptconditionalsubscriptsubscript p(θ| D_n)= e^-nL(θ|% D_n) (θ)p( D_% n)\,.p ( θ | Ditalic_n ) = divide start_ARG e- n L ( θ | Ditalic_n ) φ ( θ ) end_ARG start_ARG p ( Ditalic_n ) end_ARG . (7) where L⁢(θ|n)conditionalsubscriptL(θ| D_n)L ( θ | Ditalic_n ) is the negative log likelihood of the dataset given the model θsubscriptf_θfitalic_θ, which we identify with the loss function when making a connection between Bayesian learning and SGD (Murphy, 2012), and p⁢(n)subscriptp( D_n)p ( Ditalic_n ) is a normalisation factor. The exponential dependence on n ensures that in the limit n→∞→n→∞n → ∞, a Bayesian learning machine’s posterior is only nonzero at points of minimum loss. This means that the asymptotic behaviour of the learning machine depends only on properties of the loss landscape that are asymptotically close to having zero loss. This is the reason that we take ϵ→0→italic-ϵ0ε→ 0ϵ → 0 in the definition of the learning coefficient. However, since the parameters θ∗superscriptθ^*θ∗ we find after finite steps of SGD correspond to an algorithm plus noise, we want to consider the size of the region of parameter space that achieves a behavioral loss less than the noise size. From a bayesian learning perspective, in equation 7, we can see that for large but finite number of data points, most of the posterior concentrates around the regions of low loss, but it does not fully concentrate on the region with exactly minimum loss. Therefore, we simply refrain from taking the limit as the loss scale ϵitalic-ϵεϵ goes to 00 in the definition of the learning coefficient, and consider the learning coefficient at a particular loss scale: λ⁢(ϵ):=ϵ⁢d⁢ϵ⁢log⁡V⁡(ϵ)assignitalic-ϵitalic-ϵdditalic-ϵVitalic-ϵ λ(ε):=ε ddε % V(ε)λ ( ϵ ) := ϵ divide start_ARG d end_ARG start_ARG d ϵ end_ARG log V ( ϵ ) (8) To understand how the learning coefficient can vary with epsilon, consider an illustrative example: an extremely simple setup with a single parameter w∈ℝw ∈ blackboard_R, and a loss function L⁢(w)=c2⁢w2+w4superscript2superscript2superscript4L(w)=c^2w^2+w^4L ( w ) = c2 w2 + w4 with c≪1much-less-than1c 1c ≪ 1. This is a toy model of a scenario where there is a very small quadratic term in the learning coefficient. This term is only ‘visible’ to the learning coefficient when we zoom in to very small loss values. To see this, we must the calculate how the volume (equation 2) depends on the loss scale ϵitalic-ϵεϵ. For large ϵ≫c14much-greater-thanitalic-ϵsuperscript14ε c 14ϵ ≫ cdivide start_ARG 1 end_ARG start_ARG 4 end_ARG, the quartic term dominates the loss and the region of loss less than ϵitalic-ϵεϵ is roughly the interval [−ϵ14,ϵ14]superscriptitalic-ϵ14superscriptitalic-ϵ14[-ε 14,ε 14][ - ϵdivide start_ARG 1 end_ARG start_ARG 4 end_ARG , ϵdivide start_ARG 1 end_ARG start_ARG 4 end_ARG ]. This gives V⁡(ϵ)≈2⁢ϵ14Vitalic-ϵ2superscriptitalic-ϵ14V(ε)≈ 2ε 14V ( ϵ ) ≈ 2 ϵdivide start_ARG 1 end_ARG start_ARG 4 end_ARG so the learning coefficient is λ⁢(ϵ≫c14)=14much-greater-thanitalic-ϵsuperscript1414λ(ε c 14)= 14λ ( ϵ ≫ cdivide start_ARG 1 end_ARG start_ARG 4 end_ARG ) = divide start_ARG 1 end_ARG start_ARG 4 end_ARG, the same as if the quadratic term were not present. On the other hand, for small enough ϵ≪c14much-less-thanitalic-ϵsuperscript14ε c 14ϵ ≪ cdivide start_ARG 1 end_ARG start_ARG 4 end_ARG, the quadratic term becomes visible: V⁡(ϵ)≈2⁢ϵ12/c2Vitalic-ϵ2superscriptitalic-ϵ12superscript2V(ε)≈ 2ε 12/c^2V ( ϵ ) ≈ 2 ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG / c2, so λ⁢(ϵ≪c14)=12much-less-thanitalic-ϵsuperscript1412λ(ε c 14)= 12λ ( ϵ ≪ cdivide start_ARG 1 end_ARG start_ARG 4 end_ARG ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Determining how to choose an appropriate cutoff ϵitalic-ϵεϵ is still an open problem. We suggest that researchers choose the value of the behavioral loss cutoff in the context of the question they would like to answer. For example, if one trains multiple models with different seeds on the same task, then the appropriate loss cutoff may be on the order of the variance between the seeds. Finally, we are able to quantify the amount of degeneracy in a neural network. We define the Effective Parameter Count of a neural network θ∗subscriptsuperscriptf_θ^*fitalic_θ∗ at noise scale ϵitalic-ϵεϵ as two times the local learning coefficient λB⁢(ϵ)subscriptitalic-ϵ _B(ε)λitalic_B ( ϵ ) of the behavioral loss with respect to the network at noise scale epsilon. Neff⁢(ϵ):=2⁢λB⁢(ϵ)assignsubscripteffitalic-ϵ2subscriptitalic-ϵN_eff(ε):=2 _B(ε)Neff ( ϵ ) := 2 λitalic_B ( ϵ ) (9) We conjecture that a fully parameterisation invariant representation of a neural network which captures all the behaviour up to noise scale ϵitalic-ϵεϵ would require Neff⁢(ϵ)subscripteffitalic-ϵN_eff(ε)Neff ( ϵ ) parameters. 3 Internal structures that contribute to degeneracy In this section, we will show three ways the internal structure of neural networks can induce degrees of re-parametrization freedom NfreesubscriptfreeN_freeNfree in the loss landscape. Since Neff=N−NfreesubscripteffsubscriptfreeN_eff=N-N_freeNeff = N - Nfree, this is equivalent to showing three ways the internal structures of neural networks determine their effective parameter count. We do not expect that these three sources of re-parametrization freedom offer a complete account of all degeneracy in real networks. They are merely a starting point for relating the degeneracy of networks to their computational structure at all. For ease of presentation, most of the expressions in this section are only derived for the example case of fully connected networks. They can be generalized to transformers, though we do not show this explicitly here. In Section 3.1, we show a relationship between the effective parameter count and the dimensions of the spaces spanned by the network’s activation vectors (Section 3.1.1) and Jacobians (Section 3.1.2) recorded over the training data. In Section 3.2, we show a relationship between the number of distinct nonlinearities implemented in a layer of the network on the training data and the effective parameter count. 3.1 Activations and Jacobians In this section, we show how a network having low dimensional hidden activations or Jacobians leads to re-parametrisation freedom. We begin by bringing the network’s Hessian, which gives the first non-zero term in the Taylor expansion of the loss around an optimum (See equation 5) into a more convenient form. Each local free direction in the loss landscape corresponds to an eigenvector of the Hessian with zero eigenvalue.444The reverse does not hold, due to higher order terms in the expansion in equation 5. See (Watanabe, 2009, 2013). Therefore, the rank of the Hessian can be used to obtain a lower bound for the learning coefficient. Consider the Hessian of a fully connected network, with parameters θ=θ∗superscriptθ=θ^*θ = θ∗, network inputs x and network outputs θ⁢(x)subscriptf_θ(x)fitalic_θ ( x ), on a behavioural loss LB⁢(θ|θ∗,)subscriptconditionalsuperscriptL_B (θ|θ^*, D )Litalic_B ( θ | θ∗ , D ) evaluated on a dataset consisting of ||=n| D|=n| D | = n inputs. Using the chain rule, the Hessian at the global minimum θ=θ∗superscriptθ=θ^*θ = θ∗ can be written as: ∂2LB⁢(θ|θ∗,)∂θi,jl⁢∂θi′,j′l′|θ=θ∗evaluated-atsuperscript2subscriptconditionalsuperscriptsubscriptsuperscriptsubscriptsuperscriptsuperscript′superscript′ . ∂^2L_B (θ|θ^*,% D )∂θ^l_i,j∂θ^l^% _i ,j |_θ=θ^*divide start_ARG ∂2 Litalic_B ( θ | θ∗ , D ) end_ARG start_ARG ∂ θitalic_litalic_i , j ∂ θitalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi′ , j′ end_ARG |θ = θ∗ =∑x∈∑k,k′∂2LB⁢(θ|θ∗,)∂fklfinal⁢∂fk′lfinal|θ=θ∗⁢∂fklfinal⁢(x)∂θi′,j′l′⁢∂fk′lfinal⁢(x)∂θi,jlabsentevaluated-atsubscriptsubscriptsuperscript′2subscriptconditionalsuperscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscriptsubscriptfinalsuperscript′subscriptsuperscriptsubscriptfinalsubscriptsuperscriptsuperscript′superscript′subscriptsuperscriptsubscriptfinalsuperscript′subscriptsuperscript = _x∈ D _k,k .% ∂^2L_B (θ|θ^*, D% )∂f^l_final_k∂f^l_final_k^% |_θ=θ^* ∂ f^l_final_k(x% )∂θ^l _i ,j ∂ f^l_% final_k (x)∂θ^l_i,j= ∑x ∈ D ∑k , k′ divide start_ARG ∂2 Litalic_B ( θ | θ∗ , D ) end_ARG start_ARG ∂ fitalic_lfinalitalic_k ∂ fitalic_lfinalitalic_k′ end_ARG |θ = θ∗ divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ θitalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi′ , j′ end_ARG divide start_ARG ∂ fitalic_lfinalitalic_k′ ( x ) end_ARG start_ARG ∂ θitalic_litalic_i , j end_ARG (10) =MSE⁢loss⁢1n⁢∑x∈∑k∂fklfinal⁢(x)∂θi′,j′l′⁢∂fklfinal⁢(x)∂θi,jlMSEloss1nsubscriptxsubscriptksubscriptsuperscriptfsubscriptlfinalkxsubscriptsuperscriptsuperscriptl′i′j′subscriptsuperscriptfsubscriptlfinalkxsubscriptsuperscriptlij MSE\ loss= 1n _x∈% D _k ∂ f^l_final_k(x)∂% θ^l _i ,j ∂ f^l_final% _k(x)∂θ^l_i,jstart_OVERACCENT MSE loss end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ∑k divide start_ARG ∂ froman_lfinalroman_k ( x ) end_ARG start_ARG ∂ θroman_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi′ , j′ end_ARG divide start_ARG ∂ froman_lfinalroman_k ( x ) end_ARG start_ARG ∂ θroman_lroman_i , j end_ARG for l=1,…⁢lfinal;j=1,…,dl;i=1,…,dl+1formulae-sequence1…subscriptfinalformulae-sequence1…superscript1…superscript1l=1,… l_final;\>j=1,…,d^l;\>i=1,…,d^l+1l = 1 , … lfinal ; j = 1 , … , ditalic_l ; i = 1 , … , ditalic_l + 1. In the second line, we have used that the loss function is MSE from outputs at θ=θ∗superscriptθ=θ^*θ = θ∗ to simplify the expression, and we have also used that the first derivatives of the loss are zero at the minimum555If we were to use a different behavioural loss such as KL divergence, this would mean that the term ∂2Lfklfinal⁢fk′lfinal|θ=θ∗evaluated-atsuperscript2subscriptsuperscriptsubscriptfinalsubscriptsuperscriptsubscriptfinalsuperscript′ . ∂^2Lf^l_final_kf^l_final_k^% |_θ=θ^*divide start_ARG ∂2 L end_ARG start_ARG fitalic_lfinalitalic_k fitalic_lfinalitalic_k′ end_ARG |θ = θ∗ would not be equal to δk⁢k′.subscriptsuperscript′ _k .δitalic_k k′ . This means that different output activations (logits for a language model) would be weighted differently, but the story of this section would be largely the same.. Thus, the Hessian is equal to a Gram matrix of the network’s weight gradients ∂fklfinal∂θi,jlsubscriptsuperscriptsubscriptfinalsubscriptsuperscript ∂ f^l_final_k∂θ^l_i,jdivide start_ARG ∂ fitalic_lfinalitalic_k end_ARG start_ARG ∂ θitalic_litalic_i , j end_ARG, and linear dependence of entries of the weight gradients over the training set DD corresponds to zero eigenvalues in the Hessian. We can apply the chain rule again to rewrite the gradient vector on each datapoint as an outer product of Jacobians and activations: ∂fklfinal⁢(x)∂θi,jlsubscriptsuperscriptsubscriptfinalsubscriptsuperscript ∂ f^l_final_k(x)∂θ^l_i% ,jdivide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ θitalic_litalic_i , j end_ARG =∂fklfinal⁢(x)∂pil+1⁢fjl⁢(x)absentsubscriptsuperscriptsubscriptfinalsubscriptsuperscript1subscriptsuperscript = ∂ f^l_final_k(x)∂ p^l+1_i% f^l_j(x)= divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l + 1i end_ARG fitalic_litalic_j ( x ) (11) where the Jacobian is taken with respect to preactivations to layer l+11l+1l + 1: l+1⁢(x)=Wl⁢l⁢(x).superscript1superscriptsuperscriptp^l+1(x)=W^lf^l(x).pitalic_l + 1 ( x ) = Witalic_l fitalic_l ( x ) . (12) Thus, every degree of linear dependence in the activations fjlsubscriptsuperscriptf^l_jfitalic_litalic_j or Jacobians ∂fklfinal∂pil+1subscriptsuperscriptsubscriptfinalsubscriptsuperscript1 ∂ f^l_final_k∂ p^l+1_idivide start_ARG ∂ fitalic_lfinalitalic_k end_ARG start_ARG ∂ pitalic_l + 1i end_ARG in a layer l of the network also causes degrees of linear dependence in the weight gradient ∂fklfinal⁢(x)∂θi,jlsubscriptsuperscriptsubscriptfinalsubscriptsuperscript ∂ f^l_final_k(x)∂θ^l_i,jdivide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ θitalic_litalic_i , j end_ARG of the network, potentially resulting in re-parametrisation freedom for the network. In the next two sections, we explore how linear dependence in the activations and Jacobians respectively impact the effective parameter count. 3.1.1 Activation vectors spanning a low dimensional subspace Looking at equation 11, each degree of linear dependence of the activations fjlsubscriptsuperscriptf^l_jfitalic_litalic_j in a hidden layer l of width dlsuperscriptd^lditalic_l over the training dataset DD, ∑jcj⁢fjl⁢(x)=0⁢∀x∈,subscriptsubscriptsubscriptsuperscript0for-all _jc_jf^l_j(x)=0\,\,∀ x∈ D\,,∑j citalic_j fitalic_litalic_j ( x ) = 0 ∀ x ∈ D , (13) corresponds to dl+1superscript1d^l+1ditalic_l + 1 linearly dependent entries in the weight gradient ∂fklfinal∂θi,jlsubscriptsuperscriptsubscriptfinalsubscriptsuperscript ∂ f^l_final_k∂θ^l_i,jdivide start_ARG ∂ fitalic_lfinalitalic_k end_ARG start_ARG ∂ θitalic_litalic_i , j end_ARG, dl+1superscript1d^l+1ditalic_l + 1 eigenvectors of the Hessian with eigenvalue zero, and dl+1superscript1d^l+1ditalic_l + 1 fully independent free directions in the loss landscape than span a fully free dl+1superscript1d^l+1ditalic_l + 1 dimensional hyperplane. So the effective parameter count NeffsubscripteffN_effNeff will be lower than the nominal number of parameters in the model N by dl+1superscript1d^l+1ditalic_l + 1 for each such degree of linear dependence in the hidden representations. More generally, we can take a PCA of the activation vectors in layer l by diagonalising the Gram matrix of activations Glsuperscript G^lGitalic_l :=1n⁢∑x∈l⁢(x)⁢l⁢(x)Tassignabsent1subscriptsuperscriptsuperscriptsuperscript := 1n _x∈ Df^l(% x)f^l(x)^T:= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D fitalic_l ( x ) fitalic_l ( x )T (14) =:UlTDGlUl =:U^l^TD_G^lU^l= : Uitalic_litalic_T Ditalic_Gitalic_l Uitalic_l If there is linear dependence between the activations on the dataset, some of the singular values (eigenvalues of GlsuperscriptG^lGitalic_l) will be zero. If we transform into rotated layer coordinates ~l⁢(x)=Ul⁢l⁢(x),W~l=Wl⁢UlTformulae-sequencesuperscript~superscriptsuperscriptsuperscript~superscriptsuperscriptsuperscript f^l(x)=U^lf^l(x), W^l=W^lU^l^Tover~ start_ARG f end_ARGl ( x ) = Uitalic_l fitalic_l ( x ) , over~ start_ARG W end_ARGl = Witalic_l Uitalic_litalic_T, then the parameters of the transformed weight matrix in rows which connect to the directions with zero variance can be changed freely without changing the product W~l⁢~lsuperscript~superscript~ W^l f^lover~ start_ARG W end_ARGl over~ start_ARG f end_ARGl. In reality, a gram matrix of activation vectors will never have eigenvalues that are exactly 0. However, if a particular eigenvalue has size 1n⁢∑x∈(f~lj⁢(x))2=O⁢(ϵk)1subscriptsuperscriptsubscriptsuperscript~2superscriptitalic-ϵ 1n _x∈ D ( f^l_j% (x) )^2=O(ε^k)square-root start_ARG divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ( over~ start_ARG f end_ARGlitalic_j ( x ) )2 end_ARG = O ( ϵitalic_k ) for some ϵ≪1much-less-thanitalic-ϵ1ε 1ϵ ≪ 1, the transformed parameters inside W~lsuperscript~ W^lover~ start_ARG W end_ARGl can be changed by O⁢(ϵ12−k)superscriptitalic-ϵ12O(ε 12-k)O ( ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG - k ) while only impacting the loss L by O⁢(ϵ)italic-ϵO(ε)O ( ϵ ). This suggests that, under the finite-data SLT picture introduced in Section 2.2.2, singular values of the set of activation vectors that are less than ϵ12superscriptitalic-ϵ12ε 12ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG for noise scale ϵitalic-ϵεϵ result in a lower effective parameter count, with dl+1superscript1d^l+1ditalic_l + 1 effective parameters less for every small singular value. So, if we view the PCA components in a layer l as the ’elementary variables’ of that layer, then the fewer elementary variables the network has in total, the lower the effective parameter count will be. Relationship to weight norm One might be concerned that linear dependencies between the activation vectors on the training dataset might not hold for activation vectors outside the training dataset, such that the entries of the weight matrix that we are treating as free do in fact affect the off-distribution outputs of the network. However, SOTA optimisers often use weight decay or ℓ2superscriptℓ2 ^2ℓ2 weight regularisation during training to improve network generalization (Loshchilov and Hutter, 2019). This biases training towards networks with a smaller total ℓ2superscriptℓ2 ^2ℓ2-weight norm, ‖θ‖2=∑l=1lfinal‖Wl‖Fsubscriptnorm2superscriptsubscript1subscriptfinalsubscriptnormsuperscript||θ||_2= _l=1^l_final||W^l||_F| | θ | |2 = ∑l = 1lfinal | | Witalic_l | |F. Since the Frobenius norm ‖Wl‖Fsubscriptnormsuperscript||W^l||_F| | Witalic_l | |F is invariant under orthogonal transformations, the weight regularisation can equivalently be thought of as biasing training towards low ‖W~l‖Fsubscriptnormsuperscript~|| W^l||_F| | over~ start_ARG W end_ARGl | |F. Since the entries of W~lsuperscript~ W^lover~ start_ARG W end_ARGl which connect to the zero principal components do not affect the output, the training will be biased to push them to 0. This is an example of weight regularisation improving generalisation performance: if, at inference time, an activation vector has variation in a direction not seen during training, a regularised model ignores that component of the activation vector. 3.1.2 Jacobians spanning a low dimensional subspace We have shown that if the set of activation vectors in some layer have linear dependence over a dataset, then some parameters are free to vary without affecting outputs on that dataset. A similar story can be told when the Jacobians Ji⁢jl=∂filfinal⁢(x)∂pjl+1subscriptsuperscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscript1J^l_ij= ∂ f^l_final_i(x)∂ p^l+1_jJitalic_litalic_i j = divide start_ARG ∂ fitalic_lfinalitalic_i ( x ) end_ARG start_ARG ∂ pitalic_l + 1j end_ARG do not span the full space of the layer. As with the activations, we look for zero eigenvalues in the gram matrix of the Jacobians: Klsuperscript K^lKitalic_l :=1n⁢∑x∈∑jJlT⁢Jlassignabsent1subscriptsubscriptsuperscriptsuperscriptsuperscript := 1n _x∈ D _jJ^l% ^TJ^l:= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ∑j Jitalic_litalic_T Jitalic_l (15) =:RlTDPlRl =:R^l^TD^l_PR^l= : Ritalic_litalic_T Ditalic_litalic_P Ritalic_l Any zero eigenvalue in this gram matrix leads to dlsuperscriptd^lditalic_l zero eigenvalues in the Hessian, analogous to the previous section. We can transform into rotated layer coordinates W~l=Rl⁢Wlsuperscript~superscriptsuperscript W^l=R^lW^lover~ start_ARG W end_ARGl = Ritalic_l Witalic_l, J~l=Jl⁢Rl⁢RlTsuperscript~superscriptsuperscriptsuperscriptsuperscript J^l=J^lR^lR^l^Tover~ start_ARG J end_ARGl = Jitalic_l Ritalic_l Ritalic_litalic_Tand the parameters of the transformed weight matrix in columns which connect to the directions with zero variance can be changed freely without changing the product J~l⁢W~lsuperscript~superscript~ J^l W^lover~ start_ARG J end_ARGl over~ start_ARG W end_ARGl. However, unlike with the activation PCA components, the dlsuperscriptd^lditalic_l free directions in the Hessian from Jacobians spanning a low-dimensional subspace may not always correspond to dlsuperscriptd^lditalic_l full degrees of freedom in the parametrization. This is due to the potential presence of terms above second order in the perturbative expansion around the loss optimum, see equation 5, which can cause the loss to change if the parameters are varied along those directions despite the Hessian being zero Watanabe (2009). Jacobians between hidden layers Note that we can decompose each Jacobian from layer l to layer lfinalsubscriptfinall_finallfinal into a product of Jacobians between adjacent layers by the chain rule: ∂lfinal⁢(x)∂il+1=∂lfinal⁢(x)∂lfinal−1⁢∂lfinal−1⁢(x)∂lfinal−2⁢…⁢∂l+2⁢(x)∂l+1⁢∂l+1⁢(x)∂l+1.superscriptsubscriptfinalsubscriptsuperscript1superscriptsubscriptfinalsuperscriptsubscriptfinal1superscriptsubscriptfinal1superscriptsubscriptfinal2…superscript2superscript1superscript1superscript1 ^l_final(x) ^l+1_i=% ^l_final(x) ^l_% final-1 ^l_final-1(x) % ^l_final-2… ^l+2(x)∂ % f^l+1 ^l+1(x) ^l+1\,.divide start_ARG ∂ fitalic_lfinal ( x ) end_ARG start_ARG ∂ pitalic_l + 1i end_ARG = divide start_ARG ∂ fitalic_lfinal ( x ) end_ARG start_ARG ∂ fitalic_lfinal - 1 end_ARG divide start_ARG ∂ fitalic_lfinal - 1 ( x ) end_ARG start_ARG ∂ fitalic_lfinal - 2 end_ARG … divide start_ARG ∂ fitalic_l + 2 ( x ) end_ARG start_ARG ∂ fitalic_l + 1 end_ARG divide start_ARG ∂ fitalic_l + 1 ( x ) end_ARG start_ARG ∂ pitalic_l + 1 end_ARG . (16) Thus, any rank drop in a gram matrix of Jacobians from layer l+kl+kl + k to layer l+k+11l+k+1l + k + 1 necessarily also leads to a rank drop in the gram matrix of the Jacobians from layer l to layer lfinalsubscriptfinall_finallfinal, and thus dlsuperscriptd^lditalic_l zero eigenvalues in the Hessian. 3.2 Synchronized nonlinearities In this section, we demonstrate a third example of internal structure that affects the effective parameter count of the model. The two examples we presented in the previous sections might be thought of as showing how the network having fewer relevant variables in its representation in a layer leads to more degeneracy. The example we present in this section shows how the network performing “fewer operations” leads to more degeneracy. In a dense layer with piecewise linear activation functions (ReLU or LeakyReLU), the effective parameter count is reduced if two neurons have the same set of data points for which they are ‘on’ and ‘off’. We call neurons with this property synchronized with each other. For simplicity, in this section, we will consider a dense feedforward network with ReLU nonlinearities at each layer, and the same hidden width d throughout. We define the neuron firing pattern ril⁢(x)=fil⁢(x)pil⁢(x)⁢ if ⁢pil⁢(x)≠0,else⁢ril⁢(x)=1,formulae-sequencesubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscript if subscriptsuperscript0elsesubscriptsuperscript1 r^l_i(x)= f^l_i(x)p^l_i(x)\ if\ p^l% _i(x)≠ 0,\ else\ r^l_i(x)=1\,,ritalic_litalic_i ( x ) = divide start_ARG fitalic_litalic_i ( x ) end_ARG start_ARG pitalic_litalic_i ( x ) end_ARG if pitalic_litalic_i ( x ) ≠ 0 , else ritalic_litalic_i ( x ) = 1 , (17) where pil⁢(x)=∑jWi,jl−1⁢fjl−1⁢(x)subscriptsuperscriptsubscriptsubscriptsuperscript1subscriptsuperscript1p^l_i(x)= _jW^l-1_i,jf^l-1_j(x)pitalic_litalic_i ( x ) = ∑j Witalic_l - 1i , j fitalic_l - 1j ( x ) is the preactivation of neuron i. We call two neurons i and j synchronized if they always fire simultaneously on the training data, ril⁢(x)=rjl⁢(x)⁢∀x∈subscriptsuperscriptsubscriptsuperscriptfor-allr^l_i(x)=r^l_j(x)\,∀ x∈ Dritalic_litalic_i ( x ) = ritalic_litalic_j ( x ) ∀ x ∈ D. All synchronized As a pedagogical aid, and to demonstrate a point on how the effective parameter count is invariant to linear layer transitions, we first consider the case of all the neurons in layer l+11l+1l + 1 being synchronized together in the same firing pattern rl+1⁢(x)superscript1r^l+1(x)ritalic_l + 1 ( x ). Then we can write: l+2⁢(x)superscript2 ^l+2(x)fitalic_l + 2 ( x ) =ReLU⁡(Wl+1⁢ReLU⁡(Wl⁢l⁢(x)))=ReLU⁡(Wl+1⁢rl+1⁢(x)⁢Wl⁢l⁢(x)),absentReLUsuperscript1ReLUsuperscriptsuperscriptReLUsuperscript1superscript1superscriptsuperscript =ReLU (W^l+1ReLU(W^l % f^l(x)) )=ReLU (W^l+1r^l+1(x)W^lf^% l(x) )\,,= ReLU ( Witalic_l + 1 ReLU ( Witalic_l fitalic_l ( x ) ) ) = ReLU ( Witalic_l + 1 ritalic_l + 1 ( x ) Witalic_l fitalic_l ( x ) ) , meaning WlsuperscriptW^lWitalic_l and Wl+1superscript1W^l+1Witalic_l + 1 effectively act as a single d×d× d × d dimensional matrix W~=Wl+1⁢Wl~superscript1superscript W=W^l+1W^lover~ start_ARG W end_ARG = Witalic_l + 1 Witalic_l. Thus, any setting of the weights WlsuperscriptW^lWitalic_l and Wl+1superscript1W^l+1Witalic_l + 1 that yield the same W~~ Wover~ start_ARG W end_ARG do not change the network’s outputs on the training data, so long as we avoid changing any of the ril+1⁢(x)subscriptsuperscript1r^l+1_i(x)ritalic_l + 1i ( x ). We can ensure that the ril+1⁢(x)subscriptsuperscript1r^l+1_i(x)ritalic_l + 1i ( x ) do not change as we vary the weights by restricting ourselves to alternate weight matrices Wl+1→Wl+1⁢C−1,Wl→C⁢Wlwith C invertible andCi,j≥0⁢∀i,j.formulae-sequence→superscript1superscript1superscript1formulae-sequence→superscriptsuperscriptwith C invertible andsubscript0for-all W^l+1→ W^l+1C^-1,W^l→ CW^l $C$ % invertible and C_i,j≥ 0\,∀ i,j\,.Witalic_l + 1 → Witalic_l + 1 C- 1 , Witalic_l → C Witalic_l with C invertible and Citalic_i , j ≥ 0 ∀ i , j . (18) Note that a linear layer (without activation function, i.e. fi=pisubscriptsubscriptf_i=p_ifitalic_i = pitalic_i) is just a special case of all neurons being synchronized ∀i,x:ril+1⁢(x)=1:for-allsubscriptsuperscript11∀ i,x:r^l+1_i(x)=1∀ i , x : ritalic_l + 1i ( x ) = 1. When WlsuperscriptW^lWitalic_l is full rank, the drop in the effective parameter count from full synchronisation is the number of parameters in layer l. So we see that from the perspective of the effective parameter count, linear transitions ‘do not cost anything’ — including the linear transition in the model does not meaningfully increase the effective parameter count compared to skipping the layer entirely. We are simply passing variables to the next layer without computing anything new with them.777 See (Aoyagi, 2024) for a more complete treatment of effective parameter counts in deep linear networks. synchronized blocks Now, we consider the general case of arbitrary neuron pairs in a layer being synchronized or approximately synchronized. We can organise neurons into sets Sa,a=1,…⁢amaxformulae-sequencesubscript1…subscriptmaxS_a,a=1,… a_maxSitalic_a , a = 1 , … amax, with the same activation patterns rSal+1⁢(x)subscriptsuperscript1subscriptr^l+1_S_a(x)ritalic_l + 1S start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( x ) for all neurons in the set. We call these sets synchronized blocks. This works because synchronisation is a transitive property, if r1l+1⁢(x)=r2l+1⁢(x)subscriptsuperscript11subscriptsuperscript12r^l+1_1(x)=r^l+1_2(x)ritalic_l + 11 ( x ) = ritalic_l + 12 ( x ) and r1l+1⁢(x)=r3l+1⁢(x)subscriptsuperscript11subscriptsuperscript13r^l+1_1(x)=r^l+1_3(x)ritalic_l + 11 ( x ) = ritalic_l + 13 ( x ), then r1l+1⁢(x)=r3l+1⁢(x)subscriptsuperscript11subscriptsuperscript13r^l+1_1(x)=r^l+1_3(x)ritalic_l + 11 ( x ) = ritalic_l + 13 ( x ). Each neuron belongs to one block, so ∑a=1amax|Sa|=dsubscriptsuperscriptsubscriptmax1subscriptΣ^a_max_a=1|S_a|=d∑amaxitalic_a = 1 | Sitalic_a | = d. Then we have: fil+2⁢(x)superscriptsubscript2 f_i^l+2(x)fitalic_iitalic_l + 2 ( x ) =ReLU⁡(∑j∑a=1amaxrSal+1⁢(x)⁢∑k∈SaWi⁢kl+1⁢Wk⁢jl⁢fjl⁢(x)).absentReLUsubscriptsubscriptsuperscriptsubscriptmax1subscriptsuperscript1subscriptsubscriptsubscriptsubscriptsuperscript1subscriptsuperscriptsuperscriptsubscript =ReLU ( _jΣ^a_max_a=1r^l% +1_S_a(x) _k∈ S_aW^l+1_ikW^l_kjf_j^l(x) ).= ReLU ( ∑j ∑amaxitalic_a = 1 ritalic_l + 1S start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( x ) ∑k ∈ S start_POSTSUBSCRIPT a end_POSTSUBSCRIPT Witalic_l + 1i k Witalic_litalic_k j fitalic_jitalic_l ( x ) ) . (19) We can replace Wl+1→Wl+1⁢C−1,Wl→C⁢Wlformulae-sequence→superscript1superscript1superscript1→superscriptsuperscriptW^l+1→ W^l+1C^-1,\>W^l→ CW^lWitalic_l + 1 → Witalic_l + 1 C- 1 , Witalic_l → C Witalic_l, where the matrix C has a block-diagonal structure C=(C[1]0⋱0C[amax])with invertible blocksC[a]∈ℝ|Sa|×|Sa|formulae-sequencematrixsubscriptdelimited-[]1missing-subexpression0missing-subexpression⋱missing-subexpression0missing-subexpressionsubscriptdelimited-[]subscriptmaxwith invertible blockssubscriptdelimited-[]superscriptℝsubscriptsubscript C= pmatrixC_[1]&&0\\ & &\\ 0&&C_[a_max] pmatrix invertible blocks% C_[a] ^|S_a|×|S_a|C = ( start_ARG start_ROW start_CELL C[ 1 ] end_CELL start_CELL end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL C[ a start_POSTSUBSCRIPT max ] end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) with invertible blocks C[ a ] ∈ blackboard_R| Sitalic_a | × | Sitalic_a | and C[a],k′,k>0⁢∀k,k′∈(1,…,|Sa|).formulae-sequencesubscriptdelimited-[]superscript′0for-allsuperscript′1…subscript C_[a],k ,k>0\,∀ k,k ∈(1,…,|S_a|)\,.C[ a ] , k′ , k > 0 ∀ k , k′ ∈ ( 1 , … , | Sitalic_a | ) . Just as we do not expect activations and gradients to have exact rank drops, we do not expect exact neuron synchronisation to be common in real models. Instead, we can consider two neurons to be approximately synchronized if their activations only meaningfully differ on a few datapoints. Numerically, we can define: |ral+1|2:=1||⁢∑x∈∑i,i′∈Sa(ril+1⁢(x)⁢pil+1⁢(x)−ril+1⁢(x)⁢pi′l+1⁢(x))2.assignsuperscriptsubscriptsuperscript121subscriptsubscriptsuperscript′subscriptsuperscriptsubscriptsuperscript1subscriptsuperscript1subscriptsuperscript1subscriptsuperscript1superscript′2 |r^l+1_a|^2:= 1| D| _x% ∈ D _i,i ∈ S_a (r^l+1_i(x)% p^l+1_i(x)-r^l+1_i(x)p^l+1_i (x) )^2\;.| ritalic_l + 1a |2 := divide start_ARG 1 end_ARG start_ARG | D | end_ARG ∑x ∈ D ∑i , i′ ∈ S start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( ritalic_l + 1i ( x ) pitalic_l + 1i ( x ) - ritalic_l + 1i ( x ) pitalic_l + 1i′ ( x ) )2 . (20) If |ral+1|2superscriptsubscriptsuperscript12|r^l+1_a|^2| ritalic_l + 1a |2 is non-zero but small, choosing different weight matrices as above will only increase the loss by an amount proportional to O⁢(|ral+1|2)superscriptsubscriptsuperscript12O(|r^l+1_a|^2)O ( | ritalic_l + 1a |2 ). Degeneracy counting : For each pair of synchronized neurons ril+1⁢(x),ri′l+1⁢(x)subscriptsuperscript1subscriptsuperscript1superscript′r^l+1_i(x),r^l+1_i (x)ritalic_l + 1i ( x ) , ritalic_l + 1i′ ( x ), we can set a pair of off-diagonal entries Ck,k′,Ck′,ksubscriptsuperscript′subscriptsuperscript′C_k,k ,C_k ,kCitalic_k , k′ , Citalic_k′ , k in C to arbitrary positive values when we change the weights to Wl+1→Wl+1⁢C−1,Wl→C⁢Wlformulae-sequence→superscript1superscript1superscript1→superscriptsuperscriptW^l+1→ W^l+1C^-1,W^l→ CW^lWitalic_l + 1 → Witalic_l + 1 C- 1 , Witalic_l → C Witalic_l. If WlsuperscriptW^lWitalic_l is full rank, the rows k and k′ are linearly independent, so this synchronized pair will result in two free directions in parameter space. Thus, we have as many free directions in parameter space as we have synchronized neurons. We can also count this as the number of the synchronized neurons in each block squared Nl+1=∑a=1amax|Sa|2.superscript1subscriptsuperscriptsubscriptmax1superscriptsubscript2 N^l+1=Σ^a_max_a=1|S_a|^2.Nitalic_l + 1 = ∑amaxitalic_a = 1 | Sitalic_a |2 . We then see that Nl+1superscript1N^l+1Nitalic_l + 1 is highest if all the neuron firing patterns are synchronized, and lowest when all neurons have different firing patterns. However, WlsuperscriptW^lWitalic_l is not always full rank. Further, if we want to combine the degrees of freedom from neuron synchronisation with other degrees of freedom from this section, we have to be careful to avoid double-counting. If the activations in layer l lie in low-dimensional subspaces, then some of the d2superscript2d^2d2 degrees of freedom above may already have been accounted for. If we remove those double-counted degrees of freedom and control for the rank of WlsuperscriptW^lWitalic_l, each synchronized block only provides additional degrees of freedom equal to the dimensionality of the space spanned by the preactivations of block SasubscriptS_aSitalic_a over the dataset DD squared, which we denote sal+1:=dim⁢(span⁢pkl+1|k∈Sa).assignsubscriptsuperscript1dimspanconditional-setsubscriptsuperscript1subscripts^l+1_a:=dim(span\p^l+1_k|k∈ S_a\)\,.sitalic_l + 1a := dim ( span pitalic_l + 1k | k ∈ Sitalic_a ) . (21) So more generally, the additional amount of degeneracy the effective parameter count is lowered by will be Nl+1=∑a(sal+1)2.superscript1subscriptsuperscriptsubscriptsuperscript12N^l+1= _a(s^l+1_a)^2\,.Nitalic_l + 1 = ∑a ( sitalic_l + 1a )2 . (22) The trivial case of self-synchronisation i=i′=i i = i′ is not excluded here in this formula. It corresponds to the generic freedom to vary the diagonal entries of C, Ck,ksubscriptC_k,kCitalic_k , k of a ReLU layer: scaling all the weights going into a neuron by Ck,k∈ℝ+subscriptsuperscriptℝC_k,k ^+Citalic_k , k ∈ blackboard_R+ and scaling all the weights out of the neuron by 1/Ck,k1subscript1/C_k,k1 / Citalic_k , k does not change network behavior. Attention A similar dynamic holds in the attention layers of transformers, with the attention patterns of different attention heads playing the role of the ReLUReLUReLUReLU activation patterns. If two different attention heads h1,h2subscriptℎ1subscriptℎ2h_1,h_2h1 , h2 in the same attention layer have synchronized attention patterns on the training data set, their value matrices WVh1,WVh2subscriptsuperscriptsubscriptℎ1subscriptsuperscriptsubscriptℎ2W^h_1_V,W^h_2_VWitalic_h1italic_V , Witalic_h2italic_V can be changed to add elements in the span of the value vectors of one head to the other head, with the output matrices WOh1,WOh2subscriptsuperscriptsubscriptℎ1subscriptsuperscriptsubscriptℎ2W^h_1_O,W^h_2_OWitalic_h1italic_O , Witalic_h2italic_O that project results back into the residual stream being modified to undo the change. If WVh1,WVh2subscriptsuperscriptsubscriptℎ1subscriptsuperscriptsubscriptℎ2W^h_1_V,W^h_2_VWitalic_h1italic_V , Witalic_h2italic_V are full rank, this results in 2⁢dhead22subscriptsuperscript2head2d^2_head2 d2head degrees of freedom in the loss landscape for each synchronized attention head, in addition to the generic dhead2subscriptsuperscript2headd^2_headd2head degrees of freedom per attention head that are present in every transformer model. If WVh1,WVh2subscriptsuperscriptsubscriptℎ1subscriptsuperscriptsubscriptℎ2W^h_1_V,W^h_2_VWitalic_h1italic_V , Witalic_h2italic_V is not full rank, we account for this similarly as we did with the neurons above. 4 Interaction sparsity from parameterisation-invariance In the introduction, we argued that if we can represent a neural network in a parameterisation-invariant way, then this representation is likely to be a good starting point for reverse-engineering the computation in the network. The intuition behind this claim is that in the standard representation, parts of the network which do not affect the outputs act to obfuscate and hide the relevant computational structure — once these are stripped away, computational structure is likely to become easier to see. One way this could manifest is through the new representation having greater interaction sparsity. In this section, we demonstrate that picking the right representation can indeed lead to sparser interactions throughout the network. Specifically, we show that we can find a representation such that, for every drop in the effective parameter count caused by either (a) activation vectors not spanning the activation space (Section 3.1.1) or (b) neuron synchronisation (Section 3.2), there is at least one pair of basis directions in adjacent layers of the network that do not interact. The role of this section is to provide a first example of a representation of a network which has been made invariant to some reparameterisations, and show that this representation has correspondingly fewer interactions between variables. The algorithm sketch used to find the representation here is not very suitable for selecting sparsely connected bases in practical applications, since it is somewhat cumbersome to extend to non-exact linear dependencies. We introduce a way to choose a basis for the activations spaces that is more suitable for practical applications in Section 6. Consider a dense feedforward network with ReLU activation functions, with NfreesubscriptfreeN_freeNfree degrees of freedom in its parameterization that arise from a combination of 1. The gram matrix of activation vectors in some layers being low rank, see Section 3.1.1. 2. Blocks of neurons being synchronized, see Section 3.2. We will now show that we can find a representation of the network that 1. exploits the degrees of freedom due to low-dimensional activations to sparsify interactions through a re-parametrisation. 2. exploits the degrees of freedom from neuron synchronisation to sparsify interactions through a coordinate transformation, without losing the sparsity gained in step 1. Sparsifying using low dimensional activations Here, we show how to exploit the degrees of freedom in the network due to low-dimensional activations in the input layer to sparsify interactions. Suppose that the gram matrix of activations (1)⁢(x)superscript1f^(1)(x)f( 1 ) ( x ) of the input layer, G(1)=1n⁢∑xfi(1)⁢(x)⁢fj(1)⁢(x)superscript11subscriptsubscriptsuperscript1subscriptsuperscript1G^(1)= 1n _xf^(1)_i(x)f^(1)_j(x)G( 1 ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x f( 1 )i ( x ) f( 1 )j ( x ) is not full rank. This means that we can take a set of rank⁢(G(1))ranksuperscript1rank (G^(1) )rank ( G( 1 ) ) neurons as a basis for the space. This will be fewer neurons than the width d(1)superscript1d^(1)d( 1 ) of the input payer. Writing ∀j∈(rank⁢(G(1))+1,…,d(1)):fj(1)=∑i=1rank⁢(G(1))(cj)i⁢fi(1),:for-allranksuperscript11…superscript1subscriptsuperscript1superscriptsubscript1ranksuperscript1subscriptsubscriptsubscriptsuperscript1∀ j∈(rank (G^(1) )+1,…,d^(1)):f^(1)_j=Σ% _i=1^rank (G^(1) )(c_j)_if^(1)_i\,,∀ j ∈ ( rank ( G( 1 ) ) + 1 , … , d( 1 ) ) : f( 1 )j = ∑i = 1rank ( G start_POSTSUPERSCRIPT ( 1 ) ) end_POSTSUPERSCRIPT ( citalic_j )i f( 1 )i , (23) we can replace the weights W(1)superscript1W^(1)W( 1 ) with new weights W~i⁢j(1):=Wi⁢j(1)+∑k=rank⁢(G(1))+1d(1)(ck)j⁢Wi⁢k1≤j≤rank⁢(G(1))0rank⁢(G(1))<j≤d(1)assignsubscriptsuperscript~1casessubscriptsuperscript1superscriptsubscriptranksuperscript11superscript1subscriptsubscriptsubscript1ranksuperscript10ranksuperscript1superscript1 W^(1)_ij:= casesW^(1)_ij+ _k=% rank (G^(1) )+1^d^(1)(c_k)_jW_ik&1≤ j % (G^(1) )\\ 0&rank (G^(1) )<j≤ d^(1) casesover~ start_ARG W end_ARG( 1 )i j := start_ROW start_CELL W( 1 )i j + ∑k = rank ( G( 1 ) ) + 1d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( citalic_k )j Witalic_i k end_CELL start_CELL 1 ≤ j ≤ rank ( G( 1 ) ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL rank ( G( 1 ) ) < j ≤ d( 1 ) end_CELL end_ROW (24) In this way we can disconnect (d(1)−rank⁢(G(1)))superscript1ranksuperscript1(d^(1)-rank (G^(1) ))( d( 1 ) - rank ( G( 1 ) ) ) many neurons from the next layer without changing the activations in layer 2 at all on the training dataset, since W~(1)⁢(1)=W(1)⁢(1)superscript~1superscript1superscript1superscript1 W^(1)f^(1)=W^(1)f^(1)over~ start_ARG W end_ARG( 1 ) f( 1 ) = W( 1 ) f( 1 ). For every degree of linear dependence we may have had in layer 1111, we now have d(2)superscript2d^(2)d( 2 ) weights set to zero, where d(2)superscript2d^(2)d( 2 ) is the width of the second MLP layer. Since two neurons that are connected by a weight of 0 do not interact, this means that we can associate each drop in the effective parameter count caused by linear dependence between activations in layer 1 with a pair of nodes in the interaction graph which do not interact. Sparsifying using synchronized neurons Now, we show that we can exploit the degrees of freedom in the network from the synchronisation of neurons in the first hidden layer to sparsify interactions without losing any of the sparsity we gained in the previous step. Taking the example of the second layer (2)superscript2f^(2)f( 2 ), we want to find a new coordinate basis ^(2)=C(2)⁢(2)superscript^2superscript2superscript2 f^(2)=C^(2)f^(2)over start_ARG f end_ARG( 2 ) = C( 2 ) f( 2 ) in which there is at least one pair of variables (f^i(2),fj(1))superscriptsubscript^2subscriptsuperscript1( f_i^(2),f^(1)_j)( over start_ARG f end_ARGi( 2 ) , f( 1 )j ) that does not interact for each drop in the effective parameter count caused by neuron synchronisation. To choose this basis, we start by finding all pairs of neuron firing patterns ril⁢(x)subscriptsuperscriptr^l_i(x)ritalic_litalic_i ( x ) in layer 2222 that are synchronized and group them into sets of synchronized blocks. Continuing with the same notation as in Section 3.2, we denote the blocks of synchronized neurons Sa,a∈(1,…,amax)subscript1…subscriptmaxS_a,a∈(1,…,a_max)Sitalic_a , a ∈ ( 1 , … , amax ), with size |Sa|subscript|S_a|| Sitalic_a |, and we use the notation M[a]subscriptdelimited-[]M_[a]M[ a ] to denote the matrix in ℝsa×sasuperscriptℝsubscriptsubscriptR^s_a× s_ablackboard_Rsitalic_a × sitalic_a with entries given by Mi⁢j⁢∀i,j∈Sasubscriptfor-allsubscriptM_ij\>∀ i,j∈ S_aMitalic_i j ∀ i , j ∈ Sitalic_a. Then, we choose the transformation C(2)superscript2C^(2)C( 2 ) to be block diagonal C2=(C[1]20⋱0C[amax]2),superscript2matrixsubscriptsuperscript2delimited-[]1missing-subexpression0missing-subexpression⋱missing-subexpression0missing-subexpressionsubscriptsuperscript2delimited-[]subscriptmax C^2= pmatrixC^2_[1]&&0\\ & &\\ 0&&C^2_[a_max] pmatrix\,,C2 = ( start_ARG start_ROW start_CELL C2[ 1 ] end_CELL start_CELL end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL C2[ a start_POSTSUBSCRIPT max ] end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) , (25) with the blocks given by the inverse888Technically the pseudoinverse, because W~[a](1)subscriptsuperscript~1delimited-[] W^(1)_[a]over~ start_ARG W end_ARG( 1 )[ a ] does not need to be invertible. of the |Sa|×|Sa|subscriptsubscript|S_a|×|S_a|| Sitalic_a | × | Sitalic_a | blocks of W~(1)superscript~1 W^(1)over~ start_ARG W end_ARG( 1 ): C[a](2)=(W~[a](1))−1,subscriptsuperscript2delimited-[]superscriptsubscriptsuperscript~1delimited-[]1 C^(2)_[a]= ( W^(1)_[a] )^-1\,,C( 2 )[ a ] = ( over~ start_ARG W end_ARG( 1 )[ a ] )- 1 , (26) W~[a](1):=(Wσa−1+1,σa−1+1(1)⋯Wσa,σa−1+1(1)⋮⋱⋮Wσa−1+1,σa(1)⋯Wσa,σa(1))for⁢σa=∑b=1asbformulae-sequenceassignsubscriptsuperscript~1delimited-[]matrixsubscriptsuperscript1subscript11subscript11⋯subscriptsuperscript1subscriptsubscript11⋮⋱⋮subscriptsuperscript1subscript11subscript⋯subscriptsuperscript1subscriptsubscriptforsubscriptsuperscriptsubscript1subscript W^(1)_[a]:= pmatrixW^(1)_ _a-1+1,% _a-1+1&·s&W^(1)_ _a, _a-1+1\\ & & \\ W^(1)_ _a-1+1, _a&·s&W^(1)_ _a, _a % pmatrix \, _a= _b=1^as_bover~ start_ARG W end_ARG( 1 )[ a ] := ( start_ARG start_ROW start_CELL W( 1 )σ start_POSTSUBSCRIPT a - 1 + 1 , σitalic_a - 1 + 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL W( 1 )σ start_POSTSUBSCRIPT a , σitalic_a - 1 + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL W( 1 )σ start_POSTSUBSCRIPT a - 1 + 1 , σitalic_a end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL W( 1 )σ start_POSTSUBSCRIPT a , σitalic_a end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) for σitalic_a = ∑b = 1a sitalic_b (27) This coordinate transformation will set one interaction to zero per drop in the effective parameter count caused by neuron synchronisation. To see this, we first consider that C(2)superscript2C^(2)C( 2 ) commutes with the nonlinearity applied between layers 1 and 2 ∀x:C(2)⁢ReLU⁢(W(1)⁢(1)⁢(x))=ReLU⁢(C(2)⁢W(1)⁢(1)⁢(x)):for-allsuperscript2ReLUsuperscript1superscript1ReLUsuperscript2superscript1superscript1∀ x:C^(2)ReLU (W^(1)f^(1)(x) )=ReLU% (C^(2)W^(1)f^(1)(x) )∀ x : C( 2 ) ReLU ( W( 1 ) f( 1 ) ( x ) ) = ReLU ( C( 2 ) W( 1 ) f( 1 ) ( x ) ) (28) The product W^(1)=C(2)⁢W~(1)superscript^1superscript2superscript~1 W^(1)=C^(2) W^(1)over start_ARG W end_ARG( 1 ) = C( 2 ) over~ start_ARG W end_ARG( 1 ) will thus have block diagonal entries equal to the identity W^[a](1)=|Sa|subscriptsuperscript^1delimited-[]subscriptsubscript W^(1)_[a]=I_|S_a|over start_ARG W end_ARG( 1 )[ a ] = I| S start_POSTSUBSCRIPT a | end_POSTSUBSCRIPT. This means W^(1)superscript^1 W^(1)over start_ARG W end_ARG( 1 ) will at minimum have an additional ∑a(sa(2))2−d(2)subscriptsuperscriptsubscriptsuperscript22superscript2 _a(s^(2)_a)^2-d^(2)∑a ( s( 2 )a )2 - d( 2 ) entries that are zero — one non-interacting pair of nodes per degree of non-generic parametrization freedom caused by neuron synchronization, see equations 21, 22. These absent interactions are distinct from those due to the activation vectors in layer 1 not spanning the full activation space we found in the previous step. Thus, the minimum absent interactions add up to be equal or greater to the degrees of freedom in the loss landscape stemming from low dimensional activations in the input layer f(1)superscript1f^(1)f( 1 ) or synchronized neurons in the first hidden layer f(2)superscript2f^(2)f( 2 ). Repeat for every layer Now, we can repeat the previous two steps for all layers, moving recursively from the input layer to the output layer. We check if the activation vectors in layer 2 do not span the activation space and pick new weights W~(2)superscript~2 W^(2)over~ start_ARG W end_ARG( 2 ) accordingly. Then we check if any neurons in layer three are synchronized and transform ^(3)=C(3)⁢(3)superscript^3superscript3superscript3 f^(3)=C^(3)f^(3)over start_ARG f end_ARG( 3 ) = C( 3 ) f( 3 ) accordingly. We repeat this for every layer in the network. We thus obtain new weight matrices, and a new basis for the activations of every layer in the network. Treating the new basis vectors in each layer as nodes in a graph, we can build a graph representing the interactions in the network. This graph will have two properties: 1. It has at least one interaction that is zero for every drop in the effective parameter count introduced by neuron synchronisation or activation vectors spanning a low dimensional subspace 2. It is invariant to reparameterisations that exploit these degeneracies. 5 Modularity may contribute to degeneracy A core goal of interpretability is breaking up a neural network into smaller parts, such that we can understand the entire network by understanding the individual parts. In this section we propose a particular notion of modularity that could be used to identify these smaller parts. We argue that this notion of modularity is likely to occur in real networks due to its relation to the LLC. The core claim of this section is that more modular networks are biased towards lower LLC. We argue that if modules in a network interact less (i.e the network is more modular) this yields a higher total degeneracy and thus a lower LLC. Each module has internal degeneracies: if two modules do not interact then the degeneracies in each are independent of each other, so the total amount of degeneracy in the network (from these modules) is at least the sum of the amount of degeneracy within each module. However, if the modules are interacting, then the degeneracies may interact with each other, and the total amount of degeneracy in the network can be less. Therefore, networks which have non- or weakly- interacting modules typically have more degeneracy and thus a lower LLC, which means that neural networks are biased towards solutions which are modular. The argument in this section does not preclude non-modular networks from having a lower LLC than modular networks in any specific instance. Instead, this section presents an argument that, all else equal, modularity is associated with a lower effective parameter count. This argument could fail in practice if more modularity turns out to increase the effective parameter count of models for a different reason, or if real neural networks simply do not have low-loss modular solutions. In Section 5.1 we define interacting and non-interacting degeneracies, and show that the total degeneracy is higher in when individual degeneracies do not interact. In Section 5.2 we quantify how modularity affects the LLC by studying a series of increasingly realistic scenarios. First, we consider the case of twomodules which do not interact at all in Section 5.2.1. Then we explore how to modify the analysis for modules which have a small number of interacting variables in Section 5.2.2. Finally, in Section 5.2.3 we extend our analysis to allow for the strength of interactions to vary. We arrive at a modularity metric which can be used to search for modules in a computational graph. 5.1 Interacting and non-interacting degeneracies Figure 1: Example of a loss landscape with interacting free directions, from (Carroll, 2023), lightly edited. The loss does not change when changing w1subscript1w_1w1 alone or w2subscript2w_2w2 alone, so there are two free directions in the landscape. However, the loss does change when changing both w1subscript1w_1w1 and w2subscript2w_2w2 together, so the set of zero loss is cross-shaped rather than spanning the whole plane. Thus, despite there apparently being two free directions, the effective parameter count that characterises the dimensionality of the low loss volume is 1111 rather than 00. Non-interacting sets of parameters have no joined terms like this in the loss function, so their free directions always span full subspaces with each other. If a network’s parameterization has a degeneracy, then there is some way the parameters of the network can change without changing the input-output behavior of the network. This change corresponds to a direction that can be traversed through the parameter space along which the behavioral loss stays zero. We call such a direction a free direction in the parameter space. It’s also possible for a parameterization to have multiple degeneracies and thus multiple free directions. We call a set of free directions non-interacting if traversing along one free direction does not affect whether the other directions remain free. In this case, the set of non-interacting free directions span an entire free subspace of the parameter space. In a parameter space with θ=(w1,w2,w3)subscript1subscript2subscript3θ=(w_1,w_2,w_3)θ = ( w1 , w2 , w3 ) and loss given by L⁢(w1,w2,w3)=w12subscript1subscript2subscript3superscriptsubscript12L(w_1,w_2,w_3)=w_1^2L ( w1 , w2 , w3 ) = w12, we are free to pick any value of w2subscript2w_2w2 and w3subscript3w_3w3 while remaining at the minimum of the loss provided that w1=0subscript10w_1=0w1 = 0. The area of constant loss is a 2-dimensional plane. The set of free directions is called interacting if traversing along one free direction does affect whether other directions remain free. For an extreme example, consider the loss function L⁢(w1,w2)=w12⁢w22subscript1subscript2superscriptsubscript12superscriptsubscript22L(w_1,w_2)=w_1^2w_2^2L ( w1 , w2 ) = w12 w22 (figure 1) at its minimum (0,0). In this case there are two free directions, but when we traverse along one free direction the other direction ceases to be free. The area of constant loss does not span a full subspace (a 2-dimensional plane); here is resembles a cross (see Figure 1) which is a 1-dimensional object. We can explicitly calculate the number of degrees of freedom (the difference between the effective parameter count (equation 9) and the nominal parameter count) in each of these two loss landscapes. We find that the first landscape has two degrees of freedom but the second has only one. These are two extremes of fully interacting and fully non-interacting free directions. It is also possible to construct intermediate loss landscapes in which the number of degrees of freedom arising from two free directions is a non-integer value between 1 and 2. In general, for a given set of free directions, the lowest the effective parameter count can be is the non-interacting case. 5.2 Degeneracies in separate modules only interact if the modules are interacting In this section we quantify the increase in the effective parameter count, and equivalently the LLC, from perfect and near-perfect modularity. We show that a network consisting of non-interacting modules has a low effective parameter count, and that a network with modules which interact through a single variable has only a slightly higher effective paraeter count. Consider a modular neural network θ⁢(x)subscriptf_θ(x)fitalic_θ ( x ) consisting of two parallel modules M1subscript1M_1M1 and M2subscript2M_2M2. The modules take in different variables x1,x2subscript1subscript2x_1,x_2x1 , x2 from the input x=(x1,x2)subscript1subscript2x=(x_1,x_2)x = ( x1 , x2 ), and the output of the network is the concatenation of the module outputs θ⁢(x)=(M1⁢(x1),M2⁢(x2))subscriptsubscript1subscript1subscript2subscript2f_θ(x)=(M_1(x_1),M_2(x_2))fitalic_θ ( x ) = ( M1 ( x1 ) , M2 ( x2 ) ). We assign every activation direction in the network to either M1subscript1M_1M1 or M2subscript2M_2M2. We split the parameter space Θ Θ into 3 subspaces: Θ=Θ1⊕Θ2⊕Θ1↔2Θdirect-sumsubscriptΘ1subscriptΘ2subscriptΘ↔12 = _1 _2 _1 2Θ = Θ1 ⊕ Θ2 ⊕ Θ1 ↔ 2. θ1∈Θ1subscript1subscriptΘ1 _1∈ _1θ1 ∈ Θ1 are the parameters inside M1subscript1M_1M1 (i.e.parameters that affect interactions between two activations within M1subscript1M_1M1), θ2subscript2 _2θ2 is the space of the parameters inside M2subscript2M_2M2, and θ1↔2subscript↔12 _1 2θ1 ↔ 2 is the space of parameters which affect interactions between activations of both modules. 5.2.1 Non-interacting case We start by analyzing a network consisting of two perfectly separated modules; the values of activations in M1subscript1M_1M1 have no effect on activations in M2subscript2M_2M2, i.e. θ1↔2=0subscript↔120 _1 2=0θ1 ↔ 2 = 0 and the network output is given by θ⁢(x)=(M1⁢(θ1,x1),M2⁢(θ2,x2)).subscriptsubscript1subscript1subscript1subscript2subscript2subscript2f_θ(x)=(M_1( _1,x_1),M_2( _2,x_2)).fitalic_θ ( x ) = ( M1 ( θ1 , x1 ) , M2 ( θ2 , x2 ) ) . (29) Consider now two free directions in parameter space, where one lies entirely in Θ1subscriptΘ1 _1Θ1, and the other lies entirely in Θ2subscriptΘ2 _2Θ2. Since M1subscript1M_1M1 and M2subscript2M_2M2 share no variables and do not interact, there is no way for a change to parameters along one free direction to affect the freedom of the other direction. Therefore, one dimensional degeneracies that are in different disconnected modules must be non-interacting. By contrast, if M1subscript1M_1M1 and M2subscript2M_2M2 were connected, their free directions could interact. We break up the behavioral loss with respect to this network into three terms: LB⁢(θ|θ∗,)=L1⁢(θ1|θ1∗,)+L2⁢(θ2|θ2∗,)+L1↔2⁢(θ1,θ2,θ1↔2|θ1∗,θ2∗,0,)subscriptconditionalsuperscriptsubscript1conditionalsubscript1subscriptsuperscript1subscript2conditionalsubscript2subscriptsuperscript2subscript↔12subscript1subscript2conditionalsubscript↔12subscriptsuperscript1subscriptsuperscript20L_B(θ|θ^*, D)=L_1( _1|θ^*% _1, D)+L_2( _2|θ^*_2,% D)+L_1 2( _1, _2,% _1 2|θ^*_1,θ^*_2,0,% D)Litalic_B ( θ | θ∗ , D ) = L1 ( θ1 | θ∗1 , D ) + L2 ( θ2 | θ∗2 , D ) + L1 ↔ 2 ( θ1 , θ2 , θ1 ↔ 2 | θ∗1 , θ∗2 , 0 , D ) (30) L1subscript1L_1L1 and L2subscript2L_2L2 are the parts of the behavioral loss than involve only θ1subscript1 _1θ1 and θ2subscript2 _2θ2 respectively, and L1↔2subscript↔12L_1 2L1 ↔ 2 contains all the other parts. So long as we ensure θ1↔2=0subscript↔120 _1 2=0θ1 ↔ 2 = 0, we have L1↔2=0subscript↔120L_1 2=0L1 ↔ 2 = 0. Then a calculation shows that the overall number of degrees of freedom (Nfree=N−NeffsubscriptfreesubscripteffN_free=N-N_effNfree = N - Neff) for this behavioral loss, restricted to the subspace in which θ1↔2=0subscript↔120 _1 2=0θ1 ↔ 2 = 0, is equal to the sum of the number of degrees of freedom in each module. There could be additional free directions involving moving θ1↔2subscript↔12 _1 2θ1 ↔ 2 away from 00. These free directions are not guaranteed not to interact with the free directions in each module, and our argument says nothing about how large additional contributions to the effective parameter count from varying θ1↔2subscript↔12 _1 2θ1 ↔ 2 may be. 5.2.2 Adding in interactions between modules Next, we consider the case that there are a small set of activations v1,…,vmsubscript1…subscriptv_1,…,v_mv1 , … , vitalic_m inside M1subscript1M_1M1 that causally affect the value of some activations inside M2subscript2M_2M2 (due to not all the parameters in θ1↔2subscript↔12 _1 2θ1 ↔ 2 being 0). This means that the two modules are now interacting with each other. In that case, the only degeneracies in M1subscript1M_1M1 which are guaranteed not to interact with the degeneracies in M2subscript2M_2M2 are those which do not affect the value of any of the visubscriptv_ivitalic_i. Picture M1subscript1M_1M1 as a causal graph, where the nodes are activations and the edges are weights or nonlinearities. The nodes inside M1subscript1M_1M1 are connected to the ‘outside’ of M1subscript1M_1M1 via (a) the input layer, where M1subscript1M_1M1 takes in inputs, (b) the output layer, where M1subscript1M_1M1 passes on its outputs, and (c) the ‘mediating’ nodes visubscriptv_ivitalic_i where variations affect what happens inside M2subscript2M_2M2. The free directions inside M1subscript1M_1M1 that are guaranteed not to interact with free directions outside M1subscript1M_1M1 are those directions that leave this entire interaction surface invariant: the directions which do not change any of the mediating nodes as we traverse along them. Each mediating node that is present is an additional constraint on which free directions are guaranteed to be non-interacting. The more approximately independent nodes that are part of that interaction surface, the fewer free directions in M1subscript1M_1M1 might be generically expected to satisfy these constraints. In the previous section, we argued that the degrees of freedom of the network with noninteracting modules, restricted to the subset of parameter space in which θ1↔2=0subscript↔120 _1 2=0θ1 ↔ 2 = 0, was equal to the sum of the degrees of freedom in each module. In this section, θ1↔2∗≠0superscriptsubscript↔120 _1 2^*≠ 0θ1 ↔ 2∗ ≠ 0, but modifying the argument to restrict to the subset of parameter space in which θ1↔2=θ1↔2∗subscript↔12superscriptsubscript↔12 _1 2= _1 2^*θ1 ↔ 2 = θ1 ↔ 2∗ is not sufficient to fix the argument, because the degeneracies interact. To fix the argument, we introduce the constrained loss function for parameters in M1subscript1M_1M1: L1,C⁢(θ1|θ1∗,,v1,…,vm)=L1⁢(θ1|θ1∗,)+1n⁢∑i=1m∑x∈(v1⁢(θ1∗,x)−v1⁢(θ1,x))2subscript1conditionalsubscript1superscriptsubscript1subscript1…subscriptsubscript1conditionalsubscript1superscriptsubscript11superscriptsubscript1subscriptsuperscriptsubscript1superscriptsubscript1subscript1subscript12L_1,C( _1| _1^*, D,v_1,…,v_m% )=L_1( _1| _1^*, D)+ 1n _% i=1^m _x∈ D (v_1( _1^*,x)-v% _1( _1,x) )^2L1 , C ( θ1 | θ1∗ , D , v1 , … , vitalic_m ) = L1 ( θ1 | θ1∗ , D ) + divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1m ∑x ∈ D ( v1 ( θ1∗ , x ) - v1 ( θ1 , x ) )2 (31) This loss function is the same as the part of the behavioral loss that depends only on parameters in M1subscript1M_1M1, except that it has extra MSE terms added to ensure that the points with very small loss also preserve the values of v1,…,vmsubscript1…subscriptv_1,…,v_mv1 , … , vitalic_m on all datapoints. This means its learning coefficient is higher than for the unconstrained behavioral loss. The key property of the constrained loss landscape is that free directions in are guaranteed to be non interacting with free directions in the loss landscape L2subscript2L_2L2. Therefore, we are able to say that the total effective parameter count of the network consisting of two interacting modules, when constrained to the subspace θ1↔2=θ1↔2∗subscript↔12superscriptsubscript↔12 _1 2= _1 2^*θ1 ↔ 2 = θ1 ↔ 2∗, really is twice the sum of the learning coefficient for the loss function L2subscript2L_2L2, and for the loss function L1,Csubscript1L_1,CL1 , C999For simplicity in this section, we have considered the case in which nodes in M1subscript1M_1M1 affect nodes in M2subscript2M_2M2 but the converse is not true. If we wanted interactions to be bidirectional, we could modify the argument of this section by introducing a second constrained loss function L2,Csubscript2L_2,CL2 , C.. As before, there could be additional free directions involving moving θ1↔2subscript↔12 _1 2θ1 ↔ 2 away from θ1↔2∗superscriptsubscript↔12 _1 2^*θ1 ↔ 2∗, which may interact with the free directions in each module. Since we have not characterized the effect of these free directions on the effective parameter count, we cannot confidently conclude that networks with more separated modules reliably have lower effective parameter counts overall. For example, it may be possible that on most real-world loss landscapes, there are many more non-modular solutions than modular ones, and that typically the place in parameter space with lowest loss and lowest effective parameter count is not modular. However, we are not aware of any compelling reason why non-modular networks have some advantage in terms of having low effective parameter counts, to combat the advantage of modular networks discussed in this section. 5.2.3 Varying the strength of an interaction In the precious section, we discussed the case that two modules interact via m nodes. However, this model had no notion of how strong an interaction is — every node inside M1subscript1M_1M1 either is not on the interaction surface, or it is, and all nodes on the interaction surface affects the nodes inside M2subscript2M_2M2 the same amount. In real networks, the extent to which one activation can affect another is continuous. Therefore, we’d like to be able to answer questions like the following: Suppose that we have two networks both consisting of two modules, M1subscript1M_1M1 and M2subscript2M_2M2. In the first network, there is a single node inside M1subscript1M_1M1 that strongly influences M2subscript2M_2M2, and in the second there are two nodes inside M1subscript1M_1M1 that both weakly influence M2subscript2M_2M2. Which of these two networks is likely to have a lower effective parameter count? In this section we’l attempt to answer this question. To do so, we will make use of the notion of an effective parameter count at a finite loss cutoff ϵitalic-ϵεϵ (Section 2.2.2). We show that the magnitude of the total connections through different independent mediating nodes v1,v2subscript1subscript2v_1,v_2v1 , v2 seems to add approximately logarithmically to determine the effective ‘size’ of the total interaction surface between modules. As before, we consider two modules M1subscript1M_1M1 and M2subscript2M_2M2, connected through a number of mediating variables v1,…,vmsubscript1…subscriptv_1,…,v_mv1 , … , vitalic_m that are part of M1subscript1M_1M1 and which M2subscript2M_2M2 depends on. Let each of these mediating variables connect to M2subscript2M_2M2 through a single weight, w1,…,wnsubscript1…subscriptw_1,…,w_nw1 , … , witalic_n101010We could also consider wisubscriptw_iwitalic_i to be the sum of weights connecting node visubscriptv_ivitalic_i to M2subscript2M_2M2.. If wisubscriptw_iwitalic_i is sufficiently small relative to the loss cutoff ϵitalic-ϵεϵ, the connection between modules via visubscriptv_ivitalic_i will be so small that it can be considered no connection at all from the perspective of interactions between free directions in different modules. This would be if the loss increases when we traverse along both free directions simultaneously by an amount that is smaller than ϵitalic-ϵεϵ. Quantitatively, if we traverse along a free direction in Θ1subscriptΘ1 _1Θ1 that changes the value of vi⁢(θ1|x)subscriptconditionalsubscript1v_i( _1|x)vitalic_i ( θ1 | x ), then for small enough ϵitalic-ϵεϵ (and a network with locally smooth-enough activation functions), the resulting change in the MSE loss of the whole network L will be proportional to wi2subscriptsuperscript2w^2_iw2italic_i. If wi=O⁢(ϵ12)subscriptsuperscriptitalic-ϵ12w_i=O (ε 12 )witalic_i = O ( ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG ), that means the connection is ‘effectively zero’ relative to the given cutoff ϵitalic-ϵεϵ, in the sense that the volume of points with L⁢(θ)<ϵitalic-ϵL(θ)< ( θ ) < ϵ is not substantially impacted by the terms in the loss involving wisubscriptw_iwitalic_i. Now we consider larger connections wi=ϵkisubscriptsuperscriptitalic-ϵsubscriptw_i=ε^k_iwitalic_i = ϵitalic_kitalic_i with ki∈(0,12)subscript012k_i∈(0, 12)kitalic_i ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ). We can model this situation by taking the size of wisubscriptw_iwitalic_i into account in the constrained loss (equation 31). We define the weighted constrained loss by a sum over mean squared errors for preserving each mediating variable, weighted by the size of the variable: L1,C⁢(θ1|θ1∗,θ1↔2∗,,v1,…,vm)=L1⁢(θ1|θ1∗,)+1n⁢∑i=1mϵ2⁢ki⁢∑x∈(vi⁢(θ1∗,x)−vi⁢(θ1,x))2subscript1conditionalsubscript1superscriptsubscript1superscriptsubscript↔12subscript1…subscriptsubscript1conditionalsubscript1superscriptsubscript11subscriptsuperscript1superscriptitalic-ϵ2subscriptsubscriptsuperscriptsubscriptsuperscriptsubscript1subscriptsubscript12 L_1,C( _1| _1^*, _1 2^*% , D,v_1,…,v_m)=L_1( _1| _1^*% , D)+ 1nΣ^m_i=1ε^2k_i _% x∈ D (v_i( _1^*,x)-v_i( _1% ,x) )^2L1 , C ( θ1 | θ1∗ , θ1 ↔ 2∗ , D , v1 , … , vitalic_m ) = L1 ( θ1 | θ1∗ , D ) + divide start_ARG 1 end_ARG start_ARG n end_ARG ∑mitalic_i = 1 ϵ2 kitalic_i ∑x ∈ D ( vitalic_i ( θ1∗ , x ) - vitalic_i ( θ1 , x ) )2 (32) where we’ve made L1,Csubscript1L_1,CL1 , C depend on θ1↔2∗subscriptsuperscript↔12θ^*_1 2θ∗1 ↔ 2 here because wisubscriptw_iwitalic_i are parameters in θ1↔2∗subscriptsuperscript↔12θ^*_1 2θ∗1 ↔ 2. We are interested then in how much smaller the learning coefficient for loss landscape L1subscript1L_1L1 is than the learning coefficient on landscape L1,Csubscript1L_1,CL1 , C, as a function of loss cutoff ϵitalic-ϵεϵ. This depends heavily on the details of the model. If the constraints are completely independent, we could perhaps model the presence of each constraint as destroying some number γisubscript _iγitalic_i of degrees of freedom compared to the model in which the constraints were not present (and the modules were fully non-interacting). Neff, C=Neff+∑i=1mγi.subscripteff, Csubscripteffsubscriptsuperscript1subscript N_eff, C=N_eff+Σ^m_i=1 _i\,.Neff, C = Neff + ∑mitalic_i = 1 γitalic_i . Now, we seek an expression for γisubscript _iγitalic_i in terms of wisubscriptw_iwitalic_i. Since we require LB⁢(θ)<ϵsubscriptitalic-ϵL_B(θ)< _B ( θ ) < ϵ, and each term in LBsubscriptL_BLitalic_B is positive, we also have that each constraint M⁢S⁢EMSEM S E must be smaller than ϵitalic-ϵεϵ. Rearranging, we find that 1n⁢∑x∈(vi⁢(θ1∗,x)−vi⁢(θ1,x))2=ϵ1−2⁢ki=ϵ~.1subscriptsuperscriptsubscriptsuperscriptsubscript1subscriptsubscript12superscriptitalic-ϵ12subscript~italic-ϵ 1n _x∈ D (v_i(% _1^*,x)-v_i( _1,x) )^2=ε^1-2k_i= % ε\,.divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ( vitalic_i ( θ1∗ , x ) - vitalic_i ( θ1 , x ) )2 = ϵ1 - 2 kitalic_i = over~ start_ARG ϵ end_ARG . (33) Therefore, the weights ϵ2⁢kisuperscriptitalic-ϵ2subscriptε^2k_iϵ2 kitalic_i of each constraint effectively correspond to measuring the volume of points satisfying that constraint at a larger loss cutoff ϵ~i=ϵ1−2⁢kisubscript~italic-ϵsuperscriptitalic-ϵ12subscript ε_i=ε^1-2k_iover~ start_ARG ϵ end_ARGi = ϵ1 - 2 kitalic_i. Now, we make an assumption that if all the weights were 1, then each constraint would be responsible for removing a similar number γ~~ γover~ start_ARG γ end_ARG of degrees of freedom from the network. In other words, each constraint would restrict the volume of parameter space that achieves loss less than ϵitalic-ϵεϵ by the same amount. Then, we can rescale this region by the factor ϵ1−2⁢kisuperscriptitalic-ϵ12subscriptε^1-2k_iϵ1 - 2 kitalic_i and we find that: γi=(1−2⁢ki)⁢γ~=(1−2⁢log⁡wilog⁡ϵ)⁢γ~,subscript12subscript~12subscriptitalic-ϵ~ _i= (1-2k_i ) γ= (1-2 % w_i ε ) γ\,,γitalic_i = ( 1 - 2 kitalic_i ) over~ start_ARG γ end_ARG = ( 1 - 2 divide start_ARG log witalic_i end_ARG start_ARG log ϵ end_ARG ) over~ start_ARG γ end_ARG , (34) Therefore, the size of the logarithm of the weight wisubscriptw_iwitalic_i relative to the logarithm of the cutoff ϵitalic-ϵεϵ becomes a prefactor reducing the number of degrees of freedom removed by constraint i. If wi=1subscript1w_i=1witalic_i = 1, then γi=γ~subscript~ _i= γitalic_i = over~ start_ARG γ end_ARG, and if wi≤ϵ12subscriptsuperscriptitalic-ϵ12w_i≤ε 12witalic_i ≤ ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG, then γi=0subscript0 _i=0γitalic_i = 0111111For wi<ϵ12subscriptsuperscriptitalic-ϵ12w_i<ε 12witalic_i < ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG, this is effectively zero from the resolution available at loss cutoff ϵitalic-ϵεϵ.. With this in mind, let us return to the question introduced at the start of this section. We will call the network with two weak interactions between modules network A, with two mediating nodes vA,1,vA,2subscript1subscript2v_A,1,v_A,2vitalic_A , 1 , vitalic_A , 2 and mediating weights wA,1=wA,2subscript1subscript2w_A,1=w_A,2witalic_A , 1 = witalic_A , 2. Likewise, we denote the network with one strong interaction between modules by network B, with one mediating node vB,1subscript1v_B,1vitalic_B , 1 and one mediating weight wBsubscriptw_Bwitalic_B. How large must wBsubscriptw_Bwitalic_B be compared to wA,1subscript1w_A,1witalic_A , 1 and wA,2subscript2w_A,2witalic_A , 2 for the interactions between modules in network B to effectively remove the same number of degrees of freedom as the interactions between modules in network A? Using equation LABEL:eq:log_scale, we find that log⁡(wBϵ12)=log⁡(wA,1ϵ12)+log⁡(wA,2ϵ12).subscriptsuperscriptitalic-ϵ12subscript1superscriptitalic-ϵ12subscript2superscriptitalic-ϵ12 ( w_Bε 12 )= % ( w_A,1ε 12 )+ ( w_A,2% ε 12 )\,.log ( divide start_ARG witalic_B end_ARG start_ARG ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) = log ( divide start_ARG witalic_A , 1 end_ARG start_ARG ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) + log ( divide start_ARG witalic_A , 2 end_ARG start_ARG ϵdivide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) . (35) So, the analysis in this section implies that connections through different mediating nodes should be considered to add together logarithmically for the purpose of estimating the number of interaction terms between degrees of freedom that live in different modules. In practice, the constraints different mediating variables impose on the loss 32 are likely rarely completely independent, so this should be seen as a rough approximation to be used as a starting guess for the relevant scale of the problem. If circuits in neural networks correspond to modules, the analysis in this section implies that we could identify circuits in networks by searching for a partition of the interaction graph of the network into modules which minimises the sum of logs of cutoff-normalised interaction strengths between modules. 6 The Interaction Basis In this section, we propose a technique for representing a neural network as an interaction graph that is invariant to reparameterisations that exploit the freedoms in Sections 3.1.1 and 3.1.2. The technique consists of performing a basis transformation in each layer of the network to represent the activations in a different basis that we call the Interaction Basis. This basis transformation removes degeneracies in actviations and Jacobians of the layer to make the basis smaller. The basis is also intended to ‘disentangle’ interactions between adjacent layers as much as possible. While we do not know whether it accomplishes this in general, we do show that it does so when the layer transitions are linear. In that case, the layer transition becomes diagonal (appendix A). The interaction basis is invariant to invertible linear transformations,121212Technically, as we will see, it is only invariant to up to the uniqueness of the eigenvectors of a certain matrix. But that usually just amounts to a freedom under reflections of coordinate axes in practice. meaning the basis itself is a largely coordinate-independent object, much like an eigendecomposition (see Section 6.2). We conjecture that if we apply the interaction basis transformation to a real neural network, the resulting representation is likely to be more interpretable. In a companion paper, Bushnaq et al. (2024), we develop the interaction basis further and test this hypothesis. 6.1 Motivating the interaction basis To find a transformation of network’s weights and activations that is invariant to reparameterisations based on low-rank activations or low-rank Jacobians, we take equation 10, and use equation 11 to rewrite it as Hi⁢j,i′⁢j′l,l′⁢(θ∗)=∂2L∂θi,jl⁢∂θi′,j′l′|θ=θ∗subscriptsuperscriptsuperscript′superscript′evaluated-atsuperscript2subscriptsuperscriptsubscriptsuperscriptsuperscript′superscript′ H^l,l _ij,i j (θ^*)= .% ∂^2L∂θ^l_i,j∂θ^l _i^% ,j |_θ=θ^*Hitalic_l , l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi j , i′ j′ ( θ∗ ) = divide start_ARG ∂2 L end_ARG start_ARG ∂ θitalic_litalic_i , j ∂ θitalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi′ , j′ end_ARG |θ = θ∗ =1n⁢∑x∈fjl⁢(x)⁢fj′l′⁢(x)⁢∑k∂fklfinal⁢(x)∂pil+1⁢∂fklfinal⁢(x)∂pi′l′+1.absent1subscriptsubscriptsuperscriptsubscriptsuperscriptsuperscript′subscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscript1subscriptsuperscriptsubscriptfinalsubscriptsuperscriptsuperscript′1superscript′ = 1n _x∈ Df^l_j(x)f^l% _j (x) _k ∂ f^l_final_k(x)% ∂ p^l+1_i ∂ f^l_final_k(x)∂ p^l% +1_i .= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D fitalic_litalic_j ( x ) fitalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTj′ ( x ) ∑k divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l + 1i end_ARG divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l start_POSTSUPERSCRIPT ′ + 1 end_POSTSUPERSCRIPTi′ end_ARG . (36) Next, we make two presumptions of independence (Christiano et al., 2022), assuming that 1. We can take expectations over the activations and Jacobians in each layer independently 2. Different layers are somewhat independent such that the Hessian eigenvectors can be largely localised to a particular layer Both of these assumptions are investigated in Martens and Grosse (2020), who test their validity in small networks and use it to derive a cheap approximation to the Hessian and its inverse. This allows us to approximate the Hessian as Hi⁢j,i′⁢j′l,l′⁢(θ∗)subscriptsuperscriptsuperscript′superscript′ H^l,l _ij,i j (θ^*)Hitalic_l , l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi j , i′ j′ ( θ∗ ) ≈δl,l′⁢[1n⁢∑x∈fjl⁢(x)⁢fj′l⁢(x)]⁢[1n⁢∑x∈∑k∂fklfinal⁢(x)∂pil+1⁢∂fklfinal⁢(x)∂pi′l+1].absentsubscriptsuperscript′delimited-[]1subscriptsubscriptsuperscriptsubscriptsuperscriptsuperscript′delimited-[]1subscriptsubscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscript1subscriptsuperscriptsubscriptfinalsubscriptsuperscript1superscript′ ≈ _l,l [ 1n _x∈% Df^l_j(x)f^l_j (x) ] [ % 1n _x∈ D _k ∂ f^l_% final_k(x)∂ p^l+1_i ∂ f^l_final_k(% x)∂ p^l+1_i ]\,.≈ δitalic_l , l′ [ divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D fitalic_litalic_j ( x ) fitalic_litalic_j′ ( x ) ] [ divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ∑k divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l + 1i end_ARG divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l + 1i′ end_ARG ] . (37) This effectively turns the Hessian into a product of two matrices, a gram matrix of activations in each layer Gj⁢j′l=1n⁢∑x∈fjl⁢(x)⁢fj′l⁢(x)subscriptsuperscriptsuperscript′1subscriptsubscriptsuperscriptsubscriptsuperscriptsuperscript′ G^l_j = 1n _x∈ % Df^l_j(x)f^l_j (x)Gitalic_litalic_j j′ = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D fitalic_litalic_j ( x ) fitalic_litalic_j′ ( x ) (38) and a Gram matrix of Jacobians with respect to the next layer’s preactivations Ki⁢i′l=1n⁢∑x∈∑k∂fklfinal⁢(x)∂pil+1⁢∂fklfinal⁢(x)∂pi′l+1.subscriptsuperscriptsuperscript′1subscriptsubscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscript1subscriptsuperscriptsubscriptfinalsubscriptsuperscript1superscript′ K^l_i = 1n _x∈ % D _k ∂ f^l_final_k(x)∂ p^l+1_i% ∂ f^l_final_k(x)∂ p^l+1_i \,.Kitalic_litalic_i i′ = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ∑k divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l + 1i end_ARG divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ pitalic_l + 1i′ end_ARG . (39) We can then find the eigenvectors of this approximated Hessian by separately diagonalising these two matrices. We would like to find a basis for flsuperscriptf^lfitalic_l that excludes directions connected exclusively to zero eigenvectors of the Hessian. That is, we want to exclude directions in flsuperscriptf^lfitalic_l that lie along zero eigenvectors of GlsuperscriptG^lGitalic_l, and directions that are mapped by the weight matrix WlsuperscriptW^lWitalic_l to lie along zero eigenvectors of KlsuperscriptK^lKitalic_l. To do this, we can backpropagate the Jacobians in equation 39 one step further to include the weight matrices WlsuperscriptW^lWitalic_l: Mi⁢i′l=1n⁢∑x∈∑k∂fklfinal⁢(x)∂fil⁢∂fklfinal⁢(x)∂fi′l.subscriptsuperscriptsuperscript′1subscriptsubscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscriptsubscriptsuperscriptsubscriptfinalsubscriptsuperscriptsuperscript′ M^l_i = 1n _x∈ % D _k ∂ f^l_final_k(x)∂ f^l_i% ∂ f^l_final_k(x)∂ f^l_i \,.Mitalic_litalic_i i′ = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x ∈ D ∑k divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ fitalic_litalic_i end_ARG divide start_ARG ∂ fitalic_lfinalitalic_k ( x ) end_ARG start_ARG ∂ fitalic_litalic_i′ end_ARG . (40) and then search for a basis in flsuperscriptf^lfitalic_l that diagonalises MlsuperscriptM^lMitalic_l and GlsuperscriptG^lGitalic_l at the same time. This basis will have one basis vector less for each zero eigenvalue of the Gram matrices of the activations and Jacobians, respectively. It will also exclude directions that lie in the null space of WlsuperscriptW^lWitalic_l. The matrices Gl,MlsuperscriptsuperscriptG^l,M^lGitalic_l , Mitalic_l are symmetric, so we can write Gl=UlT⁢DGl⁢UlsuperscriptsuperscriptsuperscriptsuperscriptsubscriptsuperscriptG^l=U^l^TD_G^lU^lGitalic_l = Uitalic_litalic_T Ditalic_Gitalic_l Uitalic_l and Ml=VlT⁢DMl⁢VlsuperscriptsuperscriptsuperscriptsuperscriptsubscriptsuperscriptM^l=V^l^TD_M^lV^lMitalic_l = Vitalic_litalic_T Ditalic_Mitalic_l Vitalic_l for diagonal DG,DMsubscriptsubscriptD_G,D_MDitalic_G , Ditalic_M and orthogonal Ul,VlsuperscriptsuperscriptU^l,V^lUitalic_l , Vitalic_l. We can find a basis transformation ^=Cl⁢l^superscriptsuperscript f=C^lf^lover start_ARG f end_ARG = Citalic_l fitalic_l in which both GlsuperscriptG^lGitalic_l and MlsuperscriptM^lMitalic_l are diagonal, in two steps: 1. Apply a whitening transformation with respect to GlsuperscriptG^lGitalic_l: ~l=(DGl1/2)+⁢Ulsuperscript~superscriptsuperscriptsubscriptsuperscript12superscript f^l= (D^l_G^1/2 )^+U^lover~ start_ARG f end_ARGl = ( Ditalic_litalic_G1 / 2 )+ Uitalic_l, where the plus denotes the Moore-Penrose pseudoinverse. If the activations in layer l do not span the full activation space, then the gram matrix GlsuperscriptG^lGitalic_l must not be full rank, and some diagonal entries of DGlsubscriptsuperscriptD^l_GDitalic_litalic_G are zero. By choosing this pseudoinverse, we effectively eliminate all the degeneracies from low-rank activations from our final basis. In this basis, G~i⁢jl=δi⁢jsubscriptsuperscript~subscript G^l_ij= _ijover~ start_ARG G end_ARGlitalic_i j = δitalic_i j 2. Now that GlsuperscriptG^lGitalic_l is whitened, we can apply the transformation by VlsuperscriptV^lVitalic_l which diagonalises MlsuperscriptM^lMitalic_l without un-diagonalisng GlsuperscriptG^lGitalic_l since the identity matrix is isotropic131313We need to be careful which coordinate basis we are working in: the entries of VlsuperscriptV^lVitalic_l in the basis that whitens GlsuperscriptG^lGitalic_l and in the standard basis are different.. At this point both MlsuperscriptM^lMitalic_l and GlsuperscriptG^lGitalic_l are diagonal and ClsuperscriptC^lCitalic_l is defined up to multiplication by a diagonal matrix. We choose to multiply at the end by (DMl1/2)+superscriptsuperscriptsubscriptsuperscript12 (D^l_M^1/2 )^+( Ditalic_litalic_M1 / 2 )+ because this eliminates degeneracies from low rank Jacobians. We call the basis ^l=(DMl1/2)+⁢Vl⁢(DGl1/2)+⁢Ul⁢lsuperscript^superscriptsuperscriptsubscriptsuperscript12superscriptsuperscriptsuperscriptsubscriptsuperscript12superscriptsuperscript f^l= (D^l_M^1/2 )^+V^l (D^l_G^% 1/2 )^+U^lf^lover start_ARG f end_ARGl = ( Ditalic_litalic_M1 / 2 )+ Vitalic_l ( Ditalic_litalic_G1 / 2 )+ Uitalic_l fitalic_l the interaction basis. Basis vectors in this basis are aligned with the directions that affect the output most — in the case of a deep linear network, this means that transforming to the interaction basis provably performs an SVD of each weight matrix, resulting in basis directions which are aligned with the principal components of the output of the network (see appendix A). We made two simplifying assumptions of independence about the Hessian to motivate this basis. While they have been used in other contexts to some success, these are still strong assumptions. Future work might investigate alternative techniques for finding a basis without these assumptions. This might only be possible with an overcomplete basis, which could connect the framework of this paper to superposition. 6.2 Invariance to linear transformations The Interaction Basis is largely a coordinate-independent object, in the sense that it is invariant under linear transformations. If we apply a transformation l→Rl=R⁢l,Wl→WRl=Wl⁢R−1formulae-sequence→superscriptsuperscriptsubscriptsuperscript→superscriptsuperscriptsubscriptsuperscriptsuperscript1f^l _R^l=Rf^l,W^l→ W_R^l=W^lR^% -1fitalic_l → fitalic_Ritalic_l = R fitalic_l , Witalic_l → Witalic_Ritalic_l = Witalic_l R- 1 to the activation space, the final interaction basis is unchanged (^Rl=^lsubscriptsuperscript^superscript f^l_R= f^lover start_ARG f end_ARGlitalic_R = over start_ARG f end_ARGl) for any R∈GLdl⁢(ℝ)subscriptGLsuperscriptℝR _d^l(R)R ∈ GLditalic_l ( blackboard_R ) up to trivial axis reflections, unless MlsuperscriptM^lMitalic_l has repeated eigenvalues. To show this, first note that in the whitened basis l~=(DGl1/2)+⁢Ul⁢l~superscriptsuperscriptsuperscriptsubscriptsuperscript12superscriptsuperscript f^l= (D^l_G^1/2 )^+U^lf^lover~ start_ARG fitalic_l end_ARG = ( Ditalic_litalic_G1 / 2 )+ Uitalic_l fitalic_l, GlsuperscriptG^lGitalic_l is by definition always transformed to the identity matrix G~l=(DGl1/2)+⁢Gl⁢((DGl1/2)+)T=.superscript~superscriptsuperscriptsubscriptsuperscript12superscriptsuperscriptsuperscriptsuperscriptsubscriptsuperscript12 G^l= (D^l_G^1/2 )^+G^l ( (% D^l_G^1/2 )^+ )^T=I\,.over~ start_ARG G end_ARGl = ( Ditalic_litalic_G1 / 2 )+ Gitalic_l ( ( Ditalic_litalic_G1 / 2 )+ )T = I . (41) So if we whiten after applying the transformation R, ~Rlsubscriptsuperscript~ f^l_Rover~ start_ARG f end_ARGlitalic_R can only differ from ~lsuperscript~ f^lover~ start_ARG f end_ARGl by an orthogonal transformation. Call this orthogonal matrix QRsubscriptQ_RQitalic_R. In the whitened basis, MRlsuperscriptsubscriptM_R^lMitalic_Ritalic_l will then be: MRlsuperscriptsubscript M_R^lMitalic_Ritalic_l =QR⁢Ml⁢QRT.absentsubscriptsuperscriptsubscriptsuperscript =Q_RM^lQ^T_R\,.= Qitalic_R Mitalic_l Qitalic_Titalic_R . (42) So MRlsubscriptsuperscriptM^l_RMitalic_litalic_R and MlsuperscriptM^lMitalic_l only differ by an orthogonal transformation. The interaction basis will be the eigenbasis of MRlsubscriptsuperscriptM^l_RMitalic_litalic_R and MlsuperscriptM^lMitalic_l, respectively. So long as a real matrix does not have degenerate eigenvalues, its eigendecomposition is basis invariant if a convention for the eigenvector normalisation is chosen, up to reflections. So if MlsuperscriptM^lMitalic_l does not have multiple identical eigenvalues, the interaction basis we end up in is the same up to reflections whether we transformed with R first or not. If MlsuperscriptM^lMitalic_l does have identical eigenvalues, the basis will still be identical up to orthogonal transforms in the eigenspaces of MlsuperscriptM^lMitalic_l. 7 Related Work Explaining generalisation The inductive biases of deep neural networks that leads them to generalise well past their training data has been an object of extensive study (Zhang et al., 2021). Attempts to understand generalisation involve studying simplicity biases (Mingard et al., 2021) and are closely related to attempts to quantify model complexity, for example via VC dimension (Vapnik, 1998), Radamacher complexity (Mohri et al., 2018) or less widely known methods (Liang et al., 2019; Novak et al., 2018). This paper is heavily influenced by Singular Learning Theory (Watanabe, 2009) which uses the local learning coefficient (Lau et al., 2023) to quantify the effective number of parameters in the model via the flatness of minima in the loss landscape. The flatness of minima has been found to predict model generalisation, for example in Li et al. (2018) for networks trained on CIFAR-10. SLT has been used to study the formation of internal structure in neural networks (Chen et al., 2023; Hoogland et al., 2024). Understanding the internals of neural networks through the geometry of their loss landscapes was also proposed as a research direction in (Hoogland et al., 2023). Local structure of the loss landscape Other works have investigated the structure of neural network loss landscapes and their degeneracies around solutions found in training. In (Martens and Grosse, 2020), it was proposed that the Hessian matrix of MLPs can be approximated as being factorisable into independent outer products of activations and gradients, and that its eigenvectors might be approximated as being localised in particular layer of the network. This approximation was later extended to CNNs, RNNs, and transformers in Grosse and Martens (2016); Martens et al. (2018); Grosse et al. (2023). The approximation was used to compress models by pruning weights along directions with small Hessian eigenvalues by Wang et al. (2019). For deep linear networks, an analytical expression for the learning coefficient was derived in Aoyagi (2024). Generic degeneracies in the loss shared by all models with an MLP ReLU architecture were investigated in Carrol (2021), and degeneracies of one hidden layer MLPs with tanh activation functions in Farrugia-Roberts (2022). It has been found that most minima in the loss landscape can often be connected by a continuous path of minimum loss, for example in Draxler et al. (2019) for models trained on CIFAR. Selection for modularity In Filan et al. (2021), it was found that MLPs and CNNs trained on vision tasks showed more modularity in the weights connecting their neurons than comparable random networks. The observed tendency for biological networks created by evolution to be modular has been widely investigated, with various explanations for the phenomenon being proposed. Clune et al. (2013) offer a good overview of this work for machine learning researchers, and suggests direct minimisation of connection costs between components as a primary driver of modularity in biological networks. Kashtan and Alon (2005) proposes that genetic algorithms select systems to be modular because this makes them more robust to modular changes in the systems’ environments. In Liu et al. (2023), connection costs were used to regularise MLPs trained on various tasks including modular addition to be more modular in their weights, in order to make them more interpretable. 8 Conclusion We introduced the idea that the presence of degeneracy in neural networks’ parameterizations may be a source of challenges for reverse engineering them. We identified some of the sources of this degeneracy, and suggested a technique (the interaction basis) for removing this degeneracy from the representation of the network. We argued that this representation is likely to have sparser interactions, and we introduced a formula for searching for modules in the new represenation of the network based on a toy model of how modularity affects degeneracy. The follow-up paper Bushnaq et al. (2024) tests a variant of the interaction basis, finding that it results in representations which are sparse, modular and interpretable on toy models but it is much less useful when applied to LLMs. 9 Contribution Statement LB developed the ideas in this paper with contributions from JM and KH. JM and LB developed the presentation of these ideas together. JM led the writing, with substantial support from LB, and feedback from SH and NGD. SH, DB, and NGD ran experiments to provide feedback on early versions of the interaction basis. CW ran experiments to test neuron synchronisation. 10 Acknowledgements We thank Daniel Murfet, Tom McGrath, James Fox, and Lawrence Chan for comments on the manuscript, to Dmitry Vaintrob for suggesting the concept of finite data SLT, and to Vivek Hebbar, Jesse Hoogland and Linda Linsefors for valuable discussions. Apollo Research is a fiscally sponsored project of Rethink Priorities. References Aoyagi (2024) Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172:106132, 04 2024. doi: 10.1016/j.neunet.2024.106132. Bushnaq et al. (2024) Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, and Marius Hobbhahn. The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks. arXiv e-prints, art. arXiv:2405.10928, May 2024. Carrol (2021) Liam Carrol. Phase transitions in neural networks. Master’s thesis, School of Computing and Information Systems, The University of Melbourne, October 2021. URL http://therisingsea.org/notes/MSc-Carroll.pdf. Carroll (2023) Liam Carroll. Dslt 1. the rlct measures the effective dimension of neural networks, Jun 2023. URL https://w.alignmentforum.org/posts/4eZtmwaqhAgdJQDEg/dslt-1-the-rlct-measures-the-effective-dimension-of-neural. Chan et al. (2022) Lawrence Chan, Adria Garriga-Alonso, Nix Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://w.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing. Chen et al. (2023) Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy model of superposition. arXiv preprint arXiv:2310.06301, 2023. Christiano et al. (2022) Paul Christiano, Eric Neyman, and Mark Xu. Formalizing the presumption of independence. arXiv preprint arXiv:2211.06738, 2022. Clune et al. (2013) Jeff Clune, Jean-Baptiste Mouret, and Hod Lipson. The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755):20122863, March 2013. ISSN 1471-2954. doi: 10.1098/rspb.2012.2863. URL http://dx.doi.org/10.1098/rspb.2012.2863. Conmy et al. (2023) Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023. Conmy et al. (2024) Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024. Draxler et al. (2019) Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape, 2019. Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Farrugia-Roberts (2022) Matthew Farrugia-Roberts. Structural degeneracy in neural networks. Master’s thesis, School of Computing and Information Systems, The University of Melbourne, December 2022. URL https://far.in.net/mthesis. Filan et al. (2021) Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, and Stuart Russell. Clusterability in neural networks, 2021. Fusi et al. (2016) Stefano Fusi, Earl K Miller, and Mattia Rigotti. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37:66–74, 2016. ISSN 0959-4388. doi: https://doi.org/10.1016/j.conb.2016.01.010. URL https://w.sciencedirect.com/science/article/pii/S0959438816000118. Neurobiology of cognitive behavior. Geiger et al. (2021) Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer Feed-Forward Layers Are Key-Value Memories, September 2021. URL http://arxiv.org/abs/2012.14913. arXiv:2012.14913 [cs]. Goh et al. (2021) Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons. Grosse and Martens (2016) Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers, 2016. Grosse et al. (2023) Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. Studying large language model generalization with influence functions, 2023. Hoogland (2023) Jesse Hoogland. Neural networks generalise because of this one weird trick. https://w.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick, January 2023. Hoogland et al. (2023) Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet, and Stan van Wingerden. Towards developmental interpretability, Jul 2023. URL https://w.alignmentforum.org/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability. Hoogland et al. (2024) Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. The developmental landscape of in-context learning, 2024. Kashtan and Alon (2005) Nadav Kashtan and Uri Alon. Spontaneous evolution of modularity and network motifs. Proceedings of the National Academy of Sciences of the United States of America, 102:13773–8, 10 2005. doi: 10.1073/pnas.0503610102. Lau et al. (2023) Edmund Lau, Daniel Murfet, and Susan Wei. Quantifying degeneracy in singular models via the learning coefficient. arXiv preprint arXiv:2308.12108, 2023. Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets, 2018. Liang et al. (2019) Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd international conference on artificial intelligence and statistics, pages 888–896. PMLR, 2019. Liu et al. (2023) Ziming Liu, Eric Gan, and Max Tegmark. Seeing is believing: Brain-inspired modular training for mechanistic interpretability, 2023. Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. Martens and Grosse (2020) James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature, 2020. Martens et al. (2018) James Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyMTkQZAb. Meng et al. (2023) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2023. Mingard et al. (2021) Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, and Ard A Louis. Is sgd a bayesian sampler? well, almost. Journal of Machine Learning Research, 22(79):1–64, 2021. Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. Murfet (2020) Daniel Murfet. Singular learning theory iv: the rlct. http://w.therisingsea.org/notes/metauni/slt4.pdf, April 2020. Lecture notes. Murphy (2012) Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012. Nanda et al. (2023) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. Nguyen et al. (2016) Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, 2016. Novak et al. (2018) Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018. Olah et al. (2017) Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization. Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020. Räuker et al. (2023) Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023. Schwarz (1978) Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978. Vapnik (1998) Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. Wang et al. (2019) Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019. Wang et al. (2022) Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022. Watanabe (2009) Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge university press, 2009. Watanabe (2013) Sumio Watanabe. A widely applicable bayesian information criterion. The Journal of Machine Learning Research, 14(1):867–897, 2013. Wei et al. (2022) Susan Wei, Daniel Murfet, Mingming Gong, Hui Li, Jesse Gell-Redman, and Thomas Quella. Deep learning is singular, and that’s good. IEEE Transactions on Neural Networks and Learning Systems, 2022. Zhang et al. (2021) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. Appendix A The local interaction basis on deep linear networks The interaction basis diagonalizes interactions between neural network layers if the layer transitions are linear. We derive this property for this local interaction basis, a modified interaction basis in which gradients to the final layer are replaced with gradients to the immediately subsequent layer, in order to sparsify interactions between adjacent layers. In the experimental follow up to this paper, Bushnaq et al. [2024] discuss the local interaction basis in more detail before testing it on real networks. In this appendix, we show that the local interaction basis diagonalizes the interactions between neural network layers if the layer transitions are linear. The derivation for the non-local interaction basis follows the same structure. In the absence of nonlinearities, a deep neural network is just a series of matrix multiplications (once an extra component is added to activation vectors with a constant value of 1, to include the bias). The sparsest way to describe this series of matrix multiplications is to multiply out the network into one multiplication, and then to rotate into the left singular basis of this matrix in the inputs, and the right singular basis in the outputs. To see that transforming to the local interaction basis does indeed perform an SVD for deep linear networks, consider the penultimate layer of the network. We neglect mean centering to make this derivation cleaner, and start by transforming in layer lfinal−1subscriptfinal1l_final-1lfinal - 1 to a basis which whitens the activations: flfinalsuperscriptsubscriptfinal f^l_finalfitalic_lfinal =Wlfinal−1⁢flfinal−1absentsuperscriptsubscriptfinal1superscriptsubscriptfinal1 =W^l_final-1f^l_final-1= Witalic_lfinal - 1 fitalic_lfinal - 1 =Wlfinal−1⁢(Ulfinal−1)T⁢(Dlfinal−1)12⏟W′⁣lfinal−1⁢((Dlfinal−1)12)+⁢Ulfinal−1⁢flfinal−1⏟f′⁣lfinal−1absentsubscript⏟superscriptsubscriptfinal1superscriptsuperscriptsubscriptfinal1superscriptsuperscriptsubscriptfinal112superscript′subscriptfinal1subscript⏟superscriptsuperscriptsuperscriptsubscriptfinal112superscriptsubscriptfinal1superscriptsubscriptfinal1superscript′subscriptfinal1 = W^l_final-1 (U^l_final-1% )^T (D^l_final-1 ) 12_W l_% final-1 ((D^l_final-1) 12% )^+U^l_final-1f^l_final-1_f l_% final-1= under⏟ start_ARG Witalic_lfinal - 1 ( Uitalic_lfinal - 1 )T ( Ditalic_lfinal - 1 )divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARGW′ l start_POSTSUBSCRIPT final - 1 end_POSTSUBSCRIPT under⏟ start_ARG ( ( Ditalic_lfinal - 1 )divide start_ARG 1 end_ARG start_ARG 2 end_ARG )+ Uitalic_lfinal - 1 fitalic_lfinal - 1 end_ARGf′ l start_POSTSUBSCRIPT final - 1 end_POSTSUBSCRIPT We’ve wrapped these transformations into definitions of W′⁣lfinal−1superscript′subscriptfinal1W l_final-1W′ lfinal - 1 and f′⁣lfinal−1superscript′subscriptfinal1f l_final-1f′ lfinal - 1. We’l show that the other transformations perform an SVD of W′⁣lfinal−1superscript′subscriptfinal1W l_final-1W′ lfinal - 1. First, we have to transform to the (uncentered) PCA basis in the final layer. Gi⁢jlfinalsubscriptsuperscriptsubscriptfinal G^l_final_ijGitalic_lfinalitalic_i j =1n⁢∑xfi′lfinal⁢(x)⁢fj′lfinal⁢(x)absent1subscriptsubscriptsuperscriptsubscriptfinalsuperscript′subscriptsuperscriptsubscriptfinalsuperscript′ = 1n _xf^l_final_i (x)f^l_% final_j (x)= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x fitalic_lfinalitalic_i′ ( x ) fitalic_lfinalitalic_j′ ( x ) =1n⁢∑xWi′⁢klfinal−1⁢fklfinal−1⁢Wj′⁢mlfinal−1⁢fmlfinal−1absent1subscriptsubscriptsuperscriptsubscriptfinal1superscript′subscriptsuperscriptsubscriptfinal1subscriptsuperscriptsubscriptfinal1superscript′subscriptsuperscriptsubscriptfinal1 = 1n _xW^l_final-1_i kf^l_% final-1_kW^l_final-1_j mf^l_final-1% _m= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x Witalic_lfinal - 1i′ k fitalic_lfinal - 1k Witalic_lfinal - 1j′ m fitalic_lfinal - 1m Glfinalsuperscriptsubscriptfinal G^l_finalGitalic_lfinal =Wlfinal−1⁢Glfinal−1⁢Wlfinal−1Tabsentsuperscriptsubscriptfinal1superscriptsubscriptfinal1superscriptsuperscriptsubscriptfinal1 =W^l_final-1G^l_final-1W^l_final% -1^T= Witalic_lfinal - 1 Gitalic_lfinal - 1 Witalic_lfinal - 1T =W′⁣lfinal−1⁢W′⁣lfinal−1Tabsentsuperscript′subscriptfinal1superscriptsuperscript′subscriptfinal1 =W l_final-1W l_final-1^T= W′ lfinal - 1 W′ lfinal - 1T where we have leveraged that Glfinal−1=Ulfinal−1T⁢Dlfinal−1⁢Ulfinal−1superscriptsubscriptfinal1superscriptsuperscriptsubscriptfinal1superscriptsubscriptfinal1superscriptsubscriptfinal1G^l_final-1=U^l_final-1^TD^l_final-1U^l_% final-1Gitalic_lfinal - 1 = Uitalic_lfinal - 1T Ditalic_lfinal - 1 Uitalic_lfinal - 1 by definition in the last step. Writing W′⁣lfinal−1=UW′⁢ΣW′⁢VW′Tsuperscript′subscriptfinal1subscriptsuperscript′subscriptΣsuperscript′subscriptsuperscript′W l_final-1=U_W _W V_W ^% TW′ lfinal - 1 = Uitalic_W′ Σitalic_W′ Vitalic_W′italic_T, we have that GL=UW′⁢ΣW′2⁢UW′Tsuperscriptsubscriptsuperscript′subscriptΣsuperscript′2superscriptsubscriptsuperscript′G^L=U_W _W ^2U_W ^TGitalic_L = Uitalic_W′ Σitalic_W′2 Uitalic_W′italic_T, so UL=UW′Tsuperscriptsuperscriptsubscriptsuperscript′U^L=U_W ^TUitalic_L = Uitalic_W′italic_T. Since there is no layer after the final layer, the M matrix is not defined for the final layer, so the LI basis in the final layer is just the PCA basis141414This is also true in the nonlocal interaction basis, since ∂filfinal⁢(x)∂fjlfinal=δi⁢jsubscriptsuperscriptsubscriptfinalsubscriptsuperscriptsubscriptfinalsubscript ∂ f^l_final_i(x)∂ f^l_final_j% = _ijdivide start_ARG ∂ fitalic_lfinalitalic_i ( x ) end_ARG start_ARG ∂ fitalic_lfinalitalic_j end_ARG = δitalic_i j. f^lfinalsuperscript^subscriptfinal f^l_finalover start_ARG f end_ARGlfinal =Ulfinal⁢flfinal=UW′T⁢W′⁣lfinal−1⁢f′⁣lfinal−1absentsuperscriptsubscriptfinalsuperscriptsubscriptfinalsuperscriptsubscriptsuperscript′subscriptfinal1superscript′subscriptfinal1 =U^l_finalf^l_final=U_W ^TW^% l_final-1f l_final-1= Uitalic_lfinal fitalic_lfinal = Uitalic_W′italic_T W′ lfinal - 1 f′ lfinal - 1 (43) For the final part of the transformation into the LIB, we need to calculate M, which depends on the jacobian from the LIB functions in the next layer to the PCA functions in the current layer: Mj,j′lfinal−1subscriptsuperscriptsubscriptfinal1superscript′ M^l_final-1_j,j Mitalic_lfinal - 1j , j′ =1n⁢∑x∂f^ilfinal⁢(x)∂fj′⁣lfinal−1⁢∂f^ilfinal⁢(x)∂fj′⁣lfinal−1absent1subscriptsubscriptsuperscript^subscriptfinalsubscriptsuperscript′subscriptfinal1subscriptsuperscript^subscriptfinalsubscriptsuperscript′subscriptfinal1superscript′ = 1n _x ∂ f^l_final_i(% x)∂ f l_final-1_j ∂ f^l_ % final_i(x)∂ f l_final-1_j = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x divide start_ARG ∂ over start_ARG f end_ARGlfinalitalic_i ( x ) end_ARG start_ARG ∂ f′ lfinal - 1j end_ARG divide start_ARG ∂ over start_ARG f end_ARGlfinalitalic_i ( x ) end_ARG start_ARG ∂ f′ lfinal - 1j′ end_ARG Mlfinal−1superscriptsubscriptfinal1 M^l_final-1Mitalic_lfinal - 1 =Wlfinal−1T⁢UW′⁢UW′T⁢Wlfinal−1=W′⁣lfinal−1T⁢W′⁣lfinal−1absentsuperscriptsuperscriptsubscriptfinal1subscriptsuperscript′subscriptsuperscript′subscriptfinal1superscriptsuperscript′subscriptfinal1superscript′subscriptfinal1 =W^l_final-1^TU_W U_W ^TW^l% _final-1=W l_final-1^TW l_final% -1= Witalic_lfinal - 1T Uitalic_W′ Uitalic_W′italic_T Witalic_lfinal - 1 = W′ lfinal - 1T W′ lfinal - 1 =VW′⁢ΣW′2⁢VW′Tabsentsubscriptsuperscript′subscriptΣsuperscript′2superscriptsubscriptsuperscript′ =V_W _W ^2V_W ^T= Vitalic_W′ Σitalic_W′2 Vitalic_W′italic_T =:Vlfinal−1TΛlfinal−1Vlfinal−1 =:V^l_final-1^T ^l_final-1V^l_% final-1= : Vitalic_lfinal - 1T Λitalic_lfinal - 1 Vitalic_lfinal - 1 so Vlfinal−1superscriptsubscriptfinal1V^l_final-1Vitalic_lfinal - 1 = VW′Tsuperscriptsubscriptsuperscript′V_W ^TVitalic_W′italic_T and Λlfinal−1=ΣW′2superscriptΛsubscriptfinal1superscriptsubscriptΣsuperscript′2 ^l_final-1= _W ^2Λitalic_lfinal - 1 = Σitalic_W′2. Now, f^lfinal−1superscript^subscriptfinal1 f^l_final-1over start_ARG f end_ARGlfinal - 1 =Clfinal−1⁢flfinal−1absentsuperscriptsubscriptfinal1superscriptsubscriptfinal1 =C^l_final-1f^l_final-1= Citalic_lfinal - 1 fitalic_lfinal - 1 =Λlfinal−112⁢Vlfinal−1⁢f′⁣lfinal−1absentsuperscriptsuperscriptΛsubscriptfinal112superscriptsubscriptfinal1superscript′subscriptfinal1 = ^l_final-1 12V^l_final-% 1f l_final-1= Λitalic_lfinal - 1divide start_ARG 1 end_ARG start_ARG 2 end_ARG Vitalic_lfinal - 1 f′ lfinal - 1 Using equation 43, we have: f^Lsuperscript f^Lover start_ARG f end_ARGL =UW′T⁢W′⁣lfinal−1⁢VW′T⁢(Λlfinal−112)+⁢f^lfinal−1absentsuperscriptsubscriptsuperscript′subscriptfinal1superscriptsubscriptsuperscript′superscriptsuperscriptΛsubscriptfinal112superscript^subscriptfinal1 =U_W ^TW l_final-1V_W ^T% ( ^l_final-1 12 )^+ f^l_% final-1= Uitalic_W′italic_T W′ lfinal - 1 Vitalic_W′italic_T ( Λitalic_lfinal - 1divide start_ARG 1 end_ARG start_ARG 2 end_ARG )+ over start_ARG f end_ARGlfinal - 1 =ΣW′⁢(Λlfinal−112)+⁢f^lfinal−1absentsubscriptΣsuperscript′superscriptsuperscriptΛsubscriptfinal112superscript^subscriptfinal1 = _W ( ^l_final-1 1% 2 )^+ f^l_final-1= Σitalic_W′ ( Λitalic_lfinal - 1divide start_ARG 1 end_ARG start_ARG 2 end_ARG )+ over start_ARG f end_ARGlfinal - 1 (44) =f^lfinal−1absentsuperscript^subscriptfinal1 = f^l_final-1= over start_ARG f end_ARGlfinal - 1 For layers which are not the final layer in the network, the procedure is very similar. As before, we have: f′⁣l:=((Dl)12)+⁢Ul⁢flW′⁣l:=Wl⁢(Ul)T⁢(Dl)12formulae-sequenceassignsuperscript′superscriptsuperscript12superscriptsuperscriptassignsuperscript′superscriptsuperscriptsuperscriptsuperscript12f l:= ( (D^l ) 12 )^+U^lf^l% W l:=W^l (U^l )^T (D^l )^% 12f′ l := ( ( Ditalic_l )divide start_ARG 1 end_ARG start_ARG 2 end_ARG )+ Uitalic_l fitalic_l W′ l := Witalic_l ( Uitalic_l )T ( Ditalic_l )divide start_ARG 1 end_ARG start_ARG 2 end_ARG Gl+1=W′⁣l⁢W′⁣lT,Ul+1=UW′⁣lTformulae-sequencesuperscript1superscript′superscript′1subscriptsuperscriptsuperscript′G^l+1=W lW l^T, U^l+1=U^T_W lGitalic_l + 1 = W′ l W′ litalic_T , Uitalic_l + 1 = Uitalic_Titalic_W′ l Now, we need to remember that f^l+1=Cl+1⁢fl+1superscript^1superscript1superscript1 f^l+1=C^l+1f^l+1over start_ARG f end_ARGl + 1 = Citalic_l + 1 fitalic_l + 1: f′⁣l+1superscript′1 f l+1f′ l + 1 =((Dl+1)12)+⁢Ul+1⁢W′⁣l⁢f′⁣labsentsuperscriptsuperscriptsuperscript112superscript1superscript′ = ( (D^l+1 ) 12 )^+U^l+1W^% lf l= ( ( Ditalic_l + 1 )divide start_ARG 1 end_ARG start_ARG 2 end_ARG )+ Uitalic_l + 1 W′ l f′ l =ΣW′⁣l+⁢UW′⁣lT⁢W′⁣l⁢f′⁣labsentsuperscriptsubscriptΣsuperscript′subscriptsuperscriptsuperscript′superscript′ = _W l^+U^T_W lW lf^% l= Σitalic_W′ l+ Uitalic_Titalic_W′ l W′ l f′ l =VW′⁣lT⁢f′⁣labsentsuperscriptsubscriptsuperscript′ =V_W l^Tf l= Vitalic_W′ litalic_T f′ l f^l+1superscript^1 f^l+1over start_ARG f end_ARGl + 1 =Λl+112⁢Vl+1⁢f′⁣l+1absentsuperscriptsuperscriptΛ112superscript1superscript′1 = ^l+1 12V^l+1f l+1= Λitalic_l + 1divide start_ARG 1 end_ARG start_ARG 2 end_ARG Vitalic_l + 1 f′ l + 1 =Λl+112⁢Vl+1⁢VW′⁣lT⁢f′⁣labsentsuperscriptsuperscriptΛ112superscript1superscriptsubscriptsuperscript′ = ^l+1 12V^l+1V_W l^Tf^% l= Λitalic_l + 1divide start_ARG 1 end_ARG start_ARG 2 end_ARG Vitalic_l + 1 Vitalic_W′ litalic_T f′ l Mj,j′lsubscriptsuperscriptsuperscript′ M^l_j,j Mitalic_litalic_j , j′ =1n⁢∑x∂f^il+1⁢(x)∂fj′⁣l⁢∂f^il+1⁢(x)∂fj′⁣labsent1subscriptsubscriptsuperscript^1subscriptsuperscript′subscriptsuperscript^1subscriptsuperscript′ = 1n _x ∂ f^l+1_i(x)∂ f% l_j ∂ f^l+1_i(x)∂ f l_j^% = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x divide start_ARG ∂ over start_ARG f end_ARGl + 1i ( x ) end_ARG start_ARG ∂ f′ litalic_j end_ARG divide start_ARG ∂ over start_ARG f end_ARGl + 1i ( x ) end_ARG start_ARG ∂ f′ litalic_j′ end_ARG Mlsuperscript M^lMitalic_l =VW′⁣l⁢Vl+1T⁢Λl+1⁢Vl+1⁢VW′⁣lTabsentsubscriptsuperscript′superscript1superscriptΛ1superscript1superscriptsubscriptsuperscript′ =V_W lV^l+1^T ^l+1V^l+1V_W l% ^T= Vitalic_W′ l Vitalic_l + 1T Λitalic_l + 1 Vitalic_l + 1 Vitalic_W′ litalic_T Once again, note that this expression for MlsuperscriptM^lMitalic_l is manifestly diagonal, so Vl=Vl+1⁢VW′⁣lT,Λl=Λl+1formulae-sequencesuperscriptsuperscript1subscriptsuperscriptsuperscript′ΛsuperscriptΛ1V^l=V^l+1V^T_W l, ^l= ^l+1Vitalic_l = Vitalic_l + 1 Vitalic_Titalic_W′ l , Λitalic_l = Λitalic_l + 1 So, VlsuperscriptV^lVitalic_l is exactly what we need in order to diagonalize the relationship, and we end up with f^l+1superscript^1 f^l+1over start_ARG f end_ARGl + 1 =Λl+112⁢Vl+1⁢VW′⁣lT⁢VlT⁢Λl12+⁢f^labsentsuperscriptsuperscriptΛ112superscript1superscriptsubscriptsuperscript′superscriptsuperscriptsuperscriptsuperscriptΛ12superscript = ^l+1 12V^l+1V_W l^TV^l^% T ^l 12^+ f^l= Λitalic_l + 1divide start_ARG 1 end_ARG start_ARG 2 end_ARG Vitalic_l + 1 Vitalic_W′ litalic_T Vitalic_litalic_T Λitalic_ldivide start_ARG 1 end_ARG start_ARG 2 end_ARG+ over start_ARG f end_ARGl =f^labsentsuperscript = f^l= over start_ARG f end_ARGl (45) So, each layer of the network is the same as the final layer, which is the final activations rotated into the PCA basis, but without whitening.