← Back to papers

Paper deep dive

On Privileged and Convergent Bases in Neural Network Representations

Davis Brown, Nikhil Vyas, Yamini Bansal

Year: 2023Venue: Workshop on High-dimensional Learning Dynamics at ICML 2023Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 32

Models: WideResNet

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 1:04:25 AM

Summary

This paper investigates whether neural network representations possess a privileged and convergent basis. The authors demonstrate that neural networks do not converge to a unique basis even at high widths, and that arbitrary rotations of representations cannot be inverted, indicating a lack of rotational invariance. While Linear Mode Connectivity (LMC) improves with width, the authors show this is not due to increased basis correlation. Finally, they propose that freezing early layers during training can enforce a convergent basis across different runs, which is beneficial for interpretability and modular training.

Entities (6)

Neural Network · model-architecture · 100%Linear Mode Connectivity · metric · 98%ConvNeXt · model-architecture · 95%Permutation-Correlation · metric · 95%Vision Transformer · model-architecture · 95%WideResNet · model-architecture · 95%

Relation Signals (3)

Freezing Early Layers increases Basis Correlation

confidence 95% · Basis correlation increases significantly when a few early layers of the network are frozen identically.

Linear Mode Connectivity improveswith Network Width

confidence 90% · while Linear Mode Connectivity improves with increased network width, this improvement is not due to an increase in basis correlation

Neural Network lacks Rotational Invariance

confidence 90% · arbitrary rotations of neural representations cannot be inverted... indicating that they do not exhibit complete rotational invariance

Cypher Suggestions (2)

Identify techniques that improve basis correlation. · confidence 90% · unvalidated

MATCH (t:Technique)-[:INCREASES]->(m:Metric {name: 'Basis Correlation'}) RETURN t.name

Find all metrics used to evaluate neural network basis convergence. · confidence 85% · unvalidated

MATCH (m:Metric)-[:EVALUATES]->(n:ModelArchitecture) RETURN m.name, n.name

Abstract

Abstract:In this study, we investigate whether the representations learned by neural networks possess a privileged and convergent basis. Specifically, we examine the significance of feature directions represented by individual neurons. First, we establish that arbitrary rotations of neural representations cannot be inverted (unlike linear networks), indicating that they do not exhibit complete rotational invariance. Subsequently, we explore the possibility of multiple bases achieving identical performance. To do this, we compare the bases of networks trained with the same parameters but with varying random initializations. Our study reveals two findings: (1) Even in wide networks such as WideResNets, neural networks do not converge to a unique basis; (2) Basis correlation increases significantly when a few early layers of the network are frozen identically. Furthermore, we analyze Linear Mode Connectivity, which has been studied as a measure of basis correlation. Our findings give evidence that while Linear Mode Connectivity improves with increased network width, this improvement is not due to an increase in basis correlation.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

32,193 characters extracted from source content.

Expand or collapse full text

On Privileged and Convergent Bases in Neural Network Representations Davis Brown * DAVIS.BROWN@PNNL.GOV Pacific Northwest National Laboratory Nikhil Vyas * NIKHIL@G.HARVARD.EDU Harvard University Yamini BansalYBANSAL@GOOGLE.COM Google DeepMind Abstract In this study, we investigate whether the representations learned by neural networks possess a privileged and convergent basis. Specifically, we examine the significance of feature directions rep- resented by individual neurons. First, we establish that arbitrary rotations of neural representations cannot be inverted (unlike linear networks), indicating that they do not exhibit complete rotational invariance. Subsequently, we explore the possibility of multiple bases achieving identical perfor- mance. To do this, we compare the bases of networks trained with the same parameters but with varying random initializations. Our study reveals two findings: (1) Even in wide networks such as WideResNets, neural networks do not converge to a unique basis; (2) Basis correlation increases significantly when a few early layers of the network are frozen identically. Furthermore, we analyze Linear Mode Connectivity, which has been studied as a measure of basis correlation. Our findings give evidence that while Linear Mode Connectivity improves with increased network width, this improvement is not due to an increase in basis correlation. 1. Introduction While neural networks are black-box function approximators that are trained end-to-end to optimize a loss objective, their emergent internal layer-wise representations are important objects for both understanding deep learninganddirect downstream use. Internal representations of neural networks can be useful tools for interpretability [4, 5, 24], teaching us how neural networks perform the computations they do, as well as for understanding the implicit biases of gradient-based neural network training [1, 2]. Moreover, representations are often directly used for downstream tasks that the network was not originally trained for, like in transfer learning or representation learning. Thus, we would like to develop a better understanding of the mathematical properties of neural network representations. One such property is whether neural networks representations have aprivileged basis[10]. That is, are the features represented by each individual neuron significant, or is information stored only at a population level in neurons? This question is important, for instance, in interpretability, where attempts have been made to interpret features represented by individual neurons (such as edge or curve detectors in convolutional networks [5]). This question is also closely related to * Equal contribution. © D. Brown, N. Vyas & Y. Bansal. arXiv:2307.12941v1 [cs.LG] 24 Jul 2023 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS that ofinvariancesexhibited by neural representations — what are the set of transformations that can be applied to representations that keep the final network accuracy unchanged? In particular, if the representations are rotation invariant, then an individual neuron does not carry significant information. To understand this further, consider the simple case of a two-layer neural network without any non-linear activation functions. That is, the function output of the network isf(x) =W 2 W 1 xwith weightsW 2 ,W 1 and inputsx. Here, the first layer representations areW 1 x. This representation exhibits rotation invariance - we can rotate the first layer representations by an arbitrary orthonormal matrixO(giving us rotated first layer representationsOW 1 x) but the subsequent layer can invert this rotation and recover the original functionf(x) =W 2 O −1 OW 1 x. Thus, an individual neuron could represent any arbitrary feature for the same functional outputs in the network. Our Contributions:In our work, we start by showing that the perceived permutation invariance of representations at high width is actually a result of a of kind of noise-averaging — the correlation between the activities of neurons after accounting for permutations remainsnearlyconstant, as we scale the width (Section 3). This shows that while metrics like linear mode connectivity may suggest permutation invariance, the effect disappears when examined at a neuron level. Since this casts some doubt on the presence of a privileged basis, in Section 4 we ask ifanybasis of neural representations is likely to work equally well. To do so, we consider a random rotation of a layer, and ask if it can be inverted by the later layers with training. We find that this is not the case. Thus, combined these results suggest that while the basis of neural representations matter, there is no one unique basis that is required to achieve the same functional accuracy. Finally, in Section 5, we ask what kinds of constraints can be imposed on the network to obtain a unique neural basis consistent across different training runs. 2. Related Work Convergent learning.Also referred to asuniversality, convergent learning is the conjecture that different deep learning models learn very similar representations when trained on similar data [20, 24]. Much of the work in mechanistic interpretability [9, 23–25] has leveraged the universality conjecture to motivate research for toy models, with the hope that the methods and interpretations developed for these more tractable models will scale to larger and more capable models. Recently, [6] examined universality by reverse-engineering a toy transformer model for a group composition task. Attempts to test for convergent learning includerepresentation dissimilaritycomparisons, notably neuron alignment [14, 19, 20] and correlation analysis / centered kernel alignment [17, 22]. Model stitching [2, 18] extracts features from the early layers of modelfand inserts them into the later layers of modelg(usually via a learned, low-capacity connecting layerφ). If the representations between these models can be combined such that the resulting ‘stitched’ model, g >l ◦φ◦f ≤l , achieves a low loss on a downstream task, the models are called ‘stitching connected’ for the layerlfor that task. Linear Mode Connectivity:It has been conjectured in [1, 12, 15] that, for different models learned by SGD with equal loss, once the permutation symmetries of neural networks are taken into consideration, linear interpolations between them of the formθ α = (1−α)θ 1 +αθ 2 for0< α <1 have at least constant loss. Privileged Basis:It is often taken for granted in the interpretability literature that the acti- vation basis isprivileged, at least in layers with elementwise operations (namely nonlinearities) 2 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS [3, 13, 28–30]. On the other hand, while the residual stream of transformer models has no obvious elementwise operation and is thus not an obviously priveleged basis, [7] provides evidence for out- lier basis-aligned features. To study this phenomenon, [11] demonstrate that transformers can learn rotationally invariant representations in their residual stream using a similar procedure to what we describe in Section 4. We ask the complementary question of whether layers with nonlinearities can learn in an arbitrary basis. 3. SGD Basis and Linear Mode Connectivity In this section we explore whether two neural networks with different random initializations con- verge to the same basis. In other words, are the two neural networks the same up to a permutation of neurons per layer? It is clear that this will hold for an infinite width network since at infinite width there is no ‘randomness’ in initialization. We are interested in answering whether it holds for large yet feasible widths. Recently [1, 12, 15] have studied the weaker claim of if different neu- ral networks are linear mode connected after an appropriate permutation to the neurons per layer. Specifically they study the loss/error barrier when interpolating 1 between the two networks (after permutation). We will call this barrier as perm-LMC barrier. In [1, 15] it was found that for networks trained with SGD on CIFAR-10, the perm-LMC barrier become very small (<2%error barrier) for large yet feasible widths. On the other hand, for ImageNet trained models, while the barrier de- creases with width, it remains quite large. There are at least two reasons for why perm-LMC barrier could decrease with width: 1. The bases of two trained networks becomes closer with larger width. 2. LMC improves with width even if the two networks do not become closer in their basis with width. To answer this question, we need a measure of how close two networks are to being permutations of each other. Let Perm n be the set of permutations overnelements. For a layer withnneurons and activationsX 1 andX 2 of the two networks, we usemax p∈Perm n P i∈[n] Cov(X 1 i ,X 2 p(i) )which we call permutation-correlation (perm-corr) 2 . This is also the measure used by Li et al. [19] and Jordan et al. [15] to find an appropriate permutation for calculating the permutation-LMC barrier. perm-corr is nearly constant across width:In Figure 1 (left) we use the exact setup of Jordan et al. [15] and plot perm-corr and perm-LMC barrier (between two networks trained with indepen- dent initializations and batch orderings) as a function of width where we use Resnet20 as the base network and Resnet20-32x as the widest one. perm-corr remains almost constant with width. This suggests that we are very far from widths needed for neurons basis to be close i.e. perm-corr to be near 1. As in prior works we do find that perm-LMC barrier goes down with width. But combined these results suggest thatperm-LMC barrier goes down with width because linear mode con- nectivity becomes more feasible at high width rather than due to a better match between the basis of neurons. We now explore this further. The previous experiment already verified that basis correlation (operationalized by perm-corr) does not improve with width. We now verify that linear mode connectivity, even without an increase 1. Following Jordan et al. [15], we reset batch norm statistics in all of the experiments. 2. For all experiments we will report values of perm-corr averaged across all the layers. 3 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS 12481632 Width Multiplier 0.0 0.2 0.4 0.6 0.8 1.0 Perm­Correlation Perm­Correlation Perm­LMC Barrier 0 5 10 15 20 Accuracy Drop 12481632 Width Multiplier 0 10 20 30 40 50 60 70 80 Accuracy Drop LMC (no perm.) Barrier Noisy Network 12481632 Width Multiplier 0.0 0.2 0.4 0.6 0.8 1.0 Perm­Correlation Trained Networks Random Networks Figure 1: Do randomly intialized neural networks converge to the same basis? (left) perm-LMC barrier drops with width but perm-corr is nearly constant. (middle) LMC barrier (without permutation) also drops with width. (right) perm-corr does not improve through training for most widths. in similarity of basis, improves with width. To do so, we consider the LMC barrier 3 between two networks without permuting the neurons. In Figure 1 (middle) we find that this quantity (green) improves significantly with width, adding support to our hypothesis. We can now ask why width has this effect. Intuitively, the LMC barrier for two uncorrelated networks (or even networks with some fixed amount of correlation) improves if the networks are robust to noise. This is because being robust to noise means that we can treat the other network as noise and maintain the performance of the first network. Figure 1 (middle) we plot (orange) the performance of a network formed by averaging a trained and a random network (with same weight norms). We find that the accuracy drop (compared to the trained network) of this network also improves with width adding support to our hypothesis. Together these experiments suggest that that while perm-LMC barrier going down with width is an interesting empirical finding in its own right, it overestimates the similarity of neurons across different networks and in particular is driven by other factors such as robustness to noise to a large extent. 3.1. Change in Permutation-Correlation due to Training Another approach to think about the arising of a unique basis is to see how to training process affects perm-corr i.e., does perm-corr improve due to training? In Figure 1 (right) we plot perm-corr for a trained and random network and find that tworandomnetworks have higher perm-corr then for two trained network, for all but the narrowest width networks. This suggests that there is a large variance in the basis of neurons that are found in a trained network. This argues against convergent learning in neural networks from the perspective of the basis of neurons. 4. Can Networks Learn Rotationally Invariant Representations? In the last section we saw the neuron basis for networks trained from independent initialization are not aligned (compared to the baseline of two randomly initialized networks), i.e. there is no unique 3. Jordan et al. [15] did not reset batch norm statistics for this experiment and hence did not observe this improvement with width. 4 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS basis. This raises the following natural question: Do all bases lead to good representations? Specif- ically, in this section, we ask if the representations learnt by neural networks can be made rotation invariant. We build on the linear example considered in the introduction, where any orthonormal transformation of the representation can be inverted by the following layer successfully. To test this in non-linear networks trained with gradient descent, we take a Myrtle-CNN trained with SGD on CIFAR-10 trained to 92% accuracy. Let’s say the network isf(x) =σ(A l (σ(A l−1 (...)))), whereA l denotes the pre-activation of layerlandσdenotes the non-linearity. Then, we perform the following procedure: We sample a random orthonormal matrixO 1 and multiply it with the pre-activationsof the first layerO 1 A 1 . Then, we freeze this rotated layer, and retrain all the re- maining layers of the network on the same training dataset. Next, we apply a random orthonor- mal matrix to the second layerO 2 A 2 , freeze the first two layers and retrain all layersl >2. We repeat this procedure successively for all the layers in the network. Thus,f l,rotated (x) = σ(A l (σ(O l−1 A l−1 (σ(O l−2 A l−2 (...)))). Figure 4 shows the resulting error when we freeze up tollayers. We find that retrainingcannot invert this random rotation and that the network accuracy degrades significantly. We also observe that the error gets worse for later layers in the network. For comparison, we also plot the error of a network with random weights (with the same distribution as the initialization) for the firstllayers and show that the increase in error from performing random rotations is similar torandom features– not training the layers at all! For reference, we repeat the above procedure but we only freeze one layer at a time (Figure 4). We also observe a significant increase in the error of the network, which gets worse as we go deeper into the network. Our findings highlight that the basis of the network are not in fact rotation invariant. This suggests that the directions represented by individual neurons are in fact signifcant, and we cannot use an arbitrary basis for the network. layer 1layer 2layer 3layer 4layer 5 frozen up to layer 0 10 20 30 40 50 60 70 error (%) random feature, width=16 rotation, width=16 random feature, width=256 rotation, width=256 layer 1layer 2layer 3layer 4layer 5 frozen at layer 0 10 20 30 40 50 60 70 error (%) random feature, width=16 rotation, width=16 random feature, width=256 rotation, width=256 Figure 2:Random featurevsrecovering random rotationperformance for Myrtle CNN models trained on CIFAR-10. (left) successive freezing, rotating, and retraining; (right) single layer freezing and rotating. 5. The residual stream might enable convergent bases Finally, we ask if there are other empirically common choices that may enable the basis to be fixed in certain ways. We highlight the candidate of aresidual stream. In transformer networks, for instance, 5 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS 05101520253035 transformer block 0.6 0.7 0.8 0.9 1.0 Perm Correlation freeze up to block 2 freeze up to block 3 freeze up to block 4 freeze up to block 5 freeze up to block 6 freeze up to block 7 freeze up to block 8 freeze up to block 13 freeze up to block 25 freeze up to block 31 05101520253035 transformer block 1 0 1 2 3 4 5 6 identity stitch penalty (%) freeze up to block 2 freeze up to block 3 freeze up to block 4 freeze up to block 5 freeze up to block 6 freeze up to block 7 freeze up to block 8 freeze up to block 13 freeze up to block 25 freeze up to block 31 024681012 transformer block 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Perm Correlation freezing up to block 1 freezing up to block 3 freezing up to block 8 Figure 3: Enforcing a consistent basis by freezinglearly layers. (left) perm-corr for ConvNeXt models trained on CIFAR-10 (middle) identity stitching CIFAR-10 ConvNeXt models (right) perm-corr for ViT models trained on ImageNet there is a largely linear residual connection that runs from the input (after the embedding layer) to the remainder of the network [9]. If the residual stream is indeed important to the computations performed by the network, it may impose a basis on the network that roughly tries to align with the input embedding. For models that share the same tokenizer or embedding layer, this might suggest that we can combine networks with residual streamsas is— without even having to compute permutations to match the neurons to perform symmetry correction [1]. However, prior work [2] has shown that the early layers of a vision network can be replaced with random features without significant loss in performance. We see in Figure 2 that the first few layers of the network are relatively more resistant to random rotations. This suggests that even if the residual connections were providing a force for the basis to align, the residual connections would align to the basis computed by these random transformations of the early layers. Thus, we experiment with freezing early blocks of the network. First, we train a modelAwithn layers. Then, we train a new modelBwith the firstlfrozen layers ofA, training then−ltop-layers from scratch with a new random initialization. The perm-corr is computed between layers, Figure 3 shows our results for a ConvNeXt [21] model trained on CIFAR-10 (left) and a Vision Transformer (ViT) [8] trained on ImageNet (right). We also measure the identity stitching penalty, where we evaluate the performance of pairs models stitched with theidentity functionφ= id, i.e. models of the formg >l ◦f ≤l . We find that freezing only 2 blocks is sufficient to have a significantly higher perm-corr (like- wise, lower error for identity stitching Figure 3 (middle)) in ConvNeXt models and ViTs. This convergence phenomenon is considered across different model widths and for different residual stream structures in Appendix A. Together, these results suggest a relatively cheap procedure forfixing the basiswhen training different neural networks, an (as noted) desirable property for interpretability research and also relevant for modular and distributed neural network training [16, 26]. For example, a related phe- nomenon for networks fine-tuned from the same base model was used in [27] to create a combined model more accurate than its constituent models. 6 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS 6. Discussion and conclusion We find that, in some important respects, linear mode connectivity overstates the similarity of neu- rons across runs. Our results suggest that while the LMC barrier improves with network width, it can in part be explained by factors beyond similarity. On the other hand, we find strong evidence that for neural networks, only certain bases (namely, those that are not rotation invariant) lead to good representations. Finally, we provide a straightforward procedure to enable a convergent basis; this is desirable for both interpretability and modular training. Acknowledgements NV is supported by a Simons Investigator Fellowship, NSF grant DMS-2134157, DARPA grant W911NF2010021, and DOE grant DE-SC0022199. References [1] Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=CQsmMYmlP5T. [2] Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wort- man Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 225–236. Curran Associates, Inc., 2021. URLhttps://proceedings.neurips.c/ paper/2021/file/01ded4259d101feb739b06c399e9cd9c-Paper.pdf. [3] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3327, 2017. [4] Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models.https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. [5] Nick Cammarata, Gabriel Goh, Shan Carter, Ludwig Schubert, Michael Petrov, and Chris Olah.Curve detectors.Distill, 2020.doi:10.23915/distill.00024.003. https://distill.pub/2020/circuits/curve-detectors. [6] B. Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engi- neering how networks learn group operations.ARXIV.ORG, 2023. doi: 10.48550/arXiv.2302. 03025. [7] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339, 2022. [8] A. Dosovitskiy, L. Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, M. Dehghani, Matthias Minderer, G. Heigold, S. Gelly, Jakob Uszkoreit, and 7 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference On Learning Representations, 2020. [9] N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. [10] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022.URLhttps:// transformer-circuits.pub/2022/toy_model/index.html. [11] Nelson Elhage, Robert Lasenby, and Christopher Olah.Privileged bases in the trans- former residual stream.Transformer Circuits Thread, 2023.URLhttps:// transformer-circuits.pub/2023/privileged-basis/index.html. [12] Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permuta- tion invariance in linear mode connectivity of neural networks. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. URLhttps://openreview.net/forum?id=dNigytemkL. [13] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network.University of Montreal, 1341(3):1, 2009. [14] Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=8qugS9JqAxD. [15] Keller Jordan, Hanie Sedghi, Olga Saukh, Rahim Entezari, and Behnam Neyshabur. RE- PAIR: REnormalizing permuted activations for interpolation repair. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URLhttps://openreview. net/forum?id=gU5sJ6ZggcX. [16] Jakub Kone ˇ cn ́ y, H. Brendan McMahan, Felix X. Yu, Peter Richt ́ arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv: 1610.05492, 2016. [17] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, pages 3519–3529. PMLR, 2019. [18] Karel Lenc and A. Vedaldi. Understanding image representations by measuring their equiv- ariance and equivalence.Computer Vision And Pattern Recognition, 2014. doi: 10.1007/ s11263-018-1098-y. [19] Xuhong Li, Yves Grandvalet, R ́ emi Flamary, Nicolas Courty, and Dejing Dou. Representation transfer by optimal transport.arXiv preprint arXiv:2007.06737, 2020. 8 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS [20] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E. Hopcroft. Convergent learning: Do different neural networks learn the same representations? InProceedings of the 1st Workshop on Feature Extraction: Modern Questions and Challenges, FE 2015, co- located with the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015), Montreal, Canada, December 11-12, 2015, volume 44 ofJMLR Workshop and Confer- ence Proceedings, pages 196–212. JMLR.org, 2015. URLhttp://proceedings.mlr. press/v44/li15convergent.html. [21] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.CVPR, 2022. [22] Ari S. Morcos, David G. T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization, 2018. URLhttps://arxiv.org/ abs/1803.06959. [23] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Con- ference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=9XFSbDPmdW. [24] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. An overview of early vision in inceptionv1.Distill, 2020. doi: 10.23915/distill. 00024.002. https://distill.pub/2020/circuits/early-vision. [25] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022. [26] Colin Raffel. Building machine learning models like open source software.Commun. ACM, 66(2):38–40, jan 2023. ISSN 0001-0782. doi: 10.1145/3545111. URLhttps://doi. org/10.1145/3545111. [27] Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, July 2022. [28] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. InDeep Learning Workshop, International Con- ference on Machine Learning (ICML), 2015. [29] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014. [30] Bolei Zhou, Aditya Khosla, ` Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. InICLR, 2015. URLhttp://arxiv.org/abs/ 1412.6856. 9 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS Appendix A. Additional Experiments for Section 5 02468101214 transformer block 0 5 10 15 20 25 stitch penalty (%) freezing up to block 1 freezing up to block 3 freezing up to block 8 Figure 4: Identity stitching for Vision Transformers trained on ImageNet. Here, we examine the phenonemon of convergent bases across 1) different model widths and 2) different residual stream structures. One plausible candidate for the converging basis phenomenon displayed in Figure 3 is the structure of the residual stream. We modify a ResNet-20 so that there is no longer a non-linearity after skip-connections. For the modified ResNet-20, the residual stream is now completely ‘linear’, where all layers exclusively perform linear operations on the residual stream (i.e., there are still nonlinearities within residual blocks, but no nonlinear operations on the residual stream). Figure 5 compares the perm-corr and identity stitching between a normal and modified ResNet-20. The results are largely comparable, suggesting that the convergent basis phenomonenon is not caused by the presence of a linear residual stream. Next, we examine the effect of width. Figure 6 measures the perm-corr for a 4x-width and 8x- width ResNet-20, and Figure 7 measures the identity stitching penalty for a 4x-width and 8x-width ResNet-20. We find that width can explain some of the identity stitching success, however there is little difference between the respective networks for perm-corr in Figure 6. 10 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS 012345678 blocks 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 perm-corr 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen 012345678 blocks 0.2 0.4 0.6 0.8 1.0 perm-corr 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen 012345678 blocks 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 identity stitching penalty 1 blocks frozen, modified 2 blocks frozen, modified 3 blocks frozen, modified 4 blocks frozen, modified 5 blocks frozen, modified 6 blocks frozen, modified 7 blocks frozen, modified 8 blocks frozen, modified 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen Figure 5: Measuring basis convergence for a normal ResNet-20 and a modified ResNet-20 when freezinglearly layers. The modified ResNet-20 is modified to have a fully linear resid- ual stream (i.e., does not have a nonlinearity after the identity connection). (left) perm- corr for ResNet-20 models trained on CIFAR-10 (middle) perm-corr for ResNet-20 mod- els trained on CIFAR-10 (right) identity stitching penalties for ResNet-20 and modified ResNet-20 models for CIFAR-10 012345678 blocks 0.0 0.2 0.4 0.6 0.8 1.0 perm-corr 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen 012345678 blocks 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 perm-corr 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen Figure 6: Measuring perm-corr for wide ResNet-20s when freezinglearly layers (note different y- axis scales). (left) perm-corr for 4x-width ResNet-20 models trained on CIFAR-10 (right) 8x-width ResNet-20 models trained on CIFAR-10. 11 ONPRIVILEGED ANDCONVERGENTBASES INNEURALNETWORKREPRESENTATIONS 012345678 blocks 0% 10% 20% 30% 40% 50% identity stitching penalty 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen 012345678 blocks 0% 10% 20% 30% 40% 50% identity stitching penalty 1 blocks frozen 2 blocks frozen 3 blocks frozen 4 blocks frozen 5 blocks frozen 6 blocks frozen 7 blocks frozen 8 blocks frozen Figure 7: Measuring identity stitching for wide ResNet-20s when freezinglearly layers. (left) identity stitching penalty for 4x-width ResNet-20 models trained on CIFAR-10 (right) identity stitching penalty for 8x-width ResNet-20 models trained on CIFAR-10. 12