Paper deep dive
Polysemanticity and Capacity in Neural Networks
Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris
Models: Anthropic autoencoder toy model, Custom 2-layer quadratic activation model, GeLU variants, ReLU variants
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 92%
Last extracted: 3/12/2026, 8:03:33 PM
Summary
This paper investigates the phenomenon of polysemanticity in neural networks—where individual neurons represent multiple unrelated features—through the lens of 'feature capacity'. The authors propose a framework where capacity is defined as the fractional dimension a feature consumes in the embedding space. They demonstrate that optimal capacity allocation leads to monosemantic representation for important features, polysemantic representation for less important ones, and exclusion of unimportant features. The study identifies a block-semi-orthogonal structure in the embedding space and shows that polysemanticity is influenced by input sparsity, kurtosis, and model architecture.
Entities (5)
Relation Signals (3)
Embedding Space → exhibits → Block-semi-orthogonal structure
confidence 95% · We find a block-semi-orthogonal structure, with differing block sizes in different models
Polysemanticity → isinfluencedby → Input Sparsity
confidence 90% · Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity
Feature Capacity → determines → Polysemanticity
confidence 85% · We propose doing so through the lens of feature capacity... to understand its causes [of polysemanticity].
Cypher Suggestions (2)
Find all factors that influence polysemanticity · confidence 90% · unvalidated
MATCH (f:Entity {name: 'Polysemanticity'})<-[:is_influenced_by]-(factor) RETURN factor.nameMap the relationship between capacity and neural network interpretability · confidence 80% · unvalidated
MATCH (c:Entity {name: 'Feature Capacity'})-[:impacts]->(i:Entity {name: 'Interpretability'}) RETURN c, iAbstract
Abstract:Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.
Tags
Links
- Source: https://arxiv.org/abs/2210.01892
- Canonical: https://arxiv.org/abs/2210.01892
Full Text
164,446 characters extracted from source content.
Expand or collapse full text
Polysemanticity and Capacity in Neural Networks Adam Scherlis1, Kshitij Sachan1, Adam S. Jermyn2, Joe Benton, Buck Shlegeris1 1Redwood Research 2Flatiron Institute Abstract Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature capacity, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons. 1 Introduction Individual neurons in neural networks often represent multiple unrelated features in the input [OMS17, OCS+20]. This phenomenon is known as polysemanticity, and makes it more difficult to interpret neural networks [OCS+20]. While "feature" is a somewhat fuzzy concept [EHO+22b], there are at least some cases where we “know it when we see it”. For example, when the input features are independent random variables that do not interact in the data-generating process, neurons that represent combinations of these input features can be confidently called polysemantic. In this work we explore how loss functions incentivize polysemanticity in this setting, and the structure of the learned solutions. Fittingly, there are multiple ways that polysemanticity can manifest. Here we focus on one form that seems particularly fundamental, namely superposition [EHO+22b]. Suppose we have a linear layer that embeds features which then pass through a layer with a nonlinear activation function. The feature embedding vectors might not be orthogonal, in which case multiple neurons (nonlinear units) are involved in representing each feature. When there are at least as many features as neurons this means that some neurons represent multiple features, and so are polysemantic (Figure 1, right). There are other causes of polysemanticity, e.g. feature embedding vectors could be rotated relative to the neuron basis (Figure 1, left), but we do not study these in this work. Figure 1: Feature embedding vectors are shown in two dimensions. The neuron basis corresponds to the coordinate axes. Left: rotated embeddings. Right: non-orthogonal embeddings. In both cases the result is polysemanticity because each neuron receives some input when either feature is present. Here we build on the work of [EHO+22b], who studied polysemanticity in the context of toy models of autoencoders. They found that models can support both monosemantic and polysemantic neurons, that polysemantic neurons can perform certain kinds of computations, and that the embedding vectors of features often formed repeating motifs of a few features symmetrically embedded in a low-dimensional subspace. Moreover, in their models they found distinct “phases” where superposition was either significant or completely absent. Sparser inputs resulted in more superposition. Features with similar importance were more likely to be in superposition. This reflects an abundance of unexpected structure, and gives new handles on the phenomenon of polysemanticity. We study these phenomena through the lens of capacity, or the fraction of an embedding dimension allocated to each feature (Section 2, also termed “dimensionality” by [EHO+22b]). This ranges from 0-1 for each feature, and the total capacity across all features is bounded by the dimension of the embedding space. Because the model has a limited number of neurons and so a limited number of embedding dimensions, there is a trade-off between representing different features. We find that the capacity constraint on individual features (0-1) means that many features are either ignored altogether (not embedded) or else allocated a full dimension orthogonal to all the other features in the embedding space, depending on the relative importance of each feature to the loss. Features are represented polysemantically only when the marginal loss reduction of assigning more capacity to each is equal (Figure 2). This neatly explains the sharp “pinning” of features to either 0 or 1 capacity noted by [EHO+22b], and gives us a framework for understanding the circumstances under which features are represented polysemantically. Figure 2: The marginal loss reduction −∂L/∂Cisubscript-∂ L/∂ C_i- ∂ L / ∂ Citalic_i is shown for several features as a function of feature capacity in our toy model. Circles represent the optimal capacity allocation for a particular total embedding dimension. Colors vary to make individual curves more distinguishable. To explore capacity allocation in a concrete model, we instantiate our theory for a one-layer model with quadratic activations (Section 3). Our model differs from the Anthropic toy model in that ours uses a different activation function to make the math more tractable, and, more importantly, ours is focused on polysemantic computation rather than data compression. We contrast these toy models in Figure 3. Figure 3: Comparison between the Anthropic toy model of [EHO+22b] (left) and our toy model (right). Model inputs are at the bottom of the diagram and outputs are at the top. The key difference is that the Anthropic model studies the compression and recovery of high-dimensional vectors, while ours examines how a smaller number of polysemantic neurons can simulate the computation done by a larger number of monosemantic ones. Figure kindly provided by Chris Olah. For our toy model we can analytically determine the capacity allocation as a function of feature sparsity and importance (i.e. weight in the loss), and so construct a “phase diagram” (Figure 4). While the details of our phase diagram differ from those of [EHO+22b], reflecting our different toy model, there are three qualitative features that are in good agreement. First, when a feature is much more important than the rest, it is always represented fully with its own embedding dimension. Second, when a feature is much less important than the rest, it is ignored entirely. Finally, in a sparsity-dependent intermediate region features are partially represented, sharing embedding dimensions. In addition, this confirms our theoretical expectation that capacity is allocated according to how much each feature matters to the loss (a mixture of importance and sparsity) and that it is often allocated to fully ignore some features while fully representing others. We supplement this with empirical results for a variety of activation functions showing that the phase diagram predicts the behavior of a broad family of 2-layer models. Figure 4: Upper: Analytical and empirical phase diagrams for our toy model with 6 features and 3 neurons. In both panels one feature has a different importance from the rest, and colors show the resulting capacity allocation for that feature as a function of sparsity and relative importance. Lower: Plots of marginal loss reduction ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i as a function of feature capacity for each labelled point in the analytical phase diagram. The blue curve represents the feature with varied importance and the black one represents the constant important feature. Black dots are optimal allocations of capacity. We then turn to study the geometry of the embedding space (Section 4). When embedding matrices fully utilize the available capacity we call them “efficient”. We find that every efficient embedding matrix has a block-semi-orthogonal structure, with features partitioned into different blocks. When multiple features in a block are present they interfere with each other, causing spurious correlations in the output and hence greater loss. Features do not, however, interfere across blocks. Figure 5: Left: An embedding matrix with two blocks. Center: The relationship between features and (principal-component-aligned) neurons for this matrix. Right: Embedding vector geometry for this matrix. The blocks in efficient matrices correspond to the polytope structure [EHO+22b] found, with small blocks corresponding to features embedded as regular polytopes and large blocks corresponding to less-ordered structures. Large- and small-block arrangements come with different advantages. With large blocks there is significant freedom to allocate capacity across features, whereas with small blocks there is the additional constraint that the capacity of each block be an integer and that the block capacities add up to the total capacity. On the other hand, with small blocks the lengths of embedding vectors can be chosen more freely because blocks can be scaled independently of each other without affecting the capacity allocation. In our quadratic model the embedding matrices in our toy model always have one large block, which is correspondingly less structured. We expect that differences in architecture can lead to different sizes of blocks, which could provide a way to control the extent of polysemanticity in models, alongside other approaches such as changing the activation function [EHO+22a]. 2 Capacity and Superposition 2.1 Definitions Suppose we have a model composed of stacks of linear layers with nonlinear activation functions. In each layer, the model applies a linear transform to the input vector x to produce an embedding vector e, and then performs an element-wise non-linear calculation on those embeddings to produce the non-linear activation vector hℎh. For instance, we might have e e ≡W⋅xabsent⋅ ≡ W· x≡ W ⋅ x (1) hℎ h ≡ReLU(e)absentReLU (e)≡ ReLU ( e ) (2) with W∈ℝd×psuperscriptℝW ^d× pW ∈ blackboard_Rd × p, x∈ℝpsuperscriptℝx ^px ∈ blackboard_Rp, e,h∈ℝdℎsuperscriptℝe,h ^de , h ∈ blackboard_Rd. We associate each dimension of the input vector x with a feature, and we call each dimension of the non-linear layer a neuron. For simplicity, in the rest of this paper we work with a one-layer model, but our capacity definition should be valid for any layer in a multi-layer model. When a model represents a feature in the input space, it is convenient to think that it expends some capacity to do so. Our intuition here is that as we ask a model to represent more and more features we eventually exhaust its ability to do so, resulting in features interfering. We study the superposition phenomena by asking the question: "How do models allocate limited representation capacity to input features?" In what follows we assume that each input feature is assigned a unique dimension in the input space (e.g. feature i is input dimension i), and capacity we define below. Let W⋅,i∈ℝdsubscript⋅superscriptℝW_·,i ^dW⋅ , i ∈ blackboard_Rd be the embedding vector for feature i. The capacity allocated to feature i is Ci=(W⋅,i⋅W⋅,i)2∑j(W⋅,i⋅W⋅,j)2subscriptsuperscript⋅subscript⋅subscript⋅2subscriptsuperscript⋅subscript⋅subscript⋅2 C_i= (W_·,i· W_·,i)^2 _j(W_% ·,i· W_·,j)^2Citalic_i = divide start_ARG ( W⋅ , i ⋅ W⋅ , i )2 end_ARG start_ARG ∑j ( W⋅ , i ⋅ W⋅ , j )2 end_ARG (3) We can think of CisubscriptC_iCitalic_i as “the fraction of a dimension” allocated to feature i ([EHO+22b])111We can also interpret CisubscriptC_iCitalic_i as the squared correlation coefficient between xisubscriptx_ixitalic_i and (WTWx)isubscriptsuperscript(W^TWx)_i( Witalic_T W x )i – see Appendix E.. The numerator measures the size of the embedding and the denominator tracks the interference from other features. By this definition, CisubscriptC_iCitalic_i is bounded between 0 and 1. In the case W⋅,i=0subscript⋅0W_·,i=0W⋅ , i = 0, where this expression is undefined, we set Ci=0subscript0C_i=0Citalic_i = 0.222This is the limit of the expression from almost all directions, so in practice W⋅,i≈0subscript⋅0W_·,i≈ 0W⋅ , i ≈ 0 implies Ci≈0subscript0C_i≈ 0Citalic_i ≈ 0. We define the total model capacity to be C=∑iCisubscriptsubscriptC= _iC_iC = ∑i Citalic_i (in a multi-layer model, this would be a single layer-pair’s capacity). This is bounded between 1 and the embedding dimension D (see Appendix F for a proof of the upper bound). Figure 6: Example capacity allocations for different embeddings. Note that a set of capacities does not uniquely specify a weight matrix. For example, capacity is invariant to the overall scaling and rotation of W. In what follows it will be useful to have a full parameterization of W that includes CisubscriptC_iCitalic_i, so we define S to be a set of additional parameters that uniquely specify a weight matrix W given its capacities C1,…,CNsubscript1…subscriptC_1,…,C_NC1 , … , Citalic_N. We can then parameterize the loss using (C1,…,CN,S)subscript1…subscript(C_1,…,C_N,S)( C1 , … , Citalic_N , S ) rather than W. 2.2 Loss Minimization We are interested in how loss minimization allocates capacity among different features. Because the capacity of each feature lies in [0,1]01[0,1][ 0 , 1 ] and there is also a constraint on the total capacity of a model, this is a constrained optimization problem: minC1:n,Ssubscriptsubscript:1 _C_1:n,Sminitalic_C start_POSTSUBSCRIPT 1 : n , S end_POSTSUBSCRIPT L(C1:n,S)subscript:1 L(C_1:n,S)L ( C1 : n , S ) s.t. 0≤Ci≤10subscript1 0≤ C_i≤ 10 ≤ Citalic_i ≤ 1 1≤∑iCi≤D1subscriptsubscript 1≤ _iC_i≤ D1 ≤ ∑i Citalic_i ≤ D In brief, we minimize over S first to find the S∗(Ci)superscriptsubscriptS^*(C_i)S∗ ( Citalic_i ) minimizing the loss333There are multiple ways to define S, which give different (∂L/∂Ci)subscript(∂ L/∂ C_i)( ∂ L / ∂ Citalic_i ) holding ”S” constant. Minimizing over S removes this ambiguity.. We then minimize over CisubscriptC_iCitalic_i Intuitively capacity should always be a good thing, so we are interested in situations where the loss function L is monotonically decreasing in CisubscriptC_iCitalic_i. In this case, the bound on the total capacity ∑iCi≤Dsubscriptsubscript _iC_i≤ D∑i Citalic_i ≤ D will be saturated, and we can replace this constraint with a Lagrange multiplier: minC1:n,λsubscriptsubscript:1 _C_1:n,λ _C start_POSTSUBSCRIPT 1 : n , λ end_POSTSUBSCRIPT L(C1:n)+λ(∑iCi−D)subscript:1subscriptsubscript L(C_1:n)+λ ( _iC_i-D )L ( C1 : n ) + λ ( ∑i Citalic_i - D ) (4) s.t. 0≤Ci≤10subscript1 0≤ C_i≤ 10 ≤ Citalic_i ≤ 1 Because we have independent constraints on each CisubscriptC_iCitalic_i, for each i the minimum occurs either at a boundary (Ci=0subscript0C_i=0Citalic_i = 0 or 1111) or in between at a point where ∂Ci(L(C1:n)+λ(∑iCi−D))=0subscriptsubscript:1subscriptsubscript0 ∂ C_i (L(C_1:n)+λ (Σ% _iC_i-D ) )=0divide start_ARG ∂ end_ARG start_ARG ∂ Citalic_i end_ARG ( L ( C1 : n ) + λ ( ∑i Citalic_i - D ) ) = 0 (5) The derivative in the three cases will obey ∂L(C1:n)∂Cisubscript:1subscript ∂ L(C_1:n)∂ C_idivide start_ARG ∂ L ( C1 : n ) end_ARG start_ARG ∂ Citalic_i end_ARG ≥−λ(Ci=0)absentsubscript0 ≥-λ (C_i=0)≥ - λ ( Citalic_i = 0 ) (6) ∂L(C1:n)∂Cisubscript:1subscript ∂ L(C_1:n)∂ C_idivide start_ARG ∂ L ( C1 : n ) end_ARG start_ARG ∂ Citalic_i end_ARG =−λ(0<Ci<1)absent0subscript1 =-λ (0<C_i<1)= - λ ( 0 < Citalic_i < 1 ) (7) ∂L(C1:n)∂Cisubscript:1subscript ∂ L(C_1:n)∂ C_idivide start_ARG ∂ L ( C1 : n ) end_ARG start_ARG ∂ Citalic_i end_ARG ≤−λ(Ci=1)absentsubscript1 ≤-λ (C_i=1)≤ - λ ( Citalic_i = 1 ) (8) In other words, at the minimum of the loss, capacity is allocated to each feature until the marginal value (decrease in loss) from additional capacity is at a constant threshold – unless the feature is so important that the marginal value is above threshold even when fully represented, or so unimportant that the marginal value is below threshold even when the feature is ignored. Note that none of this derivation depended on the precise definition of capacity. We only had to assume that (1) allocating more capacity to a feature never makes the loss greater, (2) all capacity allocations satisfying the feature-level and total constraints are realizable by some embedding matrix. Thus we could have used other definitions, and indeed in Appendix G we explore one such alternative. We can visualize this in terms of the graph of ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i as a function of CisubscriptC_iCitalic_i, holding CjsubscriptC_jCitalic_j constant for j≠ij≠ ij ≠ i. For instance, in Figure 2 there are diminishing returns: ∂2L/∂Ci2>0superscript2superscriptsubscript20∂^2L/∂ C_i^2>0∂2 L / ∂ Citalic_i2 > 0, so that each additional unit of capacity makes the next one less useful. If ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i crosses −λ-λ- λ, the value of CisubscriptC_iCitalic_i where it crosses will be the optimal allocation. Features only get 0 or 1 capacity if the entire ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i curve is above or below −λ-λ- λ (respectively). By contrast, in the case of increasing returns (∂2L/∂Ci2<0superscript2superscriptsubscript20∂^2L/∂ C_i^2<0∂2 L / ∂ Citalic_i2 < 0), every feature will have Ci=0subscript0C_i=0Citalic_i = 0 or Ci=1subscript1C_i=1Citalic_i = 1 and superposition will not occur at all. As we show below for several different toy models, diminishing returns occur for inputs of high sparsity or kurtosis, and accelerating return for inputs of low sparsity or kurtosis. 3 Quadratic model We now instantiate our theory for a two-layer model with quadratic activations. This choice of activation function allows us to study this model analytically in some detail. We define our model in Section 3.1. In Section 3.2 we analyze the loss function for the case of independent random input features. We are able to write the expected loss entirely in terms of the embedding lengths and the variance and kurtosis of feature strengths. In Section 3.3 we demonstrate a surprising feature of this quadratic model: the loss function can be written in terms of the feature capacity and the embedding lengths. This allows us to analytically explore the marginal benefit of allocating more capacity to each feature. We then go on to solve for the optimum allocation of capacity across features (Section 3.4) and relate this to our results from Section 2). 3.1 The model We study a regression task, with ground-truth data x,y(x)x,y(x)x , y ( x ) and modeled data x,y~(x)~x, y(x)x , over~ start_ARG y end_ARG ( x ). The ground truth data is generated by a random vector of IID variables, x, and a set of “importance" coefficients visubscriptv_ivitalic_i: y y =∑ivixi2.absentsubscriptsubscriptsuperscriptsubscript2 = _iv_ix_i^2.= ∑i vitalic_i xitalic_i2 . (9) We think of visubscriptv_ivitalic_i as the importance of feature i, and highlight that y here is a scalar. The model y~(x)~ y(x)over~ start_ARG y end_ARG ( x ) is parameterized as y~~ yover~ start_ARG y end_ARG =∑i(Wi,⋅x)2+babsentsubscriptsuperscript⋅subscript⋅2 = _i(W_i,· x)^2+b= ∑i ( Witalic_i , ⋅ ⋅ x )2 + b (10) where b is a bias and W is a rectangular embedding matrix. Like y, y~~ yover~ start_ARG y end_ARG is a scalar. Note that the ground-truth and modeled data are both of the form xTAxsuperscriptx^TAxxitalic_T A x for a matrix A. For the ground truth, this is a full-rank diagonal444In general, we could consider a ground truth with an arbitrary full-rank A instead of a diagonal matrix. Forcing A to be diagonal is then equivalent to choosing to always write x in the principal-component (SVD) basis of A. This gives a slightly stronger motivation for our choice to interpret the components of x as separate “features”: we are really choosing features based on the principal axes of A. matrix; for our model it is limited in rank by the shape of W. Therefore, perfect loss is not possible in general. Our loss function is squared error L(x)=(y~(x)−y(x))2.superscript~2 L(x)=( y(x)-y(x))^2.L ( x ) = ( over~ start_ARG y end_ARG ( x ) - y ( x ) )2 . (11) 3.2 Interpreting the loss function We now assume that [xi]=0delimited-[]subscript0E[x_i]=0blackboard_E [ xitalic_i ] = 0 and Cov[xi,xj]=δijCovsubscriptsubscriptsubscriptCov[x_i,x_j]= _ijCov [ xitalic_i , xitalic_j ] = δitalic_i j, but do not impose any additional restrictions on the distribution of xisubscript\x_i\ xitalic_i . With this, we find (Appendix A) [L]=([xi4]−1)∑i(‖W⋅,i‖2−vi)2+2∑i≠j(W⋅,i⋅W⋅,j)2delimited-[]delimited-[]superscriptsubscript41subscriptsuperscriptsuperscriptnormsubscript⋅2subscript22subscriptsuperscript⋅subscript⋅subscript⋅2E[L]=(E[x_i^4]-1) _i(||W_·,i||^2-v_i)^2% +2 _i≠ j(W_·,i· W_·,j)^2blackboard_E [ L ] = ( blackboard_E [ xitalic_i4 ] - 1 ) ∑i ( | | W⋅ , i | |2 - vitalic_i )2 + 2 ∑i ≠ j ( W⋅ , i ⋅ W⋅ , j )2 (12) The loss is a weighted sum of two components. The first term captures true correlations in the data, because vi∝Cov[y,xi2]proportional-tosubscriptCovsuperscriptsubscript2v_i [y,x_i^2]vitalic_i ∝ Cov [ y , xitalic_i2 ], while the second represents “hallucinated correlations” which are not present in the data (W⋅,i⋅W⋅,j∝Cov[y~,xixj]proportional-to⋅subscript⋅subscript⋅Cov~subscriptsubscriptW_·,i· W_·,j [ y,x_ix_j]W⋅ , i ⋅ W⋅ , j ∝ Cov [ over~ start_ARG y end_ARG , xitalic_i xitalic_j ]) but which the model produces anyway (Appendix B). When the fourth moment is large (for example due to sparsity, see Appendix D) the hallucinated term becomes less important and we get polysemanticity/superposition. More generally, when a model is incapable of representing the precise correlations in the data there is often a tradeoff, where the model can improve its representation of the true correlations at the cost of incorporating hallucinated ones. This will often be beneficial to the overall loss, but can end up spreading the task of representing a single feature across multiple neurons. If the model has a limited number of neurons this can result in polysemanticity. 3.3 Capacity We would like to write the expected loss in terms of the capacity of each feature. Recalling the definition of capacity, the expected loss can be written as ℒ(C→,n→)=(k−1)∑i(ni−vi)2−2∑ini2+2∑ini2Ci,ℒ→1subscriptsuperscriptsubscriptsubscript22subscriptsuperscriptsubscript22subscriptsuperscriptsubscript2subscript ( C, n)=(k-1) _i(n_i-v_i)^2-2Σ% _in_i^2+2 _i n_i^2C_i,L ( over→ start_ARG C end_ARG , over→ start_ARG n end_ARG ) = ( k - 1 ) ∑i ( nitalic_i - vitalic_i )2 - 2 ∑i nitalic_i2 + 2 ∑i divide start_ARG nitalic_i2 end_ARG start_ARG Citalic_i end_ARG , (13) where ℒ≡[L]ℒdelimited-[]L [L]L ≡ blackboard_E [ L ] is the expected loss, k≡[xi4]delimited-[]superscriptsubscript4k [x_i^4]k ≡ blackboard_E [ xitalic_i4 ] is the kurtosis, and ni≡‖W⋅,i‖2subscriptsuperscriptnormsubscript⋅2n_i≡||W_·,i||^2nitalic_i ≡ | | W⋅ , i | |2 is the squared embedding length. This form is surprising because it says that the loss is composed of (1) a term that depends only on the embedding lengths and (2) a sum of terms that each depend on the capacity of just one feature. In particular, what is surprising is that the effects of interference between features are entirely captured by the feature capacity, so the second term can be interpreted as the loss due to interference. Perhaps more striking, the partial derivative of loss with respect to capacity is just ∂ℒ∂Ci=−2ni2Ci2.ℒsubscript2superscriptsubscript2superscriptsubscript2 ∂ C_i=-2 n_i^2C_i^2.divide start_ARG ∂ L end_ARG start_ARG ∂ Citalic_i end_ARG = - 2 divide start_ARG nitalic_i2 end_ARG start_ARG Citalic_i2 end_ARG . (14) That is, the marginal benefit of capacity for feature i depends only on the embedding length and capacity of that feature and not at all on any other aspect of the model. In what follows we make extensive use of this property to find the optimal capacity allocation. 3.4 Optimal capacity allocation We now minimize the loss to find (Appendix C): Cisubscript C_iCitalic_i =max(min(k−1k−3viλ−2k−3,1),0)absent13subscript2310 = ( ( k-1k-3 v_iλ- 2% k-3,1 ),0 )= max ( min ( divide start_ARG k - 1 end_ARG start_ARG k - 3 end_ARG divide start_ARG vitalic_i end_ARG start_ARG λ end_ARG - divide start_ARG 2 end_ARG start_ARG k - 3 end_ARG , 1 ) , 0 ) (15) nisubscript n_initalic_i =0Ci=0k−1k−3vi−2λk−30<Ci<1viCi=1absentcases0subscript013subscript230subscript1subscriptsubscript1 = cases0& C_i=0\\ k-1k-3v_i- 2λk-3& 0<C_i<1\\ v_i& C_i=1 cases= start_ROW start_CELL 0 end_CELL start_CELL Citalic_i = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG k - 1 end_ARG start_ARG k - 3 end_ARG vitalic_i - divide start_ARG 2 λ end_ARG start_ARG k - 3 end_ARG end_CELL start_CELL 0 < Citalic_i < 1 end_CELL end_ROW start_ROW start_CELL vitalic_i end_CELL start_CELL Citalic_i = 1 end_CELL end_ROW (16) Here λ is a Lagrange multiplier determined by the condition that ∑iCi=Dsubscriptsubscript _iC_i=D∑i Citalic_i = D. In Appendix C we also verify that this solution is indeed feasible (i.e. there is an embedding matrix producing the above CisubscriptC_iCitalic_i and nisubscriptn_initalic_i). For kurtosis k>33k>3k > 3, features will be ignored if k−12vi<λ12subscript k-12v_i< start_ARG k - 1 end_ARG start_ARG 2 end_ARG vitalic_i < λ (17) and fully represented if vi>λsubscript v_i> _i > λ (18) If k≤33k≤ 3k ≤ 3, features will always either be ignored or fully represented because the loss is concave-down (or flat) in CisubscriptC_iCitalic_i. Hence the D features with highest visubscriptv_ivitalic_i will be represented. 3.5 Phase diagrams We now calculate phase diagrams for capacity allocation. 3.5.1 Phase diagram boundaries First, we solve for λ by inserting the set of visubscriptv_ivitalic_i into equation (57): Ci=max(min(k−1k−3viλ−2k−3,1),0).subscript13subscript2310C_i= ( ( k-1k-3 v_iλ- 2k-3,1% ),0 ).Citalic_i = max ( min ( divide start_ARG k - 1 end_ARG start_ARG k - 3 end_ARG divide start_ARG vitalic_i end_ARG start_ARG λ end_ARG - divide start_ARG 2 end_ARG start_ARG k - 3 end_ARG , 1 ) , 0 ) . We then sum over i and set this equal to D, allowing us to solve for λ. Features will be polysemantic if (equations 17 and 18) vi<λ<k−12visubscript12subscriptv_i<λ< k-12v_ivitalic_i < λ < divide start_ARG k - 1 end_ARG start_ARG 2 end_ARG vitalic_i or equivalently 2k−1λ<vi<λ21subscript 2k-1λ<v_i< start_ARG 2 end_ARG start_ARG k - 1 end_ARG λ < vitalic_i < λ Note that for k<33k<3k < 3, this is impossible, so features will never be partially represented (polysemantic). 3.5.2 Most-importances-equal case We now consider the case with N features, D<ND<ND < N dimensions, and visubscript v_ivitalic_i =Vabsent =V= V (19) visubscript v_ivitalic_i =1absent1 =1= 1 i≠11 i≠ 1i ≠ 1 (20) When all features are partially represented we can use the constraint ∑iCi=Dsubscriptsubscript _iC_i=D∑i Citalic_i = D to solve for λ and find λ=(k−1)(N−1+V)(k−3)D+2N.1132λ= (k-1)(N-1+V)(k-3)D+2N.λ = divide start_ARG ( k - 1 ) ( N - 1 + V ) end_ARG start_ARG ( k - 3 ) D + 2 N end_ARG . Feature #1 therefore has capacity C1=(k−3)D+2N(k−3)(N−1+V)V−2k−3.subscript1323123C_1= (k-3)D+2N(k-3)(N-1+V)V- 2k-3.C1 = divide start_ARG ( k - 3 ) D + 2 N end_ARG start_ARG ( k - 3 ) ( N - 1 + V ) end_ARG V - divide start_ARG 2 end_ARG start_ARG k - 3 end_ARG . As a sanity check, this is D/ND/ND / N when V=11V=1V = 1. Feature #1 becomes fully represented (C1=1)subscript11(C_1=1)( C1 = 1 ) when V≥(k−1)(N−1)(k−3)(D−1)+2(N−1)113121V≥ (k-1)(N-1)(k-3)(D-1)+2(N-1)V ≥ divide start_ARG ( k - 1 ) ( N - 1 ) end_ARG start_ARG ( k - 3 ) ( D - 1 ) + 2 ( N - 1 ) end_ARG When k is large, this approaches (N−1)/(D−1)11(N-1)/(D-1)( N - 1 ) / ( D - 1 ). When V≥(N−1)/(D−1)11V≥(N-1)/(D-1)V ≥ ( N - 1 ) / ( D - 1 ), feature #1 is fully represented regardless of k. Feature #1 is ignored (C1=0)subscript10(C_1=0)( C1 = 0 ) when V≤2(N−1)(k−3)D+2(N−1)21321V≤ 2(N-1)(k-3)D+2(N-1)V ≤ divide start_ARG 2 ( N - 1 ) end_ARG start_ARG ( k - 3 ) D + 2 ( N - 1 ) end_ARG When k is large, this goes to zero like 1/k11/k1 / k. For k=33k=3k = 3, both bounds are at V=11V=1V = 1 and the feature instantly changes from being ignored to fully represented when V crosses 1. 3.6 Phase Diagram Intuition Figure 4 shows the analytic phase diagram and the corresponding empirical results. The lower panel shows the marginal loss benefit (−∂L/∂Cisubscript-∂ L/∂ C_i- ∂ L / ∂ Citalic_i) as a function of feature capacity for each labelled point in the analytical phase diagram. We first focus on relative feature importance. In our model an increase in relative feature importance corresponds to shifting ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i up by a constant factor. Going from point D to point F, the blue feature increases in relative importance, and the corresponding marginal loss curve is shifted up. At point D the blue feature is too unimportant to be represented, at point E the blue feature is similar enough in importance to be allocated a fractional dimension with equal marginal loss benefit to the black reference feature, and at point F the blue feature is so much more important than the black feature that it consumes all available capacity. We next turn to sparsity (kurtosis). As sparsity increases, the marginal loss curves go from convex (A) to flat (C), and eventually to concave (E). Thus, there are diminishing returns to representing a feature when it is sparse and increasing returns when it is dense. When features are dense, the curves are concave, so the feature with larger marginal loss benefit is purely represented and the other feature is ignored. This explains the jump from ignored to monosemantic at the top of the phase diagram between points A and B. When features are sparse, we see a smoother transition from ignored to polysemantic to monosemantic representations. At the critical point (labelled C in the figure), the marginal loss benefit curves are flat and equal in scale, so any allocation of capacity that makes full use of the available embedding dimensions is optimal. Thus, any given feature could be ignored, represented monosemantically, or represented polysemantically. The shapes of the ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i curves are specific to our toy model, but it is generally true that the boundaries between phases are determined by the marginal loss benefit of capacity. Note that this is true for any notion of capacity satisfying the properties discussed in Section 2.2. Finally, we note that our phase diagram bears a striking resemblance to that of a physical system with a second-order phase transition: there is a line of discontinuous “first order” change (above the critical point), giving way to a continuous transition at higher sparsity. The resemblance is stronger if we think of ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i as our “order parameter" instead of CisubscriptC_iCitalic_i itself: ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i is constant along one side of the discontinuity, and changes as a function of sparsity along the other side, with the discontinuity going to zero at the critical point. ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i is continuous everywhere else. That said, there are some notable differences from typical second-order transitions; most importantly, ∂L/∂Cisubscript∂ L/∂ C_i∂ L / ∂ Citalic_i is not a smooth analytic function of importance and sparsity, but rather a piecewise-analytic function whose derivatives are discontinuous at the edges of the superposition phase. It is also unclear whether there is any meaningful analog of critical-point phenomena such as critical exponents. 3.7 Other models and nonlinearities We now compare the numerical phase diagram for our toy model to that of Anthropic’s autoencoder model [EHO+22b]. The key difference between these models is that we sum the output of the nonlinear layer before computing the loss whereas in Anthropic’s model the loss is computed by summing the squared differences on the output layer. This means that theirs is tasked with data compression (i.e. recovering an encoded vector) while ours is tasked with a nonlinear computation. We compare these models across three nonlinearities: quadratic, ReLU, and GeLU. In our model when we vary the nonlinearity we do so both in the ground truth and in the activations. We further consider two cases: one with six features embedded in three dimensions (“6 in 3”) and one with six features embedded in five dimensions (“6 in 5”). The top row of Figure 7 shows the results for our toy model. The same general shape is the same in all cases, as predicted from our theoretical results: a sharp 0-1 transition at mild sparsity (low kurtosis), and a smoother transition with a region of superposition for more extreme sparsity (higher kurtosis). The superposition phase of our model also seems to be smoother, with capacity changing steadily from 0 to 1 (as predicted by our analytic results). In contrast, the Anthropic model has a large region where the capacity used by a feature is “stuck” close to 1/2. This presumably corresponds to an antipodal embedding geometry; or, in the terminology of Section 4, a separate block of the embedding matrix containing two features in a one-dimensional subspace. If the neuron basis is rotated appropriately, this becomes a single neuron that represents two features, with neither the neuron nor the features overlapping with anything else. By contrast, the superposed features in our model unavoidably project onto many neurons regardless of how the neuron basis is rotated. Interestingly, some aspects of the phase diagram seem to depend on the nonlinearity and/or model. Our model has a narrower superposed region than Anthropic’s model. For the “6 in 3” case with significantly more features than neurons, Anthropic’s model shows features in superposition even when their importance is orders of magnitude larger than any other feature, which does not happen for our model (or for Anthropic’s when there is only one more feature than neurons). Figure 7: Empirical phase diagrams are shown for our toy model, Anthropic’s toy model with 6 features in 3 dimensions, and Anthropic’s toy model with 6 features in 5 dimensions. For each model we study three different nonlinearities. Our toy model looks similar for 6 features in 3 dimensions and 6 features in 5 dimensions for all nonlinearities, so we only display plots of 6 features in 3 dimensions. The contour at c=0.50.5c=0.5c = 0.5 has been replaced with two contours at c=0.47,0.530.470.53c=0.47,0.53c = 0.47 , 0.53 to give more detail for the large c≈1/212c≈ 1/2c ≈ 1 / 2 regions, which correspond to the antipode polytope from [EHO+22b]. 4 Geometry of efficient matrices In this section we study the geometric of efficient embedding matrices, namely those which saturate the total capacity bound ∑iCi≤Dsubscriptsubscript _iC_i≤ D∑i Citalic_i ≤ D with the capacity defined in equation (3). In particular, we exhaustively enumerate the forms these matrices can take and comment on the implications for the optimal capacity allocation in models. 4.1 Diagonal (monosemantic) The simplest way to make W efficient is to put D of the vectors orthogonal to each other with arbitrary lengths and set the rest to zero. This makes W a diagonal matrix padded with zeros. In this case, the capacities CisubscriptC_iCitalic_i are either 0 or 1. The norms-squared nisubscriptn_initalic_i are zero (if Ci=0subscript0C_i=0Citalic_i = 0) or nonzero but arbitrary (if Ci=1subscript1C_i=1Citalic_i = 1). 4.2 Semiorthogonal (“everything bagel”) At the other extreme, we can choose W=λRW= λRW = square-root start_ARG λ end_ARG R where R is semiorthogonal, meaning that RRT=IsuperscriptRR^T=IR Ritalic_T = I. (This means the rows of W are orthonormal, but the columns aren’t.) In this case, we have WWT=λIsuperscriptWW^T=λ IW Witalic_T = λ I, so (WTW)2superscriptsuperscript2 (W^TW)^2( Witalic_T W )2 =λWTWabsentsuperscript =λ W^TW= λ Witalic_T W (21) and the capacity simplifies to Ci=1λ[WTW]ii=niλsubscript1subscriptdelimited-[]superscriptsubscript C_i= 1λ[W^TW]_i= n_iλCitalic_i = divide start_ARG 1 end_ARG start_ARG λ end_ARG [ Witalic_T W ]i i = divide start_ARG nitalic_i end_ARG start_ARG λ end_ARG (22) So the capacities and norms of vectors are directly related. We now confirm that this is efficient: ∑iCisubscriptsubscript _iC_i∑i Citalic_i =1λTr(WTW)=1λTr(WWT)=1λλD=Dabsent1Trsuperscript1Trsuperscript1 = 1λTr(W^TW)= 1λTr(W% ^T)= 1λ D=D= divide start_ARG 1 end_ARG start_ARG λ end_ARG Tr ( Witalic_T W ) = divide start_ARG 1 end_ARG start_ARG λ end_ARG Tr ( W Witalic_T ) = divide start_ARG 1 end_ARG start_ARG λ end_ARG λ D = D (23) Theorem 1 Such a matrix exists for any C→ Cover→ start_ARG C end_ARG that sums to D, so this gives us a way of dividing up capacity efficiently however we like. The proof is shown in Appendix H. The singular values of a semiorthogonal matrix are all equal (and vice versa), so the condition of semiorthogonality is equivalent to requiring embedding vectors to be distributed in all directions in a balanced, isotropic way.555The exact sense of “balanced” is a little tricky: the vectors don’t have to be centered around the origin! Instead, think about a dataset whose principal components all have the same variance. This gives some geometric intuition for what the angles between vectors are doing: shorter vectors, for features with less capacity, will be spaced closer together in order to keep the overall arrangement balanced. 4.3 Mixed (tegum product) We can combine these approaches by splitting our embedding space into orthogonal subspaces of dimension DksubscriptD_kDitalic_k and using a semiorthogonal matrix to embed NksubscriptN_kNitalic_k of our vectors into the kkkth subspace. To do this, make W block-diagonal with blocks λkRksubscriptsubscript _kR_kλitalic_k Ritalic_k, where RksubscriptR_kRitalic_k is a Dk×NksubscriptsubscriptD_k× N_kDitalic_k × Nitalic_k semiorthogonal matrix. Here ∑kDk=Dsubscriptsubscript _kD_k=D∑k Ditalic_k = D and ∑kNk=Nsubscriptsubscript _kN_k=N∑k Nitalic_k = N. For an example with two blocks, see Figure 5. Within each block, the dimensions of a subspace are divided up as before: nisubscript n_initalic_i =λkCiif i∈ block kabsentsubscriptsubscriptif i∈ block = _kC_i $i∈$ block k= λitalic_k Citalic_i if i ∈ block k (24) ∑i∈ block kCisubscript block subscript _i∈ block kC_i∑i ∈ block k Citalic_i =Dkabsentsubscript =D_k= Ditalic_k (25) Note that the independent scalars λksubscript _kλitalic_k let us scale different subspaces arbitrarily. (This is the only thing that makes this more general than the previous case.) 4.4 General So far we have chosen subspaces aligned with the standard basis on ℝDsuperscriptℝR^Dblackboard_RD (i.e. the neurons), but we can rotate our set of embedding vectors however we like without changing C→ Cover→ start_ARG C end_ARG or n→ nover→ start_ARG n end_ARG. Similarly, we have assigned dimensions of ℝNsuperscriptℝR^Nblackboard_RN (i.e. features) to subspaces sequentially, but we can do it in any order. In other words: a sufficient condition for efficiency is that W is a D×ND× ND × N matrix of the form QBPQBPQ B P, where • Q is a D×D× D × D orthogonal matrix, • P is an N×N× N × N permutation matrix, and • B is a D×ND× ND × N block-diagonal matrix whose (rectangular) blocks are of the form RkγksubscriptsubscriptR_k _kRitalic_k square-root start_ARG γitalic_k end_ARG, where γksubscript _kγitalic_k is a positive scalar and RksubscriptR_kRitalic_k is a semiorthogonal matrix. In terms of the SVD decomposition W=QSRW=QSRW = Q S R, an equivalent condition is that rows of R corresponding to unequal singular values SiisubscriptS_iSitalic_i i are never nonzero in the same column. Theorem 2 This is also a necessary condition; all efficient matrices are of this form. The proof is shown in Appendix H. 4.5 Interpolating between regimes Consider some block of an efficient matrix (factor of a tegum product), and some vector w→isubscript→ w_iover→ start_ARG w end_ARGi in that block. If the norm-squared nisubscriptn_initalic_i of that vector is increased, its capacity CisubscriptC_iCitalic_i increases in order to keep c∝nproportional-toc nc ∝ n within the block. At the same time, the vector becomes more orthogonal to the rest of the block. If the capacity reaches Ci=1subscript1C_i=1Citalic_i = 1, the vector becomes fully orthogonal and separates into a new block. Once that happens, nisubscriptn_initalic_i can be varied arbitrarily without affecting CisubscriptC_iCitalic_i. On the other hand, if the capacity decreases to Ci=0subscript0C_i=0Citalic_i = 0, then ni=0subscript0n_i=0nitalic_i = 0 and the vector becomes zero; at that point the feature is ignored. 4.6 Application to toy models Empirically, Anthropic found that (when importances are equal) their embedding matrices often factor into orthogonal subspaces containing simple geometric structures [EHO+22b]. This is equivalent to the statement that their embedding matrices have a large number of small blocks. (This also explains some features of the phase diagram for their model, such as a large region near c=1/212c=1/2c = 1 / 2; see above.) In contrast, the embedding matrices in our model have one large superposed block (except for ignored or fully-represented features). We can understand this phenomenon by thinking about the constraints on C→ Cover→ start_ARG C end_ARG and n→ nover→ start_ARG n end_ARG in the large-block and small-block regime. When there is one large block, capacity can be divided up arbitrarily; however, the lengths of embedding vectors are determined entirely by the capacities (up to an overall scalar). For our model, this is compatible with good loss, because minimizing loss gives the relationship Ci∝niproportional-tosubscriptsubscriptC_i n_iCitalic_i ∝ nitalic_i anyway (see above). However, for Anthropic’s model, this may be more constraining of embedding vector norms than the loss function “wants". In contrast, a collection of small blocks imposes constraints on capacity allocation, because each block must separately add up to the correct capacity. Once capacity is allocated, however, the lengths of vectors can be chosen much more freely by scaling blocks independently of each other. This doesn’t provide any benefit for our toy model, but seems to be useful to Anthropic’s. This qualitative difference matters because it affects the interpretability of the network: when there are many small blocks, it is possible to choose an orthogonal basis of neurons such that each neuron is only polysemantic between a small number of features. On the other hand, large blocks imply that neurons will be polysemantic between many features at once. 5 Conclusions We have studied polysemanticity through the lens of feature capacity, or the fraction of an embedding dimension allocated to each feature. Treating capacity allocation as a constrained optimization problem, we find that many features are either ignored (not embedded) or else allocated a full embedding dimension orthogonal to all other features, depending on their relative importance to the loss. Features are represented polysemantically only when the marginal loss reduction of assigning more capacity to each is equal and significant (Figure 2). This neatly explains the sharp “pinning” of features to either 0 or 1 capacity noted by [EHO+22b]. To explore capacity allocation in a concrete model, we investigated a 2-layer model with quadratic activations (Section 3) and constructed a phase diagram of capacity allocation. We found good qualitative agreement to the phase diagram of [EHO+22b], and confirmed our theoretical predictions regarding optimal capacity allocation. Finally, we studied the geometry of the embedding space (Section 4), and characterized embedding matrices which make full use of the available dimensions. We found that efficient embedding matrices are block-semiorthogonal. These blocks correspond to the polytope structure [EHO+22b] found, with small blocks corresponding to features embedded as regular polytopes and large blocks corresponding to less-ordered structures. Large- and small-block arrangements come with different advantages. With large blocks there is significant freedom to allocate capacity across features, whereas with small blocks there is the additional constraint that the capacity of each block be an integer and that the block capacities add up to the total capacity. On the other hand, with small blocks the lengths of embedding vectors can be chosen more freely because blocks can be scaled independently of each other without affecting the capacity allocation. In our quadratic model the embedding matrices in our toy model always have one large block, while [EHO+22b] found a range of block sizes depending on the properties of the training distribution. This suggests that differences in architecture can lead to different sizes of blocks, which could provide a way to control the extent of polysemanticity in models, alongside other approaches such as changing the activation function [EHO+22a]. Author Contributions AS derived the geometry of efficient matrices, proved the upper bound on the model capacity, and proposed an alternate definition of capacity. KS performed the numerical calculations in this work. ASJ, KS, and BS developed the capacity allocation and constrained optimization framework. AS, KS, and BS developed the quadratic model, and derived the expected loss, and related this to the input kurtosis and sparsity. AS, KS, ASJ, and BS derived the analytic optima for the quadratic model. JB proved the existence of feasible matrices for optimal capacity allocations and proved that single-block semiorthogonal matrices can be found for any capacity allocation. JS, KS, ASJ, and BS provided feedback on drafts and the organization of this manuscript. AS, ASJ, and KS contributed to writing this manuscript. Acknowledgments The Flatiron Institute is supported by the Simons Foundation. We are grateful to Jacob Steinhardt, David Lindner, Oliver Balfour, Jeff Wu, Ryan Greenblatt, and Martin Wattenberg for helpful comments on this manuscript, as well as to Chris Olah and Nicholas Schiefer for discussions about this work. We thank Chris Olah for producing Figure 3. Appendix A Quadratic Model Expected Loss Our loss function is squared error L(x)=(y~(x)−y(x))2.superscript~2 L(x)=( y(x)-y(x))^2.L ( x ) = ( over~ start_ARG y end_ARG ( x ) - y ( x ) )2 . (26) We assume that the xisubscriptx_ixitalic_i’s are IID with [xi]=0delimited-[]subscript0E[x_i]=0blackboard_E [ xitalic_i ] = 0 and Var[xi]=1Vardelimited-[]subscript1Var[x_i]=1Var [ xitalic_i ] = 1. With this, we can write the expected loss as [L]delimited-[] [L]blackboard_E [ L ] =[(xTDx+b)2]absentdelimited-[]superscriptsuperscript2 =E[(x^TDx+b)^2]= blackboard_E [ ( xitalic_T D x + b )2 ] (27) =[(∑i,jDijxixj+b)2]absentdelimited-[]superscriptsubscriptsubscriptsubscriptsubscript2 =E [ ( _i,jD_ijx_ix_j+b )^2 ]= blackboard_E [ ( ∑i , j Ditalic_i j xitalic_i xitalic_j + b )2 ] (28) =∑i,j,k,lDijDkl[xixjxkxl]+2b∑i,jDij[xixj]+b2,absentsubscriptsubscriptsubscriptdelimited-[]subscriptsubscriptsubscriptsubscript2subscriptsubscriptdelimited-[]subscriptsubscriptsuperscript2 = _i,j,k,lD_ijD_klE [x_ix_jx_kx_l% ]+2b _i,jD_ijE[x_ix_j]+b^2,= ∑i , j , k , l Ditalic_i j Ditalic_k l blackboard_E [ xitalic_i xitalic_j xitalic_k xitalic_l ] + 2 b ∑i , j Ditalic_i j blackboard_E [ xitalic_i xitalic_j ] + b2 , (29) where D=WTW−vsuperscriptD=W^TW-vD = Witalic_T W - v and v is the diagonal matrix formed of visubscript\v_i\ vitalic_i . We can split the first summation term into several summations by considering all partitions of four terms (4, 3|1, 2|2, 2|1|1, 1|1|1|1). For example, the 2|2 cases are ∑(i=j)≠(k=l),∑(i=k)≠(j=l),∑(i=l)≠(k=j)subscriptsubscriptsubscript _(i=j)≠(k=l), _(i=k)≠(j=l), _(i=l)≠(k=j)∑( i = j ) ≠ ( k = l ) , ∑( i = k ) ≠ ( j = l ) , ∑( i = l ) ≠ ( k = j ). Because the x’s are independent with mean 0 and variance 1, the only partition terms that are non-zero are 4 and 2|2, so we find [L]delimited-[] [L]blackboard_E [ L ] =[xi4]∑iDii2+∑i≠jDiiDjj+2∑i≠jDij2+2b∑iDii2+b2absentdelimited-[]superscriptsubscript4subscriptsuperscriptsubscript2subscriptsubscriptsubscript2subscriptsuperscriptsubscript22subscriptsuperscriptsubscript2superscript2 =E[x_i^4] _iD_i^2+ _i≠ jD_iD_% j+2 _i≠ jD_ij^2+2b _iD_i^2+b^2= blackboard_E [ xitalic_i4 ] ∑i Ditalic_i i2 + ∑i ≠ j Ditalic_i i Ditalic_j j + 2 ∑i ≠ j Ditalic_i j2 + 2 b ∑i Ditalic_i i2 + b2 (30) =([xi4]−1)∑iDii2+∑i,jDiiDjj+2∑i≠jDij2+2b∑iDii2+b2absentdelimited-[]superscriptsubscript41subscriptsuperscriptsubscript2subscriptsubscriptsubscript2subscriptsuperscriptsubscript22subscriptsuperscriptsubscript2superscript2 =(E[x_i^4]-1) _iD_i^2+ _i,jD_iD_% j+2 _i≠ jD_ij^2+2b _iD_i^2+b^2= ( blackboard_E [ xitalic_i4 ] - 1 ) ∑i Ditalic_i i2 + ∑i , j Ditalic_i i Ditalic_j j + 2 ∑i ≠ j Ditalic_i j2 + 2 b ∑i Ditalic_i i2 + b2 (31) =([xi4]−1)∑iDii2+(∑iDii)2+2∑i≠jDij2+2b∑iDii2+b2absentdelimited-[]superscriptsubscript41subscriptsuperscriptsubscript2superscriptsubscriptsubscript22subscriptsuperscriptsubscript22subscriptsuperscriptsubscript2superscript2 =(E[x_i^4]-1) _iD_i^2+ ( _iD_i% )^2+2 _i≠ jD_ij^2+2b _iD_i^2+b^2= ( blackboard_E [ xitalic_i4 ] - 1 ) ∑i Ditalic_i i2 + ( ∑i Ditalic_i i )2 + 2 ∑i ≠ j Ditalic_i j2 + 2 b ∑i Ditalic_i i2 + b2 (32) =([xi4]−1)∑iDii2+(b+∑iDii)2+2∑i≠jDij2.absentdelimited-[]superscriptsubscript41subscriptsuperscriptsubscript2superscriptsubscriptsubscript22subscriptsuperscriptsubscript2 =(E[x_i^4]-1) _iD_i^2+ (b+ _iD_% i )^2+2 _i≠ jD_ij^2.= ( blackboard_E [ xitalic_i4 ] - 1 ) ∑i Ditalic_i i2 + ( b + ∑i Ditalic_i i )2 + 2 ∑i ≠ j Ditalic_i j2 . (33) Inspecting the second term we see that at optimum b=−∑iDiisubscriptsubscriptb=- _iD_ib = - ∑i Ditalic_i i. Making this assignment we find [L]delimited-[] [L]blackboard_E [ L ] =([xi4]−1)∑iDii2+2∑i≠jDij2,absentdelimited-[]superscriptsubscript41subscriptsuperscriptsubscript22subscriptsuperscriptsubscript2 =(E[x_i^4]-1) _iD_i^2+2 _i≠ jD_ij% ^2,= ( blackboard_E [ xitalic_i4 ] - 1 ) ∑i Ditalic_i i2 + 2 ∑i ≠ j Ditalic_i j2 , (34) Substituting in for D gives us [L]=([xi4]−1)∑i(‖W⋅,i‖2−vi)2+2∑i≠j(W⋅,i⋅W⋅,j)2delimited-[]delimited-[]superscriptsubscript41subscriptsuperscriptsuperscriptnormsubscript⋅2subscript22subscriptsuperscript⋅subscript⋅subscript⋅2E[L]=(E[x_i^4]-1) _i(||W_·,i||^2-v_i)^2% +2 _i≠ j(W_·,i· W_·,j)^2blackboard_E [ L ] = ( blackboard_E [ xitalic_i4 ] - 1 ) ∑i ( | | W⋅ , i | |2 - vitalic_i )2 + 2 ∑i ≠ j ( W⋅ , i ⋅ W⋅ , j )2 (35) Appendix B Covariances and Correlations Here we derive the relationship between the terms in the quadratic model loss and true/hallucinated correlations: Cov[y,xi2]Covsuperscriptsubscript2 [y,x_i^2]Cov [ y , xitalic_i2 ] =Cov[∑jvjxj2,xi2]absentCovsubscriptsubscriptsuperscriptsubscript2superscriptsubscript2 =Cov [ _jv_jx_j^2,x_i^2 ]= Cov [ ∑j vitalic_j xitalic_j2 , xitalic_i2 ] (36) =viCov[xi2,xi2]absentsubscriptCovsuperscriptsubscript2superscriptsubscript2 =v_iCov[x_i^2,x_i^2]= vitalic_i Cov [ xitalic_i2 , xitalic_i2 ] (37) =vi([xi4]−[xi2]2)absentsubscriptdelimited-[]superscriptsubscript4superscriptdelimited-[]superscriptsubscript22 =v_i(E[x_i^4]-E[x_i^2]^2)= vitalic_i ( blackboard_E [ xitalic_i4 ] - blackboard_E [ xitalic_i2 ]2 ) (38) =vi([xi4]−1)absentsubscriptdelimited-[]superscriptsubscript41 =v_i(E[x_i^4]-1)= vitalic_i ( blackboard_E [ xitalic_i4 ] - 1 ) (39) Cov[y~,xi2]Cov~superscriptsubscript2 [ y,x_i^2]Cov [ over~ start_ARG y end_ARG , xitalic_i2 ] =Cov[∑j,k(WTW)jkxjxk,xi2]absentCovsubscriptsubscriptsuperscriptsubscriptsubscriptsuperscriptsubscript2 =Cov [ _j,k(W^TW)_jkx_jx_k,x_i^2 ]= Cov [ ∑j , k ( Witalic_T W )j k xitalic_j xitalic_k , xitalic_i2 ] (40) =(WTW)iiCov[xi2,xi2]absentsubscriptsuperscriptCovsuperscriptsubscript2superscriptsubscript2 =(W^TW)_iCov[x_i^2,x_i^2]= ( Witalic_T W )i i Cov [ xitalic_i2 , xitalic_i2 ] (41) =‖W⋅,i‖2([xi4]−1)absentsuperscriptnormsubscript⋅2delimited-[]superscriptsubscript41 =||W_·,i||^2(E[x_i^4]-1)= | | W⋅ , i | |2 ( blackboard_E [ xitalic_i4 ] - 1 ) (42) Cov[y,xixj]Covsubscriptsubscript [y,x_ix_j]Cov [ y , xitalic_i xitalic_j ] =0absent0 =0= 0 (i≠j) (i≠ j)( i ≠ j ) (44) Cov[y~,xixj]Cov~subscriptsubscript [ y,x_ix_j]Cov [ over~ start_ARG y end_ARG , xitalic_i xitalic_j ] =Cov[∑j,k(WTW)jkxjxk,xixj]absentCovsubscriptsubscriptsuperscriptsubscriptsubscriptsubscriptsubscript =Cov [ _j,k(W^TW)_jkx_jx_k,x_ix_j ]= Cov [ ∑j , k ( Witalic_T W )j k xitalic_j xitalic_k , xitalic_i xitalic_j ] (i≠j) (i≠ j)( i ≠ j ) (45) =2(WTW)ijCov[xixj,xixj]absent2subscriptsuperscriptCovsubscriptsubscriptsubscriptsubscript =2(W^TW)_ijCov[x_ix_j,x_ix_j]= 2 ( Witalic_T W )i j Cov [ xitalic_i xitalic_j , xitalic_i xitalic_j ] (46) =2(W⋅,i⋅W⋅,j)absent2⋅subscript⋅subscript⋅ =2(W_·,i· W_·,j)= 2 ( W⋅ , i ⋅ W⋅ , j ) (47) Here the true correlations in the model are described by the term Cov[y~,xi2]Cov~superscriptsubscript2Cov[ y,x_i^2]Cov [ over~ start_ARG y end_ARG , xitalic_i2 ], as these are what appear in the ground truth (Cov[y,xi2]Covsuperscriptsubscript2Cov[y,x_i^2]Cov [ y , xitalic_i2 ]). By contrast, the hallucinated correlations are those of the form Cov[y~,xixj],i≠jCov~subscriptsubscriptCov[ y,x_ix_j],i≠ jCov [ over~ start_ARG y end_ARG , xitalic_i xitalic_j ] , i ≠ j, which are zero in the ground truth but are generally non-zero in the quadratic toy model. Appendix C Quadratic Model Loss Minmization In minimizing the loss we have to consider three constraints: 1. 1≤∑iCi≤D1subscriptsubscript1≤ _iC_i≤ D1 ≤ ∑i Citalic_i ≤ D 2. 0≤Ci≤10subscript10≤ C_i≤ 10 ≤ Citalic_i ≤ 1 3. (C→,n→)→( C, n)( over→ start_ARG C end_ARG , over→ start_ARG n end_ARG ) is feasible Here feasibility means that there is actually a set of embedding directions we can choose such that W realizes the capacity allocation C→ Cover→ start_ARG C end_ARG and embedding lengths n→ nover→ start_ARG n end_ARG. Our strategy here will be to “ask forgiveness rather than permission": We minimize ℒLL, ignoring the feasibility constraint, and check for feasibility after the fact. First we hold C→ Cover→ start_ARG C end_ARG fixed and optimize over n→ nover→ start_ARG n end_ARG: ∂ℒ∂niℒsubscript ∂n_idivide start_ARG ∂ L end_ARG start_ARG ∂ nitalic_i end_ARG =2(k−1)(ni−vi)−4ni+4niCiabsent21subscriptsubscript4subscript4subscriptsubscript =2(k-1)(n_i-v_i)-4n_i+4 n_iC_i= 2 ( k - 1 ) ( nitalic_i - vitalic_i ) - 4 nitalic_i + 4 divide start_ARG nitalic_i end_ARG start_ARG Citalic_i end_ARG (49) 00 0 =2(k−1)(ni∗−vi)−4ni∗+4ni∗Ciabsent21superscriptsubscriptsubscript4superscriptsubscript4superscriptsubscriptsubscript =2(k-1)(n_i^*-v_i)-4n_i^*+4 n_i^*C_i= 2 ( k - 1 ) ( nitalic_i∗ - vitalic_i ) - 4 nitalic_i∗ + 4 divide start_ARG nitalic_i∗ end_ARG start_ARG Citalic_i end_ARG (50) ni∗((k−3)+2/Ci)superscriptsubscript32subscript n_i^*((k-3)+2/C_i)nitalic_i∗ ( ( k - 3 ) + 2 / Citalic_i ) =(k−1)viabsent1subscript =(k-1)v_i= ( k - 1 ) vitalic_i (51) ni∗superscriptsubscript n_i^*nitalic_i∗ =(k−1)vi(k−3)+2/Ciabsent1subscript32subscript = (k-1)v_i(k-3)+2/C_i= divide start_ARG ( k - 1 ) vitalic_i end_ARG start_ARG ( k - 3 ) + 2 / Citalic_i end_ARG (52) Note that limCi→0ni∗=0subscript→subscript0superscriptsubscript0 _C_i→ 0n_i^*=0limitalic_C start_POSTSUBSCRIPT i → 0 end_POSTSUBSCRIPT nitalic_i∗ = 0 and limCi→0(ni∗/2Ci)=0 _C_i→ 0(n_i^*^2/C_i)=0limitalic_C start_POSTSUBSCRIPT i → 0 end_POSTSUBSCRIPT ( nitalic_i∗ start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT / Citalic_i ) = 0, so the final two terms of the loss are zero for non-represented features. Next we differentiate with respect to capacity to find ∂ℒ(C→,n→∗(C→))∂Ci=∂ℒ∂Ci+∑j∂ℒ∂nj∂nj∗∂Ciℒ→superscript→subscriptℒsubscriptsubscriptℒsubscriptsuperscriptsubscriptsubscript ( C, n^*( C))∂ C% _i= ∂ C_i+ _j ∂% L∂ n_j ∂ n_j^*∂ C_idivide start_ARG ∂ L ( over→ start_ARG C end_ARG , over→ start_ARG n end_ARG∗ ( over→ start_ARG C end_ARG ) ) end_ARG start_ARG ∂ Citalic_i end_ARG = divide start_ARG ∂ L end_ARG start_ARG ∂ Citalic_i end_ARG + ∑j divide start_ARG ∂ L end_ARG start_ARG ∂ nitalic_j end_ARG divide start_ARG ∂ nitalic_j∗ end_ARG start_ARG ∂ Citalic_i end_ARG (53) The second term vanishes666This is because we optimized over nisubscript\n_i\ nitalic_i without any constraints, so n→∗superscript→ n^*over→ start_ARG n end_ARG∗ is a local optimum for any C→ Cover→ start_ARG C end_ARG., so we only have the first term ∂ℒ∂Ci=−2(ni∗(Ci))2Ci2ℒsubscript2superscriptsuperscriptsubscriptsubscript2superscriptsubscript2 ∂ C_i=-2 (n_i^*(C_% i))^2C_i^2divide start_ARG ∂ L end_ARG start_ARG ∂ Citalic_i end_ARG = - 2 divide start_ARG ( nitalic_i∗ ( Citalic_i ) )2 end_ARG start_ARG Citalic_i2 end_ARG (54) This is non-positive, so our loss is monotonically decreasing in CisubscriptC_iCitalic_i. That means that we will saturate our first constraint, ∑iCi=Dsubscriptsubscript _iC_i=D∑i Citalic_i = D. Next let’s solve for CisubscriptC_iCitalic_i and nisubscriptn_initalic_i: λCisubscript λ C_iλ Citalic_i =(k−1)vi(k−3)+2/Ciabsent1subscript32subscript = (k-1)v_i(k-3)+2/C_i= divide start_ARG ( k - 1 ) vitalic_i end_ARG start_ARG ( k - 3 ) + 2 / Citalic_i end_ARG (55) viλsubscript v_iλdivide start_ARG vitalic_i end_ARG start_ARG λ end_ARG =Cik−3k−1+2k−1absentsubscript3121 =C_i k-3k-1+ 2k-1= Citalic_i divide start_ARG k - 3 end_ARG start_ARG k - 1 end_ARG + divide start_ARG 2 end_ARG start_ARG k - 1 end_ARG (56) Cisubscript C_iCitalic_i =max(min(k−1k−3viλ−2k−3,1),0)absent13subscript2310 = ( ( k-1k-3 v_iλ- 2% k-3,1 ),0 )= max ( min ( divide start_ARG k - 1 end_ARG start_ARG k - 3 end_ARG divide start_ARG vitalic_i end_ARG start_ARG λ end_ARG - divide start_ARG 2 end_ARG start_ARG k - 3 end_ARG , 1 ) , 0 ) (57) where we have introduced the clipping at 0 and 1 to make this also apply to the capacities on the boundary. With CisubscriptC_iCitalic_i we then obtain ni=0Ci=0k−1k−3vi−2λk−30<Ci<1viCi=1subscriptcases0subscript013subscript230subscript1subscriptsubscript1 n_i= cases0& C_i=0\\ k-1k-3v_i- 2λk-3& 0<C_i<1\\ v_i& C_i=1 casesnitalic_i = start_ROW start_CELL 0 end_CELL start_CELL Citalic_i = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG k - 1 end_ARG start_ARG k - 3 end_ARG vitalic_i - divide start_ARG 2 λ end_ARG start_ARG k - 3 end_ARG end_CELL start_CELL 0 < Citalic_i < 1 end_CELL end_ROW start_ROW start_CELL vitalic_i end_CELL start_CELL Citalic_i = 1 end_CELL end_ROW (58) C.0.1 Checking feasibility For our solution (C→,n→)→( C, n)( over→ start_ARG C end_ARG , over→ start_ARG n end_ARG ) to be feasible, there must exist a corresponding matrix W. As shown in Appendix H, for any C→ Cover→ start_ARG C end_ARG with ∑iCi=D≤Nsubscriptsubscript _iC_i=D≤ N∑i Citalic_i = D ≤ N and 0≤Ci≤10subscript10≤ C_i≤ 10 ≤ Citalic_i ≤ 1, there is a semiorthogonal matrix W of shape (N,D)(N,D)( N , D ) with Ci(W)=CisubscriptsubscriptC_i(W)=C_iCitalic_i ( W ) = Citalic_i. We next need to show that n→ nover→ start_ARG n end_ARG is compatible with C→ Cover→ start_ARG C end_ARG. The condition ni∝Ciproportional-tosubscriptsubscriptn_i C_initalic_i ∝ Citalic_i holds for any semiorthogonal matrix, and we can multiply each W⋅,isubscript⋅W_·,iW⋅ , i by a scalar to get the constant of proportionality right. Therefore, this solution is feasible! Appendix D Kurtosis and Sparsity D.1 Sparsity increases fourth moment Suppose that, with probability p, x takes the value of a new random variable z and otherwise it is 0 (i.e. x=szx=szx = s z, s∼B(p)similar-tos B(p)s ∼ B ( p )). We set the mean [z]=0delimited-[]0E[z]=0blackboard_E [ z ] = 0 and variance Var[z]=1/pVardelimited-[]1Var[z]=1/pVar [ z ] = 1 / p so that [x]=0,Var[x]=1formulae-sequencedelimited-[]0Vardelimited-[]1E[x]=0,Var[x]=1blackboard_E [ x ] = 0 , Var [ x ] = 1. As p decreases (i.e. sparsity increases), [x4]delimited-[]superscript4E[x^4]blackboard_E [ x4 ] increases: [x4]=[s4z4]=[s4][z4]=p⋅Kurt[z]Var[z]2=Kurt[z]/pdelimited-[]superscript4delimited-[]superscript4superscript4delimited-[]superscript4delimited-[]superscript4⋅Kurtdelimited-[]Varsuperscriptdelimited-[]2Kurtdelimited-[] [x^4]=E[s^4z^4]=E[s^4] % E[z^4]=p·Kurt[z]Var[z]^2=Kurt[z]/pblackboard_E [ x4 ] = blackboard_E [ s4 z4 ] = blackboard_E [ s4 ] blackboard_E [ z4 ] = p ⋅ Kurt [ z ] Var [ z ]2 = Kurt [ z ] / p (59) D.2 Uniform Random Variable Moments We want to choose a such that for x∼U(−a,a)⋅B(p)similar-to⋅x U(-a,a)· B(p)x ∼ U ( - a , a ) ⋅ B ( p ), Var[x]=1Vardelimited-[]1Var[x]=1Var [ x ] = 1. Choose a=3p3a= 3pa = square-root start_ARG divide start_ARG 3 end_ARG start_ARG p end_ARG end_ARG: Var[x]Vardelimited-[] [x]Var [ x ] =[s2][z2]=p⋅(23/p)212=1absentdelimited-[]superscript2delimited-[]superscript2⋅superscript232121 =E[s^2]E[z^2]=p· (2 3/p% )^212=1= blackboard_E [ s2 ] blackboard_E [ z2 ] = p ⋅ divide start_ARG ( 2 square-root start_ARG 3 / p end_ARG )2 end_ARG start_ARG 12 end_ARG = 1 (60) Kurt[x]Kurtdelimited-[] [x]Kurt [ x ] =[x4][x2]2=95⋅1pabsentdelimited-[]superscript4superscriptdelimited-[]superscript22⋅951 = E[x^4]E[x^2]^2= 95·% 1p= divide start_ARG blackboard_E [ x4 ] end_ARG start_ARG blackboard_E [ x2 ]2 end_ARG = divide start_ARG 9 end_ARG start_ARG 5 end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG p end_ARG (61) Appendix E Interpretation as correlation coefficient The (population) Pearson correlation coefficient ρ2superscript2ρ^2ρ2, also called “fraction of variance explained”, is defined as the ratio between covariance squared and the product of variances: ρ2(X,Y)=(k−1)(X,Y)2κ(X,X)κ(Y,Y)superscript21superscript2 ρ^2(X,Y)= (k-1)(X,Y)^2κ(X,X)κ(Y,Y)ρ2 ( X , Y ) = divide start_ARG ( k - 1 ) ( X , Y )2 end_ARG start_ARG κ ( X , X ) κ ( Y , Y ) end_ARG (62) (The more-familiar r2superscript2r^2r2 is an estimator for ρ2superscript2ρ^2ρ2 computed from a sample.) Let’s assume our input features XisubscriptX_iXitalic_i are linearly uncorrelated and have mean zero and variance one: (k−1)(Xi)1subscript (k-1)(X_i)( k - 1 ) ( Xitalic_i ) =0absent0 =0= 0 (63) (k−1)(Xi,Xj)1subscriptsubscript (k-1)(X_i,X_j)( k - 1 ) ( Xitalic_i , Xitalic_j ) =δijabsentsubscript = _ij= δitalic_i j (64) We’l embed them with embedding vectors wi=We^i=W:,isubscriptsubscript^subscript:w_i=W e_i=W_:,iwitalic_i = W over start_ARG e end_ARGi = W: , i and immediately unembed with WTsuperscriptW^TWitalic_T. Then ρ2superscript2ρ^2ρ2 between the iiith input and output is (using Einstein notation for everything except i, which is never summed) ρ2(Xi,[WTWX]i)superscript2subscriptsubscriptdelimited-[]superscript ρ^2(X_i,[W^TWX]_i)ρ2 ( Xitalic_i , [ Witalic_T W X ]i ) =(k−1)(Xi,W∗ikWkjXj)2(k−1)(Xi,Xi)(k−1)(WikWkjXj,Wik′Wk′j′Xj′)absent1superscriptsubscriptsubscriptsubscript21subscriptsubscript1subscriptsubscriptsubscriptsubscriptsuperscript′subscriptsuperscript′subscriptsuperscript′ = (k-1)(X_i,W*ikW_kjX_j)^2(k-1)(X_i,X_i)(k-1% )(W_ikW_kjX_j,W_ik W_k j X_j )= divide start_ARG ( k - 1 ) ( Xitalic_i , W ∗ i k Witalic_k j Xitalic_j )2 end_ARG start_ARG ( k - 1 ) ( Xitalic_i , Xitalic_i ) ( k - 1 ) ( Witalic_i k Witalic_k j Xitalic_j , Witalic_i k′ Witalic_k′ j′ Xitalic_j′ ) end_ARG (65) =(WikWkj(k−1)(Xi,Xj))2WikWkjWik′Wk′j′(k−1)(Xi,Xi)(k−1)(Xj,Xj′)absentsuperscriptsubscriptsubscript1subscriptsubscript2subscriptsubscriptsubscriptsuperscript′subscriptsuperscript′1subscriptsubscript1subscriptsubscriptsuperscript′ = (W_ikW_kj(k-1)(X_i,X_j))^2W_ikW_kjW_ik^% W_k j (k-1)(X_i,X_i)(k-1)(X_j,X_j )= divide start_ARG ( Witalic_i k Witalic_k j ( k - 1 ) ( Xitalic_i , Xitalic_j ) )2 end_ARG start_ARG Witalic_i k Witalic_k j Witalic_i k′ Witalic_k′ j′ ( k - 1 ) ( Xitalic_i , Xitalic_i ) ( k - 1 ) ( Xitalic_j , Xitalic_j′ ) end_ARG (66) =(WikWkjδij)2WikWkjWik′Wk′j′δjj′absentsuperscriptsubscriptsubscriptsubscript2subscriptsubscriptsubscriptsuperscript′subscriptsuperscript′subscriptsuperscript′ = (W_ikW_kj _ij)^2W_ikW_kjW_ik % W_k j _j = divide start_ARG ( Witalic_i k Witalic_k j δitalic_i j )2 end_ARG start_ARG Witalic_i k Witalic_k j Witalic_i k′ Witalic_k′ j′ δitalic_j j′ end_ARG (67) =(WikWki)2(WikWkj)(Wik′Wk′j)absentsuperscriptsubscriptsubscript2subscriptsubscriptsubscriptsuperscript′subscriptsuperscript′ = (W_ikW_ki)^2(W_ikW_kj)(W_ik W_k^% j)= divide start_ARG ( Witalic_i k Witalic_k i )2 end_ARG start_ARG ( Witalic_i k Witalic_k j ) ( Witalic_i k′ Witalic_k′ j ) end_ARG (68) =(wi⋅wi)2∑j(wi⋅wj)2absentsuperscript⋅subscriptsubscript2subscriptsuperscript⋅subscriptsubscript2 = (w_i· w_i)^2 _j(w_i· w_j)^2= divide start_ARG ( witalic_i ⋅ witalic_i )2 end_ARG start_ARG ∑j ( witalic_i ⋅ witalic_j )2 end_ARG (69) =Ciabsentsubscript =C_i= Citalic_i (70) where we’ve written the sum over j explicitly for the sake of clarity. Note that WTsuperscriptW^TWitalic_T is not always the optimal way to reconstruct XisubscriptX_iXitalic_i from WXWXW X. We haven’t checked the math, but it seems to be the case that using the optimal unembedding matrix instead of WTsuperscriptW^TWitalic_T gives the alternate definition of capacity in Appendix G. Appendix F Proof of capacity constraint Suppose we have N features and D neurons, with N≥DN≥ DN ≥ D. Let our embedding matrix be WaisubscriptW_aiWitalic_a i, with shape [D,N][D,N][ D , N ]. The embedding vectors in ℝDsuperscriptℝR^Dblackboard_RD are column vectors w→i:=W:,iassignsubscript→subscript: w_i:=W_:,iover→ start_ARG w end_ARGi := W: , i with components [wi]a=Waisubscriptdelimited-[]subscriptsubscript[w_i]_a=W_ai[ witalic_i ]a = Witalic_a i. Define Vaisubscript V_aiVitalic_a i :=Wai[WTWWTW]iiassignabsentsubscriptsubscriptdelimited-[]superscriptsuperscript := W_ai [W^TW^TW]_i:= divide start_ARG Witalic_a i end_ARG start_ARG square-root start_ARG [ Witalic_T W Witalic_T W ]i i end_ARG end_ARG (71) v→i:=V:,iassignsubscript→subscript: v_i:=V_:,iover→ start_ARG v end_ARGi := V: , i =w→i∑j(w→i⋅w→j)2absentsubscript→subscriptsuperscript⋅subscript→subscript→2 = w_i _j( w_i· w_j)^% 2= divide start_ARG over→ start_ARG w end_ARGi end_ARG start_ARG square-root start_ARG ∑j ( over→ start_ARG w end_ARGi ⋅ over→ start_ARG w end_ARGj )2 end_ARG end_ARG (72) and U U :=VTWassignabsentsuperscript :=V^TW:= Vitalic_T W (73) Uijsubscript U_ijUitalic_i j =v→i⋅w→jabsent⋅subscript→subscript→ = v_i· w_j= over→ start_ARG v end_ARGi ⋅ over→ start_ARG w end_ARGj (74) Note that Uii2superscriptsubscript2 U_i^2Uitalic_i i2 =(w→i⋅w→i)2∑j(w→i⋅w→j)2=Ciabsentsuperscript⋅subscript→subscript→2subscriptsuperscript⋅subscript→subscript→2subscript = ( w_i· w_i)^2 _j( w_i% · w_j)^2=C_i= divide start_ARG ( over→ start_ARG w end_ARGi ⋅ over→ start_ARG w end_ARGi )2 end_ARG start_ARG ∑j ( over→ start_ARG w end_ARGi ⋅ over→ start_ARG w end_ARGj )2 end_ARG = Citalic_i (75) Also, ∑j=1NUij2=1superscriptsubscript1superscriptsubscript21 _j=1^NU_ij^2=1∑j = 1N Uitalic_i j2 = 1 (76) so the row vectors u^i:=Ui,:∈ℝNassignsubscript^subscript:superscriptℝ u_i:=U_i,: ^Nover start_ARG u end_ARGi := Uitalic_i , : ∈ blackboard_RN are unit vectors. Let D′:=rank(U)≤Dassignsuperscript′rankD :=rank(U)≤ D′ := rank ( U ) ≤ D, so that the vectors u^isubscript u_iover start_ARG u end_ARGi span a subspace of dimension D′. Let the D′ unit vectors y^b∈ℝNsubscript^superscriptℝ y_b ^Nover start_ARG y end_ARGb ∈ blackboard_RN with components [yb]i=:Ybi[y_b]_i=:Y_bi[ yitalic_b ]i = : Yitalic_b i be an orthonormal basis for this subspace. Let XibsubscriptX_ibXitalic_i b be the components of u^isubscript u_iover start_ARG u end_ARGi in the y^bsubscript y_bover start_ARG y end_ARGb basis, so that U U =XYabsent =XY= X Y (77) u^isubscript u_iover start_ARG u end_ARGi =∑b=1D′Xiby^babsentsuperscriptsubscript1superscript′subscriptsubscript = _b=1^D X_ib y_b= ∑b = 1D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Xitalic_i b over start_ARG y end_ARGb (78) u^isubscript u_iover start_ARG u end_ARGi is a unit vector, so its components in an orthonormal basis obey ∑b=1D′Xib2superscriptsubscript1superscript′subscript2 _b=1^D X_ib^2∑b = 1D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Xitalic_i b2 =1absent1 =1= 1 (79) Therefore, the row vectors x^i:=Xi,:∈ℝD′assignsubscript^subscript:superscriptℝsuperscript′ x_i:=X_i,: ^D over start_ARG x end_ARGi := Xitalic_i , : ∈ blackboard_RD start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are unit vectors. We can also define vectors y→i:=Y:,iassignsubscript→subscript: y_i:=Y_:,iover→ start_ARG y end_ARGi := Y: , i; note that these are the columns of Y and y^bsubscript y_bover start_ARG y end_ARGb are the rows. Note that Uij=x^i⋅y→isubscript⋅subscript^subscript→U_ij= x_i· y_iUitalic_i j = over start_ARG x end_ARGi ⋅ over→ start_ARG y end_ARGi, so we have Ci=Uii2subscriptsuperscriptsubscript2 C_i=U_i^2Citalic_i = Uitalic_i i2 =(x^i⋅y→i)2absentsuperscript⋅subscript^subscript→2 =( x_i· y_i)^2= ( over start_ARG x end_ARGi ⋅ over→ start_ARG y end_ARGi )2 (80) ≤(x^i⋅x^i)(y→i⋅y→i) by Cauchy-Schwarzabsent⋅subscript^subscript^⋅subscript→subscript→ by Cauchy-Schwarz ≤( x_i· x_i)( y_i· y_i)% by Cauchy-Schwarz≤ ( over start_ARG x end_ARGi ⋅ over start_ARG x end_ARGi ) ( over→ start_ARG y end_ARGi ⋅ over→ start_ARG y end_ARGi ) by Cauchy-Schwarz (81) =(y→i⋅y→i)absent⋅subscript→subscript→ =( y_i· y_i)= ( over→ start_ARG y end_ARGi ⋅ over→ start_ARG y end_ARGi ) (82) =∑b=1D′Ybi2absentsuperscriptsubscript1superscript′subscript2 = _b=1^D Y_bi^2= ∑b = 1D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Yitalic_b i2 (83) Finally, summing over i, ∑i=1NCisuperscriptsubscript1subscript _i=1^NC_i∑i = 1N Citalic_i ≤∑i=1N∑b=1D′Ybi2absentsuperscriptsubscript1superscriptsubscript1superscript′subscript2 ≤ _i=1^N _b=1^D Y_bi^2≤ ∑i = 1N ∑b = 1D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Yitalic_b i2 (84) =∑b=1D′∑i=1NYbi2absentsuperscriptsubscript1superscript′subscript1superscriptsubscript2 = _b=1^D _i=1^NY_bi^2= ∑b = 1D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∑i = 1N Yitalic_b i2 (85) =∑b=1D′|y^b|2absentsuperscriptsubscript1superscript′subscript^2 = _b=1^D | y_b|^2= ∑b = 1D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over start_ARG y end_ARGb |2 (86) =D′≤Dabsentsuperscript′ =D ≤ D= D′ ≤ D (87) Appendix G An alternative definition of capacity Instead of applying equation (2) directly to W, we can take the (compact) SVD W=QSRW=QSRW = Q S R and apply it to R, C~i(W):=([RTR]ii)2[(RTR)2]ii=[RTR]iiassignsubscript~superscriptsubscriptdelimited-[]superscript2subscriptdelimited-[]superscriptsuperscript2subscriptdelimited-[]superscript C_i(W):= ([R^TR]_i)^2[(R^TR)^2]_i% =[R^TR]_iover~ start_ARG C end_ARGi ( W ) := divide start_ARG ( [ Ritalic_T R ]i i )2 end_ARG start_ARG [ ( Ritalic_T R )2 ]i i end_ARG = [ Ritalic_T R ]i i (88) An equivalent prescription is to first rotate (by Q−1superscript1Q^-1Q- 1) and then scale (by S−1superscript1S^-1S- 1) the embedding space ℝDsuperscriptℝR^Dblackboard_RD so that the embedding vectors are isotropically distributed, then take C~isubscript~ C_iover~ start_ARG C end_ARGi to be the norm-squared of the new embedding vector S−1Q−1w→isuperscript1superscript1subscript→S^-1Q^-1 w_iS- 1 Q- 1 over→ start_ARG w end_ARGi. This is equivalent to our previous definition when W is “c-efficient”, ∑i=1NCi=D⟹C~i=Cisuperscriptsubscript1subscriptsubscript~subscript _i=1^NC_i=D C_i=C_i∑i = 1N Citalic_i = D ⟹ over~ start_ARG C end_ARGi = Citalic_i (89) but relaxes the condition for “c~~ cover~ start_ARG c end_ARG-efficiency” dramatically, to the point that generic matrices are efficient: ∑i=1NC~isuperscriptsubscript1subscript~ _i=1^N C_i∑i = 1N over~ start_ARG C end_ARGi =rank(W)absentrank =rank(W)= rank ( W ) (90) W full rank full rank W full rankW full rank ⟹∑i=1NC~i=Dabsentsuperscriptsubscript1subscript~ _i=1^N C_i=D⟹ ∑i = 1N over~ start_ARG C end_ARGi = D (91) On the other hand, C~i(W)subscript~ C_i(W)over~ start_ARG C end_ARGi ( W ) is a somewhat more discontinuous function of W than Ci(W)subscriptC_i(W)Citalic_i ( W ) is. Our results for the quadratic toy model are essentially unchanged; W is efficient at the minimum of missingE[ℒ]missingdelimited-[]ℒ missingE[L]missing E [ L ], so these definitions coincide (although we’re not sure their derivatives do). In more realistic models, it remains to be seen which version is more useful. This new definition also sheds some light on “c-efficiency”: a matrix is efficient if its embedding vectors are isotropically distributed within each block. In the case of only one block, a matrix is efficient if identity covariance for a random x→ xover→ start_ARG x end_ARG implies identity covariance for Wx→W xW over→ start_ARG x end_ARG. Appendix H Feasibility Proofs Set-up: We have vectors w1,…,wN∈ℝDsubscript1…subscriptsuperscriptℝw_1,…,w_N ^Dw1 , … , witalic_N ∈ blackboard_RD spanning ℝDsuperscriptℝR^Dblackboard_RD, where D≤ND≤ ND ≤ N, arranged into an D×ND× ND × N matrix W=(||w1…wN||).matrix|missing-subexpression|subscript1…subscript|missing-subexpression|W= pmatrix|&&|\\ w_1&…&w_N\\ |&&| pmatrix.W = ( start_ARG start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL end_ROW start_ROW start_CELL w1 end_CELL start_CELL … end_CELL start_CELL witalic_N end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL end_ROW end_ARG ) . Our aim is to show that ∑i=1NCi≤Dsuperscriptsubscript1subscript _i=1^NC_i≤ D∑i = 1N Citalic_i ≤ D and to determine the conditions under which equality holds. Solution: We begin by finding a maximal decomposition of ℝDsuperscriptℝR^Dblackboard_RD into orthogonal subspaces V1⊕⋯⊕Vrdirect-sumsubscript1⋯subscriptV_1 … V_rV1 ⊕ ⋯ ⊕ Vitalic_r such that each wisubscriptw_iwitalic_i lies in some VssubscriptV_sVitalic_s and partition the wisubscriptw_iwitalic_i accordingly, discarding any index with wi=0subscript0w_i=0witalic_i = 0. It then suffices to examine the problem separately on each subspace VssubscriptV_sVitalic_s. Therefore, we assume that w1,…,wNsubscript1…subscriptw_1,…,w_Nw1 , … , witalic_N are such that we cannot decompose our problem further. In particular, this implies that wi≠0subscript0w_i≠ 0witalic_i ≠ 0 for all i and we cannot partition the indices into two subsets S,TS,TS , T such that wi⋅wj=0⋅subscriptsubscript0w_i· w_j=0witalic_i ⋅ witalic_j = 0 for all i∈S,j∈Tformulae-sequencei∈ S,j∈ Ti ∈ S , j ∈ T. Lemma 1 Let e1,…,eNsubscript1…subscript\e_1,…,e_N\ e1 , … , eitalic_N be the standard basis of ℝNsuperscriptℝR^Nblackboard_RN and let V⊆ℝNsuperscriptℝV ^NV ⊆ blackboard_RN be a linear subspace of dimension D, such that ei∉V⟂subscriptsuperscriptperpendicular-toe_i ∈ V eitalic_i ∉ V⟂ for all i. Then for any arbitrary unit vectors v1,…,vN∈Vsubscript1…subscriptv_1,…,v_N∈ Vv1 , … , vitalic_N ∈ V, we have ∑i=1N(vi⋅ei)2≤Dsuperscriptsubscript1superscript⋅subscriptsubscript2 _i=1^N(v_i· e_i)^2≤ D∑i = 1N ( vitalic_i ⋅ eitalic_i )2 ≤ D with equality if and only if visubscriptv_ivitalic_i is parallel to proj(ei,V)projsubscriptproj(e_i,V)proj ( eitalic_i , V ), where proj(u,W)projproj(u,W)proj ( u , W ) denotes the orthogonal projection of a vector u onto the subspace W. Proof: Define vi∗∈Vsuperscriptsubscript∗v_i ∈ Vvitalic_i∗ ∈ V to be a unit vector in the direction proj(ei,V)projsubscriptproj(e_i,V)proj ( eitalic_i , V ), and note that this exists by the assumption ei∉V⟂subscriptsuperscriptperpendicular-toe_i ∈ V eitalic_i ∉ V⟂. Then we have |vi⋅ei|≤|vi∗⋅ei|⋅subscriptsubscript⋅subscriptsuperscript∗subscript|v_i· e_i|≤|v _i· e_i|| vitalic_i ⋅ eitalic_i | ≤ | v∗italic_i ⋅ eitalic_i | for all other unit vectors v∈Vv∈ Vv ∈ V, with equality if and only if visubscriptv_ivitalic_i is parallel to vi∗subscriptsuperscript∗v _iv∗italic_i. So it suffices to show that ∑i=1N(vi∗⋅ei)2=D.superscriptsubscript1superscript⋅superscriptsubscript∗subscript2 _i=1^N(v_i · e_i)^2=D.∑i = 1N ( vitalic_i∗ ⋅ eitalic_i )2 = D . Let b1,…,bDsubscript1…subscriptb_1,…,b_Db1 , … , bitalic_D be an orthonormal basis for V, and let bD+1,…,bNsubscript1…subscriptb_D+1,…,b_Nbitalic_D + 1 , … , bitalic_N be an orthonormal basis for V⟂superscriptperpendicular-toV V⟂. Then vi∗=λi(b1⋅ei)b1+⋯+(bD⋅ei)bDsuperscriptsubscript∗subscript⋅subscript1subscriptsubscript1⋯⋅subscriptsubscriptsubscriptv_i = _i(b_1· e_i)b_1+…+(b_D· e_i)b_Dvitalic_i∗ = λitalic_i ( b1 ⋅ eitalic_i ) b1 + ⋯ + ( bitalic_D ⋅ eitalic_i ) bitalic_D where the normalising constant λi=(∑j=1D(bj⋅ei)2)−1/2subscriptsuperscriptsuperscriptsubscript1superscript⋅subscriptsubscript212 _i=( _j=1^D(b_j· e_i)^2)^-1/2λitalic_i = ( ∑j = 1D ( bitalic_j ⋅ eitalic_i )2 )- 1 / 2 is non-zero as vi∗subscriptsuperscript∗v _iv∗italic_i is non-zero (and we may wlog assume λi>0subscript0 _i>0λitalic_i > 0). Then (vi∗⋅ei)2=λi2(∑j=1D(bj⋅ei)2)2=∑j=1D(bj⋅ei)2superscript⋅superscriptsubscript∗subscript2superscriptsubscript2superscriptsuperscriptsubscript1superscript⋅subscriptsubscript22superscriptsubscript1superscript⋅subscriptsubscript2(v_i · e_i)^2= _i^2 ( _j=1^D(b_j· e% _i)^2 )^2= _j=1^D(b_j· e_i)^2( vitalic_i∗ ⋅ eitalic_i )2 = λitalic_i2 ( ∑j = 1D ( bitalic_j ⋅ eitalic_i )2 )2 = ∑j = 1D ( bitalic_j ⋅ eitalic_i )2 from which the lemma follows. Define the matrix A=(aij)i,j=1,…,Nsubscriptsubscriptformulae-sequence1…A=(a_ij)_i,j=1,…,NA = ( aitalic_i j )i , j = 1 , … , N via aij=wi⋅wjsubscript⋅subscriptsubscripta_ij=w_i· w_jaitalic_i j = witalic_i ⋅ witalic_j, and denote by aisubscripta_iaitalic_i the vector (ai1,…,ain)T∈ℝNsuperscriptsubscript1…subscriptsuperscriptℝ(a_i1,…,a_in)^T ^N( aitalic_i 1 , … , aitalic_i n )T ∈ blackboard_RN, so that A=(||a1…aN||)matrix|missing-subexpression|subscript1…subscript|missing-subexpression|A= pmatrix|&&|\\ a_1&…&a_N\\ |&&| pmatrixA = ( start_ARG start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL end_ROW start_ROW start_CELL a1 end_CELL start_CELL … end_CELL start_CELL aitalic_N end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL end_ROW end_ARG ) and the capacities can be expressed as Ci=aii2/‖ai‖2subscriptsuperscriptsubscript2superscriptnormsubscript2C_i=a_i^2/||a_i||^2Citalic_i = aitalic_i i2 / | | aitalic_i | |2. Note that aii=‖wi‖2≠0subscriptsuperscriptnormsubscript20a_i=||w_i||^2≠ 0aitalic_i i = | | witalic_i | |2 ≠ 0 and so non of the aisubscripta_iaitalic_i are trivial. Also, note that A=WTWsuperscriptA=W^TWA = Witalic_T W, so the rank of A is the same as the rank of W, and thus equal to D. Now let vi=ai/‖ai‖subscriptsubscriptnormsubscriptv_i=a_i/||a_i||vitalic_i = aitalic_i / | | aitalic_i | |, so that visubscriptv_ivitalic_i is a unit vector. Since the rank of A is D, the aisubscripta_iaitalic_i span a space V⊆ℝNsuperscriptℝV ^NV ⊆ blackboard_RN of dimension D. We can write vi⋅ei=aii/‖ai‖,⋅subscriptsubscriptsubscriptnormsubscriptv_i· e_i=a_i/||a_i||,vitalic_i ⋅ eitalic_i = aitalic_i i / | | aitalic_i | | , from which it also follows that ei∉V⟂subscriptsuperscriptperpendicular-toe_i ∈ V eitalic_i ∉ V⟂ (otherwise we would have aii=0subscript0a_i=0aitalic_i i = 0 which is not possible). Therefore we may apply Lemma 1 to get ∑i=1NCi≤Dsuperscriptsubscript1subscript _i=1^NC_i≤ D∑i = 1N Citalic_i ≤ D with equality if and only if aisubscripta_iaitalic_i is parallel to proj(ei,V)projsubscriptproj(e_i,V)proj ( eitalic_i , V ). Assuming that equality holds, using the proof of Lemma 1, we can write ai=μi∑k=1D(bk⋅ei)bk,aij=μi∑k=1D(bk⋅ei)(bk⋅ej)formulae-sequencesubscriptsubscriptsuperscriptsubscript1⋅subscriptsubscriptsubscriptsubscriptsubscriptsuperscriptsubscript1⋅subscriptsubscript⋅subscriptsubscripta_i= _i _k=1^D(b_k· e_i)b_k, 14.22636pta_ij=μ% _i _k=1^D(b_k· e_i)(b_k· e_j)aitalic_i = μitalic_i ∑k = 1D ( bitalic_k ⋅ eitalic_i ) bitalic_k , aitalic_i j = μitalic_i ∑k = 1D ( bitalic_k ⋅ eitalic_i ) ( bitalic_k ⋅ eitalic_j ) for some constant μisubscript _iμitalic_i. Since aii=ai⋅ei>0subscript⋅subscriptsubscript0a_i=a_i· e_i>0aitalic_i i = aitalic_i ⋅ eitalic_i > 0, we see that μi>0subscript0 _i>0μitalic_i > 0. Now, if μi≠μjsubscriptsubscript _i≠ _jμitalic_i ≠ μitalic_j then μjaij=μjμi∑k=1D(bk⋅ei)(bk⋅ej)=μiajisubscriptsubscriptsubscriptsubscriptsuperscriptsubscript1⋅subscriptsubscript⋅subscriptsubscriptsubscriptsubscript _ja_ij= _j _i _k=1^D(b_k· e_i)(b_k· e_j)% = _ia_jiμitalic_j aitalic_i j = μitalic_j μitalic_i ∑k = 1D ( bitalic_k ⋅ eitalic_i ) ( bitalic_k ⋅ eitalic_j ) = μitalic_i aitalic_j i from which we conclude by the symmetry of aijsubscripta_ijaitalic_i j that aij=wi⋅wj=0subscript⋅subscriptsubscript0a_ij=w_i· w_j=0aitalic_i j = witalic_i ⋅ witalic_j = 0. Hence if we have any μi≠μjsubscriptsubscript _i≠ _jμitalic_i ≠ μitalic_j we can perform a non-trivial decomposition into orthogonal subspaces V1⊕⋯⊕Vrdirect-sumsubscript1⋯subscriptV_1 … V_rV1 ⊕ ⋯ ⊕ Vitalic_r satisfying the conditions at the start of the solution. Since we assumed this was not possible, we must in fact have μ:=μiassignsubscriptμ:= _iμ := μitalic_i is constant for all i. Consider the map Λ:wi↦μ1/2∑k=1D(bk⋅ei)bk.:Λmaps-tosubscriptsuperscript12superscriptsubscript1⋅subscriptsubscriptsubscript :w_i μ^1/2 _k=1^D(b_k· e_i)b_k.Λ : witalic_i ↦ μ1 / 2 ∑k = 1D ( bitalic_k ⋅ eitalic_i ) bitalic_k . If we let T⊆ℝNsuperscriptℝT ^NT ⊆ blackboard_RN be the linear subspace spanned by w1,…,wNsubscript1…subscriptw_1,…,w_Nw1 , … , witalic_N, then dimT=mdimension T=mdim T = m and Λ Λ maps T to V. We can also check that the orthogonality of the bases eiisubscriptsubscript\e_i\_i eitalic_i i and biisubscriptsubscript\b_i\_i bitalic_i i implies that Λ Λ preserves the inner products of the wisubscriptw_iwitalic_i. It follows that Λ Λ is invertible and orthogonal. In addition, we have that bjTΛWei=bjTΛwi=μ1/2∑k=1D(bk⋅ei)(bk⋅bj)=μ1/2(bj⋅ei)superscriptsubscriptΛsubscriptsuperscriptsubscriptΛsubscriptsuperscript12superscriptsubscript1⋅subscriptsubscript⋅subscriptsubscriptsuperscript12⋅subscriptsubscriptb_j^T We_i=b_j^T w_i=μ^1/2 _k=1^D(b_k% · e_i)(b_k· b_j)=μ^1/2(b_j· e_i)bitalic_jitalic_T Λ W eitalic_i = bitalic_jitalic_T Λ witalic_i = μ1 / 2 ∑k = 1D ( bitalic_k ⋅ eitalic_i ) ( bitalic_k ⋅ bitalic_j ) = μ1 / 2 ( bitalic_j ⋅ eitalic_i ) so if we consider ΛWΛ WΛ W as a map from ℝDsuperscriptℝR^Dblackboard_RD with basis eiisubscriptsubscript\e_i\_i eitalic_i i to V with basis biisubscriptsubscript\b_i\_i bitalic_i i then it has matrix ΛW=μ1/2(b1⋅e1…b1⋅eN⋮bD⋅e1…bD⋅eN).Λsuperscript12matrix⋅subscript1subscript1…⋅subscript1subscript⋮missing-subexpression⋮⋅subscriptsubscript1…⋅subscriptsubscript W=μ^1/2 pmatrixb_1· e_1&…&b_1· e_N\\ && \\ b_D· e_1&…&b_D· e_N pmatrix.Λ W = μ1 / 2 ( start_ARG start_ROW start_CELL b1 ⋅ e1 end_CELL start_CELL … end_CELL start_CELL b1 ⋅ eitalic_N end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bitalic_D ⋅ e1 end_CELL start_CELL … end_CELL start_CELL bitalic_D ⋅ eitalic_N end_CELL end_ROW end_ARG ) . Using ∑j=1N(bi⋅ej)2=1superscriptsubscript1superscript⋅subscriptsubscript21 _j=1^N(b_i· e_j)^2=1∑j = 1N ( bitalic_i ⋅ eitalic_j )2 = 1, it follows that ΛWWTΛT=μIDΛsuperscriptsuperscriptΛsubscript W^T ^T=μ I_DΛ W Witalic_T Λitalic_T = μ Iitalic_D. Since Λ Λ is orthogonal, we deduce that WWT=μIDsuperscriptsubscriptWW^T=μ I_DW Witalic_T = μ Iitalic_D as required. Showing that every saturating case is achievable: We want to show that for every tuple of real numbers (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) with 0≤Ci≤10subscript10≤ C_i≤ 10 ≤ Citalic_i ≤ 1 and ∑i=1NCi=Dsuperscriptsubscript1subscript _i=1^NC_i=D∑i = 1N Citalic_i = D, we can find w1,…,wN∈ℝDsubscript1…subscriptsuperscriptℝw_1,…,w_N ^Dw1 , … , witalic_N ∈ blackboard_RD such that Ci=(wi⋅wi)2∑j=1N(wi⋅wj)2.subscriptsuperscript⋅subscriptsubscript2superscriptsubscript1superscript⋅subscriptsubscript2C_i= (w_i· w_i)^2 _j=1^N(w_i· w_j)^2.Citalic_i = divide start_ARG ( witalic_i ⋅ witalic_i )2 end_ARG start_ARG ∑j = 1N ( witalic_i ⋅ witalic_j )2 end_ARG . A tuple (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) is feasible if there exists an N×N× N × N unitary matrix U such that if we let U^ Uover start_ARG U end_ARG denote the first D rows of U and define u1,…,uN∈ℝDsubscript1…subscriptsuperscriptℝu_1,…,u_N ^Du1 , … , uitalic_N ∈ blackboard_RD by U^=(||u1…uN||),^matrix|missing-subexpression|subscript1…subscript|missing-subexpression| U= pmatrix|&&|\\ u_1&…&u_N\\ |&&| pmatrix,over start_ARG U end_ARG = ( start_ARG start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL end_ROW start_ROW start_CELL u1 end_CELL start_CELL … end_CELL start_CELL uitalic_N end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL end_ROW end_ARG ) , then ‖ui‖2=Cisuperscriptnormsubscript2subscript||u_i||^2=C_i| | uitalic_i | |2 = Citalic_i for each i=1,…,n1…i=1,…,ni = 1 , … , n. Note that if (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) is feasible then for the corresponding uisubscriptu_iuitalic_i, if we denote their components by ukik=1Dsuperscriptsubscriptsubscript1\u_ki\_k=1^D uitalic_k i k = 1D, we have (ui⋅ui)2∑j=1N(ui⋅uj)2superscript⋅subscriptsubscript2superscriptsubscript1superscript⋅subscriptsubscript2 (u_i· u_i)^2 _j=1^N(u_i· u_j)^2divide start_ARG ( uitalic_i ⋅ uitalic_i )2 end_ARG start_ARG ∑j = 1N ( uitalic_i ⋅ uitalic_j )2 end_ARG =(ui⋅ui)2∑j=1N∑k,l=1Duikujkuilujlabsentsuperscript⋅subscriptsubscript2superscriptsubscript1superscriptsubscript1subscriptsubscriptsubscriptsubscript = (u_i· u_i)^2 _j=1^N _k,l=1^Du_% iku_jku_ilu_jl= divide start_ARG ( uitalic_i ⋅ uitalic_i )2 end_ARG start_ARG ∑j = 1N ∑k , l = 1D uitalic_i k uitalic_j k uitalic_i l uitalic_j l end_ARG =(ui⋅ui)2∑k,l=1Dukiuliδklabsentsuperscript⋅subscriptsubscript2superscriptsubscript1subscriptsubscriptsubscript = (u_i· u_i)^2 _k,l=1^Du_kiu_liδ% _kl= divide start_ARG ( uitalic_i ⋅ uitalic_i )2 end_ARG start_ARG ∑k , l = 1D uitalic_k i uitalic_l i δitalic_k l end_ARG =(ui⋅ui)2∑k=1Dukiuki=‖ui‖2=Ciabsentsuperscript⋅subscriptsubscript2superscriptsubscript1subscriptsubscriptsuperscriptnormsubscript2subscript = (u_i· u_i)^2 _k=1^Du_kiu_ki=||u_i% ||^2=C_i= divide start_ARG ( uitalic_i ⋅ uitalic_i )2 end_ARG start_ARG ∑k = 1D uitalic_k i uitalic_k i end_ARG = | | uitalic_i | |2 = Citalic_i so we see that (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) is a valid set of capacities by taking wi=uisubscriptsubscriptw_i=u_iwitalic_i = uitalic_i for each i. Hence it suffices to show that all the relevant tuples are feasible. Lemma 2 Suppose that (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) is a feasible tuple. Then 1. any permutation of (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) is feasible; 2. for any CN′,CN+1′≥0subscriptsuperscript′subscriptsuperscript′10C _N,C _N+1≥ 0C′italic_N , C′italic_N + 1 ≥ 0 such that CN′+CN+1′=CNsubscriptsuperscript′subscriptsuperscript′1subscriptC _N+C _N+1=C_NC′italic_N + C′italic_N + 1 = Citalic_N, the tuple (C1,…,Cn−1,CN′,CN+1′)subscript1…subscript1subscriptsuperscript′subscriptsuperscript′1(C_1,…,C_n-1,C _N,C _N+1)( C1 , … , Citalic_n - 1 , C′italic_N , C′italic_N + 1 ) is feasible; 3. the tuple (1−C1,…,1−CN)1subscript1…1subscript(1-C_1,…,1-C_N)( 1 - C1 , … , 1 - Citalic_N ) is feasible, with D replaced by N−DN-DN - D. Proof: Let U be an N×N× N × N unitary matrix corresponding to (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) as above. Then (i) is immediate by permuting the columns of U, and (i) follows by exchanging the first D and last N−DN-DN - D rows of U, noting that the squared norm of each column is 1. To prove (i), consider extending U to an (N+1)×(N+1)11(N+1)×(N+1)( N + 1 ) × ( N + 1 ) unitary matrix U′ by placing U in the upper n×n× n × n quadrant and placing a 1 at the bottom right position. Then write U′=(U001)=(|||u1′…uN′uN+1′|||),superscript′matrixmatrixmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression0missing-subexpression0missing-subexpressionmatrix1matrix|missing-subexpression||subscriptsuperscript′1…subscriptsuperscript′subscriptsuperscript′1|missing-subexpression||U = pmatrix matrix&&\\ &U&\\ && matrix& &0\\ 0& & matrix1 matrix pmatrix= pmatrix|&&% |&|\\ u _1&…&u _N&u _N+1\\ |&&|&| pmatrix,U′ = ( start_ARG start_ROW start_CELL start_ARG start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL U end_CELL start_CELL end_CELL end_ROW end_ARG end_CELL start_CELL end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL start_ARG start_ROW start_CELL 1 end_CELL end_ROW end_ARG end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL start_CELL | end_CELL end_ROW start_ROW start_CELL u′1 end_CELL start_CELL … end_CELL start_CELL u′italic_N end_CELL start_CELL u′italic_N + 1 end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL end_CELL start_CELL | end_CELL start_CELL | end_CELL end_ROW end_ARG ) , for vectors ui′∈ℝN+1subscriptsuperscript′ℝ1u _i ^N+1u′italic_i ∈ blackboard_RN + 1. Now we may rotate the final pair of vectors together by an arbitrary angle θ and the resulting matrix remains orthogonal: Uθ′=(u1′…uN′cos(θ)+uN+1′sin(θ)−uN′sin(θ)+uN+1′cos(θ)),subscriptsuperscript′matrixmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsuperscript′1…subscriptsuperscript′subscriptsuperscript′1subscriptsuperscript′subscriptsuperscript′1U _θ= pmatrix&&&\\ u _1&…&u _N (θ)+u _N+1 (θ)&-% u _N (θ)+u _N+1 (θ)\\ &&& pmatrix,U′italic_θ = ( start_ARG start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL u′1 end_CELL start_CELL … end_CELL start_CELL u′italic_N cos ( θ ) + u′italic_N + 1 sin ( θ ) end_CELL start_CELL - u′italic_N sin ( θ ) + u′italic_N + 1 cos ( θ ) end_CELL end_ROW end_ARG ) , Restricting to the first D rows of Uθ′subscriptsuperscript′U _θU′italic_θ, we see that the first n−11n-1n - 1 columns still have squared norms C1,…,Cn−1subscript1…subscript1C_1,…,C_n-1C1 , … , Citalic_n - 1 respectively (inherited from U). Therefore, as θ varies the sums of the squared norms of the final two columns must be CNsubscriptC_NCitalic_N. Moreover, the split between the two columns varies continuously between (CN,0)subscript0(C_N,0)( Citalic_N , 0 ) and (0,CN)0subscript(0,C_N)( 0 , Citalic_N ) as θ ranges between 00 and π/22π/2π / 2. Therefore we can achieve any split (CN′,cN+1′)subscriptsuperscript′subscriptsuperscript′1(C _N,c _N+1)( C′italic_N , c′italic_N + 1 ) between the final two columns, proving part (2). Finally, we show that every tuple (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) satisfying 0≤Ci≤10subscript10≤ C_i≤ 10 ≤ Citalic_i ≤ 1 and ∑i=1NCi=msuperscriptsubscript1subscript _i=1^NC_i=m∑i = 1N Citalic_i = m is feasible. Suppose not, for a contradiction. Then we may pick such a tuple with minimal n. Clearly all such tuples with n≤11n≤ 1n ≤ 1 are feasible, so we may assume n≥22n≥ 2n ≥ 2. If there are any Ci,CjsubscriptsubscriptC_i,C_jCitalic_i , Citalic_j with Ci+Cj≤1subscriptsubscript1C_i+C_j≤ 1Citalic_i + Citalic_j ≤ 1 then we may replace Ci,CjsubscriptsubscriptC_i,C_jCitalic_i , Citalic_j by Ci+CjsubscriptsubscriptC_i+C_jCitalic_i + Citalic_j to get another tuple satisfying the given conditions but with smaller n. By our minimality assumption, we conclude this new tuple must be feasible. But then our original tuple is feasible by Lemma 2.1 and 2.2. On the other hand, if there are no Ci,CjsubscriptsubscriptC_i,C_jCitalic_i , Citalic_j with Ci+Cj≤1subscriptsubscript1C_i+C_j≤ 1Citalic_i + Citalic_j ≤ 1 then consider the tuple (1−C1,…,1−CN)1subscript1…1subscript(1-C_1,…,1-C_N)( 1 - C1 , … , 1 - Citalic_N ). We know that we must have (1−Ci)+(1−Cj)≤11subscript1subscript1(1-C_i)+(1-C_j)≤ 1( 1 - Citalic_i ) + ( 1 - Citalic_j ) ≤ 1 for any i,ji,ji , j. Replacing (1−Ci),(1−Cj)1subscript1subscript(1-C_i),(1-C_j)( 1 - Citalic_i ) , ( 1 - Citalic_j ) by (1−Ci)+(1−Cj)1subscript1subscript(1-C_i)+(1-C_j)( 1 - Citalic_i ) + ( 1 - Citalic_j ) to get a new tuple and following the argument of the previous paragraph, we conclude that the new tuple must be feasible, and so (1−C1,…,1−CN)1subscript1…1subscript(1-C_1,…,1-C_N)( 1 - C1 , … , 1 - Citalic_N ) must be feasible also. But then (C1,…,CN)subscript1…subscript(C_1,…,C_N)( C1 , … , Citalic_N ) is feasible too by Lemma 2.3. We are thus done in all cases. References [EHO+22a] Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax linear units. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/solu/index.html. [EHO+22b] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html. [OCS+20] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. [OMS17] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. doi:10.23915/distill.00007. https://distill.pub/2017/feature-visualization.