Paper deep dive
A Technical Note on Bilinear Layers for Interpretability
Lee Sharkey
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 92%
Last extracted: 3/12/2026, 6:53:18 PM
Summary
The paper proposes using bilinear layers as an alternative to standard MLP layers in neural networks to improve mathematical interpretability. By expressing bilinear layers through linear operations and third-order tensors, the author extends the 'Mathematical Framework for Transformer Circuits' to include MLP layers, potentially enabling deeper mechanistic insights into feature construction and safety.
Entities (5)
Relation Signals (3)
Bilinear Layers → integratedinto → Transformer Circuits
confidence 95% · We can integrate this expression for bilinear layers into a mathematical framework for transformer circuits
Bilinear Layers → improves → Mechanistic Interpretability
confidence 90% · bilinear layers may offer an alternative path for mechanistic interpretability
Superposition → motivates → Bilinear Layers
confidence 85% · This phenomenon, known as superposition, has spurred efforts to find architectures that are more interpretable
Cypher Suggestions (2)
Map the integration of components into the transformer framework. · confidence 90% · unvalidated
MATCH (c:Component)-[:INTEGRATED_INTO]->(f:Framework {name: 'Transformer Circuits'}) RETURN c, fFind all architectures related to interpretability research. · confidence 80% · unvalidated
MATCH (a:Architecture)-[:USED_FOR]->(i:ResearchField {name: 'Interpretability'}) RETURN aAbstract
Abstract:The ability of neural networks to represent more features than neurons makes interpreting them challenging. This phenomenon, known as superposition, has spurred efforts to find architectures that are more interpretable than standard multilayer perceptrons (MLPs) with elementwise activation functions. In this note, I examine bilinear layers, which are a type of MLP layer that are mathematically much easier to analyze while simultaneously performing better than standard MLPs. Although they are nonlinear functions of their input, I demonstrate that bilinear layers can be expressed using only linear operations and third order tensors. We can integrate this expression for bilinear layers into a mathematical framework for transformer circuits, which was previously limited to attention-only transformers. These results suggest that bilinear layers are easier to analyze mathematically than current architectures and thus may lend themselves to deeper safety insights by allowing us to talk more formally about circuits in neural networks. Additionally, bilinear layers may offer an alternative path for mechanistic interpretability through understanding the mechanisms of feature construction instead of enumerating a (potentially exponentially) large number of features in large models.
Tags
Links
- Source: https://arxiv.org/abs/2305.03452
- Canonical: https://arxiv.org/abs/2305.03452
Full Text
39,470 characters extracted from source content.
Expand or collapse full text
arXiv:2305.03452v1 [cs.LG] 5 May 2023 A technical note on bilinear layers for interpretability Lee Sharkey lee@conjecture.dev, leedsharkey@gmail.com Conjecture Abstract The ability of neural networks to represent more features than neurons makes inter- preting them challenging. This phenomenon, known as superposition [Olah et al., 2020, Elhage et al., 2022b], has spurred efforts to find architectures that are more interpretable than standard multilayer perceptrons (MLPs) with elementwise acti- vation functions. In this note, I examine bilinear layers [Shazeer, 2020], which are a type of MLP layer that are mathematically much easier to analyze while simulta- neously performing better than standard MLPs. Although they are nonlinear func- tions of their input, I demonstrate that bilinear layers canbe expressed using only linear operations and third order tensors. We can integratethis expression for bi- linear layers into a mathematical framework for transformer circuits [Elhage et al., 2021], which was previously limited to attention-only transformers. These results suggest that bilinear layers are easier to analyze mathematically than current archi- tectures and thus may lend themselves to deeper safety insights by allowing us to talk more formally about circuits in neural networks. Additionally, bilinear layers may offer an alternative path for mechanistic interpretability through understand- ing themechanisms of feature constructioninstead of enumerating a (potentially exponentially) large number of features in large models. 1 Introduction Neural networks can learn to compute interesting and complicated functions. To a first approxima- tion, these functions appear to be structured such that particular computational roles or representa- tions are assigned to particular directions in neural activation space [Olah et al., 2020]. We call these representationsfeatures. Somewhat surprisingly, neural networks are believed to beable to represent more features than they have neurons [Elhage et al., 2022b, Gurnee et al., 2023]. This phenomenon is known assuperposition, since they assign features to non-orthogonal directions which ‘overlap’ in high-dimensional space. We are particularly interested in mechanistically understanding large language models that use the transformer architecture [Vaswani et al., 2017]. This architecture mostly consists of a series of al- ternating attention layers (which let activations at different points in a sequence interact with each other) and MLP layers (which, at each point in the sequence, construct useful output features that are nonlinear transformations of the input features). About two thirds of the parameters in these models are in the MLP layers, which are thought to make prodigious use of superposition [Elhage et al., 2022a, Gurnee et al., 2023]. Nonlinear elementwise activation functions (such as ReLU [Nair and Hinton, 2010] or GeLU [Hendrycks and Gimpel, 2020]) in MLP layers can remove smallamounts of interference between non-orthogonal features [Elhage et al., 2022b], thus making it possible for layers to represent fea- tures in superposition without increasing the loss. Unfortunately, while the activation function is very useful for the performance of neural networks, it makesit quite difficult to analyze MLPs mathematically because the powerful tools of linear algebra can no longer be readily applied. Work in progress However, it turns out that another kind of MLP layer, the bilinear layer [Shazeer, 2020, Dauphin et al., 2016, Mnih and Hinton, 2007], is much easier to analyze than MLPs with elemen- twise activation functions. Even though bilinear layers are nonlinear functions of the input vector, bilinear layers can be described using only linear operations and third order tensors! This nice property lets usextend ‘A Mathematical Framework for Transformer Circuits’ [Elhage et al., 2021] to transformers with MLP layers as well as attention, not just attention-only transformers. We hope that this simple change will give us a firmer analytical footing to understand large models on a deep, mechanistic level. This might eventually let us make deeper claims about their safety, since it could permit us to describe classes of circuits as mathematical objects with certain proper- ties (as induction heads were in Elhage et al. [2021]) and to analyze learning dynamics and predict the emergence of particular kinds of circuits. It has been hypothesized (though not yet observed) that neural networks might represent a number of features that is exponential in the number of neurons in a layer [Elhage et al., 2022b]. If this is true, it would not bode well for our ability to mechanistically understand large neural networks, which in a sense relies on our being able to enumerate all their features. However, as discussed in the last section of this work, bilinear layers may offera potential alternative path to ‘enumerative safety’ [Elhage et al., 2022b]. Instead of attempting to understandeach of a large number of features, with bilinear networks we may be able to understand a smaller number of primitive features that bilinear layers use to ‘construct’ their (potentially exponential)larger number of features. Thus, in the same way that we might be able to understand an exponentially large number of executed programs by un- derstanding their code, we might be able to understand an exponentially large number of features by understanding the process by which features with certain properties are constructed. Here, we make some preliminary steps toward understanding the mechanisms of feature construction in bilinear layers; we show that in bilinear layers,output features are constructed through sums of pairwise interactions between input features, whereas, in standard MLPs, output features are constructed using all-to-all interactions between input features thatappear not to be decomposable. 2 Bilinear layers 2.1 Introducing bilinear layers A standard MLP layer consist of an input vectorx, a weight matrixWand an elementwise nonlinear activation function,σsuch as the ReLU function (and an optional bias term which is omitted for notational simplicity). The input vector is linearly transformed by the weight matrix to yield the pre-activation vectorW x, to which the activation function is applied elementwise: M LP ReLU (x) =σ(W x) Bilinear layers are slightly different. They take the form M LP Bilinear (x) = (W 1 x)⊙(W 2 x), where⊙denotes elementwise multiplication. They have two weight matrices, which each separately transform the input vector. They were introduced in different forms by Dauphin et al. [2016] and Mnih and Hinton [2007]. They were later studied by Shazeer [2020], who showed that bilinear layers, when used as the MLP layer in transformer language models, are surprisingly competitive 1 : They are at least as performant per parameter than standard MLPs with ReLU or GELU activation functions and only slightly less performant than state-of-the-art SwiGLU layers 2 . 2.2 Describing bilinear layers using only linear operations and third order tensors The lack of an elementwise activation function in bilinear layers makes them mathematically very simple. In fact, despite being nonlinear functions ofx, they can be expressed using only linear operations and third order tensors. 1 At least for the model size they explored, which was approximately 120M parameters, a similar size to GPT2-small [Radford et al., 2019]. To my knowledge, it remains to be determined whether bilinear layers continue to perform competitively at larger scales. 2 A SwiGLU layer is equivalent to a bilinear layer but where an elementwise Swish activation function [Ramachandran et al., 2017] is applied toW 1 x. 2 First, we’l define thetensor inner product(See appendix A for some examples of tensor inner products which may help build intuitions). Unlike the innerproduct between vectors, the tensor inner product needs to define the axes along which the inner product is taken. The tensor inner product is thus defined as U (n) · jk V (m) =T (n+m−2) where T γ 1 ·γ j−1 γ j+1 ·γ n γ ′ 1 ·γ ′ k−1 γ ′ k+1 ·γ ′ m = ∑ β U γ 1 ·γ j−1 βγ j+1 ·γ n V γ ′ 1 ·γ ′ k−1 βγ ′ k+1 ·γ ′ m (1) For the tensor inner product betweenn th order tensorUandm th orderVto be defined, the dimen- sion of axisjof tensorUmust be the same dimension as axiskof tensorV. Now we show how bilinear layers can be expressed using linearoperations and third order tensors. Suppose we want to find the third order tensorBsuch that (W 1 x)⊙(W 2 x) =x· 12 B· 21 x, if it exists. We’l first identify the terms in the vector on the right hand side, ((W 1 x)⊙(W 2 x)) i = ( ∑ j W 1(ij) x j )( ∑ k W 2(ik) x k ) = ∑ j ∑ k W 1(ij) x j W 2(ik) x k (2) Now let’s express the terms of the third order tensorBusing tensor inner products. We have, (x· 12 B· 21 x) i = ∑ j x j ∑ k x k B ijk = ∑ k x k ∑ j x j B ijk = ∑ j ∑ k x j x k B ijk . (3) Note that it doesn’t matter whether we take the tensor inner product betweenBandxon the2nd or 3rd axis first, which is whyx· 12 B· 21 xis associative, i.e.(x· 12 B)· 21 x=x· 12 (B· 21 x). We’l use this property when extending a Mathematical Framework for Transformer Circuits [Elhage et al., 2021] (Section 2.3). Comparing the terms from equations 2 and 3, we can see they areequal ifB ijk =W 1(ij) W 2(ik) . Thus, we can construct the tensorBusing the bilinear layer weightsW 1 , W 2 ∈R m×n and a third order tensorZsuch thatZ ijk = 1wherei=j=kand0otherwise, becauseB=W 1 · 12 Z· 21 W 2 . One helpful way to think about them×n×ntensorBis that the column vectorB :jk consists of the elementwise multiplication of thej th column ofW 1 with thek th column ofW 2 . 2.3 Extending a Mathematical Framework for Transformer Circuits When Elhage et al. [2021] analyzed the equations for 1- and 2-layer attention-only transformers, it offered interesting insights on the structure of these models. It helped to reveal QK- and OV-circuits, induction heads, and virtual attention heads, which formedthe basis of much interesting follow-up work in interpretability [Olsson et al., 2022, Wang et al., 2022]. However, one of the biggest shortcomings of Elhage et al. [2021] was that the transformers they analyzed had no MLP layers. MLPs comprise around two thirds of all parameters in standard trans- former language models and are thought to be be necessary fora great deal of interesting behaviour [Geva et al., 2021]. The reason MLPs were excluded was that they could not be linearised, which made their analysis intractable. But, as we’ve seen, it is possible to describe bilinear layers using 3 only linear operations. This means we can write linearized expressions for transformers with both at- tention and MLP layers! It’s important to stress that the MLPs we achieve this with are close to state of the art [Shazeer, 2020]. This opens up the possibility that we may be able to formally analyze some very capable language models. In this section, we’l identify the expression for a one-layer transformer with attention and (bilinear) MLPs. The expressions for two- and N-layer transformers are left as lengthy exercises for the reader. We’l update our notation in order to be consistent with Elhage et al. [2021], with which we expect readers to be familiar. The inputs to the language model is a sequence of tokenstof lengthn context . These are embedded by thed model ×n vocab embedding matrixW E . The token embeddingsx 0 = W E t(which have shapen context ×d model ) become the residual stream, which is passed through multiple residual blocks, each consisting of a multihead attention layer and an MLP layer, and each added back into the residual stream. Finally, the residual stream is unembedded by the unembedding matrixW U to make the token logits. In Elhage et al. [2021], they assumed MLPs that had an elementwise GeLU activation function, which are very difficult to analyze. Here, we’l instead use bilinear layers. Define the bilinear MLP layer as F(x) =W m O (x· 12 W m I 1 · 12 Z· 21 W m I 2 · 21 x) (4) whereW m O is thed model ×d mlp output weight matrix for the MLP layer andW m I 1 , W m I 2 are the two d mlp ×d model input weight matrices for the bilinear layer. Using the path expansion trick described by Elhage et al. [2021], the input to the MLP in a one layer transformer can be described as x 1 = (Id+ ∑ h∈H A h ⊗W h OV )·W E t = (W E + ∑ h∈H A h ⊗W h OV W E )t (5) whereW h OV =W h O W h V andA h =softmax*(t T ·W T E W QK W E ·t)in which softmax* is the softmax function with autoregressive masking andW QK =W h⊤ Q W h K . Putting our definition ofx 1 into our definition ofF(·)we get F(x 1 ) =W m O (((W E + ∑ h∈H A h ⊗W h OV W E )t)· 12 W m I 1 · 12 Z· 21 W m I 2 · 21 ((W E + ∑ h∈H A h ⊗W h OV W E )t))(6) Note that for arbitrary matricesM,M ′ , it’s true thatM· 12 M ′ =M ⊤ M ′⊤ . So we transpose the left hand bracket andW m I 1 and move the weight matrix into the brackets: =W m O ((t ⊤ (W ⊤ E W m⊤ I 1 + ∑ h∈H A h ⊗W ⊤ E W h⊤ OV W m⊤ I 1 ))· 12 Z· 21 W m I 2 · 21 ((W E + ∑ h∈H A h ⊗W h OV W E )t))(7) And next, noting thatM· 21 M ′ =M M ′ , we moveW m I 2 into the right hand brackets: =W m O ((t ⊤ (W ⊤ E W m⊤ I 1 + ∑ h∈H A h ⊗W ⊤ E W h⊤ OV W m⊤ I 1 ))· 12 Z· 21 ((W m I 2 W E + ∑ h∈H A h ⊗W m I 2 W h OV W E )t))(8) Next, we move theZtensor into the left hand brackets 4 =W m O ((t ⊤ (W ⊤ E W m⊤ I 1 · 12 Z+ ∑ h∈H A h ⊗W ⊤ E W h⊤ OV W m⊤ I 1 · 12 Z))· 21 ((W m I 2 W E + ∑ h∈H A h ⊗W m I 2 W h OV W E )t))(9) And combining both the left hand and right hand brackets, we get the expression for a bilinear feedforward layer =W m O (t ⊤ ( W ⊤ E W m⊤ I 1 · 12 Z· 21 W m I 2 W E + ∑ h∈H A h ⊗(W ⊤ E W h⊤ OV W m⊤ I 1 · 12 Z· 21 W m I 2 W E ) + ∑ h∈H A h ⊗(W ⊤ E W m⊤ I 1 · 12 Z· 21 W m I 2 W h⊤ OV W E ) + ∑ h∈H ∑ h ′ ∈H A h A h ′ ⊗(W ⊤ E W h⊤ OV W m⊤ I 1 · 12 Z· 21 W m I 2 W h ′ ⊤ OV W E ) )t) (10) We can analyze each of the terms in this equation. Thefirst summandexpresses a direct path from the token embedding matrix straight to the MLP without passing through any attention heads. The second summandexpresses the components of the token embeddings that pass through the attention head and then pass into only the first MLP input matrix. The third summandis similar, but the embeddings pass through the attention heads and into thesecond MLP input matrix. The last summand corresponds to token embeddings that pass through the attention heads and then into both the first and second MLP input matrices. With this expression for the MLP layer, we can now express thepath expansion for the full one layer transformer, which is simply the above expression forF(x)added to the token embedding- unembedding pathway (the ‘direct pathway’)and thepathways through the attention heads: T(t) = (Id⊗W U W E )t+ ∑ h∈H (A h ⊗W U W h OV W E )t + W m O (t ⊤ ( W ⊤ E W m⊤ I 1 · 12 Z· 21 W m I 2 W E + ∑ h∈H A h ⊗(W ⊤ E W h⊤ OV W m⊤ I 1 · 12 Z· 21 W m I 2 W E )+ ∑ h∈H A h ⊗(W ⊤ E W m⊤ I 1 · 12 Z· 21 W m I 2 W h⊤ OV W E )+ ∑ h∈H ∑ h ′ ∈H A h A h ′ ⊗ (W ⊤ E W h⊤ OV W m⊤ I 1 · 12 Z· 21 W m I 2 W h ′ ⊤ OV W E ) )t) (11) 3 Understanding feature construction in bilinear layers One of the problems we may face when trying to mechanistically understand neural networks is that they may be able to represent an exponential number of features. If this hypothesis resolves 5 true, then enumerating all the features in large networks may become computationally intractable. One analogy that gives us hope is discussed by Olah [2022]: Even though the input space to a particular computer program might be exponentially large,we can still say that we understand that exponentially large space of executed programs if we understand its code. In the same way, if we can understand the process by which features with certain properties are constructed from simpler primitives, we may be able to overcome the issue of having to understand an exponential number of features. In this section, which is more speculative than earlier sections, I outline why this hopeful vision seems very hard to realise in standard MLPs, but seemsquite possible in bilinear layers. 3.1 Feature construction in standard MLPs is non-decomposable Suppose we have a standard MLP layerM LP ReLU (x) =σ(W x)with a ReLU activationσ(where the bias term is omitted). Also suppose that the input vectorx∈Xconsists of sparse linear com- binations of input featuresx=D I⊤ a I , whereD I is a dictionary of input features represented as an features ×d input matrix anda I ∈A I is a sparse vector of coefficients (with values in[0,∞)of sizen features ) such that the datasetXcan be reconstructed from the features and their coefficients, X=D I⊤ A I . Similarly suppose there is a dictionary of output featuresfor this layerD O and that sparse linear combinations of those output features describe the activations observed in a large representative sample fromp x (M LP ReLU (x)), i.e. M LP ReLU (x) =σ(W x) =σ(W(D I⊤ a I )) =D O⊤ a O (12) ThereforeD I andD O are overcomplete bases 3 for the input spaceXand output space M LP ReLU (X)respectively. One way to view the process of feature construction is to say that output featuresD O are all im- plicitly represented in superposition in the weight matrixWand that the nonlinearity, when applied elementwise to the preactivation vectorW x, modifies a set ofdefault output featuresin order to select particular output features. One candidate for the default output features are the left singular vectors ofW, i.e. the columns of a matrixU(We’l discuss other candidates in the next section). We can thus introduce amodifier vectorm(x)that is a function ofxsuch that M LP ReLU (x) =m(x)⊙W x= (m(x)⊙U)ΣV ⊤ x=D O⊤ a O . Therefore we can view linear combinations of the output features (namelyD O⊤ a O ) as consisting of linear combinations of modified default output features (namely(m(x)⊙U)ΣV ⊤ x). With a ReLU activation function,m(x)is binary vector of ones and zero:m(x) i = 1where σ(W x) i >0andm(x) i = 0otherwise. In general, for vanilla MLPs with any elementwise ac- tivation functionσ: m(x) i = σ(W x) i (W x) i 4 (13) It is the modifier vector that ‘selects’ from the features represented in superposition inW, or, equiva- lently, ‘contructs’ them by modifying the default output features. If we could understand howm(x) is computed in terms of input featuresD I , then we could begin to understand why particular output featuresD O are constructed not others. Unfortunately, in vanilla MLPs, the only way to calculate the value ofm(x)in general is Equation 13. In other words, to get the value of the modifier vec- tor, we first have to pass the input through the network to observe what the post-activations (the numerator) and pre-activations are (the denominator) to getm(x). But this is circular: We would have to already understand the nonlinear computation in thenumerator in order to understand how output features are constructed. This framing doesn’t simplify anything at all! Feature construction in standard MLPs can thus be considered ‘non-decomposable’. 3.2 Feature construction in bilinear layers In mechanistic interpretability, one of the major assumptions that we need to make is that we can interpret linear transformations of almost arbitrary dimensionality. They may still be large objects, 3 In linear algebra, a basis of a vector space is a set of vectorsfrom which every vector in that space can be expressed as a linear combination. Anovercompletebasis is a basis where at least one element of the basis set can be removed yet the set remains a basis. 4 Note thatm(x) i is discontinuous at(W x) i = 0. 6 but linear transformations are as simple as transformations get. For large linear transformations with non-sparse coefficients, we may have to spend more time studying them or prioritize analysis of the largest coefficients. But overall we assume that we can understand them to a satisfying extent. If we can’t, then the whole business of mechanistic intepretability would be doomed even for large linear regressions, never mind deep neural networks. Granting this assumption, if we could describe the modifier vectorm(x)in the previous section as a linear function of input features (instead of a nonlinear one), then we could begin to understand how a layer constructs output features. Fortunately, in bilinear layers the modifier vector is a linear function of the input! M LP Bilinear (x) =m(x)⊙(W 2 x)wherem(x) =W 1 x, We’l say that the modifier vector modifies the default outputfeatures represented inW 2 to construct output features. We still need to define what the default output feature directions and the modifier feature directions are concretely. Ultimately this choice will always be somewhat arbitrary because linear transforma- tions do not imply any particular privileged basis. As before, perhaps the most obvious candidates for the default output feature directions are the left singular vectors ofW 2 . But the largest directions in the output activations may not necessarily have a strong relationship with the weights because the output directions depend on both the weights and the input directions. Therefore, we may be able to do better than the left singular vectors ofW 2 by incorporating the data distribution into the choice of bases. One way might use the right singular vectorsM LP Bilinear (X)orW 2 X. Another – perhaps better – way is to identify default output featuresthat are maximally statistically indepen- dent. This may be better because statistically independentdirections tend to be activated somewhat sparsely and therefore might be better default output features than singular vectors, since fewer will be significantly ‘activated’ at any one time. We could achieve this by performing linear ICA Hyvärinen and Oja [2000] on the preactivationsW 2 X. This would yield a matrixU (2) , which is the set of vectors that are maximally statistically independent directions of the output dataset while still being a basis of it. We can then use multiple linear regression to find the corresponding matrix V (2)⊤ such thatW 2 =U (2) V (2)⊤ . Slightly abusing terminology, we’l callU (2) andV (2)⊤ the left and right independent components ofW 2 respectively. We can define the modifier features using the same procedure, identifying the left and right independentcomponents ofW 1 =U (1) V (1)⊤ . Armed with such features,we may be able to describe feature construction in bilinear networks in terms of interactions between two relatively small, relatively sparse sets of vectors (the de- fault output features and the modifier features). We hope we can use this approach to tell mech- anistic stories for how features with certain properties are constructed by the network. We might be able to do this by understanding the functional properties of the default output features and how modifier features tend to modify them. Optimistically, mechanistic stories like these may let us un- derstand an exponentially large space of features. Whetheror not such an approach will work is ultimately an empirical question, which we leave for futurework. In the next section, we explore the mathematical simplicity of feature construction in bilinear layers, which gives us some reason to suspect that feature construction may be understandable. 5 5 We can make further modifications to the modifier features anddefault output features that assist either the intuitiveness or interpretability of bilinear networks. I’l note them here but won’t explore them further in this work. Improving intuitiveness: If, during training, we constrainW 1 xto be lowL 2 norm and add the one vector as bias, the modifier vector would always be close to the one vector. In other words:m(x) =W 1 x+1 where||W 1 x|| ≈0. This would mean that modifier features simply cause slight modifications of default output features. This addition would also help us make a analysis prioritization decisions later (see section 3.4), but fundamentally the modification isn’t necessary. This addition also opens up an experimental avenue (which we won’t explore here): By imposing more or less regularization on the norm, it allows us to control the amount of superposition a network is able to do. This would be an interesting experimental lever to pull, since it would allow us to directly test how much a network’s performance isdue to superposition. Improving interpretability: We could choose anL 1 penalty for the norm constraint on the modifier vector (instead of theL 2 norm); or we could constrainW 1 to be low rank; alternatively, we could quantize the output ofW 1 xin order to put hard limits on the amount of superposition a network can do. 7 3.3 Feature construction in bilinear layers decompose intoa sum of pairwise interactions between input features Not all layers have the same ‘amount’ of nonlinearity. Some are more nonlinear than others. Here we characterize the amount of nonlinearity layers can have,which sheds light on how bilinear layers differ from standard MLPs. LetC(d I i , d O j , a I )quantify the contribution of input featured I i ∈D I to the activation (or ‘selection’) of output featured O j ∈D O . We then have the following (non-comprehensive) set of degrees of nonlinearity. •Linear: Fully linear layers have no nonlinearity. There are therefore no interactions be- tween input features during feature construction (since there is no modifier vector). The amount that input featured I i contributes to the selection of output featured O j is quanti- fied simply asC(d I i , d O j , a I ) = [W d I i a I i ] ⊤ d O j , which is just the inner product between the preactivation caused by that input feature and the output feature. •Additively pairwise nonlinear: In this case, output features are determined by a sum of pairwise interactions between features. For example, if input featuresd I 1 , d I 2 , d I 3 are active in the input, the contribution ofd I i (wherei∈ 1,2,3) to each output feature can be described as a sum of pairwise nonlinear interactions,C(d I i , d O j , a I ) = [f(d I i ;d I 1 , a I 1 ) + f(d I i ;d I 2 , a I 2 ) +f(d I i ;d I 3 , a I 3 )] ⊤ d O j , wheref(·)is some nonlinear function of the two inter- acting features. •Fully nonlinear: The contribution an input feature makes to the selection ofan output feature depends on every other feature in a way that can’t be decomposed into a sum. The contribution ofd I i to each output feature can only be described as an all-to-allnonlinear interaction between input features that cannot be broken down into linear components: C(d I i , d O j , a I ) =g(d I i ;d I 1 , d I 2 , d I 3 , a I ) ⊤ d O j , whereg(·)is some (non-additively-pairwise) nonlinear function. The task of understanding additively pairwise nonlinearity is easier than full nonlinearity because we can study each pairwise interaction between features andsum them up. Understanding full nonlinearity is significantly harder because there is no wayto linearly decompose the functiong. Sadly, standard MLPs are fully nonlinear. However, we show that bilinear layers are additively pairwise nonlinear, making them significantly easier to analyze. Suppose the input to a bilinear layerx ′ consists of a linear combination of two input featuresd I 1 and d I 2 , i.e.x ′ =a 1 d I 1 +a 2 d I 2 . Using the re-expression of the bilinear layer, inputtingx ′ into equation 2 yields (a 1 d 1 +a 2 d 2 )· 12 B· 21 (a 1 d 1 +a 2 d 2 ) = a 1 d 1 · 12 B· 21 a 1 d 1 + a 1 d 1 · 12 B· 21 a 2 d 2 + a 2 d 2 · 12 B· 21 a 1 d 1 + a 2 d 2 · 12 B· 21 a 2 d 2 (14) More generally, for arbitrary linear combinations of inputfeatures: (W 1 x)⊙(W 2 x) = ( ∑ i∈R a i d i )· 12 B· 21 ( ∑ i∈R a i d i ) = ∑ i∈R ∑ j∈R a i a j d i · 12 B· 21 d j (15) whereRis the set of indices of nonzero feature coefficients. Equation 15 shows that, although all features interact to determine the output features, these interactions can be understood as a sum of pairwise interactions between features. Hence bilinear layers are only additively pairwise nonlinear. We hope that this simplicity can be leveraged to tell simple stories about how particular input features (hopefully sparsely) activate particular default output features and modifier features. Then, if we understand the functional properties of those default output features and the kinds of functional modifications that those modifier features make, then we may be able to understand the properties of the output features. 8 3.4 How should we study feature construction? At this early stage, it’s not totally clear how best to analyze the structure of bilinear networks. What is clear is that doing so will be easier than analyzing fully nonlinear computations, since we’re simply studying the structure of tensors, which is a relatively well understood domain in mathematics. In advance of empirical results, I speculate on a few non-mutually exclusive ways to proceed in this section. 1.Large coefficients ofB: As discussed at the beginning of section 2, when interpreting any linear transformation, there may be so many coefficientsthat it may be necessary to prioritize our analyses by studying only the largest coefficients. One way to leverage this is simply to study the largest coefficients ofBand how they would influence interactions between commonly observed pairs or groups of input features. 2.Tensor decomposition: Building on (1), we could perform Higher Order Singular Value Decomposition (HOSVD) and study the structure of the most influential ranks of the tensor. 3.Maximally modified default output features: Recall that one way to view the bilinear network is that one side of the elementwise multiplication modifies the linear transforma- tion on the other side. This suggests an way to prioritize theanalysis of how particular features are constructed: For each input feature, we shouldprioritize analysis of the most modified default output features. Concretely, define U (2,d i ) :=d ⊤ i W 1 · 12 Z· 21 U (2) . This is the set of output features caused by the modificationsthat input featured i makes to default output featureU (2) . Then, for each input featured i we should study the top k most modified default output features, i.e. arg top-k l (||U (2) :,l −U (2,d i ) :,l ||)(16) This would let us focus on the most significant modifications that a given input feature makes to the default output features. But we can prioritize our analyses further than that. The modifications that an input feature makes to the default output features don’t matter unless the default output feature is actually activated by that feature or some other feature that is simultaneously present inx. Therefore we can identify pairs of features,(d l , d m ) that are correlated (or that have often appeared at the same time) and whereU (2,d l ) is both one of the default output features that is most modified byd m and simultaneously one of the default output features that is most activated byd m . 4 Conclusion The simplicity of bilinear layers makes formal analysis much easier than for standard MLPs. One of the most important things bilinear layers give us are analysable expressions for performant trans- formers with both attention heads and MLP layers. I hope thatthis will eventually let us formally analyze the structure of the representations of large language models in this class. This might reveal interesting features and circuits in a similar way that the mathematical framework for attention-only transformers introduced by Elhage et al. [2021] helped to reveal reveal QK- and OV-circuits, induc- tion heads, and virtual attention heads. Curiosity aside, an expression for models with bilinear layers may let us make stronger claims about safety. For instance, it may let us more directly compare cir- cuit structure in different models, and enable us to make inferences about model behaviour without necessarily running the model. Another potential research direction is analyzing learning dynamics. Models with bilinear layers seem like they might lend themselves to mathematical analysis in a similar fashion to the deep linear layers studied by Saxe et al. [2013]. Learning dynamics may be important for safety, since understanding them may be necessary to be able to predict dangerous model behaviors before they emerge. Lastly, and most speculatively, bilinear layers offer the potential to understand the mechanisms of feature construction, which may be necessary for understanding a potentially exponentially large number of features represented in language models. There isstill much empirical work to do to 9 evaluate whether intuiting the mechanisms of feature construction is possible. Overall, I hope that this note might pique the interest of the interpretability community by highlighting an architecture that is much gentler on the intuitions than standard MLPs. Acknowledgements I thank Trenton Bricken for helpful discussions that initiated my search for layers that could be described in terms of higher order tensors. I thank Beren Millidge, Sid Black, and Dan Braun for helpful discussions and detailed feedback on this work. References Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language mod- eling with gated convolutional networks.CoRR, abs/1612.08083, 2016.URL http://arxiv.org/abs/1612.08083. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan,Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Ka- mal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda,Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka- mal Ndousse, , Jones, Dawn Drain, Anna Chen, Yuntao Bai, DeepGanguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kada- vath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax linear units.Transformer Circuits Thread, 2022a. https://transformer-circuits.pub/2022/solu/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022b. https://transformer- circuits.pub/2022/toy_model/index.html. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories, 2021. Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsi- mas. Finding neurons in a haystack: Case studies with sparseprobing, 2023. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units(gelus), 2020. A. Hyvärinen and E. Oja.Independent component analysis:algorithms and applications.NeuralNetworks, 13(4):411–430, 2000.ISSN 0893-6080.doi:https://doi.org/10.1016/S0893-6080(00)00026-5.URL https://w.sciencedirect.com/science/article/pii/S0893608000000265. Andriy Mnih and Geoffrey Hinton. Three new graphical modelsfor statistical language modelling. InProceedings of the 24th International Conference on Machine Learn- ing, ICML ’07, page 641–648, New York, NY, USA, 2007. Association for Com- puting Machinery. ISBN 9781595937933. doi: 10.1145/1273496.1273577. URL https://doi.org/10.1145/1273496.1273577. Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learn- ing, ICML’10, page 807–814, Madison, WI, USA, 2010. Omnipress.ISBN 9781605589077. Chris Olah. Mechanistic interpretability, variables, andthe importance of interpretable bases, 2022. URLhttps://transformer-circuits.pub/2022/mech-interp-essay/index.html. 10 Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly,Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark,Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads, 2022. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searchingfor activation functions, 2017. Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynam- ics of learning in deep linear neural networks, 2013. Noam Shazeer.Glu variants improve transformer, 2020.URL https://arxiv.org/abs/2002.05202. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017. URLhttp://arxiv.org/abs/1706.03762. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022. 11 A Tensor inner product examples The definition of tensor inner product we use is U (n) · jk V (m) =T (n+m−2) where T γ 1 ·γ j−1 γ j+1 ·γ n γ ′ 1 ·γ ′ k−1 γ ′ k+1 ·γ ′ m = ∑ β U γ 1 ·γ j−1 βγ j+1 ·γ n V γ ′ 1 ·γ ′ k−1 βγ ′ k+1 ·γ ′ m Example 1:U (1) · 11 V (1) =T (0) = ∑ β u β v β =u ⊤ v, which is just the standard inner product, resulting in a scalar. Example 2:U (2) · 21 V (1) =T (1) whereT i = ∑ β U iβ v β . This is multiplication of a matrix on the right and a vector on the left:T=Uv. Example 3:U (2) · 11 V (1) =T (1) whereT i = ∑ β U βi v β . This is equivalent to multiplication of a transposed matrix on the left and a vector on the right:T=U ⊤ v. Example 4:U (1) · 12 V (2) =T (1) whereT i = ∑ β u β V iβ . This equivalent to multiplication of transposed vector on the left and a matrix on the right:T=u ⊤ V ⊤ . Note thatTis a rank one tensor, soT=u ⊤ V ⊤ =Vusince tensor notation disposes of the convention that vectors are column vectors or row vectors; instead they are just rank-one tensors. We somewhat abuse notation in this work by assuming standard vector-matrix conventions for multiplication unless the tensors we’re dealing with are rank-three or above, in which case we use tensor inner product notation. Example 5:U (3) · 11 V (1) =T (2) which is the matrix that is a sum of matrices consisting of slices of the rank-three tensorT= ∑ β U β:: v β . If we imagine the rank-three tensor as a cube, this example flattens the tensor along its height by taking the inner product betweenvand every 3-d column of U. Example 6:U (2) · 23 V (3) =T (3) which is the rank-three tensor whereT i:: = ∑ β U :β V i:β . If we imagine tensorVas a cube, here take each front-to-back-row and get its innerproduct with the corresponding rowiof matrixU. 12