Paper deep dive

Not All Language Model Features Are Linear

Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 150

Models: GPT-2, Mistral-7B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 8:15:44 PM

Summary

The paper challenges the 'Linear Representation Hypothesis' by demonstrating that language models (GPT-2, Mistral 7B, Llama 3 8B) utilize inherently multi-dimensional, irreducible features rather than just one-dimensional ones. The authors formalize multi-dimensional features, develop a method using sparse autoencoders to discover them, and provide evidence that circular features (e.g., days of the week, months) are fundamental units of computation for modular arithmetic tasks.

Entities (6)

GPT-2 · language-model · 100%Llama-3-8B · language-model · 100%Mistral 7B · language-model · 100%Linear Representation Hypothesis · hypothesis · 95%Multi-dimensional features · concept · 95%Sparse Autoencoders · methodology · 95%

Relation Signals (4)

GPT-2 → exhibits → Multi-dimensional features

confidence 95% · we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2

Mistral 7B → exhibits → Multi-dimensional features

confidence 95% · we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in... Mistral 7B

Sparse Autoencoders → discovers → Multi-dimensional features

confidence 90% · design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features

Multi-dimensional features → usedin → Modular arithmetic

confidence 90% · identify tasks where these exact circles are used to solve computational problems involving modular arithmetic

Cypher Suggestions (2)

Find all language models that exhibit multi-dimensional features. · confidence 90% · unvalidated

MATCH (m:LanguageModel)-[:EXHIBITS]->(f:Feature {type: 'multi-dimensional'}) RETURN m.name

Identify methodologies used to discover specific feature types. · confidence 85% · unvalidated

MATCH (m:Methodology)-[:DISCOVERS]->(f:Feature) RETURN m.name, f.name

Abstract

Abstract:Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we examine the continuity of the days of the week feature in Mistral 7B. Overall, our work argues that understanding multi-dimensional features is necessary to mechanistically decompose some model behaviors.

PDF

Open source PDF →Open local PDF →

Full Text

150,106 characters extracted from source content.

Expand or collapse full text

Not All Language Model Features Are One-Dimensionally Linear Joshua Engels MIT jengels@mit.edu &Eric J. Michaud MIT & IAIFI ericjm@mit.edu &Isaac Liao MIT iliao@mit.edu Gurnee MIT wesg@mit.edu &Max Tegmark MIT & IAIFI tegmark@mit.edu Abstract Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts (“features”) in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we examine the continuity of the days of the week feature in Mistral 7B. Overall, our work argues that understanding multi-dimensional features is necessary to mechanistically decompose some model behaviors. 1 Introduction Language models trained for next-token prediction on large text corpora have demonstrated remarkable capabilities, including coding, reasoning, and in-context learning (Bubeck et al., 2023; Achiam et al., 2023; Anthropic, 2024; Team et al., 2023). However, the specific algorithms models learn to achieve these capabilities remain largely a mystery to researchers; we do not understand how language models write poetry. Mechanistic interpretability is a field that seeks to address this gap by reverse-engineering trained models from the ground up into variables (features) and the programs (circuits) that process these variables (Olah et al., 2020). One mechanistic interpretability research direction has focused on understanding toy models in detail. This work has found multi-dimensional representations of inputs such as lattices (Michaud et al., 2024) and circles (Liu et al., 2022; Nanda et al., 2023a), and has successfully reverse-engineered the algorithms that models use to manipulate these representations. A separate direction has identified one-dimensional representations of high level concepts and quantities in large language models (Gurnee & Tegmark, 2023; Marks & Tegmark, 2023; Heinzerling & Inui, 2024; Bricken et al., 2023). These findings have led to the linear representation hypothesis (LRH): a hypothesis which has historically claimed both that 1. all representations in pretrained large language models lie along one-dimensional lines, and 2. model states are a simple sparse sum of these representations (Park et al., 2023; Bricken et al., 2023). In this work, we specifically call into question the first part of the LRH.111An earlier version of this manuscript sparked discussion in the mechanistic interpretability community on the distinction between non-linear features and multi-dimensional features, and in fact this discussion directly led to a consensus around the two part LRH we describe above (see (Olah, 2024; Csordás et al., 2024; Mendel, 2024; Kantamneni & Tegmark, 2025)). To clarify, we agree with these discussions, and believe the multi-dimensional features we find are ”linear” in the sense that they are contained in a low-dimensional linear subspace, but ”non-linear” in the sense that this low-dimensional subspace is not one-dimensional (and this is the sense we mean in the title). For the most part, these two directions have been disconnected: Yedidia (2023a) and Gould et al. (2023) find intriguing hints of circular language model representations, and Bricken et al. (2023) speculate about the existence of feature manifolds, but these brief results only serve to further emphasize the lack of a unifying and satisfying perspective on the nature of language model features. In this work, we seek to bridge this gap by formalizing, investigating, and systematically searching for multi-dimensional language model features. 1.1 Contributions 1. In Section 3, we generalize the one-dimensional definition of a language model feature to multi-dimensional features and provide an updated multi-dimensional superposition hypothesis to account for these new features. 2. In Section 4, we build on the definitions proposed in Section 3 to develop a theoretically grounded and empirically practical test that uses sparse autoencoders to find irreducible features. Using this test, we identify multi-dimensional representations automatically in GPT-2 and Mistral 7B, including circular representations for the day of the week and month of the year. 3. In Section 5, we show that Mistral 7B and Llama 3 8B use these circular representations when performing modular addition in days of the week and in months of the year. To the best of our knowledge, we are the first to find causal circular representations of concepts in a language model. We additionally find that the model’s circular representations respect a continuous notion of time. Figure 1: Circular representations of days of the week, months of the year, and years of the 20th century in layer 7 of GPT-2-small colored by the token they fire on. These representations were discovered via clustering SAE dictionary elements, described in Section 4. Points are colored according to the token which created the representation. See Fig. 14 for other axes and Fig. 15 for similar plots for Mistral 7B. 2 Related Work Linear Representations: Early word embedding methods such as GloVe and Word2vec, although only trained using co-occurrence data, contained directions in their vector spaces corresponding to semantic concepts (Mikolov et al., 2013b; Pennington et al., 2014; Mikolov et al., 2013a). Recent research has found similar evidence of one-dimensional linear representations in sequence models trained only on next token prediction, including Othello board positions (Nanda et al., 2023b; Li et al., 2022), the truth value of assertions (Marks & Tegmark, 2023), and numeric quantities such as longitude, latitude, birth year, and death year (Gurnee & Tegmark, 2023; Heinzerling & Inui, 2024). These results have inspired the linear representation hypothesis (Park et al., 2023; Elhage et al., 2022) defined above. Jiang et al. (2024) provide theoretical evidence for this hypothesis, assuming a latent (binary) variable-based model of language. Empirically, Bricken et al. (2023) and Cunningham et al. (2023) successfully use sparse autoencoders to break down a model’s feature space into an over-complete basis of linear features. These works assume that the number of linear features stored in superposition exceeds the model dimensionality (Elhage et al., 2022). Multi-Dimensional Representations: There has been comparatively little research on multi-dimensional features in language models. Shai et al. (2024) predict and verify that a transformer trained on a hidden Markov model uses a fractal structure to represent the probability of each next token, a clear example of a necessary multi-dimensional feature, but the analysis is restricted to a toy setting. Yedidia (2023a; b) finds that GPT-2 learned position vectors form a helix, which implies a circle when “viewed” from below. Thus, we are not the first to find a circular feature in a language model. However, our work finds circular features that represent latent concepts from text, while the GPT-2 learned position vectors are specific to tokenization, separate from the rest of the model parameters, and causally implicated only due to positional attention masking. Another suggestive result, due to Hanna et al. (2024), is the presence of a U-shape in the representation of numbers between 00 and 100100100100; however, Hanna et al. (2024) find that this representation is not causal, and they only show it exists within a specific prompt distribution. Recent work on dictionary learning (Bricken et al., 2023) has speculated about multi-dimensional feature manifolds; our work is similar to this direction and develops the idea of feature manifolds theoretically and empirically. Finally, in a separate direction, Black et al. (2022) argue for interpreting neural networks through the polytopes they split the input space into, and identifies regions of low polytope density as “valid” regions for a potential linear representation. Circuits: Circuits research seeks to identify and understand circuits, subsets of a model (usually represented as a directed acyclic graph) that explain specific behaviors (Olah et al., 2020). The base units that form a circuit can be layers, neurons (Olah et al., 2020), or sparse autoencoder features (Marks et al., 2024). In the first circuits-style work, Olah et al. (2020) found line features that were combined into curve detection features in the InceptionV1 image model. More recent work has examined language models, for example the indirect object identification circuit in GPT-2 (Wang et al., 2022). Given the difficulty of designing bespoke experiments, there has been increased research in automated circuit discovery methods (Marks et al., 2024; Conmy et al., 2023; Syed et al., 2023). Interpretability for Arithmetic Problems: Liu et al. (2022) study models trained on modular arithmetic problems a+b=c⁢(mod⁢m)moda+b=c\ (mod\ m)a + b = c ( mod m ) and find that models that generalize well have circular representations for a and b. Further work by Nanda et al. (2023a) and Zhong et al. (2024) shows that models use these circular representations to compute c via a “clock” algorithm and a separate “pizza” algorithm. These papers are limited to the case of a small model trained only on modular arithmetic. Another direction has studied how large language models perform basic arithmetic, including a circuits level description of the greater-than operation in GPT-2 (Hanna et al., 2024) and addition in GPT-J (Stolfo et al., 2023). These works find that to perform a computation, models copy pertinent information to the token before the computed result and perform the computation in the subsequent MLP layers. Finally, recent work by Gould et al. (2023) investigates language models’ ability to increment numbers and finds linear features that fire on tokens equivalent modulo 10101010. 3 Definitions This section focuses on hypotheses for how hidden states of language models can be decomposed into sums of functions of the input (features). We focus on L layer transformer models M that take in token input =(t1,…,tn)subscript1…subscript t=(t_1,…,t_n)t = ( t1 , … , titalic_n ) from input token distribution TT, have hidden states 1,l,…,n,lsubscript1…subscriptx_1,l,…,x_n,lx1 , l , … , xitalic_n , l for layers l, and output logit vectors 1,…,nsubscript1…subscripty_1,…,y_ny1 , … , yitalic_n. Given a set of inputs T, we let Xi,lsubscriptX_i,lXitalic_i , l be the set of all corresponding i,lsubscriptx_i,lxitalic_i , l. We write matrices in capital bold, vectors and vector valued functions in lowercase bold, and sets in capital non-bold. 3.1 Multi-Dimensional Features Definition 1 (Feature). We define a dfsubscriptd_fditalic_f-dimensional feature as a function ff that maps a subset of the input space into ℝdfsuperscriptℝsubscriptR^d_fblackboard_Rditalic_f. We say that a feature is active on the aforementioned subset. The input token distribution TT induces a dfsubscriptd_fditalic_f-dimensional probability distribution over feature vectors ⁢(t)f(t)f ( t ). As an example, let n=11n=1n = 1 (so inputs are single tokens) and consider a feature ff that maps integer tokens to their integer values in ℝ1superscriptℝ1R^1blackboard_R1. Then ff is a 1111-dimensional feature that is active on integer tokens, and ⁢(t)f(t)f ( t ) is the marginal integer occurrence distribution from the token distribution. How can we differentiate "true" multi-dimensional features from sums of lower dimensional features? We make this distinction by examining the reducibility of a potential multi-dimensional feature. That is, ff is a "true" multi-dimensional feature if it cannot be written as the sum of two statistically independent features and it cannot be written as the sum of two non-co-occurring features. Formally, we have the following definition: Definition 2. A feature ff is reducible into features aa and bb if there exists an affine transformation ↦+≡()maps-toFRACOP f Rf+c≡% (a b )f ↦ Rf + c ≡ ( FRACOP start_ARG a end_ARG start_ARG b end_ARG ) (1) for some orthonormal df×dfsubscriptsubscriptd_f× d_fditalic_f × ditalic_f matrix RR and additive constant cc, such that the transformed feature probability distribution p⁢(,)p(a,b)p ( a , b ) satisfies at least one of these conditions: 1. p is separable, i.e., factorizable as a product of its marginal distributions: p⁢(,)=p⁢()⁢p⁢()p(a,b)=p(a)p(b)p ( a , b ) = p ( a ) p ( b ). 2. p is a mixture, i.e., a sum of disjoint distributions, one of which is lower dimensional: p⁢(,)=w⁢p⁢()⁢δ⁢()+(1−w)⁢p⁢(,)1p(a,b)=wp(a)δ( b)+(1-w)p(% a,b)p ( a , b ) = w p ( a ) δ ( italic_b ) + ( 1 - w ) p ( a , b ) Here, p is a probability density function that is conditional on the subset of TT that fitalic_f is active on, δ is the Dirac delta function, and 0<w<1010<w<10 < w < 1. By two probability distributions being disjoint, we mean that they have disjoint support (there is no set where both have positive probability measure, or equivalently the two features aa and bb cannot be active at the same time). In Eq. 1, aa is the first k components of the vector +Rf+cRf + c and bb is the remaining df−ksubscriptd_f-kditalic_f - k components. When p is separable or a mixture, we also say that ff is separable or a mixture. We term a feature irreducible if it is not reducible, i.e., if no rotation and translation makes it separable or a mixture. An example of a feature that is a mixture is a one hot encoding along a simplex; an example of a feature that is separable is a normal distribution222since any multidimensional Gaussian can be rotated to have a diagonal covariance matrix. In natural language, a mixture might be a one hot encoding of “breed of dog”, while a separable distribution might be the “latitude” and “longitude” of location tokens. In practice, the mixture and separability definitions may not be precisely satisfied. Thus, we soften our definitions to permit degrees of reducibility: Definition 3 (Separability Index and ϵitalic-ϵεϵ-Mixture Index). Consider a feature ff. The separability index S⁢()S(f)S ( f ) measures the minimal mutual information between all possible aa and bb defined in Eq. 1: S⁢()≡absent S(f) ( f ) ≡ min⁡I⁢(;) I(a;b)min I ( a ; b ) (2) where I denotes the mutual information. Smaller values of S⁢()S(f)S ( f ) mean that ff is more separable. The ϵitalic-ϵεϵ-mixture index Mϵ⁢()subscriptitalic-ϵM_ε(f)Mitalic_ϵ ( f ) tests how often ff can be projected near zero while it is active: Mϵ⁢()=subscriptitalic-ϵabsent M_ε(f)=Mitalic_ϵ ( f ) = max∈ℝdf,c∈ℝ⁡ℙ∈⁢(|⋅⁢()+c|<ϵ⁢[(⋅⁢()+c)2])subscriptformulae-sequencesuperscriptℝsubscriptℝsubscriptℙ⋅italic-ϵdelimited-[]superscript⋅2 _v ^d_f,\;c P% _ t (|v·f( t)+c|<% ε E[(v·f( t)+c)^2] )maxbold_v ∈ blackboard_Rd start_POSTSUBSCRIPT f , c ∈ blackboard_R end_POSTSUBSCRIPT blackboard_Pitalic_t ∈ T ( | v ⋅ f ( italic_t ) + c | < ϵ square-root start_ARG blackboard_E [ ( v ⋅ f ( italic_t ) + c )2 ] end_ARG ) (3) Larger values of Mϵ⁢()subscriptitalic-ϵM_ε(f)Mitalic_ϵ ( f ) mean that ff is more of a mixture. In Appendix B, we expand on the intuition behind why the separability and ϵitalic-ϵεϵ-mixture indices as defined here correspond to weakened versions of Definition 2. We develop optimization procedures to empirically solve for the separability and ϵitalic-ϵεϵ-mixture indices of two dimensional feature distributions. At a high level, the separability procedure iterates over a sweep of rotations and estimates the mutual information between the axes for each angle, while the ϵitalic-ϵεϵ-mixture index procedure performs gradient descent to find the ϵitalic-ϵεϵ band that contains the largest possible fraction of the feature distribution. For more details on the implementation of the tests, see Section B.2. In Section 4, we apply these empirical tests to real language model feature distributions to find irreducible multi-dimensional features; we show the detailed test results on the “days of the week” cluster in Fig. 2 3.2 Superposition In this section, we propose an updated superposition hypothesis (Elhage et al., 2022) that takes into account multi-dimensional features. First, we restate the original superposition hypothesis: Definition 4 (δ-orthogonal matrices). Two matrices 1∈ℝd×d1subscript1superscriptℝsubscript1A_1 ^d× d_1A1 ∈ blackboard_Rd × d1 and 2∈ℝd×d2subscript2superscriptℝsubscript2A_2 ^d× d_2A2 ∈ blackboard_Rd × d2 are δ-orthogonal if |1⋅2|≤δ⋅subscript1subscript2|x_1·x_2|≤δ| x1 ⋅ x2 | ≤ δ for all unit vectors 1∈colspace⁢(1)subscript1colspacesubscript1x_1 (A_1)x1 ∈ colspace ( A1 ) and 2∈colspace⁢(2)subscript2colspacesubscript2x_2 (A_2)x2 ∈ colspace ( A2 ). Hypothesis 1 (One-Dimensional Superposition Hypothesis, paraphrased from (Elhage et al., 2022)). Hidden states i,lsubscriptx_i,lxitalic_i , l are the sum of many (≫dmuch-greater-thanabsent d≫ d) sparse one-dimensional features fisubscriptf_ifitalic_i and pairwise δ-orthogonal vectors isubscriptv_ivitalic_i such that i,l⁢(t)=∑i⁢fi⁢(t)subscriptsubscriptsubscriptsubscriptx_i,l(t)= _iv_if_i(t)xitalic_i , l ( t ) = ∑i vitalic_i fitalic_i ( t ). We set fi⁢(t)subscriptf_i(t)fitalic_i ( t ) to zero when t is outside the domain of fisubscriptf_ifitalic_i. In contrast, our new superposition hypothesis posits independence between irreducible multi-dimensional features instead of unknown levels of independence between one-dimensional features: Hypothesis 2 (Multi-Dimensional Superposition Hypothesis, changes underlined). Hidden states i,lsubscriptx_i,lxitalic_i , l are the sum of many (≫dmuch-greater-thanabsent d≫ d) sparse low-dimensional irreducible features isubscriptf_ifitalic_i and pairwise δ-orthogonal matrices i∈ℝd×disubscriptsuperscriptℝsubscriptsubscriptV_i ^d× d_f_iVitalic_i ∈ blackboard_Rd × dbold_f start_POSTSUBSCRIPT i end_POSTSUBSCRIPT such that i,l⁢(t)=∑i¯⁢i⁢(t)subscriptsubscript¯subscriptsubscriptx_i,l(t)= _i V_if_i(t)xitalic_i , l ( t ) = ∑i under¯ start_ARG Vitalic_i end_ARG fitalic_i ( t ). We set i⁢(t)subscriptf_i(t)fitalic_i ( t ) to zero when t is outside the domain of isubscriptf_ifitalic_i. Note that since multi-dimensional features can be written as the sums of projections of lower-dimensional features, our new superposition hypothesis is a stricter version of 1. In the next section, we will explore empirical evidence for our hypothesis, while in Appendix A, we prove upper and lower bounds on the number of δ-almost orthogonal matrices isubscriptV_iVitalic_i that can be packed into d dimensional space. Figure 2: Empirical ϵitalic-ϵεϵ-mixture index and separability index for the “days of the week” cluster along PCA components 2 and 3. Left: The ϵitalic-ϵεϵ band parameterized by vv and c that the optimization procedure found contained the highest fraction of points. Mid: Dot products of points in the feature distribution with the ϵitalic-ϵεϵ band; Mϵ⁢()subscriptitalic-ϵM_ε( f)Mitalic_ϵ ( italic_f ) is the percent of dot products within ϵ=0.1italic-ϵ0.1ε=0.1ϵ = 0.1 of 00. Right: Estimated mutual information for different rotations of the space; S⁢()S( f)S ( italic_f ) is the minimum over all rotations. This point cloud has a lower ϵitalic-ϵεϵ-mixture index and higher separability index than PCA projections within typical clusters (see Fig. 3), indicating that it is more likely to be an irreducible multi-dimensional feature. 4 Sparse Autoencoders Find Multi-Dimensional Features In this section, we describe a method to identify multi-dimensional features in language model hidden states using sparse autoencoders (SAEs). Sparse autoencoders (SAEs) deconstruct model hidden states into sparse vector sums from an over-complete basis (Bricken et al., 2023; Cunningham et al., 2023). For hidden states Xi,lsubscriptX_i,lXitalic_i , l, a one-layer SAE of size m with sparsity penalty λ minimizes the following dictionary learning loss (Bricken et al., 2023; Cunningham et al., 2023): DL⁢(Xi,l)=arg⁢min∈ℝm×d,∈ℝd×m⁢∑i,l∈Xi,l[∥i,l−⋅ReLU⁢(⋅i,l)∥22+λ⁢∥ReLU⁢(⋅i,l)∥0]DLsubscriptsubscriptargminformulae-sequencesuperscriptℝsuperscriptℝsubscriptsubscriptsubscriptdelimited-[]superscriptsubscriptdelimited-∥subscript⋅ReLU⋅subscript22subscriptdelimited-∥ReLU⋅subscript0 DL(X_i,l)= *arg\,min_ subarrayc% E ^m× d,D ^d× m% subarray _x_i,l∈ X_i,l [ _% i,l-D· ReLU(E·x_i,l) % _2^2+λ ReLU(E·x_i,l)% _0 ]DL ( Xitalic_i , l ) = start_OPERATOR arg min end_OPERATORstart_ARG start_ROW start_CELL E ∈ blackboard_Rm × d , D ∈ blackboard_Rd × m end_CELL end_ROW end_ARG ∑x start_POSTSUBSCRIPT i , l ∈ Xitalic_i , l end_POSTSUBSCRIPT [ ∥ xitalic_i , l - D ⋅ ReLU ( E ⋅ xitalic_i , l ) ∥22 + λ ∥ ReLU ( E ⋅ xitalic_i , l ) ∥0 ] (4) In practice, the L0subscript0L_0L0 loss on the last term is relaxed to LpsubscriptL_pLitalic_p for 0<p≤1010<p≤ 10 < p ≤ 1 to make the loss differentiable. We call the m columns of DD (vectors in ℝdsuperscriptℝR^dblackboard_Rd) dictionary elements. We now argue that SAEs can discover irreducible multi-dimensional features by clustering DD. We will consider a simple form of clustering: build a complete graph on DD with edge weights equal to the cosine similarity between dictionary elements, prune all edges below a threshold T, and then set the clusters equal to the connected components of the graph. If we now consider the spaces spanned by each cluster, they will be approximately T-orthogonal by construction, since their basis vectors are all T-orthogonal. Now, consider some irreducible two-dimensional feature ff; we claim that if the SAE is large enough and ff is active enough such that the SAE can reconstruct ff when ff is active, one of the clusters is likely to be exactly equal to ff. If DD includes just two dictionary elements spanning ff, then these elements both must have nonzero activations post-ReLU to reconstruct ff (otherwise ff is a mixture). Because of the sparsity penalty in Eq. 4, this two-vector solution to reconstruct ff is disincentivized, so instead the dictionary is likely to learn many elements that span ff. These dictionary elements will then have a high cosine similarity, and so the edges between them will not be pruned away during the clustering process; hence, they will be in a cluster. Thus, we have a way to operationalize 2: clustering DD finds T-orthogonal subspaces, and if irreducible multi-dimensional features exist, they are likely to be equal to some of these subspaces. This suggests a natural approach to using sparse autoencoders to search for irreducible multi-dimensional features: 1. Cluster dictionary elements by their pairwise cosine similarity. We use both the simple similarity-based pruning technique described above, as well as spectral clustering; see Appendix F for details, including comments on scalability. 2. For each cluster, run the SAEs on all i,l∈Xi,lsubscriptsubscriptx_i,l∈ X_i,lxitalic_i , l ∈ Xitalic_i , l and ablate all dictionary elements not in the cluster. This will give the reconstruction of each i,lsubscriptx_i,lxitalic_i , l restricted to the cluster found in step 1111 (if no cluster dictionary elements are non-zero for a given point, we ignore the point). 3. Examine the resulting reconstructed activation vectors for irreducible multi-dimensional features. This step can be done manually by visually inspecting the PCA projections for known irreducible multi-dimensional structures (e.g. circles, see Fig. 10) or automatically by passing the PCA projections to the tests for Definition 3. Figure 3: Mixture index and separability index of GPT-2 features. Features from Fig. 1, which we had manually identified, score highly as candidate multidimensional features with these metrics. Pseudocode for this method is in the appendix in Algorithm 1. This method succeeds on toy datasets of synthetic irreducible multi-dimensional features; see Appendix D.333Code: https://github.com/JoshEngels/MultiDimensionalFeatures We apply this method to language models using GPT-2 (Radford et al., 2019) SAEs trained by Bloom (2024) for every layer and Mistral 7B (Jiang et al., 2023) SAEs that we train on layers 8, 16, and 24 (training details in Appendix E). Strikingly, we reconstruct irreducible multi-dimensional features that are interpretable circles: in GPT-2, days, months, and years are arranged circularly in order (see Fig. 1); in Mistral 7B, days and months are arranged circularly in order (see Fig. 15). These plots contain the PCA dimensions that most clearly show circular structure; these best dimensions are usually the second and third because the first PCA dim is an “intensity” direction that manifests as the radius of the circle in Fig. 1 (thus the overall structure for these multi-d features is perhaps best thought of as a cone). See Fig. 14 for all PCA dimensions visualized). For each cluster of GPT-2 SAE features, we take the reconstructed activations and project them onto PCA components 1-2, 2-3, 3-4, and 4-5 (or fewer if there are fewer features in the cluster) and measure the separability index and ϵitalic-ϵεϵ-mixture index of each 2D point cloud as described in Section B.2. The mean scores across these planes are a computationally tractable approximation of Definition 3. We plot these mean scores in Fig. 3, and find that the features which we had manually identified in Fig. 1 are among the top scoring features along both measures of irreducibility. Thus, our theoretical tests can indeed be used to find interpretable irreducible features. We show the top 20202020 feature clusters, measured by the product of (1−ϵ(1-ε( 1 - ϵ-mixture index )))) and separability index, in Appendix G. Out of all 1000100010001000 clusters, the Fig. 1 clusters rank 9,28,9289,28,9 , 28 , and 15151515 by this metric, respectively.444We also tried an alternative ranking scheme: we sorted the clusters by separability and irreducibility and set the cluster score equal to the minimum sorted position between the two sorted lists. The Fig. 1 clusters rank 8,105,81058,105,8 , 105 , and 12121212 by this metric. 5 Circular Representations in Large Language Models In this section, we examine tasks in which models use the multi-dimensional features we discovered in Section 4, thereby providing evidence that these representations are indeed the fundamental unit of computation for some problems. Inspired by prior work studying circular representations in modular arithmetic (Liu et al., 2022), we define two prompts that represent “natural” modular arithmetic tasks: Weekdays task: “Let’s do some day of the week math. Two days from Monday is” Months task: “Let’s do some calendar math. Four months from January is” Table 1: Aggregate model accuracy on days of the week and months of the year modular arithmetic tasks. Performance broken down by problem instance in Appendix I. Model Weekdays Months Llama 3 8B 29 / 49 143 / 144 Mistral 7B 31 / 49 125 / 144 GPT-2 8 / 49 10 / 144 Figure 4: Top two PCA components on the α token. Colors show α. Left: Layer 30303030 of Mistral on Weekdays. Right: Layer 5555 of Llama on Months. For Weekdays, we range over the 7777 days of the week and durations between 1111 and 7777 days to get 49494949 prompts. For Months, we range over the 12121212 months of the year and durations between 1111 and 12121212 months to get 144144144144 prompts. Mistral 7B and Llama 3 8B (AI@Meta, 2024) achieve reasonable performance on the Weekdays task and excellent performance on the Months task (measured by comparing the highest logit valid token against the ground truth answer), as summarized in Table 1. Interestingly, although these problems are equivalent to modular arithmetic problems α+β≡?⁢(mod⁢m)?modα+β≡\;?\ (mod\ m)α + β ≡ ? ( mod m ) for m=7,12712m=7,12m = 7 , 12, both models get trivial accuracy on plain modular addition prompts, e.g. “5+3⁢(mod⁢ 7)≡53mod7absent5+3\ (mod\ 7) 5 + 3 ( mod 7 ) ≡”. Finally, although GPT-2 has circular representations, it gets trivial accuracy on Weekdays and Months. To simplify discussion, let α be the day of the week or month of the year token (e.g. “Monday” or “April”), β be the duration token (e.g. “four” or “eleven”), and γ be the target ground truth token the model should predict, such that (abusing notation) we have α+β=γα+β=γα + β = γ. Let the prompts of the task be parameterized by j, such that the jjjth prompt asks about αjsubscript _jαitalic_j, βjsubscript _jβitalic_j, and γjsubscript _jγitalic_j. We confirm that Llama 3 8B and Mistral 7B have circular representations of α on this task by examining the PCA projections of hidden states across prompts at various layers on the α token. We plot two of these in Fig. 4 and show all layers in Fig. 18. These plots show circular representations as the highest varying two components in the model’s representation of α at many layers. 5.1 Intervening on Circular Day and Month Representations Figure 5: Visual representation of the circular intervention process. Top: We learn a circular probe on the PCA projection of a training set. Bot: To intervene, we change the circular representation to αj′subscript′ _j αitalic_j′ and average ablate other dimensions. Figure 6: Mean and 96% error bars for intervening on the α token across layers using different intervention methods. The circular intervention technique outperforms patching only the top 5555 PCA components and leaving the rest unchanged, and almost reaches the upper bound performance of patching the entire layer. We now experiment with intervening on these circular representations. We base our experiments on the common interpretability technique of activation patching, which replaces activations from a “dirty” run of the model with the corresponding activations from a “clean” run (Zhang & Nanda, 2023). Activation patching empirically tests whether a specific model component, position, and/or representation has a causal influence on the model’s output. We employ a custom subspace patching method to allow testing for whether a specific circular subspace of a hidden state is sufficient to causally explain model output. Specifically, our patching technique relies on the following steps (visualized in Fig. 6): 1. Find a subspace with a circle to intervene on: Using a PCA reduced activation subspace to avoid overfitting, we train a “circular probe” to identify representations which exhibit strong circular patterns. More formally, let i,ljsubscriptsuperscriptx^j_i,lxitalic_jitalic_i , l be the hidden state at layer l token position i for prompt j. Let i,l∈ℝk×dsubscriptsuperscriptℝW_i,l ^k× dWitalic_i , l ∈ blackboard_Rk × d be the matrix consisting of the top k principal component directions of i,ljsubscriptsuperscriptx^j_i,lxitalic_jitalic_i , l. In our experiments, we set k=55k=5k = 5. We learn a linear probe ∈ℝ2,ksuperscriptℝ2P ^2,kP ∈ blackboard_R2 , k from i,l⋅Xi,l⋅subscriptsubscriptW_i,l· X_i,lWitalic_i , l ⋅ Xitalic_i , l to a unit circle in α. In other words, if circle⁢(α)=[cos⁡(2⁢π⁢α/7),sin⁡(2⁢π⁢α/7)]circle2727 circle(α)=[ (2πα/7), (2πα/7)]circle ( α ) = [ cos ( 2 π α / 7 ) , sin ( 2 π α / 7 ) ] for Weekdays and circle⁢(α)=[cos⁡(2⁢π⁢α/12),sin⁡(2⁢π⁢α/12)]circle212212 circle(α)=[ (2πα/12), (2πα/12)]circle ( α ) = [ cos ( 2 π α / 12 ) , sin ( 2 π α / 12 ) ] for Months, PP is defined as follows: =arg⁢min′∈ℝ2,k⁢∑i,lj∥′⋅i,l⋅i,lj−circle⁢(α)∥22subscriptargminsuperscript′ℝ2subscriptsubscriptsuperscriptsuperscriptsubscriptdelimited-∥⋅superscript′subscriptsubscriptsuperscriptcircle22 P= *arg\,min_P ∈% R^2,k _x^j_i,l P % ·W_i,l·x^j_i,l- circle(α) % _2^2P = start_OPERATOR arg min end_OPERATORP′ ∈ blackboard_R2 , k ∑xitalic_j start_POSTSUBSCRIPT i , l end_POSTSUBSCRIPT ∥ P′ ⋅ Witalic_i , l ⋅ xitalic_jitalic_i , l - circle ( α ) ∥22 (5) 2. Intervene on the subspace: Say our initial prompt had α=αjsubscriptα= _jα = αitalic_j and we are intervening with α=αj′subscriptsuperscript′α= _j α = αitalic_j′. In this step, we replace the model’s projection on the subspace ⋅i,l⋅subscriptP·W_i,lP ⋅ Witalic_i , l, which will be close to circle(αj circle( _jcircle ( αitalic_j), with the “clean” point circle⁢(αj′)circlesubscriptsuperscript′ circle( _j )circle ( αitalic_j′ ). Note that we do not use the hidden state i,lj′subscriptsuperscriptsuperscript′x^j _i,lxitalic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi , l from the “clean” run, only the “clean” label αj′subscriptsuperscript′ _j αitalic_j′. In practice, other subspaces of i,ljsubscriptsuperscriptx^j_i,lxitalic_jitalic_i , l may be used concurrently by the model in “backup” circuits (see e.g. Wang et al. (2022)) to compute the answer, so if we just intervene on the circular subspace the remaining components of the activation may interfere in downstream computations. Thus, to isolate the effect of our intervention, we set the average ablate the portion of the activation not in the intervened subspace. Letting i,l¯subscript x_i,lover¯ start_ARG xitalic_i , l end_ARG be the average of i,ljsubscriptsuperscript x^j_i,lxitalic_jitalic_i , l across all prompts indexed by j and +superscriptP^+P+ be the pseudoinverse of PP, we intervene via the formula i,lj∗=i,l¯+i,lT⁢+⁢(circle⁢(αj′)−i,l¯)subscriptsuperscriptsuperscript¯subscriptsuperscriptsubscriptsuperscriptcirclesubscriptsuperscript′¯subscript ^j^*_i,l= x_i,l+W_% i,l^TP^+( circle( _j )- x% _i,l)xitalic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTi , l = over¯ start_ARG xitalic_i , l end_ARG + Witalic_i , litalic_T P+ ( circle ( αitalic_j′ ) - over¯ start_ARG xitalic_i , l end_ARG ) (6) We run our patching on all 49494949 Weekday problems and 144144144144 Month problems and use as “clean” runs the 6666 or 11111111 other possible values for β, resulting in a total of 49∗649649*649 ∗ 6 patching experiments for Weekdays and 144∗1114411144*11144 ∗ 11 patching experiments for Months. We also run baselines where we (1) replace the entire subspace corresponding to the first 5555 PCA dimensions with the corresponding subspace from the clean run, (2) replace the entire layer with the corresponding layer from the clean run, and (3) replace the entire layer with the average across the task. The metric we use is average logit difference across all patching experiments between the original correct token (αjsubscript _jαitalic_j) and the target token (αj′subscriptsuperscript′ _j αitalic_j′). See Fig. 6 for these interventions on all layers of Mistral 7B and Llama 3 8B on Weekdays and Months. Figure 7: Off distribution interventions on Mistral layer 5555 on the Weekdays task. The color corresponds to the highest logit γ after performing the circular subspace intervention on that point. The main takeaway from Fig. 6 is that circular subspaces are causally implicated in computing γ, especially for Weekdays. Across all models and tasks, early layer interventions on the circular subspace have almost the same intervention effect as patching the entire layer, and are usually better than patching the top PCA dimensions from the clean problem. Patching experiments in Appendix J show α is copied to the final token on layers 15151515 to 17171717, which is why interventions drop off there. Additionally, while in this section we train a probe on a dataset of prompts, in Section 5.2, we show that intervening on the circle discovered via SAE clustering in Section 4 also works. To investigate exactly how models use the circular subspace, we perform off distribution interventions. We modify Eq. 6 so that instead of intervening on the circumference circle⁢(α)circle circle(α)circle ( α ), we sweep over a grid of positions (r,θ)(r,θ)( r , θ ) within the circle: i,lj∗=i,l¯+i,lT+[rcos(θ),rsin(θ)]T−i,l¯) ^j^*_i,l= x_i,l+W_% i,l^TP^+[r (θ),r (θ)]^T- x_% i,l)xitalic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTi , l = over¯ start_ARG xitalic_i , l end_ARG + Witalic_i , litalic_T P+ [ r cos ( θ ) , r sin ( θ ) ]T - over¯ start_ARG xitalic_i , l end_ARG ) (7) We intervene with r∈[0,0.1,…,2],θ∈[0,2⁢π/100,…,198⁢π/100]formulae-sequence00.1…202100…198100r∈[0,0.1,…,2],θ∈[0,2π/100,…,198π/100]r ∈ [ 0 , 0.1 , … , 2 ] , θ ∈ [ 0 , 2 π / 100 , … , 198 π / 100 ] and record the highest logit γ after the forward pass. Fig. 7 displays these results on Mistral layer 5555 for β∈[2,3,45]2345β∈[2,3,45]β ∈ [ 2 , 3 , 45 ]. They imply that Mistral treats the circle as a multi-dimensional representation with α encoded in the angle. 5.2 Intervening with the SAE Plane In the last section, we train probes manually on the PCA of the activations to fit a circle. A perhaps more natural approach is intervening with the precise circle we found in Section 4. To determine if this approach is feasible, we first project layer 8 Mistral 7B Weekdays activations into the weekdays plane that was discovered by clustering (see Fig. 15; the plane is defined by PCA dimensions 2 and 3 of the cluster). In the rest of this section, we call this plane the SAE plane. In Fig. 19, we find that indeed, the Weekdays representations projected into the SAE plane form a circle (see Fig. 19). We thus can fit a circular probe to this 2D plane as in Eq. 5. Similarly, we call this probe the SAE probe. Figure 8: Interventions on the Mistral 7B Weekdays task with different methods of determining the probe. Because we only have layer 8 clustering results for Mistral, we train an SAE probe only on layer 8. We then evaluate interventions using this SAE probe on layer 8 of Mistral, but also at neighboring layers, since nearby layers should have similar representations (see e.g. Belrose et al. (2023)). We compare to two baselines: 1) training a normal circular probe on the PCA projections of each layer as described in Section 5.1, and 2) training a circular probe only on layer 8 and then evaluating on adjacent layers (in the same way as for the layer 8 SAE probe). We show the results of these methods in Fig. 8. We find that on all layers, using the SAE probe only slightly decreases intervention performance as compared to training a circular probe (from -2.58 to -2.01 average logit difference on layer 8). Even more interestingly, the layer 8 SAE probe is much more robust to layer shifts than the layer 8 circular probe; for example, using the layer 8 circular probe on layer 6 results in an average logit difference of 0.029, whereas using the layer 8 SAE probe results in an average logit difference of -2.32. This is intriguing evidence that the SAE is perhaps finding more “true” (or at least more robust) features than our circular probing technique. 5.3 Continuity of Circular Representations Figure 9: Layer 30303030 Mistral 7B activations for [morning/evening] on [Monday/Tuesday/…/Sunday], plotted projected into the PCA plane for [Monday/Tuesday/…/Sunday]. In past sections, the representations of the interpretable numeric quantities we have discovered have been mostly discontinuous; that is, the days of the week and months of the year in Fig. 1 and Fig. 15 are clustered at the vertices of a heptagon and dodecagon, and there is nothing "between" adjacent weekdays or months along the circle. In this section, we will examine the "continuity" of the circular features we have discovered. Although continuity of the representation is not a requirement of Definition 3, it would further decrease the ϵitalic-ϵεϵ-mixture index, and would also increase our subjective perception of the circular feature as an intrinsic model feature representing a continuous quantity (time). Thus, we create a synthetic dataset containing the text "[very early/very late] on [Monday/Tuesday/…/Sunday]" and simply plot the projections of the layer 30303030 activations into the top two PCA components of the activations of [Monday/Tuesday/…/Sunday]. The results, shown in Fig. 9, show that Mistral 7B indeed can map intermediate quantities to their expected place in the circle: the very early and very late version of each weekday are more towards the last and the next weekday along the circle, respectively. We show similar results for "[morning/evening] on [Monday/Tuesday/…/Sunday]" in Appendix Fig. 22. 6 Discussion Our work proposes a significant refinement to the simple one-dimensional linear representation hypothesis. While previous work has convincingly shown the existence of one-dimensional features, we find evidence for irreducible multi-dimensional representations, requiring us to generalize the notion of a feature to higher dimensions. Fortunately, we find that existing unsupervised feature extraction methodologies like sparse autoencoders can readily be applied to discover multi-dimensional representations. However, we think our work raises interesting questions about whether individual SAE features are appropriate “mediators” (Mueller et al., 2024) for understanding model computation, if some features are in fact multi-dimensional. Although taking a multi-dimensional representation perspective may be more complicated, we believe that uncovering the true (perhaps multi-dimensional) nature of model representations is necessary for discovering the underlying algorithms that use these representations. Ultimately, our field aims to turn complex circuits in future more-capable models into formally verifiable programs (Tegmark & Omohundro, 2023; Dalrymple et al., 2024), which requires the ground truth “variables” of language models; we believe this work takes an important step towards discovering these variables. Limitations: It is unclear why we did not find more interpretable multi-dimensional features. We are unsure if we are failing to interpret some of the high-scoring multi-dimensional features, if most multi-dimensional features lie in dimensions higher than two, if our clustering technique is not powerful enough to find some features, or if there are truly not that many. Additionally, our definitions for irreducible features (Definition 2) are purely statistical and not intervention based, and also had to be relaxed to hold in practice, resulting in measures that return a possibly subjective “degree” of reducibility (Definition 3). Thus, although this work provides preliminary evidence for the multi-dimensional superposition hypothesis (2), it is still unclear if this theory provides the best description for the representations models use. Future work might make progress on this question by investigating new techniques for decomposing model representations, exploring higher dimensional representations, or determining conclusively whether models use representations in ways that necessitate the representations are non-linear. Acknowledgments We thank (in alphabetical order) Dowon Baek, Kaivu Hariharan, Vedang Lad, Ziming Liu, and Tony Wang for helpful discussions and suggestions. This work is supported by Erik Otto, Jaan Tallinn, the Rothberg Family Fund for Cognitive Science, the NSF Graduate Research Fellowship (Grant No. 2141064), and IAIFI through NSF grant PHY-2019786. References Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. AI@Meta (2024) AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md. Alon (2003) Noga Alon. Problems and results in extremal combinatorics—i. Discrete Mathematics, 273(1-3):31–53, 2003. Anthropic (2024) Anthropic. The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic, 2024. Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023. Black et al. (2022) Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, et al. Interpreting neural networks through the polytope lens. arXiv preprint arXiv:2211.12312, 2022. Bloom (2024) Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2 small. https://w.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream, 2024. Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. Conmy et al. (2023) Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352, 2023. Csordás et al. (2024) Róbert Csordás, Christopher Potts, Christopher D Manning, and Atticus Geiger. Recurrent neural networks learn to store and generate sequences using non-linear representations. arXiv preprint arXiv:2408.10920, 2024. Cunningham et al. (2023) Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023. Dalrymple et al. (2024) David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv preprint arXiv:2405.06624, 2024. Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html. Foucart & Rauhut (2013) Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. 2013. Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Gershgorin (1931) Semyon Aranovich Gershgorin. über die abgrenzung der eigenwerte einer matrix. Izvestiya Rossiĭskoi akademii nauk. Seriya matematicheskaya, (6):749–754, 1931. Gould et al. (2023) Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild. arXiv preprint arXiv:2312.09230, 2023. Gurnee & Tegmark (2023) Wes Gurnee and Max Tegmark. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023. Hanna et al. (2024) Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024. Heinzerling & Inui (2024) Benjamin Heinzerling and Kentaro Inui. Monotonic representation of numeric properties in language models. arXiv preprint arXiv:2403.10381, 2024. Higham (2021) Nicholas J. Higham. Singular value inequalities. https://nhigham.com/2021/05/04/singular-value-inequalities/, May 2021. (23) Bill Johnson (https://mathoverflow.net/users/2554/bill johnson). Almost orthogonal vectors. MathOverflow. URL https://mathoverflow.net/q/24873. URL:https://mathoverflow.net/q/24873 (version: 2010-05-16). Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. Jiang et al. (2024) Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, and Victor Veitch. On the origins of linear representations in large language models. arXiv preprint arXiv:2403.03867, 2024. Johnson & Lindenstrauss (1984) William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, p. 189–206. American Mathematical Society, Providence, RI, 1984. ISBN 0-8218-5030-X. doi: 10.1090/conm/026/737400. Kantamneni & Tegmark (2025) Subhash Kantamneni and Max Tegmark. Language models use trigonometry to do addition. arXiv preprint arXiv:2502.00873, 2025. Lattanzi et al. (2020) Silvio Lattanzi, Thomas Lavastida, Kefu Lu, and Benjamin Moseley. A framework for parallelizing hierarchical clustering methods. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I, p. 73–89. Springer, 2020. Li et al. (2022) Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022. Liu et al. (2022) Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022. Marks & Tegmark (2023) Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023. Marks et al. (2024) Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024. Mendel (2024) Jake Mendel. Sae feature geometry is outside the superposition hypothesis. AI Alignment Forum, 2024. URL https://w.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis. Michaud et al. (2024) Eric J Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Kheirkhah, Mateja Vukelić, and Max Tegmark. Opening the ai black box: program synthesis via mechanistic interpretability. arXiv preprint arXiv:2402.05110, 2024. Mikolov et al. (2013a) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013a. Mikolov et al. (2013b) Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, p. 746–751, 2013b. Morwani et al. (2023) Depen Morwani, Benjamin L Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. arXiv preprint arXiv:2311.07568, 2023. Mueller et al. (2024) Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, et al. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability. arXiv preprint arXiv:2408.01416, 2024. Nanda & Bloom (2022) Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022. Nanda et al. (2023a) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023a. Nanda et al. (2023b) Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023b. Olah (2024) Chris Olah. What is a linear representation? what is a multidimensional feature? Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/july-update/index.html#linear-representations. Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Park et al. (2023) Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023. Peng et al. (2023) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), p. 1532–1543, 2014. Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019. Shai et al. (2024) Adam Shai, Paul Riechers, Lucas Teixeira, Alexander Oldenziel, and Sarah Marzen. Transformers represent belief state geometry in their residual stream. https://w.alignmentforum.org/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their, 2024. Stolfo et al. (2023) Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, p. 7035–7052, 2023. Syed et al. (2023) Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348, 2023. Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. Tegmark & Omohundro (2023) Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi. arXiv preprint arXiv:2309.01933, 2023. Wang et al. (2022) Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022. Yedidia (2023a) Adam Yedidia. Gpt-2’s positional embedding matrix is a helix, 2023a. URL https://w.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix. Accessed: 2024-09-16. Yedidia (2023b) Adam Yedidia. The positional embedding matrix and previous-token heads: how do they actually work?, 2023b. URL https://w.lesswrong.com/posts/zRA8B2FJLtTYRgie6/the-positional-embedding-matrix-and-previous-token-heads-how. Accessed: 2024-09-17. Zhang & Nanda (2023) Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023. Zhong et al. (2024) Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks. Advances in Neural Information Processing Systems, 36, 2024. Appendix A Multi-Dimensional Feature Capacity The Johnson-Lindenstrauss (JL) Lemma (Johnson & Lindenstrauss, 1984) implies that we can choose eC⁢d⁢δ2superscriptsuperscript2e^Cdδ^2eitalic_C d δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pairwise one-dimensional δ-orthogonal vectors to satisfy 1 for some constant C, thus allowing us to build the model’s feature space with a number of one-dimensional δ-orthogonal features exponential in d. We now prove a similar result for low-dimensional projections (the main idea of the proof is to combine δ-orthogonal vectors as guaranteed from the JL lemma): Theorem 1. For any d′ and δ, it is possible to choose 1dmax⁢eC1⁢(d/d′⁣2)⁢δ21subscriptsuperscriptsubscript1superscript′2superscript2 1d_ e^C_1(d/d 2)δ^2divide start_ARG 1 end_ARG start_ARG droman_max end_ARG eitalic_C1 ( d / d start_POSTSUPERSCRIPT ′ 2 ) δ2 end_POSTSUPERSCRIPT pairwise δ-orthogonal matrices i∈ℝni×d′subscriptsuperscriptℝsubscriptsuperscript′A_i ^n_i× d Aitalic_i ∈ blackboard_Rnitalic_i × d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some constant C1subscript1C_1C1. Furthermore, it is not possible to choose more than eC2⁢(d−dmax⁢δ⁢log⁡(1δ))superscriptsubscript2subscript1e^C_2(d-d_ δ ( 1δ))eitalic_C2 ( d - droman_max δ log ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) ) for some constant C2subscript2C_2C2. We will first prove a lemma that will help us prove 1. Lemma 1. Pick n pairwise δ-orthogonal unit vectors in 1,…,n∈ℝdsubscript1…subscriptsuperscriptℝv_1,…,v_n ^dv1 , … , vitalic_n ∈ blackboard_Rd. Let ∈ℝdsuperscriptℝy ^dy ∈ blackboard_Rd be a unit norm vector that is a linear combination of unit norm vectors 1,…,nsubscript1…subscriptv_1,…,v_nv1 , … , vitalic_n with coefficients z1⁢…,zn∈ℝsubscript1…subscriptℝz_1…,z_n 1 … , zitalic_n ∈ blackboard_R. We can write =[1,…,n]subscript1…subscriptA=[v_1,…,v_n]A = [ v1 , … , vitalic_n ] and =[z1,…,zn]Tsuperscriptsubscript1…subscriptz=[z_1,…,z_n]^Tz = [ z1 , … , zitalic_n ]T, so that we have =∑k=1nzk⁢k=Tsuperscriptsubscript1subscriptsubscriptsuperscripty= _k=1^nz_kv_k=Az^Ty = ∑k = 1n zitalic_k vitalic_k = Azitalic_T with ∥2=1subscriptdelimited-∥21 _2=1∥ y ∥2 = 1. Then, |∑k=1nk|=∥1≤n1−δ⁢nsuperscriptsubscript1subscriptsubscriptdelimited-∥11 | _k=1^nz_k |= _1≤ % n1-δ n| ∑k = 1n zitalic_k | = ∥ z ∥1 ≤ square-root start_ARG divide start_ARG n end_ARG start_ARG 1 - δ n end_ARG end_ARG Proof. We will first bound the L2subscript2L_2L2 norm of zz. If σnsubscript _nσitalic_n is the minimum singular value of AA, then we have via standard singular value inequalities (Higham, 2021) σn≤∥2∥2⟹∥2≤∥2σn=1σnsubscriptsubscriptdelimited-∥2subscriptdelimited-∥2subscriptdelimited-∥2subscriptdelimited-∥2subscript1subscript _n≤ _2 _2% _2≤ _2σ% _n= 1 _nσitalic_n ≤ divide start_ARG ∥ y ∥2 end_ARG start_ARG ∥ z ∥2 end_ARG ⟹ ∥ z ∥2 ≤ divide start_ARG ∥ y ∥2 end_ARG start_ARG σitalic_n end_ARG = divide start_ARG 1 end_ARG start_ARG σitalic_n end_ARG Thus we now lower bound σnsubscript _nσitalic_n. The singular values are the square roots of the eigenvalues of the matrix T⁢superscriptA^TAAitalic_T A, so we now examine T⁢superscriptA^TAAitalic_T A. Since all elements of AA are unit vectors, the diagonal of T⁢superscriptA^TAAitalic_T A is all ones. The off diagonal elements are dot products of pairs of δ-orthogonal vectors, and so are within the range [−δ,δ][-δ,δ][ - δ , δ ]. Then by the Gershgorin circle theorem (Gershgorin, 1931), all eigenvalues λisubscript _iλitalic_i of T⁢superscriptA^TAAitalic_T A are in the range (1−δ⁢(n−1),1+δ⁢(n−1))1111(1-δ(n-1),1+δ(n-1))( 1 - δ ( n - 1 ) , 1 + δ ( n - 1 ) ) In particular, σn2=λn≥1−δ⁢(n−1)superscriptsubscript2subscript11 _n^2= _n≥ 1-δ(n-1)σitalic_n2 = λitalic_n ≥ 1 - δ ( n - 1 ), and thus σn≥1−δ⁢(n−1)subscript11 _n≥ 1-δ(n-1)σitalic_n ≥ square-root start_ARG 1 - δ ( n - 1 ) end_ARG. Plugging into our upper bound for ∥2subscriptdelimited-∥2 _2∥ z ∥2, we have that ∥2≤1/1−δ⁢(n−1)subscriptdelimited-∥2111 _2≤ 1/ 1-δ(n-1)∥ z ∥2 ≤ 1 / square-root start_ARG 1 - δ ( n - 1 ) end_ARG. Finally, the largest L1subscript1L_1L1 for a point on an n-hypersphere of radius r is when all dimensions are equal and such a point has magnitude n⁢r nrsquare-root start_ARG n end_ARG r, so ∥1≤n1−δ⁢(n−1)≤n1−δ⁢nsubscriptdelimited-∥1111 _1≤ n1-δ(n-1)≤ n% 1-δ n∥ z ∥1 ≤ square-root start_ARG divide start_ARG n end_ARG start_ARG 1 - δ ( n - 1 ) end_ARG end_ARG ≤ square-root start_ARG divide start_ARG n end_ARG start_ARG 1 - δ n end_ARG end_ARG ∎ See 1 Proof. By the JL lemma (Johnson & Lindenstrauss, 1984; (), https://mathoverflow.net/users/2554/bill johnson), for any d and δ, we can choose eC⁢d⁢δ2superscriptsuperscript2e^Cdδ^2eitalic_C d δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT δ-orthogonal unit vectors in ℝdsuperscriptℝR^dblackboard_Rd indexed as isubscriptv_ivitalic_i, for some constant C. Let i=[dmax∗i,…,dmax∗i+ni−1]subscriptsubscriptsubscript…subscriptsubscriptsubscript1A_i=[v_d_ *i,…,v_d_ *i+n_i-% 1]Aitalic_i = [ vitalic_d start_POSTSUBSCRIPT max ∗ i end_POSTSUBSCRIPT , … , vitalic_d start_POSTSUBSCRIPT max ∗ i + nitalic_i - 1 end_POSTSUBSCRIPT ] where each element in the brackets is a column. Then by construction all isubscriptA_iAitalic_i are matrices composed of unique δ-orthogonal vectors and there are 1dmax⁢eC⁢d⁢δ21subscriptsuperscriptsuperscript2 1d_ e^Cdδ^2divide start_ARG 1 end_ARG start_ARG droman_max end_ARG eitalic_C d δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT matrices isubscriptA_iAitalic_i. Now, consider two of these matrices i=[1,…,ni]subscriptsubscript1…subscriptsubscriptA_i=[v_1,…,v_n_i]Aitalic_i = [ v1 , … , vitalic_n start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ] and j=[1,…,nj]subscriptsubscript1…subscriptsubscriptA_j=[u_1,…,u_n_j]Aitalic_j = [ u1 , … , uitalic_n start_POSTSUBSCRIPT j end_POSTSUBSCRIPT ], i≠ji≠ ji ≠ j; we will prove that they are f⁢(δ)f(δ)f ( δ )-orthogonal for some function f. Let i=∑k=1nizi,k⁢ksubscriptsuperscriptsubscript1subscriptsubscriptsubscripty_i= _k=1^n_iz_i,kv_kyitalic_i = ∑k = 1nitalic_i zitalic_i , k vitalic_k be a vector in the colspace of isubscriptA_iAitalic_i and j=∑k=1njzj,k⁢ksubscriptsuperscriptsubscript1subscriptsubscriptsubscripty_j= _k=1^n_jz_j,ku_kyitalic_j = ∑k = 1nitalic_j zitalic_j , k uitalic_k be a vector in the colspace of jsubscriptA_jAitalic_j, such that isubscripty_iyitalic_i and jsubscripty_jyitalic_j are unit vectors. To prove f⁢(δ)f(δ)f ( δ )-orthogonality, we must bound the absolute dot product between isubscripty_iyitalic_i and jsubscripty_jyitalic_j: |i⋅j|⋅subscriptsubscript |y_i·y_j || yitalic_i ⋅ yitalic_j | =|(∑k=1nizi,k⁢k)⋅(∑k=1njzj,k⁢k)|absent⋅superscriptsubscript1subscriptsubscriptsubscriptsuperscriptsubscript1subscriptsubscriptsubscript = | ( _k=1^n_iz_i,kv_k )·% ( _k=1^n_jz_j,ku_k ) |= | ( ∑k = 1nitalic_i zitalic_i , k vitalic_k ) ⋅ ( ∑k = 1nitalic_j zitalic_j , k uitalic_k ) | =|∑k1=1ni∑k2=1nj(zi,k1⁢k1)⋅(zj,k2⁢k2)|absentsuperscriptsubscriptsubscript11subscriptsuperscriptsubscriptsubscript21subscript⋅subscriptsubscript1subscriptsubscript1subscriptsubscript2subscriptsubscript2 = | _k_1=1^n_i _k_2=1^n_j (z_i,k_% 1v_k_1 )· (z_j,k_2u_k_2 ) |= | ∑k start_POSTSUBSCRIPT 1 = 1 end_POSTSUBSCRIPTnitalic_i ∑k start_POSTSUBSCRIPT 2 = 1 end_POSTSUBSCRIPTnitalic_j ( zitalic_i , k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT vitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ( zitalic_j , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT uitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤∑k1=1ni∑k2=1nj|zi,k1⁢zj,k2|⁢|k1⋅k2|absentsuperscriptsubscriptsubscript11subscriptsuperscriptsubscriptsubscript21subscriptsubscriptsubscript1subscriptsubscript2⋅subscriptsubscript1subscriptsubscript2 ≤ _k_1=1^n_i _k_2=1^n_j |z_i,k_1% z_j,k_2 | |v_k_1·u_k_2 |≤ ∑k start_POSTSUBSCRIPT 1 = 1 end_POSTSUBSCRIPTnitalic_i ∑k start_POSTSUBSCRIPT 2 = 1 end_POSTSUBSCRIPTnitalic_j | zitalic_i , k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT zitalic_j , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | vitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ uitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | Triangle Inequality ≤∑k1=1ni∑k2=1nj|zi,k1⁢zj,k2|⁢δabsentsuperscriptsubscriptsubscript11subscriptsuperscriptsubscriptsubscript21subscriptsubscriptsubscript1subscriptsubscript2 ≤ _k_1=1^n_i _k_2=1^n_j |z_i,k_1% z_j,k_2 |δ≤ ∑k start_POSTSUBSCRIPT 1 = 1 end_POSTSUBSCRIPTnitalic_i ∑k start_POSTSUBSCRIPT 2 = 1 end_POSTSUBSCRIPTnitalic_j | zitalic_i , k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT zitalic_j , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | δ All i,jsubscriptsubscriptv_i,u_jvitalic_i , uitalic_j are δ orthogonal =δ⁢∑k1=1ni∑k2=1nj|zi,k1⁢zj,k2|absentsuperscriptsubscriptsubscript11subscriptsuperscriptsubscriptsubscript21subscriptsubscriptsubscript1subscriptsubscript2 =δ _k_1=1^n_i _k_2=1^n_j |z_i,k_% 1z_j,k_2 |= δ ∑k start_POSTSUBSCRIPT 1 = 1 end_POSTSUBSCRIPTnitalic_i ∑k start_POSTSUBSCRIPT 2 = 1 end_POSTSUBSCRIPTnitalic_j | zitalic_i , k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT zitalic_j , k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | =δ⁢|∑k=1nizi,k|⁢|∑k=1njzj,k|absentsuperscriptsubscript1subscriptsubscriptsuperscriptsubscript1subscriptsubscript =δ | _k=1^n_iz_i,k | | _k=1^n_% jz_j,k |= δ | ∑k = 1nitalic_i zitalic_i , k | | ∑k = 1nitalic_j zitalic_j , k | Factoring the product ≤δ⁢ni1−δ⁢ni⁢nj1−δ⁢njabsentsubscript1subscriptsubscript1subscript ≤δ n_i1-δ n_i n_j1% -δ n_j≤ δ square-root start_ARG divide start_ARG nitalic_i end_ARG start_ARG 1 - δ nitalic_i end_ARG end_ARG square-root start_ARG divide start_ARG nitalic_j end_ARG start_ARG 1 - δ nitalic_j end_ARG end_ARG By Lemma 1 ≤δ⁢dmax1−δ⁢dmaxabsentsubscript1subscript ≤ δ d_ 1-δ d_ ≤ divide start_ARG δ droman_max end_ARG start_ARG 1 - δ droman_max end_ARG ni,nj≤dmaxsubscriptsubscriptsubscriptn_i,n_j≤ d_ nitalic_i , nitalic_j ≤ droman_max by assumption Thus isubscriptA_iAitalic_i and jsubscriptA_jAitalic_j are f⁢(δ)f(δ)f ( δ )-orthogonal for f⁢(δ)=δ⁢dmax/(1−δ⁢dmax)subscript1subscriptf(δ)=δ d_ /(1-δ d_ )f ( δ ) = δ droman_max / ( 1 - δ droman_max ), and so it is possible to choose 1dmax⁢eC⁢d⁢δ21subscriptsuperscriptsuperscript2 1d_ e^Cdδ^2divide start_ARG 1 end_ARG start_ARG droman_max end_ARG eitalic_C d δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pairwise f⁢(δ)f(δ)f ( δ )-orthogonal projection matrices. Remapping the variable δ with δ↦f−1⁢(δ)=δ/(dmax⁢(1+δ))maps-tosuperscript1subscript1δ f^-1(δ)=δ/(d_ (1+δ))δ ↦ f- 1 ( δ ) = δ / ( droman_max ( 1 + δ ) ), we find that it is possible to choose 1dmax⁢eC⁢d⁢δ2/((1+δ)2⁢dmax2)1subscriptsuperscriptsuperscript2superscript12superscriptsubscript2 1d_ e^Cdδ^2/((1+δ)^2d_ ^2)divide start_ARG 1 end_ARG start_ARG droman_max end_ARG eitalic_C d δ start_POSTSUPERSCRIPT 2 / ( ( 1 + δ )2 droman_max2 ) end_POSTSUPERSCRIPT pairwise δ-orthogonal projection matrices. Because 1+δ11+ 1 + δ is at most 2222 with δ∈(0,1)01δ∈(0,1)δ ∈ ( 0 , 1 ), we can further simplify the exponent and find that it is possible to choose 1dmax⁢eC⁢(d/dmax2)⁢δ2/41subscriptsuperscriptsuperscriptsubscript2superscript24 1d_ e^C(d/d_ ^2)δ^2/4divide start_ARG 1 end_ARG start_ARG droman_max end_ARG eitalic_C ( d / droman_max start_POSTSUPERSCRIPT 2 ) δ2 / 4 end_POSTSUPERSCRIPT pairwise δ-orthogonal projection matrices. Absorbing the 4444 into the constant C finishes the proof of the lower bound. For the upper bound, we can proceed much more simply. Consider k pairwise δ-orthogonal matrices Ai∈ℝd′subscriptsuperscriptℝsuperscript′A_i ^d Aitalic_i ∈ blackboard_Rd start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since these matrices are full rank, their column spaces each parameterize a subspace of dimension d′, and so by a result from (Alon, 2003) it is possible to choose eC⁢d⁢’⁢δ2⁢log⁡(1δ)superscript’superscript21e^Cd δ^2 ( 1δ)eitalic_C d ’ δ start_POSTSUPERSCRIPT 2 log ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) end_POSTSUPERSCRIPT almost orthogonal vectors in this subspace. Furthermore, by our definition of δ-orthogonal matrices, all pairs of these vectors between subspaces will be δ-orthogonal. Finally, again by (Alon, 2003) we cannot have more than eC⁢d⁢δ2⁢log⁡(1δ)superscriptsuperscript21e^Cdδ^2 ( 1δ)eitalic_C d δ start_POSTSUPERSCRIPT 2 log ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) end_POSTSUPERSCRIPT δ-orthogonal vectors overall, so we have that k⁢eC⁢dmax⁢δ2⁢log⁡(1δ)superscriptsubscriptsuperscript21 ke^Cd_ δ^2 ( 1δ)k eitalic_C droman_max δ start_POSTSUPERSCRIPT 2 log ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) end_POSTSUPERSCRIPT <eC⁢d⁢δ2⁢log⁡(1δ)absentsuperscriptsuperscript21 <e^Cdδ^2 ( 1δ)< eitalic_C d δ start_POSTSUPERSCRIPT 2 log ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) end_POSTSUPERSCRIPT and simplfying, k k <eC⁢(d−dmax)⁢δ2⁢log⁡(1δ)absentsuperscriptsubscriptsuperscript21 <e^C(d-d_ )δ^2 ( 1δ)< eitalic_C ( d - droman_max ) δ start_POSTSUPERSCRIPT 2 log ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) end_POSTSUPERSCRIPT ∎ These results imply that models can still represent an exponential number of higher dimensional features. However, there is a large exponential gap between the lower and upper bound we have shown. If the lower bound is reasonably tight, then this would mean that models would be highly incentivized to fit features within the smallest dimensional space possible, suggesting a reason for recent work showing interesting compressed encodings of multi-dimensional features in toy problems (Morwani et al., 2023). Note that the proof assumes the “worst case” scenario that all of the features are dimension dmaxsubscriptd_ droman_max, while in practice many of the features may be 1111 or low dimensional, so the effect on the capacity of a real model that represents multi-dimensional features is unlikely to be this extreme. Finally, we note that the dictionary learning literature may have discovered similar results in the past (which we were unaware of), see Foucart & Rauhut (2013). Appendix B More on Reducibility B.1 Additional Intuition for Definitions Here, we present some extra intuition and high level ideas for understanding our definitions and the motivation behind them. Roughly, we intend for our definitions in the main text to identify representations in the model that describe an object or concept in a way that fundamentally takes multiple dimensions. We operationalize this as finding a subspace of representations that 1. has basis vectors that “always co-occur” no matter the orientation 2. is not made up of combinations of independent lower-dimensional features. 1. The first condition is met by the mixture part of our definition. The feature in question should be part of an irreducible manifold, and so should “fill” a plane or hyperplane. There shouldn’t be any part of the plane where the probability distribution of the feature is concentrated, because this region is then likely part of a lower dimensional feature. The idea of this part of the definition is to capture multi-dimensional objects; if the entire multi-dimensional space is truly being used to represent a high-dimensional object, then the representations for the object should be “spread out” entirely through the space. 2. The second condition is met by the separability part of our definition. This part of the definition is intended to rule out features that co-occur frequently but are fundamentally not describing the same object or concept. For example, latitude and longitude are not a mixture in that they frequently co-occur, but we do not think it is necessarily correct to say they are part of the same multi-dimensional feature because they are independent. B.2 Empirical Irreducible Feature Test Details Our tests for reducibility require the computation of two quantities S⁢()S(f)S ( f ) for the separability index and Mϵ⁢()subscriptitalic-ϵM_ε(f)Mitalic_ϵ ( f ) for the ϵitalic-ϵεϵ-mixture index. We describe how we compute each index in the following two subsections. B.2.1 Separability Index We define the separability index in Equation 2 as S⁢()=absent S(f)=S ( f ) = min⁡I⁢(;) I(a;b)min I ( a ; b ) where the min is over rotations RR used to split ′=+superscript′f =Rf+cf′ = Rf + c into aa and bb. In two dimensions, the rotation is defined by a single angle, so we can iterate over a grid of 1000 angles and estimate the mutual information between aa and bb for each angle. We first normalize ff by subtracting off the mean and then dividing by the root mean squared norm of ff (and multiplying by 22 2square-root start_ARG 2 end_ARG since the toy datasets are in two dimensions). To estimate the mutual information, we first clip the data ff to a 6 by 6 square centered on the origin. We then bin the points into a 40 by 40 grid, to produce a discrete distribution p⁢(a,b)p(a,b)p ( a , b ). After computing the marginals p⁢(a)p(a)p ( a ) and p⁢(b)p(b)p ( b ) by summing the distribution over each axis, we obtain the mutual information via the formula I⁢(;)=∑a,bp⁢(a,b)⁢log⁡p⁢(a,b)p⁢(a)⁢p⁢(b)subscript I(a;b)= _a,bp(a,b) p(a,b)p% (a)p(b)I ( a ; b ) = ∑a , b p ( a , b ) log divide start_ARG p ( a , b ) end_ARG start_ARG p ( a ) p ( b ) end_ARG (8) B.2.2 ϵitalic-ϵεϵ-Mixture Index We define the ϵitalic-ϵεϵ-mixture index in Equation 3 as Mϵ⁢()=subscriptitalic-ϵabsent M_ε(f)=Mitalic_ϵ ( f ) = max∈ℝdf,c∈ℝ⁡ℙ⁢(|⋅+c|<ϵ⁢[(⋅+c)2])subscriptformulae-sequencesuperscriptℝsubscriptℝℙ⋅italic-ϵdelimited-[]superscript⋅2 _v ^d_f,\;c P% (|v·f+c|<ε E[(v% ·f+c)^2] )maxbold_v ∈ blackboard_Rd start_POSTSUBSCRIPT f , c ∈ blackboard_R end_POSTSUBSCRIPT blackboard_P ( | v ⋅ f + c | < ϵ square-root start_ARG blackboard_E [ ( v ⋅ f + c )2 ] end_ARG ) The challenge with computing Mϵ⁢()subscriptitalic-ϵM_ε(f)Mitalic_ϵ ( f ) is to compute the maximum. We opted to maximize via gradient descent; and we guaranteed differentiability by softening the inequality <<< with a sigmoid, Mϵ,T⁢(,,c)=subscriptitalic-ϵabsent M_ε,T(f,v,c)=Mitalic_ϵ , T ( f , v , c ) = ⁢(σ⁢(1T⁢(ϵ−|⋅+c|⁢[(⋅+c)2])))1italic-ϵ⋅delimited-[]superscript⋅2 (σ ( 1T (ε- |% v·f+c| E[(v·f+% c)^2] ) ) )blackboard_E ( σ ( divide start_ARG 1 end_ARG start_ARG T end_ARG ( ϵ - divide start_ARG | v ⋅ f + c | end_ARG start_ARG square-root start_ARG blackboard_E [ ( v ⋅ f + c )2 ] end_ARG end_ARG ) ) ) (9) where T is a temperature, which we linearly decay from 1111 to 00 throughout training. We optimize for vv and c using this loss Mϵ,T⁢(,,c)subscriptitalic-ϵM_ε,T(f,v,c)Mitalic_ϵ , T ( f , v , c ) using full batch gradient descent over 10000 steps with learning rate 0.10.10.10.1. With the solution (∗,c∗)superscriptsuperscript(v^*,c^*)( v∗ , c∗ ), the final value of Mϵ,T=0⁢(,∗,c∗)subscriptitalic-ϵ0superscriptsuperscriptM_ε,T=0(f,v^*,c^*)Mitalic_ϵ , T = 0 ( f , v∗ , c∗ ) is then our estimate of Mϵ⁢()subscriptitalic-ϵM_ε(f)Mitalic_ϵ ( f ). We also run the irreducibility tests on additional synthetic feature distributions in Fig. 11a and Fig. 11b. (a) Testing S⁢()S(a)S ( a ) and Mϵ⁢()subscriptitalic-ϵM_ε(a)Mitalic_ϵ ( a ) on a reducible feature aa. (b) Testing S⁢()S(b)S ( b ) and Mϵ⁢()subscriptitalic-ϵM_ε(b)Mitalic_ϵ ( b ) on an irreducible feature bb Figure 10: Testing irreducibility of synthetic features. Left in each subfigure: Distributions of xx. For feature aa, 63.96% lies within the narrow dotted lines, indicating the feature is likely a mixture. For feature bb, 17.84% lies within the wide lines, indicating the feature is unlikely to be a mixture. The green cross indicates the angle θ that minimizes mutual information. Middle in each subfigure: Histograms of the distribution of ⋅v·xv ⋅ x with red lines indicating a 2⁢ϵ2italic-ϵ2 2 ϵ-wide region. Right in each subfigure: Mutual information between aa and bb as a function of the rotation angle θ of matrix RR. Feature bb has a large minimum mutual information so is unlikely to be separable; feature aa has a medium value of minimum mutual information of about 0.370.370.370.37 bits. (a) Testing S⁢()S(c)S ( c ) and Mϵ⁢()subscriptitalic-ϵM_ε(c)Mitalic_ϵ ( c ) on a reducible feature cc. (b) Testing S⁢()S(d)S ( d ) and Mϵ⁢()subscriptitalic-ϵM_ε(d)Mitalic_ϵ ( d ) on an irreducible feature dd Figure 11: Testing irreducibility of synthetic features. Left in each subfigure: Distributions of xx. For feature cc, 7.94% lies within the narrow dotted lines, indicating the feature is unlikely to be a mixture. For feature dd, 25.90% lies within the wide lines, indicating the feature is likely a mixture. The green cross indicates the angle θ that minimizes mutual information. Middle in each subfigure: Histograms of the distribution of ⋅v·xv ⋅ x with red lines indicating a 2⁢ϵ2italic-ϵ2 2 ϵ-wide region. Right in each subfigure: Mutual information between aa and bb as a function of the rotation angle θ of matrix RR. Both features have a small (<0.5absent0.5<0.5< 0.5 bits) minimum mutual information and so are likely separable. Appendix C Alternative Definitions In this section, we present an alternative definition of a reducible feature that we considered during our work. This chiefly deals with multi-dimensional features from the angle of computational reducibility as opposed to statistical reducibility. In other words, this definition considers whether representations of features on a specific set of tasks can be split up without changing the accuracy of the task. This captures an interesting (and important) aspect of feature reducibility, but because it requires a specific set of prompts (as opposed to allowing unsupervised discovery) we chose not to use it as our main definition. Our alternative definitions consider representation spaces that are possibly multi-dimensional, and defines these spaces through whether they can completely explain a function hℎh on the output logits. We consider a group theoretic approach to irreducible representations, via whether computation involving multiple group elements can be decomposed. C.1 Alternative Definition: Interventions and Representation Spaces Assume that we restrict the input set of prompts T=jsuperscriptT=\t^j\T = titalic_j to some subset of prompts and that we have some evaluation function hℎh that maps from the output logit distribution of M to a real number. For example, for the Weekdays problems, T is the set of 49494949 prompts and hℎh could be the arg⁢maxargmax *arg\,maxarg max over the days of week logits. Abusing notation, we let M also be the function from the layer we are intervening on; this is always clear from context. Then we can define a representation space of i,ljsubscriptsuperscriptx^j_i,lxitalic_jitalic_i , l as a subspace in which interventions always work: Definition 5 (Representation Space). Given a prompt set T=jsuperscriptT=\t^j\T = titalic_j , a rank-r dimensional representation space of intermediate value i,ljsubscriptsuperscriptx^j_i,lxitalic_jitalic_i , l is a rank r projection matrix P such that for all j,j′,j j , j′, h⁢(M⁢((I−P)⁢i,lj+P⁢i,lj′))=h⁢(M⁢(i,lj′))ℎsubscriptsuperscriptsubscriptsuperscriptsuperscript′ℎsubscriptsuperscriptsuperscript′h(M((I-P)x^j_i,l+Px^j _i,l))=h(M(x^% j _i,l))h ( M ( ( I - P ) xitalic_jitalic_i , l + P xitalic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi , l ) ) = h ( M ( xitalic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPTi , l ) ). Note that it immediately follows that the rank d dimensional matrix IdsubscriptI_dIitalic_d is trivially a rank d representation space for all prompt sets T. Definition 6 (Minimality). A representation space P of rank r is minimal if there does not exist a lower rank representation space. A minimal representation with rank > 1 is a multi-dimensional representation. Definition 7 (Alternative Reducibility). A representation space P of rank r is reducible if there are orthonormal representation spaces P1subscript1P_1P1 and P2subscript2P_2P2 (such that P1+P2=Psubscript1subscript2P_1+P_2=P1 + P2 = P, P1⁢P2=0subscript1subscript20P_1P_2=0P1 P2 = 0) where h⁢(M⁢(P1⁢i,lj)+M⁢(P2⁢i,lj))=h⁢(M⁢(P1⁢i,lj+P2⁢i,lj))ℎsubscript1superscriptsubscriptsubscript2superscriptsubscriptℎsubscript1superscriptsubscriptsubscript2superscriptsubscripth(M(P_1x_i,l^j)+M(P_2x_i,l^j))=h(M(P_1% x_i,l^j+P_2x_i,l^j))h ( M ( P1 xitalic_i , litalic_j ) + M ( P2 xitalic_i , litalic_j ) ) = h ( M ( P1 xitalic_i , litalic_j + P2 xitalic_i , litalic_j ) ) for all j,j′,j j , j′. Suppose T, hℎh and M define the multiplication of two elements in a finite group G of order n. Then if we interpret the embedding vectors as the group representations, our definition of reducibility implies to the standard group-theoretical definition of irreducibility – specifically, reducibility into a tensor product representation. Appendix D Toy Case of Training SAEs on Circles To explore how SAEs behave when reconstructing irreducible features of dimension df>1subscript1d_f>1ditalic_f > 1, we perform experiments with the following toy setup. Inspired by the circular representations of integers that networks learn when trained on modular addition (Nanda et al., 2023a; Liu et al., 2022), we create synthetic datasets of activations containing multiple features which are each 2d irreducible circles. First however, consider activations for a single circle – points uniformly distributed on the unit circle in ℝ2superscriptℝ2R^2blackboard_R2. We train SAEs on this data with encoder Enc⁢()=ReLU⁢(⁢(−)+)EncReLUsubscriptsubscriptsubscript Enc(x)= ReLU(W_e(x-b_d% )+b_e)Enc ( x ) = ReLU ( Wbold_e ( x - bbold_d ) + bbold_e ) and decoder Dec⁢()=⁢+Decsubscriptsubscript Dec(f)=W_df+b_dDec ( f ) = Wbold_d f + bbold_d. We train SAEs with m=22m=2m = 2 and m=1010m=10m = 10 with the Adam optimizer and a learning rate of 10−3superscript10310^-310- 3, sparsity penalty λ=0.10.1λ=0.1λ = 0.1, for 20,000 steps, and a warmup of 1000 steps. In Fig. 12 we show the dictionary elements of these SAEs. When m=22m=2m = 2, the SAE must use both SAE features on each input point, and uses subscriptd_bdbold_b to shift the reconstructed circle so it is centered at the origin. When m=1010m=10m = 10, the SAE learns ≈0subscript0d_b≈ 0dbold_b ≈ 0 and the features spread out across the circle, arranged close together, and only a subset are active on each input. Figure 12: SAEs trained to reconstruct a single 2d circle with m=22m=2m = 2 (left) and m=1010m=10m = 10 (middle and right) dictionary elements. When there are several SAE features, there is no natural/canonical choice of feature directions, and the dictionary elements spread out across the circle. We now consider synthetic activations with multiple circular features. Our data consists of points in ℝ10superscriptℝ10R^10blackboard_R10, where we choose two orthogonal planes spanned by (,)subscript1subscript2(e_1,e_2)( ebold_1 , ebold_2 ) and (,)subscript3subscript4(e_3,e_4)( ebold_3 , ebold_4 ), respectively. With probability one half a points is sampled uniformly on the unit circle in the subscript1e_1ebold_1-subscript2e_2ebold_2 plane, otherwise the point will be sampled uniformly on the unit circle in the subscript3e_3ebold_3-subscript4e_4ebold_4 plane. We train SAEs with m=6464m=64m = 64 on this data with the same hyperparameters as the single-circle case. We now apply the procedure described in Section 4 to see if we can automatically rediscover these circles. Encouragingly, we first find that the alive SAE features align almost exactly with either the subscript1e_1ebold_1-subscript2e_2ebold_2 or the subscript3e_3ebold_3-subscript4e_4ebold_4 plane. When we apply spectral clustering with n_clusters=2n_clusters2 n\_clusters=2n_clusters = 2 to the features with the pairwise angular similarities between dictionary elements as the similarity matrix (Fig. 13, left), the two clusters correspond exactly to the features which span each plane. As described in Section 4, given a cluster of dictionary elements S⊂1,…,m1…S⊂\1,…,m\S ⊂ 1 , … , m , we run a large set of activations through the SAE, then filter out samples which don’t activate any element in S. For samples which do activate an element of S, reconstruct the activation while setting all SAE features not in S to have a hidden activation of zero. If some collection of SAE features together represent some irreducible feature, we want to remove all other features from the activation vector, and so we only allow SAE features in the collection to participate in reconstructing the input activation. We find that this procedure almost exactly recovers the original two circles, which encouraged us to apply this method for discovering the features shown in Fig. 1 and Fig. 15. Figure 13: Automatic discovery of synthetic circular features by clustering SAE dictionary elements. Appendix E Training Mistral SAEs Our Mistral 7B (Jiang et al., 2023) sparse autoencoders (SAEs) are trained on over one billion tokens from a subset of the Pile (Gao et al., 2020) and Alpaca (Peng et al., 2023) datasets. We train our SAEs on layers 8, 16, and 24 out of 32 total layers to maximize coverage of the model’s representations. We use a 16×16×16 × expansion factor, yielding a total of 65536 dictionary elements for each SAE. To train our SAEs, we use an LpsubscriptL_pLitalic_p sparsity penalty for p=1/212p=1/2p = 1 / 2 with sparsity coefficient λ=0.0120.012λ=0.012λ = 0.012. Before an SAE forward pass, we normalize our activation vectors to have norm dm⁢o⁢d⁢e⁢l=64subscript64 d_model=64square-root start_ARG ditalic_m o d e l end_ARG = 64 in the case of Mistral. We do not apply a pre-encoder bias. We use an AdamW optimizer with weight decay 10−3superscript10310^-310- 3 and learning rate 0.0002 with a linear warm up. We apply dead feature resampling (Bricken et al., 2023) five times over the course of training to converge on SAEs with around 1000 dead features. Appendix F GPT-2 and Mistral 7B Dictionary Element Clustering In this section, we first present pseudocode in Algorithm 1 for the overall high level technique that finds multi-dimensional features and that uses clustering as a subroutine. We then provide the specific clustering algorithm implementations we use for GPT-2 and Mistral. Input: Dictionary elements D, activation vectors Xi,lsubscriptX_i,lXitalic_i , l, SAE Output: Irreducible multi-dimensional features Si,j←CosineSim⁢(Di,Dj)←subscriptCosineSimsubscriptsubscriptS_i,j← CosineSim(D_i,D_j)Sitalic_i , j ← CosineSim ( Ditalic_i , Ditalic_j ); c⁢l⁢u⁢s⁢t⁢e⁢r⁢s←absentclusters l u s t e r s ← Cluster(S); r⁢e⁢c⁢o⁢n⁢s⁢t⁢r⁢u⁢c⁢t⁢i⁢o⁢n⁢s←reconstructions←\\r e c o n s t r u c t i o n s ← ; for c⁢l⁢u⁢s⁢t⁢e⁢rclusterc l u s t e r in c⁢l⁢u⁢s⁢t⁢e⁢r⁢sclustersc l u s t e r s do Rc⁢l⁢u⁢s⁢t⁢e⁢r←subscriptabsentR_cluster _c l u s t e r ← ids of dictionary elements in c⁢l⁢u⁢s⁢t⁢e⁢rclusterc l u s t e r; for i,lsubscript x_i,litalic_xitalic_i , l in Xi,lsubscriptX_i,lXitalic_i , l do e⁢n⁢c⁢o⁢d⁢i⁢n⁢g←ReLU⁢(E⋅i,l)←ReLU⋅subscriptencoding← ReLU(E· x_i,l)e n c o d i n g ← ReLU ( E ⋅ italic_xitalic_i , l ); if max⁡(e⁢n⁢c⁢o⁢d⁢i⁢n⁢g⁢[Rc⁢l⁢u⁢s⁢t⁢e⁢r])>0delimited-[]subscript0 (encoding[R_cluster])>0max ( e n c o d i n g [ Ritalic_c l u s t e r ] ) > 0 then r←D⁢[:,Rc⁢l⁢u⁢s⁢t⁢e⁢r]⋅e⁢n⁢c⁢o⁢d⁢i⁢n⁢g←⋅:subscriptr← D[:,R_cluster]· encodingr ← D [ : , Ritalic_c l u s t e r ] ⋅ e n c o d i n g; r⁢e⁢c⁢o⁢n⁢s⁢t⁢r⁢u⁢c⁢t⁢i⁢o⁢n⁢s←r⁢e⁢c⁢o⁢n⁢s⁢t⁢r⁢u⁢c⁢t⁢i⁢o⁢n⁢s∪r←reconstructions← reconstructions∪\r\r e c o n s t r u c t i o n s ← r e c o n s t r u c t i o n s ∪ r ; end if end for end for f⁢e⁢a⁢t⁢u⁢r⁢e⁢s←features←\\f e a t u r e s ← ; for R in r⁢e⁢c⁢o⁢n⁢s⁢t⁢r⁢u⁢c⁢t⁢i⁢o⁢n⁢sreconstructionsr e c o n s t r u c t i o n s do p⁢r⁢o⁢j←absentproj r o j ← PCA(R); if TestIrreducible(p⁢r⁢o⁢jprojp r o j) then Add p⁢r⁢o⁢jprojp r o j to f⁢e⁢a⁢t⁢u⁢r⁢e⁢sfeaturesf e a t u r e s; end if end for return f⁢e⁢a⁢t⁢u⁢r⁢e⁢sfeaturesf e a t u r e s Algorithm 1 High Level Clustering Approach For Finding Multi-D Features F.1 GPT-2-small methods and results For GPT-2-small, we perform spectral clustering on the roughly 25k layer 7 SAE features from (Bloom, 2024), using pairwise angular similarities between dictionary elements as the similarity matrix. We use n_clusters=1000n_clusters1000 n\_clusters=1000n_clusters = 1000 and manually looked at roughly 500 of these clusters. For each cluster, we looked at projections onto principal components 1-4 of the reconstructed activations for these clusters. In Fig. 14, we show projections for the most interesting clusters we identified, which appear to be circular representations of days of the week, months of the year, and years of the 20th century. Figure 14: Projections of days of week, months of year, and years of the 20th century representations onto top four principal components, showing additional dimensions of the representations than Fig. 1. F.2 Mistral 7B methods and results For Mistral 7B, our SAEs have 65536 dictionary elements and we found it difficult to run spectral clustering on all of these at once. We therefore develop a simple graph based clustering algorithm that we run on Mistral 7B SAEs: 1. Create a graph G out of the dictionary elements by adding directed edges from each dictionary element to its k closest dictionary elements by cosine similarity. We use k=22k=2k = 2. 2. Make the graph undirected by turning every directed edge into an undirected edge. 3. Prune edges with cosine similarity less than a threshold value τ. We use τ=0.50.5τ=0.5τ = 0.5. 4. Return the connected components as clusters. We run this algorithm on the Mistral 7B layer 8888 SAE (216superscript2162^16216 dictionary elements) and find roughly 2700270027002700 clusters containing between 2222 and 1000100010001000 elements. We manually inspected roughly 2000 of these. From these, we re-discover circular representations of days of the week and months of the year, shown in Fig. 15. However, we did not find other obviously interesting and clearly irreducible features. We also investigate the sensitivity of this method to τ and k by varying τ and k and showing the max Jaccard similarity between any of the resulting clusters and the days of the week cluster we show in Fig. 15. We show the results in Fig. 16, where we find that varying k has minimal effect, while varying τ shows 3333 regimes: small τ causes all features to group in one cluster, so the days of the week cluster is not found; medium τ causes the days of the week cluster to become identifiable; large τ causes all features to be divided into their own clusters. Figure 15: Circular representations of days of the week and months of the year which we discover with our unsupervised SAE clustering method in Mistral 7B. Unlike similar features in GPT-2, we also find an additional “weekend” representation in between Saturday and Sunday representations (left) and additional representations of seasons among the months (right). For instance, “winter” tokens activate a region of the circle in between the representation of January and December. Figure 16: Hyperparameter regimes where the days of the week cluster exists. The cluster exists in the regime between all features clumping together and all features being in their own cluster; this regime seems reasonably stable. As future work, we think it would be exciting to develop better clustering techniques for SAE features. Our graph based clustering technique could likely be improved by more recent efficient and high-quality graph based clustering techniques, e.g. hierarchical agglomerate clustering with single-linkage (Lattanzi et al., 2020). Additionally, we believe we would see a large improvement by setting edge weights to be a combination of both the cosine and Jaccard similarity of the dictionary elements, e.g. max(cosine, Jaccard). Appendix G Other discovered clusters In Fig. 17, we plot the top 11111111 ranked clusters by the product of a) the measured separability index and b) one minus the measured ϵitalic-ϵεϵ-mixture index with ϵ=0.1italic-ϵ0.1ε=0.1ϵ = 0.1 (this is just one of many possible ways to get an ordered ranking from a two-parameter score). We color by both the current token (which results in clear patterns for all tokens) and the next token (to see if we find belief states as found by Shai et al. (2024) in toy transformers). We note that weekdays are ranked 9 and so are shown in the plot. Additionally, the next token patterns of the ‘such’ cluster and the ‘B’ cluster do seem to display some clustering independently of the the current token pattern, which might lend the belief state hypothesis some support. Appendix H Further Experiment Details H.1 Assets Information We use the following open source models for our experiments: Llama 3 8B (AI@Meta, 2024) (custom Llama 3 license https://llama.meta.com/llama3/license/), Mistral 7B (Jiang et al., 2023) (released under the Apache 2 License), and GPT-2 (Radford et al., 2019) (modified MIT license, see https://github.com/openai/gpt-2/blob/master/LICENSE). H.2 Machine Information Intervention experiments were run on two V100 GPUs using less than 64646464 GB of CPU RAM; all experiments can be reproduced from our open source repository in less than a day with this configuration. We use the TransformerLens library (Nanda & Bloom, 2022) for intervention experiments. ϵitalic-ϵεϵ-mixture index measurements on toy datasets took about one minute each, on 8GB of CPU RAM. EVR experiments take seconds on 8GB of CPU RAM and are dominated by time taken to human-interpret the RGB plots. GPT-2 SAE clustering and plotting was run on a cluster of heterogeneous hardware. Spectral clustering and computing reconstructions + plotting was done on CPUs only. We made reconstruction plots for 500 clusters, with each taking less than 10 minutes. Mistral 7B SAE reconstruction plots were made on the same cluster. We made roughly 2000 reconstruction plots for Mistral 7B (and manually inspected each), with each taking less than 20 minutes to generate. Jobs were allocated 64GB of memory each. Mistral SAE training was run on a single V100 GPU. Initially caching activations from Mistral 7B on one billion tokens took approximately 60 hours. Training the SAEs on the saved activations took another 36 hours. H.3 Error Bar Calculation In Fig. 6 we report 96% error bars for all intervention methods. To compute these error bars, we loop over all intervention methods and all layers and compute a confidence interval for each (method, layer) pair across all prompts. Assuming normally distributed errors, we compute error bars with the following standard formula: E⁢B=μ±z∗S⁢Eplus-or-minusEB=μ± z*SEE B = μ ± z ∗ S E where μ is the sample mean, z is the z score (slightly larger than 2222 for 96% error bars), and S⁢ESES E is the standard error (the standard deviation divided by the square root of the number of samples). We use standard Python functions to compute this value. Figure 17: Top 10101010 GPT-2 clusters by Mixture and Separability Index. The reason that the Months error bars are smaller than the Weekdays error bars is because there are more Months prompts: there are 12∗12∗11=1584121211158412*12*11=158412 ∗ 12 ∗ 11 = 1584 intervention effect values, rather than 7∗7∗6=2947762947*7*6=2947 ∗ 7 ∗ 6 = 294 intervention effect values. (a) Mistral 7B, Weekdays (b) Llama 3 8B, Weekdays (c) Mistral 7B, Months (d) Llama 3 8B, Months Figure 18: Projections onto the top two PCA dimensions of model hidden states on the α token show that circular representations of α are present in various layers. Appendix I More Weekdays and Months Plots and Details Figure 19: Projections of Mistral 7B Weekdays task activations at layer 8 into the plane discovered by clustering layer 8 SAE features. I.0.1 Basic Plots We show the results of Mistral 7B and Llama 3 8B on all individual instances of Weekdays that at least one of the models get wrong in Table 2 and present a similar table for Months in Table 3. We show projections onto the top two PCA directions for both Mistral 7B and Llama 3 8B in Fig. 18 on the hidden layers on top of the α token, colored by α. These are similar plots to Fig. 4, except they are on all layers. The circular structure in α is visible on many—but not all—layers. Much of the linear structure visible is due to β. I.0.2 Intervening with the SAE Probe We show the results of projecting Mistral Weekdays representations into the plane discovered by clustering SAE features in Fig. 19. The result is clearly circular. (a) Mistral 7B MLP Patching (b) Mistral 7B attention patching (c) Llama 3 8B MLP patching (d) Llama 3 8B attention patching Figure 20: Attention and MLP patching results on Weekdays. Results are averaged over 20 different runs with fixed α and varying β and 20 different runs with fixed β and varying α. (a) Mistral 7B MLP Patching (b) Mistral 7B attention patching (c) Llama 3 8B MLP patching (d) Llama 3 8B attention patching Figure 21: Attention and MLP patching results on Months. Results are averaged over 20 different runs with fixed α and varying β and 20 different runs with fixed β and varying α. I.0.3 Basic Patching In Fig. 20 and Fig. 21, we report MLP and attention head patching results for Weekdays and Months. We experiment on 20 pairs of problems with the same α and different β and 20 pairs of problems with the same β and different α, for a total of 40 pairs of problems. For each pair of problems, we patch the MLP/attention outputs from the "clean" to the "dirty" problem for each layer and token, and then complete the forward pass. Defining the logit difference as the logit of the clean γ minus the logit of the dirty γ, we record what percent of the difference between the original logit difference of the dirty problem and the logit difference of the clean problem is recovered upon intervening, and average across these 40404040 percentages for each layer and token. This gives us a score we call the Average Intervention Effect. For simplicity of presentation, we clip all of the (few) negative intervention averages to 00 (prior work (Zhang & Nanda, 2023) has also found negative-effect attention heads during patching experiments). I.0.4 Circle Continuity Finally, in Fig. 22, we show another example of the continuity of the circular days of the week representation in Mistral 7B. Table 2: Weekdays finegrained results. Row ommited if both models get it correct. α β Ground truth γ Mistral top γ Mistral correct? Llama top γ Llama correct? 1 1 Wednesday Wednesday Yes Thursday No 3 1 Friday Friday Yes Tuesday No 4 1 Saturday Saturday Yes Thursday No 3 2 Saturday Saturday Yes Tuesday No 4 2 Sunday Sunday Yes Wednesday No 5 2 Monday Monday Yes Tuesday No 2 3 Saturday Friday No Saturday Yes 3 3 Sunday Sunday Yes Tuesday No 4 3 Monday Monday Yes Tuesday No 0 4 Friday Thursday No Friday Yes 3 4 Monday Monday Yes Tuesday No 0 5 Saturday Friday No Saturday Yes 1 5 Sunday Saturday No Wednesday No 2 5 Monday Sunday No Monday Yes 4 5 Wednesday Tuesday No Tuesday No 6 5 Friday Thursday No Thursday No 1 6 Monday Sunday No Thursday No 2 6 Tuesday Monday No Tuesday Yes 3 6 Wednesday Tuesday No Tuesday No 4 6 Thursday Thursday Yes Tuesday No 5 6 Friday Friday Yes Thursday No 6 6 Saturday Thursday No Thursday No 0 7 Monday Sunday No Tuesday No 1 7 Tuesday Sunday No Tuesday Yes 2 7 Wednesday Sunday No Wednesday Yes 3 7 Thursday Sunday No Thursday Yes 4 7 Friday Thursday No Tuesday No 5 7 Saturday Friday No Saturday Yes 6 7 Sunday Friday No Thursday No Table 3: Months finegrained results. Row ommited if both models get it correct. α β Ground truth γ Mistral top γ Mistral correct? Llama top γ Llama correct? 0 4 May April No May Yes 6 4 November October No November Yes 0 6 July June No July Yes 0 7 August July No August Yes 1 7 September October No September Yes 3 7 November October No November Yes 5 7 January December No January Yes 6 7 February January No February Yes 7 7 March February No March Yes 9 7 May April No May Yes 4 9 February February Yes January No 2 10 January December No January Yes 8 10 July June No July Yes 1 11 January December No January Yes 2 11 February December No February Yes 3 11 March February No March Yes 7 11 July June No July Yes 8 11 August July No August Yes 9 11 September August No September Yes 0 12 January December No January Yes Appendix J Patching In this section, we present results to support a claim that MLPs (and not attention blocks) are responsible for computing γ. In Fig. 25, we deconstruct states on top of the final token (before predicting γ) on Llama 3 8B Months (we show a similar plot for the states on the final token of Mistral 7B on Weekdays in the main text in Fig. 24. These plots show that the value of γ is computed on the final token around layers 20202020 to 25252525. To show that this computation of occurs in the MLPs, we must show that no attention head is copying γ from a prior token or directly computing γ. Figure 22: Layer 30303030 Mistral 7B activations for [very early/very late] on [Monday/Tuesday/…/Sunday], plotted projected into the PCA plane for [Monday/Tuesday/…/Sunday]. We first perform a patching experiment with the same setup Fig. 21 and Fig. 20 on individual attention heads on the final token. From the patching results we identify the top 10101010 attention heads by average intervention effect. For each attention head, we compute one EVR run with explanatory functions equal to one-hot functions of α and β (resulting in 14141414 functions isubscriptg_igitalic_i for Weekdays and 24242424 for Months) and one with explanatory functions equal to one-hot functions of α, β, and γ. We find that for all layers before 25252525, adding γ to the explanatory functions adds almost no explanatory power. Since we established above that the model has already computed γ at this point, we know that attention heads do not participate in computing γ. To isolate the rough circuit for Weekdays and Months, we perform layer-wise activation patching on 40404040 random pairs of prompts. The results, displayed in Fig. 21 show that the circuit to compute γ consists of MLPs on top of the α and β tokens, a copy to the token before γ, and further MLPs there (roughly similar to prior work studying arithmetic circuits (Stolfo et al., 2023)). Moreover, fine-grained patching in Appendix K shows that there are just a few responsible attention heads for the writes to the token before γ. However, patching alone cannot tell use how or where γ is represented. For that, we need a new technique, which we expand on in the next section. Appendix K Explanation via Regression (EVR) So far, we have focused on examining and intervening on the representation for α, which we present as a circle in the top PCA components on top of the α token. In this section, we examine how the generated output, γ, is represented. First, to isolate the rough circuit for Weekdays and Months, we perform layer-wise activation patching on 40404040 random pairs of prompts. The results, displayed in Fig. 21 and Fig. 20, show that the circuit to compute γ consists of MLPs on top of the α and β tokens, a copy to the token before γ, and further MLPs there (roughly similar to what Stolfo et al. (2023) find in prior work studying arithmetic circuits). Thus, we know where to look for a representation of γ: in the second half of the layers on the token before γ. However, patching alone cannot tell use how γ is represented. Figure 23: Top two PCA components of residual errors after EVR with one-hot in α and β. Mistral 7B Weekdays, layer 25252525, final token. Colored by γ. Unlike α, γ has no obvious circular (or linear) pattern in the top PCA components on these layers. To determine the representation for γ, we introduce a more powerful technique we call Explanation via Regression (EVR): given a set of token sequences with a corresponding set of hidden states Xi,lsubscriptX_i,lXitalic_i , l, we choose a set of interpretable explanation functions of the input tokens j⁢(t)subscript\g_j(t)\ gitalic_j ( t ) . The r2superscript2r^2r2 value of a linear regression from j⁢(t)subscript\g_j(t)\ gitalic_j ( t ) to Xi,lsubscriptX_i,lXitalic_i , l tells us how much of the variance in the activations the j⁢(t)subscript\g_j(t)\ gitalic_j ( t ) explain, and conversely the residuals show the exact components of the representation we have yet to explain. K.1 Using EVR to uncover a circular representation for γ We first use EVR to determine the representation for γ by plotting the top two PCA components of the layer 25252525 Mistral 7B activations after subtracting the components that can be explained using a regression with one hot functions in α and β (i.e. 1=[α=0],2=[β=1],3=[α=1],…formulae-sequencesubscript1delimited-[]0formulae-sequencesubscript2delimited-[]1subscript3delimited-[]1…g_1=[α=0],g_2=[β=1],g_3=[% α=1],…g1 = [ α = 0 ] , g2 = [ β = 1 ] , g3 = [ α = 1 ] , …). The result, shown in Fig. 23, is an incredibly clear circle in γ, which suggests that the model’s generated representation of γ lies along a circle. A simple PCA projection was not enough to find this result because the representation for γ has interference from α and β, which the EVR removes. This suggests that the models may be generating γ by using a trigonometry based algorithm like the “clock” Nanda et al. (2023a) or “pizza” Zhong et al. (2024) algorithm in late MLP layers. K.2 More Experiments with EVR Figure 24: EVR residual RGB plots on Mistral hidden states on the Weekdays final token, layers 17 to 29. From top to bottom, we show each residual RGB plot after adding the function(s) isubscriptg_igitalic_i labelled just underneath, as well as the resulting r2superscript2r^2r2 value. We write “tmr” meaning “tomorrow” for β=11β=1β = 1. We also write “circle for x” meaning the inclusion of two functions i⁢(x)=cos,sin⁢(2⁢π⁢x/7)subscript27g_i(x)=\ , \(2π x/7)gitalic_i ( x ) = cos , sin ( 2 π x / 7 ). We now apply EVR to Months and Weekdays to break down Xi,lsubscriptX_i,lXitalic_i , l completely into interpretable functions. We build a list of isubscriptg_igitalic_i iteratively and greedily. At each iteration, we perform a linear regression with the current list 1⁢…⁢ksubscript1…subscriptg_1…g_kg1 … gitalic_k, visualize and interpret the residual prediction errors, and build a new function k+1subscript1g_k+1gitalic_k + 1 representing these errors to add to the list. Once most variance is explained, we can conclude that 1,…,ksubscript1…subscriptg_1,…,g_kg1 , … , gitalic_k constitutes the entirety of what is represented in the hidden states. This information tells us what can and cannot be extracted via a linear probe, without having to train any probes. Furthermore, if we treat each isubscriptg_igitalic_i as a feature (see Definition 1), then the linear regression coefficients tell us which directions in Xi,lsubscriptX_i,lXitalic_i , l these features are represented in, connecting back to 2. Since Xi,lsubscriptX_i,lXitalic_i , l consists of modular addition problems with two inputs α and β, we can visualize the errors as we iteratively construct 1,…,ksubscript1…subscriptg_1,…,g_kg1 , … , gitalic_k by making a heatmap with α and β on the two axes, where the color shows what kind of error is made. More specifically, we take the top 3 PCA components of the error distribution and assign them to the colors red, green, and blue. We call the resulting heatmap a residual RGB plot. Errors that depend primarily on α, β, or γ show up as horizontal, vertical, or diagonal stripes on the residual RGB plot. In Fig. 24, we perform EVR on the layer 17-29 hidden states of Mistral 7B on the Weekdays task; additional deconstructions are in Appendix K. We find that a circle in γ develops and grows in explanatory power; we plot the layer 25252525 residuals after explaining with one hot functions in α and β (i.e. 1=[α=0],2=[β=1],3=[α=1],…formulae-sequencesubscript1delimited-[]0formulae-sequencesubscript2delimited-[]1subscript3delimited-[]1…g_1=[α=0],g_2=[β=1],g_3=[% α=1],…g1 = [ α = 0 ] , g2 = [ β = 1 ] , g3 = [ α = 1 ] , …) in Fig. 23 to show this incredibly clear circle in γ. This suggests that the models may be generating γ by using a trigonometry based algorithm like the “clock” (Nanda et al., 2023a) or “pizza” (Zhong et al., 2024) algorithm in late MLP layers. Table 4: Highest intervention effect attention heads from fine-grained attention head patching, as well as EVR results with one hot α,βα,βα , β and one hot α,β,γα,β,γα , β , γ. L H Average Intervention Effect EVR R2superscript2R^2R2 One Hot α, β EVR R2superscript2R^2R2 One Hot α, β, γ 28 18 0.22 0.39 0.73 18 30 0.17 0.95 0.96 15 13 0.17 0.94 0.95 22 15 0.11 0.77 0.82 16 21 0.09 0.92 0.93 28 16 0.08 0.42 0.69 15 14 0.06 0.98 0.99 30 24 0.05 0.43 0.79 21 26 0.04 0.53 0.63 14 2 0.04 0.93 0.95 (a) Mistral 7B, Weekdays. L H Average Intervention Effect EVR R2superscript2R^2R2 One Hot α, β EVR R2superscript2R^2R2 One Hot α, β, γ 17 0 0.18 0.98 0.99 17 1 0.08 0.98 0.98 19 10 0.08 0.95 0.96 30 17 0.07 0.85 0.90 17 3 0.07 0.93 0.95 17 27 0.06 1.00 1.00 31 22 0.05 0.37 0.78 21 9 0.04 0.73 0.78 20 28 0.04 1.00 1.00 30 16 0.04 0.73 0.85 (b) Llama 3 8B, Weekdays. L H Average Intervention Effect EVR R2superscript2R^2R2 One Hot α, β EVR R2superscript2R^2R2 One Hot α, β, γ 20 28 0.15 0.76 0.76 17 0 0.10 0.77 0.77 25 14 0.08 0.19 0.61 17 1 0.07 0.80 0.82 17 3 0.06 0.71 0.71 31 22 0.06 0.12 0.67 17 27 0.05 0.58 0.58 19 4 0.05 0.40 0.66 19 10 0.04 0.62 0.62 30 26 0.04 0.51 0.62 (c) Mistral 7B, Months. L H Average Intervention Effect EVR R2superscript2R^2R2 One Hot α, β EVR R2superscript2R^2R2 One Hot α, β, γ 15 13 0.26 0.62 0.62 16 21 0.17 0.76 0.76 18 30 0.13 0.77 0.77 28 18 0.11 0.13 0.52 28 16 0.07 0.13 0.52 21 25 0.05 0.65 0.70 15 14 0.03 0.72 0.72 17 26 0.02 0.77 0.77 31 1 0.02 0.11 0.57 21 24 0.02 0.30 0.45 (d) Llama 3 8B, Months. Figure 25: Iterative deconstruction of hidden state representations on the final token on Llama 3 8B, Months.