Paper deep dive
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark
Models: Gemma-2-2B, Gemma-2-9B
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/12/2026, 7:42:20 PM
Summary
The paper investigates the geometric structure of sparse autoencoder (SAE) features in large language models across three scales: 'atomic' (parallelograms/trapezoids representing semantic relations), 'brain' (spatial modularity of co-occurring features), and 'galaxy' (large-scale distribution of the feature point cloud). It demonstrates that projecting out distractor dimensions improves the quality of semantic crystals and that functionally related features exhibit significant spatial clustering.
Entities (5)
Relation Signals (3)
Sparse Autoencoder → discovers → Features
confidence 95% · sparse autoencoders (SAEs) have discovered large numbers of vectors (“features”) in their activation space
Functional Modularity → exhibitedby → SAE Features
confidence 90% · clusters of co-occurring features, at coarse enough scale, also cluster together spatially
Linear Discriminant Analysis → improves → Crystal Structure
confidence 90% · LDA results in tighter clusters of pairwise Gemma-2-2b activation differences
Cypher Suggestions (2)
Find all features associated with a specific model · confidence 90% · unvalidated
MATCH (f:Feature)-[:BELONGS_TO]->(m:Model {name: 'Gemma-2-2b'}) RETURN fIdentify relationships between methods and the structures they analyze · confidence 85% · unvalidated
MATCH (m:Method)-[:ANALYZES]->(s:Structure) RETURN m.name, s.name
Abstract
Abstract:Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.
Tags
Links
- Source: https://arxiv.org/abs/2410.19750
- Canonical: https://arxiv.org/abs/2410.19750
Full Text
60,778 characters extracted from source content.
Expand or collapse full text
Abstract Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: (1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman::king:queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently performed with linear discriminant analysis. (2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. (3) The “galaxy”-scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer. keywords: sparse coding; mechanistic interpretability; neural networks; large language models; clustering 27 4 344 L. Mayo and Kevin R. Pilkiewicz 4 February 2025 18 March 2025 19 March 2025 ://doi.org/ Geometry of Concepts: Sparse Autoencoder Feature Structure Geometry of Concepts: Sparse Autoencoder Feature Structure Li 4,†, 4† ^4, , start_FLOATSUPERSCRIPT 4 , † , end_FLOATSUPERSCRIPT, Eric J. Michaud 1,3,†, David D. Baek 2,3,†, Joshua Engels 2, Xiaoqing Sun 1 and Max Tegmark 1,3* Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun and Max Tegmark , F., Lastname, F., & Lastname, F. , Firstname, Firstname Lastname, and Firstname Lastname. , Y.; Michaud, E.J.; Baek, D.D.; Engels, J.; Sun, X.; Tegmark, M. : tegmark@mit.edu authors contributed equally to this work. 1 Introduction While large language models (LLMs) now exhibit a variety of impressive abilities Hurst et al. (2024); The ; Guo et al. (2025), we largely do not understand the internal cognition that underlies the behavior of these systems. This lack of transparency may pose a challenge for a variety of AI safety Slattery et al. (2024) concerns. For instance, it may be difficult to tell whether seemingly benign model behavior in any particular instance is sycophantic Sharma et al. (2023) or deceptive Park et al. (2023) without an analysis of the internals of the system. Such “interpretability” analysis has already shown promise in auditing AI systems Marks et al. (2024) to identify misaligned goals Ngo et al. (2022). As systems become more powerful, there is a need for methods to further our understanding of the internal representations and algorithms learned by these systems Bereska and Gavves (2024); Sharkey et al. (2025). The past year has seen a breakthrough in understanding how large language models work: sparse autoencoders (SAEs) have discovered large numbers of vectors (“features”) in their activation space that can be interpreted as concepts (Huben et al., 2023; Bricken et al., 2023; Templeton et al., 2024). These advances build on earlier studies applying sparse coding to artificial neural network representations Faruqui et al. (2015); Zhang et al. (2019); Yun et al. (2021), and to earlier work in neuroscience on biological neural representations Olshausen and Field (1996, 1997). Underlying this work is the idea that neural networks use sparse coding to represent concepts in their activation space Elhage et al. (2022). In particular, sparse autoencoders are motivated by the assumptions that (1) networks compute a variety of “features” from their input, (2) features are represented as one-dimensional directions in activation space isubscript\ d_i\ italic_ditalic_i , (3) features are represented simply by adding them to the network’s activations, so activation vectors take the form ∑ifiisubscriptsubscriptsubscript _if_i d_i∑i fitalic_i italic_ditalic_i, and (4) the coefficients fisubscriptf_ifitalic_i are sparse—only a small subset of all possible features “fire” at once. The combination of assumptions (2)–(4) has been called the Linear Representation Hypothesis Park et al. (2023); Olah (2024); Engels et al. (2024). If these assumptions hold, we could automatically discover these features with sparse dictionary learning. Sparse dictionary learning attempts to learn an overcomplete basis (dictionary) isubscript\ d_i\ italic_ditalic_i such that vectors xitalic_x from a given distribution can be represented as sparse linear combinations of dictionary elements. Sparse autoencoders offer a simple approach to sparse dictionary learning. Sparse autoencoders consist of a learnable encoder function Enc, which maps vectors ∈ℝnsuperscriptℝ x ^nitalic_x ∈ blackboard_Rn to a hidden latent representation ∈ℝmsuperscriptℝ f ^mitalic_f ∈ blackboard_Rm, and a decoder Dec, which maps latent fitalic_f back to ^∈ℝnbold-^superscriptℝ x ^noverbold_ start_ARG italic_x end_ARG ∈ blackboard_Rn. The objective of the sparse autoencoder is to accurately reconstruct the input xitalic_x from a sparse latent representation, and they are trained with gradient descent with a loss function like ℒ=‖−Dec(Enc())‖22+λ‖0.ℒsuperscriptsubscriptnormDecEnc22subscriptnorm0L=|| x-Dec(Enc( x))||_2^2+λ|| f|% |_0.L = | | italic_x - Dec ( Enc ( italic_x ) ) | |22 + λ | | italic_f | |0 . Sparse autoencoders use a linear decoder Dec()=d+bdDecsubscriptsubscriptDec( f)= W_d f+b_dDec ( italic_f ) = italic_Witalic_d italic_f + bitalic_d, so that the output of the SAE can be interpreted as a linear combination of features: ^=∑ifidi+dbold-^subscriptsubscriptsuperscriptsubscriptsubscript x= _if_i W_d^i+ b_doverbold_ start_ARG italic_x end_ARG = ∑i fitalic_i italic_Witalic_ditalic_i + italic_bitalic_d. In practice, hidden latents discovered by sparse autoencoders tend to be more interpretable than neurons, activating in more consistent contexts Huben et al. (2023); Bricken et al. (2023), suggesting that they may be learning the true latents underlying the network’s computation. For AI safety, sparse autoencoders have shown some preliminary success: Ref. Marks et al. (2024) reports specially training an LLM to have a hidden objective, and then challenging separate teams of researchers to identify this objective. One team was able to quickly identify this objective by looking at sparse autoencoder features that activated when the LLM was prompted to exhibit “potentially concerning behaviors”, and then looking at examples in the training data where that same feature fired. Although some early work motivating sparse autoencoders suggested that networks would arrange features maximally spread apart (approximately orthogonal) Elhage et al. (2022), recent works have suggested that features may have a more sophisticated geometric structure Engels et al. (2024); Templeton et al. (2024). Recently, a large collection of SAEs have been made publicly available Lieberum et al. (2024), so it is timely to study their structure at various scales. Thus, the present paper examines sparse autoencoder feature structure at three separate spatial scales, which we refer to informally as the “atom”-scale, “brain”-scale, and “galaxy”-scale. These playful analogies are not meant to be precise, but instead gesture at certain concepts and methods of analysis from other fields which we apply to understanding language model feature structure. We provide project code at https://github.com/ejmichaud/feature-geometry (accessed on 24 March 2025). This paper is organized as follows. In Section 2, we summarize related work. In Section 3, we investigate if the “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman::king:queen). In Section 4, we test if the “brain” intermediate-scale structure has functional modularity akin to biological brains. In Section 5, we study the “galaxy” large-scale structure of the feature point cloud, testing whether it is more interestingly shaped and clustered than an isotropic Gaussian distribution, and conclude in Section 6. 2 Related Work Neural network geometry: Many past works have studied the geometry of neural network activations. These works find that the intrinsic dimension of neural network hidden states are much lower than the full model dimension Ansuini et al. (2019), that nearby vectors in activation space are semantically similar Chandrasekaran and Mago (2021), and that at local minima well generalizing neural network loss landscapes have many “flat” directions Watanabe (2009). Other works study how representations evolve through models; one hypotheses is “iterative inference”, which claims that neural networks iteratively refine activations layer by layer Rushing and Nanda (2024); Belrose et al. (2023). A contrasting hypothesis is a circuits view, which holds that information flows in discrete steps along a directed acyclic graph through the model, and representations cleanly change between steps Conmy et al. (2023). Another work Park et al. (2024) found that representations of hierarchically related concepts are orthogonal to each other while categorical concepts are represented as polytopes. Our work is in the same vein as these earlier analysis, but differs in an important way because we use the SAE basis, which represents the model’s atomic concept space instead of its activation space. SAE feature structure: Sparse autoencoders (SAEs) are a recent approach for discovering interpretable language model features without supervision, although relatively few works have examined SAE feature structure. Bricken et al. (2023) and Templeton et al. (2024) both visualize SAE features with UMAP projections and notice that features tend to group together in “neighborhoods” of related features, in contrast to the approximately orthogonal geometry observed in the toy model of Elhage et al. (2022). Engels et al. (2024) find examples of SAE structure where multiple SAE features appear to reconstruct a multi-dimensional feature with interesting geometry, and multiple authors have recently speculated that SAE vectors might contain more important structures (Mendel, 2024; Smith, 2024). Bussmann et al. (2024) suggest that SAE features are in fact linear combinations of more atomic features, and discover these more atomic latents with “meta SAEs”. Our discussion of crystal structure in SAE features is related to this idea that seemingly atomic representations might be composed of more atomic representations. Semantically meaningful linear representations: Early work found that word embedding methods such as GloVe and Word2vec contained directions encoding semantic concepts, e.g., the well-known formula f(king) −-- f(man) + f(woman) = f(queen) (Drozd et al., 2016; Pennington et al., 2014; Ma and Zhang, 2015). More recent research has found similar evidence of linear representations in sequence models trained only on next token prediction, including Othello board positions (Nanda et al., 2023; Li et al., 2022), integer lattices (Michaud et al., 2024), the truth value of assertions (Marks and Tegmark, 2023), and numeric quantities such as longitude, latitude, birth year, and death year (Gurnee and Tegmark, 2023; Heinzerling and Inui, 2024), inspiring the Linear Representation Hypothesis (see above). Recent works have also found causal function vectors for in-context learning (Todd et al., 2023; Hendel et al., 2023; Kharlapenko et al., 2024). These function vectors induce the model to perform a certain task when added into the model’s hidden states. Our discussion of crystal structures builds upon these previous works by finding these task vectors and parallelogram structures in sparse autoencoder representations. 3 “Atom”-Scale: Crystal Structure In this section, we search for what we term crystal structure in the point cloud of SAE features. By this we mean geometric structure reflecting semantic relations between concepts, generalizing the classic example of (,,,)( a, b, c, d)( a , b , c , d )=(man,woman,king,queen) forming an approximate parallelogram where −≈− b- a≈ d- cb - a ≈ d - c. This can be interpreted in terms of two function vectors − b- ab - a and − c- ac - a that turn male entities female and turn entities royal, respectively. We also search for trapezoids with only one pair of parallel edges −∝−proportional-to b- a d- cb - a ∝ d - c (corresponding to only one function vector); Figure 1 (right) shows such an example with (,,,)( a, b, c, d)( a , b , c , d )=(Austria,Vienna,Switzerland,Bern), where the function vector can be interpreted as mapping countries to their capitals. Studying these crystal structures is important because they provide insight into how LLMs internally represent semantic operations and relational knowledge. For instance, function vectors help us assess the extent to which semantic structures within models align with human intuition and language logic, as explored in the recent literature Todd et al. (2023); Meng et al. (2022). Figure 1: Parallelogram and trapezoid structure is revealed (left) when distractor dimensions were projected out from the activations using LDA. LDA results in tighter clusters of pairwise Gemma-2-2b activation differences (right), where each cluster corresponds to a different semantic transformation. Distractor features are defined as those that are not related to semantics of the text; for instance, the first principal component of Gemma-2-2b’s Layer 0 activations (top left figure on the right panel) represents word length. Parallelogram or trapezoid structures suggest that there is a unique direction in the activation space that represents each semantic transformation. We search for crystals by computing all pairwise difference vectors and clustering them using the K-means algorithm Ahmed et al. (2020), where the vectors could be either the original model’s hidden state activations (model crystal) or SAE features’ decoder vectors (SAE crystal). We use Gemma-2-2b for the experiment. If there is a direction that represents each semantic transformation, we expect each resulting cluster to correspond to each function vector. In other words, any pair of difference vectors in a cluster will form a trapezoid or parallelogram, depending on whether the difference vectors are normalized or not before clustering (or, equivalently, whether we quantify similarity between two difference vectors via Euclidean distance or cosine similarity). Our initial search for SAE crystals found mostly noise. To investigate why, we decided to focus on activations of the model’s early layer, where many SAE features correspond to a single token. Since SAE feature vectors in the early layers are often closely related to the corresponding model activations, we believed that studying the activations of these early layers could help clarify why our initial crystal search primarily found noise. Therefore, we studied Gemma-2-2b residual stream activations for previously reported word ↦maps-to ↦ word function vectors from the dataset of (Todd et al., 2023). Figure 1 illustrates that candidate crystal quadruplets are typically far from being parallelograms or trapezoids. This is consistent with multiple papers pointing out that (man,woman,king,queen) is not an accurate parallelogram either. We believe the reason to be the presence of what we term distractor features. We define distractor features to be the features that are not related to semantics of the text. For example, we find that the horizontal axis in Figure 1 (right) corresponds mainly to word length (Appendix B, Figure 10), which is semantically irrelevant and wreaks havoc on the trapezoid (left), since “Switzerland” is much longer than the other words. However, these distractor features were not always interpretable; in some cases, it was difficult to associate features with any clear linguistic property. To eliminate such semantically irrelevant distractor vectors, we wish to project the data onto a lower-dimensional subspace orthogonal to them. For the (Todd et al., 2023) dataset, we accomplish this with linear discriminant analysis (LDA) (Xanthopoulos et al., 2013), which projects onto signal-to-noise eigenmodes where “signal” and “noise” are defined as the covariance matrices of inter-cluster variation and intra-cluster variation, respectively. Figure 1 illustrates that this dramatically improves the cluster and trapezoid/parallelogram quality, highlighting that distractor features can hide existing crystals. 4 “Brain”-Scale: Meso-Scale Modular Structure We now zoom out and look for larger-scale structure. In particular, we investigate if functionally similar groups of SAE features (which tend to fire together) are also geometrically similar, forming “lobes” in the activation space. We refer to this analyis as “brain”-scale because, in animal brains, functionally similar groups of neurons also typically cluster together spatially. For example, Broca’s area is involved in speech production, the auditory cortex processes sound, and the amygdala is primarily associated with processing emotions. We are curious whether we can find analogous functional modularity in the SAE feature space. While prior work has qualitatively observed that semantically related features are spatially close via UMAP projections of features Bricken et al. (2023); Templeton et al. (2024), we aim to more precisely quantify the relationship between functional similarity and spatial similarity. We test a variety of methods for automatically discovering such functional “lobes” and for quantifying if they are spatially modular. We define a lobe partition as a partition of the SAE feature point cloud into k subsets (“lobes”) that are computed without positional information. Instead, we identify such lobes based on them being functionally related, specifically, tending to fire together within a document. To automatically identify functional lobes, we first compute a histogram of SAE feature co-occurrences. We take Gemma-2-2b and pass documents from The Pile Gao et al. (2020) through it. In this section, we report results with a Layer 12 residual stream SAE with 16k features and an average L0 of 41. For this SAE, we record the features that fire (we count a feature i as firing if its encoder assigns it a coefficient fi>1subscript1f_i>1fitalic_i > 1). Features are counted as co-occurring if they both fire within the same block of 256 tokens—this length provides a coarse “time resolution” allowing us to find tokens that tend to fire together within the same document rather than just at the same token. We use a max context length of 1024, and only use one such context per document, giving us at most four blocks (and histogram updates) per document of The Pile. We compute histograms across 50k documents. Given this histogram, we compute an affinity score between each pair of SAE features based on their co-occurrence statistics and perform spectral clustering on the resulting affinity matrix. We use the spectral clustering implementation of scikit-learn Pedregosa et al. (2011) with default settings with varying choice of n_clusters. In Figure 2, we visualize lobes discovered with this method with n_clusters=2, 3 via a t-SNE projection Van der Maaten and Hinton (2008). For this figure, we used the “phi coefficent” as the measure of co-occurrence similarity between features. We find that lobes visually appear to be spatially localized. For instance, features which fire primarily on math and code documents tend to cluster together spatially. We experiment with the following notions of co-occurrence-based affinity: simple matching coefficient, Jaccard similarity, Dice coefficient, overlap coefficient, and phi coefficient, which can all be computed just from a co-occurrence histogram. In the Appendix A.1, we review definitions for each of these and in Figure 3 illustrate how the choice between them affects the resulting lobe t-SNE plots. We also show how lobes appear when we cluster based on geometry directly using cosine similarities, as described below. Figure 2: Features in the SAE feature point cloud identified that tend to fire together within documents are seen to also be geometrically co-located in functional “lobes”, here down-projected to 2D with t-SNE with point size proportional to feature frequency. A 2-lobe partition (left) is seen to break the point cloud into roughly equal parts, active on code/math documents and English language documents, respectively. A 3-lobe partition (right) is seen to mainly subdivide the English lobe into a part for short messages and dialogue (e.g., chat rooms and parliament proceedings) and one primarily containing long-form scientific papers. Figure 3: Comparison of the lobe partitions of the SAE point cloud discovered with different affinity measures, with the same t-SNE projection as Figure 2. In the top left, we show clusters computed from geometry, the cosine similarity between features as the affinity score for spectral clustering. All other measures are based on whether SAE features co-occur (fire together) within 256-token blocks, using different measures of affinity. Although the phi coefficient predicts spatial structure best, all co-occurrence measures are seen to discover the code/math lobe. While these plots show a qualitative relationship between co-occurrence and feature geometry, we aim to quantify this relationship. Our null hypothesis is that functionally similar points (of commonly co-occurring SAE features) are uniformly distributed throughout the activation space, showing no spatial modularity. To quantify how statistically significant this is, we use two approaches to rule out the null hypothesis: 1. While we can cluster features based on whether they co-occur, we can also perform spectral clustering based on the cosine similarity between SAE feature decoder vectors. So instead of feature affinity values being, e.g., their co-occurrence phi coefficient, affinity matrix values are instead computed simply from feature geometry as Aij=i⋅jsubscript⋅subscriptsubscriptA_ij= d_i· d_jAitalic_i j = italic_ditalic_i ⋅ italic_ditalic_j. Given a clustering of SAE features using cosine similarity and a clustering using co-occurrence, we compute the mutual information between these two sets of labels. In some sense, this measures the amount of information about geometric structure that one obtains from knowing functional structure. We report the adjusted mutual information Vinh et al. (2009) as implemented by scikit-learn Pedregosa et al. (2011), which corrects for chance agreements between the clusters. 2. Another conceptually simple approach is to train models to predict which functional lobe a feature is in from its geometry. To accomplish this, we take a given set of lobe labels from our co-occurrence-based clustering, and train a logistic regression model to predict these labels directly from the point positions, using an 80-20 train–test split and reporting the balanced test accuracy of this classifier. Figure 4 shows that for both measures, the phi coefficient gives the best correspondence between functional lobes and feature geometry. To show that this is statistically significant, we randomly permute the cluster labels from the cosine similarity-based clustering and measure the adjusted mutual information. We also re-initialize the SAE feature decoder directions from a random Gaussian and normalize, and then train logistic regression models to predict functional lobe from these random feature directions. Figure 4 (bottom) shows that both tests rule out the null hypothesis with high significance, at 954 and 74 standard deviations, respectively, clearly demonstrating that the lobes we see are real and not a statistical fluke. Figure 4: (top left): Adjusted mutual information between spatial clusters and functional (co-occurrence-based) clusters. (top right): logistic regression balanced test accuracy, predicting co-occurrence-based cluster label from position. (bottom left): Adjusted mutual information with randomly permuted cosine similarity-based clustering labels. (bottom right): balanced test accuracy with random unit-norm feature vectors. The statistical significance reported is for phi-based clustering into lobes. To assess what each lobe specializes in, we run 10k documents from The Pile through Gemma-2-2b, and again record which SAE features at Layer 12 fire within blocks of 256 tokens. For each block of tokens, we record which lobe has the highest proportion of its features firing. Each document in The Pile is attached with a name specifying the subset of the corpus that document is from. For each document type, for each 256-token block within a document of that type, we record which lobe had the highest proportion of its SAE features firing. Across thousands of documents, we can then look at a histogram of which lobes were maximally activating across each document type. We show these results for three lobes, computed with the phi coefficient as the co-occurrence measure, in Figure 5. This forms the basis for our lobe labeling in Figure 2. These findings raise interesting questions about whether individual sparse autoencoder features are the most natural units for understanding neural networks Mueller et al. (2024); Olah (2023). In biological brains, one can study individual neurons, groups of neurons, groups of groups of neurons, and so on up to very large-scale structures, and it is not clear a priori what “scale” of analysis will be most fruitful Hoel et al. (2013). We may face a similar ambiguity with sparse autoencoder features, since, as we have seen, groups of co-occurring, geometrically related features can be interpretable and studied in their own right. This question, of whether there is a right “scale” of analysis for SAE features, is made even more salient by the observation in prior work of “feature splitting” Bricken et al. (2023). Figure 5: Fraction of contexts in which each lobe had the highest proportion of activating features. For each document type, these fractions sum to 1 across the lobes. We see that lobe 2 typically disproportionately activates on code and math documents. Lobe 0 and 1 activate on other documents, with lobe 0 activating more on documents containing short text and dialogue (chat comments, parliamentary proceedings) and lobe 1 activating more on scientific papers. 5 “Galaxy”-Scale: Large-Scale Point Cloud Structure In this section, we further broaden our perspective to analyze the “galaxy”-scale structure of the point cloud, focusing on its overall shape and clustering properties. This analysis is loosely inspired by work in astronomy Kennicutt Jr (1998) characterizing the shape Hubble (1926) and substructure Kravtsov (2010) of galaxies. We start by formulating a simple null hypothesis: The point cloud is drawn from an isotropic multivariate Gaussian distribution. To test this, we analyze the covariance of the data. As illustrated in Figure 6, the eigenvalue spectrum deviates from isotropy, meaning the cloud exhibits directional structure rather than being purely spherical. Even within the first three principlal components, the point cloud is anisotrophic, with some principal axes slightly wider than others. To quantify these deviations, we analyze the eigenvalue spectrum of the covariance matrix, comparing it to theoretical expectations from random matrix theory (RMT). Figure 6: Three-dimensional point cloud visualizations of top PCA components for the Gemma-2-2b Layer 12 SAE features. 5.1 Shape Analysis In RMT, the covariance matrix of N random vectors from a multivariate Gaussian distribution follow a Wishart distribution Wishart (1928). Under this assumption, we would expect the eigenvalues to be relatively uniform or to follow the Marcenko–Pastur law Marchenko and Pastur (1967). In contrast, we observe a surprising derivation: • The eigenvalue spectrum of the point cloud decays as a power law rather than following the expected Wishart behavior. • As shown in Figure 6, this power law decay is more pronounced in SAE features compared to raw activations. Since the abrupt drop off seen for the smallest eigenvalues is caused by limited data and vanishes in the limit N→∞→N→∞N → ∞, we dimensionally reduce the point cloud to its 100 largest principal components for all subsequent analysis in this section. We describe the shape of this high-dimensional point cloud as resembling a ”fractal cucumber”, whose width in successive dimensions falls off like a power law. We find such power law scaling is significantly less prominent for activations than for SAE features; it will be interesting for further work to investigate its origins. Figure 7 (left) shows how the slope of the aforementioned power law depends on LLM layer, computed via linear regression against the 100 largest eigenvalues. We see a clear pattern where middle layers have the steepest power law slopes: (Layer 12 has slope −--0.47, while early and late layers (e.g., Layers 0 and 24) have shallower slopes (−--0.24 and −--0.25), respectively. This may hint that middle layers act as a bottleneck, compressing information into fewer principal components, perhaps optimizing for more efficient representation of high-level abstractions. Figure 7 (right) compares the eigenvalue spectra of SAE features and neural activations, indicating a significantly steeper power law decay for SAE features. Activations, in contrast, exhibit a much slower decay, indicating weaker power law behavior and distinct geometric structures in the latent space. Figure 8 (left) explores the effective cloud volume (the determinant of the covariance matrix) of the point cloud, quantified by the log-determinant of the covariance matrix across layer. This volume variation further reflects the layer-specific changes in the structure and complexity of the latent space. Figure 7: Eigenvalue distributions for SAE features and activations. Eigenvalues of the covariance matrix for SAE features (left) and neural activations (right) decay approximately as a power law, with slopes varying across layers. A scaled isotropic Gaussian spectrum is shown for comparison, highlighting the significantly steeper decay for SAE features. Eigenvalue spectra for activations show a much slower decay compared to SAE features, indicating weaker power law behavior and distinct geometric structures. Figure 8: Layer-wise analysis of latent representations. (left): The power law slope (α) of the eigenvalue spectrum (blue) and the log-determinant of the covariance matrix (green) vary across layers. Both metrics peak in intermediate layers, indicating significant structural changes in the latent space. (right): Estimated clustering entropy across layers with 95% confidence intervals. Middle layers exhibit reduced clustering entropy, while earlier and later layers show higher entropy, reflecting distributed and concentrated feature representations, respectively. 5.2 Clustering Analysis Clustering of galaxies or microscopic particles is often quantified in terms of a power spectrum or correlation function. This is complicated for our very high-dimensional data, since the underlying density varies with radius and, for a high-dimensional Gaussian distribution, is strongly concentrated around a relatively thin spherical shell. For this reason, we instead quantify clustering by estimating the entropy of the distribution that the point cloud is assumed to be sampled from. We estimate the entropy H from our SAE feature point cloud using the k-th nearest neighbor (k-N) method Dasarathy (1991); Kozachenko and Leonenko (1987), computed as follows, Hfeatures=dn∑i=1nlog(ri+θ)+log(n−1)−Ψsubscriptsuperscriptsubscript1subscript1ΨH_features= dn _i=1^n (r_i+θ)+ (n-1)- _f e a t u r e s = divide start_ARG d end_ARG start_ARG n end_ARG ∑i = 1n log ( ritalic_i + θ ) + log ( n - 1 ) - Ψ (1) where risubscriptr_iritalic_i is the distance to the k-th nearest neighbor for point i, and d is the dimensionality of the point cloud; n is the number of points; the constant Ψ Ψ is the digamma term from the k-N estimation. As a baseline, the Gaussian entropy represents the maximum possible entropy for a given covariance matrix. For a Gaussian distribution with the same covariance matrix, the entropy is computed as follows: Hgauss=d2(1+log(2π))+∑i=1dlog(λi)subscript212superscriptsubscript1subscriptH_gauss= d2 (1+ (2π) )+ _i=1^d ( _i)Hitalic_g a u s s = divide start_ARG d end_ARG start_ARG 2 end_ARG ( 1 + log ( 2 π ) ) + ∑i = 1d log ( λitalic_i ) (2) where λisubscript _iλitalic_i are the eigenvalues of the covariance matrix. We define the clustering entropy (often referred to as “negentropy” in physics as Hgauss−HsubscriptH_gauss-HHitalic_g a u s s - H, i.e., how much lower the entropy is than its maximum allowed value). The estimated clustering entropy is shown in Figure 8 (right), plotted across different layers. The results indicate that the SAE point cloud is strongly clustered, particulary in the middle layers. This observation aligns with the reduced clustering entropy seen at intermediate layers, suggesting significant structural differences in the latent representations. In future work, it will be interesting to investigate whether these variations depend mainly on the prominence of crystals or lobes in different layers, or have an altogether different origin (entirely different underlying mechanisms). 6 Conclusions We have searched for structure in the SAE concept universe at three levels: (1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman::king:queen), may be revealed when projecting out semantically irrelevant distractor features. (2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. (3) The “galaxy” large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. While we have observed that SAE features exhibit geometric structure at multiple scales, we have not explained why this structure forms. We think that further work that not only studies the structure of SAE features, but also seeks to explain the origin of this structure, could be highly valuable. Such work may lead to refinements to our theory of how networks represent features in superposition or to insights that improve sparse autoencoder performance. We hope that our findings serve as a stepping stone toward deeper understanding of SAE features and the workings of large language models, and that this deeper understanding will eventually help to improve the safety of AI systems as they continue to grow in power. Conceptualization, J.E. and M.T.; software, E.J.M., D.D.B. and X.S.; formal analysis, Y.L., E.J.M. and D.D.B.; investigation, M.T.; writing—original draft preparation, Y.L., E.J.M., D.D.B., J.E., X.S. and M.T.; writing—review and editing, Y.L., E.J.M., D.D.B., J.E. and M.T.; visualization, Y.L., E.J.M. and D.D.B. All authors have read and agreed to the published version of the manuscript. E.J.M., D.D.B, and M.T. are supported by IAIFI through NSF grant PHY-2019786. E.J.M. and J.E. are supported through the NSF GRFP (Grant No. 2141064). This work is supported by the Rothberg Family Fund for Cognitive Science. Not applicable. We provide code to replicate our results at this repository: https://github.com/ejmichaud/feature-geometry (accessed on 24 March 2025). The authors declare no conflicts of interest. yes Appendix A Additional Information on Brain Lobes A.1 Co-Occurrence Measures Definitions of co-occurrence-based affinity measures: Let nijsubscriptn_ijnitalic_i j be the number of times features i and j co-occur. Let m11subscript11m_11m11 be number of times i and j co-occur, m00subscript00m_00m00 be number of times i and j both do not occur, m10subscript10m_10m10 be number of times i occurs but j does not, m1∙subscript1∙m_1 m1 ∙ be number of times i occurs and j either occurs or not, and so on. Then, the following can be determined. Jaccard similarity, Ref. Jaccard (1908), is as follows: Jij=|i∩j||i∪j|=nijnii+njj−nijsubscriptsubscriptsubscriptsubscriptsubscriptJ_ij= |i∩ j||i∪ j|= n_ijn_i+n_j-n_ijJitalic_i j = divide start_ARG | i ∩ j | end_ARG start_ARG | i ∪ j | end_ARG = divide start_ARG nitalic_i j end_ARG start_ARG nitalic_i i + nitalic_j j - nitalic_i j end_ARG Dice score, Ref. Dice (1945),is as follows: DSCij=2|i∩j||i|+|j|=2nijnii+njjsubscript22subscriptsubscriptsubscriptDSC_ij= 2|i∩ j||i|+|j|= 2n_ijn_i+n_jD S Citalic_i j = divide start_ARG 2 | i ∩ j | end_ARG start_ARG | i | + | j | end_ARG = divide start_ARG 2 nitalic_i j end_ARG start_ARG nitalic_i i + nitalic_j j end_ARG The overlap coefficient is as follows: overlapij=|i∩j|min(|i|,|j|)=nijmin(nii,njj)subscriptsubscriptsubscriptsubscriptoverlap_ij= |i∩ j| (|i|,|j|)= n_ij (n_i,n_% j)o v e r l a pitalic_i j = divide start_ARG | i ∩ j | end_ARG start_ARG min ( | i | , | j | ) end_ARG = divide start_ARG nitalic_i j end_ARG start_ARG min ( nitalic_i i , nitalic_j j ) end_ARG The simple matching coefficient is SMCij=m00+m11m00+m11+m01+m10subscriptsubscript00subscript11subscript00subscript11subscript01subscript10SMC_ij= m_00+m_11m_00+m_11+m_01+m_10S M Citalic_i j = divide start_ARG m00 + m11 end_ARG start_ARG m00 + m11 + m01 + m10 end_ARG The phi coefficient, Ref. Yule (1912), is ϕij=m11m00−m10m01m1∙m0∙m∙1m∙0subscriptitalic-ϕsubscript11subscript00subscript10subscript01subscript1∙subscript0∙subscript∙absent1subscript∙absent0 _ij= m_11m_00-m_10m_01 m_1 m_0 m_% 1m_ 0ϕitalic_i j = divide start_ARG m11 m00 - m10 m01 end_ARG start_ARG square-root start_ARG m1 ∙ m0 ∙ m∙ 1 m∙ 0 end_ARG end_ARG Figure 9: Histogram, over all features, of phi coefficient with k-th nearest cosine similarity neighbor for (left) Layer 0 and (right) Layer 12. Each line represents a different k. The “random” line is plotted by drawing a random feature for each feature, then computing the phi coefficient. Features with higher cosine similarity have higher phi coefficients, but this is less pronounced in Layer 0 compared to Layer 12. Appendix B Understanding Principal Components in Difference Space Figure 10 shows that the first principal component encodes mainly the length difference between two words’ last tokens in Gemma-2-2b Layer 0. Figure 10: Plot of the first principal component in the difference space as a function of last token length difference in Gemma-2-2b Layer 0. The linear relationship indicates that the first principal component encodes the length difference between two words’ last tokens. Figure 11: Silhouette score, a measure of clustering quality, as a function of reduced dimension in LDA. The plot indicates that training an affine transformation for semantic cluster separation is easier in middle layers (Layer 12), where the model starts to develop concept-level understanding of the input. Appendix C Breaking Down SAE Vectors by PCA Component An additional investigation of structure we undertake is quantifying how SAE vectors are distributed throughout the PCA components of the activations vectors. To accomplish this, we define a PCA score: PCA score(featurej)=1n∑i∗(pcai@featurej)2PCA scoresubscript1subscriptsuperscriptsubscript@subscript2 PCA score(feature_j)= 1n _ii*(pca_i@feature_j)^2PCA score ( f e a t u r eitalic_j ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i i ∗ ( p c aitalic_i @ f e a t u r eitalic_j )2 This metric is a weighted sum between 00 and 1111 measuring approximately where in the PCA each SAE feature lies. In Figure 12, we plot this metric on a single Gemma Scope SAE (the results look similar on all Gemma Scope SAEs), and we see that there is an intriguing dip into earlier PCA features in the last third of SAE features. Figure 12: Smoothed PCA scores for each SAE feature of Layer 12121212, width 16k , L0=176subscript0176L_0=176L0 = 176 Gemma Scope 2b SAE, sorted by frequency. PCA score = 1n∑i∗(pcai@featurej)21subscriptsuperscriptsubscript@subscript2 1n _ii*(pca_i@feature_j)^2divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i i ∗ ( p c aitalic_i @ f e a t u r eitalic_j )2, where n is the number of PCA features. The smoothed curves just average this somewhat noisy metric over adjacent sorted features. This measures approximately where in the PCA each SAE feature lies, and shows that there is a dip into earlier PCA features in the last third of SAE features. adjustwidth - 0cm References References Hurst et al. (2024) Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. (2) The Claude 3 Model Family: Opus, Sonnet, Haiku. URL: https://w-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (accessed 24 March 2025). Guo et al. (2025) Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. Slattery et al. (2024) Slattery, P.; Saeri, A.K.; Grundy, E.A.; Graham, J.; Noetel, M.; Uuk, R.; Dao, J.; Pour, S.; Casper, S.; Thompson, N. The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence. arXiv 2024, arXiv:2408.12622. Sharma et al. (2023) Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.; Askell, A.; Bowman, S.R.; Cheng, N.; Durmus, E.; Hatfield-Dodds, Z.; Johnston, S.R.; et al. Towards understanding sycophancy in language models. arXiv 2023, arXiv:2310.13548. Park et al. (2023) Park, P.S.; Goldstein, S.; O’Gara, A.; Chen, M.; Hendrycks, D. AI deception: A survey of examples, risks, and potential solutions. arXiv. arXiv 2023, arXiv:2308.14752. Marks et al. (2024) Marks, S.; Treutlein, J.; Bricken, T.; Lindsey, J.; Marcus, J.; Mishra-Sharma, S.; Ziegler, D.; Ameisen, E.; Batson, J.; Belonax, T.; et al. Auditing Language Models for Hidden Objectives. arXiv 2024, arXiv:2503.10965. Ngo et al. (2022) Ngo, R.; Chan, L.; Mindermann, S. The alignment problem from a deep learning perspective. arXiv 2022, arXiv:2209.00626. Bereska and Gavves (2024) Bereska, L.; Gavves, E. Mechanistic Interpretability for AI Safety—A Review. arXiv 2024, arXiv:2404.14082. Sharkey et al. (2025) Sharkey, L.; Chughtai, B.; Batson, J.; Lindsey, J.; Wu, J.; Bushnaq, L.; Goldowsky-Dill, N.; Heimersheim, S.; Ortega, A.; Bloom, J.; et al. Open Problems in Mechanistic Interpretability. arXiv 2025, arXiv:2501.16496. Huben et al. (2023) Huben, R.; Cunningham, H.; Smith, L.R.; Ewart, A.; Sharkey, L. Sparse Autoencoders Find Highly Interpretable Features in Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. Bricken et al. (2023) Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; et al. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Transformer Circuits Thread 2023. Available online: https://transformer-circuits.pub/2023/monosemantic-features/index.html (accessed on 24 March 2025). Templeton et al. (2024) Templeton, A.; Conerly, T.; Marcus, J.; Lindsey, J.; Bricken, T.; Chen, B.; Pearce, A.; Citro, C.; Ameisen, E.; Jones, A.; et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread 2024. Available online: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html (accessed on 24 March 2025). Faruqui et al. (2015) Faruqui, M.; Tsvetkov, Y.; Yogatama, D.; Dyer, C.; Smith, N. Sparse overcomplete word vector representations. arXiv 2015, arXiv:1506.02004. Zhang et al. (2019) Zhang, J.; Chen, Y.; Cheung, B.; Olshausen, B.A. Word embedding visualization via dictionary learning. arXiv 2019, arXiv:1910.03833. Yun et al. (2021) Yun, Z.; Chen, Y.; Olshausen, B.A.; LeCun, Y. Transformer visualization via dictionary learning: Contextualized embedding as a linear superposition of transformer factors. arXiv 2021, arXiv:2103.15949. Olshausen and Field (1996) Olshausen, B.A.; Field, D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996, 381, 607–609. Olshausen and Field (1997) Olshausen, B.A.; Field, D.J. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res. 1997, 37, 3311–3325. Elhage et al. (2022) Elhage, N.; Hume, T.; Olsson, C.; Schiefer, N.; Henighan, T.; Kravec, S.; Hatfield-Dodds, Z.; Lasenby, R.; Drain, D.; Chen, C.; et al. Toy Models of Superposition. Transformer Circuits Thread 2022. Available online: https://transformer-circuits.pub/2022/toy_model/index.html (accessed on 24 March 2025). Park et al. (2023) Park, K.; Choe, Y.J.; Veitch, V. The linear representation hypothesis and the geometry of large language models. arXiv 2023, arXiv:2311.03658. Olah (2024) Olah, C. What is a Linear Representation? What is a Multidimensional Feature? Transformer Circuits Thread 2024. Available online: https://transformer-circuits.pub/2024/july-update/index.html#linear-representations (accessed on 24 March 2025). Engels et al. (2024) Engels, J.; Liao, I.; Michaud, E.J.; Gurnee, W.; Tegmark, M. Not All Language Model Features Are Linear. arXiv 2024, arXiv:2405.14860. Lieberum et al. (2024) Lieberum, T.; Rajamanoharan, S.; Conmy, A.; Smith, L.; Sonnerat, N.; Varma, V.; Kramár, J.; Dragan, A.; Shah, R.; Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv 2024, arXiv:2408.05147. Ansuini et al. (2019) Ansuini, A.; Laio, A.; Macke, J.H.; Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 1–15. Chandrasekaran and Mago (2021) Chandrasekaran, D.; Mago, V. Evolution of semantic similarity—A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. Watanabe (2009) Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009; Volume 25. Rushing and Nanda (2024) Rushing, C.; Nanda, N. Explorations of Self-Repair in Language Models. arXiv 2024, arXiv:2402.15390. Belrose et al. (2023) Belrose, N.; Furman, Z.; Smith, L.; Halawi, D.; Ostrovsky, I.; McKinney, L.; Biderman, S.; Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2023, arXiv:2303.08112. Conmy et al. (2023) Conmy, A.; Mavor-Parker, A.; Lynch, A.; Heimersheim, S.; Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. Adv. Neural Inf. Process. Syst. 2023, 36, 16318–16352. Park et al. (2024) Park, K.; Choe, Y.J.; Jiang, Y.; Veitch, V. The geometry of categorical and hierarchical concepts in large language models. arXiv 2024, arXiv:2406.01506. Mendel (2024) Mendel, J. SAE feature geometry is outside the superposition hypothesis. Available online: https://w.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis (accessed on 24 March 2025) . AI Alignment Forum 2024. Smith (2024) Smith, L. The ‘strong’ feature hypothesis could be wrong. Available online: https://w.lesswrong.com/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong (accessed on 24 March 2025). . AI Alignment Forum 2024. Bussmann et al. (2024) Bussmann, B.; Pearce, M.; Leask, P.; Bloom, J.I.; Sharkey, L.; Nanda, N. Showing SAE Latents Are Not Atomic Using Meta-SAEs. AI Alignment Forum 2024. Available online: https://w.alignmentforum.org/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes (accessed on 24 March 2025). Drozd et al. (2016) Drozd, A.; Gladkova, A.; Matsuoka, S. Word embeddings, analogies, and machine learning: Beyond king −-- man + woman = queen. In Proceedings of the Coling 2016, the 26th International Conference on Computational Linguistics: Technical Papers; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; p. 3519–3530. Pennington et al. (2014) Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Doha, Qatar, 2014; p. 1532–1543. Ma and Zhang (2015) Ma, L.; Zhang, Y. Using Word2Vec to process big text data. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015 ; p. 2895–2897. Nanda et al. (2023) Nanda, N.; Lee, A.; Wattenberg, M. Emergent linear representations in world models of self-supervised sequence models. arXiv 2023, arXiv:2309.00941. Li et al. (2022) Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv 2022, arXiv:2210.13382. Michaud et al. (2024) Michaud, E.J.; Liao, I.; Lad, V.; Liu, Z.; Mudide, A.; Loughridge, C.; Guo, Z.C.; Kheirkhah, T.R.; Vukelić, M.; Tegmark, M. Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code. Entropy 2024, 26, 1046. Marks and Tegmark (2023) Marks, S.; Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv 2023, arXiv:2310.06824. Gurnee and Tegmark (2023) Gurnee, W.; Tegmark, M. Language models represent space and time. arXiv 2023, arXiv:2310.02207. Heinzerling and Inui (2024) Heinzerling, B.; Inui, K. Monotonic representation of numeric properties in language models. arXiv 2024, arXiv:2403.10381. Todd et al. (2023) Todd, E.; Li, M.L.; Sharma, A.S.; Mueller, A.; Wallace, B.C.; Bau, D. Function vectors in large language models. arXiv 2023, arXiv:2310.15213. Hendel et al. (2023) Hendel, R.; Geva, M.; Globerson, A. In-context learning creates task vectors. arXiv 2023, arXiv:2310.15916. Kharlapenko et al. (2024) Kharlapenko, D.; neverix.; Nanda, N.; Conmy, A. Extracting SAE task features for in-context learning. AI Alignment Forum 2024. Available online: https://w.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning (accessed on 24 March 2024). Meng et al. (2022) Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in gpt. Adv. Neural Inf. Process. Syst. 2022, 35, 17359–17372. Ahmed et al. (2020) Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. Xanthopoulos et al. (2013) Xanthopoulos, P.; Pardalos, P.M.; Trafalis, T.B.; Xanthopoulos, P.; Pardalos, P.M.; Trafalis, T.B. Linear discriminant analysis. In Robust Data Mining; Springer: New York, NY 2013; p. 27–33. Gao et al. (2020) Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv 2020, arXiv:2101.00027. Pedregosa et al. (2011) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Van der Maaten and Hinton (2008) Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. Vinh et al. (2009) Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; p. 1073–1080. Mueller et al. (2024) Mueller, A.; Brinkmann, J.; Li, M.; Marks, S.; Pal, K.; Prakash, N.; Rager, C.; Sankaranarayanan, A.; Sharma, A.S.; Sun, J.; et al. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability. arXiv 2024, arXiv:2408.01416. Olah (2023) Olah, C. Transformer Circuits Thread: Interpretability Dreams; An Informal Note on Future Goals for Mechanistic Interpretability. Transformer Circuits Thread 2023. Available online: https://transformer-circuits.pub/2023/interpretability-dreams/index.html (accessed on 24 March 2025) . Hoel et al. (2013) Hoel, E.P.; Albantakis, L.; Tononi, G. Quantifying causal emergence shows that macro can beat micro. Proc. Natl. Acad. Sci. USA 2013, 110, 19790–19795. Kennicutt Jr (1998) Kennicutt, R.C., Jr. Star formation in galaxies along the Hubble sequence. Annu. Rev. Astron. Astrophys. 1998, 36, 189–231. Hubble (1926) Hubble, E.P. Extragalactic Nebulae. Astrophys. J. 1926, 64, 321–369. https://doi.org/10.1086/143018. Kravtsov (2010) Kravtsov, A. Dark matter substructure and dwarf galactic satellites. Adv. Astron. 2010, 2010, 281913. Wishart (1928) Wishart, J. The generalised product moment distribution in samples from a normal multivariate population. Biometrika 1928, 20, 32–52. Marchenko and Pastur (1967) Marchenko, V.; Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Mat. Sb. 1967, 72, 4. Dasarathy (1991) Dasarathy, B.V. Nearest Neighbor (N) Norms: N Pattern Classification Techniques; IEEE Computer Society Tutorial: Los Alamitos, CA, USA, 1991. Kozachenko and Leonenko (1987) Kozachenko, L.F.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Peredachi Informatsii 1987, 23, 9–16. Jaccard (1908) Jaccard, P. Nouvelles Recherches Sur La Distribution Florale. Bull. De La Société Vaudoise Des Sci. Nat. 1908, 44, 223–270. Dice (1945) Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. Yule (1912) Yule, G.U. On the Methods of Measuring Association Between Two Attributes. J. R. Stat. Soc. 1912, 75, 579–652.