← Back to papers

Paper deep dive

Towards Atoms of Large Language Models

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: TheoreticalEmbeddings: 100

Models: Gemma2-2B, Gemma2-9B, Llama3.1-8B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%

Last extracted: 3/11/2026, 12:31:47 AM

Summary

The paper introduces 'Atom Theory' to define, evaluate, and identify the fundamental representational units (FRUs) of large language models (LLMs), termed 'atoms'. By utilizing the Atomic Inner Product (AIP) to account for non-Euclidean representational geometry, the authors establish criteria for ideal atoms based on faithfulness (R^2) and stability (q^*). They demonstrate that traditional units like neurons and features fail these criteria, and subsequently use threshold-activated sparse autoencoders (TSAEs) to identify atoms in models like Gemma2 and Llama3.1 that exhibit near-perfect faithfulness and stability.

Entities (5)

Atom Theory · theoretical-framework · 100%Atomic Inner Product · mathematical-metric · 100%Gemma2-2B · large-language-model · 100%Llama3.1-8B · large-language-model · 100%Threshold-activated Sparse Autoencoders · model-architecture · 100%

Relation Signals (3)

Atom Theory defines Atoms

confidence 100% · we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms.

Threshold-activated Sparse Autoencoders identifies Atoms

confidence 100% · we prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs).

Atomic Inner Product grounds Atom Theory

confidence 95% · demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory.

Cypher Suggestions (2)

Retrieve the relationship between the theoretical framework and its core metric. · confidence 95% · unvalidated

MATCH (t:Theory {name: 'Atom Theory'})-[:GROUNDED_BY]->(m:Metric {name: 'Atomic Inner Product'}) RETURN t, m

Find all models where atoms have been identified using Atom Theory. · confidence 90% · unvalidated

MATCH (m:Model)-[:HAS_IDENTIFIED_ATOMS]->(a:Atom) RETURN m.name, count(a)

Abstract

Abstract:The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at this https URL.

Tags

ai-safety (imported, 100%)mechanistic-interp (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

99,691 characters extracted from source content.

Expand or collapse full text

Towards Atoms of Large Language Models Chenhui Hu Pengfei Cao Yubo Chen Kang Liu Jun Zhao Abstract The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness (R2R^2) and stability (q∗q^*). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful (R2=1R^2\!=\!1) but unstable (q∗=0.5%q^*\!=\!0.5\%), while features are more stable (q∗=68.2%q^*\!=\!68.2\%) but unfaithful (R2=48.8%R^2\!=\!48.8\%). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness (R2=99.9%R^2\!=\!99.9\%) and stability (q∗=99.8%q^*\!=\!99.8\%) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms. Atom Theory, Fundamental Representational Units, Atomic Inner Product, Atoms, Large Language Models Figure 1: Illustration of Atom Theory. (a) Atoms are defined based on the atomic inner product, inducing representability, sparsity, and separability. (b) Atoms are evaluated by faithfulness (R2R^2) and stability (q∗q^*), measuring fidelity and stable-atom fraction. (c) Threshold-activated SAEs enable atom identification, with the encoder as an atom detector and the decoder as the target atom set. 1 Introduction Large language models (LLMs), trained on vast corpora, exhibit emergent knowledge and reasoning abilities (Petroni et al., 2019; Brown et al., 2020; Achiam et al., 2023). Yet such information is not stored in explicit symbolic structures, but implicitly embedded within high-dimensional representations (Nanda et al., 2023; Gurnee et al., 2023). This raises a fundamental question: Do LLMs contain fundamental representational units (FRUs)—an atomic structure underlying how they encode and compose information? These representational units are critical for understanding, interpreting, and controlling LLMs (Olah et al., 2020). Traditionally, neurons have been regarded as such FRUs (Olah et al., 2017; Dai et al., 2022; Chen et al., 2024). However, neurons frequently exhibit substantial polysemanticity (Elhage et al., 2022), raising doubts about their validity for analysis. To address this issue, features extracted from internal representations (Cunningham et al., 2023; Chen et al., 2025) have been proposed as alternative FRUs (Olah et al., 2020). Yet this perspective remains controversial: (i) features fail to fully reconstruct the original representations (Bricken et al., 2023), raising concerns about faithfulness; and (i) features undergo splitting into finer ones or merging into broader ones under varying decomposition settings (Bussmann et al., 2025; Chanin et al., 2025), undermining stability. Although prior work implicitly treats neurons and features as FRUs, there is still no formal definition of FRUs for LLMs, which hinders principled evaluation and identification and ultimately limits theoretical clarity. In this paper, we propose Atom Theory to systematically define, evaluate, and identify the FRUs of LLMs, termed atoms (Fig. 1). Specifically, to characterize the underlying geometry of LLM representations, we introduce the (non-Euclidean) atomic inner product (AIP). Based on AIP, we formally define atoms (Fig. 1(a)) by three properties: representability (representations can be faithfully reconstructed from atoms), sparsity (each representation involves only a few atoms), and separability (atoms are approximately orthogonal under AIP). Representability ensures that atoms form a faithful set for the representation space. Sparsity and separability jointly enable efficient encoding of representations by approximately orthogonal atoms with minimal overlap (Elhage et al., 2022). Accordingly, sparsity and separability are tightly coupled, and we explicitly quantify their relationship in the following criteria. To operationalize this definition, we next introduce quantitative criteria for evaluating whether candidate units qualify as atoms (Fig. 1(b)). Representability is measured by the coefficient of determination R2R^2, which quantifies faithfulness. Sparsity and separability, drawing on compressed sensing (Donoho, 2006; Candès et al., 2006), are unified into a single metric q∗q^* to quantify stability. This metric corresponds to monorepresentationality: within the regime characterized by q∗q^*, atoms and their combinations are distinguishable (i.e., not confusable), which is a desirable property under approximate orthogonality. Monorepresentationality provides a structural foundation for understanding LLM representations, offering the stability required for monosemanticity. Finally, to provide theoretical guarantees for atom identification, we prove that threshold-activated sparse autoencoders (TSAEs) (Fig. 1(c)) can identify the target atom set. Overall, we establish a unified theoretical framework that provides guarantees for defining, evaluating, and identifying atoms. Having introduced Atom Theory, we validate that its foundation, the AIP, provides a principled basis for understanding LLM representations. Empirically, we uncover a pervasive representation shift across layers of differently scaled models from multiple LLM families (Fig. 2), including GPT (Radford et al., 2019; Wang and Komatsuzaki, 2022), Pythia (Biderman et al., 2023), Llama (Touvron et al., 2023; Dubey et al., 2024), and Gemma (Team et al., 2024). This shift arises from the Softmax operation in LLMs, which drives the centroid of the distribution of pairwise representation angles away from 90∘90 under the Euclidean inner product, thereby inducing a global bias in the representation space and distorting the representational geometry. Introducing the AIP effectively corrects this shift (Fig. 3), removing the global bias and restoring the centroid to 90∘90 . This demonstrates that the AIP captures the underlying geometry of LLM representations, thereby grounding Atom Theory. Building on Atom Theory, we evaluate whether candidate representational units satisfy the criteria for ideal atoms. We show that widely used representational units, neurons and features, remain substantially distant from ideal atoms (Fig. 4). Although neurons, as the basic computational units of neural networks, exhibit perfect faithfulness (R2=1R^2\!=\!1), they display extremely low stability (q∗=0.5%q^*\!=\!0.5\%). Features achieve improved stability (q∗=68.2%q^*\!=\!68.2\%) but remain unstable and exhibit low faithfulness (R2=48.8%R^2\!=\!48.8\%). These results quantitatively reveal the limitations of neurons and features, indicating that these common units are not ideal atoms. Leveraging Atom Theory, we identify the atoms of LLMs. Based on the theoretical identifiability of TSAEs, we conduct large-scale experiments to characterize the relationship between data scale and TSAE capacity (Fig. 5), showing that reliable atom identification is achieved only when the TSAE capacity exceeds a critical threshold for a given data scale. This is intuitive: data scale determines the scale of atoms, which in turn dictates the TSAE capacity required for their identification. Guided by this insight, we achieve faithful reconstructions (R2=99.90%R^2\!=\!99.90\%) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B using TSAEs with JumpReLU activation (Erichson et al., 2019; Rajamanoharan et al., 2024b), and verify high stability of the identified units (q∗=99.85%q^*\!=\!99.85\%), yielding ideal atoms statistically (Tab. 1). Further analysis demonstrates that the identified atoms are consistent with theoretical expectations and exhibit substantially higher monosemanticity (Fig. 6). In summary, our contributions are as follows: • We propose Atom Theory, a rigorous theoretical framework based on AIP that systematically defines, evaluates, and identifies the FRUs of LLMs, i.e., atoms. • We empirically uncover a representation shift in LLMs and show that the AIP corrects this shift to characterize the underlying representational geometry, validating the foundation of Atom Theory. • Building on Atom Theory, we use faithfulness and stability to systematically and quantitatively reveal the limitations of neurons and features as FRUs. • Leveraging Atom Theory, we establish atom identifiability under TSAEs, characterize the relationship between data scale and TSAE capacity, and identify FRUs in LLMs that satisfy the criteria of ideal atoms. Further analysis shows that these atoms align with theoretical expectations and exhibit high monosemanticity. 2 Preliminary Background on Language Models Consider an L-layer language model over a vocabulary V. For an input sequence =[x1,x2,⋯,xT] x=[x_1,x_2,·s,x_T] with xi∈Vx_i∈ V, the model assigns each token xix_i an embedded representation i0∈ℝH h_i^0 ^H, which is updated at layer l as il=il−1+il+il h_i^l= h_i^l-1+ a_i^l+ v_i^l, where il a_i^l and il v_i^l denote the outputs of attention and MLP modules, respectively. From the residual-stream perspective, the representation after L layers is iL=i0+∑l=1Lil+∑l=1Lil h_i^L= h_i^0+ _l=1^L a_i^l+ _l=1^L v_i^l. The probability distribution y over the next token is obtained from the final representation via =Softmax​(U⊤​TL) y=Softmax( W_U h_T^L), where U∈ℝH×|V| W_U ^H×|V| is the unembedding matrix. 3 Atom Theory In language models, all information is embedded in high-dimensional representations. Our objective is to identify the fundamental representational units (FRUs), which we term atoms. Formally, we consider a collection of representations M=ii=1|M|M=\ m_i\_i=1^|M|, where i∈ℝH m_i ^H. Each representation can be expressed as i=∑jδ​(i,j)​j m_i= _jδ(i,j) d_j, where δ​(i,j)≥0δ(i,j)≥ 0 denotes the presence and magnitude of the j-th representational unit j∈ℝH d_j ^H in the i-th representation i m_i. However, in representation space, the family of representational units admitting such a decomposition is, in principle, infinite. The central question is how to define FRUs, i.e., atoms. A natural criterion is distinguishability: each atom should be detectable or manipulable without interfering with others. In high-dimensional spaces, this criterion translates into orthogonality: atoms occupy mutually orthogonal directions, making their identities distinguishable via inner products. Thus, the choice of inner product is critical. While the Euclidean inner product is commonly used, it is not necessarily appropriate for language models. Following Park et al. (2023) and Hu et al. (2025), we consider the following reparameterization of U W_U and L h^L: U′←−⊤​U+​ 1⊤,′⁣L←​L, W_U ← A^- W_U+ b\, 1 , h L← A\, h^L, (3.1) where ∈ℝH×H A\! ^H× H is an invertible linear transform, ∈ℝH b\! ^H, and ∈ℝ|V| 1 ^|V| is the all-ones vector. Owing to the translation invariance of Softmax, this reparameterization leaves the output distribution unchanged: =Softmax​(U⊤​L)=Softmax​(U′⊤​′⁣L) y=Softmax( W_U h^L)=Softmax( W_U \! h L). See Appendix A.1 for further details. Since the training objective of language models depends on representations solely through Softmax probabilities, different pairs (U,L)( W_U, h^L) under (3.1) are observationally indistinguishable, producing exactly the same outputs for all inputs. Thus, even for a trained checkpoint, such reparameterizations leave the model’s input–output behavior unchanged, so L h^L is identifiable only up to an invertible linear transformation A in principle. Due to the residual-stream architecture and linearity of matrix multiplication, this invariance propagates to all hidden representations and their representational units j d_j, which are likewise identifiable only up to A. Crucially, the Euclidean inner product is not invariant under reparameterization (3.1): ⟨i,j⟩≠⟨​i,​j⟩ d_i,\! d_j \!≠\! A d_i,\! A d_j . Therefore, the Euclidean geometric relations (e.g., angles and orthogonality) between i d_i and j d_j depend on the chosen parameterization, thus the Euclidean inner product does not provide a canonical geometry for language-model representations. 3.1 Atomic Inner Product To better understand language-model representations and thereby define atoms within them, we require additional principles to specify an appropriate inner product. We therefore introduce an inner product with the desired properties. Definition 3.1 (Atomic Inner Product; AIP). Let =span​(D)D\!=\!span(D), where D=jj=1|D|D\!=\!\ d_j\_j=1^|D| denotes the atom set. The atomic inner product ⟨⋅,⋅⟩S ·,· _S is an inner product on D such that ⟨i,j⟩S=0 d_i, d_j _S\!=\!0 for all i,j∈D d_i, d_j∈ D with i≠ji≠ j. Atoms are indexed arbitrarily, and any permutation of indices leaves their geometry invariant. Hence, there is no principled basis for assigning different norms to different atoms. By this symmetry, we assume a common norm under the chosen inner product, i.e., ‖i‖S=c>0\| d_i\|_S\!=\!c>0, ∀i∈[|D|]∀ i∈[|D|]. The constant c cancels naturally in the subsequent analysis. By abuse of notation, we also use ∈ℝH×|D| D ^H×|D| to denote the matrix with columns j d_j. We next characterize the atomic inner product in an explicit form. Theorem 3.2 (Explicit Form of the Atomic Inner Product). Let ⟨i,j⟩S=i⊤​j d_i, d_j _S= d_i S d_j be an atomic inner product with ∈ℝH×H S ^H× H symmetric and positive definite. If the columns of =[1,2,⋯,|D|] D=[ d_1, d_2,·s, d_|D|] form the atom set such that ∀i,‖i‖S=c>0∀ i,\ \| d_i\|_S=c>0, and ≃ℝHD ^H, then =c2​(​⊤)−1 S=c^2( D D )^-1. All proofs are provided in Appendix A. By analogy with cosine similarity, we introduce the normalized atomic inner product to remove the dependence on c. Corollary 3.3 (Normalized Atomic Inner Product; NAIP). Let the atomic inner product be defined by ⟨i,j⟩S=i⊤​j d_i, d_j _S= d_i S d_j, where S is symmetric and positive definite. Suppose the columns of =[1,⋯,|D|] D=[ d_1,·s, d_|D|] form the atom set satisfying ∀i,‖i‖S=c>0∀ i,\ \| d_i\|_S=c>0. Then, for any i,ji,j, ⟨i,j⟩S~:=⟨i,j⟩S‖i‖S​‖j‖S=i⊤​~​j,~=(​⊤)−1. d_i, d_j _ S:= d_i, d_j _S\| d_i\|_S\| d_j\|_S= d_i S d_j, S=( D D )^-1. (3.2) Consequently, the bilinear form ⟨i,j⟩S~=i⊤​~​j d_i, d_j _ S= d_i S d_j defines the normalized atomic inner product. Unlike the causal inner product (Park et al., 2023), which is defined on the static word-vector space of the unembedding matrix, our AIP (or NAIP) is defined on the dynamic, input-dependent representation space. This enables direct analysis of internal representations and thereby allows the definition of atoms as fundamental representational units under AIP. Remark. Define i~=~12​i d_i\!=\! S 12 d_i and j~=~12​j d_j\!=\! S 12 d_j. Under this transformation, ⟨i,j⟩S~=⟨i~,j~⟩ d_i, d_j _ S= d_i, d_j , where the right-hand side denotes the Euclidean inner product. Hence, properties of the Euclidean inner product transfer directly to the NAIP; i~ d_i and j~ d_j are accordingly termed normalized atoms. 3.2 Formal Definition of Atoms Having established a principled perspective for understanding representations in language models, we proceed to formally define atoms. However, the preceding analysis assumes an idealized setting in which atoms are strictly orthogonal under the chosen inner product. Although this assumption yields a clean notion, it constrains the number of atoms to at most the representation dimension (|D|=H|D|=H), rendering the formulation impractical. Elhage et al. (2022) observed that sparsity induces the emergence of approximately orthogonal representations in neural networks to cope with limited representational dimensionality, a phenomenon termed superposition. Hu et al. (2025) later identified pervasive superposition in language models. Motivated by these findings, we introduce sparsity, which naturally leads to approximate orthogonality and enables a formally grounded, practical definition of atoms. Definition 3.4 (Sparsity Level). Let M=ii=1|M|⊂ℝHM\!=\!\ m_i\_i=1^|M|\!⊂\!R^H be a collection of representations. Suppose there exist =[1,⋯,|D|]∈ℝH×|D| D\!=\![ d_1,\!·s\!, d_|D|]\!∈\!R^H×|D| and i∈ℝ≥0|D| δ_i\!∈\!R^|D|_≥ 0 such that i=​i m_i\!=\! D δ_i, ∀i∈[|M|]∀ i\!∈\![|M|]. The sparsity level is K:=maxi⁡‖i‖0K\!:=\! _i\| δ_i\|_0, where ∥⋅∥0\|·\|_0 denotes the ℓ0 _0 norm. Sparsity enables the number of atoms to substantially exceed the ambient dimension, yielding an overcomplete structure with |D|≫H|D| H that captures rich world knowledge, while simultaneously minimizing mutual interference. Remark. In the basic setting of § 3.1 with |D|=H|D|=H, one can verify that ~=(​⊤)−1 S=( D D )^-1 satisfies ⊤​~​=|D| D S D= I_|D|, so the atoms form an orthonormal basis under ⟨⋅,⋅⟩S ·,· _S. When |D|≫H|D|\! \!H, ~=(​⊤)−1 S\!=( D D )^-1 is well defined provided rank​()=Hrank( D)=H, but :=⊤​~​ G\!:= D S D becomes a rank-H projection rather than |D| I_|D|, thus the atoms cannot all be mutually orthogonal under ⟨⋅,⋅⟩S ·,· _S. Therefore, exact orthogonality is unattainable, motivating the introduction of approximate orthogonality. This consideration motivates the following definition. Definition 3.5 (ϵε-Approximately Orthogonal Atoms). The atom set ii=1|D|\ d_i\_i=1^|D| is ϵε-approximately orthogonal if |⟨i,j⟩S~|≤ϵ,∀i≠j, | d_i, d_j _ S |≤ε,\ ∀ i≠ j, where ⟨,⟩S~:=⊤​~​ x, y _ S:= x S y denotes the normalized atomic inner product and ~:=(​⊤)−1 S:=( D D )^-1. Remark. In the ideal setting of exact orthogonality, the inner products ⟨i,j⟩S~ d_i, d_j _ S (equivalently, ⟨i~,j~⟩ d_i, d_j ) for i≠ji≠ j follow a Dirac measure concentrated at the origin. Under ϵε-approximate orthogonality, they are instead expected to follow a Gaussian distribution ​(0,s2)N(0,s^2) with small variance, which converges to the Dirac measure as s→0s→ 0. We now introduce a formal definition of atoms. Definition 3.6 (Atoms). Let M=ii=1|M|M=\ m_i\_i=1^|M| be a collection of representations. Suppose there exists =[1,⋯,|D|]∈ℝH×|D| D\!=\![ d_1,\!·s,\! d_|D|]\! ^H×|D| and =[1,⋯,|M|]∈ℝ≥0|D|×|M| \!=\![ δ_1,\!·s\!, δ_|M|]\! _≥ 0^|D|×|M| such that, for a given sparsity level K∈ℕK , ∀i∈[|M|],i=​i​ with ​‖i‖0≤K.∀ i∈[|M|], m_i= D δ_i\ with \ \| δ_i\|_0≤ K. (3.3) Furthermore, |⟨i,j⟩S~|≤ϵ,∀i≠j | d_i, d_j _ S |\!≤\!ε,\ ∀ i\!≠\!j, where ~:=(​⊤)−1 S\!:=\!( D D )^-1. Under these conditions, ii=1|D|\ d_i\_i=1^|D| is called the atom set of M, and each i d_i is referred to as an atom. Intuitively, atoms are characterized by three properties: representability, where each representation can be faithfully expressed by atoms, i.e., i=​i m_i= D δ_i; sparsity, where each representation involves only a few atoms, i.e., ‖i‖0≤K\| δ_i\|_0≤ K; and separability, where atoms are approximately orthogonal, i.e., |⟨i~,j~⟩|≤ϵ| d_i, d_j |≤ε. Representability is a natural requirement, while sparsity and separability jointly enable efficient encoding of separable information under approximate orthogonality. We further quantitatively characterize the relationship between sparsity and separability in § 3.3. Remark. K quantifies sparsity without enforcing a specific sparsity regime, ensuring broad applicability of the definition. Moreover, pre-multiplying both sides of (3.3) by ~12 S 12 yields ~i=~​i m_i= D δ_i, with ~i:=~12​i m_i:= S 12 m_i and ~:=~12​=[~1,⋯,~|D|] D:= S 12 D=[ d_1,·s, d_|D|], which simplifies subsequent derivations. 3.3 Evaluation of Atoms Having formalized atoms via representability, sparsity, and separability, we now evaluate whether candidate representational units satisfy the criteria of ideal atoms. Representability is measured by the coefficient of determination R2:=1−∑i‖i−^i‖2∑i‖i−¯‖2R^2\!:=\!1\!-\! _i\| x_i- x_i\|^2 _i\| x_i- x\|^2, which quantifies the proportion of variance in the original representations explained by atoms and thus reflects faithfulness. While sparsity and separability can be quantified by K and ϵε, they do not directly indicate unit quality. In this section, we introduce a unified metric that integrates sparsity and separability to characterize the stability of representational units. We note that i∈ℝ|D| δ_i\! ^|D| can be viewed as a sparse representation of high-dimensional semantics, which is compressed by ∈ℝH×|D| D\!∈\!R^H×|D| to produce i∈ℝH m_i\!∈\!R^H. This perspective reveals a close connection to compressed sensing (Donoho (2006); Candès et al. (2006)), whose core idea is that a signal sparse in some basis can be recovered from far fewer linear measurements than its ambient dimension, with the Restricted Isometry Property (RIP) providing the essential guarantee. Definition 3.7 (Restricted Isometry Property; RIP). A matrix ~∈ℝH×|D| D ^H×|D| is said to satisfy the K-RIP if there exists a constant δK∈[0,1) _K∈[0,1) such that, for any K-sparse vector ∈ℝ|D| δ ^|D| (i.e., ‖0≤K\| δ\|_0≤ K), (1−δK)​‖22≤‖~​‖22≤(1+δK)​‖22.(1- _K)\| δ\|_2^2≤\| D δ\|_2^2≤(1+ _K)\| δ\|_2^2. (3.4) Here, δK _K is called the K-RIP constant of ~ D. Intuitively, projecting a sparse vector via ~ D to lower dimension preserves its geometric structure, ensuring the possibility of recovery. Direct verification of the RIP is NP-hard, but coherence provides a computable upper bound on δk _k. Theorem 3.8 (Coherence–RIP Upper Bound). Let ~∈ℝH×|D| D\! ^H×|D| and define the coherence μ:=maxi≠j⁡|⟨i~,j~⟩|≤εμ\!:= _i≠ j| d_i, d_j |≤ . For any K-sparse vector ∈ℝ|D| δ ^|D| with ‖0≤K\| δ\|_0≤ K, (1−(K−1)​μ)​‖22≤‖~​‖22≤(1+(K−1)​μ)​‖22. (1\!-\!(K\!-\!1)μ )\| δ\|_2^2≤\!\| D δ\|_2^2\!≤\! (1\!+\!(K\!-\!1)μ )\| δ\|_2^2. (3.5) Hence δK​(~)≤(K−1)​μ _K( D)≤(K-1)μ; in particular, ~ D satisfies the K-RIP whenever (K−1)​μ<1(K-1)μ<1. In other words, coherence provides a computable criterion for verifying the RIP, ensuring that all K-sparse vectors projected through ~ D preserve geometric structure, an essential prerequisite in compressed sensing. Nevertheless, the RIP alone does not preclude non-uniqueness: even if (K−1)​μ<1(K-1)μ<1 holds, the sparse coefficients associated with representations need not be unique. Theorem 3.9 (Uniqueness and Exact ℓ1 _1\! Recoverability). Let ~∈ℝH×|D| D ^H×|D| and define the coherence μ:=maxi≠j⁡|⟨i~,j~⟩|≤εμ:= _i≠ j| d_i, d_j |≤ . If μ<12​K−1μ< 12K-1, then for every ∈ℝ|D| δ ^|D| with ‖0≤K\| δ\|_0≤ K, the K-sparse representation determined by ~=~​ m= D δ is unique; that is, no other K-sparse vector yields the same ~ m. Moreover, δ is the unique minimizer of the convex program min∈ℝ|D|⁡‖1subject to~​=~. _ x ^|D|\| x\|_1 to D\, x= m. (3.6) Crucially, this intrinsically characterizes the monorepresentationality of ~ D: under the condition μ<12​K−1μ< 12K-1, any representation formed as a K-sparse linear combination of representational units in ~ D has a unique sparse decomposition, with no other combination yielding the same representation. Corollary 3.10 (Monorepresentationality / Injectivity). Under the condition μ<12​K−1μ< 12K-1 of Theorem 3.9, define ΣK:=∈ℝ|D|:‖0≤K _K:=\ δ ^|D|:\| δ\|_0≤ K\ and Φ:ΣK→ℝH : _K ^H by Φ​():=~​ ( δ):= D δ. Then Φ is injective on ΣK _K. That is, for any ,∈ΣK x, y∈ _K, if ~​=~​ D x= D y, it follows that = x= y. Within this regime, representational units and their combinations are unambiguous in representation space, i.e., monorepresentationality. By contrast, monosemanticity (Bricken et al. (2023); Templeton et al. (2024)) concerns the alignment of a unit with a specific meaning, concept, or function. The former is formally provable, whereas the latter is statistical and interpretive. Monorepresentationality is a prerequisite for monosemanticity, as it provides the structural stability; otherwise, units would be non-unique and interpretation unstable. Without additional semantic anchoring or inductive assumptions, theoretical guarantees primarily concern structural stability induced by monorepresentationality, while monosemanticity must be determined empirically via further semantic alignment experiments. Sparsity (K) and separability (μ) are thus unified by the condition μ<12​K−1μ< 12K-1, under which stability holds. More practical metrics are discussed in § 4.2 and Appendix C.5. Remark. For intuition, the above result can be viewed as a generalization of the strictly orthogonal case (|D|=H|D|=H). In this setting, ~∈ℝH×H D ^H× H has orthogonal columns (μ=0μ=0), i.e., ~⊤​~= D D= I. Any representation ~∈ℝH m ^H then admits a unique decomposition =~⊤​~ δ= D m, satisfying ~=~​ m= D δ. Here the sparsity level K=H=|D|K=H=|D| satisfies μ<12​K−1μ< 12K-1. Thus, under strict orthogonality, representations and atom coefficients are in one-to-one correspondence, yielding a direct and unique identification of atoms. Figure 2: Representation shift at the final layer across multiple LLMs under the Euclidean inner product, with the centroid of pairwise representation angles deviating from 90∘90 . See Appendix B for full results. Figure 3: Correction of representation shift at the final layer across multiple LLMs via the atomic inner product, with the centroid of pairwise representation angles consistently approaching 90∘90 . See Appendix B for full results. 3.4 Identification of Atoms We have defined atoms and established criteria for ideal atoms. A central question remains whether such atoms can be identified in practice. Since sparse autoencoders (SAEs) are a standard approach for learning disentangled representations (Cunningham et al., 2023), we next demonstrate that, under appropriate conditions, SAEs can indeed identify these atoms, rendering the theory practically applicable. Theorem 3.11 (Identifiability of Threshold-activated SAEs; TSAEs). Let M=ii=1|M|⊂ℝHM=\ m_i\_i=1^|M| ^H with i=​i m_i= D δ_i, where =[1,⋯,|D|]∈ℝH×|D| D=[ d_1,·s, d_|D|] ^H×|D| satisfies |⟨i~,j~⟩|≤ϵ| d_i, d_j |≤ε for all i≠ji≠ j. Suppose each i∈ℝ|D| δ_i ^|D| is K-sparse, i.e. ‖i‖0≤K\| δ_i\|_0≤ K. Consider the threshold activation function στ​(x)=0x<τ,x≥τ, _τ(x)= cases0&x<τ,\\ x&x≥τ, cases -6.0pt (3.7) with threshold τ>0τ>0. Assume there exist constants 0<δmin≤δmax0< _ ≤ _ such that, for each support i=supp​(i)S_i=supp( δ_i), δmin≤δi​j≤δmax,∀j∈i _ ≤ _ij≤ _ ,\ ∀ j _i. If the amplitude gap and threshold satisfy ε​K​δmax<τ<δmin−ε​(K−1)​δmax K _ <τ< _ - (K-1) _ , which is feasible whenever δmin>ε​(2​K−1)​δmax _ > (2K-1) _ , then setting TSAEs with d​e​c= W_dec= D and e​n​c=⊤​~ W_enc= D S yields ∀i,d​e​c​στ​(e​n​c​i)=i.∀ i, W_dec\, _τ( W_enc m_i)= m_i. -4.0pt (3.8) This parameterization exactly recover the atom set D. Therefore, TSAEs can identify atoms in principle. By contrast, ReLU (Templeton et al., 2024), lacking a threshold term, fails to satisfy the support-separation condition and is thus theoretically insufficient. This respond to O’Neill et al. (2024): the limitation of SAEs arises not from their linear–nonlinear mechanism, but from the absence of threshold activation, which prevents effective atom identification. The threshold activation considered here denotes a class of activation functions rather than a specific instantiation. Existing examples include JumpReLU (Erichson et al., 2019; Rajamanoharan et al., 2024a) and TopKK (Makhzani and Frey, 2013; Gao et al., 2024). Although TopKK is analogous in some respects, it relies on a fixed K, limiting adaptivity and practical applicability. Moreover, the motivation differs fundamentally: threshold activation is proposed to enable effective atom identification under approximate orthogonality, whereas JumpReLU is introduced to addresses feature shrinkage (Rajamanoharan et al., 2024a), and TopKK is designed to directly control sparsity (Gao et al., 2024). Remark. Although the theorem is stated for a uniform scalar threshold τ, it extends directly to a coordinate-wise threshold vector τ, without affecting the squeeze condition or the proof. This generalization enlarges the feasible interval when activation magnitudes differ, thereby improving support separation and the robustness of atom identification. 4 Experiments We next empirically validate and apply Atom Theory. In § 4.1, we uncover a pervasive representation shift in large language models (LLMs) and show that the atomic inner product (AIP) corrects it, capturing the representational geometry and validating the foundation of Atom Theory. In § 4.2, we use faithfulness and stability to quantitatively reveal the limitations of neurons and features as fundamental representational units (FRUs). Leveraging the identifiability guarantees of threshold-activated SAEs (TSAEs), we establish in § 4.3 the relationship between data scale and TSAE capacity through large-scale experiments. Finally, in § 4.4, we identify FRUs across LLMs that exhibit high faithfulness and stability, and demonstrate strong monosemanticity. 4.1 Representation Shift In this section, we uncover a pervasive representation shift in LLMs and show that the AIP corrects it, capturing the representational geometry and grounding Atom Theory. Experimental Setup We randomly sample 128 subject entities from WikiData (Vrandečić and Krötzsch, 2014) and extract the corresponding activations across all layers of multiple LLM families, including GPT-2, GPT-J, Pythia, Llama-2, Llama-3, Llama-3.1, and Gemma2, which serve as the target representations. This sample size suffices to characterize the distribution of pairwise representation angles, and additional samples do not change the overall distribution (see Appendix B, Figs. 17-21). It also facilitates ensuring no overlap with activations used in subsequent sampling. We then collect 100K activations k per layer from Wikipedia corpora (with no overlap with the target representations) and compute ​[​⊤]E[ k k ] to estimate the normalized AIP. To analyze the distribution of representations, we use cosine similarity, which removes scale effects and aligns with the theoretical framework. For representations ,∈ℝH u, v ^H, the cosine similarity under the Euclidean inner product (EIP) is defined as cos⁡(,)=⟨,⟩‖2​‖2 ( u,\! v)\!\!=\! u, v \| u\|_2\| v\|_2. Under the AIP induced by ~ S, the corresponding cosine similarity is cosS~⁡(,)=⟨,⟩S~‖S~​‖S~=⊤​~​⊤​~​⊤​~​ _ S( u,\! v)\!=\! u, v _ S\| u\|_ S\| v\|_ S\!=\! u \! S v u \! S u v \! S v, where ~ S is estimated in practice as (​[​⊤])−1(E[ k k\! ])^-1. For clarity, cosine similarities are further converted into angles. See Appendix B for more details. Experimental Results When representation angles are computed under the EIP, the centroid of the angular distribution deviates markedly from 90∘90 , revealing a representation shift (Fig. 2; full results in Appendix B, Figs. 7-16). This phenomenon consistently appears across all layers of LLM families, regardless of architectures or training corpora, indicating a systematic angular bias among representations. This bias shows that LLM representations are globally attracted toward a dominant direction, such that even unrelated representations exhibit high cosine similarity. As shown in Appendix B (Figs. 17–21), increasing the sample size used to characterize the distribution leaves the centroid unchanged, confirming attraction toward the same direction. In contrast, the AIP corrects this shift, restoring the centroid of the angular distribution to 90∘90 (Fig. 3; full results in Appendix B, Figs. 22-31). This indicates that the AIP removes the systematic angular bias induced by the EIP, so that angles between representations reflect genuine differences rather than metric-induced artifacts, which is essential for further analysis of LLM representations. Collectively, these results show that the AIP captures the underlying geometry of LLM representations and grounds Atom Theory. Figure 4: Comparison of neurons, features, and ideal atoms across all layers of different LLMs. Ideal atoms are required to exhibit both high faithfulness and high stability, corresponding to R2=1R^2=1 and q∗=1q^*=1, respectively. Values of R2R^2 below 0 are clipped to 0. 4.2 Neurons or Features as Ideal Atoms? In this section, we evaluate whether commonly used representational units, including neurons and features, satisfy the criteria of ideal atoms, i.e., faithfulness and stability. Experimental Setup We use all 20K subject entities from the CounterFact dataset (Meng et al., 2022), a subset of WikiData, and extract the corresponding neurons and features (details in Appendix C.4) activated on these entities across all layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B as the target representational units. Under Atom Theory, ideal atoms are evaluated along two dimensions. Faithfulness measures how accurately representational units reconstruct the original representations and is quantified by the coefficient of determination R2:=1−∑i‖i−^i‖2∑i‖i−¯‖2R^2:=1- _i\| x_i- x_i\|^2 _i\| x_i- x\|^2, where ^i x_i and ¯ x denote the predicted representation and sample mean, respectively. Stability captures structural robustness and is quantified by the maximal quantile q∗:=supq∣μq<12​Kq−1q^*:= \q _q< 12K_q-1 \, where μq _q and KqK_q denote quantile coherence and quantile sparsity. Ideal atoms satisfy R2=1R^2=1 and q∗=1q^*=1. Full definitions and evaluation details are provided in Appendix C.5. Experimental Results As shown in Fig. 4, neurons, as the basic computational units of neural networks, exhibit perfect faithfulness (R2¯=1 R^2\ =1) but extremely low stability (q∗¯=0.5% q^*=0.5\%), where ⋅¯ · denotes average over layers (and models). Features improve stability (q∗¯=68.2% q^*=68.2\%) relative to neurons, yet remain unstable and display low faithfulness (R2¯=48.8% R^2=48.8\%). Theoretically, stability is necessary for monosemanticity (§ 3.3). Accordingly, the extremely low stability of neurons implies polysemanticity, consistent with prior findings (Olah et al., 2020). Features exhibit higher stability and thus improved monosemanticity (Chen et al., 2025), but still fall short of ideal atoms. Overall, both neurons and features exhibit a clear gap from ideal atoms. Figure 5: Matching TSAE capacity and data scale on Gemma2-2B (measured by R2R^2). Data × and TSAE × denote data scale and model capacity (interval 9,216). Red dashed lines mark the capacity range enabling reliable atom identification. 4.3 TSAE Capacity Meet Data Scale Next, we address the practical problem of identifying the FRUs of LLMs. Theorem 3.11 shows that TSAEs are in principle capable of identifying atoms, but it does not specify how to choose the model capacity given the data scale in practice. In particular, we must determine the relationship between the data scale |M||M| and the model capacity |D||D| of TSAEs in order to reliably identify atoms. Experimental Setup We use subject entities from WikiData and extract the corresponding activations from the first layer of Gemma2-2B, yielding a sufficiently large representation dataset (1.9B samples), which is randomly sampled during training (details in Appendix C). Using an interval of 9,216, we evaluate R2R^2 obtained by training TSAEs across varying data scales and model capacities. Due to the high computational cost of this grid search, experiments are conducted exclusively on Gemma2-2B. As implied by the feasible interval condition in Theorem 3.11, faithfulness is achievable only when stability holds. For brevity, we report faithfulness here, and present extensive empirical results on stability of the identified atoms in § 4.4, confirming that the representations indeed exhibit a stable atomic structure. Experimental Results As shown in Fig. 5, we find that high faithfulness occurs only when TSAE capacity exceeds a critical threshold for a given data scale. Intuitively, the data scale determines the scale of FRUs, which in turn determines required TSAE capacity for their identification. This finding also prompts a rethink of current SAE training paradigms, which heuristically choose model capacity and then train on massive activations from large corpora. While broadly applicable across tasks, their limited faithfulness ultimately constrains downstream reliability. 4.4 Atoms of LLMs In this section, we identify widespread atoms of LLMs that exhibit high faithfulness and stability. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially stronger monosemanticity. Experimental Setup Consistent with § 4.2, we use all subject entities from CounterFact and extract the corresponding activations across all layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B. Guided by the insights from § 4.3, we uniformly adopt a 4×4× TSAE with JumpReLU activation (Erichson et al., 2019; Rajamanoharan et al., 2024b) to identify atoms. We evaluate the identified units with faithfulness (R2R^2) and stability (q∗q^*). Full details on data, training, and computational costs are provided in Appendix C. Experimental Results Table 1: Faithfulness and stability of identified units across models. Reported values are averaged over all layers. More detailed and supplementary results are provided in Appendix C.5 (Tabs. 3–5). Model Faithfulness (R2R^2) Stability (q∗q^*) Gemma2-2B 99.92%\% 99.74%\% Gemma2-9B 99.93%\% 99.87%\% Llama3.1-8B 99.85%\% 99.95%\% Across Gemma2-2B (layers 1–26), Gemma2-9B (layers 1–42), and Llama3.1-8B (layers 1–30), we consistently achieve high faithfulness (R2¯=99.90% R^2\!=\!99.90\%) and stability (q∗¯=99.85% q^*\!=\!99.85\%), as shown in Tab. 1. These results indicate that the identified units approach ideal atoms statistically. Further analysis (see Appendix C.6) reveals that the training process is minimally sensitive to hyperparameters (Fig. 36); the encoder and decoder, when randomly initialized and trained independently without weight tying or additional constraints, converge to structures consistent with Theorem 3.11 (Figs. 37-39); the identified atoms are approximately orthogonal under AIP, approaching the theoretical Dirac distribution (Figs. 40-42). Although stability provides a necessary structural foundation for monosemanticity, we validate this empirically. Specifically, we uniformly sample representational units across layers of models, evaluate monosemanticity using LLM-as-a-Judge with manual verification (see Appendix C.7 for details and case studies). As shown in Fig. 6, atoms with high faithfulness and stability consistently exhibit stronger monosemanticity. Figure 6: Monosemanticity scores of representational units across models, using GPT-5.2 with manual verification. The blue dashed line indicates random-guess performance (0.1). These findings reveal that LLMs contain FRUs and offer a new perspective on their internal representations. 5 Related Work Neurons. Early interpretability studies regarded neurons, the minimal computational units of neural networks, as the basic units of analysis. Subsequent work sought to ascribe functional interpretations to individual neurons (Bills et al., 2023; Geva et al., 2020). However, neurons face the polysemantic problem, activating for multiple semantically unrelated patterns (Olah et al., 2020), a phenomenon attributed to superposition (Elhage et al., 2022). These findings indicate that neurons are unsuited as such units, motivating a shift from neurons to features (Olah et al., 2020). Features. Although features initially lacked a unified formal definition (Elhage et al., 2022), they are commonly understood as linear directions with specific meaning (Hewitt and Manning, 2019; Park et al., 2023; Gurnee et al., 2023; Chen et al., 2025). Sparse autoencoders (SAEs) were introduced to learn such features (Cunningham et al., 2023) and subsequently scaled to larger settings (Gao et al., 2024; Templeton et al., 2024). Rajamanoharan et al. (2024a) and Rajamanoharan et al. (2024b) optimized the architecture to mitigating feature shrinkage (Wright and Sharkey, 2024). Despite widespread adoption (Lieberum et al., 2024; He et al., 2024), existing SAEs remain limited by incomplete reconstruction, with the unreconstructed component termed ”dark matter” (Engels et al., 2024), and instability from feature splitting and merging (Bussmann et al., 2025; Chanin et al., 2025), undermining their suitability as FRUs. We therefore propose atoms as FRUs and develop Atom Theory. 6 Conclusion This paper introduces and validates Atom Theory for characterizing the fundamental representational units (FRUs) of large language models (LLMs). We formalize atoms, establish criteria for ideal atoms, and prove their identifiability via threshold-activated sparse autoencoders (TSAEs). Empirically, we show that LLMs widely contain FRUs with high faithfulness and stability, exhibiting strong monosemanticity. Overall, these results elevate interpretability from heuristic analysis to a principled theory of FRUs in LLMs. 7 Impact Statement This work contributes to the interpretability of large language models by introducing Atom Theory to identify fundamental representational units. Improved understanding of internal representations may enhance transparency, safety, and reliability of AI systems, supporting model auditing and alignment. We expect the net impact to be positive. References J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §1. S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. (2023) Pythia: a suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, p. 2397–2430. Cited by: §1. S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders (2023) Language models can explain neurons in language models. Note: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html Cited by: §5. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah (2023) Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2023/monosemantic-features/index.html Cited by: §1, §3.3. A. Bridge (2001) Wikipedia, the free encyclopedia. San Francisco (CA): Wikimedia Foundation. Cited by: §C.1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, p. 1877–1901. Cited by: §1. B. Bussmann, N. Nabeshima, A. Karvonen, and N. Nanda (2025) Learning multi-level features with matryoshka sparse autoencoders. arXiv preprint arXiv:2503.17547. Cited by: §1, §5. E. J. Candès, J. Romberg, and T. Tao (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52 (2), p. 489–509. Cited by: §1, §3.3. D. Chanin, T. Dulka, and A. Garriga-Alonso (2025) Feature hedging: correlated features break narrow sparse autoencoders. arXiv preprint arXiv:2505.11756. Cited by: §1, §5. Y. Chen, P. Cao, Y. Chen, K. Liu, and J. Zhao (2024) Journey to the center of the knowledge neurons: discoveries of language-independent knowledge neurons and degenerate knowledge neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, p. 17817–17825. Cited by: §1. Y. Chen, P. Cao, K. Liu, and J. Zhao (2025) The knowledge microscope: features as better analytical lenses than neurons. arXiv preprint arXiv:2502.12483. Cited by: §1, §4.2, §5. H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey (2023) Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600. Cited by: §C.5, §1, §3.4, §5. D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei (2022) Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 8493–8502. Cited by: §1. D. L. Donoho (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), p. 1289–1306. Cited by: §1, §3.3. A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. (2024) The llama 3 herd of models. arXiv e-prints, p. arXiv–2407. Cited by: §1. N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. (2022) Toy models of superposition. arXiv preprint arXiv:2209.10652. Cited by: §1, §1, §3.2, §5, §5. J. Engels, L. Riggs, and M. Tegmark (2024) Decomposing the dark matter of sparse autoencoders. arXiv preprint arXiv:2410.14670. Cited by: §5. N. B. Erichson, Z. Yao, and M. W. Mahoney (2019) Jumprelu: a retrofit defense strategy for adversarial attacks. arXiv preprint arXiv:1904.03750. Cited by: §C.5, §1, §3.4, §4.4. L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu (2024) Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093. Cited by: §3.4, §5. M. Geva, R. Schuster, J. Berant, and O. Levy (2020) Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913. Cited by: §5. W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas (2023) Finding neurons in a haystack: case studies with sparse probing. arXiv preprint arXiv:2305.01610. Cited by: §1, §5. Z. He, W. Shu, X. Ge, L. Chen, J. Wang, Y. Zhou, F. Liu, Q. Guo, X. Huang, Z. Wu, et al. (2024) Llama scope: extracting millions of features from llama-3.1-8b with sparse autoencoders. arXiv preprint arXiv:2410.20526. Cited by: §5. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: §C.1. J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 4129–4138. Cited by: §5. C. Hu, P. Cao, Y. Chen, K. Liu, and J. Zhao (2025) Knowledge in superposition: unveiling the failures of lifelong knowledge editing for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, p. 24086–24094. Cited by: Appendix B, §3.2, §3. T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V. Varma, J. Kramár, A. Dragan, R. Shah, and N. Nanda (2024) Gemma scope: open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147. Cited by: §5. A. Makhzani and B. Frey (2013) K-sparse autoencoders. arXiv preprint arXiv:1312.5663. Cited by: §3.4. K. Meng, D. Bau, A. Andonian, and Y. Belinkov (2022) Locating and editing factual associations in gpt. Advances in neural information processing systems 35, p. 17359–17372. Cited by: Appendix B, §C.2, §C.2, §4.2. N. Nanda, A. Lee, and M. Wattenberg (2023) Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941. Cited by: §1. C. O’Neill, A. Gumran, and D. Klindt (2024) Compute optimal inference and provable amortisation gap in sparse autoencoders. arXiv preprint arXiv:2411.13117. Cited by: §3.4. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter (2020) Zoom in: an introduction to circuits. Distill 5 (3), p. e00024–001. Cited by: §1, §1, §4.2, §5. C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill. Note: https://distill.pub/2017/feature-visualization External Links: Document Cited by: §1. K. Park, Y. J. Choe, and V. Veitch (2023) The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658. Cited by: §3.1, §3, §5. F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. arXiv preprint arXiv:1909.01066. Cited by: §1. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), p. 9. Cited by: §1. S. Rajamanoharan, A. Conmy, L. Smith, T. Lieberum, V. Varma, J. Kramár, R. Shah, and N. Nanda (2024a) Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014. Cited by: §3.4, §5. S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V. Varma, J. Kramár, and N. Nanda (2024b) Jumping ahead: improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435. Cited by: §C.3, §C.3, §C.5, §1, §4.4, §5. G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et al. (2024) Gemma 2: improving open language models at a practical size. arXiv preprint arXiv:2408.00118. Cited by: §1. A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan (2024) Scaling monosemanticity: extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread. External Links: Link Cited by: §3.3, §3.4, §5. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §1. D. Vrandečić and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), p. 78–85. Cited by: Appendix B, §C.2, §4.1. B. Wang and A. Komatsuzaki (2022) GPT-j-6b: a 6 billion parameter autoregressive language model (2021). URL https://github. com/kingoflolz/mesh-transformer-jax. Cited by: §1. B. Wright and L. Sharkey (2024) Addressing feature suppression in saes. In AI Alignment Forum, Vol. 6. Cited by: §5. Appendix A Proofs A.1 Explanation for Equation 3.1 For WU′←A−⊤​WU+​ 1⊤,′⁣L←A​L,W_U ← A^- W_U+ b\, 1 ,\ h L← A h^L, (3.1) we provide a simple derivation as follows: WU′⊤​′ W_U h =(A−⊤​WU+​ 1⊤)⊤​(A​) =(A^- W_U+ b\, 1 ) (A h) (A.1) =WU⊤​(A−1​A)​+​(⊤​A​) =W_U (A^-1A) h+ 1( b A h) (A.2) =WU⊤​+c​()​, =W_U h+c( h) 1, (A.3) where c​()=⊤​A​∈ℝc( h)= b A h is a scalar. Using the translation invariance property of softmax, Softmax​(+c​)=Softmax​()Softmax( z+c 1)=Softmax( z), the result follows. It is important to note that this is the only form, meaning that WUW_U can only be identified up to an invertible transformation plus a bias, and h can only be identified up to an invertible transformation. A.2 Proof of Theorem 3.2 See 3.2 Proof. Since ⟨⋅,⋅⟩S ·,· _S is atomic inner product, for any pair of atoms i d_i and j d_j we have i⊤​S​j=⟨i,j⟩S=0i≠j,c2i=j. d_i S d_j\;=\; d_i, d_j _S= cases0&i≠ j,\\[4.0pt] c^2&i=j. cases (A.4) Applying this property to the atom set D=[1,2,⋯,|D|]D=[ d_1, d_2,·s, d_|D|] yields c2​I=D⊤​S​D.c^2I=D SD. (A.5) Since 1,2,⋯,|D|\ d_1, d_2,·s, d_|D|\ spans the atomic space D, and ≃ℝHD ^H, it follows that rank​(D)=Hrank(D)=H. Let S1/2S^1/2 denote the symmetric positive-definite square root of S. Then D⊤​S​D=(S1/2​D)⊤​(S1/2​D),D SD=(S^1/2D) (S^1/2D), (A.6) which implies rank​(I)=rank​(D⊤​S​D)=rank​(S1/2​D)=rank​(D)rank(I)=rank(D SD)=rank(S^1/2D)=rank(D). Therefore, rank​(D)=|D|≤Hrank(D)=|D|≤ H. Combining this with the earlier condition gives |D|=rank​(D)=H|D|=rank(D)=H, which shows that D is invertible. Consequently, S=c2​(D​D⊤)−1.S=c^2(D )^-1. (A.7) ∎ A.3 Proof of Corollary 3.3 See 3.3 Proof. Since ⟨i,j⟩S=0 d_i, d_j _S=0 for i≠ji≠ j, and ‖i‖S=‖j‖S=c>0\| d_i\|_S=\| d_j\|_S=c>0, it follows that D⊤​S​D=c2​ID SD=c^2I. Therefore, S~=1c2​S S= 1c^2S satisfies D⊤​S~​D=ID SD=I, which implies that the atoms are orthonormal. Since D is invertible, we also have S~=D−⊤​D−1=(D​D⊤)−1 S=D^- D^-1=(D )^-1, and thus ⟨⋅,⋅⟩S~ ·,· _ S is a symmetric positive-definite inner product. ∎ A.4 Proof of Theorem 3.8 See 3.8 Proof. Let supp​()=⊆[|D|]supp( δ)=S [|D|] and ||≤K|S|≤ K. Then, we have ‖D~​‖22=⊤​(D~⊤​D~)​=∑i∈δi2+2​∑i<ji,j∈δi​δj​⟨i~,j~⟩.\| D δ\|_2^2= δ ( D D) δ= _i _i^2+2 _ subarrayci<j\\ i,j subarray _i _j d_i, d_j . (A.8) By the fact that |⟨i~,j~⟩|≤μ| d_i, d_j |≤μ and applying the triangle inequality, we obtain: ‖D~​‖22 \| D δ\|_2^2 ≥∑i∈δi2−2​μ​∑i<j|δi​δj|, ≥ _i _i^2-2μ _i<j| _i _j|, (A.9) ‖D~​‖22 \| D δ\|_2^2 ≤∑i∈δi2+2​μ​∑i<j|δi​δj|. ≤ _i _i^2+2μ _i<j| _i _j|. (A.10) Next, we observe that: (∑i∈|δi|)2=∑i∈δi2+2​∑i<j|δi​δj|≤||​∑i∈δi2≤K​∑i∈δi2. ( _i | _i| )^2= _i _i^2+2 _i<j| _i _j|≤|S| _i _i^2≤ K _i _i^2. (A.11) Thus, we have 2​∑i<j|δi​δj|≤(K−1)​∑i∈δi22 _i<j| _i _j|≤(K-1) _i _i^2. Substituting this back, we conclude the proof. ∎ A.5 Proof of Theorem 3.9 See 3.9 Proof. We first prove that under the condition μ<12​K−1μ< 12K-1, the K-sparse representation is unique. Suppose there exist two distinct K-sparse coefficient vectors ,′ δ, δ such that D~​=D~​′ D δ= D δ . Let =−′≠ h= δ- δ ≠ 0. Then D~​=0 D h=0 and ‖0≤2​K\| h\|_0≤ 2K. By Theorem 3.8 (applied with K replaced by 2​K2K), we have (1−(2​K−1)​μ)​‖22≤‖D~​‖22= 0. (1-(2K-1)μ )\,\| h\|_2^2\ ≤\ \| D h\|_2^2\ =\ 0. (A.12) If μ<12​K−1μ< 12K-1, then the prefactor on the left is strictly positive, which forces ‖2=0\| h\|_2=0. This contradicts ≠ h≠ 0. Hence uniqueness holds. Next, we prove that under the same condition μ<12​K−1μ< 12K-1, the sparse vector δ is also the unique solution of the convex optimization problem min∈ℝ|D|⁡‖1s.t.D~​=~. _ x ^|D|\| x\|_1 .t. D x= m. (A.13) The overall strategy is as follows: (i) show that the null space property of order K (NSPK) holds under the assumption μ<12​K−1μ< 12K-1; (i) recall the equivalence NSPK ⇔ exact and unique recovery of any K-sparse solution via noiseless ℓ1 _1-minimization. Formally, the null space property of order K (NSPK) is defined as ∀∈ker(D~)∖,∀⊆[n],||≤K:‖1<‖c‖1,∀ h∈ ( D) \ 0\,\ [n],\ |S|≤ K: \,\| h_S\|_1<\| h_S^c\|_1\,, (A.14) where ker⁡(D~)=:D~​= ( D)=\ h: D h= 0\, ⊆[n]S [n] is an index set with [n]=1,…,n[n]=\1,…,n\, h_S denotes the restriction of h to the coordinates in S (with other entries set to zero), and c=[n]∖S^c=[n] is a complementary set. Step (i): Proof of NSPK. Let G=D~⊤​D~G= D D denote the Gram matrix. Since each column of D~ D is normalized, we have Gj​j=1G_j=1 and |Gi​j|≤μ|G_ij|≤μ for i≠ji≠ j. Take any ∈ker⁡(D~)∖ h∈ ( D) \ 0\ and any index set S with ||=K|S|=K. Since D~​=0 D h=0, we have G​=0G h=0. For any j, 0=(G​)j=∑iGj​i​hi=Gj​j​hj+∑i≠jGj​i​hi⇒hj=−∑i≠jGj​i​hi.0=(G h)_j= _iG_jih_i=G_jh_j+ _i≠ jG_jih_i h_j=- _i≠ jG_jih_i. (A.15) Taking absolute values and using |Gj​i|≤μ|G_ji|≤μ, we obtain |hj|≤μ​∑i≠j|hi|.|h_j|≤μ _i≠ j|h_i|. (A.16) Summing over j∈j gives ∑j∈|hj|≤μ​∑j∈∑i≠j|hi|. _j |h_j|\ ≤\ μ _j _i≠ j|h_i|. (A.17) The inner summation can be decomposed into contributions from i∈∖ji \j\ and i∈ci ^c: ∑j∈∑i≠j|hi|=∑j∈∑i∈i≠j|hi|⏟each ​i∈​ counted ​K−1​ times+∑j∈∑i∈c|hi|⏟each ​i∈c​ counted ​K​ times. _j _i≠ j|h_i|= _j _ subarrayci \\ i≠ j subarray|h_i|_each i counted K-1 times+ _j _i ^c|h_i|_each i ^c counted K times. (A.18) Hence, ‖1≤μ​((K−1)​‖1+K​‖c‖1).\| h_S\|_1\ ≤\ μ ((K-1)\| h_S\|_1+K\| h_S^c\|_1 ). (A.19) Rearranging, (1−(K−1)​μ)​‖1≤K​μ​‖c‖1. (1-(K-1)μ )\| h_S\|_1\ ≤\ Kμ\,\| h_S^c\|_1. (A.20) Dividing through by the positive factor 1−(K−1)​μ1-(K-1)μ, define α:=K​μ1−(K−1)​μ.α:= Kμ1-(K-1)μ. (A.21) When μ<12​K−1μ< 12K-1, we have α<1α<1. Since ≠0 h≠ 0 and D~​=0 D h=0, it is impossible for ‖c‖1=0\| h_S^c\|_1=0 (otherwise both terms would vanish, forcing =0 h=0, a contradiction). Therefore, ‖1≤α​‖c‖1<‖c‖1.\| h_S\|_1\ ≤\ α\| h_S^c\|_1\ <\ \| h_S^c\|_1. (A.22) Take any 0S_0 with |0|=k≤K|S_0|=k≤ K, and extend it to a superset ⊇0S _0 such that ||=K|S|=K. Then, ‖0‖1≤‖1,‖0c‖1≥‖c‖1.\| h_S_0\|_1\ ≤\ \| h_S\|_1, \| h_S_0^c\|_1\ ≥\ \| h_S^c\|_1. (A.23) If we know that ‖1<‖c‖1\| h_S\|_1<\| h_S^c\|_1 holds for all S of size K, then it follows that ‖0‖1≤‖1<‖c‖1≤‖0c‖1.\| h_S_0\|_1\ ≤\ \| h_S\|_1\ <\ \| h_S^c\|_1\ ≤\ \| h_S_0^c\|_1. (A.24) Thus, the inequality also holds for any 0S_0 with |0|≤K|S_0|≤ K, which establishes NSPK. Step (i): Equivalence between NSPK and ℓ1 _1 recovery. NSPK ⇒ unique ℓ1 _1 recovery: Suppose x is another feasible solution such that D~​^=D~​ D x= D δ. Let =^−∈ker⁡(D~)∖ h= x- δ∈ ( D) \ 0\, and let =supp​()S=supp( δ) with ||≤K|S|≤ K. Then ‖^‖1=‖+‖1=‖+‖1+‖c‖1≥‖1−‖1+‖c‖1>‖1=‖1,\| x\|_1=\| δ+ h\|_1=\| δ_S+ h_S\|_1+\| h_S^c\|_1\ ≥\ \| δ_S\|_1-\| h_S\|_1+\| h_S^c\|_1\ >\ \| δ_S\|_1=\| δ\|_1, (A.25) where the strict inequality follows from NSPK. Hence, δ is the unique minimizer of the ℓ1 _1 problem. Unique ℓ1 _1 recovery ⇒ NSPK: We argue by contradiction. If NSPK does not hold, then there exists ∈ker⁡(D~)∖ h∈ ( D) \ 0\ and some S with ||≤K|S|≤ K such that ‖1≥‖c‖1\| h_S\|_1≥\| h_S^c\|_1. Take any nonzero K-sparse δ with supp​()=supp( δ)=S, and choose δj=αj​sgn⁡(hj) _j= _jsgn(h_j) with αj≥|hj| _j≥|h_j| coordinate-wise. Consider ^=− x= δ- h. Since D~​= D h= 0, both δ and x are feasible, and ‖^‖1=‖−‖1+‖c‖1=‖1−‖1+‖c‖1≤‖1=‖1.\| x\|_1=\| δ_S- h_S\|_1+\| h_S^c\|_1=\| δ_S\|_1-\| h_S\|_1+\| h_S^c\|_1\ ≤\ \| δ_S\|_1=\| δ\|_1. (A.26) Thus, δ is not the unique minimizer of the ℓ1 _1 problem (and may even fail to be a minimizer). This contradicts the uniqueness assumption. Therefore, NSPK must hold. ∎ A.6 Proof of Corollary 3.10 See 3.10 Proof. Take arbitrary ,∈ΣK x, y∈ _K such that D~​=D~​ D x= D y. Let ~:=D~​ m:= D x, then clearly ~=D~​ m= D y, with ‖0≤K\| x\|_0≤ K and ‖0≤K\| y\|_0≤ K. By Theorem 3.9, under the condition μ<12​K−1μ< 12K-1, the K-sparse representation δ determined by the equation ~=D~​ m= D δ is unique: among all coefficient vectors satisfying ‖0≤K\| δ\|_0≤ K, there exists only one that generates ~ m. Since both x and y are feasible solutions satisfying ‖0≤K\| δ\|_0≤ K and D~​=~ D δ= m, uniqueness forces = x= y. Hence, Φ is injective on ΣK _K. ∎ A.7 Proof of Theorem 3.11 See 3.11 Proof. Consider a single-layer linear–nonlinear encoder of the form Wdec​στ​(Wenc​i)W_dec\, _τ(W_enc m_i), with training objective Wdec​στ​(Wenc​i)=i,∀i.W_dec\, _τ(W_enc m_i)= m_i, ∀ i. (A.27) Set Wdec=DW_dec=D and Wenc=D⊤​S~W_enc=D S. Denote i=supp​(i)S_i=supp( δ_i). Then D⊤​S~​i D S m_i =[1⊤2⊤⋮|D|⊤]​S~​[12⋯|D|]​i = bmatrix d_1 \\ d_2 \\ \\ d_|D| bmatrix S bmatrix d_1& d_2&·s& d_|D| bmatrix δ_i (A.28) =[1⊤​S~​1⋯1⊤​S~​|D|2⊤​S~​1⋯2⊤​S~​|D|⋮⋱⋮|D|⊤​S~​1⋯|D|⊤​S~​|D|]​i = bmatrix d_1 S d_1&·s& d_1 S d_|D|\\ d_2 S d_1&·s& d_2 S d_|D|\\ & & \\ d_|D| S d_1&·s& d_|D| S d_|D| bmatrix δ_i (A.29) =G​i,G:=D⊤​S~​D. =G\, δ_i, G:=D SD. (A.30) By NAIP, we have Gk​k=1G_k=1 and for k≠jk≠ j, |Gk​j|=|⟨k~,j~⟩|≤ε|G_kj|=| d_k, d_j |≤ . Thus, for any index k, (G​i)k=δi​k+∑j∈i∖kδi​j​Gk​j⏟=⁣:ei​k,k∈i,∑j∈iδi​j​Gk​j⏟=⁣:ei​k,k∉i.(G δ_i)_k= cases _ik+ _j _i \k\ _ijG_kj_=:e_ik,&k _i,\\[6.0pt] _j _i _ijG_kj_=:e_ik,&k _i. cases (A.31) Using the coherence bound, we obtain the deterministic perturbation estimate (G​i)k≥δi​k−ε​(K−1)​δmax,k∈i,(G​i)k≤ε​K​δmax,k∉i. cases(G δ_i)_k\ ≥\ _ik- (K-1) _ ,&k _i,\\[4.0pt] (G δ_i)_k\ ≤\ K _ ,&k _i. cases (A.32) Choose a threshold τ such that ε​K​δmax<τ<δmin−ε​(K−1)​δmax. K _ \ <\ τ\ <\ _ - (K-1) _ . (A.33) This ensures support separation (G​i)k>τ,k∈i,(G​i)k<τ,k∉i. cases(G δ_i)_k>τ,&k _i,\\[4.0pt] (G δ_i)_k<τ,&k _i. cases (A.34) Therefore, the coordinate-wise nonlinearity στ​(x):=0x<τ,x≥τ _τ(x):= cases0&x<τ,\\ x&x≥τ cases (A.35) produces activations i:=στ​(G​i) z_i:= _τ(G δ_i), with supp​(i)=isupp( z_i)=S_i. For k∈ik _i, zi​k=(G​i)k=δi​k+ei​k.z_ik=(G δ_i)_k= _ik+e_ik. (A.36) Since for any j≠kj≠ k, Gk​j=k⊤​S~​jG_kj= d_k S d_j is distributed approximately as ​(0,s2)N(0,s^2) with small variance s, it follows that ​[ei​k]=​[∑j∈i∖kδi​j​Gk​j]=∑j∈i∖kδi​j​[Gk​j]=0.E[e_ik]=E\! [ _j _i \k\ _ijG_kj ]= _j _i \k\ _ij\,E[G_kj]=0. (A.37) Since ​[ei​k]=0E[e_ik]=0 for all i,ki,k, the law of large numbers implies that, in probability, D​στ​(D⊤​S~​i)=i,∀i.D\, _τ(D S m_i)= m_i, ∀ i. (A.38) Thus, under this parametrization, the SAE recovers the target atom set D. ∎ Appendix B Representation Shift In this section, we provide further details on representation shift, including the experimental setup, ablation studies, and additional analyses, to enable a comprehensive understanding of this phenomenon. Specifically, we randomly sample 128 subject entities from the WikiData dataset (Vrandečić and Krötzsch, 2014) and extract the corresponding activations across all layers of multiple language model families, including GPT-2 (GPT2-Small, GPT2-Medium, GPT2-Large), GPT-J (GPT-J-6B), Pythia (Pythia-1B, Pythia-1.4B, Pythia-2.8B, Pythia-6.9B), Llama-2 (Llama2-7B, Llama2-13B), Llama-3 (Llama3-8B), Llama-3.1 (Llama3.1-8B), and Gemma-2 (Gemma2-2B, Gemma2-9B), which serve as the target representations for our analysis. For each entity, we extract activations at the position of its final token, which has been empirically identified via causal tracing as the key site of knowledge extraction in language models (Meng et al., 2022). Thus, for each layer we obtain 128 representations, yielding 128×128=16,384128× 128=16,384 representation pairs for analyzing angular distributions. This sample size is sufficient to capture the overall distributional characteristics, as additional samples do not alter the distribution (Figs. 17–21), while also facilitating non-overlap with activations used in subsequent sampling. We first compute the angles between representation pairs using the Euclidean inner product. Specifically, for representations ,∈ℝH u, v ^H, the Euclidean cosine similarity is defined as cos⁡(,)=⟨,⟩‖2​‖2 ( u, v)= u, v \| u\|_2\| v\|_2, which we then convert to angles for clearer visualization. The full results (Figs. 7–16) show that when angles are computed under the Euclidean inner product, the centroid of the angular distribution deviates markedly from 90∘90 , indicating a representation shift. This phenomenon consistently appears across all layers and model families, independent of model architectures and training corpora, and thus reflects a systematic non-uniformity in representation distributions. Such anisotropy distorts the underlying geometry of representations, which should be isotropic for random epresentations. In LLMs, this anisotropy is nearly indiscriminate: representations are globally attracted toward a dominant direction, yielding high cosine similarity even for unrelated samples. Direct evidence is provided in Figs. 17–21: increasing the sample size used to estimate the distribution leaves the centroid unchanged, confirming that representations are indeed attracted toward the same direction. We then collect 100K activations k per layer for each model from Wikipedia (manually verified to have no overlap with the target representations) and compute ​[​⊤]E[ k k ] to estimate the normalized atomic inner product. Under the inner-product space induced by S~ S, the cosine similarity is defined as cosS~⁡(,)=⟨,⟩S~‖S~​‖S~=⊤​S~​⊤​S~​⊤​S~​ _ S( u, v)= u, v _ S\| u\|_ S\| v\|_ S= u S v u S u v S v, where S~ S is estimated in practice as ([⊤])−1E[ k k ])^-1. As before, cosine similarities are converted to angles for clearer visualization. As shown in Figs. 22–31, the atomic inner product corrects the representation shift, restoring the centroid of the angular distribution to 90∘90 and thereby capturing the underlying representational geometry, aligned with theoretical expectations. This effect holds consistently across all layers and model families. Further inspection shows that the long tail of the distribution (high-similarity pairs) corresponds to highly related samples, indicating that LLM representations are intrinsically isotropic when measured with the appropriate metric. Although we identify the correct inner product for LLM representations and verify that their global geometry matches theoretical expectations, substantial superposition remains pervasive (Hu et al., 2025). For example, Fig. 32 shows that activations in Gemma2-2B still exhibit widespread superposition, indicating that these representations are not fully disentangled. The persistence of superposition suggests that raw activations are not the most suitable fundamental representational units. We therefore decompose high-dimensional representations into atoms that better satisfy the criteria for FRUs; the resulting decomposition (Fig. 33) demonstrates that the identified atoms effectively disentangle representations and resolve superposition. Figure 7: Representation shift of GPT2-Small. Figure 8: Representation shift of GPT2-Large. Figure 9: Representation shift of Pythia-1B. Figure 10: Representation shift of Pythia-6.9B. Figure 11: Representation shift of Llama2-7B. Figure 12: Representation shift of Llama2-13B. Figure 13: Representation shift of Llama3-8B. Figure 14: Representation shift of Llama3.1-8B. Figure 15: Representation shift of Gemma2-2B. Figure 16: Representation shift of Gemma2-9B. Figure 17: Representation shift of GPT2-Small under increasing sample sizes. Figure 18: Representation shift of GPT2-Medium under increasing sample sizes. Figure 19: Representation shift of GPT2-Large under increasing sample sizes. Figure 20: Representation shift of Gemma2-2B under increasing sample sizes. Figure 21: Representation shift of Gemma2-9B under increasing sample sizes. Figure 22: Correction of representation shift on GPT2-Small. Figure 23: Correction of representation shift on GPT2-Large. Figure 24: Correction of representation shift on Pythia-1B. Figure 25: Correction of representation shift on Pythia-6.9B. Figure 26: Correction of representation shift on Llama2-7B. Figure 27: Correction of representation shift on Llama2-13B. Figure 28: Correction of representation shift on Llama3-8B. Figure 29: Correction of representation shift on Llama3.1-8B. Figure 30: Correction of representation shift on Gemma2-2B. Figure 31: Correction of representation shift on Gemma2-9B. Figure 32: Superposition of activations on Gemma2-2B. Figure 33: Solving superposition on Gemma2-2B. Appendix C Atoms of LLMs C.1 Training Paradigm We train threshold-activated sparse autoencoders (TSAEs) on activations extracted by entity knowledge, a setting we term the knowledge atomization task. Unlike the common practice of training on activations from natural corpora, this formulation enables precise control and quantification of data scale, facilitating scalable and systematic study across model and dataset scales. Moreover, entity-induced activations exhibit higher normalized rank (Fig. 34), spanning a broader set of representational dimensions and thus providing richer information. We also conduct task-level ablations. Specifically, we train TSAEs on activations extracted from natural text (Wikipedia (Bridge, 2001)) and complex mathematical reasoning data (MATH500 (Hendrycks et al., 2021)), following the same training pipeline as knowledge atomization while matching model and data scales. As shown in Tab. 2, representations obtained from different sources yield consistent results across language models, demonstrating the generality of our findings. Figure 34: Cumulative normalized rank of (a) Gemma2-2B and (b) Gemma2-9B. Each data point corresponds to the ratio between the rank of the accumulated activation matrix (formed by stacking samples up to that point) and the total dimensionality (i.e., the theoretical maximum rank). Here we illustrate this for randomly selected early layers of Gemma2-2B and Gemma2-9B. Table 2: Comparison of reconstruction quality and sparsity under different training data sources. Training Data Source R2R^2 L0L_0 General corpora (Wikipedia) 99.84% 9.36 Complex Reasoning (MATH500) 99.86% 9.56 Knowledge Atomization (WikiData) 99.89% 8.32 C.2 Data Collection In § 4.1 and § 4.3, we use the WikiData dataset (Vrandečić and Krötzsch, 2014), while in § 4.2 and § 4.4 we adopt the CounterFact dataset (Meng et al., 2022). Specifically, we collect activations from every layer of Gemma2-2B, Gemma2-9B, and Llama3.1-8B using the subject entities in the corresponding datasets (e.g., “Danielle Darrieux,” “Edwin of Northumbria,” and “Toko Yasuda”). Activations are collected in a uniform manner: each subject name is used as a prompt, and hooks record activations at the final token of the subject mention, a position previously identified as critical for knowledge recall in language models (Meng et al., 2022). The resulting activations are aggregated as static training data. C.3 Training Details We employ single-layer SAEs with threshold activation, denoted as f:↦^=Wdec​σ​(Wenc​)f: x x=W_dec\,σ\! (W_enc x ), and train it by minimizing a joint reconstruction–sparsity objective ℒ​()=‖−^‖22⏟ℒreconstruct+λ​‖σ​()‖1⏟ℒsparsity,L( x)= \| x- x\|_2^2_L_reconstruct+λ\, \|σ( z)\|_1_L_sparsity, (C.1) where =Wenc​ z=W_enc x, and σ is coordinate-wise JumpReLU activation (Rajamanoharan et al., 2024b), (σ​())i=0,zi<τi,zi,zi≥τi,=(τi)i. (σ( z) )_i= cases0,&z_i< _i,\\ z_i,&z_i≥ _i, cases τ=( _i)_i. (C.2) The key hyperparameters are the sparsity coefficient λ in the loss function (Eq. C.1) and the threshold initialization. We fix λ=0.1λ=0.1 (later shown to be insensitive) and initialize the threshold at 0.001 (or 0.0001), which provides a good trade-off between training efficiency and effectiveness: smaller initial thresholds facilitate satisfying the support-separation condition but substantially increase training time, whereas 0.001 serves as a stable and reliable default in our experiments. During training, we employ the straight-through estimator (Rajamanoharan et al., 2024b) to approximate gradients at the non-differentiable threshold. We select the final model as the checkpoint on the Pareto front that optimally balances reconstruction error and sparsity. Fig. 35 illustrates the Pareto front for Gemma2-2B. The specific computational cost is as follows: • Gemma2-2B (per layer): ∼ 24 GPU-hours on RTX 3090-24G (on average); • Gemma2-9B (per layer): ∼ 56 GPU-hours on A100-80G (on average); • Llama3.1-8B (per layer): ∼ 58 GPU-hours on A100-80G (on average); • Largest TSAE trained in this work (Fig. 5, top-right): ∼ 135 GPU-hours on A100-80G. A minor training issue was observed in layers 30 and 31 of Llama 3.1-8B, where unusually large activations caused optimization to fail; consequently, these layers are omitted from the reported results. This behavior is likely related to their proximity to the output, where activations may drive next-token prediction rather than encode entity-specific information. By contrast, Gemma 2-2B and Gemma 2-9B did not exhibit this problem, possibly because their extensive use of RMSNorm mitigates such activation outliers. Figure 35: Pareto front during training on Gemma2-2B. C.4 Baseline Details The primary baselines used in this work are GemmaScope and LlamaScope. GemmaScope provides SAEs of widths 16k and 65k trained on the MLP layers of Gemma2-2B, as well as SAEs of widths 16k and 131k trained on the MLP layers of Gemma2-9B. LlamaScope offers SAEs with expansion factors of 8× and 32× trained on the MLP layers of Llama3.1-8B. Both GemmaScope and LlamaScope are widely regarded as open-source tools for feature extraction. It is important to emphasize that these models are trained on activations derived from continuous text corpora. We use them as baselines not to demonstrate superior performance of our SAEs, but to highlight that feature-based reconstruction of raw activations remains unreliable in practice, whereas our results show that internal representations of language models can be reconstructed with high fidelity. C.5 Evaluation Details To practically assess the stability of identified representational units, we introduce quantile-based statistics that correspond to the prior conditions in Theorem 3.9. Specifically, we define two statistics: • Quantile sparsity KqK_q: The quantile of sparsity KqK_q is defined as Kq=infk∈ℕ:ℙ∼Δ​(K≤k)≥q,K_q= \k :P_ δ _ (K≤ k )≥ q \, (C.3) where δ is a coefficient vector sampled from the distribution ΔP_ , and the random variable K:=‖0K:=\| δ\|_0 represents the sparsity of the sampled coefficient vector. In simple terms, the quantile sparsity KqK_q indicates that at least q of the samples have sparsity no greater than KqK_q. • Quantile coherence μq _q: Similarly, the quantile of coherence μq _q is defined as μq=infμ≥0:ℙ(ℐ,)​(C≤μ|D~)≥q, _q= \μ≥ 0:P_(I,J) (C≤μ| D )≥ q \, (C.4) where (ℐ,)(I,J) is uniformly sampled from all unordered pairs of indices (ℐ≠I ), and the random variable C:=|⟨~ℐ,~⟩|C:=| d_I, d_J | represents the coherence between two atoms. In simple terms, this means that the probability of randomly selecting a pair of different atoms with coherence no greater than μq _q is at least q. Based on these definitions, for the supports of most samples, if the condition μq<12​Kq−1 _q< 12K_q-1 holds, we can conclude that at least q proportion of the samples satisfy the sufficient conditions for uniqueness and recoverability. To determine the maximal quantile q∗q^* satisfying the theoretical criterion, we perform a binary search over the interval [0,0.999999][0,0.999999] for the quantile parameter α. At each iteration we compute the linear quantiles μα:=Quantile⁡(μ,α),Kα:=Quantile⁡(K,α), _α:=Quantile(\μ\,α), K_α:=Quantile(\K\,α), (C.5) and test whether μα<12​Kα−1 _α< 12K_α-1 holds. If the condition is satisfied, the lower bound of the search interval is updated to α; otherwise the upper bound is reduced. Upon convergence, the maximal α obtained is taken as the desired quantile q∗q^*, together with the corresponding values of μα _α and KαK_α. Note that verifying Theorem 3.9 requires the equality D~​=~ D x= m. However, as shown in Fig. 4, features generally fail to achieve reliable reconstruction, so the quantile q obtained from the condition μq<12​Kq−1 _q< 12K_q-1 serves only as an ideal upper bound. In contrast, the learned atoms satisfy reliable reconstruction, and 99.85% of atoms meet μq<12​Kq−1 _q< 12K_q-1 on average, confirming their favorable properties. For further detail, Tabs. 3-5 report the corresponding values of R2R^2 and q∗q^* for identified units of Gemma2-2B, Gemma2-9B and Llama3.1-8B. In all three models, we primarily use TSAEs with JumpReLU activations (Erichson et al., 2019; Rajamanoharan et al., 2024b) to identify atoms. For comparison, on Gemma2-2B we also train standard SAEs with ReLU activations (Cunningham et al., 2023) and find that they fail to identify units that satisfy the criteria of ideal atoms as fundamental representational units. Table 3: Faithfulness and stability across layers on Gemma2-2B. TSAEs with JumpReLU SAEs with ReLU Layer R2R^2 q∗q^* R2R^2 q∗q^* 0 0.9986 0.9974 - - 1 0.9984 0.9978 - - 2 0.9987 0.9978 - - 3 0.9992 0.9988 - - 4 0.9994 0.9991 - - 5 0.9996 0.9983 0.9680 0.8263 6 0.9995 0.9983 0.9510 0.6740 7 0.9993 0.9994 0.9624 0.6712 8 0.9996 0.9976 0.9650 0.6437 9 0.9995 0.9987 0.9383 0.5648 10 0.9992 0.9983 0.9133 0.4662 11 0.9992 0.9979 0.9179 0.4366 12 0.9991 0.9911 0.9129 0.4516 13 0.9973 0.9960 0.9104 0.4167 14 0.9989 0.9993 - - 15 0.9988 0.9989 - - 16 0.9992 0.9972 - - 17 0.9994 0.9991 - - 18 0.9996 0.9945 - - 19 0.9997 0.9980 - - 20 0.9989 0.9989 - - 21 0.9997 0.9964 - - 22 0.9995 0.9922 - - 23 0.9994 0.9979 - - 24 0.9994 0.9982 - - 25 0.9993 0.9943 - - Table 4: Faithfulness and stability across layers on Gemma2-9B. TSAEs with JumpReLU Layer R2R^2 q∗q^* 0 0.9996 0.9915 1 0.9993 0.9992 2 0.9995 0.9981 3 0.9996 0.9975 4 0.9996 0.9985 5 0.9997 0.9999 6 0.9993 0.9961 7 0.9996 0.9995 8 0.9996 0.9996 9 0.9996 0.9997 10 0.9997 0.9997 11 0.9996 0.9994 12 0.9994 0.9996 13 0.9993 0.9991 14 0.9989 0.9991 15 0.9992 0.9997 16 0.9990 0.9997 17 0.9994 0.9996 18 0.9995 0.9995 19 0.9990 0.9996 20 0.9991 0.9995 21 0.9991 0.9994 22 0.9992 0.9993 23 0.9993 0.9991 24 0.9993 0.9990 25 0.9994 0.9988 26 0.9964 0.9974 27 0.9997 0.9980 28 0.9994 0.9997 29 0.9993 0.9993 30 0.9997 0.9982 31 0.9995 0.9982 32 0.9996 0.9982 33 0.9998 0.9986 34 0.9998 0.9987 35 0.9997 0.9997 36 0.9997 0.9991 37 0.9995 0.9992 38 0.9993 0.9996 39 0.9990 0.9998 40 0.9988 0.9999 41 0.9995 0.9951 Table 5: Faithfulness and stability across layers on Llama3.1-8B. TSAEs with JumpReLU Layer R2R^2 q∗q^* 0 0.9985 0.9968 1 0.9998 0.9996 2 0.9930 0.9998 3 0.9945 0.9999 4 0.9992 0.9999 5 0.9992 0.9998 6 0.9971 0.9999 7 0.9961 0.9999 8 0.9992 0.9999 9 0.9988 0.9999 10 0.9989 0.9998 11 0.9987 0.9999 12 0.9993 0.9997 13 0.9970 0.9999 14 0.9992 0.9999 15 0.9986 0.9999 16 0.9989 0.9999 17 0.9992 0.9998 18 0.9992 0.9994 19 0.9993 0.9998 20 0.9991 0.9998 21 0.9993 0.9996 22 0.9997 0.9992 23 0.9995 0.9996 24 0.9996 0.9993 25 0.9990 0.9999 26 0.9995 0.9997 27 0.9993 0.9999 28 0.9982 0.9993 29 0.9979 0.9940 C.6 Experimental Analysis Notably, the training process is largely insensitive to hyperparameters: using sparsity coefficients λ∈0.01,0.1,1λ∈\0.01,0.1,1\ yields nearly identical learning curves (Fig. 36), indicating strong robustness. This suggests that high-fidelity reconstruction primarily reflects the intrinsic sparsifiability of the representations, rather than careful hyperparameter tuning. The encoder and decoder of SAEs converge to alignment under atomic inner product, namely parameterization of Wd​e​c=DW_\!dec\!\!=\!\!D and We​n​c=D⊤​S~W_\!enc\!\!=\!\!D \! S, consistent with Theorem 3.11, as shown in Figs. 37-39 for Gemma2-2B, Gemma2-9B and Llama3.1-8B. By Definition 3.6, atoms must satisfy approximate orthogonality under the normalized atomic inner product (NAIP), ensuring their mutual distinguishability. The NAIP among all atoms can be computed by directly evaluating the matrix G=D~⊤​D~G= D D, with a more practical procedure, similar to Corollary 3.3, given by G=D⊤​S​Ddiag​(D⊤​S​D)×diag​(D⊤​S​D),G= D SD diag(D SD)× diag(D SD), (C.6) where S=(D​D⊤)−1S=(D )^-1, diag​(D⊤​S​D)diag(D SD) denotes the diagonal of D⊤​S​D SD, diag​(D⊤​S​D) diag(D SD) denotes its element-wise square root, and × indicates the outer product. If the vectors learned by the SAEs exhibit atomicity, the off-diagonal elements Gi​j=⟨i~,j~⟩G_ij= d_i, d_j should cluster near zero with very small variance, demonstrating approximate orthogonality, while the diagonal entries are normalized. As shown in Figs. 40-42, across all layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, the off-diagonal elements of the matrices D are tightly concentrated near zero, closely matching the theoretical Dirac delta distribution. This accords with Definition 3.5: although strict orthogonality is unattainable, sparsity drive convergence to approximately orthogonal atoms. Figure 36: Training loss is robust to hyperparameter selection on λ, maintaining stable performance across different configurations. Figure 37: Spontaneous alignment between the encoder and decoder during training on Gemma2-2B. Figure 38: Spontaneous alignment between the encoder and decoder during training on Gemma2-9B. Figure 39: Spontaneous alignment between the encoder and decoder during training on Llama3.1-8B. Figure 40: NAIP distribution of atoms across all layers of the Gemma2-2B. Figure 41: NAIP distribution of atoms across all layers of the Gemma2-9B. Figure 42: NAIP distribution of atoms across all layers of the Llama3.1-8B. C.7 Monosemanticity Evaluation To evaluate the monosemanticity of the representational units, we adopt LLM-as-a-judge. Specifically, to ensure diversity, we first manually select a set of heterogeneous entities (e.g., ”United Kingdom”, ”Google Maps”, ”Suzuki GSX-R750”, ”Windows Vista”, ”Intel 80286”, ”Beijing”, ”Hawaii”, ”Tim Duncan”, ”Microsoft Word”, ”Vladimir Putin”, ”Apple Watch”, ”Chrome OS”). We then collect the representational units activated by these entities and aggregate them into a candidate pool. From this pool, we randomly sample ten units per selected layer: Gemma2-2B (layers 0, 5, 10, 15, 20, 25), Gemma2-9B (layers 0, 7, 14, 21, 28, 35, 41), and Llama3.1-8B (layers 0, 6, 12, 18, 24, 29). For each sampled unit, we retrieve all entities that activate it and use GPT-5.2 to assess its monosemanticity. The evaluation prompt is provided below: You are given a list of entities enclosed in square brackets [ ]. Inside the brackets, each entity is separated by a semicolon (;). Your task is to analyze the entities and determine how many of them belong to the same semantic category (i.e., refer to the same type of real-world concept). Important instructions: - You should identify the largest group of entities that belong to the same category. - Only count entities that clearly belong to the same category. - Your answer must be a single integer. You must provide your final answer strictly inside a box using the following format: NUMBER Here is the list of entities: [entities] We then compute, for each representational unit, the proportion of activated entities that are monosemantic, and report the mean and standard error of the mean (SEM) over units sampled at selected layers. Full results are shown in Fig. 43, revealing that monosemanticity increases with model scale and is generally higher in deeper layers than in shallower ones. Aggregating across layers within each model yields the results presented in Fig. 6. For example, in Gemma2-9B, an atom (ID 11346) in layer 28 is activated by entities including “Honolulu,” “aloha,” “Mufi Hannemann,” “Kirk Caldwell,” “Hawaii,” “Hawaiian Islands,” “Mauna Kea,” “USS Honolulu,” and “Aloha Stadium.” Notably, Mufi Hannemann was born in Honolulu and served as its mayor; Kirk Caldwell is a former Hawaii state representative and former mayor of Honolulu; and Mauna Kea is a volcano in the Hawaiian Islands. This example demonstrates that the atom consistently captures a semantically coherent “Hawaii–Honolulu” concept region, exhibiting clear monosemanticity. Furthermore, we analyze atoms activated by ”Beijing” in layers 1–6 of Gemma2-2B, and examine, at each layer, all entities that activate these atoms to characterize their corresponding concept regions (Tabs. 6–11). Figure 43: Monosemanticity scores of representational units across models and layers, using GPT-5.2 with manual verification. The blue dashed line indicates random-guess performance (0.1). Table 6: Entities grouped by atoms ID for Beijing on layer 1 of Gemma2-2B. Atoms ID Entities 15264 Beijing, Seoul, 1 Maccabees, Ulysses Dove 15982 Beijing, Siikainen, 36 China Town, Jim Allchin 23987 Beijing, Swann Memorial Fountain, Charles Chilton, Otto Neurath 31322 Shanghai, Beijing 35951 Beijing, Russia, Arkansas, Paris 36035 Beijing, Meiert Avis, Aviation Industry Corporation of China Table 7: Entities grouped by atoms ID for Beijing on layer 2 of Gemma2-2B. Atoms ID Entities 620 Shanghai, Beijing, Hanoi, Tokyo, Adam Maida 6258 Beijing, Majorca, Thailand, Greg Dyke 7540 1300 Oslo, Beijing, Miami Horror, Lille 10761 Moscow, Beijing, Canberra, Pyongyang 11519 Karl Polanyi, Beijing, Cevdet Sunay, Mary Gaunt, Cyd Hayman, Les diamants de la couronne 13418 Beijing, Tarnobrzeg Voivodeship, Yakuza, Longs Peak, Jeep Wrangler 15585 Beijing, Ivan Koloff, Olinto Cristina 22622 Shanghai, Cleveland, Beijing, Delhi, Saint Lucia, St Lucia, Venice 26002 Beijing, Alte Oper, Intimate Stories, Seventeen, Five Star Krishna 27116 Ankara, Mandarin Oriental, Bangkok, Cairo, Beijing, Dublin, Jakarta, Amsterdam, Bratislava, Toronto, Sydney, Edinburgh, London, Honolulu, Auckland, Bali, Tokyo, Manila, Queens Gardens, Brisbane, Budapest, Montreal, Perth, Kolkata, Dubai, Melbourne, Copenhagen, Nairobi, Bangkok, Bangalore Table 8: Entities grouped by atoms ID for Beijing on layer 3 of Gemma2-2B. Atoms ID Entities 9444 Shanghai, Beijing 24724 Moscow, Beijing, Russia 30463 Beijing, Thailand 32854 Beijing, Madrid, Mariano Gonzalvo Table 9: Entities grouped by atoms ID for Beijing on layer 4 of Gemma2-2B. Atoms ID Entities 1578 Beijing, Cadbury 11098 Beijing, Jakarta 11158 Beijing 15601 Oslo, Moscow, Stockholm, Berlin, Athens, Helsinki, Beijing, Vienna, Geneva, Amsterdam, Seoul, Prague, Madrid, London, Warsaw, Kyoto, Naples, Tokyo, Budapest, Paris, Rome, Bangkok 25755 Stockholm, Helsinki, Beijing, Minneapolis, Minecraft, Copenhagen, Nairobi 33322 Shanghai, Beijing, Guangzhou, Macau, Hong Kong, Chongqing, Shenzhen, Wuhan Table 10: Entities grouped by atoms ID for Beijing on layer 5 of Gemma2-2B. Atoms ID Entities 11453 Beijing, The Great Citizen 12661 Beijing, Holycross-Ballycahill GAA 19018 Beijing, Registro, 4th of August Regime, Witnesses 23750 Moscow, Ankara, Beijing, Jakarta, Madrid Table 11: Entities grouped by atoms ID for Beijing on layer 6 of Gemma2-2B. Atoms ID Entities 7533 Johannesburg, Shanghai, Beijing, Colombo, Prafulla Chandra Ghosh 16414 Shenyang, Shanghai, Beijing, Guangzhou, Yangtze, Google China, Taobao, Tianjin, Chongqing, National Development and Reform Commission, Shenzhen, Qing dynasty, Aviation Industry Corporation of China, Qzone, Youku, Wuhan, People’s Republic of China 22386 Beijing 33958 Carol Zhao, Shenyang, Shanghai, Beijing, Guangzhou, Seoul, Yangtze, Macau, Hanoi, Taipei, Hong Kong, Kaohsiung, South Korea, Busan, United States Army Military Government in Korea, Tianjin, Pyongyang, Incheon, Chongqing, Vietnam, Dennis Hwang, Shenzhen, Daejeon, North Korea, Wuhan