Paper deep dive
Evolution of Concepts in Language Model Pre-Training
Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu
Models: Pythia-160M, Pythia-6.9B
Abstract
Abstract:Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.
Tags
Links
- Source: https://arxiv.org/abs/2509.17196
- Canonical: https://arxiv.org/abs/2509.17196
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 6:00:24 PM
Summary
This paper introduces 'crosscoders', a sparse dictionary learning method, to track the evolution of interpretable features in language models across pre-training snapshots. The authors identify a two-stage learning process—statistical learning followed by feature learning—and demonstrate that feature complexity correlates with emergence time, providing a mechanistic link between microscopic feature development and macroscopic model performance.
Entities (5)
Relation Signals (3)
Crosscoders → tracks → Feature Evolution
confidence 95% · we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders.
Statistical Learning Phase → precedes → Feature Learning Phase
confidence 92% · Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase.
Feature Complexity → correlateswith → Emergence Time
confidence 90% · The result in Figure 6a reveals a non-trivial correlation with moderate strength between peak timing and complexity scores
Cypher Suggestions (2)
Identify the relationship between training phases and model evolution. · confidence 90% · unvalidated
MATCH (p1:Phase)-[:PRECEDES]->(p2:Phase) RETURN p1.name, p2.name
Find all methods used to analyze language model training dynamics. · confidence 85% · unvalidated
MATCH (m:Method)-[:USED_FOR]->(p:Process {name: 'Training Dynamics Analysis'}) RETURN m.nameFull Text
109,468 characters extracted from source content.
Expand or collapse full text
Published as a conference paper at ICLR 2026 EVOLUTION OF CONCEPTS IN LANGUAGE MODEL PRE-TRAINING Xuyang Ge † , Wentao Shu † , Jiaxing Wu † , Yunhua Zhou ‡ , Zhengfu He † , Xipeng Qiu † ∗ † OpenMOSS Team, Shanghai Innovation Institute; Fudan University ‡ Shanghai AI Laboratory xyge24@m.fudan.edu.cn, zfhe19@fudan.edu.cn Statistical Learning PhaseFeature Learning Phase 051015202530 Training Snapshots T r a i n i n g L oss Single-token Feature Przytyk said that the vocalist for The Dam ering drums stall. The vocal processing and es at centre stage the vocal virtuosity of Br Previous Token Feature RDER OF EMMETT TILL. He is also the co movies that include "Tron Legacy" and "Ob Wrap Tool Window Tint Kit,Wrap Stick Mic Concept of Magazine rollment startup, told Education Dive.↵His riginally published by Technologyreview.co o is founding editor of Religion Watch, a mo Awareness of Temporal Relationships a JSON-like string. The resulting string is then directly↵ e things then get into packages, which are then used for te will also produce a compost product to be used for la Initialization Features Features activating on specific tokens exist at initialization. Learning Zipf's Law During early training, the model learns statistical regularities. Feature Emergence Most features begin to emerge at this point. Complex Features Complex features (e.g., sentence structures) form in later training stages. ABSTRACT Language models obtain extensive capabilities through pre-training. However, the pre-training dynamics remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dic- tionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer’s two-stage learn- ing process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics. Our code is available at https://github.com/OpenMOSS/Language-Model-SAEs. 1INTRODUCTION Pre-training (Radford et al., 2018; Devlin et al., 2019) has emerged as the dominant paradigm for de- veloping frontier large language models (LLM) (OpenAI, 2023; Grattafiori et al., 2024; Yang et al., 2025). Despite its remarkable success, the pre-training process remains largely a black box. While scaling laws (Hestness et al., 2017; Kaplan et al., 2020; Bahri et al., 2021) reveal the predictable relationships between compute, data, and loss, they offer limited insight into the internal reorgani- zation occurring within the model parameters. Other theoretical frameworks on learning dynamics, including neural tangent kernel (Jacot et al., 2018), information bottleneck theory (Tishby & Za- slavsky, 2015; Shwartz-Ziv & Tishby, 2017), and singular learning theory (Watanabe, 2009; Lau ∗ Corresponding author. 1 arXiv:2509.17196v2 [cs.CL] 14 Feb 2026 Published as a conference paper at ICLR 2026 et al., 2024; Wang et al., 2025a) — provide high-level explanations on why we can observe gener- alization and grokking (Power et al., 2022; Nanda et al., 2023). However, a fundamental question — how do LLMs actually develop their capabilities internally during pre-training — has remained largely opaque. Recent advances in mechanistic interpretability, particularly through dictionary learning methods based on sparse autoencoders (SAEs), have begun to illuminate the internal representation of neural networks (Bricken et al., 2023; Templeton et al., 2024; Gao et al., 2025). By disentangling the phenomenon of superposition (Elhage et al., 2022), sparse autoencoders show their capabilities in extracting millions of features from LLM activations, demonstrating that LLMs encode human- interpretable concepts as linear directions in their activation spaces. However, these analyses have predominantly focused on studying fully-trained models. Consequently, the process of how features initially emerge and evolve throughout the training process remains largely unexplored. In this paper, we propose to track feature evolution across pre-training snapshots using cross- coders (Lindsey et al., 2024), a variant of sparse autoencoders designed to simultaneously identify and align features from a family of correlated model activations. While originally introduced to resolve cross-layer superposition and track features distributed across layers, we adapt this approach to analyze activations from different training checkpoints. By applying cross-snapshot crosscoders, we can observe where features emerge, rotate, and degenerate, thereby providing deeper insight into the internal dynamics of model pre-training. Our main contributions are as follows: 1. To the best of our knowledge, this work is the first to adapt crosscoders to study training dy- namics (Section 3). We evaluate completeness and faithfulness of interpretations provided by crosscoders in Appendix A. 2. We perform in-depth analyses on cross-snapshot features in Section 4. We empirically show that the decoder norms can serve as proxies of feature evolution status, and showcase both general and per-feature evolutionary properties. 3. Our method successfully connects the microscopic features to the macroscopic downstream task metrics using attribution-based circuit tracing techniques (Section 5). 4. We show evidence for the phase transition from a statistical learning phase to a feature learning phase in Section 6. 2RELATED WORKS Learning dynamics and phase transitions. Multiple theoretical frameworks explain neural net- work training dynamics. Neural Tangent Kernel (NTK) theory (Jacot et al., 2018) establishes that infinite-width networks evolve as kernel machines with fixed kernels during gradient descent, with extensions to finite-width corrections (Dyer & Gur-Ari, 2020) and modern architectures (Yang, 2020; Yang & Littwin, 2021). Recent work identifies phase transitions and lazy regimes in training dynamics (Kumar et al., 2024; Zhou et al., 2025). Information Bottleneck (IB) theory (Tishby & Za- slavsky, 2015) formulates deep learning as optimizing compression-prediction tradeoffs. Shwartz- Ziv & Tishby (2017) empirically demonstrates two distinct training phases—initial fitting followed by compression—which aligns with our findings in Section 6. Singular Learning Theory (SLT) treats neural networks as singular statistical models. Watanabe (1999; 2009) introduces the Real Log Canonical Threshold to provide geometric complexity measures predicting phase transitions, with recent advances enabling practical estimation at scale through Local Learning Coefficients (Lau et al., 2024; Furman & Lau, 2024; Wang et al., 2025a). Our work complements these theoretical frameworks by providing detailed mechanistic accounts of feature evolution during transformer pre- training, bridging high-level theory with empirical observations. Sparse dictionary learning. The superposition hypothesis posits that models use linear repre- sentations (Bengio et al., 2014; Alain & Bengio, 2017; Vargas & Cotterell, 2020) to embed more features than neurons (Olah et al., 2020; Elhage et al., 2022). Sparse Autoencoders (SAEs) (Bricken et al., 2023; Huben et al., 2024; He et al., 2024) extract monosemantic features from superposi- tion using sparse dictionary learning, with subsequent improvements and scaling efforts (Gao et al., 2025; Rajamanoharan et al., 2024). Ge et al. (2024) and Dunefsky et al. (2024) propose transcoders, an SAE variant that predicts future activations to improve circuit tracability. Recently, Lindsey et al. 2 Published as a conference paper at ICLR 2026 (2024) introduces crosscoders to simultaneously read and write to multiple layers. Mishra-Sharma et al. (2024); Minder et al. (2025) leverages crosscoders for capturing the difference between pre- trained models and their chat-tuned versions. 3TRACKING THE EVOLUTION OF FEATURES Collecting Activations The Crosscoder is trained to simultaneously decompose the activations - at a given position and a given input - of multiple pre-training snapshots. Training Progress Input: The fox jumps over the lazy dog Snapshot 5 Layer i Layer i+1 Snapshot 4 Layer i Layer i+1 Snapshot 3 Layer i Layer i+1 Snapshot 2 Layer i Layer i+1 Snapshot 1 Layer i Layer i+1 Snapshot 0 Layer i Layer i+1 "Jump" Feature ah, it's nice to jump in and o o the trick too, jumpstarting t being able to jump right into Singular Verb e symphony builds as she co Guy Pearce splits from wife iversity, who works a lot with Form at initializationEmerge at mid-training Decoder Norms Crosscoder decoder norms reflect the evolution of features throughout pre-training. Interpretation Features are interpreted by their top activations. Figure 1: Overview of our method. The crosscoder is trained to decompose activations of multiple pre-training snapshots (left) into sparse features (right). Crosscoder architecture. For a given text corpus C and a family of training snapshots Θ saved during LLM pre-training, letA = a θ : C ×N →R d model | θ ∈ Θ denote a corresponding family of parameterized functions mapping input datapoints to model activations at a specific layer. Each datapoint x = (c,j)∈C×N is a training token inC, indexing the j-th token in sequence c. A cross-snapshot crosscoder (Lindsey et al., 2024) operates on the parameterized function familyA over corpusC (Figure 1). The crosscoder architecture is defined by: f (x) = σ X θ∈Θ W θ enc a θ (x) + b enc ! ˆa θ (x) = W θ dec f (x) + b θ dec (1) with W θ enc ∈R n features ×d model , b enc ∈R n features , W θ dec ∈R d model ×n features , and b θ dec ∈R d model . The parameters W θ enc , W θ dec , and b θ dec correspond to snapshot-specific encoder and decoder weights for parameter θ. The activation function σ(·) produces sparse feature activations f (x) shared for all snapshots, and ˆa θ (x) denotes the reconstructed term of model activation a θ (x). Training objectives. The crosscoder is trained to minimize the loss: L(x) = X θ∈Θ ||a θ (x)− ˆa θ (x)|| 2 | z Reconstruction loss +λ sparsity X θ∈Θ n features X i=1 Ω f i (x)·||W θ dec,i || | z Sparsity loss (2) where λ sparsity is a hyperparameter to control the trade-off between reconstruction fidelity and feature sparsity. The regularization function Ω(·) serves as a differentiable substitute for L0 regularization, penalizing non-sparse feature activations. We include the decoder norm||W θ dec,i || in the regulariza- tion term to prevent the crosscoder from trivially reducing the feature activation f i (x) while inflat- ing the decoder norms under imperfect L0 approximations such as L1 regularization. Appendix A.1 3 Published as a conference paper at ICLR 2026 032100060001400060000127000 Training Step 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 Explained Variance Crosscoder SAE (a) 032100060001400060000127000 Training Step 20 40 60 80 100 120 L0 Norm Crosscoder SAE (b) 6080100120140160180 L0 Norm 0.87 0.88 0.89 0.90 0.91 Explained Variance Crosscoder SAE (c) Figure 3: Comparison between crosscoders and per-snapshot SAEs. (a) The explained variance of crosscoders versus SAEs at each snapshot. (b) The L0 norm of crosscoders versus SAEs at each snapshot. (c) The Pareto frontier comparison of crosscoders and SAEs trained on the final snapshot. shows the details for selecting the proper activation function σ(·) and regularization function Ω(·) to optimize crosscoder feature sparsity. Experimental setup. We use Pythia-160M and Pythia-6.9B (Biderman et al., 2023) for our ex- periments throughout this work. Pythia is a Transformer language model suite with well-controlled training settings and accessibility to training snapshots. We select the middle layers (Layer 6 of Pythia-160M and Layer 16 of Pythia-6.9B) for training crosscoders. To balance training cost with the granularity of feature evolution analysis, we select 32 snapshots out of 154 open-source snap- shots ranging from step 0 to 143,000, with a stratified sampling approach: (1) all 20 snapshots before step 10,000 to capture early feature evolution with maximum temporal resolution, and (2) 12 evenly spaced snapshots from later training stages, containing 4 snapshots from steps 14,000–34,000 and 8 snapshots from steps 47,000–143,000. We use SlimPajama (Shen et al., 2023), a comprehensive text dataset covering a variety of data sources to sample activations. 5060708090100110120 L0 Norm 0.78 0.80 0.82 0.84 0.86 0.88 0.90 Explained Variance Pythia-160m (8×) Pythia-160m (16×) Pythia-160m (32×) Pythia-160m (64×) Pythia-160m (128×) Pythia-6.9b (8×) Figure 2: Explained variances versus L0 norms of our crosscoders. Results. Figure 2 demonstrates crosscoder performance in decomposing model activations into sparse feature rep- resentations. We evaluate reconstruction quality (mea- sured by activation variance explained) and sparsity (mea- sured by L0 norm averaged across snapshots). Increasing dictionary size n features yields significant Pareto improve- ments across both metrics. To examine whether crosscoders can extract cross- snapshot features and align identical features in consistent directions within the feature space, we compare cross- coder performance against corresponding SAEs trained individually on each Pythia snapshot. We train SAEs us- ing identical settings and hyperparameters as crosscoders on each snapshot. Figures 9a and 9b demonstrate that crosscoders achieve comparable performance in L0 norm and explained variance at each snapshot, even when SAEs are optimized for individual snapshots where they should have a theoretical advantage. We further compare the Pareto frontiers (explained variance versus L0 norm) between crosscoders and SAEs trained on the final snapshot. Figure 9c shows that crosscoders exhibit a slightly superior Pareto frontier, demonstrating their effectiveness in sparse dictionary learning beyond their primary capability of tracking feature evolution across snapshots. Feature evolution revealed by crosscoders. An important advantage of crosscoders is the unified feature space (or sparse codes) they reveal. The crosscoder encoder aggregates cross-snapshot infor- mation to produce shared feature activations. Then these activations are translated back to recover the original activations by a group of independent decoders (or dictionaries). 4 Published as a conference paper at ICLR 2026 If a feature activates but “exists” at only a subset of snapshots, the sparse penalty will suppress the decoder norms of this feature at irrelevant snapshots to near-zero so they won’t interfere with reconstruction on these snapshots and also reduce sparsity loss. This design principle leads to a crucial observation: the decoder norm∥W θ dec,i ∥ directly reflects the strength and presence of feature i at snapshot θ. Therefore, tracking decoder norm changes across snapshots provides a direct window into feature evolution dynamics. Appendix C provides a sanity check of whether∥W θ dec,i ∥ can indeed serve as a proxy of feature strength using linear probes. Potential failure modes of crosscoder feature alignment. Before assessing crosscoder features, we further discuss two theoretical concerns about cross-snapshot feature alignment: Will crosscoders misalign unrelated features? Such misalignment would strongly contradict the optimization objective of crosscoders. The distinct activation patterns of unrelated features would conflict when forced into the same dimension. Once a feature activates, it would cause others to activate simultaneously, introducing noise in the reconstruction. In terms of results, misalignment would lead to the polysemanticity of crosscoder features, damaging the consistency of their inter- pretation, which our interpretability evaluation demonstrates does not occur. Will crosscoders split shared features? Suppose features from different snapshots represent the same underlying concept with identical activation patterns, splitting them across multiple dimen- sions would be suboptimal: it wastes representational capacity in feature space, as both previous works (Gao et al., 2025; Templeton et al., 2024) and Appendix A.4 prove that feature space di- mensionality is crucial for better reconstruction fidelity. Nevertheless, in rare cases, we do observe feature splitting among highly active features. These features share semantic and directional simi- larity but exhibit subtly different activation patterns that justify separate dimensions. We detail this phenomenon in Appendix F. These considerations suggest that crosscoders should naturally achieve effective feature alignment, mapping semantically equivalent features from different snapshots to consistent positions in the unified feature space. 4ASSESSING CROSS-SNAPSHOT FEATURES 4.1OVERVIEW OF FEATURE EVOLUTION 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia 160m 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia 6.9b 0 5 10 15 20 25 30 Peak Snapshot Index Figure 4: Overview of cross-snapshot feature decoder norm evolution. Features are extracted by a 98,304-feature crosscoder on Pythia-160M (top) and a 32,768-feature crosscoder on Pythia-6.9B (bottom). 5 Published as a conference paper at ICLR 2026 Figure 4 shows 50 randomly sampled features and their decoder norm evolution across snapshots. Each feature’s decoder norms are linearly rescaled to a maximum of 1. Most cross-snapshot features exhibit two distinct developmental patterns: 1. Initialization features that exist from random initialization, exhibit a sudden drop and recovery around step 128, then gradually decay. The existence of these features has been established by Bricken et al. (2023) and Heap et al. (2025). 2. Emergent features that begin forming primarily around step 1000, reaching peak intensity at various subsequent training steps. There also exists emergent features only appearing in late training, which we discuss in later sections. 051015202530 Peak Snapshot Index 0 1000 2000 3000 4000 Count (a) 0.00.20.40.60.81.0 Lifetime 0 2000 4000 6000 8000 10000 Count (b) 051015202530 Onto This Snapshot 0 5 10 15 20 25 30 Project This Snapshot Decoder Vector 0.05 0.10 0.15 0.20 0.25 Projection Strength (c) Figure 5: Statistics of emergent features in a 98,304-feature crosscoder on Pythia-160M. (a) Distri- bution of peak emergence times. (b) Distribution of feature lifetime. (c) Mean projection of each feature’s decoder vector of snapshot θ i onto its decoder vector of snapshot θ j . Feature emergence steepness. Previous studies suggest that loss curves comprise discrete phase changes, each contributing to different circuits at distinct training stages, with evidence from toy model features (Elhage et al., 2022), 5-digit addition (Nanda et al., 2023), and in-context learn- ing (Olsson et al., 2022). From a feature evolution perspective, we investigate whether features emerge abruptly enough to constitute the fundamental units of these phase transitions. By evalu- ating the time spent by each feature from initial emergence to peakness, we find that emergence steepness exhibits considerable variation, demonstrating the coexistence of both gradually-formed and abruptly-appearing features. Feature persistence after emergence. We next examine whether emergent features persist after formation. We define the lifetime of a crosscoder feature as |j | ∥W θ j dec,i ∥ > 0.3|, where a threshold is introduced to block out near-zero decoder norms. Figure 5b shows the lifetimes of all emergent features. We find that most features persist for extended periods (above 60% of snapshots) after formation, indicating that: (1) LLMs retain learned features robustly, and (2) our crosscoders successfully align and track these features across snapshots. A universal directional turning point. To further investigate the geometry of feature evolution, we compute the projections between feature directions across snapshots. The directional evolution patterns prove remarkably consistent: most features undergo drastic directional shifts around step 1,000, rendering pre- and post-step 1,000 directions nearly orthogonal (Figure 5c). Subsequently, features continue to rotate more gradually, with directions at the final snapshot maintaining notable cosine similarities to their initial post-step 1,000 orientations—a significant finding given the high- dimensional activation space. 4.2CORRELATION BETWEEN EMERGENCE STEP AND COMPLEXITY To gain deeper insights into features emerging at different training stages, we leverage LLMs to automatically assess their complexity based on top activation patterns. We follow Cunningham & Conerly (2024) to score feature complexity for 100 randomly sampled emergent features based on their top activations, ranging from 1 (simple) to 5 (complex). The 6 Published as a conference paper at ICLR 2026 result in Figure 6a reveals a non-trivial correlation with moderate strength between peak timing and complexity scores, suggesting that more complex features tend to emerge later in training. The scoring rubrics and complete prompt for automated complexity scoring can be found in Appendix B. 100060001400060000127000 Peak Training Step 1.5 2.0 2.5 3.0 3.5 4.0 Complexity (a) 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm Context-sensitive (n=822) Induction (n=827) Previous Token (n=230) (b) 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm ":" in bigram "Note:" "Not" in phrase "Not have to do" (or similar) "of" in phrase "With the help of" (or similar) Previously appeared variable names Context of computer-related topics Context of critic reviews (c) Figure 6: (a) Complexity scores (evaluated by Claude Sonnet 4) versus peak emergence times, showing a moderate positive correlation (Pearson r = 0.309, p = 0.002). (b) Decoder norm evolution trajectories for all previous token, induction, and context-sensitive features across training. (c) More cases of feature decoder norm evolution trajectories. 4.3CASE STUDY ON TYPICAL CROSS-SNAPSHOT FEATURES We employ simple rule-based approaches to find several well-studied feature types in a crosscoder with 32,768 features trained on Pythia-6.9B, including previous token features (which activate based on the preceding token), induction features (which activate on the second [A] in patterns [A][B]...[A][B] and help predict [B]), and context-sensitive features. These feature categories are extensively documented in prior research on neurons and SAE features (Gurnee et al., 2024; Huben et al., 2024; Ge et al., 2024). Figure 6b demonstrates the distinct temporal pattern in feature emergence: previous token features arise early (around steps 1,000–5,000), while induction and context-sensitive features appear later and over a wider range (mostly steps 10,000–100,000). This suggests a general emergence hierarchy, from previous token to induction to context-sensitive features, which aligns with their increasing complexity. This finding is also consistent with the causal dependency between induction heads and previous token heads (Olsson et al., 2022). For the majority of features that cannot fit into specific rule-based patterns, we further demonstrate additional cases of random emergent features with different evolutionary patterns (Figure 6c). La- bels are annotated by summarizing the top activation samples. 5CONNECTING MICROSCOPIC EVOLUTION TO MACROSCOPIC BEHAVIORS One of the ambitious missions of mechanistic interpretability research is to connect feature-level findings with the model’s downstream performance. To this end, we employ attribution-based circuit tracing techniques (Syed et al., 2023; Ge et al., 2024; Marks et al., 2025) to investigate the causal effects of crosscoder features formation on downstream task metrics improvements. Method. Let metric m :R d model →R be an arbitrary scalar-valued function of model activations a θ (x). 1 To quantify the causal effect of each feature activation f i (x) on metric m, we first decom- pose model activations into per-feature representations: a θ (x) = ˆa θ (x) + ε θ (x) = n features X i=1 f i (x)· W θ dec,i + b θ dec + ε θ (x) (3) 1 The original notation in Marks et al. (2025) uses m as a function of input x, where the computation graph flows through nodes of interest. Since we focus on feature effects within a single layer (e.g., layer 6 of Pythia- 160M), we simplify m as a function of model activations to avoid complex intervention notation. 7 Published as a conference paper at ICLR 2026 where ε θ (x) ∈R d model represents the crosscoder reconstruction error. This decomposition incorpo- rates crosscoder features into the computation graph, enabling gradient computation with respect to features. We then estimate each feature’s causal effect using its attribution score: attr θ i (x) = f i (x)· ∂m(a θ (x)) ∂f i (x) (4) where the gradient ∂m ∂f i flows through the decomposition in Eq. 3. This attribution score employs first-order Taylor expansion as a linear approximation of model computation. For structured downstream tasks with clean/corrupted input pairs, such as subject-verb agree- ment (Finlayson et al., 2021), we can employ the full framework of attribution patching (Syed et al., 2023; Marks et al., 2025) by emphasizing differences between clean and corrupted inputs: attr θ i (x, ̃x) = [f i (x)− f i ( ̃x)]· ∂m(a θ (x)) ∂f i (x) (5) where ̃x is the corrupted version of input x. Attribution patching isolates components explain- ing the transition from corrupted to clean performance, excluding the majority of components that contribute to baseline model function. This focused approach improves the validity of the linear approximation. In practice, we employ the integrated gradient (IG) version of the attribution scores defined in Eq. 4 and Eq. 5, which compute gradients at evenly-spaced points from x to ̃x (x to zero vector in the pure attribution case). Details of IG computation can be found in Appendix E. 032100060001400060000127000 Training Step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Attribution Score Feature 69636 Feature 18341 Feature 47045 Feature 50159 Feature 67864 Feature 68813 Feature 39760 Feature 86482 Feature 13274 Feature 52575 (a)(b) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (c) Plural 18341 igning now iconic buildings through ugh that late-term clinics like South d broken concrete pipes left behind Plural Subjects 47045 he deserved. The buildings he desi heist story.↵The stories from Cirso grav, 1975)↵The books were origin Compounds/Postposition Attributives 68813 tin Truex, Jr. and Kevin Harvick completed the Top 5. De dancers who enter the short program or rhythm dance b alleries that have been renovated as part of the transform Endings of Postposition Attributives 69636 nce.↵Leaders of the Apollo program, who spent great e her celebs who attended the screening were Anil Kapoor e. Followers of Lord Voldemort known as death eaters te (d) Figure 7: Crosscoder feature attribution on the Across-P variant of the subject-verb agreement task, e.g. “The teachers near the desk are”. We use a crosscoder train at layer 6 of Pythia-160M with 98,304 features. (a) The attribution scores of top contributing features over time. (b) The metric recovery when ablating all features except the top k contributing features. (c) The metric recovery when ablating the top k contributing features. (d) Top activation samples of key features contribut- ing to this task. Features recognizing plural nouns appear before features recognizing postposition attributives. Experimental setup. We evaluate our method on subject-verb agreement (SVA) (Finlayson et al., 2021), induction, and indirect object identification (IOI) (Wang et al., 2023) tasks, using 1000 sam- ples each to identify critical features. For SVA and IOI, we create corrupted controls by swapping singular/plural forms (SVA) or altering the second subject (IOI), using logit differences between clean and corrupted answers as metrics. For induction, which lacks natural corruptions, we use target answer log probabilities. We compute IG attribution scores for all features on a 98,304-feature crosscoder trained on Pythia-160M, and rank crosscoder feature contribution by mean attribution scores across all snapshots. We then perform 8 Published as a conference paper at ICLR 2026 complementary ablations: (1) removing top-ranked features, and (2) removing all except top-ranked features. Results. Figure 7 shows results for the Across-P variant of SVA, where postpositional attribu- tives separate subjects and verbs, e.g. “The teachers near the desk are”. We identify key contributing features ordered by emergence time: (1) Features 18341 and 47045 capture plural nouns, with 47045 specialized for plural subjects; (2) Feature 68813 marks compound subjects and postpositional at- tributives; (3) Features 50159 and 69636 identify endings of postpositional attributives, with 69636 showing higher accuracy. Additional features include subject-specific and context-specialized plu- ral markers with lower task contributions. Notably, Features 68813, 50159, and 69636 alternately dominate the metric, revealing circuit-level model evolution through component iteration. Ablation experiments across all tasks demonstrate that within tens of features, we can consistently disrupt or recover model performance on downstream tasks across training snapshots, confirming our method identifies necessary and sufficient task components. 6OBSERVATIONS OF A STATISTICAL-TO-SUPERPOSITION TRANSITION 10 1 10 2 10 3 10 4 10 5 Training Steps 0 2 4 6 8 10 Loss (nats) Loss Unigram Entropy Bigram Entropy Unigram KL Bigram KL 0 1 2 3 4 KL Divergence (nats) (a) 10 1 10 2 10 3 10 4 10 5 Training Steps 0 2 4 6 8 10 12 Loss (nats) Loss Unigram Entropy Bigram Entropy Unigram KL Bigram KL 0 1 2 3 4 5 6 KL Divergence (nats) (b) 032100060001400060000127000 Training Step 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Total Feature Dimensionality Ratio Pythia 160M (128×) Pythia 6.9B (8×) (c) Figure 8: Observations supporting a statistical-to-superposition transition in Pythia-160M and Pythia-6.9B. (a) Unigram and bigram KL divergence evolution in Pythia-160M. The convergence timing of KL divergences coincides with when training loss approaches the theoretical minimum (unigram/bigram entropy). (b) Unigram and bigram KL divergence evolution in Pythia-6.9B. (c) Total feature dimensionality ratio (relative to activation space dimension) over training time. What is the model learning at the beginning of training while the training loss rapidly decreases, if no features are formed at this period? We hypothesize that the rapid loss decrease in early train- ing is driven by learning coarse statistical patterns, rather than by forming distinct features. After this initial optimization nears completion, sparse features emerge in superposition for further loss reduction. This two-stage structure is deeply consistent with the fitting-to-compressing phase transition pre- dicted by the information bottleneck theory (Shwartz-Ziv & Tishby, 2017). Early training almost exclusively learns uni- and bi-gram distributions. Following previous work on language model learning statistical patterns (Takahashi & Tanaka-Ishii, 2017; Xu et al., 2019; Choshen et al., 2022; Belrose et al., 2024; Svete & Cotterell, 2024; Nguyen, 2024; Chen et al., 2024), we compute the KL divergence between the true token distribution Q and the model’s predicted distribution P . We randomly sample 10M tokens from SlimPajama to approximate both distributions. We then evaluate unigram KL divergence D KL (P (x) ∥ Q(x)) and bigram KL diver- gence D KL (P (x i | x i−1 )∥ Q(x i | x i−1 )). The results show that both unigram and bigram KL divergences rapidly converge to low values dur- ing early training (Figure 8a and 8b). Furthermore, the training losses during this period approach the theoretical minimum achievable if the model perfectly matched the true token distributions—i.e., the entropy of these distributions. This suggests that the model primarily learns to fit statistical reg- ularities (Zipf’s law (Zipf, 1935; Piantadosi, 2014)) during the early training stage, which explains the dense nature of internal representations at this phase. 9 Published as a conference paper at ICLR 2026 Total feature dimensionality undergoes compression and expansion. To directly measure su- perposition status at each snapshot, we adapt the approach from Elhage et al. (2022) and compute the dimensionality of each feature at snapshot θ as: D i = ∥W θ dec,i ∥ 2 P n features j=1 ˆ W θ dec,i · W θ dec,j 2 (6) where ˆ W θ dec,i is the normalized version of decoder vectorW θ dec,i . In contrast to its original application in toy models with ground-truth features, we apply this metric to crosscoder features due to their consistent alignment across training snapshots, providing insight into superposition dynamics. We compute total feature dimensionalities for crosscoders trained on Pythia-160M and Pythia-6.9B (Figure 8c). Under ideal superposition, features should form symmetric arrangements with total dimensionalities summing to the activation space dimension. Feature dimensionalities should go down if large interference exists among features. We observe that total feature dimensionality first decreases then increases around the turning point, eventually reaching approximately 70% of avail- able dimensions in the crosscoder with 98,304 features for Pythia-160M. Features from Pythia-6.9B account for a smaller proportion of activation space dimensions, likely due to the limited representa- tional capacity of the 32,768-feature crosscoder. Nevertheless, both models exhibit the same trend, suggesting that initialization features initially form weak superposition, then become compressed to accommodate emergent features. This indicates that the model develops into a feature learning phase. 7LIMITATIONS Scope and generalizability. Superposition has been proven to be a general phenomenon in deep neural networks (Elhage et al., 2022). However, our analysis focuses on feature evolution in Pythia suite models using their open-source training snapshots. While previous research on feature univer- sality (Wang et al., 2025b) suggest our method might be able to generalize to different architectures, datasets, post training dynamics, and tasks beyond language modeling, the extent to which feature evolution patterns are consistent across diverse settings remains to be established. We leave broader generalization studies to future work. Limited downstream task complexity. Section 5 establishes the connection between feature evo- lution and model behavior. However, the downstream tasks we examine are relatively simple, con- strained by multiple factors including Pythia model capabilities, sparse dictionary quality, and the current state of circuit tracing methodologies. Scaling to more complex downstream tasks represents a natural direction for future work. Discrete snapshot requirement. Crosscoder training requires activations from discrete training snapshots, with memory and computational costs scaling linearly with snapshot count, which limits observational granularity. Potential solutions include architectural modifications for online multi- snapshot processing or incorporating gradient information to capture continuous training dynamics. 8CONCLUSION We introduce crosscoders to study feature evolution in LLM pre-training. Our analysis reveals two patterns: initialization-dependent features and emergent features, with complex patterns emerging later. We establish causal connections between feature evolution and downstream performance. Sup- ported by uni- and bi-gram distribution analysis and feature dimensionality dynamics, we propose that model pre-training can be roughly divided into a statistical learning phase and a feature learning phase. This work bridges mechanistic interpretability with training dynamics. 10 Published as a conference paper at ICLR 2026 REPRODUCIBILITY STATEMENT All our crosscoders are based on open-source models (Pythia suite) and datasets (SlimPajama) with public accessibility. We provide source code for (1) generating model activations on each snap- shot, (2) training crosscoders, and (3) analyzing trained crosscoders at https://github.com/ OpenMOSS/Language-Model-SAEs. Detailed instructions for replicating our results are pro- vided in the examples/reproduce evolutionofconcepts/README.md file. We note that training large crosscoders requires substantial computational resources and disk space, as also listed in the README.md file. REFERENCES Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. OpenReview.net, 2017. URL https: //openreview.net/forum?id=HJ4-rAVtl. Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neu- ral scaling laws. CoRR, abs/2102.06701, 2021. URL https://arxiv.org/abs/2102. 06701. Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Z. Fern. Neural networks learn statistics of increasing complexity. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=IGdpKP0N6w. Yoshua Bengio, Nicholas L ́ eonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives, 2014. URL https://arxiv.org/abs/1206.5538. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, p. 2397–2430. PMLR, 2023. URL https:// proceedings.mlr.press/v202/biderman23a.html. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah.Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023.https://transformer- circuits.pub/2023/monosemantic-features/index.html. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165. Yihong Chen, Xiangxiang Xu, Yao Lu, Pontus Stenetorp, and Luca Franceschi. Jet expansions of residual computation. CoRR, abs/2410.06024, 2024. doi: 10.48550/ARXIV.2410.06024. URL https://doi.org/10.48550/arXiv.2410.06024. 11 Published as a conference paper at ICLR 2026 Leshem Choshen, Guy Hacohen, Daphna Weinshall, and Omri Abend. The grammar-learning trajec- tories of neural language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, p. 8281–8297. As- sociation for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.568. URL https://doi.org/10.18653/v1/2022.acl-long.568. Hoagy Cunningham and Tom Conerly. Circuits updates - june 2024. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/june-update/index. html#hurdles. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), p. 4171– 4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda.Transcoders find interpretable LLM feature circuits.In Amir Globersons, Lester Mackey, Danielle Belgrave, An- gela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38:Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024.URL http://papers.nips.c/paper_files/paper/2024/hash/ 2b8f4db0464c5b6e9d5e6bea4b9f308-Abstract-Conference.html. Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id= S1gFvANKDS. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022.https://transformer- circuits.pub/2022/toy model/index.html. Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th An- nual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), p. 1828–1843, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144/. Zach Furman and Edmund Lau. Estimating the local learning coefficient at scale, 2024. URL https://arxiv.org/abs/2402.03698. Leo Gao, Tom Dupr ́ e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In The Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=tcsZt9ZNKD. Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, p. 3816–3830. Association for Computational Linguis- tics, 2021. doi: 10.18653/V1/2021.ACL-LONG.295. URL https://doi.org/10.18653/ v1/2021.acl-long.295. 12 Published as a conference paper at ICLR 2026 Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, and Xipeng Qiu. Automatically identifying local and global circuits with linear computation graphs, 2024. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Ko- renev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzm ́ an, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind That- tai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Kore- vaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Ma- hadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jong- soo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Ku- mar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoy- chev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur C ̧ elebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ra- mon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Ro- hit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mi- haylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, V ́ ıtor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Gold- schlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, An- drew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, An- nie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leon- hardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Mon- talvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smoth- ers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia 13 Published as a conference paper at ICLR 2026 Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harri- son Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jen- nifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Jun- jie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Ro- driguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ra- maswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satter- field, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Ku- mar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiao- jian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhao- duo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in GPT2 language models. Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=ZeI104QZ8I. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=TZ0CCGDcuT. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama scope: Ex- tracting millions of features from llama-3.1-8b with sparse autoencoders. CoRR, abs/2410.20526, 2024. doi: 10.48550/ARXIV.2410.20526. URL https://doi.org/10.48550/arXiv. 2410.20526. Thomas Heap, Tim Lawson, Lucy Farnik, and Laurence Aitchison. Sparse autoencoders can in- terpret randomly initialized transformers, 2025. URL https://arxiv.org/abs/2501. 17727. Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianine- jad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http://arxiv.org/abs/1712.00409. 14 Published as a conference paper at ICLR 2026 Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openreview.net/forum?id=F76bwRSLeK. Arthur Jacot, Cl ́ ement Hongler, and Franck Gabriel.Neural tangent kernel: Convergence and generalization in neural networks.In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol ` o Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr ́ eal, Canada, p. 8580–8589, 2018. URL https://proceedings.neurips.c/paper/2018/hash/ 5a4be1fa34e62b8a6ec6b91d2462f5a-Abstract.html. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980. Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. Grokking as the tran- sition from lazy to rich training dynamics. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=vt5mnLVIVo. Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The local learning coefficient: A singularity-aware complexity measure, 2024. URL https://arxiv.org/ abs/2308.12108. Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/crosscoders/index.html. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id= I4e82CIDxv. Julian Minder, Cl ́ ement Dumas, Caden Juang, Bilal Chugtai, and Neel Nanda. Overcoming sparsity artifacts in crosscoders to interpret chat-tuning, 2025. URL https://arxiv.org/abs/ 2504.02922. Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, Christopher Olah, and Thomas Henighan.Sparse crosscoders for cross- layer features and model diffing.Transformer Circuits Thread, 2024.https://transformer- circuits.pub/2025/crosscoder-diffing-update/index.html. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URL https://arxiv.org/abs/2301. 05217. Timothy Nguyen. Understanding transformers via n-gram statistics. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.c/paper_files/paper/2024/hash/ b1c446eebd9a317d0e96b16908c821a-Abstract-Conference.html. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020.doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. 15 Published as a conference paper at ICLR 2026 Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bha- gia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Py- atkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi. 2 olmo 2 furious, 2024. URL https://arxiv.org/abs/2501.00656. Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set: A strat- egy employed by v1?Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7.URL https://w.sciencedirect. com/science/article/pii/S0042698997001697. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. OpenAI. GPT-4 technical report, 2023. URL https://arxiv.org/abs/2303.08774. Guilherme Penedo, Hynek Kydl ́ ı ˇ cek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557. Steven T. Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21:1112 – 1130, 2014. URL https://api. semanticscholar.org/CorpusID:14264582. Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gener- alization beyond overfitting on small algorithmic datasets. CoRR, abs/2201.02177, 2022. URL https://arxiv.org/abs/2201.02177. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under- standing by generative pre-training. Technical Report, 2018. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, J ́ anos Kram ́ ar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. CoRR, abs/2407.14435, 2024. doi: 10.48550/ARXIV.2407.14435. URL https: //doi.org/10.48550/arXiv.2407.14435. Peter Sanders, Kurt Mehlhorn, Martin Dietzfelbinger, and Roman Dementiev. Sequential and Par- allel Algorithms and Data Structures: The Basic Toolbox. Springer Publishing Company, Incor- porated, 1st edition, 2019. ISBN 3030252086. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagn ́ e, Alexandra Sasha Luccioni, Franc ̧ois Yvon, Matthias Gall ́ e, Jonathan Tow, Alexan- der M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno ˆ ıt Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Baw- den, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurenc ̧on, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al. BLOOM: A 176b- parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10. 48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100. Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric P. Xing. Slimpajama-dc: Un- derstanding data combinations for LLM training. CoRR, abs/2309.10818, 2023. doi: 10.48550/ ARXIV.2309.10818. URL https://doi.org/10.48550/arXiv.2309.10818. 16 Published as a conference paper at ICLR 2026 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.org/abs/1909.08053. Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810. Lewis Smith, Sen Rajamanoharan, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, and Neel Nanda.Negative results for SAEs on down- stream tasks and deprioritising SAE research (GDM mech interp team progress update #2), 2025.URL https://w.lesswrong.com/posts/4uXCAJNuPKtKBsi28/ negative-results-for-saes-on-downstream-tasks. LessWrong. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, p. 3319–3328. PMLR, 2017. URL http://proceedings. mlr.press/v70/sundararajan17a.html. Anej Svete and Ryan Cotterell. Transformers can represent n-gram language models. In Kevin Duh, Helena G ́ omez-Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, p. 6845–6881. Association for Computational Linguistics, 2024.doi: 10.18653/V1/2024.NAACL-LONG.381. URL https://doi.org/10.18653/v1/2024. naacl-long.381. Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. CoRR, abs/2310.10348, 2023. doi: 10.48550/ARXIV.2310.10348. URL https: //doi.org/10.48550/arXiv.2310.10348. Shuntaro Takahashi and Kumiko Tanaka-Ishii. Do neural nets learn statistical laws behind natural language? CoRR, abs/1707.04848, 2017. URL http://arxiv.org/abs/1707.04848. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Trans- former Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop, ITW 2015, Jerusalem, Israel, April 26 - May 1, 2015, p. 1–5. IEEE, 2015. doi: 10.1109/ITW.2015.7133169. URL https://doi.org/10.1109/ ITW.2015.7133169. Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. Polypythias: Stability and outliers across fifty language model pre-training runs, 2025. URL https://arxiv.org/abs/2503.09543. Francisco Vargas and Ryan Cotterell. Exploring the linear subspace hypothesis in gender bias mit- igation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, p. 2902–2913. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.232. URL https://doi.org/10.18653/v1/2020. emnlp-main.232. George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Dif- ferentiation and specialization of attention heads via the refined local learning coefficient. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025a. URL https://openreview.net/forum? id=SUc1UOWndp. 17 Published as a conference paper at ICLR 2026 Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Towards universality: Studying mechanistic similarity across language model architectures. In The Thirteenth International Conference on Learning Representations, 2025b. URL https: //openreview.net/forum?id=2J18i8T0oI. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. In- terpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id= NpsVSN6o4ul. Sumio Watanabe.Algebraic analysis for non-regular learning machines.In Sara A. Solla, Todd K. Leen, and Klaus-Robert M ̈ uller (eds.), Advances in Neural Information Process- ing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], p. 356–362. The MIT Press, 1999.URL http://papers.nips.c/paper/ 1739-algebraic-analysis-for-non-regular-learning-machines. Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2009. Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. In Tom Gedeon, Kok Wai Wong, and Minho Lee (eds.), Neural Information Processing - 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12-15, 2019, Proceedings, Part I, volume 11953 of Lecture Notes in Computer Science, p. 264– 274. Springer, 2019. doi: 10.1007/978-3-030-36708-4\ 22. URL https://doi.org/10. 1007/978-3-030-36708-4_22. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. Greg Yang. Tensor programs I: neural matrix laws. CoRR, abs/2009.10685, 2020. URL https: //arxiv.org/abs/2009.10685. Greg Yang and Etai Littwin. Tensor programs iib: Architectural universality of neural tangent kernel training dynamics. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, p. 11762–11772. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/ v139/yang21f.html. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, p. 12697–12706. PMLR, 2021. URL http://proceedings.mlr.press/v139/zhao21c.html. Zhanpeng Zhou, Yongyi Yang, Mahito Sugiyama, and Junchi Yan. New evidence of the two- phase learning dynamics of neural networks, 2025. URL https://arxiv.org/abs/2505. 13900. George Kingsley Zipf. The Psychobiology of Language. Houghton-Mifflin, New York, NY, USA, 1935. 18 Published as a conference paper at ICLR 2026 ACROSSCODER TRAINING DETAILS Section 3 presents the mathematical definition of our cross-snapshot crosscoder. However, in practi- cal applications, additional architectural design and training techniques are required to advance the Pareto frontier of sparsity versus reconstruction fidelity. In this section, we detail our selection of activation function σ(·) and regularization function Ω(·), and present our training hyperparameters and their results. We also compare crosscoder performance to standard SAEs to evaluate how well crosscoders perform at sparse dictionary learning. A.1SELECTION OF ACTIVATION FUNCTION AND REGULARIZATION FUNCTION The sparsity of natural features is the fundamental hypothesis underlying superposition (Olshausen & Field, 1997; Elhage et al., 2022; Huben et al., 2024). To obtain crosscoder features with optimal sparsity for ideal interpretability and monosemanticity while maintaining reconstruction fidelity, we carefully select the activation function σ(·) and the regularization function Ω(·). Previous SAE studies predominantly use ReLU activation with L1 regularization (Bricken et al., 2023; Huben et al., 2024). However, this configuration produces weak feature activations that are largely noise, compromising both interpretability and sparsity. We address this by adopting JumpReLU (Rajamanoharan et al., 2024) as the activation function, which eliminates activations below learned thresholds (trained via straight-through estimation (Bengio et al., 2013)). To prevent features from becoming permanently inactive at certain snapshots, we incorporate de- coder norms into the activation decision. Given pre-activation z(x): z(x) = X θ∈Θ W θ enc a θ (x) + b enc (7) The i-th feature activation at snapshot θ is defined as: f θ i (x) = z i (x)· H(z i (x)·∥W θ dec,i ∥− t i )(8) where H(·) is the Heaviside step function and t i ∈R is the JumpReLU threshold for feature i. This design ensures that features with small decoder norms require stronger pre-activations to acti- vate, preventing complete feature death while maintaining sparsity. Although this means truly in- active features retain small positive decoder norms rather than zero values, this architectural choice significantly improves crosscoder performance and enables better feature tracking across snapshots. For regularization, we employ a combination of tanh and quadratic frequency penalties (Smith et al., 2025): the tanh component provides a superior L0 approximation by reducing penalties on strong activations, while the quadratic frequency penalty suppresses high-frequency features. This yields the following batched sparsity loss: ω θ i (B) = 1 B X x∈B tanh f i (x)·||W θ dec,i || L sparsity (B) = λ sparsity X θ∈Θ n features X i=1 ω θ i (B)· 1 + ω θ i (B) ω 0 (9) where B ⊂ C ×N is an input batch, and ω θ i (B) is the single-batch differentiable estimate for the activation frequency on snapshot θ of i-th crosscoder feature, using tanh as L0 approximation. A new hyperparameter ω 0 is introduced to quadratically penalize feature activation when ω θ i (B)≫ ω 0 . A.2SELECTION OF PRE-TRAINING SNAPSHOTS To balance training cost with the granularity of feature evolution analysis, we train crosscoders using 32 source snapshots from the 154 open-source snapshots in the Pythia suite. Figure 10 shows the training steps of the selected snapshots. 19 Published as a conference paper at ICLR 2026 032100060001400060000127000 Training Step 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 Explained Variance Crosscoder SAE (a) 032100060001400060000127000 Training Step 20 40 60 80 100 120 L0 Norm Crosscoder SAE (b) 6080100120140160180 L0 Norm 0.87 0.88 0.89 0.90 0.91 Explained Variance Crosscoder SAE (c) Figure 9: Comparison between crosscoders and per-snapshot SAEs. (a) The explained variance of crosscoders versus SAEs at each snapshot. (b) The L0 norm of crosscoders versus SAEs at each snapshot. (c) The Pareto frontier comparison of crosscoders and SAEs trained on the final snapshot. 051015202530 Snapshot Index 0 2 2 2 4 2 6 2 8 2 10 2 12 2 14 2 16 Training Step Figure 10: Selected snapshots from Pythia suite. A.3DISTRIBUTED TRAINING STRATEGY FOR CROSSCODERS Crosscoders require parameters that scale with the number of source snapshots n snapshots , resulting in significantly higher memory and computational requirements compared to SAEs with equivalent feature counts, particularly when using many source snapshots. To efficiently train crosscoders, we employ a head parallelism distributed training strategy, a variant of tensor parallelism (Shoeybi et al., 2019). With k processes where k divides n snapshots , each process handles encoding and decoding for n snapshots k source snapshots. Pre-activations are computed via All- Reduce operations (Sanders et al., 2019). Unlike standard tensor parallelism, our approach processes activations from each snapshot separately on individual processes, reducing I/O overhead for reading activations from disk. A.4EXPERIMENTS We train crosscoders on Pythia-160M and Pythia-6.9B snapshots at various scales (6,144 to 98,304 features on Pythia-160M, and 32,768 features on Pythia-6.9B). We primarily focus on middle lay- ers (Layer 6 in Pythia-160M, and Layer 16 in Pythia-6.9B) 2 , but also train crosscoders of 24,576 features on all layers of Pythia-160M for comprehensive analysis. All crosscoders are trained on 800M tokens from the SlimPajama corpus using the Adam opti- mizer (Kingma & Ba, 2017) with β values of (0.9, 0.999). To prevent straight-through estimation from causing rapid threshold increases, we apply a reduced learning rate to JumpReLU threshold updates via a multiplier on the global learning rate. JumpReLU thresholds are initialized to 0.1 for all features. The learning rate schedule includes 10% warm-up steps followed by 20% decay steps. We initialize encoders as transposes of their corresponding decoders, with identical initialization matrices across all snapshots. We employ initialization search to identify optimal decoder norms that minimize loss at initialization. Additional hyperparameters are listed in Table 1. 2 By “Layer i”, we refer to the activations at the output of the i-th transformer layer. 20 Published as a conference paper at ICLR 2026 Table 1: Hyperparameters for crosscoder training ParameterPythia-160MPythia-6.9B Learning Rate5e-51e-5 Batch Size20482048 Feature Expansion Ratio8×, 16×, 32×, 64×, 128×8× Sparsity Coefficient (λ sparsity )0.30.3 JumpReLU Threshold LR Multiplier0.10.3 5060708090 L0 Norm 0.86 0.88 0.90 0.92 0.94 0.96 Explained Variance Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Figure 11: Explained variance versus L0 norm of crosscoders trained on all layers (0-11) of Pythia 160M A.5RESULTS ON OTHER LAYERS We train additional crosscoders with 24,576 features on all layers (0-11) of Pythia-160M using the same hyperparameters as Layer 6 (Table 1). Figure 11 shows the explained variance and L0 norm for these crosscoders, which exhibit similar performance but different trade-offs between sparsity and reconstruction quality. To demonstrate feature evolution across all layers, we apply the same visualization strategy from Section 4, plotting decoder norms for 50 randomly selected features per layer (Figures 12 and 13). Most middle layers exhibit the same evolutionary patterns as Layer 6, supporting our findings in Section 4. We would like to note that the majority of features in Layers 0 and 1 exist from initial- ization, which aligns with the observation that early layers implements more low-level, especially single-token features (He et al., 2024). BAUTOMATED SCORING OF COMPLEXITY We leverage Claude Sonnet 4 for automated complexity score evaluation. For each feature, we select the top 10 activating samples with 100 surrounding tokens around the strongest activating tokens and apply the following prompt: # Neural Network Feature Analysis Instructions We’re analyzing features in a neural network. Each feature activates on specific words, substrings, or concepts within short documents. Activating words are marked as ‘<<text, activation>>‘ where ‘activation‘ indicates the strength of activation (higher values = stronger activation). You’l receive documents containing highest activations tokens and tokens surrounding them. ## Your task: ### 1. Summarize the Activation (<20 words) 21 Published as a conference paper at ICLR 2026 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 0 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 1 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 2 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 3 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 4 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 5 0 5 10 15 20 25 30 Peak Snapshot Index Figure 12: Feature decoder norm evolution of Layer 0 to Layer 5 in a 24,576-feature crosscoder trained on Pythia-160M. 22 Published as a conference paper at ICLR 2026 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 6 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 7 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 8 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 9 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 10 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Pythia-160M Layer 11 0 5 10 15 20 25 30 Peak Snapshot Index Figure 13: Feature decoder norm evolution of Layer 6 to Layer 11 in a 24,576-feature crosscoder trained on Pythia-160M. 23 Published as a conference paper at ICLR 2026 Examine the marked activations and summarize what the feature detects in one sentence. - Avoid being overly specific|your explanation should cover most/all activating words - If all words in a sentence activate, focus on the sentence’s concept rather than individual words - Note relevant patterns in capitalization, punctuation, or formatting - Prioritize strongly activated tokens - Keep explanations simple and concise - Avoid long word lists ### 2. Assess Feature Complexity (1-5 scale, with decimal precision allowed) Rate the feature’s complexity: - ** 5 ** : Rich feature with diverse contexts unified by an interesting theme - ** 4 ** : High-level semantic structure with potentially dense activation - ** 3 ** : Moderate complexity|phrases, categories, or sentence structures - ** 2 ** : Synonyms or words at a same class - ** 1 ** : Single specific word or token You may use decimal values (e.g., 3.7) for more precise assessment. ### Output Format Your output should be in JSON format, with two fields: summarization and complexity. You should directly output the JSON object, without any other text. CASSESSING CROSSCODER DECODER NORM 0.00.20.40.60.81.0 Decoder Norm 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Probing Training Loss 0 5 10 15 20 25 30 Snapshot Index Figure 14: Linear probe training errors versus feature decoder norms of 20 randomly sampled fea- tures from a crosscoder with 6,144 features on Pythia-160M. Section 4 takes advantage of the crosscoder decoder norms to study feature evolution. However, are the crosscoder decoder norms well quantitative indicators of the extent to which features evolved? Aside from theoretical statements, we conduct experiments to examine whether crosscoder decoder norms can reflect the intensity of features. We leverage linear probing to test the correlation between the feature decoder norm and the linear separability of the feature activations on each crosscoder feature. 24 Published as a conference paper at ICLR 2026 We train linear probing classifiers (referred to as probes) separately on activations of each model snapshot to classify whether each crosscoder feature activates. For each feature i and each snapshot θ, our probes map model activations to the predicted feature activating probability p θ i (x) as: p θ i (x) = sigmoid(w θ probe,i · a θ (x) + b θ probe,i ) (10) where w θ probe,i ∈R d model and b θ probe,i ∈R is the weight and bias of the probe w.r.t. feature i and snapshot θ. Each probe is trained to minimize the binary cross-entropy loss y i · logp θ i (x) + (1− y i )· log(1− p θ i (x)) , given the label y i = sgn (f i (x)). The training loss of each probe should be a direct measure of the linear separability of the feature activations. We train probes for 100M tokens for each feature in a 6,144-feature crosscoder on Pythia-160M. The probe errors of each feature show a mean Pearson correlation with the corresponding decoder norms by−0.867, with a standard deviation of 0.153. We demonstrate example probe errors versus feature decoder norms of 20 randomly sampled features in Figure 14. This indicates a strong negative linear relationship between probe errors and crosscoder decoder norms, demonstrating the effectiveness of the crosscoder decoder norms as indicators of feature evolution. DRULES FOR FINDING TYPICAL CROSS-SNAPSHOT FEATURES We define the rules used to identify previous token features, induction features, and context-sensitive features as follows: 1. Previous Token Features: We collect the directly preceding tokens of all activating tokens in the top 20 activating samples and assess their consistency. Token consistency is defined as the proportion of the largest single group when all tokens are normalized by stemming (removing leading and trailing spaces and ignoring case). To exclude bigram or multigram features, we also evaluate the consistency of the activating tokens themselves. A feature is classified as a previous token feature if it exhibits high consistency in previous tokens (above 0.8) and low consistency in activating tokens (below 0.3). 2. Induction Features: Induction features should activate on the second [A] in patterns [A][B]...[A][B]. For each activating token [A], we collect its following token [B] and search for previous occurrences of the bigram [A][B]. To distinguish induction features from sim- pler features that merely activate on any bigram [A][B], we require that the feature does not activate on the first appearance of [A][B]. A feature exhibiting this behavior in at least 20 instances within the top 20 activating samples is classified as an induction feature. 3. Context-sensitive Features: We identify context-sensitive features using a simpler rule based on activation density within specific contexts. Context-sensitive features should ac- tivate frequently in highly specific contexts, so we require features to have high activation counts within the top 20 activating samples (exceeding 4,000 activations). To exclude fea- tures that activate ubiquitously (such as positional or bias features), we filter out features with excessive total activations (below 2M total activations across 100M analyzed tokens). EDETAILS IN DOWNSTREAM TASK ATTRIBUTION E.1FORMALIZATION OF IG ATTRIBUTION SCORE To more accurately estimate the causal effect of features, we employ the integrated gradient (IG) version of the attribution score (Sundararajan et al., 2017; Hanna et al., 2024; Marks et al., 2025). IG attribution computes gradients along an interpolation path between baseline and target activations, providing more robust estimates than single-point gradients. For the standard attribution score without clean/corrupted input pairs, we compute the IG version as: attr θ ig,i (x) = 1 N X α f i (x)· ∂m(a θ (x)) ∂(αf i (x)) (11) 25 Published as a conference paper at ICLR 2026 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Attribution Score Feature 323 Feature 50712 Feature 54702 Feature 71187 Feature 53542 Feature 46091 Feature 88744 Feature 34175 Feature 89823 Feature 454 (a) 032100060001400060000127000 Training Step 10 8 6 4 2 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 10 8 6 4 2 0 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 15: Crosscoder feature attribution on the induction task, using a crosscoder with 98,304 features trained on Pythia-160M. 032100060001400060000127000 Training Step 0.0 0.1 0.2 0.3 Attribution Score Feature 18341 Feature 47045 Feature 67864 Feature 86482 Feature 39760 Feature 13274 Feature 23407 Feature 71574 Feature 4442 Feature 51972 (a) 032100060001400060000127000 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 16: Crosscoder feature attribution on the Simple variant of the SVA task, using a crosscoder with 98,304 features trained on Pythia-160M. For attribution patching with clean/corrupted input pairs, the IG version is: attr θ ig,i (x, ̃x) = 1 N X α [f i (x)− f i ( ̃x)]· ∂m(a θ (x)) ∂(αf i (x) + (1− α)f i ( ̃x)) (12) where α ∈ 0, 1 N ,..., N−1 N linearly interpolates between the baseline and target feature activa- tions. Consistent with Marks et al. (2025), we use N = 10 interpolation steps for the IG attribution score. E.2INDUCTION TASK Transformers exhibit in-context learning capabilities (Brown et al., 2020; Zhao et al., 2021; Gao et al., 2021) through induction heads (Olsson et al., 2022)—circuits that look back over the sequence for previous instances of the current token (A), identify the subsequent token (B), and predict the same completion will occur again (forming sequences [A][B] . . . [A]→ [B]). To evaluate models’ induction abilities and trace them at the feature level, we construct samples with random tokens where identical patterns appear in the middle and at the end of sequences. We test whether models can correctly copy previous text as the next token. For precise feature-level analysis, we restrict next tokens to single capital letters with leading spaces. Since the induction task lacks natural corrupted counterparts, we use the log probability of the correct token as our evaluation metric. Results are shown in Figure 15. E.3SUBJECT-VERB AGREEMENT TASKS Subject-verb agreement tasks evaluate whether models can predict verbs in the appropriate gram- matical form based on their subjects. We test four variants: (1) Simple—the verb directly follows the subject, e.g., “The parents are”; (2) Across-RC—a relative clause intervenes between subject and verb, e.g., “The athlete that the managers like does”; (3) Within-RC—both subject and verb appear within the relative clause, e.g., “The athlete that the managers like”; (4) Across-P—a prepositional 26 Published as a conference paper at ICLR 2026 032100060001400060000127000 Training Step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Attribution Score Feature 69636 Feature 18341 Feature 47045 Feature 51956 Feature 67864 Feature 50159 Feature 39760 Feature 86482 Feature 71703 Feature 13274 (a) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 17: Crosscoder feature attribution on the Across-RC variant of the SVA task, using a cross- coder with 98,304 features trained on Pythia-160M. 032100060001400060000127000 Training Step 0.0 0.1 0.2 0.3 0.4 0.5 Attribution Score Feature 18341 Feature 47045 Feature 67864 Feature 86482 Feature 13274 Feature 39760 Feature 69636 Feature 1780 Feature 50159 Feature 90620 (a) 032100060001400060000127000 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 18: Crosscoder feature attribution on the Within-RC variant of the SVA task, using a cross- coder with 98,304 features trained on Pythia-160M. phrase separates the subject and verb, e.g., “ The secretaries near the cars have”. We use data pro- vided by Marks et al. (2025), sampling 1000 examples from each variant. We use the verb in the wrong form as the counterpart in attribution patching. Results are shown in Figure 7, 16, 17, and 18. E.4INDIRECT OBJECT IDENTIFICATION TASK Indirect Object Identification (IOI) evaluates whether models can correctly predict the indirect object (IO) based on repeated occurrence of subjects (Wang et al., 2023). In IOI tasks, sentences such as “When Mary and John went to the store, John gave a drink to” should be completed with “Mary.” We generate IOI samples following the same strategy as Wang et al. (2023), using template sentences with random person names. We use the subject name as the corrupted counterpart in attribution patching. Results are shown in Figure 19. Note that while ablating top features significantly degrades performance, we cannot recover the original metric using only these top features. This likely occurs because IOI is a complex task requiring multiple feature interactions. The features that distinguish between indirect objects and subjects represent only part of the full computational requirements, making isolated feature sets insufficient for complete task execution. E.5RANDOM BASELINE OF ABLATION In Figure 20, we add baselines to Figure 7 by ablating/preserving k random features. The results show that our feature selected by attribution patching is far more effective. FCROSSCODER FEATURE SPLITTING Feature splitting is a well-known phenomenon in SAEs where increasing the dictionary size n features causes features from smaller SAEs to fragment into multiple distinct features. Under feature split- ting, a single concept may not be represented by one feature, but rather by multiple specialized features that activate on the same concept in different contexts. 27 Published as a conference paper at ICLR 2026 032100060001400060000127000 Training Step 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Attribution Score Feature 50193 Feature 14937 Feature 877 Feature 19288 Feature 80678 Feature 19809 Feature 7374 Feature 16710 Feature 52592 Feature 30860 (a) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 19: Crosscoder feature attribution on the IOI task, using a crosscoder with 98,304 features trained on Pythia-160M. 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Ablate Top 15 Ablate Random 15 Ablate Top 80 Ablate Random 80 Ablate Top 300 Ablate Random 300 (a) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Only Top 15 Only Random 15 Only Top 80 Only Random 80 Only Top 300 Only Random 300 (b) Figure 20: Crosscoder feature attribution on the Across-P variant of the SVA task, with random baselines. We observe similar feature splitting phenomenon in crosscoders, but across the temporal dimension. We find in rare cases, features with similar semanticity and direction in different snapshots may not be encoded in the same latent of crosscoder feature space, but result in separate features. For example, we identify feature 66688, 53542, and 42307 that all activate on long sequences con- taining repeated text patterns, contributing to the induction task (Figures 21, 22, and 23). Each feature exists only during non-overlapping continuous periods across snapshots, with a drastic emer- gence and disappearance. To examine whether feature splitting across snapshots relates to dictionary size, we search for fea- ture decoder vectors (across all snapshots) with cosine similarity above 0.7—a high threshold in such high-dimensional space—in crosscoders with varying n features . Across dictionary sizes ranging from 6,144 to 98,304 features, we observe that while each snapshot after step 2,000 activates al- most exactly one feature representing this concept, the total number of distinct crosscoder features increases with dictionary size (Figure 24). This confirms that larger dictionaries lead to temporal feature splitting, where a single underlying concept splits into multiple features active at different training stages. We also observe that temporal feature splitting predominantly occurs among densely activating fea- tures, i.e., features that activate frequently across many contexts. This crosscoder feature splitting likely arises from subtly different activation patterns that emerge over training. These findings sug- gest that language model features may evolve by refining their activation patterns, leading to more specialized representations that warrant separate feature assignments at different training stages. GGENERALIZABILITY STUDIES In this section, we include more experiments on training and analyzing crosscoders on different base models/initialization/datasets and different numbers of snapshot selection, to examine the general- izablity of our methods. 28 Published as a conference paper at ICLR 2026 Figure 21: Top activation of Feature 66688 in the 98,304-feature crosscoder on Pythia-160M. Figure 22: Top activation of Feature 53542 in the 98,304-feature crosscoder on Pythia-160M. G.1GENERALIZATION ACROSS MODEL FAMILIES/SEEDS We conduct experiments on two additional Pythia 160M models with different random seeds (Pythia 160M Seed1 and Seed2) (van der Wal et al., 2025) and the Alias run of Stanford CRFM’s GPT-2 model 3 . All of these models have an activation space of 768 dimensions. We train crosscoders of 24,576 features (32x) on Layer 6 of them, and inspect their feature evolution. Different initializations (Pythia 160M variants). We observe highly consistent results across differently initialized versions of Pythia 160M. Decoder norms of initialization features follow iden- tical trajectories, and emergent features begin rising around step 1000, exhibiting a clear two-phase pattern with the same transition point (Figure 26 and 27). Attribution experiments confirm that we successfully identify sparse crosscoder features with significant contributions to downstream tasks (Figure 29 and 30). Different base model (GPT-2). For Stanford CRFM’s GPT-2 model, we observe a clear two- phase pattern, with emergent features rising around step 1,000. Upon manual inspection of these features, we still find that initialization features exhibit token-level information, while more complex patterns emerge in emergent features. However, we find a notable difference: due to absolute positional embeddings, initialization fea- ture norms do not peak at the beginning but rather at later steps (with above-threshold norm at beginning). This difference arises because the absolute positional embeddings initially exhibit large norms (larger than the word embeddings) and dominate the activation norm. Since we normalize all activation sources to the same average activation norm √ d model , this large positional contribution substantially reduces the observed norms of initialization features. (Figure 28). To sum up, the statistical-to-feature-learning phase transition still exists across models, despite small differences. Attribution experiments (Figure 31) again confirm the effectiveness of sparse crosscoder features, with generally similar feature evolution patterns for specific downstream tasks. Other models with accessible checkpoints. We also investigated other model families, including BLOOM (Scao et al., 2022) and OLMo (OLMo et al., 2024). Unfortunately, these models either lack sufficient intermediate checkpoints for fine-grained analysis of pretraining dynamics (BLOOM), or lack sufficiently early checkpoints to capture the two-stage transition (BLOOM and OLMo). 29 Published as a conference paper at ICLR 2026 Figure 23: Top activation of Feature 42307 in the 98,304-feature crosscoder on Pythia-160M. G.2GENERALIZABILITY ACROSS DATASETS We conduct experiments on the original Pythia 160M snapshots, but train and analyze 24,576-feature crosscoders on the FineWeb-Edu dataset (Penedo et al., 2024), which consists of educational web- pages. Our results (Figure 32) show nearly identical patterns to those observed with Pythia 160M crosscoders trained on SlimPajama, exhibiting very similar feature evolution dynamics. G.3SENSITIVITY TO SNAPSHOT SELECTION All of our above experiments train crosscoders on 32 pre-training snapshots of the language model. To further examine whether snapshot selection affects the trained crosscoder and the captured fea- ture evolution, we train a 24,576-feature crosscoder on 16 pre-training snapshots of Pythia 160M, downsampled from the original set by selecting every other snapshot. The result (Figure 33) shows that the overall feature evolution exhibits nearly identical trends, with only a reduction in temporal resolution. HMORE EXAMPLES OF FEATURE We additionally list some emergent features and demonstrate their top activating samples and de- coder norm evolution in Figure 34. ITHE USE OF LARGE LANGUAGE MODELS The use of Large Language Models as an assistive tool in this paper is limited to the following two aspects: 1. The automated generation of feature complexity scores, as detailed in Section 4.2 and Ap- pendix B. 2. Grammar correction and stylistic refinement of the manuscript. 3 Obtained from https://huggingface.co/stanford-crfm/alias-gpt2-small-x21. 30 Published as a conference paper at ICLR 2026 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Pythia-160M (6144 features) #3522 #2651 #4601 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Pythia-160M (12288 features) #2621 #6340 #2701 #7077 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Pythia-160M (24576 features) #13585 #9011 #12963 #22687 #24554 #1129 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Pythia-160M (49152 features) #29070 #36594 #35106 #15494 #2695 #37357 #2606 20006000100003400087000143000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Pythia-160M (98304 features) #53542 #42307 #66688 #50712 #46091 #61358 #5222 #11339 Figure 24: Crosscoder features similar to Feature 53542 of a 98,304-feature crosscoder. The number of features split increases with dictionary size. 31 Published as a conference paper at ICLR 2026 810121416 UMAP 1 5 4 3 2 1 0 1 2 UMAP 2 Pythia-160M (98304 features) Pythia-160M (24576 features) Pythia-160M (49152 features) Pythia-160M (12288 features) Pythia-160M (6144 features) 11 15 19 23 27 31 Snapshot Index Figure 25: UMAP visualization of features shown in Figure 24. 051015202530 Snapshot Index 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Decoder Norm Evolution Over Training 0 5 10 15 20 25 30 Peak Snapshot Index Figure 26: Overview of cross-snapshot feature decoder norm evolution in Pythia-160M Seed1. 32 Published as a conference paper at ICLR 2026 051015202530 Snapshot Index 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Decoder Norm Evolution Over Training 0 5 10 15 20 25 30 Peak Snapshot Index Figure 27: Overview of cross-snapshot feature decoder norm evolution in Pythia-160M Seed2. 051015202530 Snapshot Index 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Decoder Norm Evolution Over Training 0 5 10 15 20 25 30 Peak Snapshot Index Figure 28: Overview of cross-snapshot feature decoder norm evolution in Alias run of Stanford CRFM GPT-2. 032100060001400060000127000 Training Step 0.00 0.05 0.10 0.15 0.20 0.25 Attribution Score Feature 22052 Feature 4176 Feature 4918 Feature 9511 Feature 5602 Feature 2413 Feature 22897 Feature 7380 Feature 16337 Feature 11735 (a) 032100060001400060000127000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 29: Crosscoder feature attribution on the Across-P variant of the SVA task, using a cross- coder with 24,576 features trained on Pythia-160M Seed1. 33 Published as a conference paper at ICLR 2026 032100060001400060000127000 Training Step 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Attribution Score Feature 20331 Feature 10794 Feature 23354 Feature 2816 Feature 4392 Feature 4873 Feature 16248 Feature 23022 Feature 16058 Feature 15524 (a) 032100060001400060000127000 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 032100060001400060000127000 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 30: Crosscoder feature attribution on the Across-P variant of the SVA task, using a cross- coder with 24,576 features trained on Pythia-160M Seed2. 0504001900940050000283000 Training Step 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Attribution Score Feature 3370 Feature 1297 Feature 17057 Feature 15483 Feature 17653 Feature 22339 Feature 3495 Feature 20326 Feature 5993 Feature 15729 (a) 0504001900940050000283000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 Metric Value Original Ablate Top 1 Ablate Top 4 Ablate Top 15 Ablate Top 30 Ablate Top 45 Ablate Top 80 Ablate Top 120 Ablate Top 180 Ablate Top 300 Ablate Top 500 (b) 0504001900940050000283000 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 Metric Value Original Only Top 1 Only Top 4 Only Top 15 Only Top 30 Only Top 45 Only Top 80 Only Top 120 Only Top 180 Only Top 300 Only Top 500 (c) Figure 31: Crosscoder feature attribution on the Across-P variant of the SVA task, using a cross- coder with 24,576 features trained on the Alias run of Stanford CRFM GPT-2. 051015202530 Snapshot Index 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Decoder Norm Evolution Over Training 0 5 10 15 20 25 30 Peak Snapshot Index Figure 32: Overview of cross-snapshot feature decoder norm evolution in Pythia-160M, trained on FineWeb-Edu dataset. 34 Published as a conference paper at ICLR 2026 02468101214 Snapshot Index 0.0 0.2 0.4 0.6 0.8 1.0 Decoder Norm (Normalized) Decoder Norm Evolution Over Training 0 2 4 6 8 10 12 14 Peak Snapshot Index Figure 33: Overview of cross-snapshot feature decoder norm evolution in Pythia-160M, trained on 16 snapshots. Personal Opinion 26654 ople with college degrees need "a bailout." "It's hard for me to stomach the idea of billing the masses—about tw ing party and the people are sick and starving.↵Personally I am aghast that the Liberals are chasing Green vote penalty.↵My Personal Opinion On the Subject↵Personally, I am inclined to think (always recognizing the Pope me. How come they never get any flack for it?↵The way I figure it, the more the other side howls the more KDE ntrollers listen on events↵from DOM nodes↵personally even though I agree with the first statement I don't l Emphasis Sentence 29335 with an Alabama flag.↵It's refreshing that young people are learning about the nuances of our state's "branding music. It's wonderful that the Country Music community is embracing her and trying to help her out as she has h socation CEO Steve Bates said: "It's great to see the UK government and NHS committed to exploring a new pa ggle to follow the logic and it is so disappointing to have someone I like and admire going down a misleading pat st incredible, unbelievable."↵"It's really cool to me that someone from this small town went and like totally chan Say Register 20887 ="/developers">Developers</a>↵<a href="/register">Register</a>↵<a href="/authenticate">Login</a> ORDER;↵→↵→public int getRegister()↵→↵→return register;↵→↵→↵→↵→public void setEnabled( greyhounds);↵to cancel the registration of, or to refuse to register a greyhound racing club or greyhound trial t uesday, Jan. 3 – Friday, Jan. 6: Cold Blooded Creatures (Register here)↵Camp Black Bear at AD Barnes Park a var registration = require("./support/registration.js");↵var chai = require("chai");↵ Cannot be 28830 e strictness of tradition and "California Love."↵It cannot be overlooked that Dr. Dre himself has a reputation to r on began selling its products online. The store "couldn't be here" if they did not begin selling products on the In ome of the industry's standard buck-passing. "It cannot be denied that visually, clothes fall better on a slimmer ls was to get local stories out to the public. "It could not be more up to date since the principal iconography cam s. This type of good will generated for the series cannot be bought at any price. I mean, stick and ball guys wou ID in URL 2907 akistan-hikes-petrol-price-by-rs-2-31-per-litre/80059514↵ China's natural gas output to surpass oil around 20 Presents: Kara Juélz x Bzy "BANG"↵[vimeo id="127784178′ width="600′ height="350′] Two of the coolest, mo where-can-i-download-windows-7-iso-i-have-a/7d964b05-2be9-4800-bc7f-3ca30356fc3d↵http://w.w7fo al-php-api)↵[](https://styleci.io/repos/34352212) t-banking-on-buybacks-dividends/articleshow/80242836.cms↵A UK court hearing an urgent application on M Serial Numbers 12295 ↵Astronomy and astrophysics (1)↵Biological stations (1)↵Cafeterias (1)↵Catalog of Federal Domestic Assistan cation for development (1)↵Education in emergencies (1)↵HIV/AIDS (1)↵Humanitarian action and emergencie Art Fair (1)↵AIPAD (1)↵ARCO Madrid (2)↵Art Auction (1)↵Art Basel (1)↵Art Basel Hong Kong (1)↵Art Fairs (4 ↵Management, Organization, Corporate Governance (1)↵[[missing key: search-facet.tree.open-section]] Intro (-) Young people and families (9)↵Health and welfare (8)↵45 percent of young people see pollution as a prob Figure 34: More features in the 32,768-feature crosscoder on Pythia-6.9B 35