Paper deep dive

Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3

Callum McDougall, Arthur Conmy, Janos Kramar, Tom Lieberum, Senthooran Rajamanoharan, Neel Nanda

Year: 2025Venue: Google DeepMindArea: Mechanistic Interp.Type: ToolEmbeddings: 53

Models: Gemma 3 12B, Gemma 3 1B, Gemma 3 270M, Gemma 3 27B, Gemma 3 4B

Abstract

Announcing Gemma Scope 2, a comprehensive, open suite of interpretability tools for the entire Gemma 3 family to accelerate AI safety research.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%

Last extracted: 3/11/2026, 12:33:19 AM

Summary

Gemma Scope 2 is an open-source suite of interpretability tools, including JumpReLU sparse autoencoders (SAEs) and transcoders, designed for the Gemma 3 model family (270M to 27B parameters). It aims to facilitate AI safety research by enabling circuit-level analysis of complex, multi-layer behaviors through cross-layer transcoders and weakly causal crosscoders.

Entities (5)

Callum McDougall · researcher · 100%Cross-layer transcoder · interpretability-method · 100%Gemma 3 · language-model-family · 100%Gemma Scope 2 · software-suite · 100%JumpReLU SAE · interpretability-method · 100%

Relation Signals (3)

Callum McDougall → authored → Gemma Scope 2

confidence 100% · Gemma Scope 2 - Technical Paper Callum McDougall

Gemma Scope 2 → providestoolsfor → Gemma 3

confidence 100% · we train and release an open suite of JumpReLU sparse autoencoders (SAEs) and skip-transcoders on all layers and sub-layers of Gemma 3 models

JumpReLU SAE → usedfor → AI Safety

confidence 90% · to accelerate AI safety research

Cypher Suggestions (2)

List researchers associated with the Gemma Scope 2 project · confidence 95% · unvalidated

MATCH (r:Researcher)-[:AUTHORED]->(p:Paper {title: 'Gemma Scope 2'}) RETURN r.name

Find all interpretability tools released for a specific model family · confidence 90% · unvalidated

MATCH (t:Tool)-[:PROVIDES_TOOLS_FOR]->(m:Model {name: 'Gemma 3'}) RETURN t.name, t.type

Full Text

52,797 characters extracted from source content.

Expand or collapse full text

2025-09-16 Gemma Scope 2 - Technical Paper Callum McDougall 1 , Arthur Conmy 1 , János Kramár 1 , Tom Lieberum 1 , Senthooran Rajamanoharan 1 and Neel Nanda 1 1 Google In response to a surge of recent work using SAEs to study model biology and to analyze circuits that explain complex, multi-step behaviors, we train and release an open suite of JumpReLU sparse autoencoders (SAEs) and skip-transcoders on all layers and sub-layers of Gemma 3 models at 270M, 1B, 4B, 12B, and 27B, as well as a set of multi-layer models to enable circuit-level analyses that span across layers. In this way, we hope to not only enable interpretability research on the Gemma 3 model series (more advanced than the previous Gemma 2 series) but also to enable analysis of multi-layer representations and circuits, which allows the study of more complex and potentially harmful behaviors. We are encouraged by the quality of open-source research enabled by our prior release, and aim to further accelerate this work by releasing updated weights, evaluations, and tooling. Keywords: gemma scope, sparse autoencoders, transcoders 1. Introduction A growing body of work suggests many inter- nal activations of language models can be well- approximated by sparse, linear combinations of dictionary vectors ((Elhage et al., 2022); (Gurnee et al., 2023); (Mikolov et al., 2013); (Nanda et al., 2023a); (Olah et al., 2020); (Park et al., 2023)). Sparse autoencoders (SAEs) provide an unsupervised route to discover such directions and have repeatedly yielded causally relevant, in- terpretable latents ((Bricken et al., 2023); (Cun- ningham et al., 2023); (Gao et al., 2024); (Marks et al., 2024); (Templeton et al., 2024)). Realiz- ing this promise requires maturing the method- ology, validating reliability, and scaling training and evaluation to modern models, so SAEs can support applications like detecting hallucinations, debugging unexpected behaviors, and increasing reliability and safety ((Hubinger, 2022); (Nanda, 2022); (Olah, 2021)). Despite rapid progress, training comprehen- sive, high-quality SAE suites remains costly com- pared to techniques such as steering vectors ((Li et al., 2023); (Turner et al., 2024)) or probing (Belinkov, 2022). Much prior work has focused on single-layer settings ((Engels et al., 2024); (Gao et al., 2024); (Templeton et al., 2024)), leaving open how best to scale to multilayer analyses and circuit-style work. Recent work from Anthropic on cross-layer transcoders (CLTs) highlights the value of mod- eling interactions among latents across layers, rather than treating latents as isolated, single- layer objects. Cross-layer approaches can synthe- size information flowing through multiple trans- former blocks, enabling new forms of understand- ing and control for complex behaviors such as jail- breaks and unfaithful chain-of-thought reasoning (Lindsey et al., 2025). Together with transcoders (Dunefsky et al., 2024) and multi-layer SAE mod- els, this points toward circuit-level analyses that capture multi-step computations spanning several layers and modules. To better enable this kind of analysis, we have trained and released the weights of Gemma Scope 2: an open suite of models which builds on our previous Gemma Scope release (Lieberum et al., 2024). This new release includes SAEs and transcoders for every layer and sublayer of Gemma 3 270M, 1B, 4B, 12B, and 27B. We re- lease these weights under a permissive C-BY-4.0 license on HuggingFace to enable and acceler- ate research by other members of the research community. Engineering challenges for this work were greater than our previous Gemma Scope release, © 2025 Google. All rights reserved Gemma Scope 2 - Technical Paper owing to not only the greater scope of single- layer models in the release, but also the added difficulty of training and evaluating multi-layer models. Increasing the number of SAE layers directly impacts compute and memory: input batch sizes scale asO(layers), naive FLOPs scale asO(layers 2 )(since the decoder is dense over every pair of layers), and parameter counts also scale asO(layers 2 ). We mitigate the computa- tional overhead by employing a variant of Leo Gao’s sparse kernels so that effective FLOPs scale approximately linearly,O(layers), and we ad- dress the parameter scaling via extreme model sharding—splitting decoder weights across many devices—while using minimal data sharding to avoid costly all-reduces. This setup allows com- prehensive multi-layer training and evaluation while maintaining practical throughput and sta- bility. In Section 2 we provide background on SAEs and transcoders, covering the context relevant for this updated release. Section 3 contains details of our training procedure, hyperparameters and computational infrastructure. We run extensive evaluations on the trained SAEs in Section 4 and a list of open problems that Gemma Scope 2 could help tackle in Section 5. 2. Preliminaries 2.1. Sparse autoencoders Given activations x∈ ℝ 푛 from a language model, a sparse autoencoder (SAE) decomposes and re- constructs the activations using a pair of encoder and decoder functions ( f, ˆ x ) defined by: f(x) := 휎 ( W enc x+ b enc ) (1) ˆ x(f) := W dec f+ b dec .(2) These functions are trained to map ˆ x(f(x)) back to x, making them an autoencoder. Thus, f(x) ∈ ℝ 푀 is a set of linear weights that specify how to combine the푀 ≫ 푛columns of W dec to reproduce x. The columns of W dec , which we de- note by d 푖 for푖=1. . . 푀, represent the dictionary of directions into which the SAE decomposes x. We will refer to these learned directions as latents to disambiguate between learned ’features’ and the conceptual features which are hypothesized to comprise the language model’s representation vectors. The decomposition f(x)is made non-negative and sparse through the choice of activation func- tion휎and appropriate regularization, such that f(x)typically has much fewer than푛non-zero entries. Initial work ((Bricken et al., 2023); (Cun- ningham et al., 2023)) used a ReLU activation function to enforce non-negativity, and an L1 penalty on the decomposition f(x)to encourage sparsity. TopK SAEs (Gao et al., 2024) enforce sparsity by zeroing all but the top K entries of f(x), whereas the JumpReLU SAEs (Rajamanoha- ran et al., 2024b) enforce sparsity by zeroing out all entries of f(x)below a positive threshold. Both TopK and JumpReLU SAEs allow for greater sep- aration between the tasks of determining which latents are active, and estimating their magni- tudes. 2.2. Transcoders Transcoders are closely related to SAEs but tar- get a different objective: rather than sparsely reconstructing their inputs, they are trained to sparsely reconstruct the computation of an MLP sublayer. Concretely, a transcoder takes as input the pre-MLP residual stream (just after the pre- MLP RMSNorm) and learns to approximate the MLP’s output. This makes transcoders particu- larly useful for circuit analysis: if we freeze (or otherwise control) attention patterns, the direct connections between two transcoder latents be- come linear, so both upstream attributions to a latent and downstream effects from that latent can be analyzed with far fewer confounders. Formally, letting x denote the pre-MLP resid- ual and y MLP (x)the MLP output, a standard transcoder has encoder and decoder f TC (x) := 휎 ( W enc x+ b enc ) ˆ y TC (f) := W dec f+ b dec , and is trained to minimize a reconstruction loss 2 Gemma Scope 2 - Technical Paper L TC := y MLP (x)− ˆ y TC ( f TC (x) ) 2 2 . Skip transcoders Despite their nonlinearity, it has been theorized that MLP sublayers exhibit some degree of linear behavior (Dunefsky et al., 2024). To capture such structure explicitly, we follow the approach in the aforementioned work and train skip transcoders that include an affine skip connection from the input directly to the output: ˆ y skip ( f, x ) := W dec f+ b dec + W skip x . Another motivation for this choice comes from the phenomena of cross-layer superposition, as described in e.g. Anthropic’s circuit tracing work (Lindsey et al., 2024). This term describes when a single feature is distributed over latents in sev- eral layers, so just training SAEs on each layer independently can give an incomplete picture. In such cases, asking a transcoder to model this component as a learned linear map W skip is more faithful and leads to cleaner attributions: the de- coder W dec focuses on genuinely new or nonlinear structure, while the skip term captures direct lin- ear carry-through of latents such as rotations or other affine mappings. 2.3. JumpReLU SAEs As in the previous release, we focus heavily on JumpReLU SAEs as they have been shown to be a slight Pareto improvement over other approaches, and have additional beneficial properties for train- ing which will be discussed later in this section. JumpReLU activation The JumpReLU activa- tion is a shifted Heaviside step function as a gating mechanism together with a conventional ReLU: 휎(z)= JumpReLU 휽 (z) := z⊙ 퐻(z−휽). (3) Here,휽 >0 is the JumpReLU’s vector-valued learnable threshold parameter,⊙denotes elemen- twise multiplication, and퐻is the Heaviside step function, which is 1 if its input is positive and 0 otherwise. Intuitively, the JumpReLU leaves the pre-activations unchanged above the threshold, but sets them to zero below the threshold, with a different learned threshold per latent. Loss function As loss function we use a squared error reconstruction loss, and directly regularize the number of active (non-zero) la- tents using the L0 penalty: L :=∥x− ˆ x(f(x))∥ 2 2 + 휆∥f(x)∥ 0 ,(4) where휆is the sparsity penalty coefficient. Since the L0 penalty and JumpReLU activation function are piecewise constant with respect to threshold parameters휽, we use straight-through estimators (STEs) to train휽, using the approach described in (Rajamanoharan et al., 2024b). This introduces an additional hyperparameter, the ker- nel density estimator bandwidth휀, which controls the quality of the gradient estimates used to train the threshold parameters휽. Quadratic L0 penalty To target a specific ex- pected sparsity, we also consider replacing the linear L0 term with a quadratic penalty around a target number of active latents 퐿 ★ 0 : L quad :=∥x− ˆ x(f(x))∥ 2 2 + 휆 2 퐿 ★ 0 ∥f(x)∥ 0 − 퐿 ★ 0 2 . (5) The factor 2 퐿 ★ 0 scales gradients so that, when ∥f(x)∥ 0 ≈2퐿 ★ 0 , the magnitude of the spar- sity gradient roughly matches that of the linear JumpReLU objective (Eq. (4)) at the same effec- tive sparsity. This stabilizes training around the target퐿 ★ 0 while providing a smooth force toward the desired activation frequency. Direct frequency penalization One other ad- vantage of JumpReLU SAEs is that we can di- rectly target high-density latents by using their frequency in our sparsity penalty. We do this by using the STE approximation for L0, since the frequency of a given latent is simply the aver- age L0 across a batch of data. This method was described in (Rajamanoharan et al., 2024), but we modify it slightly for the models trained in this release: rather than replacing the sparsity 3 Gemma Scope 2 - Technical Paper penalty with one directly targeting frequency, we use the quadratic L0 penalty as our primary spar- sity penalty but add a secondary penalty which specifically targets high-frequency latents. 2.4 End-to-End SAEs After training our JumpReLU SAEs with MSE as our reconstruction loss, we finetune a select few using the end-to-end finetuning method intro- duced in (Braun et al., 2024) and further refined in (Karvonen, 2025). These methods propagate gradients through the base model during a short finetuning phase, with the goal of learning latents which are functionally important for the model’s predictions rather than just for reconstructing ac- tivations. Concretely, we finetune our SAEs and transcoders by optimizing the following finetun- ing objective: L finetune := MSE+ 훼 훽 KL(p(x), p( ˆ x)) 1+ 훽 . (6) whereMSEdenotes the SAE reconstruction loss, KL(p(x),p( ˆ x))is the KL divergence between the base model’s distribution푝(x)and the distribu- tion with SAE reconstructions injected into the model’s forward pass p( ˆ x), and훽is a user-defined hyperparameter (e.g. if훽=0 this reduces to reg- ular MSE training). Following the general moti- vation of KL-regularized E2E training in (Braun et al., 2024; Karvonen, 2025), we use a dynami- cally adjusted scaling factor 훼 := MSE KL+ 10 −8 .(7) which is treated as a constant with respect to gradients. This normalization ensures that훽can be interpreted as the intended relative weight of the KL penalty compared to the reconstruction error, independent of their absolute magnitudes. This stabilizes training and simplifies hyperpa- rameter selection across different layers, widths, and sparsity targets. 2.5. Instruction-tuned (IT) SAEs For instruction-tuned models, we depart from the pretraining (PT) setup in two ways. First, rather than sampling from the same pretraining distri- bution, we construct training data from actual model rollouts (specifically, we take open-source datasets of user prompts and generate responses from the corresponding Gemma models). Second, we do not train IT SAEs from scratch: we initial- ize from the corresponding PT SAEs and finetune on the rollout-derived datasets. This approach is consistent with prior DeepMind results indicating that PT-to-IT transfer typically does not require re- sampling a large fraction of latents, and preserves both reconstruction quality and interpretability of learned latents (cf. (Kissane et al., 2024b)). In practice, initializing from PT SAEs accelerates convergence, stabilizes sparsity calibration, and yields IT SAEs that can be directly compared to their PT counterparts for circuit-level analyses. 2.6 Multi-layer SAEs We release two different types of multi-layer au- toencoder models: weakly causal crosscoders and cross-layer transcoders. Weakly causal crosscoders Crosscoders were first introduced in (Lindsey et al., 2024). They are variants of regular sparse autoencoders which are trained not on a single activation site but on the concatenation of activations from multiple sites. This could mean the concatenation of activations from different base models, or from the same base model at different layers. In this paper, we refer only to the latter. Much like skip transcoders, the motivation for these models is to recover fea- tures which have been distributed across multiple layers, due to linear components of MLP or at- tention layers, or other effects. There are many variants of multi-layer crosscoders, depending on which layers are trained on and which archi- tectural restrictions are imposed on the model. In this paper we focus on crosscoders which are only trained on a partial subset of layers rather than the full model, and assume weak causality: in other words, a latent’s encoder weights are restricted to a single layer and its decoder may re- construct activations from that layer or any future layer. This ensures latents cannot use future-layer information to encode past activations. Cross-layer transcoders The cross-layer 4 Gemma Scope 2 - Technical Paper transcoder (CLT) architecture was introduced in (Lindsey et al., 2024). Much like crosscoders generalize SAEs by training on the concatenation of multiple layers, cross-layer transcoders gener- alize transcoders by training to reconstruct the map from concatenated pre-MLP activations to concatenated MLP outputs. Note that cross-layer transcoders can also be combined with affine skip connections in exactly the same way as skip transcoders, with each affine skip connection only mapping from a layer’s MLP input to that same layer’s MLP output. 3. Training details For this release, we largely kept to the same train- ing methodology as (Lieberum et al., 2024). In particular, the topology and sharding configu- ration for our single-layer models was identical to the description given in the original Gemma Scope technical report, as is our shuffling method. In this report, we will only discuss an attribute of our training in detail when it differs from our methodology in the original release. 3.1. Data We train SAEs on the activations of Gemma 3 mod- els generated using text data from the same dis- tribution as the pretraining text data for Gemma 3 (Gemma Team, 2025). For the instruction- tuned models, we finetuned our SAEs using chat data: the user prompts were taken from the open- source datasets OpenAssistant/oasst1 (Köpf et al., 2023) and LMSYS-Chat-1M (Zheng et al., 2023). During training, activation vectors are normal- ized by a fixed scalar to have unit mean squared norm. This allows more reliable transfer of hyper- parameters between layers and sites, as the raw activation norms can vary over multiple orders of magnitude, changing the scale of the reconstruc- tion loss in Eq. (4). Once training is complete, we rescale the trained SAE parameters so that no input normalization is required for inference (see Appendix A in (Lieberum et al., 2024) for more details). This process is similar for multi-layer models; the only difference is that we normalize each layer separately. This increases stability of training especially when we initialize our multi- layer models from the concatenated weights of single-layer models (see Section 3.3). Figure 1|Illustration of the three locations per layer where SAEs are trained: attention head out- puts, MLP outputs, and post-MLP residual stream. Location As in the previous Gemma Scope re- lease, we train SAEs on three locations per layer. We train on the attention head outputs before the final linear transformation푊 푂 and RMSNorm has been applied (Kissane et al., 2024a), on the MLP outputs after the RMSNorm has been applied and on the post MLP residual stream. For the attention output SAEs, we concatenate the outputs of the individual attention heads and learn a joint SAE for the full set of heads. We zero-index the layers, so layer 0 refers to the first transformer block after the embedding layer. We also train a full suite of skip transcoders on every layer. This is illustrated in Fig. 1. Additionally, for each model we train a partial weakly causal crosscoder on 4 layers cho- sen at fixed-depth percentiles (25%, 50%, 65% 5 Gemma Scope 2 - Technical Paper Gemma 3 Model SAE TypeLayersSAE WidthsL0s SAE a All16k, 256k10, 100 SAE a,c 5, 9, 12, 1516k, 64k, 256k, 1m 10, 50, 150 transcoder b All16k, 256k10, 100 transcoder b,c 5, 9, 12, 1516k, 64k, 256k10, 50, 150 crosscoder c 5, 9, 12, 1564k, 256k, 512k, 1m 50, 150 CLT b,c All256k, 512k50, 150 SAE a All16k, 256k10, 100 SAE a,c 7, 13, 17, 22 16k, 64k, 256k, 1m 10, 50, 150 transcoder b All16k, 256k10, 100 transcoder b,c 7, 13, 17, 22 16k, 64k, 256k10, 50, 150 crosscoder c 7, 13, 17, 22 64k, 256k, 512k, 1m 50, 150 CLT b,c All256k, 512k50, 150 SAE a All16k, 256k10, 100 SAE a,c 9, 17, 22, 29 16k, 64k, 256k, 1m 10, 50, 150 transcoder b All16k, 256k10, 100 transcoder b,c 9, 17, 22, 29 16k, 64k, 256k10, 50, 150 crosscoder c 9, 17, 22, 29 64k, 256k, 512k, 1m 50, 150 SAE a All16k, 256k10, 100 SAE a,c 12, 24, 31, 41 16k, 64k, 256k, 1m 10, 50, 150 transcoder b All16k, 256k10, 100 transcoder b,c 12, 24, 31, 41 16k, 64k, 256k10, 50, 150 crosscoder c 12, 24, 31, 41 64k, 256k, 512k, 1m 50, 150 SAE a All16k, 256k10, 100 SAE a,c 16, 31, 40, 53 16k, 64k, 256k, 1m 10, 50, 150 transcoder b All16k, 256k10, 100 transcoder b,c 16, 31, 40, 53 16k, 64k, 256k10, 50, 150 crosscoder c 16, 31, 40, 53 64k, 256k, 512k, 1m 50, 150 Table 1|Overview of the SAEs & variants that were trained for different Gemma 3 models. The layers column indicates either multiple different releases for the single-layer models, or multiple layers trained on simultaneously for the multi-layer models. a Each SAE corresponds to an SAE trained on 3 different sites: attention output, MLP output and post-MLP residual. Only the residual stream SAEs have a 1m-width model released. b Each transcoder and CLT corresponds to a sweep over 2 different configs: with and without affine skip connections. c These variants also include random seeds for exactly one of the combinations of SAE width & L0. Every model listed in this table comes with a finetuned variant for the instruction-tuned version of the corresponding Gemma 3 model. 6 Gemma Scope 2 - Technical Paper and 85% of the way through the model), and for the two smaller models (270M, 1B) we also train cross-layer transcoders. 3.2. Hyperparameters Optimization We use the same bandwidth휀= 0.001 and learning rate휂=7×10 −5 across all training runs. We use a cosine learning rate warmup from 0.1휂to휂over the first 1,000 train- ing steps. We train with the Adam optimizer (Kingma and Ba, 2017) with ( 훽 1 , 훽 2 ) =(0,0.999), 휖=10 −8 and a batch size of 4,096. We use a quadratic L0 penalty, and combine this with a linear warmup for the sparsity coefficient from 0 to 휆 over the first 50,000 training steps. During training, we parameterize the SAE us- ing a pre-encoder bias (Bricken et al., 2023), sub- tracting b dec from activations before the encoder. However, after training is complete, we fold in this bias into the encoder parameters, so that no pre-encoder bias needs to be applied during inference. Throughout training, we restrict the columns of W dec to have unit norm by renormal- izing after every update. We also project out the part of the gradients parallel to these columns before computing the Adam update, as described in (Bricken et al., 2023). Initialization We initialize the JumpReLU threshold as the vector휽= 0.001 푀 . We ini- tialize W dec using He-uniform (He et al., 2015) initialization and rescale each latent vector to be unit norm. W enc is initialized as the transpose of W dec , but they are not tied afterwards ((Conerly et al., 2024); (Gao et al., 2024)). The biases b dec and b enc are initialized to zero vectors. For multi- layer models we initialize using the parameters of the corresponding single-layer models, as we discuss in Section 3.3. 3.3 Multi-layer model initialization Despite these improvements, multi-layer models are still much more costly than single-layer mod- els to train. To overcome these issues, we initial- ize our multi-layer models using our single-layer models as a starting point. One possible method we explored was to ini- tialize our multi-layer models by simply concate- nating single-layer models. Motivated by the fact that we were using Matryoshka training for our SAEs, we would choose prefixes of latents from each single-layer SAE to include in the multi-layer model. The problem we ran into was redundant latents: this method would pick latents on dif- ferent layers which represented more or less the same concept. To fix this, we developed a novel initialization strategy which works as follows: we iterate through SAEs (starting from the earliest layers), choosing prefixes of latents from each SAE. For each latent we choose, we mark off each of the latents in later-layer SAEs which have the maximum similarity to this latent (as measured by the dot product between the early-layer de- coder and later-layer encoder). In this way, we get much better global coverage, because at each layer we will avoid choosing latents which were too similar to one that we already chose in a pre- vious layer. Generally for our multi-layer models we tar- get smaller L0 values than we would get from this initialization strategy, but we also want the finetuning process to be stable. To fix this, we initially set the target L0 value high (based on the sum of the single-layer L0 values of all the latents we’ve chosen in our initialization strategy) and then decay it over 50,000 steps to our target value for the multi-layer model. We do this for both the weakly causal crosscoders and the CLTs. 4. Evaluation In this section we evaluate the trained SAEs from various different angles. We note however that as of now there is no consensus on what consti- tutes a reliable metric for the quality of a sparse autoencoder or its learned latents and that this is an ongoing area of research and debate ((Gao et al., 2024); (Karvonen et al., 2024); (Makelov et al., 2024)). Unless otherwise noted, all evaluations are on sequences from the same distribution as the SAE training data, i.e. the pretraining distribution of Gemma 3. 7 Gemma Scope 2 - Technical Paper 4.1. Evaluating the sparsity-fidelity trade-off Methodology For a fixed dictionary size, we trained SAEs of varying levels of sparsity by sweeping the L0 target value퐿 ★ 0 . We then plot curves showing the level of reconstruction fidelity attainable at a given level of sparsity. Metrics We use the mean L0-norm of latent ac- tivations,피 X ∥f(x)∥ 0 , as a measure of sparsity. To measure reconstruction fidelity, our primary met- rics are delta LM loss which is the increase in the cross-entropy loss experienced by the LM when we splice the SAE into the LM’s forward pass, and fraction of variance unexplained (FVU), also called the normalized loss (Gao et al., 2024) - as a measure of reconstruction fidelity. FVU is mean reconstruction lossL reconstruct of a SAE nor- malized by the reconstruction loss obtained by always predicting the dataset mean. Note that FVU is purely a measure of the SAE’s ability to reconstruct the input activations, not taking into account the causal effect of any error on the down- stream loss. All metrics were computed on 2,048 sequences of length 1,024, after masking out special tokens (pad, start and end of sequence) when aggregat- ing the results. Results The sparsity-fidelity trade-off for SAEs in the middle of each Gemma model is illustrated in Figure 7. As in the previous release, we found delta loss to be consistently higher for residual stream SAEs compared to MLP and attention SAEs, whereas FVU is roughly comparable across sites. 4.2. Latent firing frequency Fig. 2 shows the distribution of latent activation frequencies for the latents in the residual stream SAEs across model sizes and depths. This was computed across a set of 50,000 sequences of length 1,024 after masking out special tokens. With an aggressive version of the dense latent penalization we discussed in Section 2.3, we find that we can entirely eliminate latents with fre- quency greater than 10%. Figure 2|Feature activation frequency distribu- tions for residual post-MLP SAEs across model sizes and depths; most latents remain low fre- quency with long-tailed densities. 8 Gemma Scope 2 - Technical Paper 4.3 Interpretability of latents We evaluate interpretability using an automated interpretability system rather than human raters. The method involves binary classification: we present sequences where a particular latent fires and sequences where it doesn’t, and ask a model to generate an explanation for this feature. Next, we present this explanation along with a ran- domly ordered list of sequences (some of which cause the feature to fire, some of which don’t) and ask the model to classify which ones fire. Our findings are broadly consistent with a snippet we published earlier this year: lower-frequency latents tend to be more interpretable. Figure 3 shows the distribution of interpretability scores for the latents in residual stream SAEs trained on Gemma V3 1B PT, at four different layers. Figure 3|Automated interpretability scores as a function of feature activation frequency across model sizes and depths, for Gemma V3 1B PT SAEs trained on the residual stream. Higher- frequency features are slightly less interpretable. 4.4. Affine skip connections in transcoders By giving us more learnable parameters and the ability to model the linear parts of MLP layers without dedicating transcoder latents to it, affine skip connections can improve our performance in both the single and multi-layer setting. This is shown in Figure 4, where we compare the effect of adding affine skip connections to transcoders and CLTs respectively on the model FVU. Figure 4|Effect of affine skip connections on reconstruction quality: FVU versus L0 for transcoders and CLTs, showing improved trade- offs with skip connections. We can also measure the usefulness of affine skip connections another way. In (Lindsey et al., 2025), the authors show that the circuit-tracing al- gorithm can be applied to cross-layer transcoders (CLTs) to generate graphs of latents which fire on a particular prompt, and then prune that graph to leave only latents which are important for a partic- ular token prediction. The authors also compare this to the graphs generated from a suite of single- layer transcoders. We compare our trained CLTs and transcoders by generating attribution graphs for each of them, and generally find the same results as the authors: CLTs generate graphs with higher sparsity, as measured both by the number of nodes and edges in the graph. Figure 5 visual- izes this by showing the number of latent nodes required to reach a given fraction of total circuit influence (measured using Anthropic’s influence metric). Not only do we see CLTs outperforming transcoders (since any given prefix of nodes leads to a greater total influence), but we also see affine skip connections outperform for both transcoders and CLTs. 4.5. Initializing multi-layer models from trained single-layer models We initialize multi-layer models using weights from trained single-layer models and gradually decay the target L0. This reduces wall-clock training time because single-layer models train 9 Gemma Scope 2 - Technical Paper Figure 5|Cumulative influence graph for CLTs vs transcoders for Gemma V3 1B IT, on the prompt "The National Data Authority (N". and parallelize more efficiently than randomly- initialized multi-layer models. Figure 6 shows the average cosine similarity between decoder weights and their initialized values during the course of training, for a crosscoder which was ini- tialized from several single-layer SAEs. Although the cosine similarity does come down by the end of training, it still remains fairly high, suggesting that the initialized features from single-layer mod- els are good approximations to what the multi- layer model eventually needs to learn. Figure 6|Training dynamics for weakly causal crosscoders initialized from single-layer SAEs. 10 Gemma Scope 2 - Technical Paper Figure 7|Sparsity–fidelity trade-off for Gemma 3 1B resid-post SAEs, and autointerpretability scores. Higher L0s (and wider SAEs) lead to better performance, without having a significant impact on latent interpretability. 11 Gemma Scope 2 - Technical Paper 5. Open problems that Gemma Scope 2 may help tackle As for the original Gemma Scope release, we’re ex- cited to help the broader safety and interpretabil- ity communities advance our understanding of interpretability, and how it can be used to make models safer. In this section we provide a list of open problems we’re particularly excited to see on. The list reflects how our own views on sparse autoencoder research (and interpretability more broadly) have changed over the past year as well as the kinds of research this release is especially well suited for. For example, we’re in- terested in large-scale circuit analysis as well as using sparse autoencoders for real-world tasks such as debugging strange model behaviors, but we’re less excited about fundamental research into new SAE architectures. Deepening our understanding of SAEs 1.Comparisons of residual stream SAE features across layers, e.g. are there persistent fea- tures that one can "match up" across adjacent layers? How can multi-layer models help us understand this? 2. Better understanding the phenomenon of "feature splitting" (Bricken et al., 2023) where high-level features in a small SAE break apart into several finer-grained fea- tures in a wider SAE. Do Matryoshka SAEs help resolve this? 3. We know that SAEs introduce error, and com- pletely miss out on some features that are captured by wider SAEs ((Bussmann et al., 2024); (Templeton et al., 2024)). Can we quantify and easily measure "how much" they miss and how much this matters in prac- tice? 4. How are circuits connecting up superposed features represented in the weights? How do models deal with the interference between features (Nanda et al., 2023b)? Using SAEs for real-world applications and understanding model behavior 1.Detecting or fixing jailbreaks, and under- standing the mechanisms by which jailbreaks succeed or fail. 2.Helping find new jailbreaks/red-teaming models (Ziegler et al., 2022). 3.Understanding real-world failures in model reasoning and alignment, such as hallucina- tions, unfaithful chain of thought, and emer- gent misalignment in finetuned or in-context learning scenarios. 4.Comparing steering vectors (Turner et al., 2024) to SAE feature steering (Conmy and Nanda, 2024) or clamping (Templeton et al., 2024) for controlling model behavior. 5.Can SAEs be used to improve interpretabil- ity techniques, like steering vectors, such as by removing irrelevant features (Conmy and Nanda, 2024)? 6. Using SAEs to identify and remove spurious correlations or discover causal structures in model reasoning (Marks et al., 2024). 7.Auditing games: can we use SAEs to ver- ify whether models are reasoning faithfully, planning deceptively, or pursuing hidden goals? Red-teaming SAEs 1. Can we find downstream tasks where SAEs can be measured against simple baselines (either black-box or simpler white-box meth- ods)? How do they perform? 2.How robust are claims about the inter- pretability of SAE features (Huang et al., 2023)? 3.Can we find the "dark matter" of truly non- linear features? 4. Do SAEs learn spurious compositions of inde- pendent features to improve sparsity as has been shown to happen in toy models (Anders et al., 2024), and can we fix this? 12 Gemma Scope 2 - Technical Paper Scalable circuit analysis: What interest- ing circuits can we find in these mod- els? 1.What’s the learned algorithm for addition (Stolfo et al., 2023) in Gemma 3 4B? Does it resemble that found by (Lindsey et al., 2025) ? 2.How can we practically extend the SAE fea- ture circuit finding algorithm in (Marks et al., 2024) to larger models? 3.Can we use single-layer transcoders (Dunef- sky et al., 2024) to find input independent, weights-based circuits? Using SAEs as a tool to answer existing questions in interpretability 1.What does finetuning do to a model’s inter- nals (Jain et al., 2024)? Can SAEs detect the traces left by finetuning (Minder et al., 2025)? 2.What is actually going on when a model uses chain of thought? What changes when the chain of thought is unfaithful? 3.Is in-context learning true learning, or just promoting existing circuits ((Hendel et al., 2023); (Todd et al., 2024))? 4.Can we find any "macroscopic structure" in language models, e.g. families of features that work together to perform specialised roles, like organs in biological organisms? Acknowledgements We are incredibly grateful to Joseph Bloom, Johnny Lin and Curt Tigges for their help sup- porting more interactive demos of Gemma Scope on Neuronpedia (Lin and Bloom, 2023), creating tooling for researchers like feature dashboards, and help making educational materials. Their work extended beyond the original feature dash- boards used in Gemma Scope, and included vi- sualizations of the circuit tracing methodology applied to our transcoder models. Author contributions Callum McDougall (CM) led the writing of the report and the bulk of this project, but it wouldn’t have been possible without much supporting work such as the implementation and running of eval- uations. Tom Lieberum and Vikrant Varma pri- marily designed the origin sparse autoencoder training codebase which was adapted for this work, with significant contributions from Arthur Conmy (AC). Lewis Smith (LS) wrote the origi- nal Gemma Scope tutorial, which was adapted by CM. Senthooran Rajamanoharan (SR) developed the JumpReLU architecture which was primarily used in this work. CM led the early access and open sourcing of code and weights. Neel Nanda (N) provided advice and mentorship through- out the project. Sparse autoencoder visualization and autointerpretability evaluations were written and implemented by CM. References E. Anders, C. Neo, J. Hoelscher-Obermaier, and J. N. Howard. Sparse autoencoders find com- posed features in small toy models. LessWrong, 2024. Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computa- tional Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422. URLhttps: //aclanthology.org/2022.cl-1.7. S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders. Language models can explain neurons in language models. OpenAI, 2023. URLhttps://openaipublic.blob. core.windows.net/neuron-explainer/ paper/index.html. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language 13 Gemma Scope 2 - Technical Paper models with dictionary learning. Trans- former Circuits Thread, 2023. URLhttps: //transformer-circuits.pub/2023/ monosemantic-features/index.html. B. Bussmann, P. Leask, J. Bloom, C. Tigges, and N. Nanda. Stitching SAEs of dif- ferent sizes. Alignment Forum, 2024. URLhttps://w.alignmentforum. org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes. T. Conerly, A. Templeton, T. Bricken, J. Marcus, and T. Henighan. Update on how we train SAEs. Transformer Circuits Thread, 2024. URLhttps://transformer-circuits. pub/2024/april-update/index.html# training-saes. A. Conmy and N. Nanda. Activation steer- ing with SAEs. Alignment Forum, 2024. URLhttps://w.alignmentforum. org/posts/C5KAZQib3bzzpeyrg/ progress-update-1. H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models. 2023. J. Dunefsky, P. Chlenski, and N. Nanda. Transcoders find interpretable LM feature cir- cuits. 2024. URLhttps://arxiv.org/abs/ 2406.11944. N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. Mc- Candlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/ 2022/toy_model/index.html. J. Engels, I. Liao, E. J. Michaud, W. Gurnee, and M. Tegmark. Not all language model features are linear. 2024. URLhttps://arxiv.org/ abs/2405.14860. L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu. Scaling and evaluating sparse autoencoders. 2024. URLhttps://arxiv.org/abs/2406. 04093. Gemma Team. Gemma 3: Open language models. 2025. URLhttps://arxiv.org/abs/2503. 19786. W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse prob- ing. Transactions on Machine Learning Re- search, 2023. URLhttps://openreview. net/forum?id=JYs1R9IMJr. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level perfor- mance on ImageNet classification. 2015. URL https://arxiv.org/abs/1502.01852. J. Minder, C. Dumas, S. Slocum, H. Casademunt, C. Holmes, R. West, and N. Nanda. Narrow fine- tuning leaves clearly readable traces in activa- tion differences. URLhttps://arxiv.org/ abs/2510.13900. R. Hendel, M. Geva, and A. Globerson. In- context learning creates task vectors. In EMNLP 2023. URLhttps://openreview. net/forum?id=QYvFUlF19n. J. Huang, A. Geiger, K. D’Oosterlinck, Z. Wu, and C. Potts. Rigorously assessing natural language explanations of neurons. 2023. URLhttps: //arxiv.org/abs/2309.10312. E. Hubinger. A transparency and interpretability tech tree. Alignment Forum, 2022. S. Jain, E. S. Lubana, K. Oksuz, T. Joy, P. H. S. Torr, A. Sanyal, and P. K. Dokania. What makes and breaks safety fine-tuning? A mechanistic study. 2024. URLhttps://arxiv.org/abs/2407. 10264. A. Jermyn, C. Olah, and T. Henighan. Attention head superposition. Trans- former Circuits Thread, 2023. URL https://transformer-circuits. pub/2023/may-update/index.html# attention-superposition. A. Karvonen, B. Wright, C. Rager, R. Angell, J. Brinkmann, L. R. Smith, C. M. Verdun, D. 14 Gemma Scope 2 - Technical Paper Bau, and S. Marks. Measuring progress in dictionary learning for language model inter- pretability with board game models. In ICML 2024 Workshop on Mechanistic Interpretabil- ity, 2024. URLhttps://openreview.net/ forum?id=qzsDKwGJyB. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. 2017. URLhttps:// arxiv.org/abs/1412.6980. C. Kissane, R. Krzyzanowski, J. I. Bloom, A. Conmy, and N. Nanda. Interpreting atten- tion layer outputs with sparse autoencoders. 2024. URLhttps://arxiv.org/abs/2406. 17759. D. Braun, J. Taylor, N. Goldowsky-Dill, and L. Sharkey. Identifying functionally important fea- tures with end-to-end sparse dictionary learn- ing. arXiv preprint arXiv:2405.12241, 2024. :contentReference[oaicite:0]index=0 A. Karvonen. Revisiting end-to-end sparse autoen- coder training: A short finetune is all you need. arXiv preprint arXiv:2503.17272, 2025. :con- tentReference[oaicite:1]index=1 C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. SAEs (usually) transfer between base and chat models. Alignment Forum, 2024. URLhttps://w.alignmentforum. org/posts/fmwk6qxrpW8d4jvbd/ saes-usually-transfer. A. Köpf, E. Kilincer, I. Kutlu, Y. Sawada, J. Long, O. Basturk, R. Rychkova, A. Hartmann, V. Nguyen, N. Matti, J. Wei, K. Wong, M. Wolff, J. Wang, E. Abbott, K. Nguyen, A. Presland, F. Münch, C. Liu, C. Panait, D. Hay, R. Paynter, A. Pelakh, B. Chen, D. Reddy, J. Jain, T. Mann, M. Schmidt, T. Georgiou, Z. Hu, V. Silkin, N. Vogt, J. Cantara, H. Prince, M. Balint, and A. Meemken. OpenAssistant conversations – de- mocratizing large language model alignment. NeurIPS 2023 Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/abs/2304. 07327. K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wat- tenberg. Inference-time intervention: Elicit- ing truthful answers from a language model. NeurIPS 2023. URLhttps://openreview. net/forum?id=aLLuYpn83y. T. Lieberum, V. Varma, A. Conmy, S. Rajamanoha- ran, and N. Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. 2024. URLhttps://arxiv. org/abs/2408.05147. J. Lin and J. Bloom. Analyzing neural networks with dictionary learning. 2023. URLhttps: //w.neuronpedia.org. J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/ 2024/crosscoders/index.html. J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cun- ningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson. On the biology of a large language model. Trans- former Circuits Thread, 2025. URLhttps: //transformer-circuits.pub/2025/ attribution-graphs/biology.html. A. Makelov, G. Lange, and N. Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control. In ICLR 2024 Workshop on Secure and Trustwor- thy Large Language Models, 2024. URL https://openreview.net/forum?id= MHIX9H8aYF. S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. 2024. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. 2013. URLhttps://arxiv. org/abs/1301.3781. N. Nanda. A longlist of theories of impact for in- terpretability. Alignment Forum, 2022. 15 Gemma Scope 2 - Technical Paper N. Nanda and J. Bloom. TransformerLens. 2022.URLhttps://github.com/ TransformerLensOrg/TransformerLens. N. Nanda, A. Lee, and M. Wattenberg. Emer- gent linear representations in world models of self-supervised sequence models. In BlackboxNLP 2023, pages 16–30, 2023. URLhttps://aclanthology.org/2023. blackboxnlp-1.2. N. Nanda, S. Rajamanoharan, J. Kramar, and R. Shah. Fact finding: Attempting to reverse- engineer factual recall on the neuron level. 2023. C. Olah. Interpretability. Alignment Forum, 2021. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. URLhttps:// distill.pub/2020/circuits/zoom-in. K. Park, Y. J. Choe, and V. Veitch. The linear repre- sentation hypothesis and the geometry of large language models. 2023. S. Rajamanoharan, A. Conmy, L. Smith, T. Lieberum, V. Varma, J. Kramár, R. Shah, and N. Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024. S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V. Varma, J. Kramár, and N. Nanda. Jumping ahead: Improving reconstruction fi- delity with JumpReLU sparse autoencoders. 2024. URLhttps://arxiv.org/abs/2407. 14435. A. Stolfo, Y. Belinkov, and M. Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In EMNLP 2023, 2023. URLhttps://openreview.net/forum? id=aB3Hwh4UzP. A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan. Scaling monosemanticity: Extracting inter- pretable features from Claude 3 Sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/ 2024/scaling-monosemanticity/index. html. E. Todd, M. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function vec- tors in large language models. In ICLR 2024, 2024. URLhttps://openreview. net/forum?id=AwyxtyMwaG. A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization. 2024. URLhttps:// arxiv.org/abs/2308.10248. M. Wattenberg and F. Viégas. Relational com- position in neural networks: A gentle survey and call to action. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum?id= zzCEiUIPk9. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, H. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023. URLhttps://arxiv. org/abs/2306.05685. D. Ziegler, S. Nix, L. Chan, T. Bauman, P. Schmidt- Nielsen, T. Lin, A. Scherlis, N. Nabeshima, B. Weinstein-Raun, D. de Haas, B. Shlegeris, and N. Thomas. Adversarial training for high-stakes reliability. In NeurIPS 2022, volume 35, pages 9274–9286. Curran Associates, Inc., 2022. A. Matryoshka Frequencies We can analyze the effect of the Matryoshka loss function on our features. Based on this penalty, we would expect that the features with the small- est indices are generally the more important ones for reconstructing the model’s activations. Since a latent’s contribution to loss can be decomposed as the product of its firing frequency and average loss 16 Gemma Scope 2 - Technical Paper Figure 8|Matryoshka feature frequencies for SAEs trained on Gemma V3 1B PT, residual stream. Solid line lies above diagonal (indicating earlier features have higher frequency), but be- low dotted line (indicating that the Matryoshka loss isn’t so strong that it makes the SAEs strictly order their latents by frequency). contribution when it fires, we would also expect early latents to have higher frequencies. This is borne out in Figure 8, which shows the early fea- tures having significantly higher frequency than the later features. B. Sparse Kernels We experimented with training BatchTopK SAEs, and used sparse kernels to implement this train- ing. Although we ended up including only JumpReLU SAEs in our final release, and didn’t find it beneficial to extend the sparse kernel methodology to JumpReLU SAEs at the scale we were training at, we include the theory and im- plementation details here in case they are of use to anyone who might train their own, particularly if using JAX and not working inside an existing training framwork. B.1. TopK and BatchTopK SAEs TopK SAEs enforce sparsity by selecting exactly the퐾most active latents per token, zeroing all others. Using the same notation as Section 2.1, let f TopK (x) := TopK 퐾 ( W enc x+ b enc ) .(8) BatchTopK extends this across a batch while keeping a fixed total number of active latents. For inputsx 푏 퐵 푏=1 , define pre-activations z 푏 := W enc x 푏 +b enc . BatchTopK selects the퐾 퐵largest entries acrossz 푏 퐵 푏=1 and zeros out the rest: f 푏 퐵 푏=1 := BatchTopK 퐾 z 푏 퐵 푏=1 , z 푏 := W enc x 푏 + b enc .(9) Both approaches use the same linear decoder and reconstruction as Eq. (2), typically optimiz- ing only the reconstruction loss L reconstruct :=∥x− ˆ x(f(x))∥ 2 2 ,(10) since sparsity is enforced by the hard TopK con- straints rather than anℓ 0 /ℓ 1 penalty. These meth- ods provide TPU benefits similar to JumpReLU’s sparse regimes: knowing the exact number of active latents (per token for TopK, per batch for BatchTopK) allows JIT compilation of sparse ker- nels with predictable shapes and memory traffic. Crucially, BatchTopK can be converted to a JumpReLU parameterization at inference by se- lecting per-latent thresholds휽that match the empirical activation quantiles observed during training, yielding f JR (x) := JumpReLU 휽 ( W enc x+ b enc ) , (11) with thresholds chosen so that active sets closely match those induced by BatchTopK. In practice, we generally use JumpReLU for single- layer SAEs. For multi-layer SAEs and transcoders, we typically train with BatchTopK (to realize TPU efficiency from sparse kernels) and convert to JumpReLU for inference, which is a favorable trade-off for large-scale circuit analyses. 17 Gemma Scope 2 - Technical Paper B.2. Sparse Kernel Implementation We implement sparse decoding in a JAX-friendly way, adapting the sparse kernel ideas of Gao et al. to our multi-layer models. During training we use only model parallelism, sharding along the latent dimension. Sharding and activation selection Let acti- vations have shape(퐵, 퐿 in , 퐹)for batch size퐵, input layers퐿 in , and latents per layer퐹. With 푆shards over the latent axis, each shard holds (퐵, 퐿 in , 퐹/푆). For TopK (target total퐾active la- tents across all layers), each shard independently selects퐾/(퐿 in 푆)per example from its last dimen- sion. For BatchTopK, each shard selects(퐾 퐵)/푆 across the shard. We return sparse tensors (values and indices), which remain sharded over latents (uniform by construction) but not over batch. Sparse decoder The decoder has shape (퐿 in , 퐹, 퐿 out , 푑 model )and is sharded on its latent axis. For each shard we gather decoder vec- tors at the sparse indices and sum within each (batch, layer)group, producing per-shard out- puts of shape(퐵, 퐿 out , 푑 model ). Summing across shards yields the final output. Why stack by layer? Enforcing uniformity over the flattened(퐵, 퐿 in ·퐹)axis would implicitly impose a uniform activation budget across layers, which is undesirable. Multi-layer model training should allocate activations non-uniformly across depth (empirically we see allocations rise through the network and drop near the end). Enforcing uniformity only across the latent axis within a layer is a much weaker constraint. One residual limitation is that latents in different shards never compete in the TopK, so cross-shard suppression is reduced. If this proves costly, an alternative is to broadcast the global sparse set to all shards and apply a per-shard mask before decoding. This restores cross-shard competition at the price of 푆×more indexed gathers (costly in JAX), but en- coder/other costs may still dominate end-to-end time. Approximate TopK We use JAX’s approximate TopK with a recall parameter푟 ∈ [0,1], which returns a set whose expected overlap with the true TopK is≥ 푟. Values푟 ∈ [0.75,0.975]worked well. We note a minor implementation issue in the reference formula for the internal candidate size, which slightly overestimates it, leading to higher-than-requested recall and modestly higher runtime; this does not affect results materially. 18