Paper deep dive

Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

Year: 2025Venue: ICML 2025Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 153

Models: Pythia-410M

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 7:11:04 PM

Summary

The paper introduces Jacobian Sparse Autoencoders (JSAEs), a novel method to discover sparse computational graphs in LLMs by sparsifying the Jacobian of the mapping between input and output latent activations of model components (specifically MLPs). Unlike traditional SAEs that only sparsify representations, JSAEs use an L1 penalty on the Jacobian to encourage sparse computational dependencies, while maintaining performance and interpretability. The authors provide an efficient method to compute these Jacobians, enabling the discovery of interpretable computational units.

Entities (5)

Jacobian Sparse Autoencoders · method · 100%Sparse Autoencoders · method · 100%MLP · neural-network-component · 95%Pythia · model-suite · 95%Transcoders · method · 90%

Relation Signals (3)

Jacobian Sparse Autoencoders → appliedto → MLP

confidence 95% · In this paper, we apply Jacobian SAEs to multi-layer perceptrons (MLPs)

Jacobian Sparse Autoencoders → extends → Sparse Autoencoders

confidence 95% · JSAEs are fundamentally an extension of standard SAEs

Jacobian Sparse Autoencoders → comparedwith → Transcoders

confidence 90% · JSAEs and transcoders take radically different approaches and solve radically different problems.

Cypher Suggestions (2)

Identify the relationship between JSAEs and other interpretability methods. · confidence 95% · unvalidated

MATCH (a:Method {name: 'Jacobian Sparse Autoencoders'})-[r]->(b:Method) RETURN a, type(r), b

Find all methods related to sparse representation or computation in LLMs. · confidence 90% · unvalidated

MATCH (m:Method)-[:EXTENDS|APPLIED_TO]->(target) RETURN m, target

Abstract

Abstract:Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of LLMs. However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to "sparsify" computations in any sense, only latent activations. To solve this, we propose Jacobian SAEs (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. One key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that Jacobians are a reasonable proxy for computational sparsity because MLPs are approximately linear when rewritten in the JSAE basis. Lastly, we show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.

PDF

Open source PDF →Open local PDF →

Full Text

152,255 characters extracted from source content.

Expand or collapse full text

arXiv:2502.18147v2 [cs.LG] 6 Jun 2025 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik 1 Tim Lawson 1 Conor Houghton 1 Laurence Aitchison 1 Abstract Sparse autoencoders (SAEs) have been suc- cessfully used to discover sparse and human- interpretable representations of the latent activa- tions of LLMs. However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand com- putations is unclear because they are not designed to “sparsify” computations in any sense, only la- tent activations. To solve this, we propose Jaco- bian SAEs (JSAEs), which yield not only spar- sity in the input and output activations of a given model component but also sparsity in the compu- tation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. One key technical contribution is thus finding an efficient way of computing Jaco- bians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that Jacobians are a reasonable proxy for computational sparsity because MLPs are ap- proximately linear when rewritten in the JSAE ba- sis. Lastly, we show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned trans- former computations than standard SAEs. 1 School of Engineering Mathematics and Technology, Uni- versity of Bristol, Bristol, UK. Correspondence to: Lucy Farnik <lucyfarnik@gmail.com>. Proceedings of the42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction Sparse autoencoders (SAEs) have emerged as a power- ful tool for understanding the internal representations of large language models (Bricken et al., 2023; Cunningham et al., 2023; Gao et al., 2024; Rajamanoharan et al., 2024b; Lieberum et al., 2024; Lawson et al., 2024; Braun et al., 2024; Kissane et al., 2024; Rajamanoharan et al., 2024a). By decomposing neural network activations into sparse, in- terpretable components, SAEs have helped researchers gain significant insights into how these models process informa- tion (Marks et al., 2024; Lieberum et al., 2024; Temple- ton et al., 2024b; O’Brien et al., 2024; Farrell et al., 2024; Paulo et al., 2024; Balcells et al., 2024; Lan et al., 2024; Brinkmann et al., 2025; Spies et al., 2024). When trained on the activation vectors from neural network layers, SAEs learn to reconstruct the inputs using a dic- tionary of sparse ‘features’, where there are many more features than basis dimensions of the inputs, and each fea- ture tends to capture a specific, interpretable concept. How- ever, the goal of this paper is to improve understanding of computationsin transformers. While SAEs are designed to disentangle the representations of concepts in the LLM, they are not designed to help us understand the computations performed with those representations. One approach to understanding computation would be to train two SAEs, one at the input and one at the output of an MLP in a transformer. We can then ask how the MLP maps sparse latent features at the inputs to sparse features in the outputs. For this mapping to be interpretable, it would be desirable that it is sparse, in the sense that each latent in the SAE trained on the output depends on a small number of latents of the SAE trained on the input. These dependencies can be understood as a computation graph or ‘circuit’ (Olah et al., 2020; Cammarata et al., 2020). SAEs are not designed to encourage this computation graph to be sparse. To address this, we develop Jacobian SAEs (JSAEs), where we include a term in the objective to encourage SAE bases with sparse computational graphs, not just sparse activations. Specifically, we treat the mapping between the latent activations of the input and output SAEs as a function and encourage its Jacobian to be sparse by including anL 1 penalty term in the loss function. With a naïve implementation, it is intractable to compute 1 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations MLP SAE1 activations SAE2 activations Dense input activations Dense output activations Decoder Encoder = (includes activation function ) Traditional SAEsJSAEs SAE1 activations SAE2 activations SAE1 activations SAE2 activations Figure 1.A diagram illustrating our setup. We have two SAEs: one trained on the MLP inputs and the other trained on the MLP outputs. We then consider the functionf s , which takes the latent activations of the first SAE and returns the latent activations of the second SAE, i.e.,f s (s x ) =s y . The functionf s is described by the function composition of the TopK activation function of the first (input) SAEτ k , the decoder of the first SAEd x , the MLPf, and the encoder of the second (output) SAEe y . We note that the activation functionτ k is included for computational efficiency only; see Section 4.2 for details. JSAEs optimize forf s having a sparse Jacobian matrix, which we illustrate by reducing the number of edges in the computational graph that corresponds tof s . Traditional SAEs have sparse SAE latents on either side of the MLP but a dense computational graph between them; JSAEs have both sparse SAE latentsanda sparse computational graph. Importantly, Jacobian sparsity approximates the computational graph notion, but, as we discuss in Section 5.4 and Appendix B, this approximation is highly accurate due to the fact thatf s is a mostly linear function. Jacobian matrices because each matrix would have on the or- der of a trillion elements, even for modestly sized language models and SAEs. Therefore, one of our core contributions is to develop an efficient means to compute Jacobian ma- trices in this context. The approach we develop makes it possible to train a pair of Jacobian SAEs with only approxi- mately double the computational requirements of training a single standard SAE (Section 4.2). These methods enabled us to make three downstream findings. First, we find that Jacobian SAEs successfully induce spar- sity in the Jacobian matrices between input and output SAE latents relative to standard SAEs without a Jacobian term (Section 5.1). We find that JSAEs achieve the desired in- crease in the sparsity of the Jacobian with only a slight decrease in reconstruction quality and model performance preservation, which remain roughly on par with standard SAEs. We also find that the input and output latents learned by Jacobian SAEs are approximately as interpretable as standard SAEs, as quantified by auto-interpretability scores. Importantly, we also find that the "computational units" discovered by JSAEs are often highly interpretable – for ex- ample, JSAEs find an output latent corresponding to whether the text is in German, which is computed using several input latents corresponding to tokens frequently found in German text (Section 5.2). Second, inspired by Heap et al. (2025), we investigated the behavior of Jacobian SAEs when applied to random transformers, i.e., where the parameters have been reini- tialized. We find that the degree of Jacobian sparsity that can be achieved when JSAEs are applied to a pre-trained transformer is much greater than the sparsity achieved for a random transformer (Section 5.3). This preliminary find- ing suggests that Jacobian sparsity may be a useful tool for discovering learned computational structure. Lastly, we find that Jacobians accurately approximate com- putational sparsity in this context because the function we are analyzing (i.e., the combination of JSAEs and MLP) is approximately linear (Section 5.4). Oursourcecodecanbefoundat https://github.com/lucyfarnik/jacobian-saes. 2. Related work 2.1. Sparse autoencoders SAEs have been widely applied to ‘disentangle’ the repre- sentations learned by transformer language models into a very large number of concepts, a.k.a. sparse latents, features, or dictionary elements (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023; Gao et al., 2024; Raja- manoharan et al., 2024b; Lieberum et al., 2024). Human experiments and quantitative proxies apparently confirm that SAE latents are much more likely to correspond to human- interpretable concepts than raw language-model neurons, 2 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations i.e., the basis dimensions of their activation vectors (Cun- ningham et al., 2023; Bricken et al., 2023; Rajamanoharan et al., 2024a). SAEs have been successfully applied to mod- ifying the behavior of LLMs by using a direction discovered by an SAE to “steer” the model towards a certain concept (Makelov, 2024; O’Brien et al., 2024; Templeton et al., 2024b). Our work is based on SAEs but has a very different aim: standard SAEs only sparsify activations, while JSAEs also sparsify the computation graph between them (Figure 1). 2.2. Transcoders In this paper, we focus on MLPs. Dunefsky et al. (2024); Templeton et al. (2024a) developedtranscoders, an alter- native SAE-like method to understand MLPs. However, JSAEs and transcoders take radically different approaches and solve radically different problems. This is perhaps easi- est to see if we look at what transcoders and JSAEs sparsify. JSAEs are fundamentally an extension of standard SAEs: they train SAEs at the input and output of the MLP and add an extra term to the objective such that these sparse latents are also appropriate for interpreting the MLP (Fig- ure 1). In contrast, transcoders do not sparsify the inputs and outputs; they work with dense inputs and outputs. Instead, transcoders, in essence, sparsify the MLP hidden states. Specifically, a transcoder is an MLP that you train to match (using a mean squared error objective) the input-to-output mapping of the underlying MLP from the transformer. The key difference between the transcoder MLP and the under- lying MLP is that the transcoder MLP is much wider, and its hidden layer is trained to be sparse. Thus, transcoders and JSAEs take fundamentally different approaches. Each transcoder latent tells us ‘there is com- putation in the MLP related to [concept].’ By comparison, JSAEs learn a pair of SAEs (which have mostly interpretable latents) and sparse connections between them. At a con- ceptual level, JSAEs tell us that ‘this feature in the MLP’s output was computed using only these few input features’. Ultimately, we believe that the JSAE approach, grounded in understanding how the SAE basis at one layer is mapped to the SAE basis at another layer, is potentially powerful and worth thoroughly exploring. Importantly, it is worth emphasizing that JSAEs and transcoders are asking fundamentally different questions, as can be seen in terms of e.g., differences in what they sparsify. As such, it is not, to our knowledge, possible to design meaningful quantitative comparisons, at least not without extensive future work to develop very general auto- interpretability methods for evaluating methods of under- standing MLP circuits. 2.3. Automated circuit discovery In “automated circuit discovery”, the goal is to isolate the causally relevant intermediate variables and connections between them necessary for a neural network to perform a given task (Olah et al., 2020). In this context, a circuit is defined as a computational subgraph with an interpretable function. The causal connections between elements are de- termined via activation patching, i.e., modifying or replacing the activations at a particular site of the model (Meng et al., 2022; Zhang & Nanda, 2023; Wang et al., 2022; Hanna et al., 2023). In some cases, researchers have identified sub- components of transformer language models with simple algorithmic roles that appear to generalize across models (Olsson et al., 2022). Conmy et al. (2023) proposed a means to automatically prune the connections between the sub-components of a neural network to the most relevant for a given task using ac- tivation patching. Given a choice of task (i.e., a dataset and evaluation metric), this approach to automated circuit dis- covery (ACDC) returns a minimal computational subgraph needed to implement the task, e.g., previously identified ‘circuits’ like Hanna et al. (2023). Naturally, this is compu- tationally expensive, leading other authors to explore using linear approximations to activation patching (Nanda, 2023; Syed et al., 2024; Kramár et al., 2024). Marks et al. (2024) later improved on this technique by using SAE latents as the nodes in the computational graph. In a sense, these methods are supervised because they re- quire the user to specify a task. Naturally, it is not feasible to manually iterate over all tasks an LLM can perform, so a fully unsupervised approach is desirable. With JSAEs, we take a step towards resolving this problem, although the architecture introduced in this paper initially only applies to a single MLP layer and not an entire model. Additionally, to the best of our knowledge, no automated circuit discovery algorithm sparsifies the computations inside of MLPs. There are also other approaches which focus on locating relevant computation in ML models by estimating the con- tribution of individual model components (Shah et al., 2024; Balasubramanian et al., 2024). 3. Background 3.1. Sparse autoencoders In an SAE, we have input vectors,x∈X=R m x . We want to approximate each vectorxby a sparse linear combination of vectors,s x ∈ S x =R n x . The dimension of the sparse vector,n x , is typically much larger than the dimension of the input vectorsm x (i.e. the basis is overcomplete). In the case of SAEs, we treat the vectors as inputs to an autoencoder with an encodere x :X → S x and a decoder 3 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations d x :S x →Xdefined by, s x =e x (x) =φ(W enc x x+b enc x )(1) ˆx=d x (s x ) =W dec x s x +b dec x (2) Here, the parameters are the encoder weightsW enc ∈ R n x ×m x , decoder weightsW dec ∈R m x ×n x , encoder bias b enc x ∈R n x , and decoder biasb dec x ∈R m x . The non- linearityφcan be, for instance, ReLU. These parameters are then optimized to minimize the difference betweenx andˆx, typically measured in terms of the mean squared error (MSE), while imposing anL 1 penalty on the latent activationss x to incentivize sparsity. 3.2. Automatic interpretability of SAE latents In order to compare the quality of different SAEs, it is desirable to be able to quantify how interpretable its latents are. A popular approach to quantifying interpretability at scale is to collect the examples that maximally activate a given latent, prompt an LLM to generate an explanation of the concept the examples have in common, and then prompt an LLM to predict whether a given prompt activates the SAE latent given the generated explanation. We can then score the accuracy of the predicted activations relative to the ground truth. There are several variants of this approach (e.g., Bills et al., 2023; Choi et al., 2024); in this paper, we use “fuzzing” where the scoring model classifies whether the highlighted tokens in prompts activate an SAE latent given an explanation of that latent (Paulo et al., 2024). 4. Methods The key idea with a Jacobian SAE is to train a pair of SAEs on the inputs and outputs of a neural network layer while additionally optimizing the sparsity of the Jacobian of the function that relates the input and output SAE latent acti- vations (Figure 1). In this paper, we apply Jacobian SAEs to multi-layer perceptrons (MLPs) of the kind commonly found in transformer language models (Radford et al., 2019; Biderman et al., 2023). 4.1. Setup Consider an MLP mapping fromx∈ Xtoy∈ Y, i.e., f:X → Yory=f(x). We can then train twok-sparse SAEs, one onxand the other ony. The resulting SAEs map from each ofxandyto corresponding sparse latents s x ∈ S x ands y ∈ S y , i.e.,s x =e x (x)ands y =e y (y), wheree x is the encoder of the first SAE ande y is the encoder of the second SAE. Each of these SAEs also has a decoder that maps from the sparse latents back to an approximation of the original vector:ˆx=d x (s x )andˆy=d y (s y ). We may now consider the functionf s :S X →S Y , which intuitively represents the function,f, but written in terms of the sparse bases learned by the SAE pair for the original vectorsxandy. Specifically, we definef s by f s =e y ◦f◦d x ◦τ k (3) where◦denotes function composition. Here,d x :S x →X maps the sparse latents given as input tof s to “dense” inputs. Then,f:X →Ymaps the dense inputs to dense outputs. Finally,e y :Y →S y maps the dense outputs to sparse out- puts. Note thatf s first applies the TopK activation function τ k to the sparse inputs,s x . Critically, withk-sparse SAEs, we produce the sparse inputs bys x =e x (x), implying that s x only hasknon-zero elements. In that setting, TopK does not change the inputs, i.e.s x =τ k (s x ), but it does affect the Jacobian and, in particular, allows us to compute it much more efficiently (Section 4.2). At a high level, we want the functionf s to be ‘sparse’, in the sense that each of its input dimensions (i.e. SAE latent activations) only affects a small number of its output dimensions, and each of its output dimensions only depends on a small number of its input dimensions. We quantify the sparsity off s in terms of its Jacobian matrix. The Jacobian off s is, in index notation: J f s ,i,j = ∂f s,i (s x ) ∂s x,j .(4) Intuitively, we can consider maximizing the sparsity of the Jacobian as minimizing the number of edges in the compu- tational graph connecting the input and output nodes (Fig- ure 1), i.e. maximizing the number of near-zero elements in the Jacobian matrix. We note that the Jacobian is not a perfect measure of the sparsity of the computational graph, but it is an accurate proxy (see Section 5.4 and Appendix B) while being computationally tractable. We simultaneously train two separate SAEs on the input and output of a transformer MLP with the objectives of low re- construction error and sparse relations between the separate SAE latents (via the Jacobian). We do not need to optimize for the sparsity of the latent activations via a penalty term in the loss function because we usek-sparse autoencoders, which keep only theklargest latent activations per token position. Hence, our loss function is L=MSE(x,ˆx) +MSE(y,ˆy) + λ k 2 n y X i=1 n x X j=1 |J f s ,i,j |(5) Here,kis the number of non-zero elements in the TopK activation function,n x ,n y are the dimensionalities of the latent spaces of the input and output SAEs, respectively, and λis the coefficient of the Jacobian loss term. We divide byk 2 because, as we will see later, there are at mostk 2 non-zero elements in the Jacobian. Finally, note that if we setλ= 0, then our objective effectively trains traditional SAEs for each ofxandyindependently. 4 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.0050.0100.0150.020 Threshold 0% 25% 50% 75% Proportion of elements above threshold JSAEs Traditional SAEs Figure 2.JSAEs induce a much greater degree of sparsity in the elements of the Jacobian off s than traditional SAEs. The bars show the average proportion of Jacobian elements with absolute values above certain thresholds. At mostk×kelements can be nonzero, so we take 100% on the y-axis to meank×k. The average was taken across 10 million tokens. This example is from layer 15 of Pythia-410m. For layer 3 of Pythia-70m and layer 7 of Pythia-160m, see Figure 34, for more quantitative information on Jacobian sparsity across model sizes, layers, and hyperparameters see Figures 24, 25, and 26. We present further discussion of the sparsity of the Jacobian in Appendix F. 4.2. Making the Jacobian calculation tractable Computing the Jacobian naively (e.g., using an automatic differentiation package) is computationally intractable, as the full Jacobian has sizeB×n y ×n x whereBis the number of tokens in a training batchn x is the number of SAE latents for the input, andn y is the number of SAE latents for the output. Unfortunately, typical values are around1,000for Band around32,000forn x andn y (taking as an example a model dimension of1,000and an expansion factor of 32). Combined, this gives a Jacobian with around 1 trillion elements. This is obviously far too large to work with in practice, and our key technical contribution is to develop an efficient approach to working with this huge Jacobian. Our first insight is that for each element of the batch, we have an y ×n x Jacobian, wheren x andn y are around32,000. This is obviously far too large. However, remember that we are interested in the Jacobian off s , so the input is the sparse SAE latent vector,s x and the output is the sparse SAE latent vector,s y . Importantly, as we are usingk-sparse SAEs, onlykelements of the input and output are “on” for any given token. As such, we really only care about thek×kelements of the Jacobian off s , corresponding to the inputs and outputs that are “on”. This reduces the size of the Jacobian by around six orders of magnitude, and renders the computation tractable. However, to make this work formally, we need all elements of the Jacobian corresponding to “off” elements of the input and output to be zero. This is where theτ k in the definition off s becomes important. Specifically, theτ k ensures that the gradient of Text is in German "von" "Berlin" Text about Nazi Germany "Austria" "Kle" "Pf" "sch" Common tokens in German text Place names in German-speaking countries Figure 3.JSAEs allow us to locate the "input features" of each feature computed by the MLP. For instance, in Pythia-410m, the MLP at layer 15 is computing the feature "this text is in German". JSAEs discover the inputs which the MLP uses to decide whether this feature should be on or off. These inputs correspond to tokens frequently found in German text, place names in German-speaking countries, and text about Nazi Germany. See Appendix C for details. f s wrt any of the inputs that are “off” is zero. Withoutτ k , the Jacobian could be non-zero for any of the inputs, even if changing those inputs would not make sense, as it would give more thankelements being “on” in the input, and thus could not be produced by the k-sparse SAE. Our second insight was that computing the Jacobian by automatic differentiation would still be relatively inefficient, e.g., requiringkbackward passes. Instead, for standard GPT-2-style MLPs, we noticed that an extremely efficient Jacobian formula can be derived by hand, requiring only three matrix multiplications and along with a few pointwise operations. We present this derivation in Appendix A. With these optimizations in place, training a pair of JSAEs takes about twice as long as training a single standard SAE. We measured this by training ten of each model on Pythia- 70m with an expansion factor of 32 for 100 million tokens on an RTX 3090. The average training durations were 72mins for a pair of JSAEs and 33 mins for a traditional SAE, with standard deviations below 30 seconds for both. 5. Results Our experiments were performed on LLMs from the Pythia suite (Biderman et al., 2023), the figures in the main text contain results from Pythia-410m unless otherwise specified. We trained on 300 million tokens withk= 32and an expan- sion factor of64for Pythia-410m and32for smaller models. We reproduced all our experiments on multiple models and found the same qualitative results (see Appendix E). 5 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.4 0.6 0.8 Cosine Similarity 0 0.2 0.4 0.6 0.8 Explained Variance 0.5 1 ·10 −2 Mean Squared Error 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 0 0.2 0.4 0.6 0.8 Jacobian Loss Coefficient Cross-Entropy Loss Score 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 2 4 ·10 4 Jacobian Loss Coefficient Num. Dead Features Input SAEOutput SAE 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 0 200 400 600 Jacobian Loss Coefficient Abs. Jacobian Values>0.01 Jacobian Figure 4.Reconstruction quality, model performance preservation, and sparsity metrics against the Jacobian loss coefficient. JSAEs trained on layer 7 of Pythia-160m with expansion factor64andk= 32; see Figure 26 for layer 3 of Pythia-70m. Recall that the maximum number of non-zero Jacobian values isk 2 = 1024. In accordance with Figure 5, all evaluation metrics degrade for values of the coefficient above 1. See Appendix E for details of the evaluation metrics. 5.1. Jacobian sparsity, reconstruction quality, and auto-interpretability scores First, we compared the Jacobian sparsity for standard SAEs and JSAEs. Note that, unlike with SAE latent activations, there is no mechanism for producing exact zeros in the Jacobian elements corresponding to active latents. Hence, we consider the number of near-zero elements rather than the number of exact zeros. To quantify the difference in sparsity between the two, we looked at the proportion of the elements of the Jacobian above a particular threshold when aggregating over 10 million tokens (Figure 2). Here, we found that JSAEs dramatically reduced the number of large elements of the Jacobian relative to traditional SAEs. We also note that the Jacobians are not only sparse on each individual token, but also when averaged across a large number of tokens (see Figure 36 in the appendix). Importantly, the degree of sparsity depends on our choice of the coefficientλof the Jacobian loss term. Therefore, we trained multiple JSAEs with different values of this parameter. As we might expect, for small values ofλ, i.e., little incentive to sparsify the Jacobian, the input and output SAEs perform similarly to standard SAEs (Figure 4 blue lines), including in terms of the variance explained by the reconstructed activation vectors and the increase in the cross- entropy loss when the input activations are replaced by their reconstructions. Unsurprisingly, asλgrows larger and the Jacobian loss term starts to dominate, our evaluation metrics degrade. Interestingly, this degradation happens almost entirely in the output SAE rather than the input SAE — we leave it to future work to investigate this phenomenon further. Critically, Figure 4 suggests there is a ‘sweet spot’ of the λhyperparameter where the SAE quality metrics remain reasonable, but the Jacobian is much sparser than for stan- dard SAEs. To further investigate this trade-off, we plotted a measure of Jacobian sparsity (the proportion of elements of the Jacobian above 0.01) against the average cross-entropy (Figures 4, 5, and 29). We found that there is indeed a sweet spot where the average cross-entropy is only slightly worse than a traditional SAE, while the Jacobian is far sparser. For Pythia 410m (Figure 5) this value is aroundλ= 0.5, whereas for Pythia-70m, it is aroundλ= 1(Figure 29). We choose this value of the Jacobian coefficient (i.e.λ= 0.5for Pythia-410m in the main text, andλ= 1for Pythia-160m in the Appendix) in other experiments. We also measure the interpretability of JSAE latents using the automatic interpretability pipeline developed by Paulo et al. (2024) and compare this to traditional SAEs. We find that JSAEs achieve similar interpretability scores (Figure 6). 6 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.50.60.70.8 Average cross-entropy score (1 = perfect reconstruction) 0 250 500 Number of Jacobian elements above 0.01 1 0.5 0.1 3 0.3 10 0.2 0.01 0.7 0.05 0.001 300 Figure 5.The trade-off between reconstruction quality and Jaco- bian sparsity as we vary the Jacobian loss coefficient. Each dot represents a pair of JSAEs trained with a specific Jacobian coeffi- cient. The value ofλis included for some points. We can see that a coefficient of roughlyλ= 0.5is optimal for Pythia-410m with k= 32. Note that the CE loss score is the average of the CE loss scores of the pre-MLP JSAE and the post-mlp JSAE. Measured on layer 15 of Pythia-410m, similar charts with a wider range of models and metrics can be found in Figures 27, 28, and 29. 05 101520 Layer 0% 50% 100% Auto-interp score JSAEs (input SAE) Traditional SAEs (input SAE) JSAEs (output SAE) Traditional SAEs (output SAE) Figure 6.Automatic interpretability scores of JSAEs are very simi- lar to traditional SAEs. Measured on all odd-numbered layers of Pythia-410m using the “fuzzing” scorer from Paulo et al. (2024). For all layers of Pythia-70m see Figure 37. 5.2. Max-activating examples of JSAEs Next, we interpreted the "max-activating" examples of JSAEs in order to verify that JSAEs can locate semanti- cally meaningful computational units. Namely, we took the latents of the output SAEiwhich have large Jacobian values when averaging across a wide distribution of text. Then for each output SAE latenti, we found the 10 input SAE latentsjwhich have the largest average Jacobian el- ementsJ f s ,i,j . We find that these combinations are often highly interpretable. For example, as shown in Figure 3, the very first output latent of layer 15 of Pythia-410m as sorted by average Jacobian value corresponds to "this text is in German". We find that it is computed as a function of input latents corresponding to: •Tokens which frequently appear in German text, such 0.0050.0100.0150.020 Threshold 0% 20% 40% 60% 80% Proportion of elements above threshold Traditional SAEs (Randomized LLM) Traditional SAEs JSAEs (Randomized LLM) JSAEs Figure 7.Jacobians are substantially more sparse in pre-trained LLMs than in randomly initialzied transformers. This holds both when you actively optimize for Jacobian sparsity with JSAEs, and when you don’t optimize for it and use traditional SAEs. The figure shows the proportion of Jacobian elements with absolute values above certain thresholds. At mostk 2 elements can be nonzero, we therefore takek 2 to be 100% on the y-axis. Jacobians are signifi- cantly more sparse in pre-trained transformers than in randomly re-initialized transformers. This shows that Jacobian sparsity is, at least to some extent, connected to the structures that LLMs learn during training. This stands in contrast to recent work by Heap et al. (2025) showing that traditional SAEs achieve roughly equal auto-interpretability scores on randomly initialized transformers as they do on pre-trained LLMs. Measured on layer 15 of Pythia- 410m, for layer 3 of Pythia-70m see Figure 38. Averaged across 10 million tokens. as "Pf", "sch", "Kle", and "von" • Names of places where people speak German, such as "Berlin" or "Austria" •Words and phrases related to the Third Reich, such as "Nazi", "concentration camp", "Hitler", and "Holo- caust" For a few of handpicked examples, see Appendix C. A large number of examples which are not handpicked is available at tinyurl.com/jsaes-qualitative. 5.3. Performance on re-initialized transformers To confirm that JSAEs are extracting information about the complex learned computation, we considered a form of control analysis inspired by Heap et al. (2025). Specifically, we would expect that trained transformers have carefully learned specific, structured computations while randomly initialized transformers do not. Thus, a possible desider- atum for tools in mechanistic interpretability is that they 7 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0 2 0 1 s y, j 05 s x, i 0 1 Linear JumpReLU Other 0% 20% 40% 60% 80% Proportion of scalar functions in f s Traditional SAEs JSAEs 0.00.2 Jacobian element (abs. value) 0.0 0.2 0.4 0.6 Change in downstream latent (abs. value) Linear JumpReLU Other (a)(b)(c) Figure 8.The functionf s , which combines the decoder of the first SAE, the MLP, and the encoder of the second SAE, is mostly linear. Specifically, the vast majority of scalar functions going froms x,j tos y,i are linear. (a) Examples of linear, JumpReLU, and other functions relating individual input SAE latents and output SAE latents. See Figure 9 for more examples. (b) For the empirically observeds x and randomly selectedi,j(of those corresponding to active SAE latents), the vast majority of scalar functions froms x,j tos y,i are linear. For details see Appendix B. The proportion of linear function also noticeably increases with JSAEs compared to traditional SAEs, meaning that JSAEs induce additional linearity inf s . (c) Because the vast majority of functions are linear, the Jacobian usually precisely predicts the change observed in the output SAE latent when we make a large change to the input SAE latent’s value (namely subtracting 1, note that the empirical median value ofs x,j is2.5). Each dot corresponds to an(s x,j ,s y,i )pair. For 97.7% of pairs (across a sample size of 10 million) their Jacobian value nearly exactly predicts the change we see in the output SAE latent when making large changes to the input SAE latent’s activation, i.e.|∆s y,i |≈|J f s ,ij |. The scatter plot shows a randomly selected subset of 1,000(s x,j ,s y,i )pairs. For further details see Appendix B. Measured on layer 15 of Pythia-410m, for layer 3 of Pythia-70m see Figure 39, for the linearity results on other models and hyperparameters see Figures 15, 16, and 17. ought to work substantially better when analyzing the com- plex computations in trained LLMs than when applied to LLMs with randomly re-initialized weights. This is pre- cisely what we find. Specifically, we find that the Jacobians for trained networks are always substantially sparser than the corresponding random trained network, and this holds for both traditional SAEs and JSAEs (Figure 7). Further, the relative improvement in sparsity from the traditional SAE to the JSAE is much larger for trained than random LLMs, again indicating that JSAEs are extracting structure that only exists in the trained network. Note that we also see that for traditional SAEs, there is a somewhat more sparse Jaco- bian for the trained than randomly initialized transformer. This makes sense: we would hope that the traditional SAE basis is somewhat more aligned with the computation (as expressed by a sparse Jacobian) than we would expect by chance. However, it turns out that without a “helping hand” from the Jacobian sparsity term, the alignment in a tradi- tional SAE is relatively small. Thus, Jacobian sparsity is a property related to the complex computations LLMs learn during training, which should make it substantially useful for discovering the learned structures of LLMs. 5.4.f s is mostly linear Importantly, the Jacobian is a local measure. Thus, strictly speaking, a near-zero element of the Jacobian matrix implies only that a small change to the input SAE latent does not affect the corresponding output SAE latent. It may, however, still be the case that a large change to the input SAE latent would change the output SAE latent. We investigated this question and found thatf s is usually approximately linear in a wide range and is often close to linear. Specifically, of the scalar functions relating individual input SAE latentss x,j to individual output SAE latentss y,i , the vast majority are linear (Figure 8b). This is important because, for any linear function, its local slope is completely predictive of its global shape, and therefore, a near-zero Jacobian element implies a near-zero causal relationship. For the scalar functions which are not linear, we frequently observed they have a JumpReLU structure 1 (Erichson et al., 2019). Notably, a JumpReLU is linear in a subset of its input space, so even for these scalar functions the first derivative is still an accurate measure within some range ofs x,j values. It is also worth 1 By JumpReLU, we mean any function of the formf(x) = aJumpReLU(bx+c) . Recall thatJumpReLU(x) =xifx > d and0otherwise.a,b,c,d∈Rare constants. 8 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations noting that with JSAEs, the proportion of linear functions is noticeably higher than with traditional SAEs, so at least to a certain extent, JSAEs induce additional linearity in the MLP. To confirm these results, we plotted the Jacobian against the change of output SAE latents y,i as we change the input SAE latents x,j by subtracting1(Figure 8c) 2 . We found that 97.7% of the time,|∆s y,i | ≈ |J f s ,ij |. For details see Appendix B. While these results are strongly suggestive, we would caution that it is difficult to interpret them definitively as we are not evaluating the reconstruction error for a linear model fitted to the input-output relationship for the MLP latents. 6. Discussion We believe JSAEs are a promising approach for discover- ing computational sparsity and understanding the reasoning of LLMs. We would also argue that an approach like the one we introduced is in some sense necessary if we want to ‘reverse-engineer’ or ‘decompile’ LLMs into readable source code. It is not enough that our variables (e.g., SAE features) are interpretable; they must also be connected in a relatively sparse way. To illustrate this point, imagine a Python function that takes as input 5 arguments and returns a single variable, and compare this to a Python function that takes 32,000 arguments. Naturally, the latter would be nearly impossible to reason about. Discovering computa- tional sparsity thus appears to be a prerequisite for solving interpretability. It is also important that the mechanisms for discovering computational sparsity be fully unsupervised rather than requiring the user to manually specify the task being analyzed. There are existing methods for taking a specific task and finding the circuit responsible for imple- menting it, but these require the user to specify the task first (e.g. as a small dataset of task-relevant prompts and a metric of success). They are thus ‘supervised’ in the sense that they need a clear direction from the user. Naturally, it is not feasible to manually iterate over all tasks an LLM may be performing, so a fully unsupervised approach is needed. JSAEs are the first step in this direction. Naturally, JSAEs in their current form still have important limitations. They currently only work on MLPs, and for now, they only operate on a single layer at a time rather than discovering circuits throughout the entire model. Our initial implementation also works on GPT-2-style MLPs, while most LLMs from the last few years tend to use GLUs (Dauphin et al., 2017; Shazeer, 2020), though we expect it to be fairly easy to extend our setup to GLUs. Additionally, our current implementation relies on the TopK activation function for efficient batching; TopK SAEs can sometimes encourage high-density features, so it may be desirable to 2 For reference, the median value ofs x,j without any interven- tions is2.5. generalize our implementation to work with other activation functions. These are, however, problems that can be ad- dressed relatively straightforwardly in future work, and we would welcome correspondence from researchers interested in addressing them. A pessimist may argue that partial derivatives (and, there- fore, Jacobians) are merely local measures. A small partial derivative tells you that if you slightly tweak the input la- tent’s activation, you will see no change to the output latent’s activation, but it may well be the case that a large change to the input latent’s activation will lead to a large change in the output latent. Thankfully, at least in MLPs, this is not quite the case. As we show in Section 5.4,f s is approximately linear, and the size of the elements of the Jacobian nearly perfectly predicts the change you see in the output latent when you make a large change to the input latent. For a lin- ear function, a first-order derivative at any point is perfectly predictive of the relationship between the input and the out- put, and thus, at least for the fraction off s that is linear, Jacobians perfectly measure the computational relationship between input and output variables. We further discuss this in Appendix B. Additionally, as we showed in Section 5.3, Jacobian sparsity is much more present in trained LLMs than in randomly initialized ones, which indicates that it does correspond in some way to structures that were learned during training. At a high level, a sparse computational graph necessarily implies a sparse Jacobian, but a sparse Jacobian does not in and of itself imply a sparse computa- tional graph. But all of these results make it seem likely that Jacobian sparsity is a good approximation of computational sparsity, and when combined with the fact that we have now developed efficient ways of computing them at scale, this leads us to believe that JSAEs are a highly useful approach. We would, however, still invite future work to further in- vestigate the degree to which Jacobians, and by extension JSAEs, capture the structure we care about when analyzing LLMs. 7. Conclusion We introduced Jacobian sparse autoencoders (JSAEs), a new approach for discovering sparse computation in LLMs in a fully unsupervised way. We found that JSAEs induce spar- sity in the Jacobian matrix of the function that represents an MLP layer in the sparse basis found by JSAEs, with minimal degradation in the reconstruction quality and downstream performance of the underlying model and no degradation in the interpretability of latents. We demonstrated that the computation found by JSAEs is often highly interpretable, allowing us to see not only the concepts computed by MLPs, but also the "input concepts" which are used to compute each "output concept". We also found that Jacobian sparsity is substantially greater in pre-trained LLMs than in ran- 9 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations domly initialized ones suggesting that Jacobian sparsity is indeed a proxy for learned computational structure. Lastly, we found that Jacobians are a highly accurate measure of computational sparsity due to the fact that the MLP in the JSAE basis consists mostly of linear functions relating input to output JSAE latents. Acknowledgements The authors wish to thank Callum McDougall and Euan Ong for helpful discussions. We also thank the contribu- tors to the open-source mechanistic interpretability tooling ecosystem, in particular the authors of SAELens (Bloom et al., 2024), which formed the backbone of our codebase. The authors wish to acknowledge and thank the financial support of the UK Research and Innovation (UKRI) [Grant ref EP/S022937/1] and the University of Bristol. This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bris- tol - http://w.bristol.ac.uk/acrc/. We would like to thank Dr. Stewart for funding for GPU resources. Impact Statement The work presented in this paper advances the field of mech- anistic interpretability. Our hope is that interpretability will prove beneficial in making LLMs safer and more robust in ways ranging from better detection of model misuse to editing LLMs to remove dangerous capabilities. Author contribution statement Conceptualization was done by LF and LA. Derivation of an efficient way to compute the Jacobian was done by LF and LA. Implementation of the training codebase was done by LF. The experiments in Jacobian sparsity, auto- interpretability, reconstruction quality, and approximate lin- earity off s were done by LF. Qualitative examples of the computations found by JSAEs were done by TL. LA and CH provided supervision and guidance throughout the project. The text was written by LF, LA, TL, and CH. Figures were created by LF and TL with advice from LA and CH. References Balasubramanian, S., Basu, S., and Feizi, S. Decomposing and interpreting image representations via text in vits be- yond clip, 2024. URLhttps://arxiv.org/abs/ 2406.01583. Balcells, D., Lerner, B., Oesterle, M., Ucar, E., and Heimersheim, S. Evolution of sae features across lay- ers in llms, 2024. URLhttps://arxiv.org/abs/ 2410.08869. Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and Wal, O. V. D. Pythia: A Suite for Analyz- ing Large Language Models Across Training and Scal- ing. InProceedings of the 40th International Confer- ence on Machine Learning, p. 2397–2430. PMLR, July 2023. URLhttps://proceedings.mlr.press/ v202/biderman23a.html. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models, May 2023. URLhttps: //openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html. Bloom, J., Tigges, C., and Chanin, D. SAELens.https: //github.com/jbloomAus/SAELens, 2024. Braun, D., Taylor, J., Goldowsky-Dill, N., and Sharkey, L. Identifying Functionally Important Features with End-to- End Sparse Dictionary Learning, May 2024. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., and Askell, A. Towards Monosemanticity: Decom- posing Language Models With Dictionary Learning, 2023. URLhttps://transformer-circuits. pub/2023/monosemantic-features. Brinkmann, J., Wendler, C., Bartelt, C., and Mueller, A. Large language models share representations of latent grammatical concepts across typologically diverse lan- guages, 2025.URLhttps://arxiv.org/abs/ 2501.06346. Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. Thread: Circuits.Distill, 5(3), March 2020. ISSN 2476-0757. doi: 10.23915/distill.00024. Choi, D., Huang, V., Meng, K., Johnson, D. D., Steinhardt, J., and Schwettmann, S. Scaling Automatic Neuron De- scription, October 2024. URLhttps://transluce. org/neuron-descriptions. Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards Automated Circuit Discovery for Mechanistic Interpretability.Advances in Neural Information Processing Systems, 36:16318– 16352, December 2023. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse Autoencoders Find Highly Inter- pretable Features in Language Models, October 2023. 10 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Lan- guage modeling with gated convolutional networks, 2017. URLhttps://arxiv.org/abs/1612.08083. Dunefsky, J., Chlenski, P., and Nanda, N. Transcoders Find Interpretable LLM Feature Circuits, June 2024. Erichson, N. B., Yao, Z., and Mahoney, M. W. Jumprelu: A retrofit defense strategy for adversarial attacks, 2019. URLhttps://arxiv.org/abs/1904.03750. Farrell, E., Lau, Y.-T., and Conmy, A. Applying sparse autoencoders to unlearn knowledge in language mod- els, 2024. URLhttps://arxiv.org/abs/2410. 19278. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800GB Dataset of Diverse Text for Language Modeling, December 2020. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders, June 2024. Hanna, M., Liu, O., and Variengien, A. How does GPT-2 compute greater-than?: Interpreting mathematical abili- ties in a pre-trained language model.Advances in Neural Information Processing Systems, 36:76033–76060, De- cember 2023. Heap, T., Lawson, T., Farnik, L., and Aitchison, L. Sparse autoencoders can interpret randomly initialized transform- ers, 2025. URLhttps://arxiv.org/abs/2501. 17727. Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization, January 2017. URLhttp://arxiv. org/abs/1412.6980. arXiv:1412.6980 [cs]. Kissane, C., Krzyzanowski, R., Bloom, J. I., Conmy, A., and Nanda, N. Interpreting attention layer outputs with sparse autoencoders, 2024. URLhttps://arxiv. org/abs/2406.17759. Kramár, J., Lieberum, T., Shah, R., and Nanda, N. Atp*: An efficient and scalable method for localizing llm behaviour to components, 2024. URLhttps://arxiv.org/ abs/2403.00745. Lan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., and Barez, F. Sparse autoencoders reveal universal feature spaces across large language models, 2024. URLhttps: //arxiv.org/abs/2410.06981. Lawson, T., Farnik, L., Houghton, C., and Aitchison, L. Residual Stream Analysis with Multi-Layer SAEs, Octo- ber 2024. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R., and Nanda, N. Gemma Scope: Open Sparse Autoen- coders Everywhere All At Once on Gemma 2, August 2024. Makelov, A. Sparse Autoencoders Match Supervised Fea- tures for Model Steering on the IOI Task. InICML 2024 Workshop on Mechanistic Interpretability, June 2024. URLhttps://openreview.net/forum? id=JdrVuEQih5. Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, March 2024. Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locat- ing and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems, 35:17359– 17372, December 2022. Nanda,N.AttributionPatching:Activa- tionPatchingAtIndustrialScale,February 2023.URLhttps://w.neelnanda. io/mechanistic-interpretability/ attribution-patching. O’Brien, K., Majercak, D., Fernandes, X., Edgar, R., Chen, J., Nori, H., Carignan, D., Horvitz, E., and Poursabzi- Sangde, F. Steering Language Model Refusal with Sparse Autoencoders, November 2024. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom In: An Introduction to Circuits. Distill, 5(3), March 2020. ISSN 2476-0757. doi: 10. 23915/distill.00024.001. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context Learning and Induction Heads, September 2022. Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automati- cally Interpreting Millions of Features in Large Language Models, October 2024. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multi- task Learners, 2019. URLhttps://cdn.openai. com/better-language-models/language_ models_are_unsupervised_multitask_ learners.pdf. 11 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramar, J., Shah, R., and Nanda, N. Im- proving Sparse Decomposition of Language Model Ac- tivations with Gated Sparse Autoencoders. InICML 2024 Workshop on Mechanistic Interpretability, June 2024a. URLhttps://openreview.net/forum? id=Ppj5KvzU8Q. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., and Nanda, N. Jumping Ahead: Im- proving Reconstruction Fidelity with JumpReLU Sparse Autoencoders, July 2024b.URLhttp://arxiv. org/abs/2407.14435. arXiv:2407.14435 [cs]. Shah, H., Ilyas, A., and Madry, A. Decomposing and editing predictions by modeling model computation, 2024. URL https://arxiv.org/abs/2404.11534. Sharkey, L., Braun, D., and Millidge, B. Taking features out of superposition with sparse autoencoders, December 2022. Shazeer, N. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/2002.05202. Spies, A. F., Edwards, W., Ivanitskiy, M. I., Skapars, A., Räuker, T., Inoue, K., Russo, A., and Shanahan, M. Transformers use causal world models in maze- solving tasks, 2024. URLhttps://arxiv.org/ abs/2412.11867. Syed, A., Rager, C., and Conmy, A. Attribution Patching Outperforms Automated Circuit Discovery. In Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.),Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, p. 407–416, Miami, Florida, US, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.blackboxnlp-1.25. Templeton, A., Batson, J., Jermyn, A., and Olah, C. Pre- dicting Future Activations, January 2024a. URLhttps: //transformer-circuits.pub/2024/ jan-update/index.html#predict-future. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDi- armid, M., Tamkin, A., Durmus, E., Hume, T., Mosconi, F., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, May 2024b.URLhttps: //transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, November 2022. Yun, Z., Chen, Y., Olshausen, B., and LeCun, Y. Trans- former visualization via dictionary learning: contextual- ized embedding as a linear superposition of transformer factors. In Agirre, E., Apidianaki, M., and Vuli ́ c, I. (eds.), Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, p. 1–10, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.deelio-1.1. Zhang, F. and Nanda, N. Towards Best Practices of Ac- tivation Patching in Language Models: Metrics and Methods. InThe Twelfth International Conference on Learning Representations, October 2023. URLhttps: //openreview.net/forum?id=Hf17y6u9BC. 12 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations A. Efficiently computing the Jacobian A simple form for the Jacobian of the functionf s =e y ◦f◦d x ◦τ k , which describes the action of an MLP layerfin the sparse input and output bases, follows from applying the chain rule. Note that here, the subscriptsf s ,e y , etc. denote the function in question rather than vector or matrix indices. For the GPT-2-style MLPs that we study, the components off s are: 1.TopK. This function takes sparse latentss x and outputs sparse latents ̄ s x . Importantly,s x = ̄ s x . This step makes the backward pass of the Jacobian computation more efficient but does not affect the forward pass. ̄ s x =τ k (s x )(6) 2.Input SAE Decoder. This function takes sparse latents ̄ s x and outputs dense MLP inputsˆx: ˆx=d x ( ̄ s x ) =W dec x ̄ s x +b dec x (7) 3.MLP. This function takes dense inputsˆxand outputs dense outputsy: z=W 1 ˆx+b 1 ,y=W 2 φ MLP (z) +b 2 (8) whereφ MLP is the activation function of the MLP (e.g., GeLU in the case of Pythia models). 4.Output SAE Encoder. This function takes dense outputsyand outputs sparse latentss y : s y =e y (y) =τ k W enc y y+b enc y (9) The JacobianJ f s ∈R n y ×n x for a single input activation vector has the following elements, in index notation: J f s ,ij = ∂s y,i ∂s x,j = X kℓmn ∂s y,i ∂y k ∂y k ∂z ℓ ∂z ℓ ∂ˆx m ∂ˆx m ∂ ̄s x,n ∂ ̄s x,n ∂s x,j (10) We compute each term like so: 1.Output SAE Encoder derivative: ∂s y,i ∂y k =τ ′ k   X j W enc ij y j +b enc,i   W enc y,ik = ( W enc y,ik ifi∈K 2 0otherwise (11) whereK 2 is the set of indices selected by the TopK activation functionτ k of the second (output) SAE. Importantly, the subscriptkdoes notindicate thek-th element ofτ k , whereas itdoesindicate thek-th column ofW enc y,ik . 2.MLP derivatives: ∂y k ∂z ℓ =W 2,kℓ φ ′ MLP (z ℓ ), ∂z ℓ ∂ˆx m =W 1,ℓm (12) 3.Input SAE Decoder derivative: ∂ˆx m ∂ ̄s x,n =W dec x,mn (13) 4.TopK derivative: ∂ ̄s x,n ∂s x,j = ( 1ifj∈K 1 0otherwise (14) whereK 1 is the set of indices (corresponding to SAE latents) that were selected by the TopK activation functionτ k of the first (input) SAE, which we explicitly included in the definition off s above. 13 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations When we combine all the terms: J f s ,ij = ( P kℓm W enc y,ik W 2,kℓ φ ′ MLP (z ℓ )W 1,ℓm W dec x,mj ifi∈K 2 ∧j∈K 1 0otherwise (15) LetW enc(active) y ∈R k×m y andW dec(active) x ∈R m x ×k contain the active rows and columns, i.e., the rows and columns corresponding to theK 2 orK 1 indices respectively. The Jacobian then simplifies to: J (active) f s =W enc(active) y W 2 | z R k×d MLP ·φ ′ MLP (z) |z R d MLP ×d MLP ·W 1 W dec(active) x |z R d MLP ×k (16) whered MLP is the hidden size of the MLP. Note thatJ (active) f s is of sizek×k, while the full Jacobian matrixJ f s is of size n y ×n x . However,J (active) f s contains all the nonzero elements ofJ f s , so it is all we need to compute the loss function to train Jacobian SAEs (Section 4.1). A.1. JSAEs with GLUs The equations above can be easily adapted to work with gated linear units (GLUs), which are significantly more common in modern LLMs than GPT-2-style MLPs. To do this, we modify the MLP equations like so: g=W g ˆx+b g (17) s=φ MLP (g)(18) h=W 1 ˆx+b 1 (19) z=h⊙s(20) y=W 2 z+b 2 (21) where⊙is elementwise multiplication. We then modify the derivatives accordingly: ∂y k ∂z ℓ =W 2,kℓ (22) ∂z ℓ ∂ˆx m =h ℓ ∂s ℓ ∂ˆx m +s ℓ ∂h ℓ ∂ˆx m (23) ∂h ℓ ∂ˆx m =W 1,ℓm (24) ∂s ℓ ∂g ℓ =φ ′ MLP (g ℓ )(25) ∂g ℓ ∂ˆx m =W g,ℓm (26) (27) Combining the terms again: J f s ,ij = ( P kℓm W enc y,ik W 2,kℓ (h ℓ φ ′ MLP (g ℓ )W g,ℓm +s ℓ W 1,ℓm )W dec x,mj ifi∈K 2 ∧j∈K 1 0otherwise (28) The Jacobian is then: J (active) f s =W enc(active) y W 2 | z R k×d MLP ·(diag(h⊙φ ′ MLP (g))W g +diag(s)W 1 ) | z R d MLP ×m x ·W dec(active) x | z R m x ×k (29) 14 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations B.f s is approximately linear Consider the scalar functionf s,(i,j) | s x :R→Rwhich takes as input thej-th latent activation of the first SAE (i.e.s x,j ) and returns as output thei-th latent activation of the second SAE (i.e.,s y,i ), while keeping the other elements of the input vector fixed at the same values ass x . In other words, this function captures the relationship between thej-th input SAE latent and thei-th output SAE latent in the context ofs x . Geometrically, we start off at the points x , and we move from it through the input spaces in parallel to thej-th basis vector, and then we observe how the output off s projects onto thei-th basis vector. Formally, f s,(i,j) | s x (x) =f s (ψ(s x ,i,x)) j (30) ψ(s x ,i,x) k = ( xifi=k s x,j otherwise (31) These are the functions shown in Figure 8a, of which the vast majority are linear (Figure 8b). As we showed in Figure 8c, the absolute value of a Jacobian element nearly perfectly predicts the change we see in the output SAE latent activation value when we make a large intervention on the input SAE latent activation. However, in the same figure, there is a small cluster of approximately 2.5% of samples, where the Jacobian element is near zero, but the change observed in the downstream feature is quite large. We proceed by exploring the cause behind this phenomenon. Note that each point in Figure 8 corresponds to a single scalar functionf s,(i,j) | s x (a pair of latent indices). An expanded version of Figure 8 is presented in Figure 10. Importantly, we show the ‘line’, the top-left cluster, and outliers visible in Figure 8 in different colors, which we re-use in the following charts (Figures 11 and 12). It also includes 10K samples, compared to 1K in Figure 8c: as above, most samples remain on the line, but the greater number of samples makes the behavior of the top-left cluster and outliers clearer. Figure 11 illustrates some examples of functionsf s,(i,j) | s x taken from each category shown in Figure 8, i.e., the line, cluster, and outliers. The vast majority of functions belong to the line category and are typically either linear or akin to JumpReLU activation functions (which include step functions as a special case). By contrast, the minority of functions belonging to the cluster or outliers are typically also JumpReLU-like, except where the unmodified input latent activation is close to the point where the function ‘jumps’, so when we subtract an activation value of 1 from the input (as in Figures 8c and 10), this moves to the flat region where the output latent activation value is zero. As we can see, the vast majority of these functions are either linear or JumpReLUs. Indeed, we verify this across the sample size of 10,000 functions and find that 88% are linear, 10% are JumpReLU (excl. linear, which is arguably a special case of JumpReLU), and only 2% are neither 3 . This result is encouraging – for a linear function, the first-order derivative is constant, so its value (i.e., the corresponding element of the Jacobian) completely expresses the relationship between the input and output values (up to a constant intercept). For the 88% of these scalar functions that are linear, the Jacobian thus accurately captures the notion of computational sparsity that interests us, rather than serving only as a proxy. And for the 10% of JumpReLUs, the Jacobians still perfectly measure the computational change we observe when changing the input latent within some subset of the input space. While we expect the remaining 2% of scalar functions (Jacobian elements) to contribute only a small fraction of the computational structure of the underlying model, we preliminarily investigated their behavior. Figure 12 shows 12 randomly selected non-linear, non-JumpReLUf s,(i,j) | s x functions. Even though these functions are nonlinear, they are still reasonably close to being linear, i.e., their first derivative is still predictive of the change we see throughout the input space. Indeed, most of them are on the diagonal line in Figure 10. 15 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 05 0 1 2 Linear 05 0.0 0.5 1.0 05 0 1 2 05 0 1 2 05 0.0 0.5 1.0 05 0.0 0.5 JumpReLU 05 0.0 0.5 05 0.0 0.5 05 0.0 0.5 05 0.0 0.5 05 0.0 0.5 Other 05 0.0 0.5 05 0.0 0.5 05 0.0 0.5 05 0.0 0.5 1.0 Figure 9.Additional examples of scalar functions betweens x,j tos y,i . The top row shows linear functions, the middle row shows JumpReLU functions, and the bottom row shows other functions. Recall that linear functions constitute a majority of the functions we observe empirically and that using JSAEs instead of traditional SAEs further increases the proportion of linear functions. 16 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.00.10.20.30.4 Jacobian element (absolute value) 0.0 0.2 0.4 0.6 0.8 1.0 Change in downstream feature (absolute value) Correlation between Jacobian value and change in downstream feature Figure 10.An expanded version of Figure 8c, measured on layer 3 of Pythia-70m. A scatter plot showing that values of Jacobian elements tend to be approximately equal to the change we see in the downstream feature when we modify the value of the upstream feature, namely when we subtract 1 from it. Each dot corresponds to an (input SAE latent, output SAE latent) pair. Unlike Figure 8c, this figure colors in the dots depending on which cluster they belong to – blue for "on the line", green for "in the cluster", red for "outlier". Additionally, this figure contains 10,000 samples (rather than 1,000 as in Figure 8c), which allows us to see more of the outliers and edge cases, though at the cost of visually obfuscating the fact that 97.5% of the samples are on the diagonal line, 2.1% are in the cluster, and 0.4% are outliers. 17 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 05 0 1 05 0.0 0.5 05 0.0 0.5 05 0 2 05 0.0 0.5 Downstream activation 05 0.0 0.5 05 0.0 0.5 05 0.0 0.5 05 Upstream activation 0 1 05 Upstream activation 0 1 05 Upstream activation 0 2 05 Upstream activation 0.0 0.5 Figure 11.A handful off s,(i,j) | s x functions corresponding to the points in Figure 10. The color matches the group (and therefore the color) they were assigned in Figure 10. The red dashed vertical line denotess (l) x,i , i.e. the activation value of the SAE latent before we intervened on it. Note that the functions are not selected randomly but rather hand-selected to demonstrate the range of functions. We will quantitatively explore what proportion off s,(i,j) | s x functions have which structure in other figures. 18 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 05 0 1 05 0 1 05 0.0 0.5 05 0.0 0.5 05 0.0 2.5 Downstream activation 05 0.0 0.5 05 0 1 05 0 1 010 Upstream activation 0.0 0.5 05 Upstream activation 0 2 05 Upstream activation 0.0 2.5 05 Upstream activation 0 1 Figure 12.A random selection of the non-linear, non-JumpReLUf s,(i,j) | s x functions. Note that non-linear, non-JumpReLU functions only constitute about 2% off s,(i,j) | s x functions. Even though these functions are clearly somewhat non-linear, their slope does still change quite slowly for the most part, which means that a first-order derivative at any point in the function is still reasonably predictive of the function’s behavior in at least some portion of the input space (though there are some rare exceptions). The color again matches the group (and therefore the color) they were assigned in Figure 10; the red dashed vertical line denotess (l) x,i , i.e. the activation value of the SAE latent before we intervened on it. 19 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.020.000.020.040.06 Mean second derivative across the function 0 2500 5000 Count 0100200300400500 Mean absolute value of second derivative across the function 0 5000 Count 0100002000030000400005000060000 Maxmum absolute value of second derivative across the function 0 5000 Count (a) (b) (c) Figure 13.Distribution of second-order derivatives of functionsf s,(i,j) | s x . Includes all functions, regardless of whether they are linear, JumpReLU, or neither. For a version that only includes non-linear, non-JumpReLU functions, see Figure 14. (a) The mean of the second-order derivative over the region of the input space. (b) The mean of the absolute value of the second-order derivative over the region of the input space. (c) The maximum value the second-order derivative takes in the region of the input space. Note that we are approximating the second derivative by looking at changes over a very small region (specifically0.005), i.e., we do not take the limit as the size of this small region goes to zero; this is important because derivatives which would otherwise be undefined or infinite become finite with this approximation and therefore can be shown on the histograms. Also, we note that the means and maxima are taken over the region of the input space in which SAE features exist; see the footnote on page 15. 20 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.030.020.010.000.010.020.030.04 Mean second derivative across the function 0 20 40 Count 050100150200250300350 Mean absolute value of second derivative across the function 0 100 200 Count 05000100001500020000250003000035000 Maxmum absolute value of second derivative across the function 0 100 200 Count (a) (b) (c) Figure 14.Distribution of second-order derivatives of functionsf s,(i,j) | s x . Unlike Figure 13, this figure only includes the subset of the functions that are neither linear nor JumpReLU=like. (a) The mean of the second-order derivative over the region of the input space. (b) The mean of the absolute value of the second-order derivative over the region of the input space. (c) The maximum value the second-order derivative takes in the region of the input space. Note that we are approximating the second derivative by looking at changes over a very small region (specifically0.005), i.e. we do not take the limit as the size of this small region goes to zero; this is important because derivatives which would otherwise be undefined or infinite become finite with this approximation and therefore can be shown on the histograms. Also, we note that the means and maxima are taken over the region of the input space in which SAE features exist; see the footnote on page 15. 21 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 024 0 0.2 0.4 0.6 0.8 1 Layer Fraction of Pairs Pythia-70m 0510 Layer Pythia-160m Linear JumpReLU Other 01020 Layer Pythia-410m Figure 15.The fractions of Jacobian elements that exhibit a linear relationship between the input and output SAE latent activations, a JumpReLU-like relationship, and an uncategorized relationship, as described in Section 5.4. Here, we consider Jacobian SAEs trained on the feed-forward network at different layers of Pythia-70m, 160m, and 410m with fixed expansion factorsR= 64andk= 32. We computed the fractions over 1 million samples. 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 0 0.5 1 Jacobian Loss Coefficient Fraction of Pairs Pythia-70m 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 Jacobian Loss Coefficient Pythia-160m Linear JumpReLU Other Figure 16.The fractions of Jacobian elements that exhibit a linear relationship between the input and output SAE latent activations, a JumpReLU-like relationship, and an uncategorized relationship, as described in Section 5.4. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 3 of Pythia-70m (left) and layer 7 of Pythia-160m (right), with fixed expansion factorsR= 64and k= 32and varying Jacobian loss coefficient (Section 4). We computed the fractions over 1 million samples. 10 3 10 4 10 5 0 0.2 0.4 0.6 0.8 1 Number of Latents Fraction of Pairs 10 0 10 1 Sparsityk Linear JumpReLU Other Figure 17.The fractions of Jacobian elements that exhibit a linear relationship between the input and output SAE latent activations, a JumpReLU-like relationship, and an uncategorized relationship, as described in Section 5.4. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 3 of Pythia-70m with varying expansion factors (and hence numbers of latents; left) but fixed sparsities k= 32, and varying sparsities but fixed expansion factorsR= 64(Section 4). We computed the fractions over 1 million samples. 22 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations We can measure this more precisely by looking at the second-order derivative off s,(i,j) | s x . A zero second-order derivative across the whole domain would imply a linear function and, therefore, perfect predictive power of the Jacobian, while the larger the absolute value of the second-order derivative, the less predictive the Jacobian will be. This distribution is shown in Figure 13. The same distribution, which only includes the non-linear, non-JumpReLU functions, is shown in Figure 14. On average, the second derivative is extremely small for all features and effectively zero for the vast majority. C. Qualitative examples of the computations discovered by JSAEs A common approach to interpreting LLM components like neurons or SAE latents is to collect token sequences and the corresponding activations over a text dataset (e.g., Yun et al., 2021; Bills et al., 2023). For example, the greatest latent activations may be retained, or activations from different quantiles of the distribution over the dataset (Bricken et al., 2023; Choi et al., 2024; Paulo et al., 2024). We determined the set of ‘top’ output SAE latent indices by collecting the mean absolute values of non-zero Jacobian elements over a text dataset and sorting the output latents in descending order. Then, for each output latent, we found the input SAE latents that were most strongly connected to the output latent, again by sorting the input latents in descending order of the mean absolute value of non-zero Jacobian elements over the dataset. Finally, for both the output and input latents, we collected the individual latent activations over text samples with a context length of16tokens, retaining samples where at least one token produced a non-zero activation for the SAE latent. We chose a short context length to conveniently display the examples in a table format, and display here the top eight examples for each latent index, sorting the examples in descending order of the maximum latent activation over its tokens. Each of the following figures comprises a table for a single output SAE latent (in pink), and a series of tables for the input latents with the greatest influences on the output latent, as determined by the mean absolute value of non-zero Jacobian elements. Conceptually, one may consider each figure as describing a single ‘function’, where the output and input latents represent the function output and inputs, respectively. Each table within the figure of examples displays a list of at most12 examples, each comprising16tokens; we exclude the end-of-sentence token for brevity. The values of non-zero Jacobian elements and the activations of the corresponding input and output SAE latent indices are indicated by the opacity of the background color for each token. We take the opacity to be the element or activation divided by the maximum value over the dataset, i.e., all the examples with a non-zero Jacobian element for a given pair of input and output SAE latent indices. For clarity, we report the maximum element or activation alongside the colored tokens. 3 Note that we are testing whether functions are linear or JumpReLUs only in the region of input space within which SAE activations exist. In particular, this means that we are excluding negative numbers. More specifically, the domain within which we test the function’s structure is[0,max(5,s (l) x,i + 1)]. In 92% of cases,s (l) x,i + 1<5; the medians (l) x,i is 2.5. 23 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation WennSieaufWeiterklickenistIhrBonusver5.596 roA14berflA14ssiggewordenist.Laufzeit36Monate5.542 esmussvoneinerauslAndischenBanksein.Logischerweise5.522 .SiekAnnensichjederzeitabmelden.i.ClickHere5.387 beiderBereitstellungunsererDienste.\ n5.315 sicherlichschongehArthaben,kannessehrze5.275 Geldaus.\ nJumpNomultipleaccountsorfreebonusesina5.254 weilbeborrowingfoodfromChina,next..Ohwhatawebwe5.247 kaufenwill,kannichtrotzdemdasGoldVisabe5.225 befA14rworteteauchdievollstAndigeKontrolleA14ber5.223 bonuscodes2019,sodasseinWegineinWettbA145.208 reImmobilieinSpanienaufihrenNamenbeh5.207 (a) The top 12 examples that produce the maximum latent activations for the output SAE latent with index 34455. Example tokensMax. activation damiteinverstanden,dasswirCookiesverwenden.Die1.292×10 1 weisen,dassSienichtmehrals11Monateausserhal1.229×10 1 stsuperaufmeineWA14nscheeingegangen.Ins1.173×10 1 GoldenVisa),diegleichzeitigerlaubt,dass1.167×10 1 fA14rdasGoldVisain2013erlassen.Zudiesem1.152×10 1 WennSieaufWeiterklickenistIhrBonusver1.139×10 1 age:WennichdieImmobiliedurcheineGesellschaft1.134×10 1 ungabschickendauerbrennerzurA14cklehnen:1.132×10 1 lichdieIDKarteverlAngertwerden,dassehrschn1.126×10 1 te"\ n"AufalleWA14nscheeingegangen1.125×10 1 keineoderanleichterfA14llbareBedingungengek1.124×10 1 hatSpanieneinGesetzerlassen,dasseinemNicht1.116×10 1 (b) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 39503. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 15130 tokens, and has a mean of3.966×10 −1 (rank 0 for the output SAE latent) and a standard deviation of7.743×10 −2 over its non-zero values. 24 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation ZA14richStuttgartLeipzig/Halle6.834 OTCcadets,students,andstaffofAnsbachMiddleHighSchool6.788 OutreachannouncedthattwogroupsofstudentsfromAnsbachMiddleHighSchoolwon6.693 ridefromZurichtoRhineFalls(withasmallsectionofour6.679 aud.U55BrandenburgerTorBus20000Uhrdie6.644 attheAnsbachMiddleHighSchooltrack.\ nGrammyawardwinning6.566 ,Ansbach,Germany,onMay18,2015.Fryewas6.546 parentstoattend.\ nWiesbadenstudentsfindmathalittlefish6.480 intotheGiessenbachtalvalley.Onthetrailalongtheedge6.471 wasthefirstcaseforthebMCKoblenzoffice.Capt.6.448 olie.FollowthesignsBrandenburgerTorBrandenburgGateor6.423 fortheKoblenzoffice.\ nbMC’sCaptainDennisBrand6.391 (c) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 3387. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 10355 tokens, and has a mean of3.437×10 −1 (rank 1 for the output SAE latent) and a standard deviation of2.873×10 −2 over its non-zero values. Example tokensMax. activation ,DrAlexGermanandhisteamfoundthemainreasonforthatisthe1.149×10 1 forNuclearPhysics,RuprechtKarlsUniversity,theGermanCancerResearch1.142×10 1 ,ChrisKendrick,SallyAnneFitterandPennyGerman.\ nMore1.137×10 1 1.2inCzechoslovakia,1.1intheGermanDemocratic1.119×10 1 mind,intellect"),fromProto−Germanic * mundiz, * 1.089×10 1 andHydeFlippo’ sAutobahnarticlesfromTheGermanWayand1.084×10 1 "LiliMarlene",thesongthatGermanandAmericansoldiersbothloved1.075×10 1 ls−UniversitAt,theGermanCancerResearchCenter(DKFZ),1.070×10 1 .Inaddition,DrGerman’steamalsoconfirmedthatcatsateabout1.068×10 1 Art.3oftheGermanConstitutionthataddressesequalitybeforethelaw.\ n1.064×10 1 underconstantinternationalexternallaboratorycontroloftheGermanaccreditationsystemInstand.1.052×10 1 ButthehistoryofcarnivalcanbetracedbacktoGermanictribeswho1.051×10 1 (d) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 41811. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 764 tokens, and has a mean of1.619×10 −1 (rank 2 for the output SAE latent) and a standard deviation of8.654×10 −3 over its non-zero values. 25 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation NationalArchivesreport"Hitler’sShadow:NaziWarCriminals,1.203×10 1 ThousandsofHolocaustvictimstransportedtoNaziconcentrationcampsbyaFrenchrailwaycompany1.195×10 1 AirCorpsafterWWII(wherehesurvivedayearasaNaziprisoner1.170×10 1 anameinspiredbythechiefUkrainianNazileader,StepanBandera)1.168×10 1 PowersrefusedcooperationwiththecompanyuntilconnectionswithNaziGermanyweresevered.\ n1.155×10 1 orrow,theWorld!,"from1944,aboutateen−ageNazi1.145×10 1 ,000FrenchJewstoNaziconcentrationcamps,thoughexpertsdisagreeonitsdegree1.145×10 1 oftheJewishpeopleduringtheNazieraitwastheworkofwhiteEuropeans1.118×10 1 computingpioneer,andhisdevelopmentofasystemtocrackNazicodesinorder1.099×10 1 ithadnoeffectivecontroloveroperationsduringtheNazioccupationfrom1940to19441.090×10 1 itwithheavywater+Nazichemicals,prayedtogodforassistance,struck1.079×10 1 symbolofRussianresistancetoNaziinvasion.Russia’ s"WindowontheWest1.067×10 1 (e) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 32619. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 316 tokens, and has a mean of1.518×10 −1 (rank 3 for the output SAE latent) and a standard deviation of1.193×10 −2 over its non-zero values. Example tokensMax. activation ownedbythevonWakenitzfamilyin1434.Thesamefamily1.605×10 1 WorldNuclearIndustryStatusReport2013,July.\ n8DavidvonHippel1.526×10 1 asMarta,oneofthesingingvonTrappchildren.Shei121.509×10 1 knockdownkissisadministeredwhenVonlockslipswithGGwaymoreenthusiasticallythan1.437×10 1 ,Iassume.GretaGarboisherewithVon,andnot1.427×10 1 HATETHATIDON’THATEYOU"\ nFelixvon1.423×10 1 wasalreadypublished150yearsagobythegermanphysicianHermannvonHel1.388×10 1 newlifebyEdwardvonLAngusandaugmentedrealitytechnologies,exploring1.377×10 1 (1928,JosephvonSternberg;silent).ATributeto1.366×10 1 ilyVonhasagainbeensightedinthatinfamoussuit.WasMetroeconom1.352×10 1 artistsandperformers,knownastheCastlevonTrapp,willimmin1.341×10 1 ung.Degenfeld−Festschrift,Vienna:vonLag1.320×10 1 (f) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 63157. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 84 tokens, and has a mean of1.479×10 −1 (rank 4 for the output SAE latent) and a standard deviation of8.717×10 −3 over its non-zero values. 26 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation DumontaGJohnAxelrod,MSNBCalumnowatBerlin1.040×10 1 fordeterminingadhesiveandtensilestrengths(winneroftheBerlin−Branden1.026×10 1 ThatDemonWithinwasselectedforthe64thBerlinInternationalFilmFestivalPan1.025×10 1 Mirtl,andM.Schmid,eds.Berlin:Springer1.017×10 1 processesintropicalforests(p.153−172).Springer,Berlin,1.011×10 1 Twenty−fiveyearsafterthefalloftheBerlinwallin1989,Cont1.008×10 1 theBerlinWall:legacySA−2,SA−3,SA−9.963 emergedafterthecollapseoftheBerlinWallandtheendofthecoldwar9.932 falloftheBerlinWallwasagreatspurtoGermany,thoughittook9.898 exhibitionsinmuseumsandgalleriesinCapeTown,Johannesburg,BerlinandC9.885 ofthefollowingoptionsfortheBerlintoSchildowroute:Michelin9.885 thavetogetanewone(withtheBerlincitizencenter,it9.847 (g) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 63657. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 195 tokens, and has a mean of1.387×10 −1 (rank 5 for the output SAE latent) and a standard deviation of6.091×10 −3 over its non-zero values. Example tokensMax. activation nozzles,areknownfromtheMercedesGLA.Inaddition,Mercedes3.361 anopenend−hole.\ nMoreover,whencathetersofthistype2.830 Nasmith.machinerollersforconveyorbeltTheinventionrelates,interalia2.829 catheterswhichhaveacloseddistalend.Theprinciplesoftheinventiontherefore2.757 arealsosuitable.\ nIfitispossibleforyoutogetyourblood2.487 analogmatchingdevicescanbeusedadvantageouslyforandmonitoringanddiagnosticsofprocess2.399 bethecasethatsuchsupportcanbesecured,forexample,incases2.249 icacid.Properventilationandhermeticallysealedproductionapparatusaretherefore2.240 ofresidence,inparticularcopyright,dataprotectionandcompetitionlaw.Theprovider2.234 productsofthistypeworkgreatforincreasingyourstore’ srevenue,asthey2.133 actionbythecustomerorfromerrorsintheinformationprovidedbythelatter.2.113 optionallyarearaxledifferentiallock.\ nIncontrasttotheprospectivemaincompetitor2.099 (h) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 7969. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 331 tokens, and has a mean of1.322×10 −1 (rank 6 for the output SAE latent) and a standard deviation of1.176×10 −2 over its non-zero values. 27 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation andaprivateone.Withapublickey,itA~spossibleto5.947 Consolidation2592.Inthelastyearsofitsactivelife,5.068 forwardtherealizationofartisticworkslinkedtothiswholecontext,takingintoaccount4.883 testdifferentruntimeenvironments.Asaconsequence,cloud−basedtoolingcurrently4.585 releaseIrecommendtohavealookatJSF−Spring(http://4.558 customersatisfaction.\ nWhatismore,withtheapp,it’ spossible4.501 ofthehotel.Onthehotellanditispossibletocampintents4.438 introduceadditionalmodificationsdonotforgetaboutre−publishingthewholesiteso4.426 changeisrequested,thistoolenablestoknowwhichotherrequirements,designand4.418 structures/steelconstructions,accordingtothehighestrequirements.\ nDependingon4.370 2O3,or,moreprecisely,amixtureofNOandNO4.309 ifyouhaveanyquestions(youcanalsohavealookatourwebsite4.272 (i) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 18964. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 3502 tokens, and has a mean of1.209×10 −1 (rank 7 for the output SAE latent) and a standard deviation of2.073×10 −2 over its non-zero values. Example tokensMax. activation willenjoymusicinspiredinorbyAmerica,France,Austria,Scotland,1.170×10 1 intheworldsinceitsadoptionbytheAustrianarmyin1977.Nowavailable1.116×10 1 ,Australia,China,Austria,England,IsraelandIreland.\ nThrough1.112×10 1 teachingthebasicsofKungFuinRussia,Austria,Spain,China1.053×10 1 occupiedareasin1848;bythe1880s,theAustrian−governed1.042×10 1 Russia,andAmericanAmbassadortoAustria,WilliamEacho,wehadthe1.033×10 1 ,whichtracesitsrootstotheworkoftheColdWarAustrian−American1.016×10 1 releasedinGermany,AustriaandSwitzerland.\ nIrishmaninAmericawasrecorded1.014×10 1 WahlledtoaninvitationbytheAustrianMinistryofCulture,formyself9.958 ofStatefromAustria,Poland,andHungary,MinistersfromIsraeland9.951 practicefirstabroad,incountriessuchasSpain,GermanyorAustria,where9.771 .\ nTheChechenyouth,whocametoAustriaasarefugee,9.769 (j) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 28112. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 200 tokens, and has a mean of1.156×10 −1 (rank 8 for the output SAE latent) and a standard deviation of8.429×10 −3 over its non-zero values. 28 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation Reynolds&JakeGyllenhaalArrives!\ nEarliertonightduring9.162 Gyllenhaal,RyanReynolds,andRebeccaFergusonstarinthe8.938 JimMcElreath,MarioAndretti,GaryBettenhaus8.771 M.;Spickenheuer,A.;WagenfA148.518 followedbyNickHohlbeinwhofinishedfourth.\ nIfFridaynight8.185 aud.U55BrandenburgerTorBus20000Uhrdie8.145 theritualHeidenjachtenarenotreallyaclosedchapterofhistory8.071 thehusbandtothelateMaryJaneHoltzclaw.Scoopwas8.004 aGChrisHarlow...EricFinkbeiner...MikeDeutsch,FAA7.912 dingMusicProducerandArtistLAEL,JeffSchneeweis,7.688 teamisonlyhitting.242asateam.BobSteinbockjoined7.678 Redd,TinaDeniseLollis,TimothyRayHoltzcl7.668 (k) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 4287. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 1820 tokens, and has a mean of1.131×10 −1 (rank 9 for the output SAE latent) and a standard deviation of2.500×10 −2 over its non-zero values. Example tokensMax. activation MahavirMeriDrishtiMein.Oneafternoonhewasresting1.030×10 1 haben.\ nUnsererMeinungnachwirdesindennA9.290 waukeeMagazineprofiledLeinbachandtheUEClastyear.\ n8.524 aboutMilwaukee’sUrbanEcologyCenter?\ nKenLeinbach,7.583 DOKOPYARBHAREALFAZOmeinbolnew7.505 andsongwritingteamofPiaLeinonenandJoniTial7.337 on.April27,2015.By.DomEinhorn.Cosm6.740 termgrowthandsustainability.LucasAndShelby.EinSmith.6.319 −topcasinositesworldwideDasVereinigteKAnigreich6.226 zondheid/voeding−met−weinig−calorie6.008 inatedA.C.Keinmiesastheircandidate.\ nment5.644 myhearttolive.\ nDilonmeintumapnibetabi5.064 (l) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 62769. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 48 tokens, and has a mean of1.121×10 −1 (rank 10 for the output SAE latent) and a standard deviation of7.225×10 −3 over its non-zero values. 29 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation op316(1867).ViennaPO/CarlosKleiber.\ n1.541×10 1 ,Jape,KrystalKlear,Kormac’sBig1.460×10 1 ,JordanKleeman,goesoversomeoftheupcomingfeaturescomingtothe1.445×10 1 GoldenWingsMusic.FollowinghisearlybeginningsonEelkeKleijn’1.431×10 1 apenemresistantPseudomonasaeruginosa(CRPA),Klebsiellapneumoniae(1.373×10 1 drink.IbringmyKleanKanteenwithmeeverywhereanduse1.336×10 1 inaGarrigues);2009−Barcelona(w/Klepac1.293×10 1 jin(w/Klepac);2010−Budapest(w/Med1.263×10 1 .\ nDogpoisonNo.2:Insecticides.\ nFle5.116 thereandnoonewasboardingyet.Weparkedthecar,schle4.971 markthepermalink.\ nFleasorevenworse,we4.956 left.\ nGlebeRdSouthtoRoute1.Turnrightonto4.754 (m) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 14871. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 30 tokens, and has a mean of1.095×10 −1 (rank 11 for the output SAE latent) and a standard deviation of9.772×10 −3 over its non-zero values. Example tokensMax. activation LuciaRodriguez(Comcast)asChair−Elect;EmmaPfister(1.913×10 1 anothercleverdesignfromPfister,butit’sreallyhardtoset1.897×10 1 castAdministrationfromBostonUniversity.\ nEmmaPfister(Treasurer1.854×10 1 tips/mycollegeapplicationessayoneofmyjobsatPfizersince1.817×10 1 ,Pfizer,andCompumedicsLimited,amongothers.\ nProfile1.781×10 1 .PfannkuchandM.O.J.Thomas,eds1.711×10 1 caseinPfleidererandIVGImmobilien.The1.685×10 1 amore;RonaldBrautigam,pf;LeovanDoes7.684 PenelopeThwaites,pf.\ nMozart,W6.898 ).YujaWang,pf.\ nRachmaninov,6.714 ino,pf.\ nRespighi,O.Belkis6.594 Previn,pf.\ nBoccherini,L.String6.460 (n) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 32693. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 24 tokens, and has a mean of1.092×10 −1 (rank 12 for the output SAE latent) and a standard deviation of9.177×10 −3 over its non-zero values. 30 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation A.A.planned,coordinatedandschmoozedforfourmonths2.058×10 1 thereandnoonewasboardingyet.Weparkedthecar,schle2.045×10 1 intervention.Formostofuspoorschmucksit’sjustsomething2.004×10 1 −to−speech.Nowprintoutacopyofthispage,sch1.945×10 1 stinkymeltingstagewhereyoucanjustschmearitoncrusty1.934×10 1 PaulVIforheresy,schismandscandaltenyearsago,1.926×10 1 slate,schist,quartzite,andlimestoneonthewest;met1.909×10 1 heresy,schismandscandal,itisinfactmoregrave1.893×10 1 isneededtoschirmishona3areafront.USSRshouldin1.881×10 1 aswhattheFrenchcallchangestothe"scholasticrhythms."\ n1.877×10 1 ’sschtickwasnewandthemomentwasright,"andthatwe1.876×10 1 acris,robinschulz,will.i.am,1.849×10 1 (o) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 30568. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 449 tokens, and has a mean of1.049×10 −1 (rank 13 for the output SAE latent) and a standard deviation of1.109×10 −2 over its non-zero values. Example tokensMax. activation 31.\ nRaichJW,SchlesingerWH,1992.1.313×10 1 HealthandCare,theSchleswig−HolsteinMinistryforSocialAffairs1.276×10 1 ProductionscelebratedJustinSchlegel’ sbirthdaybypresentinghimwithaveryspecial1.274×10 1 fromPoliticsandSocietytoCultureandEntertainment.Berlin:Schlesinger1.262×10 1 irlingandmixingandclearmixinglinesvisibleinit(Schlierenas1.247×10 1 KyleSchwarbermaybeoutfortheseason,butthehome1.231×10 1 aSabotage.ArnoldSchwarzeneggerisbackandlooking1.229×10 1 21)butSchlesinger’ splatenumbersuggestsaearlierpublicationdate1.224×10 1 the’onefield’target.\ nJustaboutonefieldbySchwar1.222×10 1 ville,Texas,SchreinerUniversityisasmallfouryearprivatecollege1.218×10 1 )spokeataneventofBDO10.Schifffah1.211×10 1 ,MikeSchreiberandcountlessmore.ShevolunteersatDadeCorrection1.203×10 1 (p) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 47756. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 451 tokens, and has a mean of1.040×10 −1 (rank 14 for the output SAE latent) and a standard deviation of1.167×10 −2 over its non-zero values. 31 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation liLaia~K34al ~aIJie~iaIJIa12laIJIc12al1.484×10 1 e~N,aaeiahleoaLaIcIe~1.478×10 1 Lijaoa34Ia1ljaIe3434.e~1.422×10 1 a Ncla iala14eGla~~l ~tobe,eicK~a1.419×10 1 aHa IeLcle~a1a~e~~IIJ:Isuspect1.385×10 1 :contentcK~a12laIJie~i,e~cocLa12k,1.385×10 1 l ~e~e~Na12a1e~i,iGaa GeI1.379×10 1 Iiaoik).\ naLIaaLlIgegeLa1.378×10 1 coaIa1L,eie~coI12ea1i1.374×10 1 aIaLa IeaoILHi14I2Jlieg121.374×10 1 thecontentspage(cLa12ke).a,e~1.368×10 1 aIJie~iaiia12lale~Na1l ~aHN,aH:1.362×10 1 (q) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 59459. Paired with the output SAE latent with index 34455, the Jacobian element is non-zero for 1602 tokens, and has a mean of1.036×10 −1 (rank 15 for the output SAE latent) and a standard deviation of1.771×10 −2 over its non-zero values. Figure 18.The top 12 examples that produce the maximum latent activations for the output SAE latent with index 34455, and the input SAE latents with which the mean values of the corresponding Jacobian elements are greatest. The Jacobian SAE pair was trained on layer 15 of Pythia-410m with an expansion factor ofR= 64and sparsityk= 32. The examples were collected over the first 10K records of the English subset of the C4 text dataset with a context length of 16 tokens. 32 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation kevAlretrofArgadefArdenkAnslige.Barn4.636 vligutsiktAverLipnosjAn.Mysigochbarn4.549 ochfArfArstArkare,upptill20timmarutan4.370 kar,mendumAstevarainloggadsomanstAlld4.337 DetkanfinnasflerprojektdArLarsBankvallmedver4.165 Cancel.EttexempelpAforexhandelAratt4.149 Spesialspillutvalgavmorsommeog4.121 atalogenunderDownloadsimenyndArduenkeltblA4.080 AtillA!CongratzEvaNystrAmandAdrielYoung!4.037 kApaEuroochsAljaUSDollar.\ nResultspertradebinary3.989 vAnligatmosfArochskAnarum,dockkans3.985 ismakpAdetsombleigjen.\ nEarnedthe3.977 (a) The top 12 examples that produce the maximum latent activations for the output SAE latent with index 64386. Example tokensMax. activation Spesialspillutvalgavmorsommeog1.218×10 1 asselogandendanskekongeafdenGl252;cks1.175×10 1 ismakpAdetsombleigjen.\ nEarnedthe1.137×10 1 varkongeafDanmarkfra1906til1912.Hanvardet2301.136×10 1 vligutsiktAverLipnosjAn.Mysigochbarn1.018×10 1 kevAlretrofArgadefArdenkAnslige.Barn1.007×10 1 DetkanfinnasflerprojektdArLarsBankvallmedver9.832 vAnligatmosfArochskAnarum,dockkans9.703 .Detglderdeflestevarer,herunderlsk9.656 LA~ksetinden1242m1069m1191m9.561 pAChalmersfArattkunnasedem.SometimesIliketoadd9.513 jeggings,stockingswellanythingthatcoversmylegs,is9.509 (b) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 23581. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 2755 tokens, and has a mean of2.578×10 −1 (rank 0 for the output SAE latent) and a standard deviation of2.428×10 −2 over its non-zero values. 33 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation .\ n400PetermansBjerg2940m1200m29407.399 fromSigurd.Ithinkwewereliedto.Don’ tyouhave7.343 d"wasinventedbyveterinarianSigurdKeilgaardinthe6.987 usMetzPedersenthefilmstarsSverrirGudn6.918 beganworkingasanartistandteacher.\ nIngibjArg6.917 ALEVintageMetalChairsFromBaliByBjA rkheimWith6.837 house,makingReadAlso:IngridNilsenBio,Dating6.760 filmshotinTrinidadin2013,workbyAyvindF6.647 ogRomsdal,inNorway.\ n484Lavangstinden6.632 Kjell!\ nIintendedthefirstpiecetobasicallybeanintroduction6.463 worldaroundme.CongratulationstoTorsteinHorgmo,thelatest6.459 andmoderntechnology.\ nAsTarjeNissen−Meyerwrites:6.413 (c) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 48028. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 3567 tokens, and has a mean of2.417×10 −1 (rank 1 for the output SAE latent) and a standard deviation of1.944×10 −2 over its non-zero values. Example tokensMax. activation .,Netherlands,Italy,Canada,Denmark,Norway,andTurkey)and9.613 EU,Germany,Norway,USandUK).\ nPROMANhasbeen9.009 AlaskaandtheAleutianIslands,Greenland,Britain,Norway,8.417 ),Russia,NorwayandIcelandwhichpromotesdialogue,practicalcooperationanddevelopment.8.119 Sahara.SohavehumanrightsbodiesfromNorwayandelsewhere.\ nThe8.066 thematernityleaveinNorway(46weeks),Denmark(52weeks)8.006 WochenendeoderanFeiertagen.THENorwegianFisheryCouncil7.970 landsandislandsandthroughtheNorwegianfjordsandwillcontinuetooperate7.870 singlemarketbyfollowingNorway’smodelbyjoiningtheEuropeanEconomicArea.7.853 caneithersubsistwithoutNorwayorsimplydeadzoneitbystackingUkr7.850 isamemberoftheNorwegianVisualArtistsAssociationandtheYoungArtists’Society7.838 ChildrenNorwayhasbeenawardedtheservicecontractfortheSupporttotheEducationSector7.760 (d) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 11698. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 295 tokens, and has a mean of1.344×10 −1 (rank 2 for the output SAE latent) and a standard deviation of6.924×10 −3 over its non-zero values. 34 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation kar,mendumAstevarainloggadsomanstAlld1.023×10 1 kApaEuroochsAljaUSDollar.\ nResultspertradebinary9.895 laddningochklarattkopplastillslutstegeller9.721 kevAlretrofArgadefArdenkAnslige.Barn9.363 pAChalmersfArattkunnasedem.SometimesIliketoadd9.343 vAnligatmosfArochskAnarum,dockkans8.634 ochfArfArstArkare,upptill20timmarutan8.587 Cancel.EttexempelpAforexhandelAratt8.477 ddrarmellanolikaprodukter.Devicesandsystemson7.977 AtillA!CongratzEvaNystrAmandAdrielYoung!7.812 DetkanfinnasflerprojektdArLarsBankvallmedver7.368 poolochlekplats!Trevligpersonal.\ n−ub7.195 (e) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 22804. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 895 tokens, and has a mean of1.342×10 −1 (rank 3 for the output SAE latent) and a standard deviation of9.149×10 −3 over its non-zero values. Example tokensMax. activation soifixeditmyselfbutthatsthefirstplaceyouwillgetrust.5.811 ModelsforRustingTailgateStruts.\ nTakatarecallsat5.395 becauseyougaveineasilyRusticguyscountryawomanwhoisveryto5.063 bringtofullresolution.GraphicsarefromaRusticWreathcollection4.752 RusticPicnicTablePopularOutdoorFurnitureHandmadeByAppal3.102 wepresent.RusticPicnicTableAmazing31AlluringIdeasTables2.680 seeing.Ihaveitinrustypotsandhereitisinan2.359 yieldsaFuture,gen.coroutineschedulesthegeneratortoberesumed2.104 operationsinordertoensurethattheresultingLLVMIRcanbeingested1.849 with@gen.coroutine.\ nInPython2,thesub1.805 from‘()<>‘.The‘/‘hasnomeaningbesidesbeinga1.739 youwillhavetouploadthe.pubfileandcopyitscontentstoauthorized1.702 (f) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 12754. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 7 tokens, and has a mean of1.168×10 −1 (rank 4 for the output SAE latent) and a standard deviation of5.760×10 −3 over its non-zero values. 35 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation TurkeytheLeagueBvictory.\ nSweden,makingitsNationsLeaguedebut1.362×10 1 68m1038m1218mHPontheNorwegian−Swedish1.330×10 1 forLatvianindependence,RigahasbeenruledbyGermans,Swed1.286×10 1 bar.\ nsummerwhentheSwedebegantoshine.\ nCup1.180×10 1 en,AlUnser,SwedeSavage,BobbyUnser,Gordon1.124×10 1 25ppminDenmark,France,Spain,SwedenandUK.\ n"1.107×10 1 \ nThisstudywassupportedbytheSwedishResearchCouncil(GrantK2015−1.090×10 1 Canada,Finland,Italy,SwedenandtheUK.Inthenextfew1.088×10 1 rapsinEnglish,French,Spanish,SwedishandMandarin.You1.076×10 1 ,theMiddleEastandSoutheastAsia.\ nInreality,theSwedishand1.070×10 1 thisthreatisfoundfromdifferentcountriessuchasSweden,Malaysia,India,1.040×10 1 ironoremineintheworld.It’sownedbytheSwedishgovernment1.020×10 1 (g) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 30912. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 175 tokens, and has a mean of1.133×10 −1 (rank 5 for the output SAE latent) and a standard deviation of6.893×10 −3 over its non-zero values. Example tokensMax. activation ga,2012ClAirseachnA3PANob,8.579 lariPaulBergagusWalterGilbert.\ nDh’ainmic8.236 AjiragusSparAnabhfuilluachaH8.205 alA3dAilaransuANomh.TA7.989 .AnsannANmA3rdoniomathA3ir57.893 ireadhChunCeoilagustrANRnaG.Is7.412 mAnd("mind,reason").RelatedtoOldEnglishmyntan7.394 2013nauirlisANgiolcaighmiotail(7.251 .\ nImportodekromprogramoestasfacila.\ nA7.154 \ nAIiueventoligiALasallaorigina7.080 Ailte!StairnahAiireannwilltakeyouon7.046 bildojdeeventojkajbildojdekategorio7.012 (h) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 57769. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 321 tokens, and has a mean of8.286×10 −2 (rank 6 for the output SAE latent) and a standard deviation of1.343×10 −2 over its non-zero values. 36 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation iGeeIKei,iiiL~2.700 iH1iLIJilliiileI2.051 dothat?\ n1:10:38KEVIN:Ifyou1.881 e~e3eieIKiliiiLliil1.822 iiL~ik14eIJel,eiiL14ikl1.763 Liik~eei.eIiIleL1.637 IIJaIaoI!!\ naIaal ~c2e~a IcliaI1.592 e314eIJegeIig14ile1.564 Binisoneofpicturesthatarerelatedwiththepicturebeforeinthecollection1.488 iLGeiiL14ikle3eiL14ikli1.467 sealedbearing.NowIfeelkindalikea ***** $$foraskingin1.458 eLikIikiilikiLlihI1.416 (i) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 42113. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 3 tokens, and has a mean of7.971×10 −2 (rank 7 for the output SAE latent) and a standard deviation of6.880×10 −3 over its non-zero values. Example tokensMax. activation andarthritis.\ nYogarajGuggulu:Ingredientsinclude1.332×10 1 EuropeandRestofWorld.IngredientsofthismarketareSolvents.1.315×10 1 \ n04.Ingrosso,Liohn&Salvatore−Fl1.298×10 1 anconelaudiooriginalenIngles.ElcostodeadmisiA3n1.298×10 1 2,000caloriediet. * DailyValuenotestablished.\ nOtherIng1.245×10 1 Smoothie,SixIngredientSpirulinaSnackies1.201×10 1 .100IngredientsInAPicturesForKitchenNumber79IsImpossible!.1.173×10 1 IndioPapagoTepezcohuiteCream2OzIngredients:1.153×10 1 .MaIngallshadnocabinetsorrefrigeratorandsheraisedfivechildrenand1.137×10 1 understoodIngredient,WithRecipes,SimplyRecipes’explanation1.132×10 1 ropurIngredientsisaglobalsupplierofingredientsandservicesdevelopedtocreate1.065×10 1 BahasaInggrisSingkatisjustabouttheimageweascertainedon1.045×10 1 (j) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 8827. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 21 tokens, and has a mean of7.330×10 −2 (rank 8 for the output SAE latent) and a standard deviation of5.439×10 −3 over its non-zero values. 37 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation toscholarsinOrientalStudies,MedievalLiterature,andHistory,TheGh9.650 iaworld.BepartofamedievalrevolutionagainsttheWordKingandhis8.678 tryingtomakealivinginamedievaltownfullofwarriors,sorce8.207 acouplewhoplaymedievalandRenaissanceinstruments,sing,danceandcelebrateat8.198 Semikhahwastheriseofthemedievaluniversity,whichbegantoissue8.038 locatedinMedievalSponStreet,Coventry.MarkAndrewoffersan8.015 armiesoftheCaliphates,throughtheactionofbloodymedievalbattles,7.999 NgawangNamgyal,fatherandunifierofmedievalBhut7.911 ,IneededtoknowwhataretheyateinmedievalnorthernEngland?Lots7.847 thAnnualAwardWinningLoughreaMedievalFestivaltakesplacefromthe7.761 orliketheindividualpartsina’cycle’ofMedievalmysteryplays7.681 idyllicpicturefortheseason.\ nThemedievalcharmofViln7.672 (k) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 21697. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 87 tokens, and has a mean of6.570×10 −2 (rank 10 for the output SAE latent) and a standard deviation of6.127×10 −3 over its non-zero values. 38 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation footballofhiscareer,RussellWilsonfinishedMondaynight’ swinovertheMinnesota1.082×10 1 .ST.PAUL,Minn.(AP)−TheMinnesotaDepartment1.068×10 1 theMinnesotaBirthCenter;andDiannaJolles,facultyat1.060×10 1 theMinnesotaHistoricalSociety.\ nBeginningwiththe1929volumethereisa1.060×10 1 .\ nTheWinnipegRowingClub,theMinnesotaBoatClubandtheSaint1.055×10 1 MiamiMarlinsMilwaukeeBrewersMinnesotaTwins.KentuckyWildcatsWinC1.050×10 1 H.Allen.ThelockoutofthemusiciansattheMinnesotaOrchestrahas1.042×10 1 PoststartedspeculatingthispastweekthattheMinnesotaOrchestralAssociationmight1.032×10 1 TheMinnesotaWildhaverecalleddefensemanRyanMurphyfromAHLIowapera1.032×10 1 AnindextotheMinnesotaperiodicalcollectionintheSouthSt.PaulLibrary1.010×10 1 MarlinsMilwaukeeBrewersMinnesota.AddtheUnitedStatesofBaseballtoyour1.003×10 1 PaulLurlineRowingClubastheMinnesotaandWinnipegRowingAssociation1.002×10 1 (l) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 13110. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 61 tokens, and has a mean of6.486×10 −2 (rank 11 for the output SAE latent) and a standard deviation of6.027×10 −3 over its non-zero values. Example tokensMax. activation ,2019PostedinOfficialNewsTags:FlutterprogrammingLeaveacommenton6.019 ?\ nIfyou’renotfamiliarwiththeFlutterprogramminglanguageand5.044 YourBusinesswithAmazonFBA?\ nWhyYouShouldMasterFlutterProgramming4.472 materthesought−afterGoogleFlutterlanguage.Let’sbeginnow3.056 WhyYouShouldMasterFlutterProgramming?rubberplantsindoorrubberplantgrowing2.706 ewarMahotsavwillbecelebratedfrom8thAprilto10th1.795 compatiblewiththeindustry.\ nWidgetsforyourwebsite.Orderthem.1.511 thisplace.Eleanornoticesthisuponfirstarriving,thinkingtoherselfthat"1.489 ofillnessandhealth.\ nFindoutwhySwiftSkinandWoundis1.488 ,somepatientsfeelastingingsensationfromtheinjection&thinkitisn1.473 YoucanalsousetheContactInformationwidgetfromtheFormWidgets.1.358 ColoringPages.ImageSource:houzz.com.FreeOnline1.283 (m) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 26452. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 4 tokens, and has a mean of6.316×10 −2 (rank 12 for the output SAE latent) and a standard deviation of6.164×10 −3 over its non-zero values. 39 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation beenmadeinextendingMedicaidcoveragetoAmericanIndiansandAlaskaNatives,the1.224×10 1 percentofAmericanIndianandAlaskaNativechildrenwereenrolledinMedicaidorCHIP1.224×10 1 uninsuredrateforAmericanIndianandAlaskaNativechildrenandfamiliesremainunacceptably1.153×10 1 issuesintheParkService.\ nFormerAlaskagovernorandadvisoryboardchairmanTony1.147×10 1 andmindsetfortheAlaskaConcealedHandgunPermit.Taught1.106×10 1 popularappetizers,soupsandsandwiches.\ nLocallycaughtAlaskaseafood1.082×10 1 isCanada,AlaskaandHawaii.Yourmattresswillberefunded,however1.081×10 1 forshippingquotestootherdestinationssuchasAlaskaandHawaii.\ nThisproduct1.079×10 1 MooseandWolfhunt.\ nWehunttheAlaskaYukonmoose1.066×10 1 rowselistingsofMemberusershereatAlaskaFlirtthatareassociatedwith1.064×10 1 Cherry,BLM’sAlaskaregionalmanager,saidtheagencywaspleased1.057×10 1 .Thepaintappearedtostillbewet.ThepackageincludesAlaskaYukon1.045×10 1 (n) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 32153. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 100 tokens, and has a mean of6.202×10 −2 (rank 13 for the output SAE latent) and a standard deviation of5.347×10 −3 over its non-zero values. Example tokensMax. activation Hanover,Wake,andForsythhavemorethanhalfoftheir1.788×10 1 hetoldSMSFAdviser.\ nMrForsythexplainedthis1.643×10 1 arrion"’.ThankstoAnthonyForsofAbsoluteCleanforvolunteering1.616×10 1 OreadmoreintoATOIDsthantheyshould,"MrForsyth1.531×10 1 sonSuperConsultingdirectorStuartForsythsaidshortlyafterthereleaseof1.517×10 1 webhomesforsale.comhelpyoufindtheJacksonville,FLhomes1.051×10 1 uityouridols;forsakeyourfonddoings;andthepromised7.176 you,norforsakeyou"(Heb.13:5);6.366 forcomfortinmydifficultsituation.Donotforsakeme,GoodJesus6.248 umealreadyledmanytoforsakethetemple,andholdherordinances5.815 whichneedstobeconfessedandforsaken.\ n2.Weprofitfrom5.704 besubmittedbye−mail(scannedtoforsikring@lease5.697 (o) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 56394. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 8 tokens, and has a mean of5.939×10 −2 (rank 14 for the output SAE latent) and a standard deviation of3.459×10 −3 over its non-zero values. 40 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation NorwegianfjordsaroundStord,Norway.\ nReadmoreaboutour1.093×10 1 250mcrossingunderafjordbetweenKristiansundandAver1.089×10 1 coastinTurkey,cruisingtheFjordsofNorway,etc,1.033×10 1 zonesandscrambleuptothetopofthefjord.There1.001×10 1 60m1610mSognogFjordane,Norway.9.988 landsandislandsandthroughtheNorwegianfjordsandwillcontinuetooperate9.885 YouwillalsoexperienceafjordcruiseonthemightySogne9.601 fj80studisjustslightlylarger,soallIhadtodo9.451 steeringboxanyways,soIdecidedtoconverttofj80tie9.137 is57.5kg,Sagnefjorden,Norway,in9.027 \ n426AlksfjordjA~kelen1204m1138.791 kjvNoah:November29,Ethicaldilemmasoccurwhenasituation8.114 (p) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 36481. Paired with the output SAE latent with index 64386, the Jacobian element is non-zero for 36 tokens, and has a mean of5.820×10 −2 (rank 15 for the output SAE latent) and a standard deviation of4.891×10 −3 over its non-zero values. Figure 19.The top 12 examples that produce the maximum latent activations for the output SAE latent with index 64386, and the input SAE latents with which the mean values of the corresponding Jacobian elements are greatest. The Jacobian SAE pair was trained on layer 15 of Pythia-410m with an expansion factor ofR= 64and sparsityk= 32. The examples were collected over the first 10K records of the English subset of the C4 text dataset with a context length of 16 tokens. 41 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation surgeintothecathode,producingacurrentthatdoestheall−importantwork6.720 sothatheavoidsdistributionofdesignatedproductthatapparentlydoesnotmeetlegalrequirements6.606 ,butalsoexpansivealienenvironmentsthatdotheirparttomaketheaudiencefeel6.584 of2016(88percent)wereopportunisticattacksthatdidnottargeta6.467 tomaketherapeuticclaims,werefoundtodoso.\ nUnfortunatelytheproposed6.406 .Childrenarethemostseverelyaffectedbypovertybecausetheydonothavethe6.331 hospitaladmissions;emergencyroomvisitsthatdonotresultinadmissionareexcluded.6.282 acourtorder.Proposalsrelatingtochildrenoftendonotneedto6.264 .Suitableexercisesforpregnantwomenarethosethatdonotstrainthelower6.255 ?\ nUnfortunately,thetraditionalChineseapproachtotraininginthemodernworlddoes6.240 alsorequestfatshamingtobemadeillegalbecauseitdoesnothaveany6.232 selectthe1dBdegradationtonoiseastheinterferencestandard,sinceitdoes6.221 (a) The top 12 examples that produce the maximum latent activations for the output SAE latent with index 60542. Output latent 60542 responds to a very specific use of the word “do”: its use as a pro-verb. In a pro-verb a simple verb stands in for another more complex one and here “do” is a shorthand for an action that can only be understood from the context, for example, in “were found to do so” the “to do so” stands in for “to make therapeutic claims”. Some of the inputs include very different uses of “do”, one for example deals with the “Done” in “Donegal”, an Irish county. However, another input includes a subtly different use of “do”: cases where “do” is used as an auxiliary, modifying another verb, as in “[t]his does not meet the requirements”. Clearly this circuit is creating a very fine distinction between different ways the word “do” can be used, a distinction we make in language comprehension, but one we would have trouble identifying or describing. Example tokensMax. activation YourinformationwillnotbestoredonHSBC’ssystemsifyoudo2.098×10 1 .Childrenarethemostseverelyaffectedbypovertybecausetheydonothavethe2.077×10 1 surgeintothecathode,producingacurrentthatdoestheall−importantwork2.056×10 1 .001)andfewerovertriagedchildrenwhodidnotrequireinpatientmanagement2.043×10 1 ,butalsoexpansivealienenvironmentsthatdotheirparttomaketheaudiencefeel2.032×10 1 usatyourowncostwithindaysofdelivery.Ifyoudonot2.027×10 1 thempostmenopausal.Theywerecomparedwithmorethan600womenwhodidnot2.014×10 1 despitetherisk.Whenpersecutioncame,theydidnotscatter.Theyremained2.001×10 1 levelsprotectpublichealthandwelfareiftheydonotexceed45dB.\ n1.997×10 1 sothatheavoidsdistributionofdesignatedproductthatapparentlydoesnotmeetlegalrequirements1.996×10 1 hospitaladmissions;emergencyroomvisitsthatdonotresultinadmissionareexcluded.1.984×10 1 .Suitableexercisesforpregnantwomenarethosethatdonotstrainthelower1.976×10 1 (b) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 21465. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 13107 tokens, and has a mean of2.529×10 −1 (rank 0 for the output SAE latent) and a standard deviation of5.392×10 −2 over its non-zero values. 42 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation .Weexpectthemarkettodotheoppositeofwhattheindicatorsaresaying1.438×10 1 withthe2018electionslooming.\ nTheybelievethatTrumphasdoneagood1.313×10 1 onlythingthatshewantedintheworld,butshedidabadthing1.292×10 1 ledbyitsPresidentJimCorcoran,havedoneawonderfuljobof1.292×10 1 MBTAdoessoaswell.Andmanystatelawsdothesame.1.264×10 1 a"tote;"Ijustbuytheboxes.)Thesedoa1.261×10 1 can’ twingames,you’ reintrouble.\ nSaintsdida1.258×10 1 beexorbitant.Youneeddoyourownresearchtofindoutthe1.257×10 1 upforgrabs,it’salwaysJewsdoingthegrabbing!\ nHere1.251×10 1 yourmusclesdothework,goingatasnail’ spacedoesn’ t1.246×10 1 toworryaboutwhethertheBlackhawksaredoingtherightthingwithdefense1.244×10 1 You’redoingthesamethingonAmazonthroughsponsoredadsandprovingyourself1.239×10 1 (c) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 61756. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 2683 tokens, and has a mean of1.076×10 −1 (rank 1 for the output SAE latent) and a standard deviation of2.253×10 −2 over its non-zero values. Example tokensMax. activation Adugrandluxe,ilfaitcequ’onluidit,1.044×10 1 .ParamAsinformaciA3n,hagaclicenellazoab9.095 clasedeinglAsyporhacerlapresentaciA3n.AE8.587 quAhaces?\ nifyouwanttogobacktoaprevious6.694 sivousfaitedescodes!\ nlNGNh121NI15.713 co,dondeserealizalasegundaescenayfinalmente,5.181 intothesoftware.TheKartrainterfaceisfairelywelldesignedfor4.361 whichtheydrawdrinkingwater.\ nLegislatorsmustfindthefairest4.314 endoen.\ n’LivingwithFran’,vanafzondag4.312 .Dekachelsmakenveellawaai.Wever4.253 organisierenoftmalsihreeigenenGames.Estutunsle4.225 eenponerloshuevosenelinteriordelhuAsp4.202 (d) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 51331. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 90 tokens, and has a mean of8.908×10 −2 (rank 2 for the output SAE latent) and a standard deviation of9.883×10 −3 over its non-zero values. 43 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation themiddleofwaragain.\ nTheSeventhDonegaltookpartinthe1.341×10 1 allhurtingtonight.I’ llsharedetailslaterthisweek.\ nDone!1.216×10 1 fitorthere’ssomethingwrongpleasecontactus.\ nDonewithMulti1.211×10 1 thispoint.\ nDone.Willbeinterestedtolearnmoreaboutthismyself1.192×10 1 thegoalofUniversalMale−FemaleLiteracyadonedeal.\ nNo1.191×10 1 feelingofbeingletdown,overwhelmedanddonein.Disappointmentis1.191×10 1 there’ssomethingwrongpleasecontactus!\ nDonewithEgg−shaped1.188×10 1 firstgindistilledinCo.Donegal.DAolamAnisthe1.179×10 1 Closureandreclamationofaminingpropertyisacomplexprocess.Done1.168×10 1 theycouldtake.\ nTheDonegalGuards,likenootherarmof1.155×10 1 wereabletofinishrebuildingefforts,andtheDonegalGuardsfoundthemselvesin1.150×10 1 outhishearttothedoneky.Oneday,Keshavawas1.146×10 1 (e) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 11694. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 540 tokens, and has a mean of7.039×10 −2 (rank 3 for the output SAE latent) and a standard deviation of1.967×10 −2 over its non-zero values. Example tokensMax. activation relaxmechanisms.\ nOntheotherhand,betweentheclassicmodels,it1.057×10 1 protein.Forzincprotoporphyrin,ontheotherhand,1.020×10 1 comparedtoCaucasians.Ontheotherhand,thelatterhavea9.974 allchangeassociatedwithunification.\ nOnthefirstpoint,seeJ9.717 )theeffectofclasssizevariesacrossstudents.\ nOnthefirstform9.667 disease.\ nOntheseconddayofhospitalization,Bitsy’s9.630 muchscopefordevelopment.Ontheotherhand,thereismuchscopefor9.605 forwhomitisofminimaluse.Ontheotherhand,whenhazards9.602 regulatesitscollectivetemperatureatdifferentambientairtemperatures;ontheleftsideit9.537 is.Ontheinsidetherearealsolenticularfilmsthatdiffusetheview9.520 associatedwiththeseexlibrises.Ontheonehand,onepartof9.519 states.Ontheotherhand,weneedtodefineourdestinationsina9.441 (f) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 48418. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 91 tokens, and has a mean of5.105×10 −2 (rank 4 for the output SAE latent) and a standard deviation of9.043×10 −3 over its non-zero values. 44 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation ofthedateoffilingofthepetition.\ nAchangeofnameor5.590 thejury.\ nAninterruptionoftheopposingcounselwithaspeechratherthan5.370 ofthesestatutorychanges.\ nAnewform,calleda"Disclosure5.359 .TheCourtwillexamineonlymarginallywhethertheprincipleisfulfilled.Adetailed5.215 filingofthepetition.\ nAmergerisrecordedbymeansofapetition5.214 hadcommittedacrimewhileoutonrelease:obstructionofjustice.\ nA4.802 doesnotacceptliabilityforfailuretodeliverwithinthestatedtime.\ nA4.802 igatedsentencing.AnexperiencedMaricopaCountyDUIAttorneywillknowwhat4.660 169oftheRevisedStatutes.\ nAntecedent:Referringtoa4.463 trial;judge’ savailabilitylimited.\ nAseconddiscriminationtrialagainsttheUniversity4.404 agroundsfordealingwiththeissue.Apharmacyisalsonearbywhichsells4.393 tocomplywithitsrequirements.Thecourtnotedthat"[a]dvance4.323 (g) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 32517. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 1 tokens, and has a mean of5.015×10 −2 (rank 5 for the output SAE latent) and a standard deviation of0.000over its non-zero values. Example tokensMax. activation photos.Important:Thisproductdoesnotincludethetechniquesusedinthe1.195×10 1 pumpyouarelookingfor.\ nThisproductdoesnotcontain1,1.182×10 1 Furthermore,exporail.organdwebsitesassociateddoesnotselluser1.146×10 1 .\ nThenameTeebergdoesnotappearonanyofthetop1.130×10 1 suggestionforASCUttimate.\ nDoyoudonotneedtoworryabout1.123×10 1 ube.comLimiteddoesnotendorseanyusersubmission,andexpresslydisclaims1.119×10 1 ube.comLimiteddoesnotpermitcopyrightinfringingactivitiesorinfringementofintellectual1.119×10 1 ’sfamily,itdoesn’tforMarkhetoldOstrov1.095×10 1 igences.comwebsiteorwebpage,Diligencesdoesnotrepresentor1.087×10 1 market,itdoesnothelpanyonenomaterhowgreatthethoughtis.1.085×10 1 marketing,Blogging,internetmarketing\ t.\ nTheworlddoesn’t1.075×10 1 adailybasis.\ nHey,Idon’ tknowifI’mposting1.075×10 1 (h) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 23968. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 2561 tokens, and has a mean of4.760×10 −2 (rank 6 for the output SAE latent) and a standard deviation of1.204×10 −2 over its non-zero values. 45 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation youcanloweryourstandardofliving.Andalowerstandardoflivingdoes9.428 ?Ofcoursenot.Aposthumousdiagnosisdoesnotchangewhohe9.376 holdsequityinaprivatebusinessorhasalargeportfoliothisdoesnot9.172 thatthisisanegativereview,theaboveobservationsdoconstitutequibbles9.046 Taxotereandpermanenthairloss,thosewarningsdonotappearonany8.894 havenoresponsibilityorcontroloverthem.Theexistenceoftheselinksdoesnot8.834 andambitious.\ nMissionsuccessdoesnotrequirethesecomplexmaneuvers,8.667 .Pleasenotetheserestrictionsdonotapplytoexhibitors.\ nTohelp8.648 Yellowstoneifthereareonlythatmanysubspecies.Thatdoesnotinclude8.556 ButtheTribunalexplained,"Trainingalonedoesnotmeettherequirementsofdue8.529 \ nSomeschoolsprefertobedatacontrollers.Thisdoesmakeiteasierfor8.483 extremeheat.It’snaturallydarkercolourdoesn’tdoitselfany8.473 (i) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 19973. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 2691 tokens, and has a mean of4.681×10 −2 (rank 7 for the output SAE latent) and a standard deviation of9.146×10 −3 over its non-zero values. Example tokensMax. activation playandrevenue.Well,thatnolongerappearstobethecase.8.329 thanhebelievesorcarestosay.Pollsnowdemonstratethistobe8.314 systemsareoftenthoughttobemuchmoresecurelyprotectedthanisactuallythecase7.874 relationswithresidents.Unfortunately,withsomeboardsthisisnotalwaysthecase7.272 somecasesthespamdetectionstoppedworking.Thisisnolongerthecase.7.104 ’ t)andiftherearenoclearwinnersbetweenthem(whichisusually7.036 foundthesametobetrueforthem!).ChurroFriedIceCream6.854 .\ nHowever,evenifthatisnotcorrect,theCourterroneouslyuses6.826 final.However,that’snotalwaysthecase.Thereareplenty6.787 ifthatwerenotbadenoughforThorpe’ smembership,inthesame6.773 theleaseperiodhasexpiredandnooilhasbeenproduced.Ifthisis6.688 ormissingfromtherootofthedomainyouadded.Ifthisisthe6.680 (j) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 13058. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 431 tokens, and has a mean of4.528×10 −2 (rank 8 for the output SAE latent) and a standard deviation of1.961×10 −2 over its non-zero values. 46 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation Companiesmustgiveworkerstheinformationtheyneedtoperformtheirjobssafely,1.448×10 1 tooshorttodrinkbadcoffeethatmakesyouperformpoorlyandfeelawful.1.448×10 1 ofyourreservefund.Byperformingareservestudy,youcandeterminehow1.426×10 1 gmailaccount,performmyedit,andthenuninstallinamatter1.410×10 1 wornfillatthetailpipeaspermittedtoperformeveryengineandeveryrate1.409×10 1 scarjustgotawayfromhimandheperformedalazyspinintothe1.400×10 1 from,modify,publish,edit,translate,distribute,perform,display1.400×10 1 Thefirmdidnotperformimpuritytestingduringstabilityontwoproductssince2016.1.397×10 1 stopthinkingaboutmessingup.Youperformedsotentativelythatyouhoped1.389×10 1 likehowblackhathackersareincentivizedbymoneytoperformmaliciousactivities1.388×10 1 andpaymentsencouragePBMstoactuallyperformtotheircontractualguaranteesinsteadof1.367×10 1 individualeducationalplans;assistsininventorycontrol;andperformsotherdutiesrelatedto1.364×10 1 (k) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 56700. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 574 tokens, and has a mean of4.375×10 −2 (rank 9 for the output SAE latent) and a standard deviation of1.275×10 −2 over its non-zero values. Example tokensMax. activation sharethepostsasifyourlifedependedonit.Whichitdoesn’ t8.326 encountersanimpossiblerealitythatshouldn’texist,butdoes.\ nA8.273 ask!Whoknewlasagnawasfordessert?Wesuredidn’t7.129 tomakesurethatitworkscorrectlyanditdoes.\ nWhat’6.960 ,statusandaside.NoteveryWordPressthemesupportspostformatsandthosethat6.757 youeverthankedme?\ nnoyouactuallydid,butIcancount6.578 ifthiswouldkickin,butitdidn’t.\ nAfterinstalling6.342 don’tneedconvincing.Butdoyouknowwhodoes?\ nAnd6.318 bigone.Itseemstowork.Ordoesit?Thenthreefish6.302 .Itseemstowork.Ordoesit?Thenthreefishgetthe6.277 Spin!Youdontswitchbutyouropponentdoessoyoucanstackdamageon6.275 SomehowIenvisionedourfamilyasafamilyoffour.Istilldo6.228 (l) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 46097. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 3589 tokens, and has a mean of3.995×10 −2 (rank 10 for the output SAE latent) and a standard deviation of2.127×10 −2 over its non-zero values. 47 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation itsspark.Employeesarenotinspired,stakeholdersarebecomingcolorlessandyou5.325 aretalkingaboutarehumans,andnotrobots,andsoagreateramount4.681 nice"nottoofast,nottooslow"rhythmandsettledinto4.430 endupbeingaboutcommunicationnotbetweenprofessions,butwithclientsandfamilies4.387 pre−slicediseasiestnottoothin,andnothingwet4.313 ,IdidnotknowhowtheFrenchhealthcareworkedandIdidnot4.230 movefrequently,notbeonschooldistrictcensusrolls,andarelesslikely4.183 TotalPM.Awayoflife.Notajob.\ nItwas4.141 thefirewhichdoesnotburn,thewaterwhichdoesnotwetthehands4.113 Idonotmove,Idonotbreath.Iclosemyeyesand4.075 Itwasnotforameetingwithaninternationalleader.\ nAnditwas4.069 suchthingassafeplasticaGnotforyouandyourfamily,andnot4.018 (m) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 4510. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 133 tokens, and has a mean of3.978×10 −2 (rank 11 for the output SAE latent) and a standard deviation of1.006×10 −2 over its non-zero values. Example tokensMax. activation witnessingbeforemovingontomoredifficultorresistantareas.Rathertheoppositewas1.143×10 1 decisionmakingauthority.Actually,theoppositeisimplied.Letmeputit1.059×10 1 shouldcomebeforefocusingonone’sown,andIfeelthereverse1.048×10 1 tfaith.It’sactuallytheoppositeofhope.Ourotheroptions1.017×10 1 tourist−richlocale,butrather,theoppositeistrue.Elis1.004×10 1 goodguy;ifanythingtheoppositewassaidwhensayingZimmerman’s9.978 outoffearmode,youreturntolovemode.Theoppositeoflove9.799 inanyway−justtheopposite.Thefinebalancebetweensuppleness9.498 .processasperformarguestheopposite:thatlearnonmethodmusttakeheavily9.484 saystherightwordstous,butheactsjusttheopposite.He9.464 ’sabadthing,inthiscase,itmeanstheopposite.9.423 −EGR−valveWithoutvoltageapplied,it’ sjusttheopposite9.166 (n) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 28695. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 53 tokens, and has a mean of3.952×10 −2 (rank 12 for the output SAE latent) and a standard deviation of7.511×10 −3 over its non-zero values. 48 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation \ nI’vegottoadmit,I’veheardsome"doozie8.284 silentfilm.Google’severchanging"googledoodle"logo7.809 ourannualcelebrationscanbe.DoppelklickedenNumberSl7.234 3.\ nHere’sanotherdoozie:BothAvigdor7.154 doingafewdoodlesofpeopleinsuitswiththeirarmscrossed.6.835 theinstitutionstendfirstinformationinusingthehashers,printabledood6.828 eeFaret.ThiswaslongbeforetheDoobieBrotherseverthought6.275 keepingwereapartofourlifeandIwantedtoknowaboutDoob6.262 anobjectIdorememberisDoobeeFaret.Itwasa6.177 adoptme.Nameonbirthcirtificatewasbabydoyjohn6.137 foundthatdoodlersperformed29%betterthannon−doodlers5.989 adovishFedandinstitutionalinstabilityintheUS,"saidKyleRod5.951 (o) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 26469. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 29 tokens, and has a mean of3.831×10 −2 (rank 13 for the output SAE latent) and a standard deviation of1.040×10 −2 over its non-zero values. Example tokensMax. activation Spanienkaufen,kAnnenSieeinespanischeAufenthalts8.197 dabeiauchaufSportwetten,diemanauchbequ8.086 achteineAbzocke,daherwA14rdeichrA14ck8.018 kaufenwill,kannichtrotzdemdasGoldVisabe7.818 ftwerdeichAfterimDrA14ckGlA14ckOnlineCasinoanTurn7.550 andundmehr!\ nFrage:Kannicheinspanisches7.512 sind.\ nFrage:KannicheineFinanzierungerhal7.413 ,dassesgesetzlichanerkannteAngehArige7.390 Kannichnurweiterempfehlen."\ n"This7.366 weisen,dassSiedirektoderindirektInhaberder7.351 weisen,dassSienichtmehrals11Monateausserhal7.351 age:WennichdieImmobiliedurcheineGesellschaft7.225 (p) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 14813. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 2 tokens, and has a mean of3.811×10 −2 (rank 14 for the output SAE latent) and a standard deviation of8.688×10 −3 over its non-zero values. 49 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Example tokensMax. activation .\ n4DoesthePaleoDietEliminateHealthyFoodsLikeBrown3.101 forallforprofitshowticketsasrequiredbyLouisianalaw.\ nDoesnot3.047 0334.\ nDoesmyinfant/childneedaticketfortheshow2.843 000.\ nDoesCODEundertakeitsownduediligence?\ nYes,should2.575 duties.\ nDoesthismeanwearemovingtoasystemwherethePope1.695 amore;RonaldBrautigam,pf;LeovanDoes1.543 .\ nDoesn’tclogporesnordoescauseanyacnebreakout1.511 .\ nDoeshehavealotoftoys?Yeah,canwebring1.511 .DoesBenBishopfixtheirgoaltending?WheredoAlexanderRadul1.479 .Doesthismeanthattheleaderswhoengageinthemethodsarebadpeople1.479 .Doesyourslookasprettyasmine?\ nI’mgoingto1.479 day.\ nDoesyourmarketingteamregularlygrapplewithproblems?Ifso1.306 (q) The top 12 examples that produce the maximum latent activations for the input SAE latent with index 41425. Paired with the output SAE latent with index 60542, the Jacobian element is non-zero for 17 tokens, and has a mean of3.734×10 −2 (rank 15 for the output SAE latent) and a standard deviation of8.725×10 −3 over its non-zero values. Figure 20.The top 12 examples that produce the maximum latent activations for the output SAE latent with index 60542, and the input SAE latents with which the mean values of the corresponding Jacobian elements are greatest. The Jacobian SAE pair was trained on layer 15 of Pythia-410m with an expansion factor ofR= 64and sparsityk= 32. The examples were collected over the first 10K records of the English subset of the C4 text dataset with a context length of 16 tokens. 50 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations D. Training Our training implementation is based on the open-source SAELens library (Bloom et al., 2024). We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch. Except where noted, we use a batch size of 4096 sequences, each with a context size of 2048. At a given time, we maintain 32 such batches of activation vectors in a buffer that is shuffled before training, which reduces variance in the training signal. We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters and a constant learning-rate schedule with 1% warm-up steps, 20% decay steps, and a maximum value of5×10 −4 Additionally, we use 5% warm-up steps for the coefficient of the Jacobian term in the training loss. We initialize the decoder weight matrix to the transpose of the encoder, and we scale the decoder weight vectors to unit norm at initialization and after each training step (Gao et al., 2024). Except where noted, we choose an expansion factorR= 32, keep thek= 32largest latents in the TopK activation function of each of the input and output SAEs, and choose a coefficient ofλ= 1for the Jacobian term in the training loss. D.1. Training signal stability We initially considered the following setup: s x =e x (x),ˆx=d x (s x ),y=f(ˆx),s y =e y (y),ˆy=d y (s y )(32) The problem with this arrangement is that the second SAE depends on an output from the first SAE. Since both SAEs are trained simultaneously, we found that this compromised training signal stability – whenever the first SAE changed, the training distribution of the second SAE changed with it. Additionally, at the start of training, when the first SAE was not yet capable of outputting anything meaningful, the second SAE had no meaningful training data at all, which not only made it impossible for the second SAE to learn but also made the first SAE less stable via the Jacobian sparsity loss term. To address this problem, we instead used this setup: s x =e x (x),ˆx=d x (s x ),y=f(x),s y =e y (x),ˆy=d y (s y )(33) Importantly, we pass the actual pre-MLP activationsxrather than the reconstructed activationsˆxinto the MLPf. In addition to improving training stability, we believe this setup to be more faithful to the underlying model because both SAEs are trained on the unmodified activations that pass through the MLP. E. Evaluation We evaluated each of the input and output SAEs during training on ten batches of eight sequences, where each sequence has a context size of 2048, i.e., approximately 160K tokens. We computed the sparsity of the Jacobian, measured by the mean number of absolute values above0.01for a single token, separately after training. In this case, we collected statistics over 10 million tokens from the validation subset of the C4 text dataset. For reconstruction quality, we report the mean cosine similarity between input activation vectors and their autoencoder reconstructions, the explained variance (MSE reconstruction error divided by the variance of the input activation vectors), and the MSE reconstruction error. For model performance preservation, we report the cross-entropy loss score, which is the increase in the cross-entropy loss when the input activations are replaced by their autoencoder reconstruction divided by the increase in the loss when the input activations are ablated (set to zero). For sparsity, we report the number of ‘dead’ latents that have not been activated (i.e., appeared in theklargest latents of the TopK activation function) within the preceding 10 million tokens during training and the number of latents that have activated fewer than once per 1 million tokens during training on average. Given an expansion factor of64,k= 32, and a Jacobian loss coefficient of1, i.e., fixed hyperparameters, we find that the reconstruction error and cross-entropy loss score are consistently better for the input SAE than the output SAE. Additionally, we find that the performance is generally poorer for the intermediate layers than early and later layers. 51 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.6 0.7 0.8 0.9 1 Cosine Similarity Pythia-70mPythia-160mPythia-410m 0.4 0.6 0.8 1 Explained Variance 012345 0 0.2 0.4 0.6 0.8 1 ·10 −2 Layer Mean Squared Error 0246810 Layer Input SAEOutput SAE 048121620 Layer Figure 21.Reconstruction quality metrics for Jacobian SAEs trained on the feed-forward networks at every layer (residual block) of Pythia transformers. The cosine similarity is taken between the input and reconstructed activation vectors, and the explained variance is the MSE reconstruction error divided by the variance of the input activations. For each SAE, the expansion factor isR= 64andk= 32; the Jacobian loss coefficient is 1. 012345 0 0.2 0.4 0.6 0.8 1 Layer Cross-Entropy Loss Score Pythia-70m 0246810 Layer Pythia-160m Input SAEOutput SAE 048121620 Layer Pythia-410m Figure 22.Model performance preservation metrics for Jacobian SAEs trained on the feed-forward networks at every layer (residual block) of Pythia transformers. The cross-entropy loss score is the increase in the cross-entropy loss when the input activations are replaced by their autoencoder reconstruction divided by the increase when the input activations are ablated (set to zero). For each SAE, the expansion factor isR= 64andk= 32; the Jacobian loss coefficient is 1. 52 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0 2 4 6 ·10 4 Num. Dead Features Pythia-70mPythia-160mPythia-410m 012345 0 2 4 6 ·10 4 Layer Below 1 × 10 − 6 0246810 Layer Input SAEOutput SAE 048121620 Layer Figure 23.Sparsity metrics per layer for Jacobian SAEs trained on the feed-forward networks at every layer (residual block) of Pythia transformers. Recall that theL 0 norm per token for each of the input and output SAEs is fixed atkby the TopK activation function. For each SAE, the expansion factor isR= 64andk= 32; the Jacobian loss coefficient is 1. 0.9 0.92 0.94 0.96 Cosine Similarity 0.75 0.8 0.85 0.9 Explained Variance 0.2 0.4 0.6 0.8 1 ·10 −2 Mean Squared Error 10 3 10 4 10 5 0.7 0.8 0.9 Number of Latents Cross-Entropy Loss Score 10 3 10 4 10 5 0 2 4 6 8 10 ·10 4 Number of Latents Num. Dead Features Input SAEOutput SAE 10 3 10 4 10 5 140 160 180 200 Number of Latents Abs. Jacobian Values>0.01 Jacobian Figure 24.Reconstruction quality, model performance preservation, and sparsity metrics against the number of latents. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 3 of Pythia-70m (model dimension 512) withk= 32. Recall that the maximum number of non-zero Jacobian values isk 2 = 1024. The reconstruction quality and cross-entropy loss score improve as the number of latents increases, and the number of dead features grows more quickly for the output SAE than the input SAE. See Appendix E for details of the evaluation metrics. 53 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.6 0.7 0.8 0.9 Cosine Similarity 0.2 0.4 0.6 0.8 Explained Variance 0 1 2 3 ·10 −2 Mean Squared Error 10 0 10 1 0.2 0.4 0.6 0.8 1 Sparsityk Cross-Entropy Loss Score 10 0 10 1 0 1 2 3 ·10 4 Sparsityk Num. Dead Features Input SAEOutput SAE 10 0 10 1 0 200 400 600 800 Sparsityk Abs. Jacobian Values>0.01 Jacobian Figure 25.Reconstruction quality, model performance preservation, and sparsity metrics against theklargest latents to keep in the TopK activation function. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 3 of Pythia-70m with expansion factor R= 64. Recall that the maximum number of non-zero Jacobian values isk 2 . The reconstruction quality and cross-entropy loss score improve askincreases, and the number of dead features decreases. See Appendix E for details of the evaluation metrics. 0.4 0.6 0.8 Cosine Similarity 0.2 0.4 0.6 0.8 Explained Variance 2 4 ·10 −3 Mean Squared Error 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 0.2 0.4 0.6 0.8 Jacobian Loss Coefficient Cross-Entropy Loss Score 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 0 1 2 3 ·10 4 Jacobian Loss Coefficient Num. Dead Features Input SAEOutput SAE 10 −3 10 −2 10 −1 10 0 10 1 10 2 10 3 0 200 400 600 Jacobian Loss Coefficient Abs. Jacobian Values>0.01 Jacobian Figure 26.Reconstruction quality, model performance preservation, and sparsity metrics against the Jacobian loss coefficient. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 3 of Pythia-70m with expansion factorR= 64andk= 32. Recall that the maximum number of non-zero Jacobian values isk 2 = 1024. In accordance with Figure 5, all evaluation metrics degrade for values of the coefficient above 1. See Appendix E for details of the evaluation metrics. 54 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0 0.5 1 Explained Variance 01020 0 0.5 1 JacobianL 1 Norm Cross-Entropy Loss Score 0200400600800 Abs. Jacobian Values>0.005 Input SAEOutput SAE 0200400600 Abs. Jacobian Values>0.01 Figure 27.Pareto frontiers of the explained variance and cross-entropy loss score against different sparsity measures when varying the Jacobian loss coefficient. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 3 of Pythia-70m with expansion factorR= 64andk= 32. Recall that the maximum number of (dead) latents is32768(64times the model dimension512), and the maximum number of non-zero Jacobian values isk 2 = 1024. See Appendix E for details of the evaluation metrics. 0 0.5 1 Explained Variance 0102030 0 0.5 1 JacobianL 1 Norm Cross-Entropy Loss Score 0200400600800 Abs. Jacobian Values>0.005 Input SAEOutput SAE 0200400600 Abs. Jacobian Values>0.01 Figure 28.Pareto frontiers of the explained variance and cross-entropy loss score against different sparsity measures when varying the Jacobian loss coefficient. The coefficient has a relatively small impact on the reconstruction quality and sparsity of the input SAE, whereas it has a large effect on the sparsity of the output SAE and elements of the Jacobian matrix. Here, we consider Jacobian SAEs trained on the feed-forward network at layer 7 of Pythia-160m with expansion factorR= 64andk= 32. Recall that the maximum number of (dead) latents is49152(64times the model dimension768), and the maximum number of non-zero Jacobian values isk 2 = 1024. See Appendix E for details of the evaluation metrics. 55 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0.50.60.70.80.9 Average cross-entropy score (1 = perfect reconstruction) 0 200 400 Number of Jacobian elements above 0.01 0.01 0.05 0.2 0.5 1 50 5 3 4 Figure 29.The trade-off between reconstruction quality and Jacobian sparsity as we vary the Jacobian loss coefficient. Each dot represents a pair of JSAEs trained with a specific Jacobian coefficient. Measured on layer 3 of Pythia-70m withk= 32. We speculate that it is necessary to tune our hyperparameters for each layer individually to achieve improved performance; see, for example, Figures 26 and 4 for the variation of our evaluation metrics against the coefficient of the Jacobian loss term for individual layers of Pythia-70m and 160m. F. More data on Jacobian sparsity In Figure 23 we showed that Jacobians are much more sparse with JSAEs than traditional SAEs. To this end, we provided a representative example of what the Jacobians look like with JSAEs vs traditional SAEs. Some readers may object that this is not an apples-to-apples comparison since JSAEs are optimizing for lower L1 on the Jacobian, so it may be the case that JSAEs merely induce Jacobians with smaller elements, but their distribution may still be the same. To address this criticism, the examples are L2 normalized; we provide un-normalized versions as well as L1 normalized versions of the example Jacobians in Figure 30. We also provide a histogram and a CDF of the distribution of absolute values of Jacobian elements in Figure 32, which is taken across 10 million tokens. F.1. Jacobian norms In this section, we address an objection we expect some readers will have to our measures of sparsity. Our main metric for sparsity is the percentage of elements with absolute values above certain small thresholds (e.g. Figure 2). However, one can imagine two distributions with the same degree of sparsity, but vastly different results on this metric due to a different standard deviation. For instance, imagine two Gaussian distributions, both withμ= 0but with significantly different standard deviations,σ 1 ≫σ 2 . They would score very differently on our metric, but their degrees of sparsity would not be meaningfully different (since sparsity requires there to be a small handful of relatively large elements). Since ourL 1 penalty encourages the Jacobians to be smaller, it could be that they simply become more tightly clustered around 0. However, this is not the case. We can measure this by looking at the "norms" of the Jacobian, i.e. we flatten the Jacobian, treat it as a vector, and compute itsL p norms. If the Jacobian is merely becoming smaller, we would expect all of itsL p norms to decrease at roughly the same rate. On the other hand, if the Jacobian is becoming sparser, we would expect itsL 1 ,L 2 norms to decrease while itsL 4 ,...,L ∞ norms, which depend more strongly on the presence or absence of a few large elements, should stay roughly the same. We present these results in Figure 35, as we can see, the Jacobian does become slightly smaller, but most of the effect we see is indeed the Jacobian becoming significantly more sparse. 56 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 01224 0 10 20 30 Traditional SAE Token 1 0 10 20 30 JSAE 01224 Token 2 01224 Token 3 01224 Token 4 0.0 0.2 0.4 01224 0 10 20 30 Traditional SAE Token 1 0 10 20 30 JSAE 01224 Token 2 01224 Token 3 01224 Token 4 0.0 0.2 0.4 (a)(b) Figure 30.Comparison of Jacobians from traditional SAEs vs JSAEs, same as Figure 2 but with different normalization. (a) Not normalized. (b) L2 normalized. Measured on layer 15 of Pythia-410m. 01224 0 10 20 30 Traditional SAE Token 1 0 10 20 30 JSAE 01224 Token 2 01224 Token 3 01224 Token 4 0.000 0.025 0.050 0.075 01224 0 10 20 30 Traditional SAE Token 1 0 10 20 30 JSAE 01224 Token 2 01224 Token 3 01224 Token 4 0.0 0.2 0.4 0.6 (a)(b) Figure 31.Comparison of Jacobians from traditional SAEs vs JSAEs, same as Figure 2 but with different normalization. (a) L1 normalized. (b) L2 normalized. Measured on layer 3 of Pythia-70m. 0.000.020.040.060.080.10 Absolute value of Jacobian element 0 10 20 30 40 Frequency (%) (a) JSAEs Traditional SAEs 0.000.020.040.060.080.10 Absolute value of Jacobian elements 0.25 0.50 0.75 1.00 Proportion of elements below thresholds (b) Traditional SAEs JSAEs Figure 32.Further data showing that JSAEs induce much greater Jacobian sparsity than traditional SAEs. (a) A histogram of the absolute values of Jacobian elements in JSAEs versus traditional SAEs. JSAEs induce significantly more sparse Jacobians than standard SAEs. This means that there is a relatively small number of input-output feature pairs which explain a very large fraction of the computation being performed. Note that only thek×kelements corresponding to active latents are included in the histogram – the remaining (n y −k)×(n x −k)elements are zero by definition both for JSAEs and standard TopK SAEs. The histogram was collected over 10 million tokens from the validation subset of the C4 text dataset, which produced 10.24 billion feature pairs. (b) The cumulative distribution function of the absolute values of Jacobian elements, again demonstrating that JSAEs induce significantly more computational sparsity than traditional SAEs. Measured on layer 15 of Pythia-410m. 57 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 0102030 0 10 20 30 Traditional SAE Token 1 0 10 20 30 JSAE 0102030 Token 2 0102030 Token 3 0102030 Token 4 0.0 0.2 0.4 0.0050.0100.0150.020 Threshold 0% 25% 50% 75% Proportion of elements above threshold Traditional SAEs JSAEs (a)(b) Figure 33.JSAEs induce a much greater degree of sparsity in the elements of the Jacobian than traditional SAEs. Identical to Figure 2 but measured on layer 3 of Pythia-70m. 0 5·10 −2 0.10.150.2 0 2 4 6 ·10 9 Absolute value of Jacobian element Frequency 0.0010.005 0.010.05 0.10.5 15 1050 100500 (a) Pythia-70m Layer 3 0 5·10 −2 0.10.150.2 0 2 4 6 ·10 9 Absolute value of Jacobian element Frequency 0.0010.005 0.010.05 0.10.5 15 1050 100500 (b) Pythia-160m Layer 7 Figure 34.Histograms that show the frequency of absolute values of non-zero Jacobian elements for different values of the coefficient of the Jacobian loss term. As the coefficient increases, the frequency of larger values decreases, i.e., the Jacobian becomes sparser. We provide further details in Figure 32. 58 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Traditional Traditional (Randomized) JSAE JSAE (Randomized) 0.00 0.02 0.04 0.06 0.08 Mean L 2 norm L 2 norm Traditional Traditional (Randomized) JSAE JSAE (Randomized) 0.00 0.05 0.10 0.15 Mean L 4 norm L 4 norm Traditional Traditional (Randomized) JSAE JSAE (Randomized) 0.00 0.20 0.40 0.60 0.80 Mean L inf norm L inf norm Figure 35.L p norms of the Jacobians. We measure these by flattening the Jacobians and treating them as a vector. These results imply that the Jacobians are in fact becoming more sparse, as opposed to merely becoming smaller (see Section F.1). Averaged across 1 million tokens, measured on layer 3 of Pythia-70m. 0.0050.0100.0150.020 Threshold 0.0% 0.005% 0.01% 0.015% Proportion of elements above threshold JSAEs Traditional SAEs Figure 36.The Jacobians aren’t only sparse locally (i.e. on each token in each prompt), but also globally (i.e. when averaged across many tokens), much more so than with traditional SAEs. In particular, here we consider the fulln y ×n x Jacobian (i.e. not slicing based on the TopK), which we average across 10 million tokens ( 1 N P prompt,token J) before considering its summary statistics. This is an important measure as it confirms that the connections found by JSAEs are indeed sparse in a global sense, not just when conditioning on a specific model input. Measured on layer 15 of Pythia-410m. Note that the small numbers on the y-axis are due to the fact that, unlike in e.g. Figure 2, here we set 100% to ben y ×n x rather thank×k. We also note that for each element in the Jacobian, we are only taking the average over the tokens on which the corresponding output SAE latent is selected by the TopK activation function (i.e. when at least one element in the row of the Jacobian is nonzero); this is important because otherwise this measure would significantly conflate the sparsity of the Jacobian itself with the sparsity of the activations of each individual latent. 59 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations 024 Layer 0% 50% 100% Auto-interp score JSAEs (input SAE) Traditional SAEs (input SAE) JSAEs (output SAE) Traditional SAEs (output SAE) Figure 37.Automatic interpretability scores of JSAEs are very similar to traditional SAEs. Measured on all layers of Pythia-70m using the “fuzzing” scorer from Paulo et al. (2024). 0.0050.0100.0150.020 Threshold 0% 20% 40% 60% 80% Proportion of elements above threshold Traditional SAEs (Randomized LLM) Traditional SAEs JSAEs (Randomized LLM) JSAEs Figure 38.Jacobians are substantially more sparse in pre-trained LLMs than in randomly initialized transformers. This holds both when you actively optimize for Jacobian sparsity with JSAEs, and when you don’t optimize for it and use traditional SAEs. The proportion of Jacobian elements with absolute values above certain thresholds. The figure shows the proportion of Jacobian elements with absolute values above certain thresholds. Identical to Figure 7 but measured on layer 3 of Pythia-70m. 0 1 0 1 s y, j 05 s x, i 0 1 Linear JumpReLU Other 0% 20% 40% 60% 80% Proportion of scalar functions in f s Traditional SAEs JSAEs 0.00.20.40.6 Jacobian element (abs. value) 0.0 0.2 0.4 0.6 0.8 Change in downstream latent (abs. value) Linear JumpReLU Other (a)(b)(c) Figure 39.The functionf s , which combines the decoder of the first SAE, the MLP, and the encoder of the second SAE, is mostly linear. Identical to Figure 8 but measured on layer 3 of Pythia-70m. 60