Paper deep dive
Distribution-Aware Feature Selection for SAEs
Narmeen Oozeer, Nirmalendu Prakash, Michael Lan, Alice Rigg, Amirali Abdullah
Models: Pythia-160M
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/11/2026, 1:32:31 AM
Summary
The paper introduces Sampled-SAE, a distribution-aware feature selection method for Sparse Autoencoders (SAEs) that generalizes BatchTopK. By implementing a two-stage gating process—batch-level candidate selection followed by Top-K sparsification—Sampled-SAE mitigates the 'activation lottery' where rare, high-magnitude features dominate. Using scoring functions like L2-norm and Squared-L2, the method promotes consistent mid-frequency features, leading to improved performance in sparse probing and reduced feature absorption compared to standard BatchTopK.
Entities (5)
Relation Signals (3)
Sampled-SAE → evaluatedon → Pythia-160M
confidence 95% · We train these on the sixth layer of Pythia-160M
Sampled-SAE → generalizes → BatchTopK
confidence 95% · we introduce Sampled-SAE, which generalizes BatchTopK through a two-stage gating process
Sampled-SAE → uses → L2-norm
confidence 90% · we score the columns (representing features) of the batch activation matrix (via L2 norm or entropy)
Cypher Suggestions (2)
Find all methods that generalize BatchTopK · confidence 90% · unvalidated
MATCH (m:Method)-[:GENERALIZES]->(b:Method {name: 'BatchTopK'}) RETURN m.nameList all scoring functions used by Sampled-SAE · confidence 90% · unvalidated
MATCH (s:Method {name: 'Sampled-SAE'})-[:USES]->(f:ScoringFunction) RETURN f.nameAbstract
Abstract:Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an "activation lottery," where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via $L_2$ norm or entropy), forming a candidate pool of size $Kl$, and then apply Top-$K$ to select tokens across the batch from the restricted pool of features. Varying $l$ traces a spectrum between batch-level and token-specific selection. At $l=1$, tokens draw only from $K$ globally influential features, while larger $l$ expands the pool toward standard BatchTopK and more token-specific features across the batch. Small $l$ thus enforces global consistency; large $l$ favors fine-grained reconstruction. On Pythia-160M, no single value optimizes $l$ across all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Full Text
47,038 characters extracted from source content.
Expand or collapse full text
Distribution-Aware Feature Selection for SAEs Narmeen Oozeer Martian narmeen@withmartian.com &Nirmalendu Prakash Singapore University of Technology and Design email@gmail.com &Michael Lan Martian michael@withmartian.com &Alice Rigg Independent rigg.alice0@gmail.com &Amirali Abdullah Thoughtworks amir.abdullah@thoughtworks.com Abstract Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an “activation lottery”, where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via ℓ2 _2 norm or entropy), forming a candidate pool of size KℓK , and then apply Top-K to select tokens across the batch from the restricted pool of features. Varying ℓ traces a spectrum between batch-level and token-specific selection. At ℓ=1 =1, tokens draw only from K globally influential features, while larger ℓ expands the pool toward standard BatchTopK and more token-specific features across the batch. Small ℓ thus enforces global consistency; large ℓ favors fine-grained reconstruction. On Pythia-160M, no single value optimizes ℓ all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family. 1 Introduction Sparse autoencoders (SAEs) have emerged as an essential tool for mechanistic interpretability, decomposing language model activations into sparse, interpretable features [Bricken et al., 2023, Cunningham et al., 2023, Gao et al., 2024, Marks et al., 2024a]. The recently proposed BatchTopK SAE [Bussmann et al., 2024] improves upon standard TopK by selecting features at the batch level rather than per-token, allowing variable sparsity across samples while maintaining average sparsity constraints. This modification achieves better reconstruction performance and feature activation density compared to standard TopK SAEs. However, BatchTopK’s batch-level selection still allows all features to compete equally, creating what we term an “activation lottery”—features with rare but extreme magnitudes dominate selection over consistent mid-frequency features. Recent work [Sun et al., 2024] shows that even high-frequency features (>10% activation) previously dismissed as uninterpretable represent meaningful concepts such as context position. This suggests that the middle frequency range—features that fire consistently but not strongly—may be particularly valuable for interpretability but are currently underutilized due to competition from rare, high-magnitude spikes. Figure 1: Comparison of BatchTopK to Sampled-SAE feature selection strategies: (a) BatchTopK selects the strongest activations across a batch, which can cause rare high-magnitude spikes to dominate, (b) Sampled-SAE introduces a candidate pool that filters out such rare spikes, ensuring more consistent feature usage across tokens. We introduce a new class of Sampled-SAE that specifically promotes frequently activating features per batch. We train these on the sixth layer of Pythia-160M [Biderman et al., 2023] using the first 25M tokens of The Pile [Gao et al., 2020]. Other training hyperparameters are in A.6. We also construct a controlled synthetic dataset to compare SAEs on learning disentangled features based on activation intensity and frequency. Features are partitioned based on activation frequency (low,high) and intensity (low,high), resulting in four buckets. Features are then generated by sparse superposition with small noise. Full details are in A.1. We evaluate the SAEs using SAEBench [Karvonen et al., 2025]—including Automated Interpretability [Paulo et al., 2024], k-sparse probing [Gurnee et al., 2023], and Feature Absorption [Chanin et al., 2024]. In addition, we assess: (i) coverage of high-density features (active in ≥10%≥\!10\% of tokens), (i) uniqueness of features attributable to each scoring rule, (i) cross-seed feature similarity, and (iv) cross-scoring function similarity. 2 Related Work Sparse Autoencoders and Variants. SAEs decompose neural network activations into interpretable features [Bricken et al., 2023, Cunningham et al., 2023, Marks et al., 2024a]. Standard TopK SAEs enforce exactly k active features per token [Gao et al., 2024], while the recent BatchTopK [Bussmann et al., 2024] improves reconstruction by selecting features at the batch level, allowing variable per-token sparsity. However, BatchTopK still permits all features to compete equally, leading to what we term an “activation lottery” where rare, high-magnitude features dominate. Our work extends BatchTopK by introducing controlled feature pre-selection before batch-level sparsity. Streaming and Online Column Selection. BatchTopK can be viewed as a streaming algorithm for feature selection, processing tokens in batches rather than requiring the full dataset. This connects to streaming matrix approximation and sketching algorithms Liberty [2013], Ghashami et al. [2016], Woodruff [2014]. In streaming column subset selection (CSS), one maintains important columns while processing rows online Cohen et al. [2016], Bhaskara et al. [2019]. BatchTopK essentially performs streaming column selection where each batch determines which features (columns) to keep active. Our work extends this by adding a pre-filtering step based on batch statistics, similar to importance sampling in streaming algorithms Cohen and Peng [2015]. While BatchTopK treats all features equally in the streaming selection, we bias the selection toward features with desirable properties by restricting the candidate pool. Column Subset Selection and Feature Importance. Our approach connects to the column subset selection (CSS) problem in numerical linear algebra Boutsidis et al. [2009], Mahoney and Drineas [2009]. Column L2 norms and squared L2 norms, which we employ as scoring functions, are fundamental in randomized matrix approximation Frieze et al. [2004], Drineas et al. [2006] and variance-based selection Mahoney and Drineas [2009]. While leverage scores provide stronger theoretical guarantees Mahoney [2011], Drineas et al. [2008], we focus on computationally efficient column norm methods. Future work could explore true leverage scores or ridge leverage scores Cohen et al. [2017] for SAE feature selection. Column Selection in Sparse Coding. While Krause and Cevher [2010] select columns from pre-designed dictionaries (wavelets, discrete cosine transform) for sparse coding, we perform analogous selection on learned SAE features. Both approaches use greedy selection based on column importance metrics—they use variance reduction, we use column norms—to identify which dictionary elements should be available for sparse reconstruction. Feature Frequency and Interpretability. Recent work reveals that feature frequency correlates with interpretability in unexpected ways. Sun et al. [2024] show that high-frequency features (>10% activation) previously dismissed as uninterpretable actually represent meaningful concepts like context position. Conversely, prior work on feature absorption Chanin et al. [2024] suggest that rare, high-magnitude features can dominate selection while being less interpretable. This motivates our focus on promoting consistent mid-frequency features through controlled candidate selection. SAE Evaluation and Alignment. We evaluate our methods using established interpretability metrics: automated interpretability Paulo et al. [2024], k-sparse probing Gurnee et al. [2023], and absorption Chanin et al. [2024]. For comparing features across different SAE configurations, we adopt the Hungarian algorithm matching approach from Paulo and Belrose [2024], who showed that SAEs trained with different seeds learn partially overlapping feature sets. Figure 2: Pareto frontier analyses of SAE architectures: (a) Probing accuracy versus Absorption; (b) AutoInterp Scores versus FVU. While BatchTopK lies on the pareto frontier of the autointerp versus FVU curve, it is outperfrormed by the features with lower l values (Especially L2-norm and squared-L2) on the probing versus absorption fraction. Figure 3: Probing accuracy (top-5 SAE features) vs. fraction of variance unexplained (FVU) for SampledSAE under different candidate selection rules. Each panel shows a gating strategy with a distinct batch-level scoring method: (a) entropy, (b) squared-L2, (c) L2-norm, and (d) uniform (random baseline). Points correspond to different candidate set expansion factors (ℓ ). The black cross marks the BatchTopK baseline. Higher probing accuracy and lower FVU are both desirable: while BatchTopK performs best on reconstruction alone, distribution-aware gating strategies (especially Squared-L2 and L2-norm) often improve probing accuracy at modest increases in FVU. Figure 4: Fraction of SAE features active on more than 10% of tokens vs. fraction of variance unexplained (FVU) under different candidate selection rules. Each panel shows a gating strategy with a distinct batch-level scoring method: (a) entropy, (b) Squared-L2, (c) L2-norm, and (d) uniform (random baseline). Points correspond to different sparsity levels (ℓ ). The black cross marks the BatchTopK baseline. A higher fraction indicates that more features activate consistently across tokens rather than only on rare spikes. Squared-L2 and L2-norm sampling strategies can achieve a much higher number of frequently activating features compared to BatchTopK. Figure 5: Automated Interpretability score vs. fraction of variance unexplained (FVU) for SampledSAE under different candidate selection rules. Each panel corresponds to a different batch-level scoring rule: (a) entropy, (b) Squared-L2, (c) L2-norm, and (d) uniform (random baseline). Points denote different candidate set expansion factors (ℓ ), with lower values on the axis indicating better performance (lower reconstruction error) and higher value on the y-axis represent more interpretable features. The black cross marks the BatchTopK baseline. Results show that distribution-aware scoring (especially Squared-L2 and L2-norm) achieves favorable trade-offs, achieving comparable auto interp scores as BatchTopK without excessively increasing reconstruction error. We find that along these 2 axes BatchTopK is optimal. Figure 6: Absorption fraction vs. fraction of variance unexplained (FVU) for SampledSAE under different candidate selection rules. Each panel corresponds to a different batch-level scoring rule: (a) entropy, (b) Squared-ℓ , (c) L2-norm, and (d) uniform (random baseline). Points denote different candidate set expansion factors (ℓ ), with lower values on both axes indicating better performance (lower reconstruction error and lower absorption, i.e., less feature collapse). The black cross marks the BatchTopK baseline. Results show that distribution-aware scoring (especially leverage and squared-ℓ ) achieves favorable trade-offs, reducing absorption without excessively increasing reconstruction error. 3 Sampled Sparse Autoencoders We introduce Sampled-SAE, which generalizes BatchTopK through a two-stage gating process: (1) batch-level candidate selection based on feature scoring, followed by (2) Top-k sparsification within this restricted set. This decoupling enables control over which features compete for activation slots. Architecture. Sampled-SAE augments BatchTopK with a batch-level sampling step. For a batch of activations of size B and hidden dimension n, X∈ℝn×BX\!∈\!R^n× B compute encoder preactivations Z. Z=WencX+benc∈ℝm×B.Z=W_encX+b_enc ^m× B. (1) The m columns of Z represent features and the B rows represent input samples within the batch. Score features over the batch with a global scoring function sϕs_φ (e.g., squared-L2, entropy, or ℓ2 _2), produce q=sϕ(Z)∈ℝm,q=s_φ(Z) ^m, (2) and select a candidate pool using feature scoring functions defined in 3.1 of size KℓK via c=TopK(q,Kℓ)∈0,1m.c=TopK\! (q,\,K )∈\0,1\^m. (3) Mask the preactivations with the broadcasted pool c 1B⊤c\,1_B and apply BatchTopK to obtain sparse codes F=BatchTopK(Z⊙(c 1B⊤),K)∈ℝm×B.F=BatchTopK\! (Z (c\,1_B ),\,K ) ^m× B. (4) then reconstruct and train with X^=FWdec+bdec, X=FW_dec+b_dec, (5) ℒSampled(X)=∥X−X^∥F2+αℒaux(F).L_Sampled(X)= X- X _F^2+α\,L_aux(F). (6) Candidate Selection: We first compute batch-level scores sjs_j for each feature j and select the top m=⌊ℓ⋅k⌋m= · k features to form candidate set S. The candidate pool expansion factor ℓ∈[1,n/K] ∈[1,\,n/K] controls the degree of filtering: • ℓ=1 =1: Only k features compete (most restrictive) • ℓ=n/k =n/k: All features compete (recovers BatchTopK) • 1<ℓ<n/k1< <n/k: Partial filtering of rare features Crucially, smaller ℓ values prevent rare, high-magnitude features from entering the candidate pool, promoting consistent mid-frequency features instead. BatchTopK is thus a special case where ℓ=n/k =n/k, allowing all features including rare spikes to compete. 3.1 Feature Scoring Functions We evaluate four scoring strategies that capture different notions of feature importance: ℓ2 _2-norm: sj=‖Z:,j‖2s_j=\|Z_:,j\|_2 computes the column ℓ2 _2 norm as a fast approximation to leverage scores from randomized matrix theory. This emphasizes features with consistent batch-level activation—a feature firing at magnitude 10 across 50% of samples scores higher than one firing at magnitude 100 on 5% of samples. This rewards stability over sporadic extreme activations. Entropy: sj=−∑b=1Bqbjlog(qbj+ϵ)s_j=- _b=1^Bq_bj (q_bj+ε) where qbj=Zbj/∑b′Zb′jq_bj=Z_bj/ _b Z_b j prefers features with selective firing patterns. Low entropy indicates a feature concentrates its activation on specific inputs rather than firing uniformly. This rewards specialization—features that strongly activate for particular contexts while remaining quiet elsewhere. Squared-L2: sj=∑b=1BZbj2+λs_j= _b=1^BZ_bj^2+λ rewards total activation energy across the batch. Unlike the L2-norm, Squared-L2 uses squared magnitudes, making it sensitive to both frequency and high intensity. This favors features that either fire frequently at moderate strength or occasionally at very high strength. The ridge term λ provides numerical stability. Uniform: sj=consts_j=const assigns equal scores to all features, resulting in random sampling. This baseline tests whether structured selection based on batch statistics improves over random feature subsampling. Even random sampling with small ℓ can improve some metrics by excluding rare features, though it lacks the systematic benefits of informed selection. Each strategy creates different biases: L2-norm and Squared-L2 favor consistent mid-frequency patterns, entropy selects specialized features, and uniform provides an unbiased baseline. These differences become more pronounced at smaller ℓ values. 4 Synthetic Data Experiments To understand how different sampling strategies affect feature recovery across the frequency-magnitude spectrum, we design controlled experiments with ground-truth features. We generate data X∈ℝn×dX ^n× d through sparse linear combinations of k>dk>d dictionary features: X=SAT+ϵX=SA^T+ε, where A∈ℝd×kA ^d× k has minimized mutual coherence, S∈ℝn×kS ^n× k contains sparse codes, and ϵε is 20dB SNR noise. We use n=10,000n=10,000 samples, d=256d=256, k=1024k=1024, forcing superposition since k>dk>d. We partition features equally into four categories based on activation frequency (p) and magnitude (σ): LF+HA (p=0.02p=0.02, σ=1.0σ=1.0), HF+HA (p=0.20p=0.20, σ=1.0σ=1.0), LF+LA (p=0.02p=0.02, σ=0.2σ=0.2), and HF+LA (p=0.20p=0.20, σ=0.2σ=0.2). Each feature’s support follows Bernoulli(p) with coefficients drawn from (0,σ2)N(0,σ^2). This creates an “activation lottery” scenario where rare high-magnitude features (LF+HA) would typically dominate selection in standard BatchTopK. The dataset statistics are covered in A.1. We train both Sampled-SAEs and a BatchTopK-SAE on the generated dataset. The BatchTopK-SAE learns effectively, achieving a fraction of variance explained (FVE) of ≈0.99≈ 0.99, while Sampled-SAEs perform substantially worse (see training plots in A.2). We are actively troubleshooting Sampled-SAE training instabilities. Once satisfactory reconstructions are obtained, we will evaluate: 1. Bucket recovery rate — fraction of ground-truth features successfully recovered (match similarity ≥0.7≥ 0.7), using Hungarian matching following [Paulo and Belrose, 2025]. 2. Activation fidelity — correlation between learned and true activation frequencies. 5 Real Data Experiments We train a class of SAEs parametrized by ℓ where BatchTopK is a special case with ℓ=n/k =n/k, allowing us to evaluate the full spectrum from aggressive filtering (ℓ=1 =1) to unrestricted selection. Our implementation extends the dictionary learning codebase from Marks et al. [2024b]111https://github.com/saprmarks/dictionary_learning, adding the candidate selection mechanism described in Section 3.1. We train on the sixth layer of Pythia-160M [Biderman et al., 2023] using the first 25M tokens of The Pile [Gao et al., 2020], with full training configurations detailed in Table 13. We evaluate each configuration along multiple interpretability axes using SAEBench Karvonen et al. [2025] , as described below. Feature Density (>10%>10\%): Following Sun et al. [2024] showing that high-frequency features (>10%>10\% activation) represent meaningful concepts like context position, we measure the proportion of SAE features that activate on more than 10%10\% of tokens. SampledSAE variants, particularly L2-norm and Squared-ℓ scoring strategies, achieve substantially higher feature densities compared to BatchTopK (Figure 4). This suggests that distribution-aware candidate selection promotes consistent mid-frequency features over rare high-magnitude spikes. L2-norm and Squared-L2 can achieve 2-3× higher densities of frequently activating features while maintaining comparable reconstruction fidelity, indicating that the “activation lottery” problem in BatchTopK systematically underutilizes dense features. Sparse Probing: Sparse probing evaluates whether SAEs isolate pre-specified concepts by identifying the k most relevant latents for each concept (e.g., sentiment) through comparing their mean activations on positive versus negative examples, then training linear probes on these top-k latents. We find that SampledSAEs with lower ℓ values (ℓ=3 =3-55) achieve better probing accuracy than BatchTopK while maintaining comparable reconstruction fidelity (Figure 3). L2-norm and Squared-ℓ sampling strategies consistently outperform BatchTopK on this metric, likely because they promote consistent mid-frequency features over rare high-magnitude spikes that may represent less generalizable concepts. Entropy sampling underperforms BatchTopK, while uniform sampling with high ℓ values trades reconstruction quality for improved concept detection. Feature Absorption: Feature absorption is a phenomenon where sparsity incentivizes SAEs to learn undesirable feature representations with hierarchical concepts where A implies B—rather than learning separate latents for both concepts, the SAE learns a latent for A and a latent for “B except A” to improve sparsity. Our results show that distribution-aware sampling strategies, particularly L2-norm and Squared-ℓ , substantially reduce absorption compared to BatchTopK (Figure 6). This improvement suggests that preventing rare high-magnitude features from dominating selection helps maintain cleaner concept boundaries and reduces the gerrymandered latent patterns characteristic of feature absorption. Automated Interpretability: Automated interpretability Paulo et al. [2024] uses an LLM-based judging framework where a language model first proposes a “feature description” using activating examples, then constructs test sets by sampling sequences across different activation strengths along with control sequences, with the LLM judge using the feature description to predict which sequences would activate the latent. While BatchTopK performs well on this metric, SampledSAE variants achieve comparable interpretability scores with only modest increases in reconstruction error (Figure 5). The similar performance across scoring strategies suggests that the automated interpretability metric may be less sensitive to the specific feature selection mechanism than task-specific metrics like probing and absorption. Main Takeaway: The Pareto frontier analysis across different metric pairs reveals that no single SAE configuration dominates across all interpretability dimensions (Figure 2). While BatchTopK lies on the Pareto frontier for automated interpretability versus reconstruction fidelity (FVU), it is dominated by SampledSAE variants on the probing accuracy versus absorption frontier. L2-norm and Squared-ℓ scoring strategies achieve superior positions on the probing-absorption trade-off, simultaneously improving concept detection while reducing feature absorption—a combination that BatchTopK’s unrestricted selection cannot achieve. This demonstrates that SAE selection requires careful consideration of the specific interpretability goals, as optimizing for reconstruction fidelity alone misses important advantages in concept isolation and feature disentanglement. We also compute additional metrics to compare our different scoring strategies: Unique Features Discovered by SampledSAEs: To identify features uniquely captured by each architecture, we compute cross-architecture similarity between SampledSAE variants and BatchTopK features. For each feature pair (i,j)(i,j), we measure similarity through two channels: decoder similarity sdec(i,j)s_dec(i,j), computed as the cosine similarity between L2-normalized decoder weight vectors, and semantic similarity stext(i,j)s_text(i,j), based on the automated natural language explanations generated for each feature by Automated Interpretability Paulo et al. [2024]. These explanations describe what concept each feature represents, which we embed using Sentence-Transformers (all-mpnet-base-v2) and compare via cosine similarity. These metrics are combined into an overall similarity score scomb(i,j)=(sdec(i,j)+stext(i,j))/2s_comb(i,j)=(s_dec(i,j)+s_text(i,j))/2, where scomb∈[−1,1]s_comb∈[-1,1]. For each feature i in one architecture, we identify its best match j∗=argmaxjscomb(i,j)j^*= _js_comb(i,j) in BatchTopK SAE. Features with low best-match scores represent concepts uniquely captured by that SAE variant—they lack strong equivalents in the comparison architecture. Tables 4, 6, 5, and 7 show the most unique features discovered by each SampledSAE variant (Entropy, L2-norm, Squared-ℓ , and Uniform) compared to BatchTopK, with best-match similarities typically below 0.15. These low similarity scores demonstrate that controlled candidate selection not only changes which features are selected but discovers genuinely different interpretable structures in the activations. The selection strategies exhibit distinct biases in their highest-quality features (Tables 8–12): BatchTopK learns a mix of abstract domain-specific concepts and high-frequency tokens, L2-norm and Squared-ℓ favor frequent compositional structures (variables, HTML tags, relational operators), while Entropy captures discriminative patterns across granularities and Uniform defaults to syntactic elements. These differences suggest each scoring function implicitly selects for different computational roles features play in the network (see A.5 for detailed analysis). Feature Similarity Across Seeds: SAEs are known to exhibit seed variance, learning different features under different random initializations [Paulo and Belrose, 2025]. We train Sampled-SAE with three seeds for each sampling strategy and quantify cross-seed agreement using mean max cosine similarity (MMCS; the mean, over features, of each feature’s maximum cosine match [Braun et al., 2025]) across seeds. BatchTopK shows markedly higher cross-seed agreement (than the other strategies (Table 1), indicating substantially more consistent features across seeds. Feature Similarity Across Scoring Functions: Using BatchTopK as the reference, we obtain a MMCS of ≈0.72≈ 0.72 when matching learned features against each scoring-based SAEs. Table 1: Mean max cosine similarity (MMCS) for SAEs trained with different sampling methods and seeds(0,1, and 2). Base = seed 0; compared against seeds 1 and 2. Reported as mean over the two comparisons (n=2n=2); standard deviation is close to zero. Sampling method MMCS Entropy 0.176 Squared-ℓ 0.176 ℓ2 _2-norm 0.179 Uniform 0.157 BatchTopK 0.277 6 Discussion and Conclusion SampledSAE provides a general framework for studying how features are selected in sparse autoencoders. By decoupling candidate selection from row-wise Top-k, we show that distribution-aware gating improves utilization and interpretability. Limitations include the added hyperparameter s and reliance on batch-level statistics. Future directions include adaptive scoring functions, learned gating mechanisms, and connections to mixture-of-experts routing. Limitations • Unproven interpretability of mid-activation, high-frequency features. While we hypothesize such features might be interpretable compared to spiky features, we do not contrast interpretability of these features vis-a-vis high activating and less dense features to fully validate this hypothesis. • Lack of theoretical guarantees. We do not provide mathematical proofs that SampledSAE yields better feature approximations than BatchTopK, nor adversarial cases where BatchTopK performs poorly. • Trade-off between density and interpretability. SampledSAE recovers denser features that improve probing accuracy and reduce feature absorption scores, but we find no evidence that this improves interpretability. • Reliance on autointerp as a proxy. We evaluate interpretability primarily through autointerp, while noting its known limitations. • Unproven choice of scoring function. We find no strong justification for any particular scoring function. Our experiments bias toward norm- and entropy-based measures, while we hypothesize that functions with favorable geometric properties could lead to more unique features and low-dimensional approximations of feature space geometry. Developing theory for such sampling strategies remains future work. • Restricted scope of training. We train SAEs only on a single layer of a single model model due to computational constraints, limiting the generality of our findings. • Introduction of a new hyperparameter. The axis ℓ enables exploration of trade-offs relevant to interpretability, but adds another hyperparameter to optimize. References Bhaskara et al. [2019] Aditya Bhaskara, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Online matrix completion and online robust pca. In Advances in Neural Information Processing Systems, 2019. Biderman et al. [2023] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023. Boutsidis et al. [2009] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 968–977, 2009. Braun et al. [2025] Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, and Lee Sharkey. Interpretability in parameter space: Minimizing mechanistic description length with attribution-based parameter decomposition. arXiv preprint arXiv:2501.14926, 2025. Bricken et al. [2023] Trenton Bricken, Adly Adly, Kabir Ahuja, Eric Aschenbrenner, Ella Atkinson, Yuntao Bai, Anders Baum, Jordan Camron, Newton Chen, Tom Conerly, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Bussmann et al. [2024] Bart Bussmann, Joseph Jermyn, and Nix Robertson. Batchtopk: A simple improvement for topk sparse autoencoders. GitHub repository, 2024. URL https://github.com/bartbussmann/BatchTopK. Chanin et al. [2024] David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. arXiv preprint arXiv:2409.14507, 2024. Cohen and Peng [2015] Michael B Cohen and Richard Peng. ℓp _p row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 183–192, 2015. Cohen et al. [2016] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Online row sampling. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016), 2016. Cohen et al. [2017] Michael B Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1758–1777, 2017. Cunningham et al. [2023] Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023. Drineas et al. [2006] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006. Drineas et al. [2008] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008. Frieze et al. [2004] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-rank approximations. Journal of the ACM, 51(6):1025–1041, 2004. Gao et al. [2020] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Gao et al. [2024] Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024. Ghashami et al. [2016] Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016. Gurnee et al. [2023] Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023. Karvonen et al. [2025] Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. arXiv preprint arXiv:2503.09532, 2025. Krause and Cevher [2010] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse representation. In ICML, 2010. Liberty [2013] Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581–588, 2013. Mahoney [2011] Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011. Mahoney and Drineas [2009] Michael W Mahoney and Petros Drineas. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009. Marks et al. [2024a] Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, and Fazl Barez. Interpreting learned feedback patterns in large language models. Advances in Neural Information Processing Systems, 37:36541–36566, 2024a. Marks et al. [2024b] Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning. https://github.com/saprmarks/dictionary_learning, 2024b. Paulo and Belrose [2024] Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. arXiv preprint arXiv:2501.16615, 2024. Paulo and Belrose [2025] Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. arXiv preprint arXiv:2501.16615, 2025. Paulo et al. [2024] Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024. Sun et al. [2024] Xiaoqing Sun, Joshua Engels, and Max Tegmark. High frequency latents are features, not bugs. In ICLR 2025 Workshop on Sparsity in LLMs, 2024. Woodruff [2014] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014. Appendix A Appendix A.1 Synthetic Data Table 2: Synthetic dataset summary. Overall: mutual coherence =0.0833=0.0833 (lower is better); expected L0L_0 per sample =61.52=61.52; observed L0L_0 per sample =61.51=61.51. Bucket-level mean active coefficient magnitude is reported with standard deviation in parentheses. Bucket #Features Mean activation ± std LF+HA 265 0.801±0.6050.801± 0.605 HF+HA 237 0.798±0.6010.798± 0.601 LF+LA 288 0.159±0.1200.159± 0.120 HF+LA 234 0.159±0.1200.159± 0.120 A.2 Synthetic SAE Training Refer training plots Figure 7. (a) Loss and FVE for BatchTopK on synthetic data (b) Typical Loss and FVE for Sampled-SAE on synthetic data Figure 7: We train for 10,000 synthetic samples with training parameters same as the real data experiments. A.3 Detailed K-Sparse Probing Results Refer Table 3. Table 3: Comparison across datasets. Winners are selected by highest probe accuracy; ties are bolded. ℓ is shown inline with the architecture. FVU is lower-is-better. Dataset Architecture (with ℓ ) FVU ↓ SAE acc ↑ 1 SampledSAE (Entropy, ℓ =3) 0.3080.308 0.68220.6822 SampledSAE (L2-norm, ℓ =5) 0.0470.047 0.7434 SampledSAE (Leverage, ℓ =5) 0.0470.047 0.7434 SampledSAE (Uniform, ℓ =4) 0.1380.138 0.68760.6876 batch_topk 0.0250.025 0.67580.6758 2 SampledSAE (Entropy, ℓ =3) 0.3080.308 0.72760.7276 SampledSAE (L2-norm, ℓ =5) 0.0470.047 0.76120.7612 SampledSAE (Leverage, ℓ =5) 0.0470.047 0.76120.7612 SampledSAE (Uniform, ℓ =4) 0.1380.138 0.72140.7214 batch_topk 0.0250.025 0.7622 5 SampledSAE (Entropy, ℓ =3) 0.3080.308 0.78720.7872 SampledSAE (L2-norm, ℓ =5) 0.0470.047 0.8508 SampledSAE (Leverage, ℓ =5) 0.0470.047 0.8508 SampledSAE (Uniform, ℓ =4) 0.1380.138 0.78040.7804 batch_topk 0.0250.025 0.80200.8020 10 SampledSAE (Entropy, ℓ =3) 0.3080.308 0.81840.8184 SampledSAE (L2-norm, ℓ =5) 0.0470.047 0.85640.8564 SampledSAE (Leverage, ℓ =5) 0.0470.047 0.85640.8564 SampledSAE (Uniform, ℓ =4) 0.1380.138 0.82940.8294 batch_topk 0.0250.025 0.8764 A.4 Unique Features Refer Table 4. Table 4: Comparison between SampledSAE-Entropy latents and their best BatchTopK matches with explanations and similarity scores. Entropy latent Explanation Best-match latent Explanation Overall Decoder Text 41150 the word “Let” and the word “Suppose” in mathematical context 34460 the term “immunohistochemical” and its variants in the context of biological analysis 0.092 -0.063 0.247 27451 pronouns and references related to female characters or subjects in various contexts 65343 the term “Visitor” in programming contexts related to code structure and traversal 0.097 -0.000 0.194 60055 concepts related to low levels or minimum thresholds across various contexts 8195 the name “Robert” in various contexts 0.112 0.067 0.158 52706 specific acronyms or terms related to viruses and their associated RNAs 33561 the terms related to the concepts of outer and inner structures or surfaces 0.128 0.041 0.214 56476 the substring “St” in various contexts and technical terms 8195 the name “Robert” in various contexts 0.136 -0.022 0.294 Table 5: Comparison between Squared-L latents and their best BatchTopK matches with explanations and similarity scores. Squared-ℓ latent Explanation Best-match latent Explanation Overall Decoder Text 18072 terms involving variables and their coefficients in algebraic expressions 65343 the term “Visitor” in programming contexts related to code structure and traversal 0.039 0.027 0.051 10307 variables indicated by << >> in mathematical expressions 53916 phrases introducing examples or methods preceded by the words “the following” 0.064 0.007 0.120 13856 references and citations related to MRI technology and findings in medical studies 15922 legal terminology and biological concepts related to prostaglandins, sociodemographic factors, and interleukins 0.080 0.033 0.127 8118 terms related to spectral analysis and data in scientific contexts 5340 text related to the spinal cord and spinal injuries 0.094 0.099 0.090 27691 legal case citations and references to court districts and decisions 15922 legal terminology and biological concepts related to prostaglandins, sociodemographic factors, and interleukins 0.096 0.003 0.188 Table 6: Comparison between L2 latents and their best BatchTopK matches with explanations and similarity scores. L2-norm latent Explanation Best-match latent Explanation Overall Decoder Text 21059 numbers and mathematical operations within calculations 53916 phrases introducing examples or methods preceded by the words “the following” 0.024 0.048 0.000 29801 attributes related to Android layout and styling elements 15922 legal terminology and biological concepts related to prostaglandins, sociodemographic factors, and interleukins 0.082 0.035 0.129 50476 the substring “greater” related to comparisons or quantities 33561 the terms related to the concepts of outer and inner structures or surfaces 0.091 -0.010 0.191 59710 references to college education and experiences related to being a college graduate 5340 text related to the spinal cord and spinal injuries 0.092 0.079 0.106 47382 the phrase “as” followed by verbs or clauses that indicate explanations or transitions 53916 phrases introducing examples or methods preceded by the words “the following” 0.098 0.016 0.181 Table 7: Comparison between Uniform latents and their best BatchTopK matches with explanations and similarity scores. Uniform latent Explanation Best-match latent Explanation Overall Decoder Text 1517 expressions of gratitude and acknowledgment in responses or comments 8145 terms indicating lack of limitations or conditions related to permissions and modifications 0.079 0.018 0.141 4574 technical terms and URLs related to user guides and software documentation 33561 the terms related to the concepts of outer and inner structures or surfaces 0.096 0.015 0.178 1688 CSS properties and values related to layout and design elements 15922 legal terminology and biological concepts related to prostaglandins, sociodemographic factors, and interleukins 0.099 0.037 0.160 5718 chemical compounds and their structures or modifications in scientific contexts 8145 terms indicating lack of limitations or conditions related to permissions and modifications 0.101 0.012 0.191 3943 the phrase “in order to” indicating purpose or intention 8145 terms indicating lack of limitations or conditions related to permissions and modifications 0.120 0.063 0.178 A.5 Analysis of Top Features Across Architectures Examining the Top-10 features by AutoInterp score (Tables 8–12) reveals how different scoring strategies implicitly select for different types of interpretable structure. BatchTopK’s unrestricted selection yields the highest AutoInterp scores (0.887, Table 8) but might include semantically overloaded features that combine unrelated concepts (e.g., feature 15922: "legal terminology, prostaglandins, sociodemographic factors"), suggesting it optimizes for reconstruction even at the cost of feature monosemanticity. L2-norm (Table 11) selects compositional building blocks—variables, HTML tags, API patterns—that likely combine to form complex representations, while Squared-ℓ (Table 10) similarly emphasizes relational operators ("more than", "greater") that structure information and tend to activate on various contexts. Entropy (Table 9) captures discriminative patterns across granularities, from subword tokens ("St") to abstract concepts ("change/alteration"), suggesting it selects features that serve as semantic anchors. Uniform random selection (Table 12) defaults to high-frequency syntactic elements (brackets, connectives), confirming that without guided selection, simpler surface patterns dominate. This distribution of feature types indicates that the "activation lottery" in BatchTopK may be a consequence of allowing all features—regardless of their computational role—to compete purely based on reconstruction error. Table 8: Top-10 Features — BatchTopK (autointerp_score = 0.887) Rank Score Latent Explanation (short) 1 1.0 15922 legal terminology, prostaglandins, sociodemographic factors 2 1.0 5340 spinal cord and spinal injuries 3 1.0 34460 “immunohistochemical” in biological analysis 4 1.0 5865 conditional compilation directives in code 5 1.0 53916 phrases with “the following” introducing examples 6 1.0 33561 outer/inner structures or surfaces 7 1.0 65343 “Visitor” in programming contexts 8 1.0 8145 absence of limitations/conditions (permissions) 9 1.0 8195 the name “Robert” 10 1.0 8168 legal evidence in judicial contexts Table 9: Top-10 Features — Entropy (l = 100.0, autointerp_score = 0.821) Rank Score Latent Explanation (short) l 1 1.0 27451 pronouns/references related to female subjects 100.0 2 1.0 27129 access modifiers and data types in programming 100.0 3 1.0 52706 acronyms/terms related to viruses and RNAs 100.0 4 1.0 28279 “List” in programming/data contexts 100.0 5 1.0 56476 substring “St” in various contexts 100.0 6 1.0 4740 the word “first” and initial occurrences 100.0 7 1.0 41150 “Let” and “Suppose” in mathematics 100.0 8 1.0 60055 concepts of low/minimum thresholds 100.0 9 1.0 1826 concept of change/alteration 100.0 10 1.0 65040 abbreviation “U.S.” in legal/patent contexts 100.0 Table 10: Top-10 Features — Squared-ℓ -norm (l = 100.0, autointerp_score = 0.864) Rank Score Latent Explanation (short) l 1 1.0 65439 “South” and directional/geographical terms 100.0 2 1.0 18072 variables and coefficients in algebra 100.0 3 1.0 27949 HTML tags and document structure 100.0 4 1.0 55597 requests, errors, and responses in programming 100.0 5 1.0 10307 variables with ≪≫ in mathematics 100.0 6 1.0 13856 MRI technology references in medical studies 100.0 7 1.0 49697 upload processes and object initialization in code 100.0 8 1.0 8118 spectral analysis in scientific contexts 100.0 9 1.0 25824 terms related to eye/vision (retina, eyelids) 100.0 10 1.0 27691 legal case citations and court references 100.0 Table 11: Top-10 Features — L2-norm (l = 100.0, autointerp_score = 0.868) Rank Score Latent Explanation (short) l 1 1.0 48993 circular shapes or domains in applications 100.0 2 1.0 35053 thymus and histology (immune system) 100.0 3 1.0 23445 “document” in programming/documentation 100.0 4 1.0 51553 phrase “more than” (comparisons) 100.0 5 1.0 50476 substring “greater” in comparisons 100.0 6 1.0 21059 numbers and mathematical operations 100.0 7 1.0 29801 Android layout and styling attributes 100.0 8 1.0 36472 “btn” in HTML button elements 100.0 9 1.0 59710 college education/graduate experiences 100.0 10 1.0 47382 phrase “as” introducing explanations 100.0 Table 12: Top-10 Features — Uniform (l = 100.0, autointerp_score = 0.839) Rank Score Latent Explanation (short) l 1 1.0 160 substring “Fab” in names/terms 100.0 2 1.0 3943 phrase “in order to” (purpose) 100.0 3 1.0 1517 expressions of gratitude/acknowledgment 100.0 4 1.0 851 “” indicating start of code blocks 100.0 5 1.0 2021 programming errors and data structure declarations 100.0 6 1.0 1688 CSS properties/values for layout/design 100.0 7 1.0 4474 phrase “due to” (causality) 100.0 8 1.0 2607 “else” in conditional statements 100.0 9 1.0 4574 technical terms/URLs in documentation 100.0 10 1.0 5718 chemical compounds and modifications 100.0 A.6 SAE Training Configurations Refer Table 13 Table 13: Training Configuration and Evaluation Setup Parameter Value/Description Model and Data Base Model EleutherAI/pythia-160m-deduped Training Layer Residual stream after layer 6 Activation Dimension 768 Context Length 1024 tokens Dataset monology/pile-uncopyrighted (train split, streaming) Training Tokens 25M tokens (first 25M from The Pile) SAE Architecture Dictionary Size (m) 65,536 Effective L0 (k) 60 Candidate Pool Multiplier (ℓ ) 1, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 100 (ℓ=1 =1 gives K candidates; ℓ=n/K =n/K recovers BatchTopK) Training Hyperparameters Training Steps 50,000 Batch Size 4,096 tokens Learning Rate 3×10−43× 10^-4 Optimizer Adam (β1 _1=0.9, β2 _2=0.999) Gradient Clipping 1.0 Warmup Steps 1,000 Auxiliary Loss Weight (α) 1/32 Threshold Start Step 1,000 Threshold Beta (EMA) 0.999 Initialization Decoder Unit-norm columns (maintained each step) Encoder Tied to decoder transpose Bias (bdecb_dec) Geometric median of first batch Sampling Configuration (SampledSAE) Scoring Functions Entropy, L2-norm, Squared-ℓ , Uniform Ridge Parameter (λ) 0.01 (for Squared-ℓ ) Feature Selection Frequency Every batch