← Back to papers

Paper deep dive

Measuring Sparse Autoencoder Feature Sensitivity

Claire Tian, Katherine Tian, Nathan Hu

Year: 2025Venue: NeurIPS 2025 Workshop on Mechanistic InterpretabilityArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 92

Models: Gemma-2-2B, Pythia-160M

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 12:41:59 AM

Summary

The paper introduces a scalable, explanation-free method to evaluate Sparse Autoencoder (SAE) feature sensitivity by using language models to generate semantically similar text to a feature's activating examples and testing for feature activation. The authors find that many interpretable features exhibit poor sensitivity and that average feature sensitivity declines as SAE width increases across various architectures.

Entities (5)

Feature Sensitivity · metric · 100%Sparse Autoencoder · model-architecture · 100%Mechanistic Interpretability · research-field · 95%SAEBench · benchmark · 95%GPT-4 · language-model · 90%

Relation Signals (3)

Feature Sensitivity declineswith SAE Width

confidence 95% · we observe that average feature sensitivity declines with increasing SAE width across 7 SAE variants.

Sparse Autoencoder hasmetric Feature Sensitivity

confidence 95% · Our work establishes feature sensitivity as a new dimension for evaluating both individual features and SAE architectures.

Language Models usedtoevaluate Feature Sensitivity

confidence 95% · we use language models to generate text with the same semantic properties as a feature’s activating examples.

Cypher Suggestions (2)

Find the relationship between SAE width and feature sensitivity. · confidence 90% · unvalidated

MATCH (s:Architecture {name: 'Sparse Autoencoder'})-[r:HAS_METRIC]->(m:Metric {name: 'Feature Sensitivity'}) RETURN s, r, m

List all evaluation metrics used for SAEs. · confidence 85% · unvalidated

MATCH (e:Entity {entity_type: 'Metric'})-[:APPLIES_TO]->(s:Architecture {name: 'Sparse Autoencoder'}) RETURN e.name

Abstract

Abstract:Sparse Autoencoder (SAE) features have become essential tools for mechanistic interpretability research. SAE features are typically characterized by examining their activating examples, which are often "monosemantic" and align with human interpretable concepts. However, these examples don't reveal feature sensitivity: how reliably a feature activates on texts similar to its activating examples. In this work, we develop a scalable method to evaluate feature sensitivity. Our approach avoids the need to generate natural language descriptions for features; instead we use language models to generate text with the same semantic properties as a feature's activating examples. We then test whether the feature activates on these generated texts. We demonstrate that sensitivity measures a new facet of feature quality and find that many interpretable features have poor sensitivity. Human evaluation confirms that when features fail to activate on our generated text, that text genuinely resembles the original activating examples. Lastly, we study feature sensitivity at the SAE level and observe that average feature sensitivity declines with increasing SAE width across 7 SAE variants. Our work establishes feature sensitivity as a new dimension for evaluating both individual features and SAE architectures.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)safety-evaluation (suggested, 80%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

92,150 characters extracted from source content.

Expand or collapse full text

Measuring Sparse Autoencoder Feature Sensitivity Claire Tian The Harker School Katherine Tian Independent Nathan Hu Stanford University Abstract Sparse Autoencoder (SAE) features have become essential tools for mechanistic interpretability research. SAE features are typically characterized by examining their activating examples, which are often “monosemantic" and align with human interpretable concepts. However, these examples don’t reveal feature sensitivity: how reliably a feature activates on texts similar to its activating examples. In this work, we develop a scalable method to evaluate feature sensitivity. Our approach avoids the need to generate natural language descriptions for features; instead we use language models to generate text with the same semantic properties as a feature’s activating examples. We then test whether the feature activates on these generated texts. We demonstrate that sensitivity measures a new facet of feature quality and find that many interpretable features have poor sensitivity. Human evaluation confirms that when features fail to activate on our generated text, that text genuinely resembles the original activating examples. Lastly, we study feature sensitivity at the SAE level and observe that average feature sensitivity declines with increasing SAE width across 7 SAE variants. Our work establishes feature sensitivity as a new dimension for evaluating both individual features and SAE architectures. 1 Introduction Sparse Autoencoders (SAEs) have emerged as a powerful technique to identify meaningful directions in language model activation spaces [8,37]. These learned directions, or SAE features, have proven to be valuable for mechanistic interpretability. Use cases include: surfacing surprising information present in model activations [37,12], controlling model behavior via activation steering [10,31], identifying computational circuits within models [1,28,25], and more open-ended exploration of training data [29] or other datasets [30, 18]. A key step in almost all SAE applications is to first characterize each SAE feature. This is commonly done by examining example inputs that activate each feature. These activating examples are often cohesive and correspond to human-interpretable concepts [8,37], e.g., "harmful requests". However, only examining a feature’s activating examples tells us what a feature does but not what it fails to do. We might hope that a harmful request feature activates on all harmful requests, but we cannot determine this by just examining activating text. Additionally, we need to evaluate feature sensitivity: the probability that a feature activates on texts similar to its activating examples. Ideally, features would have high sensitivity—consistently activating on all relevant inputs rather than arbitrary subsets. Understanding a feature’s sensitivity is crucial for scoping what we can learn from the feature. If a harmful request feature has high sensitivity and activates on all harmful requests, understanding its role can reveal how the model generally processes any harmful input. If, instead, the harmful request feature has poor sensitivity, we are mainly gaining narrower insights into how the model handles the specific input that activates the feature. Contact: 27clairet@students.harker.org, kattian@alumni.harvard.edu, nathu@cs.stanford.edu Code: https://github.com/nathanhu0/sae-sensitivity 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Mechanistic Inter- pretability. arXiv:2509.23717v1 [cs.AI] 28 Sep 2025 For each SAE feature... Collect dataset examples that activate the feature ingredients for this sorbet: lemon juice, watermelon, sugar during her pregnancy she frequently craved watermelon hosted a watermelon eating competition for his birthday ... Generate similar text using a language model each Densuke watermelon is expected to retail for $600 declared Oklahoma's state vegetable the watermelon ate a record of 9.1 lbs of watermelon in one sitting ... Test feature activation on generated text ... Feature sensitivity 67% Figure 1: Sensitivity evaluation methodology. We extract top activating texts for each SAE feature, use GPT-4.1 to generate similar texts based on these examples, and measure how often the feature activates on the generated texts. Features with high sensitivity reliably activate on semantically similar inputs. In this work, we use a generation-based approach to evaluate feature sensitivity at scale. As illustrated in Figure 1, we use language models to generate text with the same semantic properties as a feature’s activating examples. We then test whether the feature activates on these generated texts. Our generation-based approach is more scalable and efficient than previous dataset filtering methods [37,38]. Additionally, our method avoids the need to first generate a description of the feature’s activating text, removing a potential source of error compared to common automated interpretability evaluations [33, 21]. Our main contributions are: •We develop an explanation-free, scalable automated evaluation for SAE feature sensi- tivity, allowing efficient evaluation of thousands of SAE features. •We demonstrate that sensitivity measures a new facet of feature quality by examining its relationship to standard SAE feature metrics. Notably, we find that many interpretable features have poor sensitivity. •We validate our method through automated and human evaluations, finding that when a feature fails to activate on generated text, that text genuinely resembles activating text examples according to human assessment. • We identify declining feature sensitivity as an additional challenge for SAE scaling. We find that wider SAEs have lower average feature sensitivity in large-scale SAEs (up to 1M features) and across 7 different SAE variants. 2 Related Work 2.1 Prior Investigations of Feature Sensitivity Investigating feature sensitivity requires obtaining candidate input text and checking for feature activation. Most prior work approaches this by first generating natural language explanations for features, then using those explanations to identify candidate inputs. This includes using explanations to generate new text [16, 19] or to filter through existing datasets for relevant passages [37, 38]. Alternative approaches avoid natural language explanations entirely. Gao et al.[14]fit n-grams with wildcards to activating text examples to filter datasets for test inputs. Other work evaluates whether features or groups of features can serve as high-sensitivity classifiers for a set of predefined concepts [20,26,6]. Chanin et al.[6]study feature absorption, a special instance of poor feature sensitivity with a clear cause: when features form hierarchies, sparsity incentivizes parent features (e.g., "math") to fail to activate on inputs when a more specific child feature (e.g., "algebra") activates instead. All these approaches evaluate sensitivity with respect to some intermediate description—whether explanations, n-grams, or concept lists. Our approach evaluates sensitivity without needing to first generate such descriptions. 2 2.2 SAE Evaluation Earlier work primarily evaluated SAEs by their reconstruction error and the interpretability of individual features [4, 37]. Although increasing SAE width improves both reconstruction quality and feature interpretability [21], a growing body of research investigates problems that arise when scaling SAEs, including feature splitting [4], feature absorption [6], and feature composition [23]. These results highlight that only optimizing for sparsity and reconstruction may not yield natural features. Another line of work evaluates SAE latents by their utility for downstream tasks: sparse probing [14], spurious correlation removal [28], disentangling model representations [17], and unlearning [11]. Karvonen et al.[21]introduce SAEBench, a benchmark that aggregates many of these evaluation approaches, along with standard automated interpretability and reconstruction metrics. 2.3 Automated Interpretability The standard auto-interpretability pipeline involves collecting activating text examples for a feature, prompting an LLM to generate natural language descriptions from these examples, and validating these descriptions by testing whether they enable another LLM to predict activations on new text. Bills et al.[3]first proposed this approach for neurons, and it has since become standard for both neuron explanations [7] and SAE explanations [33, 37, 21]. A complementary approach evaluates explanation quality by testing whether explanations can generate new activating inputs. This approach has been used to evaluate both neuron explanations [16] and SAE feature explanations [19]. Other work uses input generation to help interpretability agents test hypotheses about component activation [34]. Similar generation-based evaluation approaches have been applied beyond language models to explanations of vision neurons and other components [35, 22]. 3 Evaluating Feature Sensitivity 3.1 Evaluating Feature Sensitivity Independent of Explanation Previous work on sensitivity typically relies on some (typically natural language) description to identify test inputs [38,19]. Such methods evaluate sensitivity as a function of both the model component and the corresponding explanation. When studying neurons, which are a part of the model itself, such approaches cleanly evaluate how well an explanation describes a neuron’s activating inputs. However, SAE features present a more complex challenge. Unlike neurons, SAE features are learned approximations of a model rather than intrinsic model components. Much prior work has identified and addressed limitations in feature quality arising from SAE training [6,23,27,5]. Because SAE features and generated feature descriptions are imperfect, evaluating feature sensitivity with explanations may struggle to distinguish between an inaccurate description of a feature and a feature failing to activate on relevant inputs. We avoid this ambiguity by evaluating feature sensitivity without generating an explanation. As shown in Figure 1, we prompt language models with a feature’s activating text examples to generate similar text samples, then measure how often the feature activates on these new texts. For a feature to achieve high sensitivity, it must consistently activate on novel inputs that human judges find indistinguishable from the original activating examples. This approach effectively measures sensitivity as if we had a perfect explanation—one precise enough to generate indistinguishable examples but nothing broader. 3.2 Method Details Our sensitivity evaluation approach consists of four steps: (1) collect activating text examples for each feature, (2) generate new texts similar to these examples using an LLM, (3) evaluate if the feature is active on these new texts, and (4) compute sensitivity score as the fraction of new generated texts which successfully cause the feature to activate. In the paragraphs below, we provide additional details for the first two steps. Figure 2 shows examples of text generated by our evaluation. 3 Feat ID:563047Desc:the concept of relationships or comparisons between different entities or conditionsFreq:2.74e-03Sensitivity:60.00%Interp Score:92.86% 7.19 between a somatic mode of presentation on the one handanda psychological mode on the 10.06 . This finding suggests an interdependence of heavy alcoholconsumptionand psychological 10.75 Data Separation**]: Algorithmic code is separatedfromthedataonwhich it operates. 10.94 not found when comparing a mixed level of parental educationtoahighlevel of parental 7.31 Data Handling**]: Source files are segregatedfrom the processingroutinesthey drive. This 0.00 a mixtureof phospholipids blended to achieve a molar ratio of100:20 cholesterol stabilized 0.00 examination highlights the synergistic effectof dopamineand glutamate receptor activity, 7.44 survival rate differences often show inverse correlationswith tumor progression markers Feat ID:297594Desc:the word 'problems' and related concepts indicating issues or challengesFreq:3.33e-05Sensitivity:60.00%Interp Score:100.00% 8.75 of a trendy Melbourne art gallery, has her ownproblems – chasing down a delinquent 7.34 to attempts at calibration.↵Of course, ourproblemsarenot likely to clear up so one may 12.62 him, thatʼs only the start of theirproblems.↵In this third Alex Caine book, sequel 6.78 mnir becomes queen of a land with as manyproblems as the one she fled. Her long-lived 6.97 facing constant delays, she explained herproblems quietly but with visible frustration at the 0.00 discussing legislative issues, where theproblems oftenare complex and intertwined with 8.12 Nina realized that understanding hisproblems required stepping into his perspective; that 0.00 explaining the malfunction during the software demo, he hoped the technicalproblem would Feat ID:2870Desc:words related to teaching and educationFreq:2.57e-05Sensitivity:60.00%Interp Score:92.86% 14.12 C.F.E. is an internationally respectedteacher,trainerand clinician with an expertise in the 11.00 you love aboutteaching?↵‘I love toteach,” Johnston said, “and the some of the 10.62 dots and monkeys.”↵What do you love about teaching?↵‘I love toteach,” Johnston 11.62 century. Architect, artist, furniture designer, andeducator, Ralph Rapson has played a 11.75 I always felt passionate aboutteaching because it allows me to inspire others. “Toteachis to 0.00 Kids bring so much energy and curiosity to the classroom. When Iʼmteaching, I get to see 12.88 His work as a dedicatededucatorand community advocate has influenced many in the arts 0.00 Initially nervous aboutteaching, she grew into her role and now finds great joy in it. Her first Feat ID:662898Desc:theorems and corollaries referenced by their numbers in mathematical or academic contextsFreq:6.25e-05Sensitivity:70.00%Interp Score:92.86% 21.25 othe (2017).↵Theorem 3 includes existing DR moment functions as special cases where $ 23.00 on this section can be applied.↵Corollary 9.2 from [@H] states that any Or 27.00 (β)$ and possibly additional functions. Proposition 2 of Newey (1994a) 11.44 font-variant:small-caps;">Theorem 2:</span> *If the marginal distribution of* 2.06 Consider Lemma 7 from Smith (2003) which establishes conditions for convergence in 0.00 Based on Proposition 5 in the appendix, the asymptotic variance can be expressed as 0.00 From the proof of Lemma 11, we derive bounds on the estimator variance using 2.44 The result of Corollary 4 follows immediately by applying the dominated convergence Feat ID:267258Desc:the variable placeholder 'i' in programming contextsFreq:1.27e-04Sensitivity:50.00%Interp Score:92.86% 8.50 CDATA_CTL);↵msg->buf[i] = (u8)((rxd 15.50 rows <- createRow(sheet, rowIndex=i)↵ for (j in 1: 18.88 ('model0.h5'.format(ix)) ↵ end = time.time() 11.75 I2CDATA_CTL);↵msgs[i + 1].buf[0] = ( 3.38 commandList[i].execute();↵status = commandList[i].getStatus(); 0.00 <td *ngFor="let item of items; let i=index">↵ item.name ↵</td> 4.56 dataset['column_i'] = values[i]↵dataset['column_i'].mean() 0.00 (i,'update')">↵ <span class="icon"></span>↵</button> Feat ID:898197Desc:terms related to the concept of survival and its implications in various contextsFreq:4.54e-05Sensitivity:70.00%Interp Score:100.00% 10.00 three main purposes. The first was to facilitate thesurvivalof the spongesacross the 10.00 main purposes. The first was to facilitate thesurvivalof the spongesacross the battery of 12.12 and size and the size of the grain affects itssurvivabilityin the archaeological 15.50 Oxygen is a vital substrate to the continual function andsurvivalof cerebral tissue. Rapid 9.50 Examining the role of autophagy in prolonging cellularsurvival, we used markers for 0.00 Analysis of cohort data revealed a strong correlation between dietary intake andsurvival 0.00 The clinical trial results showed higher mediansurvival time among patients receiving 15.38 Genetic diversity contributes significantly to thesurvivaladvantage seen in populations Feat ID:362816Desc:various expressions of the word "by" followed by different methods or approachesFreq:2.67e-04Sensitivity:40.00%Interp Score:92.86% 13.88 states, chose to↵achieve the same balance byalternate means. We have judges who 13.50 i=1,·s, r$.↵Byastraightforward argument one may notice that the condition 12.25 vertices.↵For each face choose a triangulation bynon-intersecting diagonals. Let $d$ 10.88 so they will not be repeated here.↵Byprior Order, the Court authorized the mailing of 5.84 this result was obtained byinnovative techniques involving machine learning and deep 0.00 the report was compiled byanexperienced team of analysts specializing in market trends 0.00 the final decision was reached bymutual agreement among the stakeholders following 0.00 marketing strategies improved bytargeted campaigns using demographic and Feat ID:878839Desc:substrings containing specific sequences of letters within proper nouns and scientific termsFreq:3.44e-04Sensitivity:50.00%Interp Score:92.86% 7.66 --449--449--Sterry \[[@CR22]\] (IMP2 7.72 been used unfairly, please contact us<eos>Kralingse Zoom metro station↵Kralingse 8.62 has been used unfairly, please contact us<eos>Kralingse Zoom metro station↵Kralingse 4.53 ↵ *Buplerum falcatum*(root) 0.00 font-style:italic;">Sergei V. Andreev and T.amara V. 5.53 been interrupted, please call the office<eos>Yalum metro station↵Yalum 0.00 <bos> ↵ *Cana 2.16 *Pleurotus ostreatus* (mushroom cap) Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Feat ID:92700Desc:numerical values and monetary amounts in the textFreq:4.34e-05Sensitivity:0.00%Interp Score:92.86% 14.94 ↵Or, maybe at around 200bucks, a lot of people end up with both. 16.88 a conventional Google tablet. At 200bucks, I couldn't resist. But, the 15.06 is also WIC, which at4.8billionin FY 2012 gets us to 8.62 , which at anannualtotalof540billion, would amount to funding close to half the 0.00 Prices of electric bikes have jumped to roughly $1500dollars this year, making commuting 0.00 The marathon registration fee this year peaked at $85dollars, drawing thousands of 0.00 Ticket prices for the concert rose to approximately $500dollars, reflecting the artist's 0.00 The annual subsidy for renewable energy projects surpassed $700million, encouraging Feat ID:361399Desc:the substring "super" and variations of "base" in programming contextsFreq:2.12e-05Sensitivity:0.00%Interp Score:100.00% 8.12 .Flush();↵base.OnTearDown();↵[Test 13.75 autorelease]];↵ [buffer appendString:[super description]];↵ [buffer appendString:@" 5.31 Head(IHeaderResponse response) ↵ super.renderHead(response);↵ response. 3.56 true_type)↵ ↵ Base::construct(expr);↵ ↵ template 0.00 super.initializeComponent();↵this.componentDidMount(); 0.00 super.performAction();↵logState(); 0.00 super.OnStart();↵logger.info("Service started"); 0.00 super.updateSettings(newSettings);↵notifyObservers(); Feat ID:388626Desc:numeric identifiers or placeholders represented with angle bracketsFreq:3.97e-04Sensitivity:0.00%Interp Score:92.86% 12.81 ASSUME_NONNULL_END↵<eos>477F.2d 598↵ 17.50 ↵ ↵ ]↵15376157484 12.06 ],↵ );↵;↵<eos>↵337F.Supp. 150 ( 10.81 ": "partial link text"↵154749992302 0.00 1729some random text854 0.00 Reference number: 504732 logged successfully 0.00 <script>var key = "13974";</script> 0.00 Timestamp 8430 registered without error Feat ID:781627Desc:the word 'export' in programming contexts, particularly related to functions and typesFreq:1.16e-05Sensitivity:0.00%Interp Score:100.00% 32.25 "data" or "err"↵ */↵exportdefault function request(url, options) ↵ 35.25 );↵ return newContainer;↵export const FadeTransition = ↵ start(container 28.50 until(deadline));↵ ↵export class ClientModule ↵ constructor(name, 33.25 ↵ return Authorized;↵;↵export CURRENT ;↵export default Authorized => 0.00 export class ApiService ↵ constructor(baseUrl) ↵ this.baseUrl = baseUrl;↵ ↵ 0.00 export const constants = ↵ appName: 'MyApp',↵ version: '1.0.0',↵; 0.00 export function validateInput(input) ↵ if (!input) throw new Error('Input required');↵ 0.00 export function calculateSum(a, b) ↵ return a + b;↵ Feat ID:121516Desc:the substring '>' in various coding and programming contextsFreq:2.52e-05Sensitivity:0.00%Interp Score:92.86% 20.62 width: 100%;↵body.is-loading & 17.00 ↵ Return(HPSD)↵ ↵ Return(SPSD)↵ ↵ 17.88 ↵ Return(SPSD)↵ ↵ // End of Scope(\_SB 18.12 ->first, stp->second);↵ ↵ void synchronize()↵ 0.00 ::process_queue;↵ queue(pg);↵struct queue_set_ready ; 0.00 margin-bottom: 12pt;↵</style> 0.00 #Major.#Minor.#Revision.#Patch↵InfoLabelText=#Label↵ 0.00 if (!is_loading) ↵ ↵ start_process();↵ Feat ID:593453Desc:the substring 'math' in mathematical notation and equationsFreq:2.22e-05Sensitivity:0.00%Interp Score:92.86% 14.31 n' k'\!)↵ tanh[(β/2) \ 13.06 -1)(N g)^2E_x_0:t- 9.44 and set $ ↵M= R p_1 p_ 11.69 8mu]2, 1 -6mu 0.00 k↵\!+\!↵ 1 iν \!-\! _ ω 0.00 n' k'\!↵ cosh [ β2 (E - μ) ] 0.00 1,…,n_s = δ q_1,…,q_s · P( \ 0.00 -n ω) = 1 iν \!-\! θφ - χ Feat ID:1023262Desc:code syntax and function callbacks in programming contextsFreq:7.21e-05Sensitivity:0.00%Interp Score:100.00% 22.00 ↵ return true;↵ ;↵// Setup callback first, so we don't 21.00 ↵ return true;↵ ;↵this.resume = function() ↵ try 13.81 ↵ headerValue = ′;↵ ;↵parser.onHeadersEnd = function() ↵ 21.25 .onPart(part);↵ ;↵parser.onEnd = function() ↵ 0.00 ;↵self.emit('close'); 0.00 ;↵// If request error, destroy. 0.00 ;↵body = Buffer.from(chunk); 0.00 ;↵fileStream.pipe(writer); Feat ID:770421Desc:phrases indicating prominent locations or regions within a larger contextFreq:5.75e-05Sensitivity:0.00%Interp Score:100.00% 17.25 routes take you past some of the most spectacular sceneryinthe state, while friends old 12.25 Pedro Bay has one of the most attractive settingsinsouthwest Alaska. Pedro Bay is 7.75 Heath and Lake District the largest economically utilized pond regionin Europe.↵Part of 5.50 the Unesco-listed cathedral – the largest Gothic cathedralinthe world – to the beautiful 0.00 the historic piazzainthe heart of Florence bustles with tourists and locals alike, offering 0.00 the ancient libraryinthe old city houses manuscripts dating back to the medieval period, 0.00 the extensive cave systeminthe Carpathian Mountains is famous for its unique rock 0.00 the bustling harborinthe Mediterranean city offers stunning views of yachts and fishing Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Figure 2: Interpretable features with moderate and low sensitivity. Feature activations are shown on top activating texts (left) and on LLM-generated texts from our evaluation (right). Generated text is formatted to indicate tokens expected to activate the feature. These are highlighted when the feature remains inactive. Collecting Activating Text: We sample 2 million tokens of candidate texts from large text corpora. The corpus is OpenWebText [15] for SAEBench evaluations and the Pile-uncopyrighted subset [13] for GemmaScope evaluations. We evaluate feature activation on sequences of 128 tokens, following the example collection methodology used in [21]. When a feature activates, we extract the activating example by including 10 tokens preceding and 10 tokens following the activating token. For each feature, we collect 15 activating text examples: 10 top activating examples and 5 importance-weighted samples by activation magnitude. Generating New Texts: We provide activating text examples when prompting an LLM. We do not use any natural language descriptions of the feature in the prompt. In preliminary experiments, adding automated feature descriptions reduced the probability that generated text would activate the feature. From inspecting samples, we believe this is due to automated descriptions that are sometimes overly general and imprecise. For each feature, we use a single query to generate 10 new text samples. We found that a single query produced more diverse outputs than multiple independent queries. The full prompts are included in Appendix A. We use GPT-4.1-mini [32] for the generation step. We found that it produced text comparable to GPT-4.1, while GPT-4.1-nano struggled to complete the generation task. Method Assumptions: Our method relies on several key assumptions. First, we require that our collected examples adequately capture each feature’s behavior, which we ensure by following standard approaches for collecting activating examples and filtering out features that fail to activate on truncated text. Details of filtering are described in Section 3.3. Second, we assume that generated texts share whatever semantic property triggers feature activation, which we validate through human evaluation in Section 5.1. Third, we assume generated samples are sufficiently novel and diverse to serve as valid tests of sensitivity, which we verify in Section 5.2. 3.3 Filtering SAE Features We limit our study to SAE features that meet two criteria. First, we only evaluate features for which we can collect at least 15 activating text samples from 2 million tokens, which filters out rare features. Second, we found that many features fail to activate on their own truncated examples, so we filter for features where at least 90% of the shortened text snippets still activate the feature. This filtering may bias our analysis toward simpler features, but it ensures that features failing to activate on generated text genuinely reflect poor sensitivity, rather than an artifact of sample text truncation. The fraction of filtered features increases substantially with SAE width. For smaller SAEBench SAEs (width 4k to 65k), we exclude 35% of features on average. For GemmaScope SAEs, this ranges from 51% for 65K width SAEs to 79% for 1M width SAEs. Detailed filtering statistics and results with different cutoffs are shown in Appendix B. 4 10 2 10 3 Count 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sensitivity (a) 0.50.60.70.80.91.0 Auto-Interp Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sensitivity (b) = 0.241 (p=1.0e-28) 10 5 10 4 10 3 10 2 Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sensitivity (c) =0.066 (p=2.6e-03) 0.20.40.60.81.0 Max Decoder Cos Sim 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sensitivity (d) = 0.058 (p=9.0e-03) Figure 3: GemmaScope SAE feature sensitivity distributions. The distribution of feature sensitivity and scatter plots showing joint distributions of sensitivity with auto-interpretability, frequency, and maximum decoder cosine similarity. Sensitivity scores in scatter plots are plotted with y-jitter for visualization. Correlation coefficients and p-values are shown at the top of each scatter plot. 4 Feature sensitivity captures novel aspects of feature behavior We begin by examining the relationship between our feature sensitivity metric and standard SAE feature evaluation metrics. For this, we study the canonical (width 1M, sparsity 107) GemmaScope [24] SAE for the layer 12 residual stream of Gemma 2 2B [36]. We sampled 10,000 SAE features. After filtering per Section 3.3, 2,061 remained for analysis. We show the distribution of sensitivities across all features in Figure 3a. Most features score well on sensitivity, but the features span all sensitivity scores, showing meaningful variation in feature quality when measured via sensitivity. Next, we examine three key feature properties for comparison. First, we look at feature interpretability, which we measure using the automated interpretability evaluation of [21]. Second, we examine feature frequency, which is how often features have nonzero activation. Third, we compute the maximum decoder cosine similarity between a feature’s decoder vector and all other feature decoder vectors. High similarity may reflect undesirable feature composition or entanglement [5]. The three scatter plots of feature sensitivity and each property (Figures 3b, 3c, and 3d) confirm that feature sensitivity is distinct from existing other metrics. We find weak correlations of sensitivity with frequency (ρ =−0.06) and decoder cosine similarity (ρ = 0.06), and a stronger correlation between sensitivity and interpretability score (ρ = 0.24). The overall weak correlations with existing metrics are encouraging—they suggest that sensitivity captures a novel and complementary dimension of feature quality rather than simply replicating existing evaluations. Although feature interpretability and feature sensitivity are correlated, they often disagree. When examining features with high sensitivity but low auto-interpretability scores, we find this mainly reflects noise in the automated evaluation—these features appear qualitatively interpretable upon inspection. More importantly, we find many interpretable features exhibit poor sensitivity. Among 1347 features with auto-interpretability scores≥0.9, 82 have sensitivity≤0.5, and 23 have sensitivity ≤0.2. Figure 2 shows examples of interpretable features with moderate and low sensitivity, with additional examples in Appendix E. Spot checking these features shows that our evaluation-generated text resembles activating text but fails to activate the feature, suggesting that our method has indeed found interpretable features that have poor sensitivity. In the next section, we validate this rigorously via human evaluation. 5 Verifying the Automated Sensitivity Evaluation We validate that our automated sensitivity evaluation is reliable through two analyses: (1) human evaluation of sample similarity and (2) automated evaluations of sample novelty and diversity. 5 Ground Truth GeneratedRandom Generated Text Sample Type 0% 20% 40% 60% 80% 100% Proportion 20 49 3 8 1 4 2 14 n=24n=62n=16 Annotator Label Indistinguishable Closely Related Weakly Related Unrelated (a) Results(b) Evaluation interface Figure 4: Human evaluation validates our method. (a) Human evaluation of 102 text samples across three conditions: true activating text examples (positive control), text generated for random features (negative control), and text generated by our evaluation that failed to activate features. (b) The interface shows feature activating examples alongside generated text for evaluation, with annotators rating similarity. 5.1 Blinded Human Evaluation The goal of the human evaluation is to check if human annotators agree that the LLM generations are indeed consistent with the feature concept, and therefore appropriate for scoring feature sensitivity. Human annotators judged 102 examples in total. Each example consists of several activating text examples for a feature along with one new text sample. The new text can be one of three categories: another activating text example for the feature (20%, positive control), a generated text for a random other feature (20%, negative control), or a text generated by our method that failed to activate the feature (60%). The category is not revealed to the human annotator. The human annotator is then asked to classify whether the new text is “indistinguishable", “closely related", “weakly related", or “unrelated" to the provided activating text examples. A sample dashboard for the human evaluation is shown in Figure 4b. We only include features with high auto-interpretability (≥ 0.9). This allows the study to focus on verifying cases where we might be most skeptical of low sensitivity results a priori. Additionally, interpretable features are easier for human annotators to assess. Results are shown in Figure 4a. Generated text achieves relevance ratings nearly matching ground truth, confirming that low sensitivity evaluations reflect poor sensitivity rather than poor generation. Human annotators rate our method’s generated texts (n = 62) nearly as relevant to the feature as the ground truth texts: 79% of generated texts are rated “indistinguishable", compared to 83% of ground truth activating texts. Only one out of 62 generated texts is rated “unrelated". Additionally, annotators correctly scored controls: positive control texts (n = 24) are rated “indistinguishable" or “closely related" 96% of the time, while all negative control texts (n = 16) are rated “unrelated" or “weakly related". 5.2 Sample Novelty and Diversity The goal of this analysis is to check that (1) our generated texts were not copying the activating examples, i.e., the diversity between each generated text and the top-activating texts is sufficiently high, and (2) our generated texts covered a wide range of feature expression, proxied by checking that the diversity between generated texts is sufficiently high. We assess text diversity by measuring the longest common substring length across three comparisons: (1) between generated text with activating examples to evaluate copying, (2) between pairs of activating examples to establish baseline overlap levels, and (3) between pairs of generated texts to assess diversity within our generations. Also note that we checked for longest substring match ending on the activating tokens, since only tokens before the activating part contribute to the activation. 6 Figure 5: Text diversity validation. Probability that the longest common substring length is≥N tokens. We compare: two activating text examples for the same feature (gray), one generated text and one activating text example for the same feature (orange), and two generated text samples for the same feature (blue). Figure 5 shows the complementary cumulative distribution function (CCDF) for longest common substring lengths. Each bar shows the fraction of text pairs with overlap≥ Ntokens: gray bars show overlap between activating examples (baseline), orange bars show overlap between generated and activating texts (testing for copying), and blue bars show overlap between generated texts (testing for diversity). The first reassuring observation is that a generated text and an activating text example are less likely to have a long overlap than two activating examples (3.1% v.s. 3.7% at≥ 5tokens). On the other hand, a generated text and an activating text example are more likely to contain a short overlap than two activating examples (20.8% v.s. 18.0% at≥ 2tokens). This indicates that our generated texts occasionally use short verbatim sequences from the examples but avoid copying long passages. Two generated texts are slightly more likely to have overlap than the baseline between activating examples, with 27.9% probability of≥ 2token overlap and 4.3% at≥ 5tokens. This reveals that pairs of generations show somewhat lower diversity, though the difference is modest. This overlap pattern likely reflects LLM preferences for common word choices and short phrases rather than wholesale copying. While generation diversity can be improved, there are no pathological issues with extended substring duplication. 6 Evaluating Feature Sensitivity Across SAEs Having explored the sensitivity of features within a single SAE and having confirmed that our evaluation method is reliable, we now turn to evaluating the average feature sensitivity across different SAE sizes and architectures. 10 1 10 2 L0 (Sparsity) 0.86 0.88 0.90 0.92 0.94 Sensitivity SAE Width 65k 131k 262k 524k 1m Figure 6: Average Feature Sensitivity of GemmaScope SAEs. For each dictionary size, we plot the feature sensitivity of SAEs trained at that size at different sparsities. Wider SAEs have worse average feature sensitivity. We also see that feature sensitivity is slightly increasing with sparsity. 7 6.1 Results on Large GemmaScope SAEs The GemmaScope suite of twenty nine JumpReLU SAEs range in size from 65K to 1M features and range in sparsity from 20 to 200 [24]. These SAEs are trained to reconstruct the layer 12 residual stream of Gemma 2-2B [36]. For each SAE in GemmaScope, we collect activating texts for 2500 features, then apply the filtering criteria described in Section 3.2 and Appendix B before computing sensitivity. Figure 6 shows the effect of dictionary width and sparsity on feature sensitivity. At a fixed dictionary size, sensitivity increases as sparsity increases. Strikingly, as SAE width increases, average feature sensitivity decreases. Concretely, 65K width SAEs have average feature sensitivities ranging from 0.92 to 0.94, while 1M width SAEs have feature sensitivities ranging from 0.85 to 0.87. Additionally we find that at a fixed width, SAEs with high L0 - more active features - have higher average feature sensitivity. In Appendix C we show that these two trends hold after controlling for feature frequency. 6.2 Results on Diverse SAE Architectures 10 2 10 3 L0 (Sparsity) 0.90 0.92 0.94 0.96 0.98 1.00 Sensitivity 10 2 10 3 L0 (Sparsity) 10 2 10 3 L0 (Sparsity) 0.90 0.92 0.94 0.96 0.98 1.00 Sensitivity 10 2 10 3 L0 (Sparsity) Gemma-2 2BPythia 160M BatchTopKTopKMatryoshkaBatchTopKGatedJumpReLuReluPAnneal Figure 7: Average Sensitivity vs. Sparsity for Gemma-2-2b and Pythia-160m SAEs This plot shows the average sensitivity of different Sparse Autoencoder (SAE) types plotted against their sparsity. We use the widest 65k width SAEs for all architectures. Each line represents a different SAE architecture. Having found these scaling trends on GemmaScope JumpReLU models, we next test whether they generalize across different model families and SAE architectures. We evaluate SAEs from the SAEBench collection [21], which includes 7 different SAE architectures trained on both Pythia-160M [2] and Gemma-2-2B [36] models. While these SAEs are much smaller in scale than GemmaScope, they allow us to validate our findings across SAE variants and model architectures. For each SAE studied here, we collect activating text for 1000 features, then filter as before. We show the relationship between sparsity and sensitivity on the largest SAEs in this suite (65k width) in Figure 7. While the results are noisier due to smaller sample sizes, we see a general trend of sensitivity increasing with sparsity across model and SAE variants. While noise prevents us from making strong claims about sensitivity differences between each of the SAE architectures, vanilla 4k16k65k Dictionary Size 0.94 0.95 0.96 0.97 0.98 Sensitivity 4k16k65k Dictionary Size 4k16k65k Dictionary Size 0.92 0.93 0.94 0.95 0.96 0.97 Sensitivity 4k16k65k Dictionary Size Gemma-2 2BPythia 160M BatchTopKTopKMatryoshkaBatchTopKGatedJumpReLuReluPAnneal Figure 8: Average Sensitivity vs. Dictionary Size for Gemma-2-2b and Pythia-160m SAEs This plot shows the average sensitivity of different Sparse Autoencoder (SAE) types plotted against their dictionary size. We select SAEs with L0 closest to 80 (exactly 80 for top-K SAEs, closest available for other variants). Each line represents a different SAE architecture. 8 ReLU SAEs consistently show low sensitivity, performing worst on Gemma-2-2B and among the worst variants on Pythia-160M. Next, we examine how dictionary size affects sensitivity across architectures. To control for sparsity, we select SAEs with L0 closest to 80 (exactly 80 for top-K SAEs, closest available for other variants). The results in Figure 8 confirm that wider SAEs consistently show worse sensitivity across all tested architectures. Notably, Matryoshka SAEs also exhibit negative scaling with sensitivity, despite being specifically designed to address scaling challenges in SAEs [5]. 7 Discussion and Conclusion We developed a scalable pipeline that generates texts similar to SAE feature activating examples. We validate through human evaluation that these generated texts are genuinely similar—humans judge them as indistinguishable from actual activating examples. We use this pipeline to evaluate individual features and average sensitivity of features in an SAE. At the feature level, we found that many interpretable features have poor sensitivity, broadening our notion of what makes a high-quality SAE feature. At the SAE level, we found that average feature sensitivity consistently decreases as SAE width increases, identifying a new challenge for scaling SAEs. Taken together, our work helps develop feature sensitivity as a new axis to evaluate both individual features and SAE variants. 7.1 Limitations and Future Work Beyond evaluation, our pipeline opens new directions for exploratory analysis. Studying feature activations on text generated by our pipeline could enable more fine-grained studies of the boundaries separating activating from non-activating inputs for a given feature. This approach could also enable the study of groups of features that may collectively represent specific concepts with high sensitivity. Additionally, our pipeline and sensitivity evaluation can be applied to any model component that activates on input text. Future research could examine sensitivity in thresholded neurons, transcoders [9], and cross-layer transcoders [1]. Our evaluation was limited to frequently occurring features (15+ times in 2M tokens), which biases our analysis toward common features and misses potentially important rare features. We filter for features that remain active when truncated activating text is used, potentially biasing toward simpler features that don’t depend on longer contexts. Future work can directly scale up this evaluation by studying less frequent features and using longer text snippets. Additionally, we don’t meaningfully incorporate information about the magnitude of feature activation in each passage. We would be excited by future work that incorporates activation strength into studies of SAE features, either in the context of sensitivity or broader evaluation. 9 Acknowledgements We thank Christopher Potts, Lee Sharkey, Thomas Icard, Alex Tamkin, and Y. Charlie Hu for feedback on earlier drafts. We are grateful to the members of #weekly-interp-meeting at Stanford for discussions throughout this project. We also thank Neuronpedia—their API enabled our initial explorations and experiments. Author Contributions ALL authors contributed to the research design through regular discussions, provided feedback throughout the project, and contributed to the manuscript. CT conducted initial feature exploration, implemented the end-to-end sensitivity evaluation method and iterated on its design, optimized the generation approach, and conducted the main feature and SAE sensitivity evaluation study. KT analyzed generated text diversity and novelty, helped annotate the human evaluation data, and contributed significantly to manuscript revision. NH proposed the research question and approach, contributed to implementation and code cleanup, conducted the human evaluation study, and led paper writing. References [1]Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/attribution-graphs/methods.html. [2] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304. 01373. [3]Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models.https://openaipublic.blob.core.windows.net/neuron-explainer/ paper/index.html, 2023. [4]Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning, 2023. URL https://arxiv.org/abs/2312.11215. [5]Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2503. 17547. [6]David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoen- coders, 2024. URL https://arxiv.org/abs/2409.14507. [7]Dami Choi, Vincent Huang, Kevin Meng, Daniel D Johnson, Jacob Steinhardt, and Sarah Schwettmann. Scaling automatic neuron description.https://transluce.org/ neuron-descriptions, October 2024. 10 [8]Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps:// arxiv.org/abs/2309.08600. [9]Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits, 2024. URL https://arxiv.org/abs/2406.11944. [10]Esin Durmus, Alex Tamkin, Jack Clark, Jerry Wei, Jonathan Marcus, Joshua Batson, Ku- nal Handa, Liane Lovitt, Meg Tong, Miles McCain, Oliver Rausch, Saffron Huang, Sam Bowman, Stuart Ritchie, Tom Henighan, and Deep Ganguli. Evaluating feature steering: A case study in mitigating social biases, 2024. URLhttps://anthropic.com/research/ evaluating-feature-steering. [11]Eoin Farrell, Yeu-Tong Lau, and Arthur Conmy. Applying sparse autoencoders to unlearn knowledge in language models, 2024. URL https://arxiv.org/abs/2410.19278. [12]Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, and Neel Nanda. Do i know this entity? knowledge awareness and hallucinations in language models, 2025. URLhttps: //arxiv.org/abs/2411.14257. [13]Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027. [14] Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders, 2024. URL https://arxiv.org/abs/2406.04093. [15]Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019. [16]Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons, 2023. URLhttps://arxiv. org/abs/2309.10312. [17]Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evalu- ating interpretability methods on disentangling language model representations, 2024. URL https://arxiv.org/abs/2402.17700. [18]Nick Jiang, Lily Sun, Lewis Smith, and Neel Nanda. Towards data-centric interpretability with sparse autoencoders, August 2025. URLhttps://w.alignmentforum.org/posts/ a4EDinzAYtRwpNmx9/towards-data-centric-interpretability-with-sparse.Less- Wrong/Alignment Forum post. [19] Caden Juang, Gonçalo Paulo, Jacob Drori, and Nora Belrose. Open source automated inter- pretability for sparse autoencoder features, July 2024. URLhttps://blog.eleuther.ai/ autointerp/. [20] Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models, 2024. URLhttps: //arxiv.org/abs/2408.00113. [21] Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URLhttps://arxiv.org/abs/2503. 09532. [22]Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. C. Höhne, and Kirill Bykov. Cosy: Evaluating textual explanations of neurons, 2024. URL https://arxiv.org/abs/2405.20331. 11 [23]Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, and Neel Nanda. Sparse autoencoders do not find canonical units of analysis, 2025. URL https://arxiv.org/abs/2502.04878. [24]Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URLhttps://arxiv. org/abs/2408.05147. [25]Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. On the biology of a large language model. Transformer Circuits Thread, 2025. URLhttps://transformer-circuits. pub/2025/attribution-graphs/biology.html. [26]Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control, 2024. URLhttps://arxiv.org/abs/2405. 08366. [27]Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URLhttps://arxiv.org/ abs/2411.01220. [28] Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2025. URL https://arxiv.org/abs/2403.19647. [29]Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, and Evan Hubinger. Auditing language models for hidden objectives, 2025. URLhttps: //arxiv.org/abs/2503.10965. [30] Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson. Sparse autoencoders for hypothesis generation, 2025. URL https://arxiv.org/abs/2502.04382. [31]Neel Nanda, Arthur Conmy, Lewis Smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, and Vikrant Varma.Progress update #1 from the gdm mech in- terp team, 2025. URLhttps://w.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/ progress-update-from-the-gdm-mech-interp-team-summary .LessWrong/Alignment Forum post. [32]OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, An- drew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, 12 Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Man- ning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pan- tuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Kata- rina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774. [33]Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models, 2024. URLhttps://arxiv.org/abs/2410. 13928. [34] Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. A multimodal automated interpretability agent, 2025. URL https://arxiv.org/abs/2404.14394. [35]Chandan Singh, Aliyah R. Hsu, Richard Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, and Jianfeng Gao. Explaining black box text modules in natural language with language models, 2023. URL https://arxiv.org/abs/2305.09863. [36] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchi- son, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozi ́ nska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci ́ nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe 13 Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, Sébastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, and Alek Andreev. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. [37]Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 son- net. Transformer Circuits Thread, 2024. URLhttps://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. [38]Nicholas L Turner, Adam Jermyn, and Joshua Batson.Measuring feature sensitivity using dataset filtering, July 2024.URLhttps://transformer-circuits.pub/2024/ july-update/index.html. 14 A Evaluation Prompts System Prompt: You are a meticulous AI researcher conducting an important investigation into a specific feature inside a language model that activates in response to text inputs. Your overall task is to generate additional text samples that cause the feature to strongly activate. You will receive a list of text examples on which the feature activates. Specific tokens causing activation will appear between delimiters like this. Consecutive activating tokens will also be accordingly delimited just like this . If no tokens are highlighted with , then the feature does not activate on any tokens in the input. Note: features activate on a word-by-word basis. Also, feature activations can only depend on words before the word it activates on. User Prompt: Consider the feature that activates when the given examples below are present. Your task is to generate text samples that strongly activate this feature. Study the examples carefully to identify both their shared and varying traits. Your generated samples should: - Preserve any consistent traits, patterns, or constraints present across all examples - Match the diversity level shown in the examples---neither more diverse nor more uniform - Vary along the same dimensions that the examples vary (e.g., if examples differ in tone but share a topic, maintain that pattern) - Avoid introducing new types of variation not present in the example set - Avoid collapsing into repetitive or overly similar outputs Generate exactly 11 new samples separated by <SAMPLE_SEPARATOR/>. Note that the feature may involve semantic content, grammatical structures, abstract concepts, specific named entities (e.g., people, organizations, locations), or formatting elements like newlines, punctuation, citations, or special characters, for example, , or ←- represent newlines, , represents commas, - represents hyphens, etc that are activating the feature. Present each sample without numbering or bullets. Important: place <SAMPLE_SEPARATOR/> between generated samples. See the following 15 examples that activate the feature, separated by <SAMPLE_SEPARATOR/>: <SAMPLE_SEPARATOR/> count(); static const char* resource_to_cstring(const char* resource <SAMPLE_SEPARATOR/> What is the Java equivalent of JavaScript’s resource folder?←-←-My Wicket web application contains <SAMPLE_SEPARATOR/> side-effect: since the check isn’t so resource←-intensive, you can set the time between checks ... 15 Assistant Response: "void free resourceMemory(void* ptr); static const char* load_ resourcePath(const char* resource);" <SAMPLE_SEPARATOR/> "How to configure the resource directory in a Python Flask application? I’m trying to serve static files from the" <SAMPLE_SEPARATOR/> warning: avoid heavy computation in the resource allocation phase, it may slow down startup. <SAMPLE_SEPARATOR/> ... B Feature Filtering Details We evaluated 112 SAEs from the SAEBench dataset and 29 from the GemmaScope dataset. The SAEBench set spans seven SAE families—BatchTopK, MatriyoshkaTopK, TopK, JumpReLU, ReLU, Gated, and PAnneal—whereas all GemmaScope SAEs are JumpReLU. During the study, we observed that some activation texts distributed with SAEBench do not consis- tently activate their associated SAE features, likely due to truncation. To address this, we computed the activation rate in truncated example text for each feature, defined as the proportion of published activation texts that reliably elicit the feature. Features with an activation rate below 90% were excluded from our analysis. Table 1 and Table 2 reports the impact of this filtering on our study. In Figure 9 we show our main Gemmascope results with different filtering thresholds. We see that for all choices of threshold, our main results hold. 10 1 10 2 L0 (Sparsity) 0.50 0.55 0.60 0.65 0.70 0.75 Sensitivity Cutoff = 0 10 1 10 2 L0 (Sparsity) 0.70 0.75 0.80 0.85 Cutoff = 0.5 10 1 10 2 L0 (Sparsity) 0.80 0.82 0.84 0.86 0.88 0.90 Cutoff = 0.8 10 1 10 2 L0 (Sparsity) 0.86 0.88 0.90 0.92 0.94 Cutoff = 0.9 10 1 10 2 L0 (Sparsity) 0.90 0.92 0.94 0.96 Cutoff = 1 SAE Width 65k131k262k524k1m Figure 9: Robustness to Feature Selection Cutoffs. GemmaScope scaling results shown with different shortened text activation filter cutoffs. Our main results are robust to the choice of cutoff threshold, demonstrating that the observed scaling trends are not artifacts of our feature selection criteria. 16 Model SAE Type No. SAEs Avg. No. Feat. Avg. Sens. 90% Activation Rate Threshold Avg. No. Remain. Avg. Sens. % Feat. Excluded % Sens. Change Gemma-2-2B All, 16k79980.8757040.98229.4%12.2% All, 4k79990.9188010.98419.8%7.2% BatchTopK, 65k69690.8145620.98042.1%20.6% Gated, 65k69810.8265620.97842.7%18.4% JumpReLu, 65k69810.8345970.97939.2%17.4% MatryoshkaBatchTopK, 65k69640.7764850.97949.7%26.3% PAnneal, 65k69970.8937490.98624.9%10.7% Relu, 65k69940.8486460.98335.0%16.0% TopK, 65k69720.8205740.97941.0%19.7% Pythia-160M All, 16k79950.5694170.98158.1%137.8% All, 4k712990.90810290.98621.0%8.6% BatchTopK, 65k69780.8506740.98730.9%16.1% Gated, 65k69940.7865220.97847.5%24.6% JumpReLu, 65k69950.8697270.98526.9%13.4% MatryoshkaBatchTopK, 65k69680.8176240.98635.1%20.8% PAnneal, 65k69980.8386270.98537.2%17.6% Relu, 65k69970.8346330.98436.5%18.1% TopK, 65k69780.8456670.98731.7%17.3% ALL1121006 0.830648 0.983 35.6%18.4% Table 1: SAE filtering statistics showing the impact of excluding features with activation rate below 90% in truncated example text. Columns show the model, SAE type, number of SAEs, average features per SAE before filtering, average sensitivity before filtering, and the effects after applying the 90% threshold: remaining features, new sensitivity, percentage excluded, and percentage sensitivity change. Width No. SAEs Features Evaluated Features Remaining % Excluded 65k52336114451.0% 131k6243899059.4% 262k6238179866.6% 524k6233962973.2% 1M6227848578.8% Table 2: GemmaScope filtering statistics with 90% activation rate cutoff. All SAEs are JumpReLU trained on Gemma-2-2B layer 12 residual stream. Wider SAEs show increased feature exclusion rates. C Controlling for Feature Frequency To ensure that our sensitivity results are not confounded by differences in feature frequency across SAE widths, we repeated our GemmaScope analysis with frequency-weighted sampling. Different width SAEs may have systematically different feature frequency distributions, which could potentially influence average sensitivity measurements. C.1 Weighting Methodology We re-weighted features so that each SAE has the same effective frequency distribution. Specifically, for each SAE, we: 1. Computed the frequency distribution of features across all SAEs in our study 17 2.Determined a target frequency distribution (the average distribution across all SAE widths) 3.Assigned weights to each feature inversely proportional to its frequency’s representation in the SAE relative to the target distribution 4. Re-computed average sensitivity using these weights Figure 10 illustrates this re-weighting process, showing how features at different frequencies are weighted to achieve a uniform distribution across SAEs. C.2 Results with Frequency Control Figure 11 shows the results after applying frequency weighting. Explicitly controlling for feature frequency via reweighting does not change our main results. Wider SAEs show lower average feature sensitivity. At a given width, SAEs with more active latents have higher sensitivity. This confirms that our main results are not an artifact of frequency distribution differences across SAE widths or sparsities. The similarity between these frequency-controlled results and our main findings (Figure 6) demon- strates that the sensitivity-width tradeoff is a robust phenomenon independent of feature frequency distributions. 10 5 10 4 10 3 10 2 Frequency 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Proportion of Features Feature Distribution Global width_65k_l0_72 width_131k_l0_129 width_262k_l0_121 width_524k_l0_115 width_1m_l0_107 [6e-06, 1e-05)[1e-05, 2e-05)[2e-05, 3e-05)[3e-05, 6e-05)[6e-05, 1e-04)[1e-04, 2e-04)[2e-04, 3e-04)[3e-04, 6e-04)[6e-04, 1e-03)[1e-03, 2e-03)[2e-03, 3e-03)[3e-03, 6e-03)[6e-03, 1e-02) Frequency Bucket 0 2 4 6 8 10 Relative Weight Reweighting Scheme width_65k_l0_72 width_131k_l0_129 width_262k_l0_121 width_524k_l0_115 width_1m_l0_107 Figure 10: Frequency re-weighting methodology. Visualization of how features are re-weighted to control for frequency differences across SAE widths. 10 1 10 2 L0 (Sparsity) 0.86 0.88 0.90 0.92 0.94 Weighted Sensitivity SAE Width 65k 131k 262k 524k 1m Figure 11: Feature sensitivity with frequency weighting. Average sensitivity across GemmaScope SAEs after re-weighting to control for feature frequency. The declining sensitivity with width persists, confirming our main findings. D Preceding Token Length Analysis When we look through generated text that fails to activate the feature, we occasionally see cases where the text that intends to activate the feature appears very early in the sequence. We wanted to check if this early positioning of feature-related text was the cause of the feature failing to activate. 18 To investigate this, we collected all generated texts and, for each one, looked for the first token that the model annotated with curly braces—this annotation indicates where the model was intending for the feature to activate, which we call target tokens. In Figure 12, we show the distribution of where the target token appears in the generated text. We found that generated texts indeed often have relatively short prefixes leading up to the target token. For example, in 1.5% of generations, the target token is actually the first token of the generation, and in around 30% of generations, the target token is preceded by 5 or fewer tokens. However, we see that even in generated text samples where the target token occurs early in the sample, most of these samples successfully activate the feature. We do note that the proportion of generated text which fails to activate the feature is higher in generations with shorter prefixes. This represents a slight limitation of our evaluation that could be improved with better prompting and instructions, though the high success rate of feature activation even with short or no prefixes suggests that the bias does not significantly compromise our evaluation. 0123456789101112131415>15 Number of Preceding Tokens 0% 2% 4% 6% 8% 10% Generated Text Fraction 0.4% 1.5% 0.7% 5.0% 0.9% 6.6% 0.8% 7.1% 0.6% 8.9% 0.8% 9.4% 0.9% 9.7% 0.8% 9.6% 0.7% 9.6% 0.6% 8.2% 0.5% 8.4% 0.3% 4.5% 0.2% 3.2% 0.2% 2.0% 0.1% 1.5% 0.1% 1.1% 0.3% 3.6% Inactive Active Figure 12: Target Token Position and Feature Activation Success. For each generated text sample, we identify the target token expected to activate the feature. The chart shows the number of tokens which occur in the generated text before the target token. Bars are colored based on whether the generated text successfully activated the feature (green) or failed to activate (red). E Additional Feature Examples We present additional feature dashboards showing interpretable features with zero sensitivity (Figure 13), interpretable features with moderate sensitivity (Figure 14), and features with high sensitivity but low automated interpretability scores that appear qualitatively interpretable (Figure 15). Each dashboard displays 4 out of 15 activating text examples and 4 out of 10 generated text examples. 19 Feat ID:92700Desc:numerical values and monetary amounts in the textFreq:4.34e-05Sensitivity:0.00%Interp Score:92.86% 14.94 ↵Or, maybe at around 200bucks, a lot of people end up with both. 16.88 a conventional Google tablet. At 200bucks, I couldn't resist. But, the 15.06 is also WIC, which at4.8billionin FY 2012 gets us to 8.62 , which at anannualtotalof540billion, would amount to funding close to half the 0.00 Prices of electric bikes have jumped to roughly $1500dollars this year, making commuting 0.00 The marathon registration fee this year peaked at $85dollars, drawing thousands of 0.00 Ticket prices for the concert rose to approximately $500dollars, reflecting the artist's 0.00 The annual subsidy for renewable energy projects surpassed $700million, encouraging Feat ID:361399Desc:the substring "super" and variations of "base" in programming contextsFreq:2.12e-05Sensitivity:0.00%Interp Score:100.00% 8.12 .Flush();↵base.OnTearDown();↵[Test 13.75 autorelease]];↵ [buffer appendString:[super description]];↵ [buffer appendString:@" 5.31 Head(IHeaderResponse response) ↵ super.renderHead(response);↵ response. 3.56 true_type)↵ ↵ Base::construct(expr);↵ ↵ template 0.00 super.initializeComponent();↵this.componentDidMount(); 0.00 super.performAction();↵logState(); 0.00 super.OnStart();↵logger.info("Service started"); 0.00 super.updateSettings(newSettings);↵notifyObservers(); Feat ID:388626Desc:numeric identifiers or placeholders represented with angle bracketsFreq:3.97e-04Sensitivity:0.00%Interp Score:92.86% 12.81 ASSUME_NONNULL_END↵<eos>477F.2d 598↵ 17.50 ↵ ↵ ]↵15376157484 12.06 ],↵ );↵;↵<eos>↵337F.Supp. 150 ( 10.81 ": "partial link text"↵154749992302 0.00 1729some random text854 0.00 Reference number: 504732 logged successfully 0.00 <script>var key = "13974";</script> 0.00 Timestamp 8430 registered without error Feat ID:781627Desc:the word 'export' in programming contexts, particularly related to functions and typesFreq:1.16e-05Sensitivity:0.00%Interp Score:100.00% 32.25 "data" or "err"↵ */↵exportdefault function request(url, options) ↵ 35.25 );↵ return newContainer;↵export const FadeTransition = ↵ start(container 28.50 until(deadline));↵ ↵export class ClientModule ↵ constructor(name, 33.25 ↵ return Authorized;↵;↵export CURRENT ;↵export default Authorized => 0.00 export class ApiService ↵ constructor(baseUrl) ↵ this.baseUrl = baseUrl;↵ ↵ 0.00 export const constants = ↵ appName: 'MyApp',↵ version: '1.0.0',↵; 0.00 export function validateInput(input) ↵ if (!input) throw new Error('Input required');↵ 0.00 export function calculateSum(a, b) ↵ return a + b;↵ Feat ID:121516Desc:the substring '>' in various coding and programming contextsFreq:2.52e-05Sensitivity:0.00%Interp Score:92.86% 20.62 width: 100%;↵body.is-loading & 17.00 ↵ Return(HPSD)↵ ↵ Return(SPSD)↵ ↵ 17.88 ↵ Return(SPSD)↵ ↵ // End of Scope(\_SB 18.12 ->first, stp->second);↵ ↵ void synchronize()↵ 0.00 ::process_queue;↵ queue(pg);↵struct queue_set_ready ; 0.00 margin-bottom: 12pt;↵</style> 0.00 #Major.#Minor.#Revision.#Patch↵InfoLabelText=#Label↵ 0.00 if (!is_loading) ↵ ↵ start_process();↵ Feat ID:593453Desc:the substring 'math' in mathematical notation and equationsFreq:2.22e-05Sensitivity:0.00%Interp Score:92.86% 14.31 n' k'\!)↵ tanh[(β/2) \ 13.06 -1)(N g)^2E_x_0:t- 9.44 and set $ ↵M= R p_1 p_ 11.69 8mu]2, 1 -6mu 0.00 k↵\!+\!↵ 1 iν \!-\! _ ω 0.00 n' k'\!↵ cosh [ β2 (E - μ) ] 0.00 1,…,n_s = δ q_1,…,q_s · P( \ 0.00 -n ω) = 1 iν \!-\! θφ - χ Feat ID:1023262Desc:code syntax and function callbacks in programming contextsFreq:7.21e-05Sensitivity:0.00%Interp Score:100.00% 22.00 ↵ return true;↵ ;↵// Setup callback first, so we don't 21.00 ↵ return true;↵ ;↵this.resume = function() ↵ try 13.81 ↵ headerValue = ′;↵ ;↵parser.onHeadersEnd = function() ↵ 21.25 .onPart(part);↵ ;↵parser.onEnd = function() ↵ 0.00 ;↵self.emit('close'); 0.00 ;↵// If request error, destroy. 0.00 ;↵body = Buffer.from(chunk); 0.00 ;↵fileStream.pipe(writer); Feat ID:770421Desc:phrases indicating prominent locations or regions within a larger contextFreq:5.75e-05Sensitivity:0.00%Interp Score:100.00% 17.25 routes take you past some of the most spectacular sceneryinthe state, while friends old 12.25 Pedro Bay has one of the most attractive settingsinsouthwest Alaska. Pedro Bay is 7.75 Heath and Lake District the largest economically utilized pond regionin Europe.↵Part of 5.50 the Unesco-listed cathedral – the largest Gothic cathedralinthe world – to the beautiful 0.00 the historic piazzainthe heart of Florence bustles with tourists and locals alike, offering 0.00 the ancient libraryinthe old city houses manuscripts dating back to the medieval period, 0.00 the extensive cave systeminthe Carpathian Mountains is famous for its unique rock 0.00 the bustling harborinthe Mediterranean city offers stunning views of yachts and fishing Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Figure 13: All 8 SAE features studied in Figure 3 that have sensitivity score 0 and auto-interp score over 0.9. For 3 of these features, low sensitivity may be due to generated passages immediately starting with the text intended to activate the feature. 20 Feat ID:563047Desc:the concept of relationships or comparisons between different entities or conditionsFreq:2.74e-03Sensitivity:60.00%Interp Score:92.86% 7.19 between a somatic mode of presentation on the one handanda psychological mode on the 10.06 . This finding suggests an interdependence of heavy alcoholconsumptionand psychological 10.75 Data Separation**]: Algorithmic code is separatedfromthedataonwhich it operates. 10.94 not found when comparing a mixed level of parental educationtoahighlevel of parental 7.31 Data Handling**]: Source files are segregatedfrom the processingroutinesthey drive. This 0.00 a mixtureof phospholipids blended to achieve a molar ratio of100:20 cholesterol stabilized 0.00 examination highlights the synergistic effectof dopamineand glutamate receptor activity, 7.44 survival rate differences often show inverse correlationswith tumor progression markers Feat ID:297594Desc:the word 'problems' and related concepts indicating issues or challengesFreq:3.33e-05Sensitivity:60.00%Interp Score:100.00% 8.75 of a trendy Melbourne art gallery, has her ownproblems – chasing down a delinquent 7.34 to attempts at calibration.↵Of course, ourproblemsarenot likely to clear up so one may 12.62 him, thatʼs only the start of theirproblems.↵In this third Alex Caine book, sequel 6.78 mnir becomes queen of a land with as manyproblems as the one she fled. Her long-lived 6.97 facing constant delays, she explained herproblems quietly but with visible frustration at the 0.00 discussing legislative issues, where theproblems oftenare complex and intertwined with 8.12 Nina realized that understanding hisproblems required stepping into his perspective; that 0.00 explaining the malfunction during the software demo, he hoped the technicalproblem would Feat ID:2870Desc:words related to teaching and educationFreq:2.57e-05Sensitivity:60.00%Interp Score:92.86% 14.12 C.F.E. is an internationally respectedteacher,trainerand clinician with an expertise in the 11.00 you love aboutteaching?↵‘I love toteach,” Johnston said, “and the some of the 10.62 dots and monkeys.”↵What do you love about teaching?↵‘I love toteach,” Johnston 11.62 century. Architect, artist, furniture designer, andeducator, Ralph Rapson has played a 11.75 I always felt passionate aboutteaching because it allows me to inspire others. “Toteachis to 0.00 Kids bring so much energy and curiosity to the classroom. When Iʼmteaching, I get to see 12.88 His work as a dedicatededucatorand community advocate has influenced many in the arts 0.00 Initially nervous aboutteaching, she grew into her role and now finds great joy in it. Her first Feat ID:662898Desc:theorems and corollaries referenced by their numbers in mathematical or academic contextsFreq:6.25e-05Sensitivity:70.00%Interp Score:92.86% 21.25 othe (2017).↵Theorem 3 includes existing DR moment functions as special cases where $ 23.00 on this section can be applied.↵Corollary 9.2 from [@H] states that any Or 27.00 (β)$ and possibly additional functions. Proposition 2 of Newey (1994a) 11.44 font-variant:small-caps;">Theorem 2:</span> *If the marginal distribution of* 2.06 Consider Lemma 7 from Smith (2003) which establishes conditions for convergence in 0.00 Based on Proposition 5 in the appendix, the asymptotic variance can be expressed as 0.00 From the proof of Lemma 11, we derive bounds on the estimator variance using 2.44 The result of Corollary 4 follows immediately by applying the dominated convergence Feat ID:267258Desc:the variable placeholder 'i' in programming contextsFreq:1.27e-04Sensitivity:50.00%Interp Score:92.86% 8.50 CDATA_CTL);↵msg->buf[i] = (u8)((rxd 15.50 rows <- createRow(sheet, rowIndex=i)↵ for (j in 1: 18.88 ('model0.h5'.format(ix)) ↵ end = time.time() 11.75 I2CDATA_CTL);↵msgs[i + 1].buf[0] = ( 3.38 commandList[i].execute();↵status = commandList[i].getStatus(); 0.00 <td *ngFor="let item of items; let i=index">↵ item.name ↵</td> 4.56 dataset['column_i'] = values[i]↵dataset['column_i'].mean() 0.00 (i,'update')">↵ <span class="icon"></span>↵</button> Feat ID:898197Desc:terms related to the concept of survival and its implications in various contextsFreq:4.54e-05Sensitivity:70.00%Interp Score:100.00% 10.00 three main purposes. The first was to facilitate thesurvivalof the spongesacross the 10.00 main purposes. The first was to facilitate thesurvivalof the spongesacross the battery of 12.12 and size and the size of the grain affects itssurvivabilityin the archaeological 15.50 Oxygen is a vital substrate to the continual function andsurvivalof cerebral tissue. Rapid 9.50 Examining the role of autophagy in prolonging cellularsurvival, we used markers for 0.00 Analysis of cohort data revealed a strong correlation between dietary intake andsurvival 0.00 The clinical trial results showed higher mediansurvival time among patients receiving 15.38 Genetic diversity contributes significantly to thesurvivaladvantage seen in populations Feat ID:362816Desc:various expressions of the word "by" followed by different methods or approachesFreq:2.67e-04Sensitivity:40.00%Interp Score:92.86% 13.88 states, chose to↵achieve the same balance byalternate means. We have judges who 13.50 i=1,·s, r$.↵Byastraightforward argument one may notice that the condition 12.25 vertices.↵For each face choose a triangulation bynon-intersecting diagonals. Let $d$ 10.88 so they will not be repeated here.↵Byprior Order, the Court authorized the mailing of 5.84 this result was obtained byinnovative techniques involving machine learning and deep 0.00 the report was compiled byanexperienced team of analysts specializing in market trends 0.00 the final decision was reached bymutual agreement among the stakeholders following 0.00 marketing strategies improved bytargeted campaigns using demographic and Feat ID:878839Desc:substrings containing specific sequences of letters within proper nouns and scientific termsFreq:3.44e-04Sensitivity:50.00%Interp Score:92.86% 7.66 --449--449--Sterry \[[@CR22]\] (IMP2 7.72 been used unfairly, please contact us<eos>Kralingse Zoom metro station↵Kralingse 8.62 has been used unfairly, please contact us<eos>Kralingse Zoom metro station↵Kralingse 4.53 ↵ *Buplerum falcatum*(root) 0.00 font-style:italic;">Sergei V. Andreev and T.amara V. 5.53 been interrupted, please call the office<eos>Yalum metro station↵Yalum 0.00 <bos> ↵ *Cana 2.16 *Pleurotus ostreatus* (mushroom cap) Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Figure 14: 8 randomly sampled features from those studied in Figure 3 that have sensitivity score between 0.4 and 0.7 and high (≥ 0.9) auto-interp score. 21 Feat ID:707468Desc:comma-separated lists or phrases related to various subjects or entitiesFreq:4.06e-03Sensitivity:90.00%Interp Score:50.00% 45.25 ↵It also issues police certificates, for a fee, needed to obtain immigration visas for 29.38 the State filed a motion, naming all three defendants, to *570 release defense exhibits for 26.38 command and/or information, via the domain manager, individually to each managed node 41.00 was supplemented, in cases in which it was necessary, by administration of an intravenous 19.38 during the evaluation, samples were cooled rapidly, to preserve microstructural features that 22.75 The software upgrade, though challenging, enhanced system stability, and improved 0.00 in this section, we discuss the experimental setup, apparatus calibration, and data 26.38 probabilities were computed for each state transition, using Markov chain methods, Feat ID:70780Desc:specific measurements or conditions in scientific and technical contextsFreq:2.98e-03Sensitivity:100.00%Interp Score:50.00% 12.00 Basin, Nova Scotia (at 7°C) and filtered on a 1-μm mesh 8.25 the left hand side has the same rate of convergence).↵**2. The finite variation case:** $\ 27.75 a more cost-effective alternative (metalbased dentures) for patients with ridge resorption. In 37.50 to collect a water sample (under negative pressure conditions) from inside the incubation 20.75 The flow rate was measured (at 5 liters per minute)) to ensure consistent operation 30.50 samples were collected (using a sterile syringe)) immediately after centrifugation to prevent 27.75 a thin layer of lubricant (containing molybdenum disulfide)) was applied to reduce friction 6.50 temperature readings were stabilized (around minus 10°C). This low temperature helped Feat ID:935063Desc:text related to academic papers and specific sections or results within themFreq:1.08e-04Sensitivity:100.00%Interp Score:57.14% 12.62 .clear-objects.com/clearodb<p>The end users would be software developers who writes 27.50 progress.↵------↵Toshio↵<p>You can download the first 25% 13.62 and Inference,” working paper, UCLA.↵<span style="font-variant:small-caps;"> 15.50 Discrete Game,” working paper, Stanford.↵<span style="font-variant:small-caps;"> 9.88 for the algorithms described in Section 4:↵<span style="font-variant:small-caps;"> 32.00 https://repository.university.edu/papers<p>Supplementary materials available 10.25 final remarks in Chapter 5:↵<span style="font-variant:small-caps;"> 11.88 and proof of Theorem 3 is given in Appendix B.↵<span style="font-variant:small-caps;"> Feat ID:444555Desc:scientific concepts related to genetics and cell functionsFreq:2.01e-03Sensitivity:100.00%Interp Score:57.14% 8.38 colonization duringthespring isalsoinsufficienttoexplainthepatternsofallele frequency 6.84 promoting genes andcell cycle exit.Recently,weshowedthat the chromatin regulatory 8.12 promoting genesandcellcycleexit.Recently,weshowedthat the chromatin regulatory 5.66 14 byfiling↵suitinthe District Courtagainstthe JCC challenging the constitutionality 6.59 promoting genesandcellcycleexit.Recentstudieshave identified new factors involved in 5.66 1) “poorly drafted,” andrecommendedspecific provisionsthat clarify ambiguous contract 6.78 executed.Otherswereconscriptedintolaborbattalions.They performed various 6.16 *^(*τ*) generatinga unitary evolution,*V*, ofthe quantum system,asillustratedinFigure3 Feat ID:756087Desc:prepositions indicating relationships or methods of integration in technical contextsFreq:1.78e-04Sensitivity:100.00%Interp Score:57.14% 10.56 ↵<eos>Q:↵Control RaspberryPI 3via Serial/GPIO↵So I'm working on 10.56 .<eos>Q:↵How to bind web servicein jquery easy ui CRUD DataGrid↵I am developing 9.25 That's really all there is to connecting VSwith Unity. Again, UnityVS will further enhance this 5.06 Q should simply make using the full power of SQLin Javaas simple as possible. In that way, 10.38 Can you connect Arduino Unowith Bluetooth modules easilyfor remote control? I need 4.81 Is there a way to upload filesto AWS S3using a simple CLI command? Any recommended 6.56 Is it possible to use Pythonfor automating Excel taskswithin a corporate networkwith strict 5.69 Could anyone explain the process of building and deploying Angular appsin Netlify? I have a Feat ID:718605Desc:numerical values and their related contexts in scientific and statistical discussionsFreq:4.20e-04Sensitivity:100.00%Interp Score:57.14% 20.75 made playoff appearances in 1989,1991, 1992 12.19 -of-month values of 29,30, and 31 are a problem 10.56 chromosomes 11q13.5,11p15.1, 1 14.19 and their relevant roots for lattices having $1$,$2$,$3$ and $4$ triangles 16.75 chromosomal gains at 5q13.3,12p12.1,7 15.06 occurrences increased from 12 (20%),18 (30%), 22 (40%) over time 11.56 reaction rates at temperatures 100,150, and 200 °C 10.06 intervals of 5--10,15--20, 25--30 days were tested Feat ID:238747Desc:words related to training, formats, interventions, and various scientific and technical conceptsFreq:3.03e-01Sensitivity:100.00%Interp Score:50.00% 53.00 trainingacademy,VETA when there were noBPOs and no one knew thatEnglishcoaching 46.50 and I think itʼs pretty clear that thetemple ismajor part of that. It may well be 59.50 Are there anyhuman-readablefileformats that containarmaturedata?↵Lastly, from a 63.50 pre-renderedanimation, and when would you usearmaturedata in themodel tomanipulate 42.00 Thetempleceremonies symbolizeeternalcovenants and are deeply significant withinLDS 44.50 Can anyone recommendhumanreadablefileformats compatible withskeletalanimationrigs 55.50 Document 3,247,988 discloses agreaseformulation incorporatingadvancedpolyureathick 56.50 Recentefforts involve employingaquaticanimals formutagenesis andteratogenesisstudies Feat ID:304338Desc:code-related constructs or selectors utilizing specific attributes and methods in programmingFreq:2.93e-05Sensitivity:90.00%Interp Score:50.00% 25.12 of ways to do it,↵$("[title='cust 1']").remove();↵$("#cust_ 15.62 ul.numberlist li:not([id*="test"])')↵console.log(document. 18.12 javascript?↵.variations_button[style*="display: none;"] + div↵This is 20.62 ↵Because the selector you use, [style*="display: none;"], is looking for the presence 0.00 <span class="button" data-tooltip="Add new item"> 7.00 a.primary_nav[data-active="true"] 14.00 figure.caption[data-label="fig:FlowDiagram"] 5.50 These parameters are described in Table 3.[]data-label="tab:parameters"] Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Activating Text ExamplesGenerated Text Figure 15: 8 randomly sampled features from those studied in Figure 3 that have high (≥0.8) sensitivity score and low auto-interp score (≤0.6). These features tend to be interpretable despite their low automated interpretability score. 22