Paper deep dive

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller, Florent Draye, Bernhard Schölkopf

Year: 2026Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 65

Models: language models (unspecified, fine-tuned with SAEs)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 5:21:11 PM

Summary

The paper introduces an orthogonality regularization technique for sparse autoencoders (SAEs) in language models to improve feature identifiability and intervenability. By enforcing orthogonality on the decoder matrix, the authors reduce feature interference and superposition, enabling more isolated causal interventions without degrading model performance on reasoning tasks.

Entities (5)

Orthogonality Regularization · technique · 99%MetaMathQA · dataset · 98%Sparse Autoencoder · model-architecture · 98%Gemma-2-2B · language-model · 95%Independent Causal Mechanisms · principle · 95%

Relation Signals (3)

Sparse Autoencoder → integratedinto → Gemma-2-2B

confidence 98% · Fixing the SAE decoder, we then insert the SAE into a Gemma 2 2B Transformer.

Orthogonality Regularization → enables → Isolated Interventions

confidence 95% · We empirically show that these increasingly orthogonalized features allow for isolated interventions.

Orthogonality Regularization → improves → Feature Identifiability

confidence 95% · Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition.

Cypher Suggestions (2)

Find all techniques used to improve model interpretability. · confidence 90% · unvalidated

MATCH (t:Technique)-[:IMPROVES]->(p:Property {name: 'Interpretability'}) RETURN t

List all models integrated with Sparse Autoencoders. · confidence 90% · unvalidated

MATCH (m:Model)-[:INTEGRATED_WITH]->(s:SAE) RETURN m.name

Abstract

Abstract:With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{this https URL}$.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

64,765 characters extracted from source content.

Expand or collapse full text

Identifying Intervenable and Interpretable Features via Orthogonality Regularization Moritz Miller * 1 2 Florent Draye * 1 3 Bernhard Sch ̈ olkopf 1 2 3 Abstract With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the Independent Causal Mechanisms principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available here. 1. Introduction Language models (LMs) demonstrate remarkable capability in learning various tasks through fine-tuning. Recent work shows that it is possible to low-rank adapt (Hu et al., 2021) an LM around a fixed sparse autoencoder (Wang et al., 2025; Chen et al., 2025). Popular in mechanistic interpretability, sparse autoencoders (SAEs) can discover millions of human- understandable features in the model’s representation space (Cunningham et al., 2023; Paulo et al., 2024). How- ever, recent research has raised concerns about the reliability of the features learned by SAEs. When trained on the same data but with different random seeds, SAEs generate largely different sets of features (Paulo & Belrose, 2025). Addi- tionally, features discovered by SAEs can be reconstructed from a smaller set of features with a meta-SAE (Leask et al., * The authors contributed equally to this work. Author order was determined by a 60–40 coin flip. 1 Max Planck Institute for Intelli- gent Systems, T ̈ ubingen, Germany 2 ETH Zurich, Zurich, Switzer- land 3 ELLIS Institute T ̈ ubingen, T ̈ ubingen, Germany. Correspon- dence to: Moritz Miller <moritz.miller@tuebingen.mpg.de>. Preprint. February 5, 2026. 2025). This suggests that these features are not atomic. Cur- rent progress in adapting an LM to a fixed SAE offers the opportunity to overcome this classical dictionary learning conundrum (Olshausen & Field, 1997; Lee et al., 2006). In this paper, we argue that these issues are related to the con- cept of identifiability in dictionary learning. A dictionary of features is identifiable if it can be recovered from data gen- erated as sparse combinations of those features (Dumitrescu & Irofti, 2018). In classical dictionary learning, successful recovery is guaranteed when the dictionary exhibits low self- coherence, i.e., features are sufficiently orthogonal (Donoho et al., 2005). In a highly self-coherent dictionary, multiple different sparse combinations of features generate the same data point (Schnass & Vandergheynst, 2008; Garfinkle & Hillar, 2016). Under this ambiguity, even recovering the sparse components given the known dictionary may fail. In the context of LM activations, we observe that high self-coherence naturally emerges due to the strong inherent structural dependencies within the model’s feature space (Li et al., 2025).This makes identifiability particularly challenging and contributes to the issues discussed above. Another benefit of identifiability relates to intervenabil- ity in the learned representation space. In causal model- ing, intervenability is formalized via interventional distribu- tions (Spirtes et al., 2000; Pearl, 2009). Causal representa- tion learning (Sch ̈ olkopf et al., 2021) seeks latent represen- tations whose components correspond to underlying causal factors of variation in the data, such that interventions on these factors admit a meaningful interpretation. A key requirement for such interventions is adherence to the Independent Causal Mechanisms (ICM) principle (Janzing & Sch ̈ olkopf, 2010). ICM posits that the data-generating process can be decomposed into autonomous mechanisms that do not inform or influence each another. In the context of learned representations, this implies that intervening on one latent mechanism should not systematically alter the functional form or parameters of the remaining mechanisms. When this condition is approximately satisfied, interven- tions applied in representation space can be interpreted as localized changes to specific causal factors, enabling controlled manipulation without inducing unintended side effects in other components. 1 arXiv:2602.04718v1 [cs.LG] 4 Feb 2026 Identifying Intervenable and Interpretable Features via Orthogonality Regularization Question: Jerry writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year? Standard Generation: Jerry writes a 3-page letter to 2 different friends twice a week, so he writes a total of 3 x 2 = 6 pages per week. Since he writes twice a week, he writes a total of 6 x 2 = 12 pages per week. In a year, there are 52 weeks, so Jerry writes a total of 12 x 52 = 624 pages in a year. #### 624 The answer is: 624 Post-Intervention Generation: If Aquaman writes a 3-page letter to 2 different friends twice a week, then he writes a total of 3 x 2 = 6 pages per week. Since he writes this letter twice a week, he writes a total of 6 x 2 = 12 pages per week. If he writes this letter for a year, which has 52 weeks, then he writes a total of 12 x 52 = 624 pages in a year. #### 624 The answer is: 624 Figure 1. Intervention on SAE Feature We query a LM fine-tuned on the orthogonality penalty10 −4 . At inference, we exchange the feature associated withJerrywith the feature corresponding to the prefixaqua. The model then substitutesJerryforAquamanwhile maintaining its reasoning capabilities. With this in mind, we study interventions in the SAE under the ICM principle. By enforcing almost orthogonality inside the dictionary, we make the SAE features themselves intervenable. In effect, changing one feature has little influence on any other feature. To do so, we aim to link latent features to human-understandable concepts. After identifying such concepts, we swap the associated feature in the representation space for another feature. Under this replacement strategy, we check if the model maintains its reasoning capabilities while correctly replacing the interpretable concept during generation. Figure 1 represents one example of an intervention in the LM under orthogonality penalty on the decoder matrix. We set the feature corresponding to the human-interpretable con- ceptJerryto zero and replace it by the feature that activates for various usages of the prefixaqua. First, we observe that the LM reasons correctly to arrive at624. Second, the model adapts theaquaconcept such that it is concordant with the preceding context. That is, the LM adequately replaces the male first nameJerrybyAquamanduring generation. While locally swapping that concept, the down- stream effect on other concepts remains limited. Building on this finding, our contributions are: •We fine-tune a LM around a fine-tuned SAE whose decoder matrix is regularized to be almost orthogonal. We demonstrate that this fine-tuning pipeline does not hurt performance on mathematical reasoning relative to a non-penalized SAE (Section 4.1). • We show that this setup obtains similar interpretability scores as non-penalized SAEs (Section 4.2). More importantly, Section 4.3 reflects that the embedded feature explanations are significantly less similar to each other under our orthogonality penalty. • We theoretically relate intervenability in the residual stream to the interference between features. Lever- aging finite frame theory, we argue that interference between features increases similarity between features (Section 3). •To the best of our knowledge, we are the first to perform isolated interventions on specific concepts inside the SAE (Section 4.4). We demonstrate that the increased orthogonality penalty leads to better intervenability in the model’s representation space. 2. Background Notation We use square brackets for an abbreviated set notation and write [N ] :=1,...,N for arbitrary N . 2.1. Identifiable Dictionary Learning The idea of recovering input data as a sparse linear combina- tion of atoms is rooted in sparse dictionary learning. Similar to the features in the SAE decoder, the atoms compose the dictionary. If the number of atoms in the dictionary exceeds the number of dimensions per atom, we call the dictionary overcomplete. Thus,D∈R m×d is overcomplete ifd > m. A recurring theme in dictionary learning constitutes self- coherence. This measure of similarity between a dictio- 2 Identifying Intervenable and Interpretable Features via Orthogonality Regularization nary’s atoms plays a critical role in determining how sparse the representations must be for identifiability to hold. We define self-coherence as μ(D) = max i̸=j |⟨f i ,f j ⟩|.(1) To illustrate the relevance of self-coherence for sparsity, we restate the below theorem (Donoho et al., 2005) and provide the proof in Appendix A. Theorem 2.1 (Self-coherence bound for uniqueness). Let D∈R m×d have unit-norm columns and denote byμself- coherence as defined in (1). If K < 1 2 1 + 1 μ , everyK-sparse representation ̃ x = Dzis unique. That is, for Dz = D ̃ z with∥z∥ 0 ,∥ ̃ z∥ 0 ≤ K, it holds that z = ̃ z. Theorem 2.1 reflects a clear relationship between sparsity and self-coherence. First, low self-coherence corresponds to stricter almost orthogonality between dictionary atoms. Second, this yields a bound on the number of atoms required for input recovery K to ensure identifiability. ForD ∈R m×d withd > m, the atoms form an overcom- plete spanning set. Welch bounds provide lower bounds on the maximum similarity between any two atoms (Welch, 1974). We restate the bounds in Appendix B. For now, we focus on an extension which relaxes the unit-norm assump- tion (Waldron, 2003). Letf j j∈[d] forf j ∈R m denote thedatoms composing the dictionary. ForN >0 non-zero natural numbers and s∈N >0 , we then have P d i=1 P d j=1 |⟨f i ,f j ⟩| 2s P d i=1 ∥f i ∥ 2s ≥ 1 m+s−1 s .(2) 2.2. Sparse Autoencoders SAEs are trained on reconstructing the activations in the residual stream under a sparsity penalty. The output is then a sparse linear combination of vectors. Letx ∈R m denote the Transformer state prior to the SAE. ATopK SAE (Makhzani & Frey, 2013; Gao et al., 2025a) then out- puts the reconstruction ˆ x∈R m , z = TopK (Ex + b E ) ˆ x = Dz + b D forz ∈R d the coefficients for the features andE ∈ R d×m ,D ∈R m×d encoder and decoder matrices with bi- asesb E ∈R d ,b D ∈R m , respectively. We callDour feature dictionary with d≫ m features. We train our autoencoder on the normalized reconstruction loss with penalty on the dictionary, ℓ(x) = ∥ ˆ x− x∥ 2 2 ∥x∥ 2 2 + λ∥tril(D ⊤ D)∥ 2 F (3) where∥tril(D ⊤ D)∥ 2 F enforces almost orthogonality between features withλ > 0. Bytril(D ⊤ D)we denote the off-diagonal lower-triangular elements ofD ⊤ D. During the forward pass,TopKonly activates the features associated with theKhighest coefficients with all other activations zeroed out. All K coefficients are non-negative. 2.3. Finite Frame Theory ForH m Hilbert space andf j j∈[d] a set of vectors withf j ∈R m ,j ∈ [d], we call the finite-dimensional setf j j∈[d] a frame forH m iffspan(f j j∈[d] ) = H m (Casazza et al., 2013). The analysis operatorT : H m −→ ℓ d 2 is then defined as Tx :=⟨x,f j ⟩ j∈[d] (4) forx ∈ H m . It follows that the adjoint operatorT ∗ : ℓ d 2 −→H m of T is given for z∈R d by T ∗ (z) = d X j=1 z j f j .(5) A frame is overcomplete if d > m. 3. Aligned Features Reduce Intervenability Throughout this section, we ignore the additive term,b D . As the decoder bias is added to the residual stream only after recovering the sparse combination of features, it does not interfere with our below intervenability arguments. To be explicit, we restrict our analysis to ̃ x := ˆ x− b D = Dz. We define an intervention on a feature as changing that feature’s influence on the reconstructed residual stream. We thus alter its associated coefficient in the latent representation of the SAE. To push on the feature, we add anR-valued scalarαto the coefficient. We show that the more geometrically aligned a pair of features is, the greater their interference in the residual stream. For this, it can be instructive to think of the framef j j∈[d] as the feature dictionary defined in 2.2. The reconstructed residual stream is ˆ x. For an intervention on the l th feature, we document the effect on ̃ x = Dz, ̃ x ′ = T ∗ (z + αe l ) = d X j=1 z j f j + αf l = Dz + αf l with the adjoint operator from(5)ande l ∈R d the l th d-dimensional standard basis vector. Interventional 3 Identifying Intervenable and Interpretable Features via Orthogonality Regularization quantities we represent by the prime symbol. Reanalyzing using (4), we obtain the new coefficients z ′ j =⟨ ̃ x ′ ,f j ⟩ + α⟨f j ,f l ⟩ = z j + α⟨f j ,f l ⟩ forj ∈ [d]. The interference between the intervened onf l and any f j grows in α. Theorem 3.1 (Post-Intervention Interference Between Fea- tures). Letα c c∈[C] denote a set of interventions onC latent featuresj(c) c∈[C] ⊆ [d]. ForTthe analysis opera- tor(4), the interference between the features in the dictio- nary amounts to eff interference = C X c=1 α c Tf j(c) 1 . In particular, the interference increases with the inner prod- uct between any pair of features. We defer the proof to Appendix C. 4. Experiments We fine-tune the model in two steps (Wang et al., 2025). 1.We optimize a pre-trainedTopKSAE (Gao et al., 2025a) for low self-reconstruction under an orthog- onality penalty (3) on the decoder matrix. 2.Fixing the SAE decoder, we then insert the SAE into a Gemma 2 2B Transformer and low-rank adapt (Hu et al., 2021) the LM on cross-entropy. In total, we fine-tune four SAEs with orthogonality penalty λ ∈ 0, 10 −6 , 10 −5 , 10 −4 . In particular, we demonstrate all our claims in comparison to the non-penalized case of λ = 0. Throughout, we chooseK = 20. Both steps in our pipeline involve fine-tuning the model for one epoch on MetaMathQA(Yu et al., 2023). In our first step, we fully fine-tune the SAE. Later, we low-rank adapt all weights including the non-decoder SAE weights. Therefore, we restrict the decoder to contain almost orthogonal features, while allowing the rest of the architecture to adapt around this penalty. Adopting the SAEBench (Karvonen et al., 2025) module of size2 16 = 65 ′ 536, we input the SAE into the26-layer residual stream after layer12. Thus, all infor- mation flows through the SAE. To limit the computational overhead induced by the penalty in(3), we evaluate the de- coder regularization at every step on a subset of2 10 = 1 ′ 024 randomly drawn features. This makes fine-tuning feasible while updating decoder weights progressively. All details including hyperparameter choice and training duration are deferred to Appendix D. 0 10 6 10 5 10 4 0 200 400 600 800 1000 Orthogonality Evaluation Loss Figure 2. Orthogonality Evaluation Loss We plot the orthogonal- ity loss∥tril(D ⊤ D)∥ 2 F for all values ofλ. For eachλ, we evaluate on a subset of1 ′ 024active features in the decoder. Error bars rep- resent confidence intervals obtained from running100evaluations. 4.1. Almost Orthogonality While Keeping Performance Figure 2 displays the orthogonality loss∥tril(D ⊤ D)∥ 2 F for increasing orthogonality penalty. Error bars represent the basic bootstrap (Efron, 1979) confidence intervals at level 95%for100sampled datasets. For computational efficiency, we compute the evaluation loss on100draws of2 10 = 1 ′ 024randomly selected active features. That is, we only evaluate on features that activate at least once in the test set. First, we observe that no penalty yields the highest orthogonality loss. This confirms the intuition that features do not empirically tend toward orthogonality if not explicitly regularized to do so. Further, we observe monotonically decreasing loss asλincreases. Inserting the orthogonality penalty, therefore, has the desired effect. Next, we evaluate the fine-tuned LM on theGSM8K(Cobbe et al., 2021) test set. Inter alia,GSM8K’s training set is augmented inMetaMathQA. Yu et al. (2023) observe per- formance in the range of[0.665, 0.777]for7B models fine- tuned onMetaMathQA. In our two-step fine-tuned2B LM, we obtain competitive performance for increasing orthog- onality penaltyλ. Figure 3 represents the accuracy on the GSM8Ktest set. While just below the performance of the7B models, we do not observe any significant differences for increasingly strict penalty. Again, the evaluation accuracy on the model fine-tuned for all non-zero values ofλis com- parable to the no-penalty LM. Together with our findings from Figure 2, we show that stricter orthogonality does not come at the cost of performance. 4.2. The Penalty Does Not Impact Interpretability SAEs are primarily used for mechanistic interpretability. By uncovering human-understandable concepts, we aim to track the Transformer’s internal computations. We there- 4 Identifying Intervenable and Interpretable Features via Orthogonality Regularization 0 10 6 10 5 10 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Figure 3. Evaluation onGSM8KWe evaluate on theGSM8Ktest set. Error bars represent the basic bootstrap confidence inter- vals (Efron, 1979) on 100 randomly drawn datasets. fore check the interpretability of our SAEs under different orthogonality penalties. For this, we create LM-generated explanations for2 ′ 000activated features in each of our fine- tuned SAEs. We do so providing20text snippets which activate the respective feature and asking Llama 3.1 8B- Instruct to describe the common theme across snippets. For details and prompt templates, see Appendix D. 0 10 6 10 5 10 4 0.0 0.1 0.2 0.3 0.4 0.5 Interpretability Score Figure 4. Interpretability Score We plot the interpretability score of correctly identifying one out of five examples, which relates closest to the provided explanation. Error bars are basic bootstrap confidence intervals (Efron, 1979) with 100 resamples. We then ask Llama 3.1 8B-Instruct to match one of five text snippets to the corresponding feature explanation. Figure 4 demonstrates our results. Without allowing the model to reason, we observe interpretability performance on this task of about40%. Given the basic bootstrap error bars (Efron, 1979), we observe non-significant difference between orthogonality parameters. We also notice significantly better interpretability than random guessing at 20%. 4.3.Almost Orthogonality Incentivizes Distinct Features 0 10 6 10 5 10 4 0.580 0.585 0.590 0.595 0.600 Average Cosine Similarity Figure 5. Embedding Explanations We plot the cosine similarity between embedded feature explanations against the orthogonality penaltyλ. The error bars are basic bootstrap intervals (Efron, 1979) on 100 sampled subsets. Previous work shows that alignment in feature space cor- relates with conceptual alignment (Li et al., 2025). Given the feature explanations for each SAE, we test the similarity between the feature explanations themselves. We hypothe- size that if features are more orthogonal, they should encode concepts which are semantically more distinct from each other. In accordance with our argumentation throughout, reducing the geometric alignment between features in the dictionary would similarly reduce the semantic alignment between the human-understandable features. To test this idea, we embed the feature explanations them- selves into the embedding space of the all-MiniLM-L6- v2 model fromsentence-transformers(Reimers & Gurevych, 2019). We then compare the average pairwise cosine similarity between1 ′ 000features for our different setups. Figure 5 depicts that the average pairwise cosine similarity between the feature explanations significantly in- creases with looser orthogonality penalty. Our penalty there- fore pushes features to be geometrically less aligned. As a di- rect consequence, the corresponding human-understandable concepts are semantically less aligned. We thus provide evi- dence that it is possible to limit superposition in the residual stream by forcing an orthogonality penalty onto the features. 4.4. Localized Interventions Work Having addressed identifiability theoretically and inter- pretability empirically, we now present a series of findings on intervenability in an almost orthogonal dictionary. In Section 3, we show that geometrically aligned features in the dictionary inhibit isolated intervenability on individual fea- tures. To confirm this idea, we locally intervene on known features in the SAE and record the downstream effect during 5 Identifying Intervenable and Interpretable Features via Orthogonality Regularization 0 10 6 10 5 10 4 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Accuracy (a) Evaluation on Mathematical Reasoning We plot mathemati- cal reasoning performance on3960hand-designed examples after intervening on the SAE. Error bars are basic bootstrap confidence intervals (Efron, 1979) with 100 samples. 0 10 6 10 5 10 4 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Correctly Included Indices (b) Evaluation on Inclusion of Correct Name We plot the fraction of correctly included first names after intervening on the correspond- ing feature in the SAE. Error bars are basic bootstrap confidence intervals (Efron, 1979) of 100 random draws. Figure 6. Intervenability We intervene on12features in the SAE corresponding to the concept of first names. Then, we evaluate the ability to maintain reasoning performance and the correct generation of the included first name. generation. In particular, every time a certain feature is acti- vated in the residual stream, we swap it for a different one. We then check if the model remains capable of reasoning about the overarching problem while adapting its generation output to the exchanged features. Under the ICM principle, isolated intervention implies that altering one feature does not inform or influence the be- havior of other features in the SAE. Translated to natural language modeling, we ideally expect the model to only swap the corresponding feature. To check this, we design the following experiment. Across the different SAEs we find12features which we interpret as encoding convention- ally male first names. We then choose30examples from GSM8Kinvolving male actors. In each example, we replace the original name with one of the12names above and run the standard evaluation task. Having access to the SAE fea- tures, however, we are able to swap the feature indices for another one of the11remaining first names. We then check if the model (a) keeps performance comparable to Figure 3, (b) correctly drops the name related to the feature we set to zero, and (c) includes the first name we inject into the residual stream through our intervention inside the SAE. Testing all combinations of names on each of the examples, we generate a dataset of 12∗ 11∗ 30 = 3 ′ 960 examples. A well-known issue in mechanistic interpretability, it is non- trivial which value is required to faithfully intervene in the representation space. As the coefficients associated with each feature are continuous, turning the feature on and off is a non-binary problem. In other words, pushing on a feature to be present in the output while keeping generation intact requires testing multiple candidate values (Ameisen et al., 2025). We therefore run a broad hyperparameter sweep on values from 10 to 750. We choose the hyperparameter that results in the best perfor- mance for each of the models. Then, we check whether the respective first names are adequately dropped and included. For all four configurations, we find that highest performance is achieved when setting the coefficient associated with the included feature to200. Further discussion as well as the30 selected examples are deferred to Appendix D. Figure 6 depicts our results. First, we observe in Figure 6a that performance is comparable across different choices forλ. No setup performs significantly differently from the zero-penalty configuration. Second, we note that the number of incorrect name inclusions the model should have dropped is0for three of the four tested models. Only the configuration withλ = 10 −5 incorrectly mentions the true concept in0.7%of the examples. Given that we test3 ′ 960 examples, this number appears negligibly small. Finally, for stricter orthogonality penalty, the model significantly more often includes the correct first name in the generation. Figure 6b demonstrates that the orthogonality penalty for λ≥ 10 −6 monotonically increases. Further, we recover the first name around70.9%of the time forλ = 10 −4 while only recovering it around60.1%of the time without penalty. 1. water of the aquarium every week, 2. calculate how many aquariums she can 3. 2 tablespoons of aquafaba is 4. the newly opened aquarium and 40 5. family pass to the aquarium for $120 Listing 1. The Aqua Feature 6 Identifying Intervenable and Interpretable Features via Orthogonality Regularization To conclude this section, we discuss the insertion of a feature that corresponds to concepts which are not conventionally male first names. In Listing 1, we show5input examples that activate the same feature. The corresponding explana- tion in the model with λ = 10 −4 reads Aquarium capacity or fish population size is the latent concept shared among these spans. Swapping this feature for the feature associated with the first nameJerry, we observe that the model adapts this feature to be concordant with the preceding context. In fact, the LM then talks aboutAquaman. We print the model outputs for this intervention in Figure 1. Moreover, Figure 7 schematically represents an intervention in the SAE. We note that the feature associated with the conceptAquaman has no incoming edge while the downward effect of the feature corresponding to Jerry is turned off. Figure 7. Interventions in the SAE We perform interventions in the SAE by turning off the intervened on feature and adding to the residual stream the feature we replace with. 5. Related Work IdentifiabilityIdentifiability in sparse dictionary learning is centered around the uniqueness bound in Theorem 2.1 and low self-coherence in general. Beyond conventional dictio- nary learning, self-coherence (Nejati et al., 2016) represents a relevant guiding principle in signal processing (Sigg et al., 2012) and computer vision (Hawe et al., 2013). In addition to the aforementioned literature, the unique disentanglement of hidden concepts reappears in causal representation learn- ing (Sch ̈ olkopf et al., 2021). Here, identifiability concerns finding a unique model which can explain the observed data. While non-identifiability in causal representation learning represents a well-studied problem (Hyv ̈ arinen & Pajunen, 1999; Locatello et al., 2019), additional information such as access to interventional data can yield identifiability re- sults (Buchholz et al., 2023). InterpretabilityThe linear superposition hypothesis (El- hage et al., 2022) posits that LMs represent features as directions in their activation space. In this way, the model can represent more features than it has dimensions. To disentangle the polysemantic residual stream into human- interpretable features, SAEs reconstruct the activations as sparse linear combinations of features (Cunningham et al., 2023). Further, the linear representation hypothesis (Park et al., 2024) hinges on the observation that semantically unrelated features are represented as almost orthogonal vec- tors in some representation space (Jiang et al., 2024). Addi- tionally, recent empirical evidence supports the emergence and test-time use of such linear structure in LMs (Chen et al., 2026). In SAEs specifically, however, features are geometrically closely aligned such that finding a canonical dictionary is impossible (Marks et al., 2024; Leask et al., 2025). Beyond language modeling, SAEs are employed in genomics (Pedrocchi et al., 2025; Maiwald et al., 2026) and computer vision (Pach et al., 2025). Finally, weight- sparse models (Gao et al., 2025b) and sparse attention fine- tuning (Draye et al., 2025) represent recent approaches with the similar objective of making architectures inherently in- terpretable. IntervenabilityThe ICM principle considers mechanisms as ”autonomous modules that do not inform or influence each other” (Janzing & Sch ̈ olkopf, 2010). Intervening on one module, therefore, does not interfere with another one. In applied causality, a set of interventions is commonly described as datasets from different environments (Peters et al., 2016). Under this notion, one main objective lies in identifying the causal mechanisms prevalent across environ- ments (Guo et al., 2024). Thus, traditional causality studies interventions as distributional properties of the underlying data-generating process (Spirtes et al., 2000; Pearl, 2009; Peters et al., 2017). In contrast to that, we study interven- ability on the modeling level. This is similar to uncovering the causal mechanisms inside the model’s representation space (Miller et al., 2025). A general causality perspec- tive on mechanistic interpretability is provided by Geiger et al. (2023). Combining those two facets of intervenabil- ity, recent work utilizes the SAE inside the model to infer information about the data (Mencattini et al., 2026). 6. Discussion We fine-tune the LM on mathematical reasoning. For this task, we notice that the orthogonality penalty works as in- tended with no significant impact on performance. A natural question pertains the extension to more general datasets. We thus run preliminary experiments onwikitext(Merity et al., 2016). Similar toMetaMathQA, we observe that cross-entropy on the test set is comparable to that of fine- tuning the LM without any SAE input into the residual stream. This works while enforcing almost orthogonality on the dictionary. However, when generating text from 7 Identifying Intervenable and Interpretable Features via Orthogonality Regularization that SAE, we obtain factually incorrect information. We do recover elements of concepts which are related to the input, but accurate responses are rarely generated. In di- rect comparison, the LM just fine-tuned onwikitext without an SAE does not suffer from this behavior. We therefore hypothesize that this is a fundamental artifact of the fine-tuning setup. Inwikitext, there likely exist more than2 16 human-interpretable concepts. Highly structured datasets such asMetaMathQA, however, might resort to a substantially smaller set of concepts. On that note, we do find it noteworthy that the configuration with orthogonality penaltyλ = 10 −3 uses less than half of the features of the non-regularized configuration. Our hy- pothesis is geometric. Enforcing orthogonality encourages features to form a better-conditioned basis of the representa- tion space. As in PCA, a more orthogonal basis spans the space more efficiently, leading to a larger number of non- zero principal directions and reducing the need for multiple overlapping features to represent the same subspace. Con- sequently, fewer features are required to explain the data, which naturally results in a higher number of unused (dead) features when the feature budget is fixed. We note, however, that the presence of dead features may also be influenced by the sensitivity of SAE training and could partially reflect optimization artifacts. Finally, it has recently been hypothesized that adversarial examples are not bugs but a consequence of feature superpo- sition (Gorton & Lewis, 2025). A natural follow-up question is whether enforcing greater feature orthogonality reduces superposition and, in turn, mitigates adversarial vulnerabili- ties. If so, orthogonality could act as a strong architectural bias toward more robust and safer models. Establishing this connection would motivate the development of more scalable versions of our approach. 0 10 6 10 5 10 4 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Fraction of Dead Features Figure 8. Dead Features We plot the fraction of dead features with increasing orthogonality penalty. Error bars are basic bootstrap confidence intervals (Efron, 1979). Limitations Our study is focused on meaningful inter- ventions in the residual stream. Choosing the off-the-shelf SAEBench (Karvonen et al., 2025) module as a starting point, we insert the SAE only after layer12. In consequence, we are constrained to uncovering features that tend to arise in the middle layers of the Transformer. Empirically, these concepts often represent high-level concepts relative to the low-level concepts encoded in earlier layers (Dorszewski et al., 2025). Additionally, we position this paper as a proof of concept and restrict ourselves to fine-tuningTopKSAEs on one dataset only. Extending our setup toJumpReLU SAEs (Rajamanoharan et al., 2024) and to tasks different from mathematical reasoning represent potential avenues for future research. 7. Conclusion We show that it is possible to fine-tune a large language model around a sparse autoencoder while keeping perfor- mance on the target dataset essentially unchanged. We use this setup to regularize the learned features to be almost orthogonal, thereby reducing superposition and interference between features. This connects our approach to the exten- sive literature on the identifiability of dictionary learning, where lack of orthogonality is known to cause significant ambiguities and failures. Beyond identifiability, we show that enforcing orthogonality leads to more diverse expla- nations, which is a desirable property for interpretability. More importantly, it enables precise local interventions: we can intervene on a single concept without affecting others. Moreover, we are able to swap two concepts with greatly improved performance. These results shed light on the practical consequences of superposition in large language models and suggest that encouraging orthogonality may be a principled way to mitigate it. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgements M.M. acknowledges financial support from the Konrad- Adenauer-Stiftung. F.D. acknowledges support through a fellowship from the Hector Fellow Academy. References Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., Citro, C., Abrahams, D., Carter, S., Hos- mer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., 8 Identifying Intervenable and Interpretable Features via Orthogonality Regularization McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Ben Thompson, T., Zim- merman, S., Rivoire, K., Conerly, T., Olah, C., and Bat- son, J. Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread, 2025. URLhttps://transformer-circuits.pub/ 2025/attribution-graphs/methods.html. Buchholz, S., Rajendran, G., Rosenfeld, E., Aragam, B., Sch ̈ olkopf, B., and Ravikumar, P. Learning linear causal representations from interventions under general nonlin- ear mixing. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc. Casazza, P. G., Kutyniok, G., and Philipp, F. Introduction to Finite Frame Theory, p. 1–53. Birkh ̈ auser Boston, Boston, 2013. ISBN 978-0-8176-8373-3. doi: 10.1007/ 978-0-8176-8373-31. URLhttps://doi.org/10. 1007/978-0-8176-8373-3_1. Chen, M., Engels, J., and Tegmark, M. Low-rank adapt- ing models for sparse autoencoders.In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., and Zhu, J. (eds.), Pro- ceedings of the 42nd International Conference on Ma- chine Learning, volume 267 of Proceedings of Machine Learning Research, p. 8077–8092. PMLR, 13–19 Jul 2025. URLhttps://proceedings.mlr.press/ v267/chen25r.html. Chen, M. C., Miller, M., Sch ̈ olkopf, B., and Guo, S. On the emergence and test-time use of structural information in large language models. arXiv preprint arXiv:2601.17869, 2026. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Cunningham, H., Ewart, A., Smith, L. R., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. ArXiv, abs/2309.08600, 2023.URLhttps://api.semanticscholar. org/CorpusID:261934663. Donoho, D. L., Elad, M., and Temlyakov, V. N. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory, 52(1):6–18, 2005. Dorszewski, T., T ˇ etkov ́ a, L., Jenssen, R., Hansen, L. K., and Wickstrøm, K. K. From colors to classes: Emergence of concepts in vision transformers, 2025. URLhttps: //arxiv.org/abs/2503.24071. Draye, F., Lei, A., Posner, I., and Sch ̈ olkopf, B. Sparse attention post-training for mechanistic interpretability. arXiv preprint arXiv:2512.05865, 2025. Dumitrescu, B. and Irofti, P. Dictionary learning algorithms and applications. Springer, 2018. Efron, B. Bootstrap methods: Another look at the jack- knife. The Annals of Statistics, 7(1):1–26, 1979. ISSN 00905364, 21688966.URLhttp://w.jstor. org/stable/2958830. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. Transformer Circuits Thread, 2022. URLhttps://transformer-circuits. pub/2022/toy_model/index.html. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Representations, 2025a. URLhttps://openreview.net/forum? id=tcsZt9ZNKD. Gao, L., Rajaram, A., Coxon, J., Govande, S. V., Baker, B., and Mossing, D. Weight-sparse transformers have interpretable circuits. arXiv preprint arXiv:2511.13653, 2025b. Garfinkle, C. J. and Hillar, C. J.Robust identifia- bility in sparse dictionary learning.arXiv preprint arXiv:1606.06997, 2016. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N. D., Potts, C., and Icard, T. F.Causal abstraction: A theoretical foundation for mechanistic interpretability. 2023.URLhttps://api.semanticscholar. org/CorpusID:255749463. Gershgorin, S. Uber die abgrenzung der eigenwerte einer matrix. Izvestija Akademii Nauk SSSR, Serija Matematika, 7(3):749–754, 1931. Gorton, L. and Lewis, O.Adversarial examples are not bugs, they are superposition.arXiv preprint arXiv:2508.17456, 2025.URLhttps://arxiv. org/abs/2508.17456. Guo, S., Zhang, C., Mohan, K., Husz ́ ar, F., and Sch ̈ olkopf, B. Do finetti: On causal effects for exchangeable data, 2024. URL https://arxiv.org/abs/2405.18836. 9 Identifying Intervenable and Interpretable Features via Orthogonality Regularization Hawe, S., Seibert, M., and Kleinsteuber, M. Separable dictionary learning. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, p. 438–445, 2013. Hu, J. E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W.Lora: Low-rank adapta- tion of large language models. ArXiv, abs/2106.09685, 2021.URLhttps://api.semanticscholar. org/CorpusID:235458009. Hyv ̈ arinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98)00140-3. URLhttps://w.sciencedirect.com/ science/article/pii/S0893608098001403. Janzing, D. and Sch ̈ olkopf, B. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10), 2010. Jiang, Y., Rajendran, G., Ravikumar, P., Aragam, B., and Veitch, V. On the origins of linear representations in large language models. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Karvonen, A., Rager, C., Lin, J., Tigges, C., Bloom, J., Chanin, D., Lau, Y.-T., Farrell, E., McDougall, C., Ayon- rinde, K., Till, D., Wearden, M., Conmy, A., Marks, S., and Nanda, N. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity, 2025. URLhttps://arxiv.org/abs/2503. 09532. Leask, P., Bussmann, B., Pearce, M., Bloom, J., Tigges, C., Moubayed, N. A., Sharkey, L., and Nanda, N. Sparse au- toencoders do not find canonical units of analysis. arXiv preprint arXiv:2502.04878, 2025. Lee, H., Battle, A., Raina, R., and Ng, A.Efficient sparse coding algorithms. In Sch ̈ olkopf, B., Platt, J., and Hoffman, T. (eds.), Advances in Neural Infor- mation Processing Systems, volume 19. MIT Press, 2006.URLhttps://proceedings.neurips. c/paper_files/paper/2006/file/ 2d71b2ae158c7c5912c0bbde2b9d95-Paper. pdf. Li, Y., Michaud, E. J., Baek, D. D., Engels, J., Sun, X., and Tegmark, M. The geometry of concepts: Sparse autoencoder feature structure. Entropy, 27(4):344, 2025. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch ̈ olkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Confer- ence on Machine Learning, volume 97 of Proceedings of Machine Learning Research, p. 4114–4124. PMLR, 09– 15 Jun 2019. URLhttps://proceedings.mlr. press/v97/locatello19a.html. Maiwald, A., Jedryszek, P., Draye, F., Sch ̈ olkopf, B., Morris, G. M., and Crook, O. M. Decode-glm: Tools to interpret, audit, and steer genomic language models. bioRxiv, 2026.doi:10.1101/2025.10.31.685860. URLhttps://w.biorxiv.org/content/ early/2026/01/25/2025.10.31.685860. Makhzani, A. and Frey, B. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. Marks, L., Paren, A., Krueger, D., and Barez, F. Enhanc- ing neural network interpretability with feature-aligned sparse autoencoders. arXiv preprint arXiv:2411.01220, 2024. Mencattini, T., Cadei, R., and Locatello, F. Exploratory causal inference in saence, 2026.URLhttps:// arxiv.org/abs/2510.14073. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016. Miller, M., Sch ̈ olkopf, B., and Guo, S. Counterfactual reasoning: an analysis of in-context emergence, 2025. URL https://arxiv.org/abs/2506.05188. Nejati, M., Samavi, S., Reza Soroushmehr, S. M., and Na- jar ́ ıan, K. Coherence regularized dictionary learning. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p. 4717–4721, 2016. doi: 10.1109/ICASSP.2016.7472572. Olshausen, B. A. and Field, D. J. Sparse coding with an over- complete basis set: A strategy employed by v1? Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7. URLhttps://w.sciencedirect.com/ science/article/pii/S0042698997001697. Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models, 2025. URLhttps://arxiv. org/abs/2504.02821. Park, K., Choe, Y. J., and Veitch, V. The linear representa- tion hypothesis and the geometry of large language mod- els. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Paulo, G. and Belrose, N. Sparse autoencoders trained on the same data learn different features. arXiv preprint arXiv:2501.16615, 2025. 10 Identifying Intervenable and Interpretable Features via Orthogonality Regularization Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automati- cally interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024. Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition, 2009. URL https://bayes.cs.ucla.edu/BOOK-2K/. Pedrocchi, F., Barkmann, F., Joudaki, A., and Boeva, V.Sparse autoencoders reveal interpretable fea- tures in single-cell foundation models.bioRxiv, 2025.doi:10.1101/2025.10.22.681631.URL https://w.biorxiv.org/content/ early/2025/10/23/2025.10.22.681631. Peters, J., B ̈ uhlmann, P., and Meinshausen, N. Causal in- ference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical So- ciety Series B: Statistical Methodology, 78(5):947–1012, 10 2016. ISSN 1369-7412. doi: 10.1111/rssb.12167. URL https://doi.org/10.1111/rssb.12167. Peters, J., Janzing, D., and Sch ̈ olkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017. ISBN 0262037319. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kram ́ ar, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse au- toencoders, 2024. URLhttps://arxiv.org/abs/ 2407.14435. Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL https://arxiv.org/abs/1908.10084. Schnass, K. and Vandergheynst, P. Dictionary precondition- ing for greedy algorithms. IEEE Transactions on Signal Processing, 56(5):1994–2002, 2008. Sch ̈ olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalch- brenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109 (5):612–634, 2021. doi: 10.1109/JPROC.2021.3058954. URLhttps://ieeexplore.ieee.org/stamp/ stamp.jsp?arnumber=9363924 . *equal contribu- tion. Sigg, C. D., Dikk, T., and Buhmann, J. M. Learning dic- tionaries with bounded self-coherence. IEEE Signal Pro- cessing Letters, 19(12):861–864, 2012. Spirtes, P., Glymour, C., and Scheines, R. Causation, Pre- diction, and Search. MIT press, 2nd edition, 2000. Waldron, S. Generalized welch bound equality sequences are tight frames. IEEE Transactions on Information The- ory, 49(9):2307–2309, 2003. doi: 10.1109/TIT.2003. 815788. Wang, S., Asilis, J., ̈ Omer Faruk Akg ̈ ul, Bilgin, E. B., Liu, O., Fu, D., and Neiswanger, W. Resa: Transpar- ent reasoning models via saes, 2025. URLhttps: //arxiv.org/abs/2506.09967. Welch, L. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on Information Theory, 20(3):397–399, 1974. doi: 10.1109/TIT.1974. 1055219. Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. 11 Identifying Intervenable and Interpretable Features via Orthogonality Regularization A. Proof on Self-Coherence Bound Proof. Assume ̃ x = Dz = D ̃ z withz̸= ̃ z, bothK-sparse. Thens = z− ̃ z̸= 0satisfiesDs = 0. Let T =supp(s), then |T| ≤ 2K. LetG = D ⊤ T D T . All diagonal elements ofGare equal to1with off-diagonal elements upper-bounded by μ. Formally,G j,j = 1,∀j ∈ [d]andG i,j ≤ μfori̸= jwithi,j ∈ [d]. By Gershgorin’s circle theorem (Gershgorin, 1931), λ min (G)≥ 1− (2K− 1)μ > 0, by assumption on K. Thus G is positive definite and the atoms in T are independent — contradiction. B. Welch Bounds We symbolize unit-norm vectors as v for arbitrary vector v. Forf j j∈[d] unit-norm vectors withf j ∈R m , define the maximum similarity byμ = max i̸=j |⟨f i ,f j ⟩|. Then, fors∈N >0 , μ 2s ≥ 1 d− 1 " d m+s−1 s − 1 # , withN >0 denoting the non-zero natural numbers. In addition, Welch obtains a bound on the mean similarity, 1 d 2 d X i=1 d X j=1 |⟨ f i ,f j ⟩| 2s ≥ 1 m+s−1 s for which Waldron (2003) relaxes the unit-norm assumption, P d i=1 P d j=1 |⟨f i ,f j ⟩| 2s P d i=1 ∥f i ∥ 2s ≥ 1 m+s−1 s . We provide the final statement in (2). C. Post-Intervention Interference Between Features Definition C.1 (Effect of Interference on Feature). Letz j ,j ∈ [d]denote the coefficient associated to thej th feature in the dictionaryD∈R m×d . We define the effect of interference on featurejas the absolute difference between the reanalyzed coefficient after intervention (4), z ′ j and the pre-intervention coefficient z j . That is, we have eff interference,j =|z j − z ′ j |. The total interference effect we define by eff interference =∥z− z ′ ∥ 1 for z∈R d . Proof. We document the intervention effect on ̃ x = Dz, ̃ x ′ = T ∗ z + C X c=1 α c e j(c) ! = d X j=1 z j f j + C X c=1 α c f j(c) for e j ∈R d the j th d-dimensional standard basis vector. Reanalyzing using (4) yields T ̃ x ′ j = z ′ j =⟨ ̃ x ′ ,f j ⟩ + C X c=1 α c ⟨f j ,f j(c) ⟩ = z j + C X c=1 α c ⟨f j ,f j(c) ⟩ 12 Identifying Intervenable and Interpretable Features via Orthogonality Regularization for all j ∈ [d]. Writing⟨f j ,f j(c) ⟩ =Tf j(c) j for c∈ [C], we obtain for the total interference effect eff interference = C X c=1 α c Tf j(c) 1 . This completes the proof. D. Experimental Details D.1. Setup All experiments were conducted using multi-GPU training. Fine-tuning the SAE on theMetaMathQAdataset required approximately8hours using4GPUs with80GB of memory each. Subsequent fine-tuning of the language model with the SAE integrated into the architecture required an additional 16 hours under the same hardware configuration. We employed parameter-efficient fine-tuning via LoRA with rank128and dropout rate0.05. The underlying model has a hidden dimension of2304. Training was performed with a per-device batch size of4. We used an initial learning rate of 5× 10 −5 , combined with200warm-up steps followed by a cosine decay schedule. All runs were carried out with gradient checkpointing enabled to reduce memory usage, and gradient norms were clipped to a maximum value of1to ensure training stability. During generation, we greedily decode all tokens 1 . D.2. Interpretability For2 ′ 000features eper orthogonality penalty, we generate a natural-language description and evaluate whether it correctly explains the feature’s activation behavior. Descriptions are constrained to a maximum length of20words. For each feature, we provide the evaluator with up to20text spans sampled from the top100most strongly activating snippets in the MetaMathQAtest set. The dataset follows a95-1-4train–validation–test split over395 ′ 000examples. Each span consists of a window of±10 tokens around the activation point. You are evaluating the interpretability of an explanation with respect to a set of text snippets. You will be given: - A list of 8 labeled text snippets (1, 2, 3, 4, 5, 6, 7, 8) - A single explanation describing a specific concept, feature, or pattern Your task is to identify which snippet the explanation applies to MOST STRONGLY. Exactly 1 of the provided snippets is correct. Guidelines: - Select ONLY the 1 snippet for which the explanation is clearly and directly applicable. - Do NOT select snippets where the explanation is only weakly, indirectly, or ambiguously related. - You must output exactly one label. - Do not explain your reasoning. Output format requirements (MANDATORY): - Output must be a Python-style list of snippet labels. - Example valid outputs: [1], [5] - Do NOT include any additional text, punctuation, or explanation. - Do NOT include quotes around labels. - Do NOT include reasoning or commentary. If the explanation best fits snippet 1, your entire output must be: [1] Listing 2. System Prompt for Interpretability Score 1 We plan to release the code publicly in a future version of this work. 13 Identifying Intervenable and Interpretable Features via Orthogonality Regularization We evaluate interpretability over500feature explanations using a multiple-choice setup, where the evaluator selects the span best matching the provided description from five candidates. All evaluations are performed using Llama 3.1 8B-Instruct . We include the full system prompt in Listing 2. A representative example of feature explanation is given in Listing 1. D.3. Feature Embedding Explanations For each SAE, we retrieve natural-language explanations for1 ′ 000features. These are selected from an initial pool of2 ′ 000 candidate features by restricting attention to explanations beginning with one of five predefined linguistic patterns. The five patterns are provided in Listing 3. "The spans describe" "The latent concept shared among these spans is" "The spans share" "The shared semantic role is" "The common latent concept is" Listing 3. Prefixes for Embedding Explanations To quantify redundancy among explanations, we compute cosine similarity between embedding representations of explana- tions corresponding to active features. We perform this analysis on the same set of models used in the orthogonality and interpretability experiments. For consistency, we restrict attention to explanations beginning with one of the five predefined linguistic patterns described above. Listing 4 states the Llama 3.1 8B-Instruct system prompt for the explanation of features. You are a technical analyst identifying latent semantic features in mathematical text. All examples come from mathematics. Do NOT describe features as "mathematical" or "math-related". Your task is to infer the most specific shared semantic role, operation, or function that is common to all examples. Constraints: - Describe the latent concept, not surface tokens or model behavior. - Your description must distinguish these spans from other mathematical text. - Do not mention tokens, models, or activations. - Prefer functional roles within proofs, derivations, or definitions. - Avoid generic abstractions like "reasoning", "formal logic", or "symbolic manipulation" unless unavoidable. - Output exactly one sentence (max 20 words). - If no specific shared role exists, reply exactly: "No coherent concept." - Only reply "No coherent concept." if the spans do not share a specific semantic function beyond being mathematical. Listing 4. System Prompt for Feature Explanations D.4. Intervenability To evaluate the extent to which learned features support localized and controllable interventions, we conduct targeted manipulation experiments. We identify12SAE features corresponding to male first names. We observe that the model assigns the same feature indices to different variations of a name. We thus list in Listing 5 all variations of names which we accept as correct. As our intervenability study hinges on string extraction, the spelling of the concept is relevant. Across all models considered, these concepts are consistently represented by the same feature indices. Before running the experiment, we check that all names are recoverable in at least some contexts. For each name, we intervene by swapping the corresponding feature activation at every token position where the name appears. We sweep over the following values for the associated coefficient z j ,j ∈ [d], 10, 20, 50, 100, 150, 200, 300, 500, 750. By doing so, we find that a value of200yields the best performance across all four models. Using this value, we evaluate both drop and include interventions. 14 Identifying Intervenable and Interpretable Features via Orthogonality Regularization While all models successfully suppress the original concept under drop interventions, including a new concept proves more challenging for models with weaker orthogonality. We hypothesize that this asymmetry arises from differences in the alignment of feature directions within the embedding space. Beyond names, we also observe successful semantic interventions on other concepts; for example, activating anaquafeature causes the model to transform the characterJerry into Aquaman while leaving the surrounding context unchanged. "Jason": ["Jason", "Jase"], "Mike": ["Michael", "Mike", "Mikey"], "Jacob": ["Jacob", "Jake", "Jakob"], "Jerry": ["Jerry", "Jeremy", "Jermey", "Jerome"], "James": ["James", "Jim", "Jimmy", "Jamie"], "Robert": ["Robert", "Rob", "Robbie"], "Jordan": ["Jordan", "Jordy"], "Jackson": ["Jackson", "Jack", "Jax"], "Paul": ["Paul", "Pauly"], "David": ["David", "Dave", "Davey"], "Andrew": ["Andrew", "Andy"], "Gary": ["Gary", "Garre", "Garret", "Garrett", "Garry"] Listing 5. First Names Used for Interventions 15 Identifying Intervenable and Interpretable Features via Orthogonality Regularization "0": "sentence": [f"first_name uses 10 weight plates each weighing 30 pounds on an exercise machine. This exercise machine uses special technology to make the weights 20% heavier on the lowering portion. How heavy did the weights feel when being lowered?"], "ground_truth": 260,, "1": "sentence": [f"first_name sold clips to 48 of his friends in April, and then he sold half as many clips in May. How many clips did first_name sell altogether in April and May?"], "ground_truth": 72,, "2": "sentence": [f"first_name writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year?"], "ground_truth": 624,, "3": "sentence": [f"first_name is wondering how much pizza he can eat in one day. He buys 2 large pizzas and 2 small pizzas. A large pizza has 16 slices and a small pizza has 8 slices. If he eats it all, how many pieces does he eat that day?"], "ground_truth": 48,, "4": "sentence": [f"first_name created a care package to send to his brother, who was away at boarding school. first_name placed a box on a scale, and then he poured into the box enough jelly beans to bring the weight to 2 pounds. Then, he added enough brownies to cause the weight to triple. Next, he added another 2 pounds of jelly beans. And finally, he added enough gummy worms to double the weight once again. What was the final weight of the box of goodies, in pounds?"], "ground_truth": 16,, "5": "sentence": [f"first_name can read 8 pages of a book in 20 minutes. How many hours will it take him to read 120 pages?"], "ground_truth": 5,, "6": "sentence": [f"first_name creates a media empire. He creates a movie for $2000. Each DVD cost $6 to make. He sells it for 2.5 times that much. He sells 500 movies a day for 5 days a week. How much profit does he make in 20 weeks?"], "ground_truth": 448000,, "7": "sentence": [f"first_name is buying a new pair of shoes that costs $95. He has been saving up his money each month for the past three months. He gets a $5 allowance a month. He also mows lawns and shovels driveways. He charges $15 to mow a lawn and $7 to shovel. After buying the shoes, he has $15 in change. If he mows 4 lawns, how many driveways did he shovel?"], "ground_truth": 5,, "8": "sentence": [f"first_name buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?"], "ground_truth": 50,, "9": "sentence": [f"first_name took 9 pills a day for 14 days. Of these 9 pills, 4 pills cost $1.50 each, and the other pills each cost $5.50 more. How much did he spend in total on the pills?"], "ground_truth": 41,, "10": "sentence": [f"first_name went to a shop to buy some groceries. He bought some bread for $2, butter for $3, and juice for two times the price of the bread. He had $15 for his shopping. How much money did first_name have left?"], "ground_truth": 6,, "11": "sentence": [f"first_name has 2 dogs, 3 cats and twice as many fish as cats and dogs combined. How many pets does first_name have in total?"], "ground_truth": 15,, "12": "sentence": [f"first_name has five more roommates than twice as many as Bob. If Bob has 10 roommates, how many roommates does first_name have?"], "ground_truth": 25,, "13": "sentence": [f"first_name owns an ice cream shop and every sixth customer gets a free ice cream cone. Cones cost $2 each. If he sold $100 worth of cones, how many free ones did he give away?"], "ground_truth": 10,, "14": "sentence": [f"first_name eats 1 apple a day for two weeks. Over the next three weeks, he eats the same number of apples as the total of the first two weeks. Over the next two weeks, he eats 3 apples a day. Over these 7 weeks, how many apples does he average a week?"], "ground_truth": 10,, "15": "sentence": [f"first_name bought 2 soft drinks for$ 4 each and 5 candy bars. He spent a total of 28 dollars. How much did each candy bar cost?"], "ground_truth": 4,, "16": "sentence": [f"first_name has a stack of books that is 12 inches thick. He knows from experience that 80 pages is one inch thick. If he has 6 books, how many pages is each one on average?"], "ground_truth": 160,, "17": "sentence": [f"first_name is throwing a huge Christmas party. He invites 30 people. Everyone attends the party, and half of the guests bring a plus one (one other person). He plans to serve a 3-course meal for the guests. If he uses a new plate for every course, how many plates does he need in total for his guests?"], "ground_truth": 135,, 16 Identifying Intervenable and Interpretable Features via Orthogonality Regularization "18": "sentence": [f"first_name volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?"], "ground_truth": 72,, "19": "sentence": [f"first_name puts $25 in his piggy bank every month for 2 years to save up for a vacation. He had to spend $400 from his piggy bank savings last week to repair his car. How many dollars are left in his piggy bank?"], "ground_truth": 200,, "20": "sentence": [f"first_name has 16 toy cars, and the number of cars he has increases by 50% every year. How many toy cars will first_name have in three years?"], "ground_truth": 54,, "21": "sentence": [f"first_name just turned 12 and started playing the piano. His friend Sheila told him about the 10,000-hour rule which says, after 10,000 hours of practice, you become an expert or master in your field. If first_name wants to become a piano expert before he is 20, how many hours a day will he need to practice if he practices every day, Monday - Friday, and takes two weeks off for vacation each year?"], "ground_truth": 5,, "22": "sentence": [f"In 6 months Bella and first_name will be celebrating their 4th anniversary. How many months ago did they celebrate their 2nd anniversary?"], "ground_truth": 18,, "23": "sentence": [f"first_name has some coins. He has 2 more quarters than nickels and 4 more dimes than quarters. If he has 6 nickels, how much money does he have?"], "ground_truth": 350,, "24": "sentence": [f"first_name starts exercising at home during quarantine. To start, he decides to do 3 sets of 15 push-ups each. Near the end of the third set, he gets tired and does 5 fewer push-ups. How many push-ups did he do in total?"], "ground_truth": 40,, "25": "sentence": [f"first_name likes to collect model trains. He asks for one for every birthday of his, and asks for two each Christmas. first_name always gets the gifts he asks for, and asks for these same gifts every year for 5 years. At the end of the 5 years, his parents give him double the number of trains he already has. How many trains does first_name have now?"], "ground_truth": 45,, "26": "sentence": [f"first_name has $5000. He spends $2800 on a new motorcycle, and then spends half of what’s left on a concert ticket. first_name then loses a fourth of what he has left. How much money does he have left?"], "ground_truth": 825,, "27": "sentence": [f"first_name has 7 one-dollar bills, 4 five-dollar bills, 2 ten-dollar bills, and 1 twenty-dollar bill. He goes to buy peanuts, which cost $3 a pound. He buys what he wants and has $4 in change. He plans to eat the peanuts all in one week. How many pounds does he eat on average per day?"], "ground_truth": 3,, "28": "sentence": [f"first_name has a terrible toothache and decides to buy some painkillers from the store. He picks up a bottle of 50 pills and takes them home. He takes 2 pills each day three times a day for the first 2 days, before cutting this amount in half for the next 3 days. On the sixth day, he takes a final 2 pills in the morning and ends up feeling better. How many pills are left in the bottle?"], "ground_truth": 27,, "29": "sentence": [f"first_name picked a handful of dandelion puffs. He gave 3 to his mom, another 3 to his sister, 5 to his grandmother, and 2 to his dog. Then, he divided the remaining dandelion puffs equally among his 3 friends. How many dandelion puffs did each friend receive if he originally picked 40 dandelion puffs?"], "ground_truth": 9,, Listing 6. Intervenability Dataset 17