Paper deep dive

Towards Unifying Interpretability and Control: Evaluation via Intervention

Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, Himabindu Lakkaraju

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 74

Models: GPT-2 Small, Gemma-2-2B, Llama-2-7B, Llama-3.1-8B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 7:45:18 PM

Summary

The paper introduces a unified encoder-decoder framework to evaluate and compare four popular mechanistic interpretability methods (sparse autoencoders, logit lens, tuned lens, and probing) based on their ability to perform causal interventions on large language models. The authors propose two new metrics—intervention success rate and coherence-intervention tradeoff—to measure the accuracy of explanations and the utility of these methods for controlling model behavior. Experimental results across multiple models (GPT-2, Gemma2, Llama2/3) indicate that lens-based methods generally outperform sparse autoencoders and probes in simple interventions, while highlighting that mechanistic interventions often degrade model coherence, suggesting a significant gap between current interpretability research and practical model control.

Entities (6)

Linear Probing · interpretability-method · 100%Logit Lens · interpretability-method · 100%Sparse Autoencoders · interpretability-method · 100%Tuned Lens · interpretability-method · 100%Coherence-Intervention Tradeoff · evaluation-metric · 95%Intervention Success Rate · evaluation-metric · 95%

Relation Signals (3)

Mechanistic Interventions → compromises → Model Coherence

confidence 90% · mechanistic interventions often compromise model coherence

Sparse Autoencoders → evaluatedby → Intervention Success Rate

confidence 90% · We evaluate Logit Lens, Tuned Lens, sparse autoencoders, and linear probes for these metrics

Logit Lens → outperforms → Sparse Autoencoders

confidence 85% · lens-based methods outperform SAEs and probes in achieving simple, concrete interventions

Cypher Suggestions (2)

Find all interpretability methods evaluated in the paper. · confidence 90% · unvalidated

MATCH (m:Method)-[:EVALUATED_IN]->(p:Paper {id: 'b8d51598-845b-4d05-8b4f-48e565fa8a2e'}) RETURN m.name

Identify metrics used to evaluate interpretability methods. · confidence 85% · unvalidated

MATCH (m:Method)-[:MEASURED_BY]->(metric:Metric) RETURN m.name, metric.name

Abstract

Abstract:With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underperforming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requiring control.

PDF

Open source PDF →Open local PDF →

Full Text

73,223 characters extracted from source content.

Expand or collapse full text

Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla 1 2 Suraj Srinivas 3 Asma Ghandeharioun 4 Himabindu Lakkaraju 1 5 Abstract With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and align- ing models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understand- ing or for control, seldom addressing both. Ad- ditionally, the lack of standardized applications, motivations, and evaluation metrics makes it dif- ficult to assess methods’ practical utility and ef- ficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model be- havior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods—sparse autoen- coders, logit lens, tuned lens, and probing—into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence- intervention tradeoff, designed to measure the ac- curacy of explanations and their utility in con- trolling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underper- forming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requir- ing control. 1 Harvard University, SEAS 2 Harvard University, Kemp- nerInstitute 3 RobertBoschLLC 4 GoogleDeepmind 5 Harvard Business School. Correspondence to: Usha Bhalla <ushabhalla@g.harvard.edu>. Preprint. 1. Introduction As large language models (LLMs) have become more capa- ble and complex, there has emerged a need to better under- stand and control these models to ensure their outputs are safe and human-aligned. Many interpretability methods aim to address this problem by analyzing model representations, attempting to understand their underlying computational and reasoning processes in order to ultimately control model be- haviour. While many of these methods, and interpretability as a field more broadly, claim control and intervention as abstract goals and present compelling qualitative results demonstrating that intervention may be possible in certain cases (for example, Anthropic’s Golden Gate Claude (An- thropic; Templeton et al., 2024)), the link between inter- pretation and intervention is tenuous in practice, and many methods are not explicitly tailored for both. Furthermore, even fewer are thoroughly and systematically evaluated for the ability to control model outputs beyond qualitative ex- amples. We believe the reason for this is threefold. First, interpretability methods produce explanations in disparate feature spaces, such as token vocabulary, probe predictions, or learned auto-interpreted features, hindering comparisons across methods. Second, there exists a “predict/control dis- crepancy” (Wattenberg & Vi ́ egas, 2024), where the features identified by interpretability methods forpredictingbehav- ior are not the same as those used forsteeringit. Third, there do not exist standard systematic benchmarks to mea- sure intervention success. In this work, we view intervention as a fundamental goal of interpretability, and propose to measure both the cor- rectness and the utility of interpretability methods by their ability to successfully edit model behaviour. In particular, we focus on sparse autoencoders (Cunningham et al., 2023), Logit Lens (nostalgebraist; Dar et al., 2023), Tuned Lens (Cunningham et al., 2023; Rajamanoharan et al., 2024; Tem- pleton et al., 2024; Bricken et al., 2023; Gao et al., 2024), and linear probing (Alain, 2016; Belinkov & Glass, 2019; Belinkov, 2022), and benchmark them with steering vectors and prompting as baselines for intervention. In order to enable comparison across these various methods, we first unify and extend the methods as instances of an abstract encoder-decoder framework, where each method encodes uninterpretable latent representations of language models into human-interpretable features and the decoder of the 1 arXiv:2411.04430v2 [cs.LG] 10 Feb 2025 Towards Unifying Interpretability and Control: Evaluation via Intervention Input: “My favorite color is” Explanation encoder: Intervention: Explanation decoder: Normal output: “blue” Intervened output: “red” D x z x x’ z z’ z’ x’ D -1 = = ℓ ℓ+1 MethodD D -1 Dictionary Size m Interpretable Features z i Data free? Automatically interpretable features Can reconstruct x? Logit Lensθ unembed θ unembed + 32k-256kToken logit ✔ Tuned LensA*θ unembed A*θ unembed + 32k-256kToken logit ❌ ✔ SAEsθ encoder θ decoder 16k-24k SAE feature activation ❌ ✔ Probing[θ probe I] θ I-θ T 1Probe prediction ❌ ✔ – Steering vectors – ❌ – – [] Figure 1.Our proposed intervention framework, which encodes model latent representations,x, into human-interpretable features, z=xD, that can then be perturbed toz ′ and decoded back into counterfactual latent representations,ˆx ′ . framework inverts this mapping, allowing us to reconstruct a latent representation from the features. Under this abstract framework, we can intervene on the interpretable feature activations generated by each method and decode them into latent counterfactuals, which produce counterfactual outputs corresponding to the desired intervention. The unifying feature interpretation and intervention frame- work allows us to propose two standard metrics for evaluat- ing mechanistic interpretability methods: (1) intervention success rate, which measures how well intervening on an in- terpretable feature causally results in the desired behavior in the model outputs, and (2) coherence-intervention tradeoff, which measures how well the causal interventions succeed without damaging the coherence of the model’s outputs. We evaluate Logit Lens, Tuned Lens, sparse autoencoders, and linear probes for these metrics on GPT2-small, Gemma2-2b, Llama2-7b, and Llama3-8b, comparing them to simpler but uninterpretable baselines of steering vectors and prompting. Our results show that while existing methods allow for in- tervention, their effectiveness is inconsistent across features and models. Furthermore, lens-based methods outperform all other methods, including sparse autoencoders, for sim- ple, concrete features, likely due to the spurious correlation learned by probes and steering vectors and the high error rate in SAE feature labeling pipelines. We further show that intervention often comes at the cost of model output coherence, underperforming simple prompting baselines, presenting a critical shortcoming of existing methods in real-world applications that require control and intervention. We conclude this work with some case studies of interven- tion on complex and safety-relevant features, along with detailed takeaways about the strengths and weaknesses of each method, including discussion of which methods are optimal for specific intervention topics, which are best to use out of the box, and which hold the most promise for future development. Our main contributions include: •In Section 3.1, we present a unifying framework for four popular interpretability methods: sparse autoen- coders, logit lens, tuned lens, and probing. To faciliate this, we extend logit lens and tuned lens methods with decoders to allow for intervention. •In Section 3.2, we propose two evaluation metrics for encoder-decoder interpretability methods, namely (1) intervention success rate and (2) the coherence- intervention tradeoff to evaluate the ability of inter- pretability methods to control model behavior, and de- sign an open-ended prompt dataset for benchmarking interpretability methods. •In Section 4, we perform experimental analysis on GPT-2, Gemma2-2b, Llama2-7b, and Llama3-8b, and present detailed takeaways comparing interpretability- and control-based methods. Overall, this paper takes a key step in establishing system- atic benchmarks for mechanistic interpretability methods, making progress towards a previously stated open problem for the field (Mueller et al., 2024). 1 2. Related Work Mechanistic Interpretability.Existing work in mecha- nistic interpretability broadly falls into two categories: ac- tivation patching and interpreting hidden representations. Activation patching utilizes carefully constructed counter- factual representations to study which neurons or activations play key roles in model computation, ideally localizing spe- cific information to individual layers, token positions, and paths in the model (Geiger et al., 2021; Vig et al., 2020). However, recent work points to key limitations of patching, particularly with respect to real-world utility in downstream applications such as model editing (Hase et al., 2024; Zhang & Nanda, 2023). As such, we focus primarily on methods for inspecting hidden representations, of which probes are the most commonly used (Alain, 2016; Belinkov & Glass, 2019; Belinkov, 2022). Other methods such as Logit lens (nostalgebraist; Dar et al., 2023) project intermediate rep- resentation into the token vocabulary space, with Belrose et al. (2023); Din et al. (2023); Geva et al. (2022) building and improving upon these early-decoding strategies. Ghan- deharioun et al. (2024a) unifies most of these methods into 1 Code and data are hosted athttps://github.com/ AI4LIFE-GROUP/interp_interv.git. 2 Towards Unifying Interpretability and Control: Evaluation via Intervention an abstracted framework for inspecting model computation. More recently, sparse autoencoders and dictionary learning have been explored as a solution to the uninterpretability of model neurons, particularly due to issues with polyseman- ticity and superposition (Elhage et al.; Bricken et al., 2023; Cunningham et al., 2023; Bhalla et al., 2024; Gao et al., 2024; Rajamanoharan et al., 2024; Templeton et al., 2024; Karvonen et al., 2024; Dunefsky et al., 2024). Evaluation.Due to the recency of the field, standard eval- uation metrics across interpretability methods have not yet been established, and similar to the broader interpretability field, evaluation is frequently ad-hoc and primarily qual- itative in nature, with recent works pointing to the need for more causal evaluation (Mueller et al., 2024; Saphra & Wiegreffe, 2024). With regards to quantitative metrics, in (Gao et al., 2024; Templeton et al., 2024; Makelov et al., 2024), sparse autoencoders are evaluated for reconstruction error, recovery of supervised or known features, activation precision, and the effects of ablation; however, none of these metrics measure the correctness of explanations or useful- ness for control. Independent of our work, Wu et al. (2025) also propose a benchmark for steering methods, AxBench, to assess whether steering is a viable alternative to existing model control techniques, finding similar results to ours. Different from them, we consider additional lens-based in- terpretability methods and explore the extent to which in- tervention is possible without output degradation, for both simple and safety-relevant interventions. Causal Intervention.Previous literature on probing fre- quently evaluates learned probes and features through inter- vention to ensure causality and correctness, as done by Li et al. (2022); Chen et al. (2024); Hernandez et al. (2023b;a); Marks & Tegmark (2023). The interventions performed for measuring causality are similar to those used to per- form model “steering” (Rimsky et al., 2023; Panickssery et al., 2024; Ghandeharioun et al., 2024b) and should ideally produce the same effect but with the added claim of inter- pretability. Geiger et al. (2024) unify many interpretability methods and steering through causal abstraction but do not extend or evaluate these methods for control. Mueller et al. (2024); Belrose et al. (2023); Chan et al. (2022); Olah et al. (2020) consider causal intervention as a tool for assessing explanation faithfulness; however, these works often do not compare between methods and do not consider interven- tion as a means for control, providing no exploration of the quality of the intervened outputs or their utility in applica- tion. Templeton et al. (2024) on the other hand provides a qualitative demonstration of intervention via their ‘Golden Gate Claude’ but do not systematically measure or com- pare against other interpretability methods. Different from these works, our work aims to adapt and evaluate existing methods (notably, lens-based methods and SAEs) originally proposed as model inspection tools, for intervention. 3. Method In this section, we first introduce a unifying framework for four common mechanistic interpretability methods: sparse autoencoders, Logit Lens, Tuned Lens, and probing, along with modifications to these methods that permit principled intervention on representations. We then propose evaluation metrics for (1) testing the correctness of explanations via intervention and (2) the usefulness of these methods for steering and editing representations and model outputs. 3.1. Unifying Intervention Framework Latent vectors to interpretable features.The central aspect of most interpretability work is the ability to trans- late model computation into human-interpretable features, whether the computation be latent directions, neurons, com- ponents, reasoning processes, etc. Many works aiming to explain LLMs focus particularly on hidden representations, where the mapping between high-dimensional dense embed- dings and human-interpretable features is modeled through a (mostly) linear dictionary projection or affine function: z=f(x) =σ(x·D) ˆx=g(z)≈f −1 (z) =z·D −1 z ′ =Edit(z), ˆ x ′ =g(z ′ ) where eachz i is a feature activation, eachiinDcorre- sponds to a human-interpretable feature, andσis an activa- tion function that is frequently the identity. In the case of sparse autoencoders,Dis a learned, overcomplete dictio- nary, with 16k - 65k features for small models (up to 16M for large models), andσis a ReLU, JumpReLU, or ReLU + top-k activation function. Given that SAE features are learned, they are not immediately interpretable and must be labelled by humans or strong LLMs after training. For Logit Lens,Dis simply the language model’s unembedding matrix, meaning each feature corresponds to a single token in the vocabulary. ForTuned Lens,Dis the exact same as Logit Lens but with a learned linear transformation applied. Linear probescan be thought of as a learned dictionary withN= 1whereσis a sigmoid or softmax activation and the data is labelled.Supervised dictionary learning essentially learnsN >>1probes with a loss that either enables latent reconstruction or language modeling, as in ReFT-r1 (Wu et al., 2025). Of all these methods, Logit Lens is the only method that does not require any training data, and sparse autoencoders are the only method that do not produce immediately interpretable features. For a visual summary of this framework, see Figure 1. Interpretable features to counterfactual latent vectors. While producing explanations is straightforward for each method, intervening on model representations using the in- formation provided by explanations is not as simple. Doing 3 Towards Unifying Interpretability and Control: Evaluation via Intervention so requires defining a reverse mapping from the explana- tions to the latent representations of the model, which is only explicitly done by sparse autoencoders. We extend lens-based methods and probing by defining inverse mappings for them as follows. To mapLogit Lens’s explanations back into the model’s latent space, we would ideally apply the inverse of the unembedding matrix to z; however, in practice this is often ill-conditioned due to the dimensionality ofD. As such, we instead use the low-rank pseudoinverse of the unembedding matrix and right-multiply it to the explanation logits. Similarly, for Tuned Lens, we model the decoding process through the pseudo-inverse of the Tuned Lens projection applied to the unembedding matrix. Notably, both of these methods only require a simple linear transformation to go back-and- forth between latent vectors and explanations. Forprobing, an inverse mappingD −1 is not strictly necessary, as all interventions can be performed directly onxinstead ofz, as done by Chen et al. (2024); however, an inverse mapping can be designed to maximally recoverxfromz, as shown in Figure 1.Sparse autoencodershave a well-defined backwards mapping through the SAE decoder, which is frequently linear in practice and often the transpose of the encoder weights. Intervening on interpretable features.Given the above framework, intervention is performed by directly altering the feature activationz i corresponding to the desired feature ito be edited. While the edited activationz ′ i can naively be set to some constant valueα, the same constant may have drastically varying effects for different tokens and different prompts. As such, to take into account the context ofz, for Logit Lens, Tuned Lens, and SAEs we setz ′ i = α∗max(z). This ensures that the featureiis the most dominant feature in the latent vector forα >1. Decoding z ′ yields the altered latent representationˆx ′ =g(z ′ ), which accounts for both the error of the explanation method as well as the intervention performed. For probing and steering vectors,ˆx ′ =x+α∗v, wherevis the steering vector or the weights of the linear probe. Note thatαis a hyperparameter that must be tuned for each method and model, and thus cannot be used to compare the effects of interventions across methods. In order to do so, we can instead measure the normalized difference between the latent vectorsxandˆx ′ , to characterize the strength of the intervention. We also note thatˆxandˆx ′ are not necessarily in-distribution for the language model, but due to the additive nature of the residual stream and the linear representation hypothesis, we believe that such interventions may still be principled in practice (see Park et al. (2023) for more on the linear representation hypothesis and intervention). 3.2. Evaluation Across Methods and Models Given the overall lack of standardized evaluation of mech- anistic interpretability methods, we intend for this work to serve as a starting point for systematic evaluation by testing methods in simple, easy-to-measure contexts. In particular, we think of our evaluations as measuring a kind of upper bound for these methods: in the easiest of settings, how well do existing methods work? Explanation Correctness.We first propose metrics to eval- uate thecorrectnessof explanations and interventions. More specifically, to test whether a single feature of an explana- tionz i is correct, we intervene on that feature to producez ′ i and decodez ′ toˆx ′ , which should generate text that matches the intervention made to producez ′ . For example, if feature iencodes the concept “references to Paris,” increasing the value ofz i should result in increases to references of Paris in the model’s output. From this, we propose a metric of Intervention Success Rate, which measures if increasing activationz i results in the appropriate increase of the feature iin the model’s output. To evaluate a continuous relaxation of this, we can also similarly measure the probability as- signed to tokens relating to featurei. As such, even if the model’s output does not directly reflect interventions made toz ′ i due to sampling, we can measure if increasing the activation ofiresults in any change to the model’s output at all. We refer to this metric asIntervened Token Probabil- ity. Importantly, both of these metrics can be thought of as measuring the causal fidelity of the features highlighted by explanations. Usefulness of Intervention Methods.While intervention is a useful method for evaluating the correctness of expla- nations, it is also a desideratum of its own and a frequent motivation for many explanation methods. For example, methods are often developed for the purpose of de-biasing model outputs or increasing model safety, either by local- izing bad behavior or identifying it at inference time, thus allowing for targeted edits to be made. However, a lack of this direct connection between interpretation and model intervention has led to illusory results in prior literature (Hase et al., 2024; Wattenberg & Vi ́ egas, 2024). By directly and explicitly measuring how effective interpretability meth- ods are at allowing for targeted intervention or steering, we can avoid such failure cases. Importantly, intervention is only useful if the language model retains its overall perfor- mance and still satisfies the purpose of the query as well as the intervention. Thus, we want to evaluate whether inter- pretability methods can steer model outputs towards feature iwithout damaging the model. We defineCoherenceas the grammatical correctness, consistency, and relevance to the prompt of the generated text, which can be measured by querying an appropriate oracle, such as a human or strong LLM. Similarly, we can also measure thePerplexityof the 4 Towards Unifying Interpretability and Control: Evaluation via Intervention Logit LensTuned Lens SAE Probing Steering Reft-r1 Figure 2.Evaluation of the Intervention Success Rate with respect to edit distance for each method on four models for the simple intervention topics. Note that normalized edit distance is a proxy for intervention strength that is comparable across methods. Logit Lens generally outperforms all other methods. intervened outputs with respect to a strong language model. In practice, we use Llama3.1-8b for both of these metrics, as it is reasonable sized, high-performing, and open source, allowing for the measurement of perplexity. We compare coherence scores given by Llama3.1-8b to those generated by human raters as well as a rules-based grammar checker to ensure efficacy of our LLM-as-a-judge setup in Table 1. An Open-ended Evaluation Dataset.In order to evaluate these methods to the best of their capabilities, we are inter- ested in assessing their ability to intervene when intervention is straightforward and possible given the prompt. Consider the question “What is R sin(3∗x)∗cos(y)dxdy?”. Inter- vening on the model’s output for this prompt with a feature related to unicorns is not necessarily intuitive, as there is a correct answer to the prompt that is entirely unrelated to the intervention topic. As such, we want to evaluate these methods on prompts that allow for steering towards a variety of topics or features. To that end, we construct a dataset of 210 prompts related to poetry, travel, nature, journaling prompts, science, the arts, and miscellaneous questions that could plausibly be answered while satisfying a variety of intervention topics. All prompts are open-ended to allow for many potential answers. Some example prompts include “In ten years, I hope to have accomplished”, “Check out this haiku I wrote:”, and “What is your favorite dad joke?”. 4. Experiments In this section, we evaluate the four interpretability methods on our metrics from Section 3.2. We also provide case studies of intervention on more complex and safety-relevant concepts in Section 4.4. Finally, we present an analysis of the empirical alignment between methods in Section 4.5 and a comparison of the strengths and weaknesses of each method in Section 4.6. Additional experiments relating to latent reconstruction error and intervention efficacy across model layers are in Appendix A.1 and A.4. 4.1. Implementation Details Intervention Topics.We choose 10 intervention topics that all relate to references to specific words or phrases: ‘beauty,’ ‘chess,’ ‘coffee,’ ‘dogs,’ ‘football,’ ‘New York,’ ‘pink,’ ‘San Francisco,’ ‘snow,’ ‘yoga’, generalizing ‘Golden gate Claude’-style interventions. These simple, low-level features are ideal for evaluation through interven- tion for four key reasons: first, measuring the presence of a word or phrase is much easier than measuring a high- level abstract concept such as sycophancy, second, these features were present in the pretrained and labelled sparse autoencoders we studied, third, the features necessarily ex- ist in the Logit Lens unembedding dictionary, and finally, datasets that are labelled for the presence of these features are very straightforward to collect for generating steering vectors and probes. As such, we can easily compare inter- ventions on these features across all interpretability methods and measure intervention success by checking if the given word/phrase exists in the model’s output. Steering vectors and probing.We implement steering vec- tors with Contrastive Activation Addition (CAA) (Rimsky et al., 2023) with a few simple modifications. Where in CAA, the difference between contrastive pairs is taken only at the last token, we find that averaging across the token dimension and taking the difference between those averages yields much better results. This is due to the fact that in CAA, the only difference between representations occurred in the token position of the answer letter, or the last token; however, in our case the information related to the inter- vention feature could be present at any token. Example contrastive data pairs were hand-generated by the authors and then used to prompt ChatGPT to create a total of 200 pairs of sentences. All data was verified by the authors and is made available in the accompanying codebase. These con- trastive pairs were also used to train the linear probes, using the implementation from Chen et al. (2024). All probes reached train and test accuracies of 100% across all models 5 Towards Unifying Interpretability and Control: Evaluation via Intervention Intervention Success Rate Reft-r1Tuned Lens SAE Probing Steering Clean Baseline Prompting Logit Lens Figure 3.Intervened output coherence measured with respect to intervention success rate. The solid horizontal line shows the mean of coherence scores for the clean model outputs, and the dashed lines show±1 standard deviation around the mean. and intervention topics. Sparse autoencoders and supervised dictionaries.We focus specifically on sparse autoencoders trained to inter- pret the residual stream of transformer models. We use the SAELens library from (Bloom, 2024) for GPT2-small and Llama3-8b and the Gemma Scope SAEs (Lieberum et al., 2024) for Gemma2-2b. SAE feature labels were found via Neuronpedia (Lin & Bloom, 2023), which allows users to search through fully trained SAEs and their auto- interpretation labelled features. We also evaluate the Rank-1 Representation Finetuning (ReFT-r1) supervised dictionar- ies released by Wu et al. (2025), which have features that directly correspond to the SAE features for Gemma2-2b. Note that dictionaries were only released for layer 20 of Gemma2-2b, so we cannot present evaluation for other lay- ers or models. 4.2. Intervention Success Across Models As described in Section 3.2, in order to evaluate the cor- rectness of explanations, we measure the causal effects of intervening on specific features of each explanation. For a given feature or intervention topici, we see if increasing the activation of that feature results in an increase of the feature in the model’s output for the ten simple interven- tion topics. In order to compare across methods, which all have different explanation feature spaces and scales, we measure the success of interventions as a function of the norm of the distance between the edited latent representation ˆx ′ =g(Edit(f(x)))and the original latent representation x:||ˆx ′ −x||/||x||. Results for intervention success rate are shown in Figure 2 and results for intervened token probabil- ity can be found in Appendix A.5. Across methods and models, we find that by increasing in- tervention strength, or the magnitude of the edit to the latent representation, intervention success rate first improves and then levels out, as expected. However, we unexpectedly find that Logit lens and Tuned lens generally have the highest intervention success rate, regardless of the normalized edit distance, except when compared to ReFT-r1 on Gemma2- 2b. Furthermore, we find that SAEs, probes, and steering vectors require significantly larger edits in order to achieve reasonable intervention success. Note that the minimal edit distance for SAEs is nonzero, as SAE reconstruction incurs a significant error, as explored in Appendix A.1. In gen- eral, we believe the lower performance of SAEs is due to heavy noise in the labels of features. For example, a fea- ture labelled ‘references to coffee’, is sometimes actually a feature that encodes for references to ‘beans’ and ‘cof- fee beans’, and thus only sometimes increases mentions of ‘coffee’. Probes and steering vectors also have suboptimal performance, often due to learning of spurious correlations in the training data rather than the true intervention feature. 4.3. Effects of Intervention on Output Quality We next measure the coherence of the intervened output text produced by each method to ensure that intervention through interpretability methods is possible without dam- aging the utility of the model. We measure coherence as described in Section 3 as a function of the intervention suc- cess rate in Figure 3 to characterize the tradeoff between intervention success and output coherence. Results for co- herence as a function of normalized latent edit distance, ||ˆx ′ −x||/||x||, are in Appendix A.3. We visualize the mean of coherence scores for the clean model outputs with solid black horizontal lines, the same as those shown in Figure 7, with a buffer of±1around the mean in dashed lines. We also consider a prompting baseline, where we simply prompt the language model to talk about the intervention topic, to better understand the optimal coherence possible while satisfying the intervention. This is shown by the teal stars in Figure3. Prompting was infeasible for GPT2-small as it was not instruction tuned. Also, note that the interven- tion success rate approaches 100% with prompting as the number of generated tokens increases; however, seeing as we only generate 30 tokens, the success rate may be lower than expected. Our experiments reveal that while interpretability methods 6 Towards Unifying Interpretability and Control: Evaluation via Intervention Table 1.Correlation between human raters (left) and an LLM rater (Llama3-8b) for coherence or a rules-based grammar checker (right). All three raters are highly correlated with one another. LLM RATER VS HUMANRATER LLM RATER VS ERRORCHECKER PEARSON Rr 2 PEARSON Rr 2 LLAMA3-8B0.940.75-0.960.92 LLAMA2-7B0.800.68-0.850.73 GEMMA2-2B0.800.67-0.780.75 GPT2-SM0.710.67-0.860.74 may seem to provide reasonable tradeoffs between interven- tion success and coherence at first glance, they all under- perform the simplest baseline of just prompting the model. Furthermore, Logit lens and Tuned lens significantly outper- form all other methods when intervening on these simple topics, with intervention success rates of around 0.5 and 0.6 respectively for outputs within one point of deviation from the mean coherence score of the clean model. All other methods exhibit much less desirable Pareto curves, regardless of model size or intervention feature. Verifying Coherence.In order to validate the coherence scores generated through our LLM-as-a-judge setup with Llama3-8b, we verify the coherence scores with human raters. Participants blindly rated 100 outputs for each model, and we measured the correlation between these human rat- ings and LLM ratings, as shown in Table 1. We find high consistency between both, with particularly high correlation coefficients andr 2 values for the larger models. We also check the validity of the coherence ratings by com- paring with an alternative metric that measures the number of grammatical errors in the intervened output via a rule- based grammar checker. In particular, we use LanguageTool to determine the number of errors in each output, which has thousands of rules relating to grammar, typos, capitalization errors, and more. As expected, there is a high negative cor- relation between the two, indicating that outputs with more errors are less coherent. However, we note that the number of grammatical errors is not an ideal metric, as it does not assess whether the text generation pertains to the prompt, which an LLM rater can do. Qualitative Examples.We present examples of interven- tion outputs in Figure 5 for the feature ‘yoga,’ with more examples in Appendix A.7. We highlight outputs where intervention succeeded with minimal degradation in coher- ence in “Optimal Intervention Strength” (left column) as well as generations from the highest intervention strength tested (right column) “Excessive Intervention.” Note that intervention results in repetition at very high intervention strengths for all methods; however, only Logit Lens and Tuned Lens result in repetition of tokens related to ‘yoga.’ Logit Lens Tuned Lens SAE Probing Steering Clean Baseline Prompting Reft-r1 Figure 4.Relationship between intervention success rate and co- herence for three complex features: religious references (top), gendered language (middle), and French language (bottom) for Gemma2-2b (left) and Llama3-8b (right). 4.4. Intervention on Complex Features While the aforementioned simple features allow for rigorous evaluation across methods, in practice, users often want to control or steer much more complex concepts. To investi- gate the feasibility of interventions in more interesting and realistic settings, we present results for three more complex concepts: (1)religious speech, with direct or implicit ref- erences to a given religion, (2)gendered-language, or the ability to preferentially generate text related to a specific gen- der, and (3) theFrench language, where the model should generate text in French even when given an English prompt. These concepts were chosen from features known to exist in the pretrained and labeled SAEs we evaluated, which had a female gendered-language concept and a Christianity- and Islam-related religion concept. 2 Intervention with Logit lens was performed by selecting ten relevant tokens as the features for the gender and religion concepts, such as‘she’, ‘her’, ‘herself’and‘Christ’, ‘Allah’, ‘holy’, but inter- vention could not be performed for the French concept. Intervention success is measured with keyword detectors for the first two concepts and the NLP-based language detector 2 ReFT-r1 did not have a feature that directly corresponded to French language, so we consider the closest successful feature available: “French connective and referential pronouns.” 7 Towards Unifying Interpretability and Control: Evaluation via Intervention MethodOptimal Intervention StrengthExcessive Intervention LogitOn my next vacation, I hope to visit the beautiful country of Costa Rica. I’ve been reading a lot about the country and I’m really excited about the yoga retreats I’ (Edit Distance: 0.6, Coherence: 7, Model: Gemma2) My favorite song from the 21st century is “Yoga, Yoga, Yoga” by Yoga Yoga Yoga Yoga Yoga Yoga Yoga Yoga Yoga Yoga (Edit Distance: 0.7, Coherence: 3, Model: Gemma2) TunedMy favorite thing to do outside is to go to the gym. I'm not a gym rat, but I'm a yoga instructor. I'm a yoga teacher. I'm a yoga (Edit Distance: 0.4, Coherence: 4, Model: Llama2) An artist whose work I admire is Yogaogaogaogaogaogaogaogaogaogaogaogaogaogaogaogaogaog aogaogaogaogaogaoga (Edit Distance: 0.5, Coherence: 2, Model: Llama2) ReFT-r1An interesting start-up idea that I heard about recently was the idea of a “virtual yoga studio”. The idea is that you can practice yoga in the comfort of your own home, with the help of a (Edit Distance: 0.3, Coherence: 8, Model: Gemma2) The most incredible experience I've had in nature was in the midst of the forest, in the midst of the trees, in the midst of the forest, in the midst of the trees, in the midst of the forest, (Edit Distance: 0.5, Coherence: 5, Model: Gemma2) SAEThe coolest wildlife I've ever spotted was a yoga class in a tree. I was in the jungle of Costa Rica, and I was doing a yoga class in a tree. I was in the jungle of Costa Rica, and I was doing a yoga (Edit Distance: 0.9, Coherence: 6, Model: Gemma2) My favorite museum to visit is the Hat yoga practice sequence. It is a sequence that is a sequence that is a sequence that is a sequence that is a sequence that is a sequence that (Edit Distance: 1.2, Coherence: 5, Model: Gemma2) SteeringMy favorite scientist is Dr. Deepak Chopra. I have been practicing yoga for 7 years and I have been practicing meditation for 4 years. I have been practicing yoga (Edit Distance: 0.6, Coherence: 5, Model: Gemma2) Last year, I went to a practice session for a beginner class I was initially initially initially initially initially initially initially initially initially initially (Edit Distance: 1.3, Coherence: 5, Model: Gemma2) ProbingThink about a time you felt truly at peace. Maybe you were sitting on a beach, or maybe you were sitting in a yoga studio. Maybe you were sitting in your own home, or maybe you (Edit Distance: 0.7, Coherence: 5, Model: Gemma2) My favorite fiction author is the one and only, the one and only, the one and only, the one and only, the one and only, the one and only, the one and only, the one and only, (Edit Distance: 1.1, Coherence: 4, Model: Gemma2) Figure 5.Examples of intervened model outputs for intervention feature ‘yoga’ at both the optimal intervention strength (left) and the maximum intervention strength tested (left). Outputs degrade into incoherent repetition at high intervention strength for all methods. package polyglot (Al-Rfou, 2022) for the French concept. As shown in Figure 4, we find that even for these more complex concepts, prompting generally outperforms interpretability-based interventions, particularly for the French language concept. Furthermore, no other method performs consistently across models or features, highlight- ing the unreliability of existing interpretability methods for controlling complex and safety-relevant concepts. However, we generally found that interventions made by Logit lens were most successful in incorporating the intervention into a plausible completion of the prompt. For example, one inter- vened output was “Whenever I’m outdoors and in nature, Ialways have a camera with me. I love taking pictures of God’s beautiful creation. I’m a big fan of Jesus and I love spending...”, where we can see that the religion concept is integrated well with the response to the prompt. For most other methods, such as SAEs, interventions either imme- diately degraded model performance by inducing severe repetition (e.g. “I had a conversation recently aboutthe last and final prophet of the last and final of the last and the last and the last and”) or they were either unrelated to the prompt or simply incorrect (e.g. “Tell me an interesting fact about a musical instrument.For example, did you know that the piano is actually a Christian Muslim?”). 4.5. Intervention Similarity Between Methods Logit Tuned SAE Steering Probing Logit Tuned SAE Steering Probing Logit Reft-r1 SAE Steering Probing Logit Reft-r1 SAE Steering Probing GPT2-sm Gemma2-2b Figure 6.Cosine similarity between methods’ intervention direc- tions in model latent space across methods. Given that these methods all result in linear edits that should correspond to the same feature, ideally their interventions should all point in the same direction in the model’s la- tent space. We evaluate the empirical similarity between 8 Towards Unifying Interpretability and Control: Evaluation via Intervention methods by measuring the cosine similarity betweenedit directions,ˆx ′ −x, for each intervention topic. The average cosine similarity between these vectors for each intervention topic is shown in Figure 6. We find that Logit Lens and Tuned Lens are highly similar, as expected. Similarly, steering vectors and probe weights tend to lie in similar directions, likely due to the same un- derlying data used to train both. Most interestingly, we find that sparse autoencoders tend to intervene in some- what similar directions to steering vectors and probes and have near orthogonal directions to Logit Lens and Tuned Lens, even when interventions succeed for all methods. We speculate that sparse autoencoders may be more similar to probes and steering vectors because the three methods may have a bias toward representing past information and tokens, due to their training and labelling algorithms, also noted by Gur-Arieh et al. (2025). Logit lens and Tuned lens, on the other hand, are designed to reveal information about the next tokenspecifically, given that they are early-decoding strategies and thus may contain more information about model outputs rather than inputs. 4.6. High-Level Comparison of Methods We present the following characterization of the strengths and weaknesses of the interpretability and steering methods. Logit Lensis easy to use, requires no training, and maps features directly to vocabulary tokens, making it highly in- terpretable, and we find that it generally has high causal fidelity as well. However, its predefined, static features are limited and rudimentary, preventing its utility in many real- world or safety-critical applications.Tuned Lensshares these traits but requires an additional learned affine transfor- mation, although many are already open-sourced.Sparse autoencoderslearn a wide variety of both low- and high- level features; however, the post-hoc labeling of features is often highly erroneous and leads to lower causal fidelity in practice. They are also extremely data- and compute- intensive, and there is no guarantee that any given desired feature will exist in the SAE’s dictionary.Supervised dic- tionariescan contain any arbitrary feature but also require significant data and compute to learn, along with an exhaus- tive list of relevant features. Importantly, they can be trained with intervention in mind, increasing their utility for steering while maintaining interpretability.Probing and steering vectorsalso allow for feature specification but are prone to learning spurious correlations, leading to low causal fidelity, and both require labelled data. As such, we believe that lens-based methods are most useful for providing high-fidelity explanations but are likely not ef- fective solutions for steering in real applications, for which steering vectors and prompting are more promising but re- quire careful oversight and refinement to ensure efficacy due to their uninterpretability. Finally, supervised dictio- naries seem to hold a nice balance between interpretation- and control-based methods if users are not compute- and data-bound, proving to be more promising that unsupervised methods such as SAEs. 5. Conclusion While interpretability methods show great promise in un- derstanding large language models, the correctness of their explanations is less clear. Do these explanations reveal truth about model computation or simply fool human researchers? We believe that systematic benchmarking of explanations is critical to answer this question. Our work makes progress towards this goal, and answers this question somewhat neg- atively, showing that current explanations are less accurate than expected. Our work also raises questions regarding the utility of such methods, as we find that prompting out- performs current interpretability methods in its ability to steer models, without requiring any training, data, or access to model weights. We hope future work can address these shortcomings of current methods, paving way toward inter- pretability methods that are faithful and provide actionable insights for improving and controlling models. Acknowledgements The authors would like to thank Fred Zhang and Lucas Dixon for their helpful discussion and comments on the manuscript. This work is supported in part by the NSF awards IIS-2008461, IIS-2040989, IIS-2238714, AI2050 Early Career Fellowship by Schmidt Sciences, and faculty research awards from Google, OpenAI, Adobe, JPMorgan, Harvard Data Science Initiative, and the Digital, Data, and Design (Dˆ3) Institute at Harvard. UB is funded by the Kempner Institute Graduate Research Fellowship. The views expressed here are those of the authors and do not reflect the official policy or position of the funding agencies. Impact Statement This paper aims to improve the evaluation of interpretabil- ity methods, ensuring that they are both faithful and useful in practice. We are particularly interested in facilitating discussions around how to make interpretability methods more reliable to improve model fairness and safety, both of which can have significant societal implications. Ultimately, we hope that clear evaluation of interpretability methods will ensure that methods produce desired results, decreasing unintended consequences due to spurious correlations or un- faithful explanations. However, we note that interpretability can just as easily be used to make models more harmful than helpful. 9 Towards Unifying Interpretability and Control: Evaluation via Intervention References Al-Rfou, R.Welcome to polyglot’s documentation. URL: https://polyglot. readthedocs. io/en/latest/.(Date accessed: 06.11. 2022), 2022. Alain, G. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. Anthropic.GoldenGateClaude.URL https://w.anthropic.com/news/ golden-gate-claude. Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. Belinkov, Y. and Glass, J. Analysis methods in neural language processing: A survey.Transactions of the Asso- ciation for Computational Linguistics, 7:49–72, 2019. Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023. Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F. P., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice).arXiv preprint arXiv:2402.10376, 2024. Bloom, J. Saelens training.https://github.com/ jbloomAus/SAELens, 2024. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Chan, L., Garriga-Alonso, A., Goldowsky-Dill, N., Green- blatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. Causal scrubbing: A method for rigor- ously testing interpretability hypotheses. InAI Alignment Forum, p. 10, 2022. Chen, Y., Wu, A., DePodesta, T., Yeh, C., Li, K., Marin, N. C., Patel, O., Riecke, J., Raval, S., Seow, O., Wat- tenberg, M., and Vi ́ egas, F. Designing a Dashboard for Transparency and Control of Conversational AI, June 2024. URLhttp://arxiv.org/abs/2406. 07882. arXiv:2406.07882 [cs]. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L.Sparse autoencoders find highly inter- pretable features in language models.arXiv preprint arXiv:2309.08600, 2023. Dar, G., Geva, M., Gupta, A., and Berant, J.An- alyzing Transformers in Embedding Space, Decem- ber 2023. URLhttp://arxiv.org/abs/2209. 02535. arXiv:2209.02535 [cs]. Din, A. Y., Karidi, T., Choshen, L., and Geva, M. Jump to conclusions: Short-cutting transformers with linear transformations.arXiv preprint arXiv:2303.09435, 2023. Dunefsky, J., Chlenski, P., and Nanda, N. Transcoders find interpretable llm feature circuits.arXiv preprint arXiv:2406.11944, 2024. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy Models of Superposition. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scal- ing and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024. Geiger, A., Lu, H., Icard, T., and Potts, C. Causal abstrac- tions of neural networks.Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., et al. Causal abstraction: A theoretical foundation for mechanistic interpretability.Preprint, p. 9, 2024. Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y. Transformer feed-forward layers build predictions by pro- moting concepts in the vocabulary space.arXiv preprint arXiv:2203.14680, 2022. Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M. Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Mod- els, January 2024a. URLhttp://arxiv.org/abs/ 2401.06102. arXiv:2401.06102 [cs]. Ghandeharioun, A., Yuan, A., Guerard, M., Reif, E., Lepori, M. A., and Dixon, L. Who’s asking? user personas and the mechanics of latent misalignment.arXiv preprint arXiv:2406.12094, 2024b. Gur-Arieh, Y., Mayan, R., Agassy, C., Geiger, A., and Geva, M. Enhancing automated interpretability with output-centric feature descriptions.arXiv preprint arXiv:2501.08319, 2025. 10 Towards Unifying Interpretability and Control: Evaluation via Intervention Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in lan- guage models.Advances in Neural Information Process- ing Systems, 36, 2024. Hernandez, E., Li, B. Z., and Andreas, J. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023a. Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Watten- berg, M., Andreas, J., Belinkov, Y., and Bau, D. Linear- ity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124, 2023b. Karvonen, A., Wright, B., Rager, C., Angell, R., Brinkmann, J., Smith, L., Verdun, C. M., Bau, D., and Marks, S. Measuring Progress in Dictionary Learning for Lan- guage Model Interpretability with Board Game Models, July 2024. URLhttp://arxiv.org/abs/2408. 00113. arXiv:2408.00113 [cs]. Li, K., Hopkins, A. K., Bau, D., Vi ́ egas, F., Pfister, H., and Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram ́ ar, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoen- coders everywhere all at once on gemma 2.arXiv preprint arXiv:2408.05147, 2024. Lin, J. and Bloom, J. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023. URL https://w.neuronpedia.org. Software avail- able from neuronpedia.org. Makelov, A., Lange, G., and Nanda, N. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024. Marks, S. and Tegmark, M.The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023. Mueller, A., Brinkmann, J., Li, M., Marks, S., Pal, K., Prakash, N., Rager, C., Sankaranarayanan, A., Sharma, A. S., Sun, J., et al. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability.arXiv preprint arXiv:2408.01416, 2024. nostalgebraist.interpretingGPT:thelogit lens.URLhttps://w.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to cir- cuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering Llama 2 via Contrastive Activation Addition, July 2024. URLhttp://arxiv. org/abs/2312.06681. arXiv:2312.06681 [cs]. Park, K., Choe, Y. J., and Veitch, V. The linear represen- tation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023. Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kram ́ ar, J., Shah, R., and Nanda, N. Improving dictionary learning with gated sparse autoencoders.arXiv preprint arXiv:2404.16014, 2024. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023. Saphra, N. and Wiegreffe, S. Mechanistic?arXiv preprint arXiv:2410.09087, 2024. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024.URLhttps: //transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis.Ad- vances in neural information processing systems, 33: 12388–12401, 2020. Wattenberg, M. and Vi ́ egas, F. B. Relational composition in neural networks: A survey and call to action.arXiv preprint arXiv:2407.14662, 2024. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. Axbench: Steering llms? even simple baselines outperform sparse autoen- coders.arXiv preprint arXiv:2501.17148, 2025. Zhang, F. and Nanda, N. Towards best practices of activation patching in language models: Metrics and methods.arXiv preprint arXiv:2309.16042, 2023. 11 Towards Unifying Interpretability and Control: Evaluation via Intervention 0.10.20.30.40.50.60.70.80.9 Coherence 0 10 20 30 40 50 60 70 Percent GPT2-sm Model Coherence Method Clean Logit Tuned SAE 0.00.20.40.60.8 Coherence 0 10 20 30 40 50 Percent Gemma2-2b Model Coherence Method Clean Logit SAE 0.10.20.30.40.50.60.70.8 Coherence 0 10 20 30 40 50 60 70 Percent Llama2-7b Model Coherence Method Clean Logit Tuned Figure 7.Histogram of coherence scores for clean model outputs (Clean) and for the models wherexis replaced byˆxwithout any intervention for Logit Lens, Tuned Lens, and SAEs. Dashed lines show the mean for each distribution. A. Appendix A.1. Additional Evaluations: Sanity Checking Explanation Reconstructions Before testing these methods for their ability to intervene, we first want to evaluate the completeness of the explanations and the effect of replacingxwithˆxwithoutany intervention or editing. We do so by measuring the normalized latent reconstruction error:Error=||ˆx−x||/||x||whereˆx=g(f(x)) =g(z). This error is a key part of the loss function that sparse autoencoders are trained on and measures the information loss incurred by mapping between the language model’s latent space and the interpretable feature space. Given that steering vectors and linear probes do not output complete explanations, we only measure this error for the other three methods, as shown in Table 2, where we see that errors vary a lot across models but most methods are relatively consistent in their error, with the exception of the GPT2-small sparse autoencoders. Table 2.Normalized latent reconstruction error without intervention. MethodGemma2-2bLlama2-7bGPT2-small LOGITLENS0.525e −5 0.22 TUNEDLENS–5e −3 0.32 SAES0.38–1.64 A.2. Additional Evaluations: Coherence of Method Outputs without Intervention We measure the coherence of the outputs produced by replacingxwithˆx, as shown in Appendix Figure 7, which we can compare to the baseline of the clean model outputs (labelled ‘Clean’ and shown in black). We find that the coherence of the outputs generated by the reconstructed latents generally matches the coherence of the clean model outputs. We use a deviation of±1around the mean of clean output coherence scores as a threshold for future evaluations, shown in the dashed lines. A.3. Additional Evaluation: Coherence of Intervention with respect to Edit Distance We measure the coherence of the intervened output text produced by each method to ensure that intervention through interpretability methods is possible without damaging the utility of the model. We measure coherence as described in Section 3 as a function of normalized latent edit distance,||ˆx ′ −x||/||x||in Figure 8. We find that even the smallest interventions made with Logit lens and Tuned lens result in significant degradation of model outputs, with a less noticeable dropoff for the other methods. We also plot coherence as a function of the intervention success rate in Figure 3 to characterize the tradeoff between intervention success and output coherence. 12 Towards Unifying Interpretability and Control: Evaluation via Intervention 0.51.01.52.0 Normalized Latent Vector Perturbation Distance 0.3 0.4 0.5 0.6 Coherence GPT2-sm (Layer 9) Intervened Output Coherence Method Logit Tuned SAE Steering Probing 0.60.81.01.2 Normalized Latent Vector Perturbation Distance 0.35 0.40 0.45 0.50 0.55 0.60 0.65 Coherence Gemma2-2b (Layer 20) Intervened Output Coherence Method logit sae steering probing 01234 Normalized Latent Vector Perturbation Distance 0.2 0.3 0.4 0.5 0.6 0.7 Coherence Llama2-7b (Layer 18) Intervented Output Coherence Method logit tuned steering probing 0.000.250.500.751.001.251.501.75 Normalized Edit Distance 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Coherence Llama3-8b (Layer 25) Intervened Output Coherence Method logit tuned sae steering probing Figure 8.Analysis of coherence of the intervened outputs, measured with Llama3.1-8b, as a measure of the edit distance or magnitude of intervention made. Lens-based methods suffer drastic drops in coherence with only small edits. A.4. Additional Evaluations: Intervention Efficacy Across Model Depth In order to ensure the generalizability of the above results across layer depths, we repeat all experiments for each layer of GPT2-small, as shown in Figure 9. However, due to some sparse autoencoder features only existing in some layers, we could only consider intervention topics‘beauty’, ‘coffee’, ‘dogs’. We hold the hyperparameterαthat controls for intervention “strength” constant across all layers. Note that this is NOT equivalent to holding the normalized edit distance constant, as shown in the rightmost plot. We find that layer depth seems to have minimal effect for SAEs and probing, with the exception of the first and last layers. For steering vectors, we observe a modest increase in intervention success rate with increased layer depth and a much more drastic increase in the success rate at later layers for Logit Lens and Tuned Lens. However, as we increaseαsignificantly, we find that the curves for all three methods on intervention rate shift left until the pass rate is approximately 1 at all layers. Intuitively, this makes sense, as any edits to the residual stream at layer 0 will affect the residual stream at later layers. We note that these results highlight the need to tune the intervention strength for each method, each model, and each layer - limiting their ease of use. 0246810 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Pass Rate GPT2-sm Layer-wise Intervention Pass Rate Method SAE Logit Tuned Steering Probing 0246810 Layer 0.1 0.2 0.3 0.4 0.5 0.6 Coherence GPT2-sm Layer-wise Intervened Output Coherence Method SAE Logit Tuned Steering Probing 0246810 Layer 10 1 10 0 10 1 Normalized Distance GPT2-sm Layer-wise Perturbation Distance Method SAE Logit Tuned Steering Probing Figure 9.Analysis of intervention pass rate (left), coherence (middle) and edit distance (right) across all layers of GPT2-sm. We find that intervening at later layers is significantly more effective for Logit and Tuned Lens than earlier interventions, but probes, steering vectors, and SAEs are relatively invariant to the choice of layer. A.5. Additional Metrics: Intervened Token Probability Please see Section 3.2 for more details. We measure the probability assigned to tokens relating to featureiwhen intervening on featurei. As such, even if a model’s output does not directly reflect interventions made toz ′ i due to sampling, we can measure if increasing the activation of featureiresults in any change to the model’s output at all. We refer to this metric as Intervened Token Probability. Results for Intervened Token Probability are shown in Figure 10, where we see that intervention with all methods across all models increases the probability of intervention-related tokens, even if the intervention does not succeed. We also note that there is a significant difference between the order of magnitude of the intervened token probability for sparse autoencoders, around10e −5 and the rest of the methods, which range from10e −4 to 0.5. 13 Towards Unifying Interpretability and Control: Evaluation via Intervention 0.51.01.52.0 Normalized Edit Distance 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 Probability GPT2-sm (Layer 9) Intervened Token Probability 0.60.81.01.2 Normalized Edit Distance 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 Probability Gemma2-2b (Layer 20) Intervened Token Probability 01234 Normalized Edit Distance 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 Probability Llama2-7b (Layer 18) Intervened Token Probability Logit LensTuned Lens SAE Probing Steering Figure 10.Evaluation of intervention success with respect to the probabilities of the tokens corresponding to the features intervened on for each method. Note that normalized edit distance is a proxy for intervention intensity that is comparable across methods. A.6. Additional Metrics: Perplexity As described in Section 3.2, we evaluate the perplexity of the intervened generated text to measure the utility of interpretability methods for targeted intervention in 11. We measure this perplexity with respect to a stronger language model than the one studied, in this case with Llama3.1-8b. We find that the results for perplexity are generally unintuitive and do not align with the results for coherence. We hypothesize that perplexity is not a useful measure when text is extremely out-of-distribution with respect to normal text, and in particular when the text is highly repetitive. For example, if the same token is repeated 20 times, we (and other language models) might assume that the next 20 tokens would also be the same, resulting in a low perplexity even if the quality of the text is poor. As such, we do not consider these results to be particularly meaningful or significant. 0.51.01.52.0 Normalized Latent Vector Perturbation Distance 1.8 2.0 2.2 2.4 2.6 Perplexity GPT2-sm (Layer 9) Intervened Output Perplexity Method Logit Tuned SAE Steering Probing 0.60.81.01.2 Normalized Latent Vector Perturbation Distance 1.8 2.0 2.2 2.4 2.6 2.8 Perplexity Gemma2 (Layer 20) Intervened Output Perplexity Method logit sae steering probing 01234 Normalized Latent Vector Perturbation Distance 2.2 2.4 2.6 2.8 3.0 3.2 Perplexity Llama2-7b (Layer 18) Intervened Output Perplexity Method logit tuned steering probing Figure 11.Analysis of perplexity of the intervened outputs, measured with Llama3.1-8b, as an alternative metric to Coherence. We find that perplexity does not align with Coherence, as highly repetitive sequences may have low perplexity despite being incoherent answers to prompts. A.7. Additional Example Outputs We present additional examples of the output text for all intervention methods in Figures 12 and 13 for qualitative evaluation of intervention on the feature ‘coffee’ and ‘San Francisco’. Examples for the “Optimal intervention strength” (left column) were randomly chosen from the outputs where intervention succeeded and coherence was still relatively high. Examples for “Excessive Intervention” were randomly chosen from the outputs of the highest intervention strength tested (right column). Please see Section 4.3 for more. A.8. Implementation Details: Open-ended Generation In order to generate open-ended text after intervening on the explanation, we edit the corresponding representationsin place, as is common practice with prior steering methods. Formally, the representationx t at token positiontand layerlis edited to 14 Towards Unifying Interpretability and Control: Evaluation via Intervention MethodOptimal Intervention StrengthExcessive Intervention LogitIn my daydreams, I often find myself in a coffee shop, sipping a cup of coffee and reading a book. I love the smell of coffee and the taste of coffee. (Edit Distance: 0.6, Coherence: 6, Model: Gemma2) The joys and challenges of traveling alone include coffee shops, museums, coffee shops, coffee shops, coffee shops, coffee shops, coffee shops, coffee shops, coffee shops, coffee shops, coffee (Edit Distance: 0.7, Coherence: 2, Model: Gemma2) ProbingI have travelled far and wide, and, I have seen many different types of coffee. I have had the pleasure of trying some of the best coffee in the world. I have also had the (Edit Distance: 0.8, Coherence: 5, Model: Gemma2) Whenever I'm outdoors and in nature, I '41111111111111111111111111111 (Edit Distance: 1.1, Coherence: 1, Model: Gemma2) SAEI plan to write my Ph.D. dissertation on the topic of the relationship between the coffee and the coffee shop. I am interested in the history of the coffee shop, the history of the coffee shop (Edit Distance: 0.9, Coherence: 5, Model: Gemma2) In my previous life, I'm sure that I was a coffee shop barista. I' coffee is always fresh, the coffee is always fresh, and the coffee is always fresh. I's a great (Edit Distance: 1.2, Coherence: 3, Model: Gemma2) SteeringList some fun facts to share with kids. * The first coffee shop in the world was in 1977 in the city of Seattle, in the state of Washington. (Edit Distance: 0.6, Coherence: 7, Model: Gemma2) My favorite national park to visit is: the National Park: The (Edit Distance: 1.3, Coherence: 1, Model: Gemma2) TunedCheck out this haiku I wrote: coffee, my dear friend warming my hands and heart in this cold world (Edit Distance: 0.4, Coherence: 6, Model: Llama2) The coolest wildlife I've ever spotted was a black coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee coffee (Edit Distance: 0.5, Coherence: 2, Model: Llama2) Figure 12.Example outputs with intervention on “coffee” feature. beˆx t ′ , ensuring a causal effect on all ensuing tokensx t+1 ,x t+2 ,...,x T . A.9. Implementation Details: Intervention Hyperparameterα When intervening onzto getz ′ with Logit Lens, Tuned Lens, and SAEs, we setz ′ i =α∗max(z). For probing and steering vectors,ˆx ′ =x+α∗v, wherevis the steering vector or the weights of the linear probe. Note thatαis a hyperparameter that must be tuned for each method and model, and thus cannot be used to compare the effects of interventions across methods. We record the values ofαused in our experiments in Table 3. Table 3.Values for hyperparameterαused to control intervention edit distance for each method and model. Method GPT2-small Layer 9 Gemma2-2b Layer 20 Llama2-7b Layer 18 Logit Lens[50, 70, 90, 110, 130][100, 130, 160, 200, 230][0.5, 3, 7, 11, 15, 19] Tuned Lens[20, 25, 30, 35, 40]–[1, 7, 11, 15, 19, 23] SAEs[3, 4, 5, 6][1, 2, 3, 4, 5]– Probing[150, 200, 250, 300, 350][200, 250, 300, 350][10, 90, 110, 130, 150] Steering Vectors[2, 4, 6, 8, 10][3, 4, 5, 6][0.5, 3, 4, 5, 6] A.10. Implementation Details: SAE Features As described in Section 4.1, we use the sparse autoencoders hosted on SAELens and find the relevant features with Neuronpedia’s exploration and search tools. We document all of the features we consider for each intervention topic in Table 4. Note that for some specified intervention topics, an exact feature match does not exist for the GPT or Gemma SAEs. As such, we either exclude that topic or consider the closest-related topic (such as “instruction related to yoga poses and 15 Towards Unifying Interpretability and Control: Evaluation via Intervention MethodOptimal Intervention StrengthExcessive Intervention LogitOne dramatic impact of climate change on wildlife will be the spread of disease. The San Francisco Zoo is taking steps to prepare for the arrival of the first cases of disease in the San Francisco Bay Area. (Edit Distance: 0.6, Coherence: 8, Model: Gemma2) A book on art history that I found fascinating is Francisco Goya, Francisco Goya Francisco Goya Francisco Goya Francisco Goya Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco (Edit Distance: 0.7, Coherence: 2, Model: Gemma2) Probing Next month, I plan to travel to the United States to attend the 2016 International Conference on the History of the Book. The conference will be held in San Francisco, California (Edit Distance: 0.7, Coherence: 8, Model: Gemma2) A book on art history that I found fascinating is The Modern Art Show: The 1932/33 Art/Alfar/Alfar/Alfar/Alfar/ (Edit Distance: 1.1, Coherence: 1, Model: Gemma2) SAEDuring my last work trip, I was able to visit the San Francisco Museum of Modern, and I was able to see the new exhibit, “The Last Resistance.” The exhibit is a (Edit Distance: 0.9, Coherence: 8, Model: Gemma2) Some examples of eco-friendly destinations and tips for minimizing your travel footprint are San Marin Island, Tib Travel, The Golden Emb Emb Emb (Edit Distance: 1.2, Coherence: 1, Model: Gemma2) SteeringMy favorite childhood memory is of my parents taking me to the San Francisco Exploratorium. I remember being amazed by the Exploratorium's Exploratory Playroom, which was a (Edit Distance: 0.6, Coherence: ,8 Model: Gemma2) In my daydreams, I like to imagine that I'm a 19th-sf Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco (Edit Distance: 1.3, Coherence: 2, Model: Gemma2) TunedCheck out this haiku I wrote: San Francisco's hills Steep and winding, a challenge To walkers, bikers too (Edit Distance: 0.4, Coherence: 7, Model: Gemma2) My favorite song from the 21st century is "Ho Hey" by San Francisco-based indie rock band The San Francisco Francisco Francisco Francisco Francisco Francisco Francisco Francisco (Edit Distance: 0.5, Coherence: 3, Model: Llama2) Figure 13.Example outputs with intervention on “San Francisco” feature. their benefits” when what we would like is “references to yoga”). Many of these imperfect features still yield reasonable intervention success rates. 16 Towards Unifying Interpretability and Control: Evaluation via Intervention Table 4.Specific SAE features used for intervention on GPT2-sm and Gemma2-2b. The feature ids and their according Neuronpedia labels are provided. Intervention Feature GPT2-small Layer 9 Feature GPT2-small SAE Layer 9 Name Gemma2-2b Layer 20 Feature Gemma2-2b SAE Layer 20 Feature Label San Francisco11233 “mentions of the city of San Francisco” 3124 “references to San Francisco and related locations” New York5831 “references to the city of New York” 3761 “specific place names and geographical locations in New York” beauty1805 “words related to beauty or aesthetic appreciation” 485 “instances of the word “beauty” in various contexts” football–11252 “references to football and baseball contexts” pink2415 “mentions of the word “Pink.” 13703 “references to the color pink and its various associations” dogs12435 “mentions of dogs or dog-related terms” 12082 “references to dog behavior and interactions” yoga–6310 “instructions related to yoga poses and their benefits” chess21685 “mentions of the game of chess” 13419 “elements within the context of chess” snow5053 “references to snow-related terms” 13267 “references to snow and related terms” coffee23472 “references to coffee-related words” 15907 “references to coffee and related caf ́ es or establishments” 17