Paper deep dive

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata

Year: 2025Venue: NeurIPS 2025Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 79

Models: CLIP-ViT L/14-336px, LLaVA

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 7:39:01 PM

Summary

The paper introduces a framework for evaluating the monosemanticity of neurons in Vision-Language Models (VLMs) using Sparse Autoencoders (SAEs). It proposes a 'MonoSemanticity score' (MS) based on activation-weighted pairwise image similarity, validated through a large-scale user study. The authors demonstrate that SAEs significantly improve neuron monosemanticity and enable effective steering of multimodal LLMs (e.g., LLaVA) by intervening on specific SAE neurons without modifying the underlying model.

Entities (6)

CLIP · model · 100%LLaVa · model · 100%MonoSemanticity score · metric · 100%Sparse Autoencoders · method · 100%Vision Language Models · model-architecture · 100%Matryoshka SAE · method · 95%

Relation Signals (3)

MonoSemanticity score → evaluates → Sparse Autoencoders

confidence 95% · We propose the MonoSemanticity score (MS) for SAEs in vision tasks

Sparse Autoencoders → improves → Monosemanticity

confidence 95% · experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons

Sparse Autoencoders → steers → LLaVa

confidence 95% · applying SAE interventions on CLIP’s vision encoder directly steers multimodal LLM outputs (e.g., LLaVA)

Cypher Suggestions (2)

Find all models that utilize Sparse Autoencoders for interpretability. · confidence 90% · unvalidated

MATCH (m:Model)-[:UTILIZES]->(s:Method {name: 'Sparse Autoencoders'}) RETURN m.name

Identify the relationship between metrics and methods. · confidence 90% · unvalidated

MATCH (m:Metric)-[:EVALUATES]->(s:Method) RETURN m.name, s.name

Abstract

Abstract:Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in visual representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Further, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying language model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code and benchmark data are available at this https URL.

PDF

Open source PDF →Open local PDF →

Full Text

78,601 characters extracted from source content.

Expand or collapse full text

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models Mateusz Pach 1,2,3,4 Shyamgopal Karthik 1,2,3,4,5 Quentin Bouniot 1,2,3,4 Serge Belongie 6 Zeynep Akata 1,2,3,4 1 Technical University of Munich 2 Helmholtz Munich 3 Munich Center for Machine Learning 4 Munich Data Science Institute 5 University of Tübingen 6 University of Copenhagen mateusz.pach@tum.de Abstract Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in visual represen- tations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Further, we demonstrate that applying SAE interventions on CLIP’s vi- sion encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying language model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both in- terpretability and control of VLMs. Code and benchmark data are available at https://github.com/ExplainableML/sae-for-vlm. 1 Introduction In recent years, Vision-Language Models (VLMs) like CLIP [47] and SigLIP [61] have gained widespread adoption, owing to their capacity for simultaneous reasoning over visual and textual modalities. They have found a surge of applications in various modalities, such as in audio [14,57] and medicine [62], transferring to new tasks with minimal supervision. Yet our current understanding of VLM internals remains limited, necessitating methods that can systematically probe their rep- resentations. Sparse AutoEncoders (SAEs) [37] are an effective approach to probing the internal representations of such models. They efficiently discover concepts (abstract features shared between data points) through their simple architecture learned as a post-hoc reconstruction task. Although analysis with SAEs is popular for Large Language Models (LLMs) [7,23,48], for VLMs it has been limited to interpretable classification [35,49], or the discovery of concepts shared across models [53]. Intuitively, SAEs reconstruct activations via a higher-dimensional space to disentangle distinct concepts from their overlapping representations in neural activations [7]. Neurons at different layers within deep neural networks are known to be naturally polysemantic [41], meaning that they can be strongly activated for multiple unrelated concepts such as cellphones and rulers. One common explanation for this behavior is the superposition hypothesis [3,17], stating that concepts are encoded as linear combination of neurons. SAEs explicitly attempt to solve this issue by separating the entangled concepts into distinct representations. Despite their widespread use in research, the absence of a metric to evaluate SAEs at the neuron-level still hinders their practicality as an interpretation tool. 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2504.02821v3 [cs.CV] 27 Nov 2025 MS = 0.85 MS = 0.01 ... ... More monosemantic: Less monosemantic: #819 M=0.81 #682 M=0.48 Explained Model SAE Add text to indicate that it is a VLM Add how you compute the score SAE ... Activating images: Activating images: 0.48 VLM 0.81 ... SAE VLM ... SAE VLM VLM Monosemanticity: 0.81 Monosemanticity: 0.48 Figure 1: Sparse Autoencoder (SAE) in VLM (e.g. CLIP): Top activating images of a neuron in a pretrained VLM layer are polysemantic (left), and those of a neuron in a sparse latent of SAE trained to reconstruct the same layer are monosemantic (right), according to MonoSemanticity score (MS). Discovering neurons encoding human-interpretable concepts requires analyzing them individually, which is even more tedious as layers of both the model and SAE are wide. In this work, we quantitatively evaluate SAEs for VLMs through monosemanticity, defined as the similarity between inputs that strongly activate a neuron. We propose the MonoSemanticity score (MS) for vision tasks, that measures the pairwise similarity of images weighted by the activations for a given neuron. Unlike natural language where individual words require surrounding context (such as complete sentences) to clearly identify their meanings, individual images can directly activate neurons without additional context. We define highly activating images as images that strongly fire a particular neuron. Greater similarity among these images suggests that the neuron is more narrowly focused on a single concept, reflecting higher neuron monosemanticity. Using our MS score, we observe and validate that neurons in SAE are significantly more monosemantic (see Figure 1, right) than the original neurons (see Figure 1, left). The neuron of the original VLM typically has a low MS score, it fires for a wide range of objects, from cellphones to rulers. On the other hand, neurons within the SAE are more focused on a single concept, e.g. parrots, obtaining a higher score. This holds even for SAE with the same width as the original layer, implying that the sparse reconstruction objective inherently improves the separability of concepts. We further conduct a large-scale study to quantitatively assess alignment of our proposed MS score with human interpretation of monosemanticity. The results confirm that the difference between scores of two neurons strongly correlates with humans perceiving the higher-scoring neuron as more focused on a single concept. Finally, we illustrate applicability of the monosemanticity of vision SAEs by transferring a CLIP- based SAE onto Multimodal LLMs (MLLMs), e.g. LLaVA [36]. Intervening on a single monoseman- tic SAE neuron in the vision encoder while keeping the LLM untouched allows steering the overall MLLM generated output to either insert or suppress the concept encoded in the selected SAE neuron. We summarize our contributions as follows: •We propose the MonoSemanticity score (MS) for SAEs in vision tasks, that computes neuron-wise activation-weighted pairwise similarity of image embeddings. To validate our MS score against human judgment, we conduct a large-scale user study, the results of which can also serve as a benchmark for future research. •We quantitatively compare MS between SAEs, and across their neurons. We find that Matryoshka SAE [9,39] achieves overall superior MS, and that wider and sparser latents lead to better scores. •We leverage the well-separability of concepts in SAE layers to intervene on neuron activa- tions and steer outputs of MLLMs to insert or suppress any discovered concept. 2 Related Work Sparse Autoencoders. Recent studies have repurposed traditional dictionary learning to enhance LLM and VLM interpretability [6,46]. Specifically, there has been success in interpreting and steering LLMs with features learned by SAEs [16,52]. Several enhancements to SAE mechanisms have been introduced, including new activation functions such as Batch TopK [8] or JumpReLU [48], 2 ... ... ... ... ... ... 휙 1 ... ... ... ... X 1 , X 2 , X 3 ,..., X N ... ... ... ... ... SAE ... M 1 ... (a) Extracting embeddings and activations(b) Pairwise similarity matrix (c) Activation-weighted average E Vision Encoder Figure 2: Computation of our MonoSemanticity score (MS). (a) Embeddings and activations are extracted for a set of images (b) to compute the pairwise embedding similarities and pairwise neuron activations. (c) MS is the average of embedding similarities weighted by the neuron activations. and insights from Matryoshka representation learning [34] for SAEs [9,39]. We analyze and evaluate neuron-level monosemanticity of SAEs in VLMs and their downstream uses. Vision-Language Models. Since Contrastive Language–Image Pretraining (CLIP) [38,47,61], many models have emerged that align images and text in a shared embedding space [32,56] or generate text conditioned on image inputs [12,19,36]. They have achieved strong results on benchmarks [59] and found many use-cases [5,24,26,28]. As trust in these models has become a concern [4,27,60], understanding decision-making and ensuring safety, for example through steering, is increasingly important [2,29]. Consequently, prior work has examined their internal representations [21,22,40,45], uncovering interpretable neurons [15]. We demonstrate that SAEs enable more effective interpretation and control than directly operating on raw features. SAEs for VLMs. Building on the success of SAEs in interpreting LLMs, researchers have tried applying them to vision and vision-language models, typically on CLIP [20,35,49,50] and other vision encoders (e.g. DINOv2 [44]). There has also been interest in interpreting the denoising diffusion models [25] using SAEs [11,30,31,51], discovering common concepts across different vision encoders [53], and applying them on multimodal LLMs [63]. Concurrently, monosemanticity of multimodal models has been investigated [58], although focusing on inter-modality differences. Our MS score provides a grounded, quantitative measure of the monosemanticity of individual neurons which is empirically aligned with human perception. Overall, we create a rigorous evaluation framework for SAEs in VLMs, as well as demonstrating their use for steering multimodal LLMs. 3 Sparse Autoencoders for VLMs 3.1 Background and Formulation of SAEs SAEs implement a form of sparse dictionary learning, where the goal is to learn a sparse decomposi- tion of a signal into an overcomplete dictionary of atoms [42]. More specifically, an SAE consists of linear layersW enc ∈R d×ω andW dec ∈R ω×d as encoder and decoder, with a non-linear activation functionσ :R ω →R ω . Both layers share a bias termb∈R d subtracted from the encoder’s input and later added to the decoder’s output. The widthωof the latent SAE layer is chosen as a factor of the original dimension, such that ω := d× ε, where ε is called the expansion factor. In general, SAEs are applied on embeddingsv ∈R d of a given layerlof the model to explain f :X →Y, such thatf l :X →R d represents the composition of the firstllayers andX ⊂R d i is the space of input images. A given input imagex ∈Xis first transformed into the corresponding embedding vectorv := f l (x), before being decomposed by the SAE into a vector of activations φ(v)∈R ω , and its reconstruction vector ˆ v∈R d is obtained by: 3 φ(v) := σ(W ⊤ enc (v− b)), ψ(v) := W ⊤ dec v + b, ˆ v := ψ(φ(v)).(1) The linear layersW enc andW dec composing the SAE are learned through a reconstruction objective R and sparsity regularizationS , to minimize the following loss: L(v) :=R(v) + λS(v),(2) whereλis a hyperparameter governing the overall sparsity of the decomposition. The most simple instantiation [7,49] uses a ReLU activation, anL 2 reconstruction objective and anL 1 sparsity penalty, such that σ(·) := ReLU(·), R(v) :=∥v− ˆ v∥ 2 2 , S(v) :=∥φ(v)∥ 1 .(3) The (Batch) TopK SAEs [8,23,37] use a TopK activation function governing the sparsity directly throughK. Finally, Matryoshka SAEs [9,39] group neuron activationsφ i (v)into different levels of sizesM, to obtain a nested dictionary trained with multiple reconstruction objectives: R(v) := X m∈M ∥v− W ⊤ dec φ 1:m (v)∥ 2 ,(4) whereφ 1:m corresponds to keeping only the firstmneuron activations, and setting the others to zero. It is important to note that Matryoshka SAEs can be combined with any SAE variant, e.g. with BatchTopK [9] or ReLU [39], as only the reconstruction objective is modified. 3.2 Monosemanticity Score A neuron’s interpretability increases as its representation becomes disentangled into a single, clear concept. Therefore, quantifying the monosemanticity of individual neurons helps identify the most interpretable ones, while aggregating these scores across an entire layer allows assessing the overall semantic clarity and quality of the representations learned by the SAE. We propose measuring monosemanticity by computing pairwise similarities between images that strongly activate a given neuron, where high similarity indicates these images likely represent the same concept. These similarities can be efficiently approximated using deep embeddings from a pretrained image encoder E. Since selecting a fixed number of top-activating images is challenging due to varying levels of specialization across neurons, we instead evaluate monosemanticity over a large, diverse set of unseen images, weighting each image by its activation strength for the neuron. We formally describe our proposed MonoSemanticity score (MS) below, with an illustration given in Figure 2. This metric can be computed for each of theωneurons extracted from the SAE. Given a diverse set of imagesI = x n ∈X N n=1 , and a pretrained image encoderE, we first extract embeddings to obtain a pairwise similarity matrixS = [s nm ] n,m ∈ [−1, 1] N×N , which captures semantic similarity between each pair of images. The similaritys nm of the pair(x n , x m )is computed as the cosine similarity between the corresponding pair of embedding vectors: s nm := E(x n )· E(x m ) |E(x n )||E(x m )| .(5) We then collect activation vectorsa k = [a k n ] n ∈R N ω k=1 across allωneurons, for all images in the datasetI. Specifically, for each image x n , the activation of the k-th neuron: v n := f l (x n ), a k n := φ k (v n ),(6) wherelrepresents the layer at which the SAE is applied,f l is the composition of the firstllayers of the explained model, andφ k is thek-th neuron ofφ(v n )(or ofv n when evaluating neurons of the original layerloff). To ensure a consistent activation scale, we apply min-max normalization to each a k , yielding ̃a k := [ ̃a k n ] n ∈ [0, 1] N , where ̃a k n = a k n − min n ′ a k n ′ max n ′ a k n ′ − min n ′ a k n ′ .(7) Using these normalized activations, we compute a relevance matrixR k = [r k nm ] n,m ∈ [0, 1] N×N for each one of the ω neurons, which quantifies the shared neuron activation of each image pair: r k nm := ̃a k n ̃a k m .(8) Finally, our proposed scoreMS k ∈ [−1, 1]for thek-th neuron is computed as the average pairwise similarity weighted by the relevance, without considering same image pairs (x n , x n ): MS k := P 1≤n<m≤N r k nm s nm P 1≤n<m≤N r k nm (9) 4 N#5164 N#8345 MS = 0.9 MS = 0.9 N#1054 N#1918 MS = 0.7MS = 0.9 MS = 0.7 N#1639 N#1527 MS = 0.4MS = 0.7MS = 0.9 MS = 0.4 N#9599 N#6913 MS = 0.3MS = 0.4MS = 0.7 MS = 0.3 N#5718 N#173 MS = 0.2MS = 0.3MS = 0.4MS = 0.7 MS = 0.2 N#5164 N#10024 MS = 0.9 MS = 0.9 N#5156 N#320 MS = 0.7MS = 0.9 MS = 0.7 N#3181 N#289 MS = 0.4MS = 0.7MS = 0.9 MS = 0.4 N#8148 N#6664 MS = 0.3MS = 0.4MS = 0.7 MS = 0.3 N#5718 N#7626 MS = 0.2MS = 0.3MS = 0.4MS = 0.7 MS = 0.2 0.90.0MS Figure 3: Top activating images of neurons with MonoSemanticity (MS) scores ranging from high (left) to low (right). Higher scores correlate with more similar images, reflecting monosemanticity. 3.3 Steering MLLMs with Vision SAEs Finding monosemantic neurons is not only useful for interpretability. The SAEs let us induce controllable semantic biases in the response of MLLMs without modifying the underlying model parameters, and without touching any textual part. In other words, we can steer the model into seeing or not targeted concepts in the image. Our MS score becomes a strong tool to help select the most monosemantic neurons for a precise and efficient steering. We first describe LLaVA [36] as an example MLLM architecture. The LLaVA modelg :X×T→T expects a pair of image and text(x, t)and outputs a text answero, whereT ⊂R d t is the word embedding space. Internally, it converts the imagexintot x ∈Ntoken embeddingsv i t x i=1 obtained from vision encoderf l :X→R d×t X composed of the firstllayers of CLIP [47]. These embeddings are then projected into visual tokensH x ∈R d t ×t x in the word embedding space, and are finally fed along with tokenized textH t ∈R d t ×t t into the pretrained LLM (e.g. LLaMA [54] or Vicuna [10]) to obtain the output text o. We modify this architecture by injecting a pretrained SAE(φ,ψ)of widthωat the token-level after the vision encoderf l . For all token embeddingsv i ∈R d ,i∈1,...,t x , we first extract the SAE decomposition into activationa i := φ(v i ) ∈R ω across all neurons. After identifying the neuron k ∈1,...,ωrepresenting the targeted concept, to steer the overall modelgtowards this concept, we manipulate the SAE activations of all token embeddings for the neuronkto obtainˆa i ∈R ω t X i=1 : ∀j ∈1,...,ω, ˆa j i = α, j = k a j i , j ̸= k (10) whereα ∈Ris the intervention value we want to apply to the activation of neuronk. Finally, we decode the manipulated activation vectors for each tokenˆa i back into a manipulated token embedding ˆv i = ψ(ˆa i )∈R d with the SAE decoder. Token embeddings are then processed as usual to generate the steered LLaVA’s response. We include an illustration of the overall process in the Appendix. 4 Experiments 4.1 Experimental Settings We apply SAEs to explain fixed and pretrained CLIP ViT-L/14-336px [47], SigLIP SoViT-400m/14- 384px [61], AIMv2 L/14-224px [19], and WebSSL MAE-300m/14-224px [18]. The SAEs are trained on activation vectors pre-extracted from the model’s responses to ImageNet [13] images. For CLIP, activation vectors are extracted from the classification (CLS) tokens in the residual stream after layersl∈11, 17, 22, 23, or from the output of the final projection layer. For steering experiments, however, the SAEs are trained on activation vectors corresponding to two random token embeddings 5 per image, taken from layerl = 22. For other encoders, we similarly use theCLStokens from the final layers, or two random token embeddings if a CLS token is not available. In the following sections, we are interested in both BatchTopK [8] and Matryoshka Batch- TopK SAEs [9] variants. If not stated otherwise, we set the groups of Matryoshka SAEs as M = 0.0625ω, 0.1875ω, 0.4375ω,ω, which roughly corresponds to doubling the size of the number of neurons added with each level down. For the BatchTopK activation, we fix the maximum number of non-zero latent neurons toK = 20. Both SAE types are compared across a wide range of expansion factorsε∈1, 2, 4, 8, 16, 64. All SAEs are optimized for10 5 steps with minibatches of size4096using Adam optimizer [33], with the learning rate initialized at 16 125 √ ω following previous work [23]. To measure SAE performance, we useR 2 for reconstruction quality and theL 0 norm for the activation sparsity ofφ(v). Throughout the paper, we quantify MS of neurons using DINOv2 ViT-B [44] as image encoderE, and present more analysis with different encoders in Appendix. Experiments are run on a single NVIDIA A100 GPU. 4.2 Evaluating Interpretability of VLM Neurons 4.2.1 Alignment of MS with human perception 0.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.9 MS Difference 0.5 0.6 0.7 0.8 0.9 1.0 Alignment Rate Figure 4: Alignment Rate (AR, %) of humans with MS score when judging which neuron in a pair is more monosemantic, grouped by MS difference between the neurons. Bars show AR per interval; dots show cumulative AR up to that interval. We first illustrate the correlation be- tween MS score and the underlying monosemanticity of neurons, with ex- amples in Figure 3 of the16highest activating images for neurons with de- creasing MS from left to right. We observe that the highest scoring neu- rons (withMS = 0.9, on the far left) are firing for images representing the same object, i.e. close-up pictures of a koala (on the top) and a hockey (on the bottom). As the score decreases, the corresponding neurons fire for less similar or even completely different objects or scenes. To verify this obser- vation in a more quantitative way, we conducted a large scale user study on the Mechanical Turk platform. This study resulted in a total of1000questions across 71 unique users, with 3 answers per question aggregated through majority voting. Results of this study are presented in Figure 4, and we provide more details on the setup in Appendix. When asked to select the set of more monosemantic images from a pair of sets(a,b), the users answered in accordance to the MS in82.8%of the cases, assuming δ = |MS(a) − MS(b)| ∼ U(0, 0.9). This alignment rate monotonically raises from56.6%for δ ∈ (0.0, 0.1)to100.0%forδ ∈ (0.8, 0.9), highlighting that users especially agree with MS as the difference in monosemanticity between the two sets becomes more pronounced. This demonstrates that MS can be used as a reliable measure aligning well with human perception of similarity. Detailed results of the user study are released for future benchmark of image similarity and monosemanticity. 4.2.2 Monosemanticity of SAEs In Table 1a, we report MS of the highest scoring neurons of two SAE types (BatchTopK [8] and Matryoshka BatchTopK [9]) trained at different layers with various expansion factorsε. We also include results for original neurons of the corresponding layer decomposed by SAEs (“No SAE”). We observe that SAEs’ neurons consistently have significantly higher MS for their best neuron when compared to original ones, implying that SAEs are better separating and disentangling concepts between their neurons. Interestingly, while the highest MS score is increasing with higher expansion factorε, i.e. with increased widthωof the SAE layer, this holds true already for expansion factor ε = 1, meaning that the disentanglement of concepts is also linked to the sparse dictionary learning and not only to the increased dimensionality. Finally, comparing SAE variants, we observe that while the Matryoshka reconstruction objective improves the concept separation at same expansion factor, it also achieves about 2 or 3 points lowerR 2 for the same expansion factors (more details in Appendix). 6 Table 1: Sparse Autoencoders (SAEs) decompose “No SAE” neurons into more monosemantic units as shown by MonoSemanticity (MS) score. Higher SAE expansion factors yield higher MS scores. (a) Highest MS scores of neurons in various CLIP ViT-Large [47] layers. SAE type Layer No SAE Expansion factor ×1×2×4×8×16×64 BatchTopK [8] 110.010.610.730.710.870.901.00 170.010.650.790.860.860.931.00 220.010.660.790.800.880.921.00 230.010.730.720.830.890.931.00 last0.010.570.780.780.810.851.00 Matryoshka [9, 39] 110.010.840.900.951.000.891.00 170.010.860.840.930.940.961.00 220.010.830.830.870.941.001.00 230.010.820.840.890.930.961.00 last0.010.820.910.890.930.911.00 (b) Top MS scores of neurons from last layers of different vision en- coders. Improvements in MS score from applying Matryoshka SAEs are consistent across all the models. Vision Encoder No SAE Exp. factor ×1×4 WebSSL [18] 0.01 0.790.92 CLIP [47]0.01 0.820.89 SigLIP [61]0.01 0.830.88 AIMv2 [19]0.01 0.590.85 0.00.20.40.60.81.0 Normalized Index 0.00 0.25 0.50 0.75 Monosemanticity Score Expansion factor No SAE (a) Impact of expansion factor ε 0.00.20.40.60.81.0 Normalized Index 0.00 0.25 0.50 0.75 Monosemanticity Score K No SAE 1 10 20 50 (b) Impact of sparsity factor K Figure 5: MonoSemanticity scores in decreasing order across neurons, normalized by width. Results are shown for the last layer of the model, without SAE (“No SAE”, in black dashed line), and with SAE in straight lines using either (a) different expansion factors (ε = 1,ε = 2,ε = 4,ε = 16, ε = 64) or (b) different sparsity levels (K = 1, K = 10, K = 20, and K = 50). Table 1b presents the highest observed MS scores among neurons in the last layers of various image encoders, both before (“No SAE”) and after attaching Matryoshka SAEs. We find that the SAE latent neurons outperform the original neurons in every case. As before, increasing the expansion factorε helps discover more monosemantic units. This suggests the universality of the SAE approach across vision representations derived from different training objectives. To analyze monosemanticity across all neurons, we plot in Figure 5a the scores of both the original neurons and the Matryoshka SAE neurons from the last layer of modelf. We ordered neurons by decreasing scores and normalized neuron indices to the[0, 1]interval to better compare SAEs with different widths. These results confirm our analysis above, and demonstrate that the vast majority of neurons within SAEs have improved MS compared to the original neurons. Even when comparing withε = 1, i.e. with same width between the SAE and original layers, we can see that about90%of the neurons within the SAE have better scores than the original neurons, proving once again that the sparse decomposition objective inherently induces a better separation of concepts between neurons. Furthermore, MS scores increase overall with the expansion factor until a certain point (ε = 4), after which they decrease overall, reaching even lower values thanε = 1. Although the relative fraction of neurons at different values of MS is decreasing for very wide latents, the absolute number is still increasing. We refer the reader to Appendix for MS with raw (unnormalized) neuron indices. The relationship between the sparsity levelKused when training Matryoshka SAEs and the scores of the learned neurons is illustrated in Figure 5b. We observe that a stricter sparsity constraint decomposes the representation into more monosemantic features overall. However, this does not 7 Table 2: Percentage of generations meeting evaluation criteria for concept insertion and suppression. SAE-derived steering directions yield higher success rates than Difference-in-Means (DiffMean) [2]. (a) Concept insertion OursDiffMean Desired concept appeared48.753.1 Base prompt followed85.866.2 Both criteria satisfied42.435.8 (b) Concept suppression OursDiffMean Desired concept removed64.464.0 Unrelated concept kept81.438.7 Both criteria satisfied52.533.3 “Write me a short love poem” CLIP encoder E Φ LLM A blank sheet of paper A canvas for your heart's desire, A place to write your love's sweet words, And let them fill the air. A pencil on a piece of paper, A love poem for you and me, With words that flow like a river, And feelings that are pure and free. A pencil, yellow and white, Sits on a piece of wood, A pink eraser, soft and white, A red pencil, sharp and pointy. A pencil, a pencil, A pencil, a pencil, A pencil, a pencil, A is all I need to feel. 0 50 40 30 X b In the vast expanse of white, A love so pure and bright, Shines through the darkness, With a warmth that's so inviting. A pencil, yellow and white, Sits on a piece of wood, A pink eraser, soft and white, A red pencil, sharp and pointy. “Write me a short love poem” + A pencil on a piece of paper, A love poem for you and me, With words that flow like a river, And feelings that are pure and free. A pencil, a pencil, A pencil, a pencil, A pencil, a pencil, A is all I need to feel. Insertion of pencil (Neuron #39): “Write me a short love poem” + “Write me a short love poem” + “Write me a short love poem” + In the vast expanse of white, A love so pure and bright, Shines through the darkness, With a warmth that's so inviting. A pencil, yellow and white, Sits on a piece of wood, A pink eraser, soft and white, A red pencil, sharp and pointy. Write me a short love poem + Neuron #39 Pencil A pencil on a piece of paper, A love poem for you and me, With words that flow like a river, And feelings that are pure and free. A pencil, a pencil, A pencil, a pencil, A pencil, a pencil, A is all I need to feel. Intervention value 0 Write me a short love poem + Neuron #39 Pencil Intervention value 30 Write me a short love poem + Neuron #39 Pencil Intervention value 40 Write me a short love poem + Neuron #39 Pencil Intervention value 50 In the vast expanse of white, A love so pure and bright, Shines through the darkness, With a warmth that's so inviting. A pencil, yellow and white, Sits on a piece of wood, A pink eraser, soft and white, A red pencil, sharp and pointy. Write me a short love poem + A pencil on a piece of paper, A love poem for you and me, With words that flow like a river, And feelings that are pure and free. A pencil, a pencil, A pencil, a pencil, A pencil, a pencil, A is all I need to feel. Intervention on N#39 Pencil Neuron Write me a short love poem + Write me a short love poem + Write me a short love poem + 0 Intervention on N#39 Pencil Neuron 30 Intervention on N#39 Pencil Neuron 40 Intervention on N#39 Pencil Neuron 50 = = = = Text prompt + Input image Generated response Figure 6: Steered LLaVA outputs after clamping the activation values of a chosen neuron, i.e. Neuron #39 of SAE in CLIP corresponding to pencil. Initially, the poem follows the given instruction (the prompt and white image), but as the intervention weightαincreases, it becomes increasingly influenced by the neuron’s concept, first mentioning the pencil’s attributes, then the pencil itself. This shows that our interventions enable new capabilities for the unsupervised steering of these models. imply that the highest sparsity (K = 1) is always the best choice, as improvements in MS come at the cost of reduced reconstruction quality. In the same setup, theR 2 varies from31.3%at the lowest K = 1to74.9%at the highestK = 50. To balance interpretability and reconstruction quality, we set K = 20 for which the R 2 remains at a reasonable 66.8%. Detailed results are in the Appendix. 4.3 Steering Multimodal LLMs We train the Matryoshka BatchTopK SAE [9] with expansion factorε = 64on random token embeddings from layerl = 22of the CLIP vision encoder obtained for the ImageNet training data. The trained SAE is plugged after the vision encoder of LLaVA-1.5-7b [36] (uses Vicuna [10] LLM). Quantitative Results. We first compare the performance of SAE-based steering for VLMs against Difference-in-Means (DiffMean) [2], a popular approach based on activation steering. For each of 100SAE neurons in LLaVA, we identify its top-activating image grid. To perform concept insertion, we boost the neuron’s activation and prompt LLaVA with10diverse text queries, such as “Propose a math word problem,” or “Invent a new holiday,” then evaluate if the output contains the concept and still responds to the prompt. For concept supression, we apply a negative intervention, ask LLaVA to describe the images, and check whether the concept is removed and unrelated images are still described correctly. The evaluation is done with a LLM-as-a-judge setup using GPT-4.1-mini [43]. We provide the full details of the prompts and the evaluation procedure in Appendix B. The results in Table 2 demonstrate that SAE-based directions outperform DiffMean in both concept insertion and suppression. For insertion, SAE effectively introduces the intended concept (48.7%vs.53.1%) while maintaining much stronger adherence to the base prompt (85.5%vs.66.2%). For suppression, it performs on par with DiffMean in removing the target concept (64.4%vs.64.0%), yet far surpasses it in preserving unrelated content (81.4% vs. 38.7%). 8 “What is shown on the image ?” + Suppression “What is shown on the image ?” + “Write me a short love poem” + “Write me a short love poem” + Text prompt + Input image Generated response The image shows a computer monitor with a keyboard and a mouse on a desk. The image shows a laptop computer sitting on a desk, with a pencil holder containing several pencils placed next to it. The image shows a pile of pencils and a basket of pencils, with a sign hanging above them. The pencils are displayed in a creative manner, possibly as a decoration or a unique way to store them. Generated response The image shows a wooden cutting board with a bowl of strawberries and blueberries on it. There are also two wooden utensils, possibly serving as tongs or spoons, placed near the bowl. The image shows a cutting board with a knife and a bowl of strawberries on it. The image shows a wooden cutting board with a bowl of strawberries and blueberries on it. The bowl is filled with strawberries and blueberries, and there are also a couple of strawberries placed on the cutting board. “What is shown on the image ?” + Suppression neuron strength Text prompt + Input image Generated response -20 -5 The image shows a computer monitor with a keyboard and a mouse on a desk. The image shows a laptop computer sitting on a desk, with a pencil holder containing several pencils placed next to it. The image shows a pile of pencils and a basket of pencils, with a sign hanging above them. The pencils are displayed in a creative manner, possibly as a decoration or a unique way to store them. “What is shown on the image ?” + -20 -5 Original Original The image shows a wooden cutting board with a bowl of strawberries and blueberries on it. There are also two wooden utensils, possibly serving as tongs or spoons, placed near the bowl. The image shows a cutting board with a knife and a bowl of strawberries on it. The image shows a wooden cutting board with a bowl of strawberries and blueberries on it. The bowl is filled with strawberries and blueberries, and there are also a couple of strawberries placed on the cutting board. The image shows a laptop computer sitting on a desk, with a pencil holder containing several pencils placed next to it. The image shows a computer monitor with a keyboard and a mouse on a desk. The image shows a pile of pencils and a basket of pencils, with a sign hanging above them. The pencils are displayed in a creative manner, possibly as a decoration or a unique way to store them. -20 -5 Original neuron strength “What is shown on the image ?” + “What is shown on the image ?” + Suppression Text prompt + Input image Generated response The image shows a wooden cutting board with a bowl of strawberries and blueberries on it. There are also two wooden utensils, possibly serving as tongs or spoons, placed near the bowl. The image shows a cutting board with a knife and a bowl of strawberries on it. The image shows a wooden cutting board with a bowl of strawberries and blueberries on it. The bowl is filled with strawberries and blueberries, and there are also a couple of strawberries placed on the cutting board. The image shows a laptop computer sitting on a desk, with a pencil holder containing several pencils placed next to it. The image shows a computer monitor with a keyboard and a mouse on a desk. The image shows a pile of pencils and a basket of pencils, with a sign hanging above them. The pencils are displayed in a creative manner, possibly as a decoration or a unique way to store them. neuron strength “What is shown on the image ?” + Figure 7: Overriding neuron activations with negative values allows concept suppressing. Originally, LLaVA describes the images by mentioning all the objects visible on the input images. However, as we decrease the valueαof neurons associated with knives and laptops, it first confuses them with wooden utensils and computer monitor, then eventually ignores them completely. At the same time, it continues to faithfully describe other objects like wooden board, strawberries, and pencil holder. Table 3: Mean similarity of neurons’ ac- tivating images to output word, with and without steering, on white or random ImageNet images. Upper bound with correct image-classname pairs is0.283, lower bound with random pairs is0.185. SteeringWhite ImageImageNet ✓0.259± 0.0360.263± 0.037 ✗0.212± 0.0210.211± 0.028 To further confirm the steering capabilities of SAE neu- rons, we also perform a sanity check on the steering ca- pabilities of the SAE neurons. We directly compare CLIP similarity scores between the top-16images activating a given neuron and the corresponding text outputs, both be- fore and after steering. We prompt the model with “What is shown on the image? Use exactly one word.” and com- pare its original answer to the one generated after fixing the activation of a specific SAE neuron toα = 100, vary- ing one neuron at a time. In the first setup, a white image is used while intervening on the first 1000 neurons to iso- late the neuron manipulation’s effect. In the second, 1000 random ImageNet images are used while steering only 10 neurons to test effects on natural inputs. The results in Table 3 clearly illustrates that steering increases image-text similarity scores. For context, we compute reference similarities in CLIP space: the average similarity between each ImageNet class image and its class name sets an upper bound (0.283± 0.034), while random image–class pairs set a lower bound (0.185± 0.028). Neuron steering yields a relative gain of 22% within this range, highlighting the significance of the results. Qualitative Examples. Figure 6 illustrates the effectiveness of concept insertion by manipulating a single neuron of the SAE. We prompt the model with the instruction “Write me a short love poem”, along with a white image. By intervening on an SAE neuron associated to the pencil concept and increasing the corresponding activation value, we observe the impact on the generated output text. While the initial output mentions the “white” color and focuses on the textual instruction, i.e. “love” poem, the output becomes more and more focused on pencil attributes as we manually increase the intervention valueα(most highly activating images for the selected neuron is in Appendix) until it only mentions pencils. We provide more examples with a different input prompt (“Generate a scientific article title”) in Appendix, for which steered LLaVA exhibits a similar behavior. In Figure 7, we show that, by clamping a specific neuron to negative value, we can suppress a concept. LLaVA is asked to answer what is shown on the images given a photo of a cutting board with knives and strawberries and a photo of laptop and pencil holder full of pencils. By default the model generates correct descriptions containing all the objects. However, when we intervene by decreasing the activation valueαof the neurons associated with knife and laptop, the resulting descriptions progressively omit these concepts. This provides a promising strategy for filtering out harmful or undesired content at an early stage, before it even reaches the language model. 9 5 Conclusion We introduced the MonoSemanticity score (MS), a quantitative metric for evaluating monosemanticity at the neuron level in SAEs trained on VLMs. Our analysis revealed that SAEs primarily increased monosemanticity through sparsity and wider latents and highlighted the superior performance of Matryoshka SAEs. We further verified the alignment of MS with human perception through a large-scale human study. Leveraging the clear separation of concepts encoded in SAEs, we explored their effectiveness for unsupervised, concept-based steering of multimodal LLMs, highlighting a promising direction for future research. Potential extensions of this work include adapting our metric to text representations and investigating the interplay between specialized (low-level) and broad (high-level) concepts within learned representations. Limitations. We focused our evaluation on various SAE architectures, the most common dictionary learning implementations that scale effectively to large VLMs. However, our MS metric is model- agnostic and could be applied to study other comparable approaches as well. High MS score neurons do not always produce precise effects when used to steer MLLM outputs. For example, a golden retriever neuron from an SAE trained on ImageNet can trigger any dog-related output. This could happen because, while SAEs can disentangle detailed classes in the dataset, MLLMs may have limited fine-grained understanding and may not be perfectly aligned with the vision encoder. Moreover, a fraction of the SAE neurons that act as feature detectors do not exhibit any clear steering effect [1]. Acknowledgments and Disclosure of Funding This work was partially funded by the ERC (853489 - DEXIM) and the Alfried Krupp von Bohlen und Halbach Foundation, which we thank for their generous support. We are also grateful for partial support from the Pioneer Centre for AI, DNRF grant number P1. Shyamgopal Karthik thanks the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for support. Mateusz Pach would like to thank the European Laboratory for Learning and Intelligent Systems (ELLIS) PhD program for support. References [1]Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering–if you select the right features. arXiv preprint arXiv:2505.20063, 2025. [2]Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. NeurIPS, 2024. [3]Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. [4]Jessica Bader, Leander Girrbach, Stephan Alaniz, and Zeynep Akata. Sub: Benchmarking cbm generaliza- tion via synthetic attribute substitutions. ICCV, 2025. [5] Jessica Bader, Mateusz Pach, Maria A Bravo, Serge Belongie, and Zeynep Akata. Stitch: Training-free position control in multimodal diffusion transformers. arXiv preprint arXiv:2509.26644, 2025. [6]Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024. [7]Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2, 2023. [8]Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410, 2024. [9] Bart Bussmann, Patrick Leask, and Neel Nanda. Learning multi-level features with matryoshka saes. AI Alignment Forum, 2024. [10] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. [11]Bartosz Cywi ́ nski and Kamil Deja. Saeuron: Interpretable concept unlearning in diffusion models with sparse autoencoders. arXiv preprint arXiv:2501.18052, 2025. 10 [12]Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [13]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [14]Satvik Dixit, Laurie M. Heller, and Chris Donahue. Vision language models are few-shot audio spectrogram classifiers, 2024. [15]Amil Dravid, Yossi Gandelsman, Alexei A. Efros, and Assaf Shocher. Rosetta neurons: Mining the common units in a model zoo. In ICCV, October 2023. [16]Esin Durmus, Alex Tamkin, Jack Clark, Jerry Wei, Jonathan Marcus, Joshua Batson, Kunal Handa, Liane Lovitt, Meg Tong, Miles McCain, Oliver Rausch, Saffron Huang, Sam Bowman, Stuart Ritchie, Tom Henighan, and Deep Ganguli. Evaluating feature steering: A case study in mitigating social biases, 2024. [17]Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. [18]David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. ICCV, 2025. [19]Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. CVPR, 2025. [20]Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transform- ers, April 2024. [21]Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition. In ICLR, 2024. [22] Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting the second-order effects of neurons in clip. In ICLR, 2025. [23]Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Representations, 2025. [24] Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review. Frontiers in artificial intelligence, 7:1430984, 2024. [25] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020. [26] Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 224:272– 286, 2025. [27] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1– 55, 2025. [28]Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free compositional image retrieval. In ICLR, 2024. [29]Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Arnaud Dapogny, and Matthieu Cord. Analyzing finetuning representation shift for multimodal llms steering. ICCV, 2025. [30]Dahye Kim and Deepti Ghadiyaram. Concept steerers: Leveraging k-sparse autoencoders for controllable generations. arXiv preprint arXiv:2501.19066, 2025. [31]Dahye Kim, Xavier Thomas, and Deepti Ghadiyaram. Revelio: Interpreting and leveraging semantic information in diffusion models. arXiv preprint arXiv:2411.16725, 2024. [32] Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, and Zeynep Akata. Cosmos: Cross-modality self-distillation for vision language pre-training. arXiv preprint arXiv:2412.01814, 2024. [33] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. [34]Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022. 11 [35]Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schneider. Sparse autoencoders reveal selective remapping of visual concepts during adaptation. arXiv preprint arXiv:2412.05276, 2024. [36]Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. [37] Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. [38]Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language- image pre-training. In ECCV, 2022. [39] Noa Nabeshima. Matryoshka sparse autoencoders. AI Alignment Forum, 2024. [40]Tuomas Oikarinen and Tsui-Wei Weng. CLIP-dissect: Automatic description of neuron representations in deep vision networks. In ICLR, 2023. [41] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017. [42]Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. [43] OpenAI. Introducing gpt-4.1 in the api, 2025. [44]Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. [45] Konstantinos Panousis and Sotirios Chatzis. Discover: making vision networks interpretable via competi- tion and dissection. Advances in Neural Information Processing Systems, 36:27063–27078, 2023. [46] Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, and Matthieu Cord. A concept-based explainability framework for large multimodal models. NeurIPS, 2024. [47]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. [48] Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024. [49]Sukrut Rao, Sweta Mahajan, Moritz Böhle, and Bernt Schiele. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In European Conference on Computer Vision, pages 444–461. Springer, 2024. [50]Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, and Yu Su. Sparse autoencoders for scientifically rigorous interpretation of vision models. arXiv preprint arXiv:2502.06755, 2025. [51]Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, and Caglar Gulcehre. Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders. arXiv preprint arXiv:2410.22366, 2024. [52] Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. [53]Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, and Konstantinos Derpanis. Universal sparse autoencoders: Interpretable cross-model concept alignment. arXiv preprint arXiv:2502.03714, 2025. [54]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [55]Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12884–12893, 2021. [56]Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. Flair: Vlm with fine-grained language-informed image representations. arXiv preprint arXiv:2412.03561, 2024. [57]Zhifeng Xie, Shengye Yu, Qile He, and Mengtian Li. Sonicvisionlm: Playing sound with vision language models, 2024. [58]Hanqi Yan, Xiangxiang Cui, Lu Yin, Paul Pu Liang, Yulan He, and Yifei Wang. The multi-faceted monosemanticity in multimodal representations. arXiv preprint arXiv:2502.14888, 2025. 12 [59]Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. [60]Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations. [61]Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. [62]Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D. Davison, Hui Ren, Jing Huang, Chen Chen, Yuyin Zhou, Sunyang Fu, Wei Liu, Tianming Liu, Xiang Li, Yong Chen, Lifang He, James Zou, Quanzheng Li, Hongfang Liu, and Lichao Sun. Biomedgpt: A generalist vision-language foundation model for diverse biomedical tasks. arXiv preprint arXiv:2305.17100, 2023. [63]Kaichen Zhang, Yifei Shen, Bo Li, and Ziwei Liu. Large multi-modal models can interpret features in large multi-modal models. arXiv preprint arXiv:2411.14982, 2024. 13 Contents A Broader Impact15 B More details on steering15 C User study17 D Benchmark22 E Additional results on monosemanticity22 E.1 Unnormalized plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 E.2 Detailed statistics and more models . . . . . . . . . . . . . . . . . . . . . . . . .23 E.3 Matryoshka hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 F Reconstruction of SAEs27 G Uniqueness of concepts28 H Additional qualitative results28 14 A Broader Impact Our work contributes to the field of interpretability and alignment, which are essential components for building safe AI systems. Our MonoSemanticity score provides a new way to evaluate the effectiveness of recently popular dictionary learning methods, such as sparse autoencoders (SAEs), by incorporating human judgment into the evaluation process. This makes it easier to assess and build trust in systems that use SAEs. In addition, we show that SAEs can be highly effective in steering applications. They can be used to encourage or discourage specific behaviors in models, or to help models recognize or ignore certain concepts, including potentially dangerous ones. This is especially useful for ensuring that models produce desired outputs and remain aligned with human values and goals. B More details on steering We illustrate in Figure A1 how we steer LLaVA-like models. We separately train SAEs on top of the pretrained CLIP vision encoder to reconstruct the token embeddingsv i , and then attach it back after the vision encoder during inference. Intervening on a neuron within the SAE layer steers the reconstructed tokensˆv i towards the activated concept, which then steers the LLM’s generated output. We present in Figure A2 additional examples of LLaVA prompted to generate scientific titles, and the outputs before and after intervening on SAE neurons. Increasing the activation of specific neurons will modify the outputs to include elements from images highly activating the corresponding neuron. ‘It is a panda.’ ... Vision Encoder ... ... ... LLM ...... Projection ‘What is it?’ Intervention on Panda Neuron Figure A1: LLaVA-like models can be steered towards seeing a concept (e.g. panda) not present in the input imagex. By attaching SAE after vision encoder and intervening on its neuron representing that concept, we effectively manipulate the LLM’s response. Such flexible and precise steering is possible thanks to the extensive concept dictionary identified through the SAE. 15 medium high The Effects of White Backgrounds on Visual Perception and Attention. “Generate a scientific article title” + Intervention on neuron strength Text prompt + Input image Generated response “Generate a scientific article title” + “Generate a scientific article title” + The Effects of White Polka Dots on the Perception of Black and White Patterns. The Effects of Polka Dots on the Density of Black and White Polka Dots. The Effects of White Backgrounds on Visual Perception and Attention. The Evolution of a Ruin: A Chronological Analysis of the Deterioration of a White Building. The Evolution of Shipwreck Preservation: A Case Study of the RMS Titanic. The Effects of White Backgrounds on Visual Perception and Attention. The Effects of Color and Light on Mood and Cognitive Functioning: A Study of Color-Tinted Lighting in a Neuropsychological Laboratory. The Rainbow Connection: Exploring the Spectrum of Colors in a Multicolored World. medium high medium high Figure A2: Effects of neuron interventions on MLLM-generated scientific article titles. Steering magnitudes are categorized as “0”, “medium”, and “high” based on the intervention strength. The neurons are visualized with the highest activating images from which we deduce their associated concepts: “polka dots”, “shipwreck”, and “rainbow”. The steering capabilities discussed in Section 4.3 are evaluated using an LLM-as-judge setup with the following prompts: • “Write me a short love poem,” • “Generate a scientific article title,” • “Give me a four-item to-do list,” • “Write me a two-verse rap song,” • “Propose a math word problem,” • “Write a paragraph from a Wikipedia page,” • “Invent a new holiday,” • “Write a dialogue,” • “Write a newspaper headline and first paragraph,” • “Give a conversation starter for a party.” 16 C User study To validate the alignment of our MonoSemanticity score (MS) with human judgment, we conducted a user study. Example questions are shown in Figures A3, A4, and A5. Each question shows two grids of images: (x i ) 16 i=1 , (y i ) 16 i=1 , where each grid contains the 16 images with the highest activations for two neuronsk x andk y , respectively. Formally, for each i = 1,..., 16, x i := x n ∈I, rank n (a k x n ) = i, y i := x m ∈I, rank m (a k y m ) = i, wherea k n is the activation of neuronkon imagex n , andrank n (a k n )is the rank of imagex n when sorting all images by their activation values in descending order. ImagesIcome from training set of the ImageNet. For each neuron pair(k x ,k y ), we asked three human annotators the question: “Which set of images looks more similar and focused on the same thing?” Each annotator gave an answerr j ∈k x ,k y for j = 1, 2, 3. The final human choice was decided by majority vote: R user (k x ,k y ) ∈k x ,k y . At the same time, we answered the question using Monosemanticity Score: R MS (k x ,k y ) := k x if MS k x > MS k y k y otherwise We say the MS and users are aligned if their decision is the same: δ (k x ,k y ) := ( 1 if R user (k x ,k y ) = R MS (k x ,k y ) 0 otherwise The overall alignment score is the fraction of all neuron pairs where the MS and humans are aligned: Alignment Score = 1 |Q| X (k x ,k y )∈Q δ (k x ,k y ) , whereQ is the set of all neuron pairs evaluated. In total, we collected 1,000 user pair rankings with the help of 71 annotators on the Mechanical Turk platform. The number of answers per annotator ranged from 1 to 205, with a median of 24. Annotators were compensated at a rate of $0.02 per answer. The neurons used in the study were randomly selected from the last layer of CLIP ViT-L, BatchTopK SAE (ε = 4,K = 20) trained on the last layer of CLIP ViT-L, Matryoshka SAE (ε = 4,K = 20) trained on the last layer of CLIP ViT-L, and BatchTopK SAE (ε = 4,K = 20) trained on the last layer of SigLIP SoViT-400m. In addition to the plot presenting the user study results in the main paper, we also provide Table A1, which reports the exact values obtained, along with the sizes of each group categorized by MS distances between neuron pairs. When designing the questions, we balanced the number of pairs within each distance interval. Our goal is to evaluate MS computed using embeddings from two different image encodersE, namely DINOv2 ViT-B and CLIP ViT-B. As a result, the group sizes are not perfectly equal due to necessary trade-offs. Nevertheless, all groups are sufficiently large and of comparable size. 17 Table A1: Alignment Scores (AS) obtained from user study. To compute the MS, we use embeddings of image encoderE, either DINOv2 ViT-B or CLIP ViT-B. Results are grouped by MS distance between neurons in the question. We made sure that every group is represented by enough pairs. (a) MS distances computed using DinoV2 embeddings. MS Distance (based on DinoV2)0.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.9 Number of pairs177139126138121901056341 AS (E = DINOv2 ViT-B)0.560.660.710.810.850.930.940.961.00 AS (E = CLIP ViT-B)0.600.660.740.820.870.930.940.961.00 (b) MS distances computed using CLIP embeddings. MS Distance (based on CLIP)0.0-0.10.1-0.20.2-0.30.3-0.40.4-0.5 Number of pairs29224918617895 AS (E = CLIP ViT-B)0.550.810.930.960.93 AS (E = DINOv2 ViT-B)0.530.770.920.960.93 18 Figure A3: Example question used in the user study. Best viewed horizontally. 19 Figure A4: Example question used in the user study. Best viewed horizontally. 20 Figure A5: Example question used in the user study. Best viewed horizontally. 21 D Benchmark While MS shows very good results in our user study, we anticipate the development of improved alternatives in the future. To facilitate such advancements, we will release our collected data as a benchmark for evaluating neuron monosemanticity. The benchmark will include the following files: • pairs.csv– Contains 1000 pairs of neurons(r x ,r y ), along with user preferencesR user (k x ,k y ) and MS values computed using two different image encoders: DINOv2 ViT-B and CLIP ViT- B. Each row includes the following columns:k_x,k_y,R_user,MS_x_dino,MS_y_dino, MS_x_clip, MS_y_clip. • top16_images.csv– Lists the 16 most activating images from the ImageNet training set for each neuron used in the study. Columns: k, x_1, . . . , x_16. • activations.csv– Provides activation values of all 50,000 ImageNet validation images for each neuron. Columns: k, a_1, . . . , a_50000. With this data and by following our evaluation procedure, researchers will be able to compare their methods directly to MS under same conditions. They will have access to the same underlying information, specifically the complete set of neuron activations on the ImageNet validation set. E Additional results on monosemanticity E.1 Unnormalized plots Monosemanticity scores across all neurons, without normalized index, are shown in Figure A6.We observe that neurons cover a wider range of scores as we increase the width of the SAE layer. Furthermore, for a given threshold of monosemanticity, the number of neurons having a score higher than this threshold is also increasing with the width. 010000200003000040000 Index 0.00 0.25 0.50 0.75 Monosemanticity Score Expansion factor No SAE x1 x2 x4 x16 x64 Figure A6: MS in decreasing order across neurons. Results are shown for a layer without SAE (“No SAE”), and with SAE using different expansion factors (×1,×2,×4,×16 and×64). 22 E.2 Detailed statistics and more models We report in Tables A2 and A5, the average (±std), best and worst monosemanticity scores across neurons for the two SAE variants, attached at different layers and for increasing expansion factors. Although average scores remain similar when increasing expansion factor, we observe a high increase between the original layer and an SAE with expansion factorε = 1. The best scores get consistentely better as expansion factor gets increased. Until now, our analysis has focused on SAEs trained on CLIP ViT-L activations, evaluated using the MS score computed from embeddings produced by the DINOv2 image encoderE. To broaden this investigation, we now consider SAEs trained on activations from SigLIP SoViT-400m. As an alternative image encoder E, we adopt CLIP ViT-B for evaluation. Tables A3 and A6 show average, best and worst MS computed using CLIP ViT-B as the vision encoderE. Even though less distinctively than in original setup, the neurons from SAEs still score better compared to the ones originally found in the model. In Tables A4 and A7, we report MS statistics for SAEs trained for SigLIP SoViT-400m model computed using CLIP ViT-B as the vision encoderE. The results highly resemble the ones for CLIP ViT-L model. Table A2: The average MS of neurons in a CLIP ViT-L model. DINOv2 ViT-B is used as the image encoder E. SAE typeLayerNo SAE Expansion factor x1x2x4x8x16x64 BatchTopK 110.0135± 0.00030.03± 0.060.04± 0.060.04± 0.060.03± 0.050.03± 0.050.03± 0.05 170.0135± 0.00040.05± 0.070.07± 0.090.08± 0.110.07± 0.100.07± 0.100.06± 0.10 220.0135± 0.00030.14± 0.120.18± 0.150.20± 0.170.21± 0.170.21± 0.180.17± 0.18 230.0135± 0.00030.15± 0.130.18± 0.160.20± 0.170.21± 0.170.20± 0.180.17± 0.18 last0.0135± 0.00020.12± 0.110.17± 0.150.19± 0.170.19± 0.160.16± 0.160.13± 0.15 Matryoshka 110.0135± 0.00030.05± 0.100.06± 0.100.05± 0.090.05± 0.090.04± 0.080.03± 0.06 170.0135± 0.00040.09± 0.140.10± 0.150.11± 0.160.11± 0.150.10± 0.150.06± 0.10 220.0135± 0.00030.17± 0.170.21± 0.180.23± 0.190.23± 0.190.23± 0.190.18± 0.19 230.0135± 0.00030.17± 0.160.21± 0.190.22± 0.180.22± 0.180.20± 0.180.12± 0.16 last0.0135± 0.00020.16± 0.170.20± 0.180.23± 0.190.22± 0.190.19± 0.190.13± 0.16 Table A3: The average MS of neurons in a CLIP ViT-L model. CLIP ViT-B is used as the image encoder E. SAE typeLayerNo SAE Expansion factor x1x2x4x8x16x64 BatchTopK 110.4837± 0.00670.52± 0.050.53± 0.060.53± 0.050.53± 0.050.53± 0.050.53± 0.06 170.4840± 0.00790.55± 0.070.56± 0.080.57± 0.080.56± 0.050.56± 0.080.56± 0.09 220.4816± 0.00530.60± 0.090.61± 0.090.62± 0.090.63± 0.090.62± 0.100.60± 0.11 230.4814± 0.00450.60± 0.090.61± 0.100.62± 0.100.62± 0.100.61± 0.100.59± 0.12 last0.4812± 0.00420.59± 0.080.60± 0.100.61± 0.100.61± 0.100.59± 0.100.56± 0.10 Matryoshka 110.4837± 0.00670.54± 0.080.55± 0.080.55± 0.080.54± 0.080.53± 0.070.52± 0.06 170.4840± 0.00790.57± 0.090.58± 0.090.58± 0.100.58± 0.100.57± 0.100.54± 0.09 220.4816± 0.00530.61± 0.090.62± 0.090.63± 0.100.62± 0.110.62± 0.110.59± 0.12 230.4814± 0.00450.60± 0.090.62± 0.100.62± 0.100.61± 0.110.60± 0.110.54± 0.11 last0.4812± 0.00420.59± 0.090.61± 0.100.62± 0.110.61± 0.110.59± 0.120.54± 0.12 23 Table A4: The average MS of neurons in a SigLIP SoViT-400m model. CLIP ViT-B is used as the image encoder E. SAE typeLayerNo SAE Expansion factor x1x2x4x8x16x64 BatchTopK 110.4805± 0.00140.50± 0.030.51± 0.040.51± 0.050.51± 0.060.52± 0.060.52± 0.07 160.4809± 0.00240.51± 0.040.52± 0.050.52± 0.060.53± 0.070.53± 0.070.53± 0.08 210.4810± 0.00520.52± 0.050.53± 0.060.53± 0.060.53± 0.070.54± 0.080.53± 0.08 last0.4811± 0.00480.61± 0.090.61± 0.090.62± 0.090.62± 0.090.62± 0.100.60± 0.11 Matryoshka 110.4805± 0.00140.50± 0.030.50± 0.050.50± 0.050.50± 0.060.51± 0.070.51± 0.07 160.4809± 0.00240.51± 0.050.52± 0.060.52± 0.070.52± 0.070.52± 0.070.51± 0.07 210.4810± 0.00520.52± 0.050.53± 0.060.53± 0.060.53± 0.070.52± 0.070.51± 0.07 last0.4811± 0.00480.61± 0.090.62± 0.100.62± 0.100.62± 0.100.60± 0.110.58± 0.11 Table A5: Comparison of the best / worst MS of neurons in a CLIP ViT-L model. DINOv2 ViT-B is used as the image encoder E. SAE typeLayerNo SAE Expansion factor ×1×2×4×8×16×64 BatchTopK 110.01 / 0.010.61 / -0.020.73 / -0.080.71 / -0.060.87 / -0.070.90 / -0.101.00 / -0.11 170.01 / 0.010.65 / 0.010.79 / -0.020.86 / -0.070.86 / -0.080.93 / -0.081.00 / -0.12 220.01 / 0.010.66 / 0.010.79 / 0.010.80 / 0.010.88 / -0.080.92 / -0.061.00 / -0.11 230.01 / 0.010.73 / 0.010.72 / 0.010.83 / 0.010.89 / -0.020.93 / -0.061.00 / -0.10 last0.01 / 0.010.57 / 0.010.78 / 0.010.78 / 0.010.81 / -0.010.85 / -0.041.00 / -0.10 Matryoshka 110.01 / 0.010.84 / -0.060.90 / -0.070.95 / -0.081.00 / -0.110.89 / -0.101.00 / -0.10 170.01 / 0.010.86 / -0.040.84 / -0.050.93 / -0.070.94 / -0.090.96 / -0.081.00 / -0.14 220.01 / 0.010.83 / 0.010.83 / 0.010.87 / -0.020.94 / -0.061.00 / -0.111.00 / -0.11 230.01 / 0.010.82 / 0.010.84 / 0.010.89 / -0.040.93 / -0.040.96 / -0.061.00 / -0.11 last0.01 / 0.010.82 / 0.010.91 / 0.010.89 / -0.030.93 / -0.050.91 / -0.071.00 / -0.12 Table A6: Comparison of the best / worst MS of neurons in a CLIP ViT-Large model. CLIP ViT-B is used as the image encoder E. SAE typeLayerNo SAE Expansion factor ×1×2×4×8×16×64 BatchTopK 110.50 / 0.470.80 / 0.410.87 / 0.380.90 / 0.280.91 / 0.270.95 / 0.241.00 / 0.20 170.50 / 0.470.84 / 0.370.87 / 0.330.94 / 0.350.94 / 0.280.96 / 0.241.00 / 0.14 220.50 / 0.470.82 / 0.390.85 / 0.380.89 / 0.370.93 / 0.290.93 / 0.151.00 / 0.15 230.50 / 0.470.81 / 0.410.84 / 0.400.89 / 0.350.91 / 0.270.93 / 0.241.00 / 0.08 last0.50 / 0.470.80 / 0.400.84 / 0.400.87 / 0.360.87 / 0.310.89 / 0.251.00 / 0.17 Matryoshka 110.50 / 0.470.90 / 0.390.95 / 0.310.97 / 0.231.00 / 0.220.94 / 0.181.00 / 0.19 170.50 / 0.470.94 / 0.330.93 / 0.350.96 / 0.290.96 / 0.220.97 / 0.141.00 / 0.11 220.50 / 0.470.88 / 0.400.87 / 0.330.89 / 0.290.94 / 0.231.00 / 0.151.00 / 0.06 230.50 / 0.470.85 / 0.400.86 / 0.350.90 / 0.350.91 / 0.190.93 / 0.171.00 / 0.14 last0.50 / 0.470.85 / 0.410.88 / 0.400.89 / 0.310.91 / 0.260.92 / 0.171.00 / 0.09 Table A7: Comparison of the best / worst MS of neurons in a SigLIP SoViT-400m model. CLIP ViT-B is used as the image encoder E. SAE typeLayerNo SAE Expansion factor ×1×2×4×8 ×16 ×64 BatchTopK 110.49 / 0.480.61 / 0.410.83 / 0.290.88 / 0.270.90 / 0.231.00 / 0.121.00 / 0.15 160.53 / 0.470.74 / 0.380.75 / 0.340.93 / 0.250.94 / 0.200.93 / 0.221.00 / 0.18 210.54 / 0.470.76 / 0.380.77 / 0.350.83 / 0.250.89 / 0.170.95 / 0.201.00 / 0.11 last0.50 / 0.470.83 / 0.410.86 / 0.400.88 / 0.370.92 / 0.330.93 / 0.201.00 / 0.11 Matryoshka 110.49 / 0.480.70 / 0.400.93 / 0.290.77 / 0.270.93 / 0.180.91 / 0.221.00 / 0.16 160.53 / 0.470.78 / 0.400.84 / 0.290.91 / 0.190.93 / 0.181.00 / 0.191.00 / 0.16 210.54 / 0.470.85 / 0.390.81 / 0.370.83 / 0.250.93 / 0.240.94 / 0.211.00 / 0.15 last0.50 / 0.470.87 / 0.400.87 / 0.380.89 / 0.300.91 / 0.250.94 / 0.151.00 / 0.15 24 In Figure A7 we plot MS across single neurons. We consider setups in which (a) neurons of CLIP ViT-L are evaluated with DINOv2 as the image encoderE, (b) neurons of CLIP ViT-L are evaluated with CLIP ViT-B asE, and (c) neurons of SigLIP SoViT-400m are evaluated with CLIP ViT-B as E. In all three cases SAE neurons are more monosemantic compared to the original neurons of the models. It shows that MS results are consistent across different architectures being both explained and used as E. 0.00.20.40.60.81.0 Normalized Index 0.0 0.2 0.4 0.6 0.8 (a) Neurons of CLIP ViT-L evaluated with DINOv2 ViT-B as the image encoder E 0.00.20.40.60.81.0 0.4 0.5 0.6 0.7 0.8 Monosemanticity Score (b) Neurons of CLIP ViT-L evaluated with CLIP ViT-B as the image encoder E 0.00.20.40.60.81.0 0.4 0.6 0.8 No SAE SAE (c) Neurons of SigLIP SoViT 400m evaluated with CLIP ViT-B as the image encoder E Figure A7: MS in decreasing order across neurons. Results are shown for the last layers of two different models, without SAE (black dashed line), and with SAE being trained with expansion factor 1 (green solid line). MS is computed with distinct image encoders E. In Figures A8 and A9, we plot again MS scores across neurons for SAEs trained with different expansion factors and sparsity levels, but using CLIP ViT-B as the image encoderE. We observe very similar patterns when compared to the MS computed using DINOv2 ViT-B. Both higher expansion factor and lower sparsity helps find more of the monosemantic units. 0.00.20.40.60.81.0 Normalized Index 0.2 0.4 0.6 0.8 1.0 Monosemanticity Score Expansion factor No SAE (a) Impact of expension factor ε 0.00.20.40.60.81.0 Normalized Index 0.4 0.6 0.8 1.0 Monosemanticity Score K No SAE 1 10 20 50 (b) Impact of sparsity factor K Figure A8: Monosemanticity Scores (computed using CLIP ViT-B) in decreasing order across neurons, normalized by width. Results are shown for the last layer of the model, without SAE (“No SAE”, in black dashed line), and with SAE using either (a) different expansion factors (in straight lines, forε = 1, forε = 4and forε = 16) or (b) different sparsity levels, with straight lines for K = 1, for K = 10, for K = 20, and for K = 50. 25 02000400060008000 1000012000 Index 0.2 0.4 0.6 0.8 1.0 Monosemanticity Score Expansion factor No SAE Figure A9: Monosemanticity Scores (computed using CLIP ViT-B) in decreasing order across neurons without normalizing by width. Results are shown for a layer without SAE (“No SAE”), and with SAE using different expansion factors (×1,×4 and×16). E.3 Matryoshka hierarchies We train and evaluate the SAE on embeddings extracted from iNaturalist [55] dataset using an expansion factorε = 2and groups of sizeM =3, 16, 69, 359, 1536. These group sizes correspond to the numbers of nodes of the first 5 levels of the species taxonomy tree of the dataset, i.e. the respective number of “kingdoms”, “phylums”, “classes”, “orders”, and “families”. To measure the granularity of the concepts, we map each neuron to the most fitting depth in the iNaturalist taxonomy tree to compare the hierarchy of concepts within the Matryoshka SAE with human-defined ones. To obtain this neuron-to-depth mapping, we select the top-16activating images per neuron, and compute the average depth of the Lowest Common Ancestors (LCA) in the taxonomy tree for each pair of images. For instance, given a neuron with an average LCA depth of2, we can assume that images activating this neuron are associated to species from multiple “classes” of the same “phylum”. We report the average assigned LCA depth of neurons across the Matryoshka group level in Table A8. We notice that average LCA depths are correlated with the level, suggesting that the Matryoshka hierarchy can be aligned with human-defined hierarchy. We additionally aggregate statistics of MS of neurons for each level. Average and maximum MS also correlates with the level, confirming that the most specialized neurons are found in the lowest levels. Table A8: Average LCA depth and monosemanticity (MS) scores across neurons at each level in the Matryoshka nested dictionary. Level01234 Depth3.332.923.853.864.06 MS Avg.0.060.080.090.160.24 Max.0.110.300.290.690.76 Min.0.040.030.030.03-0.05 26 F Reconstruction of SAEs In Table A9 and Table A10, we report respectivelyR 2 and sparsity (L 0 ), for the two SAE variants we compare in Section 4.2. As BatchTopK activation enforces sparsity on batch-level, during test-time it is replaced withReLU(x− γ), withxis the input andγis a vector of thresholds estimated for each neuron, as the average of the minimum positive activation values across a number of batches. For this reason the test-time sparsity may slightly differ from K fixed at the value of 20 in our case. We report in Table A11 the detailed metrics (R 2 ,L 0 and statistics of MS) obtained for SAEs trained with different K values considered in Section 4.2. Table A9: Comparison ofR 2 (in %) by different SAEs trained withK = 20for a CLIP ViT-L model. SAE typeLayerNo SAE Expansion factor x1x2x4x8x16x64 BatchTopK 1110074.775.075.175.074.773.5 1710070.471.972.672.972.972.5 2210068.772.674.976.076.877.4 2310067.271.574.075.376.076.8 last10070.174.677.178.278.679.1 Matryoshka 1110072.873.974.575.175.274.5 1710067.369.570.771.872.672.7 2210065.569.671.574.075.476.6 2310063.968.571.073.174.874.6 last10066.871.674.176.077.678.2 Table A10: Comparison of true sparsity measured byL 0 -normfor different SAEs trained with K = 20 for a CLIP ViT-L model. SAE typeLayerNo SAE Expansion factor x1x2x4x8x16x64 BatchTopK 11102419.719.519.419.620.022.9 17102419.419.419.219.619.522.3 22102419.619.719.719.820.323.0 23102419.819.819.920.120.322.2 last76819.919.919.920.120.222.2 Matryoshka 11102419.419.519.419.619.821.3 17102419.319.319.319.419.520.5 22102419.719.719.619.819.922.0 23102419.719.819.819.920.625.1 last76820.019.919.819.920.222.5 Table A11: Statistics for SAEs trained with different sparsity constraintKon activations of the last layer with expansion factor16. “No SAE” row contains results for raw activations before attaching the SAE. KL 0 R 2 (%) MS MinMaxMean 10.931.3-0.030.900.37± 0.20 109.960.60.010.790.19± 0.16 2020.066.80.010.820.16± 0.17 5050.174.90.010.690.07± 0.08 No SAE–0.010.010.01± 0.00 27 Figure A10: Images highly activating the neuron we intervene on in Figure 6, which we manually labeled as “Pencil Neuron”. G Uniqueness of concepts The sparse reconstruction objective regularizes the SAE activations to focus on different concepts. To confirm it in practice, we collect top-16 highest activating images for each neuron of SAE and compute Jaccard IndexJbetween every pair of neurons. The images come from training set. We exclude10out of12288neurons for which we found less than16activating images and use Matryoshka SAE trained on the last layer with expansion factor of16. We find thatJ > 0for16000 out of75368503pairs (> 0.03%) andJ > 0.5for only20pairs, which shows very high uniqueness of learned concepts. H Additional qualitative results We illustrate in Figure A10 the highly activating images for the “Pencil” neuron, which we used for steering in Figure 6. In Figures A11 and A12 we provide more randomly selected examples of neurons for which we computed MS using two different image encoders. In both cases we see a clear correlation between score and similarity of images in a grid. 28 0.9 > MS > 0.80.8 > MS > 0.70.7 > MS > 0.40.4 > MS > 0.30.3 > MS > 0.2 N#2669N#2693N#7344N#5468N#5079 N#5207N#2523N#2741N#8109N#5697 N#9447 N#6677N#3371N#6421N#8105 N#8370N#5450N#4147N#1223N#8501 N#6030N#7319N#5034N#1293N#8905 0.90.0 MS (using DINOv2 ViT-B as the image encoder E) Figure A11: Qualitative examples of highest activating images for different neurons from high (left) to low (right) MS score. As the metric gets higher, highest activating images are more similar, illustrating the correlation with monosemanticity. DINOv2 ViT-B is used as the image encoder E. 29 0.9 > MS > 0.80.8 > MS > 0.70.7 > MS > 0.40.4 > MS > 0.30.3 > MS > 0.2 N#7430N#9924N#7327N#9356N#10017 N#810N#3620N#10365N#797N#7363 N#4982 N#2395N#1793N#2673N#9867 N#9917N#7304N#5504N#9007N#6568 N#6029N#3508N#3711N#7119N#7322 0.90.0 MS (using CLIP ViT-B as the image encoder E) Figure A12: Qualitative examples of highest activating images for different neurons from high (left) to low (right) MS score. As the metric gets higher, highest activating images are more similar, illustrating the correlation with monosemanticity. CLIP ViT-B is used as the image encoder E. 30