Paper deep dive
SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders
Fabio Mercorio, Filippo Pallucchini, Daniele Potertì, Antonio Serino, Andrea Seveso
Models: Gemma 2-2B, Llama-3.1-8B
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/11/2026, 12:44:17 AM
Summary
The paper introduces SFAL (Semantic-Functional Alignment Scores), a novel, efficient evaluation strategy for auto-interpretability in Sparse Autoencoders (SAEs). By measuring the alignment between a feature's semantic neighborhood (derived from auto-interpretation embeddings) and its functional neighborhood (derived from co-occurrence statistics), SFAL reduces reliance on expensive, noisy LLM-based scoring methods while maintaining high correlation with human judgments.
Entities (6)
Relation Signals (3)
SFAL → evaluates → Sparse Autoencoders
confidence 95% · SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders
Gemma-2-9b → utilizes → Sparse Autoencoders
confidence 95% · Our study focused on the 16k features version of the SAEs for gemma-2-9b
Sparse Autoencoders → implements → Auto-interpretability
confidence 90% · Sparse Autoencoders facilitate interpretability by decomposing polysemantic activation into a latent space of monosemantic features.
Cypher Suggestions (2)
Find all evaluation metrics related to Sparse Autoencoders · confidence 90% · unvalidated
MATCH (m:Metric)-[:EVALUATES]->(s:Architecture {name: 'Sparse Autoencoders'}) RETURN m.nameIdentify models that use Sparse Autoencoders · confidence 90% · unvalidated
MATCH (m:Model)-[:UTILIZES]->(s:Architecture {name: 'Sparse Autoencoders'}) RETURN m.nameAbstract
Fabio Mercorio, Filippo Pallucchini, Daniele Potertì, Antonio Serino, Andrea Seveso. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025.
Tags
Links
Full Text
33,242 characters extracted from source content.
Expand or collapse full text
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 576–583 November 4-9, 2025 ©2025 Association for Computational Linguistics SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders Fabio Mercorio 1,3 , Filippo Pallucchini 1,3 , Daniele Potertì 2 , Antonio Serino 2 , Andrea Seveso 1,3 1 Dept of Statistics and Quantitative Methods, University of Milano-Bicocca, Italy, 2 Dept of Economics, Management and Statistics, University of Milano-Bicocca, Italy, 3 CRISP Research Centrecrispresearch.eu, University of Milano-Bicocca, Italy Abstract Interpreting the internal representations of large language models (LLMs) is crucial for their deployment in real-world applications, impact- ing areas such as AI safety, debugging, and compliance. Sparse Autoencoders facilitate interpretability by decomposing polysemantic activation into a latent space of monoseman- tic features. However, evaluating the auto- interpretability of these features is difficult and computationally expensive, which limits scal- ability in practical settings. In this work, we proposeSFAL, an alternative evaluation strategy that reduces reliance on LLM-based scoring by assessing the alignment between the semantic neighbourhoods of features (derived from auto- interpretation embeddings) and their functional neighbourhoods (derived from co-occurrence statistics). Our method enhances efficiency, enabling fast and cost-effective assessments. We validate our approach on large-scale mod- els, demonstrating its potential to provide inter- pretability while reducing computational over- head, making it suitable for real-world deploy- ment. 1 Introduction Interpreting the internal representations of large language models (LLMs) is a key challenge in re- search and real-world applications (Sharkey et al., 2025). Sparse Autoencoders (SAEs) are neural networks designed to learn interpretable feature representations from high-dimensional activations in LLMs (Cunningham et al., 2023). They provide a structured latent feature space where semantically similar features are mapped closely, enabling poten- tial improvements in model transparency (Räuker et al., 2023). In practical deployments, under- standing what a given feature represents is crucial for debugging, safety, and compliance (Temple- ton et al., 2024). Auto-interpretability (autoint- erp) (Bills et al., 2023) methods attempt to generate human-readable descriptions of these features by analysing their activations and prompting LLMs to create explanations. However, current evalu- ation approaches for autointerp rely on scoring methods that compare a feature’s activation exam- ples with the generated interpretation using other LLMs (Paulo et al., 2024). This process is prone to noise and computationally expensive, requiring multiple queries per feature, making it costly for large-scale, real-world systems. ThisworkexploresSemantic-Functional AlignmentScores (SFAL), an alternative evaluation strategy that reduces dependence on LLM-based scoring, improving efficiency while maintaining scoring quality. By leveraging the SAE feature space’s structural properties, we propose a more scalable and deployable method in real-world set- tings, enabling more cost-effective interpretability assessments. Unlike existing approaches,SFAL introduces a principled alignment metric between the latent structure of functional behaviour and the semantic space derived from auto-interpretations; a formulation that, to our knowledge, has not been previously applied to evaluating feature interpretability in sparse autoencoders. Contribution.Our main contributions are as fol- lows: (i) We proposeSFAL, a novel approach to evaluating autointerp quality that reduces depen- dence on expensive LLM-based scoring. We aim for auto-interpretability to be more efficient, less noisy, and feasible for real-world deployments. (i) We validate our approach in a user study, com- paring its robustness with previous methods and considering practical constraints such as computa- tional cost and resource limitations. (i) To support reproducibility, we release all code, processed data, and scores produced in our experiments 1 . 1 https://github.com/Crisp-Unimib/SFAL 576 2 Preliminaries and State of the Art Sparse Autoencoders (SAEs).SAEs distil high- dimensional outputs of large language models into interpretable representations (Cunningham et al., 2023). They reconstruct input activations through a sparse bottleneck layer to promote monosemantic features, each representing a distinct, understand- able concept (Bills et al., 2023). This architecture aims to mitigate the superposition phenomenon, where single neurons encode multiple unrelated concepts (Bricken et al., 2023). Monosemantic- ity is believed to promote better separation of fea- ture representations, leading to clearer conceptual neighbourhoods and forming a basis for mechanis- tic interpretability efforts to identify computational circuits within LLMs. Recent work reveals that SAE feature spaces exhibit structured organisation at multiple scales, with functionally related features clustering together and forming meaningful geo- metric patterns (Li et al., 2025b). Features that fre- quently co-activate are likely functionally related, suggesting that co-occurrence statistics can reveal functional relationships. Beyond interpretability, one can also perform targeted interventions on features to steer the model toward specific be- haviours (Potertì et al., 2025). Given the potential scale of SAEs, which can learn millions of features, there is a need for automated methods to gener- ate human-understandable textual explanations for these features, known as auto-interpretations (Bills et al., 2023). Auto-Interpretations.Auto-interpretability methods generate human-readable explanations of SAE features by analysing their activations (Bills et al., 2023).Current evaluation approaches rely heavily on LLM-based scoring methods that compare feature activations with generated interpretations.LLM-based methods include fuzzy scoring(Paulo et al., 2024), where LLMs classify whether highlighted tokens should activate features based on their explanations, showing a strong correlation with human judgments. Other methods include detection scoring (LLM identifies whether a sequence activates a latent representation based on its explanation), surprisal scoring (improvement in predicting contexts given an interpretation), and embedding scoring (semantic relevance of an interpretation to the activating data). However, these methods face significant limitations, including computational expense, potential noise in LLM judgments, scalability issues with millions of features, and the risk of"deceptive interpretability", where plausible explanations may mislead evaluators (Lermen et al., 2025). Alternative approaches have emerged to address these limitations. Intervention-based evaluation assesses an explanation’s ability to predict the consequences of actively manipulating a feature’s activation (e.g., ablation) (Bhalla et al., 2024). However, this approach faces challenges such as the complexity of designing meaningful interventions and the"predict/control discrepancy", where features good for prediction may not be effective for control, and vice versa.There is also a growing interest in non-LLM-centric metrics. Examples include classification-based metrics (Cesarini et al., 2024; Malandri et al., 2024), utilising SAE features for downstream tasks such as toxicity detection (Gallifant et al., 2025), hallucination mitigation (Abdaljalil et al., 2025), and probing-based evaluation, where linear probes are trained on SAE features to predict known concepts (e.g., sentiment, specific n-grams) (Gao et al., 2024). While human evaluation remains a gold standard for nuance and correctness, its inherent subjectivity, cost, and slow pace make it impractical for the vast number of features in large-scale SAEs. Our work contributes by proposing an evaluation strategy that leverages structural properties of the SAE feature space itself, reducing reliance on expensive LLM- based scoring while maintaining evaluation quality. Open Platforms.Neuronpedia (Lieberum et al., 2024a) is an open platform for mechanistic inter- pretability research. It serves as both a public database containing valuable data for researchers (including activations, SAE features, their auto- interpretations, metadata, and scores from various methods) and a suite of tools facilitating the storage and management of these interpretability artefacts. 3 Methods Our core objective is to quantify the alignment between the semantic interpretation of an SAE feature and its functional interactions with other features. The core assumption is that meaning- ful auto-interpretations should be consistent with the feature’s behaviour in the model (Olah et al., 2020). This reflects a principle of internal coher- ence also found in mechanistic interpretability: fea- tures with distinct and well-described semantic con- tent should exhibit functionally cohesive patterns 577 Figure 1: Pipeline for generating Semantic-Functional Alignment Scores (SFAL). SAE features are processed via a co-occurrence matrix to derive representations in a functional space. Auto-interpretations are passed through an encoder to generate representations in a semantic space. Top K-ranked lists of elements from these respective spaces are used to calculate Discounted Cumulative Gain (DCG) and Ideal Discounted Cumulative Gain (IDCG), yielding the finalSFALscore that quantifies the alignment between the semantic and functional characteristics of the elements. of co-activation. To achieve this, for each SAE feature, we define and compare its semantic neigh- bourhood and its functional neighbourhood. This comparison results in a Semantic-Functional Align- ment Score (SFAL). An overview of our methodol- ogy is presented in Fig 1. 3.1 Representations of SAE Features LetS=s 1 ,s 2 ,...,s n denote a set ofnSAE features. For each features i ∈S, we aim to cap- ture both its semantic meaning and its functional behaviour. This involves defining appropriate rep- resentations. Semantic Representations.Each SAE feature s i is associated with anauto-interpretation, a textual description of its learned function. The semantic representationof features i is the auto-interpretation vectora i ∈R d . Thesed- dimensional real-valued vectors are generated by encoding the textual auto-interpretations using an encoder language model. The set of all such vec- tors,A=a 1 ,a 2 ,...,a n , constitutes theseman- tic space. Functional Representations.The functional be- haviour of features i is characterised by how often it co-activates with other features. We capture this throughco-occurrence statisticsbetween feature pairs(s i ,s j ), following (Li et al., 2025b), result- ing in aco-occurrence matrix. For each pair, we construct a2×2contingency tablem(i,j)with entriesm 11 ,m 10 ,m 01 , andm 00 representing the joint activation counts, along with their marginal totalsm 1• ,m 0• ,m •1 , andm •0 . For example,m 11 is the number of instances where boths i ands j are active,m 00 is the number of cases where neither is active, andm 1• is the total number of instances wheres i is active, regardless of whethers j is ac- tive. 3.2 Defining Semantic and Functional Neighbourhoods Based on the representations above, we define se- mantic and functional neighbourhoods for each SAE features i . Semantic Neighbourhood (N S ).Thesemantic neighbourhoodN S (i)of an SAE features i con- sists of other featuress j (j̸=i) whose auto- interpretations are semantically similar to that of s i . This similarity is measured using theirauto- interpretation vectorsa i anda j from the semantic space. We usecosine similarityto quantify the likeness between two auto-interpretation vectors. For a given features i , its semantic neighbour- hoodN S (i)is formally defined as the set ofK S fea- turess j (forj̸=i) with the highestsim cos (a i ,a j ) scores. While we employ a fixed top-K neighbour- hood for clarity and reproducibility,SFALis not restricted to this setting; adaptive strategies (e.g., thresholds based on feature sparsity) are feasible and will be explored in future work. 578 Functional Neighbourhood (N F ).Thefunc- tional neighbourhoodN F (i)of an SAE feature s i comprises other featuress j (j̸=i) that exhibit a strong functional association withs i , based on the co-occurrence tablem(i,j). To measure the strength of association between a pair of featuress i ands j from their2×2co- occurrence counts and associated marginals (previ- ously defined asm 1• ,m 0• ,m •1 ,m •0 ), we employ thephi coefficient (φ ij )(Yule, 1912), also utilised in (Li et al., 2025b): φ ij = m 11 (i,j)m 00 (i,j)−m 10 (i,j)m 01 (i,j) √ m 1• m 0• m •1 m •0 This coefficientφ ij ranges from -1 (perfect negative association) to +1 (perfect positive association), with 0 indicating no association, and it is well- suited for measuring the association between binary variables (the active/inactive states of features). For a features i , its functional neighbourhood N F (i)is formally defined as the set ofK F features s j (forj̸=i) with the highest positiveφ ij values. 3.3 ComputingSFAL We introduce theSemantic-Functional Alignment Score(SFAL) to quantify for each SAE features i how well its semantic neighbourhoodN S (i)aligns with its functional neighbourhoodN F (i). This score is calculated usingNormalised Discounted Cumulative Gain (NDCG)(Järvelin and Kekäläi- nen, 2002), a well-established measure for evaluat- ing the consistency between two rankings (Malan- dri et al., 2025; Pallucchini et al., 2025). A score close to 1 indicates strong alignment between the feature’s semantic interpretation and its functional co-occurrence behaviour, while a score near 0 sug- gests divergence. 3.4 Computational Efficiency Our method is designed to scale efficiently with the number of SAE featuresn. For each features i ∈ S, we compute the semantic and the functional neighbourhood. Computing the cosine similarity between all pairs ofnauto-interpretation vectors (each of di- mensiond) requiresO(n 2 d)operations. Sinced is fixed (determined by the embedding model, e.g., 768 or 1024), this simplifies toO(n 2 ). To com- pute functional neighbourhoods, we build a co- occurrence histogram from a corpus and then cal- culate the phi coefficient (φ) for every feature pair. The co-occurrence histogram is built by processing a corpus ofDdocuments with an average token length ofT. The text is segmented into chunks of lengthk. Since only a small subset of features is active in any given chunk, we can compute the outer product over sparse binary vectors. For each chunk, we identify the set ofK chunk active fea- tures, whereK chunk ≪n(e.g., typically 20–50). The number of required updates per chunk is only O(K 2 chunk ). This optimisation makes the construc- tion of the histogram significantly more scalable, with an effective complexity of: O D·T k ·E[K 2 chunk ] whereE[K 2 chunk ]is the average squared number of active features per chunk. After the histogram is populated, calculating theφ ij coefficient for all ≈n 2 /2pairs is anO(n 2 )operation. With the semantic and functional matrices com- puted, we rank the neighbours for each feature with a complexity ofO(nlogK), leading to a to- tal ofO(n 2 logK). GivenK≪n, this term is effectivelyO(n 2 ). The final step, computing the NDCG@Kscore for each feature, takesO(K), for a negligible total cost ofO(nK). Therefore, the overall computational bottleneck is theO(n 2 )cost of the pairwise matrix compu- tations. This framework offers a substantial effi- ciency improvement over LLM-based evaluation pipelines, which, while scaling linearly withn, in- cur prohibitively high per-feature overhead due to the financial costs associated with using large mod- els. For the millions of features in large-scale SAEs, these combined expenses become intractable. In contrast, our approach is far more scalable and cost-effective. In practice, our experiments required just 2 GPU hours on a single NVIDIA A100 GPU, underscor- ing the practical scalability and low resource re- quirements of our approach. 4 Results Our study focused on the 16k features version of the SAEs for gemma-2-9b 2 (Lieberum et al., 2024b) and the 32k features version of llama-3.1- 8b 3 (He et al., 2024). To ensure a robust compari- son, encoder models used to compute the semantic neighbourhood were selected from the top perform- ers on the MTEB leaderboard (Muennighoff et al., 2022) at the time of our experiments. 2 gemma-scope-9b-pt-res 3 Llama3_1-8B-Base-LXR-8x 579 Different layers within a transformer architec- ture learn features at varying levels of abstraction, from simple, local patterns in the early layers to complex, semantic concepts in the deeper layers. The interpretability of these features is hypothe- sised to vary accordingly (Paulo et al., 2024). We select five layers from each model: the initial layer (0), three intermediate layers (8, 17, and 25), and the final layer (41 for Gemma-2-9 b and 31 for Llama-3.1-8 b). For the Gemma-2-9 b model, we computed the fuzzy score ourselves since it was not available on Neuronpedia, employing Gemini- 2.0-flash, which at the time of execution offered the best performance-to-cost ratio among closed- source models. The process amounted to about $100 in API fees. We computed the co-occurrence matrices for both models by processing 50k docu- ments from their respective SAE training datasets, using a chunk size of 256 tokens. In Fig. 2, we show the distribution of fuzzing scores with Gemini-2.0-Flash againstSFAL. Our scoring system generally assigns lower values over- all, reflecting a more selective approach in recognis- ing autointerpretations as the correct interpretations of features. User study design.To evaluate the practical ef- ficacy of our proposed scoring method, we con- ducted a user study following the human evalua- tion methodology outlined by (Paulo et al., 2024). A pool of four expert users participated in the assessment. We sampled 100 examples of auto- interpretations and their corresponding top activa- tions, following (Paulo et al., 2024). These ex- amples were drawn from five distinct layers (20 examples for each layer) of the Gemma-2-9b and Llama-3.1-8b models. Stratification bySFALscores Figure 2: Comparison of score overall distribution be- tween SOTA methods andSFAL. was employed during sampling to deliberately in- clude examples spanning the full range of potential scores, thus preventing bias towards predominantly positive or negative evaluations and ensuring raters encountered varied levels of interpretation quality. The expert users reviewed the auto-interpretations and the associated top activations for each of the 100 sampled features. Users rated the alignment between the feature’s interpretation and activa- tions on a 1-to-4 Likert scale for the soundness and completeness metrics proposed by Sokol and Flach (2020).Soundnessrefers to how truthful and aligned the generated auto-interpretation is with the actual behaviour and activations of the SAE fea- ture it’s meant to explain.Completenessdescribes how well that auto-interpretation covers and ex- plains all or most significant top activations for that particular feature. To be complete, an auto interpretation must be sufficiently broad to encom- pass the feature’s diverse manifestations in the data, rather than being narrowly focused on just a few activation examples. Additionally, users reported a confidence score for each rating. The overall median confidence from users was3with an in- terquartile range of1for both Gemma and Llama evaluation sets. To ensure the robustness of our human evaluations, the inter-rater agreement level was quantified using Krippendorff’s ordinalα. The calculated agreement was0.64for the gemma-2-9b set, and0.57for the llama-3.1-8b set, indicating substantial agreement between the evaluators. User scores (soundness and completeness) for each feature were averaged to create a composite human rating. A visual comparison of the score distributions in Fig. 2 shows that fuzz scores ap- pear more skewed and potentially over-optimistic in their assessment of feature interpretability com- pared toSFALscores. We analysed the correlation (Spearman, Pearson, Kendall) between the two sets of averaged human judgments and seven sets of au- tomated scores: those from fuzz scoring and those generated by our proposed alignment-based scoring method, varying the embedding model to assess the consistency of our process. As Table 1 shows,SFAL demonstrated a stronger positive correlation com- pared to the fuzzing score for all the embedding models tested on the Gemma-2-9b SAEs human evaluation set. However,SFALslightly underper- forms the fuzzing score for Llama-3.1-8b. 580 Figure 3: Comparison of score distribution against human judgement. On the left, we show the computed fuzz score, while on the right, we show theSFALresults. Gemma-2-9bLlama-3.1-8b MetricPearsonSpearmanKendallPearsonSpearmanKendall Fuzz score (Paulo et al., 2024)0.47 (***) 0.56 (***) 0.40 (***) 0.59 (***) 0.60 (***) 0.44 (***) SFALBilingual Emb (Thakur et al., 2020)0.63 (***) 0.62 (***) 0.45 (***) 0.53 (***) 0.56 (***) 0.41 (***) SFALgte-Qwen2-7B-instruct (Li et al., 2023)0.53 (***) 0.50 (***) 0.37 (***) 0.48 (***) 0.53 (***) 0.39 (***) SFALQwen3-Emb-8B (prompted) (Zhang et al., 2025) 0.66 (***) 0.63 (***) 0.46 (***) 0.56 (***) 0.60 (***) 0.43 (***) SFALQwen3-Emb-8B (Zhang et al., 2025)0.66 (***) 0.63 (***) 0.47 (***) 0.49 (***) 0.55 (***) 0.39 (***) SFALQwen3-Emb-0.6B (Zhang et al., 2025)0.64 (***) 0.62 (***) 0.46 (***) 0.45 (***) 0.53 (***) 0.37 (***) SFALQwen3-Emb-4B (Zhang et al., 2025) 0.64 (***) 0.61 (***) 0.44 (***) 0.52 (***) 0.58 (***) 0.41 (***) Table 1: Correlation coefficients (Pearson, Spearman, Kendall) between fuzz,SFALscores and human evaluation conducted by expert raters on the Gemma-2-9b and Llama-3.1-8b SAEs. The prompted version of Qwen3 uses an instruction to specialise the embedding for retrieval queries, while the normal version is for general similarity. Significance markers: (∗) p≤0.05, (∗) p≤0.01, (∗) p≤0.001, (N.S.) = not significant (p >0.05). 5 Discussion Autointerpretation quality.As shown in Fig. 2, the distributions ofSFALand the fuzz score dif- fer substantially.SFALtends to assign lower val- ues overall, showing a more selective behaviour in identifying autointerpretations as the correct inter- pretation of features. Bridging semantic and functional evaluation. Our method’s correlation with human judgments validates our framework. Instead of relying on costly LLM "oracles," we enforce internal con- sistency by aligning a feature’s semantic mean- ing with its functional behaviour, derived from co- occurrence statistics. This captures a functional signal that purely semantic checks, often focused on static human-understandability, can overlook (Li et al., 2025a). We note that the semantic–functional alignment assumption can fail in cases where fea- tures are functionally correlated yet semantically dissimilar (e.g., a transitive predicate and its di- rect object), which explains some of the observed noise and highlights the complementary role of SFALalongside more precise causal methods. Impact of embedding models employed.Ta- ble 1 shows the correlations of the human evalua- tion with both the fuzz score andSFAL, computed using several embedding models. We assess the consistency ofSFAL, varying the encoder used to create the auto-interpretation embedding. Results show that scores are consistently significant across all tested embedding models for both SAE evalu- ations. Model size appears to be a minor factor in scoring performance, as indicated by the small dif- ferences between models within the Qwen family. 6 Conclusion In this work, we introduced a novel, distributional approach for evaluating the auto-interpretations of SAE features by quantifying the alignment between a feature’s semantic and functional neighbour- hoods. Unlike traditional methods that rely heavily on expensive and often opaque LLM-based scoring, 581 our approach grounds interpretability assessment in the model’s internal structure by capturing func- tional relationships through co-activation patterns and semantic intent through auto-interpretation em- beddings. We demonstrated that this alignment- based metric is not only computationally efficient and scalable but also correlates well with human judgment. By reducing evaluation costs and im- proving scalability, this work opens the door to more practical and widespread assessments of in- terpretability in large-scale language models. Fu- ture work will explore more expressive similarity metrics and investigate how our generalises across architectures and domains. Acknowledgements Evaluation of the open-source models was con- ducted on the Leonardo supercomputer with the support of CINECA-Italian Super Computing Re- source Allocation, class C project IsCc9_MI-PLE (HP10CIQUBQ). Limitations Co-occurrence is a powerful but imperfect proxy for true functional linkage. However, the core con- tribution of this work is a significantlowering of the cost-utility frontierfor auto-interpretation evalua- tion. We demonstrate performance comparable to expensive, closed-source LLM-based metrics while operating at a fraction of the computational and financial cost. Ultimately, by making robust evalua- tion economically feasible, our method enables the field to systematically and comprehensively assess millions of features, a critical step toward genuinely understanding and trusting these complex systems. Beyond the reliance on co-activation as a proxy for function,SFALhas two additional limitations: (i) the fixed-K neighbourhoods may not fully adapt to varying sparsity across features, and (i) our human evaluation involved only a small pool of expert raters, motivating future work on adaptive neigh- bourhood selection and larger-scale, more diverse user studies. References Samir Abdaljalil, Filippo Pallucchini, Andrea Seveso, Hasan Kurban, Fabio Mercorio, and Erchin Serpedin. 2025. Safe: A sparse autoencoder-based framework for robust query enrichment and hallucination mit- igation in llms. InFindings of the Association for Computational Linguistics: EMNLP 2025. Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, and Himabindu Lakkaraju. 2024. Towards unifying inter- pretability and control: Evaluation via intervention. arXiv preprint arXiv:2411.04430. StevenBills,NickCammarata,DanMoss- ing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever,Jan Leike,Jeff Wu,and William Saunders. 2023.Language mod- els can explain neurons in language models. https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. 2023. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread. Https://transformer- circuits.pub/2023/monosemantic- features/index.html. Mirko Cesarini, Lorenzo Malandri, Filippo Pallucchini, Andrea Seveso, and Frank Xing. 2024. Explainable ai for text classification: Lessons from a compre- hensive evaluation of post hoc methods.Cognitive Computation, 16(6):3077–3095. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600. Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, and Danielle S Bitterman. 2025. Sparse autoencoder features for classifications and transferability.arXiv preprint arXiv:2502.11367. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024.Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. 2024. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.Preprint, arXiv:2410.20526. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cu- mulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446. Simon Lermen, Mateusz Dziemian, and Natalia Pérez- Campanero Antolín. 2025. Deceptive automated in- terpretability: Language models coordinating to fool oversight systems.arXiv preprint arXiv:2504.07831. 582 Aaron J Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju. 2025a. Interpretability il- lusions with sparse autoencoders: Evaluating ro- bustness of concept representations.arXiv preprint arXiv:2505.16004. Yuxiao Li, Eric J Michaud, David D Baek, Joshua En- gels, Xiaoqing Sun, and Max Tegmark. 2025b. The geometry of concepts: Sparse autoencoder feature structure.Entropy, 27(4):344. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024a. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024b. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. Preprint, arXiv:2408.05147. Lorenzo Malandri, Fabio Mercorio, Mario Mezzanzan- ica, and Filippo Pallucchini. 2025. Sense: embed- ding alignment via semantic anchors selection.In- ternational Journal of Data Science and Analytics, 20(1):167–181. Lorenzo Malandri, Fabio Mercorio, Mario Mezzanzan- ica, and Andrea Seveso. 2024. Model-contrastive explanations through symbolic reasoning.Decision Support Systems, 176:114040. Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits.Distill. Https://distill.pub/2020/circuits/zoom-in. Filippo Pallucchini, Lorenzo Malandri, Fabio Mercorio, and Mario Mezzanzanica. 2025. Lost in alignment: A survey on cross-lingual alignment methods for con- textualized representation.ACM Computing Surveys. Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. 2024. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928. Daniele Potertì, Andrea Seveso, and Fabio Mercorio. 2025. Can role vectors affect llm behaviour? In Findings of the Association for Computational Lin- guistics: EMNLP 2025. Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. 2023. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In2023 ieee conference on secure and trustworthy machine learning (satml), pages 464– 483. IEEE. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lind- sey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky- Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. 2025. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496. Kacper Sokol and Peter Flach. 2020. Explainability fact sheets: A framework for systematic assessment of explainable approaches. InProceedings of the 2020 conference on fairness, accountability, and trans- parency, pages 56–67. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. 2024. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread. Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. 2020. Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks.arXiv preprint arXiv:2010.08240. G. Udny Yule. 1912. On the methods of measuring association between two attributes.Journal of the Royal Statistical Society, 75(6):579–652. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 embedding: Advancing text embedding and rerank- ing through foundation models.arXiv preprint arXiv:2506.05176. 583