Paper deep dive

Mechanistic understanding and validation of large AI models with SemanticLens

Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: ToolEmbeddings: 248

Models: CLIP, ResNet50v2, VGG-16

Abstract

Abstract:Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on this https URL and a demo on this https URL.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 7:22:50 PM

Summary

SemanticLens is a universal explanation method for neural networks that maps hidden knowledge encoded by individual components (neurons) into the semantically structured, multimodal space of a foundation model like CLIP. This approach enables automated labelling, systematic search, comparison of model representations, and auditing of decision-making processes to detect spurious correlations, effectively bridging the trust gap in AI systems.

Entities (5)

CLIP · foundation-model · 100%SemanticLens · method · 100%Concept Relevance Propagation · algorithm · 95%Neural Networks · technology · 95%Melanoma Classification · application-domain · 90%

Relation Signals (3)

SemanticLens → uses → CLIP

confidence 100% · maps hidden knowledge encoded by components... into the semantically structured, multimodal space of a foundation model such as CLIP

SemanticLens → appliesto → Neural Networks

confidence 95% · universal explanation method for neural networks

SemanticLens → integrates → Concept Relevance Propagation

confidence 95% · we use Concept Relevance Propagation (CRP) to identify relevant components and circuits

Cypher Suggestions (2)

Find all methods that utilize foundation models for interpretability. · confidence 90% · unvalidated

MATCH (m:Method)-[:USES]->(f:FoundationModel) RETURN m.name, f.name

Identify applications where SemanticLens is used for validation. · confidence 85% · unvalidated

MATCH (m:Method {name: 'SemanticLens'})-[:APPLIES_TO]->(a:ApplicationDomain) RETURN a.name

Full Text

247,671 characters extracted from source content.

Expand or collapse full text

Mechanistic understanding and validation of large AI models with SemanticLens Maximilian Dreyer 1∗ Jim Berend 1∗ Tobias Labarta 1 Johanna Vielhaben 1 Thomas Wiegand 1,2,3 Sebastian Lapuschkin 1 Wojciech Samek 1,2,3 1 Fraunhofer Heinrich Hertz Institute 2 Technische Universit ̈ at Berlin 3 BIFOLD – Berlin Institute for the Foundations of Learning and Data wojciech.samek,sebastian.lapuschkin@hhi.fraunhofer.de ABSTRACT Unlike human-engineered systems such as aeroplanes, where each component’s role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introducesSEMANTICLENS, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (i) systematic analysis and comparison of model representations, (i) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input,SEMANTICLENSis shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the “trust gap” between AI models and traditional engineered systems. We provide code for SEMANTICLENSonhttps://github.com/jim-berend/semanticlensand a demo onhttps://semanticlens.hhi-research-insights.eu. KeywordsExplainable AI, Representations, AI Auditing, Interpretability, Foundation Models 1 Introduction Technical systems designed by humans are constructed step by step, with each component serving a specific, well-understood function. For instance, an aeroplane’s wings and wheels have clear roles, and an edge detection algorithm applies defined signal processing steps like high-pass filtering. Such a construction by synthesis not only helps to understand the system’s overall behaviour, but also simplifies the validation of its safety. In contrast, neural networks are developed holistically through optimization, often using datasets of unprecedented scale. While this process yields models with impressive capabilities that increasingly outperform engineered systems, it has a major drawback: it does not provide semantic descriptions of each ∗ The authors contributed equally. arXiv:2501.05398v1 [cs.LG] 9 Jan 2025 Mechanistic understanding and validation of large AI models a Embedding into a structured semantic space. b What we can do in semantic space. feature space incomprehensible understandable MethodsTasks Describe relevance propagation inference collection Compare concept examples semantic embeddings components prediction Audit Alignment Evaluate Interpretability Search ÅIdentify components that encode concept c. ÅIdentify data corresponding to c. ÅSemantically structure and analyze inner representations. ÅUnderstand the driving factors of decision-making. ÅUnderstand differences and similarities in what models have learned. ÅFind and rate spurious behaviorB ÅDiscover new knowledgeB ÅCorrect model behavior and clean data. ÅRate model interpretabilityB ÅOptimize models to be more interpretable. ! ? ... BLACK BOX AI MODEL FOUNDATION MODEL DATASETPREDICTIONS “owl” (98.0%) “jay” (0.1%) “junco” (0.09%) ... ... 3 ! plants mammals objects birds ? semantic space concept examples components semantic embeddings 1 2 1 2 Figure 1: Embedding the model components in an understandable semantic space allows to systematically and more easily understand the inner workings of large neural networks.a) In order to turn the incomprehensible latent feature space (hidden knowledge) into an understandable representation, we leverage a foundation modelFthat serves as a semantic expert. Concretely, for each component of the analysed modelM, 1concept examplesEare extracted from the dataset, representing samples that induce high stimuli (i.e., activate the component), and2embedded in the latent space of the foundation model resulting in a semantic representationθ. Further, 3 relevance scoresRfor all components are collected, which illustrate their role in decision-making.b) This new understandable model representation (i.e., set ofθ’s, potentially linked to E’s andR’s) enables to systematically search, describe, structure, and compare internal knowledge of AI models. It further allows to audit alignment to human expectation and opens-up ways to evaluate and optimize human-interpretability. neuron’s function. Especially in high-stakes applications such as medicine or autonomous driving, the sole reliance on the output of the black-box AI model is often unacceptable as faulty or Clever Hans-type behaviours [1,2,3] may go unnoticed but have serious consequences. Recent regulations, such as the EU AI Act and the U.S. President’s Executive Order on AI underline the need for transparency and conformity assessment. What is urgently needed, therefore, is the ability to understand and validate the inner workings and individual components of AI models [4, 5], as we do for human-engineered systems. Despite progress in fields such as eXplainable Artificial Intelligence (XAI) [6,7] and mechanistic inter- pretability [8], the automated explanation and validation of model components at scale remains infeasible. Current approaches are limited in several ways: Firstly, they often strongly depend on human intervention [9], e.g., manual investigation of individual components [10,11] or predictions [12], preventing scaling to large modern architectures and datasets. Secondly, current explanatory methods focus mostly on isolated aspects of the model behaviour and lack a holistic perspective, i.e., do not enlighten the relations between the data, representation and prediction. It is, for example, not enough to only measurethatspecific knowledge (e.g., a concept) has been learned [13,14], but also necessary to understandhowit is actually used [15,16,17,18] and wherein the training dataset it is coming from [19]. Further, available tools are suited to probe for expected concepts [20], but miss the part of a model that encodes for other unexpected concepts, which may interact with the former in non-trivial ways and thus influence model behaviour. Finally, methods that are applicable for ensuring compliance with legal/real-world requirements are scarce [21,22]. Holistic approaches are needed that quantify which parts of a model align with expectation and which not, thereby revealing spurious and potentially harmful components along with related training data. To address these shortcomings, we proposeSEMANTICLENS, a novel method that embeds hidden knowl- edge encoded by individual components of an AI model into the semantically structured, multimodal space of a foundation model such as CLIP [23]. Our approach not only enables understanding, but also allows measuring how knowledge is used for inference, which internal representations are encoding the knowledge, 2 Mechanistic understanding and validation of large AI models and which training data are relevant. This embedding is realized by two mappings: 1components→concept examples: For each component (e.g., neuron) of modelM, we collect a set of examplesE(e.g., highly activating image patches) representing the concepts encoded by this component. 2concept examples→semantic space: We embed each set of examplesEinto the semantic spaceSof a multimodal foundation modelFsuch as CLIP [23]. As a result, each single component of modelMis represented by a vectorθ∈Sin the semantic space of modelF. In addition, we use Concept Relevance Propagation (CRP) [15] to identify relevant components and circuits for an individual model prediction, forming a third mapping: 3prediction→components: Relevance scoresRquantify the contributions of model components to individ- ual predictionsy=M(x)on data pointsx. By mutually connecting the model representation (M), the relevant (training) data (E), the semantic interpre- tation (F) and the model prediction (y),SEMANTICLENSoffers a holistic approach, which enables one to systematically analyse the internals of AI models and their prediction behaviours in a scalable manner without the need of having a human-in-the-loop [24], as illustrated in Fig. 1. The multimodal foundation modelFserves as a “semantic expert” for the data domain under consideration, effectively representing the modelMas a comparable and searchable vector database, i.e., as a set ofθ(one vectorθfor every neuron), potentially linked to sets ofEandR. This enables new capabilities to answer questions aboutM: Searchefficiently via text or other modalities for encoded knowledge, pinpointing corresponding components and data samples (see Section 4.1.1 and Supplementary Note C). Describeat scale what concepts the model has learned, which are missing, and how it is using its knowledge for inference (see Section 4.1.2 and Supplementary Note D). Comparelearned concepts across models, varying architectures, or training procedures (see Section 4.1.3 and Supplementary Note E). Audit alignmentof the model’s encoded knowledge with expected human-defined concepts (see Sections 4.2 and 4.3 and Supplementary Note F). Evaluate human-interpretabilityof the hidden network components in terms of “clarity”, “polysemanticity” and “redundancy” (see Section 4.4 and Supplementary Note G). More details and specific example questions are summarized in Tab. 1. Ultimately, the transformation of the model into a semantic representation, which not only reveals what and where knowledge is encoded but also connects it to the (training) data and decision-making, enables systematic validation and opens up new possibilities for more robust and trustworthy AI. 2 Related Work SEMANTICLENSis a holistic framework that enables a systematic concept-level understanding of large AI models. Its core elements rely on previous research advances related to concept visualization, labelling, attribution, comparison, discovery, and audits, as well as human-interpretability measures. Feature VisualizationTo describe the role of individual components of a neural network, input images (referred to as “concept examples” in this work) are commonly sought that maximize their activation [25,26, 27,14,16,28]. Concept examples can either begenerated syntheticallyusing gradient-based approaches [29, 26,30,31,32] or diffusion-models [33], or, alternatively,selectedfrom a test dataset by collecting neuron activations during predictions [16,15,28,14]. As synthetic concept examples often result in data samples that are out of the training data distribution, weselectexamples from the original test dataset. Notably, as multiple 3 Mechanistic understanding and validation of large AI models Table 1: Overview of questions which can be answered bySEMANTICLENS. The workflow to answer each question is provided in Suppl. Table H.1. Type Question to the modelMResults search “Has my model learned to encode a specific concept?”via convenient “search-engine”-like text or image descriptions Fig. 2a and Suppl. Figs. C.1 to C.3 “Which components have encoded a concept, how is it used, and which data is related?” Fig. 2d describe “What concepts has my model learned?”in a structured, condensed and understandable manner via textual descriptions Fig. 2b and Suppl. Figs. D.1 to D.5 “What and how are concepts contributing to a decision?”by visualizing concept interactions throughout the model Fig. 2c and Suppl. Fig. D.7 “What do Inotyet understand of my model?”, offering to understand the unexpected concepts and their role for the model and origin in data Fig. 2d and Suppl. Figs. F.4 to F.11 compare “What concepts are shared between two models, and which are unique to each one?”by comparing learned concepts qualitatively and quantita- tively Suppl. Figs. E.1 and E.2 “How do my model’s concepts change when changing the architecture or training?”by comparing and tracking semantics of components Suppl. Figs. E.1 and D.4 audit “Is my model relying on valid information only?”by separating learned concepts intovalid,spuriousorunexpectedknowledge Figs. 3 and 4 and Suppl. Fig. F.1 evaluate “How interpretable is my model?”with easy to compute measuresFig. 5b “How can I improve interpretability of my model?”by evaluating in- terpretability measures when changing model architecture or training procedure Fig. 5c and Suppl. Tables G.1 to G.5 distracting features can be present in test data samples, we further use CRP [15] to crop full samples to more meaningful image patches with less irrelevant features. Concept LabellingVarious methods are invested in labelling the concept of individual neurons, which allows for easier interpretation of concepts and their corresponding examples, as well as quantitative evaluations. One group of approaches is purely based on activation patterns, such as Network Dissection [14] or INVERT [34], which require a large set of data annotations. Notably, CLIP-Dissect [35] circumvents the requirement for costly concept annotation by using a multimodal foundation model to generate soft data labels. Other methods, such as ours, operate on the set of maximally activating images for a neuron, hereby relying on other vision-language models [36, 35, 18, 37, 38]. Concept Importance ScoresTo not only understandwhatconcepts have been learned, but alsohowconcepts are used, importance scores wrt. predicted outputs (or upper-level component activations) can be computed. Here, various traditional feature attributions can be extended to compute importance scores of concepts [39,17]. We adhere to the CRP framework for computing relevance scores of singular components or groups thereof. Concept ComparisonVarious popular approaches exist that measure alignment between representation spaces of neural networks, including Centered Kernel Alignment [40], attention (map) patterns [41,42,43] or “concept embeddings” (i.e., weights for neuron activations to detect specific concepts in the data) as in Net2Vec [44]. The approaches above only provide a single scalar for the overall alignment between two representation spaces. In contrast, other works (including ours) also enable for similarity analysis between single concepts, allowing, e.g., to identify which concepts models share and in which concepts they differ. 4 Mechanistic understanding and validation of large AI models Similarities between single concepts can be based on activation patterns [45,46,47], relevance pattern [15] or concept example embeddings [48] as inSEMANTICLENS. Concept DiscoveryWhereas early works show that neurons often encode for human-understandable con- cepts [14,26], other works argue that linear directions (or subspaces) in latent feature space are more interpretable and disentangled [44,49,50]. In fact, neurons can be redundant and polysemantic (encoding for multiple concepts), which directions might be less prone to [51,52]. Recent research focuses on Sparse Autoencoders (SAEs) [53] or activation factorization [17] to receive more disentangled representations, for which, again, concept examples and concept relevance scores can be computed. Whereas we focus in this work on the neural basis,SEMANTICLENSis thus also applicable to SAEs or factorized activations. Human-Interpretability MeasuresThe work of Network Dissection [14] evaluates interpretability indi- rectly by the degree to which neurons align to a large set of expected concepts. Later works leverage feature spaces of large models, where the concept examples of individual neurons are encoded. Specifically, [54,55] introduce measures related to polysemanticity, [38,56,54] related to clarity (or coherence), and [57] related to redundancy. Recently, measures to capture concept complexity have also been introduced [58]. The semantic embedding of concept examples forms an integral part ofSEMANTICLENSand allows us to provide a set of interpretability measures related to concept clarity, polysemanticity and redundancy in Section 3.5. Concept AuditsEstablished methods for evaluating and auditing latent feature spaces of neural networks are TCAV [20] or linear probes [59]. Both are based on trying to detect a signal (linear direction) in the latent activations that can be associated with a specific user-defined concept of interest. Whereas linear probes only detect that a certain concept is encoded by a model, TCAV also allows to quantify whether it is actually used [60]. However, the part of the model that is not covered by the (set of) expected concept(s) is not studied, which could also incorporate various other spurious concepts.SEMANTICLENSfills this gap and provides quantification of which concepts are valid, spurious, and not yet identified (unexpected). Explanation FrameworksInstead of focusing on individual aspects, explanation frameworks combine multiple interpretability aspects and enable a more holistic understanding of model and data. For example, CRP [15] or CRAFT [16] combine feature visualization and attribution, but do not include labelling. CLIP- Dissect [35] on the other hand, leverages foundation models such as CLIP [23] to label neurons, but does not investigate how concepts are actually used during inference. Based on the semantic embedding of model components,SEMANTICLENSrepresents a more comprehensive and holistic framework compared to previous works that enables to systematically search, label, compare, describe and evaluate the inner mechanics of large AI models. In Supplementary Note A we provide a detailed overview over other frameworks including NetDissect [14], Net2Vec [44], TCAV [20], Summit [61], CLIP-Dissect [35], CRP [15], CRAFT [16], PCX [39], FALCON [38], ConceptEvo [48], SpuFix [62], W [18] and MAIA [63]. 3 Methods SEMANTICLENSembeds each component of a neural network into a semantic space. This embedding is realized in two steps as described in the following subsections. 3.1 Describing the Role of Neurons via Concept Examples To describe the role of a neuron, highly activating data samples are retrieved from the (training) database. Since the concept represented by the neuron can only occur in a small part of a large input sample, we facilitate the CRP framework [15] to identify the relevant part of the input and crop each data sample to exclude input features with less than 1 % of the highest attribution value, as illustrated in Supplementary Fig. H.1a. For Vision Transformers (ViTs) the CRP method is not available yet, therefore we approximate attributions by 5 Mechanistic understanding and validation of large AI models up-sampled spatial maps, as discussed in Supplementary Note H. Thus, concept examples for neuronkin layerℓare retrieved as E k,ℓ = CRP(x) :x∈top m (M ℓ k ,D) ,(1) where the latent activations at layerℓ∈1,...,nwithk ℓ ∈N + neurons are given byM ℓ :X→Z ℓ ∈R k ℓ , CRPdenotes the cropping operation, andtop m selects themmaximally activating samples of datasetD ⊂X. 3.2 Transformation into a Semantic Space In the second step,SEMANTICLENSgenerates a universal semantic representation for each model component based on the concept examples. To this end, we employ a foundation modelFthat serves as a semantic expert of the data domain, operating on the set of concept examplesE k . As illustrated in Fig. 1a for step2, we obtain the semantic representation of thek-th neuron in layerℓas a single vectorθ k in the latent spaceSof foundation modelF(indexℓomitted for the sake of clarity): θ k : =E x∼E k [F(x)]≈ 1 |E k | X x∈E k F(x)∈S ⊆R d .(2) Computing the mean over individual feature vectorsF(x) x∈E k (as also proposed in [48]) is usually more meaningful than using individual vectors (e.g., for labelling as in Supplementary Note D). Averaging embeddings can be viewed as a smoothing operation, where noisy background signals are reduced, resulting in a better representation of the overall semantic meaning. Setting|E k |= 30results in convergedθ k throughout ImageNet experiments, as detailed in Supplementary Note D. From Semantic Space to Model, Predictions and DataThe semantic space representation is inherently connected to the model components, that are, themselves, linked to model predictions and the data, as illustrated in Supplementary Fig. H.1c. We can thus identify all neurons that correspond to a concept (via search as detailed in Section 3.3), filter this selection to the ones relevant for a decision output using CRP (via neuron-specific relevance scoresRper data point, see step3in Fig. 1a), and lastly, identify all data (i.e.,E) which highly activate the corresponding group of model components. 3.3 Concept Search, Labelling and Comparison As semantic embeddingsθare elements in a vector space, we measure similaritysdirectly via cosine similarity, as is also the design choice of CLIP [23]: s cos :R d ×R d →[−1,1],(x,y)7→ ⟨x,y⟩ ∥x∥ 2 ∥y∥ 2 .(3) Search:Given a set of semantic embeddings of model componentsV M =θ 1 ,...,θ k and an additional probing embeddingθ probe representing a sought-after concept, we can nowsearchfor model components encoding the concept via θ ∗ = argmax θ∈V M s(θ probe ,θ)−s(θ <> ,θ) ,(4) where we additionally subtract the similarity to a “null” embeddingθ <> representing background (noise) present in the concept examples if available. For text, e.g., it is common to subtract the embedding of the empty template to remove its influence [18], leading to more faithful labelling in Supplementary Note D.4. Label:In order to label the model representationV M , a set of predefined concepts is embedded, resulting inV probe : =θ probe 1 ,...,θ probe l ⊂R d . Analogously to Eq. (4) each neuron is assigned the most aligned label from the pre-defined set, or none if the similarity falls below a certain threshold. Compare:Two modelsNandMmay be quantitatively compared via the number of neurons that were 6 Mechanistic understanding and validation of large AI models assigned to concept labels as introduced by NetDissect [14], or measuring set similarityS V M →V N based on average maximal pairwise similarity: S V M →V N = 1 |V M | X θ∈V M max θ ′ ∈V N s(θ,θ ′ ),(5) that quantifies the degree to which the knowledge (semantics) encoded in modelMis also encoded in model N. Notably, another means for comparison constitute the interpretability measures detailed in Section 3.5. 3.4 Auditing Concept Alignment As outlined in Section 4.2, it is important to measure how well the used concepts of a model are aligned with expected behaviour. In order to compute concept alignment, we require a set of model embeddingsV M , and a set of expected valid and spurious semantic embeddingsV valid andV spur , respectively. For each model componentk, we then compute the alignment scores a valid k = max θ v ∈V valid s(θ v ,θ k )−s(θ <> ,θ k ) , a spur k = max θ s ∈V spur s(θ s ,θ k )−s(θ <> ,θ k ) . Additionally, it is important to take into accounthowthe components are used. We thus propose to retrieve the relevance of each model component during inference, e.g., the relevance for predictions of a specific class. Optimally, all relevant components are aligned to valid concepts only, i.e.,a valid >0anda spur <0. A high spurious alignment scorea spur >0indicates potential harmful model behaviour. Neurons that aligned to neither should be examined more closely, representing unexpected concepts. 3.5 Human-Interpretability Measures for Concepts We now introduce measures to capture the human-interpretability of concepts. 3.5.1 Concept Clarity Theclaritymeasure aims to represent how easy it is to understand the role of a model component, i.e., how easy it is to grasp the common theme of concept examples. Intuitively, clarity is low, when there is a lot of distracting (background) elements in the concept examples. Further, clarity is low when a concept is very abstract and many, at first glance, unrelated elements are shown throughout examples. To measureclarity, we compute semantic similarities in the set of concept examples, inspired by [54,38,56]. Cosine similarity serves here as a measure of how semantically similar two samples are in the latent space of the used foundation model. For the overallclarityscore of neuronk, we compute the average pair-wise semantic similarity of the individual feature vectorsV k =v k,i i : I clarity (V k ) : = 1 |V k |(|V k |−1) |V k | X i=1 X j̸=i s cos (v k,i ,v k,j )(6) = |V k | |V k |−1 1 |V k | P |V k | i=1 v k,i ∥v k,i ∥ 2 2 2 − 1 |V k | ∈[− 1 |V k |−1 ,1](7) where the last expression is a formulation that is computationally less expensive, and circumvents the need to compute large similarity matrices. 3.5.2 Concept Similarity and Redundancy The semantic representation allows conducting comparisons across arbitrary sets of neurons without being restricted to neurons from identical layers or model architectures. In particular, it allows us to assess the degree ofsimilaritybetween the concepts of two neuronskandj, which we define as 7 Mechanistic understanding and validation of large AI models I sim (θ k ,θ j ) : =s cos (θ k ,θ j ) = ⟨θ k ,θ j ⟩ ∥θ k ∥ 2 ∥θ j ∥ 2 ∈[−1,1](8) via cosine similarity. Based on similarity, we can further assess the degree ofredundancyacross the concepts ofmneurons with theθsetV=θ 1 ,...,θ m which we define as I red (V) : = 1 m m X k=1 max j̸=k I sim (θ k ,θ j )∈[−1,1].(9) Notably,semanticredundancy might not implyfunctionalredundancy. Even though two semantics are similar, they might correspond to different input features. For example, the stripes of a zebra or striped boarfish are semantically similar, but might be functionally very different for a model that discerns both animals. 3.5.3 Concept Polysemanticity A neuron is considered polysemantic if multiple semantic directions exist in the concept example set. Formally, we define a neuron as polysemantic if subsets ofE k can be identified that provide divergingθs. The polysemanticitymeasure is defined as I poly (V (1) k ,...,V (h) k ) : = 1−I clarity n P v∈V (i) k v|i= 1,...,h o ,(10) whereV (i) k ⊆V k fori= 1,...,his a subset of the embedded concept examples, generated by an off-the- shelf clustering method, where we useh= 2throughout experiments. Alternatively, as proposed by [54], polysemanticity can be measured as an increase in the clarity of each set of concept examples, which, however, performs worse in the user study evaluation as detailed in Supplementary Note G.1. 4 Results We begin in Section 4.1 with demonstrating how to understand the internal knowledge of AI models by searching and describing the semantic space. These functionalities provide the basis for effectively auditing alignment of the model’s reasoning wrt. human-expectation in Section 4.2. We demonstrate how to spot flaws in medical models and improve robustness and safety in Section 4.3. Lastly, computable measures for human-interpretability of model components are introduced, enabling to rate and improve interpretability at scale in Section 4.4. The different sets of experiments reported in this paper were conducted on a variety of models, including convolutional neural networks with ResNet and VGG architectures as well as different ViTs. Additionally, we used two large vision datasets, namely ImageNet [64] and ISIC 2019 [65], along with several foundation models, including Mobile-CLIP [66], DINOv2 [67] and WhyLesionCLIP [68]. Further details about the experimental setting can be found in Supplementary Note B. Additional analyses are reported in Supplementary Notes C to G. 4.1 Understanding the Inner Knowledge of AI Models In the following,SEMANTICLENSis used to systematically analyse the knowledge encoded by neural network components of ResNet50v2 [69] trained on the ImageNet classification task [64]. The individual components of the model are embedded as vectorsθinto the multimodal and semantically organized space of the Mobile-CLIP foundation model [66], as illustrated in Fig. 1 and described in Section 3. 8 Mechanistic understanding and validation of large AI models 4.1.1 Search: Finding the Needle in the Haystack The first capability ofSEMANTICLENSthat we demonstrate is itssearchcapability, allowing one to quickly browse through all neurons of the ResNet50v2 model and identify concepts that a user is interested in, such as potential biases (e.g., gender or racial), data artefacts (e.g., watermarks) or specific knowledge. The search is based on (cosine) similarity comparison between a probing vectorθ probe , representing the concept we are looking for (e.g., the conceptperson), and the set of embedded neurons (i.e.,θ’s) of the ResNet model. The shared vision-text embedding space of Mobile-CLIP allows us to query concepts described through images (image of a person) as well as concepts described by text (textual input “person”). More details about the creation of the probing vectors and the retrieval process can be found in Section 3. As illustrated in Fig. 2a, neurons of the ResNet50v2 model can be identified that encode forperson-related concepts. Two embedded neurons, which are most similar to the probing vector represent different, non- obvious and potentially discriminative aspects of a person, such as “hijab” (neuron#1216) and “dark skin” (neuron#1454). It is in principle a valid strategy to represent different object subgroups sharing certain visual features by specialized neurons. However, if these “sensitive attribute”-encoding neurons are used for other purposes, e.g., the “dark skin”-personneuron is used for classification of “steel drum” (see Fig. 3b), then this may hint at potential fairness issues. We also query the model for the conceptwatermark. The retrieved neurons encode watermarks and other text superimposed on an image. Such data artefacts may become part of the model’s prediction strategy, known as shortcut learning [12,70] or Clever Hans phenomenon [15], and massively undermine its trustworthiness (i.e., the model predicts right but for the wrong reason [71]). While previous works have unmasked such watermark-encoding neurons more or less by chance [15,72],SEMANTICLENSallows one to intentionally query the model for the presence of such neurons. In addition to searching for bias- or artefact-related neurons, we can also query the model for specific knowledge, e.g., the conceptbioluminescence. The results show that this concept has been learned by the ResNet50v2 model. Such specific knowledge queries can help ensure that the model has learned all the relevant concepts needed to solve a task, as demonstrated in the ABCDE-rule for melanoma detection in Section 4.2. Notably,SEMANTICLENSnot only allows to query the model for specific concepts, but also to further identify the output classes for which concepts are used and the respective (training) data, as later shown in Fig. 2d. Additional examples, comparisons between models, and details are provided in Supplementary Note C. 4.1.2 Describe: What Knowledge Exists and How Is It Used? Another feature ofSEMANTICLENSis its ability todescribeand systematically analysewhatknowledge the model has learned andhowit is used. Fig. 2b provides an overview of the ResNet50v2 model’s internal knowledge (penultimate layer components) as a UMAP projection of the semantic embeddingsθ. Here, e.g., searching foranimalresults in aligned embeddings on the left (indicated by red colour), whereas transport-related embeddings are located in the centre (blue coloured). Even more insights can be gained when systematically searching and annotating semantic embeddings, as described in the following. Labelling and Categorizing KnowledgeTo structure the learned knowledge systematically, we assign a text-form concept label (from a user-defined set) to a neuron embedding if its alignment exceeds the alignment with a baseline which is an empty text label. The labelled embeddings can then be grouped according to their annotation, e.g., all embeddings matchingdogare grouped together, which reduces complexity, especially if many neurons with similar semantic embeddings exist. In fact, the ResNet results in over a hundred neurons related todog, as illustrated in Fig. 2b, where the overall top-aligned label from the expected set for clusters of semantic embeddingsθare provided. Further details (including labels) and examples are provided in Supplementary Notes D.1 and D.2, respectively. 9 Mechanistic understanding and validation of large AI models Figure 2:SEMANTICLENSallows to systematically understand the internal knowledge and inference of neural networks.a) Via search engine-like queries, one can probe for knowledge referring to, e.g., (racial) biases, data artefacts, or specific knowledge of interest.b) A low-dimensional UMAP projection of the semantic embeddings provides a structured overview of the model’s knowledge, where each point corresponds to the encoded concept of a model component. By searching for human-defined concepts, we can add descriptions to all parts of the semantic space.c) Having grouped the knowledge into concepts, attribution graphs reveal where concepts are encoded in the model and how they are utilized (and interconnected) for inference. For predicting Ox, we learn that ox-cart related background concepts are used. Importantly, we can also identify relevant knowledge that could not be labelled, and should be manually inspected by the user. d) The set of unexpected concepts includesIndian person,palm tree, andwatermarkconcepts, which correlate in the dataset with Ox. We can further find other affected output classes, e.g., “butcher shop”, “scale” and “ricksha” for theIndian personconcept. It is further possible to “dissect” [14] a model’s knowledge at different levels of complexity, ranging from broad categories such as “objects” and “animals” to more fine-grained concepts such as “bicycle” or “elephant”. For instance, in Supplementary Note D, we categorize the model components relevant to the “Ox” class into “breeds” likeWater Buffalo, “work”-related concepts such asploughing, and “physical attributes” such ashorns. Importantly, labelling not only facilitates the assessment of what the model has learned but also identifies gaps in its knowledge, i.e., cases where no neuron aligns with a user-defined concept. In the studied ResNet model, for instance, no neuron encodes the Ox breedsAngusandHereford, indicating areas where 10 Mechanistic understanding and validation of large AI models additional training data could enhance model performance. Notably, faithfulness of labels is important [73], which is evaluated in Supplementary Note D.4. Understanding How Knowledge is UsedUnderstandinghowthe model uses the learned knowledge is as crucial as knowingwhatknowledge exists. For example, whilewheelscan be a valid concept to detect sports cars, it should not be relevant for detecting an Ox, which is, however, measurable for the ResNet. Fig. 2c shows the attribution graph for the class Ox. The graph is constructed from the conditioned relevance scores computed with CRP [15] and reveals associations between neuron groups with the same concept-label. For the class Ox, the attribution graph in Fig. 2c, e.g., reveals next to thewheelconcept another highly relevantlong furconcept encoded by neuron#179inlayer 3, which in turn relies in the next lower-level layer on agrass concept, indicating that neuron#179is encoding long-furred animals on green grass. Attribution graphs thus not only describewhatandhowconcepts are used, but also enhance our understanding of sub-graphs (“circuits”) within the model. A full attribution graph is detailed in Supplementary Note D.5. The Link Between Knowledge, Data and PredictionsNotably, some components did not align with any of the pre-defined text-based concepts, yielding embedding similarities that were equal to or lower than those obtained using an empty text prompt as a baseline. As shown in Fig. 2d, manual inspection of these unexpected concepts reveals associations toIndian person,palm treeandwatermark, traced to neurons #179,#1569and#800inlayer 3, respectively. All three concepts correspond to spurious correlations in the dataset, e.g., farmers using Ox to plough a field, palm trees in the background or a watermark overlaid over images, where the responsible training data can be generally identified by retrieving highly activating samplesE. The plot further shows other ImageNet classes for which the neurons are highly relevant. Affected classes include “butcher shop”, “scale”, and “rickshaw” forIndian person; “thatch”, “bell cote”, and “swim trunk” forpalm tree; and “Lakeland Terrier”, “bulletproof vest”, and “safe” forwatermark. By inherently connecting data, model components, and predictions,SEMANTICLENSconstitutes an effective and actionable tool for model debugging, further described in Section 4.3. 4.1.3 Compare: Identify Common and Unique Knowledge So far, we have investigated asinglemodel in semantic space. However, the semantic space serves as a unified space, wheremultiplemodels of different architectures, different layers or model parts can be embedded and compared. As such, the influence on learned concepts when changing the network architecture or training hyperparameters, such as the training duration, can be studied. In Supplementary Note E two ResNet50 models trained on ImageNet, where one (ResNet50v2) is trained more extensively and results in higher test accuracy, are compared usingSEMANTICLENS. As illustrated in Supplementary Fig. E.1, both models share common knowledge, e.g., bird-related concepts. However, whereas the better trained ResNet50v2 has learned more specific concepts, e.g., specific fur textures of dogs, the other has learned more abstract concepts that are shared throughout classes. For the dog breed “Komondor” which has a white mop-like coat, for example, the ResNet50 has learned a mop-like concept that is used to detect “Komondor” as well as “mop”, whereas the ResNet50v2 learned a class-specific concept. This is in line with works that study generalization of neural networks for long training regimes, observing that latent model components become more structured and class-specific [74]. We further provide quantitative comparisons via network dissection in Supplementary Note D.3. Alternatively,SEMANTICLENSallows to compare models also quantitativelywithoutaccess to concept-labels by evaluating the similarity between the models’ knowledge. In Supplementary Note E, we discuss the alignment of various pre-trained neural networks across layers and architectures. 11 Mechanistic understanding and validation of large AI models classes that rely on unexpected concepts ImageNet class classes that rely on expected concepts low high a Audit concept alignment b Problematic concepts in almost every case define concepts for detecting Ox evaluate alignment of neurons test model behavior valid concepts:spurious concepts: used for shovel sandal large muscular body curved horns hooves thick neck short, rough fur soft fur long fur brown coat black coat white coat strong legs long tail wide muzzle grassland sky tree water grain, straw cart wheel Indian person watermark mud, dirt person wooden palm tree steel drumscrewdriver sports car brown bear harp bee #1500 #912 #1454 #3#1330 #1991#1935 #315 model vs vs collect and compare output logits How different are the model outputs ? safe 0.5 unsafe AUC( ) 1 2 3 each circle corresponds to a neuron the larger, the more relevant cart Indian person Indian person pulling a cart under palm trees palm tree Ox spurious alignment to valid concepts alignment to spurious concepts not understood safe problematic cart (42%) curved horns (48%) short, rough fur (61%) palm tree (4%) Indian person (3%) relevance score of neuron manual inspection Figure 3: UsingSEMANTICLENSto audit models and check if their reasoning aligns with human expectation. a)1In a first step, a set of valid and spurious concepts is defined via text descriptions, e.g.,curved horns orpalm treefor “Ox” detection, respectively.2Afterwards, we check which model components encode for either spurious or valid concepts, both or neither. The size of each dot in the chart represents the importance of a component for “Ox” detections. We learn, that the ResNet50v2 relies onIndian person,palm tree andcartconcepts.3Lastly, we can test our model, and try to distinguish the “Ox” output logits on “Ox” images (from the test dataset) and diffusion-based images with spurious features only. When multiple spurious features are present, as forIndian person pulling a cart under palm trees, model outputs become more difficult to separate, indicated by a lower AUC score.b) When auditing the ResNet’s alignment to valid concepts for 26 ImageNet classes, we find that inallcases, spurious or background concepts are used. 4.2 Audit Alignment: Do Models Reason as Expected? The analyses introduced in Section 4.1 enable the quantification of a model’s alignment with human expecta- tions. Specifically, they allow assessment of a model’s reliance on valid, spurious, or unexpected concepts. The steps of an alignment audit, outlined in Fig. 3a, include 1 defining concepts, 2 evaluating concept alignment, and3testing model behaviour. 1 Defining a set of expected concepts:First, a set of valid and spurious concepts is defined, utilized to compare against the concepts actually employed by the model. For illustration, we revisit the Ox example where valid concepts includecurved horns,wide muzzleandlarge muscular body, as shown in Fig. 3a (left). On the other hand, we are also aware of spurious correlations, such aspalm tree,Indian personand watermark. Notably, all of these concepts can be defined within the modality of the model’s data domain (i.e., via example images), or, as demonstrated here, simply via text-prompts when utilizing a multimodal foundation model such as CLIP for concept encoding. 2Evaluating alignment to valid and spurious concepts:The alignment of the model’s knowledge with user-defined spurious or valid concepts is visualized in the scatter plot in Fig. 3a (middle) for “Ox” detection. 12 Mechanistic understanding and validation of large AI models Concretely, we calculate the maximum alignment between an embeddingθand all probing embeddings θ probe within a set (valid or spurious), with mathematical formulations detailed in Section 3.4. Each dot in the plot represents a neuron of the penultimate layer, with its size indicating its highest importance (shown in parentheses) during inference on the test set. Several spurious concepts such aspalm tree,Indian personorcartare identified besides valid concepts such asshort, rough furorcurved horns. Notably, neurons that do not align to any user-defined concept can be manually inspected as done in Fig. 2d, and incorporated into the set of spurious or valid concepts. As discussed for a VGG [75] model in Supplementary Note F, lower overall alignment scores can also result for neurons that encode for highly abstract concepts, or that exhibit “polysemantic” behaviour, encoding for multiple concepts simultaneously. 3Testing models for spurious behaviour:WhileSEMANTICLENSenables quantification of a model’s reliance on valid or spurious features (e.g., via the share of spuriously aligned components), it is equally important to assess the actual impact of identified spurious features on inference. Here we use a model test [62] evaluating the separability of two sets of outputs: one generated from images containing valid features (associated with the “Ox” class) and the other from images with spurious features, as illustrated in Fig. 3a (right). When testing the model on images (generated with Stable Diffusion) for a single concept (Indian person,palm treeorcart), the model output logits for “Ox” are clearly distinguishable from those attained from “Ox” images, achieving AUC scores above0.98. However, when multiple spurious features are presented simultaneously, and we test the model on images combining all three concepts, the “Ox” output logits are further amplified. Specifically, the “Ox” class ranks among the top-5 predictions in over half of the spurious samples, resulting in an AUC of 0.91, as further detailed in Supplementary Note F. Problematic concept reliance everywhereThe previous example highlights the presence of unexpected spurious correlations, such as the association of palm trees with “Ox”. Expanding on this, we evaluate the alignment of model components with valid concepts across 26 additional ImageNet classes, including “shovel”, “steel drum” and “screwdriver”. Fig. 3b presents the resulting highest alignment scores with a valid concept for each model component, where size again indicates relevance for “Ox”. Remarkably, no class shows complete alignment of all relevant model components with valid concepts. In every case, spurious or background features are relevant, includingsnowfor “shovel”,Afro-American personfor “steel drum”, andchildfor “screwdriver”. A comprehensive overview over the utilized concepts by the model is provided in Supplementary Note F. Unaligned models are often challenging to interpretWhen analysing popular pre-trained models on the ImageNet dataset, we observe strong variations wrt. their alignment to valid concepts. The reason often lies in the share of knowledge that is neither aligned to valid or spurious concepts, as demonstrated for the VGG-16 in Supplementary Note F. For instance, the VGG-16 contains several polysemantic components that perform multiple roles in decision-making, which generally reduces alignment. On the other hand, more performant and wider models tend to have more specialized (e.g., class-specific) and monosemantic model components, later quantified in Section 4.4. Overall, higher-performing models with larger feature spaces (more neurons per layer) show thus greater alignment scores throughout experiments detailed in Supplementary Note F. Interpretability and trustworthiness are closely tied, underscoring the importance of optimizing models for interpretability. 4.3 Towards Robust and Safe Medical Models One of the most popular medical use cases for AI is melanoma detection in dermoscopic images, as shown in Fig. 4a. In the following, we demonstrate how to debug a VGG-16 model withSEMANTICLENSthat is trained to discern melanoma from other irregular or benign (referred to as “other”) samples in a public benchmark dataset [65, 76, 77]. 13 Mechanistic understanding and validation of large AI models Figure 4:UsingSEMANTICLENSto find and correct bugs in medical models that detect melanoma skin cancer.a) The ABCDE-rule is a popular guide for visual melanoma clues. We expect models to learn several concepts corresponding to the ABCDE-rule, as well as other melanoma-unrelated indications (such as regular border) or spurious concepts, includinghairsorband aid.b) In semantic space visualized via a UMAP projection, we can identify valid concepts, such asblue white veilfor “melanoma”, but also spurious ones such asred skinorruler.c) When investigating the importance of concepts, we find thatred skinorband-aidconcepts are strongly used for the “other” (non-melanoma) class. Alsorulerconcepts are used with slightly higher relevance for “melanoma”.d) We can improve safety and robustness of our model by either changing the model and remove spurious components, or retrain the model on augmented data. Whereas both approaches lead to improved clean performance, the influence of artefacts is only significantly reduced via re-training. 4.3.1 ABCDE-Rule for Melanoma Detection Dermatologists have created guidelines for visual melanoma detection, such as the ABCDE-rule, short for Asymmetry, Border, Colour, Diameter and Evolving [78]. We will useSEMANTICLENSto check whether the model has learned concepts regarding the ABCDE-rule, such asasymmetric lesion(A),ragged border (B),blue-white veil(C),large lesion(D), andcrusty surface(E). In addition, we also define concepts for benign and other skin diseases as well as several spurious concepts that have been reported 14 Mechanistic understanding and validation of large AI models in previous works [79,80], corresponding to hairs, band-aids, red-hued skin, rulers, vignetting and skin markings. Please refer to Supplementary Note F.2.1 for a full list of concepts. 4.3.2 Finding Bugs in Medical Models To embed the VGG’s components into a semantic space, we leverage a recently introduced CLIP model trained on skin lesion data [68]. As shown in Fig. 4b, the semantic embeddings are structured, with concepts aligning toirregularin the top (red colour),melanomain the bottom left (blue colour), andregularin the bottom right (green colour). Here, we can identify several valid concepts such asblue-white veilandirregular streaksfor detecting melanoma, andregular borderfor benign samples. On the other hand, spurious model components are also revealed, such as neuron#403encoding formeasurement scale bar,#508for blue coloured band-aid, and#272forred skin(visually red-coloured skin). To quantifyhowconcepts are used by the model, we compute their highest importance for predicting the “melanoma” or “other” class using CRP on the test set, as shown in Fig. 4c. Alarmingly, we find the previously found spurious concepts to be highly relevant:red skinandblue-coloured band-aidare strongly used for “other”, whereasmeasurement scale baris slightly stronger used for “melanoma”. 4.3.3 Model Correction and Evaluation In application, the background features of red-coloured skin, plasters and rulers should not influence a detection. SEMANTICLENShelps identifying model components and data associated with spurious concepts. To debug the model [81], we apply two approaches, namely pruning without retraining and retraining on augmented data. For pruning, we label corresponding neurons, resulting in overall 40 out of 512 neurons in the penultimate layer that are pruned. On the other hand, we remove data samples that incorporate the artefacts, identified through studying the highly activating samples of our labelled components. In order to become insensitive towards the artefacts, we randomly augment data samples during training by overlaying hand-crafted artefacts, as illustrated in Fig. 4d (left). The results in Fig. 4d (right) show that both strategies, pruning and retraining, lead to increased accuracy on a clean test set (without artefact samples), especially for melanoma (from 71.4 % to 72.8 %). We further “poison” data with artificially inserted artefacts by cropping out ruler and plasters from real test samples and inserting them as an overlay into clean test samples as done in [72], or, forred skin, add a reddish hue, as detailed in Supplementary Note F.2.3. Interestingly, the pruned model decreases artefact sensitivity only slightly, still remaining highly sensitive. When adding red colour, for example, test accuracy still drops by over 20 % for non-melanoma samples for the pruned model. Although computationally more expensive, only retraining leads to a strong reduction in artefact sensitivity. Further details and discussions are provided in Supplementary Note F.2.3. 4.4 Evaluating Human-Interpretability of Model Components Deciphering the meaning of concept examplesEcan be particularly challenging, especially when neurons are polysemantic and encode for multiple concepts, as observed in Section 4.2. We introduce a set of easily computable measures that assess how “clear”, “similar” and “polysemantic” concepts are perceived by humans, as inferred from their concept examplesE. Additionally, we introduce a measure to quantify the “redundancies” present within a set of concepts. All measures are based on evaluating similarities of concept examplesEin semantic spaceS, with mathematical definitions given in Section 3.5. 4.4.1 Alignment of Interpretability Measures with Human Perception Aiming to assesshuman-interpretability, we first evaluate the alignment between human judgments and our proposed measures (similarity, clarity and polysemanticity) through user studies. Specifically, we recruited 15 Mechanistic understanding and validation of large AI models Figure 5: We introduce computable human-interpretability measures that are useful to rate and improve model interpretability: “clarity” for how clear and easy it is to understand the common theme of concept examples, “polysemanticity” describes if multiple distinct semantics are present in the concept examples, “similarity” for the similarity of concepts, and “redundancy” which describes the degree of redundancies in a set of concepts.a) Our computable measures align with human perception in user studies, resulting in correlation scores above 0.74. Generally, more recent and performant foundation models lead to higher correlation scores.b) Interpretability differs strongly for common pre-trained models. Usually, ViTs or smaller and less performant convolutional models show lower interpretability.c) We can optimize model interpretability wrt. hyperparameter choices, such as drop-out or activation sparsity regularization during training. Whereas drop-out leads to more redundancies besides improved clarity of concepts, applying a sparsity loss improves interpretability overall. over 218 participants via Amazon Mechanical Turk to engage in 15-minute tasks. In these studies, participants were presented with concept examples drawn from the ImageNet object detection task, similar to those shown in Sections 4.1 and 4.2. For each interpretability measure, we designed an independent study consisting of both qualitative and quantitative experiments. Further details regarding the study design, the models used, and the data filtering procedures can be found in Supplementary Note G.1. All in all, we obtain a high alignment between computed measures and human perception, indicated by high correlation scores above 0.74, as shown in Fig. 5a, which recent works using textual concept examples also reflect [56]. Regarding concept similarity, human-alignment varies across foundation models, namely the DINOv2 [67] (uni-modal), CLIP-OpenAI [23], CLIP-LAION [82], and the most recent CLIP-Mobile [66] (specific variants reported in Supplementary Note G.1). Our results indicate that more recent and more performant CLIP models are also more aligned with human perception. Other hyperparameter choices such as the used similarity measure are compared in Section 3. We further performed an odd-one-out task, where participants are asked to detect the outlier concept (of three concepts). Interestingly, our measures often outperform the human participants, indicating that computational measures can even be more reliable than humans. Participants from Amazon Mechanical Turk are, however, often motivated to complete studies quickly to maximize their pay rate, which may not result in optimal performance. 16 Mechanistic understanding and validation of large AI models 4.4.2 Rating and Improving Interpretability The difficulty of understanding the role of components in standard pre-trained models can vary strongly, as, e.g., previously observed in Sections 4.1 and 4.2. This is further confirmed by evaluating various popular neural networks trained on ImageNet using our newly introduced measures for penultimate layer neurons, detailed in Fig. 5b. Larger and broader models, such as ResNet101, show higher degrees of redundancy, which can be expected as more neurons per layer allow more redundancies to form, e.g., in order to increase robustness. For narrow models such as the ResNet18, on the other hand, the effective neural basis might be too small, leading to superimposed signals and a higher polysemanticity (neurons are more likely to fulfil multiple tasks) [83]. The convolution-based ResNet architecture shows higher concept clarity compared to the more recent transformer-based ViT. Whereas the ResNet consists of ReLU non-linearities that allow to associate a high neuronal activation with a specific active input pattern, ViTs often refrain from ReLUs, which enables to superimpose signals (concepts) throughout model components, ultimately leading to high polysemanticity and low interpretability [84]. Interestingly, recent efforts are being made in Large Language Model (LLM) interpretability to extend the transformer architecture post-hoc with SAEs based on ReLUs to again receive a more interpretable neuronal basis [53]. Moreover, our analysis shows that more extensively trained models have clearer and overall more interpretable components, as is the case for the ResNet50v2 compared to the ResNet50. This observation raises the question on how we can influence training parameters to gain higher latent interpretability, which we inspect in the following: Drop-out:Drop-out regularization is effective for reducing overfitting, preventing high reliance on few features by randomly setting a fraction of component activations to zero during training. Our results shown in Fig. 5c indicate that VGG-13 model components become more redundant, but also clearer when drop-out is applied during training on a subset of ImageNet (standard error given by gray error bars for eight runs each). It can be expected that more redundancies form, as redundancies make predictions more robust when components are randomly pruned. On the other hand, neurons are measured to become more class-specific and thus clearer. Notably, architectures might react differently in terms of interpretability, as indicated by the ResNet-34 and ResNet-50 which are not strongly affected by drop-out. Qualitative examples of concepts, detailed training procedures and results are provided in Supplementary Note G.2. Sparsity regularization:Secondly, we apply L1 sparsity regularization during training on the neuron activations, as is, e.g., common for SAEs. Our experiments indicate that sparsity regularization improves interpretability in all measured aspects, resulting in more specific, less polysemantic and semantically redundant neurons. We further investigate the effect of task complexity, number of training epochs and data augmentation on latent interpretability in Supplementary Note G.2. 5 Discussion WithSEMANTICLENS, we propose to transfer components of large machine learning models into an understandable semantic representation that allows one to understand and evaluate their inner workings in a holistic manner. This transfer is made possible through recent foundation models that serve as domain experts, taking the human out of interpretation loops, that otherwise would be cognitively infeasible to process due to the sheer amount of components of modern deep neural networks. Especially useful aremultimodalfoundation models that allow to search, annotate and label network components via textual descriptions. Foundation models improve constantly, becoming more efficient and applicable in scarcer data domains such as medical data, or other data modalities including audio and video [85, 86]. These new capabilities offered bySEMANTICLENSallow to comprehensively audit the internal components of AI models. A multitude of spurious behaviours of popular pre-trained models are hereby revealed, stressing the need to understand every part of a model in order to ensure fairness, safety and robustness in application. 17 Mechanistic understanding and validation of large AI models To understand and audit models, we are dependent on the interpretability of the model components themselves. While some models demonstrate higher interpretability, progress is still needed to develop truly interpretable models, especially regarding recent transformer architectures. However, post-hoc architecture modifications or training regularizations are promising ongoing endeavours to achieve also high interpretability in modern architectures. Our newly introduced human-interpretability measures are an effective tool for optimizing and understanding model architecture choices without relying on expensive user studies for evaluation. There are still many other hyperparameters that we leave for future work, including training with pretrained models, adversarial training, weight decay regularization, and SAEs. Trust and safety go hand in hand with a verification of the internal components, as is the case with traditional engineered systems such as aeroplanes. In order to close this “trust gap”, holistic approaches such as SEMANTICLENSare needed, that allow to understand and quantify the validity of latent components, as well as offer ways to increase their interpretability and reduce potential spurious behaviours. However, various future work remains with post-hoc component-level XAI approaches such asSEMANTICLENS, including the need for further, meaningful evaluation metrics [87], application to generative models [88], and potential limitations regarding “post-hoc” vs. “ante-hoc” interpretability [89], leaving enough room for innovation by the next generation of XAI researchers [90]. Code Availability We provide an open-source toolbox for the scientific community written in Python and based on Py- Torch [91], Zennit-CRP [92] and Zennit [93]. The GitHub repository containing our implementations ofSEMANTICLENSis publicly available onhttps://github.com/jim-berend/semanticlens. All experiments were conducted with Python 3.10.12, zennit-crp 0.6, Zennit 0.4.6 and PyTorch 2.2.2. Acknowledgements We would like to express our gratitude to Oleg Hein for his work and fruitful discussions for developing a public demo ofSEMANTICLENSonhttps://semanticlens.hhi-research-insights.eu. 18 Mechanistic understanding and validation of large AI models References [1]Sebastian Lapuschkin, Stephan W ̈ aldchen, Alexander Binder, Gr ́ egoire Montavon, Wojciech Samek, and Klaus-Robert M ̈ uller. Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 10:1096, 2019. [2] Jacob Kauffmann, Jonas Dippel, Lukas Ruff, Wojciech Samek, Klaus-Robert M ̈ uller, and Gr ́ egoire Montavon. The clever hans effect in unsupervised learning.Nature Machine Intelligence, 2025. [3]Katarzyna Borys, Yasmin Alyssa Schmitt, Meike Nauta, Christin Seifert, Nicole Kr ̈ amer, Christoph M Friedrich, and Felix Nensa. Explainable ai in medical imaging: An overview for clinical practitioners– beyond saliency-based xai approaches.European Journal of Radiology, 162:110786, 2023. [4] Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi.arXiv preprint arXiv:2309.01933, 2023. [5]Jos ́ e Hern ́ andez-Orallo.The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge University Press, Cambridge, UK, 2017. ISBN 9781316594179. [6] Wojciech Samek, Gr ́ egoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert M ̈ uller, editors.Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of LNCS. Springer, Cham, Switzerland, 2019. [7] David Gunning, Mark Stefik, Jaesik Choi, Timothy Miller, Simone Stumpf, and Guang-Zhong Yang. Xai—explainable artificial intelligence.Science Robotics, 4(37):eaay7120, 2019. [8] Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023. [9]Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38, 2019. [10]Leonard Bereska and Stratis Gavves. Mechanistic interpretability for AI safety - a review.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. [11]Vikram V Ramaswamy, Sunnie SY Kim, Ruth Fong, and Olga Russakovsky. Overlooked factors in concept-based explanations: Dataset choice, concept learnability, and human capability. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10932–10941, 2023. [12]Felix Friedrich, Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting. A typology for exploring the mitigation of shortcut behaviour.Nature Machine Intelligence, 5(3):319–330, 2023. [13]Anh Nguyen, Jason Yosinski, and Jeff Clune. Understanding neural networks via feature visualization: A survey. InExplainable AI: interpreting, explaining and visualizing deep learning, volume 11700 of LNCS, pages 55–76. Springer, Cham, Switzerland, 2019. [14]David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6541–6549, 2017. [15] Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 5(9):1006–1019, 2023. [16]Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Remi Cadenc, and Thomas Serre. Craft: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2711–2721, 2023. [17]Thomas Fel, Victor Boutin, Louis B ́ ethune, R ́ emi Cad ` ene, Mazda Moayeri, L ́ eo And ́ eol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation. InAdvances in Neural Information Processing Systems, volume 37, 2024. [18]Yong Hyun Ahn, Hyeon Bae Kim, and Seong Tae Kim. Www: A unified framework for explaining what where and why of neural networks by interpretation of neuron concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10968–10977, 2024. [19]Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017. 19 Mechanistic understanding and validation of large AI models [20]Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018. [21] Yueqi Li and Sanjay Goel. Making it possible for the auditing of ai: A systematic review of ai audits and ai auditability.Information Systems Frontiers, pages 1–31, 2024. [22] Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932, 2024. [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. [24] Francesco Bombassei De Bona, Gabriele Dominici, Tim Miller, Marc Langheinrich, and Martin Gjoreski. Evaluating explanations through llms: Beyond traditional user studies.arXiv preprint arXiv:2410.17781, 2024. [25] Bolei Zhou, Aditya Khosla, ` Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene CNNs. In3rd International Conference on Learning Representations (ICLR), 2015. [26] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11):e7, 2017. [27]Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment.arXiv preprint arXiv:1704.01444, 2017. [28] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. On completeness-aware concept-based explanations in deep neural networks. InAdvances in neural information processing systems, volume 33, pages 20554–20565, 2020. [29] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5188–5196, 2015. [30]Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 29, pages 3387–3395, 2016. [31]Hongxu Yin, Pavlo Molchanov, Jose M Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via deepinversion. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8715–8724, 2020. [32]Thomas FEL, Thibaut Boissin, Victor Boutin, Agustin PICARD, Paul Novello, Julien Colin, Drew Linsley, Tom ROUSSEAU, Remi Cadene, Lore Goetschalckx, et al. Unlocking feature visualization for deep network with magnitude constrained optimization. InAdvances in Neural Information Processing Systems, volume 37, 2024. [33]Maximilian Augustin, Yannic Neuhaus, and Matthias Hein. Dig-in: Diffusion guidance for investigating networks-uncovering classifier differences neuron visualisations and visual counterfactual explanations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11093–11103, 2024. [34]Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, and Marina H ̈ ohne. Labeling neural representations with inverse recognition. InAdvances in Neural Information Processing Systems, volume 37, 2024. [35]Tuomas Oikarinen and Tsui-Wei Weng. Clip-dissect: Automatic description of neuron representations in deep vision networks. InInternational Conference on Learning Representations, 2022. [36]Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2021. [37]Nicholas Bai, Rahul A Iyer, Tuomas Oikarinen, and Tsui-Wei Weng. Describe-and-dissect: Interpreting neurons in vision networks with language models.arXiv preprint arXiv:2403.13771, 2024. 20 Mechanistic understanding and validation of large AI models [38]Neha Kalibhat, Shweta Bhardwaj, Bayan Bruss, Hamed Firooz, Maziar Sanjabi, and Soheil Feizi. Identifying interpretable subspaces in image representations. InInternational Conference on Machine Learning, volume 202, pages 15623–15638, 2023. [39] Maximilian Dreyer, Reduan Achtibat, Wojciech Samek, and Sebastian Lapuschkin. Understanding the (extra-)ordinary: Validating deep model decisions with prototypical concept-based explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3491–3501, June 2024. [40]Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019. [41]Matthew Walmer, Saksham Suri, Kamal Gupta, and Abhinav Shrivastava. Teaching matters: In- vestigating the role of supervision in vision transformers.Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7486–7496, 2022.URLhttps: //api.semanticscholar.org/CorpusID:254366577. [42]Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks?InAdvances in Neural Information Processing Systems, volume 34, pages 12116–12128, 2021. [43] Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, and Sangdoo Yun. What do self-supervised vision transformers learn? InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=azCKuYyS74. [44]Ruth Fong and Andrea Vedaldi. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8730–8738, 2018. [45]Johanna Vielhaben, Dilyara Bareeva, Jim Berend, Wojciech Samek, and Nils Strodthoff. Beyond scalars: Concept-based alignment analysis in vision transformers.arXiv preprint arXiv:2412.06639, 2024. [46]Kirill Bykov, Mayukh Deb, Dennis Grinwald, Klaus Robert Muller, and Marina MC H ̈ ohne. DORA: Exploring outlier representations in deep neural networks. InICLR 2023 Workshop on Pitfalls of Limited Data and Computation for Trustworthy ML, 2023. URLhttps://openreview.net/forum? id=7rI75xfidk. [47] M Li, S Jeong, S Liu, and M Berger. Can: Concept-aligned neurons for visual comparison of deep neural network models. InComputer Graphics Forum, page e15085. Wiley Online Library, 2024. [48]Haekyu Park, Seongmin Lee, Benjamin Hoover, Austin P Wright, Omar Shaikh, Rahul Duggal, Nilaksh Das, Kevin Li, Judy Hoffman, and Duen Horng Chau. Concept evolution in deep learning training: A unified interpretation framework and discoveries. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 2044–2054, 2023. [49] Mara Graziani, Laura O’Mahony, An phi Nguyen, Henning M ̈ uller, and Vincent Andrearczyk. Uncov- ering unique concept vectors through latent space decomposition.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URLhttps://openreview.net/forum?id=LT4DXqUJTD. [50]Johanna Vielhaben, Stefan Bluecher, and Nils Strodthoff. Multi-dimensional concept discovery (mcd): A unifying framework with completeness guarantees.Transactions on Machine Learning Research, 2023. [51]Laura O’Mahony, Vincent Andrearczyk, Henning M ̈ uller, and Mara Graziani. Disentangling neuron representations with concept vectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3769–3774, 2023. [52]Julien Colin, Lore Goetschalckx, Thomas Fel, Victor Boutin, Jay Gopal, Thomas Serre, and Nuria Oliver. Local vs distributed representations: What is the right basis for interpretability?arXiv preprint arXiv:2411.03993, 2024. [53]Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations, 2023. [54] Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, and Sebastian Lapuschkin. Pure: Turning polysemantic neurons into pure features by identifying relevant circuits. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 8212–8217, June 2024. 21 Mechanistic understanding and validation of large AI models [55]Alex Foote. Tackling polysemanticity with neuron embeddings. InICML 2024 Workshop on Mechanistic Interpretability, 2024. [56] Meng Li, Haoran Jin, Ruixuan Huang, Zhihao Xu, Defu Lian, Zijia Lin, Di Zhang, and Xiting Wang. Evaluating readability and faithfulness of concept-based explanations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 607–625, 2024. [57] Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, and Matthieu Cord. A concept- based explainability framework for large multimodal models. InAdvances in Neural Information Processing Systems, volume 37, 2024. [58] Thomas Fel, Louis B ́ ethune, Andrew Kyle Lampinen, Thomas Serre, and Katherine Hermann. Under- standing visual feature reliance through the lens of complexity. InAdvances in Neural Information Processing Systems, volume 37, 2024. [59]Alexis Conneau, German Kruszewski, Guillaume Lample, Lo ̈ ıc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018. [60] Jessica Schrouff, Sebastien Baur, Shaobo Hou, Diana Mincu, Eric Loreaux, Ralph Blanes, James Wexler, Alan Karthikesalingam, and Been Kim. Best of both worlds: local and global explanations with human-understandable concepts.arXiv preprint arXiv:2106.08641, 2021. [61] Fred Hohman, Haekyu Park, Caleb Robinson, and Duen Horng Polo Chau. S ummit: Scaling deep learning interpretability by visualizing activation and attribution summarizations.IEEE Transactions on Visualization and Computer Graphics, 26(1):1096–1106, 2019. [62]Yannic Neuhaus, Maximilian Augustin, Valentyn Boreiko, and Matthias Hein. Spurious features everywhere-large-scale detection of harmful spurious features in imagenet. InProceedings of the IEEE International Conference on Computer Vision, pages 20235–20246, 2023. [63]Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. A multimodal automated interpretability agent. InInternational Conference on Machine Learning, 2024. [64]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. [65]Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5(1):1–9, 2018. [66]Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, and Oncel Tuzel. Mobileclip: Fast image-text models through multi-modal reinforced training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15963–15974, 2024. [67]Maxime Oquab, Timoth ́ e Darcet, Th ́ eo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024. URLhttps://openreview.net/forum?id=a68SUt6zFt. [68]Yue Yang, Mona Gandhi, Yufei Wang, Yifan Wu, Michael S Yao, Chris Callison-Burch, James Gee, and Mark Yatskar. A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond, 2024. [69]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770– 778, 2016. [70]Lukas Kuhn, Sari Sadiya, Jorg Schlotterer, Christin Seifert, and Gemma Roig. Efficient unsupervised shortcut learning detection and mitigation in transformers.arXiv preprint arXiv:2501.00942, 2025. [71]Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting. Right for the right concept: Revising neuro-symbolic concepts by interacting with their explanations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3619–3629, 2021. [72]Frederik Pahde, Maximilian Dreyer, Wojciech Samek, and Sebastian Lapuschkin. Reveal to revise: An explainable ai life cycle for iterative bias correction of deep models. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 596–606. Springer, 2023. 22 Mechanistic understanding and validation of large AI models [73]Laura Kopf, Philine Lou Bommer, Anna Hedstr ̈ om, Sebastian Lapuschkin, Marina M-C H ̈ ohne, and Kirill Bykov. Cosy: Evaluating textual explanations of neurons. InAdvances in Neural Information Processing Systems, volume 37, 2024. [74]Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning. InAdvances in Neural Information Processing Systems, volume 35, pages 34651–34663, 2022. [75] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InInternational Conference on Learning Representations, 2015. [76]Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis toward melanoma detection. InIEEE 15th International Symposium on Biomedical Imaging, pages 168–172, 2018. [77]Carlos Hern ́ andez-P ́ erez, Marc Combalia, Sebastian Podlipnik, Noel CF Codella, Veronica Rotem- berg, Allan C Halpern, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Brian Helba, et al. Bcn20000: Dermoscopic lesions in the wild.Scientific Data, 11(1):641, 2024. [78]Ana F Duarte, Bernardo Sousa-Pinto, Lu ́ ıs F Azevedo, Ana M Barros, Susana Puig, Josep Malvehy, Eckart Haneke, and Osvaldo Correia. Clinical ABCDE rule for early melanoma detection.European Journal of Dermatology, 31(6):771–778, December 2021. [79]Bill Cassidy, Connah Kendrick, Andrzej Brodzicki, Joanna Jaworek-Korjakowska, and Moi Hoon Yap. Analysis of the isic image datasets: Usage, benchmarks and recommendations.Medical Image Analysis, 75:102305, 2022. [80]Chanwoo Kim, Soham U. Gadgil, Alex J. DeGrave, Jesutofunmi A. Omiye, Zhuo Ran Cai, Roxana Daneshjou, and Su-In Lee. Transparent medical image ai via an image–text foundation model grounded in medical literature.Nature Medicine, 2024. [81]Thilo Spinner, Daniel F ̈ urst, and Mennatallah El-Assady. innspector: Visual, interactive deep model debugging.arXiv preprint arXiv:2407.17998, 2024. [82] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InAdvances in Neural Information Processing Systems, volume 35, pages 25278–25294, 2022. [83] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toymodel/index.html. [84]Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892, 2022. [85]Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large- scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2023. [86] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, 2021. [87]Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, J ̈ org Schl ̈ otterer, Maurice Van Keulen, and Christin Seifert. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai.ACM Computing Surveys, 55 (13s):1–42, 2023. [88]Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Challenges and opportunities in text generation explainability. InWorld Conference on Explainable Artificial Intelligence, pages 244–264. Springer, 2024. [89]Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019. 23 Mechanistic understanding and validation of large AI models [90]Luca Longo, Mario Brcic, Federico Cabitza, Jaesik Choi, Roberto Confalonieri, Javier Del Ser, Riccardo Guidotti, Yoichi Hayashi, Francisco Herrera, Andreas Holzinger, et al. Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions.Information Fusion, 106:102301, 2024. [91]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library. InAdvances in Neural Information Processing Systems, volume 32, 2019. [92]Reduan Achtibat, Maximilian Dreyer, and Sebastian Lapuschkin. rachtibat/zennit-crp: v0.6.0.Zenodo, 2023. URLhttps://doi.org/10.5281/zenodo.7962574. [93]Christopher J Anders, David Neumann, Wojciech Samek, Klaus-Robert M ̈ uller, and Sebastian La- puschkin. Software for dataset-wide xai: from local explanations to global insights with zennit, corelay, and virelay.arXiv preprint arXiv:2106.13200, 2021. [94] Sebastian Bach, Alexander Binder, Gr ́ egoire Montavon, Frederick Klauschen, Klaus-Robert M ̈ uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLoS ONE, 10(7):e0130140, 2015. [95]Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. [96]S ́ ebastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. InProceedings of the 18th ACM international conference on Multimedia, pages 1485–1488, 2010. [97]RossWightman.Pytorchimagemodels.https://github.com/rwightman/ pytorch-image-models, 2019. [98] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning, page 448–456, 2015. [99] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018. [100]David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3327, 2017. doi: 10.1109/CVPR. 2017.354. [101] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023. doi: 10.1109/cvpr52729.2023. 00690. [102]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈ orn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. [103] Ralph Peter Braun, Margaret Oliviero, Isabel Kolm, Lars E French, Ashfaq A Marghoob, and Harold Rabinovitz. Dermoscopy: what’s new?Clinics in Dermatology, 27(1):26–34, 2009. [104]M ́ A Salafranca and P Zaballos. Dermoscopy of squamous cell carcinoma: from actinic keratosis to invasive forms.Actas Dermo-Sifiliograficas, pages S0001–7310, 2024. [105] Cristi ́ an Navarrete-Dechent, Shirin Bajaj, Michael A Marchetti, Harold Rabinovitz, Stephen W Dusza, and Ashfaq A Marghoob. Association of shiny white blotches and strands with nonpigmented basal cell carcinoma: evaluation of an additional dermoscopic diagnostic criterion.JAMA Dermatology, 152(5): 546–552, 2016. [106] Anna Eliza Verzi, Victor L Quan, Kara E Walton, Mary C Martini, Ashfaq A Marghoob, Erin M Garfield, Betty Y Kong, Maria Cristina Isales, Timothy VandenBoom, Bin Zhang, et al. The diagnostic value and histologic correlate of distinct patterns of shiny white streaks for the diagnosis of melanoma: A retrospective, case-control study.Journal of the American Academy of Dermatology, 78(5):913–919, 2018. 24 Mechanistic understanding and validation of large AI models [107]Jason Thomson, Sarah Hogan, Jo Leonardi-Bee, Hywel C Williams, and Fiona J Bath-Hextall. In- terventions for basal cell carcinoma of the skin.Cochrane Database of Systematic Reviews, (11), 2020. [108]Yevgeniy Balagula, Ralph P Braun, Harold S Rabinovitz, Stephen W Dusza, Alon Scope, Tracey N Liebman, Ines Mordente, Katherine Siamas, and Ashfaq A Marghoob. The significance of crys- talline/chrysalis structures in the diagnosis of melanocytic and nonmelanocytic lesions.Journal of the American Academy of Dermatology, 67(2):194–e1, 2012. [109]Aimilios Lallas, Zoe Apalla, Dimitrios Ioannides, Giuseppe Argenziano, Fabio Castagnetti, Elvira Moscarella, Caterina Longo, Tamara Palmieri, Dafne Ramundo, and Iris Zalaudek. Dermoscopy in the diagnosis and management of basal cell carcinoma.Future Oncology, 11(22):2975–2984, 2015. [110]Akane Minagawa. Dermoscopy–pathology relationship in seborrheic keratosis.The Journal of Derma- tology, 44(5):518–524, 2017. [111]Jong Hoon Kim, Mi Ri Kim, Si-Hyung Lee, Sang Eun Lee, and Seung Hun Lee. Dermoscopy: a useful tool for the diagnosis of angiokeratoma.Annals of Dermatology, 24(4):468–471, 2012. [112]Ramah I Nazer, Rahaf H Bashihab, Wedad H Al-Madani, Aamir A Omair, and Mohammed I AlJasser. Cherry angioma: A case–control study.Journal of Family and Community Medicine, 27(2):109–113, 2020. [113]Sampa Choudhury and Ashish Mandal. Talon noir: A case report and literature review.Cureus, 15(3), 2023. [114] Md Mehrab Tanjim, Krishna Kumar Singh, Kushal Kafle, Ritwik Sinha, and Garrison W Cottrell. Discovering and mitigating biases in clip-based image editing. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2984–2993, 2024. 25 Mechanistic understanding and validation of large AI models Supplementary Materials This article has supplementary files providing additional details and information, descriptions, experiments and figures. Supplementary Note A offers a detailed survey of related work important to our contribution, contrasting several related explainability techniques to our proposed technical contribution regarding different criteria. Supplementary Note B includes a description of the datasets and models used in our experiments. Supplementary Notes C, D and E include additional experiments and explanations for the search, describe and compare functionalities ofSEMANTICLENS. Subsequently, Supplementary Note F provide additional details for auditing and debugging models withSEMANTICLENS. In Supplementary Note G, details on the user study of Section 4.4 and more experiments for optimizing latent interpretability are provided. Supplementary Note H describes our technical contribution, theSEMANTICLENS, in increased detail and provides additional background. The proposed interpretability measures are detailed in Supplementary Note H.3. In Supplementary Note H.4 we summarize the computational steps involved in answering the questions presented in Tab. 1. Current challenges and an outlook to future work are discussed in Supplementary Note I. A Extended Related Work SEMANTICLENSis a holistic framework that enables a systematic concept-level understanding of large AI models. Its core elements rely on previous research advances related to concept visualization, labelling, attribution, comparison, discovery, audits, and human-interpretability measures, as detailed in the following. In Supplementary Tab. A.1, we compareSEMANTICLENSwith other popular XAI frameworks. Concept Examples (Feature Visualization)Most feature visualization techniques rely on maximizing activation values of single neurons or a linear combination thereof [25,26,27,14,16,28], where in its simplest form, input images are sought that produce the highest activation value of a specific unit. In this work, the set of images is referred to as “concept examples”. Concept examples can be generated synthetically using gradient ascent, or alternatively found from a sample dataset by collecting neuron activations during predictions. Regarding synthetic examples, preventing the emergence of adversarial patterns became a main research area. Several priors were proposed to guide optimization into more realistic looking images [29,26,30,31,32]. Recently, diffusion models are being applied to also help in generating more realistic concept examples [33]. Alternatively, natural concept examples can be collected on the training or test data, where it is favourable to collect patches of the input data [16,15,28,14], as whole inputs can incorporate a lot of distracting background features. We follow the CRP approach and crop full data samples to the actual relevant part using neuron-specific attributions [15]. Other approaches facilitate upsampled spatial activation maps [14], that are only available for convolutional layers, or transformers (through spatial token information). Encoding Concepts of Neurons: Activation Pattern or Feature SpaceThere are two approaches in literature to encode the concept of neurons: (1) via activation patterns [14,46,35] on data with concept annotations (e.g., binary labels or segmentation mask) or (2) by embedding concept examples into another feature space [37,48,18,38]. Activation patterns are a very direct measure, but often only correspond to a singular (pooled) activation score per data point. Usually, data points incorporate multiple features, which can lead to wrong conclusions due to unexpected correlations when working with singular activation scores. For example, two neurons that encode fornoseandeyeswill activate very similarly for data with human faces, but encode for different concepts. It is thus important to have a qualitative and meaningful set of concept data. Alternatively, concept examples (cropped to the relevant part, see Section 3) aim to communicate the semantic role of neurons more directly. Then, in order to encode a concept, the concept examples are embedded in the feature space of a model: either the same model [48] or a foundation model [37,18]. Notably, generating the concept examples and encodings is algorithmically and computationally more involved compared to activation pattern computation. Whereas using the same model for encoding is convenient as it does not require a foundation model (that might need to be trained first), the latent space of the investigated might not be as semantically structured. The work of [45] shows that self-supervised foundation models have a more semantically structured latent space than models trained on a classification task. Especially multimodal foundation models are interesting as they allow to also interact and describe the embedding space more flexible. However, it is to note that describing concepts through concept examples is assuming that the concept is precisely and well-defined via these examples, which might not always be the case [38]. 26 Mechanistic understanding and validation of large AI models Neuron LabellingVarious methods are invested in labelling the concept which an individual neuron represents. Some are purely based on activation patterns, such as Network Dissection [14] orINVERT[34], which require a large set of data annotations. Notably, CLIP-Dissect [35] circumvents the requirement for costly concept annotation by using a multimodal foundation model for annotation. Other methods, such as ours, operate on the set of maximally activating images (concept examples) for a neuron, hereby relying on other vision-language models [36, 35, 18, 37, 38]. Concept Importance ScoresIn order to understandhowconcepts or components are used, we need to compute their importance during inference wrt. the output or other components. Here, various traditional feature attributions can be used to compute importance scores of latent representations [39,17]. We adhere to the CRP framework for computing relevance scores of singular components (or groups thereof) wrt. to the output prediction and/or specific model parts, further detailed in Supplementary Note H. Concept DiscoveryWhereas early works show that neurons often encode for human-understandable con- cepts [14,26], other works argue that linear directions (or subspaces) in latent feature space are more interpretable and disentangled [44,49,50]. In fact, neurons can be redundant and polysemantic (encoding for multiple concepts), which directions might be less prone to [51,52]. Recent research focuses on SAEs [53] or activation factorization [17] to receive more disentangled representations, for which, again, concept ex- amples and concept relevance scores can be computed. Whereas we focus in this work on the neural basis, SEMANTICLENSis thus also applicable to SAEs or factorized activations. Concept Comparison in ModelsVarious popular approaches exist that measure alignment between rep- resentation spaces of neural networks, including Centered Kernel Alignment [40], attention (map) pat- terns [41,42,43] or “concept embeddings” (i.e., weights for neuron activations to detect specific concepts in data) as in Net2Vec [44]. The approaches above only provide a single scalar value for the overall alignment between two representation spaces. In contrast, other works (including ours) also enable for similarity analysis between single concepts, allowing, e.g., to identify which concepts models share and in which concepts they differ. Similarities between concepts can be based on activation patterns [45,46,47], relevance pattern [15] or concept example embeddings [48] as inSEMANTICLENS. Concept-level AuditsEstablished methods for evaluating and auditing latent feature spaces of neural networks are TCAV [20] or linear probes [59]. Both are based on trying to detect a signal (linear direction) in the latent activations that can be associated with a specific user-defined concept of interest. Contrary to SEMANTICLENS, where a description of a concept is given through a set of concept examples (in the form of images or text), TCAV, e.g., requires additionally a set of negative examples without the concept. Originally, linear probes only detect that a certain concept is encoded by a model, but nothowit used or how relevant it is. TCAV uses latent gradients collected on a dataset to estimate the sensitivity of the model wrt. a concept. However, sensitivity does not fully reflect the degree to which a concept contributes during inference, as the contribution also depends on the concept activation (magnitude). The work of [60] extends TCAV to also gain information in terms of concept importances for local predictions. Further, the part of the model that is not covered by the (set of) expected concept(s) is not studied, which could also incorporate various other spurious concepts. Popular methods to evaluate model behaviour wrt. model outputs are, besides test set performance and worst-group accuracies on subsets of the test set [72,12], also other more direct measures evaluating concept sensitivity [62,72]. Concretely, whereas [62] evaluates the separability of model outputs on samples solely with and without a spurious feature, [72] directly inserts artefacts into clean samples and measure the effect on model prediction. Auto-InterpretabilityThe field of automated interpretability aims to combine the flexibility of human experimentation with the scalability of automated techniques (usually by using deep models themselves), e.g., for labelling of neurons [36,18,35,38]. Automated interpretability can be more lightweight, relying on investigated models themselves or foundation models[36,18,35,38], or more involved by solving complex interpretability tasks using LLMs agents [63]. Human-Interpretability MeasuresThe work of Network Dissection [14] evaluates interpretability indi- rectly by the degree to which neurons align to a large set of expected concepts. Later works leverage features spaces of large models, where the concept examples of individual neurons are encoded. Specifically, [54,55] 27 Mechanistic understanding and validation of large AI models introduce measures related to polysemanticity, [38,56,54] related to clarity, and [57] related to redundancy. Recently, measures to capture concept complexity have also been introduced [58]. Explanation FrameworksInstead of focusing on individual aspects, explanation frameworks combine multiple interpretability aspects and enable a more holistic understanding of model and data. For example, CRP [15] or CRAFT [16] combine feature visualization and attribution, but do not include labelling. CLIP- Dissect [35] on the other hand, leverages foundation models such as CLIP [23] to label neurons, but does not investigate how concepts are actually used during inference. Based on the semantic embedding of model components,SEMANTICLENSrepresents a more comprehensive and holistic framework compared to previous works that enables to systematically search, label, compare, describe and evaluate the inner mechanics of large AI models. In the following, several explanatory frameworks that allow to gain insights into deep neural networks are presented and compared toSEMANTICLENS, as summarized in Supplementary Tab. A.1. Supplementary Tab. A.1: A comparison of selected XAI frameworks at a glance, considering the explanatory insight they provide for model components. As explanatory capabilities it is considered whether they include concept examples, concept labelling, concept relevances, concept audit capabilities, concept comparison tools, or interpretability evaluation metrics. The table indicates if an explainer exhibits specific explanatory capabilities partially (⃝) or fully (✓). Explaining Capabilities for Components Methodexampleslabelsrelevancescomparisonauditinterpretability CRP [15]✓⃝ CRAFT [16]✓ PCX [39]✓⃝ Summit [61]✓⃝ NetDissect [14]✓⃝ CLIP-Dissect [35, 47]✓⃝ FALCON [38] ✓⃝ TCAV + IG [60, 20] ✓⃝ W [18]✓ ConceptEvo [48]✓ SpuFix [62]✓⃝ MAIA [63]✓⃝✓ Ours✓ CRP: Concept Relevance Propagation [15] is a local concept-based explainability approach, that combines feature visualization techniques with local feature attribution, thus enabling a much deeper understanding of the decision-making of neural networks than with traditional local feature attribution alone. Concretely, the role of a neuron is described by collecting either the most activating samples or the samples where a neuron is most relevant for. For all neurons and single prediction outcome, the feature attribution method Layer-wise Relevance Propagation (LRP) [94] is extended to compute concept relevance scores and neuron-specific heatmaps. In their work, comparisons between neurons are computed based on similarity of neuron relevance patterns. CRAFT: Similarly to CRP, CRAFT [16] combines feature visualization and concept attributions to enable local concept-based explanations. However, CRAFT further proposes to perform activation factorization which reduces the high dimensionality of the neural basis (used by CRP). PCX: PCX [39] extends CRP by collecting local concept-based explanations and clustering them to extract “prototypes” that summarize the model behaviour on the whole (training) dataset. As such, PCX enables to 28 Mechanistic understanding and validation of large AI models reduce the workload of a user for debugging a model as only few prototypes need to be studied instead of thousands of explanations. However, PCX does not provide any neuron labels. Summit: The Summit framework combines local relevance scores and feature visualization techniques into compact visualizations, aiming to guide and facilitate the manual inspection of convolutional neurons and their roles within a network. Class relevances are derived by aggregating neuron activations over data samples associated with specific classes, while conditional neuron-to-neuron relevances are computed by aggregating the product of peak activations and connecting weights. For visualization, class relevance scores for each neuron are combined into a vector and visualized using a UMAP projection to provide an overview of class specificity within the layer. The conditional relevance scores, on the other hand, are combined into an “attribution graph” that illustrates the interactions and roles of individual neurons. Both the UMAP projections and attribution graphs are accompanied by sampled and generated concept examples for each neuron. While not explored in the original paper, the stacked class-relevance vectors can also be used to compare components across layers or architectures, offering further insights into network behaviour. NetDissect: Network Dissection [14] is one of the first explanatory frameworks that aims to quantitatively analyse the latent representations of deep neural networks. In a first step, channels of convolutional layers are labelled, by matching their upsampled spatial activation maps with densely annotated labelled data. The labelled representations allow now to compare models by what the models have learned and how well they match certain labels. In order to compare, however, labels need to be available. In principle, Network Dissection also allows to audit models by checking if they align to expected labels (but without indication how they are used). Also, latent interpretability can be evaluated, with the assumption that low alignments indicate low interpretability. Net2Vec: Net2Vec [44] is a framework in which (user-defined) concepts are mapped to vector embeddings based on corresponding component activations. Concretely, for each concept, neuron activations are collected on a reference (“probe”) dataset, and subsequently weights for each neuron are estimated that correspond to the usefulness to detect the concept. As such, these vector embeddings allow to show that in most cases, multiple filters are required to code for a concept, and that often neurons are not concept-specific and encode multiple concepts. In their work, they use NetDissect to visualize the function of single neurons. The work observes that, compared to activation patterns, their Net2Vec embeddings are able to better characterize the meaning of a representation and its relationship to other concepts. Further, the Net2Vec embeddings allow to compare whole feature spaces of different models. Notably, Net2Vec aims to compare how concepts are represented inwholefeature spaces, and is not meant for component-level analysis. CLIP-Dissect: CLIP-Dissect [35] extends Network Dissection by integrating a multimodal foundation model such as CLIP into the labelling pipeline. In principle, CLIP is used here to soft-label the dataset with expected concepts. The activation pattern of each neuron is compared to the soft labels of CLIP in order to label. The activation patterns and labels can further be used to compare models in [47]. TCAV + IG: TCAV [20] is a popular framework for evaluating the existence of expected concepts and the model’s concept sensitivity. Concretely, in a first step, a linear direction in the latent space of some layer is estimated using latent activations on data samples with and without concept. In order to test the model, the latent gradient is used, which [60] extended to the more stable Integrated Gradients (IG) [95] method. As such, we can compare models and their sensitivity wrt. expected concepts. However, there is no indication on how much the model actually relies on unexpected concepts. FALCON: Similarly toSEMANTICLENS, FALCON [38] is based on using CLIP models to annotate representations. FALCON uses therefore not only most activating samples, but also (visually) similar but lowly activating samples to further improve labelling. The work of [38] further estimates interpretability of representations based on semantic similarities of concept examples, as proposed for the clarity measure in Section 3.5. W: The W framework [18] combines feature visualization and attribution techniques with neuron labelling to explain networks’ decision-making on a local and global level. Their proposed labelling pipeline is most similar to the one we describe in Supplementary Note H.4. To label a neuron, they collect data samples that maximize the neuron’s activation and embed them into the latent space of a CLIP model alongside a set of predefined text labels. Based on the pairwise cosine similarities between the image and text embeddings, a set of labels for the examined neuron is selected using an adaptive threshold. Our labelling procedure differs in two key ways: we crop the collected samples to the image portions that include the most relevant pixels with respect to the inspected neuron, reducing the influence of background features not relevant to the neuron. 29 Mechanistic understanding and validation of large AI models Additionally, we average the image embeddings before measuring the cosine similarity, which also helps reduce the influence of noise in the data. ConceptEvo: ConceptEvo [48] focuses on interpreting and comparing deep neural networks during training. Similar to our approach, it employs a unified semantic space to embed model components for comparison and interpretation; however, ConceptEvo constructs this space from scratch. This is done by learning neuron embeddings based on co-activation relationships in a base model and aligning image embeddings by minimizing their distance to neurons they strongly activate. Network components from other models are then represented by averaging the embeddings of their strongly activating images. Whereas ConceptEvo proposes a novel way to measure importance of a concept evolution, no relevance measures for individual concepts are presented. SpuFix: In [62], Neuhaus et al. extensively analyse the representations of a robustly trained ResNet50 ImageNet classifier by applying activation factorization, similar to CRAFT. They manually inspect and label the resulting activation directions as encoding either spurious or valid features for the corresponding class. Building on this analysis, they propose the SpuFix method to identify spurious directions in other ImageNet classifiers without requiring manual labelling. SpuFix aligns the spurious directions identified in the ResNet50 model with directions in the target classifier by maximizing co-activation, and prunes the matched spurious components to mitigate reliance on spurious features. Additionally, the manually labelled spurious directions are used to constructSpurious ImageNet, a dataset containing only spurious features for 100 ImageNet classes. This dataset enables the evaluation of a classifier’s reliance on spurious features by assessing overall and per-class accuracy metrics. MAIA: The Multimodal Automated Interpretability Agent (MAIA) utilizes a pre-trained vision-language model and a set of tools, like collecting highly activating samples for a given neuron, cropping images, etc. to derive an interpretability agent. Presented with a question like “What does neuron#42in layer 5 encode?” the vision-language model autonomously queries a provided API of interpretability tools, and runs multiple hypothesis and validation iterations before providing an answer. B Experimental Settings The following section outlines the experimental settings used throughout this work. B.1 Architectures and Models We evaluate multiple pre-trained models from the torchvision [96] and hugging face model zoo as detailed in the following. ResNetThe ResNet is a convolutional neural network architecture consisting of four layer blocks and one fully connected layer. For all experiments, we collect activation and relevance scores after each layer block. Concretely, we use the ResNet [69] architectures: ResNet18, ResNet32, ResNet50, ResNet50v2, ResNet101, ResNet101v2 provided by torchvision [96]. We further evaluate ResNet18 with identifier “resnet18.a1in1k”, ResNet34 with identifier “resnet34.a1in1k”, ResNet50s with identifiers “resnet50.a1in1k”, “resnet50d.a1in1k”, “resnet50d.a2in1k”, and ResNet101 with identifier “resnet101.a1in1k” from the timm model zoo [97]. VGGThe VGG [75] is a convolutional neural network architecture consisting of a set of convolutional layers and three fully connected layers. In our experiments, the VGG-13, VGG-16 and VGG-19 with and without Batch Normalization [98] (BN) layers are used from the torchvision model zoo. Vision TransformerThe vision transformer utilizes attention and fully connected layers which are applied to the input image after it was split into tiles and projected into a sequence of tokens. In our experiments, we use the model with identifiers • “vitsmallpatch16224.augregin21kftin1k”, • “vit mediumdpatch16reg4gap256.sbbin12kftin1k” • “vitlargepatch16224.augregin21kftin1k”, from timm model zoo [97]. Notably, the ViT’s last linear layer does not operate on activations that are preprocessed by a ReLU non-linearity. As such, activations can be negative or positive. Thus, we apply both 30 Mechanistic understanding and validation of large AI models activation maximization and minimization to generate concept examples. Concretely, we duplicate the feature dimension (i.e., double the amount of neurons), and collect most activating samples for the first half, and minimal activating samples for the second half of neurons. Foundation ModelsWe use Mobile-CLIP [66] throughout all ImageNet experiments in Sections 4.1 and 4.2, DINOv2 [67] for the interpretability experiments in Section 4.4, and WhyLesionCLIP [68] for the medical experiments per default. Overall, we used five foundation models in this work: CLIP-OpenAI [23] (CLIP-ViT- base-patch32), CLIP-LAION [82] (CLIP-ViT-B-32-laion2B-s34B-b79K), Mobile-CLIP [66] (MobileCLIP- S2), DINOv2 [67] (DINOv2-base), and WhyLesionCLIP [68]. B.2 Datasets We use two datasets throughout our experiments, which are presented in the following. B.2.1 ImageNet ImageNet [64] (specifically ImageNet-1k) is a visual object classification dataset with 1000 object classes and contains 1,281,167 training images and 50,000 validation images. ImageNet is used throughout Sections 4.1, 4.2 and 4.4. All data is, per default, loaded as follows: (1) resizing images to a resolution of256×256, (2) centre crop to224×224pixels, (3) normalization to range [0, 1] (RGB values are divided by 255), and (4) subtracting [0.485, 0.456, 0.406] followed by a normalization of standard deviation values [0.229, 0.224, 0.225] over the red, green and blue colour channels respectively. B.2.2 ISIC Challenge The medical dataset used in Section 4.3 is taken from the ISIC challenge of 2019 [65,76,77], including 25,331 dermoscopic images among nine different diagnostic categories: “Melanoma”, “Melanocytic nevus”, “Basal cell carcinoma”, “Actinic keratosis”, “Benign keratosis”, “Dermatofibroma”, “Vascular lesion”, and “Squamous cell carcinoma”. For simplicity, we group all categories other than melanoma into one group “other” to receive a binary classification setting. All data is, per default, loaded as follows: (1) resizing images to a resolution of224×224, (3) normalization to range [0, 1] (RGB values are divided by 255), and (4) subtracting [0.5, 0.5, 0.5] followed by a normalization of standard deviation values [0.5, 0.5, 0.5] over the red, green and blue colour channels respectively. C Search in Semantic Space In this section, we provide additional details and examples of thesearchtool withinSEMANTICLENS, as presented in Section 4.1.1. A quantitative analysis ofSEMANTICLENS’ labelling capabilities—building on search—and the effects of selected hyperparameters are discussed in Supplementary Note D.4. Thesearchtool ofSEMANTICLENSenables to identify neurons that have learned concepts similar to a given search prompt. As described in Section 3.3, thesearchprocess operates on the semantic embeddings V M =θ 1 ,...,θ k of the components of a modelM. The process involves the following steps: 1.Probe SelectionThe searched concept can be specified in any modality supported by the foundation model withinSEMANTICLENS, or as combinations thereof. For instance, the search prompt may take the form of one or multiple samples from the model’s data domain (e.g., images of green objects), or one or multiple text descriptions (e.g., “a photo of a green car”, “a photo of a green wall”, etc.). 2. Probe EmbeddingThe selected search prompt is embedded into the semantic space. If multiple prompts are provided, their embeddings are aggregated (e.g., via average pooling) to produce a single probe embedding,θ probe . 3.Similarity SearchThe final step is a similarity search within the set of semantic embeddingsV M wrt.θ probe : θ ∗ = argmax θ∈V M a(θ,θ probe )witha(θ,θ probe ) : =s(θ,θ probe )−s(θ,θ <> ).(1) 31 Mechanistic understanding and validation of large AI models Supplementary Fig. C.1:Searchresults among all neurons in the final layers of each of the four ResNet blocks in the ResNet50v2 for animal-related queries: “animal snout”, “dog snout”, “cat snout”, “leopard fur” and “animal fur” (from left to right). We show the top-10 most aligned neurons, sorted in descending order from top to bottom. For each neuron, we provide the specific alignment scorea(as defined in Eq. (1)), along with the neuron index and layer name. The scores are derived using the CLIP-Mobile-S2 foundation model with the three prompt templates: “<concept>”, “a<concept>” and “an image of a close up of<concept>”. The “null” embedding is obtained from the empty templates. The results demonstrateSEMANTICLENS’ ability to identify neurons specific to a given query and differentiate between similar concepts, such as the general term “animal snout” and more specific terms like “dog snout” or “cat snout.” Here,θ <> is a ”null” embedding used to mitigate the influence of background noise, such as back- ground objects in image prompts, or irrelevant words in text prompts (e.g., ”photo” in the prompt template “a photo of <concept>”). The similarity scoresis defined in Eq. (3). In Supplementary Figs. C.1 and C.2, we presentsearchresults for different text-prompts, carried out across the four ResNet blocks in the ResNet50v2 ImageNet classifier using the CLIP-Mobile-S2 foundation model. These examples highlightSEMANTICLENS’ ability to identify and differentiate between similar yet distinct representations based on the selected search probes (Supplementary Fig. C.1) and demonstrate the range of concept specificity that can be uncovered (Supplementary Fig. C.2). The ability to search for concepts naturally facilitates comparisons between models, providing insights into a model’s general and relative expertise with respect to the queried concepts. In Supplementary Fig. C.3, we display the neurons most aligned with the text probe “human face” across six distinct ImageNet classifiers. SEMANTICLENS’ comparative capabilities extend beyond visual inspection and are examined in further detail in Supplementary Note E. 32 Mechanistic understanding and validation of large AI models Supplementary Fig. C.2:Searchresults among all neurons in the final layers of each of the four ResNet blocks in the ResNet50v2 for the queries: “watermarks”, “Bokeh Effect”, “Bioluminescence”, “car wheels” and “Hamburg” (from left to right). We show the top-10 most aligned neurons, sorted in descending order from top to bottom. For each neuron, we provide the specific alignment scorea(as defined in Eq. (1)), along with the neuron index and layer name. The scores are derived using the CLIP-Mobile-S2 foundation model with the three prompt templates: “<concept>”, “a<concept>” and “an image of a close up of<concept>”. The “null” embedding is obtained from the empty templates. The results showcase the diverse range of learned representations that can be discovered usingSEMANTICLENS. These include spurious concepts, such as watermarks, specialized neurons that activate for unique phenomena likebioluminescence, and even abstract representations, such as an entire city, likeHamburg, encompassing maritime and Harbor-like features. D Describe in Semantic Space This section provides more details on Section 4.1.2, where it is demonstrated howSEMANTICLENScan be utilized to describe the internal representations and encoded knowledge. Concretely, Supplementary Note D.1 provides an overview of used concept labels, and Supplementary Notes D.2, D.3 and D.5 include details and additional experiments for UMAP embeddings, attribution graphs, and network dissection, respectively. Lastly, Supplementary Note D.4 provides faithfulness evaluations for labels. D.1 Concept Set The following concept labels are used for annotation if not stated otherwise: Colorsred color,blue color,green color,yellow color,black color,white color,gray color, brown color,orange color,pink color,purple color,cyan color,magenta color,teal color,maroon color,navy color,olive color,lime color,aqua color,coral color,turquoise color,gold color, silver color,beige color,lavender color,peach color,mint color,rose color,violet color, charcoal color,salmon color,chocolate color,emerald color,crimson color,indigo color,tan color,ruby color,amber color,sapphire color,fuchsia color,bronze color,copper color,ivory 33 Mechanistic understanding and validation of large AI models VGG16 VGG19 VGG13 ResNet101v2ResNet50v2ResNet18 Supplementary Fig. C.3: Comparison ofsearchresults among neurons in the final feature layer of VGG19, VGG16, VGG13, ResNet101v2, ResNet50v2 and ResNet18 (from left to right) for the search query “human face”. We present the top-10 most aligned neurons, sorted in descending order from top to bottom. For each neuron, we provide the specific alignment scorea(as defined in Eq. (1)), along with the neuron index and layer name. The scores are derived using the CLIP-Mobile-S2 model with the following three prompt templates: “<concept>”, “a<concept>” and “an image of a close up of<concept>”. color,plum color,mustard color,khaki color,periwinkle color,sand color,slate color,burgundy color, Texturessmooth texture,rough texture,bumpy texture,grainy texture,glossy texture,matte texture,silky texture,velvety texture,fuzzy texture,prickly texture,crinkled texture, wrinkled texture,ribbed texture,knitted texture,woven texture,quilted texture,embossed texture,pebbled texture,brushed texture,etched texture,polished texture,satin texture, metallic texture,rubberized texture,plastic texture,woodgrain texture,stone texture,leather texture,suede texture,linen texture,canvas texture,corduroy texture,denim texture,fur texture ,feathered texture,wool texture,lace texture,mesh texture,net texture,sheer texture, transparent texture,opaque texture,speckled texture,striped texture,plaid texture,checked texture,houndstooth texture,chevron texture,paisley texture,floral texture,geometric texture, blotchy texture,dotted texture,banded texture,smeared texture,porous texture,pitted texture, fibrous texture,veined texture,perforated texture,crosshatched texture,sprinkled texture, polka-dotted texture,marbled texture,stained texture,grid texture,gauzy texture,interlaced texture,frilly texture,zigzagged texture,spiralled texture,swirly texture,cracked texture, studded texture,matted texture,flecked texture,potholed texture,scaly texture,stratified texture,braided texture,lined texture,waffled texture,freckled texture,honeycombed texture, lacelike texture,chequered texture,crystalline texture,bubbly texture,grooved texture, pleated texture,cobwebbed texture,waves texture,textile texture,text texture,font texture, lettering texture,noise texture, Animalsanimal,mammal,bird,reptile,amphibian,insect,dog,cat,cow,horse,sheep,pig,chicken, duck,goose,turkey,rabbit,deer,mouse,rat,elephant,tiger,lion,bear,wolf,fox,monkey,gorilla, chimpanzee,panda,kangaroo,koala,giraffe,zebra,hippopotamus,rhinoceros,leopard,cheetah,hyena, crocodile,alligator,snake,lizard,frog,toad,turtle,tortoise,fish,shark,whale,dolphin,octopus, crab,lobster,penguin,ostrich,human,man,woman,baby, 34 Mechanistic understanding and validation of large AI models Objectsplate,fan,sponge,joystick,sunglasses,sandals,tissue,light,drum,iron,wrench,ironing board,modem,game console,projector,ship,cable,knife,perfume,pliers,tennis ball,chair,scarf, fork,keyboard,cup,ring,router,sweater,pillow,notebook,shorts,shirt,blouse,boots,dishwasher, garbage can,rug,heater,spoon,charger,backpack,cabinet,horn,car,pen,oven,soap,mouse,comb, television,gloves,hammer,paper,automobile,carpet,tablet,conditioner,fireplace,shampoo, lens,dress,table,sneakers,watch,truck,nail polish,belt,stamp,remote,clothing,shower,air conditioner ,coat,sock,makeup,apparel,lipstick,cosmetics,chandelier,printer,bucket,jeans,razor, kettle,usb,stove,candle,headphones,sheet,bag,map,toaster,pants,mirror,scanner,mattress,sink, drawer,first aid kit,recycling bin,garbage,glue,plug,necklace,instrumentmedicine,pill,lamp, tape,toothbrush,bookshelf,hat,phone,broom,shoes,toothpaste,scissors,boat,curtain,poster, jacket,suitcase,hairbrush,footwear,package,mousepad,van,juicer,bus,towel,scale,computer, speaker,laptop,vacuum cleaner,desk,mixer,glasses,bottle,mug,calendar,globe,screen,pan,fridge, tie,motorcycle,letter,coffee maker,flag,door,lotion,pencil,earring,painting,blender,ruler, glass,water,tripod,sculpture,guitar,bathtub,bed,box,underwear,microwave,brush,envelope,piano, trumpet,monitor,drawing,stick,binoculars,bowl,pot,bicycle,can,mop,wallet,bandage,camera, microphone,airplane,memory card,golf ball,plane,violin,adapter,vacuum bag,window,skirt,teapot, bracelet,clock,toilet,statue,battery,remote control,tent,blanket,cream,train,laundry basket, purse,extension cord,book,tool,cap,screwdriver, Animalpartshead,eye,ear,nose,mouth,tongue,teeth,fang,beak,hand,arm,finger,body,snout,horn, antler,neck,throat,shoulder,back,spine,tail,leg,paw,hoof,claw,foot,toe,wing,feather,fin,gill, scale,fur,skin,coat,whisker,mane,tusk,trunk,tentacle,gill,flipper,shell,plumage,beard,comb, wattle,crest,spur,hind leg,foreleg,udder,legs,feet, Objectpartshandle,knob,button,lever,switch,dial,screen,display,key,lock,lid,cap,cover,base, leg,foot,wheel,axle,blade,edge,tip,point,corner,side,face,panel,frame,bracket,hinge,joint, bolt,screw,nut,washer,spring,hook,clasp,clip,latch,rivet,pin,peg,plug,socket,cord,cable,chain, strap,belt,wheel,pulley,rope,string,chinese text, Objectsceneskitchen,bathroom,living room,bedroom,dining room,office,garden,garage, classroom,library,restaurant,caf ́e,supermarket,mall,park,playground,beach,mountain,forest, desert,farm,barn,factory,warehouse,airport,train station,bus station,hospital,clinic, laboratory,gym,stadium,theater,cinema,museum,gallery,workshop,studio,hotel,church,temple, mosque,market,street,alley,highway,bridge,tunnel,harbor,dock,music studio,dirt track,sky,sea, snow, Vegetationsvegetation,grass,tree,shrub,bush,fern,moss,vine,flower,weed,algae,cactus,palm, pine,oak,maple,birch,willow,cedar,spruce,fir,bamboo,ivy,lichen,herb,succulent,reed,sedge, lily,rose,tulip,daisy,sunflower,orchid,poppy,lavender,thistle,mint,basil,thyme,sage,parsley, rosemary,chives,oregano,cilantro,spinach,lettuce,cabbage,broccoli,kale,plant, Foodfood,fruit,apple,banana,orange,lemon,lime,grape,cherry,strawberry,blueberry,raspberry, blackberry,pineapple,watermelon,cantaloupe,honeydew,kiwi,peach,plum,pear,apricot,mango, papaya,coconut,avocado,tomato,cucumber,bell pepper,carrot,celery,broccoli,cauliflower,cabbage, lettuce,spinach,kale,arugula,chard,beet,radish,turnip,potato,sweet potato,yam,cassava vegetable ,taro vegetable,pumpkin,squash,zucchini,eggplant,pea,bean,lentil,chickpea,peanut, almond,cashew,pistachio,walnut,pecan,hazelnut,macadamia,peanut,pistachio,almond,cashew,walnut, pecan,hazelnut,macadamia,peanut,pistachio,almond,cashew,walnut,pecan,hazelnut,macadamia, peanut,pistachio,almond,cashew,walnut,pecan,hazelnut,macadamia,peanut,pistachio,almond, cashew,walnut,pecan,hazelnut,macadamia,peanut,pistachio,almond,cashew,walnut,pecan,hazelnut, macadamia,peanut,pistachio,almond,cashew,walnut, D.2 UMAP Embeddings As shown in Fig. 2a, a UMAP [99] embedding can be effective to gain an overview of the concepts encoded by the model components. For ImageNet models, we use the 50 maximally activating images as concept examples per neuron, which are embedded by the CLIP-Mobile-S2 foundation model to receive semantic embeddingsθ. Further, we performk-means clustering withk=60for Fig. 2a, andk=160for the additional examples shown in Supplementary Figs. D.1 to D.3 for the ResNet50v2, ResNet50, and ResNet50.a1, respectively. The points in semantic space are colour-coded corresponding to their similarity to the concepts ofvegetation, plant, tree(green),animal, living(red),vehicle(blue), and texture, pattern(yellow). In order to label the clusters, we retrieve the two most aligned concept-labels for the averageθof a cluster. 35 Mechanistic understanding and validation of large AI models corner, stone texture hind leg, foreleg panda, snout text texture, font texture dog, snout cover, side insect, cobwebbed texture food, lacelike texture handle, smooth texture pot, cup beak, plumage shell, gill corner, gallery interlaced texture, knitted texture clothing, shoulder interlaced texture, grainy texture food, pitted texture grass, bush water, fish head, neck corner, cover motorcycle, side cheetah, animal plant, vine fish, fin monkey, mammal shoulder, cover snout, mammal interlaced texture, grainy texture dog, snout bottle, glass squash, zucchini mammal, animal car, automobile peg, pen dog, snout bird, beak food, pitted texture fish, gill reptile, lizard beak, bird rabbit, snout beak, duck fur, paw baby, shoulder sea, dock ship, boat train, bus clasp, necklace fish, fin cup, bottle handle, peg bus, bus station rough texture, grainy texture beak, bird shoulder, clothing tennis ball, net texture plane, airplane boat, ship bird, beak mammal, bear trumpet, horn monkey, mammal horse, hind leg keyboard, computer dog, snout insect, foreleg peg, stick snout, foreleg truck, automobile foreleg, snout wing, insect dog, wolf grainy texture, noise texture flipper, sea bed, sheet snow, cover strap, footwear yam, pitted texture cover, package dial, watch golf ball, globe box, package deer, hind leg church, temple camera, lens corner, side cat, whisker pulley, lever mammal, snout scanner, modem rivet, truck crab, lobster foreleg, flipper whale, flipper plane, ship wolf, fox elephant, tusk van, car text texture, bronze color cover, smooth texture candle, heater globe, dial dial, scale arm, shoulder chair, table ship, boat gorilla, chimpanzee beak, bird ship, dock corner, interlaced texture snout, dog insect, foreleg foreleg, neck shoulder, neck cover, clip fruit, pineapple piano, music studio amphibian, frog grainy texture, corner cover, shoulder corner, handle car, automobile dog, hind leg car, automobile handle, polished texture neck, shoulder tree, bush reptile, lizard bicycle, axle corner, sink shoulder, cap drum, reed cover, interlaced texture spur, handle ship, boat dial, phone dog, snout flower, daisy guitar, string truck, axle wheel, axle sheep, horn bird, beak snake, reptile plant, flower dog, snout grainy texture, sand color insect, foreleg fruit, lemon snake, reptile bridge, corner fruit, macadamia mountain, grainy texture grainy texture, smooth texture train, train station cauliflower, cabbage shoulder, arm lever, handle flipper, shark Supplementary Fig. D.1: UMAP embedding of a ResNet50v2’s semantic representation, where points are colour-coded according to the concepts similarity tovegetation, plant, tree(green),animal, living (red),vehicle(blue), andtexture, pattern(yellow). We further provide the top two labels for each of the 160 clusters. For all models, concepts related tovegetation, plant, tree,animal, living,vehicle, andtexture, pattern can be seen. However, there are also differences in structure. For example, the ResNet50 embedding shows a clear outlier which refers toChinese textthat is known to be present as a spurious correlation in the ImageNet dataset. D.3 Network Dissection As described in Section 4.1.2, we can group labelled neurons into parent categories (if these parent categories are available), referred to as Network Dissection [14]. The concept-labels and parent categories used in this work are listed in Supplementary Note D.1. For labelling of neurons, we use text embedding templates of “<concept>”, “a<concept>”, “<concept>-like”, and “an image of a close-up of<concept>” where<concept>is replaced by the actual concept-label. To compute alignment scoresawe further subtract the alignment to using an empty template. Finally, all neurons below an alignment of a <0.025are filtered out. 36 Mechanistic understanding and validation of large AI models drawer, box foreleg, hind leg gill, foreleg snout, dog cover, side snout, foreleg whisker, cat insect, foreleg foreleg, peg flipper, fin cover, fin side, cover lacelike texture, pitted texture pin, button cover, neck arm, foreleg foreleg, hind leg string, violin cover, apparel handle, lever interlaced texture, lacelike texture reptile, scaly texture broccoli, cauliflower foreleg, hind leg screen, interlaced texture potholed texture, corner cover, spur foreleg, head interlaced texture, cover chinese text, yellow color knitted texture, interlaced texture screen, television cosmetics, cover cover, neck dog, hind leg lobster, crab cover, lever fruit, squash mammal, tail foreleg, neck cover, scanner cover, reed snow, cover food, pot reptile, lizard corner, bed footwear, shoes foreleg, tail cover, farm train, rivet handle, knife grainy texture, opaque texture grainy texture, cover cover, shoulder church, gallery flower, yellow color foreleg, flipper insect, foreleg cover, interlaced texture lacelike texture, pitted texture shoulder, rivet peg, shoulder foreleg, lacelike texture foreleg, tail snout, foreleg scanner, printer boat, dock snake, reptile handle, chinese text honeycombed texture, comb cosmetics, bottle piano, cover sink, bathtub keyboard, letter corner, display cover, shoulder interlaced texture, lacelike texture cobwebbed texture, insect foreleg, handle lacelike texture, potholed texture teapot, kettle rivet, arm cover, interlaced texture spur, stratified texture cream, food sheet, bed turtle, tortoise fish, fin squash, zucchini beak, plumage hind leg, mammal net texture, hind leg cover, interlaced texture foreleg, tail dog, foreleg screen, panel bookshelf, library head, sculpture plumage, tail mountain, cover dial, scale flipper, fin cover, neck cover, side drum, pan baby, peg bird, beak dial, display foreleg, tail wing, tail screw, bolt tissue, sheet automobile, car wing, plane insect, foreleg clasp, necklace cover, interlaced texture fin, gill cap, foreleg strap, side beak, ostrich bus, truck cheetah, leopard gill, fish horn, trumpet beak, plumage pot, lid food, side foreleg, hind leg tusk, hind leg stick, foreleg interlaced texture, food shoulder, foreleg car, automobile monkey, tail fruit, market hind leg, foreleg bicycle, wheel plane, airplane shoulder, neck farm, axle squash, yam lever, latch cover, pin peg, stick legs, arm camera, lens dial, clock hind leg, foreleg ship, boat cover, interlaced texture kettle, spur cover, dial bottle, can reed, lacelike texture boat, fin glossy texture, lid dog, foreleg food, pan cover, text texture Supplementary Fig. D.2: UMAP embedding of a ResNet50’s semantic representation, where points are colour- coded according to the concepts similarity tovegetation, plant, tree(green),animal, living(red), vehicle(blue), andtexture, pattern(yellow). We further provide the top two labels for each of the 160 clusters. An outlier cluster corresponding to a well-known spurious correlation (Chinese text) in the ImageNet dataset is visible. An example for the resulting number of neurons per parent category in a ResNet50v2 for all layers is shown in Supplemen- tary Fig. D.4a. The example shows, that relatively more texture and colour-related concepts exist in lower-level layers, and more higher-level concepts such as “animals” or “objects” in higher-level layers. We further illustrate that one can inspect the most-matched concepts further in Supplementary Fig. D.4a (middle), which correspond todog,insect,food, and birdwith 146, 34, 32, and 31 neurons in the last layer, respectively. The relative share of neurons in the parent categories can be also compared across models, as shown in Supplementary Fig. D.4b, where the relative shares of a ResNet50, ResNet50v2, VGG-16, and VGG-13 are depicted. Comparing the ResNet50 and ResNet50v2, it is apparent that the ResNet50v2 has learned more higher-level concepts (e.g., “animal”) and less lower-level concepts (e.g., “animal parts”). This can be expected, as the ResNet50v2 is trained more extensively, leading to more class-specific (corresponding to “object” and “animal”) concepts in later layers [74,58]. A similar trend can also be seen when comparing the smaller VGG-13 and larger VGG-16 model, where the larger one shows more higher-level concepts. 37 Mechanistic understanding and validation of large AI models bus, truck clasp, handle snout, whisker snout, fin shoulder, neck potholed texture, grainy texture dog, snout cover, peg shoulder, neck nut, hazelnut handle, smooth texture flipper, fin foreleg, hind leg spur, farm fir, tree snake, reptile cover, interlaced texture cover, interlaced texture foreleg, leg foreleg, snout whisker, snout head, face food, pea peg, side bird, plumage shoulder, arm beak, bird pin, clasp snout, foreleg wing, insect handle, lever grass, grainy texture lobster, crab foreleg, snout hind leg, dog fish, tentacle tail, animal fish, gill cover, interlaced texture foreleg, arm monkey, animal foreleg, lacelike texture interlaced texture, lacelike texture beak, snout food, cream crab, claw car, automobile honeycombed texture, comb plane, wing whisker, animal cover, peg foreleg, hind leg grainy texture, neck scanner, modem snout, dog trumpet, horn flipper, fin beak, bird cap, rivet foreleg, cover corner, interlaced texture food, pitted texture cover, grainy texture plant, vine dog, snout zucchini, squash grainy texture, lacelike texture ship, harbor motorcycle, automobile mammal, snout smooth texture, potholed texture trumpet, horn reptile, lizard foreleg, tail baby, shoulder bird, beak grainy texture, smooth texture yellow color, lemon screw, bolt fruit, pear rivet, axle side, shoulder handle, peg keyboard, key dial, scale pot, handle opaque texture, grainy texture insect, poppy foreleg, yam snout, foreleg stick, peg pin, peg bicycle, wheel cover, foreleg dog, snout flower, plant beak, bird boat, fin insect, foreleg text texture, cover fin, flipper corner, side grainy texture, opaque texture hat, cap snout, animal beak, bird car, automobile shoulder, cover grainy texture, rough texture cover, side interlaced texture, lacelike texture display, bookshelf insect, wing grainy texture, lacelike texture beak, bird neck, head bottle, glass baby, head piano, violin snout, animal chain, clasp flipper, snout axle, lever boat, cover knitted texture, wool texture food, lacelike texture wing, plane orchid, lily ship, boat golf ball, tennis ball snout, dog boat, ship snout, button statue, sculpture statue, neck sink, bathtub dog, wolf clothing, interlaced texture insect, foreleg horn, pulley calendar, checked texture foreleg, snout cabbage, cauliflower tent, sheet dog, snout squash, pumpkin hat, head table, chair handle, corner lipstick, cosmetics horn, tusk grainy texture, potholed texture yellow color, mustard color bridge, bracket kettle, fir drum, reed fin, gill ship, dock fish, fin car, automobile Supplementary Fig. D.3: UMAP embedding of a ResNet50.a1’s semantic representation, where points are colour-coded according to the concepts similarity tovegetation, plant, tree(green),animal, living (red),vehicle(blue), andtexture, pattern(yellow). We further provide the top two labels for each of the 160 clusters. Another example of network dissection is shown in Supplementary Fig. D.5. Here, we group concepts into parent categories for the “Ox” class of ImageNet using a ResNet50v2 model. Having multiple levels of parent categories available allows to structure and simplify the knowledge encoded by a model. For examples, we can identify three neurons (number of neurons per concept/group is given in brackets) belonging tohorns, which are hierarchically grouped into “Head”, “Body Parts’ and “Physical”. D.4 Evaluation of Concept Labelling In this section, we first compare the labelling abilities ofSEMANTICLENSwith two recent neuron-labelling methods, INVERT[34] and CLIP-Dissect [35], using an evaluation procedure adapted from [73], which analyses neuron activation patterns on synthetic concept examples. Secondly, we conduct an ablation study on key hyperparameters, including the use of prompt templates, the number of concept examples used for semantic embedding computation, and the activation pooling strategy applied during concept example sampling. 38 Mechanistic understanding and validation of large AI models a Network dissection using concept databases b Comparing models via dissection # of identified detectors shoulder hind leg handle Supplementary Fig. D.4: Dissection of neural networks to quantitatively understand and compare representa- tions.a) For all layers of a ResNet50v2, we group components (i.e., neurons) into parent categories to gain an overview over the learned concepts. Overall, higher-level concepts are more dominant in later layers, whereas lower-level concepts such as textures are more dominant in earlier layers. The most assigned concept refers is dogwith over 146 neurons of 2048.b) We can also compare the relative shares of neurons in parent categories across models. More extensively trained models, and deeper architectures show relatively more higher-level concepts in the last layer (e.g., ResNet50 vs. ResNet50v2, and VGG-13 vs. VGG-16). D.4.1 Evaluation Procedure The lack of ground truth information makes evaluating explanations a challenging endeavour. For our specific case of neuron labelling, we adapt the recently proposed procedure in [73] to assess the faithfulness of a label. Concretely, the evaluation assumes that a good label for a neuron, when represented as model inputs, induces high activation in the neuron. To test this, we utilize a text-to-image model to generate a set of synthetic concept examples for each concept-labelk∈T, which we denote byC k =x 1 ,...,x n whereTis the set of labels used during the labelling process. With this synthetic test set we can define the response of a neuroni(given byM i ) to the conceptkas a i (k) = 1 |C k | X x∈C k M i (x) and score a neuron labelling(i,k)via φ(i,k) : = a i (k)−min t a i (t) max t a i (t)−min t a i (t) ∈[0,1](1) where more faithful labels will result in a score closer to 1. In the following study, we report the mean and standard error of this score across multiple neuron-label pairs for each method under investigation. D.4.2 Benchmark The benchmark study presented in this section aims to validate the faithfulness of the labels derived viaSEMANTICLENS across a variety of settings. Specifically, we evaluate the investigated methods -SEMANTICLENS, INVERTand CLIP- Dissect - on the first 200 neurons in the final feature layer of five distinct ImageNet classifiers: ResNet50v2, ResNet101v2, VGG13, VGG16 and VGG19. For this, we use concept-labels and activations extracted from three different datasets, namely ImageNet, Broad and Densely Labeled Dataset (Broden) [100] and Parts and Attributes of Common Objects (Paco) [101]. For ImageNet, the 1k class names (e.g., “coffee mug”, “strawberry”) are used, while for Broden and Paco we extract 1.344 and 531 concept-labels, respectively, using the provided finer-grained segmentation annotations such as object parts (e.g., “mug handle”), colours (e.g., “red colour”), and textures (e.g., “striped texture”). The synthetic concept example setsC k used in the evaluation are generated using the openly available Stable Diffusion 1.5v model [102] and the prompt template “photograph of a<concept>”, with each set comprising 40 images. Prompt templates are also used for generating the concept-label embeddings forSEMANTICLENSand CLIP-Dissect via the CLIP-Mobile-S2 model. Specifically, we used the average embedding for the templates “<concept>”, “<concept>-like”, “a<concept>” and “an image of a close up of<concept>”. For sampling activation-maximizing concept examples of neurons inSEMANTICLENS and CLIP-Dissect, we follow the approach in [35] and apply mean pooling across the spatial dimensions of the neuron activations. The number of activation-maximizing samples in both methods are set to 30. The effect of hyperparameters on the labelling quality ofSEMANTICLENSare studied in Supplementary Note D.4.3. The benchmarking results in Supplementary Tab. D.1 demonstrate thatSEMANTICLENSis a competitive neuron- labelling method, consistently outperforming the random baseline across the investigated datasets and networks. It performs particularly well on ResNet architectures, achieving high scores on ImageNet and Paco, such as0.774±0.022 on ResNet101 for ImageNet, closely matching the performance of CLIP-Dissect andINVERT. Notably, on Paco, SEMANTICLENSoutperforms the other methods on all investigated models except VGG19. This strong performance 39 Mechanistic understanding and validation of large AI models Supplementary Fig. D.5: Grouping of concepts in parent categories for the “Ox” class of ImageNet using a ResNet50v2 model. Having multiple levels of parent categories available allows to structure and simplify the knowledge encoded by a model. For examples, we can identify three neurons (number of neurons per concept/group is given in brackets) belonging tohorns, which are hierarchically grouped into “Head”, “Body Parts’ and “Physical”. may be attributed to the practice of cropping concept examples to the relevant image regions, which gives an advantage for highly localized concept-labels, such as object parts that occupy only a small portion of the image and may be overlooked by other methods that use the full activation-maximizing samples during labelling. However, on VGG networks, SEMANTICLENSstruggles with larger label spaces, such as Broden, often scoring below CLIP-Dissect andINVERT. Supplementary Tab. D.2 shows the results across the investigated layers in the ResNet50v2 model, where neurons in the final layer of each of the four ResNet blocks were labelled. The results are consistent with those presented in Supplementary Tab. D.1. We observe that the random baseline score for Block 0 (first feature layer) and Block 1 (second feature layer) is close to 0.5, suggesting that the set of potential concept labels may not be well-suited for these layers. This leads to a small gap between the highest and lowest possible neuron responses across the concept-labels,(max t a i (t)−min t a i (t)), resulting in an artificially high score for the random label assignment. This aligns with previous results suggestion that neurons in earlier layers are more difficult to label [73]. D.4.3 Ablation Study In this section, we investigate how various hyperparameters influence the labelling performance ofSEMANTICLENSand CLIP-Dissect. Specifically, we analyse the choice of foundation models, the use of prompt templates, the pooling strategy applied during activation maximization (ActMax), and the number of concept examples used withinSEMANTICLENS to derive the semantic embeddings of neurons. Our goal is to identify configurations that yield robust and accurate neuron labels. SetupFor all ablation experiments, we use the same evaluation metricφdescribed in Eq. (1). The results are averaged across the five neural network architectures considered in the main benchmark and are reported along the standard error. We begin by examining the effects of the foundation model, prompt template, and pooling strategies, as summarized in 40 Mechanistic understanding and validation of large AI models Supplementary Tab. D.1: Comparison of labelling Methods. This table presents the average quality scores (with standard error) as defined in Eq. (1) for the evaluated methods. The scores were averaged across 200 neurons from the final layers of the studied classifiers. For CLIP-Dissect andSEMANTICLENS, the results also account for the employed foundation models. Both mean and max pooling strategies are reported for activation-maximizing sample selection. Higher scores indicate better performance.SEMANTICLENS achieves results comparable to the other studied methods. Scoreφ(↑) ModelMethodPacoBrodenImageNet ResNet101v2Random0.141±0.0040.092±0.0030.056±0.003 INVERT0.280±0.0170.438±0.0210.809±0.020 CLIP-Dissect0.485±0.0220.588±0.0230.736±0.024 SEMANTICLENS0.508±0.0220.551±0.0230.774±0.022 ResNet50v2Random0.129±0.0040.086±0.0030.055±0.002 INVERT0.279±0.0180.343±0.0220.797±0.019 CLIP-Dissect0.448±0.0230.490±0.0260.713±0.024 SEMANTICLENS0.488±0.0230.448±0.0260.754±0.022 VGG13Random0.251±0.0070.209±0.0060.200±0.006 INVERT0.437±0.0190.474±0.0200.682±0.017 CLIP-Dissect0.487±0.0190.541±0.0210.679±0.017 SEMANTICLENS0.534±0.0190.393±0.0220.583±0.018 VGG16Random0.240±0.0060.196±0.0050.179±0.005 INVERT0.382±0.0180.465±0.0180.700±0.016 CLIP-Dissect0.476±0.0200.493±0.0220.694±0.017 SEMANTICLENS0.508±0.0190.394±0.0210.635±0.019 VGG19Random0.216±0.0060.175±0.0050.157±0.004 INVERT0.399±0.0180.526±0.0180.701±0.017 CLIP-Dissect0.511±0.0200.582±0.0190.718±0.017 SEMANTICLENS0.496±0.0200.456±0.0210.609±0.020 Supplementary Tab. D.3. After that, we present a detailed analysis of how the number of concept examples samples affects labelling performance, shown in Supplementary Fig. D.6. Effect of the Foundation ModelsSupplementary Tab. D.3 compares the performance when using different founda- tion models (CLIP-OpenAI, CLIP-LAION, and CLIP-Mobile-S2). We find that CLIP-Dissect achieves its highest overall score (0.588) with CLIP-OpenAI, also used in the original paper [35]. In contrast,SEMANTICLENSattains its best performance (0.542) with CLIP-Mobile-S2, indicating that the lighter-weight, mobile-tailored model provides embeddings that better complementSEMANTICLENS’s labelling pipeline. LAION embeddings consistently underperform compared to the other two models for both methods. Effect of the Prompt TemplateThe use of a descriptive prompt template, “a photograph of a<concept label>”, consistently improves labelling faithfulness. Both CLIP-Dissect andSEMANTICLENSbenefit from templates across all foundation models and pooling strategies. However, the improvement for CLIP-Dissect is modest — in its best-performing 41 Mechanistic understanding and validation of large AI models Supplementary Tab. D.2: Comparison of labelling methods applied to four layers of varying depth. This table presents the average quality scores (with standard error), as defined in Eq. (1), for the investigated methods applied to the first 200 neurons across the four ResNet blocks of the ResNet50v2 model. For methods requiring embeddings, we use the CLIP-Mobile-S2 model with prompt templates. We observe, higher scores of the random baseline in earlier layers (Block 0 and Block 1), likely because the concept labels are not well suited to neurons in these layers. This reduces the difference between the maximal and minimal neuron responses, inflating the random assignment scores (see Eq. (1)).INVERTdominates the lower layers on Broden and ImageNet, whereas on Paco,SEMANTICLENSachieves the best performance in Block 3 and performs competitively with CLIP-Dissect, which leads in Block 1 and Block 0. Scoreφ(↑) MethodBlock 0Block 1Block 2Block 3 Broden Random0.505±0.0090.509±0.0080.389±0.0070.086±0.003 INVERT0.595±0.0140.614±0.0160.554±0.0160.343±0.022 CLIP-Dissect0.565±0.0120.508±0.0140.524±0.0160.490±0.026 SEMANTICLENS0.534±0.0120.536±0.0120.457±0.0150.448±0.026 ImageNet Random0.482±0.0090.506±0.010.378±0.0080.055±0.002 INVERT0.648±0.0140.680±0.0130.661±0.0140.797±0.019 CLIP-Dissect0.583±0.0140.624±0.0150.609±0.0150.713±0.024 SEMANTICLENS0.577±0.0110.583±0.0140.516±0.0150.754±0.022 Paco Random0.484±0.0070.494±0.0080.381±0.0070.129±0.004 INVERT0.497±0.0120.534±0.0140.435±0.0150.279±0.018 CLIP-Dissect0.534±0.0130.579±0.0140.518±0.0170.448±0.023 SEMANTICLENS0.509±0.0130.556±0.0140.542±0.0150.488±0.023 configuration with the CLIP-OpenAI model, the score increases from0.578to0.588— whereasSEMANTICLENS shows more substantial gains, improving from0.497to0.542. Further optimization of the prompt template could yield additional improvements in labelling accuracy. Effect of the Pooling Strategy during Activation MaximizationWe evaluate bothmaxandmeanpooling strategies during the activation maximization for obtaining the neurons concept examples. CLIP-Dissect’s performance does not significantly change regardless of the pooling strategy.SEMANTICLENS, on the other hand, exhibits a slight but consistent improvement when using mean pooling (e.g., from0.537to0.542with CLIP-Mobile-S2 and template). The mean strategy may provide a more representative snapshot of neuron activation patterns, enhancingSEMANTICLENS’ ability to infer accurate labels. Effect of the Number of ActMax SamplesIn addition to embedding choice, templates, and pooling strategies, the number of ActMax samples plays a critical role for the labelling performance. Supplementary Fig. D.6 shows how the labelling scoreφchanges as we vary the number of ActMax samples from five to 50 across ResNet (ResNet50v2 and ResNet101v2) and VGG (VGG13 and VGG16) architectures. Here again, distinct trends emerge between these two architecture families, with method-specific variations observed forSEMANTICLENSand CLIP-Dissect. For ResNet models, performance generally improves as the number of samples increases. ForSEMANTICLENS, performance stabilizes around 20 samples, suggesting that sufficient activation coverage is achieved, with additional samples providing no further gains. In contrast, CLIP-Dissect exhibits varying behaviour depending on the foundation model. With CLIP-LAION and CLIP-Mobile-S2, performance peaks at approximately 15 samples before declining linearly, eventually falling below the initial scores observed at 5 samples. Interestingly, CLIP-Dissect with the CLIP-OpenAI model consistently improves as the number of samples increases, reaching its highest performance at 50 samples. 42 Mechanistic understanding and validation of large AI models Supplementary Tab. D.3:Influence of hyperparameters. This table presents the labelling scores of SEMANTICLENSand CLIP-Dissect average across the investigated three datasets and five neural net- works under various hyperparameter selections. Higher values are better. Higher scores indicate better performance. Both methods perform best with the prompt template “photograph of a<concept label>” across all foundation models. While the optimal activation aggregation strategy (ActMax in the table) varies by model, its overall impact is limited.SEMANTICLENSachieves the highest performance with CLIP- Mobile-S2, while CLIP-Dissect performs best with CLIP-OpenAI. Scoreφ(↑) MethodActMaxTemplateCLIPLAIONMobile-S2 CLIP-DissectMaxFalse0.578±0.0380.573±0.0390.576±0.034 True0.588±0.0390.582±0.0360.581±0.034 MeanFalse0.577±0.0300.569±0.0300.576±0.027 True0.586±0.0290.580±0.0290.576±0.027 SEMANTICLENSMaxFalse0.464±0.0340.423±0.0230.500±0.037 True0.523±0.0310.480±0.0250.537±0.028 MeanFalse0.454±0.0360.435±0.0240.497±0.039 True0.518±0.0310.491±0.0240.542±0.029 For VGG models,SEMANTICLENSshows performance degradation when using the max pooling strategy, as the number of ActMax samples increases. However, with mean pooling and the CLIP-Mobile-S2 or CLIP-OpenAI models, performance remains fairly stable across varying sample counts. One possible hypothesis for this behaviour is that VGG neurons are inherently less interpretable, leading to concept examples of lower clarity, as observed in Supplementary Note F and Fig. 5. As a consequence, the semantic embeddings may become increasingly noisy or diffuse with the number of samples, making it harder forSEMANTICLENSto assign faithful concept labels. The mean pooling strategy seems to mitigate this effect to some extent. In contrast, CLIP-Dissect performance initially improves with increased number of samples and then stabilizes. A plausible explanation is that, unlikeSEMANTICLENS, CLIP-Dissect determines concept assignments by maximizing the pointwise mutual information between the neurons activations on the concept samples and candidate concept labels. This process inherently assigns probabilistic weights to individual samples, rather than treating their contributions as binary, as is the case inSEMANTICLENS. This probabilistic weighting likely acts as a form of denoising, reducing the impact of ambiguous or noisy activations. Based on these observations, a potential improvement forSEMANTICLENScould involve incorporating weighted averag- ing strategies when computing semantic embeddings for neurons. By adopting a mechanism akin to the SoftWPMI approach used in CLIP-Dissect, where ActMax samples influence the embedding process non-uniformly,SEMANTICLENScould achieve greater robustness, particularly in architectures like VGG, where neuron interpretability is lower. Overall, the ablation study highlights the importance of hyperparameter selection forSEMANTICLENS, which achieves its best performance with CLIP-Mobile-S2 embeddings, prompt templates, and mean pooling. Notably, the effect of the number of used concept examples on labelling faithfulness of varies by architecture. However, for both investigated model families, performance stabilizes at around the maximal value above 20 samples. As stated in Section 3.2, we use 30 concept samples withinSEMANTICLENSunless stated otherwise. A promising future direction is to introduce weighted averaging forSEMANTICLENS, which could further improve its performance, particularly for architectures with less interpretable neurons. D.5 Attribution Graph In Section 4.1.2, we show an attribution graph (inspired by [15]) of a ResNet50v2, which illustrates where concepts are located in the architecture (which layer), and how concepts are dependent on one another. An attribution graph is computed through backpropagation of relevance scores via CRP. The computational steps are outlined in the following: 43 Mechanistic understanding and validation of large AI models CD CLIPCD LAIONCD Mobile-S2SL CLIP SL LAION SL Mobile-S2 ActMax: Mean Max ResNet50v2 Number of reference samples ResNet101v2 Number of reference samples VGG13 Number of reference samples VGG16 Number of reference samples Score Supplementary Fig. D.6: Evaluation of the influence of the count of ActMax samples on the labelling abilities ofSEMANTICLENS. We evaluate the labels derived viaSEMANTICLENSand CLIP-Dissect for neurons in the final feature layers of four different neural networks, as defined in Eq. (1), using varying number of reference samples. The average score over all neurons in the inspected layer is plotted on they-axis. Concept labels and ActMax samples were derived using ImageNet class names and test data. We see distinct effects of the number of used samples on the performance ofSEMANTICLENSbetween ResNet and VGG models. In case of ResNet we see a steep accent in performance up to 15 samples, stabilizing thereafter with a slight decline near the maximum of 50 samples. In contrast, for both VGG models we observe a monotone decrease of the overall performance, peaking with 5 reference samples and gradually declining as more samples are used. 1) Relevance filtering of componentsWe compute the highest relevance of a component on the test set for a specific output target (here “Ox”). This relevance score is used to filter out irrelevant neurons. Concretely, we filter out neurons with relevance that is below 1 % or 5 #neurons %. 2) Labelling of componentsTo label neurons, we first embed the following textual descriptions via the text model of CLIP:Horns,Eyes,Nose,Ears,Mouth,Neck,Back,Chest,Belly,Tail,Hooves,Legs,Brown,Black,Lined texture,Striped texture,Spotted texture,Brown colour,Short fur,Long fur,Thick skin,White, Spotted,Herding,Plowing,Pulling carts,Heavy labor,Grazing,Chewing cud,Grasslands,Savannahs, Farms,Pastures,Strength,Fertility,Festivals,Rituals,Hinduism,Mythology,Zebu,Hereford,Angus, Bison,Yak,Water Buffalo,Face,Grass,Arching,Angular,Branches,Mountains,Texture,Water,Sky,White color,Textures. We therefore use the templates of “<concept>”, “<concept>-like”, and “an image of<concept>”. In order to compute an alignment scorea, we subtract cosine similarity resulting from an empty template and threshold toa≥0.01. All remaining components are grouped according to their most aligned label. 3) Relevance propagationWe use CRP to attribute groups of components. Concretely, we start withlayer 3and attribute component groups using the LRPεz + ♭-rule wrt. to the “Ox” class on the test set. Subsequently, all lower-level (layer2) component groups are attributed wrt. an upper layer component group. 4) Investigate the attribution graphThe full attribution graph is shown in Supplementary Fig. D.7 forlayer 2 (“block2”) andlayer 3(“block3”). Notably, unaligned components are marked with “?”, indicating concepts that have not been though of, i.e., included in the list of expected concepts. As discussed in Section 4.1.2, some concepts refer to copyright signs, or background vegetation. Here, we additionally show the highest relevance scores on the test set for a group of components (with the same label) in parentheses. E Compare in Semantic Space This section provides more details on Section 4.1.3, where we describe howSEMANTICLENScan be used to compare the knowledge of different models.SEMANTICLENSallows toqualitativelycompare two models, as is shown in Supplementary Fig. E.1 with two ResNet50 models trained on ImageNet, where one (ResNet50v2) is trained more extensively and results in higher test accuracy. Both models share common knowledge, e.g., bird-related concepts. However, whereas the better trained ResNet50v2 has learned more specific concepts, e.g., specific fur textures of dogs, the other has learned more abstract concepts that are shared throughout classes. For the dog breed “Komondor” which has a white mop-like coat, for example, the ResNet50 has learned a mop-like concept (neuron#93) that is used to detect “Komondor” as well as “mop”, whereas the ResNet50v2 learned a class-specific concept. Further, the ResNet has learned a spotted burrito texture (neuron#485) that is used to detect burritos or German Shorthaired Pointer dogs. This is in line with works that study generalization of neural networks for long training regimes, observing that latent model components become more structured and class-specific [74]. We further provide quantitative comparisons via network dissection in Supplementary Note D.3. 44 Mechanistic understanding and validation of large AI models block_2block_3 Texture (6.2%) Lined texture (3.4%) Spotted texture (25.5%) Zebu (50.3%) Bison (37.9%) Water Buffalo (47.6%) Arching (2.4%) Short fur (1.7%) Long fur (60.8%) Festivals (1.4%) ? (14.7%) Herding (3.1%) Brown (2.4%) Brown colour (3.6%) Farms (1.1%) Pastures (2.0%) Grasslands (8.5%) Branches (2.1%) Grazing (2.4%) Grass (3.9%) Legs (3.4%) Hooves (9.6%) Horns (4.5%) Ears (2.9%) Nose (6.6%) Chewing cud (1.6%) Plowing (4.6%) Pulling carts (10.1%) Lined texture (1.4%) Spotted texture (2.5%) Long fur (1.1%) Tail (2.5%) Neck (1.5%) ? (15.1%) Brown colour (2.3%) Grass (4.4%) Sky (1.3%) Mountains (1.6%) Hooves (1.3%) Ears (1.4%) Eyes (1.3%) Nose (2.2%) Supplementary Fig. D.7: An attribution graph of a ResNet50v2 for the “Ox” class with concepts inlayer 2 (“block2”) andlayer 3(“block3”). Unaligned components are marked with “?”, indicating concepts that have not been though of, i.e., included in the list of expected concepts. Highest relevance scores on the test set for a group of components (with the same label) are shown in parentheses. The size of edges between concepts indicate their importance. Concretely, two modelsNandMmay be quantitatively compared via the number of neurons that were assigned to concept labels as introduced by NetDissect [14] and detailed in Supplementary Note D.3, or by measuring set similarity S V M →V N based on average maximal pairwise similarity: S V M →V N = 1 |V M | X θ∈V M max θ ′ ∈V N s(θ,θ ′ ),(1) that quantifies the degree to which the knowledge (semantics) encoded in modelMis also encoded in modelN. Eq. (1) allows us, e.g., to compare the same architecture with different training recipes, different model sizes and layers, or different model architectures as shown in Supplementary Fig. E.2, where the DINOv2 is used as the foundation model to generate semantic embeddings (using the top 30 most activating image patches). Training recipes matter: When comparing ResNet50 models trained with different training recipes in Supplementary Fig. E.2 (top left), it is apparent that differences in alignment exist. Interestingly, the ResNet50v2 aligns overall the 45 Mechanistic understanding and validation of large AI models Supplementary Fig. E.1: Comparing concept alignment between models (ResNet50 and ResNet50v2) based on their semantic embeddings. Both models share common knowledge, e.g., bird-related concepts. However, whereas the better trained ResNet50v2 has learned more specific concepts, e.g., specific fur textures of dogs, the other has learned more abstract concepts that are shared throughout classes. best, whereas the ResNet50 aligns generally low with the other models. Notably, there exists a correlation with test accuracy/training efficiency. The ResNet50v2 and a1-version are trained for 600 epochs, whereas the a2 version and ResNet50 are only trained for 300 and 90 epochs, respectively. Similar model sizes are more aligned: In Supplementary Fig. E.2 (top right), we compare ResNet models of different size (ResNet18/34/50/101). It can be observed, that ResNets of similar size also align more (e.g., ResNet101 and ResNet50 compared to ResNet101 and ResNet 34). Similar layer depths are more aligned: Further, for all ResNet models, representations of layer “block3” and “block2” are more aligned with one-another. Intra family alignment can be higher than inter family alignment: In Supplementary Fig. E.2 (top right), we compare ResNets and VGG models wrt. their semantic embedding alignment. Interestingly, the ResNet50’s and VGG-16’s (with batch normalization) last feature layers are more aligned than the ResNet50’s and ResNet34’s, or the VGG-16’s and VGG-13’s. F Audit in Semantic Space This section provides more details on Sections 4.2 and 4.3, where we demonstrate how to audit neural network representa- tions usingSEMANTICLENSon ImageNet and ISIC, respectively. F.1 ImageNet Visual Object Recognition This section provides more details on Section 4.2, where we investigate neural network representations using SEMANTICLENSon ImageNet for a concrete examples (Ox predictions), and, more generally, for 26 classes. We present further details in the following. F.1.1 Ox Example In Section 4.2 we demonstrate an alignment audit for a ResNet50v2 and predictions for the class “Ox”. In order to receive alignment scores, we measure the alignment with textual embeddings of valid and spurious concepts using the templates of “<concept>” and “an image of<concept>”. Specifically, the following concepts are used to measure alignment: Valid:large muscular body,curved horns,hooves,thick neck,short, rough fur,soft fur,long fur, gray skin,fur texture,gritted texture,brown coat,black coat,white coat,legs,long tail,wide muzzle Spurious:grassland, savanna,sky,tree,water,grain, straw,cart, carriage,carriage,wheel,indian person,copyright watermark,mud, dirt,person,wooden,palm tree,people,grayish earth texture To compute final alignment scores, we subtract the alignment to an empty textual embedding, as also detailed in Section 3. We further filter out all neurons of the last feature layer (layer 3for the ResNet) that have a maximal relevance of below 2.8 % on the test set for the “Ox” class using CRP for attribution. We show the valid and spurious alignment of all relevant components/neurons and their concepts in form of a scatter plot in Fig. 3a, and in more detail in Supplementary Fig. F.1. In Supplementary Fig. F.1, the same analysis with a VGG-16 is also included, with labels for all representations, the highest relevance score on the test set as well as neuron number/id. 46 Mechanistic understanding and validation of large AI models concept set similarity concept set similarity concept set similarity model model model model model different training recipes different model sizes different model architectures Supplementary Fig. E.2: Comparing concept alignment between models based on their semantic embeddings. The set of semantic embeddings is hereby compared by computing the average maximal pairwise similarity of semantic embeddings between models as given by Eq. (1).Top left:Comparing ResNet50 models (last feature layer) with different training recipes.Top right:Comparing ResNet models (penultimate and last feature layer).Bottom middle:Comparing ResNet and VGG models (last feature layer). It is apparent, that the ResNet often relies on fewer neurons as neurons have up to 60.8 % of relevance compared to 8.0 % for the VGG. Whereas both models seem to rely oncartandsavanna-related concepts, the ResNet further utilizes concepts corresponding tocopyright watermark,Indian personandpalm tree. In a subsequent experiment, the models are evaluated on actual data. Concretely, we measure how well the “Ox”-logits can be separated on samples with “Ox” (test set) and with spurious features only via an AUC score [62]. The spurious feature set is hereby generated using Stable Diffusion [102] and text prompts. The resulting AUC scores (low AUC is problematic) are displayed in Supplementary Fig. F.2 for both models and different diffusion prompts. As expected, the ResNet shows slightly lower AUCs forIndian personandpalm tree, whereas both show AUCs below 1.0 forsavannaandcart carriage. Especially for the combination of spurious features (Indian pulling a cart in savanna), both models show AUCs below 0.9. Interestingly, whenpalm treesare shown instead ofsavanna(Indian pulling a cart under palm trees), the ResNet reacts much more than the VGG, validating the finding in Supplementary Fig. F.1, that only the ResNet relies on palm tree features. F.1.2 Spurious Concept Reliance Everywhere This section provides details for Section 4.2, where we measure the alignment of ResNet50v2 neurons in the last feature layer with valid expected concepts. In Supplementary Figs. F.3 to F.11, we provide for all 26 ImageNet classes the top-10 47 Mechanistic understanding and validation of large AI models ResNet50v2VGG-16 Supplementary Fig. F.1: Valid and spurious alignment scores of all relevant neurons for detecting the “Ox” class on the ImageNet test set for a ResNet50v2 (left) and VGG-16 model (right). We include labels for all representations (highest spurious and valid aligned concept), with the highest relevance score on the test set as well as their number/id. Stable Diffusion prompt Supplementary Fig. F.2: AUROC when separating “Ox” logits computed on the “Ox” test set and Stable Diffusion-generated samples of spurious concepts for a ResNet50v2 and VGG-16. Especially when spurious concepts are combined is the AUC reduced. most relevant neurons (according to the highest relevance score on the test set of a class), the best matching label and corresponding alignment score. The following expected concepts were defined for measuring alignment: Class 70Long, thin legs,Small, round body,Brownish-gray color,Legs outstretched,Crawling motion,Smooth exoskeleton,Tiny eyes,Jointed leg segments,Spider-like appearance,Delicate legs crossing,Resting on leaves,Thin, wiry legs,Clinging to surfaces,Contrasting body and leg colors,Huddled posture, Class 277Rusty red fur,Pointed snout,Bushy tail with white tip,Black paws,Alert ears,Sharp eyes,Slim body,White underbelly,Dark nose,Leaping pose,Camouflaged in grass,Running through snow,Curled up sleeping,Sneaky posture,Fluffy tail swishing, Class 294Dark brown fur,Thick fur,Powerful paws,Brown coat,Muscular build,Sharp claws,Round ears,Dark brown eyes,Broad shoulders,Coarse fur texture,Short tail,Strong jaw,Robust body form,Light brown muzzle,Heavy limbs,Curved claws, 48 Mechanistic understanding and validation of large AI models Thin, wiry legs #1676 a : 0.01 r .:7.3% Concept examples Most relevant samples Thin, wiry legs #729 a : 0.01 r .:10.0% Spider-like appearance #1885 a : 0.03 r .:7.9% Thin, wiry legs #398 a : 0.05 r .:5.6% Resting on leaves #1335 a : 0.07 r .:6.8% Resting on leaves #895 a : 0.07 r .:32.8% Thin, wiry legs #1581 a : 0.10 r .:75.0% Thin, wiry legs #1381 a : 0.12 r .:37.3% Spider-like appearance #745 a : 0.13 r .:8.5% Spider-like appearance #1445 a : 0.13 r .:25.7% Class 70: harvestman Pointed snout #800 a : -0.02 r .:4.6% Concept examples Most relevant samples Pointed snout #1720 a : 0.00 r .:8.7% Rusty red fur #877 a : 0.05 r .:6.3% Camouflaged in grass #104 a : 0.06 r .:6.0% Pointed snout #1254 a : 0.06 r .:5.2% Fluffy tail swishing #1761 a : 0.06 r .:5.6% Rusty red fur #583 a : 0.07 r .:61.5% Pointed snout #966 a : 0.07 r .:36.5% Fluffy tail swishing #1017 a : 0.07 r .:20.1% Rusty red fur #1441 a : 0.07 r .:6.8% Class 277: red fox Short tail #1991 a : -0.00 r .:5.5% Concept examples Most relevant samples Broad shoulders #975 a : 0.01 r .:6.1% Short tail #514 a : 0.01 r .:4.1% Coarse fur texture #1720 a : 0.01 r .:5.9% Light brown muzzle #489 a : 0.02 r .:8.2% Short tail #1301 a : 0.04 r .:4.5% Dark brown fur #235 a : 0.06 r .:37.4% Light brown muzzle #1310 a : 0.06 r .:71.8% Thick fur #860 a : 0.07 r .:6.4% Dark brown fur #531 a : 0.08 r .:19.0% Class 294: brown bear Supplementary Fig. F.3: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “harvestman”, “red fox” and “brown bear”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. Class 309Transparent wings,Striped yellow and black body,Small antennae,Segmented abdomen, Hexagonal eye pattern,Fuzzy thorax,Pointed stinger,Tiny legs,Glossy wing texture,Round compound eyes,Oval-shaped body,Vibrant yellow stripes,Delicate wing veins,Thin legs with pollen sacs,Smooth exoskeleton, Class 336Thick brown fur,Short legs,Stocky body,Round ears,Curved claws,Flat snout,Bushy tail,Light brown underbelly,Coarse fur texture,Dark eyes,Short, sturdy limbs,Rounded body form,Sharp teeth,Padded paws,Grayish-brown coat, Class 340Striped coat,Mane of hair,Hoofed feet,Black and white stripes,Long tail,Rounded ears ,Smooth fur,Slender legs,Muzzle with whiskers,Muscular body,Shiny coat,Narrow snout, Short mane,Contrasting color pattern,Coarse hair texture, Class 388Black and white fur,Round ears,Large paws,Plump body,Circular black eye patches, Soft fur texture,Flat nose,Stubby tail,Strong jaws,Muscular limbs,Thick coat,Wide face, Contrasting black and white markings,Rounded belly,Coarse paw pads, Class 390Slender body,Smooth skin,Elongated form,Dark green color,Pointed snout,Silky texture,Undulating movement,Pale underbelly,Slippery surface,Streamlined shape,Glossy sheen, Narrow fins,Curved tail,Sinous body form,Mottled brown pattern, Class 407Flashing emergency lights,Red and white exterior,Siren,Large rear doors,Wheeled stretcher,Bold cross symbol,Metallic chassis,Square body form,Reflective decals,Clear 49 Mechanistic understanding and validation of large AI models Thin legs with pollen sacs #315 a : 0.03 r .:22.5% Concept examples Most relevant samples Small antennae #193 a : 0.03 r .:13.8% Small antennae #408 a : 0.04 r .:5.0% Small antennae #69 a : 0.04 r .:5.1% Hexagonal eye pattern #628 a : 0.05 r .:45.4% Small antennae #78 a : 0.06 r .:7.3% Thin legs with pollen sacs #1645 a : 0.07 r .:23.1% Small antennae #623 a : 0.08 r .:22.5% Fuzzy thorax #295 a : 0.10 r .:79.5% Delicate wing veins #1249 a : 0.10 r .:12.0% Class 309: bee Short legs #1525 a : -0.01 r .:8.3% Concept examples Most relevant samples Grayish-brown coat #1510 a : 0.01 r .:13.1% Grayish-brown coat #1616 a : 0.01 r .:6.6% Flat snout #1827 a : 0.02 r .:10.3% Grayish-brown coat #982 a : 0.02 r .:10.5% Bushy tail #608 a : 0.02 r .:21.2% Flat snout #1361 a : 0.07 r .:24.4% Flat snout #1566 a : 0.07 r .:12.4% Flat snout #359 a : 0.09 r .:62.7% Thick brown fur #531 a : 0.10 r .:10.3% Class 336: marmot Slender legs #1309 a : -0.01 r .:4.2% Concept examples Most relevant samples Narrow snout #184 a : -0.00 r .:5.2% Narrow snout #1518 a : 0.01 r .:2.7% Narrow snout #1219 a : 0.01 r .:3.5% Slender legs #1750 a : 0.02 r .:7.8% Narrow snout #998 a : 0.03 r .:8.4% Narrow snout #167 a : 0.04 r .:8.5% Narrow snout #1294 a : 0.08 r .:4.1% Slender legs #1680 a : 0.09 r .:5.1% Black and white stripes #1656 a : 0.09 r .:94.5% Class 340: zebra Supplementary Fig. F.4: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “bee”, “marmot” and “zebra”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. windshield,Rubber tires,Grippy floor texture,Bright headlamps,Compact interior,Smooth metal panels, Class 441Clear glass body,Rounded rim,Sturdy base,Smooth glass surface,Amber liquid color, Curved handle,Frosted glass texture,Tall cylindrical shape,Golden beer hue,Tapered glass form ,Bubbly carbonation,Thick glass walls,Foamy white head,Translucent glass clarity,Cold glass touch, Class 483Stone walls,Tall towers,Wooden drawbridge,Gray bricks,Iron gates,Rounded turrets, Sturdy battlements,Dark dungeons,Narrow arrow slits,Moat surrounding,Flag flying high, Cobblestone courtyard,Heavy wooden doors,Fortified ramparts,Cold stone texture, Class 517Long metal boom,Yellow frame,Steel cables,Rotating cab,Hook block,Tall mast,Boom extension,Black counterweights,Lattice structure,Rubber tracks,Wire rope,Sleek hydraulic arms,Grippy control levers,Heavy-duty joints,Glossy metal surface, Class 594Vertical Lines,Golden strings,Wooden soundboard,Curved neck form,Polished wooden column ,Smooth brass pedals,Elegant frame,Carved wooden body,Shiny metallic tuning pins, Glossy black finish,Ornate pillar design,Textured grip on strings,Deep mahogany color, Slender body form,Delicate string tension,Velvety string texture, Class 599Hexagonal cells,Golden amber color,Dripping honey,Wax texture,Symmetrical pattern, Sticky surface,Clustered structure,Thick honey-filled cells,Translucent wax,Crystalline 50 Mechanistic understanding and validation of large AI models Thick coat #869 a : -0.03 r .:16.0% Concept examples Most relevant samples Flat nose #1146 a : -0.02 r .:2.7% Stubby tail #1756 a : 0.00 r .:24.6% Stubby tail #938 a : 0.01 r .:8.3% Contrasting black and white markings #1093 a : 0.02 r .:51.7% Flat nose #1101 a : 0.04 r .:9.9% Contrasting black and white markings #333 a : 0.04 r .:30.8% Stubby tail #1692 a : 0.04 r .:10.8% Stubby tail #169 a : 0.04 r .:19.7% Stubby tail #1674 a : 0.07 r .:13.5% Class 388: giant panda Pointed snout #446 a : 0.04 r .:21.1% Concept examples Most relevant samples Narrow fins #535 a : 0.04 r .:6.3% Narrow fins #1508 a : 0.04 r .:7.2% Glossy sheen #662 a : 0.04 r .:38.8% Narrow fins #768 a : 0.05 r .:7.0% Narrow fins #616 a : 0.05 r .:13.0% Narrow fins #346 a : 0.05 r .:11.6% Curved tail #1189 a : 0.06 r .:9.0% Curved tail #744 a : 0.07 r .:23.0% Pointed snout #1469 a : 0.08 r .:79.7% Class 390: eel Reflective decals #665 a : -0.02 r .:5.5% Concept examples Most relevant samples Reflective decals #349 a : -0.00 r .:5.8% Siren #1152 a : 0.00 r .:5.4% Reflective decals #846 a : 0.01 r .:13.7% Square body form #96 a : 0.02 r .:10.2% Reflective decals #1864 a : 0.04 r .:4.4% Bright headlamps #977 a : 0.05 r .:16.7% Flashing emergency lights #1807 a : 0.06 r .:73.1% Flashing emergency lights #1828 a : 0.07 r .:7.0% Clear windshield #1685 a : 0.07 r .:12.2% Class 407: ambulance Supplementary Fig. F.5: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “giant panda”, “eel” and “ambulance”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. form,Natural golden hue,Layered honeycomb slabs,Smooth honey flow,Bees crawling over it,Cut honeycomb with oozing honey, Class 648Glass mirror door,Metallic handle,White shelves,Smooth interior surface,Compact compartments ,Small drawers,Reflective mirror finish,Hinged door,Plastic containers,Frosted glass,Rectangular frame,Transparent shelves,Polished edges,Glossy cabinet surface,Flat wooden panel, Class 718Wooden planks,Metal support beams,Sturdy pilings,Weathered surface,Rusty nails, Flat deck,Mooring cleats,Open water view,Railing along edges,Docking area,Rectangular shape, Salt-stained wood,Piling posts,Non-slip texture,Tide-worn edges, Class 754Wooden exterior,Circular tuning dial,Metal antenna,Speaker grille,Vintage knobs, AM/FM frequency display,Rectangular form,Portable handle,Static-filled screen,Red tuning light,Compact transistor shape,Glossy buttons,Retro design,LCD frequency screen,Digital display, Class 774Flat sole,Adjustable straps,Open-toe design,Leather material,Rubber outsole, Padded footbed,Colorful straps,Woven texture,Buckles or Velcro,Ankle support,Curved arch, Lightweight form,Breathable fabric,Thick heel,Smooth lining, 51 Mechanistic understanding and validation of large AI models Foamy white head #519 a : -0.02 r .:6.1% Concept examples Most relevant samples Rounded rim #1132 a : -0.01 r .:12.8% Rounded rim #1912 a : 0.01 r .:25.6% Rounded rim #1286 a : 0.04 r .:32.4% Bubbly carbonation #64 a : 0.06 r .:15.1% Curved handle #811 a : 0.06 r .:13.3% Tapered glass form #627 a : 0.06 r .:15.6% Curved handle #262 a : 0.07 r .:6.7% Translucent glass clarity #2043 a : 0.08 r .:27.1% Golden beer hue #111 a : 0.11 r .:69.0% Class 441: beer glass Narrow arrow slits #1765 a : -0.02 r .:7.2% Concept examples Most relevant samples Narrow arrow slits #184 a : 0.00 r .:10.9% Moat surrounding #1345 a : 0.01 r .:8.2% Moat surrounding #868 a : 0.02 r .:10.8% Narrow arrow slits #1747 a : 0.02 r .:5.3% Rounded turrets #434 a : 0.02 r .:16.3% Fortified ramparts #1858 a : 0.03 r .:10.0% Rounded turrets #953 a : 0.04 r .:15.9% Sturdy battlements #431 a : 0.05 r .:10.3% Rounded turrets #1042 a : 0.10 r .:75.7% Class 483: castle Tall mast #1407 a : -0.03 r .:8.8% Concept examples Most relevant samples Long metal boom #1350 a : 0.00 r .:5.4% Tall mast #1745 a : 0.01 r .:6.4% Hook block #1376 a : 0.01 r .:5.4% Tall mast #1035 a : 0.01 r .:5.3% Tall mast #1005 a : 0.02 r .:13.1% Tall mast #539 a : 0.06 r .:8.2% Tall mast #1755 a : 0.07 r .:14.8% Rotating cab #1245 a : 0.08 r .:9.2% Long metal boom #983 a : 0.08 r .:79.3% Class 517: crane Supplementary Fig. F.6: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “beer glass”, “castle” and “crane”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. Class 784Metallic shaft,Flathead tip,Phillips head tip,Ergonomic handle,Rubber grip,Chrome finish,Magnetic tip,Slender shaft,Red and black handle,Textured grip,Tapered tip,Polished steel,Stubby handle form,Ridged handle surface,Hexagonal shaft, Class 792Metallic gray blade,Wooden tan handle,Curved metal edge,Smooth metal grip,Rust brown surface ,Coarse wooden texture,Matte black finish,Polished silver shaft,Flat blade form, Rough wooden handle,Tapered metal point,Dark footstep,Ridged blade surface,Olive green shaft, Grainy wooden texture, Class 801Breathing tube,Mouthpiece,Curved top,Flexible silicone,Clear plastic body,Rubber strap,Soft bite valve,Transparent mask attachment,Curved tube form,Anti-splash guard, Adjustable strap,Compact design, Class 804Transparent bottle,Pump top,Liquid soap inside,Foaming nozzle,Matte plastic body,Chrome finish pump,Wall-mounted version,Curved bottle shape,Sleek, minimalist design, Refillable container,Dripping soap effect,Frosted glass body,Colored liquid soap,Soft-touch material,Push-button mechanism, Class 817Sleek body,Low profile,Glossy red paint,Aerodynamic spoiler,Leather seats,Shiny chrome rims,Wide tires,Curved hood,Dual exhaust pipes,Tinted windows,Carbon fiber accents, LED headlights,Smooth steering wheel,Compact form,Metallic finish, 52 Mechanistic understanding and validation of large AI models Curved neck form #1232 a : -0.01 r .:3.0% Concept examples Most relevant samples Curved neck form #156 a : 0.00 r .:2.7% Vertical Lines #975 a : 0.00 r .:5.6% Curved neck form #1908 a : 0.02 r .:2.5% Delicate string tension #1935 a : 0.02 r .:5.5% Vertical Lines #264 a : 0.03 r .:3.5% Wooden soundboard #344 a : 0.05 r .:4.3% Polished wooden column #1368 a : 0.06 r .:2.8% Delicate string tension #1387 a : 0.08 r .:3.7% Delicate string tension #2041 a : 0.10 r .:93.0% Class 594: harp Wax texture #2031 a : -0.01 r .:10.3% Concept examples Most relevant samples Sticky surface #1980 a : 0.00 r .:10.6% Wax texture #812 a : 0.01 r .:7.9% Clustered structure #1641 a : 0.01 r .:10.8% Sticky surface #768 a : 0.02 r .:8.3% Clustered structure #6 a : 0.04 r .:23.7% Thick honey-filled cells #628 a : 0.08 r .:72.3% Thick honey-filled cells #1052 a : 0.10 r .:88.0% Bees crawling over it #295 a : 0.10 r .:10.4% Bees crawling over it #1645 a : 0.13 r .:63.1% Class 599: honeycomb Hinged door #1700 a : -0.02 r .:18.5% Concept examples Most relevant samples Plastic containers #206 a : -0.00 r .:11.7% Glossy cabinet surface #1066 a : 0.04 r .:15.6% Polished edges #343 a : 0.04 r .:16.1% White shelves #681 a : 0.05 r .:11.0% Plastic containers #461 a : 0.05 r .:67.1% Hinged door #74 a : 0.05 r .:48.2% Hinged door #236 a : 0.07 r .:19.3% White shelves #599 a : 0.07 r .:12.8% Hinged door #1697 a : 0.07 r .:12.9% Class 648: medicine chest Supplementary Fig. F.7: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “harp”, “honeycomb” and “medicine chest”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. Class 822Shiny metallic surface,Circular shape,Concave playing surface,Silvery gray color, Indented note sections,Polished steel texture,Raised center dome,Circular rim,Reflective metal finish,Suspended on a stand,Hammered metal texture,Brightly polished edges,Multiple circular note areas,Matte silver base,Curved steel body, Class 913Twisted metal,Broken hull,Rusty surface,Shattered glass,Dark debris,Scattered fragments ,Bent framework,Corroded iron,Cracked structure,Faded paint,Exposed wires,Jagged edges,Dilapidated form,Burnt-out interior,Rough, damaged texture, Class 993Wrinkled cap,Brownish-red color,Curled edges,Short white stem,Brain-like cap shape,Irregular folds,Earthy brown tones,Hollow interior,Bulbous form,Textured surface,Wavy contours,Soft, spongy texture,Twisted cap structure,Cream-colored stalk, F.1.3 Aligned Models Are More Interpretable and Performant This section provides details for Section 4.2, where we analyse popular pre-trained models on the ImageNet dataset, and observe strong variations wrt. their valid alignment. The reason often lies in the share of knowledge that is neither aligned to valid nor spurious concepts, as demonstrated for the VGG-16 in Supplementary Fig. F.12. For instance, the VGG-16 contains several polysemantic components that perform multiple roles in decision-making, which generally reduces alignment. On the other hand, we also find more abstract concepts, e.g., encoding for gray/silver-like texture. On the other hand, more performant and wider models tend to have more specialized (e.g., class-specific) and monosemantic model components such as the ResNet101v2, later quantified in Section 4.4. Overall, higher-performing models with 53 Mechanistic understanding and validation of large AI models Rectangular shape #949 a : -0.02 r .:11.1% Concept examples Most relevant samples Metal support beams #1533 a : -0.01 r .:43.5% Docking area #309 a : -0.01 r .:27.3% Non-slip texture #1147 a : 0.01 r .:17.3% Railing along edges #431 a : 0.02 r .:13.6% Metal support beams #354 a : 0.03 r .:21.3% Railing along edges #1350 a : 0.03 r .:19.2% Docking area #1624 a : 0.03 r .:9.8% Metal support beams #1690 a : 0.04 r .:57.4% Wooden planks #1390 a : 0.08 r .:8.7% Class 718: pier Glossy buttons #1362 a : -0.04 r .:24.7% Concept examples Most relevant samples Speaker grille #1111 a : -0.00 r .:17.6% Static-filled screen #772 a : 0.03 r .:9.1% Static-filled screen #1338 a : 0.05 r .:8.8% Circular tuning dial #1465 a : 0.07 r .:9.8% AM/FM frequency display #507 a : 0.07 r .:15.9% Speaker grille #550 a : 0.08 r .:47.1% AM/FM frequency display #1069 a : 0.08 r .:11.0% Speaker grille #1108 a : 0.09 r .:11.4% AM/FM frequency display #1223 a : 0.12 r .:53.0% Class 754: radio Thick heel #465 a : -0.06 r .:11.8% Concept examples Most relevant samples Leather material #7 a : -0.03 r .:11.2% Thick heel #190 a : -0.02 r .:9.8% Smooth lining #912 a : -0.01 r .:11.8% Smooth lining #388 a : 0.04 r .:7.4% Thick heel #375 a : 0.07 r .:21.9% Flat sole #770 a : 0.07 r .:27.3% Padded footbed #1475 a : 0.09 r .:12.1% Rubber outsole #1166 a : 0.09 r .:34.2% Open-toe design #1418 a : 0.10 r .:50.6% Class 774: sandal Supplementary Fig. F.8: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “pier”, “radio” and “sandal”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. larger feature spaces (more neurons per layer, as, e.g., ResNet50 and ResNet101) show thus greater alignment scores throughout experiments detailed in Supplementary Fig. F.12 (middle). F.2 Medical Use Case This section provides further details for Section 4.3, where we demonstrateSEMANTICLENSin a medical use case and debug a VGG-16 with Batch Normalization layers on the ISIC 2019 challenge dataset, as detailed in Supplementary Note B. For all experiments, we leverage the CLIP model provided by [68], collecting concept examplesEon the test set, and using|E|= 20for generating semantic embeddingsθ. Notably, we reduce the set size to 20 compared to 30 for ImageNet, as the test set of ISIC is much smaller (2,533 compared to 50,000 samples). For ISIC, the activation values for the 20-st most activating sample are on average still corresponding to more than 40 % of the highest activation value for a neuron. F.2.1 Concept Labels We use the following labels to annotate concepts, where we mostly adhere to the ABCDE-rule [78]: Asymmetryasymmetric lesion,asymmetrical lesion,uneven form,irregular form,unsymmetrical lesion ,irregular dark streaks,irregular pigment network,negative pigment network,irregular pseudonetwork,off-centered blotch,atypical dots or globules,regression structures,atypical pigment network,ulcer,irregular sws,unsymmetric lesion 54 Mechanistic understanding and validation of large AI models Slender shaft #1109 a : -0.02 r .:10.5% Concept examples Most relevant samples Tapered tip #3 a : -0.02 r .:8.7% Magnetic tip #787 a : 0.01 r .:9.5% Flathead tip #1954 a : 0.02 r .:11.2% Stubby handle form #1796 a : 0.05 r .:10.4% Flathead tip #0 a : 0.05 r .:6.6% Tapered tip #266 a : 0.05 r .:7.5% Magnetic tip #1285 a : 0.07 r .:37.9% Phillips head tip #1260 a : 0.09 r .:7.7% Magnetic tip #355 a : 0.10 r .:43.1% Class 784: screwdriver Curved metal edge #1997 a : -0.03 r .:5.2% Concept examples Most relevant samples Flat blade form #1500 a : -0.01 r .:5.5% Ridged blade surface #1703 a : -0.01 r .:4.5% Dark footstep #239 a : -0.01 r .:53.5% Curved metal edge #342 a : 0.01 r .:7.0% Dark footstep #1238 a : 0.01 r .:9.2% Smooth metal grip #385 a : 0.02 r .:4.7% Olive green shaft #784 a : 0.03 r .:34.7% Flat blade form #783 a : 0.07 r .:57.9% Rough wooden handle #1596 a : 0.09 r .:25.5% Class 792: shovel Rubber strap #1167 a : -0.02 r .:14.1% Concept examples Most relevant samples Anti-splash guard #1296 a : -0.02 r .:10.5% Anti-splash guard #650 a : -0.02 r .:6.5% Rubber strap #1508 a : -0.02 r .:7.5% Breathing tube #435 a : -0.02 r .:17.6% Soft bite valve #505 a : -0.01 r .:6.5% Anti-splash guard #1027 a : -0.01 r .:10.9% Anti-splash guard #1282 a : -0.00 r .:10.0% Breathing tube #633 a : 0.05 r .:72.6% Breathing tube #1452 a : 0.07 r .:12.9% Class 801: snorkel Supplementary Fig. F.9: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “screwdriver”, “shovel” and “snorkel”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. Borderirregular border,irregular fuzzy border,uneven border,ragged border,jagged border, fuzzy and slightly jagged border,fuzzy and strongly jagged border,notched border,indistinct border,obscure border,poorly defined border,ill-defined border,irregular blurred border, scalloped border Colourvariegated color,multicolored,multicoloured,polychromatic,polychrome,blue-white veil,crystalline structures,varying color,changing color,different color,multiple color,many color,several color,various color,shades of tan,shades of brown,shades of black,shades of black and brown,shades of blue,red and blue color,red and black color,blue color,black and brown color,milky-red globules[103] DiameterLarge diameter,Large size,larger than 6m,large mole,large growth,large patch, large lesion,large spot Evolvingbleeding,changing,growing,crusty,crusty surface,warty surface,warty,warty lesion, scaly,scaly lesion Other •vascular lesionis a sub-class of the ISIC dataset for “other”, •actinic keratosisis a sub-class of the ISIC dataset for “other”, •dermatofibromais a sub-class of the ISIC dataset for “other”, 55 Mechanistic understanding and validation of large AI models Soft-touch material #1568 a : -0.03 r .:9.1% Concept examples Most relevant samples Matte plastic body #749 a : -0.02 r .:6.4% Dripping soap effect #931 a : 0.01 r .:10.7% Push-button mechanism #120 a : 0.03 r .:56.4% Soft-touch material #1958 a : 0.03 r .:7.0% Liquid soap inside #1066 a : 0.05 r .:54.7% Curved bottle shape #393 a : 0.07 r .:10.0% Chrome finish pump #339 a : 0.08 r .:14.7% Curved bottle shape #1999 a : 0.08 r .:8.7% Liquid soap inside #1950 a : 0.11 r .:53.0% Class 804: soap dispenser Low profile #515 a : -0.03 r .:5.2% Concept examples Most relevant samples Low profile #1330 a : -0.02 r .:9.3% Low profile #856 a : -0.00 r .:4.6% Wide tires #96 a : 0.02 r .:3.6% Curved hood #729 a : 0.02 r .:10.7% Aerodynamic spoiler #523 a : 0.04 r .:4.5% Aerodynamic spoiler #1379 a : 0.07 r .:45.1% Aerodynamic spoiler #1864 a : 0.08 r .:16.1% Tinted windows #1464 a : 0.08 r .:25.8% Wide tires #1507 a : 0.09 r .:54.5% Class 817: sports car Circular shape #255 a : -0.05 r .:19.2% Concept examples Most relevant samples Circular shape #1454 a : -0.05 r .:10.4% Circular shape #1598 a : -0.01 r .:5.2% Circular shape #1560 a : -0.01 r .:5.1% Curved steel body #25 a : 0.00 r .:6.9% Circular shape #606 a : 0.00 r .:13.6% Circular rim #676 a : 0.03 r .:6.3% Concave playing surface #222 a : 0.03 r .:5.2% Indented note sections #1935 a : 0.03 r .:9.6% Reflective metal finish #244 a : 0.06 r .:85.5% Class 822: steel drum Supplementary Fig. F.10: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “soap dispenser”, “sports car” and “steel drum”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. •basal cell carcinomais a sub-class of the ISIC dataset for “other”, •dermatofibromais a sub-class of the ISIC dataset for “other”, •white and yellowish structures is an indication of the Squamous cell carcinoma sub-class of the ISIC dataset [104], •white streaks ,crystalline structures,white scar-like areas,scab, scabbed,white scales andwhite dots(can be indication of melanoma, but also the Basal cell carcinoma sub-class of the ISIC dataset [105, 106, 107, 108]), •large blue-gray ovoid nestsare highly suggestive of basal cell carcinoma [109], •milia-like cysts,fingerprint-like structuresandmoth-eaten bordersare an indication for sub- class of ISIC (benign keratosis) [110], •red, maroon, or black lacunae, andwidespread red-blue lacunes[111], •clear cell acanthomais a benign epidermal tumor, •reactive haemangiomais a common benign vascular tumor, •vascular blush[103], •cherry angiomacommon asymptomatic vascular skin lesions [112], •regular ABCDE rule variants:smooth border,uniform color,regular border,regular fuzzy border,regular hazy border,regular blurred border,uniform color,regular pigment network ,small size,smaller than 6m,smaller than 4m,small diameter,small mole,small growth,small patch,small lesion,pink growths,regular dark streaks,red-bluish-black homogeneous areas, 56 Mechanistic understanding and validation of large AI models Broken hull #1810 a : -0.02 r .:8.1% Concept examples Most relevant samples Jagged edges #948 a : -0.01 r .:5.9% Broken hull #435 a : -0.00 r .:33.6% Jagged edges #1211 a : -0.00 r .:9.3% Broken hull #607 a : -0.00 r .:50.9% Dark debris #899 a : 0.04 r .:5.7% Broken hull #1014 a : 0.06 r .:7.1% Broken hull #774 a : 0.06 r .:6.7% Corroded iron #1244 a : 0.09 r .:6.3% Broken hull #564 a : 0.13 r .:72.9% Class 913: wreck Cream-colored stalk #1897 a : -0.02 r .:4.5% Concept examples Most relevant samples Short white stem #1764 a : 0.02 r .:4.9% Brownish-red color #1 a : 0.02 r .:4.9% Cream-colored stalk #941 a : 0.03 r .:4.2% Cream-colored stalk #657 a : 0.03 r .:10.8% Wrinkled cap #976 a : 0.03 r .:4.0% Brain-like cap shape #197 a : 0.05 r .:8.7% Cream-colored stalk #836 a : 0.06 r .:3.8% Brain-like cap shape #1478 a : 0.07 r .:4.6% Brain-like cap shape #421 a : 0.08 r .:95.2% Class 993: gyromitra Supplementary Fig. F.11: Alignment of top-10 most relevant neurons of a ResNet50v2’s last feature layer for the classes of “wreck” and “gyromitra”. Besides highest alignmentato an expected concept (label below concept examples), we provide the highest relevance scoresron the test set of the class. Further depicted are three examples where a neuron is most relevant for a class. •talon noir, black heelis a harmless coloration [113], Spuriousredness of the skin,vascular skin,red skin,dark corner visible,ink marker,plaster, ruler visible,white spots,fingernail,white reflections,measuring scale,measurement scale bar, size marker,purple skin marker,strongly haired,hairs,hairy,ink marker,band-aid,colorful patch ,patch or band-aid,ruler,blue coloured band-aid or patch,orange coloured band-aid or patch F.2.2 Training We train models using a sparsity loss (see Supplementary Tab. G.2 for loss details) with a regularization strength of λ= 1.0for 300 epochs with an initial learning rate of0.01which is decreased by a factor of 10 after 240 and 290 epochs, respectively. The optimization algorithm used is stochastic gradient descent. F.2.3 Evaluation Instead of Stable Diffusion (which is not available for skin lesion data), we manipulate the available test data to evaluate whether pruning or retraining had a positive effect on model reliability and robustness. 57 Mechanistic understanding and validation of large AI models 4.4% 27.6% 40.3% 27.7% relative alignment share high share of un-aligned components high share of aligned components VGG-13 ResNet18 ResNet50v2 ResNet101 ResNet101-a1 ResNet50-a1 ResNet50d relative alignment share VGG-16 3.2% 79.2% 2.6% 14.9% #242#1890 #488#948 abstract concepts low alignment high alignment specialized concepts polysemantic role monosemantic role ResNet101v2 ResNet101v2 VGG-16 Supplementary Fig. F.12: Models with a high alignment to valid concepts are often more performant and interpretable.Left: A VGG-16 shows many polysemantic units, or units that encode for abstract concepts, ultimately resulting lower alignment scores.Right: A ResNet101v2 on the other hand, has more specialized and monosemantic units, which results in overall higher alignment scores. sensitivity test band-aidred skinruler artifact insertion Supplementary Fig. F.13: Testing model artefact sensitivity by adding artefacts to clean input samples, with examples given forband-aid,red skin,ruler(left). In the bar chart (right), the change in test set accuracy for the vanilla, pruned and retrained model, when artefacts (band-aid,red skin,ruler) are added to clean input samples for the medical use case is given. Compared to pruning, only retraining results in strongly reduced sensitivities for all artefacts. “Red coloured skin” is artificially added to clean samples by adding a red hue over images. Concretely, we transform the RGB channelsr,gandbof an image by r→0.7r+ 0.3×231(1) g→0.7g+ 0.3×128(2) b→0.7b+ 0.3×151(3) which corresponds to overlaying a uniform colour of [231, 128, 151] (mean RGB values of a test samples with red skin) with a transparency of 30 %. Further,rulerandband-aidare modelled according to the Reveal2Revise [72] framework. Specifically, we first generate artefact localizations on samples from the test set. Here, we use support vector machines to generate a Concept Activation Vector (CAV) on activations of the last feature layer. Using this CAV, we generate the artefact localizations by explaining, i.e., computing input heatmaps wrt., the dot product between latent activations and CAV. Subsequently, the resulting “soft” masks are blurred using a Gaussian blur (kernel size of 41 pixels), and low values amplified by taking values to the power of0.6. After a final normalization to the maximum value, e.g., scores are between 0 and 1, we result in a localization M∈R 224×224 withM ij ∈[0,1]. Another imageD∈R 3×224×224 is then poisoned by overlaying the artifact of image A∈R 3×224×224 via D poisoned =D◦(1−M) +A◦M ,(4) where◦corresponds to the Hadamard product, andMis applied to each colour channel. Exemplary data manipulations and the effect on test accuracy when data is poisoned is shown in Supplementary Fig. F.13, where we provide concrete values for the accuracy change that is also illustrated in Fig. 4 (bottom). Alternatively, we present the output logits and test set accuracies for specific subsets and all models in Supplementary Fig. F.14. 58 Mechanistic understanding and validation of large AI models Vanilla Pruned Retrained Test set output logits Test set output logits Melanoma samples Melanoma samples Other / Non- Melanoma samples Other / Non- Melanoma samples Supplementary Fig. F.14: Output logits (vertical axis) and accuracy (sub-plot title) on subsets of the test set for the vanilla model (top left), pruned vanilla model (top right) and retrained model (bottom centre). The output logits (mean depicted in the centre of the distribution) are given for the “Melanoma” and “Other” class. For the artefactsred colourandband-aid, not samples with melanoma are available. Especially forred colourandband-aidthe logits magnitudes are reduced when pruning or retraining, indicating a reduced over-fitting on these spurious artefacts. G Human-Interpretability Measures In Section 3.5 and illustrated in Supplementary Fig. G.1, we propose four measures used to quantify the human- interpretability of concepts. All measures are based on transforming the concept examplesE k for a neuronkinto the latent feature space of a foundation modelF. Concretely, we firstly introduced“clarity”for how clear and easy it is to understand the common theme of concept examples by evaluating the similarity between each pair of concept examples from a setE k . The clarity measure is highly related to works of [38, 56]. Further,“polysemanticity”describes if multiple distinct semantics are present in the concept examples, computed by clustering concept examples of setE k and evaluating the similarity across the cluster’s semantic embeddings. Thirdly,“similarity”measures the similarity of two concepts described by setE k andE l by measuring cosine similarity between the pooled feature vector of each concept example set (the semantic embedding of each neuron). Lastly,“redundancy”describes the degree of redundancies in a set of concepts by measuring the average maximal similarity across concepts. Experimental settingThroughout experiments for human-interpretability, the default foundation model used is DINOv2. G.1 User Study The following sections provide details for the user studies performed to evaluate alignment of interpretability measures (similarity, clarity and polysemanticity) with human perception. Ethics ApprovalThe Ethics Commission of the Fraunhofer Heinrich Hertz Institute provided guidelines for the study procedure and determined that no protocol approval is required. Informed consent has been obtained from all participants. 59 Mechanistic understanding and validation of large AI models #215 #112 low high #? #1146 high low #1386 #1386 #1784 #1784 #889 #889 #2038 polysemanticityclaritysimilarity redundancy high high low between two concepts across conceptsper concept semantic representation how similar are concepts? do semantic redundancies exist? how clear is a concept? how polysemantic is a concept? per concept Supplementary Fig. G.1: We introduce easy to compute human-interpretability measure that are useful to rate and improve model interpretability. Overall, we propose four measures: “clarity” for how clear and easy it is to understand the common theme of concept examples, “polysemanticity” describes if multiple distinct semantics are present in the concept examples, “similarity” for the similarity of concepts, and “redundancy” which describes the degree of redundancies in a set of concepts. The measures are based on transforming concept examples into their semantic representation. G.1.1 Similarity Measure We in the following provide details wrt. study design, data filtering and results. Study design:We evaluate concept similarity (as perceived by humans) in an odd-one-out experiment, where participants are asked to detect the outlier out of three sets of concept examples (“A”, “B”, “C”), as shown in Supplementary Fig. G.2. Concretely, we choose two neurons randomly, and create the three sets as follows: “A” corresponds to 10 concept examples out of the top 20 most activating concept examples of the first neuron (every second example is taken). “B” corresponds to 10 concept examples out of the top 20 most activating concept examples of the second neuron (every second example is taken). “C” corresponds to 10 concept examples out of the top 20 most activating concept examples of the second neuron (every second example is taken, but beginning with the second). Subsequently, the labels “A”, “B”, “C” are shuffled again. For each odd-one-out task, we further question participants for their confidence and the similarity of images. All in all, we create six experiments with 50 of such sets for the ResNet50v2 (four times), ResNet50 and ResNet34, using neurons of the penultimate layer. Compensation:The compensation for participation is divided into two parts: a base payment, and a bonus. Firstly, participation will receive a base payment of2.0US dollars (estimated time per study are 15 minutes) if their performance was above 41 % (represents statistically significantly better than random performance according to a binomial test with 50 draws and success probability of 33 %). Performance is measured by the number of correctly recognized outliers in the odd-one-out task. The participants are further warned that responses corresponding to random guessing are rejected, leading to no payout. In addition, participants are motivated to perform well by giving a bonus of 1.0 US dollar when performance was above 75 %, or 1.5 US dollar when performance was above 85 %. Data filtering:We remove study results where the overall detection accuracy is below 55%, which corresponds to a statistically significant result above random performance (p <0.002) according to a binomial test. We further expect a negative correlation between confidence and similarity values perceived by humans. Thus, we additionally filter out results where the correlation is above 0.15. Overall, we receive 59 of 108 experimental results after filtering. Results:We have three semantic embeddings (given by a foundation model), one for each concept setE. Denotingθ odd as the outlier semantic embedding, andθ a andθ b as the embeddings of the pair, then we compute a similarity scoresof s= 1 2 (s cos (θ odd ,θ a ) +s cos (θ odd ,θ b )),(1) wheres cos denotes cosine similarity. In Supplementary Fig. G.3, the similarity of concept example sets (as measured via foundation models) is shown against perceived confidence (left), perceived similarity (middle) and detection accuracy (right). For each case, Pearson correlation scores are given with a corresponding 95 % confidence interval. In Supplementary Fig. G.4, we compare the correlation between computed similarity and perceived similarity using either cosine similarity or Euclidean distance as a distance measure. Additionally, we either compute distances across each concept examples of a set, or, as per default, compare the mean embeddings (see Section 3.5). Whereas there is so significant difference between distance measures, using the mean embedding instead of individual embeddings for each concept example set results in higher correlation scores. 60 Mechanistic understanding and validation of large AI models Supplementary Fig. G.2: Study design to evaluate alignment of our similarity measure with human perception. Overall, the participants found in 85,6 % of cases the correct outlier. This stands in contrast to using the similarities of semantic embeddings, where we receive over 97 % of accuracy when assuming that the correct pair corresponds to the one with the highest pairwise similarity, as shown in Supplementary Fig. G.5 (left). This indicates, that foundation models might be able to perceive differences or similarities better than humans. However, notably, participants on Amazon Mechanical Turk do not represent ideal performance, as participants usually solve multiple surveys per day and are aiming to finish studies as fast as possible. We further investigate if the similarity scoresas in Eq. (1) is useful to detect if participants have difficulty with finding the correct outlier (accuracy below 0.65). Compared to using the perceived similarity and confidence values, the similarity scoresis not as indicative. We hypothesize that humans will not only have problems whensis high, but also when the similarity between the embeddings of the correct pair is small. G.1.2 Clarity Measure We in the following provide details wrt. study design, data filtering and results. Study design:We evaluate concept clarity (as perceived by humans) via a combination of qualitative and quantitative tests. Concretely, a study consists of 32 tasks, where each task is divided into three questions, as illustrated in Supplementary Fig. G.6. For the first and last question, participants are asked to choose the “clearer”, easier to understand set of two sets of concept examples (top ten most activating image patches). Here the first question differs from the last in the fact that one set of concept examples is sampled randomly from all neurons of a model, e.g., each concept examples corresponds to a different neuron. As such, the first question represents a controlled setting, where one concept group is artificially 61 Mechanistic understanding and validation of large AI models 406080100 perceived confidence 0.00 0.25 0.50 0.75 similarity DINOv2 -68.5 (-74.1,-62.0) 020406080 perceived similarity 0.00 0.25 0.50 0.75 similarity 73.3 (67.6,78.1) 0.000.250.500.751.00 accuracy 0.00 0.25 0.50 0.75 similarity -56.4 (-63.7,-48.1) 406080100 perceived confidence 0.4 0.6 0.8 similarity CLIP-OpenAI -69.3 (-74.8,-63.0) 020406080 perceived similarity 0.4 0.6 0.8 similarity 76.6 (71.5,80.9) 0.000.250.500.751.00 accuracy 0.4 0.6 0.8 similarity -45.4 (-53.9,-35.9) 406080100 perceived confidence 0.2 0.4 0.6 0.8 similarity CLIP-LAION -71.5 (-76.7,-65.5) 020406080 perceived similarity 0.2 0.4 0.6 0.8 similarity 80.2 (75.7,83.9) 0.000.250.500.751.00 accuracy 0.2 0.4 0.6 0.8 similarity -48.1 (-56.3,-38.9) 406080100 perceived confidence 0.4 0.6 0.8 similarity CLIP-Mobile -73.0 (-77.9,-67.2) 020406080 perceived similarity 0.4 0.6 0.8 similarity 80.5 (76.2,84.2) 0.000.250.500.751.00 accuracy 0.4 0.6 0.8 similarity -49.0 (-57.2,-39.9) Supplementary Fig. G.3: Correlation analysis for the user study regarding “similarity”, where an odd-one-out task is performed, where participants are asked about their confidence and the similarity of shown concepts. As the setting is controlled, we can also compute an accuracy. Overall, we see high correlations between our concept similarity measure and perceived confidence, similarity and accuracy (given above each violin plot with confidence intervals in parentheses). made unclear and difficult to understand. For the second question, we ask participants to rate the clarity of a single concept example set based on a Likert scale. All in all, we create six experiments with 32 tasks each for the ResNet50v2 (four times), ResNet50 and ResNet50.a1, using neurons of the penultimate layer. Compensation:The compensation for participation is divided into two parts: a base payment, and a bonus. Firstly, participation will receive a base payment of2.0US dollars (estimated time per study are 15 minutes) if their performance was above 66 % (represents statistically significantly better than random performance according to a binomial test with 32 draws and success probability of 50 %). Performance is measured by the number of correctly recognized image grids with a clearer common theme. The participants are further warned that responses corresponding to random guessing are rejected, leading to no payout. In addition, participants are motivated to perform well by giving a bonus of 1.0 US dollar when performance was above 92 %. Data filtering:We remove study results where the overall detection accuracy is below 70% (for question one), which corresponds to a statistically significant result above random performance (p <0.03) according to a binomial test. Overall, we receive 74 of 90 experimental results after filtering. Results:For question two, we receive two clarity scores, one for each concept setE. DenotingI a clarity andI b clarity as the clarity of concept examples according to Eq. (7), respectively, we receive the difference in clarity scores as ∆I clarity =I a clarity −I b clarity .(2) 62 Mechanistic understanding and validation of large AI models Supplementary Fig. G.4: Ablation study via correlation analysis for the user study regarding “similarity”, where an odd-one-out task is performed, where participants are asked about their confidence and the similarity of shown concepts. In all violin plots, correlation between perceived similarity and our concept similarity measures are computed (displayed above each violin plot with confidence intervals in parentheses). Here, we use on the one hand side either cosine similarity or Euclidean distance to measure similarities, where no large differences are seen. Performing an averaging of semantic feature vectors (one for a whole set of concept examplesE) shows higher alignment, compared to computing similarities across individual feature vectors (for each example inE). In Supplementary Fig. G.7, the clarity of concept example sets (as measured via foundation models) is shown against perceived clarity (answer of second question) (left), and the difference of clarity scores∆I clarity against the average choice (right). For each case, Pearson correlation scores are given with a corresponding 95 % confidence interval. G.1.3 Polysemanticity Measure We in the following provide details wrt. study design, data filtering and results. Study design:We evaluate neuron polysemanticity (as perceived by humans) via a combination of qualitative and quantitative tests. Concretely, a study consists of 28 tasks, where each task is divided into two questions, as illustrated in Supplementary Fig. G.8. For both questions, we show participants a single concept example set (top 20 most activating image patches), and ask whether they see “one”, “two”, “three or more”, or “no common theme at all” in the concept examples. A choice of “one” is rated as monosemantic according to a human (polysemanticityp= 0), whereas the options “two” and “three or more” are rated as polysemantic (p= 1). Here “no common theme at all” is valued as neither, receiving a score in-between (p= 0.5). Importantly, for question one, either reference samples of asingle(rather) monosemantic neuron is shown (corresponding to neurons of a model where clarity is maximal), or, alternatively (chosen randomly with probability 0.5), oftwo(rather) monosemantic neurons (where reference images are mixed). As such corresponds question one to a controlled setting, which is used to filter data. All in all, we create six experiments with 28 tasks each for the ResNet50v2 (two sets), ResNet101v2 (two sets) and ResNet50.a1 (two sets), using neurons of the penultimate layer. Compensation:The compensation for participation is divided into two parts: a base payment, and a bonus. Firstly, participation will receive a base payment of2.0US dollars (estimated time per study are 15 minutes) if their performance was above 39 % (represents statistically significantly better than random performance according to a binomial test with 63 Mechanistic understanding and validation of large AI models Supplementary Fig. G.5: Further results for the user study for measuring alignment of computed concept similarity with human perception via an odd-one-out experiment.Left: Foundation models can considerably better detect the odd-one-out compared to humans.Right: When aiming to detect cases where participants have an accuracy below 65 %, the perceived similarity or confidence is a better indicator than the computed similarity between outlier and correct concept example embeddings. 28 draws and success probability of 25 %). Performance is measured by the number of correctly recognized outliers in the odd-one-out task. The participants are further warned that responses corresponding to random guessing are rejected, leading to no payout. In addition, participants are motivated to perform well by giving a bonus of 1.0 US dollar when performance was above 92 %. Data filtering:We remove study results where the overall detection accuracy (according to the first question) is below 72%, which corresponds to a statistically significant result above random performance (p <0.001) according to a binomial test. Overall, we receive 85 of 108 experimental results (at least 11 results per study) after filtering. Results:In Supplementary Fig. G.3, the polysemanticity of concept example sets (as measured via foundation models) is shown against perceived polysemanticity (answer of second question) (left). We further provide a correlation analysis for using the clarity measure instead or compute the increase in concept clarity after clustering, as motivated by [54]. For each case, Pearson correlation scores are given with a corresponding 95 % confidence interval. It is apparent that our polysemanticity measure as introduced in Section 3.5 performs generally significantly better than using overall clarity or the increase in clarity as a measure. Alternatively, we aim to detect polysemantic units (according to the participants) where the average perceived polyse- manticity is smaller than 50 %. The corresponding AUC scores when using the three different measures to detect the polysemantic units are shown in Supplementary Fig. G.3 (bottom). The results are related to the correlation analysis, indicating that our polysemanticity measure is most indicative for detecting polysemantic units. G.2 Optimizing Interpretability This section provides more details and additional experiments to Section 4.4.2. The interpretability measures of clarity, polysemanticity and redundancy allow to investigate the effect of various hyperparameter choices wrt. model architecture or training on the latent interpretability. We in the following present results for the training hyperparameters of: dropout regularization, sparsity regularization, task complexity, data augmentations and number of epochs. We therefore train a ResNet50, ResNet18, VGG-13 and VGG-13 (BN) on a subset of ImageNet for 100 epochs with an initial learning rate of0.01which is decreased by a factor of 10 after 80 epochs. The subset consists of 13 animal classes (’Staffordshire bullterrier, Staffordshire bull terrier’, ’goldfish, Carassius auratus’, ’hen’, ’bulbul’, ’box turtle, box tortoise’, ’water snake’, ’bee eater’, ’jellyfish’, ’hermit crab’, ’flamingo’, ’pelican’, ’beagle’, ’soft-coated wheaten terrier’, ’Rottweiler’, ’Siberian husky’, ’tabby, tabby cat’, ’brown bear, bruin, Ursus arctos’, ’mantis, mantid’, ’hamster’) and 13 object classes (’castle’, ’canoe’, ’crane’, ’digital clock’, ’dome’, ’harp’, ’lawn mower, mower’, ’moped’, ’monastery’, ’palace’, ’pole’, ’snorkel’, ’snowmobile’, ’wok’, ’comic book’, ’bubble’, ’acorn’, ’toilet tissue, toilet paper, bathroom tissue’). The semantic embeddings are computed using DINOv2 class token feature vectors of the 50 most activating samples as concept examples for each neuron in the penultimate layer. We further filter out dead neurons by thresholding the maximal relevance on the test set to be above 0.005 (0.5 %). Dropout regularizationDropout regularization randomly removes neurons from the computational graph by setting their activation to zero. In our study, we randomly set 40 % of neurons to zero. The resulting interpretability scores are given in Supplementary Tab. G.1. It is apparent that the ResNet models are not strongly effected by drop-out wrt. interpretability. However, the VGG models have overall a higher clarity and redundancy. 64 Mechanistic understanding and validation of large AI models Supplementary Fig. G.6: Study design to evaluate alignment of our clarity measure with human perception. We ask participants to choose the clearer one of two concept example sets (A and B), or for a qualitative clarity rating based on a Likert scale. In Supplementary Fig. G.10 we present examples of how concepts change when dropout is applied. Specifically, we show the concepts of neurons#0,#50,#100,#150,#200,#250,#300,#350,#400,#450and#500of the VGG-13 model without dropout, and the most similar concept of the VGG with dropout based on highest cosine similarity values of semantic embeddings. Here, it is observable, that most neurons become more class-specific and clear, as e.g., for#484that only shows flamingo feather instead of#0that encodes also for sunset besides feathers. Sparsity LossWith the aim to reduce redundancies and polysemanticity, we apply a sparsity lossL sparse on the activationsZ∈R k of the penultimate layer’s features (kneurons) during training. Concretely, we apply the additional sparsity loss with strengthλbesides the standard cross entropy lossL CE L=L CE +λL sparse ,(3) where the sparsity loss is given as the L1-norm over all (non-zero) activations: L sparse = 1 k + k X i=0 |Z i |,(4) wherek + is the number of non-zero activations. The resulting interpretability scores are given in Supplementary Tab. G.2. In almost all cases, redundancy is decreased, clarity increased and polysemanticity reduced, leading to an overall improved interpretability. 65 Mechanistic understanding and validation of large AI models 050100 perceived clarity 0.0 0.5 concept clarity DINOv2 65.7 (56.9,73.1) AB choice 0.5 0.0 0.5 clarity difference -56.9 (-65.7,-46.4) 050100 perceived clarity 0.4 0.6 0.8 concept clarity CLIP-OpenAI 77.0 (70.6,82.2) AB choice 0.2 0.0 0.2 clarity difference -59.5 (-67.9,-49.5) 050100 perceived clarity 0.25 0.50 0.75 concept clarity CLIP-LAION 80.6 (75.0,85.1) AB choice 0.25 0.00 0.25 clarity difference -64.2 (-71.8,-55.0) 050100 perceived clarity 0.4 0.6 0.8 concept clarity CLIP-Mobile 81.2 (75.7,85.5) AB choice 0.25 0.00 0.25 clarity difference -65.1 (-72.6,-56.1) Supplementary Fig. G.7: Correlation analysis for the user study regarding “clarity”, where we ask for a qualitative rating (left) or for the clearer concept example set of two choices A and B (right). Overall, we see high correlations between our concept clarity measure and perceived clarity as well as the choice (when computing the clarity difference between A and B (correlation scores given above each violin plot with 95 % confidence intervals in parentheses). Task complexityFurther, we investigate the effect of low or higher task complexity on interpretability. We would expect that very easy tasks lead to more redundancies and less specific neurons, which correspond to lower interpretability. In order to test the effect of task complexity, we train in two scenarios: a simple binary task (animal vs. objects) and more complex multi-class scenario (all 23 animal and object classes). The resulting interpretability scores are given in Supplementary Tab. G.3. Here, overall, higher complexity seems to result in higher interpretability, e.g., clarity increases, and redundancy decreases. However, polysemanticity is observed to increase. Data AugmentationThe next training hyperparameter that we vary is data augmentation. Here, we apply three sets of data augmentation: •none •default: random crop to224×224pixels, random horizontal flip (probability of 50 %) • strong: random rotation of up to 10 degrees, random crop to224×224pixels, random horizontal flip (probability of 50 %), random sharpness decrease by 0.2 (probability of 50 %), random sharpness increase by 0.2 (probability of 50 %). The resulting interpretability scores are given in Supplementary Tab. G.4. Although there are changes in interpretability such as the VGG-13 for which clarity increases, there are usually no clear trends in interpretability. Number of EpochsIn Section 4.4 we have observed that more extensively trained ImageNet models such as the ResNet50v2 result in higher interpretability compared to the ResNet50 for the last feature layers neurons. Thus, we lastly 66 Mechanistic understanding and validation of large AI models Supplementary Fig. G.8: Study design to evaluate alignment of our polysemanticity measure with human perception. A study consists of 28 tasks, where each task is divided into two questions. For individual concept example sets (top 20 most activating image patches), we ask participants whether they see “one”, “two”, “three or more”, or “no common theme at all” in the concept examples. investigate the effect of the number of epochs on latent interpretability. Concretely, we train models for 400 epochs on the ImageNet subset, and decrease learning rate by 10 after 300 and 370 epochs, respectively. The resulting interpretability scores are given in Supplementary Tab. G.5. It is apparent, that interpretability changes, especially for the first 100 epochs. Between 100 and 400 epochs, large changes are not visible any more. Generally, the longer the training, the less redundancies are observed. Further, clarity is reduced, indicating the formation of more complex concepts, which is contrary to the observation in Section 4.4, where better trained models show higher clarity. This example shows that many effects can take place and have an effect on interpretability. It might be that for the subset of ImageNet, models are more likely to learn the data (overfit) compared to the whole ImageNet dataset, where models are more likely to generalize. H Method We in the following provide more details regarding our methodology and its hyperparameters. Hereby, Supplementary Fig. H.1 illustrates the methodology which is discussed in Section 3. H.1 Transformation into Semantic Space In order to embed model components in the semantic space of a foundation model, we first collect examples images for a components concept, and pool the feature vectors of the foundation model when operating on the examples. 67 Mechanistic understanding and validation of large AI models ours clarity increaseclarity Supplementary Fig. G.9: Ablation study via correlation analysis for the user study regarding “polysemanticity”, where participants are asked about their qualitative opinion wrt. polysemanticity in shown concept example sets. In all violin plots, correlation between perceived polysemanticity and our concept polysemanticity measures are depicted (displayed above each violin plot with confidence intervals in parentheses). Here, we use on the one hand side either our measure (left), the clarity measure (middle), or the clarity increase (after clustering of concept examples inside one set). For all cases, we further computed AUC scores for detecting polysemantic neurons (according to perceived polysemanticityp >0.5. Concept ExamplesThe concept examples are collected by searching for the most activating samples of the test dataset. Here, we perform average pooling over spatial dimensions, as, e.g., given for transformer blocks or convolutional layers. Alternatively, max-pooling could be applied, which we also experiment with (and evaluate) in Supplementary Note D.4. Throughout the paper, we show concept examples that are croppedandmasked. However, we refrain from masking concept examples when encoded by a foundation model due to potentially being out of the data distribution. In order to generate cropped samples, we use the zennit-crp framework [92], which computes a neuron-specific heatmap (corresponding to a heatmap for the largest spatial activation value for convolutional channels), which is further smoothed with a Gaussian blur (kernel size 51). Subsequently, the images are cropped to include all values where the heatmap is higher than 1,0 % of the largest attribution value. The black masks (semi-transparent with 0.4 opacity) are corresponding to all values smaller than 2,0 % of the maximal attribution value. Notably, CRP is not applicable to ViT yet. Thus, we use the upsampled spatial transformer tokens as heatmaps to localize concepts, as, e.g., inspired by [14]. Pooling Semantic Feature VectorsThe pooling operation we perform on semantic vectors is a straightforward averaging of latent feature vectors (corresponding to the class token) of ViT-based foundation models. For future work, we leave to compare different pooling operations, such as median or a (activation)-weighted average. 68 Mechanistic understanding and validation of large AI models Supplementary Tab. G.1: Studying the effect of hyperparameters on model interpretability (given by clarity, polysemanticity and redundancy scores) usingdropoutduring training. All models are trained for 100 epochs. model namedropoutclarity (%)polysemanticity (%)redundancy (%) ResNet34False45.091.984.3 ResNet34True44.092.383.4 ResNet50False47.189.688.9 ResNet50True47.989.989.6 VGG-13False41.883.381.6 VGG-13True47.183.785.1 VGG-13 (BN)False35.584.779.1 VGG-13 (BN)True39.678.481.6 Supplementary Tab. G.2: Studying the effect of hyperparameters on model interpretability (given by clarity, polysemanticity and redundancy scores) using asparsity losswith strengthλ. All models are trained for 100 epochs. regularization (λ)model nameclarity (%)polysemanticity (%)redundancy (%) 0ResNet3445.091.584.4 0ResNet5047.789.389.1 0VGG-1341.883.381.6 0VGG-13 (BN)35.684.879.1 80ResNet3460.267.044.8 80ResNet5056.487.770.7 80VGG-1352.183.252.9 80VGG-13 (BN)48.071.882.8 TemplatesOur default templates are “<concept>”, “<concept>-like”, “a <concept>” and “an image of a close up of<concept>” which have been shown to lead to more faithful labels in Supplementary Note D.4. H.2 Concept Relevance Scores We use CRP [15] to compute latent importance scores of neurons and neuron-specific heatmaps via the zennit-crp package [92]. CRP is hereby based on the feature attribution method of LRP [94] that works by back-propagating relevance scores from the output back to the input through the model and its latent components. We use the LRP composite used in the work of CRP, namely theεz + ♭-rule. H.3 Human-Interpretability of Concepts The compact form of ourclarityscore, presented in Eq. (7), is derived directly from the following equality: n X i=1 v i 2 2 = D n X i=1 v i , n X j=1 v j E = n X i,j=1 ⟨v i ,v j ⟩(1) which holds for any selection of vectorsv 1 ,...,v n ∈R d . Forv 1 ,...,v n ∈S d−1 of unit length, we obtain n X i=1 X i̸=j ⟨v i ,v j ⟩= n X i=1 n X j=1 ⟨v i ,v j ⟩−1 = P n i=1 v i 2 2 −n.(2) Applying this result to Eq. (6) yields the desired compact formula. 69 Mechanistic understanding and validation of large AI models Supplementary Fig. G.10: Qualitative comparison of concepts in the penultimate layer of a VGG-13 model when dropout is either applied or not applied. For each concept of a neuron in the VGG without dropout, we retrieve the most similar concept of the VGG with dropout based on highest cosine similarity values of semantic embeddings. H.4 Workflows for SemanticLens This section details possible workflows (and the involved steps) for answering the questions presented in Tab. 1 of the main manuscript. An overview over questions and involved steps is given in Supplementary Tab. H.1. 1Semantic embedding of model components:The first step consists in the embedding of model components into the semantic representation, as detailed in Sections 3.1 and 3.2. Here, for each component of modelM, concept examplesE are collected, that are subsequently transformed into a single semantic embeddingθthrough a multimodal foundation modelF. 2Search using a single probing embedding:As outlined in Section 3.3, to search the semantic embeddingsθof a modelM, we require a set of concept examplesEfor the concept of interest. These concept examples can be of the same data domain as the modelM, or another (e.g., text for CLIP). As in Sections 3.1 and 3.2, the concept examples are transformed into a semantic probing embeddingθ probe using the foundation model. With this probing embedding, we can retrieve via cosine similarity the most aligned semantic embeddings of the model. 3 Annotate using a set of probing embeddings:As outlined in Section 3.3, to label the semantic embeddingsθof a modelM, we require a set of expected concepts. The expected concept set can further contain parent categories (which allows further grouping of concepts), or a distinction between valid and spurious features (useful for auditing). For each concept, we further define a set of concept examplesE. These concept examples can be of the same data domain as the modelM, or another (e.g., text for CLIP). As in Sections 3.1 and 3.2, the concept example sets are transformed into semantic probing embeddingsθ probe using the foundation model. For each model component, we can now compute the most aligned expected concept (label). 4 Compute concept relevance scores:As detailed in Supplementary Note H.2, CRP allows to compute relevance scores for individual neurons or neuron groups on the test set efficiently. These relevance scores can be used individually (for 70 Mechanistic understanding and validation of large AI models Supplementary Tab. G.3: Studying the effect of hyperparameters on model interpretability (given by clarity, polysemanticity and redundancy scores) varying thetask complexity(binary: two classes, class-specific: 26 classes). All models are trained for 100 epochs. model nametaskclarity (%)polysemanticity (%)redundancy (%) ResNet34binary37.485.395.0 ResNet34class-specific44.391.983.6 ResNet50binary29.887.193.8 ResNet50class-specific44.389.288.3 VGG-13binary35.185.382.2 VGG-13class-specific42.283.181.5 VGG-13 (BN)binary33.582.783.8 VGG-13 (BN)class-specific35.084.678.9 Supplementary Tab. G.4: Studying the effect of hyperparameters on model interpretability (given by clarity, polysemanticity and redundancy scores) varying thedata augmentations( Rot: random rotation of up to 10 degrees, Crop: random crop to 224 pixels height and width, xFlip: random horizontal flip, Sharpness: random sharpness variations). All models are trained for 100 epochs. model nameaugmentationclaritypolysemanticityredundancy ResNet34Rot,Crop,xFlip,Sharpness44.591.583.8 ResNet34Crop,xFlip45.091.484.3 ResNet34none45.389.786.1 ResNet50Rot,Crop,xFlip,Sharpness46.788.990.3 ResNet50Crop,xFlip44.990.388.8 ResNet50none46.985.589.7 VGG-13Rot,Crop,xFlip,Sharpness42.183.781.6 VGG-13Crop,xFlip41.883.081.8 VGG-13none34.384.679.3 VGG-13 (BN)Rot,Crop,xFlip,Sharpness35.285.078.9 VGG-13 (BN)Crop,xFlip35.784.979.3 VGG-13 (BN)none35.684.280.8 each sample), or an average or maximum value can be computed to understand global importance, e.g., the relevance of a neuron (group)/concept for the prediction of a specific output class. Whereas individual relevance scores can be used to filter out data examples where a concept is present (and relevant), a global relevance score allows, e.g., to filter out all irrelevant components. Alternatively to computing relevance scores wrt. the prediction output, one can also compute relevance scores of components for the activation of an upper-level layer component, which ultimately allows to compute an attribution graph that visualizes the relevance flow (hierarchical dependencies of neurons) throughout the network, as also depicted in Fig. 2c. 5Audit alignment:Section 3.4 Having defined a set of expected concepts (see3), as well as filtered out all overall irrelevant or class-irrelevant components (see4), we can begin with inspecting the alignment of a model with our expectation, as also detailed in Section 3.4. 71 Mechanistic understanding and validation of large AI models Supplementary Tab. G.5: Studying the effect of hyperparameters on model interpretability (given by clarity, polysemanticity and redundancy scores) varying thenumber of epochs. epochmodel nameclaritypolysemanticityredundancy 39ResNet3445.791.585.3 79ResNet3444.891.984.3 119ResNet3444.791.884.0 159ResNet3444.092.183.5 199ResNet3443.892.283.3 239ResNet3444.192.183.1 279ResNet3443.592.383.0 319ResNet3443.492.282.5 359ResNet3443.392.282.4 399ResNet3443.392.282.4 39ResNet5051.288.691.6 79ResNet5043.590.188.0 119ResNet5047.388.388.8 159ResNet5046.889.288.5 199ResNet5046.685.988.4 39VGG-1342.283.182.8 79VGG-1341.883.381.6 119VGG-1341.983.181.3 159VGG-1342.282.981.4 199VGG-1342.283.281.2 239VGG-1342.182.981.2 279VGG-1342.083.181.1 319VGG-1342.682.681.2 359VGG-1342.682.781.2 399VGG-1342.782.981.3 39VGG-13 (BN)36.484.182.0 79VGG-13 (BN)35.284.579.5 119VGG-13 (BN)34.384.778.2 159VGG-13 (BN)34.284.577.7 199VGG-13 (BN)34.084.877.4 239VGG-13 (BN)33.784.876.9 279VGG-13 (BN)33.884.877.1 319VGG-13 (BN)33.984.876.9 359VGG-13 (BN)33.885.076.8 399VGG-13 (BN)33.885.076.7 In a first step, one can study the relevant components (relevances from 4 ), that do not align with any expected concept. These components can correspond to unexpected valid or spurious concepts. In order to understand the components better, it is useful to retrieve the output classes or samples where the component is relevant. Secondly, the aligned components and their concepts can be studied more closely. For example, with parent categories for concepts are available (e.g., spurious or valid), one can quantify the number of components for each group, or the total relevance of components per group. 72 Mechanistic understanding and validation of large AI models “tusks” “trunk” ... MODEL components DATASET search, describe, compare, audit, evaluate activate retrieve related data concept examples VstructurP VcomparP Vevaluate interpretability Vsearcf VdescribP VcomparP Vaudit concept examples (text or image) text or image inference condition on output targets FOUNDATION MODEL HUMAN PREDICTIONS a Transformation into a semantic space. b Working in semantic space. c Holistic understanding via SemanticLens. HUMAN ... database ... MODELPREDICTIONS DATASET ... CLIP (img) highly activating concept examples CRP CRP preprocessing phase exploration and evaluation phase “owl” (98.0%) “eagle” (0.1%) “junco” (0.09%) ... CLIP 1 1 2 2 3 3 4 4 4 Supplementary Fig. H.1: Methodological overview overSEMANTICLENS.a) In an automated preprocessing step,1concept examplesE k for each model componentkare computed by cropping highly activating samples to the relevant part using CRP [15]. Further,2a multimodal foundation model such as CLIP is used to embed the concept examples in a semantic representationθ k ∈V M . Optionally,3CRP allows to compute relevances of components wrt. specific output targets.b)4The preprocessing phase results in a vector database, which can be utilized to compare models or evaluate human-interpretability. Alternatively, a user might search, describe and audit the model by defining a set of concepts via text or image data that is also embedded by the foundation model.c) Through the foundation model, a user can search representations ( 4 ), filter them according to output relevances (3), and find related (highly activating) data samples. 6Compare embeddings:The learned knowledge of model can be compared by measuring similarities, as detailed in Section 3.3. Here either pairs of neurons can be compared via cosine similarity, or groups of neurons, e.g., via average maximum similarity. Furthermore, two models can be compared with labels available from3, where in an additional optional step4we can filter components to be relevant for specific classes. In the manner of Network Dissection, e.g., one can compare the amount of neurons that were assigned to a specific concept. 7 Evaluate Interpretability:In order to evaluate interpretability, we proposed three measures in Section 3.5 correspond- ing to “clarity”, “polysemanticity” and redundancy. In order to evaluate whole models, only step1is required and the measures in Section 3.5 computed. As such, they provide the means to also change hyperparameters of the model during training or hyperparameters of the training procedure and investigate the effect on interpretability. I Limitations and Future Work One limitation ofSEMANTICLENSlies in its reliance on the expertise of the employed foundation model of the considered data domain. If the foundation model was not sufficiently trained on the specific data domain, or itself relies on spurious correlation and biases [114],SEMANTICLENS’s ability to correctly/faithfully analyse the network’s knowledge is affected. Therefore the foundation model employed bySEMANTICLENSneeds to be carefully selected with respect to the studied data domain, as done with WhyLesionCLIP in Section 4.3. However, in very niche or specific data domains, no foundation model might be available. In such cases, the investigated model itself could be used as an alternative, although this may not provide an optimal semantic space [48]. Secondly, if the components of the model are not describable via concept examples, or when concept examples are not meaningful, an investigation withSEMANTICLENSis not effective. A potential solution lies here with post-hoc architecture changes such as SAE [53] or activation factorization techniques [17] that lead to more interpretable components. Alternatively, instead of post-hoc adaptations, interpretability can be integrated into the hyperparameter selection process. Our proposed novel interpretability scores can be a helpful tool for this, as demonstrated with the examples of dropout and activation-sparsity regularization in Section 4.4 and Supplementary Note G.2. These scores facilitate the study of the effects of hyperparameter choices and post-hoc model augmentations on the interpretability of the final model, representing a promising direction for future research. Whereas we provide three latent interpretability measures, there is still potential for measures that describe other aspects of concept-based explanations. For example, it might be easier to understand concept use for singular instances if they are spatially localized. Further, it will be easier to understand explanations, if as few as possible distinct concepts are used by model. 73 Mechanistic understanding and validation of large AI models Supplementary Tab. H.1: Overview of questions which can be answered bySEMANTICLENSand workflow steps detailed in Supplementary Note H.4. Type Question to the modelMSteps search “Has my model learned to encode a specific concept?”via convenient “search-engine”-type text or image descriptions 1,2 “Which components have encoded a concept, how is it used, and which data is responsible?” 1,2,4 describe “What concepts has my model learned?”in a structured, con- densed and understandable manner via textual descriptions 1,3 “What and how are concepts contributing to a decision?”by visualizing concept interactions throughout the model 1,3,4 “What do Inotyet understand of my model?”, offering to understand the unexpected concepts and their role for the model and origin in data 1 , 3 , 4 , 5 compare “What has one model learned but not the other?”by comparing learned concepts qualitatively and quantitatively 1 , ( 3 ), 6 “How do my model’s concepts change when changing the archi- tecture or training?”by comparing and tracking semantics of components 1, (3),6 audit “Is my model relying on valid information only?”by separating learned concepts intovalid,spuriousorunexpectedknowl- edge 1,3,4,5 evaluate “How interpretable is my model?”with easy to compute mea- sures 1,7 “How can I improve interpretability of my model?”by evaluat- ing interpretability measures when changing model architec- ture or training procedure 1,7 In this work, we demonstrateSEMANTICLENSon image data. However, weSEMANTICLENScan be applied to any data domain where foundation models are available, such as text, audio or video, where the application of SEMANTICLENScorresponds to future work. Lastly, we demonstrated in Section 3.4 how to applySEMANTICLENSto audit models. Hereby, first steps have been taken to quantify alignment to expectation, e.g., quantify the relative share of neurons that align to valid features. For future work, further quantification measures could be developed, compared and the correlation to actual misbehaviour tested. 74