Paper deep dive
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks
Tuomas Oikarinen, Tsui-Wei Weng
Models: CLIP (ViT-B/16), ResNet-18, ResNet-50, ViT-B/16
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/12/2026, 7:52:22 PM
Summary
CLIP-Dissect is an automated, training-free method for interpreting hidden neurons in deep vision networks by leveraging multimodal models like CLIP to map neuron activations to open-ended natural language concepts without requiring labeled datasets.
Entities (6)
Relation Signals (3)
CLIP-Dissect → leverages → CLIP
confidence 100% · CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons
CLIP-Dissect → outperforms → Network Dissection
confidence 95% · we show that CLIP-Dissect provides more accurate descriptions than existing methods
CLIP-Dissect → uses → SoftWPMI
confidence 95% · we found SoftWPMI give the best results among the four and thus we use it in all our experiments
Cypher Suggestions (2)
Identify dependencies between methods and models · confidence 95% · unvalidated
MATCH (m:Method)-[:LEVERAGES]->(model:Model) RETURN m.name, model.name
Find all methods that interpret neural networks · confidence 90% · unvalidated
MATCH (m:Method)-[:INTERPRETS]->(n:NeuralNetwork) RETURN m.name, n.name
Abstract
Abstract:In this paper, we propose CLIP-Dissect, a new technique to automatically describe the function of individual hidden neurons inside vision networks. CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons with open-ended concepts without the need for any labeled data or human examples. We show that CLIP-Dissect provides more accurate descriptions than existing methods for last layer neurons where the ground-truth is available as well as qualitatively good descriptions for hidden layer neurons. In addition, our method is very flexible: it is model agnostic, can easily handle new concepts and can be extended to take advantage of better multimodal models in the future. Finally CLIP-Dissect is computationally efficient and can label all neurons from five layers of ResNet-50 in just 4 minutes, which is more than 10 times faster than existing methods. Our code is available at this https URL. Finally, crowdsourced user study results are available at Appendix B to further support the effectiveness of our method.
Tags
Links
Full Text
61,437 characters extracted from source content.
Expand or collapse full text
Published as a conference paper at ICLR 2023 CLIP-DISSECT: AUTOMATICDESCRIPTION OFNEU- RONREPRESENTATIONS INDEEPVISIONNETWORKS Tuomas Oikarinen UCSD CSE toikarinen@ucsd.edu Tsui-Wei Weng UCSD HDSI lweng@ucsd.edu ABSTRACT In this paper, we proposeCLIP-Dissect, a new technique to automatically describe the function of individual hidden neurons inside vision networks.CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons with open-ended concepts without the need for any labeled data or hu- man examples. We show thatCLIP-Dissectprovides more accurate descriptions than existing methods for last layer neurons where the ground-truth is available as well as qualitatively good descriptions for hidden layer neurons. In addition, our method is very flexible: it is model agnostic, can easily handle new concepts and can be extended to take advantage of better multimodal models in the future. Fi- nallyCLIP-Dissectis computationally efficient and can label all neurons from five layers of ResNet-50 in just 4 minutes, which is more than 10×faster than existing methods. Our code is available at https://github.com/Trustworthy-ML-Lab/CLIP- dissect. Finally, crowdsourced user study results are available at Appendix B to further support the effectiveness of our method. 1INTRODUCTION Deep neural networks (DNNs) have demonstrated unprecedented performance in various machine learning tasks spanning computer vision, natural language processing and application domains such as healthcare and autonomous driving. However, due to their complex structure, it has been chal- lenging to understand why and how DNNs achieve such great success across numerous tasks. Un- derstanding how the trained DNNs operate is essential to trust their deployment in safety-critical tasks and can help reveal important failure cases or biases of a given model. One way towards understanding DNNs is to inspect the functionality of individual neurons, which is the focus of our work. This includes methods based on manual inspection (Erhan et al., 2009; Zeiler & Fergus, 2014; Zhou et al., 2015; Olah et al., 2017; 2020; Goh et al., 2021), which provide high quality explanations and understanding of the network but require large amounts of manual effort. To address this issue, researchers have developed automated methods to describe the func- tionality of individual neurons, such as Network Dissection (Bau et al., 2017) and Compositional Explanations (Mu & Andreas, 2020). In (Bau et al., 2017), the authors first created a new dataset namedBrodenwith pixel labels associated with a pre-determined set of concepts, and then useBro- dento find neurons whose activation pattern matches with that of a pre-defined concept. In (Mu & Andreas, 2020), the authors further extend Network Dissection to detect more complex concepts that are logical compositions of the concepts inBroden. Although these methods based on Network Dissection can provide accurate labels in some cases, they have a few major limitations: (1) They require a densely annotated dataset, which is expensive and requires significant amount of human la- bor to collect; (2) They can only detect concepts from the fixed concept set which may not cover the important concepts for some networks, and it is difficult to expand this concept set as each concept requires corresponding pixel-wise labeled data. To address the above limitations, we proposeCLIP-Dissect, a novel method to automatically dissect DNNs withunrestrictedconceptswithoutthe need of any concept labeled data. Our method is training-free and leverages pre-trained multi-modal models such as CLIP (Radford et al., 2021) to efficiently identify the functionality of individual neuron units. We show thatCLIP-Dissect(i) provides high quality descriptions for internal neurons, (i) is more accurate at labeling final layer 1 arXiv:2204.10965v5 [cs.CV] 5 Jun 2023 Published as a conference paper at ICLR 2023 Figure 1: Labels generated by our methodCLIP-Dissect, NetDissect (Bau et al., 2017) and MI- LAN (Hernandez et al., 2022) for random neurons of ResNet-50 trained on ImageNet. Displayed together with 5 most highly activating images for that neuron. We have subjectively colored the descriptions green if they match these 5 images, yellow if they match but are too generic and red if they do not match. In this paper we follow thetorchvision(Marcel & Rodriguez, 2010) naming scheme of ResNet: Layer 4 is the second to last layer and Layer 1 is the end of first residual block. MILAN(b) is trained on both ImageNet and Places365 networks, while MILAN(p) is only trained on Places365. neurons where we know the ground truth, and (i) is 10×-200×more computationally efficient than existing methods. Finally, we show how one can useCLIP-Dissectto better understand neural networks and discover that neurons connected by a high weight usually represent similar concepts. 2BACKGROUND ANDRELATEDWORK Network dissection.Network dissection (Bau et al., 2017) is the first work on understanding DNNs by automatically inspecting the functionality (described asconcepts) of each individual neu- ron 1 . They identify concepts of intermediate neurons by matching the pattern of neuron activations to the patterns of pre-defined concept label masks. In order to define the ground-truth concept la- bel mask, the authors build an auxiliary densely-labeled dataset namedBroden, which is denoted asD Broden . The dataset contains a variety of pre-determined conceptscand imagesx i with their associated pixel-level labels. Each pixel of imagesx i is labeled with a set of relevant conceptc, which provides a ground-truth binary maskL c (x i )for a specific conceptc. Based on the ground- truth concept maskL c (x i ), the authors propose to compute the intersection over union score (IoU) betweenL c (x i )and the binarized maskM k (x i )from the activations of the concernedk-th neuron unit over all the imagesx i inD Broden : IoU k,c = P x i ∈D Broden M k (x i )∩L c (x i ) P x i ∈D Broden M k (x i )∪L c (x i ) . If IoU k,c > η, then the neuronkis identified to be detecting conceptc. In (Bau et al., 2017), the authors set the thresholdηto be 0.04. Note that the binary masksM k (x i )are computed via thresholding the spatially scaled activationS k (x i )> ξ, whereξis the top0.5%largest activations for the neuronkover all imagesx i ∈D Broden andS k (x i )has the same resolution as the pre-defined concept masks by interpolating the original neuron activationsA k (x i ). 1 We follow previous work and use ”neuron” to describe a channel in CNNs. 2 Published as a conference paper at ICLR 2023 (Bau et al., 2020) propose another version of Network Dissection, which replaces the human anno- tated labels with the outputs of a segmentation model. This gets rid of the need for dense annotations inD probe , but still requires dense labels for training the segmentation model and the concept set is restricted to the concepts the segmentation model was trained on. For simplicity, we focus on the original Network Dissection algorithm (with human labels) in this work unless otherwise mentioned. MILAN.MILAN (Hernandez et al., 2022) is a contemporary automated neuron labeling method addressing the issue of being restricted to detect predefined concepts. They can generate unrestricted descriptions of neuron function by training a generative images-to-text model. The approach of (Hernandez et al., 2022) is technically very different from ours as they frame the problem as learning to caption the set of most highly activating images for a given neuron. Their method works by collecting a dataset of human annotations for the set of highly activating images of a neuron, and then training a generative model topredictthese human captions. Thus, MILAN requires collecting this curated labeled data set, which limits its capabilities when applied to machine learning tasks outside this dataset. In contrast our method does not require any labeled data for neuron concepts and istraining-free. CLIP.CLIP stands for Contrastive Language-Image Pre-training (Radford et al., 2021), an effi- cient method of learning deep visual representations from natural language supervision. CLIP is designed to address the limitation of static softmax classifiers with a new mechanism to handle dynamicoutput classes. The core idea of CLIP is to enable learning from practically unlimited amounts of raw text, image pairs by training an image feature extractor (encoder)E I and a text encoderE T simultaneously. Given a batch ofNimagex i and textt i training example pairs denoted as(x i ,t i ) i∈[N] with[N]defined as the set1,2,...,N, CLIP aims to increase the similarity of the(x i ,t i )pair in the embedding space as follows. LetI i =E I (x i ),T i =E T (t i ), CLIP maximizes the cosine similarity of the(I i ,T i )in the batch ofNpairs while minimizing the cosine similarity of (I i ,T j ),j̸=iusing a multi-class N-pair loss (Sohn, 2016; Radford et al., 2021). Once the image encoderE I and the text encoderE T are trained, CLIP can perform zero-shot classification for any set of labels: given a test imagex 1 , we can feed in the natural language names for a set ofMlabels t j j∈[M] . The predicted label ofx 1 is the labelt k that has the largest cosine similarity among the embedding pairs:(I 1 ,T k ). 3METHOD In this section, we describeCLIP-Dissect, an automatic, flexible and generalizable neuron labeling method for vision networks from popular convolutional neural networks (CNNs) to SOTA vision transformers (ViT). An overview ofCLIP-Dissectalgorithm is illustrated in Figure 2 and described in detail in Sec 3.1. We then introduce and discuss a few theoretically inspired choices for similarity function in Sec 3.2. Finally in Sec 3.3 we discuss how our method can benefit from more powerful models in the future. 3.1CLIP-DISSECTOVERVIEW Inputs & Outputs.TheCLIP-Dissectalgorithm has 3 inputs:(i)DNN to be dissected/probed, denoted asf(x),(i)a set of probing images, denoted asD probe where|D probe |=N,(i)a set of concepts, denoted asS,|S|=M. The output ofCLIP-Dissectis the neuron labels, which identify the concept associated with each individual neuron. Compared with Network Dissection, our goals are the same – we both want to inspect and detect concepts associated with each neuron. The input(i)is also the same, as we both want to dissect the DNNf(x); however, the inputs(i)and(i)have differences. Specifically, in CLIP-Dissect, ourD probe does not require any concept labels and thus can be any publicly available dataset such as CIFAR-100, ImageNet, a combination of datasets or even unlabeled images collected from the internet. On the other hand, Network Dissection (Bau et al., 2017) can only use aD probe that has been densely labeled with the concepts from the concept setS. As a result, users of Network Dissection are limited to theD probe =D Broden and the fixed concept setSof Broden unless they are willing to create their own densely labeled dataset. In contrast, the concept setSand probing datasetD probe in our framework aredecoupled– we can use any text corpus to form the concept set 3 Published as a conference paper at ICLR 2023 Figure 2: Overview ofCLIP-Dissect: a 3-step algorithm to dissect neural network of interest Sand any image datasetD probe independent ofSinCLIP-Dissect, which significantly increases the flexibility and efficiency to detect neuron concepts. Algorithm.There are 3 key steps inCLIP-Dissect: 1.Compute the concept-activation matrixP. Using the image encoderE I and text encoder E T of a CLIP model, we first compute the text embeddingT i of the conceptst i in the concept setSand the image embeddingI i of the probing imagesx i in the probing dataset D probe . Next, we calculate the concept-activation matrixP∈R N×M whose(i,j)-th ele- ment is the inner productI i ·T j , i.e.P i,j =I i ·T j . 2.Record activations of target neurons.Given a neuron unitk, compute the activationA k (x i ) of thek-th neuron for every imagex i ∈D probe . Define a summary functiong, which takes the activation mapA k (x i )as input and returns a real number. Here we letgbe the mean function that computes the mean of the activation map over spatial dimensions, butgcan be any general scalar function. We recordg(A k (x i )), for alli,k. 3.Determine the neuron labels. Given a neuron unitk, the concept label forkis de- termined by calculating the most similar conceptt m with respect to its activation vec- torq k = [g(A k (x 1 )),...,g(A k (x N ))] ⊤ ,q k ∈R N . The similarity functionsimis defined assim(t m ,q k ;P). In other words, the label of neuronkist l , wherel= arg max m sim(t m ,q k ;P). Below we discuss different ways to definesim. 3.2SIMILARITY FUNCTION There are many ways to design the similarity functionsim, and this choice has a large effect on the performance of our method. In particular, simple functions like cosine similarity perform poorly, likely because they place too much weight on the inputs that don’t activate the neuron highly. We focus on the following 4 similarity functions and compare their results in the Table 3: •Cos.Cosine similarity between the activation vector (q k ) of the target neuronkand the concept activation matrixP :,m from CLIP with the corresponding conceptt m : sim(t m ,q k ;P)≜ P ⊤ :,m q k ||P :,m ||·||q k || (1) •Rank reorder.This function calculates the similarity betweenq k andP :,m by creating a vectorq ′ k , which has the values ofq k in the order ofP :,m . I.e.q ′ k is generated by reordering 4 Published as a conference paper at ICLR 2023 the elements ofq k according to the ranks of the elements inP :,m . The full similarity function is defined below, and is maximized when theq k andP :,m have the same order: sim(t m ,q k ;P)≜−∥q ′ k −q k ∥ p (2) •WPMI(WeightedPointwiseMutualInformation). A mathematically grounded idea to derivesimbased on mutual information as used in (Wang et al., 2020), where the label of a neuron is defined as the concept that maximizes the mutual information between the set of most highly activated images on neuronk, denoted asB k , and the conceptt m . Specifically: sim(t m ,q k ;P)≜wpmi(t m ,q k ) = logp(t m |B k )−λlogp(t m ),(3) wherep(t m |B k ) = Π x i ∈B k p(t m |x i )andλis a hyperparameter. •SoftWPMI.Finally, we propose a generalization of WPMI where we use the probability p(x∈B k )to denote the chance an imagexbelongs to the example setB k . Standard WPMI corresponds to the case wherep(x∈B k )is either 0 or 1 for allx∈ D probe while SoftWPMI relaxes the binary setting ofp(x∈B k )to real values between 0 and 1. This gives us the following function: sim(t m ,q k ;P)≜soft wpmi(t m ,q k ) = logE[p(t m |B k )]−λlogp(t m )(4) where we computelogE[p(t m |B k )] = log(Π x∈D probe [1 +p(x∈B k )(p(t m |x)−1)]). As shown in our experiments (Table 3), we found SoftWPMI give the best results among the four and thus we use it in all our experiments unless otherwise mentioned. Due to page constraint, we leave the derivation and details on how to calculate WPMI and SoftWPMI using only CLIP products matrixP, as well as our hyperparameter choices to Appendix A.1. 3.3COMPABILITY WITH FUTURE MODELS The current version of our algorithm relies on the CLIP (Radford et al., 2021) multimodal model. However, this doesn’t have to be the case, and developing improved CLIP-like models has received a lot of attention recently, with many recent works reporting better results with an architecture similar to CLIP (Yu et al., 2022; Yuan et al., 2021; Zhai et al., 2022; Pham et al., 2021). If these models are released publicly, we can directly replace CLIP with a better model without any changes to our algorithm. As a result, our method will improve over time as general ML models get more powerful, while existing works (Bau et al., 2017; Hernandez et al., 2022) can’t really be improved without collecting a new dataset specifically for that purpose. Similar to ours, the segmentation version of Network Dissection (Bau et al., 2020) can also be improved by using better segmentation models, but each improved segmentation model will likely work well for only a few tasks. 4EXPERIMENTS In this section, we provide both qualitative and quantitative results ofCLIP-Dissectin Sec 4.1 and 4.2 respectively. We also provide an ablation study on the choice of similarity function in Sec 4.3 and compare computation efficiency in Sec 4.4. Finally, we show thatCLIP-Dissectcan detect concepts that do not appear in the probing images in Sec 4.5. We evaluate our method through analyzing two pre-trained networks: ResNet-50 (He et al., 2016) trained on ImageNet (Deng et al., 2009), and ResNet-18 trained on Places-365 (Zhou et al., 2017). Our method can also be applied to modern architectures such as Vision Transformers as discussed in Appendix A.5. Unless otherwise mentioned we use 20,000 most common English words 2 as the concept setS. Due to the page limit, we leave additional 9 experimental results in the Appendix. Specifically, Ap- pendix A.2 shows additional qualitiative results discussed in Section 4.1. Appendix A.3 showcases our ability to detect low-level concepts but also discusses some limitations, such as sometimes out- putting higher level concepts than warranted. Appendix A.4 shows how our method can be applied to generate compositional concepts, and Appendix A.5 shows that our method can be applied to Vision Transformer architecture and provides qualitative results. In Appendix A.6 we experiment 2 Source: https://github.com/first20hours/google-10000-english/blob/master/20k.txt 5 Published as a conference paper at ICLR 2023 with another potential method to measure quality of neuron explanations and show it also favors CLIP-Dissect. Appendix A.7 discusses the limitations of only displaying top-5 images for qualita- tive evaluations and showcases a wider range for some neurons. In Appendix A.8 we discuss how our method can be used to decide whether a neuron is interpretable or not. Appendix A.9 shows the qualitative effect of different similarity functions. Finally, in Appendix A.10 we evaluated our description quality for 500 randomly chosen neurons, and found descriptions generated byCLIP- Dissectto be a good match for 65.5% of the neurons on average. 4.1QUALITATIVE RESULTS Figure 1 shows examples of descriptions for randomly chosen hidden neurons in different layers generated byCLIP-Dissectand the two baselines: Network Dissection (Bau et al., 2017) and MI- LAN (Hernandez et al., 2022). We do not compare against Compositional Explanations (Mu & An- dreas, 2020) as it is much more computationally expensive (at least 200 times slower) and comple- mentary to our approach as their composition could also be applied to our explanations. We observe that not every neuron corresponds to a clear concept and our method can detect low-level concepts on early layers and provide more descriptive labels than existing methods in later layers, such as the ’graduating’ and ’nursery’ neurons. These results use the union of ImageNet validation set and Broden asD probe . In general we observe that MILAN sometimes gives very accurate descriptions but often produces descriptions that are too generic or even semantically incorrect (highlighted as red labels), while Network Dissection is good at detecting low level concepts but fails on concepts missing from its dataset. We compared against two versions of MILAN: MILAN(b) was trained to describe neurons of networks trained on ”both” ImageNet and Places365, and MILAN(p) was only trained on Places365 neurons to test its generalization ability. Additional qualitative comparisons for interpretable neurons are shown in Figures 6 and 7 in Appendix A.2. 4.2QUANTITATIVE RESULTS Besides the qualitative comparison, in this section we propose the first quantitative evaluation to compare our methods’s performance with baselines. The key idea is to compare the neuron labels generated for neurons where we have access to theground truthdescriptions – i.e. the final layer of a network, as the ground truth concept of the output layer neuron is the name of the corresponding class (class label). This allow us to objectively evaluate the quality of the generated neuron labels, which avoids the need for human evaluation and uses real function of the target neurons while human evaluations are usually limited to describing a few most highly activating images. We propose below two metrics for measuring the quality of explanations: a)Cos similarity:We measure the cosine similarity in a sentence embedding space between the ground truth class name for the neuron (e.g. ”sea lion” in Fig 3) and the explanation generated by the method. For embeddings, we use two different encoders: the CLIP ViT-B/16 text encoder Table 1: The cosine similarity of predicted labels compared to ground truth labels on final layer neurons of ResNet-50 trained on ImageNet. The higher similarity the better. We can see that our method performs better whenD probe and concept set are larger and/or more similar to training data. MethodD probe Concept setSCLIP cosmpnet cos Network Dissection (baseline)BrodenBroden0.69290.2952 MILAN(b) (baseline) ImageNet val-0.70800.2788 ImageNet valBroden0.73930.4201 ImageNet val3k0.74560.4161 ImageNet val10k0.76610.4696 ImageNet val20k0.79000.5257 CLIP-Dissect(Ours)ImageNet valImageNet0.99020.9746 CIFAR100 train20k0.73000.3664 Broden20k0.74070.3945 ImageNet val20k0.79000.5257 ImageNet val + Broden20k0.79000.5233 6 Published as a conference paper at ICLR 2023 Table 2: Performance when labeling final layer neurons of a ResNet18 trained on Places365. Accu- racy measured on 267/365 neurons whose label is a directly included in Broden labels. MethodD probe Concept setSgt label annotationTop1 AccCLIP cosmpnet cos Net-Dissect (baseline)BrodenBrodenYes43.82%0.88870.6697 CLIP-Dissect(ours)BrodenBrodenNo58.05% 0.91060.7024 Figure 3: Example of a final layer neuron: we compare the descriptions generated by different meth- ods and our metrics. Accuracy only evaluated for CLIP-Dissect with ImageNet labels as concept set since it is the only method where exact correct answer is a possible choice and therefore accuracy makes sense. (denoted as CLIP cos) and the all-mpnet-base-v2 sentence encoder (denoted as mpnet cos). See Figure 3 for an example of the similarity scores for descriptions of a single neuron. b)Accuracy:We compute accuracy for a method as the percentage of neurons that the method as- signs the exact correct label i.e. the class name. Note that we only measure accuracy in situations where the method chooses from a concept set that includes the exact correct label, such as Net- work Dissection for models trained on Places365 (not for ImageNet models since ImageNet labels are missing from Broden). We also did not measure accuracy of MILAN as MILAN generates explanations without a concept set and thus is unlikely to match the exact wording of the class name. In Table 1, we can see that the labels generated by our method are closer to ground truth in sentence embedding spaces than those of Network Dissection or MILAN regardless of our choice ofD probe or concept setS. We can also see that using a larger concept set (e.g. 3k v.s. 20k) improves the performance of our method. Table 2 shows that our method outperforms Network Dissection even though this task is favorable to their method (as the Places365 dataset has large overlaps with Broden). We highlight thatCLIP-Dissectcan reach higher accuracy even though Network Dissection has access to and relies on the ground truth labels in Broden while ours does not. 4.3CHOICE OF SIMILARITY FUNCTION Table 3 compares the performance of different similarity functions used inCLIP-Dissect. We use accuracy and cos similarity in embedding space as defined in Sec 4.2 to measure the quality of descriptions. We observed that SoftWPMI performs the best and thus it is used in all other ex- periments unless otherwise mentioned. The effect of similarity function is shown qualitatively in Appendix A.9. Table 3 also showcases howCLIP-Dissectcan give final layer neurons the correct label with a very impressive 95% accuracy. 4.4COMPUTATIONAL EFFICIENCY Table 4 shows the runtime of different automated neuron labeling methods when tasked to label all the neurons of five layers in ResNet-50. We can see our method runs in just 4 minutes, more 7 Published as a conference paper at ICLR 2023 Table 3: Comparison of the performance between similarity functions. We look at the final layer of ResNet-50 trained on ImageNet (same as Tab 1). We useS=20k for cosine similarity andS= ImageNet classes for top1 accuracy. We can see SoftPMI performs best overall. D probe MetricSimilarity function CIFAR100 train BrodenImageNet val ImageNet val + Broden Average mpnetcos0.27610.21530.28230.25840.2580 cos similarityRank reorder 0.32500.38570.49010.50400.4262 WPMI0.34600.38780.53020.52670.4477 SoftWPMI 0.36640.39450.52570.52330.4525 Top1 accuracycos8.50%5.70%15.90%11.40%10.38% Rank reorder36.30%57.50%89.80%89.90%68.38% WPMI23.80%47.10%87.00%86.90%61.20% SoftWPMI 46.20%70.50%95.00%95.40%76.78% than 10, 60 and 200+ times faster than the baselines MILAN (Hernandez et al., 2022), Network Dissection (Bau et al., 2017) and Compositional Explanations (Mu & Andreas, 2020) respectively. Table 4: The time it takes to describe the layers [’conv1’, ’layer1’, ’layer2’, ’layer3’, ’layer4’] of ResNet-50 via different methods using our hardware(Tesla P100 GPU).We can see CLIP-Dissect is much more computationally efficient than existing methods. MethodCLIP-DissectNetwork DissectionCompositional ExplanationsMILAN Runtime3min50s>4 hrs>>14 hours55min 30s 4.5DETECTING CONCEPTS MISSING FROM PROBING DATASET One surprising ability we found is that our method is able to assign the correct label to a neuron even ifD probe does not have any images corresponding to that concept. For example,CLIP-Dissectwas able to assign the correct dog breed to 46 out of 118 neurons detecting dog breeds, and correct bird species to 22 out of 59 final layer neurons of ResNet-50 trained on ImageNet, while using CIFAR- 100 training set asD probe , which doesn’t include any images of dogs or birds. This is impossible for any label based methods like NetDissect (Bau et al., 2017) and Compositional Explanations (Mu & Andreas, 2020) (as IoU will be 0 for any concept not inD probe ), and unlikely for methods based on captioning highly activated images like MILAN (Hernandez et al., 2022) (as humans won’t assign a caption missing from activated images). Example labels and highest activating probe images can be seen in Figure 4. 5USE CASE OFCLIP-DISSECT In this section, we present a simple experiment to showcase how we can useCLIP-Dissectto gain new insights on neural networks. By inspecting the ResNet-50 network trained on ImageNet with CLIP-Dissect, we discover the following phenomenon and evidence for it:the higher the weight between two neurons, the more similar concepts they encode, as shown in Figure 5. This makes sense since a high positive weight causally makes the neurons activate more similarly, but the extent of this correlation is much larger than we expected, as each final layer neuron has 2048 incoming weights so we would not expect any single weight to have that high of an influence. A consequence of the similarity in concepts is that the second-to-last layer already encodes quite complete repre- sentations of certain final layer classes in individual neurons, instead of the representation for these classes being spread across multiple neurons. For example Fig 5a shows that the 3 neurons with highest outgoing weights already seem to be accurately detecting the final layer concept/class label they’re connected to. 8 Published as a conference paper at ICLR 2023 Figure 4: Example ofCLIP-Dissectcorrectly labeling neurons that detect the little blue heron and the great white heron based on pictures of dolphins and dinosaurs in CIFAR. CIFAR100 does not contain any bird images butCLIP-Dissectcan still get correct concept. To make these results more quantitative, in Figure 5b we measure the similarity of concepts encoded by the neurons connected via highest weights in the final layer of ResNet-50. For layer4 neurons, we used CLIP-Dissect to determine their concept, while for the final layer neurons we used the ground truth i.e. class label in text form. We can clearly see that higher weights connect more similar concepts together, and the average similarity decreases exponentially as a function ofkwhen averaging similarities of neurons connected via the topkweights. To further test this relationship, we found that the mpnet cos similarity between concepts encoded by two neurons and the weight connecting them are correlated withr= 0.120and p-value<10 −300 (probability of no correlation is practically 0) when calculated over all 2 million weights in the final layer. If we only look at the highest 50000 weights, the correlation is even higher withr= 0.258, p-value<10 −300 . (a) Visualization of 3 highest weights of final layer.(b) Average cosine similarity between concepts. Figure 5: a) 3 highest weights of the final layer of ResNet-50 trained on ImageNet, we can see neurons connected by the highest weights are detecting very much the same concept. b) Cosine similarities between the concepts of neurons connected by highest weights. The higher the weight between neurons, the more similar a concept they represent. 6LIMITATIONS AND CONCLUSIONS Limitations:The main limitation of our method compared to previous work is that it’s not taking advantage of the spatial information of neuron activations. This causes some difficulties in detecting lower level concepts, but we are still able to detect many low level/localized patterns as discussed in Section A.3. Secondly, our method currently works well only on concepts and images that CLIP works well on, and while this already covers a larger set of tasks than what existing neuron label- ing methods perform well on,CLIP-Dissectmay not work out of the box on networks trained on tasks that require highly specific knowledge such as classifying astronomical images. However, our 9 Published as a conference paper at ICLR 2023 method is compatible with future large vision-language models as long as they share a similar struc- ture to CLIP, and CLIP-like models trained for a specific target domain. Finally, not all neurons can be described well by simple terms such as single word explanations. While we can augment the space of descriptions using a different concept setS, or creating compositional explanations as dis- cussed in Appendix A.4, some neurons may have a very complicated function or perform different functions at different activation ranges. For the most part, current methods including ours will be unable to capture this full picture of complicated neuron functions. Conclusions:In this work, we have developedCLIP-Dissect, a novel, flexible and computationally efficient framework for automatically identifying concepts of hidden layer neurons. We also pro- posed new methods to quantitatively compare neuron labeling methods, which is based on labeling final layer neurons. We have shownCLIP-Dissectcan match or outperform previous automated la- beling methods both qualitatively and quantitatively and can even detect concepts missing from the probing dataset. Finally we usedCLIP-Dissectto discover that neurons connected by a high weight often represent very similar concepts. ACKNOWLEDGEMENT The authors would like to thank anonymous reviewers for valuable feedback to improve the manuscript. The authors also thank MIT-IBM Watson AI lab for computing support in this work. T. Oikarinen and T.-W. Weng are supported by National Science Foundation under Grant No. 2107189. REFERENCES David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InComputer Vision and Pattern Recognition, 2017. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network.Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In2009 IEEE conference on computer vision and pattern recognition, p. 248–255. Ieee, 2009. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network.University of Montreal, 1341(3):1, 2009. Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks.Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 770–778, 2016. Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2022. S ́ ebastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. InPro- ceedings of the 18th ACM International Conference on Multimedia, M ’10, p. 1485–1488, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605589336. doi: 10.1145/1873951.1874254. 10 Published as a conference paper at ICLR 2023 Jesse Mu and Jacob Andreas. Compositional explanations of neurons.Advances in Neural Informa- tion Processing Systems, 33:17153–17163, 2020. Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11):e7, 2017. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020.doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingx- ing Tan, and Quoc V Le. Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.),Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016. Zeyu Wang, Berthy Feng, Karthik Narasimhan, and Olga Russakovsky. Towards unique and in- formative captioning of images. InEuropean Conference on Computer Vision, p. 629–644. Springer, 2020. Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu.Coca:Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, p. 818–833. Springer, 2014. Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 18123–18133, 2022. Bolei Zhou, Aditya Khosla, ` Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. InICLR, 2015. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 mil- lion image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 11 Published as a conference paper at ICLR 2023 AAPPENDIX A.1SIMILARITY FUNCTION DETAILS AND DERIVATION Rank reorder hyperparameters: The results of Table 3 are using top 5% of most highly activating images for each neuron and using p= 3for thel p -norm. WPMI: In this section, we show that one choice of similarity functionsim(t m ,q k ;P)can be derived based on the weighted point-wise mutual information (wpmi). Note that wpmi is also used in (Hernandez et al., 2022) but in a different way – our approach can compute wpmi directly from the CLIP products Pand does not require any training, while (Hernandez et al., 2022) train two models to estimate wpmi. To start with, by definition, the wpmi between a conceptt m and the most highly activated images B k of neuronkcan be written as wpmi(t m ,q k ) = logp(t m |B k )−λlogp(t m )(5) HereB k is the set of images that most highly activates neuronk, i.e. the top indices ofq k . First we can computep(t m |x i ) =softmax(aP i,: ) m , where softmax(z) n = e z n P N j=1 e z j withz∈R N ,P i,: is the i-th row vector of the concept-activation matrixPandais a scalar temperature constant. This is the probability that CLIP assigns to a conceptt m for imagex i when used as a classifier. We then definep(t m |B k )as the probability that all images inB k have the conceptt m , which gives usp(t m |B k ) = Π x i ∈B k p(t m |x i ). Thus, we have logp(t m |B k ) = X x i ∈B k logp(t m |x i )(6) which is the 1st term in Eq (5). Next, we can approximate the 2nd termp(t m )in Eq (5) as follows: p(t m )is the probability that a random set of imagesBwill be described byt m . Since we don’t know the true distribution for a set of images, an efficient way to approximate this is to average the probability oft m over the different neurons we are probing. This can be described by the following equation: p(t m ) =E B [p(t m |B)]≈ P j∈C p(t m |B j ) |C| = P j∈C Π x i ∈B j p(t m |x i ) |C| (7) whereCis the set of neurons in the layer we are probing. Thus we can plug Eq. (6) and Eq. (7) in to Eq. (5) to compute wpmi through the CLIP model: wpmi(t m ,q k ) = X x i ∈B k logp(t m |x i )−λlog X j∈C Π x i ∈B j p(t m |x i ) +λlog|C|(8) So we can use the above Eq (8) in our CLIP-Dissect and setsim(t m ,q k ;P) =wpmi(t m ,q k )in the algorithm. For our experiments we usea= 2,λ= 0.6and top 28 most highly activating images for neuron kasB k which were found to give best quantitave results when describing final layer neurons of ResNet-50. SoftWPMI: SoftWPMI is an extension of wpmi as defined by Eq. (8) into settings where we have uncertainty over which images should be included in the example setB k . In WPMI the size of example set is defined beforehand, but it is not clear how many images should be included, and this could vary from neuron to neuron. In this description, we assume that there exists a trueB k which includes images fromD probe if and only if they represent the concept of neuronk. We then define binary indicator random variablesX k i =1[x i ∈B k ]which take value 1 if the ith image is is in set theB k , and we defineX k =X k 1 ,...,X k M . 12 Published as a conference paper at ICLR 2023 Our derivation begins from the observation that we can rewritep(t m |B k )from above as: p(t m |B k ) = Π x i ∈B k p(t m |x i ) = Π x i ∈D probe p(t m |x i ) 1[x i ∈B k ] = Π x i ∈D probe p(t m |x i ) X k i (9) Now: E X k i [p(t m |x i ) X k i ] =p(x i ∈B k )p(t m |x i ) + (1−p(x i ∈B k )) = 1 +p(x i ∈B k )(p(t m |x i )−1) (10) If we assume theX k i are statistically independent, we can write: E X k [p(t m |B k )] = Π x i ∈D probe E X k i [p(t m |x i ) X k i ] = Π x i ∈D probe [1 +p(x i ∈B k )(p(t m |x i )−1)] (11) ⇒logE X k [p(t m |B k )] = X x i ∈D probe log(1 +p(x i ∈B k )(p(t m |x i )−1))(12) Note Equation (10) goes to 1 ifp(x i ∈B k ) = 0(i.e. no effect in a product) and top(t m |x i )if p(x i ∈B k ) = 1. So Eq. (12) reduces to Eq. (6) of standard WPMI ifp(x i ∈B k )is either 1 or 0 for allx i ∈D probe . In other words, we are considering a ”soft” membership inB k instead of ”hard” membership of standard WPMI. To get the second term for wpmi,p(t m ), i.e. probability that textt m describes a random example setB k , we can approximate it like we did in Eq. (7) by using the example sets for other neurons we are interested in. p(t m ) =E B i [E X i [p(t m |B i )]]≈ P j∈C E X j [p(t m |B j )] |C| → P j∈C E X j [p(t m |B j )] |C| = P j∈C Π x∈D probe [1 +p(x∈B j )(p(t m |x)−1)] |C| (13) Finally, we can compute full SoftWPMI with Eq. (12) and Eq. (13) and use it as similarity function inCLIP-Dissect: soft wpmi(t m ,q k ) = X x i ∈D probe log(1 +p(x i ∈B k )(p(t m |x i )−1)) −λlog X j∈C Π x∈D probe [1 +p(x∈B j )(p(t m |x)−1)] +λlog|C|(14) One thing we haven’t yet discussed is the choice ofp(x∈B k ). There is flexibility and this prob- ability could be derived from the activations of neuronkon imagex, by for example by taking a scaled sigmoid, or it could be based on the ranking of the image. For our experiments we found ranking based probability to perform the best, and usedp(x∈B k ) linearly decreasing from 0.998 of the most highly activating image for neuronkto 0.97 for 100th most highly activating image and 0 for all other images. Thus in practice we only have to use the 100 images when calculating SoftWPMI instead of fullD probe which is much more computationally efficient. For other hyperparameters we useda= 10andλ= 1. A.2ADDITIONAL QUALITATIVE RESULTS Additional visualization on ResNet-18 and ResNet-50 in Figs 6 and 7, continued from Section 4.1. 13 Published as a conference paper at ICLR 2023 Figure 6: Explanations of most interpretable neurons in the second to last layer of ResNet-18 trained on Places365. Displayed together with 5 most highly activating images for that neuron. We have subjectively colored the descriptions green if they match these 5 images, yellow if they match but are too generic and red if they do not match. Both Network Dissection andCLIP-Dissectdo very well while MILAN struggles to explain some neurons. MILAN(b) is trained on both ImageNet and Places365 networks, while MILAN(i) is only trained on ImageNet. Both MILAN networks perform similarly here but the ImageNet version misses/is too generic for more neurons, such as labeling a bus neuron as ”vehicles”. The neurons on the left have highest IoU according to MILAN while neurons on the right have highest similarity to the concept according to our similarity function. 14 Published as a conference paper at ICLR 2023 Figure 7: Explanations of most interpretable neurons in the second to last layer of ResNet-50 trained on ImageNet. Displayed together with 5 most highly activating images for that neuron. We have subjectively colored the descriptions green if they match these 5 images, yellow if they match but are too generic and red if they do not match. BothCLIP-Dissectand Network Dissection perform well on these most interpretable neurons except for a few failures by Network Dissection, while MILAN often gives concepts that are too generic. MILAN(b) is trained on both ImageNet and Places365 networks, while MILAN(p) is only trained on Places365. We can see the Places trained model is struggling more with concepts like spiders, indicating issues with generalization. 15 Published as a conference paper at ICLR 2023 A.3LOW LEVEL CONCEPTS In this section we show additional results of probing low level concepts, using two networks trained on ImageNet, ResNet-50 and ResNet-152. For ResNet-152 we also compare against human annota- tions for these neurons from MILAnnotations (Hernandez et al., 2022). The results for ResNet-152 can be seen in Figure 8. We can see CLIP-Dissect is able to accurately detect many lower level concepts, such as colors in Conv1 neurons 1,3,10 and Layer1 neuron 4, as well as detecting that neuron 3 of Layer1 activates specifically for thetext/label, without having access to the activation pattern, while MILAN fails to detect this. However, CLIP-Dissect does also have some failure modes on lower level patterns. •Failure to differentiate between concept and correlated objects:This leads to higher level outputs than desired. For example:CLIP-Dissectgives the conceptunderwaterto conv1 neuron 16, while the true concept is probably more similar toblue(human annotators also made this mistake), or conv1 neuron 28 where CLIP-Dissect outputszebrawhile the neuron is likely just detectingstripes. The worst example of this is neuron conv1 neuron 24, which simply activates on white background, but this is entirely missed by CLIP-Dissect as it’s not good at focusing on the background. •Unintrepretable neurons:Some neurons seem to be not interpretable, e.g. Conv1 neuron 2. CLIP-Dissect outputsmusicwhich seems incorrect, but neither human annotators or MILAN were able to assign a clear concept for the neuron either. See Appendix A.8 for more analysis on uninterpretable neurons. Similar observations also hold for the ResNet-50 model as can be seen in Figure 9. Note we do not have human annotations to compare against for this network. 16 Published as a conference paper at ICLR 2023 Figure 8: Descriptions for select neurons in early layers of ResNet-152, showcasing both the suc- cesses and failure modes of CLIP-Dissect. For this figure we used max as the summary functiong forCLIP-Dissectto be comparable to MILAnnotations. For evaluation we used the MILAN model trained on only Places models to avoid overfitting. 17 Published as a conference paper at ICLR 2023 Figure 9: Descriptions of most interpretable neurons (highest similarity/IoU) of an early layer in ResNet-50. 18 Published as a conference paper at ICLR 2023 A.4COMPOSITIONALCONCEPTS In the sections above our method has focused on choosing the most fitting concept from the pre- defined concept set. While changing the concept set inCLIP-Dissectis as easy as editing a text file, we show it can also detect more complex compositional concepts. We experimented with generating explanations by searching over text concatenations of two concepts in our concept space. To reduce computational constraints, we only looked at combinations of 100 most accurate single word labels for each neuron. Example results are shown in Fig 10. While the initial results are promising, some challenges remain to make these compositional explanations more computationally efficient and consistent, which is an important direction for future work. Figure 10: An example of compositional explanations generated by our method for two neurons of ResNet50 trained on ImageNet. A.5VISIONTRANSFORMER Since our method does not rely on the specifics of CNNs in its operation, we can easily extend it to work on different architectures, such as the Vision Transformer, specifically ViT-B/16 (Dosovitskiy et al., 2020) model trained on ImageNet. We have visualized most interpretable neurons, their highly activating images and their descriptions in Figure 11. Interestingly, we found the highly interpretable neurons to be very location focused, i.e.kitchensorhighwaysdespite the network being trained on object level labels (ImageNet). 19 Published as a conference paper at ICLR 2023 Figure 11: Most interpretable neurons after the ’encoder’ module (one of the last layers) of a Vision Transformer. 20 Published as a conference paper at ICLR 2023 A.6PREDICTING INPUT CLASS FROM NEURON DESCRIPTIONS In this section we follow reviewer suggestion and study whether the class of the image can be pre- dicted based on descriptions of highly activating neurons. We perform an experiment on the neurons in the second-to-last layer (last hidden layer) and study if the descriptions of highly activating neu- rons in this layer can be used to predict what class the input is from. As we show in below experiments, the neuron description generated fromCLIP-Dissectgives overall higher prediction accuracy than the neuron descriptions from prior work (Network Dissection and MILAN), which suggests our proposed method generates better neuron description than prior work. Below we outline the steps we used to predict the class of an image based on the internal neuron descriptions from different neuron labelling methods: 1. Following our notation in Sec 3, for an imagex i , record the activations of neurons in the second-to-last layerg(A j (x i ))(wherejdenotes the neuron index) as well as the predicted classc. 2. Find the neuron with the highest positive contribution to the predicted classc, which can be computed ask= arg max j W c,j ·g(A j (x i )). 3. Obtain the descriptiont k for neuronkusing an automated description method. 4. Find the class name that is most similar to the description of the highest contributing neu- ron, and predict this class as the images class. Similarity is measured using an average of cosine similarities in the sentence embedding space of CLIP text encoder and mpnet sentence embedding space discussed in section 4.2. We performed this experiment on ResNet-50 trained on ImageNet, and found that predicting with the above algorithm andCLIP-Dissectneuron descriptions we were able to correctly predict the class of 10.28% of the images. In contrast, when we used descriptions from Network Dissection we were only able to predict the class of 3.36% of the images, and with MILAN(base) only 2.31% of the time. This gives evidence towards ourCLIP-Dissectdescriptions being higher quality than Network Dissection and MILAN. It’s worth noting that overall we would not expect to reach a very high predictive accuracy using the above method as the most contributing neuron often does have a completely different role than the target class. However if we study the same network with different neuron description methods, we would expect a better neuron description method would return a higher prediction accuracy, assuming at least some of the important neurons indeed do have similar role to the class itself. This method has the benefit of being automated, going beyond visualizing few most highly activating images and being able to analyze hidden layer description quality. We think methods like this are an interesting future direction for evaluating neuron description methods. A.7VISUALIZING WIDER RANGE OF ACTIVATIONS So far our qualitative evaluation has focused on whether the description matches the 5 or 10 images inD probe that activate the neuron the most. However this does not give a full picture of the function of the neuron, and in this section we explore how the neurons activate on a wider range of input images. In particular, in Figures 12 and 13 we visualize the most interpretable neurons of two layers of ResNet-18 (Places 365) and ResNet-50 (ImageNet) by uniformly sampling images from the top 0.1%, 1% 5% of most highly activating images. We can see that the descriptions tend to match quite well for the 0.1% of most highly activating images, but the top 1% and top 5% images start to be of quite different concepts only slightly related to the description. We also notice that low level concepts like colors tend to be more consistently represented in top 1% and 5% images while higher level concepts are not. This result is somewhat expected as theD probe does not have that many images for each higher level concept, but highlights the need to explore neuron activations beyond just the few most highly activated images. 21 Published as a conference paper at ICLR 2023 Figure 12: Most interpretable neurons of ResNet-18, showcasing randomly sampled images of a wide range of most highly activating images for that neuron. 22 Published as a conference paper at ICLR 2023 Figure 13: Most interpretable neurons of ResNet-50, showcasing randomly sampled images of a wide range of most highly activating images for that neuron. 23 Published as a conference paper at ICLR 2023 A.8INTERPRETABILITY CUTOFF Our method can also be used to quantify which neurons are ’interpretable’. Since each description for a neuron is associated with a similarity score, and the higher the similarity the more accurate the description, we consider a neuronkwith descriptiontas interpretable if sim(t,q k ;P)> τ. To choose the best cutoffτ, we leverage our experiment in section A.10. In particular, we choose the lowestτsuch that interpretable neurons will have an average description score of 0.75 or higher (compared to 0.655 of all neurons). This gives us a cutoff thresholdτ= 0.16with the proposed SoftWPMI similarity function. In contrast, the neurons with SoftWPMI≤τhave an average de- scription score of 0.5257, which is lower than that of interpretable neurons (0.75) and all neurons (0.655). Thus, it suggests that the similarity score of SoftWPMI is a useful indicator of description quality. Using this cutoffτ, we find 69.7% of neurons in ResNet-18 (Places-365) and 77.8% of neurons in ResNet-50 (ImageNet) to be interpretable, indicating that around 20-30% of neurons do not have a simple explanation for their functionality, i.e. are ’uninterpretable’. A.9QUALITATIVE EFFECT OF SIMILARITY FUNCTION Figure 14 shows the descriptions generated by ourCLIP-Dissectwhen using simple cosine similarity as the similarity function. As we can see the performance is very poor, only adequately describing one of the 8 neurons displayed, highlighting the need for more sophisticated similarity functions we introduced. Figure 14: Explanations for same neurons as Figure 1 showcasing the qualitative difference between using a simple cos similarity as the similarity function forCLIP-Dissectvs our best performing SoftWPMI similarity function. 24 Published as a conference paper at ICLR 2023 A.10LARGER SCALE EXPERIMENT ON DESCRIPTION QUALITY In this section we perform a larger scale analysis of the neuron description quality provided by CLIP-Dissect. We evaluated the description quality of 50 randomly selected neurons for each of the 5 layers and 2 models studied, for a total of 1000 evaluations. Each evaluator was presented with 10 most highly activating images, and answered the question: ”Does the description: ’ match this set of images?” An example of the user interface is shown in Figure 15. Each evaluation had three options which we used with the following guidelines: • Yes - Most of the 10 images are well described by this description • Maybe - Around half (i.e. 3-6) of the images are well described, or most images are de- scribed relatively well (accurate but too generic, or slightly inaccurate) • No - Most images are poorly described by this caption Figure 15: Example of the user interface for our evaluation of description quality. These evaluations were turned into a numeric score with the following formula: yes:1, maybe:0.5, no:0. Table 5 shows the average description score across different neurons and evaluators for each of the layers evaluated. This average score can be thought of as the percentage of neurons well described. We observed that overall the descriptions are good for 55-80% of neurons depending on the layer, with the average score across all evaluations being 0.655. In addition we notice that the very early and very late layers are most interpretable, corresponding to clear low or high level concepts, while the middle layers seem to be harder to describe. It is worth noting that we are evaluating random neurons here, i.e. the neurons are selected randomly, so the displayed neuron may not be interpretable in the first place – in many cases when the description does not match are because the neuron itself is not ’interpretable’, i.e. there is no simple description that corresponds to the neurons functionality. Even with the evaluation guidelines described above, these evaluations are subjective, and we found that our two evaluators agreed on 68.4% of the neurons with the vast majority of disagreements being between yes/maybe or maybe/no, with only 2.4% of neurons having a yes from one evaluator and no from another. For transparency, we have included all 50 neurons, their descriptions and two 25 Published as a conference paper at ICLR 2023 Table 5: The average description scores of theCLIP-Dissectdescriptions for neurons of different layers. An average score of 1.0 indicates all descriptions match the neurons highly activating images, 0.0 means none do, and 0.5 could mean anything from all neurons ’maybe’ match or half the neurons fully match and half don’t at all. Model 1layer1layer2layer3layer4 ResNet-18 (Places 365)0.8050.6350.6350.4100.695 ResNet-50 (ImageNet)0.8150.6700.5500.6400.695 evaluator’s evaluations (E1, E2) of these descriptions for ResNet-50 layer conv1 in Figures 16, 17 and 18 and for ResNet-50 layer4 in Figures 19, 20 and 21 26 Published as a conference paper at ICLR 2023 Figure 16: Random neurons of ’conv1’ in ResNet-50 (ImageNet), and E1, E2’s evaluation on the description ofCLIP-Dissect(PART 1). 27 Published as a conference paper at ICLR 2023 Figure 17: Random neurons of ’conv1’ in ResNet-50 (ImageNet), and E1, E2’s evaluation on the description ofCLIP-Dissect(PART 2) 28 Published as a conference paper at ICLR 2023 Figure 18: Random neurons of ’conv1’ in ResNet-50 (ImageNet), and E1, E2’s evaluation on the description ofCLIP-Dissect(PART 3) 29 Published as a conference paper at ICLR 2023 Figure 19: Random neurons of ’layer 4’ in ResNet-18 (Places 365), and E1, E2’s evaluation on the description ofCLIP-Dissect(PART 1) 30 Published as a conference paper at ICLR 2023 Figure 20: Random neurons of ’layer 4’ in ResNet-18 (Places 365), and E1, E2’s evaluation on the description ofCLIP-Dissect(PART 2) 31 Published as a conference paper at ICLR 2023 Figure 21: Random neurons of ’layer 4’ in ResNet-18 (Places 365), and E1, E2’s evaluation on the description ofCLIP-Dissect(PART 3) 32 Published as a conference paper at ICLR 2023 BCROWDSOURCED EVALUATION OF DESCRIPTION QUALITY Note: The results in this section were added after ICLR camera ready to further study the effective- ness of our proposed method, and it is not included in the official ICLR paper. To get a more objective quantitative measure of the quality of our neuron descriptions, we conducted a large scale study on Amazon Mechanical Turk. The results from this study were in line with our evaluation in Section A.10, showing that on average users agree with CLIP-Dissect’s descriptions and that neurons in very early and very late layers are the most explainable. B.1SETUP We evaluated 200 random neurons per layer (or all neurons for layers with<200 neurons), in the same 5 layers per model as in Section A.10, for both ResNet-50(ImageNet) and ResNet-18(Places). Each neurons’ description was evaluated by 3 raters. The neurons we evaluated were a superset of the neurons evaluated in Section A.10. The task interface is similar to that used in Section A.10, where we evaluated how well the 10 most highly activating images match the description. However, we changed the rating scale to a five point scale (1 - Strongly Disagree with description, 5 - Strongly Agree with description). The study interface is shown in Figure 22. To generate the descriptions we used the union of ImageNet validation dataset and Broden asD probe , the SoftWPMI similarity function, and 20k most common English words as the concept set. We used raters in the United States, with greater than 98% approval rating and at least 10000 pre- viously approved HITs. Users were paid $0.05 per task. Our experiment was deemed exempt from IRB approval by the UCSD IRB Board. Figure 22: User Interface for our Mechanical Turk study. B.2RESULTS The results of our experiment are shown in Table 6. We can see that on average, users agreed with our explanations, with the average rating across both models and layers being 3.57 out of 5. In addition, the results agree well with our previous experiment in Section A.10, showing earliest and latest layers are most interpretable. 33 Published as a conference paper at ICLR 2023 Model 1layer1layer2layer3layer4 ResNet-183.833.583.563.333.71 ResNet-503.903.483.413.383.50 Table 6: Results of our crowdsourced experiment. Each number is the average rating (1-5) across evaluators and neurons for that layer. We can see the average scores are between Neutral and Agree, with little difference between the two models, and highest agreement on the lowest and the highest layers. As a baseline/sanity test, we evaluated 10 neurons per layer with their description randomly chosen from the concept set. These descriptions reached an average score of 2.61, which is much lower than CLIP-Dissect explanations (3.57), but still higher than we would expect, indicating there’s some noise in crowdsourced evaluations. In summary we have shown that our descriptions are clearly better than random on a large scale crowdsourced study, and the findings from our study in Section A.10 hold for larger study with more neurons and more evaluators. 34