Paper deep dive
Disentangling Neuron Representations with Concept Vectors
Laura O'Mahony, Vincent Andrearczyk, Henning Muller, Mara Graziani
Models: Inception V3
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 6:49:51 PM
Summary
This paper introduces a method to disentangle polysemantic neurons in convolutional neural networks into monosemantic 'concept vectors' within the activation space. By applying agglomerative and k-means clustering to the activations of top-activating images, the authors identify distinct, human-understandable features that individual neurons fail to represent in isolation, thereby improving mechanistic interpretability.
Entities (5)
Relation Signals (3)
Concept Vector → disentangles → Polysemantic Neuron
confidence 95% · The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors
Inception V3 → evaluatedwith → Concept Vector
confidence 95% · We consider Inception V3 (IV3) [25] for our experiments... We demonstrate the results of the method... on a number of both polysemantic and monosemantic neurons.
Agglomerative Clustering → usedin → Concept Vector
confidence 90% · The second step involves measuring the similarity... We then apply a clustering technique... agglomerative clustering
Cypher Suggestions (2)
Identify relationships between algorithms and interpretability techniques · confidence 85% · unvalidated
MATCH (a:Algorithm)-[:USED_IN]->(c:InterpretabilityMethod) RETURN a.name, c.name
Find all methods used to interpret neural networks · confidence 80% · unvalidated
MATCH (m:Method)-[:USED_IN]->(n:NeuralNetwork) RETURN m.name, n.name
Abstract
Abstract:Mechanistic interpretability aims to understand how models store representations by breaking down neural networks into interpretable units. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful vectors, known as concept vectors, in activation space instead of individual neurons. The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features. Our method can search for fine-grained concepts according to the user's desired level of concept separation. The analysis shows that polysemantic neurons can be disentangled into directions consisting of linear combinations of neurons. Our evaluations show that the concept vectors found encode coherent, human-understandable features.
Tags
Links
Full Text
29,261 characters extracted from source content.
Expand or collapse full text
Disentangling Neuron Representations with Concept Vectors Laura O’Mahony * 1,2 lauraa.omahony@ul.ie Vincent Andrearczyk 1 vincent.andrearczyk@hevs.ch Henning M ̈ uller 1,3 henning.muller@hevs.ch Mara Graziani 1 mara.graziani@hevs.ch 1 Haute ́ ecole sp ́ ecialis ́ e de Suisse occidentale, Hes-so Valais, Sierre, Switzerland 2 University of Limerick, Limerick, Ireland 3 The Sense Research and Innovation Center, Sion, Lausanne, Switzerland Abstract Mechanistic interpretability aims to understand how models store representations by breaking down neural net- works into interpretable units. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful vec- tors, known as concept vectors, in activation space instead of individual neurons. The main contribution of this paper is a method to disentangle polysemantic neurons into con- cept vectors encapsulating distinct features. Our method can search for fine-grained concepts according to the user’s desired level of concept separation. The analysis shows that polysemantic neurons can be disentangled into directions consisting of linear combinations of neurons. Our evalua- tions show that the concept vectors found encode coherent, human-understandable features. 1. Introduction Mechanistic interpretability is a fast-emerging research topic that aims at deciphering the internal representations held by a model by reverse engineering into understandable computer programs [1, 20, 21]. Previous work in this field breaks down convolutional neural networks (CNNs) into the features learned by the fundamental units of a layer, which are considered as directions of a geometric space. Many previous works consider neurons as these units [20, 21]. * corresponding author lauraa.omahony@ul.ie Source code available athttps://github.com/lomahony/ sw-interpretability Proceedings of the IEEE / CVF Conference onComputer Vision and Pattern Recognition Workshops2023, Vancouver, Canada. Copyright 2023 by the author(s). Breaking down the model into such interpretable units al- lows us to better understand how models store represen- tations in vision tasks [4, 5, 20, 21] and language mod- els [8]. This could even allow us to predict and edit model behaviour such as work by Bauet al. that removes units that are important to a class [5] and has also been stud- ied for other architectures such as GANs and GPT mod- els [3, 16, 17]. A frequent issue is the occurrence ofpolysemantic neu- rons, namely neurons that respond to several unrelated fea- tures, or concepts [19–21]. They can be found by looking at the maximally activating dataset examples and finding they consist of multiple groups that are conceptually very different [6, 20]. This makes the interpretation of individ- ual neurons challenging since they cannot be mapped to unique features. This is exemplified in Olahet al. [20, 21] by a neuron equally likely to respond to car shields and cat paws at the same time, and with the same intensity. There is evidence that the training of models pushes networks to represent many features within individual neurons [7, 24]. Models have a limited number of neurons meaning a dis- crete neuron is often not possible for all features. This is related to the idea ofsuperposition, which refers to when neural networks represent more features than they have neu- rons [7]. These empirical observations in existing research indicate that neurons are not always the right fundamen- tal unit encapsulating an individual feature represented by a model. If we define activation space as all possible combi- nations of neuron activations, we can widen our lens to look for meaningful vectors in activation space instead of single neuron basis vectors. There is evidence that suggests monosemantic regions in activation space exist [4, 6, 7, 11, 21], but they are not always made obvious by studying individual neurons [6, 26]. A key issue resulting from this observation is the question of how directions in activation space representing distinct features 1 arXiv:2304.09707v1 [cs.CV] 19 Apr 2023 can be found [21]. Previous work has shown the existence of high-level hu- man interpretable concepts such as textures, shapes and parts of objects present as directions in activation space. Some early work by Alainet al. [2] took the features of the layers in a model and fit a linear classifier to each layer to predict the class labels. The work by Kimet al. [13] onCon- cept Activation Vectors (CAVs)defines a concept as a vector in the direction of the activation values of a set of exam- ples of that concept. The authors find a concept by training a linear classifier to distinguish between examples of that concept and random counterexamples. The concept vector is then taken as the vector orthogonal to the boundary. A limitation of this method is that it requires a handcrafted set of examples of a concept to find the concept direction in latent space. A small number of unsupervised methods used to find concepts have been developed [10, 14, 22], this research direction is known asconcept discovery. Concept discovery involves the search for unit vectors in the latent space of a model that encode learned representa- tions of high-level concepts. However, none of the existing methods seek to disentangle polysemantic representations. The concept vectors are linear combinations of units, and as such, they are likely to inherit polysemanticity from polyse- mantic neurons [10]. Furthermore, none of these methods incorporate the notion of a privileged basis proposed by El- hageet al. [7]. Aprivileged basisis where some represen- tations are encouraged to align with basis directions, mean- ing directions in space corresponding to individual neurons. Even though neuron directions are usually meaningful can- didates for representing a feature [4, 7, 20, 21], they likely do not show the whole story due to the countervailing force of superposition [7]. The main contribution of this paper is a method to find and disentangle monosemantic directions starting from polysemantic neurons. Moreover, our method can search for concepts that are fine-grained according to the user’s desired level of concept separation. Our analysis shows that polysemantic neurons can be disentangled into directions consisting of linear combinations of neurons. 2. Methods We consider a CNN predicting a classification output (p- dimensional output vector) from an input image. We note that the method can be generalised to other models, but use a CNN for our analysis. We assume the model was already trained, and that we have access to the intermediate rep- resentations of an arbitrary layer inside the model. Fig. 1 summarises the steps discussed in more detail below. We take a given intermediate layerl. We calculate the embeddings for the entire dataset and apply global average pooling to aggregate the spatial information of the convo- lutional feature maps. We select a neuronn, and apply the following steps iteratively. In step 1, we take these activa- tionsφ l (x i ) N i=1 whereφ l (x i )∈R d , and for the neuron n, we take the topNactivating imagesx i N i=1 . The second step involves measuring the similarity of the pooled activations (of the top activating images) at the in- termediate layerl. We use the Euclidean distance as a dis- tance metric which has been shown by previous work to be highly predictive of perceptual similarity [28]. We then apply a clustering technique to group these measurements of similarity into sets of close examples. For this, we use agglomerative clustering, a bottom-up type of hierarchical clustering [18], since it does not require us to pre-specify the number of clusters to be generated, as is required by the k-means approach. With clustering settings described in Appendix A.1, we apply agglomerative clustering with a distance thresholdd max that specifies the maximum link- age threshold at which clusters will be merged. Its result can be visualised in a dendrogram, or tree-based represen- tation of elements, as is depicted in Fig. 1, step 2. The distance threshold is a hyperparameter that is tuned to an appropriate range for the model layer. It can be tweaked to fine-grain concepts into big or small buckets. Please refer to Appendix A.1 for a demonstration of this. The third step takes the resulting number of clustersC obtained from step 2 and performs k-means clustering on the same measurements of similarity. The benefit of using k-means clustering over agglomerative clustering alone is that the k-means centroids allow us to easily remove outliers from each cluster that have low similarity to the rest of the cluster as employed in [9]. The final step finds the directions of the ˆ C 1 concept vectors corresponding to the ˆ Cclusters,ˆc j ˆ C j=1 , by tak- ing the mean of the remaining embeddings for each cluster, giving us a set of vectorsavg(φ l (x i ) i∈ˆc j ) ˆ C j=1 which are then normalised to give us disentangled concept vectors v j ˆ C j=1 , wherev j ∈R d . 3. Results We consider Inception V3 (IV3) [25] for our experi- ments since it is a de-facto standard convolutional neural network. Moreover, interpretability research has already given multiple insights for this model [9, 10, 13], and pre- trained weights on the ImageNet ILSVRC2012 [23] dataset are available online. As this exploratory study only aims at a proof of concept, we focused on an undersampled version of ImageNet, retaining 130 random images for each class. This kept computation accessible to our infrastructure, fea- sible and light. Our results can easily be scaled to the en- tire dataset and larger dataset sizes. Where not stated, we consider the concatenation layerMixed 7b, a convolutional layer with 2048 feature maps (d= 2048) near the end of the 1 ˆ Cmay be different fromCsince clusters with less than 5 samples are removed. 2 Figure 1. Step 1. A set of images that maximally activate a neuron in a model layer is taken. Step 2. The Euclidean distance between images in activation space is used as the similarity space on which the clustering is performed. This returns the appropriate number of clusters for a given distance threshold. Step 3. K-means clustering computes the cluster membership. Step 4. From the images in each cluster, a concept vector is calculated, which points toward the non-neuron aligned direction in activation space. IV3 model. We pick this layer as we expect it to encode complex concepts [21, 27]. A similar analysis can be done on other layers and architectures. We demonstrate the results of the method described in Sec. 2 on a number of both polysemantic and monoseman- tic neurons. We tookN= 100 top activating dataset exam- ples and set the distance threshold parameterd max = 15. We select neuron 35 as an example of a polysemantic neu- ron, as it activates highly for images of apples, sports, and also three images are dominated by a net-like pattern. This results in 3 clusters. When the k-means clustering step was applied (i.e. Step 3 in Figure 1) and outliers were removed, the cluster containing the net-like images was removed as there were<5 images in this cluster. Fig. 2a shows the em- beddings of the remaining images plotted using UMAP [15] dimensionality reduction. The plot shows how UMAP sep- arates embeddings for images of apples and sports. Note that UMAP is used only for demonstration of our results to depict the clusters found using k-means since we found it accurately reflects the cluster membership result. We se- lect neuron 16 as an example of a monosemantic neuron that activates for many categories of elliptical shapes as de- picted in Fig. 2b. The same procedure applied to this neuron yields only one cluster, yielding one monosemantic concept vector which has a much higher similarity than the neuron direction to the original images. The case of neuron 1 is interesting and demonstrates the application of our method to fine-grain concepts. This neuron activates highly for un- derwater images as shown in Fig. 2c. Further inspection shows how it activates highly for both general underwater images and, images of scuba divers. Applying our method yields two clusters, one for general underwater scenes such as coral, and another for scuba divers, meaning two concept vectors can be found for this neuron. However, at a higher distance threshold, the clusters of images are merged by the algorithm, and one concept is found. Here the features are not entirely conceptually different. However, these two re- lated directions can be differentiated, and concept discovery can separate them. We believe this hints at how it may be helpful to view concepts as varying continuously in the la- tent space instead of being encoded discretely by neurons. We suspect this phenomenon is related to the notion of ‘fea- ture facets’ [19]. The same analysis was scaled to multiple neurons in the same layer. We provide additional exam- ples and a depiction of the dendrogram for neuron 1 to see the result of alteringd max in the Appendix A.1, Figs. A.7 and A.8. We note that we found that the majority of the neurons we analysed in layerMixed 7bwere found to show some amount of polysemanticity. A possible explanation for this is that the number of features may be very high for a later layer in the model as it encodes complex concepts. We performed a qualitative and quantitative assessment of the identified concepts. The semanticity of concept vec- tors was evaluated, first by finding the dataset examples with the largest projections along the vectors, analogous to viewing the maximally activating dataset examples for in- dividual neurons. Fig. 3 demonstrates how the two concept vectors found for polysemantic neuron 35 were confirmed to be monosemantic regions in the latent space, cleanly acti- vated by the originally entangled concepts. A further qual- itative assessment involved applying the technique of fea- ture visualisation [21] using the Lucent [12] library. Fig. 3 also shows the result of applying this technique on polyse- mantic neuron 35, and on the concept vectors found with 3 (a) (b) (c) Figure 2. UMAP of the maximally activating images kept after k-means clusters and outlier removal in latent space: (a) sepa- rate clusters for polysemantic neuron 35 (b) a single cluster for monosemantic neuron 16. our method. The polysemantic neuron fails to give a hu- man interpretable representation, whereas the disentangled directions closer resemble the distinct categories of images which excite this neuron. For instance, the concept point- ing towards representations of apples is maximally activated by round shapes with a stem cavity that is typical of ap- ples, whereas the second concept seems to be maximally activated by large green squares such as football, soccer or baseball pitches. As has been done in concept discovery works [9, 10], we evaluated our results with human experiments to eval- uate the coherency and understandability of concepts. To avoid any cherry-picking of results, concepts from the first 8 neurons were used for all questions on the form, and a random number generator was used to select images from each concept’s maximally projecting images 2 . The first four questions evaluated the concepts’ coherency by asking par- ticipants to identify an intruder out of four other images that maximally activate another concept vector found from the same neuron or concept discovery starting point. The (n=8, including 3 domain experts) participants selected the intruder image with an overall 100% accuracy showing the images have a coherent theme. The other six questions were designed to evaluate the understandability of concepts. 2 The evaluation form can be accessed athttps://forms.gle/ 62H6iUiXHdLYs9nr9 Figure 3.Disentanglement of the representations for neuron 35. The maximally activating inputs and feature visualisation are shown for the polysemantic neuron (left) and the disentangled con- cept directions (right). Participants were asked to label two concepts and assess whether they agree with a given label for four sets of im- ages. Agreement with the given labels was observed 97% of the time. Further details are provided in Appendix A.3 for the only label change suggestion (from a domain expert) that occurred. This confirms that the images have a consis- tent semantic meaning across multiple individuals. A number of quantitative metrics were used to anal- yse the images making up the clusters for computing the concept vectors and also the maximally projecting images along concepts. A natural starting point was to check the distribution of Euclidean distances between maximally ac- tivating images as is shown in Fig. 4 (a). Fig. 4 (b) shows how the inter-cluster distance (distance between images of apples and sport-type images) is considerably higher than the intra-cluster distance. Appendix A.2 and particularly Fig. A.9 illustrate the components of the 2048 concept vectors for these two concepts, which differ considerably across other dimensions. When the maximally projecting images along the concept vectors were calculated, our anal- ysis confirmed that the projections and cosine similarities of their corresponding activations with the concept vector are remarkably higher than that with the neuron direction as exhibited in Fig. 5. As shown in additional examples in Ap- pendix A.1, our results are consistent for other monoseman- tic and polysemantic neurons. 4. Conclusions and Future Work Finding meaningful directions in activation space that are pointing to unique patterns, or concepts is a non-trivial problem encountered in our journey of understanding neu- ral networks. Our results suggest that exploring directions, instead of neurons may lead us toward finding coherent fun- 4 Figure 4. Euclidean distance between activation pairs, for the top 100 maximally activating images as in step 1 (on the left) and for the 52 remaining images after step 3, ordered by cluster member- ship (on the right). Figure 5. The left plot shows the projection of the elements of a cluster along the corresponding concept vector and the projection along the neuron direction for neuron 35. The right plot shows the cosine similarities between elements of a cluster with the concept vector and the neuron direction. damental units. We believe this work helps move toward bridging the gap between understanding the fundamental units of models as is an important goal of mechanistic in- terpretability, and concept discovery. We evaluated the co- herency and understandability of the raw images whose em- beddings have the maximum projections along a concept vector. We found that the latent space representations have much higher similarities with the concept vectors discov- ered than with the neuron directions. This work goes in the direction of building interpretability in a human-controlled way, as is important for the field of AI safety, and for ap- plications of image models such as medical lesion analy- sis. We note a limitation of this work is its reliance on the data used to generate clusters. Furthermore, all experiments were performed on image data as image data is easier to vi- sualise than other data forms. Generalising the method to other data types such as language and tabular data is a di- rection we wish to pursue in future work, as is looking at other starting candidates for concepts besides neurons. 4.0.1 Acknowledgements This work was supported by AI4Media of the European Union’s Horizon 2020 research and innovation program un- der grant agreement No 951911. LOM performed this work during an internship facilitated by the AI4Media Junior Fel- lows Exchange Program. LOM also acknowledges the sup- port of Science Foundation Ireland under Grant Number 18/CRT/6049. For the purpose of Open Access, the authors have applied a C BY public copyright licence to any Au- thor Accepted Manuscript version arising from this submis- sion. 5 References [1] Abien Fred Agarap. Deep learning using rectified linear units (relu).arXiv preprint arXiv:1803.08375, 2018. 1 [2] Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 2 [3] David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, and Antonio Torralba. Rewriting a deep generative model. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 351–369. Springer, 2020. 1 [4] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying inter- pretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6541–6549, 2017. 1, 2 [5] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network.Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020. 1 [6] Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram ́ on Gue- vara, Beren Millidge, Gabriel Alfour, et al. Interpreting neural networks through the polytope lens.arXiv preprint arXiv:2211.12312, 2022. 1 [7] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022. 1, 2 [8] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 2021. 1 [9] Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept-based explanations.Ad- vances in Neural Information Processing Systems, 32, 2019. 2, 4 [10] Mara Graziani, An-phi Nguyen, Laura O’Mahony, Henning M ̈ uller, and Vincent Andrearczyk. Concept discovery and dataset exploration with singular value decomposition. In ICLR 2023 Workshop on Pitfalls of limited data and compu- tation for Trustworthy ML, 2023. 2, 4 [11] Adam S Jermyn, Nicholas Schiefer, and Evan Hubinger. En- gineering monosemanticity in toy models.arXiv preprint arXiv:2211.09169, 2022. 1 [12] Lim Swee Kiat. Lucent, lucid library adapted for pytorch, 2021. 3 [13] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability be- yond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on ma- chine learning, pages 2668–2677. PMLR, 2018. 2 [14] Thomas McGrath, Andrei Kapishnikov, Nenad Toma ˇ sev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022. 2 [15] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimen- sion reduction.arXiv preprint arXiv:1802.03426, 2018. 3 [16] Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual knowledge in gpt.arXiv preprint arXiv:2202.05262, 2022. 1 [17] Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former.arXiv preprint arXiv:2210.07229, 2022. 1 [18] Fionn Murtagh and Pedro Contreras. Algorithms for hierar- chical clustering: an overview.Wiley Interdisciplinary Re- views: Data Mining and Knowledge Discovery, 2(1):86–97, 2012. 2 [19] Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of fea- tures learned by each neuron in deep neural networks.arXiv preprint arXiv:1602.03616, 2016. 1, 3 [20] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An in- troduction to circuits.Distill, 5(3):e00024–001, 2020. 1, 2 [21] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11):e7, 2017. 1, 2, 3 [22] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Ad- vances in neural information processing systems, 30, 2017. 2 [23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2 [24] Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892, 2022. 1 [25] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2818–2826, 2016. 2 [26] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013. 1 [27] Matthew D Zeiler and Rob Fergus. Visualizing and un- derstanding convolutional networks. InComputer Vision– ECCV 2014: 13th European Conference, Zurich, Switzer- land, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014. 3 [28] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 2 6 A. Appendix A.1. Further details of clustering and demonstra- tions of clustering results Ward linkage was used as the criterion to determine the merges between clusters in agglomerative clustering. This criterion min- imises the variance within clusters by merging the pair of clusters with the minimum between-cluster sum of squared differences. Fig. A.6 depicts how many clusters agglomerative clustering se- lects for neuron 35. The height measures the distance between clusters, so the heights show the distances at which merges occur. The number of clustersCcan be inferred by placing a horizon- tal line at the desired distance or similarity threshold. In the case of this neuron, the dendrogram shows there will be 2 clusters for around 16-28 and more for<16. Atd max = 15 we get 3 clusters, if we pick a higher threshold, we would obtain 2 clusters. In step 3, Ward linkage is again used, so the k-means clustering works similarly to agglomerative clustering. Figure A.6. Neuron 35, agglomerative clustering dendrogram. Fig. A.7a shows the clustering result for polysemantic neuron 5. This neuron activates highest for images of peacocks and vans. Applying our method yields two disentangled vectors. Fig. A.7b shows the clustering result for monosemantic neuron 13. This neu- ron activates the most for images with dark backgrounds. Apply- ing our method yields one concept vector. An additional example of a polysemantic neuron is shown in Fig. A.7c. Neuron 27 ac- tivates highest for images of toilet paper rolls and the faces of a specific dog breed. Fig. A.8 depicts the agglomerative clustering dendrogram for neuron 1. Neuron 1 is cut into two clusters for around 14-17 and one for>17. This demonstrates how we can used max to control how fine-grained the concepts found are. A.2. Detail on concept vectors Fig. A.9 shows the two concept vectors found for neuron 35 representing the concepts of apples and sport-type images. It can be seen from the figure that although both categories highly ac- tivate neuron 35 that some of the other activation spikes are not common in these two concepts. A.3. User Evaluation Fig. A.10 shows examples of questions from our user evalu- ation test. (a) asks if the concept conveyed by these images is (a) (b) (c) Figure A.7. UMAP visualisation of embeddings in latent space corresponding to images kept after k-means clusters and outlier removal for (a) polysemantic neuron 5, (b) monosemantic neuron 13, (c) neuron 1 and (d) neuron 27. described well by the concept label ‘curvy’. One user disagreed with this label, and instead proposed the label ‘sinusoidal’. 7 Figure A.8. Neuron 1, agglomerative clustering dendrogram. (a) Concept vector 1.(b) Concept vector 2. Figure A.9. Concept vectors found for neuron 35. Thexaxis represents each dimension in activation space, so each peak is the amount that the concept vector points for the 2048 basis vectors in this particular layer. (a) (b) (c) Figure A.10. User evaluation questions to evaluate the understand- ability of the semantic meaning of concepts. For (a) the user was asked if a given label describes the concept well. Participants were asked to label the concept shown in the images in (b). The labels proposed by the users are shown in (c). The font size reflects the frequency of each label suggested by the users. 8