Paper deep dive

Natural Language Descriptions of Deep Visual Features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas

Year: 2022Venue: ICLR 2022Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 75

Models: AlexNet, BigGAN, DINO, ResNet-152, Various vision models

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 8:11:15 PM

Summary

The paper introduces MILAN (Mutual-Information-guided Linguistic Annotation of Neurons), a procedure that automatically generates compositional, natural language descriptions for individual neurons in deep neural networks. By maximizing the pointwise mutual information between natural language strings and the image regions that activate a neuron, MILAN provides fine-grained, open-ended annotations that outperform existing fixed-labeling techniques. The authors demonstrate MILAN's utility in model analysis, auditing for sensitive features, and editing to improve robustness against spurious correlations.

Entities (5)

MILAN · method · 100%MILANNOTATIONS · dataset · 100%Compositional Explanations · method · 95%Deep Neural Networks · technology · 95%NetDissect · method · 95%

Relation Signals (3)

MILAN → usesdataset → MILANNOTATIONS

confidence 100% · We therefore collect a novel dataset of captions for image regions to train the models underlying MILAN.

MILAN → appliedto → Deep Neural Networks

confidence 98% · We aim to generate natural language descriptions of individual neurons in deep networks.

MILAN → outperforms → NetDissect

confidence 95% · MILAN obtains higher agreement with human annotations on held-out networks than baselines.

Cypher Suggestions (2)

Identify datasets used by specific methods · confidence 95% · unvalidated

MATCH (m:Method)-[:USES_DATASET]->(d:Dataset) RETURN m.name, d.name

Find all methods used for neuron annotation · confidence 90% · unvalidated

MATCH (m:Method)-[:ANNOTATES]->(n:Neuron) RETURN m.name, count(n) as neuron_count

Abstract

Abstract:Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible? We introduce a procedure (called MILAN, for mutual-information-guided linguistic annotation of neurons) that automatically labels neurons with open-ended, compositional, natural language descriptions. Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models. We highlight three applications of natural language neuron descriptions. First, we use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models. Second, we use MILAN for auditing, surfacing neurons sensitive to human faces in datasets designed to obscure them. Finally, we use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.

PDF

Open source PDF →Open local PDF →

Full Text

74,307 characters extracted from source content.

Expand or collapse full text

Published as a conference paper at ICLR 2022 NATURALLANGUAGEDESCRIPTIONS OF DEEPVISUALFEATURES Evan Hernandez 1 Sarah Schwettmann 1 David Bau 1,2 Teona Bagashvili 3 Antonio Torralba 1 Jacob Andreas 1 1 MIT CSAIL 2 Northeastern University 3 Allegheny College dez,schwett,teona,torralba,jda@mit.edu d.bau@northeastern.edu ABSTRACT Some neurons in deep networks specialize in recognizing highly specific percep- tual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, la- beling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible? We introduce a procedure (calledMILAN, formutual-information-guidedlinguisticannotation ofneurons) that automatically labels neurons with open-ended, compositional, natural lan- guage descriptions. Given a neuron,MILANgenerates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active.MILANproduces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models. We highlight three applications of natural language neuron descriptions. First, we useMILANforanalysis, character- izing the distribution and importance of neurons selective for attribute, category, and relational information in vision models. Second, we useMILANforauditing, surfacing neurons sensitive to human faces in datasets designed to obscure them. Finally, we useMILANforediting, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels. 1 1INTRODUCTION A surprising amount can be learned about the behavior of a deep network by understanding the indi- vidual neurons that make it up. Previous studies aimed at visualizing or automatically categorizing neurons have identified a range of interpretable functions across models and application domains: low-level convolutional units in image classifiers implement color detectors and Gabor filters (Erhan et al., 2009), while some later units activate for specific parts and object categories (Zeiler & Fer- gus, 2014; Bau et al., 2017). Single neurons have also been found to encode sentiment in language data (Radford et al., 2017) and biological function in computational chemistry (Preuer et al., 2019). Given a new model trained to perform a new task, can we automatically catalog these behaviors? Techniques for characterizing the behavior of individual neurons are still quite limited. Approaches based on visualization (Zeiler & Fergus, 2014; Girshick et al., 2014; Karpathy et al., 2015; Ma- hendran & Vedaldi, 2015; Olah et al., 2017) leave much of the work of interpretation up to human users, and cannot be used for large-scale analysis. Existing automated labeling techniques (Bau et al., 2017; 2019; Mu & Andreas, 2020) require researchers to pre-define a fixed space of candidate neuron labels; they label only a subset of neurons in a given network and cannot be used to surface novel or unexpected behaviors. This paper develops an alternative paradigm for labeling neurons with expressive, compositional, and open-ended annotations in the form ofnatural language descriptions. We focus on the visual 1 Code, data, and an interactive demonstration may be found athttp://milan.csail.mit.edu/. 1 arXiv:2201.11114v2 [cs.CV] 18 Apr 2022 Published as a conference paper at ICLR 2022 (a) Neurons in a deep network (b) Exemplar-based neuron representations Input images (d) Generated neuron labels (c) Patch description model trained on MILANNOTATIONS p( groups of cars and people ∣ ) woven similar patterns plaid and checkered patterns NetDissect (Bau et al. 2017) p(label | exemplars) MILAN (this work) grass around animalsMILAN (this work) max d pmi(d;) p( groups of cars and people ) Figure 1:(a)We aim to generate natural language descriptions of individual neurons in deep networks.(b) We first represent each neuron via anexemplar setof input regions that activate it.(c)In parallel, we collect a dataset of fine-grained human descriptions of image regions, and use these to train a model ofp(description| exemplars)andp(description).(d)Using these models, we search for a description that has high pointwise mutual information with the exemplars, ultimately generating highly specific neuron annotations. domain: building on past work on information-theoretic approaches to model interpretability, we formulate neuron labeling as a problem of findinginformativedescriptions of a neuron’s pattern of activation on input images. We describe a procedure (calledMILAN, formutual-information- guidedlinguisticannotation ofneurons) that labels individual neurons with fine-grained natural language descriptions by searching for descriptions that maximize pointwise mutual information with the image regions in which neurons are active. To do so, we first collect a new dataset of fine-grained image annotations (MILANNOTATIONS, Figure 1c), then use these to construct learned approximations to the distributions over image regions (Figure 1b) and descriptions. In some cases, MILANsurfaces neuron descriptions that more specific than the underlying training data (Figure 1d). MILANis largely model-agnostic and can surface descriptions for different classes of neurons, rang- ing from convolutional units in CNNs to fully connected units in vision transformers, even when the target network is trained on data that differs systematically fromMILANNOTATIONS’ images. These descriptions can in turn serve a diverse set of practical goals in model interpretability and dataset design. Our experiments highlight three: usingMILAN-generated descriptions to (1) analyze the role and importance of different neuron classes in convolutional image classifiers, (2) audit mod- els for demographically sensitive feature by comparing their features when trained on anonymized (blurred) and non-anonymized datasets, and (3) identify and mitigate the effects of spurious cor- relations with text features, improving classifier performance on adversarially distributed test sets. Taken together, these results show that fine-grained, automatic annotation of deep network models is both possible and practical: rich descriptions produced by automated annotation procedures can surface meaningful and actionable information about model behavior. 2RELATEDWORK Interpreting deep networksMILANbuilds on a long line of recent approaches aimed at explain- ing the behavior of deep networks by characterizing the function of individual neurons, either by visualizing the inputs they select for (Zeiler & Fergus, 2014; Girshick et al., 2014; Karpathy et al., 2015; Mahendran & Vedaldi, 2015; Olah et al., 2017) or by automatically categorizing them accord- ing to the concepts they recognize (Bau et al., 2017; 2018; Mu & Andreas, 2020; Morcos et al., 2018; Dalvi et al., 2019). Past approaches to automatic neuron labeling require fixed, pre-defined label sets; in computer vision, this has limited exploration to pre-selected object classes, parts, ma- terials, and simple logical combinations of these concepts. While manual inspection of neurons has revealed that a wider range of features play an important role in visual recognition (e.g. orientation, illumination, and spatial relations; Cammarata et al. 2021)MILANis the first automated approach that can identify such features at scale. Discrete categorization is also possible fordirectionsin rep- resentation space (Kim et al., 2018; Andreas et al., 2017; Schwettmann et al., 2021) and for clusters of images induced by visual representations (Laina et al., 2020); in the latter, an off-the-shelf im- age captioning model is used to obtain language descriptions of the unifying visual concept for the cluster, although the descriptions miss low-level visual commonalities. AsMILANrequires only a primitive procedure for generating model inputs maximally associated with the feature or direction of interest, future work might extend it to these settings as well. 2 Published as a conference paper at ICLR 2022 Natural language explanations of decisionsPrevious work aimed at explaining computer vision classifiers using natural language has focused on generating explanations for individual classification decisions(e.g., Hendricks et al., 2016; Park et al., 2018; Hendricks et al., 2018; Zellers et al., 2019). Outside of computer vision, several recent papers have proposed procedures for generating natural language explanations of decisions in text classification models (Zaidan & Eisner, 2008; Camburu et al., 2018; Rajani et al., 2019; Narang et al., 2020) and ofrepresentationsin more general sequence modeling problems (Andreas & Klein, 2017). These approaches require task-specific datasets and often specialized training procedures, and do not assist with interpretability at the model level. To the best of our knowledge,MILANis the first approach for generating compositional natural language descriptions for interpretability at the level of individual features rather than input-conditional deci- sions or representations. More fundamentally,MILANcan do soindependentlyof the model being described, making it (as shown in Section 4) modular, portable, and to a limited extent task-agnostic. 3APPROACH Neurons and exemplarsConsider the neuron depicted in Figure 1b, located in a convlutional network trained to classify scenes (Zhou et al., 2017). When the images in Figure 1 are provided as input to the network, the neuron activates inpatches of grass near animals, but not in grass without animals nearby. How might we automate the process of automatically generating such a description? While the image regions depicted in Fig. 1b do not completely characterize the neuron’s function in the broader network, past work has found that actionable information can be gleaned from such regions alone. Bau et al. (2020; 2019) use them to identify neurons that can trigger class predictions or generative synthesis of specific objects; Andreas & Klein (2017) use them to predict sequence outputs on novel inputs; Olah et al. (2018) and Mu & Andreas (2020) use them to identify adversarial vulnerabilities. Thus, building on this past work, our approach to neuron labeling also begins by representing each neuron via the set of input regions on which its activity exceeds a fixed threshold. Definition 1.Letf:X→Ybe a neural network, and letf i (x)denote the activation value of the ith neuron infgiven an inputx. 2 Then, anexemplar representationof the neuronf i is given by: E i =x∈X:f i (x)> η i .(1) for some threshold parameterη i (discussed in more detail below). Exemplars and descriptionsGiven this explicit representation off i ’s behavior, it remains to construct adescriptiond i of the neuron. Past work (Bau et al., 2017; Andreas et al., 2017) begins with a fixed inventory of candidate descriptions (e.g. object categories), defines an exemplar setE ′ d for each such category (e.g. via the output of a semantic segmentation procedure) then labels neurons by optimizingd i := arg min d δ(E i ,E ′ d )for some measure of set distance (e.g. Jaccard, 1912). In this work, we instead adopt a probabilistic approach to neuron labeling. In computer vision appli- cations, eachE i is a set of image patches. Humans are adept at describing such patches (Rashtchian et al., 2010) and one straightforward possibility might be to directly optimized i := arg max d p(d| E i ). In practice, however, the distribution of human descriptions given images may not be well- aligned with the needs of model users. Fig. 2 includes examples of human-generated descriptions for exemplar sets. Many of them (e.g.textfor AlexNet conv3-252) are accurate, but generic; in real- ity, the neuron responds specifically to text on screens. The generated description of a neuron should capture the specificity of its function—especiallyrelative to other neurons in the same model. We thus adopt an information-theoretic criterion for selecting descriptions: our final neuron descrip- tion procedure optimizes pointwise mutual information between descriptions and exemplar sets: Definition 2.Themax-mutual-information description of the neuronf i is given by: MILAN(f i ) := arg max d pmi(d;E i ) = arg max d logp(d|E i )−logp(d).(2) To turn Eq. (2) into a practical procedure for annotating neurons, three additional steps are required: constructing a tractable approximation to the exemplar setE i (Section 3.1), using human-generated image descriptions to modelp(d|E)andp(d)(Section 3.2 and Section 3.3), and finding a high- quality descriptiondin the infinite space of natural language strings (Section 3.4). 2 In this paper, we will be primarily concerned with neurons in convolutional layers; for each neuron, we will thus take the input spaceXto be the space of all image patches equal in size to the neuron’s receptive field. 3 Published as a conference paper at ICLR 2022 Figure 2: Examples ofMILANdescriptions on the generalization tasks described in Section 4. Even highly specific labels (likethe top boundaries of horizontal objects) can be predicted for neurons in new networks. Failure modes include semantic errors, e.g.MILANmisses thecupcakesin thedog faces and cupcakesneuron. 3.1APPROXIMATING THE EXEMPLAR SET As written, the exemplar set in Equation (1) captures a neuron’s behavior onallimage patches. This set is large (limited only by the precision used to represent individual pixel values), so we follow past work (Bau et al., 2017) by restricting eachE i to the set of images that cause the greatest activation in the neuronf i . For convolutional neurons in image processing tasks, setsE i ultimately comprise kimages withactivation masksindicating the regions of those images in whichf i fired (Fig. 1a; see Bau et al. 2017 for details). Throughout this paper, we use exemplar sets withk= 15images and chooseη i equal to the 0.99 percentile of activations for the neuronf i . 3.2MODELINGp(d|E)ANDp(d) The termpmi(d;E i )in Equation (2) can be expressed in terms of two distributions: the probability p(d|E i )that a human would describe an image region withd, and the probabilityp(d)that a human would use the descriptiondfor any neuron.p(d|E i )is, roughly speaking, a distribution overimage captions(Donahue et al., 2015). Here, however, the input to the model is not a single image but a set of image regions (the masks in Fig. 1a); we seek natural language descriptions of the common features of those regions. We approximatep(d|E i )with learned model—specifically the Show-Attend-Tell image description model of Xu et al. (2015) trained on theMILANNOTATIONS dataset described below, and with several modifications tailored to our use case. We approximate p(d)with a two-layer LSTM language model (Hochreiter & Schmidhuber, 1997) trained on the text ofMILANNOTATIONS. Details about both models are provided in Appendix B. 4 Published as a conference paper at ICLR 2022 3.3COLLECTING HUMAN ANNOTATIONS Asp(d|E i )andp(d)are both estimated using learned models, they require training data. In particular, modelingp(d|E i )requires a dataset of captions that describe regions from multiple different images, such as the ones shown in Fig. 1. These descriptions must describe not only objects and actions, but all other details that individual neurons select for. Existing image captioning datasets, like MSCOCO (Lin et al., 2014) and Conceptual Captions (Sharma et al., 2018), only focus on scene-level details about a single image and do not provide suitable annotations for this task. We therefore collect a novel dataset of captions for image regions to train the models underlyingMILAN. NetworkArch.TaskDatasetsAnnotated# Units AlexNetCNNClass. ImageNet Places365 conv. 1–5 1152 1376 ResNet152CNNClass. ImageNet Places365 conv. 1 res. 1–4 3904 3904 BigGANCNNGen. ImageNet Places365 res. 0–5 3744 4992 DINOViTBYOLImageNet MLP 1–12 (first 100) 1200 Table 1: Summary ofMILANNOTATIONS, which labels 20k units across 7 models with different network architectures, datsasets, and tasks. Each unit is annotated by three human participants. First, we must obtain a set of image regions to annotate. To ensure that these regions have a similar distribu- tion to the target neurons themselves, we derive them directly from the ex- emplar sets of neurons in a set ofseed models. We obtain the exemplar sets for a subset of the units in each seed model in Table 1 using the method from Section 3.1. We then present each set to a human annotator and ask them to describe what is common to the image regions. Table 1 summarizes the dataset, which we callMILANNOTATIONS. In total, we construct exemplar sets using neurons from seven vision models, totaling 20k neurons. These models include two archi- tectures for supervised image classification, AlexNet (Krizhevsky et al., 2012) and ResNet152 (He et al., 2015); one architecture for image generation, BigGAN (Brock et al., 2018); and one for un- supervised representation learning trained with a “Bootsrap Your Own Latent” (BYOL) objective (Chen & He, 2020; Grill et al., 2020), DINO (Caron et al., 2021). These models cover two datasets, specifically ImageNet (Deng et al., 2009) and Places365 (Zhou et al., 2017), as well as two com- pletely different families of models, CNNs and Vision Transformers (ViT) (Dosovitskiy et al., 2021). Each exemplar set is shown to three distinct human participants, resulting 60k total annotations. Ex- amples are provided in Appendix A (Fig. 10). We recruit participants from Amazon Mechanical Turk. This data collection effort was approved by MIT’s Committee on the Use of Humans as Ex- perimental Subjects. To control for quality, workers were required to have a HIT acceptance rate of at least 95%, have at least 100 approved HITs, and pass a short qualification test. Full details about our data collection process and the collected data can be found in Appendix A. 3.4SEARCHING IN THE SPACE OF DESCRIPTIONS Directly decoding descriptions frompmi(d;E i )tends to generate disfluent descriptions. This is because thep(d)term inherently discourages common function words likethefrom appearing in descriptions. Past work language generation (Wang et al., 2020) has found that this can be remedied by first introducing a hyperparameterλto modulate the importance ofp(d)when computing PMI, giving a newweighted PMIobjective: wpmi(d) = logp(d|E i )−λlogp(d).(3) Next, search is restricted to a set of captions that are high probability underp(d|E i ), which are reranked according to Eq. (3). Specifically, we run beam search onp(d|E i ), and use the full beam after the final search step as a set of candidate descriptions. For all experiments, we setλ=.2and beam size to 50. 4DOES MILAN GENERALIZE? Because it is trained on a set of human-annotated exemplar sets obtained from a set of seed net- works,MILANis useful as an automated procedure only if itgeneralizesand correctly describes neurons in trained models with new architectures, new datasets, and new training objectives. Thus, before describing applications ofMILANto specific interpretability problems, we perform cross- 5 Published as a conference paper at ICLR 2022 ModelCENDp(d|E) pmi(d;E) AlexNet-ImageNet.01.24.34.38 AlexNet-Places.02.21.31.37 ResNet-ImageNet.01.25.27.35 ResNet-Places.03.22.30.31 Table 2: BERTScores for neuron labeling meth- ods relative to human annotations.MILANob- tains higher agreement than Compositional Expla- nations (CE) or NetDissect (ND). validation experiments within theMILANNOTA- TIONSdata to validate thatMILANcan reliably label new neurons. We additionally verify thatMILAN provides benefits over other neuron annotation techniques by comparing its descriptions to three baselines: NetDissect (Bau et al., 2017), which assigns a single concept label to each neuron by comparing the neuron’s exemplars to semantic segmentations of the same images; Compositional GeneralizationTrain + TestBERTScore (f) within networkAlexNet–ImageNet.39 AlexNet–Places.47 ResNet152–ImageNet.35 ResNet152–Places.28 BigGAN–ImageNet.49 BigGAN–Places.52 TrainTest across arch.AlexNetResNet152.28 ResNet152AlexNet.35 CNNsViT.34 across datasetsImageNetPlaces.30 PlacesImageNet.33 across tasksClassifiersBigGAN.34 BigGANClassifiers.27 Table 3: BERTScores on held out neurons rela- tive to the human annotations. Each train/test split evaluates a different kind of generalization, ul- timately evaluating how wellMILANgeneralizes to networks with architectures, datasets, and tasks unseen in the training annotations. Explanations (Mu & Andreas, 2020), which follows a similar procedure to generate logical concept la- bels; and ordinary image captioning (selecting de- scriptions usingp(d|E)instead ofpmi(d;E)). MethodIn each experiment, we trainMILANon a subset ofMILANNOTATIONSand evaluate its per- formance on a held-out subset. To compareMILAN to the baselines, we train on all data except a single held-out network; we obtain the baseline labels by running the publicly available code with the default settings on the held-out network. To test general- ization within a network, we train on 90% of neu- rons from each network and test on the remaining 10%. To test generalization across architectures, we train on all AlexNet (ResNet) neurons and test on all ResNet (AlexNet) neurons; we also train on all CNN neurons and test on ViT neurons. To test gen- eralization across datasets, we train on all neurons from models trained on ImageNet (Places) and test on neurons from models for the other datasets. To test generalization across tasks, we train on all clas- sifier neurons (GAN neurons) and test on all GAN neurons (classifier neurons). We measure perfor- mance via BERTScore (Zhang et al., 2020) relative to the human annotations. Hyperparameters for each of these experiments are in Appendix C. ResultsTable 2 shows results forMILANand all three baselines applied to four different net- works.MILANobtains higher agreement with human annotations on held-out networks than baselines.It is able to surface highly specific behaviors in its descriptions, like thesplashes of water neuron shown in Figure 2 (splasheshas no clear equivalent in the concept sets used by NetDissect (ND) or Compositional Explanations (CE)).MILANalso outperforms the ablatedp(d|E)decoder, justifying the choice ofpmias an objective for obtaining specific and high-quality descriptions. 3 Table 3 shows thatMILANexhibits different degrees of generalization across models, with gener- alization to new GAN neurons in the same network easiest and GAN-to-classifier generalization hardest.MILANcan generalize to novel architectures. It correctly labels ViT neurons (in fully connected layers) as often as it correctly labels other convolutional units (e.g., in AlexNet). We observe thattransferability across tasks is asymmetric:agreement scores are higher when trans- ferring from classifier neurons to GAN neurons than the reverse. Finally, Figure 3 presents some ofMILAN’s failure cases: when faced with new visual concepts,MILANsometimes mislabels the concept (e.g., by calling brass instrumentsnoodle dishes), prefers a vague description (e.g.,similar color patterns), or ignores the highlighted regions and describes the context instead. We emphasize that this section is primarily intended as a sanity check of the learned models underly- ingMILAN, and not as direct evidence of its usefulness or reliability as a tool for interpretability. We 3 It may seem surprising that ND outperforms CE, even though ND can only output one-word labels. One reason is that ND obtains image segmentations from multiplesegmentation models, which support a large vo- cabulary of concepts. By contrast, CE uses afixed datasetof segmentations and has a smaller base vocabulary. CE also tends to generate complex formulas (with up to two logical connectives), which lowers its precision. 6 Published as a conference paper at ICLR 2022 Figure 3: Examples ofMILANfailures. Failure modes include incorrect gener- alization (top), vague descriptions for concepts not seen in the training set (middle), and mistaking the context for the highlighted regions (bottom). follow Vaughan & Wallach (2020) in arguing that the final test of any such tool must be its ability to produce actionable in- sights for human users, as in the three applications described below. 5ANALYZINGFEATUREIMPORTANCE The previous section shows thatMILANcan generalize to new architectures, datasets, and tasks. The remainder of this paper focuses on applications thatusegenerated labels to understand how neurons influence model behavior. As a first example: descriptions in Figure 2 reveal that neurons have different de- grees of specificity. Some neurons detect objects with spatial constraints (the area on top of the line), while others fire for low-level but highly specific perceptual qualities (long, thin objects). Still others detect perceptually similar but fundamen- tally different objects (dog faces and cupcakes). How impor- tant are these different classes of neurons to model behavior? MethodWe useMILANtrained on all convolutional units inMILANNOTATIONSto annotate every neuron in ResNet18- ImageNet. We then score each neuron according to one of seven criteria that capture differentsyntacticorstructural properties of the caption. Foursyntacticcriteria each count the number of times that a specific part of speech appears in a caption:nouns,verbs,prepositions, andadjectives. Threestructuralcriteria measure properties of the entire caption: itslength, thedepthof its parse tree (a rough measure of its compositional com- plexity, obtained from the spaCy parser of Honnibal et al. 2020), and itsmaximum word difference (a measure of the semantic coherence of the description, measured as the maximum Euclidean dis- tance between any two caption words, again obtained via spaCy). Finally, neurons are incrementally ablated in order of their score. The network is tested on the ImageNet validation set and its accuracy recorded. This procedure is then repeated, deleting 2% of neurons at each step. We also include five trials in which neurons are orderedrandomly. Further details and examples of ablated neurons are provided in Appendix D. ResultsFigure 4 plots accuracy on the ImageNet validation set as a function of the number of ab- lated neurons. Linguistic features of neuron descriptions highlight several important differences be- tween neurons. First,neurons captioned with many adjectives or prepositions (that is, neurons that capture attributes and relational features) are relatively important to model behavior.Ab- lating these neurons causes a rapid decline in performance compared to ablating random neurons or nouns. Second,neurons that detect dissimilar concepts appear to be less important.When the 0481216 % units ablated 0.2 0.4 0.6 0.8 accuracy random verbs prepositions adjectives nouns 0481216 % units ablated 0.2 0.4 0.6 0.8 accuracy random max word diff. parse depth caption length conv1layer1layer2layer3layer4 0.0 0.2 0.4 0.6 0.8 1.0 frac. neurons contains verb contains adjective contains preposition length > 10 parse depth > 3 top 10% word diff. Figure 4: ResNet18 accuracy on the ImageNet validation set as units are ablated(left, middle), and distribution of neurons matching syntactic and structural criteria in each layer (right). In each configuration, neurons are scored according to a property of their generated description (e.g., number of nouns/words in description, etc.), sorted based on their score, and ablated in that order. Neurons described with adjectives appear crucial for good performance, while neurons described with very different words (measured by word embedding difference; max word diff.) appear less important for good performance. Adjective-selective neurons are most prevalent in early layers, while neurons with large semantic differences are more prevalent in late ones. 7 Published as a conference paper at ICLR 2022 caption contains highly dissimilar words (max word diff.), ablation hurts performance substantially less than ablating random neurons. Such neurons sometimes detect non-semantic compositions of concepts like thedog faces and cupcakesneuron shown in Fig. 2; Mu & Andreas (2020) find that these units contribute to non-robust model behavior. We reproduce their robustness experiments us- ing these neurons in Section 5 (Figure 14) and reach similar conclusions. Finally, Figure 4 highlights that neurons satisfying each criterion are not evenly distributed across layers—for example,middle layers contain the largest fraction of relation-selective neuronsmeasured via prepositions. 6AUDITINGANONYMIZEDMODELS One recent line of work in computer vision aims to constructprivacy-awaredatasets, e.g. by de- tecting and blurring all faces to avoid leakage of information about specific individuals into trained models (Yang et al., 2021). But to what extent does this form of anonymization actually reduce unblurredblurred 0 10 20 30 40 # units Figure 5: Change in # of face neurons found byMI- LAN(each pair of points is one model architecture). Blurring reduces, but does not eliminate, units selec- tive for unblurred faces. models’ reliance on images of humans? We wish to understand if mod- els trained on blurred data still construct features that can human faces, or even specific categories of faces. A core function of tools for inter- pretable machine learning is to enable auditing of trained models for such behavior; here, we applyMILANto investigate the effect of blurring- based dataset privacy. MethodWe useMILANto caption a subset of convolutional units in 12 different models pretrained for image classification on the blurred ImageNet images (blurredmodels). These models are distributed by the original authors of the blurred ImageNet dataset (Yang et al., 2021). We caption the same units in models pretrained on regular ImageNet (un- blurredmodels) obtained fromtorchvision(Paszke et al., 2019). We then manually inspect all neurons in the blurred and unblurred models for whichMILANdescriptions contain the wordsface,head,nose,eyes, andmouth(using exemplar sets containing only unblurred images). A Study of Face Obfuscation in ImageNet Figure 1.Most categories in ImageNet Challenge (Russakovsky et al.,2015) are not people categories. However, the images contain many people co-occurring with the object of interest, posing a potential privacy threat. These are example images frombarber chair, husky,beer bottle,volleyballandmilitary uniform. Effects of face obfuscation on classification accuracy. Obfuscating sensitive image areas is widely used for pre- serving privacy (McPherson et al.,2016). Using our face annotations and a typical obfuscation strategy: blurring (Fig.1), we construct a face-blurred version of ILSVRC. What are the effects of using it for image classification? At first glance, it seems inconsequential—one should still recognize a car even when the people inside have their faces blurred. However, to the best of our knowledge, this has not been thoroughly analyzed. By benchmarking various deep neural networks on original images and face-blurred images, we report insights about the effects of face blurring. The validation accuracy drops only slightly (0.13%–0.68%) when using face-blurred images to train and evaluate. It is hardly surprising since face blurring could remove informa- tion useful for classifying some images. However, the result assures us that we can train privacy-aware visual classifiers on ILSVRC with less than 1% accuracy drop. Breaking the overall accuracy into individual categories in ILSVRC, we observe that they are impacted by face blur- ring differently. Some categories incur significantly larger accuracy drop, including categories with a large fraction of blurred area, and categories whose objects are often close to faces, e.g.,maskandharmonica. Our results demonstrate the utility of face-blurred ILSVRC for benchmarking. It enhances privacy with only a marginal accuracy drop. Models trained on it perform competitively with models trained on the original ILSVRC dataset. Effects on feature transferability.Besides a classifi- cation benchmark, ILSVRC also serves as pretraining data for transferring to domains where labeled images are scarce (Girshick,2015;Liu et al.,2015a). So a further ques- tion is:Does face obfuscation hurt the transferability of visual features learned from ILSVRC? We investigate this question by pretraining models on the original/blurred images and finetuning on 4 downstream tasks: object recognition on CIFAR-10 (Krizhevsky et al., 2009), scene recognition on SUN (Xiao et al.,2010), object detection on PASCAL VOC (Everingham et al.,2010), and face attribute classification on CelebA (Liu et al.,2015b). They include both classification and spatial localization, as well as both face-centric and face-agnostic recognition. In all of the 4 tasks, models pretrained on face-blurred images perform closely with models pretrained on original images. We do not see a statistically significant difference between them, suggesting that visual features learned from face-blurred pretraining are equally transferable. Again, this encourages us to adopt face obfuscation as an additional protection on visual recognition datasets without worrying about detrimental effects on the dataset’s utility. Contributions.Our contributions are twofold. First, we obtain accurate face annotations in ILSVRC, which facil- itates subsequent research on privacy protection. We will release the code and the annotations. Second, to the best of our knowledge, we are the first to investigate the effects of privacy-aware face obfuscation on large-scale visual recog- nition. Through extensive experiments, we demonstrate that training on face-blurred does not significantly compromise accuracy on both image classification and downstream tasks, while providing some privacy protection. Therefore, we ad- vocate for face obfuscation to be included in ImageNet and to become a standard step in future dataset creation efforts. 2. Related Work Privacy-preserving machine learning (PPML).Ma- chine learning frequently uses private datasets (Chen et al., 2019b). Research in PPML is concerned with an adversary trying to infer the private data. It can happen to the trained model. For example,model inversion attackrecovers sensi- tive attributes (e.g., gender, genotype) of an individual given the model’s output (Fredrikson et al.,2014;2015;Hamm, 2017;Li et al.,2019;Wu et al.,2019).Membership infer- ence attackinfers whether an individual was included in training (Shokri et al.,2017;Nasr et al.,2019;Hisamoto et al.,2020).Training data extraction attackextracts verba- tim training data from the model (Carlini et al.,2019;2020). For defending against these attacks,differential privacyis a general framework (Abadi et al.,2016;Chaudhuri & Mon- teleoni,2008;McMahan et al.,2018;Jayaraman & Evans, 2019;Jagielski et al.,2020). It requires the model to behave similarly whether or not an individual is in the training data. Privacy breaches can also happen during training/inference. (a) (b) (c) Faces of people Human faces Figure 6:(a)The blurred ImageNet dataset. (b–c)Exemplar sets and labels for two neu- rons in a blurred model that activate on un- blurred faces—and appear to preferentially (but not exclusively) respond to faces in spe- cific demographic categories. ResultsAcross models trained on ordinary ImageNet, MILANidentified 213 neurons selective for human faces. Across models trained on blurred ImageNet,MILANiden- tified 142 neurons selective for human faces.MILANcan distinguish between models trained on blurred and unblurred data(Fig. 5). However, it also reveals that models trained on blurred data acquire neurons selec- tive for unblurred faces. Indeed, it is possible to use MILAN’s labels to extract these face-selective neurons di- rectly. Doing so reveals that several of them are not sim- ply face detectors, but appear to selectively identify fe- male faces (Fig. 6b) and Asian faces (Fig. 6c). Blurring does not prevent models from extracting highly specific features for these attributes. Our results in this section highlight the use ofMILANfor both quantitative and qual- itative, human-in-the loop auditing of model behavior. 7EDITINGSPURIOUSFEATURES Spurious correlationsbetween features and labels are a persistent problem in machine learning applications, especially in the presence of mismatches between training and testing data (Storkey, 2009). In object recognition, one frequent example is correlation between backgrounds and objects (e.g. cows are more likely to appear with green grass in the background, while fish are more likely to appear with a blue background; Xiao et al. 2020). In a more recent example, models trained on joint text and image data are subject to “text-based adversarial attacks”, in which e.g. an apple with the wordiPodwritten on it is classified as an iPod (Goh et al., 2021). Our final experiment shows thatMILANcan be used to reduce models’ sensitivity to these spurious features. 8 Published as a conference paper at ICLR 2022 (a)training dataset (b) adversarial test dataset (c) text neuron layer3-134, “words and letters” Figure 7: Network editing.(a)We train an image classifier on a synthetic dataset in which half the images include the class label written in text in the corner.(b)We eval- uate the classifier on an adversarial test set, in which every image has arandomtextual label.(c)Nearly a third of neurons in the trained model model detect text, hurting its performance on the test set. DataWe create a controlled dataset imitating Goh et al. (2021)’s spurious text features. The dataset consists of 10 ImageNet classes. In the training split, there are 1000 images per class; 500 are annotated with (correct) text labels in the top-left corner. The test set contains 100 images per class (from the ImageNet validation set); in all these images, arandom(usually incorrect) text label is included. We train and evaluate a fresh ResNet18 model on this dataset, holding out 10% of the training data as a validation dataset for early stopping. Training details can be found in Appendix E. MethodWe useMILANto obtain descriptions of ev- ery residual neuron in the model as well as the first convolutional layer. We identify all neurons whose de- scription containstext,word, orletter. To identify spu- rious neurons, we first assign each text neuron an in- dependentimportance scoreby removing it from the network and measuring the resulting drop in valida- tion accuracy (with non-adversarial images). We then sort neurons by importance score (with the least impor- tant first), and successively ablate them from the model. 0510152025 # text neurons zeroed of 1024 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 adversarial accuracy sort text sort all initial performance no text distractors Figure 8: ResNet18 accuracy on the adver- sarial test set as neurons are incrementally ablated. Neurons are sorted by the model’s validation accuracy when that single neuron is ablated, then ablated in that order. When ablating neurons that select for the spurious text, the accuracy improves by 4.9 points. When zeroing arbitrary neurons, accuracy still improves, but by much less. ResultsThe result of this procedure on adversarial test accuracy is shown in Fig. 8. Training on the spurious data substantially reduces ResNet18’s performance on the ad- versarial test set: the model achieves 58.8% accuracy, as opposed to 69.9% when tested on non-spurious data.MI- LANidentifies 300 text-related convolutional units (out of 1024 examined) in the model, confirming that the model has indeed devoted substantial capacity to identifying text labels in the image. Figure 7c shows an example neurons specifically selective forairlineandtrucktext. By delet- ing only 13 such neurons, test accuracy is improved by 4.9% (a 12% reduction in overall error rate). 4 This in- crease cannot be explained by the sorting procedure de- scribed above: if instead we sortallneurons according to validation accuracy (orange line), accuracy improves by less than 1%. Thus, while this experiment does not com- pletely eliminate the model’s reliance on text features, it shows thatMILAN’s predictions enable direct editing of networks to partially mitigate sensitivity to spuri- ous feature correlations. 8CONCLUSIONS We have presentedMILAN, an approach for automatically labeling neurons with natural language descriptions of their behavior.MILANselects these descriptions by maximizing pointwise mutual information with image regions in which each neuron is active. These mutual information esti- mates are in turn produced by a pair of learned models trained onMILANNOTATIONS, a dataset of fine-grained image annotations released with this paper. Descriptions generated byMILANsurface diverse aspects of model behavior, and can serve as a foundation for numerous analysis, auditing, and editing techniques workflows for users of deep network models. 4 Stopping criteria are discussed more in Appendix E; if no adversarial data is used to determine the number of neurons to prune, an improvement of 3.1% is still achievable. 9 Published as a conference paper at ICLR 2022 IMPACT STATEMENT In contrast to most past work on neuron labeling,MILANgenerates neuron labels using another black-box learned model trained on human annotations of visual concepts. With this increase in expressive power come a number of potential limitations: exemplar-based explanations have known shortcomings (Bolukbasi et al., 2021), human annotations of exemplar sets may be noisy, and the captioning model may itself behave in unexpected ways far outside the training domain. TheMI- LANNOTATIONSdataset was collected with annotator tests to address potential data quality issues, and our evaluation in Section 4 characterizes prediction quality on new networks; we nevertheless emphasize that these descriptions arepartialand potentiallynoisycharacterizations of neuron func- tion via their behavior on a fixed-sized set of representative inputs.MILANcomplements, rather than replaces, both formal verification (Dathathri et al., 2020) and careful review of predictions and datasets by expert humans (Gebru et al., 2018; Mitchell et al., 2019). ACKNOWLEDGMENTS We thank Ekin Aky ̈ urek and Tianxing He for helpful feedback on early drafts of the paper. We also thank IBM for the donation of the Satori supercomputer that enabled training BigGAN on MIT Places. This work was partially supported by the MIT-IBM Watson AI lab, the SystemsThatLearn initiative at MIT, a Sony Faculty Innovation Award, DARPA SAIL-ON HR0011-20-C-0022, and a hardware gift from NVIDIA under the NVAIL grant program. REFERENCES Jacob Andreas and Dan Klein. Analogs of linguistic structure in deep representations. InProceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, p. 2893– 2897, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1311. URLhttps://w.aclweb.org/anthology/D17-1311. Jacob Andreas, Anca D Dragan, and Dan Klein. Translating neuralese. InACL (1), 2017. Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InICLR, January 2015. Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. Identifying and controlling important neurons in neural machine translation. InInternational Conference on Learning Representations, 2018. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InComputer Vision and Pattern Recognition (CVPR), 2017. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Antonio Torralba. Gan dissection: Visualizing and understanding generative adversarial net- works. InInternational Conference on Learning Representations (ICLR), 2019. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network.Proceedings of the National Academy of Sciences (PNAS), 2020. Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Vi ́ egas, and Martin Wattenberg. An interpretability illusion for bert.arXiv preprint arXiv:2104.07143, 2021. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. InInternational Conference on Learning Representations, 2018. Oana-Maria Camburu, Tim Rockt ̈ aschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: Natural language inference with natural language explanations.arXiv preprint arXiv:1812.01193, 2018. Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits.Distill, 6(1):e00024–006, 2021. 10 Published as a conference paper at ICLR 2022 Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ́ e J ́ egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning, 2020. Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James Glass. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In Proceedings of AAAI, 2019. Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Ue- sato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow, Percy Liang, et al. Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming. In Neural Information Processing Systems (NeurIPS), 2020. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition (CVPR), 2009. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venu- gopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pat- tern recognition, p. 2625–2634, 2015. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An im- age is worth 16x16 words: Transformers for image recognition at scale. InInternational Confer- ence on Learning Representations (ICLR), 2021. Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. 2009. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum ́ e I, and Kate Crawford. Datasheets for datasets.arXiv preprint arXiv:1803.09010, 2018. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for ac- curate object detection and semantic segmentation. Incomputer vision and pattern recognition (CVPR), p. 580–587, 2014. Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks.Distill, 2021. Jean-Bastien Grill, Florian Strub, Florent Altch ́ e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Ghesh- laghi Azar, Bilal Piot, Koray Kavukcuoglu, R ́ emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition, 2015. Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. InEuropean conference on computer vision, p. 3–19. Springer, 2016. Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. Grounding visual expla- nations. InProceedings of the European Conference on Computer Vision (ECCV), p. 264–279, 2018. Sepp Hochreiter and J ̈ urgen Schmidhuber. Long short-term memory. InNeural computation, 1997. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial- strength Natural Language Processing in Python, 2020. URLhttps://doi.org/10.5281/ zenodo.1212303. 11 Published as a conference paper at ICLR 2022 Paul Jaccard. The distribution of the flora in the alpine zone.New Phytologist, 11(2):37–50, 1912. Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning (ICML), 2018. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep con- volutional neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2012. Iro Laina, Ruth C. Fong, and Andrea Vedaldi. Quantifying learnability and describability of visual concepts emerging in representation learning.Advances in Neural Information Processing Sys- tems, 2020-December, 2020. ISSN 1049-5258. Funding Information: We would like to thank Yuki Asano and Christian Rupprecht for helpful discussions and for their feedback on this work. We are also grateful for the EPSRC programme grant Seebibyte EP/M013774/1 (I.L.), ERC start- ing grant IDIU 638009 (I.L), and Open Philanthropy Project (R.F.). Publisher Copyright: © 2020 Neural information processing systems foundation. All rights reserved.; 34th Conference on Neu- ral Information Processing Systems, NeurIPS 2020 ; Conference date: 06-12-2020 Through 12- 12-2020. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ́ ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (eds.),Computer Vision – ECCV 2014, p. 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. Incomputer vision and pattern recognition (CVPR), 2015. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, p. 220–229, 2019. Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. InInternational Conference on Learning Representations (ICLR), 2018. Jesse Mu and Jacob Andreas. Compositional explanations of neurons. InAdvances in Neural Information Processing Systems, 2020. Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan.WT5?! Training text-to-text models to explain their predictions.arXiv preprint arXiv:2004.14546, 2020. Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. InDistill, 2017. Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. InDistill, 2018. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, p. 311–318, 2002. Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Dar- rell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 8779–8788, 2018. 12 Published as a conference paper at ICLR 2022 Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Ed- ward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.Pytorch: An imperative style, high- performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch ́ e- Buc, E. Fox, and R. Garnett (eds.),Advances in Neural Information Processing Systems 32, p. 8024–8035. Curran Associates, Inc., 2019.URLhttp://papers.neurips.c/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Kristina Preuer, G ̈ unter Klambauer, Friedrich Rippmann, Sepp Hochreiter, and Thomas Unterthiner. Interpretable deep learning in drug discovery. InExplainable AI: Interpreting, Explaining and Visualizing Deep Learning, p. 331–345. Springer, 2019. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment.arXiv preprint arXiv:1704.01444, 2017. Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning.arXiv preprint arXiv:1906.02361, 2019. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. Collecting image anno- tations using amazon’s mechanical turk. InProceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, p. 139–147, 2010. Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Klein, Jacob Andreas, and Antonio Tor- ralba. Toward a visual concept vocabulary for gan latent space.International Conference on Computer Vision, 2021. Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of ACL, 2018. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InInternational Conference on Learning Representations (ICLR), 2015. Amos Storkey. When training and test sets are different: characterizing learning transfer.Dataset shift in machine learning, 30:3–28, 2009. Jennifer Wortman Vaughan and Hanna Wallach. A human-centered agenda for intelligible machine learning.Machines We Trust: Getting Along with Artificial Intelligence, 2020. Zeyu Wang, Berthy Feng, Karthik Narasimhan, and Olga Russakovsky. Towards unique and infor- mative captioning of images. InEuropean Conference on Computer Vision (ECCV), 2020. Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition.arXiv preprint arXiv:2006.09994, 2020. Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption genera- tion with visual attention. InProceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, p. 2048–2057. JMLR.org, 2015. Kaiyu Yang, Jacqueline Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky. A study of face obfusca- tion in imagenet.arXiv preprint arXiv:2103.06191, 2021. Omar Zaidan and Jason Eisner. Modeling annotators: A generative approach to learning from annotator rationales. InProceedings of the 2008 conference on Empirical methods in natural language processing, p. 31–40, 2008. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks.ECCV, 2014. Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 6720–6731, 2019. 13 Published as a conference paper at ICLR 2022 (a) qualification test(b) annotation form Figure 9: Screenshots of the Amazon Mechanical Turk forms we used to collect the CaNCAn dataset.(a)The qualification test. Workers are asked to pick the best description for two hand-chosen neurons from a model not included in our corpus.(b)The annotation form. Workers are shown the top-15 highest-activating images for a neuron and asked to describe what is common to them in one sentence. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evalu- ating text generation with bert. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SkeHuCVFDr. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 2017. Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019. AMILANNOTATIONS We recruited annotators from Amazon Mechanical Turk to describe one neuron at a time given its top-activating images. A screenshot of the template is shown in Figure 9b. Participants were given the instructions: Instructions:In one sentence, summarize everything shown inside the high- lighted regions in the images. They might all show the same thing, or they might show several different things. In your answer,DO NOTmention that you are describing highlighted regions in images. Workers were given up to an hour to complete each annotation, but early trials revealed they required about 30 seconds per HIT. We paid workers $0.08 per annotation, which at $9.60 per hour exceeds the United States federal minimum wage. 14 Published as a conference paper at ICLR 2022 Figure 10: Example human annotations for neuron exemplars inMILANNOTATIONS, which contains annota- tions for neurons in seven networks. Each set of images is annotated by three distinct human participants. ModelDatasetIAA AlexNetImageNet.25 Places365.27 ResNet152ImageNet.21 Places365.17 BigGANImageNet.26 Places365.24 DINOImageNet.23 Table 4: Average inter-annotator agreement among human annota- tions, measured in BERTScore. Some models have clearer neu- ron exemplars than others. To control for quality, we required workers to pass a short qualifi- cation test in which they had to choose the most descriptive cap- tion for two manually chosen neurons from VGG-16 (Simonyan & Zisserman, 2015) trained on ImageNet (not included as part ofMI- LANNOTATIONS). A screenshot of this test is shown in Figure 9a. Table 4 shows the inter-annotator agreement of neuron annotations for each model, and Table 5 shows some corpus statistics broken down by model and layer. Layers closest to the image (early lay- ers in CNNs and later layers in GANs) are generally described with more adjectives than other layers, while annotations for layers far- ther from the image include more nouns, perhaps highlighting the low-level perceptual role of the former and the scene- and object- centric behavior of the latter. Layers farther from the image tend to have longer descriptions (e.g. in BigGAN-ImageNet, AlexNet- ImageNet), but this trend is not consistent across all models (e.g. in models trained on Places365, the middle layers have the longest average caption length). BMILAN IMPLEMENTATION DETAILS B.1IMPLEMENTINGp(d|E) We build on the Show, Attend, and Tell (SAT) model for describing images (Xu et al., 2015). SAT is designed for describing the high-level content of a single images, so we must make several modi- fications to support our use case, where our goal is to describesetsofregionsin images. 15 Published as a conference paper at ICLR 2022 ModelLayer# Units# WordsLen.% Noun% Adj% Prep AlexNet-ImageNetconv1641854.837.524.312.2 conv21923845.537.819.413.2 conv33846615.341.016.413.0 conv42566085.543.111.912.5 conv52566935.546.010.210.4 AlexNet-Places365conv1961534.338.426.812.7 conv22562974.837.826.012.7 conv33844124.740.224.810.5 conv43844834.443.719.910.3 conv52564864.145.817.610.6 ResNet152-ImageNetconv1642854.743.811.810.3 layer12566535.543.110.512.5 layer25129365.144.012.712.6 layer3102412224.249.610.911.3 layer4204817284.647.88.67.8 ResNet152-Places365conv1642835.247.311.114.6 layer12566335.346.39.413.3 layer25129865.846.08.313.8 layer3102413894.848.26.712.7 layer4204819705.346.35.511.9 BigGAN-ImageNetlayer0153611473.952.47.88.2 layer17688533.553.09.48.9 layer27686183.252.612.39.5 layer33844953.749.914.310.9 layer41922693.347.918.013.4 layer596692.653.622.814.6 BigGAN-Places365layer0204810624.253.35.48.3 layer110247083.955.06.111.5 layer210244104.652.78.116.3 layer35122735.250.47.615.0 layer42561924.647.59.314.9 layer51281234.246.713.513.0 DINO-ImageNetlayer01003204.445.712.74.8 layer11003214.249.89.16.8 layer21002853.953.36.27.5 layer31003123.954.46.27.1 layer41003043.953.54.47.0 layer51002873.555.15.55.2 layer61003773.951.38.25.4 layer71003743.852.06.46.2 layer81003303.453.07.08.8 layer91003503.156.16.39.6 layer101003693.950.39.38.2 layer111002943.352.47.59.4 Total2027245974.548.79.410.9 Table 5: Corpus statistics forMILANNOTATIONSdescriptions broken down by model and layer. The# Words column reports the number of unique words used across all layer annotations, theLen.column reports the average number of words in each caption for that layer, and the%columns report the percentage of all words across all captions for that layer that are a specific part of speech. 16 Published as a conference paper at ICLR 2022 Figure 11: Neuron captioning model. Given the set of top-activating images for a neuron and masks for the regions of greatest activation, we extract features maps from each convolutional layer of a pretrained image classifier. We then downsample the masks and use them to pool the features before concatenating them into a single feature vector per image. These feature vectors are used as input to the decoder attention mechanism. In the original SAT architecture, a single input imagexis first converted to visual features by passing it through an encoder networkg, typically an image classifier pretrained on a large dataset. The output of the last convolutional layer is extracted as a matrix of visual features: v= [v 1 ;v 2 ;...;v k ] These visual features are passed to a decoder LSTM whose hidden state is initialized as a function of the mean of the visual features v= 1/k ∑ i v i . At each time step, the decoder attends over the fea- tures using an additive attention mechanism (Bahdanau et al., 2015), then consumes the attenuated visual features and previous token as input to predict the next token. The SAT architecture makes few assumptions about the structure of the visual features. We will take advantage of this generality and modify howvis constructed to support our task, leaving the decoder architecture intact. Now, instead of a single imagex, the model inputs are thektop-activating imagesx j for a neuron as well as a maskm j for each image that highlights the regions of greatest activation. Our task is to describe what the neuron is detecting, based strictly on the highlighted regions of thex j . In support of this, the visual features must (1) include information about allkimages, (2) encode multiple resolutions of the images to capture both low-level perceptual and high-level scene details about the image, and (3) pay most (but not exclusive) attention to the regions of greatest activation in the image. Describing sets of imagesThekfeatures in SAT correspond to different spatial localities of a single image. In our architecture, each featurev j corresponds to one input imagex j . Encoding multiple resolutionsInstead of encoding the image with just the last convolutional layer ofg, we use pooled convolutional features from every layer. Formally, letg ` (x)denote the output of layer`in the pretrained image encoder withLlayers, and letpooldenote a pooling function that uses the mask to pool the features (described further below). The feature vector for the jth imagex j is: v j = [ pool(m j ,g 1 (x j )) ;...;pool(m i ,g L (x j )) ] Highlighting regions of greatest activationEach of the top-activating imagesx j that we hand to our model comes with a maskm j highlighting the image regions of greatest activation. We incorporate these masks into the pooling functionpoolfrom above. Specifically, we first downsam- ple the maskm j to the same spatial shape asg ` (x j )using bilinear interpolation, which we denote upsample(m j ). We then apply the mask to each channelcat layer`, writteng `,c (x j ), via element- wise multiplication ( ) withupsample(m j ). Finally, we sum spatially along each channel, resulting in a lengthcvector. Formally: pool c (g ` (x j )) =1 > vec(upsample(m j ) g `,c (x j )) Eachv i is thus a length ∑ ` C ` vector, whereC ` is the number of channels at layer`ofg. 17 Published as a conference paper at ICLR 2022 Gen.Train + Test# Units# WordsLen.% Noun% Adj% Prep within netwokAlexNet–ImageNet1151003.545.716.411.9 AlexNet–Places137462.549.328.79.6 ResNet–ImageNet3901212.852.223.811.7 ResNet–Places3903764.346.58.710.9 BigGAN–ImageNet3741122.259.817.510.4 BigGAN–Places4992453.854.26.09.0 TrainTest across arch.AlexNetResNet78083263.046.121.08.9 ResNetAlexNet25282752.748.027.16.4 CNNsViT12002002.655.018.213.0 across datasetImageNetPlaces102722712.258.814.013.8 PlacesImageNet88003093.147.826.97.8 across taskClassifiersBigGAN87362022.153.025.36.1 BigGANClassifiers103363363.254.314.216.8 Total5158510022.751.919.811.1 Table 6: Statistics forMILAN-generated descriptions on the held-out neurons from the generalization experi- ments of Section 4. Columns are the same as in Table 5. Throughout our experiments,gis a ResNet101 pretrained for image classification on ImageNet, provided by PyTorch Paszke et al. (2019). We extract visual features from the first convolutional layer and all four residual layers. We do not fine tune any parameters in the encoder. The decoder is a single LSTM cell with an input embedding size of 128 and a hidden size of 512. The attention mechanism linearly maps the current hidden state and all visual feature vectors to size 512 vectors before computing attention weights. We always decode for a maximum of 15 steps. The rest of the decoder is exactly the same as in Xu et al. (2015). The model is trained to minimize cross entropy on the training set using the AdamW optimizer Loshchilov & Hutter (2019) with a learning rate of 1e-3 and minibatches of size 64. We include the double stochasticity regularization term used by Xu et al. (2015) withλ= 1. We also apply dropout (p=.5) to the hidden state before predicting the next word. Across configurations, 10% of the training data is held out and used as a validation set, and training stops when the model’s BLEU score (Papineni et al., 2002) does not improve on this set for 4 epochs, up to a maximum of 100 epochs. B.2IMPLEMENTINGp(d) We implementp(d)using a two-layer LSTM language model (Hochreiter & Schmidhuber, 1997). We use an input embedding size of 128 with a hidden state size and cell size of 512. We apply dropout to non-recurrent connections(p=.5)during training and hold out 10% of the training dataset as a validation set and following the same early stopping procedure as in Appendix B.1, except we stop on validation loss instead of BLEU. CGENERALIZATION EXPERIMENT DETAILS In each experiment,MILANis trained with the hyperparameters described in Appendix B and Sec- tion 3.4, with the sole exception being the within-network splits—for these, we increase the early stopping criterion to require 10 epochs of no improvement to account for the training instability caused by the small training set size. To obtain NetDissect labels, we obtain image exemplars with the same settings as we do forMILAN, and we obtain segmentations using the full segmentation vocabulary minus the textures. To obtain Compositional Explanations labels, we search for up to length 3 formulas (comprised of not,and, andoroperators) with a beam size of 5 and no length penalty. Image region exemplars and corresponding segmentations come from the ADE20k dataset (Zhou et al., 2019). Finally, Table 6 shows statistics forMILANdescriptions generated on the held out sets from each generalization experiment. Compared to human annotators (Table 5),MILANdescriptions are on 18 Published as a conference paper at ICLR 2022 Figure 12: Randomly chosen examples ofMILAN-generated descriptions from the generalization experiments of Section 4. average shorter (2.7 vs. 4.5 tokens), use fewer unique words (1k vs. 4.6k), and contain adjectives twice as often (9.4% vs. 19.8%). Figure 12 contains additional examples, chosen at random. DANALYSIS EXPERIMENT DETAILS We obtain the ResNet18 model pretrained on ImageNet fromtorchvision(Paszke et al., 2019). We obtain neuron descriptions for the same layers that we annotate in ResNet152 (Section 3.3) using theMILANhyperparameters described in Section 3.2 and Section 3.4. We obtain part of speech tags, parse trees, and word vectors for each description from spaCy (Honnibal et al., 2020). Figure 13 shows examples of neurons that scored high under each criterion (and consequently were among the first ablated in Fig. 5). Note that these examples include some failure cases ofMILAN: for example, in the# verbsexample,MILANincorrectly categorizes all brass instruments as flutes; and 19 Published as a conference paper at ICLR 2022 max word difference Animals, vehicles, and vases ResNet18, layer4-427 parse depth / caption length The center part of a flower, or radial spikes that extend from a center ResNet18, layer3-153 # nouns Text on a sign, text on a web page, text on a menu # adjectives Purple, red and blue fluorescent blue and purple objects # prepositions ResNet18, layer2-52 Space on the right side of a object, space above a bird ResNet18, conv1-16 ResNet18, layer4-450# verbs A man holding a flute, a flute, a man playing a flute ResNet18, layer4-6 Figure 13: Examples of ablated neurons for each condition Section 5, chosen from among the first 10 ablated. containershipamphibian hermit crab pretzel snowplowjeep (b)(c)(a) .33 .37 .37 (d) hermit crab amphibian jeep Unit: ResNet18-ImageNet layer4-427 MILAN: “animals, vehicles, and vases” Original Image & Ground Truth Label Distractor Image Adversarial Image & Model Prediction Residual Layer 4 Output Layer Figure 14: Cut-and-paste adversarial attacks highlighting non-robust behavior by a neuron that scored high on themax-word-diffcriterion of Section 5.(a)MILANfinds this neuron automatically because the generated description mentions two or more dissimilar concepts:animalsandvehicles. The neuron is directly connected to the final fully-connected output layer, and strongly influencesamphibian,hermit crab, andjeeppredictions according to the connection weights.(b)To construct adversarial inputs, we pick three images from the Ima- geNet validation set that do not include concepts detected by the neuron.(c)We then select a different set of images to act as distractors that do include the concepts detected by the neuron.(d)By cutting and pasting the central object from the distractor to the original image, the model is fooled into predicting a class label that is completely unrelated to the pasted object: e.g., it predictsamphibianwhen the military vehicle is pasted. 20 Published as a conference paper at ICLR 2022 in the# adjectivesexample, the description is disfluent. Nevertheless, these examples confirm our intuitions about the kinds of neurons selected for by each scoring criterion, as described in Section 5. We hypothesized in Section 5 that neurons scoring high on themax-word-diffcriterion correspond to non-robust behavior by the model. Figure 14 provides some evidence for this hypothesis: we construct cut-and-paste adversarial inputs in the style of Mu & Andreas (2020). Specifically, we look at the examplemax-word-diffneuron shown in Figure 13, crudely copy and paste one of the objects mentioned in its description (e.g., avehicle-related object like a half track), and show that this can cause the model to predict one of the other concepts in the description (e.g., ananimal-related class likeamphibian). EEDITING EXPERIMENT DETAILS HyperparametersWe train a randomly initialized ResNet18 on the spurious training dataset for a maximum of 100 epochs with a learning rate of 1e-4 and a minibatch size of 128. We annotate the same convolutional and residual units we did for ResNet152 in Section 3.3. We stop training when validation loss does not improve for 4 epochs. How many neurons should we remove?In practice, we cannot incrementally test our model on an adversarial set. So how do we decide on the number of neurons to zero? One option is to look solely at validation accuracy. Figure 15 recreates Figure 8 with accuracy on the held out validation set (which is distributed like the training dataset) instead of accuracy on the adversarial test set. The accuracy starts peaks and starts decreasing earlier than in Fig. 8, but if we were to choose the number to be thelargest before validation accuracy permanently decreases, we would choose 8 neurons, which would still result in a 3.1% increase in adversarial accuracy. 0510152025 # units ablated out of 1024 0.72 0.74 0.76 0.78 0.80 0.82 validation accuracy sort text units sort all units Figure 15: Same as Fig. 8, but shows ac- curacy on thevalidationdataset, which is distributed identically to the training dataset. Dotted line denotes initial accuracy. 21