Paper deep dive
Natural Adversarial Examples
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn Song
Models: AlexNet, DPN-68, DPN-98, DeiT-Base, DeiT-Small, DeiT-Tiny, DenseNet-121, Res2Net-152, Res2Net-50, ResNeXt-101, ResNeXt-50, ResNet-101, ResNet-152, ResNet-18, ResNet-34, ResNet-50, SqueezeNet, VGG-16, VGG-19
Abstract
Abstract:We introduce two challenging datasets that reliably cause machine learning model performance to substantially degrade. The datasets are collected with a simple adversarial filtration technique to create datasets with limited spurious cues. Our datasets' real-world, unmodified examples transfer to various unseen models reliably, demonstrating that computer vision models have shared weaknesses. The first dataset is called ImageNet-A and is like the ImageNet test set, but it is far more challenging for existing models. We also curate an adversarial out-of-distribution detection dataset called ImageNet-O, which is the first out-of-distribution detection dataset created for ImageNet models. On ImageNet-A a DenseNet-121 obtains around 2% accuracy, an accuracy drop of approximately 90%, and its out-of-distribution detection performance on ImageNet-O is near random chance levels. We find that existing data augmentation techniques hardly boost performance, and using other public training datasets provides improvements that are limited. However, we find that improvements to computer vision architectures provide a promising path towards robust models.
Tags
Links
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 7:58:19 PM
Summary
The paper introduces two challenging datasets, ImageNet-A and ImageNet-O, created using adversarial filtration to expose shared weaknesses and failure modes in computer vision models. ImageNet-A consists of hard, real-world examples that cause significant performance degradation in existing classifiers, while ImageNet-O is an out-of-distribution detection dataset designed to test model uncertainty. The authors demonstrate that these adversarially filtered examples transfer reliably across different model architectures, suggesting that current models rely on spurious cues and lack robust features.
Entities (5)
Relation Signals (3)
Adversarial Filtration → usedtocreate → ImageNet-A
confidence 100% · We curate two hard ImageNet test sets of natural adversarial examples with adversarial filtration.
ImageNet-A → causesdegradationin → Computer Vision Models
confidence 95% · The datasets are collected with a simple adversarial filtration technique to create datasets with limited spurious cues.
ImageNet-O → tests → Out-of-distribution Detection
confidence 95% · IMAGENET-O enables us to test out-of-distribution detection performance when the label distribution shifts.
Cypher Suggestions (2)
Find all datasets created using a specific technique · confidence 90% · unvalidated
MATCH (d:Dataset)-[:CREATED_USING]->(t:Technique {name: 'Adversarial Filtration'}) RETURN d.nameIdentify models affected by a specific dataset · confidence 85% · unvalidated
MATCH (m:Model)-[:PERFORMANCE_DEGRADED_BY]->(d:Dataset {name: 'ImageNet-A'}) RETURN m.nameFull Text
67,662 characters extracted from source content.
Expand or collapse full text
Natural Adversarial Examples Dan Hendrycks UC Berkeley Kevin Zhao * University of Washington Steven Basart * UChicago Jacob Steinhardt, Dawn Song UC Berkeley Abstract We introduce two challenging datasets that reliably cause machine learning model performance to substantially degrade. The datasets are collected with a simple adver- sarial filtration technique to create datasets with limited spurious cues. Our datasets’ real-world, unmodified ex- amples transfer to various unseen models reliably, demon- strating that computer vision models have shared weak- nesses. The first dataset is calledIMAGENET-Aand is like the ImageNet test set, but it is far more challenging for existing models. We also curate an adversarial out-of- distribution detection dataset calledIMAGENET-O, which is the first out-of-distribution detection dataset created for ImageNet models. OnIMAGENET-Aa DenseNet-121 ob- tains around 2% accuracy, an accuracy drop of approx- imately 90%, and its out-of-distribution detection perfor- mance onIMAGENET-Ois near random chance levels. We find that existing data augmentation techniques hardly boost performance, and using other public training datasets provides improvements that are limited. However, we find that improvements to computer vision architectures provide a promising path towards robust models. 1. Introduction Research on the ImageNet [11] benchmark has led to numerous advances in classification [40], object detection [38], and segmentation [23].ImageNet classification improvements are broadly applicable and highly predictive of improvements on many tasks [39]. Improvements on ImageNet classification have been so great that some call ImageNet classifiers “superhuman” [25]. However, performance is decidedly subhuman when the test distri- bution does not match the training distribution [29]. The distribution seen at test-time can include inclement weather conditions and obscured objects, and it can also include objects that are anomalous. Recht et al., 2019 [47] remind us that ImageNet test * Equal Contribution. ImageNet-A Fox SquirrelSea Lion (99%)DragonflyManhole Cover (99%) ImageNet-O PhotosphereJellyfish (99%)VerdigrisJigsaw Puzzle (99%) Figure 1: Natural adversarial examples from IMAGENET-A and IMAGENET-O. The black text is the actual class, and the red text is a ResNet-50 prediction and its confidence. IMAGENET-A contains images that classifiers should be able to classify, while IMAGENET-O contains anomalies of unforeseen classes which should result in low-confidence predictions. ImageNet-1K models do not train on exam- ples from “Photosphere” nor “Verdigris” classes, so these images are anomalous. Most natural adversarial examples lead to wrong predictions despite occurring naturally. examples tend to be simple, clear, close-up images, so that the current test set may be too easy and may not represent harder images encountered in the real world. Geirhos et al., 2020 argue that image classification datasets contain “spurious cues” or “shortcuts” [18, 2]. For instance, models may use an image’s background to predict the foreground object’s class; a cow tends to co-occur with a green pasture, and even though the background is inessential to the object’s identity, models may predict “cow” primarily using the green pasture background cue. When datasets contain 1 arXiv:1907.07174v4 [cs.LG] 4 Mar 2021 AlexNet SqueezeNet VGG-19 DenseNet-121 ResNet-50 0 5 10 15 20 25 Accuracy (%) ImageNet-A Accuracy of Various Models AlexNet SqueezeNet VGG-19 DenseNet-121 ResNet-50 10 15 20 25 AUPR (%) ImageNet-O Detection with Various Models Random Chance Level Figure 2: Various ImageNet classifiers of different architectures fail to generalize well to IMAGENET-A and IMAGENET-O. Higher Accuracy and higher AUPR is better. See Section 4 for a description of the AUPR out-of-distribution detection measure. These specific models were not used in the creation of IMAGENET-A and IMAGENET-O, so our adversarially filtered image transfer across models. spurious cues, they can lead to performance estimates that are optimistic and inaccurate. To counteract this, we curate two hard ImageNet test sets of natural adversarial examples with adversarial filtration. By using adversarial filtration, we can test how well models perform when simple-to-classify examples are removed, which includes examples that are solved with simple spurious cues. Some examples are depicted in Figure 1, which are simple for humans but hard for models.Our examples demonstrate that it is possible to reliably fool many models with clean natural images, while previous attempts at exposing and measuring model fragility rely on synthetic distribution corruptions [20, 29], artistic renditions [27], and adversarial distortions. We demonstrate that clean examples can reliably de- grade and transfer to other unseen classifiers using our first dataset. We call this dataset IMAGENET-A, which contains images from a distribution unlike the ImageNet training distribution. IMAGENET-A examples belong to ImageNet classes, but the examples are harder and can cause mistakes across various models. They cause consistent classifica- tion mistakes due to scene complications encountered in the long tail of scene configurations and by exploiting classifier blind spots (see Section 3.2). Since examples transfer reli- ably, this dataset shows models have unappreciated shared weaknesses. The second dataset allows us to test model uncertainty estimates when semantic factors of the data distribution shift. Our second dataset is IMAGENET-O, which contains image concepts from outside ImageNet-1K. These out-of- distribution images reliably cause models to mistake the ex- amples as high-confidence in-distribution examples. To our knowledge this is the first dataset of anomalies or out-of- distribution examples developed to test ImageNet models. While IMAGENET-A enables us to test image classifica- tion performance when theinput data distribution shifts, IMAGENET-O enables us to test out-of-distribution detec- tion performance when thelabel distribution shifts. We examine methods to improve performance on adversarially filtered examples.However, this is diffi- cult because Figure 2 shows that examples successfully transfer to unseen or black-box models.To improve robustness, numerous techniques have been proposed. We find data augmentation techniques such as adversarial training decrease performance, while others can help by a few percent. We also find that a10×increase in training data corresponds to a less than a10%increase in accuracy.Finally, we show that improving model architectures is a promising avenue toward increasing robustness. Even so, current models have substantial room for improvement. Code and our two datasets are available at github.com/hendrycks/natural-adv-examples. 2. Related Work Adversarial Examples.Real-world images may be cho- sen adversarially to cause performance decline. Goodfellow ImageNet ImageNet-O Previous OOD Datasets Figure 3: IMAGENET-O examples are closer to ImageNet examples than previous out-of-distribution (OOD) detec- tion datasets. For example, ImageNet has triceratops ex- amples and IMAGENET-O has visually similar T-Rex ex- amples, but they are still OOD. Previous OOD detection datasets use OOD examples from wholly different data gen- erating processes. For instance, previous work uses the De- scribable Textures Dataset [10], Places365 scenes [63], and synthetic blobs to test ImageNet OOD detectors. To our knowledge we propose the first dataset of OOD examples collected for ImageNet models. et al. [21] define adversarial examples [54] as “inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake.” Most ad- versarial examples research centers around artificial` p ad- versarial examples, which are examples perturbed by nearly worst-case distortions that are small in an` p sense. Su et al., 2018 [52] remind us that most` p adversarial examples crafted from one model can only be transferred within the same family of models. However, our adversarially filtered images transfer to all tested model families and move be- yond the restrictive` p threat model. Out-of-Distribution Detection.For out-of-distribution (OOD) detection [30, 44, 31, 32] models learn a distribu- tion, such as the ImageNet-1K distribution, and are tasked with producing quality anomaly scores that distinguish be- tween usual test set examples and examples from held-out anomalous distributions. For instance, Hendrycks et al., 2017 [30] treat CIFAR-10 as the in-distribution and treat Gaussian noise and the SUN scene dataset [57] as out-of- distribution data. They show that the negative of the max- imum softmax probability, or the the negative of the clas- sifier prediction probability, is a high-performing anomaly score that can separate in- and out-of-distribution examples, so much so that it remains competitive to this day. Since that time, other work on out-of-distribution detection has con- tinued to use datasets from other research benchmarks as anomaly stand-ins, producing far-from-distribution anoma- lies. Using visually dissimilar research datasets as anomaly stand-ins is critiqued in Ahmed et al., 2019 [1]. Some pre- vious OOD detection datasets are depicted in the bottom row of Figure 3 [31]. Many of these anomaly sources are unnatural and deviate in numerous ways from the distribu- tion of usual examples. In fact, some of the distributions can be deemed anomalous from local image statistics alone. Next, Meinke et al., 2019 [46] propose studying adversar- ial out-of-distribution detection by detecting adversarially optimized uniform noise. In contrast, we propose a dataset for more realistic adversarial anomaly detection; our dataset contains hard anomalies generated by shifting the distribu- tion’s labels and keeping non-semantic factors similar to the original training distribution. Spurious Cues and Unintended Shortcuts.Models may learn spurious cues and obtain high accuracy, but for the wrong reasons [43, 18]. Spurious cues are a studied problem in natural language processing [9, 22].Many recently introduced NLP datasets use adversarial filtration to create “adversarial datasets” by sieving examples solved with simple spurious cues [49, 5, 61, 15, 7, 28]. Like this recent concurrent research, we also use adversarial filtra- tion [53], but the technique of adversarial filtration has not been applied to collecting image datasets until this paper. Additionally, adversarial filtration in NLP removes only the easiest examples, while we use filtration to select only the hardest examples and ignore examples of intermediate difficulty. Adversarially filtered examples for NLP also do notreliably transfer even to weaker models. In Bisk et al., 2019 [6] BERT errors do not reliably transfer to weaker GPT-1 models. This is one reason why it is not obviousa prioriwhether adversarially filtered images should transfer. In this work, we show that adversarial filtration algorithms can find examples that reliably transfer to both weaker and stronger models. Since adversarial filtration can remove examples that are solved by simple spurious cues, models must learn more robust features for our datasets. Robustness to Shifted Input Distributions.Recht et al., 2019 [47] create a new ImageNet test set resembling the original test set as closely as possible. They found evi- dence that matching the difficulty of the original test set required selecting images deemed the easiest and most ob- vious by Mechanical Turkers. However, Engstrom et al., 2020 [16] estimate that the accuracy drop from ImageNet to ImageNetV2 is less than3.6%. In contrast, model accu- racy can decrease by over50%with IMAGENET-A. Bren- del et al., 2018 [8] show that classifiers that do not know the spatial ordering of image regions can be competitive on the ImageNet test set, possibly due to the dataset’s lack of diffi- culty. Judging classifiers by their performance on easier ex- amples has potentially masked many of their shortcomings. For example, Geirhos et al., 2019 [19] artificially overwrite each ImageNet image’s textures and conclude that classi- fiers learn to rely on textural cues and under-utilize infor- mation about object shape. Recent work shows that clas- sifiers are highly susceptible to non-adversarial stochastic corruptions [29]. While they distort images with75dif- ferent algorithmically generated corruptions, our sources of distribution shift tend to be more heterogeneous and varied, and our examples are naturally occurring. 3. IMAGENET-A and IMAGENET-O 3.1. Design IMAGENET-A is a dataset of real-world adversarially fil- tered images that fool current ImageNet classifiers. To find adversarially filtered examples, we first download numer- ous images related to an ImageNet class. Thereafter we delete the images that fixed ResNet-50 [24] classifiers cor- rectly predict. We chose ResNet-50 due to its widespread use. Later we show that examples which fool ResNet-50 re- liably transfer to other unseen models. With the remaining incorrectly classified images, we manually select visually clear images. Next, IMAGENET-O is a dataset of adversarially fil- tered examples for ImageNet out-of-distribution detectors. To create this dataset, we download ImageNet-22K and delete examples from ImageNet-1K. With the remaining ImageNet-22K examples that do not belong to ImageNet- 1K classes, we keep examples that are classified by a ResNet-50 as an ImageNet-1K class with high confidence. Then we manually select visually clear images. Both datasets were manually constructed by graduate students over several months. This is because a large share of images contain multiple classes per image [51]. There- fore, producing a dataset without multilabel images can be challenging with usual annotation techniques. To ensure images do not fall into more than one of the several hundred classes, we had graduate students memorize the classes in order to build a high-quality test set. IMAGENET-A Class Restrictions.We select a200-class subset of ImageNet-1K’s1,000classes so that errors among these200classes would be considered egregious [11]. For instance, wrongly classifying Norwich terriers as Norfolk terriers does less to demonstrate faults in current classifiers than mistaking a Persian cat for a candle. We additionally avoid rare classes such as “snow leopard,” classes that have changed much since 2012 such as “iPod,” coarse classes such as “spiral,” classes that are often image backdrops such as “valley,” and finally classes that tend to overlap such as “honeycomb,” “bee,” “bee house,” and “bee eater”; “eraser,” “pencil sharpener” and “pencil case”; “sink,” “medicine cabinet,” “pill bottle” and “band-aid”; and so on. The200 IMAGENET-A classes cover most broad categories spanned by ImageNet-1K; see the Supplementary Materials for the full class list. IMAGENET-A Data Aggregation.The first step is to download many weakly labeled images. Fortunately, the website iNaturalist has millions of user-labeled images of animals, and Flickr has even more user-tagged images of objects. We download images related to each of the200Im- ageNet classes by leveraging user-provided labels and tags. After exporting or scraping data from sites including iNatu- ralist, Flickr, and DuckDuckGo, we adversarially select im- ages by removing examples that fail to fool our ResNet-50 models. Of the remaining images, we select low-confidence images and then ensure each image is valid through human review. If we only used the original ImageNet test set as a source rather than iNaturalist, Flickr, and DuckDuckGo, some classes would have zero images after the first round of filtration, as the original ImageNet test set is too small to contain hard adversarially filtered images. We now describe this process in more detail. We use a small ensemble of ResNet-50s for filtering, one pre-trained on ImageNet-1K then fine-tuned on the200class subset, and one pre-trained on ImageNet-1K where200of its1,000 logits are used in classification. Both classifiers have similar accuracy on the200clean test set classes from ImageNet- 1K. The ResNet-50s perform 10-crop classification for each image, and should any crop be classified correctly by the ResNet-50s, the image is removed. If either ResNet-50 as- signs greater than15%confidence to the correct class, the image is also removed; this is done so that adversarially filtered examples yield misclassifications with low confi- dence in the correct class, like in untargeted adversarial at- tacks. Now, some classification confusions are greatly over- represented, such as Persian cat and lynx. We would like IMAGENET-A to have great variability in its types of errors and cause classifiers to have a dense confusion matrix. Con- sequently, we perform a second round of filtering to create a shortlist where each confusion only appears at most15 times. Finally, we manually select images from this shortlist in order to ensure IMAGENET-A images are simultaneously valid, single-class, and high-quality. In all, the IMAGENET- A dataset has7,500adversarially filtered images. As a specific example, we download81,413dragonfly images from iNaturalist, and after running the ResNet-50 filter we have8,925dragonfly images. In the algorithmi- cally diversified shortlist,1,452images remain. From this shortlist,80dragonfly images are manually selected, but hundreds more could be selected if time allows. The resulting images represent a substantial distribution shift, but images are still possible for humans to classify. The Fr ́ echet Inception Distance (FID) [35] enables us to de- Figure 4: Additional adversarially filtered examples from the IMAGENET-A dataset. Examples are adversarially selected to cause classifier accuracy to degrade. The black text is the actual class, and the red text is a ResNet-50 prediction. Figure 5: Additional adversarially filtered examples from the IMAGENET-O dataset. Examples are adversarially selected to cause out-of-distribution detection performance to degrade. Examples do not belong to ImageNet classes, and they are wrongly assigned highly confident predictions. The black text is the actual class, and the red text is a ResNet-50 prediction and the prediction confidence. termine whether IMAGENET-A and ImageNet are not iden- tically distributed. The FID between ImageNet’s validation and test set is approximately0.99, indicating that the distri- butions are highly similar. The FID between IMAGENET- A and ImageNet’s validation set is50.40, and the FID be- tween IMAGENET-A and ImageNet’s test set is approxi- mately50.25, indicating that the distribution shift is large. Despite the shift, we estimate that our graduate students’ IMAGENET-A human accuracy rate is approximately90%. IMAGENET-O Class Restrictions.We again select a 200-class subset of ImageNet-1K’s1,000classes. These 200classes determine the in-distribution or the distribution that is considered usual. As before, the200classes cover most broad categories spanned by ImageNet-1K; see the Supplementary Materials for the full class list. IMAGENET-O Data Aggregation.Our dataset for ad- versarial out-of-distribution detection is created by fooling ResNet-50 out-of-distribution detectors. The negative of the prediction confidence of a ResNet-50 ImageNet classifier serves as our anomaly score [30]. Usually in-distribution examples produce higher confidence predictions than OOD examples, but we curate OOD examples that have high confidence predictions. To gather candidate adversarially filtered examples, we use the ImageNet-22K dataset with ImageNet-1K classes deleted. We choose the ImageNet- 22K dataset since it was collected in the same way as ImageNet-1K. ImageNet-22K allows us to have coverage of numerous visual concepts and vary the distribution’s se- mantics without unnatural or unwanted non-semantic data shift. After excluding ImageNet-1K images, we process the remaining ImageNet-22K images and keep the images which cause the ResNet-50 to have high confidence, or a low anomaly score. We then manually select a high-quality subset of the remaining images to create IMAGENET-O. We suggest only training models with data from the1,000 ImageNet-1K classes, since the dataset becomes trivial if models train on ImageNet-22K. To our knowledge, this dataset is the first anomalous dataset curated for ImageNet models and enables researchers to study adversarial out-of- distribution detection. The IMAGENET-O dataset has2,000 adversarially filtered examples since anomalies are rarer; this has the same number of examples per class as Ima- geNetV2 [47]. While we use adversarial filtration to select images that are difficult for a fixed ResNet-50, we will show GrasshopperSundialLadybug HarvestmanDragonflySea Lion DragonflyBananaAlligatorHummingbirdAlligatorObelisk Figure 6: Examples from IMAGENET-A demonstrating classifier failure modes. Adjacent to each natural image is its heatmap [50]. Classifiers may use erroneous background cues for prediction. These failure modes are described in Section 3.2. these examples straightforwardly transfer to unseen models. 3.2. Illustrative Failure Modes Examples in IMAGENET-A uncover numerous failure modes of modern convolutional neural networks. We de- scribe our findings after having viewed tens of thousands of candidate adversarially filtered examples. Some of these failure modes may also explain poor IMAGENET-O perfor- mance, but for simplicity we describe our observations with IMAGENET-A examples. Consider Figure 6. The first two images suggest models may overgeneralize visual concepts. It may confuse metal with sundials, or thin radiating lines with harvestman bugs. We also observed that networks overgeneralize tricycles to bicycles and circles, digital clocks to keyboards and calcu- lators, and more. We also observe that models may rely too heavily on color and texture, as shown with the drag- onfly images. Since classifiers are taught to associate en- tire images with an object class, frequently appearing back- ground elements may also become associated with a class, such as wood being associated with nails. Other examples include classifiers heavily associating hummingbird feeders with hummingbirds, leaf-covered tree branches being asso- ciated with the white-headed capuchin monkey class, snow being associated with shovels, and dumpsters with garbage trucks. Additionally Figure 6 shows an American alliga- tor swimming. With different frames, the classifier pre- diction varies erratically between classes that are seman- tically loose and separate. For other images of the swim- ming alligator, classifiers predict that the alligator is a cliff, lynx, and a fox squirrel. Assessing convolutional networks on IMAGENET-A reveals that even state-of-the-art models have diverse and systematic failure modes. 4. Experiments We show that adversarially filtered examples collected to fool fixed ResNet-50 models reliably transfer to other mod- els, indicating that current convolutional neural networks have shared weaknesses and failure modes. In the following sections, we analyze whether robustness can be improved by using data augmentation, using more real labeled data, and using different architectures. For the first two sections, we analyze performance with a fixed architecture for com- parability, and in the final section we observe performance with different architectures. First we define our metrics. Metrics.Our metric for assessing robustness to adversar- ially filtered examples for classifiers is the top-1accuracy on IMAGENET-A. For reference, the top-1 accuracy on the 200 IMAGENET-A classes using usual ImageNet images is usually greater than or equal to90%for ordinary classifiers. Our metric for assessing out-of-distribution detection performance of IMAGENET-O examples is the area un- der the precision-recall curve (AUPR). This metric requires anomaly scores. Our anomaly score is the negative of the maximum softmax probabilities [30] from a model that can classify the200IMAGENET-O classes. The maxi- mum softmax probability detector is a long-standing base- line in OOD detection. We collect anomaly scores with the ImageNet validation examples for the said200classes. Then, we collect anomaly scores for the IMAGENET-O examples. Higher performing OOD detectors would as- sign IMAGENET-O examples lower confidences, or higher anomaly scores. With these anomaly scores, we can com- pute the area under the precision-recall curve [48]. Random chance levels for the AUPR is approximately16.67%with IMAGENET-O, and the maximum AUPR is100%. NormalAdversarial Training Style Transfer AugMixCutoutMoExMixupCutMix 0 5 10 15 20 25 Accuracy (%) The Effect of Data Augmentation on ImageNet-A Accuracy Figure 7: Some data augmentation techniques hardly improve IMAGENET-A accuracy. This demonstrates that IMAGENET-A can expose previously unnoticed faults in proposed robustness methods which do well on synthetic distribution shifts [34]. Data Augmentation.We examine popular data augmen- tation techniques and note their effect on robustness. In this section we exclude IMAGENET-O results, as the data aug- mentation techniques hardly help with out-of-distribution detection as well. As a baseline, we train a new ResNet-50 from scratch and obtain2.17%accuracy on IMAGENET-A. Now, one purported way to increase robustness is through adversarial training, which makes models less sensitive to ` p perturbations. We use the adversarially trained model from Wong et al., 2020 [56], but accuracy decreases to 1.68%. Next, Geirhos et al., 2019 [19] propose making net- works rely less on texture by training classifiers on images where textures are transferred from art pieces. They ac- complish this by applying style transfer to ImageNet train- ing images to create a stylized dataset, and models train on these images. While this technique is able to greatly increase robustness on synthetic corruptions [29], Style Transfer increases IMAGENET-A accuracy only0.13%over the ResNet-50 baseline. A recent data augmentation tech- nique is AugMix [34], which takes linear combinations of different data augmentations. This technique increases ac- curacy to3.8%. Cutout augmentation [12] randomly oc- cludes image regions and corresponds to4.4%accuracy. Moment Exchange (MoEx) [45] exchanges feature map moments between images, and this increases accuracy to 5.5%. Mixup [62] trains networks on elementwise con- vex combinations of images and their interpolated labels; this technique increases accuracy to6.6%. CutMix [60] su- perimposes images regions within other images and yields 7.3%accuracy. At best these data augmentations techniques improve accuracy by approximately5%over the baseline. Results are summarized in Figure 7. Although some data augmentation techniques are purported to greatly improve robustness to distribution shifts [34, 59], their lackluster re- sults on IMAGENET-A show they do not improve robust- ness on some distribution shifts. Hence IMAGENET-A can be used to verify whether techniques actually improve real- world robustness to distribution shift. More Labeled Data.One possible explanation for con- sistently low IMAGENET-A accuracy is that all models are trained only with ImageNet-1K, and using additional data may resolve the problem. Bau et al., 2017 [4] ar- gue that Places365 classifiers learn qualitatively distinct fil- ters (e.g., they have more object detectors, fewer texture detectors in conv3) compared to ImageNet classifiers, so one may expect an error distribution less correlated with errors on ImageNet-A. To test this hypothesis we pre-train a ResNet-50 on Places365 [63], a large-scale scene recog- nition dataset. After fine-tuning the Places365 model on ImageNet-1K, we find that accuracy is1.56%. Conse- quently, even though scene recognition models are pur- ported to have qualitatively distinct features, this is not enough to improve IMAGENET-A performance. Likewise, Places365 pre-training does not improve IMAGENET-O de- tection, as its AUPR is14.88%. Next, we see whether la- beled data from IMAGENET-A itself can help. We take baseline ResNet-50 with2.17%IMAGENET-A accuracy and fine-tune it on80%of IMAGENET-A. This leads to no clear improvement on the remaining20%of IMAGENET-A since the top-1 and top-5 accuracies are below2%and5%, respectively. Last, we pre-train using an order of magnitude more training data with ImageNet-21K. This dataset contains ap- proximately21,000classes and approximately14million images. To our knowledge this is the largest publicly avail- able database of labeled natural images. Using a ResNet- 50 pretrained on ImageNet-21K, we fine-tune the model on ImageNet-1K and attain11.41%accuracy on IMAGENET- A, a9.24%increase. Likewise, the AUPR for IMAGENET- O improves from16.20%to21.86%, although this im- provement is less significant since IMAGENET-O images overlap with ImageNet-21K images. Academic researchers rarely use datasets larger than ImageNet due to computa- tional costs, using more data has limitations. An order of magnitude increase in labeled training data can provide some improvements in accuracy, though we now show that NormalLargeXLarge 0 5 10 15 20 25 Accuracy (%) Model Architecture and ImageNet-A Accuracy ResNet ResNeXt ResNet+SE Res2Net NormalLargeXLarge 10 15 20 25 AUPR (%) Model Architecture and ImageNet-O Detection ResNet ResNeXt ResNet+SE Res2Net Figure 8: Increasing model size and other architecture changes can greatly improve performance. Note Res2Net and ResNet+SE have a ResNet backbone. Normal model sizes are ResNet-50 and ResNeXt-50 (32×4d), Large model sizes are ResNet-101 and ResNeXt-101 (32×4d), and XLarge Model sizes are ResNet-152 and (32×8d). architecture changes provide greater improvements. Architectural Changes.We find that model architec- ture can play a large role in IMAGENET-A accuracy and IMAGENET-O detection performance. Simply increasing the width and number of layers of a network is sufficient to automatically impart more IMAGENET-A accuracy and IMAGENET-O OOD detection performance.Increasing network capacity has been shown to improve performance on` p adversarial examples [42], common corruptions [29], and now also improves performance for adversarially fil- tered images.For example, a ResNet-50’s top-1 accu- racy and AUPR is2.17%and16.2%, respectively, while a ResNet-152 obtains6.1%top-1 accuracy and18.0%AUPR. Another architecture change that reliably helps is using the grouped convolutions found in ResNeXts [58]. A ResNeXt- 50 (32×4d) obtains a4.81%top1 IMAGENET-A accuracy and a17.60%IMAGENET-O AUPR. Another useful architecture change is self-attention. Convolutional neural networks with self-attention [36] are designed to better capture long-range dependencies and in- teractions across an image. We consider the self-attention technique called Squeeze-and-Excitation (SE) [37], which won the final ImageNet competition in 2017. A ResNet-50 with Squeeze-and-Excitation attains6.17%accuracy. How- ever, for larger ResNets, self-attention does little to improve IMAGENET-O detection. We consider the ResNet-50 architecture with its resid- ual blocks exchanged with recently introduced Res2Net v1b blocks [17]. This change increases accuracy to14.59% and the AUPR to19.5%. A ResNet-152 with Res2Net v1b blocks attains22.4%accuracy and23.9%AUPR. Com- pared to data augmentation or an order of magnitude more labeled training data, some architectural changes can pro- vide far more robustness gains. Consequently future im- provements to model architectures is a promising path to- wards greater robustness. We now assess performance on a completely different architecture which does not use convolutions, vision Trans- formers [14]. We evaluate with DeiT [55], a vision Trans- former trained on ImageNet-1K with aggressive data aug- mentation such as Mixup.Even for vision Transformers, we find that ImageNet-A and ImageNet-O examples suc- cessfully transfer.In particular, a DeiT-small vision Trans- former gets 19.0% on IMAGENET-A and has a similar num- ber of parameters to a Res2Net-50, which has 14.6% ac- curacy. This might be explained by DeiT’s use of Mixup, however, which provided a 4% ImageNet-A accuracy boost for ResNets. The IMAGENET-O AUPR for the Transformer is 20.9%, while the Res2Net gets 19.5%. Larger DeiT models do better, as a DeiT-base gets 28.2% accuracy on IMAGENET-A and 24.8% AUPR on IMAGENET. Conse- quently, our datasets transfer to vision Transformers and performance for both tasks remains far from the ceiling. 5. Conclusion We found it is possible to improve performance on our datasets with data augmentation, pretraining data, and ar- chitectural changes. We found that our examples transferred to all tested models, including vision Transformers which do not use convolution operations. Results indicate that im- proving performance on IMAGENET-A and IMAGENET-O is possible but difficult. Our challenging ImageNet test sets serve as measures of performance under distribution shift— an important research aim as models are deployed in in- creasingly precarious real-world environments. References [1] Faruk Ahmed and Aaron C. Courville. Detecting semantic anomalies.ArXiv, abs/1908.04388, 2019. [2] Mart ́ ın Arjovsky, L ́ eon Bottou, Ishaan Gulrajani, and David Lopez-Paz.Invariant risk minimization.ArXiv, abs/1907.02893, 2019. [3] P. Bartlett and M. Wegkamp. Classification with a reject op- tion using a hinge loss.J. Mach. Learn. Res., 9:1823–1840, 2008. [4] David Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep vi- sual representations.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3327, 2017. [5] Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Yih, and Yejin Choi. Abductive common- sense reasoning.ArXiv, abs/1908.05739, 2019. [6] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical common- sense in natural language.ArXiv, abs/1911.11641, 2019. [7] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical common- sense in natural language.ArXiv, abs/1911.11641, 2020. [8] Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet.CoRR, abs/1904.00760, 2018. [9] Zheng Cai, Lifu Tu, and Kevin Gimpel. Pay attention to the ending: Strong neural baselines for the roc story cloze task. InACL, 2017. [10] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild.Computer Vision and Pattern Recognition, 2014. [11] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database.CVPR, 2009. [12] Terrance Devries and Graham W. Taylor. Improved regular- ization of convolutional neural networks with Cutout.arXiv preprint arXiv:1708.04552, 2017. [13] Terrance Devries and Graham W. Taylor. Learning confi- dence for out-of-distribution detection in neural networks. ArXiv, abs/1802.04865, 2018. [14] A. Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, M. De- hghani, Matthias Minderer, Georg Heigold, S. Gelly, Jakob Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. [15] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A read- ing comprehension benchmark requiring discrete reasoning over paragraphs. InNAACL-HLT, 2019. [16] L. Engstrom, Andrew Ilyas, Shibani Santurkar, D. Tsipras, J. Steinhardt, and A. Madry. Identifying statistical bias in dataset replication.ArXiv, abs/2005.09619, 2020. [17] Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xinyu Zhang, Ming-Hsuan Yang, and Philip H. S. Torr. Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 2019. [18] Robert Geirhos, Jorn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural net- works.ArXiv, abs/2004.07780, 2020. [19] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness.ICLR, 2019. [20] Robert Geirhos, Carlos R. M. Temme, Jonas Rauber, Heiko H. Sch ̈ utt, Matthias Bethge, and Felix A. Wich- mann. Generalisation in humans and deep neural networks. NeurIPS, 2018. [21] Ian Goodfellow, Nicolas Papernot, Sandy Huang, Yan Duan, , and Peter Abbeel. Attacking machine learning with adver- sarial examples.OpenAI Blog, 2017. [22] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. An- notation artifacts in natural language inference data.ArXiv, abs/1803.02324, 2018. [23] Kaiming He, Georgia Gkioxari, Piotr Doll ́ ar, and Ross B. Girshick. Mask r-cnn. InCVPR, 2018. [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.CVPR, 2015. [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level perfor- mance on imagenet classification.2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. [26] Dan Hendrycks, Steven Basart, Mantas Mazeika, Moham- madreza Mostajabi, J. Steinhardt, and D. Song. Scaling out-of-distribution detection for real-world settings.arXiv: 1911.11132, 2020. [27] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, F. Wang, Evan Dorundo, Rahul Desai, Tyler Lixuan Zhu, Samyak Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ArXiv, abs/2006.16241, 2020. [28] Dan Hendrycks, C. Burns, Steven Basart, Andrew Critch, Jerry Li, D. Song, and J. Steinhardt. Aligning ai with shared human values.ArXiv, abs/2008.02275, 2020. [29] Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions.ICLR, 2019. [30] Dan Hendrycks and Kevin Gimpel. A baseline for detect- ing misclassified and out-of-distribution examples in neural networks.ICLR, 2017. [31] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure.ICLR, 2019. [32] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty.Advances in Neural In- formation Processing Systems (NeurIPS), 2019. [33] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and D. Song. Using self-supervised learning can improve model ro- bustness and uncertainty. InNeurIPS, 2019. [34] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty.ICLR, 2020. [35] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNIPS, 2017. [36] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite : Exploiting feature context in convo- lutional neural networks. InNeurIPS, 2018. [37] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. [38] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara Balan, Alireza Fathi, Ian Fischer, Zbig- niew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy.Speed/accuracy trade-offs for modern convolu- tional object detectors.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [39] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do bet- ter imagenet models transfer better?CoRR, abs/1805.08974, 2018. [40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural net- works.NIPS, 2012. [41] A. Kumar, P. Liang, and T. Ma. Verified uncertainty calibra- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. [42] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adver- sarial machine learning at scale.ICLR, 2017. [43] Sebastian Lapuschkin, Stephan W ̈ aldchen,Alexander Binder, Gr ́ egoire Montavon, Wojciech Samek, and Klaus- Robert M ̈ uller. Unmasking clever hans predictors and assess- ing what machines really learn. InNature Communications, 2019. [44] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out- of-distribution samples.ICLR, 2018. [45] Bo-Yi Li, Felix Wu, Ser-Nam Lim, Serge J. Belongie, and Kilian Q. Weinberger. On feature normalization and data augmentation.ArXiv, abs/2002.11102, 2020. [46] Alexander Meinke and Matthias Hein. Towards neural net- works that provably know when they don’t know.ArXiv, abs/1909.12180, 2019. [47] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet?ArXiv, abs/1902.10811, 2019. [48] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluat- ing binary classifiers on imbalanced datasets. InPLoS ONE. 2015. [49] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi.Winogrande: An adversarial winograd schema challenge at scale.ArXiv, abs/1907.10641, 2019. [50] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Com- puter Vision, 128:336 – 359, 2019. [51] Pierre Stock and Moustapha Ciss ́ e. Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases. InECCV, 2018. [52] D. Su, Huan Zhang, H. Chen, Jinfeng Yi, P. Chen, and Yu- peng Gao. Is robustness the cost of accuracy? - a comprehen- sive study on the robustness of 18 deep image classification models. InECCV, 2018. [53] Kah Kay Sung. Learning and example selection for object and pattern detection. 1995. [54] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. In- triguing properties of neural networks, 2014. [55] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ́ e J ́ egou. Training data-efficient image transformers and distillation through at- tention.arXiv preprint arXiv:2012.12877, 2020. [56] Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training.arXiv preprint arXiv:2001.03994, 2020. [57] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba.Sun database: Large-scale scene recognition from abbey to zoo.2010 IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010. [58] Saining Xie, Ross Girshick, Piotr Doll ́ ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks.CVPR, 2016. [59] Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, E. Cubuk, and J. Gilmer. A fourier perspective on model ro- bustness in computer vision.ArXiv, abs/1906.08988, 2019. [60] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features.2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 6022–6031, 2019. [61] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InACL, 2019. [62] Hongyi Zhang, Moustapha Ciss ́ e, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ArXiv, abs/1710.09412, 2018. [63] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.PAMI, 2017. 6. Appendix 7. Expanded Results 7.1. Full Architecture Results Full results with various architectures are in Table 1. 7.2. More OOD Detection Results and Background Works in out-of-distribution detection frequently use the maximum softmax baseline to detect out-of-distribution ex- amples [30]. Before neural networks, using the reject option or ak+ 1st class was somewhat common [3], but with neu- ral networks it requires auxiliary anomalous training data. New neural methods that utilize auxiliary anomalous train- ing data, such as Outlier Exposure [31], do not use the reject option and still utilize the maximum softmax probability. We do not use Outlier Exposure since that paper’s authors were unable to get their technique to work on ImageNet- 1K with224×224images, though they were able to get it work on Tiny ImageNet which has64×64images. We do not use ODIN since it requires tuning hyperparameters directly using out-of-distribution data, a criticized practice [31]. We evaluate three additional out-of-distribution detec- tion methods, though none substantially improve perfor- mance. We evaluate method of [13], which trains an aux- iliary branch to represent the model confidence. Using a ResNet trained from scratch, we find this gets a14.3% AUPR, around 2% less than the MSP baseline. Next we use the recent Maximum Logit detector [26]. With DenseNet- 121 the AUPR decreases from16.1%(MSP) to15.8%(Max Logit), while with ResNeXt-101 (32×8d) the AUPR of 20.5%increases to20.6%. Across over 10 models we found the MaxLogit technique to be slightly worse. Finally, we evaluate the utility of self-supervised auxiliary objectives for OOD detection. The rotation prediction anomaly de- tector [33] was shown to help improve detection perfor- mance for near-distribution yet still out-of-class examples, and with this auxiliary objective the AUPR for ResNet-50 does not change; it is16.2%with the rotation prediction and 16.2%with the MSP. Note this method requires training the network and does not work out-of-the-box. 7.3. Calibration In this section we show IMAGENET-A calibration re- sults. Uncertainty Metrics.The` 2 Calibration Erroris how we measure miscalibration. We would like classifiers that can reliably forecast their accuracy. Concretely, we want classifiers which give examples 60% confidence to be cor- rect 60% of the time. We judge a classifier’s miscalibration with the` 2 Calibration Error [41]. Our second uncertainty estimation metric is theArea Figure 9: A demonstration of color sensitivity. While the leftmost image is classified as “banana” with high confi- dence, the images with modified color are correctly classi- fied. Not only would we like models to be more accurate, we would like them to be calibrated if they wrong. 020406080100 Response Rate (%) 0 5 10 15 20 25 Accuracy (%) ImageNet-A Accuracy vs. Response Rate Normal +SE Figure 10: The Response Rate Accuracy curve for a ResNeXt-101 (32×4d) with and without Squeeze-and- Excitation (SE). The Response Rate is the percent classi- fied. The accuracy at an% response rate is the accuracy on then% of examples where the classifier is most confident. Under the Response Rate Accuracy Curve (AURRA). Responding only when confident is often preferable to predicting falsely. In these experiments, we allow classi- fiers to respond to a subset of the test set and abstain from predicting the rest.Classifiers with quality uncertainty estimates should be capable identifying examples it is likely to predict falsely and abstain.If a classifier is required to abstain from predicting on 90% of the test set, or equivalently respond to the remaining 10% of the test set, then we should like the classifier’s uncertainty estimates to separate correctly and falsely classified examples and have high accuracy on the selected 10%. At a fixed response rate, we should like the accuracy to be as high as possible. At a 100% response rate, the classifier accuracy is the usual ImageNet-A (Acc %)ImageNet-O (AUPR %) AlexNet1.7715.44 SqueezeNet1.11.1215.31 VGG162.6316.58 VGG192.1116.80 VGG19+BN2.9516.57 DenseNet1212.1616.11 ResNet-181.1515.23 ResNet-341.8716.00 ResNet-502.1716.20 ResNet-1014.7217.20 ResNet-1526.0518.00 ResNet-50+Squeeze-and-Excite6.1717.52 ResNet-101+Squeeze-and-Excite8.5517.91 ResNet-152+Squeeze-and-Excite9.3518.65 ResNet-50+DeVries Confidence Branch0.3514.34 ResNet-50+Rotation Prediction Branch2.1716.20 Res2Net-50 (v1b)14.5919.50 Res2Net-101 (v1b)21.8422.69 Res2Net-152 (v1b)22.423.90 ResNeXt-50 (32×4d)4.8117.60 ResNeXt-101 (32×4d)5.8519.60 ResNeXt-101 (32×8d)10.220.51 DPN 683.5317.78 DPN 989.1521.10 DeiT-tiny7.2517.4 DeiT-small19.120.9 DeiT-base28.224.8 Table 1: Expanded IMAGENET-A and IMAGENET-O architecture results. Note IMAGENET-O performance is improving more slowly. test set accuracy. We vary the response rates and compute the corresponding accuracies to obtain the Response Rate Accuracy (RRA) curve. The area under the Response Rate Accuracy curve is the AURRA. To compute the AURRA in this paper, we use the maximum softmax probability. For response ratep, we take thepfraction of examples with highest maximum softmax probability. If the response rate is 10%, we select the top 10% of examples with the highest confidence and compute the accuracy on these examples. An example RRA curve is in Figure 10 . 8. IMAGENET-A Classes The 200 ImageNet classes that we selected for IMAGENET-A are as follows. goldfish,great white shark, hammerhead,stingray,hen,ostrich,goldfinch, junco,bald eagle,vulture,newt,axolotl,tree frog,iguana,African chameleon,cobra,scorpion, tarantula,centipede,peacock,lorikeet,humming- bird,toucan,duck,goose,black swan,koala, jellyfish,snail,lobster,hermit crab,flamingo, american egret,pelican,king penguin,grey whale, killer whale,sea lion,chihuahua,shih tzu,afghan hound,basset hound,beagle,bloodhound,italian greyhound,whippet,weimaraner,yorkshire terrier, boston terrier,scottish terrier,west highland white terrier,golden retriever,labrador retriever,cocker spaniels,collie,border collie,rottweiler,ger- man shepherd dog,boxer,french bulldog,saint bernard,husky,dalmatian,pug,pomeranian, chow chow,pembroke welsh corgi,toy poodle,stan- dard poodle,timber wolf,hyena,red fox,tabby cat,leopard,snow leopard,lion,tiger,chee- ResNet-101ResNet-152 ResNeXt-101 (32x4d) 30 35 40 45 50 55 60 65 ℓ 2 Calibration Error (%) The Effect of Self-Attention on ImageNet-A Calibration Normal+SE ResNet-101ResNet-152 ResNeXt-101 (32x4d) 0 5 10 15 20 AURRA (%) The Effect of Self-Attention on ImageNet-A Error Detection Normal +SE Figure 11: Self-attention’s influence on IMAGENET-A` 2 calibration and error detection. tah,polar bear,meerkat,ladybug,fly,bee, ant,grasshopper,cockroach,mantis,dragon- fly,monarch butterfly,starfish,wood rabbit,por- cupine,fox squirrel,beaver,guinea pig,ze- bra,pig,hippopotamus,bison,gazelle,llama, skunk,badger,orangutan,gorilla,chimpanzee, gibbon,baboon,panda,eel,clown fish,puffer fish,accordion,ambulance,assault rifle,back- pack,barn,wheelbarrow,basketball,bathtub, lighthouse,beer glass,binoculars,birdhouse,bow tie,broom,bucket,cauldron,candle,cannon, canoe,carousel,castle,mobile phone,cow- boy hat,electric guitar,fire engine,flute,gas- mask,grand piano,guillotine,hammer,harmon- ResNetResNeXtDPN 30 35 40 45 50 55 60 65 ℓ 2 Calibration Error (%) The Effect of Model Size on ImageNet-A Calibration BaselineLarger Model ResNetResNeXtDPN 0 5 10 15 20 AURRA (%) The Effect of Model Size on ImageNet-A Error Detection Baseline Larger Model Figure 12: Model size’s influence on IMAGENET-A` 2 cal- ibration and error detection. ica,harp,hatchet,jeep,joystick,lab coat, lawn mower,lipstick,mailbox,missile,mit- ten,parachute,pickup truck,pirate ship,re- volver,rugby ball,sandal,saxophone,school bus, schooner,shield,soccer ball,space shuttle,spider web,steam locomotive,scarf,submarine,tank, tennis ball,tractor,trombone,vase,violin, military aircraft,wine bottle,ice cream,bagel, pretzel,cheeseburger,hotdog,cabbage,broc- coli,cucumber,bell pepper,mushroom,Granny Smith,strawberry,lemon,pineapple,banana, pomegranate,pizza,burrito,espresso,volcano, baseball player,scuba diver,acorn, n01443537,n01484850,n01494475, n01498041,n01514859,n01518878,n01531178, n01534433,n01614925,n01616318,n01630670, n01632777,n01644373,n01677366,n01694178, n01748264,n01770393,n01774750,n01784675, n01806143,n01820546,n01833805,n01843383, n01847000,n01855672,n01860187,n01882714, n01910747,n01944390,n01983481,n01986214, n02007558,n02009912,n02051845,n02056570, n02066245,n02071294,n02077923,n02085620, n02086240,n02088094,n02088238,n02088364, n02088466,n02091032,n02091134,n02092339, n02094433,n02096585,n02097298,n02098286, n02099601,n02099712,n02102318,n02106030, n02106166,n02106550,n02106662,n02108089, n02108915,n02109525,n02110185,n02110341, n02110958,n02112018,n02112137,n02113023, n02113624,n02113799,n02114367,n02117135, n02119022,n02123045,n02128385,n02128757, n02129165,n02129604,n02130308,n02134084, n02138441,n02165456,n02190166,n02206856, n02219486,n02226429,n02233338,n02236044, n02268443,n02279972,n02317335,n02325366, n02346627,n02356798,n02363005,n02364673, n02391049,n02395406,n02398521,n02410509, n02423022,n02437616,n02445715,n02447366, n02480495,n02480855,n02481823,n02483362, n02486410,n02510455,n02526121,n02607072, n02655020,n02672831,n02701002,n02749479, n02769748,n02793495,n02797295,n02802426, n02808440,n02814860,n02823750,n02841315, n02843684,n02883205,n02906734,n02909870, n02939185,n02948072,n02950826,n02951358, n02966193,n02980441,n02992529,n03124170, n03272010,n03345487,n03372029,n03424325, n03452741,n03467068,n03481172,n03494278, n03495258,n03498962,n03594945,n03602883, n03630383,n03649909,n03676483,n03710193, n03773504,n03775071,n03888257,n03930630, n03947888,n04086273,n04118538,n04133789, n04141076,n04146614,n04147183,n04192698, n04254680,n04266014,n04275548,n04310018, n04325704,n04347754,n04389033,n04409515, n04465501,n04487394,n04522168,n04536866, n04552348,n04591713,n07614500,n07693725, n07695742,n07697313,n07697537,n07714571, n07714990,n07718472,n07720875,n07734744, n07742313,n07745940,n07749582,n07753275, n07753592,n07768694,n07873807,n07880968, n07920052,n09472597,n09835506,n10565667, n12267677, ‘Stingray;’ ‘goldfinch, Carduelis carduelis;’ ‘junco, snow- bird;’‘robin, American robin, Turdus migratorius;’ ‘jay;’ ‘bald eagle, American eagle, Haliaeetus leuco- cephalus;’ ‘vulture;’ ‘eft;’ ‘bullfrog, Rana catesbeiana;’ ‘box turtle, box tortoise;’ ‘common iguana, iguana, Iguana iguana;’ ‘agama;’ ‘African chameleon, Chamaeleo chamaeleon;’ ‘American alligator, Alligator mississipi- ensis;’ ‘garter snake, grass snake;’ ‘harvestman, daddy longlegs, Phalangium opilio;’ ‘scorpion;’ ‘tarantula;’ ‘centipede;’ ‘sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita;’ ‘lorikeet;’ ‘hummingbird;’ ‘toucan;’ ‘drake;’ ‘goose;’ ‘koala, koala bear, kangaroo bear, na- tive bear, Phascolarctos cinereus;’ ‘jellyfish;’ ‘sea anemone, anemone;’ ‘flatworm, platyhelminth;’ ‘snail;’ ‘crayfish, crawfish, crawdad, crawdaddy;’ ‘hermit crab;’ ‘flamingo;’ ‘American egret, great white heron, Egretta albus;’ ‘oyster- catcher, oyster catcher;’ ‘pelican;’ ‘sea lion;’ ‘Chihuahua;’ ‘golden retriever;’ ‘Rottweiler;’ ‘German shepherd, Ger- man shepherd dog, German police dog, alsatian;’ ‘pug, pug-dog;’ ‘red fox, Vulpes vulpes;’ ‘Persian cat;’ ‘lynx, catamount;’ ‘lion, king of beasts, Panthera leo;’ ‘Amer- ican black bear, black bear, Ursus americanus, Euarctos americanus;’ ‘mongoose;’ ‘ladybug, ladybeetle, lady bee- tle, ladybird, ladybird beetle;’ ‘rhinoceros beetle;’ ‘wee- vil;’ ‘fly;’ ‘bee;’ ‘ant, emmet, pismire;’ ‘grasshopper, hop- per;’ ‘walking stick, walkingstick, stick insect;’ ‘cockroach, roach;’ ‘mantis, mantid;’ ‘leafhopper;’ ‘dragonfly, darning needle, devil’s darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk;’ ‘monarch, monarch butterfly, milkweed butterfly, Danaus plexippus;’ ‘cabbage butterfly;’ ‘lycaenid, lycaenid butterfly;’ ‘starfish, sea star;’ ‘wood rabbit, cottontail, cottontail rabbit;’ ‘por- cupine, hedgehog;’ ‘fox squirrel, eastern fox squirrel, Sci- urus niger;’ ‘marmot;’ ‘bison;’ ‘skunk, polecat, wood pussy;’ ‘armadillo;’ ‘baboon;’ ‘capuchin, ringtail, Cebus capucinus;’ ‘African elephant, Loxodonta africana;’ ‘puffer, pufferfish, blowfish, globefish;’ ‘academic gown, academic robe, judge’s robe;’ ‘accordion, piano accordion, squeeze box;’ ‘acoustic guitar;’ ‘airliner;’ ‘ambulance;’ ‘apron;’ ‘balance beam, beam;’ ‘balloon;’ ‘banjo;’ ‘barn;’ ‘barrow, garden cart, lawn cart, wheelbarrow;’ ‘basketball;’ ‘bea- con, lighthouse, beacon light, pharos;’ ‘beaker;’ ‘bikini, two-piece;’ ‘bow;’ ‘bow tie, bow-tie, bowtie;’ ‘breastplate, aegis, egis;’ ‘broom;’ ‘candle, taper, wax light;’ ‘canoe;’ ‘castle;’ ‘cello, violoncello;’ ‘chain;’ ‘chest;’ ‘Christmas stocking;’ ‘cowboy boot;’ ‘cradle;’ ‘dial telephone, dial phone;’ ‘digital clock;’ ‘doormat, welcome mat;’ ‘drum- stick;’ ‘dumbbell;’ ‘envelope;’ ‘feather boa, boa;’ ‘flag- pole, flagstaff;’ ‘forklift;’ ‘fountain;’ ‘garbage truck, dust- cart;’ ‘goblet;’ ‘go-kart;’ ‘golfcart, golf cart;’ ‘grand pi- ano, grand;’ ‘hand blower, blow dryer, blow drier, hair dryer, hair drier;’ ‘iron, smoothing iron;’ ‘jack-o’-lantern;’ ‘jeep, landrover;’ ‘kimono;’ ‘lighter, light, igniter, ignitor;’ ‘limousine, limo;’ ‘manhole cover;’ ‘maraca;’ ‘marimba, xylophone;’ ‘mask;’ ‘mitten;’ ‘mosque;’ ‘nail;’ ‘obelisk;’ ‘ocarina, sweet potato;’ ‘organ, pipe organ;’ ‘parachute, chute;’ ‘parking meter;’ ‘piggy bank, penny bank;’ ‘pool table, billiard table, snooker table;’ ‘puck, hockey puck;’ ‘quill, quill pen;’ ‘racket, racquet;’ ‘reel;’ ‘revolver, six- gun, six-shooter;’ ‘rocking chair, rocker;’ ‘rugby ball;’ ‘saltshaker, salt shaker;’ ‘sandal;’ ‘sax, saxophone;’ ‘school bus;’ ‘schooner;’ ‘sewing machine;’ ‘shovel;’ ‘sleeping bag;’ ‘snowmobile;’ ‘snowplow, snowplough;’ ‘soap dis- penser;’ ‘spatula;’ ‘spider web, spider’s web;’ ‘steam lo- comotive;’ ‘stethoscope;’ ‘studio couch, day bed;’ ‘subma- rine, pigboat, sub, U-boat;’ ‘sundial;’ ‘suspension bridge;’ ‘syringe;’ ‘tank, army tank, armored combat vehicle, ar- moured combat vehicle;’ ‘teddy, teddy bear;’ ‘toaster;’ ‘torch;’ ‘tricycle, trike, velocipede;’ ‘umbrella;’ ‘unicy- cle, monocycle;’ ‘viaduct;’ ‘volleyball;’ ‘washer, auto- matic washer, washing machine;’ ‘water tower;’ ‘wine bot- tle;’ ‘wreck;’ ‘guacamole;’ ‘pretzel;’ ‘cheeseburger;’ ‘hot- dog, hot dog, red hot;’ ‘broccoli;’ ‘cucumber, cuke;’ ‘bell pepper;’ ‘mushroom;’ ‘lemon;’ ‘banana;’ ‘custard apple;’ ‘pomegranate;’ ‘carbonara;’ ‘bubble;’ ‘cliff, drop, drop- off;’ ‘volcano;’ ‘ballplayer, baseball player;’ ‘rapeseed;’ ‘yellow lady’s slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum;’ ‘corn;’ ‘acorn.’ Their WordNet IDs are as follows. n01498041,n01531178,n01534433,n01558993, n01580077,n01614925,n01616318,n01631663, n01641577,n01669191,n01677366,n01687978, n01694178,n01698640,n01735189,n01770081, n01770393,n01774750,n01784675,n01819313, n01820546,n01833805,n01843383,n01847000, n01855672,n01882714,n01910747,n01914609, n01924916,n01944390,n01985128,n01986214, n02007558,n02009912,n02037110,n02051845, n02077923,n02085620,n02099601,n02106550, n02106662,n02110958,n02119022,n02123394, n02127052,n02129165,n02133161,n02137549, n02165456,n02174001,n02177972,n02190166, n02206856,n02219486,n02226429,n02231487, n02233338,n02236044,n02259212,n02268443, n02279972,n02280649,n02281787,n02317335, n02325366,n02346627,n02356798,n02361337, n02410509,n02445715,n02454379,n02486410, n02492035,n02504458,n02655020,n02669723, n02672831,n02676566,n02690373,n02701002, n02730930,n02777292,n02782093,n02787622, n02793495,n02797295,n02802426,n02814860, n02815834,n02837789,n02879718,n02883205, n02895154,n02906734,n02948072,n02951358, n02980441,n02992211,n02999410,n03014705, n03026506,n03124043,n03125729,n03187595, n03196217,n03223299,n03250847,n03255030, n03291819,n03325584,n03355925,n03384352, n03388043,n03417042,n03443371,n03444034, n03445924,n03452741,n03483316,n03584829, n03590841,n03594945,n03617480,n03666591, n03670208,n03717622,n03720891,n03721384, n03724870,n03775071,n03788195,n03804744, n03837869,n03840681,n03854065,n03888257, n03891332,n03935335,n03982430,n04019541, n04033901,n04039381,n04067472,n04086273, n04099969,n04118538,n04131690,n04133789, n04141076,n04146614,n04147183,n04179913, n04208210,n04235860,n04252077,n04252225, n04254120,n04270147,n04275548,n04310018, n04317175,n04344873,n04347754,n04355338, n04366367,n04376876,n04389033,n04399382, n04442312,n04456115,n04482393,n04507155, n04509417,n04532670,n04540053,n04554684, n04562935,n04591713,n04606251,n07583066, n07695742,n07697313,n07697537,n07714990, n07718472,n07720875,n07734744,n07749582, n07753592,n07760859,n07768694,n07831146, n09229709,n09246464,n09472597,n09835506, n11879895,n12057211,n12144580,n12267677. 9. IMAGENET-O Classes The 200 ImageNet classes that we selected for IMAGENET-O are as follows. ‘goldfish, Carassius auratus;’ ‘triceratops;’ ‘harvestman, daddy longlegs, Phalangium opilio;’ ‘centipede;’ ‘sulphur- crested cockatoo, Kakatoe galerita, Cacatua galerita;’ ‘lori- keet;’ ‘jellyfish;’ ‘brain coral;’ ‘chambered nautilus, pearly nautilus, nautilus;’ ‘dugong, Dugong dugon;’ ‘starfish, sea star;’ ‘sea urchin;’ ‘hog, pig, grunter, squealer, Sus scrofa;’ ‘armadillo;’ ‘rock beauty, Holocanthus tricolor;’ ‘puffer, pufferfish, blowfish, globefish;’ ‘abacus;’ ‘accor- dion, piano accordion, squeeze box;’ ‘apron;’ ‘balance beam, beam;’ ‘ballpoint, ballpoint pen, ballpen, Biro;’ ‘Band Aid;’ ‘banjo;’ ‘barbershop;’ ‘bath towel;’ ‘bearskin, busby, shako;’ ‘binoculars, field glasses, opera glasses;’ ‘bolo tie, bolo, bola tie, bola;’ ‘bottlecap;’ ‘brassiere, bra, bandeau;’ ‘broom;’ ‘buckle;’ ‘bulletproof vest;’ ‘candle, taper, wax light;’ ‘car mirror;’ ‘chainlink fence;’ ‘chain saw, chainsaw;’ ‘chime, bell, gong;’ ‘Christmas stock- ing;’ ‘cinema, movie theater, movie theatre, movie house, picture palace;’ ‘combination lock;’ ‘corkscrew, bottle screw;’ ‘crane;’ ‘croquet ball;’ ‘dam, dike, dyke;’ ‘dig- ital clock;’ ‘dishrag, dishcloth;’ ‘dogsled, dog sled, dog sleigh;’ ‘doormat, welcome mat;’ ‘drilling platform, off- shore rig;’ ‘electric fan, blower;’ ‘envelope;’ ‘espresso maker;’ ‘face powder;’ ‘feather boa, boa;’ ‘fireboat;’ ‘fire screen, fireguard;’ ‘flute, transverse flute;’ ‘folding chair;’ ‘fountain;’ ‘fountain pen;’ ‘frying pan, frypan, skillet;’ ‘golf ball;’ ‘greenhouse, nursery, glasshouse;’ ‘guillo- tine;’ ‘hamper;’ ‘hand blower, blow dryer, blow drier, hair dryer, hair drier;’ ‘harmonica, mouth organ, harp, mouth harp;’ ‘honeycomb;’ ‘hourglass;’ ‘iron, smoothing iron;’ ‘jack-o’-lantern;’ ‘jigsaw puzzle;’ ‘joystick;’ ‘lawn mower, mower;’ ‘library;’ ‘lighter, light, igniter, ignitor;’ ‘lipstick, lip rouge;’ ‘loupe, jeweler’s loupe;’ ‘magnetic compass;’ ‘manhole cover;’ ‘maraca;’ ‘marimba, xylophone;’ ‘mask;’ ‘matchstick;’ ‘maypole;’ ‘maze, labyrinth;’ ‘medicine chest, medicine cabinet;’ ‘mortar;’ ‘mosquito net;’ ‘mouse- trap;’ ‘nail;’ ‘neck brace;’ ‘necklace;’ ‘nipple;’ ‘ocarina, sweet potato;’ ‘oil filter;’ ‘organ, pipe organ;’ ‘oscillo- scope, scope, cathode-ray oscilloscope, CRO;’ ‘oxygen mask;’ ‘paddlewheel, paddle wheel;’ ‘panpipe, pandean pipe, syrinx;’ ‘park bench;’ ‘pencil sharpener;’ ‘Petri dish;’ ‘pick, plectrum, plectron;’ ‘picket fence, paling;’ ‘pill bot- tle;’ ‘ping-pong ball;’ ‘pinwheel;’ ‘plate rack;’ ‘plunger, plumber’s helper;’ ‘pool table, billiard table, snooker ta- ble;’ ‘pot, flowerpot;’ ‘power drill;’ ‘prayer rug, prayer mat;’ ‘prison, prison house;’ ‘punching bag, punch bag, punching ball, punchball;’ ‘quill, quill pen;’ ‘radiator;’ ‘reel;’ ‘remote control, remote;’ ‘rubber eraser, rubber, pen- cil eraser;’ ‘rule, ruler;’ ‘safe;’ ‘safety pin;’ ‘saltshaker, salt shaker;’ ‘scale, weighing machine;’ ‘screw;’ ‘screw- driver;’ ‘shoji;’ ‘shopping cart;’ ‘shower cap;’ ‘shower cur- tain;’ ‘ski;’ ‘sleeping bag;’ ‘slot, one-armed bandit;’ ‘snow- mobile;’ ‘soap dispenser;’ ‘solar dish, solar collector, so- lar furnace;’ ‘space heater;’ ‘spatula;’ ‘spider web, spider’s web;’ ‘stove;’ ‘strainer;’ ‘stretcher;’ ‘submarine, pigboat, sub, U-boat;’ ‘swimming trunks, bathing trunks;’ ‘swing;’ ‘switch, electric switch, electrical switch;’ ‘syringe;’ ‘ten- nis ball;’ ‘thatch, thatched roof;’ ‘theater curtain, theatre curtain;’ ‘thimble;’ ‘throne;’ ‘tile roof;’ ‘toaster;’ ‘tricy- cle, trike, velocipede;’ ‘turnstile;’ ‘umbrella;’ ‘vending ma- chine;’ ‘waffle iron;’ ‘washer, automatic washer, washing machine;’ ‘water bottle;’ ‘water tower;’ ‘whistle;’ ‘Windsor tie;’ ‘wooden spoon;’ ‘wool, woolen, woollen;’ ‘crossword puzzle, crossword;’ ‘traffic light, traffic signal, stoplight;’ ‘ice lolly, lolly, lollipop, popsicle;’ ‘bagel, beigel;’ ‘pret- zel;’ ‘hotdog, hot dog, red hot;’ ‘mashed potato;’ ‘broccoli;’ ‘cauliflower;’ ‘zucchini, courgette;’ ‘acorn squash;’ ‘cu- cumber, cuke;’ ‘bell pepper;’ ‘Granny Smith;’ ‘strawberry;’ ‘orange;’ ‘lemon;’ ‘pineapple, ananas;’ ‘banana;’ ‘jack- fruit, jak, jack;’ ‘pomegranate;’ ‘chocolate sauce, chocolate syrup;’ ‘meat loaf, meatloaf;’ ‘pizza, pizza pie;’ ‘burrito;’ ‘bubble;’ ‘volcano;’ ‘corn;’ ‘acorn;’ ‘hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa.’ Their WordNet IDs are as follows. n01443537,n01704323,n01770081, n01784675,n01819313,n01820546,n01910747, n01917289,n01968897,n02074367,n02317335, n02319095,n02395406,n02454379,n02606052, n02655020,n02666196,n02672831,n02730930, n02777292,n02783161,n02786058,n02787622, n02791270,n02808304,n02817516,n02841315, n02865351,n02877765,n02892767,n02906734, n02910353,n02916936,n02948072,n02965783, n03000134,n03000684,n03017168,n03026506, n03032252,n03075370,n03109150,n03126707, n03134739,n03160309,n03196217,n03207743, n03218198,n03223299,n03240683,n03271574, n03291819,n03297495,n03314780,n03325584, n03344393,n03347037,n03372029,n03376595, n03388043,n03388183,n03400231,n03445777, n03457902,n03467068,n03482405,n03483316, n03494278,n03530642,n03544143,n03584829, n03590841,n03598930,n03602883,n03649909, n03661043,n03666591,n03676483,n03692522, n03706229,n03717622,n03720891,n03721384, n03724870,n03729826,n03733131,n03733281, n03742115,n03786901,n03788365,n03794056, n03804744,n03814639,n03814906,n03825788, n03840681,n03843555,n03854065,n03857828, n03868863,n03874293,n03884397,n03891251, n03908714,n03920288,n03929660,n03930313, n03937543,n03942813,n03944341,n03961711, n03970156,n03982430,n03991062,n03995372, n03998194,n04005630,n04023962,n04033901, n04040759,n04067472,n04074963,n04116512, n04118776,n04125021,n04127249,n04131690, n04141975,n04153751,n04154565,n04201297, n04204347,n04209133,n04209239,n04228054, n04235860,n04243546,n04252077,n04254120, n04258138,n04265275,n04270147,n04275548, n04330267,n04332243,n04336792,n04347754, n04371430,n04371774,n04372370,n04376876, n04409515,n04417672,n04418357,n04423845, n04429376,n04435653,n04442312,n04482393, n04501370,n04507155,n04525305,n04542943, n04554684,n04557648,n04562935,n04579432, n04591157,n04597913,n04599235,n06785654, n06874185,n07615774,n07693725,n07695742, n07697537,n07711569,n07714990,n07715103, n07716358,n07717410,n07718472,n07720875, n07742313,n07745940,n07747607,n07749582, n07753275,n07753592,n07754684,n07768694, n07836838,n07871810,n07873807,n07880968, n09229709,n09472597,n12144580,n12267677, n13052670.