DOES PROGRESS ON IMAGENET TRANSFER TO REAL-WORLD DATASETS?

Abstract

Does progress on ImageNet transfer to real-world datasets? We investigate this question by evaluating ImageNet pre-trained models with varying accuracy (57% -83%) on six practical image classification datasets. In particular, we study datasets collected with the goal of solving real-world tasks (e.g., classifying images from camera traps or satellites), as opposed to web-scraped benchmarks collected for comparing models. On multiple datasets, models with higher ImageNet accuracy do not consistently yield performance improvements. For certain tasks, interventions such as data augmentation improve performance even when architectures do not. We hope that future benchmarks will include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.

1. INTRODUCTION

ImageNet is one of the most widely used datasets in machine learning. Initially, the ImageNet competition played a key role in re-popularizing neural networks with the success of AlexNet in 2012. Ten years later, the ImageNet dataset is still one of the main benchmarks for state-of-the-art computer vision models (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016; Liu et al., 2018; Howard et al., 2019; Touvron et al., 2021; Radford et al., 2021) . As a result of Ima-geNet's prominence, the machine learning community has invested tremendous effort into developing model architectures, training algorithms, and other methodological innovations with the goal of increasing performance on ImageNet. Comparing methods on a common task has important benefits because it ensures controlled experimental conditions and results in rigorous evaluations. But the singular focus on ImageNet also raises the question whether the community is over-optimizing for this specific dataset. As a first approximation, ImageNet has clearly encouraged effective methodological innovation beyond ImageNet itself. For instance, the key finding from the early years of ImageNet was that large convolution neural networks (CNNs) can succeed on contemporary computer vision datasets by leveraging GPUs for training. This paradigm has led to large improvements in other computer vision tasks, and CNNs are now omnipresent in the field. Nevertheless, this clear example of transfer to other tasks early in the ImageNet evolution does not necessarily justify the continued focus ImageNet still receives. For instance, it is possible that early methodological innovations transferred more broadly to other tasks, but later innovations have become less generalizable. The goal of our paper is to investigate this possibility specifically for neural network architecture and their transfer to real-world data not commonly found on the Internet. When discussing the transfer of techniques developed for ImageNet to other datasets, a key question is what other datasets to consider. Currently there is no comprehensive characterization of the many machine learning datasets and transfer between them. Hence we restrict our attention to a limited but well-motivated family of datasets. In particular, we consider classification tasks derived from image data that were specifically collected with the goal of classification in mind. This is in contrast to many standard computer vision datasets -including ImageNet -where the constituent images were originally collected for a different purpose, posted to the web, and later re-purposed for benchmarking computer vision methods. Concretely, we study six datasets ranging from leaf disease classification over melanoma detection to categorizing animals in camera trap images. Since these datasets represent real-world applications, transfer of methods from ImageNet is particularly relevant. Figure 1 : Overview of transfer performance across models from ImageNet to each of the datasets we study. Although there seems to be a strong linear trends between ImageNet accuracy and the target metrics (green), these trends become less certain when we restrict the models to those above 70% ImageNet accuracy (blue). Versions with error bars and spline interpolation can be found in Appendix B. We find that on four out of our six real-world datasets, ImageNet-motivated architecture improvements after VGG resulted in little to no progress (see Figure 1 ). Specifically, when we fit a line to downstream model accuracies as a function of ImageNet accuracy, the resulting slope is less than 0.05. The two exceptions where post-VGG architectures yield larger gains are the Caltech Camera Traps-20 (CCT-20) (Beery et al., 2018) dataset (slope 0.11) and the Human Protein Atlas Image Classification (Ouyang et al., 2019) dataset (slope 0.29). On multiple other datasets, we find that task-specific improvements such as data augmentations or extra training data lead to larger gains than using a more recent ImageNet architecture. We evaluate on a representative testbed of 19 Im-ageNet models, ranging from the seminal AlexNet (Krizhevsky et al., 2012) over VGG (Simonyan & Zisserman, 2015) and ResNets (He et al., 2016) to the more recent and higher-performing Effi-cientNets (Tan & Le, 2019) and ConvNexts (Liu et al., 2022) (ImageNet top-1 accuracies 56.5% to 83.4%). Our testbed includes three Vision Transformer models to cover non-CNN architectures. Interestingly, our findings stand in contrast to earlier work that investigated the aforementioned image classification benchmarks such as CIFAR-10 (Krizhevsky & Hinton, 2009) , PASCAL VOC 2007 (Everingham et al., 2010), and Caltech-101 (Fei-Fei et al., 2004 ) that were scraped from the Internet. On these datasets, Kornblith et al. (2019) found consistent gains in downstream task accuracy for a similar range of architectures as we study in our work. Taken together, these findings indicate that ImageNet accuracy may be a good predictor for other web-scraped datasets, but less informative for real-world image classification datasets that are not sourced through the web. On the other hand, the CCT-20 data point shows that even very recent ImageNet models do help on some downstream tasks that do not rely on images from the web. Overall, our results highlight the need for a more comprehensive understanding of machine learning datasets to build and evaluate broadly useful data representations.

2. RELATED WORK

Transferability of ImageNet architectures. Although there is extensive previous work investigating the effect of architecture upon the transferability of ImageNet-pretrained models to different datasets, most of this work focuses on performance on datasets collected for the purpose of benchmarking. Kornblith et al. (2019) previously showed that ImageNet accuracy of different models is strongly correlated with downstream accuracy on a wide variety of web-scraped object-centric computer vision benchmark tasks. Later studies have investigated the relationship between ImageNet and transfer accuracy for self-supervised networks (Ericsson et al., 2021; Kotar et al., 2021; Nayman et al., 2022) , adversarially trained networks (Salman et al., 2020) , or networks trained with different loss functions (Kornblith et al., 2021) , but still evaluate primarily on web-scraped benchmark tasks. The Visual Task Adaptation Benchmark (VTAB) (Zhai et al., 2019) comprises a more diverse set of tasks, including natural and non-natural classification tasks as well as non-classification tasks, but nearly all consist of web-scraped or synthetic images. In the medical imaging domain, models have been extensively evaluated on real-world data, with limited gains from newer models that perform better on ImageNet (Raghu et al., 2019; Bressem et al., 2020; Ke et al., 2021) . Most closely related to our work, Tuggener et al. (2021) investigate performance of 500 CNN architectures on yet another set of datasets, several of which are not web-scraped, and find that accuracy correlates poorly with ImageNet accuracy when training from scratch, but correlations are higher when fine-tuning ImageNet-pretrained models. Our work differs from theirs in our focus solely on real-world datasets (e.g., from Kaggle competitions) and in that we perform extensive tuning in order to approach the best single-model performance obtainable on these datasets whereas Tuggener et al. (2021) instead devote their compute budget to increasing the breadth of architectures investigated. Transferability of networks trained on other datasets. Other work has evaluated transferability of representations of networks trained on datasets beyond ImageNet. Most notably, Abnar et al. (2022) explore the relationship between upstream and downstream accuracy for models pretrained on JFT and ImageNet-21K and find that, on many tasks, downstream accuracy saturates with upstream accuracy. However, they evaluate representational quality using linear transfer rather than end-to-end fine-tuning. Other studies have investigated the impact of relationships between pretraining and fine-tuning tasks (Zamir et al., 2018; Mensink et al., 2021) or the impact of scaling the model and dataset (Goyal et al., 2019; Kolesnikov et al., 2020) . Another direction of related work relates to the effect of pretraining data on transfer learning. Huh et al. (2016) look into the factors that make ImageNet good for transfer learning. They find that fine-grained classes are not needed for good transfer performance, and that reducing the dataset size and number of classes only results in slight drops in transfer learning performance. Though there is a common goal of exploring what makes transfer learning work well, our work differs from this line of work by focusing on the fine-tuning aspect of transfer learning. Other studies of external validity of benchmarks. Our study fits into a broader literature investigating the external validity of image classification benchmarks. Early work in this area identified lack of diversity as a key shortcoming of the benchmarks of the time (Ponce et al., 2006; Torralba & Efros, 2011) , a problem that was largely resolved with the introduction of the much more diverse ImageNet benchmark (Deng et al., 2009; Russakovsky et al., 2015) . More recent studies have investigated the extent to which ImageNet classification accuracy correlates with accuracy on outof-distribution (OOD) data (Recht et al., 2019; Taori et al., 2020) or accuracy as measured using higher-quality human labels (Shankar et al., 2020; Tsipras et al., 2020; Beyer et al., 2020) . As in previous studies of OOD generalization, transfer learning involves generalization to test sets that differ in distribution from the (pre-)training data. However, there are also key differences between transfer learning and OOD generalization. First, in transfer learning, additional training data from the target task is used to adapt the model, while OOD evaluations usually apply trained models to a new distribution without any adaptation. Second, OOD evaluations usually focus on settings with a shared class space so that evaluations without adaptation are possible. In contrast, transfer learning evaluation generally involves downstream tasks with classes different from those in the pretraining dataset. These differences between transfer learning and OOD generalization are not only conceptual but also lead to different empirical phenomena. Miller et al. (2021) has shown that indistribution accuracy improvements often directly yield out-of-distribution accuracy improvements as well. This is the opposite of our main experimental finding that ImageNet improvements do not directly yield performance improvements on many real-world downstream tasks. Hence our work demonstrates an important difference between OOD generalization and transfer learning. As mentioned in the introduction, a key choice in any transfer study is the set of target tasks on which to evaluate model performance. Before we introduce our suite of target tasks, we first describe three criteria that guided our dataset selection: (i) diverse data sources, (ii) relevance to an application, and (iii) availability of well-tuned baseline models for comparison.

3.1. SELECTION CRITERIA

Prior work has already investigated transfer of ImageNet architectures to many downstream datasets (Donahue et al., 2014; Sharif Razavian et al., 2014; Chatfield et al., 2014; Simonyan & Zisserman, 2015) . The 12 datasets used by Kornblith et al. (2019) often serve as a standard evaluation suite (e.g., in (Salman et al., 2020; Ericsson et al., 2021; Radford et al., 2021) ). While these datasets are an informative starting point, they are all object-centric natural image datasets, and do not represent the entire range of image classification problems. There are many applications of computer vision; the Kaggle website alone lists more than 1,500 datasets as of May 2022. To understand transfer from ImageNet more broadly, we selected six datasets guided by the following criteria. Diverse data sources. Since collecting data is an expensive process, machine learning researchers often rely on web scraping to gather data when assembling a new benchmark. This practice has led to several image classification datasets with different label spaces such as food dishes, bird species, car models, or other everyday objects. However, the data sources underlying these seemingly different tasks are actually often similar. Specifically, we surveyed the 12 datasets from Kornblith et al. (2019) and found that all of these datasets were harvested from the web, often via keyword searches in Flickr, Google image search, or other search engines (see Appendix K). This narrow range of data sources limits the external validity of existing transfer learning experiments. To get a broader understanding of transfer from ImageNet, we focus on scientific, commercial, and medical image classification datasets that were not originally scraped from the web. Application relevance. In addition to the data source, the classification task posed on a given set of images also affects how relevant the resulting problem is for real-world applications. For instance, it would be possible to start with real-world satellite imagery that shows multiple building types per image, but only label one of the building types for the purpose of benchmarking (e.g., to avoid high annotation costs). The resulting task may then be of limited value for an actual application involving the satellite images that requires all buildings to be annotated. We aim to avoid such pitfalls by limiting our attention to classification tasks that were assembled by domain experts with a specific application in mind. Availability of baselines. If methodological progress does not transfer from ImageNet to a given target task, we should expect that, as models perform better on ImageNet, accuracy on the target task saturates. However, observing such a trend in an experiment is not sufficient to reach a conclusion regarding transfer because there is an alternative explanation for this empirical phenomenon. Besides a lack of transfer, the target task could also simply be easier than the source task so that models with sub-optimal source task accuracy already approach the Bayes error rate. As an illustrative example, consider MNIST as a target task for ImageNet transfer. A model with mediocre ImageNet accuracy is already sufficient to get 99% accuracy on MNIST, but this finding does not mean that better ImageNet models are insufficient to improve MNIST accuracy -the models have already hit the MNIST performance ceiling. More interesting failures of transfer occur when ImageNet architectures plateau on the target task, but it is still possible to improve accuracy beyond what the best ImageNet architecture can achieve without target task-specific modifications. In order to make such comparisons, well-tuned baselines for the target task are essential. If improving ImageNet accuracy alone is insufficient to reach these well-tuned baselines, we can indeed conclude that architecture transfer to this target task is limited. In our experiments, we use multiple datasets from Kaggle competitions since the resulting leaderboards offer well-tuned baselines arising from a competitive process.

3.2. DATASETS STUDIED

The datasets studied in this work are practical and cover a variety of applications. We choose four of the most popular image classification competitions on Kaggle, as measured by number of competitors, teams, and submissions. Each of these competitions is funded by an organization with the goal of advancing performance on that real-world task. Additionally, we supplement these datasets with Caltech Camera Traps (Beery et al., 2018) and EuroSAT (Helber et al., 2019) to broaden the types of applications studied. Details for each dataset can be found in Table 1 foot_0 .

4. MAIN EXPERIMENTS

We run our experiments across 19 model architectures, including both CNNs and Vision Transformers (ViT and DeiT). They range from 57% to 83% ImageNet top-1 accuracy, allowing us to observe the relationship between ImageNet performance and target dataset performance. In order to get the best performance out of each architecture, we do extensive hyperparameter tuning over learning rate, weight decay, optimizer, and learning schedule. Details about our experiment setup can be found in Appendix C. We now present our results for each of the datasets we investigated. Figure 1 summarizes our results across all datasets, with additional statistics in Table 2 . Appendix A contains complete results for all datasets across the hyperparameter grids. We see in Figure 1 (top-left) an overall positive trend between ImageNet performance and CCT-20 performance. The overall trend is unsurprising, given the number of animal classes present in ImageNet. But despite the drastic reduction in the number of classes when compared to ImageNet, CCT-20 has its own set of challenges. Animals are often pictured at difficult angles, and sometimes are not even visible in the image because a sequence of frames triggered by activity all have the same label. Despite these challenges, an even higher performing model still does better on this task -we train a CLIP ViT L/14-336px model (85.4% ImageNet top-1) with additional augmentation to achieve 83.4% accuracy on CCT-20.

4.2. APTOS 2019 BLINDNESS DETECTION

This dataset was created for a Kaggle competition run by the Asia Pacific Tele-Ophthalmology Society (APTOS) with the goal of advancing medical screening for diabetic retinopathy in rural areas (Asia Pacific Tele-Ophthalmology Society, 2019). Images are taken using fundus photography and vary in terms of clinic source, camera used, and time taken. Images are labeled by clinicians on a scale of 0 to 4 for the severity of diabetic retinopathy. Given the scaled nature of the labels, the competition uses quadratic weighted kappa (QWK) as the evaluation metric. We create a local 80% to 20% random class-balanced train/validation split, as the competition test labels are hidden. We find that models after VGG do not show significant improvement. Similar to in CCT-20, DeiT and EfficientNets performs slightly worse, while deeper models from the same architecture slightly help performance. We also find that accuracy has a similar trend as QWK, despite it being an inferior metric in the context of this dataset. When performance stagnates, one might ask whether we have reached a performance limit for our class of models on the dataset. As a comparison, the ensemble from the top leaderboard entry included a single model Inception-ResNet v2 trained with additional interventions that achieves 0.927 QWK. We submitted the original models we trained to Kaggle as well, finding that the new models trained with additional interventions do at least 0.03 QWK points better. See Appendix F for additional experimental details. Both this result and the gap between our models and the top leaderboard models show that there exist interventions that do improve task performance.

4.3. HUMAN PROTEIN ATLAS IMAGE CLASSIFICATION

The Human Protein Atlas runs the Human Protein Atlas Image Classification competition on Kaggle to build an automated tool for identifying and locating proteins from high-throughput microscopy images (Ouyang et al., 2019) . Images can contain multiple of the 28 different proteins, so the competition uses the macro F1 score. Given the multi-label nature of the problem, this requires thresholding for prediction. We use a 73% / 18% / 9% train / validation / test-validation split created by a previous competitor (Park, 2019) . We report results on the validation split, as we find that the thresholds selected for the larger validation split generalize well to the smaller test-validation split. We find a slightly positive trend between task performance and ImageNet performance, even when ignoring AlexNet and MobileNet. This is surprising because ImageNet is quite visually distinct from human protein slides. These results suggest that models with more parameters help with downstream performance, especially for tasks that have a lot of room for improvement. Specific challenges for this dataset are extreme class imbalance, multi-label thresholding, and generalization from the training data to the test set. Competitors were able to improve performance beyond the baselines we found by using external data as well as techniques such as data cleaning, additional training augmentation, test time augmentation, ensembling, and oversampling (Dai, 2019; Park, 2019; Shugaev, 2019) . Additionally, some competitors modified commonly-used architectures by substituting pooling layers or incorporating attention (Park, 2019; Zheng, 2019) . Uniquely, the first place solution used metric learning on top of a single DenseNet121 (Dai, 2019) . These techniques may be useful when applied to other datasets, but are rarely used in a typical workflow.

4.4. SIIM-ISIC MELANOMA CLASSIFICATION

The Society for Imaging Informatics in Medicine (SIIM) and the International Skin Imaging Collaboration (ISIC) jointly ran this Kaggle competition for identifying Melanoma (SIIM & ISIC, 2020), a serious type of skin cancer. Competitors use images of skin lesions to predict the probability that each observed image is malignant. Images come from the ISIC Archive, which is publicly available and contains images from a variety of countries. The competition provided 33,126 training images, plus an additional 25,331 images from previous competitions. We split the combined data into an 80% to 20% class-balanced and year-balanced train/validation split. Given the imbalanced nature of the data (8.8% positive), the competition uses area under ROC curve as the evaluation metric. We find only a weak positive correlation (0.44) between ImageNet performance and task performance, with a regression line with a normalized slope of close to zero (0.05). But if we instead look at classification accuracy, Appendix H shows that there is a stronger trend for transfer than that of area under ROC curve, as model task accuracy more closely follows the same order as ImageNet performance. This difference shows that characterizing the relationship between better ImageNet models and better transfer performance is reliant on the evaluation metric as well. We use a relatively simple setup to measure the impact of ImageNet models on task performance, but we know we can achieve better results with additional strategies. The top two Kaggle solutions used models with different input size, ensembling, cross-validation and a significant variety of training augmentation to create a stable model that generalized to the hidden test set (Ha et al., 2020; Pan, 2020) .

4.5. CASSAVA LEAF DISEASE CLASSIFICATION

The Makerere Artificial Intelligence Lab is an academic research group focused on applications that benefit the developing world. Their goal in creating the Cassava Leaf Disease Classification Kaggle competition (Makerere University AI Lab, 2021) was to give farmers access to methods for diagnosing plant diseases, which could allow farmers to prevent these diseases from spreading, increasing crop yield. Images were taken with an inexpensive camera and labeled by agricultural experts. Each image was classified as healthy or as one of four different diseases. We report results using a 80%/20% random class-balanced train/validation split of the provided training data. Once we ignore models below 70% ImageNet accuracy, the relationship between the performance on the two datasets has both a weak positive correlation (0.12) and a near-zero normalized slope (0.02 (Naushad et al., 2021) and using 13 spectral bands (Yassine et al., 2021) . We use RGB images and keep our experimental setup consistent to compare across a range of models. Since there is no set train/test split, we create a 80%/20% class-balanced split. All models over 60% ImageNet accuracy achieve over 98.5% EuroSAT accuracy, and the majority of our models achieve over 99.0% EuroSAT accuracy. There are certain tasks where using better ImageNet models does not improve performance, and this would be the extreme case where performance saturation is close to being achieved. While it is outside the scope of this study, a next step would be to investigate the remaining errors and find other methods to reduce this last bit of error.

5.1. AUGMENTATION ABLATIONS

In our main experiments, we keep augmentation simple to minimize confounding factors when comparing models. However, it is possible pre-training and fine-tuning with different combinations of augmentations may have different results. This is an important point because different architectures may have different inductive biases and often use different augmentation strategies at pre-training time. To investigate these effects, we run additional experiments on CCT-20 and APTOS to explore the effect of data augmentation on transfer. Specifically, we take ResNet-50 models pre-trained with standard crop and flip augmentation, AugMix (Hendrycks et al., 2020), and RandAugment (Cubuk et al., 2020) , and then fine-tune on our default augmentation, AugMix, and RandAugment. We also study DeiT-tiny and Deit-small models by fine-tuning on the same three augmentations mentioned above. We choose to examine DeiT models because they are pre-trained using RandAugment and RandErasing (Zhong et al., 2020) . We increase the number of epochs we fine-tune on from 30 to 50 to account for augmentation. Our experimental results are found in Appendix G. In our ResNet-50 experiments, both AugMix and RandAugment improve performance on ImageNet, but while pre-training with RandAugment improves performance on downstream tasks, pre-training with AugMix does not. Furthermore, fine-tuning with RandAugment usually yields additional performance gains when compared to our default fine-tuning augmentation, no matter which pre-trained model is used. For DeiT models, we found that additional augmentation did not significantly increase performance on the downstream tasks. Thus, as with architectures, augmentation strategies that improve accuracy on ImageNet do not always improve accuracy on real-world tasks.

5.2. CLIP MODELS

A natural follow-up to our experiments is to change the source of pre-training data. We examine CLIP models from Radford et al. (2021) , which use diverse pre-training data and achieve high performance on a variety of downstream datasets. We fine-tune CLIP models on each of our downstream datasets by linear probing then fine-tuning (LP-FT) (Kumar et al., 2022) . 3 Our results are visualized by the purple stars in Appendix I Figure 8 . We see that by using a model that takes larger images we can do better than all previous models, and even without the larger images, ViT-L/14 does better on four out of the six datasets. While across all CLIP models the change in pre-training data increases performance for CCT-20, the effect on the other datasets is more complicated. When controlling for architecture changes by only looking at ResNet-50 and ViT/B16, we see that the additional pre-training data helps for CCT-20, HPA, and Cassava, the former two corresponding to the datasets that empirically benefit most from using better ImageNet models. Additional results can be found in Appendix I, while additional fine-tuning details can be found in Appendix J.

6. DISCUSSION

Alternative explanations for saturation. Whereas Kornblith et al. (2019) reported a high degree of correlation between ImageNet and transfer accuracy, we find that better ImageNet models do not consistently transfer better on our real-world tasks. We believe these differences are related to the tasks themselves. Here, we rule out alternative hypotheses for our findings. Comparison of datasets statistics suggests that the number of classes and dataset size also do not explain the differences from Kornblith et al. (2019) . The datasets we study range from two to 28 classes. Although most of the datasets studied in Kornblith et al. (2019) have more classes, CIFAR-10 has 10. In Appendix E, we replicate CIFAR-10 results from Kornblith et al. ( 2019) using our experimental setup, finding a strong correlation between ImageNet accuracy and transfer accuracy. Thus, the number of classes is likely not the determining factor. Training set sizes are similar between our study and that of Kornblith et al. (2019) and thus also do not seem to play a major role. A third hypothesis is that it is parameter count, rather than ImageNet accuracy, that drives trends. We see that VGG BN models appear to outperform their ImageNet accuracy on multiple datasets, and they are among the largest models by parameter count. However, in Appendix L, we find that model size is also not a good indicator of improved transfer performance on real world datasets. Differences between web-scraped datasets and real-world images We conjecture that it is possible to perform well on most, if not all, web-scraped target datasets simply by collecting a very large amount of data from the Internet and training a very large model on it. Web-scraped target datasets are by definition within the distribution of data collected from the web, and a sufficiently large model can learn that distribution. In support of this conjecture, recent models such as CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021) , ViT-G (Zhai et al., 2022) , BASIC (Pham et al., 2021) , and CoCa (Yu et al., 2022) are trained on very large web-scraped datasets and achieve high accuracy on a variety of web-scraped benchmarks. However, this strategy may not be effective for non-web-scraped datasets, where there is no guarantee that we will train on data that is close in distribution to the target data, even if we train on the entire web. Thus, it makes sense to distinguish these two types of datasets. There are clear differences in image distribution between the non-web-scraped datasets we consider and web-scraped datasets considered by previous work. In Figure 3 and Appendix M, we compute Fréchet inception distance (FID) (Heusel et al., 2017) between ImageNet and each of the datasets we study in this work as well as the ones found in Kornblith et al. (2019) . The real-world datasets are further away from ImageNet than those found in Kornblith et al. (2019) , implying that there is a large amount of distribution shift between web-scraped datasets and real-world datasets. However, FID is only a proxy measure and may not capture all factors that lead to differences in transferability. Whereas web-scraped data is cheap to acquire, realworld data can be more expensive. Ideally, progress in computer vision should improve performance not just on web-scraped data, but also on real-world tasks. Our results suggest that the latter has not happened. Gains in ImageNet accuracy over the last decade have primarily come from improving and scaling architectures, and past work has shown that these gains generally transfer to other web-scraped datasets, regardless of size (Sun et al., 2017; Kornblith et al., 2019; Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020) . However, we find that improvements arising from architecture generally do not transfer to non-web-scraped tasks. Nonetheless, data augmentation and other tweaks can provide further gains on these tasks. Recommendations towards better benchmarking. While it is unclear whether researchers have over-optimized for ImageNet, our work suggests that researchers should explicitly search for methods that improve accuracy on real-world non-web-scraped datasets, rather than assuming that methods that improve accuracy on ImageNet will provide meaningful improvements on real-world datasets as well. Just as there are methods that improve accuracy on ImageNet but not on the tasks we investigate, there may be methods that improve accuracy on our tasks but not ImageNet. The Kaggle community provides some evidence for the existence of such methods; Kaggle submissions often explore architectural improvements that are less common in traditional ImageNet pre-trained models. To measure such improvements on real-world problems, we suggest simply using the average accuracy across our tasks as a benchmark for future representation learning research. Further analysis of our results shows consistencies in the accuracies of different models across the non-web-scraped datasets, suggesting that accuracy improvements on these datasets may translate to other datasets. For each dataset, we use linear regression to predict model accuracies on the target dataset as a linear combination of ImageNet accuracy and accuracy averaged across the other realworld datasets. We perform an F-test to determine whether the average accuracy on other real-world datasets explains significant variance beyond that explained by ImageNet accuracy. We find that this F-test is significant on all datasets except EuroSAT, where accuracy may be very close to ceiling (see further analysis in Appendix N.1). Additionally, in Appendix N.2 we compare the Spearman rank correlation (i.e., the Pearson correlation between ranks) between each dataset and the accuracy averaged across the other real-world datasets to the Spearman correlation between each dataset and ImageNet. We find that the correlation with the average over real-world datasets is higher than the correlation with ImageNet and statistically significant for CCT-20, APTOS, HPA, and Cassava. Thus, there is some signal in the average accuracy across the datasets that we investigate that is not captured by ImageNet top-1 accuracy. Where do our findings leave ImageNet? We suspect that most of the methodological innovations that help on ImageNet are useful for some real-world tasks, and in that sense it has been a successful benchmark. However, the innovations that improve performance on industrial web-scraped datasets such as JFT (Sun et al., 2017) or IG-3.5B-17k (Mahajan et al., 2018 ) (e.g., model scaling) may be almost entirely disjoint from the innovations that help with the non-web-scraped real-world tasks studied here (e.g., data augmentation strategies). We hope that future benchmarks will include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms. We examine 19 model architectures in this work that cover a diverse range of accuracies on ImageNet in order to observe the relationship between ImageNet performance and target dataset performance. In addition to the commonly used CNNs, we also include data-efficient image transformers (DeiT) due to the recent increase in usage of Vision Transformers. Additional model details are in Table 4 .

C.2 HYPERPARAMETER GRID

Hyperparameter tuning is a key part of neural network training, as using suboptimal hyperparameters can lead to suboptimal performance. Furthermore, the correct hyperparameters vary across both models and training data. To get the best performance out of each model, we train each model on AdamW with a cosine decay learning rate schedule, SGD with a cosine decay learning rate schedule, and SGD with a multi-step decay learning rate schedule. We also grid search for optimal initial learning rate and weight decay combinations, searching logarithmically between 10 -1 to 10 -4 for SGD learning rate, 10 -2 to 10 -5 for AdamW learning rate, and 10 -3 to 10 -6 as well as 0 for weight decay. All models are pretrained on ImageNet and then fine-tuned on the downstream task. Additional training details for each dataset can be found in Appendix D. We also run our hyperparameter grid on CIFAR-10 in Appendix E to verify that we find a strong relationship between ImageNet and CIFAR-10 accuracy as previously reported by Kornblith et al. (2019) .

D TRAINING DETAILS BY DATASET (IMAGENET MODELS)

Experiments on Cassava Leaf Disease, SIIM-ISIC Melanoma, and EuroSAT datasets were ran on TPU v2-8s, while all other datasets were ran on NVIDIA A40s. All experiments were ran with mini-batch size of 128. For SGD experiments, we use Nesterov momentum, set momentum to 0.9, and try learning rates of 1e-1, 1e-2, 1e-3, and 1e-4. For AdamW experiments, we try learning rates of 1e-2, 1e-3, 1e-4, 1e-5. For all experiments, we try weight decays of 1e-3, 1e-4, 1e-5, 1e-6, and 0. For all experiments, we use weights that are pretrained on ImageNet. AlexNet, DenseNet, Mo-bileNet, ResNet, ResNext, ShuffleNet, SqueezeNet and VGG models are from torchvision, while ConvNext, DeiT, EfficientNet, InceptionResNet, and PNASNet models are from timm. Additionally, we normalize images to ImageNet's mean and standard deviation. For EuroSAT we random resize crop to 224 with area at least 0.65. For all other datasets, we random resize crop with area at least 0.65 to 224 for DeiT models, and 256 for all other models. Additionally, we use horizontal flips. For Human Protein Atlas, Cassava Leaf Disease, and SIIM-ISIC Melanoma, we also use vertical flips. For SIIM-ISIC Melanoma, we train for 10 epochs, and for the step scheduler decay with factor 0.1 at 5 epochs. For all other datasets, we train for 30 epochs, and for the step scheduler decay with factor 0.1 at 15, 20, and 25 epochs. E CIFAR-10 ON HYPERPARAMETER GRID Models used here are trained using AdamW with a cosine scheduler. We random resize crop to 512, use random rotations, and use color jitter (brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1). We train on all the available training data, no longer using the local train/validation split mentioned in the main text. This includes both the training data in the 2019 competition, as well as data from a prior 2015 diabetic retinopathy competition. Figure 9 : We compare model size with downstream transfer performance. Again we use separate trend lines for all models (green) and only those above 70% ImageNet accuracy (blue). We use 95% confidence intervals computed with Clopper-Pearson for accuracy metrics and bootstrap with 10,000 trials for other metrics.

M FID SCORE DETAILS

Table 11 : We calculate FID scores between the ImageNet validation set and each of the datasets we study, as well as between the ImageNet validation set and each of the datasets in Kornblith et al. (2019) . We found that dataset size affects FID score, so we take a 3,662 subset of each downstream dataset. Note that 3,662 is the size of APTOS, which is the smallest dataset. We observe that, on many non-web-scraped datasets, accuracy correlates only weakly with Ima-geNet accuracy. It is thus worth asking whether other predictors might correlate better. In this section, we examine the extent to which accuracy on a given non-web-scraped target dataset can be predicted from the accuracy on the other non-web-scraped target datasets.

N.1 F-TEST

We can further measure the extent to which the averages of the five other datasets beyond the predictive power provided by ImageNet by using F-tests. For each target task, we fit a linear regression model that predicts accuracy as either ImageNet accuracy or the average accuracy on the other five non-web-scraped datasets, and a second linear regression model that predicts accuracy as a function of both ImageNet accuracy and the average accuracy on the other five datasets. Since the first model is nested within the second, the second model must explain at least as much variance as the first. The F-test measures whether the increase in explained variance is significant. For these experiments, we logit-transform accuracy values and standardize them to zero mean and unit variance before computing the averages, as in the middle column of Table 13 . Results are shown in Table 12 . The average accuracy across the other five datasets explains variance beyond that explained by ImageNet accuracy alone on five of the six datasets. The only exception is EuroSAT, where the range of accuracies is low (most models get ∼99%) and a significant fraction of the variance among models may correspond to noise. By contrast, ImageNet accuracy explains variance beyond the average accuracy only on two datasets (APTOS and Melanoma). These results indicate that there are patterns in how well different models transfer to non-web-scraped data that are not captured by ImageNet accuracy alone, but are captured by the accuracy on other non-webscraped datasets.



Dataset download links and PyTorch datasets and splits can be found at REDACTED. Empty class is removed for the classification experiments in Table1ofBeery et al. (2018) We use LP-FT because, in past experiments, we have found that LP-FT makes hyperparameter tuning easier for CLIP models, but does not significantly alter performance when using optimal hyperparameters.



Figure 2: Sample images from each of the datasets.

Figure 3: FID scores vs ImageNet for the datasets we study in this work (red), and the web-scraped datasets studied by Kornblith et al. (2019) (blue).

We examine a variety of real-world datasets that cover different types of tasks.

We summarize the blue regression lines from Figure1, calculated on models above 70% ImageNet accuracy, with their correlation and slope. Slope is calculated so that all metrics have a range from 0 to 100. While one of the goals of the dataset is to study generalization to new environments, here we only study the sets from the same locations. Although CCT-20 is not a Kaggle competition, it is a subset of the iWildCam Challenge 2018, whose yearly editions have been hosted on Kaggle.

To answer this question, we compare with the Kaggle leaderboard's top submissions. The top Kaggle submission achieves 0.936 QWK on the private leaderboard (85% of the test set)(Xu, 2019). They do this by using additional augmentation, using external data, training on L1-loss, replacing the final pooling layer with generalized mean pooling, and ensembling a variety of models trained with different input sizes. The external data consists of 88,702 images from the 2015 Diabetic Retinopathy Detection Kaggle competition.

Comparing various models with additional interventions by evaluating on the Kaggle leaderboard.

Comparing the effect of augmentation on Kaggle leaderboard scores. More augmentation is as described earlier in this section. Less augmentation only uses random resize crop with at least 0.65 area and horizontal flips.

We examine the effect of pre-training augmentation and fine-tuning augmentation on downstream transfer performance. The model specifies the architecture and pre-training augmentation, while each column specifies the downstream task and fine-tuning augmentation. We find that augmentation strategies that improve ImageNet accuracy do not always improve accuracy on downstream tasks. Pre-trained augmentation models are fromWightman et al. (2021).

We find that the 12 datasets studied inKornblith et al. (2019) come from web scraping.

Appendix

A DETAILED EXPERIMENT RESULTS Green is linear trend of all models, while blue is linear trend for models above 70% ImageNet accuracy. We use 95% confidence intervals computed with Clopper-Pearson for accuracy metrics and bootstrap with 10,000 trials for other metrics. 

J CLIP FINE-TUNING DETAILS

We fine-tune by running a linear probe, followed by end-to-end fine-tuning on the best model from the first part. We keep total epochs consistent with the previous models, with a third of the epochs going toward linear probing. We use AdamW with a cosine decay schedule. During the linear probe, we search over 10 -1 , 10 -2 , and 10 -3 learning rates, and during fine-tuning, we search over 10 -4 , 10 -5 , and 10 -6 learning rates. For both parts, we search over 10 -3 to 10 -6 and 0 for weight decay. 

