DOES PROGRESS ON IMAGENET TRANSFER TO REAL-WORLD DATASETS?

Abstract

Does progress on ImageNet transfer to real-world datasets? We investigate this question by evaluating ImageNet pre-trained models with varying accuracy (57% -83%) on six practical image classification datasets. In particular, we study datasets collected with the goal of solving real-world tasks (e.g., classifying images from camera traps or satellites), as opposed to web-scraped benchmarks collected for comparing models. On multiple datasets, models with higher ImageNet accuracy do not consistently yield performance improvements. For certain tasks, interventions such as data augmentation improve performance even when architectures do not. We hope that future benchmarks will include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.

1. INTRODUCTION

ImageNet is one of the most widely used datasets in machine learning. Initially, the ImageNet competition played a key role in re-popularizing neural networks with the success of AlexNet in 2012. Ten years later, the ImageNet dataset is still one of the main benchmarks for state-of-the-art computer vision models (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016; Liu et al., 2018; Howard et al., 2019; Touvron et al., 2021; Radford et al., 2021) . As a result of Ima-geNet's prominence, the machine learning community has invested tremendous effort into developing model architectures, training algorithms, and other methodological innovations with the goal of increasing performance on ImageNet. Comparing methods on a common task has important benefits because it ensures controlled experimental conditions and results in rigorous evaluations. But the singular focus on ImageNet also raises the question whether the community is over-optimizing for this specific dataset. As a first approximation, ImageNet has clearly encouraged effective methodological innovation beyond ImageNet itself. For instance, the key finding from the early years of ImageNet was that large convolution neural networks (CNNs) can succeed on contemporary computer vision datasets by leveraging GPUs for training. This paradigm has led to large improvements in other computer vision tasks, and CNNs are now omnipresent in the field. Nevertheless, this clear example of transfer to other tasks early in the ImageNet evolution does not necessarily justify the continued focus ImageNet still receives. For instance, it is possible that early methodological innovations transferred more broadly to other tasks, but later innovations have become less generalizable. The goal of our paper is to investigate this possibility specifically for neural network architecture and their transfer to real-world data not commonly found on the Internet. When discussing the transfer of techniques developed for ImageNet to other datasets, a key question is what other datasets to consider. Currently there is no comprehensive characterization of the many machine learning datasets and transfer between them. Hence we restrict our attention to a limited but well-motivated family of datasets. In particular, we consider classification tasks derived from image data that were specifically collected with the goal of classification in mind. This is in contrast to many standard computer vision datasets -including ImageNet -where the constituent images were originally collected for a different purpose, posted to the web, and later re-purposed for benchmarking computer vision methods. Concretely, we study six datasets ranging from leaf disease classification over melanoma detection to categorizing animals in camera trap images. Since these datasets represent real-world applications, transfer of methods from ImageNet is particularly relevant. Although there seems to be a strong linear trends between ImageNet accuracy and the target metrics (green), these trends become less certain when we restrict the models to those above 70% ImageNet accuracy (blue). Versions with error bars and spline interpolation can be found in Appendix B. We find that on four out of our six real-world datasets, ImageNet-motivated architecture improvements after VGG resulted in little to no progress (see Figure 1 ). Specifically, when we fit a line to downstream model accuracies as a function of ImageNet accuracy, the resulting slope is less than 0.05. 2007 (Everingham et al., 2010 ), and Caltech-101 (Fei-Fei et al., 2004 ) that were scraped from the Internet. On these datasets, Kornblith et al. (2019) found consistent gains in downstream task accuracy for a similar range of architectures as we study in our work. Taken together, these findings indicate that ImageNet accuracy may be a good predictor for other web-scraped datasets, but less informative for real-world image classification datasets that are not sourced through the web. On the other hand, the CCT-20 data point shows that even very recent ImageNet models do help on some downstream tasks that do not rely on images from the web. Overall, our results highlight the need for a more comprehensive understanding of machine learning datasets to build and evaluate broadly useful data representations.

2. RELATED WORK

Transferability of ImageNet architectures. Although there is extensive previous work investigating the effect of architecture upon the transferability of ImageNet-pretrained models to different



Figure1: Overview of transfer performance across models from ImageNet to each of the datasets we study. Although there seems to be a strong linear trends between ImageNet accuracy and the target metrics (green), these trends become less certain when we restrict the models to those above 70% ImageNet accuracy (blue). Versions with error bars and spline interpolation can be found in Appendix B.

The two exceptions where post-VGG architectures yield larger gains are the Caltech Camera Traps-20 (CCT-20)(Beery et al., 2018)  dataset (slope 0.11) and the Human Protein Atlas Image Classification(Ouyang et al., 2019)  dataset (slope 0.29). On multiple other datasets, we find that task-specific improvements such as data augmentations or extra training data lead to larger gains than using a more recent ImageNet architecture. We evaluate on a representative testbed of 19 Im-ageNet models, ranging from the seminal AlexNet(Krizhevsky et al., 2012) over VGG (Simonyan  & Zisserman, 2015)  and ResNets(He et al., 2016)  to the more recent and higher-performing Effi-cientNets (Tan & Le, 2019) and ConvNexts (Liu et al., 2022) (ImageNet top-1 accuracies 56.5% to 83.4%). Our testbed includes three Vision Transformer models to cover non-CNN architectures.

