THE AUGMENTED IMAGE PRIOR: DISTILLING 1000 CLASSES BY EXTRAPOLATING FROM A SINGLE IMAGE

Abstract

Figure 1: Extrapolating from one image. Strongly augmented patches from a single image are used to train a student (S) to distinguish semantic classes, such as those in ImageNet. The student neural network is initialized randomly and learns from a pretrained teacher (T) via KL-divergence. Although almost none of target categories are present in the image, we find student performances of > 69% for classifying ImageNet's 1000 classes. In this paper, we develop this single datum learning framework and investigate it across datasets and domains.

1. INTRODUCTION

Deep learning has both relied and improved significantly with the increase in dataset sizes. In turn, there are many works that show the benefits of dataset scale in terms of data points and modalities used. Within computer vision, these models trained on ever larger datasets, such as Instagram-1B (Mahajan et al., 2018) or JFT-3B (Dosovitskiy et al., 2021) , have been shown to successfully distinguish between semantic categories at high accuracies. In stark contrast to this, there is little research on understanding neural networks trained on very small datasets. Why would this be of any interest? While smaller dataset sizes allow for better understanding and control of what the model is being trained with, we are most interested in its ability to provide insights into fundamental aspects of learning: For example, it is an open question as to what exactly is required for arriving at semantic visual representations from random weights, and also of how well neural networks can extrapolate from their training distribution. While for visual models it has been established that few or no real images are required for arriving at basic features, like edges and color-contrasts (Asano et al., 2020; Kataoka et al., 2020; Bruna & Mallat, 2013; Olshausen & Field, 1996) , we go far beyond these and instead ask what the minimal data requirements are for neural networks to learn semantic categories, such as those of ImageNet. This approach is also motivated by studies that investigate the early visual development in infants, which have shown how little visual diversity babies are exposed to in the first few months whilst developing generalizeable visual systems (Orhan et al., 2020; Bambach et al., 2018) . In this paper, we study this question in its purest form, by analyzing whether neural networks can learn to extrapolate from a single datum. However, addressing this question naïvely runs into the difficulties of i) current deep learning methods, such as SGD or BatchNorm being tailored to large datasets and not working with a single datum and ii) extrapolating to semantic categories requiring information about the space of natural images beyond the single datum. In this paper, we address these issues by developing a simple framework that recombines augmentations and knowledge distillation. First, augmentations can be used to generate a large number of variations from a single image. This can effectively address issue i) and allow for evaluating the research question on standard architectures and datasets. This use of data augmentations to generate variety is drastically different to their usual use-case in which transformations are generated to implicitly encode desirable invariances during training. Second, to tackle the difficulty of providing information about semantic categories in the single datum setting, we opt to use the outputs of a supervisedly trained model in a knowledge distillation (KD) fashion. While KD (Hinton et al., 2015) is originally proposed for improving small models' performance by leveraging what larger models have learned, we re-purpose this as a simple way to provide a supervisory signal about semantic classes into the training process. We combine the above two ideas and provide both student and teacher only with augmented versions of a single datum, and train the student to match the teacher's imagined class-predictions of classes -almost all of which are not contained in the single datum, see Fig. 1 . While practical applications do result from our method -for example we provide results on single image dataset based model compression in the Appendix -our goal in this paper is analyzing the fundamental question of how well neural networks trained from a single datum can extrapolate to semantic classes, like those of CIFAR, SpeechCommands, ImageNet or even Kinetics. What we find is that despite the fact that the resulting model has only seen a single datum plus augmentations, surprisingly high quantitative performances are achieved: e.g. 74% on CIFAR-100, 84% on SpeechCommands and 69% top-1, single-crop accuracy on ImageNet-12. We further make the novel observations that our method benefits from high-capacity student and low-capacity teacher models, and that the source datum's characteristics matter -random noise or less dense images yield much lower performances than dense pictures like the one shown in Figure 1 . In summary, in this paper we make these four main contributions: 1. A minimal framework for training neural networks with a single datum using distillation. 2. Extensive ablations of the proposed method, such as the dependency on the source image, augmentations and network architectures. 3. Large scale empirical evidence of neural networks' ability to extrapolate on > 12 vision and audio datasets. 4. Qualitative insights on what and how neural networks trained with a single image learn.

2. RELATED WORK

The work presented builds on top of insights from the topics of knowledge distillation and singleand no-image training of visual representations and yields insights into neural networks' ability to extrapolate.

