EXPANDING SMALL-SCALE DATASETS WITH GUIDED IMAGINATION

Abstract

The power of Deep Neural Networks (DNNs) depends heavily on the training data quantity, quality and diversity. However, in many real scenarios, it is costly and time-consuming to collect and annotate large-scale data. This has severely hindered the application of DNNs. To address this challenge, we explore a new task of dataset expansion, which seeks to automatically create new labeled samples to expand a small dataset. To this end, we present a Guided Imagination Framework (GIF) that leverages the recently developed big generative models (e.g., DALL-E2) and reconstruction models (e.g., MAE) to "imagine" and create informative new data from seed data to expand small datasets. Specifically, GIF conducts imagination by optimizing the latent features of seed data in a semantically meaningful space, which are fed into the generative models to generate photo-realistic images with new contents. For guiding the imagination towards creating samples useful for model training, we exploit the zero-shot recognition ability of CLIP and introduce three criteria to encourage informative sample generation, i.e., prediction consistency, entropy maximization and diversity promotion. With these essential criteria as guidance, GIF works well for expanding datasets in different domains, leading to 29.9% accuracy gain on average over six natural image datasets, and 10.4% accuracy gain on average over three medical image datasets.

1. INTRODUCTION

Having a sufficient amount of training data is crucial for unleashing the power of deep neural networks (DNNs) (Deng et al., 2009; Qi & Luo, 2020) . However, in many fields, collecting large-scale datasets is expensive and time-consuming (Qi & Luo, 2020; Zhang et al., 2020) , resulting in limited dataset sizes which make it difficult to fully utilize DNNs. To address this data limitation issue and reduce the cost of manual data collection/annotation, we explore dataset expansion in this work, which seeks to build an automatic data generation pipeline for expanding a small dataset into a larger and more informative one, as illustrated in Figure 1 

(left).

There are some research attempts that could be applied to dataset expansion. Among them, data augmentation (DeVries & Taylor, 2017; Cubuk et al., 2020; Zhong et al., 2020) applies pre-defined transformations to each image for enriching datasets. However, these transformations mostly affect the surface visual characteristics of an image, but have a minimal effect on the actual image content. Therefore, the brought new information is limited, and cannot sufficiently address the limited-data issue in small datasets. Besides, some recent studies (Zhang et al., 2021c; Li et al., 2022) utilize generative adversarial networks (GANs) (Goodfellow et al., 2014; Brock et al., 2018) to synthesize images for model training. They, however, require a sufficiently large dataset for in-domain GAN training, which is not feasible in the small-data scenario. Moreover, the generated images are often not well-annotated, limiting their utility for DNN training. Therefore, both of them are unable to effectively resolve the dataset expansion problem. For an observed object, humans can easily imagine its different variants in various shapes, colors or contexts, relying on their accumulated prior understanding of the world (Warnock & Sartre, 2013; Vyshedskiy, 2019) . Such an imagination process is highly useful for dataset expansion, since it does not simply perturb the object's appearance but applies rich prior knowledge to create object variants with new information. Meanwhile, recent breakthroughs in large-scale generative models (e.g., DALL-E2 (Ramesh et al., 2022) ) have demonstrated that generative models can effectively capture the sample distribution of extremely large datasets (Schuhmann et al., 2021; Byeon et al., 2022) and show encouraging abilities in generating photo-realistic images with a rich variety of contents. This motivates us to explore their capabilities as prior models to develop a computational data imagination pipeline for dataset expansion, by imagining different sample variants from seed data. However, deploying big generative models for dataset expansion is highly non-trivial, complicated by several key challenges, including how to generate samples with correct labels, and how to make sure the created samples are useful for model training. To handle these challenges, we conduct a series of studies (cf. Section 3), from which we make two important findings. First, the CLIP model (Radford et al., 2021) , which offers excellent zero-shot classification abilities, can map latent features of category-agnostic generative models to the specific label space of the target small dataset. This is helpful for generating samples with correct labels. Second, we empirically find three informativeness criteria crucial for generating effective training data: 1) zero-shot prediction consistency to ensure that the imagined image is class-consistent with the seed image; 2) entropy maximization to encourage the imagined images to bring more information; 3) diversity promotion to encourage the imagined images to have diversified contents. In light of the above findings, we propose the Guided Imagination Framework (GIF) for dataset expansion. Specifically, given a seed image, GIF first extracts its latent feature with the prior generative model. Different from data augmentation that imposes variation over the raw image, GIF optimizes the variation over the latent feature. Thanks to the guidance carefully designed by our discovered criteria, the latent feature is optimized to provide more information while maintaining its class semantics. This enables GIF to create informative new samples, with class-consistent semantics yet higher content diversity, to expand small datasets for model training. Considering that DALL-E2 have been shown to be powerful in generating images and MAE (He et al., 2022 ) is excellent at reconstructing images, we explore their use as prior models for imagination in this work. We evaluate the proposed method on both small-scale natural and medical image datasets. As shown in Figure 1 (right), compared to the ResNet50 model trained on the original dataset, our method improves the model performance by a large margin across a variety of visual tasks, including finegrained object classification, texture classification, cancer pathology detection, and ultrasound image classification. More specifically, GIF obtains 29.9% accuracy gain on average over six natural image datasets, and 10.4% accuracy gain on average over three medical image datasets. Moreover, we show that the expansion efficiency of our method is much higher than expansion with existing augmentation methods. For example, 5× expansion by our GIF-DALLE method already outperforms 20× expansion by Cutout, GridMask and RandAugment on the Cars and DTD datasets. In addition, the expanded datasets can be directly used to train different model architectures (e.g., ResNeXt, WideResNet and MobileNet), leading to consistent performance improvement.

2. RELATED WORK

Learning with synthetic images. Training models with synthetic images is a promising direction (Jahanian et al., 2022) . DatasetGANs (Zhang et al., 2021c; Li et al., 2022) explore GAN models (Isola et al., 2017; Esser et al., 2021) to generate images for segmentation model training. However, as the generated images are without labels, they need manual annotations on generated images to train a label generator for annotating synthetic images. In contrast, our dataset expansion aims to expand a real small dataset to a larger labeled one in a fully automatic manner, without involving



Figure 1: Dataset expansion aims to create data with new information to enrich small datasets for training DNN models better (left). The ResNet50 trained on the expanded datasets by our proposed method performs much better than the one trained on the original small datasets (right).

