IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?

Abstract

Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. Though the results are astonishing to human eyes, how applicable these generated images are for recognition tasks remains under-explored. In this work, we extensively study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks, and focus on two perspectives: synthetic data for improving classification models in data-scarce settings (i.e. zero-shot and fewshot), and synthetic data for large-scale model pre-training for transfer learning. We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.

1. INTRODUCTION

Over the past decade, deep learning powered by large-scale annotated data has revolutionized the field of image recognition. However, it is costly and time-consuming to manually collect a largescale labeled dataset, and recent concerns about data privacy and usage rights further hinder this process. In parallel, generative models that aim to model real-data distributions can now produce high-fidelity photo-realistic images. In particular, recent text-to-image generation models (Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022b) have made major breakthroughs in synthesizing high-quality images from text descriptions. This promotes us to ask: is synthetic data from generative models ready for image recognition tasks? There are a few early attempts at exploring synthetic data from generative models for image recognition tasks. Besnier et al. ( 2020 2021) manipulate the latent space of a GAN model to produce multi-view images for contrastive learning. Albeit promising, early works either address tasks on a small scale or only for a specific setting. Plus, they all focus on GAN-based models and none explore the revolutionary text-to-image generation models, which hold more promises to benefit recognition tasks. In this paper, we present the first study on the state-of-the-art text-to-image generation models for image recognition. With the power of text-to-image generation, we could hopefully not only generate massive high-quality labeled data, but also achieve domain customization by generating synthetic data targeted for a specific label space, i.e. the label space of a downstream task. Our study is carried out on one open-sourced text-to-image generation model, GLIDE (Nichol et al., 2021) 1 . We attempt to uncover the benefits and pitfalls of synthetic data for image recognition through the lens of investigating the following two questions: 1) is synthetic data from generative models ready for improving classification models? 2) whether synthetic data can be a feasible source for transfer learning (i.e. model pre-training)? It is worth noting that for 1), we only studied the zero-shot and few-shot settings because the positive impact of synthetic data diminishes as more shots are present. And, we build most of our investigations on the state-of-the-art method CLIP (Radford et al., 2021) with the feature extractor initialized with large-scale pre-trained weights frozen.



) use a class-conditionalGAN (BigGAN (Brock et al., 2018)  trained for ImageNet-1000 classes) to generate images for training image classifiers. Zhang et al. (2021) leverage StyleGAN (Karras et al., 2019) to produce synthetic labeled data for object-part segmentation. Jahanian et al. (

availability

https://github.com/CVMI-Lab/SyntheticData.

