AUGMENTATION -INTERPOLATIVE AUTOENCODERS FOR UNSUPERVISED FEW-SHOT IMAGE GENERATION Anonymous

Abstract

We aim to build image generation models that generalize to new domains from few examples. To this end, we first investigate the generalization properties of classic image generators, and discover that autoencoders generalize extremely well to new domains, even when trained on highly constrained data. We leverage this insight to produce a robust, unsupervised few-shot image generation algorithm, and introduce a novel training procedure based on recovering an image from data augmentations. Our Augmentation-Interpolative AutoEncoders synthesize realistic images of novel objects from only a few reference images, and outperform both prior interpolative models and supervised few-shot image generators. Our procedure is simple and lightweight, generalizes broadly, and requires no category labels or other supervision during training.

1. INTRODUCTION

Modern generative models can synthesize high-quality (Karras et al., 2019; Razavi et al., 2019; Zhang et al., 2018a) , diverse (Ghosh et al., 2018; Mao et al., 2019; Razavi et al., 2019) , and highresolution (Brock et al., 2018; Karras et al., 2017; 2019) images of any class, but only given a large training dataset for these classes (Creswell et al., 2017) . This requirement of a large dataset is impractical in many scenarios. For example, an artist might want to use image generation to help create concept art of futuristic vehicles. Smartphone users may wish to animate a collection of selfies, or researchers training an image classifier might wish to generate augmented data for rare classes. These and other applications will require generative models capable of synthesizing images from a large, ever-growing set of object classes. We cannot rely on having hundreds of labeled images for all of them. Furthermore, most of them will likely be unknown at the time of training. We therefore need generative models that can train on one set of image classes, and then generalize to a new class using only a small quantity of new images: few-shot image generation. Unfortunately, we find that the latest and greatest generative models cannot even represent novel classes in their latent space, let alone generate them on demand (Figure 1 ). Perhaps because of this generalization challenge, recent attempts at few-shot image generation rely on undesirable assumptions and compromises. They need impractically large labeled datasets of hundreds of classes (Edwards & Storkey, 2016) , involve substantial computation at test time (Clouâtre & Demers, 2019), or are highly domain-specific, generalizing only across very similar classes (Jitkrittum et al., 2019) . In this paper, we introduce a strong, efficient, unsupervised baseline for few-shot image generation that avoids all the above compromises. We leverage the finding that although the latent spaces of powerful generative models, such as VAEs and GANs, do not generalize to new classes, the representations learned by autoencoders (AEs) generalize extremely well. The AEs can then be converted into generative models by training them to interpolate between seed images (Sainburg et al., 2018; Berthelot et al., 2018; Beckham et al., 2019) . These Interpolative AutoEncoders (IntAEs) would seem a natural fit for few-shot image generation. Unfortunately, we also find that although IntAEs can reproduce images from novel classes, the ability to interpolate between them breaks down upon leaving the training domain. To remedy this, we introduce a new training method based on data augmentation, which produces smooth, meaningful interpolations in novel domains. We demonstrate on three different settings (handwritten characters, faces and general objects) that our Augmentation-Interpolative Autoencoder (AugIntAE) achieves simple, robust, highly general, and completely unsupervised few-shot image generation. 

2. RELATED WORK

2.1 GENERATIVE MODELING AEs were originally intended for learned non-linear data compression, which could then be used for downstream tasks; the generator network was discarded (Kramer, 1991; Hinton & Salakhutdinov, 2006; Masci et al., 2011) . VAEs do the opposite: by training the latent space toward a prior distribution, the encoder network can be discarded at test time instead. New images are sampled directly from the prior (Kingma & Welling, 2013) . Subsequent models discard the encoder network entirely. GANs sample from a noise distribution and learn to generate images which fool a concurrently-trained real/fake image discriminator (Goodfellow et al., 2014) . Bojanowski et al. (2017) and Hoshen et al. (2019) treat latent codes as learnable parameters directly, and train separate sampling procedures for synthesizing the novel images. Recent work has also seen a return to AEs as conditional generators, by training reconstruction networks to interpolate smoothly between pairs or sets of seed images. This is accomplished by combining the reconstruction loss on seed images with an adversarial loss on the seed and interpolated images. Different forms of adversarial loss (Sainburg et al., 2018; Berthelot et al., 2018) and interpolation (Beckham et al., 2019) have been proposed. While all of these approaches generate new images, it is unclear if any of them can generalize to novel domains. Some results suggest the opposite: a VAE sufficiently powerful to model the training data becomes incapable of producing anything else (Bozkurt et al., 2018) .

2.2. FEW-SHOT IMAGE GENERATION

Current attempts at few-shot image generation span a wide range of approaches and models. Neural Statistician, an early attempt, is similar to the AE in that it is built for few-shot classification, and largely discards the generative capability (Edwards & Storkey, 2016) . Generation-oriented iterations exist, but likewise depend on a large, varied, labelled dataset for training (Hewitt et al., 2018) . Other approaches based on few-shot classification include generative matching networks (Bartunov & Vetrov, 2018) and adversarial meta-learning (Clouâtre & Demers, 2019) . These models also depend on heavy supervision, and are fairly complicated, involving multiple networks and training procedures working in tandem -making them potentially difficult to train in practice reliably. 



Figure 1: A basic image reconstruction task. One can either use a state-of-the-art PGAN (Karras et al., 2017), optimizing the latent code to match the generation to the input, or use our simpler AugIntAE. Both succeed on the training set of adult faces (left), but on the novel domain of baby faces (right), the best PGAN reconstruction ages the baby. Ours is far more faithful.

Separate work has approached few-shot image generation from the side of generative modeling. Wang et al. (2018), Noguchi & Harada (2019) and Wu et al. (2018) investigate the ability of GANs to handle domain adaptation via fine-tuning -thus requiring substantial computation, and more novel class examples than are available in the few-shot setting. Zhao et al. (2020) train GANs directly from few examples, though still more than are at hand for few-shot learning, and can be considered orthogonal work, as AugIntAE can serve as a useful pre-trained initialization. Antoniou et al. (2017) and Liu et al. (2019) use adversarial training to produce feed-forward few-shot generators. However,

