MOMENTUM CONTRASTIVE AUTOENCODER

Abstract

Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution. This latent space distribution matching is a core component in WAE, and is in itself a challenging task. In this paper, we propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem. We do so by exploiting the fact that contrastive learning objectives optimize the latent space distribution to be uniform over the unit hypersphere, which can be easily sampled from. This results in a simple and scalable algorithm that avoids many of the optimization challenges of existing generative models, while retaining the advantage of efficient sampling. Quantitatively, we show that our algorithm achieves a new state-of-the-art FID of 54.36 on CIFAR-10, and performs competitively with existing models on CelebA in terms of FID score. We also show qualitative results on CelebA-HQ in addition to these datasets, confirming that our algorithm can generate realistic images at multiple resolutions.

1. INTRODUCTION

The main goal of generative modeling is to learn a given data distribution while facilitating an efficient way to draw samples from them. Popular algorithms such as variational autoencoders (VAE, Kingma & Welling (2013) ) and generative adversarial networks (GAN, Goodfellow et al. (2014) ) are theoretically-grounded models designed to meet this goal. However, they come with some challenges. For instance, VAEs suffer from the posterior collapse problem (Chen et al., 2016; Zhao et al., 2017; Van Den Oord et al., 2017) , and a mismatch between the posterior and prior distribution (Kingma et al., 2016; Tomczak & Welling, 2018; Dai & Wipf, 2019; Bauer & Mnih, 2019) . GANs are known to have the mode collapse problem (Che et al., 2016; Dumoulin et al., 2016; Donahue et al., 2016) and optimization instability (Arjovsky & Bottou, 2017) due to their saddle point problem formulation. With the Wasserstein autoencoder (WAE), Tolstikhin et al. (2017) propose a general theoretical framework that can potentially avoid these challenges. They show that the divergence between two distributions is equivalent to the minimum reconstruction error, under the constraint that the marginal distribution of the latent space is identical to a prior distribution. The core challenge of this framework is to match the latent space distribution to a prior distribution that is easy to sample from. If this challenge is addressed appropriately, WAE can avoid many of the aforementioned challenges of VAE and GANs. Tolstikhin et al. ( 2017) investigate GANs and maximum mean discrepancy (MMD, Gretton et al. ( 2012)) for this task and empirically find that the GAN-based approach yields better performance despite its instability. Others have proposed solutions to overcome this challenge (Kolouri et al., 2018; Knop et al., 2018) , but they come with their own pitfalls (see Section 2). This paper aims to design a generative model that avoids the aforementioned challenges of existing approaches. To do so, we build on the WAE framework. In order to tackle the latent space distribution matching problem, we make a simple observation that allows us to use the contrastive learning framework to solve this problem. Contrastive learning achieves state-of-the-art results in selfsupervised representation learning tasks (He et al., 2020; Chen et al., 2020) by forcing the latent representations to be 1) augmentation invariant; 2) distinct for different data samples. It has been shown that the contrastive learning objective corresponding to the latter goal pushes the learned representations to achieve maximum entropy over the unit hyper-sphere (Wang & Isola, 2020) . We observe that applying this contrastive loss term to the latent representation of an AE therefore matches it to the uniform distribution over the unit hyper-sphere. This approach avoids the aforementioned

