MOMENTUM CONTRASTIVE AUTOENCODER

Abstract

Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution. This latent space distribution matching is a core component in WAE, and is in itself a challenging task. In this paper, we propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem. We do so by exploiting the fact that contrastive learning objectives optimize the latent space distribution to be uniform over the unit hypersphere, which can be easily sampled from. This results in a simple and scalable algorithm that avoids many of the optimization challenges of existing generative models, while retaining the advantage of efficient sampling. Quantitatively, we show that our algorithm achieves a new state-of-the-art FID of 54.36 on CIFAR-10, and performs competitively with existing models on CelebA in terms of FID score. We also show qualitative results on CelebA-HQ in addition to these datasets, confirming that our algorithm can generate realistic images at multiple resolutions.

1. INTRODUCTION

The main goal of generative modeling is to learn a given data distribution while facilitating an efficient way to draw samples from them. Popular algorithms such as variational autoencoders (VAE, Kingma & Welling (2013) ) and generative adversarial networks (GAN, Goodfellow et al. (2014) ) are theoretically-grounded models designed to meet this goal. However, they come with some challenges. For instance, VAEs suffer from the posterior collapse problem (Chen et al., 2016; Zhao et al., 2017; Van Den Oord et al., 2017) , and a mismatch between the posterior and prior distribution (Kingma et al., 2016; Tomczak & Welling, 2018; Dai & Wipf, 2019; Bauer & Mnih, 2019) . GANs are known to have the mode collapse problem (Che et al., 2016; Dumoulin et al., 2016; Donahue et al., 2016) and optimization instability (Arjovsky & Bottou, 2017) due to their saddle point problem formulation. With the Wasserstein autoencoder (WAE), Tolstikhin et al. (2017) propose a general theoretical framework that can potentially avoid these challenges. They show that the divergence between two distributions is equivalent to the minimum reconstruction error, under the constraint that the marginal distribution of the latent space is identical to a prior distribution. The core challenge of this framework is to match the latent space distribution to a prior distribution that is easy to sample from. If this challenge is addressed appropriately, WAE can avoid many of the aforementioned challenges of VAE and GANs. Tolstikhin et al. (2017) investigate GANs and maximum mean discrepancy (MMD, Gretton et al. (2012) ) for this task and empirically find that the GAN-based approach yields better performance despite its instability. Others have proposed solutions to overcome this challenge (Kolouri et al., 2018; Knop et al., 2018) , but they come with their own pitfalls (see Section 2). This paper aims to design a generative model that avoids the aforementioned challenges of existing approaches. To do so, we build on the WAE framework. In order to tackle the latent space distribution matching problem, we make a simple observation that allows us to use the contrastive learning framework to solve this problem. Contrastive learning achieves state-of-the-art results in selfsupervised representation learning tasks (He et al., 2020; Chen et al., 2020) by forcing the latent representations to be 1) augmentation invariant; 2) distinct for different data samples. It has been shown that the contrastive learning objective corresponding to the latter goal pushes the learned representations to achieve maximum entropy over the unit hyper-sphere (Wang & Isola, 2020). We observe that applying this contrastive loss term to the latent representation of an AE therefore matches it to the uniform distribution over the unit hyper-sphere. This approach avoids the aforementioned There are many autoencoder based generative models in existing literature. One of the earliest model in this category is the de-noising autoencoder (Vincent et al., 2008) . Bengio et al. (2013b) show that training an autoencoder to de-noise a corrupted input leads to the learning of a Markov chain whose stationary distribution is the original data distribution it is trained on. However, this results in inefficient sampling and mode mixing problems (Bengio et al., 2013b; Alain & Bengio, 2014) . Variational autoencoders (VAE) (Kingma & Welling, 2013) overcome these challenges by maximizing a variational lower bound of the data likelihood, which involves a KL term minimizing the divergence between the latent's posterior distribution and a prior distribution. This allows for efficient approximate likelihood estimation as well as posterior inference through ancestral sampling once the model is trained. Despite these advantages, followup works have identified a few important drawbacks of VAEs. The poor sample qualities of VAE has been attributed to a mismatch between the prior (which is used for drawing samples) and the posterior (Kingma et al., 2016; Tomczak & Welling, 2018; Dai & Wipf, 2019; Bauer & Mnih, 2019) . The VAE objective is also at the risk of posterior collapse -learning a latent space distribution which is independent of the input distribution if the KL term dominates the reconstruction term (Chen et al., 2016; Zhao et al., 2017; Van Den Oord et al., 2017) . Dai & Wipf (2019) claim that the reason behind poor sample quality of VAEs is a mismatch between the prior and posterior, arising from the latent space dimension of the autoencoder being different from the intrinsic dimensionality of the data manifold (which is typically unknown). To overcome this mismatch, they propose to learn a two stage VAE in which the second stage learns a VAE on the latent space samples of the first. They show that this two stage training and sampling significantly improves the quality of generated samples. However, training a second VAE is computationally expensive and introduces some of the same challenges mentioned above. Ghosh et al. (2019) observe that VAEs can be interpreted as deterministic autoencoders with noise injected in the latent space as a form of regularization. Based on this observation, they introduce deterministic autoencoders and empirically investigate various other regularizations. The further introduce a post-hoc density estimation for the latent space since the autoencoding step does not match it to a prior. In this context, one can view our proposed algorithm as a way to regularize deterministic autoencoders while simultaneously learning a latent space distribution which can be easily sampled from. Tolstikhin et al. (2017) make the observation that the optimal transport problem can be equivalently framed as an autoencoder objective (WAE) under the constraint that the latent space distribution matches a prior distribution. They experiment with two alternatives to satisfy this constraint in the form of a penalty -MMD (Gretton et al., 2012) and GAN (Goodfellow et al., 2014) ) loss, and they find that the latter works better in practice. Training an autoencoder with an adversarial loss was also proposed earlier in adversarial autoencoders (Makhzani et al., 2015) . Our algorithm builds on the aforementioned WAE theoretical framework due to its theoretical advantages. There has been research that aims at avoiding the latent space distribution matching problem all together by making use of sliced distances. For instance, Kolouri et al. (2018) observe that Wasserstein distance for one dimensional distributions have a closed form solution. Motivated by this, they propose to use sliced-Wasserstein distance, which involves a large number of projections of the high dimensional distribution onto one dimensional spaces which allows approximating the original Wasserstein distance with the average of one dimensional Wasserstein distances. A similar idea using the sliced-Cramer distance is introduced in Knop et al. (2018) . However, the number of required random projections becomes prohibitively high when the data lives on a low dimensional manifold in a high dimensional space, making this approach computationally inefficient or otherwise inaccurate (Liutkus et al., 2019) .

3. MOMENTUM CONTRASTIVE AUTOENCODER

We present the proposed algorithm in this section. We begin by restating the WAE theorem that connects the autoencoder loss with the Wasserstein distance between two distributions. Let X ∼ P X

