SELF-SUPERVISED VARIATIONAL AUTO-ENCODERS

Abstract

Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), that utilizes deterministic and discrete transformations of data. This class of models allows performing both conditional and unconditional sampling while simplifying the objective function. First, we use a single self-supervised transformation as a latent variable, where a transformation is either downscaling or edge detection. Next, we consider a hierarchical architecture, i.e., multiple transformations, and we show its benefits compared to the VAE. The flexibility of selfVAE in data reconstruction finds a particularly interesting use case in data compression tasks, where we can trade-off memory for better data quality, and vice-versa. We present the performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA).

1. INTRODUCTION

The framework of variational autoencoders (VAEs) provides a principled approach for learning latentvariable models. As it utilizes a meaningful low-dimensional latent space with density estimation capabilities, it forms an attractive solution for generative modelling tasks. However, its performance in terms of the test log-likelihood and quality of generated samples is often disappointing, thus, many modifications were proposed. In general, one can obtain a tighter lower bound, and, thus, a more powerful and flexible model, by advancing over the following three components: the encoder (Rezende et al., 2014; van den Berg et al., 2018; Hoogeboom et al., 2020; Maaløe et al., 2016) , the prior (or marginal over latents) (Chen et al., 2016; Habibian et al., 2019; Lavda et al., 2020; Lin & Clark, 2020; Tomczak & Welling, 2017) and the decoder (Gulrajani et al., 2016) . Recent studies have shown that by employing deep hierarchical architectures and by carefully designing building blocks of the neural networks, VAEs can successfully model high-dimensional data and reach state-of-the-art test likelihoods (Zhao et al., 2017; Maaløe et al., 2019; Vahdat & Kautz, 2020) . In this work, we present a novel class of VAEs, called self-supervised Variational Auto-Encoders, where we introduce additional variables to VAEs that result from discrete and deterministic transformations of observed images. Since the transformations are deterministic, and they provide a specific aspect of images (e.g., contextual information through detecting edges or downscaling), we refer to them as self-supervised representations. The introduction of the discrete and deterministic variables allows to train deep hierarchical models efficiently by decomposing the task of learning a highly complex distribution into training smaller and conditional distributions. In this way, the model allows to integrate the prior knowledge about the data, but still enables to synthesize unconditional samples. Furthermore, the discrete and deterministic variables could be used to conditionally reconstruct data, which could be of great use in data compression and super-resolution tasks. We make the following contributions: i) We propose an extension of the VAE framework by incorporating self-supervised representations of the data. ii) We analyze the impact of modelling natural images with different data transformations as self-supervised representations. iii) This new type of generative model (self-supervised Variational Auto-Encoders), which is able to perform both conditional and unconditional sampling, demonstrate improved quantitative performance in terms of density estimation and generative capabilities on image benchmarks.

2.1. VARIATIONAL AUTO-ENCODERS

Let x 2 X D be a vector of observable variables, where X ✓ R or X ✓ Z, and z 2 R M denote a vector of latent variables. Since calculating p # (x) = R p # (x, z)dz is computationally intractable for non-linear stochastic dependencies, a variational family of distributions could be used for approximate inference. Then, the following objective function could be derived, namely, the evidence lower bound (ELBO) (Jordan et al., 1999) : ln p # (x) E q (z|x) [ln p ✓ (x|z) + ln p (z) ln q (z|x)] , where q (z|x) is the variational posterior (or the encoder), p ✓ (x|z) is the conditional likelihood function (or the decoder) and p (z) is the prior (or marginal), , ✓ and denote parameters. The expectation is approximated by Monte Carlo sampling while exploiting the reparameterization trick in order to obtain unbiased gradient estimators. The models are parameterized by neural networks. This generative framework is known as Variational Auto-Encoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014) .

2.2. VAES WITH BIJECTIVE PRIORS

Even though the lower-bound suggests that the prior plays a crucial role in improving the variational bounds, usually a fixed distribution is used, e.g., a standard multivariate Gaussian. While being relatively simple and computationally cheap, the fixed prior is known to result in over-regularized models that tend to ignore most of the latent dimensions (Burda et al., 2015; Hoffman & Johnson, 2016; Tomczak & Welling, 2017) . Moreover, even with powerful encoders, VAEs may still fail to match the variational posterior to a unit Gaussian prior (Rosca et al., 2018) . However, it is possible to obtain a rich, multi-modal prior distribution p(z) by using a bijective (or flow-based) model (Dinh et al., 2016) . Formally, given a latent code z, a base distribution p V (v) over latent variables v 2 R M , and f : R M ! R M consisting of a sequence of L diffeomorphic transformationsfoot_0 , where f i (v i 1 ) = v i , v 0 = v and v L = z, the change of variable can be used sequentially to express the distribution of z as a function of v as follows: log p(z) = log p V (v) L X i=1 log @f i (v i 1 ) @v i 1 , where @fi(vi 1 ) @vi 1 is the Jacobian-determinant of the i th transformation. Thus, using the bijective prior yields the following lower-bound: ln p(x) E q (z|x) h log p ✓ (x|z) log q (z|x) + log p V (v 0 ) + L X i=1 log @f 1 i (v i ) @v i i . In this work, we utilize RealNVP (Dinh et al., 2016) as the prior, however, any other flow-based model could be used (Kingma & Dhariwal, 2018; Hoogeboom et al., 2020) . For the experiments and ablation study that shows the impact of the bijective prior on VAEs, we refer to the appendix A.1.

3.1. MOTIVATION

The idea of self-supervised learning is about utilizing original unlabeled data to create additional context information. It could be achieved in multiple manners, e.g., by adding noise to data (Vincent et al., 2008) or masking data during training (Zhang et al., 2017) . Self-supervised learning could also be seen as turning an unsupervised model into a supervised by, e.g., treating predicting next pixels as a classification task (Hénaff et al., 2019; Oord et al., 2018) . These are only a few examples of a quickly growing research line (Liu et al., 2020) .



That is, invertible and differentiable transformations.

