SELF-SUPERVISED VARIATIONAL AUTO-ENCODERS

Abstract

Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), that utilizes deterministic and discrete transformations of data. This class of models allows performing both conditional and unconditional sampling while simplifying the objective function. First, we use a single self-supervised transformation as a latent variable, where a transformation is either downscaling or edge detection. Next, we consider a hierarchical architecture, i.e., multiple transformations, and we show its benefits compared to the VAE. The flexibility of selfVAE in data reconstruction finds a particularly interesting use case in data compression tasks, where we can trade-off memory for better data quality, and vice-versa. We present the performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA).

1. INTRODUCTION

The framework of variational autoencoders (VAEs) provides a principled approach for learning latentvariable models. As it utilizes a meaningful low-dimensional latent space with density estimation capabilities, it forms an attractive solution for generative modelling tasks. However, its performance in terms of the test log-likelihood and quality of generated samples is often disappointing, thus, many modifications were proposed. In general, one can obtain a tighter lower bound, and, thus, a more powerful and flexible model, by advancing over the following three components: the encoder (Rezende et al., 2014; van den Berg et al., 2018; Hoogeboom et al., 2020; Maaløe et al., 2016) , the prior (or marginal over latents) (Chen et al., 2016; Habibian et al., 2019; Lavda et al., 2020; Lin & Clark, 2020; Tomczak & Welling, 2017) and the decoder (Gulrajani et al., 2016) . Recent studies have shown that by employing deep hierarchical architectures and by carefully designing building blocks of the neural networks, VAEs can successfully model high-dimensional data and reach state-of-the-art test likelihoods (Zhao et al., 2017; Maaløe et al., 2019; Vahdat & Kautz, 2020) . In this work, we present a novel class of VAEs, called self-supervised Variational Auto-Encoders, where we introduce additional variables to VAEs that result from discrete and deterministic transformations of observed images. Since the transformations are deterministic, and they provide a specific aspect of images (e.g., contextual information through detecting edges or downscaling), we refer to them as self-supervised representations. The introduction of the discrete and deterministic variables allows to train deep hierarchical models efficiently by decomposing the task of learning a highly complex distribution into training smaller and conditional distributions. In this way, the model allows to integrate the prior knowledge about the data, but still enables to synthesize unconditional samples. Furthermore, the discrete and deterministic variables could be used to conditionally reconstruct data, which could be of great use in data compression and super-resolution tasks. We make the following contributions: i) We propose an extension of the VAE framework by incorporating self-supervised representations of the data. ii) We analyze the impact of modelling natural images with different data transformations as self-supervised representations. iii) This new type of generative model (self-supervised Variational Auto-Encoders), which is able to perform both conditional and unconditional sampling, demonstrate improved quantitative performance in terms of density estimation and generative capabilities on image benchmarks. 1

