DIRECT EVOLUTIONARY OPTIMIZATION OF VARIATIONAL AUTOENCODERS WITH BINARY LATENTS

Abstract

Discrete latent variables are considered important to model the generation process of real world data, which has motivated research on Variational Autoencoders (VAEs) with discrete latents. However, standard VAE training is not possible in this case, which has motivated different strategies to manipulate discrete distributions in order to train discrete VAEs similarly to conventional ones. Here we ask if it is also possible to keep the discrete nature of the latents fully intact by applying a direct discrete optimization for the encoding model. The studied approach is consequently strongly diverting from standard VAE training by altogether sidestepping absolute standard VAE mechanisms such as sampling approximation, reparameterization trick and amortization. Discrete optimization is realized in a variational setting using truncated posteriors in conjunction with evolutionary algorithms (using a recently suggested approach). For VAEs with binary latents, we first show how such a discrete variational method (A) ties into gradient ascent for network weights and (B) uses the decoder network to select latent states for training. More conventional amortized training is, as may be expected, more efficient than direct discrete optimization, and applicable to large neural networks. However, we here find direct optimization to be efficiently scalable to hundreds of latent variables using smaller networks. More importantly, we find the effectiveness of direct optimization to be highly competitive in 'zero-shot' learning (where high effectiveness for small networks is required). In contrast to large supervised neural networks, the here investigated VAEs can, e.g., denoise a single image without previous training on clean data and/or training on large image datasets. More generally, the studied approach shows that training of VAEs is indeed possible without sampling-based approximation and reparameterization, which may be interesting for the analysis of VAE-training in general. In the regime of few data, direct optimization, furthermore, makes VAEs competitive for denoising where they have previously been outperformed by non-generative approaches.

1. INTRODUCTION AND RELATED WORK

Variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014) are prominent and very actively researched models for unsupervised learning. VAEs, in their many different variations, have successfully been applied to a large number of tasks including semi-supervised learning (e.g. Maaløe et al., 2016) , anomaly detection (e.g. An & Cho, 2015; Kiran et al., 2018) , sentence interpolation (Bowman et al., 2016) , music interpolation (Roberts et al., 2018) and drug response prediction (Rampasek et al., 2017) . The success of VAEs rests on a series of methods that enable the derivation of scalable training algorithms to optimize their model parameters (discussed further below). A desired feature when applying VAEs to a given problem is that their latent variables (i.e., the encoder output variables) correspond to meaningful properties of the data, ideally to those latent causes that have originally generated the data. However, many real-world datasets suggest the use of discrete latents as they often describe the data generation process more naturally. For instance, the presence or absence of objects in images is best described by binary latents (e.g. Jojic & Frey, 2001) . Discrete latents are also a popular choice in modeling sounds; for instance, describing piano sounds may naturally involve binary latents: keys are pressed or not (e.g. Titsias & Lázaro-Gredilla, 2011; Goodfellow et al., 2013; Sheikh et al., 2014) . The success of standard forms of VAEs has consequently spurred research on novel formulations that feature discrete latents (e.g. Rolfe, 2016; Khoshaman & Amin, 2018; Roy et al., 2018; Sadeghi et al., 2019; Vahdat et al., 2019) . The objective of VAE training is the optimization of a generative data model which parameterizes a given data distribution. Typically we seek model parameters Θ of a VAE that maximize the data log-likelihood, L(Θ) = n log p Θ ( x (n) ) , where we denote by x (1:N ) a set of N observed data points, and where p Θ ( x) denotes the modeled data distribution. Like conventional autoencoders (e.g., Bengio et al., 2007) , VAEs use a deep neural network (DNN) to generate (or decode) observables x from a latent code z. Unlike conventional autoencoders, however, the generation of data x is not deterministic but it takes the form of a probabilistic generative model. For VAEs with binary latent variables, as they will be of interest here, we consider the following VAE generative model: p Θ ( z) = Bern( z; π) = h π z h h (1 -π h ) (1-z h ) , p Θ ( x | z) = N x; µ( z; W ), σ 2 I , where z ∈ {0, 1} H is a binary code and the non-linear function µ( z; W ) is a DNN that outputs the mean of the Gaussian distribution. p Θ ( x | z) is commonly referred to as decoder. The set of model parameters is Θ = { π, W, σ 2 }, where W incorporates DNN weights and biases. We assume homoscedasticity of the Gaussian distribution, but note that there is no obstacle to generalizing the model by inserting a DNN non-linearity that outputs a correlation matrix. Similarly, the algorithm could easily be generalized to different noise distributions should the task at hand call for it. For the purpose of this work, however, we will focus on as elementary as possible VAEs, with the form shown in Eqn. (1). Given standard or binary-latent VAEs, essentially all learning algorithms seek to approximately maximize the log-likelihood using the following series of methods (we elaborate in the appendix): (A) Instead of the log-likelihood, a variational lower-bound (a.k.a. ELBO) is optimized. (B) VAE posteriors are approximated by an encoding model, that is a specific distribution (often Gaussian) parameterized by one or more DNNs. (C) The variational parameters of the encoder are optimized using gradient ascent on the lower bound, where the gradient is evaluated based on sampling and reparameterization trick to obtain sufficiently low-variance and yet efficiently computable estimates. (D) Using samples from the encoder, the parameters of the decoder are optimized using gradient ascent on the variational lower bound. Optimization procedures for VAEs with discrete latents follow the same steps (Points A to D). However, discrete or binary latents pose substantial further obstacles in learning, mainly due to the fact that backpropagation through discrete variables is generally not possible (Rolfe, 2016; Bengio et al., 2013) . In order to maintain the general VAE framework for encoder For discrete VAEs, it may consequently be a desirable goal to investigate alternative, more direct optimization procedures that do not require a softening of discrete distributions or the use of other indirect solutions. Such a direct approach is challenging, however, because once DNNs are used to define the encoding model (Point B) standard tricks to estimate gradients (Point C) seem unavoidable. A direct optimization procedure, as is investigated here, consequently has to substantially change VAE training. For the data model (1) we will maintain the variational setting and a decoding



optimization, different groups have therefore suggested different possible solutions: work by Rolfe (2016), for instance, extends VAEs with discrete latents by auxiliary continuous latents such that gradients can still be computed. Work on the concrete distribution (Maddison et al., 2016) or Gumbel-softmax distribution (Jang et al., 2016) proposes newly defined continuous distributions that contain discrete distributions as limit cases. Work by Lorberbom et al. (2019) merges the Gumbel-Max reparameterization with the use of direct loss minimization for gradient estimation, enabling efficient training on structured latent spaces. Finally, work by van den Oord et al. (2017), and Roy et al. (2018) combines VAEs with a vector quantization (VQ) stage in the latent layer. Latents become discrete through quantization but gradients for learning are adapted from latent values before they are processed by the VQ stage. All methods have the goal of treating discrete distributions such that standard VAE training as developed for continuous latents can still be applied. These techniques interact during training with the standard methods (Points A-D) already in place for VAE optimization. Furthermore, they add further types of design decisions and hyper-parameters, for example parameters for annealing from softened discrete distributions to the (hard) original distributions for discrete latents.

