CAPTURING LABEL CHARACTERISTICS IN VAEs

Abstract

We present a principled approach to incorporating labels in variational autoencoders (VAEs) that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the characteristic capturing VAE (CCVAE), a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.

1. INTRODUCTION

Learning the characteristic factors of perceptual observations has long been desired for effective machine intelligence (Brooks, 1991; Bengio et al., 2013; Hinton & Salakhutdinov, 2006; Tenenbaum, 1998) . In particular, the ability to learn meaningful factors-capturing human-understandable characteristics from data-has been of interest from the perspective of human-like learning (Tenenbaum & Freeman, 2000; Lake et al., 2015) and improving decision making and generalization across tasks (Bengio et al., 2013; Tenenbaum & Freeman, 2000) . At its heart, learning meaningful representations of data allows one to not only make predictions, but critically also to manipulate factors of a datapoint. For example, we might want to manipulate the age of a person in an image. Such manipulations allow for the expression of causal effects between the meaning of factors and their corresponding realizations in the data. They can be categorized into conditional generation-the ability to construct whole exemplar data instances with characteristics dictated by constraining relevant factors-and intervention-the ability to manipulate just particular factors for a given data point, and subsequently affect only the associated characteristics. A particularly flexible framework within which to explore the learning of meaningful representations are variational autoencoders (VAEs), a class of deep generative models where representations of data are captured in the underlying latent variables. A variety of methods have been proposed for inducing meaningful factors in this framework (Kim & Mnih, 2018; Mathieu et al., 2019; Mao et al., 2019; Kingma et al., 2014; Siddharth et al., 2017; Vedantam et al., 2018) , and it has been argued that the most effective generally exploit available labels to (partially) supervise the training process (Locatello et al., 2019) . Such approaches aim to associate certain factors of the representation (or equivalently factors of the generative model) with the labels, such that the former encapsulate the latter-providing a mechanism for manipulation via targeted adjustments of relevant factors. Prior approaches have looked to achieve this by directly associating certain latent variables with labels (Kingma et al., 2014; Siddharth et al., 2017; Maaløe et al., 2016) . Originally motivated by the desiderata of semi-supervised classification, each label is given a corresponding latent variable of the same type (e.g. categorical), whose value is fixed to that of the label when the label is observed and imputed by the encoder when it is not. Though natural, we argue that this assumption is not just unnecessary but actively harmful from a representation-learning perspective, particularly in the context of performing manipulations. To allow manipulations, we want to learn latent factors that capture the characteristic information associated with a label, which is typically much richer than just the label value itself. For example, there are various visual characteristics of people's faces associated with the label "young," but simply knowing the label is insufficient to reconstruct these characteristics for any particular instance. Learning a meaningful representation that captures these characteristics, and isolates them from others, requires encoding more than just the label value itself, as illustrated in Figure 1 . The key idea of our work is to use labels to help capture and isolate this related characteristic information in a VAE's representation. We do this by exploiting the interplay between the labels and inputs to capture more information than the labels alone convey; information that will be lost (or at least entangled) if we directly encode the label itself. Specifically, we introduce the characteristic capturing VAE (CCVAE) framework, which employs a novel VAE formulation which captures label characteristics explicitly in the latent space. For each label, we introduce a set of characteristic latents that are induced into capturing the characteristic information associated with that label. By coupling this with a principled variational objective and carefully structuring the characteristic-latent and label variables , we show that CCVAEs successfully capture meaningful representations, enabling better performance on manipulation tasks, while matching previous approaches for prediction tasks. In particular, they permit certain manipulation tasks that cannot be performed with conventional approaches, such as manipulating characteristics without changing the labels themselves and producing multiple distinct samples consistent with the desired intervention. We summarize our contributions as follows: i) showing how labels can be used to capture and isolate rich characteristic information; ii) formulating CCVAEs, a novel model class and objective for supervised and semi-supervised learning in VAEs that allows this information to be captured effectively; iii) demonstrating CCVAEs' ability to successfully learn meaningful representations in practice.

2. BACKGROUND

VAEs (Kingma & Welling, 2013; Rezende et al., 2014) are a powerful and flexible class of model that combine the unsupervised representation-learning capabilities of deep autoencoders (Hinton & Zemel, 1994) with generative latent-variable models-a popular tool to capture factored low-dimensional representations of higher-dimensional observations. In contrast to deep autoencoders, generative models capture representations of data not as distinct values corresponding to observations, but rather as distributions of values. A generative model defines a joint distribution over observed data x and latent variables z as p θ (x, z) = p(z)p θ (x | z). Given a model, learning representations of data can be viewed as performing inference-learning the posterior distribution p θ (z | x) that constructs the distribution of latent values for a given observation. VAEs employ amortized variational inference (VI) (Wainwright & Jordan, 2008; Kingma & Welling, 2013) using the encoder and decoder of an autoencoder to transform this setup by i) taking the model likelihood p θ (x | z) to be parameterized by a neural network using the decoder, and ii) constructing an amortized variational approximation q φ (z | x) to the (intractable) posterior p θ (z | x) using the encoder. The variational approximation of the posterior enables effective estimation of the objective-maximizing the marginal likelihood-through importance sampling. The objective is obtained through invoking Jensen's inequality to derive the evidence lower bound (ELBO) of the



Figure 1: Manipulating label characteristics for "hair color" and "smile".

