CAPTURING LABEL CHARACTERISTICS IN VAEs

Abstract

We present a principled approach to incorporating labels in variational autoencoders (VAEs) that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the characteristic capturing VAE (CCVAE), a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.

1. INTRODUCTION

Learning the characteristic factors of perceptual observations has long been desired for effective machine intelligence (Brooks, 1991; Bengio et al., 2013; Hinton & Salakhutdinov, 2006; Tenenbaum, 1998) . In particular, the ability to learn meaningful factors-capturing human-understandable characteristics from data-has been of interest from the perspective of human-like learning (Tenenbaum & Freeman, 2000; Lake et al., 2015) and improving decision making and generalization across tasks (Bengio et al., 2013; Tenenbaum & Freeman, 2000) . At its heart, learning meaningful representations of data allows one to not only make predictions, but critically also to manipulate factors of a datapoint. For example, we might want to manipulate the age of a person in an image. Such manipulations allow for the expression of causal effects between the meaning of factors and their corresponding realizations in the data. They can be categorized into conditional generation-the ability to construct whole exemplar data instances with characteristics dictated by constraining relevant factors-and intervention-the ability to manipulate just particular factors for a given data point, and subsequently affect only the associated characteristics. A particularly flexible framework within which to explore the learning of meaningful representations are variational autoencoders (VAEs), a class of deep generative models where representations of data are captured in the underlying latent variables. A variety of methods have been proposed for inducing meaningful factors in this framework (Kim & Mnih, 2018; Mathieu et al., 2019; Mao et al., 2019; Kingma et al., 2014; Siddharth et al., 2017; Vedantam et al., 2018) , and it has been argued that the most effective generally exploit available labels to (partially) supervise the training process (Locatello et al., 2019) . Such approaches aim to associate certain factors of the representation (or equivalently factors of the generative model) with the labels, such that the former encapsulate the latter-providing a mechanism for manipulation via targeted adjustments of relevant factors. Prior approaches have looked to achieve this by directly associating certain latent variables with labels (Kingma et al., 2014; Siddharth et al., 2017; Maaløe et al., 2016) . Originally motivated by the desiderata of semi-supervised classification, each label is given a corresponding latent variable of the same type (e.g. categorical), whose value is fixed to that of the label when the label is observed and imputed by the encoder when it is not. Though natural, we argue that this assumption is not just unnecessary but actively harmful from a representation-learning perspective, particularly in the context of performing manipulations. To allow manipulations, we want to learn latent factors that capture the characteristic information associated with a label, which is typically much richer than just the label value itself. For example, there are Figure 1 : Manipulating label characteristics for "hair color" and "smile". various visual characteristics of people's faces associated with the label "young," but simply knowing the label is insufficient to reconstruct these characteristics for any particular instance. Learning a meaningful representation that captures these characteristics, and isolates them from others, requires encoding more than just the label value itself, as illustrated in Figure 1 . The key idea of our work is to use labels to help capture and isolate this related characteristic information in a VAE's representation. We do this by exploiting the interplay between the labels and inputs to capture more information than the labels alone convey; information that will be lost (or at least entangled) if we directly encode the label itself. Specifically, we introduce the characteristic capturing VAE (CCVAE) framework, which employs a novel VAE formulation which captures label characteristics explicitly in the latent space. For each label, we introduce a set of characteristic latents that are induced into capturing the characteristic information associated with that label. By coupling this with a principled variational objective and carefully structuring the characteristic-latent and label variables , we show that CCVAEs successfully capture meaningful representations, enabling better performance on manipulation tasks, while matching previous approaches for prediction tasks. In particular, they permit certain manipulation tasks that cannot be performed with conventional approaches, such as manipulating characteristics without changing the labels themselves and producing multiple distinct samples consistent with the desired intervention. We summarize our contributions as follows: i) showing how labels can be used to capture and isolate rich characteristic information; ii) formulating CCVAEs, a novel model class and objective for supervised and semi-supervised learning in VAEs that allows this information to be captured effectively; iii) demonstrating CCVAEs' ability to successfully learn meaningful representations in practice.

2. BACKGROUND

VAEs (Kingma & Welling, 2013; Rezende et al., 2014) are a powerful and flexible class of model that combine the unsupervised representation-learning capabilities of deep autoencoders (Hinton & Zemel, 1994) with generative latent-variable models-a popular tool to capture factored low-dimensional representations of higher-dimensional observations. In contrast to deep autoencoders, generative models capture representations of data not as distinct values corresponding to observations, but rather as distributions of values. A generative model defines a joint distribution over observed data x and latent variables z as p θ (x, z) = p(z)p θ (x | z). Given a model, learning representations of data can be viewed as performing inference-learning the posterior distribution p θ (z | x) that constructs the distribution of latent values for a given observation. VAEs employ amortized variational inference (VI) (Wainwright & Jordan, 2008; Kingma & Welling, 2013) using the encoder and decoder of an autoencoder to transform this setup by i) taking the model likelihood p θ (x | z) to be parameterized by a neural network using the decoder, and ii) constructing an amortized variational approximation q φ (z | x) to the (intractable) posterior p θ (z | x) using the encoder. The variational approximation of the posterior enables effective estimation of the objective-maximizing the marginal likelihood-through importance sampling. The objective is obtained through invoking Jensen's inequality to derive the evidence lower bound (ELBO) of the model which is given as: log p θ (x) = log E q φ (z|x) p θ (z, x) q φ (z | x) ≥ E q φ (z|x) log p θ (z, x) q φ (z | x) ≡ L(x; φ, θ). Given observations D = {x 1 , . . . , x N } taken to be realizations of random variables generated from an unknown distribution p D (x), the overall objective is 1 N n L(x n ; θ, φ). Hierarchical VAEs Sønderby et al. (2016) impose a hierarchy of latent variables improving the flexibility of the approximate posterior, however we do not consider these models in this work. Semi-supervised VAEs (SSVAEs) (Kingma et al., 2014; Maaløe et al., 2016; Siddharth et al., 2017) consider the setting where a subset of data S ⊂ D is assumed to also have corresponding labels y. Denoting the (unlabeled) data as U = D\S, the log-marginal likelihood is decomposed as log p (D) = (x,y)∈S log p θ (x, y) + x∈U log p θ (x), where the individual log-likelihoods are lower bounded by their ELBOs. Standard practice is then to treat y as a latent variable to marginalize over whenever the label is not provided. More specifically, most approaches consider splitting the latent space in z = {z y , z \y } and then directly fix z y = y whenever the label is provided, such that each dimension of z y explicitly represents a predicted value of a label, with this value known exactly only for the labeled datapoints. Much of the original motivation for this (Kingma et al., 2014) was based around performing semi-supervised classification of the labels, with the encoder being used to impute the values of z y for the unlabeled datapoints. However, the framework is also regularly used as a basis for learning meaningful representations and performing manipulations, exploiting the presence of the decoder to generate new datapoints after intervening on the labels via changes to z y . Our focus lies on the latter, for which we show this standard formulation leads to serious pathologies. Our primary goal is not to improve the fidelity of generations, but instead to demonstrate how label information can be used to structure the latent space such that it encapsulates and disentangles the characteristics associated with the labels.

3. RETHINKING SUPERVISION

As we explained in the last section, the de facto assumption for most approaches to supervision in VAEs is that the labels correspond to a partially observed augmentation of the latent space, z y . However, this can cause a number of issues if we want the latent space to encapsulate not just the labels themselves, but also the characteristics associated with these labels. For example, encapsulating the youthful characteristics of a face, not just the fact that it is a "young" face. At an abstract level, such an approach fails to capture the relationship between the inputs and labels: it fails to isolate characteristic information associated with each label from the other information required to reconstruct data. More specifically, it fails to deal with the following issues. Firstly, the information in a datapoint associated with a label is richer than stored by the (typically categorical) label itself. That is not to say such information is absent when we impose z y = y, but here it is entangled with the other latent variables z \y , which simultaneously contain the associated information for all the labels. Moreover, when y is categorical, it can be difficult to ensure that the VAE actually uses z y , rather than just capturing information relevant to reconstruction in the higher-capacity, continuous, z \y . Overcoming this is challenging and generally requires additional heuristics and hyper-parameters. Second, we may wish to manipulate characteristics without fully changing the categorical label itself. For example, making a CelebA image depict more or less 'smiling' without fully changing its "smile" label. Here we do not know how to manipulate the latents to achieve this desired effect: we can only do the binary operation of changing the relevant variable in z y . Also, we often wish to keep a level of diversity when carrying out conditional generation and, in particular, interventions. For example, if we want to add a smile, there is no single correct answer for how the smile would look, but taking z y = "smile" only allows for a single point estimate for the change. Finally, taking the labels to be explicit latent variables can cause a mismatch between the VAE prior p(z) and the pushforward distribution of the data to the latent space q(z) = E p D (x) [q φ (z | x)]. During training, latents are effectively generated according to q(z), but once learned, p(z) is used to make generations; variations between the two effectively corresponds to a train-test mismatch. As there is a ground truth data distribution over the labels (which are typically not independent), taking the latents as the labels themselves implies that there will be a ground truth q(z y ). However, as this is not generally known a priori, we will inevitably end up with a mismatch. What do we want from supervision? Given these issues, it is natural to ask whether having latents directly correspond to labels is actually necessary. To answer this, we need to think about exactly what it is we are hoping to achieve through the supervision itself. Along with uses of VAEs more generally, the three most prevalent tasks are: a) Classification, predicting the labels of inputs where these are not known a priori; b) Conditional Generation, generating new examples conditioned on those examples conforming to certain desired labels; and c) Intervention, manipulating certain desired characteristics of a data point before reconstructing it. Inspecting these tasks, we see that for classification we need a classifier form z to y, for conditional generation we need a mechanism for sampling z given y, and for inventions we need to know how to manipulate z to bring about a desired change. None of these require us to have the labels directly correspond to latent variables. Moreover, as we previously explained, this assumption can be actively harmful, such as restricting the range of interventions that can be performed.

4. CHARACTERISTIC CAPTURING VARIATIONAL AUTOENCODERS

To correct the issues discussed in the last section, we suggest eschewing the treatment of labels as direct components of the latent space and instead employ them to condition latent variables which are designed to capture the characteristics. To this end, we similarly split the latent space into two components, z = {z c , z \c }, but where z c , the characteristic latent, is now designed to capture the characteristics associated with labels, rather than directly encode the labels themselves. In this breakdown, z \c is intended only to capture information not directly associated with any of the labels, unlike z \y which was still tasked with capturing the characteristic information. For the purposes of exposition and purely to demonstrate how one might apply this schema, we first consider a standard VAE, with a latent space z = {z c , z \c }. The latent representation of the VAE will implicitly contain characteristic information required to perform classification, however the structure of the latent space will be arranged to optimize for reconstruction and characteristic information may be entangled between z c and z \c . If we were now to jointly learn a classifier-from z c to y-with the VAE, resulting in the following objective: J = x∈U L VAE (x) + (x,y)∈S L VAE (x) + αE q φ (z|x) [log q ϕ (y | z c )] , where α is a hyperparameter, there will be pressure on the encoder to place characteristic information in z c , which can be interpreted as a stochastic layer containing the information needed for classification and reconstructionfoot_0 . The classifier thus acts as a tool allowing y to influence the structure of z, it is this high level concept, i.e. using y to structure z, that we utilize in this work. However, in general, the characteristics of different labels will be entangled within z c . Though it will contain the required information, the latents will typically be uninterpretable, and it is unclear how we could perform conditional generation or interventions. To disentangle the characteristics of different labels, we further partition the latent space, such that the classification of particular labels y i only has access to particular latents z i c and thus log q ϕ (y | z c ) = i log q ϕ i (y i | z i c ). This has the critical effect of forcing the characteristic information needed to classify y i to be stored only in the corresponding z i c , providing a means to encapsulate such information for each label separately. We further see that it addresses many of the prior issues: there are no measure-theoretic issues as z i c is not discrete, diversity in interventions is achieved by sampling different z i c for a given label, z i c can be manipulated while remaining within class decision boundaries, and a mismatch between p(z c ) and q(z c ) does not manifest as there is no ground truth for q(z c ). How to conditionally generate or intervene when training with (2) is not immediately obvious though. However, the classifier implicitly contains the requisite information to do this via inference in an implied Bayesian model. For example, conditional generation needs samples from p(z c ) that classify to the desired labels, e.g. through rejection sampling. See Appendix A for further details.

4.1. THE CHARACTERISTIC CAPTURING VAE

One way to address the need for inference is to introduce a conditional generative model p ψ (z c | y), simultaneously learned alongside the classifier introduced in (2), along with a prior p(y). This approach, which we term the CCVAE, allows the required sampling for conditional generations and interventions directly. Further, by persisting with the latent partitioning above, we can introduce a factorized set of generative models p(z c | y) = i p(z i c | y i ), enabling easy generation and manipulation of z i c individually. CCVAE ensures that labels remain a part of the model for unlabeled datapoints, which transpires to be important for effective learning in practice. x z \c z c 1 . . . z c L y 1 . . . y L pθ(x | z) qφ(z | x) Figure 2 : CCVAE graphical model. To address the issue of learning, we perform variational inference, treating y as a partially observed auxiliary variable. The final graphical model is illustrated in Figure 2 . The CCVAE can be seen as a way of combining top-down and bottom-up information to obtain a structured latent representation. However, it is important to highlight that CCVAE does not contain a hierarchy of latent variables. Unlike a hierarchical VAE, reconstruction is performed only from z ∼ q φ (z | x) without going through the "deeper" y, as doing so would lead to a loss of information due to the bottleneck of y. By enforcing each label variable to link to different characteristiclatent dimensions, we are able to isolate the generative factors corresponding to different label characteristics.

4.2. MODEL OBJECTIVE

We now construct an objective function that encapsulates the model described above, by deriving a lower bound on the full model log-likelihood which factors over the supervised and unsupervised subsets as discussed in § 2. The supervised objective can be defined as log p θ,ψ (x, y) ≥ E q ϕ,φ (z|x,y) log p θ (x | z)p ψ (z | y)p(y) q ϕ,φ (z | x, y) ≡ L CCVAE (x, y), with p ψ (z | y) = p(z \c )p ψ (z c | y). Here, we avoid directly modeling q ϕ,φ (z | x, y); instead leveraging the conditional independence x ⊥ ⊥ y | z, along with Bayes rule, to give q ϕ,φ (z | x, y) = q ϕ (y | z c )q φ (z | x) q ϕ,φ (y | x) , where q ϕ,φ (y | x) = q ϕ (y | z c )q φ (z | x)dz. Using this equivalence in (3) yields (see Appendix B.1 for a derivation and numerical details) L CCVAE (x, y) = E q φ (z|x) q ϕ (y | z c ) q ϕ,φ (y | x) log p θ (x | z)p ψ (z | y) q ϕ (y | z c )q φ (z | x) +log q ϕ,φ (y | x)+log p(y). (4) Note that a classifier term log q ϕ,φ (y | x) falls out naturally from the derivation, unlike previous models (e.g. Kingma et al. (2014) ; Siddharth et al. (2017) ). Not placing the labels directly in the latent space is crucial for this feature. When defining latents to directly correspond to labels, observing both x and y detaches the mapping q ϕ,φ (y | x) between them, resulting in the parameters (ϕ, φ) not being learned-motivating addition of an explicit (weighted) classifier. Here, however, observing both x and y does not detach any mapping, since they are always connected via an unobserved random variable z c , and hence do not need additional terms. From an implementation perspective, this classifier strength can be increased, we experimented with this, but found that adjusting the strength had little effect on the overall classification accuracies. We consider this insensitivity to be a significant strength of this approach, as the model is able to apply enough pressure to the latent space to obtain high classification accuracies without having to hand tune parameter values. We find that the gradient norm of the classifier parameters suffers from a high variance during training, we find that not reparameterizing through z c in q ϕ (y | z c ) reduces this affect and aides training, see Appendix C.3.1 for details. For the datapoints without labels, we can again perform variational inference, treating the labels as random variables. Specifically, the unsupervised objective, L CCVAE (x), derives as the standard (unsupervised) ELBO. However, it requires marginalising over labels as p(z) = p(z c )p(z \c ) = p(z \c ) y p(z c |y)p(y). This can be computed exactly, but doing so can be prohibitively expensive if the number of possible label combinations is large. In such cases, we apply Jensen's inequality a second time to the expectation over y (see Appendix B.2) to produce a looser, but cheaper to calculate, ELBO given as L CCVAE (x) = E q φ (z|x)qϕ(y|zc) log p θ (x | z)p ψ (z | y)p(y) q ϕ (y | z c )q φ (z | x) . Combining ( 4) and ( 5), we get the following lower bound on the log probability of the data log p (D) ≥ (x,y)∈S L CCVAE (x, y) + x∈U L CCVAE (x), that unlike prior approaches faithfully captures the variational free energy of the model. As shown in § 6, this enables a range of new capabilities and behaviors to encapsulate label characteristics.

5. RELATED WORK

The seminal work of Kingma et al. (2014) was the first to consider supervision in the VAEs setting, introducing the M2 model for semi-supervised classification which was also approach to place labels directly in the latent space. The related approach of Maaløe et al. (2016) augments the encoding distribution with an additional, unobserved latent variable, enabling better semi-supervised classification accuracies. Siddharth et al. (2017) extended the above work to automatically derive the regularised objective for models with arbitrary (pre-defined) latent dependency structures. The approach of placing labels directly in the latent space was also adopted in Li et al. (2019) . Regarding the disparity between continuous and discrete latent variables in the typical semi-supervised VAEs, Dupont (2018) provide an approach to enable effective unsupervised learning in this setting. From a purely modeling perspective, there also exists prior work on VAEs involving hierarchies of latent variables, exploring richer higher-order inference and issues with redundancy among latent variables both in unsupervised (Ranganath et al., 2016; Zhao et al., 2017) and semi-supervised (Maaløe et al., 2017; 2019) settings. In the unsupervised case, these hierarchical variables do not have a direct interpretation, but exist merely to improve the flexibility of the encoder. The semi-supervised approaches extend the basic M2 model to hierarchical VAEs by incorporating the labels as an additional latent (see Appendix F in Maaløe et al., 2019, for example), and hence must incorporate additional regularisers in the form of classifiers as in the case of M2. Moreover, by virtue of the typical dependencies assumed between labels and latents, it is difficult to disentangle the characteristics just associated with the label from the characteristics associated with the rest of the data-something we capture using our simpler split latents (z c , z \c ). From a more conceptual standpoint, Mueller et al. (2017) introduces interventions (called revisions) on VAEs for text data, regressing to auxiliary sentiment scores as a means of influencing the latent variables. This formulation is similar to (2) in spirit, although in practice they employ a range of additional factoring and regularizations particular to their domain of interest, in addition to training models in stages, involving different objective terms. Nonetheless, they share our desire to enforce meaningfulness in the latent representations through auxiliary supervision. Another related approach involves explicitly treating labels as another data modality (Vedantam et al., 2018; Suzuki et al., 2017; Wu & Goodman, 2018; Shi et al., 2019) . This work is motivated by the need to learn latent representations that jointly encode data from different modalities. Looking back to (3), by refactoring p(z | y)p(y) as p(y | z)p(z), and taking q(z | x, y) = G(q(z | x), q(z | y)), one derives multi-modal VAEs, where G can construct a product (Wu & Goodman, 2018) or mixture (Shi et al., 2019) of experts. Of these, the MVAE (Wu & Goodman, 2018 ) is more closely related to our setup here, as it explicitly targets cases where alternate data modalities are labels. However, they differ in that the latent representations are not structured explicitly to map to distinct classifiers, and do not explore the question of explicitly capturing the label characteristics. The JLVM model of Adel et al. (2018) is similar to the MVAE, but is motivated from an interpretability perspective-with labels providing 'side-channel' information to constrain latents. They adopt a flexible normalising-flow posterior from data x, along with a multi-component objective that is additionally regularised with the information bottleneck between data x, latent z, and label y. DIVA (Ilse et al., 2019) introduces a similar graphical model to ours, but is motivated to learn a generalized classifier for different domains. The objective is formed of a classifier which is regularized by a variational term, requiring additional hyper-parameters and preventing the ability to disentangle the representations. In Appendix C.4 we propose some modifications to DIVA that allow it to be applied in our problem domain. In terms of interoperability, the work of Ainsworth et al. ( 2018) is closely related to ours, but they focus primarily on group data and not introducing labels. Here the authors employ sparsity in the multiple linear transforms for each decoder (one for each group) to encourage certain latent dimensions to encapsulate certain factors in the sample, thus introducing interoperability into the model. Tangentially to VAEs, similar objectives of structuring the latent space using GANs also exist Xiao et al. (2017; 2018) , although they focus purely on interventions and cannot perform conditional generations, classification, or estimate likelihoods.

6. EXPERIMENTS

Following our reasoning in § 3 we now showcase the efficacy of CCVAE for the three broad aims of (a) intervention, (b) conditional generation and (c) classification for a variety of supervision rates, denoted by f . Specifically, we demonstrate that CCVAE is able to: encapsulate characteristics for each label in an isolated manner; introduce diversity in the conditional generations; permit a finer control on interventions; and match traditional metrics of baseline models. Furthermore, we demonstrate that no existing method is able to perform all of the above,foot_1 highlighting its sophistication over existing methods. We compare against: M2 (Kingma et al., 2014); MVAE (Wu & Goodman, 2018) ; and our modified version of DIVA (Ilse et al., 2019) . See Appendix C.4 for details. To demonstrate the capture of label characteristics, we consider the multi-label setting and utilise the Chexpert (Irvin et al., 2019) and CelebA (Liu et al., 2015) datasets. 3 For CelebA, we restrict ourselves to the 18 labels which are distinguishable in reconstructions; see Appendix C.1 for details. We use the architectures from Higgins et al. (2016) for the encoder and decoder. The label-predictive distribution q ϕ (y | z c ) is defined as Ber(y | π ϕ (z c )) with a diagonal transforma- tion π ϕ (•) enforcing q ϕ (y | z c ) = i q ϕ i (y i | z c i ). The conditional prior p ψ (z c | y) is then defined as N (z c |µ ψ (y), diag(σ 2 ψ (y)) ) with appropriate factorization, and has its parameters also derived through MLPs. See Appendix C.3 for further details.

6.1. INTERVENTIONS

If CCVAE encapsulates characteristics of a label in a single latent (or small set of latents), then it should be able to smoothly manipulate these characteristics without severely affecting others. This allows for finer control during interventions, which is not possible when the latent variables directly correspond to labels. To demonstrate this, we traverse two dimensions of the latent space and display the reconstructions in Figure 3 . These examples indicate that CCVAE is indeed able to smoothly manipulate characteristics. For example, in b) we are able to induce varying skin tones rather than have this be a binary intervention on pale skin, unlike DIVA in a). In c), the z i c associated with the necktie label has also managed to encapsulate information about whether someone is wearing a shirt or is bare-necked. No such traversals are possible for M2 and it is not clear how one would do them for MVAE; additional results, including traversals for DIVA, are given in Appendix D.2. 

6.2. DIVERSITY OF GENERATIONS

Label characteristics naturally encapsulate diversity (e.g. there are many ways to smile) which should be present in the learned representations. By virtue of the structured mappings between labels and characteristic latents, and since z c is parameterized by continuous distributions, CCVAE is able to capture diversity in representations, allowing exploration for an attribute (e.g. smile) while preserving other characteristics. This is not possible with labels directly defined as latents, as only discrete choices can be made-diversity can only be introduced here by sampling from the unlabeled latent space-which necessarily affects all other characteristics. To demonstrate this, we reconstruct multiple times with z = {z c ∼ p ψ (z c | y), z \c } for a fixed z \c . We provide qualitative results in Figure 4 . If several samples are taken from z c ∼ p ψ (z c | y) when intervening on only a single characteristic, the resulting variations in pixel values should be focused around the locations relevant to that characteristic, e.g. pixel variations should be focused around the neck when intervening on necktie. To demonstrate this, we perform single interventions on each class, and take multiple samples of z c ∼ p ψ (z c | y). We then display the variance of each pixel in the reconstruction in green in Figure 5 , where it can be seen that generally there is only variance in the spatial locations expected. Interestingly, for the class smile (2nd from right), there is variance in the jaw line, suggesting that the model is able capture more subtle components of variation that just the mouth.

6.3. CLASSIFICATION

To demonstrate that reparameterizing the labels in the latent space does not hinder classification accuracy, we inspect the predictive ability of CCVAE across a range of supervision rates, given in Table 1 . It can be observed that CCVAE generally obtains prediction accuracies slightly superior to other models. We emphasize here that CCVAE's primary purpose is not to achieve better classification accuracies; we are simply checking that it does not harm them, which it most clearly does not. If a model can correctly disentangle the label characteristics from other generative factors, then manipulating z \c should not change the label characteristics of the reconstruction. To demonstrate this, we perform "characteristic swaps," where we first obtain z = {z c , z \c } for a given image, then swap in the characteristics z c to another image before reconstructing. This should apply the exact characteristics, not just the label, to the scene/background of the other image (cf. Figure 6 ). Comparing CCVAE to our baselines in Figure 7 , we see that CCVAE is able to transfer the exact characteristics to a greater extent than other models. Particular attention is drawn to the preservation of labeled characteristics in each row, where CCVAE is able to preserve characteristics, like the precise skin tone and hair color of the pictures on the left. We see that M2 is only able to preserve the label and not the exact characteristic, while MVAE performs very poorly, effectively ignoring the attributes entirely. Our modified DIVA variant performs reasonably well, but less reliably and at the cost of reconstruction fidelity compared to CCVAE. f = 0.004 f = 0.06 f = 0.2 f = 1.0 f = 0.004 f = 0.06 f = 0.2 f = 1.0 CCVAE 0. unlabeled contextual attributes, z \c An ideal characteristic swap should not change the probability assigned by a pre-trained classifier between the original image and a swapped one. We employ this as a quantitative measure, reporting the average difference in log probabilities for multiple swaps in Table 2 . CCVAE is able to preserve the characteristics to a greater extent than other models. DIVA's performance is largely due to its heavier weighting on the classifier, which adversely affects reconstructions, as seen earlier.  f = 0.004 f = 0.06 f = 0.2 f = 1.0 f = 0.004 f = 0.06 f = 0.2 f = 1.0 CCVAE 1.

7. DISCUSSION

We have presented a novel mechanism for faithfully capturing label characteristics in VAEs, the characteristic capturing VAE (CCVAE), which captures label characteristics explicitly in the latent space while eschewing direct correspondences between label values and latents. This has allowed us to encapsulate and disentangle the characteristics associated with labels, rather than just the label values. We are able to do so without affecting the ability to perform the tasks one typically does in the (semi-)supervised setting-namely classification, conditional generation, and intervention. In particular, we have shown that, not only does this lead to more effective conventional label-switch interventions, it also allows for more fine-grained interventions to be performed, such as producing diverse sets of samples consistent with an intervened label value, or performing characteristic swaps between datapoints that retain relevant features.

A CONDITIONAL GENERATION AND INTERVENTION FOR EQUATION (2)

For the model trained using (2) as the objective to be usable, we must consider whether it can carry out the classification, conditional generation, and intervention tasks outlined previously. Of these, classification is straightforward, but it is less apparent how the others could be performed. The key here is to realize that the classifier itself implicitly contains the information required to perform these tasks. Consider first conditional generation and note that we still have access to the prior p(z) as per a standard VAE. One simple way of performing conditional generation would be to conduct a rejection sampling where we draw samples ẑ ∼ p(z) and then accept these if and only if they lead to the classifier predicting the desired labels up to a desired level of confidence, i.e. q φ (y | ẑc ) > λ where 0 < λ < 1 is some chosen confidence threshold. Though such an approach is likely to be highly inefficient for any general p(z) due to the curse of dimensionality, in the standard setting where each dimension of z is independent, this rejection sampling can be performed separately for each z i c , making it relatively efficient. More generally, we have that conditional generation becomes an inference problem where we wish to draw samples from p (z | {q φ (y | z c ) > λ}) ∝ p(z)I (q φ (y | z c ) > λ) . Interventions can also be performed in an analogous manner. Namely, for a conventional intervention where we change one or more labels, we can simply resample the z i c associated we those labels, thereby sampling new characteristics to match the new labels. Further, unlike prior approaches, we can perform alternative interventions too. For example, we might attempt to find the closest z i c to the original that leads to the class label changing; this can be done in a manner akin to how adversarial attacks are performed. Alternatively, we might look to manipulate the z i c without actually changing the class itself to see what other characteristics are consistent with the labels. To summarize, (2) yields an objective which provides a way of learning a semi-supervised VAEs that avoids the pitfalls of directly fixing the latents to correspond to labels. It still allows us to perform all the tasks usually associated with semi-supervised VAEs and in fact allows a more general form of interventions to be performed. However, this comes at the cost of requiring inference to perform conditional generation or interventions. Further, as the label variables y are absent when the labels are unobserved, there may be empirical complications with forcing all the denotational information to be encoded to the appropriate characteristic latent z i c . In particular, we still have a hyperparameter α that must be carefully tuned to ensure the appropriate balance between classification and reconstruction.

B MODEL FORMULATION B.1 VARIATIONAL LOWER BOUND

In this section we provide the mathematical details of our objective functions. We show how to derive it as a lower bound to the marginal model likelihood and show how we estimate the model components. The variational lower bound for the generative model in Figure 2 , is given as L CCVAE = x∈U L CCVAE (x) + (x,y)∈S L CCVAE (x, y) L CCVAE (x, y) = E q φ (z|x) q ϕ (y | z c ) q ϕ,φ (y | x) log p θ (x | z)p ψ (z | y) q ϕ (y | z c )q φ (z | x) + log q ϕ,φ (y | x) + log p(y), L CCVAE (x) = E q φ (z|x)qϕ(y|zc) log p θ (x | z)p ψ (z c | y)p(y) q ϕ (y | z c )q φ (z | x) . The overall likelihood in the semi-supervised case is given as p θ (x, y) = (x,y)∈S p θ (x, y) x∈U p θ (x), To derive a lower bound for the overall objective, we need to obtain lower bounds on log p θ (x) and log p θ (x, y). When the labels are unobserved the latent state will consist of z and y. Using the factorization according to the graph in Figure 2 yields log p θ (x) ≥ E q φ (z|x)qϕ(y|zc) log p θ (x | z)p ψ (z | y)p(y) q ϕ (y | z c )q φ (z | x) , where p ψ (z | y) = p(z \c )p ψ (z c | y). For supervised data points we consider a lower bound on the likelihood p θ (x, y), log p θ (x, y) ≥ log p θ (x | z)p ψ (z | y)p(y) q ϕ,φ (z | x, y) q ϕ,φ (z | x, y)dz, in order to make sense of the term q ϕ,φ (z | x, y), which is usually different from q φ (z | x) we consider the inference model q ϕ,φ (z | x, y) = q ϕ (y | z c )q φ (z | x) q ϕ,φ (y | x) , where q ϕ,φ (y | x) = q ϕ (y | z c )q φ (z | x)dz. Returning to the lower bound on log p θ (x, y) we obtain log p θ (x, y) ≥ log p θ (x | z)p ψ (z | y)p(y) q(z | x, y) q(z | x, y)dz = log p θ (x | z)p ψ (z | y)p(y)q ϕ,φ (y | x) q ϕ (y | z c )q φ (z | x) q ϕ (y | z c )q φ (z | x) q ϕ,φ (y | x) dz = E q φ (z|x) q ϕ (y | z c ) q ϕ,φ (y | x) log p(x | z)p ψ (z c | y) q ϕ (y | z c )q φ (z | x) + log q ϕ,φ (y | x) + log p(y), where q ϕ (y | z c )/q ϕ,φ (y | x) denotes the Radon-Nikodym derivative of q ϕ,φ (z | x, y) with respect to q φ (z | x).

B.2 ALTERNATIVE DERIVATION OF UNSUPERVISED BOUND

The bound for the unsupervised case can alternatively be derived by applying Jensen's inequality twice. First, use the standard (unsupervised) ELBO log p θ (x) ≥ E q φ (z|x) log p θ (x | z)p(z) q φ (z | x) . Now, since calculating p(z) = p(z c )p(z \c ) = p(z \c ) y p(z c | y)p(y) can be expensive we can apply Jensen's inequality a second time to the expectation over z c to obtain log p(z c ) ≥ E qϕ(y|zc) log p ψ (z s | y)p(y) q ϕ (y | z s ) . Substituting this bound into the unsupervised ELBO yields again our bound log p(x) ≥ E q φ (z|x)qϕ(y|zc) log p θ (x | z)p(z | y) q φ (z | x)q ϕ (y | z c ) + log p(y) (7) C IMPLEMENTATION C.1 CELEBA We chose to use only a subset of the labels present in CelebA, since not all attributes are visually distinguishable in the reconstructions e.g. (earrings). As such we limited ourselves to the following labels: arched eyebrows, bags under eyes, bangs, black hair, blond hair, brown hair, bushy eyebrows, chubby, eyeglasses, heavy makeup, male, no beard, pale skin, receding hairline, smiling, wavy hair, wearing necktie, young. No images were omitted or cropped, the only modifications were keeping the aforementioned labels and resizing the images to be 64 × 64 in dimension.

C.2 CHEXPERT

The Chexpert dataset comprises of chest X-rays taken from a variety of patients. We down-sampled each image to be 64 × 64 and used the same networks from the CelebA experiments. The five main attributes for Chexpert are: cardiomegaly, edema, consolidation, atelectasis, pleural effusion. Which for non medical experts can be interpreted as: enlargement of the heart; fluid in the alveoli; fluid in the lungs; collapsed lung; fluid in the corners of the lungs.

C.3 IMPLEMENTATION DETAILS

For our experiments we define the generative and inference networks as follows. The approximate posterior is represented as q φ (z | x) = N (z c , z \c | µ φ (x), diag(σ 2 φ (x))) with µ φ (x) and diag(σ 2 φ (x)) being the architecture from Higgins et al. (2016) . The generative model p θ (x | z) is represented by a Laplace distribution, again parametrized using the architecture from Higgins et al. (2016) . The label predictive distribution q ϕ (y | z c ) is represented as Ber(y | π ϕ (z c )) with π ϕ (z c ) being a diagonal transformation forcing the factorisation q ϕ (y | z c ) = i q ψ i (y i | z c i ). The conditional prior is given as p ψ (z c | y) = N (z c | µ ψ (y), diag(σ 2 ψ (y))), with the appropriate factorisation, where the parameters are represented by an MLP. Finally, the prior placed on the portion of the latent space reserved for unlabelled latent variables is p(z \c ) = N (z \c | 0, I)). For the latent space z c ∈ R mc and z \c ∈ R m \c , where m = m c + m \c with m c = 18 and m \c = 27 for CelebA. The architectures are given in and Table 3 . Optimization We trained the models on a GeForce GTX Titan GPU. Training consumed ∼ 2Gb for CelebA and Chexpert, taking around 2 hours to complete 100 epochs respectively. Both models were optimized using Adam with a learning rate of 2 × 10 -4 for CelebA respectively. Input ∈ R mc Input ∈ R mc m c × m c Diagonal layer m c × m c Diagonal layer

C.3.1 HIGH VARIANCE OF CLASSIFIER GRADIENTS

The gradients of the classifier parameters ϕ suffer from a high variance during training. We find that not reparameterizing z c for q ϕ (y | z c ) reduces this issue: L CCVAE (x, y) = E q φ (z|x) q ϕ (y | zc ) q ϕ,φ (y | x) log p θ (x | z)p ψ (z | y) q ϕ (y | zc )q φ (z | x) +log q ϕ,φ (y | x)+log p(y). (8) where zc indicates that we do not reparameterize the sample. This significantly reduces the variance of the magnitude of the gradient norm ∇ ϕ , allowing the classifier to learn appropriate weights and structure the latent space. This can be seen in Figure 8 , where we plot the gradient norm of ϕ for when we do reparameterize z c (blue) and when we do not (orange). Clearly not reparameterizing leads to a lower variance in the gradient norm of the classifier, which aides learning. To a certain extent these gradients can be viewed as redundant, as there is already gradients to update the predictive distribution due to the log q ϕ,φ (y | x) term anyway. C.4 MODIFIED DIVA The primary goal of DIVA is domain invariant classification and not to obtain representations of individual characteristics like we do here. The objective is essentially a classifier which is regularized by a variational objective. However, to achieve domain generalization, the authors aim to disentangle the domain, class and other generative factors. This motivation leads to a graphical model that is similar in spirit to ours ( Figure 9 ), in that the latent variables are used to predict labels, and the introduction of the inductive bias to partition the latent space. As such, DIVA can be modified to suit our problem of encapsulating characteristics. The first modification we need to consider is the removal of z d , as we are not considering multi-domain problems. Secondly, we introduce the factorization present in CCVAE, namely q ϕ (y | z c ) = i q ψ i (yi |z c i ). With these two modifications an alternative objective can now be constructed, with the supervised given as L SDIV A (x, y) = E q φ (z|x) log p θ (x | z) -βKL(q φ (z \c |x)||p(z \c )) -βKL(q φ (z c |x)||p ψ (z c | y)), and the unsupervised as L U DIV A (x) = E q φ (z|x) log p θ (x | z) -βKL(q φ (z \c |x)||p(z \c )) + βE q φ (zc|x)qϕ(y|zc) [log p ψ (z c | y) -log q φ (z c |x)], + βE q φ (zc|x)qϕ(y|zc) [log p(y) -log q ϕ (y | z c )], where y has to be imputed. The final objective for DIVA is then given as log p θ (D) ≥ (x,y)∈S L SDIV A (x, y) + x∈U L U DIV A (x) + αE q(zc|x) log q ϕ (y | z c ) . It is interesting to note the differences to the objective of CCVAE, namely, there is no emergence of a natural classifier in the supervised case, and y has to be imputed in the unsupervised case instead of relying on variational inference as in CCVAE. Clearly such differences have a significant impact on performance as demonstrated by the main results of this paper.

D.1 SINGLE INTERVENTIONS

Here we demonstrate single interventions where we change the binary value for the desired attributes. To quantitatively evaluate the single interventions, we intervene on a single label and report the changes in log-probabilities assigned by a pre-trained classifier. If the single intervention only affects the characteristics of the chosen label, then there should be no change in other classes and only a change on the chosen label. Intervening on all possible labels yields a confusion matrix, with the optimal results being a diagonal matrix with zero off-diagonal elements. We also report the condition number for the confusion matrices, given in the titles. It is interesting to note that the interventions for CCVAE are subtle, this is due to the latent z i c ∼ p(z i c |y i ), which will be centered around the mean. More striking intervention can be achieved by traversing along z i c . 

D.2 LATENT TRAVERSALS

Here we provide more latent traversals for CCVAE in Figure 18 and for DIVA in Figure 19 . CCVAE is able to smoothly alter characteristics, indicating that it is able to encapsulate characteristics in a single dimension, unlike DIVA which is unable to alter the characteristics effectively, suggesting it cannot encapsulate the characteristics.

D.3 GENERATION

We provide results for the fidelity of image generation on CelebA. To do this we use the FID metric Heusel et al. (2017) , we omitted results for Chexpert as the inception model used in FID has not been trained on the typical features associated with X-Rays. The results are given in Table 4 , interestingly for low supervision rates MVAE obtains the best performance but for higher supervision rates M2 outperforms MVAE. We posit that this is due to MVAE having little structure imposed on the latent space, as such the POE can structure the representation purely for reconstruction without considering the labels, something which is not possible as the supervision rate is increased. CCVAE obtains competitive results with respect to M2. It is important to note that generative fidelity is not the focus of this work as we focus purely on how to structure the latent space using labels. It is no surprise then that the generations are bad as structuring the latent space will potentially be at odds with the reconstruction term in the loss. 

D.4 CONDITIONAL GENERATION

To asses conditional generation, we first train an independent classifier for both datasets. We then conditionally generate samples given labels and evaluate them using this pre-trained classifier. Results provided in Table 5 . CCVAE and M2 are comparable in generative abilities, but DIVA and MVAE perform poorly, indicated by random guessing. 

D.5 DIVERSITY OF CONDITIONAL GENERATIONS

We also report more examples for diversity, as in Figure 5 , in Figure 20 .

D.6 MULTI-CLASS SETTING

Here we provide results for the multi-class setting of MNIST and FashionMNIST. The multi-class setting is somewhat tangential to our work, but we include it for completeness. For CCVAE, we have some flexibility over the size of the latent space. Trying to encapsulate representations for each label is not well suited for this setting, as it's not clear how you could alter the representation of an image being a 6, whilst preserving the representation of it being an 8. In fact, there is really only one label for this setting, but it takes multiple values. With this in mind, we can now make an explicit choice about how the latent space will be structured, we can set z c ∈ R or z c ∈ R N , or conversely, store all of the representation in z c , i.e. z \c = ∅. Furthermore, we do not need to enforce the factorization q ϕ (y | z c ) = i q(y i |z i c ), and instead can be parameterized by a function F : R N → R M where M is the number of possible classes. Classification We provide the classification results in Table 6 . Conditional Generation We provide classification accuracies for pre-trained classifier using conditional generated samples as input and the condition as a label. We also report the mutual information to give an indication of how out-of-distribution the samples are. In order to estimate the uncertainty, we transform a fixed pre-trained classifier into a Bayesian predictive classifier that integrates over the posterior distribution of parameters ω as p(y | x, D) = p(y | x, ω)p(ω | D)dω. The utility of classifier uncertainties for out-of-distribution detection has previously been explored Smith & Gal (2018) , where dropout is also used at test time to estimate the mutual information (MI) between the predicted label y and parameters ω (Gal, 2016; Smith & Gal, 2018) as However, the Monte Carlo (MC) dropout approach has the disadvantage of requiring ensembling over multiple instances of the classifier for a robust estimate and repeated forward passes through the classifier to estimate MI. To mitigate this, we instead employ a sparse variational GP (with 200 inducing points) as a replacement for the last linear layer of the classifier, fitting just the GP to the data and labels while holding the rest of the classifier fixed. This, in our experience, provides a more robust and cheaper alternative to MC-dropout for estimating MI. Results are provided in Table 7 . I Latent Traversals We can also perform latent traversals for the multi-class setting. Here, we perform linear interpolation on the polytope where the corners are obtained from the network µ ψ (y) for four different classes. We provide the reconstructions in Figure 21 .



Though, for convenience, we implicitly assume here, and through the rest of the paper, that the labels are categorical such that the mapping zc → y is a classifier, we note that the ideas apply equally well if some labels are actually continuous, such that this mapping is now a probabilistic regression. DIVA can perform the same tasks as CCVAE but only with the modifications we ourselves suggest and still not to a comparable quality. CCVAE is well-suited to multi-label problems, but also works on multi-class problems. See Appendix D.6 for results and analyses on MNIST and FashionMNIST.



Figure 3: Continuous interventions through traversal of zc. From left to right, a) DIVA pale skin and young; b) CCVAE pale skin and young; c) CCVAE smiling and necktie; d) CCVAE Pleural Effusion and Cardiomegaly.

Figure 4: Diverse conditional generations for CCVAE, y is held constant along each row and each column represents a different sample for zc ∼ p(zc|y). z \c is held constant over the entire figure.

Figure 5: Variance in reconstructions when intervening on a single label. [Top two] CelebA, from left to right: reconstruction, bangs, eyeglasses, pale skin, smiling, necktie.. [Bottom] Chexpert: reconstruction, cardiomegaly, edema, consolidation, atelectasis, pleural effusion.

Figure 6: Characteristic swap, where the characteristics of the first image (blond hair, smiling, heavy makeup, female, no necktie, no glasses etc.) are transfered to the unlabeled characteristics of the second (red background etc.).

Figure 7: Characteristic swaps. Characteristics (smiling, brown hair, skin tone, etc) of the left image should be preserved along the row while background information should be preserved along the column.

Figure 8: Gradient norms of classifier.

Figure 9: Left: Generative model for DIVA, Right: Inference model where dashed line indicates auxiliary classifier.

Figure 10: Confusion matrices for CCVAE for (from top left clockwise) f = 0.004, 0.06, 0.2, 1.0

Figure 12: Confusion matrices for M2 for (from top left clockwise) f = 0.004, 0.06, 0.2, 1.0

Figure 14: Confusion matrices for DIVA for (from top left clockwise) f = 0.004, 0.06, 0.2, 1.0

Figure 16: Confusion matrices for MVAE for (from top left clockwise) f = 0.004, 0.06, 0.2, 1.0

Figure 18: Various latent traversals for CCVAE.

Figure 19: Various latent traversals for DIVA.

Figure 20: CCVAE, variance in reconstructions when intervening on a single label. From left to right: reconstruction, then interventions from switching on the following labels: arched eyebrows, bags under eyes, bangs, black hair, blond hair, brown hair, bushy eyebrows, chubby, eyeglasses, heavy makeup, male, no beard, pale skin, receding hairline, smiling, wavy hair, wearing necktie, young.

004 f = 0.06 f = 0.2 f = 1.0 f = 0.004 f = 0.06 f = 0.2 f = 1.0 CCVAE 0

(y, ω | x, D) = H[p(y | x, D)] -E p(ω|D) [H[p(y | x, ω)]] .

Classification accuracies.

Difference in log-probabilities of pre-trained classifier from denotation swaps, lower is better.

Architectures for CelebA and Chexpert.

CelebA FID scores.

Generations accuracies.

8. ACKNOWLEDGMENTS

TJ, PHST, and NS were supported by the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. Toshiba Research Europe also support TJ. TJ would also like to thank Dr. M. Stoddart. PHST would also like to acknowledge the Royal Academy of Engineering and FiveAI. SMS was partially supported by the Engineering and Physical Sciences Research Council (EPSRC) grant EP/K503113/1. TR's research leading to these results has received funding from a Christ Church Oxford Junior Research Fellowship and from Tencent AI Labs.

annex

Published as a conference paper at ICLR 2021 

