

Abstract

Autoencoders, or nonlinear factor models parameterized by neural networks, have become an indispensable tool for generative modeling and representation learning in high dimensions. Imposing structural constraints such as conditional independence on the latent variables (representation, or factors) in order to capture invariance or fairness with autoencoders has been attempted through adding ad hoc penalties to the loss function mostly in the variational autoencoder (VAE) context, often based on heuristic arguments. In this paper, we demonstrate that Wasserstein autoencoders (WAEs) are highly flexible in embracing structural constraints. Well-known extensions of VAEs for this purpose are gracefully handled within the framework of the seminal result by Tolstikhin et al. (2018). In particular, given a conditional independence structure of the generative model (decoder), corresponding encoder structure and penalties are induced from the functional constraints that define the WAE. This property of WAEs opens up a principled way of penalizing autoencoders to impose structural constraints. Utilizing this generative model structure, we present results on fair representation and conditional generation tasks, and compare them with other preceding methods.

1. INTRODUCTION

The ability to learn informative representation of data with minimal supervision is a key challenge in machine learning (Tschannen et al., 2018) , toward obtaining which autoencoders have become an indispensable toolkit. An autoencoder consists of the encoder, which maps the input to a lowdimensional representation, and the decoder, that maps a representation back to a reconstruction of the input. Thus an autoencoder can be considered a nonlinear factor analysis model as the latent variable provided by the encoder carries the meaning of "representation" and the decoder can be used for generative modeling of the input data distribution. Most autoencoders can be formulated as minimizing some "distance" between the distribution P X of input random variable X and the distribution g ♯ P Z of the reconstruction G = g(Z), where Z is the latent variable or representation having distribution P Z and g is either deterministic or probabilistic decoder (in the latter case g is read as the conditional distribution of G given Z), which is variationally described in terms of an encoder Q Z|X . For instance, the variational autoencoder (VAE, Kingma & Welling, 2014) minimizes  D VAE (P X , g ♯ P Z ) = inf Q Z|X ∈Q E P X [D KL (Q Z|X ∥P Z ) -E Q Z|X log g(Z)] P Z ) = inf Q Z|X ∈Q E P X E Q Z|X d p (X, g(Z)) (2) over the set of deterministic decoders g, where d is the metric in the space of input X and p ≥ 1. Set Q restricts the search space for the encoder. In VAEs, a popular choice is a class of normal distributions Q = {Q Z|X regular conditional distribution : Z|{X = x} ∼ N (µ(x), Σ(x)), (µ, Σ) ∈ N N } where N N is a class of functions parametrized by neural networks. In WAEs, the choice Q = {Q Z|X regular conditional distribution : Q Z ≜ E P X Q Z|X = P Z } (3) makes the left-hand side of Eq. ( 2) equal to the (p-th power of) the p-Wasserstein distance between P X and g ♯ P Z (Tolstikhin et al., 2018, Theorem 1); 2) reduces to the learning problem of a deterministic unregularized autoencoder. Q Z is called an aggregate posterior of Z. If Q is a set of Dirac measures, i.e., Q = {Q Z|X : Q Z|X=x = δ f (x) , f ∈ N N } , then minimizing Eq. ( Of course, the notion of "informativeness" depends on the downstream task. The variation in the observations that are not relevant to the particular task is often called "nuisance" and is desirable to be suppressed from the representation. For example, in finding "fair representations," (Zemel, 2013) sensitive information such as gender or socioeconomic status should be removed from latent representations; in obtaining representations of facial images, those that are invariant to lighting conditions, poses, or wearing of eyeglasses are often sought. A popular approach to this goal is to explicitly separate informative and nuisance variables in the generative model by factorization. This approach imposes a structure on the decoder. Additionally the encoder is further factorized and a penalty promoting independence between the encoded representation and nuisance variable can be added These examples illustrate that, while the generative model (decoder structure) can be chosen suitably for the downstream task, there is no principled way of imposing the corresponding encoder structure. In this paper, we show that the WAE framework allows us to automatically determine the encoder structure corresponding to imposed decoder structure. Specifically, when the deterministic decoder g in Eq. ( 2) is modified to handle the conditional independence structure of the imposed generative model, then the constraint set (amounting to the Q in Eq. ( 3)) that makes the LHS of Eq. (2) a proper (power of) Wasserstein distance determines the factorization of the (deterministic) encoder. In practice, the hard constraints in Q is relaxed and Eq. ( 2) is solved in a penalized form. Following the approach of Tolstikhin et al. ( 2018), the cited constraint set can be systemically translated to penalties. Therefore, in addition to the theoretical advantage that the penalized form equals a genuine distributional distance for sufficiently large penalty parameter while that of Eq. ( 1) remains a lower bound of the negative log-likelihood of the model, the ad hoc manner of designing penalties prevalent in the VAE literature can be avoided in the WAE framework. Further, the allowance of deterministic encoder/decoder promotes better generation performance in many downstream tasks. We explain how the WAE framework leads to structured encoders given a generative model through examples reflecting downstream tasks in Sect. 3 after providing necessary background in Sect. 2. We would call these structured uses of WAEs the Wasserstein Fair Autoencoders (WFAEs). After reviewing relevant ideas in Sect. 4, WFAEs are experimented in Sect. 5 for datasets including VGGFace2 (Cao et al., 2018) . We conclude the paper in Sect. 6.

2. PRELIMINARIES

In fitting a given probability distribution P X of a random variable X on a measurable space (X , B(X )), where X ⊂ R D equipped with metric d, by a generative model P G of sample G on the same measurable space, one may consider minimizing the (pth power of) p-Wasserstein distance between the two distributions, i.e., min P G ∈M W p p (P X , P G ) := inf π∈P(P X ,P G ) E π d p (X, G) . Here, M is the model space of probability distributions, P(P X , P G ) is the coupling or the set of all joint distributions on (X × X , B(X × X )) having marginals P X and P G . Often the sample G is generated by transforming a variable in a latent space. When G = g(Z) a.s. for a latent variable Z in a probability space (Z, B(Z), P Z ), Z ⊂ R l , and measurable function g, then P G is denoted by g ♯ P Z , where ♯ is the push forward operator. In this setting, as discussed in Sect. 1, Tolstikhin et al.



. A well-known example is the variational fair autoencoder(VFAE, Louizos et al.,  2016), in which a variant of the "M1+M2" graphical model(Kingma et al., 2014)  is used to factorize the decoder and a resembling factorization of the encoder (variational posterior) is assumed. Independence of the representation from nuisance variable is encouraged by adding a maximum mean discrepancy(MMD, Gretton et al., 2007)  between conditional variational posteriors; in Lopez et al. (2018), MMD is replaced by the Hilbert-Schmidt Independence Criterion(HSIC, Gretton  et al., 2007). Other authors employ penalties derived from the mutual information (MI)(Moyer  et al., 2018; Song et al., 2019; Creager et al., 2019). Another example is the Fader Networks(Lample et al., 2018), in which the deterministic decoder takes an additional input of the attribute (such as whether or not eyeglasses are present in a portrait) and an adversarial penalty that hinders the accurate prediction of the attribute by the deterministic, unfactorized encoder.

