FAILURE MODES OF VARIATIONAL AUTOENCODERS AND THEIR EFFECTS ON DOWNSTREAM TASKS Anonymous

Abstract

Variational Auto-encoders (VAEs) are deep generative latent variable models that are widely used for a number of downstream tasks. While it has been demonstrated that VAE training can suffer from a number of pathologies, existing literature lacks characterizations of exactly when these pathologies occur and how they impact down-stream task performance. In this paper we concretely characterize conditions under which VAE training exhibits pathologies and connect these failure modes to undesirable effects on specific downstream tasks, such as learning compressed and disentangled representations, adversarial robustness and semi-supervised learning.

1. INTRODUCTION

Variational Auto-encoders (VAEs) are deep generative latent variable models that transform simple distributions over a latent space to model complex data distributions Kingma & Welling (2013) . They have been used for a wide range of downstream tasks, including: generating realistic looking synthetic data (e.g Pu et al. ( 2016)), learning compressed representations (e.g. Miao & Blunsom (2016) ; Gregor et al. (2016) ; Alemi et al. ( 2017)), adversarial defense using de-noising (Luo & Pfister, 2018; Ghosh et al., 2018) , and, when expert knowledge is available, generating counter-factual data using weak or semi-supervision (e.g. Kingma et al. (2014) ; Siddharth et al. (2017) ; Klys et al. (2018) ). Variational auto-encoders are widely used by practitioners due to the ease of their implementation and simplicity of their training. In particular, the common choice of mean-field Gaussian (MFG) approximate posteriors for VAEs (MFG-VAE) results an inference procedure that is straight-forward to implement and stable in training. Unfortunately, a growing body of work has demonstrated that MFG-VAEs suffer from a variety of pathologies, including learning un-informative latent codes (e.g. van den Oord et al. (2017); Kim et al. (2018) ) and unrealistic data distributions (e.g. Tomczak & Welling (2017) ). When the data consists of images or text, rather than evaluating the model based on metrics alone, we often rely on "gut checks" to make sure that the quality of the latent representations the model learns and the synthetic data (as well as counterfactual data) generated by the model is high (e.g. by reading generate text or inspecting generated images visually (Chen et al., 2018; Klys et al., 2018) ). However, as VAEs are increasingly being used in application where the data is numeric, e.g. in medical or financial domains (Pfohl et al., 2019; Joshi et al., 2019; Way & Greene, 2017) , these intuitive qualitative checks no longer apply. For example, in many medical applications, the original data features themselves (e.g. biometric reading) are difficult to analyze by human experts in raw form. In these cases, where the application touches human lives and potential model error/pathologies are particularly consequential, we need to have a clear theoretical understanding of the failure modes of our models as well as the potential negative consequences on down-stream tasks. Recent work (Yacoby et al., 2020) attributes a number of the pathologies of MFG-VAEs to properties of the training objective; in particular, the objective may compromise learning a good generative model in order to learn a good inference model -in other words, the inference model over-regularizes the generative model. While this pathology has been noted in literature (Burda et al., 2016; Zhao et al., 2017; Cremer et al., 2018) , no prior work has characterizes the conditions under which the MFG-VAE objective compromises learning a good generative model in order to learn a good inference model; moreover, no prior work has related MFG-VAE pathologies with the performance on downstream tasks. Rather, existing literature focuses on mitigating the regularizing effect of the inference model on the VAE generative model by using richer variational families (e.g. Kingma et al. 2019)). As such, it is important to understand precisely when MFG-VAEs exhibit pathologies and when alternative training methods are worth the computational trade-off. In this paper, we characterize the conditions under which MFG-VAEs perform poorly and link these failures to effects on a range of downstream tasks. While we might expect that methods designed to mitigate VAE training pathologies (e.g. methods with richer variational families (Kingma et al., 2016 )), will also alleviate the negative downstream effects, we find that this is not always so. Our observations point to reasons for further studying the performance VAE alternatives in these applications. Our contributions are both theoretical and empirical: I. When VAE pathologies occur: (1) We characterize concrete conditions under which learning the inference model will compromise learning the generative model for MFG-VAEs. More problematically, we show that these bad solutions are globally optimal for the training objective, the ELBO. ( 2) We demonstrate that using the ELBO to select the output noise variance and the latent dimension results in biased estimates. (3) We propose synthetic data-sets that trigger these two pathologies and can be used to test future proposed inference methods. II. Effects on tasks: (4) We demonstrate ways in which these pathologies affect key downstream tasks, including learning compressed, disentangled representations, adversarial robustness and semisupervised learning. In semi-supervised learning, we are the first to document the instance of "functional collapse", in which the data conditionals problematically collapse to the same distribution. (5) Lastly, we show that while the use of richer variational families alleviate VAE pathologies on unsupervised learning tasks, they introduce new ones in the semi-supervised tasks. These contributions help identify when MFG-VAEs suffice, and when advanced methods are needed.

2. RELATED WORK

Existing works that characterize MFG-VAEs pathologies largely focus on relating local optima of the training objective to a single pathology: the un-informativeness of the learned latent codes (posterior collapse) (He et al., 2019; Lucas et al., 2019; Dai et al., 2019) . In contrast, there has been little work to characterize pathologies at the global optima of the MFG-VAE's training objective. Yacoby et al. (2020) show that, when the decoder's capacity is restricted, posterior collapse and the mismatch between aggregated posterior and prior can occur as global optima of the training objective. In contrast, we focus on global optima of the MFG-VAE objective in fully general settings: with fully flexible generative and inference models, as well as with and without learned observation noise. Previous works (e.g. Yacoby et al. (2020) ) have connected VAE pathologies like posterior collapse to the over-regularizing effect of the variational family on the generative model. However while there are many works that mitigate the over-regularization issue (e.g. Burda et al. ( 2016 2018)), none have given a full characterization of when the learned generative model is over-regularized, nor have they related the quality of the learned model to its performance on down-stream tasks. In particular, these works have shown that their proposed methods have higher test log-likelihood relative to a MFG-VAEs, but as we show in this paper, high test log-likelihood is not the only property needed for good performance on downstream tasks. Lastly, these works all propose fixes that require a potentially significant computational overhead. For instance, works that use complex variational families, such as normalizing flows (Kingma et al., 2016) , require a significant number of parameters to scale (Kingma & Dhariwal, 2018) . In the case of the Importance Weighted Autoencoder (IWAE) objective (Burda et al., 2016) , which can be interpreted as having a more complex variational family (Cremer et al., 2017) , the complexity of the posterior scales with the number of importance samples used. Lastly, works that de-bias existing bounds (Nowozin, 2018; Luo et al., 2020) all require several evaluations of the objective. Given that MFG-VAEs remain popular today due to the ease of their implementation, speed of training, and their theoretical connections to other dimensionality reduction approaches like probabilistic PCA (Rolinek et al., 2019; Dai et al.; Lucas et al., 2019) , it is important to characterize the training pathologies of MFG-VAE, as well as the concrete connections between these pathologies and downstream tasks. More importantly, this characterization will help clarify for which tasks and datasets a MFG-VAE suffices and for which the computational tradeoffs are worth it.



(2016); Nowozin (2018); Luo et al. (2020)). While promising, these methods introduce potentially significant additional computational costs to training, as well as new training issues (e.g. noisy gradients Roeder et al. (2017); Tucker et al. (2018); Rainforth et al. (

); Zhao et al. (2017); Cremer et al. (2018); Shu et al. (

