COMPLETE LIKELIHOOD OBJECTIVE FOR LATENT VARIABLE MODELS

Abstract

In this work, we propose an alternative to the Marginal Likelihood (MaL) objective for learning representations with latent variable models, Complete Latent Likelihood (CoLLike). We analyze the objectives from the perspective of matching joint distributions. We show that MaL corresponds to a particular KL divergence between some target joint distribution and the model joint. Furthermore, the properties of the target joint explain such major malfunctions (from the representation learning perspective) of MaL as uninformative latents (posterior collapse) and high deviation of the aggregated posterior from the prior. In the CoLLike approach, we use a sample from the prior to construct a family of target joint distributions, which properties prevent these drawbacks. We utilize the complete likelihood both to choose the target from this family and to learn the model. We confirm our analysis by experiments with low-dimensional latents, which also indicate that it is possible to achieve high-accuracy unsupervised classification using CoLLike objective.

1. INTRODUCTION

In the latent variable setting, the model defines a joint distribution over both observed variables x and latent variables z, while the training data contains only observed variables. The problem can be treated as an unknown z|x target conditional distribution. There are at least two possible solutions to this problem: try to come up with a meaningful target z|x distribution and train the model similarly to a supervised setting, or give up and focus on matching only marginals in the x domain. The latter is the choice of the MaL objective. In this work, we follow the former approach. However, instead of picking up a single target conditional we construct an entire family of possible distributions and use the model likelihood to decide which conditional to use as a target. To construct a family of possible conditionals, we use a sample from prior of the same size as the dataset in the observed domain. All possible assignments of observed samples to latent ones span a family of empirical joint distributions. This can be represented as permutations of the latent samples. Despite the size of the permutations set being tremendous and growing as a factorial of the dataset size, the search of the permutation with the best likelihood can be done efficiently using combinatorial optimization. The resulting optimization procedure resembles the expectation maximization algorithm (Dempster et al., 1977) , where expectation is replaced with the combinatorial assignment problem. Furthermore, since the proposed algorithm uses gradient-free optimization for obtaining the target distribution, the objective can be seamlessly applied to both continuous and discrete latent variables, while the discrete latents case is challenging for approaches based on the MaL (Mnih & Gregor, 2014; Mnih & Rezende, 2016; Tucker et al., 2017) . We analyze the objectives from the perspective of matching joint distributions. We show that MaL corresponds to a specific choice of the target z|x conditional, while our approach takes into consideration family of possible conditionals. The choice of target conditional is responsible for two major failures that arise during training with the MaL objective: inability to learn informative latents, also known as "posterior collapse" (Bowman et al., 2016; Razavi et al., 2019; He et al., 2019) , and divergence between the prior and the aggregated posterior (Hoffman & Johnson, 2016; Makhzani et al., 2015; Zhao et al., 2019; Kim & Mnih, 2018) . These characteristics are vital for latent variable models because posterior collapse prevents learning meaningful representation and sampling from the regions of high deviation of the latent marginals are subjected to severe quality degradation 1

