COMPLETE LIKELIHOOD OBJECTIVE FOR LATENT VARIABLE MODELS

Abstract

In this work, we propose an alternative to the Marginal Likelihood (MaL) objective for learning representations with latent variable models, Complete Latent Likelihood (CoLLike). We analyze the objectives from the perspective of matching joint distributions. We show that MaL corresponds to a particular KL divergence between some target joint distribution and the model joint. Furthermore, the properties of the target joint explain such major malfunctions (from the representation learning perspective) of MaL as uninformative latents (posterior collapse) and high deviation of the aggregated posterior from the prior. In the CoLLike approach, we use a sample from the prior to construct a family of target joint distributions, which properties prevent these drawbacks. We utilize the complete likelihood both to choose the target from this family and to learn the model. We confirm our analysis by experiments with low-dimensional latents, which also indicate that it is possible to achieve high-accuracy unsupervised classification using CoLLike objective.

1. INTRODUCTION

In the latent variable setting, the model defines a joint distribution over both observed variables x and latent variables z, while the training data contains only observed variables. The problem can be treated as an unknown z|x target conditional distribution. There are at least two possible solutions to this problem: try to come up with a meaningful target z|x distribution and train the model similarly to a supervised setting, or give up and focus on matching only marginals in the x domain. The latter is the choice of the MaL objective. In this work, we follow the former approach. However, instead of picking up a single target conditional we construct an entire family of possible distributions and use the model likelihood to decide which conditional to use as a target. To construct a family of possible conditionals, we use a sample from prior of the same size as the dataset in the observed domain. All possible assignments of observed samples to latent ones span a family of empirical joint distributions. This can be represented as permutations of the latent samples. Despite the size of the permutations set being tremendous and growing as a factorial of the dataset size, the search of the permutation with the best likelihood can be done efficiently using combinatorial optimization. The resulting optimization procedure resembles the expectation maximization algorithm (Dempster et al., 1977) , where expectation is replaced with the combinatorial assignment problem. Furthermore, since the proposed algorithm uses gradient-free optimization for obtaining the target distribution, the objective can be seamlessly applied to both continuous and discrete latent variables, while the discrete latents case is challenging for approaches based on the MaL (Mnih & Gregor, 2014; Mnih & Rezende, 2016; Tucker et al., 2017) . We analyze the objectives from the perspective of matching joint distributions. We show that MaL corresponds to a specific choice of the target z|x conditional, while our approach takes into consideration family of possible conditionals. The choice of target conditional is responsible for two major failures that arise during training with the MaL objective: inability to learn informative latents, also known as "posterior collapse" (Bowman et al., 2016; Razavi et al., 2019; He et al., 2019) , and divergence between the prior and the aggregated posterior (Hoffman & Johnson, 2016; Makhzani et al., 2015; Zhao et al., 2019; Kim & Mnih, 2018) . These characteristics are vital for latent variable models because posterior collapse prevents learning meaningful representation and sampling from the regions of high deviation of the latent marginals are subjected to severe quality degradation (Rosca et al., 2018) . The form of the target joint also motivates the success of the complete likelihood in these challenges. Namely, the target distribution for CoLLike has high mutual information and matches prior. We verify our analysis with experiments. In this work, we focus on low-dimensional latent variables to perform a direct comparison with the exact MaL. Models trained with CoLLike stably maintain high mutual information and low divergence from the prior. In turn, MaL inevitably leads either to posterior collapse or to a highly divergent aggregated posterior. Previously, for simple linear models, it has been shown that there is posterior collapse during the optimization of the exact likelihood Lucas et al. (2019) . Our experiments demonstrate that it can as well happen with expressive models trained with exact likelihood. Along with informativeness and latent distribution matching, CoL-Like indicates no degradation of likelihood compared to MaL. Furthermore, we show that CoLLike objective alone can achieve high accuracy in unsupervised classification. We show that CoLLike unifies a range of existing approaches that lack probabilistic justification. Constrained K-means (Bennett et al., 2000 ), Permutation Invariant Training (Yu et al., 2017; Luo & Mesgarani, 2019) , and Noise as Target Bojanowski & Joulin (2017) are among these approaches. This allows us to extend them to different factorizations of the joint and perform analysis from the probabilistic perspective. Furthermore, CoLLike bridges likelihood and optimal transport (OT) frameworks. From this perspective, the negative likelihood plays the role of both mapping from latent to visible domain and distance function.

2. COMPLETE LIKELIHOOD OBJECTIVE

In the regular latent variable setting, we are given a dataset {x 1 , ..., x N } and the model p θ (x, z) = p θ (x|z)p(z). The missing z can be treated as the missing p δ (z|x) part of the target joint. If we cannot come up with a reasonable z|x target, we can at least match the marginals in the observed domain with KL(p δ (x)||p θ (x)) in hope that the model will learn an informative relation between x and z. This is equivalent to the maximization of MaL: L M aL (θ) = N i=1 log p θ (x i ) = N i=1 log p θ (x i , z)dz Justification of the MaL comes from the equivalence of maximization of (1) and minimization of the Kullback-Leibler divergence KL(p δ (x)||p θ (x)) which measures the discrepancy between the target empirical data distribution p δ (x)foot_0 and the model distribution p θ (x) (Murphy, 2022, 4.2.2). Note that this justifies MaL only for learning distributions of observed variables, not learning representations. The fact that MaL does not promote informativeness (Alemi et al., 2018) clearly shows the lack of justification of MaL for learning representation because informativeness is undoubtedly a fundamental requirement for any useful representation. Despite the family of all possible target p(z|x) distributions being tremendous, we do not need to consider it entirely. Firstly, the target distribution must be informative. Secondly, the marginal of the target joint distribution in the latent domain should match the prior p(z). The fixed prior implies that the desired marginal distribution of z is known. These requirements can be interpreted (Huszár, 2017) as Infomax principle (Linsker, 1988) . It is not hard to get a rich family of distributions with such properties. We can obtain a collection (z 1 , ..., z N ) by sampling from the prior and pair this collection with the dataset (x 1 , ..., x N ). The pairing produce an empirical distribution. Empirical distribution attains highest possible Mutual Information (MI) under the assumption that there are no repeated values of x in the dataset (see Appendix B for derivation). This ensures the first requirement. Sampling from the prior addresses the second requirement, since the collection of z samples converges to p(z) (Cover & Thomas, 2006, Theorem 11.2.1) . However, the sampling effects can be a significant problem for high-dimensional latents. We express each pairing as some permutation π, which produces a complete collection ((x 1 , z π(1) ), ..., (x N , y π(N ) )) and an empirical joint p δπ (x, z) = p δ (x)p π (z|x). Given a family of distributions, we need to decide which member of the family is our target. We propose to pick the one with the highest complete likelihood relying on the model inductive biases. For this target we then once again optimize the complete likelihood of



We find the Greek letter δ especially suitable for data distribution because it is consonant with "data" and reflects the delta-function-like form of the empirical distribution.

