GENERALIZED MULTIMODAL ELBO

Abstract

Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in selfsupervised, generative learning tasks.

1. INTRODUCTION

The availability of multiple data types provides a rich source of information and holds promise for learning representations that generalize well across multiple modalities (Baltrušaitis et al., 2018) . Multimodal data naturally grants additional self-supervision in the form of shared information connecting the different data types. Further, the understanding of different modalities and the interplay between data types are non-trivial research questions and long-standing goals in machine learning research. While fully-supervised approaches have been applied successfully (Karpathy & Fei-Fei, 2015; Tsai et al., 2019; Pham et al., 2019; Schoenauer-Sebag et al., 2019) , the labeling of multiple data types remains time consuming and expensive. Therefore, it requires models that efficiently learn from multiple data types in a self-supervised fashion. Self-supervised, generative models are suitable for learning the joint distribution of multiple data types without supervision. We focus on VAEs (Kingma & Welling, 2014; Rezende et al., 2014) which are able to jointly infer representations and generate new observations. Despite their success on unimodal datasets, there are additional challenges associated with multimodal data (Suzuki et al., 2016; Vedantam et al., 2018) . In particular, multimodal generative models need to represent both modality-specific and shared factors and generate semantically coherent samples across modalities. Semantically coherent samples are connected by the information which is shared between data types (Shi et al., 2019) . These requirements are not inherent to the objective-the evidence lower bound (ELBO)-of unimodal VAEs. Hence, adaptions to the original formulation are required to cater to and benefit from multiple data types. Furthermore, to handle missing modalities, there is a scalability issue in terms of the number of modalities: naively, it requires 2 M different encoders to handle all combinations for M data types. Thus, we restrict our search for an improved multimodal ELBO to the class of scalable multimodal VAEs. Among the class of scalable multimodal VAEs, there are two dominant strains of models, based on either the multimodal variational autoencoder (MVAE, Wu & Goodman, 2018) or the Mixture-of-Experts multimodal variational autoencoder (MMVAE, Shi et al., 2019) . However, we show that these approaches differ merely in their choice of joint posterior approximation functions. We draw a theoretical connection between these models, showing that they can be subsumed under the class of abstract mean functions for modeling the joint posterior. This insight has practical implications, because the choice of mean function directly influences the properties of a model (Nielsen, 2019) . The MVAE uses a geometric mean, which enables learning a sharp posterior, resulting in a good approximation of the joint distribution. On the other hand, the MMVAE applies an arithmetic mean which allows better learning of the unimodal and pairwise conditional distributions. We generalize these approaches and introduce the Mixture-of-Products-of-Experts-VAE that combines the benefits of both methods without considerable trade-offs. In summary, we derive a generalized multimodal ELBO formulation that connects and generalizes two previous approaches. The proposed method, termed MoPoE-VAE, models the joint posterior approximation as a Mixture-of-Products-of-Experts, which encompasses the MVAE (Product-of-Experts) and MMVAE (Mixture-of-Experts) as special cases (Section 3). In contrast to previous models, the proposed model approximates the joint posterior for all subsets of modalities, an advantage that we validate empirically in Section 4, where our model achieves state-of-the-art results.

2. RELATED WORK

This work extends and generalizes existing work in self-supervised multimodal generative models that are scalable in the number of modalities. Scalable in the sense that a single model approximates the joint distribution over all modalities (including all marginal and conditional distributions) instead of requiring individual models for every subset of modalities (e.g., Huang et al., 2018; Tian & Engel, 2019; Hsu & Glass, 2018) . The latter approach requires a prohibitive number of models, exponential in number of modalities. Multimodal VAEs Among multimodal generative models, multimodal VAEs (Suzuki et al., 2016; Vedantam et al., 2018; Kurle et al., 2019; Tsai et al., 2019; Wu & Goodman, 2018; Shi et al., 2019; 2020; Sutter et al., 2020) have recently been the dominant approach. Multimodal VAEs are not only suitable to learn a joint distribution over multiple modalities, but also enable joint inference given a subset of modalities. However, to approximate the joint posterior for all subsets of modalities efficiently, it is required to introduce additional assumptions on the form of the joint posterior. To overcome the issue of scalability, previous work relies on either the product (Kurle et al., 2019; Wu & Goodman, 2018) or the mixture (Shi et al., 2019; 2020) of unimodal posteriors. While both approaches have their merits, there are also disadvantages associated with them. We unite these approaches in a generalized formulation-a mixture of products joint posterior-that encapsulates both approaches and combines their benefits without significant trade-offs.

Multimodal posteriors

The MVAE (Wu & Goodman, 2018) assumes that the joint posterior is a product of unimodal posteriors-a Product-of-Experts (PoE, Hinton, 2002) . The PoE has the benefit of aggregating information across any subset of unimodal posteriors and therefore provides an efficient way of dealing with missing modalities for specific types of unimodal posteriors (e.g., Gaussians). However, to handle missing modalities the MVAE relies on an additional sub-sampling of unimodal log-likelihoods, which no longer guarantees a valid lower bound on the joint log-likelihood (Wu & Goodman, 2019) . Previous work provides empirical results that exhibit the shortcomings of the MVAE, attributing them to a precision miscalibration of experts (Shi et al., 2019) or to the averaging over inseparable individual beliefs (Kurle et al., 2019) . Our results suggest that the PoE works well in practice, if it is also applied on all subsets of modalities, which naturally leads to the proposed Mixture-of-Products-of-Experts (MoPoE) generalization, which yields a valid lower bound on the joint log-likelihood. On the other hand, the MMVAE (Shi et al., 2019) assumes that the joint posterior is a mixture of unimodal posteriors-a Mixture-of-Experts (MoE). The MMVAE is suitable for the approximation of unimodal posteriors and for translation between pairs of modalities, however, it cannot take advantage of multiple modalities being present, because it only takes the unimodal posteriors into account during training. In contrast, the proposed MoPoE-VAE computes the joint posterior for all subsets of modalities and therefore enables efficient many-to-many translations. Extensions of the MVAE and MMVAE (Kurle et al., 2019; Daunhawer et al., 2020; Shi et al., 2020; Sutter et al., 2020) have introduced additional loss terms, however, these are also applicable to and can be added on top of the proposed model. 1 summarizes the properties of previous multimodal VAEs and highlights the benefits of the proposed model: the ability to aggregate multiple modalities, to learn a multi-modal posterior (in the statistical sense), and to efficiently handle missing modalities at test time.

3.1. PRELIMINARIES

We consider a dataset {X (i) } N i=1 of N i.i.d. samples, each of which is a set of M modalities X (i) = {x (i) j } M j=1 . We assume that the data is generated by some random process involving a joint hidden random variable z such that inter-modality dependencies are unknown. The marginal log-likelihood can be decomposed into a sum over marginal log-likelihoods of individual sets log p θ ({X (i) } N i=1 ) = N i=1 log p θ (X (i) ), which can be written as: log p θ (X (i) ) = D KL (q φ (z|X (i) )||p θ (z|X (i) )) + L(θ, φ; X (i) ), with L(θ, φ; X (i) ) := E q φ (z|X (i) ) [log p θ (X (i) |z)] -D KL (q φ (z|X (i) )||p θ (z)). L(θ, φ; X (i) ) is called evidence lower bound (ELBO) on the marginal log-likelihood of the i-th set. It forms a tractable objective to approximate the joint data distribution log p θ (X (i) ). q φ (z|X (i) ) is the posterior approximation distribution with learnable parameters φ. From the non-negativity of the KL divergence, it follows that log p θ (X (i) ) ≥ L(θ, φ; X (i) ). If the posterior approximation q φ (z|X (i) ) is identical to the true posterior distribution p θ (z|X (i) ), the bound holds with equality. Hence, maximizing the ELBO in Equation (2) minimizes the otherwise intractable KL-divergence between approximate and true posterior distribution: arg min φ D KL (q φ (z|X (i) )||p θ (z|X (i) )) . Adaptations to the ELBO formulation in Equation ( 2) include an additional hyperparameter β which weights the KL-divergence relative to the log-likelihood (Higgins et al., 2017) . To improve readability, we will omit the superscript (i) in the remaining part of this work.

3.2. APPROXIMATING p θ (z|X) IN CASE OF MISSING DATA TYPES

For a dataset of M modalities, there are 2 M different subsets contained in the powerset P(X). If, for a particular observation, we only have access to a subset of data types X k ∈ P(X), the approximation of p θ (X k ) would result in a different ELBO formulation L(θ, φ k ; X k ) where the true posterior p θ (z|X k ) of subset X k is approximated. Instead, we are interested in the true posterior p θ (z|X) of all data types X, even when only a subset X k , i.e. qφ k (z|X k ), is available. The desired ELBO for the available subset X k is given by L k (θ, φ k ; X) := E qφ k (z|X k ) [log(p θ (X|z)] -D KL (q φ k (z|X k )||p θ (z)) . The subtle but important difference between L k (θ, φ k ; X) and L(θ, φ k ; X k ) is that the former still yields a valid lower bound on p θ (X), whereas the latter forms a lower bound on log p θ (X k ), which is no longer a valid bound on the desired log p θ (X). Different from previous work, we argue for an optimization of the powerset P(X), i.e., the joint optimization of all ELBOs L k (θ, φ k ; X) defined by the posterior subset approximation qφ k (z|X k ). Since maximizing the ELBO in Equation ( 2) is equivalent to minimizing the KL-divergence in Equation (3), the joint optimization of the powerset P(X) is equal to the minimization of the following convex combination of KL-divergences of the power set P(X). 1 arg min φ X k ∈P(X) D KL (q φ (z|X k )||p θ (z|X)) Hence, we propose to optimize Equation ( 4) for all subsets X k . Lemma 1. The sum of KL-divergences in Equation (5) describes the joint probability log p θ (X) as follows: log p θ (X) = 1 2 M X k ∈P(X) D KL (q φ (z|X k )||p θ (z|X)) + 1 2 M X k ∈P(X) E qφ (z|X k ) log p θ (X|z)p θ (z) qφ (z|X k ) Following Lemma 1 (see Appendix A.1 for the proof) and the non-negativity of the KL-divergence, we see that the convex combination of expectations over the powerset P(X) is an ELBO on the joint probability log p θ (X). Since this would require 2 M different inference networks in a naive implementation, we use a more efficient approach utilizing abstract mean functions.

3.3. SCALABLE INFERENCE USING ABSTRACT MEAN FUNCTIONS

To create a model that is scalable in the number of modalities-a model that breaks the need for 2 M different networks-previous works define the joint posterior approximation q φ (z|X) as a mean function of the unimodal variational posteriors. The PoE and MoE can be subsumed under the concept of abstract means (Nielsen, 2019) . Abstract means unify multiple mean functions M f for a given function f (Niculescu & Persson, 2005) : M f (p) = f -1 1 P P k=1 f (p k ) where P is the number of elements and the function f needs to be injective in order for f -foot_0 to exist. f (p) = ap + b results in the arithmetic mean, f (p) = log p in the geometric mean. The choice of mean function directly influences the properties of the learned model as we will recapitulate with regard to multimodal VAEs in the following. The MVAE (Wu & Goodman, 2018) employs the PoE, which is a geometric mean of unimodal posteriors. Aggregation through the PoE results in a sharp posterior approximation (Hinton, 2002) , but struggles in optimizing the individual experts as mentioned by the authors (Wu & Goodman, 2018, p. 3) . In contrast, the MMVAE (Shi et al., 2019) uses the MoE, which is an arithmetic mean of unimodal posteriors. As such, the MMVAE optimizes individual experts well, but is not able to learn a distribution that is sharper than any of its experts. Thus, the choice of mean function directly influences the properties of the resulting model. The MoE is optimizing for conditional distributions based on the unimodal posterior approximations, while the PoE is optimizing for the approximation of the joint probability distribution. For scalable, abstract-mean based models, the set of parameters φ k for the posterior approximation of a subset qφ (z|X k ) is determined by the unimodal posterior approximations q φj (z|x j ) as φ k = {φ j ∀ j ∈ {1, . . . , M } : x j ∈ X k }.

3.4. GENERALIZED MULTIMODAL ELBO

In the following, we first introduce the new ELBO L MoPoE (θ, φ; X) and then prove that its objective minimizes the convex combination of KL-divergences in Equation ( 5). Definition 1. 1. Let the posterior approximation of subset X k be qφ (z|X k ) = PoE({q φj (z|x j ) ∀ x j ∈ X k }) ∝ xj ∈X k q φj (z|x j ) . 2. Let the joint posterior be q φ (z|X) = 1 2 M X k ∈P(X) qφ (z|X k ) . The objective L MoPoE (θ, φ; X) for learning a joint distribution of multiple data types X is defined as L MoPoE (θ, φ; X) := E q φ (z|X) [log(p θ (X|z)] -D KL 1 2 M X k ∈P(X) qφ (z|X k )||p θ (z) . From Definition 1, Lemma 2 directly follows. Lemma 2. L MoPoE (θ, φ; X) is a multimodal ELBO, that is log p θ (X) ≥ L MoPoE (θ, φ; X). Since q φ (z|X) is defined as a mixture distribution (i.e., a probability distribution), it directly follows that L MoPoE (θ, φ; X) is a valid ELBO on log p θ (X), because a variational distribution can be chosen arbitrarily as long as it is a valid probability distribution. For a proof of Lemma 2, see Appendix A.2. Lemma 3. Maximizing L MoPoE (θ, φ; X) minimizes the convex combination of KL-divergences of the powerset P(X) given in Equation (5). For a proof of Lemma 3, see Appendix A.3. Definition 1 does not put any restrictions on the choice of posterior approximations qφ (z|X k ). As we are interested in scalable, multimodal models, we focus on methods which apply to this restriction and choose the PoE for the posterior approximations of the subsets X k ∈ P(X). Other, non-scalable posterior fusion methods are possible using this framework.

3.5. THE GENERAL FRAMEWORK

Definition 1 can be interpreted as a hierarchical distribution: first the unimodal posterior approximations of a subset q φj (z|x j ) ∀ x j ∈ X k are combined using a PoE, second the subset approximations qφ (z|X k ) ∀X k ∈ P(X) are combined using a MoE. This allows us to combine the strengths of both MoE as well as PoE while circumventing their weaknesses (see Section 2). For Gaussian posterior approximations, as is common in VAEs, the PoE can be calculated in closed form, which makes it a computationally efficient solution. In the following, we derive the objectives optimized by the MVAE and MMVAE as special cases of L MoPoE (θ, φ; X). The MVAE only takes into account the full subset, i.e., the PoE of all data types. Trivially, this is a MoE with only a single component: L PoE (θ, φ; X) = E q φ (z|X) [log(p θ (X|z)] -D KL (q φ (z|X)||p θ (z)) (7) with q φ (z|X) ∝ M j=1 q φj (z|x j ) = P oE({q φj (z|x j )} M j=1 ) = 1 k=1 P oE({q φj (z|x j )} M j=1 ) This is equivalent to the MoPoE-VAE of a single subset X K , which is the full set X. As the PoE of a single expert is just the expert itself, the MMVAE model (Shi et al., 2019) is the special case of L MoPoE (θ, φ; X) which takes only into account the M unimodal subsets: L MoE (θ, φ; X) = E q φ (z|X) [log(p θ (X|z)] -D KL   1 M M j=1 q φj (z|x j ) ||p θ (z)   (9) with q φ (z|X) = 1 M M j=1 q φj (z|x j ) = 1 M M j=1 P oE(q φj (z|x j )) L M oE (θ, φ; X) is equivalent to a MoPoE-VAE of the M unimodal posterior approximations q φj (z|x j ) for j = 1, . . . , M . Therefore, the proposed MoPoE-VAE is a generalized formulation of the MVAE and MMVAE, which accounts for all subsets of modalities. The identified special cases offer a new perspective on the strengths and weaknesses of prior work: previous models focus on a specific subset of posteriors, which might lead to a decreased performance on the remaining subsets. In particular, the MVAE should perform best when all modalities are present, whereas the MMVAE should be most suitable when only a single modality is observed. We validate this observation empirically in Section 4.

4. EXPERIMENTS & RESULTS

We evaluate the proposed method on three different datasets and compare it to state-of-the-art methods. We introduce a new dataset called PolyMNIST with 5 simplified modalities. Additionally, we evaluate all models on the trimodal matching digits dataset MNIST-SVHN-Text and the challenging bimodal Celeba dataset with images and text. The latter two were introduced in Sutter et al. ( 2020). We evaluate the models according to three different metrics. We assess the quality of the learned latent representation using a linear classifier. The coherence of generated samples is evaluated using pre-trained classifiers. The approximation of the joint data distribution is measured using test set log-likelihoods. The datasets and the evaluation of experiments are described in detail in Appendix B. Tables 2 and 3 show the superior performance of the proposed method compared to state-of-the-art methods regarding the ability to learn meaningful latent representations and generate coherent samples. MVAE reaches superior performance for the generation of the SVHN modality, while MoPoE-VAE overall achieves best coherence results. Table 4 shows the results for the test log-likelihoods. The proposed MoPoE-VAE is the only method that is able to reach state-of-the-art coherence, latent classification accuracies, as well as test log-likelihoods for all combination of inputs. This can be seen in Figure 1 , illustrating the trade-off between test log-likelihoods and joint coherence for every model. Every point encodes the joint coherence and joint log-likelihood for a different βvalue. 2 The goal is to have high coherence and log-likelihoods (i.e., the top right corner). Note that lower beta values typically correspond to models with a higher log-likelihood but lower coherence. Overall, the MoPoE-VAE achieves a superior trade-off compared to the baselines. As expected by our theoretical analysis (Section 3.5), the MVAE achieves good joint log-likelihoods, whereas MMVAE reaches high joint coherence. The PolyMNIST dataset allows to investigate how well different methods perform given more than two modalities. Since individual images can be difficult to classify correctly (even for a human observer) one would expect multimodal models to aggregate information across multiple modalities. Further, this dataset facilitates the comparison of different models, because it removes the need for modality-specific architectures and hyperparameters. As such, for a fair comparison, we use the same architectures and hyperparameter values across all methods. We expect to see that both the MMVAE and our proposed method are able to aggregate the redundant digit information across different modalities, whereas the MVAE should not be able to benefit from an increasing number of modalities, because it does not aggregate unimodal posteriors. Further, we hypothesize that the MVAE will achieve the best generative performance when all modalities are present, but that it will struggle with an increasing number of missing modalities. The proposed MoPoE-VAE should perform well given any subset of modalities. PolyMNIST results Figure 3 compares the results across different methods. The performance in terms of three different metrics is shown as a function of the number of input modalities; for instance, the log-likelihood of all generated modalities given one input modality (averaged over all possible single input modalities). As expected, both the MVAE and MoPoE-VAE benefit from more input modalities, whereas the performance of the MVAE stays flat across all metrics. In the limit of all 5 input modalities, the log-likelihood of MoPoE-VAE is on par with MVAE, but the proposed method is clearly superior in terms of both latent classification as well as conditional coherence Table 4 : Test set log-likelihoods on MNIST-SVHN-Text. We report the test set log-likelihoods of the joint generative model conditioned on the variational posterior of subsets of modalities qφ (z|X k ). (x M : MNIST; x S : SVHN; x T : Text; X = (x M , x S , x T )). across any subset of modalities. Analogously, in the limit of a single input modality, MoPoE-VAE matches the performance of MMVAE. Only in terms of the joint coherence (see figure legend) the MMVAE performs better, suggesting that a more flexible prior might be needed for the MoPoE-VAE. Therefore, the PolyMNIST experiment illustrates that the proposed method does not only theoretically encompass the other two methods, but that it is superior for most subsets of modalities and even matches the performance in special cases that favor previous methods.

4.3. BIMODAL CELEBA

In this dataset, the images displaying faces (Liu et al., 2015) are equipped with additional text describing the faces using the labeled attributes. Any negatively labeled attribute is completely missing in the string which makes the text modality more challenging. Compared to previous experiments, we additionally use modality-specific latent spaces, which were found to improve the generative quality of a model (Hsu & Glass, 2018; Sutter et al., 2020; Daunhawer et al., 2020) . 4 Figure 4 displays qualitative results of images which are generated given text. Table 5 shows the classification results for the coherence of generated samples as well as the classification of latent representations. We see that the proposed model is able to match the baselines on this challenging dataset, which favors the baselines, because it consists of two modalities. Figure 4 shows that attributes like "gender" or "smiling" are learned well, as they manifest in generated samples and can be identified from the latent representation. Subtle and rare attributes are more difficult to generate consistently; evaluations specific to the different labels are provided in Appendix E.3.

5. CONCLUSION

In this work, we propose a new multimodal ELBO formulation. Our contribution is threefold: First, the proposed MoPoE-VAE generalizes prior works (MVAE, MMVAE) and combines their benefits. Second, we analyze the strengths and weaknesses of previous works and relate them directly to their objective and choice of posterior approximation function. Finally, in extensive experiments we empirically show the advantages compared to state-of-the-art models and even match their performance on tasks that favor previous work. In future work, we would like to evaluate previous extensions to multimodal VAEs. Addtionally, we will explore different types and combinations of abstract mean functions and investigate their effects on the model and its performance as well as their theoretical properties (e.g., tightness) compared to existing methods.

A PROOFS

A.1 PROOF OF LEMMA 1 X k ∈P(X) D KL (q φ (z|X k )||p θ (z|X)), i.e., the sum of KL-divergences in Equation ( 5) can be used for describing the joint probability log p θ (X). Proof. We show that the convex combination of KL-divergences can be directly related to the joint probability log p θ (X): X k ∈P(X) D KL (q φ (z|X k )||p θ (z|X)) = X k ∈P(X) E qφ (z|X k ) log qφ (z|X k ) p θ (z|X) = X k ∈P(X) E qφ (z|X k ) log qφ (z|X k ) p θ (z, X) + log p θ (X) which can be reformulated as an expression of the joint probability log p θ (X): log p θ (X) = 1 2 M X k ∈P(X) D KL (q φ (z|X k )||p θ (z|X)) Equation (5) + 1 2 M X k ∈P(X) E qφ (z|X k ) log p θ (X|z)p θ (z) qφ (z|X k ) From the non-negativity of the KL-divergence, we derive the lower bound to the joint probability log p(X): log p θ (X) ≥ 1 2 M X k ∈P(X) E qφ (z|X k ) log p θ (X|z)p θ (z) qφ (z|X k ) A.2 PROOF OF LEMMA 2 L MoPoE (θ, φ; X) is a multimodal ELBO, that is log p θ (X) ≥ L MoPoE (θ, φ; X). Proof. The sums using index k also sum over all 2 M subsets in the power set P(X). We use k only for better readability. log p(X) = D KL (q φ (z|X)||p(z|X)) + E q φ (z|X) [log p(z, X) q φ (z|X) ] (13) = D KL 1 2 M k q(z|X k )||p(z|X) + E 1 2 M k q(z|X k ) log p(z, X) 1 2 M k q(z|X k ) (14) ≥ E 1 2 M k q(z|X k ) log p(z, X) 1 2 M k q(z|X k ) (15) = E qφ (z|X) [log(p θ (X|z)] -D KL ( 1 2 M X k ∈P(X) qφ k (z|X k )||p θ (z)) (16) = L(θ, φ; X) A.3 PROOF OF LEMMA 3 Maximizing L MoPoE (θ, φ; X) minimizes the convex combination of KL-divergences of the powerset P(X) given in Equation (5). Proof. Lemma 1 shows that the sum of KL-divergences in Equation ( 5) is able to describe the joint probability and can be used to form a valid ELBO. Equation ( 12) is the convex combination of ELBOs given a subset's posterior approximation qφ (z|X k ). Utilizing Jensen's inequality, it follows: 1 2 M X k ∈P(X) E qφ (z|X k ) log p θ (X|z)p θ (z) qφ (z|X k ) ≤ E 1 2 M k qφ (z|X k ) log p θ (X|z)p θ (z) 1 2 M k qφ (z|X k ) where the sums on the right hand also iterate over all X k ∈ P(X). From Equation ( 18), we see that the proposed L MoPoE (θ, φ; X) is not only a valid lower bound to the joint log-probability log p θ (X), but also a tighter one than the convex combination of ELBOs: log p θ (X) ≥ E 1 2 M k qφ (z|X k ) log p θ (X|z)p θ (z) 1 2 M k qφ (z|X k ) (19) = L MoPoE (θ, φ; X) (20) ≥ 1 2 M X k ∈P(X) E qφ (z|X k ) log p θ (X|z)p θ (z) qφ (z|X k ) As Equation ( 19) can be directly derived from Equation ( 5), maximizing the proposed objective results in minimizing the convex combination of KL-divergences. In the following, we derive the inequality in Equation ( 18) in more detail: 1 2 M X k ∈P(X) E qφ (z|X k ) log p θ (X|z)p θ (z) qφ (z|X k ) (22) = 1 2 M X k ∈P(X) E qφ (z|X k ) [log p θ (X|z)p θ (z) -log qφ (z|X k )] (23) = 1 2 M X k ∈P(X) E qφ (z|X k ) [log p θ (X|z)p θ (z)] =E 1 2 M k qφ (z|X k ) [log p θ (X|z)p θ (z)] - 1 2 M X k ∈P(X) E qφ (z|X k ) [log qφ (z|X k )] ≥E 1 2 M k qφ (z|X k ) [log( 1 2 M k qφ (z|X k ))] ≤ E 1 2 M k qφ (z|X k ) [log p θ (X|z)p θ (z)] -E 1 2 M k qφ (z|X k ) log 1 2 M k qφ (z|X k ) (25) = E 1 2 M k qφ (z|X k ) log (p θ (X|z)p θ (z)) -log 1 2 M k qφ (z|X k ) (26) = E 1 2 M k qφ (z|X k ) log p θ (X|z)p θ (z) 1 2 M k qφ (z|X k ) In the minuend of Equation ( 24), the ordering of expectation and sum can be exchanged due to the linearity of the expectation. In the subtrahend of Equation ( 24), the sum of expectation of the posterior approximations of subsets can be reformulated into the expectation of a mixture distribution using Jensen's inequality. Due to the convexity of the function f (t) = t log t (Cover & Thomas, 2006, p.29) , the expectation of a mixture distribution is a lower bound to the sum over the expectation of posterior approximations as the mixture distribution can be seen as a convex combination of posterior approximations of subset of modalities qφ (z|X k ). Hence, the inequality from Equation (24) to Equation ( 25) follows as we decrease the subtrahend in Equation ( 25).

B EVALUATION OF EXPERIMENTS

For the experiments, we evaluate all models regarding three different metrics: the classification accuracy (or average precision for CelebA) on the latent representation, the coherence of generated samples and the test set log-likelihoods. The latent representations are evaluated using a logistic regression classifier from scikit-learn (Pedregosa et al., 2011) . The classifier is trained using 500 samples from the training set which are encoded using the trained models. The evaluation is done on the full test set and the reported numbers are the average performances over all batches in the test set. The generation coherence is evaluated using the same networks as the unimodal encoders which were trained beforehand. For every data type, we train a neural network classifier in a supervised way. The architecture of the classifier is identical to the encoder except from the last layer. For joint coherence, all generated samples are evaluated by the classifier and if all modalities are classified as having the same label, they are considered coherent. The coherence accuracy is the ratio of coherent samples divided by the number of generated samples. For conditional generation, the conditionally generated samples have to be coherent to the input samples. The test set log-likelihoods are evaluated using 15 importance samples for all models and the reported numbers are the averages over all test set batches. If not stated differently, the reported numbers in section 4 are the mean and standard deviations of 5 runs with different random seeds. All models evaluated use the same architectures and numbers of parameters. The likelihoods of the different modalities are weighted to each other according to the size of the modality for all experiments. The most dominant modality is set to 1.0. The remaining ones are scaled up by the ratio of their data dimensions. For example in the MNIST-SVHN-Text experiment, SVHN is set to 1.0 and MNIST to 3.92 which is the ratio of their data dimensions. For all unimodal posterior approximations, we assume Gaussian distributions N (z; µ, σ 2 I n ) where n is the number of latent space dimensions. In all experiments, the mixture components are equally weighted with 1 #components . B.1 COMPARISON TO PREVIOUS WORKS Shi et al. (2019) in the end use a different ELBO objective including importance samples L IWAE (Burda et al., 2016) . We compare all models without the use of importance samples as these could be easily introduced to all objectives and are not directly related to the focus of this work which is choice of joint posterior approximation. Sutter et al. ( 2020) utlize the Jensen-Shannon divergence as a regularizer instead of the KLdivergence. This results in the use of a dynamic prior and shows promising results. Besides the dynamic prior, they model the joint posterior approximation as well using a MoE. Again, we do not include models utilizing a dynamic prior as this could be introduced to all formulations and is not the focus of this work.

B.1.1 EQUIVALENCE TO ELBO FORMULATION IN SHI ET AL. (2019)

For clarity, we show here the equivalence of the formulation in Equation ( 10) to the formulation in (Shi et al., 2019, p.5) . L MoE (θ, φ; X) = E q φ (z|X) [log(p θ (X|z)] -D KL   1 M M j=1 q φj (z|x j ) ||p θ (z)   (28) = E q φ (z|X) [log(p θ (X|z)] -E 1 M M j=1 q φ j (z|xj ) log 1 M M j=1 q φj (z|x j ) p θ (z) (29) = E q φ (z|X) [log(p θ (X|z)] -E q φ (z|X) log q φ (z|X) p θ (z) (30) = E q φ (z|X) [log p θ (X|z)] + E q φ (z|X) log p θ (z) q φ (z|X) (31) = E q φ (z|X) log p θ (X|z)p θ (z) q φ (z|X) (32) = 1 M M j=1 E q φ j (z|xj ) log p θ (X|z)p θ (z) q φ (z|X) where Equation (32) and Equation ( 33) are equivalent to the first equation on page 5 in Shi et al. The different formulation on the second line of the first equation on page 5 is coming from their use of importance samples. et al. (2019) in their bimodal experiment, we create 20 triples per set resulting in a many-to-many mapping.

C.2 EXPERIMENTAL SETUP

The latent space dimension is set to 20 for all modalities, models and runs. The results in tables 2 to 4 are generated with β = 5.0. We train all models for 150 epochs. We use the same architectures as in Sutter et al. (2020) . For MNIST encoder and decoder, we use fully-connected layers, for SVHN and text encoders and decoder feed-forward convolutional layers. For all layers, we use ReLU-activation functions (Nair & Hinton, 2010) . The detailed architectures can also be looked up in the released code. We use an Adam optimizer (Kingma & Ba, 2014) with an initial learning rate 0.001.

C.3 ADDITIONAL RESULTS

In table 6, we show the coherence results including the standard deviation of the 5 runs which were removed from the main part due to space restrictions. Additionally, we perform the analysis of coherence in relation to log-likelihood for conditional generation as well, similar to the example using random generation in section 4.1. The combination of coherence and log-likelihoods shows the ability of a model to learn the data distribution as well as the generation of coherent samples. Every point refers to a different β value. We evaluated the models for β = [0.5, 1.0, 2.5, 5.0, 10.0, 20.0]. The points in the figures are the mean values of 5 different runs with the lines being the standard deviations in both directions, coherence and log-likelihoods. Figure 7 displays a qualitative comparison between the three models using 100 randomly generated samples. The generated samples correspond to the numbers in section 4.1. MVAE is able to best approximate the joint distribution in terms of sample quality for the price of a limited coherence, while MMVAE shows higher coherence but limited sample quality. MoPoE approximate MVAE's sample quality with a start-of-the-art coherence.

D POLYMNIST D.1 DATASET

For the creation of the PolyMNIST dataset, we fuse each MNIST image with a random crop of size 28x28 from the background image of the respective modality. In particular, we binarize the MNIST image and invert the colors of the random crop at those locations where the binarized MNIST digit is visible. We use the following background images:

D.2 EXPERIMENTAL SETUP

The latent space dimension is set to 512 for all modalities, models and runs. All results in are based on β = 2.5, which was found to be a reasonable setting for all models. We use the same architectures for all methods and train all models for 300 epochs. We use an Adam optimizer (Kingma & Ba, 2014) with an initial learning rate 0.001. The architecture is based on straightforward convolutional neural networks (without bells and whistles); for details, we refer to the released code.

D.3 QUALITATIVE RESULTS

In Figures 8 to 11 



We omit the subscript k for the parameterization of the posterior approximations when it is clear from context that only X k is available, and write φ instead. The β-hyperparameter controls the weight of the KL-divergence in Equation (6). We evaluate the models using β ∈ {0.5, 1.0, 2.5, 5.0, 10.0, 20.0}. Details on how the dataset was generated are included in Appendix D. For more details, see Appendix E.



Figure 1: Joint Coherence vs. Log-Likelihoods for MNIST-SVHN-Text.

Figure 2: Ten samples from the PolyMNIST dataset. Each column depicts one tuple that consists of five different "modalities". The PolyMNIST dataset consists of sets of MNIST digits where each set {x j } M j=1 consists of 5 images with the same digit label but different backgrounds and different styles of hand writing. An example of one such tuple is shown in Figure 2. Thus, each "modality" represents a shuffled set of MNIST digits overlayed on top of (random crops from) 5 different background images, which are modality-specific. In total there are 60, 000 tuples of training examples and 10, 000 tuples of test examples and we make sure that no two MNIST digits were used in both the training and test set.3

Figure 3: Performance on PolyMNIST as a function of the number of input modalities, averaged over all subsets of the respective size. Performance is measured in terms of three different metrics (larger is better) and markers denote the means (error bands denote standard deviations) over five runs. Left: Linear classification accuracy of digits given the latent representation computed from the respective subset. Center: Coherence of conditionally generated samples (excluding the input modality). Right: Log-likelihood of all generated modalities. Not shown: The joint coherence is 3.6 (±1.5), 20.0 (±1.9), and 12.1 (±1.6) percent for MVAE, MMVAE, and MoPoE respectively.

Figure 4: Qualitative results for bimodal CelebA. The images are conditionally generated by MoPoE-VAE using the text on top of each column.

Figure 5: Coherence and Log-Likelihoods for MNIST-SVHN-Text. The three figures show the evaluation for the conditional generation of a single modality given the other two in relation to the joint log-likelihood given these two modalities, e.g. in the first row we generate SVHN samples conditioned on MNIST and Text. The points in the figures are the mean values of 5 different runs with the lines being the standard deviations in bopth directions, coherence and log-likelihoods.

, we show qualitative results comparing the different methods.

Figure8: Reconstructions across all modalities for all models. In every pair of rows, we show one row of test images followed by one row of respective reconstructions.

Figure 9: Ten unconditionally generated images from the respective five modalities for each model.Column-wise, we use the same latent codes, sampled from the prior. Note that, row-wise, the digits should not be ordered.

Figure 10: Conditionally generated images of the first modality given the respective test example from the second modality shown in the first row. Column-wise, we take different samples from the approximate posterior, which should result in stylistic variations for generated outputs, but which should ideally not change the digit labels.

Figure 11: Conditionally generated images of the first modality given the four test examples from the remaining modalities shown in the first four rows. Column-wise, we take different samples from the approximate posterior, which should result in stylistic variations for generated outputs, but which should ideally not change the digit labels. Compared to the results from Figure 10, the MoPoE-VAE generates more coherent samples when conditioned on four instead of one input modality.

Figure 12: Coherence of generated bimodal CelebA samples. For every subplot, image and text are generated conditionally by the the modality or subset of modalities in the caption. We see that different attributes are not learned equally well.

Figure 13: Learned Latent Representations for the bimodal CelebA dataset.

Figure 14: Qualitative Results of randomly generated CelebA samples.

Properties of previous scalable multimodal VAEs and our proposed model. Note that to deal with missing modalities, the MVAE requires sub-sampling of unimodal ELBOs, which yields an invalid bound on the joint log-likelihood(Wu & Goodman, 2019).

Linear classification accuracy of latent representations for MNIST-SVHN-Text. We evaluate all subsets of modalities X k where the abbreviations of subsets are as follows: M: MNIST; S: SVHN; T: Text; M,S: MNIST and SVHN; M,T: MNIST and Text; S,T: SVHN and Text; M,S,T: all. We report the means and standard deviations over 5 runs.

Generation coherence for MNIST-SVHN-Text. For conditional generation, the letter above the horizontal line indicates the modality which is generated based on the subsets X k below. We report the mean values over 5 runs. Standard deviations are included in Appendix C.3.

Classification and coherence results on the bimodal CelebA experiment. For latent representations and conditionally generated samples, we report the mean average precision over all attributes (I: Image; T: Text; Joint: I and T).

Generation Coherence for MNIST-SVHN-Text. For every subtable, the modality above the wide horizontal line is generated based on the subsets below the same line-except for joint coherence. The abbreviations of the different modalities are as follows: M:MNIST; S: SVHN; T: Text. Combinations thereof separated by commas result in the subsets consisting of the modalities. We report the mean value and standard deviation of 5 runs.

ACKNOWLEDGMENTS

We would like to thank Ričards Marcinkevičs for helpful discussions and proposing the name "PolyMNIST". ID is supported by the SNSF grant #200021 188466.

annex

In addition to the theoretical proof of Lemma 3 that Definition 1 minimizes the convex combination of ELBOs, we compare the performance of a model trained using the objective in Equation ( 5) to the proposed method MoPoE-VAE. It can be seen that MoPoE-VAE achieves competitive results to the model which is optimizing Equation ( 5). This shows empirically that the proposed method is indeed minimizing the convex combination of ELBOs in Equation (5). Equation ( 5) is extensively minimizing the ELBO of every possible subset. Hence, Equation ( 5) is computationally much more expensive to optimize.Table 7 : Comparison of objectives: Equation (5) and Definition 1. We report the test set loglikelihoods of the joint generative model conditioned on the variational posterior of subsets of modalities qφ (z|X k ). (x M : MNIST; x S : SVHN; x T : Text; X = (x M , x S , x T )). For both objectives we use β = 2.5The bimodal version of CelebA was introduced by Sutter et al. (2020) . The text modality consists of strings which concatenate the attributes which are present in a face. If an attribute is not present, it is not present in the string which makes it a more difficult modality. Example strings can be seen in the top of Figure 4 .

E.1.1 MODALITY-SPECIFIC LATENT SPACES

Modality-specific spaces empirically have empirically shown to be useful (Bouchacourt et al., 2018; Hsu & Glass, 2018; Daunhawer et al., 2020; Sutter et al., 2020) -especially for the generative quality of samples. As CelebA is a visually challenging dataset, we adopt this idea and the ELBO formulation changes accordingly. For details, we refer the reader to the beforehand mentioned papers. The latent space is divided into a shared space q φc (c|X) and modality-specific spaces q φs j (s j |x j ) for every modality x j . This allows every modality to encode information-which is specific to this modality-in a separate latent space.where q φc (c|X) = 1 2 M X k ∈X qφc (c|X k ) models the shared information and q φs j (s j |x j ) the modality-specific information for every modality.All posterior approximations, shared and modality-specific, are again assumed to be Gaussian distributed, see Appendix B.

E.2 EXPERIMENTAL SETUP

The latent spaces are set to 32 dimensions for the shared space as well as the modality-specific spaces, resulting in 64 dimensions per modality in total. We set β = 2.5 for all runs and models. All models are trained for 200 epochs. Again we use the same architectures as in Sutter et al. ( 2020): the encoders and decoders of both image and text use residual blocks (He et al., 2016) . We use an Adam optimizer (Kingma & Ba, 2014) with an initial learning rate 0.0005. The architectures can also be looked up in the released code. The classification of samples and representations are evaluated using average precision due to the imbalanced nature of the distribution of labels.

E.3 ADDITIONAL RESULTS

We show the attribute-specific evaluations in figs. 12 and 13 where the representations and generated samples are evaluated specific to individual attributes. The evaluations are performed for all subsets of modalities. We see the differences in averagea precision between attributes in the coherence of samples as well as the latent representations. The correlation between learned representation and coherence of samples gives further evidence on the importance of a good representation-also for the multimodal setting and its task of conditional generation.Figure 14 displays qualitative results of randomly generated samples. We can see the high quality samples the proposed model is able to generate which cover a wide variety of attributes. In the images, minor artefacts can be seen. This suggests that there is still room for improvement doing a more rigorous hyper-parameter search.

