GENERALIZED MULTIMODAL ELBO

Abstract

Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in selfsupervised, generative learning tasks.

1. INTRODUCTION

The availability of multiple data types provides a rich source of information and holds promise for learning representations that generalize well across multiple modalities (Baltrušaitis et al., 2018) . Multimodal data naturally grants additional self-supervision in the form of shared information connecting the different data types. Further, the understanding of different modalities and the interplay between data types are non-trivial research questions and long-standing goals in machine learning research. While fully-supervised approaches have been applied successfully (Karpathy & Fei-Fei, 2015; Tsai et al., 2019; Pham et al., 2019; Schoenauer-Sebag et al., 2019) , the labeling of multiple data types remains time consuming and expensive. Therefore, it requires models that efficiently learn from multiple data types in a self-supervised fashion. Self-supervised, generative models are suitable for learning the joint distribution of multiple data types without supervision. We focus on VAEs (Kingma & Welling, 2014; Rezende et al., 2014) which are able to jointly infer representations and generate new observations. Despite their success on unimodal datasets, there are additional challenges associated with multimodal data (Suzuki et al., 2016; Vedantam et al., 2018) . In particular, multimodal generative models need to represent both modality-specific and shared factors and generate semantically coherent samples across modalities. Semantically coherent samples are connected by the information which is shared between data types (Shi et al., 2019) . These requirements are not inherent to the objective-the evidence lower bound (ELBO)-of unimodal VAEs. Hence, adaptions to the original formulation are required to cater to and benefit from multiple data types. Furthermore, to handle missing modalities, there is a scalability issue in terms of the number of modalities: naively, it requires 2 M different encoders to handle all combinations for M data types. Thus, we restrict our search for an improved multimodal ELBO to the class of scalable multimodal VAEs. Among the class of scalable multimodal VAEs, there are two dominant strains of models, based on either the multimodal variational autoencoder (MVAE, Wu & Goodman, 2018) or the Mixture-of-Experts multimodal variational autoencoder (MMVAE, Shi et al., 2019) . However, we show that these approaches differ merely in their choice of joint posterior approximation functions. We draw a theoretical connection between these models, showing that they can be subsumed under the class of abstract mean functions for modeling the joint posterior. This insight has practical implications, because the choice of mean function directly influences the properties of a model (Nielsen, 2019) . The MVAE uses a geometric mean, which enables learning a sharp posterior, resulting in a good approximation of the joint distribution. On the other hand, the MMVAE applies an arithmetic mean which allows better learning of the unimodal and pairwise conditional distributions. We generalize these approaches and introduce the Mixture-of-Products-of-Experts-VAE that combines the benefits of both methods without considerable trade-offs. In summary, we derive a generalized multimodal ELBO formulation that connects and generalizes two previous approaches. The proposed method, termed MoPoE-VAE, models the joint posterior approximation as a Mixture-of-Products-of-Experts, which encompasses the MVAE (Product-of-Experts) and MMVAE (Mixture-of-Experts) as special cases (Section 3). In contrast to previous models, the proposed model approximates the joint posterior for all subsets of modalities, an advantage that we validate empirically in Section 4, where our model achieves state-of-the-art results.

2. RELATED WORK

This work extends and generalizes existing work in self-supervised multimodal generative models that are scalable in the number of modalities. Scalable in the sense that a single model approximates the joint distribution over all modalities (including all marginal and conditional distributions) instead of requiring individual models for every subset of modalities (e.g., Huang et al., 2018; Tian & Engel, 2019; Hsu & Glass, 2018) . The latter approach requires a prohibitive number of models, exponential in number of modalities. Multimodal VAEs Among multimodal generative models, multimodal VAEs (Suzuki et al., 2016; Vedantam et al., 2018; Kurle et al., 2019; Tsai et al., 2019; Wu & Goodman, 2018; Shi et al., 2019; 2020; Sutter et al., 2020) have recently been the dominant approach. Multimodal VAEs are not only suitable to learn a joint distribution over multiple modalities, but also enable joint inference given a subset of modalities. However, to approximate the joint posterior for all subsets of modalities efficiently, it is required to introduce additional assumptions on the form of the joint posterior. To overcome the issue of scalability, previous work relies on either the product (Kurle et al., 2019; Wu & Goodman, 2018) or the mixture (Shi et al., 2019; 2020) of unimodal posteriors. While both approaches have their merits, there are also disadvantages associated with them. We unite these approaches in a generalized formulation-a mixture of products joint posterior-that encapsulates both approaches and combines their benefits without significant trade-offs.

Multimodal posteriors

The MVAE (Wu & Goodman, 2018) assumes that the joint posterior is a product of unimodal posteriors-a Product-of-Experts (PoE, Hinton, 2002) . The PoE has the benefit of aggregating information across any subset of unimodal posteriors and therefore provides an efficient way of dealing with missing modalities for specific types of unimodal posteriors (e.g., Gaussians). However, to handle missing modalities the MVAE relies on an additional sub-sampling of unimodal log-likelihoods, which no longer guarantees a valid lower bound on the joint log-likelihood (Wu & Goodman, 2019) . Previous work provides empirical results that exhibit the shortcomings of the MVAE, attributing them to a precision miscalibration of experts (Shi et al., 2019) or to the averaging over inseparable individual beliefs (Kurle et al., 2019) . Our results suggest that the PoE works well in practice, if it is also applied on all subsets of modalities, which naturally leads to the proposed Mixture-of-Products-of-Experts (MoPoE) generalization, which yields a valid lower bound on the joint log-likelihood. On the other hand, the MMVAE (Shi et al., 2019) assumes that the joint posterior is a mixture of unimodal posteriors-a Mixture-of-Experts (MoE). The MMVAE is suitable for the approximation of unimodal posteriors and for translation between pairs of modalities, however, it cannot take advantage of multiple modalities being present, because it only takes the unimodal posteriors into account during training. In contrast, the proposed MoPoE-VAE computes the joint posterior for all subsets of modalities and therefore enables efficient many-to-many translations. Extensions of the MVAE and MMVAE (Kurle et al., 2019; Daunhawer et al., 2020; Shi et al., 2020; Sutter et al., 2020) have introduced additional loss terms, however, these are also applicable to and can be added on top of the proposed model.

