GENERALIZED MULTIMODAL ELBO

Abstract

Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in selfsupervised, generative learning tasks.

1. INTRODUCTION

The availability of multiple data types provides a rich source of information and holds promise for learning representations that generalize well across multiple modalities (Baltrušaitis et al., 2018) . Multimodal data naturally grants additional self-supervision in the form of shared information connecting the different data types. Further, the understanding of different modalities and the interplay between data types are non-trivial research questions and long-standing goals in machine learning research. While fully-supervised approaches have been applied successfully (Karpathy & Fei-Fei, 2015; Tsai et al., 2019; Pham et al., 2019; Schoenauer-Sebag et al., 2019) , the labeling of multiple data types remains time consuming and expensive. Therefore, it requires models that efficiently learn from multiple data types in a self-supervised fashion. Self-supervised, generative models are suitable for learning the joint distribution of multiple data types without supervision. We focus on VAEs (Kingma & Welling, 2014; Rezende et al., 2014) which are able to jointly infer representations and generate new observations. Despite their success on unimodal datasets, there are additional challenges associated with multimodal data (Suzuki et al., 2016; Vedantam et al., 2018) . In particular, multimodal generative models need to represent both modality-specific and shared factors and generate semantically coherent samples across modalities. Semantically coherent samples are connected by the information which is shared between data types (Shi et al., 2019) . These requirements are not inherent to the objective-the evidence lower bound (ELBO)-of unimodal VAEs. Hence, adaptions to the original formulation are required to cater to and benefit from multiple data types. Furthermore, to handle missing modalities, there is a scalability issue in terms of the number of modalities: naively, it requires 2 M different encoders to handle all combinations for M data types. Thus, we restrict our search for an improved multimodal ELBO to the class of scalable multimodal VAEs. Among the class of scalable multimodal VAEs, there are two dominant strains of models, based on either the multimodal variational autoencoder (MVAE, Wu & Goodman, 2018) or the Mixture-of-Experts multimodal variational autoencoder (MMVAE, Shi et al., 2019) . However, we show that these approaches differ merely in their choice of joint posterior approximation functions. We draw a theoretical connection between these models, showing that they can be subsumed under the class

