MMVAE+: ENHANCING THE GENERATIVE QUALITY OF MULTIMODAL VAES WITHOUT COMPROMISES

Abstract

Multimodal VAEs have recently gained attention as efficient models for weaklysupervised generative learning with multiple modalities. However, all existing variants of multimodal VAEs are affected by a non-trivial trade-off between generative quality and generative coherence. In particular mixture-based models achieve good coherence only at the expense of sample diversity and a resulting lack of generative quality. We present a novel variant of the mixture-of-experts multimodal variational autoencoder that improves its generative quality, while maintaining high semantic coherence. We model shared and modality-specific information in separate latent subspaces, proposing an objective that overcomes certain dependencies on hyperparameters that arise for existing approaches with the same latent space structure. Compared to these existing approaches, we show increased robustness with respect to changes in the design of the latent space, in terms of the capacity allocated to modality-specific subspaces. We show that our model achieves both good generative coherence and high generative quality in challenging experiments, including more complex multimodal datasets than those used in previous works.

1. INTRODUCTION

Multimodal VAEs are a promising class of models for weakly-supervised generative learning. Different from initially proposed models in this class (Suzuki et al., 2016; Vedantam et al., 2018) , more recent approaches (Wu & Goodman, 2018; Shi et al., 2019; 2021; Sutter et al., 2020; 2021) can efficiently scale to a large number of modalities. These methodological advances enabled applications in multi-omics data integration (Lee & van der Schaar, 2021; Minoura et al., 2021) and tumor segmentation from multiple image modalities (Dorent et al., 2019) . Several variants of scalable multimodal VAEs have been proposed (Wu & Goodman, 2018; Shi et al., 2019; 2021; Sutter et al., 2021) and their performance is measured in terms of generative quality and generative coherence. While generative quality measures how well a model approximates the data distribution, generative coherence measures the semantic coherence of generated samples across modalities (e.g., see Shi et al., 2019) . High generative quality requires generated samples being similar to the test data, while high generative coherence requires generated samples to agree in their semantic content across modalities. For instance, in a dataset of image/caption pairs, conditional generation from the text modality should produce images where the depicted object matches the description in the given caption (e.g. matching color). Ideally, an effective multimodal generative model should fulfill both of these performance aspects. Still, recent work (Daunhawer et al., 2022) shows that the predominant approaches exhibit a non-trivial trade-off between the two criteria, which limits their utility for complex real-world applications. In this work we focus on mixture-based multimodal VAEs, which show high generative coherence only at the expense of a lack of generative quality, a fact that undermines their performance in realistic settings. Results for models in this class are promising for capturing shared information, i.e. information that is communal across modalities on the underlying concept being described, while exhibit a lack of modelling of private variation, i.e. modality-specific information for single modalities (see Shi et al., 2019; Sutter et al., 2021; Daunhawer et al., 2022) . In an attempt to enhance modelling of private information, recent work (Sutter et al., 2020) has suggested introducing modality-specific latent spaces in addition to a shared subspace for mixture-based multimodal VAEs. However, we revise such proposed extension and find that an improvement in terms of generative quality comes again at the expense of reduced generative coherence. Most importantly we uncover a relevant shortcoming of existing approaches with separate subspaces, in that we find generative coherence to be overly sensitive to hyperparameters controlling the capacity of private latent subspaces, which in practice calls for expensive model selection procedures to achieve adequate performance. Incorporating the idea of modelling the latent space as a combination of shared and modality-specific encodings, we propose the MMVAE+, a variant of the mixture-of-experts multimodal VAE (MM-VAE,Shi et al. ( 2019)) with a novel ELBO, that significantly improves the diversity of the generated samples, without sacrificing semantic coherence. Compared to previously proposed models, our method achieves both convincing generative quality and generative coherence (Section 4.1). Notably, its performance in terms of both criteria is robust with respect to hyperparameters controlling latent dimensionality, compared to previous methods with separate shared and private latent subspaces. (Section 4.2) Finally, we show our proposed model can successfully tackle a challenging multimodal dataset of image and text pairs (Section 4.3), that was shown to be too complex for existing multimodal VAEs (Daunhawer et al., 2022) .

2. RELATED WORK

Multimodal generative models are promising approaches for learning from co-occurring data sources without explicit supervision, by exploiting the pairing between modalities as a form of weak-supervision. Previous work has achieved outstanding results for multimodal generative tasks such as image-to-image translation (Zhu et al., 2017; Choi et al., 2018) or text-to-image synthesis (Reed et al., 2016) . While these models are designed for specialized tasks limited to a fixed number of modalities, in a lot of real-world settings there is a need for general methods that can leverage large datasets of many heterogeneous modalities. A prominent example is the field of healthcare, where personalised medicine requires learning from large-scale multimodal datasets comprised of medical images, genomics tests, and clinical measurements. Multimodal VAEs (Suzuki et al., 2016; Wu & Goodman, 2018; Shi et al., 2019; 2021; Sutter et al., 2020; 2021 ) are a promising model class for such applications, showing encouraging results towards efficient learning from multimodal datasets with multiple heterogeneous modalities. Multimodal VAEs extend the popular variational autoencoder (VAE, Kingma & Welling, 2014) to multiple data modalities. Initially proposed approaches (Suzuki et al., 2016; Vedantam et al., 2018) lack scalability in the number of modalities, requiring an additional encoder network per possible subset of modalities, to enable inference from the given subset. Other proposed methods require explicit supervision (Tsai et al., 2019) , which demands prior expensive data labelling processes. In contrast, recent methodological advances enable learning from a large number of modalities efficiently and without explicit supervision, by using a joint encoder that decomposes in terms of unimodal encoders. Previous work proposed three different formulations for the joint encoder: the product-of-experts (MVAE, Wu & Goodman 2018), mixture-of-experts (MMVAE, Shi et al. 2019), and mixture-of-product-of-experts (MoPoE-VAE, Sutter et al. 2021) . Recent work (Daunhawer et al., 2022) shows that existing multimodal VAEs exhibit a tradeoff between two desired performance criteria for multimodal generation, namely generative quality and generative coherence. While generative quality assesses the generative performance of the model for each modality, generative coherence (Shi et al., 2019) examines the learning of shared information by estimating the consistency in the semantic content between modalities in both conditional and unconditional generation. Daunhawer et al. (2022) show that existing product-based models exhibit low generative coherence, while mixture-based models exhibit a lack of sample diversity, which negatively affects generative quality (cp. Wolff et al., 2021) . Based on the three mentioned formulations of multimodal VAEs, subsequent work introduced additional regularization terms (Sutter et al., 2020; Hwang et al., 2021) and hierarchical latent spaces (Sutter & Vogt, 2021; Vasco et al., 2022; Wolff et al., 2022) . Previous work has also explored the possibility of assuming separate modality-specific latent subspaces in addition to a shared subspace (Sutter et al., 2020; Lee & Pavlovic, 2021; Wang et al., 2016) , or leveraging mutual supervision (Joy et al., 2022 ). Yet, it is not clear whether these extensions overcome the fundamental tradeoff be-

