MMVAE+: ENHANCING THE GENERATIVE QUALITY OF MULTIMODAL VAES WITHOUT COMPROMISES

Abstract

Multimodal VAEs have recently gained attention as efficient models for weaklysupervised generative learning with multiple modalities. However, all existing variants of multimodal VAEs are affected by a non-trivial trade-off between generative quality and generative coherence. In particular mixture-based models achieve good coherence only at the expense of sample diversity and a resulting lack of generative quality. We present a novel variant of the mixture-of-experts multimodal variational autoencoder that improves its generative quality, while maintaining high semantic coherence. We model shared and modality-specific information in separate latent subspaces, proposing an objective that overcomes certain dependencies on hyperparameters that arise for existing approaches with the same latent space structure. Compared to these existing approaches, we show increased robustness with respect to changes in the design of the latent space, in terms of the capacity allocated to modality-specific subspaces. We show that our model achieves both good generative coherence and high generative quality in challenging experiments, including more complex multimodal datasets than those used in previous works.

1. INTRODUCTION

Multimodal VAEs are a promising class of models for weakly-supervised generative learning. Different from initially proposed models in this class (Suzuki et al., 2016; Vedantam et al., 2018) , more recent approaches (Wu & Goodman, 2018; Shi et al., 2019; 2021; Sutter et al., 2020; 2021) can efficiently scale to a large number of modalities. These methodological advances enabled applications in multi-omics data integration (Lee & van der Schaar, 2021; Minoura et al., 2021) and tumor segmentation from multiple image modalities (Dorent et al., 2019) . Several variants of scalable multimodal VAEs have been proposed (Wu & Goodman, 2018; Shi et al., 2019; 2021; Sutter et al., 2021) and their performance is measured in terms of generative quality and generative coherence. While generative quality measures how well a model approximates the data distribution, generative coherence measures the semantic coherence of generated samples across modalities (e.g., see Shi et al., 2019) . High generative quality requires generated samples being similar to the test data, while high generative coherence requires generated samples to agree in their semantic content across modalities. For instance, in a dataset of image/caption pairs, conditional generation from the text modality should produce images where the depicted object matches the description in the given caption (e.g. matching color). Ideally, an effective multimodal generative model should fulfill both of these performance aspects. Still, recent work (Daunhawer et al., 2022) shows that the predominant approaches exhibit a non-trivial trade-off between the two criteria, which limits their utility for complex real-world applications. In this work we focus on mixture-based multimodal VAEs, which show high generative coherence only at the expense of a lack of generative quality, a fact that undermines their performance in realistic settings. Results for models in this class are promising for capturing shared information, i.e. information that is communal across modalities on the underlying concept being described, while exhibit a lack of modelling of private variation, i.e. modality-specific information for single modalities (see Shi et al., 2019; Sutter et al., 2021; Daunhawer et al., 2022) . In an attempt to enhance modelling

