MITIGATING THE LIMITATIONS OF MULTIMODAL VAES WITH COORDINATION-BASED APPROACH Anonymous authors Paper under double-blind review

Abstract

One of the key challenges in multimodal variational autoencoders (VAEs) is inferring a joint representation from arbitrary subsets of modalities. The state-of-the-art approach to achieving this is to sub-sample the modality subsets and learn to generate all modalities from them. However, this sub-sampling in the mixture-based approach has been shown to degrade other important features of multimodal VAEs, such as quality of generation, and furthermore, this degradation is theoretically unavoidable. In this study, we focus on another approach to learning the joint representation by bringing unimodal inferences closer to joint inference from all modalities, which does not have the above limitation. Although there have been models that can be categorized under this approach, they were derived from different backgrounds; therefore, the relation and superiority between them were not clear. To take a unified view, we first categorize them as coordination-based multimodal VAEs and show that these can be derived from the same multimodal evidence lower bound (ELBO) and that the difference in their performance is related to whether they are more tightly lower bounded. Next, we point out that these existing coordination-based models perform poorly on cross-modal generation (or cross-coherence) because they do not learn to reconstruct modalities from unimodal inferences. Therefore, we propose a novel coordination-based model that incorporates these unimodal reconstructions, which avoids the limitations of both mixture and coordination-based models. Experiments with diverse and challenging datasets show that the proposed model mitigates the limitations in multimodal VAEs and performs well in both cross-coherence and generation quality.

1. INTRODUCTION

Deep generative models have recently shown high performance on multimodal data, including images and captions. In particular, multimodal learning with variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) has attracted much attention to obtain a joint representation from modalities (Wu & Goodman, 2018; Shi et al., 2019; Sutter et al., 2021) . Such representations can be used for predicting common concepts from these modalities or for generating other modalities. An essential challenge of multimodal VAEs is inferring a joint representation from subsets of modalities. State-of-the-art models attempt to accomplish this by training to generate all modalities from a joint representation inferred from missing (or sub-sampled) modalities. These are called mixture-based multimodal VAEs (Daunhawer et al., 2021a) , and examples include MMVAE (Shi et al., 2019) and MoPoE-VAE (Sutter et al., 2021) foot_0 . However, it has been pointed out that the quality of modality generation is lower than that of unimodal VAEs, and furthermore, cross-generation between modalities, or cross-coherence, can be degraded. This is an inherent limitation on mixturebased VAEs due to modality sub-sampling during training, and theoretical and empirical evidence shows that this limitation cannot be avoided (Daunhawer et al., 2021a) . To alleviate this issue, we focus on another approach in multimodal VAEs that brings the representation inferred from each modality closer to that inferred from all modalities. This approach avoids the limitation of degraded generation quality because its objective does not include generation from sub-sampled modalities. Models that fit this approach have been proposed, but their objectives are derived from different backgrounds. For example, MVTCAE (Hwang et al., 2021) , which can be regarded as one of them, is derived its objective from the total correlation; therefore, its relationship to the multimodal VAEs' objective, multimodal evidence lower bound (ELBO), is not clear. Moreover, another model of this approach, MMJSD (Sutter et al., 2020) , has been shown experimentally to perform worse than MVTCAE (Hwang et al., 2021) , but the theoretical reason for this is unclear. We first categorize them as coordination-based multimodal VAEsfoot_2 and show that these models can be viewed as unified, i.e., all are derived from multimodal ELBO, and that the differences in models correspond to whether they are more tightly lower bounded. We also prove that MVTCAE encourages unimodal posterior to approach the average of joint posterior, which might be one of the reasons for learning good inference. Next, we point out that the existing coordination-based VAEs still have a shortcoming in cross-coherence since no generation from unimodal inferences is included in the training. We then propose a novel objective that introduces unimodal reconstruction terms. Since this objective does not include cross-generation, it also avoids the generation degradation issue of the mixture-based models while mitigating the issue of coordination-based models. Note that our model alleviates the issues by only changing the objective, unlike other methods for improving multimodal VAEs by changing the architecture, such as introducing additional latent variables (Tsai et al., 2018; Hsu & Glass, 2018; Sutter et al., 2020; Palumbo et al., 2022) . We conducted experiments on five diverse multimodal datasets, including challenging ones, e.g., those that cannot be adequately trained with existing models or require additional architecture for adequate training. We confirmed that the proposed method outperforms existing mixture-and coordination-based models in terms of cross-coherence and generation quality.

2. MULTIMODAL VAES

Suppose that we have a set of multimodal examples, where each example is a set of M modalities X = {x m } M m=1 . We assume that these examples are derived from data distribution p d (X) and that each example X has a corresponding common latent concept z, i.e., a joint representation. Given a training set {X (i) } N i=1 , our goals are to infer a joint representation from the subset of modalities X S ⊆ Xfoot_3 and to generate another subset X S ′ ⊆ X via that representation. In other words, we aim to obtain an inference p(z|X) and a joint distribution p(X, z) = xm∈X p(x m |z)p(z). To achieve this, it is a natural choice to use VAEs (Kingma & Welling, 2013) , deep generative models that can learn inference to acquire representations from data in addition to the generation. The generative model p θ (X|z) = m:xm∈X p θ (x m |z) is parameterized by deep neural networks, and the prior of the latent variable is set as a standard Gaussian p(z) = N (0, I). The objective of VAEs is to maximize the expected log-likelihood over data distribution Since it is tractable to optimize this likelihood directly, we introduce an approximated posterior q ϕ (z|X) and instead optimize the following evidence lower bound (ELBO) on the expected log-likelihood: L(θ, ϕ) ≡ E p d (X) [E q ϕ (z|X) [log p θ (X|z)] -D KL (q ϕ (z|X)||p(z))] ≤ E p d (X) [p θ (X)]. In this paper, the model group that maximizes this multimodal ELBO is collectively called multimodal VAEs. We aim to optimize Eq. 1 to obtain a joint representation and generate modalities from it given a modality subset X S . However, Eq. 1 only includes the inference given all modalities; therefore, it does not learn to infer representations from subsets.

2.1. MIXTURE-BASED MULTIMODAL VAES

An inference from subsets is commonly expressed as a combination of unimodal posteriors from each modality. Product of experts (PoE) (Hinton, 2002; Wu & Goodman, 2018 ) q P oE ϕ (z|X S ) ≡ p(z) m:xm∈X S q ϕm (z|x m )foot_4 and mixture of experts (MoE) q M oE ϕ (z|X S ) ≡



Daunhawer et al. (2021a) also include a special case of MVAE (Wu & Goodman, 2018) in mixture-based models. However, we focus on the mixture-based nature of cross-generation of modalities from subsets as their shortcomings, so we exclude MVAE, which does not perform cross-generation. The term coordination is taken from coordination representations, a category of multimodal learning(Baltrušaitis et al., 2018), in which representations inferred from different modalities are learned close together. We denote a subset of the multimodal set X by XS respectively, where S represents a subset of {1, .., M }. q ϕm is the unimodal inference of modality xm.

