MULTIMODAL VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED LEARNING: IN DEFENSE OF PRODUCT-OF-EXPERTS Anonymous

Abstract

Multimodal generative models should be able to learn a meaningful latent representation that enables a coherent joint generation of all modalities (e.g., images and text). Many applications also require the ability to accurately sample modalities conditioned on observations of a subset of the modalities. Often not all modalities may be observed for all training data points, so semi-supervised learning should be possible. In this study, we evaluate a family of product-ofexperts (PoE) based variational autoencoders that have these desired properties. We include a novel PoE based architecture and training procedure. An empirical evaluation shows that the PoE based models can outperform an additive mixtureof-experts (MoE) approach. Our experiments support the intuition that PoE models are more suited for a conjunctive combination of modalities while MoEs are more suited for a disjunctive fusion.

1. INTRODUCTION

Multimodal generative modelling is important because information about real-world objects typically comes in different representations, or modalities. The information provided by each modality may be erroneous and/or incomplete, and a complete reconstruction of the full information can often only be achieved by combining several modalities. For example, in image-and video-guided translation (Caglayan et al., 2019) , additional visual context can potentially resolve ambiguities (e.g., noun genders) when translating written text. In many applications, modalities may be missing for a subset of the observed samples during training and deployment. Often the description of an object in one modality is easy to obtain, while annotating it with another modality is slow and expensive. Given two modalities, we call samples paired when both modalities are present, and unpaired if one is missing. The simplest way to deal with paired and unpaired training examples is to discard the unpaired observations for learning. The smaller the share of paired samples, the more important becomes the ability to additionally learn from the unpaired data, referred to as semi-supervised learning in this context (following the terminology from Wu & Goodman, 2019. Typically one would associate semi-supervised learning with learning form labelled and unlabelled data to solve a classification or regression tasks). Our goal is to provide a model that can leverage the information contained in unpaired samples and to investigate the capabilities of the model in situations of low levels of supervision, that is, when only a few paired samples are available. While a modality can be as low dimensional as a label, which can be handled by a variety of discriminative models (van Engelen & Hoos, 2020), we are interested in high dimensional modalities, for example an image and a text caption. Learning a representation of multimodal data that allows to generate high-quality samples requires the following: 1) deriving meaningful representation in a joint latent space for each high dimensional modality and 2) bridging the representations of different modalities in a way that the relations between them are preserved. The latter means that we do not want the modalities to be represented orthogonally in the latent space -ideally the latent space should encode the object's properties independent of the input modality. Variational autoencoders (Kingma & Welling, 2014) using a product-of-experts (PoE, Hinton, 2002; Welling, 2007) approach for combining input modalities are a promising approach for multimodal generative modelling having the desired properties, in partic-ular the VAEVAE model developed by Wu & Goodman ( 2018) and a novel model termed SVAE, which we present in this study. Both models can handle multiple high dimensional modalities, which may not all be observed at training time. It has been argued that a product-of-experts (PoE) approach is not well suited for multimodal generative modelling using variational autoencoders (VAEs) in comparison to additive mixture-of-experts (MoE). It has empirically been shown that the PoE-based MVAE (Wu & Goodman, 2018) fails to properly model two high-dimensional modalities in contrast to an (additive) MoE approach referred to as MMVAE, leading to the conclusion that "PoE factorisation does not appear to be practically suited for multi-modal learning" (Shi et al., 2019) . This study sets out to test this conjecture for state-of-the-art multimodal VAEs. The next section summarizes related work. Section 3 introduces SVAE as an alternative PoE based VAE approach derived from axiomatic principles. Then we present our experimental evaluation of multimodal VAEs before we conclude.

2. BACKGROUND AND RELATED WORK

We consider multimodal generative modelling. We mainly restrict our considerations to two modalities x 1 ∈ X 1 , x 2 ∈ X 2 , where one modality may be missing at a time. Extensions to more modalities are discussed in Experiments section and Appendix D. To address the problem of generative cross-modal modeling, one modality x 1 can be generated from another modality x 2 by simply using independently trained generative models (x 1 → x 2 and x 2 → x 1 ) or a composed but noninterchangeable representation (Wang et al., 2016; Sohn et al., 2015) . However, the ultimate goal of multimodal representation learning is to find a meaningful joint latent code distribution bridging the two individual embeddings learned from x 1 and x 2 alone. This can be done by a two-step procedure that models the individual representations first and then applies an additional learning step to link them (Tian & Engel, 2019; Silberer & Lapata, 2014; Ngiam et al., 2011) . In contrast, we focus on approaches that learn individual and joint representations simultaneously. Furthermore, our model should be able to learn in a semi-supervised setting. Kingma et al. (2014) introduced two models suitable for the case when one modality is high dimensional (e.g., an image) and another is low dimensional (e.g., a label) while our main interest are modalities of high complexity. We consider models based on variational autoencoders (VAEs, Kingma & Welling, 2014; Rezende et al., 2014) . Standard VAEs learn a latent representation z ∈ Z for a set of observed variables x ∈ X by modelling a joint distribution p(x, z) = p(z)p(x|z). In the original VAE, the intractable posterior q(z|x) and conditional distribution p(x|z) are approximated by neural networks trained by maximising the ELBO loss taking the form L = E q(z|x) [log p(x|z)] -D KL (q(z|x) N (0, I)) with respect to the parameters of the networks modelling q(z|x) and p(x|z). Here D KL (• •) denotes the Kullback-Leibler divergence. Bi-modal VAEs that can handle a missing modality extend this approach by modelling q(z|x 1 , x 2 ) as well as q 1 (z|x 1 ) and q 2 (z|x 2 ), which replace the single q(z|x). Multimodal VAEs may differ in 1) the way they approximate q(z|x 1 , x 2 ), q 1 (z|x 1 ) and q 2 (z|x 2 ) by neural networks and/or 2) the structure of the loss function, see Figure 1 . Typically, there are no conceptual differences in the decoding, and we model the decoding distributions in the same way for all methods considered in this study. Suzuki et al. (2017) introduced a model termed JMVAE (Joint Multimodal VAE), which belongs to the class of approaches that can only learn from the paired training samples (what we refer to as the (fully) supervised setting). It approximates q(z|x 1 , x 2 ), q 1 (z|x 1 ) and q 2 (z|x 2 ) with three corresponding neural networks and optimizes an ELBO-type loss of the form L = E q(z|x1,x2) [log p 1 (x 1 |z) + log p 2 (x 2 |z)] -D KL (q(z|x 1 , x 2 ) N (0, I)) -D KL (q(z|x 1 , x 2 ) q 1 (z|x 1 )) -D KL (q(z|x 1 , x 2 ) q 2 (z|x 2 )) . (2) The last two terms imply that during learning the joint network output must be generated which requires paired samples. The MVAE (Multimodal VAE) model (Wu & Goodman, 2018) is the first multimodal VAE-based model allowing for missing modalities that does not require any additional network structures for

