MULTIMODAL VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED LEARNING: IN DEFENSE OF PRODUCT-OF-EXPERTS Anonymous

Abstract

Multimodal generative models should be able to learn a meaningful latent representation that enables a coherent joint generation of all modalities (e.g., images and text). Many applications also require the ability to accurately sample modalities conditioned on observations of a subset of the modalities. Often not all modalities may be observed for all training data points, so semi-supervised learning should be possible. In this study, we evaluate a family of product-ofexperts (PoE) based variational autoencoders that have these desired properties. We include a novel PoE based architecture and training procedure. An empirical evaluation shows that the PoE based models can outperform an additive mixtureof-experts (MoE) approach. Our experiments support the intuition that PoE models are more suited for a conjunctive combination of modalities while MoEs are more suited for a disjunctive fusion.

1. INTRODUCTION

Multimodal generative modelling is important because information about real-world objects typically comes in different representations, or modalities. The information provided by each modality may be erroneous and/or incomplete, and a complete reconstruction of the full information can often only be achieved by combining several modalities. For example, in image-and video-guided translation (Caglayan et al., 2019) , additional visual context can potentially resolve ambiguities (e.g., noun genders) when translating written text. In many applications, modalities may be missing for a subset of the observed samples during training and deployment. Often the description of an object in one modality is easy to obtain, while annotating it with another modality is slow and expensive. Given two modalities, we call samples paired when both modalities are present, and unpaired if one is missing. The simplest way to deal with paired and unpaired training examples is to discard the unpaired observations for learning. The smaller the share of paired samples, the more important becomes the ability to additionally learn from the unpaired data, referred to as semi-supervised learning in this context (following the terminology from Wu & Goodman, 2019. Typically one would associate semi-supervised learning with learning form labelled and unlabelled data to solve a classification or regression tasks). Our goal is to provide a model that can leverage the information contained in unpaired samples and to investigate the capabilities of the model in situations of low levels of supervision, that is, when only a few paired samples are available. While a modality can be as low dimensional as a label, which can be handled by a variety of discriminative models (van Engelen & Hoos, 2020), we are interested in high dimensional modalities, for example an image and a text caption. Learning a representation of multimodal data that allows to generate high-quality samples requires the following: 1) deriving meaningful representation in a joint latent space for each high dimensional modality and 2) bridging the representations of different modalities in a way that the relations between them are preserved. The latter means that we do not want the modalities to be represented orthogonally in the latent space -ideally the latent space should encode the object's properties independent of the input modality. Variational autoencoders (Kingma & Welling, 2014) using a product-of-experts (PoE, Hinton, 2002; Welling, 2007) approach for combining input modalities are a promising approach for multimodal generative modelling having the desired properties, in partic-

