RELATING BY CONTRASTING: A DATA-EFFICIENT FRAMEWORK FOR MULTIMODAL DGMs

Abstract

Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of "related" multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. We show in experiments that our method enables data-efficient multimodal learning on challenging datasets for various multimodal variational autoencoder (VAE) models. We also show that under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.

1. INTRODUCTION

To comprehensively describe concepts in the real world, humans collect multiple perceptual signals of the same object such as image, sound, text and video. We refer to each of these media as a modality, and a collection of different media featuring the same underlying concept is characterised as multimodal data. Learning from multiple modalities has been shown to yield more generalisable representations (Zhang et al., 2020; Guo et al., 2019; Yildirim, 2014) , as different modalities are often complimentary in content while overlapping for common abstract concept. Despite the motivation, it is worth noting that the multimodal framework is not exactly data-efficient-constructing a suitable dataset requires a lot of "annotated" unimodal data, as we need to ensure that each multimodal pair is related in a meaningful way. The situation is worse when we consider more complicated multimodal settings such as language-vision, where one-to-one or one-to-many correspondence between instances of the two datasets are required, due to the difficulty in categorising data such that commonality amongst samples is preserved within categories. See Figure 1 for an example from the CUB dataset (Welinder et al., a); although the same species of bird is featured in both image-caption pairs, their content differs considerably. It would be unreasonable to apply the caption from one to describe the bird depicted in the other, necessitating one-to-one correspondence between images and captions. However, the scope of multimodal learning has been limited to leveraging the commonality between "related" pairs, while largely ignoring "unrelated" samples potentially available in any multimodal dataset-constructed through random pairing between modalities (Figure 3 ). We posit that if a distinction can be established between the "related" and "unrelated" observations within a multimodal dataset, we could greatly reduce the amount of related data required for effective learning. Figure 2 formalises this proposal. Multimodal generative models in previous work (Figure 2a ) typically assumes one latent variable z that always generates related multimodal pair (x, y). In this work (Figure 2b ), we introduce an additional Bernoulli random variable r that dictates the "relatedness" between x and y through z, where x and y are related when r = 1, and unrelated when r = 0. While r can encode different dependencies, here we make the simplifying assumption that the Pointwise Mutual Information (PMI) between x and y should be high when r = 1, and low when r = 0. Intuitively, this can be achieved by adopting a max-margin metric. We therefore propose to train the generative moels with a novel contrastive-style loss (Hadsell et al., 2006; Weinberger et al., 2005) , and demonstrate the effectiveness of our proposed method from a few different perspectives: Improved multimodal learning: showing improved multimodal learning for various state-of-the-art multimodal generative models on two challenging multimodal datasets. This is evaluated on four different metrics (Shi et al., 2019) summarised in § 4.2; Data efficiency: learning generative models under the contrastive framework requires only 20% of the data needed in baseline methods to achieve similar performance-holding true across different models, datasets and metrics; Label propagation: the contrastive loss encourages a larger discrepency between related and unrelated data, making it possible to directly identify related samples using the PMI between observations. We show that these data pairs can be used to further improve the learning of the generative model.

2. RELATED WORK

Contrastive loss Our work aims to encourage data-efficient multimodal generative-model learning using a popular representation learning metric-contrastive loss (Hadsell et al., 2006; Weinberger et al., 2005) . There has been many successful applications of contrastive loss to a range of different tasks, such as contrastive predictive coding for time series data (van den Oord et al., 2018) , image classification (Hénaff, 2020), noise contrastive estimation for vector embeddings of words (Mnih and Kavukcuoglu, 2013) , as well as a range of frameworks such as DIM (Hjelm et al., 2019) , MoCo (He et al., 2020 ), SimCLR (Chen et al., 2020) for more general visual-representation learning. The features learned by contrastive loss also perform well when applied to different downstream taskstheir ability to generalise is further analysed in (Wang and Isola, 2020) using quantifiable metrics for alignment and uniformity. Contrastive methods have also been employed under a generative-model setting, but typically on generative adversarial networks (GANs) to either preserve or identify factors-of-variations in their inputs. For instance, SiGAN (Hsu et al., 2019) uses a contrastive loss to preserve identity for faceimage hallucination from low-resolution photos, while (Yildirim et al., 2018) uses a contrastive loss to disentangle the factors of variations in the latent code of GANs. We here employ a contrastive loss in a distinct setting of multimodal generative model learning, that, as we will show with our experiments and analyses promotes better, more robust representation learning. Multimodal VAEs We also demonstrate that our approach is applicable across different approaches to learning multimodal generative models. To do so, we first summarise past work on multimodal VAE into two categories based on the modelling choice of approximated posterior q Φ (z|x, y): a) Explicit joint models: q Φ as single joint encoder q Φ (z|x, y). Example work in this area include JMVAE (Suzuki et al.), triple ELBO (Vedantam et al., 2018) and MFM (Tsai et al., 2019) . Since the joint encoder require multimodal pair (x, y) as input, these approaches typically require additional modelling components and/or inference steps to deal with missing modality at test time; in fact, all three approaches propose to train unimodal VAEs on top of the joint model that handles data from each modality independently. b) Factorised joint models: q Φ as factored encoders q Φ (z|x, y) = f q φx (z|x), q φy (z|y) . This was first seen in Wu and Goodman (2018) , as the MVAE model with f defined as a product of experts (POE), i.e. q Φ (z|x, y) = q φx (z|x)q φy (z|y)p(z), allowing for cross-modality generation without extra modelling components. Particularly, the MVAE caters to settings where data was not guaranteed to be always related, and where additional modalities were, in terms of information content, subsets of a primary data source-such as images and their class labels. Alternately, Shi et al. ( 2019) explored an approach that explicitly leveraged the availability of related/paired data, motivated by arguments from embodied cognition of the world. They propose the MMVAE model, which additionally differs from the MVAE model in its choice of posterior approximation-where f is modelled as the mixture of experts (MOE) of unimodal posteriors-to ameliorate shortcomings to do with precision miscalibration of the POE. Furthermore, Shi et al.



Figure 1: Multimodal data from the CUB dataset

Graphical models for multimodal generative process.

