RELATING BY CONTRASTING: A DATA-EFFICIENT FRAMEWORK FOR MULTIMODAL DGMs

Abstract

Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of "related" multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. We show in experiments that our method enables data-efficient multimodal learning on challenging datasets for various multimodal variational autoencoder (VAE) models. We also show that under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.

1. INTRODUCTION

To comprehensively describe concepts in the real world, humans collect multiple perceptual signals of the same object such as image, sound, text and video. We refer to each of these media as a modality, and a collection of different media featuring the same underlying concept is characterised as multimodal data. Learning from multiple modalities has been shown to yield more generalisable representations (Zhang et al., 2020; Guo et al., 2019; Yildirim, 2014) , as different modalities are often complimentary in content while overlapping for common abstract concept. Despite the motivation, it is worth noting that the multimodal framework is not exactly data-efficient-constructing a suitable dataset requires a lot of "annotated" unimodal data, as we need to ensure that each multimodal pair is related in a meaningful way. The situation is worse when we consider more complicated multimodal settings such as language-vision, where one-to-one or one-to-many correspondence between instances of the two datasets are required, due to the difficulty in categorising data such that commonality amongst samples is preserved within categories. See Figure 1 for an example from the CUB dataset (Welinder et al., a) ; although the same species of bird is featured in both image-caption pairs, their content differs considerably. It would be unreasonable to apply the caption from one to describe the bird depicted in the other, necessitating one-to-one correspondence between images and captions. However, the scope of multimodal learning has been limited to leveraging the commonality between "related" pairs, while largely ignoring "unrelated" samples potentially available in any multimodal dataset-constructed through random pairing between modalities (Figure 3 ). We posit that if a distinction can be established between the "related" and "unrelated" observations within a multimodal dataset, we could greatly reduce the amount of related data required for effective learning. Figure 2 formalises this proposal. Multimodal generative models in previous work (Figure 2a ) typically assumes one latent variable z that always generates related multimodal pair (x, y). In this work (Figure 2b ), we introduce an additional Bernoulli random variable r that dictates the "relatedness" between x and y through z, where x and y are related when r = 1, and unrelated when r = 0. While r can encode different dependencies, here we make the simplifying assumption that the Pointwise Mutual Information (PMI) between x and y should be high when r = 1, and low when r = 0. Intuitively, this can be achieved by adopting a max-margin metric. We therefore propose to train the generative moels with a novel contrastive-style loss (Hadsell et al., 2006; Weinberger et al., 2005) , and demonstrate the effectiveness of our proposed method from a few different perspectives: Improved multimodal learning: showing improved multimodal learning for various state-of-the-art multimodal generative models on two challenging multimodal datasets. This is evaluated on four different metrics (Shi et al., 2019) summarised in § 4.2; Data efficiency: learning generative models under the contrastive framework requires only 20% of the data needed in baseline methods to achieve similar performance-holding true across different models, datasets and metrics; Label propagation: the contrastive loss encourages a larger discrepency between related and unrelated data, making it possible to directly identify related samples using the PMI between observations. We show that these data pairs can be used to further improve the learning of the generative model.

2. RELATED WORK

Contrastive loss Our work aims to encourage data-efficient multimodal generative-model learning using a popular representation learning metric-contrastive loss (Hadsell et al., 2006; Weinberger et al., 2005) . There has been many successful applications of contrastive loss to a range of different tasks, such as contrastive predictive coding for time series data (van den Oord et al., 2018) , image classification (Hénaff, 2020) , noise contrastive estimation for vector embeddings of words (Mnih and Kavukcuoglu, 2013) , as well as a range of frameworks such as DIM (Hjelm et al., 2019) , MoCo (He et al., 2020) , SimCLR (Chen et al., 2020) for more general visual-representation learning. The features learned by contrastive loss also perform well when applied to different downstream taskstheir ability to generalise is further analysed in (Wang and Isola, 2020) using quantifiable metrics for alignment and uniformity. Contrastive methods have also been employed under a generative-model setting, but typically on generative adversarial networks (GANs) to either preserve or identify factors-of-variations in their inputs. For instance, SiGAN (Hsu et al., 2019) uses a contrastive loss to preserve identity for faceimage hallucination from low-resolution photos, while (Yildirim et al., 2018) uses a contrastive loss to disentangle the factors of variations in the latent code of GANs. We here employ a contrastive loss in a distinct setting of multimodal generative model learning, that, as we will show with our experiments and analyses promotes better, more robust representation learning. Multimodal VAEs We also demonstrate that our approach is applicable across different approaches to learning multimodal generative models. To do so, we first summarise past work on multimodal VAE into two categories based on the modelling choice of approximated posterior q Φ (z|x, y): a) Explicit joint models: q Φ as single joint encoder q Φ (z|x, y). Example work in this area include JMVAE (Suzuki et al.) , triple ELBO (Vedantam et al., 2018) and MFM (Tsai et al., 2019) . Since the joint encoder require multimodal pair (x, y) as input, these approaches typically require additional modelling components and/or inference steps to deal with missing modality at test time; in fact, all three approaches propose to train unimodal VAEs on top of the joint model that handles data from each modality independently. b) Factorised joint models: q Φ as factored encoders q Φ (z|x, y) = f q φx (z|x), q φy (z|y) . This was first seen in Wu and Goodman (2018) , as the MVAE model with f defined as a product of experts (POE), i.e. q Φ (z|x, y) = q φx (z|x)q φy (z|y)p(z), allowing for cross-modality generation without extra modelling components. Particularly, the MVAE caters to settings where data was not guaranteed to be always related, and where additional modalities were, in terms of information content, subsets of a primary data source-such as images and their class labels. Alternately, Shi et al. (2019) explored an approach that explicitly leveraged the availability of related/paired data, motivated by arguments from embodied cognition of the world. They propose the MMVAE model, which additionally differs from the MVAE model in its choice of posterior approximation-where f is modelled as the mixture of experts (MOE) of unimodal posteriors-to ameliorate shortcomings to do with precision miscalibration of the POE. Furthermore, Shi et al. (2019) also posit four criteria that a multimodal VAE should satisfy, which we adopt in this work to evaluate the performance of our models. Weakly-supervised learning Using generative models for label propagation (see § 4.5) is a form of weak supervision. Commonly seen approaches for weakly-supervised training with incomplete data include (1) graph-based methods (such as minimum cut) (Zhou et al., 2003; Zhu et al., 2003; Blum and Chawla, 2001) , (2) low-density separation methods (Joachims, 1999; Burkhart and Shan, 2020)  and  (3) disagreement-based models (Blum and Mitchell, 1998; Zhou and Li, 2005; 2010) . However, (1) and (2) suffers from scalability issues due to computational inefficiency and optimisation complexity, while (3) works well for many different tasks but requires training an ensemble of learners. The use of generative models for weakly supervised learning has also been explored in Nigam et al. (2000) ; Miller and Uyar (1996) , where labels are estimated using expectation-maximisation (EM) (Dempster et al., 1977) for instances that are unlabelled. Notably, models trained with our contrastive objective does not need EM to leverage unlabelled data (see § 4.5) to determine the relatedness of two examples we only need to compute a threshold (estimation of PMI) using the trained model.

3. METHODOLOGY

Given data over observations from two modalties (x, y), one can learn a multimodal VAE targetting p Θ (x, y, z) = p(z)p θx (x|z)p θy (y|z), where p θ (•|z) are deep neural networks (decoders) parametrised by Θ = {θ x , θ y }. To maximise the joint marginal likelihood log p Θ (x, y), one approximates the intractable model posterior p Θ (z|x, y) with a variational posterior q Φ (z|x, y), allowing us to optimise a variational evidence lower bound (ELBO), defined as log p Θ (x, y) ≥ E z ∼ qΦ(z|x,y) log p Θ (z, x, y) q Φ (z | x, y) = ELBO(x, y). This leaves open the question of how to model the approximatd posterior q Φ (z|x, y). As mentioned in § 2, there are two schools of thinking, namely explicit joint model such as JMVAE (Suzuki et al.) and factorised joint model including MVAE (Wu and Goodman, 2018) and MMVAE (Shi et al., 2019) . In this work we demonstrate the effectiveness of our approach across all these models.

3.1. CONTRASTIVE LOSS FOR "RELATEDNESS" LEARNING

Where prior work always assumes (x, y) to be related, we introduce a relatedness variable r explicitly capturing this aspect. Our approach is motivated by the characteristics of pointwise mutual information (PMI) between related and unrelated observations across modalities: Hypothesis 3.1. Let (x, y) ∼ p Θ (x, y) be a related data pair from two modalities, and let y denote a data point unrelated to x. Then, the pointwise mutual information I(x, y) > I(x, y ). Note that the PMI measures the statistical dependence between values (x, y), which for a joint distribution p(x, y) is defined as I(x, y) = log p(x,y) p(x)p(y) . Hypothesis 3.1 should be a fairly uncontroversial assumption for true generative models: we say simply that under the joint distribution for related data p Θ (x, y), the PMI between related points (x, y) is stronger than that between unrelated points (x, y ). In fact, we demonstrate in § 4.5 that for a well-trained generative model, PMI is a good indicator of relatedness, enabling pairing of random multimodal data by thresholding the PMI estimated by the trained model. It is therefore possible to leverage relatedness while training parametrised generative model by maximising the difference in PMI between related and unrelated pairs, i.e. I(x, y) -I(x, y ). We show in Appendix A that this is equivalent to maximising the difference between the joint marginal likelihoods of related and unrelated pairs (Figure 3a and 3b ). This casts multimodal learning as max-margin optimisation, with the contrastive (triplet) loss as a natural choice of objective: L C (x, y, y ) = d(x, y) -d(x, y ) + m. Intuitively, L C attempts to make distance d between a positive pair (x, y) smaller than the distance between a negative pair (x, y ) by margin m. We adopt this loss to our objective by omitting margin m and replacing d with the (negative) joint marginal likelihood -log p Θ . Following Song et al. (2016) , with N negative samples {y i } N i=1 , we have L C (x, Y ) = -log p Θ (x, y) + log N i=1 p Θ (x, y i ). We choose to put the sum over N within the log following conventions in previous work on contrastive loss (van den Oord et al., 2018; Hjelm et al., 2019; Chen et al., 2020) . As the loss is asymmetric, one can average over L C (y, X) and L C (x, Y ) to ensure negative samples in both modalities are accounted for-giving our contrastive objective: L C (x, y) = 1 2 {L C (x, Y )+L C (y, X)} = -log p Θ (x, y) 1 + 1 2   log N x i =1 p Θ (x , y) + log N y i =1 p Θ (x, y )   2 (3) Note that since only the joint marginal likelihood terms are needed in (3), the contrastive learning framework can be directly applied to any multimodal generative model without needing extra components. We also show our contrastive framework applies when the number of modalities is more than two (cf. Appendix C). Dissecting L C Although similar to the VAE (1), our objective (3) directly maximises p Θ (x, y) in 1 . L C by itself is not a effective objective as 2 in L C minimises p Θ (x, y), which can overpower the effect of 1 during training. We intuit this phenomenon using a simple example in Figure 4 with natural images, showing the log likelihood of log p(x, y) in column 2, 3, 4 (green) on a Gaussian distribution N ((m x , m y ); c), with the images in the first column of Figure 4 as means and with constant variance c. While achieving high log p(x, y) requires matching (m x , m y ) (col 2), we see both unrelated digits (col 3) and noise (col 4) can lead to (comparatively) poor joint log likelihoods. This indicates that the generative model need not generate valid, unrelated images to minimise 2 -generating noise would have roughly the same effect on log likelihood. As a result, the model can trivially minimise L C by generating noise that minimises 2 instead of accurate reconstruction that maximises 1 . This learning dynamic is verified empirically, as we show in Figure 9 in Appendix D: optimising L C by itself results in the loss approaching 0 within the first 100 iterations, and both 1 and 2 takes on extremely low values, resulting in a model that generates random noise. Final objective To mitigate this issue, we need to ensure that minimising 2 does not overpower maximising 1 . We hence introduce a hyperparameter γ on 1 to upweight the maximisation of the marginal likelihood, with the final objective to minimise L C (x, y) = -γ log p Θ (x, y) 1 + 1 2   log N x i =1 p Θ (x , y) + log N y i =1 p Θ (x, y )   2 , γ > 1. (4) We conduct ablation studies on the effect of γ in Appendix E, noting that larger γ encourages better quality of generation and more stable training in some cases, while models trained with smaller γ are better at predicting "relatedness" between multimodal samples. We also note that optimising (4) maximises the pointwise mutual information I(x, y); see Appendix B for a proof.

3.2. OPTIMISING THE OBJECTIVE

Since in VAEs we do not have access to the exact marginal likelihood, we have to optimise an approximated version of the true contrastive objective in (4). In (5) we list a few possible candidates of estimators and their relationships to the (log) joint marginal likelihood: ELBO ≤ E {zk} K 1 ∼qΦ log 1 K K k=1 p Θ (z k , x, y) q Φ (z k | x, y) IWAE ≤ log p Θ (x, y) ≤ E {zk} K 1 ∼qΦ   log 2 1 K K k=1 p Θ (z k , x, y) q Φ (z k | x, y) 2   CUBO (5) with the ELBO from Eq. ( 1), the importance weighted autoencoder (IWAE), a K-sample lower bound estimator that can compute an arbitrarily tight bound with increasing K (Burda et al., 2016) , and the χ upper bound (CUBO), an upper-bound estimator (Dieng et al., 2017) . We now discuss our choice of approximation of log p Θ (x, y) for each term in equation ( 4): 1 Minimising this term maximises the joint marginal likelihood, which can hence be approximated with a valid lower-bound estimator (5); the IWAE estimator being the preferred choice. 2 For this term, we consider two choices of estimators: 1) CUBO: Since 2 of (4) minimises the joint marginal likelihoods, to ensure an upper-bound estimator to L C in (4), we need to employ an upper-bound estimator. Here we propose to use CUBO in (5). While such a bounded approximation is indeed desirable, existing upperbound estimators tend to have rather large bias/variance and can therefore yield poor quality approximations. 2) IWAE: We therefore also propose to estimate 2 with the IWAE estimator, as it provides an arbitrarily tight, low variance lower-bound (Burda et al., 2016) to log p Θ (x, y). Although this no longer ensure a valid bound on the objective, we hope that having a more accurate approximation to the marginal likelihood (and by extension, the contrastive loss) can affect the performance of the model positively. We report results using both IWAE and CUBO estimators in § 4, denoted cI and cC respectively.

4. EXPERIMENTS

As stated in § 1, we analyse the suitability of contrastive learning for multimodal generative models from three persepctives-improved multimodal learning ( § 4.3), data efficiency ( § 4.4) and label propagation ( § 4.5). Throughout the experiments, we take N = 5 negative samples for the contrastive objective, set γ = 2 based on analyses of ablations in appendix E, and take K = 30 samples for our IWAE estimators. We now introduce the datasets and metrics used for our experiments.

4.1. DATASETS MNIST-SVHN

The dataset is designed to separate conceptual complexity, i.e. digit, from perceptual complexity, i.e. color, style, size. Each data pair contains 2 samples of the same digit, one from each dataset (see examples in Figure 3a ). We construct the dataset such that each instance from one dataset is paired with 30 instances of the same digit from the other dataset. Although both datasets are simple and well-studied, the many-to-many pairing between samples creates matching of different writing styles vs. backgrounds and colors, making it a challenging multimodal dataset. CUB Image-Captions We also consider a more challenging language-vision multimodal dataset, Caltech-UCSD Birds (CUB) (Welinder et al., b; Reed et al., 2016) . The dataset contains 11,788 photos of birds, paired with 10 captions describing the bird's physical characteristics, collected through Amazon Mechanical Turk (AMT). See CUB image-caption pair in Figure 1 . Shi et al. (2019) proposed four criteria for multimodal generative models (Figure 5 , left), that we summarise and unify as metrics to evaluate these criteria for different generative models (Figure 5 , right). We now introduce each criterion and its corresponding metric in detail.

4.2. METRICS

(a) Latent accuracy (Figure 5a ) Criterion: latent space factors into "shared" and "private" subspaces across modalities. We fit a linear classifier on the samples from z ∼ q Φ (z|x, y) to classify the information shared between the two modalities. For MNIST-SVHN, this can be the digit label as shown in Figure 5a (right). We check if lz is the same as the digit label of the original inputs x and y, with the intuition that extracting the commonality between x and y from latent representation using a linear transform, supports the claim that the latent space has factored as desired. (b) Joint coherence (Figure 5b ) Criterion: model generates paired samples that preserves the commonality observed in data. Again taking MNIST-SVHN as an example, this can be verified by taking pre-trained MNIST and SVHN digit classifiers and applying them on the multimodal observations generated from the same prior sample z. Coherence is computed by how often generations x and ŷ classify to the same digit, i.e. whether lx = lŷ . (c) Cross coherence (Figure 5c ) Criterion: model generates data in one modality conditioned on the other while preserving shared commonality. To compute cross coherence, we generate observations x using latent from unimodal marginal posterior z ∼ q φ (z|y) and ŷ from z ∼ q φ (z|x). Similar to joint coherence, the criterion here is evaluated by predicting the label of the cross-generated samples x, ŷ using off-the-shelf MNIST and SVHN classifiers. In Figure 5c (right), the cross coherence is the frequency of which lx = l y and l ŷ = l x . (d) Synergy coherence (Figure 5d ) Criterion: models learnt across multiple modalities should be no worse than those learnt from just one. For consistency, we evaluate this criterion from a coherence perspective. Given generations x and ŷ from z ∼ q Φ (z|x, y), we again examine if generated and original labels match; i.e. if lx = l y = l ŷ = l x . See Appendix F for details on architecture. All quantitative results are reported over 5 runs. In addition to these quantitative metrics, we also showcase the qualitative results on both datasets in Appendix G and Appendix H.

4.3. IMPROVED MULTIMODAL LEARNING

Finding: Contrastive learning improves multimodal learning across all models and datasets. MNIST-SVHN See Table 1 (top, 100% data used) for results on the full MNIST-SVHN dataset. Note that for MMVAE, since the joint posterior q Φ is factorised as the mixture of unimodal posteriors q φx and q φy , the model never directly takes sample from the explicit form of the joint posterior. Instead, it takes equal number of samples from each unimodal posteriors, reflective of the equal weighting of the mixture. As a result, it is not meaningful to compute synergy coherence for MMVAE as it is exactly the same as the coherence of any single-way generation. From Table 1 (top, 100% data used), we see that models trained on our contrastive objective (cI-<MODEL> and cC-<MODEL>) improves multimodal learning performance significantly for all three generative models evaluated on the metrics. The results showcase the robustness of our approach from the perspectives of modelling choice and metric of interests. In particular, note that for the best performing model MMVAE, using IWAE estimator for 2 of Eq (4) (cI-MMVAE) yields slightly better results than CUBO (cC-MMVAE), while for the two other models the performance for different estimators are similar. We also include qualitative results for MNIST-SVHN, including generative results, marginal likelihood table and diversity analysis in Appendix G. CUB Following Shi et al. (2019) , for the images in CUB, we observe and generate in feature space instead of pixel space by preprocessing the images using a pre-trained ResNet-101 (He et al., 2016) . A nearest-neighbour lookup among all the features in the test set is used to project the feature generations of the model back to image space. This helps circumvent CUB image complexities to some extent-as the primary goal here is to learn good models and representations of multimodal data, rather than a focus on pixel-level image quality of generations. The metrics listed in § 4.2 can also be applied to CUB with some modifications. Since bird-species classes are disjoint for the train and test sets, and as we show in Figure 1 contains substantial in-class variance, it is not constructive to evaluate these metrics using bird categories as labels. In Shi et al. (2019) , the authors propose to use Canonical Correlation Analysis (CCA)-used by Massiceti et al. as a reliable vision-language correlation baseline-to compute coherence scores between generated image-caption pairs; which we employ (i.e. (b), (c), (d) in Figure 5 ) for CUB. We show the results in Table 2 (top, 100% data used). We see that our contrastive approach (both cI-<MODEL> and cC-<MODEL>) is even more effective on this challenging vision-language dataset, with significant improvements to the correlation of generated image-caption pairs. It is also on this more complicated dataset where the advantages of using the stable, low-variance IWAE estimator are highlighted -for both MVAE and JMVAE, the contrastive objective with CUBO estimator suffers from numerical stability issues, yielding close-to-zero correlations for all metrics in Table 2 . Results for these models are therefore omitted. We also show the qualitative results for CUB in Appendix H. 0.273 (±8.67e-3) --MVAE 0.091 (±2.63e-2) 0.008 (±4.81e-3) 0.005 (±7.50e-3) 0.020 (±8.77e-3) 0.009 (±1.17e-2) cI-MVAE 0.132 (±3.33e-2) 0.192 (±3.91e-2) -0.002 (±3.61e-3) 0.162 (±7.89e-2) 0.081 (±3.82e-2) JMVAE 0.127 (±3.76e-2) 0.118 (±3.82e-3) 0.154 (±8.34e-3) 0.181 (±1.26e-2) 0.139 (±1.33e-2) cI-JMVAE 0.269 (±1.20e-2) 0.134 (±4.24e-4) 0.210 (±2.35e-2) 0.192 (±1.41e-4) 0.168 (±3.82e-3)

4.4. DATA EFFICIENCY

Finding: Contrastive learning on 20% of data matches baseline models on full data. We plot the quantitative performance of MMVAE with and without contrastive learning against the percentage of the original dataset used, as seen in Figure 6 . We observe that the performance of contrastive MMVAE with the IWAE (cI-MMVAE, red) and CUBO estimators (cC-MMVAE, yellow) are consistently better than the baselines (MMVAE, blue), and that baseline performance using all related data is matched by the contrastive MMVAE using just 10-20% of data. The partial datasets used here are constructed by first taking n% of each unimodal dataset, then pairing to create multimodal datasets ( § 4.1)-ensuring it contains the requisite amount of "related" samples. In addition, we reproduce results generated from using 100% of the data in MNIST-SVHN and CUB (Tables 1 and 2 ) using only 20% of the original multimodal datasets (Tables 1 and 2 (bottom)). Comparing results between top vs. bottom in Tables 1 and 2 shows that this finding holds across the models, on both MNIST-SVHN and CUB datasets. This shows that the efficiency gains from a contrastive approach is invariant to VAE type, data, and metrics used, underscoring its effectiveness.

4.5. LABEL PROPAGATION

Finding: Generative models learned contrastively are good predictors of "relatedness", enabling label propagation and matching baseline performance on full datasets, using only 10% of data. Here, we show that our contrastive framework encourages a larger discrepancy between the PMI of related vs. unrelated data, as set out in hypothesis 3.1, allowing one to first train the model on a small subset of related data, and subsequently construct a classifier using PMI that identifies related samples in the remaining data. We now introduce our pipeline for label propagation in details. Pipeline As showing in Figure 7 , we first construct a full dataset by randomly matching instances in MNIST and SVHN, and denote the related pairs by F r (full, related). We further assume access to only n% of F r , denoted as S r (small, related), and denote the rest as F m , containing a mix of related and unrelated pairs. Next, we train a generative model g on S r . To find a relatedness threshold, we construct a small, mixed dataset S m by randomly matching samples across modalities in S r . Given relatedness ground-truth for S m , we can compute the PMI I(x, y) = log p Θ (x, y)log p θx (x)p θy (y) for all pairs (x, y) in S m and estimate an optimal threshold. This threshold can now be applied to the full, mixed dataset F m to identify related pairs giving us a new related dataset F r , which can be used to further improve the performance of the generative model g.

Results

In Figure 8 (a-e), we plot the performance of baseline MMVAE (blue) and contrastive MMVAE with IWAE (cI-MMVAE, red) and CUBO (cC-MMVAE, yellow) estimators for term 2 of Eq (4), trained with (solid curves) and without (dotted curves) label propagation. Here, the x-axis is the proportion in size of S r to F r , i.e. the percentage of related data used to pretrain the generative model before label propagation. We compare these results to MMVAE trained on all related data F r (cyan horizontal line) as a "best case scenario" of these training regimes. Clearly, label propagation using a contrastive model with IWAE estimator is helpful, and in general the improvement is greater when less data is available; Figure 8 also shows that when S r is 10% of F r , cI-MMVAE is competitive with the performance of baseline MMVAE trained on F r . For baselines MMVAE and cC-MMVAE however, label propagation hurts performance no matter the size of S r , as shown by the blue and yellow curves in Figure 8 (a-e ). The advantages of cI-MMVAE is further demonstrated in Figure 8f , where we compute the precision, recall, and F 1 score of relatedness prediction on F m , for models trained on 10% of all related data. We also compare to a simple label-propagation baseline, where the relatedness of F m is predicted using a siamese network (Hadsell et al., 2006) trained on the same 10% dataset. Notably, while the Siamese baseline in Figure 8f is a competitive predictor of relatedness, cI-MMVAE has the highest F 1 score amongst all four, making it the most reliable indicator of relatedness. Beyond that, note that with the contrastive MMVAE, relatedness can be predicted without additional training and only requires a simple threshold computation directly computed using the generative model. The fact that the cI-MMVAE's relatedness-prediction performance is the only one that matches the Siamese baseline strongly supports the view that the contrastive loss encourages generative models to utilise and better learn the relatedness between multimodal pairs; in addition, the poor performance of cC-MMVAE shows that the advantages of having a bounded estimation to the contrastive objective by using an upper-bound for 2 is overshadowed by the high bias of CUBO, and that one may benefit more from choosing a low variance lower-bound like IWAE.

D THE INEFFECTIVENESS OF TRAINING WITH L C ONLY

We demonstrate why training with the contrastive loss proposed in (3) is ineffective, and why additional ELBO term is needed for the final objective. As we show in Figure 9 , when training with L C only, while the contrastive loss (green) quickly drops to zero, both term 1 and 2 in (3) also reduces drastically. This means the joint marginal likelihood of any generation log p Θ (x, y) is small regardless the relatedness of (x, y). In comparison, we also plot the training curve for model trained on the final objective in (4), which upweights term 1 in (3) by γ. We see in Figure 10 that by setting γ = 2, the joint marginal likelihood (yellow and blue curve) improves during training, while L C (green curve) gets minimised. 

E ABLATION STUDY OF γ

In § 3, we specified that γ needs to be greater than 1 to offset the negative effect of minimising ELBO through term 2 in (4). Here, we study the effect of γ in details. Figure 11 compares latent accuracy, cross coherence and joint coherence of MMVAE on MNIST-SVHN dataset trained on different values of γ. Note that here we only consider cases where γ ≥ 1, since the minimum value of γ is 1. In this case, the loss reduces to the original contrastive objective in (3). A few interesting observations from the plot are as follows: First, when γ = 1, the model is trained using the contrastive loss only, and as we showed is an ineffective objective for generative model learning. This is verified again in Figure 11 -when γ = 1, both coherences and latent accuracies take on extremely low values; interestingly, there is a significant boost of performance across all metrics by simply increasing the value of γ from 1 to 1.1; after that, as the value of γ increases, performance on most metrics decreases monotonically (joint coherence being the only exception), and eventually converges to baseline MMVAE (dotted lines in Figure 11 ). This is unsurprising, since the final objective in (4) reduces to the original joint ELBO as γ approaches infinity. To verify this observation, we also compute the marginal log likelihood log p Θ (x, y) to quantify the quality of generations. We compute this for all γs considered in Figure 11 , and take the average over the entire test set. From the results in Figure 13 , we can see a significant increase of the log likelihood between γ = 1.1 to γ = 1. This gain in image generation quality then slows down as γ further increases, and as all other metrics converges to the original MMVAE model. 

F ARCHITECTURE

We use architectures listed in Table 3 for the unimodal encoder and decoder for MMVAE, MVAE and JMVAE. For JMVAE we use an extra joint encoder, the architecture of which is described in Table 4 . 



Figure 1: Multimodal data from the CUB dataset

Figure 3: Constructing related & unrelated samples

Figure 4: log p(x, y) of imitation, unrelated digits and random noise, where p = N (mx, my).

Figure 5: Left of each pair: Four criteria for multi-modal generative models; image adapted from Shi et al. (2019). Right of each pair: Four metrics to evaluate the model's performance on criterion in corresponding row.

Figure 6: Performance of MMVAE, cI-MMVAE and cC-MMVAE using n% of MNIST-SVHN.

Figure 7: Pipeline of label propagation

Figure 8: Models with and without label propagation using MMVAE, cI-MMVAE and cC-MMVAE.

Figure 9: First 300 iterations of training using contrastive loss LC only.

Figure 10: First 300 iterations of training with final loss L, where γ = 2.

Figure 11: Performance on different metrics for different values of γ. Dotted lines represents the performance of baseline MMVAE.

Figure11seems to suggest that 1.1 is the optimal value for hyperparameter γ, however close inspection of the qualitative generative results shows that this might not be the case. See Figure12for a comparison of the model's generation between MMVAE models trained on (from left to right) γ = 1.1, γ = 2 and γ = +∞ (i.e. original MMVAE). Although γ = 1.1 yields model with high coherence scores, we can clearly see from the left-most column of Figure12that the generation of the model seems deprecated, especially for the SVHN modality, where the backgrounds of model's generation appear to be unnaturally spotty and deformed. This problem is mitigated by increasing γ -as shown in Figure11, the image generation quality of γ = 2 (middle column) is not visibly different from that of γ = +∞ (right column).

Figure 12: Generations of MMVAE model trained using the final contrastive objective, with (from left to right) γ = 1.1, 2 and +∞. Note in (c), (d), (e), (f), the top rows are the inputs and the bottom rows are their corresponding reconstruction/cross generation.

Figure 13: Performance on different metrics for different values of γ. Dotted lines represents the performance of baseline MMVAE.

Figure 14: Generations of MMVAE model, from left to right are original model (MMVAE), contrastive loss with IWAE estimator (cI-MMVAE) and contrastive loss with CUBO estimator (cC-MMVAE).

Figure 17: Number of examples generated for each of class during joint generation.

Figure20: Qualitative results of MMVAE trained with contrastive loss with CUBO estimator on CUB Image-Caption dataset, including reconstruction (vision → vision, language → language), cross generation (vision → language, language → vision) and joint generation from prior samples.

Evaluation of baselines MMVAE, MVAE, JMVAE and their contrastive variations (cI-<MODEL>, cC-<MODEL> for the IWAE and CUBO estimators used in Eq (4), respectively), on MNIST(M)-SVHN(S) dataset, using 100% (top) and 20% (bottom) of data.



Unimodal encoder and decoder architectures.

5. CONCLUSION

We introduced a contrastive-style objective for multimodal VAE, aiming at reducing the amount of multimodal data needed by exploiting the distinction between "related" and "unrelated" multimodal pairs. We showed that this objective improves multimodal training, drastically reduce the amount of multimodal data needed, and establishes a strong sense of "relatedness" for the generative model. These findings hold true across a multitude of datasets, models and metrics. The positive results of our method indicates that it is beneficial to utilise the relatedness information when training on multimodal data, which has been largely ignored in previous work. While we propose to utilise it implicitly through contrastive loss, one may consider relatedness as a random variable in the graphical model and see if explicit dependency on relatedness can be useful. It is also possible to extend this idea to Generative adversarial networks (GANs), by employing an additional discriminator that evaluates relatedness between generations across modalities. We will leave these interesting directions to be explored by future work. ACKNOWLEDGEMENTS YS and PHST were supported by the Royal Academy of Engineering under the Research Chair and Senior Research Fellowships scheme, EPSRC/MURI grant EP/N019474/1 and FiveAI. YS was additionally supported by Remarkdip through their PhD Scholarship Programme. BP is supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1. Special thanks to Elise van der Pol for helpful discussions on contrastive learning.

Appendix:

A FROM POINTWISE MUTUAL INFORMATION TO JOINT MARGINAL LIKELIHOOD In this section, we show that for the purpose of utilising the relatedness between mutlimodal pairs, maximising the difference between pointwise mutual information between related points x, y and unrelated points x, y is equivalent to maximising the difference between their log joint marginal likelihoods.To see this, we can expand I(x, y) -I(x, y ) as I(x, y) -I(x, y ) = [log p Θ (x, y) -log p Θ (x) -log p Θ (y)] -[log p Θ (x, y ) -log p Θ (x) -log p Θ (y )] = log p Θ (x, y) -log p Θ (x, y )In ( 6) the PMI difference is decomposed as two terms: 1 the difference between joint marginal likelihoods and 2 the difference between marginal likelihoods of different instances of y-s. It is clear that since 2 involves only one modality and only accounts for the difference in values between y and y , it is not relevant to the relatedness of x, y.Therefore, for the purpose of utilising relatedness information, we only need to maximise I(x, y) -I(x, y ) through maximising term 1 , i.e. the difference between joint marginal likelihoods.

B CONNECTION OF FINAL OBJECTIVE TO POINTWISE MUTUAL INFORMATION

Here, we show that minimising the objective in (4) maximises the PMI between x and y:We see in (7) that minimising L can be decomposed to maximising both the joint marginal likelihood p Θ (x, y) and an approximation of PMI I(x, y). Note that since γ > 1, we can be sure that the joint marginal likelihood weighting γ -1 2 is non-negative.C GENERALISATION TO M > 2 MODALITIESIn this section we show how the contrastive loss generalise to cases where number of modalities considered M is greater than 2.Given observations from. Similar to (2), we can write the assymetrical contrastive loss for any observation x (i) m from modality m, where negative samples are taken for all (M -1) other modalities:We can therefore rewrite (3) as:where N is the number of negative samples, all log p Θ (x 1:M ) are approximated by the following joint ELBO for M modalities:While the above gives us the true generalisation of (3), we note that the number of times where ELBO needs to be evaluated in (10) is O(M 2 N ), making it difficult to implement this objective in practice, especially on more complicated datasets. We therefore propose a simplified version of the objective, where we estimate the second term of (10) with N sets of random samples from all modalities. Specifically, we can precompute the following M × N random index matrix J:where each entry of J is a random integer taken from range [1, N m ]. We can then replace the second term of (10) random samples selected by the indices in J, giving usThe number of times ELBO needs to be computed is now O(N ), and is no longer relevant to the number of modalities M .We can now also generalise the final objective in (4) to M modalities: 

H QUALITATIVE RESULTS ON CUB

The generative results of MMVAE, cI-MMVAE and cC-MMVAE on CUB Image-Caption dataset are as shown in Figure 18 , Figure 19 and Figure 20 . Note that for the generation in the vision modality, we reconstruct and generate features from ResNet101 and perform nearest neighbour search in all the features in train set to showcase our generation results.Figure 18 : Qualitative results of MMVAE on CUB Image-Caption dataset, including reconstruction (vision → vision, language → language), cross generation (vision → language, language → vision) and joint generation from prior samples.Figure 19 : Qualitative results of MMVAE trained with contrastive loss with IWAE estimator on CUB Image-Caption dataset, including reconstruction (vision → vision, language → language), cross generation (vision → language, language → vision) and joint generation from prior samples.

