RELATING BY CONTRASTING: A DATA-EFFICIENT FRAMEWORK FOR MULTIMODAL DGMs

Abstract

Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of "related" multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. We show in experiments that our method enables data-efficient multimodal learning on challenging datasets for various multimodal variational autoencoder (VAE) models. We also show that under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.

1. INTRODUCTION

To comprehensively describe concepts in the real world, humans collect multiple perceptual signals of the same object such as image, sound, text and video. We refer to each of these media as a modality, and a collection of different media featuring the same underlying concept is characterised as multimodal data. Learning from multiple modalities has been shown to yield more generalisable representations (Zhang et al., 2020; Guo et al., 2019; Yildirim, 2014) , as different modalities are often complimentary in content while overlapping for common abstract concept. Despite the motivation, it is worth noting that the multimodal framework is not exactly data-efficient-constructing a suitable dataset requires a lot of "annotated" unimodal data, as we need to ensure that each multimodal pair is related in a meaningful way. The situation is worse when we consider more complicated multimodal settings such as language-vision, where one-to-one or one-to-many correspondence between instances of the two datasets are required, due to the difficulty in categorising data such that commonality amongst samples is preserved within categories. See Figure 1 for an example from the CUB dataset (Welinder et al., a); although the same species of bird is featured in both image-caption pairs, their content differs considerably. It would be unreasonable to apply the caption from one to describe the bird depicted in the other, necessitating one-to-one correspondence between images and captions. However, the scope of multimodal learning has been limited to leveraging the commonality between "related" pairs, while largely ignoring "unrelated" samples potentially available in any multimodal dataset-constructed through random pairing between modalities (Figure 3 ). We posit that if a distinction can be established between the "related" and "unrelated" observations within a multimodal dataset, we could greatly reduce the amount of related data required for effective learning. Figure 2 formalises this proposal. Multimodal generative models in previous work (Figure 2a ) typically assumes one latent variable z that always generates related multimodal pair (x, y). In this work (Figure 2b ), we introduce an additional Bernoulli random variable r that dictates the "relatedness" between x and y through z, where x and y are related when r = 1, and unrelated when r = 0. * work done while at Oxford 1



Figure 1: Multimodal data from the CUB dataset

