LEARNING MULTIMODAL DATA AUGMENTATION IN FEATURE SPACE

Abstract

The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.

1. INTRODUCTION

Imagine watching a film with no sound, or subtitles. Our ability to learn is greatly enhanced through jointly processing multiple data modalities, such as visual stimuli, language, and audio. These information sources are often so entangled that it would be near impossible to learn from only one modality in isolation -a significant constraint on traditional machine learning approaches. Accordingly, there have been substantial research efforts in recent years on developing multimodal deep learning to jointly process and interpret information from different modalities at once (Baltrušaitis et al., 2017) . Researchers studied multimodal deep learning from various perspectives such as model architectures (Kim et al., 2021b; Pérez-Rúa et al., 2019; Nagrani et al., 2021; Choi & Lee, 2019) , training techniques (Li et al., 2021; Chen et al., 2019a) , and theoretical analysis (Huang et al., 2021; Sun et al., 2020b) . However, data augmentation for multimodal learning remains relatively unexplored (Kim et al., 2021a) , despite its enormous practical impact in single modality settings. The plane is doing tricks while flying down.

(b) Entailment

There is a plane in the air. (c) Entailment The pilot is aware that the plane is doing a loop.

(d) Neutral

The plane is falling down.

AutoAugment-Cifar10

AutoAugment-ImageNet AutoAugment-SVHN TrivialAugment TrivialAugment. The new image does not match the description: "The pilot is aware that the plane is doing a loop", as in data (c). However, the label of the augmented pair will still be "Entailment". Indeed, data augmentation has particularly proven its value for data efficiency, regularization, and improved performance in computer vision (Ho et al., 2019; Cubuk et al., 2020; Müller & Hutter, 2021; Zhang et al., 2017; Yun et al., 2019) and natural language processing (Wei & Zou, 2019; Karimi et al., 2021; Fadaee et al., 2017; Sennrich et al., 2015; Wang & Yang, 2015; Andreas, 2020; Kobayashi, 2018) . These augmentation methods are largely tailored to a particular modality in isolation. For example, for object classification in vision, we know certain transformations such as translations or rotations should leave the class label unchanged. Similarly, in language, certain sentence manipulations like synonym replacement will leave the meaning unchanged. The most immediate way of leveraging data augmentation in multimodal deep learning is to separately apply well-developed unimodal augmentation strategies to each corresponding modality. However, this approach can be problematic because transforming one modality in isolation may lead to disharmony with the others. Consider Figure 1 , which provides four training examples from SNLI-VE (Xie et al., 2019a), a vision-language benchmark dataset. Each description is paired with the image on the top left, and the label refers to the relationship between the image and description. The bottom row provides four augmented images generated by state-of-the-art image augmentation methods (Cubuk et al., 2019; Müller & Hutter, 2021) . In the image generated by AutoAugment-Cifar10 and AutoAugment-SVHN, the plane is entirely cropped out, which leads to mislabeling for data (a), (b), (c), and (d). In the image generated by AutoAugment-ImageNet, due to the change in smoke color, this plane could be on fire and falling down, which leads to mislabeling for data (a) and (d). In the image generated by TrivialAugment (Müller & Hutter, 2021) , a recent image augmentation method that randomly chooses one transformation with a random magnitude, the loop is cropped out, which leads to mislabeling for data (a) and (c). Mislabeling can be especially problematic for over-parameterized neural networks, which tend to confidently fit mislabeled data, leading to poor performance (Pleiss et al., 2020) . There are two key challenges in designing a general approach to multimodal data augmentation. First, multimodal deep learning takes input from a diverse set of modalities. Augmentation transformations can be obvious for some modalities such as vision and language, but not others, such as sensory data which are often numeric or categorical. Second, multimodal deep learning includes a diverse set of tasks with different cross-modal relationships. Some datasets have redundant or totally correlated modalities while others have complementary modalities. There is no reasonable assumption that would generally preserve labels when augmenting modalities in isolation. In this work, we propose LeMDA (Learning Multimodal Data Augmentation) as a general multimodal data augmentation method. LeMDA augments the latent representation and thus can be applied to any modalities. We design the augmentation transformation as a learnable module such that



Figure 1: The top row shows four training samples drawn from SNLI-VE (Xie et al., 2019a), a visual entailment dataset. Each text description is paired with the image on the top left.The task is to predict the relationship between the image and the text description, which can be "Entailment", "Neutral", or "Contradiction". The bottom row shows four augmented images generated by different image-only augmentation methods. If we pair the text description with the augmented images, we observe mislabeled data. For example, the smoke loop is cropped out in the image augmented via TrivialAugment. The new image does not match the description: "The pilot is aware that the plane is doing a loop", as in data (c). However, the label of the augmented pair will still be "Entailment".

