LEARNING MULTIMODAL DATA AUGMENTATION IN FEATURE SPACE

Abstract

The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.

1. INTRODUCTION

Imagine watching a film with no sound, or subtitles. Our ability to learn is greatly enhanced through jointly processing multiple data modalities, such as visual stimuli, language, and audio. These information sources are often so entangled that it would be near impossible to learn from only one modality in isolation -a significant constraint on traditional machine learning approaches. Accordingly, there have been substantial research efforts in recent years on developing multimodal deep learning to jointly process and interpret information from different modalities at once (Baltrušaitis et al., 2017) . Researchers studied multimodal deep learning from various perspectives such as model architectures (Kim et al., 2021b; Pérez-Rúa et al., 2019; Nagrani et al., 2021; Choi & Lee, 2019) , training techniques (Li et al., 2021; Chen et al., 2019a) , and theoretical analysis (Huang et al., 2021; Sun et al., 2020b) . However, data augmentation for multimodal learning remains relatively unexplored (Kim et al., 2021a) , despite its enormous practical impact in single modality settings. The plane is doing tricks while flying down.

(b) Entailment

There is a plane in the air. (c) Entailment The pilot is aware that the plane is doing a loop.

(d) Neutral

The plane is falling down.

AutoAugment-Cifar10

AutoAugment-ImageNet AutoAugment-SVHN TrivialAugment The task is to predict the relationship between the image and the text description, which can be "Entailment", "Neutral", or "Contradiction". The bottom row shows four augmented images generated by different image-only augmentation methods. If we pair the text description with the augmented images, we observe mislabeled data. For example, the smoke loop is cropped out in the image augmented via TrivialAugment. The new image does not match the description: "The pilot is aware that the plane is doing a loop", as in data (c). However, the label of the augmented pair will still be "Entailment". Indeed, data augmentation has particularly proven its value for data efficiency, regularization, and improved performance in computer vision (Ho et al., 2019; Cubuk et al., 2020; Müller & Hutter, 2021; Zhang et al., 2017; Yun et al., 2019) and natural language processing (Wei & Zou, 2019; Karimi et al., 2021; Fadaee et al., 2017; Sennrich et al., 2015; Wang & Yang, 2015; Andreas, 2020; Kobayashi, 2018) . These augmentation methods are largely tailored to a particular modality in isolation. For example, for object classification in vision, we know certain transformations such as translations or rotations should leave the class label unchanged. Similarly, in language, certain sentence manipulations like synonym replacement will leave the meaning unchanged. The most immediate way of leveraging data augmentation in multimodal deep learning is to separately apply well-developed unimodal augmentation strategies to each corresponding modality. However, this approach can be problematic because transforming one modality in isolation may lead to disharmony with the others. Consider Figure 1 , which provides four training examples from SNLI-VE (Xie et al., 2019a) , a vision-language benchmark dataset. Each description is paired with the image on the top left, and the label refers to the relationship between the image and description. The bottom row provides four augmented images generated by state-of-the-art image augmentation methods (Cubuk et al., 2019; Müller & Hutter, 2021) . In the image generated by AutoAugment-Cifar10 and AutoAugment-SVHN, the plane is entirely cropped out, which leads to mislabeling for data (a), (b), (c), and (d). In the image generated by AutoAugment-ImageNet, due to the change in smoke color, this plane could be on fire and falling down, which leads to mislabeling for data (a) and (d). In the image generated by TrivialAugment (Müller & Hutter, 2021) , a recent image augmentation method that randomly chooses one transformation with a random magnitude, the loop is cropped out, which leads to mislabeling for data (a) and (c). Mislabeling can be especially problematic for over-parameterized neural networks, which tend to confidently fit mislabeled data, leading to poor performance (Pleiss et al., 2020) . There are two key challenges in designing a general approach to multimodal data augmentation. First, multimodal deep learning takes input from a diverse set of modalities. Augmentation transformations can be obvious for some modalities such as vision and language, but not others, such as sensory data which are often numeric or categorical. Second, multimodal deep learning includes a diverse set of tasks with different cross-modal relationships. Some datasets have redundant or totally correlated modalities while others have complementary modalities. There is no reasonable assumption that would generally preserve labels when augmenting modalities in isolation. In this work, we propose LeMDA (Learning Multimodal Data Augmentation) as a general multimodal data augmentation method. LeMDA augments the latent representation and thus can be applied to any modalities. We design the augmentation transformation as a learnable module such that it is adaptive to various multimodal tasks and cross-modal relationships. Our augmentation module is learned together with multimodal networks to produce informative data through adversarial training, while preserving semantic structure through consistency regularization. With no constraints over the modalities and tasks, one can simply plug-and-play LeMDA with different multimodal architectures. We summarize our contributions as follows. • In Section 3, we introduce LeMDA, a novel approach to multimodal data augmentation. Section 3.1 shows how to use LeMDA with multimodal networks, and Section 3.2 describes how to train the augmentation module to produce informative and label preserving data. The method is notable for several reasons: (1) it can be applied to any modality combinations; (2) it is attractively simple and easy-to-use; (3) it is the first augmentation method to be applied to the joint of text, image, and tabular data, which is essentially uncharted territory. • In Section 4, we show that LeMDA consistently boosts accuracy for multimodal deep learning architectures compared to a variety of baselines, including state-of-the-art input augmentation and feature augmentation methods. • In Section 4.4, we provide an ablation study validating the design choices behind LeMDA. In particular, we study the architecture of the augmentation module, and the effects of consistency regularizer. We demonstrate that the consistency regularizer clearly outperforms L 2 regularization (Tang et al., 2020b) .

2. BACKGROUND AND RELATED WORK

Multimodal network architectures. Multimodal deep learning architectures are categorized as performing early or late fusion, depending on the stage of combining information from each modality. In early fusion, the network combines the raw input or token embedding from all the modalities. Early fusion architectures can be designed to exploit the interaction between low-level features, making it a good choice for multimodal tasks with strong cross-modal correlations (Barnum et al., 2020; Gao et al., 2020) . For example, there exist low-level correspondence in image captioning task because different words in the caption may relate to different objects in the image. We note that feature-space augmentation procedures are typically computationally intractable on early-fusion architectures, because early fusion would require combining a large number of latent features, such as a long sequence of token embeddings. On the other hand, in late fusion, the focus of our work, input from each modality is independently processed by different backbones. The representations provided by different backbones are fused together in later layers, often just before the classifier layer (Shi et al., 2021a; Wang et al., 2017; Schönfeld et al., 2019; Mahajan et al., 2020) . This design is straightforward to apply to any new modality and any multimodal task. Late fusion often uses pre-trained networks as backbones in each modality, making it more computationally tractable. In both early and late fusion, there are a variety of methods to fuse information. Standard approaches include (1) feed all modalities as token embedding into the network, (2) perform cross-attention between modalities, (3) concatenate representations from all modalities, and (4) combine the predictions from each modality in an ensemble (Baltrušaitis et al., 2017) . Researchers usually design the multimodal network by considering the task objective, the amount of data available, and the computation budget (Shi et al., 2021b; Chen et al., 2019b; Li et al., 2022; Tsai et al., 2019; Mahajan & Roth, 2020) . Baltrušaitis et al. (2017) provides further readings. Data augmentation for single modality tasks. Data augmentation is widely adopted in vision and natural language tasks. In vision, we can manually intervene on a per-task basis to apply transformations that should leave our label invariant -e.g., translations, rotations, flipping, cropping, and color adjustments. A transformation on one task may not be suitable for another: for example, flipping may be reasonable on CIFAR-10, but would lose semantic information on MNIST, because a flipped '6' becomes a '9'. Accordingly, there are a variety of works for automatic augmentations in vision, including neural architecture search (Ho et al., 2019; Cubuk et al., 2020) , reinforcement learning (Cubuk et al., 2019) , generative modelling (Ratner et al., 2017) , mixing aspects of the existing data (Zhang et al., 2017; Yun et al., 2019) , and adversarial training for informative examples (Fawzi et al., 2016; Goodfellow et al., 2015; Zhang et al., 2019; Suzuki, 2022; Tang et al., 2020b; Tsai et al., 2017) . Similarly, in natural language processing there are a variety of standard interventions (replacement, deletion, swapping) (Wei & Zou, 2019; Karimi et al., 2021; Fadaee et al., 2017) ,  Algorithm ; while F not converged do Sample a mini-batch from X Compute z ← F before (x) Generate augment feature G(z) ŷ ← F after (z), ŷG ← F after (G(z)) Update the augmentation network G by stochastic gradient -∇L(ŷ G ) + ∇L consist (ŷ, ŷG ) Update the task network F by stochastic gradient ∇L(ŷ) + ∇L(ŷ G ) end while and more automatic approaches such as back-translation (Sennrich et al., 2015) , context augmentation (Wang & Yang, 2015; Andreas, 2020; Kobayashi, 2018) , and linear interpolation of training data (Sun et al., 2020a) . Data augmentation is less explored for tabular data, but techniques in vision, such as mixup (Zhang et al., 2017) and adversarial training (Goodfellow et al., 2015) have recently been adapted to the tabular setting with promising results (Kadra et al., 2021) . Latent space augmentation is much less explored than input augmentation, as it is less obvious what transformations to apply. To augment latent vectors produced by passing data inputs through a neural network (feature space augmentation), researchers have considered interpolation, extrapolation, noise addition, and generative models (Verma et al., 2019; Liu et al., 2018; Kumar et al., 2019) . Multimodal data augmentation. There are a small number of works considering multimodal data augmentation, primarily focusing on vision-text tasks. In visual question answering, Tang et al. (2020a) proposes to generate semantic similar data by applying back-translation on the text and adversarial noise on the image. Wang et al. (2021) generates text based on images using a variational autoencoder. In cross-modal retrieval, Gur et al. (2021) query similar data from external knowledge sources for cross-modal retrieval tasks. The state-of-the-art augmentation procedure for visual-language representation learning generates new image-text pairs by interpolating between images and concatenating texts in a method called MixGen (Hao et al., 2022) . All prior work on multimodal data augmentation relies on tailored modality-specific transformations. By contrast, our proposed approach is fully automatic and can be applied to any arbitrary modality. Indeed, for the first time, we consider augmentation jointly over the tabular, image, and language modalities. Moreover, even for image-text specific problems, we show that our approach outperforms MixGen, the state-of-the-art tailored approach.

3. LEMDA: LEARNING MULTIMODAL DATA AUGMENTATION

We now introduce LeMDA, a simple and automatic approach to multi-modal data augmentation. LeMDA learns an augmentation network G, along with the multimodal task network F to generate informative data that preserves semantic structure. In Sections 3.1 and 3.2 we describe how we learn the parameters the augmentation and task networks, respectively. We summarize the training algorithm for LeMDA in Figure 2 and Algorithm 1. In Section 3.4 we provide intuition for the consistency loss. Finally, in Section 3.3 we describe how we design the augmentation network.

3.1. TRAINING THE TASK NETWORK

The task network can be divided into two parts at the fusion layer F(x) = F after (F before (x)) where F before denotes the layers before fusion, F after denotes the layers after the fusion. Given a training sample x, we pass x until the fusion layer and obtain the latent features for each modality {z i } N i=1 = F before (x) where N is the number of modalities. Taken {z i } N i=1 as inputs, the augmentation network G generates additional latent vectors G({z i } N i=1 ). Both {z i } N i=1 and G({z i } N i=1 ) are fed through the rest of target network F after as distinct training data. Then, the task network is trained in the standard way, taking the task loss function on both original data and augmented data, to find min E x∼X (L(ŷ) + L(ŷ G )) where ŷ = F after (F before (x)) and ŷG = F after (G(F before (x))).  ! ! ! " ! # " $ (! ! ) " $ (! ! ) " $ (! ! ) ! % (-& ) ! % (-' ) ! % (-( ) Augmentation Network 𝒢 𝒢(z ! ) 𝒢(z " ) 𝒢(z ! ) ! 𝑦 # 𝑦 𝒢 ! ! ! " ! # " $ (! ! ) " $ (! ! ) " $ (! ! ) # $ $ # y % ! % (-& ) ! % (-' ) ! % (-( ) -∇& * ( # + ∇& $%&'(') ' (, * ( # Augmentation Network ! !(z ! ) !(z " ) !(z ! ) ! " # " !

Task Network After Fusion

Modality A Encoder

Modality B Encoder

Modality C Encoder The augmentation network is trained to maximize task loss while minimizing consistency loss. We describe our standard choices for fusion in Section 2, and the design of our augmentation network in Section 3.3.

3.2. TRAINING THE AUGMENTATION NETWORK

Inspired by adversarial data augmentation, we optimize parameters for the augmentation network to maximize the task loss such that the task network's representation is encouraged to be updated by the augmented data. At the same time, we introduce a consistency regularizer that encourages a similar output distribution given the original data and the augmented data to preserve the semantic structure. Formally, we find max E x∼X (L(ŷ G )) + min E x∼X (L consist (ŷ, ŷG )) where L consist (ŷ, ŷG ) denotes a divergence metric between the logit outputs on original data ŷ and on augmented data ŷG such as the Kullback-Leibler divergence. Confidence masking. For classification problems, we apply the consistency term only to the samples whose highest probability is greater than a threshold α. If the task network can't make a confident prediction, it is unlikely the prediction provides a good reference to the ground truth label. Design decisions. The simplicity and generality of this approach, combined with its strong empirical performance in Section 4, are LeMDA's most appealing features. The few design decisions for training involve how the consistency regularizer should be defined and to what extent it should be applied. For example, as an alternative to a KL-based consistency regularizer, we could minimize the L 2 distance of the augmented feature vector to the original feature vector as a proxy for preserving the label of the augmentation. We provide ablations of these factors in Section 4.4.

3.3. THE DESIGN OF AUGMENTATION NETWORK

The augmentation network can take various forms depending on the multimodal learning tasks and the fusion strategies. In our experiments, we use a variational autoencoder (VAE) as the augmentation network, since VAEs have generally been found effective for augmentation purposes (Tang et al., 2020b) . We consider two architectural choices: MLP-VAE: The encoder and decoder of VAE are MLPs. {z i } N i=1 are concatenated as the input. Attention-VAE: The encoder and decoder are made of self-attention and feedforward networks. {z i } N i=1 are treated as N tokens where each token has an embedding z i . There are two loss terms in the standard VAE, the reconstruction loss, and the KL divergence regularizer. We only adopt the KL regularizer on the encoder distribution. The updating step for augmentation networks is -∇L(ŷ G ) + ∇L consist (ŷ, ŷG ) + +∇L VAE , where L VAE refers to the KL divergence regularizer on the latent encoding distribution. The major deciding factor between MLP-VAE and Attention-VAE is the multimodal task network architectures. With late fusion architectures, which is the primary focus of this paper, z i refers to the representation from a single modality backbone (e.g. CLS embedding from a BERT model), and N is the number of modalities or the number of backbone models. We can concatenate {z i } N i=1 as one vector input to MLP-VAE, or we can treat {z i } N i=1 as a sequence of N tokens to Attention-VAE. Attention-VAE may be less intuitive here because N is usually a small number in late fusion architectures( 2 or 3 in our experiment). We provide a performance comparison between these two architectures in Section 4.4. On the other hand, for early fusion architectures, z i could be a sequence of token embedding for a text or a sequence of patch embedding for an image. Concatenation will result in a really high dimension input, which makes MLP-VAE less favorable. In Figure 3 we provide intuition for the consistency regularizer using a simple illustrative binary classification. Darker background corresponds to higher task training loss, the solid green line is the actual decision boundary, and the dashed green line is the model's decision boundary. Starting from a point in feature space, moving to D1 and D2 would provide a similar increase in task loss and thus are equally favored by the adversarial loss term. However, D2 crosses the model's decision boundary, and thus would be heavily penalized by the consistency regularizer -as we would hope, since such a point is likely to have a different class label. On the other hand, an L2 regularizer between the original and augmented points in feature space would have no preference between D1 and D2, as they are an equal distance away from the starting point. Empirically, in Section 4, we see the consistency loss confers accuracy improvements over both pure adversarial training and L2 regularizer. Similar intuition is in Suzuki (2022) , which uses the logits distribution from the teacher model (an exponential moving average over the model's weights) as the soft target such that the augmented data is still recognizable by the teacher, and Xie et al. (2019b) 

4. EXPERIMENTS

We evaluate LeMDA over a diverse set of real-world multimodal datasets. We curate a list of public datasets covering image, text, numerical, and categorical inputs. Table 1 provides a summary of the source, statistic, and modality identity. We introduce baselines in Section 4.1, and describe experimental settings in Section 4.2 We provide the main evaluation result in Section 4.3. Finally, we investigate the effects of the consistency regularizer and the choices of augmentation model architecture in Section 4.4.foot_0 

4.1. BASELINES

To the best of our knowledge, there exist no general-purpose multimodal augmentation methods. We compare against a diverse set of state-of-the-art data augmentation methods from vision, language, and vision-text tasks. We additionally consider baselines for feature augmentation, since LeMDA augments in the feature space. Finally, we compare with state-of-the-art multimodal augmentation methods from the vision-text tasks, although we note, unlike LeMDA, these methods are not general purpose and cannot be directly applied to our datasets that have tabular inputs. • Input Augmentation. We apply state-of-the-art input augmentation independently on the data from each modality. For images, we use TrivialAugment (Müller & Hutter, 2021), a simple and effective method for image classification tasks. For text, we apply EDA (Wei & Zou, 2019) and AEDA (Karimi et al., 2021) . We randomly sample one transformation from all transformations proposed in EDA and AEDA with a randomly generated magnitude. • Mixup. Mixup was originally proposed to perform interpolation between two images in the training data for image classification. We adopt the original Mixup for images and numerical features and extend it for text and categorical features. Specifically, given a pair of data, we construct the mixed data as follows. We generate a random number j uniformly between 0.0 to 1.0. If j < α, we use the first data, else we use the second data. • Manifold Mixup. Manifold Mixup (Verma et al., 2019) performs interpolation between hidden representations and thus can be applied to all modalities. We applied Manifold Mixup to the exact feature in the multimodal network as LeMDA. • MixGen. MixGen (Hao et al., 2022) is a state-of-the-art data augmentation designed specifically for vision-text tasks. MixGen generates new data by interpolating images and concatenating text. We apply MixGen to datasets only consisting of images and text.

4.2. EXPERIMENT SETUP

We use Multimodal-Net (Shi et al., 2021a) for all the datasets except SNLI-VE. Multimodal-Net passes input from each modality through separate backbones, concatenates the representation(e.g. the CLS embedding) from all backbones, and passes them through fusion MLP. We use the default hyper-parameters provided by Multimodal-Net and plug LeMDA before the fusion layer. We use ConvNet as the image backbone and ELECTRA as the text backbone. To further demonstrate LeMDA's generalizability, we evaluate LeMDA with early fusion architectures ALBEF (Li et al., 2021) embeddings and text token embeddings. We keep all original configurations except the batch size due to limitations in computation memory. We set the batch size to be half of the default. We load the 4M pre-trained checkpoints. In this setting, we apply LeMDA before the cross-attention layer. The augmentation network augments every image patch embedding and every text token embedding. For LeMDA, we set the confidence threshold for consistency regularizer α as 0.5, and we study this choice in Section 4.4. For our baselines, we follow the recommended hyperparameters. For Mixup and Manifold Mixup, we set α as 0.8, and for MixGen, we set λ as 0.5.

4.3. MAIN RESULTS

We summarize the performance comparison in Table 2 . Plugging LeMDA in both Multimodal-Net and ALBEF leads to consistent accuracy improvements. There are also some particularly notable improvements, such as a 6% increase in accuracy for both Hateful Memes and Petfinder. Table 2 illustrates how LeMDA performs comparing to the baselines. We see that single modality input augmentation methods can hurt accuracy, for example, on News Channel, in accordance with the intuition from our introductory example in Figure 1 . Mixup also can hurt accuracy, for example, on Wine Reviews. Similarly, in the latent space, Manifold Mixup fails to improve accuracy across datasets. On Melbourne Airbnb and Wine Reviews, Manifold Mixup results in accuracy drops. On the contrary, LeMDA consistently improves upon original architectures and provides clearly better performance than a wide range of baselines.

4.4. ABLATION STUDY

We now perform three ablation studies to support the design choice of LeMDA.

Dataset

No Augmentation Network Regularizer. We argue in Section 3.4 that consistency regularizer helps preserve the semantic structure of augmentations. In Table 3 , we see that this consistency regularizer significantly improves performance, and also outperforms L2 regularization in feature space. While L2 regularization attempts to keep augmented features close in distance to the original as a proxy for semantic similarity, the consistency regularization has access to the softmax outputs of the target and augmentation networks, providing direct information about labels. 5 : The influence of confidence-based masking. α = 0 indicates no masking such that consistency loss is calculated with all data. We see that filtering out low-confidence data leads to better end-to-end accuracy. Architecture Difference. We consider the two augmentation architectures introduced in Section 3.3, MLP-VAE and Attention-VAE. In Table 4 we see both architectures increase performance over no augmentation. We also see that MLP-VAE generally outperforms Attention-VAE. We suspect the reason is that Multimodal-Net passes the concatenation of N latent vector into fusion layers, where N is the number of modalities (2 or 3 in our experiments). For Attention-VAE, this means that the input is only 2 or 3 tokens. However, we note that MLP-VAE is not reasonable for ALBEF, since it would require concatenating thousands of tokens. Confidence Masking. Here, we investigate the effect of confidence masking, as well as the choice of α in Table 5 . α = 0 means no masking, and all training data are used to calculate consistency loss. We see that confidence masking generally leads to higher accuracy, and that the performance is not particularly sensitive to the precise value of α.

4.5. THE RELATIONSHIP BETWEEN MODALITIES

We can categorize the relationship between available modalities by looking at P (y|x) where y ∼ Y and Y is the target domain. Let x = {x 1 , x 2 , . . . , x N } consist of N modalities. Perfect Correlation P (y|x) = P (y|x n ). Essentially, one modality alone provides enough information to make the right prediction. Nevertheless, data still comprises multiple modalities for reasons such as easier training (Huang et al., 2021) . One example could be Food101, where the task is to predict the food from the text of a recipe and the photo of the food. Complementary P (y|x) = P (y|{x 1 , x 2 , . . . , x N }). This category suggests that information aggregated from all modalities is necessary to make the right prediction. Each modality is complementary to each other, and missing one modality would lead to information loss. One example could be Hateful Memes, where only the combined meaning of text and image indicates harmful content. The design for LeMDA does not exploit any assumption over the cross-modal relationship. We observe from Table 2 that LeMDA consistently improves performance regardless of the relationship.

5. CONCLUSION

Jointly learning from multiple different modalities will be crucial in our quest to build autonomous intelligent agents. We introduce the first method, LemDA, for jointly learning data augmentation across arbitrary modalities. LeMDA is simple, automatic, and achieves promising results over a wide range of experiments. Moreover, our results provide several significant conceptual findings about multimodal data augmentation in general: (1) separately augmenting each modality performs much worse than joint augmentation; (2) although feature augmentation is less popular than input augmentation for single-modality tasks because it is less interpretable, feature augmentation is particularly promising for modality-agnostic settings; (3) a learning-based multimodal augmentation policy can outperform even tailored augmentations, and significantly improve accuracy when augmentation transformations are not obvious such as for categorical data. Our investigation has primarily focused on late-fusion architectures, showing strong results over a wide range of settings. In general, applying feature augmentation strategies to early-fusion architectures is an open question. Early fusion combines a large number of latent features (e.g., a long sequence of token embeddings), resulting in typically intractable computational costs for augmenting every latent feature. Our experiment with an early-fusion architecture shows however that developing more efficient augmentation networks, or selectively generating only a few important latent vectors, is a promising direction for future work.

A MORE DETAILS ON THE DESIGN A.1 DETAILED ARCHITECTURES OF THE AUGMENTATION NETWORK

We consider two VAE architectures in LeMDA, depending on the architecture of the task network. The latent dimension in VAE is set as 8. We adopted the KL divergence regularizer on the encoder distribution. Note that we do not use the reconstruction loss between the input and the output. In MLP-VAE, the encoder and decoder are standard fully connected layers with ReLU as the activation function. Dropout is used with p = 0.5. In Attention-VAE, the encoder is implemented as torch.nn.TransformerEncoder. We set numlayers as 4 and nhead as 8. One fully connected layer is used to mapThe decoder is symmetric as the encoder. Features from all modalities are treated as token embedding with no cross-attention. We use VAE for its simplicity. The main focus of this paper is to demonstrate the effectiveness of a learnable augmentation network for multimodal learning. Other generative models, such as diffusion models and GANs, are also valid architectures. The main concern may lie in efficiency, and we leave this direction as future work.

A.2 IMPLEMENTATION DETAILS OVER THE TRAINING PROCEDURE

In practice, we iterative train the task and augmentation networks using the same batch of training data. Specifically, we perform two separate forward passes using F after for easy implementation with pyTorch Autograd. We use two optimizers, one for the task network and one for the augmentation network.

B EXPERIMENT DETAILS B.1 ADDITIONAL STUDIES ON THE TRAINING COST

One limitation of a learning-based approach is the extra training cost. LeMDA optimizes the augmentation network along with the task network and does incur extra training costs. Here, we investigate the training throughput to provide a more complete understanding of the method. We summarize the training throughput(it/second) in Table 6 . As expected, we observe lower throughput for LeMDA compared to other baselines. However, efficiency can be improved. The most straightforward direction is to reduce the frequency of updating the augmentation network. Currently, the augmentation network is updated every iteration. However, the parameters for our task network change slowly, especially in the later stage of training. We leave this part as future direction.



Code is available at https://github.com/lzcemma/LeMDA/



Figure 1: The top row shows four training samples drawn from SNLI-VE(Xie et al., 2019a), a visual entailment dataset. Each text description is paired with the image on the top left. The task is to predict the relationship between the image and the text description, which can be "Entailment", "Neutral", or "Contradiction". The bottom row shows four augmented images generated by different image-only augmentation methods. If we pair the text description with the augmented images, we observe mislabeled data. For example, the smoke loop is cropped out in the image augmented via TrivialAugment. The new image does not match the description: "The pilot is aware that the plane is doing a loop", as in data (c). However, the label of the augmented pair will still be "Entailment".

Figure 2: LeMDA training as described in Algorithm 1. Top: the training process for the task network. Latent representations for each modality z i are passed into the augmentation network, which generates a new latent vector for each modality. Both original features and augmented features are passed into the rest of the task network. Bottom: the training process for the augmentation network.The augmentation network is trained to maximize task loss while minimizing consistency loss. We describe our standard choices for fusion in Section 2, and the design of our augmentation network in Section 3.3.

Figure 3: Motivation for the consistency regularizer. The solid and dashed green lines are the ground truth and model decision boundaries, respectively. Darker background corresponds to a higher loss for the task network. We intuitively prefer D1, because the augmented point should be informative but preserve the same label. The consistency loss will prefer D1 over D2, because D2 crosses the model's decision boundary, even though both points incur the same training loss.

, which designs the unsupervised training objective to encourage similar logits for augmented data. This table provides a summary of the source, statistic, and modality identity.

LeMDA not only significantly increases accuracy over the original architectures but also outperforms all baselines.

This table summarizes the training throughput, measured as it/second. Experiments were conducted on a server with 8 V100 GPU. As expected, learning-based approach incur higher training cost.

B.2 ADDITIONAL STUDIES ON THE HYPER-PARAMETERS

The optimization for the augmentation network is a min-max game, which leads to hyperparameters to balance the contradicting loss. Specifically, -w 1 ∇L(ŷ G ) + w 2 ∇L consist (ŷ, ŷG ) + w 3 ∇L VAE , where L VAE refers to the KL divergence regularizer on the latent encoding distribution.In our main experiment, we use w 1 = 0.0001, w 2 = 0.1, w 3 = 0.1 on all datasets except Melbourne Airbnb and SNLI-VE. On Melbourne Airbnb and SNLI-VE, we use w 1 = 0.001, w 2 = 0.1, w 3 = 0.1. Note that the hyperparameters are relative consistent across datasets.Further, we investigate the influence of the different combinations of w 1 , w 2 , and w 3 . We summarize the result on Petfinder Table 7 . We observe consistent improvements over the original multimodal network across various combinations. 

