RETHINKING MISSING MODALITY LEARNING: FROM A DECODING VIEW

Abstract

Conventional pipeline of multimodal learning consists of three stages, including encoding, fusion, and decoding. Most existing methods under missing modality condition focus on the first stage and aim to learn the modality invariant representation or reconstruct missing features. However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. Different from the previous methods, we improve the decoding stage. Concretely, IPD jointly learns the common and modality-specific task prototypes. Considering that the number of missing modality conditions scales exponentially with the number of modalities O(2 n ) and different conditions may have implicit interaction, the lowrank partial prototype decomposition with enough theoretical analysis is employed for modality-specific components to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. Extensive results on the newly-created benchmarks of multiple tasks illustrate the effectiveness of our proposed model.

1. INTRODUCTION

Multimodal learning is recently one of the increasingly popular yet challenging tasks involved in both computer vision and natural language processing Ben-Younes et al. (2017) ; Do et al. (2019) ; Gabeur et al. (2020) ; Liu et al. (2018) . The target of multimodal learning is to utilize complementary information contained in multimodal data for improving the performance of various tasks. Many superior approaches on multimodal learning have been well developed in an ideal situation. However, a common assumption underlying these approaches is the completeness of modality (i.e., the full modalities are available in both training and testing data). In practice, such an assumption may not always hold in real world due to some overwhelming reasons. For example, some uploaded YouTube videos do not have audio tracks, also, the black screen may occur during the broadcast, leading to the lack of visual modality. Although a bunch of endeavors are devoted to developing effective methods to cope with the missing modality conditions in the training and inference stages, there is no unified paradigm that takes into account all possible scenarios. For instance, Pham et al. (2019); Zhao et al. (2021) only consider the incompleteness of testing data, however, obtaining a lot of complete data for training is extremely labor-intensive. Ma et al. (2021) formulates a new setting that considers the incompleteness of training data, while there are only two modalities used. As we know, due to the quick developments of feature-extraction skills, one data sample may have more than two kinds of modality representation. In this paper, we focus on a more general setting (as showin in Fig. 1 ), where the number of modalities is arbitrary and there are various incomplete modality conditions, to systematically study the missing modality problem. We propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) to capture the universality and particularity of different modality conditions (including both complete and incomplete conditions). To the best of our knowledge, we try to study generalizable missing modality learning from the decoding perspective for the first time. Following Zhou et al. (2021; 2022) , we treat the weights of the standard linear classifier as task prototypes. Since different modality conditions correspond to different task prototypes, training multiple models separately for all the conditions is time-consuming, IPD jointly learns the common and modality-specific prototypes. Considering that the number of missing modality conditions scales exponentially with the number (n) of modalities O(2 n ) and different conditions may have implicit interaction, the low-rank partial prototype decomposition with enough theoretical analysis is employed for modality-specific component to reduce the complexity. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. We conduct extensive experiments on the newly-created benchmarks of multiple tasks, the experimental results show that IPD could achieve competitive results compared with the state-of-the-art methods of conventional multimodal learning. The main contributions can be summarized as follows: • We propose a novel method called Interaction Augmented Prototype Decomposition (IPD) for generalizable missing modality learning, which jointly learns the common and modality specific prototypes. To the best of our knowledge, it is the first time to improve the decoding stage. • Considering that the number of missing modality conditions scales exponentially with the number (n) of modalities O(2 n ) and different conditions may have implicit interaction, the low-rank partial prototype decomposition with is employed to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. • We constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. • We conduct low-rank ensemble learning to enhance the performances of the conditions with relatively few modalities. 2021) proposes modality-invariant crossmodal attention towards learning crossmodal interactions over modality-invariant space in which the distribution mismatch between different modalities is well bridged. However, these methods do not consider the missing modality problems at all. For this aim, we propose IPD to solve the problems.



Figure 1: An example of generalizable missing modality learning, where M 1 , M 2 , M 3 denote three modalities. There are unseen modality combinations in the testing stage. Multimodal learning utilizes complementary information contained in multimodal data to improve the performance of various tasks. The key point of this area is multimodal fusion. Early fusion methods are mainstream and integrate features of different modalities before feeding them to the task modules. For example, concatenating different features Zadeh et al. (2016b) is a simple way. Zadeh et al. (2017) proposes a product operation to allow more interaction among different modalities during the fusion process. Liu et al. (2018) considers the large complexity of Zadeh et al. (2017) and utilizes modality-specific factors to achieve efficient low-rank fusion. However, the intra-modal dynamics cannot be effectively captured in the above methods. Liang et al. (2019) employs low-rank fusion for each time step of multi-view sequential input. Tsai et al. (2019) utilizes Transformer to replace LSTM due to the powerful encoding capacity. Rahman et al. (2020) employs large-scale pre-trained Bert embeddings for textual modeling. Liang et al. (2021) proposes modality-invariant crossmodal attention towards learning crossmodal interactions over modality-invariant space in which the distribution mismatch between different modalities is well bridged. However, these methods do not consider the missing modality problems at all. For this aim, we propose IPD to solve the problems.

