RETHINKING MISSING MODALITY LEARNING: FROM A DECODING VIEW

Abstract

Conventional pipeline of multimodal learning consists of three stages, including encoding, fusion, and decoding. Most existing methods under missing modality condition focus on the first stage and aim to learn the modality invariant representation or reconstruct missing features. However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. Different from the previous methods, we improve the decoding stage. Concretely, IPD jointly learns the common and modality-specific task prototypes. Considering that the number of missing modality conditions scales exponentially with the number of modalities O(2 n ) and different conditions may have implicit interaction, the lowrank partial prototype decomposition with enough theoretical analysis is employed for modality-specific components to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. Extensive results on the newly-created benchmarks of multiple tasks illustrate the effectiveness of our proposed model.

1. INTRODUCTION

Multimodal learning is recently one of the increasingly popular yet challenging tasks involved in both computer vision and natural language processing Ben-Younes et al. (2017); Do et al. (2019); Gabeur et al. (2020); Liu et al. (2018) . The target of multimodal learning is to utilize complementary information contained in multimodal data for improving the performance of various tasks. Many superior approaches on multimodal learning have been well developed in an ideal situation. However, a common assumption underlying these approaches is the completeness of modality (i.e., the full modalities are available in both training and testing data). In practice, such an assumption may not always hold in real world due to some overwhelming reasons. For example, some uploaded YouTube videos do not have audio tracks, also, the black screen may occur during the broadcast, leading to the lack of visual modality. Although a bunch of endeavors are devoted to developing effective methods to cope with the missing modality conditions in the training and inference stages, there is no unified paradigm that takes into account all possible scenarios. For instance, Pham et al. ( 2019 In this paper, we focus on a more general setting (as showin in Fig. 1 ), where the number of modalities is arbitrary and there are various incomplete modality conditions, to systematically study the missing modality problem. We propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) to capture the universality and particularity of different modality conditions



); Zhao et al. (2021) only consider the incompleteness of testing data, however, obtaining a lot of complete data for training is extremely labor-intensive. Ma et al. (2021) formulates a new setting that considers the incompleteness of training data, while there are only two modalities used. As we know, due to the quick developments of feature-extraction skills, one data sample may have more than two kinds of modality representation.

