RETHINKING MISSING MODALITY LEARNING: FROM A DECODING VIEW

Abstract

Conventional pipeline of multimodal learning consists of three stages, including encoding, fusion, and decoding. Most existing methods under missing modality condition focus on the first stage and aim to learn the modality invariant representation or reconstruct missing features. However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. Different from the previous methods, we improve the decoding stage. Concretely, IPD jointly learns the common and modality-specific task prototypes. Considering that the number of missing modality conditions scales exponentially with the number of modalities O(2 n ) and different conditions may have implicit interaction, the lowrank partial prototype decomposition with enough theoretical analysis is employed for modality-specific components to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. Extensive results on the newly-created benchmarks of multiple tasks illustrate the effectiveness of our proposed model.

1. INTRODUCTION

Multimodal learning is recently one of the increasingly popular yet challenging tasks involved in both computer vision and natural language processing Ben-Younes et al. (2017) ; Do et al. (2019) ; Gabeur et al. (2020) ; Liu et al. (2018) . The target of multimodal learning is to utilize complementary information contained in multimodal data for improving the performance of various tasks. Many superior approaches on multimodal learning have been well developed in an ideal situation. However, a common assumption underlying these approaches is the completeness of modality (i.e., the full modalities are available in both training and testing data). In practice, such an assumption may not always hold in real world due to some overwhelming reasons. For example, some uploaded YouTube videos do not have audio tracks, also, the black screen may occur during the broadcast, leading to the lack of visual modality. Although a bunch of endeavors are devoted to developing effective methods to cope with the missing modality conditions in the training and inference stages, there is no unified paradigm that takes into account all possible scenarios. For instance, Pham et al. (2019) ; Zhao et al. (2021) only consider the incompleteness of testing data, however, obtaining a lot of complete data for training is extremely labor-intensive. Ma et al. (2021) formulates a new setting that considers the incompleteness of training data, while there are only two modalities used. As we know, due to the quick developments of feature-extraction skills, one data sample may have more than two kinds of modality representation. In this paper, we focus on a more general setting (as showin in Fig. 1 ), where the number of modalities is arbitrary and there are various incomplete modality conditions, to systematically study the missing modality problem. We propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) to capture the universality and particularity of different modality conditions (including both complete and incomplete conditions). To the best of our knowledge, we try to study generalizable missing modality learning from the decoding perspective for the first time. Following Zhou et al. (2021; 2022) , we treat the weights of the standard linear classifier as task prototypes. Since different modality conditions correspond to different task prototypes, training multiple models separately for all the conditions is time-consuming, IPD jointly learns the common and modality-specific prototypes. Considering that the number of missing modality conditions scales exponentially with the number (n) of modalities O(2 n ) and different conditions may have implicit interaction, the low-rank partial prototype decomposition with enough theoretical analysis is employed for modality-specific component to reduce the complexity. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. We conduct extensive experiments on the newly-created benchmarks of multiple tasks, the experimental results show that IPD could achieve competitive results compared with the state-of-the-art methods of conventional multimodal learning. The main contributions can be summarized as follows: • We propose a novel method called Interaction Augmented Prototype Decomposition (IPD) for generalizable missing modality learning, which jointly learns the common and modality specific prototypes. To the best of our knowledge, it is the first time to improve the decoding stage. • Considering that the number of missing modality conditions scales exponentially with the number (n) of modalities O(2 n ) and different conditions may have implicit interaction, the low-rank partial prototype decomposition with is employed to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. • We constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. • We conduct low-rank ensemble learning to enhance the performances of the conditions with relatively few modalities. . "0" and "1" denote the states of existence and nonexistence. In the left part, the process of Eq. 1 to Eq. 6 is included. "Orth." denotes orthogonal constraints (Eq. 4). Yellow bars denote A s that consists of a sj . Gray bars of e c and e s l denote the common and modality-specific prototypes. e s l can be calculated with A s and β through Eqs. 2 and 3. The decoding process (Eq. 1) is developed by correlating x and e l . In the right part, we mainly present Eq. 8 of contrastive constraints. "Cons" denotes contrastive learning (Eq. 8). The gray bars of e ta , e tav , e v denote the modality specific prototypes of ta, tav, v.

2.2. MULTIMODAL LEARNING WITH INCOMPLETENESS

Multimodal learning with incompleteness is another topic of concern. There are also two lines of incomplete multimodal learning. One is to process the representation noise, for example, the visual features of several time steps are not available, while all the remaining time steps are normal. Liang et al. (2019) proposes rank constraints to solve this problem. The other is to process the missing modality conditions that are more likely to be encountered in real-world applications. Also, there exist more researches that study it. Parthasarathy & Sundaram (2020) proposes a strategy to randomly discard the visual input during training at the clip or frame level to mimic real-world missing modality scenarios for audio-visual multimodal emotion recognition. 

3.1. PROBLEM FORMULATION

We formulate the problem setting for generalizable missing modality learning in this section. Suppose the number of modalities is n, there are various modality scenarios in both training and inference stages. Concretely, about 2 n -1 scenariosfoot_0 are available, since each modality has two states (i.e. 0-nonexistence, 1-existence). Under such real-world conditions, there are two challenging problems, one is to obtain good performances of existing modality combinations. The other is to generalize to unseen conditions, when the training set does not cover all the modality scenarios.

3.2. COMMON-SPECIFIC JOINT LEARNING

Following the previous work, we employ different encoding blocks for corresponding tasks. For multimodal sentiment analysis and speaker traits recognition, we adopt the same structure as LMF Liu et al. (2018) , where LSTMs are utilized to process the multimodal features. As for multimodal video retrieval, we borrow the idea from MMT Gabeur et al. (2020) , where the multimodal features are encoded by Transformer. To fuse the multimodal features, we use the mainstream tensor-based method, LMF, which considers the fine-grained interaction among different modalities. The training paradigms of multimodal sentiment analysis and speaker traits recognition are similar to the settings of LMF. As for the training objective of video retrieval, we employ the contrastive cross-entropy loss and transform it to a classification task (i.e., the matched pairs are positive samples) We mainly focus on classification and regression. Similar to Zhou et al. (2021 ), Zhou et al. (2022) , we treat the linear weights of the final layer as task prototypes of different modality scenarios. There are 2 n scenarios and the examples (x, y) (x ∈ R m denotes the fusion result of multiple modalities, y ∈ R denotes the label or regression score) from specific scenario l ∈ [2 n ] are generated as: y = x e l , e l = e c + e s l , where e c ∈ R m denotes the common prototypes of different modality combinations, e s l ⊥ e c ∈ R m denotes the modality-specific prototypes of l-th modality scenarios, e l denotes the complete task prototype for l-th modality combination. Intuitively, the common component contains more high-level semantic information (i.e., the reasoning of performance), and the modality-specific part contains more low-level detailed information (i.e., the loudness of voice, the movement range). y = ±1 for classification tasks and y ∈ [-p, p] for regression tasks.

3.3. LOW-RANK PARTIAL PROTOTYPE DECOMPOSTION

Based on Eq. 1, there exist superior scenario specific classifiers {e l |l ∈ [2 n ]}, one for each modality scenario, such that e l = e c + e s l . Note that all these classifiers share the common component e c . If we are able to find multiple specific classifiers of this form, e c and e s l could be extracted from them. Although the prototypes cover all the modality combinations, the complexity (O(2 n )) would be large when n increases. Further, training the modality-specific prototypes separately is inefficient with neglecting the interaction among different modality scenarios. Therefore, we reformulate the E s ∈ R m×2 n that consists of e si as A s • β , where A s ∈ R m×k , β ∈ R 2 n ×k , and k denotes the low-rank value. Our idea can be extended from Eq. 1: y = x (e c + k j=1 β l,j a sj ) = x (e c + A s β l ) (2) where a sj ⊥ e c ∈ R m , and a sj ⊥ a sq for j, q ∈ [k], j = q are modality specific features. The correlation between the a sj and task is given by the coefficients β l ∈ R k , varies among multiple modality combinations. Under this setting, there also exists superior modality specific classifier e l such that e l = e c + A s β l ∈ R m , where e c ∈ R m , A s ∈ R m×k , β l ∈ R k are trainable variables. More concretely, the trainable task prototypes can be written as: E = e c 1 + A s β = [e c , A s ][1, β] where E = [e 1 , e 2 , ..., e 2 n ] ∈ R m×2 n , 1 ∈ R 2 n denotes the all ones vector and β = [β 1 , β 2 , ..., β 2 n ] ∈ R k×2 n . However, in practice, given a general matrix E which can be written as E = e c 1 + A s β , there are multiple ways of decomposing E into this form, e c and A s β cannot be uniquely determined by the decomposition alone. Therefore, we conduct constraints to the [e c , A s ] with orthogonal regularization: L o = ||I k+1 -[e c , A s ] [e c , A s ]|| 2 (4) where L o denotes the regularization loss, I k+1 ∈ R (k+1)×(k+1) denotes identity matrix. Such regularization term caters the following propositions: Proposition 1: When e c ⊥ Span(A s ), the decomposition of E = e c 1 + A s β has a unique solution. Suppose E = e c 1 + A s β = w c 1 + W s Γ is a rank-(k + 1) matrix, where A s ∈ R m×k , β ∈ R 2 n ×k , W s ∈ R m×k , and Γ ∈ R 2 n ×k are all rank-k matrices with k ≤ min(m, 2 n ). When e c ⊥ Span(A s ), w c = e c is equal to w c ⊥ Span(W s ). Proposition 2: If the orthogonal regularization is not satisfied (e c is not orthogonal to Span(A s )), the performance of partial modality combinations will be influenced.

3.4. HIERARCHICAL DECOMPOSITION OF A s β AND UNSEEN GENERALIZATION

Although the decomposition of A s β results in the change of complexity (from m • 2 n to k • 2 n ), the exponential term 2 n still exits. Furthermore, the decomposition of A s β does not consider the fine-grained interaction among different modality combinations. To solve this problem, we employ the hierarchical decomposition for the high-order tensor β ∈ R 2 n ×k . Concretely, we divide β into a series of n-order tensors: β = [C 1 , C 2 , ..., C k ] where C j ∈ R 2 n , we apply CP decomposition Harshman et al. (1970) to each C j : C j = R r=1 n i=1 C j i,r , C j i,r ∈ R 2 (6) where ⊗ denotes tensor outer product operation over a set of vectors, R denotes the rank value of CP decomposition. Such operations change the complexity of β from exponential to linear. With the approximation, the weights for unseen modality combinations can also been obtained with tensor multiplication. We employ these weights for the evaluation of missing modality generalization.

3.5. DISENTANGLED CONTRASTIVE CONSTRAINTS FOR DECOMPOSITION

The rank value is related to the interaction among different modality combinations to some extent. To further enhance the explicit interaction of specific modality combinations and control the rank value, we employ disentangled contrastive constraints for decomposition. Specifically, we suggest employing the task prototype (i.e. e si ) of the modality-specific part. Different from the obscure modality gaps that cannot be obtained, the interaction between different modality combinations is obvious. For example, suppose that there are totally three modalities (t, a, v) and the possible combinations include {t, a, v, ta, tv, av, tav}. We can easily find that ta is more related to t, a, and tav than v, since ta and v do not have intersection. We express the above statements as: D(ta, η) ≥ D(ta, v), η ∈ {t, a, tav} where D() denotes the correlation function. Following the laws of nature, we conduct disentangled contrastive constraints to the modality-specific part. The naturally similar pairs like ta and tav should have higher matching scores than negative pairs like ta and v, L c = max 0, ∆-h(e ta , e tav ) + h(e ta , e v ) where e ta , e tav , e v denote the modality-specific prototypes, ∆ is the margin value, h() denotes the normalized inner product operation (cosine similarity). In practice, the number of existing positive and negative pairs also scales exponentially with the number of modalities, therefore, we utilize the maximum suppression sampling scheme to reduce it. Concretely, for a specific input (i.e. e ta ), we consider the combination (i.e. e tav ) with most modalities that contain the processed modalities as a positive object, the complementary set (i.e. e v ) is treated as a negative object.

3.6. LOW-RANK ENSEMBLE LEARNING

With the approximation of β, we employ low-rank ensemble learning as an auxiliary, since the samples with more modalities could enhance the training of more modality factors produced by Eq. 6 (e.g. a sample with modalities ta can be used for training e ta , e t , e a simultaneously). Concretely, a data point with n * modalities can be augmented to 2 n * modality combinations. We calculate the average results of all the augmented modality scenarios of an input sample. Suppose that the encoded features are represented as u 1 , u 2 , ...u n ∈ R m , where u i denotes the feature of i-th modality. If a modality is missing, the original input would be replaced with all zeros vector, the encoded features are represented as o 1 , o 2 , ...o n ∈ R m . Thus, if the i-th modality is missing, u i = o i . We concatenate each W b i u i and corresponding W b i o i to v i = [W b i u i , W b i o i ] ∈ R m×2 , where W b i ∈ R m×m denotes the linear mapping weights following LMF. The fusion results for all the modality combinations can be denoted as V ∈ R  O c = sum n i=1 v i • 1 e c /M O s = sum R r=1 n i=1 v i C i,r A s /M (9) where denotes element-wise multiplication, sum() denotes summation operation for all the elements of the vector, C i,r ∈ R 2×k , M denotes the ensembled number (i.e. 2 n * ) which is related to n * . The detailed derivation is shown in the appendix (section A).

4. TRAINING AND INFERENCE

The overall framework is shown in Fig. 2 . We employ L t = f (y * , y) to denote the prediction loss, where y * denotes the prediction result with LMF and y is the ground-truth label, f () can be MAE for sentiment analysis or cross-entropy loss function for retrieval. L e = f (O e , y) denotes the loss when using ensemble learning. The final optimization objective is L t + λ 1 L e + λ 2 L o + λ 3 L c . During inference, we employ E = [e 1 , e 2 , ..., e 2 n ] for both existing modality combinations and unseen ones. The unseen task prototypes can be calculated by the trainable modality factors as introduced in Eq. 6. We provide the training and inference details in the appendix (Algs. 1 and 2).

5.1. DATASET AND METRICS

We evaluate our method on three challenging tasks, multimodal sentiment analysis, multimodal speaker traits recognition, and multimodal video retrieval. In this section, we provide a brief introduction of the datasets and metrics: CMU-MOSI Zadeh et al. (2016a) : It is a collection of 93 opinion videos from YouTube movie reviews. Each video consists of multiple opinion segments (2199 segments in total) and each segment is annotated with the score in the range [-3, 3] , where -3 and 3 indicate highly negative and positive. We report the metrics of BA (binary accuracy), F1, Corr (Correlation Coefficient), MA (Multi-class accuracy, higher is better), MAE (Mean-Absolute Error, lower is better). POM Pérez-Rosas et al. ( 2013): POM is a speaker traits recognition dataset made up of 903 movie review videos. Each video has 16 speaker traits. We report the multi-class accuracy of different traits. Xu et al. (2016) : MSR-VTT is composed of 10K YouTube videos, collected using 257 queries from a commercial video search engine. Each video is 10 to 30s long, and is paired with 20 natural sentences describing it. We report the common metrics of R@K and MdR.

5.2. DATA PREPROCESSING

CMU-MOSI and POM. Each dataset (CMU-MOSI, POM) consists of three modalities, including textual, visual, and audio modalities. For textual features, we employ the pre-trained 300-dimensional Glove embeddings Pennington et al. (2014) . For visual features, we utilize Facet iMotions (2017) 2014) acoustic analysis framework. To align the different modalities along the temporal dimension, we perform word alignment with P2FA Yuan & Liberman (2008) .

MSR-VTT.

The videos contain abundant multimodal information. Thus, we use multiple pre-trained models for extracting features. Concretely, we utilize seven modality experts: motion, scene, OCR, audio, speech, face, appearance following Gabeur et al. (2020) .

5.3. MISSING MODALITY SETTING

We evaluate IPD from two views. One is the performances of existing modality combinations, the other is the performances of unseen combinations. Concretely, we divide CMU-MOSI, POM into 2 3 -1 = 7 pieces (scenarios). One piece is kept for the evaluation of unseen generalization (Since the combinations of fewer modalities can be augmented with those of more modalities, we employ the complete modality combination for unseen evaluations). As for the remaining 6 pieces, We employ 70%, 10%, 20% of the samples for training, validation, and testing for existing modality combinationsfoot_2 . To be realistic, we randomly generate the ratio of 7 pieces (by giving each piece a number from 0 to 1 and employing normalization) for 5 times and report the average results. For the simulation of a large number of the modality combinations, we conduct video retrieval experiments on MSR-VTT, which contains rich multimodal information. Since the number of modalities is 7, there are 2 7 -1 = 127 combinations. For the evaluation of unseen modality combinations, we employ 8 pieces (the combinations with more modalities). The remaining pieces are divided in the proportion of previous work Gabeur et al. (2020) (90% for training, 10% for testing). We follow CMU-MOSI to randomly generate ratio and report the average results. 1 presents the overall comparison of IPD and existing methods on CMU-MOSI (both existing and unseen combinations). Note that we mainly compare IPD with the methods that use the same features for fairness. As for the evaluation of existing modality combinations, we could observe that MFN, TFN, LMF, and MulT perform worse than IPD as they pay more attention to the usage of complete modalities, leading to the poor adaptation on the missing modality scenarios. Even MulT utilizes Transformer Vaswani et al. (2017) . Besides, IPD achieves the best performances on all the metrics among the existing missing-modality learning methods (including MVAE, MCTN, MMIN). Particularly, the performance of IPD increases the "MA" from 25.6 to 27.9 compared to the best counterparts. We carefully analyze the observations: (1) The mainstream reconstruction based methods (i.e. MCTN, MMIN) depends on the existence of all the modalities to obtain the supervised information in the training stage, therefore, when the samples for training are imcomplete, there will be a big drop in performance. (2) The methods about modality invariant representation (i.e. MVAE) are also widely studied, however, since the noise covariance matrix in Eq. 1 varies across different modality combinations and samples, none of the features have the same distribution. Further, the assumptions for training of MVAE also include the availability of all of the pre-defined modalities. As for the evaluation of unseen modality combinations, IPD performs better than all the baseline methods, which demonstrates that IPD has competitive generalization ability. In general, the best performances of IPD attribute to the advanced prototype decomposition which captures modality-shared knowledge and modality-specific knowledge, respectively, as well as the disentangled contrastive constraints that enhances the interaction of different modality conditions. Ablation Study: We set some control experiments on CMU-MOSI to verify the effectiveness of IPD and the results are shown in Table 2 . "w/o. All" denotes the model without all the contributions and equals to LMF. "w/o. LR" denotes that all the orthogonal task prototypes are trained separately without other contributions (i.e. low-rank approximation, contrastive constraints, augmented ensemble). "w/o Orth" denotes that we just remove the orthogonal constraints for e c and A s from the complete model. "w/o. Ens" denotes that we just remove the low-rank ensemble learning from the complete model. "w/o. Cs" denotes the model without contrastive constraints for the modality specific prototypes. From the Table 2 , we could find that "w/o. All" achieves worst results on all the metrics, which is consistent with the objective law, since all the contributions for missing-modality learning are removed. The results of "w/o. LR" is similar to those of "IPD (Full)", we analyze that the model without lowrank constraints can also learn limited multimodal interaction from scratch. "w/o. Orth" achieves limited improvement based on the baseline method, which fits Proposition 2, the performance of some modality combinations would be influenced. The bad performances of "w/o. Ens" reveal the effectiveness of data augmentation. "w/o. Cs" achieves relatively bad results compared with IPD (Full), since the interaction enhancement among different modality combinations is important to simulate the low-rank condition. We utilize low-rank approximation for E s and β, the rank values k, R should be considered. We examine the performances of IPD on CMU-MOSI (Existing Combs.) with different values of k, R, Table 4 : Retrieval performances (text-to-video) on the MSR-VTT dataset.

Method

Existing Combs. Unseen Combs. R@1↑ R@5↑ R@10↑ MdR↓ R@1↑ R@5↑ R@10↑ MdR↓ CE Liu et al. (2019) 14.9 as shown in Fig. 3 . IPD with contrastive constraints could achieve competitive performances based a relatively small rank values (suppose that k = R). When removing the contrastive constraints that can promote the multimodal association, a large rank value is needed for the satisfactory results. We also conduct analysis for the low-rank ensemble learning. As shown in Fig. 4 , IPD has poor performances on the single modality without the mechanism, which reveals its effectiveness again.

5.5. EXPERIMENTS FOR SPEAKER RECOGNITION

Table 3 shows the experimental results of different methods on speaker traits recognition dataset POM, where the top half part corresponds to existing combinations and the bottom half part corresponds to unseen combinations, we report the multi-class accuracy of multiple traitsfoot_3 . The simialr observation could be found from the table, IPD achieves competitive performances on both existing and unseen modality combinations compared with the baseline methods. Particularly, the performance of IPD increases the average multi-class accuracy from 26.6 to 28.8 (existing combs.) and from 30.5 to 33.8 (unseen combs.) compared to the best counterparts.

5.6. EXPERIMENTS FOR VIDEO RETRIEVAL

To evaluate the generalization for more modalities, we report the evaluation results of IPD and the competing text-video retrieval methods on MSR-VTT (Table 4 ). Note that MMIN, MCTN, and most of the existing methods (i.e. Ji et al. (2022) ; Lei et al. (2021) ) can not be used for 7 modalities, thus, we do not compare IPD with them. We provide the metrics from two points of views (i.e. existing and unseen modality combinations). For both settings, IPD outperforms baseline methods in all the metrics. Benefiting from the low-rank prototype decomposition and disentangled contrastive constraints, the modality-shared knowledge and modality-specific knowledge are efficiently combined for the corresponding tasks. Further, the complexity reduction is more obvious for video retrieval, due to the larger modality number and hidden size. To provide an intuitive of complexity reduction, we report the parameter number (Table 5 ) of the task prototypes.

6. CONCLUSION

We propose a novel method called Interaction Augmented Prototype Decomposition (IPD) to solve the generalizable missing modality problem. Concretely, IPD disentangles the task prototypes into a modality-shared part and a low-rank modality-specific part. We present a principled analysis to provide the rationality of low-rank approximation. Further, to control the rank value, we facilitate the interaction of different modality conditions by employing disentangled contrastive constraints, which complements the decomposition. Extensive results on the newly-created benchmarks of multiple tasks illustrates the superiority of our proposed method. Jinming Zhao, Ruichen Li, and Qin Jin. Missing modality imagination network for emotion recognition with uncertain missing modalities. In ACL, 2021. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 2017. Kaiyang Zhou, Chen Change Loy, and Ziwei Liu. Semi-supervised domain generalization with stochastic stylematch. arXiv preprint arXiv:2106.00592, 2021. 

A PROOF AND DERIVATION

A.1 THE PROOF OF PROPOSITIONS Proposition 1: When e c ⊥ Span(A s ), the decomposition of E = e c 1 + A s β has a unique solution. Proof: Suppose E = e c 1 + A s β = w c 1 + W s Γ is a rank-(k + 1) matrix, where A s ∈ R m×k , β ∈ R 2 n ×k , W s ∈ R m×k , and Γ ∈ R 2 n ×k are all rank-k matrices with k ≤ min(m, 2 n ). When e c ⊥ Span(A s ), to prove that w c = e c is equal to w c ⊥ Span(W s ). We should deduce the necessary and sufficient conditions. Sufficient Condition: Suppose that w c ⊥ Span(W s ). Then E w c = e c , w c • 1 + β • A s w c = w c 2 • 1. Since E is a rank-(k + 1 ) matrix, we know that 1 / ∈ Span(β) and so it has to be the case that e c , w c = w c 2 and A s w c = 0. Both of these together imply that w c is the projection of e c onto the space orthogonal to A s i.e., w c = e c -P As e c , where P As is the projection matrix onto the span of vectors A s . Since e c ⊥ A s , we can obtain that w c = e c . Necessary Condition: Let w c = e c . Since E is fixed, we could obtain A s β = W s Γ , i.e. Span(W s ) = Span(A s ). Since e c ⊥ Span(A s ), w c ⊥ Span(W s ). Proposition 2: If the orthogonal regularization is not satisfied (e c is not orthogonal to Span(A s )), the performance of partial modality combinations will be influenced. Proof: Suppose the modality-specific prototypes of two single modalities are e 1 and e 2 (e 1 ⊥ e 2 ), the fusion representation of input sample (corresponding to e 2 ) is u. If the orthogonal regularization is not satisfied (i.e., e 1 is not orthogonal to e c ). The prediction can be expressed as u • e c + u • e 2 . In the extreme situation (i.e., e 1 = e c ), the result is u • e 2 . The value of u • e c is influenced by the nonorthogonality, to some extent. A.2 THE DERIVATION OF EQ. 9 Note that we omit M for convenience. To prove the calculation of O s as follows: O s = sum V A s β = sum R r=1 n i=1 v i C i,r A s for convenience, we adopt the element-wise multiplication. Concretely, V ∈ R m×2 n and A s β ∈ R m×2 n cannot be operated with matrix (vectorial) multiplication. Therefore, we indirectly calculate the result of ( V ) d ∈ R 2 n and (A s β ) d ∈ R 2 n , d ∈ [m] , we rewrite the above equations, o s = A s,d β n i=1 (v i ) d = A s,d β n i=1 (v i ) d = n i=1 (v i ) d β A s,d = n i=1 (v i ) d C 1 , ..., C k A s,d where (v i ) d ∈ R 2 , A s,d ∈ R k . We then calculate n i=1 (v i ) d C j as follows: n i=1 (v i ) d C j = n i=1 (v i ) d R r=1 n i=1 C j i,r = R r=1 n i=1 (v i ) d C j i,r we put the result of Eq. 12 into Eq. 11 and obatain: o s = R r=1 n i=1 (v i ) d C i,r A s,d = sum R r=1 n i=1 (v i ) d C i,r A s,d According to the conclusion of Eq. 13, Eq. 10 is proved. We then prove the the calculation of O c as follows: O c = sum V e c 1 = sum n i=1 v i • 1 e c We also adopt element-wise multiplication. o c = n i=1 (v i ) d • 1 2 n e c,d = sum n i=1 (v i ) d • 1 2 e c,d According to the conclusion of Eq. 15, Eq. 14 is proved. Algorithm 1 The Training Process of IPD 1: Given: n, m, k, λ 1 , λ 2 , λ 3 , train-data 2: Initialize params e c ∈ R m , A s ∈ R m×k 3: Initialize C j i,r ∈ R 2 : i ∈ [n], r ∈ [R], j ∈ [k] in Eq. 6 4: L o ← I k+1 -[e c , A s ] [e c , A s ] 2 Orthonormality constraint 5: for (x, y, l) ∈ train-data do In the testing process, we first obtain the modality factor e l according to the modality combination index, then, we send the input x to the network and obtain y * (x) = Net(x; e l ).

B.2 THE TRAINING OBJECTIVE OF VIDEO RETRIEVAL

As shown in MMT Gabeur et al. (2020) , the training objective of video retrieval is triplet contrastive loss, which compares the matching scores between video representation and text representation. To implement our method, we employ a new contrastive cross-entropy loss. Concretely, we treat the query text as one of the modalities of the video and text modality always exists. If the query text and the video are matched, the classification output is 1. In a batch, we treat the matched text and video as positive sample and change one side to construct negative samples. Then, the cross-entropy loss can be used to train the network.

B.3 THE FEATURE EXTRACTION FOR VIDEO RETRIEVAL

Motion embeddings are extracted from S3D Xie et al. (2018) trained on the kinetics dataset. Scene embeddings are extracted with DenseNet-161 Huang et al. (2017) trained on the Places365 dataset Zhou et al. (2017) . OCR embeddings are extracted in three stages. First, the pixel link text detection model is used to detect the overlaid text. Then, the detected boxes are passed through the text recognition model. Finally, each character sequence is encoded using a word2vec embedding. Audio embeddings are obtained with a VGGish model, trained on the YouTube-8m dataset. Speech features are extracted using the Google Cloud speech API, to extract word tokens from the audio stream, which are then encoded via pre-trained word2vec embeddings Mikolov et al. (2013) We follow LMF to implement IPD. All the components of IPD are same as those of LMF, except for the final classification layer. Specifically, the hidden size of the task prototype is 16, the rank k and R are set to 5. ∆ is set to 0.1. The tuning of other parameters is similar to LMF. After grid searching, the batch size is set to 16, the learning rate is 0.001. The hidden size of common feature space is 16. λ 1 , λ 2 , λ 3 are set to 0.1. We develop all the experiments with 5 RTX-3080Ti GPUs (10GB). .

B.5 EXPERIMENTAL DETAILS FOR VIDEO RETRIEVAL

We follow MMT to implement IPD. All the components of IPD are same as those of MMT. We add the new contrastive cross-entropy loss introduced above to the module and employ it as an auxiliary loss. Specifically, the hidden size of the task prototype is 512, the rank k and R are set to 32. ∆ is set to 0.1. The tuning of other parameters is similar to MMT. Concretely, the batch size is 32, the initial learning rate is set to 5e -5, which decays by a multiplicative factor 0.95 every 1k optimization steps. We train for 50k steps. The hidden size is 512 for all the Transformer structures, the number of heads is 8, and there are 6 stacked attention blocks. λ 1 , λ 2 , λ 3 are set to 0.1. We develop all the experiments with 5 RTX-3080Ti GPUs (10GB).

B.6 EXPERIMENTAL DETAILS OF BASELINE METHODS

We simply reproduce the baseline methods by replace the missing-modality features by all-zero vectors and follow these methods for the subsequent processes.

C.1 ADDITIONAL RESULTS OF POM

The results of other all 16 traits are shown in Table 6 and Table 7 . We could obtain similar conclusions.

C.2 ADDITIONAL RESULTS OF THE CASE THAT ALL THE MODALITIES ARE AVAILABLE WHEN TRAINING

We also conduct experiments when all the modalities are available during training. In this way, we adopt the same train/val/test splits as LMF. The validation and test sets are equally divided into 7 pieces of modality combinations. We report the corresponding results of different methods. As shown in the figure, the metric gap between IPD and baseline methods narrows, since MVAE, MCTN, MMIN are original designed for this case.

C.3 ADDITIONAL RESULTS OF CMU-MOSEI

The results of CMU-MOSEI Zadeh et al. (2018b) are shown in Table 9 . CMU-MOSEI is also proposed for the evaluation of multimodal sentiment analysis. It is similar to CMU-MOSI and has 23454 movie review videos. We could obtain similar conclusions based on Table 9 . We also conduct experiments on XRMB and RGB-D. As for XRMB, it has two modalities (273D acoustic inputs and 112D articulatory inputs) following Wang et al. (2015) , thus, the power of IPD is not obvious with achieving a similar results to baseline methods. As for RGB-D, due to the imbalanced importance of different modalities (3D point cloud, RGB color and height, following 2021)), we keep the 3D point cloud as a fixed available modality. The missing modality setting is adopted to RGB color and height. Since it is hard to model multimodal interaction with two modalities, IPD also performs similar to the baseline methods. 



For convenience, we utilize n in the following sections. The experimental details of all the tasks are shown in the appendix. Due to the space limits, we provide the results of traits, the complete results are shown in the appendix.



Figure 1: An example of generalizable missing modality learning, where M 1 , M 2 , M 3 denote three modalities. There are unseen modality combinations in the testing stage. Multimodal learning utilizes complementary information contained in multimodal data to improve the performance of various tasks. The key point of this area is multimodal fusion. Early fusion methods are mainstream and integrate features of different modalities before feeding them to the task modules. For example, concatenating different features Zadeh et al. (2016b) is a simple way. Zadeh et al. (2017) proposes a product operation to allow more interaction among different modalities during the fusion process. Liu et al. (2018) considers the large complexity of Zadeh et al. (2017) and utilizes modality-specific factors to achieve efficient low-rank fusion. However, the intra-modal dynamics cannot be effectively captured in the above methods. Liang et al. (2019) employs low-rank fusion for each time step of multi-view sequential input. Tsai et al. (2019) utilizes Transformer to replace LSTM due to the powerful encoding capacity. Rahman et al. (2020) employs large-scale pre-trained Bert embeddings for textual modeling.Liang et al. (2021) proposes modality-invariant crossmodal attention towards learning crossmodal interactions over modality-invariant space in which the distribution mismatch between different modalities is well bridged. However, these methods do not consider the missing modality problems at all. For this aim, we propose IPD to solve the problems.

m×2 n , where V t1,t2,...,tn = n i=1 v i,ti ∈ R m . The result O e = O c + O s consisting of common O c and specific O s parts can be calculated as follows:

Figure 3: Ablation study of rank value, suppose that two rank values (k in Eq. 5 and R in Eq. 6) are equal. Left and right parts denote IPD without and with contrastive constraints, respectively.

Figure 4: Ablation study of low-rank ensemble learning, where we report the multi-class accuracy of different modality conditions. Left and right parts denote IPD without and with ensemble learning.

← e c 1 + A s β β l and β are calculated based on Eqs. 5 and 6 7: e l ← e c + A s β l 8: loss += L t (y * (x), y; e l ) + λ 1 L e (O e (x), y; E) + λ 2 L o + λ 3 L c 9: end for 10: Optimize loss wrt e c , A s , C j i,r 11: Return e c , A s , C j i,r Algorithm 2 The Inference Process of IPD 1:Given: n, m, k, test-data 2: Trained params e c ∈ R m , A s ∈ R m×k , β ∈ R 2 n ×kβ is calculated based on Eqs. 5 and 6 3: for (x, l) ∈ test-data do 4: e l ← e c + A s β l 5: y * (x) = Net(x; e l ) 6: end for B MORE EXPERIMENTAL DETAILS B.1 THE DETAILED PROCESS OF IPD Due to the space limitation, we put the algorithmic process (both training and testing) in this section. The mathematic symbols are same as the main paper. In the training process, O e (x) denotes the output O e of input x, L e (O e (x), y; E) denotes the calculated loss based on modality factors E, and L t (y * (x), y; e l ) denotes the calculated loss based on modality factor e l .

Face features are extracted by ResNet-50 He et al. (2016) trained for face classification on the VGGFace2 dataset. Appearance features are extracted from the final global average pooling layer of SENet-154 Hu et al. (2018) trained on ImageNet. B.4 EXPERIMENTAL DETAILS FOR MULTIMODAL SENTIMENT ANALYSIS AND SPEAKER TRAITS RECOGNITION

The results on CMU-MOSI.

Ablation study on CMU-MOSI. Full) 67.7 68.0 1.142 0.486 27.9 76.6 77.4 0.984 0.625 34.0 to indicate 35 facial action units, which records facial muscle movement for representing the basic and advanced emotions. For audio features, we use COVAREP Degottex et al. (

The performances on POM, where we report the multi-class accuracy of four traits , including Credible (Cre), Vivid (Viv), Expertise (Exp), Entertaining (Ent).

Paramters of the implementation of task prototypes.

Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool. Rethinking semantic segmentation: A prototype view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2582-2593, 2022.

The performances on POM for existing modality combinations. MA(5,7) denotes multi-class accuracy for (5,7) classes.

The performances on POM for unseen combination. MA(5,7) denotes multi-class accuracy for (5,7) classes.

The results on CMU-MOSI when all the modalities are available when training.

The results on CMU-MOSEI. We report BA (binary accuracy), F1, Corr (Correlation Coefficient), MA (Multi-class accuracy, higher is better), MAE (Mean-absolute Error, lower is better). .5 1.050 0.409 36.7 76.1 75.7 0.725 0.498 43.5MCTN Pham et al. (2019)  66.9 67.3 1.007 0.440 36.4 74.9 75.0 0.735 0.497 43.5 MMIN Zhao et al. (2021) 67.5 67.8 0.975 0.429 37.3 75.7 75.0 0.782 0.508 42.9 IPD (Ours) 69.8 70.2 0.909 0.470 39.6 77.8 78.4 0.712 0.515 45.0 Liu et al. (

The results on XRMB. We report PER (phone error rate).

The results on RGB-D. We report mAP@0.25 for 3D object detection.

