ON UNI-MODAL FEATURE LEARNING IN SUPERVISED MULTI-MODAL LEARNING

Abstract

We abstract the features of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interaction. Multi-modal joint training is expected to benefit from cross-modal interaction on the basis of ensuring uni-modal feature learning. However, recent late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality, and we prove that this phenomenon does hurt the model's generalization ability. Given a multi-modal task, we propose to choose targeted late-fusion learning method from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

uni-modal features paired features uni-modal features paired features learned features unlearned features

Modality #1

Modality #2

Modality #1 Modality #2 Figure 1: Overview of Modality Laziness. Although multi-modal joint training provides the opportunity for cross-modal interaction to learn paired features, the model easily saturates and ignores the uni-modal features that are hard to learn but also important to generalization.

1. INTRODUCTION

Multi-modal signals, e.g., vision, sound, text, are ubiquitous in our daily life, allowing us to perceive the world through multiple sensory systems. Inspired by the crucial role that multi-modal interactions play in human perception and decision (Smith & Gasser, 2005) , substantial efforts have been made to build effective and reliable computational multi-modal systems in fields like multimedia computing (Wang et al., 2020; Xiao et al., 2020) , representation learning (Arandjelovic & Zisserman, 2017; Radford et al., 2021) and robotics (Chen et al., 2020a) . According to how the features of multi-modal data can be learned, we abstract them into two categories: (1) uni-modal features, which can be learned from uni-modal training, and (2) paired features, which can only be learned from cross-modal interaction. In this paper, we focus on multimodal tasks where uni-modal priors are meaningfulfoot_0 (Kay et al., 2017; Chen et al., 2020b) . Ideally, we hope that multi-modal joint training can learn paired features through cross-modal interactions on the basis of ensuring that enough uni-modal features are learned. However, recent late-fusion methods still suffer from learning insufficient uni-modal representations of each modality (Peng et al., 2022) . We term this phenomenon as Modality Laziness and illustrate that in Figure 1 . We theoretically characterize Modality Laziness and prove that it does hurt the generalization ability of the model, especially when uni-modal features are dominant in the given task. Besides the laziness problem, another shortcoming of recent late-fusion approaches is that they are complex to implement. For example, G-Blending (Wang et al., 2020) needs an extra split of data to estimate the overfitting-to-generalization ratio to re-weight the losses and then re-train the model again and again. Peng et al. (2022) proposes OGM-GE, which dynamically adjusts the gradients of different modalities during training. However, it needs to tune too many hyper-parameters 2 , including the start and end epoch of the gradient modulation, an "alpha" used to calculate the coefficients for the modulation and whether adaptive Gaussian noise Enhancement (GE) is needed. The more complicated thing is that these hyper-parameters need to be re-tuned on new datasets. To this end, more simple and effective methods are urgently needed. We pay attention to the learning of uni-modal features and propose to choose targeted late-fusion training method from Uni-Modal Ensemble (UME) and proposed Uni-Modal Teacher (UMT) according to the distribution of unimodal and paired features. If both uni-modal and paired features are essential, UMT is effective, which helps multi-modal models better learn uni-modal features via uni-modal distillation; if both modalities have strong uni-modal features and paired features are not important enough, UME is properer, which combines predictions of uni-modal models and completely avoids insufficient learning of uni-modal features. We also provide an empirical trick to decide which one to use. Under this guidance, we achieve comparable results to other complex late-fusion or intermediate-fusion methods on multiple multi-modal datasets, including VGG-Sound (Chen et al., 2020b) , Kinetics-400 (Kay et al., 2017) , UCF101 (Soomro et al., 2012) and ModelNet40 (Wu et al., 2022) .

2. RELATED WORK

Multi-modal training approaches aim to train a multi-modal model by using all modalities simultaneously (Liang et al., 2021) , including audio-visual classification (Peng et al., 2022; Xiao et al., 2020; Panda et al., 2021) , action recognition (Wang et al., 2020; Panda et al., 2021) , visual question answering (Agrawal et al., 2018) and RGB-D segmentation (Park et al., 2017; Hu et al., 2019; Seichter et al., 2020) . There are several different fusion methods, including early/middle fusion (Seichter et al., 2020; Nagrani et al., 2021; Wu et al., 2022) and late fusion (Wang et al., 2020; Peng et al., 2022; Fayek & Kumar, 2020) . In this paper, we mainly consider the late-fusion methods following Wang et al. (2020) , which is convenient and straightforward to evaluate the learning of uni-modal features. We demonstrate that simple late-fusion approaches can outperform approaches with more complex model architecture (Wu et al., 2022; Xiao et al., 2020) . Multi-modal learning theory. The research on multi-modal learning theory is still at an early age. A line of work focuses on understanding multi-view tasks (Amini et al., 2009; Xu et al., 2013; Arora et al., 2016; Allen-Zhu & Li, 2020) , and our assumption on the data structure partially stems from allows interactions between modalities to learn representations beyond uni-modal features, namely paired features. Based on this, we offer guidance on multi-modal late-fusion learning. Finally, we provide a theoretical analysis of Modality Laziness and justification for our solution. Discussion. The importance of uni-modal prior varies across different multi-modal tasks. In tasks like video classification (Chen et al., 2020b) and action recognition (Feichtenhofer et al., 2016; Wang et al., 2020) , uni-modal models can achieve good performance alone, suggesting that unimodal priors in these settings are essential. Visual question and answering (VQA) (Agrawal et al., 2018) is a counter example. Specifically, the same image with different text questions may have totally different labels, making it pointless to check its uni-modal accuracy. In this paper, we focus on the tasks where uni-modal priors are essential, following Wu et al. (2022) ; Peng et al. (2022) .

3.1. INSUFFICIENT LEARNING OF UNI-MODAL FEATURES IN MULTI-MODAL TRAINING

This subsection illustrates that existing multi-modal late-fusion training methods suffer from Modality Laziness. Even recent methods, G-Blending (Wang et al., 2020) and OGM-GE (Peng et al., 2022) , are no exception. In multi-modal late-fusion learning, each modality is encoded by its corresponding encoder and then a fusion module is applied on top of them to produce outputs. By building a classifier on frozen encoder (Chen et al., 2020c) , we assess the learned representations of the encoder: Compute uni-modal pre-trained features f m1 pre , f m2 pre of the data by F m1 pretrain , F m2 pretrain . Compute the prediction and features Ŷ , f m1 , f m2 from multi-modal model. Compute the losses between Ŷ , f m1 , f m2 and Y, f m1 pre , f m2 pre and multiply by the λ task , λ distill , λ distill , respectively. Update the multi-modal model by SGD or its variant. end for Return: A multi-modal model trained by UMT. • As Figure 2 shows, throughout the training process, the two encoders mentioned above not only cannot achieve comparable performance to their uni-modal counterparts but are far worse than them.

3.2. HOW DOES A MODEL BENEFIT FROM MULTI-MODAL TRAINING?

In Sec 3.1, we empirically show that recent late-fusion methods suffer from insufficient learning of uni-modal features. Combining predictions from uni-modal trained models avoids laziness by nature, which raises another question: How does a multi-modal model benefit from multi-modal training? We answer this question by investigating different models on VGG-Sound and find that the model learns some representations beyond uni-modal features. As Table 2 shows, in certain classes of VGG-Sound, the accuracy of naive fusion (navie fusion or naive multi-modal late-fusion learning means no carefully designed tricks are used) exceeds the sum of the accuracy of the two uni-modal models. Besides, we evaluate two other methods. One is to directly average the uni-modal models' predictions, which has little cross-modal interaction. The other one is to train a multi-modal linear classifier on top of uni-modal pre-trained encoders, where modalities can interact with each other through the linear layer. We find that naive fusion training, which owns maximum freedom of cross-modal interaction among these models, gets the best mean accuracy across these classes, suggesting that joint training enables the model to learn representations beyond uni-modal features, which we term as paired features. They are a type of feature that uni-modal training cannot learn. We offer more explanations on paired features in Appendix A.10. We also analyze more datasets and find that different datasets have different characteristics, we put more experimental analysis and interpretation in Appendix A.7.

3.3. GUIDANCE ON MULTI-MODAL LEARNING

Given a multi-modal task, if both uni-modal features and paired features are essential, Uni-Modal Teacher (UMT) is effective; if both modalities have strong uni-modal features and paired features are not important enough, simply combining the predictions of uni-modal models works well, which is named as Uni-Modal Ensemble (UME). UMT. Uni-Modal Teacher (UMT) is proposed for late-fusion joint training. It distills the pre-trained uni-modal features to the corresponding parts of multi-modal late-fusion models. Distilling knowledge from uni-modal models can help multi-modal models learn uni-modal features better, which happens in feature-level. The framework of UMT is shown in Algorithm 1 and Figure 4 . Noting that we use the same backbone in uni-modal and multi-modal model for a specified modality. The backbones and loss function used can be found in Sec 4. More details can be found in Appendix A.4.

UME.

If both modalities have strong uni-modal features, joint training does more harm than good. Combining predictions of uni-modal models avoids insufficient learning of uni-modal features by nature. Firstly, we can train uni-modal models independently. Then, we can give final output by weighting the predictions of uni-modal models. The simple ensemble method is named as Uni-Modal Ensemble (UME). We demonstrate that UME can show competitive performance on certain multi-modal datasets. An empirical trick to decide which method to use. We can train a multi-modal linear classifier on uni-modal pre-trained encoders and compare that with averaging predictions of uni-modal models. If the performance of the classifier is better, it means we can benefit from cross-modal interaction in this task and we can choose UMT, where cross-modal interactions are preserved while guaranteeing improved learning of uni-modal features; otherwise, the simple cross-modal interaction does more harm than good because of the strong uni-modal features of each modality, and we can choose UME, which avoids Modality Laziness completely.

3.4. THEORETICAL CHARACTERIZATION AND JUSTIFICATION

In this subsection, we characterize Modality Laziness of Sec 3.1 from a feature learning perspective and prove it does hurt the generalization of the model. And then, we give justification for the learning guidance proposed in Sec 3.3. Before diving into the technical details, we first provide some intuition behind the proof. Our goal is to show that how Modality Laziness happens in multi-modal joint training, and we refer to Figure 3 as an illustration. Here we omit the effect of paired features for easier to understand the intuition. During the naive multi-modal training process, learning those easy-to-learn features suffices to reach zero training error (point A in Figure 3 ). However, the model is under-trained at point A, and the zero-training-error region stops us from further training. As a comparison, uni-modal models can learn more features and achieve point B, outperforming point A. We next give Modality Laziness a theoretical explanation under a simple but effective regime. We mainly consider cases with two modalities x m1 and x m2 , and similar techniques can be directly generalized to the cases with more modalities. Data distribution. We formalize the distribution of the multi-modal features. Specifically, we abstract the features into uni-modal features (Definition 3.1) and paired features (Definition 3.2) to describe the core differences between uni-modal training and multi-modal joint training. We consider the binary classification regime where the label y has a uniform distribution over {-1, 1} without loss of generality. Such simplification is self-contained to describe the differences between uni-modal features and paired features. Definition 3.1 (Uni-modal features, which can be learned from uni-modal training). The i-th unimodal feature (f i (x m1 )) in modality x m1 is generated asfoot_2 : w.p. p(f i ), yf i (x m1 ) > 0; w.p. 1 -p(f i ) -ϵ(f i ), yf i (x m1 ) = 0; w.p. ϵ(f i ), yf i (x m1 ) < 0. The i-th uni-modal feature (g i (x m2 )) in modality x m2 is similarly generated with parameters p(g i ) and ϵ(g i ). Definition 3.2 (Paired features, which can only be learned from cross-modal interaction). The j-th paired featurefoot_3 h j is generated as: ℱ ! ′ ℱ ! 𝑥 " ! ℱ # ′ ℱ # 𝑥 " " ℱ # 𝑥 " " ℱ ! 𝑥 " ! 𝑨 𝑨 𝑩 𝑩 w.p. p(h j ), yh j (x m1 )h j (x m2 ) > 0; w.p. 1 -p(h j ) -ϵ(h j ), yh j (x m1 )h j (x m2 ) = 0; w.p. ϵ(h j ), yh j (x m1 )h j (x m2 ) < 0. When the context is clear, we abuse the notation r i to represent either f i (uni-modal feature in modality x m1 ), g i (uni-modal feature in modality x m2 ), or h i (paired feature). We name p(r i ) as the predicting probability of feature r i . When r i is present (meaning that r i ̸ = 0), we use I(r i > 0) -I(r i < 0) to predict y. Otherwise (r i = 0), we random guess y uniformly over {-1, 1}. To simplify the discussion, we always assume ϵ(f i ) = p(f i )/c, where c > 1 is a fixed constant. For the ease of notations, we define the empty feature in Definition 3.3. Definition 3.3 (Empty Feature). Empty feature e i is a kind of uni-modal feature (or paired feature) with p(e i ) = ϵ(e i ) = 0. Evaluation procedure. When the context is clear, we abuse r i to denote the learned features. For each data point, we random guess ŷ on {-1, 1} uniformly when i I(r i > 0) = i I(r i < 0). Otherwise, we predict the label by ŷ = 2I( i I(r i > 0) > i I(r i < 0)) -1. We define the error as i I(yr i < 0)i I(yr i > 0). Training procedure. (a.) multi-modal joint training, which directly train the model using both modality x m1 and modality x m2 ; (b.) uni-modal ensemble, which firstly train the features via independent training (x m1 and x m2 separately), and then combine the x m1 -learned features and x m2 -learned features. During the training process, we first initialize all the features with empty features e i to imitate random initialization. The models then learn the features in descending order of predicting probability, meaning that the powerful features (with large predicting probability) are learned firstfoot_4 . Our goal is to minimize the training error to zerofoot_5 . We now state our main theorem in Theorem 3.4, demonstrating naive joint training learn fewer uni-modal features compared to uni-modal training, which hurts the model's generalization. Theorem 3.4. In uni-modal ensemble, assume that the training procedure learns b m1 features in modality x m1 and learns b m2 features in modality x m2 . We order the probability of uni-modal fea-tures (both x m1 and x m2 ) in decreasing order of predicting probability p, namely, p [1] , p [2] , . . . . In multi-modal training approaches, assume that the training procedure learns k m1 uni-modal features in modality x m1 , learns k m2 uni-modal features in modality x m2 , and learns k pa paired features with predicting probability p(h 1 ), . . . , p(h kpa ). We provide three types of laziness: (a. ) Quantity Laziness: k m1 + k m2 + k pa ≤ min{b m1 , b m2 }. (b. ) Uni-modal Laziness: Each modality in multi-modal training approaches performs worse than uni-modal training. (c. ) Performance Laziness: Consider a new testing point, then for every δ > 0, if the following inequality holds: i∈[kpa] p(h i ) ≤ i∈[bm1+1,bm1+bm2] p [i] + ∆(δ), where ∆(δ) = 8(k pa + b m1 -k m1 + b m2 -k m2 ) log(1/δ) , then with probabilityfoot_6 at least 1 -δ, uni-modal ensemble outperform multi-modal training approaches concerning the loss on the testing point with probability. In theorem 3.4, we describe three notations of laziness problem: Quantity Laziness indicates that the number of features learned in naive multi-modal training is less than uni-modal training. Unimodal Laziness shows encoders from multi-modal training perform worse than from uni-modal training because of Quantity Laziness, which fits the experimental results in sec3.1. Performance Laziness compares the performance of multi-modal joint training approaches with Uni-Modal Ensemble, demonstrating that when uni-modal features dominate, combining uni-modal predictions is more effective. We defer the complete proof to Appendix B.1 and generalize that to more modalities (Appendix B.2). We give a concrete example in Appendix B.3 to better illustrate Theorem 3.4. We next prove that UMT proposed in Sec 3.3 indeed helps uni-modal feature learning and can also learn some easy-to-learn paired features in Theorem 3.5 and Appendix B.3. Theorem 3.5. Denote the paired features by h 1 , . . . h L with corresponding predicting probability p(h 1 ), . . . , p(h L ). Assume that distillation can boost the training priority by p 0 > 0. If there exists paired features whose predicting probability exceeds the boosting probability p 0 , namely, the set S is not empty: S = {h i : p(h i ) > p 0 } ̸ = ϕ. Then UMT helps uni-modal feature learning and can also learn easy-to-learn paired features.

4. EXPERIMENTS

In Sec 3.4, we justify our method theoretically. In this section, we firstly introduce the experimental setup and then demonstrate that choosing a suitable learning method from UMT and UME can outperforming other complex late-fusion or intermediate-fusion methods in various multi-modal tasks.

4.1. EXPERIMENTAL SETUP

Dataset. We run experiments on four datasets. Kinetics-400 (Kay et al., 2017) is a video recognition dataset with 240k videos for training and 19k for validation. We treat the two modalities, RGB and audio, as the inputs. VGG-Sound (Chen et al., 2020b ) is an audio-visual classification dataset which contains over 200k video clips for 309 different sound classes. UCF101 (Soomro et al., 2012) is an action recognition dataset with 101 action categories, including 7k videos for training and 3k for testing. ModelNet40 is a 3D object classification task with 9,483 training samples and 2,468 test samples. Following Wu et al. (2022) , we treat the front and rear view as two modalities. Training Settings. In VGG-Sound, UCF101 and ModelNet40, we use ResNet as our backbone, all with 18 layers. As for Kinetics-400, we use 50 or 101 layers' ResNet to encode the inputs. Noting that 3D CNN is used for visual data of VGG-Sound and Kinetics-400. The data preprocessing, hyper-parameters, optimizer can be found in the Appendix A.1 and A.2. We train a multi-modal linear classifier on frozen uni-modal pre-trained encoders and compare that with averaging uni-modal predictions. As Table 3 shows, in VGG-Sound and Kinetics-400, the classifier is better, meaning cross-modal interaction can benefit the classifier in the two datasets. However, in UCF101 and ModelNet40, averaging uni-modal predictions performs well. To explore why the classifier fails in UCF101, we check the uni-modal classifiers of the newly trained multimodal linear layer (details in Appendix A.7.1). As Table 4 shows, they are far worse than the unimodal models. The result shows the simple linear classifier suffers from serious Modality Laziness in UCF101, which negatively impact the performance. Both modalities in ModelNet40 also have strong uni-modal features and can achieve 89% accuracy individually. Averaging uni-modal predictions avoids laziness problem and achieves competitive performance. Based on the above analysis, we perform UMT on VGG-Sound and Kinetics-400, and UME on UCF101 and ModelNet40.

4.3. UMT IS AN EFFECTIVE REGULARIZER FOR UNI-MODAL FEATURE LEARNING

In this subsection, we demonstrate that Uni-Modal Teacher outperforms other multi-modal training methods in VGG-Sound and Kinetics-400. We use MSELoss as the distillation loss and set the weight of that as 50. Cross Entropy is used as classification loss and its weight is set as 1.

4.3.1. UMT IS EFFECTIVE ON VGG-SOUND AND KINETICS-400.

UMT vs Other Late-Fusion Methods. The late-fusion architecture is commonly used for multimodal classification tasks (Wang et al., 2020; Peng et al., 2022) . In late-fusion architecture, the features are extracted from different modalities by the corresponding encoders, and then the head layer is applied to output predictions. We compare different heads, including linear layer, MLP, and attention layer. In UMT, we use a simple linear layer as the multi-modal head. We also conduct another experiment, which adds extra uni-modal linear heads to receive the uni-modal features and generating additional losses to joint optimize the model, namely Auxiliary-CEloss. Auxiliary-CEloss gives all losses equal weights, while G-Blending reweights the losses according to the overfitting-togeneralization-ratio (OGR) (Wang et al., 2020) . OGM-GE (Peng et al., 2022) controls the optimization of each modality by online gradient modulation. However, both OGM-GE and G-Blending are complex to implement. As shown in Table 5 , UMT outperforms other methods. We compare UMT with AVSlowFast in Kinetics-400. As Table 6 shows, under different RGB encoders, UMT consistently exceeds AVSlowFast, although we cannot reproduce their results due to the dynamics of Kinetics-400 (Appendix A.11). Ablation Study of UMT. We first evaluate the encoders of UMT by training linear classifiers on them to verify that UMT does improve the uni-modal feature learning. As Table 7 shows, UMT makes its encoders stand out. Benefiting from uni-modal distillation, some encoders even outperform their uni-modal counterparts. We then compare UMT with classic self-distillation methods (distillation on soft label (Hinton et al., 2015) and feature (Romero et al., 2014) ). As Table 8 shows, naive self-distillation can only bring limited improvement, showing that UMT improves overall performance by improving the uni-modal feature learning instead of knowledge distillation.

4.3.2. UNI-MODAL ENSEMBLE IN MULTI-MODAL LEARNING

In this subsection, we demonstrate that Uni-Modal Ensemble is effective on multi-modal datasets where modalities have strong uni-modal features, outperforming other complex methods. Even though we don't combine these uni-modal predictions in any special way, but simply average. In UCF101, we compare Uni-Modal Ensemble with various multi-modal late-fusion methods. As Table 9 shows, although Gradient Blending (Wang et al., 2020) and OGM-GE (Peng et al., 2022) outperforms baseline methods, they are far worse than Uni-Modal Ensemble. In ModelNet40, the main comparing methods come from Wu et al. (2022) , which uses a multimodal DNN with intermediate fusion. It proposes a balanced multi-modal algorithm which balances conditional utilization of each modality by re-balancing the optimization step. UME surpass their balanced multi-modal algorithm, as Table 10 shows.

5. CONCLUSION

This paper analyzes the phenomenon of insufficient uni-modal feature learning in multi-modal training and proves that it does hurt the overall performance. We propose to choose proper learning method from Uni-Modal Ensemble and proposed Uni-Modal Teacher according to the distribution of uni-modal and paired features and demonstrate the effectiveness of the guiding principle.

A EXPERIMENTAL DETAILS AND ADDITIONAL EXPERIMENTS

A.1 DATASETS Here, we describe the preprocessing of Kinetics-400, VGG-Sound, UCF101 and ModelNet40 in detail. Kinetics-400 dataset (Kay et al., 2017) contains over 240k videos for training and 19k for validation, which we download from cvdfoundation 8 . Kinetics-400 is a commonly used dataset with 400 classes, and we mainly follow the open source preprocessing methods to process that. For RGB modality, we follow the procedure of PySlowFast 9 , which resizes the video to the short edge size of 256. and for audio modality, we follow mmaction2 10 to extract specgram features. When performing joint training, we take consecutive 64 frames from a video with fps of 30 and random crop the video to 224*224, and for audio inputs, we take the specgram that can be aligned in time with the clip extracted from the video. When testing, we ensemble the predictions from uniformly sampled clips with RGB and audio from a video and give the final outputs, following PySlowfast. VGG-Sound dataset (Chen et al., 2020b) , which contains over 200k video clips for 309 different sound classes, is also used for evaluating our method. It is an audio-visual dataset in the wild where each object that emits sound is also visible in the corresponding video clip, making it suitable for scene classification tasks. Please note that some clips in the dataset are no longer available on YouTube, and we actually use about 175k videos for training and 15k for testing, but the number of classes remains the same. We design a preprocessing paradigm to improve training efficiency as follows: (1) each video is interpolated to 256×256 and saved as stacked images; (2) each audio is first converted to 16 kHz and 32-bit precision in the floating-point PCM format, then randomly cropped or tiled to a fixed duration of 10s. For video input, 32 frames are uniformly sampled from each clip before feeding to the video encoder. While for the audio input, a 1024-point discrete Fourier transform is performed using nnAudio (Cheuk et al., 2020) , with 64 ms frame length and 32 ms frame-shift. And we only feed the magnitude spectrogram to the audio encoder. UCF101 dataset (Soomro et al., 2012) ModalNet40 is a 3D object classification dataset with 9,483 training samples and 2,468 test samples. We base on the front view and the rear view of the 3D object to classify that, following Wu et al. (2022) .

A.2 TRAINING HYPERPARAMETERS

In this subsection, we show the hyperparameters of our experiments in UCF101 and VGG-Sound in Table 11 . As for Kinetics-400's RGB modality, we totally follow the hyperparameters and settings of PySlow-Fast 11 . As for audio modality, we modify the hyperparameters 12 to be as consistent as possible with the RGB training for further joint training. Specifically, we use the same learning rate and batch size as RGB training used. As for ModelNet40, we totally follow the experimental settings of Wu et al. (2022) While the results in Table 1 show that different multi-modal methods suffer from learning insufficient uni-modal features, how about changing the optimizer? To answer this question, we try different optimizers for multi-modal late-fusion training (with a linear multi-modal head), including SGD, RMSprop, Adagrad, Adadelta, Adamw and Adam. As Table 12 shows, Modality Laziness exists no matter which optimizer is used.

A.4 DETAILS ON UNI-MODAL TEACHER (UMT)

In this subsection, we describe how Uni-Molda Teacher (UMT) applies on multi-modal late-fusion tasks. The overall architecture can be found in Figure 4 . UMT in late-fusion classification. In multi-modal late-fusion architecture, modalities are first encoded by the corresponding encoders and then mapped to the output space by a multi-modal fusion head (Figure 4 left). Uni-Modal Teacher distills the pre-trained uni-modal features to the corresponding parts in multi-modal networks in multi-modal training (Figure 4 right). Uni-modal distillation happens before fusion, so it's suitable for late-fusion multi-modal architecture. The pre-trained unimodal features are generated by inputting the data to the pre-trained uni-modal models. UMT's weights. For VGG-Sound and Kinetics, we use 50 (both audio feature distillation and RGB feature distillation) as the distillation loss's weight. We test different distillation weights on VGG-Sound and Kinetics-400. As shown in Table 13 , on both datasets, UMT performs well with an distillation weight of 50. is the encoder which is supervised pre-trained on uni-modal data. φ mi is a random initialed encoder without pre-training. L multi is the loss between multi-modal predictions and labels. L distill is the uni-modal distillation loss. Here we consider the common regularizer, dropout (Srivastava et al., 2014) , and a variant of it, namely modality-wise dropout, which randomly drops (with probability 1/3) the feature from one modality in each iteration. Modality dropout is akin to the ModDrop in Neverova et al. (2015) . As Table 14 shows, modality-wise dropout is significantly better than dropout, which implies that modality-wise laziness is serious and modality-wise dropout is also effective. 𝜑 𝑚 1 𝜑 𝑚 2 𝜑 𝑚 2 𝜑 𝑚 1 ' ' ℒ distill ℒ distill ℒ multi 𝜑 𝑚 1 𝜑 𝑚 2 ℒ multi

A.6 FINETUNING THE UNI-MODAL PRE-TRAINED ENCODERS

In this subsection, we use the uni-modal pre-trained encoders' parameters as the initialized weights in multi-modal training and randomly initialize a multi-modal linear classifier on the encoders. We set the classifier's learning rate as 1e -3 and try different learning rates on the encoders. As Table 15 shows, using the uni-modal supervised pre-trained encoder's weights in multi-modal training and then fine-tuning the whole multi-modal model can bring some improvement compared to naive fusion (49.46) but is worse than UMT, which gets 53.46 accuracy. When the learning rate of encoders is large, the encoders forget some abilities to extract uni-modal features. Ngiam et al. (2011) proposes to use Bimodal Deep Autoencoder to pre-train the encoders with multiple modalities. It is a direction worth exploring to address Modality Laziness of deep multimodal models.

A.7 THE ROLE OF CROSS-MODAL INTERACTION ON DIFFERENT DATASETS

In this subsection, we conduct various experiments to further investigate the effect of cross-modal interaction and explore the benefits and harms that cross-modal interaction brings in different multi- In Sec 4.2, we train a multi-modal linear classifier on frozen uni-modal pre-trained encoders and compare this classifier with directly averaging uni-modal models' predictions. As Table 3 shows, this classifier does not consistently outperform simply averaging the uni-modal predictions on all datasets. It shows better performance on VGG-Sound and Kinetics-400, but worse performance on UCF101 and ModelNet40. To further explain this phenomenon, we check and disassemble this new trained multi-modal classifier on UCF101. In the late-fusion multi-modal training, the features of different modalities are concatenated first and then the multi-modal classifier receives them and output predictions. Different modalities in the classifier do not share the parameters. So we split the new trained multi-modal linear classifier into uni-modal classifiers. We use the uni-modal pre-trained encoders to extract features and then the uni-modal classifiers receive the corresponding features and output predictions. Noting that OGM-GE (Peng et al., 2022) uses similar technique to check how well different modalities are trained. As Table 4 shows, the uni-modal classifiers from new trained multi-modal classifiers are significantly worse than uni-modal models, implying that the multi-modal classifier trained on uni-modal pre-trained encoders suffers from serious Modality Laziness on UCF101, although it is just a simple linear layer, resulting in worse performance than directly averaging the uni-modal predictions.

A.7.2 CLASS-LEVEL EVALUATION ON DIFFERENT MULTI-MODAL DATASETS

In this subsection, we compare naive late-fusion learning with averaging predictions of uni-modal models in class level. It's obvious that there are more cross-modal interactions in naive fusion. Although naive fusion suffers from learning insufficient uni-modal features, we find in some classes in Kinetics and VGG-Sound, the accuracy of naive fusion model outperforms averaging the unimodal models' predictions, and even outperforms the sum of the accuracy of the two uni-modal models in VGG-Sound and Kinetics-400, as shown in Table 2 and 16 . However, We cannot find any class that naive fusion can exceed the sum of the accuracy of the uni-RGB model and uni-flow model in UCF101. We select classes in UCF101 by sorting the differences of accuracy between naive fusion and the best uni-modal model in class level and the top ten with the largest difference are selected. In these classes where naive fusion has advantages, averaging the predictions can outperform naive fusion in some classes (ID:29, 67, 71) , and this phenomenon is not found in VGG-Sound and Kinetics. And as Tabla 4 and For Kinetics-400, we sort the classes alphabetically from smallest to largest according to the class name, and then we can get the mapping between class name and id. In UCF101, the mapping can be found in classInd.txt, a given file of UCF101. In order to verify whether the multimodal loss in UMT makes sense, we train uni-modal models by knowledge distillation to get better performance than encoders trained by UMT and then combine them by introducing a new multi-modal classifier on these encoders. As Table 18 shows, UMT works better in multi-modal performance, although the encoders of UMT in unimodal evaluation are worse, showing that UMT indeed benefits from cross-modal interaction.

A.8 CROSS-MODAL INTERACTION IN UNI-MODAL TEACHER (UMT)

A.9 EXPLORING UMT FOR MULTI-MODAL SEGMENTATION NYU Depth V2 dataset (Silberman et al., 2012) contains 1449 indoor RGB-Depth data totally and we use 40-class label setting. The number of training set and testing set is 795 and 654 respectively. All perprocessing operations are following (Seichter et al., 2020) . In contrast to the late fusion classification task, the RGB-Depth semantic segmentation belongs to middle fusion. The main encoder receives RGB inputs, and the depth inputs are fed into the depth encoder. At each intermediate layer, the main encoder fuses its own intermediate outputs and the depth features obtained from the depth encoder, which makes it a mid-fusion task (Seichter et al., 2020) . Since features generated by each layer matter, we distill multi-scale depth feature maps using the MSE loss. For feature maps from the RGB encoder, however, since they are generated by fusing RGB and depth modalities, we cannot distill RGB feature maps directly like depth feature maps. To mitigate this effect, we curate predictors, namely 2 layers CNNs, aiming to facilitate the fused feature maps to predict the RGB feature maps trained by the RGB modality before distillation. The full schematic diagram is presented in Figure 5 . As shown in Table 19 , UMT can also improve multi-modal segmentation whether the encoder is pre-trained on ImageNet or not.

A.10 EXPLANATIONS ON PAIRED FEATURES

We revisit the definitions of uni-modal features and paired features: uni-modal features, which can be learned by uni-modal training; paired features, which can only be learned by cross-modal interaction in joint training. Different datasets contain different proportions of these features. In this subsection, we use synthetic datasets to explain the uni-modal features and paired features in multi-modal tasks. Understanding different types of features in multi-modal tasks by synthetic datasets. Three different multi-modal datasets are generated to help us understand the uni-modal features and paired features in multi-modal tasks. The process of data generation mainly refers to Hessel & Lee (2020) . Firstly, we generate a dataset where each modality can extract the features to give correct predictions. We name this dataset as Dataset α. The data generation process is as follows: 1. Sample random projection P 1 ∈ R d1×d and P 2 ∈ R d2×d from U (-0.5, 0.5).

2.. Sample z

∈ R d ∼ N (0, 1). Normalize z to unit length 3. Sample x ∈ R d ∼ N (0, 1). Normalize x to unit length 4. if |x • z| ≤ 0.1, return to the Step 3. 5. If x • z > 0.1, then y = 1; else y = 0. 6. Get the data point (P 1 x, P 2 x, y). 7. If the amount of data generated is less than N , return to the Step 3; else break The P 1 x, P 2 x represents two modalities of the multi-modal dataset and we set d 1 , d 2 , d, N as 200, 100, 50, 5000, respectively. And we randomly split 80% of the generated data as train set, and the rest serves as a test set. In this dataset, uni-modal models can extract useful features to give correct predictions and multi-modal joint training is not necessary, as Table 20 shows (Dataset α). We name the features, that uni-modal models can learns to give correct predictions in the given task, as uni-modal features. Secondly, we generate another dataset where the model must rely on both the two modalities to make correct predictions. We name this dataset as Dataset β. The data generation process is as follows:  . Sample x 1 , x 2 ∈ R d ∼ N (0, 1). Normalize x 1 , x 2 to unit length 3. if |x 1 • x 2 | ≤ 0.25, return to the Step 2. 4. If x 1 • x 2 > 0.25, then y = 1; else y = 0. 5. Get the data point (P 1 x 1 , P 2 x 2 , y). 6. If the amount of data generated is less than N , return to the Step 2; else break This multi-modal dataset is different from the first dataset, because the labels in this dataset are highly dependent on the relationship between the two modalities. As we can see in Table 20 (Dataset β), the uni-modal models can only gives about 50 percent accuracy, while the multi-modal models can give about 90 percent accuracy. In binary classification tasks, 50 percent accuracy is no different from guessing. In this dataset, because labels are heavily relied on the relationships of the two modalities, we must train both modalities simultaneously to extract the joint representations to learn the relationship of the two modalities, which are beyond uni-modal features. In order to better carry out theoretical analysis, we abstract these representations into paired features, which can only be learned from multi-modal joint training in multi-modal tasks. Finally, we generate a dataset that contains both uni-modal features and paired features. We name this dataset as Dataset γ. The data generation process is as follows: 1. Sample random projection P 1 ∈ R d1×d and P 2 ∈ R d2×d from U (-0.5, 0.5). ∈ R d ∼ N (0, 1). Normalize z to unit length 3. Sample x ∈ R d ∼ N (0, 1). Normalize x to unit length 4. if x • z ≥ 0. 1, get the data point (P 1 x, P 2 x, y = 0). else, return to the step 3 until collecting 2500 data points.

5.. Sample x

1 , x 2 ∈ R d ∼ N (0, 1). Normalize x 1 , x 2 to unit length. If x 1 • z > -0.1 or x 2 • z > -0.1, resample. 6. if |x 1 • x 2 | ≤ 0. 25, return to the Step 5. 7. If x 1 • x 2 > 0.25, then y = 2; else y = 1. 8. Get the data point (P 1 x 1 , P 2 x 2 , y). 9. If the total amount of data generated is less than 7500, return to the Step 5; else break In the data labeled 0, each modality contains features that can give a correct prediction, while in the data labeled 1 or 2, we need both modalities together to make the right predictions. To further understand how uni-modal models give predictions in dataset that containing both uni-modal and paired features, we give the confusion matrix of the uni-modal model. As Table 21 shows, the unimodal model can give correct predictions for data labeled 0, while for data labeled 1 or 2, it fails and gives a random predictions. Because for data labeled 1 or 2, we need to learn the relationship of the two modalities to give the correct predictions. In this subsection, we mainly discuss the synthetic multi-modal datasets and in Appendix A.7, we conduct various experiments on real-world multi-modal datasets to help us understand the uni-modal features, paired features and cross-modal interaction in multi-modal training better. Training settings on synthetics datasets. We use a two layer MLP with ReLU as activation function. As for hidden layer, we use 200 dimensions for multi-modal training and 100 dimensions for uni-modal training. We use SGD as the optimizer and the learning rate is 0.2. In each iteration, we use the whole training set to compute the gradients. And we provide the code in supplement materials.

A.11 UNI-MODAL PERFORMANCE IN KINETICS-400

Kinetics-400 is a dynamic dataset, because videos may be removed from YouTube. In this subsection, we report the uni-modal performance of ours and Xiao et al. ( 2020)'s on Kinetics-400. As Table 22 shows, we cannot reproduce their uni-modal performance and ours are lower than theirs. But we demonstrate that UMT outperforms AVSlowFast in Sec 4.3.1, which shows UMT's effectiveness. Theorem 3.4. In uni-modal ensemble, assume that the training procedure learns b m1 features in modality x m1 and learns b m2 features in modality x m2 . We order the probability of uni-modal features (both x m1 and x m2 ) in decreasing order of predicting probability p, namely, p [1] , p [2] , . . . . In multi-modal training approaches, assume that the training procedure learns k m1 uni-modal features in modality x m1 , learns k m2 uni-modal features in modality x m2 , and learns k pa paired features with predicting probability p(h 1 ), . . . , p(h kpa ). We provide three types of laziness: (a. ) Quantity Laziness: k m1 + k m2 + k pa ≤ min{b m1 , b m2 }. (b. ) Uni-modal Laziness: Each modality in multi-modal training approaches performs worse than uni-modal training. (c. ) Performance Laziness: Consider a new testing point, then for every δ > 0, if the following inequality holds: i∈[kpa] p(h i ) ≤ i∈[bm1+1,bm1+bm2] p [i] + ∆(δ), where ∆(δ) = 8(k pa + b m1 -k m1 + b m2 -k m2 ) log(1/δ), then with probabilityfoot_8 at least 1 -δ, uni-modal ensemble outperform multi-modal training approaches concerning the loss on the testing point with probability. We prove the theorem, which shows that naive joint training indeed suffers from overfitting issues, meaning that it learns less features compared to uni-modal ensemble. Proof. We first introduce some additional notations used in the proof. We define the features trained in x m1 -uni-modal training as f 1 (x m1 ), . . . , f bm1 (x m1 ), define the features trained in x m1uni-modal training as g 1 (x m2 ), . . . , g bm2 (x m2 ). Therefore, there are in total b m1 + b m2 features learned in uni-modal ensemble, namely, f 1 (x m1 ), . . . , f bm1 (x m1 ), g 1 (x m2 ), . . . , g bm2 (x m2 ). Besides, We define the features trained in multi-modal training approaches as f 1 (x m1 ), . . . , f km1 (x m1 ), g 1 (x m2 ), . . . , g km2 (x m2 ), h 1 (x m1 , x m2 ), . . . , h kpa (x m1 , x m2 ). When the context is clear, we omit the dependency of x m1 , x m2 and denote them as f i , g i , h i for simplicity. When the context is clear, we abuse the notation r to represent arbitrary f , g or h. The corresponding predicting probability of feature r i is denoted as p(r i ). To summary, there are b m1 + b m2 features in uni-modal ensemble, k m1 + k m2 + k pa features in multi-modal training approaches. We first prove statement (a.), which claims that the number of features learned in multi-modal training approaches are provably less than any of the number of features learned in uni-modal training. The proof depends on the following Lemma B.1. Lemma B.1. Assume there exists T features r i , i = 1, . . . , T . If we replace one of the T features (without loss of generality, r T ) with a more powerful feature r ′ , where p(r ′ ) > p(r T ), then the predicting probability for each data point increases (where the probability is taken over the randomness of the training data). We next provide the proof of statements (a.): based on Lemma B.1. We shall prove k m1 + k m2 + k pa < b m1 without loss of generality. Start from the features f 1 (x m1 ), . . . , f km1 (x m1 ) which are common features in both multi-modal training approaches and Uni-modal training. Next step, we add feature f km1+1 in uni-modal approachesand g 1 in multi-modal training approaches. Obviously, p(g 1 ) > p(f km1+1 ) due to the training priority (or multi-modal training approaches should learn f km1+1 instead of g 1 ). Therefore, the predicting probability of multi-modal training approaches is larger than uni-modal approaches. Repeating the procedure by comparing g i with f km1+i and comparing h j with f km1+km2+j , the predicting probability of multi-modal training approaches is always larger than uni-modal approaches. Note that b m1 should be always larger than k m1 + k m2 , or the predicting probability of uni-modal approaches would be smaller than multi-modal training approaches. At the end of the comparison, the predicting probability of multi-modal training approaches is still larger than uni-modal approaches. This requires that uni-modal approaches should learn more features, which can be regarded as uni-modal approacheslearns a features while multi-modal training approaches learns an empty feature. In conclusion, uni-modal approaches learns more features compared to multi-modal training approaches, leading to b m1 > k m1 + k m2 + k pa . We next prove the statement (b.). The proof of (b.) is based on (a.). We next only consider modality x m1 , the proof for modality x m2 is similar. Note that the since the number of features learned in multi-modal training approaches is less than b m1 , the number of features learned in x m1 must be less than b m1 (Note that those features can be either paired feature or uni-modal feature, namely, f 1 , . . . , f km1 and h 1 , . . . , h kpa ). Therefore, multi-modal training approaches learns less features compared to uni-modal approaches in modality x m1 . On the other hand, the predicting probability of features learned in multi-modal training approaches (f 1 , . . . , f km1 and h 1 , . . . , h kpa , considering only modality x m1 for the paired feature) is less than that learned in uni-modal approaches (f 1 , . . . , f bm1 ), because otherwise, uni-modal approaches will learn the features in h instead of f . In conclusion, when considering only modality x m1 , the number of features learned in multi-modal training approaches is less and its corresponding predicting probability is small. Therefore, each modality in multi-modal training approaches performs worse than uni-modal approaches. We finally prove the statement (c.). Recall that the loss isi u(r i ) where u(r i ) = I(yr i > 0) -I(yr i < 0). Note that E(u(r i )) = 1 2 p(r i ) and |u(r i )| ≤ 1. We derive that: P   - i∈[km1] u(f i ) - i∈[km2] u(g i ) - i∈[kpa] u(h i ) ≤ - i∈[bm1] u(f i ) - i∈[bm2] u(g i )   =P   km1<i≤bm1 u(f i ) + km2<i≤bm2 u(g i ) - i∈[kpa] u(h i ) ≤ 0   =P   km1<i≤bm1 u(f i ) + km2<i≤bm2 u(g i ) - i∈[kpa] u(h i ) + 1 2 E ≤ 1 2 E   , where E = -E( km1<i≤bm1 u(f i ) + km2<i≤bm2 u(g i ) -i∈[kpa] u(h i )) = i∈[kpa] p(h i ) - km1<i≤bm1 p(f i ) -km2<i≤bm2 p(g i ). Due to the training priority and the conclusion in (a.), i∈[bm1+1,bm1+bm2] p [i] ≤ km1<i≤bm1 p(f i ) + km2<i≤bm2 p(g i ). Therefore, E ≤ i∈[kpa] p(h i ) - i∈[bm1+1,bm1+bm2] p [i] ≤ 8(k pa + b m1 -k m1 + b m2 -k m2 ) log(1/δ). We next apply Hoeffding inequality on Equation B.1 and derive that P   - i∈[km1] u(f i ) - i∈[km2] u(g i ) - i∈[kpa] u(h i ) < - i∈[bm1] u(f i ) - i∈[bm2] u(g i )   ≤ exp(-E 2 /8(k pa + b m1 -k m1 + b m2 -k m2 )) ≤δ To conclude, multi-modal training approaches outperform uni-modal ensemble concerning the testing loss with probability at least 1 -δ. Compared to uni-modal ensemble, denote the additional paired feature are indexed by c, and the additional uni-modal feature in uni-modal ensemble are indexed by v. We have that:  P( i∈[c] (I(f i (x) > 0) -I(f i (x) < 0)) - j∈[v] (I(f j (x) > 0) -I(f j (x) < 0)) > 0) =P( i∈[c] I(f i (x) > 0) - j∈[v] I(f j (x) > 0) - 1 2 [ i∈[c] p i - j∈[v] p j ] > 1 2 [ j∈[v] p j - i∈[c] p i ]) ≤ exp(-( j∈[v] p j - i∈[c] p i ) 2 /8|c + v|) (1) Therefore, if j∈[v] p j -i∈[c] p i ≥ 8(c + v) log(1/δ),   rT + i∈[T -1] yri   > 0 i∈[T -1] yri ≥ 2   + 1 2 P   y   rT + i∈[T -1] yri   = 0 i∈[T -1] yri ≥ 2   =P   y   r ′ + i∈[T -1] yri   > 0 i∈[T -1] yri ≥ 2   + 1 2 P   y   r ′ + i∈[T -1] yri   = 0 i∈[T -1] yri ≥ 2   . (b. ) When the number of effective features in r [-T ] is even, | i∈[T -1] yr i | ̸ = 1. (c. ) When | i∈[T -1] yr i | = 0 , due to the assumption that p(r ′ ) > p(r T ) and ϵ(r) = p(r)/c, adding r ′ helps increase the predicting probability compared to r T , namely P   y   rT + i∈[T -1] yri   > 0 i∈[T -1] yri = 0   + 1 2 P   y   rT + i∈[T -1] yri   = 0 i∈[T -1] yri = 0   >P   y   r ′ + i∈[T -1] yri   > 0 i∈[T -1] yri = 0   + 1 2 P   y   r ′ + i∈[T -1] yri   = 0 i∈[T -1] yri = 0   . The above inequality is derived based on the following equation: P   y   rT + i∈[T -1] yri   > 0 i∈[T -1] yri = 0   + 1 2 P   y   rT + i∈[T -1] yri   = 0 i∈[T -1] yri = 0   =P   yrT > 0 i∈[T -1] yri = 0   + 1 2 P   yrT = 0 i∈[T -1] yri = 0   =p(rT ) + 1 2 [1 -p(rT ) -ϵ(rT )] = 1 2 [1 + (1 -1/c)p(rT )] . Since we assume c > 1, the probability increases with probability p(r T ). Therefore, under the three conditions, adding r ′ increase the predicting probability more compared to r T . In summary, under case 1 (a-c), adding r ′ increase the predicting probability compared to r T . Case 2: When the number of features in r [-T ] is odd. The discussion in (b.) can be a little bit more complex compared to case 1. (a. ) If | i∈[T -1] yr i | ≥ 2, similar to case 1, adding r T or r ′ does not alter the predicting probability, namely P   y   rT + i∈[T -1] yri   > 0 i∈[T -1] yri ≥ 2   + 1 2 P   y   rT + i∈[T -1] yri   = 0 i∈[T -1] yri ≥ 2   =P   y   r ′ + i∈[T -1] yri   > 0 i∈[T -1] yri ≥ 2   + 1 2 P   y   r ′ + i∈[T -1] yri   = 0 i∈[T -1] yri ≥ 2   . (b. ) If | i∈[T -1] yr i | = 1: (b.1 ) If i∈[T -1] yr i = -1: P   y   rT + i∈[T -1] yri   > 0 i∈[T -1] yri = -1   + 1 2 P   y   rT + i∈[T -1] yri   = 0 i∈[T -1] yri = -1   =P   yrT -1 > 0 i∈[T -1] yri = -1   + 1 2 P   yrT -1 = 0 i∈[T -1] yri = -1   = 1 2 P   yrT -1 = 0 i∈[T -1] yri = -1   = 1 2 p(rT ). (b.2 ) If i∈[T -1] yr i = +1: P   y  rT + i∈[T -1] yri   > 0 i∈[T -1] yri = 1   + 1 2 P   y   rT + i∈[T -1] yri   = 0 i∈[T -1] yri = 1   =P   yrT + 1 > 0 i∈[T -1] yri = 1   + 1 2 P   yrT + 1 = 0 i∈[T -1] yri = 1   =(1 -ϵ(rT )) + 1 2 ϵ(rT ) =1 - 1 2c p(rT ). Note that the probability of event (b.1) and the probability of event (b.2) satisfy the following equation by Lemma B.2: P   i∈[T -1] yr i = 1   = cP   i∈[T -1] yr i = -1   . Therefore, the total probability under case (b) is 1 2 p(r T )P   i∈[T -1] yr i = -1   + (1 - 1 2c p(r T ))P   i∈[T -1] yr i = 1   =P   i∈[T -1] yr i = 1   which is independent of p(r T ). Therefore, adding r T or r ′ share the same predicting probability. In summary, under case 2 (a-c), adding r ′ do not decrease the predicting probability compared to r T . The following lemmas are used during the proof.  yr i = 1 in B if yr i = -1 in A yr i = -1 in B if yr i = 1 in A yr i = 0 in B if yr i = 0 in A Comparing each event in A with its complementary event in B leads to the conclusion. Combining case 1 and case 2 together leads to the final conclusion.

B.2 GENERALIZE THEOREM 3.4 TO MORE MODALITIES

We next show that the results in Theorem 3.4 can be generalized to the regime of more modals. (c. ) Performance Laziness: Consider a new testing point, then for every δ > 0, if the following inequality holds: i∈[kpa] p(h i ) ≤ i∈[mini{bm i }+1, i bm i ] p [i] + ∆(δ), where ∆(δ) = 8(k pa + j [b mj -k mj ]) log(1/δ), then with probabilityfoot_9 at least 1 -δ, uni-modal ensemble outperform multi-modal training approaches concerning the loss on the testing point with probability. B.3 PROOF OF THEOREM 3.5 Theorem 3.5. Denote the paired features by h 1 , . . . h L with corresponding predicting probability p(h 1 ), . . . , p(h L ). Assume that distillation can boost the training priority by p 0 > 0. If there exists paired features whose predicting probability exceeds the boosting probability p 0 , namely, the set S is not empty: S = {h i : p(h i ) > p 0 } ̸ = ϕ. Then UMT helps uni-modal feature learning and can also learn easy-to-learn paired features. Proof. The core of Theorem 3.5 is to clarify the training priority. We revisit the notations of Theorem 3.4 as follows without further clarification. At the end of the training, uni-modal ensemble learn b m1 + b m2 useful features, namely, f 1 , . . . , f bm1 , g 1 , . . . , g bm2 . And multi-modal training approaches learn k m1 + k m2 + k pa features: f 1 , . . . , f km1 , g 1 , . . . , g km2 , h 1 , . . . , h kpa . We note that there are still many empty features e i in the model due to the initialization. By distillation, the model learns the features according to the new priority. Since the set S is not empty, there exists paired features that is learned before the empty features. By distillation, the model would learn all the useful features that appear in uni-modal approaches, as well as those features in set S. Therefore, UMT outperforms uni-modal ensemblewhen there exists useful paired features. B.4 A CONCRETE EXAMPLE TO ILLUSTRATE THEOREM 3.4 We next provide a concrete example to better illustrate the Modality Laziness issues. For Example B.4, we aim to show the Modality Laziness issues. For Example B.5, we aim to show the role of the pushing force. Example B.4. Consider modality x m1 with features f 1 , f 2 , f 3 (corresponding prediction probability p = 0.2, 0.1, 0.05), and modality x m2 with features g 1 , g 2 , g 3 (corresponding prediction probability p = 0.15, 0.08, 0.02). We show the dataset in Table 23 and aim to minimize the training loss to zero. In uni-modal approaches, we learn features f 1 , f 2 and f 3 on modality x m1 (similarly, g 1 , g 2 , and g 3 on modality x m2 ). Therefore, we learn features f 1 , f 2 , f 3 , g 1 , g 2 , g 3 in uni-modal ensemble. In multi-modal training approaches without paired feature, we can only learn three features f 1 , f 2 , g 2 due to the training priority f 1 > g 1 > f 2 > g 2 > f 3 > g 3 (decreasing order in p). This phenomenon is caused by modality laziness. We next consider another paired feature h with probability p = 0.28. Under the case, multi-modal training approaches only learn two features h and f 1 . Therefore, when h is not powerful enough, uni-modal ensemble outperforms multi-modal training approaches. Example B.5. We follow the notations and dataset in Example B.4. By applying the pushing force, assume that each probability of uni-modal feature boosts 0.15, which changes the training priority to f 1 > g 1 > h > f 2 > g 2 > f 3 > g 3 (decreasing order in p ′ ). Therefore, multi-modal training approaches (with pushing force) learns f 1 , f 2 , h. As a comparison, multi-modal training approaches (without pushing force) can only learn f 1 , h. Therefore, pushing force helps learn more features. We additionally remark that we only consider the training error in this example, and there might be other penalties in practice (e.g., distillation loss).



Uni-modal prior here means that we get predictions only according to one modality in multi-modal tasks. Allen-Zhu & Li (2020).Huang et al. (2021) explains multi-modal learning is potentially better than uni-modal learning andHuang et al. (2022) explains why failure exists in multi-modal learning. Our paper investigates the different types of features in multi-modal data and provides solutions for the weakness of multi-modal learning.Knowledge distillation was introduced to compress the knowledge from an ensemble into a smaller and faster model but still preserve competitive generalization power(Buciluǎ et al., 2006;Hinton et al., 2015;Tian et al., 2019;Gou et al., 2021;Allen-Zhu & Li, 2020). In this paper, we propose Uni-Modal Teacher to leverage uni-modal distillation for joint training to help the learning of unimodal features, without involving cross-modal knowledge distillation(Pham et al., 2019;Gupta et al., 2016;Tan & Bansal, 2020;Garcia et al., 2018;Luo et al., 2018).3 ANALYSIS, LEARNING GUIDANCE AND THEORYIn this section, we show the drawbacks and advantages of joint training. On one hand, joint training results in insufficient learning of uni-modal features (Modality Laziness). On the other hand, it 2 https://github.com/GeWu-Lab/OGM-GE_CVPR2022 We simplify "with probability" as "w.p." We abuse the notation h to simplify the notations where h(x m 1 ) and h(x m 2 ) can have different forms. Recent works have demonstrated that neural networks indeed prefer easy-to-learn features(Shah et al., 2020;Pezeshki et al., 2020). We always assume that the training error can be minimized to zero. The probability is taken over the randomness of the testing point Sample random projection P 1 ∈ R d1×d and P 2 ∈ R d2×d from U (-0.5, 0.5). The probability is taken over the randomness of the testing point The probability is taken over the randomness of the testing point



(a) RGB encoder evaluation on VGG-Sound. (b) Optical flow encoder evaluation on UCF101.

Figure 2: By building a linear classifier on encoders and checking the top-1 accuracy, we evaluate the RGB encoder in VGG-Sound and the optical flow encoder in UCF101 from different multi-modal late-fusion methods.

is an action recognition dataset with 101 action categories, including 7k videos for training and 3k for testing. And we use the rgb and flow provided by(Feichtenhofer et al., 2016). For RGB, we use one image of (3 * 224 * 224) as the input; while for flow, we use a stack of optical flow images which contained 10 x-channel and 10 y-channel images, So its input shape is (20 * 224 * 224). During training, we perform random crop and random horizontal flip as the data augmentation; while testing, we resize the image to 224 and do not perform data augmentation operations.

Figure 4: Model architecture of naive late fusion (left) and Uni-Modal Teacher (UMT) (right). φ ′ mi

Figure 5: Distillation details of UMT for RGB (left) and depth (right) modalities in multi-modal semantic segmentation (based on ESANet).

the probability is done. Therefore, for a new data point, uni-modal ensemble can outperforms multi-modal training approaches with high probability. Proof of Lemma B.1. We define r [-T ] as the features r 1 , . . . , r T -1 . The proof is divided into two parts, depending on whether i∈[T -1] I(r i ̸ = 0) is even or odd. We regard the term i∈[T -1] I(r i ̸ = 0) as the number of effective features in r [-T ] . To simplify the discussion, we rescale r such that |yr| = 1 (when r ̸ = 0) or |yr| = 0 (when r = 0). Case 1: When the number of effective features in r [-T ] is even. (a. ) If | i∈[T -1] yr i | ≥ 2, adding r T or r ′ does not alter the predicting probability, namely P   y

(c. ) When the number of effective features in r[-T ] is odd, | i∈[T -1] yr i | ̸ = 0.

Consider T -1 features r 1 , . . . , r T -1 , the following equation holds: It can be proved to compare the events A = { i∈[T -1] yr i = 1} and B = { i∈[T -1] yr i = -1}. Every event in A has a complementary event in B, namely,

Specifically, we assume a T -modal regimes, and denote the modals as x mi , i ∈ [T ]. In uni-modal pre-training approaches, let b mi denote the number of returned features in modal i. In multi-modal joint training, let k mi denote the number of uni-modal features for modal i, and k pa denote the number of returned paired features. We derive the following Theorem B.3 for the multi-modal regimes. Theorem B.3. Based on the above notations, we provide three types of laziness from three perspectives:(a. ) Quantity Laziness: i k mi + k pa ≤ min i {b mi }.(b. ) Uni-modal Laziness: Each modality in multi-modal training approaches performs worse than uni-modal training.

Top 1 test accuracy (in %) of linear evaluation on encoders from various multi-modal latefusion training methods and uni-modal training on VGG-Sound and UCF101.

Top-1 test accuracy of multi-modal models and uni-modal models on certain classes of VGG-Sound. Avg Pred: average the two uni-modal models' predictions directly. Linear Clf: train a multi-modal classifier on top of uni-modal trained encoders. Naive Fusion: train a multi-modal late-fusion model from scratch. modal model F mm , iteration number N , loss weight λ task , λ distill .



Comparison between averaging unimodal predictions and multi-modal classifier trained on uni-modal pre-trained encoders.

Evaluation of uni-modal classifiers from the multi-modal linear classifier trained on uni-modal pre-trained encoders in UCF101. This evaluation method borrows fromPeng et al. (2022).

Results of different late-fusion methods. * means the result comes from its original paper

Comparison between UMT and Audio-Visual SlowFast(Xiao et al., 2020) on Kinetics-400. AVSlowFast is an representative intermediate fusion method.

Evaluation on the encoders trained by naive multi-modal training and UMT.

Self-Distillation vs UMT on VGG-Sound.

https://github.com/open-mmlab/mmaction2/blob/master/tools/data/build audio features.py/ 11 https://github.com/facebookresearch/SlowFast/configs/Kinetics/SLOWFAST 8x8 R50.yaml 12 openmmlab/mmaction2/blob/master/configs/recognition audio/resnet/tsn r18 64x1x1 100e kinetics400 audio feature.py 13 https://github.com/nyukat/greedy multimodal learning

The Hyperparameters used in our experiments for VGG-Sound and UCF101.

Top-1 test accuracy (in %) of linear classifiers trained on frozen encoders from multimodal late-fusion training under different optimizers and uni-modal training on VGG-Sound.

Different distillation weights of UMT on VGG-Sound and Kinetics-400

Dropout in multi-modal training on VGG-Sound.

The top-1 test accuracy of finetuning the uni-modal pre-trained encoders and linear evaluation on finetuned encoders on VGG-Sound.

Table17show, both RGB and optical flow in UCF101 can get strong performance individually. All the evidence shows that in UCF101, the uni-modal features are totally dominate and any joint training can lead to serious Modality Laziness. Top-1 test accuracy of different models on some classes of Kinetics. The accuracy of naive fusion model outperforms averaging the uni-modal models' predictions, and even outperforms the sum of the accuracy of the uni-audio model and uni-video model.

Top-1 test accuracy of different models on selected classes of UCF101. We select the top-10 classes according to the gap of accuracy between the multi-modal and uni-modal models. As we can see, uni-modal model's performance is high, meaning paired features in UCF101 are rare.



Model performance comparison under UMT and ESANet on NYU-DepthV2 RGB-Depth semantic segmentation task.

Test Accuracy of uni-modal models and multi-modal model on different synthetic datasets. Synthetic Dataset α mainly contains uni-modal features which can be learned in uni-modal training; Synthetic Dataset β mainly contains paired features which need joint training to learn; Synthetic Dataset γ contains uni-modal features and paired features. Dataset Synthetic Dataset α Synthetic Datset β Synthetic Dataset γ

The confusion matrix of uni-modal model in Dataset γ. In the data labeled 0, each modality contains features that can give a correct prediction, while in the data labeled 1 or 2, we need both modalities together to make the right predictions.

Uni-Modal Performance of ours and Xiao et al. (2020)'s on Kinetics-400

Dataset used in Example B.4. + means the feature is larger than zero andmeans the feature is less than zero. We denote the predicting probability by p and the rectified probability (due to pushing force) by p ′ .

6. REPRODUCIBILITY STATEMENT

In our paper, we detail the specific implementation steps of the methods in Sec 3.3 and also give the values of the hyper-parameters we use in detail in Appendix A.2. We also provide the codes in supplementary material, which can reproduce the results of the experiments.

