THE MODALITY FOCUSING HYPOTHESIS: TOWARDS UNDERSTANDING CROSSMODAL KNOWLEDGE DISTIL-LATION

Abstract

Crossmodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning and demonstrates great success in various applications. To achieve knowledge transfer across modalities, a pretrained network from one modality is adopted as the teacher to provide supervision signals to a student network learning from another modality. In contrast to the empirical success reported in prior works, the working mechanism of crossmodal KD remains a mystery. In this paper, we present a thorough understanding of crossmodal KD. We begin with two case studies and demonstrate that KD is not a universal cure in crossmodal knowledge transfer. We then present the modality Venn diagram (MVD) to understand modality relationships and the modality focusing hypothesis (MFH) revealing the decisive factor in the efficacy of crossmodal KD. Experimental results on 6 multimodal datasets help justify our hypothesis, diagnose failure cases, and point directions to improve crossmodal knowledge transfer in the future.

1. INTRODUCTION

Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network to another (Wang & Yoon, 2021; Gou et al., 2021) . Its core mechanism is a teacher-student learning framework, where the student network is trained to mimic the teacher through a loss. The loss function, initially proposed by (Hinton et al., 2015) as the KL divergence between teacher and student soft labels, has been extended in many ways (Zagoruyko & Komodakis, 2016; Tung & Mori, 2019; Park et al., 2019; Peng et al., 2019; Tian et al., 2019) . KD has been successfully applied to various fields and demonstrates its high practical value. The wide applicability of KD stems from its generality: any student can learn from any teacher. To be more precise, the student and teacher network may differ in several ways. Three common scenarios are: (1) model capacity difference: Many works (Zagoruyko & Komodakis, 2016; Tung & Mori, 2019; Park et al., 2019; Peng et al., 2019) on model compression aim to learn a lightweight student matching the performance of its cumbersome teacher for deployment benefits. ( 2) architecture (inductive bias) difference: As an example, recent works (Touvron et al., 2021; Ren et al., 2022; Xianing et al., 2022) propose to utilize a CNN teacher to distill its inductive bias to a transformer student for data efficiency. (3) modality difference: KD has been extended to transfer knowledge across modalities (Gupta et al., 2016; Aytar et al., 2016; Zhao et al., 2018; Garcia et al., 2018; Thoker & Gall, 2019; Ren et al., 2021; Afouras et al., 2020; Valverde et al., 2021; Xue et al., 2021) , where the teacher and student network come from different modalities. Examples include using an RGB teacher to provide supervision signals to a student network taking depth images as input, and adopting an audio teacher to learn a visual student, etc. Despite the great empirical success reported in prior works, the working mechanism of KD is still poorly understood (Gou et al., 2021) . This puts the efficacy of KD into question: Is KD always efficient? If not, what is a good indicator of KD performance? A few works (Cho & Hariharan, 2019; Tang et al., 2020; Ren et al., 2022) are in search for the answer in the context of model capacity difference and architecture difference. However, the analysis for the third scenario, KD under modality difference or formally crossmodal KD, remains an open problem. This work aims to fill this gap and for the first time provides a comprehensive analysis of crossmodal KD. Our major contributions are the following: • We evaluate crossmodal KD on a few multimodal tasks and find surprisingly that teacher performance does not always positively correlate with student performance. • To explore the cause of performance mismatch in crossmodal KD, we adopt the modality Venn diagram (MVD) to understand modality relationships and formally define modalitygeneral decisive features and modality-specific decisive features. • We present the modality focusing hypothesis (MFH) that provides an explanation of when crossmodal KD is effective. We hypothesize that modality-general decisive features are the crucial factor that determines the efficacy of crossmodal KD. • We conduct experiments on 6 multimodal datasets (i.e., synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB). The results validate the proposed MFH and provide insights on how to improve crossmodal KD.

2. RELATED WORK

2.1 UNIMODAL KD KD represents a general technique that transfers information learned by a teacher network to a student network, with applications to many vision tasks (Tung & Mori, 2019; Peng et al., 2019; He et al., 2019; Liu et al., 2019) . Despite the development towards better distillation techniques or new application fields, there is limited literature (Phuong & Lampert, 2019; Cho & Hariharan, 2019; Tang et al., 2020; Ren et al., 2022; 2023) on understanding the working mechanism of KD. Specifically, Cho & Hariharan (2019) and Mirzadeh et al. (2020) investigate KD for model compression, i.e., when the student and teacher differ in model size. They point out that mismatched capacity between student and teacher network can lead to failure of KD. Ren et al. (2022) analyze KD for vision transformers and demonstrate that teacher's inductive bias matters more than its accuracy in improving performance of the transformer student. These works provide good insight into understanding KD, yet their discussions are limited to unimodality and have not touched on KD for multimodal learning.

2.2. CROSSMODAL KD

With the accessibility of the Internet and the growing availability of multimodal sensors, multimodal learning has received increasing research attention (Baltrušaitis et al., 2018) . Following this trend, KD has also been extended to achieve knowledge transfer from multimodal data and enjoys diverse applications, such as action recognition (Garcia et al., 2018; Luo et al., 2018; Thoker & Gall, 2019) , lip reading (Ren et al., 2021; Afouras et al., 2020) , and medical image segmentation (Hu et al., 2020; Li et al., 2020) . Vision models are often adopted as teachers to provide supervision to student models of other modalities, e.g., sound (Aytar et al., 2016; Xue et al., 2021) , depth (Gupta et al., 2016; Xue et al., 2021 ), optical flow (Garcia et al., 2018 ), thermal (Kruthiventi et al., 2017) , and wireless signals (Zhao et al., 2018) . Although these works demonstrate potentials of crossmodal KD, they are often associated with a specific multimodal task. An in-depth analysis of crossmodal KD is notably lacking, which is the main focus of this paper.

2.3. MULTIMODAL DATA RELATIONS

There is continuous discussion on how to characterize multimodal (or multi-view) data relations. Many works (Tsai et al., 2020; Lin et al., 2021; 2022) utilize the multi-view assumption (Sridharan & Kakade, 2008) , which states that either view alone is sufficient for the downstream tasks. However, as suggested in (Tsai et al., 2020) , when the two views of input lie in different modalities, the multi-view assumption is likely to fail. 2 In the meantime, a few works on multimodal learning (Wang et al., 2016; Zhang et al., 2018; Hazarika et al., 2020; Ma et al., 2020) indicate that multimodal features can be decomposed as modality-general features and specific features in each modality. Building upon these ideas, in this work, we present the MVD to formally characterize modality relations. In addition, the importance of modality-general information has been identified in these works, yet with different contexts. In multi-view learning, (Lin et al., 2021; 2022) consider shared information between two views as the key to enforce cross-view consistency. To boost multimodal network performance and enhance its generalization ability, (Wang et al., 2016; Zhang et al., 2018; Hazarika et al., 2020; Ma et al., 2020) propose different ways to separate modality-general and modalityspecific information. For semi-supervised multimodal learning, (Sun et al., 2020) aims at maximizing the mutual information shared by all modalities. To the best of our knowledge, our work is the first to reveal the importance of modality-general information in crossmodal KD.

3. ON THE EFFICACY OF CROSSMODAL KD

First, we revisit the basics of KD and introduce notations used throughout the paper. Consider a supervised K-class classification problem. Let f θs (x) ∈ R K and f θt (x) ∈ R K represent the output (i.e., class probabilities) of the student and teacher networks respectively, where {θ s , θ t } are learnable parameters. Without loss of generality, we limit our discussion within input data of two modalities, denoted by x a and x b for modality a and b, respectively. Assume that we aim to learn a student network that takes x b as input. In conventional unimodal KD, the teacher network takes input from the same modality as the student network (i.e., x b ). The objective for training the student is: L = ρL task + (1 -ρ)L kd where L task represents the cross entropy loss between the ground truth label y ∈ {0, 1, • • • , K -1} and the student prediction f θs (x b ), L kd represents the KL divergence between the student prediction f θs (x b ) and the teacher prediction f θt (x b ), and ρ ∈ [0, 1] weighs the importance of two terms L task and L kd (i.e., driving the student to true labels or teacher's soft predictions). Crossmodal KD resorts to a teacher from the other modality (i.e., x a ) to transfer knowledge to the student. Eq. ( 1) is still valid with a slight correction that the KL divergence term is now calculated using f θs (x b ) and f θt (x a ). In addition, there is one variant (or special case) of crossmodal KD, where a multimodal teacher taking input from both modality a and b is adopted for distillation, and L kd is now a KL divergence term between f θs (x b ) and f θt (x a , x b ). We first present a case study on the comparison of crossmodal KD with unimodal KD. Consider the special case of crossmodal KD where a multimodal teacher is adopted. Intuitively, adopting a multimodal teacher, which takes both modality a and b as input, can be beneficial for distillation since: (1) a multimodal network usually enjoys a higher accuracy than its unimodal counterpart (Baltrušaitis et al., 2018) , and a more accurate teacher ought to result in a better student; (2) the complementary modality-dependent information brought by a multimodal teacher can enrich the student with additional knowledge. This idea motivates many research works (Luo et al., 2018; Hu et al., 2020; Valverde et al., 2021) to replace a unimodal teacher with a multimodal one, in an attempt to improve student performance. Despite many empirical evidence reported in prior works, in this paper, we reflect on this assumption and ask the question: Is crossmodal KD always effective? To study crossmodal KD, it is critical to first establish an understanding of multimodal data. Before touching multimodal data, let us fall back and consider unimodal data. Following a causal perspective (Schölkopf et al., 2012) (i.e., features cause labels), we assume that the label y is determined by a subset of features in x a (or x b ); this subset of features are referred to as decisive features for modality a (or modality b) throughout the paper. For instance, colors of an image help identify some classes (e.g., distinguish between a zebra and a horse) and can be considered as decisive features. When considering multimodal data, input features of the two modalities will have logical relations such as intersection and union. We describe the modality Venn diagram (MVD) below to characterize this relationship. Stemming from the common perception that multimodal data possess shared information and preserve information specific to each modality, MVD states that any multimodal features are composed of modality-general features and modality-specific features. Decisive features of the two modalities are thus composed of two parts: (1) modality-general decisive features and (2) modality-specific decisive features; these two parts of decisive features work together and contribute to the final label y. Next, we propose a formal description of MVD to capture the generating dynamics of multimodal data. Let X a , X b , and Y be the feature space of modality a, modality b, and the label space, respectively, and (x a , x b , y) be a pair of data drawn from a unknown distribution P over the space X a × X b × Y. MVD assumes that (x a , x b , y) is generated by a quadruple (z sa , z sb , z 0 , y) ∈ Z sa × Z sb × Z 0 × Y, following the generation rule: MVD GENERATION RULE: x a = g a (z a ), z a = [z sa , z 0 ] T ∈ Z a = Z sa × Z 0 x b = g b (z b ), z b = [z sb , z 0 ] T ∈ Z b = Z sb × Z 0 (2) where collectively, g u (•) : Z u → X u denotes an unknown generating function, if we adopt the notation u ∈ {a, b}. To complete the MVD, another linear decision rule should be included. Specifically, the following equation: MVD DECISION RULE: ∃ W u , arg max [Softmax(W u z u )] = arg max [W u z u ] = y (3) is assumed to hold for any (z sa , z sb , z 0 , y), where we slightly abuse arg max [•] and here it means the index of the largest element in the argument. In essence, MVD specifies that x u is generated based on a modality-specific decisive feature vector z su and a modality-general decisive feature vector z 0 (generation rule), and that z u is sufficient to linearly determine the label y (decision rule). We proceed to quantify the correlation of modality-general decisive features and modality-specific decisive features. Let Z su ⊆ R dsu and Z 0 ⊆ R d0 , so that Z u ⊆ R du , where d u = d su + d 0 . We denote a ratio γ = d 0 /(d 0 + d sa + d sb ) ∈ [0, 1], which characterizes the ratio of modalitygeneral decisive features over all decisive features. Similarly, α = d sa /(d 0 + d sa + d sb ) and β = d sb /(d 0 + d sa + d sb ) denotes the proportion of modality-specific decisive features for modality a and b over all decisive features, respectively, and we have α + β + γ = 1.

4.2. THE MODALITY FOCUSING HYPOTHESIS

Based on MVD, we now revisit our observation in Sec. 3 (i.e., teacher accuracy is not a key indicator of student performance) and provide explanations. First, teacher performance is decided by both modality-general decisive and modality-specific decisive features in modality a. In terms of student performance, although modality-specific decisive features in modality a are meaningful for the teacher, they can not instruct the student since the student only sees modality b. On the other hand, modality-general decisive features are not specific to modality b and could be transferred to the student. Coming back to the example in Fig. 1 , if an audio teacher provides modality-specific information (i.e., the sound colored in red), the visual student will get confused as this information (i.e., playing violin) is not available in the visual modality. On the contrary, modality-general information can be well transferred across modalities and facilitates distillation as the audio teacher and visual student can both perceive the information about the left person playing guitar. This motivates the following modality focusing hypothesis (MFH). The Modality Focusing Hypothesis (MFH). For crossmodal KD, distillation performance is dependent on the proportion of modality-general decisive features preserved in the teacher network: with larger γ, the student network is expected to perform better. The hypothesis states that in crossmodal knowledge transfer, the student learns to "focus on" modalitygeneral decisive features. Crossmodal KD is thus beneficial for the case where γ is large (i.e., multimodal data share many label-relevant information). Moreover, it accounts for our observation that teacher performance fails to correlate with student performance in some scenarios -When α is large and γ is small, the teacher network attains high accuracy primarily based on modality-specific information, which is not beneficial for the student's learning process. To have an intuitive and quick understanding of our hypothesis, here we present two experiments with synthetic Gaussian data. More details can be found in Sec. 5.2. As shown in Fig. 2 , we start from the extreme case where two modalities do not overlap, and gradually increase the proportion of modality-general decisive features until all decisive features are shared by two modalities. We observe that crossmodal KD fails to work when x a and x b share few decisive features (i.e., γ is small) since modality-specific decisive features in modality a are not perceived by the student. As x a α = 𝛽 = 1 γ = 0 x b γ ↑ α = 𝛽 = 0 γ = 1 γ α x a x b γ γ x a x b 𝛽 α 𝛽 Figure 2: An illustration of MFH with synthetic Gaussian data. Teacher modality is x a and student modality is x b . We plot the confidence interval of one standard deviation for student accuracy. With increasing γ, crossmodal KD becomes more effective. α = 𝛽 = 0 γ = 1 α ↑ α = 0.8, 𝛽 = 0 γ = 0.2 α x a γ x a x b γ γ x a x b x b α α Figure 3: With increasing α (i.e., decreasing γ), the teacher improves its prediction accuracy but the student network fails to benefit from KD. See the caption of Fig. 2 for more explanations. γ gradually increases, crossmodal KD becomes more effective. For the case where all decisive features possess in both modalities, the student gains from teacher's knowledge and outperforms its baseline by 2.1%. Note that the teacher accuracy does not vary much during this process, yet student performance differs greatly. Fig. 3 illustrates the reverse process where modality-specific decisive features in modality a gradually dominate. With increasing α, the teacher gradually improves since it receives more modality-specific decisive features for prediction. However, the student network fails to benefit from the improved teacher and performs slightly worse instead. Clearly, teacher performance is not reflective of student performance in this case. These two sets of experiments help demonstrate that teacher accuracy does not faithfully reflect the effectiveness of crossmodal KD and lend support to our proposed hypothesis. Apart from the two intuitive examples, below we provide a theoretical guarantee of MFH in an analytically tractable case, linear binary classification. Formally, considering an infinitesimal learning rate which turns the training into a continuous gradient flow defined on a time parameter t ∈ [0, +∞) (Phuong & Lampert, 2019) . If n data are available, which are collectively denoted as Z u ∈ R du×n , we have the following theorem to bound the training distillation loss with γ. Theorem 1. (Crossmodal KD in linear binary classification). Without loss of generality, we assume f θt (•) : X a → Y and f θs (•) : X b → Y. Suppose max{||Z u Z u,T ||, ||(Z u Z u,T ) -1 ||} ≤ λ always holds for both u = a or b, and g u (•) are identity functions. If there exists (ϵ, δ) such that P r ||Z a,T Z a -Z b,T Z b || ≤ (1 -γ)ϵ ≥ 1 -δ (4) Then, with an initialization at t = 0 satisfying R dis n (θ s (0)) ≤ q, we have, at least probability 1 -δ: R dis n (θ s (t = +∞)) ≤ n( ϵ ⋆ 1 -e -ϵ ⋆ -1 -ln ϵ ⋆ 1 -e -ϵ ⋆ ) where ϵ ⋆ = λ 1.5 (λ 2 + 1)(1 -γ)ϵ and R dis n (θ s ) is the empirical risk defined by KL divergence (corresponding to Eq. ( 1) when ρ = 0): R dis n (θ s (t)) = n i=1 -σ(θ T t x a i ) • ln σ(θ T s x b i ) σ(θ T t x a i ) -1 -σ(θ T t x a i ) • ln 1 -σ(θ T s x b i ) 1 -σ(θ T t x a i ) See Appendix A for the omitting proof, several important remarks, and future improvements.

4.3. IMPLICATIONS

We have presented MVD and MFH. Equipped with this new perspective of crossmodal KD, we discuss their implications and practical applications in this section. Implication. For crossmodal KD, consider two teachers with identical architectures and similar performance: Teacher (a) makes predictions primarily based on modality-general decisive features while teacher (b) relies more on modality-specific decisive features. We expect that the student taught by teacher (a) yields better performance than that by teacher (b). The implication above provides us with ways to validate MFH. It also points directions to improve crossmodal KD -We can train a teacher network that focuses more on modality-general decisive features for prediction. Compared with a regularly-trained teacher, the new teacher is more modalitygeneral (i.e., has a larger γ) and thus tailored for crossmodal knowledge transfer. Note that: Firstly, identical architectures and performance of the two teachers are stated here for a fair comparison. Similar performance of the two teachers translate to similar ability to extract decisive features for prediction, and thus the only difference lies in the amount of modality-general decisive features. This design excludes other factors and helps justify that the performance difference stems from γ. In fact, we observe that even with inferior accuracy than teacher (b), teacher (a) still demonstrates better crossmodal KD performance in experiments. Secondly, the main focus of this paper is to present the MFH and to validate it with theoretical analysis, synthetic experiments, and evidence from experiments conducted on real-world multimodal data. Contrary to the common belief, we detach the influence of teacher performance in crossmodal KD and point out that modality-general decisive features are the key. Developing methods to separate modality-general/specific decisive features from real-world multimodal data is beyond the scope of this paper and left as future work.

5.1. EXPERIMENTAL SETUP

To justify our MFH, we conduct experiments on 6 multimodal datasets (synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB) that cover a diverse combination of modalities including images, video, audio and text. In essence, we design approaches to obtain teachers of different γ and perform crossmodal KD to validate the implication presented in Sec. 4.3. We consider four different ways to derive a teacher network that attends to more or fewer modalitygeneral decisive features than a regularly-trained teacher: (1) For synthetic Gaussian data, since the multimodal data generation mechanism is known, we train a modality-general teacher on data with only modality-general decisive features preserved and other channels removed; (2) For NYU Depth V2, we notice that RGB images and depth images share inherent similarities and possess identical dimensions, allowing them to be processed using a single network. Therefore, we design a modality-general teacher (i.e., has a larger γ than a regularly-trained teacher) by following the training approach in (Girdhar et al., 2022) ; (3) For MM-IMDB data, we follow the approach in (Xue et al., 2021) to obtain a multimodal teacher which is more modality-specific (i.e., has a smaller γ) than the regularly-trained teacher; (4) For the other datasets, we design an approach based on feature importance (Breiman, 2001; Wojtas & Chen, 2020 ) to rank all feature channels according to the amount of modality-general decisive information. With a sorted list of all features, we train a modality-general teacher by only keeping feature channels with large salience values and a modalityspecific teacher by keeping features with small values. In summary, a wide range of approaches and tasks are considered to justify MFH. See Appendix B for detailed setups and some results. We observe considerable performance degradation (larger than -10% accuracy loss) of the modality-general teacher than the regular teacher, as it only relies on modality-general decisive features and discards modality-specific decisive features for prediction. However, the modality-general teacher still facilitates crossmodal KD and leads to an improved student (∼ +2% accuracy improvement compared with regular crossmodal KD). The results align well with MFH stating that a teacher with more emphasis on modality-general decisive features (i.e., has a larger γ) yields a better student.

5.3. NYU DEPTH V2

We revisit the example of NYU Depth V2 in Sec. 3. We adopt a teacher network that takes depth images as input to transfer knowledge to an RGB student. Both student and teacher network architectures are implemented as DeepLab V3+ (Chen et al., 2018) . As described in Sec. 5.1, besides training a regular teacher, we follow (Girdhar et al., 2022) and train a teacher that learns to predict labels for the two modalities with identical parameters. To be specific, a training batch contains both RGB and depth images, and the teacher network is trained to output predictions given either RGB or depth images as input. As such, the resulting teacher is assumed to extract more modality-general features for decision (i.e., has a larger γ than a regular teacher) since it needs to process both modalities in an identical way during training. As shown in Table 3 , regular crossmodal KD does not bring many advantages: the student achieves a similar mIoU compared with the No-KD baseline. Therefore, one might easily blame the failure of crossmodal KD on teacher accuracy and assume that crossmodal KD is not effective because the depth teacher itself yields poor performance (i.e., has an mIoU of 37.33%). By nature of its training approach, the modality-general teacher is forced to extract more modality-general decisive features for prediction rather than rely on depth-specific features as it also takes RGB images as input. While we do not observe difference in teacher performance, the modality-general teacher turns out to be a better choice for crossmodal KD: its student mIoU improves from 46.36% to 47.93%. The results indicate that our MFH has the potential to diagnose crossmodal KD failures and lead to improvement. Figure 4 : The class activation maps of a regular teacher (middle) and a modality-general teacher (right) on VGGSound. A regular teacher attends to all decisive features (i.e., the visual objects) while a modality-general one focuses on modality-general decisive features (i.e., the area of vocalization).

Nullified Area

Figure 5 : With more feature channels getting nullified, student performance starts to increase in the beginning and then decrease; the process aligns well with MVD.

5.4. RAVDESS AND VGGSOUND

Besides the approaches presented above, we design a permutation-based method to sort the features according to the salience of modality-general decisive features, which allows us to obtain teacher networks that possess different amount of γ. See Appendix B.4 for the detailed algorithm flow. We apply this approach to RAVDESS (Livingstone & Russo, 2018) and VGGSound (Chen et al., 2020a) . RAVDESS is an audio-visual dataset containing 1,440 emotional utterances with 8 different emotion classes. Teacher modality is audio and student modality is images. The student and teacher network follow the unimodal network design in (Xue et al., 2021) ; VGGSound is a large-scale audio-visual event classification dataset including over 200,000 video and 310 classes. We consider two setups: (1) adopting an audio teacher and a video student; (2) using a video teacher for distillation to an audio student. We also experiment with two architectures, ResNet-18 and ResNet-50 (He et al., 2016) to be the teacher and student network backbone. In our algorithm, we sort feature channels according to Moreover, to provide an intuitive understanding of how a modality-general teacher differs from the regular teacher, we present visualization results on VGGSound data in Fig. 4 . We see that a regular video teacher utilizes all decisive features for classification, and attends to the visual objects (i.e., the Saxophone and the baby). On the contrary, a modality-general teacher focuses more on information available in both the visual and audio modality, thus the area of vocalization get most activated. Finally, we vary feature nullifying ratio r% for VGGSound data and plot the student performance curve along with r%. From Fig. 5 , we observe that there exists a sweet spot for modality-general KD. As r% increases, the student performance improves in the beginning. The improvement indicates that non modality-general decisive features in the teacher are gradually discarded, which in turn results in a better student. Later, after all non modality-general decisive information are discarded, the feature nullifying process starts to hinder student performance as modality-general decisive features get nullified as well. The modality Venn diagram corresponding to this process is depicted in the upper figure. The observed performance curve aligns well with our understanding on MVD.

6. CONCLUSION AND FUTURE WORK

In this work, we present a thorough investigation of crossmodal KD. The proposed MVD and MFH characterize multimodal data relationships and reveal that modality-general decisive features are the key in crossmodal KD. We present theoretical analysis and conduct various experiments to justify MFH. We hope MFH shed light on applications of crossmodal KD and will raise interest for general understanding of multimodal learning as well. Future work includes: (i) deriving a more profound theoretical analysis of crossmodal KD, (ii) differentiating modality-general/specific decisive features for real-world data, and (iii) improving multimodal fusion robustness based on MVD.

A PROOF OF THEOREM 1

Lemma 1. Consider a function l(b, a): l(b, a) = -σ(a) [ln σ(b) -ln σ(a)] -(1 -σ(a)) [ln(1 -σ(b)) -ln(1 -σ(a))] where σ(•) represents the sigmoid function. When |a -b| ≤ ϵ, max l(b, a) = ϵ 1 -e -ϵ -1 -ln ϵ 1 -e -ϵ Proof. We only need to prove the following two things: (i) l(b, a) ≤ l(a ± ϵ, a), and (ii) l(a ± ϵ, a) ≤ ϵ 1-e -ϵ -1 -ln ϵ 1-e -ϵ . For (i), we imagine a is fixed, and calculate the derivative of l(b, a) w.r.t. b, yielding: ∂l ∂b = e b -e a (1 + e a )(1 + e b ) = σ(b) -σ(a) Thus, when varying b in the range [a -ϵ, a + ϵ], the value of l(b, a) first decreases and then increases. It implies l(b, a) ≤ l(a ± ϵ, a). To prove (ii), we first consider the function: l+ (a) = l(a + ϵ, a) = -σ(a) [ln σ(a + ϵ) -ln σ(a)] -(1 -σ(a)) [ln(1 -σ(a + ϵ)) -ln(1 -σ(a))] = σ(a) ln σ(a) 1 -σ(a) -ln σ(a + ϵ) 1 -σ(a + ϵ) -ln 1 -σ(a + ϵ) 1 -σ(a) = -σ(a) • ϵ -ln 1 -σ(a + ϵ) 1 -σ(a) = -ϵ 1 1 + e -a -ln e -(a+ϵ) e -a 1 + e -a 1 + e -(a+ϵ) = -ϵ 1 1 + e -a + ϵ -ln 1 + e -a 1 + e -(a+ϵ) = -ϵ 1 1 + t + ϵ -ln 1 + t 1 + te -ϵ (10) where in the last line, we denote t = e -a . Now, let us calculate the derivative of l+ w.r.t. t: ∂ l+ ∂t = ϵ (1 + t) 2 - 1 1 + t + e -ϵ 1 + te -ϵ = ϵ -1 -t (1 + t) 2 + 1 e ϵ + t = (1 + t) 2 -(t + e ϵ )(t -ϵ + 1) (1 + t) 2 (e ϵ + t) Obviously, the denominator is always larger than zero. After expanding the numerator, we see that when t = e ϵ (1-ϵ)-1 1+ϵ-e ϵ , l+ achieves its maximum. Therefore, according to a = -ln t, we conclude the maximum value of l+ (a) is ϵ 1-e -ϵ -1 -ln ϵ 1-e -ϵ , achieved at a = -ln e ϵ (1-ϵ)-1 1+ϵ-e ϵ . Similarly, we could define a function l-(a) = l(a -ϵ, a) and verify that the maximum value of l-(a) is -ϵ 1-e ϵ -1 -ln -ϵ 1-e ϵ . Furthermore, after some simplifications, these two expressions, i.e., max l+ (a) and max l-(a), could be proved identical. Corollary 1. For two vectors a ∈ R n and b ∈ R n , if ||a -b|| ≤ ϵ, then L n (b, a) = n i=1 l(b i , a i ) ≤ n( ϵ 1 -e -ϵ -1 -ln ϵ 1 -e -ϵ ) ( ) The corollary is straightforward if noticing |a i -b i | ≤ ||a -b|| ≤ ϵ holds for any i = 1, 2, • • • , n. Lemma 2. If max{||Z u Z u,T ||, ||(Z u Z u,T ) -1 ||} ≤ λ always holds for both u = a, b, assume ||Z a,T Z a -Z b,T Z b || ≤ (1 -γ)ϵ, then we have: ||Z b,T (Z b Z b,T ) -1 Z b Z a,T -Z a,T || ≤ λ 1.5 (λ 2 + 1)(1 -γ)ϵ (13) Proof. For conciseness, we temporarily denote A = Z a and B = Z b . To begin with, we notice: (B T (BB T ) -1 BA T -A T )AA T = (B T (BB T ) -1 B -I)(A T A -B T B)A T (14) Then, we have: ||B T (BB T ) -1 BA T -A T || ≤ ||(AA T ) -1 || • ||B T (BB T ) -1 B -I|| • ||A T A -B T B|| • ||A T || ≤ λ • ||B T (BB T ) -1 B|| + 1 • (1 -γ)ϵ • ||A T || ≤ λ • ||BB T || • ||(BB T ) -1 || + 1 • (1 -γ)ϵ • ||AA T || = λ 1.5 (λ 2 + 1)(1 -γ)ϵ (15) Now, we are ready to prove Theorem 1. For reader's convenience, we re-state Theorem 1 in below. Theorem A.1. (Crossmodal KD in linear binary classification). Without loss of generality, we assume f θt (•) : X a → Y and f θs (•) : X b → Y. Suppose max{||Z u Z u,T ||, ||(Z u Z u,T ) -1 ||} ≤ λ always holds for both u = a or b, and g u (•) are identity functions. If there exists (ϵ, δ) such that P r ||Z a,T Z a -Z b,T Z b || ≤ (1 -γ)ϵ ≥ 1 -δ Then, with an initialization at t = 0 satisfying R dis n (θ s (0)) ≤ q, we have, at least probability 1 -δ: R dis n (θ s (t = +∞)) ≤ n( ϵ ⋆ 1 -e -ϵ ⋆ -1 -ln ϵ ⋆ 1 -e -ϵ ⋆ ) ( ) where ϵ ⋆ = λ 1.5 (λ 2 + 1)(1 -γ)ϵ and R dis n (θ s ) is the empirical risk defined by KL divergence: R dis n (θ s (t)) = n i=1 -σ(θ T t x a i ) • ln σ(θ T s x b i ) σ(θ T t x a i ) -1 -σ(θ T t x a i ) • ln 1 -σ(θ T s x b i ) 1 -σ(θ T t x a i ) Proof. First, under the given conditions, if we could prove: ||Z a,T Z a -Z b,T Z b || ≤ (1 -γ)ϵ => R dis n (θ s (t = +∞)) ≤ n( ϵ ⋆ 1 -e -ϵ ⋆ -1 -ln ϵ ⋆ 1 -e -ϵ ⋆ ) (19) Then, adding the outer probability bracket doesn't alter the conclusion, and the theorem is proved. Therefore, in the following, we will focus on proving Eq. ( 19). The proof consists of two parts. In the first part, we will show that there exists an θ ⋆ s , such that R dis n (θ ⋆ s ) is bounded. Then, in the second part, we will prove that if the training process is long enough (i.e., t → +∞), then the final distillation risk R dis n (θ s (t = +∞)) is further bounded by R dis n (θ ⋆ s ). Without loss of generality, we assume the trained teacher weight ||θ t || = 1 since in linear binary classification, scaling the weight doesn't affect the final prediction. Now, let us consider: θ ⋆ s = (Z b Z b,T ) -1 Z b Z a,T θ t (20) Then we have: ||Z b,T θ ⋆ s -Z a,T θ t || = ||Z b,T (Z b Z b,T ) -1 Z b Z a,t θ t -Z a,T θ t || ≤ ||Z b,T (Z b Z b,T ) -1 Z b Z a,T -Z a,T || • ||θ t || ≤ λ 1.5 (λ 2 + 1)(1 -γ)ϵ (21) where in the last line, we have used Lemma 2. Then using Corollary 1, we obtain R dis n (θ ⋆ s ) = L n (Z b,T θ ⋆ s , Z a,T θ ⋆ t ) ≤ n( ϵ ⋆ 1 -e -ϵ ⋆ -1 -ln ϵ ⋆ 1 -e -ϵ ⋆ ) (22) where ϵ ⋆ = λ 1.5 (λ 2 + 1)(1 -γ)ϵ. Now, the remaining effort is to prove that if time is sufficiently long (i.e., t → +∞), the trained loss R dis n (θ s (t)) will be smaller than R dis n (θ ⋆ s ). To begin with, we apply Theorem A.2 and Corollary A.1 from (Phuong & Lampert, 2019) : For any sublevel set Θ = {θ s : R dis n (θ s ) ≤ q}, there exists c > 0 such that: cR dis n (θ s ) -cR dis n (θ ⋆ s ) ≤ 1 2 ||∇R dis n (θ s )|| 2 (23) Compared to the original Corollary A.1 of (Phuong & Lampert, 2019) , a slight modification is done for it to suit our case. Proving the existence of c is obvious by noticing that the left-hand side of Eq. ( 50) of (Phuong & Lampert, 2019 ) is relaxed to R dis n (θ ⋆ s ) instead of 0 in our case. Next, noticing Eq. ( 8) and ( 53) of (Phuong & Lampert, 2019) , we have: (R dis n ) ′ = dR dis n (θ s (t)) dt = -||∇R dis n (θ s )|| 2 ≤ -2cR dis n (θ s ) + 2cR dis n (θ ⋆ s ) which could be simplified as (Phuong & Lampert, 2019) : R dis n -R dis n (θ ⋆ s ) ′ ≤ -2c R dis n -R dis n (θ ⋆ s ) => ln R dis n -R dis n (θ ⋆ s ) ′ ≤ -2c Integrating over [0, τ ] yields: R dis n (θ s (τ )) ≤ R dis n (θ s (0)) • e -2cτ + R dis n (θ ⋆ s ) Thus, as τ → +∞, R dis n (θ s (τ = +∞)) will be upper-bounded by R dis n (θ ⋆ s ). The above equation also indicates the convergence speed. Remarks. Firstly, the above theorem implies that the training distillation loss is upper-bounded by a monotonic function with respect to γ. When γ increases from 0 to 1, the upper bound gradually decreases. Further combining the above theorem with Rademacher complexity, we could even obtain a bound on the generalization error. Nevertheless, the key component in the generalization bound is already shown here. As such, readers should be alerted that our derived theorem is primitive, that our main contribution in this paper are MVD and MFH, and that a complete theoretical analysis itself could be a standalone work. Secondly, we emphasize that in proving our theorem, the MVD decision rule shown in Eq. ( 3) is not utilized. In the context of our theorem, Eq. (3) states that there exist θ s and θ t , which could make the corresponding student and teacher network achieve zero generalization errorfoot_2 . However, our focus in the proved theorem is merely the training distillation loss, neither involving the generalization error, nor the cross entropy loss between the student prediction and the true label, so this MVD decision rule is not used. However, the decision rule is essential to fully characterize the MVD model as it states the modality-general and modalty-specfic decisive features are sufficient to determine the label, and we expect it will be of use when proving the generalization error. Finally, with the above two remarks on our insufficiency, it would be of great interest to extend our theorem to the generalization error with a linear (or even non-linear) g u (•) and ρ ̸ = 0 in K-class classification. In a nutshell, we hope our proposed MVD could work as a powerful model being utilized to theoretically prove generalization error of crossmodal KD.

B.1 CASE STUDIES IN SEC. 3

In this section, we describe implementation details and provide more results for the two case studies in Sec. 3. AV-MNIST (Vielzeuf et al., 2018) is an audio-visual dataset created by pairing audio and image features. The two modalities are MNIST images with 75% energy removed by principal component analysis and audio spectrograms with random natural noise injected. There are 50,000 pairs for training, 5,000 pairs for validation and 10,000 pairs for testing. Following (Vielzeuf et al., 2018; Gao et al., 2022) , we adopt a 6-layer CNN as the audio teacher network. The audio student network is implemented as a 3-layer CNN, and the multimodal teacher is a late fusion network. The multimodal teacher uses LeNet5 (LeCun et al., 1989) as the image backbone and a 5-layer CNN as the audio backbone; Audio and image features are then concatenated and passed to fully-connected layers for the final prediction. Recall that ρ in Eq. ( 1) in the main text controls the relative importance of the two loss terms when training the student network. We experiment with both ρ = 0 and ρ = 0.5, and repeat the experiments for 10 times. We have provided the results of ρ = 0.5 in Table 1 in the main text, and a more detailed version can be found in Table 5 below. From the table, we can see that crossmodal KD does not have advantages over unimodal KD for both values of ρ. We hypothesize that the proportion of modality-general decisive information is small in this dataset since a multimodal data pair is assembled by randomly pairing an image with an audio that belongs to the same class. Thus the two modalities are not naturally correlated and there may be little modality-general information. MFH provides a plausible explanation for this failure case of crossmodal KD. NYU Depth V2 (Nathan Silberman & Fergus, 2012) contains 1,449 aligned RGB and depth images with 40-class labels, where 795 images are used for training and 654 images are for testing. ρ is set to 0.5. We implement two model architectures for the multimodal teacher: (1) Channel Exchanging Networks (CEN) (Wang et al., 2020) and (2) Separation-and-Aggregation Gate (SA-Gate) (Chen et al., 2020b) . The unimodal teacher and student are adopted as the RGB branch of the corresponding multimodal network. The results are shown in Table 6 , part of which corresponds to the right section of Table 1 in the main text. Table 6 demonstrates that crossmodal KD is not effective in both cases. The great advantages in teacher performance does not enhance student performance. Adopting CEN as the multimodal teacher seems better than SA-Gate, but the improvement compared with unimodal KD is still marginal (i.e., from 46.23% to 46.70%). According to MFH, different teacher networks utilize different amount of modality-general decisive features for prediction, which results in different distillation performance. We hypothesize that CEN has a larger γ than SA-Gate due to their model design: CEN shares all parameters for the RGB and depth input except for Batch Normalization layer while SA-Gate has separate encoders for the two modalities. This indicates that CEN is more modality-general than SA-Gate, and this may further account for their performance differences. There may be other factors lying behind, and one future direction is to develop methods to compare existing model architectures to find a teacher architecture that best suits crossmodal KD.

B.2 SYNTHETIC GAUSSIAN

Assume two vectors x a ∈ R d1 and x b ∈ R d2 compose one multimodal data pair (x a , x b ). We select a subset of input features as decisive features, denoted by x * ∈ R d . We assume that x * exist in both x a and x b , and denote the corresponding decisive feature index set of x a (x b ) as J 1 (J 2 ). The separating hyperplanes are denoted by δ ∈ R d . Formally, we generate one feature-label pair (x a , x b , y) by: x * ∼ N (0, I d ), y ← 1(⟨δ, x * ⟩ > 0) x a ∼ N (0, I d1 ), x a J1 ← x * J1 x b ∼ N (0, I d2 ), x b J2 ← x * J2 As depicted in MVD, modality-general decisive features are decisive features shared by two modalities and thus indexed by J 1 ∩ J 2 . J 1 ∪ J 2 represents the index set of decisive features from both modalities. Therefore, α = 1 -|J2| |J1∪J2| , β = 1 -|J1| and γ = |J1∩J2| |J1∪J2| . By changing J 1 and J 2 , we can generate multimodal data with different inherent characteristics (i.e., different α, β, and γ). We consider two settings: (1) varying γ (Fig. 2 in the main paper). Let d 1 = 25, d 2 = 50 and d = 20, we gradually increase |J 1 ∩ J 2 | from 0 to 10, with a step size of 2 and perform KD on every step. Consequently, γ takes the value of [0, 0.11, 0.25, 0.43, 0.67, 1] , and α = β = 1-γ 2 . (2) varying α (Fig. 3 in the main text). Let d 1 = d 2 = 50 and d = |J 1 ∪ J 2 | increase from 10 to 50, with a step size of 10. Thus, α takes the value of [0, 0.5, 0.67, 0.75, 0.80, 0.83] . We set β to be 0 through the process, and γ = 1 -α. Following (Lopez-Paz et al., 2015) , the teacher and the student are both implemented as logistic regression models, and we use 200 samples for training and 1,000 samples for testing. δ is sampled from the standard normal distribution. ρ in Eq. 1 in the main text is set as 0.5. Results are averaged over 10 runs.

B.3 MM-IMDB

MM-IMDB (Arevalo et al., 2017) is the largest publicly available multimodal dataset for genre prediction on movies. It contains 25,959 movie titles and posters that belong to 27 movie genres. We pick two movie genres (i.e., drama and comedy) for multi-label classification. There are 15,552 data for training, 2,608 for validation, and 7,799 for testing. We adopt the same pre-processing method as in (Arevalo et al., 2017) to extract image and text features. We consider the special case of crossmodal KD, where the teacher is a multimodal network that takes both images and text as input and student is a unimodal text network. The unimodal and multimodal architecture are identical to the one in (Liang et al., 2021) . As described in Sec. 5.1, we experiment with two teacher networks that have the same architecture but differ in γ: (1) We regularly train a multimodal network with labels; (2) Following (Xue et al., 2021) , we train a multimodal network that receives pseudo labels from an unimodal image network. The second teacher only has access to pseudo labels from the image modality and leans towards the image modality when giving predictions. In other words, it is more modality-specific (i.e., has a smaller γ) than the regular teacher. We randomly split training data with the ratio 50%:50%, and use the first half to train the unimodal teacher and the general multimodal teacher. The other part of data is used for training the student network, and we set ρ = 0. Table 7 shows the teacher and student performance for both unimodal KD and crossmodal KD. We select three teacher models that have similar performance on test data, and use them for distillation to detach the influence of teacher performance. Clearly, the three teachers transfer different knowledge to the student. The unimodal teacher comes from the image modality and the modality-specific multimodal teacher is also biased towards the image modality due to its training strategy. Finally, the regular multimodal teacher adopts more modality-general information compared with the previous two teachers (i.e., has a larger γ). As can be seen from the table, it results in the best unimodal text student, which helps verify our proposed MFH.

B.4 RAVDESS AND VGGSOUND

We first present a permutation-based approach to rank given multimodal features according to the amount of modality-general decisive information available in each feature channel. This approach offers an alternative way to obtain teachers of different γ and helps validate the proposed MFH. As described in Sec. 5.4, once we have a sorted list for all feature channels, we can derive a modalitygeneral (or modality-specific) teacher by nullifying the top r% smallest (or largest) channels during the distillation process. The major steps of our proposed feature ranking approaches are demonstrated in Algorithm 1. The input of Algorithm 1 are X a ∈ R n×d1 , X b ∈ R n×d2 , and Y ∈ R n , representing n paired features from modality a and b, and n target labels, respectively. The output is a salience vector p ∈ R d1 for modality-specific decisive features in modality a, where its i-th entry p i ∈ [0, 1] reflects the salience of the i-th feature dimension. A larger salience value indicates a more modality-general decisive feature channel. Algorithm 1 Modality-General Decisive Feature Ranking Input: multimodal features (X a ∈ R n×d1 , X b ∈ R n×d2 , Y ∈ R n ) Output: salience vector p ∈ R d1 for features of modality a 1: Jointly train two unimodal networks f θ * 1 and f θ * 2 using the following loss: min θ1,θ2 L = Dist(f θ1 (X a ), f θ2 (X b )) + CE(Y, f θ1 (X a )) + CE(Y, f θ2 (X b )) ▷ Dist(•, •) denotes a distance loss (e.g., mean squared error) and CE denotes cross entropy 2: for i = 1 to d 1 do ▷ Calculate the salience for the d-th feature dimension 3: p i = 0 4: for m = 1 to M do ▷ Repeat permutation M times for better stability 5: permute the i-th column of X a yielding Xa 6: p i = p i + 1 M × Dist(f θ * 1 ( Xa ), f θ * 2 (X b )) 7: end for 8: end for 9: Perform normalization: p = p maxi pi ∈ [0, 1] d1 Since input-level features contain much label-irrelevant noise, our Algorithm 1 is designed following a trace-back thought starting from the output level. Namely, we drive two unimodal networks to the state of "feature alignment" at the output level using Eq. ( 28), and then use permutation to identify which input feature dimension has a larger impact to the state. Those more influential to the state (i.e., a large distance in step 6) will be assigned a larger salience value. In step 1, we jointly train two unimodal networks f θ * 1 and f θ * 2 that respectively take unimodal data X a and X b as input. The first loss term in Eq. ( 28) aims to align feature spaces learned by the two networks, and the remaining loss terms ensure that learned features are essential for a correct prediction. We believe that this training strategy aligns three sources of decisive features at the output level. In step 2, we follow the idea of permutation feature importance (Breiman, 2001) to trace back modality-general decisive features at the input level. For the i-th dimension in X a , we randomly permute X a along this dimension and obtain a permuted Xa in step 5. Next, we calculate the distance between f θ * 1 ( Xa ) and f θ * 2 (X b ) in step 6. A large distance indicates that the i-th dimension largely influences the state of "feature alignment". Consequently, we are able to quantify the proportion of modality-general decisive features in each input feature channel and use the salience vector p to represent it. We repeat the permutation process for M times and average the distance value for good stability. Finally, p is normalized to [0, 1] d1 in step 9. Note that: (1) The focus of this paper is to propose and validate MFH. Algorithm 1 presents an approach to rank features and allows us to derive teachers of different γ to justify the MFH implication. Developing methods that can separate modality-general decisive features and modalityspecific decisive features is a challenging problem worth deep investigation and is left as future work. (2) Algorithm 1 is not limited to feature ranking at the input level. X a and X b can be features extracted from middle layers of the neural network as well. In such case, output p reflects the salience for each middle-layer feature channel. (3) Algorithm 1 could be equally applied to rank modality-general decisive features for modality b as long as we permute X b . Next, we present results by applying this feature ranking approach to two multimodal applications, i.e., RAVDESS emotion recognition and VGGSound event classification. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) (Livingstone & Russo, 2018) contains videos and audios of 24 professional actors vocalizing two lexically-matched statements. For modality a (i.e., teacher modality), we adopt Kaiser best sampling and take melfrequency cepstral coefficients (MFCCs) features from corresponding audio. For modality b (i.e., student modality), we uniformly sample single-frame images every 0.5 second from each video. We randomly split image-audio pairs, and have 7,943 data for training, 2,364 data for validation and 1,001 data for testing. Similar to (Xue et al., 2021) , the teacher and student architecture are 3-layer CNNs followed by 3 fully-connected layers. We set ρ in Eq. (1) in the main text as 0 (i.e., only use L kd for distillation) to fully observe the teacher's influence on student performance. We report results with three feature nullifying ratio r% in Table 8 , which is a detailed version of Table 4 in the main text. Results are averaged over 5 runs. As shown in the table, with more feature channels getting nullified (i.e., increasing r%), the random and modality-specific version both suffer from a heavy performance degradation. On the contrary, a modality-general teacher still attains satisfactory distillation performance and outperforms regular KD even when the feature nullifying ratio goes to 75%. This demonstrates the efficacy of our proposed feature ranking method as well as the practical value of MFH. VGGSound is a large-scale audio-visual correspondent dataset. We choose 100 class from its 310 classes and obtain 56,614 audio-video pairs for training and 4,501 audio-video pairs for testing. Videos clips and audio spectrograms are taken as input features, respectively. The audio network is implemented as a ResNet-18 / ResNet-50 backbone followed by linear layers and the video network is the same architecture with 2D convolution replaced by 3D convolution. For Table 4 in the main paper, we set ρ in Eq. ( 1) to be 0 and experiment with both ResNet-18 and ResNet-50. In Table 9 , we report results of both ρ = 0 and ρ = 0.5 with the ResNet-18 backbone. The conclusion is consistent: a modality-general teacher improves student performance while a modality-specific teacher results in performance degradation. These results help validate our proposed MFH. In Algorithm 1, we repeat permutation M times for a better estimation of each feature dimension's salience value. Fig. 6 provides an analysis on the number of permutation times M . As M increases, we have a more accurate estimation of p, so modality-general KD gradually improves and finally reaches a plateau.

C COMPARISON OF MVD WITH THE MULTI-VIEW ASSUMPTION

Below we compare MVD with the multi-view assumption (Sridharan & Kakade, 2008) . We adopt the same notations used in the main text here for the multi-view learning paradigm: assume the input variable is partitioned into two different views x a and x b , and there is a target variable y of interest. The multi-view assumption states that either view alone is sufficient to predict the target label y accurately. As illustrated in Figure 7 (a), all task-relevant information is assumed to lie in the shared regions between x a and x b . Despite its wide use in unimodal self-supervised learning, the multi-view assumption is likely to fail when x a and x b are from different modalities (Tsai et al., 2020) . In fact, whether the multi-view assumption holds is largely dependent on downstream tasks and modalities involved. On one hand, for tasks such as image classification, either the image itself or its caption is sufficient to derive the target label. On the other hand, for tasks such as MM-IMDB movie genre classification, using the movie poster (image modality) alone may not convey which category this movie belongs to. Movie descriptions (text modality-specific information) are also needed to infer the target label y. Similarly, using the depth image alone is not sufficient for RGB-D semantic segmentation as there are regions in the image that have the same depth values but correspond to different semantic labels. The multi-view assumption thus does not hold for many challenging multimodal tasks. 



A detailed comparison of our proposed MVD with the multi-view assumption is presented in Appendix C. Note that we even give preferable treatment to crossmodal KD: we take a multimodal network as teacher, and this teacher achieves higher accuracy than a teacher typically used in crossmodal KD. However, such a θs might not able to achieve when doing crossmodal KD by using Eq. (1.



Fig. 1 left shows an example of a video-audio data pair, where the camera only captures one person due to its position angle and the audio is mixed sounds of two instruments. Fig. 1 right illustrates how we interpret these three features (i.e., modality-general decisive, visual modality-specific decisive and audio modality-specific decisive) at the input level.

Figure1: An input video-audio pair can be regarded as composed of modality-general features and modality-specific features in the visual and audio modality. For instance, the man playing violin on the right is not captured by the camera and hence its sound (marked in red) belongs to audio modality-specific information.

Figure 6: Crossmodal KD peprformance with varying permutation number M .

Figure 7: Illustration of (a) the multi-view assumption and (b) our modality Venn diagram.Our proposed MVD generalizes the multi-view assumption by taking modality-specific decisive features into consideration as well. It states that task-relevant information is composed of three parts: (1) modalitygeneral decisive features; (2) decisive features specific to modality a; and (3) decisive features specific to modality b, as illustrated in Figure7(b). Consequently, the multi-view assumption can be considered as a special case of MVD when α = β = 0 and γ = 1.

Evaluation of unimodal KD (UM-KD) and crossmodal KD (CM-KD) on AV-MNIST and NYU Depth V2. 'Mod.' is short for modality, 'mIoU' denotes mean Intersection over Union, and A, I, RGB, D represents audio, grayscale images, RGB images and depth images, respectively.



Results on synthetic Gaussian data. Compared with a regular teacher, the modality-general teacher has downgraded performance yet leads to a student with increasing accuracy.



Results on RAVDESS Emotion Recognition and VGGSound Event Classification. A, I and V denotes audio, images and video, respectively. We report student test accuracy (%) for RAVDESS and mean Average Precision (%) for VGGSound. With identical model architecture, a modality-general teacher improves regular KD while a modality-specific teacher leads to downgraded performance.

Evaluation of unimodal KD (UM-KD) and crossmodal KD (CM-KD) on AV-MNIST.

Evaluation of unimodal KD (UM-KD) and crossmodal KD (CM-KD) on NYU Depth V2. KD RGB + D 51.14 RGB 46.70 RGB + D 51.00 RGB 47.78

Results on MM-IMDB movie genre classification. T and I represent text and images, respectively. With identical architecture and similar accuracy, a modality-specific teacher leads to worse crossmodal KD performance than a regular teacher since it utilizes fewer modality-general decisive features for prediction.

Test Accuracy (%) on RAVDESS emotion recognition. Teacher modality is audio and student modality is images. Modality-general crossmodal KD demonstrates best performance for all feature nullifying dimension ratio r%.

mean Average Precision (%) on VG-GSound event classification. Teacher modality is audio and student modality is video.

ACKNOWLEDGEMENT

The authors would like to thank Hangyu Lin (HKUST) and Zikai Xiong (MIT) for providing crucial steps in proving Theorem 1, and the anonymous reviewers for valuable feedbacks.

