THE MODALITY FOCUSING HYPOTHESIS: TOWARDS UNDERSTANDING CROSSMODAL KNOWLEDGE DISTIL-LATION

Abstract

Crossmodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning and demonstrates great success in various applications. To achieve knowledge transfer across modalities, a pretrained network from one modality is adopted as the teacher to provide supervision signals to a student network learning from another modality. In contrast to the empirical success reported in prior works, the working mechanism of crossmodal KD remains a mystery. In this paper, we present a thorough understanding of crossmodal KD. We begin with two case studies and demonstrate that KD is not a universal cure in crossmodal knowledge transfer. We then present the modality Venn diagram (MVD) to understand modality relationships and the modality focusing hypothesis (MFH) revealing the decisive factor in the efficacy of crossmodal KD. Experimental results on 6 multimodal datasets help justify our hypothesis, diagnose failure cases, and point directions to improve crossmodal knowledge transfer in the future.

1. INTRODUCTION

Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network to another (Wang & Yoon, 2021; Gou et al., 2021) . Its core mechanism is a teacher-student learning framework, where the student network is trained to mimic the teacher through a loss. The loss function, initially proposed by (Hinton et al., 2015) as the KL divergence between teacher and student soft labels, has been extended in many ways (Zagoruyko & Komodakis, 2016; Tung & Mori, 2019; Park et al., 2019; Peng et al., 2019; Tian et al., 2019) . KD has been successfully applied to various fields and demonstrates its high practical value. The wide applicability of KD stems from its generality: any student can learn from any teacher. To be more precise, the student and teacher network may differ in several ways. Three common scenarios are: (1) model capacity difference: Many works (Zagoruyko & Komodakis, 2016; Tung & Mori, 2019; Park et al., 2019; Peng et al., 2019) on model compression aim to learn a lightweight student matching the performance of its cumbersome teacher for deployment benefits. (2) architecture (inductive bias) difference: As an example, recent works (Touvron et al., 2021; Ren et al., 2022; Xianing et al., 2022) propose to utilize a CNN teacher to distill its inductive bias to a transformer student for data efficiency. (3) modality difference: KD has been extended to transfer knowledge across modalities (Gupta et al., 2016; Aytar et al., 2016; Zhao et al., 2018; Garcia et al., 2018; Thoker & Gall, 2019; Ren et al., 2021; Afouras et al., 2020; Valverde et al., 2021; Xue et al., 2021) , where the teacher and student network come from different modalities. Examples include using an RGB teacher to provide supervision signals to a student network taking depth images as input, and adopting an audio teacher to learn a visual student, etc. Despite the great empirical success reported in prior works, the working mechanism of KD is still poorly understood (Gou et al., 2021) . This puts the efficacy of KD into question: Is KD always efficient? If not, what is a good indicator of KD performance? A few works (Cho & Hariharan, 2019; Tang et al., 2020; Ren et al., 2022) are in search for the answer in the context of model capacity difference and architecture difference. However, the analysis for the third scenario, KD under modality difference or formally crossmodal KD, remains an open problem. This work aims to fill this gap and for the first time provides a comprehensive analysis of crossmodal KD. Our major contributions are the following: • We evaluate crossmodal KD on a few multimodal tasks and find surprisingly that teacher performance does not always positively correlate with student performance. • To explore the cause of performance mismatch in crossmodal KD, we adopt the modality Venn diagram (MVD) to understand modality relationships and formally define modalitygeneral decisive features and modality-specific decisive features. • We present the modality focusing hypothesis (MFH) that provides an explanation of when crossmodal KD is effective. We hypothesize that modality-general decisive features are the crucial factor that determines the efficacy of crossmodal KD. 2022) analyze KD for vision transformers and demonstrate that teacher's inductive bias matters more than its accuracy in improving performance of the transformer student. These works provide good insight into understanding KD, yet their discussions are limited to unimodality and have not touched on KD for multimodal learning.

2.2. CROSSMODAL KD

With the accessibility of the Internet and the growing availability of multimodal sensors, multimodal learning has received increasing research attention (Baltrušaitis et al., 2018) . Following this trend, KD has also been extended to achieve knowledge transfer from multimodal data and enjoys diverse applications, such as action recognition (Garcia et al., 2018; Luo et al., 2018; Thoker & Gall, 2019) , lip reading (Ren et al., 2021; Afouras et al., 2020) , and medical image segmentation (Hu et al., 2020; Li et al., 2020) . Vision models are often adopted as teachers to provide supervision to student models of other modalities, e.g., sound (Aytar et al., 2016; Xue et al., 2021 ), depth (Gupta et al., 2016; Xue et al., 2021 ), optical flow (Garcia et al., 2018 ), thermal (Kruthiventi et al., 2017) , and wireless signals (Zhao et al., 2018) . Although these works demonstrate potentials of crossmodal KD, they are often associated with a specific multimodal task. An in-depth analysis of crossmodal KD is notably lacking, which is the main focus of this paper.

2.3. MULTIMODAL DATA RELATIONS

There is continuous discussion on how to characterize multimodal (or multi-view) data relations. Many works (Tsai et al., 2020; Lin et al., 2021; 2022) utilize the multi-view assumption (Sridharan & Kakade, 2008) , which states that either view alone is sufficient for the downstream tasks. However, as suggested in (Tsai et al., 2020) , when the two views of input lie in different modalities, the multi-view assumption is likely to fail.foot_0 In the meantime, a few works on multimodal learning (Wang et al.,



A detailed comparison of our proposed MVD with the multi-view assumption is presented in Appendix C.



• We conduct experiments on 6 multimodal datasets (i.e., synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB). The results validate the proposed MFH and provide insights on how to improve crossmodal KD.

