THE MODALITY FOCUSING HYPOTHESIS: TOWARDS UNDERSTANDING CROSSMODAL KNOWLEDGE DISTIL-LATION

Abstract

Crossmodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning and demonstrates great success in various applications. To achieve knowledge transfer across modalities, a pretrained network from one modality is adopted as the teacher to provide supervision signals to a student network learning from another modality. In contrast to the empirical success reported in prior works, the working mechanism of crossmodal KD remains a mystery. In this paper, we present a thorough understanding of crossmodal KD. We begin with two case studies and demonstrate that KD is not a universal cure in crossmodal knowledge transfer. We then present the modality Venn diagram (MVD) to understand modality relationships and the modality focusing hypothesis (MFH) revealing the decisive factor in the efficacy of crossmodal KD. Experimental results on 6 multimodal datasets help justify our hypothesis, diagnose failure cases, and point directions to improve crossmodal knowledge transfer in the future.

1. INTRODUCTION

Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network to another (Wang & Yoon, 2021; Gou et al., 2021) . Its core mechanism is a teacher-student learning framework, where the student network is trained to mimic the teacher through a loss. The loss function, initially proposed by (Hinton et al., 2015) as the KL divergence between teacher and student soft labels, has been extended in many ways (Zagoruyko & Komodakis, 2016; Tung & Mori, 2019; Park et al., 2019; Peng et al., 2019; Tian et al., 2019) . KD has been successfully applied to various fields and demonstrates its high practical value. The wide applicability of KD stems from its generality: any student can learn from any teacher. To be more precise, the student and teacher network may differ in several ways. Three common scenarios are: (1) model capacity difference: Many works (Zagoruyko & Komodakis, 2016; Tung & Mori, 2019; Park et al., 2019; Peng et al., 2019) on model compression aim to learn a lightweight student matching the performance of its cumbersome teacher for deployment benefits. ( 2) architecture (inductive bias) difference: As an example, recent works (Touvron et al., 2021; Ren et al., 2022; Xianing et al., 2022) propose to utilize a CNN teacher to distill its inductive bias to a transformer student for data efficiency. (3) modality difference: KD has been extended to transfer knowledge across modalities (Gupta et al., 2016; Aytar et al., 2016; Zhao et al., 2018; Garcia et al., 2018; Thoker & Gall, 2019; Ren et al., 2021; Afouras et al., 2020; Valverde et al., 2021; Xue et al., 2021) , where the teacher and student network come from different modalities. Examples include using an RGB teacher to provide supervision signals to a student network taking depth images as input, and adopting an audio teacher to learn a visual student, etc. Despite the great empirical success reported in prior works, the working mechanism of KD is still poorly understood (Gou et al., 2021) . This puts the efficacy of KD into question: Is KD always

