FUNCTION-CONSISTENT FEATURE DISTILLATION

Abstract

Feature distillation makes the student mimic the intermediate features of the teacher. Nearly all existing feature-distillation methods use L2 distance or its slight variants as the distance metric between teacher and student features. However, while L2 distance is isotropic w.r.t. all dimensions, the neural network's operation on different dimensions is usually anisotropic, i.e., perturbations with the same 2-norm but in different dimensions of intermediate features lead to changes in the final output with largely different magnitude. Considering this, we argue that the similarity between teacher and student features should not be measured merely based on their appearance (i.e., L2 distance), but should, more importantly, be measured by their difference in function, namely how later layers of the network will read, decode, and process them. Therefore, we propose Function-Consistent Feature Distillation (FCFD), which explicitly optimizes the functional similarity between teacher and student features. The core idea of FCFD is to make teacher and student features not only numerically similar, but more importantly produce similar outputs when fed to the later part of the same network. With FCFD, the student mimics the teacher more faithfully and learns more from the teacher. Extensive experiments on image classification and object detection demonstrate the superiority of FCFD to existing methods. Furthermore, we can combine FCFD with many existing methods to obtain even higher accuracy.

1. INTRODUCTION

Deep neural networks (DNNs) have demonstrated great power on a variety of tasks. However, the high performance of DNN models is often accompanied by large storage and computational costs, which barricade their deployment on edge devices. A common solution to this problem is Knowledge Distillation (KD), whose central idea is to transfer the knowledge from a strong teacher model to a compact student model, in hopes that the additional guidance could raise the performance of the student (Gou et al., 2021; Wang & Yoon, 2021) . According to the type of carrier for knowledge transfer, existing KD methods can be roughly divided into the following two categories: 



1.1 KNOWLEDGE DISTILLATION BASED ON FINAL OUTPUTHinton et al. (2015)  first clarifies the concept of knowledge distillation. In their method, softened probability distribution output by the teacher is used as the guidance to the student. Following works(Ding et al., 2019; Wen et al., 2019)  step further to explore the trade-off between soft logits and hard task label. Zhao et al. (2022) propose to decouple target-class and non-target-class knowledge transfer, and Chen et al. (2022) make the student to reuse the teacher's classifier. Park et al. (2019) claims that the relationship among representations of different samples implies important information, and they used this relationship as the knowledge carrier. Recently, some works introduce contrastive learning to knowledge distillation. Specifically, Tian et al. (2020) propose to identify if a pair of teacher and student representations are congruent (same input provided to teacher and student) or not, and SSKD (Xu et al., 2020) introduce an additional self-supervision task to the training process. Whereas all the aforementioned methods employ a fixed teacher model during distillation, Zhang et al. (2018b) propose online knowledge distillation, where both the large and the small models are randomly initialized and learn mutually from each other during training.

availability

Our codes are available at https://github.com/LiuDongyang6

