FUNCTION-CONSISTENT FEATURE DISTILLATION

Abstract

Feature distillation makes the student mimic the intermediate features of the teacher. Nearly all existing feature-distillation methods use L2 distance or its slight variants as the distance metric between teacher and student features. However, while L2 distance is isotropic w.r.t. all dimensions, the neural network's operation on different dimensions is usually anisotropic, i.e., perturbations with the same 2-norm but in different dimensions of intermediate features lead to changes in the final output with largely different magnitude. Considering this, we argue that the similarity between teacher and student features should not be measured merely based on their appearance (i.e., L2 distance), but should, more importantly, be measured by their difference in function, namely how later layers of the network will read, decode, and process them. Therefore, we propose Function-Consistent Feature Distillation (FCFD), which explicitly optimizes the functional similarity between teacher and student features. The core idea of FCFD is to make teacher and student features not only numerically similar, but more importantly produce similar outputs when fed to the later part of the same network. With FCFD, the student mimics the teacher more faithfully and learns more from the teacher. Extensive experiments on image classification and object detection demonstrate the superiority of FCFD to existing methods. Furthermore, we can combine FCFD with many existing methods to obtain even higher accuracy.

1. INTRODUCTION

Deep neural networks (DNNs) have demonstrated great power on a variety of tasks. However, the high performance of DNN models is often accompanied by large storage and computational costs, which barricade their deployment on edge devices. A common solution to this problem is Knowledge Distillation (KD), whose central idea is to transfer the knowledge from a strong teacher model to a compact student model, in hopes that the additional guidance could raise the performance of the student (Gou et al., 2021; Wang & Yoon, 2021) . According to the type of carrier for knowledge transfer, existing KD methods can be roughly divided into the following two categories: 1.1 KNOWLEDGE DISTILLATION BASED ON FINAL OUTPUT Hinton et al. (2015) first clarifies the concept of knowledge distillation. In their method, softened probability distribution output by the teacher is used as the guidance to the student. Following works (Ding et al., 2019; Wen et al., 2019) step further to explore the trade-off between soft logits and hard task label. Zhao et al. (2022) propose to decouple target-class and non-target-class knowledge transfer, and Chen et al. (2022) make the student to reuse the teacher's classifier. Park et al. (2019) claims that the relationship among representations of different samples implies important information, and they used this relationship as the knowledge carrier. Recently, some works introduce contrastive learning to knowledge distillation. Specifically, Tian et al. (2020) propose to identify if a pair of teacher and student representations are congruent (same input provided to teacher and student) or not, and SSKD (Xu et al., 2020) introduce an additional self-supervision task to the training process. Whereas all the aforementioned methods employ a fixed teacher model during distillation, Zhang et al. (2018b) propose online knowledge distillation, where both the large and the small models are randomly initialized and learn mutually from each other during training. Comparing with final output, intermediate features contain richer information, and extensive methods try to use them for distillation. FitNet (Romero et al., 2015) pioneers this line of works, where at certain positions, intermediate features of the student are transformed to mimic corresponding teacher features w.r.t. L2 distance. FSP (Yim et al., 2017) and ICKD (Liu et al., 2021) utilize the Gramian matrix to transfer knowledge, and AT (Zagoruyko & Komodakis, 2017) uses the attention map. Generally, one student feature only learns from one teacher feature. However, some works have extended this paradigm: Review (Chen et al., 2021b) makes a student layer not only learn from the corresponding teacher layer, but also from teacher layers prior to that; AFD (Ji et al., 2021) et al., 2020; Zhang et al., 2020; Lin et al., 2022) , etc. However, all these methods use L2 distance or its slight variants (Heo et al., 2019) as the distance function to optimize. In this paper, we argue that the simple L2 distance (as well as L1, smooth L1, etc.) has significant weakness that impedes itself from being an ideal feature distance metric: it measures features' similarity independently to the context. In other words, L2 distance merely measures the numerical distance between two features, which can be considered as the appearance of the feature. This kind of appearance-based distance treats all dimensions in an isotropic manner, without considering feature similarity in function. The function of intermediate features is to serve as the input to the network's later part. Features playing similar functions express similar semantics and lead to similar results. Since neural network's operation on different dimensions are usually anisotropic, changing the feature by the same distance along different dimensions can lead to changes in the final output with completely different magnitude. As a result, with only L2-based feature matching, student features with a large variety of functional differences to teacher feature are considered equally-good imitations, which inevitably impedes knowledge transfer. On the other hand, suppose that among these imitations, an additional function-based supervision could be provided to identify the better ones that lead to less change in the final output, by enforcing the student to prioritize them, the semantic consistency between teacher and student features can be better guaranteed. We reasonably speculate that if an extra matching mechanism, which drives the student to the solution with highest function consistency, could be provided, knowledge transfer will be more effective



and SemCKD (Chen et al., 2021a) make a student layer to learn from all candidate teacher layers, with attention mechanism proposed to allocate weight to each pair of features. Considering that negative values are filtered out by ReLU activation, Heo et al. (2019) propose the marginal ReLU activation and the partial L2 loss function to suppress the transfer of useless information. 1.3 OUR MOTIVATION Extensive methods have been proposed to enhance feature distillation, from perspectives like distillation position (Chen et al., 2021b; Ji et al., 2021; Chen et al., 2021a), feature transformation (Yue

Figure 1: A toy example illustrating the relationship between appearance and function. Points on the red ring in (b) have the same L2 distance to target (4,4), but they cause largely different deviations in final output.

availability

Our codes are available at https://github.com/LiuDongyang6

