MULTI-LABEL KNOWLEDGE DISTILLATION

Abstract

Existing knowledge distillation methods typically work by enforcing the consistency of output logits or intermediate feature maps between the teacher network and student network. Unfortunately, these methods can hardly be extended to the multi-label learning scenario. Because each instance is associated with multiple semantic labels, neither the prediction logits nor the feature maps obtained from the whole example can accurately transfer knowledge for each label. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by label decoupling with the one-versus-all reduction strategy; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, and achieve superior performance against diverse comparing methods.

1. INTRODUCTION

Despite the remarkable success in training deep neural networks (DNNs) (Krizhevsky et al., 2012) , it is hard to deploy these large neural networks on lightweight terminals, e.g., mobile phones, under the constraint of computational resource or requirement of short inference time. To mitigate this issue, knowledge distillation (Hinton et al., 2015) aims to improve the performance of a small network (also known as the student) by requiring the knowledge from a large network (also known as the teacher) to guide the training of the student network. Existing knowledge distillation methods can be roughly divided into two categories: logits-based methods and feature-based methods. The former minimizes the difference between logits of teacher model and student model (Hinton et al., 2015; Zhao et al., 2022) , while the latter distills knowledge from feature maps of intermediate layers (Park et al., 2019; Tian et al., 2019; Chen et al., 2021) . Typical knowledge distillation methods focus on the multi-class classification task, where each instance is associated with only one class label. However, in many real-world scenarios, an instance inherently contains complex semantics and can be simultaneously assigned with multiple class labels. For example, an image of street scene may be annotated with labels building, car and person. To learn the complex object-label mapping, there is always necessity of training large models to obtain desirable performance in multi-label classification. Unfortunately, due to computational resource constraints, it cannot be allowed to adopt large neural networks in many practical applications, leading to noticeable decrease in model performance (Gou et al., 2021) . To alleviate the performance degradation, it is necessary to design specific knowledge distillation methods for multi-label learning. We formalize such a problem as a new learning framework called Multi-Label Knowledge Distillation (MLKD). Although knowledge distillation has been proven to be effective for improving the performance of the student network in single-label classification, it is still a challenging problem to directly extend existing KD methods to solve MLKD problems. Specifically, logits-based methods often obtain the predicted probabilities based on the softmax function, which is unavailable for MLKD, since the sum of predicted probabilities may not equal to one in Multi-Label Learning (MLL). Featurebased methods often perform knowledge distillation based on feature maps of the whole image with multiple semantics, which makes the model focus on the major objects while neglect the minor objects. This would lead the model to obtain sub-optimal even undesirable distillation performance. Figure 1 provides empirical validations for these observations. 1

