MULTI-LABEL KNOWLEDGE DISTILLATION

Abstract

Existing knowledge distillation methods typically work by enforcing the consistency of output logits or intermediate feature maps between the teacher network and student network. Unfortunately, these methods can hardly be extended to the multi-label learning scenario. Because each instance is associated with multiple semantic labels, neither the prediction logits nor the feature maps obtained from the whole example can accurately transfer knowledge for each label. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by label decoupling with the one-versus-all reduction strategy; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, and achieve superior performance against diverse comparing methods.

1. INTRODUCTION

Despite the remarkable success in training deep neural networks (DNNs) (Krizhevsky et al., 2012) , it is hard to deploy these large neural networks on lightweight terminals, e.g., mobile phones, under the constraint of computational resource or requirement of short inference time. To mitigate this issue, knowledge distillation (Hinton et al., 2015) aims to improve the performance of a small network (also known as the student) by requiring the knowledge from a large network (also known as the teacher) to guide the training of the student network. Existing knowledge distillation methods can be roughly divided into two categories: logits-based methods and feature-based methods. The former minimizes the difference between logits of teacher model and student model (Hinton et al., 2015; Zhao et al., 2022) , while the latter distills knowledge from feature maps of intermediate layers (Park et al., 2019; Tian et al., 2019; Chen et al., 2021) . Typical knowledge distillation methods focus on the multi-class classification task, where each instance is associated with only one class label. However, in many real-world scenarios, an instance inherently contains complex semantics and can be simultaneously assigned with multiple class labels. For example, an image of street scene may be annotated with labels building, car and person. To learn the complex object-label mapping, there is always necessity of training large models to obtain desirable performance in multi-label classification. Unfortunately, due to computational resource constraints, it cannot be allowed to adopt large neural networks in many practical applications, leading to noticeable decrease in model performance (Gou et al., 2021) . To alleviate the performance degradation, it is necessary to design specific knowledge distillation methods for multi-label learning. We formalize such a problem as a new learning framework called Multi-Label Knowledge Distillation (MLKD). Although knowledge distillation has been proven to be effective for improving the performance of the student network in single-label classification, it is still a challenging problem to directly extend existing KD methods to solve MLKD problems. Specifically, logits-based methods often obtain the predicted probabilities based on the softmax function, which is unavailable for MLKD, since the sum of predicted probabilities may not equal to one in Multi-Label Learning (MLL). Featurebased methods often perform knowledge distillation based on feature maps of the whole image with multiple semantics, which makes the model focus on the major objects while neglect the minor objects. This would lead the model to obtain sub-optimal even undesirable distillation performance. Figure 1 provides empirical validations for these observations. Conventional KD methods achieve unfavorable performance similar to the student, while our method significantly outperforms these methods and achieves comparable performance to the teacher. Details about the implementation of this experiment can be found in Appendix A. In this paper, to perform multi-label knowledge distillation, we propose a new method consisting of multi-label logits distillation and label-wise embedding distillation (L2D for short). Specifically, to exploit informative semantic knowledge compressed in the logits, L2D employs the one-versus-all reduction strategy to obtain a series of binary classification problems and perform logits distillation for each one. To enhance the distinctiveness of learned feature representations, L2D encourages the student model to maintain a consistent structure of intra-class and intra-instance (inter-class) label-wise embeddings with the teacher model. By leveraging the structural information of the teacher model, these two structural consistencies respectively enhance the compactness of intraclass embeddings and dispersion of inter-class embeddings for the student model. Our main contributions can be summarized as follows: • A general learning framework called MLKD is proposed. To our best knowledge, the framework is the first study specially designed for knowledge distillation in the multi-label learning scenario. • A new approach called L2D is proposed. It performs multi-label logits distillation and label-wise embedding distillation simultaneously. The former provides informative semantic knowledge while the latter encourages the student model to learn more distinctive feature representations. • Extensive experimental results on benchmark datasets demonstrate the effectiveness of our proposed method.

2. RELATED WORK

The concept of knowledge distillation (KD) proposed by Hinton et al. (2015) defines a learning framework that transfers knowledge from a large teacher network to a small student network. Existing works can be roughly divided into two groups, i.e., logits-based methods and feature-based methods. Logits-based methods mainly focus on designing effective distillation losses to distill knowledge from logits and softmax scores after logits. DML (Zhang et al., 2018) introduces a mutual learning method to train both teachers and students simultaneously. TAKD (Mirzadeh et al., 2020) proposes a new architecture called "teacher assistant", which is an intermediate-sized network bridging the gap between teachers and students. Besides, a recent study (Zhao et al., 2022) proposes a novel logits-based method to reformulate the classical KD loss into two parts and achieves the state-of-the-art performance by adjusting weights for these two parts. Some other methods focus on distilling knowledge from intermediate feature layers. FitNet (Romero et al., 2014) is the first approach to distill knowledge from intermediate features by measuring the distance between feature maps. Attention transfer (Zagoruyko & Komodakis, 2016a) achieves better performance than FitNet



Figure1: Comparison results between our proposed L2D method and conventional KD methods on MS-COCO. We compare our method with the following three baselines: 1) Vanilla: student trained without distillation; 2) Softmax(Hinton et al., 2015): a representative logits-based method by measuring KL divergence on softmax score after logits; 3) ReviewKD (Chen et al., 2021): a feature-based method that achieves sota performance. The red dashed lines mark the performance of teachers. Conventional KD methods achieve unfavorable performance similar to the student, while our method significantly outperforms these methods and achieves comparable performance to the teacher. Details about the implementation of this experiment can be found in Appendix A.

