MULTI-LABEL KNOWLEDGE DISTILLATION

Abstract

Existing knowledge distillation methods typically work by enforcing the consistency of output logits or intermediate feature maps between the teacher network and student network. Unfortunately, these methods can hardly be extended to the multi-label learning scenario. Because each instance is associated with multiple semantic labels, neither the prediction logits nor the feature maps obtained from the whole example can accurately transfer knowledge for each label. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by label decoupling with the one-versus-all reduction strategy; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, and achieve superior performance against diverse comparing methods.

1. INTRODUCTION

Despite the remarkable success in training deep neural networks (DNNs) (Krizhevsky et al., 2012) , it is hard to deploy these large neural networks on lightweight terminals, e.g., mobile phones, under the constraint of computational resource or requirement of short inference time. To mitigate this issue, knowledge distillation (Hinton et al., 2015) aims to improve the performance of a small network (also known as the student) by requiring the knowledge from a large network (also known as the teacher) to guide the training of the student network. Existing knowledge distillation methods can be roughly divided into two categories: logits-based methods and feature-based methods. The former minimizes the difference between logits of teacher model and student model (Hinton et al., 2015; Zhao et al., 2022) , while the latter distills knowledge from feature maps of intermediate layers (Park et al., 2019; Tian et al., 2019; Chen et al., 2021) . Typical knowledge distillation methods focus on the multi-class classification task, where each instance is associated with only one class label. However, in many real-world scenarios, an instance inherently contains complex semantics and can be simultaneously assigned with multiple class labels. For example, an image of street scene may be annotated with labels building, car and person. To learn the complex object-label mapping, there is always necessity of training large models to obtain desirable performance in multi-label classification. Unfortunately, due to computational resource constraints, it cannot be allowed to adopt large neural networks in many practical applications, leading to noticeable decrease in model performance (Gou et al., 2021) . To alleviate the performance degradation, it is necessary to design specific knowledge distillation methods for multi-label learning. We formalize such a problem as a new learning framework called Multi-Label Knowledge Distillation (MLKD). Although knowledge distillation has been proven to be effective for improving the performance of the student network in single-label classification, it is still a challenging problem to directly extend existing KD methods to solve MLKD problems. Specifically, logits-based methods often obtain the predicted probabilities based on the softmax function, which is unavailable for MLKD, since the sum of predicted probabilities may not equal to one in Multi-Label Learning (MLL). Featurebased methods often perform knowledge distillation based on feature maps of the whole image with multiple semantics, which makes the model focus on the major objects while neglect the minor objects. This would lead the model to obtain sub-optimal even undesirable distillation performance. Figure 1 provides empirical validations for these observations. Figure 1 : Comparison results between our proposed L2D method and conventional KD methods on MS-COCO. We compare our method with the following three baselines: 1) Vanilla: student trained without distillation; 2) Softmax (Hinton et al., 2015) : a representative logits-based method by measuring KL divergence on softmax score after logits; 3) ReviewKD (Chen et al., 2021) : a feature-based method that achieves sota performance. The red dashed lines mark the performance of teachers. Conventional KD methods achieve unfavorable performance similar to the student, while our method significantly outperforms these methods and achieves comparable performance to the teacher. Details about the implementation of this experiment can be found in Appendix A. In this paper, to perform multi-label knowledge distillation, we propose a new method consisting of multi-label logits distillation and label-wise embedding distillation (L2D for short). Specifically, to exploit informative semantic knowledge compressed in the logits, L2D employs the one-versus-all reduction strategy to obtain a series of binary classification problems and perform logits distillation for each one. To enhance the distinctiveness of learned feature representations, L2D encourages the student model to maintain a consistent structure of intra-class and intra-instance (inter-class) label-wise embeddings with the teacher model. By leveraging the structural information of the teacher model, these two structural consistencies respectively enhance the compactness of intraclass embeddings and dispersion of inter-class embeddings for the student model. Our main contributions can be summarized as follows: • A general learning framework called MLKD is proposed. To our best knowledge, the framework is the first study specially designed for knowledge distillation in the multi-label learning scenario. • A new approach called L2D is proposed. It performs multi-label logits distillation and label-wise embedding distillation simultaneously. The former provides informative semantic knowledge while the latter encourages the student model to learn more distinctive feature representations. • Extensive experimental results on benchmark datasets demonstrate the effectiveness of our proposed method.

2. RELATED WORK

The concept of knowledge distillation (KD) proposed by Hinton et al. (2015) defines a learning framework that transfers knowledge from a large teacher network to a small student network. Existing works can be roughly divided into two groups, i.e., logits-based methods and feature-based methods. Logits-based methods mainly focus on designing effective distillation losses to distill knowledge from logits and softmax scores after logits. DML (Zhang et al., 2018) introduces a mutual learning method to train both teachers and students simultaneously. TAKD (Mirzadeh et al., 2020) proposes a new architecture called "teacher assistant", which is an intermediate-sized network bridging the gap between teachers and students. Besides, a recent study (Zhao et al., 2022) proposes a novel logits-based method to reformulate the classical KD loss into two parts and achieves the state-of-the-art performance by adjusting weights for these two parts. Some other methods focus on distilling knowledge from intermediate feature layers. FitNet (Romero et al., 2014) is the first approach to distill knowledge from intermediate features by measuring the distance between feature maps. Attention transfer (Zagoruyko & Komodakis, 2016a) achieves better performance than FitNet by distilling knowledge from the attention maps. PKT (Passalis & Tefas, 2018) measures the KL divergence between features by treating them as probability distributions. RKD (Park et al., 2019) utilizes the relations among instances to guide the training process of the student model. CRD (Tian et al., 2019) incorporates contrastive learning into knowledge distillation. ReviewKD (Chen et al., 2021) proposes a review mechanism which uses multiple layers in the teacher to supervise one layer in the student. ITRD (Miles et al., 2021) aims to maximize the correlation and mutual information between the students' and teachers' representations. Multi-label learning has increasingly attracted a lot of interest recently. Existing solutions for solving MLL problems can be categorized into three directions. The first type attempts to design novel loss functions for tackling the intrinsic positive-negative imbalance issue in multi-label classification tasks. For example, ASL (Ridnik et al., 2021a) uses different weights to re-weight positive and negative examples for the balanced training. The second type focuses on modeling the label correlations, which provides prior knowledge for multi-label classification. Among them, MLGCN (Chen et al., 2019b ) is a representative method that employs a graph convolutional network to model correlation matrix. CADM (Chen et al., 2019a ) constructs a similar graph based on class-aware maps. To handle the multiple semantic objects contained in an image, the last type of methods aims to locate areas of interest related to semantic labels by using attention techniques. Among them, C-Tran (Lanchantin et al., 2021) first utilizes the transformer to retrieve embeddings from visual features for each label. Query2Label (Liu et al., 2021a) uses several stacked transformer encoders to identify interesting areas. ML-Decoder (Ridnik et al., 2021b) simplifies transformer encoders in Query2Label. ADDS (Xu et al., 2022b) introduces encoders from CLIP (Radford et al., 2021) in order to get better textual and visual embedding inputs to the classification head. In addition, ADDS adds a multi-head cross-attention layer and a skipping connection from the query input to the query output based on ML-Decoder. Several previous studies have applied KD techniques to improve the performance of MLL. For example, Liu et al. (2018) and Xu et al. (2022a) simply minimized mean squared error (MSE) loss between teacher logits and student logits. Song et al. (2021) designed a partial softmax function by combining a positive label with all other negative labels. Then, the conventional KD loss can be computed for each positive label. The main difference between our work and previous works is that we aim to improve the distillation performance instead of the performance of MLL. This can be reflected in experiments in Section 4 and Appendix B, where the proposed method mainly compares with KD methods while the previous method mainly compares with MLL methods.

3. THE PROPOSED APPROACH

Let x ∈ X be an instance and y ∈ Y be its corresponding label vector, where X ⊂ R d is the input space with d dimensions and Y ⊂ {0, 1} q is the target space with q class labels. We further use y j to denote the j-th component of y. For a given instance x, y j = 1 indicates the j-th label is relevant to the instance; y j = 0, otherwise. In multi-label learning, each instance may be assigned with more than one label, which means q j=1 I(y j = 1) ≥ 1, where I(•) is the indicator function. We also denote by [q] the integer set {1, 2, • • • , q}. In this paper, we use a classification model consisting of three components, i.e., a visual backbone f , which extracts a feature map f (x) for the input x, a label-wise embedding encoder g (Lanchantin et al., 2021; Liu et al., 2021a) , which produces a label-wise embedding e k = g k (f (x)) with respect to the k-th class based on the feature map f (x), and a multi-label classifier h, which predicts multi-label probabilities ŷ = [σ(h 1 (e 1 )), σ(h 2 (e 2 )), ..., σ(h q (e q ))], where σ(•) denotes the sigmoid function. It is noteworthy that the used model is very general, which can be built by equipping commonly used backbones, e.g., ResNet (He et al., 2016) , with a label-wise embedding encoder g. For the notations mentioned above, we use the superscripts T (or S) to denote the teacher (or student) model. For example, we use e T k and e S k to denote the label-wise embeddings for teacher and student models. In multi-label learning, a popular method is to employ the one-versus-all reduction strategy to transform the original task into multiple binary problems. Among the various loss functions, the most commonly used one is the binary cross entropy (BCE) loss. Specifically, given a batch of examples {(x i , y i )} b i=1 and the predicted probabilities ŷ, the BCE loss can be defined as follows: L BCE = - 1 b b i=1 q k=1 y ik log(ŷ ik ) + (1 -y ik ) log(1 -ŷik ). Figure 2 illustrates the distillation process of the proposed L2D framework. For a batch of training examples, we feed them into the teacher/student model to obtain the label-wise embeddings and predicted probabilities. In order to train the student model, besides the BCE loss, we design the following two distillation losses: 1) multi-label logits distillation loss L MLD to exploit informative semantic knowledge compressed in the logits, 2) label-wise embedding distillation loss L LED to leverage structural information for enhancing the distinctiveness of learned feature representations. The overall objective function can be presented as L L2D = L BCE + λ MLD L MLD + λ LED L LED , where λ MLD and λ LED are two balancing parameters.

3.1. MULTI-LABEL LOGITS DISTILLATION

Traditional logits-based distillation normally minimizes the Kullback-Leibler (KL) divergence between the predicted probabilities, i.e., the logits after the softmax function, of teacher and student model. However, the method cannot be directly applied to the MLL scenario, since it depends on a basic assumption that the predicted probabilities of all classes should sum to one, which hardly holds for MLL examples. To mitigate this issue, inspired by the idea of one-versus-all reduction, we propose a multi-label logits distillation (MLD) loss, which decomposes the original multi-label task into multiple binary classification problems and minimizes the divergence between the binary predicted probabilities of two models. Formally, the MLD loss can be formulated as follows: L MLD = 1 b b i=1 q k=1 D [ŷ T ik , 1 -ŷT ik ] [ŷ S ik , 1 -ŷS ik ] , where [•, •] is an operator used to concatenate two scalars into a vector, and D is a divergence function. The most common choice is the KL divergence D KL (P Q) = x∈X P (x) log P (x) Q(x) , where P and Q are two different probability distributions. The MLD loss aims to improve the performance of student model by sufficiently exploiting informative knowledge from logits.

3.2. LABEL-WISE EMBEDDING DISTILLATION

The MLD loss performs distillation on the predicted probabilities that can be regarded as a highlevel representation, i.e., the final outputs of model. The knowledge distilled from the teacher model by only using the MLD loss would be insufficient to train a student model with desirable performance due to the limited information carried by the logits. To further strengthen the effectiveness of distillation, we design the label-wise embedding distillation (LED) loss, which aims to explore the structural knowledge from label-wise embeddings. The main idea is to capture two types of structural relations among label-wise embeddings: 1) class-aware label-wise embedding distillation (CD) loss L LED-CD , which captures the structural relation between any two intra-class label-wise embeddings from different examples; 2) instance-aware label-wise embedding distillation (ID) loss L LED-ID , which models the structural relation between any two inter-class label-wise embeddings from the same example. In the following content, we introduce these two distillation losses in detail.

3.2.1. CLASS-AWARE LABEL-WISE EMBEDDING DISTILLATION

Class-aware label-wise embedding distillation aims to improve the distillation performance by exploiting the structural relations among intra-class label-wise embeddings. Generally, the same semantic objects from two different images often differ from each other by their individual characteristics, such as two cars with different colors and diverse styles (see the left side in Figure 3 ). Since our goal is to distinguish between car and other semantic classes instead of identifying different cars, these distinctiveness would be confusing information for the corresponding classification task. Due to the powerful learning capacity, the large model is able to capture the highly abstract semantic representations for each class label by neglecting the useless individual information. From the perspective of learned feature representations, as shown in the left side of Figure 3 , the teacher model tends to obtain a more compact structure of intra-class label-wise embedding, which often leads to better classification performance. By transferring the structural knowledge from the teacher model to the student model, CD encourages the student model to enhance the compactness of intra-class label-wise embeddings, which can improve its classification performance. It is worth to note that we only consider the structural relation between any two valid label-wise embeddings, i.e., the embeddings with respect to positive labels. Similar to Eq.( 4), for any two intra-class label-wise embeddings e S ik and e S jk , we can obtain the structural relation φ CD (e S ik , e S jk ) for the student model. By enforcing the teacher and student structural relations to maintain the consistency for each pair of intra-class label-wise embeddings, we can achieve the class-aware structural consistency as follows: L LED-CD = q k=1 i,j∈[b] (φ CD (e T ik , e T jk ), φ CD (e S ik , e S jk )), where is a function to measure the consistency between the teacher and student structural relations. In experiments, we use the following Huber loss function as a measurement: (a, b) = 1 2 (a -b) 2 |a -b| ≤ 1, |a -b| -1 2 otherwise. ( ) where a and b are two different structural relations.

3.2.2. INSTANCE-AWARE LABEL-WISE EMBEDDING DISTILLATION

Instance-aware label-wise embedding distillation (ID) aims to improve the distillation performance by exploring the structural relations among inter-class label-wise embeddings from the same image. Generally, one can hardly distinguish between two different semantic objects occurring in an image due to the high similarities they share. For example, see the right side of Figure 3 , for an image annotated with sky, sea and beach, one can hardly distinguish between the semantic objects sky and sea, since they share the same color and similar texture. A feasible solution is to exploit other useful information, such as their spatial relation, i.e., sky is always above sea. Due to the powerful learning capacity, the large model is able to distinguish between the similar semantic objects by exploiting such implicit supervised information. From the perspective of learned feature representations, as shown in the right side of 3, the teacher model tends to learn a dispersed structure of inter-class label-wise embedding, which is beneficial for improving its discrimination ability. By distilling the structural knowledge from the teacher model, ID enforces the student model to enhance the dispersion of inter-class label-wise embeddings, which can improve its discrimination ability. For a given instance x i , let {e T ik } q k=1 and {e S ik } q k=1 respectively denote the label-wise embeddings generated by teacher and student models. Then, we can capture the structural relation between any two inter-class label-wise embeddings e T ik and e T il by measuring their distance in the embedding space: φ ID (e T ik , e T il ) = e T ik -e T il 2 y ik = 1, y il = 1, 0 otherwise. ( ) Note that in Eq.( 7), we only consider the structural relation between any two valid label-wise embeddings, i.e., the embedding with respect to positive labels. Similar to Eq.( 7), for any two inter-class label-wise embeddings e S ik and e S il , we can obtain the structural relation φ ID (e S ik , e S il ) for the student model. By encouraging the teacher and student model to maintain the consistent structure of intra-instance label-wise embeddings, we can minimize the following L LED-ID loss to achieve the instance-aware structural consistency where (.) is Huber loss as defined in Eq.( 6): L LED-ID = b i=1 k,l∈[q] (φ ID (e T ik , e T il ), φ ID (e S ik , e S il )). (8) Finally, the overall objective function of L2D (Eq.( 2)) can be re-written as follows: L L2D = L BCE + λ MLD L MLD + λ LED-CD L LED-CD + λ LED-ID L LED-ID , where λ MLD , λ LED-CD and λ LED-ID are all balancing parameters. 

4. EXPERIMENTS

Datasets. We perform experiments on three benchmark datasets Pascal VOC2007 (Everingham et al., 2015) (VOC for short), MS-COCO2014 (Lin et al., 2014) (MS-COCO for short) and NUS-WIDE (Chua et al., 2009) . VOC contains 5,011 images in the train-val set, and 4,952 images in the test set. It covers 20 common objects, with an average of 1.6 labels per image. MS-COCO contains 82,081 training images and 40,137 test images. It covers 80 common objects, with an average of 2.9 labels per image. NUS-WIDE contains 161,789 training images and 107,859 test images. It covers 81 visual concepts, with an average of 1.9 labels per image. Metrics. Following existing works (Zhang & Zhou, 2013; Liu et al., 2021a; Sun et al., 2022) , we adopt the mean average precision (mAP) over all classes, overall F1-score (OF1) and average perclass F1-score (CF1) to evaluate the performance. We choose OF1 and CF1, since they consider both recall and precision and thus are more comprehensive. Methods. To validate the proposed method, we compare it with the following KD methods: RKD (Park et al., 2019) , which captures the relations among instances to guide the training of the student model; PKT (Passalis & Tefas, 2018) , which measures KL divergence between features by treating them as probability distributions; ReviewKD (Chen et al., 2021) , which transfers knowledge across different stages instead of just focusing on features in the same levels; MSE (Xu et al., 2022a) , which minimizes the MSE loss between logits of teacher and student model; PS (Song et al., 2021) , which minimizes KL divergence of logits after a partial softmax function. Implementation Details. We use the models pretrained on ImageNet (Deng et al., 2009) as the backbones. We resize the resolution of all images to 224 × 224 and set the batch size as 64. For each training image, we adopt a weak augmentation consisting of random horizontal flipping and a strong augmentation consisting of Cutout (Devries & Taylor, 2017) and RandAugment Cubuk et al. (2020) . We use the Adam optimization method (Kingma & Ba, 2015) to train the model for 80 epochs. The one-cycle policy is used with a maximal learning rate of 0.0001 and the weight decay (Loshchilov & Hutter, 2018) L2D are insensitive to all of our balancing parameters. For the comparing methods, we set their parameters as suggested in the original papers. Especially, for all feature-based methods, we just deploy them on the feature maps which is output from the visual backbone f . All the experiments are conducted on GeForce RTX 2080 GPUs. More details about the used models and implementation of the label-wise embedding encoder are attached in Appendix A.

4.1. COMPARISON RESULTS

Table 1 and Table 2 report comparison results on MS-COCO with the same and different architectures of student and teacher models. From Table 1 , it can be observed that: 1) Conventional feature-based distillation methods only achieve minor improvements in performance when compared with the student model (without distillation). This indicates these methods do not work in multi-label scenarios due to their disability to capture multiple semantics occurred in feature maps. 2) MLD can outperform conventional feature-based distillation methods in most cases, which indicates by performing one-versus-all reduction, the logits-based distillation can be adapted into multilabel knowledge distillation. 3) The proposed L2D significantly outperforms all other methods and achieves comparable performance with the teacher model, which convincingly validates the effectiveness of the proposed label-wise embedding distillation. From Table 2 , we can see: 1) Compared with the same architecture, the performance gap between teacher and student model is larger for different architectures, which indicates the corresponding distillation task is harder. 2) Our method significantly outperforms all the comparing methods by a significant margin in all cases. 3) L2D achieves the best performance in all cases and significantly outperforms MLD. These results provide a strong empirical validation for the effectiveness of the proposed method. Figure 4 illustrates the performance of the proposed methods and other comparing methods on VOC in terms of AP and mAP. It is noteworthy that the performance is ranked in descending according to the performance of student model. From the figure, it can be observed that: 1) Our proposed L2D achieves the best performance and significantly outperforms the comparing methods in terms of mAP. 2) L2D consistently achieves superior performance to the comparing methods in most classes and the performance gap is large especially for classes that the student model achieves poor performance. This observation discloses that L2D improves the performance of hard classes by enhancing the distinctiveness of feature representations. These experimental results demonstrate the practical usefulness of the proposed method. More results on VOC and NUS-WIDE can be found in Appendix B and Appendix C.

4.2. ABLATION STUDY

In this section, to further analyze how the proposed method improves distillation performance, Figure 5 : The differences between correlation matrices of student and teacher predicted probabilities on MS-COCO. The student distilled by our method shows significant matching between student's and teacher's correlations compared to others. be observed that by performing CD and ID, the mAP performance achieves 1.6% and 1.48% increments, respectively. Finally, we also examine the combination of these techniques. By incorporating these components together, the fusing method achieves the best performance and significantly outperforms each other method. These results demonstrate that all of three components are of great importance to the performance of the proposed L2D.

4.3. DISTILLING INTER-CLASS CORRELATIONS

As discussed in the previous works (Hinton et al., 2015) , the conventional supervised learning losses, e.g., BCE loss, often neglects the correlations among class predictions. However, the label correlation is a foundational element for multi-label classification. Distillation losses utilize "soft targets", which can effectively distill such correlation information from the teacher model, leading to a desirable distillation performance. To validate whether the correlations can be captured by L2D effectively, Figure 5 illustrates the differences between correlation matrices of student and teacher predicted probabilities on MS-COCO. From the figure, it can be observed that: 1) Without knowledge distillation, the teacher and student correlations are very different, which indicates teacher model tends to capture a more precise correlation than student model and thus achieves better performance. 2) The representative comparing method PKT shows a large difference, which discloses that the conventional KD methods are ineffective to capture the correlations in multi-label scenarios. 3) MLD shows a reduced difference, which indicates that the rich informativeness of logits is beneficial for capturing precise correlations. 4) L2D shows significant matching between the teacher and student correlations. By enhancing the distinctiveness of label-wise embeddings, L2D can obtain more correct predicted probabilities, leading to a more precise correlation estimation.

5. CONCLUSION

The paper studies the problem of multi-label knowledge distillation. In the proposed method, the multi-label logits distillation explores the informative semantic knowledge compressed in the teacher logits to obtain more semantic supervision. Furthermore, the label-wise embedding distillation exploits the structural knowledge from label-wise embeddings to learn more distinctive feature representations. Experimental results on benchmark datasets validate the effectiveness of the proposed method. In future, we plan to improve the performance of MLKD by exploiting other abundant structural information. We can observe that on attention heads for class car and class person, Vanilla pays some of its attention to unrelated objects. On attention head for class umbrella, PKT pays some of its attention to the house. Only L2D can concentrate on these three targets precisely.



Figure 2: An illustration of the L2D framework. The framework simultaneously performs multilabel logits distillation and label-wise embedding distillation to improve the performance of the student model.

Figure 3: An illustration of class/instance-aware label-wise embedding distillation. Class-aware label-wise embedding distillation (CD) captures structural relations among intra-class label-wise embeddings from different examples, while instance-aware label-wise embedding distillation (ID) explores structural relations among intra-instance (inter-class) label-wise embeddings.

For a batch of examples, let {e T ik } b i=1 and {e S ik } b i=1 respectively denote the intra-class label-wise embeddings with respect to class k ∈ [q] generated by the teacher and student models. Then, we can capture the structural relation between any two intra-class label-wise embeddings e T ik and e T jk by measuring their distance in the embedding space: φ CD (e T ik , e T jk ) = e T ik -e T jk 2 y ik = 1, y jk = 1

Figure 4: Comparison results of the comparing methods on VOC in terms of AP and mAP (%), where the backbones of teacher and student model are respectively ResNet-50 and ResNet-18.

Figure 8: An example of visualization of attention maps.We can find that on attention head for class airplane, both Vanilla and PKT do not pay all attention to the airplane: Vanilla is interfered by the plant and PKT is interfered by the boy. But our L2D resists such interference successfully. On attention head for class person, both Vanilla and PKT are interfered by the shades on the cloud, but L2D is not.

Results on MS-COCO where teacher and student models are in the same architectures.

Results on MS-COCO where teacher and student models are in the different architectures.



Ablation studies on MS-COCO and VOC.

6. REPRODUCIBILITY

We list the parameters for running the proposed algorithm in Section 4. The used neural networks and implementation details of the label-wise embedding encoder can be found in Appendix A. The source code and a detailed description about the code are attached in the supplementary material.

A MORE DETAILS OF IMPLEMENTATION

In order to validate the proposed method with diverse architectures, we employ some commonly used models, including ResNet (He et al., 2016) , Wide ResNet (WRN) (Zagoruyko & Komodakis, 2016b) , RepVGG (Ding et al., 2021) , Swin Transformer (Liu et al., 2021b) , and MobileNet v2 (Sandler et al., 2018) . For all the backbones, we utilize their pre-trained version on the ImageNet (Deng et al., 2009) as our base model.For all experiments, similar to the previous work (Ridnik et al., 2021b) , we employ the label-wise embedding encoder consisting of a cross-attention module and a feed-forward fully-connected layer (Vaswani et al., 2017) . The cross-attention module takes full queries and feature maps as the input. We assign a query per class to ensure that each query corresponds to a single semantic. The multilabel classifier h(.) is a fully-connected layer for each class, which outputs a predicted logit for a class label based on the input label-wise embedding.In experiments shown in Figure 1 , for distillation cross the same architectures, we use Swin-S (Liu et al., 2021b) as the teacher and Swin-T as the student; for distillation cross different architectures, we use ResNet-101 (He et al., 2016) as the teacher and MobileNet v2 (Sandler et al., 2018) as the student.

B MORE RESULTS ON PASCAL VOC 2007

Table 4 and Table 5 report comparison results on Pascal VOC 2007 with the same and different architectures of student and teacher models. From the tables, it can be observed that the proposed L2D significantly outperforms all comparing methods, which convincingly validates the effectiveness of the proposed label-wise embeddings distillation. Compared with the results on MS-COCO, the performance gap between L2D and comparing methods seems to become smaller. One possible reason is that VOC only contains about 1.5 labels per image, which leads conventional KD methods to obtain a better performance. 

C MORE RESULTS ON NUS-WIDE

Table 6 reports comparison results on NUS-WIDE with the same and different architectures of student and teacher models. For distillation between the same architectures, we choose a ResNet-101 (He et al., 2016) as the teacher and a ResNet-34 as the student. For distillation between different architectures, we choose a Swin-T (Liu et al., 2021b) as the teacher and a MobileNet v2 (Sandler et al., 2018) as the student. From the tables, it can be observed that the proposed L2D significantly outperforms all comparing methods, which convincingly validates the effectiveness of the proposed label-wise embeddings distillation. 

D PARAMETER SENSITIVITY ANALYSIS

In this section, we study the influence of balancing parameters λ MLD , λ LED-CD and λ LED-ID on the performance of L2D. A commonly used setting of the hyperparameter in vanilla KD that balances KL divergence against CE is 0.9 (Tian et al., 2019) , which means the balancing parameter for CE is 0.1 and the one for KL divergence is 0.9. So we choose 10 for the balancing parameter for MLD, which is closest to the setting of vanilla KD. We set the balancing parameter for LED-ID larger than LED-CD considering that the LED-ID may carry less information because there are only less than 3 labels for a instance on average, though it seems unnecessary since parameter sensitivity experiments in Figure 6 show that the performance of L2D are not sensitive to all of our balancing parameters. 

E VISUALIZATION OF ATTENTION MAPS

To further show the effectiveness of our proposed method L2D, we visualize some attention maps of the penultimate layer in the visual backbones using LayerCAM (Jiang et al., 2021) implemented by Fernandez (2020). We compare attention maps of the student model trained by L2D with some other methods in Figure 7 8 9. We compare L2D with: 1)Vanilla: student trained without distillation; 2)PKT: a classical feature-based method. In each figure, the first column shows the raw picture and the other columns show class activation maps overlaying on the raw picture. Each row represents a certain class. From these figures, we can find that L2D can locate the specified object more precisely than the other methods, which means it can not only pay attention to target objects, but also resist interference from similar but unrelated objects. All these comparisons show that L2D outperforms all comparing methods. It validates the effectiveness of our proposed label-wise embeddings distillation and shows great potential in MLKD. We can find that on attention head for class handbag, both Vanilla and PKT are interfered by some other objects and do not pay all attention to the handbag, but our L2D resists such interference successfully.

