CROSS-LEVEL DISTILLATION AND FEATURE DENOIS-ING FOR CROSS-DOMAIN FEW-SHOT CLASSIFICATION

Abstract

The conventional few-shot classification aims at learning a model on a large labeled base dataset and rapidly adapting to a target dataset that is from the same distribution as the base dataset. However, in practice, the base and the target datasets of few-shot classification are usually from different domains, which is the problem of cross-domain few-shot classification. We tackle this problem by making a small proportion of unlabeled images in the target domain accessible in the training stage. In this setup, even though the base data are sufficient and labeled, the large domain shift still makes transferring the knowledge from the base dataset difficult. We meticulously design a cross-level knowledge distillation method, which can strengthen the ability of the model to extract more discriminative features in the target dataset by guiding the network's shallow layers to learn higher-level information. Furthermore, in order to alleviate the overfitting in the evaluation stage, we propose a feature denoising operation which can reduce the feature redundancy and mitigate overfitting. Our approach can surpass the previous state-of-the-art method, Dynamic-Distillation, by 5.44% on 1-shot and 1.37% on 5-shot classification tasks on average in the BSCD-FSL benchmark. The implementation code will be available at

1. INTRODUCTION

Deep learning has achieved great success on image recognition tasks with the help of a large number of labeled images. However, it is the exact opposite of the human perception mechanism which can recognize a new category by learning only a few samples. Besides, a large amount of annotations is costly and unavailable for some scenarios. It is more valuable to study few-shot classification which trains a classification model on a base dataset and rapidly adapts it to the target dataset. However, due to the constraint that the base data and the target data need to be consistent in their distributions, the conventional few-shot classification may not cater to the demands in some practical scenarios. For example, it may fail in scenarios where the training domain is natural images, but the evaluation domain is satellite images. Considering this domain shift in practical applications, we focus on cross-domain few-shot classification (CD-FSC) in this paper. Previous methods, such as (Mangla et al., 2020; Adler et al., 2020; Tseng et al., 2020) , can handle this problem with small domain gaps. However, the CD-FSC problem with a large domain gap is still a challenge. BSCD-FSC (Guo et al., 2020) is a suitable benchmark for studying this problem, where the base dataset has natural images and the target datasets contain satellite images, crop disease images, skin disease images and X-ray images of sundry lung diseases. On this benchmark, previous methods following the traditional CD-FSC protocol train their models on the base dataset and evaluate them on the target dataset, but their performances are far from satisfactory. STARTUP (Phoo & Hariharan, 2021) and Dynamic-Distillation (Islam et al., 2021a) introduce a more realistic setup that makes a small portion of the unlabeled target images accessible during the training phase. These target images bring a prior to the model and dramatically promote the model's performance on the target datasets. Inspired by that, we follow their setup to explore the CD-FSC problem with a large domain shift. In this work, we propose a cross-level distillation (CLD), which can effectively transfer the knowledge from the base dataset and improve the performance of the student network on the target domain. Besides, we propose feature denoising (FD) to remove the noise in the features during the fine-tuning stage. Our CD-FSC framework is given in Figure 1 . The detail of CLD is shown in Figure 2 , which distills a teacher's deeper layers to a student's shallower layers, where the student and the teacher share the same structure. Unlike the distillation methods in STARTUP and Dynamic-Distillation, which only distill the teacher's last layer to the student's last layer, our CLD leads the shallow layers of the student to mimic the features generated from the deeper levels of the teacher so that the student can learn more deeper semantic information and extract more discriminative features on the target dataset. Additionally, since the teacher networks in STARTUP and Dynamic-Distillation are pre-trained on the base dataset only, the teacher's observation of the target data is biased. In order to calibrate the bias, we design an iterative process by building another network which shares the same structure and parameters with the historical student network, named old student network. In each training iteration, the features from the teacher and the old student in the same layers are dynamically fused to guide the corresponding layers of the student. The latter the training iteration, the fewer fusion features from the teacher network, and the more from the old student network. Due to the target data used in training is unlabeled, the self-supervised loss is introduced to excavate the target domain information further. The self-supervised loss not only supports the network in mining valuable information on the target domain, but also brings a phenomenon where the final feature vector for classification has a small number of dominant (strongly activated) elements with the others are close to zero (Hua et al., 2021; Kalibhat et al., 2022) . We find that during the fine-tuning phase in Figure 1 , these small activated elements are redundant and considered as noise. Our FD operation keeps the top h largest elements and sets the others to zero. It is experimentally verified that FD can greatly improve the model's performance. Above all, our main contributions are summarized below: • We propose a cross-level distillation (CLD) framework, which can well transfer the knowledge of the teacher trained on the base dataset to the student. We also use an old student network mechanism is also necessary to calibrate the teacher's bias learned from the base data. • Considering the noisy feature activations, we design a feature denoising (FD) operation that can significantly improve the performance of our model. • Extensive experiments are conducted to verify that our proposed CLD and FD can achieve state-of-the-art results on the BSCD-FSL benchmark with large domain gaps.

2. RELATED WORK

Cross-domain few-shot classification. The cross-domain few-shot classification is firstly defined by (Chen et al., 2018) , which trains a model on the base dataset and evaluates it on the target dataset in a different domain from the base dataset. LFT (Tseng et al., 2020) simulates the domain shift from the base dataset to the target dataset by meta-learning and inserts linear layers into the network to align the features from the different domains. Meta-FDMixup (Fu et al., 2021) uses several labeled images from the target dataset for domain-shared feature disentanglement and feeds the domainshared features to the classifier. FLUTE (Triantafillou et al., 2021) learns the universal templates of features across multi-source domains to improve the transferability of the model. However, all these methods concentrate on the CD-FSC problem with small domain shifts. Some methods handle CD-FSC with large domain gaps, in which the target datasets have obvious dissimilarity from the base dataset on perspective distortion, semantics, and/or color depth. For example, ConfeSS (Das et al., 2021) extracts useful feature components from labeled target images. ATA (Wang & Deng, 2021) does not require any prior of the target dataset and proposes a plug-and-play inductive bias-adaptive task augmentation module. CI (Luo et al., 2022) trains an encoder on the base dataset and converts the features of the target data with a transformation function in the evaluation stage. UniSiam (Lu et al., 2022) adopts a self-supervised approach to address the CD-FSC problem. Among the methods dealing with large domain shifts, STARTUP (Phoo & Hariharan, 2021) is a strong baseline, which uses a few unlabeled target images in the training stage. It firstly trains a teacher network on the base dataset in a supervised fashion and transfers the teacher's knowledge to the student by knowledge distillation (KD). It jointly optimizes the cross-entropy loss with the base dataset, contrastive loss of the unlabeled target images and the KD loss to upgrade the student network. Dynamic-Distillation (Islam et al., 2021a ) also uses a small number of unlabeled images and KD. The main difference between Dynamic-Distillation and STARTUP is that the former upgrades the pre-trained teacher dynamically by exponential moving averages, while the latter fixes the teacher. In our work, we follow their data setup allowing a small proportion of the unlabeled target images to be seen during the training phase. Different from the two methods that perform KD at the last layers of the teacher and the student, our KD is carried out at cross levels. Besides, our denoising operation further improves the performance. Self-supervised learning. Self-supervised learning is widely used in the scenarios where labels are not available for training. It defines a "pretext task" to pre-train a network. For example, (Gidaris et al., 2018) pre-trains the model by predicting the rotation angle of the image. One popular method is contrastive learning, such as SimCLR (Chen et al., 2020) which pulls different augmented versions of the same image closer and pushes the versions from different images away. Beyond contrastive learning, BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021) rely only positive pairs. Knowledge distillation. (Hinton et al., 2015) firstly propose knowledge distillation by guiding a compact network (student) to mimic the output of a large network (teacher). Since features in the intermediate layers are informative, some previous methods distill the teacher's intermediate features (Romero et al., 2014) or attention maps of the features (Zagoruyko & Komodakis, 2017) . Besides, self-distillation methods, such as BYOT (Zhang et al., 2019) , distill the last layer of the network to its own shallower layers. BYOT's teacher and student are the same network. In our framework (Figure 2 ), in addition to the KD in the intermediate layers, we design an old student that shares the same structure as the student but with different parameters. The introduction of this old student not only alleviates the teacher's bias learned from the base dataset, but also has the effect of assembling multiple historic students during training. 

3. METHODOLOGY

ℒ ()$ ℒ $$ ℒ ()$ ℒ ()$ Figure 2 : Framework of CLD for knowledge distillation (KD). The teacher network is pre-trained on the labeled base dataset with the cross-entropy loss and is fixed during KD. When training the student network, the target image x is augmented into x 1 , x 2 and x 3 . Then x 1 , x 2 is fed to both the student and the teacher, and x 3 is fed to the old student. At the i-th iteration, the parameters of the old student are a copy of those of the student at (i -τ )-th iteration. The feature s i l of the student is firstly projected by ω l for dimensionality alignment, where l is the block index. Then we fuse the features t i l+1 and o i l+1 obtaining u i l+1 , which are from the (l + 1)-th block of the teacher and the old student, respectively. The KD is conducted by forcing ω l (s i l ) to mimic u i l+1 . Additionally, the self-supervised loss L ss is introduced on the student network. from the selected N classes for evaluating the classification accuracy. The support set D S and the query set D Q have no overlap.

3.2. CROSS-LEVEL DISTILLATION

The proposed cross-level distillation (CLD) framework is shown in Figure 2 . The teacher network f t is pre-trained on D B with the cross-entropy loss. The student network f s is expected to inherit the knowledge of the teacher and extract discriminative features of D T . However, the teacher's observation of the target data is biased since it is pre-trained on the base dataset only. In the i-th training iteration, if the features extracted by the student f i s directly mimic the features of the teacher, the teacher's bias will be transferred to the student. To reduce the bias, we introduce an old student network f i o , which is a copy of f i-τ s , where the hyper-parameter τ denotes the training iteration interval between f i-τ s and f i s . To simplify the KD complexity, we divide each backbone of f s , f o , and f t into L residual blocks. Let s i l , o i l , and t i l be the features obtained by the student, the old student, and the teacher in the l-th block at the i-th iteration. The fusion between t i l and o i l is defined as: u i l = α i o i l + (1 -α i )t i l , where α i = i T is a dynamic weight with T being the total number of training iterations. The KD loss L i dis in the i-th iteration is defined as: L i dis =      L l=1 ∥ω l (s i l ) -t i l+1 ∥ 2 2 if i ≤ τ L l=1 ∥ω l (s i l ) -u i l+1 ∥ 2 2 otherwise, where ω l (•) is a projector of the l-th student's block for feature dimensionality alignment. It is comprised of a convolutional layer, a batch normalization operator and ReLU activation. Note that the KD in Equation 2 is from the (l + 1)-th block of the teacher to the l-th block of the student. In our experiments, we find that this style is better than others (see Section 4.3). The total loss L for training the student network is: L = L ss + λL dis , where L ss is a self-supervised loss, and λ is a weight coefficient of loss function for balancing the two losses. For L ss , off-the-shelf self-supervised losses like SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020 ) can be used. The contrastive loss in SimCLR is: L simclr = - 1 |B| m,n∈B log exp (sim (z m , z n ) /γ) 2|B| q=1,q̸ =m exp (sim (z m , z q ) /γ) , where z m , z n and z q are the projected embeddings of different augmentations, (m, n) is a positive pair from the same image, sim(•) is a similarity function, B is a mini-batch of unlabeled target images, and γ is a temperature coefficient. The self-supervised loss in BYOL is: L byol = m∈B 2 -2 • sim(p(z m ), z ′ m ) ∥p(z m )∥ 2 • ∥z m ′∥ 2 , ( ) where z m and z ′ m are embeddings of the online network and the target network, respectively, and p(•) is a linear predictor. Note that the last convolution block of the student network is not involved in KD and is trained by minimizing L ss only. The reason is that the last block mainly discovers semantic information that is highly domain-specific. Therefore, we constrain it on the target data rather than letting it learn from the teacher that is pre-trained on the base data.

3.3. FEATURE DENOISING

The self-supervised loss brings a phenomenon where the final feature vector for classification has a small number of strongly activated elements while the others are close to zero (Hua et al., 2021; Kalibhat et al., 2022) . These elements of small magnitudes are regarded as noise, which may cause overfitting. We propose a feature denoising (FD) operation to remove their contribution during the fine-tuning phase (see Figure 1 ). FD keeps the largest h elements of the feature from the student network and zeros the other elements. The FD operation is only performed on the feature s L in the last layer of the student network. Specifically, let s L = [s (L,1) , s (L,2) , . . . , s (L,D L ) ] and s L = [ s (L,1) , s (L,2) , . . . , s (L,D L ) ] be the features before and after the FD operation, respectively, where D L is the feature's dimensionality. Then the FD operation is defined as: s L,d = (s L,d ) β if s L,d ∈ top h (s L ), d = 1, 2, ..., D L 0 otherwise, ( ) where β is a hyper-parameter which makes the non-zero elements more distinguishable and top h (•) is the operator selecting the largest h elements of the feature. Finally, s M is fed to the classifier for fine-tuning.

4.1. DATASETS AND IMPLEMENTATION

Datasets. We evaluate the proposed CLD and FD for the CD-FSC problem on the BSCD-FSL benchmark (Guo et al., 2020) with large domain gaps. The miniImageNet dataset (Vinyals et al., 2016) serves as the base dataset D B which has sufficient labeled images. EuroSAT (Helber et al., 2019) , CropDisease (Mohanty et al., 2016) , ISIC (Codella et al., 2019) and ChestX (Wang et al., 2017) in BSCD-FSL are the unlabeled target datasets D T . We follow the training protocol in STARTUP (Phoo & Hariharan, 2021) and Dynamic-Distillation (Islam et al., 2021a) , which allows the whole labeled training set of miniImageNet and a small proportion (20%) of unlabeled target images available during the training period. The remaining 80% of target images are utilized for fine-tuning and evaluating by building 5-way K-shot tasks, K ∈ {1, 5}. Implementation details. We implement our model using the MindSpore Lite tool (Mindspore). For a fair comparison, all the methods use ResNet-10 (Guo et al., 2020) as the backbone. Our Table 1 : The averaged 5-way 1-shot and 5-shot averaged accuracy and 95% confidence interval among 600 episodes are given. The reported results of SimCLR (Base) and previous state-of-the-art Dynamic-Distillation are from (Islam et al., 2021a) . The results of CI are from (Luo et al., 2022) . The results of ATA are from (Wang & Deng, 2021) . The performance of ConFeSS refers to (Das et al., 2021) and they do not give the confidence interval. The champion results are marked in bold. model is optimized by SGD with the momentum 0.9, weight decay 1e-4, and batch size 32 for 600 epochs. The learning rate is 0.1 at the beginning and decays by 0.1 after the 300th epoch and 500th epoch. The hyper-parameters λ in Equation 3, and h and β in Equation 6 are set to 2, 64, and 0.4, respectively. For fine-tuning and evaluating the student network, we randomly sample 600 episodes of 5-way K-shot tasks. The performance is represented by the average classification accuracy over the 600 episodes within the 95% confidence interval. In each episode, the parameters of the student are frozen. An additional linear classifier is fine-tuned by minimizing the cross-entropy loss. The above mentioned hyperparameters are determined based on the EuroSAT dataset only, and then they are used for evaluation on all the datasets, showing the generalization ability of our method.

4.2. MAIN RESULTS

We select several basic and competitive methods for comparison in Table 1 . The basic model is Transfer that trains the encoder on the labeled base dataset with the cross-entropy loss. SimCLR (Base) (Chen et al., 2020) trains the model on the base dataset (miniImageNet) with the contrastive loss of SimCLR. CI (Luo et al., 2022) trains the encoder on the base dataset and converts the features of the target dataset with a transformation function in the evaluation stage. Transfer+SimCLR is proposed in (Islam et al., 2021b ) and exhibits good transferability, which simultaneously optimizes the cross-entropy loss on the base dataset and the contrastive loss of SimCLR on the target dataset. STARTUP (Phoo & Hariharan, 2021) trains the model with three loss functions: cross- entropy loss on the labeled base dataset, and contrastive loss and KD loss on the unlabeled target domain. ATA (Wang & Deng, 2021) designs a plug-and-play inductive bias-adaptive task augmentation module. BYOL (Grill et al., 2020) is trained on the target dataset. ConfeSS (Das et al., 2021) uses labeled target images to find out the useful components of the target image features. Dynamic-Distillation (Islam et al., 2021a) designs a distillation loss that draws support from the dynamically updated teacher network. SimCLR trains the encoder on unlabeled target data only and shows good generalization ability. For our method, we build two models BYOL+CLD+FD and SimCLR+CLD+FD, which use Equations 5 and 4 for self-supervised losses, respectively. Table 1 gives the comparisons among our models and the baselines. All the best results on the four datasets are obtained by either of our two models. In particular on EuroSAT, our SimCLR+CLD+FD can bring 9.38% and 7.42% gains over Dynamic-Distillation and SimCLR on the 5-way 1-shot task, respectively. On average, on the four datasets, SimCLD+CLD+FD outperforms Dynamic-Distillation significantly (58.77% vs. 53.33%) for 1-shot; 66.94% vs. 65.57% for 5-shot. Besides, although SimCLR+CLD+FD and BYOL+CLD+FD are based on SimCLR and BYOL, respectively, the performances of SimCLR and BYOL are improved greatly. Finally, we visualize the feature distributions of STARTUP and our SimCLR+CLD+FD in Figure 3 . It can be seen that SimCLR+CLD+FD exhibits better clustering results at all the blocks, especially at the last one. For simplicity, the combination (according to Equation 1) of the teacher and the old student in Figure 2 is denoted as the united teacher in Figure 4 and Figure 5 . In these experiments, L simclr is used as L ss and FD is not employed. Table 2 gives the results of SimCLR, different single-block distillation structures, and CLD on EuroSAT (5-way 1-shot). We can notice that all the single-block distillation structures perform better than SimCLR, and CLD can outperform all the single-block distillation 3 .

4.3. ABLATION STUDY ON CROSS-LEVEL DISTILLATION

structures. Table 3 gives the results of SimCLR, different multi-blocks distillation structures and CLD. Compared with SimCLR without KD, all the multi-blocks KD structures improve the model's overall performance. Second, the one-to-all KD outperforms the same-level KD in most cases. Finally, our CLD performs best on average among the three multi-blocks KD structures. We further explore the reason that CLD can exceed other methods. Rand Index (Rand, 1971 ) is a metric to reflect the quality of feature clustering. The bigger Rand Index is, the better clustering the method provides. Figure 6 is comprised of the Rand Index values of the features in each residual block on the EuroSAT test dataset. Our CLD increases more than all the other methods in each block. It means that our CLD can pick up more discriminative information at each level so that the model gradually hunts the useful features as the network deepens. Next, we examine the usefulness of the old student and the effect of the training iteration interval τ (Section 3.2). Table 4 shows the experimental results of different settings. First, CLD without the old student outperforms SimCLR. Second, using the old student is better than not using it. Considering the performance and the memory requirement, we choose τ = 1 in all the other experiments on the four datasets. 

4.4. ABLATION STUDY ON FEATURE DENOISING

On EuroSAT, the model SimCLR+CLD+FD is used to find optimal h and β on the 5-way 1-shot tasks. As shown in Figure 7 , h = 64 and β = 0.4 are chosen for the best accuracy. With h = 64 and β = 0.4 in FD, we compare using FD or not in Table 5 . We can see that the first five models are improved with FD, while the last two are not. The different positions, which the 

4.5. ABLATION STUDY ON WEIGHT COEFFICIENT OF LOSS FUNCTION

We give the results of SimCLR+CLD+FD with the different λ values in Equation 3 on the EuroSAT dataset (5-way 1-shot), as shown in Figure 9 . We can see that the best result is obtained with λ = 2. So we set λ = 2 in all the related experiments.

5. CONCLUSION

In this work, we handle the CD-FSC problems with large domain gaps between the base dataset and the target datasets. We propose the cross-level distillation KD for better transferring the knowledge of the base dataset to the student. We also present the feature denoising operation for mitigating the overfitting. Our method improves the performance of a strong baseline Dynamic-Distillation by 5.44% on 1-shot and 1.37% on 5-shot classification tasks on average in the BSCD-FSL benchmark, establishing new state-of-the-art results.

A APPENDIX A.1 RESULTS OF THE DIFFERENT BATCH SIZES

We present the results of the SimCLR+CLD+FD trained with different batch sizes on the EuroSAT dataset (5-way 1-shot) in Figure 10 . We can see that the optimal batch size is 32, so we utilize this value in all the experiments of our method. 



Figure 1: Our CD-FSC framework. The first phase is pre-training, which trains the teacher network on the labeled base dataset by optimizing the cross-entropy loss. The second phase trains the student network using our proposed cross-level distillation (CLD). The third phase fine-tunes a linear classifier on a few labeled images in the target domain, and feature denoising (FD) is conducted to remove the noise in the final feature vectors. The final phase classifies images in the target domain.

PRELIMINARYDuring the training period, we follow the setting in STARTUP(Phoo & Hariharan, 2021) and Dynamic-Distillation(Islam et al., 2021a), where a labeled base dataset D B and a few unlabeled target images sampled from the target dataset D T are available. In the testing stage, the support set D S comprises N classes, and K samples are randomly selected from each class in D T , which is the so-called N -way K-shot task. The support set D S is for fine-tuning a new classifier with the frozen encoder (the student network in this work). The images in the query set D Q are randomly picked Student Teacher Old Student

Figure4: Different single-block distillation structures. B2-to-B1 means that the second block of the teacher is distilled to the first block of the student. B3-to-B2 and B4-to-B3 are similar.

Figure 5: Different multi-blocks distillation structures. (a) and (b) are two common KD methods. (c) is our cross-level KD.

Figure6: Rand Index values in the residual blocks of the methods in Table3.

Figure 7: (a) Accuracy vs. h when β = 0.4. (b) Accuracy vs. β when h = 64.

Figure 8: L ss is the self-supervised loss and L aux includes other losses such as the crossentropy and KD losses. (a) L aux is applied to intermediate layers. (b) L aux is applied to the final layer.

Figure 10: Results of the SimCLR+CLD+FD with the different batch sizes on the EuroSAT dataset (5-way 1-shot).

Figure 11: Results of the SimCLR+CLD+FD with the different percentages on the EuroSAT dataset (5-way 1-shot).

Results of different single-block distillation stuctures on EuroSAT (5-way 1-shot). B2-to-B1, B3-to-B2, and B4-to-B3 are defined in Figure4.

Results of different multi-blocks KD structures.

Effects of the old student and the training iteration interval τ . All the methods in the table use L simclr as L ss .

Comparison of using FD or not on EuroSAT (5-way 1-shot).

Results with self-supervised & supervised pre-trained teacher. SimCLR+CLR+FD (self-supervised) 80.30±0.72 91.63±0.36 89.94±0.72 96.50±0.35 37.42±0.46 49.36±0.64 22.58±0.43 25.96±0.43 SimCLR+CLR+FD (supervised) 82.52±0.76 92.89±0.34 90.48±0.72 96.58±0.39 39.70±0.69 52.29±0.62 22.39±0.44 25.98±0.43

ACKNOWLEDGEMENTS

We gratefully acknowledge the support of MindSpore (Mindspore), CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research. We would also like to express our sincere gratitude to Mr. Haichen ZHENG, Mrs. Jing LIN, and Ms. Xiao WANG for their unwavering support and confidence in our work.

funding

research/cv/CLDFD.

availability

https://gitee.com/mindspore/models/tree/master/

annex

On the EuroSAT dataset (5-way 1-shot), we give the results of the SimCLR+FD+CLD where the teacher is pre-trained in a self-supervised scheme. The performances of the SimCLR+CLD+FD are shown in Table 6 .

