CROSS-LEVEL DISTILLATION AND FEATURE DENOIS-ING FOR CROSS-DOMAIN FEW-SHOT CLASSIFICATION

Abstract

The conventional few-shot classification aims at learning a model on a large labeled base dataset and rapidly adapting to a target dataset that is from the same distribution as the base dataset. However, in practice, the base and the target datasets of few-shot classification are usually from different domains, which is the problem of cross-domain few-shot classification. We tackle this problem by making a small proportion of unlabeled images in the target domain accessible in the training stage. In this setup, even though the base data are sufficient and labeled, the large domain shift still makes transferring the knowledge from the base dataset difficult. We meticulously design a cross-level knowledge distillation method, which can strengthen the ability of the model to extract more discriminative features in the target dataset by guiding the network's shallow layers to learn higher-level information. Furthermore, in order to alleviate the overfitting in the evaluation stage, we propose a feature denoising operation which can reduce the feature redundancy and mitigate overfitting. Our approach can surpass the previous state-of-the-art method, Dynamic-Distillation, by 5.44% on 1-shot and 1.37% on 5-shot classification tasks on average in the BSCD-FSL benchmark. The implementation code will be available at

1. INTRODUCTION

Deep learning has achieved great success on image recognition tasks with the help of a large number of labeled images. However, it is the exact opposite of the human perception mechanism which can recognize a new category by learning only a few samples. Besides, a large amount of annotations is costly and unavailable for some scenarios. It is more valuable to study few-shot classification which trains a classification model on a base dataset and rapidly adapts it to the target dataset. However, due to the constraint that the base data and the target data need to be consistent in their distributions, the conventional few-shot classification may not cater to the demands in some practical scenarios. For example, it may fail in scenarios where the training domain is natural images, but the evaluation domain is satellite images. Considering this domain shift in practical applications, we focus on cross-domain few-shot classification (CD-FSC) in this paper. Previous methods, such as (Mangla et al., 2020; Adler et al., 2020; Tseng et al., 2020) , can handle this problem with small domain gaps. However, the CD-FSC problem with a large domain gap is still a challenge. BSCD-FSC (Guo et al., 2020) is a suitable benchmark for studying this problem, where the base dataset has natural images and the target datasets contain satellite images, crop disease images, skin disease images and X-ray images of sundry lung diseases. On this benchmark, previous methods following the traditional CD-FSC protocol train their models on the base dataset and evaluate them Inspired by that, we follow their setup to explore the CD-FSC problem with a large domain shift. In this work, we propose a cross-level distillation (CLD), which can effectively transfer the knowledge from the base dataset and improve the performance of the student network on the target domain. Besides, we propose feature denoising (FD) to remove the noise in the features during the fine-tuning stage. Our CD-FSC framework is given in Figure 1 . The detail of CLD is shown in Figure 2 , which distills a teacher's deeper layers to a student's shallower layers, where the student and the teacher share the same structure. Unlike the distillation methods in STARTUP and Dynamic-Distillation, which only distill the teacher's last layer to the student's last layer, our CLD leads the shallow layers of the student to mimic the features generated from the deeper levels of the teacher so that the student can learn more deeper semantic information and extract more discriminative features on the target dataset. Additionally, since the teacher networks in STARTUP and Dynamic-Distillation are pre-trained on the base dataset only, the teacher's observation of the target data is biased. In order to calibrate the bias, we design an iterative process by building another network which shares the same structure and parameters with the historical student network, named old student network. In each training iteration, the features from the teacher and the old student in the same layers are dynamically fused to guide the corresponding layers of the student. The latter the training iteration, the fewer fusion features from the teacher network, and the more from the old student network. Due to the target data used in training is unlabeled, the self-supervised loss is introduced to excavate the target domain information further. The self-supervised loss not only supports the network in mining valuable information on the target domain, but also brings a phenomenon where the final feature vector for classification has a small number of dominant (strongly activated) elements with the others are close to zero (Hua et al., 2021; Kalibhat et al., 2022) . We find that during the fine-tuning phase in Figure 1 , these small activated elements are redundant and considered as noise. Our FD operation keeps the top h largest elements and sets the others to zero. It is experimentally verified that FD can greatly improve the model's performance. Above all, our main contributions are summarized below: • We propose a cross-level distillation (CLD) framework, which can well transfer the knowledge of the teacher trained on the base dataset to the student. We also use an old student network mechanism is also necessary to calibrate the teacher's bias learned from the base data. • Considering the noisy feature activations, we design a feature denoising (FD) operation that can significantly improve the performance of our model. • Extensive experiments are conducted to verify that our proposed CLD and FD can achieve state-of-the-art results on the BSCD-FSL benchmark with large domain gaps.



Figure 1: Our CD-FSC framework. The first phase is pre-training, which trains the teacher network on the labeled base dataset by optimizing the cross-entropy loss. The second phase trains the student network using our proposed cross-level distillation (CLD). The third phase fine-tunes a linear classifier on a few labeled images in the target domain, and feature denoising (FD) is conducted to remove the noise in the final feature vectors. The final phase classifies images in the target domain.

funding

research/cv/CLDFD.

availability

https://gitee.com/mindspore/models/tree/master/

