REVISIT FINETUNING STRATEGY FOR FEW-SHOT LEARN-ING TO TRANSFER THE EMDEDDINGS Anonymous authors Paper under double-blind review

Abstract

Few-Shot Learning (FSL) aims to learn a simple and effective bias on limited novel samples. Recently, many methods have been focused on re-training a randomly initialized linear classifier to adapt it to the novel features extracted by a pre-trained feature extractor (called Linear-Probing-based methods). These methods typically assumed the pre-trained feature extractor was robust enough, i.e., finetuning was not needed, and hence the pre-trained feature extractor does not be adapted to the novel samples. However, the unadapted pre-trained feature extractor distorts the features of novel samples because the robustness assumption may not hold, especially on the out-of-distribution samples. To extract the undistorted features, we designed Linear-Probing-Finetuning with Firth-Bias (LP-FT-FB) to yield an accurate bias on the limited samples for better finetuning the pre-trained feature extractor, providing stronger transferring ability. In LP-FT-FB, we further proposed inverse Firth Bias Reduction (i-FBR) to regularize the over-parameterized feature extractor on which FBR does not work well. The proposed i-FBR effectively alleviates the over-fitting problem of the feature extractor in the process of finetuning and helps extract undistorted novel features. To show the effectiveness of the designed LP-FT-FB, we conducted comprehensive experiments on the commonly used FSL datasets under different backbones for in-domain and cross-domain FSL tasks. The experimental results show that the proposed FT-LP-FB outperforms the SO-TA FSL methods. The code is available at https://github.com/whzyf951620/ LinearProbingFinetuningFirthBias.

1. INTRODUCTION

Few-shot Learning (FSL) has recently developed quickly in the limited data regime. FSL aims to learn a suitable inductive bias on the given limited samples of novel classes. At the very start, the whole model consisting of the feature extractor and the classifier is pre-trained on the samples of base classes, and then is finetuned on the limited novel samples to obtain an inductive bias. The performance of the finetuned model drops significantly due to the over-fitting problem of the pre-trained model. To address the overfitting problem, meta-learning-based methods such as Prototypical Networks Snell et al. (2017) and MAML Finn et al. (2017) were proposed to learn the learning strategies for a suitable inductive bias. Then, Chen et al. proposed Baseline++ Chen et al. (2019) to show that a simple Linear Probing (LP) strategy can also get comparable performance to meta-learning-based methods. LP re-trained a linear classifier to adapt to the novel samples without updating the whole model. Following Baseline++, many researchers studied the LP-based FSL methods, such as S2M2 Mangla et al. (2020) , RFS Tian et al. (2020) , and EMD Zhang et al. (2020) , are proposed to obtain a more powerful fully-trained feature extractor. LP-based FSL methods assumed that a pre-trained feature extractor is robust enough to novel samples Yang et al. (2021) ; Tian et al. (2020) and hence does not need to be finetuned. However, the robustness assumption may not hold, especially on the out-of-distribution (OOD) novel samples, because the existing regularization methods only provided effective robustness to in-distribution (ID) samples. This leads to "wrong" features of novel samples Kumar et al. (2022) . This problem cannot be well addressed by only strengthening the robustness of the pre-trained feature extractor without transferring the extracted features. Finetuning is a typical and simple technology in the transferring learning on large-scale datasets to quickly transfer the features Kornblith et al. (2019) ; He et al. (2020) ; Chen et al. (2020) . However, the existing FSL finetuning strategies cannot appropriately finetune the over-fitted feature extractor Shen et al. (2021) . Consequently, designing an appropriate finetuning strategy for the fully-trained feature extractor on the limited samples is a critical point to transfer the features and hence make the finetuned feature extractor yield "good" novel features. To design an appropriate finetuning strategy, we started by analyzing the existing finetuning strategies Kumar et al. (2022) ; Kanavati & Tsuneki (2021) ; Levine et al. (2016) ; Radford et al. (2021) , especially on the ODD samples Kumar et al. (2022) . Linear-Probing-Finetuning (LP-FT) Kumar et al. (2022) theoretically proved that the existing finetuning strategies underperformed due to distorted novel features. This is because when trying to fit the novel samples with a randomly initialized linear classifier, the extracted ID features and OOD features change inconsistently. The inconsistent problem is caused by the unchanged features of the samples in the space orthogonal to the space spanned by the base samples, i.e., out-of-distribution features unchanged (details in Section 2.2). Kumar et al. proposed using a linear classifier well-trained on the fitting novel samples to address this distorted problem. Although LP-FT is effective, it is used to address the finetuning problem on large-scale out-of-distribution datasets, which contain enough samples to provide an unbiased estimation. In FSL tasks, limited novel samples lead to a biased estimation of the updated parameters. Seeing this, we reduce the Firth Bias of the model Ghaffari et al. (2022); FIRTH (1993) to obtain an unbiased estimation (called Firth Bias Reduction). However, Firth Bias Reduction (FBR) gives an unbiased estimation only for the linear classifier but not the feature extractor. Experimental result also shows that FBR decreased the performance of the finetuned feature extractor. A deep analysis (Section 2.3) shows that when the scaling factor λ (see Eq. 2) is positive, FBR encourages the distribution of the linear classifier output to be far away from the uniform distribution (See Fig. 1 ) -FBR strengthens the influence of novel samples for linear probing. This accelerates the convergence of the low-parameterized linear classifier but hampers the generalization of the over-parameterized feature extractor. According to the analysis, we proposed inverse-FBR (i-FRB, negative λ in Eq. 2) to encourage the distribution of the linear classifier output to be close to uniform distribution (See Fig. 1 ) to address the over-fitting problem Müller et al. (2019) . Combining LP-FT, FBR, and the proposed i-FBR, we proposed Linear-Probing-Finetuning with Firth Bias (LP-FT-FB) to appropriately finetune the fully-trained feature extractors to adapt them to the target domain on limited novel samples. Firstly, we use linear probing to address the distorted feature problem. The linear probing is regularized with FBR for unbiased estimation. Then the whole pre-trained model is finetuned regularized with the proposed i-FBR on the limited novel samples to strengthen the transferring ability of the feature extractor for more separable features. The proposed LP-FT-FB makes the extracted features of novel samples undistorted and separable. The whole LP-FT-FB finetuning strategy is visualized in Fig. 1 . Furthermore, the unbiased finetuned feature extractor transfers the novel features close to the novel domain. This is verified in the cross-domain FSL tasks (Section 3.5). Our main contributions include (1) studying and finding the gaps between finetuning-based and LP-based FSL methods: distorted features and a biased estimation; (2) designing a novel and simple finetuning strategy, LP-FT-FB, to quickly transfer the extracted features to the novel domain to bridge the gaps; (3) proposing inverse Firth Bias Reduction (i-FBR) to address the biased estimation problem of finetuning the feature extractor due to limited samples.

2. METHODS

LP-FT is first described in Section 2.2. Then FBR is formulated and analyzed in Section 2.3 to address the unique limited-sample problem in FSL, i.e., the biased estimation problem. However, FBR is not suitable for the over-parameterized feature extractor. Based on the analysis of FBR, i-FBR is proposed to finetune the feature extractor. In Section 2.4, LP-FT-FB is stated. The whole inference process is formulated as : ŷi = v(z i , β) = v(B(x i , θ), β), where z i = B(x i , θ). In the multinomial logistic regression model, the assignment probability of z i to class c is formulated as p c i := P r(y i = c|z i ) = e β T c z i 1+ C c =1 e β T c z i . Then the logistic log-likelihood function is formulated as L logistic := M i=1 C c=1 1[y i = c] • log p c i , where 1[•] denotes the y i in the one-hot manner. For the linear probing and finetuning, we used the Cross Entropy Loss L CE = -1 M M i=1 [y i • log(P i )], where P i = {p c i } C c=1 . P i = ŷi denotes the predicted probability vector.

2.2. LINEAR-PROBING-FINETUNING

LP-FT explored the reason why LP underperforms FT on in-distribution samples but outperforms FT in outof-distribution samples on large-scale datasets. LP-FT theoretically proved that when a pre-trained feature extractor tried to fit the ID samples, the transfer of the model will not change the feature of out-of-distribution samples in the space orthogonal to the space spanned by pre-training samples. To show this, let S denote the subspace spanned the training samples X, and the training loss is L(v(B(X, θ), β), Y) = Xθ T β -Y 2 2 . B is assumed to be a linear model; Y denotes the one-hot la-bels of X. The gradients of the training loss with respect to the parameter θ of the feature extractor B is computed as: ∇ θ L(v(B(X, θ), β), Y) = 2β(Y -Xθ T β) T X (1) With Eq. 1, if u is a sample in the subspace orthogonal to S, the features of u do not change with the finetuned B(•, θ): ∆z = ∇ θ L(v(•, β), B(•, θ))u = 2β(Y -Xθ T β) T (X • u) = 0. This leads to the distorted features extracted by the finetuned B(•, θ) because the in-distribution and out-of-distribution features inconsistently change. However, LP-FT explores the feature extractor finetuning problem on large-scale datasets. Observing this, we wish to address the feature extractor finetuning problem with LP-FT in FSL if we can address the limitedsample problem. Because the novel samples used to finetune the pre-trained feature extractor are out-ofdistribution in FSL models. The limited-sample problem is actually a biased estimation problem.

2.3. INVERSE FIRTH BIAS REDUCTION

For linear probing, we use Firth Bias Reduction (FRB) to obtain an unbiased estimation. PMLE FIRTH (1993) added a log-determinant penaltyfoot_0 to remove the O(n -1 ) term of the asymptotic bias of the maximum likelihood estimate parameters for the unbiased estimation. FBR Ghaffari et al. (2022) addressed the situation when det(F ) = 0 and proposed an approximate format for the linear classifier and cosine classifier. In FBRGhaffari et al. ( 2022), the FBR loss is formulated as L = L CE + L F irth = L CE -λ • 1 M M i=1 D KL (U [0,C] P i ), where U [0,C] denotes the uniform distribution between 0 and C, and λ denotes the scaling factor of FBR. From Eq. 2, we concluded that when λ > 0, the L F irth encourages the distribution of the logits z i to be far away from U [0,C] , which totally differs from Label Smoothing Müller et al. (2019) . When λ < 0, L F irth encourages the distribution of the logits z i to be close to U [0,C] , and hence avoids too high confidence of z i . This coincides with the effectiveness of Label Smoothing Müller et al. (2019) ; Hein et al. (2019) . However, the vanilla FBR is not suitable for finetuning feature extractor because when λ > 0 (used in FBR Ghaffari et al. (2022) ), the logits z is encouraged to be far away from U [0,C] . The distribution drag leads to too high similarity between the distribution of the logits z and y, and hence makes the over-parameterized feature extractor over-fitted Hein et al. (2019) . Seeing this, according to the analysis of Eq. 2, we proposed inverse FBR (i-FBR) to encourage the distribution of the z to be close to U [0,C] with λ < 0 (we use the λ inv to denote the scaling factor of i-FBR) to address the over-fitting problem. The proposed i-FBR works similarly to Label Smoothing to a certain extent because both of them encourage the output distribution of the model to be close to the uniform distribution. Hence they can reduce the influence of the training novel samples and alleviate the too high confidence problem. In addition to the above distribution drag, we computed the gradients of L CE and L with respect to r i = β • B(x i , θ) as follows to show the gradient variation brought by λ.

∂L ∂r i

= ∂L CE ∂r i + λ (C + 1) (CP i -E), where E is a one-full vector with the same size as P i and P i = ŷi is the predict of the whole model. The detailed derivation is given in Appendix. According to Eq. 3, we concluded that when λ > 0, the gradient

∂L ∂r c

i is increased for the low-parameterized linear classifier to quickly reach the optimal point, and hence the linear probing model performs better Prabhu et al. (2021) . The overfitting problem does not need to be considered because of the low-parameterized property. However, it is not suitable for finetuning the over-parameterized feature extractor because of too large similarity between z i and y i . From Eq. 3, we found that when λ < 0 and the class of x i is c, the gradient i is decreased and ∂L ∂r k i (k = c) is increased. This reduces the influence of x i , which is used to train the over-parameterized feature extractor, and hence alleviates the over-fitting problem. Furthermore, different from Label Smoothing, the proposed i-FBR decreases the gradients, which is suitable for finetuning the over-parameterized model, and hence outperforms Label Smoothing (See Table 6 ).

2.4. LP-FT-FB

With FBR, LP-FT, and the proposed i-FBR, we proposed LP-FT-FB to quickly transfer the extracted features instead of the robustness assumption to address the over-fitting problem. Firstly, a randomly initialized linear classifier v(•, β) is re-trained on the novel samples and regularized with FBR. For the C f ew -way-K-Shotfoot_1 FSL tasks, the re-trained classifier v (•, β ) is updated as β = β + α 1 K • C f ew • K•C f ew i=1 ∂L(ŷ i , y i ) ∂r i • ∂r i ∂β , where r i = β • B 0 (x i , θ 0 ); ŷ = {ŷ i } C f ew •K i=1 ; y = {y i } C f ew •K i=1 ; α 1 is the LP learning rate. Then the pre-trained B 0 (•, θ 0 ) and v (•, β ) are together finetuned on the same novel samples and regularized with i-FBR. FT is stated as follows. The forward process of FT is formulated as ŷ = v (B 0 (x, θ 0 ), β ). (5) The loss of FT is computed as: L(ŷ , y) = 1 C f ew • K K•C f ew i=1 L(ŷ i , y i ). The feature extractor B 0 (•, θ 0 ) is updated as θ = θ 0 + α 2 K • C f ew • K•C f ew i=1 ∂L(ŷ i , y i ) ∂r i • ∂r i ∂θ 0 , where r = {r i } K•C f ew i=1 and r i = β • B 0 (x i , θ 0 ). The linear classifier v (•, β ) is updated as β = β + α 2 K • C f ew • K•C f ew i=1 ∂L(ŷ i , y i ) ∂r i • ∂r i ∂β , where α 2 is the learning rate of FT. The analytic formula of Eq. 4, Eq. 8, and Eq. 7 are given in Appendix. With B(•, θ) and v(•, β), the performance of LP-FT-FB is evaluated on the novel query samples. The whole flow of LP-FT-FB is given in Appendix.

3. EXPERIMENTS

To comprehensively show the effectiveness of the proposed i-FBR and LP-FT-FB, we conducted many in-domain experiments: 1) 5-way-1\5-shot tasks on mini-Imagenet, tiered-Imagenet, and CUB datasets under the typical FSL backbone, WideResNet-28-10; 2) few-shot tasks under multiple scale backbones including ResNet-18 and ResNet-34; 3) N-way-K-shot (N > 5, K > 5) tasks on mini-Imagenet and tiered-Imagenet; 4) few-shot tasks on the extracted features augmented by Distribution Calibration (DC) Yang et al. (2021) . Also, the cross-domain few-shot tasks, mini-Imagenet → CUB and tiered-Imagenet → CUB, are also evaluated. 3) Finetuning hyper-parameters. The pre-trained hyper-parameters are the same as Mangla et al. (2020) . We only give the hyper-parameters in the LP-FT process. For LP, we used the linear classifier proposed in the Baseline++ Chen et al. (2019) . For the optimizer, the SGD is used with the learning rate α 1 = 0.01, momentum 0.9, dampening 0.9, and weight decay 1e -foot_2 . For the FBR of classifier, the factor λ in Eq. 2 is set to 1. For FT, the feature extractor and the classifier are together manually finetuned with the learning rate α 2 = 1e -3 and the i-FBR factor λ inv in Eq. 7 is set to -1e -3 . 4) Evaluation. The reported results are averaged in percent with 95% confidence interval on 10000 tasks randomly selected by the Episodic Sampler. As the settings in Yang et al. (2021) and Ghaffari et al. (2022) , the query sample number is 15 in each task for evaluating the performance.

3.2. 5-WAY-1\5-SHOT TASKS

In-domain 5-way-1\5-shot tasks are the most typical FSL tasks. We directly used the pre-trained WideResNet-28-10 provided by S2M 2 R Mangla et al. (2020) . The evaluated results are reported in Table 1. On mini-Imagenet, LP-FT-FB outperforms FBR by 1.45% and 0.86% for 1\5-shot tasks respectively. With the same pre-trained model, LP-FT-FB outperforms S2M 2 R by 2.56% and 1.23% for 1\5-shot tasks. Similarly, on the tiered-Imagenet, LP-FT-FB outperforms FBR by 3.04% and 3.75% for 1\5-shot tasks. On CUB, LP-FT-FB outperforms S2M 2 R by 1.48% and 1.03% for 1\5-shot tasks. Table 1 : The evaluation experiments for 5-way FSL tasks are conducted under the WideResNet28-10 on three typical FSL datasets. Methods mini-Imagenet tiered-Imagenet CUB 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot MAML Finn et al. (2017) 48 

3.3. EVALUATION UNDER DIFFERENT BACKBONES

In addition to WideResNet-28-10, we also evaluated LP-FT-FB under different backbones, ResNet-18 and ResNet-34. Because S2M 2 R did not provided the pre-trained models of ResNet-18 and ResNet-34, we used the code 3 of S2M 2 R to reproduce the pre-trained models. LP-FT-FB outperforms S2M 2 R by 3.56% and 3.47% for 1\5-shot tasks on CUB.

3.4. EVALUATION ON AUGMENTED SAMPLES

The feature extractor is finetuned, and the novel features are since equivariant while novel samples totally differ from the base samples. The equivariant features are since closer to the distribution of the novel sample than the unchanged feature extractor used in DC. To show this, we used DC to augment the yielded features before evaluation. The evaluated results are given in Table 3 . The data augmented experiments are conducted on the tiered-Imagenet. For the 1-shot-10-way task, LP-FT-FB outperforms DC and FBR by 0.74% and 0.79%, respectively. For the 5-shot-10-way task, LP-FT-FB outperforms DC and FBR by 1.13% and 1.54%, respectively. Also, for the 15-way tasks, LP-FT-FB outperforms DC and FBR. 

3.5. CROSS-DOMAIN TASKS

To show the transferring ability brought by LP-FT-FB, we conducted cross-domain FSL experiments. LP-FT-FB is evaluated on the CUB dataset, but the feature extractor is pre-trained on mini-Imagenet and tiered-Imagenet, respectively. In this experiment, we used DC to augment the novel samples, just as the evaluated experiments in FBR Ghaffari et al. (2022) . LP-FT-FB + DC denotes the proposed method in Table 4 . As is shown in Table 4 , for mini-Imagenet → CUB tasks, LP-FT-FB outperforms DC by 2.07% and 2.29% for 5-shot-10-way and 5-shot-15-way tasks. For tiered-Imagenet → CUB tasks, LP-FT-FB outperforms DC by 1.65% and 2.09% for 5-shot-10-way and 5-shot-15-way tasks. i-FBR factor λ inv tuning. We give the tuning of λ inv as follows to show the pattern of i-FBR. -1 -1e-1 -1e-2 -7e-3 -4e-3 -1e-3 -1e-4 0 1e-3 1e-2 5e-2 1e-1 As given in Fig. 2 , we show the tuning trend of λ inv . The ablation experiments is evaluated on mini-Imagenet for the 5-way-1-shot task with λ = 1. We obtained the best performance with λ inv = -7e -4 . While λ inv ≥ 0, i.e., i-FBR is turned into FBR, the performance is decreased. i-FBR v.s. smoothing regulariers. The proposed i-FBR is a sort of regularizer smoothing the output distribution of the model. We compared the performance of different smoothing regularizers. As given in Table 6 , the proposed i-FBR outperforms the smoothing regulaizers. The detailed experiment settings are given in Section Appendix. (2015) can make the feature extractor more robust. These methods are called Linear Probing methods because they only re-trained the linear classifier but fixed the pre-trained feature extractor, and they are based on a strong robustness assumption. However, the robustness assumption may not hold. To address the problem, Shen et al. proposed Shen et al. (2021) a searching algorithm finding the layers suitable for finetuning. The searching algorithm is time-consuming and performs poorly on a single dataset because it does not address the distorted feature problem. Additionally, the related meta-learning FSL methods is left to Appendix.

5. CONCLUSION

In this paper, we proposed LP-FT-FB finetuning the pre-trained feature extractor to bring strong transferring ability to the feature extractor instead of a strong robustness assumption. The transferring ability addressed the distorted feature problem caused by the OOD novel samples. However, the strong transferring ability is at the cost of the over-parameterized feature extractor finetuning. The proposed LP-FT-FB is also timeconsuming, just like the first-order gradient computation of Reptile Nichol et al. (2018) . In future works, we will try to design a filter-wise or layer-wise finetuning method instead of the unit-wise one.



log(det(F )), whereF := -Hess β (L logistic ) = Ey[∇ β L logistic • ∇ β L T logistic ] Here, we used C f ew -way-K-Shot to correspond with C in Sec. 2.1, 2.2, and 2.3 instead of N-way-K-Shot. https://github.com/nupurkmr9/S2M2_fewshot



Figure 1: The proposed LP-FT-FB. The flow is divided into Linear Probing and FineTuning in two gray lines of dashes. (Left) At the linear probing stage, the parameters of the pre-trained feature extractor are frozen; at the finetuning stage, the parameters of the feature extractor are changeable. In the Linear Probing box, the RI means the linear classifier is randomly initialized. In the FineTuning box, the LP means the linear classifier is fully re-trained on the novel samples. (Right) The FBR is used to get an unbiased estimation for linear probing by encouraging the distribution of logits to be far away from the uniform distribution (U(a, b) in the figure). The proposed i-FBR addressed the over-fitting problem of the finetuned feature extractor by encouraging the distribution of logits to be close to the uniform distribution.

SET UP Following FBR Ghaffari et al. (2022), we assume a multinomial logitic regression model for the classifier v(•, β). β = {β c } 1≤c≤C and β c denotes the logistic regression weights for class c. B(•, θ) denotes the feature extractor. The dataset D = {(x i , y i )} 1≤i≤M totally contains C classes and M samples.

. The experiments are evaluated on three typical FSL datasets, mini-ImagenetVinyals et al. (2016), tiered-Imagenet Ren et al. (2018), and CUB Wah et al. (2011). mini-Imagenet consists of 100 classes from the ImageNet, which are split randomly into 64 base, 16 validation, and 20 novel classes. Each class has 600 samples of size 84 × 84. tiered-Imagenet consists of 608 classes from the ImageNet, which are split randomly into 351 base, 97 validation, and 160 novel classes. CUB contains 200 classes with a total of 11,788 images of size 84 × 84. The base, validation, and novel split contain 100, 50, and 50 classes. 2) Backbones. To show the effectiveness of LP-FT-FB, we used many backbones. WideResNet-28-10 Zagoruyko & Komodakis (2016) with Dropout Hinton et al. (2012) is a typical backbone which used in the SOTA methods, such as S2M2 Mangla et al. (2020), FBR Ghaffari et al. (2022), and DC Yang et al. (2021). Also, multiple scale backbones including ResNet-18 He2 (2016) and ResNet-34, are used. The ResNets are the same as Mangla et al. (2020).

Figure 2: i-FBR factor λ inv tuning.

-learning-based Methods. In 2019,Chen et al. proposed Baseline++ Chen et al. (2019), which showed that a re-trained linear classifier could get a comparable performance to the meta-learningbased methods when the feature extractor is pre-trained. The pre-trained feature extractors are easy to be obtained without the need of training starting from scratch (used in meta-learning-based methods). Following Baseline++,Tian et al. proposed RFS Tian et al. (2020)  to show that learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, gets a good performance better than most meta-learning-based methods. Similarly, S2M2Mangla et al. (2020) andSKD Rajasegaran et al. (2020)  proposed that the feature extractor pre-trained with Rotation Augmentation, Manifold MixupVerma et al. (2019), and Knowledge Distillation Hinton et al.

The evaluation results for 5-way FSL tasks under different backbones.As given in Table2, for ResNet-18, LP-FT-FB outperforms S2M 2 R by 2.14% and 2.48% for 1\5-shot tasks on mini-Imagenet. LP-FT-FB outperforms S2M 2 R by 1.93% and 1.33% for 1\5-shot tasks on CUB. For ResNet-34, LP-FT-FB outperforms S2M 2 R by 2.42% and 3.51% for 1\5-shot tasks on mini-Imagenet.

The results of FSL tasks with DC data augmentation.

The results of cross-domain FSL tasks. K & N denotes N-way-K-shot tasks. FBR v.s. i-FBR. We give the ablation study to explore the effectiveness of each part. The study is evaluated on mini-Imagenet and tiered-Imagenet for 5-way-1\5-shot tasks. As given in Table5, we show the effectiveness of LP-FT-FB's each part.

The ablation study on LP-FT v.s. FBR v.s. i-FBR.

The ablation study on different smoothing regularizers. LP-FT-None denotes no smoothing regularizers is used. FSL methods are divided into meta-learning-based methods and finetuning-based methods. The finetuning-based methods are divided into Linear-Probing-based and finetuning-the-whole methods.

6. DECLARATIONS

This work was supported by the National Key R&D Program of China under Grant 2019YFF0302601, National Natural Science Foundation of China (No. 62071060), and the Beijing Key Laboratory of Work Safety and Intelligent Monitoring Foundation.

