REVISIT FINETUNING STRATEGY FOR FEW-SHOT LEARN-ING TO TRANSFER THE EMDEDDINGS Anonymous authors Paper under double-blind review

Abstract

Few-Shot Learning (FSL) aims to learn a simple and effective bias on limited novel samples. Recently, many methods have been focused on re-training a randomly initialized linear classifier to adapt it to the novel features extracted by a pre-trained feature extractor (called Linear-Probing-based methods). These methods typically assumed the pre-trained feature extractor was robust enough, i.e., finetuning was not needed, and hence the pre-trained feature extractor does not be adapted to the novel samples. However, the unadapted pre-trained feature extractor distorts the features of novel samples because the robustness assumption may not hold, especially on the out-of-distribution samples. To extract the undistorted features, we designed Linear-Probing-Finetuning with Firth-Bias (LP-FT-FB) to yield an accurate bias on the limited samples for better finetuning the pre-trained feature extractor, providing stronger transferring ability. In LP-FT-FB, we further proposed inverse Firth Bias Reduction (i-FBR) to regularize the over-parameterized feature extractor on which FBR does not work well. The proposed i-FBR effectively alleviates the over-fitting problem of the feature extractor in the process of finetuning and helps extract undistorted novel features. To show the effectiveness of the designed LP-FT-FB, we conducted comprehensive experiments on the commonly used FSL datasets under different backbones for in-domain and cross-domain FSL tasks. The experimental results show that the proposed FT-LP-FB outperforms the SO-TA FSL methods. The code is available at https://github.com/whzyf951620/ LinearProbingFinetuningFirthBias.

1. INTRODUCTION

Few-shot Learning (FSL) has recently developed quickly in the limited data regime. FSL aims to learn a suitable inductive bias on the given limited samples of novel classes. At the very start, the whole model consisting of the feature extractor and the classifier is pre-trained on the samples of base classes, and then is finetuned on the limited novel samples to obtain an inductive bias. The performance of the finetuned model drops significantly due to the over-fitting problem of the pre-trained model. To address the overfitting problem, meta-learning-based methods such as Prototypical Networks Snell et al. ( 2017 However, the robustness assumption may not hold, especially on the out-of-distribution (OOD) novel samples, because the existing regularization methods only provided effective robustness to in-distribution (ID) samples. This leads to "wrong" features of novel samples Kumar et al. (2022) . This problem cannot be well addressed by only strengthening the robustness of the pre-trained feature extractor without transferring the extracted features. 2022) theoretically proved that the existing finetuning strategies underperformed due to distorted novel features. This is because when trying to fit the novel samples with a randomly initialized linear classifier, the extracted ID features and OOD features change inconsistently. The inconsistent problem is caused by the unchanged features of the samples in the space orthogonal to the space spanned by the base samples, i.e., out-of-distribution features unchanged (details in Section 2.2). Kumar et al. proposed using a linear classifier well-trained on the fitting novel samples to address this distorted problem. Although LP-FT is effective, it is used to address the finetuning problem on large-scale out-of-distribution datasets, which contain enough samples to provide an unbiased estimation. In FSL tasks, limited novel samples lead to a biased estimation of the updated parameters. Seeing this, we reduce the Firth Bias of the model Ghaffari et al. (2022); FIRTH (1993) to obtain an unbiased estimation (called Firth Bias Reduction). However, Firth Bias Reduction (FBR) gives an unbiased estimation only for the linear classifier but not the feature extractor. Experimental result also shows that FBR decreased the performance of the finetuned feature extractor. A deep analysis (Section 2.3) shows that when the scaling factor λ (see Eq. 2) is positive, FBR encourages the distribution of the linear classifier output to be far away from the uniform distribution (See Fig. 1 ) -FBR strengthens the influence of novel samples for linear probing. This accelerates the convergence



Figure 1: The proposed LP-FT-FB. The flow is divided into Linear Probing and FineTuning in two gray lines of dashes. (Left) At the linear probing stage, the parameters of the pre-trained feature extractor are frozen; at the finetuning stage, the parameters of the feature extractor are changeable. In the Linear Probing box, the RI means the linear classifier is randomly initialized. In the FineTuning box, the LP means the linear classifier is fully re-trained on the novel samples. (Right) The FBR is used to get an unbiased estimation for linear probing by encouraging the distribution of logits to be far away from the uniform distribution (U(a, b) in the figure). The proposed i-FBR addressed the over-fitting problem of the finetuned feature extractor by encouraging the distribution of logits to be close to the uniform distribution.

Finetuning is a typical and simple technology in the transferring learning on large-scale datasets to quickly transfer the features Kornblith et al. (2019); He et al. (2020); Chen et al. (2020). However, the existing FSL finetuning strategies cannot appropriately finetune the over-fitted feature extractor Shen et al. (2021). Consequently, designing an appropriate finetuning strategy for the fully-trained feature extractor on the limited samples is a critical point to transfer the features and hence make the finetuned feature extractor yield "good" novel features. To design an appropriate finetuning strategy, we started by analyzing the existing finetuning strategies Kumar et al. (2022); Kanavati & Tsuneki (2021); Levine et al. (2016); Radford et al. (2021), especially on the ODD samples Kumar et al. (2022). Linear-Probing-Finetuning (LP-FT) Kumar et al. (

