TOWARDS UNDERSTANDING THE CAUSE OF ERROR IN FEW-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Few-Shot Learning (FSL) is a challenging task of recognizing novel classes from scarce labeled samples. Many existing researches focus on learning good representations that generalize well to new categories. However, given low-data regime, the restricting factors of performance on novel classes has not been well studied. In this paper, our objective is to understand the cause of error in few-shot classification, as well as exploring the upper limit of error rate. We first introduce and derive a theoretical upper bound of error rate which is constrained to 1) linear separability in the learned embedding space and 2) discrepancy of task-specific and task-independent classifier. Quantitative experiment is conducted and results show that the error in FSL is dominantly caused by classifier discrepancy. We further propose a simple method to confirm our theoretical analysis and observation. The method adds a constraint to reduce classifier discrepancy so as to lower the upper bound of error rate. Experiments on three benchmarks with different base learners verify the effectiveness of our method. It shows that decreasing classifier discrepancy can consistently achieve improvements in most cases.

1. INTRODUCTION

Learning novel concepts from few samples is one of the most important ability in human cognition system (Chen et al. (2018) ; Dhillon et al. (2019) ; Wang et al. (2020)) . By contrast, massive achievements of modern artificial intelligent systems are dependent upon lots of data and annotation which are hard to acquire in many scenarios. Blocked by the difficulty in obtaining large labeled datasets, community shows more interests in developing algorithms with high data-efficiency. It is so-called few-shot learning that learns to generalize well to new categories with scarce labeled samples (Sung et al. (2018) ; Vinyals et al. (2016) ). Existing methods deal with few-shot learning in the general framework of meta-learning where a base learner is developed and optimized across different episodes (or tasks). Episodes are formed in a N-way K-shot fashion where K support samples per class are available for training. The overall objective is enabling the base learner to exploit on base classes and to transfer learnt knowledge to recognize novel classes with few support data. Since training and evaluation are performed on different tasks, the base learner holds different task-specific classifiers that depend on data sampling. In general, classification model has two components: feature extractor and classifier (Simonyan & Zisserman (2015) ; He et al. (2016) ; Zagoruyko & Komodakis (2016) ). Most approaches of few-shot learning exploit from according perspectives: learning a good embedding and finding a right base learner. Rethinking-FSC (Tian et al. (2020) ) demonstrates that a good learned embedding space can be more effective than many sophisticated meta-learning algorithms. It argues for the performance on meta set where embeddings are learnt in supervised or self-supervised way. Goldblum et al. (2020) reveal the importance of feature clustering in few-shot learning. Since classifier performance is sample-dependent especially in one-shot scenario, variance of feature is expected to be small so as to retain good performance. It shows that classifier performance is not stable across different tasks. MetaOptNet (Lee et al. (2019) ) and R2-D2 (Bertinetto et al. (2018) ) explore training and optimization routines for linear classifier, enabling good few-shot performance through simple base learner. These literatures develop specific algorithms from the aspects of learning good representation or optimizing base learner. Most recent methods use linear classifier as base learner, so we also consider linear model in this paper. To our best knowledge, there has been little research focusing on how the two components (aka feature representation and classifier) respectively influence the performance on novel classes in FSL. In this paper, we introduce an upper bound of error rate in few-shot learning, indicating that the error comes from two aspects: 1) linear separability in the embedding space and 2) classifier discrepancy between task-specific and task-independent classifiers. The ideal classifier is viewed to be task-independent since its performance is not sample-dependent (Goldblum et al. (2020) ). To quantitively estimate each term, an experiment is performed where we use error rate of supervised classification tasks on novel classes to measure feature separability, and use disagreement of results obtained from different classifiers to measure discrepancy. It comes to an interesting observation that features learned through simple methods are sufficiently discriminative and the error mainly comes from classifier discrepancy. Based on our observation and theoretical analysis, we propose a simple method of reducing classifier discrepancy so as to boost few-shot performance. Experiments on three benchmarks are conducted to empirically verify our theory. Results on different datasets with various base learners show consistent improvements, supporting our finding and theory in few-shot learning. The main contributions of this paper are: 1. The upper bound of error rate on novel classes is theoretically analyzed. From derived equations we figure out that the error in FSL is caused by linear separability in the feature space and discrepancy between task-specific and task-independent classifiers.

2.

Quantitative experiments are conducted to verify the theoretical analysis. Results show that the error is dominantly caused by classifier discrepancy. 3. Based on the theoretical analysis and the experiment results, a constraint is proposed to reduce classifier discrepancy so as to decrease the upper bound of error rate in FSL.

4.

Further experiments on mini-ImageNet, tiered-ImageNet and CIFAR-FS confirm the effectiveness of the proposed method. It shows that decreasing classifier discrepancy can consistently achieve improvements in most cases.

2. RELATED WORK

Algorithms of Few-Shot Learning Prototypical Network (Snell et al. (2017) ) is a classical algorithm for its simplicity and effectiveness, which performs few-shot classification by nearestprototype matching. Since class prototype is the mean of features, the linear separability in feature space has direct impact on classification results. Performance of following series of prototype based methods (Allen et al. (2019) ; Liu et al. (2020) ) is also limited by feature separability. Different with these methods using nearest-neighbor classifier, Bertinetto et al. (2018) adopt ridge regression and logistic regression as base learner. Similarly, Lee et al. (2019) 2017)), demonstrating that the intrinsic dimension of the embedding function's output space varies with the number of shots. They further propose a method to overcome the negative impact of mismatched shots in meta-train and meta-test stages. As in (Liu et al. (2020) ), they give a lower bound for accuracy cosine similarity based prototypical network. Two key factors: intra-class bias and cross-class bias are theoretically formulated. We also analyze theoretical bounds in few-shot learning. Theory in this paper does not focus on specific algorithm like Prototypical Network but on general scenarios, from the perspective of feature separability and classifier discrepancy. Theoretical Analysis of Domain Adaptation: Methods of domain adaptation (Ben-David et al. (2007; 2010) ; Ganin & Lempitsky (2015) ) solve the problem of how to train a classifier on source domain and guarantee the classifier performs well on target domain. A classifier's target error is bound by its source error and the divergence between the two domains in (Ben-David et al. (2010) ). They utilize H-divergence and H∆H-divergence to measure discrepancy between two domains. H∆H-divergence can be computed from finite unlabeled data, allowing us to directly estimate the error of a source-trained classifier on the target domain. Inspired by their work, we also use H∆Hdivergence to measure discrepancy between sets on novel classes and base classes.

3.1. PROBLEM SETUP

The common setup of few-shot learning used in this paper is described below. A space of class is divided into two parts: base classes C base and novel classes C novel where C base ∩ C novel = ∅. Dataset D base of base classes is used for model training and the model is evaluated on dataset D novel whose samples belong to unseen classes during training. The model is composed of a feature extractor F and a classifier h. In few-shot learning, we usually consider N -way K-shot Q-query tasks T . In task τ i = (D s i , D q i , h), the support set D s i includes K data x ∈ R d per class and its true label y ∈ {c 1 , ..., c N }. The goal is to predict labels for query data in D q i given D s i . In this paper, we use error rate on novel classes to evaluate the few-shot performance of a trained model. The error rate is formulated as: novel = E[ τ ] = 1 M × Q M i Q j 1(h(F (x i,j ))! = y i,j ) (1) where M is the number of sampled tasks τ i ∼ T novel . 1(•) is indicator function.

3.2. DISTRIBUTION DIVERGENCE

We adopt following concepts to explore the cause of error in few-shot scenarios. Definition 1 Given a set D = {(x 1 , y 1 ), ..., (x m , y m )} where x i ∈ X and y i ∈ Y, for any mappings h 1 , h 2 ∈ X , disagreement is defined in Eqn. 2 to measure the difference of these two mappings. dis(h 1 , h 2 ) = P x∼D X (h 1 (x) = h 2 (x)) Definition 2 Given a domain X with D 1 and D 2 probability distributions over X , let H be a hypothesis class on X and denote by I(h) the set for which h ∈ H is the characteristic function; that is, x ∈ I(h) ⇔ h(x) = 1. H divergence between D 1 and D 2 is d H (D 1 , D 2 ) = 2sup h∈H |P r D1 [I(h)] -P r D2 [I(h)]| Definition 3 For hypotheses h, h ∈ H, the symmetric difference hypothesis space H∆H is the set of hypotheses g ∈ H∆H ⇔ g(x) = h(x) ⊕ h (x) where ⊕ is the XOR function. H∆H divergence over distributions is defined as following:  d H∆H (D 1 , D 2 ) = 2sup h∈H |P r x∼D1 [h(x) = h (x)] -P r x∼D2 [h(x) = h (x)]| (h * ) = 1 N × Q * N ×Q * i 1(ŷ i = y i ) (5) dis(h, h * ) = 1 N × Q N ×Q i 1(ŷ i = ŷ * i ) (6) where Q is the number of query samples. h is task-specific classifier that differentiates among tasks, decided by sampled support data. h * is task-independent concerning these N classes. Thus, dis(h, h * ) indicates the discrepancy between the task-specific classifiers and the ideal classifier. Table 1 shows results on two benchmarks: mini-ImageNet and tiered-ImageNet. From Table 1 , we can see that novel (h * ) is generally lower than dis(h, h * ) in a large margin. For example, 1-shot dis(h, h * ) on mini-ImageNet with RR is up to 38.34% while novel (h * ) is 6.72%, which is five times lower. Furthermore,the obvious drop of dis(h, h * ) from 1-shot to 5-shot indicates obvious raising of classifier discrepancy. An interesting conclusion can be drawn from this experiment that the error on novel classes is dominantly caused by classifier discrepancy in low-data regimes rather than linear separability. More details about this experiment are presented in the appendix.

4.2. UPPER BOUND OF ERROR RATE

In this section, we introduce an upper bound of error rate on novel classes in few-shot learning. Proposition 1 Consider a feature extractor F and a hypothesis space H. Based on triangular inequality, for h, h * ∈ H, it follows that: (h; F ) ≤ (h * ; F ) + dis(h, h * ; F ) (7) h * is the ideal hypothesis in H, holding that h * = arg min h∈H E[ τ (h; F )]. Proof is in the appendix. In few-shot learning, error rate on novel classes is usually denoted by novel = E[ τ (h; F )]. Hence, the upper bound is: novel ≤ E[ τ (h * ; F )] + E[dis τ (h, h * ; F )] where τ ∼ T novel . From Eqn. 8, boosting few-shot performance can be achieved by minimizing the two terms in right side of above inequality. However, h * is unavailable when testing on novel classes. In order to connect the performance on novel set and it on base set we consifer following setting. Consider N -way K-shot tasks where N is assumed to be same in meta-train and meta-test stages. h, h are linear classifiers of novel classes and base classes respectively. For classification weights W b , W n ∈ R N ×d of base and novel classes, there exists a linear transformation matrix W ∈ R d×d that has W b = W n W . We define the linear transformation between the ideal hypothesis on the novel set and that on the base set as Λ, h * = Λ(h * ), h * = Λ -1 (h * ). For query samples X = {x i ∈ R d }, predicted results are: h(X; W n ) = XW T n = X(W b W -1 ) T (9) According to the above analysis we know that performing a transformation on the classifier is equivalent to perform the transformation on the data, Λ(h)(X) = h(Λ(X)). Lemma 1 Let H be a hypothesis space of VC dimension d. h * , h * are ideal hypotheses on D novel and D base . There is an ideal hypothesis ĥ = arg min h∈H novel (h) + base (Λ -1 (h)). Then for h * and h * : novel (h * ) ≤ base (h * ) + 1 2 d H∆H (D novel , D base ) + λ (10) where λ = novel ( ĥ) + base (Λ -1 ( ĥ)) is the combined error of the ideal hypothesis ĥ. Lemma 2 For linear hypotheses h, h ∈ H and ideal hypotheses h * on D novel , h * on D base , there exists: dis(h, h * ; D novel ) ≤ dis(h, Λ(h * ); D base ) + 1 2 d H∆H (D novel , Λ(D base )) ≤ dis(h , h * ; D base ) + dis(h , Λ(h); D base ) + 1 2 d H∆H (D novel , Λ(D base )) Proofs of Lemma 1 and Lemma 2 are provided in the appendix. We give the core theory in this paper by plugging Lemma 1 and Lemma 2 into Eqn. 8.

Theorem 1

The upper bound of error rate on novel classes in few-shot learning is: novel ≤ base (h * ) + dis(h , h * ; D base ) + dis(Λ -1 (h), h ; D base ) + λ + 1 2 d H∆H (D novel , D base ) + 1 2 d H∆H (D novel , Λ(D base )) Based on theoretical analysis and experiments, we come to several conclusions: 1. In theory, the error rate of few-shot classification is influenced by linear separability of feature representation and classifier discrepancy between task-specific and task-independent classifiers. Experiment results indicate that the main cause of error in few-shot learning is classifier discrepancy.

2.

From Theorem 1, we can see that the upper bound of error rate on novel classes is positively related to 1) linear separability on D base , 2) classifier discrepancy on D base , 3) the combined error and 4) H∆H-divergence of D novel and D base measuring the discrepancy between two domains.

4.3. REDUCING CLASSIFIER DISCREPANCY FOR FEW-SHOT LEARNING

Based on our theoretical analysis, we propose a simple method to reduce the upper bound of error rate, boosting few-shot performance by reducing classifier discrepancy. Measuring Classifier Discrepancy It is proved in Sec. 4.2 that reduce error on novel classes can be achieved by improving linear separability and reducing classifier discrepancy. Furthermore, experiment reveals that the cause is the discrepancy between task-specific classifier and ideal classifier. For these reasons, we target to reduce the upper bound by decreasing classifier discrepancy. Discrepancy denoted in Eqn. 6 is non-differentiable so that we propose two measurements of classifier discrepancy to ease gradient propagation in training stage. Since we consider linear classifier in this paper, an intuitive way to reduce classifier discrepancy is to reduce distance between classification weights. Squared euclidean distance can be used as a measurement: dis M SE (h, h * ; W, W * ) = E[ W -W * 2 2 ] (13) where W is the weight of task-specific classifier h and W * is the weight of task-independent classifier h * . dis M SE measures the variance of classification weights. Since task-specific classifier is decided by data sampling, taking data distribution into consideration, we suggest to calculate the difference of logit predicted by various classifiers. Consider commonly used KL divergence to measure the difference of logits: dis KL (h, h * ) = E[KLD(h(D q ; D s ), h * (D q ))] Training Policy Training process of proposed method Reducing Classifier Discrepancy (RCD) consists two phases. In the first phase, model is trained in conventional supervised way on base classes. Loss function in phase 1 is: where L ce is standard cross-entropy loss. The classifier h * = arg min h L sup obtained in this stage is treated as the ideal classifier, also the task-independent classifier, on D base . With fixed h * , we train the model on meta tasks T base with loss in Eqn. 16: L sup = L ce (h(F (x)), y), with (x, y) ∼ D base (15) L meta = L ce (h(F (x)), y) + β dis(h, h * ), with (x, y) ∼ T base The second training procedure aims to reduce classifier discrepancy on base classes. Consequently, the upper bound of novel can be decreased as verified in Theorem 1. Our method shows great flexibility that training in the first phase is free of carefully designing base learners. Moreover, policy in the second phase can generalize to different base learners. Algorithm is shown in the appendix. Base Learner To illustrate the effectiveness of our proposed method, we use three base learners: PN, Ridge Regression (RR) and Logistic Regression (LR) (Bertinetto et al. (2018) ). 1)PN is derived from (Snell et al. ( 2017)) which finds the nearest prototype based on cosine similarity. Prototype is computed from support samples:

5. EXPERIMENTS

P = norm( 1 K K i F (x i )) . Predicted labels of query samples are given by Ŷ = arg min c Cos(P c , X). 2) RR Classification weight is estimated by W = (X T X + γI) -1 X T Y where Y is one-hot labels of support samples and I is identity matrix. Prediction of query samples are Ŷ = X • W . 3) LR Classification weight in logistic regression is W = arg min W L ce (D s , W ). Query samples are predicted by Ŷ = X • W . Descriptions are detailed in the appendix.

5.2. RESULTS OF RCD

We evaluate the proposed Reducing Classifier Discrepancy (RCD) on three benchmarks, with three base learners and two backbones. Table 2 and Table 3 summarize few-shot results with dis M SE and dis KL be auxiliary loss respectively. Overall, RCD achieves consistent improvements in most cases. Auxiliary Loss As displayed in Eqn. 16, two measurements of classifier discrepancy can be added as auxiliary loss in meta-train stage. We compare the results without and with auxiliary constraints in Table 2 and Table 3 . Each second column of three base learners shows few-shot accuracies by reducing classifier discrepancy constrained by dis M SE or dis KL . Generally, dis M SE is an useful 4 are consistent with accuracy increment in Table 2 and Table 3 , further proving the rationality of our proposed theory.

5.4. VISUALIZATION

T-SNE visualization (Maaten & Hinton (2008) ) is provided in Fig. 1 to give an intuitive understanding of our method. In Fig. 1 , figures in the first column display the distribution of features that trained in conventional way. Figures in latter columns visualize features trained with proposed constraints dis M SE and dis KL . We can see that features within same classes cluster more tightly and the boundaries among different classes become more clear. Our method enables larger separability in feature space.

6. CONCLUSION

In this paper, we theoretically analyze the upper bound of error rate on novel classes in few-shot learning. We derive that the upper bound is decided by feature separability and classifier discrepancy. Furthermore, the observation shows that classification error is mainly caused by classifier discrepancy in few-shot scenarios. Based on our observation and theory, we propose a simple method to lower the upper bound of classification error by reducing classifier discrepancy. Two differentiable discrepancy measurements are proposed as auxiliary constraints in our method RCD which is feasible on different base learners. To verify the feasibility and generalization of proposed RCD, comprehensive experiments are conducted on three few-shot benchmarks with three base learners. Experiment results powerfully prove that RCD is effective to reduce classifier discrepancy and consequently lower the upper bound of error rate in few-shot learning.



IMPLEMENTATION DETAILSDataset and BackboneWe conduct experiments on three benchmarks: mini-ImageNet (Vinyals et al. (2016)), tiered-ImageNet(Ren et al. (2018)) and CIFAR-FS(Bertinetto et al. (2018)). ResNet-12(Lee et al. (2019)) andConvNet-64 (Snell et al. (2017)) are employed as backbones in this paper. Details about data setting and architectures are shown in the appendix.

use classical linear classifier SVM in few-shot learning to learn representations. Simple linear classifier shows competitive performance and in this paper, we use linear classifier in measuring linear separability and classifier discrepancy.

Experiment of linear separability and classifier discrepancy of 5-way (N =5) classification tasks on novel classes. is the number of all samples of each class c ∈ {c 1 , ..., c N }. ŷi is predicted label and y i is true label. h * is trained under supervised way from large set of samples and is used to approximate the expected ideal N -way classifier. On the other hand, we use disagreement defined in Eqn. 6 to measure classifier discrepancy.

Performance of RCD. 5-way classification accuracies (%) without/with dis M SE constraint. Results in bold indicate performance is improved by reducing classifier discrepancy.

Performance of RCD. 5-way classification accuracies (%) without/with dis KL constraint. Results in bold indicate performance is improved by reducing classifier discrepancy.

Changes of classifier discrepancy on novel classes. Columns of Stage 1 show discrepancy on novel classes after the fist conventional training stage. Columns of Stage 2 show discrepancy on novel classes after the second training stage with auxiliary constraints dis M SE or dis KL . Differences are highlighted by green and red respectively. Green/red means reduced/enlarged discrepancy.Backbone in this experiment: ResNet-12. FS with RR. By contrast, dis KL shows superiority in reducing classifier discrepancy that by adding constraint dis KL , accuracy is raised in all cases, up to 3.45% on 1-shot CIFAR-FS with PN. Under the same conditions, reducing classifier discrepancy can bring in larger improvements in 1-shot scenarios. For example, on mini-ImageNet with PN, reducing classifier discrepancy through diminishing KL divergence achieves improvement by margins of 2.91% in 1-shot and 2.07% in 5-shot. Experiment results are consistent with our theory. In 1-shot settings, discrepancy is larger due to data scarcity. Thus, RCD makes more obvious increase on 1-shot tasks.Base Learner We argue for the flexibility and generalization of proposed RCD. For verification, we adopt three commonly used linear classifiers as base learner in meta-train stage. Results in Table2and Table3demonstrate feasibility of our method in improving few-shot performance regardless of specific classifier. It indicates that reducing classifier discrepancy on base set is effective to lower the upper bound of novel . SNE visualization of features of novel classes. The first column: feature space is learnt in conventional training. The second and third column: feature space is learnt with dis M SE and dis KL . In each dataset, 5 novel classes are randomly sampled. Backbone: ResNet-12. Base learner: PN. Best viewed in color.5.3 CLASSIFIER DISCREPANCYWe compare changes of discrepancy in Table4to clearly illustrate the proposed method is effective in reducing classifier discrepancy. Changes are denoted in colors. It can be clearly see that after the second training stage with constraints dis M SE and dis KL , classifier discrepancy is reduced in nearly all settings, especially on mini-ImageNet and CIFAR-FS. Downward trend of classifier discrepancy is positively correlated to decreasing the upper bound of error rate. Results in Table

