TOWARDS UNDERSTANDING THE CAUSE OF ERROR IN FEW-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Few-Shot Learning (FSL) is a challenging task of recognizing novel classes from scarce labeled samples. Many existing researches focus on learning good representations that generalize well to new categories. However, given low-data regime, the restricting factors of performance on novel classes has not been well studied. In this paper, our objective is to understand the cause of error in few-shot classification, as well as exploring the upper limit of error rate. We first introduce and derive a theoretical upper bound of error rate which is constrained to 1) linear separability in the learned embedding space and 2) discrepancy of task-specific and task-independent classifier. Quantitative experiment is conducted and results show that the error in FSL is dominantly caused by classifier discrepancy. We further propose a simple method to confirm our theoretical analysis and observation. The method adds a constraint to reduce classifier discrepancy so as to lower the upper bound of error rate. Experiments on three benchmarks with different base learners verify the effectiveness of our method. It shows that decreasing classifier discrepancy can consistently achieve improvements in most cases.

1. INTRODUCTION

Learning novel concepts from few samples is one of the most important ability in human cognition system (Chen et al. (2018) ; Dhillon et al. (2019) ; Wang et al. (2020)) . By contrast, massive achievements of modern artificial intelligent systems are dependent upon lots of data and annotation which are hard to acquire in many scenarios. Blocked by the difficulty in obtaining large labeled datasets, community shows more interests in developing algorithms with high data-efficiency. It is so-called few-shot learning that learns to generalize well to new categories with scarce labeled samples (Sung et al. (2018); Vinyals et al. (2016) ). Existing methods deal with few-shot learning in the general framework of meta-learning where a base learner is developed and optimized across different episodes (or tasks). Episodes are formed in a N-way K-shot fashion where K support samples per class are available for training. The overall objective is enabling the base learner to exploit on base classes and to transfer learnt knowledge to recognize novel classes with few support data. Since training and evaluation are performed on different tasks, the base learner holds different task-specific classifiers that depend on data sampling. In general, classification model has two components: feature extractor and classifier (Simonyan & Zisserman (2015) ; He et al. (2016) ; Zagoruyko & Komodakis (2016)). Most approaches of few-shot learning exploit from according perspectives: learning a good embedding and finding a right base learner. Rethinking-FSC (Tian et al. (2020) ) demonstrates that a good learned embedding space can be more effective than many sophisticated meta-learning algorithms. It argues for the performance on meta set where embeddings are learnt in supervised or self-supervised way. Goldblum et al. (2020) reveal the importance of feature clustering in few-shot learning. Since classifier performance is sample-dependent especially in one-shot scenario, variance of feature is expected to be small so as to retain good performance. It shows that classifier performance is not stable across different tasks. MetaOptNet (Lee et al. ( 2019)) and R2-D2 (Bertinetto et al. ( 2018)) explore training and optimization routines for linear classifier, enabling good few-shot performance through simple base learner. These literatures develop specific algorithms from the aspects of learning good representation or optimizing base learner. Most recent methods use linear classifier as base learner, so we also consider linear model in this paper. To our best knowledge, there has been little research focusing on how the two components (aka feature representation and classifier) respectively influence the performance on novel classes in FSL. In this paper, we introduce an upper bound of error rate in few-shot learning, indicating that the error comes from two aspects: 1) linear separability in the embedding space and 2) classifier discrepancy between task-specific and task-independent classifiers. The ideal classifier is viewed to be task-independent since its performance is not sample-dependent (Goldblum et al. ( 2020)). To quantitively estimate each term, an experiment is performed where we use error rate of supervised classification tasks on novel classes to measure feature separability, and use disagreement of results obtained from different classifiers to measure discrepancy. It comes to an interesting observation that features learned through simple methods are sufficiently discriminative and the error mainly comes from classifier discrepancy. Based on our observation and theoretical analysis, we propose a simple method of reducing classifier discrepancy so as to boost few-shot performance. Experiments on three benchmarks are conducted to empirically verify our theory. Results on different datasets with various base learners show consistent improvements, supporting our finding and theory in few-shot learning. The main contributions of this paper are: 1. The upper bound of error rate on novel classes is theoretically analyzed. From derived equations we figure out that the error in FSL is caused by linear separability in the feature space and discrepancy between task-specific and task-independent classifiers. 2. Quantitative experiments are conducted to verify the theoretical analysis. Results show that the error is dominantly caused by classifier discrepancy. 3. Based on the theoretical analysis and the experiment results, a constraint is proposed to reduce classifier discrepancy so as to decrease the upper bound of error rate in FSL. 4. Further experiments on mini-ImageNet, tiered-ImageNet and CIFAR-FS confirm the effectiveness of the proposed method. It shows that decreasing classifier discrepancy can consistently achieve improvements in most cases.

2. RELATED WORK

Algorithms of Few-Shot Learning Prototypical Network (Snell et al. ( 2017)) is a classical algorithm for its simplicity and effectiveness, which performs few-shot classification by nearestprototype matching. Since class prototype is the mean of features, the linear separability in feature space has direct impact on classification results. Performance of following series of prototype based methods (Allen et al. ( 2019 2017)), demonstrating that the intrinsic dimension of the embedding function's output space varies with the number of shots. They further propose a method to overcome the negative impact of mismatched shots in meta-train and meta-test stages. As in (Liu et al. (2020) ), they give a lower bound for accuracy cosine similarity based prototypical network. Two key factors: intra-class bias and cross-class bias are theoretically formulated. We also analyze theoretical bounds in few-shot learning. Theory in this paper does not focus on specific algorithm like Prototypical Network but on general scenarios, from the perspective of feature separability and classifier discrepancy. Theoretical Analysis of Domain Adaptation: Methods of domain adaptation (Ben-David et al. (2007; 2010); Ganin & Lempitsky (2015) ) solve the problem of how to train a classifier on source domain and guarantee the classifier performs well on target domain. A classifier's target error is bound by its source error and the divergence between the two domains in (Ben-David et al. (2010) ). They utilize H-divergence and H∆H-divergence to measure discrepancy between two domains. H∆H-divergence can be computed from finite unlabeled data, allowing us to directly estimate the



); Liu et al. (2020)) is also limited by feature separability. Different with these methods using nearest-neighbor classifier, Bertinetto et al. (2018) adopt ridge regression and logistic regression as base learner. Similarly, Lee et al. (2019) use classical linear classifier SVM in few-shot learning to learn representations. Simple linear classifier shows competitive performance and in this paper, we use linear classifier in measuring linear separability and classifier discrepancy. Theoretical Analysis of Few-Shot Learning Cao et al. (2019) introduce a bound for accuracy of Prototypical Network (Snell et al. (

