CLASS IMBALANCE IN FEW-SHOT LEARNING

Abstract

Few-shot learning aims to train models on a limited number of labeled samples from a support set in order to generalize to unseen samples from a query set. In the standard setup, the support set contains an equal amount of data points for each class. This assumption overlooks many practical considerations arising from the dynamic nature of the real world, such as class-imbalance. In this paper, we present a detailed study of few-shot class-imbalance along three axes: dataset vs. support set imbalance, effect of different imbalance distributions (linear, step, random), and effect of rebalancing techniques. We extensively compare over 10 state-of-the-art few-shot learning methods using backbones of different depths on multiple datasets. Our analysis reveals that 1) compared to the balanced task, the performances of their class-imbalance counterparts always drop, by up to 18.0% for optimization-based methods, although feature-transfer and metric-based methods generally suffer less, 2) strategies used to mitigate imbalance in supervised learning can be adapted to the few-shot case resulting in better performances, 3) the effects of imbalance at the dataset level are less significant than the effects at the support set level. The code to reproduce the experiments is released under an open-source license.

1. INTRODUCTION

Deep learning methods are well known for their state-of-the-art performances on a variety of tasks (LeCun et al., 2015; Russakovsky et al., 2015; Schmidhuber, 2015) . However, they often require to be trained on large labeled datasets to acquire robust and generalizable features. Few-Shot Learning (FSL) (Chen et al., 2019; Wang et al., 2019b; Bendre et al., 2020) aims at reducing this burden by defining a distribution over tasks, with each task containing a few labeled data points (support set) and a set of target data (query set) belonging to the same set of classes. A common way to train FSL methods is through episodic meta-training (Vinyals et al., 2017) with the model repeatedly exposed to batches of tasks sampled from a task-distribution and then tested on a different but similar distribution in the meta-testing phase. The prefix "meta" is commonly used to distinguish the highlevel training and evaluation routines of meta-learning (outer loop), from the training and evaluation routines at the single-task level (inner loop). Limitations. Standard meta-training overlooks many challenges stemming from real-world dynamics, such as class-imbalance (CI). The standard setting assumes that all classes in the support set contain the same number of data points, whereas in many practical applications, the number of samples for each class may vary (Buda et al., 2018; Leevy et al., 2018) . Given the limited amount of data used in FSL, a small difference in the number of samples between classes could already introduce significant levels of imbalance. Most FSL methods are not designed to cope with these more challenging settings. Figure 1 exemplifies these considerations by showing that several state-of-the-art FSL methods underperform when tested under three CI regimes (linear, step, random) . Previous work. Previous work mainly focuses on the single imbalance case or grouping several settings into one task, offering limited insights into the effects of CI on FSL and making it challenging to quantify its effects (Guan et al., 2020; Triantafillou et al., 2020; Lee et al., 2019; Chen et al., 2020) . A common approach to mitigate imbalance is Random-Shot meta-training (Triantafillou et al., 2020) , which exposes the model to imbalanced tasks during meta-training. However, previous work provides little insight into the effectiveness of this procedure on the imbalanced FSL evaluation task. Furthermore, minimal work exists that investigates meta-training outcomes under an imbalanced distribution of classes at the (meta-)dataset level, while this case is common in recent FSL applications (Ochal et al., 2020; Guan et al., 2020) and meta-learning benchmarks (Triantafillou et al., 2020) . The CI problem is well-known within the supervised learning community, which has systematically produced strategies to deal with the problem, such as the popular Random Over-Sampling (Japkowicz & Stephen, 2002 ) that aims at rebalancing minority classes by uniform sampling. While such strategies have been extensively studied on many supervised learning problems, there is little understanding of how they behave with the recently proposed FSL methods in the low-data regime. Our work and main contributions. In this paper, we provide, for the first time, a detailed analysis of the CI problem within the FSL framework. Our results show that even small CI levels can introduce a significant performance drop for all the methods considered. Moreover, we find that only a few models benefit from Random-Shot meta-training (Triantafillou et al., 2020; Lee et al., 2019; Chen et al., 2020) over the classical (balanced) episodic meta-training (Vinyals et al., 2017) ; while pairing the meta-training procedures with Random Over-Sampling offers a substantial advantage. The experimental results show that imbalance severity at the dataset level depends on the size of the dataset. Our contributions can be summarized as follows: 1. A systematic, comprehensive and in-depth study of the effects of CI within the FSL framework along three axes: (i) dataset vs. support set imbalance, (ii) effect of different imbalance distributions (linear, step, random), (iii) effect of rebalancing techniques, such as random over-sampling and the recently proposed Random-Shot meta-training (Triantafillou et al., 2020) . 2. We reveal novel insights into the meta-learning and support set adaptation capabilities to the CI regime, supported by extensive results on over 10 FSL methods with different imbalance settings, backbones, support set sizes, and datasets. 3. We provide insight into the previously unaddressed CI problem in the (meta-)training dataset, showing that the effects of imbalance at the dataset level are less significant than the effects at the support set level.

2. RELATED WORK

2.1 CLASS IMBALANCE In classification, imbalance occurs when at least one class (the majority class) contains a higher number of samples than the others. The classes with the lowest number of samples are called minority classes. If uncorrected, conventional supervised loss functions, such as (multi-class) cross-entropy, skew the learning process in favor of the majority class, introducing bias and poor generalization toward the minority class samples (Buda et al., 2018; Leevy et al., 2018) . Imbalance approaches are categorized into three groups: data-level, algorithm-level, and hybrid. Data-level strategies manipulate and create new data points to equalize data sampling. Popular data-level methods include Random Over-Sampling (ROS) and Random Under-Sampling (RUS) (Japkowicz & Stephen, 2002) . ROS randomly resamples data points from the minority classes, while RUS randomly leaves out a randomly selected portion of the majority classes to decrease imbalance levels. Algorithm-level strategies use regularization or minimization of loss/cost functions. Weighted loss is a common approach where each sample's loss is weighted by the inverse frequency of that sample's class. Focal loss (Lin et al., 2017) is another type of cost function that has seen wide success. Hybrid methods combine one or more types of strategies (e.g. Two-Phase Training, Havaei et al. (2017) ). Modeling Imbalance. The object recognition community studies class imbalance using realworld datasets or distributions that approximate real-world imbalance (Buda et al., 2018; Johnson & Khoshgoftaar, 2019; Liu et al., 2019) . Buda et al. (2018) note that two distributions can be used: linear and step imbalance (defined in our methodology Section 3). At large-scale, datasets with many samples and classes tend to follow a long-tail distribution (Liu et al., 2019; Salakhutdinov et al., 2011; Reed, 2001) , with most of the classes occurring with small frequency and a few classes occurring with high frequency. Our work primarily focuses on the tail-end of the distribution and does not consider the case of large sample size. Therefore, we do not examine the long-tail mechanisms.

2.2. FEW-SHOT LEARNING

FSL methods can be broadly categorized into metric-learning, optimization-based, hallucination, data-adaptation, and probabilistic approaches (Chen et al., 2019) . Metric-learning approaches such as Prototypical Networks (Snell et al., 2017) , Relation Networks (Sung et al., 2017) , the Neural Statistician (Edwards & Storkey, 2017) and Matching Networks (Vinyals et al., 2017) , learn a feature extractor capable of parameterizing images into embeddings, and then use distance metrics to classify mapped query samples based on their distance to support points. Optimization-based approaches such as MAML (Finn et al., 2017) and Meta-Learner LSTM (Ravi & Larochelle, 2016) ), are meta-trained to use guided optimization steps on the support set for quick adaptation. Hallucination or data augmentation techniques perform affine and color transformations on the support set to create additional data points (Zhang et al., 2018) . Probabilistic methods use Bayesian inference to learn and classify samples, for example, the recently proposed Deep Kernel Transfer (DKT) (Patacchiola et al., 2020) , which uses Gaussian Process at inference time. We use the term domain adaptation to represent those approaches using standard transfer-learning with a pre-training stage on a large set of classes and a fine-tune stage on the support set -examples are Baseline and Base-line++ from Chen et al. (2019) , and the recently proposed Transductive Fine-Tuning from Dhillon et al. ( 2020). The details of the methods used in our experiments are reported in Appendix A. For completeness, it is worth mentioning Incremental Few-Shot Learning (Ren et al., 2018; Gidaris & Komodakis, 2018; Hariharan & Girshick, 2017) which is an extension of FSL. It considers maintaining performance on base classes (meta-training dataset) while incrementally learning about novel classes using limited data, typically without re-training from scratch on all data. Here, we focus on studying how imbalance affects the learning of novel classes only; therefore, we will not consider incremental FSL further.

2.3. IMBALANCE IN FEW-SHOT AND META LEARNING

Class Imbalance in the low-data regime has received some attention, although the current work is not comprehensive (Guan et al., 2020; Triantafillou et al., 2020; Lee et al., 2019; Chen et al., 2020) . We identify that in FSL, class-imbalance occurs at two levels: the task-level and the meta-dataset level. At the task level, class-imbalance occurs in the support set or the query-set, directly affecting learning and evaluation procedures. Class imbalance at the meta-dataset level is caused by imbalanced dataset classes in one (or more) of the three data splits: meta-training, meta-validation, meta-testing. This disproportion affects the distribution of tasks that a model is exposed to during meta-training, affecting their ability to generalize to new tasks. In Figure 2 , we highlight the differences between imbalanced task and imbalanced meta-dataset. Related to, but distinct from, these two class-imbalance types is task-distribution imbalance (Lee et al., 2019) ; skewed task-distribution can occur as a result of meta-dataset level class-imbalance or as a result of the task-sampling procedure. In extreme cases, task-distribution imbalance can lead to out-of-distribution tasks during meta-evaluation. Task-distribution imbalance has already received some attention (Lee et al., 2019; Cao et al., 2020) ; therefore, it will not be considered in this work. Class Imbalance in Tasks. Classes are imbalanced at the dataset level, but all the support sets are balanced at the task level. Following standard practice in the literature, query sets are kept balanced in both settings. analysis is limited to just two methods (their proposal and MAML). In Guan et al. (2020) , metalearning is applied on aerial imagery, exploring step imbalance ranging from 5 to 140 samples per class (shot); however, only two FSL methods are compared (Prototypical Networks and their RF-MML method). Previous work provides limited insight into class-imbalance at the task level. Class Imbalance in the Meta-Dataset. Standard meta-datasets (e.g. Mini-ImageNet) can be swapped for other domain-specific datasets, such as CUB (Wah et al., 2011 ), VGG Flowers (Nilsback & Zisserman, 2008) , and others (Triantafillou et al., 2020) . These datasets sometimes contain an unequal number of class samples, but previous work has never reported the effects of classimbalance in the meta-training dataset (Guan et al., 2020; Triantafillou et al., 2020; Lee et al., 2019; Chen et al., 2020) . We emphasize that studying the impact of imbalance at this level is important since imbalanced domain-specific meta-datasets are common in real-world applications (Guan et al., 2020; Ochal et al., 2020) and recent benchmarks (Triantafillou et al., 2020) . Our work is the first to provide quantitative insights into this setting.

3. METHODOLOGY

3.1 STANDARD FSL A standard K-shot N -way FSL classification task is defined by a small support set, S = {(x 1 , y 1 ), ..., (x s , y s )} ∼ D, containing N × K image-label pairs drawn from N unique classes with K samples per class (|S| = K × N ). The goal is to correctly predict labels for a query set, Q = {(x 1 , y 1 ), ..., (x t , y t )} ∼ D, containing a different set of M samples drawn from the same N classes (i.e. Q (x) ∩ S (x) = ∅ and Q (y) ≡ S (y) ). The support set can also be referred to as sample set and the query set as target set.

3.2. CLASS-IMBALANCED FSL

We define a class-imbalanced FSL task as a K min -K max -shot N -way I-distribution task. Similarly to the standard FSL task, a model is given a small support set, S ∼ D and a query set, Q ∼ D, containing a different set of samples drawn from the same N classes. However, in the imbalance case, the support set contains between K min to K max (inclusive) number of samples per class distributed according to the imbalance I-distribution, where I ∈ {linear, step, random} (Buda et al., 2018) . Similarly, the query set can contain M min to M max samples per class distributed according to the I-distribution. In our experiments, we keep a balanced query set (M = M min = M max ) for fair evaluation.foot_0 For brevity, but without loss of generality, we define imbalance Idistribution in relation to the support set (see Figure 2 ) as: • Linear imbalance. The number of class samples, K i , for classes i ∈ {1..N } is defined by: K i = round (K min -c + (i -1) × (K max + c * 2 -K min )/(N -1)) , where c = 0.499 for rounding purposes. For example, this means that for linear 1-9-shot 5-way task, K i ∈ {1, 3, 5, 7, 9}, and for linear 4-6-shot 5-way task K i ∈ {4, 4, 5, 6, 6}. • Step imbalance. The number of class samples, K i , is determined by an additional variable N min specifying the number of minority classes. Specifically, for classes i ∈ {1..N }: K i = K min , if i ≤ N min , K max , otherwise. For example, in a step 1-9-shot 5-way task with 1 minority class K i ∈ {1, 9, 9, 9, 9}. • Random imbalance. The number of class samples, K i , is sampled from a uniform distribution, i.e. K i ∼ Unif(K min , K max ), with K min and K max inclusive. This is appropriate for the problem at hand (small number of classes), but it could be replaced by a Zipf/Power Law (Reed, 2001) for a more appropriate imbalance in problems with a large number of classes. We also report the imbalance ratio ρ, which is a scalar identifying the level of class-imbalance; this is often reported in the CI literature for the supervised case (Buda et al., 2018) . We define ρ to be the ratio between the number of samples in the majority and minority classes in the support set: ρ = K max K min . (3)

3.3. CLASS-IMBALANCED META-DATASET

Training FSL methods involves three phases: meta-training, meta-validation, and meta-testing. Each phase samples tasks from a different dataset, D train , D val , and D test , respectively. A balanced dataset contains D N * classes with D K * samples per class, where * ∈ {train, val, test}. However, in the real-world, datasets can contain any number of samples with imbalance. For fair evaluation, we control dataset imbalance according to the I-distribution described in Section 3.2 but with K min , K max , N , N min changed for D Kmin . Similarly, we report the imbalance ratio ρ. In our experiments, we apply imbalance only at the meta-training stage to limit the factors of interest, but a similar procedure could be used at the meta-testing and meta-validation stages.

3.4. REBALANCING TECHNIQUES AND STRATEGIES

Random Over-Sampling. We apply Random Over-Sampling (ROS) and for each classimbalanced task, we match the number of support samples in the non-majority classes to the number of support samples in the majority class, K i = max i (K i ). This means that for I ∈ {linear, step}, the number of samples in each class is equal to K max . We match K i to max i (K i ) by resampling uniformly at random the remaining max i (K i ) -K i support samples belonging to class i, and then appending them to the support set. When applying ROS with augmentation (ROS+), we perform further data transformation on the resampled supports. A visual representation of a class imbalanced task after applying ROS and ROS+ is presented in the Appendix A (Figure 7 ). Random-Shot Meta-Training. We apply Random-Shot meta-training similarly to the Standard episodic (meta-)training (Vinyals et al., 2017) but with the balanced tasks exchanged with K min -K max -shot random-distribution tasks, as defined above. We use random-distribution following previous work (Triantafillou et al., 2020; Lee et al., 2019) , since in real-world applications, the actual imbalance distribution is likely to be unknown at (meta-)evaluation time. Rebalancing Loss Functions. We apply two rebalancing loss functions: Weighted Loss (Buda et al., 2018) and Focal Loss (Lin et al., 2017) . Both of them have been applied to the inner loop of optimization-based methods. Full details are reported in the supplementary material (Appendix A).

4. EXPERIMENTS

4.1 SETUP Class Imbalance Scenarios and Tasks. We address two class-imbalance scenarios within the FSL framework: 1) imbalanced support set, and 2) imbalanced meta-training dataset. For the imbalanced support set scenario, we first focus on the very low-data range with an average support set size of 25 samples (5 avr. shot). We train FSL models using Standard (episodic) meta-training (Vinyals et al., 2017) using 5-shot 5-way tasks, as well as Random-Shot meta-training (Triantafillou et al., 2020; Lee et al., 2019; Chen et al., 2020) using 1-9shot 5-way random-distribution tasks (as described in Section 3.2). We pre-train baselines (i.e., Fine-Tune, 1-NN, Baseline++) using mini-batch gradient descent, and then fine-tune on the support or perform a 1-NN classification. We evaluate all baselines and models using a wide range of imbalanced meta-testing tasks. In contrast to previous work, we evaluate models using two additional imbalance distributions, linear and step; this allows us to control the imbalance level deterministically and provide insights from multiple angles. For the imbalanced meta-dataset scenario, we vary the class distributions of the meta-training datasets. We isolate this level of imbalance by meta-training and meta-evaluating on balanced FSL tasks. All main experiments are repeated three times with different initialization seeds. Each data point represents the average performance of over 600 meta-testing tasks per run. Additional details. We adapted a range of 11 unique baselines and FSL methods: Fine-tune baseline (Pan & Yang, 2010), 1-NN baseline, Baseline++ (Chen et al., 2019) , SimpleShot (Wang et al., 2019a) , Prototypical Networks (Snell et al., 2017) , Matching Networks (Vinyals et al., 2017) , Relation Networks (Sung et al., 2017) , MAML (Finn et al., 2017) , ProtoMAML (Triantafillou et al., 2020) , DKT (Patacchiola et al., 2020) , and Bayesian MAML (BMAML) (Yoon et al., 2018) . Implementation details of these algorithms are supplied in Appendix A. We used a 4 layer convolutional network as backbone for each model, following common practice (Chen et al., 2019) . We train and evaluate all methods on MiniImageNet (Ravi & Larochelle, 2016; Vinyals et al., 2017) , containing 64 classes with 600 image samples each. In the imbalanced meta-dataset setting, we half the Mini-ImageNet dataset to contain 300 samples per class on average, and control imbalance as described in section 3.3. For full implementation details, see Appendix A.

4.2. CLASS IMBALANCED SUPPORT SET

Effect of Class Imbalance with Standard Meta-Training. Figure 1 highlights the crux of the class-imbalance problem at the support set level. Specifically, the figure shows standard metatrained FSL models (Vinyals et al., 2017) and pre-trained baselines, evaluated on the balanced 5shot 5-way task and three imbalanced tasks. We observe that introducing even a small level of imbalance (linear 4-6-shot 5-way, ρ = 1.5) produces a significantfoot_2 performance difference for 6 out of 13 algorithms, compared with the balanced 5-shot task. The average accuracy drop is -1.5% for metric-based models and -8.2% for optimization-based models. On tasks with a larger imbalance (1-9shot random, ρ = 9.0), the performance drops by an average -8.4% for metric-based models and -18.0% for optimization-based models compared to the balanced task. Interestingly, despite the additional 12 samples in the support set in 1-9shot step tasks with 1 minority class (ρ = 9.0), the performance drops by -5.0% on the balanced task with 25 support samples in total. (Vinyals et al., 2017) vs. random-shot episodic training (Triantafillou et al., 2020) . We explore pairing methods with Random Over-Sampling (ROS) without and with augmentation (ROS+). Standard vs. Random-Shot Meta-Training. In Figure 3 , we show the accuracy for increasing imbalance levels (ρ) using evaluation tasks with a linear distribution and fixed support set size (K i ≈ 5) for a fair comparison. Comparing Standard and Random-Shot meta-training (solid black and solid red lines) reveals that only a few methods benefit from Random-Shot meta-training. On the balanced 5-shot task, we observe a -6.0% decrease in accuracy, caused by Random-Shot over Standard meta-training. On the imbalanced 1-9shot random task, Random-Shot offers a limited improvement over the Standard, with a significant increase in performance for only 3 out of 10 models. Those improvements include +2.5% for Relation Net, and +6.6% for BMAMLs. These results suggest that exposing FSL methods to imbalanced tasks during meta-training does not automatically lead to improved performance at meta-test time. Interestingly, in an extreme imbalance case (1-21shot step 4minor, Appendix E) only ProtoNet and RelationNet obtained a significantly higher performance with Random-Shot (+18% compared to Standard). This suggests that the advantage may emerge from coupling Random Shot with the prototype calculation mechanism unique to those methods. The results also suggest that some models have natural robustness to imbalance: Relation Net, MatchingNet, and DKT only drop slightly compared to other methods. Random-Shot with Random Over-Sampling. In Figure 3 , we observe that the performances of optimization-based methods such as MAML and BMAML significantly improve by applying random over-sampling with augmentation (ROS+) and without augmentation (ROS). In the largest imbalance case in the graph (ρ=9), we observe that models using ROS+ at inference (dotted and yellow lines) improve over the Standard (solid black) by +6.7%; in particular, optimization-based methods improve by +12.2%, fine-tune baselines by +7.4%, and metric-based by +2.8%. In the imbalanced task, the least affected model is MatchingNet only dropping -1.9% compared to the balanced task; we provide a list of top-50 performing models in Table 3 (Appendix C.1). Standard (ROS+ at inference) achieves the highest average performance gains (+8.5%); tieing for second best is Random-Shot with ROS (ROS+ at inference) with +6.9% and Random-Shot with ROS+ (+6.4%). We breakdown the results by type in Appendix C.2 (Figure 9 ). Imbalance with More Shots. We explored additional settings with a higher number of shots, see Figure 4 . Specifically, we train models using Random-Shot meta-training with 1-29 shot and 1-49 random episodes. We then evaluate those models on imbalanced tasks with an average number of 15 shots and 25 shots, respectively. The bottom row of Figure 4 shows the difference in performance between the imbalance and balanced tasks. We observe that for the high-shot condition (right column), the general model performance increases while the models are less affected by imbalance; however, the gap with respect to the balanced condition remains significant. Models achieve 55-60% of their performance on the balanced task within first 5 avr. shots; increasing the number of shots to 15 only boosts their performance by +7%. This may explain why imbalance will have an inevitable impact on small classification tasks: better performance achieved via a higher numbers of support samples in the majority classes, does not offset the performance lost due to lack of samples in the minority classes. In Appendix C.2, we breakdown the results for each model type. Backbones. In Figure 5 , we report the combined average accuracy of all models and imbalance strategies against different backbones (Conv4, Conv6, ResNet10, ResNet34). Overall, deeper backbones seem to perform slightly better on the imbalanced tasks, suggesting a higher tolerance for imbalance. For instance, using Conv4 gave -8.6% difference between the balanced and the 1-9shot random task, while using ResNet10 the gap is smaller (-6.8%). The performance degradation observed with ResNet34 is similar to that reported by Chen et al. (2019) , and is most likely caused by the intrinsic instability of meta-training routines on larger backbones. In Appendix C.3, we breakdown the results across different models and training strategies. Precision and Recall. Looking at the precision and recall tables in Appendix C.4, provides additional insights about each algorithm. For instance, DKT (Patacchiola et al., 2020) shows very strong performance in classes with a small number of shots and well-balanced performances for higher shots. This may be due to the partitioned Bayesian one-vs-rest scheme used for classification by DKT, with a separate Gaussian Process for each class; this could be more robust to imbalance. BMAML, on the other hand, fails to correctly classify samples with K = 1 and K = 3 samples, showing that the method has a strong bias towards the majority classes. (Vinyals et al., 2017) with (balanced) 5-shot 5-way tasks. In this particular scenario, we use significantly higher imbalance levels (ρ = 19) compared to those in the previous section; despite this, we observe small, insignificant performance differences between balanced and imbalanced conditions. In additional experiments, we further reduced the dataset size to contain a total of 4800 images and 32 randomly selected classes. In Figure 10 (Appendix C .5), we observe a more significant performance drop as we increase the number of minority classes. Meta-evaluating on CUB showed a similar trend, with an average drop of -1.6% on the most extreme imbalance setting: 30-510 step with 24 minority classes and ρ = 17.0 (Appendix C.5). When we breakdown the results by model in Appendix C.5, we observe that optimization-based approaches and fine-tune baselines have a slight advantage over the metric-based, most likely due to the ability to adapt during inference. Interestingly, in this setting RelationNet performs the worse with a drop of -4.3% w. r. t. the balanced task in the most extreme setting (24 minority, Mini-ImageNet). Additional results. To evaluate the performance under a strong dataset shift, we evaluated Mini-ImageNet trained models on tasks sampled from CUB-200-2011. In Table 1 (right), we observe that models are not affected at all by the imbalanced setting despite the harder scenario. In Appendix D, we provide additional experiments with BTAML (Lee et al., 2019) , and an analysis of the correlation between meta-dataset size and performance

5. DISCUSSION

FSL robustness to class imbalance. All examined FSL methods are susceptible to class imbalance, although some show more robustness (e.g., Matching Net, Relation Net, and DKT). Optimization-based methods and fine-tune baselines suffer more as they use conventional supervised loss functions in the inner-loop which are known to be particularly susceptible to imbalance (Buda et al., 2018; Johnson & Khoshgoftaar, 2019; Japkowicz & Stephen, 2002) . Moreover, the problem of class imbalance persists as the backbone complexity and support set size increase. Those results suggest that current solutions will offer sub-optimal performance in real-world few-shot problems. Effectiveness of Random-Shot meta-training. Our experiments test a simple solution to class imbalance that has been popular in the meta-learning community -Random-Shot meta-training (Triantafillou et al., 2017; Lee et al., 2019; Chen et al., 2020) . Contrarily to popular belief, our findings reveal that this method is scarcely effective when applied by itself. Extensive analysis and validation performance through epochs (see Appendix B) suggest that these results are genuine and unlikely to be the result of inappropriate parameter tuning. This finding has an important consequence, suggesting that robustness to imbalance cannot be obtained by the simple exposure to imbalanced tasks. Effectiveness of re-balancing procedures. The results suggest that a simple procedure, Random Over-Sampling (ROS), is quite effective in tackling class imbalance issues. Therefore, we encourage the community to include it in their evaluation as ROS is simple to implement, and it can be applied to almost any algorithm. However, ROS does not provide any particular advantage to methods in the highest performance ranking levels, like MatchingNet and DKL. This could be due to diminishing returns and should be investigated on a case-by-case basis. Effect of imbalance at the meta-dataset level. Our results suggest that imbalance in the metadataset has minimal effect on the meta-learning procedure. This could result from standard episodic (meta-)training that samples classes with equal probability and causes natural re-sampling. Likely, datasets with lower intra-class variation and larger imbalance (Liu et al., 2019; Wang et al., 2017; Salakhutdinov et al., 2011) could produce more dramatic performance changes. A IMPLEMENTATION DETAILS ). Thus, the baseline were trained on the same number of training samples as the meta-learning methods, albeit with more classes and less samples per class. All models were evaluated on FSL tasks sampled from D test . For the imbalanced meta-dataset experiments, we used two variants of MiniImageNet. In the first, referring to Table 13 , we halved the average number of samples per class in the meta-training dataset D train to allow us to introduce imbalance artificially into the dataset. In the second scenario, referring to Figure 10 , we reduced the size of the meta-training by a more considerable degree. Specifically, the size of the meta-training dataset was controlled to contain a total of 4800 images distributed among 32 randomly selected classes from the meta-training dataset of Mini-ImageNet. For meta-learning methods, we kept the original 16 and 20 classes for meta-validation and meta-testing with 600 samples each. To allow as fair comparison as possible, we used the same meta-training datasets for the baselines and meta-learning models. However, the baselines used a balanced validation set created from the leftover samples from the original meta-training dataset. For the imbalanced meta-dataset experiments, we also evaluated on tasks sampled from 50 randomly selected classes from CUB-200-2011 (Wah et al., 2011) , following the same line of work as Chen et al. (2019) .

A.2 TRAINING PROCEDURE.

All methods follow a similar three-phase learning procedure: meta-training, meta-validation, and meta-testing. During meta-training, an FSL model was exposed to 100k tasks sampled from D train . After every 500 tasks, the model was validated on tasks from D val and the best performing model was updated. At the end of the meta-training phase, the best model was evaluated on tasks sampled from D test . The baselines (i.e., fine-tune, 1-NN, Baseline++) follow a similar three-phase procedure but with the meta-training / meta-validation phases exchanged for conventional pre-training / validation on mini-batches (of size 128) sampled from D train and D val as outlined above. In all three meta-phase's tasks, we used 16 query samples per class, except for the 20-way Prototypical Network, where we used 5 query samples per class during meta-training to allow for a higher number of samples in the support set. Meta-/Pre-Training Details. In the imbalanced support set setting, we meta-train FSL methods using standard episodic meta-training (Vinyals et al., 2017) using 5-shot 5-way tasks. We also explore random-shot episodic training (Lee et al., 2019) using 1-9shot 5-way random-distribution tasks (as described in section 3). We meta-/pre-trained on 100k tasks/mini-batches, using a learning rate of 10 -3 for the first 50k episodes/mini-batches, and 10 -4 for the second half. The baselines and SimpleShot are trained using 100k balanced mini-batches with a batch size of 128. All methods were meta-validated on 200 tasks/mini-batches every 500 meta-training tasks/mini-batches to select the best performing model. Meta-Testing. The final test performances were measured on a random sample of 600 tasks. We report the average 95% mean confidence interval in brackets/errorbars. In the imbalanced support set experiments, we evaluate tasks with various imbalance levels and distributions, as specified in figures and tables. In the imbalanced meta-dataset experiments, we evaluate using regular, balanced 5-shot 5-way tasks. Data Augmentation. During the meta-/pre-training phases, we apply standard data augmentation techniques, following a similar setup to Chen et al. (2019) , with a random rotation of 10 degrees, scaling, random color/contrast/brightness jitter. Meta-validation and meta-testing had no augmentation apart from in the Random-Shot (ROS+) setting where the same augmentations were applied on the oversampled support images. All images are resized to 84 by 84 pixels. A.3 BACKBONE ARCHITECTURES All methods shared the same backbone architecture. For the core contribution of our work, we used Conv4 architecture consisting of 4 convolutional layers with 64 channels (padding 1), interleaved by batch normalization (Ioffe & Szegedy, 2015) , ReLU activation function, and max-pooling (kernel size 2, and stride 2) (Chen et al., 2019) . Relation Network used max-pooling only for the last 2 layers of the backbone to account for the Relation Module. The Relation Module consisted of two additional convolutional layers, each followed by batch norm, ReLU, max-pooling). For experiments with different backbones: Conv6, ResNet10, and ResNet34 (Chen et al., 2019) . Conv6 extended the Conv4 backbone to 6 convolutional layers, and max-pooling applied after each first 4 layers. ResNet models (He et al., 2016) followed the same setup as Chen et al. (2019) . For imbalanced meta-dataset and imbalanced reduced meta-dataset, we used the Conv4 model, with 32 channels instead of 64, due to less training data.

A.4 FSL METHODS AND BASELINES

In our experiments, we used a wide range of FSL methods (full details can be found in our source code): 1. Baseline (fine-tune) (Pan & Yang, 2010) represents a classical way of applying transfer learning, where a neural network is pre-trained on a large dataset, then fine-tuned on a smaller domain-specific dataset. The baseline's backbone followed a single linear classification layer with a single output for each meta-training dataset class. The whole network was trained during pre-training. During meta-testing, the baseline's pre-trained linear layer was exchanged for another randomly initialized classification layer with outputs matching the task's number of classes (N -way). Fine-tuning was performed on the new randomly initialized classification layer using the support set S.

2.

Baseline (1-NN) is another classical method of applying transfer learning but using a knearest neighbor classifier instead of the classification layer during meta-validation. Pretraining was performed in the same was the fine-tune baselines. 2020) is a probabilistic approach that utilizes the Gaussian Processes with a deep neural network as a kernel function. We used Batch Norm Cosine distance for the kernel type. 8. SimpleShot (Wang et al., 2019a) augments the 1-NN baseline model by normalizing and centering the feature vector using the dataset's mean feature vector. The query samples are assigned to the nearest prototype's class according to Euclidian distance. In contrast to the baseline models, pre-training is performed on the meta-training dataset like other meta-learning algorithms, and meta-validation is used to select the best model based on performance on tasks sampled from D val . 9. MAML (Finn et al., 2017 ) is a meta-learning technique that learns a common initialization of weights that can be quickly adapted for task using fine-tuning on the support set. The task adaptation process uses a standard gradient descent algorithm minimizing Cross-Entropy loss on the support set. The original method uses second-order derivates; however, due to more efficient calculation, we use the first-order MAML, which has been shown to work just as well. We set the inner-learning rate to 0.1 with 10 iteration steps. We optimize the meta-learner model on batches of 4 meta-training tasks. These hyperparameters were selected based on our hyperparameter fine-tuning. 10. ProtoMAML (Triantafillou et al., 2020) augments traditional first-order MAML by reinitializing the last classification layer between tasks. Specifically, the weights of the layer are assigned to the prototype for each class's corresponding output. This extra step combines the fine-tuning ability of MAML and the class regularisation ability of Prototypical Networks. We set the inner-loop learning rate to 0.1 with 10 iterations. Unlike for MAML, we found that updating the meta-learner after a single meta-training task gave the best performance. 11. Bayesian-MAML (BMAML) (Yoon et al., 2018) augments the MAML method by replacing the inner loop's standard stochastic gradient descent with Bayesian gradient-based updates. BMAML uses Stein Variational Gradient Descent that is a non-parametric variational inference method combining strengths of Monte Carlo approximation and variational inference. The algorithm learns to approximate a posterior over the initialization parameters conditioned on the task support set. Yoon et al. (2018) also adds a chaser loss, which utilizes the samples in the query set during meta-training to approximate the true task posterior. Minimization of the KL divergence between the true task posterior and the estimated task parameter posterior can be used to drive the meta-training process. We set the innerloop learning rate to 0.1 with 1 inner-loop step. We found that using a higher inner-loop step number destabilized performance. We used 20 particles. The chaser loss variation used a learning rate of 0.5. Again, we found these combinations of hyperparameters to give the best results. 12. Bayesian-TAML (BTAML) (Lee et al., 2019) [Left out of the main paper body.] This method augments the MAML method with four main changes: 1) task-dependent parameter initialization z, 2) task-dependent per-layer learning rate multipliers γ, 3) taskdependent per-class learning rate multiplier ω, 4) meta-learned per-parameter learning rate α (similar to Meta-SGD, Li et al. (2017) ). However, our results for BTAML were unstable, suggesting a fault in our implementation. Unfortunately, we did not identify it in time for the submission, and we decided to move the method's results to the appendix. In our experiments, we explored several variations to this method with various parameters (z, ω, γ, α) turned on and off, as well as different hyperparameters. We found that using a meta-learning rate of 10 -4 performed better than 10 -3 in contrast to the other models. We set the inner-learning rate to 0.1 with 10 iteration steps, and optimize the meta-learner model on batches of 4 meta-training tasks. Again, we found this set up worked best in our experiments.

A.5 CLASS IMBALANCE TECHNIQUES AND STRATEGIES

We pair FSL methods with popular data-level class-imbalance strategies: 1. Random Over-Sampling (ROS) (Japkowicz & Stephen, 2002) without and with data augmentation (ROS and ROS+). For the augmentations we used: random sized cropping between 0.15 and 1.1 scale of the image, random horizontal flip, and random color/brightness/contrast jitter. A visualization of ROS and ROS+ is presented in Figure 7 . During meta-training, ROS+ augmentations were applied twice: once when sampling from the meta-training dataset, and the second time during the support set resampling. This may have slightly destabilized meta-training, which would explain why sometimes Random-Shot with ROS (ROS+ at inference only) achieved better performance than Random-Shot with ROS+ in Figures 3 and 4 . 2. Random-Shot Meta-Training (Triantafillou et al., 2020; Lee et al., 2019; Chen et al., 2020) was applied as specified in the main body of the paper (Section 3.4). 3. Focal Loss (Lin et al., 2017) . Focal Loss has been found to be very effective in combating the class-imbalance problem on the one-stage object detectors. We exchanged the innerloop cross-entropy loss of optimization-based algorithms and fine-tune baselines with the focal loss with γ = 2 and α = 1. Results are presented in Figure 6 in the main paper body and in Appendix D.1. 4. Weighted Loss. Weighted loss is also commonly used to rebalance the effects of classimbalance (Buda et al., 2018; Leevy et al., 2018) . We weight the inner-loop cross-entropy loss of optimization-based algorithms and fine-tune baselines by inverse class frequency of support set samples. Results are presented in Figure 6 in the main paper body and in Appendix D.1. 

B VERIFICATION OF IMPLEMENTATION

We implement the FSL methods in PyTorch, adapting the implementation of (Chen et al., 2019) but also borrowing from other implementations online (see individual method files in the source code for individual attribution). However, we have heavily modified these implementations to fit our imbalanced FSL framework, which also offers standard and continual FSL compatibility (Antoniou et al., 2020) . We provide our implementations for ProtoMAML and BTAML for which no opensource implementation in PyTorch existed as of writing. To verify our implementations, we compare methods on the standard balanced 5-shot 5-way task with reported accuracy. Results are presented in Table 2 . We see that algorithms achieve very similar performance with no less than 3% accuracy points compared to the reported performance. The discrepancies can be accounted for due to smaller training batch for SimpleShot, different augmentation strategies for the other methods, and natural variance stemming from random initialization. We show the validation performance over epochs for each method in Figure 8 on the next page. 

C BREAKDOWN OF RESULTS

In this section, we breakdown the results from the main body of the paper. Specifically, we provide the top-50 performing models on the imbalaned 1-9shot linear task in subsection C.1. The breakdown of higher shots experiment (Figure 4 ) is provided in subsection C.2. In subsection C.3, we include the breakdown of the backbone experiment (Figure 5 ) showing the performance by algorithm type and training procedure. We provide precision and recall tables of linear 1-9shot 5-way tasks for each meta-training procedure in subsection C.4. In section C.5, we provide the results for the imbalanced reduced meta-dataset (Figure 10 ). C.1 TOP-50 PERFORMING MODELS ON 1-9SHOT LINEAR  K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 - Baseline (1-NN) 0.32±0.32 0.46±0.25 0.50±0.19 0.48±0.15 0.45±0.13 0.13±0.10 0.29±0.19 0.41±0.20 0.50±0.21 0.57±0.20 0.32±0.02 Baseline (fine-tune) 0.05±0.10 0.63±0.34 0.63±0.08 0.50±0.04 0.41±0.03 0.00±0.00 0.19±0.08 0.53±0.12 0.75±0.07 0.90±0.02 0.39±0.01 Baseline++ 0.02±0.04 0.45±0.44 0.66±0.13 0.52±0.05 0.39±0.02 0.00±0.00 0.13±0.08 0.48±0.17 0.76±0.09 0.94±0.01 0.37±0.01 Matching Net 0.54±0.39 0.67±0.13 0.62±0.06 0.55±0.04 0.48±0.03 0.15±0.08 0.41±0.14 0.61±0.12 0.73±0.08 0.84±0.04 0.50±0.02 ProtoNet 0.18±0.28 0.70±0.15 0.62±0.07 0.54±0.06 0.51±0.05 0.03±0.01 0.37±0.12 0.65±0.09 0.78±0.06 0.84±0.03 0.47±0.01 ProtoNet (20-way) 0.10±0.18 0.71±0.21 0.64±0.08 0.55±0.06 0.49±0.05 0.01±0.00 0.31±0.12 0.63±0.10 0. 0.00±0.00 0.00±0.00 0.15±0.22 0.46±0.20 0.24±0.00 0.00±0.00 0.00±0.00 0.04±0.03 0.37±0.20 0.99±0.00 0.16±0.01 Table 7 : Precision and recall for linear 1-9shot 5-way tasks after Random Shot meta-training. Precision(95%CI) Recall(95%CI) Avr. F1(95%CI) 35±0.39 0.58±0.16 0.53±0.12 0.46±0.12 0.44±0.10 0.07±0.03 0.37±0.12 0.54±0.15 0.63±0.20 0.72±0.17 0.40±0.02 BMAML 0.49±0.09 0.57±0.10 0.59±0.09 0.61±0.10 0.63±0.10 0.54±0.17 0.55±0.17 0.58±0.17 0.55±0.18 0.54±0.18 0.52±0.02 BMAML (chaser) 0.45±0.08 0.52±0.09 0.54±0.09 0.55±0.10 0.57±0.12 0.51±0.17 0.51±0.17 0.54±0.17 0.50±0.18 0.47±0.18 0.47±0.02 K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 - Baseline (1-NN) 0. K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 - Baseline ( Table 9 : Precision and recall for linear 1-9shot 5-way after Random Shot (ROS+) meta-training. Precision(95%CI) Recall(95%CI) Avr. F1(95%CI) (Vinyals et al., 2017) and random-shot episodic metatraining (Triantafillou et al., 2020) with Weighted Loss (Buda et al., 2018) and Focal Loss (Lin et al., 2017) , applied to the inner-loop of optimization-based functions and fine-tune baselines. K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 K1 = 1 K2 = 3 K3 = 5 K4 = 7 K5 = 9 - Baseline (1-

D.2 BTAML

We implemented and trained Bayesian TAML (Lee et al., 2019); however, the training performance graphs suggest a mistake in our implementation, which we did not manage to identify in time for the submission. For this reason, we have left out these results from the main paper. The BTAML (α, ω, γ, z) corresponds to full BTAML, while others indicate variants with the corresponding components turn off. We provide models' performances with the full performance in Appendix E. 



This is a standard procedure used in the class-imbalance literature(Buda et al., 2018), which reduces the number of variables and allows isolating the effect of imbalance. Note that an imbalanced query set would influence methods such as SCA(Antoniou & Storkey, 2019), which use the query set as an additional unlabeled set during the inner-loop. We do not consider such methods in our experiments since they assume immediate access to the query set, which limits their practical application. Non-overlapping 95% confidence intervals indicate 'significant' performance difference. CONCLUSIONIn this work, we have provided a detailed analysis of class-imbalance in FSL, showing that classimbalance at the support set level is problematic for many methods. We found that most metric-based models present a built-in robustness to support-set imbalance, while in optimization-based models imbalance issues can be alleviated using oversampling. In our experiments, Random-Shot metatraining provided minimal benefits suggesting that meta-learning methods do not learn to balance from random-shot episodes alone. Results on meta-dataset imbalance showed just a small negative effect, but this effect is not as dramatic as with the task-level imbalance. In future work, the insights gained with our investigation could be used to design novel few-shot methods that can guarantee a stable performance under the imbalance condition. Accuracy, % Accuracy, % Accuracy, % Accuracy, % Accuracy, % Accuracy, % Accuracy, % Accuracy, % Accuracy, % Accuracy, % Figure 13: Accuracy



Figure 1: Accuracy (mean percentage on 3 runs) and 95% confidence intervals on FSL methods with balanced tasks (red bars) vs 3 imbalanced task (blue bars). Most methods perform significantly worse on the imbalanced tasks, as showed by the lower accuracy of the blue bars.

Figure 2: The two types of imbalance settings investigated in this work. Left: Imbalanced support set. Classes are balanced at the dataset level, but tasks are imbalanced by one of I-distributions: linear (task 1), step (task 2), and random (task 3) Right: Imbalanced meta-training dataset.Classes are imbalanced at the dataset level, but all the support sets are balanced at the task level. Following standard practice in the literature, query sets are kept balanced in both settings.

Figure 3: Standard episodic training(Vinyals et al., 2017) vs. random-shot episodic training(Triantafillou et al., 2020). We explore pairing methods with Random Over-Sampling (ROS) without and with augmentation (ROS+).

Figure 4: Comparing imbalance levels via support sets of different size. Each line represents the average across all models in each training and imbalance setting.

Figure 5: Combined average model performance against different backbones and imbalanced tasks. Left: combined performance of all models and training scenarios. Right: relative performance w.r.t. the balanced task.

Figure 7: Visualisation of linear 1-5shot support sets. Left: no ROS. Middle: ROS. Right: ROS+.

Figure 8: Validation performance through epochs on Standard 5-shot 5-way meta-training, and Random Shot (1-9shot random). The shaded areas show ± 1 standard deviation over three repeats on different seeds.

Figure 10: Combined model average performance with increasing minority classes. Left: Combined accuracy of all models and training scenarios. Right: Relative performance difference to the balanced dataset.

Figure 12: Validation performance of BTAML through epochs using Standard 5-shot 5-way metatraining, and Random Shot (1-9shot random). The shaded areas show ± 1 standard deviation over three repeats on different seeds.

Figure 14: F1 Scores.

Training on imbalanced meta-training dataset. Imbalanced distributions represent ρ = 19 (D Kmin = 30, D Kmax = 570) with step imbalance containing D Nmin = 32 minority classes (out of 64 available in the dataset). Small differences in accuracy between balanced and I-distributions, suggest insignificant effect of imbalance at dataset level. Left: Evaluation on the meta-testing dataset of Mini-ImageNet. Right: Evaluation on the meta-testing dataset of CUB.

A.1 DATASETS For the imbalanced support set experiments, we used MiniImageNet(Vinyals et al., 2017;Ravi & Larochelle, 2016) following the same version popular version as(Ravi & Larochelle, 2016;Cheng et al., 2018). All meta-learning models used 64, 16, 20 classes for the meta-training D train , meta-validation D val , and meta-testing D test datasets, respectively, with each class containing 600 samples. All images are resized to 84 by 84px. For the feature-transfer baselines (1-NN, fine-tune, and Baseline++), we used a conventionally partitioned training and validation datasets. Specifically, we combined D train and D val classes (i.e., 64 + 16 = 80 classes), then partitioned the samples of each class into 80% -20% split for pre-training -validation, forming D train and D val (where D

Results of standard 5-shot 5-way experiments on Mini-ImageNet as achieved with our implementation compared to the original (reported) accuracy and other work. Other Sources's Accuracies were taken from: * (Chen et al., 2019), † (Snell & Zemel, 2020), ‡(Vogelbaum et al., 2020)

Top-50 models using different meta-training strategies, showing absolute and relative difference between the balanced and the imbalanced task. Results sorted by relative difference.

Precision and recall for linear 1-9shot 5-way tasks after Standard meta-training.

Precision and recall for linear 1-9shot 5-way after Random Shot (ROS) meta-training.

Table showing full results for the reduced meta-training MiniImageNet dataset (with 32 classes), evaluated on (meta-test split of) MiniImageNet. The last two columns show the average and maximum model's (absolute) performance difference to the balanced task. RelationNet suffers the most from the imbalanced meta-training.

Table showing full results for the reduced meta-training MiniImageNet dataset (with 32 classes), evaluated on (meta-test split of) CUB. The last two columns show the average and maximum model's (absolute) performance difference to the balanced task. RelationNet suffers the most from the imbalanced meta-training dataset. .1 ADDITIONAL IMBALANCE STRATEGIES In Figure 11, we compare Standard meta-training and Random-Shot meta-training with focal loss. We observe no significant advantage of using Focal Loss Random-Shot meta-training over the Standard meta-training experiments.

Performance of our implementation of BTAML on 5shot 5-way task. The BTAML (α, ω, γ, z) indicates the full version of the proposed model.

Meta-/Pre-Training with reduced number of samples in the meta-training dataset of Mini-ImageNet (all 64 classes are balanced). Setting with 600 * samples uses 64 channels for each 4 convolutional layers instead of 32 channels as was done for the other '#Class Samples' settings. In addition, all three Baselines were trained using conventional split on D train instead of D train . The table suggests that the number of samples per class in the meta-training dataset is not very significant on the performance of FSL algorithms beyond a certain point. All settings were trained on 50k tasks, apart from 600 * trained on 100k tasks.Figures14 and 13show accuracy and F1 scores, respectively, on imbalanced tasks of Standard (5shot) and Random Shot (1-9shot) meta-training. Their corresponding result tables are in Tables14 to 21. Table22provides the full results for the main imbalanced meta-dataset experiments.

Standard (accuracy)

Standard (F1)

Full results for Table1.

C.3 BACKBONE EXPERIMENTS

We run additional experiments with different backbone models. In Tables 4 and 5 , we show the random 1-9 shot 5-way performance on Conv6, ResNet10, and ResNet34. We can observe that for many of the methods, ROS still benefits the models. However, the reader should exercise caution since we used the same hyper-parameterization as for the four-layered CNN. We did not perform any hyperparameter fine-tuning on these backbones; the results would likely be higher if we allowed for longer meta-training. Some results are missing due to destabilization in meta-training caused by deeper backbones. 

