HARD-META-DATASET++: TOWARDS UNDERSTAND-ING FEW-SHOT PERFORMANCE ON DIFFICULT TASKS

Abstract

Few-shot classification is the ability to adapt to any new classification task from only a few training examples. The performance of current top-performing fewshot classifiers varies widely across different tasks where they often fail on a subset of 'difficult' tasks. This phenomenon has real-world consequences for deployed few-shot systems where safety and reliability are paramount, yet little has been done to understand these failure cases. In this paper, we study these difficult tasks to gain a more nuanced understanding of the limitations of current methods. To this end, we develop a general and computationally efficient algorithm called FASTDIFFSEL to extract difficult tasks from any large-scale vision dataset. Notably, our algorithm can extract tasks at least 20x faster than existing methods enabling its use on large-scale datasets. We use FASTDIFFSEL to extract difficult tasks from META-DATASET, a widely-used few-shot classification benchmark, and other challenging large-scale vision datasets including ORBIT, CURE-OR and OB-JECTNET. These tasks are curated into HARD-META-DATASET++, a new fewshot testing benchmark to promote the development of methods that are robust to even the most difficult tasks. We use HARD-META-DATASET++ to stress-test an extensive suite of few-shot classification methods and show that state-of-the-art approaches fail catastrophically on difficult tasks. We believe that our extraction algorithm FASTDIFFSEL and HARD-META-DATASET++ will aid researchers in further understanding failure modes of few-shot classification models.

1. INTRODUCTION

Few-shot classification is the ability to distinguish between a set of novel classes when given only a few labelled training examples of each class (Lake et al., 2011; Fei-Fei et al., 2006) . This holds potential across many real-world applications -from robots that can identify new objects (Ren et al., 2020) , to drug discovery pipelines that can predict the properties of new molecules (Stanley et al., 2021) . A few-shot image classifier is given a few labelled training images of the new object classes, called the support set. Once the classifier has adapted to this support set, it is then evaluated on novel test images of those classes, called the query set. Together, the support and query set is called a task. Recent years have seen rapid progress in few-shot image classification (Snell et al., 2017; Finn et al., 2017b; Ye et al., 2020; Li et al., 2021; Radford et al., 2021; Kolesnikov et al., 2019) , however, current top-performing methods display a wide range in performance over different tasks at test time (Fu et al., 2022; Agarwal et al., 2021) . On META-DATASET (MD), a widely-used few-shot classification benchmark (Triantafillou et al., 2019) , state-of-the-art classifiers obtain accuracies as low as 22% on some individual tasks though their average task accuracy is >55% (see Fig 1 ). Few works have undertaken a detailed examination of these 'difficult' tasks, yet they remain critical to interrogate for both future algorithmic development and the safety and reliability of deployed systems. This paper aims to gain a more nuanced understanding of these 'difficult' tasks and the limitations of current methods. We define a difficult task as one on which a few-shot classifier performs poorly on the task's query set, after being adapted to its support set. Current methods for finding supports sets that lead to poor query performance rely on greedy search-based algorithms (Agarwal et al., 2021) . These approaches, however, incur a high computational cost when sampling for a large numbers of * Work done partly during a research internship at Microsoft Research, Cambridge (UK) 1

