HARD-META-DATASET++: TOWARDS UNDERSTAND-ING FEW-SHOT PERFORMANCE ON DIFFICULT TASKS

Abstract

Few-shot classification is the ability to adapt to any new classification task from only a few training examples. The performance of current top-performing fewshot classifiers varies widely across different tasks where they often fail on a subset of 'difficult' tasks. This phenomenon has real-world consequences for deployed few-shot systems where safety and reliability are paramount, yet little has been done to understand these failure cases. In this paper, we study these difficult tasks to gain a more nuanced understanding of the limitations of current methods. To this end, we develop a general and computationally efficient algorithm called FASTDIFFSEL to extract difficult tasks from any large-scale vision dataset. Notably, our algorithm can extract tasks at least 20x faster than existing methods enabling its use on large-scale datasets. We use FASTDIFFSEL to extract difficult tasks from META-DATASET, a widely-used few-shot classification benchmark, and other challenging large-scale vision datasets including ORBIT, CURE-OR and OB-JECTNET. These tasks are curated into HARD-META-DATASET++, a new fewshot testing benchmark to promote the development of methods that are robust to even the most difficult tasks. We use HARD-META-DATASET++ to stress-test an extensive suite of few-shot classification methods and show that state-of-the-art approaches fail catastrophically on difficult tasks. We believe that our extraction algorithm FASTDIFFSEL and HARD-META-DATASET++ will aid researchers in further understanding failure modes of few-shot classification models.

1. INTRODUCTION

Few-shot classification is the ability to distinguish between a set of novel classes when given only a few labelled training examples of each class (Lake et al., 2011; Fei-Fei et al., 2006) . This holds potential across many real-world applications -from robots that can identify new objects (Ren et al., 2020) , to drug discovery pipelines that can predict the properties of new molecules (Stanley et al., 2021) . A few-shot image classifier is given a few labelled training images of the new object classes, called the support set. Once the classifier has adapted to this support set, it is then evaluated on novel test images of those classes, called the query set. Together, the support and query set is called a task. Recent years have seen rapid progress in few-shot image classification (Snell et al., 2017; Finn et al., 2017b; Ye et al., 2020; Li et al., 2021; Radford et al., 2021; Kolesnikov et al., 2019) , however, current top-performing methods display a wide range in performance over different tasks at test time (Fu et al., 2022; Agarwal et al., 2021) . On META-DATASET (MD), a widely-used few-shot classification benchmark (Triantafillou et al., 2019) , state-of-the-art classifiers obtain accuracies as low as 22% on some individual tasks though their average task accuracy is >55% (see Fig 1 ). Few works have undertaken a detailed examination of these 'difficult' tasks, yet they remain critical to interrogate for both future algorithmic development and the safety and reliability of deployed systems. This paper aims to gain a more nuanced understanding of these 'difficult' tasks and the limitations of current methods. We define a difficult task as one on which a few-shot classifier performs poorly on the task's query set, after being adapted to its support set. Current methods for finding supports sets that lead to poor query performance rely on greedy search-based algorithms (Agarwal et al., 2021) . These approaches, however, incur a high computational cost when sampling for a large numbers of tasks, and for tasks with large support sets, as are common (and best) practices in few-shot evaluation protocols. As a result, the study of difficult tasks has been limited to small-scale datasets which lacks the challenging examples and the setup of large benchmarks such as META-DATASET. To address this, we develop a general and computationally efficient algorithm called FASTDIFFSEL to extract difficult tasks from any large-scale dataset. Given a (meta-)trained few-shot classifier, a query set, and a search pool of support images, we formulate a constrained combinatorial optimization problem which learns a selection weight for each image in the pool such that the loss on the query set is maximised. The top-k (i.e. most difficult) images per class are then extracted into a support set and paired with the query set to form a difficult task. This optimization can be repeated to obtain any number of difficult tasks. In practice, we find that FASTDIFFSEL is at least 20x faster than existing greedy search-based algorithms (Agarwal et al., 2021) , with greater gains as the support pools and support set sizes increase. We leverage the scalability of FASTDIFFSEL to extract difficult tasks from a wide range of largescale vision datasets including META-DATASET (Triantafillou et al., 2019) , OBJECTNET (Barbu et al., 2019) , CURE-OR (Temel et al., 2018) and ORBIT (Massiceti et al., 2021) and collect these tasks into a new testing set called HARD-META-DATASET++ (HARD-MD++). The addition of datasets beyond META-DATASET is motivated by their real-world nature and that they provide image annotations of quality variations (e.g. object occluded, blurred, poorly framed) to enable future research into why a task is difficult. We provide early insights into this question in our analyses. We stress test an extensive suite of state-of-the-art few-shot classification methods on HARD-MD++, cross-validating the difficulty of our extracted tasks across these top-performing methods. In Fig 1,  we show one such method (Hu et al., 2022) performing consistently worse on the META-DATASET test split in HARD-MD++ than on the original MD test split across all 10 sub-datasets. In Section 5, we find that this trend holds true across a wide-range of few-shot classification methods. We release HARD-MD++ along with a broad set of baselines to drive future research in methods that are robust to even the most difficult tasks. In summary, our contributions are the following: 1. FASTDIFFSEL, an efficient algorithm to extract difficult tasks from any large-scale vision dataset. 2. HARD-META-DATASET++, a new test-only few-shot classification benchmark composed of difficult tasks extracted from the widely-used few-shot classification benchmark META-DATASET and other large-scale real-world datasets: OBJECTNET, CURE-OR and ORBIT. 3. Extensive stress testing and novel empirical insights for a wide range of few-shot classification methods, including transfer-and meta-learning based approaches on HARD-MD++.

2. FEW-SHOT CLASSIFICATION: PRELIMINARIES AND NOTATIONS

A few-shot classification task is typically composed of (i) a support set S which contains a few labelled examples from a set of N classes (e.g., k j examples for each class index j ∈ [1, N ])



Figure 1: A state-of-the-art method (Hu et al., 2022) performs consistently worse on difficult tasks in the MD split (HARD-MD) of HARD-META-DATASET++ compared to tasks in META-DATASET (MD) across all 10 MD sub-datasets. The method uses ViT-S initialized with self-supervised DINO weights and is further meta-trained with ProtoNets on MD's ilsvrc 2012 split.

funding

* Work done partly during a research internship at Microsoft Research, Cambridge (UK)

