WORST-CASE FEW-SHOT EVALUATION: ARE NEURAL NETWORKS ROBUST FEW-SHOT LEARNERS?

Abstract

Neural networks have achieved remarkable performance on various few-shot tasks. However, recent studies reveal that existing few-shot models often exploit the spurious correlations between training and test sets, achieving a high performance that is hard to generalize. Motivated by a fact that a robust few-shot learner should accurately classify data given any valid training set, we consider a worst-case fewshot evaluation that computes worst-case generalization errors by constructing a challenging few-shot set. Specifically, we search for the label-balanced subset of a full-size training set that results in the largest expected risks. Since the search space is enormous, we propose an efficient method NMMD-attack to optimize the target by maximizing NMMD distance (maximum mean discrepancy based on neural tangent kernel). Experiments show that NMMD-attack can successfully attack various architectures. The large gap between average performance and worst-case performance shows that neural networks still suffer from poor robustness. We appeal to more worst-case benchmarks for better robust few-shot evaluation.

1. INTRODUCTION

Given a limited number of supervised samples, few-shot learning aims to achieve high generalization performance on unseen test data (Wang et al., 2020; Yue et al., 2020) . Recent years have witnessed rapid advancement of few-shot learning, particularly with neural networks pre-trained with selfsupervised data (Brown et al., 2020; Dosovitskiy et al., 2021; Tan & Le, 2021) . Neural networks gradually become the dominant solution to few-shot learning. Some networks (He et al., 2021; Xu et al., 2022) even surpass humans in standard benchmarks (Wang et al., 2019; Mukherjee et al., 2021) . Despite promising results on fixed sets, recent work (Sagawa et al., 2020; Taori et al., 2020a; Koh et al., 2021; Tang et al., 2022) reveals that neural networks as few-shot learners exacerbate spurious correlations and easily fail on distribution shifts. Spurious correlation is a typical representation learning problem in which models learn to classify based on superficial features. For example, when learning the class label pigs from a few pig images, neural network models sometimes learn to guess based on superficial features (e.g., background with farm fences) rather than learn to generalize base on essential features (e.g., the facial characteristics of pigs), as shown in Figure 1 . Over-fitting to spurious attributes brings performance increase hallucination but does not guarantee better robustness, which explains over-optimistic performance on existing benchmarks (Mutton et al., 2007; Vinyals et al., 2016; Oreshkin et al., 2018; Schick & Schütze, 2021; Alayrac et al., 2022a) . The performance of models are assessed according to the averaged test accuracy given a fixed training set (1-fold evaluation) or several random subsets of the training set (k-fold evaluation). In that procedure, it is easy for the superficial features to be carried by the few-shot sets and eventually exploited by the neural networks since training and test data usually come from the same data distribution in the construction of the benchmark (Sagawa et al., 2020) . Motivated by a fact that a robust few-shot learner should accurately classify data given any valid training set, we propose a worst-case evaluation for few-shot learners in this work. Worst-case evaluation targets to evaluate generalization error bounds. Instead of randomly sampling few-shot sets, we search for the worst-case few-shot set from a full-size training set with balanced labels. An illustration of worst-case few-shot evaluation is shown in Figure 1 . Inspired by the notion that spurious correlations often arise from common statistical features (Sagawa et al., 2019; Tang et al., 2022) , the distribution of unbiased samples generally has large divergence with the full-size training Since training and test data usually come from the same distribution, over-fitting to spurious features brings performance increase but does not bring better robustness. For example, models learn to classify pig based on superficial features (e.g., in a fence), rather than the shape. In this work, we consider a worst case evaluation for few-shot robustness evaluation by extracting a challenging and label-balanced subset from a full-size training set with the largest expected risk. set. Therefore, we adopt a distribution divergence maximization approach NMMD-attack to find the most challenging few-shot set. This approach is also theoretically guaranteed as the generalization error of a few-shot set is bounded by the maximum mean discrepancy (MMD) distance (Gretton et al., 2012) between the few-shot distribution and the original training distribution. The goal of maximizing generalization error can then be simplified to maximizing the MMD distance between the few-shot set and the full-size training set. Following the MMD maximization principle, we use the MMD distance in the hypothesis space to define group (or set) distance. Since the MMD distance is still intractable, we borrow neural tangent kernel, an approximation to over-parameterized neural networks, to estimate MMD distance without optimization. Given the searched subset, we train models and report test accuracy. Experiments show that NMMD-attack challenges high few-shot performance in randomly-sampled cases. It can successfully attack various model architectures with large performance drops. For example, the performance of DenseNet-121 drops by 10.02% on the generated few-shot set. Also, our case study on ImageNet-1K and CIFAR-10 demonstrates that the generated few-shot sets show much fewer spurious attributes than randomly-sampled few-shot training sets. This work re-examine the actual ability of representative neural networks on few-shot cases. The large performance drop indicates large improvement space in future work. Actually, the bias to spurious attributes can be a severe system bug and loophole for all few-shot learners. The attacker can manipulate sets with unseen correlations to destroy a model, which is hard to detect. How to avoid learning spurious features is a hopeful direction to improve few-shot robustness. Furthermore, compared with existing few-shot benchmarks, worst-case evaluation can provide a new view demonstrating how worse a model can be such that we can prepare backup plans in case of accidents in real-word applications. In this work, we provide a feasible solution to estimate generalization error bounds for few-shot cases, we appeal to more worst-case benchmarks for better few-shot evaluation in future.

2. RELATED WORK

In this work, we review related topics, including adversarial attack, distribution shift, and distribution robustness optimization. Our work comes as a form of attack inspired by adversarial attack literature (Szegedy et al., 2013; Goodfellow et al., 2014) . Adversarial attacks aim to fool neural networks while keeping innocuous to humans. This form of attack, though effective, alters each sample independently and ignores group correlations. Performing attacks mainly concerns making slight variations, e.g. adding noise to the sample. Since few-shot cases usually have serious spurious correlations between groups, the target of worst-case few-shot evaluation is to evaluate robustness to distribution shifts.



Figure 1: An illustration of worst-case evaluation. Spurious correlations affect few-shot evaluation.Since training and test data usually come from the same distribution, over-fitting to spurious features brings performance increase but does not bring better robustness. For example, models learn to classify pig based on superficial features (e.g., in a fence), rather than the shape. In this work, we consider a worst case evaluation for few-shot robustness evaluation by extracting a challenging and label-balanced subset from a full-size training set with the largest expected risk.

