IMPROVING FEW-SHOT VISUAL CLASSIFICATION WITH UNLABELLED EXAMPLES

Abstract

We propose a transductive meta-learning method that uses unlabelled instances to improve few-shot image classification performance. Our approach combines a regularized Mahalanobis-distance-based soft k-means clustering procedure with a modified state of the art neural adaptive feature extractor to achieve improved test-time classification accuracy using unlabelled data. We evaluate our method on transductive few-shot learning tasks, in which the goal is to jointly predict labels for query (test) examples given a set of support (training) examples. We achieve new state of the art performance on the Meta-Dataset and the mini-ImageNet and tiered-ImageNet benchmarks.

1. INTRODUCTION

Deep learning has revolutionized visual classification, enabled in part by the development of large and diverse sets of curated training data (Szegedy et al., 2014; He et al., 2015; Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; Sornam et al., 2017) . However, in many image classification settings, millions of labelled examples are not available; therefore, techniques that can achieve sufficient classification performance with few labels are required. This has motivated research on few-shot learning (Feyjie et al., 2020; Wang & Yao, 2019; Wang et al., 2019; Bellet et al., 2013) , which seeks to develop methods for developing classifiers with much smaller datasets. Given a few labelled "support" images per class, a few-shot image classifier is expected to produce labels for a given set of unlabelled "query" images. Typical approaches to few-shot learning adapt a base classifier network to a new support set through various means, such as learning new class embeddings (Snell et al., 2017; Vinyals et al., 2016; Sung et al., 2018 ), amortized (Requeima et al., 2019; Oreshkin et al., 2018) or iterative (Yosinski et al., 2014) partial adaptation of the feature extractor, and complete fine-tuning of the entire network end-to-end (Ravi & Larochelle, 2017; Finn et al., 2017) . In addition to the standard fully supervised setting, techniques have been developed to exploit additional unlabeled support data (semi-supervision) (Ren et al., 2018) as well as information present in the query set (transduction) (Liu et al., 2018; Kim et al., 2019) . In our work, we focus on the transductive paradigm, where the entire query set is labeled at the same time. This allows us to exploit the additional unlabeled data, with the hopes of improving classification performance. Existing transductive few-shot classifiers rely on label propagation from labelled to unlabelled examples in the feature space through either k-means clustering with Euclidean distance (Ren et al., 2018) or message passing in graph convolutional networks (Liu et al., 2018; Kim et al., 2019) . Since few-shot learning requires handling a varying number of classes, an important architectural choice is the final feature to class mapping. Previous methods have used the Euclidean distance (Ren et al., 2018) , the absolute difference (Koch et al., 2015) , cosine similarity (Vinyals et al., 2016 ), linear classification (Finn et al., 2017; Requeima et al., 2019) or additional neural network layers (Kim et al., 2019; Sung et al., 2018) . Bateni et al. (2020) improved these results by using a class-adaptive Mahalanobis metric. Their method, Simple CNAPS, uses a conditional neural-adaptive feature extractor, along with a regularized Mahalanobis-distance-based classifier. This modification to CNAPS (Requeima et al., 2019) achieves improved performance on the Meta-Dataset benchmark (Triantafillou et al., 2019) , only recently surpassed by SUR (Dvornik et al., 2020) and URT (Liu et al., 2020) . However, performance suffers in the regime where there are five or fewer support examples available per class.

