IMPROVING FEW-SHOT VISUAL CLASSIFICATION WITH UNLABELLED EXAMPLES

Abstract

We propose a transductive meta-learning method that uses unlabelled instances to improve few-shot image classification performance. Our approach combines a regularized Mahalanobis-distance-based soft k-means clustering procedure with a modified state of the art neural adaptive feature extractor to achieve improved test-time classification accuracy using unlabelled data. We evaluate our method on transductive few-shot learning tasks, in which the goal is to jointly predict labels for query (test) examples given a set of support (training) examples. We achieve new state of the art performance on the Meta-Dataset and the mini-ImageNet and tiered-ImageNet benchmarks.

1. INTRODUCTION

Deep learning has revolutionized visual classification, enabled in part by the development of large and diverse sets of curated training data (Szegedy et al., 2014; He et al., 2015; Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; Sornam et al., 2017) . However, in many image classification settings, millions of labelled examples are not available; therefore, techniques that can achieve sufficient classification performance with few labels are required. This has motivated research on few-shot learning (Feyjie et al., 2020; Wang & Yao, 2019; Wang et al., 2019; Bellet et al., 2013) , which seeks to develop methods for developing classifiers with much smaller datasets. Given a few labelled "support" images per class, a few-shot image classifier is expected to produce labels for a given set of unlabelled "query" images. Typical approaches to few-shot learning adapt a base classifier network to a new support set through various means, such as learning new class embeddings (Snell et al., 2017; Vinyals et al., 2016; Sung et al., 2018 ), amortized (Requeima et al., 2019; Oreshkin et al., 2018) or iterative (Yosinski et al., 2014) partial adaptation of the feature extractor, and complete fine-tuning of the entire network end-to-end (Ravi & Larochelle, 2017; Finn et al., 2017) . In addition to the standard fully supervised setting, techniques have been developed to exploit additional unlabeled support data (semi-supervision) (Ren et al., 2018) as well as information present in the query set (transduction) (Liu et al., 2018; Kim et al., 2019) . In our work, we focus on the transductive paradigm, where the entire query set is labeled at the same time. This allows us to exploit the additional unlabeled data, with the hopes of improving classification performance. Existing transductive few-shot classifiers rely on label propagation from labelled to unlabelled examples in the feature space through either k-means clustering with Euclidean distance (Ren et al., 2018) or message passing in graph convolutional networks (Liu et al., 2018; Kim et al., 2019) . Since few-shot learning requires handling a varying number of classes, an important architectural choice is the final feature to class mapping. Previous methods have used the Euclidean distance (Ren et al., 2018) , the absolute difference (Koch et al., 2015) , cosine similarity (Vinyals et al., 2016 ), linear classification (Finn et al., 2017; Requeima et al., 2019) or additional neural network layers (Kim et al., 2019; Sung et al., 2018) . Bateni et al. (2020) improved these results by using a class-adaptive Mahalanobis metric. Their method, Simple CNAPS, uses a conditional neural-adaptive feature extractor, along with a regularized Mahalanobis-distance-based classifier. This modification to CNAPS (Requeima et al., 2019) achieves improved performance on the Meta-Dataset benchmark (Triantafillou et al., 2019) , only recently surpassed by SUR (Dvornik et al., 2020) and URT (Liu et al., 2020) . However, performance suffers in the regime where there are five or fewer support examples available per class.

Soft-K Assignment Initialization

Cluster Updates et al., 2017; Nichol et al., 2018; Ravi & Larochelle, 2017 ) learn meta-parameters that allow fast task-adaptation with only a few gradient updates. Work has also been done on partial adaptation of feature extractors using conditional neural adaptive processes (Oreshkin et al., 2018; Garnelo et al., 2018; Requeima et al., 2019; Bateni et al., 2020) . These methods rely on channel-wise adaptation of pretrained convolutional layers by adjusting parameters of FiLM layers (Perez et al., 2018) inserted throughout the network. Our work builds on the most recent of these neural adaptive approaches, specifically Simple CNAPS (Bateni et al., 2020) . SUR (Dvornik et al., 2020) and URT (Liu et al., 2020) are two very recent methods that employ universal representations stemming from multiple domain-specific feature extraction heads. URT (Liu et al., 2020) , which was developed and released publicly in parallel to this work, achieves state of the art performance by using a universal transformation layer. µ 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " Z w w / K C X B T N C L 7 j 3 2 B T g b O 9 P E g U o = " > A A A B 9 X i c d V D L S g M x F M 3 4 r P V V d e k m W A R X Q 6 b t 0 C 6 L b l x W s A / o T E s m z b S h m c y Q Z J Q y 9 D / c u F D E r f / i z r 8 x 0 1 Z Q 0 Q O B w z n 3 c k 9 O k H C m N E I f 1 t r 6 x u b W d m G n u L u 3 f 3 B Y O j r u q D i V h L Z J z G P Z C 7 C i n A n a 1 k x z 2 k s k x V H A a T e Y X u V + 9 4 5 K x W J x q 2 c J 9 S M 8 F i x k B G s j D b w I 6 0 k Q Z l 6 U z g d o W C o j G 9 U q q F q H y K 4 6 r t v I C U I u q j r Q M S R H G a z Q G p b e v V F M 0 o g K T T h W q u + g R P s Z l p o R T u d F L 1 U 0 w W S K x 7 R v q M A R V X 6 2 S D 2 H 5 0 Y Z w T C W 5 g k N F + r 3 j Q x H S s 2 i w E z m K d V v L x f / 8 v q p D h t + x k S S a i r I 8 l C Y c q h j m F c A R 0 x S o v n M E E w k M 1 k h m W C J i T Z F F U 0 J X z + F / 5 N O x X a q t n t T K z c v V 3 U U w C k 4 A x f A A X X Q B N e g B d q A A A k e w B N 4 t u 6 t R + v F e l 2 O r l m r n R P w A 9 b b J / f / k t U = < / l a t 2.2 FEW-SHOT LEARNING USING UNLABELLED DATA Several approaches (Kim et al., 2019; Liu et al., 2018; Ren et al., 2018) have also explored the use of unlabelled instances for few-shot visual classification. EGNN (Kim et al., 2019) employs a



e x i t > Figure 1: Soft k-means Mahalanobis-distance based clustering method used in Transductive CNAPS. First, cluster parameters are initialized using the support examples. Then, during cluster update iterations, query examples are assigned class probabilities as soft labels and subsequently, both softlabelled query examples and labelled support examples are used to estimate new cluster parameters. Motivated by these observations, we explore the use of unlabelled examples through transductive learning within the same framework as Simple CNAPS. Our contributions are as follows. (1) We propose a transductive few-shot learner, namely Transductive CNAPS, that extends Simple CNAPS with a transductive two-step task encoder, as well as an iterative soft k-means procedure for refining class parameter estimates (mean and covariance) using both labelled and unlabelled examples. (2) We demonstrate the efficacy of our approach by achieving new state of the art performance on Meta-Dataset (Triantafillou et al., 2019). (3) When deployed with a feature extractor trained on their respective training sets, Transductive CNAPS achieves state of the art performance on 4 out of 8 settings on mini-ImageNet (Snell et al., 2017) and tiered-Imagenet (Ren et al., 2018), while matching state of the art on another 2. (4) When additional non-overlapping classes from ImageNet(Russakovsky et al., 2015)  are used to train the feature extractor, Transductive CNAPS is able to leverage this example-rich feature extractor to achieve state of the art across the board on mini-ImageNet and tiered-ImageNet.Vinyals et al., 2016)  use cosine similarities over feature vectors produced by independently learned feature extractors. Siamese networks(Koch et al., 2015)  classify query images based on the nearest support example in feature space, under the L 1 metric. Relation networks(Sung et al.,  2018) and variants (Kim et al., 2019; Satorras & Estrach, 2018)  learn their own similarity metric, parameterised through a Multi-Layer Perceptron. More recently, Prototypical Networks(Snell et al.,  2017)  learn a shared feature extractor that is used to produce class means in a feature space where the Euclidean distance is used for classification.Other work has focused on adapting the feature extractor for new tasks. Transfer learning by finetuning pretrained visual classifiers(Yosinski et al., 2014)  was an early approach that proved limited in success due to issues arising from over-fitting. MAML(Finn et al., 2017)  and its variants (Mishra

