IMPROVING FEW-SHOT VISUAL CLASSIFICATION WITH UNLABELLED EXAMPLES

Abstract

We propose a transductive meta-learning method that uses unlabelled instances to improve few-shot image classification performance. Our approach combines a regularized Mahalanobis-distance-based soft k-means clustering procedure with a modified state of the art neural adaptive feature extractor to achieve improved test-time classification accuracy using unlabelled data. We evaluate our method on transductive few-shot learning tasks, in which the goal is to jointly predict labels for query (test) examples given a set of support (training) examples. We achieve new state of the art performance on the Meta-Dataset and the mini-ImageNet and tiered-ImageNet benchmarks.

1. INTRODUCTION

Deep learning has revolutionized visual classification, enabled in part by the development of large and diverse sets of curated training data (Szegedy et al., 2014; He et al., 2015; Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; Sornam et al., 2017) . However, in many image classification settings, millions of labelled examples are not available; therefore, techniques that can achieve sufficient classification performance with few labels are required. This has motivated research on few-shot learning (Feyjie et al., 2020; Wang & Yao, 2019; Wang et al., 2019; Bellet et al., 2013) , which seeks to develop methods for developing classifiers with much smaller datasets. Given a few labelled "support" images per class, a few-shot image classifier is expected to produce labels for a given set of unlabelled "query" images. Typical approaches to few-shot learning adapt a base classifier network to a new support set through various means, such as learning new class embeddings (Snell et al., 2017; Vinyals et al., 2016; Sung et al., 2018) , amortized (Requeima et al., 2019; Oreshkin et al., 2018) or iterative (Yosinski et al., 2014) partial adaptation of the feature extractor, and complete fine-tuning of the entire network end-to-end (Ravi & Larochelle, 2017; Finn et al., 2017) . In addition to the standard fully supervised setting, techniques have been developed to exploit additional unlabeled support data (semi-supervision) (Ren et al., 2018) as well as information present in the query set (transduction) (Liu et al., 2018; Kim et al., 2019) . In our work, we focus on the transductive paradigm, where the entire query set is labeled at the same time. This allows us to exploit the additional unlabeled data, with the hopes of improving classification performance. Existing transductive few-shot classifiers rely on label propagation from labelled to unlabelled examples in the feature space through either k-means clustering with Euclidean distance (Ren et al., 2018) or message passing in graph convolutional networks (Liu et al., 2018; Kim et al., 2019) . Since few-shot learning requires handling a varying number of classes, an important architectural choice is the final feature to class mapping. Previous methods have used the Euclidean distance (Ren et al., 2018) , the absolute difference (Koch et al., 2015) , cosine similarity (Vinyals et al., 2016) , linear classification (Finn et al., 2017; Requeima et al., 2019) or additional neural network layers (Kim et al., 2019; Sung et al., 2018) . Bateni et al. (2020) improved these results by using a class-adaptive Mahalanobis metric. Their method, Simple CNAPS, uses a conditional neural-adaptive feature extractor, along with a regularized Mahalanobis-distance-based classifier. This modification to CNAPS (Requeima et al., 2019) achieves improved performance on the Meta-Dataset benchmark (Triantafillou et al., 2019) , only recently surpassed by SUR (Dvornik et al., 2020) and URT (Liu et al., 2020) . However, performance suffers in the regime where there are five or fewer support examples available per class.

Soft-K Assignment Initialization

Cluster Updates We demonstrate the efficacy of our approach by achieving new state of the art performance on Meta-Dataset (Triantafillou et al., 2019) . i V h L Z J z G P Z C 7 C i n A n a 1 k x z 2 k s k x V H A a T e Y X u V + 9 4 5 K x W J x q 2 c J 9 S M 8 F i x k B G s j D b w I 6 0 k Q Z l 6 U z g d o W C o j G 9 U q q F q H y K 4 6 r t v I C U I u q j r Q M S R H G a z Q G p b e v V F M 0 o g K T T h W q u + g R P s Z l p o R T u d F L 1 U 0 w W S K x 7 R v q M A R V X 6 2 S D 2 H 5 0 Y Z w T C W 5 g k N F + r 3 j Q x H S s 2 i w E z m K d V v L x f / 8 v q p D h t + x k S S a i r I 8 l C Y c q h j m F c A R 0 x S o v n M E E w k M 1 k h m W C J i T Z F F U 0 J X z + F / 5 N O x X a q t n t T K z c v V 3 U U w C k 4 A x f A A X X Q B N e g B d q A A A k e w B N 4 t u 6 t R + v F e l 2 O r l m r n R P w A 9 b b J / f / k t U = < / l a t (3) When deployed with a feature extractor trained on their respective training sets, Transductive CNAPS achieves state of the art performance on 4 out of 8 settings on mini-ImageNet (Snell et al., 2017) and tiered-Imagenet (Ren et al., 2018) , while matching state of the art on another 2. (4) When additional non-overlapping classes from ImageNet (Russakovsky et al., 2015) are used to train the feature extractor, Transductive CNAPS is able to leverage this example-rich feature extractor to achieve state of the art across the board on mini-ImageNet and tiered-ImageNet.

2.1. FEW-SHOT LEARNING USING LABELLED DATA

Early work on few-shot visual classification has focused on improving classification accuracy through the use of better classification metrics with a meta-learned non-adaptive feature extractor. Matching networks (Vinyals et al., 2016) use cosine similarities over feature vectors produced by independently learned feature extractors. Siamese networks (Koch et al., 2015) classify query images based on the nearest support example in feature space, under the L 1 metric. Relation networks (Sung et al., 2018) and variants (Kim et al., 2019; Satorras & Estrach, 2018) learn their own similarity metric, parameterised through a Multi-Layer Perceptron. More recently, Prototypical Networks (Snell et al., 2017) learn a shared feature extractor that is used to produce class means in a feature space where the Euclidean distance is used for classification. Other work has focused on adapting the feature extractor for new tasks. Transfer learning by finetuning pretrained visual classifiers (Yosinski et al., 2014) was an early approach that proved limited in success due to issues arising from over-fitting. MAML (Finn et al., 2017) and its variants (Mishra et al., 2017; Nichol et al., 2018; Ravi & Larochelle, 2017) learn meta-parameters that allow fast task-adaptation with only a few gradient updates. Work has also been done on partial adaptation of feature extractors using conditional neural adaptive processes (Oreshkin et al., 2018; Garnelo et al., 2018; Requeima et al., 2019; Bateni et al., 2020) . These methods rely on channel-wise adaptation of pretrained convolutional layers by adjusting parameters of FiLM layers (Perez et al., 2018) inserted throughout the network. Our work builds on the most recent of these neural adaptive approaches, specifically Simple CNAPS (Bateni et al., 2020) . SUR (Dvornik et al., 2020) and URT (Liu et al., 2020) are two very recent methods that employ universal representations stemming from multiple domain-specific feature extraction heads. URT (Liu et al., 2020) , which was developed and released publicly in parallel to this work, achieves state of the art performance by using a universal transformation layer.

2.2. FEW-SHOT LEARNING USING UNLABELLED DATA

Several approaches (Kim et al., 2019; Liu et al., 2018; Ren et al., 2018) have also explored the use of unlabelled instances for few-shot visual classification. EGNN (Kim et al., 2019) employs a x ⇤ i < l a t e x i t s h a 1 _ b a s e 6 4 = " a J m 2 f n H 3 I B b + Q / s 3 S h E A 6 4 K z E W g = " > A A A B 9 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o s g L k r i A 1 0 W 3 b i s Y B / Q p m U y n b R D J 5 M w M 1 F L y H + 4 c a G I W / / F n X / j p M 1 C W w 8 M H M 6 5 l 3 v m e B F n S t v 2 t 1 V Y W l 5 Z X S u u l z Y 2 t 7 Z 3 y r t 7 T R X G k t A G C X k o 2 x 5 W l D N B G 5 p p T t u R p D j w O G 1 5 4 5 v M b z 1 Q q V g o 7 v U k o m 6 A h 4 L 5 j G B t p F 4 3 w H r k + c l T 2 m e 9 k 3 6 5 Y l f t K d A i c X J S g R z 1 f v m r O w h J H F C h C c d K d R w 7 0 m 6 C p W a E 0 7 T U j R W N M B n j I e 0 Y K n B A l Z t M U 6 f o y C g D 5 I f S P K H R V P 2 9 k e B A q U n g m c k s p Z r 3 M v E / r x N r / 8 p N m I h i T Q W Z H f J j j n S I s g r Q g E l K N J 8 Y g o l k J i s i I y w x 0 a a o k i n B m f / y I m m e V p 2 z 6 s X d e a V 2 n d d R h A M 4 h G N w 4 B J q c A t 1 a A A B C c / w C m / W o / V i v V s f s 9 G C l e / s w x 9 Y n z + k E 5 K b < / l a t e x i t > S ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " T 6 2 j f q 8 + v e E G g r n 6 9 X e n w U I j 7 C Y = " > A A A B + 3 i c b V D L S s N A F J 3 U V 6 2 v W J d u g k V w V R I f 6 L L o x m V F + 4 A m h M l 0 0 g 6 d T M L M j V h C f s W N C 0 X c + i P u / B s n b R b a e m D g c M 6 9 3 D M n S D h T Y N v f R m V l d W 1 9 o 7 p Z 2 9 r e 2 d 0 z 9 + t d F a e S 0 A 6 J e S z 7 A V a U M 0 E 7 w I D T f i I p j g J O e 8 H k p v B 7 j 1 Q q F o s H m C b U i / B I s J A R D F r y z b o b Y R g T z L P 7 3 H d h T A H 7 Z s N u 2 j N Y y 8 Q p S Q O V a P v m l z u M S R p R A Y R j p Q a O n Y C X Y Q m M c J r X 3 F T R B J M J H t G B p g J H V H n Z L H t u H W t l a I W x 1 E + A N V N / b 2 Q 4 U m o a B X q y S K o W v U L 8 z x u k E F 5 5 G R N J C l S Q + a E w 5 R b E V l G E N W S S E u B T T T C R T G e 1 y B h L T E D X V d M l O I t f X i b d 0 6 Z z 1 r y 4 O 2 + 0 r s s 6 q u g Q H a E T 5 K B L 1 E K 3 q I 0 6 i K A n 9 I x e 0 Z u R G y / G u / E x H 6 0 Y 5 c 4 B + g P j 8 w d 1 4 Z S 4 < / l a t e x i t > z ⇤ i < l a t e x i t s h a 1 _ b a s e 6 4 = " L r n P S t U + U W r O 4 + m i 0 J t 6 v + 3 3 0 graph convolutional edge-labelling network for iterative propagation of labels from support to query instances. Similarly, TPN (Liu et al., 2018 ) learns a graph construction module for neural propagation of soft labels between elements of the query set. These methods rely on a neural parameterization of distance within the feature space. TEAM (Qiao et al., 2019) uses an episodic-wise transductive adaptable metric for performing inference on query examples using a task-specific metric. Song et al. d 8 = " > A A A B 9 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o s g L k r i A 1 0 W 3 b i s Y B / Q p m U y n b R D J 5 M w M 1 F q y H + 4 c a G I W / / F n X / j p M 1 C W w 8 M H M 6 5 l 3 v m e B F n S t v 2 t 1 V Y W l 5 Z X S u u l z Y 2 t 7 Z 3 y r t 7 T R X G k t A G C X k o 2 x 5 W l D N B G 5 p p T t u R p D j w O G 1 5 4 5 v M b z 1 Q q V g o 7 v U k o m 6 A h 4 L 5 j G B t p F 4 3 w H r k + c l T 2 m e 9 k 3 6 5 Y l f t K d A i c X J S g R z 1 f v m r O w h J H F C h C c d K d R w 7 0 m 6 C p W a E 0 7 T U j R W N M B n j I e 0 Y K n B A l Z t M U 6 f o y C g D 5 I f S P K H R V P 2 9 k e B A q U n g m c k s p Z r 3 M v E / r x N r / 8 p N m I h i T Q W Z H f J j j n S I s g r Q g E l K N J 8 Y g o l k J i s i I y w x 0 a a o k i n B m f / y I m m e V p 2 z 6 s X d e a V 2 n d d R h A M 4 h G N w 4 B J q c A t 1 a A A B C c / w C m / W o / V i v V s f s 9 G C l e / s w x 9 Y n z + n J Z K d < / l a t e x i t > (2020) use a cross attention network combined with a transductive iterative approach for augmenting the support set using the query examples. The closest method to our work is Ren et al. (2018) . Their approach extends prototypical networks by performing a single additional soft-label weighted estimation of class prototypes. Our work, on the other hand, is different in three major ways. First, we produce soft-labelled estimates of both class mean and covariance. Second, we use an iterative algorithm with a data-driven convergence criterion allowing for a dynamic number of soft-label updates, depending on the task at hand. Lastly, we employ a neural adaptive procedure for feature extraction that is conditioned on a two-step learned transductive task representation, as opposed to a fixed feature-extractor. As we discuss in Section 4.2, this novel task-representation encoder is responsible for substantial performance gains on out-of-domain tasks.

3.1. PROBLEM DEFINITION

Following (Snell et al., 2017; Bateni et al., 2020; Requeima et al., 2019; Finn et al., 2017) , we focus on a few-shot classification setting where a distribution D over image classification tasks (S, Q) is provided for training. Each task (S, Q) ⇠ D consists of a support set S = {(x i , y i )} n i=1 of labelled images and a query set Q = {x ⇤ i } m i=1 of unlabelled images; the goal is to predict labels for these query examples, given the (typically small) support set. Each query image x ⇤ i 2 Q has a corresponding ground truth label y ⇤ i available at training time. A model will be trained by minimizing, over some parameters ✓ (which are shared across tasks), the expected query set classification loss over tasks: E (S,Q)⇠D [ P x ⇤ i 2Q log p ✓ (y ⇤ i |x ⇤ i , S, Q)] ; the inclusion of the dependence on all of Q here allows for the model to be transductive. At test time, a separate distribution of tasks generated from previously unseen images and classes is used to evaluate performance. We also define shot as the number of support examples per class, and way as the number of classes within the task.

3.2. SIMPLE CNAPS

Our method extends the Simple CNAPS (Bateni et al., 2020) architecture for few-shot visual classification. Simple CNAPS performs few-shot classification in two steps. First, it computes task-adapted features for every support and query example. This part of the architecture is the same as that in CNAPS (Requeima et al., 2019) , and is based on the FiLM meta-learning framework (Perez et al., 2018) . Second, it uses the support set to estimate a per-class Mahalanobis metric, which is used to assign query examples to classes. The architecture uses a ResNet18 (He et al., 2015) feature extractor. Within each residual block, Feature-wise Linear Modulation (FiLM) layers compute a scale factor and shift for each output channel, using block-specific adaptation networks ✓ that are conditioned on a task encoding. The task encoding g ✓ (S) consists of the mean-pooled feature vectors of support examples produced by d ✓ , a separate but end-to-end learned Convolution Neural Network (CNN). This produces an adapted feature extractor f ✓ (which implicitly depends on the support set S) that maps support/query images onto the corresponding adapted feature space. We will denote by S ✓ , Q ✓ versions of the support/query sets where each image is mapped into its feature representation z = f ✓ (x). Simple CNAPS then computes a Mahalanobis distance relative to each e s < l a t e x i t s h a 1 _ b a s e 6 4 = " p f class k by estimating a mean µ k and regularized covariance Q k in the adapted feature space, using the support instances: E d s + / k i 9 H 0 w V g V w J x t W M s t V P 0 = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q + 0 G X R j c s K 1 h a a U i b T m 3 b o Z B J m J k I J / Q 0 3 L h R x 6 8 + 4 8 2 + c t F l o 6 4 G B w z n 3 c s + c I B F c G 9 f 9 d k o r q 2 v r G + X N y t b 2 z u 5 e d f / g U c e p Y t h i s Y h V J 6 A a B Z f Y M t w I 7 C Q K a R Q I b A f j 2 9 x v P 6 H S P J Y P Z p J g L 6 J D y U P O q L G S 7 0 f U j I I w w 2 l f 9 6 s 1 t + 7 O Q J a J V 5 A a F G j 2 q 1 / + I G Z p h N I w Q b X u e m 5 i e h l V h j O B 0 4 q f a k w o G 9 M h d i 2 V N E L d y 2 a Z p + T E K g M S x s o + a c h M / b 2 R 0 U j r S R T Y y T y j X v R y 8 T + v m 5 r w u p d x m a Q G J Z s f C l N B T E z y A s i A K 2 R G T C y h T H G b l b A R V Z Q Z W 1 P F l u A t f n m Z P J 7 V v f P 6 5 f 1 F r X F T 1 F G G I z i G U / D g C h p w B 0 1 o A Y M E n u E V 3 p z U e X H e n Y / 5 a M k p d g 7 h D 5 z P H 3 O n k f Y = < / l a t e x i t > e q < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 g v O n v U t h l h p c H g a A s + B 7 q + Q w O 4 = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q + 0 G X R j c s K 9 g F N K J P p T T t 0 8 n B m I p T Q 3 3 D j Q h G 3 / o w 7 / 8 Z J m 4 W 2 H h g 4 n H M v 9 8 z x E 8 G V t u 1 v q 7 S y u r a + U d 6 s b G 3 v 7 O 5 V 9 w / a K k 4 l w x a L R S y 7 P l U o e I Q t z b X A b i K R h r 7 A j j + + z f 3 O E 0 r F 4 + h B T x L 0 Q j q M e M A Z 1 U Z y 3 Z D q k R 9 k O O 0 / 9 q s 1 u 2 7 P Q J a J U 5 A a F G j 2 q 1 / u I G Z p i J F m g i r V c + x E e x m V m j O B 0 4 q b K k w o G 9 M h 9 g y N a I j K y 2 a Z p + T E K A M S x N K 8 S J O Z + n s j o 6 F S k 9 A 3 k 3 l G t e j l 4 n 9 e L 9 X B t Z f x K E k 1 R m x + K E g F 0 T H J C y A D L p F p M T G E M s l N V s J G V F K m T U 0 V U 4 K z + O V l 0 j 6 r O + f 1 y / u L W u O m q K M M R 3 A M p + D A F T T g D p r Q A g Y J P M M r v F m p 9 W K 9 W x / z 0 Z J V 7 B z C H 1 i f P 3 C f k f Q = < / l a t e x i t > < l a t e x i t s h a _ b a s e = " g H V L P + a B Y B l K w q Q T Y m a d S M e u Q = " > A A A B H i c b V D L S g N B E O z G e M r t H L Y B A h V f D H o x W M C g H J E m Y n v c m Y d l l Z l Y I S A i w d F v P p J v w b J k e N L G g o a j q p r s r S A T X x n W / n Z X V t f W N z c J W c X t n d + / d H D Y H G q G D Z Y L G L V D q h G w S U D D c C l C G g U C W H o b u q n l B p H s s H M Q j + h A p A z a q x U d u l s l t x Z y D L x M t J G X L U e q W v b j m a Y T S M E G n h u Y v y M K s O Z w E m x m p M K B v R A X Y s l T R C W e z Q y f k C p E s b K l j R k p v e y G i k T g K b G d E z V A v e l P x P + T m v D G z h M U o O S z R e F q S A m J t O v S Z r Z E a M L a F M c X s r Y U O q K D M m I N w V t e Z k z y v e R e W q f l m u u Z x F O A Y T u A M P L i G K t x D D R r A A O E Z X u H N e X R e n H f n Y u Q z R / A µ k = 1 n k X i I[y i = k] z i , Q k = k ⌃ k + (1 k ) ⌃ + I, k = n k n k + 1 . Here I[y i = k] is the indicator function and n k = P i I[y i = k] is the number of examples with class k in the support set S. The ratio k balances a task-conditional sample covariance ⌃ and a class-conditional sample covariance ⌃ k : ⌃ = 1 n X i z i µ z i µ T , ⌃ k = 1 n k X i I[y i = k] z i µ k z i µ k T , where µ = 1 n P i z i is the task-level mean. When few support examples are available for a particular class, k is small, and the estimate is regularized towards the task-level covariance ⌃. As the number of support examples for the class increases, the estimate tends towards the class-conditional covariance ⌃ k . Additionally, a regularizer I (we set = 1 in our experiments) is added to ensure invertibility. Given the class means and covariances, Simple CNAPS computes class probabilities for each query feature vector z ⇤ i through a softmax over the squared Mahalanobis distances with respect to each class: p(y ⇤ = k | z ⇤ ) / exp (z µ k ) T Q 1 k (z µ k ) . (3)

3.3. TRANSDUCTIVE CNAPS

Transductive CNAPS extends Simple CNAPS by taking advantage of the query set, both in the feature adaptation step and the classification step. First, the task encoder g ✓ is extended to incorporate both a support-set embedding e s and a query-set embedding e q such that, e s = 1 K X k 1 n k X i I[y i = k] d ✓ (x i ), e q = 1 n q X i⇤ d ✓ (x ⇤ i ), where d ✓ is a learned CNN. The support embedding e s is formed by an average of (encoded) support examples, with weighting inversely proportional to their class counts to prevent bias from class imbalance. The query embedding e q uses simple mean-pooling; both e s and e q are invariant to permutations of the respective support/query instances. We then process e s and e q through two steps of a Long Short Term Memory (LSTM) network in the same order to generate the final transductive task-embedding g ✓ (S, Q) used for adaptation. This process is visualized in Figure 3-a . Algorithm 1 Iterative Refinement in Transductive-CNAPS 1: procedure COMPUTE_QUERY_LABELS(S ✓ , Q ✓ , N iter ) 2: For j ranging over support and query sets, w jk ⇢ 1 if (z 0 j , y 0 j ) 2 S ✓ and y j = k 0 otherwise 3: for iter = 0 • • • N iter do . The first iteration is equivalent to Simple CNAPS; break if the most probable class for each query example hasn't changed 7: end for 8: return class probabilities w jk for j corresponding to Q ✓ 9: end procedure Second, we can interpret Simple CNAPS as a form of supervised clustering in feature space; each cluster (corresponding to a class k) is parameterized with a centroid µ k and a metric Q 1 k , and we interpret equation 3 as class assignment probabilities based on the distance to each centroid. With this viewpoint in mind, a natural extension to consider is to use the estimates of the class assignment probabilities on unlabelled data to refine the class parameters µ k , Q k in a soft k-means framework based on per-cluster Mahalanobis distances (Melnykov & Melnykov, 2014) . In this framework, as shown in Figure 1 , we alternate between computing updated assignment probabilities using equation 3 on the query set and using those assignment probabilities to compute updated class parameters. We will define R ✓ = S ✓ t Q ✓ as the disjoint union of the support set and the query set. For each element of R ✓ , which we index by j, we define responsibilities w jk in terms of their class predictions when it is part of the query set and in terms of the label when it is part of the support set, w jk = ⇢ p y 0 j = k | z 0 j z 0 j 2 Q ✓ , I[y 0 j = k] ( z 0 j , y 0 j ) 2 S ✓ . Using these responsibilities we can incorporate unlabelled samples from the support set by defining weighted estimates µ 0 k and Q 0 k : µ 0 k = 1 n 0 k X j w jk z 0 j Q 0 k = 0 k ⌃ 0 k + (1 0 k )⌃ 0 + I, where n 0 k = P j w jk defines 0 k = n 0 k /(n 0 k + 1), and the covariance estimates ⌃ 0 and ⌃ 0 k are ⌃ 0 = 1 P k n 0 k X jk w jk z 0 j µ 0 z 0 j µ 0 T , ⌃ 0 k = 1 n 0 k X j w jk z 0 j µ 0 k z 0 j µ 0 k T . ( ) with µ 0 = ( P k n 0 k ) 1 P jk w jk z 0 j being the task-level mean. These update equations are simply weighted versions of the original Simple CNAPS estimators from Section 3.2, and reduce to them exactly in the case of an empty query set. Algorithm 1 summarizes the soft k-means procedure based on these updates. We initialize our weights using only the labelled support set. We use those weights to compute class parameters, then compute updated weights using both the support and query sets. At this point, the weights associated with the query set Q are the same class probabilities as estimated by Simple CNAPS. However, we continue this procedure iteratively until we reach either reach a maximum number of iterations, or until class assignments argmax k w jk stop changing. Unlike the transductive task-encoder, this second extension, namely the soft k-mean iterative estimation of class parameters, is used at test time only. During training, a single estimation is produced for both mean and covariance using only the support examples. This, as we discuss more in Section 4.2, was shown to empirically perform better. See Figure 3 -b for a high-level visual comparison of classification in Simple CNAPS vs. Transductive CNAPS.

4. EXPERIMENTS

4.1 BENCHMARKS Meta-Dataset (Triantafillou et al., 2019) is a few-shot image classification benchmark that consists of 10 widely used datasets: ILSVRC-2012 (ImageNet) (Russakovsky et al., 2015) , Omniglot (Lake et al., 2015) , FGVC-Aircraft (Aircraft) (Maji et al., 2013) , CUB-200-2011 (Birds) (Wah et al., 2011) , Describable Textures (DTD) (Cimpoi et al., 2014) , QuickDraw (Jongejan et al., 2016) , FGVCx Fungi (Fungi) (Schroeder & Cui, 2018) , VGG Flower (Flower) (Nilsback & Zisserman, 2008) , Traffic Signs (Signs) (Houben et al., 2013) and MSCOCO (Lin et al., 2014) . Consistent with past work (Requeima et al., 2019; Bateni et al., 2020) , we train our model on the official training splits of the first 8 datasets and use the test splits to evaluate in-domain performance. We use the remaining two dataset as well as three external benchmarks, namely MNIST (LeCun & Cortes, 2010), CIFAR10 (Krizhevsky, 2009) and CIFAR100 (Krizhevsky, 2009) , for out-of-domain evaluation. Task generation in Meta-Dataset follows a complex procedure where tasks can be of different ways and individual classes can be of varying shots even within the same task. Specifically, for each task, the task way is first sampled uniformly between 5 and 50 and way classes are selected at random from the corresponding class/dataset split. Then, for each class, 10 instances are sampled at random and used as query examples for the class, while of the remaining images for the class, a shot is sampled uniformly from [1, 100] and shot number of images are selected at random as support examples with total support set size of 500. Additional dataset-specific constraints are enforced, as discussed in Section 3.2 of (Triantafillou et al., 2019) , and since some datasets have fewer than 50 classes and fewer than 100 images per class, the overall way and shot distributions resemble Poisson distributions where most tasks have fewer than 10 classes and most classes have fewer than 10 support examples (see Appendix-A.1). Following Bateni et al. (2020) and Requeima et al. (2019) , we first train our ResNet18 feature extractor on the Meta-Dataset defined training split of ImageNet following the procedure in Appendix-A.3. The ResNet18 parameters are then kept fixed while we train the adaptation network for a total of sampled 110K tasks using Episodic Training (Snell et al., 2017; Finn et al., 2017) 

(see Appendix-A.3 for details).

mini/tiered-ImageNet (Vinyals et al., 2016; Ren et al., 2018) are two benchmarks for few-shot learning. Both datasets employ subsets of ImageNet (Russakovsky et al., 2015) with a total of 100 classes and 60K images in mini-ImageNet and 608 classes and 779K images in tiered-ImageNet. Unlike Meta-Dataset, tasks across these datasets have pre-defined shots/ways that are uniform across every task generated in the specified setting. Following (Nichol et al., 2018; Liu et al., 2018; Snell et al., 2017) , we report performance on the 1/5-shot 5/10-way settings across both datasets with 10 query examples per class. We first train the ResNet18 on the training set of the corresponding benchmark at hand following the procedure noted in Appendix-A.4. We also consider a more featurerich ResNet18 trained on the larger ImageNet dataset. However, we exclude classes and examples from test sets of mini/tiered-ImageNet to address potential class/example overlap issues, resulting in 825 classes and 1,055,494 images remaining. Then, with the ResNet18 parameters fixed, we train episodically for 20K tasks (see Appendix-A.2 for details).

4.2. RESULTS

Evaluation on Meta-Dataset: In-domain, out-of-domain and overall rankings on Meta-Dataset are shown in Table 1 . Following Bateni et al. (2020) and Requeima et al. (2019) Table 2 : Few-shot visual classification results on 1/5-shot 5/10-way few-shot on mini/tiered-ImageNet. A i r c r a f t B i r d s D T D Q u i c k D r a w F u n g i F l o w e r S i g n s M S C O C O M N I S T C I F A R 1 0 C I F A R 1 0 0 In Out For CNAP-based model, "FETI" indicates that the feature extractor used has been trained on ImageNet Russakovsky et al. (2015) exluding classes within the test splits of mini/tiered-ImageNet (for more details see Appendix-[TBD]). "BN" indicates implicit transductive conditioning on the query set through the use of batch normalization. Error intervals showcase 95% confidence interval. Transductive CNAPS sets new state of the art accuracy on 2 out of the 8 in-domain datasets, while matching other methods on 2 of the remaining domains. On out-of-domain tasks, it performs better with new state of the art performance on 4 out of the 5 out-of-domain datasets, Overall, it produces an average rank of 1.9 among all datasets, the best among the methods, with an average rank of 2.1 on in-domain tasks, only second to URT which was developed parallel to Transductive CNAPS, and 1.6 on out-of-domain tasks, the best among even the most recent methods. Evaluation on mini/tiered-ImageNet: We consider two feature extractor training settings on these benchmarks. First, we employ the feature extractor trained on the corresponding training split of the mini/tiered-ImageNet. As shown in Table 2 , on tiered-ImageNet, Transductive CNAPS achieves state of art performance on both 10-way settings while matching state of the art accuracy of LEO (Rusu et al., 2018) on the 5-way settings. On mini-ImageNet, Transductive CNAPS out-performs other methods on 10-way settings while coming second to LEO (Rusu et al., 2018) and TADAM (Requeima et al., 2019; Oreshkin et al., 2018) on 5-way. We attribute this difference in performance between mini-ImageNet and tiered-ImageNet to the fact that mini- 3 , with "Transductive+ CNAPS" denoting this variation. Iterative updates during training result in an average accuracy decrease of 2.5%, which we conjecture to be due to training instabilities caused by applying this iterative algorithm early in training on noisy features. Transductive Feature Extraction vs. Classification: Our approach extends Simple CNAPS in two ways: improved adaptation of the feature extractor using a transductive task-encoding, and the soft k-means iterative estimation of class means and covariances. We perform two ablations, "Feature Extraction Only Transductive" (FEOT) and "Classification Only Transductive" (COT), to independently assess the impact of these extensions. The results are presented in Table 3 . As shown, both extensions outperform Simple CNAPS. The transductive task-encoding is especially effective on out-of-domain tasks whereas the soft k-mean learning of class parameters boosts accuracy on in-domain tasks. Transductive CNAPS is able to leverage the best of both worlds, allowing it to achieve statistically significant gains over Simple CNAPS overall. Comparison to Gaussian Mixture Models: The Mahalanobis-distance based class probabilities produced by Equation 3 closely resembles the cluster posterior probabilities (responsibilities) inferred by a Gaussian Mixture Model (GMM). The only changes required to make this correspondence exact is to introduce a class prior distribution ⇡, and to change the class probability model equation 3 to the Gaussian likelihood: p(y ⇤ = k | z ⇤ ) / ⇡(y ⇤ = k) exp ✓ 1 2 (z µ k ) T Q 1 k (z µ k ) 1 2 log |Q k | ◆ (8) With these modifications, Transductive CNAPS would exactly correspond to inference in a GMM, with cluster parameters learned through semi-supervised expectation maximization (EM). Given this observation, we consider five GMM-based ablations of our method where the log-determinant is introduced (a uniform class prior is used). These ablations are presented in Table 3 and correspond to their soft k-means counterparts in the same order shown. The GMM-based variations of our method and Simple CNAPS result in a notable 4-8% loss in overall accuracy. It's also surprising to observe that the FEOT variation matches the performance of the full GMM-EM model. Performance is, however, degraded with more refinement steps required as minimum as that may lead to over-fitting.

Maximum and Minimum Number of Refinements:

In our experiments, we use a minimum number of 2 refinement steps of class parameters, with the maximum set to 4 on the Meta-Dataset and 10 on the mini/tiered-ImageNet benchmarks. We explore the impact of these hyperparameters on the performance on Transductive CNAPS on the Meta-Dataset in Figure 5 . As shown, requiring the same number of refinement steps for every task results in suboptimal performance. This is demonstrated by the fact that the peak performance for each minimum number of steps is achieved with larger number of maximum steps, showcasing the importance of allowing different numbers of refinement steps depending on the task. In addition, we observe that as the number of minimum refinement steps increases, the performance improves up to two steps while declining after. This suggests that, unlike Ren et al. (2018) where only a single refinement step leads to the best performance, our Mahalanobis-based approach can leverage extra steps to further refining the class parameters. We do see a decline in performance with a higher number of steps; this suggests that while our refinement criteria can be effective at performance different number of steps depending on the task, it can potentially lead to over-fitting, justifying the need for an accurately chosen maximum number of steps.

5. DISCUSSION

In this paper, we have presented a few-shot visual classification method that achieves new state of the art performance via a transductive clustering procedure for refining class parameters derived from a previous neural adaptive Mahalanobis-distance based approach. The resulting architecture, Transductive CNAPS, is more effective at producing useful estimates of class mean and covariance especially in low-shot settings, when used at test time. Even though we demonstrate the efficacy of our approach in the transductive domain where query examples themselves are used as unlabelled data, our soft k-means clustering procedure can naturally extend to use other sources of unlabelled examples in a semi-supervised fashion. Transductive CNAPS superficially resembles a transductive GMM model stacked on top of a learned feature representation; however, when we try to make this connection exact (by including the logdeterminant of the class covariances), we suffer substantial performance hits. Explaining why this happens will be the subject of future work.



t e x i t s h a 1 _ b a s e 6 4 = " Z ww / K C X B T N C L 7 j 3 2 B T g b O 9 P E g U o = " > A A A B 9 X i c d V D L S g M x F M 3 4 r P V V d e k m W A R X Q 6 b t 0 C 6 L b l x W s A / o T E s m z b S h m c y Q Z J Q y 9 D / c u F D E r f / i z r 8 x 0 1 Z Q 0 Q O B w z n 3 c k 9 O k H C m N E I f 1 t r 6 x u b W d m G n u L u 3 f 3 B Y O j r u q D

Figure 2: Overview of the neural adaptive feature extraction process used in Transductive/Simple CNAPS. Figure was adapted from Bateni et al. (2020).

Figure 3: a) Overview of the transductive task-encoding procedure, g ✓ (S, Q), used in Transductive CNAPS. b) Transductive CNAPS (right) extends the Mahalanobis-distance based classifier in Simple CNAPS (left) through transductive soft k-means clustering of the visual space.

µ k , Q k according to update equations equation 6

Figure 4: Class recall (otherwise noted as in-class query accuracy) averaged between classes across all tasks and (a: In-Domain, b: Out-of-domain, c: all) Meta-Dataset datasets. Class recalls have been grouped together, averaged and plotted according to the class shot in (a), (b), and (c).

same example-rich feature extractor, it outperforms the baseline with strong margins; this comparison demonstrates the gains we get from leveraging the additional query set information.Performance vs. Class Shot: In Figure4, we examine the relationship between class recall (i.e. accuracy among query examples belonging to the class itself) and the number of support examples in the class (shot). As shown, Transductive CNAPS is very effective when class shot is below 10, showing large average recall improvements, especially at the 1-shot level. However, as the class shot increases beyond 10, performance drops compared to Simple CNAPS. This suggests that soft k-means learning of cluster parameters can be effective when very few support examples are available. Conversely, in high-shot classes, transductive updates can act as distractors. Training with Classification-Time Soft K-means Clustering: In our work, we use soft k-means iterative updates of means and covariance at test-time only. It's natural to consider training the feature adaptation network end-to-end through the soft k-means transduction procedure. We provide this comparison in the bottom-half of Table

Figure5: Evaluating Transductive CNAPS on Meta-Dataset with different minimum and maximum number of steps. As shown, performance is improved with 2 minimum refinement steps required, with the best results observed at a maximum of 4 refinement steps. Performance is, however, degraded with more refinement steps required as minimum as that may lead to over-fitting.

e x i t > Figure 1: Soft k-means Mahalanobis-distance based clustering method used in Transductive CNAPS. First, cluster parameters are initialized using the support examples. Then, during cluster update iterations, query examples are assigned class probabilities as soft labels and subsequently, both softlabelled query examples and labelled support examples are used to estimate new cluster parameters. Motivated by these observations, we explore the use of unlabelled examples through transductive learning within the same framework as Simple CNAPS. Our contributions are as follows. (1) We propose a transductive few-shot learner, namely Transductive CNAPS, that extends Simple CNAPS with a transductive two-step task encoder, as well as an iterative soft k-means procedure for refining class parameter estimates (mean and covariance) using both labelled and unlabelled examples. (2)

, we pretrain our ResNet feature extractor on the training split of the ImageNet subset of Meta-Dataset. As demonstrated,

Few-shot classification on Meta-Dataset, MNIST, and CIFAR10/100. Error intervals showcase 95% confidence interval, and bold values indicate statistically significant state of the art performance. Average rank is obtained by ranking methods on each dataset and averaging the ranks.

ImageNet only provides 38,400 training  examples, compared to 448,695  examples provided by tiered-ImageNet. This results in a lower performing ResNet-18 feature extractor (which is trained in a traditional supervised manner). This hypothesis is further supported by the results provided in our second model (denoted by "FETI", for "Feature Extractor Trained with ImageNet", in Table2). In this model, we train the feature extractor with a much larger subset of ImageNet, which has been carefully selected to prevent any possible overlap (in examples or classes) with the test sets of mini/tiered-ImageNet. In this case, Transductive CNAPS is able to take advantage of the more example-rich feature extractor, establishing state of the art performance across the board. Additionally, even when compared to Simple CNAPS using the Performance of various ablations of Tranductive and Simple CNAPS on Meta-Dataset. Error intervals showcase 95% confidence interval, and bold values indicate statistically significant state of the art performance.

