MELR: META-LEARNING VIA MODELING EPISODE-LEVEL RELATIONSHIPS FOR FEW-SHOT LEARNING

Abstract

Most recent few-shot learning (FSL) approaches are based on episodic training whereby each episode samples few training instances (shots) per class to imitate the test condition. However, this strict adhering to test condition has a negative side effect, that is, the trained model is susceptible to the poor sampling of few shots. In this work, for the first time, this problem is addressed by exploiting interepisode relationships. Specifically, a novel meta-learning via modeling episodelevel relationships (MELR) framework is proposed. By sampling two episodes containing the same set of classes for meta-training, MELR is designed to ensure that the meta-learned model is robust against the presence of poorly-sampled shots in the meta-test stage. This is achieved through two key components: (1) a Cross-Episode Attention Module (CEAM) to improve the ability of alleviating the effects of poorly-sampled shots, and (2) a Cross-Episode Consistency Regularization (CECR) to enforce that the two classifiers learned from the two episodes are consistent even when there are unrepresentative instances. Extensive experiments for non-transductive standard FSL on two benchmarks show that our MELR achieves 1.0%-5.0% improvements over the baseline (i.e., ProtoNet) used for FSL in our model and outperforms the latest competitors under the same settings.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) have achieved tremendous successes in a wide range of computer vision tasks including object recognition (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Russakovsky et al., 2015; He et al., 2016a) , semantic segmentation (Long et al., 2015; Chen et al., 2018) , and object detection (Ren et al., 2015; Redmon et al., 2016) . For most visual recognition tasks, at least hundreds of labeled training images are required from each class for training a CNN model. However, collecting a large number of labeled training samples is costly and may even be impossible in real-life application scenarios (Antonie et al., 2001; Yang et al., 2012) . To reduce the reliance of deep neural networks on large amount of annotated training data, few-shot learning (FSL) has been studied (Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017; Sung et al., 2018) , which aims to recognize a set of novel classes with only a few labeled samples by knowledge transfer from a set of base classes with abundant samples. limited number of gradient update steps. (3) Optimization-based methods (Ravi & Larochelle, 2017; Munkhdalai & Yu, 2017; Li et al., 2017) aim to learn to optimize, that is, to meta-learn optimization algorithms suitable for quick finetuning from base to novel classes. (4) Hallucination-based methods (Hariharan & Girshick, 2017; Wang et al., 2018; Schwartz et al., 2018; Li et al., 2020) learn generators on base classes and then hallucinate new novel class data to augment the few shots. Additionally, there are also other methods that learn to predict network parameters given few novel class samples (Qiao et al., 2018; Gidaris & Komodakis, 2019; Guo & Cheung, 2020) . Although the metric-based ProtoNet (Snell et al., 2017) is used as our baseline in this paper, our proposed MELR framework can be easily integrated with other episodic-training based methods. Modeling Episode-Level Relationships. In the FSL area, relatively less effort has been made to explicitly model the relationships across episodes. For modeling such episode-level relationships, there are two recent examples: (1) LGM-Net (Li et al., 2019b) proposes an inter-task normalization strategy, which applies batch normalization to all support samples across a batch of episodes in each training iteration. (2) Among a batch of episodes, Meta-Transfer Learning (Sun et al., 2019) records the class with the lowest accuracy in each episode and then re-samples 'hard' meta-tasks from the set of recorded classes. In this work, instead of utilizing the relationships implicitly, we propose to model episode-level relationships (MELR) explicitly by focusing on episodes with the same set of classes. Furthermore, our MELR is specifically designed to cope with the poor sampling of the few shots -an objective very different from those in (Li et al., 2019b; Sun et al., 2019) . Attention Mechanism. Attention mechanism was first proposed by (Bahdanau et al., 2015) for machine translation and has now achieved great success in natural language processing (Vaswani et al., 2017) and computer vision (Xu et al., 2015) . An attention module typically takes a triplet (queries, keys, values) as input and learns interactions between queries and key-value pairs according to certain task objectives. It is referred to as self-attention or cross-attention depending on whether keys and queries are the same. Several recent works (Hou et al., 2019; Guo & Cheung, 2020; Ye et al., 2020) have utilized attention mechanism for meta-learning based FSL. CAN (Hou et al., 2019) employs cross-attention between support and query samples to learn better feature representations. AWGIM (Guo & Cheung, 2020) adopts both self-and cross-attention for generating classification weights. FEAT (Ye et al., 2020) only uses self-attention on the class prototypes of the support set. The biggest difference between these methods and our MELR lies in whether attention is modeled within each episode or across episodes. Only MELR allows modeling cross-episode instance attention explicitly so that the meta-learned model can be insensitive to badly-sampled support set instances. In addition, in our MELR, query set instances are also updated using cross-attention whilst existing models such as FEAT only apply attention to prototypes obtained using support set instances. They thus cannot directly handle instance-level anomalies. 

3. METHODOLOGY

3.1 PROBLEM DEFINITION Let D b = {(x i , y i )|y i ∈ C b , i = 1, 2, • • • , N b } denote an abundant meta-training set from base classes C b , where x i is the i-th image, y i denotes the class label of x i , and N b is the number of images in D b . Similarly, let D n = {(x i , y i )|y i ∈ C n , i = 1, 2, • • • , N n }

3.2. META-LEARNING BASED FSL

Most FSL methods are based on meta-learning (Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017; Sung et al., 2018; Lee et al., 2019; Ye et al., 2020)  𝐒 ($) 𝐐 (") 𝐐 ($) queries queries keys, values 𝑓(#; % 𝐒 ($) ) 𝐿 %&%' 𝐿 ()% min 𝐿 ()% min min shared weights $ 𝐒 (") $ 𝐒 ($) % 𝐐 (") % 𝐐 ($) ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Figure 1: The schematic illustration of the proposed MELR model. It consists of two main components for modeling episode-level relationships: Cross-Episode Attention Module (CEAM) and Cross-Episode Consistency Regularization (CECR). For clarity, only the 5-way 1-shot setting is presented here. Each red/blue cuboid denotes a single instance. C e , i = 1, 2, • • • , N × K} and the query set Q e = {(x i , y i )|y i ∈ C e , i = 1, 2, • • • , N × Q} (S e ∩ Q e = ∅), respectively. A meta-learning based FSL approach typically designs a few-shot classification loss over the query set Q e for each meta-training episode e: L f sc (e) = E (xi,yi)∈Qe L(y i , f (ψ(x i ); S e )), where ψ denotes a feature extractor with an output dimension d, f (• ; S e ) : R d → R N can be any scoring function constructed from the support set S e , and L(•, •) is the classification loss (e.g., widely used cross-entropy). By minimizing the above loss function via back propagation to update the part of the model to be meta-learned (e.g., ψ in ProtoNet), the model is trained over many meta-training episodes and then evaluated on the meta-test episodes.

3.3. MODELING EPISODE-LEVEL RELATIONSHIPS (MELR)

In the FSL area, relatively less effort has been made to explicitly model the relationships across episodes, and many FSL methods define their loss functions within each episode independently. In contrast, with our Modeling Episode-Level Relationships (MELR), two N -way episodes are sampled in each training iteration from exactly the same set of N base classes. Cross-Episode Attention Module (CEAM) and Cross-Episode Consistency Regularization (CECR) are then devised to exploit this type of episode-level relationship explicitly (see Figure 1 ). Cross-Episode Attention Module (CEAM). We are given two N -way K-shot Q-query episodes e (1) = (S (1) e , Q e ) and e (2) = (S (2) e , Q e ) sampled from the same subset C e of C b , where C e is re-indexed as C e = {1, 2, • • • , N }, and e (1) ∩ e (2) = ∅. For both episodes, to minimize the negative impact of badly-sampled few shots for a given query instance, we propose CEAM for cross-episode attention modeling, which is detailed below.

Concretely, let S

(1) = [ψ(x i ) T ; x i ∈ S (1) e ] ∈ R N K×d (or S (2) ) and Q (1) = [ψ(x i ) T ; x i ∈ Q (1) e ] ∈ R N Q×d (or Q (2) ) denote the feature matrices of support and query samples in e (1) (or e (2) ), respectively, and let F (1) = [S (1) ; Q (1) ] ∈ R N (K+Q)×d (or F (2) = [S (2) ; Q (2) ]) be the feature matrix of all samples in e (1) (or e (2) ). For episode e (1) , CEAM takes the triplet (F (1) , S (2) , S (2) ) as input, which corresponds to the input (queries, keys, values) in a typical attention module: F(1) = CEAM(F (1) , S (2) , S (2) ) = F (1) + softmax( F (1) Q S (2)T K √ d )S (2) V , where the inputs are first linearly mapped into a latent space with the same dimension of the feature space (using projection matrices W Q , W K , W V ∈ R d×d ): F (1) Q = F (1) W Q ∈ R N (K+Q)×d ,

S

(2) K = S (2) W K ∈ R N K×d , (2) V = S (2) W V ∈ R N K×d . Similarly, for episode e (2) , we have (analogous to Eq. ( 2)): F(2) = CEAM(F (2) , S (1) , S (1) ) = F (2) + softmax( F (2) Q S (1)T K √ d )S (1) V , where the learnable parameters of fully connected layers (i.e., W Q , W K and W V ) are shared across Eq. ( 2) and Eq. ( 6). We can then obtain the transformed support and query embedding matrices in e (1) (or e (2) ) from F(1) = [ Ŝ(1) ; Q(1) ] (or F(2) = [ Ŝ(2) ; Q(2) ]). Cross-Episode Consistency Regularization (CECR). In our MELR model, CEAM utilizes instance-level attention to alleviate the negative effects of the poor support set instance sampling so that each query set instance can be assigned to the right class with minimal loss. Our CECR is designed to further reduce the model sensitivity to badly-sampled shots in different episodes by forcing the two classifiers learned over the two episodes to produce consistent predictions. There are various options on how to enforce such consistency. CECR adopts a knowledge distillation based strategy as empirically it is the most effective one (see Section 4.3). Let f (•; Ŝ(1) ) : R d → R N and f (•; Ŝ(2) ) : R d → R N be the scoring functions of the two classifiers constructed from Ŝ(1) and Ŝ(2) , respectively. To determine which classifier/scoring function is stronger, we compute the few-shot classification accuracies of the two classifiers on the merged query samples from both episodes. Concretely, let Q(1) e = {(q (1) i , y i )|q (1) i ∈ R d , i = 1, 2, • • • , N Q} (or Q(2) e ) denote the set of transformed embedding vectors of query samples in e (1) (or e (2) ), where q(1) i (or q(2) i ) denotes the i-th row of the transformed embedding matrix Q(1) (or Q(2) ). With Q(1,2) e = Q(1) e ∪ Q(2) e = {(q (1,2) i , y (1,2) i ), i = 1, 2, • • • , 2N Q}) , we are able to compute the few-shot classification accuracies of the two classifiers w.r.t. the ground-truth labels and the corresponding predicted ones (i.e., arg max j σ j (f (q (1,2) i ; Ŝ(1) )) and arg max j σ j (f (q (1,2) i ; Ŝ(2) )) (j = 1, 2, • • • , N ) for (q (1,2) i , y (1,2) i ) ∈ Q(1,2) e , where σ j (v) exp(vj ) N j =1 exp(v j ) (v ∈ R N ) is the softmax function). The classifier with higher accuracy is thus considered to be the stronger one and subsequently used as the teacher classifier for the student to behave consistently with it. Without loss of generality, we assume that f (•; Ŝ(1) ) is stronger than f (•; Ŝ(2) ). We choose the knowledge distillation loss (Hinton et al., 2015) for CECR, which is stated as: L cecr (e (1) , e (2) ; T ) = E (q (1,2) i ,y (1,2) i )∈ Q(1,2) e L (f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) ); T ), ( ) where T is the temperature parameter as used in (Hinton et al., 2015) . More specifically, when the softmax function σ j (v; T ) 7) with the cross-entropy loss: exp(vj /T ) N j =1 exp(v j /T ) (v ∈ R N , j = 1, 2, • • • , N ) is used, we define L (f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) ); T ) in Eq. ( L (f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) ); T ) = - N j=1 σ j (f (q (1,2) i ; Ŝ(1) ); T ) log σ j (f (q (1,2) i ; Ŝ(2) ); T ) . Note that we cut off the gradients over f (•; Ŝ(1) ) when back-propagating since the output of the teacher scoring function is treated as the soft target for the student. Randomly sample e (1) and e (2) from D b , satisfying that C (1) e = C e , e (1) ∩ e (2) = ∅; 3: Compute F(1) for e (1) using CEAM with Eq. ( 2), and obtain F(2) with Eq. ( 6) similarly.

4:

Compute L f sc (e (1) ) and L f sc (e (2) ) with Eq. ( 9), respectively; 5: Construct Q(1,2) e = Q(1) e ∪ Q(2) e based on the two episodes; 6: Determine the teacher episode e (t) and the student e (s) by computing the few-shot classification accuracies of the two classifiers within e (1) and e (2) , respectively; 7: Compute the CECR loss L cecr (e (t) , e (s) ; T ) with Eq. ( 7); 8: Compute the total loss L total with Eq. ( 10); 9: Compute the gradients ∇ Θ L total ; 10: Update Θ using stochastic gradient descent; 11: end for 12: return The learned model.

3.4. MELR-BASED FSL ALGORITHM

As we have mentioned above, in each training iteration, we randomly sample two N -way K-shot Qquery episodes e (1) = (S (1) e , Q (1) e ) and e (2) = (S (2) e , Q e ), which must have exactly the same set of classes but with different instances. We first transform the feature embeddings with Cross-Episode Attention Module (CEAM) and then compute the few-shot classification loss adopting ProtoNet (Snell et al., 2017) for both episodes (e ∈ {e (1) , e (2) }): L f sc (e) = E (qi,yi)∈ Qe L(y i , f ProtoNet (q i ; Ŝ)) = E (qi,yi)∈ Qe -log σ yi (f ProtoNet (q i ; Ŝ)). ( ) Next we determine the stronger/teacher episode to compute the Cross-Episode Consistency Regularization (CECR) loss between the two episodes. The total loss for MELR is finally given by: L total = 1 2 (L f sc (e (1) ) + L f sc (e (2) )) + λL cecr (e (t) , e (s) ; T ), where e (t) ∈ {e (1) , e (2) } denotes the teacher episode, and e (s) ∈ {e (1) , e (2) } is the student. ) has 5 classes randomly sampled from the test split, with 5 or 1 shots and 15 queries per class. We thus have N = 5, K = 5 or 1, Q = 15 as in previous works. Although we meta-train Table 1 : Comparative results of standard FSL on two benchmark datasets. The average 5-way fewshot classification accuracies (%, top-1) along with the 95% confidence intervals are reported. our MELR with two episodes in each training iteration, we still evaluate it over test episodes one by one, strictly following the standard setting. Moreover, since no cross-episode relationships can be used when testing, we take (F (test) , S (test) , S (test) ) as the input of CEAM. Note that the meta-test process is non-transductive since the embedding of each query sample in e (test) is independently updated using the keys and values coming from the support set. We report average 5-way classification accuracy (%, top-1) over 2,000 test episodes as well as the 95% confidence interval. Implementation Details. Our MELR algorithm adopts Conv4-64 (Vinyals et al., 2016) , Conv4-512 and ResNet-12 (He et al., 2016b) as the feature extractors ψ for fair comparison with published results. The output feature dimensions of Conv4-64, Conv4-512 and ResNet-12 are 64, 512, and 640, respectively. To accelerate the entire training process, we pre-train all three backbones on the training split of each dataset as in many previous works (Zhang et al., 2020; Ye et al., 2020; Simon et al., 2020) . We use data augmentation during pre-training (as well as meta-training with ResNet-12 on miniImageNet). For ResNet-12, the stochastic gradient descent (SGD) optimizer is employed with the initial learning rate of 1e-4, the weight decay of 5e-4, and the Nesterov momentum of 0.9. For Conv4-64 and Conv4-512, the Adam optimizer (Kingma & Ba, 2015) is adopted with the initial learning rate of 1e-4. The hyper-parameters λ and T are respectively selected from {0.02, 0.05, 0.1, 0.2} and {16, 32, 64, 128} according to the validation performances of our MELR algorithm (see Appendix A.5 for more details). The code and models will be released soon.

4.2. MAIN RESULTS

We compare our MELR with the representative/state-of-the-art methods for standard FSL on the two benchmark datasets in Table 1 . Note that we re-implement our baseline (i.e., ProtoNet, denoted with † ) by sampling two episodes in each training iteration since it is still considered as a strong FSL approach especially when the backbone is deep. We can observe from Table 1 that: (1) Meth- It can be seen that ProtoNet † +KD performs slightly better than other implementations. We thus choose KD as our CECR loss. Visualizations of Data Distributions and Attention Maps. MELR is designed to alleviate the negative effects of poorly-sampled few shots. To validate this, we further provide some visualization results in Figure 3 . (1) We sample one episode in the test split of miniImageNet under the 5-way 5-shot setting and obtain the embeddings of all images using the trained models of ProtoNet † , ProtoNet † +CEAM, and our MELR (i.e., ProtoNet † +CEAM+CECR), respectively. We then apply t-SNE to project these embeddings into a 2-dimensional space in Figure 3 (2) We also visualize the attention maps over the meta-test episode using our trained MELR. Since we take all samples in the episode as 'queries', and support samples as 'keys' and 'values' for CEAM when meta-testing, each of 100 samples has a 25-dimensional weight vector under the 5-way 5-shot setting. For each weight vector, we average the weights of the same class and obtain a 5-dimensional vector. For 75 query samples, we average the vectors of samples with the same class, resulting in a 5 × 5 instance attention map (see Figure 3(d) ). Similarly, we obtain the attention map for 25 support samples (see Figure 3(e) ). It can be seen that the two attention maps are very much alike, indicating that support and query sample embeddings are transformed by our CEAM in a similar way, which thus brings performance improvements for FSL.

5. CONCLUSION

We have investigated the challenging problem of how to counter the negative effects of badlysampled few shots for FSL. For the first time, we propose to exploit the underlying relationships between training episodes with identical sets of classes explicitly for meta-learning. This is achieved by two key components: CEAM is designed for neutralizing unrepresentative support set instances, and CECR is to enforce the prediction consistency of few-shot classifiers obtained in the two episodes. Extensive experiments for non-transductive standard FSL on two benchmarks show that our MELR achieves 1.0%-5.0% improvements over the baseline (i.e., ProtoNet) used for FSL in our model and outperforms the latest competitors under the same settings.

A APPENDIX

A.1 DETAILS ABOUT CECR ALTERNATIVES For the cross-episode consistency regularization (CECR) loss L cecr in Eq. ( 7), we compare the knowledge distillation (KD) loss to the negative cosine similarity (NegCos), the L2 distance, and the symmetric Kullback-Leibler (KL) divergence (SymKL) in the main paper. Here we give the details about three CECR alternatives. Concretely, let σ (1) (q (1,2) i ) (or σ (2) (q (1,2) i )) denote the normalized vector of f (q (1,2) i ; Ŝ(1) ) (or f (q (1,2) i ; Ŝ(2) )) using softmax, we then have L (N egCos) cecr =E (q (1,2) i ,y (1,2) i )∈ Q(1,2) e NegCos(f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) )) =E (q (1,2) i ,y (1,2) i )∈ Q(1,2) e - < σ (1) (q (1,2) i ), σ (2) (q (1,2) i ) > σ (1) (q (1,2) i ) 2 • σ (2) (q (1,2) i ) 2 , ( ) L (L2) cecr =E (q (1,2) i ,y (1,2) i )∈ Q(1,2) e L2(f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) )) =E (q (1,2) i ,y (1,2) i )∈ Q(1,2) e σ (1) (q (1,2) i ) -σ (2) (q (1,2) i ) 2 , L (symKL) cecr =E (q (1,2) i ,y (1,2) i )∈ Q(1,2) e symKL(f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) ); T ), =KL(f (q (1,2) i ; Ŝ(1) ), f (q (1,2) i ; Ŝ(2) )/T ) + KL(f (q (1,2) i ; Ŝ(2) ), f (q (1,2) i ; Ŝ(1) )/T ), where < •, • > is the inner product of two vectors, T is the temperature parameter, and KL(u, v) = N j=1 σ j (u) log σj (u) σj (v) (u, v ∈ R N are two unnormalized scoring vectors, σ denotes the softmax function, and σ j (u) denotes the j-th element of σ(u)).

A.2 MORE ABLATIVE RESULTS

In Table 2 , we show more ablative results when the attention module is applied within each episode independently (named as Intra-Episode Attention Module (IEAM)). Concretely, IEAM is used for ProtoNet † +IEAM ( † means that ProtoNet is trained with two episodes in each training iteration for fair comparison) as follows (i = 1, 2): F(i) = IEAM(F (i) , S (i) , S (i) ) = F (i) + softmax( F (i) Q S (i)T K √ d )S (i) V . We also add our Cross-Episode Consistency Regularization (CECR) for ProtoNet † +IEAM+CECR to see the performance when IEAM instead of our CEAM is adopted. We can see from Table 2 that adding IEAM to the baseline ProtoNet † also improves its performance, but IEAM is not as beneficial as our CEAM (see ProtoNet † +IEAM vs. ProtoNet † +CEAM). When our CECR is applied on top of ProtoNet † +IEAM (i.e., ProtoNet † +IEAM+CECR), the improvement is rather minor under the 5-way 1-shot setting and the result even gets worse under the 5-shot setting. However, our MELR can still benefit from CECR (see MELR vs. ProtoNet † +CEAM), indicating that CECR is not suitable for IEAM and our CEAM is necessary. The first three subfigures in each row are the visualizations of data distributions obtained by ProtoNet † , ProtoNet † +CEAM, and ProtoNet † +CEAM+CECR (i.e., our MELR) for the same metatest episode, respectively. The last two subfigures in each row are the visualizations of attention maps for query and support sets, respectively. All results are obtained under the 5-way 5-shot setting on miniImageNet with Conv4-64 as the feature extractor.

A.3 MORE VISUALIZATION RESULTS

Similar to Section 4.3, we provide more visualization results in Figure 4 . (1) We sample five episodes (corresponding to five rows in Figure 4 ) in the test split of miniImageNet under the 5-way 5-shot setting and visualize the data distributions in the first three columns using the trained models of ProtoNet † , ProtoNet † +CEAM, and our MELR (i.e., ProtoNet † +CEAM+CECR), respectively. We can observe that adding CEAM makes the distributions of different classes more separable in the first four rows (see the second column vs. the first column), validating the effectiveness of our CEAM. Moreover, the embeddings obtained by our MELR are clearly much more evenly distributed with the prototypes generally right in the center, indicating less poorly-sampled instances (see the third column vs. the second column). Specifically, ProtoNet † +CEAM brings an obvious outlying instance in the last row, but adding CECR stabilizes the training of CEAM. In a word, when CECR is combined with CEAM, the badly-sampled shots can be pulled back to the center of the class distributions. (2) We also visualize the attention maps over each meta-test episode using our trained MELR model. For each of the five episodes, we obtain two 5 × 5 attention maps for query and  (i = 1, • • • , N e ) , the output of CEAM can be defined as: Implement. (1): i) , S (j) , S (j) ); F(i) = 1 N e -1 j=1,••• ,Ne,j =i CEAM(F ( (15) Implement. (2): F(i) = CEAM(F (i) , [S (j) ] Ne j=1,j =i , [S (j) ] Ne j=1,j =i ), where [S (j) ] Ne j=1,j =i ∈ R N K(Ne-1)×d is the concatenation of S (j) ∈ R N K×d (j = 1, • • • , i -1, i + 1, • • • , N e ). As for CECR, we determine the episode with the best accuracy as the teacher and distill knowledge to the rest N e -1 student episodes. The results on miniImageNet using Conv4-64 in Table 4 show that the performance drops slightly as the number of episodes in each training iteration increases for both implementations. One possible explanation is that too much training data make the model fit better on the training set but fail to improve its generalization ability on novel classes.

A.7 COMPARISON REGARDING THE NUMBER OF PARAMETERS

We select several representative/latest FSL models from Table 1 of the main paper and list their numbers of parameters in Table 5 . We can observe that: (1) With an extra CEAM in addition to the backbone (Conv4-64 or ResNet-12), our MELR has about 13% -15% relatively more parameters than the baseline ProtoNet. Note that CECR (in our MELR) leads to no extra parameters. Considering the (statistically) significant improvements achieved by our MELR over ProtoNet, we think that our MELR is cost-effective because it requires not much additional parameters. (2) The number of MELR's parameters is almost the same as that of FEAT's and is much less than that of PARN's, but our MELR achieves better results than FEAT and PARN, indicating that our MELR is the most cost-effective among these three methods. 𝐱 = CEAM 𝐱, 𝐒 𝟐 , 𝐒 𝟐 ≈ 𝐱 + 𝐖𝐒 (𝟐)  ≈ 𝐱 + - 𝐖 . 𝐒 (𝟏) 𝐱 ! 𝐱 𝑆 (%) 𝑆 (&) Figure 6 : Schematic illustration of our CEAM with a pair of episodes as its input. For easy understanding (but without loss of generality), a toy visual example is considered: only one outlying instance x exists in the support set S (1) of the first episode, but the support set S (2) of the second episode is properly sampled. Since S(1) =S (1) \ {x} and S (2) have similar data distributions (from the same classes), the outlying instance x is pulled back to S(1) by attending on it with S (2) (which can not be done by attending on it with S (1) ), i.e., its negative effect is mitigated by our CEAM. To demonstrate how our proposed CEAM can alleviate the negative effect of the poorly sampled shots, we present a schematic illustration of CEAM with a pair of episodes as its input in Figure 6 . For easy understanding (but without loss of generality), a toy visual example is considered: only one outlying instance x exists in the support set S (1) of the first episode, but the support set S (2) of the second episode is properly sampled. Since the two episodes are sampled from the same set of classes, the data distributions of S(1) = S (1) \ {x} and S (2) are similar. On one hand, when S (1) (including the outlier x) is used as keys and values to update S (2) , the distribution of S (2) will not be influenced too much by the outlier x since all the shots in S (2) are far away from x and the weights on x will be very small. That is, our CEAM is insensitive to few outliers in the keys and values. On the other hand, when S (1) is transformed based on S (2) , the distribution of S(1) will be changed little (the data distributions of S(1) and S (2) are similar). Particularly, for the outlier x, its updated embedding x will be pulled back to S(1) (i.e., its negative effect is mitigated) since x = CEAM(x, S (2) , S (2) ) ≈ x + WS (2) ≈ x + WS (1) , where x is the original embedding of x, S (2) and S(1) are respectively the feature matrices of S (2) and S(1) , W and W are two normalized weight matrices. Note that this cannot be done by attending on x with S (1) . Additionally, we provide the results obtained by an extra CEAM alternative in Table 6 : only the embeddings of support samples are transformed (denoted as 'Support → Support') instead of transforming all samples as our choice. We can see from Table 6 and also Figure 2 (b) that our choice (i.e., Support → All) achieves the best results among all CEAM alternatives. One possible explanation for why we resort to updating all samples is that transforming support and query samples into the same embedding space is beneficial to the model learning. Figure 7 : Visualization results of 10 meta-test episodes on miniImageNet under the 5-way 5-shot setting (Conv4-64 is used as the backbone). For each meta-test episode, we visualize three data distributions (from left to right) obtained by ProtoNet † , ProtoNet † +CEAM, and our MELR, respectively. That is, each meta-test episode is denoted by three subfigures and each row has two episodes. In each subfigure, we compute the test accuracy over query samples and present it as the title. A.9 SIGNIFICANCE ANALYSIS OF CECR Since CECR (in our MELR) requires no extra learnable parameters and only computes a loss for consistency constraint, it brings very limited computational cost. Empirically, the meta-training time of MELR is almost the same as that of ProtoNet † +CEAM, indicating that the performance improvement over ProtoNet † +CEAM is obtained by CECR at an extremely low cost. To study when CECR has a significant impact on the final FSL performance, we select 10 meta-test episodes from the 2,000 ones used in the evaluation stage and visualize them in Figure 7 . Concretely, for each of the 10 selected meta-test episodes, we visualize three data distributions (from left to right) obtained by ProtoNet † , ProtoNet † +CEAM, and our MELR, respectively. That is, each meta-test episode is denoted by a group of three subfigures. In each subfigure, we compute the test accuracy over query samples and present it as the title. We can see that: (1) When the few-shot classification task is hard (i.e., ProtoNet † obtains relatively low accuracy), CEAM leads to significant improvements (about 3% -9%). (2) In the same hard situation, CECR further achieves significant improvements (about 5% -8%) on top of CEAM and shows its great effect on the final FSL performance. This indicates that CECR and CEAM are complementary to each other in hard situations, and thus both are crucial for solving the poor sampling problem in meta-learning based FSL.

A.10 RESULTS OF TRANSDUCTIVE FSL

The main difference between standard and transductive FSL is whether query samples are tested one at a time or all simultaneously. As we have mentioned in Section 4.1, we evaluate our MELR model strictly following the non-transductive setting for standard FSL since the embedding of each query sample in the test episode is independently transformed using the keys and values coming from the



denote a few-shot sample set from a set of novel classes C n (e.g., K-shot means that each novel class has K labeled images and N n = K|C n |), where C b ∩ C n = ∅. We are also given a test set T from C n , where D n ∩ T = ∅. By exploiting D b and D n for training, the objective of few-shot learning (FSL) is to predict the class labels of test images in T .

, which adopt episodic training on the base class sample set D b and test their models over few-shot classification tasks sampled from the novel classes C n . Concretely, an N -way K-shot Q-query episode e = (S e , Q e ) is generated as follows: (1) We first randomly sample a subset C e from base classes C b during meta-training (or from novel classes C n during meta-test) and re-index it as C e = {1, 2, • • • , N }. (2) For each class in C e , K support and Q query images are then randomly sampled to form the support set S e = {(x i , y i )|y i ∈

MELR-based FSL Input: Our MELR model with the set of all parameters Θ The base class sample set D b Hyper-parameters λ, T Output: The learned model 1: for all iteration = 1, 2, • • • , MaxIteration do 2:

Figure 3: (a) -(c) Visualizations of data distributions obtained by ProtoNet † , ProtoNet † +CEAM, and ProtoNet † +CEAM+CECR (i.e., our MELR) for the same meta-test episode, respectively. (d) -(e) Visualizations of attention maps for query and support sets, respectively. All results are obtained under the 5-way 5-shot setting on miniImageNet with Conv4-64 as the feature extractor.

(a) -3(c). Across three subfigures, samples with the same color belong to the same class and diamonds are class prototypes/centers. We can observe that adding CEAM makes the distribution of different classes more separable (see Figure3(b) vs. 3(a)), validating the effectiveness of our CEAM. Moreover, the embeddings obtained by our MELR are clearly much more evenly distributed with the prototypes right in the center, indicating less outlying instances (see Figure3(c) vs. 3(b)). This shows that when CECR is combined with CEAM, those badly-sampled shots in Figure3(a) are now pulled back to the center of the class distributions.

.54 ± 0.44 74.37 ± 0.34 60.26 ± 0.51 77.25 ± 0.40 ProtoNet † (Snell et al., 2017) ResNet-12 62.41 ± 0.44 80.49 ± 0.29 69.63 ± 0.53 84.82 ± 0.36 TADAM (Oreshkin et al., 2018) ResNet-12 58.50 ± 0.30 76.70 ± 0.38 --MetaOptNet (Lee et al., 2019) ResNet-12 62.64 ± 0.61 78.63 ± 0.46 65.99 ± 0.72 81.56 ± 0.63 MTL (Sun et al., 2019) ResNet-12 61.20 ± 1.80 75.50 ± 0.80 65.62 ± 1.80 80.61 ± 0.90 AM3 (Xing et al., 2019) ResNet-12 65.21 ± 0.49 75.20 ± 0.36 67.23 ± 0.34 78.95 ± 0.22 Shot-Free (Ravichandran et al., 2019) ResNet-12 59.04 ± 0.43 77.64 ± 0.39 66.87 ± 0.43 82.64 ± 0.43 Neg-Cosine(Liu et al., 2020) ResNet-12 63.85 ± 0.81 81.57 ± 0.56 --Distill(Tian et al., 2020) ResNet-12 64.82 ± 0.60 82.14 ± 0.43 71.52 ± 0.69 86.03 ± 0.49 DSN-MR(Simon et al., 2020) ResNet-12 64.60 ± 0.72 79.51 ± 0.50 67.39 ± 0.82 82.85 ± 0.56

Comparative results when IEAM is used on miniImageNet. The average 5-way few-shot classification accuracies (%, top-1)

along with the 95% confidence intervals are reported.

Results obtained by varying the number of episodes in each training iteration on miniImageNet (with Conv4-64 as the backbone). The average 5-way classification accuracies (%, top-1)

along with the 95% confidence intervals are reported. Comparison regarding the number of parameters ('K' denotes '×10 3 ' and 'M' denotes '×10 6 ') among various FSL methods.

Results by inputting only support samples as 'queries' into CEAM (denote as 'Support → Support') on miniImageNet.

ACKNOWLEDGMENTS

This work was supported in part by National Natural Science Foundation of China (61976220 and 61832017), Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098), Open Project Program Foundation of Key Laboratory of Opto-Electronics Information Processing, Chinese Academy of Sciences (OEIP-O-202006), and Alibaba Innovative Research (AIR) Program.

annex

Published as a conference paper at ICLR 2021 ods trained with ResNet-12 generally perform better than those employing shallower backbones. Also, methods trained with Conv4-512 generally perform better than those employing Conv4-64 even though the two backbones are of the same depth. This is expected because deeper and wider backbones have better representation learning abilities.(2) Our MELR achieves the new state-ofthe-art performance on both benchmarks under all settings. Particularly, the improvements over the baseline (i.e., ProtoNet † ) range from 1.0% to 5.0%, which clearly validates the effectiveness and the strong generalization ability of our MELR.(3) On both benchmarks, the improvements obtained by our MELR over ProtoNet † under the 1-shot setting are significantly larger than those under the 5-shot setting. This demonstrates the superior performance of our MELR for FSL with less shots. Again this is expected: FSL with less shots is more likely to suffer from the poor sampling of the few support instances; such a challenging problem is exactly what our MELR is designed for.

4.3. FURTHER EVALUATION

Ablation Study. To demonstrate the contributions of each cross-episode learning objective in our MELR, we conduct experiments on miniImageNet by adding these learning objectives to the baseline (one at a time) under the 5-way 1-shot and 5-shot settings. Note that ProtoNet without † is trained with one episode in each training iteration while ProtoNet † is trained with two episodes per iteration. The ablation study results in Figure 2 (a) show that: (1) Increasing mini-batch size helps little for ProtoNet, indicating that our MELR benefits from two cross-episode objectives rather than doubling the mini-batch size.(2) CEAM or CECR alone clearly improves the performance of the baseline model and CEAM appears to be more beneficial to FSL than CECR.(3) The combination of the two cross-episode learning objectives in our full model (i.e., MELR) achieves further improvements, suggesting that these two learning objectives are complementary to each other. Moreover, in Appendix A.2, we conduct more ablative experiments when the attention module is applied within each episode, validating the necessity of cross-episode attention.Comparison to CEAM Alternatives. As we have described, our CEAM takes the support samples as 'keys' and 'values', and all samples in one episode as 'queries' for attention module (denoted as Support→All). We can also input prototypes (mean representations of support samples from the same class) as 'keys' and 'values' or input only query samples as 'queries' for attention. This results in other three alternatives of CEAM: Prototype→Query, Support→Query, and Prototype→All. Note that under the 5-way 1-shot setting, Prototype→Query is equal to Support→Query and Prototype→All is the same as Support→All. Additionally, we compare to All→All: inputting all samples from the other episode as 'keys' and 'values' for CEAM when training but still testing as Support→All (not violating the non-transductive setting). From the comparative results of different choices in Figure 2 (b), we can see that Support→All is the best for CEAM. Moreover, All→All works worse than both Prototype→All and Support→All. One possible explanation is that All→All exploits all query set instances during meta-training but only has access to the support set during meta-test to conform to the inductive learning setting. This mis-match reduces its effectiveness.Comparison to CECR Alternatives. Our consistency regularization loss L cecr in Eq. ( 7) is defined with the knowledge distillation (KD) loss, which can be easily replaced by the negative cosine similarity (NegCos), symmetric Kullback-Leibler divergence (symKL), or the L2 distance (see Ap-Published as a conference paper at ICLR 2021 support sets in the last two subfigures of each row, respectively. It can be seen that the two attention maps in each row are very much alike, indicating that support and query sample embeddings are transformed by our CEAM in a similar way, which thus brings performance improvements for FSL.

A.4 RESULTS FOR FINE-GRAINED FSL

To evaluate our MELR under the fine-grained setting where the poorly-sampled shots may have greater negative impact since the classes are much more close, we conduct experiments on CUB-200-2011 Birds (CUB) (Wah et al., 2011) with Conv4-64 as the feature extractor. CUB has 200 finegrained classes of birds and 11,788 images in total. We follow (Ye et al., 2020) and split the dataset into 100, 50, and 50 classes for training, validation, and test, respectively. For direct comparison, we also use the pre-trained backbone model released by (Ye et al., 2020) , which is pre-trained on the training set. The comparative results in Table 3 show that: (1) Our MELR achieves the best results and improves over the second-best FEAT by 1.4% -2.1%, validating the effectiveness of MELR under the fine-grained setting.(2) Our ProtoNet † +CEAM alone outperforms all the competitors, and adding CECR into ProtoNet † +CEAM (i.e., our MELR) further brings noticeable improvements (0.5% -1.3%), indicating that both CEAM and CECR are crucial for fine-grained FSL.

A.5 ANALYSIS OF HYPER-PARAMETER SENSITIVITY

As we have mentioned in Section 4.1, the hyper-parameters λ and T are respectively selected from {0.02, 0.05, 0.1, 0.2} and {16, 32, 64, 128} according to the validation performance of our MELR algorithm. Concretely, on miniImageNet (with Conv4-64 as the feature extractor), we choose λ = 0.1 and T = 128 under the 5-way 1-shot setting, and choose λ = 0.05 and T = 64 under the 5-way 5-shot setting. In Figure 5 , we further present our hyper-parameter analysis on miniImageNet. The results show that our algorithm is quite insensitive to these parameters.Published as a conference paper at ICLR 2021 Table 7 : Comparative results of transductive FSL on miniImageNet. The average 5-way few-shot classification accuracies (%, top-1) along with the 95% confidence intervals are reported. We cite the results of the competitors from (Ye et al., 2020) support set. However, in this section, we further conduct experiments under the transductive FSL setting to study how well our MELR can make use of the unlabeled query samples.Concretely, for each meta-test episode e (test) , we input all samples (both support and query ones) as keys and values into the trained CEAM:such that the relationships of all unlabeled query samples can be taken into consideration. With the transformed embeddings, we make predictions for all query samples based on Semi-ProtoNet (Ren et al., 2018) , which utilizes the unlabeled query samples to help construct better class prototypes and then makes predictions similar to ProtoNet. To match the meta-test process, we also make changes to meta-training accordingly. Specifically, for one episode (out of the two) in each training iteration, we use all samples from the other episode as keys and values for CEAM to update all of its own embeddings. This is followed by obtaining the prototypes based on Semi-ProtoNet as well as computing the FSL loss and CECR loss.The results of transductive FSL on miniImageNet are shown in Table 7 . It can be seen that: (1) By utilizing the unlabeled query samples under transductive FSL, our MELR achieves further improvements, as compared to MELR under standard FSL. Particularly, the performance improvement under 1-shot (i.e., 6.3%) is more significant than that under 5-shot (i.e., 2.6%), indicating that exploiting unlabeled query samples brings more benefits to FSL with less labeled support samples.(2) Our MELR achieves the best results among all the transductive FSL methods. Specifically, MELR outperforms FEAT by a large margin (2.0% -4.6%). Since FEAT also makes predictions based on Semi-ProtoNet, this clearly validates the effectiveness of our MELR under the transductive setting.(3) Our ProtoNet † +CEAM alone outperforms all the competitors, and adding CECR into ProtoNet † +CEAM (i.e., our MELR is obtained) further brings noticeable improvements (0.6% -1.4%), indicating that both CEAM and CECR play important roles under transductive FSL.

A.11 VISUALIZATIONS OF THE GENERALIZATION ABILITY OF MELR

We further provide the visualization of the generalization ability of our MELR during meta-test in Figure 8 . Concretely, we randomly sample 1,000 episode pairs from the test split of miniImageNet under the 5-way 1-shot setting, where the two episodes in each pair have identical sets of classes.We then compute the average 5-way classification accuracy over all 2,000 episodes (from the 1,000 episode pairs) and the average L cecr in Eq. ( 7) over all 1,000 episode pairs at each training epoch. We present the visualization results w.r.t. accuracy and CECR loss in Figure 8 . As expected, the accuracy of our MELR is consistently higher than that of our baseline ProtoNet † . Moreover, as compared with ProtoNet † , the CECR loss of our MELR is also lower across the whole training process, indicating that MELR has better performance consistency between two episodes. This provides direct evidence that our CEAM and CECR can boost the generalization ability of the learned model on novel classes.

