IEPT: INSTANCE-LEVEL AND EPISODE-LEVEL PRE-TEXT TASKS FOR FEW-SHOT LEARNING

Abstract

The need of collecting large quantities of labeled training data for each new task has limited the usefulness of deep neural networks. Given data from a set of source tasks, this limitation can be overcome using two transfer learning approaches: few-shot learning (FSL) and self-supervised learning (SSL). The former aims to learn 'how to learn' by designing learning episodes using source tasks to simulate the challenge of solving the target new task with few labeled samples. In contrast, the latter exploits an annotation-free pretext task across all source tasks in order to learn generalizable feature representations. In this work, we propose a novel Instance-level and Episode-level Pretext Task (IEPT) framework that seamlessly integrates SSL into FSL. Specifically, given an FSL episode, we first apply geometric transformations to each instance to generate extended episodes. At the instancelevel, transformation recognition is performed as per standard SSL. Importantly, at the episode-level, two SSL-FSL hybrid learning objectives are devised: (1) The consistency across the predictions of an FSL classifier from different extended episodes is maximized as an episode-level pretext task. (2) The features extracted from each instance across different episodes are integrated to construct a single FSL classifier for meta-learning. Extensive experiments show that our proposed model (i.e., FSL with IEPT) achieves the new state-of-the-art.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016b; Huang et al., 2017) have seen tremendous successes in a wide range of application fields, especially in visual recognition. However, the powerful learning ability of CNNs depends on a large amount of manually labeled training data. In practice, for many visual recognition tasks, sufficient manual annotation is either too costly to collect or not feasible (e.g., for rare object classes). This has severely limited the usefulness of CNNs for real-world application scenarios. Attempts have been made recently to mitigate such a limitation from two distinct perspectives, resulting in two popular research lines, both of which aim to transfer knowledge learned from the data of a set of source tasks to a new target one: few-shot learning (FSL) and self-supervised learning (SSL). FSL (Fei-Fei et al., 2006; Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017; Sung et al., 2018) typically takes a 'learning to learn' or meta-learning paradigm. That is, it aims to learn an algorithm for learning from few labeled samples, which generalizes well across any tasks. To that end, it adopts (2) In the middle branch, an FSL classifier is exploited to predict the FSL classification probabilities for each episode. We maximize the classification consistency among the extended episodes by forcing the four probability distributions to be consistent using L epis . The average supervised FSL loss L aux is also computed. (3) In the bottom branch, we utilize an integration transformer module to fuse the features extracted from each instance with different rotation transformations; they are then used to compute an integrated FSL loss L integ . Among the four losses, L inst and L epis are the self-supervised losses, and L aux and L integ are the supervised losses. an episodic training strategy -the source tasks are arranged into learning episodes, each of which contains n classes and k labeled samples per class to simulate the setting for the target task. Part of the CNN model (e.g., feature extraction subnet, classification layers, or parameter initialization) is then meta-learned for rapid adaptation to new tasks. In contrast, SSL (Doersch et al., 2015; Noroozi & Favaro, 2016; Iizuka et al., 2016; Doersch & Zisserman, 2017; Noroozi et al., 2018) does not require the source data to be annotated. Instead, it exploits an annotation-free pretext task on the source task data in the hope that a task-generalizable feature representation can be learned from the source tasks for easy adoption or adaptation in a target task. Such a pretext task gets its self-supervised signal at the per-instance level. Examples include rotation and context prediction (Gidaris et al., 2018; Doersch et al., 2015) , jigsaw solving (Noroozi & Favaro, 2016) , and colorization (Iizuka et al., 2016; Larsson et al., 2016) . Since these pretext tasks are class-agnostic, solving them leads to the learning of transferable knowledge. Since both FSL and SSL aim to reduce the need of collecting a large amount of labeled training data for a target task by transferring knowledge from a set of source tasks, it is natural to consider combining them in a single framework. Indeed, two recent works (Gidaris et al., 2019; Su et al., 2020) proposed to integrate SSL into FSL by adding an auxiliary SSL pretext task in an FSL model. It showed that the SSL learning objective is complementary to that of FSL and combining them leads to improved FSL performance. However, in (Gidaris et al., 2019; Su et al., 2020) , SSL is combined with FSL in a superficial way: it is only taken as a separate auxiliary task for each single training instance and has no effect on the episodic training pipeline of the FSL model. Importantly, by ignoring the class labels of samples, the instance-level SSL learning objective is weak on its own. Since meta-learning across episodes is the essence of most contemporary FSL models, we argue that adding instance-level SSL pretext tasks alone fails to exploit fully the complementarity of the aforementioned FSL and SSL, for which a closer and deeper integration is needed. To that end, in this paper we propose a novel Instance-level and Episode-level Pretext Task (IEPT) framework for few-shot recognition. Apart from adding an instance-level pretext SSL task as in (Gidaris et al., 2019; Su et al., 2020) , we introduce two episode-level SSL-FSL hybrid learning objectives for seamless SSL-FSL integration. Concretely, as illustrated in Figure 1 , our full model has three additional learning objectives (besides the standard FSL one): (1) Different rotation transformations are applied to each original few-shot episode to generate a set of extended episodes, where each image has a rotation label for the instance-level pretext task (i.e., to predict the rotation label). (2) The consistency across the predictions of an FSL classifier from different extended episodes is maximized as an episode-level pretext task. For each training image, the rotation transformation does not change its semantic content and hence its class label; the FSL classifier predictions across different extended episodes thus should be consistent, hence the consistency regularization objective. (3) The correlation of features across instances from these extended episodes is modeled by a transformer-based attention module, optimizing the fusion of the features of each instance/image and its various rotation-transformed versions mainly for task adaptation during meta-testing. Importantly, with these three new learning objectives introduced in IEPT, any meta-learning based FSL model can now benefit more from SSL by fully exploiting their complementarity. Our main contributions are: (1) For the first time, we propose both instance-level and episode-level pretext tasks (IEPT) for integrating SSL into FSL. The episode-level pretext task enables episodic training of SSL and hence closer integration of SSL with FSL. (2) In addition to these pretext tasks, FSL further benefits from SSL by integrating features extracted from various rotation-transformed versions of the original training instances. The optimal way of feature integration is learned by a transformer-based attention module, which is mainly designed for task adaptation during meta-testing. (3) Extensive experiments show that the proposed model achieves the new state-of-the-art.

2. RELATED WORK

Few-Shot Learning. The recent FSL studies are dominated by meta-learning based methods. They can be divided into three groups: (1) Metric-based methods (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Allen et al., 2019; Xing et al., 2019; Li et al., 2019a; b; Wu et al., 2019; Ye et al., 2020; Afrasiyabi et al., 2020; Liu et al., 2020; Zhang et al., 2020) aim to learn the distance metric between feature embeddings. The focus of these methods is often on meta-learning of a feature-extraction CNN, whilst the classifiers used are of simple form such as a nearest-neighbor classifier. (2) Optimization-based methods (Finn et al., 2017; Ravi & Larochelle, 2017; Rusu et al., 2019; Lee et al., 2019) learn to optimize the model rapidly given a few labeled samples per class in the new task. (3) Model-based methods (Santoro et al., 2016; Munkhdalai & Yu, 2017; Mishra et al., 2018) focus on designing either specific model structures or parameters capable of rapid updating. Apart from these three groups of methods, other FSL methods have attempted feature hallucination (Schwartz et al., 2018; Hariharan & Girshick, 2017; Gao et al., 2018; Wang et al., 2018; Zhang et al., 2019; Tsutsui et al., 2019) which generates additional samples from the given few shots for network finetuning, and parameter predicting (Qiao et al., 2018; Qi et al., 2018; Gidaris & Komodakis, 2019; 2018) which learns to predict part of the parameters of a network given few samples of new classes for quick adaptation. In this work, we adopt the metric-based Prototypical Network (ProtoNet) (Snell et al., 2017) as the basic FSL classifier for the main instantiation of our IEPT framework due to its simplicity and popularity. However, we show that any meta-learning based FSL method can be combined with our IEPT (see results in Figure 2 (c)). Self-Supervised Learning. In SSL, it is assumed that the source task data is label-free and a pretext task is designed to provide self-supervision signals at the instance-level. Existing SSL approaches differ mainly in the pretext task design. These include predicting the rotation angle (Gidaris et al., 2018) and the context of image patch (Doersch et al., 2015; Nathan Mundhenk et al., 2018) , jigsaw solving (Noroozi & Favaro, 2016; Noroozi et al., 2018) (i.e. shuffling and then reordering image patch), and performing images reversion (Iizuka et al., 2016; Pathak et al., 2016; Larsson et al., 2016) . SSL has been shown to be beneficial to various down-steam tasks such as semantic object matching (Novotny et al., 2018) , object segmentation (Ji et al., 2019) and object detection (Doersch & Zisserman, 2017 ) by learning transferable feature presentations for these tasks. Integrating Self-Supervised Learning into Few-Shot Learning. To the best of our knowledge, only two recent works (Gidaris et al., 2019; Su et al., 2020) have attempted combining SSL with FSL. However, the integration of SSL into FSL is often shallow: the original FSL training pipeline is intact; in the meantime, an additional loss on each image w.r.t. a self-supervised signal like the rotation angle or relative patch location is introduced. With pretext tasks solely at the instance level, combining the two approaches (i.e., SSL and FSL) can only be superficial without fully exploiting the episodic training pipeline unique to FSL. Different from (Gidaris et al., 2019; Su et al., 2020) , we introduce an episode-level pretext task to integrate SSL into the episodic training in FSL fully. Specifically, the consistency across the predictions of an FSL classifier from different extended episodes is maximized to reflect the fact that various rotation transformations should not alter the class-label prediction. Moreover, features of each instance and its various rotation-transformed versions are now fused for FSL classification, to integrate SSL with FSL for the supervised classification task. Our experimental results show that thanks to the closer integration of SSL and FSL, our IEPT clearly outperforms (Gidaris et al., 2019; Su et al., 2020) (see Table 1 ). 

3. METHODOLOGY

S e Q e = ∅. For simplicity, we denote l k = n × k and l q = n × q. In the meta-training stage, the training process has an inner and an outer loop in each episode: in the inner loop, the model is updated using S e ; its performance is then evaluated on the query set Q e in the outer loop to update the model parameters or algorithm that one wants to meta-learn. Basic FSL Classifier. We employ ProtoNet (Snell et al., 2017) as the basic FSL model. This model has a feature-extraction CNN and a simple non-parametric classifier. The parameter of the feature extractor is to be meta-learned. Concretely, in the inner loop of an episode, ProtoNet fixes the feature extractor and computes the mean feature embedding for each class as follows: h c = 1 k • (xi,yi)∈Se f φ (x i ) • I(y i = c), where class c ∈ C e , f φ is a feature extractor with learnable parameters φ, and I is the indicator function. By computing the distance between the feature embedding of each query sample and that of the corresponding class, the loss function used to meta-learn φ in the outer loop is defined as: L f sl (S e , Q e ) = 1 |Q e | (xi,yi)∈Qe -log exp(-d(f φ (x i ), h yi )) c∈Ce exp(-d(f φ (x i ), h c )) , where d(•, •) denotes a distance function (e.g., the l 2 distance).

3.2. PRETEXT TASKS IN IEPT

The schematic of our IEPT is illustrated in Figure 1 . We first define a set of 2D-rotation operators G = {g r |r = 0, ..., R -1}, where g r means the operator of rotating the image by r*90 degrees and R is the total number of rotations (R = 4 in our implementation). Given an original episode E e = {S e , Q e } as described in Sec. 3.1, we utilize the 2D-rotation operators from G in turn to transform each image in E e . This results in a set of R extended episodes (including the original one) E = {{S r e , Q r e }|r = 0, ..., R -1}, where S r e = {(x i , y i , r)|y i ∈ C e , i = 1, ..., l k } and Q r e = {(x i , y i , r)|y i ∈ C e , i = 1, ..., l q }. Now each episode is denoted as E r e = {(x i , y i , r)|y i ∈ C e , i = 1, ..., l k , l k + 1, ..., l k + l q }, where the first l k samples are from S r e and the rest from Q r e . Note that {S 0 e , Q 0 e } is the original episode {S e , Q e }. With the rotation transformations, each sample (x i , y i , r i ) in E carries a class label y i for supervised learning (from the inherent class) and a label r i from the rotation operator for self-supervised learning. After generating the set of extended episodes E, the feature extractor f φ is applied to each image x i in E. On these episodes, we design two self-supervised pretext tasks, one at the instance-level and the other episode-level. Instance-Level Pretext Task. The instance-level task is to recognize different rotation transformations. The idea is that if the model to be meta-learned here (i.e., f φ ) can be used to distinguish different transformations, it must understand the canonical poses of objects (e.g., animals have legs touching the ground and trees have leaves on top), a vital part of class-agnostic and thus transferable knowledge. With the self-supervised rotation label r i , we consider the mapping: f θrot : x i → r i for each instance (x i , y i , r i ) ∈ E, where f θrot is a rotation classifier with learnable parameters θ rot . Given the input pair (x i , r i ), the total instance-level rotation loss is a cross-entropy loss: L inst = 1 R(l k + l q ) R-1 r=0 (xi,yi,ri)∈E r e -log exp([f θrot (f φ (x i ))] ri ) R-1 r =0 exp([f θrot (f φ (x i ))] r ) , where [f θrot (f φ (x i ))] ∈ R R is the rotation scoring vector and [•] r means taking the r-th element. Episode-Level Pretext Task. We design the episode-level task based on a simple principle: although different extended episodes contain images with different rotation transformations, these transformations do not change their class labels. Consequently, the FSL classifier should produce consistent probability distributions for each instance across different extended episodes. Such consistency can be measured using the Kullback-Leibler (KL) divergence. Formally, for each extended episode {S r e , Q r e } in E, we first define the probability distribution of FSL classification over the query set Q r e as P r e = [p r 1 ; • • • ; p r lq ] ∈ R lq×n , where p r i ∈ R n is the probability distribution for x i in Q r e with its c-th element [p r i ] c (c = 1, ..., n) being: [p r i ] c = exp(-d(f φ (x i ), h r c )) c exp(-d(f φ (x i ), h r c )) . ( ) The above probability is computed as in Sec. 3.1 and the class embedding h r c is obtained from S r e . The mean probability distribution of the R extended episodes is thus given by: pi = 1 R • R-1 r=0 p r i . The total episode-level consistency regularization loss is computed with the KL divergence loss: L epis = 1 Rl q • R-1 r=0 lq i=1 mean(p r i (log p r i -log pi )). where mean(•) is an element-wise averaging function.

3.3. INTEGRATED FSL TASK

The two tasks introduced so far are self-supervised tasks without using the class labels in the query set. Now we describe how in the supervised classification task, the extended episodes can be used. Given the set of extended episodes E, we denote the feature set of E as E emb , where E emb = {f φ (x i )|(x i , y i , r) ∈ E r e , r = 0, • • • , R -1, i = 1, ..., l k + l q }. Note that each extended episode in E corresponds to one specific rotation transformation of the same set of images from the original episode E e . Therefore, in order to capture the correlation among instances with different transformations and learn how best combine them to form the class mean for meta-learning, an instance attention module is deployed w.r.t. each image in E e (i.e., all images are assumed to be independent). Specifically, based on E emb , we construct the feature tensor F ∈ R (l k +lq)×R×d , where d is the feature dimension. We then adopt a transformer to obtain the integrated representation for FSL classification. The transformer architecture is based on a self-attention mechanism, as in (Vaswani et al., 2017) . It receives the triplet input (F, F, F ) as (Q, K, V ) (Query, Key, and Value, respectively). With F (i) being the i-th row of F (w.r.t. the i-th image in E e ), the attentive module is defined as: (F (i) Q , F (i) K , F (i) V ) = (F (i) W Q , F (i) W K , F (i) W V ), F (i) att = F (i) + softmax( F (i) Q (F (i) K ) T √ d K ) F (i) V , where d K = d, and W Q , W K , W V represent the parameters of three fully-connected layers respectively (the parameters of the integration transformer are collected as θ int ). Note that the key and value are computed from each image and its augmented versions, i.e., they are computed independently without using inter-image correlation. With the attentive feature F att ∈ R (l k +lq)×R×d , the integrated representation F integ = [F S ; F Q ] ∈ R (l k +lq )×Rd (F S and F Q are respectively for the support set and query set) is given by: F integ = flatten(F att ), (9) where flatten(•) denotes flattening F att along the last two dimensions, i.e., concatenating the attentive features from different extended episodes for the corresponding images. The integrated representation is then inputted to the FSL classifier to define the FSL classification loss: L integ = 1 l q • lq i=1 -log exp(-d(F Q i , h f yi )) c∈Ce exp(-d(F Q i , h f c )) where the class embedding h f c = 1 k • l k i=1 F S i • I(y i = c ) is computed on the support set. Note that the integrated FSL task actually acts as an alternative to prediction averaging.

3.4. TOTAL LOSS

The total training loss for our full model consists of the self-supervised losses from the pretext tasks and the supervised losses from the FSL tasks. In this work, in addition to L integ in Eq. ( 10), another supervised FSL loss L aux is also used (see Figure 1 ). L aux is the average FSL classification loss over the extended episodes. Formally, it can be written as: L aux = 1 R • R-1 r=0 L f sl (S r e , Q r e ) Therefore, the total loss L total for training our full model is given as follows: L total = instance-level w1 * Linst + episode-level w2 * Lepis self-supervised loss + w3 * Laux + Linteg supervised loss , ( ) where w 1 , w 2 , w 3 are the loss weight hyperparameters.

3.5. INFERENCE

During the test stage, we only exploit the integrated representation F integ for the final FSL prediction. The predicted class label for x i ∈ Q e can be computed with Eq. (10) as: y pred i = argmax y∈Ce exp(-d(F Q i , h f y )) c∈Ce exp(-d(F Q i , h f c )) .

3.6. FULL IEPT ALGORITHM

For easy reproduction, we present the full algorithm for FSL with IEPT in Algorithm 1. Once learned, with the learned ψ, we can perform the inference over the test episodes with Eq. ( 13). Algorithm 1 FSL with IEPT Generate the set of extended episodes E from Ee using G 5: Compute the SSL loss Linst for the instance-level pretext task with Eq. (3) 6: Compute the SSL loss Lepis for the episode-level pretext task with Eq. ( 6) 7: Compute the supervised FSL loss Laux over the extended episodes with Eq. ( 11) 8: Compute the supervised FSL loss Linteg for the integrated episode with Eq. (10) 9: L total = w1 * Linst + w2 * Lepis + w3 * Laux + Linteg 10: Update ψ based on ∇ ψ L total 11: end for 12: return ψ.

4.1. EXPERIMENTAL SETUP

Datasets. Two widely-used FSL datasets are selected: miniImageNet (Vinyals et al., 2016) and tieredImageNet (Ren et al., 2018) . The first dataset consists of a total number of 100 classes (600 images per class) and the train/validation/test split is set to 64/16/20 classes as in (Ravi & Larochelle, 2017) . The second dataset is a larger dataset including 608 classes totally (nearly 1,200 images per class), which is split into 351/97/160 classes for train/validation/test. Both datasets are subsets sampled from ImageNet (Russakovsky et al., 2015) . 1/5-shot FSL are shown in Table 1 . We have the following observations: (1) When compared with the representative/latest FSL methods (w/o SSL), our IEPT achieves the best performance on all datasets and under all settings, validating the effectiveness of SSL with IEPT for FSL. (2) Our IEPT also clearly outperforms the two SSL-based FSL methods (Gidaris et al., 2019; Su et al., 2020) which only use instance-level pretext tasks, demonstrating the importance of closer/episode-level integration of SSL into FSL. (3) The improvements achieved by our IEPT over ProtoNet range from 2% to 5%. Since our IEPT takes ProtoNet as the baseline, the obtained margins provide direct evidence that SSL brings significant benefits to FSL. Note that our IEPT is also shown to be effective under both the fine-grained FSL and cross-domain FSL settings in Sec. 4.3 (see Table 3 ). Ablation Study. Our full IEPT model is trained with four losses (see Eq. ( 12)), including two self-supervised losses and two supervised losses: the episode-level SSL loss L epis , the instance-level SSL loss L inst , the auxiliary FSL loss L aux and the integrated FSL loss L integ . To demonstrate the contribution of each loss, we present the ablation study results for our full IEPT model in Table 2 , where Conv4-64 is used as the backbone. We start with L integ and then add the additional three losses successively. It can be observed that the performance of our model continuously increases when more losses are used, indicating that each loss contributes to the final performance.

4.3. FURTHER EVALUATIONS

Different Combination Methods over Episodes. We have introduced a transformer-based attention module to fuse the features of each instance from all extended episodes (and an integrated episode can be obtained) for the supervised classification task (see Sec. 3.3) . In this experiment, we compare it with two alternative ways of across-episode integration: (1) Averaging extended episodes: the extended episodes are directly fused for FSL classification; (2) Averaging all episodes: the extended episodes as well as the integrated episode are fused for FSL classification. We present the comparative results on miniImageNet in Figure 2 (a). For comprehensive comparison, the results of FSL with each single extended episode are also reported. We can observe that: (1) The performance of 'Episode 0 • ' is the highest among the four baselines (i.e., FSL with single extended episode), perhaps because the feature extractor is pretrained on the original images without rotation transformations. (2) FSL by averaging extended episodes (i.e., 'Averaging extended episodes') indeed improves each of the four It can be seen that the performance of our model consistently grows when R increases from 1 to 4. Additionally, the study on exploiting other pretext tasks for our IEPT is presented in Appendix A.1. Different Basic FSL Classifiers. As mentioned in Sec. 3.1, we adopt ProtoNet as the basic FSL classifier due to its scalability and simplicity. To further show the effectiveness of our IEPT when other basic FSL classifiers are used, we provide the results obtained by our IEPT using ProtoNet, FEAT, and IMP for FSL in Figure 2 (c). It can be clearly observed that our IEPT leads to an improvement of about 1-4% over each basic FSL method (ProtoNet, FEAT, or IMP), indicating that our IEPT can be applied to improve a variety of popular FSL methods. Comparative Results for Fine-Grained FSL and Cross-Domain FSL. To evaluate our IEPT algorithm under the fine-grained FSL and cross-domain FSL settings, we conduct experiments on CUB (Wah et al., 2011) and miniImageNet → CUB, respectively. For fine-grained FSL on CUB, following (Ye et al., 2020) , we randomly split the dataset into 100 training classes, 50 validation classes, and 50 test classes. For cross-domain FSL on miniImageNet → CUB, the 100 training classes are from miniImageNet; the 50 validation classes and 50 test classes (using the aforementioned split for fine-grained FSL) are from CUB. Under both settings, we use Conv4-64 as the feature extractor. The 5-way 1/5-shot FSL results are shown in Table 3 . Our IEPT clearly achieves the best results, yielding 1-3% improvements over the second-best FEAT. This shows the effectiveness of our IEPT under both fine-grained and cross-domain settings.

5. CONCLUSION

We have proposed a novel Instance-level and Episode-level Pretext Task (IEPT) framework for integrating SSL into FSL. For the first time, we have introduced an episode-level pretext task for FSL with self-supervision, in addition to the conventional instance-level pretext task. Moreover, we have also developed an episode extension-integration framework by introducing an integration transformer module to fully exploit the extended episodes for FSL. Extensive experiments on two benchmarks demonstrate that the proposed model (i.e., FSL with IEPT) achieves the new state-of-theart. Our ongoing research directions include: exploring other episode-level pretext tasks for FSL with self-supervision, and applying FSL with self-supervision to other vision problems.



Figure 1: Schematic of our approach to FSL. Given a training episode, we apply 2D rotations by 0, 90, 180, and 270 degrees to each instance to generate four extended episodes. After going through a feature extraction CNN, four losses over three branches are designed: (1) In the top branch, we employ a self-supervised rotation classifier with the instance-level SSL loss L inst .(2) In the middle branch, an FSL classifier is exploited to predict the FSL classification probabilities for each episode. We maximize the classification consistency among the extended episodes by forcing the four probability distributions to be consistent using L epis . The average supervised FSL loss L aux is also computed. (3) In the bottom branch, we utilize an integration transformer module to fuse the features extracted from each instance with different rotation transformations; they are then used to compute an integrated FSL loss L integ . Among the four losses, L inst and L epis are the self-supervised losses, and L aux and L integ are the supervised losses.

The training set Ds, the rotation operator set G The loss weight hyperparameters w1, w2, w3 Output: The learned ψ 1: Randomly initialize all learnable parameters ψ = {φ, θrot, θint} 2: for iteration = 1, ..., MaxIteration do 3: Randomly sample episode Ee from Ds 4:

Figure 2: (a) Comparison among different combination methods over episodes for FSL with selfsupervision. (b) Illustration of the effect of different choices of R on the performance of our model (R denotes the number of extended episodes used for SSL). (c) Comparative results obtained by our IEPT using different basic FSL classifiers (i.e. ProtoNet, FEAT, and IMP). It can be seen clearly that integrated episode-based fusion leads to more separation between classes. All figures present 5-way 1-shot/5-shot results on miniImageNet, using Conv4-64 as the feature extractor.

Given an n-way k-shot FSL task sampled from a test set D t , to imitate the test setting, an FSL model is typically trained in an episodic way. That is, n-way k-shot episodes are randomly sampled from a training set D s , where the class label space of D s has no overlap with that of D t . Each episode E e contains a support set S e and a query set Q e . Concretely, we first randomly sample a set of n classes C e from the training set, and then generate S e and Q e by sampling k support samples and q query samples from each class in C e , respectively. Formally, we have S e = {(x i , y i )|y i ∈ C e , i = 1, ..., n × k} and Q e = {(x i , y i )|y i ∈ C e , i = 1, ..., n × q}, where

Ablation study results for our full IEPT model over miniImageNet and tieredImageNet. Our full model includes two self-supervised losses (i.e. L epis and L inst ) and two supervised losses (i.e. L aux and L integ ). Conv4-64 is used as the feature extractor.

Comparative results for the fine-grained FSL on CUB(Wah et al., 2011) and the cross-domain FSL on miniImageNet → CUB. Integrated episode' with 'Averaging all episodes', the performance of FSL with integrated episode is more stable across different settings, furthering validating the usefulness of our across-episode integration. Overall, the episode-integration module is indeed effective in FSL with self-supervision. This is also supported by the visualization results presented in Appendices A.3 & A.4.Different Number of Extended Episodes. In all the above experiments, the number of the extended episodes R is set to 4 (rotation by 0 • , 90 • , 180 • , 270 • ). Figure2(b)shows the impact of the value of R. Note that when R = 1, our IEPT model is equivalent to ProtoNet which is without self-supervision.

ACKNOWLEDGMENTS

This work was supported in part by National Natural Science Foundation of China (61976220 and 61832017), Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098), Open Project Program Foundation of Key Laboratory of Opto-Electronics Information Processing, Chinese Academy of Sciences (OEIP-O-202006), and Alibaba Innovative Research (AIR) Program.

annex

Table 1 : Comparative results for 5-way 1/5-shot FSL. The mean classification accuracies (top-1, %) with the 95% confidence intervals are reported. † indicates the result is reproduced by ourselves. miniImageNet tieredImageNet Method Backbone 1-shot 5-shot 1-shot 5-shot MatchingNet (Vinyals et al., 2016) Conv4-64 43.56 ± 0.84 55.31 ± 0.73 --ProtoNet † (Snell et al., 2017) Conv4-64 52.61 ± 0.52 71.33 ± 0.41 53.33 ± 0.50 72.10 ± 0.41 MAML (Finn et al., 2017) Conv4-64 48.70 ± 1.84 63.10 ± 0.92 51.67 ± 1.81 70.30 ± 0.08 Relation Net (Sung et al., 2018) Conv4-64 50.40 ± 0.80 65.30 ± 0.70 54.48 ± 0.93 71.32 ± 0.78 IMP † (Allen et al., 2019) Conv4-64 52.91 ± 0.49 71.57 ± 0.42 53.63 ± 0.51 71.89 ± 0.44 DN4 (Li et al., 2019b) Conv4-64 51.24 ± 0.74 71.02 ± 0.64 --DN PARN (Wu et al., 2019) Conv4-64 55.22 ± 0.84 71.55 ± 0.66 --PN+rot (Gidaris et al., 2019) Conv4-64 53.63 ± 0.43 71.70 ± 0.36 --CC+rot (Gidaris et al., 2019) Conv4-64 54.83 ± 0.43 71.86 ± 0.33 --DSN-MR (Simon et al., 2020) Conv4-64 55.88 ± 0.90 70.50 ± 0.68 --Centroid (Afrasiyabi et al., 2020) Conv4-64 53.14 ± 1.06 71.45 ± 0.72 --Neg-Cosine (Liu et al., 2020) Conv4-64 52. Feature Extractors. For fair comparison with published results, our IEPT adopts three widely-used feature extractors: Conv4-64 (Vinyals et al., 2016), Conv4-512, and ResNet-12 (He et al., 2016a) . Particularly, Conv4-512 is almost the same as Conv4-64 except having a different channel size of the last convolution layer. To speed up the training process, as in many previous works (Ye et al., 2020; Zhang et al., 2020; Simon et al., 2020) , we pretrain all the feature extractors on the training split of each dataset for our IEPT. Following (He et al., 2016a) , we use the temperature scaling skill during the training phase. On both datasets, the input image size is 84 × 84. The output feature dimensions of Conv4-64, Conv4-512, and ResNet-12 are 64, 512, and 640, respectively.Evaluation Metrics. We take the 5-way 5-shot (or 1-shot) FSL evaluation setting, as in previous works. We randomly sample 2,000 episodes from the test split and report the mean classification accuracy (top-1, %) as well as the 95% confidence interval. Since the integration transformer copes with each sample independently, we take a strict non-transductive setting during evaluation.Implementation Details. PyTorch is used for our implementation. We utilize the Adam optimizer (Kingma & Ba, 2015) for Conv4-64 & Conv4-512 and the SGD optimizer for ResNet-12 to train our IEPT model. The hyperparameters of our IEPT model are selected according to the performance on the validation split.We will release the code soon.

4.2. MAIN RESULTS

Comparison to State-of-the-Arts. We compare our IEPT with two groups of baselines: (1) Recent SSL-based FSL methods (Gidaris et al., 2019; Su et al., 2020) ; (2) Representative/latest FSL methods (w/o SSL) (Snell et al., 2017; Finn et al., 2017; Lee et al., 2019; Ravichandran et al., 2019; Simon et al., 2020; Zhang et al., 2020; Ye et al., 2020; Liu et al., 2020) . The comparative results for 5-way

A APPENDIX A.1 COMPARISON AMONG DIFFERENT SSL STRATEGIES

To generate the extended episodes in IEPT, we apply four rotation transformations (i.e. rotation by 0 • , 90 • , 180 • , 270 • ) to each image. It makes sense to explore whether other self-supervised strategies are also effective for our IEPT. To this end, we exploit shuffling image patches (see Figure 3 ) for self-supervised learning (SSL). Specifically, we divide each image into 2*2 patches and reorganize the patch orders to obtain a shuffling label. To compare with the rotation strategy fairly, we choose only four shuffling orders: (1, 2, 3, 4), (2, 3, 4, 1), (3, 4, 1, 2) and (4, 1, 2, 3 ). Note that the (1, 2, 3, 4) shuffling order equals to the original image. Similar to the rotation strategy, a fully-connected layer is utilized to recognize the shuffling order. The comparative results are shown in Table 4 . We can see that both IEPT with shuffling and IEPT with rotation achieve better performance than the original ProtoNet. Particularly, IEPT with shuffling yields 1-3% and 3-4% improvements under 5-shot and 1-shot, respectively. This clearly shows the effectiveness of our IEPT for FSL even when different SSL strategies are used to define the pretext tasks. 

A.2 COMPARISON AMONG DIFFERENT INTEGRATION APPROACHES

We employ the integration transformer to find the intrinsic correlation of various rotation-transformed instances. The transformer architecture is based on a self-attention mechanism. Concretely, it receives the feature sets of extended episodes as input Q, K and V . Further, it matches each query in Q with a list of keys in K and returns the weighted sum of corresponding values. To show the importance of the transformer module, we compare it with two other integration approaches (i.e. concatenating and averaging) to integrate the features of extended episodes. The comparative results in Table 5 demonstrate that the integration transformer consistently performs better than the simply concatenating/averaging approaches. This suggests that the attention-based integration transformer is a better choice for designing the integration module. 

A.3 FEATURE VISUALIZATIONS OF TEST EPISODES

We provide the feature visualizations of test episodes in Figure 4 . It can be seen that an integrated episode (the last one in each row) clearly has a better cluster data structure than the corresponding four extended episodes (the first four ones in each row). This indicates that our transformer-based across-episode integration is indeed effective for few-shot classification with self-supervision.

A.4 ATTENTION VISUALIZATION OF TEST EPISODES

We present attention map visualization of two test episodes (left and right) in Figure 5 . Each average attention map is computed by averaging the attention map of all instances of a certain class. We can observe that: (1) The average attention maps from different classes vary significantly, showing that the diverse semantics of different classes can be reflected by our attention-based integration transformer.(2) When the classes of two episodes overlap (e.g., 'trifle' and 'dalmatian'), the average attention maps of an overlapped class from two episodes are similar, illustrating that our attention-based integration transformer can well capture the semantics of classes across episodes. We compare our IEPT with a simple baseline that trains the model with L aux + L inst and then makes inference by just averaging the outputs of different extended episodes. The results are shown in Table 6 . It can be observed that the performance of our IEPT is much more effective than that of simple integration, due to the extra use of L integ + L epis for FSL. 

A.6 DIFFERENT ALTERNATIVES OF SELF-SUPERVISED LOSSES

In Table 7 , we provide further ablation study regarding different alternatives of L epis and L inst . For the episode-level self-supervised loss L epis , we compare our implementation (using the KL loss between each distribution and the mean distribution) with that using a pairwise KL loss (i.e., the KL loss between each pair of distributions). For the instance-level self-supervised loss L inst , we compare our implementation (using the rotation prediction loss) with the recent self-supervised learning technique (Chen et al., 2020) . We observe that our implementation achieves slight performance improvements over those using the pairwise KL loss or the contrastive learning loss. 

A.7 APPLICATION OF IEPT TO OPTIMIZATION-BASED METHOD MAML

In Table 8 , we show the results obtained by applying our IEPT to the optimization-based model MAML (Finn et al., 2017) . We use Conv4-64 as the feature extractor. We can see that our IEPT brings 0.7%-2.1% improvements to MAML. This further shows the flexibility (as well as effectiveness) of our IEPT for FSL.Published as a conference paper at ICLR 2021 A.8 HYPER-PARAMETER SENSITIVITY TESTWe select the hyper-parameters w 1 , w 2 and w 3 from the candidate set {0.1, 0.5, 1.0, 5.0, 10.0} and show the hyper-parameter analysis results in Figure 6 . We find that the performance of our IEPT is relatively stable. Concretely, the performance of our IEPT is not sensitive to w 1 and w 2 with proper values, but too large w 1 (i.e. w 1 = 10.0) tends to cause obvious degradation, perhaps because the FSL task is biased by the rotation prediction loss.

