IEPT: INSTANCE-LEVEL AND EPISODE-LEVEL PRE-TEXT TASKS FOR FEW-SHOT LEARNING

Abstract

The need of collecting large quantities of labeled training data for each new task has limited the usefulness of deep neural networks. Given data from a set of source tasks, this limitation can be overcome using two transfer learning approaches: few-shot learning (FSL) and self-supervised learning (SSL). The former aims to learn 'how to learn' by designing learning episodes using source tasks to simulate the challenge of solving the target new task with few labeled samples. In contrast, the latter exploits an annotation-free pretext task across all source tasks in order to learn generalizable feature representations. In this work, we propose a novel Instance-level and Episode-level Pretext Task (IEPT) framework that seamlessly integrates SSL into FSL. Specifically, given an FSL episode, we first apply geometric transformations to each instance to generate extended episodes. At the instancelevel, transformation recognition is performed as per standard SSL. Importantly, at the episode-level, two SSL-FSL hybrid learning objectives are devised: (1) The consistency across the predictions of an FSL classifier from different extended episodes is maximized as an episode-level pretext task. (2) The features extracted from each instance across different episodes are integrated to construct a single FSL classifier for meta-learning. Extensive experiments show that our proposed model (i.e., FSL with IEPT) achieves the new state-of-the-art.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016b; Huang et al., 2017) have seen tremendous successes in a wide range of application fields, especially in visual recognition. However, the powerful learning ability of CNNs depends on a large amount of manually labeled training data. In practice, for many visual recognition tasks, sufficient manual annotation is either too costly to collect or not feasible (e.g., for rare object classes). This has severely limited the usefulness of CNNs for real-world application scenarios. Attempts have been made recently to mitigate such a limitation from two distinct perspectives, resulting in two popular research lines, both of which aim to transfer knowledge learned from the data of a set of source tasks to a new target one: few-shot learning (FSL) and self-supervised learning (SSL). FSL (Fei-Fei et al., 2006; Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017; Sung et al., 2018) typically takes a 'learning to learn' or meta-learning paradigm. That is, it aims to learn an algorithm for learning from few labeled samples, which generalizes well across any tasks. To that end, it adopts In contrast, SSL (Doersch et al., 2015; Noroozi & Favaro, 2016; Iizuka et al., 2016; Doersch & Zisserman, 2017; Noroozi et al., 2018) does not require the source data to be annotated. Instead, it exploits an annotation-free pretext task on the source task data in the hope that a task-generalizable feature representation can be learned from the source tasks for easy adoption or adaptation in a target task. Such a pretext task gets its self-supervised signal at the per-instance level. Examples include rotation and context prediction (Gidaris et al., 2018; Doersch et al., 2015) , jigsaw solving (Noroozi & Favaro, 2016), and colorization (Iizuka et al., 2016; Larsson et al., 2016) . Since these pretext tasks are class-agnostic, solving them leads to the learning of transferable knowledge. Since both FSL and SSL aim to reduce the need of collecting a large amount of labeled training data for a target task by transferring knowledge from a set of source tasks, it is natural to consider combining them in a single framework. Indeed, two recent works (Gidaris et al., 2019; Su et al., 2020) proposed to integrate SSL into FSL by adding an auxiliary SSL pretext task in an FSL model. It showed that the SSL learning objective is complementary to that of FSL and combining them leads to improved FSL performance. However, in (Gidaris et al., 2019; Su et al., 2020) , SSL is combined with FSL in a superficial way: it is only taken as a separate auxiliary task for each single training instance and has no effect on the episodic training pipeline of the FSL model. Importantly, by ignoring the class labels of samples, the instance-level SSL learning objective is weak on its own. Since meta-learning across episodes is the essence of most contemporary FSL models, we argue that adding instance-level SSL pretext tasks alone fails to exploit fully the complementarity of the aforementioned FSL and SSL, for which a closer and deeper integration is needed. To that end, in this paper we propose a novel Instance-level and Episode-level Pretext Task (IEPT) framework for few-shot recognition. Apart from adding an instance-level pretext SSL task as in (Gidaris et al., 2019; Su et al., 2020) , we introduce two episode-level SSL-FSL hybrid learning objectives for seamless SSL-FSL integration. Concretely, as illustrated in Figure 1 , our full model has three additional learning objectives (besides the standard FSL one): (1) Different rotation transformations are applied to each original few-shot episode to generate a set of extended episodes, where each image has a rotation label for the instance-level pretext task (i.e., to predict the rotation label). (2) The consistency across the predictions of an FSL classifier from different extended episodes is maximized as an episode-level pretext task. For each training image, the rotation transformation does not change its semantic content and hence its class label; the FSL classifier predictions across different extended episodes thus should be consistent, hence the consistency regularization objective. (3) The correlation of features across instances from these extended episodes is modeled by a



Figure 1: Schematic of our approach to FSL. Given a training episode, we apply 2D rotations by 0, 90, 180, and 270 degrees to each instance to generate four extended episodes. After going through a feature extraction CNN, four losses over three branches are designed: (1) In the top branch, we employ a self-supervised rotation classifier with the instance-level SSL loss L inst .(2) In the middle branch, an FSL classifier is exploited to predict the FSL classification probabilities for each episode. We maximize the classification consistency among the extended episodes by forcing the four probability distributions to be consistent using L epis . The average supervised FSL loss L aux is also computed. (3) In the bottom branch, we utilize an integration transformer module to fuse the features extracted from each instance with different rotation transformations; they are then used to compute an integrated FSL loss L integ . Among the four losses, L inst and L epis are the self-supervised losses, and L aux and L integ are the supervised losses.an episodic training strategy -the source tasks are arranged into learning episodes, each of which contains n classes and k labeled samples per class to simulate the setting for the target task. Part of the CNN model (e.g., feature extraction subnet, classification layers, or parameter initialization) is then meta-learned for rapid adaptation to new tasks.

