MELR: META-LEARNING VIA MODELING EPISODE-LEVEL RELATIONSHIPS FOR FEW-SHOT LEARNING

Abstract

Most recent few-shot learning (FSL) approaches are based on episodic training whereby each episode samples few training instances (shots) per class to imitate the test condition. However, this strict adhering to test condition has a negative side effect, that is, the trained model is susceptible to the poor sampling of few shots. In this work, for the first time, this problem is addressed by exploiting interepisode relationships. Specifically, a novel meta-learning via modeling episodelevel relationships (MELR) framework is proposed. By sampling two episodes containing the same set of classes for meta-training, MELR is designed to ensure that the meta-learned model is robust against the presence of poorly-sampled shots in the meta-test stage. This is achieved through two key components: (1) a Cross-Episode Attention Module (CEAM) to improve the ability of alleviating the effects of poorly-sampled shots, and (2) a Cross-Episode Consistency Regularization (CECR) to enforce that the two classifiers learned from the two episodes are consistent even when there are unrepresentative instances. Extensive experiments for non-transductive standard FSL on two benchmarks show that our MELR achieves 1.0%-5.0% improvements over the baseline (i.e., ProtoNet) used for FSL in our model and outperforms the latest competitors under the same settings.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) have achieved tremendous successes in a wide range of computer vision tasks including object recognition (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Russakovsky et al., 2015; He et al., 2016a) , semantic segmentation (Long et al., 2015; Chen et al., 2018) , and object detection (Ren et al., 2015; Redmon et al., 2016) . For most visual recognition tasks, at least hundreds of labeled training images are required from each class for training a CNN model. However, collecting a large number of labeled training samples is costly and may even be impossible in real-life application scenarios (Antonie et al., 2001; Yang et al., 2012) . To reduce the reliance of deep neural networks on large amount of annotated training data, few-shot learning (FSL) has been studied (Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017; Sung et al., 2018) , which aims to recognize a set of novel classes with only a few labeled samples by knowledge transfer from a set of base classes with abundant samples. Recently, FSL has been dominated by meta-learning based approaches (Finn et al., 2017; Snell et al., 2017; Sung et al., 2018; Lee et al., 2019; Ye et al., 2020) , which exploit the ample samples from base classes via episodic training. During meta-training, to imitate an N -way K-shot novel class recognition task, an N -way K-shot episode/meta-task is sampled in each iteration from the base classes, consisting of a support set and a query set. By setting up the meta-training episodes exactly the same way as the meta-test ones (i.e., N -way K-shot in the support set), the objective is to ensure that the meta-learned model can generalize to novel tasks. However, this also leads to an unwanted side-effect, that is, the model will be susceptible to the poor sampling of the few shots. Outlying training instances are prevalence in vision benchmarks which can be caused by various factors such as occlusions or unusual pose/lighting conditions. When trained with ample samples, modern CNN-based recognition models are typically robust against abnormal instances as long as they are not dominant. However, when as few as one shot per class is used to build a classifier for FSL, the poorly-sampled few shots could be catastrophic, e.g., when the cat class is represented in the support set by a single image of a half-occluded cat viewed from behind, it would be extremely hard to build a classifier to recognize cats in the query set that are mostly full-body visible and frontal. Existing episodic-training based FSL models do not offer any solution to this problem. The main reason is that different episodes are sampled randomly and independently. When the cat class is sampled in two episodes, these models are not aware that they are the same class, and thus cannot enforce the classifiers independently learned to be consistent to each other, regardless whether there exist poorly-sampled shots in one of the two episodes. In this paper, a novel meta-learning via modeling episode-level relationships (MELR) framework is proposed to address the poor sampling problem of the support set instances in FSL. In contrast to the existing episodic training strategy, MELR conducts meta learning over two episodes deliberately sampled to contain the same set of base classes but different instances. In this way, cross-episode model consistency can be enforced so that the meta-learned model is robust against poorly-sampled shots in the meta-test stage. Concretely, MELR consists of two key components: Cross-Episode Attention Module (CEAM) and Cross-Episode Consistency Regularization (CECR). CEAM is composed of a cross-episode transformer which allows the support set instances to be examined through attention so that unrepresentative support samples can be identified and their negative effects alleviated (especially for computing class prototypes/centers). CECR, on the other hand, exploits the fact that since the two episodes contain the same set of classes, the obtained classifiers (class prototypes) should produce consistent predictions regardless whether there are any poorly-sampled instances in the support set and/or which episode a query instance comes from. This consistency is enforced via cross-episode knowledge distillation. Our main contributions are three-fold: (1) For the first time, the poor sampling problem of the few shots is formally tackled by modeling the episode-level relationships in meta-learning based FSL. (2) We propose a novel MELR model with two cross-episode components (i.e., CEAM and CECR) to explicitly enforce that the classifiers of the same classes learned from different episodes need to be consistent regardless whether there exist poorly-sampled shots. (3) Extensive experiments for non-transductive standard FSL on two benchmarks show that our MELR achieves significant improvements over the baseline ProtoNet (Snell et al., 2017) and even outperforms the latest competitors under the same settings. We will release the code and models soon.

2. RELATED WORK

Few-Shot Learning. Few-shot learning (FSL) has become topical recently. Existing methods can be generally divided into four groups: (1) Metric-based methods either learn a suitable embedding space for their chosen/proposed distance metrics (e.g., cosine similarity (Vinyals et al., 2016) , Euclidean distance (Snell et al., 2017) , and a novel measure SEN (Nguyen et al., 2020) ) or directly learn a suitable distance metric (e.g., CNN-based relation module (Sung et al., 2018; Wu et al., 2019) , ridge regression (Bertinetto et al., 2019) , and graph neural networks (Satorras & Estrach, 2018; Kim et al., 2019; Yang et al., 2020) ). Moreover, several approaches (Yoon et al., 2019; Li et al., 2019a; Qiao et al., 2019; Ye et al., 2020; Simon et al., 2020 ) learn task-specific metrics which are adaptive to each episode instead of learning a shared task-agnostic metric space. (2) Model-based methods (Finn et al., 2017; Nichol et al., 2018; Rusu et al., 2019) learn good model initializations on base classes and then quickly adapt (i.e., finetune) them on novel classes with few shots and a

