REPRESENTATIVE PROTOTYPE WITH CONSTRASTIVE LEARNING FOR SEMI-SUPENVISED FEW-SHOT CLAS-SIFICATION

Abstract

Few-shot learning aims to learn novel classes in the dataset with few samples per class, which is a very challenging task. To mitigate this issue, the prior work obtain representative prototypes with semantic embedding based on prototypical networks. While the above methods do not meet the requirement of fewshot learning, which requires abundant labeled samples. Therefore, We propose a new model framework to get representative prototypes with semi-supervised learning. Specifically, we introduces the dataset containing unlabeled samples to assist training the model. More importantly, to fully utilize these unlabeled samples, we adopt conditional variational autoencoder to construct more representative prototypes. Simultaneously, we develop novel contrastive loss to improve the model generalization ability. We evaluate our method on miniImageNet and tieredImageNet benchmarks for both 1-shot and 5-shot settings and achieve better performance over the state-of-the-art semi-supervised few-shot method.

1. INTRODUCTION

In real life, humans are able to quickly establish awareness of new concepts from just one or a few examples. However, conventional machine learning usually learn with abundant labeled samples to ensure its generalization ability. Actually, obtaining a plentiful of labeled samples is exceedingly hard on account of security and the high cost time and money. Motivated by this, many researchers turn to few-shot learning (FSL). In the field of image classification, FSL means getting better image classification accuracy in a small dataset. Generally, prior knowledge is obtained from the base classes and then applied to the novel classes, which contains a few labeled samples (Fei-Fei et al., 2006 ) (Wang et al., 2020) . Existing studies on FSL roughly fall into four types. (1) Metric-based method (Koch et al., 2015 ) (Vinyals et al., 2016 ) (Zhang et al., 2019b) . The type of methods is a space mapping method, which aims to learn a good feature space. In this space, all data is converted into feature vectors, and the feature vectors of similar samples are close, while the feature vectors of dissimilar samples are far, so as to distinguish samples, and the distance usually use Euclidean distance (Snell et al., 2017) or cosine distance Chen et al. (2019a). (2) Optimization-based method. In the meta-learning framework, the method first learns a group of good and potential parameters for the network model with a large number of similar tasks, and then uses this group of parameters as the initial value to train on specific tasks, so as to achieve the convergence effect as long as fine-tuning on the new tasks, such as: (Finn et al., 2017 ) (Lee et al., 2019 ) (Fallah et al., 2020) . (3) Data augmentationbased method (Alfassy et al., 2019 ) (Schwartz et al., 2018) . The fundamental problem of FSL is that samples is few, so it can be solved by increasing the diversity of samples. For example, (Zhang et al., 2019a) proposed to segment the image into foreground and background, and then combine the foreground and background of different pictures, so as to expand the dataset. (4) Semantics-based method (Chen et al., 2019b ) (Xing et al., 2019 ) (Li et al., 2020 ) (Zhang et al., 2021 ) (Xu & Le, 2022) . This method is a recent research hotspot, which is mainly inspired by zero-shot learning (ZSL). This series of methods use semantic information as auxiliary information to enhance classification performance. In some cases, visual information is richer, while in some cases, semantic information is richer (Xing et al., 2019) , which explains that fusing cross-modal information plays an important role in constructing representative class prototypes. Generally, most of these methods are not used alone but integrated, and almost all are based on the meta-learning framework. However, class prototypes based on the meta-learning framework are not representative enough due to the number of samples in support set is few. Therefore, we propose a new model to construct representative class prototypes. For FSL, prototype-based is the typical method. Simply put, prototype-based is to construct a class prototype for each class using support set, and then keep test samples (from query set) close to the class prototype to which they belong and away from the other class prototypes. Prototypical networks (ProtoNet) firstly to tackle FSL (Snell et al., 2017) , the basic idea of which is that samples in each class will be mapped to a feature space through neural network, and calculating the mean features of all samples of each class in this space as class prototype. And then there are a lot of work around ProtoNet. The novel extensions of ProtoNet (Ren et al., 2018) exploits unlabeled samples when constructing prototypes, moreover, this paper makes the precise analysis of the distractor in unlabeled samples. Furthermore, a cosine similarity based prototypical network to select neighbor samples to augment support set (Liu et al., 2020) and training the regression model to restore the biased prototype with the Euclidean distance between the biased prototype and the real prototype (Xue & Wang, 2020), etc. The setting of our work is similar to (Ren et al., 2018) , the different is that we cluster unlabeled samples first, and then judge the labels of unlabeled samples according to cluster centers and class prototypes. Except that we adopt the generation model to make full use of unlabeled samples. We discover that the key of the prototype-based method is how to use a few samples to construct a representative class prototype, which is the challenging task on account of few samples or noise samples. The above approaches mentioned above rely solely on visual features for few-shot classification. Recently, inspired by ZSL, some work combined semantic embedding with prototype-based to enhance the performance. (Xing et al., 2019 ) utilize cross-modal information (visual features and semantic embedding) to generate visual and semantic prototypes and fuse the two prototypes according to different weights. (Zhang et al., 2021) takes attribute features as prior knowledge to complete the biased prototype. (Xu & Le, 2022) first selects the representative samples in the base classes via assuming that the features of each class follow the Gaussian multivariate distribution. Then, conditional variational autoencoder (CVAE) is used to generate representative features with these representative samples, and constructs representative class prototypes with the generate features and the support features in the novel classes. This paper opens up a ideas for constructing representative class prototypes, the one is data preprocessing, and the other is using generation models to augment data. Many previous methods require a large number of labeled samples at training stage, which is not consistent with the common scenario in our life, so semi-supervised learning can be used in our work. In order to make full use of unlabeled samples, these samples and semantic information are inputed CVAE to generate more features to construct representative class prototypes. Concurrently, the novel contrastive loss is introduced, which improves model generalization ability. Please note that this contrastive loss is calculated in feature space via feature extractor. Based on the above contents, we propose a novel model framework via meta-learning. Our main contributions of this paper can be summarized as follows: • We propose a new prototype recovery framework based on meta-learning, which can effectively use unlabeled samples to construct representative class prototypes. • We develop novel contrastive loss, using class prototypes as the anchor, which allows better inter-class discriminability to mitigate generalization problem. • We evaluate our approach on two benchmark datasets for few-shot learning, namely mini-ImageNet and tieredImageNet. The experimental results show that our method achieves higher performance, outperforming semi-supervised few-shot learning baselines. We summarize related works in Section 2. Section 3 provides a rundown of our approach. Section 4 reports the main results obtained with our method. In section 5, we analyzed our methods from different aspects.

