REPRESENTATIVE PROTOTYPE WITH CONSTRASTIVE LEARNING FOR SEMI-SUPENVISED FEW-SHOT CLAS-SIFICATION

Abstract

Few-shot learning aims to learn novel classes in the dataset with few samples per class, which is a very challenging task. To mitigate this issue, the prior work obtain representative prototypes with semantic embedding based on prototypical networks. While the above methods do not meet the requirement of fewshot learning, which requires abundant labeled samples. Therefore, We propose a new model framework to get representative prototypes with semi-supervised learning. Specifically, we introduces the dataset containing unlabeled samples to assist training the model. More importantly, to fully utilize these unlabeled samples, we adopt conditional variational autoencoder to construct more representative prototypes. Simultaneously, we develop novel contrastive loss to improve the model generalization ability. We evaluate our method on miniImageNet and tieredImageNet benchmarks for both 1-shot and 5-shot settings and achieve better performance over the state-of-the-art semi-supervised few-shot method.

1. INTRODUCTION

In real life, humans are able to quickly establish awareness of new concepts from just one or a few examples. However, conventional machine learning usually learn with abundant labeled samples to ensure its generalization ability. Actually, obtaining a plentiful of labeled samples is exceedingly hard on account of security and the high cost time and money. Motivated by this, many researchers turn to few-shot learning (FSL). In the field of image classification, FSL means getting better image classification accuracy in a small dataset. Generally, prior knowledge is obtained from the base classes and then applied to the novel classes, which contains a few labeled samples (Fei-Fei et al., 2006 ) (Wang et al., 2020) . Existing studies on FSL roughly fall into four types. (1) Metric-based method (Koch et al., 2015 ) (Vinyals et al., 2016 ) (Zhang et al., 2019b) . The type of methods is a space mapping method, which aims to learn a good feature space. In this space, all data is converted into feature vectors, and the feature vectors of similar samples are close, while the feature vectors of dissimilar samples are far, so as to distinguish samples, and the distance usually use Euclidean distance (Snell et al., 2017) or cosine distance Chen et al. (2019a). ( 2) Optimization-based method. In the meta-learning framework, the method first learns a group of good and potential parameters for the network model with a large number of similar tasks, and then uses this group of parameters as the initial value to train on specific tasks, so as to achieve the convergence effect as long as fine-tuning on the new tasks, such as: (Finn et al., 2017 ) (Lee et al., 2019 ) (Fallah et al., 2020) . (3) Data augmentationbased method (Alfassy et al., 2019 ) (Schwartz et al., 2018) . The fundamental problem of FSL is that samples is few, so it can be solved by increasing the diversity of samples. For example, (Zhang et al., 2019a) proposed to segment the image into foreground and background, and then combine the foreground and background of different pictures, so as to expand the dataset. (4) Semantics-based method (Chen et al., 2019b ) (Xing et al., 2019 ) (Li et al., 2020 ) (Zhang et al., 2021 ) (Xu & Le, 2022) . This method is a recent research hotspot, which is mainly inspired by zero-shot learning (ZSL). This series of methods use semantic information as auxiliary information to enhance classification performance. In some cases, visual information is richer, while in some cases, semantic information is richer (Xing et al., 2019) , which explains that fusing cross-modal information plays an important role in constructing representative class prototypes.

