MULTI-REPRESENTATION ENSEMBLE IN FEW-SHOT LEARNING

Abstract

Deep neural networks (DNNs) compute representations in a layer by layer fashion, producing a final representation at the top layer of the pipeline, and classification or regression is made using the final representation. A number of DNNs (e.g., ResNet, DenseNet) have shown that representations from the earlier layers can be beneficial. They improved performance by aggregating representations from different layers. In this work, we asked the question, besides forming an aggregation, whether these representations can be utilized directly with the classification layer(s) to obtain better performance. We started our quest to the answer by investigating the classifiers based on the representations from different layers and observed that these classifiers were diverse and many of their decisions were complementary to each other, hence having the potential to generate a better overall decision when combined. Following this observation, we propose an ensemble method that creates an ensemble of classifiers, each taking a representation from a different depth of a base DNN as the input. We tested this ensemble method in the setting of few-shot learning. Experiments were conducted on the mini-ImageNet and tiered-ImageNet datasets which are commonly used in the evaluation of fewshot learning methods. Our ensemble achieves the new state-of-the-art results for both datasets, comparing to previous regular and ensemble approaches.

1. INTRODUCTION

The depth of a deep neural network is a main factor that contributes to the high capacity of the network. In deep neural networks, information is often processed in a layer by layer fashion through many layers, before it is fed to the final classification (regression) layer(s). From a representation learning point of view, a representation is computed sequentially through the layers and a final representation is used to perform the targeted task. There have been deep neural networks that try to exploit the lower layers in the sequence to achieve better learning results. GoogLeNets (Szegedy et al., 2015) added auxiliary losses to the lower layers to facilitate training. Skip links (such as the ones used in ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) ) may be added to connect the lower layers to the higher ones in a deep architecture. Even though the main purposes of these approaches are to assist the training process or to help the gradient back-propagation, the success of these approaches suggests that the representations from the lower layers may be beneficial to many learning tasks. Therefore, it is worth to rethink the standard sequential structure where a final representation is used to make the prediction. In this work, we ask the question whether the representations from the lower layers can be used directly (instead of being auxiliary or being aggregated into a final representation) for decision making. If so, how can we take advantage of these lower-level representations and what are good practices in doing so? We first investigated the problem by conducting classifications using the representations from different layers. We took the convolutional layers of a trained network as an encoder. The representations (feature maps) from different layers of the encoder were tested for their classification performance. We observed that although overall, the feature maps from the higher layers led to better performance, there was a significant number of cases that correct predictions could be made with the lower feature maps but the higher-level feature maps failed to do so. This suggested that the lower-level representations have the potential to help the classification directly (detailed analysis in Section 3). Based on the inspiration from the prior models (i.e., GoogLeNet, ResNet and DenseNet) and our own observations, we propose an ensemble approach to directly take advantage of the lower-level representations. By integrating multiple models, an ensemble is likely to compensate the errors of a single classifier, and thus the overall performance of the ensemble would be better than that of a single classifier. This makes ensemble a suitable technique for our purpose. A variety of methods exist for ensemble construction. Some utilize sampling to obtain individual models from different subsets of the training data. Others construct models with different structures or initialization. However, these common ensemble methods cannot achieve our goal to exploit the lower-level representations. Instead, we propose a special type of ensembles, different from the existing ones. In particular, each classifier in our ensemble takes a feature map from a different depth of a CNN encoder as input and the whole ensemble utilizes the feature maps from multiple convolutional layers. We call this approach the multi-representation ensemble. Figure 1 illustrates our ensemble approach and compares it to the common ensemble method. We evaluate our ensemble method on the few-shot learning (FSL) problem (Snell et al., 2017) . FSL aims to learn a network capable of recognizing instances (query images) from novel classes with only few labeled examples (support images) available in each class. Given the demanding nature (learning from a few examples) of the problem, many FSL approaches first train an encoder following a regular training paradigm and then further-train the encoder and the classifier using a FSL paradigm. Because the encoder plays an important role in few-shot learning, it is a good learning task to apply and test our ensemble method which takes advantage of multiple representations from the encoder. Note that in recent years, many FSL works have employed extra data from the test (novel) classes for better performance. The extra data can be unlabeled and given at the test time (transductive learning) (Kye et al., 2020; Yang et al., 2020) or during the training phase (semi-supervised learning) (Rodríguez et al., 2020; Lichtenstein et al., 2020) . Our problem scope focuses on the traditional FSL setting, where only a few (one or five) support images per novel class are available at the test time. Experiments with our ensemble model were conducted on two FSL benchmark datasets and we obtained new state-of-the-art results for both. Besides evaluating our ensemble and comparing it to the existing methods for FSL tasks, we also conducted experiments that demonstrated that the utilization of multiple representations in the ensemble is crucial for the success of our method. Our main contributions are as follows: 1) We propose a novel ensemble method that creates a collection of models by employing multiple representations from different depth of a deep neural network. 2) We demonstrated the advantage of our ensemble model on the FSL problems and achieved new stateof-the-art results on two benchmark datasets. Our experiments also showed that multi-representation is necessary for the improved performance of the ensemble.

2. RELATED WORK

Ensemble methods. Ensemble methods are commonly used to improve prediction quality. Some example ensemble strategies include: (1) manipulate the data, such as data augmentation or dividing the original dataset into smaller subsets and then training a different model on each subset. (2) apply different models or learning algorithms. For example, train a neural network with varied hyperparameter values such as different learning rates or different structures. (3) hybridize multiple ensemble strategies, e.g., random forest. Ensembles have also been applied to FSL problems. Liu et al. (2019b) proposed to learn an ensemble of temporal base-learners, which are generated along



Figure 1: Multi-representation classifier ensemble.

