ON THE EFFECTIVENESS OF DEEP ENSEMBLES FOR SMALL DATA TASKS Anonymous authors Paper under double-blind review

Abstract

Deep neural networks represent the gold standard for image classification. However, they usually need large amounts of data to reach superior performance. In this work, we focus on image classification problems with a few labeled examples per class and improve sample efficiency in the low data regime by using an ensemble of relatively small deep networks. For the first time, our work broadly studies the existing concept of neural ensembling in small data domains, through an extensive validation using popular datasets and architectures. We show that deep ensembling is a simple yet effective technique that outperforms current state-of-the-art approaches for learning from small datasets. We compare different ensemble configurations to their deeper and wider competitors given a total fixed computational budget and provide empirical evidence of their advantage. Furthermore, we investigate the effectiveness of different losses and show that their choice should be made considering different factors.

1. INTRODUCTION

The computer vision field has been revolutionized by the advent of deep learning (DL) [Y. LeCun & Hinton, 2015] . The convolutional neural network (CNN) is the most popular DL model for visual learning tasks thanks to its ability to automatically learn general features via gradient-based optimization algorithms. However, the cost to reach high recognition performances involves the collection and labeling of large quantities of images. This requirement can not always be fulfilled since it may happen that collecting images is extremely expensive or not possible at all. For instance, in the medical field, high-quality annotations by radiology experts are often costly and not manageable at large scales [Litjens et al., 2017] . Different approaches have been proposed by the research community to mitigate the necessity of training data, tackling the problem from different perspectives. Transfer learning aims at learning representations from one domain and transfer the learned knowledge (e.g. pre-trained network) to a target domain [Bengio, 2012] , [Tan et al., 2018] . Similarly, few-shot learning uses a base set of labelled pairs to generalize from a small support set of target classes [Vanschoren, 2018] . Both approaches suffer from the need of collecting a pool of annotated images and the source and target domains must be somewhat related. Self-supervised learning is another approach that is trying to reduce the demand for annotations. Usually, a large set of images is used to teach how to solve a pretext task to a CNN [Jing & Tian, 2020] in sight of teaching a later downstream task. In this manner, costly human annotations are not needed but the challenge of collecting many images remains. The previously cited research directions, in a way or another, still rely on many samples/annotations. Our grand goal is to develop learning algorithms that are as sample-efficient as the human visual system. In other words, we aim to solve a classification problem with only a limited amount of labeled examples. Due to the great difficulty, this problem is still largely unsolved and hardly experimented. In this work, we propose the use of neural ensembles composed of smaller networks to tackle the problem of learning from a small sample and show the superiority of such methodology. Similarly to what has been done in recent works [Arora et al., 2020] , [Barz & Denzler, 2020] , we benchmark the approaches by varying the number of data points in the training sample while keeping it low with respect to the current standards of computer vision datasets. It has been shown that large CNNs can handle overfitting and generalize well even if they are severely over-parametrized [Kawaguchi et al., 2017] , [Neyshabur et al., 2018] . A recent study have also empirically shown that such behaviour might also be valid in the case of tiny datasets, making large nets a viabile choice even when the training sample is limited [Bornschein et al., 2020] . A well-known technique to reduce model variance is to average predictions from a set of weak learners (e.g. random forests [Breiman, 2001] ). An ensemble of low-bias decorrelated learners, combined with randomized inputs and prediction averaging, generally mitigates overfitting. Neural network ensembles have been widely used in the past for both regression and classification problems [Hansen & Salamon, 1990] , [Giacinto & Roli, 2001] , [Li et al., 2018] . However, state-of-the-art classification pipelines rarely make use of ensembling techniques due to the large dimensions of modern CNNs. Such an approach would become prohibitive from a computational point of view. On the other side, for domains in which training data is not abundant, ensembles of relatively small neural networks become highly attractive thanks to the aforementioned factors. Furthermore, ensemble approaches have also been shown to be good estimators of predictive uncertainty, an important tool for any predictive system [Lakshminarayanan et al., 2017] . Motivated by the nice variance-reduction properties of ensembles, we systematically study their effectiveness in small data tasks by a) fixing a computational budget and b) comparing them to corresponding deeper and wider single variants. Furthermore, we studied the performance of models varying depth, width, and ensembles dimension. According to our empirical study, ensembles are preferable over wider networks that are in turn better than deeper ones. The obtained results confirm our intuition and show that running a large ensemble is advantageous in terms of accuracy in domains with small data. Finally, we study the effectiveness of two losses: 1) the widely used cross-entropy and 2) the recently proposed cosine loss [Barz & Denzler, 2020] . Despite the latter loss has been specifically proposed to tackle small data problems, we have noticed some cases in which the cross-entropy still gives higher accuracy. The combined factors of model complexity and the amount of available data seem to influence the outcome. In summary, the contributions of our work are the following: i) we systematically study the use of neural ensembles in the small sample domain and show that they improve the state of the art; ii) we show that ensembles of smaller-scale networks outperform their computationally equivalent single competitors with increased depth and width; iii) we compare state-of-the-art losses showing that their performance depends on diverse factors and we provide a way of choosing the right configuration depending on the situation.

2. RELATED WORK

In this section, we first present a summary of the main techniques proposed in the literature to learn from a small sample. Secondly, we give an overview of neural ensembling techniques. Learning from a small sample is extremely challenging and, for this reason, largely unsolved. As previously said, few works have tried to tackle the problem of training DL architectures with a small number of samples due to its difficulty. We start by mentioning a series of works that focused on the classification of vector data and mainly used the UCI Machine Learning Repository as a benchmark. In [Fernández-Delgado et al., 2014] the authors have shown the superiority of random forests over a large set of classifiers including feed-forward networks. Later, [Olson et al., 2018] used a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. They showed that these sub-networks were relatively uncorrelated which lead to an internal regularization process similar to what happens in random forests, obtaining comparable results. More recently, [Arora et al., 2020] proposed the use of neural tangent kernel (NTK) architectures in low data tasks and obtained significant improvements over all previously mentioned classifiers. All previous works did not test CNNs since inputs were not images. In the computer vision domain, a straightforward approach to improve generalization is to implement techniques that try to synthesize new images through different transformations (e.g. data augmentation [Shorten & Khoshgoftaar, 2019] ). Some previous knowledge regarding the problem at hand might turn to be useful in some cases [Hu et al., 2017] . However, this makes data augmentation techniques not always generalizable to all possible image classification domains. It has also been proposed to train generative models (e.g. GANs) to increase the dataset size and consequently, performance [Liu et al., 2019] . Generating new images to improve performance is extremely attractive and effective. Yet, training a generative model

