ON THE EFFECTIVENESS OF DEEP ENSEMBLES FOR SMALL DATA TASKS Anonymous authors Paper under double-blind review

Abstract

Deep neural networks represent the gold standard for image classification. However, they usually need large amounts of data to reach superior performance. In this work, we focus on image classification problems with a few labeled examples per class and improve sample efficiency in the low data regime by using an ensemble of relatively small deep networks. For the first time, our work broadly studies the existing concept of neural ensembling in small data domains, through an extensive validation using popular datasets and architectures. We show that deep ensembling is a simple yet effective technique that outperforms current state-of-the-art approaches for learning from small datasets. We compare different ensemble configurations to their deeper and wider competitors given a total fixed computational budget and provide empirical evidence of their advantage. Furthermore, we investigate the effectiveness of different losses and show that their choice should be made considering different factors.

1. INTRODUCTION

The computer vision field has been revolutionized by the advent of deep learning (DL) [Y. LeCun & Hinton, 2015] . The convolutional neural network (CNN) is the most popular DL model for visual learning tasks thanks to its ability to automatically learn general features via gradient-based optimization algorithms. However, the cost to reach high recognition performances involves the collection and labeling of large quantities of images. This requirement can not always be fulfilled since it may happen that collecting images is extremely expensive or not possible at all. For instance, in the medical field, high-quality annotations by radiology experts are often costly and not manageable at large scales [Litjens et al., 2017] . Different approaches have been proposed by the research community to mitigate the necessity of training data, tackling the problem from different perspectives. Transfer learning aims at learning representations from one domain and transfer the learned knowledge (e.g. pre-trained network) to a target domain [Bengio, 2012] , [Tan et al., 2018] . Similarly, few-shot learning uses a base set of labelled pairs to generalize from a small support set of target classes [Vanschoren, 2018] . Both approaches suffer from the need of collecting a pool of annotated images and the source and target domains must be somewhat related. Self-supervised learning is another approach that is trying to reduce the demand for annotations. Usually, a large set of images is used to teach how to solve a pretext task to a CNN [Jing & Tian, 2020] in sight of teaching a later downstream task. In this manner, costly human annotations are not needed but the challenge of collecting many images remains. The previously cited research directions, in a way or another, still rely on many samples/annotations. Our grand goal is to develop learning algorithms that are as sample-efficient as the human visual system. In other words, we aim to solve a classification problem with only a limited amount of labeled examples. Due to the great difficulty, this problem is still largely unsolved and hardly experimented. In this work, we propose the use of neural ensembles composed of smaller networks to tackle the problem of learning from a small sample and show the superiority of such methodology. Similarly to what has been done in recent works [Arora et al., 2020] , [Barz & Denzler, 2020] , we benchmark the approaches by varying the number of data points in the training sample while keeping it low with respect to the current standards of computer vision datasets. It has been shown that large CNNs can handle overfitting and generalize well even if they are severely over-parametrized [Kawaguchi et al., 1 

