OPTIMIAL HYPERPARAMETER OPTI-MIZATION MODELS TO GENERATE EFFICIENT ENSEM-BLE DEEP LEARNING

Abstract

Ensemble Deep Learning improves accuracy over a single model by combining predictions from multiple models. It has established itself to be the core strategy for tackling the most difficult problems, like winning Kaggle challenges. Due to the lack of consensus to design a successful deep learning ensemble, we introduce Hyperband-Dijkstra, a new workflow that automatically explores neural network designs with Hyperband and efficiently combines them with Dijkstra's algorithm. This workflow has the same training cost than standard Hyperband running except sub-optimal solutions are stored and are candidates to be selected in the ensemble selection step (recycling). Next, to predict on new data, the user gives to Dijkstra the maximum number of models wanted in the ensemble to control the tradeoff between accuracy and inference time. Hyperband is a very efficient algorithm allocating exponentially more resources to the most promising configurations. It is also capable to propose diverse models due to its pure-exploration nature, which allows Dijkstra algorithm with a smart combination of diverse models to achieve a strong variance and bias reduction. The exploding number of possible combinations generated by Hyperband increases the probability that Dijkstra finds an accurate combination which fits the dataset and generalizes on new data. The two experimentation on CIFAR100 and on our unbalanced microfossils dataset show that our new workflow generates an ensemble far more accurate than any other ensemble of any ResNet models from ResNet18 to ResNet152.

1. INTRODUCTION

Ensemble machine learning is a popular method to use predictions and combine them for a successful and optimal classification. In the light of its success in Kaggle competition, all top-5 solutions published in the last seven image recognition challenges use at least one ensemble method. The average and median number of individual models used by ensemble is between 7 and 8. Appendix A summarized these 17 solutions. Despite its recent popularity among practitioners, there is no consensus on how to apply ensemble in the context of deep neural network. The overall work on ensemble Machine Learning (non-deep) was carried out in the 1990s and 2000s. The implementation of Deep Learning on GPU appeared less than 10 years ago. The outbreak of multi-GPU servers allows to effectively train and evaluate many neural networks simultaneously but also deploy ensemble deep architectures. Another recent trend to improve accuracy is the transfer learning or use external similar data source Kolesnikov et al. (2019) . Instead we search a new model-oriented method which can be applied on new kind of problems where no similar dataset exists. Hyperband-Dijkstra is an innovative way to benefit from this increasing computing power. It consists in unifying the two already proven efficient but contradictory approaches: hyperparameter optimization (HPO) and ensemble. First, one explores and trains models until finding the optimal solution and wasting sub-optimal ones while the other one uses a population of trained models to predict more accurately. Hyperband-Dijkstra creates an ensemble based on hyperband which is able to generate a huge number of trained deep models. Then, Dijkstra yields efficient combinations between them. As far as we know, it was never proposed to use Dijkstra's algorithm to find a subset of k previously trained models in a greater population. After that, we describe and discuss interesting properties and experimental results on two datasets: • Hyperband-Dijkstra is able to generate better ensemble than any ensemble of ResNet models. • We show that Dijkstra algorithm is better to aggregate k trained models than a naive strategy consisting in taking the top k models based on their validation accuracy. • We show that our workflow (with ensemble of size ≥ 2) keeps benefiting of hyperband running after many days while a standard use of hyperband (consisting in taking only the best model) stops improving much earlier.

2. RELATED WORKS

In this section we briefly review the main ideas from prior work that are relevant to our method. Ensemble. Authors Sollich & Krogh (1995) laid the foundation stone about the idea that over-fitted machine learning algorithms can be averaged out to get more accurate results. This phenomenon is explained by the Law of Large Numbers which claims that the average of the results obtained from a large number of trials should be close to the expected value. These results are especially interesting for deep learning models because they are machine learning models which are the most affected to random effects (over-fitting) due to their huge amount of parameters. Many ensemble algorithms have been invented such as Wolpert (1992), Breiman (1996) (2017) . There is today no consensus on the way to do ensembles like shown in the appendix A. In case the architecture of models in the ensemble is biased -for example all models contained are not deep enough or not wide enough to capture relevant features in the data -exploiting parametric diversity will not efficiently improve the results. 



That is why authorsLiao & Moody (1999)  Gashler  et al. (2008)  promote more and more diversity, not only based on the random weights initialisation but based on different machine learning algorithms such as neural network and decision tree in the same ensemble to maximize diversity and therefore the accuracy. Knapsack problem. A Combinatorial Optimization problem consists in searching for a solution in a discrete set so that a function is optimized. In many such problems, exhaustive search is not tractable, that is why approximate methods are used. Dijkstra's algorithm Dijkstra (1959) is a path finding algorithm which locally selects the next best node until it reaches the final node. A*Hart et al.  (1972)  is an informed algorithm which first expands the most promising node to converge faster than Dijkstra. This knowledge is used only if an appropriate heuristic function is available. Otherwise, in absence of this knowledge, Dijkstra and A* are equivalent. More recently, SP-MCTS Schadd et al. (2008) is a probabilistic approach which runs many tree explorations based on the Upper Confident bound applied to Tree (UCT)Kocsis & Szepesvári (2006)  formula to guide exploration/exploitation to catch a maximum of information on one node before selecting it.Hyperparameter Optimization. The empirical nature of research in Deep Learning leads us to try many models, optimization settings and pre-processing settings to find the best suited one for data. No Free Lunch theoremWolpert & Macready (1997)  proves that no hyperparameter optimization can show a superior performance in all cases. Nevertheless, methods have been developed and have shown a stable performance on supervised deep learning dataset.Discrete-space search enables to search the best model description to a given neural network. Under this umbrella, we can find : the number of units per layer, regularization parameters, batch size, type of initialization, optimizer strategy, learning rate. Plenty of approaches exist with a different theoretical background, a pure-exploration approach Bergstra & Bengio (2012), Li et al. (2017), smart computing resources allocation strategies Li et al. (2017) Falkner et al. (2018), a priori based Hoffman et al. (2011), a posteriori based Bergstra et al. (2011) or genetic inspired Jaderberg et al.

