OPTIMIAL HYPERPARAMETER OPTI-MIZATION MODELS TO GENERATE EFFICIENT ENSEM-BLE DEEP LEARNING

Abstract

Ensemble Deep Learning improves accuracy over a single model by combining predictions from multiple models. It has established itself to be the core strategy for tackling the most difficult problems, like winning Kaggle challenges. Due to the lack of consensus to design a successful deep learning ensemble, we introduce Hyperband-Dijkstra, a new workflow that automatically explores neural network designs with Hyperband and efficiently combines them with Dijkstra's algorithm. This workflow has the same training cost than standard Hyperband running except sub-optimal solutions are stored and are candidates to be selected in the ensemble selection step (recycling). Next, to predict on new data, the user gives to Dijkstra the maximum number of models wanted in the ensemble to control the tradeoff between accuracy and inference time. Hyperband is a very efficient algorithm allocating exponentially more resources to the most promising configurations. It is also capable to propose diverse models due to its pure-exploration nature, which allows Dijkstra algorithm with a smart combination of diverse models to achieve a strong variance and bias reduction. The exploding number of possible combinations generated by Hyperband increases the probability that Dijkstra finds an accurate combination which fits the dataset and generalizes on new data. The two experimentation on CIFAR100 and on our unbalanced microfossils dataset show that our new workflow generates an ensemble far more accurate than any other ensemble of any ResNet models from ResNet18 to ResNet152.

1. INTRODUCTION

Ensemble machine learning is a popular method to use predictions and combine them for a successful and optimal classification. In the light of its success in Kaggle competition, all top-5 solutions published in the last seven image recognition challenges use at least one ensemble method. The average and median number of individual models used by ensemble is between 7 and 8. Appendix A summarized these 17 solutions. Despite its recent popularity among practitioners, there is no consensus on how to apply ensemble in the context of deep neural network. The overall work on ensemble Machine Learning (non-deep) was carried out in the 1990s and 2000s. The implementation of Deep Learning on GPU appeared less than 10 years ago. The outbreak of multi-GPU servers allows to effectively train and evaluate many neural networks simultaneously but also deploy ensemble deep architectures. Another recent trend to improve accuracy is the transfer learning or use external similar data source Kolesnikov et al. (2019) . Instead we search a new model-oriented method which can be applied on new kind of problems where no similar dataset exists. Hyperband-Dijkstra is an innovative way to benefit from this increasing computing power. It consists in unifying the two already proven efficient but contradictory approaches: hyperparameter optimization (HPO) and ensemble. First, one explores and trains models until finding the optimal solution and wasting sub-optimal ones while the other one uses a population of trained models to predict more accurately. Hyperband-Dijkstra creates an ensemble based on hyperband which is able to generate a huge number of trained deep models. Then, Dijkstra yields efficient combinations between them. As far as we know, it was never proposed to use Dijkstra's algorithm to find a subset of k previously trained models in a greater population. After that, we describe and discuss interesting properties and experimental results on two datasets: • Hyperband-Dijkstra is able to generate better ensemble than any ensemble of ResNet models. • We show that Dijkstra algorithm is better to aggregate k trained models than a naive strategy consisting in taking the top k models based on their validation accuracy. • We show that our workflow (with ensemble of size ≥ 2) keeps benefiting of hyperband running after many days while a standard use of hyperband (consisting in taking only the best model) stops improving much earlier.

2. RELATED WORKS

In this section we briefly review the main ideas from prior work that are relevant to our method. Ensemble. Authors Sollich & Krogh (1995) laid the foundation stone about the idea that over-fitted machine learning algorithms can be averaged out to get more accurate results. This phenomenon is explained by the Law of Large Numbers which claims that the average of the results obtained from a large number of trials should be close to the expected value. These results are especially interesting for deep learning models because they are machine learning models which are the most affected to random effects (over-fitting) due to their huge amount of parameters. Many ensemble algorithms have been invented such as Wolpert (1992) , Breiman (1996) or boosting Schwenk & Bengio (2000) . Some other methods are neural networks specific like negative correlation learning Liu & Yao (1999) , dropout Srivastava et al. (2014) or snapshot learning Huang et al. (2017) . There is today no consensus on the way to do ensembles like shown in the appendix A. In case the architecture of models in the ensemble is biased -for example all models contained are not deep enough or not wide enough to capture relevant features in the data -exploiting parametric diversity will not efficiently improve the results. That is why authors Liao & Moody (1999) (2017) . Those methods are not exclusive, for example BOHB Falkner et al. (2018) mixes Bayesian Optimization strategy to Hyperband. Another automatic approach exists like graph-space search Pham et al. (2018) . It consists in finding the best architecture (graph) of neural networks. It provides a maximum degree of freedom in the construction of the neural network architecture. Due to the infinity of combinations, scientists implement several constraints to limit the possibilities of graph generation, save computation cost and preserve the correctness of generated graphs. All hyper-parameters, like optimization settings and data pre-preprocessing are given by user to drive into this graph-space. Due to this complexity and because only models architectures are explored, we decide to not follow this path. Parallel hyperparameter optimization. All HPO strategies presented in this paper are asynchronous so their deployment is ideal on multi-GPU or multi-node GPU HPC. Distributed clientserver softwares Matthew Rocklin (2015) , Moritz et al. (2018) allow to simultaneously spread those training candidate models and evaluate them. Those frameworks allow also serve them in parallel. Multi-objective goal. Authors Johnston et al. (2017) discovered that many neural networks have a comparable accuracy. Literature lets us imagine that the hyper-parameter function topology has two plateaus : where the optimizer algorithm converges and where it does not. This flatness can be used to optimize a secondary goal such as model size, time-to-prediction, power consumption and so on. Authors Patton et al. (2019) propose a multi-objective optimization to not only search an accurate model but also faster ones. Early Stopping. A common practice exists to speed up HPO running like Early Stopping. They consists in resources reallocation strategies by considering learning dynamic of DNN. Prechelt (1998) Li et al. (2017) . Early stopping is also known to be a regularization method that stops the training when the validation accuracy plateaus is symptomatic and that it will not generalize well (overfitting).

3. PROPOSED WORKFLOW

In this section we will first see the workflow proposed before going into a more detailed explanation step by step.

3.1. DETAIL OF THE WORKFLOW

As shown in figure 1 , the proposed workflow consists in using hyperband and not only saving the best one on the disk but the sub-optimal one too. Second, a combinatorial optimization algorithm (Dijkstra's) finds the best one regarding the maximum number of models desired by the user (noted K). Dijkstra's algorithm computes the validation loss of candidates ensemble to evaluate how well a solution will generalize on the test database. The workflow we introduce is simple. We use Hyperband algorithm and the distributed framework Ray Moritz et al. (2018) and then our combinatorial optimization Disjkstra's algorithm is a natural choice to ensemble models. The simplicity of the chosen algorithm and re-using existing frameworks reinforce our claims that this work is easy to test on a new dataset.

3.2. STEP 1 -HYPERBAND TO GENERATE MANY MODELS

Hyperband relies on an iterative selection of the most promising models to allocate resources, allowing it to exponentially evaluate more configurations than strategies which do not use progressive results during training. Hyperband is a technique that makes minimal assumptions unlike prior configuration evaluation approaches. Its pure-exploration nature combined with conservative resource allocation strategies can sweep better the hyperparameter space than other strategies like blackbox bayesian optimization. This diversity of models sampled are ideal to combine them for Dijkstra's algorithm and make better ensemble Liao & Moody (1999) Gashler et al. (2008) . We only store models trained at least half maximum epochs. This allows to reduce the number of models saved and thus the number of possible combinations by focusing on the most promising models explored by Hyperband.

3.3. STEP 2 -DIJKSTRA'S ALGORITHM TO COMBINE MODELS

We discuss that finding the best combination of K among a larger population is first modeled as a graph. We then prove that no exact solution can be applied because of the computing complexity of the problem. That is why we propose Dijsktra's algorithm, a simple and popular approach.

3.3.1. PROBLEM MODELING AND INTUITION

The solution space can be modeled as a tree with the empty ensembles as the root. Every node represents a deep learning model added and any path an ensemble. All nodes can be a terminal and not only leaves. To evaluate and compare their score, we use the formula 1. It calculates the cross entropy between validation labels, averaging predictions and the current ensemble I of size k. score I = CE(y, 1 k i∈I ỹi ) Figure 2 is a simplified example of three models and their combinations on error distribution. The modeled tree associated is shown in figure 3 . The problem is to find the best combination to be as close as possible to the center (0; 0). We observe that the best individual model c (d {c} = 0.22) is not always the best one to combine with the others (d {a,b} = 0.07). That is why smart algorithms are needed. We also eliminate the simple idea that combining all models systematically leads to the best solution (d {a,b,c} = 0.12).

3.3.2. PROBLEM COMPLEXITY

Finding the best ensemble of maximum size K among n models with 1 ≤ K ≤ n is a case of the 'knapsack problem' belonging to the NP (non-deterministic polynomial) problem's family. There is no known exact method except brut-forcing all possibilities. The number of possible subsets of size k among n items is computed with the binomial coefficient. This binomial formula is known Das (2020) to be asymptotically polynomial when n is increased and k is fixed. When the user puts the maximum size of ensembles K to explore, all combinations k such as 1 ≤ k ≤ K are also candidates. Therefore the quantity of candidates ensembles is given by K k=1 n k . This formula also has a polynomial behavior when k is fixed and n increases. For example, for K = 4 and a population of n = 100, adding only one new model in the population increases the number of combinations from 4.09 million to 4.25 million. This polynomial behavior has two exceptions : when K = 1 (linear) and when K = n (exponential). K = n allows the research of big ensembles for a maximum precision despite inference time. The number of ways to construct a non-empty model from a catalogue of N models is formally described by the equation N k=1 N k = 2 N -1. We have two options for each model : using it or not (2 N possibilities). We exclude the empty ensemble (-1). It means that for each new model found by hyperband, the quantity of combinations is multiplied by 2. The combination of ensembles with a catalogue of 100 models is ≈ 1.27e 30 . Due to this combinatorial explosion, we understand the need of an approximation search algorithm.

3.3.3. APPROXIMATE SOLUTION WITH DIJKSTRA'S ALGORITHM

This huge number of ensembles combined to the fact that relationships between model predictions and labels are complex (figure 2 and formula 1) and that no heuristic is known makes Dijkstra's algorithm a natural choice for this class of problems Voloch (2017). Dijkstra's algorithm is a Dynamic Programming procedure, meaning it makes and memorizes successive approximation choices. While the number of possibilities requires approximate solutions, this huge number of candidate ensembles has the advantage of ensuring that better combinations should be found compared to a naive aggregation approach. This is confirmed in the results in section 4. Once a model is found based on the running by Disjkstra's algorithm, we can combine predictions of models on new data or evaluate it on the test dataset. As training and evaluating, models predicting can be distributed on different GPUs but the averaging require all models finished their prediction.

4. EXPERIMENTS AND RESULTS

We experiment our workflow on CIFAR100 Krizhevsky (2009) and microfossils datasets both presented in appendix B.3. On these two datasets there are 16 hyper-parameters to explore. Experimental settings are explained in appendices B for reproducibility purpose but this level of detail is not required to understand our works or results. It is possible that larger hyperparameters value ranges may positively influence again the results obtained. In this section, we evaluate different workflows by evaluating various HPO strategies, different combinatorial optimizations. We also different settings like the number of models in produced ensemble and effect of HPO running time on results.

4.1. VISUALIZATION OF ARCHITECTURE SAMPLING RESULTS

As others have highlighted, no correlation exists between the accuracy and computing cost of a model on image recognition. We display it in the figure 4 results of random sampling of hyperparameter space on the CIFAR100 dataset. This is the reason why we propose a target function which measure efficiency of one model based on its accuracy and its inference time. The implemented formula is: CE(y, ỹ) + W I with I the inference time on 2,000 images expressed in seconds and W a scaling factor arbitrary choosen such as W = 0.001. Hyperband minimizing this target function increases the concentration of computing resources on more efficient models. The natural efficiency of Hyperband combined to the fact that we use this multi-criteria target function allows to increase the number of models explored in 6 days by factor 3.2 compared to Random Sampling + Early Stopping (plateau detection). Another Early Stopping method used consist in detecting after one epoch if the models perform better than random predictions on the validation dataset. It shows very effective to detect early which models are unable to learn and free GPUs for other models. Experiments in figure 5 show that about 14% of models diverge so we can save quasi-entirely their running. We evaluate the accuracy of our workflow on CIFAR100 by replacing Hyperband with various HPO strategies in table 1. Retraining the same deep learning architecture from scratch can yield significant distance in different run time, that is why we compare different HPO strategies and different popular ResNet architectures as well in table 2. We observe that Hyperband generally performs well to take the best one and also to aggregate ensembles compared to all other methods. It confirms our claim in the previous section on Hyperband computing efficiency and the ability to generate good ensembles. We also observe that most of HPO strategies discovered better models than the best ResNet models found. For both benchmarks, we observe that ResNet18 compared to other ResNet architectures, lead to better models but when we combine them, ResNet34 is a better choice. We conjecture that ResNet18 leads to a lower parametric diversity compared to ResNet34 models because of the lower number of layers. Another remark is that a proportion of 14% of randomly chosen models diverge while 100% of ResNet models converge. It shows that ResNet are robust handcrafted models but a random process can find more accurate ones on a new dataset. The same results and conclusion on the microfossils dataset are reached in tables 3 and 4. Different combinations algorithms are tested in figures 6 and 7 by varying the number of models from 1 to 16. The total population of models only contains models trained during at least 50 epochs by Hyperband. We tested two naive strategies. The first one consists in drawing randomly ensembles of K models and the second one in taking the top-K. Dijkstra's algorithm generally finds better solutions than naive strategies. We also evaluate SP-MCTS, a tree search algorithm based on Monte-Carlo. To test SP-MCTS, the solution space was modeled as an unfolded tree representation leading to nodes redundancy, so equivalent nodes were implemented to index the same score. Based on preliminary experiments, SP-MCTS is set to run 1000 × K with K the maximum desired number of models to favor accuracy over SP-MCTS computing cost. With a single-threading implementation, Dijkstra's algorithm takes only 25 seconds to find an ensemble of K = 16 among 160 models while SP-MCTS is x580 slower. In the microfossils dataset, Dijkstra's algorithm falls to a local minimum and uses the same 10 models when K > 10. SP-MCTS do not falls into this trap and keeps benefiting of an increasing power. The Microfossils dataset. Microfossils are extremely useful in age dating, correlation and paleoenvironmental reconstruction to refine our knowledge of geology. Micro-fossil species are identified and counted on large microscope images and thanks to their frequencies we can compute the date of sedimentary rocks. To do reliable statistics, a big amount of objects needs to be identified. That is why we need deep learning to automate this work. Today, between 400 and 800 fields of view (microscopy imagery) need to be shot for 1 rock sample. In each field of view, there are between 300 to 400 objects to identify. Among these objects, there are non-fossils (crystals, rock grains etc...) and others are fossils that we are looking for to study rocks. Our dataset contains 91 classes of 224x224 RGB images (after homemade preprocessing). Microfossils are calcareous objects took with polarized light microscopy. The classes are unbalanced. We have from 50 images to 2500 images by class, with a total of 32K images in all the dataset. The F1 score was used and labeled as 'accuracy' on all benchmarks.

B.4 HYPERPARAMETER CONFIGURATION SPACE

The table 5 shows all hyperparameters properties in this workflow. We use a ResNet Zagoruyko & Komodakis (2016) based architectures due to its simplicity to yield promising and robust models on many datasets . We explore different residual block versions: "V1", "V2" He et al. (2015) and "next" Xie et al. (2016) . Regarding the optimization method, we use adam optimizer Kingma & Ba (2014) due to its well known performance and its low learning rate tuning requirement. Hyperparameters labeled as "mutable" can be updated during the training, for example the learning rate can change but the architecture cannot. PBT algorithm is the only one algorithm tested to discover a schedule of mutable hyperparameters. We aware that our research may have a limitations. The range of hyper-parameter can be to short compared to good results found in the literature Zagoruyko & Komodakis (2016) like the batch size, width and depth of convolutionnal neural network. Moreover we could explore other optimization strategies like SGD with momentum and also the learning rate decay. Next, dropout is also a promising method we could explore. To finish, on CIFAR100 our maximum number of epochs is 100 and scientists before us usually use 160 epochs.

B.5 ADAPTATION TO APPLY ON CIFAR100

The CIFAR100 dataset contains 32x32 images while usually ResNet are adapted to be used on imagenet (224x244 images). Those different resolution need some adaptation. On the CIFAR100, the first convolutionnal network is replaced from the 7x7 kernel size with a stride of 2, to a 3x3 kernel size with a stride of 1.



Figure 1: The algorithmic workflow to train, combine and predict on a new data. In this example K=2. The final accuracy depends on the running time of hyperband and the number of models chosen in the ensemble. Experiments results are shown in section 4.

Figure 2: The two axes represent an error of two different classes, the goal is to get closer to the (0;0) point. Each model a, b and c have different error distribution and averaging them leads to ensembles with other error distributions.

Figure 3: Available solutions as a graph. Nodes are ensembles. Edges are the decision to add a new model to the ensemble. The cost function is the euclidean distance and is displayed on the top left of each node. The source node, corresponding to the empty ensemble, is modeled with an arbitrary large distance. The optimal ensemble is made of a and b.

Figure 4: Correlation computing cost versus accuracy of randomly sampled models on CI-FAR100 Figure 5: Accuracy histogram of randomly sampled models on CIFAR100

Figure 6: The CIFAR100 dataset Figure 7: The microfossils dataset Figure 8: Different combinatorial optimization algorithm tested

Various HPO strategies and various ensemble size computed on the CIFAR100 dataset

Comparison different ResNet populations and ensemble size on the CIFAR100 dataset

Various HPO strategies and ensemble size on the microfossils dataset

annex

Under review as a conference paper at ICLR 2021 Our workflow benefits more of computing intensity than standard Hyperband like shown in figures 9 and 10.After 24 hours, standard Hyperband (consisting in taking only the best model) converges while our worflow with K > 2 keeps benefiting of the models generated. On the CIFAR100 dataset, we identify that ensembles of 12 and 16 models benefit linearly of the computing time. Their accuracy begins to 77% and increases of +0.4% every 24h00.Moreove, we observe that adding more models systematically leads to an increasing of accuracy but this trend declines. We show that the benefit is obvious from 1 to 2 models (+3.9%) but the improvement is small from 6 to 16 (+1%). We show that Hyperband efficiently generates diverse architectures coupled by a significant number of combinations between them. That is the reason why a smart selection of models with Dijkstra allows to build accurate ensembles. This workflow benefits of the increasing computational power and proposes a general approach to unify hyper-parameter search and resembling models. On two datasets, it has been also showed that our ensembles are more accurate than a naively build ensemble.Our workflow also yields an ensemble more accurate than any other ensemble of ResNet models. 

