DEEP ENSEMBLES WITH HIERARCHICAL DIVERSITY PRUNING Anonymous authors Paper under double-blind review

Abstract

Diverse deep ensembles hold the potential for improving accuracy and robustness of deep learning models. Both pairwise and non-pairwise ensemble diversity metrics have been proposed over the past two decades. However, it is also challenging to find the right metrics that can effectively prune those deep ensembles with insufficient ensemble diversity, thus failing to deliver effective ensemble accuracy. In this paper, we first compare six popular diversity metrics in the literature, coined as Q metrics, including both pairwise and non-pairwise representatives. We analyze their inherent limitations in capturing the negative correlation of ensemble member models, and thus inefficient in identifying and pruning low quality ensembles. We next present six HQ ensemble diversity metrics by extending the existing Q-metrics with three novel optimizations: (1) We introduce the concept of focal model and separately measure the ensemble diversity among the deep ensembles of the same team size with the concept of focal model, aiming to better capture the negative correlations of member models of an ensemble. (2) We introduce six HQ-diversity metrics to optimize the corresponding Q-metrics respectively in terms of measuring negative correlation among member models of an ensemble using its ensemble diversity score. (3) We introduce a two phase hierarchical pruning method to effectively identify and prune those deep ensembles with high HQ diversity scores, aiming to increase the lower and upper bounds on ensemble accuracy for the selected ensembles. By combining these three optimizations, deep ensembles selected based on our hierarchical diversity pruning approach significantly outperforms those selected by the corresponding Q-metrics. Comprehensive experimental evaluation over several benchmark datasets shows that our HQ-metrics can effectively select high diversity deep ensembles by pruning out those ensembles with insufficient diversity, and successfully increase the lower bound (worst case) accuracy of the selected deep ensembles, compared to those selected using the state-of-the-art Q-metrics.

1. INTRODUCTION

Deep ensembles with sufficient ensemble diversity hold potential of improving both accuracy and robustness of ensembles with their combined wisdom. The improvement can be measured by three criteria: (i) the average ensemble accuracy of the selected ensemble teams, (ii) the percentage of selected ensembles that exceed the highest accuracy of individual member models; (iii) the lower bound (worst case) and the upper bound (best case) accuracy of the selected ensembles. The higher these three measures, the higher quality of the ensemble teams. Ensemble learning can be broadly classified into two categories: (1) learning the ensemble of diverse models via diversity optimized joint-training, coined as the ensemble training approach, such as boosting (Schapire, 1999) ; and (2) learning to compose an ensemble of base models from a pool of existing pre-trained models through ensemble teaming based on ensemble diversity metrics (Partridge & Krzanowski, 1997; Liu et al., 2019; McHugh, 2012; Skalak, 1996) , coined as the ensemble consensus approach. This paper is focused on improving the state-of-the-art results in the second category. Related Work and Problem Statement. Ensemble diversity metrics are by design to capture the degree of negative correlation among the member models of an ensemble team (Brown et al., 2005; Liu et al., 2019; Kuncheva & Whitaker, 2003) such that the high diversity indicates high negative correlation among member models of an ensemble. Three orthogonal and yet complimentary threads of efforts have been engaged in ensemble learning: (1) developing mechanisms to produce diverse base neural network models, (2) developing diversity metrics to select ensembles with high ensemble diversity from the candidate ensembles over the base model pool, and (3) developing consensus voting methods. The most popular consensus voting methods include the simple averaging, the weighted averaging, the majority voting, the plurality voting (Ju et al., 2017) , and the learn to rank (Burges et al., 2005) . For the base model selection, early efforts have been devoted to training diverse weak models to form a strong ensemble on a learning task, such as bagging (Breiman, 1996) , boosting (Schapire, 1999) , or different ways of selecting features, e.g., random forests (Tin Kam Ho, 1995) . Several recent studies also produce diverse base models by varying the training hyper-parameters, such as snapshot ensemble (Huang et al., 2017) , which utilizes the cyclic learning rates (Smith, 2015; Wu et al., 2019) to converge the single DNN model at different epochs to obtain the snapshots as the ensemble member models. Alternative method is to construct the pool of base models by using pre-trained models with different neural network backbones (Wu et al., 2020; Liu et al., 2019; Wei et al., 2020; Chow et al., 2019a) . The research efforts on diversity metrics have proposed both pairwise and non-pairwise ensemble diversity measures (Fort et al., 2019; Wu et al., 2020; Liu et al., 2019) , among which the three representative pairwise metrics are Cohen's Kappa (CK) (McHugh, 2012) , Q Statistics (QS) (Yule, 1900), Binary Disagreement (BD) (Skalak, 1996) , and the three representative non-pairwise diversity metrics are Fleiss' Kappa (FK) (Fleiss et al., 2013) , Kohavi-Wolpert Variance (KW) (Kohavi & Wolpert, 1996; Kuncheva & Whitaker, 2003) and Generalized Diversity (GD) (Partridge & Krzanowski, 1997) . These diversity metrics are widely used in several recent studies (Fort et al., 2019; Liu et al., 2019; Wu et al., 2020) . Some early study has shown that these diversity metrics are correlated with respect to ensemble accuracy and diversity in the context of traditional machine learning models (Kuncheva & Whitaker, 2003) . However, few studies to date have provided in-depth comparative critique on the effectiveness of these diversity metrics in pruning those low quality deep ensembles from the candidate ensembles due to their high negative correlation.

Scope and Contributions.

In this paper, we focus on the problem of defining ensemble diversity metrics that can select diverse ensemble teams with high ensemble accuracy. We first investigate the six representative ensemble diversity metrics, coined as Q metrics. We identify and analyze their inherent limitations in capturing the negative correlation among the member models of an ensemble, and why pruning out those deep ensembles with low Q-diversity may not always guarantee to improve the ensemble accuracy. To address the inherent problems of Q metrics, we extend the existing six Q metrics with three optimizations: (1) We introduce the concept of the focal model and argue that one way to better capture the negative correlations among member models of an ensemble is to compute diversity scores for ensembles of fixed size based on the focal model. (2) We introduce the six HQ diversity metrics to optimize the six Q-diversity metrics respectively. (3) We develop a HQbased hierarchical pruning method, consisting of two stage pruning: the α filter and the K-Means filter. By combining these optimizations, the deep ensembles selected by our HQ-metrics can significantly outperform those deep ensembles selected by the corresponding Q metrics, showing that the HQ metrics based hierarchical pruning approach is efficient in identification and removal of low diversity deep ensembles. Comprehensive experiments are conducted on three benchmark datasets: CIFAR-10 (Krizhevsky & Hinton, 2009 ), ImageNet (Russakovsky et al., 2015 ), and Cora (Lu & Getoor, 2003) . The results show that our hierarchical diversity pruning approach outperforms their corresponding Q-metrics in terms of the lower bound and the upper bound of ensemble accuracy over the deep ensembles selected, exhibiting the effectiveness of our HQ approach in pruning low diversity deep ensembles.

2. HIERARCHICAL PRUNING WITH DIVERSITY METRICS

Existing studies on consensus based ensemble learning (Huang et al., 2017; Krizhevsky et al., 2012; Zoph & Le, 2016) generate the base model pool through two channels: (i) deep neural network training using different network structures or different configurations of hyperparameters (Breiman, 1996; Schapire, 1999; Zoph & Le, 2016; Hinton et al., 2015; Wu et al., 2018; 2019) and (ii) selecting the top performing pre-trained models from open-source projects (e.g., GitHub) and public model zoos (Jia et al., 2014; ONNX Developers, 2020; GTModelZoo Developers, 2020) . Hence, an important technical challenge for deep ensemble learning is to define diversity metrics for producing high quality ensemble teaming strategies, aiming to boost the ensemble accuracy. Given that the number of possible ensemble teams increases exponentially with a small pool of base models, de-

