DEEP ENSEMBLES WITH HIERARCHICAL DIVERSITY PRUNING Anonymous authors Paper under double-blind review

Abstract

Diverse deep ensembles hold the potential for improving accuracy and robustness of deep learning models. Both pairwise and non-pairwise ensemble diversity metrics have been proposed over the past two decades. However, it is also challenging to find the right metrics that can effectively prune those deep ensembles with insufficient ensemble diversity, thus failing to deliver effective ensemble accuracy. In this paper, we first compare six popular diversity metrics in the literature, coined as Q metrics, including both pairwise and non-pairwise representatives. We analyze their inherent limitations in capturing the negative correlation of ensemble member models, and thus inefficient in identifying and pruning low quality ensembles. We next present six HQ ensemble diversity metrics by extending the existing Q-metrics with three novel optimizations: (1) We introduce the concept of focal model and separately measure the ensemble diversity among the deep ensembles of the same team size with the concept of focal model, aiming to better capture the negative correlations of member models of an ensemble. (2) We introduce six HQ-diversity metrics to optimize the corresponding Q-metrics respectively in terms of measuring negative correlation among member models of an ensemble using its ensemble diversity score. (3) We introduce a two phase hierarchical pruning method to effectively identify and prune those deep ensembles with high HQ diversity scores, aiming to increase the lower and upper bounds on ensemble accuracy for the selected ensembles. By combining these three optimizations, deep ensembles selected based on our hierarchical diversity pruning approach significantly outperforms those selected by the corresponding Q-metrics. Comprehensive experimental evaluation over several benchmark datasets shows that our HQ-metrics can effectively select high diversity deep ensembles by pruning out those ensembles with insufficient diversity, and successfully increase the lower bound (worst case) accuracy of the selected deep ensembles, compared to those selected using the state-of-the-art Q-metrics.

1. INTRODUCTION

Deep ensembles with sufficient ensemble diversity hold potential of improving both accuracy and robustness of ensembles with their combined wisdom. The improvement can be measured by three criteria: (i) the average ensemble accuracy of the selected ensemble teams, (ii) the percentage of selected ensembles that exceed the highest accuracy of individual member models; (iii) the lower bound (worst case) and the upper bound (best case) accuracy of the selected ensembles. The higher these three measures, the higher quality of the ensemble teams. Ensemble learning can be broadly classified into two categories: (1) learning the ensemble of diverse models via diversity optimized joint-training, coined as the ensemble training approach, such as boosting (Schapire, 1999) ; and (2) learning to compose an ensemble of base models from a pool of existing pre-trained models through ensemble teaming based on ensemble diversity metrics (Partridge & Krzanowski, 1997; Liu et al., 2019; McHugh, 2012; Skalak, 1996) , coined as the ensemble consensus approach. This paper is focused on improving the state-of-the-art results in the second category. Related Work and Problem Statement. Ensemble diversity metrics are by design to capture the degree of negative correlation among the member models of an ensemble team (Brown et al., 2005; Liu et al., 2019; Kuncheva & Whitaker, 2003) such that the high diversity indicates high negative correlation among member models of an ensemble. Three orthogonal and yet complimentary threads of efforts have been engaged in ensemble learning: (1) developing mechanisms to produce diverse base neural network models, (2) developing diversity metrics to select ensembles with high ensemble diversity from the candidate ensembles over the base model pool, and (3) developing consensus voting methods. The most popular consensus voting methods include the simple averaging, the weighted averaging, the majority voting, the plurality voting (Ju et al., 2017) , and the learn to rank (Burges et al., 2005) . For the base model selection, early efforts have been devoted to training diverse weak models to form a strong ensemble on a learning task, such as bagging (Breiman, 1996) , boosting (Schapire, 1999) , or different ways of selecting features, e.g., random forests (Tin Kam Ho, 1995) . Several recent studies also produce diverse base models by varying the training hyper-parameters, such as snapshot ensemble (Huang et al., 2017) , which utilizes the cyclic learning rates (Smith, 2015; Wu et al., 2019) to converge the single DNN model at different epochs to obtain the snapshots as the ensemble member models. Alternative method is to construct the pool of base models by using pre-trained models with different neural network backbones (Wu et al., 2020; Liu et al., 2019; Wei et al., 2020; Chow et al., 2019a) . The research efforts on diversity metrics have proposed both pairwise and non-pairwise ensemble diversity measures (Fort et al., 2019; Wu et al., 2020; Liu et al., 2019) , among which the three representative pairwise metrics are Cohen's Kappa (CK) (McHugh, 2012) , Q Statistics (QS) (Yule, 1900) , Binary Disagreement (BD) (Skalak, 1996) , and the three representative non-pairwise diversity metrics are Fleiss' Kappa (FK) (Fleiss et al., 2013) , Kohavi-Wolpert Variance (KW) (Kohavi & Wolpert, 1996; Kuncheva & Whitaker, 2003) and Generalized Diversity (GD) (Partridge & Krzanowski, 1997) . These diversity metrics are widely used in several recent studies (Fort et al., 2019; Liu et al., 2019; Wu et al., 2020) . Some early study has shown that these diversity metrics are correlated with respect to ensemble accuracy and diversity in the context of traditional machine learning models (Kuncheva & Whitaker, 2003) . However, few studies to date have provided in-depth comparative critique on the effectiveness of these diversity metrics in pruning those low quality deep ensembles from the candidate ensembles due to their high negative correlation. Scope and Contributions. In this paper, we focus on the problem of defining ensemble diversity metrics that can select diverse ensemble teams with high ensemble accuracy. We first investigate the six representative ensemble diversity metrics, coined as Q metrics. We identify and analyze their inherent limitations in capturing the negative correlation among the member models of an ensemble, and why pruning out those deep ensembles with low Q-diversity may not always guarantee to improve the ensemble accuracy. To address the inherent problems of Q metrics, we extend the existing six Q metrics with three optimizations: (1) We introduce the concept of the focal model and argue that one way to better capture the negative correlations among member models of an ensemble is to compute diversity scores for ensembles of fixed size based on the focal model. (2) We introduce the six HQ diversity metrics to optimize the six Q-diversity metrics respectively. (3) We develop a HQbased hierarchical pruning method, consisting of two stage pruning: the α filter and the K-Means filter. By combining these optimizations, the deep ensembles selected by our HQ-metrics can significantly outperform those deep ensembles selected by the corresponding Q metrics, showing that the HQ metrics based hierarchical pruning approach is efficient in identification and removal of low diversity deep ensembles. Comprehensive experiments are conducted on three benchmark datasets: CIFAR-10 ( Krizhevsky & Hinton, 2009) , ImageNet (Russakovsky et al., 2015) , and Cora (Lu & Getoor, 2003) . The results show that our hierarchical diversity pruning approach outperforms their corresponding Q-metrics in terms of the lower bound and the upper bound of ensemble accuracy over the deep ensembles selected, exhibiting the effectiveness of our HQ approach in pruning low diversity deep ensembles.

2. HIERARCHICAL PRUNING WITH DIVERSITY METRICS

Existing studies on consensus based ensemble learning (Huang et al., 2017; Krizhevsky et al., 2012; Zoph & Le, 2016) generate the base model pool through two channels: (i) deep neural network training using different network structures or different configurations of hyperparameters (Breiman, 1996; Schapire, 1999; Zoph & Le, 2016; Hinton et al., 2015; Wu et al., 2018; 2019) and (ii) selecting the top performing pre-trained models from open-source projects (e.g., GitHub) and public model zoos (Jia et al., 2014; ONNX Developers, 2020; GTModelZoo Developers, 2020) . Hence, an important technical challenge for deep ensemble learning is to define diversity metrics for producing high quality ensemble teaming strategies, aiming to boost the ensemble accuracy. Given that the number of possible ensemble teams increases exponentially with a small pool of base models, de-veloping proper ensemble diversity metrics is critical for effective pruning of deep ensembles with insufficient diversity. Consider a pool of M base models for a learning task on a given dataset D, denoted by BM Set(D)= {F 1 , ..., F M }. Let EnsSet denote the set of all possible ensemble teams that are composed from BM Set(D), with the ensemble team size S varying from 2 to M . We have a total of M S=2 M S ensembles, i.e., |EnsSet| = M 2 + M 3 + ... + M M = 2 M -(1 + M ). The cardinality of the set of possible ensembles EnsSet grows exponentially with M , the number of base models. For example, M = 3, we have |EnsSet| = 4. When M becomes larger, such as M = 5, 10, 20, |EnsSet| = 26, 1013, 1048555. Hence, as M increases, it is non-trivial to construct a set of high-accuracy ensemble teams (GEnsSet), from the candidate set (EnsSet) of all possible ensembles that are composed from BM Set(D). Consider a pool of M = 10 base models for ImageNet, in which the highest performing base model is 78.25%, the lowest performing base model is 56.63%, and the average accuracy of these 10 base models is 71.60% (see Table 5 in Appendix Section F). For a pool of 10 base models, there will be a total of 1013 (2 10 -(10 + 1)) different ensembles with team size ranging from 2 to 10. The performance of these ensembles vary sharply, from 61.39% (lower bound) to 80.77% (upper bound). Randomly selecting an ensemble team from these 1013 teams in EnsSet(ImageNet) may lead to a non-trivial probability of selecting a team with the ensemble accuracy lower than the average member model accuracy of 71.60% over the 10 base models. Clearly, an efficient ensemble diversity metric should be able to prune out those ensemble teams with insufficient ensemble diversity and thus low ensemble accuracy, increasing (i) the average ensemble accuracy of the selected ensemble teams, (ii) the percentage of selected ensembles that exceed the highest accuracy of individual member models (i.e., 78.25% for the 10 base DNN models on ImageNet); and (iii) the lower bound (worst case) and the upper bound (best case) accuracy of the selected ensembles. A number of ensemble diversity metrics have been proposed to address this challenging problem. In this section, we first provide a comparative study of the six state-of-the-art Q-diversity metrics and analyze their inherent limitations in identifying and pruning out low diversity ensembles. Then we introduce our proposed HQ-diversity metrics and analyze the effectiveness of our HQ based hierarchical diversity approach in pruning low quality ensembles.

2.1. Q-DIVERSITY METRICS AND THEIR LIMITATIONS

We outline the key notations for the six Q-diversity metrics in Table 1 : three pairwise diversity metrics: Cohen's Kappa (CK) (McHugh, 2012) , Q Statistics (QS) (Yule, 1900) and Binary Disagreement (BD) (Skalak, 1996) , and three non-pairwise diversity metrics: Fleiss' Kappa (Fleiss et al., 2013) (FK), Kohavi-Wolpert Variance (KW) (Kohavi & Wolpert, 1996; Kuncheva & Whitaker, 2003) and Generalized Diversity (GD) (Partridge & Krzanowski, 1997) . The arrow column ↑ | ↓ specifies the relationship between the Q-value and the ensemble diversity. The ↑ represents positive relationship of the Q-value and the ensemble diversity, that is a high Q-value refers to high ensemble diversity. The ↓ indicates the negative relationship, that is the low Q-value corresponds to high ensemble diversity. To facilitate the comparison of the six Q-diversity metrics such that the low Q-value refers to high ensemble diversity for all six Q-metrics, we apply (1 -Q-value) when calculating Q-diversity score with BD, KW and GD. We refer readers to Appendix (Section C) for the formal definitions of the six Q-diversity metrics. Given a Q-diversity metric, we calculate the diversity score for each ensemble team in the ensemble set (EnsSet) using a set of negative samples (N egSampSet) on which one or more models in the ensemble make prediction errors. The low Q-score indicates sufficient diversity among member models of an ensemble. Upon the completion of Q-diversity score computation for all ensembles in EnsSet, the diversity threshold based pruning is employed to remove those ensembles with insufficient diversity among ensemble member models. Either a pre-defined Q-diversity threshold or a mean threshold by taking the average value of all Q-diversity scores calculated for all candidate deep ensembles in EnsSet. The mean threshold tends to work better in general than a manually defined threshold. Once a mean threshold is obtained, those ensembles in EnsSet with their Q diversity scores below the threshold will be selected and placed into the diverse ensemble set GEnsSet, and the remaining ensembles are those with their Q scores higher than the threshold and thus will be pruned out. The pseudo code of the algorithm is included in Appendix (Algorithm 1). The last three columns of Table 1 show the mean threshold for all six Q-diversity metrics calculated on the set of 1013 candidate deep ensembles for the three benchmark datasets used in this study. We make two observations. First, different Q-diversity metrics capture the ensemble diversity from different perspectives with different diversity measurement principles, resulting in different Q-scores. Second, each Q-metric, say CK, is used to compare ensembles based on their Q-CK scores. Hence, even though the Q-KW metric has relatively high KW-specific Q scores for all ensemble teams, it can select the diverse ensembles based on the mean KW-threshold, in a similar manner as any of the other five Q metrics. The horizontal red and black dashed lines represent the maximum single model accuracy 96.68% and the average accuracy 94.25% of the 10 base models respectively. We use these two accuracy bounds as important references to quantify the quality of the deep ensembles selected using a Q metric with its mean threshold. Those deep ensembles on the left of the red vertical dash line are selected and added into GEnsSet given that their Q-scores are below the mean threshold (e.g., Q-KW or Q-GD). The ensembles on the right of this red vertical dash line are pruned out because their Q diversity scores exceed the mean threshold. Compare Figure 1a and 1b, it is visually clear that both Q metrics can select a sufficient number of good quality ensemble teams while at the same time, both Q metrics with mean threshold pruning will miss a large portion of teams with high ensemble accuracy, indicating the inherent limitations of both Q metrics and the mean threshold pruning with respect to capturing the concrete ensemble diversity in terms of low negative correlation among member models of an ensemble. To better understand the inherent problems with the Q-diversity metrics, we performed another set of experiments by measuring the Q-GD metric over ensemble teams of fixed size S on CIFAR-10. Figure 1c shows a visualization of the results using the Q-GD scores computed over ensembles of size S = 4 with mean threshold indicated by the vertical red dashed line, showing a visually sharper trend in terms of the relationship between ensemble diversity and ensemble accuracy when comparing the selected ensemble teams (red dots) with those ensembles (black dots) on the right of the red vertical threshold line. However, relying on separating the diversity computation and comparison over ensemble teams of the same size alone may not be sufficient, because Figure 1c shows that (i) some selected ensemble teams have low accuracy, affecting all three ensemble quality measures (recall Section 2, page 3), and (ii) a fair number of ensemble teams with high ensemble accuracy (black dots on the top right side) are still missed. Similar observations are also found for other five Q-diversity metrics. We conclude our analysis with three arguments: (1) The Qdiversity metrics may not accurately capture the degree of negative correlation among the member models of an ensemble even when its ensemble Q-diversity score is below the mean threshold. (2) Comparing ensembles of different team size S using their Q scores may not be a fair measure of their true ensemble diversity in terms of the degree of negative correlation among member models of an ensemble. However, relying on ensembles of the same team size S alone is still insufficient. (3) Mean threshold is not a good Q-diversity pruning method in terms of capturing the intrinsic relationship between ensemble diversity and ensemble accuracy. This motivates us to propose the HQ diversity metrics with two phase pruning using learning algorithms.

2.2. HQ-DIVERSITY METRICS AND THEIR TWO-PHASE PRUNING

The design of the six HQ metrics is to enhance the six existing popular Q-metrics with three optimizations. First, we argue that comparing ensembles of the same team size in terms of their diversity scores can better capture the intrinsic relationship between ensemble diversity and ensemble accu- = M (M -1) candidate ensembles. For M = 10 we will have 90 teams of size 2. Given a HQ metric, we first sort the ensembles of small size S, say S = 2, by their HQ scores in decreasing order, and then choose the top β (percentage) of ensembles of size S with large HQ value as our pruning targets at team size S. We recommend a conservative approach by using a small β (e.g., β = 5%, 10%). We first preemptively prune out the β(%) of the ensembles with large HQ scores and then prune all those ensembles that are super-sets of these β(%) of ensembles. Imagining a hierarchical structure with all teams of size 2 on the top, and each layer we add one additional model to the teams such that all teams of size S + 1 are placed in the next tier. The bottom tier will be one ensemble team of size M . For each of the β(%) of ensembles of size 2 that are pruned out, this α filter algorithm will cut off the whole branch of ensemble teams that are supersets of this removed ensemble team. Due to space constraint, we include the Algorithm 2 to compute HQ metrics and the α filter algorithm in Appendix: section D and section E respectively. Figure 2 shows the visualization of applying α filter on two HQ metrics: HQ-KW and HQ-GD. The black dots denote the ensemble teams pruned out by using the α filter and the red dots are the ensembles selected after HQ metric with α pruning. We highlight two interesting observations. First, the α filter can effectively prune those ensembles with large HQ values (representing insufficient ensemble diversity). Compared Q-GD in Figure 1c with HQ-GD (α) in Figure 2e (both with S = 4), HQ-GD (α) can significantly improve the quality of selected ensemble teams while effectively pruning out most of the low accuracy ensembles. Second, both HQ-GD (α) and HQ-KW (α) diversity metrics display high correlation of measured ensemble diversity with the ensemble accuracy: low HQ scores correspond to high ensemble accuracy. Similar observations are found consistently for all HQ diversity metrics. HQ (α + K) metrics: HQ metrics with α filter followed by K-means filter. In our two-phase HQ diversity pruning approach, we introduce K-means filter to correct as much as possible the remaining errors in high quality ensemble team selection. Recall Figure 2a and 2d for ensemble teams of size S = 3, it is visually clear that the α filter is less effective in pruning out some ensemble teams of low accuracy, compared to teams of larger sizes, S = 4, 5 in Figure 2(b)(c )(e)(f). We introduce the second phase filtering by using a customized K-means clustering algorithm with K = 2 and two strategically chosen initial centroids: top left and bottom right (marked in the red and black unfilled circles respectively), aiming to learn two clusters of ensembles: (1) the cluster of ensembles with low HQ score and high ensemble accuracy, and (2) the cluster of ensembles with low accuracy and relatively larger HQ score. The clustering results are indicated by the two solid circles: the pink one for cluster (1) and the light grey one for cluster (2). The two phase filtering powered HQ (α+K) metrics can effectively remove those ensembles with low accuracy and insufficient diversity (i.e., higher HQ values), further improving the three ensemble accuracy measures (recall Section 2, page 3) compared to the HQ (α) metrics, increasing the lower bound accuracy and improving the worst-case ensemble selection quality. Figure 3 provides a visualization for ensemble teams of size S = 3 using three HQ (α + K) metrics: HQ-CK, HQ-KW and HQ-GD. The red dots and black dots show the two clusters produced by K-means, and the red vertical dashed line indicates the filtering threshold produced by the K-Means filter, which chooses the smallest HQ value from the cluster of low accuracy ensembles as the HQ-specific pruning threshold. By using HQ with two phase α + K filters, we can further fine tune the quality of ensemble selection by removing those ensembles with relatively low ensemble accuracy, effectively boosting the lower bound of ensemble accuracy for all the ensemble teams selected by HQ (α + K) metrics, compared to either HQ (α) or Q metrics. 

CIFAR-10

Table 2 shows the experimental comparison of the ensemble teams selected by Q metrics with mean threshold, HQ (α) metrics and HQ (α + K) metrics for CIFAR-10. For the se-lected ensembles, we show their ensemble accuracy range (%) in the 4th column. The 5th column #(%)(Acc>96.68% (max)) shows the number and percentage of the ensembles selected, which have ensemble accuracy higher than the highest (max) single model accuracy of 96.68% over the M = 10 CIFAR-10 base models. The last column shows the number of selected ensembles with ensemble accuracy over 96.70%, exceeding the best 96.68% single base model accuracy. We highlight three interesting observations. First, compare to Q metrics, our HQ (α) metrics significantly reduce the number of candidate ensembles in #EnsSet (from 1013 to 230∼281) and improve the quality of selected ensembles. For example, with the α filter, HQ-BD, HQ-KW and HQ-GD can improve the ensemble accuracy lower bound from 93.56% to 93.88%, while HQ-CK, HQ-QS, HQ-BD, HQ-FK and HQ-KW all improve the accuracy upper bound from 96.72% or 96.74% to 97.01% or 97.15%. Second, the two phase filtering HQ (α + K) metrics further improved the quality of selected ensembles compared to both Q-metrics and HQ (α) metrics, e.g., increasing the lower bound of ensemble accuracy from 93.56%∼94.27% to 94.46%∼95.45%. Furthermore, 42.22% (38 out of 90) of the ensembles selected by HQ-GD (α + K) have the ensemble accuracy above 96.70%, showing that with random picking of an ensemble from the selected set (GEnsSet), HQ-GD has higher than 42% probability to choose an ensemble team with accuracy better than the max accuracy of the 10 single base models for CIFAR-10, compared to 17.93% by HQ-GD (α) and 7.26% by Q-GD. This further demonstrates the effectiveness of our HQ (α + K) metrics. ImageNet Table 3 shows the same set of experiments on ImageNet. We make three observations. (1) For ImageNet, many ensembles generated by HQ metrics can achieve higher ensemble accuracy, better than the max single base model accuracy of 78.25% by the member model F 5 (Table 5 in Appendix Section F), even without having F 5 as a member model of the ensemble teams. For example, with α + K, HQ-BD and HQ-GD both have 19 ensemble teams that offer ensemble accuracy higher than the max single model accuracy of 78.25% by the member model F 5 , and yet do not have F 5 as the member model of their ensemble teams. (2) Similar to CIFAR-10, many ensembles with low accuracy and insufficient HQ diversity are effectively pruned out by using our HQ (α) metrics. Compared to Q-metrics, our HQ (α) metrics effectively increase the accuracy lower bound of all selected ensembles from 61.39% to 68.99%, significant improvement over Q metrics. (3) The HQ (α + K) metrics further boost the lower bound ensemble accuracy over the corresponding HQ (α) metrics, with the lower bound (worst case) accuracy of 76.16%∼78.35%, significantly higher than Q metrics (61.39%∼70.79%). Three HQ (α + K) metrics (HQ-CK, HQ-QS, HQ-FK) achieve 100% of the selected ensembles with over 78.25% accuracy (the max single base model accuracy on ImageNet), while HQ-BD has over 90.91%, HQ-KW and HQ-GD have over 87.10% of the selected ensembles with their ensemble accuracy over the best single base model accuracy (78.25%). Clearly, the average accuracy of the selected ensembles by HQ (α + K) metrics is much higher than that by using Q-diversity metrics. Ensemble Accuracy Distribution. We further investigate the ensemble accuracy distribution for the ensemble teams selected by Q, HQ (α) and HQ (α + K) metrics. Figure 4 shows the visualization of the results. For CIFAR-10, we compare the ensemble teams selected by Q-GD (yellow triangles), HQ-GD (α) (blue dots), and HQ-GD (α + K) (red circles). It is visually clear that HQ-GD (α + K) diversity metric can effectively prune out more low accuracy ensembles with insufficient HQ scores compared to Q-GD and HQ-GD (α), although it still suffers from a few low accuracy ensembles, which dragged the improvement on the ensemble accuracy lower bound of 94.72% on CIFAR-10. For ImageNet, Figure 4b 

4. CONCLUSION

We have presented a two-phase hierarchical ensemble diversity pruning approach for high quality ensemble selection. This paper makes three original contributions. First, we identify and analyze the inherent limitations of existing six ensemble diversity metrics, coined as Q-metrics. Second, we address the limitations of Q-metrics by introducing the six HQ diversity metrics respectively. Third, we develop a two phase HQ-based hierarchical pruning method with α filter followed by K-means filter. By combining these optimizations, the deep ensembles selected by our HQ (α + K) metrics can significantly outperform the deep ensembles selected by the corresponding Q metrics, showing that the HQ metrics based hierarchical pruning approach is efficient in identification and removal of low quality deep ensembles. Comprehensive experiments conducted on benchmark datasets of CIFAR-10 and ImageNet show that our hierarchical diversity pruning approach outperforms the corresponding Q-metrics in terms of the lower bound (worst case) and the upper bound (best case) of ensemble accuracy over the deep ensembles selected, in addition to the average ensemble accuracy of the selected ensemble teams, and the percentage of selected ensembles that exceed the highest accuracy of the member models in the base model pool. 

A DIVERSITY BY UNCORRELATED ERROR

Deep neural network ensembles use multiple (say M > 1) deep neural networks to form a committee (team) to collaborate and combine the predictions of individual member models to make the final prediction. A consensus method will be used to combine the individual predictions, such as majority voting, plurality voting, or model averaging (the average of prediction vectors). A deep neural network classifier is typically trained to minimize a cross-entropy loss and output a probability vector to approximate a posteriori probability densities for the corresponding class. For a given input x, the ith element in the output probability vector of model F k can be modeled as: f k i (x) = p(c i |x) + k i (x) , where p(c i |x) is the posteriori probability distribution of the ith class (c i ) for the input x, and k i (x) is the error associated with this output. For making the Bayes optimum decision, x will be predicted as class c i if p(c i |x) > p(c j |x), ∀j = i. Therefore, the Bayes optimum boundary locates at all points x * such that p(c i |x * ) = p(c j |x * ) where p(c j |x * ) = max l =i p(c l |x). Given the neural network model will output  f k i (x) instead of p(c i |x), (x) = 1 S S k=1 f k i (x) = p(c i |x) + i (x), where i (x) = 1 S k i (x). We can calculate the variance of i with σ 2 i = 1 S 2 S k=1 S l=1 cov( k i (x), l i (x)) = 1 S 2 S k=1 σ 2 k i + 1 S 2 S k=1 l =k cov( k i (x), l i (x)) where cov() represents the covariance. With cov(a, b) = corr(a, b)σ a σ b , we can replace the covariance with correlation corr() and derive σ 2 i = 1 S 2 S k=1 σ 2 k i + 1 S 2 S k=1 l =k corr( k i (x), l i (x))σ k i σ l i Let δ i denote the average correlation factor among these models, we have δ i = 1 S(S -1) S k=1 l =k corr( k i (x), l i (x)) Assuming the common variance σ 2 i = σ 2 k i holds for every model F k , with δ i we have σ 2 i = 1 S σ 2 i + S -1 S δ i σ 2 i With the variance of the ensemble decision boundary offset σ 2 o avg = σ 2 i +σ 2 j d 2 given in (TUMER & GHOSH, 1996), we have σ 2 o avg = 1 d 2 S (σ 2 i + (S -1)δ i σ 2 i + σ 2 j + (S -1)δ j σ 2 j ) Assume that the error between classes are i.i.d., that is σ , 1996) , we have 2 i = σ 2 j . With σ 2 i = σ 2 k i (the previous assumption) and σ 2 o = 2σ 2 k i d 2 given in (TUMER & GHOSH σ 2 o avg = 1 d 2 S (2σ 2 i + (S -1)σ 2 i (δ i + δ j )) σ 2 o avg = 2σ 2 i d 2 S (1 + (S -1) (δ i + δ j ) 2 ) = 2σ 2 k i d 2 S (1 + (S -1) (δ i + δ j ) 2 ) σ 2 o avg = σ 2 o S (1 + (S -1) δ i + δ j 2 ) To extend the above formula to include all classes, given δ = C i=1 P i δ i , where P i is the prior probability of class c i and C is the total number of classes. Assuming the prior probability P i of class c i is uniformly distributed, we have σ 2 o avg = σ 2 o S (1 + (S -1)δ) = σ 2 o ( 1 + (S -1)δ) S ) So we can derive the added error for the ensemble prediction E avg add as E avg add = dσ 2 o avg 2 = dσ 2 o 2 ( 1 + (S -1)δ) S ) = E add ( 1 + (S -1)δ) S ) Therefore, the ideal scenario is when all members in an ensemble team of size S are diverse. They can learn and predict with uncorrelated errors (negative correlation), i.e., δ = 0. Then a simple model averaging method can significantly reduce the overall prediction error by S. Meanwhile, the worst scenario happens when error of individual model are highly correlated with δ = 1, such as all S models are perfect duplicates, the error of the ensemble is identical to the initial errors without any improvement. In general, the correlation δ lies between 0 and 1, and therefore, it is always beneficial to use ensemble to reduce the prediction errors.

B ENSEMBLE ROBUSTNESS

Let g(x) = f c (x) -f j (x), where c = argmax 1≤i≤C f i (x) is the predicted class label and j = c. Assume g(x) is Lipschitz continuous with Lipschitz constant L j q , according to (Paulavičius & Žilinskas, 2006) , we have |g(x) -g(y)| ≤ L j q ||x -y|| p where L j q = max x ||∇g(x)|| q , 1 p + 1 q = 1 and 1 ≤ p, q ≤ ∞. Let x = x 0 + δ and y = x 0 , we have |g(x 0 + δ) -g(x 0 )| ≤ L j q ||δ|| p which can be rearranged as g(x 0 ) -L j q ||δ|| p ≤ g(x 0 + δ) ≤ g(x 0 ) + L j q ||δ|| p When g(x 0 + δ) = 0, the predicted class label will change. However, g(x 0 + δ) is lower bounded by g(x 0 ) -L j q ||δ|| p ≤ g(x 0 + δ). If 0 ≤ g(x 0 ) -L j q ||δ|| p , we have g(x 0 + δ) ≥ 0 to ensure that the prediction result will not change with the small change δ on the input x 0 . This leads to g(x 0 ) -L j q ||δ|| p ≥ 0 ⇒ ||δ|| p ≤ g(x 0 ) L j q That is ||δ|| p ≤ f c (x 0 ) -f j (x 0 ) L j q To ensure the classification result will not change, that is argmax 1≤i≤C f i (x 0 + δ) = c, we use the minimum of the bound on δ over j = c, that is ||δ|| p ≤ min j =c f c (x 0 ) -f j (x 0 ) L j q which indicates that as long as ||δ|| p is small enough to fulfill the above bound, the classifier decision will never be changed, which marks the robustness of this classifier. The robustness bound (R) can be denoted as R = min j =c f c (x 0 ) -f j (x 0 ) L j q = min j =c f c (x 0 ) -f j (x 0 ) max x ||∇(f c (x) -f j (x))|| q For a model F k , we have its upper bound R k = min j =c f k c (x 0 ) -f k j (x 0 ) max x ||∇(f k c (x) -f k j (x))|| q Let g k j (x) = f k c (x) -f k j (x), we have R k = min j =c g k j (x 0 ) max x ||∇(g k j (x))|| q Given S models, combining their predictions with model averaging (avg), we have the ith element in the combined probability vector as f avg i (x) = 1 S S k=1 f k i (x) corresponding to the robustness bound R avg = min j =c f avg c (x 0 ) -f avg j (x 0 ) max x ||∇(f avg c (x) -f avg j (x))|| q = min j =c g avg c (x 0 ) max x ||∇(g avg c (x))|| q Assume the minimum of the robustness bound can be achieved with the prediction result c and j for each model including the ensemble F avg , that is R k = g k j (x 0 ) max x ||∇(g k j (x))|| q and R avg = g avg j (x 0 ) max x ||∇(g avg j (x))|| q where g avg j (x) = 1 S S k=1 g k j (x). The following property always holds that ∃1 ≤ k ≤ S, R k ≤ R avg , indicating that the ensemble can improve the robustness bound. We prove the property by contradiction. First, we assume ∀1 ≤ k ≤ S, R k > R avg , that is g k j (x 0 ) max x ||∇(g k j (x))|| q > g avg j (x 0 ) max x ||∇(g avg j (x))|| q So we have g k j (x 0 )(max x ||∇(g avg j (x))|| q ) > g avg j (x 0 )(max x ||∇(g k j (x))|| q ) For each k ∈ {1, ..., S}, we have the above inequality. To add them all, we have S k=1 g k j (x 0 )(max x ||∇(g avg j (x))|| q ) > S k=1 g avg j (x 0 )(max x ||∇(g k j (x))|| q ) That is (max x ||∇(g avg j (x))|| q ) S k=1 g k j (x 0 ) > g avg j (x 0 ) S k=1 (max x ||∇(g k j (x))|| q ) Given g avg j (x) = 1 S S k=1 g k j ( x), we have (max x ||∇( S k=1 g k j (x))|| q ) 1 S S k=1 g k j (x 0 ) > 1 S S k=1 g k j (x 0 ) S k=1 (max x ||∇(g k j (x))|| q ) Therefore, we have (max x ||∇( S k=1 g k j (x))|| q ) > S k=1 (max x ||∇(g k j (x))|| q ) According to the triangle inequality, we have max x ||∇( S k=1 g k j (x))|| q ≤ max x ( S k=1 ||∇(g k j (x))|| q ) ≤ S k=1 (max x ||∇(g k j (x))|| q ) which contradicts with our derived inequality. Therefore, the previous assumption does not hold. We show that ∃1 ≤ k ≤ S, R k ≤ R avg , demonstrating that the robustness of a member model can be further improved with ensemble. Furthermore, for a model F k , if its robustness bound R k was not obtained with j. We have ∃i = j, i, j = c, R k = g k i (x0) maxx||∇(g k i (x))||q ≤ g k j (x0) maxx||∇(g k j (x))||q . The above claim still holds as long as each model makes the same prediction c.

C ALGORITHMS FOR COMPUTING Q-DIVERSITY METRICS

We have covered six state-of-the-art diversity metrics (coined in this paper as Q-diversity metrics). In the literature, different studies will use one of these diversity metrics to select models and analyze the prediction results. However, there are few studies to provide guidelines for choosing them or to compare and evaluate these diversity metrics in terms of pruning out low diversity ensembles. In general, diversity metrics can be classified into two major categories based on how the fault independence and uncorrelated errors are computed using a set of negative samples. They are pairwise metrics and non-pairwise metrics. We below describe six representative diversity metrics considered in our study: Cohen's Kappa, Q Statistics and Binary Disagreement for pairwise, and Fleiss' Kappa, Kohavi-Wolpert Variance and Generalized Diversity for non-pairwise. Given a pool of M base models, all trained on the same dataset, one approach to create negative samples is to get the negative samples from the validation set of each model and then randomly select a subset of negative samples from the union of all M subsets of negative examples. Let X = {x 1 , x 2 , ..., x N } be the randomly selected N labeled negative examples on the training dataset. Given a base model F i and a negative sample X, the output of F i on X is a vector of binary values, denoted as ω i = [ω i,1 , ω i,2 , ..., ω i,N ] T , and ω i,k = 1 if F i predicts x k correctly, otherwise, ω i,k = 0. Pairwise Diversity Metrics For pairwise diversity metrics, they are calculated based on a pair of classifiers. Table 4 shows the relationship between a pair of classifiers F i , F j . For a labeled sample x k , four different types of prediction results emerge, such as both F i and F j make correct or wrong predictions and either F i or F j makes correct predictions. Correspondingly, we can count the number of samples in the four different types, that is N ab , which represents the number of elements x k ∈ X, such that ω i,k = a and ω j,k = b. Table 4 : The relationship between a pair of classifiers F j correct (1) F j wrong (0) F i correct (1) N 11 N 10 F i wrong (0) N 01 N 00 N = N 00 + N 01 + N 10 + N 11 i. Cohen's Kappa (CK): The Cohen's Kappa measures the diversity between the two classifiers F i , F j from the perspective of agreement (McHugh, 2012; Kuncheva & Whitaker, 2003) . A lower Cohen's kappa value implies lower agreement and higher diversity. Formula 1 shows the definition of the Cohen's kappa (κ ij ) between the two classifiers F i , F j . The value for the Cohen's Kappa ranges from -1 to 1 with 0 representing the amount of agreement of random chance. (McHugh, 2012 ) κ ij = 2(N 11 N 00 -N 01 N 10 ) (N 11 + N 10 )(N 01 + N 00 ) + (N 11 + N 01 )(N 10 + N 00 ) (1) ii. Q Statistics (QS): The Q statistics (Yule, 1900) is defined as QS ij in Formula 2 between two models F i , F j . QS ij varies between -1 and 1. When the models F i , F j are statistically independent, the expected QS ij is 0. If the two models tend to recognize the same object similarly, QS ij will have positive value. While two diverse models, recognizing the same object differently, will render a small or negative QS ij value. QS ij = N 11 N 00 -N 01 N 10 N 11 N 00 + N 01 N 10 (2) iii. Binary Disagreement (BD): The binary disagreement (Skalak, 1996; Kuncheva & Whitaker, 2003) is the ratio between (i) the number of samples on which one model is correct while the other is wrong to (ii) the total number of samples predicted by the two models F i , F j as Formula 3 shows. θ ij = N 01 + N 10 N 11 + N 10 + N 01 + N 00 For an ensemble team of S models, as recommended by (Kuncheva & Whitaker, 2003) , we calculate the averaged metric value over all pair of classifiers as Formula 4 shows, where Q represents a pairwise diversity metric. Q = 2 S(S -1) S-1 i=1 S j=i+1 Q ij (4) Non-pairwise Diversity Metrics Numerous non-pairwise diversity metrics are widely used for a team of over 2 models. To compare with pairwise diversity metrics, we focus on three representative non-pairwise diversity metrics. In an ensemble team of S classifiers, we use l(x k ) to denote the number of classifiers that correctly recognize x k , i.e., l(x k ) = S i=1 ω ik . iv. Fleiss' Kappa (FK): Similar to Cohen's Kappa, the Fleiss' Kappa (Fleiss et al., 2013 ) also measures the diversity from the perspective of agreement. But it is directly calculated from a team of more than 2 models as Formula 5 shows, where p is the average classification accuracy for the Algorithm 1 Threshold-based Q-diversity Pruning 1: procedure THRESHOLD-BASED-PRUNING(N egSampSet, Q, Θ, EnsSet) 2: Input: N egSampSet: negative samples; Q the diversity metric; Θ: the diversity threshold calculation function; EnsSet: the set of ensemble teams to be considered; 3: Output: GEnsSet: the set of good ensemble teams. 4: Initialize GEnsSet = {}, D = {} 5: for i = 1 to |EnsSet| do 6: calculate the diversity metric Q for Ti ∈ EnsSet 7: qi = DiversityM etric(Q, Ti, N egSampSet) 8: D.append(qi) Store qi in the diversity measures D 9: end for 10: θ(Q) = Θ(D) Calculate the diversity threshold 11: for i = 1 to |EnsSet| do 12: if qi < θ(Q) then 13: GEnsSet.add(Ti) add qualified Ti 14: end if 15: end for 16: return GEnsSet 17: end procedure Algorithm 2 HQ Diversity Metric Calculation 1: procedure GETHQ(N egSampSet, Q, EnsSet) 2: Input: N egSampSet: negative samples for each model; Q the diversity metric; EnsSet: the set of ensemble teams to be considered; 3: Output: HQ: the set of HQ diversity measurements 4: Initialize D(Q) = {}, D(Q) = {} 5: Initialize HQ = {} A map of HQ diversity metrics and teams 6: for S = 2 to M -1 do 7: for f ocal = 0 to M -1 do 8: Obtain EnsSet(F f ocal , S) with candidate teams of size S and containing F f ocal . 9: Initialize scale the diversity metrics for ensemble teams of the same size 17: D(Q, S, F f ocal , Ti) = q i -min(D(Q,S,F f ocal )) max(D(Q,S,F f ocal ))-min(D(Q,S,F f ocal )) Scale to [0, 1] 18: end for 19: end for 20: Obtain EnsSet(S) with candidate teams of size S 21: for i = 1 to |EnsSet(S)| do 22: Initialize tmpD = {} 23: for j = 0 to |Ti| do 24: tmpD.append(D(Q, S, F f ocal = Ti[j], Ti)) 25: end for 26: w = MemberModelAccuracyRank(Ti) Obtain the weights for combining tmpD 27: HQ(Ti) = W eightedAverage(w, tmpD) 28: end for 29: end for 30: return HQ 31: end procedure HQ score. The weight is calculated with the corresponding rank of accuracy of the member model (T i [j]) in the ensemble (T i ), i.e., the member model with higher accuracy will have higher weight (Line 21∼28).

E THE ALGORITHM FOR THE α FILTER

To construct a deep ensemble teams of diverse models, we start with building the ensembles of a smaller size, such as S = 2 with M 2 = M (M -1) candidates. For a larger size, such as S * = S+1, we then extend these candidate ensembles of size S by adding another member model from the base model pool. This way of constructing deep ensembles enables us to efficiently form high quality deep ensembles step by step and strategically prune out low diversity ensembles. Intuitively, an ensemble team of a larger size S = 3, such as [F 5 , F 6 , F 7 ] containing a subset of models with lower ensemble diversity (i.e., higher correlation), such as [F 5 , F 6 ], then the other teams with size S = 3, such as [F 5 , F 7 , F 9 ], which may have higher diversity than [F 5 , F 6 , F 7 ], so we can preemptively prune out [F 5 , F 6 ] for S = 2 to avoid calculating the diversity scores for ensembles with S > 2 containing [F 5 , F 6 ] as Figure 5 shows. end for 23: return GEnsSet(Q) 24: end procedure Therefore, with this property we can effectively prune out low-diversity deep ensembles. Algorithm 3 presents a skeleton of the pseudo code, describing this pruning process. N egSampSet contains the set of negative samples for calculating the diversity metric Q. β marks the percentage of the teams to be further pruned out for a fixed team size. By default, we set β = 10%. EnsSet contains the set of ensemble teams to be considered. For each team size, we omit all the teams that contain any group of models in pruneSet. For the remaining teams, we measure their diversity scores and ordered them based on the diversity score p i . Then we remove β of the remaining teams with the lowest diversity and add them into pruneSet for further pruning. This algorithm can significantly avoid exploring unpromising branches in searching for high-quality ensembles.

F THE BASE MODEL POOLS FOR THREE BENCHMARK DATASETS

We evaluate the proposed hierarchical diversity pruning methods using three benchmark datasets, CIFAR-10, ImageNet, and Cora. The specification of these datasets and the base model pools for each of the datasets are included in this section as Table 5 shows. We use 10 base models in this study for each dataset, primarily collected from GTModelZoo (GTModelZoo Developers, 2020). G THE α FILTER ON Q DIVERSITY METRICS We also applied the α filter with six Q-diversity metrics for pruning out low-diversity ensembles. Figure 6 shows the experimental results on the Q-GD metric on CIFAR-10, where Figure 6a , 6b and 6c present all the candidate ensembles of size 3, 4 and 5 respectively, and the relationship of the Q-diversity metric GD and ensemble accuracy. The black dots mark the ensembles that pruned out by the α filter while the red ones represent the remaining ensembles. Even though, the α filter can significantly filter many low-diversity ensembles, we still miss a fair number of ensembles with high ensemble accuracy. There are two primary reasons behind this observation: (1) The Q-diversity metrics fail to precisely capture the diversity of ensembles, therefore, when pruning out a low Qdiversity branch, such as in Figure 6a with S = 3, some ensembles of a larger size with high diversity (with low Q-GD values) many also be pruned out in Figure 6b with S = 4. (2) A few ensembles with high ensemble accuracy have low diversity, demonstrating that Q-diversity metrics may not be effectively correlated to ensemble accuracy. We further perform a comprehensive evaluation as Table 6 shows on three datasets. Due to the above inherent problems with Q metrics, the α filter on our HQ metrics achieved much better performance than Q metrics, when comparing Table 6 with Table 2 , 3, 7. H EXPERIMENTAL EVALUATION ON CORA DATASET We also evaluate our methods on a popular graph dataset, Cora. The same set of experimental results are shown on Table 7 . We found similar observations as CIFAR-10 and ImageNet. First, the α filter with HQ metrics works much better than Q metrics. HQ metrics can capture more high accuracy (≥ 89%) ensembles (14∼18) than 6∼17 by Q metrics with the mean threshold. Second, the combined hierarchical pruning method of the α filter and K-Means filter on HQ metrics (α + K) can significantly improve the ensemble accuracy lower bound from 82.10% to 86.70%∼87.80% as well as the probability of high accuracy ensembles among the selected ones. 



Figure 1: Pruning with Q diversity with mean threshold (CIFAR-10)) Limitations of Q Metrics.Figure 1a and 1b show Q-KW and Q-GD metrics and their relationship with ensemble accuracy for all 1013 deep ensembles on CIFAR-10 respectively. Each dot represents a deep ensemble team with team sizes color-coded by the color diagram on the right. The vertical red dashed line represents the Q-KW and Q-GD mean thresholds of 0.868 and 0.476 respectively.The horizontal red and black dashed lines represent the maximum single model accuracy 96.68% and the average accuracy 94.25% of the 10 base models respectively. We use these two accuracy bounds as important references to quantify the quality of the deep ensembles selected using a Q metric with its mean threshold. Those deep ensembles on the left of the red vertical dash line are selected and added into GEnsSet given that their Q-scores are below the mean threshold (e.g., Q-KW or Q-GD). The ensembles on the right of this red vertical dash line are pruned out because their Q diversity scores exceed the mean threshold. Compare Figure1aand 1b, it is visually clear that both Q metrics can select a sufficient number of good quality ensemble teams while at the same time, both Q metrics with mean threshold pruning will miss a large portion of teams with high ensemble accuracy, indicating the inherent limitations of both Q metrics and the mean threshold pruning with respect to capturing the concrete ensemble diversity in terms of low negative correlation among member models of an ensemble.

Figure 1a and 1b show Q-KW and Q-GD metrics and their relationship with ensemble accuracy for all 1013 deep ensembles on CIFAR-10 respectively. Each dot represents a deep ensemble team with team sizes color-coded by the color diagram on the right. The vertical red dashed line represents the Q-KW and Q-GD mean thresholds of 0.868 and 0.476 respectively.

Figure 2: Ensemble teams of size S = 3, 4, 5 on CIFAR-10: top three figures for HQ-KW (α) and bottom three figures for HQ-GD (α) HQ (α): HQ metrics with α filter. We observe that if an ensemble team of size S has large HQ score (say [F 5 , F 6 ]), indicating insufficient ensemble diversity, then all the ensemble teams that have larger size than S and contain all the member models of this ensemble team (e.g., [F 5 , F 6 , F 7 ], [F 0 , F 5 , F 6 ], [F 0 , F 5 , F 6 , F 7 ], [F 5 , F 6 , F 7 , F 8 ]) tend to have insufficient ensemble diversity (i.e., larger HQ score) as well. This motivates us to design a hierarchical pruning algorithm, coined as α filter. Concretely, we start with the set of ensembles of smallest team size, say S = 2, |EnsSet| = M 2

(a) HQ-CK (α + K),θ=0.139 (b) HQ-KW (α + K),θ=0.261 (c) HQ-GD (α + K),θ=0.375

Figure 3: Three HQ metrics with two phase (α + K) filters for the team size S = 3 (CIFAR-10)3 EXPERIMENTAL EVALUATIONExtensive experiments on three benchmark datasets (CIFAR-10, ImageNet, and Cora), with a total of 10 base models for each dataset, are conducted to evaluate our hierarchical diversity pruning methods. All the experiments were conducted on an Intel Xeon E5-1620 server with Nvidia GeForce GTX 1080Ti GPU on Ubuntu 16.04. Readers may refer to Appendix (section F) for further details on the base models used in this study and their accuracy results.

Figure 4: Ensemble Accuracy Distribution on CIFAR-10 and ImageNet

the decision boundary of the model, x, may vary from the optimum boundary by an offset o = x -x * . (TUMER & GHOSH, 1996) shows that the added error beyond Bayes error is E add = dσ 2 o 2 where d is the difference between the derivatives of the two posteriors and σ 2 o is the variance of the boundary offset o, σ 2 o = 2σ 2 k i /d 2 . Combining the predictions of S models with model averaging (avg), the ith element in the combined probability vector gives an approximation to p(c i |x) as f avg i

D(Q, S, F f ocal ) = [ ] 10: for i = 1 to |EnsSet(F f ocal , S)| do 11: calculate the diversity metric Q for Ti ∈ EnsSet(F f ocal , S) 12: qi = DiversityM etric(Q, Ti, N egSampSet(F f ocal )) 13: D(Q, S, F f ocal ).append(qi) add the acci into D(Q, S, F f ocal ) 14: end for 15: for i = 1 to |EnsSet(F f ocal , S)| do 16:

Figure 5: α Filter

Figure 6: α filter on Q diversity with different team size S (CIFAR-10, Q-GD))

The six Q-diversity metrics

Comparing Q, HQ (α), HQ (α + K) metrics on CIFAR-10

Comparing Q, HQ (α), and HQ (α + K) metrics on ImageNet

Yanzhao Wu, Ka-Ho Chow, Wenqi Wei, Zhongwei Xie, and Ling Liu.  Boosting ensemble accuracy by revisiting ensemble diversity metrics. Technical report, Georgia Institute of Technology, 2020.GU Yule. "on the association of attributes in statistics"philosophical transactions of the royal society.

Base Model Pools for Three Benchmark Datasets

Q-Metrics with α filter

Comparing Q, HQ (α), and HQ (α + K) metrics on Cora

annex

ensemble team and κ is not obtained by simply averaging the Cohen's kappa (κ ij ).v. Kohavi-Wolpert Variance (KW): Kohavi-Wolpert Variance is derived by (Kuncheva & Whitaker, 2003) to measure the variability of the predicted class label for the sample x with the team of models F 1 , F 2 , ..., F S as Formula 6 shows. Higher value of KW variance indicates higher model diversity of the team.vi. Generalized Diversity (GD): The generalized diversity has been proposed by (Partridge & Krzanowski, 1997) as Formula 7 shows. Y is a random variable, representing the proportion of classifiers (out of S) that fail to recognize a random sample x. The probability of Y = i S is denoted as p i , i.e., the probability of i (out of S) classifiers recognizing a randomly chosen sample x incorrectly. p(1) represented the expected probability of one randomly picked model failing while p(2) denotes the expected probability of both two randomly picked models failing. GD varies between 0 and 1. The maximum diversity (1) occurs when the failure of one model is accompanied by the correct recognition by the other model for two randomly picked models, that is p(2) = 0. When both two randomly picked models fail, we have p(1) = p(2), corresponding to the minimum diversity, 0.Algorithm 1 shows the sketch of the process of using a threshold-based filter. The diversity threshold calculation function is denoted as Θ, such as the mean function. First, we calculate the diversity measurements for all ensemble teams. Then based on the diversity threshold θ(Q) (Line 10), we can prune out the teams with low diversity (q i ≥ θ(Q)) and place the remaining high diversity ensembles into GEnsSet (Line 11∼15). With a proper threshold, θ, the threshold-based pruning can efficiently prune out low-diversity deep ensembles.

D THE ALGORITHM FOR COMPUTING HQ-DIVERSITY METRICS

Unlike the Q-diversity metrics, HQ-diversity metrics calculate the diversity among the ensembles of the same size with a focal model. Algorithm 2 shows the skeleton of calculating the HQ diversity metrics for all the candidate ensembles in EnsSet. For each team size S (Line 6∼29), we follow two general steps to calculate the HQ diversity scores for each ensemble. 

