DEEP ENSEMBLES WITH HIERARCHICAL DIVERSITY PRUNING Anonymous authors Paper under double-blind review

Abstract

Diverse deep ensembles hold the potential for improving accuracy and robustness of deep learning models. Both pairwise and non-pairwise ensemble diversity metrics have been proposed over the past two decades. However, it is also challenging to find the right metrics that can effectively prune those deep ensembles with insufficient ensemble diversity, thus failing to deliver effective ensemble accuracy. In this paper, we first compare six popular diversity metrics in the literature, coined as Q metrics, including both pairwise and non-pairwise representatives. We analyze their inherent limitations in capturing the negative correlation of ensemble member models, and thus inefficient in identifying and pruning low quality ensembles. We next present six HQ ensemble diversity metrics by extending the existing Q-metrics with three novel optimizations: (1) We introduce the concept of focal model and separately measure the ensemble diversity among the deep ensembles of the same team size with the concept of focal model, aiming to better capture the negative correlations of member models of an ensemble. (2) We introduce six HQ-diversity metrics to optimize the corresponding Q-metrics respectively in terms of measuring negative correlation among member models of an ensemble using its ensemble diversity score. (3) We introduce a two phase hierarchical pruning method to effectively identify and prune those deep ensembles with high HQ diversity scores, aiming to increase the lower and upper bounds on ensemble accuracy for the selected ensembles. By combining these three optimizations, deep ensembles selected based on our hierarchical diversity pruning approach significantly outperforms those selected by the corresponding Q-metrics. Comprehensive experimental evaluation over several benchmark datasets shows that our HQ-metrics can effectively select high diversity deep ensembles by pruning out those ensembles with insufficient diversity, and successfully increase the lower bound (worst case) accuracy of the selected deep ensembles, compared to those selected using the state-of-the-art Q-metrics.

1. INTRODUCTION

Deep ensembles with sufficient ensemble diversity hold potential of improving both accuracy and robustness of ensembles with their combined wisdom. The improvement can be measured by three criteria: (i) the average ensemble accuracy of the selected ensemble teams, (ii) the percentage of selected ensembles that exceed the highest accuracy of individual member models; (iii) the lower bound (worst case) and the upper bound (best case) accuracy of the selected ensembles. The higher these three measures, the higher quality of the ensemble teams. Ensemble learning can be broadly classified into two categories: (1) learning the ensemble of diverse models via diversity optimized joint-training, coined as the ensemble training approach, such as boosting (Schapire, 1999); and (2) learning to compose an ensemble of base models from a pool of existing pre-trained models through ensemble teaming based on ensemble diversity metrics (Partridge & Krzanowski, 1997; Liu et al., 2019; McHugh, 2012; Skalak, 1996) , coined as the ensemble consensus approach. This paper is focused on improving the state-of-the-art results in the second category. Related Work and Problem Statement. Ensemble diversity metrics are by design to capture the degree of negative correlation among the member models of an ensemble team (Brown et al., 2005; Liu et al., 2019; Kuncheva & Whitaker, 2003) such that the high diversity indicates high negative correlation among member models of an ensemble. Three orthogonal and yet complimentary

