ELODI: ENSEMBLE LOGIT DIFFERENCE INHIBITION FOR POSITIVE-CONGRUENT TRAINING Anonymous

Abstract

Negative flips are errors introduced in a classification system when a legacy model is updated. Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy by forcing a new model to imitate the old models, or use ensembles, which multiply inference cost prohibitively. We analyze the role of ensembles in reducing NFR and observe that they remove negative flips that are typically not close to the decision boundary, but often exhibit large deviations in the distance among their logits. Based on the observation, we present a method, called Ensemble Logit Difference Inhibition (ELODI), to train a classification system that achieves paragon performance in both error rate and NFR, at the inference cost of a single model. The method distills a homogeneous ensemble to a single student model which is used to update the classification system. ELODI also introduces a generalized distillation objective, Logit Difference Inhibition (LDI), which penalizes changes in the logits between the reference ensemble and the student single model. On multiple image classification benchmarks, model updates with ELODI demonstrate superior accuracy retention and NFR reduction.

1. INTRODUCTION

The rapid development of visual recognition in recent years has led to the need for frequently updating existing models in production-scale systems. However, when replacing a legacy classification model, one has to weigh the benefit of decreased error rate against the risk of introducing new errors that may disrupt post-processing pipelines (Yan et al., 2021) or cause friction with human users (Bansal et al., 2019) . Positive-Congruent Training (PC-Training) refers to any training procedure that minimizes the negative flip rate (NFR) along with the error rate (ER). Negative flips are instances that are misclassified by the new model, but correctly classified by the old one. They are manifest in both visual and natural language tasks (Yan et al., 2021; Xie et al., 2021) . They typically include not only samples close to the decision boundary, but also highconfidence mistakes that lead to perceived "regression" in performance compared to the old model. They are present even in identical architectures trained from different initial conditions, or with different data augmentations, or using different sampling of mini-batches. Yan et al. (2021) have shown that in state-of-the-art image classification models, where a 1% improvement is considered significant, NFR can be in the order of 4∼5% even across models that have identical ER. These intriguing properties motivate us to investigate causes of negative flips and mechanism of reducing negative flips to establish a model update method that achieves the cross-model compatibility, thus lower NFR, and lower error rate, for better PC-training. Two questions. A naive approach to cross-model compatibility is to bias one model to mimic the other, as done in model distillation (Hinton et al., 2015) . In this case, however, compatibility comes at the expense of accuracy (Yan et al., 2021; Bansal et al., 2019) . On the other hand, averaging a number of models in a deep ensemble (Lakshminarayanan et al., 2017) can reduce NFR without negative accuracy impact (Yan et al., 2021) , even if it does not explicitly optimize NFR nor its surrogates. The role of ensembles in improving accuracy is widely known, but our first question arises: what is the role of ensembles in reducing NFR? inference by an integer factor. Therefore, a second key question arises: Is it possible to achieve the PC-Training performance of ensembles at the inference cost of a single model? Key ideas. To address the first key question above, we analyze the pattern of negative flip reduction in deep ensembles. We observe that deep ensembles reduce NFR by remedying potential flip samples that have relatively large variation in the logits space of different single models. When a deep ensemble is composed of member models with the same architecture but trained with independent initialization on the same dataset, which we denote as homogeneous ensembles, this behavior can be theoretically predicted and empirically validated. To address the second key question, we propose to train a single model by penalizing the difference of sample logits from the mean of a deep homogeneous ensemble and to use this single model to perform a model update. As illustrated in Figure 1 (Left), we independently train replicas of a single model with different random seeds to form the deep ensemble. We introduce a generalized distillation objective, Logit Difference Inhibition (LDI), which only penalizes significant changes in the logits between the reference ensemble and the student single model, to realize the ensemble to single model distillation. The result is what we call Ensemble Logit Difference Inhibition (ELODI). Contributions. ELODI improves the state of the art in reducing perceived regression in model updates in three ways: (1) Generality, by not targeting distillation to a specific legacy model, yet reducing NFR; (2) Absence of collateral damage, by retaining the accuracy of a new model, or even improving it, while ensuring reduction of NFR; (3) Efficiency, as ELODI does not require evaluating ensembles of models at inference time.foot_0 These improvements are made possible by two main contributions: (1) an analysis on deep ensembles which sheds light on their role in reducing NFR and the direction to obtain the their performance for PC-training with single models; (2) ELODI, that integrates the NFR reduction of deep ensembles and running cost of single models by first training deep networks using the LDI loss with respect to an ensemble and then deploying the resulting single model at inference time. This results in a significant reduction of NFR (29% relative reduction on ImageNet for ResNet-18 → ResNet-50) over previous methods. As a side benefit, ELODI increases top-1 accuracy in several cases, and is comparable in others.

2. RELATED WORK

Cross-model compatibility is becoming increasingly important as real world systems incorporate trained components that, if replaced, can wreak havoc with post-processing pipelines. Toneva et al. (2019) empirically study prediction flip on training samples between epochs, termed "forgetting events", while Yan et al. (2021) address perceived regression using held-out sets between different models. Both are particular instance of cross-model compatibility (Shen et al., 2020; Bansal et al., 2019; Srivastava et al., 2020 ). Focal Distillation (Yan et al., 2021) minimizes the distance between the old and new predictions, with increased weights on samples correctly classified by the old model. Träuble et al. (2021) use a probabilistic approach to determine whether the prediction should update when a new model comes. While it improves cumulative NFR, it requires multiple models to be available at inference, which is prohibitive in practice. Ensemble learning methods (Breiman, 1996; Freund & Schapire, 1997; Breiman, 2001) are widely adopted in machine learning. The understanding for these methods is sometimes explained as en-



Note that ELODI is able to deal with existing models trained without treatment, as shown in Appendix B.2.



Figure 1: Left: In ELODI, one model is trained using the Logit Difference Inhibition (LDI) loss w.r.t an ensemble of m models with its same architecture. The result is a single model which achieves a significantly reduced negative flip rate (NFR) with the other. Right: Scatter plot of a ResNet-50's ER vs its NFR w.r.t a ResNet-18. The more left and lower, the better. ELODI improves both ER and NFR than baseline methods. Particularly, ELODI is close to the ensemble paragon, without the prohibitive computation cost of ensembles.

