ANTI-DISTILLATION: IMPROVING REPRODUCIBILITY OF DEEP NETWORKS

Abstract

Deep networks have been revolutionary in improving performance of machine learning and artificial intelligence systems. Their high prediction accuracy, however, comes at a price of model irreproducibility in very high levels that do not occur with classical linear models. Two models, even if they are supposedly identical, with identical architecture and identical trained parameter sets, and that are trained on the same set of training examples, while possibly providing identical average prediction accuracies, may predict very differently on individual, previously unseen, examples. Prediction differences may be as large as the order of magnitude of the predictions themselves. Ensembles have been shown to somewhat mitigate this behavior, but without an extra push, may not be utilizing their full potential. In this work, a novel approach, Anti-Distillation, is proposed to address irreproducibility in deep networks, where ensemble models are used to generate predictions. Anti-Distillation forces ensemble components away from one another by techniques like de-correlating their outputs over mini-batches of examples, forcing them to become even more different and more diverse. Doing so enhances the benefit of ensembles, making the final predictions more reproducible. Empirical results demonstrate substantial prediction difference reductions achieved by Anti-Distillation on benchmark and real datasets.

1. INTRODUCTION

In the last decade, deep networks provided revolutionary breakthroughs in machine learning, achieving capabilities that were not even imagined ten years ago, and penetrating every domain of our lives. They have been shown to be substantially superior to classical techniques that optimized linear models on convex objectives. With this success, however, comes a price; the price of irreproducibility, at high levels that were unseen before with classical models (Dusenberry et al., 2020) . Training even the same exact model with identical parameters and architecture on the same set of training examples can produce very different models if trained more than once. Two such models can have equal average prediction accuracy on validation data, but predict very differently on individual examples. The problem can be as extreme as having a Prediction Difference (PD) that is of the same order of magnitude as the predictions themselves. Perhaps some applications can tolerate such differences, but imagine applications as medical ones, where for the same symptoms, one model would predict one disease, and the other model would predict another. While in some way, this mimics real life, where individuals make conclusions based on what they learned and what they know, or dependent on the order in which they learn different topics (see, e.g., Achille et al. (2017); Bengio et al. (2009) ), this is definitely not a desired behavior. Overall, perhaps, predictions are better, but for the individual cases in which predictions are different, the consequences can be irreversible. Deep models are usually trained on highly parallelized distributed systems. They normally are initialized randomly, and are expected to find a nonlinear solution that fits the data best, minimizing a non-convex loss objective. Applying determinism (see, e.g., Nagarajan et al. (2018)) to the order in which data is seen and/or updated may not be an option, especially in extremely large scale systems. Due to such systems, even if the models are initialized identically (to some identical pseudorandom set of initialization values), the trained model sees the training examples in some random order and the updates are applied also with some randomness. Due to the non-convex nature, different training instances of the same model on the same dataset, may still find and converge to different optima, which may be all equal in average accuracy, but very different for individual examples. This problem becomes even more critical in re-enforcement and online (or mini-batch) learning, where the model needs to make predictions and decisions while it continues to train on new additional examples. The decisions the model makes also dictate what new data is seen. If two models diverge from one another on the same data, they can then make different future decisions, affecting what additional data is seen by the model. This can enhance the divergence of two supposedly identical models even more. Online Click-Through-Rate (CTR) prediction (see, e.g., McMahan et al. (2013) ), where ads are shown based on the current state of the model and then the model updates based on user reactions to the shown ads, is an example where irreproducibility can lead to two models which diverge over time very largely from one another. Using ensembles of models (see, e.g., Dietterich (2000) ; Koren ( 2009)) has become a very popular technique in machine learning, and also in deep networks. Ensembles can reduce prediction uncertainty as shown in Lakshminarayanan et al. (2017) . Averaging the predictions of multiple deep networks, referred to as ensemble components, in an ensemble turns out to also help reduce prediction differences between two (or more) such ensembles. Because of irreproducibility, and especially if each component is initialized differently, each of the components diverges to focus on a different slice of the solution space. Thus the average prediction, which has a reduced variance, is more reproducible. One point to note, however, is that this approach can trade performance accuracy to improve reproducibility. If the number of operations (or the number of learned parameters) is a constraint, then, the components of the ensemble must have narrower layers than a single deep network applied to the same task if comparison is done where complexity and resources are kept equal to both systems. While ensembling on the narrower components has better accuracy than that of a single narrow component, the ensemble as whole may have inferior accuracy than a comparable single network with the same complexity (or number of parameters). This will happen in very large scale systems where the single components of the model are not significantly over-parameterized to the training data. (In smaller scales, it is possible that narrower models are still expressive enough for the specific problem, and adding more parameters is no longer helpful in improving model accuracy.) Distillation (see Hinton et al. (2015) , and also Lan et al. ( 2018)) is a technique that is gaining popularity in deep networks to transfer knowledge of a complex model or complex ensemble of models (Mosca & Magoulas, 2018 ) to a simple model in a way that would allow the simple model to exhibit similar or close performance to that of the complex model. This is particularly important to models deployed on small devices that do not have the capacity of larger systems. The simple student model uses the prediction of the complex teacher model as a label, to try and become more similar to the teacher, which is a stronger model with better predictions. Various methods as in Crowley et al. ( 2018 Contribution: Unlike distillation methods, in this paper, we apply a technique of opposite nature, Anti-Distillation (AD). Leveraging the benefit of ensembles, we try to make the components as different as possible, so together they capture a larger subset of the solution space as an ensemble. This is done by adding a (regularization) loss that forces different components to diverge from one another. This can increase the diversification of the ensemble. Various regularization losses can achieve the effect of diversifying the components. We focus on correlation and covariance losses. A de-correlation loss can be obtained by minimizing the square of the Frobenius norm of the off-diagonal terms of the correlation matrix of the predictions of the ensemble components. De-correlation loss can be applied over correlations estimated from the predictions of a mini-batch of training examples. De-correlation has been applied internally on neurons of hidden layers in neural networks for decades (see, e.g., Shamir et al. (1993) ). It was used in attempts to simplify the representations of the networks, by pruning neurons whose contributions are correlated with others, or for reducing overfitting (Cogswell et al., 2015) . Unlike these works, here we apply de-correlation on full model predictions of the components of the ensemble for the mere purpose of diversifying the predictions of these components, so that the overall averaged prediction of the ensemble captures a wider portion of the optimization parameter space, and by doing so, is more reproducible. Related Work: Over-parameterization of deep networks has been studied in many references (see, e.g., Denil et al. (2013); Han et al. (2015) and many others). Deep networks can thus find multiple explanations to the same datasets (even when examples are randomized (Zhang et al., 2016) ). As described above, such different explanations may yield non-identical individual predictions, despite even consistent identical average performance over validation sets.



); Gou et al. (2020); Kim et al. (2018); Muller et al. (2020); Mun et al. (2018); Tang et al. (2020) have been developed to extract the most from distillation.

