DEEP k-NN LABEL SMOOTHING IMPROVES STABIL-ITY OF NEURAL NETWORK PREDICTIONS

Abstract

Training modern neural networks is an inherently noisy process that can lead to high prediction churndisagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batcheseven when the trained models all attain high accuracies. Such prediction churn can be very undesirable in practice. In this paper, we present several baselines for reducing churn and show that utilizing the k-NN predictions to smooth the labels results in a new and principled method that often outperforms the baselines on churn while improving accuracy on a variety of benchmark classification tasks and model architectures.

1. INTRODUCTION

Deep neural networks (DNNs) have proved to be immensely successful at solving complex classification tasks across a range of problems. Much of the effort has been spent towards improving their predictive performance (i.e. accuracy), while comparatively little has been done towards improving the stability of training these models. Modern DNN training is inherently noisy due to factors such as the random initialization of network parameters, the mini-batch ordering, and effects of various data augmentation or pre-processing tricks, all of which are exacerbated by the non-convexity of the loss surface. This results in local optima corresponding to models that have very different predictions on the same data points. This may seem counter-intuitive, but even when the different runs all produce very high accuracies for the classification task, their predictions can still differ quite drastically as we will show later in the experiments. Thus, even an optimized training procedure can lead to high prediction churn, which refers to the proportion of sample-level disagreements between classifiers caused by different runs of the same training procedurefoot_0 . In practice, reducing such predictive churn can be critical. For example, in a production system, models are often continuously improved on by being trained or retrained with new data or better model architectures and training procedures. In such scenarios, a candidate model for release must be compared to the current model serving in production. Oftentimes, this decision is conditioned on more than just overall offline test accuracy-in fact, oftentimes the offline metrics are not completely aligned with actual goal, especially if these models are used as part of a larger system (e.g. maximizing offline click-through rate vs. maximizing revenue or user satisfaction). As a result, these comparisons oftentimes require extensive and costly live experiments, requiring human evaluation in situations where the candidate and the production model disagree (i.e. in many situations, the true labels are not available without a manual labeler). In these cases, it can be highly desirable to lower prediction churn. Despite the practical relevance of lowering predictive churn, there has been surprisingly little work done in this area, which we highlight in the related work section. In this work, we focus on predictive churn reduction under retraining the same model architecture on an identical train and test set. Our main contributions are as follows: • We provide one of the first comprehensive analyses of baselines to lower prediction churn, showing that popular approaches designed for other goals are effective baselines for churn reduction, even compared to methods designed for this goal.



Concretely, given two classifiers applied to the same test samples, the prediction churn between them is the fraction of test samples with different predicted labels.

