DEEP k-NN LABEL SMOOTHING IMPROVES STABIL-ITY OF NEURAL NETWORK PREDICTIONS

Abstract

Training modern neural networks is an inherently noisy process that can lead to high prediction churndisagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batcheseven when the trained models all attain high accuracies. Such prediction churn can be very undesirable in practice. In this paper, we present several baselines for reducing churn and show that utilizing the k-NN predictions to smooth the labels results in a new and principled method that often outperforms the baselines on churn while improving accuracy on a variety of benchmark classification tasks and model architectures. Algorithm 1 Deep k-NN label smoothing Inputs: 0 ≤ a, b ≤ 1, Training data (x 1 , y 1 ), ..., (x n , y n ), model training procedure M. Train model M 0 on (x 1 , y 1 ), ..., (x n , y n ) with M. Let z 1 , ..., z n ∈ R L be the logits of x 1 , ..., x n , respectively, w.r.t. M 0 Let y i be the k-NN smoothed label of (z i , y i ) computed w.r.t. dataset (z 1 , y 1 ), ..., (z n , y n ). Train model M on (x 1 , y 1 ), ..., (x n , y n ) with M. In this section, we provide theoretical justification for why the k-NN labels may be useful. In particular, we show results for two settings, where n is the number of datapoints.

1. INTRODUCTION

Deep neural networks (DNNs) have proved to be immensely successful at solving complex classification tasks across a range of problems. Much of the effort has been spent towards improving their predictive performance (i.e. accuracy), while comparatively little has been done towards improving the stability of training these models. Modern DNN training is inherently noisy due to factors such as the random initialization of network parameters, the mini-batch ordering, and effects of various data augmentation or pre-processing tricks, all of which are exacerbated by the non-convexity of the loss surface. This results in local optima corresponding to models that have very different predictions on the same data points. This may seem counter-intuitive, but even when the different runs all produce very high accuracies for the classification task, their predictions can still differ quite drastically as we will show later in the experiments. Thus, even an optimized training procedure can lead to high prediction churn, which refers to the proportion of sample-level disagreements between classifiers caused by different runs of the same training procedurefoot_0 . In practice, reducing such predictive churn can be critical. For example, in a production system, models are often continuously improved on by being trained or retrained with new data or better model architectures and training procedures. In such scenarios, a candidate model for release must be compared to the current model serving in production. Oftentimes, this decision is conditioned on more than just overall offline test accuracy-in fact, oftentimes the offline metrics are not completely aligned with actual goal, especially if these models are used as part of a larger system (e.g. maximizing offline click-through rate vs. maximizing revenue or user satisfaction). As a result, these comparisons oftentimes require extensive and costly live experiments, requiring human evaluation in situations where the candidate and the production model disagree (i.e. in many situations, the true labels are not available without a manual labeler). In these cases, it can be highly desirable to lower prediction churn. Despite the practical relevance of lowering predictive churn, there has been surprisingly little work done in this area, which we highlight in the related work section. In this work, we focus on predictive churn reduction under retraining the same model architecture on an identical train and test set. Our main contributions are as follows: • We provide one of the first comprehensive analyses of baselines to lower prediction churn, showing that popular approaches designed for other goals are effective baselines for churn reduction, even compared to methods designed for this goal. • We improve label smoothing, a global smoothing method popular for improving model confidence scores, by utilizing the local information leveraged by the k-NN labels thus introducing k-NN label smoothing which we show to often outperform the baselines on a wide range of benchmark datasets and model architectures. • We show new theoretical results for the k-NN labels suggesting the usefulness of the k-NN label. We show under mild nonparametric assumptions that for a wide range of k, the k-NN labels uniformly approximates the Bayes-optimal label and when k is tuned optimally, achieves the minimax optimal rate. We also show that when k is linear in n, the distribution implied by the k-NN label approximates the original distribution smoothed with an adaptive kernel.

2. RELATED WORKS

Our work spans multiple sub-areas of machine learning. The main problem this paper tackles is reducing prediction churn. In the process, we show that label smoothing is an effective baseline and we improve upon it in a principled manner using deep k-NN label smoothing. Prediction Churn. There are only a few works which explicitly address prediction churn. Fard et al. (2016) proposed training a model so that it has small prediction instability with future versions of the model by modifying the data that the future versions are trained on. They furthermore propose turning the classification problem into a regression towards corrected predictions of an older model as well as regularizing the new model towards the older model using example weights. Cotter et al. (2019) ; Goh et al. (2016) use constrained optimization to directly lower prediction churn across model versions. Simultaneously training multiple identical models (apart from initialization) while tethering their predictions together via regularization has been proposed in the context of distillation (Anil et al., 2018; Zhang et al., 2018; Zhu et al., 2018; Song & Chai, 2018) and robustness to label noise (Malach & Shalev-Shwartz, 2017; Han et al., 2018) . This family of methods was termed "co-distillation" by Anil et al. (2018) , who also noted that it can be used to reduce churn in addition to improving accuracy. In this paper, we show much more extensively that co-distillation is indeed a reasonable baseline for churn reduction. Label smoothing. Label smoothing (Szegedy et al., 2016) is a simple technique that proposes to train a model the model on the soft labels obtained by a convex combination of the hard true label and the soft uniform distribution across all the labels. It has been shown that it prevents the network from being over-confident and leads to better confidence calibration (Müller et al., 2019) . Here we show that label smoothing is a reasonable baseline for reducing prediction churn, and we moreover enhance it for this task by smoothing the labels locally via k-NN rather than a the pure global approach mixing with the uniform distribution. k-NN Theory. The theory of k-NN classification has a long history (e.g. Fix & Hodges Jr (1951) ; Cover (1968); Stone (1977) ; Devroye et al. (1994) ; Chaudhuri & Dasgupta (2014) ). To our knowledge, the most relevant k-NN classification result is by Chaudhuri & Dasgupta (2014) , who show statistical risk bounds under similar assumptions as used in our work. Our analysis shows finitesample L ∞ bounds on the k-NN labels, which is a stronger notion of consistency as it provides a uniform guarantee, rather than an average guarantee as is shown in previous works under standard risk measures such as L 2 error. We do this by leveraging recent techniques developed in Jiang (2019) for k-NN regression, which assumes an additive noise model instead of classification. Moreover, we provide to our knowledge the first consistency guarantee for the case where k grows linearly with n. Deep k-NN. k-NN is a classical method in machine learning which has recently been shown to be useful when applied to the intermediate embeddings of a deep neural network (Papernot & Mc-Daniel, 2018) to obtain more calibrated and adversarially robust networks. This is because standard distance measures are often better behaved in these representations leading to better performance of k-NN on these embeddings than on the raw inputs. 

3. ALGORITHM

Suppose that the task is multi-class classification with L classes and the training datapoints are (x 1 , y 1 ), ..., (x n , y n ), where x i ∈ X , and X is a compact subset of R D and y i ∈ R L , where represents the one-hot vector encoding of the label-that is, if the i-th example has label j, then y i has 1 in the j-th entry and 0 everywhere else. Then we give the formal definition of the smoothed labels: Definition 1 (Label Smoothing). Given label smoothing parameter 0 ≤ a ≤ 1, then the smoothed label y is (where 1 L denotes the vector of all 1's in R L ). y LS a := (1 -a) • y + a L • 1 L . We next formally define the k-NN label, which is the average label of the example's k-nearest neighbors in the training set. Let us use shorthand X := {x 1 , ..., x n } and y i ∈ R L . Definition 2 (k-NN label). Let the k-NN radius of x ∈ X be r k (x) := inf{r : |B(x, r) ∩ X| ≥ k} where B(x, r) := {x ∈ X : |x-x | ≤ r} and the k-NN set of x ∈ X be N k (x) := B(x, r k (x))∩X. Then for all x ∈ X , the k-NN label is defined as η k (x) := 1 |N k (x)| n i=1 y i • 1 [x i ∈ N k (x)] . The label smoothing method can be seen as performing a global smoothing. That is, every label is equally transformed towards the uniform distribution over all labels. While it seems almost deceptively simple, it has only recently been shown to be effective in practice, specifically for better calibrated networks. However, since this smoothing technique is applied equally to all datapoints, it fails to incorporate local information about the datapoint. To this end, we propose using the k-NN label, which smooths the label across its nearest neighbors. We show theoretically that the k-NN label can be a strong proxy for the Bayes-optimal label, that is, the best possible prediction one can make given the uncertainty. In other words, compared to the true label (or even the label smoothing), the k-NN label is robust to variability in the data distribution and provides a more stable estimate of the label than the original hard label which may be noisy. Training on such noisy labels have been shown to hurt model performance (Bahri et al., 2020) and using the smoothed labels can help mitigate these effects. To this end, we define k-NN label smoothing as follows: Definition 3 (k-NN label smoothing). Let 0 ≤ a, b ≤ 1 be k-NN label smoothing parameters. Then the k-NN smoothed label of datapoint (x, y) is defined as: y kNN a,b = (1 -a) • y + a • b • 1 L • 1 L + (1 -b) • η k (x) . We see that a is used to weight between using the true labels vs. using smoothing, and b is used to weight between the global vs. local smoothing. Algorithm 1 shows how k-NN label smoothing is applied to deep learning models. Like Bahri et al. (2020) , we perform k-NN on the network's logits layer. • When k n, we show that with appropriate setting of k, the k-NN smoothed labels approximates the predictions of Bayes-optimal classifier at a minimax-optimal rate. • When k = O(n), we show that the distribution implied by the k-NN smoothed labels is equivalent to the original distribution convolved with an adaptive smoothing kernel. Our results may also reveal insights into why distillation methods (the procedure of training a model on another model's predictions instead of the true labels) can work. Another way of considering the result is that the k-NN smoothed label is equivalent to the soft prediction of the k-NN classifier. Thus, if one were to train on the k-NN labels, it would be essentially distillation on the k-NN classifier and our theoretical results show that the labels implied by k-NN approximate the predictions of the optimal classifier (in the k n setting). Learning the optimal classifier may indeed be a better goal than learning from the true labels, because the latter may lead to overfitting to the sampling noise rather than just the true signal implied by the optimal classifer. While distillation is not the topic of this work, our results in this section may be of independent interest to that area. For the analysis, we assume the binary classification setting, but it is understood that our results can be straightforwardly generalized to the multi-class setting. The feature vectors are defined on compact support X ⊆ R D and datapoints are drawn as follows: the features vector is drawn from density p X on X and the labels are drawn according to the label function η : X → [0, 1], i.e. η(x) = P(Y = 1|X = x).

4.1. k n

We make a few mild regularity assumptions for our analysis to hold, which are standard in works analyzing non-parametric methods e.g. Singh et al. (2009) ; Chaudhuri & Dasgupta (2014) ; Reeve & Kaban (2019); Jiang (2019); Bahri et al. (2020) . The first part ensures that the support X does not become arbitrarily thin anywhere, the second ensures that the density does not vanish anywhere in the support, and the third ensures that the label function η is smooth w.r.t. to its input. Assumption 1. The following three conditions hold: • Support Regularity: There exists ω > 0 and r 0 > 0 such that Vol(X ∩ B(x, r)) ≥ ω • Vol(B(x, r)) for all x ∈ X and 0 < r < r 0 , where B(x, r) := {x ∈ X : |x -x | ≤ r}. • Non-vanishing density: p X,0 := inf x∈X p X (x) > 0. • Smoothness of η: There exists 0 < α ≤ 1 and C α > 0 such that |η(x) -η(x )| ≤ C α |x -x | α for all x, x ∈ X . We have the following result which provides a uniform bound between the smoothed k-NN label η k and the Bayes-optimal label η. Theorem 1. Let 0 < δ < 1 and suppose that Assumption 1 holds and that k satisfies the following: 2 8 • D log 2 (4/δ) • log n ≤ k ≤ 1 2 • ω • p X,0 • v D • r D 0 • n, where v D := π D/2 Γ(d/2+1 ) is the volume of a D-dimensional unit ball. Then with probability at least 1 -δ, we have sup x∈X |η k (x) -η(x)| ≤ C α 2k ω • v D • n • p X,0 α/D + 2 log(4D/δ) + 2D log(n) k . In other words, there exists constants C 1 , C 2 , C depending on η and δ such that if k satisfies C 1 log n ≤ k ≤ C 2 • n, then with probability at least 1 -δ, ignoring logarithmic factors in n and 1/δ: sup x∈X |η k (x) -η(x)| ≤ C • k n α/D + 1 √ k . Choosing k ≈ n 2α/(2α+D) , gives us a bound of sup x∈X |η k (x) -η(x)| ≤ O(n -1/(2α+D) ), which is the minimax optimal rate as established by Tsybakov et al. (1997) . Therefore, the advantage of using the smoothed labels η k (x 1 ), ..., η k (x n ) instead of the true labels y 1 , ..., y n , is that the smoothed labels approximate the Bayes-optimal classifier. Moreover, as shown above, with appropriate setting of k, the smoothed labels are a minimax-optimal estimator of the true label function η. Thus, the smoothed labels provide as good of a proxy for η as any estimator possibly can. As suggested earlier, another way of considering this result is that the original labels may contain considerable noise and thus no single label can be guaranteed reliable. Using the smoothed label instead mitigates this effect and allows us to train the model to match the label function η.

4.2. k LINEAR IN n

In the previous subsection, we showed the utility of k-NN label smoothing as a theoretically sound proxy for the Bayes-optimal labels, which attains statistical consistency guarantees as long as k grows faster than log n and k/n → 0. Now, we analyze the case where k grows linearly with n. In this case, the k-NN smoothed labels no longer recover the Bayes-optimal label function η, but instead an adaptive kernel smoothed version of η. We make this relationship precise here. Suppose that k = β • n for some 0 < β < 1. We define the β-smoothed label function: Definition 4 (β-smoothed label function). Let r β (x) := inf{r > 0 : P(B(x, r)) ≥ β}, that is the radii of the smallest ball centered at x with probability mass β w.r.t. P X . Then, let η β (x) be the expectation of η on B(x, r β (x)) w.r.t. P X : η β (x) := 1 β B(x,r β (x)) η(x) • P X (x)dx. We can view η β as an adaptively kernel smoothed version of η, where adaptivity arises from the density of the point (the more dense, the smaller the bandwidth we smooth it across) and the kernel is based on the density. We now prove the following result which shows that in this setting η k estimates η β (x). It is worth noting that we need very little assumption on η as compared to the previous result because the βsmoothing of η provides a more regular label function; moreover, the rates are fast i.e. O( D/n). Theorem 2. Let 0 < δ < 1 and k = β • n . Then with probability at least 1 -δ, we have for n sufficiently large depending on β, δ: sup x∈X |η k (x) -η β (x)| ≤ 3 2 log(4D/δ) + 2D log(n) β • n .

5. EXPERIMENTS

We now describe the experimental methodology and results for validating our proposed method.

5.1. BASELINES

We next detail the suite of baselines we compare against. We tune baseline hyper-parameters extensively, with the precise sweeps and setups available in the Appendix. • Control: Baseline where we train for accuracy without regards to lower churn. • p Regularization: We control the stability of a model's predictions by simply regularizing them (independently of the ground truth label) using classical p regularization. The loss function is given by: L p (x i , y i ) = L(x i , y i ) + a||f (x i )|| p p . We experiment with both 1 and 2 regularization. • Bi-tempered: This is a baseline by Amid et al. (2019) , originally designed for robustness to label noise. It modifies the standard logistic loss function by introducing two temperature scaling parameters t 1 and t 2 . We apply their "bi-tempered" loss here, suspecting that methods which make model training more robust to noisy labels may also be effective at reducing prediction churn. • Anchor: This is based on a method proposed by Fard et al. ( 2016) specifically for churn reduction. It uses the predicted probabilities from a preliminary model to smooth the training labels of the second model. We first train a preliminary model f prelim using regular crossentropy loss. We then retrain the model using smoothed labels (1 -a)y i + af prelim (x i ), thus "anchoring" on a preliminary model's predictions. In our experiments, we train one preliminary model and fix it across the runs for this baseline to reduce prediction churn. • Co-distillation: We use the co-distillation approach presented by Anil et al. (2018) , who touched upon its utility for churn reduction. We train two identical models M 1 and M 2 (but subject to different random initialization) in tandem while penalizing divergence between their predictions. The overall loss is L codistill (x i , y i ) = L(f 1 (x i ), y i ) + L(f 2 (x i ), y i ) + aΨ(f 1 (x i ), f 2 (x i )). In their paper, the authors set Ψ to be cross-entropy: Ψ(p (1) , p (2) ) = i∈[K] p (1) i log(p (2) i ), but they note KL divergence can be used. We experiment with both cross-entropy and KL divergence. We also tune w codistill , the number of burn-in steps of training before turning on the regularizer. • Label Smoothing: This is the method of Szegedy et al. (2016) et al., 2017; Fort et al., 2019) . We consider the simple case where m identical deep neural networks are trained independently on the same training data, and at inference time, their predictions are uniformly averaged together.

5.2. DATASETS AND MODELS.

For all datasets, we do not use any data augmentation in order to guarantee that the training data used to across different trainings is held fixed. For all datasets we use the Adam optimizer with default learning rate 0.001. We use a minibatch size of 128 throughout. • MNIST: We train a two-layer MLP with 256 hidden units per layer and ReLU activations for 20 epochs. • Fashion MNIST: We use the same architecture as the one used for MNIST. • SVHN: We use LeNet5 CNN (LeCun et al., 1998) for 30 epochs on the Google Street View Housing Numbers (SVHN) dataset, where each image is cropped to be 32 × 32 pixels. • CelebA: CelebA (Liu et al., 2018 ) is a large-scale face attributes dataset with more than 200k celebrity images, each with 40 attribute annotations. We use the standard train and test splits, which consist of 162770 and 19962 images respectively. Images were resized to be 28 × 28 × 3. We select the "smiling" and "high cheekbone" attributes and perform binary classification, training LeNet5 for 20 epochs. Table 1 : Results across all datasets and baselines under optimal hyperparameter tuning (settings shown). Note that we report the standard deviation of the runs instead of standard deviation of the mean (i.e. standard error) which is often reported instead. The former is higher than the latter by a factor of the square root of the number of trials (10).

5.3. EVALUATION METRICS AND HYPERPARAMETER TUNING

For each dataset, baseline and hyper-parameter setting, we run each method on the same train and test split exactly 5 times. We then report the average test accuracy as well as the test set churn averaged across every possible pair (i, j) of runs (10 total pairs). To give a more complete picture of the sources of churn, we also slice the churn by the whether or not the test predictions of the first run in the pair were correct. Then, lowering the churn on the correct predictions is desirable (i.e. if the base model is correct, we clearly don't want the predictions to be changing), while churn reduction on incorrect predictions is less relevant (i.e. if the base model was incorrect, then it may be better for there to be higher churn-however at the same time, some examples may be inherently difficult to classify or the label is such an outlier that we don't expect an optimal model to correctly classify in which case lower churn may be desirable). This is why in the results for Table 1 , we bold the best performing baseline for churn on correct examples, but not for churn on incorrect examples. In the results (Table 1 ), for each dataset and baseline, we chose the optimal hyperparameter setting by first sorting by accuracy and choosing the setting with the highest accuracy, and if there were multiple settings with very close to the top accuracy (defined as within less than 0.1% difference in test accuracy), then we chose the setting with the lowest churn among those settings with accuracy close to the top accuracy. There is often no principled way to trade-off the two sometimes competing objectives of accuracy and churn (e.g. Cotter et al. (2019) offer a heuristic to trade off the two objectives in a more balanced manner on the Pareto frontier). However in this case, biasing towards higher accuracy is most realistic because in practice, when given a choice between two models, it's usually best to go with the more accurate model. Fortunately, we will see that accuracy and churn are not necessarily competing objectives and our proposed method usually gives the best result for both simultaneously.

5.4. RESULTS

We see from Table 1 that mixup and our method, k-NN label smoothing, are consistently the most competitive; mixup outperforms on SVHN and Fashion MNIST while k-NN label smoothing outperforms on all the remaining datasets. Notably, both methods do well on accuracy and churn metrics simultaneously, suggesting that there is no inherent trade-off between predictive performance and churn reduction. Due to space constraints, ablations on SVHN for our method's hyperparameters (a, b, and k), along with results for the ensemble baseline can be found in the Appendix. While we found ensembling to be remarkably effective, it does come with higher cost (more trainable parameters and higher inference cost), and so we discourage a direct comparison with other methods.

6. CONCLUSION

Modern DNN training is a noisy process: randomization arising from stochastic minibatches, weight initialization, and data preprocessing techniques can lead to models with drastically different predictions on the same datapoints when using the same training procedure-and this phenomenon happens even when all the models attain similarly high accuracies. Reducing such prediction churn is important in practical problems as production ML models are constantly updated and improved on. Since offline metrics usually can only serve as proxies to the live metrics, comparing the models in A/B tests and live experiments oftentimes must involve manual labeling of the disagreements between the models making it a costly procedure. Thus, controlling the amount of predictive churn can be crucial for more efficiently iterating and improving models in a production setting. Despite the practical importance of this problem, there has been little work done in the literature on this topic. We provide one of the first comprehensive analyses of reducing predictive churn arising from retraining the model on the same dataset and model architecture. We show that numerous methods used for other goals such as learning with noisy labels and improving model calibration serve as reasonable baselines for lowering prediction churn. Moreover, we propose a new technique, k-NN label smoothing, which is shown to be a principled approach leveraging a local smoothing from the deep k-NN labels to enhance the global smoothing from the vanilla label smoothing procedure. We further show that it often outperforms the baselines across a range of datasets and model architectures.

A PROOFS

For the proofs, we make use of the following result from Jiang (2019) which bounds the number of distinct k-NN sets on the sample across all k: Lemma 1 (Lemma 3 of Jiang ( 2019)). Let M be the number of distinct k-NN sets over X , that is, M := |{N k (x) : x ∈ X }|. Then M ≤ D • n D . Proof of Theorem 1. We have by triangle inequality and the smoothness condition in Assumption 1 that: |η k (x) -η(x)| ≤ n i=1 (η(x i ) -η(x)) • 1 [x i ∈ N k (x)] |N k (x)| + n i=1 (y i -η(x i )) • 1 [x i ∈ N k (x)] |N k (x)| ≤ C α • r k (x) α + n i=1 (y i -η(x i )) • 1 [x i ∈ N k (x)] |N k (x)| . We now bound each of the two terms separately. To bound r k (x), let r = 2k ω•v D •n•p X,0 1/D . We have P(B(x, r)) ≥ ω inf x ∈B(x,r)∩X p X (x ) • v D r D ≥ ωp X,0 v D r D = 2k n , where P is the distribution function w.r.t. p X . By Lemma 7 of Chaudhuri & Dasgupta (2010) and the condition on k, it follows that with probability 1 -δ/2, uniformly in x ∈ X , |B(x, r)∩X| ≥ k, where X is the sample of feature vectors. Hence, r k (x) < r for all x ∈ X uniformly with probability at least 1 -δ/2. Define ξ i := y i -η(x i ). Then, we have that -1 ≤ ξ i ≤ 1 and thus by Hoeffding's inequality, we have that A x := n i=1 (y i -η(x i )) • 1[xi∈N k (x)] |N k (x)| = n i=1 ξ i • 1[xi∈N k (x)] |N k (x)| satisfies P (|A x | > t/k) ≤ 2 exp -t 2 /2k . Then setting t = √ 2k • log(4D/δ) + D log(n) gives P |A x | ≥ 2 log(4D/δ) + 2D log(n) k ≤ δ 2D • n D . By Lemma 3 of Jiang (2019) , the number of unique random variables A x across all x ∈ X is bounded by D • n D . Thus, by union bound, P sup x∈X |A x | ≥ 2 log(4D/δ) + 2D log(n) k ≤ δ/2. The result follows. Proof of Theorem 2. Let X be the n sampled feature vectors and let x ∈ X . Define k (x) := |X ∩ B(x, r β (x))|. We have: |η k (x) -η β (x)| ≤ |η k (x) (x) -η k (x)| + |η k (x) (x) -η β (x)|. We bound each of the two terms separately. We have |k (x) -k| = x∈X 1[x ∈ B(x, r(x))] -β • n By Hoeffding's inequality we have P(|k (x) -k| ≥ t • n) ≤ 2 exp(-2t 2 n). Choosing t = log(4D/δ)+D log(n) 2n gives us  P |k (x) -k| ≥ n 2 • (log(4D/δ) + D log(n)) ≤ δ 2D • n D . Dataset ( sup x∈X |k (x) -k| ≤ n 2 • (log(4D/δ) + D log(n)). We now have |η k (x) (x) -η k (x)| ≤ 1 k - 1 k (x) min {k, k (x)} + min 1 k , 1 k (x) |k -k (x)| ≤ 2 k • |k -k (x)| ≤ 2 log(4D/δ) + 2D log(n) β • n . where the first inequality follows by comparing the difference contributed by the shared neighbors among the k-NN and k (x)-NN (first term on RHS) and contributed by the neighbors that are not shared (second term on RHS). For the second term, define A x := X ∩ B(x, r β (x)). For any x sampled from B(x, r β (x)), we have that the expected label is η β (x). Since η k (x) (x) is the mean label among datapoints in A x , then we have by Hoeffding's inequality that P(|η k (x) -η β (x)| ≥ k (x) • t) ≤ 2 exp -t 2 /2k . Then setting t = √ 2k • log(4D/δ) + D log(n) gives P |η k (x) (x) -η β (x)| ≥ 2 log(4D/δ) + 2D log(n) k (x) ≤ δ 2D • n D . By Lemma 3 of Jiang ( 2019), the number of unique sets A x across all x ∈ X is bounded by D • n D . Thus, by union bound, with probability at least 1 -δ/2L |η k (x) (x) -η β (x)| ≤ 2 log(4D/δ) + 2D log(n) k (x) . The result follows immediately for n sufficiently large.

B ENSEMBLE RESULTS

In Table 2 we present the experimental results for the ensemble baseline. The method performs remarkably well, beating the proposed method and the other baselines on both accuracy and churn reduction across datasets. We do note, however, that ensembling does come at a cost which may prove prohibitive in many practical applications. Firstly, having m times the number of trainable parameters, training time (if done sequentially) takes m times as long, as does inference, since each subnetwork must be evaluated before aggregation. 

C ABLATION STUDY

In Table 3 , we report SVHN results ablating k-NN label smoothing's hyperparameters: k, a, and b. We observe the following trends: with a fixed to 1, both accuracy and churn improve with increasing b, and a similar relationship holds as a increases with b fixed to 0.9. Lastly, both key metrics are stable with respect to k.

D HYPERPARAMETER SEARCH

Our experiments involved performing a grid search over hyperparameters. We detail the search ranges per method below. k-NN label smoothing. • k ∈ [5, 10, 100, 500] • a ∈ [0.005, 0.01, 0.2, 0.05, 0.1, 0.5, 0.8, 0.9, 1.0] • b ∈ [0, 0.05, 0.1, 0.5, 0.9] Anchor. • a ∈ [0.005, 0.01, 0.02, 0.05, 0.1, 0.5, 0.8, 0.9, 1.0]



Concretely, given two classifiers applied to the same test samples, the prediction churn between them is the fraction of test samples with different predicted labels.



defined earlier in the paper. Our proposed method augments global label smoothing by leveraging the local k-NN estimates. Naturally, we compare against doing global smoothing only and this serves as a key ablation model to see the added benefits of leveraging the k-NN labels. • Mixup: This method proposed by Zhang et al. (2017) generates synthetic training examples on the fly by convex combining random training inputs and their associated labels, where the combination weights are random draws from a Beta(a, a) distribution. Mixup improves generalization, increases robustness to adversarial examples as well as label noise, and also improves model calibration (Thulasidasan et al., 2019). • Ensemble: Ensembling deep neural networks can improve the quality of their uncertainty estimation (Lakshminarayanan

Jiang et al. (2018) uses nearest neighbors on the intermediate representations to obtain better uncertainty scores than softmax probabilities andBahri et al. (2020) uses the k-NN label disagreement to filter noisy labels for better training. Like these works, we also leverage k-NN on the intermediate representations but we show that utilizing the k-NN labels leads to lower prediction churn.

Ensemble results for all datasets. In all settings, the optimal m (number of subnetworks) is 5. We see that compared to the other methods presented, ensembling does well in both predictive performance and in reducing churn. It does come at a cost, however: the model is effectively 5 times larger, making both training and inference more expensive.By Lemma 3 of Jiang (2019), the number of unique sets of points consisting of balls intersected with the sample is bounded by D • n D and thus by union bound, we have with probability at least 1 -δ/2:

Ablation on k-NN label smoothing's hyperparameters: a, b, and k for the SVHN dataset.

annex

• a ∈ [0.001, 0.01, 0.05, 0.1, 0.2, 0.5] 001, 0.01, 0.05, 0.1, 0.2, 0.5] • n warm ∈ [1000, 2000] Bi-tempered 1., 2., 3., 4.] • n iters always set to 5. [3, 5] 

