ESTIMATING EXAMPLE DIFFICULTY USING VARIANCE OF GRADIENTS Anonymous

Abstract

In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples helps inform safe deployment of models, isolates examples that require further human inspection, and provides interpretability into model behavior. In this work, we propose Variance of Gradients (VoG) as a valuable and efficient proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VoG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VoG scores are far more difficult for the model to learn and over-index on corrupted or memorized examples.

1. INTRODUCTION

Reasoning about model behavior is often easier when presented with a subset of data points that are relatively more difficult for a trained model to learn. This not only aids interpretability through case based reasoning (Kim et al., 2016; Caruana, 2000; Hooker et al., 2019) , but can also be used as a mechanism to surface a tractable subset of atypical examples for further human auditing (Leibig et al., 2017; Zhang, 1992; Hooker et al., 2019) , for active learning to inform model improvements, or to choose not to classify certain examples when the model is uncertain (Bartlett & Wegkamp, 2008; Cortes et al., 2016) . One of the biggest bottlenecks for human auditing is the large scale size of modern datasets and the cost of annotating each feature (Veale & Binns, 2017) . Methods which automatically surface a subset of relatively more challenging examples for human inspection help prioritize limited human annotation and auditing time. Despite the urgency of this use-case, ranking examples by difficulty has had limited treatment in the context of deep neural networks due to the computational cost of ranking a high dimensional feature space. Recent work in this direction has either been limited to small scale datasets or features a computational cost which is infeasible for most practitioners (Hooker et al., 2019; Carlini et al., 2019; Koh & Liang, 2017) . In this work, we start with a simple hypothesis -examples that a model has difficulty learning will exhibit higher variance in gradient updates over the course of training. On the other hand, we expect the backpropagated gradients of the samples that are relatively easier to learn will have lower variance because performance on that example does not consistently dominate the loss over the course of training. The gradient updates for the relatively easier examples are expected to stabilize early in training and converge to a narrow range of values. We term this class normalized ranking mechanism Variance of Gradients VoG, and demonstrate across a variety of large scale datasets that it efficiently ranks the difficulty of both training and test examples. VoG can be computed using either the predicted or true label, making it a valuable unsupervised auditing tool at test time when the true label is unknown. Validating the behavior of VoG on artificial data. To begin, we illustrate the principle and effectiveness of VoG in a contrived toy example setting. The data was generated using two separate isotropic Gaussian clusters with a total of 500 data points. In such a simple low dimensional problem, the most challenging examples for the model to classify are closer to the decision boundary. In Fig. 1a we visualize the trained decision boundary of a multiple layer perceptron (MLP) with a single hidden layer trained for 15 epochs. VoG is computed at relative intervals for each training data point. (Russakovsky et al., 2015) . Our contributions can be enumerated as follows: 1. We propose Variance of Gradienst (VoG) -a class-normalized variance gradient score for determining the relative ease of learning data samples within a given class (Sec. 2). 2. We show that VoG is an effective auditing tool for ranking high dimensional datasets by difficulty. Implications of this work It is becoming increasingly important for deep neural networks (DNNs) to make decisions that are interpretable to both researchers and end-users. In sensitive domains such as health care diagnostics (Xie et al., 2019; Gruetzemacher et al., 2018; Badgeley et al., 2019; Oakden-Rayner et al., 2019) , self-driving cars (NHTSA, 2017) and hiring (Dastin, 2018; Harwell, 2019) providing tools for domain experts to audit models is of upmost importance. Our work offers an efficient method to rank the global difficulty of examples and automatically surface a possible subset to aid human interpretability. VoG can be computed using checkpoints stored over the course of training and is model agnostic. Critically, VoG can be computed using the predicted label, which makes it an unsupervised auditing tool at test time.

2. METHODOLOGY

We consider a supervised classification problem where a DNN is trained to approximate the function F that maps an input variable X to an output variable Y , formally F : X → Y . y ∈ Y is a discrete label vector associated with each input x. Each label y corresponds to one of C categories or classes.



Figure 1: We compute the variance of gradients (VoG) for each training data point in this two dimensional toy problem. On the right, we show that VoG accords higher scores to the most challenging examples closest to the decision boundary (as measured by the perpendicular distance). Contributions We scale this toy experiment and demonstrate consistent results across two different architectures and three datasets -Cifar-10, Cifar-100 Krizhevsky et al. (2009) and ImageNet(Russakovsky et al., 2015). Our contributions can be enumerated as follows:

VoG assigns higher scores to test-set examples that are more challenging for the model to classify. Restricting evaluation to the test-set examples with the lowest VoG greatly improves generalization performance. (Sec. 3). 3. VoG identifies clusters of images with clearly distinct semantic properties. As seen in Fig. 4), Low scores feature images with far less cluttered backgrounds and more prototypical vantage points of the object. In contrast, images with high scores over-index on images with cluttered backgrounds and atypical vantage points of the object of interest (zoomed in on part of the object, side profile of the object, shot from above). 4. VoG effectively surfaces OOD and memorized examples We empirically show that VoG allocates higher scores to examples that require memorization (Sec. 5) and out-of-distribution examples from curated benchmarks like ImageNet-O (Hendrycks et al., 2019). We use VoG to explore how learning differs at different stages of training and show that VoG rankings are sensitive to the stage of training and provide insight into the learning process in deep neural networks.

