DEFUSE: DEBUGGING CLASSIFIERS THROUGH DIS-TILLING UNRESTRICTED ADVERSARIAL EXAMPLES

Abstract

With the greater proliferation of machine learning models, the imperative of diagnosing and correcting bugs in models has become increasingly clear. As a route to better discover and fix model bugs, we propose failure scenarios: regions on the data manifold that are incorrectly classified by a model. We propose an end-to-end debugging framework called Defuse to use these regions for fixing faulty classifier predictions. The Defuse framework works in three steps. First, Defuse identifies many unrestricted adversarial examples-naturally occurring instances that are misclassified-using a generative model. Next, the procedure distills the misclassified data using clustering into failure scenarios. Last, the method corrects model behavior on the distilled scenarios through an optimization based approach. We illustrate the utility of our framework on a variety of image data sets. We find that Defuse identifies and resolves concerning predictions while maintaining model generalization.

1. INTRODUCTION

Debugging machine learning (ML) models is a critical part of the ML development life cycle. Uncovering bugs helps ML developers make important decisions about both development and deployment. In practice, much of debugging uses aggregate test statistics (like those in leader board style challenges [Rajpurkar et al. (2016) ]) and continuous evaluation and monitoring post deployment [Liberty et al. (2020) , Simon (2019) ]. However, additional issues arise with over-reliance on test statistics. For instance, aggregate statistics like held out test accuracy are known to overestimate generalization performance [Recht et al. (2019) ]. Further, statistics offer little insight nor remedy for specific model failures [Ribeiro et al. (2020); Wu et al. (2019) ]. Last, reactive debugging of failures as they occur in production does little to mitigate harmful user experiences [La Fors et al. (2019) ]. Several techniques exist for identifying undesirable behavior in machine learning models. These methods include explanations [Ribeiro et al. (2016) In this work, we propose Defuse: a technique for debugging classifiers through distillingfoot_0 unrestricted adversarial examples. Defuse works in three steps. First, Defuse identifies unrestricted adversarial examples by making small, semantically meaningful changes to input data using a variational autoencoder (VAE). If the classifier prediction deviates from the ground truth label on the altered instance, it returns the data instance as a potential model failure. This method employs similar techniques from [Zhao et al. (2018) ]. Namely, small perturbations in the latent space of generative models can produce images that are misclassified. Second, Defuse distills the changes through clustering on the unrestricted adversarial example's latent codes. In this way, Defuse diagnoses regions in the latent space that are problematic for the classifier. This method produces a set of To illustrate the usefulness of failure scenarios, we run Defuse on a classifier trained on MNIST and provide an overview in figure 1. In the identification step (first pane in figure 1 ), Defuse generates unrestricted adversarial examples for the model. The red number in the upper right hand corner of the image is the classifier's prediction. Although the classifier achieves high test set performance, we find naturally occurring examples that are classified incorrectly. Next, the method performs the distillation step (second pane in figure 1 ). The clustering model groups together similar failures for annotator labeling. We see that similar mistakes are grouped together. For instance, Defuse groups together a similar style of incorrectly classified eights in the first row of the second pane in figure 1. Next, Defuse receives annotator labels for each of the clusters.foot_1 Last, we run the correction step using both the annotator labeled data and the original training data. We see that the model correctly classifies the images (third pane in figure 1 ). Importantly, the model maintains its predictive performance, scoring 99.1% accuracy after tuning. We see that Defuse enables model designers to both discover and correct naturally occurring model failures. We provide the necessary background in Defuse ( §2). Next, we detail the three steps in Defuse: identification, distillation, and correction ( §3). We then demonstrate the usefulness of Defuse on three image data sets: MNIST [LeCun et al. (2010) ], the German traffic signs data set [Stallkamp et al. (2011) ], and the Street view house numbers data set [Netzer et al. (2011) ], and find that Defuse discovers and resolves critical bugs in high performance classifiers trained on these datasets ( §4).



We mean distilling in the sense of "to extract the most important aspects of" and do not intend to invoke the knowledge distillation literature[Hinton et al. (2014)]. NOTATION AND BACKGROUNDIn this section, we establish notation and background on unrestricted adversarial examples. Though unrestricted adversarial examples can be found in many domains, we focus on Defuse applied to image classification.2 We assign label 8 to the first row in the second pane of figure 1, label 0 to the second row, and label 6 to the third row.



; Slack et al. (2020b); Lakkaraju et al. (2019); Lundberg & Lee (2017)], fairness metrics [Feldman et al. (2015), Slack et al. (2020a)], data set replication [Recht et al. (2019); Engstrom et al. (2020)], and behavioral testing tools [Ribeiro et al. (2020)]. However, these techniques do not provide methods to remedy model bugs or require a high level of human supervision. To enable model designers to discover and correct model bugs beyond aggregate test statistics, we analyze unrestricted adversarial examples: instances on the data manifold that are misclassified [Song et al. (2018)]. We identify model bugs through diagnosing common patterns in unrestricted adversarial examples.

Figure 1: Running Defuse on a MNIST classifier. The (handpicked) images are examples from three failure scenarios identified from running Defuse. The red digit in the upper right hand corner of the image is the classifier's prediction. Defuse initially identifies many model failures. Next, it aggregates these failures in the distillation step for annotator labeling. Last, Defuse tunes the classifier so that it correctly classifies the images, with minimal change in classifier performance. Defuse serves as an end-to-end framework to diagnose and debug errors in classifiers.

