DEFUSE: DEBUGGING CLASSIFIERS THROUGH DIS-TILLING UNRESTRICTED ADVERSARIAL EXAMPLES

Abstract

With the greater proliferation of machine learning models, the imperative of diagnosing and correcting bugs in models has become increasingly clear. As a route to better discover and fix model bugs, we propose failure scenarios: regions on the data manifold that are incorrectly classified by a model. We propose an end-to-end debugging framework called Defuse to use these regions for fixing faulty classifier predictions. The Defuse framework works in three steps. First, Defuse identifies many unrestricted adversarial examples-naturally occurring instances that are misclassified-using a generative model. Next, the procedure distills the misclassified data using clustering into failure scenarios. Last, the method corrects model behavior on the distilled scenarios through an optimization based approach. We illustrate the utility of our framework on a variety of image data sets. We find that Defuse identifies and resolves concerning predictions while maintaining model generalization.

1. INTRODUCTION

Debugging machine learning (ML) models is a critical part of the ML development life cycle. Uncovering bugs helps ML developers make important decisions about both development and deployment. In practice, much of debugging uses aggregate test statistics (like those in leader board style challenges [Rajpurkar et al. (2016) ]) and continuous evaluation and monitoring post deployment [Liberty et al. (2020 ), Simon (2019) ]. However, additional issues arise with over-reliance on test statistics. For instance, aggregate statistics like held out test accuracy are known to overestimate generalization performance [Recht et al. (2019) In this work, we propose Defuse: a technique for debugging classifiers through distillingfoot_0 unrestricted adversarial examples. Defuse works in three steps. First, Defuse identifies unrestricted adversarial examples by making small, semantically meaningful changes to input data using a variational autoencoder (VAE). If the classifier prediction deviates from the ground truth label on the altered instance, it returns the data instance as a potential model failure. This method employs similar techniques from [Zhao et al. (2018) ]. Namely, small perturbations in the latent space of generative models can produce images that are misclassified. Second, Defuse distills the changes through clustering on the unrestricted adversarial example's latent codes. In this way, Defuse diagnoses regions in the latent space that are problematic for the classifier. This method produces a set of



We mean distilling in the sense of "to extract the most important aspects of" and do not intend to invoke the knowledge distillation literature[Hinton et al. (2014)].



]. Further, statistics offer little insight nor remedy for specific model failures[Ribeiro et al. (2020); Wu et al. (2019)]. Last, reactive debugging of failures as they occur in production does little to mitigate harmful user experiences[La Fors et al.  (2019)]. Several techniques exist for identifying undesirable behavior in machine learning models. These methods include explanations[Ribeiro et al. (2016); Slack et al. (2020b); Lakkaraju et al. (2019); Lundberg & Lee (2017)], fairness metrics [Feldman et al. (2015), Slack et al. (2020a)], data set replication [Recht et al. (2019); Engstrom et al. (2020)], and behavioral testing tools [Ribeiro et al. (2020)]. However, these techniques do not provide methods to remedy model bugs or require a high level of human supervision. To enable model designers to discover and correct model bugs beyond aggregate test statistics, we analyze unrestricted adversarial examples: instances on the data manifold that are misclassified [Song et al. (2018)]. We identify model bugs through diagnosing common patterns in unrestricted adversarial examples.

