UNDERSTANDING CLASSIFIER MISTAKES WITH GENERATIVE MODELS

Abstract

Although deep neural networks are effective on supervised learning tasks, they have been shown to be brittle. They are prone to overfitting on their training distribution and are easily fooled by small adversarial perturbations. In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize. We propose a generative model of the features extracted by a classifier, and show using rigorous hypothesis testing that errors tend to occur when features are assigned low-probability by our model. From this observation, we develop a detection criteria for samples on which a classifier is likely to fail at test time. In particular, we test against three different sources of classification failures: mistakes made on the test set due to poor model generalization, adversarial samples and out-of-distribution samples. Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semisupervised way.

1. INTRODUCTION

Machine learning algorithms have shown remarkable success in challenging supervised learning tasks such as object classification (He et al., 2016) and speech recognition (Graves et al., 2013) . Deep neural networks in particular, have gained traction because of their ability to learn a hierarchical feature representation of their inputs. Neural networks, however, are also known to be brittle. As they require a large number of parameters compared to available data, deep neural networks have a tendency to latch onto spurious statistical dependencies to make their predictions. As a result, they are prone to overfitting and can be fooled by imperceptible adversarial perturbations of their inputs (Szegedy et al., 2013; Kurakin et al., 2016; Madry et al., 2017) . Additionally, modern neural networks are poorly calibrated and do not capture model uncertainty well (Gal & Ghahramani, 2016; Kuleshov & Ermon, 2017; Guo et al., 2017) . They produce confidence scores that do not represent true probabilities and consequently, often output predictions that are over-confident even when fed with out-of-distribution inputs (Liang et al., 2017) . These limitations of neural networks are problematic as they become ubiquitous in applications where safety and reliability is a priority (Levinson et al., 2011; Sun et al., 2015) . Fully probabilistic, generative models could mitigate these issues by improving uncertainty quantification and incorporating prior knowledge (e.g, physical properties (Wu et al., 2015) ) into the classification process. While great progress has been made towards designing generative models that can capture high-dimensional objects such as images (Oord et al., 2016a; Salimans et al., 2017) , accurate probabilistic modeling of complex, high-dimensional data remains challenging. Our work aims at providing an understanding of these failure modes under the lens of probabilistic modelling. Instead of directly modeling the inputs, we rely on the ability of neural networks to extract features from high-dimensional data and build a generative model of these low-dimensional features. Because deep neural networks are trained to extract features from which they output classification predictions, we make the assumption that it is possible to detect failure cases from the learned representations. Given a neural network trained for image classification, we capture the distribution of the learned feature space with a Gaussian Mixture Model (GMM) and use the predicted likelihoods to detect inputs on which the model cannot produce reliable classification results. We show that we are able to not only detect adversarial and out-of-distribution samples, but surprisingly also identify inputs from the test set on which a model is likely to make a mistake. We experiment on state-of-the-art neural networks trained on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009) and show, through statistical hypothesis testing, that samples leading to classification failures tend to correspond to features that lie in a low probability region of the feature space. Contributions Our contributions are as follows: • We provide a probabilistic explanation to the brittleness of deep neural networks and show that classifiers tend to make mistakes on inputs with low-probability features. • We demonstrate that a simple modeling by a GMM of the feature space learned by a deep neural network is enough to model the probability space. Other state-of-the-art methods for probabilistic modelling such as VAEs (Kingma & Welling, 2013) and auto-regressive flow models (Papamakarios et al., 2017) fail in that regard. • We show that generative models trained on the feature space can be used as a single tool to reliably detect different sources of classification failures: test set errors due to poor generalization, adversarial samples and out-of-distribution samples.

2. RELATED WORK

An extensive body of work has been focused on understanding the behaviours of deep neural networks when they are faced with inputs on which they fail. We provide a brief overview below: Uncertainty quantification Uncertainty quantification for neural networks is crucial in order to detect when a model's prediction cannot be trusted. Bayesian approaches (MacKay, 1992; Neal, 2012; Blundell et al., 2015) , for example, seek to capture the uncertainty of a network by considering a prior distribution over the model's weights. Training these networks is challenging because the exact posterior is intractable and usually approximated using a variety of methods for posterior inference. Closely related, Deep Ensembles (Lakshminarayanan et al., 2017) and Monte-Carlo Dropout (Gal & Ghahramani, 2016) consider the outputs of multiple models as an alternative way to approximate the distribution. Model calibration (Platt, 1999; Guo et al., 2017) aims at producing confidence score that are representative of the likelihood of correctness. Uncertainty quantification may also be obtained by training the network to provide uncertainty measures. Prior Networks (Malinin & Gales, 2018) model the implicit posterior distribution in the Bayesian approach, DeVries & Taylor (2018); Lee et al. ( 2017) have the network output an additional confidence output. These methods require a proxy dataset representing the out-of-distribution samples to train their confidence scores. Our method differs from the above as it seeks to give an uncertainty estimation based on a model trained with the usual cross-entropy loss. It does not require additional modelling assumptions, nor modifications to the model's architecture or training procedure. As such, it relates closely to threshold-based methods. For example, Hendrycks & Gimpel (2016) use the logits outputs as a measure of the network's confidence and can be improved using Temperature Scaling (Guo et al., 2017; Liang et al., 2017) , a post-processing method that calibrates the model. Our work derives a confidence score by learning the probability distribution of the feature space and generalizes to adversarial samples (Szegedy et al., 2013) , another source of neural networks' brittleness. Adversarial samples Methods to defend against adversarial examples include explicitly training networks to be more robust to adversarial attacks (Tramèr et al., 2017; Madry et al., 2017; Papernot et al., 2015) . 2018) train a conditional generative model on the feature space learned by the classifier and derive a confidence score based on the Mahalanobis distance between a test sample and its predicted class representation. Our method makes the GMM class-agnostic, making it applicable to settings where labels are not available at inference time. We further show that the unsupervised GMM improves on the Mahalanobis score on the OOD detection task.



Another line of defense comes from the ability to detect adversarial samples at test time. Song et al. (2017) for example, use a generative model trained on the input images to detect and purify adversarial examples at test time using the observation that adversarial samples have lower predicted likelihood under the trained model. Closer to our work, Zheng & Hong (2018) and Lee et al. (

