UNDERSTANDING CLASSIFIER MISTAKES WITH GENERATIVE MODELS

Abstract

Although deep neural networks are effective on supervised learning tasks, they have been shown to be brittle. They are prone to overfitting on their training distribution and are easily fooled by small adversarial perturbations. In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize. We propose a generative model of the features extracted by a classifier, and show using rigorous hypothesis testing that errors tend to occur when features are assigned low-probability by our model. From this observation, we develop a detection criteria for samples on which a classifier is likely to fail at test time. In particular, we test against three different sources of classification failures: mistakes made on the test set due to poor model generalization, adversarial samples and out-of-distribution samples. Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semisupervised way.

1. INTRODUCTION

Machine learning algorithms have shown remarkable success in challenging supervised learning tasks such as object classification (He et al., 2016) and speech recognition (Graves et al., 2013) . Deep neural networks in particular, have gained traction because of their ability to learn a hierarchical feature representation of their inputs. Neural networks, however, are also known to be brittle. As they require a large number of parameters compared to available data, deep neural networks have a tendency to latch onto spurious statistical dependencies to make their predictions. As a result, they are prone to overfitting and can be fooled by imperceptible adversarial perturbations of their inputs (Szegedy et al., 2013; Kurakin et al., 2016; Madry et al., 2017) . Additionally, modern neural networks are poorly calibrated and do not capture model uncertainty well (Gal & Ghahramani, 2016; Kuleshov & Ermon, 2017; Guo et al., 2017) . They produce confidence scores that do not represent true probabilities and consequently, often output predictions that are over-confident even when fed with out-of-distribution inputs (Liang et al., 2017) . These limitations of neural networks are problematic as they become ubiquitous in applications where safety and reliability is a priority (Levinson et al., 2011; Sun et al., 2015) . Fully probabilistic, generative models could mitigate these issues by improving uncertainty quantification and incorporating prior knowledge (e.g, physical properties (Wu et al., 2015) ) into the classification process. While great progress has been made towards designing generative models that can capture high-dimensional objects such as images (Oord et al., 2016a; Salimans et al., 2017) , accurate probabilistic modeling of complex, high-dimensional data remains challenging. Our work aims at providing an understanding of these failure modes under the lens of probabilistic modelling. Instead of directly modeling the inputs, we rely on the ability of neural networks to extract features from high-dimensional data and build a generative model of these low-dimensional features. Because deep neural networks are trained to extract features from which they output classification predictions, we make the assumption that it is possible to detect failure cases from the learned representations. Given a neural network trained for image classification, we capture the distribution of the learned feature space with a Gaussian Mixture Model (GMM) and use the predicted likelihoods to detect inputs on which the model cannot produce reliable classification results. We show that we are able to not only detect adversarial and out-of-distribution samples, but surprisingly also identify inputs from the test set on which a model is likely to make a mistake. We experiment on state-of-the-art neural networks trained on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009) and show, through statistical hypothesis testing, that samples leading to classification failures tend to correspond to features that lie in a low probability region of the feature space. Contributions Our contributions are as follows: • We provide a probabilistic explanation to the brittleness of deep neural networks and show that classifiers tend to make mistakes on inputs with low-probability features. • We demonstrate that a simple modeling by a GMM of the feature space learned by a deep neural network is enough to model the probability space. Other state-of-the-art methods for probabilistic modelling such as VAEs (Kingma & Welling, 2013) and auto-regressive flow models (Papamakarios et al., 2017) fail in that regard. • We show that generative models trained on the feature space can be used as a single tool to reliably detect different sources of classification failures: test set errors due to poor generalization, adversarial samples and out-of-distribution samples.

2. RELATED WORK

An extensive body of work has been focused on understanding the behaviours of deep neural networks when they are faced with inputs on which they fail. We provide a brief overview below: Uncertainty quantification Uncertainty quantification for neural networks is crucial in order to detect when a model's prediction cannot be trusted. Bayesian approaches (MacKay, 1992; Neal, 2012; Blundell et al., 2015) , for example, seek to capture the uncertainty of a network by considering a prior distribution over the model's weights. Training these networks is challenging because the exact posterior is intractable and usually approximated using a variety of methods for posterior inference. Closely related, Deep Ensembles (Lakshminarayanan et al., 2017) and Monte-Carlo Dropout (Gal & Ghahramani, 2016) consider the outputs of multiple models as an alternative way to approximate the distribution. Model calibration (Platt, 1999; Guo et al., 2017) aims at producing confidence score that are representative of the likelihood of correctness. Uncertainty quantification may also be obtained by training the network to provide uncertainty measures. Prior Networks (Malinin & Gales, 2018) model the implicit posterior distribution in the Bayesian approach, DeVries & Taylor (2018); Lee et al. (2017) have the network output an additional confidence output. These methods require a proxy dataset representing the out-of-distribution samples to train their confidence scores. Our method differs from the above as it seeks to give an uncertainty estimation based on a model trained with the usual cross-entropy loss. It does not require additional modelling assumptions, nor modifications to the model's architecture or training procedure. As such, it relates closely to threshold-based methods. For example, Hendrycks & Gimpel (2016) use the logits outputs as a measure of the network's confidence and can be improved using Temperature Scaling (Guo et al., 2017; Liang et al., 2017) , a post-processing method that calibrates the model. Our work derives a confidence score by learning the probability distribution of the feature space and generalizes to adversarial samples (Szegedy et al., 2013) , another source of neural networks' brittleness. Adversarial samples Methods to defend against adversarial examples include explicitly training networks to be more robust to adversarial attacks (Tramèr et al., 2017; Madry et al., 2017; Papernot et al., 2015) . Another line of defense comes from the ability to detect adversarial samples at test time. Song et al. (2017) for example, use a generative model trained on the input images to detect and purify adversarial examples at test time using the observation that adversarial samples have lower predicted likelihood under the trained model. Closer to our work, Zheng & Hong (2018) and Lee et al. (2018) train a conditional generative model on the feature space learned by the classifier and derive a confidence score based on the Mahalanobis distance between a test sample and its predicted class representation. Our method makes the GMM class-agnostic, making it applicable to settings where labels are not available at inference time. We further show that the unsupervised GMM improves on the Mahalanobis score on the OOD detection task. Detecting samples on which a trained classifier is likely to make a mistake is crucial when considering the range of applications in which these models are deployed. However, predicting in advance whether a sample will fail seems challenging, especially when the sample is drawn from the same distribution as the train set. To illustrate this, we show in Fig. 1 , samples from the CIFAR-100 training dataset and compare them to test samples and adversarial examples that our DenseNet model fails to classify properly. In both cases, it is not obvious to the human eye what fundamentally differs between correct and incorrect samples. Our main intuition is that a generative model trained on the feature space could capture these subtle differences.

3.1. BACKGROUND

We consider the problem of classification where we have access to a (possibly partially) labeled dataset D = {(X i , y i )} N i=1 where (X i , y i ) ∈ X × Y. Samples are assumed to be independently sampled from a distribution p data (X, y) and we denote the marginal over X as p data (X). We will denote f θ : X -→ F = R D the feature extractor part of our neural network, where θ represents the parameters of the network and F is the feature space of dimension D. Given an input X, the predictions probabilities on the label space Y are then typically obtained using multivariate logistic regression on the extracted features. p(y|X, θ, W, b) = sof tmax(Wf θ (X) + b) ) where (W, b) represent the weights and bias of the last fully-connected layer of the neural network. The model prediction is the class with the highest predicted probability: ŷ(X) = arg max y∈Y p(y|X, θ, W, b). The parameters (θ, W, b) are trained to minimize a crossentropy loss on the training set and performance is evaluated on the test set. Learning the data structure with Generative Models Understanding the data structure can greatly improve the ability of neural models to generalize. Recently, great progress has been made in designing powerful generative models that can capture high-dimensional complex data such as images. PixelCNN (Salimans et al., 2017; Oord et al., 2016b; a) in particular, is a state-of-the-art deep generative model with tractable likelihood that represents the probability density of an image as a fully factorized product of conditionals over individual pixels of an image. p CN N (X) = n i=1 p φ (X i |X 1:i-1 ) Flow models such as the Masked Autogressive Flow (MAF) (Papamakarios et al., 2017) model provide similar tractability by parameterizing distributions with reversible functions which make that likelihood easily tractable through the change of variable formula. Another widely used class of generative models assumes the existence of unobserved latent variables. Gaussian Mixture Models, for example, assume discrete latents (corresponding to the mixture component). Variational autoencoders (Kingma & Welling, 2013) use continuous latent variables and parameterize the (conditional) distributions using neural networks.

3.2. MODELING THE FEATURE SPACE

We identify two main reasons why characterizing examples over which a classifier is likely to make a mistake is difficult. First, modeling the input data distribution p data (X), as done in Song et al. (2017) to detect adversarial examples, is challenging because of the high-dimensional, complex nature of the image space X . This approach also fails at detecting out-of-distribution samples, with state-of-the art models assigning higher likelihoods to samples that completely differ from their train set (Nalisnick et al., 2018) . Second, a model of p data (X) doesn't capture any information about the classifier itself. To overcome these difficulties, we propose to model the underlying distribution of the learned features F = f θ (X), where X ∼ p data (X). Extracted features have lower dimension which makes them easier to model and they give access to some information on the classifier. Specifically, we are interested in comparing features F c of samples that are correctly classified with features F w of samples that are incorrectly classified by a trained neural network. F c and F w can be described as elements of the following sets: F c ∈ C = {f θ (X)|ŷ(X) = y, (X, y) ∈ X × Y} (3) F w ∈ W = {f θ (X)|ŷ(X) = y, (X, y) ∈ X × Y} (4) The distribution of the extracted features is modeled by: p(F) = K k=1 π k N (F; µ k , Σ k ) (5) where K is the number of Gaussians in the mixture, π k , µ k , Σ k are the model parameters. We choose Σ k to be diagonal in all our experiments. After training a neural network to convergence, we learn the parameters of the GMM using the EM algorithm. Our training set is built from the features extracted from the training image set by the trained classifier.

3.3. DETECTING CLASSIFICATION MISTAKES

We posit that classification mistakes are linked to extracted features that are unusual under the training distribution. By modeling the feature space learned by the classifier, our generative model will be able to detect an input that will lead to a potential classification mistake. We found that a simple generative model is surprisingly good at capturing the distribution of the feature space and can detect when an input will lead to a classification mistake based on its predicted feature log-likelihood.

Statistical Hypothesis Testing

We consider p C (F c ) the distribution of features F c = f θ (X) where (X, y) ∼ p data (X, y) and ŷ(X) = y, and p W (F w ) the distribution of features F w = f θ (X) where (X, y) ∼ p data (X, y) and ŷ(X) = y. These correspond to features extracted on correctly classified vs. incorrectly classified examples. Note that these distributions not only depend on the underlying data distribution but also on the classifier's parameters (θ, W, b). Assuming we have access to samples F c,1 , . . . , F c,n ∼ p C and F w,1 , . . . F w,m ∼ p W our null hypothesis H 0 and alternative hypothesis H 1 are: H 0 : p C = p W H 1 : p C = p W (6) We use the Mann-Whitney U-test, which assumes that samples can be ranked. The test statistic is defined by ranking all samples of the two groups together and using the sum of their ranks. U C = R C - n(n + 1) 2 U W = R W - m(m + 1) 2 (7) where R C and R W are the sum of ranks of samples F c and F w respectively. The statistic for the statistical test is U = min(U C , U W ), which has a distribution that can be approximated by a normal distribution under the null hypothesis. In our approach, samples are ranked based on their predicted probability. Since our test statistic directly uses the predicted likelihood of a feature, we deduce from it a simple per-sample test to determine if an input is likely to be misclassified. Given a threshold T , a test sample X is rejected as being misclassified if p(f θ (X)) < T . The value of the threshold is chosen by cross-validation on the validation set to obtain a good trade-off between precision and recall. 

4. EXPERIMENTS

We run experiments on the CIFAR-100 dataset, containing 32 × 32 color images used for image classification with 100 classes. All reported results give the mean and standard deviation over 5 independent runs. Additional experiments on a model trained on the smaller CIFAR-10 dataset are also available in the appendix. We examine two state-of-the-art deep neural networks, DenseNet-100 (Huang et al., 2016) and Wide ResNet-28 (Zagoruyko & Komodakis, 2016) trained with the usual cross-entropy loss. In the setting where only a small number of labels is available, we train a WRN-28 model with 100 labeled samples per class using Temporal ensembling (Laine & Aila, 2016) . This self-ensembling training method takes advantage of the stochasticity provided by the use of dropout and random augmentation techniques (e.g. random flipping and cropping). Mistake Detection Using statistical testing, we verify that the trained model learns a distribution that differentiates correct and incorrect samples. We sum up the performance of our method by reporting the AUC-ROC and AUC-PR obtained on the test set. To motivate the use of high-level features, we adapt the detection method used by Song et al. (2017) to the mistake detection problem and compare the performance with our proposed method. We train a PixelCNN on the image dataset and use the predicted likelihood values to detect classification mistakes. We evaluate mistake detection on the test set and first compare the distribution predicted by PixelCNN on the images with the distribution predicted by a GMM-100 model on extracted features (Figure 2 ). Using the Mann-Whitney U-test, we verify that the distribution learned by GMM-100 differentiates correct and incorrect samples (p = 1.9e -13 ). On the other hand, because PixelCNN is trained without knowledge of the classifier's internal representations, the distributions of correct and incorrect samples predicted under PixelCNN are almost indistinguishable (p = 8.58e -5 ). Additionally, we experimented with more flexible likelihood models to model the feature space such as the Variational Auto Encoder (Kingma & Welling, 2013) and Masked Autoregressive Flow (Papamakarios et al., 2017) . Surprisingly, we found that a simple Gaussian Mixture Model is better at detecting classification mistakes than these more flexible models. Finally, we also compare with other threshold-based methods: using the predicted logits and calibrated scores obtained after Temperature Scaling. Detection performance is summed up in Figure 3 for DenseNet and WideResNet models trained on CIFAR-100. GMM models trained on the features outperform all other generative models trained either on images or on the feature space. We find that using a GMM has similar performance than calibrated scores on the Wide ResNet but not on the DenseNet. This is explained by the fact that our DenseNet model has much lower accuracy than Wide ResNet (72.76% v. 80.22%) and therefore does not produce overly confident predictions. Additional results are available in the appendix. In the next experiments, we show that although using predicted logits provides reliable detection of test set mistakes, this metric doesn't generalize to adversarial or out-of-detection samples. On the other hand, our approach to train a generative model on the feature space can be applied to these other sources of classification errors. Adversarial samples We craft adversarial samples from test samples using the Fast-Gradient Sign Method (FGSM) proposed by Goodfellow et al. (2014) and the Basic Iteration Method (BIM) (Kurakin et al., 2016) . Both methods move the input image in the direction of the gradient of the loss while restraining the adversarial sample to be in a 1 ball of ray attack around the original input. This ensures that the generated adversarial sample is visually indistinguishable from the original. Figure 4 shows that the GMM is sensitive to features extracted from adversarial samples, as they are assigned higher BPDs than clean samples. We also plot the ROC curves and corresponding AUC metrics that are obtained by using the predicted BPD to detect adversarial samples. We compare our approach with other possible detection metrics. In particular, the method proposed by Zheng & Hong ( 2018) and the Mahalanobis score from Lee et al. (2018) also leverage the feature space to detect adversarial inputs. These approaches use one different model per class and therefore require labels to train while we only train one GMM in an unsupervised manner. ROC curves are shown in Figure 5 and a full comparison table with higher attack values for both attacks is shown in Figure 4 : Top: Distribution of log-likelihood predicted on clean (blue) and adversarial samples (green and orange) by a GMM-1000. The log-likelihood of features extracted from adversarial samples is lower. The histograms are separated, which means it is possible to detect adversarial samples using the log-likelihood of their features. Bottom: ROC curve for detecting adversarial samples using predicted log-likelihood. Our method achieves a good trade-off between true positive and false positive rate, significantly improves over chance and achieves between 76% and 100% AUC depending on attack methods and models. the appendix. Our method, using a GMM-1000 provides better detection performance of adversarial samples than calibrated and non-calibrated logit scores. Most notably, in a semi-supervised setting (Figure 5c ), our method surpasses all others on attacks with low attack values. Our method achieves comparable detection results as the Mahalanobis and the Zheng metric and surpasses both of them in a semi-supervised setting (Figure 5c ).

Out-of-Distribution Detection

We also test the use of feature log-likelihood values on the task of detection out-of-distribution samples. As Out-of-Distribution samples we use Random Gaussian Noise, SVHN (Netzer et al., 2011) , Tiny ImageNet (Russakovsky et al., 2015) , and Fashion MNIST (Xiao et al., 2017) . OOD detection results are reported in Table 2 for each model we trained. Our experiments show that it is not possible to rely on calibrated probability scores for OOD detection, and that our method yields better detection results than using the Mahalanobis score in some cases. We also highlight that a PixelCNN trained on CIFAR has very poor detection results on image datasets that visually look very different from its original training set (FashionMNIST and SVHN). This is a result of the generative model assigning higher likelihood to these OOD samples. Table 3 in the annex also shows that only calibrated scores fail to detect random gaussian noise as an OOD sample. Table 1 : OOD Detection results. Calibrated scores do not detect OOD samples well, especially samples from SVHN. On the other hand, our detection method using a GMM on the feature space is able to reliably detect out-of-distribution samples and performs better than the class-dependent Mahalanobis score even on fully-supervised models. PixelCNN assigns higher likelihoods to OOD samples that diverge clearly from the original training set and fails to detect them. 



Figure 1: Predicting whether an image will be correctly classified is challenging. Left: Images from the train set. Middle: Adversarial images computed on the same images with FGSM-0.1 are indistinguishable from the clean images, yet, they fool our classifier into making incorrect predictions. Right: Images from the test set. The images look similar to images from the training set, yet, they are incorrectly classified by our DenseNet model. Each row represents a different class.

Figure 2: Comparing the predicted log likelihood distribution of correct (blue) and incorrect (orange) samples from the test set for models trained on CIFAR-100. Top row: Log likelihoods obtained from training PixelCNN on images from the train set. Bottom row: Log likelihoods obtained from training GMM-100 on features extracted from the train set.

Figure 3: Comparison of ROC and PR curves for detecting classification mistakes using different generative models. Using a PixelCNN to model the image space fails at reliably detecting classification mistakes. GMMs trained on the feature space achieve better detection than other more flexible models like VAE and MAF and are comparable to temperature scaling on the WRN model.

(a) DenseNet (C100) (b) WRN (C100) (c) TE-WRN (C100)

Figure 5: Comparison of ROC curves for adversarial sample detection using different metrics. Logitbased scores (Logits and Calibrated) can not reliably detect adversarial samples properly while methods that model the probability space of the feature space can (GMM1000, Mahalanobis, Zheng).Our method achieves comparable detection results as the Mahalanobis and the Zheng metric and surpasses both of them in a semi-supervised setting (Figure5c).

Figure6: Additional results for model trained on CIFAR-10. Left: comparison of log-likelihood distributions. PixelCNN is not able to distinguish correct samples from incorrect ones while a GMM trained on the feature space can. Middle: a GMM trained on the feature space can detect adversarial samples reliably.

Figure 7: Comparison of ROC curves for FGSM and BIM adversarial sample detection for CIFAR-10 model.

Figure 9: Additional comparison of ROC curves for FGSM adversarial sample detection for CIFAR-100 models

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. Zhihao Zheng and Pengyu Hong. Robust detection of adversarial attacks by modeling the intrinsic properties of deep neural networks. In Advances in Neural Information Processing Systems, pp. 7924-7933, 2018.

OOD Detection results for CIFAR-10 model. Calibrated scores do not detect OOD samples well, especially gaussian noise samples. PixelCNN assigns higher likelihoods to OOD samples that diverge clearly from the original training set and fails to detect them.

Out-distribution detection results for CIFAR-100 models on Gaussian Noise. Calibrated scores is the only method failing at detecting gaussian noise inputs.

B PURIFICATION B.1 METHOD

The purification process aims at moving the feature F extracted by the classifier to a low BPD region. This can be formulated as a joint optimization problem where we want to find features F with minimal BPD, while being close to the initial extracted features F ref .ν is a hyperparameter that defines how close the new feature should be to the initial one. As the objective is not convex and there is no close form solution for stationary points, we use gradient descent with regards to F to minimize the objective function.Purification of features is performed with 100 iterations of gradient descent steps to optimize the objective function. We test the performance of purification for both classification and semi-supervised classification tasks on CIFAR-100.We report the accuracy on validation and test set obtained after purification with different GMMs and for different values of learning rates and regularization strength ν in Table 4 . For classification, our networks are DenseNet (DN-100) and Wide ResNet (WRN-28). For semi-supervised classification, we apply temporal ensembling to wide ResNet (TE-WRN-28). Our results show that this purification procedure is able to correct classification mistakes on previously unseen samples and results in an accuracy gain for the model without the need to retrain. However the purification method also leads to new classification mistakes, which means that the net improvement on the accuracy reaches 0.6% on the DenseNet model at most. 

C EXPERIMENTAL SETUP

Dataset and preprocessing We trained on CIFAR-10 and CIFAR-100 Krizhevsky (2009) with 5,000 images held-out validation images. Inputs were preprocessed with per-channel standardization before training.DenseNet We use bottleneck layers and compression rate θ = 0.5, growth rate k = 12 and depth L = 100. The model is trained with batch size 64 for 300 epochs with a learning rate 0.1, dropout rate 0.2 and L 2 regularization weight 1e -4 . We use ReLU non-linearities except for the last layer where we use a tanh non-linearity to ensure the extracted features are bounded. For optimization, we use Stochastic Gradient Descent with a Nestrov momentum of 0.9. The learning rate is divided by 10 at epoch 150 and 175.Wide ResNet Wide ResNet Zagoruyko & Komodakis ( 2016) is trained with growth rate k = 10 and depth L = 28 and batch size 100 for 200 epochs, with a learning rate 0.1, dropout rate 0.3 and L 2 regularization weight 5e -4 . Data augmentation is applied during training with random translation by up to 2 pixels and random horizontal flips.Temporal Ensembling For the semi-supervised setting, we only keep 100 samples per label in the train set. We train a Wide ResNet using Temporal Ensembling with a maximum weight decay of 100.PixelCNN The PixelCNN model is trained with the PixelCNN++ ameliorations from Salimans et al. (2017) for our experiments. The model is trained for 5000 epochs with dropout rate 0.5 and learning rate 1e -4 . VAE The VAE is trainer for 1000 epochs with a learning rate of 0.001 and decay rate of 0.9995. The encoder and decoder architecture are fully connected layers with ReLU non-linearities, one hidden layer of size 512 and latent dimension of 128. The model was trained with Adam.MAF The Masked Autoregressive Flow model is trained for 1000 epochs with a learning rate of 0.01 and batch size 32 using Adam Optimizer. We used a 5-layer MADE model with hidden layer size of 128.Temperature Scaling The temperature for Temperature Scaling is optimized using the L-BFGS-B optimization algorithm with a maximum of 100 iterations. We use ECE with B = 10 bins to evaluate the success of the calibration.

