AN ALGORITHM FOR OUT-OF-DISTRIBUTION ATTACK TO NEURAL NETWORK ENCODER Anonymous

Abstract

Deep neural networks (DNNs), especially convolutional neural networks, have achieved superior performance on image classification tasks. However, such performance is only guaranteed if the input to a trained model is similar to the training samples, i.e., the input follows the probability distribution of the training set. Out-Of-Distribution (OOD) samples do not follow the distribution of training set, and therefore the predicted class labels on OOD samples become meaningless. Classification-based methods have been proposed for OOD detection; however, in this study we show that this type of method has no theoretical guarantee and is practically breakable by our OOD Attack algorithm because of dimensionality reduction in the DNN models. We also show that Glow likelihood-based OOD detection is breakable as well.

1. INTRODUCTION

Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have become the method of choice for image classification. Under the i.i.d. (independent and identically distributed) assumption, a high-performance DNN model can correctly-classify an input sample as long as the sample is "generated" from the distribution of training data. If an input sample is not from this distribution, which is called Out-Of-Distribution (OOD), then the predicted class label from the model is meaningless. It would be great if the model has the ability to distinguish OOD samples from in-distribution samples. OOD detection is needed especially when applying DNN models in life-critical applications, e.g., vision-based self-driving or image-based medical diagnosis. Nguyen et al. (2015 ) (Nguyen et al., 2015) that DNN classifiers can be easily fooled by OOD data, and an evolutionarily algorithm was used to generate OOD samples such that DNN classifiers had high output confidence on these samples. Since then, many methods have been proposed for OOD detection using classifiers or encoders (Hendrycks & Gimpel, 2017; Hendrycks et al., 2019; Liang et al., 2018; Lee et al., 2018b; a; Alemi et al., 2018; Hendrycks & Gimpel, 2017) . For instance, Hendrycks et al. (Hendrycks & Gimpel, 2017) show that a classifier's prediction probabilities of OOD examples tend to be more uniform, and therefore the maximum predicted class probability from the softmax layer was used for OOD detection. Regardless of the details of these methods, every method needs a classifier or an encoder, which takes an image x as input and compresses it into a vector z in the laten space; after some further transform, z is converted to an OOD detection score τ . This computing process can be expressed as: z = f (x) and τ = d(z). To perform OOD detection, a detection threshold needs to be specified, and then x is OOD if τ is smaller/larger than the threshold. For the evaluation of OOD detection methods, (Hendrycks & Gimpel, 2017) , an OOD detector is usually trained on a dataset (e.g. Fashion-MNIST as indistribution) and then it is tested on another dataset (e.g. MNIST as OOD).

It was shown by

As will be shown in this study, the above mentioned classification-based OOD detection methods are practically breakable. As an example (more details in Section 3), we used the Resnet-18 model (He et al., 2016) pre-trained on the ImageNet dataset. Let x in denote a 224×224×3 image (indistribution sample) in ImageNet and x out denote an OOD sample which could be any kinds of images (even random noises) not belonging to any category in ImageNet. Let z denote the 512dimensional feature vector in Resnet-18, which is the input to the last fully-connected linear layer before the softmax operation. Thus, we have z in = f (x in ) and z out = f (x out ). In Fig. 1  , x in (a) (b) (c) Figure 1 : The 1st column shows the image of Santa Claus x in and the scatter plot of z in using blue dots. The 2nd column shows a chest x-ray image x out and the scatter plot of z out (red circles) and z in (blue). The 3rd column shows a random image x out , and the scatter plot of z out (red) and z in (blue). is the image of Santa Claus, and x out could be a chest x-ray image or a random-noise image, and "surprisingly", z out ∼ = z in which renders OOD detection score to be useless: d(z out ) ∼ = d(z in ). In Section 2, we will introduce an algorithm to generate OOD samples such that z out ∼ = z in . In Section 3, we will show the evaluation results on publicly available datasets, including ImageNet subset, GTSRB, OCT, and COVID-19 CT. Since some generative models (e.g. Glow (Kingma & Dhariwal, 2018)) can approximate the distribution of training samples (i.e. p(x in )), likelihood-based generative models were utilized for OOD detection (Nalisnick et al., 2019) . It has been shown that likelihoods derived from generative models may not distinguish between OOD and training samples (Nalisnick et al., 2019; Ren et al., 2019; Choi et al., 2018) , and a fix to the problem could be using likelihood ratio instead of raw likelihood score (Serrà et al., 2019) . Although not the main focus of this study, we will show that the OOD sample's likelihood score from the Glow model (Kingma & Dhariwal, 2018; Serrà et al., 2019) can be arbitrarily manipulated by our algorithm (Section 2.1) such that the output probability p(x in ) ∼ = p(x out ), which further diminishes the effectiveness of any Glow likelihood-based detection methods.

2.1. OOD ATTACK ON DNN ENCODER

We introduce an algorithm to perform OOD attack on a DNN encoder z = f (x) which takes an image x as input and transforms it into a feature vector z in a latent space. Preprocessing on x can be considered as the very first layer inside of the model f (x). The algorithm needs a weak assumption that f (x) is sub-differentiable. A CNN classifier can be considered a composition of a feature encoder z = f (x) and a feature classifier p = g(z) where p is the softmax probability distribution over multiple classes. Let's consider an in-distribution sample x in and an OOD sample x out , and apply the model: z in = f (x in ) and z out = f (x out ). Usually, z out = z in . However, if we add a relatively small amount of noise δ to x out , then it could be possible that f (x out + δ) = z in and x out + δ is still OOD. This idea is realized in Algorithm 1, OOD Attack on DNN Encoder. The clip operation in Algorithm 1 is very important: it will limit the difference between x out and x out so that x out may be OOD. The algorithm is inspired by the method called projected gradient descent (PGD) (Kurakin et al., 2016; Madry et al., 2018) which is used for adversarial attacks. We note that the term "adversarial attack" usually refers to adding a small perturbation to a clean sample x in a dataset such that a classifier will incorrectly-classify the noisy sample while being

