MULTISCALE SCORE MATCHING FOR OUT-OF-DISTRIBUTION DETECTION

Abstract

We present a new methodology for detecting out-of-distribution (OOD) images by utilizing norms of the score estimates at multiple noise scales. A score is defined to be the gradient of the log density with respect to the input data. Our methodology is completely unsupervised and follows a straight forward training scheme. First, we train a deep network to estimate scores for L levels of noise. Once trained, we calculate the noisy score estimates for N in-distribution samples and take the L2norms across the input dimensions (resulting in an N xL matrix). Then we train an auxiliary model (such as a Gaussian Mixture Model) to learn the in-distribution spatial regions in this L-dimensional space. This auxiliary model can now be used to identify points that reside outside the learned space. Despite its simplicity, our experiments show that this methodology significantly outperforms the stateof-the-art in detecting out-of-distribution images. For example, our method can effectively separate CIFAR-10 (inlier) and SVHN (OOD) images, a setting which has been previously shown to be difficult for deep likelihood models. We make our code and results publicly available on Github 1 .

1. INTRODUCTION

Modern neural networks do not tend to generalize well to out-of-distribution samples. This phenomenon has been observed in both classifier networks (Hendrycks & Gimpel (2017) ; Nguyen et al. (2015) ; Szegedy et al. (2013) ) and deep likelihood models (Nalisnick et al. (2018) ; Hendrycks et al. (2018) ; Ren et al. (2019) ). This certainly has implications for AI safety (Amodei et al. (2016) ), as models need to be aware of uncertainty when presented with unseen examples. Moreover, an out-ofdistribution detector can be applied as an anomaly detector. Ultimately, our research is motivated by the need for a sensitive outlier detector that can be used in a medical setting. Particularly, we want to identify atypical morphometry in early brain development. This requires a method that is generalizable to highly variable, high resolution, unlabeled real-world data while being sensitive enough to detect an unspecified, heterogeneous set of atypicalities. To that end, we propose multiscale score matching to effectively detect out-of-distribution samples. Hyvärinen (2005) introduced score matching as a method to learn the parameters of a nonnormalized probability density model, where a score is defined as the gradient of the log density with respect to the data. Conceptually, a score is a vector field that points in the direction where the log density grows the most. The authors mention the possibility of matching scores via a nonparametric model but circumvent this by using gradients of the score estimate itself. However, Vincent (2011) later showed that the objective function of a denoising autoencoder (DAE) is equivalent to matching the score of a non-parametric Parzen density estimator of the data. Thus, DAEs provide a methodology for learning score estimates via the objective: 1 2 E x∼qσ(x|x)pdata(x) [||s θ (x) -∇ x log q σ (x|x)||] Here s θ (x) is the score network being trained to estimate the true score ∇ x log p data (x), and q σ (x) = q σ (x|x)p data (x)dx. It should be noted that the score of the estimator only matches the true score when the noise perturbation is minimal i.e q σ (x) ≈ p data (x  = ... = σ L-1 σ L > 1. NCSN is a conditional network, s θ (x, σ), trained to jointly estimate scores for various levels of noise σ i such that ∀σ ∈ {σ i } L i=1 : s θ (x, σ) ≈ ∇ x log q σ (x). In practice, the network is explicitly provided a one-hot vector denoting the noise level used to perturb the data. The network is then trained via a denoising score matching loss. They choose their noise distribution to be N (x|x, σ 2 I); therefore ∇ x log q σ (x|x) = -(x -x/σ 2 ). Thus the objective function is: 1 L L i=1 λ(σ i ) 1 2 E x∼qσ i (x|x)pdata(x) s θ (x, σ i ) + ( x -x σ 2 i ) 2 2 (2) Song & Ermon (2019) set λ(σ i ) = σ 2 after empirically observing that ||σs θ (x, σ)|| 2 ∝ 1. We similarly scaled our score norms for all our experiments. Our work directly utilizes the training objective proposed by Song & Ermon (2019) i.e. we use an NCSN as our score estimator. However, we use the score outputs for out-of-distribution (OOD) detection rather than for generative modeling. We demonstrate how the space of multiscale score estimates can separate in-distribution samples from outliers, outperforming state-of-the-art methods. We also apply our method on real-world medical imaging data of brain MRI scans.

2. MULTISCALE SCORE ANALYSIS

Consider taking the L2-norm of the score function: ||s(x)|| = ||∇ x log p(x)|| = ∇x p(x) p(x) . Since the data density term appears in the denominator, a high likelihood will correspond to a low norm. Since out-of-distribution samples should have a low likelihood with respect to the indistribution log density (i.e. p(x) is small), we can expect them to have high score norms. However, if these outlier points reside in "flat" regions with very small gradients (e.g. in a small local mode), then their score norms can be low despite the point belonging to a low density region. This is our first indicator informing us that a true score norm may not be sufficient for detecting outliers. We empirically validate our intuition by considering score estimates for a relatively simple toy dataset: FashionMNIST. Following the denoising score matching objective (Equation 2), we can obtain multiple estimates of the true score by using different noise distributions q σ (x|x). Like Song & Ermon (2019), we choose the noise distributions to be zero-centered Gaussian scaled according to σ i . Recall that the scores for samples perturbed by the lowest σ noise should be closest to the true score. Our analyses show that this alone was inadequate at separating inliers from OOD samples. Here we see a better separation between FashionMNIST and MNIST when using estimates from multiple scales rather than the one that corresponds to the true score only. We trained a score network s FM (x, σ L ) on FashionMNIST and used it to estimate scores of Fash-ionMNIST (x ∼ D F M ), MNIST (x ∼ D M ) and CIFAR-10 (x ∼ D C ) test sets. Figure 1a shows the distribution of the score norms corresponding to the lowest noise level used. Note that CIFAR-10 samples are appropriately given a high score by the model. However, the model is unable to



https://github.com/ahsanMah/msma



Figure1: Visualizing the need for a multiscale analysis. In (a), we plot the scores corresponding to the lowest sigma estimate. In (b), we plot the UMAP embedding of the L = 10 dimensional vectors of score norms. Here we see a better separation between FashionMNIST and MNIST when using estimates from multiple scales rather than the one that corresponds to the true score only.

). Recently, Song & Ermon (2019) employed multiple noise levels to develop a deep generative model based on score matching, called Noise Conditioned Score Network (NCSN). Let {σ i } L i=1 be a positive geometric sequence that satisfies σ1 σ2

