EXPLAINABLE DEEP ONE-CLASS CLASSIFICATION

Abstract

Deep one-class classification variants for anomaly detection learn a mapping that concentrates nominal samples in feature space causing anomalies to be mapped away. Because this transformation is highly non-linear, finding interpretations poses a significant challenge. In this paper we present an explainable deep one-class classification method, Fully Convolutional Data Description (FCDD), where the mapped samples are themselves also an explanation heatmap. FCDD yields competitive detection performance and provides reasonable explanations on common anomaly detection benchmarks with CIFAR-10 and ImageNet. On MVTec-AD, a recent manufacturing dataset offering ground-truth anomaly maps, FCDD sets a new state of the art in the unsupervised setting. Our method can incorporate ground-truth anomaly explanations during training and using even a few of these (∼ 5) improves performance significantly. Finally, using FCDD's explanations, we demonstrate the vulnerability of deep one-class classification models to spurious image features such as image watermarks.

1. INTRODUCTION

Anomaly detection (AD) is the task of identifying anomalies in a corpus of data (Edgeworth, 1887; Barnett and Lewis, 1994; Chandola et al., 2009; Ruff et al., 2021) . Powerful new anomaly detectors based on deep learning have made AD more effective and scalable to large, complex datasets such as high-resolution images (Ruff et al., 2018; Bergmann et al., 2019) . While there exists much recent work on deep AD, there is limited work on making such techniques explainable. Explanations are needed in industrial applications to meet safety and security requirements (Berkenkamp et al., 2017; Katz et al., 2017; Samek et al., 2020) , avoid unfair social biases (Gupta et al., 2018) , and support human experts in decision making (Jarrahi, 2018; Montavon et al., 2018; Samek et al., 2020) . One typically makes anomaly detection explainable by annotating pixels with an anomaly score and, in some applications, such as finding tumors in cancer detection (Quellec et al., 2016) , these annotations are the primary goal of the detector. One approach to deep AD, known as Deep Support Vector Data Description (DSVDD) (Ruff et al., 2018) , is based on finding a neural network that transforms data such that nominal data is concentrated to a predetermined center and anomalous data lies elsewhere. In this paper we present Fully Convolutional Data Description (FCDD), a modification of DSVDD so that the transformed samples are themselves an image corresponding to a downsampled anomaly heatmap. The pixels in this heatmap that are far from the center correspond to anomalous regions in the input image. FCDD does this by only using convolutional and pooling layers, thereby limiting the receptive field of each output pixel. Our method is based on the one-class classification paradigm (Moya et al., 1993; Tax, 2001; Tax and Duin, 2004; Ruff et al., 2018) , which is able to naturally incorporate known anomalies Ruff et al. (2021) , but is also effective when simply using synthetic anomalies. We show that FCDD's anomaly detection performance is close to the state of the art on the standard AD benchmarks with CIFAR-10 and ImageNet while providing transparent explanations. On MVTec-AD, an AD dataset containing ground-truth anomaly maps, we demonstrate the accuracy of FCDD's explanations (see Figure 1 ), where FCDD sets a new state of the art. In further experiments we find that deep one-class classification models (e.g. DSVDD) are prone to the "Clever Hans" effect (Lapuschkin et al., 2019) where a detector fixates on spurious features such as image watermarks. In general, we find that the generated anomaly heatmaps are less noisy and provide more structure than the baselines, including gradient-based methods (Simonyan et al., 2013; Sundararajan et al., 2017) and autoencoders (Sakurada and Yairi, 2014; Bergmann et al., 2019) . Figure 1 : FCDD explanation heatmaps for MVTec-AD (Bergmann et al., 2019) . Rows from top to bottom show: (1) nominal samples (2) anomalous samples (3) FCDD anomaly heatmaps (4) ground-truth anomaly maps.

2. RELATED WORK

Here we outline related works on deep AD focusing on explanation approaches. Classically deep AD used autoencoders (Hawkins et al., 2002; Sakurada and Yairi, 2014; Zhou and Paffenroth, 2017; Zhao et al., 2017) . Trained on a nominal dataset autoencoders are assumed to reconstruct anomalous samples poorly. Thus, the reconstruction error can be used as an anomaly score and the pixel-wise difference as an explanation (Bergmann et al., 2019) , thereby naturally providing an anomaly heatmap. Recent works have incorporated attention into reconstruction models that can be used as explanations (Venkataramanan et al., 2019; Liu et al., 2020) . In the domain of videos, Sabokrou et al. (2018) used a pre-trained fully convolutional architecture in combination with a sparse autoencoder to extract 2D features and provide bounding boxes for anomaly localization. One drawback of reconstruction methods is that they offer no natural way to incorporate known anomalies during training. More recently, one-class classification methods for deep AD have been proposed. These methods attempt to separate nominal samples from anomalies in an unsupervised manner by concentrating nominal data in feature space while mapping anomalies to distant locations (Ruff et al., 2018; Chalapathy et al., 2018; Goyal et al., 2020) . In the domain of NLP, DSVDD has been successfully applied to text, which yields a form of interpretation using attention mechanisms (Ruff et al., 2019) . For images, Kauffmann et al. (2020) have used a deep Taylor decomposition (Montavon et al., 2017) to derive relevance scores. Some of the best performing deep AD methods are based on self-supervision. These methods transform nominal samples, train a network to predict which transformation was used on the input, and provide an anomaly score via the confidence of the prediction (Golan and El-Yaniv, 2018; Hendrycks et al., 2019b) . Hendrycks et al. (2019a) have extended this to incorporate known anomalies as well. No explanation approaches have been considered for these methods so far. Finally, there exists a great variety of explanation methods in general, for example model-agnostic methods (e.g. LIME (Ribeiro et al., 2016) ) or gradient-based techniques (Simonyan et al., 2013; Sundararajan et al., 2017) . Relating to our work, we note that fully convolutional architectures have been used for supervised segmentation tasks where target segmentation maps are required during training (Long et al., 2015; Noh et al., 2015) .

3. EXPLAINING DEEP ONE-CLASS CLASSIFICATION

We review one-class classification and fully convolutional architectures before presenting our method. Deep One-Class Classification Deep one-class classification (Ruff et al., 2018; 2020b) performs anomaly detection by learning a neural network to map nominal samples near a center c in output space, causing anomalies to be mapped away. For our method we use a Hypersphere Classifier (HSC) (Ruff et al., 2020a) , a recently proposed modification of Deep SAD (Ruff et al., 2020b) , a semi-supervised version of DSVDD (Ruff et al., 2018) . Let X 1 , . . . , X n denote a collection of samples and y 1 , . . . , y n be labels where y i = 1 denotes an anomaly and y i = 0 denotes a nominal sample. Then the HSC objective is min W,c 1 n n i=1 (1 -y i )h(φ(X i ; W) -c) -y i log (1 -exp (-h(φ(X i ; W) -c))) , where c ∈ R d is the center, and φ : R c×h×w → R d a neural network with weights W. Here h is the pseudo-Huber loss (Huber et al., 1964) , h(a) = a 2 2 + 1 -1, which is a robust loss that interpolates from quadratic to linear penalization. The HSC loss encourages φ to map nominal samples near c and anomalous samples away from the center c. In our implementation, the center c corresponds to the bias term in the last layer of our networks, i.e. is included in the network φ, which is why we omit c in the FCDD objective below. Figure 2 : Visualization of a 3×3 convolution followed by a 3×3 transposed convolution with a Gaussian kernel, both using a stride of 2. Fully Convolutional Architecture Our method uses a fully convolutional network (FCN) (Long et al., 2015; Noh et al., 2015) that maps an image to a matrix of features, i.e. φ : R c×h×w → R 1×u×v by using alternating convolutional and pooling layers only, and does not contain any fully connected layers. In this context, pooling can be seen as a special kind of convolution with fixed parameters. A core property of a convolutional layer is that each pixel of its output only depends on a small region of its input, known as the output pixel's receptive field. Since the output of a convolution is produced by moving a filter over the input image, each output pixel has the same relative position as its associated receptive field in the input. For instance, the lower-left corner of the output representation has a corresponding receptive field in the lower-left corner of the input image, etc. (see Figure 2 left side). The outcome of several stacked convolutions also has receptive fields of limited size and consistent relative position, though their size grows with the amount of layers. Because of this an FCN preserves spatial information. Fully Convolutional Data Description Here we introduce our novel explainable AD method Fully Convolutional Data Description (FCDD). By taking advantage of FCNs along with the HSC above, we propose a deep one-class method where the output features preserve spatial information and also serve as a downsampled anomaly heatmap. For situations where one would like to have a full-resolution heatmap, we include a methodology for upsampling the low-resolution heatmap based on properties of receptive fields. FCDD is trained using samples that are labeled as nominal or anomalous. As before, let X 1 , . . . , X n denote a collection of samples with labels y 1 , . . . , y n where y i = 1 denotes an anomaly and y i = 0 denotes a nominal sample. Anomalous samples can simply be a collection of random images which are not from the nominal collection, e.g. one of the many large collections of images which are freely available like 80 Million Tiny Images (Torralba et al., 2008) or ImageNet (Deng et al., 2009) . The use of such an auxiliary corpus has been recommended in recent works on deep AD, where it is termed Outlier Exposure (OE) (Hendrycks et al., 2019a; b) . When one has access to "true" examples of the anomalous dataset, i.e. something that is likely to be representative of what will be seen at test time, we find that even using a few examples as the corpus of labeled anomalies performs exceptionally well. Furthermore, in the absence of any sort of known anomalies, one can generate synthetic anomalies, which we find is also very effective. With an FCN φ : R c×h×w → R u×v the FCDD objective utilizes a pseudo-Huber loss on the FCN output matrix A(X) = φ(X; W) 2 + 1 -1 , where all operations are applied element-wise. The FCDD objective is then defined as (cf., (1)): min W 1 n n i=1 (1 -y i ) 1 u • v A(X i ) 1 -y i log 1 -exp - 1 u • v A(X i ) 1 . Here A(X) 1 is the sum of all entries in A(X), which are all positive. FCDD is the utilization of an FCN in conjunction with the novel adaptation of the HSC loss we propose in (2). The objective maximizes A(X) 1 for anomalies and minimizes it for nominal samples, thus we use A(X) 1 as the anomaly score. Entries of A(X) that contribute to A(X) 1 correspond to regions of the input image that add to the anomaly score. The shape of these regions depends on the receptive field of the FCN. We include a sensitivity analysis on the size of the receptive field in Appendix A, where we find that performance is not strongly affected by the receptive field size. Note that A(X) has spatial dimensions u × v and is smaller than the original image dimensions h × w. One could use A(X) directly as a low-resolution heatmap of the image, however it is often desirable to have full-resolution heatmaps. Because we generally lack ground-truth anomaly maps in an AD setting during training, it is not possible to train an FCN in a supervised way to upsample the low-resolution heatmap A(X) (e.g. as in (Noh et al., 2015) ). For this reason we introduce an upsampling scheme based on the properties of receptive fields.

Algorithm 1 Receptive Field Upsampling

Input: A ∈ R u×v (low-res anomaly heatmap) Output: A ∈ R h×w (full-res anomaly heatmap) Define: [G2(µ, σ)]x,y 1 2πσ 2 exp -(x-µ 1 ) 2 +(y-µ 2 ) 2 2σ 2 A ← 0 for all output pixels a in A do f ← receptive field of a c ← center of field f A ← A + a • G2(c, σ) end for return A Heatmap Upsampling Since we generally do not have access to ground-truth pixel annotations in anomaly detection during training, we cannot learn how to upsample using a deconvolutional type of structure. We derive a principled way to upsample our lower resolution anomaly heatmap instead. For every output pixel in A(X) there is a unique input pixel which lies at the center of its receptive field. It has been observed before that the effect of the receptive field for an output pixel decays in a Gaussian manner as one moves away from the center of the receptive field (Luo et al., 2016) . We use this fact to upsample A(X) by using a strided transposed convolution with a fixed Gaussian kernel (see Figure 2 right side). We describe this operation and procedure in Algorithm 1 which simply corresponds to a strided transposed convolution. The kernel size is set to the receptive field range of FCDD and the stride to the cumulative stride of FCDD. The variance of the distribution can be picked empirically (see Appendix B for details). Figure 3 shows a complete overview of our FCDD method and the process of generating full-resolution anomaly heatmaps.

4. EXPERIMENTS

In this section, we experimentally evaluate the performance of FCDD both quantitatively and qualitatively. For a quantitative evaluation, we use the Area Under the ROC Curve (AUC) (Spackman, 1989) which is the commonly used measure in AD. For a qualitative evaluation, we compare the heatmaps produced by FCDD to existing deep AD explanation methods. As baselines, we consider gradient-based methods (Simonyan et al., 2013) applied to hypersphere classifier (HSC) models (Ruff et al., 2020a) with unrestricted network architectures (i.e. networks that also have fully connected layers) and autoencoders (Bergmann et al., 2019) where we directly use the pixel-wise reconstruction error as an explanation heatmap. We slightly blur the heatmaps of the baselines with the same Gaussian kernel we use for FCDD, which we found results in less noisy, more interpretable heatmaps. We include heatmaps without blurring in Appendix G. We adjust the contrast of the heatmaps per method to highlight interesting features; see Appendix C for details. For our experiments we don't consider model-agnostic explanations, such as LIME (Ribeiro et al., 2016) or anchors (Ribeiro et al., 2018) , because they are not tailored to the AD task and performed poorly.

4.1. STANDARD ANOMALY DETECTION BENCHMARKS

We first evaluate FCDD on the Fashion-MNIST, CIFAR-10, and ImageNet datasets. The common AD benchmark is to utilize these classification datasets in a one-vs-rest setup where the "one" class is used as the nominal class and the rest of the classes are used as anomalies at test time. For training, we only use nominal samples as well as random samples from some auxiliary Outlier Exposure (OE) (Hendrycks et al., 2019a) dataset, which is separate from the ground-truth anomaly classes following Hendrycks et al. (2019a; b) . We report the mean AUC over all classes for each dataset. Fashion-MNIST We consider each of the ten Fashion-MNIST (Xiao et al., 2017) classes in a one-vs-rest setup. We train Fashion-MNIST using EMNIST (Cohen et al., 2017) or grayscaled CIFAR-100 (Krizhevsky et al., 2009) as OE. We found that the latter slightly outperforms the former (∼3 AUC percent points). On Fashion-MNIST, we use a network that consists of three convolutional layers with batch normalization, separated by two downsampling pooling layers.

CIFAR-10

We consider each of the ten CIFAR-10 ( Krizhevsky et al., 2009) classes in a one-vs-rest setup. As OE we use CIFAR-100, which does not share any classes with CIFAR-10. We use a model similar to LeNet-5 (LeCun et al., 1998) , but decrease the kernel size to three, add batch normalization, and replace the fully connected layers and last max-pool layer with two further convolutions. ImageNet We consider 30 classes from ImageNet1k (Deng et al., 2009) for the one-vs-rest setup following Hendrycks et al. (2019a) . For OE we use ImageNet22k with ImageNet1k classes removed (Hendrycks et al., 2019a) . We use an adaptation of VGG11 (Simonyan and Zisserman, 2015) with batch normalization, suitable for inputs resized to 224×224 (see Appendix D for model details). State-of-the-art Methods We report results from state-of-the-art deep anomaly detection methods. Methods that do not incorporate known anomalies are the autoencoder (AE), DSVDD (Ruff et al., 2018) , Geometric Transformation based AD (GEO) (Golan and El-Yaniv, 2018) , and a variant of GEO by Hendrycks et al. (2019b) (GEO+) . Methods that use OE are a Focal loss classifier (Hendrycks et al., 2019b) , also GEO+, Deep SAD (Ruff et al., 2020b) , and HSC (Ruff et al., 2020a) . Table 1 : Mean AUC (over all classes and 5 seeds per class) for Fashion-MNIST, CIFAR-10, and ImageNet. Results from existing literature are marked with an asterisk (Bergman and Hoshen, 2020; Golan and El-Yaniv, 2018; Hendrycks et al., 2019b; Ruff et al., 2020a 

Quantitative Results

The mean AUC detection performance on the three AD benchmarks are reported in Table 1 . We can see that FCDD, despite using a restricted FCN architecture to improve explainability, achieves a performance that is close to state-of-the-art methods and outperforms autoencoders, which yield a detection performance close to random on more complex datasets. We provide detailed results for all individual classes in Appendix F. Qualitative Results Figures 4 and 5 show the heatmaps for Fashion-MNIST and ImageNet respectively. For a Fashion-MNIST model trained on the nominal class "trousers," the heatmaps show that FCDD correctly highlights horizontal elements as being anomalous, which makes sense since trousers are vertically aligned. For an ImageNet model trained on the nominal class "acorns," we observe that colors seem to be fairly relevant features with green and brown areas tending to be seen as more nominal, and other colors being deemed anomalous, for example the red barn or the white snow. Nonetheless, the method also seems capable of using more semantic features, for example it recognizes the green caterpillar as being anomalous and it distinguishes the acorn to be nominal despite being against a red background. Figure 6 shows heatmaps for CIFAR-10 models with varying amount of OE, all trained on the nominal class "airplane." We can see that, as the number of OE samples increases, FCDD tends to concentrate Baseline Explanations We found the gradient-based heatmaps to mostly produce centered blobs which lack spatial context (see Figure 6 ) and thus are not useful for explaining. The AE heatmaps, being directly tied to the reconstruction error anomaly score, look reasonable. We again note, however, that it is not straightforward how to include auxiliary OE samples or labeled anomalies into an AE approach, which leaves them with a poorer detection performance (see Table 1 ). Overall we find that the proposed FCDD anomaly heatmaps yield a good and consistent visual interpretation.

4.2. EXPLAINING DEFECTS IN MANUFACTURING

Here we compare the performance of FCDD on the MVTec-AD dataset of defects in manufacturing (Bergmann et al., 2019) . This datasets offers annotated ground-truth anomaly segmentation maps for testing, thus allowing a quantitative evaluation of model explanations. MVTec-AD contains 15 object classes of high-resolution RGB images with up to 1024×1024 pixels, where anomalous test samples are further categorized in up to 8 defect types, depending on the class. We follow Bergmann et al. (2019) and compute an AUC from the heatmap pixel scores, using the given (binary) anomaly segmentation maps as ground-truth pixel labels. We then report the mean over all samples of this "explanation" AUC for a quantitative evaluation. For FCDD, we use a network that is based on a VGG11 network pre-trained on ImageNet, where we freeze the first ten layers, followed by additional fully convolutional layers that we train. Synthetic Anomalies OE with a natural image dataset like ImageNet is not informative for MVTec-AD since anomalies here are subtle defects of the nominal class, rather than being out of class (see Figure 1 ). For this reason, we generate synthetic anomalies using a sort of "confetti noise," a simple noise model that inserts colored blobs into images and reflects the local nature of anomalies. See Figure 7 for an example. Semi-Supervised FCDD A major advantage of FCDD in comparison to reconstruction-based methods is that it can be readily used in a semi-supervised AD setting (Ruff et al., 2020b) . To see the effect of having even only a few labeled anomalies and their corresponding ground-truth anomaly maps available for training, we pick for each MVTec-AD class just one true anomalous sample per defect type at random and add it to the training set. This results in only 3-8 anomalous training samples. To also take advantage of the ground-truth heatmaps, we train a model on a pixel level. Let X 1 , . . . , X n again denote a batch of inputs with corresponding ground-truth heatmaps Y 1 , . . . , Y n , each having m = h • w number of pixels. Let A(X) also again denote the corresponding output anomaly heatmap of X. Then, we can formulate a pixel-wise objective by the following: min W 1 n n i=1   1 m m j=1 (1 -(Y i ) j )A (X i ) j   -log   1 -exp   - 1 m m j=1 (Y i ) j A (X i ) j     . (3) Results Figure 1 in the introduction shows heatmaps of FCDD trained on MVTec-AD. The results of the quantitative explanation are shown in Table 2 . We can see that FCDD outperforms its competitors in the unsupervised setting and sets a new state of the art of 0.92 pixel-wise mean AUC. In the semi-supervised setting -using only one anomalous sample with corresponding anomaly map per defect class-the explanation performance improves further to 0.96 pixel-wise mean AUC. FCDD also has the most consistent performance across classes. Table 2 : Pixel-wise mean AUC scores for all classes of the MVTec-AD dataset (Bergmann et al., 2019) . For competitors we include the baselines presented in the original MVTec-AD paper and previously published works from peer-reviewed venues that include the MVTec-AD benchmark. The competitors are Self-Similarity and L2 Autoencoder (Bergmann et al., 2019) , AnoGAN (Schlegl et al., 2017; Bergmann et al., 2019) , CNN Feature Dictionaries (Napoletano et al., 2018; Bergmann et al., 2019) , Visually Explained Variational Autoencoder (Liu et al., 2020) , Superpixel Masking and Inpainting (Li et al., 2020) , Gradient Descent Reconstruction with VAEs (Dehaene et al., 2020) , and Encoding Structure-Texture Relation with P-Net for AD (Zhou et al., 2020) . Lapuschkin et al. (2016; 2019) revealed that roughly one fifth of all horse images in PASCAL VOC (Everingham et al., 2010) contain a watermark in the lower left corner. They showed that a classifier recognizes this as the relevant class pattern and fails if the watermark is removed. They call this the "Clever Hans" effect in memory of the horse Hans, who could correctly answer math problems by reading its masterfoot_0 . We adapt this experiment to one-class classification by swapping our standard setup and train FCDD so that the "horse" class is anomalous and use ImageNet as nominal samples. We choose this setup so that one would expect FCDD to highlight horses in its heatmaps and so that any other highlighting makes FCDD reveal a Clever Hans effect. . This is due to many images in the dataset containing horses jumping over bars or being in fenced areas. In both cases, the horse features themselves do not attain the highest scores because the model has no way of knowing that the spurious features, while providing good discriminative power at training time, would not be desirable upon deployment/test time. In contrast to traditional black-box models, however, transparent detectors like FCDD enable a practitioner to recognize and remedy (e.g. by cleaning or extending the training data) such behavior or other undesirable phenomena (e.g. to avoid unfair social bias).

5. CONCLUSION

In conclusion we find that FCDD, in comparison to previous methods, performs well and is adaptable to both semantic detection tasks (Section 4.1) and more subtle defect detection tasks (Section 4.2). Finally, directly tying an explanation to the anomaly score should make FCDD less vulnerable to attacks (Anders et al., 2020) in contrast to a posteriori explanation methods. We leave an analysis of this phenomenon for future work.

A RECEPTIVE FIELD SENSITIVITY ANALYSIS

The receptive field has an impact on both detection performance and explanation quality. Here we provide some heatmaps and AUC scores for networks with different receptive field sizes. We observe that the detection performance is only minimally affected, but larger receptive fields cause the explanation heatmap to become less concentrated and more "blobby." For MVTec-AD we see that this can also negatively affect pixel-wise AUC scores, see Table 4 .

CIFAR-10

For CIFAR-10 we create eight different network architectures to study the impact of the receptive field size. Each architecture has four convolutional layers and two max-pool layers. To change the receptive field we vary the kernel size of the first convolutional layer between 3 and 17. When this kernel size is 3 then the receptive field contains approximately one quarter of the image; for a kernel size of 17 the receptive field is the entire image. Table 3 shows the detection performance of the networks. Figure 9 contains example heatmaps. MVTec-AD We create six different network architectures for MVTec-AD. They have six convolutional layers and three max-pool layers. We vary the kernel size for all of the convolutional layers between 3 and 13, which corresponds to a receptive field containing 1/16 of the image to the full image respectively. Table 4 shows the explanation performance of the networks in terms of pixel-wise mean AUC. Figure 10 contains some example heatmaps. We observe that a smaller receptive field yields better explanation performance.

B IMPACT OF THE GAUSSIAN VARIANCE

Using the proposed heatmap upsampling in Section 3 FCDD provides full-resolution anomaly heatmaps. However, this upsampling involves the choice of σ for the Gaussian kernel. In this section, we demonstrate the effect of this hyperparameter on the explanation performance of FCDD on MVTec-AD. Table 5 shows the pixel-wise mean AUC, Figure 11 corresponding heatmaps. 

C ANOMALY HEATMAP VISUALIZATION

For anomaly heatmap visualization, the FCDD anomaly scores A (X) need to be rescaled to values in [0, 1]. Instead of applying standard min-max scaling that would divide all heatmap entries by max A (X), we use anomaly score quantiles to adjust the contrast in the heatmaps. For a collection of inputs X = {X 1 , . . . , X n } with corresponding full-resolution anomaly heatmaps learning rate per epoch by a factor of 0.98. The pre-processing pipeline is: (1) Random crop to size 28 with beforehand zero-padding of 2 pixels on all sides (2) random horizontal flipping with a chance of 50% (3) data normalization.

CIFAR-10

We train for 600 epochs using a batch size of 200 samples. We optimize the network using Adam (Kingma and Ba, 2015) (β = (0.9, 0.999)) with weight decay 10 -6 and an initial learning rate of 0.001 which is decreased by a factor of 10 at epoch 400 and 500. The pre-processing pipeline is: (1) Random color jitter with all parameters 3 set to 0.01 (2) random crop to size 32 with beforehand zero-padding of 4 pixels on all sides (3) random horizontal flipping with a chance of 50% (4) additive Gaussian noise with σ = 0.001 (5) data normalization. ImageNet We use the same setup as in CIFAR-10, but resize all images to size 256×256 before forwarding them through the pipeline and change the random crop to size 224 with no padding. Test samples are center cropped to a size of 224 before being normalized. Pascal VOC We use the same setup as in CIFAR-10, but resize all images to size 224×224 before forwarding them through the pipeline and remove the Random Crop step. MVTec-AD For MVTec-AD we redefine an epoch to be ten times an iteration of the full dataset because this improves the computational performance of the data pipeline. We train for 200 epochs using SGD with Nesterov momentum (µ = 0.9), weight decay 10 -4 , and an initial learning rate of 0.001, which decreases per epoch by a factor of 0.985. The pre-processing pipeline is: (1) Resize to 240×240 pixels (2) random crop to size 224 with no padding (3) random color jitter with either all parameters set to 0.04 or 0.0005, randomly chosen (4) 50% chance to apply additive Gaussian noise (5) data normalization.

F QUANTITATIVE DETECTION RESULTS FOR INDIVIDUAL CLASSES

Table 6 shows the class-wise results on Fashion-MNIST for AE, Deep Support Vector Data Description (DSVDD) (Ruff et al., 2018; Bergman and Hoshen, 2020) and Geometric Transformation based AD (GEO) (Golan and El-Yaniv, 2018) . et al., 2018) , DSVDD (Ruff et al., 2018) , GEO (Golan and El-Yaniv, 2018 ) and an adaptation of GEO (GEO+) (Hendrycks et al., 2019b) . Competitors with OE are the focal loss classifier (Hendrycks et al., 2019b) , again GEO+ (Hendrycks et al., 2019b) , Deep Semi-supervised Anomaly Detection (Deep SAD) (Ruff et al., 2020b; a) and the hypersphere Classifier (Ruff et al., 2020a) . In Table 8 the class-wise results for Imagenet are shown, where competitors are the AE, the focal loss classifier (Hendrycks et al., 2019b) , Geo+ (Hendrycks et al., 2019b) , Deep SAD (Ruff et al., 2020b) and HSC (Ruff et al., 2020a) . Results from the literature are marked with an asterisk. 

G FURTHER QUALITATIVE ANOMALY HEATMAP RESULTS

In this section we report some further anomaly heatmaps, unblurred baseline heatmaps, as well as class-wise heatmaps for all datasets. Unblurred Anomaly Heatmap Baselines Here we show unblurred baseline heatmaps for the figures in Section 4.1. Figures 12, 13, and 14 show the unblurred heatmaps for Fashion-MNIST, ImageNet, and CIFAR-10 respectively. "t-shirt/top" "trouser" "pullover" "dress" "coat" "sandal" "shirt" "sneaker" "bag" "ankle boot" "bottle" "cable" "capsule" "carpet" "grid" "hazelnut" "leather" "metal nut" "pill" "screw" "tile" "toothbrush" "transistor" "wood" "zipper" "acorn" "airliner" "ambulance" "American alligator" "banjo" "barn" "bikini" "digital clock" "dragonfly" "dumbbell" "forklift" "goblet" "grand piano" "hotdog" "hourglass" "manhole cover" "mosque" "nail" "parking meter" "pillow" "revolver" "dial telephone" "schooner" "snowmobile" "soccer ball" "stingray" "strawberry" "tank" "toaster" "volcano" 



https://en.wikipedia.org/wiki/Clever_Hans https://pytorch.org/docs/1.4.0/torchvision/transforms.html#torchvision.transforms.ColorJitter



Figure 3: Visualization of the overall procedure to produce full-resolution anomaly heatmaps with FCDD. X denotes the input, φ the network, A the produced anomaly heatmap and A the upsampled version of A using a transposed Gaussian convolution.

Figure 4: Anomaly heatmaps for anomalous test samples of a Fashion-MNIST model trained on nominal class "trousers" (nominal samples are shown in (a)). In (b) CIFAR-100 was used for OE and in (c) EMNIST. Columns are ordered by increasing anomaly score from left to right, i.e. what FCDD finds the most nominal looking anomaly on the left to the most anomalous looking anomaly on the right.

Figure 5: Anomaly heatmaps of an ImageNet model trained on nominal class "acorns." Here (a) are nominal samples and (b) are anomalous samples. Columns are ordered by increasing anomaly score from left to right, i.e. what FCDD finds the most nominal looking on the left to the most anomalous looking on the right for (a) nominal samples and (b) anomalies.

Figure 7: Confetti noise.

Figure 8: Heatmaps for horses on PASCAL VOC. Here (a) shows anomalous samples ordered from most nominal to most anomalous from left to right, and (b) shows examples that indicate that the model is a "Clever Hans," i.e. has learned a characterization based on spurious features (watermarks).

Figure8 (b)  shows that a one-class model is indeed also vulnerable to learning a characterization based on spurious features: the watermarks in the lower left corner which have high scores whereas other regions have low scores. We also observe that the model yields high scores for bars, grids, and fences in Figure8 (a). This is due to many images in the dataset containing horses jumping over bars or being in fenced areas. In both cases, the horse features themselves do not attain the highest scores because the model has no way of knowing that the spurious features, while providing good discriminative power at training time, would not be desirable upon deployment/test time. In contrast to traditional black-box models, however, transparent detectors like FCDD enable a practitioner to recognize and remedy (e.g. by cleaning or extending the training data) such behavior or other undesirable phenomena (e.g. to avoid unfair social bias).

Figure 9: Anomaly heatmaps for three anomalous test samples on CIFAR-10 models trained on nominal class "airplane." We grow the receptive field size from 18 (left) to 32 (right).

Figure 10: Anomaly heatmaps for seven anomalous test samples of MVTec-AD. We grow the receptive field size from 53 (left) to 243 (right).

Figure 11: Anomaly heatmaps for seven anomalous test samples of MVTec-AD. We grow σ from 4 (left) to 16 (right).

Figure 12: Anomaly heatmaps for anomalous test samples of a Fashion-MNIST model trained on nominal class "trousers." In (a) CIFAR-100 was used for OE and in (b) EMNIST.

Figure 16: Anomaly heatmaps for anomalous test samples in Fashion-MNIST using EMNIST OE. Columns are ordered by increasing anomaly score from left to right. The subcaptions refer to the nominal class that each model is trained on, for which some examples are also displayed as a separate column on the left.

Figure 18: Anomaly heatmaps for anomalous test samples in MVTec-AD. Columns are ordered by increasing anomaly score from left to right. The subcaptions refer to the nominal class that each model is trained on.

Figure 19: Anomaly heatmaps for anomalous test samples in ImageNet, where classes 1-21 are shown. Columns are ordered by increasing anomaly score from left to right. The subcaptions refer to the nominal class that each model is trained on, for which some examples are also displayed as a separate column on the left.

Figure 20: Anomaly heatmaps for anomalous test samples in ImageNet, where classes 22-30 are shown. Columns are ordered by increasing anomaly score from left to right. The subcaptions refer to the nominal class that each model is trained on, for which some examples are also displayed as a separate column on the left.

).

Mean AUC (over all classes and 5 seeds per class) for CIFAR-10 and neural networks with varying receptive field size.

Pixel-wise mean AUC (over all classes and 5 seeds per class) for MVTec-AD and neural networks with varying receptive field size.

Pixel-wise mean AUC (over all classes and 5 seeds per class) for MVTec-AD and different σ.

AUC scores for all classes of Fashion-MNIST(Xiao et al., 2017).

AUC scores for all classes of CIFAR-10 (Krizhevsky et al., 2009).

AUC scores for 30 classes of ImageNet(Deng et al., 2009).

acknowledgement

ACKNOWLEDGEMENTS MK, PL, and BJF acknowledge support by the German Research Foundation (DFG) award KL 2698/2-1 and by the German Federal Ministry of Science and Education (BMBF) awards 01IS18051A, 031B0770E, and 01MK20014U. LR acknowledges support by the German Federal Ministry of Education and Research (BMBF) in the project ALICE III (01IS18049B). RV acknowledges support by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) sponsored by the German Federal Ministry of Education and Research (BMBF). KRM was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea Government (No. 2017-0-00451 and 2019-0-00079) and was partly supported by the German Federal Ministry of Education and Research (BMBF) for the Berlin Center for Machine Learning (01IS18037A-I) and under the Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A, and 031L0207A-D; the German Research Foundation (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689. Finally, we thank all reviewers for their constructive feedback, which helped to improve this work.

availability

://github.

annex

A = {A (X 1 ), . . . , A (X n )}, the normalized heatmap I(X) for some A (X) is computed as I(X) j = min A (X) j -min(A) q η ({A -min(A) | A ∈ A}), 1 , where j denotes the j-th pixel and q η the η-th percentile over all pixels and examples in A. The subtraction and min operation are applied on a pixel level, i.e. the minimum is extracted over all pixels and all samples of A and subtraction is then applied elementwise. Using the η-th percentile might leave some of the values above 1, which is why we finally clamp the pixels at 1.The specific choice of η and set of samples X differs per figure. We select them to highlight different properties of the heatmaps. In general, the lower η the more red (anomalous) regions we have in the heatmaps because more values are left above one (before clamping to 1) and vice versa. The choice of X ranges from just one sample X, such that A (X) is normalized only w.r.t. to its own scores (highlighting the most anomalous regions within the image), to the complete dataset (highlighting which regions look anomalous compared to the whole dataset). For the latter visualization we rebalance the dataset so that X contains an equal amount of nominal and anomalous images to maintain consistent scaling. The choice of η and X is consistent per figure. In the following we list the choices made for the individual figures. Heatmap Upsampling For the Gaussian kernel heatmap upsampling described in Algorithm 1, we set σ to 1.2 for CIFAR-10 and Fashion-MNIST, to 8 for ImageNet and Pascal VOC, and to 12 for MVTec-AD.

D DETAILS ON THE NETWORK ARCHITECTURES

Here we provide the complete FCDD network architectures we used on the different datasets. --------------------------------------------------------------- Class-wise Anomaly Heatmaps Due to space restrictions we have only shown heatmaps for some of the classes in the main paper. Here we also report a collection of heatmaps for all classes.

Fashion-MNIST

We show heatmaps with adjusted contrast curves by setting X to the balanced set of all samples for all datasets in this section. Further, we set η = 0.85 for Fashion-MNIST and CIFAR-10, η = 0.99 for MVTec-AD, and η = 0.97 for ImageNet. Note that, to keep the heatmaps for different classes comparable, we use a unified normalization for all heatmaps in one figure. However, since for each class a separate anomaly detector is trained, this yields suboptimal visualizations for some of the classes (for example, the "toothbrush" images for MVTec-AD in Figure 18 where the heatmaps just show a huge red blob). Tweaking the normalization for such classes reveals that the heatmaps actually tend to mark the correct anomalous regions, which in the case of "toothbrushes" can be seen in the explanation performance evaluation in Table 2 .The rows in all heatmaps show the following: (1) Input samples (2) FCDD heatmaps (3) gradient heatmaps with HSC (4) autoencoder reconstruction heatmaps. Heatmaps for MVTec-AD add a fifth row containing the ground-truth anomaly map.Heatmaps for Fashion-MNIST using auxiliary anomalies from CIFAR-100 are in Figure 15 , using EMNIST for OE instead are in Figure 16 . CIFAR-10 heatmaps are in Figure 17 , and heatmaps for all classes of MVTec-AD are in Figure 18 . Finally, we present ImageNet heatmaps in Figures 19 and 20 ."t-shirt/top" "trouser" "pullover" "dress" "coat" "sandal" "shirt" "sneaker" "bag" "ankle boot" 

