ITERATIVE IMAGE INPAINTING WITH STRUCTURAL SIMILARITY MASK FOR ANOMALY DETECTION Anonymous

Abstract

Autoencoders have emerged as popular methods for unsupervised anomaly detection. Autoencoders trained on the normal data are expected to reconstruct only the normal features, allowing anomaly detection by thresholding reconstruction errors. However, in practice, autoencoders fail to model small detail and yield blurry reconstructions, which makes anomaly detection challenging. Moreover, there is objective mismatching that models are trained to minimize total reconstruction errors while expecting a small deviation on normal pixels and a large deviation on anomalous pixels. To tackle these two issues, we propose the iterative image inpainting method that reconstructs partial regions in an adaptive inpainting mask matrix. This method constructs inpainting masks from the anomaly score of structural similarity. Overlaying inpainting mask on images, each pixel is bypassed or reconstructed based on the anomaly score, enhancing reconstruction quality. The iterative update of inpainted images and masks by turns purifies the anomaly score directly and follows the expected objective at test time. We evaluated the proposed method using the MVTec Anomaly Detection dataset. Our method outperformed previous state-of-the-art in several categories and showed remarkable improvement in high-frequency textures.

1. INTRODUCTION

Anomaly detection (AD) is the identification task of the rarely happened events or items that differ from the majority of the data. In the real world, there are many applications, such as the medial diagnosis (Baur et al., 2018; Zimmerer et al., 2019a) , defect detection in the factories (Matsubara et al., 2018; Bergmann et al., 2019) , early detection of plant disease (Wang et al., 2019) , and X-Ray security detection in public space (Griffin et al., 2018) . Because manual inspection by humans is slow, expensive, and error-prone, automating visual inspection is the popular application of artificial intelligence. In transferring knowledge from humans to machines, there is a lack of anomalous samples due to their low event rate and difficulty annotating and categorizing various anomalous defects beforehand. Therefore, AD methods typically take unsupervised approaches that try to learn compact features of data from normal samples and detect anomalies by thresholding anomaly score to measure the deviation from learned features. To deal with high-dimensional images and learn their features, it is popular to use deep neural networks (Goodfellow et al., 2016) . In this work, we focus on the reconstruction-based unsupervised AD. This attempts to reconstruct only the normal dataset and classify the normal or anomalous data on thresholding reconstruction errors (An & Cho, 2015) . The architectures are based on deep neural networks such as deep autoencoders (Hinton & Salakhutdinov, 2006) , variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) , or autoencoders with generative adversarial networks (GANs) (Goodfellow et al., 2014) . These models compress the high-dimensional information into the data manifold in lower-dimensional latent space by reconstructing input data under certain constraints for latent space, such as a prior distribution or an information bottleneck (Alemi et al., 2016) . The reconstruction-based AD approach issue is that autoencoders fail to model small details and yield blurry image reconstruction. This is especially the case for the high-frequency textures, such as carpet, leather, and tile (Bergmann et al., 2019) . Dehaene et al. (2020) also pointed out that there is no guarantee of the generalization of their behavior for out-of-samples, and local defects added to normal images could deteriorate whole images. In the viewpoint of the signal-to-noise ratio (SNR), it is interpreted that blurry reconstruction makes anomaly signals (reconstruction errors in anomaly pixels) unclear and increases normal noises (reconstruction errors in normal pixels). Since the SNR explains the feasibility of AD by thresholding a sample-wise reconstruction error, the low SNR makes AD challenges. We point out an additional issue about the gap between optimized function at training and evaluated function at testing. Rethinking our goal in unsupervised AD, it is concluded not to minimize reconstruction errors merely but to maximize the SNR. Although models are trained to minimize a sample-wise reconstruction error, they are expected to have a large deviation of anomaly pixels and a small deviation of normal pixels at testing. In this paper, we propose I3AD (Iterative Image Inpainting for Anomaly Detection). As Figure 1 , our method utilizes an inpainting model that only encode unmasked regions and reconstruct masked regions instead of vanilla autoencoders. Once computed reconstruction errors, it is recycled as an inpainting mask for the next iteration. We show that the iterative update enhances the reconstruction quality and satisfies the expected objective to maximize the expected SNR at testing. Through experiments and analysis on the MVTecAD dataset shows that our I3AD outperforms existing methods on nine categories and has ave. +11.6% improvement on texture category.

2.1. HIGH-LEVEL IDEA

We think of unsupervised AD using autoencoders. Here, we implicitly assume anomalies show up in partial regions, and pixels in surrounding regions obey by the distribution of normal datasets. Therefore, the sample-wise anomaly score based on reconstruction errors is the summation of two types of pixels: (1) pixels of normal regions (normal noises) and (2) pixels of anomalous regions (anomaly signals). We expect an ideal model with zero errors on normal regions and distinguishable per-pixel scores on anomalous regions, leading to the high SNR. Inheriting vanilla autoencoders' architecture does not help to resolve the low SNR issue mentioned by Bergmann et al. (2019) and Dehaene et al. (2020) . Indeed, autoencoders are forced to encode whole images with anomalous pixels of local defects and attempt to decode whole images with background normal pixels of fine structures. They do not learn their behavior to encode unseen anomalous pixels. That anomalous information can affect the whole image decoding. One approach to resolve the issue is a combination of the per-pixel identity function and conditional autoencoder. Compared to vanilla autoencoders, conditional autoencoders can encode only normal regions and decode only anomalous regions. The per-pixel identity function could copy the remaining unreconstructed regions. This model architecture is the same as an image inpainting model. Actually, deep inpainting models are designed by conditional autoencoders that encode unmasked regions and fill in masked regions under a certain mask matrix (Yu et al., 2019) . However, the image inpainting method for AD falls into the tautology trap that we do not know a perfect inpainting mask matrix in advance, and detecting anomalous regions is the main goal to achieve. The key ideas to disentangle this tautology are mask generation by the anomaly score and iterative update of a mask matrix. Updating the inpainting mask matrix dynamically controls encoding and decoding information balances with a pixel-wise confidence level of anomaly scores. A generator gradually receives more information on potentially normal pixels and focuses on the suspected pixels during iterations. Furthermore, this process not only reduces background noises but also improves the SNR directly.

2.2. ITERATIVE IMAGE INPAINTING FOR ANOMALY DETECTION

Following the above discussion, we construct our I3AD method by an inpainting generator and a mask generation module. As a mask generation module, we explain the detail in the next subsection. Our model overview is depicted in Figure 2 . We construct an inpainting generator using conditional generative adversarial networks (cGANs) (Isola et al., 2017) and train its networks by a general image inpainting task under a normal dataset. We feed normal images partially hidden by randomly generated masks into generator networks and train them to decode the masked pixels from the unmasked pixels and corresponding bool mask matrix. A discriminator network distinguishes generated images from normal images. A generator is rewarded for fooling a discriminator, while a discriminator is rewarded for detecting the generated images, respectively. This training can be considered as the two-player min-max game that a generator and a discriminator compete. As a result, the inpainting model tries to find the optimal point on the loss function as below: min G max D E x∼P data (x) [log D(x)]+E x∼P data (x),M ∼P (M ) [log(1 -D(G(x = M x, M )))], where x and x are real samples from data distribution P data (x) and their masked real samples. is an element-wise product. M is the corresponding bool mask matrix generated from the random distribution P (M ). G(x, M ) is an image inpainting network that takes an incomplete image and masked matrix. D(x) denotes a binary classifier whether an image is generated or real. We borrow and customize Spectral-Normalized Markovian GAN (SN-PatchGAN) architecture following by Yu et al. (2019) . The network consists of two networks: a coarse-to-fine generator network with attention module and gated convolution, and a spectral normalized Markovian (Patch) discriminator network. Our I3AD expects to handle more fine structured masks than usual free-form masks. Therefore, to better handle more irregular masks, we apply the self-attention module (Zhang et al., 2018) instead of the contextual attention module originally designed for large rectangular masks as described in Yu et al. (2018; 2019) . To stabilize the training of GAN, we adapted proposed spectral normalization (Miyato et al., 2018) for discriminator's layers. As approximation of objective min-max loss function (Miyato et al., 2018) , we also derived loss functions respectively for generator L G and discriminator L D below: L G = -E x∼P data (x),M ∼P (M ) [D sn (G(x = M x, M ))] L D = E x∼P data (x) [ReLU(1 -D sn (x))] + E x∼P data (x),M ∼P (M ) [ReLU(1 + D sn (G(x = M x, M )))], where D sn (x) denotes spectral-normalized discriminator. ReLU is the abbreviation of Rectified Linear Units activation function, defined by ReLU(x) = max(0, x). For a generator network, we use a spatially discounted l 1 reconstruction loss (Yu et al., 2018) . At the test step, we fix all trainable parameters in the I3AD generator. The I3AD generator receives test images that are normal or anomalous images and adaptive mask matrices. Mask matrices are constructed from the pixel-wise reconstruction errors between original images and generated images at the previous iteration step. Since mask matrices are dynamically updated and shrunken during iterations, the I3AD generator generates masked regions intensively, leveraging the gradually increasing information of the surrounding unmasked regions.

2.3. INPAINTING MASK OF STRUCTURAL SIMILARITY (SSIM MASK)

In anomaly segmentation tasks, structural similarity measure (SSIM) index (Wang et al., 2004) sharply measures the small anomalous change (Bergmann et al., 2018) . Details of SSIM calculation is in Appendix A. We propose the structural similarity mask (SSIM-Mask) to mask anomalous pixels during test iterations. SSIM-Mask M i is a binary mask thresholding the pixel-wise SSIM Index between an input image x 0 and the reconstructed image xi at i th iteration step, defined as M i = 1 if a i (x) ≥ u 0 otherwise, a i (x) = SSIM(x 0 , xi ), xi = G(x i-1 , M i-1 ) where u denotes the threshold level for binary classification. After N iterations, we use the N th SSIM anomaly score a N (x) for AD evaluation.

2.4. MASK INITIALIZATION

We have no mask information at the first iteration step during testing. We use four checkerboard matrices to initialize masks. Figure 6 shows initialized mask examples. A generator encodes pixels of test images in white regions and decodes pixels in black boxes. These black regions are mutually exclusive between four masks and cover target images collectively. Therefore, we combine four generated images into a single whole reconstruction.

2.5. ITERATION STEPS AND STOP CRITERIA

We expect no masked region for normal images and some local masked region for anomalous images. Therefore, applying iterative inpainting for some samples on the training dataset could estimate iteration steps enough to reduce masks on normal pixels. Since I3AD decodes only masked regions, M i+1 will be almost all subset of M i . Therefore, we could set early stop criteria whether the difference between M i and M i+1 is small against masked regions of M i .

3. RESULTS

Table 1 : Results for anomaly detection on the MVTecAD dataset, expressed in the AUROC on sample-wise reconstruction errors for different autoencoders and datasets. We compare the vanilla L 2 and SSIM autoencoders (Bergmann et al., 2018) and their iterative projection method (Dehaene et al., 2020) 

I3AD

We train an I3AD generator by 500,000 iterations with 10 images per batch. We use two Adam optimizers (Kingma & Ba, 2014) with learning rates of 0.0001 for the generator and 0.0004 for the discriminator, respectively. Random blushing stroke masks are generated by the algorithm proposed by Yu et al. (2019) . Regarding mask generation, we apply 6 by 6 patched four checkerboard matrices masks initialization. We find the threshold hyperparameter u for a generator to achieve low L 1 reconstruction errors on training normal images. We apply the I3AD method on samples of training datasets with u set from 0.10 to 0.50 by 0.05. Then, we find a minimal threshold to achieve small L 1 reconstruction errors less than the order of 10 after 10 iteration steps. We apply 30 iterations for test images in all categories. Baselines As baseline models, we compare two autoencoders that are trained to minimize different reconstruction loss of L 2 and SSIM. Similar to Bergmann et al. (2018) , both autoencoder models are parameterized by the same architecture with latent space dimensionality (set to 100) and optimized We compute the Area Under the Receiver Operating Characteristics (AUROC) to get a performance independent of the determined threshold for evaluation. The AUROC scores on the sample-wise L 2 and SSIM reconstruction errors are computed for anomaly detection, and pixel-wise errors are computed for anomaly localization. For the SSIM Index, their hyperparameters are selected as window size 11, α = β = γ = 1, same with Bergmann et al. (2018) . All experiments were performed on a graphics processing unit NVIDIA Quadro RTX 8000 based on a system running Python 3.7, PyTorch library version 1.5, and CUDA 10.1.

3.2. QUALITATIVE AND QUANTITATIVE RESULTS

Table 1 shows the AUROC result for anomaly detection. 2020)'s approach works and improves AUROC of both localization and detection on many datasets. However, there remains room to improve in the categories where the base autoencoders had a weak result, especially in textures. In anomaly detection, our I3AD outperforms baselines on nine categories and has ave. +11.1% on textures, ave. +8.85% on objects and ave. +13.01% in total from vanilla autoencoders. Figure 3 shows the qualitative results of a baseline L 2 AE and our I3AD method. Figure 7 in Appendix G shows the results of the remaining categories. SSIM anomaly score is used for highlighting. Our I3AD has clearly a large improvement in high-frequency categories such as tile, carpet, and zipper.

3.3. HYPERPARAMETER SENSITIVITY

In this section, we investigate the hyperparameter sensitivity for the AD results. Compared to typical inpainting tasks to fill various types of images, the inpainting task for AD is relatively easy only to learn every single category. Thus, we skip hyperparameters of the inpainting networks and focus on the hyperparameters regarding mask generation.

Mask initialization

We changed the patch size on four checkerboard matrices. Figure 4 showed that different patch size initialization has performance differences in the first few steps while converging similar results after enough iterations. We verified that no mask initialization generates anomalous pixels directly and leads to a meaningless result. Table 2 shows I3AD runs at compatible speed with Iterative Projection method and. et al. (2018) provide evidence that deep generative models might fail to assign a high likelihood for the out-of-distribution dataset. Several approaches such as training an auxiliary dataset of outliers (Hendrycks et al., 2018) , training a background model (Ren et al., 2019) , and Likelihood Regret score by fine-tuning test images Xiao et al. (2020) , are proposed to mitigate this issue. Since VAEs tend to generate blur images, vanilla autoencoders are popular for local defect detection. Indeed, Matsubara et al. (2018) showed the KL divergence term deteriorates the anomaly sensitivity and proposes to use only reconstruction term for anomaly map. Baur et al. (2018) uses VAEs for unsupervised anomaly segmentation in brain MR scans, and the improvement from autoencoders to VAEs was limited.

4. RELATED WORKS

Adversarial training Adversarial training is one approach to generate high-resolution images. Schlegl et al. (2017) proposed AnoGAN, which utilizes GANs for unsupervised AD. We do not know a random noise to generate an anomalous-free image close to a targeted image. AnoGAN finds a latent noise sample by iterative update under minimizing the reconstruction errors and the semantic similarity of a discriminator's outputs. Since there was a drawback of long runtime for this calculation, several studies had efforts to reduce its runtime. ADGAN (Deecke et al., 2018) updates the input images and the generator's parameters. Several studies propose to train an encoder network to find an expected random noise from an input image. Their architectures are similar to autoencoders or VAEs with an adversarial term (here referred to as "AEGAN" and "VAEGAN") (Larsen et al., 2016; Dumoulin et al., 2017; Donahue et al., 2017) . Their difference is around training processes and loss terms. Efficient-GAN (Zenati et al., 2018 ) adopted ALI training (Dumoulin et al., 2017) that The discriminator receives the joint pair of latent features and images. GANomaly (Akcay et al., 2018) combines three losses about L1 loss between images, L2 encoder loss between latent features, and adversarial loss. f-AnoGAN (Schlegl et al., 2019) trains an encoder and a generator separately. They examined three architectures regarding loss terms and showed that "izif" architecture best performed of them. inherits GANomaly and uses the same loss terms and anomaly scores. AVID (Sabokrou et al., 2018) is trained as same as vanilla AEGANs. It applies Fully convolutional neural networks (FCNs) for a discriminator to capture the regional information and define each region's regularity likelihood. Oktay et al. (2018) added attention modules into U-Net to be specific to local regions and tested 3D abdominal CT scans. We tested U-Net architecture for the MVTecAD dataset and verified several models by switching U-Net's four skip-connections on or off. As a result, We faced learning of trivial identity function, and anomalous pixels are reconstructed if models have first or second skip-connection close to input images. I3AD is considered as an application of per-pixel skip-connections that only transport high convinced regions as normal pixels from anomaly score. Iterative methods In addition to the aforementioned iterative methods such as Likelihood Regret, AnoGAN, and ADGAN, Dehaene et al. (2020) proposed the iterative projection method on trained autoencoders to make blurred images clear at testing time. It iteratively updates an input sample's pixels to minimize reconstruction errors under the constraint on the distance from the original image. Dehaene et al. ( 2020)'s method has underlying autoencoders that encode and decode the whole image with back-propagation update. Our I3AD has an underlying of conditional autoencoders with forwarding iterations. Thus, our approach efficiently updates high-resolution pixels and avoids reconstructing complex patterns in background normal pixels. we summarize where differentiates previous AD methods and our proposed method below. • Though aforementioned autoencoder-base methods encode and decode whole images, I3AD encodes unmasked regions and decodes masked regions. • Skip connection passing whole images potentially leads to trivial function. I3AD could learn per-pixel skip-connections where connects local regions the model assigns a high possibility as normal pixels. • Previous iterative methods used to minimize total reconstruction errors. I3AD targets to minimize the reconstruction errors on normal pixels and maximize ones on anomalous pixels. It attempts to fill in the gap of train and test objectives on unsupervised AD. • Previous iterative methods require back-propagation with long iteration steps. Ours only uses forward iteration and efficiently updates high-resolution pixels.

5. DISCUSSIONS AND FUTURE WORKS

As future works, our I3AD should leverage GAN's unsupervised AD technique. For example, like AVID, a PatchGAN discriminator score could be useful to generate iterative masks. I3AD could be tested on different inpainting models. The coarse-to-fine network with an attention module has a remarkable performance, requiring a large network structure. To reduce the model parameters for speeding up the inference and saving hardware costs, Sagong et al. (2019) proposes weight sharing on coarse and fine generator networks while maintaining performance.

6. CONCLUSION

In high-resolution images, autoencoders fail to model small details, which yields blurry image reconstructions. This is especially the case for high-frequency textures, such as carpet, leather, and tile. To tackle this issue, we propose the iterative image inpainting method to reconstruct partial regions adaptively. Our method utilizes the structural similarity measure for inpainting regions, which modifies their structural difference and enhances their reconstruction quality iteration steps. Our method outperforms state-of-the-art results on several categories in MVTecAD datasets and shows an especially large improvement in the texture categories.

A ANOMALY SCORE OF STRUCTURAL SIMILARITY (SSIM INDEX)

For AD, some per-pixel error measure such as L p -distance is utilized as anomaly score. The structural similarity (SSIM) metric (Wang et al., 2004 ) could be also employed to capture perceptual similarity (Bergmann et al., 2018; 2019) . The SSIM Index defines a structural similarity measure between two K × K image patches p and q, taking into account their similarity in luminance l(p, q), contrast c(p, q), and structure s(p, q): SSIM(p, q) = l(p, q) α c(p, q) β s(p, q) γ , where α, β, γ ∈ R are hyperparameters. From the patches' mean intensities µ p , µ q , the patches' variances σ p , σ q , and their covariance σ pq , the three measures are defined as l(p, q) = 2µ p µ q + c 1 µ 2 p + µ 2 q + c 1 , c(p, q) = 2σ p σ q + c 2 σ 2 p + σ 2 q + c 2 , s(p, q) = 2σ pq + c 2 2σ p σ q + c 2 . The constants c 1 and c 2 ensure numerical stability and are typically set to c 1 = 0.01 and c 2 = 0.03. By substituting three measures under equally weighting α = β = γ = 1, the SSIM Index is derived by SSIM(p, q) = (2µ p µ q + c 1 )(2σ pq + c 2 ) (µ 2 p + µ 2 q + c 1 )(σ 2 p + σ 2 q + c 2 ) . We have that SSIM(p, q) ∈ [-1, 1], and SSIM(p, q) = 1 if and only if p and q are identical.

B MVTEC ANOMALY DETECTION DATASET

We performed experiments with the industrial dataset named MVTec Anomaly Detection (MVTecAD) (Bergmann et al., 2019) 

C ANOMALY LOCALIZATION

Table 3 : Results for anomaly localization on the MVTecAD dataset, expressed in the AUROC on pixel-wise reconstruction errors for different autoencoders and datasets. Same as anomaly detection, we compare the vanilla L 2 and SSIM autoencoders (Bergmann et al., 2018) and their iterative projection method (Dehaene et al., 2020) (Deecke et al., 2018) , f-AnoGAN (Schlegl et al., 2019) , and AEGAN that is the base architecture for several models (Zenati et al., 2018; Akcay et al., 2018) . Bold font is the best AUC in each category and lightblue background indicates the best method measured by the average AUC on L 2 and SSIM anomaly score. We also test L 2 + L D anomaly scores that combines reconstruction errors and discriminator features as mentioned in Deecke et al. (2018) .  Model AnoGAN f-AnoGAN AEGAN I3AD Score L 2 SSIM L 2 SSIM L 2 + L D L 2 SSIM L 2 + L D L 2 SSIM

E ADDITIONAL COMPARISON BETWEEN GAN MODELS (2)

Table 5 : Results for anomaly localization on the MVTecAD dataset, expressed in the AUROC on pixel-wise reconstruction errors for different autoencoders and datasets. We compared AnoGAN (Deecke et al., 2018) , f-AnoGAN (Schlegl et al., 2019) , and AEGAN that is the base architecture for several models (Zenati et al., 2018; Akcay et al., 2018) . Bold font is the best AUC in each category and lightblue background indicates the best method measured by the average AUC on L 2 and SSIM anomaly score. 



Figure 1: Autoencoder vs. I3AD method (Ours). Autoencoder fails to reconstruct the highfrequency texture of "carpet" and yields a large residual map. Ours utilizes the residual map as an inpainting mask and iteratively updates the reconstructed image and the residual map.

Figure 2: Model overview of iterative image inpainting method for AD (I3AD). At the training step, we solve a general image inpainting task in normal datasets. At the testing step, a generator receives test images with an adaptive mask and only reconstructs masked regions. This adaptive mask is updated from previous reconstruction results during iterations.

Figure 3: First row: Normal samples of tile, carpet, transistor, screw, wood, and zipper in MVTecAD dataset; Second row: anomalous samples from the aforementioned dataset categories; Third row: Anomaly map by L 2 autoencoder (Bergmann et al., 2019); Fourth row: our proposed anomaly map by our I3AD method. Ground truth is represented by red contour, and each estimated anomaly score is highlighted by green.

Figure 4: AUC change of different patch-size mask initialization during iteration. Both L 2 and SSIM anomaly scores are computed. (a) average AUC result of each anomaly score, (b) AUC result measured by L 2 (c) AUC result measured by SSIM

as baselines. Bold font is the best AUC in each category and lightblue background indicates the best method measured by the average AUC on L 2 and SSIM anomaly score.

Table 5 in Appendix E shows the AU-ROC result for anomaly localization. From both detection and localization, the baselines of vanilla L 2 and SSIM autoencoders perform well for object datasets and challenge the high-frequency textures such as carpet, leather, tile, or zipper. Despite different resolution from original experiments, Dehaene et al. (

Comparison of iteration speedAnomaly detection overviewBergmann et al. (2019) introduced the MVTecAD dataset and conducted a thorough evaluation of traditional shallow models and recent state-of-the-art deep neural networks for unsupervised AD and segmentation tasks. They showed the evaluated methods do not perform equally across data categories, and there is still room for improvement.Autoencoders Convolutional autoencoders are commonly used as a base architecture in unsupervised AD. Autoencoders are customized by various reconstruction and regularization losses.Bergmann et al. (2018) applies SSIM Index as reconstruction loss and anomaly map.Zimmerer  et al. (2019b)  shows that loss gradients are useful anomaly scores to improve the AUC score on unsupervised pixel-wise tumor detection.Zimmerer et al. (2019a)  proposes Context-encoding VAE (ceVAE) for unsupervised AD. Context-encoding(Pathak et al., 2016) is a special class of denoising autoencoders(Vincent et al., 2010) to learn inpainting random masks instead of additive Gaussian noise denoising.

Venkataramanan et al. added  attention expansion loss for unsupervised and weakly-supervised AD.Skip-connection Skip-connection is another approach to model small details. U-Net(Ronneberger et al., 2015) is widely applied for cardiac MR, brain tumors, and abdominal CT in supervised segmentation tasks. Akc ¸ay et al. (2019);Sabokrou et al. (2018) replace autoencoder by U-Net with AEGAN for unsupervised anomaly detection.Skip-GANomaly (Akc ¸ay et al., 2019)

. It contains 5,354 high-resolution color images of ten different objects and five texture categories. All image resolutions are set in the range between 700 × 700 and 1024 × 1024 pixels. It splits 3,629 images for training and validation and 1,725 images for testing. At test time, there are 467 defect-free images and 1,258 defect images. This dataset includes 73 different defect types, such as defects on the object's surface, structural defects, or the absence of certain parts. These anomalies are manually generated to produce realistic anomalies as they would occur in real-world industrial inspection scenarios.Bergmann et al. (2019) use several resolution for objects and textures on each model. As AnoGAN, both training and testing images are zoomed to 128 × 128 pixels for object categories. For textures, 128 × 128 patches are extracted from zoomed 512 × 512 pixels. As L 2 and SSIM autoencoders, they reconstruct patches of 128 × 128 pixels for textures and of 256 × 256 pixels for objects.

as baselines. Bold font is the best AUC in each category and lightblue background indicates the best method measured by the average AUC on L 2 and SSIM anomaly score.

Results for anomaly detection on the MVTecAD dataset, expressed in the AUROC on sample-wise reconstruction errors for different autoencoders and datasets. We compared AnoGAN

We compared GAN based anomaly detection and localization models. We use same network architecture for our I3AD model and there is only difference that input images are masked or not. AnoGAN takes 500 iterations at test time to find latent random noise. f-AnoGAN has two training steps and an encoder is optimized after training the generator.

.754 0.684 0.723 0.741 0.739 0.819 0.795 capsule 0.778 0.859 0.740 0.869 0.857 0.913 0.733 0.854 hazelnut 0.885 0.946 0.849 0.955 0.884 0.955 0.664 0.756 metal nut 0.747 0.733 0.738 0.749 0.849 0.854 0.515 0.526 pill 0.805 0.883 0.692 0.752 0.872 0.922 0.649 0.725 screw 0.789 0.925 0.786 0.898 0.857 0.947 0.852 0.959 toothbrush 0.873 0.931 0.881 0.925 0.925 0.956 0.946 0.969 transistor 0.704 0.768 0.814 0.843 0.814 0.843 0.596 0.651 zipper 0.710 0.734 0.702 0.727 0.734 0.743 0.854 0.962

acknowledgement

We summarize the base model architecture used in recent reconstruction based approach. This is not an all-inclusive list and several studies proposed additional losses and modules with different anomaly score measures in the above base architectures.

annex

In computer vision, recently, there is much research about the approach based on feed-forward generative models with convolutional networks, especially using conditional generative adversarial networks (cGANs) (Isola et al., 2017) . Iizuka et al. (2017) proposed the model with global and local consistency to handle high-resolution images. To handle irregular masks, Liu et al. (2018) proposed a partial convolution layer that the weights in the convolutional layers are re-normalized by the number of valid pixels from the mask matrix. To produce higher-quality image inpainting, Yu et al. (2018) proposed two-stage architecture with coarse and refinement networks and a contextual attention module in a refinement network. The contextual attention module captures long-range spatial dependencies between masked regions and their surroundings, allowing models to fill large rectangular masks. To handle free-form masks, (Yu et al., 2019) proposed the gated convolution that extends the partial convolution and works pixel normalization and mask updates as trainable weights in neural network layers.

J ANALOGY OF SSIM-MASK WITH PARTIAL CONVOLUTION LAYER

In inpainting task, the convolutional layer in a generator network assigns the weight balanced on both valid pixels in unmasked region and invalid pixels in masked region. (Liu et al., 2018) proposed partial convolution layer to adapt an irregular shaped mask and re-normalize the dependence of valid pixels. It is computed by the following equations:where let W be the convolution filter weights for the convolution layer, X be the feature values for the sliding window, M be the corresponding binary mask, and O x,y be the output features. The corresponding binary mask M has also the rule-based update in each layer following by M l+1 = 1 if sum(M l ) > 0 0 otherwise.Our I3AD passes through the forward model iteratively, which is interpreted as a very deep generative model. Our SSIM-Mask corresponds to a kind of partial convolutions, which extends to the non-linear structural similarity rule.

