ITERATIVE IMAGE INPAINTING WITH STRUCTURAL SIMILARITY MASK FOR ANOMALY DETECTION Anonymous

Abstract

Autoencoders have emerged as popular methods for unsupervised anomaly detection. Autoencoders trained on the normal data are expected to reconstruct only the normal features, allowing anomaly detection by thresholding reconstruction errors. However, in practice, autoencoders fail to model small detail and yield blurry reconstructions, which makes anomaly detection challenging. Moreover, there is objective mismatching that models are trained to minimize total reconstruction errors while expecting a small deviation on normal pixels and a large deviation on anomalous pixels. To tackle these two issues, we propose the iterative image inpainting method that reconstructs partial regions in an adaptive inpainting mask matrix. This method constructs inpainting masks from the anomaly score of structural similarity. Overlaying inpainting mask on images, each pixel is bypassed or reconstructed based on the anomaly score, enhancing reconstruction quality. The iterative update of inpainted images and masks by turns purifies the anomaly score directly and follows the expected objective at test time. We evaluated the proposed method using the MVTec Anomaly Detection dataset. Our method outperformed previous state-of-the-art in several categories and showed remarkable improvement in high-frequency textures.

1. INTRODUCTION

Anomaly detection (AD) is the identification task of the rarely happened events or items that differ from the majority of the data. In the real world, there are many applications, such as the medial diagnosis (Baur et al., 2018; Zimmerer et al., 2019a) , defect detection in the factories (Matsubara et al., 2018; Bergmann et al., 2019) , early detection of plant disease (Wang et al., 2019) , and X-Ray security detection in public space (Griffin et al., 2018) . Because manual inspection by humans is slow, expensive, and error-prone, automating visual inspection is the popular application of artificial intelligence. In transferring knowledge from humans to machines, there is a lack of anomalous samples due to their low event rate and difficulty annotating and categorizing various anomalous defects beforehand. Therefore, AD methods typically take unsupervised approaches that try to learn compact features of data from normal samples and detect anomalies by thresholding anomaly score to measure the deviation from learned features. To deal with high-dimensional images and learn their features, it is popular to use deep neural networks (Goodfellow et al., 2016) . In this work, we focus on the reconstruction-based unsupervised AD. This attempts to reconstruct only the normal dataset and classify the normal or anomalous data on thresholding reconstruction errors (An & Cho, 2015) . The architectures are based on deep neural networks such as deep autoencoders (Hinton & Salakhutdinov, 2006) , variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) , or autoencoders with generative adversarial networks (GANs) (Goodfellow et al., 2014) . These models compress the high-dimensional information into the data manifold in lower-dimensional latent space by reconstructing input data under certain constraints for latent space, such as a prior distribution or an information bottleneck (Alemi et al., 2016) . The reconstruction-based AD approach issue is that autoencoders fail to model small details and yield blurry image reconstruction. This is especially the case for the high-frequency textures, such as carpet, leather, and tile (Bergmann et al., 2019) . Dehaene et al. ( 2020) also pointed out that there is no guarantee of the generalization of their behavior for out-of-samples, and local defects added to normal images could deteriorate whole images. In the viewpoint of the signal-to-noise ratio (SNR), We point out an additional issue about the gap between optimized function at training and evaluated function at testing. Rethinking our goal in unsupervised AD, it is concluded not to minimize reconstruction errors merely but to maximize the SNR. Although models are trained to minimize a sample-wise reconstruction error, they are expected to have a large deviation of anomaly pixels and a small deviation of normal pixels at testing. In this paper, we propose I3AD (Iterative Image Inpainting for Anomaly Detection). As Figure 1 , our method utilizes an inpainting model that only encode unmasked regions and reconstruct masked regions instead of vanilla autoencoders. Once computed reconstruction errors, it is recycled as an inpainting mask for the next iteration. We show that the iterative update enhances the reconstruction quality and satisfies the expected objective to maximize the expected SNR at testing. Through experiments and analysis on the MVTecAD dataset shows that our I3AD outperforms existing methods on nine categories and has ave. +11.6% improvement on texture category.

2. METHODOLOGY

2.1 HIGH-LEVEL IDEA We think of unsupervised AD using autoencoders. Here, we implicitly assume anomalies show up in partial regions, and pixels in surrounding regions obey by the distribution of normal datasets. Therefore, the sample-wise anomaly score based on reconstruction errors is the summation of two types of pixels: (1) pixels of normal regions (normal noises) and (2) pixels of anomalous regions (anomaly signals). We expect an ideal model with zero errors on normal regions and distinguishable per-pixel scores on anomalous regions, leading to the high SNR. Inheriting vanilla autoencoders' architecture does not help to resolve the low SNR issue mentioned by Bergmann et al. (2019) and Dehaene et al. (2020) . Indeed, autoencoders are forced to encode whole images with anomalous pixels of local defects and attempt to decode whole images with background normal pixels of fine structures. They do not learn their behavior to encode unseen anomalous pixels. That anomalous information can affect the whole image decoding.



Figure 1: Autoencoder vs. I3AD method (Ours). Autoencoder fails to reconstruct the highfrequency texture of "carpet" and yields a large residual map. Ours utilizes the residual map as an inpainting mask and iteratively updates the reconstructed image and the residual map.

