AE-FLOW: AUTOENCODERS WITH NORMALIZING FLOWS FOR MEDICAL IMAGES ANOMALY DETECTION

Abstract

Anomaly detection from medical images is an important task for clinical screening and diagnosis. In general, a large dataset of normal images are available while only few abnormal images can be collected in clinical practice. By mimicking the diagnosis process of radiologists, we attempt to tackle this problem by learning a tractable distribution of normal images and identify anomalies by differentiating the original image and the reconstructed normal image. More specifically, we propose a normalizing flow-based autoencoder for an efficient and tractable representation of normal medical images. The anomaly score consists of the likelihood originated from the normalizing flow and the reconstruction error of the autoencoder, which allows to identify the abnormality and provide an interpretability at both image and pixel levels. Experimental evaluation on four medical and one non-medical images datasets showed that the proposed model outperformed the other approaches by a large margin, which validated the effectiveness and robustness of the proposed method.

1. INTRODUCTION

Medical anomaly detection (Taboada-Crispi et al., 2009; Fernando et al., 2021) is an important task in clinical screening and diagnosis by capturing distinctive features in collected biomedical data, such as medical images, electrical biomedical signals or other laboratory results. Anomaly detection aims to detect data that significantly deviates from the majority of data instances, arising in clinical applications due to imbalance between normal and abnormal data and variability of anomaly in real world scenario. Different to usual classification models used for computed aided diagnosis, anomaly detection is usually considered in an unsupervised or semi-supervised paradigm. In this paper, we mainly focus on anomaly detection from medical images, to mimic the diagnosis process of radiologists. The traditional techniques for finding anomalies are divided into several categories, such as statistics-based methods (Hido et al., 2011; Rousseeuw & Hubert, 2011) , distance-based methods (Knorr et al., 2000; Angiulli et al., 2005) , density-based methods (Breunig et al., 2000) , and clustering-based methods (Yang et al., 2009; Al-Zoubi, 2009) , etc. Deep learning for anomaly detection (Wang et al., 2019; Chalapathy & Chawla, 2019; Pang et al., 2021) , also known as deep anomaly detection, typically consists of learning a feature representation model of normal images and constructs an anomaly score function for abnormal images by neural networks. There are two main types of approaches for image anomaly detection, one is reconstruction-based model and the other is likelihood-based model in the literature. The standard procedure of reconstruction-based methods is to first learn an auto-encoder (AE) (Kramer, 1991) or generative models (Goodfellow et al., 2014) for normal images and the difference between the test and reconstructed (generated) images through the representation neural networks can be used to characterize the level of anomaly. For example, AnoGAN (Schlegl et al., 2017 ) is a generative adversarial networks (GAN) based model utilizing a generator for image reconstruction and an anomaly score using a weighted sum of residual socre and discrimination score. In Akcay et al. (2018) , GANomaly considers the distance in the latent feature space to distinguish the anomaly data. F-anoGAN (Schlegl et al., 2019) is an improved version of anoGAN, which si-multaneously guides encoder training in both image and latent spaces. Even reconstruction-based methods could give the dissimilarity in pixel level, many limitations exists for anomaly detection tasks. The limited capabilities of AEs in modelling high-dimensional data distribution often lead to inaccurate approximations and erroneous reconstructions, such as features are smoothed out in the reconstructed images (Kingma & Welling, 2013; Schlegl et al., 2019; Ravanbakhsh et al., 2017) and image boundaries are predicted as anomaly pixels and appeared in the difference (Schlegl et al., 2017; 2019) . The second type of anomaly detection method is to construct a likelihood function of extracted image features. For example, normalizing flow-based anomaly detection models have been proposed for industry anomaly datasets (Rudolph et al., 2021; Gudovskiy et al., 2022; Yu et al., 2021) . Normalizing flow (Dinh et al., 2014; Rezende & Mohamed, 2015; Dinh et al., 2016; Kingma & Dhariwal, 2018 ) is a popular method that transform the observed data to a tractable distribution. The idea behind normalizing flows is deploying a sequence of invertible and differentiable mappings to transform a complex distribution into a simple probability distribution (e.g., a standard normal distribution). NICE (Dinh et al., 2014) was introduced for modeling complex high-dimensional densities. In Dinh et al. ( 2016), RealNVP was proposed to improve the coupling layer and a multi-scale architecture was applied to enhance the representative ability of the framework. Recently, Glow proposed in Kingma & Dhariwal (2018) utilizes actnorm and 1 × 1 convolution to simulate more realistic images. Normalizing flows have been applied in many fields such as image generation, noise modeling, video generation (Papamakarios et al., 2021) . Recently, some anomaly detection methods based on normalizing flows have also emerged and have achieved very high performance for industrial datasets. Differnet (Rudolph et al., 2021) used a backbone network to extract multiscale features of the input and a normalizing flow network to maximize the likelihood. The anomaly score is defined as the average of negative log-likelihoods. CFLOW-AD (Gudovskiy et al., 2022) added positional embedding layers and utilizes conditional flows to complete the location problems. FastFlow (Yu et al., 2021) proposed a lightweight network and further enhances the accuracy. However, these NF-based anomaly detection methods make decisions by estimating the likelihood of the extracted features, and the structural information of images are missing. From this perspective, an auto-encoder architecture can construct a mapping between the hidden feature space and the data space, and enforce the model to learn structural information of the original data. In this paper, we propose to construct a loss function and anomaly score function by an auto-encoder with the normalizing flow bottleneck, namely AE-FLOW. This model combines the benefits of normalizing flow methods for computing the anomaly likelihood of extracted features at image level, and the interpretability of reconstruction-based methods at pixel-level. The proposed score function takes both computable probability density in feature features and visual structures consistency in image domain into consideration. The model is in a self-supervised paradigm that uses only normal data, which is more adaptive to real world applications. The experiments on four public medical datasets and one non-medical dataset demonstrated the effectiveness and robustness of the proposed method, with a detail comparison to other existing methods.

2. METHODS

We propose a method that integrates the difference on image structures of the original and reconstructed image and the likelihood of image features distribution. The proposed pipeline consists of three important components, i.e., encoder-flow-decoder. Intuitively, normal data will be mapped to the high density area of the standard Gaussian distribution as normalizing flows are trained with normal data. In contrast, those abnormal data will be mapped to the tail of the distribution, making it difficult for the decoder network to effectively reconstruct the original image. The loss function of the three-block pipeline takes into account of the dissimilarity between the reconstructed image and the original image in pixel-level and the likelihood of data distribution in feature-level. The anomaly score is composed of the reconstruction error and the flow likelihood for the inference stage. The pipeline is illustrated in Fig. 1 , compared to the usual autoencoder neural network. The detailed illustration of the overall network structure is shown in Fig. 2 . In the following, we will explain the three blocks in detail. Encoder. The model starts with a pretrained encoder network for extracting features of the input image. Each input image will be downsampled four times and the extracted feature is low-dimensional

