DISENTANGLED CONDITIONAL VARIATIONAL AU-TOENCODER FOR UNSUPERVISED ANOMALY DETEC-TION

Abstract

Recently, generative models have shown promising performance in anomaly detection tasks. Specifically, autoencoders learn representations of high-dimensional data, and their reconstruction ability can be used to assess whether a new instance is likely to be anomalous. However, the primary challenge of unsupervised anomaly detection (UAD) is in learning appropriate disentangled features and avoiding information loss, while incorporating known sources of variation to improve the reconstruction. In this paper, we propose a novel architecture of generative autoencoder by combining the frameworks of β-VAE, conditional variational autoencoder (CVAE), and the principle of total correlation (TC). We show that our architecture improves the disentanglement of latent features, optimizes TC loss more efficiently, and improves the ability to detect anomalies in an unsupervised manner with respect to high-dimensional instances, such as in imaging datasets. Through both qualitative and quantitative experiments on several benchmark datasets, we demonstrate that our proposed method excels in terms of both anomaly detection and capturing disentangled features. Our analysis underlines the importance of learning disentangled features for UAD tasks.

1. INTRODUCTION

Unsupervised anomaly detection (UAD) has been a fertile ground for methodological research for several decades. Recently, generative models, such as Variational Autoencoders (VAEs) (Kingma & Welling, 2014) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2020; Arjovsky et al., 2017) , have shown exceptional performance at UAD tasks. By learning the distribution of normal data, generative models can naturally score new data as anomalous based on how well they can be reconstructed. For a recent review of deep learning for anomaly detection, see Pang et al. (2021) . In a complex task like UAD, disentanglement as a meta-prior encourages latent factors to be captured by different independent variables in the low-dimensional representation. This phenomenon has been on disply in recent work that has used representation learning as a backbone for developing new VAE architectures. Some of the methods proposed new objective functions (Higgins et al., 2017; Mathieu et al., 2019) , efficient decomposition of the evidence lower bound (ELBO) (Chen et al., 2018) , partitioning of the latent space by adding a regularization term to the mutual information function (Zhao et al., 2017) , introducing disentanglement metrics (Kim & Mnih, 2018) , and penalizing total correlation (TC) loss (Gao et al., 2019) . Penalized TC efficiently learns disentangled features and minimizes the dependence across the dimension of the latent space. However, it often leads to a loss of information, which leads to lower reconstruction quality. For example, methods such as β-VAE, Disentangling by Factorising (FactorVAE) (Kim & Mnih, 2018) , and Relevance FactorVAE (RFVAE) (Kim et al., 2019) encourage more factorized representations with the cost of either losing reconstruction quality or losing a considerable among of information about the data and drop in disentanglement performance. To draw clear boundaries between an anomalous sample and a normal sample, we must minimize information loss. a generative modeling architecture which learns disentangled representations of the data while minimizing the loss of information and thus maintaining good reconstruction capabilities. We achieve this by modeling known sources of variation, in a similar fashion as Conditional VAE (Pol et al., 2019) . Our paper is structured as follows. We first briefly discuss related methods (Section 2), draw connection between them, and present our proposed method dCVAE (Section 3). In Section 4, we discuss our experimental design including competing methods, datasets, and model configuration. Finally, experimental results are presented in Section 5, and Section 6 concludes this paper.

2. RELATED WORK

In this section, we discuss related work on autoencoders. We focus on two types of architecture: extensions of VAE enforcing disentanglement, and architectures based on mutual information theory.

2.1. β-VAE

β-VAE and its extensions proposed by (Higgins et al., 2017; Mathieu et al., 2019; Chen et al., 2018) is an augmentation of the original VAE with learning constraints of β applied to the objective function of the VAE. The idea of including such a hyper-parameter is to balance the latent channel capacity and improve the reconstruction accuracy. As a result, β-VAE is capable of discovering the disentangled latent factors and generating more realistic samples while retaining the small distance between the actual and estimated distributions. Recall the objective function of VAE proposed by Kingma & Welling (2014) : L VAE (θ, ϕ) = -E z∼q ϕ (z|x) log p θ (x | z) + D KL (q ϕ (z | x)∥p θ (z)) . (1) Here, p θ (x | z) is the probabilistic decoder, q ϕ (z | x) is the recognition model, KLD is denoted by D KL (q ϕ (z | x)∥p θ (z | x)) parameterized by the weights (θ) and bias (ϕ) of inference and generative models. As the incentive of β-VAE is to introduce the disentangling property, maximizing the probability of generating original data, and minimizing the distance between them, a constant δ is introduced in the objective VAE to formulate the approximate posterior distributions as below: max ϕ,θ E x∼X E q ϕ (z|x) [log p θ (x | z)] such that D KL (q ϕ (z | x)||p(z)) < δ. Rewriting the Equation in Lagrangian form and using the KKT conditions, Higgins et al. (2017) derive the following objective function: L βV AE (θ, ϕ) = E q ϕ (z|x) [log p θ (x | z)] -βD KL (q ϕ (z | x)∥p(z)) , Here, β is the regularization coefficient that enforces the constraints to limit the capacity of the latent information z. When β = 1, we recover the original VAE. Increasing the value of β > 1 enforces the constraints to capture disentanglement. However, Hoffman et al. ( 2017) argue that with an implicit prior, optimizing the regularized ELBO is equivalent to performing variational expectation maximization (EM). 



Disentangling by Factorising or FactorVAE is another modification of β-VAE proposed byKim  & Mnih (2018). FactorVAE emphasizes the trade-off between disentanglement and reconstruction quality. The authors primarily focused on the objective function of the VAE and β-VAE. The authors propose a new loss function to mitigate the loss of information that arise while penalizing both the mutual information and the KLD to enforce disentangled latent factors.According toHoffman & Johnson (2016)  andMakhzani & Frey (2017), the objective function of β-VAE can be further extended into: E pdata (x) [KL(q(z | x)∥p(z))] = I(x; z) + KL(q(z)∥p(z)),

