A MIXTURE OF VARIATIONAL AUTOENCODERS FOR DEEP CLUSTERING Anonymous

Abstract

In this study, we propose a deep clustering algorithm that utilizes a variational autoencoder (VAE) framework with a multi encoder-decoder neural architecture. This setup enforces a complementary structure that guides the learned latent representations towards a more meaningful space arrangement. It differs from previous VAE-based clustering algorithms by employing a new generative model that uses multiple encoder-decoders. We show that this modeling results in both better clustering capabilities and improved data generation. The proposed method is evaluated on standard datasets and is shown to outperform state-of-the-art deep clustering methods significantly.

1. INTRODUCTION

Clustering is one of the most fundamental techniques used in unsupervised machine learning. It is the process of classifying data into several classes without using any label information. In the past decades, a plethora of clustering methods have been developed and successfully employed in various fields, including computer vision (Jolion et al., 1991) , natural language processing (Ngomo & Schumacher, 2009) , social networks (Handcock et al., 2007) and medical informatics (Gotz et al., 2011) . The most well-known clustering approaches include the traditional k-means algorithm and the generative model, which assumes that the data points are generated from a Mixture-of-Gaussians (MoG), and the model parameters are learned via the Expectation-Maximization (EM) algorithm. However, using these methods over datasets that include high-dimensional data is problematic since, in these vector spaces, the inter-point distances become less informative. As a result, the respective methods have provided new opportunities for clustering (Min et al., 2018) . These methods incorporate the ability to learn a (non-linear) mapping of the raw features in a low-dimensional vector space that hopefully allow a more feasible application of clustering methods. Deep learning methods are expected to automatically discover the most suitable non-linear representations for a specified task. However, a straightforward implementation of "deep" k-means algorithm by jointly learning the embedding space and applying clustering to the embedded data, leads to a trivial solution, where the data feature vectors are collapsed into a single point in the embedded space, and thus, the k centroids are collapsed into a single spurious entity. For this reason, the objective function of many deep clustering methods is composed of both a clustering term computed in the embedded space and a regularization term in the form of a reconstruction error to avoid data collapsing. One broad family of successful deep clustering algorithms, which was shown to yield state-ofthe-art results, is the generative model-based methods. Most of these methods are based on the Variational Autoencoder framework (Kingma & Welling, 2014), e.g., Gaussian Mixture Variational Autoencoders (GMVAE) (Dilokthanakul et al., 2016) and Variational Deep Embedding (VaDE) . Instead of using an arbitrary prior to the latent variable, these algorithms proposed using specific distributions that will allow clustering at the bottleneck, such as MoG distributions. This design results in a VAE based training objective function that is composed of a significant reconstruction term and a second parameter regularization term, as discussed above. However, this objective seems to miss the clustering target since the reconstruction term is not related to the clustering, and actual clustering is only associated with the regularization term optimization. This might result in inferior clustering performance, degenerated generative model, and stability issues during training. We propose a solution to alleviate the issues introduced by previous deep clustering generative models. To that end, we propose the k-Deep Variational Auto Encoders (dubbed k-DVAE). Our k-DVAE improves upon the current state-of-the-art clustering methods in several facets: (1) A novel model that outperforms the current methods in terms of clustering accuracy. (2) A novel Variational Bayesian framework to balance the data reconstruction and actual clustering that differs from the previous methods. (3) A network architecture that allows better generative modeling and thus more accurate data generation. Importantly, this architecture uses a lower amount of parameters compared to previous models. We implemented the k-DVAE algorithm on various standard document and image corpora and obtained improved results for all the datasets we experimented with compared to state-of-the-art clustering methods.

2. RELATED WORK

Deep clustering has been studied extensively in the literature. The most common deep clustering methods aim to project the data into a non-linear, low-dimensional feature space, where the task of clustering appears to be feasible. Then, traditional clustering methods are further applied to perform the actual clustering. Previous works have employed autoencoders (Yang et al., 2016; Ghasedi Dizaji et al., 2017; Yang et al., 2017; Fogel et al., 2019; Opochinsky et al., 2020) , Variational Autoencoders (VAEs) (Jiang et al., 2016; Dilokthanakul et al., 2016; Yang et al., 2019; Li et al., 2019) and Generative Adversarial Networks (GANs) (Springenberg, 2015; Chen et al., 2016) . IMSAT (Hu et al., 2017) , is another recent method that augmented the training data. Our method does not make any use of augmented data during training and therefore, we do not consider IMSAT to be an appropriate or fair baseline for comparison. Additionally, the GMVAE method has shown to yield inferior performance results compared to the rest of VAE-based deep clustering, hence we do not present it in our evaluations. Among the aforementioned work, VaDE (Jiang et al., 2016) and k-DAE (Opochinsky et al., 2020) are most relevant to our work. Both VaDE and our work utilize the Varitional Bayes framework, and use a probabilistic generative process to determine the data generation model. Yet, the difference lies in both the generative process and the use of several autoencoders: our network consists of a set of k autoencoders, where each specializes on encoding and reconstructing a different cluster. The k-DAE architecture consists of a set of k autoencoders, but does not consider generative modelling, which as we show, proved to be more powerful and yields significant clustering performance results in recent years. The recent, state-of-the-art DGG method (Yang et al., 2019) was built on the foundations of VaDE, and integrates graph embeddings that serves as a regularization over the VaDE objective. Using the DGG revised objective, each pair of samples that are connected on the learned graph, will have similar posterior distributions, using the Jenson-Shannon (JS) divergence similarity metric. The other baselines used in this study are described in Section 4.2.

3. THE k-DVAE CLUSTERING ALGORITHM

In this section, we describe our k-Deep Variational Auto Encoders (dubbed k-DVAE). First, we formulate the generative model that our algorithm is based on. Next, we derive the optimization objective score. Then we discuss the differences between our model and previous VAE based algorithms such as VaDE (Jiang et al., 2016) and illustrate the advantages of our approach.

3.1. GENERATIVE MODEL

In our generative modeling, we assume that the data are drawn from a mixture of VAEs, each with a standard Gaussian latent r.v., as follows: 1. Draw a cluster y by sampling from p(y = i) = α i , i = 1, ..., k. 2. Sample a latent r.v. z from the unit normal distribution, z ∼ N (0, I). 3. Sample an observed r.v. x: (a) If x is real-valued vector: sample a data vector using the conditional distribution, x|(z, y = i) ∼ N (µ θi (z), Σ θi (z)). (b) If x is binary vector: sample a data vector using the conditional distribution, x|(z, y = i) ∼ Ber(µ θi (z)).

