A MIXTURE OF VARIATIONAL AUTOENCODERS FOR DEEP CLUSTERING Anonymous

Abstract

In this study, we propose a deep clustering algorithm that utilizes a variational autoencoder (VAE) framework with a multi encoder-decoder neural architecture. This setup enforces a complementary structure that guides the learned latent representations towards a more meaningful space arrangement. It differs from previous VAE-based clustering algorithms by employing a new generative model that uses multiple encoder-decoders. We show that this modeling results in both better clustering capabilities and improved data generation. The proposed method is evaluated on standard datasets and is shown to outperform state-of-the-art deep clustering methods significantly.

1. INTRODUCTION

Clustering is one of the most fundamental techniques used in unsupervised machine learning. It is the process of classifying data into several classes without using any label information. In the past decades, a plethora of clustering methods have been developed and successfully employed in various fields, including computer vision (Jolion et al., 1991) , natural language processing (Ngomo & Schumacher, 2009) , social networks (Handcock et al., 2007) and medical informatics (Gotz et al., 2011) . The most well-known clustering approaches include the traditional k-means algorithm and the generative model, which assumes that the data points are generated from a Mixture-of-Gaussians (MoG), and the model parameters are learned via the Expectation-Maximization (EM) algorithm. However, using these methods over datasets that include high-dimensional data is problematic since, in these vector spaces, the inter-point distances become less informative. As a result, the respective methods have provided new opportunities for clustering (Min et al., 2018) . These methods incorporate the ability to learn a (non-linear) mapping of the raw features in a low-dimensional vector space that hopefully allow a more feasible application of clustering methods. Deep learning methods are expected to automatically discover the most suitable non-linear representations for a specified task. However, a straightforward implementation of "deep" k-means algorithm by jointly learning the embedding space and applying clustering to the embedded data, leads to a trivial solution, where the data feature vectors are collapsed into a single point in the embedded space, and thus, the k centroids are collapsed into a single spurious entity. For this reason, the objective function of many deep clustering methods is composed of both a clustering term computed in the embedded space and a regularization term in the form of a reconstruction error to avoid data collapsing. One broad family of successful deep clustering algorithms, which was shown to yield state-ofthe-art results, is the generative model-based methods. Most of these methods are based on the Variational Autoencoder framework (Kingma & Welling, 2014), e.g., Gaussian Mixture Variational Autoencoders (GMVAE) (Dilokthanakul et al., 2016) and Variational Deep Embedding (VaDE) . Instead of using an arbitrary prior to the latent variable, these algorithms proposed using specific distributions that will allow clustering at the bottleneck, such as MoG distributions. This design results in a VAE based training objective function that is composed of a significant reconstruction term and a second parameter regularization term, as discussed above. However, this objective seems to miss the clustering target since the reconstruction term is not related to the clustering, and actual clustering is only associated with the regularization term optimization. This might result in inferior clustering performance, degenerated generative model, and stability issues during training. We propose a solution to alleviate the issues introduced by previous deep clustering generative models. To that end, we propose the k-Deep Variational Auto Encoders (dubbed k-DVAE). Our k-DVAE improves upon the current state-of-the-art clustering methods in several facets: (1) A novel model that outperforms the current methods in terms of clustering accuracy. (2) A novel Variational Bayesian framework to balance the data reconstruction and actual clustering that differs from the previous methods. (3) A network architecture that allows better generative modeling and thus more accurate data generation. Importantly, this architecture uses a lower amount of parameters compared to previous models. We implemented the k-DVAE algorithm on various standard document and image corpora and obtained improved results for all the datasets we experimented with compared to state-of-the-art clustering methods.

2. RELATED WORK

Deep clustering has been studied extensively in the literature. The most common deep clustering methods aim to project the data into a non-linear, low-dimensional feature space, where the task of clustering appears to be feasible. Then, traditional clustering methods are further applied to perform the actual clustering. Previous works have employed autoencoders (Yang et al., 2016; Ghasedi Dizaji et al., 2017; Yang et al., 2017; Fogel et al., 2019; Opochinsky et al., 2020) , Variational Autoencoders (VAEs) (Jiang et al., 2016; Dilokthanakul et al., 2016; Yang et al., 2019; Li et al., 2019) and Generative Adversarial Networks (GANs) (Springenberg, 2015; Chen et al., 2016) . IMSAT (Hu et al., 2017) , is another recent method that augmented the training data. Our method does not make any use of augmented data during training and therefore, we do not consider IMSAT to be an appropriate or fair baseline for comparison. Additionally, the GMVAE method has shown to yield inferior performance results compared to the rest of VAE-based deep clustering, hence we do not present it in our evaluations. Among the aforementioned work, VaDE (Jiang et al., 2016) and k-DAE (Opochinsky et al., 2020) are most relevant to our work. Both VaDE and our work utilize the Varitional Bayes framework, and use a probabilistic generative process to determine the data generation model. Yet, the difference lies in both the generative process and the use of several autoencoders: our network consists of a set of k autoencoders, where each specializes on encoding and reconstructing a different cluster. The k-DAE architecture consists of a set of k autoencoders, but does not consider generative modelling, which as we show, proved to be more powerful and yields significant clustering performance results in recent years. The recent, state-of-the-art DGG method (Yang et al., 2019) was built on the foundations of VaDE, and integrates graph embeddings that serves as a regularization over the VaDE objective. Using the DGG revised objective, each pair of samples that are connected on the learned graph, will have similar posterior distributions, using the Jenson-Shannon (JS) divergence similarity metric. The other baselines used in this study are described in Section 4.2.

3. THE k-DVAE CLUSTERING ALGORITHM

In this section, we describe our k-Deep Variational Auto Encoders (dubbed k-DVAE). First, we formulate the generative model that our algorithm is based on. Next, we derive the optimization objective score. Then we discuss the differences between our model and previous VAE based algorithms such as VaDE (Jiang et al., 2016) and illustrate the advantages of our approach.

3.1. GENERATIVE MODEL

In our generative modeling, we assume that the data are drawn from a mixture of VAEs, each with a standard Gaussian latent r.v., as follows: 1. Draw a cluster y by sampling from p(y = i) = α i , i = 1, ..., k. 2. Sample a latent r.v. z from the unit normal distribution, z ∼ N (0, I).

3.. Sample an observed r.v. x:

(a) If x is real-valued vector: sample a data vector using the conditional distribution, x|(z, y = i) ∼ N (µ θi (z), Σ θi (z)). (b) If x is binary vector: sample a data vector using the conditional distribution, x|(z, y = i) ∼ Ber(µ θi (z)). θ i is the stacked vector of parameters of the i-th neural network (NN). It formulates a decoder NN that corresponds to the i-th cluster, 1 ≤ i ≤ k, assuming that the total number of clusters is k. µ θi (z), Σ θi (z) are computed by a decoder NN with an input z and parameters θ i . We denote the parameter set of all the decoders by θ = {θ 1 , ..., θ k }. Note that the latent data representation z is drawn independently of the selected class y, and the class only affects when selecting the sample x. 

3.2. LEARNING

ELBO(θ, λ) = y z q(y, z|x;λ) log p(x|y, z;θ)dz -D KL (q(y, z|x;λ)||p(y, z;θ)), where D KL is the Kullback Leibler (KL) divergence between two density functions, and q(y, z|x;λ) is a conditional density function parametrized by λ. We use an approximate conditional density q(y, z|x) that mirrors the structure of the generative model. For each cluster we define an encoder that transforms the input x into the latent space of that cluster: q(y = i, z|x;λ) = q(y = i|x)q(z|x, y = i;λ i ), such that q(z|x, y = i;λ i ) = N (z;µ λi (x), Σ λi (x)) where µ λi (x), Σ λi (x) are computed by an encoder NN with input x and parameter-set λ i and we use the notation λ = {λ 1 , ..., λ k }. The first term of the ELBO expression (1) can be written as: y z q(y, z|x;λ) log p(x|y, z;θ)dz = i q(y = i|x)E q(z|x,y=i;λi) log N (x;µ θi (z), Σ θi (z)). (2) We next use Monte-Carlo sampling to approximate the expectation in Eq. ( 2): E q(z|x,y=i;λi) log N (x;µ θi (z), Σ θi (z)) ≈ log N (x;µ θi (z), Σ θi (z)), such that z|(x, y = i) is sampled from N (µ λi (x), Σ λi (x)). Applying the chain rule for KL divergence to the second term of the ELBO expression (1), we get: D KL (q(y, z|x;λ)||p(y, z;θ)) = D KL (q(y|x;λ)||p(y;θ)) + i q(y = i|x)D KL (N (µ λi (x), Σ λi (x))||N (0, I)). We next replace the soft clustering in Eq. (3) and Eq. ( 4), by a hard clustering: k i=1 q(y = i|x)(log N (x;µ θi (z i ), Σ θi (z i )) -D KL (N (µ λi (x), Σ λi (x))||N (0, I))) (5) ≈ max i (log N (x;µ θi (z i ), Σ θi (z i )) -D KL (N (µ λi (x), Σ λi (x))||N (0, I))). Finally, by neglecting the term D KL (q(y|x)||p(y;θ)) (4) (or equivalently setting q(y|x) = p(y;θ)), we obtain the following objective for optimization: ELBO(θ, λ) ≈ max i {log N (x;µ θi (z i ), Σ θi (z i )) -D KL (N (µ λi (x), Σ λi (x))||N (0, I))} s.t. z i ∼ N (µ λi (x), Σ λi (x)). . . . for i = 1 to k do Compute µ λ i (x) and Σ λ i (x) using the i-th encoder. Draw zi ∼ N (µ λ i (x), Σ λ i (x)). Compute µ θ i (zi) and Σ θ i (zi) using the i-th decoder. end for Compute the ELBO score using Eq. ( 6).

Algorithm 2 Hard clustering

Input: Data sample x Output: Estimated cluster ŷ(x) of x. for i = 1 to k do Compute zi ← µ λ i (x) using the i-th encoder. Compute µ θ i (zi) and Σ θ i (zi) using the i-th decoder. end for Compute the cluster ŷ(x) using Eq. ( 8). When optimizing the ELBO expression, we sample the Gaussian r.v. z i |(x, y = i) using the reparameterization trick. Note that the ELBO objective function (6) consists of a reconstruction term and a regularization term and both are involved in the clustering decision. In the derivation of the objective function above we assumed that x is a real-valued vector. The derivation of the ELBO objective function for the discrete case is similar. The score computation procedure is depicted in Algorithm 1 and the overall architecture of the autoencoder used in the training is depicted in Fig. 1 .

3.3. HARD CLUSTERING OF DATA POINTS

After the model parameters have been learned we can extract the data clustering. We chose a deterministic version of the clustering procedure (6) that avoids sampling of z and was empirically shown to yield more stable results in our simulations. We used the expectation vector zi = µ λi (x) instead of a sampled z i . The hard clustering is thus defined as: ŷ(x) = arg max i (log N (x;µ θi (z i ), Σ θi (z i )) -log p(z i |x;λ i ) p(z i ;θ i ) ) According to our generative model p(z i ;θ i ) = N (z i ;0, I), and log p(z i |x;λ i ) = log N (z i ;z i , Σ λi (x)) = d s=1 log N (z is ;z is , Σ λi (x) s ) = - d s=1 log σis where d is the dimensionality of the latent r.v. and Σ λi (x) = Var(z|x, y = i) = diag(σ 2 i1 , ..., σ2 id ). This finally implies: ŷ(x) = arg max i (log N (x;µ θi (z i ), Σ θi (z i )) - 1 2 zi 2 + d s=1 log σis ) (8) The hard clustering procedure is depicted in Algorithm 2.

3.4. COMPARISON TO THE VADE METHOD

Our method and the VaDE algorithm (Jiang et al., 2016) are both based on generative models learned by variational autoencoders. We will now briefly describe VaDE and focus on the differences from our model. The VaDE generative process is based on a MoG model combined with a non-linear function (decoder) and is given by: 1. Draw a cluster y by sampling from p(y = i) = α i , i = 1, ..., k. 2. Sample a latent r.v. z using the conditional distribution, z|y = i ∼ N (µ i (z), Σ i (z)).

3.. If

x is real valued, sample it using the conditional distribution, x|z ∼ N (µ θ (z), Σ θ (z)). If x is binary valued, sample it using the conditional distribution, x|z ∼ Ber(µ θ (z)). µ θ (z), Σ θ (z) are computed by a decoder NN with an input z and parameters θ. Note that unlike our method, this modeling uses the same decoder (parametrized by θ) to construct the observed data for all the different clusters. Hence the decoder is likely to be very complex. In comparing the performance of two methods in the next section, we show that much less number of parameters are needed in our model than in VaDE, and the reconstruction quality of our model is much better. The VaDE ELBO(θ, λ) term can be approximated as follows: ELBO(θ, λ) ≈ log N (x;µ θ (z), Σ θ (z)) - k i=1 p θ (y = i|z)(D KL (N (µ λ (x), Σ λ (x))||N (µ i , Σ i )) C + log p θ (y = i|z) α i ), ( ) where z is sampled from N (µ λ (x), Σ λ (x)). After the VaDE parameters are learned, the soft clustering of x is p θ (y|z) where z is sampled from N (µ λ (x), Σ λ (x)). For the full derivation, we refer the reader to Jiang et al. (2016) . Note that the term C in Eq. ( 9) refers to the actual MoG-based soft clustering performed by VaDE during the learning phase. The clustering is thus performed here only within the ELBO regularization term. In our method, both the reconstruction and regularization parts of the ELBO term are involved in the clustering decision. Another variant of our algorithm is a non-generative approach that do not have a regularization term, and it only minimizes the reconstruction error (Opochinsky et al., 2020) . We show in the next section that this results in significant degradation of the clustering performance. Hence, it is required that both the reconstruction term and the regularization term of the ELBO should be involved in the clustering process.

4. EXPERIMENTS AND RESULTS

In this section, we present the datasets, hyperparameters, and experiments conducted to evaluate our approach's clustering results and compare it to other clustering methods.

4.1. DATASETS

We used the following datasets in our experiments: MNIST: The MNIST dataset consists of 70, 000 handwritten (ten) digits images, of size 28 × 28 pixels. Prepossessing includes centering the pixel values and flattening each image to a 784-dimensional vector.

STL-10:

The STL-10 dataset consists of RGB colored images of size 96 × 96 pixels. This dataset contains a total number of 10 classes. Since clustering directly from raw pixels of highresolution images is rather difficult, Prepossessing includes features extraction by passing the images to a pre-trained ResNet-50 (He et al., 2016) and then applying an average pooling operation to reduce the dimensionality to 2048.

REUTERS:

The REUTERS dataset consists of 10, 000 English news stories that relate to a total number of 4 categories. Prepossessing includes computing of 2000-dimensional TF-IDF feature vectors for the most frequent words in the articles.

HHAR:

The Heterogeneity Human Activity Recognition (HHAR) dataset consists of 10, 200 sample records, where each sample relates to one of 6 different categories. Each sample in this dataset is a 561-dimensional vector. Note that we set k to be the actual number of classes of the given datasets during our simulations. The overall datasets statistics are summarized in Table 1 .

4.2. EVALUATED MODELS

We compared our method to the following state-of-the-art deep clustering algorithms: Autoencoder followed by Gaussian Mixture Model (AE+GMM): This method trains a single AE using the reconstruction objective, and then applies GMM-based clustering on the embedding space. Variational Deep Embedding (VaDE): Introduces a VAE based generative model that assumes the latent variables follows a mixture of Gaussians, where the means and variances of the Gaussian components are trainable (Jiang et al., 2016) . Latent Tree Variational Autoencoder (LTVAE): A VAE based model that assumes a tree structure of the latent variables (Li et al., 2019) . Deep clustering via a Gaussian mixture VAE with Graph embedding (DGG): A recent VAE based model that assumes a tree structure of the latent variables (Yang et al., 2019) . k-Deep-AutoEncoder (k-DAE): This algorithm uses k-AEs for deep clustering, where k is assumed to be the number of clusters (Opochinsky et al., 2020) . This method serves as the ablation study for our method, since it induces the same (reconstruction) objective without the KL term (which stands for regularization). k-Deep-Variational AutoEncoder (k-DVAE): Our clustering method. The encoder-decoder structure used for the first four methods is the same (for a fair comparison) and is composed as follows. Each encoder network uses dense layers of sizes D-500-500-2000-10, and each decoder network uses dense layers of sizes 10 -2000 -500 -500 -D. All these methods use additional mid-layers (to perform clustering). This setting and the remaining hyperparameters were taken from Jiang et al. (2016), and Yang et al. (2019) . For both our method and the k-DAE method, the autoencoders used dense layers of sizes D -500 -100 -10 for the encoder, and 10 -100 -500 -D for the decoder. Note that although we needed to allocate one encoder-decoder network to each cluster, the number of parameters was still drastically lower than the compared methods. We tried increasing the number of parameters for each method, but it did not result in any performance gains. Each encoder network outputs mean and variance vectors that form the multivariate normal distribution. The output of the decoder is a single mean vector if the input x is discrete; otherwise, it also outputs a variance vector to form the normal distribution. In our implementation of k-DVAE, Similar to the DGG method (Yang et al., 2019) , we first pretrain the VaDE network as initialization. Then, we set the initial clusters by applying the k-means clustering over the VaDE embedded space. In our case, this architecture was used only in this initialization step.

4.3. CLUSTERING RESULTS

Clustering performance of all the compared methods was evaluated with respect to the unsupervised clustering accuracy (ACC) measure, given by ACC max m∈S k 1 N N i=1 1{y i = m(ŷ(x i )} where N is the total number of data samples, y i is the ground-truth label that corresponds to that x i sample, ŷ(x i ) is the cluster assignment obtained by the model, and m ranges over the set S k of all possible one-to-one mappings between cluster assignments and labels. This measures the proportion of data points for which the obtained clusters can be correctly mapped to ground-truth classes, where the matching is based on the Hungarian algorithm (Kuhn, 1955) . It lies in the range of 0 to 1 where one is a perfect clustering result and zero is worst. In Table 2 we depict the quantitative clustering results over the tested benchmarks compared to clustering methods. We show the mean and average ACC clustering results over ten training sessions with different random parameter initializations. The table shows that our method outperformed the other methods in terms of accuracy. In addition, using the non-variational k-DAE variant yields inferior results compared to our method, which emphasizes the superiority of the variational generative framework in this setup.

4.4. QUALITATIVE ANALYSIS

A key modeling difference between our k-DVAE and recent state-of-the-art models is that our model allocates a different decoder to each cluster. We saw that this yields improved clustering results. We show below that we also gain improved generation capability. In classification tasks, it is known that discriminative methods are better than generative ones since the classes are known, and we only need to find the discriminative features. However, in clustering tasks where we need to learn the clusters, there is a tight relationship between a model's generation capabilities and its clustering performance. To gain insight into our model's data generation capabilities, we present examples of images generated by the model's generator network. To generate an example from the i-th clusters, we first sample a random vector z from the unit normal distribution and then feed it to the i-th decoder network, parametrized by θ i . The VaDE/DGG algorithm, in contrast, uses a single decoder for all the clusters. Fig. 2 illustrates the generated samples for digits 0 to 9 of MNIST by our method compared to DGGfoot_0 .Note that unlike the results shown



VaDE has a similar generative model as DGG. Thus we choose to depict the results of DGG, which is state-of-the-art.



Figure 1: A block diagram of the autoencoder that computes the ELBO of the k-DVAE clustering method, during training phase.

THE MODEL PARAMETERS BY OPTIMIZING A VARIATIONAL LOWER BOUND

Datasets statistics

Clustering accuracy results for clustering benchmarks. Best performance is bolded.

5. CONCLUSION

In this work, we proposed k-Deep Variational AutoEncoder (k-DVAE), a neural generative model for deep clustering. This framework facilitates k encoder-decoder models designed to learn insightful low-dimensional representations for better clustering. The model is optimized by maximizing the evidence lower bound (ELBO) of the data log-likelihood. Using a distinct set of k parametrized models combined with the variational probabilistic framework results in a much richer representation of each cluster than previous methods. Extensive experimental results on four different datasets demonstrate our method's effectiveness over different state-of-the-art baselines, which require more parameters for training than our proposed architecture. Our qualitative analysis showcases the high quality of the generative model induced by our k-DVAE. Future research can extend our work by utilizing a graph embeddings similarity objective or adding a discriminator network to further regularize the posterior.

annex

in Jiang et al. (2016) , we performed the digits generation process without restricting the posterior's high values. We note in passing that in Jiang et al. (2016) , the authors presented generation results only for good cases where the posterior probability of the correct clustering was at least 0.999. While both k-DVAE and DGG were able to generate smooth and diverse digits, the images generated by the DGG are prone to errors. In contrast, each decoder network of the k-DVAE successfully reconstructed its corresponding digit by only using random normal noise as an input.

