AC-VAE: LEARNING SEMANTIC REPRESENTATION WITH VAE FOR ADAPTIVE CLUSTERING

Abstract

Unsupervised representation learning is essential in the field of machine learning, and accurate neighbor clusters of representation show great potential to support unsupervised image classification. This paper proposes a VAE (Variational Autoencoder) based network and a clustering method to achieve adaptive neighbor clustering to support the self-supervised classification. The proposed network encodes the image into the representation with boundary information, and the proposed cluster method takes advantage of the boundary information to deliver adaptive neighbor cluster results. Experimental evaluations show that the proposed method outperforms state-of-the-art representation learning methods in terms of neighbor clustering accuracy. Particularly, AC-VAE achieves 95% and 82% accuracy on CIFAR10 dataset when the average neighbor cluster sizes are 10 and 100. Furthermore, the neighbor cluster results are found converge within the clustering range (α ≤ 2), and the converged neighbor clusters are used to support the self-supervised classification. The proposed method delivers classification results that are competitive with the state-of-the-art and reduces the super parameter k in KNN (K-nearest neighbor), which is often used in self-supervised classification.

1. INTRODUCTION

Unsupervised representation learning is a long-standing interest in the field of machine learning (Peng et al., 2016a; Chen et al., 2016; 2018; Deng et al., 2019; Peng et al., 2016b) , which offers a promising way to scale-up the usable data amount for the current artificial intelligence methods without the requirement for human annotation by leveraging on the vast amount of unlabeled data (Chen et al., 2020b; a) . Recent works (Chen et al., 2020b; a; He et al., 2020) advocate to structure the unsupervised representation learning at the pre-training stage and then apply semi-supervised or selfsupervised techniques on the learned representations in the fine-tuning stage. So the representation learning acts as a feature extractor, which extracts semantic features from the image, and wellextracted features should lead to excellent classification performance (He et al., 2020) . Moreover, representation learning assigns close vectors to images with similar semantic meanings, thus making it possible to cluster the same meaning images together (Xie et al., 2016; Van Gansbeke et al., 2020) . When no label is available, unsupervised or self-supervised classification methods rely on the neighbor clustering to provide the supervisory signal to guide the self-supervised fine-tuning process (Van Gansbeke et al., 2020; Xie et al., 2016) . In this scenario, accurately clustering neighbors among representations is crucial for the followed classification fine-tuning. In many of the prior unsupervised methods (Van Gansbeke et al., 2020; Xie et al., 2016) , the neighbor clustering process is performed by KNN (k-nearest neighbor) based methods. However, KNN based methods introduce k as a super parameter, which needs to be fine-tuned regarding different datasets. In an unsupervised setup, selecting a suitable k without any annotation or prior knowledge is not straightforward. Therefore it is desirable to have a neighbor clustering process that automatically adapts to different datasets, thus eliminating the need for pre-selecting the super parameter k. To achieve adaptive neighbors clustering, the proposed method tries to encode the image representation into the multivariate normal distribution, as the multivariate normal distribution provides distance information, such as z-score, which can naturally adapt to different datasets without the help of any additional mechanism. Prior works (Kingma & Welling, 2013; Higgins et al., 2016; Burgess et al., 2018) showed VAE's ability to encode images into multivariate normal distributions; nonethe-less, these works struggled to extract high-level semantic features, as most of them were trained by image recovery tasks, which encourages the network to focus on the low-level imagery features. Consequently, the extracted low-level features cannot be utilized in the unsupervised classification method, which needs semantic features to function. • In this work, a loss function is proposed based on consistency regulation to train the VAEbased network for extracting the high-level semantic feature from the image. Experiments demonstrate that the proposed method assigns close vectors to images with similar semantic meanings. • This work proposed a clustering method to take advantage of the adaptive boundary of each representation. The proposed method delivers high accuracy neighbor clusters. Besides, the neighbor clusters are found converge within the clustering range (α ≤ 2), and the selfsupervised learning framework utilizing the converged clusters delivers competitive results without the need of a pre-selecting parameter k.

2. RELATED WORKS

Many frameworks cluster the dataset directly into semantic classes, and train the network in an endto-end manner (Asano et al., 2019; Caron et al., 2019; Haeusser et al., 2018; Yang et al., 2016; Xie et al., 2016) . Although the end-to-end training method is easy to apply, the network's initialization largely influences these frameworks' performance. Therefore, complex mechanisms (such as cluster



Figure 1: The proposed clustering method includes a VAE based network and z-score based cluster methods. The VAE based network encodes the image into the multivariate normal distribution, and the z-score based clustering method takes advantage of the distribution's boundary information.To provide VAE with the ability to extract the high-level semantic features, as well as to utilize its strength to produce adaptive clusters, this paper proposes a framework, AC-VAE, including a VAE based network and a z-score based clustering method, as shown in Figure1. The VAE based network encodes the image into the multivariate normal distribution N (µ, Σ), The distribution's mean µ is taken as the representation; meanwhile, its z-score provides the boundary information that can naturally adapt to different datasets. The proposed clustering method takes advantage of the boundary information to achieve adaptive neighbor clustering. The proposed framework's efficacy is evaluated on CIFAR10, CIFAR100-20, and SLT datasets, and it surpasses the current state-of-theart methods in neighbor clustering on these datasets. Particularly, AC-VAE achieves 95% and 82% accuracy on CIFAR10 dataset when the average neighbor cluster sizes are 10 and 100, surpassing the current state-of-the-art method by a margin of 10%. Our main innovations and contributions can be summarized as follows:• This work proposed a VAE based network to encode the image into the representation with its boundary information. The representation and boundary information are retrieved from the multivariate normal distribution, which encoded from the image. The efficacy of the adaptive boundary is demonstrated by neighbor clustering results.

