AC-VAE: LEARNING SEMANTIC REPRESENTATION WITH VAE FOR ADAPTIVE CLUSTERING

Abstract

Unsupervised representation learning is essential in the field of machine learning, and accurate neighbor clusters of representation show great potential to support unsupervised image classification. This paper proposes a VAE (Variational Autoencoder) based network and a clustering method to achieve adaptive neighbor clustering to support the self-supervised classification. The proposed network encodes the image into the representation with boundary information, and the proposed cluster method takes advantage of the boundary information to deliver adaptive neighbor cluster results. Experimental evaluations show that the proposed method outperforms state-of-the-art representation learning methods in terms of neighbor clustering accuracy. Particularly, AC-VAE achieves 95% and 82% accuracy on CIFAR10 dataset when the average neighbor cluster sizes are 10 and 100. Furthermore, the neighbor cluster results are found converge within the clustering range (α ≤ 2), and the converged neighbor clusters are used to support the self-supervised classification. The proposed method delivers classification results that are competitive with the state-of-the-art and reduces the super parameter k in KNN (K-nearest neighbor), which is often used in self-supervised classification.

1. INTRODUCTION

Unsupervised representation learning is a long-standing interest in the field of machine learning (Peng et al., 2016a; Chen et al., 2016; 2018; Deng et al., 2019; Peng et al., 2016b) , which offers a promising way to scale-up the usable data amount for the current artificial intelligence methods without the requirement for human annotation by leveraging on the vast amount of unlabeled data (Chen et al., 2020b; a) . Recent works (Chen et al., 2020b; a; He et al., 2020) advocate to structure the unsupervised representation learning at the pre-training stage and then apply semi-supervised or selfsupervised techniques on the learned representations in the fine-tuning stage. So the representation learning acts as a feature extractor, which extracts semantic features from the image, and wellextracted features should lead to excellent classification performance (He et al., 2020) . Moreover, representation learning assigns close vectors to images with similar semantic meanings, thus making it possible to cluster the same meaning images together (Xie et al., 2016; Van Gansbeke et al., 2020) . When no label is available, unsupervised or self-supervised classification methods rely on the neighbor clustering to provide the supervisory signal to guide the self-supervised fine-tuning process (Van Gansbeke et al., 2020; Xie et al., 2016) . In this scenario, accurately clustering neighbors among representations is crucial for the followed classification fine-tuning. In many of the prior unsupervised methods (Van Gansbeke et al., 2020; Xie et al., 2016) , the neighbor clustering process is performed by KNN (k-nearest neighbor) based methods. However, KNN based methods introduce k as a super parameter, which needs to be fine-tuned regarding different datasets. In an unsupervised setup, selecting a suitable k without any annotation or prior knowledge is not straightforward. Therefore it is desirable to have a neighbor clustering process that automatically adapts to different datasets, thus eliminating the need for pre-selecting the super parameter k. To achieve adaptive neighbors clustering, the proposed method tries to encode the image representation into the multivariate normal distribution, as the multivariate normal distribution provides distance information, such as z-score, which can naturally adapt to different datasets without the help of any additional mechanism. Prior works (Kingma & Welling, 2013; Higgins et al., 2016; Burgess et al., 2018) showed VAE's ability to encode images into multivariate normal distributions; nonethe-

