SEMI-SUPERVISED LEARNING VIA CLUSTERING REP-RESENTATION SPACE

Abstract

We proposed a novel loss function that combines supervised learning with clustering in deep neural networks. Taking advantage of the data distribution and the existence of some labeled data, we construct a meaningful latent space. Our loss function consists of three parts, the quality of the clustering result, the margin between clusters, and the classification error of labeled instances. Our proposed model is trained to minimize our loss function, avoiding the need for pre-training or additional networks. This guides our network to classify labeled samples correctly while able to find good clusters simultaneously. We applied our proposed method on MNIST, USPS, ETH-80, and COIL-100; the comparison results confirm our model's outstanding performance over semi-supervised learning.

1. INTRODUCTION

Labeling data is expensive. Thus, it is often hard for us to get enough labeled samples. Thus, semi-supervised learning (Chapelle et al., 2009) becomes a serious issue. People try to get good performance with limited labeled data and a large amount of unlabeled data. When having a limited amount of labeled samples, extracting information from unlabeled data has played an important role for semi-supervised learning. In general, we often applied unlabeled information as auxiliary tools, such as pre-training (Hinton and Salakhutdinov, 2006) or recursive picking confidence data from unlabeled samples with supervised learning (Zhu, 2005) . However, we notice that when we counter the issue such as Two Half-moon, double circle, or other more complex distribution problem, these methods are lack of considering the spatial distribution information provided by unlabeled samples. In this paper, we aimed to guide our model to extract the spatial distribution information from unlabeled data. We proposed a new approach for semi-supervised learning by adding our loss function term for our target embedding latent space. Within our proposed model, the neural network can now learn correctness and spatial distribution information from labeled and unlabeled samples simultaneously. This provides our feed-forward neural network to have more opportunity passing through the sparse margin between clusters, and elevate the performance of the classifier, see more details in Sec. 3. Moreover, it is worth noting that our proposed model does not rely on any additional neural networks, which is suitable for any task and is highly compatible with different semi-supervised learning algorithms. In short, the characteristics of our proposed model are as follows: Intuitive The idea of correctness and spatial distribution came up with the characteristics of supervised and unsupervised learning straightly, which is intuitive. Compatibility Our method does not rely on any additional neural networks but only adding new loss term. Our approach is easy to change into any existing feed-forward neural networks. Extensible We designed our approach by the notion of defining an evaluation for spatial distribution, which can replace by any other methods in future researches.

2. RELATED WORK

In recent years, the neural network plays an essential role in various tasks; more specifically, they are highly applied in the tasks of image classification. Since then, semi-supervised learning (Weston et al., 2012; Lee, 2013) for image classification has become a vital issue. First of all, some works succeeded by proposing regularization methods for NNs (Bishop, 1995; Srivastava et al., 2014) . They regularize the input and hidden layers of their models by applying random permutations. This can smooth the input-output relation and further get improvements in semi-supervised learning. Generative adversarial networks (GAN) (Goodfellow et al., 2014) are popular research for the neural network, several models (Salimans et al., 2016; Dumoulin et al., 2016; Dai et al., 2017) had gone more in-depth researches with GAN for semi-supervised learning. They showed remarkable results, especially on image classification problems. In 2014, Kingma et al. proposed Deep generative models (Kingma et al., 2014) , applying a variational-autoencoder based generative model to semi-supervised learning. They showed good results on image classification for semi-supervised learning and became the benchmark of several datasets. However, in practice, these models require more careful tuning for parameters. Also, they usually require more neural network structures and computation resources. Moreover, Rasmus et al. (Rasmus et al., 2015) propose semi-supervised with Ladder Networks, a model structure with an autoencoder. This work is similar to denoising autoencoders and applied to every layer. It is impressive that they have got a vast improvement compared with Deep Generative models (Kingma et al., 2014) . Shortly after, Miyato ey al. (Miyato et al., 2018 ) also achieve competitive results on the benchmark data sets with a regularization term. They guide their model to minimize the change of the input and output of the network, which does not require labeled information and able to use unlabeled samples for their regularization term. Labeled propagation proposed by (Zhu and Ghahramani, 2002) is also a family of methods for semi-supervised learning. By smoothing the model around the input data points, the model can extrapolate the labels of unlabeled samples. Similar to this idea, several works (Laine and Aila, 2016; Sajjadi et al., 2016) had succeeded by using the random image augmentation. They try to improve the generation performance of image classification for semi-supervised learning.

3. PROPOSED MODEL

For semi-supervised learning, we assume that in some particular space, samples in the same category should be in the same cluster. Following this assumption, once we can distinguish different clusters properly, and guide our network to find the decision boundary with sparser region between clusters. In this section, we introduce an end-to-end learning method by adding our loss functions. First of all, we tried to guide our network to learn a good mapping from the original input space to the embedding latent space; see Sec. 3.1. We next defined loss functions and tried to cluster the samples that should be in the same categories together in the embedding latent space; see Sec. 3.2. Moreover, similar to some supervised learning works (Xu et al., 2005; Rennie and Srebro, 2005; Srebro et al., 2005) , we aim to maximize the margin between different clusters to separate them well. Note that since we are lack of labeled samples, we maximize the margin within temporary clustering results for our data, instead of using the ground truth of labeled data, see more in Sec. 3.3. Overall, we named our proposed model as Maximum Cluster Margin Classifier, referred to as MCMC.

3.1. EMBEDDING LATENT SPACE

In general, measuring a clustering result is very subjective. However, to deal with various kinds of distribution, we avoid evaluating them directly by their original input space. Instead, we pull out a layer from the neural network and set it as the embedding latent space. Next, we evaluate whether the quality of the embedding latent space. In our proposed model, we try to define a measurement for the embedding latent space to satisfy our assumptions. This guides the previous layers to learn about a good mapping from the original input space to a well-distributed embedding latent space. To strengthen the efficiency of the embedding latent space, we add a simple classifier that is fully connected to the embedding latent space.

3.2.1. DAVIES-BOULDIN INDEX

As Davies-Bouldin index (Davies and Bouldin, 1979) proposed, given a dataset with N clusters, for every cluster C i , we compute S i as a measure of scattering with the cluster, which is defined as

