A DISCRIMINATIVE GAUSSIAN MIXTURE MODEL WITH SPARSITY

Abstract

In probabilistic classification, a discriminative model based on the softmax function has a potential limitation in that it assumes unimodality for each class in the feature space. The mixture model can address this issue, although it leads to an increase in the number of parameters. We propose a sparse classifier based on a discriminative GMM, referred to as a sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained via sparse Bayesian learning. Using this sparse learning framework, we can simultaneously remove redundant Gaussian components and reduce the number of parameters used in the remaining components during learning; this learning method reduces the model complexity, thereby improving the generalization capability. Furthermore, the SDGM can be embedded into neural networks (NNs), such as convolutional NNs, and can be trained in an end-to-end manner. Experimental results demonstrated that the proposed method outperformed the existing softmax-based discriminative models.

1. INTRODUCTION

In probabilistic classification, a discriminative model is an approach that assigns a class label c to an input sample x by estimating the posterior probability P (c | x). The posterior probability P (c | x) should correctly be modeled because it is not only related to classification accuracy, but also to the confidence of decision making in real-world applications such as medical diagnosis support. In general, the model calculates the class posterior probability using the softmax function after nonlinear feature extraction. Classically, a combination of the kernel method and the softmax function has been used. The recent mainstream method is to use a deep neural network for representation learning and softmax for the calculation of the posterior probability. Such a general procedure for developing a discriminative model potentially contains a limitation due to unimodality. The softmax-based model, such as a fully connected (FC) layer with a softmax function that is often used in deep neural networks (NNs), assumes a unimodal Gaussian distribution for each class (details are shown in Appendix A). Therefore, even if the feature space is transformed into discriminative space via the feature extraction part, P (c | x) cannot correctly be modeled if the multimodality remains, which leads to a decrease in accuracy. Mixture models can address this issue. Mixture models are widely used for generative models, with a Gaussian mixture model (GMM) as a typical example. Mixture models are also effective in discriminative models; for example, discriminative GMMs have been applied successfully in various fields, e.g., speech recognition (Tüske et al. 2015; Wang 2007) . However, the number of parameters increases if the number of mixture components increases, which may lead to over-fitting and an increase in memory usage; this is useful if we can reduce the number of redundant parameters while maintaining multimodality. In this paper, we propose a discriminative model with two important properties; multimodality and sparsity. The proposed model is referred to as the sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is formulated and trained via sparse Bayesian learning. This learning algorithm reduces memory usage without losing generalization capability by obtaining sparse weights while maintaining the multimodality of the mixture model. The technical highlight of this study is twofold: One is that the SDGM finds the multimodal structure in the feature space and the other is that redundant Gaussian components are removed owing to sparse learning. Figure 1 2001), estimate nonlinear decision boundaries using nonlinear kernels, whereas these methods cannot find multimodal structures. Although the discriminative GMM finds multimodal structure, this model retains redundant Gaussian components. However, the proposed SDGM finds a multimodal structure of data while removing redundant components, which leads to an accurate decision boundary. Furthermore, the SDGM can be embedded into NNs, such as convolutional NNs (CNNs), and trained in an end-to-end manner with an NN. The proposed SDGM is also considered as a mixture, nonlinear, and sparse expansion of the logistic regression, and thus the SDGM can be used as the last layer of an NN for classification by replacing it with the fully connected (FC) layer with a softmax activation function. The contributions of this study are as follows: • We propose a novel sparse classifier based on a discriminative GMM. The proposed SDGM has both multimodality and sparsity, thereby flexibly estimating the posterior distribution of classes while removing redundant parameters. Moreover, the SDGM automatically determines the number of components by simultaneously removing the redundant components during learning. • From the perspective of the Bayesian kernel methods, the SDGM is considered as the expansion of the GP and RVM. The SDGM can estimate the posterior probabilities more flexibly than the GP and RVM owing to multimodality. The experimental comparison using benchmark data demonstrated superior performance to the existing Bayesian kernel methods. • This study connects both fields of probabilistic models and NNs. From the equivalence of a discriminative model based on a Gaussian distribution to an FC layer, we demonstrate that the SDGM can be used as a module of a deep NN. We also demonstrate that the SDGM exhibits superior performance to the FC layer with a softmax function via end-toend learning with an NN on the image recognition task.

2. RELATED WORK AND POSITION OF THIS STUDY

The position of the proposed SDGM among the related methods is summarized in Figure 2 . Interestingly, by summarizing the relationships, we can confirm that the three separately developed fields, generative models, discriminative models, and kernel Bayesian methods, are related to each other. Starting from the Gaussian distribution, all the models shown in Figure 2 are connected via



Figure 1: Comparison of decision boundaries. black and green circles represent training samples from classes 1 and 2, respectively. The dashed black line indicates the decision boundary between classes 1 and 2 and thus satisfies P (c = 1 | x) = (c = 2 | x) = 0.5. The dashed blue and red lines represent the boundaries between the posterior probabilities of the mixture components.

shows a comparison of the decision boundaries with other discriminative models. The two-class data are from Ripley's synthetic data(Ripley 2006), where two Gaussian components are used to generate data for each class. The FC layer with the softmax function, which is often used in the last layer of deep NNs, assumes a unimodal Gaussian for each class, resulting in an inappropriate decision boundary. Kernel Bayesian methods, such as the Gaussian process (GP) classifier (Wenzel et al. 2019) and relevance vector machine (RVM) (Tipping

