GRADIENT-BASED TRAINING OF GAUSSIAN MIXTURE MODELS FOR HIGH-DIMENSIONAL STREAMING DATA Anonymous

Abstract

We present an approach for efficiently training Gaussian Mixture Models (GMMs) by Stochastic Gradient Descent (SGD) with non-stationary, high-dimensional streaming data. Our training scheme does not require data-driven parameter initialization (e.g., k-means) and has the ability to process high-dimensional samples without numerical problems. Furthermore, the approach allows mini-batch sizes as low as 1, typical for streaming-data settings, and it is possible to react and adapt to changes in data statistics (concept drift/shift) without catastrophic forgetting. Major problems in such streaming-data settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities.We introduce an adaptive annealing procedure to address the first problem,whereas numerical instabilities are eliminated by using an exponential-free approximation to the standard GMM log-likelihood. Experiments on a variety of visual and nonvisual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization, and compares to a favorably online variant of Expectation-Maximization (EM) -stochastic EM (sEM), which it outperforms by a large margin for very high-dimensional data.

1. INTRODUCTION

This contribution focuses Gaussian Mixture Models (GMMs), which represent a probabilistic unsupervised model for clustering and density estimation and allowing sampling and outlier detection. GMMs have been used in a wide range of scenarios, e.g., Melnykov & Maitra (2010) . Commonly, free parameters of a GMM are estimated by using the Expectation-Maximizations (EMs) algorithm (Dempster et al., 1977) , as it does not require learning rates and automatically enforces all GMM constraints. A popular online variant is stochastic Expectation Maximization (sEM) (Cappé & Moulines, 2009) , which can be trained mini-batch wise and is, thus, more suited for large datasets or streaming data.

1.1. MOTIVATION

Intrinsically, EM is a batch-type algorithm. Memory requirements can therefore become excessive for large datasets. In addition, streaming-data scenarios require data samples to be processed one by one, which is impossible for a batch-type algorithm. Moreover, data statistics may be subject to changes over time (concept drift/shift), to which the GMM should adapt. In such scenarios, an online, mini-batch type of optimization such as SGD is attractive, as it can process samples one by one, has modest, fixed memory requirements and can adapt to changing data statistics.

1.2. RELATED WORK

Online EM is a technique for performing EM mini-batch wise, allowing to process large datasets. One branch of previous research Newton et al. (1986); Lange (1995); Chen et al. (2018) has been devoted to the development of stochastic Expectation Maximization (sEM) algorithms that reduce the original EM method in the limit of large batch sizes. The variant of Cappé & Moulines ( 2009) is widely used due to its simplicity and efficiency for large datasets. These approaches come at the price of additional hyper-parameters (e.g., learning rate or mini-batch size), thus, removing a key advantage of EM over SGD. Another common approach is to modify the EM algorithm itself by, e.g., including heuristics for adding, splitting and merging centroids Vlassis & Likas ( 2002 (2005) . This allows GMM-like models to be trained by presenting one sample after another. The models work well in several application scenarios, but their learning dynamics are impossible to analyze mathematically. They also introduce a high number of parameters. Apart from these works, some authors avoid the issue of extensive datasets by determining smaller "core sets" of representative samples and performing vanilla EM Feldman et al. (2011) . SGD for training GMMs has, as far as we know, been recently treated only by Hosseini & Sra (2015; 2019) . In this body of work, GMM constraint enforcement is ensured by using manifold optimization techniques and re-parameterization/regularization. Thereby, additional hyper-parameters are introduced. The issue of local optima is sidestepped by a k-means type centroid initialization, and the used image datasets are low-dimensional (36 dimensions). Additionally, enforcing positive definiteness constraints by Cholesky decomposition is discussed. (2015) . It exploits the properties of highdimensional spaces in order to achieve learning with a number of samples that is polynomial in the number of Gaussian components. This is difficult to apply in streaming settings, since higher-order moments need to be estimated beforehand, and also because the number of samples usually cannot be controlled in practice. Training GMM-like lower-dimensional factor analysis models by SGD on high-dimensional image data is successfully demonstrated in Richardson & Weiss (2018), avoiding numerical issues, but, again, sidestepping the local optima issue by using k-means initialization. The numerical issues associated with log-likelihood computation in high-dimensional spaces are generally mitigated by using the "logsumexp" trick Nielsen & Sun (2016), which is, however, insufficient for ensuring numerical stability for particularly high-dimensional data, such as images.

1.3. GOALS AND CONTRIBUTIONS

The goals of this article are to establish GMM training by SGD as a simple and scalable alternative to sEM in streaming scenarios with potentially high-dimensional data. The main novel contributions are: • a proposal for numerically stable GMM training by SGD that outperforms sEM for high data dimensionalities • an automatic annealing procedure that ensures SGD convergence from a wide range of initial conditions without prior knowledge of the data (e.g., no k-means initialization) which is especially beneficial for streaming data • a computationally efficient method for enforcing all GMM constraints in SGD Apart from these contents, we provide a publicly available TensorFlow implementation.foot_0 

2. DATASETS

We use a variety of different image-based datasets as well as a non-image dataset for evaluation purposes. All datasets are normalized to the [0, 1] range. MNIST (LeCun et al., 1998) contains gray scale images, which depict handwritten digits from 0 to 9 in a resolution of 28 ×28 pixels -the common benchmark for computer vision systems. SVHN (Wang et al., 2012) contains color images of house numbers (0-9, resolution 32 × 32). FashionMNIST (Xiao et al., 2017) contains gray scale images of 10 cloth categories and is considered as more challenging classification task compared to MNIST. Fruits 360 (Murean & Oltean, 2018) consists of colored pictures showing different types of fruits (100 × 100 × 3 pixels). The ten best-represented classes are selected from this dataset. Devanagari (Acharya et al., 2016) includes gray scale images of handwritten Devanagari letters with a resolution of 32 ×32 pixels -the first 10 classes are selected. NotMNIST (Yaroslav Bulatov, 2011 ) is a gray scale image dataset (resolution 28 × 28 pixels) of



https://github.com/gmm-iclr21/sgd-gmm



); Engel & Heinen (2010); Pinto & Engel (2015); Cederborg et al. (2010); Song & Wang (2005); Kristan et al. (2008); Vijayakumar et al.

Annealing and Approximation approaches for GMMs were proposed by Verbeek et al. (2005); Pinheiro & Bates (1995); Ormoneit & Tresp (1998); Dognin et al. (2009). However, the regularizers proposed by Verbeek et al. (2005); Ormoneit & Tresp (1998) significantly differ from our scheme. GMM log-likelihood approximations, similar to the one used here, are discussed in, e.g., Pinheiro & Bates (1995) and Dognin et al. (2009), but only in combination with EM training. GMM Training in High-Dimensional Spaces is discussed in several publications: A conceptually very interesting procedure is proposed by Ge et al.

