GRADIENT-BASED TRAINING OF GAUSSIAN MIXTURE MODELS FOR HIGH-DIMENSIONAL STREAMING DATA Anonymous

Abstract

We present an approach for efficiently training Gaussian Mixture Models (GMMs) by Stochastic Gradient Descent (SGD) with non-stationary, high-dimensional streaming data. Our training scheme does not require data-driven parameter initialization (e.g., k-means) and has the ability to process high-dimensional samples without numerical problems. Furthermore, the approach allows mini-batch sizes as low as 1, typical for streaming-data settings, and it is possible to react and adapt to changes in data statistics (concept drift/shift) without catastrophic forgetting. Major problems in such streaming-data settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities.We introduce an adaptive annealing procedure to address the first problem,whereas numerical instabilities are eliminated by using an exponential-free approximation to the standard GMM log-likelihood. Experiments on a variety of visual and nonvisual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization, and compares to a favorably online variant of Expectation-Maximization (EM) -stochastic EM (sEM), which it outperforms by a large margin for very high-dimensional data.

1. INTRODUCTION

This contribution focuses Gaussian Mixture Models (GMMs), which represent a probabilistic unsupervised model for clustering and density estimation and allowing sampling and outlier detection. GMMs have been used in a wide range of scenarios, e.g., Melnykov & Maitra (2010) . Commonly, free parameters of a GMM are estimated by using the Expectation-Maximizations (EMs) algorithm (Dempster et al., 1977) , as it does not require learning rates and automatically enforces all GMM constraints. A popular online variant is stochastic Expectation Maximization (sEM) (Cappé & Moulines, 2009) , which can be trained mini-batch wise and is, thus, more suited for large datasets or streaming data.

1.1. MOTIVATION

Intrinsically, EM is a batch-type algorithm. Memory requirements can therefore become excessive for large datasets. In addition, streaming-data scenarios require data samples to be processed one by one, which is impossible for a batch-type algorithm. Moreover, data statistics may be subject to changes over time (concept drift/shift), to which the GMM should adapt. In such scenarios, an online, mini-batch type of optimization such as SGD is attractive, as it can process samples one by one, has modest, fixed memory requirements and can adapt to changing data statistics.

1.2. RELATED WORK

Online EM is a technique for performing EM mini-batch wise, allowing to process large datasets. One branch of previous research Newton et al. (1986) ; Lange (1995); Chen et al. ( 2018) has been devoted to the development of stochastic Expectation Maximization (sEM) algorithms that reduce the original EM method in the limit of large batch sizes. The variant of Cappé & Moulines (2009) is widely used due to its simplicity and efficiency for large datasets. These approaches come at the price of additional hyper-parameters (e.g., learning rate or mini-batch size), thus, removing a key advantage of EM over SGD. Another common approach is to modify the EM algorithm itself by, e.g., including heuristics for adding, splitting and merging centroids Vlassis & Likas (2002) ; Engel

