GRADIENT-BASED TRAINING OF GAUSSIAN MIXTURE MODELS FOR HIGH-DIMENSIONAL STREAMING DATA Anonymous

Abstract

We present an approach for efficiently training Gaussian Mixture Models (GMMs) by Stochastic Gradient Descent (SGD) with non-stationary, high-dimensional streaming data. Our training scheme does not require data-driven parameter initialization (e.g., k-means) and has the ability to process high-dimensional samples without numerical problems. Furthermore, the approach allows mini-batch sizes as low as 1, typical for streaming-data settings, and it is possible to react and adapt to changes in data statistics (concept drift/shift) without catastrophic forgetting. Major problems in such streaming-data settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities.We introduce an adaptive annealing procedure to address the first problem,whereas numerical instabilities are eliminated by using an exponential-free approximation to the standard GMM log-likelihood. Experiments on a variety of visual and nonvisual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization, and compares to a favorably online variant of Expectation-Maximization (EM) -stochastic EM (sEM), which it outperforms by a large margin for very high-dimensional data.

1. INTRODUCTION

This contribution focuses Gaussian Mixture Models (GMMs), which represent a probabilistic unsupervised model for clustering and density estimation and allowing sampling and outlier detection. GMMs have been used in a wide range of scenarios, e.g., Melnykov & Maitra (2010) . Commonly, free parameters of a GMM are estimated by using the Expectation-Maximizations (EMs) algorithm (Dempster et al., 1977) , as it does not require learning rates and automatically enforces all GMM constraints. A popular online variant is stochastic Expectation Maximization (sEM) (Cappé & Moulines, 2009) , which can be trained mini-batch wise and is, thus, more suited for large datasets or streaming data.

1.1. MOTIVATION

Intrinsically, EM is a batch-type algorithm. Memory requirements can therefore become excessive for large datasets. In addition, streaming-data scenarios require data samples to be processed one by one, which is impossible for a batch-type algorithm. Moreover, data statistics may be subject to changes over time (concept drift/shift), to which the GMM should adapt. In such scenarios, an online, mini-batch type of optimization such as SGD is attractive, as it can process samples one by one, has modest, fixed memory requirements and can adapt to changing data statistics.

1.2. RELATED WORK

Online EM is a technique for performing EM mini-batch wise, allowing to process large datasets. One branch of previous research Newton et al. (1986) ; Lange (1995) ; Chen et al. (2018) has been devoted to the development of stochastic Expectation Maximization (sEM) algorithms that reduce the original EM method in the limit of large batch sizes. The variant of Cappé & Moulines (2009) is widely used due to its simplicity and efficiency for large datasets. These approaches come at the price of additional hyper-parameters (e.g., learning rate or mini-batch size), thus, removing a key advantage of EM over SGD. Another common approach is to modify the EM algorithm itself by, e.g., including heuristics for adding, splitting and merging centroids Vlassis & Likas (2002) ; Engel & Heinen (2010) ; Pinto & Engel (2015) ; Cederborg et al. (2010) ; Song & Wang (2005) ; Kristan et al. (2008) ; Vijayakumar et al. (2005) . This allows GMM-like models to be trained by presenting one sample after another. The models work well in several application scenarios, but their learning dynamics are impossible to analyze mathematically. They also introduce a high number of parameters. Apart from these works, some authors avoid the issue of extensive datasets by determining smaller "core sets" of representative samples and performing vanilla EM Feldman et al. (2011) . SGD for training GMMs has, as far as we know, been recently treated only by Hosseini & Sra (2015; 2019) . In this body of work, GMM constraint enforcement is ensured by using manifold optimization techniques and re-parameterization/regularization. Thereby, additional hyper-parameters are introduced. The issue of local optima is sidestepped by a k-means type centroid initialization, and the used image datasets are low-dimensional (36 dimensions). Additionally, enforcing positive definiteness constraints by Cholesky decomposition is discussed. Annealing and Approximation approaches for GMMs were proposed by Verbeek et al. (2005) ; Pinheiro & Bates (1995) ; Ormoneit & Tresp (1998) ; Dognin et al. (2009) . However, the regularizers proposed by Verbeek et al. (2005) ; Ormoneit & Tresp (1998) significantly differ from our scheme. GMM log-likelihood approximations, similar to the one used here, are discussed in, e.g., Pinheiro & Bates (1995) and Dognin et al. (2009) , but only in combination with EM training. GMM Training in High-Dimensional Spaces is discussed in several publications: A conceptually very interesting procedure is proposed by Ge et al. (2015) . It exploits the properties of highdimensional spaces in order to achieve learning with a number of samples that is polynomial in the number of Gaussian components. This is difficult to apply in streaming settings, since higher-order moments need to be estimated beforehand, and also because the number of samples usually cannot be controlled in practice. Training GMM-like lower-dimensional factor analysis models by SGD on high-dimensional image data is successfully demonstrated in Richardson & Weiss (2018) , avoiding numerical issues, but, again, sidestepping the local optima issue by using k-means initialization. The numerical issues associated with log-likelihood computation in high-dimensional spaces are generally mitigated by using the "logsumexp" trick Nielsen & Sun (2016) , which is, however, insufficient for ensuring numerical stability for particularly high-dimensional data, such as images.

1.3. GOALS AND CONTRIBUTIONS

The goals of this article are to establish GMM training by SGD as a simple and scalable alternative to sEM in streaming scenarios with potentially high-dimensional data. The main novel contributions are: • a proposal for numerically stable GMM training by SGD that outperforms sEM for high data dimensionalities • an automatic annealing procedure that ensures SGD convergence from a wide range of initial conditions without prior knowledge of the data (e.g., no k-means initialization) which is especially beneficial for streaming data • a computationally efficient method for enforcing all GMM constraints in SGD Apart from these contents, we provide a publicly available TensorFlow implementation.foot_0 

2. DATASETS

We use a variety of different image-based datasets as well as a non-image dataset for evaluation purposes. All datasets are normalized to the [0, 1] range. MNIST (LeCun et al., 1998) contains gray scale images, which depict handwritten digits from 0 to 9 in a resolution of 28 ×28 pixels -the common benchmark for computer vision systems. SVHN (Wang et al., 2012) contains color images of house numbers (0-9, resolution 32 × 32). FashionMNIST (Xiao et al., 2017) contains gray scale images of 10 cloth categories and is considered as more challenging classification task compared to MNIST. Fruits 360 (Murean & Oltean, 2018) consists of colored pictures showing different types of fruits (100 × 100 × 3 pixels). The ten best-represented classes are selected from this dataset. Devanagari (Acharya et al., 2016) includes gray scale images of handwritten Devanagari letters with a resolution of 32 ×32 pixels -the first 10 classes are selected. NotMNIST (Yaroslav Bulatov, 2011 ) is a gray scale image dataset (resolution 28 × 28 pixels) of letters from A to J extracted from different public available fonts. ISOLET (Cole & Fanty, 1990 ) is a non-image dataset containing 7 797 samples of spoken letters recorded from 150 subjects. Each sample was encoded and is represented by 617 float values.

3. GAUSSIAN MIXTURE MODELS

GMMs are probabilistic models that intend to explain the observed data X = {x n } by expressing their density as a weighted mixture of K Gaussian component densities N ( x; µ k , P k ) ≡ N k (x): p(x) = K k π k N k (x) . We work with precision matrices P = Σ -1 instead of covariances Σ. This is realized by optimizing the (incomplete) log-likelihood L = E n log k π k N k (x n ) . (1)

3.1. GMMS AND SGD

GMMs require the mixture weights to be normalized: k π k = 1 and the precision matrices to be positive definite: x T P k x ≥ 0 ∀x. These constraints must be explicitly enforced after each SGD step: Weights π k are adapted according to Hosseini & Sra (2015) , which replaces them by other free parameters ξ k from which the π k are computed so that normalization is ensured: π k = exp(ξ k ) j exp(ξ j ) . Precision Matrices need to be positive-definite, so we re-parameterize these as P k = D T k D k , where the upper-diagonal matrices D k result from a Cholesky decomposition. Consequently, det Σ k = det P -1 k = det(D T k D k ) -1 = tr(D k ) -2 can be computed efficiently. To avoid recomputing the costly Cholesky decomposition of the P k at every iteration, we perform it on the initial precision matrices and just erase the elements below the diagonal in the D k after each gradient step.

3.2. MAX-COMPONENT APPROXIMATION FOR GMMS

The log-likelihood Eq. ( 1) is difficult to optimize by SGD (see Sec. 3.3) . This is why we intend to find a lower bound that we can optimize instead. A simple scheme is given by L = E n log k π k N k (x n ) ≥ E n log max k π k N k (x n ) = L = E n log π k * N k * (x n ) where k * = arg max k π k N k (x n ). This is what we call the max-component approximation of Eq. ( 3). In contrast to the lower bound that is constructed for EM-type algorithms, this bound is usually not tight. The advantages of L are the avoidance of local optima in SGD, and the elimination of exponentials causing numerical instabilities for high data dimensions. The "logsumexp" trick is normally employed with GMMs to rectify this by factoring out the largest component probability N k * . This mitigates, but does not avoid numerical problems when distances are high. To give an example: we normalize the component probability N k = e -101 (using 32-bit floats) by the highest probability N k * = e 3 , and we obtain N k N k * = e -104 , which is numerically problematic.

3.3. UNDESIRABLE LOCAL OPTIMA IN SGD TRAINING

An issue when performing SGD without k-means initialization concerns undesirable local optima. Degenerate Solutions occur when naively optimizing L by SGD (see Fig. 1a ). All components have the same weight, centroid and covariance matrix: π k ≈ 1 K , µ k = E[X], Σ k = Cov(X) ∀ k, in which case all gradients vanish (see App. A.3 for a proof). These solutions are avoided by L, since only a subset of components is updated by SGD, thereby breaking the symmetry between components. Single/Sparse-Component Solutions occur when optimizing L by SGD (see Fig. 1b ). They are characterized by one or several components {k i } that have large weights with centroid and precision matrices given by the mean and covariance of a significant subset X ki ⊂ X of the data X: π ki 0, µ ki = E[X ki ], Σ ki = Cov(X ki ), whereas the remaining components k are characterized by π k ≈ 0, µ k = µ(t = 0), P k = P (t = 0). Thus, these unconverged components are almost never best-matching components k * . The max-operation in L causes gradients like ∂ L ∂µ k to contain δ kk * (see App. A.3). This implies that they are non-zero only for the best-matching component k * . Thus the gradients of unconverged components vanish, implying that they remain in their unconverged state. 

3.4. ANNEALING PROCEDURE FOR AVOIDING LOCAL OPTIMA

Our approach for avoiding undesirable solutions is to punish their characteristic response patterns by a modification of L, the smoothed max-component log-likelihood Lσ : Lσ = E n max k j g kj (σ) log π j N j (x n ) = E n j g k * j (σ) log π j N j (x n ) . The entries of the g k are computed by a Gaussian function centered on component k with common spatial standard deviation σ, where we assume that the K components are arranged on a √ K × √ K grid with 2D Euclidean metric (see App. A.4). Eq. ( 4) essentially represents a smoothing of the log (π k N k (x)) with a 2D convolution filter (we use periodic boundary conditions). Thus, Eq. ( 4) is maximized if the log probabilities follow a uni-modal Gaussian profile of spatial variance ∼ σ 2 , which heavily punishes single-component solutions that have a locally delta-like response. Annealing starts with large values of σ(t) = σ 0 and reduces it over time to an asymptotic small value of σ = σ ∞ , thus, smoothly transitioning from Lσ in Eq. ( 4) into L in Eq. (3). Annealing Control is ensured by adjusting σ, which defines an effective upper bound on Lσ (see App. A.2 for a proof). This implies that the loss will be stationary once this bound is reached, which we consider a suitable indicator for reducing σ. We implement an annealing control that sets σ ← 0.9σ whenever the loss is considered sufficiently stationary. Stationarity is detected by maintaining an exponentially smoothed average (t) = (1 -α) (t -1) + α Lσ (t) on time scale α. Every 1 α iterations, we compute the fractional increase of Lσ as ∆ = (t) -(t -α -1 ) (t -α -1 ) -Lσ (t = 0) (5) and consider the loss stationary iff ∆ < δ (the latter being a free parameter). The choice of the time constant for smoothing Lσ is outlined in the following section.

3.5. TRAINING PROCEDURE FOR SGD

Training GMMs with SGD is performed by maximizing the smoothed max-component log-likelihood Lσ from Eq. (4). At the same time, we enforce the constraints on the component weights and covariances as described in Sec. 3.1 and transition from Lσ into L by annealing (see Sec. 3.4) . SGD requires a learning rate to be set, which in turn determines the parameter α (see Sec. 3.4) as α = since stationarity detection should operate on a time scale similar to that of SGD. Cholesky matrices D k are initialized to D max I and are clipped after each iteration so that diagonal entries are in the range [0, D 2 max ]. This is necessary to avoid excessive growth of precisions for data entries with vanishing variance, e.g., pixels that are always black. Weights are uniformly initialized to π i = 1 K , centroids in the range [-µ i , +µ i ] (see Alg. 1 for a summary). Please note that our SGD approach requires no centroid initialization by k-means, as it is usually recommended when training GMMs with EM. We discuss and summarize good practices for choosing hyper-parameters Sec. 5. Algorithm 1: Steps of SGD-GMM training. Data: initializer values: if ∆ < δ then σ(t) ← 0.9σ(t -1), (t) ← 0.9 (t -1) // ∆ see Eq. ( 5) µ i , K, 0 / ∞ , σ 0 /σ ∞ , δ and data X Result: trained GMM model 1 µ ← U(-µ i , +µ i ), π ← 1/K, P ← ID max , σ ← σ 0 , ← 0 //

3.6. TRAINING PROCEDURE FOR STOCHASTIC EXPECTATION MAXIMIZATION

We use sEM as proposed by Cappé & Moulines (2009) . We choose the step size of the form ρ t = ρ 0 (t + 1) -0.5+α , with α ∈ [0, 0.5], ρ 0 < 1 and enforce ρ(t) ≥ ρ ∞ . Values for these parameters are determined via a grid search in the ranges ρ 0 ∈ {0.01, 0.05, 0.1}, α ∈ {0.01, 0.25, 0.5} and ρ ∞ ∈ {0.01, 0.001, 0.0001}. Each sEM iteration uses a batch size B. Initial accumulation of sufficient statics is conducted for 10% of an epoch, but not when re-training with new data statistics. Parameter initialization and clipping of precisions is performed just as for SGD, see Sec. 3.5.

3.7. COMPARING SGD AND SEM

Since sEM optimizes the log-likelihood L, whereas SGD optimizes the annealed approximation Lσ , the comparison of these measures should be considered carefully. We claim that the comparison is fair assuming that i) SGD annealing has converged and ii) GMM responsibilities are sharply peaked so that a single component has responsibility of ≈ 1. It follows from i) that Lσ ≈ L and ii) implies that L ≈ L. Condition ii) is usually satisfied to high precision especially for high-dimensional inputs: if it is not, the comparison is biased in favor of sEM, since L > L by definition.

4. EXPERIMENTS

Unless stated otherwise, the experiments in this section will be conducted with the following parameter values for sEM and SGD (where applicable): mini-batch size B = 1, K = 8 × 8, µ i = 0.1, σ 0 = 2, σ ∞ = 0.01, = 0.001, D max = 20. Each experiment is repeated 10 times with identical parameters but different random seeds for parameter initialization. See Sec. 5 for a justification of these choices. Due to input dimensionality, all precision matrices must be taken to be diagonal. Training/test data are taken from the datasets mentioned in Sec. 2.

4.1. ROBUSTNESS OF SGD TO INITIAL CONDITIONS

Here, we train GMMs for three epochs on classes 1 to 9 for each dataset. We use different random and non-random initializations of the centroids and compare the final log-likelihood values. Random centroid initializations are parameterized by µ i ∈ {0.1, 0.3, 0.5}, whereas non-random initializations are defined by centroids from a previous training run on class 0 (one epoch). The latter is done to have a non-random centroid initialization that is as dissimilar as possible from the training data. The initialization of the precisions cannot be varied because empirical data shows that training converges to undesirable solutions if the precisions are not initialized to large values. While this will have to be investigated further, we find that convergence to near-identical levels, regardless of centroid initialization for all datasets (see Tab. 1 for more details).

4.2. ADDED VALUE OF ANNEALING

To demonstrate the beneficial effects of annealing, we perform experiments on all datasets with annealing turned off. This is achieved by setting σ 0 = σ ∞ . This invariably produces sparsecomponent solutions with strongly inferior log-likelihoods after training, please refer to Tab. 1. 

4.3. CLUSTERING PERFORMANCE EVALUATION

To compare the clustering performance of sEM and GMM the Davies-Bouldin score (Davies & Bouldin, 1979) and the Dunn index (Dunn, 1973) are determined. We evaluate the grid-search results to find the best parameter setup for each metric for comparison. sEM is initialized by k-means to show that our approach does not depend on parameter initialization. Tab. 2 indicaties that SGD can egalize sEM performance (see also App. A.5). We train GMMs for three epochs (enough for convergence in all cases) using SGD and sEM on all datasets as described in Secs. 3.5 and 3.6.The resulting centroids of our SGD-based approach are shown in Fig. 2 , whereas the final loss values for SGD and sEM are compared in Tab. 3. The centroids for both approaches are visually similar, except for the topological organization due to annealing for SGD, and the fact that in most experiments, some components do not converge for sEM while the others do. Tab. 3 indicates that SGD achieves performances superior to sEM in the majority of cases, in particular for the highest-dimensional datasets (SVHN: 3 072 dimensions and Fruits 360: 30 000 dimensions). Figure 2 : Exemplary results for centroids learned by SGD, trained on full images. Visualization of High-dimensional sEM Outcomes Fig. 3 was obtained after training GMMs by sEM on both the Fruits 360 and the SVHN dataset. It should be compared to Fig. 2 , where an identical procedure was used to visualize centroids of SGD-trained GMMs. It is notable that the effect of unconverged components does not occur at all for our SGD approach, which is due to the annealing mechanism that "drags" unconverged components along. Figure 3 : Visualization of centroids after exemplary training runs (3 epochs) on high-dimensional datasets for sEM: Fruits 360 (left, 30 000 dimensions) and SVHN (right, 3 000 dimensions). Component entries are displayed "as is", meaning that low brightness means low RGB values. Visibly, many GMM components remain unconverged, which is analogous to a sparse-component solution and explains the low log-likelihood values especially for these high-dimensional datasets.

5. DISCUSSION AND CONCLUSION

The Relevance of this Article is outlined by the fact that training GMMs by SGD was recently investigated in the community by Hosseini & Sra (2015; 2019) . We go beyond, since our approach does not rely on off-line data-driven model initialization, and works for high-dimensional streaming data.The presented SGD scheme is simple and very robust to initial conditions due to the proposed annealing procedure, see Sec. 4.1 and Sec. 4.2. In addition, our SGD approach compares favorably to the reference model for online EM (Cappé & Moulines, 2009) in terms of achieved log-likelihoods, which was verified on multiple real-world datasets. Superior SGD performance is observed for the high-dimensional datasets. Analysis of Results suggests that SGD performs better than sEM on average, see Sec. 4.4, although the differences are very modest. It should be stated clearly that it cannot be expected, and is not the goal of this article, to outperfom sEM by SGD in the general case, only to achieve a similar performance. However, if sEM is used without, e.g., k-means initialization, components may not converge (see Fig. 3 for a visual impression) for very high-dimensional data like Fruits 360 and SVHN datasets, which is why SGD outperforms sEM in this case. Another important advantage of SGD over sEM is the fact that no grid search for finding certain hyper-parameter values is necessary, whereas sEM has a complex and unintuitive dependency on ρ 0 , ρ ∞ and α 0 . Small Batch Sizes and Streaming Data are possible with the SGD-based approach. Throughout the experiments, we used a batch size of 1, which allows streaming-data processing without the need to store any samples at all. Larger batch sizes are, of course, possible and increase execution speed. In the experiments conducted here, SGD (and sEM) usually converged within the first two epochs, which is a substantial advantage whenever huge sets of data have to be processed. No Assumptions About Data Generation are made by SGD in contrast to the EM and sEM algorithms. The latter guarantee that the loss will not decrease due to an M-step. This, however, assumes a non-trivial dependency of the data on an unobservable latent variable (see App. A.1 for a proof). In contrast, SGD makes no such hard-to-verify assumptions, which is a rather philosophical point, but may be an advantage in certain situations where data are strongly non-Gaussian. Numerical Stability is assured by our SGD training approach. It does not optimize the log-likelihood but its max-component approximation. This approximation contains no exponentials at all and is very well justified by the results of Tab. 3 which show that component probabilities are very strongly peaked. In fact, it is the gradient computations where numerical problems (e.g., NaN values) occurred. The "logsumexp" trick mitigates the problem, but does not eliminate it (see Sec. 3.2). It cannot be used when gradients are computed automatically (what most machine learning frameworks do). Hyper-Parameter Selection Guidelines are as follows: the learning rate must be set by crossvalidation (a good value is 0.001). We empirically found that initializing precisions to the cut-off value D max and an uniform initialization of the π i are beneficial, and that centroids are best initialized to small random values. A value of D max = 20 always worked in the experiments. Generally, the cut-off must be much larger than the inverse of the data variance. In many cases, it should be possible to estimate this roughly, even in streaming settings, especially when samples are normalized. For density estimation, choosing higher values for K leads to higher final log-likelihoods (validated in App. A.6). For clustering, K should be selected using standard techniques for GMMs. The parameter δ controls loss stationarity detection for the annealing procedure and was shown to perform well for δ = 0.05. Larger values will lead to faster decrease of σ(t), which may impair convergence. Smaller values are always admissible but lead to longer convergence times. The annealing time constant α should be set to the GMM learning rate or lower. Smaller values of α lead to longer convergence times since σ(t) will be updated less often. The initial value σ 0 needs to be large in order to enforce convergence for all components. A typical value is 0.25 √ K . The lower bound on σ, σ ∞ should be as small as possible to achieve high log-likelihoods (e.g., 0.01, see proof in App. A.2).

6. OUTLOOK

The presented work can be extended in several ways: First of all, annealing control could be simplified further by inferring good δ values from α. Likewise, increases of σ might be performed automatically when the loss rises sharply, indicating a task boundary. As we found that GMM convergence times grow linear with the number of components, we will investigate hierarchical GMM models that operate like a Convolutional Neural Network (CNN), and in which individual GMMs only see a local patch of the input and can therefore have low K. Lastly, we will investigate a replay by SGDtrained GMMs for continual learning architectures. GMMs could compare favorably to Generative Adversarial Nets (Goodfellow et al., 2014) due to faster training and the fact that sample generation capacity can be monitored via the log-likelihood. where H represents the Shannon entropy of p(z). The highest value this can have is log K for an uniform distribution of the z n , finally leading to a lower bound for L of L ≥ n log p(x n ) = L (11) which is indeed tight, but trivial, and thus does not simplify the problem at all. In particular, no closed-form solutions to the associated extreme value problem can be computed for this case. This shows that optimizing GMMs by Expectation-Maximization assumes that each sample has been drawn from a single element in a set of K uni-modal Gaussian distributions. Which distribution is selected for sampling depends on a latent random variable. On the other hand, optimization by SGD uses the incomplete-data log-likelihood L as basis for optimization, without probabilistic interpretation. This may be advantageous for problems where the assumption of Gaussianity is badly violated, although empirical studies indicate that optimization by EM works very well in a wide range of scenarios. A.2 PROOF THAT σ DEFINES AN UPPER BOUND ON Lσ Let us assume that SGD optimization has reached a stationary point where the derivative w.r.t. all GMM parameters is 0. In this situation, we claim that the only way to increase the loss is by manipulating σ. We show here that ∂ Lσ ∂σ < 0∀σ > 0, and that ∂ Lσ ∂σ = 0 for σ = 0. This means that the loss can be increased by decreasing σ, up to the point where σ = 0. For each sample, the 2D profile of log(π k N k ) ≡ f k is assumed to be radially symmetric, centered on the best-matching component k * and decreases with the distance as a function of ||k -k * ||. We thus have f k = f k (r) with r ≡ ||k -k * ||. Passing to the continuous domain, the indices in the Gaussian "smoothing filter" g k * k become continuous variables, and we have g k * k → g(||k -k * ||, σ) ≡ g(r, σ). Similarly, f k (r) → f (r). Using 2D polar coordinates, the smoothed max-component likelihood Lσ becomes a polar integral around the position of the bestmatching component: Lσ ∼ R 2 g(r, σ)f (r)drdφ. We are interested in the change of Lσ when σ undergoes an infinitesimal change. It is trivial to show that for the special case of a constant logprobability profile, i.e., f (r) = L, L σ does not depend on σ because Gaussians are normalized, and that the derivative w.r.t. σ vanishes: d Lσ dσ ∼ ∞ 0 dr r 2 σ 2 -1 exp(- r 2 2σ 2 )L = L σ 0 dr r 2 σ 2 -1 exp(- r 2 2σ 2 ) -L ∞ σ r 2 σ 2 -1 exp(- r 2 2σ 2 ) ≡ LN -LP (12) where we have split the integral into parts where the derivative w.r.t. σ is negative(N ) and positive (P). We know that N = P since the derivative must be zero for a constant function f (r) = L due to the fact that Gaussians are normalized to the same value regardless of σ. and, furthermore, that this derivative is zero for σ = 0 because Lσ no longer depends on σ for this case. Taking everything into consideration, in a situation where the log-likelihood Lσ has reached a stationary point for a given value of σ, we have shown that: • The value of Lσ depends on σ. • Without changing the log-probabilities, we can increase Lσ by reducing σ, assuming that the log-probabilities are mildly decreasing around the BMU. • Increasing Lσ works as long as σ > 0. At σ = 0 the derivative becomes 0. Thus, σ indeed defines an upper bound to Lσ which can be increased by decreasing σ. The assumption of log-probabilities that decrease around the best matching unit (BMU) is reasonable since such a profile maximizes Lσ . All functions f (r) that, e.g., decrease monotonically around the BMU, fulfill this criterion, where the precise form of the decrease is irrelevant. This proof works identically when not resorting to integrals but using discrete sums.

A.3 LOG-LIKELIHOOD GRADIENTS

The gradients of L read ∂ L ∂µ k = E n [P k (x n -µ k ) δ kk * ] ∂ L ∂P k = E n (P k ) -1 -(x n -µ k )(x n -µ k ) T δ kk * ∂ L ∂π k = π -1 k E n [δ kk * ] . The gradients of L are obtained by replacing δ kk * by the standard GMM responsibilities γ nk . For the case of a degenerate solution when optimizing L, only a single component k * has a weight close to 1, with its centroid and covariance matrix are given by the mean and covariance of the data: π k * ≈ 1, µ k * = E[X], P -1 k * = Cov(X). In this case, the gradients w.r.t µ and P vanish. The gradient w.r.t. π k does not vanish, but is δ kk * which vanishes after enforcing the normalization constraint. 



https://github.com/gmm-iclr21/sgd-gmm



Figure 1: Undesirable solutions during SGD, visualized for MNIST with component weights π k .

f (r) that satisfies f (r) > L∀r ∈ [0, σ[ and f (r) < L∀r ∈]σ, ∞[, the inner and outer parts of the integral behave as follows: r) is minorized/majorized by L by assumption, and the contributions in both integrals have the same sign for the whole domain of integration. Thus, it is shown that, for σ > 0 d Lσ dσ = Ñ -P < LN -LP = 0 (14)

Figure 5: Trend of clustering capabilities for sEM and SGD trained GMM. Comparison by Davies-Bouldin score (smaller is better) and Dunn index (higher is better) for the datasets MNIST, Fash-ionMNIST, NotMNIST, Devanagari and SVHN. The lines visualize the average metric score/index values of 10 repetitions with its standard deviation.

Effect of different random and non-random centroid initializations of SGD training. Given are the means and standard deviations of final log-likelihoods (10 repetitions per experiment). To show the added value of annealing, the right-most column indicates the final log-likelihoods when annealing is turned off. This value should be co,pared to the leftmost entry in each row where annealing is turned on. Standard deviations in this case where very small so they are omitted.

Clustering performance comparison of SGD and sEM training using Davies-Bouldin score (less is better) and Dunn index (more is better). Results are in bold face whenever they are better by more than half a standard deviation.

Comparison of SGD and sEM training on all datasets in a streaming-data scenario. Each time mean log-likelihoods (10 repetitions) at the end of training, and their standard deviations are presented. Results are in bold face whenever they are higher by more than half a standard deviation. Additionally, the averaged maximum responsibilities (p k * ) for test data are given for justifying the max-component approximation.

A SUPPLEMENTARY MATERIAL A.1 ASSUMPTIONS MADE BY EM AND SGD

The EM algorithm assumes that the observed data samples {x n } depend on unobserved latent variables z n in a non-trivial fashion. This assumption is formalized for a GMM with K components by formulating the complete-data likelihood in which z n ≡ z n ∈ {0, . . . , K -1} is a scalar: p(x n , z n ) = π zn N zn (x n ) (6) where we have defined N k (x n ) = N (x n ; θ k , µ k ) for brevity. It is assumed that the z n are unobservable random variables whose distribution is unknown. Marginalizing them out gives us the incomplete-data likelihood p(x n ) = k p(x n , z n ). The derivation of the EM algorithm starts out with the total incomplete-data log-likelihoodDue to the assumption that L is obtained by marginalizing out the latent variables, an explicit dependency on z n can be re-introduced. For the last expression, Jensen' inequality can be used to construct a lower bound:Since the realizations of the latent variables are unknown, we can assume any form for their distribution. In particular, for the choice p(z n ) ∼ p(x n , z n ), the lower bound becomes tight. Simple algebra and the fact that the distribution p(z n ) must be normalized gives us:where we have used Eq. ( 6) in the last step. p(z n = k|x n ) is a quantity that can be computed from data with no reference to the latent variables. For GMMs it is usually termed responsibility and we write it as p(z n = k|x n ) ≡ γ nk .However, the construction of a tight lower bound, which is actually different from L, only works when p(x n , z n ) depends non-trivially on the latent variable z n . If this is not the case, we have p(x n , z n ) = K -1 p(x n ) and the derivation of Eq. ( 8) goes down very differently:A.4 VISUALIZATION OF 2D ANNEALING GRIDIn Figure 4 , three different states of the g k are visualized, depending on σ(t). Darker pixels indicate larger values. Each g k is assigned to a single GMM component k, which is why the g k are arranged on the same √ K × √ K grid we place the components themselves. An intuitive interpretation of a particular g k is that it encodes the contribution of neighbouring component log-probabilities on the log-probability of component k entering into max-computation. Over time, σ(t) is reduced (middle and right pictures) and thus only component k contributes. Please note that the grid we place the components on is periodic for simplicity, so the g k are themselves periodic. 

A.6 EFFECT OF THE GMM COMPONENT NUMBER K

The number of Gaussian components K is a key parameter for any GMM and has a huge impact on performance. It should be stated clearly that this discussion is not specific any particular fashion of performing GMM training, be it by SGD, sEM or EM since all of these methods optimize the log-likelihood. This is why we do not propose a particular way of choosing K for SGD-trained GMMs. For clustering, K can be chosen using standard techniques like BIC or AIC, in addition to priors depending on the data and the concrete application in mind. For density estimation, it is generally assumed that K should be set as high as possible since the real distribution of the data can be approximated in better detail. This set of experiments aims at showing empirically, for completeness, that this "bigger is better" relation holds when training GMMs by SGD. We use the hyper-parameter settings stated in Sec. 4, vary K and record the final log-likelihoods. Results are shown in Tab. 4 and suggest a very clear relationship between K and the (test) log-likelihood obtained at the end of training. We run the experiments for a larger number of epochs to exclude that (non-)convergence effects could play a role here. 

