IMAGE MODELING WITH DEEP CONVOLUTIONAL GAUSSIAN MIXTURE MODELS

Abstract

In this conceptual work, we present Deep Convolutional Gaussian Mixture Models (DCGMMs), a deep hierarchical Gaussian Mixture Model (GMM) that is particularly suited for describing and generating images. Vanilla (i.e., flat) GMMs require a very large number of components to well describe images, leading to long training times and memory issues. DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations. This allows to exploit the compositionality of images in a similar way as deep CNNs do. DCG-MMs can be trained end-to-end by Stochastic Gradient Descent. This sets them apart from vanilla GMMs which are trained by Expectation-Maximization, requiring a prior k-means initialization which is infeasible in a layered structure. For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling. Based on the MNIST and FashionMNIST datasets, we validate the DCGMM model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.

1. INTRODUCTION

This conceptual work is in the context of probabilistic image modeling, whose main objectives are density estimation and image generation (sampling). Since images usually do not precisely follow a Gaussian mixture distribution, such a treatment is inherently approximative in nature. This implies that clustering, even though it is possible and has a long history in the context of Gaussian Mixture Models (GMMs), is not a main objective. Sampling is an active research topic mainly relying on Generative Adverserial Networks (GANs) discussed in Section 1.2. Similar techniques are being investigated for generating videos (Ghazvinian Zanjani et al., 2018; Piergiovanni & Ryoo, 2019 ). An issue with GANs is that their probabilistic interpretation remains unclear. This is outlined by the fact that there is no easy-to-compute probabilistic measure of the current fit-to-data that is optimized by GAN training. Recent evidence seems to indicate that GANs may not model the full image distribution as given by training data (Richardson & Weiss, 2018) . Besides, images generated by GANs appear extremely realistic and diverse, and the GAN model has been adapted to perform a wide range of visually impressive functionalities. In contrast, GMMs explicitly describe the distribution p(X), given by a set of training data X = {x n }, as a weighted mixture of K Gaussian component densities N ( x; µ k , Σ k ) ≡ N k (x): p(x) = K k π k N k (x). GMMs require the mixture weights to be normalized: k π k = 1 and the covariance matrices to be positive definite: x T Σ k x > 0 ∀x. The quality of the current fit-to-data is expressed by the log-likelihood L(X) = E n log k π k N k (x n ) , which is what GMM training optimizes, usually by variants of Expectation-Maximization (EM) (Dempster et al., 1977) . It can be shown that arbitrary distributions can, given enough components, be approximated by mixtures of Gaussians (Goodfellow et al., 2016) . Thus, GMMs are guaranteed to model the complete data distribution, but only to the extent allowed by the number of components K. In this respect, GMMs are similar to flat neural networks with a single hidden layer: although, by the universal approximation theorem of Pinkus (1999) and Hornik et al. (1989) , they can approximate arbitrary functions (from certain rather broad function classes), they fail to do so in practice. The reason for this is that the number of required hidden layer elements is unknown, and usually beyond the reach of any reasonable computational capacity. For images, this problem was largely solved by introducing deep Convolutional Neural Networks (CNNs). CNNs model the statistical structure of images (hierarchical organization and translation invariance) by chaining multiple convolution and pooling layers. Thus the number of parameters without compromising accuracy can be reduced.

1.1. OBJECTIVE, CONTRIBUTION AND NOVELTY

The objectives of this article are to introduce a GMM architecture which exploits the same principles that led to the performance explosion of CNNs. In particular, the genuinely novel characteristics are: • formulation of GMMs as a deep hierarchy, including convolution and pooling layers, • end-to-end training by SGD from random initial conditions (no k-means initialization), • generation of realistic samples by a new sharpening procedure, • better empirical performance than vanilla GMMs for sampling, clustering and outlier detection. In addition, we provide a publicly available TensorFlow implementation which supports a Keras-like flexible construction of Deep Convolutional Gaussian Mixture Models instances.

1.2. RELATED WORK

Generative Adverserial Networks The currently most widely used models of image modeling and generation are Generative Adverserial Networks (Arjovsky et al., 2017; Mirza & Osindero, 2014; Goodfellow et al., 2014) . Variational Autoencoders (VAEs) follow the classic autoencoder principle (Kingma & Welling, 2013) , trying to reconstruct their inputs through a bottleneck layer, whose activities are additionally constrained to have a Gaussian distribution. GANs are trained adversarially, mapping Gaussian noise to image instances, while trying to fool an additional discriminator network, which in turn aims to distinguish real from generated samples. They are capable of generating photo-realistic images (Richardson & Weiss, 2018) , although their probabilistic interpretation remains unclear since they do not possess a differentiable loss function that is minimized by training. They may suffer from what is termed mode collapse, which is hard to detect automatically due to the absence of a loss function. Due to their ability to generate realistic images, they are prominently used in models of continual learning (Shin et al., 2018) . Hierarchical GMMs Mixture of Factor Analyzers (MFAs) models (McLachlan & Peel, 2005; Ghahramani & Hinton, 1997 ) can be considered as hierarchical GMMs because they are formulated in terms of a lower-dimensional latent-variable representation, which is mapped to a higherdimensional space. The use of MFAs for describing natural images is discussed in detail in Richardson & Weiss (2018) , showing that the MFA model alone, without further hierarchical structure, compares quite favorably to GANs when considering image generation. A straightforward hierarchical extension of GMMs is presented by Liu et al. (2002) with the goal of unsupervised clustering: responsibilities of one GMM are treated as inputs to a subsequent GMM, together with an adaptive mechanism that determines the depth of the hierarchy. Garcia et al. (2010) present a comparable, more information-theoretic approach. A hierarchy of MFA layers with sampling in mind is presented by Viroli & McLachlan (2019) , where each layer is sampling values for the latent variables of the previous one, although transformations between layers are exclusively linear. Van Den Oord & Schrauwen (2014) and (Tang et al., 2012) pursue a similar approach. All described approaches use (quite complex) extensions of the EM algorithm initialized by k-means for training hierarchical GMMs, except Richardson & Weiss (2018) use Stochastic Gradient Descent (SGD), although with a k-means initialization. None of these models consider convolutional or max-pooling operations which have been proven to be important for modeling the statistical structure of images.

Convolutional GMMs

The only work we could identify proposing to estimate hierarchical convolutional GMMs is Ghazvinian Zanjani et al. ( 2018), although the article described a hybrid model where a CNN and a GMM are combined.

SGD and End-to-End GMM Training

Training GMMs by SGD is challenging due to local optima and the need to enforce model constraints, most notably the constraint of positive-definite covariance matrices. This has recently been discussed in Hosseini & Sra (2020) , although the proposed solution requires parameter initialization by k-means and introduces several new hyper-parameters and is, thus, unlikely to work as-is in a hierarchical structure. An SGD approach that achieves robust convergence even without k-means-based parameter initialization is presented by Gepperth & Pflb (2020) . Undesirable local optima caused by random parameter initialization are circumvented by an adaptive annealing strategy.

2. DATA

For the evaluation we use the following image data sets: MNIST (LeCun et al., 1998) is the common benchmark for computer vision systems and classification problems. It consists of 60 000 28 × 28 gray scale images of handwritten digits (0-9). FashionMNIST (Xiao et al., 2017) consists of images of clothes in 10 categories and is structured like the MNIST dataset. Although these datasets are not particularly challenging for classification, their dimensionality of 784 is at least one magnitude higher than datasets used for validating other hierarchical GMM approaches in the literature.

3. DCGMM: MODEL OVERVIEW

The Deep Convolutional Gaussian Mixture Model is a hierarchical model consisting of layers in analogy to CNNs. 1 Each layer with index L expects an input tensor 1) and produces an output tensor L) . Layers can have internal variables θ (L) that are adapted during SGD training. A (L-1) ∈ R 4 of di- mensions N, H (L-1) , W (L-1) , C (L- A (L) ∈ R 4 of dimensions N, H (L) , W (L) , C An DCGMM layer L has two basic operating modes (see Figure 1 ): for (density) estimation, an input tensor A (L-1) from layer L -1 is transformed into an output tensor A (L) . For sampling, the direction is reversed: each layer receives a control signal T (L+1) from layer L+1 (same dimensions as A (L) ), which is transformed into a control signal T (L) to layer L-1 (same dimensions as A (L-1) ). 

3.1. LAYER TYPES

We define three layer types: Folding (F), Pooling (P) and convolutional GMM (G). Each layer implements distinct operations for each of the two modes, i.e., estimation and sampling. Folding Layer For density estimation, this layer performs a part of the well-known convolution operation known from CNNs. Based on the filter sizes f (L) X , f (L) Y as well as the filter strides ∆ (L) X , ∆ (L) Y , all entries of the input tensor inside the range of the sliding filter window are dumped into the channel dimension of the output tensor. We thus obtain an output tensor of dimensions N, H (L) = 1 + H (L-1) -f (L) Y ∆ (L) Y , W (L) = 1 + W (L-1) -f (L) X ∆ (L) X and C (L) = C (L) f (L) X f (L) Y , whose entries are computed as A (L) nhwc = A (L-1) nh w c with h = h /f (L) Y , w = w /f (L) X and c = c + (h -h∆ (L) Y )f (L) X + w -w∆ (L) X C (L-1) + c . When sampling, it performs the inverse mapping which is not a one-to-one correspondence: input tensor elements which receive several contributions are simply averaged over all contributions. Pooling Layer For density estimation, pooling layers perform the same operations as standard (max-)pooling layers in CNNs based on the kernel sizes k (L) Y , k (L) X and strides ∆ (L) X , ∆ (L) Y . When sampling, pooling layers perform a simple nearest-neighbor up-sampling by a factor indicated by the kernel sizes and strides. GMM Layer This layer type contains K GMM components, each of which is associated with trainable parameters π k , µ k and Σ k , k = 1 . . . K, representing the GMM weights, centroids and covariances. What makes GMM layers convolutional is that they do not model single input vectors, but the channel content at all positions h, w of the input A (L-1) n,w,h,: , using a shared set of parameters. This is analog to the way a CNN layer models image content at all sliding window positions using the same filters. A GMM layer thus maps the input tensor A (L-1) ∈ R N,H (L-1) ,W (L-1) ,C (L-1) to A (L) ∈ R N,H (L-1) ,W (L-1) ,K , each GMM component k ∈ {0, . . . , K} contributing the likelihood A (L) nhwk of having generated the channel content at position h, w (for sample n in the mini-batch). This likelihood is often referred to as responsibility and is computed as p nhwk (A (L-1) ) = N k (A (L-1) nhw: ; µ k , Σ k ) A (L) nhwk ≡ p nhwk c p nhwc . (2) For training the GMM layer, we optimize the GMM log-likelihood L (L) for each layer L: L (L) hw = n log k π k p nhwk (A (L-1) ) L (L) = hw L (L) hw H (L-1) W (L-1) Training is performed by SGD according to the technique, and with the recommended parameters, presented by Gepperth & Pflb (2020) , which uses a max-component approximation to L (L) . In sampling mode, a control signal T (L) is produced by standard GMM sampling, performed separately for all positions h, w. GMM sampling at position h, w first selects a component by drawing from a multinomial distribution. If the GMM layer is the last layer of a DCGMM instance, the multinomial's parameters are the mixing weights π : for each position h, w. Otherwise, the control signal T (L+1) nhw: received from layer L + 1 is used. It is consistent to use the control signal for component selection in layer L, since it was sampled by layer L + 1, which was in turn trained on the component responsibilities of layer L. The selected component (still at position h, w) then samples T (L) nhw: . It is often beneficial for sampling to restrict component selection to the S components with the highest control signal (top-S sampling).

3.2. ARCHITECTURE-LEVEL FUNCTIONALITIES

The DCGMM approach proposes several functionalities on different architectural levels.

3.2.1. END-TO-END TRAINING

To train an DCGMM instance, we optimize L (L) for each GMM layer L by vanilla SGDfoot_1 , using learning rates (L) . This is different from a standard CNN classifier, where only a single loss function is minimized, usually a cross-entropy loss computed from the last layer's outputs. Learning is not conducted layer-wise but end-to-end. Parameter initialization for GMM layers selects the initial values for the mixing weights as π k = K -1 , centroid elements sampled from µ kl ∼ U [-0.01,0.01] and diagonal covariances are initialized to unit entries. To ensure good convergence, training is conducted in two phases, in the first one of which only centroids are adapted.  L (L) hw ≥ B (L) hw ≡ E n L (L) nhw -c Var n (L (L) nhw ) . ( ) Larger values of c imply a less restrictive characterization of inliers. Assuming that the topmost GMM layer is global (h = w = 1), Equation 4 reduces to a single condition that determines whether the sample, as a whole, is an inlier. However, we can also localize inlier/outlier image parts by evaluating Equation 4 in lower GMM layers.

3.2.3. SAMPLING AND SHARPENING

Sampling starts in the highest layer, assumed to be an GMM layer, and propagates control signals downwards (see Figure 1 and Section 3.1), with control signal T (1) constituting the sampling result. Sampling suffers from information loss due to the not-invertible mappings effected by Pooling and Folding layers. To counteract this, each Folding layer l performs sharpening on the generated control signal T (l) . This involves computing L (L * ) (T (L) ) of the next-highest GMM layer at level L * > l, and performing G gradient ascent steps T (L) nhwc → T (L) nhwc + s ∂L (L * ) /∂T (L) nhwc . The reason for sharpening is that filters in Folding layers usually overlap, and neighboring filter results are correlated. This correlation is captured by all higher GMM layers, and most prominently by the next-highest one. Therefore, modifying T (L) by gradient ascent will recover some of the information lost by pooling and folding. After sharpening, the tensor T (L) is passed as signal to L -1.

4. EXPERIMENTS

We define various DCGMM instances (with 2 or 3 GMM layers) for evaluation, see Table 1 , plus a single-layer DCGMM baseline which is nothing but a vanilla GMM. A DCGMM instance is defined by the parameters of its layers: Folding(f Y , f X , ∆ Y , ∆ X ), (Max-)Pooling(k Y , k X , ∆ Y , ∆ X ) and GMM(K). Unless stated otherwise, training is always conducted for 25 epochs, using the recommended parameters from (Gepperth & Pflb, 2020) . Sharpening is always performed for G = 1 000 iterations with a step size of 0.1.

4.1. SAMPLING, SPARSITY AND INTERPRETABILITY

We show that trained DCGMM parameters are sparse and have a intuitive interpretation in terms of sampling. To this effect, we train DCGMM instance 2L-a (see Table 1 ). After training (see Section 3.2.1), we plot and interpret the centroids of the GMM layers 2 (G2) and 4 (G4). The centroids of layer 2 (left of Figure 2 ) are easily interpretable and reflect the patterns that can occur in any of the 2 × 2 input patches to layer G2 of size 20 × 20. The 36 = 6 × 6 centroids of G4 (right of Figure 2 ) express typical responsibility patterns computed from each of the 2 × 2 input patches to G2, and can be observed to be very sparsely populated. Another interpretation of G4 (20, 20, 8, 8 ) F(7,7,7,7) F(7,7,3,3) F(28,28,1,1) F(7,7,3,3) F(3,3,1,1) F(28,28,1,1) 2 G( 25) G( 25) G( 25) G( 25) G( 25) G( 25) G( 25) G(25) 3 F( 2, 2, 1, 1) F(4, 4, 1, 1) F(8, 8, 1, 1) F(1, 1, 1, 1)  F(8, 8, 1, 1) P(2, 2)  F(1, 1, 1, 1)  4 G( 36) G( 36) G( 36) G( 36) G( 36 In this case, it is easy to see that sampling produces a particular representation of the digit zero. 1 . Shown are learned GMM centroids (left: G2, right: G4, see text) and an illustration of sampling, having initially selected the layer 4 component highlighted in red. In the middle, the selected G4 centroid is shown in more detail.

4.2. OUTLIER DETECTION

For outlier detection, we compare DCGMM architectures from Table 1 , using the log-likelihood of the highest layer as a criterion as detailed in Section 3.2.2. We first train a DCGMM instance on classes 0-4, and subsequently use the trained classes for inlier-and class 5-9 for outlier-detection. We vary c in the range [-2, 2], resulting in different outlier and inlier percentages. Figure 3 shows the ROC-like curves thus clearly indicate that the deep convolutional DCGMM instances perform best, whereas deep but non-convolutional instances like 2L-d and 3L-b consistently perform badly.

4.3. CLUSTERING

We compare DCGMMs to vanilla GMMs using established clustering metrics, namely the Dunn index (Dunn, 1973) and the Davies-Bouldin score (Davies & Bouldin, 1979) . The DCGMM instances from Table 1 are tested on both image datasets. We observe that mainly the deep but nonconvolutional DCGMM instances perform well in clustering, whereas convolutional instances, even if they are deep, are compromised. Please note that these metrics do not measure the classification accuracy obtained by clustering but intrinsic clustering-related properties. The results presented here were obtained by training on classes 0-4 of both datasets, and have to be confirmed by visual inspection of generated samples. Generating Sharp Images Figure 5 shows the effect of sharpening for DCGMM instance 2L-c using top-1-sampling. We can observe that the overall shape of a sample is not changed but that the outlines are crisper, an effect visible especially for FashionMNIST. Thus, sharpening does no harm and rather improves the visual quality of generated samples. See Appendix A.3 for FashionMNIST results.

Effects of Convolution on

Controlling Diversity by Top-S-Sampling Using instance 2L-c, Figure 5 demonstrates how sample diversity is related to S: a higher value yields more diverse samples, but increases the risk of generating corrupted samples or outliers. As the corresponding FashionMNIST results in Appendix A.1 show, a good value of S is clearly problem-dependent. Important results show the illustration of important functionalities such as outlier detection, clustering and sampling, which no other work on hierarchical GMMs can present related to such highdimensional image datasets. We also propose a method to generated sharp images with GMMs, which has been a problem in the past (Richardson & Weiss, 2018 ). An interesting facet of our experimental results is that non-convolutional DCGMMs seem to perform better at clustering, whereas convolutional ones are better at outlier detection and sampling. A Key point of the article is the compositionality in natural images. This property is at the root of DCGMM's ability to produce realistic samples with relatively few parameters. When considering top-S-sampling in a layer L with H (L) W (L) = P (L) positions, the number of distinct control signals generated layer L is S P (L) . A DCGMM instance with multiple GMM layers {L i } can thus sample L S P (L) different patterns, which grows with the depth of the hierarchy and the number of distinct positions in a layer, making a strong argument in favor of deep convolutional hierarchies such as DCGMM. This is an argument similar to the one about different paths through a hierarchical MFA model in (Viroli & McLachlan, 2019) , although their number grows more strongly for DCGMM. Differences to other hierarchical models such as (Viroli & McLachlan, 2019; Van Den Oord & Schrauwen, 2014; Tang et al., 2012) are most notably the introduction of convolution and pooling layers. Our experimental validation can therefore be performed on high-dimensional data, such as images, with moderate computational cost, instead of low-dimensional problems such as the artificial Smiley task or the Ecoli and related problems. Our experimental validation does not exclusively focus on clustering performance (problematic with images) but on demonstrating the capacity for realistic sampling and outlier detection. Lastly, training DCGMMs by SGD facilitates efficient parallelizable implementations, as demonstrated with the provided TensorFlow implementation. Next steps will consist of exploring the layered DCGMM architecture, mainly top-S-sampling, for generating natural images.

A PROBABILISTIC INTERPRETATION OF DCGMMS

A probabilistic interpretation of the DCGMM model is possible despite its complex structure. The simple reason is that DCGMM instances produce outputs which are inherently normalizable, meaning that the integral over an infinite domain (e.g., data space) remains finite. Thus, DCGMM outputs can be interpreted as a probability which is not the case for DNN/CNNs due to the use of scalar products. Here, we prove that GMMs are normalizable in the sense that the integral of the log-probability L(x) = log k π k p k (x) is finite. This holds for any GMM layer in a hierarchy regardless of its input, provided that the input is finite (which is assured because Pooling and Folding layers cannot introduce infinities). For simplicity, we integrate over the whole d-dimensional space R d . Since the component probabilities are Gaussian and thus strictly positive, and since furthermore the mixing weights are normalized and ≥ 0, the sum is strictly positive. Thus, it is sufficient to show that the integral over the inner sum (the argument of the logarithm) is finite. We thus have 



TensorFlow code is available under https://github.com/iclr2021-dcgmm/dcgmm Advanced SGD strategies like RMSProp(Hinton et al., 2012) or Adam(Kingma & Ba, 2015) seem incompatible with GMM optimization.



Figure 1: Illustration of a sample DCGMM instance containing all four layer types, with exemplary dimensionalities and parameters for each layer.

3.2.2 DENSITY ESTIMATION AND HIERARCHICAL OUTLIER DETECTION Outlier detection requires the computation of long-term averages in all layers and positions, E n L (L) nhw and variances Var n (L (L) nhw ) over the training set, preferably during a later, stable part of training. Thus, for every layer and position h, w, inliers are characterized by

found in terms of sampling (see Section 3.2.3), which would first select a random G4 component to generate a sample of dimensions H, W, C = 1, 1, 2 × 2 × 5 × 5 from it, and pass it on as a control signal to G2. Traversing Folding layer 3 only reshapes the control signal to dimension H, W, C = 2, 2, 5 × 5, depicted in the middle of Figure2. This signal controls component selection in each of the 2 × 2 positions in G2: due to their sparsity, we can directly read off the components likely to be selected for sampling at each position. G2 thus generates a control signal whose 2 × 2 positions of dimensions H, W, C = 20, 20, 1 overlap in the input plane (this is resolved by sharpening in Folding layer 1).

Figure 2: Sampling from DCGMM instance 2L-a, see Table1. Shown are learned GMM centroids (left: G2, right: G4, see text) and an illustration of sampling, having initially selected the layer 4 component highlighted in red. In the middle, the selected G4 centroid is shown in more detail.

Figure 3: Visualization of different DCGMM architectures and its outlier detection capabilities for MNIST (left) and FashionMNIST (right).

Figure 4: Sampling diversity: top-1 sampling shown on MNIST, from left to right, for DCGMM architectures 1L (vanilla GMM), 2L-d (non-convolutional 2-layer), 2L-c and 2L-e (convolutional 2-layer). Please observe duplicated samples in the non-convolutional architectures, marked in red.

Figure 5: Impact of sharpening on top-S-sampling with S=1, see Section 3.2.3, shown for DCGMM instance 2L-c on MNIST. Shown are unsharpened samples (left), sharpened samples (middle) and differences (right). Samples at the same position were generated by the same top-level prototype.

Figure 6: Impact of higher values of S in top-S sampling, shown for DCGMM instance 2L-c. From left to right: S=2,5,10.

Configurations and parameters of different DCGMM architectures.

Two metrics, Dunn index (higher is better) and Davies-Bouldin (DB) score (smaller is better), evaluated for all tested DCGMM architectures on MNIST and FashionMNIST. Best results are marked in bold. The given numbers are worst cases over 10 independent runs.

