IMAGE MODELING WITH DEEP CONVOLUTIONAL GAUSSIAN MIXTURE MODELS

Abstract

In this conceptual work, we present Deep Convolutional Gaussian Mixture Models (DCGMMs), a deep hierarchical Gaussian Mixture Model (GMM) that is particularly suited for describing and generating images. Vanilla (i.e., flat) GMMs require a very large number of components to well describe images, leading to long training times and memory issues. DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations. This allows to exploit the compositionality of images in a similar way as deep CNNs do. DCG-MMs can be trained end-to-end by Stochastic Gradient Descent. This sets them apart from vanilla GMMs which are trained by Expectation-Maximization, requiring a prior k-means initialization which is infeasible in a layered structure. For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling. Based on the MNIST and FashionMNIST datasets, we validate the DCGMM model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.

1. INTRODUCTION

This conceptual work is in the context of probabilistic image modeling, whose main objectives are density estimation and image generation (sampling). Since images usually do not precisely follow a Gaussian mixture distribution, such a treatment is inherently approximative in nature. This implies that clustering, even though it is possible and has a long history in the context of Gaussian Mixture Models (GMMs), is not a main objective. Sampling is an active research topic mainly relying on Generative Adverserial Networks (GANs) discussed in Section 1.2. Similar techniques are being investigated for generating videos (Ghazvinian Zanjani et al., 2018; Piergiovanni & Ryoo, 2019 ). An issue with GANs is that their probabilistic interpretation remains unclear. This is outlined by the fact that there is no easy-to-compute probabilistic measure of the current fit-to-data that is optimized by GAN training. Recent evidence seems to indicate that GANs may not model the full image distribution as given by training data (Richardson & Weiss, 2018) . Besides, images generated by GANs appear extremely realistic and diverse, and the GAN model has been adapted to perform a wide range of visually impressive functionalities. In contrast, GMMs explicitly describe the distribution p(X), given by a set of training data X = {x n }, as a weighted mixture of K Gaussian component densities N (x; µ k , Σ k ) ≡ N k (x): p(x) = K k π k N k (x). GMMs require the mixture weights to be normalized: k π k = 1 and the covariance matrices to be positive definite: x T Σ k x > 0 ∀x. The quality of the current fit-to-data is expressed by the log-likelihood L(X) = E n log k π k N k (x n ) , which is what GMM training optimizes, usually by variants of Expectation-Maximization (EM) (Dempster et al., 1977) . It can be shown that arbitrary distributions can, given enough components, be approximated by mixtures of Gaussians (Goodfellow et al., 2016) . Thus, GMMs are guaranteed to model the complete data distribution, but only to the extent allowed by the number of components K. In this respect, GMMs are similar to flat neural networks with a single hidden layer: although, by the universal approximation theorem of Pinkus (1999) and Hornik et al. (1989) , they can approximate arbitrary functions (from certain rather broad function classes), they fail to do so in practice. The reason for this is that the number of required hidden layer elements is unknown, and usually beyond the reach of any reasonable computational capacity. For images, this problem was largely solved by introducing deep Convolutional Neural Networks (CNNs). CNNs model the statistical structure of images (hierarchical organization and translation invariance) by chaining multiple convolution and pooling layers. Thus the number of parameters without compromising accuracy can be reduced.

1.1. OBJECTIVE, CONTRIBUTION AND NOVELTY

The objectives of this article are to introduce a GMM architecture which exploits the same principles that led to the performance explosion of CNNs. In particular, the genuinely novel characteristics are: • formulation of GMMs as a deep hierarchy, including convolution and pooling layers, • end-to-end training by SGD from random initial conditions (no k-means initialization), • generation of realistic samples by a new sharpening procedure, • better empirical performance than vanilla GMMs for sampling, clustering and outlier detection. In 



addition, we provide a publicly available TensorFlow implementation which supports a Keras-like flexible construction of Deep Convolutional Gaussian Mixture Models instances.Hierarchical GMMs Mixture of Factor Analyzers (MFAs) models(McLachlan & Peel, 2005;  Ghahramani & Hinton, 1997) can be considered as hierarchical GMMs because they are formulated in terms of a lower-dimensional latent-variable representation, which is mapped to a higherdimensional space. The use of MFAs for describing natural images is discussed in detail inRichardson & Weiss (2018), showing that the MFA model alone, without further hierarchical structure, compares quite favorably to GANs when considering image generation. A straightforward hierarchical extension of GMMs is presented byLiu et al. (2002)  with the goal of unsupervised clustering: responsibilities of one GMM are treated as inputs to a subsequent GMM, together with an adaptive mechanism that determines the depth of the hierarchy.Garcia et al. (2010)  present a comparable, more information-theoretic approach. A hierarchy of MFA layers with sampling in mind is presented byViroli & McLachlan (2019), where each layer is sampling values for the latent variables of the previous one, although transformations between layers are exclusively linear.Van Den Oord  & Schrauwen (2014) and (Tang et al., 2012)  pursue a similar approach. All described approaches use (quite complex) extensions of the EM algorithm initialized by k-means for training hierarchical GMMs, exceptRichardson & Weiss (2018)  use Stochastic Gradient Descent (SGD), although with a k-means initialization. None of these models consider convolutional or max-pooling operations which have been proven to be important for modeling the statistical structure of images.Convolutional GMMsThe only work we could identify proposing to estimate hierarchical convolutional GMMs is Ghazvinian Zanjani et al. (2018), although the article described a hybrid model where a CNN and a GMM are combined.

