IMAGE MODELING WITH DEEP CONVOLUTIONAL GAUSSIAN MIXTURE MODELS

Abstract

In this conceptual work, we present Deep Convolutional Gaussian Mixture Models (DCGMMs), a deep hierarchical Gaussian Mixture Model (GMM) that is particularly suited for describing and generating images. Vanilla (i.e., flat) GMMs require a very large number of components to well describe images, leading to long training times and memory issues. DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations. This allows to exploit the compositionality of images in a similar way as deep CNNs do. DCG-MMs can be trained end-to-end by Stochastic Gradient Descent. This sets them apart from vanilla GMMs which are trained by Expectation-Maximization, requiring a prior k-means initialization which is infeasible in a layered structure. For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling. Based on the MNIST and FashionMNIST datasets, we validate the DCGMM model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.

1. INTRODUCTION

This conceptual work is in the context of probabilistic image modeling, whose main objectives are density estimation and image generation (sampling). Since images usually do not precisely follow a Gaussian mixture distribution, such a treatment is inherently approximative in nature. This implies that clustering, even though it is possible and has a long history in the context of Gaussian Mixture Models (GMMs), is not a main objective. Sampling is an active research topic mainly relying on Generative Adverserial Networks (GANs) discussed in Section 1.2. Similar techniques are being investigated for generating videos (Ghazvinian Zanjani et al., 2018; Piergiovanni & Ryoo, 2019 ). An issue with GANs is that their probabilistic interpretation remains unclear. This is outlined by the fact that there is no easy-to-compute probabilistic measure of the current fit-to-data that is optimized by GAN training. Recent evidence seems to indicate that GANs may not model the full image distribution as given by training data (Richardson & Weiss, 2018) . Besides, images generated by GANs appear extremely realistic and diverse, and the GAN model has been adapted to perform a wide range of visually impressive functionalities. In contrast, GMMs explicitly describe the distribution p(X), given by a set of training data X = {x n }, as a weighted mixture of K Gaussian component densities N (x; µ k , Σ k ) ≡ N k (x): p(x) = K k π k N k (x). GMMs require the mixture weights to be normalized: k π k = 1 and the covariance matrices to be positive definite: x T Σ k x > 0 ∀x. The quality of the current fit-to-data is expressed by the log-likelihood L(X) = E n log k π k N k (x n ) , which is what GMM training optimizes, usually by variants of Expectation-Maximization (EM) (Dempster et al., 1977) . It can be shown that arbitrary distributions can, given enough components, be approximated by mixtures of Gaussians (Goodfellow et al., 2016) . Thus, GMMs are guaranteed to model the complete data distribution, but only to the extent allowed by the number of components K. 1

