SPATIAL DEPENDENCY NETWORKS: NEURAL LAYERS FOR IMPROVED GENERATIVE IMAGE MODELING

Abstract

How to improve generative modeling by better exploiting spatial regularities and coherence in images? We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs). In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way, using a sequential gating-based mechanism that distributes contextual information across 2-D space. We show that augmenting the decoder of a hierarchical VAE by spatial dependency layers considerably improves density estimation over baseline convolutional architectures and the state-of-the-art among the models within the same class. Furthermore, we demonstrate that SDN can be applied to large images by synthesizing samples of high quality and coherence. In a vanilla VAE setting, we find that a powerful SDN decoder also improves learning disentangled representations, indicating that neural architectures play an important role in this task. Our results suggest favoring spatial dependency over convolutional layers in various VAE settings.

1. INTRODUCTION

The abundance of data and computation are often identified as core facilitators of the deep learning revolution. In addition to this technological leap, historically speaking, most major algorithmic advancements critically hinged on the existence of inductive biases, incorporating prior knowledge in different ways. Main breakthroughs in image recognition (Cireşan et al., 2012; Krizhevsky et al., 2012) were preceded by the long-standing pursuit for shift-invariant pattern recognition (Fukushima & Miyake, 1982) which catalyzed the ideas of weight sharing and convolutions (Waibel, 1987; Le-Cun et al., 1989) . Recurrent networks (exploiting temporal recurrence) and transformers (modeling the "attention" bias) revolutionized the field of natural language processing (Mikolov et al., 2011; Vaswani et al., 2017) . Visual representation learning is also often based on priors e.g. independence of latent factors (Schmidhuber, 1992; Bengio et al., 2013) or invariance to input transformations (Becker & Hinton, 1992; Chen et al., 2020) . Clearly, one promising strategy to move forward is to introduce more structure into learning algorithms, and more knowledge on the problems and data. Along this line of thought, we explore a way to improve the architecture of deep neural networks that generate images, here referred to as (deep) image generators, by incorporating prior assumptions based on topological image structure. More specifically, we aim to integrate the priors on spatial dependencies in images. We would like to enforce these priors on all intermediate image representations produced by an image generator, including the last one from which the final image is synthesized. To that end, we introduce a class of neural networks designed specifically for building image generators -spatial dependency network (SDN). Concretely, spatial dependency layers of SDN (SDN layers) incorporate two priors: (i) spatial coherence reflects our assumption that feature vectors should be dependent on each other in a spatially consistent, smooth way. Thus in SDN, the neighboring feature vectors tend to be more similar than the non-neighboring ones, where the similarity correlates with the 2-D distance. The graphical model (Figure 1a ) captures this assumption. Note also that due to their unbounded receptive field, SDN layers model long-range dependencies; (ii) spatial dependencies between feature vectors should not depend on their 2-D coordinates. Mathematically speaking, SDN should be equivariant to spatial translation, in the same way convolutional networks (CNN) are. This is achieved through parameter (weight) sharing both in SDN and CNN. The particular focus of this paper is the application of SDN to variational autoencoders (VAEs) (Kingma & Welling, 2013) . The main motivation is to improve the performance of VAE generative models by endowing their image decoders with spatial dependency layers (Figure 1b ). While out of the scope of this work, SDN could also be applied to other generative models, e.g. generative adversarial networks (Goodfellow et al., 2014) . More generally, SDN could be potentially used in any task in which image generation is required, such as image-to-image translation, super-resolution, image inpainting, image segmentation, or scene labeling. SDN is experimentally examined in two different settings. In the context of real-life-image density modeling, SDN-empowered hierarchical VAE is shown to reach considerably higher test loglikelihoods than the baseline CNN-based architectures and can synthesize perceptually appealing and coherent images even at high sampling temperatures. In a synthetic data setting, we observe that enhancing a non-hierarchical VAE with an SDN facilitates learning of factorial latent codes, suggesting that unsupervised 'disentanglement' of representations can be bettered by using more powerful neural architectures, where SDN stands out as a good candidate model. The contributions and the contents of this paper can be summarized as follows. CONTRIBUTIONS AND CONTENTS: • The architecture of SDN is introduced and then contrasted to the related ones such as convolutional networks and self-attention. 



Figure1: (a) DAG of a spatial dependency layer. The input feature vector (red node) is gradually refined (green nodes) as the computation progresses through the four sub-layers until the output feature vector is produced (blue node). In each sub-layer, the feature vector is corrected based on contextual information. Conditional maps within sub-layers are implemented as learnable deterministic functions with shared parameters; (b) VAE with SDN decoder reconstructing a 'celebrity'.

• The architecture of SDN-VAE is introduced -the result of applying SDN toIAF-VAE  (Kingma et al., 2016), with modest modifications to the original architecture. SDN-VAE is evaluated in terms of: (a) density estimation, where the performance is considerably improved both upon the baseline and related approaches to non-autoregressive modeling, on CIFAR10 and ImageNet32 data sets; (b) image synthesis, where images of competitively high perceptual quality are generated based on CelebAHQ256 data set.• By integrating SDN into a relatively simple, non-hierarchical VAE, trained on the synthetic 3D-Shapes data set, we demonstrate in another comparative analysis substantial improvements upon convolutional networks in terms of: (a) optimization of the evidence lower bound (ELBO); (b) learning of disentangled representations with respect to two popular metrics, when β-VAE(Higgins et al., 2016)  is used as an objective function.

availability

://github.com/djordjemila

