UNCERTAINTY IN NEURAL PROCESSES

Abstract

We explore the effects of architecture and training objective choice on amortized posterior predictive inference in probabilistic conditional generative models. We aim this work to be a counterpoint to a recent trend in the literature that stresses achieving good samples when the amount of conditioning data is large. We instead focus our attention on the case where the amount of conditioning data is small. We highlight specific architecture and objective choices that we find lead to qualitative and quantitative improvement to posterior inference in this low data regime. Specifically we explore the effects of choices of pooling operator and variational family on posterior quality in neural processes. Superior posterior predictive samples drawn from our novel neural process architectures are demonstrated via image completion/in-painting experiments.

1. INTRODUCTION

What makes a probabilistic conditional generative model good? The belief that a generative model is good if it produces samples that are indistinguishable from those that it was trained on (Hinton, 2007) is widely accepted, and understandably so. This belief also applies when the generator is conditional, though the standard becomes higher: conditional samples must be indistinguishable from training samples for each value of the condition. Consider an amortized image in-painting task in which the objective is to fill in missing pixel values given a subset of observed pixel values. If the number and location of observed pixels is fixed, then a good conditional generative model should produce sharp-looking sample images, all of which should be compatible with the observed pixel values. If the number and location of observed pixels is allowed to vary, the same should remain true for each set of observed pixels. Recent work on this problem has focused on reconstructing an entire image from as small a conditioning set as possible. As shown in Fig. 1 , state-of-the-art methods (Kim et al., 2018) achieve high-quality reconstruction from as few as 30 conditioning pixels in a 1024-pixel image. Our work starts by questioning whether reconstructing an image from a small subset of pixels is always the right objective. To illustrate, consider the image completion task on handwritten digits. A small set of pixels might, depending on their locations, rule out the possibility that the full image is, say, 1, 5, or 6. Human-like performance in this case would generate sharp-looking sample images for all digits that are consistent with the observed pixels (i.e., 0, 2-4, and 7-9). Observing additional pixels will rule out successively more digits until the only remaining uncertainty pertains to stylistic details. The bottom-right panel of Fig. 1 demonstrates this type of "calibrated" uncertainty. We argue that in addition to high-quality reconstruction based on large conditioning sets, amortized conditional inference methods should aim for meaningful, calibrated uncertainty, particularly for small conditioning sets. For different problems, this may mean different things (see discussion in Section 3). In this work, we focus on the image in-painting problem, and define well calibrated uncertainty to be a combination of two qualities: high sample diversity for small conditioning sets; and sharp-looking, realistic images for any size of conditioning set. As the size of the conditioning set grows, we expect the sample diversity to decrease and the quality of the images to increase. We note that this emphasis is different from the current trend in the literature, which has focused primarily on making sharp and accurate image completions when the size of the conditioning context is large (Kim et al., 2018) . To better understand and make progress toward our aim, we employ posterior predictive inference in a conditional generative latent-variable model, with a particular focus on neural processes (NPs) Top rows show context sets of given pixels, ranging from very few pixels to all pixels. In each panel the ground truth image (all pixels) is in the upper right corner. The rows correspond to i.i.d. samples from the corresponding image completion model given only the pixels shown in the top row of the same column. Our neural process with semi-implicit variational inference and max pooling produces results with the following characteristics: 1) the images generated with a small amount of contextual information are "sharper" and more face-and digit-like than NP results and 2) there is greater sample diversity across the i.i.d. samples than those from the ANP. This kind of "calibrated uncertainty" is what we target throughout. (Garnelo et al., 2018a; b) . We find that particular architecture choices can result in markedly different performance. In order to understand this, we investigate posterior uncertainty in NP models (Section 4), and we use our findings to establish new best practices for NP amortized inference artifacts with well-calibrated uncertainty. In particular, we demonstrate improvements arising from a combination of max pooling, a mixture variational distribution, and a "normal" amortized variational inference objective. The rest of this paper is organized as follows. Section 2 and Section 3 present background material on amortized inference for generative models and calibrated uncertainty, respectively. Section 4 discusses and presents empirical evidence for how NP models handle uncertainty. Section 5 introduces our proposed network architecture and objective. Section 6 reports our results on the MNIST, FashionMNIST and CelebA datasets. Finally, Section 7 presents our conclusions.

2. AMORTIZED INFERENCE FOR CONDITIONAL GENERATIVE MODELS

Our work builds on amortized inference (Gershman & Goodman, 2014; Kingma & Welling, 2014 ), probabilistic meta-learning (Gordon et al., 2019) , and conditional generative models in the form of neural processes (Garnelo et al., 2018b; Kim et al., 2018) . This section provides background. Let (x C , y C ) = {(x i , y i )} n i=1 and (x T , y T ) = {(x j , y j )} m j=1 be a context set and target set respectively. In image in-painting, the context set input x C is a subset of an image's pixel coordinates, the context set output y C are the corresponding pixel values (greyscale intensity or colors), the target set input x T is a set of pixel coordinates requiring in-painting, and the target set output y T is the corresponding set of target pixel values. The corresponding graphical model is shown in Fig. 2 . The goal of amortized conditional inference is to rapidly approximate, at "test time," the posterior predictive distribution p θ (y T |x T , x C , y C ) = p θ (y T |x T , z)p θ (z|x C , y C )dz . (1)



Figure 1: Representative image in-painting results for CelebA and MNIST. From left to right, neural process (NP) (Garnelo et al., 2018b), attentive neural process (ANP) (Kim et al., 2018), and ours.Top rows show context sets of given pixels, ranging from very few pixels to all pixels. In each panel the ground truth image (all pixels) is in the upper right corner. The rows correspond to i.i.d. samples from the corresponding image completion model given only the pixels shown in the top row of the same column. Our neural process with semi-implicit variational inference and max pooling produces results with the following characteristics: 1) the images generated with a small amount of contextual information are "sharper" and more face-and digit-like than NP results and 2) there is greater sample diversity across the i.i.d. samples than those from the ANP. This kind of "calibrated uncertainty" is what we target throughout.

