BAYESIAN CONTEXT AGGREGATION FOR NEURAL PROCESSES

Abstract

Formulating scalable probabilistic regression models with reliable uncertainty estimates has been a long-standing challenge in machine learning research. Recently, casting probabilistic regression as a multi-task learning problem in terms of conditional latent variable (CLV) models such as the Neural Process (NP) has shown promising results. In this paper, we focus on context aggregation, a central component of such architectures, which fuses information from multiple context data points. So far, this aggregation operation has been treated separately from the inference of a latent representation of the target function in CLV models. Our key contribution is to combine these steps into one holistic mechanism by phrasing context aggregation as a Bayesian inference problem. The resulting Bayesian Aggregation (BA) mechanism enables principled handling of task ambiguity, which is key for efficiently processing context information. We demonstrate on a range of challenging experiments that BA consistently improves upon the performance of traditional mean aggregation while remaining computationally efficient and fully compatible with existing NP-based models.

1. INTRODUCTION

Estimating statistical relationships between physical quantities from measured data is of central importance in all branches of science and engineering and devising powerful regression models for this purpose forms a major field of study in statistics and machine learning. When judging representative power, neural networks (NNs) are arguably the most prominent member of the regression toolbox. NNs cope well with large amounts of training data and are computationally efficient at test time. On the downside, standard NN variants do not provide uncertainty estimates over their predictions and tend to overfit on small datasets. Gaussian processes (GPs) may be viewed as complementary to NNs as they provide reliable uncertainty estimates but their cubic (quadratic) scaling with the number of context data points at training (test) time in their basic formulation affects the application on tasks with large amounts of data or on high-dimensional problems. Recently, a lot of interest in the scientific community is drawn to combinations of aspects of NNs and GPs. Indeed, a prominent formulation of probabilistic regression is as a multi-task learning problem formalized in terms of amortized inference in conditional latent variable (CLV) models, which results in NN-based architectures which learn a distribution over target functions. Notable variants are given by the Neural Process (NP) (Garnelo et al., 2018b) and the work of Gordon et al. (2019) , which presents a unifying view on a range of related approaches in the language of CLV models. Inspired by this research, we study context aggregation, a central component of such models, and propose a new, fully Bayesian, aggregation mechanism for CLV-based probabilistic regression models. To transform the information contained in the context data into a latent representation of the target function, current approaches typically employ a mean aggregator and feed the output of this aggregator into a NN to predict a distribution over global latent parameters of the function. Hence, aggregation and latent parameter inference have so far been treated as separate parts of the learning pipeline. Moreover, when using a mean aggregator, every context sample is assumed to carry the same amount of information. Yet, in practice, different input locations have different task ambiguity and, therefore, samples should be assigned different importance in the aggregation process. In contrast, our Bayesian aggregation mechanism treats context aggregation and latent parameter inference as one holistic mechanism, i.e., the aggregation directly yields the distribution over the latent parameters of the target function. Indeed, we formulate context aggregation as Bayesian inference of latent parameters using Gaussian conditioning in the latent space. Compared to existing methods, the resulting aggregator improves the handling of task ambiguity, as it can assign different variance levels to the context samples. This mechanism improves predictive performance, while it remains conceptually simple and introduces only negligible computational overhead. Moreover, our Bayesian aggregator can also be applied to deterministic model variants like the Conditional NP (CNP) (Garnelo et al., 2018a) . In summary, our contributions are (i) a novel Bayesian Aggregation (BA) mechanism for context aggregation in NP-based models for probabilistic regression, (ii) its application to existing CLV architectures as well as to deterministic variants like the CNP, and (iii) an exhaustive experimental evaluation, demonstrating BA's superiority over traditional mean aggregation.

2. RELATED WORK

Prominent approaches to probabilistic regression are Bayesian linear regression and its kernelized counterpart, the Gaussian process (GP) (Rasmussen and Williams, 2005) . The formal correspondence of GPs with infinite-width Bayesian NNs (BNNs) has been established in Neal (1996) and Williams (1996) . A broad range of research aims to overcome the cubic scaling behaviour of GPs with the number of context points, e.g., through sparse GP approximations (Smola and Bartlett, 2001; Lawrence et al., 2002; Snelson and Ghahramani, 2005; Quiñonero-Candela and Rasmussen, 2005) , by deep kernel learning (Wilson et al., 2016) , by approximating the posterior distribution of BNNs (MacKay, 1992; Hinton and van Camp, 1993; Gal and Ghahramani, 2016; Louizos and Welling, 2017) , or, by adaptive Bayesian linear regression, i.e., by performing inference over the last layer of a NN which introduces sparsity through linear combinations of finitely many learned basis functions (Lazaro-Gredilla and Figueiras-Vidal, 2010; Hinton and Salakhutdinov, 2008; Snoek et al., 2012; Calandra et al., 2016) . An in a sense complementary approach aims to increase the data-efficiency of deep architectures by a fully Bayesian treatment of hierarchical latent variable models ("DeepGPs") (Damianou and Lawrence, 2013). A parallel line of research studies probabilistic regression in the multi-task setting. Here, the goal is to formulate models which are data-efficient on an unseen target task by training them on data from a set of related source tasks. Bardenet et al. (2013) ; Yogatama and Mann (2014), and Golovin et al. (2017) study multi-task formulations of GP-based models. More general approaches of this kind employ the meta-learning framework (Schmidhuber, 1987; Thrun and Pratt, 1998; Vilalta and Drissi, 2005) , where a model's training procedure is formulated in a way which incentivizes it to learn how to solve unseen tasks rapidly with only a few context examples ("learning to learn", "few-shot learning" (Fei-Fei et al., 2006; Lake et al., 2011)) . A range of such methods trains a meta-learner to learn how to adjust the parameters of the learner's model (Bengio et al., 1991; Schmidhuber, 1992) , an approach which has recently been applied to few-shot image classification (Ravi and Larochelle, 2017), or to learning data-efficient optimization algorithms (Hochreiter et al., 2001; Li and Malik, 2016; Andrychowicz et al., 2016; Chen et al., 2017; Perrone et al., 2018; Volpp et al., 2019) . Other branches of meta-learning research aim to learn similarity metrics to determine the relevance of context samples for the target task (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2017) , or explore the application of memory-augmented neural networks for meta-learning (Santoro et al., 2016 ). Finn et al. (2017) propose model-agnostic meta-learning (MAML), a general framework for fast parameter adaptation in gradient-based learning methods. A successful formulation of probabilistic regression as a few-shot learning problem in a multi-task setting is enabled by recent advances in the area of probabilistic meta-learning methods which allow a quantitative treatment of the uncertainty arising due to task ambiguity, a feature particularly

