PROBABILISTIC NUMERIC CONVOLUTIONAL NEURAL NETWORKS

Abstract

Continuous input signals like images and time series that are irregularly sampled or have missing values are challenging for existing deep learning methods. Coherently defined feature representations must depend on the values in unobserved regions of the input. Drawing from the work in probabilistic numerics, we propose Probabilistic Numeric Convolutional Neural Networks which represent features as Gaussian processes (GPs), providing a probabilistic description of discretization error. We then define a convolutional layer as the evolution of a PDE defined on this GP, followed by a nonlinearity. This approach also naturally admits steerable equivariant convolutions under e.g. the rotation group. In experiments we show that our approach yields a 3× reduction of error from the previous state of the art on the SuperPixel-MNIST dataset and competitive performance on the medical time series dataset PhysioNet2012.

1. INTRODUCTION

Standard convolutional neural networks are defined on a regular input grid. For continuous signals like time series and images, these elements correspond to regular samples of an underlying function f defined on a continuous domain. In this case, the standard convolutional layer of a neural network is a numerical approximation of a continuous convolution operator A. Coherently defined networks on continuous functions should only depend on the input function f , and not on spurious shortcut features (Geirhos et al., 2020) such as the sampling locations or sampling density, which enable overfitting and reduce robustness to changes in the sampling procedure. Each application of A in a standard neural network incurs some discretization error which is determined by the sampling resolution. In some sense, this error is unavoidable because the features f ( ) at the layers depend on the values of the input function f at regions that have not been observed. For input signals which are sampled at a low resolution, or even sampled irregularly such as with the sporadic measurements of patient vitals data in ICUs or dispersed sensors for measuring ocean currents, this discretization error cannot be neglected. Simply filling in the missing data with zeros or imputing the values is not sufficient since many different imputations are possible, each of which can affect the outcomes of the network. Probabilistic numerics is an emergent field that studies discretization errors in numerical algorithms using probability theory (Cockayne et al., 2019) . Here we build upon these ideas to quantify the dependence of the network on the regions in the input which are unknown, and integrate this uncertainty into the computation of the network. To do so, we replace the discretely evaluated feature maps {f ( ) (x i )} N i=1 with Gaussian processes: distributions over the continuous function f ( ) that track the most likely values as well as the uncertainty. On this Gaussian process feature representation, we need not resort to discretizing the convolution operator A as in a standard convnet, but instead we can apply the continuous convolution operator directly. If a given feature is a Gaussian process, then applying linear operators yields a new Gaussian process with transformed mean and covariance functions. The dependence of Af on regions of f which are not known translates into the uncertainty represented in the transformed covariance function, the analogue of the discretization error in a CNN, which is now tracked explicitly. We call the resulting model Probabilistic Numeric Convolutional Neural Network (PNCNN).

2. RELATED WORK

Over the years there have been many successful convolutional approaches for ungridded data such as GCN (Kipf and Welling, 2016) , PointNet (Qi et al., 2017) , Transformer (Vaswani et al., 2017 ), Deep Sets (Zaheer et al., 2017) , SplineCNN (Fey et al., 2018) , PCNN (Atzmon et al., 2018) , PointConv (Wu et al., 2019) , KPConv (Thomas et al., 2019) and many others (de Haan et al., 2020; Finzi et al., 2020; Schütt et al., 2017; Wang et al., 2018) . However, the target domains of sets, graphs, and point clouds are intrinsically discrete and for continuous data each of these methods fail to take full advantage of the assumption that the underlying signal is continuous. Furthermore, none of these approaches reason about the underlying signal probabilistically. In a separate line of work there are several approaches tackling irregularly spaced time series with RNNs (Che et al., 2018) , Neural ODEs (Rubanova et al., 2019) , imputation to a regular grid (Li and Marlin, 2016; Futoma et al., 2017; Shukla and Marlin, 2019; Fortuin et al., 2020) , set functions (Horn et al., 2019) and attention (Shukla and Marlin, 2020) . Additionally there are several works exploring reconstruction of images from incomplete observations for downstream classification (Huijben et al., 2019; Li and Marlin, 2020) . Most similar to our method are the end-to-end Gaussian process adapter (Li and Marlin, 2016) and the multi-task Gaussian process RNN classifier (Futoma et al., 2017) . In these two works, a Gaussian process is fit to an irregularly spaced time series and sampled imputations from this process are fed into a separate RNN classifier. Unlike our approach where the classifier operates directly on a continuous and probabilistic signal, in these works the classifier operates on a deterministic signal on a regular grid and cannot reason probabilistically about discretization errors. Finally, while superficially similar to Deep GPs (Damianou and Lawrence, 2013) or Deep Differential Gaussian Process Flows (Hegde et al., 2018) , our PNCNNs tackle fundamentally different kinds of problems like image classificationfoot_0 , and our GPs represent epistemic uncertainty over the values of the feature maps rather than the parameters of the network.

Probabilistic Numerics:

We draw inspiration for our approach from the community of probabilistic numerics where the error in numerical algorithms are modeled probabilistically, and typically with a Gaussian process. In this framework, only a finite number of input function calls can be made, and therefore the numerical algorithm can be viewed as an autonomous agent which has epistemic uncertainty over the values of the input. A well known example is Bayesian Monte Carlo where a Gaussian process is used to model the error in the numerical estimation of an integral and optimally select a rule for its computation (Minka, 2000; Rasmussen and Ghahramani, 2003) . Probabilistic numerics has been applied widely to numerical problems such as the inversion of a matrix (Hennig, 2015) , the solution of an ODE (Schober et al., 2019) , a meshless solution to boundary value PDEs (Cockayne et al., 2016) , and other numerical problems (Cockayne et al., 2019) . To our knowledge, we are the first to construct a probabilistic numeric method for convolutional neural networks.

Gaussian Processes:

We are interested in operating on the continuous function f (x) underlying the input, but in practice we have access only to a collection of the values of that function sampled on a finite number of points {x i } N i=1 . Classical interpolation theory reconstructs f deterministically by assuming a certain structure of the signal in the frequency domain. Gaussian processes give a way of modeling our beliefs about values that have not been observed (Rasmussen et al., 2006) , as reviewed in appendix A. These beliefs are encoded into a prior covariance k of the GP f ∼ GP(0, k) and updated upon seeing data with Bayesian inference. Explicitly, given a set of sampling locations x = {x i } N i=1 and noisy observations y = {y i } N i=1 sampled y i ∼ N (f (x i ), σ 2 i ), using Bayes rule one can compute the posterior distribution f |y, x ∼ GP(µ p , k p ), which captures our epistemic uncertainty about the values between observations. The posterior mean and covariance are given by µ p (x) = k(x) [K + S] -1 y , k p (x, x ) = k(x, x ) -k(x) [K + S] -1 k(x ) , where K ij = k(x i , x j ), k(x) i = k(x, x i ) and S = diag(σ 2 i ). Below we shall choose the RBF kernelfoot_1 as prior covariance, due to its convenient analytical properties: k RBF (x, x ) = aN (x; x , l 2 I) = a 2πl 2 -d 2 exp(-1 2l 2 ||x -x || 2 ). In typical applications of GPs to machine learning tasks such as regression, the function f that we want to predict is already the regression model. In contrast, here we use GPs as a way of representing our beliefs and epistemic uncertainty about the values of both the input function and the intermediate feature maps of a neural network.

4.1. OVERVIEW

Given an input signal f : X → R c , we define a network with layers that act directly on this continuous input signal. We define our neural network recursively from the input f (0) = f , as a series of L continuous convolutions A ( ) with pointwise ReLU nonlinearities and weight matrices which mix only channels (known as 1 × 1 convolutions) M ∈ R c×c : f ( +1) = M ( ) ReLU[A ( ) f ( ) ], and a final global average pooling layer P which acts channel-wise as natural generalization of the discrete case: P(f (L) ) α = f (L) α (x)dx for each α = 1, 2, . . . , c. Denoting the space of functions on X with c channels by H c , the convolution operators A ( ) are linear operators from H c to H c +1 . Like in ordinary convolutional neural networks, the layers build up increasingly more expressive spatial features and depend on the parameters in A ( ) and M ( ) . Unlike ordinary convolutional networks, these layers are well defined operations on the underlying continuous signal. While it is clear that such a network can be defined abstractly, the exact values of the function f (L) cannot be computed as the operators depend on unknown values of the input. However, by adapting a probabilistic description we can formulate our ignorance of f (0) with a Gaussian process and see how the uncertainties propagate through the layers of the network, yielding a probabilistic output. Before delving into details, we outline the key components of equation 2 that make this possible. Continuous Convolutional Layers: Crucially, we consider continuous convolution operators A that can be applied to input Gaussian process f ∼ GP(µ p , k p ) in closed form. The output is another Gaussian process with a transformed mean and covariance Af ∼ GP(Aµ p , Ak p A ) where A acts to the left on the primed argument of k p (x, x ).foot_2 In section 4.2 we show how to parametrize these continuous convolutions in terms of the flow of a PDE and show how they can be applied to the RBF kernel exactly in closed form. Probabilistic ReLUs: Applying the ReLU nonlinearity to the GP yields a new non Gaussian stochastic process h ( ) = ReLU[A ( ) f ( ) ], and we show in section 4.5 that the mean and covariance of this process has a closed form solution which can be computed. Channel Mixing and Central Limit Theorem: The activations h ( ) are not Gaussian; however, for a large number of weakly dependent channels we argue that f ( +1) = M ( ) h ( ) is approximately distributed as a Gaussian Process in section 4.5. Measurement and Projection to RBF Gaussian Process: While f ( +1) is approximately a Gaussian process, the mean and covariance functions have a complicated form. Instead of using these functions directly, we take measurements of the mean and variance of this process and feed them in as noisy observations to a fresh RBF kernel GP, allowing us to repeat the process and build up multiple layers without increasing complexity. The Gaussian process feature maps in the final layer f (L) are aggregated spatially by the integral pooling P which can also be applied in closed form (see appendix D), to yield a Gaussian output. Assembling these components, we implement the end to end trainable Probabilistic Numeric Convolutional Neural Network which integrates a probabilistic description of missing data and discretization error inherent to continuous signals. The layers of the network are shown in figure 1 .

4.2. CONTINUOUS CONVOLUTIONAL LAYERS

On a discrete domain such as the lattice X = Z d , all translation equivariant linear operators A are convolutions, a fact which we review in appendix B. In general, these convolutions can be written in terms of a linear combination of powers of the generators of the translation group: the shift operators τ i , i = 1, . . . , d shift all elements by one unit along the i-th axis of the grid. For a one dimensional grid, one can always write A = k W k τ k where the weight matrices W k ∈ R c×c act only on the channels and the shift operator τ acts on functions on the lattice. In d dimensions, A = k1,...,k d W k1,...,k d τ k1 1 • • • τ k d d for some set of integer coefficients k 1 , . . . , k d . For example when d = 2, we can take k 1 , k 2 ∈ {-1, 0, 1} to fill out a 3 × 3 neighborhood. On the continuous domain X = R d we similarly parametrize convolutions with A = k W k e D k , where D k is given by powers of the partial derivatives ∂ i , i = 1, . . . , d which generate infinitesimal translations along the i-th axes. Setting d = 1 for simplicity, we can indeed verify that the operator exponential τ a = e a∂x applied to a function g(x) is a translation: e a∂x g(x) = g(x) + ag (x) + 1 2 a 2 g (x)+• • • = g(x+a) , which is the Taylor series expansion of g(x+a) around x. Exponentials of operators can be defined similarly in terms of the formal Taylor series e D = ∞ k=0 D k /k! or more broadly as the solution to the PDE: ∂ t g(t, x) = (Dg)(t, x) , g(0, x) = g(x) (3) at time t = 1: e D g(x) = g(t = 1, x). Following the discussion in the discrete case, translation invariance of D k imposes that it is expressed in terms of powers of the generators. Collecting the derivatives into the gradient ∇, we can write the general form of D k as α k + β k ∇ + 1 2 ∇ Σ k ∇ + ... for any constants α k , vectors β k , matrices Σ k etc. For simplicity, we truncate the series at second order to get D k = β k ∇ + 1 2 ∇ Σ k ∇ , where we omit the constants α k that can be absorbed into the definition of W k . For this choice of D, the PDE in equation 3 is nothing but the diffusion equation with drift β k and diffusion Σ k . When discussing rotational equivariance in section 4.4, we also consider a more general form of D. The diffusion layer can also be viewed in another way as the infinitesimal generator of an Ito diffusion (a stochastic process). Given an Ito process with constant drift and diffusion dX t = βdt + Σ 1/2 dB t where B t is a d dimensional Brownian motion, the time evolution operator can be written via the Feynman-Kac formula as e tD f (x) = E[f (X t )] where X 0 = x. In other words, the operator layer A = e tD is the expectation under a parametrized Neural Stochastic Differential equation (Li et al., 2020; Tzen and Raginsky, 2019) that is homogeneous and therefore shift invariant. The flow of this SDE depends on the drift and diffusion parameters β and Σ. To recap, we define our convolution operator through the general form A = k W k e D k where the weight matrices W k ∈ R c×c mix only channels and e D k is the forward evolution by one unit of time of the diffusion equation with drift β k and diffusion Σ k containing learnable parameters {(W k , β k , Σ k )} K k=1 . The translation equivariance of A follows directly from the fact that the generators commute ∀k, i : [D k , ∇ i ] = 0 and therefore [A, τ i ] = 0 (the bracket [a, b] = ab -ba is the commutator of the two operators). In appendix B we show that our definition of A reduces to the usual one in the discrete case and is thus a principled generalization to the continuous domain.

4.3. EXACT APPLICATION ON RBF GPS

Although the application of the linear operator A = k W k e D k involves the time evolution of a PDE, owing to properties of the RBF kernel we fortuitously can apply the operator to an input GP in closed form! Gaussian processes are closed under linear transformations: given f ∼ GP(µ p , k p ), we need only compute the action of A on the mean and covariance: Af ∼ GP(Aµ p , Ak p A ), where A is the adjoint w.r.t. the L 2 (X ) inner product. The application of time evolution e D k is a convolution with a Green's function G k , so Af = k W k e D k f = k W k G k * f . As we derive in appendix C, the Green's function for D k = β k ∇ + (1/2)∇ Σ k ∇, is nothing but the multivariate Gaussian density G k (x) = N (x; -β k , Σ k ): Af = k W k e D k f = k W k G k * f = k W k N (-β k , Σ k ) * f . In order to apply e tD to the posterior GP, we need only to be able to apply the operator to the posterior mean and covariance. This posterior mean and covariance in equation 1 are expressed in terms of k RBF = aN (x; x , 2 I) and the computation boils down to a convolution of two Gaussians: e tD k RBF (x, x ) = N (x; -tβ, tΣ) * aN (x; x , 2 I) = aN (x; x -tβ, 2 I + tΣ) (6) e tD1 k RBF (x, x )e tD 2 = aN (x; x -t(β 1 -β 2 ), 2 I + tΣ 1 + tΣ 2 ) . The application of the channel mixing matrices W k and summation is also straightforward through matrix multiplication for the mean and covariance. To summarize, because of the closed form action on the RBF kernel, the layer can be implemented efficiently and exactly with no discretization or approximations. We note that with the Green's function above, the action of A encompasses the ordinary convolution operator on the 2d lattice as a special case. Given drift β k ∈ {-1, 0, 1} ×2 , k = 1, . . . , 9 filling out the 9 elements of a 3 × 3 grid and as the diffusion Σ k → 0, the Green's function is a Dirac delta, so that: Af (x) = k W k δ(x -β k ) * f (x) = i,j=-1,0,1 W ij f (x 1 -i, x 2 -j) = W * Z 2 f (x).

4.4. GENERAL EQUIVARIANCE

The convolutional layers discussed so far are translation equivariant. We discuss how to extend the continuous linear operator layers to more general symmetries such as rotations. Feature fields in this more general case are described by tensor fields, where the symmetry group acts not only on the input space X but also on the vector space attached to each point x ∈ X . A linear layer A is equivariant if its action commutes with that of the symmetry. In appendix E we derive constraints for general linear operators and symmetries, which generalize those appearing in the steerable-CNN literature (Weiler and Cesa, 2019; Cohen et al., 2019) . Then we show how equivariance under continuous roto-translations in 2d constrains the form of a convolutional layer by solving the equivariance constraint. Non-trivial solutions require that the operator D in the PDE of equation 3 has a non-trivial matrix structure.

4.5. PROBABILISTIC NONLINEARITIES AND RECTIFIED GAUSSIAN PROCESSES

Gast and Roth (2018) derive the mean and variance for a univariate rectified Gaussian distribution for use in a neural network. We generalize these results to the full covariance function (and higher moments) of a rectified Gaussian process in appendix J and present the results here. For the input GP A ( ) f ( ) (x) ∼ GP(µ(x), k(x, x )), we denote σ(x) = k(x, x), Σ the matrix with components Σ ij = k(x i , x j ) for i, j = 1, 2 and µ = [µ(x 1 ), µ(x 2 )]. We use notation Φ(z) for the univariate standard normal CDF, and Φ(z; Σ) for (two dimensional) multivariate CDF of N (0, Σ) at z. Σ 1 and Σ 2 are the column vectors of Σ. The first and second moments of h = ReLU [Af ] are: E[h(x)] = µ(x)Φ(µ(x)/σ(x)) + σ(x)Φ (µ(x)/σ(x)) , E[h(x 1 )h(x 2 ))] = (k(x 1 , x 2 ) + µ(x 1 )µ(x 2 ))Φ(µ; Σ) + (µ(x 1 )Σ 2 + µ(x 2 )Σ 1 )∇Φ(µ; Σ) + Σ 1 ∇∇ Φ(µ; Σ)Σ 2 , where ∇∇ Φ denotes the Hessian of Φ with respect to the first argument. The first and higher order derivatives of the Normal CDF are just the PDF and products of the PDF with Hermite polynomials. Note that the mean and covariance interact through the nonlinearity.

4.6. CHANNEL MIXING AND CENTRAL LIMIT THEOREM

After the non-linearity the process is no longer Gaussian. To overcome this issue we introduce a channel mixing matrix M ( ) ∈ R c +1 ×c and define the feature map in the following layer by f ( +1) = M ( ) h ( ) , where h ( ) = ReLU[A ( ) f ( ) ]. So long as the channels of h ( ) are only weakly dependent, we can apply the central limit theorem (CLT) to each function f ( +1) α = c β=1 M ( ) αβ h ( ) β so that in the limit of large c , the statistics of the f ( +1) α 's converge to a GP with first and second moments given by: E[f ( +1) (x)] = M E[h ( ) (x)], E[f ( +1) (x)f ( +1) (x ) ] = M E[h ( ) (x)h ( ) (x ) ]M . (10) The convergence to a Gaussian process here is reminiscent of the well known infinite width limit of Bayesian neural networks Neal (1996); de G. Matthews et al. (2018) ; Yang (2019). However the setting here is fundamentally different. Unlike the Bayesian case where the distribution of M is given by a prior or posterior, in PNCNN M is a deterministic quantity and instead the uncertainty is about the input. PNCNN is not a Bayesian method in the sense of representing uncertainty about the parameters of the model, but instead it is Bayesian in representing and propagating the uncertainty in the value of the inputs. We go through the central limit theorem argument in appendix I, and go over some potential caveats such if the weights were to converge to 0 or the independence assumptions are violated. In the following section, we evaluate the Gaussianity empirically.

4.7. MEASUREMENT AND PROJECTION TO RBF GAUSSIAN PROCESS

As a last step we simplify the mean and covariance functions of the approximate GP f ( +1) . While we can readily compute the values of these functions, unlike in the RBF kernel case, we cannot apply the convolution operator e tD in closed form. In order to circumvent this challenge, we model the (approximately) Gaussian process f ( +1) with an RBF Gaussian process as follows: we evaluate the mean y i = E[f ( +1) (x i )] and variance σ 2 i = Var[f ( +1) (x i )] of the approximate Gaussian process f ( +1) at a collection of points {x i } N i=1 using equations 8, 9 and 10. These values y i are treated as measurements of the underlying signal with a heteroscedastic noise σ 2 i that varies from point to point. We can then compute the RBF-based posterior GP of this signal f ( +1) |{(x i , y i , σ i )} N i=1 ∼ GP(µ p , k p ) with posterior mean and covariance given by equation 1 for the heteroschedastic noise model. The uncertainty in the input f ( ) is propagated through to the RBF posterior f ( +1) |{(x i , y i , σ i )} N i=1 via the measurement noise σ i . Crucially, this Gaussian process mean and covariance functions are written in terms of the RBF kernel and we can therefore continue applying convolutions in closed form in future layers. As we describe in the following section, the RBF kernel in each layer is trained to maximize the marginal likelihood of the data that it sees, and thereby minimize the discrepancy with the underlying generating distribution f ( +1) . While this measurement/projection approach is effective in many scenarios, in networks with many layers or a very large number of observations uncertainty information can get attenuated as it passes through the layers, a phenomenon which we investigate in appendix H. With a network that is trained on a version of MNIST that is randomly subsampled to 75 pixels, in figure 2 (left) we evaluate the mean and uncertainty of the internal feature maps as we vary the number of pixels of the inputs at test time. As expected, the mean functions for the feature maps slowly converge and the predicted uncertainties decrease in magnitude as the input resolution is increased. In figure 2 (middle) we show that in early layers the uncertainties decrease at a similar rate to the O(1/ √ N ) of discretization error that we would expect from a standard convolutional layer which is discretized to a square grid. 4 Despite the fact that these resolutions differ substantially from those seen at training time and the fact that there are no explicit uncertainty targets for these internal layers, the predictions are reasonably well calibrated as demonstrated in figure 2 (right). While the prediction residuals have fatter tails than a standard Gaussian (as evidenced by the deviations from the line in the QQ plot), the mean and standard deviation are close to the theoretically optimal 0 and 1 values across a range of resolutions.

4.8. TRAINING PROCEDURE

Our neural network has two sets of parameters: the channel mixing and diffusion parameters, {(M ( ) , W ( ) , β ( ) , Σ ( ) )} L =1 , as well as kernel hyperparameters of the Gaussian Processes {(l ( ) , a ( ) )} L =1 . We train all parameters jointly on the loss L task + λL GP , where L task is the cross entropy with logits given by the mean µ P of the pooled features P(f (L) ) ∼ N (µ P , Σ P ) and L GP is the marginal log likelihoods of the GP feature maps: L GP (f ) = 1 2 L =1 c α=1 f T α [K XX + S α ] -1 f α + log det [K XX + S α ] + N log 2π ( ) , where for each layer , f α = f α (x 1 ), ..., f α (x N ) ∈ R N are the observed values for channel α at locations X = [x 1 , . . . , x N ], K XX is the covariance of the RBF kernel and S α = diag(σ 2 α ) the measurement noise for each channel α and spatial location. Notably the GP marginal likelihood is independent of the class labels.

5. EXPERIMENTAL RESULTS

We evaluate the Probabilistic Numeric CNN on two different problems which have incomplete and irregular observations. √ N ) which is the same rate as would be achieved through Monte Carlo sampling. Superpixel MNIST is an adaptation of the MNIST dataset where the 784 pixels of the original images are replaced by 75 salient superpixels that are non uniformly spread throughout the domain and are different for each image (Monti et al., 2017) . Despite the simplicity of the underlying images, the lack of grid structure and high fraction of missing values make this a challenging task. Example inputs are visualized at the left of figure 1 . We compare to Monet (Monti et al., 2017) , SplineCNN (Fey et al., 2018) We conduct an ablation study where uncertainty propagation is removed: the probabilistic ReLU is replaced with the deterministic one applied to the mean, and the uncertainties in each layer are set to 0. This form of the network still makes use of the fact that the input function is continuous by use of the GP interpolation of the means, but crucially it does not integrate the uncertainty in the computation resulting from the missing data. While this variant (PNCNN w/o σ) with a 3.03% error rate outperforms existing methods from the literature, it is substantially worse than the PNCNN that integrates the uncertainty at 1.24% error. This validates both that the underlying architecture (using the continuous convolution operators) has good inductive biases and that reasoning about discretization errors probabilistically can improve performance directly. In figure 3 we evaluate the performance on zero shot generalization to a different test resolution for a variety of training resolutions. In order to compare to an ordinary CNN we sample MNIST on a regular square grid: PNCNN with uncertainty is the most robust to this train test sampling distribution shift, followed by PNCNN w/o uncertainty, and finally the ordinary CNN which is quite sensitive to these changes. Irregularly Spaced Time Series For the second task, we evaluate our model on the irregularly spaced time series dataset PhysioNet2012 (Silva et al., 2012) for predicting mortality from ICU vitals signs. This dataset is particularly challenging because different vital sign channels are observed at different times, even within a single patient record. This means that we cannot compute the GP inference formula of equation 1 efficiently for all channels simultaneously because the observation points {x i } and hence the matrices K in that formula differ between the channels, increasing computational complexity. To circumvent this difficulty, we employ a stochastic diagonal estimator to compute the variances as described in appendix G. We compare against IP-Nets (Shukla and Marlin, 2019), SEFT-ATTN (Horn et al., 2019) , and GRU-D (Che et al., 2018) as reported in Horn et al. (2019) . PNCNN performs competitively, although not a breakout performance as in the image dataset which we attribute to the use of the stochastic variance estimates over an exact calculation.

6. CONCLUSION

Based on the ideas of probabilistic numerics, we have introduced a new class of neural networks which model missing values and discretization errors probabilistically within the layers of a CNN. The layers of our network are a series of operators defined on continuous functions, removing dependence on shortcut features like the sampling locations and distribution. On irregularly sampled and incomplete spatial data we show improved generalization and robustness. We believe that generative modeling of the internal feature maps of a neural network for representing uncertainty from incomplete observations is an underexplored research direction and has significant promise for other modalities of data, which we leave to future work. 

A REVIEW OF GAUSSIAN PROCESSES

We briefly review here the main ideas of Gaussian Processes for machine learning, see Stein (2012); Rasmussen et al. (2006) for more details. We start to explain how to use stochastic processes for Bayesian inference. We see the stochastic process as prior over functions p(f ) and as we are given samples x = (x 1 , . . . , x N ), y = (y 1 , . . . , y N ), y i ≡ f (x i ), we update our beliefs about the function by constructing the posterior via Bayes rule: p(f |y, x) = p(y|f, x)p(f )/p(y|x). Here we need to specify the likelihood of the data with our model p(y|f, x) = N i=1 p(y i |f, x i ) and the denominator, called the evidence or marginal likelihood, follows: p(y|x) = E f ∼p(f ) [p(y|f, x)]. The power of this approach is that the value of the signal y at an unseen point x has an uncertainty which depends on our knowledge of its neighbourhood: p(y|x, y, x) = E f ∼p(f |y,x) [p(y|f, x)] and allows us to reason probabilistically about the underlying signal. A particular convenient class of random function is Gaussian processes (GPs) for which inference can be done exactly. A stochastic process can be presented in terms of the finite distributions of the random variables {f (x i )} M i=1 at points {x i } M i=1 . For a GP these distributions are Gaussian and can be defined uniquely by specifying means and covariances, and so a GP is specified entirely by its mean function µ(x) and covariance kernel k(x, x ). We shall write f ∼ GP(µ, k). Let us assume a Gaussian likelihood model as well, i.e. p(y i |f, x i ) = N (f (x i ), σ 2 i ) , where σ n represents aleatoric uncertainty on the measurement. (For simplicity we take here the function to be scalar valued but the reasoning can be easily generalized.) Then properties of the Gaussian distribution (see (Rasmussen et al., 2006, Chap. 2) for a detailed derivation of the formulas) lead to the following posterior distributions after seeing data y, x: p(f |y, x) = GP(µ p , k p ), with µ p (x) = k(x) T [K + S] -1 y , k p (x, x ) = k(x, x ) -k(x) T [K + S] -1 k(x ) . where K ij = k(x i , x j ), k(x) i = k(x, x i ) and S = diag(σ 2 i ).

B FROM DISCRETE TO CONTINUOUS CONVOLUTIONAL LAYERS

We here show that the general formula A = k W k e D k with D k a function of spatial derivatives, reduces in the case of discrete input space X to the usual convolution we encounter in deep learning. For simplicity we shall assume a 1d grid as input space X = {1, . . . , N }. Let us start by recalling the form of the classical discrete convolution when C = C +1 = 1. We define a convolutional layer as a linear map that commutes with the translation operator. To make the symmetry exact, we need assume periodic boundaries. Then in the standard basis of R N , {e i } N i=1 of vectors localized at site i, the translation operator τ acts as τ e i = e i+1 mod N . An N × N matrix B is translation invariant iff τ B = Bτ . Since τ is diagonal in Fourier space, the most general solution is B = F -1 diag( b)F , where F jk = e -2πi N jk is the discrete Fourier transform. Such matrices are called circulant and can be written alternatively as B = N -1 i=0 b N -i τ i , b = F -1b . Explicitly: B =        b 0 b 1 . . . b N -2 b N -1 b N -1 b 0 b 1 b N -2 . . . b N -1 b 0 . . . . . . b 2 . . . . . . b 1 b 1 b 2 . . . b N -1 b 0        . ( ) This shows that the most general convolutional layer is a circulant matrix. E.g. if b i = 0 unless i = 0, 1, N -1, B coincides with the matrix representing a periodic convolution of filter size 3. The matrix B is invertible as long as bk = 0 for all k. In a convolutional network the parameters b i are random variables and the measure of the set where B is not invertible is zero. Thus the role of e D is replaced in the discrete case by the B. The discrete analog of A is then: A = i W i ⊗ B i . Introducing the unit matrices E α,β which have 1 at the row α and column β and 0 otherwise, we can rewrite it as: A = j,α,β E α,β ⊗ τ j W α,β j , W α,β j = i W α,β i b i,N -j . Since E α,β ⊗ τ j is a linear basis of the space of convolutional layers, we see that equation 13 indeed reduces to the usual one when discretizing the input domain and is a principled generalization to the continuous domain.

C GREENS FUNCTION

Given the operator D = β ∇ + 1 2 ∇ Σ∇ we can compute the action of e tD in terms of convolutions. Using the d dimensional Fourier transforms F[h](k) = (2π) -d 2 h(x)e -ik x dx and F -1 = F † , we can rewrite the derivative operator D in terms of elementwise multiplication in the Fourier domain, which diagonalizes D. Since ∇ = F -1 (ik)F, D = F -1 (iβ k -1 2 k Σk)F. ( ) Using the series definition e tD = ∞ n=0 (tD) n /n!, we have: e tD = F -1 e t(iβ k-1 2 k Σk) F. Applying this operator to a test function h(x) yields e tD h = F -1 [e t(iβ k-1 2 k Σk) F[h](k)] = F -1 [F[G t ] • F[h]] = G t * h, ( ) where the final step follows from the Fourier convolution theorem, and we define the function G t = F -1 [e t(iβ k-1 2 k Σk) ]. Directly applying the Fourier integral yields a Gaussian integral G t (x) = (2π) -d 2 e ik (x+tβ)-1 2 k tΣk dk = e -1 2 (x+tβ) (tΣ) -1 (x+tβ) det(2πtΣ) -1/2 . ( ) This function G t (x) = N (x; -tβ, tΣ) is nothing but a multivariate heat kernel, the Greens function (also known as the fundamental solution or time propagator) for the diffusion equation ∂ t G t (x - x ) = DG t (x -x ), and indeed lim t→0 G t (x -x ) = δ(x -x ).

D INTEGRAL POOLING

The integral pooling operator P[f ] = R d f (x) dx can be applied to the Gaussian process just like any other linear operator. Given f (L) ∼ GP(µ, k), we have that Pf (L) ∼ GP(Pµ, PkP ) = N (Pµ, PkP ). ( ) Again, computing the mean µ P = Pµ and covariance matrix Σ P = PkP we need just to be able to apply P to the RBF kernel. Pk RBF (x ) = R d k RBF (x, x )dx = a ( ) Pk RBF P = R d ×R d k RBF (x, x )dxdx = ∞ (23) For many applications such as image classification using the mean logit value, we require only the predictive mean, so an unbounded covariance matrix Σ P is acceptable. We use this form for all of our experiments. However for some applications an output uncertainty can be useful, so we also provide a variant that integrates over a finite region [0, 1] d , Pf = [0,1] d f (x)dx. Pk RBF (x ) = [0,1] d k RBF (x, x )dx = a d i=1 Φ( x i ) -Φ( x i -1 ) (24) Pk RBF P = [0,1] d ×[0,1] d k RBF (x, x )dxdx = a 2 π (e -1/2 2 -1) + 2Φ( 1 ) -1 d ( ) where Φ is again the univariate standard normal CDF.

E EQUIVARIANCE E.1 RELATED WORK

We note that there has been considerable research effort in the development of equivariant CNNs which we build on top of. The group equivariant CNN was introduced by Cohen and Welling (2016a) for discrete groups on lattices. This work has been extended for continuous groups (Worrall et al., 2017; Zhou et al., 2017) and with steerable equivariance (Cohen and Welling, 2016b; Weiler and Cesa, 2019) where other group representations are used. There have also been group equivariant networks designed for point clouds and other irregularly spaced data (Thomas et al., 2018; Finzi et al., 2020; Fuchs et al., 2020; de Haan et al., 2020) . In Shen et al. (2020) , layers using finite difference estimation of derivative operators are used for defining equivariant layers in an equivariant CNN, effectively a change of basis. Most closely related to our PDE operator approach to equivariance is work by Smets et al. (2020) . In this work, the authors define layers of their convolutional network through the time evolution of a PDE which is a nonlinear generalization of the diffusion equation, and which additionally includes pooling like behaviour. Smets et al. (2020) parametrize their equivariant PDEs in terms of left invariant vector fields. These PDEs are applied to scalar functions on the group, unlike our work where feature fields (of various representations) live on the base space R n . These two approaches reflect the two major approaches to equivariance in the literature: using regular representations and using irreducible/tensor representations. Both have advantages and disadvantages and we focus on the second approach in this work. It may be possible that there is a deeper relationship between the solutions we find and the left invariant vector fields in Smets et al. (2020) , but this is beyond the scope of our work.

E.2 TRANSLATION EQUIVARIANCE

A key factor in the generalization of convolutional neural networks is their translation equivariance. Patterns in different parts of an input signal can be seen in the same way because convolution is translation equivariant. Our learnable linear operators A are equivariant to continuous transformations. Two linear operators e C and e B commute [e C , e B ] = 0 if and only if their generators commute: [C, B] = 0. Since the generator of diffusions D i is a sum of derivative operators, and the generators of translations are just ∇ as mentioned in section 4.2, the two commute: [D k , ∇] = 0. Therefore [A, τ a ] = [ k W k e D k , τ a ] = k W k [e D k , e a ∇ ] = 0 and A is translation equivariant.

E.3 STEERABLE EQUIVARIANCE FOR LINEAR OPERATORS

For some tasks like medical segmentation, aerial imaging, and chemical property prediction there are additional symmetries in the data it makes sense to exploit other than mere translation equivariance. Below we show how to enforce equivariance of the Linear operator A to other symmetry groups G such as the group of continuous rotations SO(d) in R d . Applying equivariance constraints separately on each of the components of e Di on top of translation equivariance yields very restricted set of operators. For example, enforcing equivariance to continuous rotations G = SO(d), the operator must be an isotropic heat kernel: D k = c k ∇ ∇. The reason for this apparent restriction is a result of considering the different channels independently, as scalar fields. The alternative is to use features fields which transform under more general representations of the symmetry group, introduced in steerable-CNNs (Cohen and Welling, 2016b) and used in (Worrall et al., 2017; Thomas et al., 2018; Weiler et al., 2018; Weiler and Cesa, 2019) and others. In this way, the symmetry transformation acts not only on the spatial domain X , but also transforms the channels. The way that the group acts on R c (i.e. the channels) is formalized by a representation matrix ρ(g) ∈ R c×c for each element g ∈ G in the transformation group that satisfies ∀g, h ∈ G : ρ(gh) = ρ(g)ρ(h). Choosing the type of each intermediate feature map is equivalent to choosing their representations, and we describe a simple way of doing this with tensor representations in the later section. Operator Equivariance Constraint: Returning to linear operators, we derive the equivariance constraint and show how to use constructs from the previous sections to implement steerable rotation equivariance. Equivariance of a linear operator A : (R d → R cin ) → (R d → R cout ) requires that, for any input function, transforming the input function first (both argument and channels) and applying A is equivalent to first applying A and then transforming the output: Aρ in (g)L g f = ρ out (g)L g Af where L g f (x) = f (g -1 x). Rearranging the terms, one sees that the equivariance constraint on the linear operator A is: ρ out (g)L g AL g -1 ρ in (g -1 ) = A, where the operators L g and L -1 g are understood not to act on the representation matrices ρ (although implicitly a function of g). As shown in Appendix E.4, eq. 26 is a direct generalization of the equivariance constraint for convolutions ∀x : ρ out (g)K(g -1 x)ρ in (g -1 ) = K(x) described in the literature (Weiler and Cesa, 2019; Cohen et al., 2019) . As shown in Appendix E.6, the equivariance constraint for continuous rotations applied to the diffusion operators A = k W k e D k has only the trivial solutions of isotropic diffusion without any drift. For this reason we instead consider a more general form of diffusion operator where the PDE itself couples the different channels. For the coupled PDE: ∂f ∂t = k W k D k f the time evolution contains the matrices W k in the exponential A = e k W k D k . Like with the example of translation above, this operator is equivariant if and only if the infinitesmal generator k W k D k is equivariant. Because equation 26 applies generally to linear operators and not just convolutions, we can compute the equivariance constraint for these derivative operators. We can simplify the summation k W k D k = k W k (β T k ∇ + (1/2)∇ T Σ k ∇ ) by writing it in terms of the collection of matrices B i = k W k β ki and S ij = (1/2) k W k Σ kij to express A deriv = i B i ∂ i + i,j S ij ∂ i ∂ j where the indices i, j = 1, 2..., d enumerate the spatial dimensions of each vector β k and each matrix Σ k . As we derive in appendix E.5, the necessary and sufficient conditions for the equivariance of k W k D k and therefore A is that ∀g ∈ G : [ρ out ⊗ ρ * in ⊗ ρ (1,0) ](g)vec(B) = vec(B) and ∀g ∈ G : [ρ out ⊗ ρ * in ⊗ ρ (2,0) ](g)vec(S) = vec(S) where vec(•) denotes flattening the elements into a single vector and ρ (r,s) is the tensor representation with r covariant and s contravariant indices.

E.4 GENERALIZATION OF EQUIVARIANCE CONSTRAINT FOR CONVOLUTIONS

This equivariance constraint is a direct generalization of the equivariance constraint for convolution kernels as described in Weiler and Cesa (2019) ; Cohen et al. (2019) . In fact, when A is a convolution operator, Af = K * f , the action of L g by conjugation A is equivalent to transforming the argument of the kernel K: L g (K * )L g -1 f (x) = K(g -1 x -x )f (gx )dµ(x ) = K(g -1 (x -x ))f (x )dµ(x ) = (L g [K]) * f. Letting both sides of eq 26 act on the product of a constant unit vector e i and a delta function, f = e i δ the expression ∀e i : ρ out (g)L g [K]ρ in (g -1 ) * e i δ = K * e i δ can be rewritten as ∀x : ρ out (g)K(g -1 x)ρ in (g -1 ) = K(x) which is precisely the constraint for steerable equivariance for convolution described in the literature.foot_4 

E.5 EQUIVARIANT DIFFUSIONS WITH MATRIX EXPONENTIAL

Below we solve for the necessary and sufficient conditions for the equivariance of the operator A deriv . We will use tensor representations for their convenience, but the approach is general to allow other kinds of representations. A rank (p, q) tensor t is an element of the vector space T (p,q) := V ⊗p ⊗ (V * ) ⊗q where V is some underlying vector space, V * is its dual and (•) ⊗p is the tensor product iterated p times. In common language T (0,0) are scalars, T (1,0) are vectors, and T (1,1) are matrices. Given the action of a group G on the vector space V , the representation on T (p,q) is ρ (p,q) (g) = g ⊗p ⊗ (g -) ⊗q whereis inverse transpose and ⊗ on the matrices is the tensor product (Kronecker product) of matrices. Composite representations can be formed by stacking different tensor ranks together, such as a representation of 50 scalars, 25 vectors, 10 matrices and 5 higher order tensors: T 50 (0,0) ⊕ T 25 (1,0) ⊕ T 10 (1,1) ⊕ T 5 (1,2) , where ⊕ in this context is the same as the Cartesian product. For a composite representation U = i T (pi,qi) the group representation is similarly ρ U (g) = i ρ (pi,qi) (g) where ⊕ concatenates matrices as blocks on the diagonal. Noting that the operator L g that acts only on the argument and the matrix ρ in (g) acts only on the components, the two commute and we can rewrite the constraint for A deriv as i ρ out (g)B i ρ in (g -1 )L g ∂ i L g -1 + ij ρ out (g)S ij ρ in (g -1 )L g ∂ i ∂ j L g -1 = A deriv (28) We can simplify the expression L g ∂ i L g -1 by seeing how it acts on a function. For any differentiable function ∂ i L g -1 f (x) = ∂ ∂xi [f (gx)] = j g ji [∂ j f ](gx) = L g -1 j g ji ∂ j f (x) where g ij are the components of the matrix g. Since this holds for any f , we find that L g ∇L g -1 = g T ∇ and therefore L g ∇∇ T L g -1 = L g ∇L g -1 L g ∇ T L g -1 = g T ∇∇ T g. Since equation 28 holds as an operator equation, it must be true separately for each component ∂ i and ∂ i ∂ j . This means that the constraint separates into a constraint for B and a constraint for S: 1. ∀g, i : j g ij ρ out (g)B j ρ in (g -1 ) = B i 2. ∀g, i, j : kl g i g ik ρ out (g)S k ρ in (g -1 ) = S ij . These relationships can be expressed more succinctly by flattening the elements of B and S into vectors: [ρ out (g) ⊗ ρ in (g -T ) ⊗ ρ (1,0) (g)]vec(B) = vec(B) and [ρ out (g) ⊗ ρ in (g -T ) ⊗ ρ (2,0) (g)]vec(S) = vec(S).

E.6 ROTATION EQUIVARIANCE CONSTRAINT FOR SCALAR DIFFUSIONS HAS ONLY TRIVIAL SOLUTIONS

The diffusion operator A = k W k e D k leads to only trivial β k = 0 and Σ k ∝ I if it satisfies the continuous rotation equivariance constraint.

Proof:

The application of e D k is just a convolution with the Greens function k W k e D k f = k W k [e -1 2 (x+β k ) Σ -1 k (x+β k ) det(2πΣ k ) -1/2 ] * f = k W k G k * f (29) where the Greens function is the multivariate Gaussian density: G k (x) = N (x; -β k , Σ k ). As shown in appendix E.4, for convolutions the operator constraint is equivalent to the kernel equivariance constraint ρ out (g)K(g -1 x)ρ in (g -1 ) = K(x) from (Weiler and Cesa, 2019) . With K(x) = k W k G k (x) this reads: ∀x ∈ R d , g ∈ G : k ρ out (g)W k N (g -1 x; -β k , Σ k )ρ in (g -1 ) = k W k N (x; -β k , Σ k ), For rotations g ∈ SO(2) where we can parametrize g θ = e θJ in terms of the antisymmetric matrix J = [[0, 1], [-1, 0]] ∈ R 2×2 and the translation operator can be written L g = e -θx T J T ∇ , we can take derivatives with respect to θ to get (now with double sums implicit): ∀x ∈ R d : k dρ out W k N (x; -β k , Σ k )-W k N (x; -β k , Σ k )dρ in -W k (x T J T ∇)N (x; -β k , Σ k ) = 0. Here the Lie Algebra representation of J is dρ := ∂ ∂θ ρ(g θ )| θ=0 . Factoring out the normal density: ∀x ∈ R d : k dρ out W k -W k dρ in -W k (x T J T Σ -1 k (x + β k )) N (x; -β k , Σ k ) = 0. Without loss of generality we may assume that each of the Gaussians β k , Σ k pairs are distinct since if they were not then we could replace the collection with a single element. Since the (finite) sum of distinct Gaussian densities is never a Gaussian density, and monomials of order > 0 multiplied by a Gaussian density cannot be formed with sums of Gaussian densities or sums multiplied by monomials of a different order and Gaussian densities are never 0, this constraint separates out into several independent constraints. 1. ∀i : dρ out W k = W k dρ in 2. ∀i, x : W k (x T J T Σ -1 k β k ) = 0 3. ∀i, x : W k (x T J T Σ -1 k x) = 0 We may assume w.l.o.g. that W k is not 0 for all components of the matrix (otherwise we could have deleted this element of k and continue). Therefore there is some component which is nonzero, and the expressions in parentheses in equations 2 and 3 must be 0. Given that this holds for all x, eq 3 implies: J T Σ -1 k = 0 or equivalently Σ -1 k J = 0 because Σ k is symmetric, and since J = -J T this can be expressed concisely as [Σ -1 k , J] = 0 for which the only symmetric solution is proportional to the identity Σ k = c k I. Since both Σ k and J are invertible, equation 2 yields β k = 0. Therefore there are no nontrivial solutions for β, Σ in A = k W k e D k for continuous rotation equivariance.

F DATASET AND TRAINING DETAILS

In this section we elaborate on some of the details regarding hyperparameters, network architecture, and the datasets. As described in the main text, the PNCNN is composed of a chain of convolutional blocks containing a convolution layer, a probabilistic ReLUs, and linear channel mixing layer (analogue of the colloquial 1 × 1 convolution). In each of these convolutional blocks, the input is a collection of points and feature mean value at those points along with the feature elementwise standard deviation at those points: {(x i , µ(x i ), σ(x i )} N i=1 . These observations seed the GP layer, and the block is evaluated at the same collection of points for the output (although it can be evaluated elsewhere since it is a continuous process, and we make use this fact to visualize the features in figures 1 and 2). Hyperparameters: For the PNCNN on the Superpixel MNIST dataset, we use 4 PNCNN convolution blocks with c = 128 channels and with K = 9 basis elements for the different drift and diffusion parameters in K k=1 W k e D k . We train for 20 epochs using the Adam optimizer (Kingma and Ba, 2014) with lr = 310 -3 with batch size 50. For the PNCNN on the PhysioNet2012 dataset, we use the variant of the PNCNN convolution layer that uses the stochastic diagonal estimator described in appendix G with P = 20 probes. In the convolution blocks we use c = 96 channels, K = 5 basis elements and we train for 10 epochs using the same optimizer settings above. For both datasets we tuned hyperparameters on a validation set of size 10% before folding the validation set back into the training set for the final runs. Both models take about 2 hours to train. SuperPixel-MNIST We source the SuperPixel MNIST dataset (Monti et al., 2017) from Fey and Lenssen (2019) 

G STOCHASTIC DIAGONAL ESTIMATION FOR PHYSIONET2012

In order to compute the mean and variance of the rectified Gaussian process, the activations of the probabilistic ReLU, we need compute the diagonal of Ak p A (x n , x n ) for the relevant points {x n } N n=1 . In the usual case where each of the channels α = 1, 2, ..., c are observed at the same locations this can be done efficiently. First one computes the application of e Di on the left and e D j on the right onto the posterior k p : N ij = (e Di k p e D j )(x n , x n ) = (e Di ke D j )(x n , x n ) -(e Di k )(x n )[K + S] -1 (ke D j )(x n ) where k is the RBF kernel and we have reused the notation from appendix A. Notably, this quantity is the same for each of the channels, and the elementwise variance is just: v α (x n ) = (Ak p A ) αα (x n , x n ) = i,j,β W αβ i N ij W αβ j ( ) where the α, β index the channels of each of the matrices W i . Because N is the same for all channels, we can compute this quantity efficiently with a reasonable memory cost and compute. For the PhysioNet2012 dataset where the observation points differ between the channels we must consider a different observation set {x β n } N n=1 for each channel β. This means that evaluated kernel depends on the channel and we have the objects: k β , K β and S β . As a result, we have an additional index for N β ij and the desired computation is v α (x α n ) = (Ak p A ) αα (x α n , x α n ) = i,j,β W αβ i N β ij W αβ j . While each of the terms in the computation can be computed without much difficulty, performing the summation explicitly requires an unreasonably large memory cost and also compute. However, by the same approach we can consider the full covariance matrix B (αn)(βm) = (Ak p A ) αβ (x α n , x β m ), and while it would not be feasible to compute this matrix directly we can define matrix vector multiplies onto vectors of size R cN implicitly using the sequence of operations that define it. Crucially, this sequence of operations has much more modest memory consumption (and compute cost) over the direct expression in equation 31. These implicit matrix vector multiplies can then be used to compute a stochastic diagonal estimator (Bekas et al., 2007) given by: vα (x α n ) = 1 P P p=1 z p Bz p with Gaussian probe vectors z p ∼ N (0, I), and where is elementwise multiplication (see Bekas et al. (2007) for more details on this stochastic diagonal estimator). We use this estimator with P = 20 probes for computing the variances for PhysioNet. We note that with P = 20 the variance estimates are still quite noisy, however without the estimator cannot readily apply the PNCNN to PhysioNet. We leave a better approach for handling this kind of data to future work.

H PATHOLOGIES IN PROJECTION TO RBF GAUSSIAN PROCESS

In section 4.7 describe an approach by which a Gaussian process with a complex mean and covariance function is projected down to the posterior of a (simpler) RBF kernel GP from a set of observations. We know given the representation capacity of the RBF kernel that with the right set of observations, a complex function can be well approximated in principle. However, the relationship for uncertainty is less straightforward. The properties of the input Gaussian process must be conveyed to the output Gaussian process by only the (uncorrelated) noisy observations {(x i , µ(x i ), σ(x i ))} N i=1 . As the uncertainty in original GP increases, so do the measurement uncertainties in the transmission, and therefore the output GP also has a higher uncertainty. However, the uncertainty in the input GP is in the form of a full covariance kernel k(x, x ) and it seems that individual observations will not easily be able to communicate the covariance of the values of the GP function at different spatial locations despite the heterogeneous noise model. Fundamentally, the problem is that the observation values are treated as independent, an incorrect assumption which has other knock-on effects when the number of observations is large. With some fixed measurement error no matter how high but a large enough set of independent observations, the mean value can be pinned down precisely. If in contrast the observations are not independent, then there may be a situation where the mean value cannot be known more precisely than some limiting uncertainty. This effect leads the output GP to have less uncertainty and be more confident in the values that it should be given the input GP. If the observations are sparse, then the effective sample size of the estimator for the mean of the GP at any given location is small, and then the amount by which uncertainty is underestimated is small. However, if there are very many observations then this kind of observation transmission of information with the independence assumption will attenuate the uncertainty. We would also expect that over the course of many layers, this attenuation can accumulate. We believe that this is what causes the poorer uncertainty calibration in layers 3 and 4 of the PNCNN shown in figure 2 . We hope that this problem can be resolved perhaps by removing the independence assumption or providing an alternative projection method in future work.

I CENTRAL LIMIT THEOREM FOR STOCHASTIC PROCESSES

In this section we show that each of the outputs features f α = β M αβ h β are marginally distributed as Gaussian processes in the limit as c in → ∞, provided that h β are independent and have bounded moments, and that the fraction of zeros in each row of M does approach 1. The result follows from an application of the multivariate Lyapunov CLT to the joint distribution of the functions evaluated at a finite collection of points. Since we are only making the case about the marginal distribution of each of the output channels, we can separate them out individually and apply a CLT argument on each one. For any given output channel g, the output is a sum g = β m β h β for values m β given by the particular row of M . For each coefficient m β that is nonzero (with index β = i), we can define an element hi = m i h i . The output is the sum of such elements g = i hi . Now choose any finite collection of points x 1 , x 2 , . . . , x N . The random vector [g(x 1 ), g(x 2 ), ..., g(x N )] = i [ hi (x 1 ), hi (x 2 ), ..., hi (x N )]. In the limit as the number of channels c in goes to infinity, we can apply the Lyapunov central limit theorem to this sum of random vectors, so long as the number of elements in the sum also goes to infinity. This means that the vector [g(x 1 ), g(x 2 ), ..., g(x N )] converges multivariate Gaussian with mean µ having entries µ n = i E[ hi (x n )] and covariance Σ with entries Σ nm = i E[ hi (x n ) hi (x m )]. Since we can repeat the same argument for any collection of points x 1 , x 2 , . . . , x N , we must deduce that g is a Gaussian process according to the definition. The mean and covariance of ḡ(x) → GP(µ, k) follow from the mean and covariance of these finite collections, µ(x) = i E[ hi (x)] = i m i E[h i (x)] and k(x, x ) = i E[ hi (x) hi (x )] = i m 2 i E[h i (x)h i (x )]. Conceivably the CLT could be violated during training if a large enough fraction of M αβ converged to 0 during training, or also if the separate channels became more strongly correlated. We investigate the non-Gaussianity of the features in section 4.7 and find that the features are mostly Gaussian but have somewhat longer tails indicating deviation from the theoretical behavior.



While GPs can be applied directly to image classification, they are not well suited to this task even with convolutional structure baked in, as shown inKumar et al. (2018). For convenience, we include the additional scale factor (2πl 2 ) d/2 relative to the usual definition. More generally neural networks have affine layers including both convolutions and biases. An affine transformation Af + b of a Gaussian process is also a Gaussian process f ∼ GP(Aµp + b, AkpA ), and we include biases in our network but omit them from the derivations for simplicity. A 2D discrete convolution layer using N = m 2 points can be interpreted as a Riemann sum approximation of the continuous integral and will therefore have an error of O(1/m) = O(1/ This assumes as is typically done that measure µ over which the convolution is performed is left invariant. For the more general case, see the discussion inBekkers (2019).



Figure 1: The PNCNN operating on SuperPixel-MNIST images shown on the left. The mean and elementwise uncertainty of the Gaussian process feature maps are shown as they are transformed through the network by the convolution layers. Observation points shown as green dots in σ(x).

Figure 2: Left: Qualitative convergence of the mean and uncertainty of the first 3 channels of the feature maps is shown in RGB color as the input test resolution is increased. Middle: Median predicted uncertainties over spatial locations as a function of the test resolution. Right: Using the predictions of the highest resolution model as ground truth, the distribution of prediction residuals is shown in a Q-Q plot for each layer (shifted horizontally for clarity) with the black lines showing the theoretical relationship, and the overall distribution histogram is shown on the right.

Figure 3: Zero shot generalization to other resolutions: Having trained on MNIST at a given training resolution shown by the color, we evaluate the performance of an PNCNN, PNCNN without uncertainty, and an ordinary CNN on varying test resolutions. Notably, PNCNN with uncertainty is the most robust.

consisting of 60k training examples and 10k test represented as collections of positions and grayscale values {(x i , f (x i ))} 75 i=1 at the N = 75 super pixel centroids. PhysioNet2012 We follow the data preprocessing from Horn et al. (2019) and the 10k-2k train test split. The individual data points consist of 42 irregularly spaced vital sign time series signals as well as 5 static variables: Gender, ICU Type, Age, Height, Weight. We use one hot embeddings for the first two categoric variables, and we treat each of these static signals as fully observed constant time series signals. As the binary classification task exhibits a strong label imbalance, 14% positive signals, we apply an inverse frequency weighting of 1/.14 to the binary cross entropy loss.

, Graph Convolutional Gaussian Processes (GCGP)(Walker and Glocker, 2019), and Graph Attention Networks (GAT)(Avelar et al., 2020). As shown in table 5, the probabilistic numeric CNN greatly outperforms the competing methods reducing the classification error rate by more than 3× over the previous state of the art. Left: Classification Error on the 75-SuperPixel MNIST problem. Right: Average Precision and Area Under ROC curve metrics for PhysioNET2012. Mean and standard deviation are computed over 3 trials.

Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 5e69fda38cda2060819766569fd93aa5-Paper.pdf.

J MOMENTS OF RECTIFIED GAUSSIAN RANDOM VARIABLES

Let f ∼ N (µ, Σ) be a d dimensional random Gaussian vector. We compute here(34) We use the generating function technique. DefineThenTo compute Z(b) we proceed as in Gaussian case. We change variables to f = Σb + g , (38) and define z = µ + Σb to get:(40) Φ being the multivariate standard Normal CDF:Now we compute the first two derivatives. Note that in d = 1, denoting σ 2 = Σ:In d = 2, we can use the conditional probability decomposition to get the required derivatives:So denoting ∂ i = ∂ ∂bi , we get: b) ,qIn particular, for the first moment d = 1 we have:which coincides with equation 8. Note that ψ (1) (µ, 0, σ 2 ) = 1 σ ψ(µ/σ; 0, 1) because of the normalization factor. For the second moments d = 2 we have: E(ReLU(f 1 )ReLU(f 2 )) = Σ 12 Φ (2) (µ; 0, Σ) + µ 1 µ 2 Φ (2) (µ; 0, Σ) + µ 1 m 2 (0) + µ 2 m 1 (0) (55)which can be rewritten in the form of equation 9.

