BRIDGE THE INFERENCE GAPS OF NEURAL PRO-CESSES VIA EXPECTATION MAXIMIZATION

Abstract

The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, e.g. attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance.

1. INTRODUCTION

Figure 1 : Deep Latent Variable Models for Neural Processes. Here D C and D T respectively denote the context points for the functional prior inference and the target points for the function prediction. The global latent variable z is to summarize function properties. The model involves a functional prior distribution p(z|D C ; ϑ) and a functional generative distribution p(D T |z; ϑ). Please refer to Section (2) for detailed notation descriptions. The combination of deep neural networks and stochastic processes provides a promising framework for modeling data points with correlations (Ghahramani, 2015) . It exploits the high capacity of deep neural networks and enables uncertainty quantification for distributions over functions. As an example, we can look at the deep Gaussian process (Damianou & Lawrence, 2013) . However, the run-time complexity of predictive distributions in Gaussian processes is cubic w.r.t. the number of predicted data points. To circumvent this, Garnelo et al. (2018a; b) developed the family of neural processes (NPs) as the alternative, which can model more flexible function distributions and capture predictive uncertainty at a lower computational cost. In this paper, we study the vanilla NP as a deep latent variable model and show the generative process in Fig. (1) . In particular, let us recap the inference methods used in vanilla NPs: It learns to approximate the functional posterior q ϕ (z) ≈ p(z|D T ; ϑ) and a functional prior q ϕ (z|D C ) ≈ p(z|D C ; ϑ), which are permutation invariant to the order of data points. Then the predictive distribution for a data point [x * , y * ] can be formulated in the form E q ϕ (z|D C ) [p(y * |[x * , z]; ϑ)]. While the NP provides a computationally efficient framework for modeling exchangeable stochastic processes, it exhibits underfitting and fails to capture accurate uncertainty (Garnelo et al., 2018b; Kim et al., 2019) in practice. To improve its generalization capability, researchers have focused much attention on finding appropriate inductive biases, e.g. attention (Kim et al., 2019) and convolutional modules (Gordon et al., 2019; Kawano et al., 2020) , Bayesian mixture structures (Wang & van Hoof, 2022) or Bayesian hierarchical structures (Naderiparizi et al., 2020) , to incorporate in modeling. Research Motivations. Most previous work (Garnelo et al., 2018a; b; Kim et al., 2019; Gordon et al., 2019; Wang & van Hoof, 2022) ignores the reason why the vanilla NP suffers the performance bottleneck and what kind of functional priors the vanilla NPs can represent. In particular, we point out the remaining crucial issues that have not been sufficiently investigated in this domain, respectively: (i) understanding the inference suboptimality of vanilla NPs (ii) quantifying statistical traits of learned functional priors. To this end, we try to diagnose the vanilla NP from its optimization objective. Our primary interest is to find a tractable way to optimize NPs and examine the statistics of learned functional priors from diverse optimization objectives. Developed Methods. To understand the inference suboptimality of vanilla NPs, we establish connections among a collection of optimization objectives, e.g. approximate evidence lower bounds (ELBOs) and Monte Carlo estimates of log-likelihoods, in Section (3). Then we formulate a tractable optimization objective within the variational expectation maximization framework and obtain the Self-normalized Importance weighted neural process (SI-NP) in Section (4). Contributions. To summarize, our primary contributions are three-fold: (i) we analyze the inherent inference sub-optimality of NPs from an optimization objective optimization perspective; (ii) we demonstrate the equivalence of conditional NPs (Garnelo et al., 2018a) and SI-NPs with one Monte Carlo sample estimate, which closely relates to the prior collapse in Definition (3.1); (iii) our developed SI-NPs have an improvement guarantee to the likelihood of meta dataset in optimization and show a significant advantage over baselines with other objectives.

2. PRELIMINARIES

General Notations. We study NPs in a meta learning setup. T defines a set of tasks with τ a sampled task. Let D C τ = {(x i , y i )} n i=1 and D T τ = {(x i , y i )} n+m i=1 denote the context points for the functional prior inference and the target points for the function prediction. The latent variable z is a functional representation of a task τ with observed data points. We refer to ϑ ∈ Θ as the parameters of the deep latent variable model for NPs. In detail, ϑ consists of encoder parameters in a functional prior p(z|D C τ ; ϑ) and decoder parameters in a generative distribution p(D T τ |z; ϑ). ϕ refer to the parameters of a variational posterior distribution q ϕ (z) = q ϕ (z|D T τ ), while η refer to the parameters of a proposal distribution q η (z) in the following self-normalized importance sampling. Gaussian distributions with diagonal covariance matrices are the default choice for these distributions, e.g. p(z|D C τ ; ϑ) = N (z; µ ϑ (D C τ ), Σ ϑ (D C τ )), q ϕ (z) = N (z; µ ϕ (D T τ ), Σ ϕ (D T τ )) and q η (z|D T τ ) = N (z; µ η (D T τ ), Σ η (D T τ )). NPs as Exchangeable Stochastic Processes. In vanilla NPs, the element-wise generative process can be translated into Eq. ( 1). Here the mean and variance functions are respectively denoted by µ and Σ. ρ x1:n+m (y 1:n+m ) = p(z) n+m i=1 N (y i ; µ(x i , z), Σ(x i , z))dz Based on the Kolmogorov extension theorem (Klenke, 2013) and de Finneti's theorem (Kerns & Székely, 2006) , the above equation ρ x1:n+m (y 1:n+m ) is verified to be a well-defined exchangeable stochastic process. NPs in Meta Learning Tasks. Given a collection of tasks T , we can decompose the marginal distribution p(D T T |D C T ; ϑ) with a global latent variable z in Eq. ( 2). Here the conditional distribution p(z|D C τ ; ϑ) with τ ∈ T is permutation invariant w.r.t. the order of data points and encodes the functional prior in the generative process. Throughout the paper, the optimization objective of our interest is the marginal log-likelihood of a meta learning dataset in Eq. (3). Furthermore, this applies to all NP variants. For the sake of simplicity, we consider one task τ to derive equations in the following section, which corresponds to maximizing the following objectivefoot_0 . L(ϑ) = ln p(D T τ |z; ϑ)p(z|D C τ ; ϑ)dz (4) In the NP family (Garnelo et al., 2018a; b) , the target data points are conditionally independent given the global latent variable p(D T τ |z; ϑ) = n+m i=1 p(y i |[x i , z]; ϑ). The marginal distribution can be interpreted as the infinite mixture of distributions when z is defined on a continuous domain. Now learning distributions over functions is reduced to a probabilistic inference problem.

3.1. INFERENCE SUBOPTIMALITY IN VANILLA NPS

Previously, variational auto-encoder (VAE) models (Kingma & Welling, 2013; Rezende et al., 2014) mostly set a prior distribution fixed, e.g. N (0, I), as the default to approximate the posterior. This differs significantly from NPs family settings. On the one hand, the functional prior is learned in NPs. On the other hand, the functional prior participates in the performance evaluation. Exact ELBO for NPs. Following the essential variational inference operation, we can establish connections between the exact ELBO and the log-likelihood in Eq. ( 5). Given the functional prior p(z|D C τ ; ϑ) and the generative distribution p(D T τ |z; ϑ), the exact functional posterior can be obtained by the Bayes rule p(z|D T τ ; ϑ) = p(D T τ |z; ϑ)p(z|D C τ ; ϑ) p(D T τ |z; ϑ)p(z|D C τ ; ϑ)dz . The denominator p(D T τ |D C τ ) makes exact inference infeasible, and the family of variational posteriors is introduced to approximate p(z|D T τ ; ϑ). Here the variational posterior family is defined in a parameterized set Q Φ = {q ϕ (z)|ϕ ∈ Φ}. L(ϑ) = ln p(D T τ |D C τ ; ϑ) = E q ϕ (z) ln p(D T τ , z|D C τ ; ϑ) q ϕ (z) Exact ELBO + D KL q ϕ (z) ∥ p(z|D T τ ; ϑ) Posterior Approximation Gap (5) When the variational posterior family is flexible enough, e.g. p(z|D T τ ; ϑ) ∈ Q Φ , the posterior approximation gap can be reduced to an arbitrarily small quantity. In this case, maximizing the exact ELBO in Eq. ( 6) increases the likelihood in Eq. ( 5) accordingly. L ELBO (ϑ, ϕ) = E q ϕ (z) ln p(D T τ |z; ϑ) -D KL q ϕ (z) ∥ p(z|D C τ ; ϑ) Approximate ELBO for NPs. As previously mentioned, the inference is complicated since the functional prior and the posterior in the exact ELBO are unknown. To this end, Garnelo et al. (2018b) proposes a surrogate objective as an approximate ELBO for NPs. This is defined as Eq. ( 7) to maximize. L NP (ϑ, ϕ) = E q ϕ (z)   ln p(D T τ |z; ϑ) Generative Likelihood   -D KL q ϕ (z) ∥ q ϕ (z|D C τ ) Consistent Regularizer (7) The Kullback-Leibler divergence between the approximate posterior q ϕ (z) and the approximate prior q ϕ (z|D C τ ) is referred to as the consistent regularizer in this paper. We claim that the consistent regularizer is the source of the inference suboptimality of vanilla NPs, and this is shown in Appendix (D.2) as the proof of Remark (1). Remark 1 Eq. ( 7) is an invalid variational inference objective, and optimizing it cannot guarantee to find optimal or locally optimal solutions for the maximization over τ ∈T ln p(D T τ |D C τ ; ϑ). Other Available Objectives in NPs Family. Now we turn to other tractable optimization objectives in NPs. These include conditional neural processes (CNPs) (Garnelo et al., 2018a) and convolutional neural processes (ConvNPs) (Foong et al., 2020) (or VERSA (Gordon et al., 2018) ). The CNP is also a typical model of the NPs family. The objective L CNP (ϑ) can be obtained when the functional prior collapses into a Dirac delta distribution with ẑ a fixed real-value vector. L CNP (ϑ) = E p(z|D C τ ;ϑ) ln p(D T τ |z; ϑ) with p(z|D C τ ; ϑ) = δ(|z -ẑ|) For the ConvNP, we do not focus on the convolutional structural inductive bias and concentrate more on the optimization objective itself. Its objective in Eq. ( 9) is a biased Monte Carlo estimate of Eq. (3), so we can maximize the log-likelihood of marginal distributions straightforwardly. L ML-NP (ϑ) = ln 1 B B b=1 exp ln p(D T τ |z (b) ; ϑ) with z (b) ∼ p(z|D C τ ; ϑ) This is termed as Monte Carlo Maximum Likelihood L ML-NP (ϑ)foot_1 , and B is the number of used Monte Carlo samples. Without the involvement of the consistent regularizer, both L CNP (ϑ) and L ML-NP (ϑ) will not encounter the approximation gap in practice.

3.2. EVALUATION CRITERIA & ASYMPTOTIC PERFORMANCE

As in (Le et al., 2018; Foong et al., 2020) , we take a multi-sample Monte Carlo method to evaluate the performance. This applies to both meta training and meta testing processes. In detail, NP models need to run B times stochastic forward pass by sampling z (b) ∼ p(z|D C τ ; ϑ) and then compute the log-likelihoods as ln 1 B B b=1 p(D T τ |z (b) ; ϑ) . After describing the evaluation criteria, we turn to a phenomenon of our interest. In Gaussian processes (Ghahramani, 2015) , with more observed context points, the epistemic uncertainty can be decreased, and the predictive mean function is closer to the ground truth. Similarly, this trait is also reflected in NPs family and can be quantitatively described as follows. Given a measure of average predictive errors β, the number of context points n and the evaluated dataset, D T τ , the introduced metrics β(D T τ ; n) are decreased when increasing n in prediction. In our paper, we refer to this trait as the asymptotic behavior. Definition 3.1 (Prior Collapse) The functional prior p(z|D C τ ; ϑ) = N (z; µ ϑ (D C τ ), Σ ϑ (D C τ ) ) is said to collapse in learning when the trace of the covariance matrix satisfies Tr [Σ ϑ (D C τ )] = d i=1 σ 2 i ≈ 0 with Σ ϑ (D C τ ) = diag[σ 2 1 , . . . , σ 2 d ]. As previously mentioned, we can more precisely keep track of measures β(D T τ ; n), such as predictive log-likelihoods or mean square errors of data points, to assess the asymptotic behavior. The role of latent variables z is to propagate the uncertainty about the partial observations in functions. And Definition (3.1) provides a quantitative way to examine the extent of prior collapse in ML-NPs and SI-NPs.

4. TRACTABLE OPTIMIZATION VIA EXPECTATION MAXIMIZATION

In this section, we propose alleviating the inference suboptimality of NPs with the help of the variational expectation maximization algorithm. The strategy is to formulate a surrogate optimization objective and then execute the EM-steps in optimization. The benefit of our method is to guarantee performance improvement w.r.t. the likelihood of meta dataset in iterations and finally result in at least a local optimum. Green lines indicate the results after the E-steps while the red lines are for the M-steps in Algorithm (1). In the convergence iteration, the performance gap L(ϑ H ) -L(ϑ H-1 ) is close to zero and the algorithm results in at least a local optimal solution. Values of these quantities are increased from the left to the right.

4.1. VARIATIONAL EXPECTATION MAXIMIZATION FOR NPS

This part is to avoid the inference suboptimality of vanilla NPs as mentioned earlier. We retain the neural architectures used in NPs. The basic idea is illustrated in Fig. (2). In detail, we iteratively construct the lower bound L(ϑ K ) and maximize the surrogate function L(ϑ; ϑ K ). The referred optimization gap is due to the complexity of objectives or the choice of optimizers and measures the difference between converged (local) optimal functional prior and the theoretical optimal functional prior. The general pseudo code is Algorithm (1). To make the optimization of meta dataset log-likelihood feasible, we construct surrogate functions as the proxy in each iteration step. These meta learning surrogate functions with special properties are closely connected with the original objective, and we leave this discussion in Appendix (E). Here ϑ k denotes the parameter of the latent variable model for NPs in the k-th iteration of variational expectation maximization. Following the Algorithm (1), we take the E-step #1 by replacing the approximate posterior in Eq. ( 6) with the last time updated p(z|D T τ ; ϑ k ). And this results in the following equation, L(ϑ; ϑ k ) = E p(z|D T τ ;ϑ k ) ln p(D T τ , z|D C τ ; ϑ) -ln p(z|D T τ ; ϑ k ) where p(z|D T τ ; ϑ k ) is the posterior distribution. Proposition 1 The proposed meta learning function L(ϑ; ϑ k ) in Eq. ( 10) is a surrogate function w.r.t. the log-likelihood of the meta learning dataset. The above proposition is examined based on the definition in Appendix (E.1).

4.1.2. TRACTABLE OPTIMIZATION WITH SELF-NORMALIZED IMPORTANCE SAMPLING

Since the second term in Eq. ( 10) is constant in the iteration, we can drop it to simplify the surrogate objective as the right side of the following equation. max ϑ L(ϑ; ϑ k ) ⇔ max ϑ L EM (ϑ; ϑ k ) = E p(z|D T τ ;ϑ k ) ln p(D T τ , z|D C τ ; ϑ) Proposition 2 Optimizing this surrogate function of a batch of tasks via the variational expectation maximization leads to an improvement guarantee w.r.t. the log-likelihood τ ∈T ln p(D T τ |D C τ ; ϑ). Still we cannot optimize L EM (ϑ; ϑ k ) since the expectation has no analytical solution and it is intractable to sample from p(z|D T τ ; ϑ k ) for Monte Carlo estimatesfoot_2 . Remember that the marginal distribution p(D T τ |D C τ ; ϑ k ) is task dependent and can not be ignored in computing the posterior p(z|D T τ ; ϑ k ). To circumvent this, we introduce a proposal distribution q η (z|D T τ ) and optimize the objective via self-normalized importance sampling (Tokdar & Kass, 2010) . The resulting meta learning surrogate function is as follows: L EM (ϑ; ϑ k ) = E qη p(z|D T τ ; ϑ k ) q η (z|D T τ ) ln p(D T τ , z|D C τ ; ϑ) ≈ B b=1 ω(b) ln p(D T τ , z (b) |D C τ ; ϑ) = B b=1 ω(b) Importance Weight   ln p(D T τ |z (b) ; ϑ) Generative Likelihood + ln p(z (b) |D C τ ; ϑ) Functional Prior Likelihood   = L SI-NP (ϑ; η k , ϑ k ) where z (b) ∼ q η k (z|D T τ ), ω (b) = exp ln p(D T τ |z (b) ; ϑ k ) + ln p(z (b) |D C τ ; ϑ k ) -ln q η k (z (b) |D T τ ) and ω(b) = ω (b) B b ′ =1 ω (b ′ ) . In terms of the first conditional term in Eq. ( 12), all the data points are conditional independent and this is further expressed as ln p(D T τ |z (b) ; ϑ) z (b) ]; ϑ). In practice, the selection of proposal distributions is empirically tricky, so we make the update of proposal distributions optional in implementations. In our experimental settings, we simply use the functional prior p(z|D C τ ; ϑ) as the default proposal distribution, which is competitive enough in performance. = n+m i=1 ln p(y i |[x i , Proposition 3 With one Monte Carlo sample used in Eq. ( 12), the presumed diagonal Gaussian prior p(z|D C τ ; ϑ) will collapse into a Dirac delta distribution in convergence. In this case, SI-NP with the one sample Monte Carlo estimate in Eq. ( 13) is equivalent with CNP in Eq. ( 8). L SI-NP (ϑ; η k , ϑ k ) ≈ E p(z|D C τ ;ϑ k )   ln p(D T τ |z; ϑ) Generative Likelihood   + E p(z|D C τ ;ϑ k ) ln p(z|D C τ ; ϑ) Prior Collapse Term The Proposition (3) establishes connections between SI-NPs and CNPs in optimization and we attribute this to the collapse term in Eq. ( 13). Hence, CNP can be viewed as a particular example in SI-NPs. The complete proof is based on limit analysis and can be found in Appendix (F.1).

4.2. SCALABLE TRAINING AND TESTING

As shown in Algorithm (1), the meta training process consists of two steps. We skip E-step #2 to avoid unstable optimization observed in empirical results. By repeating E-step #1/M-step iterations until convergence, the method can theoretically find at least a local optimal w.r.t. the log-likelihood of meta learning dataset based on the Proposition (2). Once the learning progress reaches the final convergence, we can make use of the learned functional prior to obtain the predictive distribution. With B particles in prediction, the distribution can be expressed as follows. p(y|x, D C τ ; ϑ) = E p(z|D C τ ;ϑ) p(y|[x, z]; ϑ) ≈ 1 B B b=1 p(y|[x, z (b) ]; ϑ) with z (b) ∼ p(z|D C τ ; ϑ) (14)

5. EXPERIMENTS AND ANALYSIS

In this section, two central questions are answered: (i) can variational EM based models SI-NPs, achieve a better local optimum than vanilla NPs? (ii) what is the role of randomness in functional priors? Specifically, we examine the influence of NPs optimization objectives on typical downstream tasks and understand the functional prior quantitatively. Baselines & Evaluations. Since our concentration is on optimization objectives in NPs family, we compare to NP (Garnelo et al., 2018b) , and CNP (Garnelo et al., 2018a) , ML-NP (Foong et al., 2020) in experiments. Note that our developed SI-NP and ML-NP (Foong et al., 2020) We create synthetic regression tasks by sampling functions from Gaussian processes. Three types of kernels, respectively Matern-5 2 , RBF, and Periodic, are used to generate diverse function distributions. In each iteration, a batch of data points from functions is randomly processed into the context and the target for training models. In meta testing, the same strategy is used to generate functions and data points. We report the average log-likelihood results in Table (1) . It shows SI-NPs outperform the vanilla NPs in all kernel cases and are at least competitive with other baselines. ML-NPs are also superior to NPs, but they cannot beat CNPs in RBF and Periodic cases. An illustration of fitted curves is given in Fig. (3): compared to vanilla NPs, we notice that SI-NPs and ML-NPs can better match the ground truth and capture the uncertainty between context data points. Table 1 : Test average log-likelihoods of target data points for 1-dimensional Gaussian process dataset with various kernels (reported standard deviations in 5 runs). For each run, we randomly sample 1000 functions as tasks to evaluate. 

5.2. IMAGE COMPLETION

Similar to experiments in (Garnelo et al., 2018b; Kim et al., 2019) , we perform image completion experiments in this section. We implement NPs baselines in four commonly used datasets, namely MNIST (Bottou et al., 1994) , FMNIST (Xiao et al., 2017) , CIFAR10 (Krizhevsky et al., 2009) and SVHN (Sermanet et al., 2012) . During the meta training, the goal is to complete images in which some pixels have been randomly masked out. More precisely, we learn a latent variable model that maps all pixel locations x ∈ [0, 1] 2 to Gaussian distribution parameters of pixel values based on the context pixel locations and values. Fig. ( 4) is an example of completion results from sampled images. For implementation details, please refer to Appendix (G). 1.06±0.004 0.99±0.006 0.94±0.008 0.89±0.007 3.51±0.008 3.43±0.006 2.60±0.005 2.41±0.005 LSI-NP (ours) 1.09±0.006 1.02±0.004 0.98±0.004 0.94±0.005 3.57±0.003 3.50±0.003 2.75±0.004 2.60±0.005 In the evaluation, the number of context pixels is randomly selected for each image in the dataset. We examine the performance in a testing set of the above image datasets and report the average log-likelihoods in Table (2). We can see that SI-NP achieves the best performance in all image datasets. The performance gaps between SI-NPs and NPs are remarkable. Furthermore, the asymptotic behaviors of all baselines are illustrated in Fig. ( 5). All models exhibit asymptotic behaviors by varying the number of context points, and NPs mainly result in clearly lower log-likelihoods with 10 context points. The observations with different number of context pixels are consistent with the conclusion in Table (2) . Figure 5 : Asymptotic Performance in Image Completion. We meta test pixel average loglikelihoods with varying number of context points in image datasets. Context points are randomly selected for each image in testing processes. For MNIST/FMNIST datasets, the numbers of context pixels in testing are {10, 100, 300, 500, 700}. For CIFAR10/SVHN datasets, the numbers of context pixels in testing are {10, 100, 300, 500, 800, 1000}. Another critical finding is reported in Fig. ( 6), which examines whether the learned functional priors collapse into the deterministic ones. This is based on the average computed trace of covariance matrices in learned functional priors. In MNIST dataset, SI-NPs encounter the prior collapse, and ML-NPs also obtain extremely lower trace values of covariance matrices. We attribute the prior collapse in MNIST to its most superficial structures and limited semantic diversity, which results in lower prior uncertainty. For image datasets with more complicated semantics, the computed trace values in SI-NPs retain a reasonable interval. The scale of SI-NPs' trace values coincides with the semantics complexity of image datasets: CIFAR10>SVHN>FMNIST>MNIST. Another common observation is that with context points increasing, both models' trace values of functional priors in most datasets decrease to a certain level. The above reveals the meaning of the learned functional priors in SI-NPs: The functional prior exhibits higher uncertainty with more complicated semantics. Hence, it has the potential of generating functional samples with more diversity. For datasets with less rich semantics, the randomness of the learned functional priors plays a less critical role, and the deterministic functional representation is capable of function generation. Also, with more observations, the randomness of functional priors can be reduced.  [Σ ϑ (D C τ )] is computed based on p(z|D C τ ; ϑ) = N (z; µ ϑ (D C τ ), Σ ϑ (D C τ )).

6. RELATED WORK

NPs Family. NPs are simple and flexible in model formulations, but they may suffer from underfitting. Previous research focuses more on the use of structural inductive biases in NPs. (Kim et al., 2019; 2021) improves the predictive performance by adding attention based local variables. Translation equivariance and invariance are incorporated in modeling NPs with help of convolutions (Gordon et al., 2019; Foong et al., 2020; Kawano et al., 2020) . To find more compact functional representations, the contrastive loss is used to regularize (C)NPs' training (Gondal et al., 2021) . Hierarchical and mixture structures are also explored to induce diverse latent variables in NPs (Wang & Van Hoof, 2020; Wang & van Hoof, 2022) . Lee et al. (2020) combines boostrapping tricks with neural processes to improve expressiveness of latent variables. Besides, NPs are applied to address sequential decision-making problems (Galashov et al., 2019; Wang & Van Hoof, 2022) . A summary of these models is given in Appendix (D.3). Expectation Maximization & Wake-Sleep Algorithms. For log-likelihood maximization problems with incomplete observations, expectation maximization (Bishop & Nasrabadi, 2006; Balestriero et al., 2020 ) is a tractable approach in optimization. It consists of an E-step to optimize the distribution of latent variables and a M-step to maximize the likelihood parameters. The optimization of VAEs and variants is also built upon an EM framework (Ghojogh et al., 2021; Gao et al., 2021) . Our developed algorithm can be interpreted as EM for NPs. Another family of algorithms related to our method is wake-sleep algorithms (Bornschein & Bengio, 2014; Eslami et al., 2016; Dieng & Paisley, 2019; Le et al., 2020) , where self-normalized importance sampling is used in Monte Carlo estimates. In this case, our method can be also viewed as the extension of the reweighted wake-sleep (RWS) algorithm to NPs with an improvement guarantee. Another difference lies in that optimizing a learnable functional prior is of most importance in NPs, while RWS algorithms are mostly used in scenarios with a fixed prior.

7. CONCLUSION

Technical Discussions. In this paper, we study NPs family from an optimization objective perspective and analyze the inference suboptimality of vanilla NPs. Within the variational expectation maximization framework, our developed SI-NP improves the target likelihood step by step to obtain a more optimal functional prior. Besides, experimental results tell us the learned functional prior has close connections with the number of context points and complexity of function families. Existing Limitations. Compared to vanilla NPs, SI-NP requires more than one Monte Carlo sample from the functional prior in training, which consumes more computations. Though intuitively, the uncertainty of the functional prior can be forward propagated to the output distribution, the influence of such uncertainty has not been mathematically studied. Future Extensions. The SI-NP objective is regardless of neural architectures so that any structural inductive bias can be incorporated in modeling. 

I Computation Tools

A FREQUENTLY ASKED QUESTIONS In this section, we list frequently asked questions from researchers who help proofread this manuscript. These raised questions are crucial, so we include more detailed discussions here. Meanwhile, we make extra efforts for readers to quickly understand our work and make slides as follows https://anonymous.4open.science/r/SI_NPs_Slides-7C94/iclr_ submission_slides.pdf. Learnable Proposal Distributions. The reason why we set the update of learnable proposal distribution an optional choice is that we find it difficult to balance the optimization of Eq. ( 12) and Eq. ( 31). This is non-trivial and assigning equal weights in optimization results in unstable performance in experiments (Please refer to Section (H.1) for more discussion.). So, we leave the improvement of the proposal distribution future research. Suppose a reasonable objective or at least a weight scheduling mechanism for two objectives in two E-steps can be developed to stabilize the training. In that case, SI-NPs are likely to achieve even better performance. Influence of Functional Prior Collapse. Generally, the generated functions are more diverse when the functional prior does not collapse: More randomness can be reflected from the functional prior, and the uncertainty of partial observation can be propagated to the output space. However, in experiments, we observe that for functions with more superficial structures, e.g. MNIST dataset, the deterministic functional representation in CNPs is sufficient to generate functions of high quality. So there is no consensus on the influence of the functional prior collapse on the generation performance. Influence of the Number of Inference Particles in SI-NPs. With more inference particles (latent variables), SI-NPs can avoid the prior collapse in some cases, but the cost is expensive for computations. To trade-off performance and computations, we set the required number of particles for meta training in a small quantity, e.g. 8 particles for image completion tasks. Connections between Different Optimization Objectives. In Table (3), we summarize the connections of different objectives with the meta dataset log-likelihood L(ϑ). Note that L NP (ϑ, ϕ) is not the exact ELBO of L(ϑ). L ML-NP (ϑ) can also be viewed as the importance weighted method with equal weights for all particles. Since the estimates of importance weights in SI-NPs exploit the target observations and consider the difference in particles, this helps improve the model's generalization capability. Though the self-normalized importance sampling makes SI-NPs' estimates over L(ϑ) biased, there is a probabilistic convergence guarantee towards the true integral quantity of our interest (Please refer to Chapter 9 in the Book "Monte Carlo Theory, Methods and Examples" (Owen, 2013) ). Combination with Structural Inductive Biases. The primary research focus is on the optimality of the different optimization objectives, and this study applies to more general stochastic processes (stationary or nonstationary cases). Theoretically, the objective of SI-NPs can be combined with most structural inductive biases in Table ( 4), especially when the function family is complicated. For example, the scale parameter of Matern kernels also varies with kernel values, bringing more variations in realizations. In this case, we guess a global latent variable is difficult to handle: SI-NPs and ML-NPs might encounter a similar performance bottleneck from the empirical observations in the main paper. To further improve the performance, we take the attention module (Kim et al., 2019) as an example to conduct extensive experiments in Section (H.5). Since the meta learning experiment is computationally expensive and time-consuming in training processes, we do not examine combinations with other inductive biases in this paper. GP Oracles in Synthetic Regression. Note that the GP oracle is computed in the form ln p(y 1:n+m |y 1:n , x 1:n+m ), where the predictive distribution is mainly with a non-diagonal covariance matrix (suggesting y n+m not independent in statistics). In comparison, the log-likelihood of NPs family is computed in the form n+m i=1 ln p(y i |[x i , z]; ϑ), where y 1:n+m is conditional independent w.r.t. the global latent variable z. Such a difference in computations causes the mentioned gap. However, with a large neural model, such as the integration of attention networks in Section (H.5), the quantified performance gap of GP Oracles and the log-likelihood of NPs family is well reduced.  X ×Ω → R d such that ν x1,...,x N (F 1 ×• • •× F N ) = P(S x1 ∈ F 1 , . . . , S x N ∈ F N ) if it satisfies exchangeable consistency and marginalization consistency. Here we can translate the generative process of NPs in the following mathematical way. τ ∼ p(T ), z ∼ N (z; µ ϑ (D C τ ), Σ ϑ (D C τ )) x i ∼ p(x), y i ∼ p(y|[x i , z]; ϑ) ∀i ∈ {1, 2, . . . , n + m}

C PREDICTIVE DISTRIBUTIONS IN GPS & NPS

Here we take one-dimensional deep Gaussian processes (Dai et al., 2016) as an example. With context points D C = {(x i , y i )} n i=1 and target points D T = {(x i , y i )} n+m i=1 = [x T , y T ], the key to applications is the predictive distribution p(f (x T )|D C , x T ) = N (y T ; µ T , Σ T ). The conditional mean µ T and covariance Σ T functions in Eq. ( 16) are permutation invariant to the order of context points. µ T = m θ (x T ) + Σ T,C Σ -1 C,C y C -m θ (x C ) Σ T = Σ T,T -Σ T,C Σ -1 C,C Σ C,T Here the covariance matrix denoted by Σ is computed with the context input x C = (x 1 , . . . , x n ) ∈ R n×d , the target input x T = (x 1 , . . . , x n+m ) ∈ R (n+m)×d , and a kernel function ψ, e.g. [Σ C,C ] i,j = ψ(x i , x j ), m θ is the mean function m θ , and the context output is y C = (y 1 , . . . , y n ) ∈ R n . Nevertheless, the computation of matrix inversion in Eq. ( 16) makes the runtime complexity as expensive as O((n + m) 3 ).

D NPS FORMULATION & STRUCTURAL INDUCTIVE BIASES D.1 PRIOR, POSTERIOR & PROPOSAL DISTRIBUTIONS

Since fast adaptation is achieved in an amortized way, which reduces the gradient updates w.r.t. model parameters to learning function specific latent variables with meta trained neural networks. Definition D.1 (Permutation Invariant Functions) Let S n be an n-element permutation group. For any permutation operator g ∈ S n , the function is a bijective mapping from the order set {1, 2, . . . , n} to itself: g : [1, 2, . . . , n] → [g 1 , g 2 , . . . , g n ]. Then given a set of data points {x 1 , x 2 , . . . , x n }, the function Φ is said to be a permutation invariant function if the following equation is satisfied ∀g ∈ S n Φ(g • [x 1 , x 2 , . . . , x n ]) = Φ([x g1 , x g2 , . . . , x gn ]) = Φ([x 1 , x 2 , . . . , x n ]). The context points and the target points are treated as sets, so the amortized network should be permutation invariant w.r.t. the order of data points. Approximate Posterior Distribution. This is denoted by q ϕ (z|D T τ ) in this paper. Usually, the approximate posterior is used in NPs (Garnelo et al., 2018b; Kim et al., 2019) and works as a proxy for the non-analytical exact posterior p(z|D T τ ; ϑ). Prior Distribution. This is denoted by p(z|D C τ ; ϑ) in this paper. Unlike the approximate prior q ϕ (z|D C τ ) used in NPs, we use an exact functional prior in SI-NPs. Proposal Distribution. This is denoted by q η (z|D T τ ) in this paper. The role of the proposal distribution resembles that of the approximate posterior in NPs. It is used to sample latent variables and enables the computation of the importance weights in NPs. The blue one and the pink one are, respectively, the encoder and the decoder structures in vanilla NPs. The approximate posterior, the approximate prior, the learned prior, and the proposal distributions are in the same neural structure as the encoder.

Mean

For the generative distribution in the form of the decoder, the diagram shows one instance (x * , y * ) ∈ D T τ in a generation and the sampled global latent variable z ∼ p(z|D C τ ; ϑ) or z ∼ q ϕ (z|D C τ ) is shared across all data points in one realization. In vanilla NPs, the latent variable z is inferred from a set of data points, and the distributions q ϕ (z) and q ϕ (z|D C τ ) are learned via functions permutation invariant to the order of data points. An example module to parameterize q ϕ (z|D C τ ) = N (z; µ C , Σ C ) can be Eq. ( 17) with a mean pooling operator and {h ϕ , g ϕ } encoder networks. r i = h ϕ ([x i , y i ]) ∀(x i , y i ) ∈ D C τ , r C = N i=1 r i , [µ C , Σ C ] = g ϕ (r C ) The same with that in traditional stochastic processes, a realisation corresponds to a sampled function f (X) generated in a sequential way: z ∼ q ϕ (z|D C τ ), f (X) ∼ p(Y |X, z; ϑ). Since this paper aims to study the influence of different optimization objectives concerning the NP models, we retain the basic model set-up for all baselines. The Fig. ( 7) exhibits the neural architecture of the vanilla NPs, and such a structure is also shared with ML-NPs and SI-NPs in our paper. As previously mentioned, these methods can be straightforwardly incorporated with other structural inductive biases (Kim et al., 2019; Gordon et al., 2019; Gondal et al., 2021; Wang & van Hoof, 2022) . For example, in the following section, we augment all methods with attention networks (Kim et al., 2019) and examine the resulting performance.

D.2 APPROXIMATE ELBOS

IN NPS & PROOF OF REMARK 1 ln p(D T τ |D C τ ; ϑ) = ln p(D T τ |z; ϑ)p(z|D C τ ; ϑ)dz (18a) ≥ E q ϕ (z) ln p(D T τ |z; ϑ) -D KL    q ϕ (z) Approximate Posterior ∥ p(z|D C τ ; ϑ) Functional Prior    = L ELBO (ϑ, ϕ) (18b) ≈ E q ϕ (z) ln p(D T τ |z; ϑ) -D KL    q ϕ (z) Approximate Posterior ∥ q ϕ (z|D C τ ) Approximate Prior    = L NP (ϑ, ϕ) □ (18c) For Eq. (18.b), remember that w.r.t. these VAE-like methods, it is hard to guarantee the improvement of the evidence in each iteration when optimizing ELBO due to the existence of a posterior approximation gap. Meanwhile, the form of the functional prior is unknown. Proof D.1 (Remark 1) Vanilla NPs directly replace the real functional prior by the approximate one q ϕ (z|D C τ ) and introduce the consistent regularizer in Eq. ( 7). We further introduce the prior approximation gap in Eq. ( 19), in which the sign is undetermined. L NP (ϑ, ϕ) = L ELBO (ϑ, ϕ) + E q ϕ (z) ln q ϕ (z|D C τ ) p(z|D C τ ; ϑ) Prior Approximation Gap (19) Based on decomposition in Eq. ( 6)/(19), we can find that there exists no consistent monotonic relationship between L(ϑ) and L NP (ϑ, ϕ). The invalid ELBO makes the optimization w.r.t. L NP (ϑ, ϕ) not always improve log-likelihood L(ϑ). L(ϑ) ≥ L ELBO (ϑ, ϕ), L(ϑ) ≱ L NP (ϑ, ϕ) ∀ϑ ∈ Θ and ϕ ∈ Φ (20) On the left side of Eq. ( 20), the ELBO of NPs can be bounded by the log-likelihood of meta learning dataset. Optimizing the ELBO can guarantee the optimal or local optimal solution when the approximation gap and the amortization gap are well closed. However, the right side of Eq. ( 20) implies the previous commonly-used strategies, e.g. normalizing flows for richer variational posterior distributions (Rezende & Mohamed, 2015) , auxiliary variables for augmented variational posterior distributions (Maaløe et al., 2016) and more flexible prior distributions (Tomczak & Welling, 2018) , to close VAEs inference gaps (Cremer et al., 2018) will not guarantee the performance improvement in a theoretical sense. In other words, the consistent regularizer in NPs is ill-posed for optimization.

D.3 SUMMARY OF NPS FAMILY

In this part, we summarize the encoder and decoders for typical NP variants and point out the inductive biases behind them. The information is summarized in Table (4). We can see a variety of inductive biases can be used to improve NPs, most of them from the model structure perspective rather than the optimization objective. 

D.4 INFERENCE GAPS

In this section, we apply the trick of inference gap decomposition (Cremer et al., 2018) to understand vanilla NPs. Here we denote the approximate inference gap by D AI KL and the posterior approximation gap by D PA KL in Table ( 5). Since it is infeasible to obtain the exact form for the functional posterior, we cannot directly close the mentioned approximate gap, and approximate methods are used to learn them. In vanilla NPs, the consistent regularizer in Eq. ( 7) works as the surrogate for the ELBO. But based on our analysis, closing this surrogate gap cannot lead to a theoretically optimal solution of functional priors. Table 5 : Inference Gaps in vanilla NPs. The ↓ indicates the minimization to obtain the optimal inference solution. The sign -means not applicable in deriving equivalent KL divergence form. The sign * indicates the optimal posterior approximation in the family of variational distributions ϕ * = arg min ϕ∈Φ D KL q ϕ (z) ∥ p(z|D T τ ; ϑ * ) . Here ϑ * consists of the optimal parameters in priors p(z|D C τ ; ϑ * ) and the conditional distribution p(D T τ |z; ϑ * ).

Terms

Optimization Objective KL Divergence or Gaps

Approximate Inference

L(ϑ * ) -L ELBO (ϑ * , ϕ) ↓ D KL q ϕ (z) ∥ p(z * |D T τ ; ϑ) Posterior Approximation L(ϑ * ) -L ELBO (ϑ * , ϕ * ) ↓ D KL q ϕ * (z) ∥ p(z|D T τ ; ϑ * ) Amortization L ELBO (ϑ * , ϕ * ) -L ELBO (ϑ * , ϕ) ↓ D AI KL -D PA KL NP Prior Approximation |L ELBO (ϑ * , ϕ * ) -L NP (ϑ * , ϕ * )| - Surrogate Likelihood L(ϑ * ) -L(ϑ; ϑ k ) ↓ L(ϑ * ) -L(ϑ H )

E FORMULATION OF VARIATIONAL EXPECTATION MAXIMIZATION METHOD

In this section, we detail the progress of optimizing the NP model with the help of variational expectation maximization algorithms. The concept of meta learning surrogate functions is introduced, and NPs are verified. Meanwhile, the improvement guarantee as well as other concerning technical points are included to better understand our method. • Both f (ϑ) and g(ϑ; ϑ k ) are C 2 -functions, which means f (ϑ) has first order and second order derivatives of those functions exist and these are smooth in the domain of the function. • The following two formulas hold: g(ϑ; ϑ k ) ≤ f (ϑ) ∀ϑ; g(ϑ k ; ϑ k ) = f (ϑ k ). ( ) When the objective function f (ϑ) to maximize is complicated, e.g. multi-modal likelihood functions, a surrogate function g(ϑ; ϑ k ) enables an easier proxy implementation with convergence guarantee to at least the local optimal. To see this point, recall the properties of the surrogate function g(ϑ k ; ϑ k ) = f (ϑ k ). The update rule for the surrogate function follows that ϑ k+1 = arg max ϑ g(ϑ; ϑ k ). And this results in f (ϑ k+1 ) ≥ g(ϑ k+1 ; ϑ k ) ≥ f (ϑ k ). Recall the following function L(ϑ; ϑ k ) in the main paper. L(ϑ; ϑ k ) = τ ∈T E p(z|D T τ ;ϑ k ) ln p(D T τ , z|D C τ ; ϑ) -ln p(z|D T τ ; ϑ k ) The expectation operation in the k-th iteration step corresponds to E-step : q ϕ (z|D C τ ) = p ϑ k (z|D C τ ) , while the maximization operation in the k-th iteration step updates the parameter as M-step : ϑ k+1 = arg max ϑ L(ϑ; ϑ k ). ln p(D T τ |D C τ ; ϑ k ) Model Evidence = E-step L(ϑ k ; ϑ k ) ≤ M-step L(ϑ k+1 ; ϑ k ) (23a) ≤ L(ϑ k+1 ; ϑ k ) + D KL [p(z|D T τ ; ϑ k ) ∥ p(z|D T τ ; ϑ k+1 )] = ln p(D T τ |D C τ ; ϑ k+1 ) Model Evidence □ (23b) E.1.2 IMPROVEMENT GUARANTEE As illustrated in Eq.s ( 23), the surrogate function L(ϑ; ϑ k ) is bounded by two log-likelihoods. Over the process of iterations, the log-likelihood is gradually increased to the final convergence. ln p(D T τ |D C τ ; ϑ 1 ) ≤ L(ϑ 2 ; ϑ 1 ) ≤ ln p(D T τ |D C τ ; ϑ 2 ) ≤ • • • ≤ L(ϑ H ; ϑ H-1 ) ≤ ln p(D T τ |D C τ ; ϑ H ) This indicates that the finally updated parameters of the surrogate function are exactly (sub-)optimal ones for the evidence. Based on these rules, directly optimizing the surrogate function step by step can theoretically guarantee the finding of optimal or at least a local optimal parameters.

E.2 IMPORTANCE SAMPLING IN A VARIATIONAL EM ALGORITHM

Though the conditional marginal distribution p(D T τ |D C τ ; ϑ k ) is not analytical, the importance sampling trick can be used to estimate the result with the help of a proposal distribution q η (z|D T τ ). p(D T τ |D C τ ; ϑ k ) = p(D T τ , z|D C τ ; ϑ k )dz = q η (z|D T τ ) p(D T τ , z|D C τ ; ϑ k ) q η (z|D T τ ) dz ≈ 1 B B b=1 p(D T τ , z (b) |D C τ ; ϑ k ) q η (z (b) |D T τ ) = 1 B B b=1 ω (b) , with z (b) ∼ q η (z|D T τ ) and ω (b) = p(D T τ , z (b) |D C τ ; ϑ k ) q η (z (b) |D T τ ) Especially, the joint distribution is computed via the decomposition that p( b) ]; ϑ k ). With the above equation, the intractable optimization objective is transformed into a feasible one. D T τ , z (b) |D C τ ; ϑ k ) = p(z (b) |D C τ ; ϑ k )p(D T τ |z (b) ; ϑ k ) with p(D T τ |z (b) ; ϑ k ) = n+m i=1 p(y i |[x i , z L EM (ϑ; ϑ k ) = E p(z|D T τ ;ϑ k ) ln p(D T τ , z|D C τ ; ϑ) = p(D T τ , z|D C τ ; ϑ k ) p(D T τ |D C τ ; ϑ k ) ln p(D T τ , z|D C τ ; ϑ)dz (26a) = q η (z|D T τ ) p(D T τ , z|D C τ ; ϑ k ) q η (z|D T τ )p(D T τ |D C τ ; ϑ k ) ln p(D T τ , z|D C τ ; ϑ)dz (26b) ≈ 1 B B b=1 ω (b) p(D T τ |D C τ ; ϑ k ) ln p(D T τ , z (b) |D C τ ; ϑ) (26c) = B b=1 ω (b) B b ′ =1 ω (b ′ ) ln p(D T τ , z (b) |D C τ ; ϑ) = B b=1 ω(b) ln p(D T τ , z (b) |D C τ ; ϑ) = L SI-NP (ϑ, η; ϑ k ) (26d) We can expand the term inside the expectation in Eq (26.a) as follows. p(D T τ , z (b) |D C τ ; ϑ) = p(z (b) |D C τ ; ϑ)p(D T τ |z (b) ; ϑ) As for the exact posterior p(z|D T τ ; ϑ k ), we can get the following expansion. p(z|D T τ ; ϑ k ) = p(z, D T τ ; ϑ k ) p(z, D T τ ; ϑ k )dz = p(z|D C τ ; ϑ k )p(D T τ |z; ϑ k ) p(z|D C τ ; ϑ k )p(D T τ |z; ϑ k )dz The distribution has the complicated denominator and this makes it infeasible to directly sample from the conditional distribution.

E.3 OPTIMIZATION OBJECTIVE WITH PROPOSAL DISTRIBUTIONS

This subsection shows the mentioned optional optimization step E-step #2 in Algorithm (1).

E.3.1 UPDATE OF PROPOSAL DISTRIBUTIONS (OPTIONAL)

The use of proposal distribution q η enables sampling z for Monte Carlo estimates. Another role of the proposal distribution is to work as a proxy for the posterior p(z|D T τ ; ϑ k ), and the variance V qη p(z|D T τ ;ϑ k ) qη(z|D T τ ) ln p(D T τ , z|D C τ ; ϑ ) is expected to be lower for stable training. So a reasonable optimization objective is to get two distributions as close as each other, e.g. minimizing the Kullback-Leibler divergence D KL [p(z|D T τ ; ϑ k ) ∥ q η (z|D T τ )]. This treatment is equivalent to wake phase updates in (Bornschein & Bengio, 2014) . We can obtain the optimization objective on the right side of Eq. ( 29) with the help of the self-normalized importance sampling. min η D KL [p(z|D T τ ; ϑ k ) ∥ q η (z|D T τ )] ⇔ min η L KL (η; η k-1 , ϑ k ) = - B b=1 ω(b) ln q η (z (b) |D T τ ) Here the self-normalized importance weights {ω (b) } B b=1 inside the above equation are the same as that in Eq. ( 12). Also note that the the denominator 1 B B b ′ =1 ω (b ′ ) of the weight relates to the estimate of p(D T τ |D C τ ) , which has biases with limited samples at the beginning of training. However, with the improvement of approximation, the bias can be decreased accordingly (Zimmermann et al., 2021) .

E.3.2 DERIVATION OF THE PROPOSAL UPDATE OBJECTIVE

It is trivial to see the following equation since the term E p(z|D T τ ;ϑ k ) p(z|D T τ ; ϑ k ) is a constant. min η D KL [p(z|D T τ ; ϑ k ) ∥ q η (z|D T τ )] ⇔ min η -E p(z|D T τ ;ϑ k ) [ln q η (z|D T τ )] Once again, we apply self-normalized importance sampling to the right side of Eq. ( 30). With the same set of sampled latent variables, the reweighted objective w.r.t. the proposal distribution can be derived. min η -E p(z|D T τ ;ϑ k ) [ln q η (z|D T τ )] ≈ - B b=1 ω(b) ln q η (z (b) |D T τ ) = L KL (η; η k-1 , ϑ k )

E.4 GRADIENT ESTIMATES IN VARIATIONAL EM

In the E-step #2, note that the model parameter is fixed as ϑ k , we can estimate the gradient of η w.r.t. L KL (η; η k-1 , ϑ k ) in the following way. This operation is to close the divergence between the proposal distribution and the exact posterior distribution. ∂L KL (η; η k-1 , ϑ k ) ∂η = B b=1 ω(b) ∂ ln q η (z (b) |D T τ ) ∂η In the M-step, note that the normalized importance weights are constant, the proposal distribution is fixed, and the gradient w.r.t. L SI-NP (ϑ; η k , ϑ k ) can be estimated in a straightforward way as follows. ∂L SI-NP (ϑ; η k , ϑ k ) ∂ϑ = B b=1 ω(b) ∂ ln p(z (b) |D C τ ; ϑ) ∂ϑ + ∂ ln p(D T τ |z (b) ; ϑ) ∂ϑ We can see the role of normalized importance weights in gradient estimates from Eq. ( 33): remember that ω(b) ∝ p(D T τ |z (b) ; ϑ k ) in the k-th iteration when the functional prior works as the proposal distribution, so the more gradients will be allocated to those particles with higher generative likelihoods. Meanwhile, the normalized importance weights influence the optimization of the functional prior distribution so that the functional prior can match the best set of particles well.

F CONNECTION WITH CNPS F.1 PRIOR COLLAPSE IN SI-NPS WITH ONE MONTE CARLO SAMPLE

Theorem 1 (L'Hôpital's Rule (Hospital, 1696) ) Let f (x) and g(x) be two functions differentiable on an open interval I except possibly at a point c contained in I. If lim x→c f (x) = lim x→c g(x) = ∞, and 1 g(x)f (x) ′ ̸ = 0 with ∀x ∈ I and x ̸ = c, we can have the following limit equation. lim x→c f (x) -g(x) = lim x→c 1 g(x) -1 f (x) 1 g(x)f (x) = lim x→c 1 g(x) -1 f (x) ′ 1 g(x)f (x) ′ (34) Proof F.1 (Proposition 3) Note that in our default setup of SI-NPs, the functional prior works as the proposal distribution to sample the latent variable. Let z ∈ R d be the latent variable for a diagonal Gaussian conditional prior p(z|D C τ ; ϑ) = N (z; µ ϑ (D C τ ), Σ ϑ (D C τ )). Here the learned mean and the covariance matrix are simply denoted by µ ϑ = [µ 1 , . . . , µ d ] T ∈ R d and Σ ϑ = diag σ 2 1 , . . . , σ 2 d . With one Monte Carlo sample ẑ = [ẑ 1 , . . . , ẑd ] T from the conditional prior, it can be written as ẑ = µ ϑ + εΣ 1 2 ϑ , ε ∼ N (0, I d ) with help of a reparameterization trick (Kingma & Welling, 2013). E p(z|D C τ ;ϑ k ) ln p(z|D C τ ; ϑ) ≈ ln p(z|D C τ ; ϑ) = - 1 2 ln(2π) + d i=1 -ln σ i - (µ i -ẑi ) 2 2σ 2 i Eq. ( 35) is the result of one Monte Carlo estimate, termed as the collapse term in the main paper. Built up on these, we rewrite the SI-NP optimization objective to maximize as Eq. ( 36). L SI-NP = E p(z|D C τ ;ϑ) ln p(D T τ |z; ϑ) + E p(z|D C τ ;ϑ k ) ln p(z|D C τ ; ϑ) ≈ n+m i=1 ln p(y i |[x i , µ ϑ + εΣ 1 2 ϑ ; ϑ]) - 1 2 ln(2π) + d i=1 ln σ i + (µ i -ẑi ) 2 2σ 2 i ( ) Now we prove that when the learned variance parameter {σ i } d i=1 collapse into the value zero, the optimization objective in Eq. ( 35) and Eq. ( 36) can be maximized. To simplify the notation, we put the mean variable µ ϑ aside, focus more on the variance variable Σ ϑ and let the value µi-ẑi 2 denoted by κ i . We can directly apply L'Hôpital's Rule in Theorem (1) to these quantities and obtain the following equations. lim σi→0 -ln σ i - κ i σ 2 i = lim σi→0 σ 2 i κi + 1 ln σi - σ 2 i κi ln σi = lim σi→0 σ 2 i ln σ i + κ i -σ 2 i = lim σi→0 σ 2 i ln σ i + κ i ′ (-σ 2 i ) ′ = lim σi→0 2σ i ln σ i + σ i -2σ i = lim σi→0 -ln σ i - 1 2 = +∞ Putting them together, the Gaussian latent variable will finally collapse into a Dirac delta distribution and this demonstrates the equivalence between SI-NP with one Monte Carlo sample and CNP.

F.2 EFFECT OF ADJUSTING COEFFICIENTS OF THE PRIOR REGULARIZER

Remark 2 Either increasing the number of Monte Carlo samples or lower the weight of the collapse term in SI-NPs can effectively avoid the prior collapse. With increase of Monte Carlo samples, we can see the scale of the generative term outweights that of the collapse term, which indicates more weights are put in the gradient w.r.t. the variance parameters to maximize the generative log-likelihood. In this way, it can naturally avoid the prior collapse caused by the second term. Next, we analyze the influence of the coefficients α with α ∈ (0, 1) for the generative and functional prior likelihoods. To this end, we go back to the original meta learning objective and introduce the α-dependent one as L(ϑ; α) in Eq. ( 38). max ϑ E p(z|D T τ ;ϑ k ) ln p(D T τ |z; ϑ) + (1 -α)E p(z|D T τ ;ϑ k ) ln p(z|D C τ ; ϑ) = L(ϑ; α) ⇔ max ϑ E p(z|D T τ ;ϑ k ) ln p(D T τ |z; ϑ) + (1 -α)E p(z|D T τ ;ϑ k ) ln p(z|D C τ ; ϑ) p(z|D T τ ; ϑ k ) + ln p(z|D T τ ; ϑ k ) ⇔ max ϑ E p(z|D T τ ;ϑ k ) ln p(D T τ |z; ϑ) + (1 -α)D KL p(z|D T τ ; ϑ k ) ∥ p(z|D C τ ; ϑ) +(1 -α) E p(z|D T τ ;ϑ k ) p(z|D T τ ; ϑ k ) Constant ⇔ max ϑ E p(z|D T τ ;ϑ k ) ln p(D T τ |z; ϑ) + (1 -α)D KL p(z|D T τ ; ϑ k ) ∥ p(z|D C τ ; ϑ) As noticed in L(ϑ; α), the KL divergence constraint enforces the functional prior p(z|D C τ ; ϑ) to be close to the learned functional posterior p(z|D T τ ; ϑ k ). Since the functional posterior is derived from the full observation of a function, this suggests that the relation of conditional entropy H(z|D T τ ) ≤ H(z|D C τ ). This trait is also empirically supported by Fig. ( 6), where the functional priors with fewer context points exhibit higher values in the trace of the covariance matrices. As consequence, larger α values impose less constraint on the KL divergence term and results in higher entropy of p(z|D C τ ; ϑ).

G EXPERIMENTAL SETUP & IMPLEMENTATION DETAILS

In all experiments, we use Adam (Kingma & Welling, 2013) as the default optimizer for all experiments. Pytorchfoot_4 works as the toolkit to program and run experiments. The implementation of vanilla NPs/CNPs/ANPs follows the repositoryfoot_5 and related work (Garnelo et al., 2018a; b; Kim et al., 2019) in DeepMind. The implementation of ML-NPs is based on that in (Foong et al., 2020) except that the convolution modules are removed for the fair comparison since the inference objective is our research focus. To enable researchers to implement our developed method in studies, we leave the anonymous Github link here: https://anonymous.4open. science/r/SI_NPs-C832, where we provide an example of SI-NPs. The full code implementations of our method will be released in the final version. G.1 META LEARNING DATASETS Synthetic Regression. We use the Gaussian process simulator to generate different stochastic functions. Three types of kernels are used to formulate diverse Gaussian processes. This is the same as that in (Lee et al., 2020; Kawano et al., 2020) . For each task, we generate x from the uniform distribution U [-2.0, 2.0] and the mean function is zero. The number of context points is drawn from n ∼ U [3, 47] and the number of target points is n + m with m ∼ U [3, 50 -n]. The following kernel functions specify different Gaussian processes in this paper. • Matern -5 2 kernel: k(x, x ′ ) = s 2 1 + √ 5d l + 5d 2 3l 2 exp - √ 5d l with d = 4|x -x ′ |, s ∼ U [0.1, 1.0] and l ∼ U [0.1, 0.6]; • RBF kernel: k(x, x ′ ) = s 2 exp - (x -x ′ ) 2 2l 2 with s ∼ U [0.1, 1.0] and l ∼ U [0.1, 0.6] • Periodic kernel: k(x, x ′ ) = s 2 exp   -2 sin 2 ( π||x-x ′ || 2 p ) l 2   with s ∼ U [0.1, 1.0], l ∼ U [0.1, 0.6] and p ∼ U [0.1, 0.5] Image Datasets. Benchmark image datasets include MNIST (Bottou et al., 1994) , FMNIST (Xiao et al., 2017) , CIFAR10 (Krizhevsky et al., 2009) and SVHN (Sermanet et al., 2012) . In meta training and testing, we randomly select the number of context pixels n for each sampled batch of images (n ∼ U [1, 784] in MNIST/FMNIST and n ∼ U [1, 1023] in CIFAR10/SVHN). For pixel values, they are transformed to normalized Tensors via pytorch package. All other set-ups are the same as in (Garnelo et al., 2018a; b; Kim et al., 2019) . Sim2Real Dataset. This is a part of additional experiments. We retain the preprocessing and meta training set-up in (Gordon et al., 2019) . The meta training datasets include the Lotka-Volterra simulation samples and Predator-Prey's real-world dataset. All datasets are normalized before the meta training process. Please refer to (Gordon et al., 2019) for more details.

G.2 NEURAL ARCHITECTURES & OPTIMIZATIONS & EVALUATION SET-UP

Synthetic Regression. In terms of neural architectures, we use the same setup as that in (Gordon et al., 2019; Lee et al., 2020) for all baselines. The dimension of latent variables is 128. The Encoder is a two hidden layer neural network with 128 neuron units for each layer. The Decoder is a one hidden layer neural network with 128 neuron units. The optimizer's learning rate is 5e -4. For all methods, we sample 100 tasks as one batch to train in each iteration, and the number of iteration steps in meta training is 100000. In meta training, the numbers of Monte Carlo samples for ML-NPs and SI-NPs are 16, while those in meta testing are 32. Image Completion. The setup is the same with that in (Garnelo et al., 2018a) and works for all NPs variants. As default, we set the dimension of latent variables z as 128 for all baselines. For all baselines, the Encoder is constituted with three hidden layers (128 neuron units each). The Decoder has five hidden layers (128 neuron units each) as well. The learning rate for the optimizer is 5e -4. The training batch size for all images is 4 and we meta train the model until convergence (the maximum epoch number for MNIST/FMNIST is 100, and that for CIFAR10/SVHN is 200, and early stop is used when it reaches convergence). In meta training, the numbers of Monte Carlo samples are 16 for ML-NPs and 8 for SI-NPs (We find that 8 Monte Carlo samples are enough for SI-NPs to obtain competitive performance and increasing the number of particles will require more computations but the performance improvement is minor). In meta testing, the Monte Carlo sample numbers are 32 for latent variable models. Sim2Real. We use the same setup as that in (Gordon et al., 2019; Lee et al., 2020) for all baselines. The neural architectures are the same as the Synthetic regression settings, except the dimension of output variables changes to 2. For other settings, please refer to (Gordon et al., 2019) for more details. As in (Garnelo et al., 2018a) , for the Decoder in all models, we use the modified standard deviation variable σi = 0.1 + 0.9 * σ i for the output distribution p(y i |x i , z; ϑ) in all benchmarks, where σ i = MLP ϑ (x i , z i ). Evaluation Set-up of Datasets. For Table (1)/(2) in meta testing, we keep the same set-up as (Foong et al., 2020; Gordon et al., 2019; Lee et al., 2020; Kawano et al., 2020) , which randomly select the number of context points in each batch and average the testing results of all batches as the final result. The range of the number of context points can be found in the above subsection.

H MORE EXPERIMENTAL RESULTS

H.1 SI-NPS WITH A LEARNABLE PROPOSAL DISTRIBUTION We have investigated the use of a learnable proposal distribution in SI-NPs. In this case, an additional proposal distribution q η (z) = N (z; µ η (D T τ ), Σ η (D T τ )) is introduced to sample the latent variables. The neural architecture of such a proposal distribution is the same as the approximate posterior in vanilla NPs, but the role is entirely distinguished. Unfortunately, we find that it is difficult to stabilize performance even though coefficients of the generative log-likelihood ln p(D T τ |z (b) ; ϑ), the functional prior log-likelihood ln p(z (b) |D C τ ; ϑ) and the proposal likelihood ln q η (z (b) |D T τ ) are tuned in a lot of trials. Since there is no convergence solution when assigning the same weights to minimize Eq. ( 31) and the negative of Eq. ( 12) with a shared optimizer, we do not report the result here.

H.2 EVALUATION WITH MORE MONTE CARLO PARTICLES

In the main paper, evaluating the exact likelihood L(ϑ) = ln p(D T τ |z; ϑ)p(z|D C τ ; ϑ)dz is intractable for the studied latent variable models, so we report the evaluation results by setting B the number of the Monte Carlo particles in Eq. ( 39) fixed. 8), SI-NPs show performance improvement with more particles in FMNIST/CIFAR10/SVHN, and B = 32 is sufficient enough in evaluationfoot_6 . For CNPs, the performance does not change with more particles since the functional prior is collapsed. While for NPs, the performance goes far worse than importance-weighted ones and does not significantly improve with more particles. So we only show the ML-NPs and SI-NPs for better visual comparison. Theoretical Explanation. Note that ML-NPs and SI-NPs are importance weighted methods, the number of Monte Carlo samples matters. It is theoretically proved in Theorem 1 in the paper importance weighted autoencoders (Burda et al., 2016)  that L MC (ϑ; B 1 ) ≥ L MC (ϑ; B 2 ) with B 1 ≥ B 2 . Our developed SI-NPs can be viewed as the conditional version of importance weighted autoencoders, which explains the empirical observations in Fig. ( 8). However, when the prior is collapsed to a deterministic embedding, we do not expect this effect.

H.3 INFLUENCE OF DIMENSIONS OF LATENT VARIABLES

It is unrealistic to explore all hyper-parameters' influence: e.g., learning rates, numbers of layers, dimensions of each layer, types of activation function, batch size in training, dimensions of latent variables, the Cartesian of these hyper-parameters causes dimension explosion, and the required time of running experiments can be more than one year with limited GPUs. So we keep most of the set-up of the above hyper-parameters the same as that in original papers of CNP/NP/ANPs. 4, 8, 16, 32, 64, 96, 128} . The first row is to report the log-likelihood of the target points, while the second row is for the log-likelihood of the context points. In this subsection, we study the influence of dimensions of latent variables on SI-NPs. Considering that the meta training process is time expensive, we report the result on FMNIST image completion in Table ( 6). It can be observed that this factor rarely influences test performance. 

H.4 SIM2REAL EXPERIMENTAL RESULTS

Following that in (Gordon et al., 2019; Lee et al., 2020) , We conduct experiments in Lotka-Volterra dynamical systems. The transition dataset is collected in the mentioned simulator for meta training. The testing scenarios include Lotka-Volterra simulators and a Predator-Prey's real-world dataset Hudson's Baye hare-lynx. The meta testing results are reported in Table (7) . It can be seen that SI-NPs can achieve the best performance in Lotka-Volterra and Predator-Prey datasets. Since the Predator-Prey dataset is out of the meta training distribution, the log-likelihoods are significantly lower than the Lotka-Volterra ones. Theoretically, we can combine different optimization objectives of NPs with various structural inductive biases. Take the attention inductive bias as an example; we augment the vanilla SI-NP with attention networks the same as in (Kim et al., 2019) and compare the augmented one with other augmented baselines. To enable fair comparison, we also augment other baselines with attention networks. We apply the modification to all methods, and this operation results in ANP (Kim et al., 2019) , ML-ANP and SI-ANPs. Neural Architectures. We retain all neural architectures the same as in Appendix (G.2), except for adding the deterministic path to obtain a local deterministic variable (Kim et al., 2019) . The deterministic path is built with a cross attention encoder. Due to memory restriction, one head is used to obtain a 128 dimensional local variable, and the neural network set-up for query/key/value is the same as (Kim et al., 2019)  Optimization. We retain the optimization step the same as those in Appendix (G.2).

H.5.1 SYNTHETIC REGRESSION

In Table (8) , it can be seen SI-ANPs outperform ANPs in all kernel cases. SI-ANPs show a slight advantage over ML-ANPs in Marten and RBF kernels. We also observe anomaly results of ANPs in the Periodic kernel case, which illustrates that the use of attention might deteriorate the performance. Table 8 : Test average log-likelihoods of target data points with reported standard deviations for 1-dimensional Gaussian process dataset with various kernels (5 runs). For each run, we randomly sample 1000 functions as tasks to evaluate. All NP models are augmented with attention networks. Settings are same as in the main paper. # Matern -5 2 RBF Periodic L ANP (Kim et al., 2019) 0.98±0.021 1.06±0.020 -1.346±0.102 L ML-ANP (Foong et al., 2020) 0.985±0.019 1.062±0.019 0.574±0.023 L SI-ANP (ours) 0.995±0.017 1.071±0.017 0.56±0.024

H.5.2 IMAGE COMPLETION

In Table (9), we report the results in image completion. We notice that the SI-ANP significantly beats other models in FMNIST/SVHN/CIFAR10 and is comparable with the ML-ANP in MNIST. Meanwhile, the performance gap between vanilla NPs and SI-NPs in Table (2) is quite huge, while this situation is alleviated after adding attention modules. All of these indicate that incorporating structural inductive biases in complicated tasks is also necessary. Combining SI-NPs with more powerful structural inductive biases is the top choice for boosting performance. (Kim et al., 2019) 1.173±0.008 1.101±0.01 4.011±0.005 3.605±0.016 L ML-ANP (Foong et al., 2020) 1 In this case, we guess the local deterministic embedding from the attention network plays a more critical role than the global latent variable in fitting these two datasets. The visualization of fitting Predator-Prey samples with SI-ANPs is given in Fig. (10) . The real-world samples are well fitted with well quantified standard deviations. We notice that the measured predictive mean square errors and the negative log-likelihoods of the target data points in the Predator-Prey testing dataset are comparable using SI-NPs and SI-ANPs. However, the fitting result of the context data points are distinguished a lot: the mean square error of SI-NPs in 4 runs is 0.285±0.007 and that of SI-ANPs is 0.029±0.004. This implies that the attention network in the real-world dataset focuses more on the context point fitting while the global latent variable is for the general trend characterization. 

I COMPUTATION TOOLS

In this project, we use NVIDIA 1080-TiGPUs to finish all experiments.



Meta training and testing phases are implemented in a batch of tasks consistent with Eq. (2)/(3). The Monte Carlo maximum likelihood corresponds to the optimization objective that in(Foong et al., 2020) except that the convolutional inductive bias is removed. Though the exact posterior distribution p(z|D T ; ϑ) can be inferred by Bayes rule, the denominator p(D T τ |z; ϑ)p(z|D C τ ; ϑ)dz is not available. Note that the surrogate function above is defined in the Minorize-Maximization (MM) algorithm sense(Hunter & Lange, 2004). https://pytorch.org/ https://github.com/deepmind/neural-processes Even though in FMNIST, more performance gain is observed using SI-NPs with more particles, B = 32 is sufficient and fair enough for all benchmarks. https://github.com/deepmind/neural-processes



Figure 2: Illustration of Expectation Maximization for NPs.Green lines indicate the results after the E-steps while the red lines are for the M-steps in Algorithm (1). In the convergence iteration, the performance gap L(ϑ H ) -L(ϑ H-1 ) is close to zero and the algorithm results in at least a local optimal solution. Values of these quantities are increased from the left to the right.

Variational Expectation Maximization for NPs. Input : Task distribution p(T ); Task batch size, Number of particles, Initialized ϑ and η. Output: Meta-trained parameters ϑ and η. for k = 1 to K do E-step #1: k ← k + 1 and reset the variational posterior q ϕ (z) = p(z|D T τ ; ϑ k ) in Eq. (5); if Use the Functional Prior as the Proposal then Reset q η (z|D T τ ) = p(z|D C τ ; ϑ k ); else E-step #2: update the proposal η k = arg min η L KL (η; η k-1 , ϑ k ) in Eq. (29) according to operations in Appendix (E.3.1); end M-step: optimize surrogate functions ϑ k+1 = arg max ϑ L SI-NP (ϑ; η k , ϑ k ) in Eq. (12); end 4.1.1 SURROGATE FUNCTION FOR EXACT NPS

Figure 3: Examples of Curve Fitting in RBF Kernel Cases. The plots report predictive mean functions with ±3 standard deviations.

Figure 4: SI-NP Completed Images. From top to bottom in rows are original images, context points, learned predictive means and variances of sampled images.

Figure 6: Statistics of Learned Functional Priors in ML-NPs/SI-NPs. In meta testing, we still vary the number of context points in image datasets. The trace of learned functional priors' covariance matrices Tr[Σ ϑ (D C τ )] is computed based on p(z|D C τ ; ϑ) = N (z; µ ϑ (D C τ ), Σ ϑ (D C τ )).

Figure7: Computational Diagram of Vanilla NPs. The blue one and the pink one are, respectively, the encoder and the decoder structures in vanilla NPs. The approximate posterior, the approximate prior, the learned prior, and the proposal distributions are in the same neural structure as the encoder. For the generative distribution in the form of the decoder, the diagram shows one instance (x * , y * ) ∈ D T τ in a generation and the sampled global latent variable z ∼ p(z|D C τ ; ϑ) or z ∼ q ϕ (z|D C τ ) is shared across all data points in one realization.

MC (ϑ; B) = ln 1 B B b=1 exp ln p(D T τ |z (b) ; ϑ) with z (b) ∼ p(z|D C τ ; ϑ) (39) Empirical Explanation. We set B = 32 in evaluation based on the empirical observations because the evaluated likelihood for all experiments nearly reaches the convergence or does not significantly increase. To see this point, we give the illustration of the evaluation results by varying the number of Monte Carlo samples and analyzing the test log-likelihoods in image completion tasks. As displayed in Fig. (

Figure 8: Evaluation of Image Completion with Varying Number of Monte Carlo Particles. The xaxis records the number of Monte Carlo particles B in the set {1,4, 8, 16, 32, 64, 96, 128}. The first row is to report the log-likelihood of the target points, while the second row is for the log-likelihood of the context points.

Figure 9: From the Left to the Right are population fitting results with ±1 standard deviation in the predator and prey datasets with the meta trained SI-NPs. The x-axis corresponds to normalized years from 1845 to 1935 in order.

Figure 10: From the Left to the Right are population fitting results with ±1 standard deviation in the predator and prey datasets with the meta trained SI-ANPs. The x-axis corresponds to normalized years from 1845 to 1935 in order.

Figure 14: Examples of Curve Fitting in Periodic Kernel Cases.

Figure 16: Examples of Image Completion Results using SI-NPs. From top to bottom in rows are original images, context points, means, and variances of completed images.

Figure 17: Examples of CIFAR10 Image Completion Results using SI-ANPs. From left to right and top to bottom are cases with 10, 100, 300, 500, and 800 randomly selected pixels as the context points.

Figure 18: Examples of FMNIST Image Completion Results using SI-ANPs. From left to right and top to bottom are cases with 10, 100, 300, 500, and 700 randomly selected pixels as the context points.



Inference Suboptimality in vanilla NPs . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Evaluation Criteria & Asymptotic Performance . . . . . . . . . . . . . . . . . . . Variational Expectation Maximization for NPs . . . . . . . . . . . . . . . . . . . . 4.1.1 Surrogate Function for Exact NPs . . . . . . . . . . . . . . . . . . . . . . Meta Learning Surrogate Function . . . . . . . . . . . . . . . . . . . . . . .3.2 Derivation of the Proposal Update Objective . . . . . . . . . . . . . . . . E.4 Gradient Estimates in Variational EM . . . . . . . . . . . . . . . . . . . . . . . . Prior Collapse in SI-NPs with One Monte Carlo Sample . . . . . . . . . . . . . . . F.2 Effect of Adjusting Coefficients of the Prior Regularizer . . . . . . . . . . . . . . Meta Learning Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2 Neural Architectures & Optimizations & Evaluation Set-up . . . . . . . . . . . . .

A Summary of Optimization Objectives in NPs Family. We list the available optimization objectives in Section (3)/(4). For Importance Weighted Estimates, multiple Monte Carlo samples are required in meta training.Optimization Objective Connection with L(ϑ) in Eq. (4) Importance Weighted Estimates L NP (ϑ, ϕ) We denote a probability space of functions by (Ω, F, P). Let ν x1,...,x N be a probability measure on R d with {x 1 , . . . , x N } a finite index set. The process is called an exchangeable stochastic process S :

Summary of Typical Neural Process Related Models (Meta-Testing Scenarios). The recognition model and the generative model respectively correspond to the encoder and the decoder in the family of neural processes. Here [x C , y C ] are context data points and we consider (x * , y * ) a data point in target dataset.

Given the objective function f (ϑ) to maximize, surrogate functions g(ϑ; ϑ k ) w.r.t. f (ϑ) are a family of functions with the following properties.

Test average log-likelihoods with reported standard deviations for image completion in FMNIST (5 runs). We test the performance of different optimization objectives in both context data points and target data points. We use 32 Monte Carlo samples from the functional prior to evaluate the average log-likelihoods.

Test average log-likelihoods of target data points with reported standard deviations for Sim2Real (4 runs). We test the performance of different optimization objectives in both context data points and target data points. For each run, we randomly sample 1000 functions as tasks to evaluate.

Test average log-likelihoods of target data points with reported standard deviations for image completion in MNIST/FMNIST/SVHN/CIFAR10 (4 runs). All NP models are augmented with attention networks. Same as in the main paper, we test the performance of different optimization objectives in target data points. We use 32 Monte Carlo samples from the functional prior to evaluate the average log-likelihoods.

In Table(10), we report the results with the attention module augmentation. As observed, ANPs, ML-ANPs, and SI-ANPs exhibit comparable performance in Lotka-Volterra simulation. As for the Predatory-Prey testing results, the conclusion is similar to that in Table (7) without attention augmentations. Test average log-likelihoods of target data points with reported standard deviations for Sim2Real (4 runs). We test the performance of different optimization objectives augmented by attention inductive bias(Kim et al., 2019) in both context data points and target data points. For each run, we randomly sample 1000 functions as tasks to evaluate.

ACKNOWLEDGEMENT

We thank Dharmesh Tailor for the helpful discussions. We thank Eric Nalisnick and Tim Bakker for the helpful feedback in the AMLab Seminar talks.

annex

 13)/( 14), we respectively plot the curve fitting results in all kernel cases when all latent variable models are augmented by attention networks. In Marten and RBF cases, all attention augmented models well fit data points. Nevertheless, in RBF cases, SI-NPs can better capture data point-dependent deviations. In Periodic cases, it can be observed that SI-ANPs can precisely capture fluctuations and quantify more convincing uncertainty. Sometimes, ANPs underestimate the uncertainty while ML-ANPs fail to show data point distinguished uncertainty.

H.6.2 MORE IMAGE COMPLETION RELATED RESULTS

In Fig. ( 15), the scale of computed KL divergence values in vanilla NPs is positively correlated with the semantic complexity. Also, note that the approximate functional prior in vanilla NPs seldom collapses, but it has a theoretical bias away from the optimal functional prior according to Remark (1). We can also find with more context points, the approximate prior is closer to the approximate posterior, so the value decreases accordingly.In Fig. ( 16), we sample a collection of images from MNIST/FMNIST/SVHN/CIFAR10 to visualize the completed results. We can find that without structural inductive biases, SI-NPs can reasonably complete the images in MNIST/FMNIST/SVHN and exhibit the uncertainty from partial observations. The learned functional prior, such as that in SVHN, can generate images that are different from 

