FREQUENCY DECOMPOSITION IN NEURAL PROCESSES

Abstract

Neural Processes are a powerful tool for learning representations of function spaces purely from examples, in a way that allows them to perform predictions at test time conditioned on so-called context observations. The learned representations are finite-dimensional, while function spaces are infinite-dimensional, and so far it has been unclear how these representations are learned and what kinds of functions can be represented. We show that deterministic Neural Processes implicitly perform a decomposition of the training signals into different frequency components, similar to a Fourier transform. In this context, we derive a theoretical upper bound on the maximum frequency Neural Processes can reproduce, depending on their representation size. This bound is confirmed empirically. Finally, we show that Neural Processes can be trained to only represent a subset of possible frequencies and suppress others, which makes them programmable band-pass or band-stop filters.

1. INTRODUCTION

Neural Processes (Garnelo et al., 2018a; b) are a class of models that can learn a distribution over functions, or more generally a function space. In contrast to many other approaches that do the same, for example Bayesian Neural Networks, Neural Processes learn an explicit representation of such a function space, which allows them to condition their predictions on an arbitrary number of observations that are only available at test time. This representation is finite-dimensional, while function spaces are infinite-dimensional, and so far it has not been understood how they are able to bridge this gap and under what conditions they can successfully do so. Our work reveals how Neural Processes learn to represent infinite-dimensional function spaces in a finite-dimensional space, and in the process describes constraints and conditions that decide what kinds of function spaces can be represented. We begin with an observation that prior art in the context of learning on sets can be reinterpreted from a signal-processing perspective, which allows us to derive a theoretical upper bound on the frequencies, i.e. Fourier components, of functions that can be represented. We subsequently confirm this bound empirically, which suggests that the learned representations should contain a notion of frequency. To further investigate this hypothesis, we continue with a visualization of the learned representations, which reveals that Neural Processes can decompose a function space into different frequency components, essentially finding a representation in Fourier space without any explicit supervision on the representations to elicit such behaviour. As further evidence of this we train Neural Processes to represent only certain frequencies, which results in them suppressing those frequencies that were not observed in the training data. Our contributions can be summarized as followsfoot_0 : • We derive a theoretical upper bound on the signal frequency Neural Processes of a given representation size can reconstruct. As we show, the bound is observed either in the expected way-by suppressing high frequencies-or by implicitly limiting the signal interval. • We investigate learned representations qualitatively, presenting evidence that Neural Processes perform a frequency decomposition of the function space, akin to a Fourier transform. This behaviour is not incentivized externally but rather emerges naturally. • We show that by choosing the training distribution appropriately, Neural Processes can be made to represent certain frequencies and suppress others, which turns them into programmable band-pass or band-stop filters.

2. BACKGROUND

Neural Processes (Garnelo et al., 2018a; b) are maps P : C, X → Y , where C is a set of tuples {(x, f (x))} N c=1 =: (x c , f (x c ))foot_1 with arbitrary but positive cardinality N , and f ∈ F : X → Y . C is often called the context, because Neural Processes perform predictions for values x t ∈ X (t for target), conditioned on these points. F is the function space we would like to find a representation of. Note that some sources define function spaces as any set of functions with a shared domain and co-domain, while others require them to be vector spaces as well. We don't concern ourselves with this distinction and further restrict our work to X = Y = R, because it allows us to visualize learned representations. We only look at the original Neural Processes, namely the deterministic Conditional Neural Processes (CNP) (Garnelo et al., 2018a) and the variational Neural Processes (NP) (Garnelo et al., 2018b) , because newer contributions in the field work in ways that preclude them from being analyzed in the same way. We discuss this further in Section 5. In CNPs and NPs, the map P is separated into two parts, a so called encoding E : C → Z and a decoding or generating part G : Z, X → Y . Z is referred to as the representation or latent space. To allow Neural Processes to approximate arbitraryfoot_2 function spaces F, E and G are typically chosen to be powerful approximators, specifically neural networks, as the name suggests. The defining characteristic of CNPs and NPs is that E encodes individual pairs (x, f (x)) from the context separately, and the resulting representations are averaged to form a global representation, meaning one that is independent of the target points x t at which we then evaluate the Neural Process. This is often not the case in later work, for example in Attentive Neural Processes (Kim et al., 2019) , where the individual representations are instead aggregated using an attention mechanism that depends on x t . In CNPs the representations are deterministic, while in NPs they parametrize mean and (log-)variance of a Gaussian distribution, so the latter are trained using variational inference. For details on implementation and training we refer to Appendix A.1. Our work will investigate how these global representations, which are finite-dimensional, represent infinite-dimensional function spaces. As stated above, E and by extension the Neural Process P acts on set-valued inputs. This is contrary to the vast majority of machine learning work where inputs are vectors of fixed dimension and ordering. Recall that sets are permutation invariant, so we must ensure that the same is true for the output of E. It is easy to see that this is given when we average individual encodings, but Zaheer et al. (2017) show that it is in fact the only way to ensure it: E is permutation-invariant if and only if it has a so-called sum-decomposition, i.e. it can be represented in the form E(x) = ρ N i=1 φ(x i ) where ρ, φ are appropriately chosen functions. Wagstaff et al. (2019) further show that to be able to represent all continuous permutation-invariant functions on sets with a cardinality of at most N , the dimension of the image Z must at least be N . This will become relevant in the following section.

3. AN UPPER BOUND ON SIGNAL FREQUENCIES

We mentioned in the previous section that the encoder E in a Neural Process should have a sumdecomposition, so that the global representations are permutation-invariant, as shown in Zaheer et al. (2017) . Expanding on this, Wagstaff et al. (2019) show that we require a representation size of at least N to be able to represent arbitrary continuous functions on sets of cardinality smaller or equal to N . What these works do not consider are the implications for situations where the elements of



The complete source code to reproduce our experiments is available at https://github.com/ *** We use boldface as a shorthand for sets, not vectors. This will depend on the implementation of E and G, and for neural networks F is practically restricted to continuous and differentiable functions.

