VERSATILE NEURAL PROCESSES FOR LEARNING IM-PLICIT NEURAL REPRESENTATIONS

Abstract

Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning. Code is available here.

1. INTRODUCTION

A recent line of research on learning representations is to model a signal (e.g., image, 3D scene) as a continuous function that map the input coordinates into the corresponding signal values. By parameterizing a continuous function with neural networks, such implicitly defined representations, i.e., implicit neural representations (INRs), offer many benefits over conventional discrete (e.g., grid-based) representations, such as the compactness and memory-efficiency (Sitzmann et al., 2020b; Tancik et al., 2020; Mildenhall et al., 2020; Chen et al., 2021a) . Characterizing/parameterizing a signal by a corresponding set of network parameters generally requires re-training the neural network, which is computationally costly. In practice, at test time, it is desired to have models that support fast adaptation to partial observations of a new signal without finetuning. In fact, the Neural Processes (NPs) family (Jha et al., 2022) supports such merit. It meta-learns the implicit neural representations of a probabilistic function conditioned on partial signal observations. During test-time inference, it enables the prediction of the function values at target points within a single forward pass. Naturally, given partial observations of a signal, there exists uncertainty inside its continuous function since there are many possible ways to interpret these observations (i.e., context set). The NP methods (Garnelo et al., 2018a; b) learn to map a context set of observed input-output pairs to a conditional distribution over functions (with uncertainty modeling). However, it has been observed that NPs are prone to underfit the data distribution. Following the spirits of variational auto-encoders (Kingma & Welling, 2014) , the work of (Garnelo et al., 2018b) introduces a global latent variable to better capture the uncertainty in the overall structure of the function, which still suffers from the inferior capability for modeling complex signals. Attentive Neural Processes (ANP) (Kim et al., 2019) can further alleviate this issue, which leverages the permutation-invariant attention mechanism (Vaswani et al., 2017) to reweight the context points and the target predictions. However, taking each context point as a token, ANP has troubles in processing complex signals that requires abundant context points as condition (e.g., image with high resolution), where the computational cost is very expensive. Moreover, for complex signals, modeling the global structure and uncertainty of the function with a single latent Gaussian variable may be suboptimal. It is worthwhile to explore an efficient framework to excavate the potential of NPs in modeling complex signals. In this paper, we propose Versatile Neural Processes (VNP), an efficient and flexible framework for meta-learning of implicit neural representations. Figure 1 shows the framework of VNP. Specifically, VNP consists of a bottleneck encoder and a hierarchical latent modulated decoder. The bottleneck encoder powered by the set tokenizer and self-attention blocks encodes the set of context points into fewer and informative context tokens, refraining from high computational cost especially on complex signals while attaining higher modeling capability. At the decoder, we hierarchically learn multiple latent Gaussian variables for jointly modeling the global structure and uncertainty of the function distributions. Particularly, we sample from the latent variables and use them to modulate the parameters of the MLP modules. Our VNP has high expressiveness to complex signals (e.g., 2D images and 3D scenes) and significantly outperforms existing NPs approaches on 1D synthetic data. We summarize our main contributions as below: • We propose Versatile Neural Processes (VNP) that is capable of learning accurate INRs for approximating the function of a complex signal. • We introduce a bottleneck encoder to produce compact yet representative context tokens, facilitating the processing of complex signals with tolerable computational complexity. • We design a hierarchical latent modulated decoder that can better capture and describe the structure and uncertainty of functions through the joint modulation from the multiple global latent variables. Mildenhall et al., 2020; Martin-Brualla et al., 2021; Niemeyer & Geiger, 2021) , and even lossy compression (Dupont et al., 2021; 2022; Schwarz & Teh, 2022) . Most of these methods require re-training the neural network to model/overfit



Figure 1: The proposed Versatile Neural Processes framework contains a bottleneck encoder and a hierarchical latent modulated decoder. The input context set is first encoded into fewer and informative context tokens by a set tokenizer followed by self-attention blocks, which provide powerful network capability with tolerable complexity. The decoder consists of cross-attention modules and multiple modulated MLP blocks, enhancing the model expressiveness for complex signals.

We implement the VNP framework on 1D, 2D, and 3D signals respectively, demonstrating the state-of-the-art performance on a variety of tasks. Particularly, our method shows promise in learning accurate INRs of 3D scenes without further finetuning.Tancik et al., 2020). In the seminal workCPPN (Stanley, 2007), a neural network is trained to learn the implicit function that fits a signal, e.g., an image. Given any spatial position identified by a 2D coordinate, the model that acts as a function, outputs the color value of this position. Such continuous representations, as a powerful paradigm, have a wide range of applications such as image super-resolution(Chen et al., 2021b), modeling shapes(Chen & Zhang, 2019; Park et al., 2019)   and textures(Oechsle et al., 2019), 3D scene reconstruction (

