BAYESIAN META-LEARNING FOR FEW-SHOT 3D SHAPE COMPLETION

Abstract

Estimating the 3D shape of real-world objects is a key perceptual challenge. It requires going from partial observations, which are often too sparse and incomprehensible for the human eye, to detailed shape representations that vary significantly across categories and instances. We propose to cast shape completion as a Bayesian meta-learning problem to facilitate the transfer of knowledge learned from observing one object into estimating the shape of another object. To facilitate the learning of object shapes from sparse point clouds, we introduce an encoder that describes the posterior distribution of a latent representation conditioned on the sparse cloud. With its ability to isolate object-specific properties from object-agnostic properties, our meta-learning algorithm enables accurate shape completion of newly-encountered objects from sparse observations. We demonstrate the efficacy of our proposed method with experimental results on the standard ShapeNet and ICL-NUIM benchmarks.

1. INTRODUCTION

The task of estimating 3D geometry from sparse observations, commonly referred to as shape completion, is a key perceptual challenge and an integral part of many mission-critical problems, including robotics (Varley et al., 2017) and autonomous driving (Giancola et al., 2019; Stutz & Geiger, 2018) . Recently, a series of methods (Mescheder et al., 2019; Park et al., 2019) have achieved great success by using the observations to infer the parameters of an implicit 3D geometric representation of the targets object. However, with some notable exceptions (Yuan et al., 2018) , such methods require relatively dense observations to achieve high accuracy, which is usually impractical in real situations. In this paper we introduce a novel methodology that enables state-of-the-art shape completion of previously unseen objects from highly sparse observations. Our insight comes from the following simple intuition: "Can we leverage the geometric information available in one object to improve shape completion results on another target object?" Meta-learning is an emerging field of study in machine learning that serves this very purpose. By training a model on multiple inter-related tasks, it learns how to learn new tasks efficiently from a small amount of observations. Recently proposed meta-learning methods often achieve this by parameterizing the input-output relationship with task-specific latent variables and training a separate, task-agnostic model/mechanism that can infer these task-specific variables from sparse observations of the target task (Chang et al., 2015; Finn et al., 2017; Garnelo et al., 2018) . We can cast the shape completion problem as a Bayesian meta-learning problem by treating each object as a task and its sparse observations as the corresponding contextual dataset. In popular Bayesian variants of meta-learning (Edwards & Storkey, 2017; Eslami et al., 2018; Garnelo et al., 2018) , the task-specific latent variables are treated as random variables, and the aforementioned task-agnostic model (i.e. the encoder) is represented as a posterior distribution of the latent variables conditioned on sparse observations. In this study, we combine probabilistic meta-learning with recent shape completion methods that represent the geometry of a given object with implicit parameters, such as the parameters of a signed distance function (SDF). By training an encoder that computes the posterior distribution of these implicit parameters conditioned against sparse observations, we develop a framework that enables the few-shot learning of implicit geometric functions. Under appropriate regularity conditions, the computation of correct posterior distribution leads to optimal prediction in the sense of Bayes Risk (Maeda et al., 2020) . Our proposed approach is a natural extension of many implicit approaches, in the sense that it introduces an additional encoder function that represents the posterior distribution of the geometry-describing implicit parameters. More specifically, we build upon the Bayesian approach of (Maeda et al., 2020) whose posterior estimate behaves asymptotically well with respect to the size of contextual dataset, and combine their method with Implicit Geometrical Regularization (IGR) (Gropp et al., 2020) . We use IGR as the baseline, and demonstrate its efficacy on two benchmark datasets (ShapeNet and ICL-NUIM), especially when the observations are very sparse.

2. RELATED WORK

2.1 3D DECODING REPRESENTATIONS Unlike images, that contain a clear pixel-based structural pattern, there is no unified representation for 3D object reconstruction that is both computationally and memory efficient. In terms of used 3D representations, existing methods can be broadly divided into the following categories: Voxel-based methods are a generalization of 2D pixels into 3D space, and thus constitute a natural extension for classical image-based methods. Early works focused on 3D convolutions operating on dense grids (Choy et al., 2016) to generate an occupancy function that determines whether each cell is inside an object or not, however these were limited to relatively small resolutions. To address the high memory requirements of dense voxel grids, various works have proposed 3D reconstruction in a multi-resolution fashion (Häne et al., 2017) , with the added complexity of requiring multiple passes to generate the final output. More recently, OccNet (Mescheder et al., 2019) proposes encoding a 3D description of the output at infinite resolution, and shows that this representation can be learned from different sensor modalities. Signed Distance methods are an alternative to occupancy functions, where instead of the occupancy state we learn a function describing the signed distance to the object surface (Dai et al., 2017; Stutz & Geiger, 2018) . This approach builds upon earlier fusion methods that utilize a truncated signed distance function (SDF) introduced in (Curless & Levoy, 1996) . DeepSDF (Park et al., 2019) represents 3D space as a continuous volumetric field, and requires at training time the ground-truth SDF calculated from dense input data using numerical methods. Implicit Geometric Reconstruction (IGR) (Gropp et al., 2020) is a SDF variant that uses Eikonal regularization, thus enforcing that the output of the decoder will be the SDF of "some" surface. This is an effective way of mitigating the impact of outliers in the final generated surface, and is used as the starting point for our proposed meta-learning approach to shape completion. Point-based methods directly output points located on the object surface, thus eliminating the need for a dense representation of the 3D space. Earlier works such as PointNet (Charles et al., 2017; Qi et al., 2017) combined fully connected networks with a symmetric aggregation function, thus achieving permutation invariance and robustness to perturbations. (Fan et al., 2017b) introduced point clouds as a viable output representation for 3D reconstruction, and (Yang et al., 2017) proposed a decoder design that approximates a 3D surface as the deformation of a 2D plane. Point Completion Network (PCN) (Yuan et al., 2018) is a recent architecture that enables the generation of coarse-tofine shapes while maintaining a small number of parameters. However, a common limitation of all these methods is that they do not describe topology, and thus are not suitable for the generation of watertight surfaces. Also, to change the number of output points, methods like PCN have to re-train their networks entirely, while SDF-based methods learn the geometry in an implicit form and thus can generate any amount of points. Mesh-based methods choose to represent classes of similarly shaped objects in terms of a predetermined set of template meshes. First attempts focused on graph convolutions alongside the mesh's vertices and edges (Guo et al., 2015) , and more recently as a direct output representation for 3D reconstruction (Kanazawa et al., 2018) . These methods, however, are only able to generate meshes with simple topologies (Wang et al., 2018) , require a reference template from the same object class (Kanazawa et al., 2018) and cannot guarantee water-tight closed surfaces (Groueix et al., 2018) . A learnable extension to the Marching Cubes algorithm (Lorensen & Cline, 1987) has been proposed in (Liao et al., 2018) , however this approach is limited by the memory requirements of the underlying 3D voxel grid.

