DEEP LEARNING ON IMPLICIT NEURAL REPRESENTATIONS OF SHAPES

Abstract

Implicit Neural Representations (INRs) have emerged in the last few years as a powerful tool to encode continuously a variety of different signals like images, videos, audio and 3D shapes. When applied to 3D shapes, INRs allow to overcome the fragmentation and shortcomings of the popular discrete representations used so far. Yet, considering that INRs consist in neural networks, it is not clear whether and how it may be possible to feed them into deep learning pipelines aimed at solving a downstream task. In this paper, we put forward this research problem and propose inr2vec, a framework that can compute a compact latent representation for an input INR in a single inference pass. We verify that inr2vec can embed effectively the 3D shapes represented by the input INRs and show how the produced embeddings can be fed into deep learning pipelines to solve several tasks by processing exclusively INRs.

1. INTRODUCTION

Since the early days of computer vision, researchers have been processing images stored as twodimensional grids of pixels carrying intensity or color measurements. But the world that surrounds us is three dimensional, motivating researchers to try to process also 3D data sensed from surfaces. Unfortunately, representation of 3D surfaces in computers does not enjoy the same uniformity as digital images, with a variety of discrete representations, such as voxel grids, point clouds and meshes, coexisting today. Besides, when it comes to processing by deep neural networks, all these kinds of representations are affected by peculiar shortcomings, requiring complex ad-hoc machinery (Qi et al., 2017b; Wang et al., 2019b; Hu et al., 2022) and/or large memory resources (Maturana & Scherer, 2015) . Hence, no standard way to store and process 3D surfaces has yet emerged. Recently, a new kind of representation has been proposed, which leverages on the possibility of deploying a Multi-Layer Perceptron (MLP) to fit a continuous function that represents implicitly a signal of interest (Xie et al., 2021) . These representations, usually referred to as Implicit Neural Representations (INRs), have been proven capable of encoding effectively 3D shapes by fitting signed distance functions (sdf) (Park et al., 2019; Takikawa et al., 2021; Gropp et al., 2020) , unsigned distance functions (udf) (Chibane et al., 2020) and occupancy fields (occ) (Mescheder et al., 2019; Peng et al., 2020) . Encoding a 3D shape with a continuous function parameterized as an MLP decouples the memory cost of the representation from the actual spatial resolution, i.e., a surface with arbitrarily fine resolution can be reconstructed from a fixed number of parameters. Moreover, the same neural network architecture can be used to fit different implicit functions, holding the potential to provide a unified framework to represent 3D shapes. Due to their effectiveness and potential advantages over traditional representations, INRs are gathering ever-increasing attention from the scientific community, with novel and striking results published more and more frequently (Müller et al., 2022; Martel et al., 2021; Takikawa et al., 2021; Liu et al., 2022) . This lead us to conjecture that, in the forthcoming future, INRs might emerge as a standard representation to store and communicate 3D shapes, with repositories hosting digital twins of 3D objects realized only as MLPs becoming commonly available. An intriguing research question does arise from the above scenario: beyond storage and communication, would it be possible to process directly INRs of 3D shapes with deep learning pipelines to solve downstream tasks as it is routinely done today with discrete representations like point clouds or meshes? In other words, would it be possible to process an INR of a 3D shape to solve a downstream task, e.g., shape classification, without reconstructing a discrete representation of the surface? Since INRs are neural networks, there is no straightforward way to process them. Earlier work in the field, namely OccupancyNetworks (Mescheder et al., 2019) and DeepSDF (Park et al., 2019) , fit the whole dataset with a shared network conditioned on a different embedding for each shape. In such formulation, the natural solution to the above mentioned research problem could be to use such embeddings as representations of the shapes in downstream tasks. This is indeed the approach followed by contemporary work (Dupont et al., 2022) , which addresses such research problem by using as embedding a latent modulation vector applied to a shared base network. However, representing a whole dataset by a shared network sets forth a difficult learning task, with the network struggling in fitting accurately the totality of the samples (as we show in Appendix A). Conversely, several recent works, like SIREN (Sitzmann et al., 2020b) and others (Sitzmann et al., 2020a; Dupont et al., 2021a; Strümpler et al., 2021; Zhang et al., 2021; Tancik et al., 2020) have shown that, by fitting an individual network to each input sample, one can get high quality reconstructions even when dealing with very complex 3D shapes or images. Moreover, constructing an individual INR for each shape is easier to deploy in the wild, as availability of the whole dataset is not required to fit an individual shape. Such works are gaining ever-increasing popularity and we are led to believe that fitting an individual network is likely to become the common practice in learning INRs. Thus, in this paper, we investigate how to perform downstream tasks with deep learning pipelines on shapes represented as individual INRs. However, a single INR can easily count hundreds of thousands of parameters, though it is well known that the weights of a deep model provide a vastly redundant parametrization of the underlying function (Frankle & Carbin, 2018; Choudhary et al., 2020) . Hence, we settle on investigating whether and how an answer to the above research question may be provided by a representation learning framework that learns to squeeze individual INRs into compact and meaningful embeddings amenable to pursuing a variety of downstream tasks. Our framework, dubbed inr2vec and shown in Fig. 1 , has at its core an encoder designed to produce a task-agnostic embedding representing the input INR by processing only the INR weights. These embeddings can be seamlessly used in downstream deep learning pipelines, as we validate experimentally for a variety of tasks, like classification, retrieval, part segmentation, unconditioned generation, surface reconstruction and completion. Interestingly, since embeddings obtained from INRs live in low-dimensional vector spaces regardless of the underlying implicit function, the last two tasks can be solved by learning a simple mapping between the embeddings produced with our framework, e.g., by transforming the INR of a udf into the INR of an sdf . Moreover, inr2vec can learn a smooth latent space conducive to interpolating INRs representing unseen 3D objects. Additional details and code can be found at https://cvlab-unibo.github.io/inr2vec. Our contributions can be summarised as follows: • we propose and investigate the novel research problem of applying deep learning directly on individual INRs representing 3D shapes; • to address the above problem, we introduce inr2vec, a framework that can be used to obtain a meaningful compact representation of an input INR by processing only its weights, without sampling the underlying implicit function; • we show that a variety of tasks, usually addressed with representation-specific and complex frameworks, can indeed be performed by deploying simple deep learning machinery on INRs embedded by inr2vec, the same machinery regardless of the INRs underlying signal.

2. RELATED WORK

Deep learning on 3D shapes. Due to their regular structure, voxel grids have always been appealing representations for 3D shapes and several works proposed to use 3D convolutions to perform both discriminative (Maturana & Scherer, 2015; Qi et al., 2016; Song & Xiao, 2016) and generative tasks

