3D SCENE COMPRESSION THROUGH ENTROPY PENALIZED NEURAL REPRESENTATION FUNCTIONS

Abstract

Some forms of novel visual media enable the viewer to explore a 3D scene from essentially arbitrary viewpoints, by interpolating between a discrete set of original views. Compared to 2D imagery, these types of applications require much larger amounts of storage space, which we seek to reduce. Existing approaches for compressing 3D scenes are based on a separation of compression and rendering: each of the original views is compressed using traditional 2D image formats; the receiver decompresses the views and then performs the rendering. We unify these steps by directly compressing an implicit representation of the scene, a function that maps spatial coordinates to a radiance vector field, which can then be queried to render arbitrary viewpoints. The function is implemented as a neural network and jointly trained for reconstruction as well as compressibility, in an end-to-end manner, with the use of an entropy penalty on the parameters. Our method significantly outperforms a state-of-the-art conventional approach for scene compression, achieving simultaneously higher quality reconstructions and lower bitrates. Furthermore, we show that the performance at lower bitrates can be improved by jointly representing multiple scenes using a soft form of parameter sharing.

1. INTRODUCTION

The ability to render 3D scenes from arbitrary viewpoints can be seen as a big step in the evolution of digital multimedia, and has applications such as mixed reality media, graphic effects, design, and simulations. Often such renderings are based on a number of high resolution images taken of some original scene, and it is clear that to enable many applications, the data will need to be stored and transmitted efficiently over low-bandwidth channels (e.g., to a mobile phone for augmented reality). Traditionally, the need to compress this data is viewed as a separate need from rendering. For example, light field images (LFI) consist of a set of images taken from multiple viewpoints. To compress the original views, often standard video compression methods such as HEVC (Sullivan et al., 2012) are repurposed (Jiang et al., 2017; Barina et al., 2019) . Since the range of views is narrow, light field images can be effectively reconstructed by "blending" a smaller set of representative views (Astola & Tabus, 2018; Jiang et al., 2017; Zhao et al., 2018; Bakir et al., 2018; Jia et al., 2019) . Blending based approaches, however, may not be suitable for the more general case of arbitrary-viewpoint 3D scenes, where a very diverse set of original views may increase the severity of occlusions, and thus would require storage of a prohibitively large number of views to be effective. A promising avenue for representing more complete 3D scenes is through neural representation functions, which have shown a remarkable improvement in rendering quality (Mildenhall et al., 2020; Sitzmann et al., 2019; Liu et al., 2020; Schwarz et al., 2020) . In such approaches, views from a scene are rendered by evaluating the representation function at sampled spatial coordinates and then applying a differentiable rendering process. Such methods are often referred to as implicit representations, since they do not specify explicitly the surface locations and properties within the scene, which would be required to apply some conventional rendering techniques like rasterization (Akenine-Möller et al., 2019) . However, finding the representation function for a given scene requires training a neural network. This makes this class of methods difficult to use as a rendering method in the existing framework, since it is computationally infeasible on a low-powered end device like a mobile phone, which are often on the receiving side. Due to the data processing inequality, it may also be inefficient to compress the original views (the training data) rather than the trained

annex

representation itself, because the training process may discard some information that is ultimately not necessary for rendering (such as redundancy in the original views, noise, etc.).In this work, we propose to apply neural representation functions to the scene compression problem by compressing the representation function itself. We use the NeRF model (Mildenhall et al., 2020) , a method which has demonstrated the ability to produce high-quality renders of novel views, as our representation function. To reduce redundancy of information in the model, we build upon the model compression approach of Oktay et al. ( 2020), applying an entropy penalty to the set of discrete reparameterized neural network weights. The compressed NeRF (cNeRF) describes a radiance field, which is used in conjunction with a differentiable neural renderer to obtain novel views (see Fig. 1 ).To verify the proposed method, we construct a strong baseline method based on the approaches seen in the field of light field image compression. cNeRF consistently outperforms the baseline method, producing simultaneously superior renders and lower bitrates. We further show that cNeRF can be improved in the low bitrate regime when compressing multiple scenes at once. To achieve this, we introduce a novel parameterization which shares parameters across models and optimize jointly across scenes.

2. BACKGROUND

We define a multi-view image dataset as a set of tuples, where V n is the camera pose and X n is the corresponding image from this pose. We refer to the 3D ground truth that the views capture as the scene. In what follows, we first provide a brief review of the neural rendering and the model compression approaches that we build upon while introducing the necessary notation.Neural Radiance Fields (NeRF) The neural rendering approach of Mildenhall et al. ( 2020) uses a neural network to model a radiance field. The radiance field itself is a learned function g θ : R 5 → (R 3 , R + ), mapping a 3D spatial coordinate and a 2D viewing direction to a RGB value and a corresponding density element. To render a view, the RGB values are sampled along the relevant rays and accumulated according to their density elements. The learned radiance field mapping g θ is parameterized with two multilayer perceptrons (MLPs), which Mildenhall et al. (2020) refer to as the "coarse" and "fine" networks, with parameters θ c and θ f respectively. The input locations to the coarse network are obtained by sampling regularly along the rays, whereas the input locations to the fine network are sampled conditioned on the radiance field of the coarse network. The networks are

