SEMI-SUPERVISED LEARNING OF MULTI-OBJECT 3D SCENE REPRESENTATIONS

Abstract

Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose a novel approach for learning multiobject 3D scene representations from images. A recurrent encoder regresses a latent representation of 3D shapes, poses and texture of each object from an input RGB image. The 3D shapes are represented continuously in function-space as signed distance functions (SDF) which we efficiently pre-train from example shapes in a supervised way. By differentiable rendering we then train our model to decompose scenes self-supervised from RGB-D images. Our approach learns to decompose images into the constituent objects of the scene and to infer their shape, pose and texture from a single view. We evaluate the accuracy of our model in inferring the 3D scene layout and demonstrate its generative capabilities.

1. INTRODUCTION

Humans have the remarkable capability to decompose scenes into its constituent objects and to infer object properties such as 3D shape and texture from just a single view. Providing intelligent systems with similar capabilities is a long-standing goal in artificial intelligence. Such representations would facilitate object-level description, abstract reasoning and high-level decision making. Moreover, object-level scene representations could improve generalization for learning in downstream tasks such as robust object recognition or action planning. Previous work on learning-based scene representations focused on single-object scenes (Sitzmann et al., 2019) or neglected to model the 3D geometry of the scene and the objects explicitly (Burgess et al., 2019; Greff et al., 2019; Eslami et al., 2016) . In our work, we propose a multi-object scene representation network which learns to decompose scenes into objects and represents the 3D shape and texture of the objects explicitly. Shape, pose and texture are embedded in a latent representation which our model decodes into textured 3D geometry using differentiable rendering. This allows for training our scene representation network in a semi-supervised way. Our approach jointly learns the tasks of object detection, instance segmentation, object pose estimation and inference of 3D shape and texture in single RGB images. Inspired by (Park et al., 2019; Oechsle et al., 2019; Sitzmann et al., 2019) , we represent 3D object shape and texture continuously in function-space as signed distance and color values at continuous 3D locations. The scene representation network infers the object poses and its shape and texture encodings from the input RGB image. We propose a novel differentiable renderer which efficiently generates color and depth images as well as instance masks from the object-wise scene representation. By this, our model facilitates to generate new scenes by altering an interpretable latent representation (see Fig. 1 ). Our network is trained in two stages: In a first stage, we train an auto-decoder subnetwork of our full pipeline to embed a collection of meshes in continuous SDF shape embeddings as in DeepSDF (Park et al., 2019) . With this pre-trained shape space, we train the remaining parts of our full multi-object network to decompose and describe the scene by multiple objects in a self-supervised way from RGB-D images. No ground truth of object pose, shape, texture, or instance segmentation is required for the training on multi-object scenes. We denote our learning approach semi-supervised due to the supervised pre-training of the shape embedding and the self-supervised learning of the scene decomposition. We evaluate our approach on synthetic scene datasets with images composed of multiple objects to show its capabilities with shapes such as geometric primitives and vehicles and demonstrate the properties of our geometric and semi-supervised learning approach for scene representation. In sum- mary, we make the following contributions: (1) We propose a novel model to learn representations of scenes composed of multiple objects. Our model describes the scene by explicitly encoding object poses, 3D shapes and texture. To the best of our knowledge, our approach is the first to jointly learn the tasks of object instance detection, instance segmentation, object localization, and inference of 3D shape and texture in a single RGB image through self-supervised scene decomposition. (2) Our model is trained by using differentiable rendering for decoding the latent representation into images. For this, we propose a novel differentiable renderer using sampling-based raycasting for deep SDF shape embeddings which renders color and depth images as well as instance segmentation masks. (3) By representing 3D geometry explicitly, our approach naturally respects occlusions and collisions between objects and facilitates manipulation of the scene within the latent space. We demonstrate properties of our geometric model for scene representation and augmentation, and discuss advantages over multi-object scene representation methods which model geometry implicitly. We plan to make source code and datasets of our approach publicly available upon paper acceptance.

2. RELATED WORK

Deep learning of single object geometry. Several recent 3D learning approaches represent single object geometry by implicit surfaces of occupancy or signed distance functions which are discretized in 3D voxel grids (Kar et al., 2017; Tulsiani et al., 2017; Wu et al., 2016; Gadelha et al., 2017; Qi et al., 2016; Jimenez Rezende et al., 2016; Choy et al., 2016; Shin et al., 2018; Xie et al., 2019) . Voxel grid representations typically waste significant memory and computation resources in scene parts which are far away from the surface. This limits their resolution and capabilities to represent fine details. Other methods represent shapes with point clouds (Qi et al., 2017; Achlioptas et al., 2018) , meshes (Groueix et al., 2018) , deformations of shape primitives (Henderson & Ferrari, 2019) or multiple views (Tatarchenko et al., 2016) . In continuous function-space representations, deep neural networks are trained to directly predict signed distance (Park et al., 2019; Xu et al., 2019; Sitzmann et al., 2019) , occupancy (Mescheder et al., 2019; Chen & Zhang, 2019) , or texture (Oechsle et al., 2019) at continuous query points. We use such representations for individual objects. Deep learning of multi-object scene representations. Self-supervised learning of multi-object scene representations from images recently gained significant attention in the machine learning community. MONet (Burgess et al., 2019) presents a multi-object network which decomposes the scene using a recurrent attention network and an object-wise autoencoder. It embeds images into objectwise latent representations and overlays them into images with a neural decoder. Yang et al. (2020) improve upon this work. Greff et al. (2019) use iterative variational inference to optimize objectwise latent representations using a recurrent neural network. SPAIR (Crawford & Pineau, 2019) and SPACE (Lin et al., 2020) extend the attend-infer-repeat approach (Eslami et al., 2016) by laying a grid over the image and estimating the presence, relative position, and latent representation of objects in each cell. In GENESIS (Engelcke et al., 2020) , the image is recurrently encoded into latent codes per object in a variational framework. In contrast to our method, the above methods do not represent the 3D geometry of the scene explicitly. Recently, Liao et al. (2020) introduced 3D controllable image synthesis to generate novel scenes instead of explaining input views like we do. Supervised learning for object instance segmentation, pose and shape estimation. Loosely related to our approach are supervised deep learning methods that segment object instances (Hou et al., 2019; Prabhudesai et al., 2020) , estimate their poses (Xiang et al., 2017) or recover their 3D shape (Gkioxari et al., 2019; Kniaz et al., 2020) . In Mesh R-CNN (Gkioxari et al., 2019) We feed the input image and scene composition images and masks from the previously found objects to an object encoder network g o which regresses the encoding of the next object z i . The object encoding decomposes into shape z i,sh , extrinsics z i,ext and texture latents z i,tex . The shape latent parametrizes an SDF function network Φ which we use in combination with the pose and scale of the object encoded in z i,ext for raycasting the object depth and mask using our differentiable renderer f . Finally, the color of the pixels is found with a texture function network Ψ parametrized by the texture latent. are detected in bounding boxes and a 3D mesh is predicted for each object. The method is trained supervised on images with annotated object shape ground truth. Scene Encoding. The network infers a latent z = (z 1 , . . . , z N , z bg ) which decomposes the scene into object latents z i ∈ R d , i ∈ {1, . . . , N } and a background component z bg ∈ R d bg where d, d bg are the dimensionality of the object and background encodings and N is the number of objects. Object are sequentially encoded by a deep neural network z i = g o (I, ∆I 1:i-1 , M 1:i-1 ) (see Fig. 2 ). We share the same object encoder network and weights between all objects. To guide the encoder to regress the latent representation of one object after the other, we forward additional information about already reconstructed objects. Specifically, we decode the previous object latents into object composition images, depth images and occlusion masks ( I 1:i-1 , D 1:i-1 , M 1:i-1 ) := F (z bg , z 1 , . . . , z i-1 ). They are generated by F using differentiable rendering which we detail in the subsequent paragraph. We concatenate the input image I with the difference image ∆I 1:i-1 := I -I 1:i-1 and occlusion masks M 1:i-1 , and input this to the encoder for inferring the representation of object i. The object encoding z i = (z i,sh , z i,tex , z i,ext ) decomposes into encodings for shape z i,sh , textural appearance z i,tex , and 3D extrinsics z i,ext (see Fig. 3 ). The shape encoding z i,sh ∈ R D sh parametrizes the 3D shape represented by a DeepSDF autodecoder (Park et al., 2019) . Similarly, the texture is encoded in a latent vector z i,tex ∈ R Dtex which is used by the decoder to obtain color values for each pixel that observes the object. Object position p i = (x i , y i , z i ) , orientation θ i and scale s i are regressed with the extrinsics encoding z i,ext = (p i , z cos,i , z sin,i , s i ) . The object pose T o w (z i,ext ) = s i R i -R i p i 0 1 is parametrized in a world coordinate frame with known transformation T w c from the camera frame. We assume the objects are placed upright and model rotations around the vertical axis with angle θ i = arctan(z sin,i , z cos,i ) and corresponding rotation matrix R i . We use a two parameter representation for the angle as suggested in (Zhou et al., 2019) . We scale the object shape by the factor s i ∈ [s min , s max ] which we limit in an appropriate range using a sigmoid squashing function. The background encoder g bg := z bg ∈ R d bg regresses the uniform color of the background plane, i.e. d bg = 3. We assume the plane extrinsics and hence its depth image is known in our experiments. Scene Decoding. Given our object-wise scene representation, we use differentiable rendering to generate individual images of objects based on their geometry and appearance and compose them into scene images. An object-wise renderer ( I i , D i , M i ) := f (z i ) determines color image I i , depth image D i and occlusion mask M i from each object encoding independently (see Fig. 3 ). The renderer determines the depth at each pixel u ∈ R 2 (in normalized image coordinates) through raycasting in the SDF shape representation. Inspired by (Wang et al., 2020) , we trace the SDF zero-crossing along the ray by sampling points x j := (d j u, d j ) in equal intervals d j := d 0 + j∆d, j ∈ {0, . . . , N -1} with start depth d 0 . The points are transformed to the object coordinate system by T o c (z i,ext ) := T o w (z i,ext )T w c . Subsequently, the signed distance φ j to the shape at these transformed points is obtained by evaluating the SDF function network Φ (z i,sh , T o c (z i,ext )x j ). Note that the SDF network is also parametrized by the inferred shape latent of the object. The algorithm finds the zero-crossing at the first pair of samples with a sign change of the SDF Φ. The sub-discretization accurate location x(u) of the surface is found through linear interpolation of the depth regarding the corresponding SDF values of these points. The depth at a pixel D i (u) is given by the z coordinate of the raycasted point x(u) on the object surface in camera coordinates. If no zero crossing is found, the depth is set to a large constant. The binary occlusion mask M i (u) is set to 1 if a zero-crossing is found at the pixel and 0 otherwise. The pixel color I i (u) is determined using a decoder network Ψ which receives the texture latent z i,tex of the object and the raycasted 3D point x(u) in object coordinates as inputs, i.e. I i (u) = Ψ (z i,tex , T o c (z i,ext )x(u) ). We speed up the raycasting process by only considering pixels that lie within the projected 3D bounding box of the object shape representation. This bounding box is known since the SDF function network is trained with meshes that are normalized to fit into a unit cube with a constant padding. Note that this rendering procedure can be implemented using differentiable operations which makes it fully differentiable for the shape, color and extrinsics encodings of the object. The scene images, depth images and occlusion masks I 1:n , D 1:n , M 1:n = F (z bg , z 1 , . . . , z n ) are composed from the individual objects 1, . . . , n with n ≤ N and the decoded background through zbuffering. We initialize them with the background color, depth image of the empty plane and empty mask. Recall that the background color is regressed by the encoder network. For each pixel u, we search the occluding object i with the smallest depth at the pixel. If such an object exists, we set the pixel's values in I 1:N , D 1:N , M 1:N to the corresponding values in the object images and masks. Training. We train our network architecture in two stages. In a first stage, we learn the SDF function network from a collection of meshes. The second stage uses the pre-trained SDF models to learn the remaining components for the object-wise scene decomposition and rendering network. We train the SDF networks according to (Park et al., 2019) from a collection of meshes and sample points in a volume around the object and on the object surface. We normalize the size of the input meshes to fit into the unit cube with constant padding = 0.1. Our multi-object network architecture is trained self-supervised from RGB-D images containing example scenes composed of multiple objects. To this end, we minimize the loss function L total = λ I L I + λ D L D + λ gr L gr + λ sh L sh , which is a weighted sum of multiple sub-loss functions defined by L I = 1 |Ω| u∈Ω G I 1:N (u) -G(I gt )(u) 2 L D = 1 |Ω| u∈Ω G D 1:N (u) -G(D gt )(u) L gr = i max(0, -z i ) + max(0, -φ i (z i )) L sh = i z i,sh 2 In particular, L I is the mean squared error on the image reconstruction with Ω being the set of image pixels and I gt the ground-truth color image. The depth reconstruction loss L D penalizes deviations from the ground-truth depth D gt . We apply Gaussian smoothing G(•) for which we decrease the standard deviation over time. L sh regularizes the shape encoding to stay within the training regime of the SDF network. Lastly, L gr favors objects to reside above the ground plane with z i being the coordinate of the object in the world frame, z i the corresponding projection onto the ground plane, and φ i (x k ) := Φ (z i,sh , T o c (z i,ext )x k ). The shape regularization loss is scheduled with time-dependent weighting. This prevents the network from learning to generate unreasonable extrapolated shapes in the initial phases of the training, but lets the network refine them over time. We use a CNN for both the object and the background encoder. Both consist of a number of convolutional layers with kernel size (3, 3) and strides (1, 1) each followed by ReLU activation and (2, 2) max-pooling. The subsequent fully connected layers yield the encodings for objects and background. Similar to (Park et al., 2019) , we use multi-layer fully-connected neural networks for the shape decoder Φ and texture decoder Ψ. Further details are provided in the supplementary material.

4. EXPERIMENTS

We evaluate our approach on synthetic scenes based on the Clevr dataset (Johnson et al., 2017) and scenes generated with ShapeNet models (Chang et al., 2015) . The Clevr-based scenes contain images with a varying number of colored shape primitives (spheres, cylinders, cubes) on a planar single-colored background. We modify the data generation of Clevr in a number of aspects: (1) We remove shadows and additional light sources and only use the Lambertian rubber material for the objects' surfaces. (2) To further increase shape variety, we apply random scaling along the principal axes of the primitives. (3) An object might be completely hidden behind another one. Hence, the network needs to learn to hide single objects. We generate several multi-object datasets. Each dataset contains scenes with a specific number of objects which we choose from two to five. Each dataset consists of 12.5K images with a size of 64×64 pixels. Objects are randomly rotated and placed in a range of [-1.5, 1.5] 2 on the ground plane while ensuring that any two objects do not intersect. Additionally to the RGB images, we also generate depth maps for training as well as instance masks for evaluation. The images are split into 9K training, 1K validation, and 2.5K testing examples. For the pre-training of the DeepSDF (Park et al., 2019) network, we generate a small set of nine shapes per category with different scaling along the axes for which we generate ground truth SDF samples. Different to (Park et al., 2019) , we sample a higher ratio of points randomly in the unit cube instead of close to the surface. We also evaluate on scenes depicting either cars or armchairs as well as a mixed set consisting of mugs, bottles and cans (tabletop) from the ShapeNet model set. Specifically, we select 25 models per setting which we use both for pre-training the DeepSDF as well as for the generation of the multi-object datasets. We increase the size of the dataset to (18K/2K/5K). The evaluation is performed on two different test sets: (1) with known shapes and (2) with new objects. The renderer evaluates at 12 steps along each ray. Gaussian smoothing is applied with kernel size 16 and decreasing sigma from 16 3 to 1 2 in 250K steps. We use the ADAM optimizer (Kingma & Ba, 2014) with learning rate 0.0001 and batch size 8 to train for a dataset-specific number of epochs (see supplementary material for more details). Evaluations Metrics. We evaluate the task of learning object-level 3D scene representations using measures for instance segmentation, image reconstruction, and pose estimation. To evaluate the capability of our model to recognize objects that best explain the input image, we consider established instance segmentation metrics. An object is considered to be correctly segmented if the intersection-over-union (IoU) score between ground truth and predicted mask is higher than some threshold τ . To account for occlusions, only objects that occupy at least 25 pixels are taken into account. We report average precision (AP 0.5 ), average recall (AR 0.5 ), F1 0.5 -score for a fixed τ = 0.5 as well as the mean AP over thresholds in range [0.5, 0.95] with stepsize 0.05 (Everingham et al., 2010) . Furthermore, we list the ratio of scenes were all visible objects were found w.r.t. τ = 0.5 (allObj). Next, we evaluate the quality of both the RGB and depth reconstruction obtained from the generated objects. To assess the image reconstruction, we report Root Mean Squared Error (RMSE), Structural SIMilarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR) scores. For the object geometry, we compute similar to (Eigen et al., 2014) the Absolute Relative Difference (AbsRD), Squared Relative Difference (SqRD), as well as the RMSE for the predicted depth. Furthermore, we report the error on the estimated objects' position (mean) and rotation (median, sym.: up to symmetries) for objects with a valid match w.r.t. τ = 0.5. More details on the metrics are provided in the supplementary material. We show results over five runs per configuration and report the mean.

4.1. CLEVR DATASET

In Fig. 4 , we show reconstructed images, depth and normal maps on the Clevr (Johnson et al., 2017) scenes. Our model provides a complete reconstruction of the individual objects although they might be partially hidden in the image. The network can infer the color of the objects correctly and gets a basic idea about shading (e.g. that spheres are darker on the lower half) and coarse texture. The shape characteristics such as extent, edges or curved surfaces are well recognized. Our model needs to fill all object slots. We sometimes observed that it fantasizes and hides additional objects behind others. Some reconstruction artifacts at object boundaries are due to rendering hard transitions between objects and background. More results and typical failure cases are shown in the supplementary material. Our 3D scene model naturally facilitates generation and manipulation of scenes by altering the latent representation. In Fig. 1 , we show example operations like switching the positions of two objects, changing their shape, or removing an entire object. The explicit knowledge about 3D shape also allows us to reason about object penetrations when generating new scenes. Specifically, we evaluate an object intersection loss L int on the newly sampled scenes to filter out those that turn out to be unrealistic due to an intersection between objects (see supplementary material for details). Ablation Study. We evaluate various components of our model on the Clevr dataset with three objects. In Table 1 , we evaluate on training settings where we left out each of the loss functions and also demonstrate the benefit of Gaussian smoothing (denoted by G) on the image reconstructions. At the beginning of training, the shape regularization loss is crucial to keep the shape encoder close to the pretrained DeepSDF shape space and to prevent it from diverging due to the inaccurate pose estimates of the objects. Applying and decaying Gaussian blur distributes gradient information in the Table 1 : Results on Clevr dataset (Johnson et al., 2017) . The combination of our proposed loss with Gaussian blur is essential to guide the learning of scene decomposition and object-wise representations. We highlight best (bold) and second best (underlined) result for each measure. Using different maximum numbers of objects in our network, we further train our model on scenes with 2, 4, or 5 objects. Despite the increased difficulty for larger number of objects, our model recognizes most objects in scenes with two to five objects. Models trained with fewer objects can successfully explain scenes with a larger number of objects (# obj=o train /o test ). images beyond the object masks and allows the model to be trained in a coarse-to-fine manner. This helps the model to localize the various objects in the scene. Moreover, the depth loss is essential for learning the scene decomposition. Without this loss, the network can simply describe several objects using a single object with more complex texture. The usage of the ground loss prevents the model from fitting objects into the ground plane. The image reconstruction loss plays only a minor part for the scene decomposition task but is merely responsible for learning the texture of the objects. Using all our proposed loss functions yields best results over all metrics. Remarkably, our model is able to find objects at high recall rates (0.942 AR at 50% IoU). Object Count. We also report results when varying the maximum number of objects in our model in Tab. 1. We train the models with the corresponding number of objects in the dataset. Obviously, it is on average easier for our model to find and describe the objects in less crowded scenes, while it still performs with high accuracy for five objects. Due to the sequential architecture of our model, it can even be extended for scenes with more objects than that it has been trained for. As we use a shared encoder for all objects, we can simply reset the number of encoding rollouts to the number of objects in the test data. Again, we assume the number of objects to be known. Although our model would be able to hide redundant objects behind already reconstructed ones without this explicit change, it could not reconstruct additional objects. In these experiments, it performs less well than the trained models for the respective object counts. The achieved average recall and allObj measures still indicate that the model is able to detect the objects at good rates. For instance, for # obj=3/5, we find all objects in about 21% cases but overall 71% of the objects according to AR 0.5 . Extended quantitative evaluation as well as qualitative results can be viewed in the supplementary material.

4.2. SHAPENET DATASET

Our composed multi-object variant of ShapeNet (Chang et al., 2015) models is more difficult in shape and texture variation than Clevr (Johnson et al., 2017) . For some object categories such as cups or armchairs, training can converge to local minima. We report mean and best results over five training runs in Tab. 2, where the best run is chosen according to F1 score on the validation set. Evaluation is performed on two different testsets: scenes containing (1) object instances with shapes and textures used for training and (2) unseen object instances. We show several scene reconstructions in Fig. 5 . Further qualitative results are provided in the supplementary material. For the cars, our model yields consistent performance in all runs with comparable decomposition results to our Clevr experiments. However, we found that cars exhibit a pseudo-180-degree shape symmetry which was difficult for our model to differentiate. Especially for small objects in the background, it favors to adapt the texture over rotating the object. For the armchair shapes, our model finds local minima in pseudo-90-degree symmetries. The median rotation error indicates better than chance prediction for the correct orientation. Rotation error histograms can be found in the supplementary material. For approximately correct rotation predictions, we found that our model was able to differentiate between basic shape types but often neglected finer details like thin armrests which are difficult to differentiate in the images. Our tabletop dataset provides another type of challenge: the network needs to distinguish different object categories with larger shape and scale variation. For this setting, we added further auxiliary losses to penalize object positions outside of the image view as well as object intersections (see supplementary material for details). Our model is able to predict the different shape types with coarse textures. On scenes with instances that were not seen during training, our model often approximates the shapes with similar training instances. Due to the learned 3D structure, our model is able to render novel views from a scene given a single image (see Fig. 6 ). Although our model never saw multiple views of the same scene during training and is not tuned for this task, we obtain reasonable results for both scene geometry and appearance. We observe a lower reconstruction accuracy for invisible scene parts, especially for the texture. We further evaluated our model on real images of toy cars and building blocks (see Fig. 7 ) for which we adjusted brightness and contrast to visually match the background color of the synthetic data. Note while the scene perspective, camera and image properties are different, our model is able to decompose the scene in these examples into the individual objects and obtain a coarse understanding about their shape and appearance without any further fine-tuning on the new data domain. Input -90 • -67.5 • -45 • -22.5 • 0 • +22.5 • +45 • +67.5 • +90 • +180 • ... Limitations. We show typical failure cases of our approach in Fig. 8 . Self-supervised learning without regularizing assumptions leads typically to ill-conditioned problems. We use a pre-trained 3D shape space to confine the possible shapes, impose a multi-object decomposition of the scene, and use a differentiable renderer of the latent representation. In our self-supervised approach, ambiguities can arise due to the decoupling of shape and texture. For instance, the network can choose to occlude the background partially with the shape but fix the image reconstruction by predicting background color in these areas. Rotations can only be learned up to a pseudo-symmetry by selfsupervision when object shapes are rotationally similar and the subtle differences in shape or texture are difficult to differentiate in the image. In such cases, the network can favor to adapt texture over rotating the shape. Depending on the complexity of the scenes and the complex combination of loss terms, training can run into local minima in which objects are moved outside the image or fit the ground plane. Currently, the network is trained for a maximum number of objects. If all objects in the scene are explained, it hides further objects which could be alleviated by learning a stop criterion.

5. CONCLUSION

We propose a novel deep learning approach for multi-object scene representation learning and parsing. Our approach infers the 3D structure of a scene in RGB images by recursively parsing the image for shape, texture and poses of the objects. A differentiable renderer allows images to be generated from the latent scene representation and the network to be trained semi-supervised from RGB-D images. We represent object shapes by signed distance functions. To confine the search space of possible shapes, we employ pre-trained shape spaces in our network. The shape space is represented by a deep neural network using a continuous function representation. Our experiments demonstrate that our model achieves scene parsing for a variety of object counts and shapes. We provide an ablation study to motivate design choices and discuss assumptions and limitations of our approach. We further demonstrate the advantages of our model to reason about the underlying 3D space of a seen scene by performing explicit manipulation on the individual objects or rendering novel views. To the best of our knowledge, our approach is the first to jointly learn the tasks of object instance detection, instance segmentation, object pose estimation, and inference of 3D shape and texture in a single RGB image in a semi-supervised way. We believe our approach provides an important step towards self-supervised learning of object-level 3D scene parsing and generative modeling of complex scenes from real images. Our work is currently limited to simple scenes with few objects on a uniformly colored background. The usage of such synthetic data allows us to evaluate the individual design choices of our model in a controlled setup. In future work, we plan to address challenges of more complex scenes with more diverse background and objects and real imagery.



Neural and differentiable rendering.Eslami et al. (2018) encode images into latent representations which can be aggregated from multiple view points. Scene rendering is deferred to a neural network which needs to be trained to decode the latents into images from examples. Several differentiable rendering approaches have been proposed using voxel occupancy grids(Tulsiani et al., 2017;Gadelha et al., 2017;Jimenez Rezende et al., 2016;Yan et al., 2016;Gwak et al., 2017;Zhu et al., 2018;Wu et al., 2017;Nguyen-Phuoc et al., 2018), meshes(Kato et al., 2018;Loper & Black, 2014; Chen et al., 2019;Delaunoy & Prados, 2011;Ramamoorthi & Hanrahan, 2001;Meka et al., 2018;Athalye et al., 2018;Richardson et al., 2016;Liu et al., 2019;Henderson & Ferrari, 2019), signed distance functions(Sitzmann et al., 2019), or point clouds(Lin et al., 2018;Yifan et al., 2019). Recent literature overviews can be found in(Tewari et al., 2020;Kato et al., 2020). In our approach, we find depth and mask values through equidistant sampling along the ray.3 METHODWe propose an autoencoder architecture which embeds images into object-wise scene representations (see Fig.2for an overview). Each object is explicitly described by its 3D pose and latent embeddings for both its shape and textural appearance. Given the object-wise scene description, a decoder composes the images back from the latent representation through differentiable rendering. We train our autoencoder-like network in a self-supervised way from RGB-D images.



Figure 1: Example scenes with object manipulation. For each example, we input the left images and compute the middle one as standard reconstruction. After the manipulation in the latent space, we obtain the respective right image. Plausible new scene configurations are shown on the Clevr dataset (Johnson et al., 2017) (top) and on composed ShapeNet models (Chang et al., 2015) (bottom).

Figure 2: Multi-object 3D scene representation network. The image is sequentially encoded into object representations using an encoder network g 0 . The object encoders additionally receive image and mask compositions (∆I, M ) generated from the previous object encodings. A differentiable renderer based decoder F composes images and masks from the encodings of previous steps. The background is encoded from the image in parallel and used in the final scene reconstruction.

Figure3: Object-wise encoding and rendering. We feed the input image and scene composition images and masks from the previously found objects to an object encoder network g o which regresses the encoding of the next object z i . The object encoding decomposes into shape z i,sh , extrinsics z i,ext and texture latents z i,tex . The shape latent parametrizes an SDF function network Φ which we use in combination with the pose and scale of the object encoded in z i,ext for raycasting the object depth and mask using our differentiable renderer f . Finally, the color of the pixels is found with a texture function network Ψ parametrized by the texture latent.

Figure4: Qualitative results on the Clevr dataset(Johnson et al., 2017) with three and five objects. Our object-wise scene representation decouples all objects from the background.Network Parameters. For the Clevr / ShapeNet datasets, the object encoding dimension is set to D sh = 8/16, and D tex = 7/15. The shape decoder is pre-trained for 10K epochs. We decrease the loss weight λ sh from 0.025/0.1 to 0.0025/0.01 during the first 500K iterations. The remaining weights are fixed to λ I = 1.0, λ depth = 0.1/0.05, λ gr = 0.01. We add Gaussian noise to the input RGB images. Depth images are clipped at a distance of 12. The renderer evaluates at 12 steps along each ray. Gaussian smoothing is applied with kernel size 16 and decreasing sigma from16  3 to 1 2 in 250K steps. We use the ADAM optimizer(Kingma & Ba, 2014) with learning rate 0.0001 and batch size 8 to train for a dataset-specific number of epochs (see supplementary material for more details).

Figure 5: Qualitative results on ShapeNet (Changet al., 2015). Our model obtains a good scene understanding if confronted with more difficult objects (cars, armchairs) and even handles objects from different categories (tabletop scenes with mugs, bottles and cans). It is able to estimate plausible pose and shape of individual objects and learns to decode more complex textures.

Figure6: Novel view renderings. Our model is able to generate new scene renderings for largely rotated camera views from just a single input RGB image. While we noticed a reduced texture accuracy for unseen object parts compared to visible parts, the normal maps are generally good and demonstrate that our model obtains a good 3D structural understanding of the scene.

Evaluation on scenes with ShapeNet objects(Chang et al., 2015). Results for scenes containing objects from different categories. We differentiate between scenes that consist of shapes that were seen during training and novel objects. We report mean and best outcome over five runs. F10.5↑ allObj ↑ RMSE ↓ PSNR ↑ SSIM ↑ RMSE ↓ AbsRD ↓ SqRD ↓ Errpos ↓ Errrot [sym.] ↓

