MESHMVS: MULTI-VIEW STEREO GUIDED MESH RE-CONSTRUCTION

Abstract

Deep learning based 3D shape generation methods generally utilize latent features extracted from color images to encode the objects' semantics and guide the shape generation process. These color image semantics only implicitly encode 3D information, potentially limiting the accuracy of the generated shapes. In this paper we propose a multi-view mesh generation method which incorporates geometry information in the color images explicitly by using the features from intermediate 2.5D depth representations of the input images and regularizing the 3D shapes against these depth images. Our system first predicts a coarse 3D volume from the color images by probabilistically merging voxel occupancy grids from individual views. Depth images corresponding to the multi-view color images are predicted which along with the rendered depth images of the coarse shape are used as a contrastive input whose features guide the refinement of the coarse shape through a series of graph convolution networks. Attention-based multi-view feature pooling is proposed to fuse the contrastive depth features from different viewpoints which are fed to the graph convolution networks. We validate the proposed multi-view mesh generation method on ShapeNet, where we obtain a significant improvement with 34% decrease in chamfer distance to ground truth and 14% increase in the F1-score compared with the state-of-the-art multi-view shape generation method.

1. INTRODUCTION

3D shape generation is a long-standing research problem in computer vision and computer graphics with applications in autonomous driving, augmented reality, etc. Conventional approaches mainly leverage multi-view geometry based on stereo correspondences between images but are restricted by the coverage provided by the input views. With the availability of large-scale 3D shape datasets and the success of deep learning in several computer vision tasks, 3D representations such as voxel grid Choy et al. ( 2016 2019) introduces a multi-view deformation network using perceptual feature from each color image for refining the meshes generated by Pixel2Mesh Wang et al. (2018) . Although promising results were obtained, this method relies on perceptual features from color images which do not explicitly encode the objects' geometry and could restrict the accuracy of the 3D models. In this work, we present a novel multi-view mesh generation method where we start by predicting coarse volumetric occupancy grid representations for the color images of each input viewpoint independently using a shared fully convolutional network which are merged into a single voxel grid in a probabilistic fashion followed by cubify Gkioxari et al. ( 2019) operation to convert it to a triangle 1



); Tulsiani et al. (2017); Yan et al. (2016) and point cloud Yang et al. (2018); Fan et al. (2017) have been explored for single-view 3D reconstruction. Among them, triangle mesh representation has received the most attention as it has various desirable properties for a wide range of applications and is capable of modeling detailed geometry without high memory requirement. Single-view 3D reconstruction methods Wang et al. (2018); Huang et al. (2015); Kar et al. (2015); Su et al. (2014) generate the 3D shape from merely a single color image but suffer from occlusion and limited visibility which leads to low quality reconstructions in the unseen areas. Multi-view methods Wen et al. (2019); Choy et al. (2016); Kar et al. (2017); Gwak et al. (2017) extend the input to images from different viewpoints which provides more visual information and improves the accuracy of the generated shapes. Recent work in multi-view mesh reconstruction Wen et al. (

