MESHMVS: MULTI-VIEW STEREO GUIDED MESH RE-CONSTRUCTION

Abstract

Deep learning based 3D shape generation methods generally utilize latent features extracted from color images to encode the objects' semantics and guide the shape generation process. These color image semantics only implicitly encode 3D information, potentially limiting the accuracy of the generated shapes. In this paper we propose a multi-view mesh generation method which incorporates geometry information in the color images explicitly by using the features from intermediate 2.5D depth representations of the input images and regularizing the 3D shapes against these depth images. Our system first predicts a coarse 3D volume from the color images by probabilistically merging voxel occupancy grids from individual views. Depth images corresponding to the multi-view color images are predicted which along with the rendered depth images of the coarse shape are used as a contrastive input whose features guide the refinement of the coarse shape through a series of graph convolution networks. Attention-based multi-view feature pooling is proposed to fuse the contrastive depth features from different viewpoints which are fed to the graph convolution networks. We validate the proposed multi-view mesh generation method on ShapeNet, where we obtain a significant improvement with 34% decrease in chamfer distance to ground truth and 14% increase in the F1-score compared with the state-of-the-art multi-view shape generation method.

1. INTRODUCTION

3D shape generation is a long-standing research problem in computer vision and computer graphics with applications in autonomous driving, augmented reality, etc. Conventional approaches mainly leverage multi-view geometry based on stereo correspondences between images but are restricted by the coverage provided by the input views. With the availability of large-scale 3D shape datasets and the success of deep learning in several computer vision tasks, 3D representations such as voxel grid Choy et al. ( 2016 2019) introduces a multi-view deformation network using perceptual feature from each color image for refining the meshes generated by Pixel2Mesh Wang et al. (2018) . Although promising results were obtained, this method relies on perceptual features from color images which do not explicitly encode the objects' geometry and could restrict the accuracy of the 3D models. In this work, we present a novel multi-view mesh generation method where we start by predicting coarse volumetric occupancy grid representations for the color images of each input viewpoint independently using a shared fully convolutional network which are merged into a single voxel grid in a probabilistic fashion followed by cubify Gkioxari et al. ( 2019) operation to convert it to a triangle 2018) to fine-tune the cubified voxel grid in a coarse-to-fine manner. The GCN refines the coarse mesh by using the feature vector of each graph node (mesh vertices) obtained by projecting the vertices on the 2D contrastive depth features. The contrastive depth features are extracted from the rendered depth maps of the current mesh and predicted depth maps from a multi-view stereo network. We also propose an attention-based method to fuse feature from multiple views that can learn the importance of different views for each of the mesh vertices. Constrains between the intermediate refined mesh from GCN with predicted depth maps of different viewpoints further improve the final mesh quality. By employing multi-view voxel grid generation and refining it using geometry information from both the current mesh (through the rendered depth maps) and predicted depth maps, we are able to generate high-quality meshes. We validate our method on the ShapeNet Chang et al. ( 2015) benchmark and our method achieves the best performance among all previous multi-view and single-view mesh generation methods. 



); Tulsiani et al. (2017); Yan et al. (2016) and point cloud Yang et al. (2018); Fan et al. (2017) have been explored for single-view 3D reconstruction. Among them, triangle mesh representation has received the most attention as it has various desirable properties for a wide range of applications and is capable of modeling detailed geometry without high memory requirement. Single-view 3D reconstruction methods Wang et al. (2018); Huang et al. (2015); Kar et al. (2015); Su et al. (2014) generate the 3D shape from merely a single color image but suffer from occlusion and limited visibility which leads to low quality reconstructions in the unseen areas. Multi-view methods Wen et al. (2019); Choy et al. (2016); Kar et al. (2017); Gwak et al. (2017) extend the input to images from different viewpoints which provides more visual information and improves the accuracy of the generated shapes. Recent work in multi-view mesh reconstruction Wen et al. (

Figure1: Architecture of the proposed method. The voxel grid prediction module predicts coarse voxel grid representation which is further refined by a series of GCNs. The GCNs use contrastive depth features from rendered depths of the current shape and the predicted depths from MVSNet. Multi-view features are pooled using a multi-head attention mechanism.

TRADITIONAL SHAPE GENERATION METHODS 3D model generation has traditionally been tackled using multi-view geometry principles. Among them, structure-from-motion (SfM) Schonberger & Frahm (2016); Agarwal et al. (2011); Cui & Tan (2015); Cui et al. (2017) and simultaneous localization and mapping (SLAM) Cadena et al. (2016); Mur-Artal et al. (2015); Engel et al. (2014); Whelan et al. (2015) are popular techniques that perform 3D reconstruction and camera pose estimation at the same time. These methods extract local image features, match them across images and use the matches to estimate camera poses and 3D geometry. Closer to our problem setup, multi-view stereo methods infer 3D geometry from images with known camera parameters. Volumetric methods Kar et al. (2017); Kutulakos & Seitz (2000); Seitz & Dyer (1999) predict voxel grid representation of objects by estimating the relationship between each voxel and object surfaces. Point cloud based methods Furukawa & Ponce (2009); Lhuillier & Quan (2005) start with a sparse point cloud and gradually increase the density of points to obtain a final dense point cloud of the object. Durou et al. (2008); Zhang et al. (1999); Favaro & Soatto (2005) reason about shading, texture and defocus to reason about visible parts of the object and infer its 3D geometry. While the results of these works are impressive in terms of quality and completeness of reconstruction,

