MESHMVS: MULTI-VIEW STEREO GUIDED MESH RE-CONSTRUCTION

Abstract

Deep learning based 3D shape generation methods generally utilize latent features extracted from color images to encode the objects' semantics and guide the shape generation process. These color image semantics only implicitly encode 3D information, potentially limiting the accuracy of the generated shapes. In this paper we propose a multi-view mesh generation method which incorporates geometry information in the color images explicitly by using the features from intermediate 2.5D depth representations of the input images and regularizing the 3D shapes against these depth images. Our system first predicts a coarse 3D volume from the color images by probabilistically merging voxel occupancy grids from individual views. Depth images corresponding to the multi-view color images are predicted which along with the rendered depth images of the coarse shape are used as a contrastive input whose features guide the refinement of the coarse shape through a series of graph convolution networks. Attention-based multi-view feature pooling is proposed to fuse the contrastive depth features from different viewpoints which are fed to the graph convolution networks. We validate the proposed multi-view mesh generation method on ShapeNet, where we obtain a significant improvement with 34% decrease in chamfer distance to ground truth and 14% increase in the F1-score compared with the state-of-the-art multi-view shape generation method. Depth prediction module

1. INTRODUCTION

3D shape generation is a long-standing research problem in computer vision and computer graphics with applications in autonomous driving, augmented reality, etc. Conventional approaches mainly leverage multi-view geometry based on stereo correspondences between images but are restricted by the coverage provided by the input views. With the availability of large-scale 3D shape datasets and the success of deep learning in several computer vision tasks, 3D representations such as voxel grid Choy et al. (2016) ; Tulsiani et al. (2017) ; Yan et al. (2016) and point cloud Yang et al. (2018) ; Fan et al. (2017) have been explored for single-view 3D reconstruction. Among them, triangle mesh representation has received the most attention as it has various desirable properties for a wide range of applications and is capable of modeling detailed geometry without high memory requirement. Single-view 3D reconstruction methods Wang et al. (2018) ; Huang et al. (2015) ; Kar et al. (2015) ; Su et al. (2014) generate the 3D shape from merely a single color image but suffer from occlusion and limited visibility which leads to low quality reconstructions in the unseen areas. Multi-view methods Wen et al. (2019) ; Choy et al. (2016) ; Kar et al. (2017) ; Gwak et al. (2017) extend the input to images from different viewpoints which provides more visual information and improves the accuracy of the generated shapes. Recent work in multi-view mesh reconstruction Wen et al. (2019) introduces a multi-view deformation network using perceptual feature from each color image for refining the meshes generated by Pixel2Mesh Wang et al. (2018) . Although promising results were obtained, this method relies on perceptual features from color images which do not explicitly encode the objects' geometry and could restrict the accuracy of the 3D models. In this work, we present a novel multi-view mesh generation method where we start by predicting coarse volumetric occupancy grid representations for the color images of each input viewpoint independently using a shared fully convolutional network which are merged into a single voxel grid in a probabilistic fashion followed by cubify Gkioxari et al. (2019) operation to convert it to a triangle Figure 1 : Architecture of the proposed method. The voxel grid prediction module predicts coarse voxel grid representation which is further refined by a series of GCNs. The GCNs use contrastive depth features from rendered depths of the current shape and the predicted depths from MVSNet. Multi-view features are pooled using a multi-head attention mechanism. mesh. We then use Graph Convolutional Network (GCN) Scarselli et al. (2008) ; Wang et al. (2018) to fine-tune the cubified voxel grid in a coarse-to-fine manner. The GCN refines the coarse mesh by using the feature vector of each graph node (mesh vertices) obtained by projecting the vertices on the 2D contrastive depth features. The contrastive depth features are extracted from the rendered depth maps of the current mesh and predicted depth maps from a multi-view stereo network. We also propose an attention-based method to fuse feature from multiple views that can learn the importance of different views for each of the mesh vertices. Constrains between the intermediate refined mesh from GCN with predicted depth maps of different viewpoints further improve the final mesh quality. By employing multi-view voxel grid generation and refining it using geometry information from both the current mesh (through the rendered depth maps) and predicted depth maps, we are able to generate high-quality meshes. We validate our method on the ShapeNet Chang et al. ( 2015) benchmark and our method achieves the best performance among all previous multi-view and single-view mesh generation methods.

2.1. TRADITIONAL SHAPE GENERATION METHODS

3D model generation has traditionally been tackled using multi-view geometry principles. Among them, structure-from-motion (SfM) Schonberger & Frahm (2016); Agarwal et al. (2011) ; Cui & Tan (2015) ; Cui et al. (2017) and simultaneous localization and mapping (SLAM) Cadena et al. (2016) ; Mur-Artal et al. (2015) ; Engel et al. (2014) ; Whelan et al. (2015) are popular techniques that perform 3D reconstruction and camera pose estimation at the same time. These methods extract local image features, match them across images and use the matches to estimate camera poses and 3D geometry. Closer to our problem setup, multi-view stereo methods infer 3D geometry from images with known camera parameters. Volumetric methods Kar et al. (2017) ; Kutulakos & Seitz (2000) ; Seitz & Dyer (1999) predict voxel grid representation of objects by estimating the relationship between each voxel and object surfaces. Point cloud based methods Furukawa & Ponce (2009) ; Lhuillier & Quan (2005) start with a sparse point cloud and gradually increase the density of points to obtain a final dense point cloud of the object. Durou et al. (2008) ; Zhang et al. (1999) ; Favaro & Soatto (2005) reason about shading, texture and defocus to reason about visible parts of the object and infer its 3D geometry. While the results of these works are impressive in terms of quality and completeness of reconstruction, they still struggle with poorly textured and reflective surfaces and require carefully selected input views.

2.2. DEEP SHAPE GENERATION METHODS

Deep learning based approaches can learn to infer 3D structure from training data and can be robust against poorly textured and reflective surfaces as well as limited and arbitrarily selected input views. These methods can be categorized into single view and multi-view methods. Huang et al. (2015) ; Su et al. (2014) use shape component retrieval and deformation from a large dataset for single-view 3D shape generation. Kurenkov et al. (2018) extend this idea by introducing free-form deformation networks on retrieved object templates from a database. Some work learn shape deformation from ground truth foreground masks of 2D images Kar et al. (2015) ; Yan et al. (2016) ; Tulsiani et al. (2017) . Recurrent Neural Networks (RNN) based methods Choy et al. (2016) ; Kar et al. (2017) ; Gwak et al. (2017) are another popular solution to solve this problem. Gwak et al. (2017) ; Lin et al. (2019) introduce image silhouettes along with adversarial multi-view constraints and optimize object mesh models using multi-view photometric constraints. Predicting mesh directly from color images was proposed in Wang et al. (2018) 2017) where a learned cost metric is used to estimate patch similarities. DeepMVS Huang et al. (2018) warps multi-view images to 3D space and then applies deep networks for regularization and aggregation to estimate depth images. Learned 3D cost volume based depth prediction was proposed in MVSNet Yao et al. (2018) where a 3 dimensional cost volume is built using homographically warped 2D features from multi-view images and 3D CNNs are used for cost regularization and depth regression. This idea was further extended by Chen et al. (2019) ; Luo et al. (2019) ; Gu et al. (2019) ; Yao et al. (2019) .

3. METHODOLOGY

Figure 1 shows the architecture of the proposed system which takes as input multi-view color images of an object with known poses and outputs a triangle mesh representing the surface of the object.

3.1. MULTI-VIEW VOXEL GRID PREDICTION

Single-view Voxel Grid Prediction The single-view voxel branch consists of a ResNet feature extractor and a fully convolutional voxel grid prediction network. It generates the coarse initial shape of an object from one viewpoint as voxel occupancy grid using a color image. Here, we set the resolution of the generated voxel occupancy grid as 32 × 32 × 32. The voxel prediction networks for all viewpoints share the same weights. Probabilistic Occupancy Grid Merging Voxel occupancy grid predicted from a single viewpoint suffers from occlusion and limited visibility. In order to fuse voxel grids from different viewpoints, we propose a probabilistic occupancy grid merging method which merges the voxel grids from each input viewpoint probabilistically to obtain the final voxel grid output. This allows occluded regions in one view to be estimated from other views where those regions are visible as well as increase the confidence of prediction in overlapping regions. Occupancy probability of each voxel is represented by p(x) which is converted to log-odds (logit): l(x) = log p(x) 1 -p(x) Bayesian update on the probabilities reduce to simple summation of log likelihoods Konolige (1997) . Hence, the multi-view log-odds of a voxel is given by: l(x) = l 1 (x) + l 2 (x) + ... + l n (x) where l i is the voxel's log-odds in view i and n is the number of input views. The final voxel probability x is obtained by applying the inverse function of Equation ( 1) which is a sigmoid function.

3.2. MESH REFINEMENT

The cubified mesh from the voxel branch only provides a coarse reconstruction of the object's surface. We apply graph convolutional networks which represent each mesh vertex as one graph node and deforms them to more accurate positions.

GCN-based Mesh Deformation

The features pooled from multi-view images along with 3D coordinates of the vertices in world frame are used as features of the graph nodes. Series of Graphbased Convolutional Network (GCN) blocks are applied to deform a mesh at the current stage to the next stage, starting with the cubified voxel grids. A graph convolution deforms mesh vertices by propagating features from neighboring vertices by applying f i = ReLU (W 0 f i + j∈N (i) W 1 f j ) where N (i) is the set of neighboring vertices of the i-th vertex in the mesh, f {} represents the feature vector of a vertex, and W 0 and W 1 are learnable parameters of the model. Each GCN block utilizes several graph convolutions to transform the vertex features along with a final vertex refinement operation where the features along with vertex coordinates are further transformed as v i = v i + tanh(W vert [f i ; v i ]) where the matrix W vert is another learnable parameter to obtain the deformed mesh. Contrastive Depth Feature Extraction Yao et al. (2020) demonstrate that using intermediate, image-centric 2.5D representations instead of directly generating 3D shapes in global frame from raw 2D images can improve 3D reconstruction quality. We therefore propose to formulate the features for graph nodes using 2.5D depth maps as input additional inputs alongside the RGB features. Specifically, we render the meshes at different GCN stages to depth image at all the input views using Kato et al. (2018) and use them along with predicted depths for depth feature extraction. We call this form of depth input contrastive depth as it contrasts the rendered depths of the current mesh against the predicted depths and allows the network to reason about the deformation better than when using predicted depth or color images alone. Given the 2D features, corresponding feature vectors of individual vertices can be found by projecting the 3D vertex coordinates to the feature planes using known camera parameters. 2018) and predict the depth maps of all views since the original implementation predicts depth of only one reference view. This is achieved by transforming the feature volumes to each view's coordinate frame using homography warping and applying identical cost volume regularization and depth regression on each view. Detailed network architecture diagram of this module is provided in the appendix. Attention-based Multi-View Feature Pooling In order to fuse multi-view contrastive depth features, we formulate an attention module by adapting multi-head attention mechanism originally designed for sequence to sequence machine translation using transformer (encoder-decoder) architecture Vaswani et al. (2017) . In a transformer architecture the encoder hidden state is mapped to lower dimension key-value pairs (K, V) while the decoder hidden state is mapped to a query vector Q using independent fully connected layers. The encoder hidden state in our case is the multi-view features while the decoder hidden state is the mean of the multi-view features. The attention weights are computed using scaled-dot product: where N is the number of input views. Attention(Q, K, V) = sof tmax( QK T √ N )V Multiple attention heads are used which are concatenated and transformed to obtain the final output head i = Attention(QW Q i , KW K i , VW V i ) (4) M ultiHead(Q, K, V) = [head 1 ; ...; head h ]W 0 where multiple W are parameters to be learned, h is the number of attention heads and i ∈ [1, h]. We choose multi-head attention as our feature pooling method since it allows the model to attend information from different representation subspaces of the features by training multiple attentions in parallel. This method is also invariant to the order and number of input views. We visualize the learned attention weights (average of each attention heads) in Figure 2 where we can observe that the attention weights roughly takes into account the visibility/occlusion information from each view.

Mesh losses

The losses which are derived from Wang et al. (2018) to constrain the mesh predicted by each GCN block (P) to resemble the ground truth (Q) include Chamfer distance L chamfer (P, Q) = |P| -1 (p,q)∈ΛP,Q ||p -q|| 2 + |Q| -1 (q,p)∈ΛQ,P ||q -p|| 2 and surface normal loss L normal (P, Q) = -|P| -1 (p,q)∈ΛP,Q |u p • u q | -|Q| -1 (q,p)∈ΛQ,P |u q • u p | with additional regularization in the form of edge length loss L edge (V, E) = 1 |E| (v,v )∈E ||v -v || 2 for visually appealing results. Depth loss Our depth prediction network is supervised using adaptive reversed Huber loss (also known as BerHu criterion) Lambert-Lacroix & Zwald (2016) . L depth = |x|, if |x| ≤ c, otherwise x 2 +c 2 2c where x is the depth error of a pixel and c is a constant set to 0.2. Note that the original MVSNet uses L1-loss, but we used BerHu loss since it gave slightly higher accuracy. Intuitively, this is because BerHu provides a good balance between L1 and L2 loss and has shown similar improvement in Laina et al. (2016) . Contrastive depth loss BerHu loss is also applied between the rendered depth images at different GCN stages and the predicted depth images. L contrastive = |x|, if |x| ≤ c, otherwise x 2 +c 2 2c Voxel loss Binary cross-entropy loss between the predicted voxel occupancy probabilities and the ground truth occupancies is used as voxel loss to supervise the voxel predictions L voxel = -p(x)log p(x) + 1 -p(x) log 1 -p(x) Final loss We use the weighted sum of the individual losses discussed above as the final loss to train our model in an end-to-end fashion. L = λ chamfer L chamfer + λ normal L normal + λ edge L edge + λ depth L depth + λ contrastive L contrastive + λ voxel L voxel , where L is the final loss term. The hierarchical features obtained from "Contrastive Depth Features Extractor" are of total 4800 dimensions for each view. The aggregated multi-view features are compressed to 480 dimensional after applying attentive feature pooling. 5 attention heads are used for merging multi-view features. The loss function weights are set as λ chamfer = 1, λ normal = 1.6 × 10 -4 , λ depth = 0.1, λ contrastive = 0.001 and λ voxel = 1. Two settings of λ edge were used, λ edge = 0 (referred as Best) which gives better quantitative results and λ edge = 0.2 (referred as Pretty) which gives better qualitative results. Training and Runtime The network is optimized using Adam optimizer with a learning rate of 10 -4 . The training is done on 5 Nvidia RTX-2080 GPUs with effective batch size 5. The depth prediction network (MVSNet) is trained independently for 30 epochs. Then the whole system is trained for another 40 epochs with the weights of the MVSNet frozen. Our system is implemented in PyTorch deep learning framework and it takes around 60 hours for training. Evaluation Metric Following Wang et al. (2018) ; Wen et al. (2019) , we use F1-score as our evaluation metric. The F1-score is the harmonic mean of precision and recall where the precision/recall are calculated by finding the percentage of points in the predicted/ground truth that can find a nearest neighbor from the other within a threshold. We provide evaluations with two threshold values: τ and 2τ where τ = 10 -4 m 2 .

4.2. COMPARISON WITH PREVIOUS MULTI-VIEW SHAPE GENERATION METHODS

We quantitatively compare our method against previous works for multi-view shape generation in Table 1 and show the effectiveness of our methods in improving the shape quality. Our method outperforms the state-of-the-art method Pixel2Mesh++ Wen et al. ( 2019) with a decrease in chamfer distance to ground truth by 34% and 15% increase in F1-score at threshold τ . Note that in Table 1 the same model is trained for all the categories but accuracy on individual categories as well as average over the categories are evaluated. We provide the chamfer distances in the appendix. Table 1 : Qualitative comparison against state-of-the-art multi-view shape generation methods. We report F-score on each semantic category along with the mean over all categories using two thresholds τ and 2τ for nearest neighbor match where τ =10 -4 m 2 . We also provide visual results for qualitative assessment of the generated shapes by our Pretty model in Figure 3 which shows that it is able to more accurately predict topologically diverse shapes.

4.3. ABLATION STUDIES

Contrastive Depth Feature Extraction We evaluate several methods for contrastive feature extraction (Sub-section 3.2). These methods are 1) Input Concatenation: using the concatenated rendered and predicted depth maps as input to the VGG feature extractor, 2) Input Difference: using the difference of the two depth maps as input to VGG, 3) Feature Concatenation: concatenating features from rendered and predicted depths extracted by shared VGG, 4) Feature Difference: using difference of the features from the two depth maps extracted by shared VGG, and 5) Predicted depth only: using the VGG features from the predicted depths only. 6) Rendered depth only: using the VGG features from the rendered depths only. The quantitative results are summarized in Table 2 and shows that Input Concatenation method produces better results than other formulations. Accuracy with different settings Table 3 shows the contribution of different components towards the final accuracy. Naively extending the single-view Mesh R- CNN Gkioxari et al. (2019) to multiple views using statistical feature pooling Wen et al. (2019) for mesh refinement (row 1) gives an F1-score of 72.74% for threshold τ which is 6.26% improvement over Pixel2Mesh++. We further extend the above method with our probabilistic multi-view voxel grid prediction in row 2 and get a 4.23% improvement. In row 3 of Table 3 we use our contrastive depth features instead of RGB features for mesh refinement and get 2.7% improvement. We then replace the statistical feature pooling with the proposed attention method and get 0.19% improvement. The improvement is not significant on our final architecture but we found the multi-head attention to perform better on more light-weight architectures. We also evaluate the effect of using additional regularization from contrastive depth losses: rendered depth vs predicted depth in the 5th rows of which improves the score by 0.98%. In row 6 we use ground truth Under review as a conference paper at ICLR 2021 Table 3 : Comparison of shape generation accuracy with different settings of additional contrastive depth losses, multi-view feature pooling. The Baseline framework uses multi-head attention mechanism without any contrastive depth losses.

Number of View

We test the performance of our framework with respect to the number of views. Table 4 shows that the accuracy of our method increases as we increase the number of input views for training. These experiments also validate that the attention-based feature pooling can efficiently encode features from different views to take advantage of larger number of views. Table 5 shows the results when using different number of views during testing on our model trained with 3 views which indicates that increasing the number of views during testing does not improve the accuracy while decreasing the number of views can cause a significant drop in accuracy. 

5. CONCLUSION

We propose a neural network based solution to predict 3D triangle mesh models of objects from images taken from multiple views. First, we propose a multi-view voxel grid prediction module which probabilistically merges voxel grids predicted from individual input views. We then cubify the merged voxel grid to triangle mesh and apply graph convolutional networks for further refining the mesh. The features for the mesh vertices are extracted from contrastive depth input consisting of rendered depths at each refinement stage along with the predicted depths. The proposed mesh reconstruction method outperforms existing methods with a large margin and is capable of reconstructing objects with more complex topologies.

PROBABILISTIC OCCUPANCY GRID MERGING

We use single-view voxel prediction network from Gkioxari et al. (2019) to predict predicts voxel grids for each of the input images in their respective local coordinate frames. The occupancy grids are transformed to global frame (which is set to the coordinate frame of the first image) by finding the equivalent global grid values in the local grids after applying bilinear interpolation on the closest matches. The voxel grids in global coordinates are then probabilistically merged according to Sub-section 3.1 of the main submission.

EXPERIMENTS

We quantitatively compare our method against previous works for multi-view shape generation in Table 6 and show effectiveness of our proposed shape generation methods in improving shape quality. Our method outperforms the state-of-the-art method Pixel2Mesh++ Wen et al. (2019) with decrease in chamfer distance to ground truth by 34%, which shows the effectiveness of our proposed method. Note that in 7 , the accuracy of the initial shape generated from probabilistically merged voxel grid is higher than that from individual views.

Accuracy at Different GCN Stages

We analyze the accuracy of meshes at different GCN stages in Table 8 . The results validate that our method produces the meshes in a coarse-to-fine manner and multiple GCN refinements improve the mesh quality.

Resolution of Depth Prediction

We conduct experiments using different numbers of depth hypotheses in our depth prediction network (Sub-section A), producing depth values at different resolutions. A higher number of depth hypothesis means finer resolution of the predicted depths. The quantitative results with different hypothesis numbers are summarized in Table 9 . We set depth hypothesis as 48 for our final architecture which is equivalent to the resolution of 25 mm. We observe that the mesh accuracy remain relatively unchanged if we predict depths at finer resolutions. 

Generalization Capability

We conduct experiments to evaluate the generalization capability of our system across the semantic categories. We train our model with only 12 out of the 13 categories and test on the category that was left out. Table 10 shows that the accuracy generally does not decrease significantly when compared with the model that was trained on all 13 categories when using 2τ threshold for the F-score. 

B APPENDIX BEST VS PRETTY MODELS

We provide qualitative comparison between the our models trained with best and pretty configurations in Figure 5 . The best configuration refers to our model trained without edge regularization while pretty refers to the model trained with the regularization (Sub-section 4.1). We observe that without the regularization we get higher score on our evaluation metrics but get degenerate meshes with self-intersections and irregularly sized faces. 

FAILURE CASES

Some failure cases of our model (with pretty setting) are shown in Figure 6 . We notice that the rough topology of the mesh is recovered while we failed to reconstruct the fine topology. We can regard the recovery from wrong initial topology as a promising future work. Figure 6 : Failure Cases. Our system can struggle to roughly reconstruct shapes with very complex topology while some fine topology of the mesh is missing.



; Wickramasinghe et al. (2019); Pan et al. (2019); Wen et al. (2019); Gkioxari et al. (2019); Tang et al. (2019). DR-KFS Jin et al. (2019) introduces a differentiable visual similarity metric while SeqXY2SeqZ Han et al. (2020) represents 3D shapes using a set of 2D voxel tubes for shape reconstruction. Front2Back Yao et al. (2020) generates 3D shapes by fusing predicted depth and normal images and DV-Net Jia et al. (2020) predicts dense object point clouds using dual-view RGB images with a gated control network to fuse point clouds from the two views. FoldingNet Yang et al. (2018) learns to reconstruct arbitrary point clouds from a single 2D grid. AtlasNet Groueix et al. (2018) use learned parametric representation while Mescheder et al. (2019); Park et al. (2019); Liu et al. (2019b;a); Murez et al. (2020) employ implicit surface representation to reconstruct 3D shapes. 2.3 DEPTH ESTIMATION Compared to 3D shape generation, depth prediction is an easier problem formulation since it simplifies the task to per-view depth map estimation. Traditional methods Campbell et al. (2008); Galliani et al. (2015); Schönberger et al. (2016) use multi-view stereo principles for depth prediction. Deep learning based multi-view stereo depth estimation was first introduced in Hartmann et al. (

Figure 2: Attention weights visualization. From left to right: input images from 3 viewpoints, corresponding ground truth point clouds color-coded by their view order and the predicted mesh vertices color-coded by the attention weights of the views. Only the view with maximum attention weight is visualized for each predicted points for clarity.

Figure3: Qualitative evaluation on ShapeNet dataset. From top to bottom: one of the input images, ground truth mesh, multi-view extended Pixel2Mesh, Pixel2Mesh++, and ours. Our predictions are closer to the actual shape, especially for the objects with more complex topologies.4 EXPERIMENTS

Figure 5: Qualitative evaluation: best vs pretty wireframe models. The best models while being preferred by the evaluation metrics lead to degenerate meshes, with irregularly sized faces and self-intersections

We use VGG-16 Simonyan & Zisserman (2014) as our contrastive depth feature extraction network. Multi-View Depth Estimation We extend MVSNet Yao et al. (

Comparisons of different contrastive depth formulations. In 1st and 2nd rows, concatenation and difference of the rendered and predicted depths are fed to VGG feature extractor while in 3rd and 4th rows, concatenation and difference of the VGG features from the depths is used for mesh refinement. 5 uses VGG features from predicted depths only while 6 uses VGG features from rendered depths only.instead of predicted depths on our final model which gives the upper bound on our mesh prediction accuracy in relation to the depth prediction accuracy as 84.58%.

Accuracy w.r.t the number of views during training. The evaluation was performed on the same number of views as training.

Accuracy w.r.t the number of views during testing. The same model trained with 3 views was used in all of the cases.

Table6same model is trained for all the categories but accuracy on individual categories as well as average over all the categories are evaluated. Qualitative comparison against state-of-the-art multi-view shape generation methods. FollowingWen et al. (2019), we report Chamfer Distance in m 2 × 1000 from ground truth for different methods. Note that same model is trained for all the categories but accuracy on individual categories as well as average over all the categories are evaluated.ABLATION STUDIESCoarse Shape Generation We conduct comparisons on voxel grid predicted from our proposed probabilistically merged voxel grids against single view methodGkioxari et al. (2019). As is shown in Table

Accuracy of the refined meshes at different GCN stages. 1, 2 and 3 indicate the performance at the corresponding graph convolution blocks while Cubified is for the cubified voxel grids used as input for the first GCN block. All the stages, including the voxel prediction, were trained jointly and hence the accuracy of voxel predictions varies from that in Table7.

Accuracy w.r.t the number of depth hypothesis. A higher number of depth hypothesis increases the resolution of predicted depth values at the expense of higher memory requirement. The range of depths for all the models are same and based on the minimum/maximum depth in the ShapeNetChang et al. (2015) dataset.

Accuracy when a category is excluded during training and evaluation is performed on the category to verify how well training on other categories generalizes to the excluded category.

annex

Here, we extent MVSNet to predict the depth maps of all views instead of only the reference view. This is achieved by transforming the feature volumes to each view's coordinate frame using homography warping and applying identical cost volume regularization and depth regression on each view. This allows the reuse of pre-regularization feature volumes for efficient multi-view depth prediction invariant to the order of input images. Figure 4 shows the architecture of the our depth estimation module.

