VA-DEPTHNET: A VARIATIONAL APPROACH TO SIN-GLE IMAGE DEPTH PREDICTION

Abstract

We introduce VA-DepthNet, a simple, effective, and accurate deep neural network approach for the single-image depth prediction (SIDP) problem. The proposed approach advocates using classical first-order variational constraints for this problem. While state-of-the-art deep neural network methods for SIDP learn the scene depth from images in a supervised setting, they often overlook the invaluable invariances and priors in the rigid scene space, such as the regularity of the scene. The paper's main contribution is to reveal the benefit of classical and well-founded variational constraints in the neural network design for the SIDP task. It is shown that imposing first-order variational constraints in the scene space together with popular encoder-decoder-based network architecture design provides excellent results for the supervised SIDP task. The imposed first-order variational constraint makes the network aware of the depth gradient in the scene space, i.e., regularity. The paper demonstrates the usefulness of the proposed approach via extensive evaluation and ablation analysis over several benchmark datasets, such as KITTI, NYU Depth V2, and SUN RGB-D. The VA-DepthNet at test time shows considerable improvements in depth prediction accuracy compared to the prior art and is accurate also at high-frequency regions in the scene space. At the time of writing this paper, our method-labeled as VA-DepthNet, when tested on the KITTI depth-prediction evaluation set benchmarks, shows state-of-the-art results, and is the top-performing published approach 1 2 .

1. INTRODUCTION

Over the last decade, neural networks have introduced a new prospect for the 3D computer vision field. It has led to significant progress on many long-standing problems in this field, such as multiview stereo (Huang et al., 2018; Kaya et al., 2022) , visual simultaneous localization and mapping (Teed & Deng, 2021) , novel view synthesis (Mildenhall et al., 2021) , etc. Among several 3D vision problems, one of the challenging, if not impossible, to solve is the single-image depth prediction (SIDP) problem. SIDP is indeed ill-posed-in a strict geometric sense, presenting an extraordinary challenge to solve this inverse problem reliably. Moreover, since we do not have access to multi-view images, it is hard to constrain this problem via well-known geometric constraints (Longuet-Higgins, 1981; Nistér, 2004; Furukawa & Ponce, 2009; Kumar et al., 2019; 2017) . Accordingly, the SIDP problem generally boils down to an ambitious fitting problem, to which deep learning provides a suitable way to predict an acceptable solution to this problem (Yuan et al., 2022; Yin et al., 2019) . Impressive earlier methods use Markov Random Fields (MRF) to model monocular cues and the relation between several over-segmented image parts (Saxena et al., 2007; 2008) . Nevertheless, with the recent surge in neural network architectures (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016) , which has an extraordinary capability to perform complex regression, many current works use deep learning to solve SIDP and have demonstrated high-quality results (Yuan et al., 2022; Aich et al., 2021; Bhat et al., 2021; Eigen et al., 2014; Fu et al., 2018; Lee et al., 2019; 2021) . Popular recent methods for SIDP are mostly supervised. But even then, they are used less in real-world applications than geometric multiple view methods (Labbé & Michaud, 2019; Müller et al., 2022) . Nonetheless, a good solution to SIDP is highly desirable in robotics (Yang et al., 2020 ), virtual-reality (Hoiem et al., 2005) , augmented reality (Du et al., 2020 ), view synthesis (Hoiem et al., 2005) and other related vision tasks (Liu et al., 2019) . In this paper, we advocate that despite the supervised approach being encouraging, SIDP advancement should not wholly rely on the increase of dataset sizes. Instead, geometric cues and scene priors could help improve the SIDP results. Not that scene priors have not been studied to improve SIDP accuracy in the past. For instance, Chen et al. ( 2016 2019) relies on good depth map prediction from a deep network and the idea of virtual normal. The latter is computed by randomly sampling three noncollinear points with large distances. This is rather complex and heuristic in nature. Qi et al. ( 2018) uses depth and normal consistency, which is good, yet it requires good depth map initialization. This brings us to the point that further generalization of the regression-based SIDP pipeline is required. As mentioned before, existing approaches in this direction have limitations and are complex. In this paper, we propose a simple approach that provides better depth accuracy and generalizes well across different scenes. To this end, we resort to the physics of variation (Mollenhoff et al., 2016; Chambolle et al., 2010) in the neural network design for better generalization of the SIDP network, which by the way, keeps the essence of affine invariance (Yin et al., 2019) . An image of a general scene-indoor or outdoor, has a lot of spatial regularity. And therefore, introducing a variational constraint provides a convenient way to ensure spatial regularity and to preserve information related to the scene discontinuities (Chambolle et al., 2010) . Consequently, the proposed network is trained in a fully-supervised manner while encouraging the network to be mindful of the scene regularity where the variation in the depth is large (cf. Sec.3.1). In simple terms, depth regression must be more than parameter fitting, and at some point, a mindful decision must be made-either by imaging features or by scene depth variation, or both. As we demonstrate later in the paper, such an idea boosts the network's depth accuracy while preserving the high-frequency and low-frequency scene information (see Fig. 1 ). Our neural network for SIDP disentangles the absolute scale from the metric depth map. It models an unscaled depth map as the optimal solution to the pixel-level variational constraints via weighted first-order differences, respecting the neighboring pixel depth gradients. Compared to previous methods, the network's task has been shifted away from pixel-wise metric depth learning to learning the first-order differences of the scene, which alleviates the scale ambiguity and favors scene regularity. To realize that, we initially employ a neural network to predict the first-order differences of the depth map. Then, we construct the partial differential equations representing the variational constraints by reorganizing the differences into a large matrix, i.e., an over-determined system of equations. Further, the network learns a weight matrix to eliminate redundant equations that do not favor the introduced first-order difference constraint. Finally, the closed-form depth map solution is recovered via simple matrix operations. When tested on the KITTI (Geiger et al., 2012) and NYU Depth V2 (Silberman et al., 2012) test sets, our method outperforms prior art depth prediction accuracy by a large margin. Moreover, our model pre-trained on NYU Depth V2 better generalizes to the SUN RGB-D test set.

2. PRIOR WORK

Depth estimation is a longstanding task in computer vision. In this work, we focus on a fullysupervised, single-image approach, and therefore, we discuss prior art that directly relates to such approach. Broadly, we divide the popular supervised SIDP methods into three sub-categories. 



) uses pairwise ordinal relations between points to learn scene depth.Alternatively, Yin et al. (2019)  uses surface normals as an auxiliary loss to improve performance. Other heuristic approaches, such as Qi et al. (2018), jointly exploit the depth-to-normal relation to recover scene depth and surface normals. Yet, such state-of-the-art SIDP methods have limitations: for example, the approach in Chen et al. (2016) -using ordinal relation to learn depth -over-smooths the depth prediction results, thereby failing to preserve high-frequency surface details.Conversely, Yin et al. (

(i) Depth Learning using Ranking or Ordinal Relation Constraint.Zoran et al. (2015)  and Chen et al. (2016) argue that the ordinal relation between points is easier to learn than the metric depth. To this end,Zoran et al. (2015)  proposes constrained quadratic optimization while Chen

