PSEUDO LABEL-GUIDED MULTI TASK LEARNING FOR SCENE UNDERSTANDING Anonymous

Abstract

Multi-task learning (MTL) for scene understanding has been actively studied by exploiting correlation of multiple tasks. This work focuses on improving the performance of the MTL network that infers depth and semantic segmentation maps from a single image. Specifically, we propose a novel MTL architecture, called Pseudo-MTL, that introduces pseudo labels for joint learning of monocular depth estimation and semantic segmentation tasks. The pseudo ground truth depth maps, generated from pretrained stereo matching methods, are leveraged to supervise the monocular depth estimation. More importantly, the pseudo depth labels serve to impose a cross-view consistency on the estimated monocular depth and segmentation maps of two views. This enables for mitigating the mismatch problem incurred by inconsistent prediction results across two views. A thorough ablation study validates that the cross-view consistency leads to a substantial performance gain by ensuring inference-view invariance for the two tasks.

1. INTRODUCTION

Scene understanding has become increasingly popular in both academia and industry as an essential technology for realizing a variety of vision-based applications such as robotics and autonomous driving. 3D geometric and semantic information of a scene often serve as a basic building block for high-level scene understanding tasks. Numerous approaches have been proposed for inferring a depth map (Garg et al., 2016; Godard et al., 2019) or grouping semantically similar parts (Chen et al., 2017; Yuan et al., 2019) from a single image. In parallel with such a rapid evolution for individual tasks, several approaches (Chen et al., 2019; Zhang et al., 2018; Guizilini et al., 2020b; Liu et al., 2019) have focused on boosting the performance through joint learning of the semantic segmentation and monocular depth estimation tasks by considering that the two tasks are highly correlated. For instance, pixels with the same semantic segmentation labels within an object are likely to have similar (or smoothly-varying) depth values. An abrupt change of depth values often implies the boundary of two objects containing different semantic segmentation labels. These properties have been applied to deep networks to enhance the semantic segmentation and monocular depth estimation tasks in a synergetic manner. In (Chen et al., 2019) , they proposed a joint learning model that learns semantic-aware representation to advance the monocular depth estimation with the aid of semantic segmentation. A depth map is advanced by making use of loss functions designed for the purpose of bonding geometric and semantic understanding. The method in (Guizilini et al., 2020b) proposed a new architecture that improves the accuracy of monocular depth estimation through the pixel-adaptive convolution (Su et al., 2019) using semantic feature maps computed from pre-trained semantic segmentation networks. Despite the improved monocular depth accuracy over a single monocular depth network, the performance improvement of the semantic segmentation task by the aid of geometrical representation has not been verified (Chen et al., 2019) , or even the semantic segmentation network was fixed with pretrained parameters (Guizilini et al., 2020b) . A generic computational approach for multi-task learning (MTL) was proposed in (Zamir et al., 2018) , which models the structure across twenty six tasks, including 2D, 2.5D, 3D, and semantic tasks, by finding first and higher-order transfer learning dependencies across them in a latent space to seamlessly reuse supervision among related tasks and/or solve them in a single network without increasing the complexity significantly. This was further extended by imposing a cross-task consis-

