PSEUDO LABEL-GUIDED MULTI TASK LEARNING FOR SCENE UNDERSTANDING Anonymous

Abstract

Multi-task learning (MTL) for scene understanding has been actively studied by exploiting correlation of multiple tasks. This work focuses on improving the performance of the MTL network that infers depth and semantic segmentation maps from a single image. Specifically, we propose a novel MTL architecture, called Pseudo-MTL, that introduces pseudo labels for joint learning of monocular depth estimation and semantic segmentation tasks. The pseudo ground truth depth maps, generated from pretrained stereo matching methods, are leveraged to supervise the monocular depth estimation. More importantly, the pseudo depth labels serve to impose a cross-view consistency on the estimated monocular depth and segmentation maps of two views. This enables for mitigating the mismatch problem incurred by inconsistent prediction results across two views. A thorough ablation study validates that the cross-view consistency leads to a substantial performance gain by ensuring inference-view invariance for the two tasks.

1. INTRODUCTION

Scene understanding has become increasingly popular in both academia and industry as an essential technology for realizing a variety of vision-based applications such as robotics and autonomous driving. 3D geometric and semantic information of a scene often serve as a basic building block for high-level scene understanding tasks. Numerous approaches have been proposed for inferring a depth map (Garg et al., 2016; Godard et al., 2019) or grouping semantically similar parts (Chen et al., 2017; Yuan et al., 2019) from a single image. In parallel with such a rapid evolution for individual tasks, several approaches (Chen et al., 2019; Zhang et al., 2018; Guizilini et al., 2020b; Liu et al., 2019) have focused on boosting the performance through joint learning of the semantic segmentation and monocular depth estimation tasks by considering that the two tasks are highly correlated. For instance, pixels with the same semantic segmentation labels within an object are likely to have similar (or smoothly-varying) depth values. An abrupt change of depth values often implies the boundary of two objects containing different semantic segmentation labels. These properties have been applied to deep networks to enhance the semantic segmentation and monocular depth estimation tasks in a synergetic manner. In (Chen et al., 2019) , they proposed a joint learning model that learns semantic-aware representation to advance the monocular depth estimation with the aid of semantic segmentation. A depth map is advanced by making use of loss functions designed for the purpose of bonding geometric and semantic understanding. The method in (Guizilini et al., 2020b) proposed a new architecture that improves the accuracy of monocular depth estimation through the pixel-adaptive convolution (Su et al., 2019) using semantic feature maps computed from pre-trained semantic segmentation networks. Despite the improved monocular depth accuracy over a single monocular depth network, the performance improvement of the semantic segmentation task by the aid of geometrical representation has not been verified (Chen et al., 2019) , or even the semantic segmentation network was fixed with pretrained parameters (Guizilini et al., 2020b) . A generic computational approach for multi-task learning (MTL) was proposed in (Zamir et al., 2018) , which models the structure across twenty six tasks, including 2D, 2.5D, 3D, and semantic tasks, by finding first and higher-order transfer learning dependencies across them in a latent space to seamlessly reuse supervision among related tasks and/or solve them in a single network without increasing the complexity significantly. This was further extended by imposing a cross-task consis-tency based on inference-path invariance on a graph of multiple tasks (Zamir et al., 2020) . Though these approaches provide a generic and principled way for leveraging redundancies across multiple tasks, there may be limitations to improving the performance of individual tasks in that it is difficult to consider task-specific architectures and loss functions in such unified frameworks. With the same objective yet with a different methodology, the method in (Liu et al., 2019) proposes a novel MTL architecture consisting of task-shared and task-specific networks based on task-attention modules, aiming to learn both generalizable features for multiple tasks and features tailored to each task. They validated the performance in the joint learning of monocular depth and semantic segmentation. In this paper, we propose a novel MTL architecture for monocular depth estimation and semantic segmentation tasks, called pseudo label-guided multi-task learning (Pseudo-MTL). The proposed architecture leverages geometrically-and semantically-guided representations by introducing pseudo ground truth labels. When a pair of stereo images is given as inputs, our method first generates pseudo ground truth left and right depth maps by using existing pre-trained stereo matching networks (Pang et al., 2017; Chang & Chen, 2018 ). To prevent inaccurate depth values from being used, a stereo confidence map (Poggi & Mattoccia, 2016) is used together as auxiliary data that measures the reliability of the pseudo depth labels. These are leveraged for supervising the monocular depth network, obtaining substantial performance gain over recent self-supervised monocular depth estimation approaches (Godard et al., 2017; 2019) . More importantly, the pseudo depth labels are particularly useful when imposing a cross-view consistency across left and right images. The estimated monocular depth and segmentation maps of two views are tied from a geometric perspective by minimizing the cross-view consistency loss, alleviating the mismatch problem incurred by inconsistent prediction across two views significantly. We will verify through an intensive ablation study that the proposed cross-consistency loss leads to a substantial improvement on both tasks. Experimental results also show that our approach achieves an outstanding performance over state-of-the-arts. In short, our novel contributions can be summarized as follows. • We propose a novel MTL approach that jointly performs monocular depth estimation and semantic segmentation through pseudo depth labels. • The cross-view consistency loss based on the pseudo depth labels and associated confidence maps is proposed to enable consistent predictions across two views. • An intensive ablation study is provided to quantify the contribution of the proposed items to performance improvement.

2. RELATED WORK

Monocular Depth Estimation While early works for monocular depth estimation are based on supervised learning, self-supervised learning has attracted increasing interest in recent approaches (Godard et al., 2017; 2019; Watson et al., 2019) to overcome the lack of ground truth depth labels. Here, we review works mostly relevant to our method. Godard et al. (Godard et al., 2017; 2019) proposed the deep network that infers a disparity map using the image reconstruction loss and left-right consistency loss from a pair of stereo images or monocular videos. Chen et al. (Chen et al., 2019) infers both disparity and semantic segmentation maps by enforcing the cross consistency across stereo images to address the mismatch problem of (Godard et al., 2017) . Several approaches have focused on improving the monocular depth estimation through the aid of segmentation networks, e.g., by stitching local depth segments from instance segmentation with respect to scale and shift (Wang et al., 2020) or leveraging pretrained semantic segmentation networks to guide the monocular depth estimation (Guizilini et al., 2020b) . Semantic Segmentation A deep convolutional encoder-decoder architecture for semantic segmentation proposed in (Badrinarayanan et al., 2017) has been widely used as backbone. The pyramid pooling module was proposed for leveraging global context through aggregation of different regionbased contexts (Zhao et al., 2017) . Some segmentation works attempted to combine different tasks to improve segmentation performance. Gated-SCNN (Takikawa et al., 2019) refines segmentation results by fusing semantic-region features and boundary features. FuseNet (Hazirbas et al., 2016) proposed to fuse features from color and depth images for improving the segmentation performance. Multi-task learning In (Chen et al., 2019; Takikawa et al., 2019; Zhang et al., 2018) , they proposed to leverage task-specific loss functions to tie up two (or more) tasks within the MTL architecture. For

