PSEUDO LABEL-GUIDED MULTI TASK LEARNING FOR SCENE UNDERSTANDING Anonymous

Abstract

Multi-task learning (MTL) for scene understanding has been actively studied by exploiting correlation of multiple tasks. This work focuses on improving the performance of the MTL network that infers depth and semantic segmentation maps from a single image. Specifically, we propose a novel MTL architecture, called Pseudo-MTL, that introduces pseudo labels for joint learning of monocular depth estimation and semantic segmentation tasks. The pseudo ground truth depth maps, generated from pretrained stereo matching methods, are leveraged to supervise the monocular depth estimation. More importantly, the pseudo depth labels serve to impose a cross-view consistency on the estimated monocular depth and segmentation maps of two views. This enables for mitigating the mismatch problem incurred by inconsistent prediction results across two views. A thorough ablation study validates that the cross-view consistency leads to a substantial performance gain by ensuring inference-view invariance for the two tasks.

1. INTRODUCTION

Scene understanding has become increasingly popular in both academia and industry as an essential technology for realizing a variety of vision-based applications such as robotics and autonomous driving. 3D geometric and semantic information of a scene often serve as a basic building block for high-level scene understanding tasks. Numerous approaches have been proposed for inferring a depth map (Garg et al., 2016; Godard et al., 2019) or grouping semantically similar parts (Chen et al., 2017; Yuan et al., 2019) from a single image. In parallel with such a rapid evolution for individual tasks, several approaches (Chen et al., 2019; Zhang et al., 2018; Guizilini et al., 2020b; Liu et al., 2019) have focused on boosting the performance through joint learning of the semantic segmentation and monocular depth estimation tasks by considering that the two tasks are highly correlated. For instance, pixels with the same semantic segmentation labels within an object are likely to have similar (or smoothly-varying) depth values. An abrupt change of depth values often implies the boundary of two objects containing different semantic segmentation labels. These properties have been applied to deep networks to enhance the semantic segmentation and monocular depth estimation tasks in a synergetic manner. In (Chen et al., 2019) , they proposed a joint learning model that learns semantic-aware representation to advance the monocular depth estimation with the aid of semantic segmentation. A depth map is advanced by making use of loss functions designed for the purpose of bonding geometric and semantic understanding. The method in (Guizilini et al., 2020b) proposed a new architecture that improves the accuracy of monocular depth estimation through the pixel-adaptive convolution (Su et al., 2019) using semantic feature maps computed from pre-trained semantic segmentation networks. Despite the improved monocular depth accuracy over a single monocular depth network, the performance improvement of the semantic segmentation task by the aid of geometrical representation has not been verified (Chen et al., 2019) , or even the semantic segmentation network was fixed with pretrained parameters (Guizilini et al., 2020b) . A generic computational approach for multi-task learning (MTL) was proposed in (Zamir et al., 2018) , which models the structure across twenty six tasks, including 2D, 2.5D, 3D, and semantic tasks, by finding first and higher-order transfer learning dependencies across them in a latent space to seamlessly reuse supervision among related tasks and/or solve them in a single network without increasing the complexity significantly. This was further extended by imposing a cross-task consis-tency based on inference-path invariance on a graph of multiple tasks (Zamir et al., 2020) . Though these approaches provide a generic and principled way for leveraging redundancies across multiple tasks, there may be limitations to improving the performance of individual tasks in that it is difficult to consider task-specific architectures and loss functions in such unified frameworks. With the same objective yet with a different methodology, the method in (Liu et al., 2019) proposes a novel MTL architecture consisting of task-shared and task-specific networks based on task-attention modules, aiming to learn both generalizable features for multiple tasks and features tailored to each task. They validated the performance in the joint learning of monocular depth and semantic segmentation. In this paper, we propose a novel MTL architecture for monocular depth estimation and semantic segmentation tasks, called pseudo label-guided multi-task learning (Pseudo-MTL). The proposed architecture leverages geometrically-and semantically-guided representations by introducing pseudo ground truth labels. When a pair of stereo images is given as inputs, our method first generates pseudo ground truth left and right depth maps by using existing pre-trained stereo matching networks (Pang et al., 2017; Chang & Chen, 2018) . To prevent inaccurate depth values from being used, a stereo confidence map (Poggi & Mattoccia, 2016) is used together as auxiliary data that measures the reliability of the pseudo depth labels. These are leveraged for supervising the monocular depth network, obtaining substantial performance gain over recent self-supervised monocular depth estimation approaches (Godard et al., 2017; 2019) . More importantly, the pseudo depth labels are particularly useful when imposing a cross-view consistency across left and right images. The estimated monocular depth and segmentation maps of two views are tied from a geometric perspective by minimizing the cross-view consistency loss, alleviating the mismatch problem incurred by inconsistent prediction across two views significantly. We will verify through an intensive ablation study that the proposed cross-consistency loss leads to a substantial improvement on both tasks. Experimental results also show that our approach achieves an outstanding performance over state-of-the-arts. In short, our novel contributions can be summarized as follows. • We propose a novel MTL approach that jointly performs monocular depth estimation and semantic segmentation through pseudo depth labels. • The cross-view consistency loss based on the pseudo depth labels and associated confidence maps is proposed to enable consistent predictions across two views. • An intensive ablation study is provided to quantify the contribution of the proposed items to performance improvement.

2. RELATED WORK

Monocular Depth Estimation While early works for monocular depth estimation are based on supervised learning, self-supervised learning has attracted increasing interest in recent approaches (Godard et al., 2017; 2019; Watson et al., 2019) to overcome the lack of ground truth depth labels. Here, we review works mostly relevant to our method. Godard et al. (Godard et al., 2017; 2019) proposed the deep network that infers a disparity map using the image reconstruction loss and left-right consistency loss from a pair of stereo images or monocular videos. Chen et al. (Chen et al., 2019) infers both disparity and semantic segmentation maps by enforcing the cross consistency across stereo images to address the mismatch problem of (Godard et al., 2017) . Several approaches have focused on improving the monocular depth estimation through the aid of segmentation networks, e.g., by stitching local depth segments from instance segmentation with respect to scale and shift (Wang et al., 2020) or leveraging pretrained semantic segmentation networks to guide the monocular depth estimation (Guizilini et al., 2020b) . Semantic Segmentation A deep convolutional encoder-decoder architecture for semantic segmentation proposed in (Badrinarayanan et al., 2017) has been widely used as backbone. The pyramid pooling module was proposed for leveraging global context through aggregation of different regionbased contexts (Zhao et al., 2017) . Some segmentation works attempted to combine different tasks to improve segmentation performance. Gated-SCNN (Takikawa et al., 2019) refines segmentation results by fusing semantic-region features and boundary features. FuseNet (Hazirbas et al., 2016) proposed to fuse features from color and depth images for improving the segmentation performance. Multi-task learning In (Chen et al., 2019; Takikawa et al., 2019; Zhang et al., 2018) , they proposed to leverage task-specific loss functions to tie up two (or more) tasks within the MTL architecture. For instance, Chen et al. (Chen et al., 2019) attempted to improve a monocular depth accuracy by using the loss functions that measure the consistency between geometric and semantic predictions. The generic computational approach for MTL was proposed by leveraging redundancies across multiple tasks in a latent space in (Zamir et al., 2018; 2020) . The task-attention modules were introduced to extract features for individual tasks in (Misra et al., 2016; Liu et al., 2019; Jha et al., 2020) . In this work, we focus on improving the performance of the MTL architecture for monocular depth estimation and semantic segmentation tasks by using the cross-view consistency loss based on pseudo labels.

3.1. OVERVIEW AND ARCHITECTURE DESIGN

Our Pseudo-MTL approach focuses on improving the performance of the monocular depth estimation and semantic segmentation tasks through task-specific losses defined based on the pseudo depth labels generated by using pre-trained stereo matching networks (Pang et al., 2017) . The stereo confidence maps are used together as auxiliary data to compensate for estimation errors in the pseudo depth labels. These are effective in mitigating undesired artifacts of errors that may exist in the pseudo depth labels. In our work, we chose the CCNN (Poggi & Mattoccia, 2016) for calculating the confidence map, but more advanced confidence estimation approaches can also be used. As shown in Figure 1 , the proposed Pseudo-MTL network is based on the encoder-decoder architecture, in which a single encoder takes an image and two decoders predict the monocular depth and semantic segmentation maps. The encoder network E consists of the convolutional layers of the VGG network (Simonyan & Zisserman, 2015) . Two decoders, D d for monocular depth estimation and D s for monocular depth estimation, are designed symmetrically with the encoder. While two tasks share the encoder, the task-specific decoder branches are used for each task. The pseudo depth labels and the segmentation label maps of stereo images are used for supervising the proposed architecture. The monocular depth and segmentation maps of left and right images are estimated by passing each image to the proposed architecture, as shown in Figure 1 . The cross-view consistency loss is then imposed on the prediction results of two views. To be specific, the estimated monocular depth maps of left and right images are warped and tested using the pseudo depth labels for ensuring inference-view invariance on the monocular depth estimation, and a similar procedure is also applied to the semantic segmentation. Using the pseudo depth labels for training the proposed model is advantageous at various aspects. The pseudo depth labels of stereo images, filtered out by its confidence map, provides a better supervision (Choi et al., 2020) than recent self-supervised monocular depth estimation approaches. More importantly, the cross-view consistency based on the pseudo depth labels mitigates the mismatch problem by inconsistent prediction results of two views, leading to a substantial performance gain. Our method aims at advancing the two tasks via task-specific losses based on pseudo ground truth labels, and existing MTL architectures, e.g. based on task-specific attention modules and adaptive balancing (Liu et al., 2019; Jha et al., 2020) , can be used complementarily with our loss functions.

3.2. LOSS FUNCTIONS

Loss functions are divided into two parts, 1) supervised loss for depth and segmentation networks and 2) pseudo depth-guided reconstruction loss for cross-view consistency. Note that the supervised loss used for monocular depth estimation relies on the pseudo depth labels generated from a pair of stereo images.

3.2.1. LOSS FOR MONOCULAR DEPTH AND SEMANTIC SEGMENTATION

Depth maps d i for i = {l, r}, predicted by the decoder D d for monocular depth estimation, are used for measuring the depth regression loss L d as follows: L d = i={l,r} L reg (c i , d i , d pgt i ), where L reg (c i , d i , d pgt i ) = 1 Z i p∈Φ c i (p) • |d i (p) -d pgt i (p)| 1 , (1) where c i and d pgt i indicate the confidence map and pseudo ground truth depth map of left (i = l) or right (i = r) images, respectively. The loss is normalized with Z i = p c i (p). Φ represents a set of all pixels. The confidence map serves to exclude inaccurate depth values of d pgt i when calculating the depth regression loss L d . This can be used in various ways, including the hard thresholding (Cho et al., 2019; Tonioni et al., 2020) and the soft thresholding (Choi et al., 2020) . Among them, the soft thresholded confidence map (Choi et al., 2020 ) is shown to be effective in the monocular depth estimation. Our work chose to threshold the confidence map through the soft-thresholding of (Choi et al., 2020) . We found that the pretrained threshold network already provides satisfactory results, and thus it was fixed during our network training. A supervised loss for semantic segmentation is defined with the standard cross-entropy H : L s = i={l,r} H(s i , s gt i ), s i and s gt i denote the segmentation map, predicted by the decoder D s for semantic segmentation, and ground truth segmentation map, respectively. The supervised loss for both tasks is defined as L S = α d L d + α s L s with loss weights α d and α s .

3.2.2. CROSS-VIEW CONSISTENCY LOSS

Minimizing the supervised loss L S for an individual view may often lead to the mismatched problem in the predicted depth and segmentation maps due to the lack of consistency constraints across two views. We address this issue by imposing the cross-view consistency across left and right images with the pseudo depth labels. Figure 2 shows the procedure of computing the cross-view consistency losses with pseudo depth labels. The cross-view consistency loss for the monocular depth estimation is defined as follows: The cross-view consistency can also be applied to semantic segmentation as follows: Note that in (Chen et al., 2019) , the consistency for left and right segmentation maps is considered e.g., by minimizing H (s l , G(s r ; d l )). Two segmentation maps s l and s r are aligned with the estimated monocular depth map d l . However, d l is continuously updated during the network training, and thus this may result in inaccurate alignments at early stage, often leading to divergences of loss. For these reasons, minimizing the loss H with respect to both monocular depth and segmentation maps often becomes very challenging, and the performance gain by the consistency loss is relatively marginal. Contrarily, our approach is more effective in imposing the cross-view consistency in that 1) more accurate pseudo depth labels, obtained from stereo matching networks, are used, 2) the confidence map helps to filter out inaccurate depth values in the pseudo ground truth depth maps. Furthermore, we extend the cross-view consistency to the monocular depth estimation, which is infeasible in the recent self-supervised monocular depth estimation approaches (Godard et al., 2017; 2019; Watson et al., 2019) that rely on the reconstruction loss only. A detailed ablation study will be provided to validate the effectiveness of the proposed cross-view consistency loss. A total loss is defined as While the pseudo depth labels d pgt l and d pgt r , generated using pretrained stereo matching networks, are used to supervise the monocular depth estimation task, the semantic segmentation task requires using the ground truth segmentation maps. The Cityscapes dataset provide only the left ground truth segmentation map s gt l , and the KITTI dataset does not provide them. In our work, we hence generated the pseudo segmentation labels of these images by using semantic segmentation methods (Cheng et al., 2020; Zhu et al., 2019) . Table 1 summarizes the supervisions used for the two tasks. L d,c = α d,lr L d,lr + α d,l L d,l + α d,r L d,r , L d,lr = L reg c l , d l , G(d r ; d pgt l ) + L reg c r , G(d l ; d pgt r ), d r , L d,l = L reg c l , d pgt l , G(d r ; d pgt l ) + L reg c l , d l , G(d pgt r ; d pgt l ) , L d,r = L reg c r , G(d l ; d pgt r ), d pgt r + L reg c r , G(d pgt l ; d pgt r ), d r , L s,c = α s,lr L s,lr + α s,l L s,l + α s,r L s,r , L s,lr = c l • H s l , G(s r ; d pgt l ) + c r • H G(s l ; d pgt r ), s r , L s,l = c l • H s gt l , G(s r ; d pgt l ) + c l • H s l , G(s gt r ; d pgt l ) , L s,r = c r • H G(s l ; d pgt r ), s gt r + c r • H G(s gt l ; d pgt r ), s r , L = L S + L d,c + L s,c .

3.3. TRAINING DETAILS

Table 2 : Quantitative evaluation of monocular depth estimation on Eigen split of KITTI dataset. Numbers in bold and underlined represent 1 st and 2 nd ranking, respectively. The methods used in evaluation are (Garg et al., 2016) , (Zhou et al., 2017) , Monodepth (Godard et al., 2017) , (Zhan et al., 2018) , (Chen et al., 2019) , Monodepth2 (Godard et al., 2019) , Uncertainty (Poggi et al., 2020) , DepthHint (Watson et al., 2019) , (Guizilini et al., 2020b) , and (Choi et al., 2020) 

4.1. DATASETS

We evaluated the performance on two popular datasets, KITTI (Geiger et al., 2012) and Cityscapes (Cordts et al., 2016) . In KITTI, for a fair comparison, we followed the common setup to use 22,600 images for training and the rest for evaluation. The Eigen split data (697 images) (Eigen et al., 2014) was used for evaluating the monocular depth accuracy. Following existing MTL methods (Chen et al., 2019) , the semantic segmentation accuracy was evaluated with 200 annotated data provided from KITTI benchmark. Cityscapes provides high resolution images of urban street scenes used for segmentation and depth estimation. 2,975 and 500 images were used for training and evaluation, respectively.

4.2. IMPLEMENTATION DETAILS AND EVALUATION METRIC

We first pretrained the monocular depth network E + D d and semantic segmentation network E + D s independently for 30 epochs using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 -4 and momentum of 0.9. We then finetuned the whole network E + D d + D s for 20 epochs using the Adam optimizer with a learning rate of 10 -5 , reduced to 1/10 every 10 epochs, and a momentum of 0.9, after initializing it with the pretrained weight parameters of the monocular depth network E + D d and semantic segmentation network D s . During training, we resized KITTI images to a resolution of [480, 192] , and cropped Cityscapes images [2048, 768] to exclude the front part of the car and resized to a resolution of [256, 96] . The weights for the objective function are set to α d = 850, α s = 2.5, α d,lr = 0.5, α d,l = 1 ,α d,r = 1, α s,lr = 0.5, α s,l = 1.5, α s,r = 1.5. The performance evaluation was conducted by following the common practices: 1) mean absolute relative error (Abs Rel), mean relative squared error (Sq Rel), root mean square error (RMSE), root mean square error log (RMSE log) and accuracy under threshold δ for monocular depth estimation, 2) intersection over union (IoU) and mean intersection over union (mIoU) for semantic segmentation. Due to page limits, some results are provided in appendix. Our code will be publicly available later. (Liu et al., 2019) , (e) Dense (Liu et al., 2019) , and (f) Ours. Note that ground truth depth maps were obtained using SGM (Hirschmuller, 2008).

4.3. PERFORMANCE EVALUATION

KITTI In Table 2 , we provide objective evaluation results on the KITTI Eigen split (Eigen et al., 2014) . The proposed method produces very competitive results to state-of-the-arts monocular depth estimation approaches. Qualitative evaluation in Figure 3 verifies that our method yields the results with sharper boundaries and better object delineation. These validate the effectiveness of the crossview consistency based on the pseudo depth labels. In Figure 4 , the proposed method produces satisfactory semantic segmentation results for the Cityscapes dataset, achieving mIoU = 59.93. Note that mIoU in the MTL approach of (Chen et al., 2019) is 39.13. Cityscapes In Table 3 , we compared results on the Cityscape dataset with recent multi-task learning approaches for monocular depth estimation and semantic segmentation tasks: 'Cross-stitch' (Misra et al., 2016) and 'MTAN' (Liu et al., 2019) . 'Split (deep)', 'Split (wide)', and 'Dense' were reproduced by using author-provided codes in 'MTAN' (Liu et al., 2019) . Our method achieves improved quantitative results on both tasks. Figure 5 exhibits qualitative results on the Cityscape dataset. As expected, depth and segmentation maps generated by our method are capable of preserving object boundaries and recover details better than the latest MTL methods (Misra et al., 2016; Liu et al., 2019) .

4.4. ABLATION STUDY

We conducted the ablation experiments to validate the effectiveness of the confidence map and crossview consistency for the KITTI dataset in Table 4 and the Cityscapes dataset in Table 5 . We first compared the performance with the method (b = d i ) based on the cross-view consistency using the estimated monocular depth map, e.g. H (s l , G(s r ; d l )), similar to (Chen et al., 2019) . Under the same setting, our method (b = d pgt i ) tends to achieve higher mIoU than the method (b = d i ). Ad- ditionally, while the method (b = d i ) often degenerates the monocular depth accuracy, our method (b = d pgt i ) does not suffer from such an issue, achieving the improved monocular depth accuracy. Such a performance gain become even more apparent for both tasks when using the confidence map. Note that it is infeasible to leverage the confidence map for the method (b = d i ) in which the estimated monocular depth map is constantly updated during the network training. When including the cross-view consistency loss L d,c for monocular depth estimation, the additional performance gain was observed, validating its effectiveness on the monocular depth estimation. Though the segmentation accuracy (mIoU) was slightly worsen in some cases, it is relatively marginal. This may be due to our architecture where the two tasks share the encoder, and more advanced MTL architecture, e.g. using task-attention modules (Liu et al., 2019) , would lead to performance improvement. We reserve this as future work.

5. CONCLUSION

This paper has presented a new MTL architecture designed for monocular depth estimation and semantic segmentation tasks. The cross-view consistency loss based on the pseudo depth labels, generated using pretrained stereo matching methods, was imposed on the prediction results of two views for resolving the mismatch problem. Intensive ablation study exhibited that it leads to a substantial performance gain in both tasks, especially achieving the best accuracy in the monocular depth estimation. Our task-specific losses can be used complementarily together with existing MTL architectures, e.g. based on task-specific attention modules (Liu et al., 2019 ). An intelligent combination with these approaches is expected to further improve the performance. Additionally, how to integrate recent architectures (Chen et al., 2018; Takikawa et al., 2019) designed for semantic segmentation into the MTL network would be an interesting research direction.

A APPENDIX A.1 MORE COMPREHENSIVE EVALUATION RESULTS FOR KITTI

We provide more comprehensive results for KITTI dataset. Figure 6 shows the qualitative evaluation with existing monocular depth estimation methods on Eigen Split of the KITTI dataset. Figure 7 shows the semantic segmentation prediction results on the KITTI dataset. We also evaluated the performance with the improved ground truth depth maps made available at (Uhrig et al., 2017) for the KITTI dataset in Table 6 . Our approach is state-of-the-art in monocular depth estimation compared to existing methods. 

A.2 MORE COMPREHENSIVE EVALUATION RESULTS FOR CITYSCAPES

We provide more comprehensive results for the Cityscapes dataset. Figure 8 and 9 show the qualitative evaluation results with existing MTL methods (Misra et al., 2016; Liu et al., 2019) for monocular depth estimation and semantic segmentation on the Cityscapes dataset. These results also support the effectiveness of our method. 



Figure 1: Network architecture: (a) Pseudo MTL based on the encoder-decoder, (b) To impose the cross-view consistency, it is applied for left and right images, respectively.

where α d,lr , α d,l , and α d,r denote weights for each loss. G(a; b) indicates the result of warping a with a depth map b into another view. For instance, G(d r ; d pgt l ) returns the depth map warped onto the left image using d pgt l . L d,lr measures the cross-view consistency between two predicted depth maps d l and d r . Note that the warping function G is applied to d r and d l , respectively. Similar to the depth regression loss L d , the confidence map is used together to prevent inaccurate values in the pseudo depth labels from being used. L d,l denotes the cross-view consistency for (d pgt l , d r ) and (d l , d pgt r ) using the left pseudo label d pgt l . This implies that when warping d r (or d pgt r ) into the left image, the warped result should be similar to d pgt l (or d l ). L d,r is defined in a similar manner.

10) where '•' indicates an element-wise multiplication. The confidence maps c l and c r are also used to compensate for errors in the pseudo depth labels d pgt l and d pgt r . Note that for some training datasets that provide no ground truth segmentation maps, we generate pseudo ground truth segmentation maps. More details are provided in Section 3.3.

Figure 2: Cross-view consistency loss. Monocular depth or semantic segmentation maps are warped using d pgt l and d pgt r , and the consistency losses are measured using equation 3 and equation 7. Dotted lines mean a warping operator G(a; b), and solid lines denote the cross-view consistency losses. As summarized inTable 1, either s pgt i or s gt i is used as supervision for semantic segmentation.

Figure 3: Qualitative evaluation of monocular depth estimation on Eigen split of KITTI dataset. (a) Input image, (b) Monodepth (Godard et al., 2017), (c) Monodepth2 (Godard et al., 2019), (d) DepthHints (Watson et al., 2019), and (e) Ours.

Figure 4: Qualitative results of semantic segmentation prediction on KITTI dataset.

Figure 6: Qualitative evaluation of monocular depth estimation on KITTI dataset. (a) Input image, (b) Monodepth (Godard et al., 2017), (c) Monodepth2 (Godard et al., 2019), (d) DepthHints (Watson et al., 2019), and (e) Ours.

Figure 7: Qualitative results of semantic segmentation prediction on the KITTI dataset.

Figure 8: Qualitative results of monocular depth estimation on the Cityscapes dataset: (a) Input image, (b) Ground truth depth map obtained using SGM(Hirschmüller, 2008), (c) Cross-stitch(Misra et al., 2016), (d) MTAN(Liu et al., 2019), (e) Dense(Liu et al., 2019), and (f) Ours.



.

Multi-task validation results for 7-class semantic segmentation and depth estimation on Cityscapes dataset.

Ablation study of our model on the KITTI dataset. 'Baseline' model is our network without the confidence and cross consistency loss.

Ablation study of our model on the Cityscapes dataset. 'Baseline' model is our network without the confidence and cross consistency loss.

