SBEVNET: END-TO-END DEEP STEREO LAYOUT ES-TIMATION

Abstract

Accurate layout estimation is crucial for planning and navigation in robotics applications, such as self-driving. In this paper, we introduce the Stereo Bird's Eye View Network (SBEVNet), a novel supervised end-to-end framework for estimation of bird's eye view layout from a pair of stereo images. Although our network reuses some of the building blocks from the state-of-the-art deep learning networks for disparity estimation, we show that explicit depth estimation is neither sufficient nor necessary. Instead, the learning of a good internal bird's eye view feature representation is effective for layout estimation. Specifically, we first generate a disparity feature volume using the features of the stereo images and then project it to the bird's eye view coordinates. This gives us coarse-grained information about the scene structure. We also apply inverse perspective mapping (IPM) to map the input images and their features to the bird's eye view. This gives us fine-grained texture information. Concatenating IPM features with the projected feature volume creates a rich bird's eye view representation which is useful for spatial reasoning. We use this representation to estimate the BEV semantic map. Additionally, we show that using the IPM features as a supervisory signal for stereo features can give an improvement in performance. We demonstrate our approach on two datasets: the KITTI (Geiger et al., 2013) dataset and a synthetically generated dataset from the CARLA (Dosovitskiy et al., 2017) simulator. For both of these datasets, we establish state-of-the-art performance compared to baseline techniques. 1 

1. INTRODUCTION

Layout estimation is an extremely important task for navigation and planning in numerous robotics applications such as autonomous driving cars. The bird's eye view (BEV) layout is a semantic occupancy map containing per pixel class information, e.g. road, sidewalk, cars, vegetation, etc. The BEV semantic map is important for planning the path of the robot in order to prevent it from hitting objects and going to impassable locations. In order to generate a BEV layout, we need 3D information about the scene. Sensors such as LiDAR (Light Detection And Ranging) can provide accurate point clouds. The biggest limitations of LiDAR are high cost, sparse resolution, and low scan-rates. Also, as an active sensor LiDAR is more power hungry, more susceptible to interference from other radiation sources, and can affect the scene. Cameras on the other hand, are much cheaper, passive, and capture much more information at a higher frame-rate. However, it is both hard and computationally expensive to get accurate depth and point clouds from cameras. The classic approach for stereo layout estimation contains two steps. The first step is to generate a BEV feature map by an orthographic projection of the point cloud generated using stereo images. The second step is bird's eye view semantic segmentation using the projected point cloud from the first step. This approach is limited by the estimated point cloud accuracy because the error in it will propagate to the layout estimation step. In this paper, we show that explicit depth estimation is actually neither sufficient nor necessary for good layout estimation. Estimating accurate depth is not sufficient because many areas in the 3D space can be occluded partially, e.g. behind a tree trunk.



The code and the synthesized dataset will be made public upon the acceptance of this paper.1

