SBEVNET: END-TO-END DEEP STEREO LAYOUT ES-TIMATION

Abstract

Accurate layout estimation is crucial for planning and navigation in robotics applications, such as self-driving. In this paper, we introduce the Stereo Bird's Eye View Network (SBEVNet), a novel supervised end-to-end framework for estimation of bird's eye view layout from a pair of stereo images. Although our network reuses some of the building blocks from the state-of-the-art deep learning networks for disparity estimation, we show that explicit depth estimation is neither sufficient nor necessary. Instead, the learning of a good internal bird's eye view feature representation is effective for layout estimation. Specifically, we first generate a disparity feature volume using the features of the stereo images and then project it to the bird's eye view coordinates. This gives us coarse-grained information about the scene structure. We also apply inverse perspective mapping (IPM) to map the input images and their features to the bird's eye view. This gives us fine-grained texture information. Concatenating IPM features with the projected feature volume creates a rich bird's eye view representation which is useful for spatial reasoning. We use this representation to estimate the BEV semantic map. Additionally, we show that using the IPM features as a supervisory signal for stereo features can give an improvement in performance. We demonstrate our approach on two datasets: the KITTI (Geiger et al., 2013) dataset and a synthetically generated dataset from the CARLA (Dosovitskiy et al., 2017) simulator. For both of these datasets, we establish state-of-the-art performance compared to baseline techniques. 1 

1. INTRODUCTION

Layout estimation is an extremely important task for navigation and planning in numerous robotics applications such as autonomous driving cars. The bird's eye view (BEV) layout is a semantic occupancy map containing per pixel class information, e.g. road, sidewalk, cars, vegetation, etc. The BEV semantic map is important for planning the path of the robot in order to prevent it from hitting objects and going to impassable locations. In order to generate a BEV layout, we need 3D information about the scene. Sensors such as LiDAR (Light Detection And Ranging) can provide accurate point clouds. The biggest limitations of LiDAR are high cost, sparse resolution, and low scan-rates. Also, as an active sensor LiDAR is more power hungry, more susceptible to interference from other radiation sources, and can affect the scene. Cameras on the other hand, are much cheaper, passive, and capture much more information at a higher frame-rate. However, it is both hard and computationally expensive to get accurate depth and point clouds from cameras. The classic approach for stereo layout estimation contains two steps. The first step is to generate a BEV feature map by an orthographic projection of the point cloud generated using stereo images. The second step is bird's eye view semantic segmentation using the projected point cloud from the first step. This approach is limited by the estimated point cloud accuracy because the error in it will propagate to the layout estimation step. In this paper, we show that explicit depth estimation is actually neither sufficient nor necessary for good layout estimation. Estimating accurate depth is not sufficient because many areas in the 3D space can be occluded partially, e.g. behind a tree trunk. However, these areas can be estimated by combining spatial reasoning and geometric knowledge in bird's eye view representation. Explicitly estimating accurate depth is also not necessary because layout estimation can be done without estimating the point cloud. Point cloud coordinate accuracy is limited by the 3D to 2D BEV projection and rasterization. For these reasons, having an effective bird's eye view representation is very important. SBEVNet is built upon recent deep stereo matching paradigm. These deep learning based methods have shown tremendous success in stereo disparity/depth estimation. Most of these models (Liang et al., 2018; Khamis et al., 2018; Wang et al., 2019b; Sun et al., 2018; Guo et al., 2019; Zhang et al., 2019; Chang & Chen, 2018; Kendall et al., 2017) generate a 3-dimensional disparity feature volume by concatenating the left and right images shifted at different disparities, which is used to make a cost volume containing stereo matching costs for each disparity value. Given a location in the image and the disparity, we can get the position of the corresponding 3D point in the world space. Hence, every point in the feature volume and cost volume corresponds to a 3D location in the world space. The innovation in our approach comes from the observation: it is possible to directly use the feature volume for layout estimation, rather than a two step process, which uses the point cloud generated by the network. We propose SBEVNet, an end-to-end neural architecture that takes a pair of stereo images and outputs the bird's eye view scene layout. We first project the disparity feature volume to the BEV view, creating a 2D representation from the 3D volume. We then warp it by mapping different disparities and the image coordinates to the bird's eye view space. In order to overcome the loss of fine grained information imposed by our choice of the stereo BEV feature map, we concatenate a projection of the original images and deep features to this feature map. We generate these projected features by applying inverse perspective mapping (IPM) (Mallot et al., 1991) to the input image and its features, choosing the ground as the target plane We feed this representation to a U-Net in order to estimate the BEV semantic map of the scene. In order to perform inverse perspective mapping, we require information about the ground in the 3D world space. Hence we also consider the scenario where we perform IPM during the training time and not the inference time. Here, during the training time, we use cross modal distillation to transfer knowledge from IPM features to the stereo features. SBEVNet is the first approach to use an end-to-end neural architecture for stereo layout estimation. We show that SBEVNet achieves better performance than existing approaches. SBEVNet outperforms all the baseline algorithms on KITTI (Geiger et al., 2013) dataset and a synthetically generated dataset extracted from the CARLA simulator (Dosovitskiy et al., 2017) . In summary, our contributions are the following: 1. We propose SBEVNet, an end-to-end neural architecture for layout estimation from a stereo pair of images. 2. We learn a novel representation for BEV layout estimation by fusing projected stereo feature volume and fine grained inverse perspective mapping features. 3. We evaluate SBEVNet and demonstrate state-of-the-art performance over other methods by a large margin on two datasets -KITTI dataset and our synthetically generated dataset using the CARLA simulator.

2. RELATED WORK

To the best of our knowledge, there is no published research paper for estimating layout given a pair of stereo images. However, there are several works tackling layout estimation using a single image or doing object detection using stereo images. In this section, we review the most closely related approaches. 



The code and the synthesized dataset will be made public upon the acceptance of this paper.



MonoLayout (Mani et al., 2020)  uses an encoder-decoder model to estimate the bird's eye view layout using a monocular input image. They also leverage adversarial training to produce sharper estimates.MonoOccupancy (Lu et al., 2019)  uses a variational encoderdecoder network to estimate the layout. Both MonoLayout and MonoOccupancy do not use any camera geometry priors to perform the task. Schulter et al. (2018) uses depth estimation to project the image semantics to bird's eye view. They also use Open Street Maps data to refine the BEV images via adversarial learning.Wang et al. (2019c) uses (Schulter et al., 2018)  to estimate the parameters of the road such as lanes, sidewalks, etc. Monocular methods learn strong prior, which

