VECTORMAPNET: END-TO-END VECTORIZED HD MAP LEARNING

Abstract

Autonomous driving systems require a good understanding of surrounding environments, including moving obstacles and static High-Definition (HD) semantic map elements. Existing methods approach the semantic map p•roblem by offline manual annotation, which suffers from serious scalability issues. Recent learning-based methods produce dense rasterized segmentation predictions to construct maps. However, these predictions do not include instance information of individual map elements and require heuristic post-processing to obtain vectorized maps. To tackle these challenges, we introduce an end-to-end vectorized HD map learning pipeline, termed VectorMapNet. VectorMapNet takes onboard sensor observations and predicts a sparse set of polylines in the bird's-eye view. This pipeline can explicitly model the spatial relation between map elements and generate vectorized maps that are friendly to downstream autonomous driving tasks. Extensive experiments show that VectorMapNet achieve strong map learning performance on both nuScenes and Argoverse2 dataset, surpassing previous state-of-the-art methods by 14.2 mAP and 14.6mAP. Qualitatively, we also show that VectorMapNet is capable of generating comprehensive maps and capturing more fine-grained details of road geometry. To the best of our knowledge, VectorMapNet is the first work designed towards end-to-end vectorized map learning from onboard observations.

1. INTRODUCTION

Autonomous driving system requires an understanding of map elements on the road, including lanes, pedestrian crossing, and traffic signs, to navigate the world. Such map elements are typically provided by pre-annotated High-Definition (HD) semantic maps in existing pipelines (Rong et al., 2020) . These methods suffer from serious scalability issues as human efforts are heavily involved in annotating HD maps. Recent works (Li et al., 2021; Philion & Fidler, 2020; Roddick & Cipolla, 2020) explore the problem of online HD semantic map learning, where the goal is to use onboard sensors (e.g. LiDARs and cameras) to estimate map elements on-the-fly. Most recent methods (Roddick & Cipolla, 2020; Yang et al., 2018; Philion & Fidler, 2020; Zhou & Krähenbühl, 2022) consider HD semantic map learning as a semantic segmentation problem in bird's-eye view (BEV), which rasterizes map elements into pixels and assigns each pixel with a class label. This formulation makes it straightforward to leverage fully convolutional networks. However, rasterized maps are not an ideal map representation for autonomous driving, for three reasons. First, rasterized maps lack instance information which is necessary to distinguish map elements with the same class label but different semantics, e.g. left boundary and right boundary. Second, it is hard to enforce spatial consistency within the predicted rasterized maps, e.g. nearby pixels might have contradicted semantics or geometries. Third, 2D rasterized maps are incompatible with most autonomous driving systems which consume instance-level 2D/3D vectorized maps for motion forecasting and planning. To alleviate these issues and produce vectorized outputs, HDMapNet (Li et al., 2021) generates semantic, instance, and directional maps and vectorizes these three maps with a hand-designed post-processing algorithm. However, HDMapNet still relies on the rasterized map predictions, and its heuristic post-processing step complicates the pipeline and restricts the model's scalability and performance. In this paper, we propose an end-to-end vectorized HD map learning model named VectorMapNet, which does not involve a dense set of semantic pixels. Instead, it represents map elements as a set of polylines that are closely related to downstream tasks, e.g. motion forecasting (Gao et al., 2020) . Therefore, the map learning problem boils down to predicting a sparse set of polylines from sensor observations in our paper. Specifically, we pose it as a detection problem and leverage set detection and sequence generation methods. First, VectorMapNet aggregates features generated from different modalities (e.g. camera images and LiDAR) into a common BEV feature space. Then, it detects map elements' locations based on learnable element queries and BEV features. Finally, we decode element queries to polylines for every map elements. An overview of VectorMapNet is shown in Figure 1 . Our experiments show that VectorMapNet achieves state-of-the-art performance on the public nuScenes dataset (Caesar et al., 2020) and Argoverse2 (Wilson et al., 2021) , outperforming HDMap-Net and another baseline by at least 14.2 mAP. Qualitatively, we find that VectorMapNet builds a more comprehensive map compared to previous works and is capable of capturing fine details, e.g. jagged boundaries. Furthermore, we feed our predicted vectorized HD map into a downstream motion forecasting module, and show the compatibility and effectiveness of the predicted map. To summarize, the contributions of the paper are as follows: • VectorMapNet is an end-to-end HD semantic map learning method. Unlike previous works, we pose map learning as an set prediction problem and directly predict vectorized outputs from sensor observations without requiring map rasterization or post-processing. • Jointly modeling the geometry and topological relations of map elements is challenging. We leverage polylines as primitives to model complex map shapes and decompose the model into two parts to mitigate this difficulty: a map element detector and a polyline generator. • VectorMapNet achieves state-of-the-art HD semantic map learning performance on both nuScenes and Argoverse2 datasets. Qualitative results and downstream evaluations also validate our design choices.

2. VECTORMAPNET

Problem formulation. Similar to HDMapNet (Li et al., 2021) , our task is to model map elements in a vectorized form using data from onboard sensors, e.g. RGB cameras and/or LiDARs. These map elements include but are not limited to : Road boundaries, boundaries of roads that split roads and sidewalks. Typically, they are curves with irregular shapes and arbitrary lengths; Lane dividers, boundaries of the lanes in the road. Usually they are straight lines; Pedestrian crossings, regions with white markings where pedestrians can legally cross the road. Usually they are quadrilaterals. These elements are critical for autonomous driving, but these elements typically have diverse geometries



Figure 1: An overview of VectorMapNet. Sensor data is encoded to BEV features in the same coordinate as map elements. VectorMapNet detects the locations of map elements from BEV features by leveraging element queries. The vectorized HD map is built upon a sparse set of polylines that are generated from the detection results. Since polylines have encoded direction information, we can infer semantic information (e.g. drivable area) from the polylines. It worth noting that the drivable area is inferred from several disjoint boundaries and is non-trivial to model as one object.

