MAPTR: STRUCTURED MODELING AND LEARNING FOR ONLINE VECTORIZED HD MAP CONSTRUCTION

Abstract

High-definition (HD) map provides abundant and precise environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. We present MapTR, a structured end-to-end Transformer for efficient online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, i.e., modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. MapTR achieves the best performance and efficiency with only camera input among existing vectorized map construction approaches on nuScenes dataset. In particular, MapTR-nano runs at real-time inference speed (25.1 FPS) on RTX 3090, 8× faster than the existing state-of-the-art camera-based method while achieving 5.0 higher mAP. Even compared with the existing stateof-the-art multi-modality method, MapTR-nano achieves 0.7 higher mAP , and MapTR-tiny achieves 13.5 higher mAP and 3× faster inference speed. Abundant qualitative results show that MapTR maintains stable and robust map construction quality in complex and various driving scenes. MapTR is of great application value in autonomous driving. Code and more demos are available at https://github.com/hustvl/MapTR.

1. INTRODUCTION

High-definition (HD) map is the high-precision map specifically designed for autonomous driving, composed of instance-level vectorized representation of map elements (pedestrian crossing, lane divider, road boundaries, etc.). HD map contains rich semantic information of road topology and traffic rules, which is vital for the navigation of self-driving vehicle. Conventionally HD map is constructed offline with SLAM-based methods (Zhang & Singh, 2014; Shan & Englot, 2018; Shan et al., 2020) , incurring complicated pipeline and high maintaining cost. Recently, online HD map construction has attracted ever-increasing interests, which constructs map around ego-vehicle at runtime with vehicle-mounted sensors, getting rid of offline human efforts. Early works (Chen et al., 2022a; Liu et al., 2021a; Can et al., 2021) leverage line-shape priors to perceive open-shape lanes based on the front-view image. They are restricted to single-view perception and can not cope with other map elements with arbitrary shapes. With the development of bird's eye view (BEV) representation learning, recent works (Chen et al., 2022b; Zhou & Krähenbühl, 2022; Hu et al., 2021; Li et al., 2022c) predict rasterized map by performing BEV semantic segmentation. However, the rasterized map lacks vectorized instance-level information, such as the lane structure, It is natural to ask a question: Can we design a DETR-like paradigm for efficient end-to-end vectorized HD map construction? We show that the answer is affirmative with our proposed Map TRansformer (MapTR). Different from object detection in which objects can be easily geometrically abstracted as bounding box, vectorized map elements have more dynamic shapes. To accurately describe map elements, we propose a novel unified modeling method. We model each map element as a point set with a group of equivalent permutations. The point set determines the position of the map element. And the permutation group includes all the possible organization sequences of the point set corresponding to the same geometrical shape, avoiding the ambiguity of shape. Based on the permutation-equivalent modeling, we design a structured framework which takes as input images of vehicle-mounted cameras and outputs vectorized HD map. We streamline the online vectorized HD map construction as a parallel regression problem. Hierarchical query embed-



Figure 1. MapTR maintains stable and robust vectorized HD map construction quality in complex and various driving scenes.

