CROMA: CROSS-MODALITY DOMAIN ADAPTATION FOR MONOCULAR BEV PERCEPTION

Abstract

Incorporating multiple sensor modalities and closing the domain gap between training and deployment are two challenging yet critical topics for self-driving. Existing adaption work only focuses on the visual-level domain gap, overlooking the sensor-type gap which exists in reality. A model trained with a collection of sensor modalities may need to run on another setting with less types of sensors available. In this work, we propose a Cross-Modality Adaptation (CroMA) framework to facilitate the learning of a more robust monocular bird's-eye-view (BEV) perception model, which transfers the point clouds knowledge from a Li-DAR sensor during the training phase to the camera-only testing scenario. The absence of LiDAR during testing negates the usage of it as model input. Hence, our key idea lies in the design of (i) a LiDAR-teacher and Camera-student knowledge distillation model, and (ii) a multi-level adversarial learning mechanism, which adapts and aligns the features learned from different sensors and domains. This work results in the first open analysis of cross-domain perception and cross-sensor adaptation for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under a wide range of domain shifts and show state-of-the-art results against various baselines.

1. INTRODUCTION

In recent years, multi-modality 3D perception has shown outstanding performance and robustness over its single-modality counterpart, achieving leading results for various 3D perception tasks (Vora et al., 2020; Qi et al., 2020; Jaritz et al., 2020; Park et al., 2021; Weng et al., 2020) on large-scale multi-sensor 3D datasets (Caesar et al., 2020; Kesten et al., 2019; Sun et al., 2020) . Despite the superiority in information coverage, the introduction of more sensor modalities also poses additional challenges to the perception system. On one hand, generalizing the model between datasets becomes hard because each sensor has its unique domain gap, such as field-of-view (FoV) for cameras, density for LiDAR, etc. On the other hand, the operation of the model is conditioned on the presence and function of more sensors, making it hard to work on autonomous agents with less sensor types or under sensor failure scenarios. More specifically, transferring knowledge among different data domains is still an open problem for autonomous agents in the wild. In the self-driving scenario, training the perception models offline in a source domain with annotations while deploying the model in another target domain without annotations is very common in practice. As a result, the model will have to consider the domain gap between source and target environments or datasets, which usually involves different running locations, different sensor specifications, different illumination and weather conditions, etc. Meanwhile, the domain shift lies not only in the visual perspective, but also in the sensor-modality perspective. Previous methods assume a less realistic setting where all sensors are available during training, validation, and deployment time, which is not always true in reality. Due to the cost and efficiency trade-off, or sensor missing and failure, in many scenarios we can have fewer sensors available in the target domain during testing than what we have in the source domain during training. A typical scenario is having camera and LiDAR sensors in the large-scale training phase while only having cameras for testing, as shown in Figure 1 . It is not clear how to facilitate the camera-only 3D inference with the help of a LiDAR sensors only in the source domain during training. The challenges above raise an important question and task: Can we achieve robust 3D perception under both the visual domain gap and sensor modality shift?

