CROMA: CROSS-MODALITY DOMAIN ADAPTATION FOR MONOCULAR BEV PERCEPTION

Abstract

Incorporating multiple sensor modalities and closing the domain gap between training and deployment are two challenging yet critical topics for self-driving. Existing adaption work only focuses on the visual-level domain gap, overlooking the sensor-type gap which exists in reality. A model trained with a collection of sensor modalities may need to run on another setting with less types of sensors available. In this work, we propose a Cross-Modality Adaptation (CroMA) framework to facilitate the learning of a more robust monocular bird's-eye-view (BEV) perception model, which transfers the point clouds knowledge from a Li-DAR sensor during the training phase to the camera-only testing scenario. The absence of LiDAR during testing negates the usage of it as model input. Hence, our key idea lies in the design of (i) a LiDAR-teacher and Camera-student knowledge distillation model, and (ii) a multi-level adversarial learning mechanism, which adapts and aligns the features learned from different sensors and domains. This work results in the first open analysis of cross-domain perception and cross-sensor adaptation for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under a wide range of domain shifts and show state-of-the-art results against various baselines.

1. INTRODUCTION

In recent years, multi-modality 3D perception has shown outstanding performance and robustness over its single-modality counterpart, achieving leading results for various 3D perception tasks (Vora et al., 2020; Qi et al., 2020; Jaritz et al., 2020; Park et al., 2021; Weng et al., 2020) on large-scale multi-sensor 3D datasets (Caesar et al., 2020; Kesten et al., 2019; Sun et al., 2020) . Despite the superiority in information coverage, the introduction of more sensor modalities also poses additional challenges to the perception system. On one hand, generalizing the model between datasets becomes hard because each sensor has its unique domain gap, such as field-of-view (FoV) for cameras, density for LiDAR, etc. On the other hand, the operation of the model is conditioned on the presence and function of more sensors, making it hard to work on autonomous agents with less sensor types or under sensor failure scenarios. More specifically, transferring knowledge among different data domains is still an open problem for autonomous agents in the wild. In the self-driving scenario, training the perception models offline in a source domain with annotations while deploying the model in another target domain without annotations is very common in practice. As a result, the model will have to consider the domain gap between source and target environments or datasets, which usually involves different running locations, different sensor specifications, different illumination and weather conditions, etc. Meanwhile, the domain shift lies not only in the visual perspective, but also in the sensor-modality perspective. Previous methods assume a less realistic setting where all sensors are available during training, validation, and deployment time, which is not always true in reality. Due to the cost and efficiency trade-off, or sensor missing and failure, in many scenarios we can have fewer sensors available in the target domain during testing than what we have in the source domain during training. A typical scenario is having camera and LiDAR sensors in the large-scale training phase while only having cameras for testing, as shown in Figure 1 . It is not clear how to facilitate the camera-only 3D inference with the help of a LiDAR sensors only in the source domain during training. The challenges above raise an important question and task: Can we achieve robust 3D perception under both the visual domain gap and sensor modality shift?

annex

 3) Naive LiDAR supervision leads to worse performance. It is generally believed in the community that introducing additional sensors is bound to increase the overall performance. Surprisingly, our experiments showed 0.3 IoU decrease when we naively introducing LiDAR to supervise the depth estimation. This is because the source and target domain gap becomes larger with the additional sensor-type shift. As we will discuss in Sec. 3.2, our new problem setting requires novel methodology in using LiDAR without increasing the domain discrepancy.To tackle the above challenges, we propose CroMA, a cross-modality domain adaptation framework for bird's-eye-view (BEV) perception. Our model addresses the monocular 3D perception task between different domains, and utilizes additional modalities in the source domain to facilitate the evaluation performance. Motivated by the fact that image and BEV frames are bridged with 3D representation, we first design an efficient backbone to perform 3D depth estimation followed by a BEV projection. Then, to learn from point clouds without explicitly taking them as model inputs, we propose an implicit learning strategy, which distills 3D knowledge from a LiDAR-Teacher to help the Camera-Student learn better 3D representation. Finally, in order to address the visual domain shift, we introduce adversarial learning on the student to align the features learned from source and target domains. Supervision from the teacher and feature discriminators are designed at multiple layers to ensure an effective knowledge transfer. By considering the domain gap and effectively leveraging LiDAR point clouds in the source domain, our proposed method is able to work reliably in more complicated, uncommon, and even unseen environments. Our model achieves state-of-the-art performance in four very different domain shift settings. Extensive ablation studies are conducted to investigate the contribution of our proposed components, the robustness under different changes, as well as other design choices.The main contributions of this paper are as follows. (1) We introduce modality mismatch, an overlooked but realistic problem setting in 3D domain adaptation in the wild, leading to a robust cameraonly 3D model that works in complicated and dynamic scenarios with minimum sensors available. 

