EXPLOITING PLAYBACKS IN UNSUPERVISED DOMAIN ADAPTATION FOR 3D OBJECT DETECTION Anonymous

Abstract

Self-driving cars must detect other vehicles and pedestrians in 3D to plan safe routes and avoid collisions. State-of-the-art 3D object detectors, based on deep learning, have shown promising accuracy but are prone to over-fit to domain idiosyncrasies, causing them to fail in new environments-a serious problem if autonomous vehicles are meant to operate freely. In this paper, we propose a novel learning approach that drastically reduces this gap by fine-tuning the detector on pseudo-labels in the target domain, which our method generates while the vehicle is parked, based on replays of previously recorded driving sequences. In these replays objects are tracked over time and detections are interpolated and extrapolatedcrucially, leveraging future information to catch hard cases. We show, on five autonomous driving datasets, that fine-tuning the detector on these pseudo-labels substantially reduces the domain-gap to new driving environments, yielding drastic improvements in accuracy and detection reliability.

1. INTRODUCTION

One of the fundamental learning problems in the context of self-driving cars, is the detection and localization of other traffic participants, such as cars, cyclists, and pedestrians in 3D. Typically, the input consists of LiDAR or pseudo-LiDAR (Wang et al., 2019b) point clouds (sometimes with accompanying images), and the outputs are sets of tight 3D bounding boxes that envelope the detected objects. The problem is particularly challenging, because the predictions must be highly accurate, reliable, and, importantly, be made in real time. The current state-of-the-art in 3D object detection is based on deep learning approaches (Qi et al., 2018; Shi et al., 2019; Yang et al., 2018; Shi et al., 2020) , trained on short driving segments with labeled bounding boxes (Geiger et al., 2012; 2013) , which yield up to 80% average precision on held-out segments (Shi et al., 2020) . However, as with all machine learning, these techniques succeed when the training data distribution matches the test data distribution. One possibility to ensure train/test consistency is to constrain self-driving cars to a small geo-fenced area. Here, a fleet of self-driving taxis might together collect accurate training data with exhaustive coverage so that the accuracy of the system is guaranteed. This, however, is fundamentally limiting. Ultimately, one would like to allow self-driving cars to be driven freely anywhere, similar to a human driven car. This unconstrained scenario introduces an inherent adaptation problem: The car producer cannot foresee where the owner will ultimately operate the car. The perception system might be trained on urban roads in Germany (Geiger et al., 2013; 2012) , but the car may be driven in the mountain regions in the USA, where other cars may be larger and fewer, the roads may be snowy, and buildings may look different. Past work has shown that such differences can cause > 35% drop in the accuracy of extant systems (Wang et al., 2020) . Closing this adaptation gap is one of the biggest remaining challenges for freely self-driving vehicles. Car owners, however, are likely to spend most of their driving time on similar routes (commuting to work, grocery stores, etc.), and leave their cars parked (e.g., at night) for extended amounts of time. This raises an intriguing possibility: the car can collect training data on these frequent trips; then retrain itself while offline to adapt to this new environment for subsequent online driving. Unfortunately, the data the car collects are unlabeled. The challenge is thus in unsupervised domain adaptation (Gong et al., 2012 ): The detection system, having been previously trained on labeled data from a source domain, must now adapt to a target domain where only unlabeled data are available. Under review as a conference paper ICLR 2021 In this paper, we present a novel and effective approach for unsupervised domain adaptation of 3D detectors that addresses this challenge. Our key insight is two-fold. One, our data is not simply a bag of independent images, but a video of the same scene over time. Two, the dynamics of our objects of interest (i.e., cars) can be modeled effectively. This allows us to take confident detections of nearby objects, estimate their states (e.g., locations, sizes and speeds), and then extrapolate them forward and backward in time, when they were missed by the detector. We show that these extrapolations allow us to correctly classify precisely the difficult (typically distant) objects that are easily missed in new environments. For example, an oncoming car may be detected too late, only when it is close enough. The playback allows us to go back in time and annotate its position in frames where it was previously missed. Although this process cannot be performed in real time (since it uses future information), we use it to generate a new training set with pseudo-labels for the target environment. We then adapt the detector to the target domain through fine-tuning on this newly created data set. The few, but likely accurate labels allow the detector to generalize to more settings in this environment (e.g., picking up what distant cars or typical background scenes look like). We call our approach dreaming, as the car learns by replaying past driving sequences backwards and forwards while it is parked. We evaluate our dreaming car algorithm on multiple autonomous driving datasets, including KITTI (Geiger et al., 2012; 2013 ), Argoverse (Chang et al., 2019 ), Lyft (Kesten et al., 2019) , Waymo (Sun et al., 2019), and nuScenes (Caesar et al., 2019) . We show across all possible data set combinations that fine-tuning the detector with our dreaming car procedure drastically reduces the source/target domain gap with high consistency. In fact, the resulting detector after "dreaming" substantially exceeds the accuracy of the offline system used to generate the pseudo-labels, which -although able to look into the future-is limited to the extrapolation of confident detection before adaption. Our dreaming procedure can easily be implemented on-device and we believe that it constitutes a significant step towards safely operating autonomous vehicles without geo-restrictions.

2. RELATED WORK

3D object detection. Prior work can be categorized based on the input sensors: using 3D sensors (time-of-flight sensors) like Light Detection and Ranging (LiDAR) or 2D images from inexpensive commodity cameras (Wang et al., 2019b; You et al., 2019; Qian et al., 2020; Li et al., 2019; Chen et al., 2020) . We focus on LiDAR-based methods due to their higher accuracy. LiDAR-based 3D object detectors view the LiDAR signal as a 3D point cloud. For example, Frustum PointNet (Qi et al., 2018) applies PointNet (Qi et al., 2017a; b) , a neural network dedicated to point clouds, to LiDAR points within each image-based frustum proposal to localize the object in 3D. PointRCNN (Shi et al., 2019) combines PointNet with faster R-CNN (Ren et al., 2015) to directly generate proposals in 3D using LiDAR points alone. VoxelNet (Zhou & Tuzel, 2018) and PointPillar (Lang et al., 2019) encode 3D points into voxels and extract features by 3D convolutions and PointNet. For scenes dedicated to self-driving cars, processing points from the top-down bird's-eye view (BEV) also proves to be sufficient for capturing object contours and locations (Ku et al., 2018; Yang et al., 2018; Liang et al., 2018) . While all these algorithms have consistently improved the detection accuracy, they are mainly evaluated on KITTI (Geiger et al., 2012) alone. Wang et al. (2020) recently revealed the poor generalization of 3D detectors when they are trained and tested on different datasets, especially on distant objects with sparse LiDAR points. Unsupervised domain adaptation (UDA). UDA has been widely studied in the machine learning and computer vision communities, especially on image classification (Gopalan et al., 2011; Gong et al., 2012; Ganin et al., 2016; Tzeng et al., 2017; Saito et al., 2018) . The common setup is to adapt a model trained from one labeled source domain (e.g., synthetic images) to another unlabeled target domain (e.g., real images). Recent work has extended UDA to driving scenes, but mainly for 2D semantic segmentation and 2D object detection. (See Appendix A for a list of work.) The mainstream approach is to match the feature distributions or image appearances between domains, for example, via adversarial learning (Ganin et al., 2016; Hoffman et al., 2018; Tzeng et al., 2017) or image translation (Zhu et al., 2017) . The approaches more similar to ours are (RoyChowdhury et al., 2019; Tao et al., 2018; Liang et al., 2019; Zhang et al., 2018; Zou et al., 2018; Choi et al., 2019; Chitta et al., 2018; Yu et al., 2019; Kim et al., 2019a; French et al., 2018; Inoue et al., 2018; Rodriguez & Mikolajczyk, 2019; Khodabandeh et al., 2019; Chen et al., 2011) , which iteratively assign pseudo-labels to (some of) the target domain unlabeled data and re-train the models. This procedure, usually named self-training, has proven to be effective in learning with unlabeled data,

