EXPLOITING PLAYBACKS IN UNSUPERVISED DOMAIN ADAPTATION FOR 3D OBJECT DETECTION Anonymous

Abstract

Self-driving cars must detect other vehicles and pedestrians in 3D to plan safe routes and avoid collisions. State-of-the-art 3D object detectors, based on deep learning, have shown promising accuracy but are prone to over-fit to domain idiosyncrasies, causing them to fail in new environments-a serious problem if autonomous vehicles are meant to operate freely. In this paper, we propose a novel learning approach that drastically reduces this gap by fine-tuning the detector on pseudo-labels in the target domain, which our method generates while the vehicle is parked, based on replays of previously recorded driving sequences. In these replays objects are tracked over time and detections are interpolated and extrapolatedcrucially, leveraging future information to catch hard cases. We show, on five autonomous driving datasets, that fine-tuning the detector on these pseudo-labels substantially reduces the domain-gap to new driving environments, yielding drastic improvements in accuracy and detection reliability.

1. INTRODUCTION

One of the fundamental learning problems in the context of self-driving cars, is the detection and localization of other traffic participants, such as cars, cyclists, and pedestrians in 3D. Typically, the input consists of LiDAR or pseudo-LiDAR (Wang et al., 2019b) point clouds (sometimes with accompanying images), and the outputs are sets of tight 3D bounding boxes that envelope the detected objects. The problem is particularly challenging, because the predictions must be highly accurate, reliable, and, importantly, be made in real time. The current state-of-the-art in 3D object detection is based on deep learning approaches (Qi et al., 2018; Shi et al., 2019; Yang et al., 2018; Shi et al., 2020) , trained on short driving segments with labeled bounding boxes (Geiger et al., 2012; 2013) , which yield up to 80% average precision on held-out segments (Shi et al., 2020) . However, as with all machine learning, these techniques succeed when the training data distribution matches the test data distribution. One possibility to ensure train/test consistency is to constrain self-driving cars to a small geo-fenced area. Here, a fleet of self-driving taxis might together collect accurate training data with exhaustive coverage so that the accuracy of the system is guaranteed. This, however, is fundamentally limiting. Ultimately, one would like to allow self-driving cars to be driven freely anywhere, similar to a human driven car. This unconstrained scenario introduces an inherent adaptation problem: The car producer cannot foresee where the owner will ultimately operate the car. The perception system might be trained on urban roads in Germany (Geiger et al., 2013; 2012) , but the car may be driven in the mountain regions in the USA, where other cars may be larger and fewer, the roads may be snowy, and buildings may look different. Past work has shown that such differences can cause > 35% drop in the accuracy of extant systems (Wang et al., 2020) . Closing this adaptation gap is one of the biggest remaining challenges for freely self-driving vehicles. Car owners, however, are likely to spend most of their driving time on similar routes (commuting to work, grocery stores, etc.), and leave their cars parked (e.g., at night) for extended amounts of time. This raises an intriguing possibility: the car can collect training data on these frequent trips; then retrain itself while offline to adapt to this new environment for subsequent online driving. Unfortunately, the data the car collects are unlabeled. The challenge is thus in unsupervised domain adaptation (Gong et al., 2012 ): The detection system, having been previously trained on labeled data from a source domain, must now adapt to a target domain where only unlabeled data are available. In this paper, we present a novel and effective approach for unsupervised domain adaptation of 3D detectors that addresses this challenge. Our key insight is two-fold. One, our data is not simply a bag of independent images, but a video of the same scene over time. Two, the dynamics of our objects of interest (i.e., cars) can be modeled effectively. This allows us to take confident detections of nearby objects, estimate their states (e.g., locations, sizes and speeds), and then extrapolate them forward and backward in time, when they were missed by the detector. We show that these extrapolations allow us to correctly classify precisely the difficult (typically distant) objects that are easily missed in new environments. For example, an oncoming car may be detected too late, only when it is close enough. The playback allows us to go back in time and annotate its position in frames where it was previously missed. Although this process cannot be performed in real time (since it uses future information), we use it to generate a new training set with pseudo-labels for the target environment. We then adapt the detector to the target domain through fine-tuning on this newly created data set. The few, but likely accurate labels allow the detector to generalize to more settings in this environment (e.g., picking up what distant cars or typical background scenes look like). We call our approach dreaming, as the car learns by replaying past driving sequences backwards and forwards while it is parked. We evaluate our dreaming car algorithm on multiple autonomous driving datasets, including KITTI (Geiger et al., 2012; 2013) , Argoverse (Chang et al., 2019) , Lyft (Kesten et al., 2019) , Waymo (Sun et al., 2019) , and nuScenes (Caesar et al., 2019) . We show across all possible data set combinations that fine-tuning the detector with our dreaming car procedure drastically reduces the source/target domain gap with high consistency. In fact, the resulting detector after "dreaming" substantially exceeds the accuracy of the offline system used to generate the pseudo-labels, which -although able to look into the future-is limited to the extrapolation of confident detection before adaption. Our dreaming procedure can easily be implemented on-device and we believe that it constitutes a significant step towards safely operating autonomous vehicles without geo-restrictions. (Zhou & Tuzel, 2018) and PointPillar (Lang et al., 2019) encode 3D points into voxels and extract features by 3D convolutions and PointNet. For scenes dedicated to self-driving cars, processing points from the top-down bird's-eye view (BEV) also proves to be sufficient for capturing object contours and locations (Ku et al., 2018; Yang et al., 2018; Liang et al., 2018) . While all these algorithms have consistently improved the detection accuracy, they are mainly evaluated on KITTI (Geiger et al., 2012) alone. Wang et al. (2020) recently revealed the poor generalization of 3D detectors when they are trained and tested on different datasets, especially on distant objects with sparse LiDAR points.

2. RELATED WORK

Unsupervised domain adaptation (UDA). UDA has been widely studied in the machine learning and computer vision communities, especially on image classification (Gopalan et al., 2011; Gong et al., 2012; Ganin et al., 2016; Tzeng et al., 2017; Saito et al., 2018) . The common setup is to adapt a model trained from one labeled source domain (e.g., synthetic images) to another unlabeled target domain (e.g., real images). Recent work has extended UDA to driving scenes, but mainly for 2D semantic segmentation and 2D object detection. (See Appendix A for a list of work.) The mainstream approach is to match the feature distributions or image appearances between domains, for example, via adversarial learning (Ganin et al., 2016; Hoffman et al., 2018; Tzeng et al., 2017) or image translation (Zhu et al., 2017 Chen et al., 2011) , which iteratively assign pseudo-labels to (some of) the target domain unlabeled data and re-train the models. This procedure, usually named self-training, has proven to be effective in learning with unlabeled data, such as semi-supervised and weakly-supervised learning (McClosky et al., 2006b; a; Kumar et al., 2020; Lee, 2013; Cinbis et al., 2016; Triguero et al., 2015) . Specifically for UDA, self-training enables the model to adapt its features to the target domain in a supervised learning fashion. For UDA in 3D, the domain discrepancy is in the point clouds, instead of images. Qin et al. (2019) are the first to map and match point clouds between domains, via adversarial learning. However, the point clouds they considered correspond to single, isolated objects, as opposed to the cluttered, large scenes encountered in driving, which are considerably more challenging. Others project LiDAR points to a frontal or BEV and apply UDA techniques in the resulting 2D images for BEV object detection (Saleh et al., 2019; Wang et al., 2019c) or semantic segmentation (Wu et al., 2019) . However, this approach forces the downstream detector to operate on the 2D image, which can be sub-optimal. There is a need for UDA techniques that apply to large point clouds from unstructured, cluttered scenes. Our work is the first to apply self-training for UDA to realistic scenes for 3D object detection. We argue that this can be more effective than learning a mapping from the target to the source, for which the detector is (over-)specialized. First, a LiDAR point cloud easily contains more than 10, 000 points, making full-scene transformations prohibitively slow. Second, in many practical cases, we may not have access to source domain training data (Chidlovskii et al., 2016) after the detector is deployedmaking it impossible to learn a cross-domain mapping. While most self-training approaches assume access to source domain data, we make no such assumption and never use it in the adaptation process. Leveraging videos for object detection. To ease the annotation efforts for 2D object detection, several works (Liang et al., 2015; Ošep et al., 2019; Misra et al., 2015; Kumar Singh et al., 2016) proposed to mine additional bounding boxes from videos in an unsupervised or weakly supervised manner. The main idea is to leverage the temporal information (e.g., tracks) to extend weakly-labeled instances or potential object proposals across frames, which are then used as pseudo-labels to re-train the detectors. In the context of UDA, Tang et al. A key difference between these methods and ours is that we do not just interpolate tracks, but also extrapolate these tracks to infer objects when they are too far away to be detected accurately. We are able to do this by operating in 3D and leveraging the dynamics of objects (by physical-based motion models). In contrast, most of the methods above may disregard faraway objects (that appear too small in the image) because their tracks are unreliable (Ošep et al., 2017) . Our approach also differs from this prior work in the application domains and data modalities: we focus on LiDAR-based 3D object detection, as opposed to 2D object detectors. Specifically, we exploit the property that objects in 3D are scale-invariant to correct object sizes along tracks, which is not applicable in 2D.

3. EXPLOITING PLAYBACKS FOR UDA

Similar to most published work on 3D object detection for autonomous driving, we focus on framewise 3D detectors. A detector is first trained on a source domain, and is then applied in a target domain.(e.g., a new city). Wang et al. (2020) conducted a comprehensive analysis and revealed a drastic performance drop in such a scenario: many of the target objects are either misdetected or mislocalized, especially if they are far away. To aid adaptation, we assume access to an unlabeled dataset of video sequences in the target domain, which could simply be recordings of the vehicle's sensors while it was in operation. Our approach is to generate pseudo-labels for these recordings that can be used to adapt the detector to the new environment during periods in which the car is not in use. We do not assume access to the source data when performing adaptation-it is unlikely that the car producer will share its data with the customers after the detector is deployed.

3.1. TRACKING FOR IMPROVED DETECTION

One approach to improve the test accuracy based on the frame-wise detection outputs is on-line tracking by detection (Breitenstein et al., 2010; 2009; Hua et al., 2015) . Here, detected objects are associated across current and past frames to derive trajectories, which are used to filter out false positives, impute false negatives, and adjust the initial detection bounding boxes in the current frame. On-line 3D object tracking. We investigate this idea with a Kalman Filter based tracker (Diaz-Ruiz et al., 2019; Chiu et al., 2020; Weng et al., 2020) , which has shown promising results in benchmark tracking leader boards (Caesar et al., 2019) . We opt to not use a learning-based tracker (Yin et al., 2020b) as such trackers would also require adaptation before they can be applied to improve detection in the target domain. Specifically, we apply the tracker by Diaz-Ruiz et al. (2019) . The algorithm estimates the joint probability p(a k , x k |z k ) at time k, where x k is the set of tracked object states (e.g., cars speeds and locations), z k is the set of observed sensor measurements (here each measurement is a frame-wise detection), and a k the assignment of measurements to tracks. The joint distribution can be decoupled into the continuous estimation problem p(x k |a k , z k ), which is solved recursively via an Extended Kalman Filter (EKF), and the discrete data assignment, p(a k |x k , z k ), which is solved via Global Nearest-Neighbor (GNN). The EKF parameterizes the state x of a single (ith) object as a vehicle (position, velocity, and shape) relative to the ego-vehicle x i k = [x y θ s l w] T , where x, y are the location of the tracked vehicle's back axle relative to a fixed point on the egovehicle, θ is the vehicle orientation relative to the ego-vehicle, s is the absolute ground speed, and l, w are the length and width. The EKF uses a dynamic model of the evolution of the state over time. Here we assume that the tracked vehicle is moving at a constant speed and heading in the global coordinate frame, with added noise to represent the uncertainty associated with vehicle maneuvers. This tracker has been shown to work well on tracking moving objects from a self-driving car (Miller et al., 2011; Diaz-Ruiz et al., 2019) . More details of the tracker are given in the supplementary. As will be seen in subsection 4.2, applying this tracker can indeed improve the detection accuracy online, via imputing missing detection, correcting mislocalized detection, and rejecting wrong detection in z k at current time k by x k . Off-line 3D object tracking. Online trackers are constrained to only use past information in order to improve current detections. Relaxing this constraint for off-line tracking (e.g., to be able to look into the future and come back to the current time), we can obtain even more accurate estimates of vehicle states. While such an improvement is not applicable directly during test time, higher accuracy tracking on unlabeled driving sequences will be highly valuable to adapt the source detector in a self-supervised fashion, as we will explain in the following section. . The basic idea is to apply an existing model to an unlabeled data set and use the high confidence predictions (here detections), which are likely to be correct, as "pseudo-labels" for fine-tuning. One key to success for self-training is the quality of the pseudo-labels. In particular, we desire two qualities out of the detections we use as pseudo-labels. First, they should be correct, i.e., they should not include false positives. Second, they should have high coverage, i.e., they should cover all cases of objects. Choosing high confidence detections as pseudo-labels satisfies the first criterion but not the second. With 3D object detection, we find that most of the high confidence examples are easy cases: unoccluded objects near the self-driving car. This is where offline tracking becomes a crucial component to include the more challenging cases (far away, or partially occluded objects) in the pseudo-label pool.

3.3. HIGH QUALITY PSEUDO-LABELS VIA 3D OFF-LINE TRACKING

How do we obtain pseudo-labels for far-away, hard-to-detect objects that the detector cannot reliably detect? We propose to exploit tracking by leveraging two facts in the autonomous driving scenario. First, the available unlabeled data is in the form of sequences (akin to videos) of point clouds over time. Second, the objects of interest and the self-driving car move in fairly constrained ways. We will run the object detector on logged data, so that we can easily analyze both forwards and backwards in time. The object detector will detect objects accurately only when they are close to the self-driving car. Once detected over a few frames, we can estimate the object's motion either towards the car or away from it, and then both interpolate the object's positions in frames where it was missed, or extrapolate the object into frames where it is too far away for accurate detection. We show an example of this procedure in Figure 1 . Through dynamic modeling, tracking, and smoothing over time we can correct noisy detections; and with extrapolation and interpolation, we can recover far away missed detections. Concretely, we proposed to augment the following functionalities that utilizes the future information into the on-line tracker introduced in subsection 3.1, turning it into an off-line tracker specifically designed to improve detection, i.e., generating higher quality pseudo-labels, for self-training. State smoothing. Frame-wise 3D object detectors can generate inconsistent, noisy detection across time (e.g., frame 4, 7 and 9 in Figure 1 ). The model-based tracking approach in subsection 3.1 reduces this noise, but we can go further by smoothing tracks back and forth over time, since our data is off-line. In this work, we use a fixed-point Rauch-Tung-Striebel (RTS) smoother (Bar-Shalom et al., 2001) to smooth the tracked state estimates. See the supplementary material for details. Adjusting object size. As shown in Wang et al. (2020) , the distribution of car sizes in different domains (e.g. different cities) can be different. As such, when tested on a novel domain, detectors often predict incorrect object sizes. This is especially true when the LiDAR signal is too sparse for correct size information. We can also use our tracking to correct such systematic error. Assuming that the most confident detections are more likely to be accurate, we estimate the size of the object by averaging the size of the three highest scoring detections. We use this size for all objects in this track. Interpolation and extrapolation. We use estimation (forward in time) and smoothing (backward in time) to recover missed detections, and in turn, to increase the recall rate of pseudo-labels (e.g., frame 1 -3 and 6 in Figure 1 ). If a detection is missed in the middle of a track, we restore it by taking the estimated state from smoothing. We also extrapolate the tracks both backward and forward in time, so that tracks that were prematurely terminated due to missing detections, can be recovered. Most commonly, we are able to recover detections of vehicles that were lost as they moved away from the ego vehicle because the measurement signals became sparser (or in turn, which started far away and were then only detected when the vehicle got close enough). Extrapolations are performed by first using dynamic model predictions of the EKF to predict potential bounding boxes; measurements are obtained by performing a search and detection in the vicinity of the prediction. We apply the detector in a 3 m 2 area around the extrapolated prediction, yielding several 3D bounding box candidates. After filtering out candidates with confidence lower than some threshold, we select the candidate with highest BEV IoU with the prediction as a measurement. If the track loses such measurement for three consecutive frames, extrapolations are stopped. With this targeted search, we are able to recover objects that were missed due to low confidence. After extrapolating and interpolating detections for all tracks, we perform Non Maximum Suppression (NMS) over bounding boxes in BEV, where more recent extrapolations/interpolations are prioritized. Discussion. The tracker we apply is standard and simple. We opt it to show the power of our dreaming approach for UDA-exploiting off-line, forward and backward information to derive high-quality pseudo-labels for adapting detectors. More sophisticated trackers will likely improve the results further. While we focus on frame-wise 3D detectors, our algorithm can be applied to adapt video-based 3D object detectors (Yin et al., 2020a) as well. One particular advantage of fine-tuning on the pseudo-labeled target data is to allow the detector not only adapting its predictions (e.g., the box regression) but also its features (e.g., the early layers in the neural networks) to the target domain. The resulting detector therefore can usually lead to more accurate detection than the pseudo-labels it has been trained on.

4. EXPERIMENTS

Datasets. We experiment with five autonomous driving data sets: KITTI ( UDA settings. We train models from the source domain using labeled frames. We use the train set (and its pseudo-labels) of the target domain to adapt the source detector, and evaluate the adapted detector on the test set. See supplementary for more details. Metric. We follow KITTI to evaluate object detection in 3D and the BEV metrics. As it has been the main focus of existing works on 3D object detection, we focus on the Car category. We report average precision (AP) with the intersection over union (IoU) thresholds at 0.5 or 0.7, i.e.a car is correctly detected if the IoU between it and the predicted box is larger than 0.5 or 0.7. We denote AP for the 3D and BEV by AP 3D and AP BEV , respectively. Because on the other datasets there is no official separation on the difficulty levels like in KITTI, we split AP by depth range. 3D object detection models. We use two LiDAR-based models POINTRCNN (Shi et 2020) to train it using RMSProp with momentum 0.9 and learning rate 5 × 10 -5 (decreased by a factor of 10 after 50 and 80 epochs) for 90 epochs. For self-training on the target domain, we initialize from the pre-trained model on the source domain. For POINTRCNN, we fine-tune the model with learning rate 2 × 10 -4 and 40 epochs in RPN and 10 epochs in RCNN. For PIXOR, we use RMSProp with momentum 0.9 and learning rate 5 × 10 6 (decreased by a factor of 10 after 10 and 20 epochs) for 30 epochs. We developed and tuned our dreaming method with Argoverse as the source domain and KITTI as the target domain (in the target domain, we only use the training set). We then fixed all hyper-parameters for all subsequent experiments. We list other relevant hyper-parameters in the supplementary.

4.1. BASELINES

We compare against two baselines under the UDA setting.

Self-Training (ST).

We apply a self-training scheme similar to that typically used in the 2D problems (Kumar et al., 2020) . When adapting the model from the source to the target, we apply the source model to the target training set. We then keep the detected cars of confidence scores > 0.8 

4.2. EMPIRICAL RESULTS

Adaptation from Argoverse to KITTI. We compare the UDA methods under Argoverse to KITTI in Table 2 and observe several trends: 1) Models experience a smaller domain gap when objects are closer (0-30 m vs 30-80 m); 2) Though directly applying on-line tracking can improve the detection performance, models improve more after just self-training; 3) The off-line tracking is used to provide extra pesudo-labels for fine-tuning, and interestingly, models fine-tuned from pseudo-labels can out-perform pseudo-labels themselves; 4) DREAMING improves over ST and SN by a large margin, especially on IoU at 0.5; 5) DREAMING introduces a large gain in AP for faraway objects, e.g., on range 50-80 m, compared with ST, it boosts the AP BEV on IoU at 0.5 from 28.2 to 30.1. Adaptation among five datasets. We further applied our methods to adaptation tasks among the five datasets. Due to limited space, we show the results of AP BEV and AP 3D on range 50-80 m at IoU = 0.5 in Table 3 (see supplementary material for results at IoU = 0.7 on other ranges and results with SN). Our method consistently improves the adaptation performance on faraway ranges, while having mostly equal or better performance on close-by ranges. Adaptation between different locations inside the same dataset. Different datasets not only come from different locations but also use different sensor configurations. To isolate the effects of the former (which is our motivating application), in Table 4 we evaluate our method's performance for domain adaptation within the KITTI dataset from city/campus scenes to residential/country scenes (details in supplementary). Our method consistently outperforms no fine-tuning and ST, especially on 30-50 m and 50-80 m range. Ablation Study. We show ablation results in Table 5 . Here we fine-tune models using ST and adding smoothing (S), resizing (R), interpolation (I) and extrapolation (E) to the pseudo-label generation. It can be observed that ST alone already boosts performance considerably. Through selecting high confidence detections, smoothing and adjusting the object size we ensure that the pseudo-labels provided are mostly correct. But just these do not address the second criteria for desired pseudolabels: high coverage. We observe noticeable boosts when interpolation and extrapolations are added, specially for far away objects. This is due to extrapolations and interpolations recovering pseudo-labels for low confidence or missed detections for distant vehicles. Others. We show more results, including adaptation results on PIXOR detectors, qualitative visualizations and analysis on pseudo-labels, in the supplementary material.

5. CONCLUSION AND DISCUSSION

In this paper, we have introduced a novel method towards closing the gap between source and target in unsupervised domain adaptation for LiDAR-based 3D object detection. Our approach is based on self-training, while leveraging vehicle dynamics and offline analysis to generate pseudo-labels. Importantly, we can generate high quality pseudo-labels even for difficult cases (i.e. far away objects), which the detector tends to miss before adaptation. Fine-tuning on these pseudo-labels improves detection performance drastically in the target domain. It is hard to conceive an autonomous vehicle manufacturer that could collect, label, and update data for every consumer environment, meeting the requirements to allow self-driving cars to operate everywhere freely and safely. By significantly reducing the adaptation gap between domains, our approach takes a significant step towards making this vision a reality nevertheless.

C HYPER-PARAMETERS

In performing the forward pass of the Extended Kalman Filter (EKF) to calculate the a-posteriori state and state error covariance estimates (i.e., xk|k P k|k ) and the a-priori state and state error covariance (i.e., xk+1|k P k+1|k ) in subsection 3.3, a process noise covariance matrix (Q), measurement noise covariance matrix (R), initial state estimate (x 0 ), and initial state error covariance matrix (P 0 ) are necessary. The measurement noise matrix is established through measurement variance, and was obtained by observing the variance of position (x, y), orientation (θ), length (l) and width (w) errors from testing detector in Argoverse to KITTI setting. A larger measurement noise matrix is used in the EKF for extrapolation as larger variance was observed with far-away detections. The process noise matrix should represent the magnitudes of dynamical noise that the system might experience. In the model in Diaz-Ruiz et al. ( 2019), which is constant velocity and heading, there are seven noise parameters: e θ , e s , e vx , e vy , e ωz , e l , e w which correspond to the diagonal of the Q matrix. The variables e θ , e s , e l , e w are modeled as zero mean, mutually uncorrelated, Gaussian, and white process noise. Intuitively, e θ , e s represent the uncertainty associated with the orientation and speed of vehicles, especially given that the model assumes constant speed straight line motion. The noise associated with the orientation and speed of the target vehicles are the largest and of most importance. The time derivative of the objects length and width are zero, where e l , e w are small tuning parameters which control the response of the filter. While e vx , e vy , e ωz are noises associated with the pose of ego-vehicle and small values were chosen given that a high accuracy in ego-vehicle pose is expected across the datasets. Finally, the EKF necessitates an initialization for the state and state error covariance estimates. The initial detection was used for state initialization, and relatively large conservative uncertainties were used for state error covariance. • Extended Kalman Filter (EKF) and Global Nearest Neighbor (GNN) parameters for tracking and data association (cf. subsection 3.3): 1) Measurement noise covariance matrix: R = diag(0.1m 2 , 0.1m 2 , 0.015rad 2 , 0.07m 2 , 0.04m 2 ) 2) Process noise covariance matrix: 3) State error covariance matrix initialization: Q = diag(0. P 0 = diag(2m 2 , 2m 2 , 0.1rad 2 , 5 m s 2 , 0.5m 2 , 0.32m 2 ) 4) The initial state estimate, x0 , is set to be the first detection values. 5) The data association threshold (in BEV IoU) in GNN is set to be 0.3. 6) The fraction of distance between the vehicle center and back-axle is 1 4 l, as in Diaz-Ruiz et al. (2019). • EKF parameters for extrapolation: 1) Measurement noise covariance matrix: R = diag(0.5m 2 , 0.5m 2 , 0.06rad 2 , 0.07m 2 , 0.04m 2 ) We set c min-hits = 3 and c max-age = 3. When doing extrapolation, we use -25 and -3 as thresholds for POINTRCNN and PIXOR models respectively.

D DETAILS ON DATASETS

As summarized in Table 1 of the main paper, we split each dataset into two parts, a training set and a test set. When a dataset is used as the source, we train the detector using its (ground truth) training labeled frames. When a dataset is used as the target, we adapt the detector using its training sequences without revealing the ground truth labels. We evaluate the adapted models on the test set. We provide detailed properties of the five autonomous driving datasets and the way we split the data as follows. KITTI. The KITTI object detection benchmark (Geiger et al., 2013; 2012) have no overlap with the sequences where validation scenes are sampled. These training sequences are collected in 10 Hz, resulting in 13,596 frames. We extract the sequences from the raw KITTI data as our adaptation data. We use such data splits in all experiments related to KITTI, except for adaptation between different locations inside the same dataset (cf. 

E FURTHER ANALYSIS

Analysis on Pseudo-Labels. Figure 2 shows the (accumulated) false positive (FP) rates above a certain detector confidence threshold (setting: Argoverse → KITTI). We view a detection as a false positive if there are no ground-truth labels having an BEV IoU > 0.7 with it. We see that FP rates smoothly decrease with the increasing detector confidence, suggesting that higher confidence detection are more reliable. Multi-round fine-tuning. In theory, the process of pseudo-label generation followed by fine-tuning the model can iterate for multiple rounds, yet we see a very small gain after the first round (which is what we report in all tables). We believe a curated training procedure with carefully chosen hyper-parameters is needed to make it work. F ADAPTATION RESULTS USING PIXOR We further apply our approach to the PIXOR model from Argoverse to KITTI in Table 6 . Dreaming improves the accuracy at farther ranges (30-80 m) while maintaining the accuracy at close range (0-30 m). Interestingly, at IoU 0.5 in the 30-80 m ranges, we are able to surpass the in-domain performance, which uses models trained only in the target domain with the ground-truth labels. This results showcases the power of unsupervised domain adaptation (UDA): with a suitably designed algorithm, UDA that leverages both the source and target domain could outperform models trained only in a single domain.

G QUALITATIVE RESULTS

In Figure 3 , Figure 4 and Figure 5 , we compare qualitatively the detection results from models trained with different adaptation strategies. We select (Argoverse, KITTI), (nuScenes, KITTI) and (nuScenes, Argoverse) as (source, target) example pairs in qualitative visualization. It can be seen that models without retraining tend to miss faraway objects. Models with self-training are able to detect some of these objects. Models with dreaming can detect more faraway objects. Self-training and dreaming both exhibit some more false positive detection.

H ADAPTATION AMONG FIVE DATASETS

In Table 7 , we present the adaptation evaluation results across on range 0 -80m at IoU= 0.5 as that in Table 3 . In Table 8 , we present UDA results on all possible (source, target) pairs among the five autonomous driving datasets (20 pairs in total). On each pair, we show results with and without statistical normalization (Wang et al., 2020) . As in Table 2 , we report AP BEV / AP 3D of the car category at IoU = 0.7 and IoU = 0.5 across different depth range, using the POINTRCNN detector. It can be seen that under these 20 UDA scenarios, our method consistently improves the adaptation performance on faraway ranges, while having mostly equal or better performance on close-by ranges. 



3D object detection. Prior work can be categorized based on the input sensors: using 3D sensors (time-of-flight sensors) like Light Detection and Ranging (LiDAR) or 2D images from inexpensive commodity cameras (Wang et al., 2019b; You et al., 2019; Qian et al., 2020; Li et al., 2019; Chen et al., 2020). We focus on LiDAR-based methods due to their higher accuracy. LiDAR-based 3D object detectors view the LiDAR signal as a 3D point cloud. For example, Frustum PointNet (Qi et al., 2018) applies PointNet (Qi et al., 2017a;b), a neural network dedicated to point clouds, to LiDAR points within each image-based frustum proposal to localize the object in 3D. PointRCNN (Shi et al., 2019) combines PointNet with faster R-CNN (Ren et al., 2015) to directly generate proposals in 3D using LiDAR points alone. VoxelNet

(2012); RoyChowdhury et al. (2019); Ošep et al. (2017) also incorporate object tracks to discover high quality pseudo-labels for self-training.

SELF-TRAINING FOR UDA Self-training is a simple yet fairly effective way to improve a model with unlabeled data (McClosky et al., 2006b;a; Kumar et al., 2020; Lee, 2013; Chen et al., 2011)

Figure 1: Example track pointcloud of a vehicle moving towards ego vehicle where pseudo-labels are recovered through extrapolation, interpolation, and smoothing. Pseudo-labels gained are instances where estimated bounding box (blue) is observed, while the original detection (orange) was missing or poorly aligned with ground truth (green). Better viewed in color. Zoom in for details.

Figure 2: False positive rates above detection confidence for a model trained on Argoverse and tested on KITTI.

Figure 3: Qualitative Results. We compare the detection results on several scenes from the KITTI validation set by the POINTRCNN detectors that are trained on 1) Argoverse dataset (No Re-training), 2) Argoverse dataset and fine-tuned using self-training on KITTI (Self Training), and 3) Argoverse dataset and fine-tuned using dreaming on KITTI (Dreaming). We visualize them from both frontalview images and bird's-eye view point maps. Ground-truth boxes are in green and detected bounding boxes are in blue. The ego vehicle is on the left side of the BEV map and looking to the right. One floor square is 10m×10m. Best viewed in color. Zoom in for details.

Figure 4: Qualitative Results. The setups are the same as those in Figure 3, but the models are pre-trained in nuScenes dataset and tested on KITTI dataset.

Figure 5: Qualitative Results. The setups are the same as those in Figure 3, but the models are pre-trained in nuScenes dataset and tested on Argoverse dataset.

We summarize properties of five datasets (or subsets of them). * For nuScenes, we use 10Hz data to generate pseudo-labels, but subsample them in 2Hz (12562 frames) afterwards.

Unsupervised domain adaptation from Argoverse to KITTI. We report AP BEV / AP 3D of the car category at IoU = 0.7 and IoU = 0.5 across different depth range, using POINTRCNN model. ST stands for Self-Training(Kumar et al., 2020), SN stands for Statistical Normalization(Wang et al., 2020). Our method Dreamt is marked in blue. We show the performance of in-domain model, i.e., the model trained and evaluated on KITTI, at the first row in gray. We also show results by directly applying online and offline (no feasible in real-time) tracking. Best viewed in color. / 89.4 75.4 / 73.8 29.4 / 25.4 87.0 / 73.5 62.8 / 41.9 17.2 / 10.3

Dreaming results on unsupervised adaptation among five auto-driving datasets. Here we report AP BEV and AP 3D of the Car category on range 50-80m at IoU= 0.5. On each entry (row, column), we report AP of UDA from row to column in the order of no fine-tuning / ST / Dreaming. At the diagonal entries, we report the AP of in-domain model. Our method is marked in blue.Wang et al. (2020) showed that car sizes vary between domains: popular cars at different areas can be different. When the mean bounding box size in the target domain is accessible, either from limited amount of labeled data or statistical data, we can apply statistical normalization (SN)(Wang et al., 2020) to mitigate such a systematic difference in car sizes. SN adjusts the bounding box sizes and corresponding point clouds in the source domain to match those in the target domain, and fine-tunes the model on such "normalized" source data, with no need to access target sensor data. We follow the exact setting in(Wang et al., 2020) to apply SN.

UDA from KITTI (city, campus) to KITTI (road, residential). Naming is as in Table2.

Ablation study of UDA from Argoverse to KITTI. We report AP BEV / AP 3D of the car category at IoU = 0.5 and IoU = 0.7 across different depth range, using POINTRCNN model. Naming is as in Table2. S stands for smoothing, R for resizing, I for interpolation and E for extrapolation. The last row is our full approach.

Lyft. The Lyft Level 5 dataset collects 18,634 scenes around Palo Auto, USA in clear weather and during day time. For each scene, Lyft provides the ground-truth bounding box labels and point cloud captured by a 40 (or 64)-beam roof LiDAR and two 40-beam bumper LiDAR sensors. We follow Wang et al. (2020) and separate the dataset by sequences, resulting in 12,599 frames for training (100 sequences), 3,024 validation frames (24 sequences) and 3,011 frames for testing (24 sequences). The sequences in 5 Hz and we use training sequences without labels as adaptation data.nuScenes. The nuScenes dataset(Caesar et al., 2019) collects scenes around Boston, USA and Singapore in multiple weather conditions and during different times of the day. For each scene, nuScenes provides a point cloud captured by a 32-beam roof LiDAR. We use the data collected in Boston, and sampled 312 sequences for training and 78 sequences for validation. We use 10 Hz sensor data without labels as our adaptation data (61,121 frames in total), and evaluate the model on 2 Hz labeled data in validation set (3,133 frames). Note that after generating pseudo-labels, we sub-sample adaptation data using 2 Hz into 12,562 frames.

Unsupervised domain adaptation from Argoverse to KITTI using PIXOR. We report AP BEV of the car category at IoU = 0.5 and IoU = 0.7. Naming is as in Table2.

Dreaming results on unsupervised adaptation among five auto-driving datasets. Here we report AP BEV and AP 3D of the Car category on range 0-80m at IoU= 0.5. On each entry (row, column), we report AP of UDA from row to column in the order of no re-training / ST / Dreamt. At the diagonal entries, we report the AP of in-domain model. Our method is marked in blue.

Unsupervised domain adaptation among five autonomous driving datasets. Naming is as that in Table 2 of the main paper. / 57.2 46.1 / 33.5 9.6 / 5.5 35.7 / 5.3 12.7 / 1.4 2.8 / 1.8 ST 87.8 / 78.9 64.8 / 51.0 16.4 / 8.7 45.9 / 10.9 21.4 / 2.5 3.4 / 0.6 Dreaming 87.6 / 78.8 69.3 / 56.3 21.7 / 13.4 46.4 / 14.5 25.7 / 10.2 5.4 / 0.5 SN only 78.8 / 78.3 48.9 / 46.6 6.0 / 5.5 74.9 / 40.9 37.7 / 11.4 4.7 / 1.1 SN + ST 88.7 / 88.5 65.6 / 63.5 16.6 / 13.0 84.3 / 61.2 53.3 / 22.3 10.6 / 1.8 SN + Dreaming 88.5 / 88.2 69.9 / 65.8 23.9 / 20.9 84.0 / 60.8 53.9 / 24.5 8.9 / 1.5 / 85.7 68.5 / 66.4 37.8 / 30.0 76.5 / 53.2 56.6 / 30.4 20.2 / 10.1 no retraining 39.5 / 37.0 21.6 / 17.4 3.0 / 3.0 28.6 / 5.1 17.6 / 9.1 3.0 / 0.1 ST 65.2 / 59.5 32.6 / 25.9 12.4 / 9.1 51.9 / 15.7 24.3 / 9.1 9.1 / 0.6 Dreaming 65.9 / 63.2 50.0 / 40.3 17.7 / 13.0 50.7 / 10.6 32.3 / 2.9 9.6 / 0.6 SN only 49.6 / 47.8 20.8 / 16.4 4.5 / 4.5 38.2 / 8.2 16.6 / 2.3 4.5 / 1.8 SN + ST 56.7 / 55.5 23.3 / 18.0 4.7 / 3.4 50.3 / 13.3 17.4 / 9.1 3.3 / 0.5 SN + Dreaming 64.3 / 61.8 45.7 / 40.9 19.4 / 14.1 50.8 / 11.0 33.4 / 3.0 12.8 / 0.8 / 78.0 65.9 / 58.0 39.1 / 30.2 77.0 / 20.2 57.0 / 13.0 30.0 / 10.4 Dreaming 79.6 / 78.1 66.8 / 63.2 46.4 / 37.8 77.0 / 20.7 62.1 / 10.4 36.4 / 5.2 SN only 62.3 / 61.7 41.6 / 39.9 16.7 / 15.4 60.1 / 26.8 32.1 / 5.7 15.6 / 9.1 SN + ST 78.6 / 77.9 56.1 / 53.7 23.1 / 13.9 77.3 / 32.4 53.0 / 13.8 20.4 / 3.0 SN + Dreaming 78.7 / 78.0 64.7 / 57.5 39.7 / 36.1 77.3 / 31.6 56.0 / 16.0 29.6 / 11.0 / 81.5 71.7 / 71.2 58.2 / 50.8 80.8 / 68.8 69.1 / 54.9 47.0 / 27.2 no retraining 62.2 / 61.2 50.7 / 42.1 24.9 / 22.5 60.5 / 25.3 41.7 / 13.8 21.3 / 4.5 ST 80.7 / 71.9 68.7 / 60.9 42.5 / 39.6 71.6 / 34.5 60.7 / 24.6 37.0 / 8.8 Dreaming 80.7 / 71.9 69.4 / 61.3 50.1 / 45.5 71.7 / 33.5 61.2 / 24.3 39.3 / 10.7 / 79.2 69.5 / 61.5 50.1 / 41.3 71.7 / 42.2 60.7 / 27.6 38.5 / 14.4 / 89.9 81.0 / 79.9 40.4 / 36.3 89.0 / 78.0 70.3 / 51.5 26.6 / 9.8 no retraining 88.4 / 87.0 64.7 / 55.2 14.3 / 8.5 45.8 / 10.4 22.9 / 3.1 2.5 / 0.8 ST 89.3 / 88.3 70.5 / 62.8 25.6 / 18.1 50.7 / 10.1 28.6 / 11.9 4.3 / 2.3 Dreaming 89.5 / 88.7 72.6 / 65.8 27.8 / 19.7 49.8 / 16.3 31.9 / 12.3 5.0 / 3.0 SN only 88.6 / 88.5 63.0 / 61.9 11.6 / 10.7 84.1 / 53.8 51.7 / 28.3 8.4 / 5.1 SN + ST 89.7 / 89.7 72.5 / 69.0 19.8 / 18.6 87.1 / 60.6 60.8 / 38.6 13.8 / 6.2 SN + Dreaming 89.6 / 89.6 73.5 / 72.3 22.1 / 19.6 86.3 / 63.1 58.0 / 36.4 14.0 / 4.0 / 85.7 68.5 / 66.4 37.8 / 30.0 76.5 / 53.2 56.6 / 30.4 20.2 / 10.1 no retraining 84.0 / 76.3 54.4 / 51.3 24.7 / 19.6 69.6 / 28.4 40.6 / 13.8 15.9 / 1.6 ST 85.7 / 78.6 55.5 / 52.9 23.4 / 21.3 70.7 / 29.8 44.3 / 14.0 14.9 / 2.1 Dreaming 85.8 / 78.3 56.7 / 54.6 28.3 / 22.3 64.8 / 28.8 44.1 / 13.5 17.3 / 3.0 SN only 83.3 / 74.9 54.3 / 47.2 19.2 / 16.3 69.3 / 29.2 43.2 / 16.9 10.3 / 3.0 SN + ST 85.2 / 78.0 55.4 / 52.8 22.8 / 20.4 73.9 / 37.6 45.8 / 19.8 14.8 / 9.1 SN + Dreaming 85.4 / 78.1 56.0 / 53.6 27.7 / 22.1 71.8 / 35.2 45.5 / 16.8 17.6 / 9.1 / 80.6 74.6 / 73.4 54.0 / 48.1 78.9 / 54.9 65.4 / 32.8 42.9 / 17.2 Dreaming 86.8 / 80.8 75.4 / 74.0 55.5 / 49.6 79.2 / 55.5 65.4 / 32.6 44.6 / 17.3 SN only 80.5 / 80.2 66.4 / 65.1 38.4 / 37.1 78.0 / 55.9 55.8 / 31.4 33.5 / 15.4 SN + ST 86.8 / 80.7 74.2 / 73.3 45.5 / 44.7 79.8 / 63.9 65.4 / 38.1 36.1 / 14.5 SN + Dreaming 87.3 / 81.1 75.3 / 73.8 48.0 / 47.0 80.3 / 64.6 66.3 / 38.1 38.4 / 18.4 / 47.6 12.4 / 10.2 2.8 / 0.8 46.8 / 26.5 11.2 / 2.3 2.3 / 0.0

SUPPLEMENTARY MATERIAL

We provide details omitted in the main text.• Appendix A: additional related work (cf. section 2 of the main paper).• Appendix B: additional details on 3D object tracking (cf. subsection 3.3 of the main paper).• Appendix C: additional details on hyper-parameters (cf. section 4 of the main paper).• Appendix D: additional details on the datasets (cf. section 4 and subsection 4.2 of the main paper).• Appendix E: additional analysis (cf. subsection 4.2 of the main paper).• Appendix F: adaptation results using PIXOR detector• Appendix G: qualitative comparison on baselines and our method (cf. subsection 4.2 of the main paper).• Appendix H: additional results on unsupervised domain adaptation among five autonomous driving datasets (cf. subsection 4.2 of the main paper).A RELATED WORK 

B 3D OBJECT TRACKING

Baseline on-line tracking. Given a set of detections (in this case the 3D bounding box detections), and the predicted states of all tracks xk+1 , we use a GNN to estimate the detection to track association p(a k+1 |x k+1 , z k+1 ) at each timestep. This is formulated as a linear assignment problem, where the cost to be minimized is the bird's eye view (BEV) IoU between the predicted box and the measurement. We initialize track when c min-hits measurements to the same track are realized. We end a track when it does not obtain any measurement updates over c max-age frames, or the object vehicle exits the field of view (FOV).Given the measurement assignment, the EKF updates the state distribution p(x k+1 |a k+1 , z k+1 ) with the measurement detection z k+1 . The measurement likelihood is computed via an uncertainty model derived in Diaz-Ruiz et al. (2019) . With the dynamics model, associations and measurement models, the EKF predicts and updates state distributions in the form of a state estimate and error covariance, xk|k , P k|k .State smoothing. Smoothing requires a backward iteration (k = N, N -1, ...1) that is performed after the forward filtering where, xk|k P k|k , the a-posteriori state and state error covariance estimates, and xk+1|k P k+1|k , the a-priori state and state error covariance estimates have been calculated. The smother gain, C k , is obtained fromwhere F k is the Jacobian of the dynamics model evaluated at xk|k . The smoothed state is then evaluated aswhile the covariance of the smoothed state is evaluated as

