OOD-ODBENCH: AN OBJECT DETECTION BENCH-MARK FOR OOD GENERALIZATION ALGORITHMS

Abstract

The consensus about machine learning tasks, such as object detection, is still the test data are drawn from the same distribution as the training data, which is known as IID (Independent and Identically Distributed). However, it can not avoid being confronted with OOD (Out-of-Distribution) scenarios in real practice. It is risky to apply an object detection algorithm without figuring out its OOD generalization performance. On the other hand, a plethora of OOD generalization algorithms has been proposed to amortize the gap between the in-house and open-world performances of machine learning systems. However, their effectiveness was only demonstrated in the image classification tasks. It is still an opening question of how these algorithms perform on more complex tasks. In this paper, we first specify the setting of OOD-OD (OOD generalization object detection). Then, we propose OOD-ODBench consisting of four OOD-OD benchmark datasets to evaluate various object detection and OOD generalization algorithms. From extensive experiments on OOD-ODBench, we find that existing OOD generalization algorithms fail dramatically when applied to the more complex object detection tasks. This raises questions over the current progress on a large number of these algorithms and whether they can be effective in practice beyond simple toy examples. For future work, we sincerely hope that OOD-ODBench can serve as a foothold for OOD generalization object detection research.



a pre-specified target domain. However, it is hard to ensure performance consistency when dealing with unseen and infinite real-world domains. In this paper, we focus on OOD generalization object detection (OOD-OD) which aims at training detectors to generalize to the testing data drawn from an unseen distribution distinct from the training distribution. See Table 1 for more details and we provide the theoretical definition for OOD-OD in Appendix A.1. In this work, we propose OOD-ODBench, in which four OOD-OD benchmarks are constructed with existing datasets, including BDD100K (Yu et al., 2018) , Cityscapes (Cordts et al., 2016) and Sim10K (Johnson-Roberson et al., 2016) . As revealed by (Ye et al., 2021) , data distribution shifts on classification datasets are dominated by correlation shift and diversity shift. We test whether a similar phenomenon also exists on detection datasets and we construct a synthetic dataset named CtrlShift to quantitatively analyze generalization ability over the two kinds of distribution shifts of OOD-OD, respectively. With the above benchmark datasets, numerous experiments are conducted with detectors ranging from one(two)-stage to transformer-based and diverse OOD generalization algorithms carefully implemented on popular detectors (Ren et al., 2015; Lin et al., 2017b) . Section 2 reviews the related work dispersed in different research areas. Section 3 provides clarification of different techniques and tasks with similar names. Section 4 introduces the implementation details of OOD-ODBench, including the datasets, algorithms, and model selection methods. Finally, Section 5 discusses the experiment results on OOD-ODBench and offers insightful recommendations for future work. Our main contributions can be summarized as followed: 1. We propose OOD-ODBench, the first OOD generalization benchmark for object detection algorithms. Based on the extensive experiment results, we arrive at a surprising conclusion: The enormous achievements in IID object detection are marginal on OOD generalization object detection, and the OOD generalization improvements on classification are hard to generalize to more complex tasks (i.e., object detection). 2. In OOD-ODBench, we propose a Sim2real benchmark for OOD generalization object detection analysis which measures the possibility of training models with low-cost simulated data to generalize well on real scenarios. 3. To further analyze the generalization ability under the different types of shifts, we construct a synthetic dataset with designed shifts, namely CtrlShift. The synthetic dataset can systematically measure the OOD generalization algorithms' performances under different types of distribution shifts. 4. From Benchmark results and analysis, we recommend future research to clearly investigate the diversity shift and the correlation shift on OOD generalization object detection before designing algorithms and then evaluate them comprehensively on the two-dimensional shift using CtrlShift.

2.1. OBJECT DETECTION

The task of object detection aims at classifying and localizing the objects in an image based on the assumption that test data are drawn from the same distribution as training data. Modern deep  / detect X 1 X 1 Y 1 Supervised learning classify / detect X 1 , Y 1 X 1 Y 1 Semi-supervised learning classify / detect X 1 , (Y 1 ) ′ X 1 Y 1 Transfer learning classify / detect X 1,...,d , X d+1 , Y d+1 X d+1 Y d+1 Domain generalization classify X 1,...,d , Y 1,...,d X d+1 Y d+1 1 Domain adaption detect X 1,...,d , Y 1,...,d , X d+1 X d+1 Y d+1 1,2 OOD-OD detect X 1,...,d , Y 1,...,d X d+1,... Y d+1,... 1,2 learning-based object detection models can be divided into three categories: two-stage detectors (Girshick et al., 2014; Grauman & Darrell, 2005; Girshick, 2015; Ren et al., 2015; Lin et al., 2017a; Dai et al., 2016; He et al., 2017; Qiao et al., 2021; Cai & Vasconcelos, 2019; Huang et al., 2019; Pang et al., 2019; Wu et al., 2019; Sun et al., 2020) , one-stage detectors (Redmon et al., 2016; Redmon & Farhadi, 2017; 2018; Bochkovskiy et al., 2020; Ge et al., 2021; Liu et al., 2016; Lin et al., 2017b; Zhou et al., 2019; Tan et al., 2020; Law & Deng, 2018; Tian et al., 2019; Zhang et al., 2020a; Zhu et al., 2021; Liu et al., 2021) and lightweight detectors with small components (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019; Zhang et al., 2018; Ma et al., 2018; Wang et al., 2018; Iandola et al., 2016) . Compared to one-stage detectors, two-stage detectors are equipped with a separate differentiable module to generate region proposals which are possible to contain objects. Lightweight detectors are usually proposed to improve real-time performance with a small and efficient network. Recently, with the enormous success of applying transformer (Vaswani et al., 2017) on computer vision, a branch of transformer-based detector (Zhu et al., 2021; Liu et al., 2021) has shaped up.

2.2. OOD GENERALIZATION

The task of OOD generalization is training on multiple datasets sampled from distinct domains and then generalizing to an unseen test domain. Models with OOD generalization ability typically have access to multiple training datasets for the same task obtained from various environments. The purpose of OOD generalization algorithms is to learn from these diverse but relevant training settings before being applied to unknown testing environments. Driven by this motivation, many algorithms have been proposed throughout these years. These algorithms can be divided into: empirical risk learning (Vapnik, 1998; Sagawa et al., 2019) , invariant risk optimization (Arjovsky et al., 2019) , domain adversarial learning (Ajakan et al., 2014; Li et al., 2018c; Ruan et al., 2021) , metalearning (Zhang et al., 2020b; Li et al., 2018a) , kernel function (Li et al., 2018b; Sun & Saenko, 2016) , gradient-based approach (Shi et al., 2021; Pezeshki et al., 2020; Bai et al., 2020; Parascandolo et al., 2021; Shahtalebi et al., 2021; Koyama & Yamaguchi, 2021; Rame et al., 2021) , risk extrapolation (Krueger et al., 2021) , data processing (Xu et al., 2020c; Yan et al., 2020) , transfer learning (Blanchard et al., 2017; Xu & Jaakkola, 2021) , information bottleneck (Ahuja et al., 2021) and self-supervised learning (Wang et al., 2020; Zhou et al., 2020) . OOD generalization for object detection is currently underexplored. Region Aware Proposal reweighTing (RAPT) (Zhang et al., 2022) aims to eliminate dependence within RoI features for domain generalization. Cyclic-Disentangled Self-Distillation (Wu & Deng, 2022) aims at disentangling domain-invariant representations. 3D-VField (Lehner et al., 2022) improves generalization on 3D object detection.

2.3. OOD BENCHMARK

Different domains data (Zhou et al., 2021; Wang et al., 2022) can be viewed as data drawn from different distributions and the distinct train-test domains are Out-of-Distribution. DomainBed (Gulrajani & Lopez-Paz, 2020) is a large-scale benchmark suite for reproducing domain generalization research and facilitating the implementation of new algorithms. With the experiment results of fourteen algorithms on seven datasets, the authors found that empirical risk minimization (Vapnik, 1998) 

3. CLARIFICATION OF TASKS

Domain Randomization techniques (Tobin et al., 2017; Tremblay et al., 2018; Zakharov et al., 2019; Yue et al., 2019; Huang et al., 2021) aim at providing enough simulated domains at training data so that models are possible to generalize to real-world scenarios based on a hypothesis that with enough variability in the data simulator, the real world may appear to be a specific variation of the simulation data which exists in the training set. OOD Detection for Object Detection (Joseph et al., 2021; Du et al., 2022a; Harakeh & Waslander, 2021; Riedlinger et al., 2021; Dhamija et al., 2020; Miller et al., 2019; Hall et al., 2020; Deepshikha et al., 2021) can be formulated as a binary classification problem which distinguishes whether the distribution of the incoming data is out of the distribution of the training data. Open-World Object Detection (Joseph et al., 2021; Zhao et al., 2022) initially learns a model which can detect all the previously encountered categories, and incrementally updates the model when unseen classes come. Open-Vocabulary Object Detection (Gu et al., 2021; Zareian et al., 2021; Du et al., 2022b; Bravo et al., 2022) aims to train an detector which can detect various objects in any novel categories described by arbitrary texts. 4 OOD-ODBENCH: IMPLEMENTATION DETAILS

4.1. BENCHMARKING DATASETS

In OOD-ODBench, we choose datasets to cover as many types of variations between training and test datasets as possible. Figure 2 lists some samples of the four benchmark datasets. BDD100K (Yu et al., 2018) contains 80,000 labeled images (70,000 for training and 10,000 for validation) with ten annotated object categories, including bike, bus, car, motor, person, rider, traffic light, traffic sign, train and truck. Each image has three attribute labels which indicate the condition, including the weather, scene and time for data collection and we remove the images with an undefined attribute label. Specifically, we construct three OOD environments using the attribute labels, including Weather, Scene, and Time.

BDD100K Sim10K Cityscapes

Figure 2: Some samples of datasets included in OOD-ODBench. Note that BDD100K has ten categories while Sim10K and Cityscapes only use the annotated cars. 2016) is a synthetic dataset containing 10,000 images (8,000 for training, 1,000 for validating and 1,000 for testing) with bounding box annotations for cars, which is rendered with the Grand Theft Auto V (GTA5) game engine.

Diversity shift Origin Correlation shift

Cityscapes (Cordts et al., 2016 ) is a large-scale database which focuses on urban street scenes. The dataset consists of around 5000 fine annotated images (2975 for training, 500 for validating and the rest for testing) with eight annotated instance categories. On OOD-ODBench, we consider the car recognition task to construct the Sim2real benchmark for simplicity and without loss of generality. We construct the Sim2Real benchmark which covers the "sim2real" scenario. The simulated images of Sim10K are used for training and the real images of Cityscapes are used for testing. CtrlShift is a synthetic dataset to analyze the two-dimension shift in OOD generalization object detection. The Airsim simulator (Shah et al., 2017) based on the high-fidelity rendering software Unreal Engine 4 is used to generate samples in CtrlShift. We totally sample over 2000 simulated images from both the rural and urban environments which contain common objects, including buildings, traffic lights, and vehicles. Moreover, every image comprises two attribute labels to construct a controllable distribution shift. One is car color which indicates the color of the car in the image, Table 3 : Experimental results of detectors performance measured by AP(%) on MS COCO and the four OOD benchmarks of OOD-ODBench. All models are implemented by mmdetection (Chen et al., 2019) and loaded pretrained weights provided by open-mmlab (Contributors, 2018) . Note that @Faster in Libra R-CNN row represents applying Faster R-CNN as architecture. X-101-64x4d represents a modified ResNeXt-101 network architecture from (Xie et al., 2017) , R-50 and R-101 represents ResNet backbones with 50 or 101 layers (He et al., 2016) . OOD Avg calculate the average accuracy on the four OOD benchmark. OOD generalization algorithms. We have adapted eleven algorithms from different OOD research areas to the classification branch in object detection, including Empirical Risk Minimization (ERM) (Vapnik, 1998) which aims to minimize the loss function overall the training domains, (IB-ERM) (Ahuja et al., 2021) which applies an information bottleneck constraint to address OOD generalization, Invariant Risk Minimization (IRM) (Arjovsky et al., 2019 ) which aims at estimating invariant correlations across different domains, adversarial feature learning (MMD) (Li et al., 2018b) which imposes Maximum Mean Discrepancy (Gretton et al., 2012) to align the distributions among different domains, correlation alignment (CORAL) (Sun & Saenko, 2016 ) which aims at matching the mean and covariance of feature distributions, Variance Risk Extrapolation (VREx) (Krueger et al., 2021) which performs a form of robust optimization over extrapolated domains, Gradient Starvation (GS) (Pezeshki et al., 2020) which derives a regularization to overcome the gradient descent phenomenon across different domains, (IGA) (Koyama & Yamaguchi, 2021) which uses a parametrization trick to conduct feature searching and predictor training, Group Distributionally Robust Optimization (GroupDRO) (Sagawa et al., 2019) which increases the importance of each domain with penalty loss, and Representation Self-Challenging (RSC) (Huang et al., 2020) iteratively challenges the dominant features to force the model to activate the remaining features. Optimal Representations (CAD) (Ruan et al., 2021) designs self-supervised objectives to obtain representations on which risk is minimal to any distribution. Invariant Causal Mechanisms (CausIRL) (Chevalley et al., 2022) learns the invariant features by viewing the learning process as a causal process and introduces a unifying framework.

4.3. MODEL SELECTION METHODS

Model selection methods can influence the final rankings of methods to a large extent, especially in OOD generalization tasks (Gulrajani & Lopez-Paz, 2020) . However, there is no consensus on what parameters selection strategy should be used in OOD generalization research for object detection. In OOD-ODBench, we choose the models trained at the last epoch as our model selection method. This is because the testing data is inaccessible and selecting models based on the trainset's performances may lead to excessive over-fitting for current methods since there is a huge distribution gap between the training set and testing set. For future research, we strongly recommend that researchers should detail and include the model selection methods in OOD generalization object detection research. In this section, we conduct numerical experiments on our benchmark to reveal the OOD generalization ability for existing algorithms and we provide further discussion in Appendix A.5. All experiments are conducted on a Pytorch platform with eight Tesla V100 GPUs. We evaluate each algorithm using the Average Precision (AP) from MS COCO (Lin et al., 2014) . For object detection algorithms, our codes are based on mmdetection (Chen et al., 2019) and for domain generalization algorithms, our codes are stemmed from Do-mainBed (Gulrajani & Lopez-Paz, 2020) . We draw several conclusions from the results.

5.1. BENCHMARK RESULTS

The enormous achievements of object detection on IID datasets are marginal on the OOD condition. (Zhang et al., 2020a) . Figure 4 intuitively displaces the significant discrepancy between IID and OOD. What is responsible for these results? We suspect two factors: One is that current researches simply stem from the ideal assumption of IID regardless of whether it can be satisfied in real scenarios. The other is that the improvement on IID datasets may be a phenomenon of over-fitting since few works provide sufficient evidence that the causal features have been learned during the training process without evaluating on OOD benchmarks. The tremendous success of domain generalization algorithms confronting OOD is inconsistent between classification and object detection. We draw this conclusion from Table 4 and the experimental results reported on OOD-bench (Ye et al., 2021) . The OOD results on the four benchmarks in Table 4 suggest that the domain generalization algorithms degenerate or slightly outperform the ERM (Vapnik, 1998) which can be attributed to the hyper-parameters bias. Moreover, as for VREx (Krueger et al., 2021) which is the best models on Correlation-Bench (Ye et al., 2021) , AP drops by 0.1 comparing to ERM (Vapnik, 1998) while VREx (Krueger et al., 2021) outperforms ERM (Vapnik, 1998 ) by 8.6 AP on Correlation-Bench (Ye et al., 2021) . RSC (Huang et al., 2020) which is the best models on Diversity-Bench (Ye et al., 2021) degenerates 11.2 AP while improves 0.6 accuracy comparing to ERM (Vapnik, 1998) . The generalization inconsistency between classification and object detection of domain generalization algorithms happens among different detectors. As shown in Table 4 .2, we choose the popular two-stage detector, Faster R-CNN (Ren et al., 2015) , one-stage detector, RetinaNet (Lin et al., 2017b) , and transformer-based detector, DETR (Carion et al., 2020) , as base models for implementing domain generalization algorithms. Obviously, ERM achieves the best average generalization ability on the three detectors and we can conclude that the degeneration of domain generalization algorithms has little relevance to the detectors.

5.2. CONTROLLED DISTRIBUTION SHIFTS EXPERIMENTS

Previous experiments provide performance analysis on the real scenarios for OOD generalization object detection. But it is hard to see which kind of distribution shift leads to performance degeneration. To systematically analyze the generalization performance under the influence of the two-dimensional distribution shift, we test the performance of Faster-RCNN trained by ERM (Vapnik, 1998) , and the top performers on previous datasets (IRM (Arjovsky et al., 2019) and VREx (Krueger et al., 2021) ) on CtrlShift dataset with different settings of the correlation shift and diversity shift. The results are shown in Figure 5 , all methods consistently (Vapnik, 1998; Arjovsky et al., 2019; Krueger et al., 2021) achieve the best AP when both correlation shift and diversity shift are low. For ERM (Vapnik, 1998) , the performance evenly degenerates on two dimensions. In Figure 5 (b), with the increase of the two-dimension shift, the performance of IRM (Arjovsky et al., 2019) in the horizontal direction tends to degenerate faster than in the vertical direction. This indicates that IRM (Arjovsky et al., 2019) confronts correlation shift better than diversity shift. From Figure 5 (c), we can observe the similar phenomenon happens for VREx (Krueger et al., 2021) on the two-dimensional shift. This phenomenon demonstrates that existing OOD generalization algorithms may help mitigate performance degradation when confronted with correlation shifts. Whereas for diversity shift, key components are missing to improve the generalization abilities, let alone the complex mixture of both shifts in real datasets. For future research, we recommend that both shifts should be included in new benchmark datasets and algorithms should be evaluated on both types of distribution shifts simultaneously.

6. CONCLUSION AND DISCUSSION

In this paper, we propose the first benchmark for OOD-OD tasks, named OOD-ODBench. The benchmark suite includes four benchmark datasets along with a synthetic dataset to generate controlled distribution shifts. The experimental results conducted on OOD-ODBench suggest that the enormous achievements in classical IID object detection are marginal on OOD generalization object detection. And the OOD generalization methods mainly tested on classification cannot generalize to object detection tasks. This raises questions about existing progress on object detection and OOD generalization algorithms. We appeal for more attention from the community for this problem to propose an OOD-OD method that is undoubtedly effective.

A APPENDIX

A.1 THEORETICAL ANALYSIS OF OOD GENERALIZATION OBJECT DETECTION Figure 6 : The causal influence among the concerned variables. The OOD generalization object detection has been fragmentally researched in previous research, however, no rigorous definition of OOD generalization object detection has been given. We give a definition and taxonomy as follows: OOD generalization object detection (OOD-OD): In object detection tasks, algorithms learn a mapping function f to predict the category (y) and location of interested objects in an image x. In OOD generalization object detection tasks, training and test data pairs (X, Y) are not necessarily drawn from the same distribution. This poses great challenges for existing machine learning methods as most methods are reliant on exploiting the correlation between X and Y. Due to the distribution change, the correlation might not be generalizable. More specifically, we depict the causal data generating process in Figure 6 . When the data X are given, the causal features Z 1 and the non-causal features Z 2 are given. The causal features reliably determine the location and categories of interested objects in the input images. The non-causal features are irrelevant features for predictions. An intuitive example is that if we want to recognize a dog in an image, the causal features are dogs' shapes. The non-causal features are the environment features, such as weather or captured time of the image. We improve the definitions in (Ye et al., 2021) and propose the following mathematical definitions for Z 1 and Z 2 given overall semantic features zfoot_0 : ∀ z ∈ Z 1 : p(z) • q(z) ̸ = 0 ∧ ∀ y 1 ∈ Y 1 : p(y 1 |z) = q(y 1 |z) ∧ ∀ y 2 ∈ Y 2 : p(y 2 |z) = q(y 2 |z) (1) ∃ z ∈ Z 2 : p(z) • q(z) = 0 ∨ ∃ y 1 ∈ Y 1 : p(y 1 |z) ̸ = q(y 1 |z) ∨ ∃ y 2 ∈ Y 2 : p(y 2 |z) ̸ = q(y 2 |z) (2) where p is the training distribution and q is the test distribution. Since Z 1 is the stable and reliable predictor for the category and location of objects, there are two kinds of shifts intuitively because of the discrepancy of Z 2 in training and test distribution. Diversity shift stems from the first kind of features in Z 2 since the diversity of data is embodied by novel features not shared by the environments; whereas correlation shift is caused by the second kind of features in Z 2 which is spuriously correlated with some Y 1 or Y 2 . Based on this, we partition Z 2 into two subsets: S := {z ∈ Z 2 |p(z) • q(z) = 0} T := {z ∈ Z 2 |p(z) • q(z) ̸ = 0} (3) Definition A.1. (Definition of Diversity shift and Correlation shift for OOD-OD.) Given S and T defined in Equation 3, the definitions of diversity shift and correlation shift are given as follows: D diversity := 1 2 S | p(z) -q(z) | d z (4) D correlation := 1 2 T p(z) • q(z) y2 y1 | p(y 1 , y 2 |z) -q(y 1 , y 2 |z) | d z It can be seen that both D diversity and D correlation are within the range of [0, 1]. D diversity measures the support difference non-causal features for object detection. While D correlation gauges the variations of conditional probabilities of the object category Y 1 given non-causal features and object locations. This serves as an indicator for spurious correlations existing in datasets. The proposed definition first provides a quantitative way for measuring the distributional shifts for OOD-OD to the best of our knowledge. We leave it for future work to compute the numerical values of shifts given an object detection dataset.

A.2 PROOF

Proposition A.2. For any probability functions p and q of training distribution and testing distribution, D diversity and D correlation are inclusively bounded between 0 and 1. (Lin et al., 2017b) X-101 10.0 8.7 38.0 Mask R-CNN (He et al., 2017) X-101 10.7 8.0 36.7 CornetNet (Law & Deng, 2018) Hourglass104 13.9 4.2 21.6 YOLOv3 (Redmon & Farhadi, 2018) DarkNet-53 7.4 48.1 28.2 FCOS (Tian et al., 2019) X-101 10.0 9.7 37.9 Cascade R-CNN (Cai & Vasconcelos, 2019) X-101 10.7 -40.5 MS R-CNN (Huang et al., 2019) R-X101 11.0 8.0 35.7 Libra R-CNN (Pang et al., 2019) X-101 10.8 8.5 35.3 DH R-CNN (Wu et al., 2019) R-50 6.8 9.5 33.8 VarifocalNet (Zhang et al., 2020a) X-101 --42.3 Sparse R-CNN (Sun et al., 2020) R-101 --40.3 Deformable (Zhu et al., 2021) R-50 --37.4 YOLOX (Ge et al., 2021) YOLOX-x 28.1 -36.4 (Ahuja et al., 2021) λ ib = 100 18.3 IRM (Arjovsky et al., 2019) λ irm = 1 32.7 MMD (Li et al., 2018b) γ mmd = 1 33.2 CORAL (Sun & Saenko, 2016) γ mmd = 1 32.5 VREx (Krueger et al., 2021) λ vrex = 1 32.4 GS (Pezeshki et al., 2020) λ reg = 0.1 31.4 IGA (Koyama & Yamaguchi, 2021) λ penalty = 1000 33.4 GroupDRO (Sagawa et al., 2019) η groupdro = 0.01 31.9 



We assume no category label (Y1) shifts for simplicity and without loss of generality



Figure 1: The setting illustration of out-of-distribution generalization object detection (OOD-OD).

Figure 3: Some samples of CtrlShift and the illustration of the two-dimensional shift.

Figure4: The improvements of recent object detection methods over the baseline on IID and OOD respectively. While the improvements on IID datasets (MS COCO) are prominent, it is not generalizable to OOD scenarios. The compared baseline method is Faster R-CNN.

Figure 5: Controlled distribution shifts experiments results of ERM, IRM and VREx. X-axis is Diversity shift and Y-axis is Correlation shift. Each block indicates the AP(%).

Figure 7: X-axis is task complexity α. Each block indicates the AP(%).

Comparison between OOD generalization object detection and other tasks. X i , Y i indicate the data drawn from distribution i and Y 1 , Y 2 represent category labels and bounding box labels respectively.

Details of datasets used in benchmarks. Quantity indicates the number of images in each domain. Total counts the total number of training and testing domains respectively.

The comparisons of different detectors trained with ERM on OOD-ODBench can help answer the question that whether the progress made by recently proposed detectors is generalizable to OOD data. The benchmarks of detectors trained with proposed OOD generalization algorithms can indicate whether the OOD generalization algorithms proposed recently are still effective for object detection beyond toy image classification tasks.Detectors. Object detection models (detectors) generally can be categorized into two genres: onestage methods and two-stage methods. One-stage detectors predict the bounding boxes as well as the categories of the objects. Two-stage detectors predict the bounding boxes first to indicate the possible locations of objects. Then, two-stage detectors conduct classifications on the images within the bounding boxes to predict the categories of these images. Recently, with the tremendous success of transformer(Vaswani et al., 2017), transformer-based detectors become popular. We have selected one/two-stage and transformer-based algorithms ranging from 2015 to 2021 for our Object Detection Experimental results of domain generalization algorithms on OOD generalization object detection. All algorithms are implemented by ourselves based on Faster R-CNN(Girshick, 2015) with ResNet-50(He et al., 2016) backbone. Hyper-parameter of each algorithm is chosen among 0.1, 1 and 10 according to the average AP over the four benchmarks.

Ablation study of OOD algorithms with different detectors on Sim2real. Faster R-CNN(Ren et al., 2015), RetinaNet(Lin et al., 2017b)  and DETR(Carion et al., 2020) are all with ResNet-50(He et al., 2016) backbone.

The experimental results of object detection algorithms on the all-sim-all-real of Sim2real. Mem (GB) † and Inf time (fps) † are from mmdetection(Chen et al., 2019).

The experimental results of domain generalization algorithms on the all-sim-all-real of

Experimental results of detectors on Sim2real.

Experimental results of detectors on Weather.

Experimental results of detectors on Scene.

Experimental results of detectors on Time. Detector Backbone AP AP 50 AP 75 AP s AP m AP l

Generalization performance of detectors with OOD algorithms.

annex

 (Johnson-Roberson et al., 2016) while city train and city val are from original Cityscapes (Cordts et al., 2016) Proof. Obviously, both D diversity and D correlation are positive. Then, we prove the upper bound by the triangle inequality as followed:Similarly, we have the following inequality:The second inequality is due to triangle inequality.

A.3 IMPLEMENTATION DETAILS

To evaluate the object detection algorithms, we use the models and the pre-trained weights provided by mmdetection (Chen et al., 2019) . For domain generalization algorithms on OOD generalization object detection, we derive the implementations using Faster R-CNN (Ren et al., 2015) with ResNet-50 FPN backbone (He et al., 2016) from torchvision. The whole network is optimized by Stochastic Gradient Descent with learning rate 0.02, momentum 0.9 and weight decay 0.0005.

A.4 FURTHER RESULTS

Task complexity. To analyse the IID condition on CtrlShift, which indicates both D correlation and D diversity equal zero, we propose a hyper-parameter task complexity α to measure the difficulty of the task. The difficulty is adjusted by using 1 -α percent novel data in the testing set in addition to the original training data. The experimental results are shown in Figure 7 . The generalization ability of each algorithm drops with the increase of task complexity.Sim2real benchmark. The training set of the Sim2real results reported in the main manuscript comprises the training data from Sim10K (Johnson-Roberson et al., 2016) and the validating data from Cityscapes (Cordts et al., 2016) , while the testing set comprises the training data from Cityscapes (Cordts et al., 2016) and the validating data from Sim10K (Johnson-Roberson et al., 2016) (noted by part-sim-part-real, more details can be found in Table A .4). We reported the experimental results on all-sim-all-real in Table 7 and Table 8 .Full results. the proposed varifocal loss (Zhang et al., 2020a ) can be considered to improve OOD generalization ability.

