OOD-ODBENCH: AN OBJECT DETECTION BENCH-MARK FOR OOD GENERALIZATION ALGORITHMS

Abstract

The consensus about machine learning tasks, such as object detection, is still the test data are drawn from the same distribution as the training data, which is known as IID (Independent and Identically Distributed). However, it can not avoid being confronted with OOD (Out-of-Distribution) scenarios in real practice. It is risky to apply an object detection algorithm without figuring out its OOD generalization performance. On the other hand, a plethora of OOD generalization algorithms has been proposed to amortize the gap between the in-house and open-world performances of machine learning systems. However, their effectiveness was only demonstrated in the image classification tasks. It is still an opening question of how these algorithms perform on more complex tasks. In this paper, we first specify the setting of OOD-OD (OOD generalization object detection). Then, we propose OOD-ODBench consisting of four OOD-OD benchmark datasets to evaluate various object detection and OOD generalization algorithms. From extensive experiments on OOD-ODBench, we find that existing OOD generalization algorithms fail dramatically when applied to the more complex object detection tasks. This raises questions over the current progress on a large number of these algorithms and whether they can be effective in practice beyond simple toy examples. For future work, we sincerely hope that OOD-ODBench can serve as a foothold for OOD generalization object detection research.

1. INTRODUCTION

Modern object detection methods (Liu et al., 2021; Huang et al., 2019; Pang et al., 2019; Wu et al., 2019; Zhang et al., 2020a; Sun et al., 2020; Zhu et al., 2021; Ge et al., 2021) have achieved many progresses on various applications, such as autonomous driving and industrial defect detection. Tremendous efforts have been devoted to improving an object detector's performance on standard datasets, such as MS-COCO (Lin et al., 2014) . While these efforts have seen impacts on industry (Redmon et al., 2016; Redmon & Farhadi, 2017; 2018; Bochkovskiy et al., 2020; Ge et al., 2021) , the improvements are becoming marginal recently and most achievements are accompanied by an inherent assumption, i.e. , the training data and the test data are IID (Independent and Identically Distributed). However, this assumption is unlikely to hold in real-world scenarios. For example, an autonomous system suffers from different environmental conditions (Dai & Gool, 2018; Volk et al., 2019) ; a medical system fails to work consistently among hospitals when data are collected from different equipment (de Castro et al., 2019; Albadawy et al., 2018; Perone et al., 2019) . As a consequence, models trained on IID dataset are susceptible to a subtle disturbance in test data distribution (Outof-Distribution) and fail to generalize to real scenarios (Torralba & Efros, 2011) . Previous research devoted to encountering this train-test discrepancy can be summarized as either "less complex" or "complex but not general". From the first perspective, a plethora of Domain Generalization (DG) algorithms (Arjovsky et al., 2019; Ahuja et al., 2021; Li et al., 2018b; Sun & Saenko, 2016; Xu et al., 2020c; Yan et al., 2020; Krueger et al., 2021; Pezeshki et al., 2020; Parascandolo et al., 2021; Koyama & Yamaguchi, 2021; Huang et al., 2020; Sagawa et al., 2019) concentrate on improving OOD generalization ability. But they are simply evaluated on the image classification. The effectiveness is unknown when applied to the complex task (object detection). On the other perspective, numerous Domain Adaption (DA) algorithms (Chen et al., 2018; He & Zhang, 2020; Rodriguez & Mikolajczyk, 2019; Xu et al., 2020a; Su et al., 2020; Xu et al., 2020b; Soviany et al., 2019; Deng et al., 2020; Chen et al., 2021) aim to build an optimal object detector that can be generalized into a pre-specified target domain. However, it is hard to ensure performance consistency when dealing with unseen and infinite real-world domains. In this paper, we focus on OOD generalization object detection (OOD-OD) which aims at training detectors to generalize to the testing data drawn from an unseen distribution distinct from the training distribution. See Table 1 for more details and we provide the theoretical definition for OOD-OD in Appendix A.1. In this work, we propose OOD-ODBench, in which four OOD-OD benchmarks are constructed with existing datasets, including BDD100K (Yu et al., 2018 ), Cityscapes (Cordts et al., 2016) and Sim10K (Johnson-Roberson et al., 2016) . As revealed by (Ye et al., 2021) , data distribution shifts on classification datasets are dominated by correlation shift and diversity shift. We test whether a similar phenomenon also exists on detection datasets and we construct a synthetic dataset named CtrlShift to quantitatively analyze generalization ability over the two kinds of distribution shifts of OOD-OD, respectively. With the above benchmark datasets, numerous experiments are conducted with detectors ranging from one(two)-stage to transformer-based and diverse OOD generalization algorithms carefully implemented on popular detectors (Ren et al., 2015; Lin et al., 2017b) . Section 2 reviews the related work dispersed in different research areas. Section 3 provides clarification of different techniques and tasks with similar names. Section 4 introduces the implementation details of OOD-ODBench, including the datasets, algorithms, and model selection methods. Finally, Section 5 discusses the experiment results on OOD-ODBench and offers insightful recommendations for future work. Our main contributions can be summarized as followed: 1. We propose OOD-ODBench, the first OOD generalization benchmark for object detection algorithms. Based on the extensive experiment results, we arrive at a surprising conclusion: The enormous achievements in IID object detection are marginal on OOD generalization object detection, and the OOD generalization improvements on classification are hard to generalize to more complex tasks (i.e., object detection). 2. In OOD-ODBench, we propose a Sim2real benchmark for OOD generalization object detection analysis which measures the possibility of training models with low-cost simulated data to generalize well on real scenarios. 3. To further analyze the generalization ability under the different types of shifts, we construct a synthetic dataset with designed shifts, namely CtrlShift. The synthetic dataset can systematically measure the OOD generalization algorithms' performances under different types of distribution shifts. 4. From Benchmark results and analysis, we recommend future research to clearly investigate the diversity shift and the correlation shift on OOD generalization object detection before designing algorithms and then evaluate them comprehensively on the two-dimensional shift using CtrlShift.

2.1. OBJECT DETECTION

The task of object detection aims at classifying and localizing the objects in an image based on the assumption that test data are drawn from the same distribution as training data. Modern deep



Figure 1: The setting illustration of out-of-distribution generalization object detection (OOD-OD).

