TOWARDS ROBUST OBJECT DETECTION INVARIANT TO REAL-WORLD DOMAIN SHIFTS

Abstract

Safety-critical applications such as autonomous driving require robust object detection invariant to real-world domain shifts. Such shifts can be regarded as different domain styles, which can vary substantially due to environment changes, but deep models only know the training domain style. Such domain style gap impedes object detection generalization on diverse real-world domains. Existing classification domain generalization (DG) methods cannot effectively solve the robust object detection problem, because they either rely on multiple source domains with large style variance or destroy the content structures of the original images. In this paper, we analyze and investigate effective solutions to overcome domain style overfitting for robust object detection without the above shortcomings. Our method, dubbed as Normalization Perturbation (NP), perturbs the channel statistics of source domain low-level features to synthesize various latent styles, so that the trained deep model can perceive diverse potential domains and generalizes well even without observations of target domain data in training. This approach is motivated by the observation that feature channel statistics of the target domain images deviate around the source domain statistics. We further explore the style-sensitive channels for effective style synthesis. Normalization Perturbation only relies on a single source domain and is surprisingly simple and effective, contributing a practical solution by effectively adapting or generalizing classification DG methods to robust object detection. Extensive experiments demonstrate the effectiveness of our method for generalizing object detectors under real-world domain shifts.

1. INTRODUCTION

Object detection, a fundamental computer vision task, plays an important role in various safety-critical applications, including autonomous driving (Grigorescu et al., 2020) , video surveillance (Raghunandan et al., 2018) , and healthcare (Dusenberry et al., 2020) . Deep learning has made great progress on in-domain data (Ren et al., 2015; Bochkovskiy et al., 2020; Fan et al., 2020; 2022) for object detection, but its performance usually degrades under domain shifts (Sakaridis et al., 2018; Michaelis et al., 2019) , where the testing (target) data differ from the training (source) data. Real-world domain shifts are usually brought by environment changes, such as different weather and time conditions, attributed by diverse contrast, brightness, texture, etc. Trained models usually overfit to the source domain style and generalize poorly in other domains, posing serious problems in challenging real-world usage such as autonomous driving. Figure 1 (b) shows a large gap of feature channel statistics between two distinct domains, Cityscapes (Cordts et al., 2016) and Foggy Cityscapes (Sakaridis et al., 2018) , especially in shallow CNN layers which preserve more style information (Zhou et al., 2020b; Pan et al., 2018) . Deep models trained on the source domain cannot generalize well on the target domain, due to the discrepancy in feature channel statistics caused by the domain style overfitting. Domain generalization (DG) (Muandet et al., 2013; Ghifary et al., 2016; Mahajan et al., 2021; Li et al., 2020) aims to solve this hard and significant problem. Major undertaking has been done This work was done when Qi was the visiting scholar at MPII. This research was supported by the Research Grant Council of the HKSAR under grant No. 16201420. Synthesizing new domains have been demonstrated as an effective solution for domain generalization in the classification task (Nuriel et al., 2021; Zhou et al., 2020b) . The rationale behind is that the model can learn domain-invariant representations and generalizes well by perceiving a large variety of synthesized domains during training. But existing domain synthesis methods are all specifically designed for image classification, and it is non-trivial to directly apply these methods for robust object detection because of the task gap between classification and detection. Specifically, the feature-level synthesis approach (Zhou et al., 2020b; Li et al., 2022) is effective for classification DG problem, but it requires multiple source domains with large style variance. However, there is usually only one single source domain for robust object detection due to the expensive annotation cost, which means that only relatively small style variance can exist. In this situation, previous feature-level synthesis approach cannot synthesize sufficient diverse domains. The image generation based synthesis approach (Jackson et al., 2019; Geirhos et al., 2018) can effectively address the single-source domain problem by leveraging an extra large-scale style image dataset (Kaggle) to synthesize diverse domains. But the image generation procedure may suffer from the potential destruction on image contents, which are essential for the hierachical object detection where large context diversity may be present. In this paper, we perform in-depth problem analysis for the under-explored robust object detection, and propose a novel domain style synthesis approach. Figure 1 (a) shows our motivation. Feature channel statistics of the target domain image deviate around the source domain statistics. Thus by perturbing the feature channel statistics of source domain images in the shallow CNN layers, we can effectively synthesize new domains. The perturbed feature statistics correspond to various latent domain styles, so that the trained model perceives diverse potential domains accordingly. Such perturbation enables deep models to learn domain-invariant representations where distinct domains can be effectively blended together in the learned feature



(a) Feature channel statistics (mean) difference at Stage 1 (b) Feature channel statistics (mean) visualization 1: Visualizations for feature channel statistics on Cityscapes (source domain, red) and Foggy Cityscapes (target domain, blue). (a) For two domain images with the same content but different styles, we show their feature channel statistics and differences on the pretrained backbone at stage 1 ("stage" denotes the backbone block). The statistics of the Foggy Cityscapes image are negated for better visualization. The feature channel statistics of the target domain image deviate around the source domain statistics. (b) The t-SNE (Van der Maaten & Hinton, 2008) map for the feature channel statistics. The model is trained on the source domain and evaluated on both domains. The distance between two domains is computed by Maximum Mean Discrepancy (Borgwardt et al., 2006) (MMD). After equipping Normalization Perturbation in shallow CNN layers, our model can effectively blend distinct domain style distributions. Thus our model generalizes much better on the target domain.on improving domain generalization of classification models, where multiple source domains with large inter-image style variance are available for model training. However, less attention has been paid on robust object detection(Wang et al., 2021)  which is of equal importance if not more in many visual perceptual systems. The closely related unsupervised domain adaptation (UDA) object detection(Schneider et al., 2020; Nado et al., 2020)  has been widely studied, but it requires target domain images for model training, which is often infeasible for some online object detection systems.

