WHAT'S WRONG WITH THE ROBUSTNESS OF OBJECT DETECTORS?

Abstract

Despite tremendous successes achieved, object detection models confront the vulnerability to adversarial attacks. Even with imperceptible adversarial perturbations in images, they probably yield erroneous detection predictions, posing a threat to various realistic applications, e.g., medical diagnosis and automatic driving. Although some existing methods can improve the adversarial robustness of detectors, they still suffer from the detection robustness bottleneck: the significant performance degradation on clean images and the limited robustness on adversarial images. In this paper, we conduct empirically a comprehensive investigation on what's wrong with the robustness of object detectors in four different seminal architectures, i.e., two-stage, one-stage, anchor-free, and Transformer-based detectors, inspiring more research interest on this task. We also devise a Detection Confusion Matrix (DCM) and Classification-Ablative Validation (ClsAVal) for further detection robustness analyses. We explore underlying factors that account for robustness bottleneck. It is empirically demonstrated that robust detectors have reliable localization robustness and poor classification robustness. The classification module easily mis-classifies the foreground objects into the background. Furthermore, Robust Derformable-DETR suffers from a poor classification and localization robustness. Our source codes, trained models, and detailed experiment results will be publicly available.



Figure 1: Robustness for image classification models and object detection models. "STD" is the standard model (non-robust), i.e., ResNet in classification and SSD in detection. A cls and A loc denote the attacks for classification and localization in detection.

performance decrease by nearly 30% mAP on clean images (77.49% mAP for standard SSD vs. 48% mAP for MTD, 51.3% mAP for CWAT)! Instead, this is not an intuitive phenomenon, because the robust classification models (e.g.,TRADES Zhang et al. (2019), IRGD Gowal et al. (2021)) have a small amount of the performance decline on clean images while gaining the robustness, as shown in Fig.1. A same thing for those classification and detection methods is adversarial training to ensure the robustness, but this problem occurs notably in object detection. Therefore, it is worthy of more attention to exploring what's wrong with the robustness of object detectors. Four robust object detectors with adversarial training present the similar robustness bottleneck aforementioned, which is mainly attributed to the inferior classification robustness, i.e., the mis-classification of the foreground as the background in the classification module, with less confusion among foreground categories. Besides, in Deformable-DETR, both the robustness of classification and localization is poor, unlike SSD, Faster RCNN, and YOLOX with reliable localization robustness and poor classification robustness.2 ADVERSARIAL ROBUSTNESSOF OBJECT DETECTORS With the remarkable learning capability of deep neural networks, three representative series of deep learning based object detectors are prevalent, leading the recent research on object detection. They are two-stage, one-stage, anchor-free and Transformer-based object detectors. In our work, we will empirically investigate the adversarial robustness of those three types of detectors. Concretely, Faster RCNN Ren et al. (2015) (two-stage), SSD Liu et al. (2016) (one-stage), YOLOX Ge et al. (2021) (anchor-free) and Deformable-DETR Zhu et al. (2021) (Transformer-based) are selected. In this section, adversarial training for robust object detection will be firstly described and then empirical analyses on detection robustness will be elaborated. For SSD, Faster RCNN and YOLOX, we use PASCAL VOC Everingham et al. (2015) dataset for analysis. For Deformable-DETR, MS-COCO Lin et al. (2014) is adopted, since it is unable to successfully train Deformable-DETR on PASCAL VOC using the official code. (However, we still provide successful training results on VOC dataset using MMDetection codes in the supplementary material.)2.1 PRELIMINARIESObject detection can be regarded as a multi-task learning for classification and localization. Formally, a detection model f is parameterized by θ, which consists of a backbone f b and two headers of classification H cls and localization H loc . Given an input image x ∈ D, two headers yield probabilistic confidence and the predicted localization for each bounding box, respectively. After Non-Maximum

