DINO: DETR WITH IMPROVED DENOISING ANCHOR BOXES FOR END-TO-END OBJECT DETECTION Anonymous

Abstract

We present DINO (DETR with Improved deNoising anchOr boxes), a strong endto-end object detector. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a look forward twice scheme for box prediction, and a mixed query selection method for anchor initialization. DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +6.0AP and +2.7AP, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO val2017 (63.2AP) and test-dev (63.3AP) with model size under 1 billion parameters. Compared to other models on the leaderboard, DINO achieves better results with smaller model size and pre-training data size. The code will be available.

1. INTRODUCTION

Object detection is a fundamental task in computer vision. Remarkable progress has been accomplished by classical convolution-based object detection algorithms (Ren et al., 2017; Tian et al., 2019; Lin et al., 2020; Bochkovskiy et al., 2020; Ge et al., 2021) . Despite that such algorithms normally include hand-designed components like anchor generation and non-maximum suppression (NMS), they yield the best detection models such as DyHead (Dai et al., 2021a ), Swin (Liu et al., 2021b ) and SwinV2 (Liu et al., 2021a) with HTC++ (Chen et al., 2019a) , as evidenced on the COCO test-dev leaderboard (pap). In contrast to classical detection algorithms, DETR (Carion et al., 2020 ) is a novel Transformerbased detection algorithm. It eliminates the need of hand-designed components and achieves com-parable performance with optimized classical detectors like Faster RCNN (Ren et al., 2017) . Different from previous detectors, DETR models object detection as a set prediction task and assigns labels by bipartite graph matching. It leverages learnable queries to probe the existence of objects and combine features from an image feature map like soft ROI pooling (Liu et al., 2022) . Despite its promising performance, it converges slow and the meaning of queries is unclear. To address such problems, many methods have been proposed, such as introducing deformable attention (Zhu et al., 2021) , decoupling positional and content information (Meng et al., 2021) , providing spatial priors (Gao et al., 2021; Yao et al., 2021; Wang et al., 2021 ), etc. Recently, DAB-DETR (Liu et al., 2022) proposes to formulate DETR queries as dynamic anchor boxes (DAB), which bridges the gap between classical anchor-based detectors and DETR-like ones. DN-DETR (Li et al., 2022) further accelerate convergence by introducing a denoising (DN) technique. These improvements promote the development of DETR-like models, while it remains not on the list of first-choice detectors in the field. As a DETR-like model, DINO contains a backbone, a multi-layer Transformer encoder, a multi-layer Transformer decoder, and multiple prediction heads. Following DAB-DETR, we formulate queries in decoder as dynamic anchor boxes and refine them step-by-step across decoder layers. Following DN-DETR, we add ground truth labels and boxes with noises into the Transformer decoder layers to help stabilize bipartite matching during training. We also adopt deformable attention (Zhu et al., 2021) for its computational efficiency. Moreover, we propose three new methods as follows. First, to reduce duplicate predictions, we propose a contrastive denoising training by adding both positive and negative samples of the same ground truth at the same time. After adding two different noises to the same ground truth box, we mark the box with a smaller noise as positive and the other as negative. The contrastive denoising training helps the model to predict more precise boxes and avoid duplicate outputs of the same target. Second, to overcome the shortsightedness of refining boxes in each decoder layer, which is a greedy way proposed in Deformable DETR, while keeping the advantages of fast convergence, we propose a new look forward twice scheme to correct the updated parameters with gradients from later layers. Third, the dynamic anchor box formulation of queries links DETR-like models with classical two-stage models. Hence we propose a mixed query selection method, which helps better initialize the queries. We select initial anchor boxes as positional queries from the output of the encoder, similar to (Zhu et al., 2021; Yao et al., 2021) . However, we leave the content queries learnable queries aligned with CDN part where queries are also learnable queries which encourages the first decoder layer to focus on the spatial prior. 



Figure 1: AP on COCO compared with other detection models. (a) Comparison to models with a ResNet-50 backbone w.r.t. training epochs. Models marked with DC5 use a dilated larger resolution feature map. Other models use multi-scale features. (b) Comparison to SOTA models w.r.t. pretraining data size and model size. SOTA models are from the COCO test-dev leaderboard. In the legend we list the backbone pre-training data size (first number) and detection pre-training data size (second number). * means the data size is not disclosed.

The best detection models nowadays are based on improved classical detectors like DyHead(Dai  et al., 2021b)  and HTC(Chen et al., 2019a). For example, the best result presented in SwinV2(Liu  et al., 2021a)  was trained with the HTC++(Chen et al., 2019a; Liu et al., 2021b)  framework. Two main reasons contribute to the phenomenon: 1) Previous DETR-like models are inferior to the improved classical detectors. Most classical detectors have been well studied and highly optimized, leading to a better performance compared with the newly developed DETR-like models. 2) The performance of DETR-like model has not been tested on large backbone with large-scale pre-training data. We aim to address both concerns in this paper. Specifically, by improving the denoising training, query initialization, and box prediction, we design a new DETR-like model based on DN-DETR, DAB-DETR, and Deformable DETR. We name our model as DINO (DETR with Improved deNoising anchOr box). As shown in Fig.1, the comparison on COCO shows the superior performance of DINO. In particular, DINO demonstrates a strong performance, setting a new record of 63.3 AP for models with less than 1 billion parameters on the COCO test-dev leaderboard (pap).

We validate the effectiveness of DINO with extensive experiments on theCOCO (Lin et al., 2014)   detection benchmarks. As shown in Fig.1, DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs with ResNet-50 multi-scale features, yielding a significant improvement of +6.0AP and +2.7AP, respectively, compared to the previous best DETR-like model DN-DETR. In addition, DINO scales well in both model size and data size. After pre-training on the Objects365 (Shao et al., 2019) data set with a SwinL (Liu et al., 2021b) backbone, DINO achieves impressive results on both COCO val2017 (63.2AP) and test-dev (63.3AP) benchmarks, as shown in Table 4. Our DINO reduces the model size to 1/15 compared to SwinV2-G (Liu et al., 2021a). Moreover, DINO outperforms Florence (Yuan et al., 2021) with only 1/60 backbone pre-training dataset and 1/5 detection pre-training dataset.

