DINO: DETR WITH IMPROVED DENOISING ANCHOR BOXES FOR END-TO-END OBJECT DETECTION Anonymous

Abstract

We present DINO (DETR with Improved deNoising anchOr boxes), a strong endto-end object detector. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a look forward twice scheme for box prediction, and a mixed query selection method for anchor initialization. DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +6.0AP and +2.7AP, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO val2017 (63.2AP) and test-dev (63.3AP) with model size under 1 billion parameters. Compared to other models on the leaderboard, DINO achieves better results with smaller model size and pre-training data size. The code will be available. Figure 1: AP on COCO compared with other detection models. (a) Comparison to models with a ResNet-50 backbone w.r.t. training epochs. Models marked with DC5 use a dilated larger resolution feature map. Other models use multi-scale features. (b) Comparison to SOTA models w.r.t. pretraining data size and model size. SOTA models are from the COCO test-dev leaderboard. In the legend we list the backbone pre-training data size (first number) and detection pre-training data size (second number). * means the data size is not disclosed.

1. INTRODUCTION

Object detection is a fundamental task in computer vision. Remarkable progress has been accomplished by classical convolution-based object detection algorithms (Ren et al., 2017; Tian et al., 2019; Lin et al., 2020; Bochkovskiy et al., 2020; Ge et al., 2021) . Despite that such algorithms normally include hand-designed components like anchor generation and non-maximum suppression (NMS), they yield the best detection models such as DyHead (Dai et al., 2021a) , Swin (Liu et al., 2021b) and SwinV2 (Liu et al., 2021a) with HTC++ (Chen et al., 2019a) , as evidenced on the COCO test-dev leaderboard (pap). In contrast to classical detection algorithms, DETR (Carion et al., 2020 ) is a novel Transformerbased detection algorithm. It eliminates the need of hand-designed components and achieves com-parable performance with optimized classical detectors like Faster RCNN (Ren et al., 2017) . Different from previous detectors, DETR models object detection as a set prediction task and assigns labels by bipartite graph matching. It leverages learnable queries to probe the existence of objects and combine features from an image feature map like soft ROI pooling (Liu et al., 2022) . Despite its promising performance, it converges slow and the meaning of queries is unclear. To address such problems, many methods have been proposed, such as introducing deformable attention (Zhu et al., 2021) , decoupling positional and content information (Meng et al., 2021) , providing spatial priors (Gao et al., 2021; Yao et al., 2021; Wang et al., 2021) , etc. Recently, DAB-DETR (Liu et al., 2022) proposes to formulate DETR queries as dynamic anchor boxes (DAB), which bridges the gap between classical anchor-based detectors and DETR-like ones. DN-DETR (Li et al., 2022) further accelerate convergence by introducing a denoising (DN) technique. These improvements promote the development of DETR-like models, while it remains not on the list of first-choice detectors in the field. The best detection models nowadays are based on improved classical detectors like DyHead (Dai et al., 2021b) and HTC (Chen et al., 2019a) . For example, the best result presented in SwinV2 (Liu et al., 2021a) was trained with the HTC++ (Chen et al., 2019a; Liu et al., 2021b) framework. Two main reasons contribute to the phenomenon: 1) Previous DETR-like models are inferior to the improved classical detectors. Most classical detectors have been well studied and highly optimized, leading to a better performance compared with the newly developed DETR-like models. 2) The performance of DETR-like model has not been tested on large backbone with large-scale pre-training data. We aim to address both concerns in this paper. Specifically, by improving the denoising training, query initialization, and box prediction, we design a new DETR-like model based on DN-DETR, DAB-DETR, and Deformable DETR. We name our model as DINO (DETR with Improved deNoising anchOr box). As shown in Fig. 1 , the comparison on COCO shows the superior performance of DINO. In particular, DINO demonstrates a strong performance, setting a new record of 63.3 AP for models with less than 1 billion parameters on the COCO test-dev leaderboard (pap). As a DETR-like model, DINO contains a backbone, a multi-layer Transformer encoder, a multi-layer Transformer decoder, and multiple prediction heads. Following DAB-DETR, we formulate queries in decoder as dynamic anchor boxes and refine them step-by-step across decoder layers. Following DN-DETR, we add ground truth labels and boxes with noises into the Transformer decoder layers to help stabilize bipartite matching during training. We also adopt deformable attention (Zhu et al., 2021) for its computational efficiency. Moreover, we propose three new methods as follows. First, to reduce duplicate predictions, we propose a contrastive denoising training by adding both positive and negative samples of the same ground truth at the same time. After adding two different noises to the same ground truth box, we mark the box with a smaller noise as positive and the other as negative. The contrastive denoising training helps the model to predict more precise boxes and avoid duplicate outputs of the same target. Second, to overcome the shortsightedness of refining boxes in each decoder layer, which is a greedy way proposed in Deformable DETR, while keeping the advantages of fast convergence, we propose a new look forward twice scheme to correct the updated parameters with gradients from later layers. Third, the dynamic anchor box formulation of queries links DETR-like models with classical two-stage models. Hence we propose a mixed query selection method, which helps better initialize the queries. We select initial anchor boxes as positional queries from the output of the encoder, similar to (Zhu et al., 2021; Yao et al., 2021) . However, we leave the content queries learnable queries aligned with CDN part where queries are also learnable queries which encourages the first decoder layer to focus on the spatial prior. We validate the effectiveness of DINO with extensive experiments on the COCO (Lin et al., 2014) detection benchmarks. As shown in Fig. 1 , DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs with ResNet-50 multi-scale features, yielding a significant improvement of +6.0AP and +2.7AP, respectively, compared to the previous best DETR-like model DN-DETR. In addition, DINO scales well in both model size and data size. After pre-training on the Objects365 (Shao et al., 2019) data set with a SwinL (Liu et al., 2021b) backbone, DINO achieves impressive results on both COCO val2017 (63.2AP) and test-dev (63.3AP) benchmarks, as shown in Table 4 . Our DINO reduces the model size to 1/15 compared to SwinV2-G (Liu et al., 2021a) . Moreover, DINO outperforms Florence (Yuan et al., 2021) with only 1/60 backbone pre-training dataset and 1/5 detection pre-training dataset. To summarize, our contributions are three-fold. 1) We design a new end-to-end DETR-like object detector with several novel techniques, including contrastive denoising training, look forward twice, and mixed query selection for different parts of the DINO model. 2) We conduct intensive ablation studies to validate the effectiveness of different design choices in DINO. As a result, DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs with ResNet-50 and multi-scale features, significantly outperforming the previous best DETR-like model DN-DETR. 3) We show that, without bells and whistles, DINO can achieve the best performance on public benchmarks with model size under 1 billion parameters. After pre-training on the Objects365 (Shao et al., 2019) dataset with a SwinL (Liu et al., 2021b) backbone, DINO achieves 63.2AP on COCO val2017 and 63.3AP on COCO test-dev benchmarks.

2. RELATED WORK

Classical Object Detectors: Early convolution-based object detectors are either two-stage or onestage models, based on hand-crafted anchors or reference points. Two-stage models (Ren et al., 2015; He et al., 2017) usually use an region proposal network (RPN) (Ren et al., 2015) to propose potential boxes, which are then refined in the second stage. One-stage models (Redmon & Farhadi, 2017; 2018) directly output offsets relative to predefined anchors. Recently, some convolutionbased models such as HTC++ (Chen et al., 2019a) and Dyhead (Dai et al., 2021a) have achieved top performance on the COCO 2017 (Lin et al., 2014) . The performance of convolution-based models, however, rely on the way they generate anchors and need hand-designed components like NMS. DETR and Its Variants: Carion et al. (Carion et al., 2020) proposed a Transformer-based endto-end object detector named DETR (DEtection TRansformer) without using hand-designed components like anchor design and NMS. Many follow-up papers have attempted to address the slow training convergence issue of DETR introduced by decoder cross-attention. For instance, Dai et al. (Dai et al., 2021a) proposed a dynamic decoder to focus on important regions from multiple feature levels. Another line of works is towards a deeper understanding of decoder queries in DETR. Many papers associate queries with spatial position from different perspectives. Deformable DETR (Zhu et al., 2021) predicts 2D anchor points and designs a deformable attention module that only attends to certain sampling points around a reference point. DAB-DETR (Liu et al., 2022) further extends 2D anchor points to 4D anchor box coordinates to represent queries and dynamically update boxes in each decoder layer. Recently, DN-DETR (Li et al., 2022) introduces a denoising training method to speed up DETR training. It feeds noise-added ground-truth labels and boxes into the decoder and trains the model to reconstruct the original ones. Our work is based on DAB-DETR and DN-DETR, and also adopts deformable attention for its computational efficiency. Large-scale Pre-training for Object Detection: The best performing detectors nowadays are mostly achieved with large backbones pre-trained on large-scale data. For example, Swin V2 (Liu et al., 2021a) extends its backbone size to 3.0 billion parameters and pre-trains its models with 70M privately collected images. Florence (Yuan et al., 2021) first pre-trains its backbone with 900M privately curated image-text pairs and then pre-trains its detector with 9M images with annotated or pseudo boxes. In contrast, DINO achieves better results with a publicly available SwinL (Liu et al., 2021b) backbone and a public dataset Objects365 (Shao et al., 2019) (1.7M annotated images) only.

3.1. PRELIMINARIES

As studied in Conditional DETR (Meng et al., 2021) and DAB-DETR (Liu et al., 2022) , queries in DETR (Carion et al., 2020) are formed by two parts: a positional part and a content part, which are referred to as positional queries and content queries in this paper. DAB-DETR explicitly formulates each positional query in DETR as a 4D anchor box (x, y, w, h), where x and y are the center coordinates of the box and w and h correspond to its width and height. Such an explicit anchor box formulation makes it easy to dynamically refine anchor boxes layer by layer in the decoder. DN-DETR (Li et al., 2022) additionally feed noised ground-truth (GT) labels and boxes into the Transformer decoder and train the model to reconstruct the ground-truth ones. The noise (∆x, ∆y, ∆w, ∆h) is constrained by |∆x| < λw 2 , |∆y| < λh 2 , |∆w| < λw, and |∆y| < λh, where (x, y, w, h) denotes a GT box and λfoot_0 is a hyper-parameter to control the scale of noise. Since DN-DETR view decoder queries as anchors, a noised GT box can be viewed as a special anchor with a GT box nearby as λ is usually small. In addition to the orginal DETR queries, DN-DETR adds a DN part which feeds noised GT labels and boxes into the decoder to provide an auxiliary DN loss. The DN loss effectively stabilizes and speeds up the DETR training and can be plugged into any DETR-like models. Deformable DETR (Zhu et al., 2021) is another early work to speed up the convergence of DETR. To compute deformable attention, it introduces the concept of reference point so that deformable attention can attend to a small set of key sampling points around a reference. The reference point concept makes it possible to develop several techniques to further improve the DETR performance. The first technique is query selection (or "two stage"), which selects features and reference boxes from the encoder as inputs to the decoder directly. The second technique is iterative bounding box refinement with a careful gradient detachment design between two decoder layers. We call this gradient detachment technique "look forward once" in our paper. Following DAB-DETR and DN-DETR, DINO formulates the positional queries as dynamic anchor boxes and is trained with an extra DN loss. DINO additionally introduces three methods, which will be described in Sec. 3.3, Sec. 3.4, and Sec. 3.5, respectively. 

3.2. MODEL OVERVIEW

As a DETR-like model, DINO is an end-to-end architecture which contains a backbone, a multilayer Transformer (Vaswani et al., 2017) encoder, a multi-layer Transformer decoder, and multiple prediction heads. The overall pipeline is shown in Fig. 2 . Given an image, we extract multi-scale features with a backbone, and then feed them into the Transformer encoder with corresponding positional embeddings. After feature enhancement with the encoder layers, we propose a new mixed query selection strategy to initialize anchors as positional queries for the decoder. Note that this strategy does not initialize content queries but leaves them learnable. More details of mixed query selection are available in Sec. 3.5. With the initialized anchors and the learnable content queries, we use the deformable attention (Zhu et al., 2021) to combine the features of the encoder outputs and update the queries layer-by-layer. The final outputs are formed with refined anchor boxes and classification results predicted by refined content features. As in DN-DETR, we have an extra DN branch to perform denoising training. Beyond the standard DN method, we propose a new contrastive denoising training approach by taking into account hard negative samples, which will be presented in Sec. 3.3. To overcome the shortsightedness of the greedy way for box refinement in previous works, a novel look forward twice method is proposed to pass gradients between adjacent layers, which will be described in Sec. 3.4. DN-DETR is effective in stabilizing training and accelerating convergence. With the help of De-Noising (DN) queries, it learns to make predictions based on noised Ground-Truth (GT) boxes, which leads to fast convergence. However, each DN query in DN-DETR is matched with a GT box and lacks the ability to predict background for "no object". Since predicting background is also important for DETR-like model to reduce duplicate predictions, we propose a Contrastive DeNoising (CDN) approach to rejecting hard negative examples. To maximize the utilization of denoising queries, we also propose to use adaptive number of denoising groups.

3.3. CONTRASTIVE DENOISING TRAINING

Implementation: DN-DETR has a hyper-parameter λ to control the noise scale. The generated noises are no larger than λ as DN-DETR wants the model to reconstruct the ground truth (GT) from moderately noised queries. In our method, we have two hyper-parameters λ 1 and λ 2 , where λ 1 < λ 2 . As shown in the concentric squares in Fig. 3 , we generate two types of CDN queries: positive queries and negative queries. Positive queries within the inner square have a noise scale smaller than λ 1 and are expected to reconstruct their corresponding ground truth boxes. Negative queries between the inner and outer squares have a noise scale larger than λ 1 and smaller than λ 2 . They are expected to predict "no object". We usually adopt a small λ 2 because hard negative samples closer to GT boxes can better help the model suppress duplicate predictions. As shown in Fig. 3 , each CDN group has a set of positive queries and negative queries. If an image has n GT boxes, a CDN group will have 2 × n queries with each GT box generating both a positive and a negative queries. Similar to DN-DETR, we also use multiple CDN groups to improve the effectiveness of our method. The reconstruction losses are l 1 and GIOU losses for box regression and focal loss (Lin et al., 2020) for classification. The loss to classify negative samples as background is also focal loss. Furthermore, to better utilize DN queries. We improve DN-DETR's design of using a fixed number of denoising groups with an adaptive number of denoising groups. For each image, we fix the total number of denoising queries as N . For an image with n objects, the number of CDN groups is N 2n . Analysis: The reason why CDN works is because it explicitly introduces hard negative examples that are very similar to positive example. Such negative examples encourage the model to learn subtle differences between positive and negative boxes for more precise box predictions. The ability to distinguish positive and negative example also enables the model to further reduce duplicate predictions on the basis of DETR. DETR eliminates the need of using NMS to suppress duplicate boxes. Instead, it relies on bipartite matching to pick up only one query for each GT box and suppress other queries by pushing them away or lowering their confidence. However, the suppressed queries are normally not hard negative. As a result, DETR cannot completely avoid duplicate boxes, especially for low confidence boxes. CDN addresses this issue by introducing explicitly designed negative queries, which further enhance the effect of bipartite matching on avoiding duplicate boxes. For example, on the COCO dataset, we compare CDN with its counterpart DN, both using 300 predictions. The numbers of duplicate predictions for each method are shown in Table 3 .3. For all the thresholds from 0 to 0.3, CDN constantly predicts fewer duplicate boxes than DN. For each model, we choose the top 300 predictions and filter them according to confidence scores with 7 thresholds from 0 to 0.3. For each threshold ti, "total" and "duplicate" denote the numbers of total and duplicate predictions with scores greater than ti, respectively. We view predictions with IoU > 0.8 as duplicate predictions. prediction prediction Layer i ��-1 �� ′ Δ�� Layer i+1 �� ��+1 ′ Δ��+1 ��-1 ′ Δ��-1 ��+1 prediction stop gradients (a) Look Forward Once � � (pred) � �-1 (pred) � �+1 (pred) Layer i ��-1 �� ′ Δ�� Layer i+1 �� ��+1 ′ Δ��+1 ��-1 ′ Δ��-1 prediction prediction ��+1 prediction (b) Look Forward Twice � � (pred) � �-1 (pred) � �+1 (pred) gradients from current layer gradients from latter layers 

3.4. LOOK FORWARD TWICE

We propose a new approach to improving box prediction in this section. The iterative box refinement in Deformable DETR blocks gradient back propagation to stabilize training. We name the method look forward once since the parameters of the i-th decoder layer L i are updated based on the auxiliary loss of boxes b denotes the predicted boxes in L i . Such a parameter update approach is a greedy method, in which each decoder layer approximates ground truth boxes individually while trying not to influence its previous layers by blocking gradient. Such a method stabilizes training and helps convergence in early training stages. However, it may lead to a sub optimal result. On the other hand, allowing gradient to propagate from all latter layers will make the model hard to converge. To address this issue, we propose to only allow L i-1 to be influenced by gradients from itself and L i as shown in 4 (b). Since parameters in L i-1 are optimized to approximate ground truth boxes in both L i-1 and L i , we name our method as look forward twice, which is more comprehensive compared with the look forward once method. Implementation: We compare the implementations of look forward once (LFO) and look forward twice (LFT) as follows. Since LFO and LFT share the same process from b ′ i-1 to b ′ i , we first show this process. Denote b ′ i-1 and b i-1 as the boxes before and after stopping gradient. We have b i-1 = sg b ′ i-1 , where sg[•] denotes stopping gradient. b i-1 is used as the input anchor box in L i to obtain ∆b i as follows. ∆b i = L i (b i-1 ; θ i ), where L i denotes the i-th Decoder layer with θ i as its parameters. We ignore other inputs to L i for simplicity. b ′ i is obtained as follows. b ′ i = σ σ -1 (b i-1 ) + ∆b i . ( ) where σ(•) and σ -1 denote the sigmoid and inverse sigmoid functions. Note that such a box update approach is to guarantee that the updated boxes have normalized x, y, w, h values between 0 and 1. Equation 3 is marked with green line in Fig. 4 (a) where gradients are propagated from b is obtained as follows. (pred) i to θ i through ∆b i . In LFO, the prediction b (pred) i is equal to b ′ i . While in LFT, we update box predictions b (pred) i based on b ′ i-1 instead of b i-1 as follows. b (pred) i = σ σ -1 (b ′ i-1 ) + ∆b i , b are propagated to θ i through ∆b i . Fig. 4 (c) shows a comparison of the performances of look forward once (LFO) and look forward twice (LFT) in different layers. For layer 0 to 2, LFO performs better than LFT. While LFT exceeds LFO in layer 3 to 6. This observation verifies our intuition that LFT sacrifices performance in early layers to achieve better final performance. A common implementation for these static queries is to make them learnable. Note that in (b) the selected reference points go through a positional encoding and linear transform to obtain the query embeddings as implemented in deformable DETR. (pred) i+1 = σ σ -1 (b ′ i ) + ∆b i+1 = σ σ -1 (b ′ i-1 ) + ∆b i + ∆b i+1 In DETR (Carion et al., 2020) and DN-DETR (Li et al., 2022) , decoder queries are static embeddings without taking any encoder features from an individual image, as shown in Fig. 5 (a). They learn anchors or positional queries from training data and set the content queries as 0 vectors. Deformable DETR (Zhu et al., 2021) learns both the positional and content queries, which is another implementation of static query initialization. To further improve the performance, Deformable DETR (Zhu et al., 2021) has a query selection variant (or "two-stage"). It selects positions with top K classification scores as reference points and the content queries are linear transform of the positional embeddings of the reference points. In addition, features in the selected positions go through a classification head and a box head to calculate auxiliary loss. We call the implementation in Deformable DETR as vanilla query selection as shown in Fig. 5 . Vanilla query selection helps the model converge especially in early training epochs. However, its content queries are not aligned with those in CDN part-the content queries in CDN part are learnable class embeddings. Therefore, we propose to use selected positions as anchors and learnable query embeddings as the content queries. We call our method as mixed query selection. We show in Table 5 that our simple and intuitive method achieves better result.

4.1. SETUP

Dataset and Backbone: We conduct evaluation on the COCO 2017 object detection dataset (Lin et al., 2014) , which is split into train2017 and val2017 (also called minival). We report results with two different backbones: ResNet-50 (He et al., 2016) pre-trained on ImageNet-1k (Deng et al., 2009) and SwinL (Liu et al., 2021b) pre-trained on ImageNet-22k (Deng et al., 2009) . DINO with ResNet-50 is trained on train2017 without extra data, while DINO with SwinL is first pretrained on Object365 (Shao et al., 2019) and then fine-tuned on train2017. We also report the test-dev results for DINO with SwinL. Implementation Details: In appendix F, we provide implementation details, including all the hyperparameters and engineering techniques used in our models. including both convolution-based methods (Ren et al., 2015; Chen et al., 2019a; Dai et al., 2021a) and DETR-like methods (Carion et al., 2020; Zhu et al., 2021; Dai et al., 2021b; Liu et al., 2022; Li et al., 2022) . For a fair comparison, we report both GFLOPS and FPS tested on the same A100 NVIDIA GPU for all the models listed in Table 2 . All methods except for DETR and DAB-DETR use multi-scale features. For those without multi-scale features, we report their results with ResNet-DC5 which has a better performance for its use of a dilated larger resolution feature map. Since some methods adopt 5 scales of feature maps and some adopt 4, we report our results with both 4 and 5 scales of feature maps.

12

As shown in Table 2 , our method yields an improvement of +5.6 AP under the same setting using ResNet-50 with 4-scale feature maps and +6.0 AP with 5-scale feature maps. Our 4-scale model does not introduce much overhead in computation and the number of parameters. Moreover, our method performs especially well for small objects, gaining +7.2 AP with 4 scales and +7.5 AP with 5 scales. Comparison with the best models with a ResNet-50 backbone: To validate the effectiveness of our method in improving both convergence speed and performance, we compare our method with several strong baselines using the same ResNet-50 backbone. Despite the most common 50-epoch setting, we adopt the 24 (2×) and 36 (3×) epoch settings since our method converges faster and yields only a smaller additional gain with 50-epoch training. The results in Table 3 show that, using only 24 epochs, our method achieves an improvement of +1.8 AP and +2.7 AP with 4 and 5 scales, respectively. Moreover, using 36 epochs in the 3× setting, the improvement increases to +2.3 and +2.6 AP with 4 and 5 scales, respectively. The convergence curve comparison is shown in Fig. 6 . We also show our results using SwinL backbone without bells and whistles in Appendix B.

4.3. COMPARISON WITH SOTA MODELS

To compare with SOTA results, we use the publicly available SwinL (Liu et al., 2021b ) backbone pre-trained on ImageNet-22K. We first pre-train DINO on the Objects365 (Shao et al., 2019) dataset and then fine-tune it on COCO. As shown in (Chen et al., 2019a) and DyHead (Dai et al., 2021a) ). It is the first time that an end-to-end Transformer detector is established as a SOTA model on the leaderboard (pap). Compared with the previous SOTA Effectiveness of New Algorithm Components: We validate the effectiveness of our proposed methods in Table 5 . We build an optimized DN-Deformable DETR as our strong baseline, which performs better than the one in Table 2 . We include all the pipeline optimization and engineering techniques (see section 4.1 and Appendix F) in the strong baseline. The result of the strong baseline is available in Table 5 Row 1. According to Table 5 , our three new methods in DINO further improve the performance significantly even without considering any engineering techniques.

D TRAINING EFFICIENCY

We provide the GPU memory and training time for our base model in Table 7 . All results are reported on 8 Nvidia A100 GPUs with ResNet-50 (He et al., 2016) . The results demonstrate that our models are not only effective but also efficient for training. Analysis on the Number of Encoder and Decoder Layers: We also investigate the influence of varying numbers of encoder and decoder layers. As shown in Table 8 , decreasing the number of decoder layers hurts the performance more significantly. For example, using the same 6 encoder layers while decreasing the number of decoder layers from 6 to 2 leads to a 3.0 AP drop. This performance drop is expected as the boxes are dynamically updated and refined through each decoder layer to get the final results. Moreover, we also observe that compared with other DETR-like models like Dynamic DETR (Dai et al., 2021a) whose performance drops by 13.8AP (29.1 vs 42.9) when decreasing the number of decoder layers to 2, the performance drop of DINO is much smaller. This is because our mixed query selection approach feeds the selected boxes from the encoder to enhance the decoder queries. Therefore, the decoder queries are well initialized and not deeply coupled with decoder layer refinement. Analysis on Query Denoising: We continue to investigate the influence of query denoising by varying the number of denoising queries. We use the optimized dynamic denoising group (detailed in Appendix F.1). As shown in Table 9 , when we use less than 100 denoising queries, increasing the number can lead to a significant performance improvement. However, continuing to increase the DN number after 100 yields only a small additional or even worse performance improvement. We also analysis the effect of the number of encoder and decoder Layers in Appendix E. We pre-train DINO on Objects365 for 26 epochs using 64 Nvidia A100 GPUs and fine-tune the model on COCO for 18 epochs using 16 Nvidia A100 GPUS. Each GPU has a local batch size of 1 image only. In the fine-tuning stage, we enlarge the image size to 1.5× (i.e., with max size 1200 × 2000). This adds around 0.5 AP to the final result. To reduce the GPU memory usage, we leverage checkpointing (Chen et al., 2016) and mixed precision (Micikevicius et al., 2018) We use the L1 loss and GIOU (Rezatofighi et al., 2019) loss for box regression and focal loss (Lin et al., 2020) with α = 0.25, γ = 2 for classification. As in DETR (Carion et al., 2020) , we add auxiliary losses after each decoder layer. Similar to Deformable DETR (Zhu et al., 2021) , we add extra intermediate losses after the query selection module, with the same components as for each decoder layer. We use the same loss coefficients as in DAB-DETR (Liu et al., 2022) and DN-DETR (Li et al., 2022) , that is, 1.0 for classification loss, 5.0 for L1 loss, and 2.0 for GIOU loss.

F.3.3 DETAILED MODEL COMPONENTS.

We also optimize the detection pipeline used in DAB-DETR (Liu et al., 2022) and DN-DETR (Li et al., 2022) . Following DN-Deformable-DETR (Li et al., 2022) , we use the same multi-scale approach as in Deformable DETR (Zhu et al., 2021) and adopt the deformable attention. DN-DETR uses different prediction heads with unshared parameters in different decoder layers. In addition, we introduce dynamic denoising group to increase denoising training efficiency and alleviate memory overhead (see Appendix F.1). In this work, we find that using a shared prediction head will add additional performance improvement. This also leads to a reduction of about one million parameters. In addition, we find the conditional queries (Meng et al., 2021) used in DAB-DETR does not suit our model and we do not include them in our final model.

F.3.4 TRAINING AUGMENTATION.

We use the same random crop and scale augmentation during training following DETR (Carion et al., 2020) . For example, we randomly resize an input image with its shorter side between 480 and 800 pixels and its longer side at most 1333. For DINO with SwinL, we pre-train the model using the default setting, but finetune using 1.5× larger scale (shorter side between 720 and 1200 pixels and longer side at most 2000 pixels) to compare with models on the leaderboard (pap). Without using any other tricks, we achieve the result of 63.1 on val2017 and 63. For our 4-scale models, we extract features from stages 2, 3, and 4 of the backbone and add an extra feature by down-sampling the output of the stage 4. An additional feature map of the backbone stage 1 is used for our 5-scale models. For hyper-parameters, we set λ 1 = 1.0 and λ 2 = 2.0 and use 100 CDN pairs which contain 100 positive queries and 100 negative queries.

F.4 DETAILED HYPER-PARAMETERS

We list the hyper-parameters for those who want to reproduce our results in 

G INFERENCE SPEED AND GFLOPS

We list the inference cost of our 4-scale and 5-scale model with Swin-L backbones in Table 11 . Note that our model for Table 4 is a 5-scale model.

H WHY DINO IMPROVES AP ON SMALL OBJECTS BY LARGE

There are several reasons for the large AP improvement on small objects (AP s ). 1. In Table 5 , our optimized DN-Deformable DETR has AP s of 28.2 which is +3.4 higher than that of the original DN-Deformable DETR in Table 2 . The original one uses dense attention in the decoder while the optimized one uses deformable attention which is better for local attention and therefore improves AP s . In addition, we fixed a problem in the original DN-Deformable DETR's Transformer encoder-their deformable attention is not properly initialized using the initialization method in Deformable DETR. The Transformer encoder is a critical component that processes multi-scale image features and multi-scale image features are critical for small objects. Therefore, the optimized one has higher AP s . 2. In Table 5 , we can see that CDN improves AP s by 1.2. By introducing negative noised queries, CDN encourages the model to pick up the anchor nearest to the center of a GT box to make predictions and explicitly suppresses farther anchors. Since small object detection is more sensitive to the quality of anchors, DINO with high-quality anchors can achieve better AP s . 3. Query selection improves AP s by 1.9. Query selection provides high-quality anchor initialization which is especially beneficial to small objects. The reason is similar to reason 2 that small object detection is more sensitive to the quality of anchors.

I MORE DETAILS ABOUT LOOK FORWARD TWICE (LFT)

We propose LFT because the original Look Forward Once scheme for box refinement is greedy, which will lead to sub-optimal results. We propose LFT to make the model far-sighted. But there is a trade-off. When we increase the number of layers to "look forward", the model becomes harder to converge. We conduct experiments of Look Forward Three and Four times as shown in Table 12 . The results become worse when we continue to increase the number of layers to "look forward". Following is a detailed explanation of why LFT is worse than LFO in layers 0 to 2 but outperforms LFO in layers 3 to 6 as shown in Fig. 4 . There are actually three factors affecting the performance of layer i. 1. The performance of layer i -1. Since the predictions of layer i are based on predictions of layer i -1, better predictions in layer i -1 lead to better predictions in layer i. 2. Whether allows gradients to backpropagate from layer i to layer i-1. Allowing the gradient to propagate to layer i -1 helps performance in layer i. 



The DN-DETR paper(Li et al., 2022) uses λ1 and λ2 to denote noise scales of center shifting and box scaling, but sets λ1 = λ2. In this paper, we use λ in place of λ1 and λ2 for simplicity. CONCLUSIONIn this paper, we have presented a strong end-to-end Transformer detector DINO with contrastive denoising training, look forward twice, and mixed query selection, which significantly improves both the training efficiency and the final detection performance. As a result, DINO outperforms all previous ResNet-50-based models on COCO val2017 in both the 12-epoch and the 36-epoch settings using multi-scale features. Motivated by the improvement, we further explored to train DINO with a stronger backbone on a larger dataset and achieved a strong result, 63.3 AP on COCO 2017 test-dev. This result establishes DETR-like models as a mainstream detection framework, not only for its novel end-to-end detection optimization, but also for its superior performance.



Figure 2: The framework of our proposed DINO model. Our improvements are mainly in the Transformer encoder and decoder. The top-K encoder features in the last layer are selected to initialize the positional queries for the Transformer decoder. Our decoder also contains a Contrastive DeNoising (CDN) part with both positive and negative examples.

Figure 3: The structure of CDN group and a demonstration of positive and negative examples. Although both positive and negative examples are 4D anchors that can be represented as points in 4D space, we illustrate them as points in 2D space on concentric squares for simplicity. Assuming the square center is a GT box, points inside the inner square are regarded as a positive example and points between the inner square and the outer square are viewed as negative examples.

Figure 4: (a)(b) Comparison of box update in Deformable DETR and our method. (c) APs of look forward once and look forward twice in each decoder layer. "LFO" and "LFT" denote look forward once and look forward twice, respectively.

Equation 4 is marked with green line in Fig. 4(b). Similarly, b (pred) i+1

5)Equation 5 is marked with red line in Fig.4(b), where the gradients from b (pred) i+1

Figure5: Comparison of three different query initialization methods. "static" means that queries will keep the same for different images in inference. A common implementation for these static queries is to make them learnable. Note that in (b) the selected reference points go through a positional encoding and linear transform to obtain the query embeddings as implemented in deformable DETR.

Figure 6: Training convergence curves evaluated on COCO val2017 for DINO and two previous state-ofthe-art models with ResNet-50 using multi-scale features. Method Params Backbone Pre-training Dataset

introduces a denoising (DN) training method to accelerate the training convergence of DETR-like models. It shows that the slow convergence problem in DETR is caused by the instability of bipartite matching. To mitigate this problem, DN-DETR proposes to

For a fair comparison, we only change CDN to DN and keep other hyper-parameters unchanged.

Results for DINO and other detection models with the ResNet50 backbone on COCO val2017 trained with 12 epochs (the so called 1× setting). For models without multi-scale features, we test their GFLOPS and FPS for their best model ResNet-50-DC5. DINO uses 900 queries. † indicates models that use 900 queries or 300 queries with 3 patterns which has similar effect with 900 queries. Other DETR-like models except DETR (100 queries) uses 300 queries. * indicates that they are tested using the mmdetectionChen et al.  (2019b)  framework.

Results for DINO and other detection models with the ResNet-50 backbone on COCO val2017 trained with more epochs (24, 36, or more).



models, we use a much smaller model size (1/15 parameters compared with SwinV2-G(Liu et al.,  2021a)), backbone pre-training data size (1/60 images compared with Florence), and detection pretraining data size (1/5 images compared with Florence), while achieving better results. In addition, our reported performance without test time augmentation (TTA) is a neat result without bells and whistles. These results effectively show the superior detection performance of DINO compared with traditional detectors. Ablation comparison of the proposed algorithm components. We use the terms "QS", "CDN", and

Training efficieny for different models with ResNet-50 backbone. All models are trianed with 8 Nvidia A100 GPUs. All results are reported by us. * The results of Faster RCNN are tested with the mmdetection framework. ⋆ We use the vanilla Deformable DETR without two-stage and bbox refinement during testing.

Ablation on the numbers of encoder layers and decoder layers with the ResNet-50 backbone on COCO val2017. We use the 12-epoch setting and 100 DN queries without negative samples here.

Ablation on number of denoising queries with the ResNet-50 backbone on COCO validation. Note that 100 CND query pairs contains 200 queries which are 100 positive and 100 negative queries.

DETR, all the GT objects (label+box) in one image are collected as one GT group for denoising. To improve the DN training efficiency, multiple noised versions of the GT group in an image are used during training. In DN-DETR, the number of groups is set to five or ten according to different model sizes. As DETR-like models adopt mini-batch training, the total number of DN queries for each image in one batch is padded to the largest one in the batch. Considering that the number of objects in one image in COCO dataset ranges from 1 to 80, this design is inefficient and results in excessive memory consumption. To address this problem, we propose to fix the number of DN queries and dynamically adjust the number of groups for each image according to its number of objects.Shao et al., 2019) is a large-scale detection data set with over 1.7M annotated images for training and 80, 000 annotated images for validation. To use the data more efficiently, We select the first 5, 000 out of 80, 000 validation images as our validation set and add the others to training.

during training. Moreover, we use 1000 DN queries for this large model. Transformer decoder and 256 as the hidden feature dimension. We set the initial learning rate (lr) as 1 × 10 -4 and adopt a simple lr scheduler, which drops lr at the 11-th, 20-th, and 30-th epoch by multiplying 0.1 for the 12, 24, and 36 epoch settings with RestNet50, respectively. We use the AdamW(Kingma & Ba, 2014;Loshchilov & Hutter, 2017) optimizer with weight decay of 1×10 -4 and train our model on Nvidia A100 GPUs with batch size 16. Since DN-DETR(Li et al., 2022) adopts 300 decoder queries and 3 patterns(Wang et al., 2021), we use 300 × 3 = 900 decoder queries with the same computation cost. Learning schedules of our DINO with SwinL are available in the appendix.

Hyper-parameters used in our models.

2 on test-dev without test time augmentation (TTA) (see Appendix C), outperforming the previous state-of-the-art result 63.1 achieved by SwinV2(Liu et al., 2021a)  with a much neater solution.



The inference speed and computation cost of our laege model.# epochs APAP 50 AP 75 AP S AP M AP L

The experiments of Look Forward Three (LF3) and Four times (LF4).

The results on LVIS val v1.0. * denotes zero-shot results. † denotes DINO is trained for 12 epochs and is not fully conveged.

A OPTIMIZED DN-DEFORMABLE DETR

The optimized DN-Deformable DETR differs from the original DN-Deformable DETR in the following three parts. Firstly, the optimized DN-Deformable DETR adopts deformable attention in both encoder and decoder while the original one only adopts deformable attention in encoder. With deformable attention in decoder, the optimized one is able to use more decoder queries. For example, we use 900 here. Secondly, the optimized one use different weight for matcher and loss, while the original one follows DETR to use same weight for loss and matcher. For example, we use class weight 1.0 for loss and 2.0 for matcher. Finally, we set dropout rate to be 0. We find these three technical improvements can improve the performance.

B RESULTS USING SWINL BACKBONE WITHOUT PRE-TRAINING ON OBJECT 365

We also evaluate our method on COCO val2017 with SwinL as backbone without pre-training on Object 365. The results are without any bells and whistles. We compare with other methods using Swin-L backbone. 

C TEST TIME AUGMENTATIONS (TTA)

We aim to build an end-to-end detector that is free from hand-crafted components. However, to compare with traditional detection models, we also explore the use of TTA in DETR-like models.We only use it in our large model with the SwinL backbone. Our TTA does not obtain an inspiring gain compared with traditional detectors, but we hope our exploration may provide some insights for future studies.We adopt multi-scale test and horizontal flip as TTA. However, the way of ensembling different augmentations in our method is different from that in traditional methods which usually output duplicate boxes. In traditional methods, the ensembling is done by first gathering predictions from all augmentations and ranked by a confidence score. Then, duplicate boxes are found and eliminated by NMS or box voting. The reason why predictions from all augmentations are gathered first is that duplicate boxes appear not only among different augmentations but also within one augmentation. This ensembling method decreases the performance for our method since DETR-like methods are not prone to output duplicate boxes since their set-based prediction loss inhibits duplicate predictions and ensembling may incorrectly remove true positive predictions (Carion et al., 2020) . To address this issue, we designed a one-to-one ensembling method. Assume we have n augmentations Aug 0 , Aug 1 , ..., Aug n-1 , where Aug i has predictions O i and a pre-defined hyper-parameter weight) where b i j , l i j and s i j denote the j-th boundbox, label and score, respectively. We let Aug 0 be the main augmentation which is the most reliable one. For each prediction in O 0 , we select the prediction with the highest IOU from predictions of each of other augmentations O 1 , ..., O n-1 and make sure the IOU is higher than a predefined threshold. Finally, we ensemble the selected boxes through weighted average as follows b = 1where I i = 1 when there is at least one box in O i with IOU higher than the threshold and I i = 0 otherwise. idx(i) denotes the index of the selected box in O i .3. Whether allows gradients backpropagate from layer i + 1 to layer i. Allowing gradient to propagate from layer i + 1 to layer i jeopardizes performance in layer i.For layer 0, there is no layer i -1, so factors 1 and 2 do not affect the performance. According to factor 3, LFO is better than LFT.For layers 1 to 5, LFO has an advantage in factor 3 and LFT has an advantage in factor 2. Because factor 2 affects the performance more than factor 3, The gap between LFT is narrowed down from layer 0 to 2 and LFT exceeds LFO in layer 3.In the last layer (when i = 6), there is no layer i + 1 (factor 3 does not affect the result) and LFT has the advantage in both factors 1 and 2. Therefore, LFT exceeds LFO in the last layer.J VISUALIZATIONS We present a comparison of visualizations in Fig. 7 . The results show that our DINO has better predictions than DN-DETR.

K LVIS RESULTS

To evaluate DINO's performance on other detection datasets, we conducted experiments on more challenging LVIS (Gupta et al., 2019) dataset.

