DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

Abstract

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https:// github.com/fundamentalvision/Deformable-DETR.

1. INTRODUCTION

Modern object detectors employ many hand-crafted components (Liu et al., 2020) , e.g., anchor generation, rule-based training target assignment, non-maximum suppression (NMS) post-processing. They are not fully end-to-end. Recently, Carion et al. (2020) proposed DETR to eliminate the need for such hand-crafted components, and built the first fully end-to-end object detector, achieving very competitive performance. DETR utilizes a simple architecture, by combining convolutional neural networks (CNNs) and Transformer (Vaswani et al., 2017) encoder-decoders. They exploit the versatile and powerful relation modeling capability of Transformers to replace the hand-crafted rules, under properly designed training signals. Despite its interesting design and good performance, DETR has its own issues: (1) It requires much longer training epochs to converge than the existing object detectors. For example, on the COCO (Lin et al., 2014) benchmark, DETR needs 500 epochs to converge, which is around 10 to 20 times slower than Faster R-CNN (Ren et al., 2015) . ( 2) DETR delivers relatively low performance at detecting small objects. Modern object detectors usually exploit multi-scale features, where small objects are detected from high-resolution feature maps. Meanwhile, high-resolution feature maps lead to unacceptable complexities for DETR. The above-mentioned issues can be mainly attributed to the deficit of Transformer components in processing image feature maps. At initialization, the attention modules cast nearly uniform attention weights to all the pixels in the feature maps. Long training epoches is necessary for the attention weights to be learned to focus on sparse meaningful locations. On the other hand, the attention weights computation in Transformer encoder is of quadratic computation w.r.t. pixel numbers. Thus, it is of very high computational and memory complexities to process high-resolution feature maps. In the image domain, deformable convolution (Dai et al., 2017) is of a powerful and efficient mechanism to attend to sparse spatial locations. It naturally avoids the above-mentioned issues. While it lacks the element relation modeling mechanism, which is the key for the success of DETR. In this paper, we propose Deformable DETR, which mitigates the slow convergence and high complexity issues of DETR. It combines the best of the sparse spatial sampling of deformable convolution, and the relation modeling capability of Transformers. We propose the deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of FPN (Lin et al., 2017a) . In Deformable DETR , we utilize (multi-scale) deformable attention modules to replace the Transformer attention modules processing feature maps, as shown in Fig. 1 . Deformable DETR opens up possibilities for us to exploit variants of end-to-end object detectors, thanks to its fast convergence, and computational and memory efficiency. We explore a simple and effective iterative bounding box refinement mechanism to improve the detection performance. We also try a two-stage Deformable DETR, where the region proposals are also generated by a vaiant of Deformable DETR, which are further fed into the decoder for iterative bounding box refinement. Extensive experiments on the COCO (Lin et al., 2014) benchmark demonstrate the effectiveness of our approach. Compared with DETR, Deformable DETR can achieve better performance (especially on small objects) with 10× less training epochs. The proposed variant of two-stage Deformable DETR can further improve the performance. Code is released at https://github. com/fundamentalvision/Deformable-DETR.

2. RELATED WORK

Efficient Attention Mechanism. Transformers (Vaswani et al., 2017) involve both self-attention and cross-attention mechanisms. One of the most well-known concern of Transformers is the high time and memory complexity at vast key element numbers, which hinders model scalability in many cases. Recently, many efforts have been made to address this problem (Tay et al., 2020b) , which can be roughly divided into three categories in practice. The first category is to use pre-defined sparse attention patterns on keys. The most straightforward paradigm is restricting the attention pattern to be fixed local windows. Most works (Liu et al., 2018a; Parmar et al., 2018; Child et al., 2019; Huang et al., 2019; Ho et al., 2019; Wang et al., 2020a; Hu et al., 2019; Ramachandran et al., 2019; Qiu et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020) follow this paradigm. Although restricting the attention pattern to a local neighborhood can decrease the complexity, it loses global information. To compensate, Child et al. (2019) ; Huang et al. (2019) ; Ho et al. (2019) ; Wang et al. (2020a) attend key elements at fixed intervals to significantly increase the receptive field on keys. Beltagy et al. (2020) ; Ainslie et al. (2020) ; Zaheer et al. (2020) allow a small number of special tokens having access to all key elements. Zaheer et al. (2020) ; Qiu et al. (2019) also add some pre-fixed sparse attention patterns to attend distant key elements directly. The second category is to learn data-dependent sparse attention. Kitaev et al. (2020) proposes a locality sensitive hashing (LSH) based attention, which hashes both the query and key elements to different bins. A similar idea is proposed by Roy et al. (2020) , where k-means finds out the most related keys. Tay et al. (2020a) learns block permutation for block-wise sparse attention. The third category is to explore the low-rank property in self-attention. Wang et al. (2020b) reduces the number of key elements through a linear projection on the size dimension instead of the channel dimension. Katharopoulos et al. (2020) ; Choromanski et al. (2020) rewrite the calculation of selfattention through kernelization approximation. In the image domain, the designs of efficient attention mechanism (e.g., Parmar et al. (2018) 2019) admit such approaches are much slower in implementation than traditional convolution with the same FLOPs (at least 3× slower), due to the intrinsic limitation in memory access patterns. On the other hand, as discussed in Zhu et al. (2019a) , there are variants of convolution, such as deformable convolution (Dai et al., 2017; Zhu et al., 2019b) and dynamic convolution (Wu et al., 2019) , that also can be viewed as self-attention mechanisms. Especially, deformable convolution operates much more effectively and efficiently on image recognition than Transformer self-attention. Meanwhile, it lacks the element relation modeling mechanism. Our proposed deformable attention module is inspired by deformable convolution, and belongs to the second category. It only focuses on a small fixed set of sampling points predicted from the feature of query elements. Different from Ramachandran et al. (2019) ; Hu et al. (2019) , deformable attention is just slightly slower than the traditional convolution under the same FLOPs. Multi-scale Feature Representation for Object Detection. One of the main difficulties in object detection is to effectively represent objects at vastly different scales. Modern object detectors usually exploit multi-scale features to accommodate this. As one of the pioneering works, FPN (Lin et al., 2017a) proposes a top-down path to combine multi-scale features. PANet (Liu et al., 2018b) further adds an bottom-up path on the top of FPN. Kong et al. (2018) combines features from all scales by a global attention operation. Zhao et al. (2019) proposes a U-shape module to fuse multi-scale features. Recently, NAS-FPN (Ghiasi et al., 2019) and Auto-FPN (Xu et al., 2019) are proposed to automatically design cross-scale connections via neural architecture search. Tan et al. (2020) proposes the BiFPN, which is a repeated simplified version of PANet. Our proposed multi-scale deformable attention module can naturally aggregate multi-scale feature maps via attention mechanism, without the help of these feature pyramid networks.

3. REVISITING TRANSFORMERS AND DETR

Multi-Head Attention in Transformers. Transformers (Vaswani et al., 2017) are of a network architecture based on attention mechanisms for machine translation. Given a query element (e.g., a target word in the output sentence) and a set of key elements (e.g., source words in the input sentence), the multi-head attention module adaptively aggregates the key contents according to the attention weights that measure the compatibility of query-key pairs. To allow the model focusing on contents from different representation subspaces and different positions, the outputs of different attention heads are linearly aggregated with learnable weights. Let q ∈ Ω q indexes a query element with representation feature z q ∈ R C , and k ∈ Ω k indexes a key element with representation feature x k ∈ R C , where C is the feature dimension, Ω q and Ω k specify the set of query and key elements, respectively. Then the multi-head attention feature is calculated by MultiHeadAttn(z q , x) = M m=1 W m k∈Ω k A mqk • W m x k , where m indexes the attention head, W m ∈ R Cv×C and W m ∈ R C×Cv are of learnable weights (C v = C/M by default). The attention weights A mqk ∝ exp{ z T q U T m Vmx k √ Cv } are normalized as k∈Ω k A mqk = 1, in which U m , V m ∈ R Cv×C are also learnable weights. To disambiguate different spatial positions, the representation features z q and x k are usually of the concatenation/summation of element contents and positional embeddings. There are two known issues with Transformers. One is Transformers need long training schedules before convergence. Suppose the number of query and key elements are of N q and N k , respectively. Typically, with proper parameter initialization, U m z q and V m x k follow distribution with mean of 0 and variance of 1, which makes attention weights A mqk ≈ 1 N k , when N k is large. It will lead to ambiguous gradients for input features. Thus, long training schedules are required so that the attention weights can focus on specific keys. In the image domain, where the key elements are usually of image pixels, N k can be very large and the convergence is tedious. On the other hand, the computational and memory complexity for multi-head attention can be very high with numerous query and key elements. The computational complexity of Eq. 1 is of O(N q C 2 + N k C 2 + N q N k C). In the image domain, where the query and key elements are both of pixels, N q = N k C, the complexity is dominated by the third term, as O(N q N k C). Thus, the multi-head attention module suffers from a quadratic complexity growth with the feature map size. DETR. DETR (Carion et al., 2020) is built upon the Transformer encoder-decoder architecture, combined with a set-based Hungarian loss that forces unique predictions for each ground-truth bounding box via bipartite matching. We briefly review the network architecture as follows. Given the input feature maps x ∈ R C×H×W extracted by a CNN backbone (e.g., ResNet (He et al., 2016) ), DETR exploits a standard Transformer encoder-decoder architecture to transform the input feature maps to be features of a set of object queries. A 3-layer feed-forward neural network (FFN) and a linear projection are added on top of the object query features (produced by the decoder) as the detection head. The FFN acts as the regression branch to predict the bounding box coordinates b ∈ [0, 1] 4 , where b = {b x , b y , b w , b h } encodes the normalized box center coordinates, box height and width (relative to the image size). The linear projection acts as the classification branch to produce the classification results. For the Transformer encoder in DETR, both query and key elements are of pixels in the feature maps. The inputs are of ResNet feature maps (with encoded positional embeddings). Let H and W denote the feature map height and width, respectively. The computational complexity of self-attention is of O(H 2 W 2 C), which grows quadratically with the spatial size. For the Transformer decoder in DETR, the input includes both feature maps from the encoder, and N object queries represented by learnable positional embeddings (e.g., N = 100). There are two types of attention modules in the decoder, namely, cross-attention and self-attention modules. In the cross-attention modules, object queries extract features from the feature maps. The query elements are of the object queries, and key elements are of the output feature maps from the encoder. In it, N q = N , N k = H × W and the complexity of the cross-attention is of O(HW C 2 + N HW C). The complexity grows linearly with the spatial size of feature maps. In the self-attention modules, object queries interact with each other, so as to capture their relations. The query and key elements are both of the object queries. In it, N q = N k = N , and the complexity of the self-attention module is of O(2N C 2 + N 2 C). The complexity is acceptable with moderate number of object queries. DETR is an attractive design for object detection, which removes the need for many hand-designed components. However, it also has its own issues. These issues can be mainly attributed to the deficits of Transformer attention in handling image feature maps as key elements: (1) DETR has relatively low performance in detecting small objects. Modern object detectors use high-resolution feature maps to better detect small objects. However, high-resolution feature maps would lead to an unacceptable complexity for the self-attention module in the Transformer encoder of DETR, which has a quadratic complexity with the spatial size of input feature maps. (2) Compared with modern object detectors, DETR requires many more training epochs to converge. This is mainly because the attention modules processing image features are difficult to train. For example, at initialization, the cross-attention modules are almost of average attention on the whole feature maps. While, at the end of the training, the attention maps are learned to be very sparse, focusing only on the object 

4.1. DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

Deformable Attention Module. The core issue of applying Transformer attention on image feature maps is that it would look over all possible spatial locations. To address this, we present a deformable attention module. Inspired by deformable convolution (Dai et al., 2017; Zhu et al., 2019b) , the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps, as shown in Fig. 2 . By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated. Given an input feature map x ∈ R C×H×W , let q index a query element with content feature z q and a 2-d reference point p q , the deformable attention feature is calculated by DeformAttn(z q , p q , x) = M m=1 W m K k=1 A mqk • W m x(p q + ∆p mqk ) , where m indexes the attention head, k indexes the sampled keys, and K is the total sampled key number (K HW ). ∆p mqk and A mqk denote the sampling offset and attention weight of the k th sampling point in the m th attention head, respectively. The scalar attention weight A mqk lies in the range [0, 1], normalized by K k=1 A mqk = 1. ∆p mqk ∈ R 2 are of 2-d real numbers with unconstrained range. As p q + ∆p mqk is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing x(p q +∆p mqk ). Both ∆p mqk and A mqk are obtained via linear projection over the query feature z q . In implementation, the query feature z q is fed to a linear projection operator of 3M K channels, where the first 2M K channels encode the sampling offsets ∆p mqk , and the remaining M K channels are fed to a softmax operator to obtain the attention weights A mqk . The deformable attention module is designed for processing convolutional feature maps as key elements. Let N q be the number of query elements, when M K is relatively small, the complexity of the deformable attention module is of O(2N q C 2 + min(HW C 2 , N q KC 2 )) (See Appendix A.1 for details). When it is applied in DETR encoder, where N q = HW , the complexity becomes O(HW C 2 ), which is of linear complexity with the spatial size. When it is applied as the cross-attention modules in DETR decoder, where N q = N (N is the number of object queries), the complexity becomes O(N KC 2 ), which is irrelevant to the spatial size HW . Multi-scale Deformable Attention Module. Most modern object detection frameworks benefit from multi-scale feature maps (Liu et al., 2020) . Our proposed deformable attention module can be naturally extended for multi-scale feature maps. Let {x l } L l=1 be the input multi-scale feature maps, where x l ∈ R C×H l ×W l . Let pq ∈ [0, 1] 2 be the normalized coordinates of the reference point for each query element q, then the multi-scale deformable attention module is applied as MSDeformAttn(z q , pq , {x l } L l=1 ) = M m=1 W m L l=1 K k=1 A mlqk • W m x l (φ l ( pq ) + ∆p mlqk ) , where m indexes the attention head, l indexes the input feature level, and k indexes the sampling point. ∆p mlqk and A mlqk denote the sampling offset and attention weight of the k th sampling point in the l th feature level and the m th attention head, respectively. The scalar attention weight A mlqk is normalized by L l=1 K k=1 A mlqk = 1. Here, we use normalized coordinates pq ∈ [0, 1] 2 for the clarity of scale formulation, in which the normalized coordinates (0, 0) and (1, 1) indicate the top-left and the bottom-right image corners, respectively. Function φ l ( pq ) in Equation 3 re-scales the normalized coordinates pq to the input feature map of the l-th level. The multi-scale deformable attention is very similar to the previous single-scale version, except that it samples LK points from multi-scale feature maps instead of K points from single-scale feature maps. The proposed attention module will degenerate to deformable convolution (Dai et al., 2017) , when L = 1, K = 1, and W m ∈ R Cv×C is fixed as an identity matrix. Deformable convolution is designed for single-scale inputs, focusing only on one sampling point for each attention head. However, our multi-scale deformable attention looks over multiple sampling points from multi-scale inputs. The proposed (multi-scale) deformable attention module can also be perceived as an efficient variant of Transformer attention, where a pre-filtering mechanism is introduced by the deformable sampling locations. When the sampling points traverse all possible locations, the proposed attention module is equivalent to Transformer attention. Deformable Transformer Encoder. We replace the Transformer attention modules processing feature maps in DETR with the proposed multi-scale deformable attention module. Both the input and output of the encoder are of multi-scale feature maps with the same resolutions. In encoder, we extract multi-scale feature maps {x l } L-1 l=1 (L = 4) from the output feature maps of stages C 3 through C 5 in ResNet (He et al., 2016) (transformed by a 1 × 1 convolution), where C l is of resolution 2 l lower than the input image. The lowest resolution feature map x L is obtained via a 3 × 3 stride 2 convolution on the final C 5 stage, denoted as C 6 . All the multi-scale feature maps are of C = 256 channels. Note that the top-down structure in FPN (Lin et al., 2017a) is not used, because our proposed multi-scale deformable attention in itself can exchange information among multi-scale feature maps. The constructing of multi-scale feature maps are also illustrated in Appendix A.2. Experiments in Section 5.2 show that adding FPN will not improve the performance. In application of the multi-scale deformable attention module in encoder, the output are of multiscale feature maps with the same resolutions as the input. Both the key and query elements are of pixels from the multi-scale feature maps. For each query pixel, the reference point is itself. To identify which feature level each query pixel lies in, we add a scale-level embedding, denoted as e l , to the feature representation, in addition to the positional embedding. Different from the positional embedding with fixed encodings, the scale-level embedding {e l } L l=1 are randomly initialized and jointly trained with the network. Deformable Transformer Decoder. There are cross-attention and self-attention modules in the decoder. The query elements for both types of attention modules are of object queries. In the crossattention modules, object queries extract features from the feature maps, where the key elements are of the output feature maps from the encoder. In the self-attention modules, object queries interact with each other, where the key elements are of the object queries. Since our proposed deformable attention module is designed for processing convolutional feature maps as key elements, we only replace each cross-attention module to be the multi-scale deformable attention module, while leaving the self-attention modules unchanged. For each object query, the 2-d normalized coordinate of the reference point pq is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function. Because the multi-scale deformable attention module extracts image features around the reference point, we let the detection head predict the bounding box as relative offsets w.r.t. the reference point to further reduce the optimization difficulty. The reference point is used as the initial guess of the box center. The detection head predicts the relative offsets w.r.t. the reference point. Check Appendix A.3 for the details. In this way, the learned decoder attention will have strong correlation with the predicted bounding boxes, which also accelerates the training convergence. By replacing Transformer attention modules with deformable attention modules in DETR, we establish an efficient and fast converging detection system, dubbed as Deformable DETR (see Fig. 1 ).

4.2. ADDITIONAL IMPROVEMENTS AND VARIANTS FOR DEFORMABLE DETR

Deformable DETR opens up possibilities for us to exploit various variants of end-to-end object detectors, thanks to its fast convergence, and computational and memory efficiency. Due to limited space, we only introduce the core ideas of these improvements and variants here. The implementation details are given in Appendix A.4. Iterative Bounding Box Refinement. This is inspired by the iterative refinement developed in optical flow estimation (Teed & Deng, 2020) . We establish a simple and effective iterative bounding box refinement mechanism to improve detection performance. Here, each decoder layer refines the bounding boxes based on the predictions from the previous layer. Two-Stage Deformable DETR. In the original DETR, object queries in the decoder are irrelevant to the current image. Inspired by two-stage object detectors, we explore a variant of Deformable DETR for generating region proposals as the first stage. The generated region proposals will be fed into the decoder as object queries for further refinement, forming a two-stage Deformable DETR. In the first stage, to achieve high-recall proposals, each pixel in the multi-scale feature maps would serve as an object query. However, directly setting object queries as pixels will bring unacceptable computational and memory cost for the self-attention modules in the decoder, whose complexity grows quadratically with the number of queries. To avoid this problem, we remove the decoder and form an encoder-only Deformable DETR for region proposal generation. In it, each pixel is assigned as an object query, which directly predicts a bounding box. Top scoring bounding boxes are picked as region proposals. No NMS is applied before feeding the region proposals to the second stage.

5. EXPERIMENT

Dataset. We conduct experiments on COCO 2017 dataset (Lin et al., 2014) . Our models are trained on the train set, and evaluated on the val set and test-dev set. Implementation Details. ImageNet (Deng et al., 2009) pre-trained ResNet-50 (He et al., 2016) is utilized as the backbone for ablations. Multi-scale feature maps are extracted without FPN (Lin et al., 2017a) . M = 8 and K = 4 are set for deformable attentions by default. Parameters of the deformable Transformer encoder are shared among different feature levels. Other hyper-parameter setting and training strategy mainly follow DETR (Carion et al., 2020) , except that Focal Loss (Lin et al., 2017b) with loss weight of 2 is used for bounding box classification, and the number of object queries is increased from 100 to 300. We also report the performance of DETR-DC5 with these modifications for a fair comparison, denoted as DETR-DC5 + . By default, models are trained for 50 epochs and the learning rate is decayed at the 40-th epoch by a factor of 0.1. Following DETR (Carion et al., 2020) , we train our models using Adam optimizer (Kingma & Ba, 2015) with base learning rate of 2 × 10 -4 , β 1 = 0.9, β 2 = 0.999, and weight decay of 10 -4 . Learning rates of the linear projections, used for predicting object query reference points and sampling offsets, are multiplied by a factor of 0.1. Run time is evaluated on NVIDIA Tesla V100 GPU.

5.1. COMPARISON WITH DETR

As shown in Table 1 , compared with Faster R-CNN + FPN, DETR requires many more training epochs to converge, and delivers lower performance at detecting small objects. Compared with DETR, Deformable DETR achieves better performance (especially on small objects) with 10× less training epochs. Detailed convergence curves are shown in Fig. 3 . With the aid of iterative bounding box refinement and two-stage paradigm, our method can further improve the detection accuracy. Our proposed Deformable DETR has on par FLOPs with Faster R-CNN + FPN and DETR-DC5. But the runtime speed is much faster (1.6×) than DETR-DC5, and is just 25% slower than Faster R-CNN + FPN. The speed issue of DETR-DC5 is mainly due to the large amount of memory access in Transformer attention. Our proposed deformable attention can mitigate this issue, at the cost of unordered memory access. Thus, it is still slightly slower than traditional convolution. 

5.2. ABLATION STUDY ON DEFORMABLE ATTENTION

Table 2 presents ablations for various design choices of the proposed deformable attention module. Using multi-scale inputs instead of single-scale inputs can effectively improve detection accuracy with 1.7% AP, especially on small objects with 2.9% AP S . Increasing the number of sampling points K can further improve 0.9% AP. Using multi-scale deformable attention, which allows information exchange among different scale levels, can bring additional 1.5% improvement in AP. Because the cross-level feature exchange is already adopted, adding FPNs will not improve the performance. When multi-scale attention is not applied, and K = 1, our (multi-scale) deformable attention module degenerates to deformable convolution, delivering noticeable lower accuracy.

5.3. COMPARISON WITH STATE-OF-THE-ART METHODS

Table 3 compares the proposed method with other state-of-the-art methods. Iterative bounding box refinement and two-stage mechanism are both utilized by our models in Table 3 . With ResNet-101 and ResNeXt-101 (Xie et al., 2017) , our method achieves 48.7 AP and 49.0 AP without bells and whistles, respectively. By using ResNeXt-101 with DCN (Zhu et al., 2019b) , the accuracy rises to 50.1 AP. With additional test-time augmentations, the proposed method achieves 52.3 AP. Table 2 : Ablations for deformable attention on COCO 2017 val set. "MS inputs" indicates using multi-scale inputs. "MS attention" indicates using multi-scale deformable attention. K is the number of sampling points for each attention head on each feature level. MS inputs MS attention K FPNs AP AP50 AP75 APS APM APL 4 FPN (Lin et al., 2017a ) 43.8 62.6 47.8 26.5 47.3 58.1 4 BiFPN (Tan et al., 2020) 43.9 62.5 47.7 25.6 47.4 

6. CONCLUSION

Deformable DETR is an end-to-end object detector, which is efficient and fast-converging. It enables us to explore more interesting and practical variants of end-to-end object detectors. At the core of Deformable DETR are the (multi-scale) deformable attention modules, which is an efficient attention mechanism in processing image feature maps. We hope our work opens up new possibilities in exploring end-to-end object detection.

A APPENDIX

A.1 COMPLEXITY FOR DEFORMABLE ATTENTION Supposes the number of query elements is N q , in the deformable attention module (see Equation 2), the complexity for calculating the sampling coordinate offsets ∆p mqk and attention weights A mqk is of O(3N q CM K). Given the sampling coordinate offsets and attention weights, the complexity of computing Equation 2is O(N q C 2 + N q KC 2 + 5N q KC), where the factor of 5 in 5N q KC is because of bilinear interpolation and the weighted sum in attention. On the other hand, we can also calculate W m x before sampling, as it is independent to query, and the complexity of computing Equation 2 will become as O(N q C 2 +HW C 2 +5N q KC). So the overall complexity of deformable attention is O(N q C 2 + min(HW C 2 , N q KC 2 ) + 5N q KC + 3N q CM K). In our experiments, M = 8, K ≤ 4 and C = 256 by default, thus 5K + 3M K < C and the complexity is of O(2N q C 2 + min(HW C 2 , N q KC 2 )).

A.2 CONSTRUCTING MULT-SCALE FEATURE MAPS FOR DEFORMABLE DETR

As discussed in Section 4.1 and illustrated in Figure 4 , the input multi-scale feature maps of the encoder {x l } L-1 l=1 (L = 4) are extracted from the output feature maps of stages C 3 through C 5 in ResNet (He et al., 2016) (transformed by a 1×1 convolution). The lowest resolution feature map x L is obtained via a 3 × 3 stride 2 convolution on the final C 5 stage. Note that FPN (Lin et al., 2017a) is not used, because our proposed multi-scale deformable attention in itself can exchange information among multi-scale feature maps.  𝐶𝑜𝑛𝑣 1 × 1, 𝑠𝑡𝑟𝑖𝑑𝑒 1 𝐶𝑜𝑛𝑣 1 × 1, 𝑠𝑡𝑟𝑖𝑑𝑒 1 𝐶𝑜𝑛𝑣 1 × 1, 𝑠𝑡𝑟𝑖𝑑𝑒 1 𝐶𝑜𝑛𝑣 3 × 3, 𝑠𝑡𝑟𝑖𝑑𝑒 2 Input Multi-scale Feature Maps{𝒙 𝑙 } 𝑙=1 4 𝑪3 𝑪 4 𝑪 5 ResNet Feature Maps q = {σ(∆b d qx +σ -1 ( bd-1 qx )), σ(∆b d qy +σ -1 ( bd-1 qy )), σ(∆b d qw +σ -1 ( bd-1 qw )), σ(∆b d qh +σ -1 ( bd-1 qh ))}, where d ∈ {1, 2, ..., D}, ∆b d q{x,y,w,h} ∈ R are predicted at the d-th decoder layer. Prediction heads for different decoder layers do not share parameters. The initial box is set as b0 qx = pqx , b0 qy = pqy , b0 qw = 0.1, and b0 qh = 0.1. The system is robust to the choice of b 0 qw and b 0 qh . We tried setting them as 0.05, 0.1, 0.2, 0.5, and achieved similar performance. To stabilize training, similar to Teed & Deng (2020) , the gradients only back propagate through ∆b d q{x,y,w,h} , and are blocked at σ -1 ( bd-1 q{x,y,w,h} ). In iterative bounding box refinement, for the d-th decoder layer, we sample key elements respective to the box bd-1 q predicted from the (d -1)-th decoder layer. For Equation 3 in the cross-attention module of the d-th decoder layer, ( bd-1 qx , bd-1 qy ) serves as the new reference point. The sampling offset ∆p mlqk is also modulated by the box size, as (∆p mlqkx bd-1 qw , ∆p mlqky bd-1 qh ). Such modifications make the sampling locations related to the center and size of previously predicted boxes. Two-Stage Deformable DETR. In the first stage, given the output feature maps of the encoder, a detection head is applied to each pixel. The detection head is of a 3-layer FFN for bounding box regression, and a linear projection for bounding box binary classification (i.e., foreground and background), respectively. Let i index a pixel from feature level l i ∈ {1, 2, ..., L} with 2-d normalized coordinates pi = (p ix , piy ) ∈ [0, 1] 2 , its corresponding bounding box is predicted by bi = {σ(∆b ix +σ -1 (p ix )), σ(∆b iy +σ -1 (p iy )), σ(∆b iw +σ -1 (2 li-1 s)), σ(∆b ih +σ -1 (2 li-1 s))}, where the base object scale s is set as 0.05, ∆b i{x,y,w,h} ∈ R are predicted by the bounding box regression branch. The Hungarian loss in DETR is used for training the detection head. Given the predicted bounding boxes in the first stage, top scoring bounding boxes are picked as region proposals. In the second stage, these region proposals are fed into the decoder as initial boxes for the iterative bounding box refinement, where the positional embeddings of object queries are set as positional embeddings of region proposal coordinates. Initialization for Multi-scale Deformable Attention. In our experiments, the number of attention heads is set as M = 8. In multi-scale deformable attention modules, W m ∈ R Cv×C and W m ∈ R C×Cv are randomly initialized. Weight parameters of the linear projection for predicting A mlqk and ∆p mlqk are initialized to zero. Bias parameters of the linear projection are initialized to make A mlqk = 1 LK and {∆p 1lqk = (-k, -k), ∆p 2lqk = (-k, 0), ∆p 3lqk = (-k, k), ∆p 4lqk = (0, -k), ∆p 5lqk = (0, k), ∆p 6lqk = (k, -k), ∆p 7lqk = (k, 0), ∆p 8lqk = (k, k)} (k ∈ {1, 2, ...K}) at initialization. For iterative bounding box refinement, the initialized bias parameters for ∆p mlqk prediction in the decoder are further multiplied with 1 2K , so that all the sampling points at initialization are within the corresponding bounding boxes predicted from the previous decoder layer.

A.5 WHAT DEFORMABLE DETR LOOKS AT?

For studying what Deformable DETR looks at to give final detection result, we draw the gradient norm of each item in final prediction (i.e., x/y coordinate of object center, width/height of object bounding box, category score of this object) with respect to each pixel in the image, as shown in Fig. 5 . According to Taylor's theorem, the gradient norm can reflect how much the output would be changed relative to the perturbation of the pixel, thus it could show us which pixels the model mainly relys on for predicting each item. The visualization indicates that Deformable DETR looks at extreme points of the object to determine its bounding box, which is similar to the observation in DETR (Carion et al., 2020) . More concretely, Deformable DETR attends to left/right boundary of the object for x coordinate and width, and top/bottom boundary for y coordinate and height. Meanwhile, different to DETR (Carion et al., 2020) , our Deformable DETR also looks at pixels inside the object for predicting its category. 

A.6 VISUALIZATION OF MULTI-SCALE DEFORMABLE ATTENTION

For better understanding learned multi-scale deformable attention modules, we visualize sampling points and attention weights of the last layer in encoder and decoder, as shown in Fig. 6 . For readibility, we combine the sampling points and attention weights from feature maps of different resolutions into one picture. Similar to DETR (Carion et al., 2020) , the instances are already separated in the encoder of Deformable DETR. While in the decoder, our model is focused on the whole foreground instance instead of only extreme points as observed in DETR (Carion et al., 2020) . Combined with the visualization of ∂c ∂I in Fig. 5 , we can guess the reason is that our Deformable DETR needs not only extreme points but also interior points to detemine object category. The visualization also demonstrates that the proposed multi-scale deformable attention module can adapt its sampling points and attention weights according to different scales and shapes of the foreground object. width of input feature map of l th feature level A mqk attention weight of q th query to k th key at m th head A mlqk attention weight of q th query to k th key in l th feature level at m th head z q input feature of q th query p q 2-d coordinate of reference point for q th query pq normalized 2-d coordinate of reference point for q th query x input feature map (input feature of key elements) x k input feature of k th key x l input feature map of l th feature level ∆p mqk sampling offset of q th query to k th key at m th head ∆p mlqk sampling offset of q th query to k th key in l th feature level at m th head W m output projection matrix at m th head U m input query projection matrix at m th head V m input key projection matrix at m th head W m input value projection matrix at m th head φ l ( p) unnormalized 2-d coordinate of p in l th feature level exp exponential function σ sigmoid function σ -1 inverse sigmoid function



Figure 1: Illustration of the proposed Deformable DETR object detector.

; Child et al. (2019); Huang et al. (2019); Ho et al. (2019); Wang et al. (2020a); Hu et al. (2019); Ramachandran et al. (2019)) are still limited to the first category. Despite the theoretically reduced complexity, Ramachandran et al. (2019); Hu et al. (

Figure 2: Illustration of the proposed deformable attention module.

Figure 3: Convergence curves of Deformable DETR and DETR-DC5 on COCO 2017 val set. For Deformable DETR, we explore different training schedules by varying the epochs at which the learning rate is reduced (where the AP score leaps).

Figure 4: Constructing mult-scale feature maps for Deformable DETR.

bd

Figure 5: The gradient norm of each item (coordinate of object center (x, y), width/height of object bounding box w/h, category score c of this object) in final detection result with respect to each pixel in input image I.

Comparision of Deformable DETR with DETR on COCO 2017 val set. DETR-DC5 + denotes DETR-DC5 with Focal Loss and 300 object queries.

Comparison of Deformable DETR with state-of-the-art methods on COCO 2017 test-dev set. "TTA" indicates test-time augmentations including horizontal flip and multi-scale testing.

Lookup table for notations in the paper.

ACKNOWLEDGMENTS

The work is supported by the National Key R&D Program of China (2020AAA0105200), Beijing Academy of Artificial Intelligence, and the National Natural Science Foundation of China under grand No.U19B2044 and No.61836011.

annex

Figure 6 : Visualization of multi-scale deformable attention. For readibility, we draw the sampling points and attention weights from feature maps of different resolutions in one picture. Each sampling point is marked as a filled circle whose color indicates its correspoinding attention weight. The reference point is shown as green cross marker, which is also equivalent to query point in encoder. In decoder, the predicted bounding box is shown as a green rectangle and the category and confidence score are texted just above it.

