GROUP DETR: FAST DETR TRAINING WITH GROUP-WISE ONE-TO-MANY ASSIGNMENT

Abstract

Detection Transformer (DETR) relies on one-to-one assignment for end-to-end object detection and lacks the capability of exploiting multiple positive object queries. We present a novel DETR training approach, named Group DETR, to support one-to-many assignment in a group-wise manner. To achieve it, we make simple modifications during training: (i) adopt K groups of object queries; (ii) conduct decoder self-attention on each group of object queries with the same parameters; (iii) perform one-to-one assignment for each group, leading to K positive object queries for each ground-truth object. In inference, we only use one group of object queries, making no modifications to model architectures and inference processes. Group DETR is a versatile training method and is applicable to various DETR variants. Our experiments show that Group DETR significantly speeds up the training convergences and improves the performances of various DETR-based methods.

1. INTRODUCTION

Detection Transformer (DETR) (Carion et al., 2020) achieves end-to-end detection without the need of non-maximum suppression (NMS) (Hosang et al., 2017) . There are several designs: (i) adopt an encoder-decoder architecture based on transformer layers (Vaswani et al., 2017) , (ii) introduce object queries, and (iii) perform one-to-one assignmentfoot_0 by conducting bipartite matching (Kuhn, 1955) between object predictions and ground-truth objects. The original DETR suffers from the slow convergence issue and needs 500 training epochs to achieve good performance. Various solutions have been developed to accelerate the training from different aspects. For example, sparse transformers (Zhu et al., 2020b; Gao et al., 2021; Chen et al., 2022c; Roh et al., 2022 ) are adopted to replace dense transformers. Additional spatial modulations are introduced into object queries (Zhu et al., 2020b; Meng et al., 2021; Wang et al., 2022b; Yao et al., 2021; Liu et al., 2022a; Gao et al., 2022) . Denoising modules are presented for stabilizing the object query and group-truth matching in the assignment process (Li et al., 2022; Zhang et al., 2022b) . In this paper, we propose a novel training approach Group DETR to accelerate DETR training convergence. Group DETR introduces group-wise one-to-many assignment. It assigns each groundtruth object to many positive object queries (one-to-many assignmentfoot_1 ), and separate them into multiple independent groups, keeping only one positive object query per object (one-to-one assignment) in each group. To achieve it, we make simple modifications during training: (i) adopt K groups of object queries; (ii) conduct decoder self-attention on each group of object queries with the same parameters; (iii) perform one-to-one assignment in each group, leading to K positive object queries for each ground-truth object. The design achieves fast training convergence, maintaining the key DETR property: enabling end-to-end object detection without NMS. We only use one group of object queries in inference, and we do not modify either architectures or processes, bringing no extra cost compared with the original method. Group DETR is a versatile training method and can be applied to various DETR-based models. Extensive experiments prove that our method is effective in achieving fast training convergence (Lin et al., 2014) with ResNet-50 (He et al., 2016) as the backbone. More results and comparisons can be found in Table 1 and Table 2 . Here, we use different colors to distinguish different models in the figure and apply dashed curves and bold curves to highlight the comparisons between baseline models and their Group DETR counterparts. Best view in color. (convergence curves are shown in Figure 1 ). Group DETR obtains consistent improvements on various DETR-based methods (Meng et al., 2021; Liu et al., 2022a; Li et al., 2022; Zhang et al., 2022b) . With a 12-epoch ( 1×) training schedule on MS COCO (Lin et al., 2014) , Group DETR significantly improves Conditional DETR-C5 by 5.0 mAP. The non-trivial improvements hold when we adopt longer training schedules (e.g., 36 epochs or 50 epochs). Moreover, Group DETR can easily outperform baseline models when applied to multi-view 3D object detection (Liu et al., 2022b; c) and instance segmentation (Cheng et al., 2021) .

2. RELATED WORKS

Acceleraing DETR training convergence. The success of DETR (Carion et al., 2020) in object detection validates the potential to achieve elegant designs with transformers in computer vision. Since DETR (Carion et al., 2020) was proposed, its slow convergence issue has been a critical problem that many researchers (Bar et al., 2022; Wang et al., 2022a; Song et al., 2022; Roh et al., 2022) try to address. Many works provide their solutions and achieve a 10× speed up for DETR. They mainly focus on proposing better transformer layers (Zhu et al., 2020b; Gao et al., 2021; Meng et al., 2021; Dai et al., 2021; Roh et al., 2022; Cao et al., 2022; Zhang et al., 2022a; Chen et al., 2022d) and designing new types of object queries (Zhu et al., 2020b; Meng et al., 2021; Wang et al., 2022b; Yao et al., 2021; Liu et al., 2022a; Gao et al., 2022) . DN-DETR (Li et al., 2022) and DINO (Zhang et al., 2022b) attribute the slow convergence issue to the instability of bipartite matching (Kuhn, 1955) .They present auxiliary query denoising tasks to speed up the DETR training convergence. Unlike previous approaches, we show that assignment methods are critical to fast DETR training convergence. We propose Group DETR to support one-to-many assignment in a group-wise manner, which can be achieved by simple modifications during training compared with DETR. One-to-many assignment and one-to-one assignment. One-to-many assignment is widely adopted in modern detectors (Redmon et al., 2016; Ren et al., 2015; Liu et al., 2016; He et al., 2017; Lin et al., 2017; Cai & Vasconcelos, 2018; Chen et al., 2019; Tian et al., 2019; Zhang et al., 2020; Zhu et al., 2020a; Kim & Lee, 2020; Bochkovskiy et al., 2020; Chen et al., 2021; Ge et al., 2021) . It produces duplicate predictions and needs NMS (Hosang et al., 2017; Bodla et al., 2017) for post-processing. DETR (Carion et al., 2020) explores an alternative way (one-to-one assign-ment) and achieves end-to-end detection, removing the need for NMS. Recent studies (Wang et al., 2021; Sun et al., 2021b; a) show that one-to-one assignment is a key factor in achieving end-to-end detection. Differently, we find that the one-to-one assignment impacts the training convergence of DETR-based methods (Carion et al., 2020) and we focus on exploiting assignment methods to speed up DETR training in this paper. The key differences from DETR include: Group DETR feeds K groups of queries to the decoder, conducts self-attention on each group of object queries with the shared parameters, and makes one-to-one assignment for each group.

3. GROUP DETR

Group DETR is a training approach for accelerating the DETR training convergence. We make simple modifications during training and adopt the same architecture and inference process for inference. Figure 2 illustrates the decoder parts of DETR (Carion et al., 2020) and our Group DETR for training.

3.1. DETR

DETR has three key designs: (i) adopt a transformer encoder-decoder architecture (Vaswani et al., 2017) , (ii) introduce object queries, and (iii) perform one-to-one assignment by conducting bipartite matching between object predictions and ground-truth objects. DETR architecture. The DETR architecture consists of a backbone (e.g., ResNet (He et al., 2016) , Swin Transformer (Liu et al., 2021) , or others (Dosovitskiy et al., 2021; Liu et al., 2022d; Chen et al., 2022a; He et al., 2022; Chen et al., 2022b) ), a transformer encoder, a transformer decoder, and object class and box position predictors (Carion et al., 2020) . Figure 2 (a) shows the architecture of the transformer decoder in DETR. The image features are extracted by the backbone and the transformer encoder layers. The transformer decoder takes N object queries {q 1 , . . . , q N } as input. It performs self-attention on object queries, aggregates the image features to refine the query embeddings by conducting cross-attention, and adds an FFN to get the output query embeddings. The output query embeddings are fed into detection heads to produce N object predictions. One-to-one assignment. In model training, DETR perform one-to-one assignment to find the learning target for each object prediction. It employs the Hungarian algorithm (Kuhn, 1955) to find an optimal bipartite matching σ between the predictions and the ground-truth objects: σ = arg min σ∈ξ N N i=1 C match (y i , ŷσ(i) ), where ξ N is the set of permutations of N elements and C match (y i , ŷσ(i) ) is the matching cost (Carion et al., 2020) between the ground truth y i and the prediction with index σ(i).

3.2. GROUP DETR

Group DETR makes simple modifications during training compared with DETR (Figure 2  = Q[:N] # (N, B, C) # self-attention, cross-attention, and ffn out = self_atten(Q) out = ffn(cross_atten(out, X)) K groups of object queries. There are K groups of queries in the proposed Group DETR: G 1 = {q 1 1 , . . . , q 1 N }, (2) • • • • • • G K = {q K 1 , . . . , q K N }. (3) The total K×N object queries are concatenated and fed to transformer decoder layers (as shown in Figure 2 (b) ). Group-wise decoder self-attention. We perform group-wise self-attention in the transformer decoder layers. This leads to that the object queries do not interact with the queries across groups. The pseudocode of group-wise selfattention is in Algorithm 1. Group-wise one-to-many assignment. We apply one-to-one assignment to each group independently and we could get K matching results: σ1 = arg min σ∈ξ N N i=1 C match (y i , ŷ1 σ(i) ), • • • • • • σK = arg min σ∈ξ N N i=1 C match (y i , ŷK σ(i) ), where σK and ŷK σ(i) are the optimal matching result and the prediction of the K-th group, respectively. During training, each group will calculate the loss (Carion et al., 2020) independently in Group DETR. The final training loss is the average of K groups. Model inference. We adopt the same architectures and processes as the baseline models in inference. According to our experiments, every group can achieve similar results in Group DETR (Table 5 ). We simply use the first group of object queries. 8 ). Thus, the baseline results of DN-DETR are slightly different from the ones (with 3 patterns) reported in the original paper (Li et al., 2022) . For DINO-Deformable-DETR, we adopt the 4scale version (Zhang et al., 2022b) .

4. EXPERIMENTS

We demonstrate the effectiveness of our Group DETR on object detection, instance segmentation and multi-view 3D object detection. We adopt the training settings and hyper-parameters same as the baseline models, including learning rate, optimizer, pre-trained model, initialization methods, and data augmentationsfoot_2 . The number of queries in each group is the same as the baselines.

4.1. OBJECT DETECTION

We verify our approach over object detection on MS COCO (Lin et al., 2014) and make comparison with representative DETR-based methods, including Conditional DETR (Meng et al., 2021) , DAB-DETR (Liu et al., 2022a) , DN-DETR (Liu et al., 2022a; Zhu et al., 2020b) , and DINO-Deformable-DETR (Zhang et al., 2022b; Zhu et al., 2020b) . We report the results with 12-epoch (1×) and 50-epoch training schedules, as well as the comparison in terms of convergence curve. Results with a standard 1× schedule. Table 1 report the results. Group DETR gives consistent improvements over all baseline models. In comparison to query design methods, Group DETR significantly boosts detection performance when applied to Conditional DETR (+5.0 mAP for C5 and +4.8 mAP for DC5) (Meng et al., 2021) , DAB-DETR (+3.9 mAP for C5 and +4.4 mAP for DC5) (Liu et al., 2022a) , and DAB-Deformable-DETR (+1.5 mAP for multiple levels of feature maps) (Liu et al., 2022a; Zhu et al., 2020b) . In comparison to the matching stabilization methods, Group DETR can also give non-trivial gains over DN-DETR (+2.0 mAP for C5 and +2.6 mAP for DC5) (Li et al., 2022) (Zhang et al., 2022b; Zhu et al., 2020b) , we train 36 epochs by following the settings in the original paper (Zhang et al., 2022b) . Except that we use Swin-Large (Liu et al., 2021) in the last row of the 2 , Group DETR can also outperform baseline models by large margins. When we adopt a stronger backbone, Swin-Large (Liu et al., 2021) , we can achieve 58.4 mAP (0.4 mAP higher than its baseline DINO-Deformable-DETR (58.0 mAP with Swin-Large)), which verifies the generalization ability of our Group DETR. Convergence curves. We report the convergence curves in Figure 1 . We give each method two convergence curves of the baseline model and the baseline with Group DETR using dashed curves and bold curves. The comparisons in Figure 1 support that Group DETR gives a further speed up on DETR training convergence.

4.2. MULTI-VIEW 3D OBJECT DETECTION AND INSTANCE SEGMENTATION

Multi-view 3D object detection. We adopt PETR (Liu et al., 2022b) and PETR v2 (Liu et al., 2022c) as our baseline models. Table 3 shows that Group DETR brings significant gains to PETR and PETR v2. When we train PETR v2 with a longer training schedule (36 epochs), we obtain more improvements on both nuScenes Detection Score (NDS) and mAP on the nuScenes val set (Caesar et al., 2020) . Instance segmentation. We adopt Mask2Former (Cheng et al., 2021) as the baseline and apply Group DETR to it. In (Caesar et al., 2020) . We train these experiments with VoVNetV2 (Lee & Park, 2020) as the backbone and with the image size of 800 × 320. We follow all the settings and hyperparameters of PETR and PETR v2. Table 4 : Results on instance segmentation. We adopt Mask2Former (Cheng et al., 2021) as the baseline and report the mask mAP (mAP m ) on the MS COCO val split for instance segmentation.

4.3. ABLATION STUDIES

We conduct ablation studies on object detection by using Conditional DETR-C5 (Meng et al., 2021) as our baseline model. The studies include: the influence of group number, the performance in each group, the assignment scheme, and the group design. Influence of group number. Figure 3 shows the influence of the number of groups K in Group DETR. The detection performance improves when increasing the number of groups, and saturates when the group number (K) is greater than 11. Thus, we adopt K = 11 in Group DETR in our experiments. Performance in each group. Table 5 gives the performance of all groups in Group DETR. Each group can achieve similar results, which is consistent with the design of independent groups. In other experiments, we simply report the result of the first group. Assignment. Figure 4 and Table 6 study the training convergence and performance by keeping the same number (3300) of total object queries about three cases: single-group one-to-one assignment (K = 1), single-group one-to-many assignment (K = 1 with 11 positive object queries for each ground-truth object), and group-wise one-to-many assignment with K = 11 groups (Group DETR). Figure 4 shows that Group DETR and single-group one-to-many assignment give significantly faster convergence speeds than single-group one-to-one assignment. Different from single-group one-to-one assignment in DETR (Carion et al., 2020) , single-group oneto-many assignment highly depends on the post-processing step NMS ( w/o NMS achieves almost the same performance as single-group one-to-many assignment w/ NMS, and the inference of our Group DETR is more efficient than it. It is as expected that our Group DETR performs better than single-group one-to-one assignment. Group design. DN-DETR and our Group DETR both adopt the group design and focus on different aspects: stabilizing the prediction and group-truth matching (DN-DETR), and exploiting multiple positive predictions for one ground-truth object (ours), respectively. The performance comparison for DN-DETR and Group DETR in Table 7 shows that Group DETR is superior than DN-DETR. The result of the combination of DN-DETR and Group DETR further improves the performance to 40.6 mAP, implying that the two approaches are complementary. 8 ).

4.4. ANALYSES

Figure 5 : Distribution of all groups of object queries with different colors. The object queries are distributed similarly for all groups. We study the distributions of object queries in each group with conditional DETR as an example. Figure 5 depicts 2D reference points (positions) corresponding to the object queries, with one color for one group. We can see that the reference point distributes similarly for all the groups. This provides an explanation to that each group of object queries gives similar detection performances in Table 5 . Visualization of positive object queries. We visualize the positions of positive object queries in all groups in Figure 6 . The visualization 



One-to-one assignment: one ground-truth object is only assigned to one object query. One-to-many assignment: each ground-truth object can be assigned to one or more positive object queries. We may adjust the batch size according to the GPU memory. Note that we will retrain the baseline model with the same batch size to make fair comparisons when conducting experiments. CONCLUSIONIn this paper, we present a simple yet effective approach, Group DETR, to accelerate DETR training convergence. We study different assignment methods and propose a novel group-wise one-to-many assignment. It has shown positive results in a variety of DETR-based methods and vision tasks.



Figure 1: Comparisons on training convergence curves. We show the training convergence curves of various DETR-based models. All experiments are conducted on MS COCO(Lin et al., 2014) with ResNet-50(He et al., 2016) as the backbone. More results and comparisons can be found in Table1and Table2. Here, we use different colors to distinguish different models in the figure and apply dashed curves and bold curves to highlight the comparisons between baseline models and their Group DETR counterparts. Best view in color.

Figure2: Decoder architectures of DETR and Group DETR for training. The key differences from DETR include: Group DETR feeds K groups of queries to the decoder, conducts self-attention on each group of object queries with the shared parameters, and makes one-to-one assignment for each group.

Figure 6: Visualization of positive object queries. To give a neat visualization, we only show one object per image. Each ground-truth box (green bounding box) is assigned to multiple positive queries (red points). We train Group DETR based on Conditional-DETR-C5 for 50 epochs. The number of groups is 11 here. In the figure, the positive queries (red points) may overlap. Best view in color and zoom in.shows that the positions of positive object queries are distributed in a certain region on the object instance. This region is learned by the model and is somehow different from manually selecting a center region within the bounding box. This is reasonable, since the center of the bounding box may not have enough information about the object instance. According to the visualization, the positions of all positive object queries are considered good ones by the model to predict the ground-truth objects. In the one-to-one assignment, only one of these object queries can be set as positive for the object, while others are negatives. The model needs to distinguish the differences among these object queries during training, which impact model learning, leading to slow training convergence. With Group DETR, all these object queries are set as positives, which gives stronger supervision signals in training, thereby improving training efficiency and speeding up training convergence.

(b)): (i) adopt K groups of object queries; (ii) conduct decoder self-attention on each group of object queries with the same parameters; (iii) perform one-to-one assignment in each group (group-wise one-tomany assignment), leading to K positive object queries for each ground-truth object.

Results with a 12-epoch training schedule on MS COCO. All experiments adopt ResNet-50(He et al., 2016) as the backbone. We highlight the improvements brought by Group DETR on various DETR-based methods. Note that we do not use multiple patterns(Wang et al., 2022b)  in our experiments. For DN-DETR in the table, we use the improved version of DN (dynamic DN groups(Zhang et al., 2022b)) and set the DN number to 100 (more results about the DN number can be found in Appendix B Table

Results with a 50-epoch training schedule on MS COCO. We adopt the training schedule of 50 epochs in the table, while for DINO-Deformable-DETR

Table 4, we provide comparisons of different training schedules. Group DETR achieves non-trivial improvements on Mask2Former.

Results on multi-view 3D object detection. All experiments are conducted on the nuScenes val set

Influence of group number (K) in Group DETR. As the number of groups increases, continuous improvement could be achieved compared to the baseline model.



Performance in each group. We show the mAP on the MS COCO val split of different groups in Group DETR. The results obtained by different groups are similar.

Comparisons of different assignment methods with and without NMS. Results are obtained with a 50-epoch training schedule. We set a threshold of 0.7 in NMS following DETR.

Comparisons

APPENDIX A DATASETS AND EVALUATION METRICS

We perform the object detection and instance segmentation experiments on the COCO 2017 (Lin et al., 2014) dataset, which contains about 118K training (train) images and 5K validation (val) images. Following the common practice, we report the standard mean average precision (mAP) result (box mAP for object detection and mask mAP for instance segmentation) on the COCO validation dataset under different IoU thresholds and object scales.We perform multi-view 3D object detection experiments on the nuScenes (Caesar et al., 2020) dataset, which contains 1000 driving sequences. There are 700 for train set, 150 for val set and 150 for test set. We report the standard nuScenes Detection Score (NDS) and mean Average Precision (mAP) result on the nuScenes val set.

B ADDITIONAL RESULTS

Results of DN-DETR with different number of denoising queries. We conduct experiments with different numbers of denoising queries in DN-DETR (Li et al., 2022) . The results in Table 8 suggest that increasing the number of denoising queries can not achieve further improvements and show unstable performances. The effects of denoising queries differ from the ones of Group DETR (Figure 3 ). The denoising queries mainly aim to solve the instability in the matching process, while our Group DETR aims to exploit multiple positive queries for one ground-truth object. We choose to use 100 denoising queries in our experiments in Table 1 and Table 2 by following the setting in the original paper (Zhang et al., 2022b 

