GROUP DETR: FAST DETR TRAINING WITH GROUP-WISE ONE-TO-MANY ASSIGNMENT

Abstract

Detection Transformer (DETR) relies on one-to-one assignment for end-to-end object detection and lacks the capability of exploiting multiple positive object queries. We present a novel DETR training approach, named Group DETR, to support one-to-many assignment in a group-wise manner. To achieve it, we make simple modifications during training: (i) adopt K groups of object queries; (ii) conduct decoder self-attention on each group of object queries with the same parameters; (iii) perform one-to-one assignment for each group, leading to K positive object queries for each ground-truth object. In inference, we only use one group of object queries, making no modifications to model architectures and inference processes. Group DETR is a versatile training method and is applicable to various DETR variants. Our experiments show that Group DETR significantly speeds up the training convergences and improves the performances of various DETR-based methods.

1. INTRODUCTION

Detection Transformer (DETR) (Carion et al., 2020) achieves end-to-end detection without the need of non-maximum suppression (NMS) (Hosang et al., 2017) . There are several designs: (i) adopt an encoder-decoder architecture based on transformer layers (Vaswani et al., 2017) , (ii) introduce object queries, and (iii) perform one-to-one assignmentfoot_0 by conducting bipartite matching (Kuhn, 1955) between object predictions and ground-truth objects. The original DETR suffers from the slow convergence issue and needs 500 training epochs to achieve good performance. Various solutions have been developed to accelerate the training from different aspects. For example, sparse transformers (Zhu et al., 2020b; Gao et al., 2021; Chen et al., 2022c; Roh et al., 2022 ) are adopted to replace dense transformers. Additional spatial modulations are introduced into object queries (Zhu et al., 2020b; Meng et al., 2021; Wang et al., 2022b; Yao et al., 2021; Liu et al., 2022a; Gao et al., 2022) . Denoising modules are presented for stabilizing the object query and group-truth matching in the assignment process (Li et al., 2022; Zhang et al., 2022b) . In this paper, we propose a novel training approach Group DETR to accelerate DETR training convergence. Group DETR introduces group-wise one-to-many assignment. It assigns each groundtruth object to many positive object queries (one-to-many assignmentfoot_1 ), and separate them into multiple independent groups, keeping only one positive object query per object (one-to-one assignment) in each group. To achieve it, we make simple modifications during training: (i) adopt K groups of object queries; (ii) conduct decoder self-attention on each group of object queries with the same parameters; (iii) perform one-to-one assignment in each group, leading to K positive object queries for each ground-truth object. The design achieves fast training convergence, maintaining the key DETR property: enabling end-to-end object detection without NMS. We only use one group of object queries in inference, and we do not modify either architectures or processes, bringing no extra cost compared with the original method. Group DETR is a versatile training method and can be applied to various DETR-based models. Extensive experiments prove that our method is effective in achieving fast training convergence More results and comparisons can be found in Table 1 and Table 2 . Here, we use different colors to distinguish different models in the figure and apply dashed curves and bold curves to highlight the comparisons between baseline models and their Group DETR counterparts. Best view in color. (convergence curves are shown in Figure 1 ). 

2. RELATED WORKS

Acceleraing DETR training convergence. The success of DETR (Carion et al., 2020) in object detection validates the potential to achieve elegant designs with transformers in computer vision. Since DETR (Carion et al., 2020) was proposed, its slow convergence issue has been a critical problem that many researchers (Bar et al., 2022; Wang et al., 2022a; Song et al., 2022; Roh et al., 2022) try to address. Many works provide their solutions and achieve a 10× speed up for DETR. They mainly focus on proposing better transformer layers (Zhu et al., 2020b; Gao et al., 2021; Meng et al., 2021; Dai et al., 2021; Roh et al., 2022; Cao et al., 2022; Zhang et al., 2022a; Chen et al., 2022d) and designing new types of object queries (Zhu et al., 2020b; Meng et al., 2021; Wang et al., 2022b; Yao et al., 2021; Liu et al., 2022a; Gao et al., 2022) . DN-DETR (Li et al., 2022) and DINO (Zhang et al., 2022b) attribute the slow convergence issue to the instability of bipartite matching (Kuhn, 1955) .They present auxiliary query denoising tasks to speed up the DETR training convergence. Unlike previous approaches, we show that assignment methods are critical to fast DETR training convergence. We propose Group DETR to support one-to-many assignment in a group-wise manner, which can be achieved by simple modifications during training compared with DETR. One-to-many assignment and one-to-one assignment. One-to-many assignment is widely adopted in modern detectors (Redmon et al., 2016; Ren et al., 2015; Liu et al., 2016; He et al., 2017; Lin et al., 2017; Cai & Vasconcelos, 2018; Chen et al., 2019; Tian et al., 2019; Zhang et al., 2020; Zhu et al., 2020a; Kim & Lee, 2020; Bochkovskiy et al., 2020; Chen et al., 2021; Ge et al., 2021) . It produces duplicate predictions and needs NMS (Hosang et al., 2017; Bodla et al., 2017) for post-processing. DETR (Carion et al., 2020) explores an alternative way (one-to-one assign-



One-to-one assignment: one ground-truth object is only assigned to one object query. One-to-many assignment: each ground-truth object can be assigned to one or more positive object queries.



Figure 1: Comparisons on training convergence curves. We show the training convergence curves of various DETR-based models. All experiments are conducted on MS COCO (Lin et al., 2014) with ResNet-50 (He et al., 2016) as the backbone.More results and comparisons can be found in Table1and Table2. Here, we use different colors to distinguish different models in the figure and apply dashed curves and bold curves to highlight the comparisons between baseline models and their Group DETR counterparts. Best view in color.

