EXPLICIT BOX DETECTION UNIFIES END-TO-END MULTI-PERSON POSE ESTIMATION

Abstract

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without postprocessing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-theart with 76.6 AP on CrowdPose without bells and whistles.

Global:

Where is the "person"? Multi-person human pose estimation has attracted much attention in the computer vision community for decades for its wide applications in areas of augmented reality (AR), virtual reality (VR), and human-computer interaction (HCI). Given an image, it targets to localize the 2D keypoint positions for every person in the image. Although many methods have been developed (Xiao et al., 2018; Sun et al., 2019; Cheng et al., 2020; Mao et al., 2022; Shi et al., 2022) , it remains challenging and intractable for situations with heavy occlusions, hard poses, and diverse body part scales. Intuitively, as shown in Figure 1 , this task needs to focus on both global (human-level) and local (keypoint-level) dependencies, which concentrate on different levels of semantic granularity. Mainstream solutions are normally two-stage methods, which divide the problem into two separate subproblems (e.g., the global person detection and local keypoint regression). Such solutions include Top-Down (TD) methods (Xiao et al., 2018; Sun et al., 2019; Li et al., 2021b; Mao et al., 2022) which are of high performance yet with a high inference cost and Bottom-Up (BU) solutions (Cao et al., 2017; Newell et al., 2017; Cheng et al., 2020) which are fast in inference yet with relatively lower precision. However, all of these methods are non-differentiable between their global and local stages due to hand-crafted operations, like Non-Maximum Suppression (NMS), Region of Interest (RoI) cropping, and keypoint grouping post-processing. Lately, Poseur (Mao et al., 2022) tries to directly apply top-down methods to an end-to-end framework and finds that there will be a significant performance drop (about 8.7 AP on COCO), indicating the optimization conflicts between the learning of global and local relations. Exploring a fully end-to-end trainable method to unify the two disassembled subproblems is attractive and important. Inspired by the success of recent end-to-end object detection methods, like DETR (Carion et al., 2020) , there is a surge of related approaches that regard human pose estimation as a direct set prediction problem. They utilize a bipartite matching for one-to-one prediction with Transformers to avoid cumbersome post-processings (Li et al., 2021b; Mao et al., 2021a; 2022; Stoffl et al., 2021; Shi et al., 2022) . Recently, PETR (Shi et al., 2022) proposes a fully end-to-end framework to predict instance-aware poses without any post-processings and shows a favorable potential. Nevertheless, it directly uses a pose decoder with randomly initialized pose queries to query local features from images. The only local dependency makes keypoint matching across persons ambiguous and thus leading to inferior performance, especially for occlusions, complex poses, and diverse human scales in crowded scenes. Moreover, either two-stage methods or DETR-based estimators suffer from slow training convergence and need more epochs (e.g., train a model above a week) to achieve high precision. Additionally, the convergence speed of DETR-based methods is even slower than bottom-up methods (Cheng et al., 2020) . We address the details in Sec.3. Based on the above observations, this work re-considers multi-person pose estimation as two Explicit box Detection processes named ED-Pose. We realize each box detection by using a decoder and cascade them to form an end-to-end framework, which makes the model fast in convergence, precise, and scalable. Specifically, to obtain global dependencies, the first process detects boxes for all persons via human box queries selected from the encoded image tokens. This simple step can provide a good initialization for the latter keypoint detection to accelerate the training convergence. Then, to capture local contextual relations and reduce ambiguities in the feature alignment, we regard the following pose estimation task as a keypoint box detection problem, where it learns both box positions and local contents for each keypoint. Such an approach can leverage contextual information near a keypoint by directly regressing the keypoint box position and learning local keypoint content queries without dense supervision (e.g., heatmap). To further enhance the global-local interactivity among external human-human, internal human-keypoint, and internal keypoint-keypoint, we design interactive learning between human detection and keypoint detection. Following the two Explicit box Detection processes, we can unify the global and local feature learning using the consistent regression loss and the same box representation in an end-to-end framework. We summarize the related methods from the supervisions and representations. Compared with previous works, ED-Pose is more conceptually simple. Notably, we find that explicit global box detection will gain 4.5 AP on COCO and 9.9 AP on CrowdPose compared with a solution without such a scheme. In comparison to top-down methods, ED-Pose makes the human and keypoint detection share the same encoders to avoid redundant costs from human detection and further boost performance by 1.2 AP on COCO and 9.1 AP on CrowdPose under the same ResNet-50 backbone. Moreover, ED-Pose surpasses the previous end-to-end model PETR significantly by 2.8 AP on COCO and 5.0 AP on CrowdPose. In crowded scenes, ED-Pose achieves the state-of-the-art with 76.6 AP (by 4.2 AP improvement over the previous SOTA (Yuan et al., 2021) ) without any bells and whistles (e.g., without multi-scale test and flipping). We hope this simple attempt at explicit box detection, simplification of losses, and no post-processing to unify the whole pipeline could bring in new perspectives to further one-stage framework designs.



Figure 1: Illustration of (a) the perception of the pose estimation task that usually captures global and local contexts concurrently; (b) a taxonomy of existing estimators. ED-Pose (Ours) is a novel one-stage method of learning both global and local relations in an end-to-end manner.

