EXPLICIT BOX DETECTION UNIFIES END-TO-END MULTI-PERSON POSE ESTIMATION

Abstract

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without postprocessing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-theart with 76.6 AP on CrowdPose without bells and whistles.

Global:

Where is the "person"? Multi-person human pose estimation has attracted much attention in the computer vision community for decades for its wide applications in areas of augmented reality (AR), virtual reality (VR), and human-computer interaction (HCI). Given an image, it targets to localize the 2D keypoint positions for every person in the image. Although many methods have been developed (Xiao et al., 2018;  



Figure 1: Illustration of (a) the perception of the pose estimation task that usually captures global and local contexts concurrently; (b) a taxonomy of existing estimators. ED-Pose (Ours) is a novel one-stage method of learning both global and local relations in an end-to-end manner.

