PROACTIVE MULTI-CAMERA COLLABORATION FOR 3D HUMAN POSE ESTIMATION

Abstract

This paper presents a multi-agent reinforcement learning (MARL) scheme for proactive Multi-Camera Collaboration in 3D Human Pose Estimation in dynamic human crowds. Traditional fixed-viewpoint multi-camera solutions for human motion capture (MoCap) are limited in capture space and susceptible to dynamic occlusions. Active camera approaches proactively control camera poses to find optimal viewpoints for 3D reconstruction. However, current methods still face challenges with credit assignment and environment dynamics. To address these issues, our proposed method introduces a novel Collaborative Triangulation Contribution Reward (CTCR) that improves convergence and alleviates multi-agent credit assignment issues resulting from using 3D reconstruction accuracy as the shared reward. Additionally, we jointly train our model with multiple world dynamics learning tasks to better capture environment dynamics and encourage anticipatory behaviors for occlusion avoidance. We evaluate our proposed method in four photo-realistic UE4 environments to ensure validity and generalizability. Empirical results show that our method outperforms fixed and active baselines in various scenarios with different numbers of cameras and humans. (a) Dynamic occlusions lead to failed reconstruction (b) Constrained MoCap area Active MoCap in the wild * Equal Contribution.

1. INTRODUCTION

Marker-less motion capture (MoCap) has broad applications in many areas such as cinematography, medical research, virtual reality (VR), sports, and etc. Their successes can be partly attributed to recent developments in 3D Human pose estimation (HPE) techniques (Tu et al., 2020; Iskakov et al., 2019; Jafarian et al., 2019; Pavlakos et al., 2017b; Lin & Lee, 2021b) . A straightforward implementation to solve multi-views 3D HPE is to use fixed cameras. Although being a convenient solution, it is less effective against dynamic occlusions. Moreover, fixed camera solutions confine tracking targets within a constrained space, therefore less applicable to outdoor MoCap. On the contrary, active cameras (Luo et al., 2018; 2019; Zhong et al., 2018a; 2019) such as ones mounted on drones can maneuver proactively against incoming occlusions. Owing to its remarkable flexibility, the active approach has thus attracted overwhelming interest (Tallamraju et al., 2020; Ho et al., 2021; Xu et al., 2017; Kiciroglu et al., 2019; Saini et al., 2022; Cheng et al., 2018; Zhang et al., 2021) . Previous works have demonstrated the effectiveness of using active cameras for 3D HPE on a single target in indoor (Kiciroglu et al., 2019; Cheng et al., 2018 ), clean landscapes (Tallamraju et al., 2020; Nägeli et al., 2018; Zhou et al., 2018; Saini et al., 2022) or landscapes with scattered static obstacles (Ho et al., 2021) . However, to the best of our knowledge, we have not seen any existing work that experimented with multiple (n > 3) active cameras to conduct 3D HPE in human crowd. There are two key challenges : First, frequent human-to-human interactions lead to random dynamic occlusions. Unlike previous works that only consider clean landscapes or static obstacles, dynamic scenes require frequent adjustments of cameras' viewpoints for occlusion avoidance while keeping a good overall team formation to ensure accurate multi-view reconstruction. Therefore, achieving optimality in dynamic scenes by implementing a fixed camera formation or a hand-crafted control policy is challenging. In addition, the complex behavioural pattern of a human crowd makes the occlusion patterns less comprehensible and predictable, further increasing the difficulty in control. Second, as the team size grows larger, the multi-agent credit assignment issue becomes prominent which hinders policy learning of the camera agents. Concretely, multi-view 3D HPE as a team effort requires inputs from multiple cameras to generate an accurate reconstruction. Having more camera agents participate in a reconstruction certainly introduces more redundancy, which reduces the susceptibility to reconstruction failure caused by dynamic occlusions. However, it consequently weakens the association between individual performance and the reconstruction accuracy of the team, which leads to the "lazy agent" problem (Sunehag et al., 2017) . In this work, we introduce a proactive multi-camera collaboration framework based on multi-agent reinforcement learning (MARL) for real-time distributive adjustments of multi-camera formation for 3D HPE in a human crowd. In our approach, multiple camera agents perform seamless collaboration for successful reconstructions of 3D human poses. Additionally, it is a decentralized framework that offers flexibility over the formation size and eliminates dependency on a control hierarchy or a centralized entity. Regarding the first challenge, we argue that the model's ability to predict human movements and environmental changes is crucial. Thus, we incorporate World Dynamics Learning (WDL) to train a state representation with these properties, i.e., learning with five auxiliary tasks to predict the target's position, pedestrians' positions, self state, teammates' states, and team reward. To tackle the second challenge, we further introduce the Collaborative Triangulation Contribution Reward (CTCR), which incentivizes each agent according to its characteristic contribution to a 3D reconstruction. Inspired by the Shapley Value (Rapoport, 1970) , CTCR computes the average weighted marginal contribution to the 3D reconstruction for any given agent over all possible coalitions that contain it. This reward aims to directly associate agents' levels of participation with their adjusted return, guiding their policy learning when the team reward alone is insufficient to produce such direct association. Moreover, CTCR penalizes occluded camera agents more efficiently than the shared reward, encouraging emergent occlusion avoidance behaviors. Empirical results show that CTCR can accelerate convergence and increase reconstruction accuracy. Furthermore, CTCR is a general approach that can benefit policy learning in active 3D HPE and serve as a new assessment metric for view selection in other multi-view reconstruction tasks. For the evaluations of the learned policies, we build photo-realistic environments (UnrealPose) using Unreal Engine 4 (UE4) and UnrealCV (Qiu et al., 2017) . These environments can simulate realistic-behaving crowds with assurances of high fidelity and customizability. We train the agents on a Blank environment and validate their policies on three unseen scenarios with different landscapes, levels of illumination, human appearances, and various quantities of cameras and humans. The empirical results show that our method can achieve more accurate and stable 3D pose estimates than off-the-shelf passive-and active-camera baselines. To help facilitate more fruitful research on this topic, we release our environments with OpenAI Gym-API (Brockman et al., 2016) integration and together with a dedicated visualization tool. Here we summarize the key contributions of our work: • Formulating the active multi-camera 3D human pose estimation problem as a Dec-POMDP and proposing a novel multi-camera collaboration framework based on MARL (with n ≥ 3). • Introducing five auxiliary tasks to enhance the model's ability to learn the dynamics of highly dynamic scenes. • Proposing CTCR to address the credit assignment problem in MARL and demonstrating notable improvements in reconstruction accuracy compared to both passive and active baselines. • Contributing high-fidelity environments for simulating realistic-looking human crowds with authentic behaviors, along with visualization software for frame-by-frame video analysis.



Figure 1: Left: Two critical challenges in fixed camera approaches. Right: Three active cameras collaborate to best reconstruct the 3D pose of the target (marked in ).

