STOCHASTIC MULTI-PERSON 3D MOTION FORECAST-ING

Abstract

This paper aims to deal with the ignored real-world complexities in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interactions, and the complexity of articulated motion. To this end, we introduce a novel task of stochastic multi-person 3D motion forecasting. We propose a dual-level generative modeling framework that separately models independent individual motion at the local level and social interactions at the global level. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through introducing learnable latent codes that represent intents of future motion and switching the codes' modes of operation at different levels. Our framework is general; we instantiate it with different generative models, including generative adversarial networks and diffusion models, and various multi-person forecasting models. Extensive experiments on CMU-Mocap, MuPoTS-3D, and SoMoF benchmarks show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.

1. INTRODUCTION

One of the hallmarks of human intelligence is the ability to predict the evolution of the physical world over time given historical information. For example, humans naturally anticipate the flow of people in public areas, react, and plan their own behavior based on social rules, such as avoiding collisions. Effective forecasting of human motion has thus become a crucial task in computer vision and robotics, e.g., in autonomous driving (Paden et al., 2016) and robot navigation (Rudenko et al., 2018) . This task, however, is challenging. First, human motion is structured with respect to both body physics and social norms, and is highly dependent on the surrounding environment and its changes. Second, human motion is inherently uncertain and multi-modal, especially over long time horizons. Previous work on human motion forecasting often focuses on simplified scenarios. Perhaps the most widely adopted setting is on stochastic local motion prediction of a single person (Mao et al., 2021; Yuan & Kitani, 2020) , which ignores human interactions with the environment and other people in the environment. Another related task is deterministic multi-person motion forecasting (Wang et al., 2021b; Adeli et al., 2020; 2021; Guo et al., 2022) . However, it does not take into account the diversity of individual movements and social interactions. In addition, stochastic forecasting of human trajectories in crowds (Alahi et al., 2014) has shown progress in modeling social interactions, e.g., with the use of attention models (Kosaraju et al., 2019; Vemula et al., 2018; Zhang et al., 2019) and spatial-temporal graph models (Huang et al., 2019; Ivanovic & Pavone, 2019; Salzmann et al., 2020; Yu et al., 2020) . Nevertheless, this task only considers motion and interactions at the trajectory level. Modeling articulated 3D poses involves richer human-like social interactions than trajectory forecasting which only needs to account for trajectory collisions. To overcome these limitations, we introduce a novel task of stochastic multi-person 3D motion forecasting, aiming to jointly tackle the aforementioned aspects ignored in the previous work -the social properties of multi-person motion, the multi-modality of motion and social interactions, and Due to the substantially increased complexity of our task, it becomes challenging to optimize all three objectives simultaneously. We observe simply extending existing work such as on deterministic motion forecasting cannot address the proposed task. This difficulty motivates us to adopt a divideand-conquer strategy, together with the observation that single-person fidelity and multi-person fidelity can be viewed as relatively independent goals, while there is an inherent trade-off between fidelity and diversity. Therefore, we propose a Dual-level generative modeling framework for Multi-person Motion Forecasting (DuMMF). At the local level, we model motion for different people independently under relaxed conditions, thus satisfying single-person fidelity and diversity. Meanwhile, at the global level, we model social interactions by considering the correlation between all motion, thereby further improving multi-person fidelity. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through simply switching the modes of operation of the motion intent codes (i.e., latent codes of the generative model) at different levels. By optimizing these codes with level-specific objectives, we produce diverse and realistic multi-person predictions. Our contributions can be summarized as follows. (a) To the best of our knowledge, we are the first to investigate the task of stochastic multi-person 3D motion forecasting. (b) We propose a simple yet effective dual-level learning framework to address this task. (c) We introduce discrete learnable social intents at dual levels to improve the realism and diversity of predictions. (d) Our framework is general and can be operationalized with various generative models, including generative adversarial networks and diffusion models, and different types of multi-person motion forecasting models. Notably, it can be generalized to challenging more-person (e.g., 18-person) scenarios that are unseen during training.

2. RELATED WORK

Stochastic Human Motion Forecasting. There have been many advances in stochastic human motion forecasting, many of which (Walker et al., 2017; Yan et al., 2018; Barsoum et al., 2018) are based on the adaptation and improvement of deep generative models such as variational autoencoders



Figure 1: Illustration of the multifaceted challenges in the proposed task of stochastic multi-person 3D motion forecasting. (a) Single-person fidelity: for each person, the predicted pose and trajectory should be realistic and consistent with each other, e.g., to avoid foot floating and skating. (b) Multiperson fidelity: multi-person motion in a scene inherently involves social interactions, e.g., to avoid motion collisions. (c) Overall diversity: long-term human motion is uncertain and stochastic; we address this intrinsic multi-modality, while existing work (Wanget al., 2021b; Adeli et al., 2020;  2021; Guo et al., 2022)  simplifies to deterministic prediction.

