STOCHASTIC MULTI-PERSON 3D MOTION FORECAST-ING

Abstract

This paper aims to deal with the ignored real-world complexities in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interactions, and the complexity of articulated motion. To this end, we introduce a novel task of stochastic multi-person 3D motion forecasting. We propose a dual-level generative modeling framework that separately models independent individual motion at the local level and social interactions at the global level. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through introducing learnable latent codes that represent intents of future motion and switching the codes' modes of operation at different levels. Our framework is general; we instantiate it with different generative models, including generative adversarial networks and diffusion models, and various multi-person forecasting models. Extensive experiments on CMU-Mocap, MuPoTS-3D, and SoMoF benchmarks show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.

1. INTRODUCTION

One of the hallmarks of human intelligence is the ability to predict the evolution of the physical world over time given historical information. For example, humans naturally anticipate the flow of people in public areas, react, and plan their own behavior based on social rules, such as avoiding collisions. Effective forecasting of human motion has thus become a crucial task in computer vision and robotics, e.g., in autonomous driving (Paden et al., 2016) and robot navigation (Rudenko et al., 2018) . This task, however, is challenging. First, human motion is structured with respect to both body physics and social norms, and is highly dependent on the surrounding environment and its changes. Second, human motion is inherently uncertain and multi-modal, especially over long time horizons. Previous work on human motion forecasting often focuses on simplified scenarios. Perhaps the most widely adopted setting is on stochastic local motion prediction of a single person (Mao et al., 2021; Yuan & Kitani, 2020) , which ignores human interactions with the environment and other people in the environment. Another related task is deterministic multi-person motion forecasting (Wang et al., 2021b; Adeli et al., 2020; 2021; Guo et al., 2022) . However, it does not take into account the diversity of individual movements and social interactions. In addition, stochastic forecasting of human trajectories in crowds (Alahi et al., 2014) has shown progress in modeling social interactions, e.g., with the use of attention models (Kosaraju et al., 2019; Vemula et al., 2018; Zhang et al., 2019) and spatial-temporal graph models (Huang et al., 2019; Ivanovic & Pavone, 2019; Salzmann et al., 2020; Yu et al., 2020) . Nevertheless, this task only considers motion and interactions at the trajectory level. Modeling articulated 3D poses involves richer human-like social interactions than trajectory forecasting which only needs to account for trajectory collisions. To overcome these limitations, we introduce a novel task of stochastic multi-person 3D motion forecasting, aiming to jointly tackle the aforementioned aspects ignored in the previous work -the social properties of multi-person motion, the multi-modality of motion and social interactions, and * Yu-Xiong Wang and Liang-Yan Gui contributed equally to this work. 1

