STOCHASTIC MULTI-PERSON 3D MOTION FORECAST-ING

Abstract

This paper aims to deal with the ignored real-world complexities in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interactions, and the complexity of articulated motion. To this end, we introduce a novel task of stochastic multi-person 3D motion forecasting. We propose a dual-level generative modeling framework that separately models independent individual motion at the local level and social interactions at the global level. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through introducing learnable latent codes that represent intents of future motion and switching the codes' modes of operation at different levels. Our framework is general; we instantiate it with different generative models, including generative adversarial networks and diffusion models, and various multi-person forecasting models. Extensive experiments on CMU-Mocap, MuPoTS-3D, and SoMoF benchmarks show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.

1. INTRODUCTION

One of the hallmarks of human intelligence is the ability to predict the evolution of the physical world over time given historical information. For example, humans naturally anticipate the flow of people in public areas, react, and plan their own behavior based on social rules, such as avoiding collisions. Effective forecasting of human motion has thus become a crucial task in computer vision and robotics, e.g., in autonomous driving (Paden et al., 2016) and robot navigation (Rudenko et al., 2018) . This task, however, is challenging. First, human motion is structured with respect to both body physics and social norms, and is highly dependent on the surrounding environment and its changes. Second, human motion is inherently uncertain and multi-modal, especially over long time horizons. Previous work on human motion forecasting often focuses on simplified scenarios. Perhaps the most widely adopted setting is on stochastic local motion prediction of a single person (Mao et al., 2021; Yuan & Kitani, 2020) , which ignores human interactions with the environment and other people in the environment. Another related task is deterministic multi-person motion forecasting (Wang et al., 2021b; Adeli et al., 2020; 2021; Guo et al., 2022) . However, it does not take into account the diversity of individual movements and social interactions. In addition, stochastic forecasting of human trajectories in crowds (Alahi et al., 2014) has shown progress in modeling social interactions, e.g., with the use of attention models (Kosaraju et al., 2019; Vemula et al., 2018; Zhang et al., 2019) and spatial-temporal graph models (Huang et al., 2019; Ivanovic & Pavone, 2019; Salzmann et al., 2020; Yu et al., 2020) . Nevertheless, this task only considers motion and interactions at the trajectory level. Modeling articulated 3D poses involves richer human-like social interactions than trajectory forecasting which only needs to account for trajectory collisions. To overcome these limitations, we introduce a novel task of stochastic multi-person 3D motion forecasting, aiming to jointly tackle the aforementioned aspects ignored in the previous work -the social properties of multi-person motion, the multi-modality of motion and social interactions, and Due to the substantially increased complexity of our task, it becomes challenging to optimize all three objectives simultaneously. We observe simply extending existing work such as on deterministic motion forecasting cannot address the proposed task. This difficulty motivates us to adopt a divideand-conquer strategy, together with the observation that single-person fidelity and multi-person fidelity can be viewed as relatively independent goals, while there is an inherent trade-off between fidelity and diversity. Therefore, we propose a Dual-level generative modeling framework for Multi-person Motion Forecasting (DuMMF). At the local level, we model motion for different people independently under relaxed conditions, thus satisfying single-person fidelity and diversity. Meanwhile, at the global level, we model social interactions by considering the correlation between all motion, thereby further improving multi-person fidelity. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through simply switching the modes of operation of the motion intent codes (i.e., latent codes of the generative model) at different levels. By optimizing these codes with level-specific objectives, we produce diverse and realistic multi-person predictions. Our contributions can be summarized as follows. (a) To the best of our knowledge, we are the first to investigate the task of stochastic multi-person 3D motion forecasting. (b) We propose a simple yet effective dual-level learning framework to address this task. (c) We introduce discrete learnable social intents at dual levels to improve the realism and diversity of predictions. (d) Our framework is general and can be operationalized with various generative models, including generative adversarial networks and diffusion models, and different types of multi-person motion forecasting models. Notably, it can be generalized to challenging more-person (e.g., 18-person) scenarios that are unseen during training.

2. RELATED WORK

Stochastic Human Motion Forecasting. There have been many advances in stochastic human motion forecasting, many of which ( 

3. METHODOLOGY

In this section, we explain the proposed dual-level generative modeling framework (DuMMF) for our task. As illustrated in Figure 2 , our key insight is to decouple the modeling of independent individual movements at the local level and social interactions at the global level (Sec. Problem Formulation. We denote the input motion sequence of length T h for N persons in a scene as {X n } N n=1 , where X n [t] is the pose of n-th person at time step t. We aim to predict M future motion sequences of length T p , denoted as {{ Y m n } N n=1 } M m=1 , where Y m n = [ Y m n [T h + 1], . . . , Y m n [T h + T p ]] is the m-th predicted motion of n-th person. We use the 3D coordinates to represent the absolute joint position of V joints, hence ∀n, m, t, X n [t], Y m n [t] ∈ R V ×3 . We assume to be given the ground truth motion of N persons as {Y n } N n=1 . Our goal is to forecast multiple realistic yet diverse future motion sequences, such that (a) all the M predictions represent human-like motion, simultaneously satisfying single-person fidelity and multi-person fidelity; (b) the predictions are diverse (overall diversity); and (c) one of the predicted sequences is as close to the ground truth as possible.

3.1. DUAL-LEVEL STOCHASTIC MULTI-PERSON MOTION FORECASTING: OVERVIEW

Basic Generative Modeling Framework. Stochastic multi-person future motion can be modeled as a joint distribution using deep generative models (Goodfellow et al., 2014) . Accordingly, we denote this joint distribution of the future motion of N persons as p(Y 1 , Y 2 , . . . , Y N |X 1 , X 2 , . . . , X N ), where all future movements {Y n } N n=1 are conditioned on the past sequences {X n } N n=1 of all persons. Typically, we can use a latent code z ∼ p(z) to reparameterize this joint distribution as p({Y n } N n=1 |{X n } N n=1 ) = p({Y n } N n=1 |z, {X n } N n=1 )p(z)dz. Here, the latent code z can be inter- Local Encoder Global Encoder z d z c Discrete Intents Continous Intents X 1 X 1 X 2 X 3 Embedding of X 1  T h T h -1 T h T h -1 T p T p -1 T p -2 T p T p -1 T p -2 z ∼ p(z), { Y n } N n=1 = G θ (z, {X n } N n=1 ). Given the complexity of our task, it becomes challenging to simultaneously ensure all objectives (i.e., the fidelity of a single person, the fidelity of multiple persons, and the overall diversity). To overcome this difficulty, we introduce a dual-level modeling mechanism that explicitly decomposes the task objectives into local modeling of independent individual movements and global modeling of social interactions. Notably, we achieve this by simply switching the modes of operation for the latent codes z w.r.t. different levels of modeling, without any change to the model architecture G. Local-Level Modeling: Individual Motion. At this level, the generative model G θ models all human bodies as independent of each other, and we aim to improve the overall diversity and the single-person fidelity, alleviating problems such as predicting unrealistic poses. Here, the joint distribution of future human motions can be rewritten in the form of all single-person marginal distributions, i.e., p({Y n } N n=1 |{X n } N n=1 ) = N n=1 p(Y n |X n ). To this end, as shown in Figure 2 (a), we consider leveraging N different individual intents z 1 , z 2 , . . . , z N independently drawn from p(z) to generate independent future movements, denoted as z 1 , . . . , z n ∼ p(z), { Y n } N n=1 = G θ ({z n } N n=1 , {X n } N n=1 ). Global-Level Modeling: Social Interactions. Going beyond the local individual level, the generative model G θ at the global level takes into account the social behavior of multiple people to model their joint distribution. The goal is to further improve the multi-person fidelity, e.g., promoting the overall accuracy. As illustrated in Figure 2 (b), to maintain the network architecture G unchanged, we still use N individual intents as input. However, different from the local level, we constrain these N individual intents to be the same, representing social intents that stand for correlations between the intents of multiple persons. Formally, we have z ∼ p(z), { Y n } N n=1 = G θ ({z} N n=1 , {X n } N n=1 ). Note that, without additional constraints, this dual-level modeling scheme by itself is not guaranteed to enforce the latent codes to behave in the designed manner. To this end, we introduce learnable latent intent codes z (Sec. 3.2), jointly optimize the codes z and the forecasting model G guided by the level-specific training objectives (Sec. 3.3).

3.2. DISCRETE LEARNABLE HUMAN INTENT CODES

Intuitively, an arbitrary, albeit identical, individual intent in Eq. 3 may not adequately lead to a valid social intent. We thus hypothesize that a social intent is formed when all individual intents are the same and belong within some range of "options." This can typically be achieved through discrete choice models (Aguirregabiria & Mira, 2010; Ryan & Gerard, 2003; Bhat et al., 2008; Leonardi, 1984) -an effective tool that predicts choices from a set of available options created by hand-crafted rules. Here, we formulate the correlation of multiple persons at the global level by using the same discrete code. However, the intent options for social interactions are more subtle and difficult to define manually than those in other applications such as for trajectories (Kothari et al., 2021) . Therefore, we use a set of learnable codes z d ∈ {Z m } M m=1 to represent social intents, inspired by Xu et al. (2022b) . However, we introduce different training strategies that are tailored to this new task (Sec. 3.3). Our motivations are: (a) Subject to physical constraints and social laws, the intents of future movements should share some deterministic patterns. For example, all intents should avoid imminent collisions anticipated from the history, even if these intents refer to different motions. We assume that such deterministic properties shared by social intents can be represented by a set of shareable codes learned directly from the data. (b) It will be easier for the predictor to identify and implement different levels of functionality by jointly optimizing discrete intents and the predictor. To further enhance the expressiveness of the codes, we retain the original continuous Gaussian noise z c ∼ p(z) of the generative model G θ and bundle the discrete intent with the noise to represent the final intent, as shown in Figure 2(c ). Now the global-level modeling of social interactions in Eq. 3 is reformulated as z c 1 , . . . , z c n ∼ p(z), z d ∈ {Z m } M m=1 , { Y n } N n=1 = G θ ({z c n + z d } N n=1 , {X n } N n=1 ). And correspondingly, the local-level modeling of individual motion in Eq. 2 becomes z c 1 , . . . , z c n ∼ p(z), z d 1 , . . . , z d n ∈ {Z m } M m=1 , { Y n } N n=1 = G θ ({z c n + z d n } N n=1 , {X n } N n=1 ).

3.3. TRAINING UNDER THE GUIDANCE OF LEVEL-SPECIFIC OBJECTIVES

Using only the same discrete intent does not naturally and necessarily inherit the multi-person correlation. Thus, we optimize both the parameters of the predictor and the discrete codes with a level-specific training strategy. We jointly train both levels of individual movement modeling and social interaction modeling, but with each level guided by its own objective. In each forward pass, we explicitly produce different output predictions from different intents of the two levels. And then, in the backward pass, the discrete intent codes z d are optimized separately at different levels, while the parameters θ of the forecasting model G are updated based on the fused losses from the two levels. Local-Level Training. At the local level, we train the model without social interactions. Given independent multi-person motion data ({X n } N n=1 , {Y n } N n=1 ), we first randomly sample the discrete intent codes and merge them with independently sampled continuous intent codes into M × N different latent codes {{z m n } N n=1 } M m=1 . We then use each intent and the past motion of each person to predict M future motion sequences {{ Y m n } N n=1 } M m=1 for each person {X n } N n=1 . Local-level objectives consider single-person fidelity and overall diversity respectively. Global-Level Training. Meanwhile, training is also conducted at the global level to enable the modeling of social interactions. The difference from the local setting is that we incorporate the discrete and continuous codes into only M distinct latent codes {{z m } N n=1 } M m=1 ; hence, the discrete latent codes z m d of all N individuals are the same for a certain m-th prediction. Here, we introduce learning objectives to facilitate multi-person fidelity and accuracy. Please refer to Sec. C of the Appendix for more detail on the aforementioned level-specific learning objectives. Also, Sec. G of the Appendix further demonstrates that these learning objectives are important, and in some cases critical, to the fidelity and diversity of multi-person motion. Inference. During inference, we only use the global-level strategy by sampling the same intent for all individuals present in the scene. Note that the uncertainty of human motion substantially

Starting Poses Ending Poses

Figure 3 : Qualitative results of DuMMF with a DDPM. We demonstrate the generalizability of our method to handle a significantly more complex scenario with 18 persons. Note that our model is trained only on 3-person data. We visualize the predicted final poses at 2 seconds. Table 1 : Quantitative results of DuMMF with a DDPM. Both the baseline and our models are trained using SMPL-X representations on AMASS, and we convert them to skeletons for evaluation. Using the same backbone and generative model, our DuMMF framework significantly provides more accurate predictions with more intents. Method, # of Intents @25 frames @50 frames @75 frames  ADE ↓ FDE ↓ ADE ↓ FDE ↓ ADE ↓ FDE ↓ DDPM ( M 2 predictions {{ Y m n } N n=1 } M 2 m=1 , etc.

3.4. NETWORK ARCHITECTURE

Our dual-level modeling framework in conjunction with the discrete learnable intent codes is general and, in principle, does not rely on specific network architectures or generative models. To demonstrate this, we combine our framework with various types of deterministic multi-person motion predictors and different generative models, yielding consistent and significant improvements across all baselines (see Sec. 4). As shown in Figure 2 , we abstract the encoder of the multi-person motion predictor into two parts: the local part is responsible for encoding single-person motion, while the global part is responsible for encoding multi-person motion and its interactions. A summary of the multi-person predictors used in the paper is given in Table A of the Appendix.

4. EXPERIMENTS

Datasets. In the main paper, we show the evaluation on two motion capture datasets, CMU-Mocap (CMU) and MuPoTS-3D (Mehta et al., 2018 

Architecture

Predictor Diversifier Variants @t = 1s @t = 2s and follow the same strategy to mix single-person and double-person motion together. MuPoTS-3D consists of more than 8,000 frames with up to three subjects. We convert the data to the same 15-joint human skeleton and length units as CMU-Mocap, and evaluate the generalization on MuPoTS-3D of a model trained only on CMU-Mocap. We also report our performance on the SoMoF benchmark ( distance between all predicted final pose pairs. We disentangle the local pose and the global trajectory of the motion and measure their accuracy and diversity separately by defining the following metrics: rootADE, rootFDE, poseADE, poseFDE, rootFPD, and poseFPD. To measure three different aspects including single-person fidelity, multi-person fidelity, and overall diversity comprehensively, we provide a summary and analysis of all metrics for this novel task in Table F and Sec. F of the Appendix. We include results on all above metrics in Sec. G of the Appendix. , where we set the feature dimension to 128. For evaluation, we recursively predict the next 15 frames 3 times given all past frames generated, as illustrated in Sec. 3.3. Thus, given the number of intents to be M , the model outputs M , M 2 , and M 3 different predictions in sequence. For the SMPL representation, we train the model to predict a 25-frame sequence of 3 people given the 10 past frames at 30Hz. We use an 8-layer transformer, where we set the feature dimension to 512. For evaluation, similarly, we use 10 frames as the past motion and recursively predict the next 25 frames. Additional implementation details are provided in Sec. E of the Appendix. @t = 3s FPD ↑ ADE ↓ FDE ↓ FPD ↑ ADE ↓ FDE ↓ FPD ↑ ADE ↓ FDE ↓ RNN SC-MPF ( Quantitative Results. We compare our method with a pure DDPM in Table 1 . Our improvement is significant. Notably, even in the case of a single intent, where we only evaluate one prediction, our method outperforms DDPM in long-term generation. As the number of intents increases, our method provides more accurate results, especially in long-term prediction. In Table 2 , we demonstrate that our DuMMF framework benefits all predictor variants. We observe that the simple combination of deterministic predictors and CGAN results in very low diversity and accuracy. By contrast, our full approach significantly outperforms CGAN baselines on both diversity and accuracy across all backbones. In Table D of the Appendix, we further show that DuMMF achieves the best generalization results across all predictors on MuPoTS-3D, highlighting its generality and superiority. Ablation: Effectiveness of Dual-Level Modeling. Tables 2 and 3 show the effectiveness of our dual-level framework. First, we investigate the settings with only single-person motion modeling or only social interaction modeling. In Table 3 , compared with our full method, modeling independent multi-person motion ('w/o Social') provides higher diversity but leads to inaccurate poses, since the social restrictions are not considered. With only social interaction modeling, the model ('w/o Individual') cannot output sufficiently diverse predictions, which also makes predictions inaccurate, Ablation: Effectiveness of Discrete Human Intents. In Tables 2 and 3 , we also demonstrate that discrete human intents are effective and crucial. We observe the best results when using both discrete and continuous intents, indicating that they are complementary. In the absence of discrete intents ('w/o Discrete'), the performance is only comparable with the baseline ('CGAN'). Importantly, with the help of discrete intents, the improvement of dual-level modeling ('Full') over 'w/o Separation' is more pronounced, compared with the improvement of 'w/o Discrete' over 'CGAN.' Therefore, discrete learnable intents are essential for effectively integrating the advantages of both levels during training. The performance without continuous intents ('w/o Continuous') is slightly worse than the full method. Our hypothesis is that relying solely on discrete intents is limiting, because they only support a finite number of outputs. In Sec. G of the Appendix, we further investigate how the number of discrete intents impacts stochastic forecasting. Qualitative Results. Consistent with the quantitative evaluation above, we observe that our method provides diverse multi-person motion, and produces predictions closer to the ground truth compared with the deterministic method MRT in Figure 4 . In Figure 5 , we qualitatively show that our generated results in meshes reflect real-world diversity of social interactions. Furthermore, we provide qualitative results for more-person scenarios in Figure 3 Limitation. Although our dual-level framework has proven effective in producing high-quality and diverse predictions, we have observed artifacts such as foot skating in some predicted motion sequences. This is because our model relies solely on loss functions to constrain motion, rather than explicitly modeling articulated motion. As this is a common issue in learning-based methods, we plan to exploit a physical simulator to further improve the plausibility of our predicted motion.

5. CONCLUSION

We formulate a novel task called stochastic multi-person 3D motion forecasting, which better reflects the real-world human motion complexities. To simultaneously achieve single-person fidelity, social realism, and overall diversity, we propose a dual-level generative modeling framework (DuMMF) with learnable latent intent codes. Compared with prior work on deterministic or single-person prediction, our model learns to generate diverse and realistic human motion and interactions. Notably, our framework is model-agnostic and generalizes to unseen more-person scenarios.

Published as a conference paper at ICLR 2023

In this Appendix, we include additional method details and experimental results that are not included in the main paper due to limited space as follows. 1) We provide a visualization video as additional qualitative results, and the details are explained in Sec. A. 2) We include a further discussion on related work in Sec. B. 3) We explain different generative models incorporated in our proposed dual-level modeling framework in Sec. D; 4) We provide additional details of the experimental implementation in Sec. E and the summary of evaluation metrics in Sec. F. 5) To elaborate the effectiveness of our method, we provide additional ablation experiments with qualitative and quantitative analysis in Sec. G, and we evaluate our approach in more challenging scenarios with a significantly increased number of people in Sec. H.

A VISUALIZATION VIDEO

In addition to Figure 4 and Figure 5 in the main paper and more qualitative results in this Appendix (Figures A and B ), we provide a video to demonstrate more comprehensive visualizations of multiperson 3D motion forecasting at https://sirui-xu.github.io/DuMMF/images/demo. mp4. In this video, we illustrate that our method DuMMF generates diverse multi-person motion and social interactions, as well as taking into account both single-person and multi-person fidelity. We also show that our model is scalable and provide effective predictions in more challenging scenarios with a significantly increased number of people and associated more complex interactions. We also highlight the impact and effectiveness of our dual-level modeling framework.

B ADDITIONAL DISCUSSION ON RELATED WORK

As we demonstrate in the main paper, our proposed stochastic multi-person motion forecasting needs to simultaneously take into account single-person pose fidelity, consistency of pose and trajectory, social interactions between poses, and overall diversity of motion, while prior work including stochastic multi-person trajectory forecasting (Alahi et 

C ADDITIONAL DETAILS OF LEVEL-SPECIFIC OBJECTIVES

We use ∆ to represent the residual of the motion sequence. For example, ∆ Y j i = [ Y j i [T h + 1] - X i [T h ], Y j i [T h + 2] -Y j i [T h + 1], . . . , Y j i [T h + T p ] -Y j i [T h + T p -1]]. Local-Level Objectives. We adopt the multiple output loss (Guzmán-rivera et al., 2012) and extend it to the local reconstruction loss of multiple people L lR , which is used to optimize the most accurate prediction of each person while maintaining diversity. We highlight the structure of the human skeleton by introducing the limb loss (Mao et al., 2021) L L . Specifically, L lR = 1 N N n=1 min m=1,...,M ∥∆ Y m n -∆Y n ∥ 2 2 , L L = 1 N * M N n=1 M m=1 ∥ L m n -L n ∥ 2 2 , where the vector L n represents the ground truth distance between all pairs of joints that are physically connected in the n-th human body and L m n includes the limb length for all the T p poses in Y m n . We further develop a multimodal reconstruction loss L mmR to provide additional supervision for all outputs {{∆ Y m n } N n=1 } M m=1 . We first construct pseudo future motion { Y p i } P p=1 for each historical sequence X i . Different from (Yuan & Kitani, 2020; Mao et al., 2021) , we additionally consider translation T ∈ R 3 and rotation R ∈ R 3×3 of the pose. Specifically, given a threshold ϵ, we cluster future motion with a similar start pose and train the model with their residuals as { Y p i } P p=1 = { Y p i | min R,T ∥R( X p i [T h ] -T) -X i [T h ]∥ 2 ≤ ϵ}, L mmR = 1 N * P N n=1 P p=1 min m=1,...,M ∥∆ Y m n -∆Y p n ∥ 2 2 . ( ) To explicitly encourage diversity, we adopt a diversity-promoting loss (Yuan & Kitani, 2020), which directly promotes the pairwise distance between the predictions of a single person. We decompose this loss into two parts, promoting the diversity of local pose and the global root separately. Supposing that Y m n (l) and Y m n (g) are the local pose and the global root joint extracted from the global pose Y m n , respectively, and α and β are two hyperparameters, this diversity-promoting loss is denoted as L D = 1 N * M (M -1) N n=1 M m=1 M k=m+1 [exp( ∥ Y m n (g) -Y k n (g)∥ 2 2 α ) + exp( ∥ Y m n (l) -Y k n (l)∥ 2 2 β )]. Table B: Summary of the complementary evaluation metrics in the multi-person 3D motion forecasting task, with each focusing on evaluating different aspects of predicted motion. Here for simplicity, we show the metrics without alignment. We also provide ADE, FDE, and FPD with alignment to evaluate the pose and trajectory separately, as explained in Sec. 4 of the main paper.

Single-Person Fidelity

Local Average Displacement Error (lADE) 1 N Tp N n=1 minm ∥ Y m n -Yn∥2 Local Final Displacement Error (lFDE) 1 N N n=1 minm ∥ Y m n [Tp]-Yn[Tp]∥2 Foot Skating Ratio (FSR) average ratio of frames where both foot joints are close to the ground (≤ 5cm) and fast (≥ 75mm/s) Multi-Person Fidelity (Global) Average Displacement Error (ADE) minm 1 N Tp N n=1 ∥ Y m n -Yn∥2 (Global) Final Displacement Error (FDE) minm 1 N N n=1 ∥ Y m n [Tp]-Yn[Tp]∥2 Trajectory Collision Ratio (TCR) average ratio of frames where there is a collision between any two trajectories Average Human Displacement (AHD) 1 N M N n=1 M m=1 ∥ Y m n [Tp] -Y m n [1]∥2 Overall Diversity Final Pairwise Distance (FPD) 1 N M (M -1) N n=1 M m=1 M k=m+1 ∥ Y m n [Tp] - Y k n [Tp]∥2 For CGAN, a GAN loss (Kocabas et al., 2020) is leveraged to train the model and the local discriminator D l for individual body realism. Suppose {Y * n } N n=1 is the set of real motion clips sampled from the data, we have the following. L lGAN = 1 N * M N n=1 M m=1 ∥D l ( Y m n )∥ 2 2 + 1 N N n=1 ∥D l (Y * n ) -1∥ 2 2 . Global-Level Objectives. Since in this setting we treat N individuals as a whole, the reconstruction loss is reformulated as L gR = min m=1,...,M 1 N N n=1 ∥∆ Y m n -∆Y n ∥ 2 2 . For CGAN, a global GAN loss is further leveraged to promote the realism of social motion, where the global discriminator D g takes the motion of all N people as input. Suppose {Y * * n } N n=1 is the multi-person motion clip sampled from the data, and we have L gGAN = 1 M M m=1 ∥D g ({ Y m n } N n=1 )∥ 2 2 + ∥D g ({Y * * n } N n=1 ) -1∥ 2 2 . Table C : Quantitative results (w/ error bar) of our DuMMF on single-person accuracy. Method, # of Intents @t = 1s @t = 2s @t = 3s Local Discriminator. To ensure the fidelity of the motion, especially to address the problem of foot skating, where the feet appear to slide in the ground plane, we concatenate the predicted motion of each person Y j i with the feet velocities ∆F j i as input to a local discriminator D l . The local discriminator uses a local-range transformer encoder adopted from Wang et al. (2021b) . lADE ↓ lFDE ↓ lADE ↓ lFDE ↓ lADE ↓ lFDE ↓ DuMMF ( Global Discriminator. To ensure the realism of social interactions and avoid motion collisions, we propose a global discriminator D g that encodes all motion of N people {Y j n } N n=1 at the same time and outputs a fidelity score. The global discriminator uses a global-range transformer encoder adopted from Wang et al. (2021b) .

E ADDITIONAL IMPLEMENTATION DETAILS

Here, we provide more details on the implementation of our method DuMMF. The two hyperparameters (α, β) in diversity promoting loss (Sec. Table D: Quantitative comparison on MuPoTS-3D between our DuMMF and deterministic forecasting baselines and their CGAN variants. All models are trained only using skeletal representations on CMU-Mocap and we compare their generalizations on MuPoTS-3D here. The number of intents is set to 5 for stochastic forecasting on 3-person (top) and 2-person (bottom) scenarios. DuMMF significantly improves multi-person accuracy and diversity across various architectures and deterministic predictors.

Architecture

Predictor Diversifier Variants @t = 1s @t = 2s @t = 3s Yuan & Kitani, 2020; 2019) that is the average ℓ 2 distance between all predicted motion pairs. In this paper, we formulate diverse forecasting as producing more forecasts over time (see Sec. 3.3) . This progressive generation is actually closer to reality because the multi-modality of motion should be more pronounced after further time. However, in this case, APD cannot reflect the diversity well, since many predictions will share the same previous segment. Therefore, we only examine the diversity of the last pose (FPD), as the last pose should not be the same. FPD ↑ ADE ↓ FDE ↓ FPD ↑ ADE ↓ FDE ↓ FPD ↑ ADE ↓ FDE ↓ RNN SC-MPF ( Moreover, we introduce three tailored metrics to evaluate specific aspects of predicted motion, which correspond to the unique challenges in multi-person motion forecasting as discussed in the main paper. (a) Foot Skating Ratio (Zhang et al., 2021) : the average ratio of frames with foot skating. (b) Trajectory Collision Ratio: the average ratio of predictions that is considered to have collision (Kothari et al., 2020) between any two trajectories in the scene. (c) Average Human Displacement: Average displacement of the predicted human body between the last frame and the first frame, reflecting the properties of the predicted motion distribution.

G ADDITIONAL EXPERIMENTAL RESULTS

Additional Quantitative Results. We compare our method with MRT (Wang et al., 2021b) in Table G . We show that the improvement is significant by performing each experiment five times with different random seeds and reporting error bars. When the number of intents is one, equivalent to a deterministic setting, our method marginally outperforms MRT for all metrics on CMU-Mocap and also generalizes better on MuPoTS-3D. This suggests the generality of our approach, which is also advantageous for deterministic processes. In Table C , we use lADE and lFDE to compare single-person fidelity, and we observe that our model also significantly outperforms the baseline. , where we use labeled trajectories and poses (13-joint human skeleton), but we do not use videos of given scenes as input. We discard the multi-modal reconstruction loss since the 3DPW provided by SoMoF benchmark is relatively small. In Table E , we provide results on our implemented deterministic prediction and compare with baselines directly reported from the leaderboard on the SoMoF benchmark. In Table F , we use SoMoF benchmark for stochastic multi-person forecasting. Note that ADE and FDE require access to ground truth data, which is not publicly available from SoMoF benchmark. Thus, we report the results on validation set. We observe a similar performance as on CMU-Mocap (CMU) and MuPoTS-3D ( our DuMMF for the multi-person forecasting model is much higher than that for the single-person forecasting model. Additional Ablation on Impact of Learning Objectives. In Table I , we evaluate the impact of each loss term within the dual-level framework. In general, using all loss functions yields the best results, since its results are either the best or the second best. We observe that local and global discriminators not only make predictions more accurate, but also more diverse. Note that the reconstruction losses L lR and L gR optimize only the most accurate prediction of all the outputs. It is important to provide supervision for other predictions that are not the best. We observe that limb loss L L is crucial, as it is the only loss function that provides supervision for all outputs. The multi-modal reconstruction loss also has a large performance impact, since it provides supervision for more than one output. Additional Analysis of the Number of Discrete Latent Codes. Note that the number of discrete latent codes is not restricted to the same number as the number of predictions per second. We chose them to be the same for simplicity and a better trade-off between prediction performance and training efficiency. We find that this setup also achieved the best performance. To ensure that all predictions come from different discrete codes, the number of discrete latent codes should not be less than the number of predictions. In Table J , we provide an ablation study when the number of predictions per second is 5. If the number of discrete intents is greater than 5, we randomly select a discrete code without replacement (excluding any of the previously selected intents) to generate a prediction. We observe that both the accuracy and diversity of the predictions decrease as the number of intents increases. An explanation of such behavior could be: When the number of discrete latent codes is the same as the number of predictions, each code is explicitly and fully optimized for a particular prediction, leading to the best prediction accuracy and diversity; whereas, when the number of intents increases, the random selection strategy may hurt performance, as the probability of selecting the best five discrete codes decreases. We hypothesize that a better training and selection strategy might improve performance with more discrete codes.

H GENERALIZABILITY AND SCALABILITY: EVALUATION ON MORE-PERSON SCENARIOS

Datasets with More People per Scene. In the main paper, we constructed multi-person motion data with 3 people per scene. Here, we construct more challenging datasets with a significantly increased number of people; this also increases the complexity of social interactions. Specifically, we first sample 2-person and 1-person motion data from CMU-Mocap (CMU), and then compose them together. We use handcraft rules to filter out scenes with trajectory collisions. Instead of retraining the model in the novel setting with more people per scene, we directly evaluate the model trained on 3-person motion data, and test if our model is able to scale to scenarios with more people. Qualitative Results. Similar to Figure 3 of the main paper, we provide visualizations for 6-person and 9-person scenarios in Figure A and Figure B of the end poses of predictions of 3 second. We randomly select a prediction and visualize the corresponding full motion sequence as well. Note that our model is trained on only 3-person data without any fine-tuning on more-person data, but it still performs well on more-person data, suggesting the generalizability and scalability of our approach.



Figure 1: Illustration of the multifaceted challenges in the proposed task of stochastic multi-person 3D motion forecasting. (a) Single-person fidelity: for each person, the predicted pose and trajectory should be realistic and consistent with each other, e.g., to avoid foot floating and skating. (b) Multiperson fidelity: multi-person motion in a scene inherently involves social interactions, e.g., to avoid motion collisions. (c) Overall diversity: long-term human motion is uncertain and stochastic; we address this intrinsic multi-modality, while existing work (Wang et al., 2021b; Adeli et al., 2020; 2021; Guo et al., 2022) simplifies to deterministic prediction.

Figure 2: Overview of our proposed dual-level generative modeling with motion inputs of three persons as an illustration. (a) We combine encoded multi-person embeddings with independent intent codes at the local level of modeling individual motion. (b) Differently, the global level of modeling social interactions requires all latent codes to be the same. (c) The latent codes comprise both discrete intent codes, which are learned from the data and represented as a set, and continuous intent codes. (d) We abstract the encoder of a multi-person predictor as the combination of a local branch that encodes single-person motion and a global branch that encodes multi-person motions (see Table A of the Appendix for instantiations).

Figure 4: Qualitative results of DuMMF with a CGAN model. The leftmost column is the ground truth starting poses, and the second column is the ground truth poses three seconds later. We show our five sampled 3-second predictions in the middle. Our model produces diverse predictions, with one closer to the ground truth (highlighted by the red box) compared with MRT (rightmost column).

Quantitative comparison between our DuMMF and deterministic forecasting baselines and their CGAN stochastic variants, using skeletal representations on CMU-Mocap. The number of intents is set to 5 for stochastic forecasting in 3-person (top) and 2-person (bottom) scenarios. DuMMF significantly improves multi-person accuracy and diversity across various architectures and deterministic predictors. Additionally, our discrete and continuous intent codes are complementary to each other in most cases.

Adeli et al., 2020; 2021) in Sec. G of the Appendix. Metrics. For evaluating multi-person motion accuracy and diversity, we adopt the common metrics used in stochastic forecasting (Mao et al., 2021; Yuan & Kitani, 2020; 2019; Salzmann et al., 2020; Kothari et al., 2021) as follows. For accuracy measurement, we follow the Best-of-N (BoN) evaluation and use (a) Average Displacement Error (ADE): the average ℓ 2 distance over time between the ground truth and the prediction closest to the ground truth; (b) Final Displacement Error (FDE): the ℓ 2 distance between the final pose of the ground truth and the last predicted pose closest to the ground truth. For diversity measurement, we employ (c) Final Pairwise Distance (FPD): the average ℓ 2

Figure 5: Qualitative results of DuMMF with a DDPM model. We visualize the predicted frames for two different 3-person input motions, which are listed on the left and right respectively. For each input, we generate four sampled motions, arranged in a single column and listed sequentially in time at 0, 1, 2, and 3 seconds. Our method effectively produces diverse human-like social interactions.

, and others in Figure A and Figure B of the Appendix. Please refer to Sec. H of the Appendix for more detail on the more-person setting.

C) are set to (50, 100). The model is trained using a batch size of 32 for 50 epoch, with 6000 training examples per epoch. We use ADAM (Kingma & Ba, 2014) to train the model. The code is based on PyTorch. On one NVIDIA GeForce GTX TITAN X GPU, training an epoch takes approximately 5 minutes. For license, CMU-Mocap (CMU) is free for all users; MuPoTS-3D (Mehta et al., 2018) is for noncommercial purposes. Part of our code is based on AMCParser (MIT license), attention-is-all-you-need-pytorch (MIT license), and MRT (Wang et al., 2021b) (not specified), and XIA (Guo et al., 2022) (GPL license).

Figure A: Qualitative results on CMU-Mocap. We evaluate the generalizability and scalability of our model to predict 3-second motion on the constructed 6-person motion test data. The top row is the historical pose, and the five end poses predicted by our model; and we show the sequence of one of the predictions in the bottom two rows (highlighted by the blue dashed box).

Figure B: Qualitative results on CMU-Mocap. We evaluate the generalizability and scalability of our model to predict 3-second motion on the constructed 9-person motion test data. The top row is the historical pose, and the five end poses predicted by our model; and we show the sequence of one of the predictions in the bottom two rows (highlighted by the blue dashed box).

Welling, 2013), generative adversarial networks (GANs)(Goodfellow et al., 2014), normalizing flows (NFs) (Rezende & Mohamed, 2015), and diffusion models (Sohl-Dickstein et al., 2015; Song et al., 2020; Ho et al., 2020). Some recent approaches (Bhattacharyya et al., 2018; Dilokthanakul et al., 2016; Gurumurthy et al., 2017; Yuan & Kitani, 2019; 2020; Zhang et al., 2021; Mao et al., 2021; Xu et al., 2022b; Petrovich et al., 2022) emphasize on the promotion of diversity. Mao et al. (2021) sequentially generate the different parts of a pose for better controllability of diversity. Xu et al. (2022b) introduce learnable anchors in the latent space to guide the samples with sufficient diversity. Although these methods can predict very diverse human motion sequences, most of them are limited to local motion and ignore the global trajectory. Some of their produced motion sequences are actually unrealistic; in particular, incorporating global trajectories may result in severe foot skating. Predicting human motion under the constraint of scene context (Cao et al., 2020; Hassan et al., 2021; Zhang & Tang, 2022) has recently been explored, where the effect of global trajectories and scenes on human motion is considered. Instead of predicting single-person movement, our work focuses on diverse multi-person movements and social interactions. Multi-Person Forecasting. So far, research on multi-person forecasting has mainly focused on global trajectory forecasting (Helbing & Molnar, 1995; Mehran et al., 2009; Pellegrini et al., 2009;

Table A of the Appendix for instantiations).preted as a social intent that guides the future movements of multiple persons. With this social intent z sampled from the given distribution p(z), a deterministic neural network G θ with parameters θ

Ablation study of our DuMMF with a CGAN and MRT(Wang et al., 2021b)  using skeletal representations on CMU-Mocap and MuPoTS-3D. We report both accuracy and diversity for root and pose separately. The results show the effectiveness of our dual-level modeling along with discrete motion intents, and complementariness of local-level and global-level modeling.as more varied outputs have a better chance of covering the ground truth. Note that the results of 'CGAN' and 'w/o Separation' in Table2and 'w/o Separation' in Table3are worse since they simply use all the learning objectives together without disentangling the two levels of modeling, while 'w/o Separation' is slightly better due to the use of discrete intents. Under our dual-level framework with level-specific motion intents and learning objectives, the model can more effectively incorporate the benefits of both levels, thus leading to improved accuracy and diversity. In Sec. G of the Appendix, we further demonstrate that our dual-level benefits different predictor variants.

Both methods utilize additional contextual information to aid deterministic prediction with their proposed Social Motion Forecasting (SoMoF) benchmark(Adeli et al., 2020; 2021).Guo et al. (2022) propose a cross-interaction attention mechanism to predict cross dependencies between two pose sequences, making it applicable only to 2-person scenarios. Summary of methods for encoding and integrating multi-person motions and interactions. We follow to abstract the encoder into local and global part as Figure2(d). As we use the same decoder for baselines (Sec. 4 of the main paper), we only discuss encoders of three predictors in this paper. Note that XIA(Guo et al., 2022) can only be applied to two-person scenarios. network to encode both spatial interactions and temporal features. GroupNet(Xu et al., 2022a) employs a multi-scale hypergraph neural network that models group-based interactions and facilitates more comprehensive relational reasoning. T-GNN(Xu et al., 2022c) introduces a transferable graph neural network that allows not only trajectory prediction but also domain alignment of potential distribution differences. MID(Gu et al., 2022) employs a diffusion model to model the variation of indeterminacy for trajectory prediction.

Final Pairwise Distance (FPD), and briefly discussed other metrics. Here, we explain these and additional metrics in detail and also provide a systematic summary in Table B for better understanding. Additional comparisons based on these metrics are shown in the following sections.As summarized in TableB, we group the metrics into thee types, with each type evaluating different aspects of predicted motion (discussed in Sec. 4 of the main paper) -single-person fidelity, multiperson fidelity, and overall diversity.

Quantitative comparison on SoMoF Benchmark. Here, we only show the deterministic forecasting results. Our method with MRT predictor significantly outperform two deterministic baselines. We use VIM(Adeli et al., 2020) as the metric. * means we directly report the results from the benchmark leaderboard.

Quantitative comparisons between our DuMMF and deterministic forecasting baselines and their CGAN variants on SoMoF benchmark. The number of intents is set to 5 for stochastic forecasting. Our DuMMF significantly improves multi-person accuracy and diversity. Local Average Displacement Error (lADE) and Local Final Displacement Error (lFDE) are proposed to further evaluate the single-person fidelity. They compute the average distance between the individual ground truth and the individual prediction closest to the individual ground truth. Note that ADE (or lADE) and FDE (or lFDE) only measure the best predictions in all outputs. While we can average ADE (or lADE) and FDE (or lFDE) by computing the distance between multiple predictions and a single ground truth, this way of assessing the overall prediction quality cannot reflect motion realism. For example, a very realistic but diverse set of outputs may have poor average ADE and FDE. Therefore, we do not use such metrics for our evaluation.

Quantitative results (w/ error bar) of DuMMF with a CGAN and the baseline MRT(Wang  et al., 2021b). The baseline and our models are trained only on CMU-Mocap, and are tested on CMU-Mocap (top) and MuPoTS-3D (bottom). With the same backbone, our DuMMF framework significantly outperforms MRT on the deterministic prediction, and provides more accurate and diverse predictions with more intents and predictions.

Mehta et al., 2018), specifically, using DuMMF, the model has significantly better accuracy and diversity in stochastic multi-person forecasting.

Ablation study of our DuMMF model on CMU-Mocap using skeletal representations. We report the accuracy and diversity with error bar for root and pose respectively. The results show the impact of different learning objectives. Best results are bolded, and next best results are underlined.

Ablation study of the number of discrete latent codes on CMU-Mocap using skeletal representations. For producing 5 predictions per second, we observe that both the accuracy and diversity of predictions decrease significantly as the number of discrete intents increases.

acknowledgement

Acknowledgement. This work was supported in part by NSF Grant 2106825, NIFA Award 2020-67021-32799, the Jump ARCHES endowment through the Health Care Engineering Systems Center, the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign through the NCSA Fellows program, the IBM-Illinois Discovery Accelerator Institute, the Illinois-Insper Partnership, and the Amazon Research Award. This work used NVIDIA GPUs at NCSA Delta through allocation CIS220014 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by NSF Grants #2138259, #2138286, #2138307, #2137603, and #2138296.

annex

Ethics Statement. Our proposed technique is useful in many applications, such as self-driving to avoid crowds. The potential negative societal impacts include: (a) our approach can be used to synthesize highly realistic human motion, which might lead to the spread of false information; (b) our approach requires real behavioral information as input, which may raise privacy concerns and result in the disclosure of sensitive identity information. Nevertheless, our model operates on the processed human skeleton representation that contains minimal identifying information, unlike raw data. On the positive side, this can be seen as a privacy-enhancing feature.

