GENERATIVE MODEL-ENHANCED HUMAN MOTION PREDICTION

Abstract

The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out-of-distribution (OoD). Here we formulate a new OoD benchmark based on the Human3.6M and CMU motion capture datasets, and introduce a hybrid framework for hardening discriminative architectures to OoD failure by augmenting them with a generative model. When applied to current state-of-theart discriminative models, we show that the proposed approach improves OoD robustness without sacrificing in-distribution performance, and can theoretically facilitate model interpretability. We suggest human motion predictors ought to be constructed with OoD challenges in mind, and provide an extensible general framework for hardening diverse discriminative architectures to extreme distributional shift.

1. INTRODUCTION

Human motion is naturally intelligible as a time-varying graph of connected joints constrained by locomotor anatomy and physiology. Its prediction allows the anticipation of actions with applications across healthcare (Geertsema et al., 2018; Kakar et al., 2005) , physical rehabilitation and training (Chang et al., 2012; Webster & Celik, 2014) , robotics (Koppula & Saxena, 2013b; a; Gui et al., 2018b) , navigation (Paden et al., 2016; Alahi et al., 2016; Bhattacharyya et al., 2018; Wang et al., 2019) , manufacture ( Švec et al., 2014) , entertainment (Shirai et al., 2007; Rofougaran et al., 2018; Lau & Chan, 2008) , and security (Kim & Paik, 2010; Ma et al., 2018) . The favoured approach to predicting movements over time has been purely inductive, relying on the history of a specific class of movement to predict its future. For example, state space models (Koller & Friedman, 2009) enjoyed early success for simple, common or cyclic motions (Taylor et al., 2007; Sutskever et al., 2009; Lehrmann et al., 2014) . The range, diversity and complexity of human motion has encouraged a shift to more expressive, deep neural network architectures (Fragkiadaki et al., 2015; Butepage et al., 2017; Martinez et al., 2017; Li et al., 2018; Aksan et al., 2019; Mao et al., 2019; Li et al., 2020b; Cai et al., 2020) , but still within a simple inductive framework. This approach would be adequate were actions both sharply distinct and highly stereotyped. But their complex, compositional nature means that within one category of action the kinematics may vary substantially, while between two categories they may barely differ. Moreover, few real-world tasks restrict the plausible repertoire to a small number of classes-distinct or otherwise-that could be explicitly learnt. Rather, any action may be drawn from a great diversity of possibilities-both kinematic and teleological-that shape the characteristics of the underlying movements. This has two crucial implications. First, any modelling approach that lacks awareness of the full space of motion possibilities will be vulnerable to poor generalisation and brittle performance in the face of kinematic anomalies. Second, the very notion of In-Distribution (ID) testing becomes moot, for the relations between different actions and their kinematic signatures are plausibly determinable only across the entire domain of action. A test here arguably needs to be Out-of-Distribution (OoD) if it is to be considered a robust test at all. These considerations are amplified by the nature of real-world applications of kinematic modelling, such as anticipating arbitrary deviations from expected motor behaviour early enough for an automatic intervention to mitigate them. Most urgent in the domain of autonomous driving (Bhattacharyya et al., 2018; Wang et al., 2019) , such safety concerns are of the highest importance, and are best addressed within the fundamental modelling framework. Indeed, Amodei et al. ( 2016) cites the ability to recognize our own ignorance as a safety mechanism that must be a core component in safe AI. Nonetheless, to our knowledge, current predictive models of human kinematics neither quantify OoD performance nor are designed with it in mind. There is therefore a need for two frameworks, applicable across the domain of action modelling: one for hardening a predictive model to anomalous cases, and another for quantifying OoD performance with established benchmark datasets. General frameworks are here desirable in preference to new models, for the field is evolving so rapidly greater impact can be achieved by introducing mechanisms that can be applied to a breadth of candidate architectures, even if they are demonstrated in only a subset. Our approach here is founded on combining a latent variable generative model with a standard predictive model, illustrated with the current state-of-the-art discriminative architecture (Mao et al., 2019; Wei et al., 2020) , a strategy that has produced state-of-the-art in the medical imaging domain Myronenko (2018). Our aim is to achieve robust performance within a realistic, low-volume, high-heterogeneity data regime by providing a general mechanism for enhancing a discriminative architecture with a generative model. In short, our contributions to the problem of achieving robustness to distributional shift in human motion prediction are as follows: 1. We provide a framework to benchmark OoD performance on the most widely used opensource motion capture datasets: Human3.6M (Ionescu et al., 2013) , and CMU-Mocapfoot_0 , and evaluate state-of-the-art models on it. 2. We present a framework for hardening deep feed-forward models to OoD samples. We show that the hardened models are fast to train, and exhibit substantially improved OoD performance with minimal impact on ID performance. We begin section 2 with a brief review of human motion prediction with deep neural networks, and of OoD generalisation using generative models. In section 3, we define a framework for benchmarking OoD performance using open-source multi-action datasets. We introduce in section 4 the discriminative models that we harden using a generative branch to achieve a state-of-the-art (SOTA) OoD benchmark. We then turn in section 5 to the architecture of the generative model and the overall objective function. Section 6 presents our experiments and results. We conclude in section 7 with a summary of our results, current limitations, and caveats, and future directions for developing robust and reliable OoD performance and a quantifiable awareness of unfamiliar behaviour.

2. RELATED WORK

Deep-network based human motion prediction. Historically, sequence-to-sequence prediction using Recurrent Neural Networks (RNNs) have been the de facto standard for human motion prediction (Fragkiadaki et al., 2015; Jain et al., 2016; Martinez et al., 2017; Pavllo et al., 2018; Gui et al., 2018a; Guo & Choi, 2019; Gopalakrishnan et al., 2019; Li et al., 2020b) . Currently, the SOTA is dominated by feed forward models (Butepage et al., 2017; Li et al., 2018; Mao et al., 2019; Wei et al., 2020) . These are inherently faster and easier to train than RNNs. The jury is still out, however, on the optimal way to handle temporality for human motion prediction. Meanwhile, recent trends have overwhelmingly shown that graph-based approaches are an effective means to encode the spatial dependencies between joints (Mao et al., 2019; Wei et al., 2020) , or sets of joints (Li et al., 2020b) . In this study, we consider the SOTA models that have graph-based approaches with a feed forward mechanism as presented by (Mao et al., 2019) , and the subsequent extension which leverages motion attention, Wei et al. (2020) . We show that these may be augmented to improve robustness to OoD samples. Generative models for Out-of-Distribution prediction and detection. Despite the power of deep neural networks for prediction in complex domains (LeCun et al., 2015) , they face several challenges that limits their suitability for safety-critical applications. Amodei et al. ( 2016) list robustness to distributional shift as one of the five major challenges to AI safety. Deep generative models, have been used extensively for detection of OoD inputs and have been shown to generalise



t http://mocap.cs.cmu.edu/

