GENERATIVE MODEL-ENHANCED HUMAN MOTION PREDICTION

Abstract

The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out-of-distribution (OoD). Here we formulate a new OoD benchmark based on the Human3.6M and CMU motion capture datasets, and introduce a hybrid framework for hardening discriminative architectures to OoD failure by augmenting them with a generative model. When applied to current state-of-theart discriminative models, we show that the proposed approach improves OoD robustness without sacrificing in-distribution performance, and can theoretically facilitate model interpretability. We suggest human motion predictors ought to be constructed with OoD challenges in mind, and provide an extensible general framework for hardening diverse discriminative architectures to extreme distributional shift.

1. INTRODUCTION

Human motion is naturally intelligible as a time-varying graph of connected joints constrained by locomotor anatomy and physiology. Its prediction allows the anticipation of actions with applications across healthcare (Geertsema et al., 2018; Kakar et al., 2005) , physical rehabilitation and training (Chang et al., 2012; Webster & Celik, 2014 ), robotics (Koppula & Saxena, 2013b; a; Gui et al., 2018b ), navigation (Paden et al., 2016; Alahi et al., 2016; Bhattacharyya et al., 2018; Wang et al., 2019 ), manufacture ( Švec et al., 2014 ), entertainment (Shirai et al., 2007; Rofougaran et al., 2018; Lau & Chan, 2008) , and security (Kim & Paik, 2010; Ma et al., 2018) . The favoured approach to predicting movements over time has been purely inductive, relying on the history of a specific class of movement to predict its future. For example, state space models (Koller & Friedman, 2009) enjoyed early success for simple, common or cyclic motions (Taylor et al., 2007; Sutskever et al., 2009; Lehrmann et al., 2014) . The range, diversity and complexity of human motion has encouraged a shift to more expressive, deep neural network architectures (Fragkiadaki et al., 2015; Butepage et al., 2017; Martinez et al., 2017; Li et al., 2018; Aksan et al., 2019; Mao et al., 2019; Li et al., 2020b; Cai et al., 2020) , but still within a simple inductive framework. This approach would be adequate were actions both sharply distinct and highly stereotyped. But their complex, compositional nature means that within one category of action the kinematics may vary substantially, while between two categories they may barely differ. Moreover, few real-world tasks restrict the plausible repertoire to a small number of classes-distinct or otherwise-that could be explicitly learnt. Rather, any action may be drawn from a great diversity of possibilities-both kinematic and teleological-that shape the characteristics of the underlying movements. This has two crucial implications. First, any modelling approach that lacks awareness of the full space of motion possibilities will be vulnerable to poor generalisation and brittle performance in the face of kinematic anomalies. Second, the very notion of In-Distribution (ID) testing becomes moot, for the relations between different actions and their kinematic signatures are plausibly determinable only across the entire domain of action. A test here arguably needs to be Out-of-Distribution (OoD) if it is to be considered a robust test at all. These considerations are amplified by the nature of real-world applications of kinematic modelling, such as anticipating arbitrary deviations from expected motor behaviour early enough for an automatic intervention to mitigate them. Most urgent in the domain of autonomous driving (Bhattacharyya et al., 2018; Wang et al., 2019) , such safety concerns are of the highest importance, and 1

