HUMAN MOTION DIFFUSION MODEL

Abstract

Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-ofthe-art results on leading benchmarks for text-to-motion and action-to-motion 1 . https://guytevet.github.io/mdm-page/.

1. INTRODUCTION

Human motion generation is a fundamental task in computer animation, with applications spanning from gaming to robotics. It is a challenging field, due to several reasons, including the vast span of possible motions, and the difficulty and cost of acquiring high quality data. For the recently emerging text-to-motion setting, where motion is generated from natural language, another inherent problem is data labeling. For example, the label "kick" could refer to a soccer kick, as well as a Karate one. At the same time, given a specific kick there are many ways to describe it, from how it is performed to the emotions it conveys, constituting a many-to-many problem. Current approaches have shown success in the field, demonstrating plausible mapping from text to motion (Petrovich et al., 2022; Tevet et al., 2022; Ahuja & Morency, 2019) . All these approaches, however, still limit the learned distribution since they mainly employ auto-encoders or VAEs (Kingma & Welling, 2013) (implying a one-to-one mapping or a normal latent distribution respectively). In this aspect, diffusion models are a better candidate for human motion generation, as they are free from assumptions on the target distribution, and are known for expressing well the many-to-many distribution matching problem we have described. Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2020; Ho et al., 2020) are a generative approach that is gaining significant attention in the computer vision and graphics community. When trained for conditioned generation, recent diffusion models (Ramesh et al., 2022; Saharia et al., 2022b) have shown breakthroughs in terms of image quality and semantics. The competence of these models have also been shown for other domains, including videos (Ho et al., 2022) , and 3D point clouds (Luo & Hu, 2021) . The problem with such models, however, is that they are notoriously resource demanding and challenging to control. In this paper, we introduce Motion Diffusion Model (MDM) -a carefully adapted diffusion based generative model for the human motion domain. Being diffusion-based, MDM gains from the na- tive aforementioned many-to-many expression of the domain, as evidenced by the resulting motion quality and diversity (Figure 1 ). In addition, MDM combines insights already well established in the motion generation domain, helping it be significantly more lightweight and controllable. First, instead of the ubiquitous U-net (Ronneberger et al., 2015) backbone, MDM is transformerbased. As we demonstrate, our architecture (Figure 2 ) is lightweight and better fits the temporal and spatially irregular nature of motion data (represented as a collection of joints). A large volume of motion generation research is devoted to learning using geometric losses (Kocabas et al., 2020; Harvey et al., 2020; Aberman et al., 2020 ). Some, for example, regulate the velocity of the motion (Petrovich et al., 2021) to prevent jitter, or specifically consider foot sliding using dedicated terms (Shi et al., 2020) . Consistently with these works, we show that applying geometric losses in the diffusion setting improves generation. The MDM framework has a generic design enabling different forms of conditioning. We showcase three tasks: text-to-motion, action-to-motion, and unconditioned generation. We train the model in a classifier-free manner (Ho & Salimans, 2022), which enables trading-off diversity to fidelity, and sampling both conditionally and unconditionally from the same model. In the text-to-motion task, our model generates coherent motions (Figure 1 ) that achieve state-of-the-art results on the Hu-manML3D (Guo et al., 2022a) and KIT (Plappert et al., 2016) benchmarks. Moreover, our user study shows that human evaluators prefer our generated motions over real motions 42% of the time (Figure 4(a) ). In action-to-motion, MDM outperforms the state-of-the-art (Guo et al., 2020; Petrovich et al., 2021) , even though they were specifically designed for this task, on the common HumanAct12 (Guo et al., 2020) and UESTC (Ji et al., 2018) benchmarks. Lastly, we also demonstrate completion and editing. By adapting diffusion image-inpainting (Song et al., 2020b; Saharia et al., 2022a) , we set a motion prefix and suffix, and use our model to fill in the gap. Doing so under a textual condition guides MDM to fill the gap with a specific motion that still maintains the semantics of the original input. By performing inpainting in the joints space rather than temporally, we also demonstrate the semantic editing of specific body parts, without changing the others (Figure 3 ). Overall, we introduce Motion Diffusion Model, a motion framework that achieves state-of-the-art quality in several motion generation tasks, while requiring only about three days of training on a single mid-range GPU. It supports geometric losses, which are non trivial to the diffusion setting,



Code can be found at https://github.com/GuyTevet/motion-diffusion-model.



"A person kicks with their left leg." "A man runs to the right then runs to the left then back to the middle."

Figure 1: Our Motion Diffusion Model (MDM) reflects the many-to-many nature of text-to-motion mapping by generating diverse motions given a text prompt. Our custom architecture and geometric losses help yielding high-quality motion. Darker color indicates later frames in the sequence.

