HUMAN MOTIONFORMER: TRANSFERRING HUMAN MOTIONS WITH VISION TRANSFORMERS

Abstract

Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the crossattention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: https://github.com/KumapowerLIU/ 

1. INTRODUCTION

Human Motion Transfer, which transfers the motion from a target person's video to a source person, has grown rapidly in recent years due to its substantial entertaining applications for novel content generation Wang et al. (2019) ; Chan et al. (2019) . For example, a dancing target automatically animates multiple static source people for efficient short video editing. Professional actions can be transferred to celebrities to produce educational, charitable, or advertising videos for a wide range of broadcasting. Bringing static people alive suits short video creation and receives growing attention on social media platforms. During motion transfer, we expect the source person to redo the same action as the target person. To achieve this purpose, we need to establish an accurate matching between the target pose and the source person (i.e., each body part skeleton in a pose image matches its corresponding body part in a source image), and use this matching to drive the source person with target pose (i.e., if the hand skeleton in the target pose is raised, the hand in the source person should also be raised). According to the degree of difference between the target pose and the source person pose, this matching can be divided into two types: global and local. When the degree is large, there is a large motion change between the target pose and the source person, and the target pose shall match a distant region in the source image (e.g., the arm skeleton of the target pose is distant from the source man arm region in Fig. 1(b) ). When the degree is small, there are only subtle motion changes, and the target pose shall match its local region in the source image (e.g., the arm skeleton of the target pose is close to the source woman arm region in Fig. 1(b) ). As the human body moves non-rigidly, large and subtle The encoders extract feature pyramids from the source person image and the target pose image, respectively. These feature pyramids are sent to the decoder for accurate matching and motion transfer. In the decoder, there are several decoder blocks and one fusion block. Each block consists of two parallel branches (i.e., the warping and generation branches). In the warping branch, we predict a flow field to warp features from the source person image, which preserves the information of the source image to achieve high-fidelity motion transfer. Meanwhile, the generation branch produces novel content that cannot be directly borrowed from the source appearance to improve photorealism further. Afterward, we use a fusion block to convert the feature output of these two branches into the final transferred image. The feature matching result dominates the flow field and generation content. Specifically, we implement feature matching from two input images via convolutions and cross-attentions in each branch. The tokens from the source image features are mapped into Key and Value, and the tokens from the target pose image features are mapped into Query. We compute the cross-attention map based on the Key and Query. This map reflects the global correlations between two input images. Then, we send the output of cross-attention process into a convolution to capture locally matched results. Thanks to



Figure 1: Human motion transfer results. Target pose images are in the first row, and two source person images are in the first column. Our MotionFormer effectively synthesizes motion transferred results whether the poses in the above two images differ significantly or not.

