HUMAN MOTIONFORMER: TRANSFERRING HUMAN MOTIONS WITH VISION TRANSFORMERS

Abstract

Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the crossattention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: https://github.com/KumapowerLIU/ Human-MotionFormer * X. Han and H. Liu contribute equally. † Y. Song and Q. Chen are the corresponding

1. INTRODUCTION

Human Motion Transfer, which transfers the motion from a target person's video to a source person, has grown rapidly in recent years due to its substantial entertaining applications for novel content generation Wang et al. (2019) ; Chan et al. (2019) . For example, a dancing target automatically animates multiple static source people for efficient short video editing. Professional actions can be transferred to celebrities to produce educational, charitable, or advertising videos for a wide range of broadcasting. Bringing static people alive suits short video creation and receives growing attention on social media platforms. During motion transfer, we expect the source person to redo the same action as the target person. To achieve this purpose, we need to establish an accurate matching between the target pose and the source person (i.e., each body part skeleton in a pose image matches its corresponding body part in a source image), and use this matching to drive the source person with target pose (i.e., if the hand skeleton in the target pose is raised, the hand in the source person should also be raised). According to the degree of difference between the target pose and the source person pose, this matching can be divided into two types: global and local. When the degree is large, there is a large motion change between the target pose and the source person, and the target pose shall match a distant region in the source image (e.g., the arm skeleton of the target pose is distant from the source man arm region in Fig. 1(b) ). When the degree is small, there are only subtle motion changes, and the target pose shall match its local region in the source image (e.g., the arm skeleton of the target pose is close to the source woman arm region in Fig. 1(b) ). As the human body moves non-rigidly, large and subtle The encoders extract feature pyramids from the source person image and the target pose image, respectively. These feature pyramids are sent to the decoder for accurate matching and motion transfer. In the decoder, there are several decoder blocks and one fusion block. Each block consists of two parallel branches (i.e., the warping and generation branches). In the warping branch, we predict a flow field to warp features from the source person image, which preserves the information of the source image to achieve high-fidelity motion transfer. Meanwhile, the generation branch produces novel content that cannot be directly borrowed from the source appearance to improve photorealism further. Afterward, we use a fusion block to convert the feature output of these two branches into the final transferred image. the accurate global and local feature matching, our MotionFormer can improve the performance of both the warping and generation processes. Moreover, we propose a mutual learning loss to enable these two branches to supervise each other during training, which helps them to benefit from each other and facilitates the final fusion for high-quality motion transfer. In the test phase, our method generates a motion transfer video on the fly given a single source image without training a personspecific model or fine-tuning. Some results generated by our method can be found in Fig. 1 , and we show more video results in the supplementary files. Zakharov et al. (2019) to achieve higher perceptual quality, our method works in a one-shot fashion that directly generalizes to unseen identities. as our Attention mechanism in the encoder and decoder. The CSWin Attention calculates attention in the horizontal and vertical stripes in parallel to ensure the performance and efficiency. In our method, we assume a fixed background and simultaneously estimate a foreground mask M out to merge an inpainted background with I out in the testing phase. We introduce the Transformer encoder and decoder in Sec. 3.1 and Sec. 3.2 respectively. The mutual learning loss is in Sec. 3.3.

3.1. TRANSFORMER ENCODER

The structure of the two Transformer encoders is the same. Each encoder consists of three stages with different spatial size. Each stage has multiple encoder blocks and we adopt the CSwin Transformer Block Dong et al. (2022) as our encoder block. Our Transformer encoder captures hierarchical representations of the source image I s and the target pose image P t . S i and T i (i = 1, 2, 3) denote the output of the i-th stage for I s and P t , respectively, as shown in Fig. 2 . We follow Dong et al. (2022) ; Wu et al. (2021) to utilize a convolutional layer between different stages for token reduction and channel increasing. We show more details of the encoder in the appendix.

3.2. TRANSFORMER DECODER

There are three stages in our Transformer decoder. The number of the decoder block in each stage is 2, 4, and 12, respectively. We concatenate the output of each stage and the corresponding target pose feature by skip-connections. The concatenated results are sent to the second and third stages. Similar to the encoder, we set the convolutional layer between different stages to increase token numbers and decrease channel dimensions.

3.2.1. DECODER BLOCK.

As shown in Fig 3 , the decoder block has warping and generation branches. In each branch, there is a cross-attention process and a convolutional layer to capture the global and local correspondence respectively. Let X l de denote the output of l-th decoder block (l > 1) or the output of precedent stage (l = 1). For the first decoder stage, we set the T 3 as input so the X 1 de = T 3 . The decoder block first extracts Xl de from X l-1 de with a Multi-Head Self-Attention process. Then we feed Xl de to the warping branch and generation branch as Query (Q), and we use the feature of source encoder S i as Key (K) and Value (V ) to calculate the cross-attention map similar to Vaswani et al. (2017) with Multi-Head Cross-Attention process. The cross-attention map helps us build the global correspondence between the target pose and the source image. Finally, we send the output of the Multi-Head Cross-Attention to a convolutional layer to extract the local correspondence. The warping branch predicts a flow field to deform the source feature conditioned on the target pose, which helps the generation of regions that are visible in the source image. While for the invisible parts, the generation branch synthesizes 

S i f i-1

The output feature of i-th stage in source image encoder. The flow filed output of i-th-1 stage in decoder. novel content with the contextual information mined from the source feature. We combine the advantages of these two branches in each decoder block to improve the generation quality. Warping branch. The warping branch aims to generate a flow field to warp the source feature S i . Specifically, the Multi-Head Cross-Attention outputs the feature with the produced Q, K, V , and we feed the output to a convolution to inference the flow field. Inspired by recent approaches that gradually refine the estimation of optical flow Hui et al. (2018) ; Han et al. (2019) , we estimate a residual flow to refine the estimation of the previous stage. Next, we warp the feature map S i according to the flow field using bilinear interpolation. Formally, the whole process can be formulated as follows: Q = W Q ( Xl de ), K = W K (Si), V = W V (Si), f l = Conv(Multi-Head Cross-Attention(Q, K, V ))), f l = Up(fi-1) + f l , if l = 1 and i > 1, O l w = Warp(Si, f l ), where W Q , W K , W V are the learnable projection heads in the self-attention module, the O l w denotes the output of the warping branch in l-th block, the Up is a ×2 nearest-neighbor upsampling, and Warp denotes warping feature map S i according to flow f l using grid sampling Jaderberg et al. (2015) . For the i-th decoder stage, the flow predicted by the last decoder block is treated as f i and then refined by the subsequent blocks. Generation branch. The architecture of the generation branch is similar to the warping branch. The attention outputs the feature with the produced Q, K, V , and then we feed the output to a convolution to infer the final prediction O l g : Q = W Q ( Xl de ), K = W K (Si), V = W V (Si), O l g = Conv(Multi-Head Cross-Attention(Q, K, V ))). ( ) where W Q , W K , W V are the learnable projection heads in the self-attention module. The generation branch can generate novel content based on the global information of source feature S i . Therefore, it is complementary to the warping branch when the flow field is inaccurate or there is no explicit reference in the source feature. Finally, we concatenate the output of warping and generation branch and reduce the dimension with a convolutional layer followed by an MLP and a residual connection: Xl de = Conv( Concat (O l w , O l g )), X l de = MLP(LN( Xl de )) + Xl de , where the Xl de is the combination of warping and generation branches in the l-th decoder block.

3.2.2. FUSION BLOCK.

The fusion block takes the decoder output to predict the final result. The fusion block has a warping branch and generation branch at the pixel level. The warping branch refines the last decoder flow field f 3 and estimates a final flow f f . And the generation branch synthesizes the RGB value I f . At the same time, a fusion mask M f is predicted to merge the output of these two branches: f f = Conv(O de ) + Up(f3), M f = Sigmoid(Conv(O de )), I f = Tanh(Conv(O de )), Iout = M f ⊙ Warp(Is, f f ) + (1 -M f ) ⊙ I f , where the O de is the output of decoder, ⊙ is the element-wise multiplication, and I out is the final prediction.

3.3. MUTUAL LEARNING LOSS

The generation and warping branches have their own advantages as mentioned above. Intuitively, we concatenate the output of these two branches followed by a convolution layer and an MLP as shown in Fig. 2 , but we empirically find the convolution layer and MLP cannot combine these advantages well (see Sec. 5). To address this limitation and ensure the results have both advantages of these two branches, we propose a novel mutual learning loss to enforce these two branches to learn the advantages of each other. Specifically, the mutual learning loss enables these two branches to supervise each other within each decoder block, let O k w , O k g ∈ R (H×W )×C denote the reshaped outputs of the last warping and generation branch at the k-th decoder stage (see Eqs. ( 1) and (2) for their definition). If we calculate the similarity between the feature vector O k w,i ∈ R C at the spatial location i of O k w and all feature vectors O k g,j ∈ R C (j = 1, 2, . . . , HW ) in O k g , we argue that the most similar vector to O k w,i should be O k g,i , which is at the same position in O k g . In another word, we would like to enforce i = arg max j Cos(O k w,i , O k g,j ), where Cos(•, •) is the cosine similarity. This is achieved by the following mutual learning loss: Lmut = k HW i=1 ||SoftArgMax j (Cos(O k w,i , O k g,j )) -i||1, where the SoftArgMax is a differentiable version of arg max that returns the spatial location of the maximum value. The mutual learning loss constrain the two branches to have high correlations at the same location, enhancing the complementariness of warping and generation. In addition to the perceptual diversity loss, we follow the Ren et al. ( 2020 

4. EXPERIMENTS

Datasets. We use the solo dance YouTube videos collected by Huang et al. (2021a) and iPer Liu et al. (2019b) datasets. These videos contain nearly static backgrounds and subjects that vary in gender, body shape, hairstyle, and clothes. All the frames are center cropped and resized to 256 × 256. We train a separate model on each dataset to fairly compare with other methods. Implementation details. We use OpenPose Cao et al. (2017) to detect 25 body joints for each frame. These joints are then connected to create a target pose stick image P t , which has 26 channels and et al. (2020) . For LWG, we test it on iPer dataset with the released pre-trained model, and we train LWG on YouTube videos with its source code. At test time, we fine-tune LWG on the source image as official implementation (fine-tuning is called "personalize" in the source code). For GTM, we utilize the pre-trained model on the YouTube videos dataset provided by the authors and retrain the model on iPer with the source code. As GTM supports testing with multiple source images, we use 20 frames in the source video and fine-tune the pre-trained network as described in the original paper Huang et al. (2021a) . For MRAA, we use the source code provided by the authors to train the model. For DIST Ren et al. ( 2020), we compare with it using the pre-trained model on iPer dataset. For synthesizing a 1,000 frame video, the average per frame time costs of MRAA, LWG, GTM, DIST, and our method are 0.021s, 1.242s, 1.773s, 0.088s, and 0.94s, respectively. Meanwhile, MotionFormer does not require an online fine-tuning, while LWG and GTM do.

4.1. QUALITATIVE COMPARISONS

Qualitatively comparisons are given in Fig. 4 and Fig. 5 . Although LWG Liu et al. (2019b) can maintain the overall shape of the human body, it fails to reconstruct complicated human parts (e.g., long hair and shoes in Fig. 4 ) of the source person and synthesis image with a large body motion (e.g., squat in the red box of Fig. 4 ), which leads to visual artifacts and missing details. This is because LWG relies on the 3D mesh predicted by HMR Kanazawa et al. (2018) , which is unable to model detailed shape information. In contrast, GTM Huang et al. (2021a) reconstructs better the body shape as it uses multiple inputs to optimize personalized geometry and texture. However, the geometry cannot handle the correspondence between the source image and the target pose. The synthesized texture also presents severe artifacts, especially for invisible regions in the source images. As an unsupervised method, MRAA Siarohin et al. (2021) implicitly models the relationship between source and target images. Without any prior information about the human body, MRAA generates unrealistic human images. DIST Huang et al. (2021a) does not model correct visual correspondence (e.g., the coat buttons are missing in the last example) and suffers from overfitting (e.g., the coat color becomes dark blue in the third example). Compared to existing methods, MotionFormer renders more realistic and natural images by effectively modeling long-range correspondence and local details. 

4.2. QUANTITATIVE COMPARISONS

We use SSIM Wang et al. (2004) , PSNR, FID Heusel et al. (2017) , and LPIPS Zhang et al. (2018) as numerical evaluation metrics. The quantitative results are reported in Table 1 and Table 2 . We observe that our method outperforms existing methods by large margins across all metrics. Additionally, we perform a human subjective evaluation. We generate the motion transfer videos of these different methods by randomly selecting 3-second video clips in the test set. On each trial, a volunteer is given compared results on the same video clip and is then asked to select the one with the highest generation quality. We tally the votes and show the statistics in the last column of Table 1 and Table 2 . We can find that our method is favored in most of the trials.

5. ABLATION STUDY

Attention Mechanism. To evaluate the effects of the cross-attention module, we delete the crossattention in both the warping and generation branch. Instead, we concatenate the source feature S i and Query directly in the Transformer decoder, followed by a convolution layer constructing their local relationship (this experiment is named Ours w/o Attention). As shown in Fig. 6 (c), without modeling the long-range relationship between the source and target, Ours w/o Attention achieves worse results (e.g., distorted skirt, limbs, and shoes). The numerical comparison shown in Table 3 is consistent with the visual observation. Generation and warping branches. We show the contributions of the generation branch and warping branch in the decoder block by removing them individually (i.e., Ours w/o warping, Ours w/o generation). As shown in Fig. 6 (d), without the warping branch, the generated clothing contains unnatural green and black regions in the man's T-shirt and woman's skirt, respectively. This phenomenon reflects that a single generation branch is prone to over-fitting. On the other hand, the warping branch can avoid over-fitting as shown in Fig. 6 (e). However, the results still lack realism as the warping branch cannot generate novel appearances which are invisible in the source image (e.g., the shoes of the man and the hair of the woman are incomplete). the advantages of these two branches and produces better results in Fig. 6 (g). We also report the numerical results in Table 3 , in which our full method achieves the best performance. Mutual learning loss. We analyze the importance of mutual learning loss (Eq. ( 5)) by removing it during training (Ours w/o mutual). Fig. 6 (f) shows the prediction combining the advantages of both warping and generation branches without using the mutual learning loss, which still produces noticeable visual artifacts. The proposed mutual learning loss aligns the output features from these two branches and improves the performance. The numerical evaluation in Table 3 also indicates that mutual learning loss improves the generated image quality. The other loss terms have been demonstrated effective in Balakrishnan et al. (2018) ; Wang et al. (2018b) ; Liu et al. (2019b) with sufficient studies, so we do not include them in the ablation studies.

Method

PSNR↑ SSIM↑ LPIPS↓ FID↓ 

6. CONCLUDING REMARKS

In this paper, we introduce MotionFormer, a Transformers-based framework for realistic human motion transfer. MotionFormer captures the global and local relationship between the source appearance and target pose with carefully designed Transformer-based decoder blocks, synthesizing promising results. At the core of each block lies a warping branch to deform the source feature and a generation branch to synthesize novel content. By minimizing a mutual learning loss, these two branches supervise each other to learn better representations and improve generation quality. Experiments on a dancing video dataset verify the effectiveness of MotionFormer.



Figure 1: Human motion transfer results. Target pose images are in the first row, and two source person images are in the first column. Our MotionFormer effectively synthesizes motion transferred results whether the poses in the above two images differ significantly or not.

Fig.2shows an overview of MotionFormer. It consists of two Transformer encoders and one Transformer decoder. The two Transformer encoders first extract the features of source image I s and target pose image P t , respectively. Then the Transformer decoder builds the relationship between I s and P t with two-branch decoder blocks hierarchically. Finally, a fusion block predicts the reconstructed person image I out . The network is trained end-to-end with the proposed mutual learning loss. We utilize the Cross-Shaped Window Self-Attention (CSWin Attention)Dong et al. (2022)

Figure 3: Overview of our decoder and fusion blocks. There are warping and generation branches in these two blocks. In decoder block, We build the global and local correspondence between source image and target pose with Multi-Head Cross-Attention and CNN respectively. The fusion block predict an mask to combine the output of two branches in pixel level.

) and Huang et al. (2021a) utilize the reconstruction loss Johnson et al. (2016), feature matching loss Wang et al. (2018c), hinge adversarial loss Lim & Ye (2017), style loss Gatys et al. (2015), total variation loss Johnson et al. (2016) and mask loss Huang et al. (2021a) to optimize our network. Details are in appendix.

Figure 4: Visual comparison of state-of-the-art approaches and our method on YouTube videos dataset. Our proposed framework generates images with the highest visual quality.

Figure 5: Visual comparison of state-of-the-art approaches and our method on iPer dataset. Our proposed framework generates images with the highest visual quality.

(a) Source Image (b) Target Pose (c) w/o Attention (d) w/o warping (f) w/o mutual (g) Ours (e) w/o generation

Figure 6: Visual ablation study on YouTube videos dataset. (a) The source image. (b) The target pose. (c) Our method without Attention. (d) Our method without the warping branch. (e) Our method without the generation branch. (f) Our method without the mutual learning loss. (g) Our full method. Our full model can generate realistic appearance and correct body pose.

Quantitative comparisons of state-of-the-art methods on YouTube videos dataset. User study denotes the preference rate of our method against the competing methods. Chance is 50%.

Quantitative comparisons of state-of-the-art methods on iPer videos dataset. User study denotes the preference rate of our method against the competing methods. Chance is 50%.

Ablation analysis of our proposed method on YouTube dataset. Our Full method achieves results that are superior to all other variants.

ETHIC DISCUSSIONS

This work introduces a motion transfer method that can transfer motions from one subject to another. It may raise the potential ethical concern that malicious actions can be transferred to anyone (e.g., celebrities). To prevent action retargeting on celebrities, we may insert a watermark to the human motion videos. The watermark contains the original motion source, which may differentiate the celebrity movement. Meanwhile, we can construct a celebrity set. We first conduct face recognition on the source person; if that person falls into this set, we will not perform human motion transfer.

REPRODUCIBILITY STATEMENT

The MotionFormer is trained for 10 epochs, and the learning rate decays linearly after the 5-th epoch. We provide the pseudo-code of the training process in Algorithm 1. We denote the Transformer encoder of the source images as En s , the Transformer encoder of the target pose as En t , the Transformer decoder as De, and the discriminator as D. We set the batchsize as 4 and step max is obtained by dividing the image numbers of dataset by batchsize. Meanwhile, we show the details of the model architecture and loss function in the appendix, this information is useful for the reproduce process.

Algorithm 1 Training Process

Require: A set of source images I s , target pose images P t , and person mask images M gt .for Epoch = 1, 2, 3, ..., 10 do for Step = 1, 2, 3, ..., step max do Sample a batch of source images I s , target pose P t , and person mask 

