PRIORITYCUT: OCCLUSION-AWARE REGULARIZATION FOR IMAGE ANIMATION

Abstract

Image animation generates a video of a source image following the motion of a driving video. Self-supervised image animation approaches do not require explicit pose references as inputs, thus offering large flexibility in learning. State-of-the-art self-supervised image animation approaches mostly warp the source image according to the motion of the driving video, and recover the warping artifacts by inpainting. When the source and the driving images have large pose differences, heavy inpainting is necessary. Without guidance, heavily inpainted regions usually suffer from loss of details. While previous data augmentation techniques such as CutMix are effective in regularizing non-warp-based image generation, directly applying them to image animation ignores the difficulty of inpainting on the warped image. We propose PriorityCut, a novel augmentation approach that uses the top-k percent occluded pixels of the foreground to regularize image animation. By taking into account the difficulty of inpainting, PriorityCut preserves better identity than vanilla CutMix and outperforms state-of-the-art image animation models in terms of the pixel-wise difference, low-level similarity, keypoint distance, and feature embedding distance.

1. INTRODUCTION

Image animation takes an image and a driving video as inputs and generates a video of the input image that follows the motion of the driving video. Traditional image animation requires a reference pose of the animated object such as facial keypoints or edge maps (Fu et al., 2019; Ha et al., 2019; Qian et al., 2019; Zhang et al., 2019b; Otberdout et al., 2020) . Self-supervised image animation does not require explicit keypoint labels on the objects (Wiles et al., 2018; Kim et al., 2019; Siarohin et al., 2019a; b) . Without explicit labeling, these approaches often struggle to produce realistic images when the poses between the source and the driving images differ significantly. To understand this problem, we first look at the typical process of self-supervised image animation approaches. These approaches can be generalized into the following pipeline: (1) keypoint detection, (2) motion prediction, and (3) image generation. Keypoint detection identifies important points in the source image for movement. Motion prediction estimates the motion of the source image based on the driving image. Based on the results of keypoint detection and motion prediction, it warps the source image to obtain an intermediate image that closely resembles the motion of the driving image. Image generation then recovers the warping artifacts by inpainting. Existing approaches mostly provide limited to no guidance on inpainting. The generator has to rely on the learned statistics to recover the warping artifacts. For instance, First Order Motion Model (Siarohin et al., 2019b) predicts an occlusion mask that indicates where and how much the generator should inpaint. While it has shown significant improvements over previous approaches such as X2Face (Wiles et al., 2018) and Monkey-Net (Siarohin et al., 2019a) , it struggles to inpaint realistic details around heavily occluded areas. The occlusion mask does not provide information on how well the generator inpaints. We propose PriorityCut, a novel augmentation approach that uses the top-k percent occluded pixels of the foreground for consistency regularization. PriorityCut derives a new mask from the occlusion mask and the background mask. Using the PriorityCut mask, we apply CutMix operation (Yun et al., 2019) , a data augmentation that cuts and mixes patches of different images, to regularize discriminator predictions. Compared to the vanilla rectangular CutMix mask, PriorityCut mask is flexible in both shape and locations. Also, PriorityCut prevents unrealistic patterns and information loss unlike previous approaches (DeVries & Taylor, 2017; Yun et al., 2019; Zhang et al., 2017) . The subtle differences in our CutMix image allow the generator to take small steps in learning, thus refining the details necessary for realistic inpainting. We built PriorityCut on top of First Order Motion Model and experimented on the VoxCeleb (Nagrani et al., 2017), BAIR (Ebert et al., 2017) , and Tai-Chi-HD (Siarohin et al., 2019b) datasets. Our experimental results show that PriorityCut outperforms state-of-the-art image animation approaches in pixel-wise difference, low-level similarity, keypoint distance, and feature embedding distance.

2. RELATED WORK

Data augmentation Our work is closely related to patch-based augmentation techniques. Cutout and its variants drop random patches of an image (DeVries & Taylor, 2017; Singh et al., 2018; Chen, 2020) . Mixup blends two images to generate a new sample (Zhang et al., 2017) . CutMix and its variants cut and mix patches of random regions between images (Takahashi et al., 2019; Yun et al., 2019; Yoo et al., 2020) . Yoo et al. (2020) observed that existing patch-based data augmentation techniques either drop the relationship of pixels, induce mixed image contents within an image, or cause a sharp transition in an image. In contrast, we design our augmentation to avoid these issues. Image animation Traditional image animation requires a reference pose of the animated object such as facial keypoints or edge maps (Fu et al., 2019; Ha et al., 2019; Qian et al., 2019; Zhang et al., 2019b; Otberdout et al., 2020) . Self-supervised image animation does not require explicit labels on the objects. X2Face (Wiles et al., 2018) 2020) generated images based on optical flow predicted on 3D meshes. These approaches mostly provide limited to no guidance on inpainting. In contrast, our approach utilizes the occlusion information to guide inpainting. et al., 2019a) , and additionally on generated images and latent variables (Zhao et al., 2020) . Researchers also provided local discriminator feedback on patches (Isola et al., 2017) and individual pixels with CutMix regularization (Schonfeld et al., 2020) . Our work differs from Schonfeld et al. (2020) in the application domain, mask shape, and mask locations. First, their experiments are on non-warp-based image generation, but we experimented with image animation. Also, their CutMix mask is rectangular and is applied at arbitrary locations. In contrast, our mask shape is irregular and



Figure 1: Warp-based image animation warps the source image based on the motion of the driving image and recovers the warping artifacts by inpainting. PriorityCut utilizes the occlusion information in image animation indicating the locations of warping artifacts to regularize discriminator predictions on inpainting. The augmented image has smooth transitions without loss or mixture of context.

uses an embedding network and a driving network to generate images. Kim et al. (2019) used a keypoint detector and a motion generator to predict videos of an action class based on a single image. Monkey-Net (Siarohin et al., 2019a) generates images based on a source image, relative keypoint movements, and dense motion. First Order Motion Model (Siarohin et al., 2019b) extended Monkey-Net by predicting Jacobians in keypoint detection and an occlusion mask. Burkov et al. (2020) achieved pose-identity disentanglement using a big identity encoder and a small pose encoder. Yao et al. (

Researchers have proposed different solutions to address the challenges of GANs (Bissoto et al., 2019). Our work is closely related to architectural methods, constraint techniques, and image-to-image translation. Chen et al. (2018) modulated the intermediate layers of a generator by the input noise vector using conditional batch normalization. Kurach et al. (2019) conducted a large-scale study on different regularization and normalization techniques. Some researchers applied consistency regularization on real images (Zhang

