LEARNING IMPLICIT SCALE CONDITIONED MEMORY COMPENSATION FOR TALKING HEAD GENERATION

Abstract

Talking head video generation aims to animate the pose and expression of a person in a target driving video using motion information contained in the video, while maintaining a person's identity in a given still source image. Highly dynamic and complex motions in the driving video cause ambiguous generation from the source image, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expressions, which severely produces artifacts and significantly degrades the generation quality. However, existing works mainly focus on learning more accurate motion estimation and representation in 2D and 3D, and they ignore the facial structural prior in addressing the facial ambiguities. Therefore, effective handling of the ambiguities in the dramatic appearance changes of the source to largely improve facial details and completeness in generation still remains barely explored. To this end, we propose a novel implicit scale conditioned memory compensation network (MCNet) for high-fidelity talking head generation. Specifically, considering human faces are symmetric and structured, we aim to automatically learn a representative global facial memory bank from all training data as a prior to compensate the facial generation features. Each face in the source image contains a scale that can be reflected in detected facial keypoints. To better query the learned global memory, we further propose to learn implicit scale representations from the discrete keypoints, which can be used to condition on the query of the global memory, to obtain scale-aware memory for the feature compensation. Extensive experiments from quantitative and qualitative perspectives demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art methods on VoxCeleb1 and CelebV datasets.

1. INTRODUCTION

In this work, we aim to address the problem of generating a realistic talking head video given one still source image and one dynamic driving video, which is widely known as talking head video generation. A high-quality talking head generation model needs to imitate vivid facial expressions and complex head movements, and should be applicable for different facial identities presenting in the source image and the target video. It has been attracting rapidly increasing attention from the community, and a wide range of realistic applications remarkably benefits from this task, such as digital human broadcast, AI-based human conversation, and virtual anchors in films. Significant progress has been achieved on this task in terms of both quality and robustness in recent years. Existing works mainly focus on learning more accurate motion estimation and representation in 2D and 3D to improve the generation. More specifically, 2D facial keypoints or landmarks are learned to model the motion flow (see Fig. 1(c )) between the source image and any target image in the driving video ( Zhao et al. (2021); Zakharov et al. (2019); Hong et al. (2022) ). Some works also consider utilizing 3D facial prior model (e.g. 3DMMBlanz & Vetter (1999)) with decoupled expression codes (Zhao et al., 2021; Zakharov et al., 2019) or learning dense facial geometries in a self-supervised manner (Hong et al., 2022) to model complex facial expression movements to produce more fine-grained facial generation. However, no matter how accurately the motion can be estimated and represented, highly dynamic and complex motions in the driving video cause ambiguous generation from the source image (see Fig. 1(d) ), because the still source image cannot provide sufficient appearance information for occluded regions or delicate expressions, which severely produces artifacts and significantly degrades the generation quality. Intuitively, we understand that human faces are symmetrical and highly structured, and many regions of the human faces are essentially not discriminative. For instance, only blocking a very small eye region of a face image makes a well-trained facial recognition model largely drop the recognition performance (Qiu et al., 2021) , which indicates to a certain extent that the structure and appearance representations of human faces crossing different face identities are generic and transferable. Therefore, learning global facial priors on spatial structure and appearance from all available training face images, and utilizing the learned facial priors for compensating the dynamic facial synthesis are critically important for high-fidelity talking head generation. However, existing works did not explore these beneficial facial priors to address facial ambiguities in generation from large head motions. In this paper, to effectively deal with the ambiguities in the dramatic appearance changes from the still source image, we propose an implicit scale conditioned Memory Compensation Network, coined as MCNet, to learn and transfer global facial representations to compensate ambiguous facial details and guarantee completeness for a high-fidelity generation. Specifically, we design and learn a global spatial meta memory bank. The optimization gradients from all the training images during training contribute together to the updating of the meta memory, and thus it can capture the global and most common facial appearance and structure representations for the transferring. Since the different source face images contain distinct scales, to more effectively query the learned meta memory bank, we propose an implicit scale conditioned memory module (ISCM) (see Fig. 3 ). As the detected discrete facial keypoints inherently contain the scale information of the face, we first learn an implicit scale representation from the discrete keypoint coordinates, and further use it to condition on the query of the meta memory bank to obtain a scale-aware memory bank, which can more effectively compensate the feature of faces with different scales. The compensation is performed through a memory compensation module (MCM) (see Fig. 4 ). The warped feature map generated from the estimated motion field queries the scale-aware memory bank through a dynamic cross-attention mechanism to output a refined compensated feature map for the final generation. We conduct extensive experiments to evaluate the proposed MCNet on two competitive talking head generation datasets (i.e. VoxCeleb (Nagrani et al., 2017) and CelebV (Wu et al., 2018) . Experimental results demonstrate the effectiveness of learning global facial memory to tackle the appearance ambiguities in the talking head generation, and also show clearly improved generation results from both qualitative and quantitative perspectives, achieving state-of-the-art performances. In summary, our main contribution is three-fold: • We propose to learn a global facial meta memory bank to transfer representative facial representations to handle the appearance and structure ambiguities caused by the highly dynamic generation from a still source image. To the best of our knowledge, it is the first exploration in the literature to model global facial representations for effectively improving the ambiguities in talking head generation. • We propose a novel implicit scale conditioned memory compensation network (MCNet) for talking head video generation, in which an implicit scale conditioned memory module (ISCM) and a facial memory compensation module (MCM) are designed to respectively perform the scale-aware memory learning and the feature compensation tasks.



Figure 1: MCNet animation illustration. MCNet first learns the motion flow (c) between the source and the driving images. (d) provides an elaboration of possible occulusion or deformation caused large motion, which is produced by directly warping the source image with the motion provided by the driving images. (e) shows randomly sampled memory channels of our learned scale-aware memory bank. We also present examples of generated results of our method in (f).

