LEARNING IMPLICIT SCALE CONDITIONED MEMORY COMPENSATION FOR TALKING HEAD GENERATION

Abstract

Talking head video generation aims to animate the pose and expression of a person in a target driving video using motion information contained in the video, while maintaining a person's identity in a given still source image. Highly dynamic and complex motions in the driving video cause ambiguous generation from the source image, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expressions, which severely produces artifacts and significantly degrades the generation quality. However, existing works mainly focus on learning more accurate motion estimation and representation in 2D and 3D, and they ignore the facial structural prior in addressing the facial ambiguities. Therefore, effective handling of the ambiguities in the dramatic appearance changes of the source to largely improve facial details and completeness in generation still remains barely explored. To this end, we propose a novel implicit scale conditioned memory compensation network (MCNet) for high-fidelity talking head generation. Specifically, considering human faces are symmetric and structured, we aim to automatically learn a representative global facial memory bank from all training data as a prior to compensate the facial generation features. Each face in the source image contains a scale that can be reflected in detected facial keypoints. To better query the learned global memory, we further propose to learn implicit scale representations from the discrete keypoints, which can be used to condition on the query of the global memory, to obtain scale-aware memory for the feature compensation. Extensive experiments from quantitative and qualitative perspectives demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art methods on VoxCeleb1 and CelebV datasets.

1. INTRODUCTION

In this work, we aim to address the problem of generating a realistic talking head video given one still source image and one dynamic driving video, which is widely known as talking head video generation. A high-quality talking head generation model needs to imitate vivid facial expressions and complex head movements, and should be applicable for different facial identities presenting in the source image and the target video. It has been attracting rapidly increasing attention from the community, and a wide range of realistic applications remarkably benefits from this task, such as digital human broadcast, AI-based human conversation, and virtual anchors in films. Significant progress has been achieved on this task in terms of both quality and robustness in recent years. Existing works mainly focus on learning more accurate motion estimation and representation in 2D and 3D to improve the generation. More specifically, 2D facial keypoints or landmarks are learned to model the motion flow (see Fig. 1(c )) between the source image and any target image in the driving video ( Zhao et al. (2021) ; Zakharov et al. (2019) ; Hong et al. (2022) ). Some works also consider utilizing 3D facial prior model (e.g. 3DMMBlanz & Vetter (1999) ) with decoupled expression codes (Zhao et al., 2021; Zakharov et al., 2019) or learning dense facial geometries in a self-supervised manner (Hong et al., 2022) to model complex facial expression movements to produce more fine-grained facial generation. However, no matter how accurately the motion can be estimated and represented, highly dynamic and complex motions in the driving video cause ambiguous generation from the source image (see Fig. 1(d) ), because the still source image cannot provide sufficient appearance information for occluded regions or delicate expressions, which severely produces artifacts and significantly degrades the generation quality. Intuitively, we understand that human faces are symmetrical and highly structured, and many regions of the human faces are essentially not discriminative. For instance, only blocking a very small eye region of a face image makes a well-trained facial recognition model largely drop the recognition performance (Qiu et al., 2021) , which indicates to a certain extent that the structure and appearance representations of human faces crossing different face identities are generic and transferable. Therefore, learning global facial priors on spatial structure and appearance from all available training face images, and utilizing the learned facial priors for compensating the dynamic facial synthesis are critically important for high-fidelity talking head generation. However, existing works did not explore these beneficial facial priors to address facial ambiguities in generation from large head motions. In this paper, to effectively deal with the ambiguities in the dramatic appearance changes from the still source image, we propose an implicit scale conditioned Memory Compensation Network, coined as MCNet, to learn and transfer global facial representations to compensate ambiguous facial details and guarantee completeness for a high-fidelity generation. Specifically, we design and learn a global spatial meta memory bank. The optimization gradients from all the training images during training contribute together to the updating of the meta memory, and thus it can capture the global and most common facial appearance and structure representations for the transferring. Since the different source face images contain distinct scales, to more effectively query the learned meta memory bank, we propose an implicit scale conditioned memory module (ISCM) (see Fig. 3 ). As the detected discrete facial keypoints inherently contain the scale information of the face, we first learn an implicit scale representation from the discrete keypoint coordinates, and further use it to condition on the query of the meta memory bank to obtain a scale-aware memory bank, which can more effectively compensate the feature of faces with different scales. The compensation is performed through a memory compensation module (MCM) (see Fig. 4 ). The warped feature map generated from the estimated motion field queries the scale-aware memory bank through a dynamic cross-attention mechanism to output a refined compensated feature map for the final generation. We conduct extensive experiments to evaluate the proposed MCNet on two competitive talking head generation datasets (i.e. VoxCeleb (Nagrani et al., 2017) and CelebV (Wu et al., 2018) . Experimental results demonstrate the effectiveness of learning global facial memory to tackle the appearance ambiguities in the talking head generation, and also show clearly improved generation results from both qualitative and quantitative perspectives, achieving state-of-the-art performances. In summary, our main contribution is three-fold: • We propose to learn a global facial meta memory bank to transfer representative facial representations to handle the appearance and structure ambiguities caused by the highly dynamic generation from a still source image. To the best of our knowledge, it is the first exploration in the literature to model global facial representations for effectively improving the ambiguities in talking head generation. • We propose a novel implicit scale conditioned memory compensation network (MCNet) for talking head video generation, in which an implicit scale conditioned memory module (ISCM) and a facial memory compensation module (MCM) are designed to respectively perform the scale-aware memory learning and the feature compensation tasks. • Qualitative and quantitative experiments extensively show the effectiveness of the learned meta memory bank for addressing the ambiguities in generation, and our framework establishes a clear state-of-the-art performance on the talking head generation. The generalization experiment also shows that the proposed modules can effectively boost the performance of different talking head generation models.

2. RELATED WORKS

Talking Head Video Generation. Talking Head video Generation can be mainly divided into two strategies: image-driven and audio-driven generation. For the image-driven strategy, researchers aim to capture the expression of a given driving image and aggregate the captured expression with the facial identity from a given source image. Some approaches (Yao et al., 2020; Wu et al., 2021b; Wang et al., 2021a ) utilized a 3DMM regressor (Tran & Liu, 2018; Zhu et al., 2017) to extract an expression code and an identity code from a given face, and then respectively combine them from different faces to generate a new face. Also, several other works (Tripathy et al., 2021; Ha et al., 2020; Zakharov et al., 2020; 2019; Zhao et al., 2021) utilized facial landmarks detected by a pretrained face model (Guo et al., 2019) to act as anchors of the face. Then, the facial motion flow calculated from landmarks is transferred from a driving face video. However, their motion flow suffers from error accumulation caused by inaccuracy of the pretrained model. To overcome this limitation, the keypoints are learned in an unsupervised fashion (Siarohin et al., 2019; Hong et al., 2022; Wang et al., 2021b; Liu et al., 2021a; Zhao & Zhang, 2022) to better represent the motion of the face with carefully designed mechanisms for modeling the motion transformations between two sets of keypoints. Audio-driven talking head generation (Ji et al., 2022; Lu et al., 2021; Wu et al., 2021a; Ji et al., 2021) is another popular direction on this topic, as audio sequences do not contain information of the face identity, and is relatively easier to disentangle the motion information from the input audio. Liang et al. (2022) explicitly divides the driving audio into granular parts through delicate prior-based pre-processing to control the lip shape, face pose, and the facial expression. In this work, we focus on the image-driven talking head generation. In contrast to previous imagedriven works, we aim at learning global facial structure and appearance priors through a welldesigned memory-bank network to effectively compensate intermediate facial features, which can produce higher-quality generation on ambiguous regions caused by large head motion. Memory Bank Learning. Introducing an external memory / prior component is popular because of its flexible capability of storing, abstracting and organising long-term knowledge into a structural form. Recently, the memory bank has shown its powerful capabilities in learning and reasoning for addressing several challenging tasks, e.g. image processing (Yoo et al., 2019; Huang et al., 2021) , video object detection (Sun et al., 2021) , and image caption (Fei, 2021) . As an earlier work, Weston et al. (2014) propose a memory network, which integrates inference components within a memory bank that can be read and written to memorize supporting facts from the past for question answering. Yoo et al. (2019) propose a memory-augmented colorization network to produce high-quality colorization with limited training data. Xu et al. (2021) uses the texture memory of patch samples extracted from unmasked regions to inpaint missing facial parts. Wu et al. (2022) proposes a memory-disentangled refinement network for coordinated face inpainting in a coarse-to-fine manner. In contrast to these previous works, to the best of our knowledge, we are the first to propose using a global memory mechanism to deal with ambiguous generation issues in the task of talking head video generation. We also accordingly design a novel implicit-facial-scale-aware memory learning network and a novel memory compensation network to successfully tackle the issues.

3. METHODOLOGY

Under the same pipeline in previous work (Siarohin et al., 2019) , we introduce an implicit scale conditioned memory compensation network, termed as MCNet, for talking head video generation. MCNet learns a facial-scale-aware memory bank by the designed implicit scale conditioned memory module (ISCM) to compensate the warped feature in the memory compensation module (MCM).

3.1. OVERVIEW

The framework of our MCNet depicted in Fig. 2 can be divided into three parts: (i) The keypoint detector and the dense motion network. Initially, the keypoint detector receives a source image S and a driving frame D to predict K pairs of keypoints, i.e. {x s,t , y s,t } K t=1 and {x d,t , y d,t } K t=1 on the source and target, respectively. With the keypoints generated from the driving frame and the source image, the dense motion network estimates the motion flow A S←D between these two; (ii) The designed implicit scale conditioned memory module (ISCM). We first leverage the estimated motion flow A S←D to warp the encoded feature F i e in the i-th layer, resulting in a warped feature F i w . The warped feature F i w and the source keypoints are fed into the implicit scale conditioned memory module to encode an implicit scale representation, which will be conditioned on the query of the meta memory M o to produce an identity-dependent scale-aware memory bank M s ; (iii) The memory compensation module (MCM). After obtaining the scale-aware memory bank M s , we utilize a dynamic cross-attention mechanism to compensate the warped features spatially in the MCM, and then output a compensated feature F i cpt . Finally, our decoder utilizes all the N feature maps i.e. {F i cpt } N i=1 , to produce the final image I rst . In the following, we will show how to learn our memory bank in the ISCM and how it is utilized in the MCM for generation-feature compensation.

3.2. LEARNING IMPLICIT-SCALE-CONDITIONED GLOBAL FACIAL MEMORY

We first aim at learning a global meta-memory bank to model facial structure and appearance representations from the whole face dataset. It is clear that human faces have multiple scales in the real-world, thus learning a meta-memory bank to directly compensate all source faces with different scales is inflexible. To handle the faces with distinct scales, we design an implicit scale conditioned memory module (ISCM) to learn a scale-aware memory, through source-scale-conditioned query on the global meta-memory bank to compensate warped source face features with scale variations. Meta memory. In this work, we first aim to learn a global meta memory bank to store the global and generic facial appearance and spatial structure representations from all the training data available. We initialize a meta memory bank M o as a cube tensor with a shape of C m × H m × W m instead of a vector (Esser et al., 2021) . Moreover, the multiple channels hold enough capacity for the meta memory to learn different facial structures and appearances (see Fig. 7 ). As many regions of the human faces are not discriminative and transferable, we can utilize the global facial priors learned in the meta memory to compensate ambiguous regions in the generated faces. With a designed objective function, the meta memory bank can be automatically updated by the optimization gradients from all the training images during the training stage. In this way, the facial prior learned in the meta memory is global rather than conditioned on any specific input sample, which provides highly beneficial global information for face compensation in generation. Implicit scale representation learning. In our framework, the detected facial keypoints are used to learn motion flow for feature warping. The facial keypoints implicitly contain the scale information of the human face because of their structural positions (Siarohin et al., 2019; Tao et al., 2022) . Therefore, we utilize both the source keypoints {x s,t , y s,t } K t=1 and the warped feature F i w to learn an implicit scale representation of the source face. The reason of learning a scale representation for the source is that we aim to compensate the image with the identity of the source image. And the warped feature F i w is used because it also contains the scale information of the source image, as the warped feature is generated via warping the source feature with the keypoint-based motion flow. As shown in Fig. 3 , to embed the scale information, we first utilize a global average pooling function F GAP to squeeze the global spatial information of the projected feature F i proj , which is produced from the warped feature F i w (see Fig. 4 ), into a channel descriptor. After that, we concatenate the flattened and normalized keypoints with the feature vector from F GAP , and feed them into an MLP mapping network F mlp to learn an implicit scale representation S of the source image. Thus: S = F mlp (S ′ ), S ′ = Concat F GAP (F i proj ), [x s,1 , y s,1 , . . . , x s,K , y s,K ] , where Concat[•, •] denotes the concatenation operation. In this way, the implicit scale representation S of the source image can be learned. Scale-aware memory learning. As discussed before, human faces present a diverse range of scales in reality. Compared to directly using the global meta memory for facial-feature compensation, we believe that a scale-dependent condition on the meta-memory is a more intuitive and effective way. Therefore, we propose to condition the learned implicit scale representation S on the meta memory M o to obtain an identity-dependent scale-aware memory M s for each face image. Inspired by the style injection in StyleGANv2 (Karras et al., 2020) , we utilize the implicit scale representation S to manipulate a 3 × 3 convolution layer to produce the implicit scale-aware facial memory: ω ′ ijk = s i * ω ijk and ω ′′ ijk = ω ′ ijk i,k (ω ′ ijk ) 2 + ϵ, where ω is the weight of the convolution kernel, ϵ is a small constant to avoid numerical issues, s i is the i-th element in the learned implicit scale representation S, and j and k enumerate the output feature maps and spatial footprint of the convolution, respectively. Finally, we obtain the scale-aware memory as: M s = F C ω ′′ (M o ) (2) where the F C ω ′′ is the manipulated convolution layer parameterized by ω ′′ . With the scale-aware memory bank M s , each input sample can be compensated by the scale-correlated facial priors, resulting in better generation performance discussed in the experiments.

3.3. GLOBAL MEMORY COMPENSATION AND GENERATION

The warped feature map contains ambiguity for generation caused by large head motion or occlusion, we thus propose to inpaint those ambiguous regions via compensating the warped facial features. To this end, we design a memory compensation module (MCM, see Fig. 4 ) to refine the warped feature F i w via the learned scale-aware facial memory bank. Warped facial feature projection. To maintain better the identity information in the source image while compensating the warped feature from the source, we employ a channel-split strategy to split the warped feature F i w into two parts along the channel dimension: F i,0 w and F i,1 w . The part of the first half channels F i,0 w is left to directly pass through for contributing for the identity preserving, while the part of the rest half channels F i,1 w is modulated by the learned scale-aware memory bank M s to refine the ambiguities. After splitting, we employ a 1 × 1 convolution layer on F i,1 w to change the channel number, resulting in a projected feature F i proj . Warped facial feature compensation. To compensate the feature map spatially, we adopt a dynamic cross-attention mechanism to refine features. Specifically, we employ the scale-aware mem- ory to produce the Key F i K and Value F i V via two dynamic convolution layers (i.e. f 1 dc , f 2 dc ) conditioned on the projected feature F i proj . In this way, the generated Key and Value are identitydependent and capable of providing useful context information. We, in the meanwhile, perform a non-linear projection to map F i proj into a query feature F i Q by a 1 × 1 convolution layer followed by a ReLU layer. Then, we perform cross attention to reconstruct a more robust feature F i ca as: F i ca = F C1×1 Softmax F i Q T × F i K × F i V where "Softmax" denotes the softmax operator, while the F C1×1 is a 1 × 1 convolution layer to change the channel number of the cross-attention output. "×" denotes a matrix multiplication. As shown in Fig. 4 , to maintain the identity of the source image, we concatenate the cross-attention features F i ca with the first half-channels F i,0 w : F i cpt = Concat[F i ca , F i,0 w ], where the Concat[•, •] represents a concatenation operation. As a result, the final output feature F i cpt enjoys the benefits of directly incorporating learned facial prior information (Wang et al., 2021c) from the memory and effectively modulating by the dynamic cross-attention mechanism. Regularization on consistency. To learn the global and the most generic spatial facial appearance and structure representations from the input faces, we need to make the learning of the meta memory constrained by each single image in the training data. Simply but effectively, we enforce the consistency between the projected feature F i proj from the current training face image, and the value feature F i V from the global meta memory: L con = ||F i V -de(F i proj )|| 1 , where the de(•) indicates a gradient detach function and || • || 1 is L 1 loss. By using this function, the regularization enforces the consistency on the learning of the global meta-memory while not affecting the learning of the source image features to guarantee the training stability of the overall generation framework. The above equation also makes sure that the optimization gradients from all the face images during the training state contribute together to the updating of the memory bank, and thus it can capture global facial appearance and structure representations for the transferring. Multi-layer generation. A higher-resolution feature map contains more facial details, while a smaller-resolution one contains more semantic information. We perform memory compensation for feature maps of each layer to preserve facial details as TPSM (Zhao & Zhang, 2022) . As shown in Fig. 2 , We utilize the motion flow A S←D to warp the encoded feature {F i e } N i=1 in each layer to produce warped features {F i w } N i=1 . For each warped feature F i w , we feed it into our designed ISCM and the MCM modules sequentially to produce the compensated features {F i cpt } N i=1 . In the decoding process, we treat the F 1 cpt as F 1 d and then the F 2 d is generated by F 1 d through an upsampling layer. At i-th level (i > 1), the output compensated feature F i cpt will be concatenated with the decoded feature 

3.4. TRAINING

Loss objectives. We train the proposed MCNet by minimizing the following losses: L = λ P L P + λ eq L eq + λ dist L dist + λ con L con where the λ P , λ eq , λ dist and λ con are the hyper-parameters to allow for a balanced learning from these losses. Per FOMM (Siarohin et al., 2019) , we leverage the perceptual loss L P to minimize the gap between the model output and the driving image and equivariance loss L eq to learn more stable keypoints. Additionally, we also adopt the keypoints distance loss L dist (Hong et al., 2022) to avoid the detected keypoints crowding around a small neighbourhood. The L con is the consistency loss in Eq. 5. The details of these losses are described in Appendix.

4. EXPERIMENTS

In this section, we present quantitative and qualitative experiments to validate the performance of our MCNet. More details (i.e. additional results and implementation) are included in the Appendix.

4.1. DATASETS AND METRICS

Dataset. In this work, we mainly evaluate our MCNet on two talking head generation datasets, i.e. VoxCeleb1 (Nagrani et al., 2017) and CelebV (Wu et al., 2018) dataset. We follow the sampling strategy for the test set in DaGAN (Hong et al., 2022) for evaluation. Following DaGAN, to verify the generalization ability, we apply the model trained on VoxCeleb1 to test on CelebV. Metrics. We adopt the structured similarity (SSIM), peak signal-to-noise ratio (PSNR), and L 1 distance to measure the low-level similarity between the generated image and the driving image. Following the previous works (Siarohin et al., 2019) , we utilize the Average Euclidean Distance (AED) to measure the identity preservation, and Average Keypoint Distance (AKD) to evaluate whether the motion of the input driving image is preserved. We also adopt the AUCON and PRMSE, similar to Hong et al. (2022) , to evaluate the expression and head poses in cross-identity reenactment.

4.2. COMPARISON WITH STATE-OF-THE-ART METHODS

Same-identity reenactment. In Table 1 (a), we first compare the synthesised results for the setup in which the source and the driving images share the same identity. It can be observed that our MCNet obtains the best results compared with other competitive methods. Specifically, compared with FOMM (Siarohin et al., 2019) and DaGAN (Hong et al., 2022) , which adopt the same motion estimation method as ours, our method can produce higher-quality images (72.3% of FOMM vs 82.5% of ours, resulting in a 10.2% improvement on the SSIM metric), which verifies that introducing the global memory mechanism can indeed benefit the image quality in the generation process. Regarding motion animation and identity preservation, our MCNet also achieves the best results (i.e. 1.203 on AKD and 0.106 on AED), showing superior performance on the talking head animation. Wiles et al., 2018) 71.9 22.54 -0.0780 7.687 0.405 --0.679 3.62 marioNETte (Ha et al., 2020) 75.5 23.24 ------0.710 3.41 FOMM (Siarohin et al., 2019)) 72.3 30.39 0.199 0.0430 1.294 0.140 0.882 2.824 0.667 3.90 MeshG (Yao et al., 2020) 73.9 30.39 ------0.709 3.41 face-vid2vid (Wang et al., 2021b) 76 Moreover, we show several samples in Fig. 5 (a), and the face samples in Fig. 5 (a) contain large motions (the first, the third, and the last row) and object occlusion (the second row). From Fig. 5 (a), our model can effectively handle these complex cases and produces more completed image generations compared with the state-of-the-art competitors. VoxCeleb1 VoxCeleb1 CelebV1 SSIM (%) ↑ PSNR ↑ LPIPS ↓ L 1 ↓ AKD ↓ AED ↓ AUCON ↑ PRMSE ↓ AUCON ↑ PRMSE ↓ X2face ( Cross-identity reenactment. We also perform experiments on the VoxCeleb1 and CelebV datasets to conduct the task of the cross-identity face motion animation, in which the source and driving images are from different people. The results compared with other methods are reported in Table 1 . Our MCNet outperforms all the other comparison methods. Regarding the head pose imitation, our MCNet can produce the face with a more accurate head pose (i.e. 2.641 and 2.10 for VoxCeleb1 and CelebV, respectively, on the PRMSE metric). We also present several samples of results with the VoxCeleb1 dataset in Fig. 5(b ). It is clear to observe that our MCNet can mimic the facial expression better than the other methods, such as the smiling countenance shown in the first row. For the unseen person in the CelebV dataset, e.g. the last two rows in Fig. 5 (b), our method can still produce a more natural generation, while the results of other methods contain more obvious artifacts. All of these results verify that the feature compensated by our learned memory can produce better results.

4.3. ABLATION STUDY

In this section, we perform ablation studies to demonstrate the effectiveness of the proposed implicit scale conditioned memory module (ISCM) and memory compensation module (MCM). We report the quantitative results in Table 2 and the qualitative results in Fig. 6 . Our baseline is the model without ISCM and MCM modules. The "Baseline + MCM" means that we drop the ISCM module and replace the scale-aware memory M s with the meta memory M o in Fig. 4 . Meta Memory learning. We first visualize the learned meta memory in Fig. 7 , which aims to learn the global and generic facial appearance and structure representations. In Fig. 7 , we visualize partial channels of the meta memory bank. It can be observed that all these channels represent the faces with different appearances, structures, poses, and shapes, which are very informative and clearly beneficial for the facial compensation and generation, confirming our motivation of learning global facial representations to tackle ambiguities in the talking head generation. Table 2 : Ablation studies: "Baseline" indicates the simplest model without the implicit scale conditioned memory module (ISCM) and memory compensation module (MCM). "MCM w/o Eq.4 " means that we project the entire warped feature into a projected feature F i proj , and remove the concatenation function Eq.5 to make the output of cross-attention as the compensated feature  F i cpt . Model SSIM (%) ↑ PSNR ↑ LPIPS ↓ L 1 ↓ AKD ↓ AED ↓ Baseline 81

Memory compensation

In Table 2 and Fig. 6 , the proposed memory compensation module can effectively improve the generation quality of human faces. From Tab. 2, we observe that adding the memory compensation module (MCM) can consistently boost the performance via comparison between "Baseline+MCM" and "Baseline" (82.3% vs. 81.1% on SSIM). In Fig. 6 , we also can see that the variant "Baseline+MCM" compensates the warped image better than the "Baseline", e.g. the face shape in the second row and the mouth shape in the third row. Additionally, we also conduct an ablation study to verify the feature channel split strategy discussed in Sec. 3.3. The results of "Baseline + MCM w/o Eq.4 " show that the channel split can slightly improve the performance. All these results demonstrate that learning a global facial memory can indeed effectively compensate the warped facial feature to produce higher-fidelity results for the talking head generation. Scale-aware memory learning. To verify the effectiveness of the implicit scale conditioned memory module (i.e. ISCM introduced in Sec. 3.2), we show the randomly sampled channels of the scale-aware memory in Fig 6 . As shown in the last column in Fig 6 , the ISCM can produce an identity-dependent scale-aware memory bank, which have structural and scale relations with the input source images. By deploying the ISCM, our MCNet can produce highly realistic-looking images compared with the "Baseline+MCM", verifying that the learned scale-aware memory condioned on the input source can provide a better compensatation on the source feature for more vivid generation. Generalization experiment. Importantly, we also insert the proposed MCM and ISCM modules into FOMM (Siarohin et al., 2019) and TPSM (Zhao & Zhang, 2022) to verify our designed memory mechanism can be flexibly generalized to existing talking head models. As shown in Table 2 , the TPSM, which has a different motion estimation method compared to ours, deployed with our proposed memory modules, can achieve a stable improvement. The "FOMM+ISCM+MCM" can also gain a significant improvement on SSIM compared with the pioneering work "FOMM". These results demonstrate the transferability and generalization capabilities of the proposed method.

5. CONCLUSION

In this paper, we present an implicit scale conditioned memory compensation network (MCNet) to learn a global facial prior of spatial structure and appearance to address the ambiguity problem caused by the dynamic motion in the talking head video generation task. MCNet utilizes a designed implicit scale conditioned memory module to learn the scale-aware memory for different samples, which will be used to compensate the feature map in the memory compensate module. Ablation studies clearly show the effectiveness of learning global facial meta memory for the talking head video generation task. Our MCNet also produces more natural-looking results compared with the state-of-the-art on all benchmarks.

APPENDIX A REPRODUCIBILITY

In this work, we utilize the PyTorch framework to implement our method. We develop our code based on the FOMMfoot_0 . All experiments in this work are conducted on publicly available datasets. Hyperparameters necessary for reproducing our experiments are reported in "Implementation Details".

B TRAINING DETAILS AND ADDITIONAL NETWORK B.1 IMPLEMENTATION DETAILS

The keypoint estimator and dense motion are borrowed from the FOMM (Siarohin et al., 2019) . We extract each frame from the driving video as a driving image and input it into the MCNet model with the source image. The source image and driving video share the same identity in the training stage, so that the ground-truth is the driving frame during training. To optimize the training objectives, we set the λ rec = 10, λ eq = 10, λ dist = 10 and λ con = 10. The number of keypoints in this work is 15, which is the same as that of DaGAN (Hong et al., 2022) . In the training stage, we employ 8 RTX 3090 GPUs to run the model for 100 epochs in an end-to-end manner, and it costs about 12 hours in totally. The number of layers of encoder and decoder N is set as 4 and the number of keypoints K is 15 as Hong et al. (2022) . We set the size of meta memory as 512 × 32 × 32, where C m = 512, H m = 32 and W m = 32. In the warping process, for those F i e that has a different spatial size as the motion flow, we employ the bilinear interpolation method to adjust the spatial size of the motion flow.

B.2 LOSS DETAILS

Perceptual Loss L p . Perceptual loss is a popular objective function in image generation tasks. As introduced in DaGAN (Hong et al., 2022) , a generated image and its ground-truth, i.e. driving image in the training stage, is downsampled to 4 different resolutions (i.e. 256 × 256, 128 × 128, 64 × 64, 32 × 32) respectively. Then we utilize a pre-trained VGG network (Simonyan & Zisserman, 2014) to extract the features from each resolution image. To simplify, we denote R 1 , R 2 , R 3 , R 4 as the features of generated images in different resolutions, respectively, and G 1 , G 2 , G 3 , G 4 for 4 different resolutions of ground-truth. Then, we measure the L 1 distance between the ground-truth and generated image as the Perceptual loss: L p = 4 i=1 L 1 (G i , R i ) Equivariance Loss L eq . We employ this loss to maintain the consistency of the estimated keypoints in the images after different augmentations. Per FOMM (Siarohin et al., 2019) , given an image I and its detected keypoints {X i } K i=1 (X i ∈ R 1×2 ), we then perform a known spatial transformation T on images I and keypoints {X i } K i=1 , resulting in transformed image I T and transformed keypoints {X T i } K i=1 . Then, we use detect the keypoints on the transformed image I T , denoted as ({X I T ,i } K i=1 ). We employ the equivariance Loss on the source image and driving image: L eq = K i=1 ||X T i -X I T ,i || 1 (8) Keypoint distance loss L dist . We employ the keypoint distance loss as Hong et al. (2022) to penalize the model if the distance between any two keypoints is smaller than a user-defined threshold. Thus, the keypoint distance loss can make the keypoints much less crowded around a small neighbourhood. In one image, for every two keypoints X i and X j , we have: , 2020; Martin-Brualla et al., 2021; Pumarola et al., 2021) . Therefore, we consider applying the positional encoding functionfoot_2 to our keypoints when we produce the scale-aware memory. We show the results in Tab .3. From the Tab. 3, we observe that when we apply the position encoding function on keypoints cannot bring the improvements, and even degrade the model if we set the L as 20. Since the keypoints will be utilized to estimate the motion flow in dense motion network, the Euclidean distance between any two keypoints is physically meaningful. Therefore, we suppose that employing the positional encoding on keypoints may affect the motion flow estimation, resulting in bad generation. The input element in ISCM. We also conduct experiments to investigate the usage of intermediate feature F i proj ("ISCM w/o F i proj ") and keypoints (ISCM w/o keypoints), the results shown in Tab. 4 indicate that these two items are equally crucial for the generation of the scale-aware memory bank. We can obtain the best results when we combine them together. Single layer vs multi layer. In our work, we deploy the ISCM and MCM in each layer to obtain the best results. Also, we investigate the performance of using ISCM and MCM in the first layer only. The results "MCNet (single layer)" show that the single layer can also obtain similar good results, which verify the effectiveness of our designed memory mechanism. The dynamic convolution in MCM. Besides, we also conduct an ablation study on the dynamic convolution layer in the memory compensation module (see Fig. 4 ). We can observe that the dynamic convolution layer can contribute to the final performance, especially for the AKD and AED. More datasets for evaluation. To fully verify the superiority of our method, we also compare it with other methods on two other large datasets, i.e. VoxCeleb2 (Chung et al., 2018) and HDTF (Zhang et al., 2021) . We report the results on Tab. 5 and Tab. 6. From these two tables, we can observe that our method can still obtain the best results compared with other SOTA methodsfoot_3 . It strongly illustrates the superiority of our designed method. Idenetity Preservation. In this section, we reorganized the voxceleb1 dataset and divided it into a training set and a test set. These two sets have the same identity space. That is, the identities of test videos also appear in the training videos. We select 500 videos as the test set and the rest as the training set. The experimental results are shown in Tab. 7. We can observe that our method obtains higher performance under the setting of testing identity as a part of the training corpus. One possible reason is that our global memory is learned from the identities in the training set. In this way, it can better compensate for the facial details of these seen identities. Video generation demo. We also provide several video generation demos to show a more detailed comparison qualitatively with the most competitive methods in the literature. From demo videos, we can observe that our main can compensate those regions that do not appear in the source image better than other methods (e.g. the ear region in demo2). These demos are attached in Supplementary Material. Comparison in other domains. To better verify the generalization ability of our method, we also train our method on TED-talks dataset (Siarohin et al., 2021) , because the human body is also symmetrical and highly structured. We report the results in Tab. 8. From the Tab. 8, our method still obtain the best results among all compared method. This generalization experiments verity that our memory can learn the symmetrical and structured object to inpaint the generated image. Meta memory visualization. In this section, we show all channels of our learn meta memory in Fig. 9 for better understanding. To look into details, we also show some channels in Fig. 10 in high resolution. These visualizations demonstrate the meaningful facial priors learnt in the meta memory. More qualitative ablation studies . To better show that our designed implicit scale conditioned memory module brings improvement to our model, we illustrate more qualitative results for ablation studies in Fig. 11 and Fig. 12 .



https://github.com/AliaksandrSiarohin/first-order-model https://py-feat.org Here, we use the implementation of https://github.com/yenchenlin/nerf-pytorch These compared methods have official released code for us to test on these two datasets.



Figure 1: MCNet animation illustration. MCNet first learns the motion flow (c) between the source and the driving images. (d) provides an elaboration of possible occulusion or deformation caused large motion, which is produced by directly warping the source image with the motion provided by the driving images. (e) shows randomly sampled memory channels of our learned scale-aware memory bank. We also present examples of generated results of our method in (f).

Figure 2: An illustration of the proposed MCNet, which contains two designed modules to compensate the facial feature map: (i) The implicit scale conditioned memory module (ISCM) learns the scale information from the input source, utilizing the warped feature map and keypoints of the source image, to produce an implicit scale representation S, which is conditioned on the meta memory bank M o to learn a scale-aware memory M s . (ii) The memory compensation module (MCM) adopts a dynamic cross-attention mechanism to compensate the warped feature map spatially.

Figure 3: The illustration of the proposed implicit scale conditioned memory module (ISCM). The symbol c ⃝ denotes the concatenation operation, and the "GAP" and "Conv" represent the global average pooling and the convolution layer, respectively. The detailed generation of the projected feature F i proj can refer to Fig. 4. C i denotes the channel number of the i-th level warped feature F i w

Figure 4: The illustration of the memory compensation module (MCM). The symbol denotes matrix multiplication, and f 1 dc and f 2 dc are dynamic convolution layers (Chen et al., 2020), whose kernel weights are estimated by the projected feature F i proj . The c ⃝ represents the concatenation operation, and the "Conv" denote a convolution layer. C i is the channel number of the i-th level feature in our autoencoder framework, while C m is the channel number of the memory bank.

Figure 5: Qualitative comparisons of (a) same-identity reenactment and (b) cross-identity reenactment on the VoxCeleb1 (the first two rows) and CelebV dataset (the last two rows). Our method shows higher-fidelity generation compared to the state-of-the-arts. Zoom in for best view.

Figure 6: Qualitative ablation studies. The memory compensation module (MCM) and implicit scale conditioned memory module (ISCM) can obtain improvements. The last column verifies that our ISCM can learn different scale-aware memories for different scale samples.Table 1: Comparisons with state-of-the-art methods on (a) same-identity reenactment on VoxCeleb1 and (b) cross-identity reenactment on VoxCeleb1 and CelebV dataset. Model (a) Results of Same-identity Reenactment (b) Results of Cross-identity ReenactmentVoxCeleb1 VoxCeleb1 CelebV1 SSIM (%) ↑ PSNR ↑ LPIPS ↓ L 1 ↓ AKD ↓ AED ↓ AUCON ↑ PRMSE ↓ AUCON ↑ PRMSE ↓ X2face (Wiles et al., 2018) 71.9 22.54 -0.0780 7.687 0.405 --0.679 3.62 marioNETte (Ha et al., 2020) 75.5 23.24 ------0.710 3.41 FOMM (Siarohin et al., 2019)) 72.3 30.39 0.199 0.0430 1.294 0.140 0.882 2.824 0.667 3.90 MeshG (Yao et al., 2020) 73.9 30.39 ------0.709 3.41 face-vid2vid (Wang et al., 2021b) 76.1 30.69 0.212 0.0430 1.620 0.153 0.839 4.398 0.805 3.15 MRAA (Siarohin et al., 2021) 80.0 31.39 0.195 0.0375 1.296 0.125 0.882 2.751 0.840 2.46 DaGAN (Hong et al., 2022) 80.4 31.22 0.185 0.0360 1.279 0.117 0.888 2.822 0.873 2.33 TPSN (Zhao & Zhang, 2022) 81.6 31.43 0.179 0.0365 1.233 0.119 0.894 2.756 0.882 2.23 MCNet (Ours) 82.5 31.94 0.174 0.0331 1.203 0.106 0.895 2.641 0.885 2.10

Figure 7: The visualization of randomly selected channels of the meta memory M o . We can observe that our meta memory learns very diverse global facial representations. Zoom in for best view.

Figure 9: The visualization of randomly selected channels of the meta memory M o .

The results of applying positional encoding function on keypoints. "pe(10)" means we set the output dimension control factor L of positional encoding function as 10, and 20 for "pe(20)"

Ablation studies. 'ISCM w/o F i proj " and "ISCM w/o keypoints" represent that ISCM does not use the projected feature F i proj or keypoints as input (see Fig.3), respectively, to encode implicit scale representation. "MCM w/o f 1 dc , f 2 dc " indicates that we replace the f 1 dc and f 2 dc with two normal convolution layers to produce Key and Value in MCM.

Comparisons results on VoxCeleb2.

Comparisons results on HDTF dataset. Model SSIM (%) ↑ PSNR ↑ LPIPS ↓ L 1 ↓ AKD ↓ AED ↓

Comparisons results on HDTF dataset.

Comparisons results on TED-talks dataset.

annex

where the sign(•) represents a sign function and the α is the threshold of distance, which is 0.2 in this work.

B.3 NETWORK ARCHITECTURE DETAILS OF MCNET

The keypoint detector receives an image as input and outputs the K keypoints {x i , y i } K i=1 . The structure of keypoint detector is illustrated in Fig. 8 . Here, we adopt the Taylor approximation as FOMM (Siarohin et al., 2019) and DaGAN (Hong et al., 2022) to compute the motion flow. Thus, the motion estimation is not our contribution and we mainly focus on our meta memory.

C EXPERIMENTS C.1 METRICS

We mainly introduce four important metrics in talking head talks generation field, i.e., AED, ADK, PRMSE, and AUCON. Specifically, Average euclidean distance (AED) is an important metric that measures identity preservation in reconstructed video/image. In this work, we use OpenFace (Baltrušaitis et al., 2016) to extract identity embeddings from the reconstructed face and the ground truth frame pairs. The MSE loss is employed to measure their difference.Average keypoint distance (ADK). ADK evaluates the difference between landmarks of the reconstructed faces and ground truth frames. We extract the facial landmark using the face alignment method (Bulat & Tzimiropoulos, 2017) . We compute the average distance between corresponding keypoints. Thus, the AKD mainly measures the ability of pose imitation.The root mean square error of the head pose angles (PRMSE). In this work, we utilize the Py-Feat toolkit 2 to detect the Euler angles of head pose, and then evaluate the pose difference between different identities.The ratio of identical facial action unit values (AUCON). We first utilize the Py-Feat toolkit to detect the action units of the generated face and the driving face. Then we can calculate the ratio of identical facial action unit values as the AUCON metric.

C.2 OTHER EXPERIMENTAL RESULTS

Positional Encoding for keypoints. The positional encoding method shows its strong power in the transformer (Vaswani et al., 2017; Liu et al., 2021b; Dosovitskiy et al., 2020) and Nerf (Mildenhall

Source

Driving Baseline Baseline + MCM Ours 

