FEATURE-DRIVEN TALKING FACE GENERATION WITH STYLEGAN2

Abstract

In this work, we wish to use a face image that generate a more natural and real face talking animation video. This is not an easy task because face appearance variation and semantics of speech are coupled together when tacking face have a micro movement. Audio features sometimes contain information about expressions, but they are not accurate enough. So a single audio feature cannot fully represent the movement of the face. For the above reason, we want to use different features to generate talking faces. The StyleGan series show good performance in the direction of image processing, and can perform the style migration task of portraits very well at the same time. We find that StyleGan can be used as a talking face generator. At the same time, we also encode and extract non-identity features and non-lip features, and try to find the subtle relationship between the features and the talking face. We also use the evaluation and ablation study to measure the quality of the generated videos and examine whether our approach is effective and feasible.

1. INTRODUCTION

Using the characteristics of multimedia to realize the interaction between virtual characters and users is one of the applications of AI technology. Audio is often easier to obtain compared with video. The task of using audio and an image of a face to generate a video animation has recently been carried out by more and more researchers to realize and explore. Solving the task is essential to achieve a wide range of practical applications, such as virtual character interaction, copying videos in other languages, video of a conference or role-playing game, and so on. Graphics-based face animation generation methods often require a completely original video sequence as input (Liu & Ostermann, 2011 ) (Garrido et al., 2015 ) (Suwajanakorn et al., 2017 ). Fried et al. (2019) proposed a new method to edit the conversation header video based on its transcript to produce a real output video. However, a retimed background video is required as input in their method. It takes about 1 hour of video to produce the best quality results. There is also a model that takes into account the movement of the face and uses landmarks to drive it (Wang et al., 2019) (Zakharov et al., 2019 ) (Gu et al., 2020) . There have also been many ways to generate facial animations through audio drivers in recent times (Jamaludin et al., 2019) . Prajwal et al. (2020) use the pretrained lip-sync model and add it to the system that generates facial animation to obtain the effect of lip synchronization. In human-to-human communication, speech sounds inevitably involve lip movements. That is, the speaker's lip movements and speech are closely related. In speech recognition and speaker recognition, the most commonly used speech features are Mel-scale Frequency Cepstral Coefficients (MFCC). Utilize MFCC to obtain speech-related features and find the relationship with the features of lip movement when a person speaks. It is possible using speech to generate mouth animation. However, it is not possible to obtain all the information about face motion with audio alone. To make the face more natural, additional features are needed to make the generated face have more realistic features. There are many ways to generate talking faces driven by speech, we propose a method using Style-Gan2 to generate animation. At the same time, considering non-lip-related features, we try to extract some features except identity features and lip features. Combine audio features with facial features to hopefully get a more realistic facial animation. The rest of the paper is organized as follows: we survey the recent developments in this field in Section 2. In Section 3, we introduce our approach. In Section 4, training details are shown. After these, we evaluate our method. Finally, conclude our work in section 6.

2. RELATED WORK

In this section, we briefly review some related works on talking face generation.

2.1. GENERATIVE ADVERSARIAL NETWORK (GAN)

Generative adversarial network (GAN) is a framework that is composited of a generator network and a discriminator network. It can be utilized to train a generative model. The generator tries to fool the discriminator maximally. On the other hand, the discriminator also tries to discriminate the generated samples from the true ones as much as possible. By using this manner, both the generator and discriminator's performance can be improved in the end. GAN methods have wildly utilized for many computer vision tasks, for example, image synthesis (Radford et al., 2016) , image superresolution (Ledig et al., 2017) , and image style transfer (Zhu et al., 2017) . In recent years, some methods proposed to improve the original GAN from different perspectives. Such as Conditional GAN (CGAN) (Mirza & Osindero, 2014) , the InfoGAN (Chen et al., 2016) , and the CycleGAN (Zhu et al., 2017) . The GAN methods also can be utilized in the data enhancement. For example, we can use GAN to generate different action images of people to train an action recognition model.

2.2. TALKING FACE GENERATION

Synthesizing high-fidelity audio-driven facial video sequences is an important and challenging problem in many applications like digital humans, chatting robots, and virtual video conferences. For a long time, the topic of synthesizing realistic speech video from audio with an image as input is very charming for researchers. In the beginning, people adopted the model-based approach. These methods need to establish the relationship between audio semantics and lip movement. Such as phoneme mapping (Fisher, 1968 )and anatomical actions (Edwards et al., 2016) . Because the establishment of this relationship is quite difficult, it is not suitable for large-scale use. In recent years, with the development of GAN technology, the research of constrained talking face generation from speech began to flourish. Kumar et al. (2017) attempted to generate key points synchronized with audio by using delay LSTM (Graves & Schmidhuber, 2005) . They learn a mapping between the input audio and the corresponding lip landmarks of Barack Obama. Suwajanakorn et al. ( 2017) also proposed a method named "teeth proxy" for improving the visual quality of teeth generation. But these methods can only train specific people who have a lot of video data. Fried et al. (2019) proposed that the video of a single speaker can be edited seamlessly by adding or deleting phrases in the speech. Unfortunately, to accomplish this task, they still need at least one hour's data for each speaker.



Figure 1: The left parameter dimension map and the right parameter amplitude map after a piece of audio passes through MFCC.

