FEATURE-DRIVEN TALKING FACE GENERATION WITH STYLEGAN2

Abstract

In this work, we wish to use a face image that generate a more natural and real face talking animation video. This is not an easy task because face appearance variation and semantics of speech are coupled together when tacking face have a micro movement. Audio features sometimes contain information about expressions, but they are not accurate enough. So a single audio feature cannot fully represent the movement of the face. For the above reason, we want to use different features to generate talking faces. The StyleGan series show good performance in the direction of image processing, and can perform the style migration task of portraits very well at the same time. We find that StyleGan can be used as a talking face generator. At the same time, we also encode and extract non-identity features and non-lip features, and try to find the subtle relationship between the features and the talking face. We also use the evaluation and ablation study to measure the quality of the generated videos and examine whether our approach is effective and feasible.

1. INTRODUCTION

Using the characteristics of multimedia to realize the interaction between virtual characters and users is one of the applications of AI technology. Audio is often easier to obtain compared with video. The task of using audio and an image of a face to generate a video animation has recently been carried out by more and more researchers to realize and explore. Solving the task is essential to achieve a wide range of practical applications, such as virtual character interaction, copying videos in other languages, video of a conference or role-playing game, and so on. Graphics-based face animation generation methods often require a completely original video sequence as input (Liu & Ostermann, 2011 ) (Garrido et al., 2015 ) (Suwajanakorn et al., 2017 ). Fried et al. (2019) proposed a new method to edit the conversation header video based on its transcript to produce a real output video. However, a retimed background video is required as input in their method. It takes about 1 hour of video to produce the best quality results. There is also a model that takes into account the movement of the face and uses landmarks to drive it (Wang et al., 2019) (Zakharov et al., 2019 ) (Gu et al., 2020) . There have also been many ways to generate facial animations through audio drivers in recent times (Jamaludin et al., 2019) . Prajwal et al. ( 2020) use the pretrained lip-sync model and add it to the system that generates facial animation to obtain the effect of lip synchronization. In human-to-human communication, speech sounds inevitably involve lip movements. That is, the speaker's lip movements and speech are closely related. In speech recognition and speaker recognition, the most commonly used speech features are Mel-scale Frequency Cepstral Coefficients (MFCC). Utilize MFCC to obtain speech-related features and find the relationship with the features of lip movement when a person speaks. It is possible using speech to generate mouth animation. However, it is not possible to obtain all the information about face motion with audio alone. To make the face more natural, additional features are needed to make the generated face have more realistic features. There are many ways to generate talking faces driven by speech, we propose a method using Style-Gan2 to generate animation. At the same time, considering non-lip-related features, we try to extract some features except identity features and lip features. Combine audio features with facial features to hopefully get a more realistic facial animation.

