UNSUPERVISED AUDIOVISUAL SYNTHESIS VIA EXEMPLAR AUTOENCODERS

Abstract

We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use exemplar autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring any training data for the input speaker. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis. Please visit our project website 1 for our summary video and more information.

1. INTRODUCTION

We present an unsupervised approach to retargeting the speech of any unknown speaker to an audiovisual stream of a known target speaker. Using our approach, one can retarget a celebrity video clip to say the words "Welcome to ICLR 2021" in different languages including English, Hindi, and Mandarin (please see our associated video). Our approach enables a variety of novel applications because it eliminates the need for training on large datasets; instead, it is trained with unsupervised learning on only a few minutes of the target speech, and does not require any training examples of the input speaker. By retargeting input speech generated by medical devices such as electrolarynxs and text-to-speech (TTS) systems, our approach enables personalized voice generation for voiceimpaired individuals (Kain et al., 2007; Nakamura et al., 2012) . Our work also enables applications in education and entertainment; one can create interactive documentaries about historical figures in their voice, or generate the sound of actors who are no longer able to perform. We highlight such representative applications in Figure 1 . Prior work typically independently looks at the problem of audio conversion (Chou et al., 2019; Kaneko et al., 2019a; b; Mohammadi & Kim, 2019; Qian et al., 2019) and video generation from audio signals (Yehia et al., 2002; Chung et al., 2017; Suwajanakorn et al., 2017; Zhou et al., 2019; Zhu et al., 2018) . Particularly relevant are zero-shot audio translation approaches (Chou et al., 2019; Mohammadi & Kim, 2019; Polyak & Wolf, 2019; Qian et al., 2019) that learn a generic lowdimensional embedding (from a training set) that are designed to be agnostic to speaker identity (Fig. 2-a ). We will empirically show that such generic embeddings may struggle to capture stylistic details of in-the-wild speech that differs from the training set. Alternatively, one can directly learn an audio translation engine specialized to specific input and output speakers, often requiring data of the two speakers either aligned/paired (Chen et al., 2014; Nakashika et al., 2014; Sun et al., 2015; Toda et al., 2007) or unaligned/unpaired (Chou et al., 2018; Fang et al., 2018; Kameoka et al., 2018; Kaneko & Kameoka, 2017; Kaneko et al., 2019a; b; Serrà et al., 2019) . This requirement restricts such methods to known input speakers at test time (Fig. 2-b ). In terms of video synthesis from audio input, zero-shot facial synthesis approaches (Chung et al., 2017; Zhou et al., 2019; Zhu et al., 2018) animate the lips but struggle to capture realistic facial characteristics of the entire person. ... input: "this is not the end, this is not even the …" Other approaches (Ginosar et al., 2019; Shlizerman et al., 2018; Suwajanakorn et al., 2017) restrict themselves to known input speakers at test time and require large amounts of data to train a model in a supervised manner. Our work combines the zero-shot nature of generic embeddings with the stylistic detail of personspecific translation systems. Simply put, given a target speech with a particular style and ambient environment, we learn an autoencoder specific to that target speech (Fig. 2-c ). We deem our approach "Exemplar Autoencoders". At test time, we demonstrate that one can translate any input speech into the target simply by passing it through the target exemplar autoencoder. We demonstrate this property is a consequence of two curious facts, shown in Fig. 3: (1) linguistic phonemes tend to cluster quite well in spectrogram space (Fig. 3-a ); and (2) autoencoders with sufficiently small bottlenecks act as projection operators that project out-of-sample source data onto the target training distribution, allowing us to preserve the content (words) of the source and the style of the target (Fig. 3-c ). Finally, we jointly synthesize audiovisual (AV) outputs by adding a visual stream to the audio autoencoder. Importantly, our approach is data-efficient and can be trained using 3 minutes of audio-video data of the target speaker and no training data for the input speaker. The ability to train exemplar autoencoders on small amounts of data is crucial when learning specialized models tailored to particular target data. Table 1 contrasts our work with leading approaches in audio conversion (Kaneko et al., 2019c; Qian et al., 2019) and audio-to-video synthesis (Chung et al., 2017; Suwajanakorn et al., 2017) . Contributions: (1) We introduce exemplar autoencoders, which allow for any input speech to be converted into an arbitrarily-large number of target speakers ("any-to-many" AV synthesis). ( 2) We move beyond well-curated datasets and work with in-the-wild web audio-video data in this paper. We also provide a new CelebAudio dataset for evaluation. (3) Our approach can be used as an offthe-shelf plug and play tool for target-specific voice conversion. (4) Finally, because our approach generates high-fidelity audio and video content that could be potentially misused, we discuss broader impacts in the appendix, including forensic experiments that suggests fake content can be identified with high accuracy.

2. RELATED WORK

A tremendous interest in audio-video generation for health-care, quality-of-life improvement, educational, and entertainment purposes has influenced a wide variety of work in audio, natural language processing, computer vision, and graphics literature. In this work, we seek to explore a standard representation for a user-controllable "any-to-many" audiovisual synthesis. Speech Synthesis & Voice Conversion: Earlier works (Hunt & Black, 1996; Zen et al., 2009) in speech synthesis use text inputs to create Text-to-Speech (TTS) systems. Sequence-to-sequence (Seq2seq) structures (Sutskever et al., 2014) have led to significant advancements in TTS sys-



https://dunbar12138.github.io/projectpage/Audiovisual/



(a). Assistive Tool for Speech Impaired: Zero-Shot Natural Voice Synthesis (b). Beyond Language Constraints: Zero-Shot Multi-Lingual Translation We train Exemplar Autoencoders for infinitely many speakers using ~3 minutes of speech for an individual speaker without any additional information. output: natural voice like Takeo Kanade. input: Text-to-Speech System output: natural voice like Michelle Obama. input: Electrolarynx output: Chinese speech in John Oliver's voice. input: native Chinese speaker output: Hindi speech in John Oliver's voice. input: native Hindi speaker (c). Educational/Entertainment Purposes: Zero-Shot Audio-Visual Content Creation output: Winston Churchill Audio-Visual Speech.

Figure 1: We train audiovisual (AV) exemplar autoencoders that capture personalized in-the-wild web speech as shown in the top-row. We then show three representative applications of Exemplar Autoencoders: (a) Our approach enables zero-shot natural voice synthesis from an Electrolarynx or a TTS used by a speech-impaired person; (b) Without any knowledge of Chinese and Hindi, our approach can generate Chinese and Hindi speech for John Oliver, an English-speaking late-night show host; and (c) We can generate audio-visual content for historical documents that could not be otherwise captured.

