NANSY++: UNIFIED VOICE SYNTHESIS WITH NEURAL ANALYSIS AND SYNTHESIS

Abstract

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications -i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing -by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.

1. INTRODUCTION

Most deep learning-based voice synthesis models consist of two parts: generating a mel spectrogram from an acoustic model that takes labeled annotations as input (e.g., text, music score, etc.), and converting the mel spectrogram into a waveform using a vocoder (Wang et al., 2017; Kim et al., 2020; Jeong et al., 2021; Lee et al., 2019; Liu et al., 2022) . However, this usually suffers from poor synthesis quality due to the training/inference mismatch of the acoustic model and vocoder. End-toend training methods has been recently proposed to tackle such issues (Bińkowski et al., 2020; Weiss et al., 2021; Donahue et al., 2021; Kim et al., 2021) . Despite the high quality, however, the training process of end-to-end models is often costly, as the waveform synthesis part needs to be trained again when training each different model. Furthermore, regardless of the training strategies of previous studies (end-to-end or not), most of the standard voice synthesis models are not modular enough in that most of the desirable control features are entangled in a single mid-level representation, or so-called latent space. This limits the controllability of such features and restrains the possibility of models being used as co-creation tools between creators and machines. Lastly, although many voice synthesis tasks are analogous in that they are meant to synthesize or control certain attributes of voice, the methodologies developed for each application remain scattered in research fields. These problems call for the need of developing a unified voice synthesis framework. We stick to three objectives for designing the unified synthesis framework, that is, 1. data-scalable: the training procedure should be done via a minimum amount of labeled dataset while exploiting abundant audio recordings without labels, 2. modular: the training for each application should be done in a modularized way by sharing a universal parameterized synthesizer, 3. high quality: the synthesis quality must persist high standard even by abiding by the modularized training procedure. To this end, we make a core assumption that most of the voice synthesis tasks can be defined by synthesizing and controlling four aspects of voice, that is, pitch, amplitude, linguistic, and timbre. This motivates us to develop a backbone network that can analyze voice into the four properties and then synthesize them back into an waveform. On that account, we propose NANSY++, which *Equal contribution 1

