NANSY++: UNIFIED VOICE SYNTHESIS WITH NEURAL ANALYSIS AND SYNTHESIS

Abstract

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications -i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing -by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.

1. INTRODUCTION

Most deep learning-based voice synthesis models consist of two parts: generating a mel spectrogram from an acoustic model that takes labeled annotations as input (e.g., text, music score, etc.), and converting the mel spectrogram into a waveform using a vocoder (Wang et al., 2017; Kim et al., 2020; Jeong et al., 2021; Lee et al., 2019; Liu et al., 2022) . However, this usually suffers from poor synthesis quality due to the training/inference mismatch of the acoustic model and vocoder. End-toend training methods has been recently proposed to tackle such issues (Bińkowski et al., 2020; Weiss et al., 2021; Donahue et al., 2021; Kim et al., 2021) . Despite the high quality, however, the training process of end-to-end models is often costly, as the waveform synthesis part needs to be trained again when training each different model. Furthermore, regardless of the training strategies of previous studies (end-to-end or not), most of the standard voice synthesis models are not modular enough in that most of the desirable control features are entangled in a single mid-level representation, or so-called latent space. This limits the controllability of such features and restrains the possibility of models being used as co-creation tools between creators and machines. Lastly, although many voice synthesis tasks are analogous in that they are meant to synthesize or control certain attributes of voice, the methodologies developed for each application remain scattered in research fields. These problems call for the need of developing a unified voice synthesis framework. We stick to three objectives for designing the unified synthesis framework, that is, 1. data-scalable: the training procedure should be done via a minimum amount of labeled dataset while exploiting abundant audio recordings without labels, 2. modular: the training for each application should be done in a modularized way by sharing a universal parameterized synthesizer, 3. high quality: the synthesis quality must persist high standard even by abiding by the modularized training procedure. To this end, we make a core assumption that most of the voice synthesis tasks can be defined by synthesizing and controlling four aspects of voice, that is, pitch, amplitude, linguistic, and timbre. This motivates us to develop a backbone network that can analyze voice into the four properties and then synthesize them back into an waveform. On that account, we propose NANSY++, which *Equal contribution is improved upon the previous Neural ANalysis and SYnthesis (NANSY) framework (Choi et al., 2021a) , by putting forward a new end-to-end self-supervised training method. First, we propose a self-supervised fundamental frequency (F 0 ) estimation training method that can be trained without any post processing or synthetic datasets. Next, we adopt two self-supervised training methods -information perturbation and bottleneck -to extract a linguistic representation that is disentangled from other representations (Choi et al., 2021a; Qian et al., 2022) . Then, we propose to encode timbre information using a content-dependent time-varying speaker embedding, which successfully captures timbre information of unseen target speaker during training. Finally, we propose a high quality synthesis network to convert the 4 analysis representations into a waveform by adopting an inductive bias of human voice production model. We assume that by exploiting the proposed self-supervised disentangled representation learning strategies on the backbone network, we can encourage several downstream generative tasks to be more data efficient, while not losing the modularity and synthesis quality. Therefore, after training the backbone network, we introduce 4 exemplar applications -voice conversion, text-to-speech (TTS), singing voice synthesis (SVS), voice designing (VOD) -that can be tackled by sharing analysis representations and synthesis network of the backbone. Each application can be simplified and substituted into the task of synthesizing a subset of analysis representations. Through extensive experiments, we verify that NANSY++ enjoys a lot of advantages at once that existing methodologies cannot: high quality output, fast training of modularized application models, data efficiency, and controllability over disentangled voice features. All modules in the backbone network is trained in an end-to-end manner within a single analysis and synthesis loop.

2.1. SELF-SUPERVISED LEARNING OF PITCH

We represent pitch with fundamental frequency F 0 because it is explicitly controllable. In addition, we found that using sinsuoidal signal made by F 0 as an input signal for synthesizer can greatly reduce glitches. The training solely depending on reconstruction, however, was unstable, and showed poor pitch estimation quality, which was also reported in (Engel et al., 2020b) , and as a result they ended up utilizing synthetic audio datasets with pitch labels. While collecting synthetic audio datasets paired with pitch labels is not difficult for musical instruments, it is problematic for speech or singing



Figure1: Overview of proposed NANSY++ backbone architecture. All modules in the backbone network is trained in an end-to-end manner within a single analysis and synthesis loop.

To train F 0 estimator in a self-supervised manner, we first adopt the auto-encoding approach ofEngel et al. (2020b). Pitch encoder f θ P takes Constant-Q Transform (CQT) as an input feature and outputs probability distribution over 64 frequency bins where it spans from 50Hz to 1000Hz logarithmically (approximately 0.79 semitone per bin). The F 0 is estimated by weighted averaging the probability distribution over the frequency bins. The pitch encoder also outputs two amplitude values, periodic amplitude A p [n] and aperiodic amplitude A ap[n]. F 0 [n] and A p [n] are linearly upsampled into a sample-level, F 0 [t] and A p [t], and transformed into a sinusoidal waveform denotes sampling rate. We also linearly upsampleA ap [n] into A ap [t] and generate shaped noise y[t] = A ap [t] • n[t], where n[t] ∼ U [-1, 1].Finally the two siganls, x[t] and y[t], are added to form an input excitation signal z[t] = x[t] + y[t] for a synthesizer. After the synthesizer reconstructs the input signal, the pitch encoder is trained with reconstruction loss.

