EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION Anonymous

Abstract

Text-to-speech (TTS) field is recently dominated by one-stage text-to-waveform models, in which the speech quality is significantly improved compared to twostage models. However, the best-performing open-sourced one-stage model, the VITS (Kim et al. ( 2021)), is not fully differentiable and suffers from relatively high computation costs. To address these issues, we propose EfficientTTS 2 (EFTS2), a fully differentiable end-to-end TTS framework that is highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. The differentiable aligner is built upon the EfficientTTS (Miao et al., 2021). A hybrid attention mechanism and a variational alignment predictor are incorporated into our network to improve the expressiveness of the aligner. The use of the hierarchical-VAE-based waveform generator not only alleviates the one-to-many mapping problem in waveform generation but also allows the model to learn hierarchical and explainable latent variables that control different aspects of the generated speech. We also extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows efficient and high-quality conversion. Experimental results suggest that the two proposed models match their strong counterparts in speech quality with a faster inference speed and smaller model size.

1. INTRODUCTION

Text-to-speech (TTS) task aims at producing human-like synthetic speech signals from text inputs. In recent years, neural network systems have dominated the TTS field, sparked by the development of autoregressive (AR) models (Wang et al., 2017; Shen et al., 2018; Ping et al., 2018) and non-autoregressive (NAR) models (Miao et al., 2020; Ren et al., 2019; 2021) . The conventional neural TTS systems cascade two separate models: an acoustic model that transforms the input text sequences into acoustic features (e.g. mel-spectrogram) (Wang et al., 2017; Ren et al., 2019) , followed by a neural vocoder that transforms the acoustic features into audio waveforms (Valin & Skoglund, 2019; Yamamoto et al., 2020) . Although two-stage TTS systems have demonstrated the capability of producing human-like speech, these systems come with several disadvantages. First of all, the acoustic model and the neural vocoder cannot be optimized jointly, which often hurts the quality of the generated speech. Moreover, the separate training pipeline not only complicates the training and deployment but also makes it difficult for modeling downstream tasks. Recently, in the TTS field, there is a growing interest in developing one-stage text-to-waveform models that can be trained without the need for mel-spectrograms (Weiss et al., 2021; Donahue et al., 2021; Kim et al., 2021) . Among all the open-sourced text-to-waveform models, VITS (Kim et al., 2021) achieves the best model performance and efficiency. However, it still has some drawbacks. Firstly, the MAS method (Kim et al., 2020) used to learn sequence alignment in VITS is precluded in the standard back-propagation process, thus affecting training efficiency. Secondly, in order to generate a time-aligned textual representation, VITS simply repeats each hidden text representation by its corresponding duration. This repetition operation is non-differentiable thus hurting the quality of generated speech. Thirdly, VITS utilizes bijective transformations, specifically affine coupling layers, to compute latent representations. However, for affine coupling layers, only half of the input data gets updated after each transformation. Therefore, one has to stack multiple affine coupling layers to generate meaningful latent representations, which increases the model size and further reduces the model's efficiency. A recent work NaturalSpeech (Tan et al., 2022) improves upon VITS by leveraging a learnable differentiable aligner and a bidirectional prior/posterior module. However, the training of the learnable differentiable aligner requires a warm-up stage, which is a pretraining process with the help of external aligners. Although the bidirectional prior/posterior module of NaturalSpeech can reduce the training and inference mismatch caused by the bijective flow module, it further increases the model's computational cost of training. A recent work EfficientTTS (EFTS) (Miao et al., 2021) proposed a NAR architecture with differentiable alignment modeling that is optimized jointly with the rest of the model. In EFTS, a family of text-to-mel-spectrograms models and a text-to-waveform model are developed. However, the performance of the text-to-waveform model is close to but no better than two-stage models. Inspired by EFTS, we propose an end-to-end text-to-waveform TTS system, the EfficientTTS 2 (EFTS2), that overcomes the above issues of current one-stage models with competitive model performance and higher efficiency. The main contributions of this paper are as follows: • We improve upon the alignment framework of EFTS by proposing a hybrid attention mechanism and a variational alignment predictor, empowering the model to learn expressive latent time-aligned representation and have controllable diversity in speech rhythms. (Section 2.2.1) • We introduce a 2-layer hierarchical-VAE-based waveform generator that not only produces high-quality outputs but also learns hierarchical and explainable latent variables that control different aspects of the generated speech. (Section 2.2.2) • We develop an end-to-end adversarial TTS model, EFTS2, that is fully differentiable and can be trained end-to-end. It matches the baseline VITS in naturalness and offers faster inference speed and a smaller model footprint. (Section 2.2) • We extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-toend VC model. The conversion performance of EFTS2-VC is comparable to a state-of-theart model (YourTTS, Casanova et al. ( 2022)) while obtaining significantly faster inference speed and much more expressive speaker-independent latent representations. (Section 2.3)

2. METHOD

Our goal is to build an ideal TTS model that enables end-to-end training and high-fidelity speech generation. To achieve this, we consider two major challenges in designing the model: (i) Differentiable aligner. The TTS datasets usually consist of thousands of audio files with corresponding text scripts that are, however, not time aligned with the audios. Many previous TTS works either use external aligners (Ren et al., 2019; 2021; Chen et al., 2021) or non-differentiable internal aligners (Kim et al., 2020; 2021; Popov et al., 2021) 



for alignment modeling, which complicates the training procedure and reduces the model's efficiency. An ideal TTS model requires an internal differentiable aligner that can be optimized jointly with the rest of the network. Soft attention(Bahdanau et al., 2015)  is mostly used in building an internal differentiable aligner. However, computing soft attention requires autoregressive decoding, which is inefficient for speech generation(Weiss  et al., 2021). Donahue et al. proposes to use Gaussian upsampling and Dynamic Time Warping (DTW) for alignment learning, while training such a system is inefficient. To the best of our knowledge, EFTS is the only NAR framework that enables both differentiable alignment modeling and high-quality speech generation. Therefore, we integrate and extend it into the proposed models. (ii) Generative modeling framework. The goal of a generative task, such as TTS, is to estimate the probability distribution of the training data, which is usually intractable in practice. Multiple deep generative frameworks have been proposed to address this problem, including Auto-Regressive models (ARs,Bahdanau et al. (2015)), Normalizing Flows (NFs, Kingma & Dhariwal (2018)), Denoising Diffusion Probabilistic Models (DDPMs, Ho et al. (2020)), Generative Adversarial Networks (GANs, Goodfellow et al. (2014)) and Variational Auto-Encoders (VAEs, Kingma & Welling (2014)). However, AR models have linear growing generation steps; NFs use bijective transformations and often suffer from large model footprints; DDPMs require many iterations to produce high-quality samples. In this work, we propose to use GAN structure with a hierarchical-VAE-based generator, which allows efficient training and high-fidelity generation.

