ADASPEECH: ADAPTIVE TEXT TO SPEECH FOR CUSTOM VOICE

Abstract

Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech from her/him. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions which could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we model the acoustic information in both utterance and phoneme level. Specifically, we use one acoustic encoder to extract an utterance-level vector and another one to extract a sequence of phoneme-level vectors from the target speech during pre-training and fine-tuning; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phonemelevel vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. The audio samples are available at

1. INTRODUCTION

Text to speech (TTS) aims to synthesize natural and intelligible voice from text, and attracts a lot of interests in machine learning community (Arik et al., 2017; Wang et al., 2017; Gibiansky et al., 2017; Ping et al., 2018; Shen et al., 2018; Ren et al., 2019) . TTS models can synthesize natural human voice when training with a large amount of high-quality and single-speaker recordings (Ito, 2017) , and has been extended to multi-speaker scenarios (Gibiansky et al., 2017; Ping et al., 2018; Zen et al., 2019; Chen et al., 2020) using multi-speaker corpora (Panayotov et al., 2015; Veaux et al., 2016; Zen et al., 2019) . However, these corpora contain a fixed set of speakers where each speaker still has a certain amount of speech data. Nowadays, custom voice has attracted increasing interests in different application scenarios such as personal assistant, news broadcast and audio navigation, and has been widely supported in commercial speech platforms (some custom voice services include Microsoft Azure, Amazon AWS and Google Cloud). In custom voice, a source TTS model is usually adapted on personalized voices with few adaptation data, since the users of custom voice prefer to record as few adaptation data as possible (several minutes or seconds) for convenient purpose. Few adaptation data presents great challenges on the naturalness and similarity of adapted voice. Furthermore, there are also several distinctive challenges in custom voice: 1) The recordings of the custom users are usually of different acoustic conditions from the source speech data (the data to train the source TTS model). For example, the adaptation data is usually recorded with diverse speaking prosodies, styles, emotions, accents and recording environments. The mismatch in these acoustic conditions makes the source model difficult to generalize and leads to poor adaptation quality. 2) When adapting the source TTS model to a new voice, there is a trade-off between the fine-tuning parameters and voice quality. Generally speaking, more adaptation parameters will usually result in better voice quality, which, as a result, increases the memory storage and serving costfoot_0 . While previous works in TTS adaptation have well considered the few adaptation data setting in custom voice, they have not fully addressed the above challenges. They fine-tune the whole model (Chen et al., 2018; Kons et al., 2019) or decoder part (Moss et al., 2020; Zhang et al., 2020) , achieving good quality but causing too many adaptation parameters. Reducing the amount of adaptation parameters is necessary for the deployment of commercialized custom voice. Otherwise, the memory storage would explode as the increase of users. Some works only fine-tune the speaker embedding (Arik et al., 2018; Chen et al., 2018) , or train a speaker encoder module (Arik et al., 2018; Jia et al., 2018; Cooper et al., 2020; Li et al., 2017; Wan et al., 2018) that does not need fine-tuning during adaptation. While these approaches lead a light-weight and efficient adaptation, they result in poor adaptation quality. Moreover, most previous works assume the source speech data and adaptation data are in the same domain and do not consider the setting with different acoustic conditions, which is not practical in custom voice scenarios. In this paper, we propose AdaSpeech, an adaptive TTS model for high-quality and efficient customization of new voice. AdaSpeech employ a three-stage pipeline for custom voice: 1) pre-training; 2) fine-tuning; 3) inference. During the pre-training stage, the TTS model is trained on large-scale multi-speaker datasets, which can ensure the TTS model to cover diverse text and speaking voices that is helpful for adaptation. During the fine-tuning stage, the source TTS model is adapted on a new voice by fine-tuning (a part of) the model parameters on the limited adaptation data with diverse acoustic conditions. During the inference stage, both the unadapted part (parameters shared by all custom voices) and the adapted part (each custom voice has specific adapted parameters) of the TTS model are used for the inference request. We build AdaSpeech based on the popular non-autoregressive TTS models (Ren et al., 2019; Peng et al., 2020; Kim et al., 2020; Ren et al., 2021) and further design several techniques to address the challenges in custom voice: • Acoustic condition modeling. In order to handle different acoustic conditions for adaptation, we model the acoustic conditions in both utterance and phoneme level in pre-training and fine-tuning. Specifically, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech, which are taken as the input of the mel-spectrogram decoder to represent the global and local acoustic conditions respectively. In this way, the decoder can predict speech in different acoustic conditions based on these acoustic information. Otherwise, the model would memorize the acoustic conditions and cannot generalize well. In inference, we extract the utterance-level vector from a reference speech and use another acoustic predictor that is built upon the phoneme encoder to predict the phoneme-level vectors. • Conditional layer normalization. To fine-tune as small amount of parameters as possible while ensuring the adaptation quality, we modify the layer normalization (Ba et al., 2016) in the melspectrogram decoder in pre-training, by using speaker embedding as the conditional information to generate the scale and bias vector in layer normalization. In fine-tuning, we only adapt the parameters related to the conditional layer normalization. In this way, we can greatly reduce adaptation parameters and thus memory storagefoot_1 compared with fine-tuning the whole model, but maintain high-quality adaptation voice thanks to the flexibility of conditional layer normalization. To evaluate the effectiveness of our proposed AdaSpeech for custom voice, we conduct experiments to train the TTS model on LibriTTS datasets and adapt the model on VCTK and LJSpeech datasets with different adaptation settings. Experiment results show that AdaSpeech achieves better adaptation quality in terms of MOS (mean opinion score) and SMOS (similarity MOS) than baseline methods, with



For example, to support one million users in a cloud speech service, if each custom voice consumes 100MB model sizes, the total memory storage would be about 100PB, which is quite a big serving cost. We further reduce the memory usage in inference as described in Section 2.3.

availability

https://speechresearch.github.io/adaspeech/.

