NON-ATTENTIVE TACOTRON: ROBUST AND CONTROLLABLE NEURAL TTS SYNTHESIS INCLUDING UNSUPERVISED DURATION MODELING

Abstract

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.

1. INTRODUCTION

Autoregressive neural text-to-speech (TTS) models using an attention mechanism are known to be able to generate speech with naturalness on par with recorded human speech. However, these types of models are known to be less robust than traditional approaches (He et al., 2019; Zheng et al., 2019; Guo et al., 2019; Battenberg et al., 2020) . These autoregressive networks that predict the output one frame at a time are trained to decide whether to stop at each frame, and thus a misprediction on a single frame can result in serious failures such as early cut-off. Meanwhile, there are little to no hard constraints imposed on the attention mechanism to prevent problems such as repetition, skipping, long pause or babbling. To exacerbate the issue, these failures are rare and are thus often not represented in small test sets, such as those used in subjective listening tests. However, in customer-facing products, even a one-in-a-million chance of such problems can severely degrade the user experience. There have been various works aimed at improving the robustness of autoregressive attention-based neural TTS models. Some of them investigated reducing the effect of the exposure bias on the autoregressive decoder, using adversarial training (Guo et al., 2019) or adding regularization to encourage the forward and backward attention to be consistent (Zheng et al., 2019) . Others utilized or designed alternative attention mechanisms, such as Gaussian mixture model (GMM) attention (Graves, 2013; Skerry-Ryan et al., 2018 ), forward attention (Zhang et al., 2018) , stepwise monotonic attention (He et al., 2019) , or dynamic convolution attention (Battenberg et al., 2020) . Nonetheless, these approaches do not fundamentally solve the robustness issue. Recently, there has been a surge in the use of non-autoregressive models for TTS. Rather than predicting whether to stop on each frame, non-autoregressive models need to determine the output length ahead of time, and one way to do so is with an explicit prediction of the duration for each input token. A side benefit of such a duration predictor is that it is significantly more resilient to the failures afflicting the attention mechanism. However, one-to-many regression problems like TTS can benefit from an autoregressive decoder as the previous mel-spectrogram frames provides context to disambiguate between multi-modal outputs. Under review as a conference paper at ICLR 2021 In this paper, we propose, Non-Attentive Tacotronfoot_0 , a neural TTS model that combines the robust duration predictor with the autoregressive decoder of Tacotron 2 (Shen et al., 2018) . Our work is similar to DurIAN (Yu et al., 2019; Zhang et al., 2020) , which incorporates the duration predictor with an autoregressive decoder. But besides the differences in architecture, we also introduce a couple of novel features in our model. The key contributions of this paper are summarized as follows: 1. Replacing the attention mechanism in Tacotron 2 with duration prediction and upsampling modules leading to better robustness with the naturalness matching recorded natural speech; 2. Introduction of Gaussian upsampling significantly improving the naturalness compared to vanilla upsampling through repetition; 3. Global and fine-grained controlling of durations at inference time; 4. Semi-supervised and unsupervised duration modeling of Non-Attentive Tacotron, allowing the model to be trained with few to no duration annotations; and 5. More reliable evaluation metrics for measuring robustness of TTS models, as well as comparing Non-Attentive Tacotron with Tacotron 2 with respect to those metrics.

2. RELATED WORKS

In the past decade, model-based TTS synthesis has evolved from hidden Markov model (HMM)-based approaches (Zen et al., 2009) to using deep neural networks. Over this period, the concept of using an explicit representation of token (phoneme) durations has not been foreign. Early neural parametric synthesis models (Zen et al., 2013) require explicit alignments between input and target and include durations as part of the bag of features used to generate vocoder parameters. Explicit durations continue to be used with the advent of the end-to-end neural vocoder WaveNet (Oord et al., 2016) in works such as Deep Voice (Arik et al., 2017; Gibiansky et al., 2017) and CHiVE (Kenter et al., 2019) . As general focus turned towards end-to-end approaches, the autoregressive sequence-to-sequence model with attention mechanism used in neural machine translation (NMT) (Bahdanau et al., 2015) and automatic speech recognition (ASR) (Chan et al., 2016) became an attractive option, removing the need to represent durations explicitly. This led to works such as Char2Wav (Sotelo et al., 2017 ), Tacotron (Wang et al., 2017; Shen et al., 2018 ), Deep Voice 3 (Ping et al., 2018 ), and Transformer TTS (Li et al., 2019) . Similar models have been used for more complicated problems, like direct speech-to-speech translation (Jia et al., 2019 ), speech conversion (Biadsy et al., 2019) , and speech enhancement (Ding et al., 2020) . Tacotron 2, on which our work is based, is one such model. It connects a character-level encoder and an autoregressive decoder producing mel spectrogram frames with the use of a location-sensitive attention mechanism (Chorowski et al., 2015) . Recently, there has been a surge of non-autoregressive models, bringing back the use of explicit duration prediction. This approach initially surfaced in NMT (Gu et al., 2017) , then made its way into TTS with models such as FastSpeech (Ren et al., 2019; 2020 ), AlignTTS (Zeng et al., 2020 ), TalkNet (Beliaev et al., 2020 ), and JDI-T (Lim et al., 2020) . See Appendix C for a rough categorization of these models. To train the duration predictor, FastSpeech uses target durations extracted from a pre-trained autoregressive model in teacher forcing mode, while JDI-T also extracts target durations from a separate autoregressive model but co-trains it with the feed-forward model. TalkNet uses a CTC-based ASR model to extract target durations, while CHiVE, FastSpeech 2, and DurIAN use target durations from an external aligner module utilizing forced alignment. Finally, AlignTTS forgoes target durations completely and uses a specialized alignment loss inspired by the Baum-Welch algorithm to train a mixture density network for alignment.



Audio samples available in supplemental materials.

