NON-ATTENTIVE TACOTRON: ROBUST AND CONTROLLABLE NEURAL TTS SYNTHESIS INCLUDING UNSUPERVISED DURATION MODELING

Abstract

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.

1. INTRODUCTION

Autoregressive neural text-to-speech (TTS) models using an attention mechanism are known to be able to generate speech with naturalness on par with recorded human speech. However, these types of models are known to be less robust than traditional approaches (He et al., 2019; Zheng et al., 2019; Guo et al., 2019; Battenberg et al., 2020) . These autoregressive networks that predict the output one frame at a time are trained to decide whether to stop at each frame, and thus a misprediction on a single frame can result in serious failures such as early cut-off. Meanwhile, there are little to no hard constraints imposed on the attention mechanism to prevent problems such as repetition, skipping, long pause or babbling. To exacerbate the issue, these failures are rare and are thus often not represented in small test sets, such as those used in subjective listening tests. However, in customer-facing products, even a one-in-a-million chance of such problems can severely degrade the user experience. There have been various works aimed at improving the robustness of autoregressive attention-based neural TTS models. Some of them investigated reducing the effect of the exposure bias on the autoregressive decoder, using adversarial training (Guo et al., 2019) or adding regularization to encourage the forward and backward attention to be consistent (Zheng et al., 2019) . Others utilized or designed alternative attention mechanisms, such as Gaussian mixture model (GMM) attention (Graves, 2013; Skerry-Ryan et al., 2018 ), forward attention (Zhang et al., 2018) , stepwise monotonic attention (He et al., 2019) , or dynamic convolution attention (Battenberg et al., 2020) . Nonetheless, these approaches do not fundamentally solve the robustness issue. Recently, there has been a surge in the use of non-autoregressive models for TTS. Rather than predicting whether to stop on each frame, non-autoregressive models need to determine the output length ahead of time, and one way to do so is with an explicit prediction of the duration for each input token. A side benefit of such a duration predictor is that it is significantly more resilient to the failures afflicting the attention mechanism. However, one-to-many regression problems like TTS can benefit from an autoregressive decoder as the previous mel-spectrogram frames provides context to disambiguate between multi-modal outputs. 1

