END-TO-END ADVERSARIAL TEXT-TO-SPEECH

Abstract

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

1. INTRODUCTION

A text-to-speech (TTS) system processes natural language text inputs to produce synthetic human-like speech outputs. Typical TTS pipelines consist of a number of stages trained or designed independently -e.g. text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, and raw audio waveform synthesis (Taylor, 2009) . Although these pipelines have proven capable of realistic and high-fidelity speech synthesis and enjoy wide real-world use today, these modular approaches come with a number of drawbacks. They often require supervision at each stage, in some cases necessitating expensive "ground truth" annotations to guide the outputs of each stage, and sequential training of the stages. Further, they are unable to reap the full potential rewards of data-driven "end-to-end" learning widely observed in a number of prediction and synthesis task domains across machine learning. In this work, we aim to simplify the TTS pipeline and take on the challenging task of synthesising speech from text or phonemes in an end-to-end manner. We propose EATS -End-to-end Adversarial Text-to-Speech -generative models for TTS trained adversarially (Goodfellow et al., 2014) that operate on either pure text or raw (temporally unaligned) phoneme input sequences, and produce raw speech waveforms as output. These models eliminate the typical intermediate bottlenecks present in most state-of-the-art TTS engines by maintaining learnt intermediate feature representations throughout the network. Our speech synthesis models are composed of two high-level submodules, detailed in Section 2. An aligner processes the raw input sequence and produces relatively low-frequency (200 Hz) aligned features in its own learnt, abstract feature space. The features output by the aligner may be thought of as taking the place of the earlier stages of typical TTS pipelines -e.g., temporally aligned melspectrograms or linguistic features. These features are then input to the decoder which upsamples the features from the aligner by 1D convolutions to produce 24 kHz audio waveforms. By carefully designing the aligner and guiding training by a combination of adversarial feedback and domain-specific loss functions, we demonstrate that a TTS system can be learnt nearly end-to-end, resulting in high-fidelity natural-sounding speech approaching the state-of-the-art TTS systems. Our main contributions include: • A fully differentiable and efficient feed-forward aligner architecture that predicts the duration of each input token and produces an audio-aligned representation. • The use of flexible dynamic time warping-based prediction losses to enforce alignment with input conditioning while allowing the model to capture the variability of timing in human speech. • An overall system achieving a mean opinion score of 4.083, approaching the state of the art from models trained using richer supervisory signals.

2. METHOD

Our goal is to learn a neural network (the generator) which maps an input sequence of characters or phonemes to raw audio at 24 kHz. Beyond the vastly different lengths of the input and output signals, this task is also challenging because the input and output are not aligned, i.e. it is not known beforehand which output tokens each input token will correspond to. To address these challenges, we divide the generator into two blocks: (i) the aligner, which maps the unaligned input sequence to a representation which is aligned with the output, but has a lower sample rate of 200 Hz; and (ii) the decoder, which upsamples the aligner's output to the full audio frequency. The entire generator architecture is differentiable, and is trained end to end. Importantly, it is also a feed-forward convolutional network, which makes it well-suited for applications where fast batched inference is important: our EATS implementation generates speech at a speed of 200× realtime on a single NVIDIA V100 GPU (see Appendix A and Table 3 for details). It is illustrated in Figure 1 . The generator is inspired by GAN-TTS (Bińkowski et al., 2020) , a text-to-speech generative adversarial network operating on aligned linguistic features. We employ the GAN-TTS generator as the decoder in our model, but instead of upsampling pre-computed linguistic features, its input comes from the aligner block. We make it speaker-conditional by feeding in a speaker embedding s alongside the latent vector z, to enable training on a larger dataset with recordings from multiple speakers. We also adopt the multiple random window discriminators (RWDs) from GAN-TTS, which have been proven effective for adversarial raw waveform modelling, and we preprocess real audio input by applying a simple µ-law transform. Hence, the generator is trained to produce audio in the µ-law domain and we apply the inverse transformation to its outputs when sampling. The loss function we use to train the generator is as follows: L G = L G,adv + λ pred • L pred + λ length • L length , where L G,adv is the adversarial loss, linear in the discriminators' outputs, paired with the hinge loss (Lim & Ye, 2017; Tran et al., 2017) used as the discriminators' objective, as used in GAN-TTS (Bińkowski et al., 2020) . The use of an adversarial (Goodfellow et al., 2014) loss is an advantage of our approach, as this setup allows for efficient feed-forward training and inference, and such losses tend to be mode-seeking in practice, a useful behaviour in a strongly conditioned setting where realism is an important design goal, as in the case of text-to-speech. In the remainder of this section, we describe the aligner network and the auxiliary prediction (L pred ) and length (L length ) losses in detail, and recap the components which were adopted from GAN-TTS. et al., 2017; De Vries et al., 2017) . We then predict the length for each input token individually: l n = g(h n , z, s), where g is an MLP. We use a ReLU nonlinearity at the output to ensure that the predicted lengths are non-negative. We can then find the predicted token end positions as a cumulative sum of the token lengths: e n = n m=1 l m , and the token centre positions as c n = e n -1 2 l n . Based on these predicted positions, we can interpolate the token representations into an audio-aligned representation at 200 Hz, a = (a 1 , . . . , a S ), where S = e N is the total number of output time steps. To compute a t , we obtain interpolation weights for the token representations h n using a softmax over



sequence x = (x 1 , . . . , x N ) of length N , we first compute token representations h = f (x, z, s), where f is a stack of dilated convolutions (van den Oord et al., 2016) interspersed with batch normalisation (Ioffe & Szegedy, 2015) and ReLU activations. The latents z and speaker embedding s modulate the scale and shift parameters of the batch normalisation layers (Dumoulin

funding

determined by coin toss.

