RESGRAD: RESIDUAL DENOISING DIFFUSION PROB-ABILISTIC MODELS FOR TEXT TO SPEECH

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-tospeech (TTS) synthesis because of their strong capability of generating highfidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in realtime systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding groundtruth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.

1. INTRODUCTION

In recent years, text-to-speech (TTS) synthesis (Tan et al., 2021) has witnessed great progress with the development of deep generative models, e.g., auto-regressive models (Oord et al., 2016; Mehri et al., 2017; Kalchbrenner et al., 2018) , flow-based models (Rezende & Mohamed, 2015; van den Oord et al., 2018; Kingma & Dhariwal, 2018) , variational autoencoders (Peng et al., 2020) , generative adversarial networks (Kumar et al., 2019; Kong et al., 2020; Binkowski et al., 2020) , and denoising diffusion probabilistic models (DDPMs, diffusion models for short) (Ho et al., 2020; Song et al., 2021b) . Based on the mechanism of iterative refinement, DDPMs have been able to achieve a sample quality that matches or even surpasses the state-of-the-art methods in acoustic models (Popov et al., 2021; Huang et al., 2022b ), vocoders (Chen et al., 2021; Kong et al., 2021b; Lee et al., 2022a; Chen et al., 2022; Lam et al., 2022), and end-to-end TTS systems (Huang et al., 2022a) . A major disadvantage of diffusion models is the slow inference speed which requires a large number of sampling steps (e.g., 100 ∼ 1000 steps) in high-dimensional data space. To address this issue, many works have explored to minimize the number of inference steps (e.g., 2 ∼ 6 steps) for TTS (Huang et al., 2022b; Liu et al., 2022c) or vocoder (Chen et al., 2021; Kong et al., 2021b; Lam et al., 2022; Chen et al., 2022) . However, the high-dimensional data space cannot be accurately estimated with a few inference steps; otherwise, the sample quality would be degraded. Instead of following previous methods to minimize the number of inference steps in DDPMs for speedup, we propose to reduce the complexity of data space for DDPM to model. In this way, the ResGrad can generate high quality speech with a lightweight model and small real-time factor due to the less complex learning space, i.e., residual space. Meanwhile, ResGrad is a plug-and-play model, which does not require retraining the existing TTS models and can be easily applied to improve the quality of any existing TTS systems. The main contributions of our work are summarized as follows: • We propose a novel method, ResGrad, to speed up the inference for DDPM model in TTS synthesis. By generating the residual instead of generating whole speech from scratch, the complexity of data space is reduced, which enables ResGrad to be lightweight and effective for generating high quality speech with a small real-time factor. • The experiments on LJ-Speech, LibriTTS, and VCTK datasets show that compared with other speed-up baselines for DDPMs, ResGrad can achieve higher sample quality than baselines with the same RTF, while being more than 10 times faster than baselines when generating speech with similar speech quality, which verifies the effectiveness of ResGrad.

2. RELATED WORK

Denoising diffusion probabilistic models (DDPMs) have achieved the state-of-the-art generation results in various tasks (Ho et al., 2020; Song et al., 2021b; Kingma et al., 2021; Dhariwal & Nichol, 2021; Ramesh et al., 2022) but require a large number of inference steps to achieve high sample quality. Various methods have been proposed to accelerate the sampling process (Song et al., 2021a; Vahdat et al., 2021; Dhariwal & Nichol, 2021; Zheng et al., 2022; Xiao et al., 2022; Salimans & Ho, 2022) . In the field of speech synthesis, previous works on improving the inference speed of DDPMs can be mainly divided into three categories: 1) The prior/noise distribution used in DDPMs can be improved by leveraging some condition information (Lee et al., 2022a; Popov et al., 2021; Koizumi et al., 2022) . Then, the number of sampling steps in inference stage can be reduced by changing the prior/noise distribution from standard Gaussian to reparameterized distribution. 2) The inference process can start from a latent representation which is acquired by adding noise on the additional input, instead of starting from standard Gaussian noise, which can reduce the number of inference steps in sampling process (Liu et al., 2022b) . 3) Additional training techniques can be used for DDPMs. For example, DiffGAN-TTS (Liu et al., 2022c) utilizes GAN-based models in the framework of DDPMs to replace multi-step sampling process with the generator.



Figure 1: Illustration of ResGrad. ResGrad first predicts the residual between the mel-spectrogram estimated by an existing TTS model and the ground-truth mel-spectrogram, and then adds the residual to the estimated mel-spectrogram to get the refined mel-spectrogram.model size as well as the number of inference steps can be reduced naturally, resulting in small real-time factor during inference. To the end, we propose ResGrad, a diffusion model that predicts the residual between the output of an existing TTS model and the corresponding ground-truth mel-spectrogram. Compared with synthesizing speech from scratch, predicting the residual is less complex and easier for DDPM to model. Specifically, as shown in Figure1, we first utilize an existing TTS model such as FastSpeech 2(Ren et al., 2021)  to generate speech. In training, ResGrad is trained to generate the residual, while in inference, the output of ResGrad is added to the original TTS output, resulting in a speech with higher quality.

