DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

Abstract

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audio in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

1. INTRODUCTION

Deep generative models have produced high-fidelity raw audio in speech synthesis and music generation. In previous work, likelihood-based models, including autoregressive models (van den Oord et al., 2016; Kalchbrenner et al., 2018; Mehri et al., 2017) and flow-based models (Prenger et al., 2019; Ping et al., 2020; Kim et al., 2019) , have predominated in audio synthesis because of the simple training objective and superior ability of modeling the fine details of waveform in real data. There are other waveform models, which often require auxiliary losses for training, such as flow-based models trained by distillation (van den Oord et al., 2018; Ping et al., 2019) , variational auto-encoder (VAE) based model (Peng et al., 2020) , and generative adversarial network (GAN) based models (Kumar et al., 2019; Bińkowski et al., 2020; Yamamoto et al., 2020) . Most of previous waveform models focus on audio synthesis with informative local conditioner (e.g., mel spectrogram or aligned linguistic features), with only a few exceptions for unconditional generation (Mehri et al., 2017; Donahue et al., 2019) . It has been noticed that autoregressive models (e.g., WaveNet) tend to generate made-up word-like sounds (van den Oord et al., 2016) , or inferior samples (Donahue et al., 2019) under unconditional settings. This is because very long sequences need to be generated (e.g., 16,000 time-steps for one second speech) without any conditional information. Diffusion probabilistic models (diffusion models for brevity) are a class of promising generative models, which use a Markov chain to gradually convert a simple distribution (e.g., isotropic Gaussian) into complicated data distribution (Sohl-Dickstein et al., 2015; Goyal et al., 2017; Ho et al., 2020) . Although the data likelihood is intractable, diffusion models can be efficiently trained by optimizing the variational lower bound (ELBO). Most recently, a certain parameterization has been shown successful in image synthesis (Ho et al., 2020) , which is connected with denoising score matching (Song • • • • • • x 0 x 1 x 2 x T -1 x T diffusion process reverse process q(x 1 |x 0 ) q(x 2 |x 1 ) q(x T |x T -1 ) p θ (x 0 |x 1 ) p θ (x 1 |x 2 ) p θ (x T -1 |x T ) q data (x 0 ) p latent (x T ) Figure 1 : The diffusion and reverse process in diffusion probabilistic models. The reverse process gradually converts the white noise signal into speech waveform through a Markov chain p θ (x t-1 |x t ). & Ermon, 2019). Diffusion models can use a diffusion (noise-adding) process without learnable parameters to obtain the "whitened" latents from training data. Therefore, no additional neural networks are required for training in contrast to other models (e.g., the encoder in VAE (Kingma & Welling, 2014) or the discriminator in GAN (Goodfellow et al., 2014) ). This avoids the challenging "posterior collapse" or "mode collapse" issues stemming from the joint training of two networks, and hence is valuable for high-fidelity audio synthesis. In this work, we propose DiffWave, a versatile diffusion probabilistic model for raw audio synthesis. DiffWave has several advantages over previous work: i) It is non-autoregressive thus can synthesize high-dimensional waveform in parallel. ii) It is flexible as it does not impose any architectural constraints in contrast to flow-based models, which need to keep the bijection between latents and data (e.g., see more analysis in Ping et al. (2020) ). This leads to small-footprint neural vocoders that still generate high-fidelity speech. iii) It uses a single ELBO-based training objective without any auxiliary losses (e.g., spectrogram-based losses) for high-fidelity synthesis. iv) It is a versatile model that produces high-quality audio signals for both conditional and unconditional waveform generation. Specifically, we make the following contributions: 1. DiffWave uses a feed-forward and bidirectional dilated convolution architecture motivated by WaveNet (van den Oord et al., 2016) . It matches the strong WaveNet vocoder in terms of speech quality (MOS: 4.44 vs. 4.43), while synthesizing orders of magnitude faster as it only requires a few sequential steps (e.g., 6) for generating very long waveforms. 2. Our small DiffWave has 2.64M parameters and synthesizes 22.05 kHz high-fidelity speech (MOS: 4.37) more than 5× faster than real-time on a V100 GPU without engineered kernels. Although it is still slower than the state-of-the-art flow-based models (Ping et al., 2020; Prenger et al., 2019) , it has much smaller footprint. We expect further speed-up by optimizing its inference mechanism in the future. 3. DiffWave significantly outperforms WaveGAN (Donahue et al., 2019) and WaveNet in the challenging unconditional and class-conditional waveform generation tasks in terms of audio quality and sample diversity measured by several automatic and human evaluations. We organize the rest of the paper as follows. We present the diffusion models in Section 2, and introduce DiffWave architecture in Section 3. Section 4 discusses related work. We report experimental results in Section 5 and conclude the paper in Section 6.

2. DIFFUSION PROBABILISTIC MODELS

We define q data (x 0 ) as the data distribution on R L , where L is the data dimension. Let x t ∈ R L for t = 0, 1, • • • , T be a sequence of variables with the same dimension, where t is the index for diffusion steps. Then, a diffusion model of T steps is composed of two processes: the diffusion process, and the reverse process (Sohl-Dickstein et al., 2015) . Both of them are illustrated in Figure 1 .

Algorithm 1

Training for i = 1, 2, • • • , N iter do Sample x 0 ∼ q data , ∼ N (0, I), and t ∼ Uniform({1, • • • , T }) Take gradient step on ∇ θ -θ ( √ ᾱt x 0 + √ 1 -ᾱt , t) 2 2 according to Eq. ( 7) end for Algorithm 2 Sampling Sample x T ∼ p latent = N (0, I) for t = T, T -1, • • • , 1 do Compute µ θ (x t , t) and σ θ (x t , t) using Eq. ( 5) Sample x t-1 ∼ p θ (x t-1 |x t ) = N (x t-1 ; µ θ (x t , t), σ θ (x t , t) 2 I) end for return x 0 The diffusion process is defined by a fixed Markov chain from data x 0 to the latent variable x T : q(x 1 , • • • , x T |x 0 ) = T t=1 q(x t |x t-1 ), where each of q(x t |x t-1 ) is fixed to N (x t ; √ 1 -β t x t-1 , β t I) for a small positive constant β t . The function of q(x t |x t-1 ) is to add small Gaussian noise to the distribution of x t-1 . The whole process gradually converts data x 0 to whitened latents x T according to a variance schedule β 1 , • • • , β T . 2 The reverse process is defined by a Markov chain from x T to x 0 parameterized by θ: p latent (x T ) = N (0, I), and p θ (x 0 , • • • , x T -1 |x T ) = T t=1 p θ (x t-1 |x t ), where p latent (x T ) is isotropic Gaussian, and the transition probability p θ (x t-1 |x t ) is parameterized as N (x t-1 ; µ θ (x t , t), σ θ (x t , t)foot_0 I) with shared parameter θ. Note that both µ θ and σ θ take two inputs: the diffusion-step t ∈ N, and variable x t ∈ R L . µ θ outputs an L-dimensional vector as the mean, and σ θ outputs a real number as the standard deviation. The goal of p θ (x t-1 |x t ) is to eliminate the Gaussian noise (i.e. denoise) added in the diffusion process. Sampling: Given the reverse process, the generative procedure is to first sample an x T ∼ N (0, I), and then sample x t-1 ∼ p θ (x t-1 |x t ) for t = T, T -1, • • • , 1. The output x 0 is the sampled data. Training: The likelihood p θ (x 0 ) = p θ (x 0 , • • • , x T -1 |x T ) • p latent (x T ) dx 1:T is intractable to calculate in general. The model is thus trained by maximizing its variational lower bound (ELBO): E q data (x0) log p θ (x 0 ) = E q data (x0) log E q(x1,••• ,x T |x0) p θ (x 0 , • • • , x T -1 |x T ) × p latent (x T ) q(x 1 , • • • , x T |x 0 ) ≥ E q(x0,••• ,x T ) log p θ (x 0 , • • • , x T -1 |x T ) × p latent (x T ) q(x 1 , • • • , x T |x 0 ) := ELBO. (3) Most recently, Ho et al. (2020) showed that under a certain parameterization, the ELBO of the diffusion model can be calculated in closed-form. This accelerates the computation and avoids Monte Carlo estimates, which have high variance. This parameterization is motivated by its connection to denoising score matching with Langevin dynamics (Song & Ermon, 2019; 2020) . To introduce this parameterization, we first define some constants based on the variance schedule {β t } T t=1 in the diffusion process as in Ho et al. (2020) : α t = 1 -β t , ᾱt = t s=1 α s , βt = 1 -ᾱt-1 1 -ᾱt β t for t > 1 and β1 = β 1 . (4) Then, the parameterizations of µ θ and σ θ are defined by µ θ (x t , t) = 1 √ α t x t - β t √ 1 -ᾱt θ (x t , t) , and σ θ (x t , t) = β 1 2 t , where θ : R L × N → R L is a neural network also taking x t and the diffusion-step t as inputs. Note that σ θ (x t , t) is fixed to a constant β 1 2 t for every step t under this parameterization. In the following proposition, we explicitly provide the closed-form expression of the ELBO. Proposition 1. (Ho et al., 2020) Suppose a series of fixed schedule {β t } T t=1 are given. Let ∼ N (0, I) and x 0 ∼ q data . Then, under the parameterization in Eq. ( 5), we have -ELBO = c + T t=1 κ t E x0, -θ ( √ ᾱt x 0 + √ 1 -ᾱt , t) 2 2 (6) for some constants c and κ t , where κ t = βt 2αt(1-ᾱt-1) for t > 1, and κ 1 = 1 2α1 . Note that c is irrelevant for optimization purpose. The key idea in the proof is to expand the ELBO into a sum of KL divergences between tractable Gaussian distributions, which have a closed-form expression. We refer the readers to look at Section A in the Appendix for the full proof. In addition, Ho et al. (2020) reported that minimizing the following unweighted variant of the ELBO leads to higher generation quality: min θ L unweighted (θ) = E x0, ,t -θ ( √ ᾱt x 0 + √ 1 -ᾱt , t) 2 2 ( ) where t is uniformly taken from 1, • • • , T . Therefore, we also use this training objective in this paper. We summarize the training and sampling procedures in Algorithm 1 and 2, respectively. Fast sampling: Given a trained model from Algorithm 1, we noticed that the most effective denoising steps at sampling occur near t = 0 (see Section IV on demo website). This encourages us to design a fast sampling algorithm with much fewer denoising steps T infer (e.g., 6) than T at training (e.g., 200). The key idea is to "collapse" the T -step reverse process into a T infer -step process with carefully designed variance schedule. We provide the details in Appendix B.

3. DIFFWAVE ARCHITECTURE

In this section, we present the architecture of DiffWave (see Figure 2 for an illustration). We build the network θ : R L × N → R L in Eq. ( 5) based on a bidirectional dilated convolution architecture that is different from WaveNet (van den Oord et al., 2016) , because there is no autoregressive generation constraint. 3 The similar architecture has been applied for source separation (Rethage et al., 2018; Lluís et al., 2018) . The network is non-autoregressive, so generating an audio x 0 with length L from latents x T requires T rounds of forward propagation, where T (e.g., 50) is much smaller than the waveform length L. The network is composed of a stack of N residual layers with residual channels C. These layers are grouped into m blocks and each block has n = N m layers. We use a bidirectional dilated convolution (Bi-DilConv) with kernel size 3 in each layer. The dilation is doubled at each layer within each block, i.e., [1, 2, 4, • • • , 2 n-1 ]. We sum the skip connections from all residual layers as in WaveNet. More details including the tensor shapes are included in Section C in the Appendix.

3.1. DIFFUSION-STEP EMBEDDING

It is important to include the diffusion-step t as part of the input, as the model needs to output different θ (•, t) for different t. We use an 128-dimensional encoding vector for each t (Vaswani et al., 2017) : t embedding = sin 10 0×4 63 t , • • • , sin 10 63×4 63 t , cos 10 0×4 63 t , • • • , cos 10 63×4 63 t (8) We then apply three fully connected (FC) layers on the encoding, where the first two FCs share parameters among all residual layers. The last residual-layer-specific FC maps the output of the second FC into a C-dimensional embedding vector. We next broadcast this embedding vector over length and add it to the input of every residual layer.

3.2. CONDITIONAL GENERATION

Local conditioner: In speech synthesis, a neural vocoder can synthesize the waveform conditioned on the aligned linguistic features (van den Oord et al., 2016; Arık et al., 2017b) , the mel spectrogram from a text-to-spectrogram model (Ping et al., 2018; Shen et al., 2018) , or the hidden states within the text-to-wave architecture (Ping et al., 2019; Donahue et al., 2020) . In this work, we test DiffWave as a neural vocoder conditioned on mel spectrogram. We first upsample the mel spectrogram to the same length as waveform through transposed 2-D convolutions. After a layer-specific Conv1×1 mapping its mel-band into 2C channels, the conditioner is added as a bias term for the dilated convolution in each residual layer. The hyperparameters can be found in Section 5.1. Global conditioner: In many generative tasks, the conditional information is given by global discrete labels (e.g., speaker IDs or word IDs). We use shared embeddings with dimension d label = 128 in all experiments. In each residual layer, we apply a layer-specific Conv1×1 to map d label to 2C channels, and add the embedding as a bias term after the dilated convolution in each residual layer.

3.3. UNCONDITIONAL GENERATION

In unconditional generation task, the model needs to generate consistent utterances without conditional information. It is important for the output units of the network to have a receptive field size (denoted as r) larger than the length L of the utterance. Indeed, we need r ≥ 2L, thus the left and right-most output units have receptive fields covering the whole L-dimensional inputs as illustrated in Figure 4 in Appendix. This posts a challenge for architecture design even with the dilated convolutions. For a stack of dilated convolution layers, the receptive field size of the output is up to: r = (k -1) i d i + 1, where k is the kernel size and d i is the dilation at i-th residual layer. For example, 30-layer dilated convolution has a receptive field size r = 6139, with k = 3 and dilation cycle [1, 2, • • • , 512]. This only amounts to 0.38s of 16kHz audio. We can further increase the number of layers and the size of dilation cycles; however, we found degraded quality with deeper layers and larger dilation cycles. This is particularly true for WaveNet. In fact, previous study (Shen et al., 2018) suggests that even a moderate large receptive field size (e.g., 6139) is not effectively used in WaveNet and it tends to focus on much shorter context (e.g., 500). DiffWave has an advantage in enlarging the receptive fields of output x 0 : by iterating from x T to x 0 in the reverse process, the receptive field size can be increased up to T × r, which makes DiffWave suitable for unconditional generation.

4. RELATED WORK

In the past years, many neural text-to-speech (TTS) systems have been introduced. (Ping et al., 2019) , ParaNet (Peng et al., 2020) , FastSpeech (Ren et al., 2019) , GAN-TTS (Bińkowski et al., 2020) , and Flowtron (Valle et al., 2020) . These systems first generate intermediate representations (e.g., aligned linguistic features, mel spectrogram, or hidden representations) conditioned on text, then use a neural vocoder to synthesize the raw waveform. Neural vocoder plays the most important role in the recent success of speech synthesis. Autoregressive models like WaveNet and WaveRNN can generate high-fidelity speech, but in a sequential way of generation. Parallel WaveNet and ClariNet distill parallel flow-based models from WaveNet, thus can synthesize waveform in parallel. In contrast, WaveFlow (Ping et al., 2020) , WaveGlow (Prenger et al., 2019) and FloWaveNet (Kim et al., 2019) are trained by maximizing likelihood. There are other waveform models, such as VAE-based models (Peng et al., 2020) , GAN-based models (Kumar et al., 2019; Yamamoto et al., 2020; Bińkowski et al., 2020) , and neural signal processing models (Wang et al., 2019; Engel et al., 2020; Ai & Ling, 2020) . In contrast to likelihood-based models, they often require auxiliary training losses to improve the audio fidelity. The proposed DiffWave is another promising neural vocoder synthesizing the best quality of speech with a single objective function. Unconditional generation of audio in the time domain is a challenging task in general. Likelihoodbased models are forced to learn all possible variations within the dataset without any conditional information, which can be quite difficult with limited model capacity. In practice, these models produce made-up word-like sounds or inferior samples (van den Oord et al., 2016; Donahue et al., 2019) . VQ-VAE (van den Oord et al., 2017) circumvents this issue by compressing the waveform into compact latent code, and training an autoregressive model in latent domain. GAN-based models are believed to be suitable for unconditional generation (e.g., Donahue et al., 2019) due to the "mode seeking" behaviour and success in image domain (Brock et al., 2018) . Note that unconditional generation of audio in the frequency domain is considered easier, as the spectrogram is much shorter (e.g., 200×) than waveform (Vasquez & Lewis, 2019; Engel et al., 2019; Palkama et al., 2020) . In this work, we demonstrate the superior performance of DiffWave in unconditional generation of waveform. In contrast to the exact-likelihood models, DiffWave maximizes a variational lower bound of the likelihood, which can focus on the major variations within the data and alleviate the requirements for model capacity. In contrast to GAN or VAE-based models (Donahue et al., 2019; Peng et al., 2020) , it is much easier to train without mode collapse, posterior collapse, or training instability stemming from the joint training of two networks. There is a concurrent work (Chen et al., 2020) that uses diffusion probabilistic models for waveform generation. In contrast to DiffWave, it uses a neural architecture similar to GAN-TTS and focuses on the neural vocoding task only. Our DiffWave vocoder has much fewer parameters than WaveGrad -2.64M vs. 15M for Base models and 6.91M vs. 23M for Large models. The small memory footprint is preferred in production TTS systems, especially for on-device deployment. In addition, DiffWave requires a smaller batch size (16 vs. 256) and fewer computational resources for training.

5. EXPERIMENTS

We evaluate DiffWave on neural vocoding, unconditional and class-conditional generation tasks.

5.1. NEURAL VOCODING

Data: We use the LJ speech dataset (Ito, 2017) The reason to increase β t for smaller T is to make q(x T |x 0 ) close to p latent . In addition, we compare the fast sampling algorithm with smaller T infer (see Appendix B), denoted as DiffWave (Fast), with the regular sampling (Algorithm 2). Both of them use the same trained models. Conditioner: We use the 80-band mel spectrogram of the original audio as the conditioner to test these neural vocoders as in previous work (Ping et al., 2019; Prenger et al., 2019; Kim et al., 2019) . We set FFT size to 1024, hop size to 256, and window size to 1024. We upsample the mel spectrogram 256 times by applying two layers of transposed 2-D convolution (in time and frequency) interleaved Training: We train DiffWave on 8 Nvidia 2080Ti GPUs using random short audio clips of 16,000 samples from each utterance. We use Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 and learning rate 2 × 10 -foot_2 . We train all DiffWave models for 1M steps. For other models, we follow the training setups as in the original papers. Results: We use the crowdMOS tookit (Ribeiro et al., 2011) for speech quality evaluation, where the test utterances from all models were presented to Mechanical Turk workers. We report the 5-scale Mean Opinion Scores (MOS), and model footprints in (Fast) and DiffWave LARGE (Fast) can be 5.6× and 3.5× faster than realtime respectively and still obtain good audio fidelity. In contrast, a WaveNet implementation can be 500× slower than real-time at synthesis without engineered kernels. DiffWave is still slower than the state-of-the-art flow-based models (e.g., a 5.91M WaveFlow is > 40× faster than real-time in FP16), but has smaller footprint and slightly better quality. Because DiffWave does not impose any architectural constraints as in flow-based models, we expect further speed-up by optimizing the architecture and inference mechanism in the future.

5.2. UNCONDITIONAL GENERATION

In this section, we apply DiffWave to an unconditional generation task based on raw waveform only. Data: We use the Speech Commands dataset (Warden, 2018) , which contains many spoken words by thousands of speakers under various recording conditions including some very noisy environment. We select the subset that contains spoken digits (0∼9), which we call the SC09 dataset. The SC09 dataset contains 31,158 training utterances (∼8.7 hours in total) by 2,032 speakers, where each audio has length equal to one second under sampling rate 16kHz. Therefore, the data dimension L is 16,000. Note that the SC09 dataset exhibits various variations (e.g., contents, speakers, speech rate, recording conditions); the generative models need to model them without any conditional information.

Models:

We compare DiffWave with WaveNet and WaveGAN. We also tried to remove the mel conditioner in a state-of-the-art GAN-based neural vocoder (Yamamoto et al., 2020) , but found it could We tried to increase the size of the dilation cycle and the number of layers, but these modifications lead to worse quality. In particular, a large dilation cycle (e. Training: We train WaveNet and DiffWave on 8 Nvidia 2080Ti GPUs using full utterances. We use Adam optimizer with a batch size of 16. For WaveNet, we set the initial learning rate as 1 × 10 -3 and halve the learning rate every 200K iterations. For DiffWave, we fix the learning rate to 2 × 10 -4 . We train WaveNet and DiffWave for 1M steps. Evaluation: For human evaluation, we report the 5-scale MOS for speech quality similar to Section 5.1. To automatically evaluate the quality of generated audio samples, we train a ResNeXT classifier (Xie et al., 2017) on the SC09 dataset according to an open repository (Xu & Tuguldur, 2017) . The classifier achieves 99.06% accuracy on the trainset and 98.76% accuracy on the testset. We use the following evaluation metrics based on the 1024-dimensional feature vector and the 10-dimensional logits from the ResNeXT classifier (see Section D in the Appendix for the detailed definitions): • Fréchet Inception Distance (FID) (Heusel et al., 2017) measures both quality and diversity of generated samples, and favors generators that match moments in the feature space. • Inception Score (IS) (Salimans et al., 2016) measures both quality and diversity of generated samples, and favors generated samples that can be clearly determined by the classifier. • Modified Inception Score (mIS) (Gurumurthy et al., 2017) measures the within-class diversity of samples in addition to IS. • AM Score (Zhou et al., 2017) takes into consideration the marginal label distribution of training data compared to IS. • Number of Statistically-Different Bins (NDB) (Richardson & Weiss, 2018) measures diversity of generated samples. Results: We randomly generate 1,000 audio samples from each model for evaluation. We report results in Table 2 . Our DiffWave model outperforms baseline models under all metrics, including both automatic and human evaluation. Notably, the quality of audio samples generated by DiffWave is much higher than WaveNet and WaveGAN baselines (MOS: 3.39 vs. 1.43 and 2.03). Note that the quality of ground-truth audio exhibits large variations. The automatic evaluation metrics also indicate that DiffWave is better at quality, diversity, and matching marginal label distribution of training data.

5.3. CLASS-CONDITIONAL GENERATION

In this section, we provide the digit labels as the conditioner in DiffWave and compare our model to WaveNet. We omit the comparison with conditional WaveGAN due to its noisy output audio (Lee et al., 2018) . For both DiffWave and WaveNet, the label conditioner is added to the model according to Section 3.2. We use the same dataset, model hyperparameters, and training settings as in Section 5.2. Evaluation: We use slightly different automatic evaluation methods in this section because audio samples are generated according to pre-specified discrete labels. The AM score and NDB are removed because they are less meaningful when the prior label distribution of generated data is specified. We keep IS and mIS because IS favors sharp, clear samples and mIS measures within-class diversity. We modify FID to FID-class: for each digit from 0 to 9, we compute FID between the generated audio samples that are pre-specified as this digit and training utterances with the same digit labels, and report the mean and standard deviation of these ten FID scores. We also report classification accuracy based on the ResNeXT classifier used in Section 5.2. Results: We randomly generate 100 audio samples for each digit (0 to 9) from all models for evaluation. We report results in Table 3 . Our DiffWave model significantly outperforms WaveNet on all evaluation metrics. It produces superior quality than WaveNet (MOS: 3.50 vs. 1.58), and greatly decreases the gap to ground-truth (the gap between DiffWave and ground-truth is ∼10% of the gap between WaveNet and ground-truth). The automatic evaluation metrics indicate that DiffWave is much better at speech clarity (> 91% accuracy) and within-class diversity (its mIS is 6× higher than WaveNet). We additionally found a deep and thin version of DiffWave with residual channels C = 128 and 48 residual layers can achieve slightly better accuracy but lower audio quality. One may also compare quality of generated audio samples between conditional and unconditional generation based on IS, mIS, and MOS. For both WaveNet and DiffWave, IS increases by >20%, mIS almost doubles, and MOS increases by ≥ 0.11. These results indicate that the digit labels reduces the difficulty of the generative task and helps improving the generation quality of WaveNet and DiffWave.

5.4. ADDITIONAL RESULTS

Zero-shot speech denoising: The unconditional DiffWave model can readily perform speech denoising. The SC09 dataset provides six types of noises for data augmentation in recognition tasks: white noise, pink noise, running tap, exercise bike, dude miaowing, and doing the dishes. These noises are not used during the training phase of our unconditional DiffWave in Section 5.2. We add 10% of each type of noise to test data, feed these noisy utterances into the reverse process at t = 25, and then obtain the outputs x 0 's. The audio samples are in Section V on the demo website. Note that our model is not trained on a denoising task and has zero knowledge about any noise type other than the white noise added in diffusion process. It indicates DiffWave learns a good prior of raw audio. Interpolation in latent space: We can do interpolation with the digit conditioned DiffWave model in Section 5.3 on the SC09 dataset. The interpolation of voices x a 0 , x b 0 between two speakers a, b is done in the latent space at t = 50. We first sample x a t ∼ q(x t |x a 0 ) and x b t ∼ q(x t |x b 0 ) for the two speakers. We then do linear interpolation between x a t and x b t : x λ t = (1 -λ)x a t + λx b t for 0 < λ < 1. Finally, we sample x λ 0 ∼ p θ (x λ 0 |x λ t ). The audio samples are in Section VI on the demo website.

6. CONCLUSION

In this paper, we present DiffWave, a versatile generative model for raw waveform. In the neural vocoding task, it readily models the fine details of waveform conditioned on mel spectrogram and matches the strong autoregressive neural vocoder in terms of speech quality. In unconditional and class-conditional generation tasks, it properly captures the large variations within the data and produces realistic voices and consistent word-level pronunciations. To the best of our knowledge, DiffWave is the first waveform model that exhibits such versatility. DiffWave raises a number of open problems and provides broad opportunities for future research. For example, it would be meaningful to push the model to generate longer utterances, as DiffWave potentially has very large receptive fields. Second, optimizing the inference speed would be beneficial for applying the model in production TTS, because DiffWave is still slower than flow-based models. We found the most effective denoising steps in the reverse process occur near x 0 , which suggests an even smaller T is possible in DiffWave. In addition, the model parameters θ are shared across the reverse process, so the persistent kernels that stash the parameters on-chip would largely speed-up inference on GPUs (Diamos et al., 2016) . A PROOF OF PROPOSITION 1 Proof. We expand the ELBO in Eq. ( 3) into the sum of a sequence of tractable KL divergences below. ELBO = E q log p θ (x 0 , • • • , x T -1 |x T ) × p latent (x T ) q(x 1 , • • • , x T |x 0 ) = E q log p latent (x T ) - T t=1 log p θ (x t-1 |x t ) q(x t |x t-1 ) = E q log p latent (x T ) -log p θ (x 0 |x 1 ) q(x 1 |x 0 ) - T t=2 log p θ (x t-1 |x t ) q(x t-1 |x t , x 0 ) + log q(x t-1 |x 0 ) q(x t |x 0 ) = E q log p (x T ) q(x T |x 0 ) -log p θ (x 0 |x 1 ) - T t=2 log p θ (x t-1 |x t ) q(x t-1 |x t , x 0 ) = -E q KL (q(x T |x 0 ) p latent (x T )) + T t=2 KL (q(x t-1 |x t , x 0 ) p θ (x t-1 |x t )) -log p θ (x 0 |x 1 ) Before we calculate these terms individually, we first derive q(x t |x 0 ) and q(x t-1 |x t , x 0 ). Let i 's be independent standard Gaussian random variables. Then, by definition of q and using the notations of constants introduced in Eq. ( 4), we have x t = √ α t x t-1 + √ β t t = √ α t α t-1 x t-2 + α t β t-1 t-1 + √ β t t = √ α t α t-1 α t-1 x t-3 + α t α t-1 β t-2 t-2 + α t β t-1 t-1 + √ β t t = • • • = √ ᾱt x 0 + α t α t-1 • • • α 2 β 1 1 + • • • + α t β t-1 t-1 + √ β t t Note that q(x t |x 0 ) is still Gaussian, and the mean of x t is √ ᾱt x 0 , and the variance matrix is (α t α t-1 • • • α 2 β 1 + • • • + α t β t-1 + β t )I = (1 -ᾱt )I. Therefore, q(x t |x 0 ) = N (x t ; √ ᾱt x 0 , (1 -ᾱt )I). (10) It is worth mentioning that, q(x T |x 0 ) = N (x T ; √ ᾱT x 0 , (1 -ᾱT )I), where ᾱT = T t=1 (1 -β t ) approaches zero with large T . Next, by Bayes rule and Markov chain property, q(x t-1 |x t , x 0 ) = q(x t |x t-1 ) q(x t-1 |x 0 ) q(x t |x 0 ) = N (x t ; √ α t x t-1 , β t I) N (x t-1 ; √ ᾱt-1 x 0 , (1 -ᾱt-1 )I) N (x t ; √ ᾱt x 0 , (1 -ᾱt )I) = (2πβ t ) -d 2 (2π(1 -ᾱt-1 )) -d 2 (2π(1 -ᾱt )) d 2 × exp - x t - √ α t x t-1 2 2β t - x t-1 - √ ᾱt-1 x 0 2 2(1 -ᾱt-1 ) + x t - √ ᾱt x 0 2 2(1 -ᾱt ) = (2π βt ) -d 2 exp - 1 2 βt x t-1 - √ ᾱt-1 β t 1 -ᾱt x 0 - √ α t (1 -ᾱt-1 ) 1 -ᾱt x t 2 Therefore, q(x t-1 |x t , x 0 ) = N (x t-1 ; √ ᾱt-1 β t 1 -ᾱt x 0 + √ α t (1 -ᾱt-1 ) 1 -ᾱt x t , βt I). Now, we calculate each term of the ELBO expansion in Eq. ( 9). The first constant term is E q KL (q(x T |x 0 ) p latent (x T )) = E x0 KL N ( √ ᾱT x 0 , (1 -ᾱT )I) N (0, I) = 1 2 E x0 √ ᾱT x 0 -0 2 + d log 1 √ 1 -ᾱT + 1 -ᾱT -1 2 = ᾱT 2 E x0 x 0 2 - d 2 ( ᾱT + log(1 -ᾱT )) times the squared 2 distance between their means. By the expression of q(x t |x 0 ), we have x t = √ ᾱt x 0 + √ 1 -ᾱt . Therefore, we have E q KL (q(x t-1 |x t , x 0 ) p θ (x t-1 |x t )) = 1 2 βt E x0 √ ᾱt-1 β t 1 -ᾱt x 0 + √ α t (1 -ᾱt-1 1 -ᾱt x t - 1 √ α t x t - β t √ 1 -ᾱt θ (x t , t) 2 = 1 2 βt E x0, √ ᾱt-1 β t 1 -ᾱt • x t - √ 1 -ᾱt √ ᾱt + √ α t (1 -ᾱt-1 ) 1 -ᾱt x t - 1 √ α t x t - β t √ 1 -ᾱt θ (x t , t) 2 = 1 2 βt • β 2 t α t (1 -ᾱt ) E x0, 0 • x t + -θ (x t , t) 2 = β 2 t 2 1-ᾱt-1 1-ᾱt β t α t (1 -ᾱt ) E x0, -θ (x t , t) 2 = β t 2α t (1 -ᾱt-1 ) E x0, -θ (x t , t) Finally, as x 1 = √ ᾱ1 x 0 + √ 1 -ᾱ1 = √ α 1 x 0 + √ 1 -α 1 , we have E q log p θ (x 0 |x 1 ) = E q log N x 0 ; 1 √ α 1 x 1 - β 1 √ 1 -α 1 θ (x 1 , 1) , β 1 I = E q - d 2 log 2πβ 1 - 1 2β 1 x 0 - 1 √ α 1 x 1 - β 1 √ 1 -α 1 θ (x 1 , 1) 2 = - d 2 log 2πβ 1 - 1 2β 1 E x0, x 0 - 1 √ α 1 √ α 1 x 0 + √ 1 -α 1 - β 1 √ 1 -α 1 θ (x 1 , 1) 2 = - d 2 log 2πβ 1 - 1 2β 1 E x0, √ β 1 √ α 1 ( -θ (x 1 , 1)) 2 = - d 2 log 2πβ 1 - 1 2α 1 E x0, -θ (x 1 , 1) 2 The computation of the ELBO is now finished.

B DETAILS OF THE FAST SAMPLING ALGORITHM

Let T infer T be the number of steps in the reverse process (sampling) and {η t } T infer t=1 be the userdefined variance schedule, which can be independent with the training variance schedule {β t } T t=1 . Then, we compute the corresponding constants in the same way as Eq. ( 4): γ t = 1 -η t , γt = t s=1 γ s , ηt = 1 -γt-1 1 -γt η t for t > 1 and η1 = η 1 . (13) As step s during sampling, we need to select an t and use θ (•, t) to eliminate noise. This is realized by aligning the noise levels from the user-defined and the training variance schedules. Ideally, we want √ ᾱt = √ γs . However, since this is not always possible, we interpolate √ γs between two consecutive training noise levels √ ᾱt+1 and √ ᾱt , if √ γs is between them. We therefore obtain the desired aligned diffusion step t, which we denote t align s , via the following equation: t align s = t + √ ᾱt - √ γs √ ᾱt - √ ᾱt+1 if √ γs ∈ [ √ ᾱt+1 , √ ᾱt ]. Note that, t align s is floating-point number, which is different from the integer diffusion-step at training. Finally, the parameterizations of µ θ and σ θ are defined in a similar way as Eq. ( 5):  µ fast θ (x s , s) = 1 √ γ s x s - η s √ 1 -γs θ (x s , t align The fast sampling algorithm is summarized in Algorithm 3. Algorithm 3 Fast Sampling Sample x T infer ∼ p latent = N (0, I) for s = T infer , T infer -1, • • • , 1 do Compute µ fast θ (x s , s) and σ fast θ (x s , s) using Eq. (15) Sample x s-1 ∼ N (x s-1 ; µ fast θ (x s , s), σ fast θ (x s , s) 2 I) end for return x 0 In neural vocoding task, we use user-defined variance schedules {0.0001, 0.001, 0.01, 0.05, 0.2, 0.7} for DiffWave LARGE and {0.0001, 0.001, 0.01, 0.05, 0.2, 0.5} for DiffWave BASE in Section 5.1. The fast sampling algorithm is similar to the sampling algorithm in Chen et al. (2020) in the sense of considering the noise levels as a controllable variable during sampling. However, the fast sampling algorithm for DiffWave does not need to modify the training procedure (Algorithm 1), and can just reuse the trained model checkpoint with large T .



One can find that q(xT |x0) approaches to isotropic Gaussian with large T in Eq. (11) in the Appendix A. Indeed, we found the causal dilated convolution architecture leads to worse audio quality in DiffWave. The MOS evaluation for DiffWave(Fast) with Tinfer = 6 was done after paper submission and may not be directly comparable to previous scores. βt



Figure 2: The network architecture of DiffWave in modeling θ : R L × N → R L .

g., up to 2048) leads to unstable training. For WaveGAN, we use their pretrained model on Google Colab. We use a 36-layer DiffWave model with kernel size 3 and dilation cycle [1, 2, • • • , 2048]. We set the number of diffusion steps T = 200 and residual channels C = 256. We use linear spaced schedule for β t ∈ [1 × 10 -4 , 0.02].

that contains ∼24 hours of audio recorded in home environment with a sampling rate of 22.05 kHz. It contains 13,100 utterances from a female speaker. We compare DiffWave with several state-of-the-art neural vocoders, including WaveNet, ClariNet, WaveGlow and WaveFlow. Details of baseline models can be found in the original papers. Their hyperparameters can be found in Table1. Our DiffWave models have 30 residual layers, kernel size 3, and dilation cycle [1, 2, • • • , 512]. We compare DiffWave models with different number of diffusion steps T ∈ {20, 40, 50, 200} and residual channels C ∈ {64, 128}. We use linear spaced schedule for β t ∈ [1 × 10 -4 , 0.02] for DiffWave with T = 200, and β t ∈ [1 × 10 -4 , 0.05] for DiffWave with T ≤ 50.

The model hyperparameters, model footprint, and 5-scale Mean Opinion Score (MOS) with 95% confidence intervals for WaveNet, ClariNet, WaveFlow, WaveGlow and the proposed DiffWave on the neural vocoding task. ↑ means the number is the higher the better, and ↓ means the number is the lower the better. For each layer, the upsamling stride in time is 16 and 2-D filter sizes are[32, 3]. After upsampling, we use a layer-specific Conv1×1 to map the 80 mel bands into 2× residual channels, then add the conditioner as a bias term for the dilated convolution before the gated-tanh nonlinearities in each residual layer.

The automatic evaluation metrics (FID, IS, mIS, AM, and NDB/K), and 5-scale MOS with 95% confidence intervals for WaveNet, WaveGAN, and DiffWave on the unconditional generation task. ↑ means the number is the higher the better, and ↓ means the number is the lower the better.

The automatic evaluation metrics (Accuracy, FID-class, IS, mIS), and 5-scale MOS with 95% confidence intervals for WaveNet and DiffWave on the class-conditional generation task.

availability

://diffwave-demo.github.io/

C DETAILS OF THE MODEL ARCHITECTURE

(", $, %) (", $, %)(", 1, %) The automatic evaluation metrics used in Section 5.2 and 5.3 are described as follows. Given an input audio x, an 1024-dimensional feature vector (denoted as F feature (x)) is computed by the ResNeXT F, and is then transformed to the 10-dimensional multinomial distribution (denoted as p F (x)) with a fully connected layer and a softmax layer. Let X train be the trainset, p gen be the distribution of generated data, and X gen ∼ p gen (i.i.d.) be the set of generated audio samples. Then, we compute the following automatic evaluation metrics:• Fréchet Inception Distance (FID) (Heusel et al., 2017) computes the Wasserstein-2 distance between Gaussians fitted to F feature (X train ) and F feature (X gen ). That is,, where µ t , Σ t are the mean vector and covariance matrix of F feature (X train ), and where µ g , Σ g are the mean vector and covariance matrix of F feature (X gen ).• Inception Score (IS) (Salimans et al., 2016) computes the following:where E x ∼pgen p F (x ) is the marginal label distribution. • Modified Inception Score (mIS) (Gurumurthy et al., 2017) computes the following:• AM Score (Zhou et al., 2017) computes the following:where H(•) computes the entropy. Compared to IS, AM score takes into consideration the the prior distribution of p F (X train ).• Number of Statistically-Different Bins (NDB) (Richardson & Weiss, 2018) : First, X train is clustered into K bins by K-Means in the feature space (where K = 50 in our evaluation). Next, each sample in X gen is assigned to its nearest bin. Then, NDB is the number of bins that contain statistically different proportion of samples between training samples and generated samples.

