BIGVGAN: A UNIVERSAL NEURAL VOCODER WITH LARGE-SCALE TRAINING

Abstract

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We introduce periodic activation function and anti-aliased representation into the GAN generator, which brings the desired inductive bias for audio synthesis and significantly improves audio quality. In addition, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. We identify and address the failure modes in large-scale GAN training for audio, while maintaining high-fidelity output without over-regularization. Our BigVGAN, trained only on clean speech (LibriTTS), achieves the state-of-the-art performance for various zero-shot (out-of-distribution) conditions, including unseen speakers, languages, recording environments, singing voices, music, and instrumental audio. 1 We release our code and model at: https:

1. INTRODUCTION

Deep generative models have demonstrated noticeable successes for modeling raw audio. The successful methods include, autoregressive models (van den Oord et al., 2016; Mehri et al., 2017; Kalchbrenner et al., 2018) , flow-based models (van den Oord et al., 2018; Ping et al., 2019; Prenger et al., 2019; Kim et al., 2019; Ping et al., 2020; Lee et al., 2020) , GAN-based models (Donahue et al., 2019; Kumar et al., 2019; Bińkowski et al., 2020; Yamamoto et al., 2020; Kong et al., 2020) , and diffusion models (Kong et al., 2021; Chen et al., 2021; Lee et al., 2022) . Among these methods, GAN-based vocoders (e.g., Kong et al., 2020) can generate high-fidelity raw audio conditioned on mel spectrogram, while synthesizing hundreds of times faster than real-time on a single GPU. However, existing GAN vocoders are confined to the settings with a moderate number of voices recorded in clean environment due to the limited model capacity. The audio quality can heavily degrade when the models are conditioned on mel spectrogram from unseen speakers in different recording environments. In practice, a universal vocoder, that can do zero-shot generation for out-of-distribution samples, is very valuable in many real-world applications, including text-to-speech with numerous speakers (Ping et al., 2018) , neural voice cloning (Arik et al., 2018; Jia et al., 2018 ), voice conversion (Liu et al., 2018 ), speech-to-speech translation (Jia et al., 2019) , and neural audio codec (Zeghidour et al., 2021) . In these applications, the neural vocoder also needs to generalize well for audio recorded at various conditions. Scaling up the model size for zero-shot performance is a noticeable trend in text generation (e.g., Brown et al., 2020) and image synthesis (e.g., Ramesh et al., 2021), but has not been explored in audio synthesis. Although likelihood-based models are found to be easier for scaling among others because of their simple training objective and stable optimization, we build our universal vocoder with large-scale GAN training, because GAN vocoder has the following advantages: i) In contrast to autoregressive or diffusion models, it is fully parallel and requires only one forward pass to generate high-dimensional waveform. ii) In contrast to flow-based models (Prenger et al., 2019) , it does not enforce any architectural constraints (e.g., affine coupling layer) that maintain the bijection between latent and data. Such architectural constraints can limit model capacity given the same number of parameters (Ping et al., 2020) . In this work, we present BigVGAN, a Big Vocoding GAN that enables high-fidelity out-ofdistribution (OOD) generation without fine-tuning. Specifically, we make the following contributions: 1. We introduce periodic activations into the generator, which provide the desired inductive bias for audio synthesis. Inspired by the methods proposed for other domains (Liu et al., 2020; Sitzmann et al., 2020) , we demonstrate the noticeable success of periodic activations in audio synthesis. 2. We propose anti-aliased multi-periodicity composition (AMP) module for modeling complex audio waveform. AMP composes multiple signal components with learnable periodicities and uses low-pass filter to reduce the high-frequency artifacts. 3. We successfully scale BigVGAN up to 112M parameters by fixing the failure modes of large-scale GAN training without regularizing both generator and discriminator. The empirical insights are different from Brock et al. ( 2019) in image domain. For example, regularization methods (e.g., Miyato et al., 2018) introduce phase mismatch artifacts in audio synthesis. 4. We demonstrate that BigVGAN-base with 14M parameters outperforms the state-of-the-art neural vocoders with comparable size for both in-distribution and out-of-distribution samples. In particular, BigVGAN with 112M parameters outperforms the state-of-the-art models by a large margin for zero-shot generation at various OOD scenarios, including unseen speakers, novel languages, singing voices, music and instrumental audio in varied unseen recording environments. We organize the rest of the paper as follows. We discuss related work in § 2 and present BigVGAN in § 3. We report empirical results in § 4 and conclude the paper in § 5.

2. RELATED WORK

Our work builds upon the state-of-the-art of GANs for image and audio synthesis. GAN was first proposed for image synthesis (Goodfellow et al., 2014) . Since then, impressive results have been obtained through optimized architectures (e.g., Radford et al., 2016; Karras et al., 2021) or large scale training (e.g., Brock et al., 2019) . In audio synthesis, previous works focus on improving the discriminator architectures or adding new auxiliary training losses. MelGAN (Kumar et al., 2019) introduces the multi-scale discriminator (MSD) that uses average pooling to downsample the raw waveform at multiple scales and applies window-based discriminators at each scale separately. It also enforces the mapping between input mel spectrogram and generated waveform via an ℓ 1 feature matching loss from discriminator. In contrast, GAN-TTS (Bińkowski et al., 2020) uses an ensemble of discriminators which operate on random windows of different sizes, and enforces the mapping between the conditioner and waveform adversarially using conditional discriminators. Parallel WaveGAN (Yamamoto et al., 2020) In this work, we focus on improving and scaling up the generator. We introduce the periodic inductive bias for audio synthesis and address the feature aliasing issues within the non-autoregressive generator architecture. Our architectural design has a connection with the latest results in timeseries prediction (Liu et al., 2020) , implicit neural representations (Sitzmann et al., 2020) , and image synthesis (Karras et al., 2021) . Note that, You et al. (2021) argues that different generator



extends the single short-time Fourier transform (STFT) loss (Ping et al., 2019) to multi-resolution, and adds it as an auxiliary loss for GAN training. Yang et al. (2021) and Mustafa et al. (2021) further improve MelGAN by incorporating the multi-resolution STFT loss. HiFi-GAN (Kong et al., 2020) reuses the MSD from MelGAN, and introduces the multi-period discriminator (MPD) for high-fidelity synthesis. UnivNet (Jang et al., 2020; 2021) uses the multi-resolution discriminator (MRD) that takes the multi-resolution spectrograms as the input and can sharpen the spectral structure of synthesized waveform. In contrast, CARGAN (Morrison et al., 2022) incorporates the partial autoregression (Ping et al., 2020) into generator to improve the pitch and periodicity accuracy.

availability

//github.com/NVIDIA/BigVGAN.

