BIGVGAN: A UNIVERSAL NEURAL VOCODER WITH LARGE-SCALE TRAINING

Abstract

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We introduce periodic activation function and anti-aliased representation into the GAN generator, which brings the desired inductive bias for audio synthesis and significantly improves audio quality. In addition, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. We identify and address the failure modes in large-scale GAN training for audio, while maintaining high-fidelity output without over-regularization. Our BigVGAN, trained only on clean speech (LibriTTS), achieves the state-of-the-art performance for various zero-shot (out-of-distribution) conditions, including unseen speakers, languages, recording environments, singing voices, music, and instrumental audio. 1 We release our code and model at: https:

1. INTRODUCTION

Deep generative models have demonstrated noticeable successes for modeling raw audio. The successful methods include, autoregressive models (van den Oord et al., 2016; Mehri et al., 2017; Kalchbrenner et al., 2018 ), flow-based models (van den Oord et al., 2018; Ping et al., 2019; Prenger et al., 2019; Kim et al., 2019; Ping et al., 2020; Lee et al., 2020) , GAN-based models (Donahue et al., 2019; Kumar et al., 2019; Bińkowski et al., 2020; Yamamoto et al., 2020; Kong et al., 2020) , and diffusion models (Kong et al., 2021; Chen et al., 2021; Lee et al., 2022) . Among these methods, GAN-based vocoders (e.g., Kong et al., 2020) can generate high-fidelity raw audio conditioned on mel spectrogram, while synthesizing hundreds of times faster than real-time on a single GPU. However, existing GAN vocoders are confined to the settings with a moderate number of voices recorded in clean environment due to the limited model capacity. The audio quality can heavily degrade when the models are conditioned on mel spectrogram from unseen speakers in different recording environments. In practice, a universal vocoder, that can do zero-shot generation for out-of-distribution samples, is very valuable in many real-world applications, including text-to-speech with numerous speakers (Ping et al., 2018) , neural voice cloning (Arik et al., 2018; Jia et al., 2018 ), voice conversion (Liu et al., 2018 ), speech-to-speech translation (Jia et al., 2019) , and neural audio codec (Zeghidour et al., 2021) . In these applications, the neural vocoder also needs to generalize well for audio recorded at various conditions.

availability

//github.com/NVIDIA/BigVGAN.

