AUDIOGEN: TEXTUALLY GUIDED AUDIO GENERA-TION

Abstract

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AUDIOGEN, an auto-regressive generative model that generates audio samples conditioned on text inputs. AUDIOGEN operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating "objects" can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AUDIOGEN outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.

1. INTRODUCTION

Neural generative models have challenged the way we create digital content. From generating highquality images (Karras et al., 2019; Park et al., 2019) and speech (Ren et al., 2021; Oord et al., 2016) , through generating long textual spans (Brown et al., 2020; Zhang et al., 2022) , to the recently proposed text prompted image generation (Ramesh et al., 2022; Rombach et al., 2022) , these models have shown impressive results. This begs the question what would be the audio equivalent to textually guided generative models? From generating soundscapes to music or speech, a solution to this problem that is high fidelity, controllable, and diverse in its outputs, would be a useful addition to the modern toolbox of creators of movies, video games, and any virtual environments. While image generation and audio generation have a lot in common, there are a few key differences. Audio is intrinsically a one dimensional signal and thus has less degrees of freedom to differentiate overlapping "objects" (Capon, 1969; Frost, 1972) . Real-world audio inherently has reverberations, which makes the task of differentiating objects from the surrounding environment even harder. Moreover, psychoacoustic and psychovisual properties differ, for instance hearing "resolution" (equal-loudness) is U-shaped in frequencies with a dip at 4kHz and bump at 8kHz (Suzuki et al., 2003) . Last but not least, the availability of audio data with textual descriptions is orders of magnitude below that of text-image paired data. This makes generating unseen audio compositions a hard task (e.g. generating an audio equivalent of an image of "an astronaut riding a horse in space"). In this work, we tackle the problem of generating audio conditioned on descriptive text captions. We additionally extend the proposed method to conditional and unconditional audio continuation. Here, we generate "a dog barks while somebody plays the trumpet in a busy street". In the above prompt, the model must generate three categories of acoustic content, with varying degrees of background/foreground, durations, and relative position in the temporal axis, the composition of which is highly unlikely to be present in the training set. Generating such audio is thus a challenge in generalization, acoustic fidelity, production and mastering. We propose AUDIOGEN, an autoregressive textually guided audio generation model. AUDIO-GEN consists of two main stages. The first encodes raw audio to a discrete sequence of tokens using a neural audio compression model (e.g. Zeghidour et al. ( 2021)). This model is trained in an end-to-end fashion to reconstruct the input audio from the compressed representation, with an addition of a perceptual loss in the form of a set of discriminators. Such an audio representation is designed to generate high-fidelity audio samples while still being compact. The second stage, leverages an autoregressive Transformer-decoder language-model that operates on the discrete audio tokens obtained from the first stage while also being conditioned on textual inputs. We represent text using a separate text encoder model pre-trained on a large corpus of text, namely T5 (Raffel et al., 2020) . The pre-trained text encoder enables the generalization to text concepts that are absent from current text-audio datasets. This is especially important when working with text annotations limited in terms of diversity and descriptiveness. Compared to the existing text-to-audio work (Yang et al., 2022) , AUDIOGEN generates samples that obtain better objective and subjective metrics. In particular, AUDIOGEN creates more natural sounding unseen audio compositions. Lastly, we empirically show how the proposed approach can be extended to audio continuation considering both conditional and unconditional generation. Our contributions: (i) We propose a state-of-the-art auto-regressive audio generation model conditioned on textual descriptions or audio prompts, as evaluated with objective and subjective (human listeners) scores. Specifically we propose two model variations, one with 285M parameters and another one with 1B parameters; (ii) We improve text-to-audio generation in two axes. We improve text adherence by applying classifier free guidance on top of the audio language model. We improve compositionality by performing on the fly text and audio mixing; (iii) We show that the proposed approach can be extended to audio continuation conditioned on text and unconditionally; (iv) We explore the trade-off between audio-fidelity and sampling time by utilizing residual vector quantization (for acoustic units) and multi-stream transformers.

2. RELATED WORK

Speech Representation Learning. Studies on unsupervised speech representation learning can be roughly divided into reconstruction and self-supervised learning methods. Auto-encoding is the common approach for signal reconstruction, where speech is first encoded into a low-dimensional latent representation, and then decoded back to speech. Various constraints can be imposed on the encoded space, such as temporal smoothness (Ebbers et al., 2017 ), discreteness (van den Oord et al., 2017b ), and hierarchy (Hsu et al., 2017) . Self-Supervised Learning (SSL) methods for speech have shown remarkable results for automatic speech recognition (Schneider et al., 2019; Baevski et al., 2020; Wang et al., 2021) , phoneme segmentation (Kreuk et al., 2020) , and audio compression (Zeghidour et al., 2021; Polyak et al., 2021) . van Another line of relevant prior work relates to modeling audio discrete representations. Recent studies suggest quantizing SSL representations using k-means and later perform language modeling (Lakho-



den Oord et al. (2018) and Schneider et al. (2019) suggested training a convolutional neural network to distinguish true future samples from random distractor samples using a Contrastive Predictive Coding (CPC) loss function. Ao et al. (2022) proposed a speech version of the T5 model and showed its efficiency on various speech tasks. Similar to CPC, Baevski et al. (2020) use an encoder and a predictor, which is trained contrastively to distinguish positive and negative samples. Unlike (Schneider et al., 2019), it discretizes and masks segments of the encoder's output. Hsu et al. (2021) proposed the HuBERT model which is trained with a masked prediction task similar to BERT (Devlin et al., 2019) but with masked continuous audio signals. Chen et al. (2022) proposed a similar version of HuBERT trained on larger and augmented dataset. More recently, Huang et al. (2022) proposed a Masked Auto Encoding approach for learning a speech representation and show it efficiency on several audio classification tasks.

availability

https://felixkreuk.github.io/audiogen.

