UNIVERSAL SPEECH ENHANCEMENT WITH SCORE-BASED DIFFUSION

Abstract

Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

1. INTRODUCTION

Real-world recorded speech almost inevitably contains background noise, which can be unpleasant and prevent intelligibility. Removing background noise has traditionally been the objective of speech enhancement algorithms (Loizou, 2013) . Since the 1940s (Kolmogorov, 1941; Wiener, 1949) , a myriad of denoising approaches based on filtering have been proposed, with a focus on stationary noises. With the advent of deep learning, the task has been dominated by neural networks, often outperforming more classical algorithms and generalizing to multiple noise types (Lu et al., 2013; Pascual et al., 2017; Rethage et al., 2018; Défossez et al., 2020; Fu et al., 2021) . Besides recent progress, speech denoising still presents room for improvement, especially when dealing with distribution shift or real-world recordings. Noise however is only one of the many potential disturbances that can be present in speech recordings. If recordings are performed in a closed room, reverberation is ubiquitous. With this in mind, a number of works have recently started to zoom out the focus in order to embrace more realistic situations and tackle noise and reverberation at the same time (Su et al., 2021; 2019; Polyak et al., 2021) . Some of these works adopt a generation or re-generation strategy (Maiti & Mandel, 2019) , in which a two-stage approach is employed to first enhance and then synthesize speech signals. Despite the relative success of this strategy, it is still an open question whether such approaches can perceptually outperform the purely supervised ones, especially in terms of realism and lack of voice artifacts. 2021) deal with bandwidth reduction and clipping in addition to noise and reverberation. Despite the recent efforts to go beyond pure denoising, we are not aware of any speech enhancement system that tackles more than 2-4 distortions at the same time. In this work, we take a holistic approach and regard the task of speech enhancement as a universal endeavor. We believe that, for realistic speech enhancement situations, algorithms need not only face and improve upon background noise and possibly reverberation, but also to correct a large number of typical but usually neglected distortions that are present in everyday recordings or amateur-produced audio, such as bandwidth reduction, clipping, codec artifacts, silent gaps, excessive dynamics compression/expansion, sub-optimal equalization, noise gating, and others (in total, we deal with 55 distortions, which can be grouped into 10 different families). Our solution relies on an end-to-end approach, in which a generator network synthesizes clean speech and a conditioner network informs of what to generate. The idea is that the generator learns from clean speech and both generator and conditioner have the capability of enhancing representations, with the latter undertaking the core part of this task. For the generator, we put together a number of known and less known advances in score-based diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) . For the conditioner, we develop a number of improved architectural choices, and further propose the usage of auxiliary, out-of-path mixture density networks for enhancement in both the feature and the waveform domains. We quantify the relative importance of these main development steps using objective metrics, and show how the final solution outperforms the state of the art in all considered distortions using a subjective test with expert listeners (objective metrics for the denoising task are also reported in the Appendix). Finally, we also study the number of diffusion steps needed for performing high-quality universal speech enhancement, and find it to be on par with the fastest diffusion-based neural vocoders without the need for any specific tuning.

2. RELATED WORK

Our approach is based on diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) . While diffusion models have been more extensively studied for unconditional or weaklyconditioned image generation, our work presents a number of techniques for strongly-conditioned speech re-generation or enhancement. Diffusion-based models achieve state-of-the-art quality on multiple generative tasks, in different domains. In the audio domain, they have been particularly successful in speech synthesis (Chen et al., 2021; Kong et al., 2021 ), text-to-speech (Jeong et al., 2021; Popov et al., 2021) 2022) propose formulations of the diffusion process that can adapt to (non-Gaussian) real audio noises. These studies with speech denoising show improvement over the considered baselines, but do not reach the objective scores achieved by state-of-the-art approaches (see also Appendix E). Our work stems from the WaveGrad architecture (Chen et al., 2021) , introduces a number of crucial modifications and additional concepts, and pioneers universal enhancement by tackling an unprecedented amount of distortions. The state of the art for speech denoising and dereverberation is dominated by regression and adversarial approaches (Défossez et al., 2020; Fu et al., 2021; Su et al., 2021; Isik et al., 2020; Hao et al., 2021; Kim & Seo, 2021; Zheng et al., 2021; Kataria et al., 2021) . However, if one considers further degradations of the signal like clipping, bandwidth removal, or silent gaps, it is intuitive to think that generative approaches have great potential (Polyak et al., 2021; Pascual et al., 2019; Zhang et al., 2021b) , as such degradations require generating signal where, simply, there is none. Yet, to the best of our knowledge, this intuition has not been convincingly demonstrated through subjective tests involving human judgment. Our work sets a milestone in showing that a generative approach can outperform existing supervised and adversarial approaches when evaluated by expert listeners.

3.1. METHODOLOGY

Data -To train our model, we use a data set of clean and programmatically-distorted pairs of speech recordings. To obtain the clean speech, we sample 1,500 h of audio from an internal pool of data sets and convert it to 16 kHz mono. The speech sample consists of about 1.2 M utterances



Besides noise and reverberation, a few works propose to go one step further by considering additional distortions. Pascual et al. (2019) introduce a broader notion of speech enhancement by considering whispered speech, bandwidth reduction, silent gaps, and clipping. More recently, Nair & Koishida (2021) consider silent gaps, clipping, and codec artifacts, and Zhang et al. (2021a) consider clipping and codec artifacts. In concurrent work, Liu et al. (

, bandwidth extension (Lee & Han, 2021), or drum sound synthesis (Rouard & Hadjeres, 2021). An introduction to diffusion models is given in Appendix A Diffusion-based models have also recently been used for speech denoising. Zhang et al. (2021a) expand the DiffWave vocoder (Kong et al., 2021) with a convolutional conditioner, and train that separately with an L1 loss for matching latent representations. Lu et al. (2021) study the potential of DiffWave with noisy mel band inputs for speech denoising and, later, Lu et al. (2022) and Welker et al. (

