UNIVERSAL SPEECH ENHANCEMENT WITH SCORE-BASED DIFFUSION

Abstract

Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

1. INTRODUCTION

Real-world recorded speech almost inevitably contains background noise, which can be unpleasant and prevent intelligibility. Removing background noise has traditionally been the objective of speech enhancement algorithms (Loizou, 2013) . Since the 1940s (Kolmogorov, 1941; Wiener, 1949) , a myriad of denoising approaches based on filtering have been proposed, with a focus on stationary noises. With the advent of deep learning, the task has been dominated by neural networks, often outperforming more classical algorithms and generalizing to multiple noise types (Lu et al., 2013; Pascual et al., 2017; Rethage et al., 2018; Défossez et al., 2020; Fu et al., 2021) . Besides recent progress, speech denoising still presents room for improvement, especially when dealing with distribution shift or real-world recordings. Noise however is only one of the many potential disturbances that can be present in speech recordings. If recordings are performed in a closed room, reverberation is ubiquitous. With this in mind, a number of works have recently started to zoom out the focus in order to embrace more realistic situations and tackle noise and reverberation at the same time (Su et al., 2021; 2019; Polyak et al., 2021) . Some of these works adopt a generation or re-generation strategy (Maiti & Mandel, 2019) , in which a two-stage approach is employed to first enhance and then synthesize speech signals. Despite the relative success of this strategy, it is still an open question whether such approaches can perceptually outperform the purely supervised ones, especially in terms of realism and lack of voice artifacts. 2021) deal with bandwidth reduction and clipping in addition to noise and reverberation. Despite the recent efforts to go beyond pure denoising, we are not aware of any speech enhancement system that tackles more than 2-4 distortions at the same time. In this work, we take a holistic approach and regard the task of speech enhancement as a universal endeavor. We believe that, for realistic speech enhancement situations, algorithms need not only face



Besides noise and reverberation, a few works propose to go one step further by considering additional distortions. Pascual et al. (2019) introduce a broader notion of speech enhancement by considering whispered speech, bandwidth reduction, silent gaps, and clipping. More recently, Nair & Koishida (2021) consider silent gaps, clipping, and codec artifacts, and Zhang et al. (2021a) consider clipping and codec artifacts. In concurrent work, Liu et al. (

