D4AM: A GENERAL DENOISING FRAMEWORK FOR DOWNSTREAM ACOUSTIC MODELS

Abstract

The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems.

1. INTRODUCTION

Speech enhancement (SE) aims to extract speech components from distorted speech signals to obtain enhanced signals with better properties (Loizou, 2013) . Recently, various deep learning models (Wang et al., 2020; Lu et al., 2013; Xu et al., 2015; Zheng et al., 2021; Nikzad et al., 2020) have been used to formulate mapping functions for SE, which treat SE as a regression task trained with noisy-clean paired speech data. Typically, the objective function is formulated using a signal-level distance measure (e.g., L1 norm (Pandey & Wang, 2018; Yue et al., 2022 ), L2 norm (Ephraim & Malah, 1984; Yin et al., 2020; Xu et al., 2020) , SI-SDR (Le Roux et al., 2019; Wisdom et al., 2020; Lee et al., 2020) , or multiple-resolution loss (Défossez et al., 2020) ). In speech-related applications, SE units are generally used as key pre-processors to improve the performance of the main task in noisy environments. To facilitate better performance on the main task, certain studies focus on deriving suitable objective functions for SE training. For human-human oral communication tasks, SE aims to improve speech quality and intelligibility, and enhancement performance is usually assessed by subjective listening tests. Because largescale listening tests are generally prohibitive, objective evaluation metrics have been developed to objectively assess human perception of a given speech signal (Rix et al., 2001; Taal et al., 2010; Jensen & Taal, 2016; Reddy et al., 2021) . Perceptual evaluation of speech quality (PESQ) (Rix et al., 2001 ) and short-time objective intelligibility (STOI) (Taal et al., 2010; Jensen & Taal, 2016) are popular objective metrics designed to measure speech quality and intelligibility, respectively. Recently, DNSMOS (Reddy et al., 2021) has been developed as a non-instructive assessment tool that predicts human ratings (MOS scores) of speech signals. In order to obtain speech signals with

availability

Our code is available at https://github.com/ChangLee0903

