D4AM: A GENERAL DENOISING FRAMEWORK FOR DOWNSTREAM ACOUSTIC MODELS

Abstract

The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems.

1. INTRODUCTION

Speech enhancement (SE) aims to extract speech components from distorted speech signals to obtain enhanced signals with better properties (Loizou, 2013) . Recently, various deep learning models (Wang et al., 2020; Lu et al., 2013; Xu et al., 2015; Zheng et al., 2021; Nikzad et al., 2020) have been used to formulate mapping functions for SE, which treat SE as a regression task trained with noisy-clean paired speech data. Typically, the objective function is formulated using a signal-level distance measure (e.g., L1 norm (Pandey & Wang, 2018; Yue et al., 2022) , L2 norm (Ephraim & Malah, 1984; Yin et al., 2020; Xu et al., 2020) , SI-SDR (Le Roux et al., 2019; Wisdom et al., 2020; Lee et al., 2020) , or multiple-resolution loss (Défossez et al., 2020) ). In speech-related applications, SE units are generally used as key pre-processors to improve the performance of the main task in noisy environments. To facilitate better performance on the main task, certain studies focus on deriving suitable objective functions for SE training. For human-human oral communication tasks, SE aims to improve speech quality and intelligibility, and enhancement performance is usually assessed by subjective listening tests. Because largescale listening tests are generally prohibitive, objective evaluation metrics have been developed to objectively assess human perception of a given speech signal (Rix et al., 2001; Taal et al., 2010; Jensen & Taal, 2016; Reddy et al., 2021) . Perceptual evaluation of speech quality (PESQ) (Rix et al., 2001 ) and short-time objective intelligibility (STOI) (Taal et al., 2010; Jensen & Taal, 2016) are popular objective metrics designed to measure speech quality and intelligibility, respectively. Recently, DNSMOS (Reddy et al., 2021) has been developed as a non-instructive assessment tool that predicts human ratings (MOS scores) of speech signals. In order to obtain speech signals with improved speech quality and intelligibility, many SE approaches attempt to formulate objective functions for SE training directly according to speech assessment metrics (Fu et al., 2018; Fu et al., 2019; 2021) . Another group of approaches, such as deep feature loss (Germain et al., 2019) and HiFi-GAN (Su et al., 2020) , propose to perform SE by mapping learned noisy latent features to clean ones. The experimental results show that the deep feature loss can enable enhanced speech signal to attain higher human perception scores compared with the conventional L1 and L2 distances. Another prominent application of SE is to improve automatic speech recognition (ASR) in noise (Seltzer et al., 2013; Weninger et al., 2015b; Li et al., 2014; Cui et al., 2021) . ASR systems perform sequential classification, mapping speech utterances to sequences of tokens. Therefore, the predictions of ASR systems highly depend on the overall structure of the input utterance. When regard to noisy signals, ASR performance will degrade significantly because noise interference corrupts the content information of the structure. Without modifying the ASR model, SE models can be trained separately and "universally" used as a pre-processor for ASR to improve recognition accuracy. Several studies have investigated the effectiveness of SE's model architecture and objective function in improving the performance of ASR in noise (Geiger et al., 2014a; Wang et al., 2020; Zhang et al., 2020; Chao et al., 2021; Meng et al., 2017; Weninger et al., 2015a; Du et al., 2019; Kinoshita et al., 2020; Meng et al., 2018) . The results show that certain specific designs, including model architecture and input format, are favorable for improving ASR performance. However, it has also been reported that improved recognition accuracy in noise is not always guaranteed when the ASR objective is not considered in SE training (Geiger et al., 2014b) . A feasible approach to tune the SE model parameters toward the main ASR task is to prepare the data of (noisy) speech-text pairs and backpropagate gradients on the SE model according to the classification objective provided by the ASR model. That is, SE models can be trained on a regression objective (using noisy-clean paired speech data) or/and a classification objective (using speech-text paired data). Ochiai et al. (2017a; b) proposed a multichannel end-to-end (E2E) ASR framework, where a mask-based MVDR (minimum variance distortionless response) neural beamformer is estimated based on the classification objective. Experimental results on CHiME-4 (Jon et al., 2017) confirm that the estimated neural beamformer can achieve significant ASR improvements under noisy conditions. Meanwhile, Chen et al. ( 2015) and Ma et al. (2021) proposed to train SE units by considering both regression and classification objectives, and certain works (Chen et al., 2015; Ochiai et al., 2017a; b) proposed to train SE models with E2E-ASR classification objectives. A common way to combine regression and classification objectives is to use weighting coefficients to combine them into a joint objective for SE model training. Notwithstanding promising results, the use of combined objectives in SE training has two limitations. First, how to effectively combine regression and classification objectives remains an issue. A largescale grid search is often employed to determine optimal weights for regression and classification objectives, which requires exhaustive computational costs. Second, ASR models are often provided by third parties and may not be accessible when training SE models. Moreover, due to various training settings in the acoustic model, such as label encoding schemes (e.g., word-piece (Schuster & Nakajima, 2012) , byte-pair-encoding (BPE) (Gage, 1994; Sennrich et al., 2016) , and character), model architectures (e.g., RNN (Chiu et al., 2018; Rao et al., 2017; He et al., 2019; Sainath et al., 2020 ), transformer (Vaswani et al., 2017; Zhang et al., 2020), and conformer (Gulati et al., 2020) ), and objectives (e.g., Connectionist Temporal Classification (CTC) (Graves et al., 2006) , Attention (NLL) (Chan et al., 2016) , and their hybrid version (Watanabe et al., 2017)), SE units trained according to a specific acoustic model may not generalize well to other ASR systems. Based on the above limitations, we raise the question: Can we effectively integrate speech-text and noisy-clean paired data to develop a denoising pre-processor that generalizes well to unseen ASR systems? In this work, we derive a novel denoising framework, called D4AM, to be used as a "universal" pre-processor to improve the performance of various downstream acoustic models in noise. To achieve this goal, the proposed framework focuses on preserving the integrity of clean speech signals and trains SE models with regression and classification objectives jointly. By using the regression objective as an auxiliary loss, we circumvent the need to require additional training costs to grid search the appropriate weighting coefficients for the regression and classification objectives. Instead, D4AM applies an adjustment scheme to determine the appropriate weighting coefficients automatically and efficiently. The adjustment scheme is inspired by the following concepts: (1) we attempt to adjust the gradient yielded by a proxy ASR model so that the SE unit can be trained to improve the general recognition capability; (2) we consider the weighted regression objective as a regularizer and, thereby,

availability

Our code is available at https://github.com/ChangLee0903

