NEW TRAINING FRAMEWORK FOR SPEECH ENHANCE-MENT USING REAL NOISY SPEECH Anonymous

Abstract

Recently, deep learning-based speech enhancement (SE) models have gained significant improvements. However, the success is mainly based on using synthetic training data created by adding clean speech with noise. On the other hand, in spite of its large amount, real noisy speech is hard to be applied for SE model training because of lack of its clean reference. In this paper, we propose a novel method to utilize real noisy speech for SE model training based on a non-intrusive speech quality prediction model. The SE model is trained through the guide of the quality prediction model. We also find that a speech quality predictor with better accuracy may not necessarily be an appropriate teacher to guide the SE model. In addition, we show that if the quality prediction model is adversarially robust, then the prediction model itself can also be served as a SE model by modifying the input noisy speech through gradient backpropagation. Objective experiment results show that, under the same SE model structure, the proposed new training method trained on a large amount of real noisy speech can outperform the conventional supervised model trained on synthetic noisy speech. Lastly, the two training methods can be combined to utilize both benefits of synthetic noisy speech (easy to learn) and real noisy speech (large amount) to form semi-supervised learning which can further boost the performance both objectively and subjectively. The code will be released after publication.

1. INTRODUCTION

Deep learning-based speech enhancement (SE) has gained significant improvements in different aspects such as model structures (Xu et al., 2014; Weninger et al., 2015; Fu et al., 2017; Luo & Mesgarani, 2018; Dang et al., 2022; Hu et al., 2020) , input features (Williamson et al., 2015; Fu et al., 2018b; Huang et al., 2022; Hung et al., 2022) , and loss functions (Pascual et al., 2017; Fu et al., 2018b; Martin-Donas et al., 2018; Kolbaek et al., 2018; Koizumi et al., 2017; Niu et al., 2020) . However, the success is mainly based on synthetic training data, which includes different clean and noisy speech pairs. In general, the noisy speech is synthesized by adding clean speech with noise; hence, both clean speech and noise are required for model training. Compared to real noisy speech, pure clean speech and noise are very difficult to obtain in daily life, and they have to be recorded in a controlled environment. Although some studies (Wisdom et al., 2020; Fujimura et al., 2021) have been proposed to use real noisy speech for SE training, they still rely on synthetic training data by adding noise to noisy speech to generate a noisier signal as model input with the original noisy speech as the training target. The mismatch between synthetic training data and real noisy data may degrade the SE performance (e.g., the recording devices and the room responses of noisy speech and noise may be different, which results in different acoustic characteristics). This study aims to solve the above-mentioned issues by training a SE model directly on real noisy speech. To achieve this goal, we first train a non-intrusive speech quality predictor. If this predictor is robust, then it should be able to guide the training of a SE model. Because the quality assessment can be done without the need for a clean reference, real noisy speech can be applied for SE model training. A few key characteristics of the proposed method are: 1) The training of the SE model is based on real noisy speech and a quality prediction model; no synthetic training data is required. 2) The loss function to train the SE model is not based on the signal level comparison (such as mean square error between the enhanced and target speech); it is completely based on the quality predictor. To summarize the key contributions of the paper: 1) A novel training framework for speech enhancement using real noisy speech is proposed. 2) We found that a speech quality predictor with better prediction accuracy may not lead to a better SE model. Model structure does matter! 3) Adversarially robust quality predictor itself can directly be used for speech enhancement without the need to train a separate SE model. A three-output SE model is trained; outputs 1 and 3 or 1 and 2 can be used to reconstruct the noisy speech, while outputs 2 or 3 can be used to match the noise-only audio. However, it was found that the performance is poor if the distributions between noise in the noisy speech and the artificially added noise are too different (Saito et al., 2021; Maciejewski et al., 2021 ). Trinh & Braun (2021) apply two additional loss terms based on Wav2vec 2.0 (Baevski et al., 2020) to improve the MixIT performance. Similar to the input of MixIT, Fujimura et al. (Fujimura et al., 2021) proposed noisy-target training (NyTT) by adding noise to noisy speech. The noise-added signal and original noisy speech are used as the model input and target, respectively. Compared to these methods, our model does not need a 'pure' noise or clean corpus but requires a data set with a MOS label. In addition, the loss function of our SE model is to maximize the predicted quality score, which may make the enhanced speech have higher subjective scores. SE with a quality predictor: MetricGAN (Fu et al., 2019b; 2021) applies a GAN framework to make the discriminator mimic the behavior of perceptual evaluation of speech quality (PESQ) (Rix et al., 2001) function. Then the discriminator is used to guide the learning of the SE model by maximizing the predicted score. Xu et al. (2022) propose a non-intrusive PESQNet as the discriminator. 'NOn-matching REference based Speech Quality Assessment' (NORESQA) is proposed in (Manocha et al., 2021) to estimate the quality differences between an input speech and a non-matching reference. Then the authors apply the NORESQA model to pre-train a SE model by minimizing the predicted quality differences between the output of a SE model and a clean recording. Manocha et al. (2020) propose a perceptual distance metric based on just-noticeable difference (JND) labels, and the model is applied as a perceptual loss for SE training. In Nayem & Williamson (2021) , joint training is applied to train a SE model together with a MOS predictor. Because the calculation of PESQ and training of NORESQA rely on two signal processing measures, Signal-to-Noise Ratio (SNR) and Scale-Invariant Signal to Distortion Ratio (SI-SDR) to compare the quality of the two inputs, synthetic data is needed to train the quality prediction model. However, in our proposed training method, it is not necessarily needed.



Under the same SE model structure, the proposed new training method can outperform the conventional supervised trained model. 5) The conventional supervised training and proposed methods can be combined together to form semi-supervised learning and further boost the performance. 2 RELATED WORK Previous research has proposed using real noisy speech for SE model training. It can be further divided into two categories depending on if clean speech or noise is needed. SE training using unpaired noisy and clean speech: Cycle-consistent generative adversarial network (CycleGAN) (Xiang & Bao, 2020; Yu et al., 2021) is applied to achieve this goal. Through the framework of a GAN and cycle-consistent loss, only non-paired clean and noisy speech was needed during training. Bie et al. (2021) used clean speech to first pre-train a variational auto-encoder and applied variational expectation-maximization to fine-tune the encoder part during inference. SE training using noisy speech and noise signal: MixIT (Wisdom et al., 2020) is an unsupervised sound separation method, which requires only mixtures during training. It can also be used in SE with some simple modifications. The input to the SE model is noisy speech and noise-only audio.

