NEW TRAINING FRAMEWORK FOR SPEECH ENHANCE-MENT USING REAL NOISY SPEECH Anonymous

Abstract

Recently, deep learning-based speech enhancement (SE) models have gained significant improvements. However, the success is mainly based on using synthetic training data created by adding clean speech with noise. On the other hand, in spite of its large amount, real noisy speech is hard to be applied for SE model training because of lack of its clean reference. In this paper, we propose a novel method to utilize real noisy speech for SE model training based on a non-intrusive speech quality prediction model. The SE model is trained through the guide of the quality prediction model. We also find that a speech quality predictor with better accuracy may not necessarily be an appropriate teacher to guide the SE model. In addition, we show that if the quality prediction model is adversarially robust, then the prediction model itself can also be served as a SE model by modifying the input noisy speech through gradient backpropagation. Objective experiment results show that, under the same SE model structure, the proposed new training method trained on a large amount of real noisy speech can outperform the conventional supervised model trained on synthetic noisy speech. Lastly, the two training methods can be combined to utilize both benefits of synthetic noisy speech (easy to learn) and real noisy speech (large amount) to form semi-supervised learning which can further boost the performance both objectively and subjectively. The code will be released after publication.

1. INTRODUCTION

Deep learning-based speech enhancement (SE) has gained significant improvements in different aspects such as model structures (Xu et al., 2014; Weninger et al., 2015; Fu et al., 2017; Luo & Mesgarani, 2018; Dang et al., 2022; Hu et al., 2020 ), input features (Williamson et al., 2015; Fu et al., 2018b; Huang et al., 2022; Hung et al., 2022) , and loss functions (Pascual et al., 2017; Fu et al., 2018b; Martin-Donas et al., 2018; Kolbaek et al., 2018; Koizumi et al., 2017; Niu et al., 2020) . However, the success is mainly based on synthetic training data, which includes different clean and noisy speech pairs. In general, the noisy speech is synthesized by adding clean speech with noise; hence, both clean speech and noise are required for model training. Compared to real noisy speech, pure clean speech and noise are very difficult to obtain in daily life, and they have to be recorded in a controlled environment. Although some studies (Wisdom et al., 2020; Fujimura et al., 2021) have been proposed to use real noisy speech for SE training, they still rely on synthetic training data by adding noise to noisy speech to generate a noisier signal as model input with the original noisy speech as the training target. The mismatch between synthetic training data and real noisy data may degrade the SE performance (e.g., the recording devices and the room responses of noisy speech and noise may be different, which results in different acoustic characteristics). This study aims to solve the above-mentioned issues by training a SE model directly on real noisy speech. To achieve this goal, we first train a non-intrusive speech quality predictor. If this predictor is robust, then it should be able to guide the training of a SE model. Because the quality assessment can be done without the need for a clean reference, real noisy speech can be applied for SE model training. A few key characteristics of the proposed method are: 1) The training of the SE model is based on real noisy speech and a quality prediction model; no synthetic training data is required. 2) The loss function to train the SE model is not based on the signal level comparison (such as mean square error between the enhanced and target speech); it is completely based on the quality predictor.

