EXPLOITING SPATIAL SEPARABILITY FOR DEEP LEARNING MULTICHANNEL SPEECH ENHANCEMENT WITH AN ALIGN-AND-FILTER NETWORK Anonymous

Abstract

Multichannel speech enhancement (SE) systems separate the target speech from background noise by performing spatial and spectral filtering. The development of multichannel SE has a long history in the signal processing field, where one crucial step is to exploit spatial separability of sound sources by aligning the microphone signals in response to the target speech source prior to further filtering processes. This is similar to the human listening behavior of facing toward the speaker for better perception of the speech. However, most existing deep learningbased multichannel SE works have yet to effectively incorporate or emphasize this spatial alignment aspect in the network design; some of them rely on integrating conventional model-based beamformer units to extract useful spatial features implicitly while others just let the network figure everything out by itself. However, the beamformer operation could be computationally expensive and numerically unstable when trained with the network while without it the model lacks guidance on learning meaningful spatial features. In this paper, we highlight this important but often overlooked step in deep learning-based multichannel SE, i.e., signal alignment, by introducing an Align-and-Filter network (AFnet) featuring a two-stage sequential masking design. The AFnet aims at estimating two sets of masks, the alignment masks and filtering masks, to carry out temporal alignment and spectral filtering processes. During training, we propose to supervise the learning of alignment masks by predicting the relative transfer functions (RTFs) of various speech source locations followed by learning the filtering masks for signal reconstruction. During inference, the AFnet sequentially multiplies the estimated alignment and filtering masks with the microphone signals, performing the "align-then-filter" process similar to the human listening behavior. Due to the incorporation of RTF supervision, the AFnet explicitly learns interpretable spatial features without integrating traditional beamformer operations.

1. INTRODUCTION

Speech enhancement (SE) systems can be categorized into single-channel (single microphone) and multichannel (multiple microphones) schemes. An important aspect of multichannel SE against single-channel SE is the exploitation of spatial separability, as known as spatial filtering or beamforming, enabled by the difference between the amplitudes and times of arrival of the received microphone signals due to the different acoustic paths the sound waveform travels to the microphones. In many signal processing beamforming methods (Gannot et al., 2001; Cohen, 2004; Krueger et al., 2010; Koldovskỳ et al., 2015) , a key step is to align the microphone signals in response to the target signal source before any further filtering processes. This step, by steering the array toward the location of the target signal, aims to compensate for the difference of the amplitudes and time delays (or correspondingly the magnitudes and phases in the frequency domain) of the microphone signals with respect to the target source. Ideally, after the alignment step, each microphone should contain the same target speech component with no difference in amplitude and time delay (or magnitude and phase). For a linear array in the far-filed, anechoic setting, perfectly steering the microphone array makes it as if the target signal comes from the broadside, which renders the speech extraction task easier in the later filtering stage. Such process is similar to the human listening behavior of facing toward the speaker for better perception of the speech. Thus, an efficient SE system can first align its microphone signals in response to the target speech, followed by spectral processing for fine-tuning and enhancement. Surprisingly, though well-known in signal processing approaches, there are only few deep learning SE systems designed mainly based on this observation. Instead, several recent works (Wang et al., 2020; 2021) have been utilizing conventional beamformer units such as the the minimum-variance-distortionless-response (MVDR) beamformer (Capon, 1969) to extract spatial characteristics implicitly in deep learning methods. However, matrix inversion or eigendecomposition are often required which in turn increase the computation burden and numerical instability. In this paper, we revisit this important but often overlooked alignment aspect from conventional signal processing algorithms and recognize its importance for efficient deep learning multichannel SE network design that requires no intermediate beamformer units such as MVDR to extract meaningful spatial features. Specifically, we propose the Align-and-Filter network (AFnet) shown in Figure 1 for exploiting spatial separability within multichannel data with the following main contributions: • The AFnet features a two-stage sequential masking design, i.e., Align Net and Filter Net, where two sets of masks, alignment and filtering masks, are estimated and multiplied with the microphone signals to perform the "align-then-filter" process mimicking the human listening behavior. • During the training stage, we propose to supervise the learning of the alignment masks by estimating the relative transfer functions (RTFs) (Gannot et al., 2001; Cohen, 2004) for speech sources coming from various locations, prior to learning the filtering masks for final enhancement. • During inference, the AFnet is able to first align the microphone signals with respect to a speech source coming from an unknown direction due to supervised learning of alignment masks. Subsequently, the model performs filtering on the roughly aligned signals to achieve denoising. • It is demonstrated that the RTF supervision incorporated with the sequential masking mechanism is the key to effectively learn useful, interpretable spatial characteristics. On situations where the target speech may come from arbitrary positions, the "align-then-filter" mechanism consistently improves the SE performance by more efficiently exploiting spatial separability of sound sources.

2. RELATED WORK

Multichannel SE has been a well known topic in signal processing for decades. Aside from leveraging spectral characteristics, multichannel SE can exploit positional information to perform spatial filtering, or beamforming, that allows extracting the target speech from noise based on spatial separability. Conventional beamforming approaches rely on the so-called "steering vector" which carries positional information about the target speech (Doclo et al., 2015; Trees, 2004) , e.g., the MVDR beamformer (Capon, 1969) and its variants (Frost, 1972; Griffiths & Jim, 1982) . It is an important step in beamforming that the microphone array is steered toward the target signal location prior to further filtering processes. To this end, certain knowledge about the acoustic paths between the target signal and the microphones have to be known. Many signal processing-based multichannel SE systems utilize the ratio of the acoustic transfer functions, i.e., the relative transfer function (RTF), that represents the coupling between sensors in response to a desired source (Gannot et al., 2001; Cohen, 2004; Krueger et al., 2010; Koldovskỳ et al., 2015) for improving the denoising process. In the past decade, deep learning approaches have remarkably changed the way of developing SE systems. Along with the success of deep neural networks (DNNs) on single-channel SE (Lu et al., 2013; Williamson et al., 2015; Pascual et al., 2017; Luo & Mesgarani, 2019; Kim et al., 2020; Zheng et al., 2021; Hu et al., 2020) We postulate that the overlook may be due to the lack of sufficient spatial variety of the speech sources in popular datasets such as CHiME-3 Barker et al. (2015) , where the benefit of utilizing RTFs could be only marginal. However, many practical situations can have the target speech coming from arbitrary directions and thus a deep dive into the RTF spatial alignment aspect is still of great importance.

3. PROPOSED METHOD

We consider an acoustic scenario with one desired speech source and several interfering noise signals in a reverberant environment. The SE system will be developed in the time-frequency domain using

