EXPLOITING SPATIAL SEPARABILITY FOR DEEP LEARNING MULTICHANNEL SPEECH ENHANCEMENT WITH AN ALIGN-AND-FILTER NETWORK Anonymous

Abstract

Multichannel speech enhancement (SE) systems separate the target speech from background noise by performing spatial and spectral filtering. The development of multichannel SE has a long history in the signal processing field, where one crucial step is to exploit spatial separability of sound sources by aligning the microphone signals in response to the target speech source prior to further filtering processes. This is similar to the human listening behavior of facing toward the speaker for better perception of the speech. However, most existing deep learningbased multichannel SE works have yet to effectively incorporate or emphasize this spatial alignment aspect in the network design; some of them rely on integrating conventional model-based beamformer units to extract useful spatial features implicitly while others just let the network figure everything out by itself. However, the beamformer operation could be computationally expensive and numerically unstable when trained with the network while without it the model lacks guidance on learning meaningful spatial features. In this paper, we highlight this important but often overlooked step in deep learning-based multichannel SE, i.e., signal alignment, by introducing an Align-and-Filter network (AFnet) featuring a two-stage sequential masking design. The AFnet aims at estimating two sets of masks, the alignment masks and filtering masks, to carry out temporal alignment and spectral filtering processes. During training, we propose to supervise the learning of alignment masks by predicting the relative transfer functions (RTFs) of various speech source locations followed by learning the filtering masks for signal reconstruction. During inference, the AFnet sequentially multiplies the estimated alignment and filtering masks with the microphone signals, performing the "align-then-filter" process similar to the human listening behavior. Due to the incorporation of RTF supervision, the AFnet explicitly learns interpretable spatial features without integrating traditional beamformer operations.

1. INTRODUCTION

Speech enhancement (SE) systems can be categorized into single-channel (single microphone) and multichannel (multiple microphones) schemes. An important aspect of multichannel SE against single-channel SE is the exploitation of spatial separability, as known as spatial filtering or beamforming, enabled by the difference between the amplitudes and times of arrival of the received microphone signals due to the different acoustic paths the sound waveform travels to the microphones. In many signal processing beamforming methods (Gannot et al., 2001; Cohen, 2004; Krueger et al., 2010; Koldovskỳ et al., 2015) , a key step is to align the microphone signals in response to the target signal source before any further filtering processes. This step, by steering the array toward the location of the target signal, aims to compensate for the difference of the amplitudes and time delays (or correspondingly the magnitudes and phases in the frequency domain) of the microphone signals with respect to the target source. Ideally, after the alignment step, each microphone should contain the same target speech component with no difference in amplitude and time delay (or magnitude and phase). For a linear array in the far-filed, anechoic setting, perfectly steering the microphone array makes it as if the target signal comes from the broadside, which renders the speech extraction task easier in the later filtering stage. Such process is similar to the human listening behavior of facing toward the speaker for better perception of the speech. Thus, an efficient SE system can first align its microphone signals in response to the target speech, followed by spectral processing for fine-tuning

