FILTER-RECOVERY NETWORK FOR MULTI-SPEAKER AUDIO-VISUAL SPEECH SEPARATION

Abstract

In this paper, we systematically study the audio-visual speech separation task in a multi-speaker scenario. Given the facial information of each speaker, the goal of this task is to separate the corresponding speech from the mixed speech. The existing works are designed for speech separation in a controlled setting with a fixed number of speakers (mostly 2 or 3 speakers), which seems to be impractical for real applications. As a result, we try to utilize a single model to separate the voices with a variable number of speakers. Based on the observation, there are two prominent issues for multi-speaker separation: 1) There are some noisy voice pieces belonging to other speakers in the separation results; 2) Part of the target speech is missing after separation. Accordingly, we propose BFRNet, including a Basic audio-visual speech separator and a Filter-Recovery Network (FRNet). FR-Net can refine the coarse audio separated by basic audio-visual speech separator. To have fair comparisons, we build a comprehensive benchmark for multi-speaker audio-visual speech separation to verify the performance of various methods. Experimental results show that our method is able to achieve the state-of-the-art performance. Furthermore, we also find that FRNet can boost the performance of other off-the-shelf speech separators, which exhibits its ability of generalization.

1. INTRODUCTION

Audio-visual speech separation has been extensively used in various applications, such as speech recognition (Radford et al.; Chan et al., 2015) , assistive hearing device (Kumar et al., 2022) , and online video meetings (Tamm et al., 2022) . As human voices are naturally mixed together in public places, it would be challenging to directly extract the information of interest from such raw audiovisual signals containing multiple speakers. As a result, separating audio signals for each speaker could serve as an effective pre-processing step for further analysis on the audio-visual signal. Convolutional neural networks (Gogate et al., 2018; Makishima et al., 2021; Gao & Grauman, 2021) and Transformers (Ramesh et al., 2021; Montesinos et al., 2022; Rahimi et al., 2022) has made prominent progress in the field of audio-visual speech separation. However, previous works (Lee et al., 2021; Gao & Grauman, 2021; Montesinos et al., 2022) mostly focus on two-speaker speech separation. Although other researches (Ephrat et al., 2018; Afouras et al., 2018b; 2019) manage to separate voices for more speakers, (Ephrat et al., 2018) requires customized models for each kind of mixture instead of separating all kinds of mixtures with a single model, and (Afouras et al., 2018b) mainly contributes to enhancing the voice of the target individual while ignoring others. Furthermore, (Afouras et al., 2019) uses pre-enrolled speaker embeddings to extract the corresponding speech, but still leaves a performance gap compared with unenrolled speakers. Therefore, how to efficiently and effectively separate speech under a multi-speaker environment still requires further study. Through our explorations, prior works Gao & Grauman ( 2021 2019) show their superiority in two-speaker speech separation but yield disappointing results with more speakers (e.g., 3, 4, or even 5 speakers). It exhibits a simple yet important fact that the complexity of speech separation strongly correlates with the number of speakers in the mixed audio. As shown in Fig. 1 , the core problems for multi-speaker speech separation can be empirically summarized as two folds: 2) The missing parts (gray box in Figure 1 ) represent some target speech pieces dropped by models. These two problems usually occur in complex conditions, reflecting that models fail to accurately separate audio signals under such challenging scenarios. As a result, there is still a long way to go toward solving audio-visual speech separation for multi-speaker conditions. To this end, we come up with an effective method BFRNet, to conquer the aforementioned challenges in multi-speaker audio-visual speech separation. As illustrated in Figure 2 , the BFRNet consists of a basic audio-visual speech separator and a Filter-Recovery Network (FRNet). The FRNet aims at solving the two issues in the results of any identically structured basic separator. It comprises two serial modules, i.e., Filter and Recovery. Firstly, the Filter module utilizes the visual features of the target speaker to query corresponding audio signals from the coarse prediction and suppress components from other speakers. Then, the Recovery module uses the clean audio yielded by the Filter module to query the missing components that belong to the target speaker from others' predictions. Essentially, FRNet aims to calibrate the coarsely separated audio predicted by off-the-shelf models (Gao & Grauman, 2021; Montesinos et al., 2022; Chuang et al., 2020) . Furthermore, we have noticed that there is still one obstacle in evaluating audio-visual speech separation. As we found that most works (Gao & Grauman, 2021; Montesinos et al., 2022) evaluate performance on unfixed numbers of samples generated by randomly mixing the audio signals during each inference, it sometimes leads to hard reproductions and unfair comparisons. Consequently, to unify the evaluation protocols, we create a comprehensive benchmark to verify the models' performance fairly. Specifically, for each type of mixture, we randomly sample test videos without replacement to make up the speech mixtures. The constructed fixed test sets serve for all experiments to ensure fairness and reproducibility of the results. To sum up, our contributions are as follows: • First, we design a Filter-Recovery Network (FRNet) with multi-speaker training paradigm to improve the quality of multi-speaker speech separation. • Second, to test the different methods on a fair basis, a well-established benchmark for multi-speaker audio-visual speech separation is created. We not only unify the evaluation protocol but also re-implement several state-of-the-art methods on this benchmark. • Finally, in the experiments, we demonstrate that our proposed FRNet can be equipped with other models to further improve the quality of audio separation and achieve the state-ofthe-art performance.

2. RELATED WORKS

Audio-Only Speech Separation. Using only audio modality for speech separation faces the problem of speaker agnosticism. Some works (Liu et al., 2019; Wang et al., 2018) utilize the speaker's voice embedding as the hint to isolate the target speech. Current methods mostly treat the audio-only speech separation as a label permutation problem. (Chen et al., 2017) cluster the similar speech to perform speech separation. (Luo et al., 2018) requires no prior information of speaker number. (Luo



); Montesinos et al. (2022); Chuang et al. (2020); Makishima et al. (2021); Afouras et al. (

Figure 1: Illustration of two critical issues in multi-speaker audio-visual speech separation. The missing parts in separation results compared to ground truth speech are indicated by gray boxes, while the noisy parts are marked by red boxes.

