FILTER-RECOVERY NETWORK FOR MULTI-SPEAKER AUDIO-VISUAL SPEECH SEPARATION

Abstract

In this paper, we systematically study the audio-visual speech separation task in a multi-speaker scenario. Given the facial information of each speaker, the goal of this task is to separate the corresponding speech from the mixed speech. The existing works are designed for speech separation in a controlled setting with a fixed number of speakers (mostly 2 or 3 speakers), which seems to be impractical for real applications. As a result, we try to utilize a single model to separate the voices with a variable number of speakers. Based on the observation, there are two prominent issues for multi-speaker separation: 1) There are some noisy voice pieces belonging to other speakers in the separation results; 2) Part of the target speech is missing after separation. Accordingly, we propose BFRNet, including a Basic audio-visual speech separator and a Filter-Recovery Network (FRNet). FR-Net can refine the coarse audio separated by basic audio-visual speech separator. To have fair comparisons, we build a comprehensive benchmark for multi-speaker audio-visual speech separation to verify the performance of various methods. Experimental results show that our method is able to achieve the state-of-the-art performance. Furthermore, we also find that FRNet can boost the performance of other off-the-shelf speech separators, which exhibits its ability of generalization.

1. INTRODUCTION

Audio-visual speech separation has been extensively used in various applications, such as speech recognition (Radford et al.; Chan et al., 2015) , assistive hearing device (Kumar et al., 2022) , and online video meetings (Tamm et al., 2022) . As human voices are naturally mixed together in public places, it would be challenging to directly extract the information of interest from such raw audiovisual signals containing multiple speakers. As a result, separating audio signals for each speaker could serve as an effective pre-processing step for further analysis on the audio-visual signal. Convolutional neural networks (Gogate et al., 2018; Makishima et al., 2021; Gao & Grauman, 2021) and Transformers (Ramesh et al., 2021; Montesinos et al., 2022; Rahimi et al., 2022) 2019) show their superiority in two-speaker speech separation but yield disappointing results with more speakers (e.g., 3, 4, or even 5 speakers). It exhibits a simple yet important fact that the complexity of speech separation



has made prominent progress in the field of audio-visual speech separation. However, previous works(Lee  et al., 2021; Gao & Grauman, 2021; Montesinos et al., 2022)  mostly focus on two-speaker speech separation. Although other researches(Ephrat et al., 2018; Afouras et al., 2018b; 2019)  manage to separate voices for more speakers, (Ephrat et al., 2018) requires customized models for each kind of mixture instead of separating all kinds of mixtures with a single model, and (Afouras et al., 2018b) mainly contributes to enhancing the voice of the target individual while ignoring others. Furthermore, (Afouras et al., 2019) uses pre-enrolled speaker embeddings to extract the corresponding speech, but still leaves a performance gap compared with unenrolled speakers. Therefore, how to efficiently and effectively separate speech under a multi-speaker environment still requires further study. Through our explorations, prior works Gao & Grauman (2021); Montesinos et al. (2022); Chuang et al. (2020); Makishima et al. (2021); Afouras et al. (

