FILTER-RECOVERY NETWORK FOR MULTI-SPEAKER AUDIO-VISUAL SPEECH SEPARATION

Abstract

In this paper, we systematically study the audio-visual speech separation task in a multi-speaker scenario. Given the facial information of each speaker, the goal of this task is to separate the corresponding speech from the mixed speech. The existing works are designed for speech separation in a controlled setting with a fixed number of speakers (mostly 2 or 3 speakers), which seems to be impractical for real applications. As a result, we try to utilize a single model to separate the voices with a variable number of speakers. Based on the observation, there are two prominent issues for multi-speaker separation: 1) There are some noisy voice pieces belonging to other speakers in the separation results; 2) Part of the target speech is missing after separation. Accordingly, we propose BFRNet, including a Basic audio-visual speech separator and a Filter-Recovery Network (FRNet). FR-Net can refine the coarse audio separated by basic audio-visual speech separator. To have fair comparisons, we build a comprehensive benchmark for multi-speaker audio-visual speech separation to verify the performance of various methods. Experimental results show that our method is able to achieve the state-of-the-art performance. Furthermore, we also find that FRNet can boost the performance of other off-the-shelf speech separators, which exhibits its ability of generalization.

1. INTRODUCTION

Audio-visual speech separation has been extensively used in various applications, such as speech recognition (Radford et al.; Chan et al., 2015) , assistive hearing device (Kumar et al., 2022) , and online video meetings (Tamm et al., 2022) . As human voices are naturally mixed together in public places, it would be challenging to directly extract the information of interest from such raw audiovisual signals containing multiple speakers. As a result, separating audio signals for each speaker could serve as an effective pre-processing step for further analysis on the audio-visual signal. Convolutional neural networks (Gogate et al., 2018; Makishima et al., 2021; Gao & Grauman, 2021) and Transformers (Ramesh et al., 2021; Montesinos et al., 2022; Rahimi et al., 2022) has made prominent progress in the field of audio-visual speech separation. However, previous works (Lee et al., 2021; Gao & Grauman, 2021; Montesinos et al., 2022) mostly focus on two-speaker speech separation. Although other researches (Ephrat et al., 2018; Afouras et al., 2018b; 2019) manage to separate voices for more speakers, (Ephrat et al., 2018) requires customized models for each kind of mixture instead of separating all kinds of mixtures with a single model, and (Afouras et al., 2018b) mainly contributes to enhancing the voice of the target individual while ignoring others. Furthermore, (Afouras et al., 2019) uses pre-enrolled speaker embeddings to extract the corresponding speech, but still leaves a performance gap compared with unenrolled speakers. Therefore, how to efficiently and effectively separate speech under a multi-speaker environment still requires further study. Through our explorations, prior works Gao & Grauman (2021) ; Montesinos et al. (2022) ; Chuang et al. (2020) ; Makishima et al. (2021) ; Afouras et al. (2019) show their superiority in two-speaker speech separation but yield disappointing results with more speakers (e.g., 3, 4, or even 5 speakers). It exhibits a simple yet important fact that the complexity of speech separation 1) The noisy parts (red box in Figure 1 ) contains the components from other speakers within the separated audio . 2) The missing parts (gray box in Figure 1 ) represent some target speech pieces dropped by models. These two problems usually occur in complex conditions, reflecting that models fail to accurately separate audio signals under such challenging scenarios. As a result, there is still a long way to go toward solving audio-visual speech separation for multi-speaker conditions. To this end, we come up with an effective method BFRNet, to conquer the aforementioned challenges in multi-speaker audio-visual speech separation. As illustrated in Figure 2 , the BFRNet consists of a basic audio-visual speech separator and a Filter-Recovery Network (FRNet). The FRNet aims at solving the two issues in the results of any identically structured basic separator. It comprises two serial modules, i.e., Filter and Recovery. Firstly, the Filter module utilizes the visual features of the target speaker to query corresponding audio signals from the coarse prediction and suppress components from other speakers. Then, the Recovery module uses the clean audio yielded by the Filter module to query the missing components that belong to the target speaker from others' predictions. Essentially, FRNet aims to calibrate the coarsely separated audio predicted by off-the-shelf models (Gao & Grauman, 2021; Montesinos et al., 2022; Chuang et al., 2020) . Furthermore, we have noticed that there is still one obstacle in evaluating audio-visual speech separation. As we found that most works (Gao & Grauman, 2021; Montesinos et al., 2022) evaluate performance on unfixed numbers of samples generated by randomly mixing the audio signals during each inference, it sometimes leads to hard reproductions and unfair comparisons. Consequently, to unify the evaluation protocols, we create a comprehensive benchmark to verify the models' performance fairly. Specifically, for each type of mixture, we randomly sample test videos without replacement to make up the speech mixtures. The constructed fixed test sets serve for all experiments to ensure fairness and reproducibility of the results. To sum up, our contributions are as follows: • First, we design a Filter-Recovery Network (FRNet) with multi-speaker training paradigm to improve the quality of multi-speaker speech separation. • Second, to test the different methods on a fair basis, a well-established benchmark for multi-speaker audio-visual speech separation is created. We not only unify the evaluation protocol but also re-implement several state-of-the-art methods on this benchmark. • Finally, in the experiments, we demonstrate that our proposed FRNet can be equipped with other models to further improve the quality of audio separation and achieve the state-ofthe-art performance.

2. RELATED WORKS

Audio-Only Speech Separation. Using only audio modality for speech separation faces the problem of speaker agnosticism. Some works (Liu et al., 2019; Wang et al., 2018) utilize the speaker's voice embedding as the hint to isolate the target speech. Current methods mostly treat the audio-only speech separation as a label permutation problem. (Chen et al., 2017) cluster the similar speech to perform speech separation. (Luo et al., 2018) requires no prior information of speaker number. (Luo & Mesgarani, 2019) adopts a deep learning network comprising a series of convolutional layers and trains the network with permutation-invariant loss (Yu et al., 2017) . Visual Sound Separation. There are various kinds of sound separation explored in the literature. One type focuses on music separation (Zhao et al., 2018; Gao & Grauman, 2019; Xu et al., 2019; Gan et al., 2020) , where the diverse shapes of musical instruments and the distinguished patterns of music sounds are the key clues. Other works concentrate on the in-the-wild sounds (Gao et al., 2018; Tzinis et al., 2020; Chen et al., 2020) , such as animal sounds, vehicle sounds, etc. (Gao et al., 2018) incorporate the image recognition results as the advisor and learn prototypical spectral patterns for each sounding object. (Tzinis et al., 2020) extend to unsupervised, open-domain audio-visual sound separation and develop a large new dataset. (Chen et al., 2020) propose an algorithm to generate temporally synchronized sound given mismatched visual information. Visual Speech Separation. Speech is extensively studied as the most closely associated audio with humans, and has a broad range of applications. Due to the natural correlation between face and speech, many works (Gabbay et al., 2018; Lu et al., 2018; Ephrat et al., 2018; Afouras et al., 2019; Chung et al., 2020; Hegde et al., 2021; Rahimi et al., 2022) employ face-related information to separate speech from a mixture in the literature. (Chung et al., 2020) use only the still images containing facial appearance to isolate speech, with the assistance of the consistency of face identity and speech identity. Numerous methods (Afouras et al., 2018b; Gao & Grauman, 2021; Ephrat et al., 2018) explore the simultaneous lip motions and voice fluctuations clues. (Lu et al., 2018) integrate optical flow and lip movements to predict the spectrogram masks. (Hegde et al., 2021) propose synthesizing a virtual visual stream to deal with the situation where the visual stream is unreliable or completely absent. Another family of works (Owens & Efros, 2018; Afouras et al., 2020; Truong et al., 2021) combines multiple tasks for joint learning. The most relevant works to ours are (Afouras et al., 2018b) , (Shi et al., 2020) , and (Yao et al., 2022) . (Afouras et al., 2018b ) also pays attention to the uncontrolled environment with several speakers, but it only focuses on enhancing the target speech and suppressing the noisy voices. Our method further separates every speech component for different mixtures in a single model. (Shi et al., 2020) and (Yao et al., 2022) perform a coarse-to-fine procedure by adopting an additional refining separation phase. However, the refining phase applies the same model as the coarse phase. Thus if the coarse phase cannot achieve a clean separation, the refining results will likewise be sub-optimal.

3. METHOD

To best of our knowledge, this is the first work that systematically studies the audio-visual speech separation under a multi-speaker setting. In this section, we first elaborate on our proposed multispeaker training strategy in Sec. 3.1; Next, we introduce the adopted basic audio-visual speech separator in Sec. 3.2, which takes the visual information and the speech mixture as input and outputs separated speech; Then the Filter-Recovery Network is detailedly described in Sec. 3.3; Finally, we formulate the objective function for training model in Sec. 3.4.

3.1. OVERVIEW

Given a video containing S simultaneous speakers, our goal is to isolate the individual speech for each speaker. Formally, we denote time-domain speech mixture as x = S i=1 x i , x i ∈ R Tx , where T x represents the time length, and x i is the separate speech of the i-th speaker. Since acquiring the exact individual ground truth data from the mixtures in real scenes is yet impossible, we follow previous works (Afouras et al., 2018b; Gao & Grauman, 2021) to synthesize mixtures by adding individual speech together. Current works are designed for definite speakers in the mixture, mostly 2 or 3 speakers. However, these models present poor effects for practical applications, where there are usually a variable number of speakers. Alternatively, they take an inefficient strategy of training separate models for each kind of mixture different in speaker numbers. To meet the demand for practical applications, we create different mixtures with various numbers of speakers during training, i.e., S ranges from 2 to 5. The network is required to isolate all speech components for all kinds of mixtures.  M i ∈ R 2×F ×T X of input spectrogram X and visual feature V i . Afterward, for each mixture, the separated S masks and visual features are sent to the FRNet to filter the noisy components and recover the missing ones. The FRNet outputs an improved mask for each speaker in the mixture, which is finally used to restore the exact time-domain speech.

3.2. BASIC AUDIO-VISUAL SPEECH SEPARATOR

In our framework, the basic audio-visual speech separator can be replaced with any network that outputs the masks of the mixture spectrogram. Following (Gao & Grauman, 2021) , we adopt both lip and face clues to guide target speech separation. Specifically, a lip net is taken to dig out the consistency of lip motion and continuous pronunciation, and a face net is used to explore the relationship between speech and face attributes. Lip Net. Following the previous structure (Ma et al., 2021; Gao & Grauman, 2021) , we feed T v consecutive frames of lip regions into a 3D convolutional layer followed by a ShuffleNet v2 (Ma et al., 2018) network to extract mouth features. A temporal convolutional network is further utilized to output the lip motion features Lip i ∈ R C l ×Tv for i-th speaker. Face Net. The Face Net aims at leveraging the correspondence between face attributes and speech. A ResNet-18 network takes a single face image as input, and outputs a face embedding for i-th speaker. We repeat the face embedding along the time dimension to obtain F ace i ∈ R C f ×Tv . Encoder-Decoder Separator. As for the speech analysis and separation end, we adopt a U-Net network consisting of an encoder, a fusion module, and a decoder. The encoder is composed of multiple convolutions and pooling layers. It takes the mixture spectrogram X as input and outputs an audio feature Au of dimension C a × T v . Following (Xiong et al., 2022) , we take an AV-Fusion module, an attention-like operation, to obtain vision-related audio representation for a given speaker. We first concatenate the lip feature Lip i and face feature F ace i together along the time dimension to get visual feature V i ∈ R Cv×Tv , then feed V i and Au to AV-Fusion module to obtain an enhanced feature. It is further delivered into the decoder to predict a mask M i of input spectrogram for i-th speaker, which denotes the projection values of the prediction onto the mixture. Note that the values might be positive or negative, as the spectrograms have both positive and negative values.

3.3. FILTER AND RECOVERY NETWORK

Due to the high complexity of speech mixture with multiple speakers (especially above 3 speakers), the separated target speech remains two apparent issues: 1) There still exists voices from others; 2) The isolated speech is partially missing compared to the ground truth speech. To solve the two problems, we introduce the Filter-Recovery Network (FRNet), which takes the visual features and the separated coarse speech masks as input, and outputs more precise target speech masks. For convenience, we concatenate the real and imaginary parts of M i obtained by the basic separator to form dimension 2F × T X before feeding it into FRNet, which we still denote M i . Besides, to align the dimension, we transform V i to V ′ i of shape 2F × T X with a convolutional layer. As Fig. 2 shows, the FRNet consists of a Filter Net and a Recovery Net. The former utilizes visual features to remove the noisy voices from the separated speech, and the latter learns the correlation between all separated speech to extract the missing voices from others. Since the attention mechanism (Vaswani et al., 2017) enhances some parts of the input data while declining other parts, we adopt attention as the component module. For the sake of fluent description, we define some operations here. Given three tensors q ∈ R D h ×Nq , k, v ∈ R D h ×N kv , we compute the weighted sum of v: [q ′ , k ′ , v ′ ] = [U q q, U k k, U v v], U q,k,v ∈ R D h ×D h , W = softmax (q ′ ) ⊤ k ′ / D h , W ∈ R Nq×N kv , Attn(q, k, v) = v ′ W ⊤ , Attn ∈ R D h ×Nq . Filter Net. Since the visual information is highly correlated with speech, we utilize each speaker's visual knowledge to reduce irrelevant voices in the separation results during the Filter phase. The Filter Net consists of L basic layers, each consisting of an Attn and an MLP module, where the MLP block contains two fully-convolutional layers and a ReLU activation function. For l-th layer, the model takes V ′ i and M l i as input, and outputs M l+1 i , where M 0 i = M i , and LN denotes Layer Normalization (Ba et al., 2016) : z l+1 = LN Attn(V ′ i , M l i , M l i ) + M l i , z l+1 ∈ R 2F ×T X , M l+1 i = LN MLP(z l+1 ) + z l+1 , M l+1 i ∈ R 2F ×T X . For i-th speaker, the Filter Net output mask Mi = M L i where the noise speech pieces are removed. Recovery Net. According to our analysis, there are some pieces of speech from other speakers in the target separation result of basic separator. So we design the Recovery Net to pull out the missing voice pieces Mi for i-th speaker from the separation results of other speakers, which is much easier than recovering the missing parts from the original mixture. We define a rearrange operation by stacking S -1 masks of other speakers: temp = [M 1 ; • • • ; M i-1 ; M i+1 ; • • • ; M S ], M i reshape ← -----temp, M i ∈ R T X ×2F ×(S-1) , (6) where the reshape ← ----operation rearranges the input tensor to the target dimension. The Recovery Net aims to learn the association between Mi and M i . It consists of L basic layers, similar to the decoder of Transformer (Vaswani et al., 2017) . Specifically, the output of layer l + 1 can be conveyed by the following equations, where q 0 i = Mi : ql i = LN(Attn(q l i , q l i , q l i ) + q l i ), ql i reshape ← -----ql i , ql i ∈ R 2F ×T X , ql i ∈ R T X ×2F ×1 , (7) z l [t] = LN Attn ql i [t], M i [t], M i [t] + ql i [t] , z l [t] ∈ R 2F ×1 , ẑl [t] = LN MLP(z l [t]) + z l [t] , ẑl [t] ∈ R 2F ×1 , ( ) ẑl = [ẑ l [0]; ...; ẑl [t]; ...; ẑl [T X ]], q l+1 i reshape ← -----ẑl , ẑl ∈ R T X ×2F ×1 , q l+1 i ∈ R 2F ×T X . ( ) We hold the view that the speech separation issues analyzed above do not always manifest to the same degree at different time slots. As a result, in Equ. 8, the attention operations are performed separately for each time slice of tensor to recover missing parts. Finally, the missing components Mi can be obtained by applying a fully-convolutional layer on q L i , i.e., Mi = FC(q L i ). After obtaining the noise-filtered result Mi and the recovered missing part Mi , we add them up to obtain the clean and complete separation target mask: Mi = Mi + Mi . (11)

3.4. LOSS FUNCTION

We jointly optimize the outputs of the basic separator and the FRNet. Following previous work (Pan et al., 2022) , We take the scale-invariant signal-to-noise ratio (SI-SNR) (Le Roux et al., 2019) as the loss function. The predicted masks M i and Mi are separately multiplied by the mixture spectrogram X to get the separated spectrograms, which are finally transformed by inverse STFT to restore the time-domain speech y i and ỹi . Given any prediction ŝ and ground truth s, the SI-SNR loss can be computed with the following formula: L SI-SN R (ŝ, s) = -10 log 10 ( ∥ ⟨ŝ,s⟩s ∥s∥ 2 ∥ 2 ∥ŝ -⟨ŝ,s⟩s ∥s∥ 2 ∥ 2 ). ( ) We jointly train the basic separator and the FRNet by an overall loss for k-th speaker: L i = λL SI-SN R (y i , x i ) + (1 -λ)L SI-SN R (ỹ i , x i ), where λ is the factor to control the ratio of the two loss parts. For a batch of mixtures containing a total number of N speakers, the training loss is the average of all N individual losses.

4.1. DATASETS

VoxCeleb2 (Chung et al., 2018) . This dataset is organized in the identity labels, with 5994 speakers in the training set and another 118 in the test set. It contains more than 1 million samples, each consisting of an utterance and synchronized face tracks. Following (Gao & Grauman, 2021) For each dataset, we randomly mix the test videos to build 2-mix, 3-mix, 4-mix, and 5-mix test sets without any subjective considerations. Each video appears only once in each mixture set. To be more explicit, we list the number of videos and each kind of mixture for all test sets in Tab. 1.

4.2. IMPLEMENTATION DETAILS

Data Process. Following the previous setting (Gao & Grauman, 2021) , we randomly cut 2.55second long clips from videos sampled at 25fps and the corresponding audios sampled at 16kHz as the training pairs. For all utterances that make up a mixture, we first normalize the energy of each one to the same, which corresponds to the same loudness for each utterance. Then we add up all normalized speech to obtain the mixture. STFT is conducted on the mixture waveform using a Hann window length of 400, a hop size of 160, and an FFT window size of 512 to output the complex spectrogram of dimension 2 × 257 × 256, which is taken as the input to the U-Net encoder. A randomly selected frame from the video is rescaled to 224 × 224 and sent to the face analysis network. The input to the lip reading network is 64 consecutive frames of cropped gray mouth regions of dimension 88 × 88. We adopt the official implementation of 2D face landmark detection (Bulat & Tzimiropoulos, 2017) to detect the mouth landmarks and crop the mouth regions. Training Setting. To achieve great separation results for mixtures containing different numbers of speakers, we separate all types of mixtures into individual speech simultaneously. In each batch, the ratio of 2-mix, 3-mix, 4-mix, and 5-mix numbers is set to 2:1:1:1, and the total number of speakers is 256. As for selecting utterances to create mixtures, some methods (Gao & Grauman, 2021; Rahimi et al., 2022) perform put-back random sampling. Such an approach results in some utterances being sampled multiple times while others are ignored, in which case the model might overfit some samples and underfit others. Instead, at the beginning of each training epoch, we shuffle all samples randomly and pick them one by one to ensure no duplication or omission. Optimization. Empirically, we adopt the Adam optimizer to train the network with a weight decay of 1e-4 and a learning rate of 1e-4. We drop the learning rate by a factor of 0.1 after epochs 12 and 15, and train models for 19 epochs, when the loss almost reaches a plain. Tab. 8 displays experimental results of setting different lambda values, and we finally set it to 0.5. Evaluation. Following previous methods (Afouras et al., 2018b; Ephrat et al., 2018; Gao & Grauman, 2021; Rahimi et al., 2022) , we adopt the standard blind source separation metric Signal-to-Distortion-Ratio (SDR) (Vincent et al., 2006) , which measures the ratio between the energy of the target signal and that of the errors. To further assess the speech quality and intelligibility, we also employ the Perceptual Evaluation of Speech Quality (PESQ) (Rix et al., 2001 ) metric.

4.3. COMPARISON WITH STATE-OF-THE-ART METHODS

Tab. 2 and 3 compare our method to other open-source state-of-the-art models on the test sets. All experiments follow the same test and evaluation protocols. We choose two types of methods for comparison: audio-only methods and audio-visual methods.

Audio-Only Methods:

U-Net-AO. By removing the face net and the lip net from the adopted basic audio-visual speech separator and making corresponding adaptations for the U-Net, we obtain an audio-only speech separator and train it with permutation-invariant loss. VoiceFilter (Wang et al., 2018) . This method uses the speech mixture and a reference utterance to extract the target speech from the mixture. Note that the reference audio is a piece of a randomly sampled utterance of the target speaker different from the target speech. Conv-TasNet (Luo & Mesgarani, 2019) . This method is widely adopted in audio-only speech separation practices. It adopts a fully-convolutional audio separation network, which takes the timedomain speech mixture as input and outputs all speech components simultaneously. Audio-Visual Methods: LAVSE (Chuang et al., 2020) . To reduce processing costs, this work employs a lightweight but efficient framework, where a lip encoder extracts the lip motion feature as the synchronization signal for target speech extraction. VisualVoice (Gao & Grauman, 2021) . It contains the same visual networks and U-Net as our basic audio-visual separator, except the U-Net simply concatenates the audio and visual features, while we perform AV-Fusion to obtain enhanced audio features and achieve better results on VoxCeleb2. Be- VoViT (Montesinos et al., 2022) . The newly proposed method uses a landmark-based graph convolutional network (Yan et al., 2018) to capture the facial motion cues. It adopts an AV spectro-temporal transformer for target speech separation. DeBaSe. In order to offset the impact of increased model capacity brought by the additional FRNet, we compare the results of a deeper basic audio-visual speech separator (DeBaSe) to that of BFRNet, which has more layers in the encoder and has almost the same parameter counts as BFRNet. Besides, we combine all audio-visual methods apart from DeBaSe with the proposed FRNet to verify its generality. We call these methods '*+FR'. As seen in Tab. 2, Tab. 3 and Fig. 3 , BFRNet achieves the best results in all test sets compared to other baseline methods with at least 1 dB SDR advantage. Besides, the FRNet combined with any audio-visual method can significantly improve performance. For instance, VisualVoice (Gao & Grauman, 2021) combined with FRNet receives a 1.67 dB gain of SDR on the VoxCeleb2 unseen 2-mix test set. It is worth noting that there is still a large gap between DeBaSe and BFRNet, although their parameter amounts are nearly equal. It means that just expanding the model capacity does not necessarily lead to significant performance gain. In contrast, BFRNet focuses on the main issues of the separation task, thus significantly enhancing the separation results. The difficulty of separation grows with the increase of speaker number in the mixture, resulting in lower metrics. For the VoxCeleb2 dataset, although the speakers in the seen test set appear in the training set, the results of the seen set are not certainly higher than that of the unseen set. We argue that the speakers' voiceprint features is not very critical for speech separation.

4.4. ABLATION STUDIES

We conduct ablation studies and report average results on VoxCeleb2 unseen and seen test sets. The effect of visual knowledge. We remove the Face Net and Lip Net separately while remaining other parts. Tab. 4 shows that modified models consistently yield lower results than BFRNet. However, as lip motion is more related to speech, it plays a principal role in speech separation, while the static face image only serves as extra information. Module design of FRNet. Tab. 6 explores deformed structures of FRNet. For Filter Net, we filter voice noise from M i with itself as a clue instead of a simultaneous visual signal V i . We name this ablation 'filter-sa'. As to Recovery Net, we replace the clean mask Mi with M i as the clue to extract the missing voices from M i , and we name it 'recovery-noise'. Layers of the FRNet. Both the Filter and Recovery Nets are composed of L basic layers. We here study the impact of L. As seen in Tab. 7, there is a great improvement for the network with L = 2 compared to L = 1. When the number of layers L increases to 3, the performance improvement is insignificant. To balance performance and efficiency, we adopt the 2-layer FRNet.

GT Basic Separator BFRNet

VoViT Conv-TasNet LAVSE 

5. CONCLUSION

In this paper, we have focused on the multi-speaker audio-visual speech separation task. We are the first to propose separating mixtures with a variable number of speakers simultaneously during training, and we also provide a standard test benchmark for a fair comparison. There are two significant problems for speech separation, especially in the multi-speaker setting: part of the voice is missing in the separated speech; the separated speech may still be mixed with others' voices. To deal with this, we propose a Filter and Recovery network to solve these two problems. The filter module filters out other people's voices, and the recovery module compensates for their missing voices. We conduct various experiments to demonstrate the effectiveness of this module, and its addition to other audio-visual speech separation methods has led to considerable improvements.

A APPENDIX

We provide an additional ablation study and more visualization examples in the appendix. Ablation study on λ in loss function. We conduct experiments to explore the impact of different training λ in Eq. 13 on the results. A smaller λ means a higher training loss weight of FRNet. On the contrary, a larger λ implies a lower weight of FRNet in the whole model. Tab. 8 displays the average results of VoxCeleb2 unseen and seen test sets. As seen in the table, attaching a higher weight to the training of FRNet (λ is 0.2) yields similar results to giving an equal weight (λ is 0.5). Nevertheless, a lower loss weight of FRNet (λ is 0.8) results in a non-negligible performance drop, which proves the necessity of FRNet. Spectrogram visualization of BFRNet. We visualize the intensity of spectrograms of ground truth (GT) and predictions by each network of BFRNet in 



Figure 1: Illustration of two critical issues in multi-speaker audio-visual speech separation. The missing parts in separation results compared to ground truth speech are indicated by gray boxes, while the noisy parts are marked by red boxes.

Figure 2: Overview of the proposed framework. It consists of a basic audio-visual speech separator and a Filter-Recovery Network (FRNet). Given a mixture spectrogram containing S pieces of speech, the basic separator takes each speaker's visual clue separately and outputs the corresponding visual feature and speech mask. For i-th speaker, the basic separator outputs Vi and Mi. Then the visual features and speech masks of all S speakers in a mixture are fed into the Filter-Recovery Network to obtain a more precise speech mask Mi for each speaker. The Filter Net utilizes visual embedding Vi to reduce the noisy components in Mi to obtain Mi. Then the Recovery Net takes the clean mask Mi as the query to extract the corresponding missing parts from the speech masks of other speakers. To better illustrate the effect of the FRNet, we visualize masks into the form of waveforms to present speech components by →. The red/gray boxes and the corresponding represent the noisy/missing speech pieces that are removed or retrieved.

Figure 3: Visualization of SDR for each method on VoxCeleb2 unseen test set. The increments on each bar denote the improvement after combining FRNet with base methods. 'BaSe' is our basic audio-visual separator.

Figure 4: Spectrogram visualization of ground truth and predictions.We visualize the intensity of ground truth (GT) spectrograms and predictions by models in Fig 4.The first row presents the ground truth, the separation result of the basic separator, and the result of BFRNet, respectively. The second row shows the separation results of other approaches. The results demonstrate that the separation results of baseline methods are subject to the two issues we claim, including the results of the proposed method without the FRNet. We use black boxes to highlight the missing parts compared to GT, and red boxes to indicate the noisy parts from other speakers. As seen in Fig.4, the two problems are greatly suppressed after utilizing FRNet.

The red boxes indicate the noisy part generated by the basic separator and then suppressed by Filter Net. The black boxes denote the missing part yielded by the basic separator and further recovered by Recovery Net. The results demonstrate the effects of Filter Net and Recovery Net.

Figure 5: Visualization of the intensity of spectrogram: ground truth (GT), outputs of the basic separator, outputs of the Filter Net, and outputs of the Recovery Net.

, we hold out two videos for each speaker in the original training set and utilize the rest videos as our training set. For the remaining videos in the original training set, we randomly sample 7200 videos to build the seen test set. The speakers also appear in the training set, but the specific utterances do not. Our unseen test set consists of 7200 videos randomly chosen from the original test set. All of the rest videos in the VoxCeleb2 dataset form our validation set. Numbers of videos and mixtures for four test sets.

Results on VoxCeleb2 dataset. The metrics are the average of all speakers for each test set. All methods are trained in the proposed multi-speaker setting. BFRNet achieves the best performance among all methods for all types of mixtures. (e.g., we achieve 11.06 dB SDR on VoxCeleb2 unseen 2-mix set.) Besides, we list the overall (O.A.) performance of all methods for convenient comparison, which is the average performance of each speaker in all kinds of mixtures. It is worth mentioning that each audio-visual method combined with the FRNet results in a large improvement.

Results on LRS 2&3 datasets. To validate the generalization of separation models, they are trained on VoxCeleb2 and validated on LRS2/LRS3 without fine-tuning. BFRNet outperforms all base methods that do not integrate FRNet. Although some methods combined with FRNet achieve the best performance in certain metrics, it still proves the effectiveness of FRNet. Similarly, we give the overall (O.A.) performance on all kinds of mixtures for each test set for direct comparison.

Experiments on effect of visual clues.

Ablation study on FRNet modules.The effect of FRNet modules. To validate the necessity of each module, we conduct experiments in Tab. 5 that remove individual modules. When the Filter net is removed, the Recovery net adopts M i to replace Mi . When the Recovery net is removed, the output of Filter net Mi is considered the final output.

Ablation study on module design.

Ablation study on the layers L.

Ablation Study on λ.

ACKNOWLEDGEMENTS

This work is supported by the National Key R&D Program of China (No. 2022ZD0160900), the National Natural Science Foundation of China (No. 62076119, No. 61921006), the Fundamental Research Funds for the Central Universities (No. 020214380091), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

