CLIPSEP: LEARNING TEXT-QUERIED SOUND SEPARA-TION WITH NOISY UNLABELED VIDEOS

Abstract

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

1. INTRODUCTION

Humans can focus on to a specific sound in the environment and describe it using language. Such abilities are learned using multiple modalities-auditory for selective listening, vision for learning the concepts of sounding objects, and language for describing the objects or scenes for communication. In machine listening, selective listening is often cast as the problem of sound separation, which aims to separate sound sources from an audio mixture (Cherry, 1953; Bach & Jordan, 2005) . While text queries offer a natural interface for humans to specify the target sound to separate from a mixture (Liu et al., 2022; Kilgour et al., 2022) , training a text-queried sound separation model in a supervised manner requires labeled audio-text paired data of single-source recordings of a vast number of sound types, which can be costly to acquire. Moreover, such isolated sounds are often recorded in controlled environments and have a considerable domain gap to recordings in the wild, which usually contain arbitrary noise and reverberations. In contrast, humans often leverage the visual modality to assist learning the sounds of various objects (Baillargeon, 2002) . For instance, by observing a dog barking, a human can associate the sound with the dog, and can separately learn that the animal is called a "dog." Further, such learning is possible even if the sound is observed in a noisy environment, e.g., when a car is passing by or someone is talking nearby, where humans can still associate the barking sound solely with the dog. Prior work in psychophysics also suggests the intertwined cognition of vision and hearing (Sekuler et al., 1997; Shimojo & Shams, 2001; Rahne et al., 2007) . 

2. RELATED WORK

Universal sound separation Much prior work on sound separation focuses on separating sounds for a specific domain such as speech (Wang & Chen, 2018) or music (Takahashi & Mitsufuji, 2021; Mitsufuji et al., 2021) . Recent advances in domain specific sound separation lead several attempts to generalize to arbitrary sound classes. Kavalerov et al. (2019) reported successful results on separating arbitrary sounds with a fixed number of sources by adopting the permutation invariant training (PIT) (Yu et al., 2017) , which was originally proposed for speech separation. While this approach does not require labeled data for training, a post-selection process is required as we cannot not tell what sounds are included in each separated result. Follow-up work (Ochiai et al., 2020; Kong et al., 2020) addressed this issue by conditioning the separation model with a class label to specify the target sound in a supervised setting. However, these approaches still require labeled data for training, and the interface for selecting the target class becomes cumbersome when we need a large number of classes to handle open-domain data. Wisdom et al. (2020) later proposed an unsupervised method called mixture invariant training (MixIT) for learning sound separation on noisy data. MixIT is designed to separate all sources at a time and also requires a post-selection process such as using a pre-trained sound classifier (Scott et al., 2021) , which requires labeled data for training, to identify the target sounds. We summarize and compare related work in Table 1 . Query-based sound separation Visual information has been used for selecting the target sound in speech (Ephrat et al., 2019; Afouras et al., 2020 ), music (Zhao et al., 2018; 2019; Tian et al., 2021) and universal sounds (Owens & Efros, 2018; Gao et al., 2018; Rouditchenko et al., 2019) . While many image-queried sound separation approaches require clean video data that contains isolated sources, Tzinis et al. ( 2021) introduced an unsupervised method called AudioScope for separating on-screen sounds using noisy videos based on the MixIT model. While image queries can serve as a



https://sony.github.io/CLIPSep/



Motivated by this observation, we aim to tackle text-queried sound separation using only unlabeled videos in the wild. We propose a text-queried sound separation model called CLIPSep that leverages abundant unlabeled video data resources by utilizing the contrastive image-language pretraining (CLIP)(Radford et al., 2021)  model to bridge the audio and text modalities. As illustrated in Figure1, during training, the image feature extracted from a video frame by the CLIP-image encoder is used to condition a sound separation model, and the model is trained to separate the sound that corresponds to the image query in a self-supervised setting. Thanks to the properties of the CLIP model, which projects corresponding text and images to close embeddings, at test time we instead use the text feature obtained by the CLIP-text encoder from a text query in a zero-shot setting.

Figure 1: An illustration of modality transfer. However, such zero-shot modality transfer can be challenging when we use videos in the wild for training as they often contain off-screen sounds and voice overs that can lead to undesired audiovisual associations. To address this problem, we propose the noise invariant training (NIT), where query-based separation heads and permutation invariant separation heads jointly estimate the noisy target sounds. We validate in our experiments that the proposed noise invariant training reduces the zero-shot modality transfer gap when the model is trained on a noisy dataset, sometimes achieving competitive results against a fully supervised text-queried sound separation system. Our contributions can be summarized as follows: 1) We propose the first text-queried universal sound separation model that can be trained on unlabeled videos. 2) We propose a new approach called noise invariant training for training a query-based sound separation model on noisy data in the wild. Audio samples can be found on an our demo website. 1 For reproducibility, all source code, hyperparameters and pretrained models are available at: https://github.com/sony/CLIPSep.

