INTO THE WILD WITH AUDIOSCOPE: UNSUPERVISED AUDIO-VISUAL SEPARATION OF ON-SCREEN SOUNDS

Abstract

Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.

Video frame

On-screen audio Off-screen audio Input audio mixture Attention map On-screen estimate Audio-visual machine perception has been undergoing a renaissance in recent years driven by advances in large-scale deep learning. A motivating observation is the interplay in human perception between auditory and visual perception. We understand the world by parsing it into the objects that are the sources of the audio and visual signals we can perceive. However, the sounds and sights produced by these sources have rather different and complementary properties. Objects may make sounds intermittently, whereas their visual appearance is typically persistent. The visual percepts of different objects tend to be spatially distinct, whereas sounds from different sources can blend together and overlap in a single signal, making it difficult to separately perceive the individual sources. This suggests that there is something to be gained by aligning our audio and visual percepts: if we can identify which audio signals correspond to which visual objects, we can selectively attend to an object's audio signal by visually selecting the object. This intuition motivates using vision as an interface for processing, where a primary problem is to selectively preserve desired sounds, while removing unwanted sounds. In some tasks, such as speech enhancement, the desired sounds can be selected by their class: speech versus non-speech in this case. In an open-domain setting, the selection of desired sounds is at the user's discretion. This presents a user-interface problem: it is challenging to select sources in an efficient way using audio. This problem can be greatly simplified in the audio-visual case if we use video selection as a proxy for audio selection, for example, by selecting sounds from on-screen objects, and removing off-screen sounds. The problem of associating arbitrary sounds with their visual objects is challenging in an open domain. Several complications arise that have not been fully addressed by previous work. First, a large amount of training data is needed in order to cover the space of possible sound. Supervised methods require labeled examples where isolated on-screen sounds are known. The resulting data collection and labeling burden limits the amount and quality of available data. To overcome this, we propose an unsupervised approach using mixture invariant training (MixIT) (Wisdom et al., 2020) , that can learn to separate individual sources from in-the-wild videos, where the on-screen and off-screen sounds are unknown. Another problem is that different audio sources may correspond to a dynamic set of on-screen objects in arbitrary spatial locations. We accommodate this by using attention mechanisms that align each hypothesized audio source with the different spatial and temporal positions of the corresponding objects in the video. Finally we need to determine which audio sources appear on screen, in the absence of strong labels. This is handled using a weakly trained classifier for sources based on audio and video embeddings produced by the attention mechanism.

2. RELATION TO PREVIOUS WORK

Separation of arbitrary sounds from a mixture, known as "universal sound separation," was recently shown to be possible with a fixed number of sounds (Kavalerov et al., 2019) . Conditional information about which sound classes are present can improve separation performance (Tzinis et al., 2020) . The FUSS dataset (Wisdom et al., 2021) expanded the scope to separate a variable number of sounds, in order to handle more realistic data. A framework has also been proposed where specific sound classes can be extracted from input sound mixtures (Ochiai et al., 2020) . These approaches require curated data containing isolated sounds for training, which prevents their application to truly open-domain data and introduces difficulties such as annotation cost, accurate simulation of realistic acoustic mixtures, and biased datasets. To avoid these issues, a number of recent works have proposed replacing the strong supervision of reference source signals with weak supervision labels from related modalities such as sound class (Pishdadian et al., 2020; Kong et al., 2020 ), visual input (Gao & Grauman, 2019) , or spatial location from multi-microphone recordings (Tzinis et al., 2019; Seetharaman et al., 2019; Drude et al., 2019) . Most recently, Wisdom et al. (2020) proposed mixture invariant training (MixIT), which provides a purely unsupervised source separation framework for a variable number of latent sources. A variety of research has laid the groundwork towards solving audio-visual on-screen source separation (Michelsanti et al., 2020) . Generally, the two main approaches are to use audio-visual localization (Hershey & Movellan, 2000; Senocak et al., 2018; Wu et al., 2019; Afouras et al., 2020) , or object detection networks, either supervised (Ephrat et al., 2018; Gao & Grauman, 2019; Gan et al., 2020) or unsupervised (Zhao et al., 2018) , to predict visual conditioning information. However, these works only consider restricted domains such as speech (Hershey & Casey, 2002; Ephrat et al., 2018; Afouras et al., 2020) or music (Zhao et al., 2018; Gao & Grauman, 2019; Gan et al., 2020) . Gao et al. (2018) reported results with videos from a wide domain, but relied on supervised visual object detectors, which precludes learning about the appearance of sound sources outside of a closed set of classes defined by the detectors. Rouditchenko et al. (2019) proposed a system for a wide domain of sounds,



Figure 1: separating on-screen bird chirping from wind noise and off-screen sounds from fireworks and human laugh. More demos online at https://audioscope.github.io.

Recent work has used video for selection and separation of speech(Ephrat et al.,  2018; Afouras et al., 2020)  or music(Zhao et al., 2018; Gao & Grauman, 2019; Gan et al., 2020). However, systems that address this for arbitrary sounds(Gao et al., 2018; Rouditchenko et al., 2019;  Owens & Efros, 2018) may be useful in more general cases, such as video recording, where the sounds of interest cannot be defined in advance.

