NEURAL SYNTHESIS OF BINAURAL SPEECH FROM MONO AUDIO

Abstract

We present a neural rendering approach for binaural sound synthesis that can produce realistic and spatially accurate binaural sound in realtime. The network takes, as input, a single-channel audio source and synthesizes, as output, twochannel binaural sound, conditioned on the relative position and orientation of the listener with respect to the source. We investigate deficiencies of the 2 -loss on raw wave-forms in a theoretical analysis and introduce an improved loss that overcomes these limitations. In an empirical evaluation, we establish that our approach is the first to generate spatially accurate waveform outputs (as measured by real recordings) and outperforms existing approaches by a considerable margin, both quantitatively and in a perceptual study. Dataset and code are available online.

1. INTRODUCTION

The rise of artificial spaces, in augmented and virtual reality, necessitates efficient production of accurate spatialized audio. Spatial hearing (the capacity to interpret spatial clues from binaural signals), not only helps us to orient ourselves in 3D environments, it also establishes immersion in the space by providing the brain with congruous acoustic and visual input (Hendrix & Barfield, 1996) . Binaural audio (left and right ear) even guides us in multi-person conversations: consider a scenario where multiple persons are speaking in a video call, making it difficult to follow the conversation. In the same situation in a real environment we are able to effortlessly focus on the speech from an individual (Hawley et al., 2004) . Indeed, auditory sensation has primacy over even visual sensation as an input modality for scene understanding: (1) reaction times are faster for auditory stimulus compared to visual stimulus (Jose & Praveen, 2010) (2) auditory sensing provides a surround understanding of space as opposed to the directionality of visual sensation. For these reasons, the generation of accurate binarual signal is integral to full immersion in artificial spaces. Most approaches to binaural audio generation rely on traditional digital signal processing (DSP) techniques, where each component -head related transfer function, room acoustics, ambient noise -is modeled as a linear time-invariant system (LTI) (Savioja et al., 1999; Zotkin et al., 2004; Sunder et al., 2015; Zhang et al., 2017) . These linear systems are well-understood, relatively easy to model mathematically, and have been shown to produce perceptually plausible results -reasons why they are still widely used. Real acoustic propagation, however, has nonlinear wave effects that are not appropriately modeled by LTI systems. As a consequence, DSP approaches do not achieve perceptual authenticity in dynamic scenarios (Brinkmann et al., 2017) , and fail to produce metrically accurate results, i.e., the generated waveform does not resemble recorded binaural audio well. In this paper, we present an end-to-end neural synthesis approach that overcomes many of these limitations by efficiently synthesizing accurate and precise binaural audio. The end-to-end learning scheme naturally captures the linear and nonlinear effects of sound wave propagation and, being fully convolutional, is efficient to execute on commodity hardware. Our major contributions are (1) a novel binarualization model that outperforms existing state of the art, (2) an analysis of the shortcomings of the 2 -loss on raw waveforms and a novel loss mitigating these shortcomings, (3) a real-world binaural dataset captured in a non-anechoic room. Related Work. State of the art DSP techniques approach binaural sound spatialization as a stack of acoustic components, each of which is an LTI system. As accurate wave-based simulation of room impulse responses is computationally expensive and requires detailed geometry and material information, most real-time systems rely on simplified geometrical models (Välimäki et al., 2012; Savioja & Svensson, 2015) . Head-related transfer functions are measured in an anechoic chamber (Li & Peissig, 2020) and high-quality spatialization requires binaural recordings at almost 10k different spatial positions (Armstrong et al., 2018) . To generate binaural audio the DSP-based binaural renderers typically perform a series of convolutions with these component impulse responses (Savioja et al., 1999; Zotkin et al., 2004; Sunder et al., 2015; Zhang et al., 2017) . For a more detailed discussion, see Appendix A.4. Given their success in speech synthesis (Wang et al., 2017) , neural networks gained increased attention for audio generation recently. While most approaches focus on models in frequency domain (Choi et al., 2018; Vasquez & Lewis, 2019) , raw waveform models were long neglected due to the difficulty to model long-range dependencies on a high-frequency audio signal. With the success of WaveNet (Van Den Oord et al., 2016) however, direct wave-to-wave modeling is of increasing interest (Fu et al., 2017; Luo & Mesgarani, 2018; Donahue et al., 2019) and shows major improvements in speech enhancement (Defossez et al., 2020) and denoising (Rethage et al., 2018) , speech synthesis (Kalchbrenner et al., 2018) , and music style translation (Mor et al., 2019) . More recently, first steps towards neural sound spatialization have been undertaken. Gebru et al. (2021) showed that HRTFs can be implicitly learned by neural networks trained on raw waveforms. Focusing on predicting spatial sound conditioned on visual information, a work by Morgado et al. (2018) aims to spatialize sound conditioned on 360 • video. Yet, their work is limited to first order ambisonics and can not model detailed binaural effects. More closely related is a line of papers originating from the 2.5D visual sound system by Gao & Grauman (2019b) . In this work, binaural audio is generated conditioned on a video frame embedding such that object locations can contribute to where sound comes from. Yang et al. (2020); Lu et al. (2019); Zhou et al. (2020) build upon the same idea. Unfortunately, all these works have in common that they pose the spatialization task as an upmixing problem, i.e., their models are trained with a mixture of left and right ear binaural recording as pseudo mono input. By design, these methods fail to model time delays and reverberation effects caused by the difference between source and listener position.

2. A NEURAL NETWORK FOR BINAURAL SYNTHESIS

We consider the problem where a monaural (single-channel) signal x 1:T = (x 1 , . . . , x T ) of length T is to be transformed into a binaural (stereophonic) signal y (l) 1:T , y (r) 1:T representing the listener's left ear and right ear, given a conditioning temporal signal c 1:T . This conditioning signal is the position and orientation of source and listener, respectively. Here x t , and correspondingly y where ∆ is a temporal receptive field. Each c t ∈ R 14 contains the 3D position of source and listener (three values each) and their orientations as quaternions (four values each). Note that in practice, c



https://github.com/facebookresearch/BinauralSpeechSynthesis



Figure1: System Overview. Given the source and listener position and orientation c 1:T at each time step, a single-channel input signal x 1:T is transformed into a binaural signal. The neural time warping module learns an accurate warp from the source position to the listeners left and right ear while respecting physical properties like monotonicity and causality. The Temporal ConvNet models nuanced effects like room reverberations or head-and ear-shape related modifications to the signal.

scalars representing an audio sample at time t. In other words, we aim to produce a function,y (l) t , y (r) t = f (x t-∆:t |c t-∆:t ),

