SOUNDNERIRF: RECEIVER-TO-RECEIVER SOUND NEURAL ROOM IMPULSE RESPONSE FIELD

Abstract

We present SoundNeRirF, a framework that learns a continuous receiver-to-receiver neural room impulse response field (r2r-RIR) to help robot efficiently predict the sound to be heard at novel locations. It represents a room acoustic scene as a continuous 6D function, whose input is a reference receiver's 3D position and a target receiver's 3D position, and whose outputs are an inverse room impulse response (inverse-RIR) and a forward room impulse response (forward-RIR) that jointly project the sound from the reference position to the target position. Sound-NeRirF requires knowledge of neither sound source (e.g. location and number of sound sources) nor room acoustic properties (e.g. room size, geometry, materials). Instead, it merely depends on a sparse set of sound receivers' positions, as well as the recorded sound at each position. We instantiate the continuous 6D function as multi-layer perceptrons (MLP), so it is fully differentiable and continuous at any spatial position. SoundNeRirF is encouraged, during the training stage, to implicitly encode the interaction between sound sources, receivers and room acoustic properties by minimizing the discrepancy between the predicted sound and the truly heard sound at the target position. During inference, the sound at a novel position is predicted by giving a reference position and the corresponding reference sound. Extensive experiments on both synthetic and real-world datasets show SoundNeRirF is capable of predicting high-fidelity and audio-realistic sound that fully captures room reverberation characteristics, significantly outperforming existing methods in terms of accuracy and efficiency.

1. INTRODUCTION

Room acoustic [3, 50, 1, 27] aims at accurately capturing the sound propagation process in a room environment, so as to determine the received sound characterizing reverberation at any spatial position. Robots and other intelligent agents (e.g. voice assistants) can exploit such sound for a variety of tasks, including navigation [9, 44, 29] , reconstruction [62, 8] and audio-dependent simultaneous localization and mapping (SLAM) [16] . As a complementary sensor to vision, sound perception exhibits strengths in scenarios involving dim or no lighting and occlusion. In most cases, room acoustic can be treated as linear time-invariant system (LTI). Thus, the underlying sound propagation problem is to precisely derive a room impulse response (RIR) -a transfer function of time and other acoustic variables that measures the behaviour a sound takes from a sound source to a receiver. The sound recorded at a particular position can thus be obtained by convolving the sound at the source position with the RIR. We call such RIR source-to-receiver RIR (s2r-RIR). Accurately deriving s2r-RIR, however, is a difficult task. The challenge is three-fold: 1) high computational cost: traditional methods [3, 50, 1, 27, 41, 26, 4, 22] undergo rigorous mathematical derivation and extensive measurements to compute s2r-RIR, by treating sound as either rays [50, 1, 27] or waveforms [3, 41] . The corresponding computation complexity is proportional to the room geometric layout complexity and the number of sources. 2) non-scalable: the whole computation process needs to be re-executed once the source/receiver location changes slightly, or the room layout has altered (e.g. furniture movement). 3) too strong assumptions: the assumption that the source/receiver location and room acoustic properties are well-defined and known in advance is too strong to be held in real scenarios. The sound source location is mostly unknown in real-scenarios, and localizing sound sources is an extremely difficult task [20, 23, 6, 19] . In this work, we propose SoundNeRirF, a receiver-to-receiver Sound Neural Room impulse response Field for efficiently predicting what sounds will be heard at arbitrary positions. SoundNeRirF requires knowledge of neither sound sources (e.g. source number and position) nor room acoustic propertiesfoot_0 , but instead represents a room acoustic scene through a sparse set of 3D spatial positions that robots have explored, as well as the sound recorded by the robot at each position. A robot equipped with a receiver can easily collect massive such datasets by simply moving around the room and recording its position and received sound at each step. SoundNeRirF represents a room acoustic scene by a continuous 6D function, which takes two receivers' 3D positions (one reference and one target position) as input, and outputs two room impulse responses that jointly project the sound from the reference position to the target position. SoundNeRirF thus learns receiver-to-receiver RIR (r2r-RIR). SoundNeRirF is reinforced to implicitly encode the interaction between sound sources and receivers, in addition to room acoustic properties by minimizing the error between predicted and the truly recorded sounds, because sound recorded at any position is naturally a result of such an interaction. By instantiating the 6D function as multi-layer perceptrons (MLPs), SoundNeRirF is differentiable and continuous for any spatial position, and can be optimized with gradient descent. Figure 1 visualizes SoundNeRirF's motivation. Specifically, SoundNeRirF splits r2r-RIR into two parts: an inverse-RIR for the reference position and a forward-RIR for the target position. We further introduce receiver virtual position settings and a physical constraint strategy that guides SoundNeRirF to explicitly learn direct-path, specularreflection and late-reverberation that commonly exist in room acoustics. SoundNeRirF has numerous applications in robotics, including immersive audio/video game experience in augmented reality (AR) [24, 57] , sound pre-hearing without actually reaching to the location [5], and improving robot localization (by utilizing the predicted sound and truly heard sound). In summary, we make four main contributions: First, we propose SoundNeRirF that learns r2r-RIR neural field from a sparse set of receiver recordings. Second, SoundNeRirF requires knowledge of neither sound source nor room acoustic properties, but instead depends on more accessible data that can be collected by robot walking around in the room environment. Third, SoundNeRirF directly predicts sound raw waveform. It disentangles sound prediction and r2r-RIR learning, exhibiting strong generalization capability in predicting previously unheard sound (see Experiment Sec.). Lastly, we release all datasets collected in either synthetic and real-world scenario to the public to foster more research.

2. RELATED WORK

Room Acoustics Modelling There are two main ways to model room acoustics: wave-based modelling [3, 41, 26, 4] and geometry-based modelling (aka geometrical acoustics) [50] . The wave-based modelling utilizes sound wave nature to model sound wave propagation, whereas geometry-based modelling treats sound propagation as optic rays. Typical geometry-based modelling methods include ray tracing [27] , image source method (ISM [1]), beam tracing [17] and acoustic radiosity [22, 36] . The main goal of room acoustics is to characterize the room reverberation effect, which consists of three main components: direct-path, specular-reflection and late-reverberation. SoundNeRirF explicitly models the three components to learn r2r-RIR. Neural Room Impulse Response (RIR) Some recent work [46, 55, 47, 45, 12, 30, 49] proposed to learn room impulse response (RIR) with deep neural networks. However, they all assume massive source-to-receiver RIRs (s2r-RIR) are available to train the model. Unlike these methods, Sound-NeRirF learns implicit r2r-RIR from a set of robot recorded sounds at different positions, which are easiler to collect with much less constraints. Neural Radiance Field (NeRF) has received lots of attention in recent years, especially in computer vision community [33, 21, 59, 52, 61] . They model static or dynamic visual scenes by optimizing an implicit neural radiance field in order to render photo-realistic novel views. Some recent work [60, 13, 58] extends to neural radiance field to 3D point cloud [60], image-text domain [58] and robotics [13, 53, 35, 28] for robot localization, position and scene representation. A. Luo et. al. [30] propose to learn implicit neural acoustic fields from source-receiver pairs model how nearby physical environment affect reverberation. However, their method requires the presence of massive



room acoustic properties relate to any factor that may affect sound propagation, including room size, geometric structure, surface absorption/scattering coefficient, etc.

