SOUNDNERIRF: RECEIVER-TO-RECEIVER SOUND NEURAL ROOM IMPULSE RESPONSE FIELD

Abstract

We present SoundNeRirF, a framework that learns a continuous receiver-to-receiver neural room impulse response field (r2r-RIR) to help robot efficiently predict the sound to be heard at novel locations. It represents a room acoustic scene as a continuous 6D function, whose input is a reference receiver's 3D position and a target receiver's 3D position, and whose outputs are an inverse room impulse response (inverse-RIR) and a forward room impulse response (forward-RIR) that jointly project the sound from the reference position to the target position. Sound-NeRirF requires knowledge of neither sound source (e.g. location and number of sound sources) nor room acoustic properties (e.g. room size, geometry, materials). Instead, it merely depends on a sparse set of sound receivers' positions, as well as the recorded sound at each position. We instantiate the continuous 6D function as multi-layer perceptrons (MLP), so it is fully differentiable and continuous at any spatial position. SoundNeRirF is encouraged, during the training stage, to implicitly encode the interaction between sound sources, receivers and room acoustic properties by minimizing the discrepancy between the predicted sound and the truly heard sound at the target position. During inference, the sound at a novel position is predicted by giving a reference position and the corresponding reference sound. Extensive experiments on both synthetic and real-world datasets show SoundNeRirF is capable of predicting high-fidelity and audio-realistic sound that fully captures room reverberation characteristics, significantly outperforming existing methods in terms of accuracy and efficiency.

1. INTRODUCTION

Room acoustic [3, 50, 1, 27] aims at accurately capturing the sound propagation process in a room environment, so as to determine the received sound characterizing reverberation at any spatial position. Robots and other intelligent agents (e.g. voice assistants) can exploit such sound for a variety of tasks, including navigation [9, 44, 29] , reconstruction [62, 8] and audio-dependent simultaneous localization and mapping (SLAM) [16] . As a complementary sensor to vision, sound perception exhibits strengths in scenarios involving dim or no lighting and occlusion. In most cases, room acoustic can be treated as linear time-invariant system (LTI). Thus, the underlying sound propagation problem is to precisely derive a room impulse response (RIR) -a transfer function of time and other acoustic variables that measures the behaviour a sound takes from a sound source to a receiver. The sound recorded at a particular position can thus be obtained by convolving the sound at the source position with the RIR. We call such RIR source-to-receiver RIR (s2r-RIR). Accurately deriving s2r-RIR, however, is a difficult task. The challenge is three-fold: 1) high computational cost: traditional methods [3, 50, 1, 27, 41, 26, 4, 22] undergo rigorous mathematical derivation and extensive measurements to compute s2r-RIR, by treating sound as either rays [50, 1, 27] or waveforms [3, 41] . The corresponding computation complexity is proportional to the room geometric layout complexity and the number of sources. 2) non-scalable: the whole computation process needs to be re-executed once the source/receiver location changes slightly, or the room layout has altered (e.g. furniture movement). 3) too strong assumptions: the assumption that the source/receiver location and room acoustic properties are well-defined and known in advance is too strong to be held in real scenarios. The sound source location is mostly unknown in real-scenarios, and localizing sound sources is an extremely difficult task [20, 23, 6, 19] .

