SOUNDNERIRF: RECEIVER-TO-RECEIVER SOUND NEURAL ROOM IMPULSE RESPONSE FIELD

Abstract

We present SoundNeRirF, a framework that learns a continuous receiver-to-receiver neural room impulse response field (r2r-RIR) to help robot efficiently predict the sound to be heard at novel locations. It represents a room acoustic scene as a continuous 6D function, whose input is a reference receiver's 3D position and a target receiver's 3D position, and whose outputs are an inverse room impulse response (inverse-RIR) and a forward room impulse response (forward-RIR) that jointly project the sound from the reference position to the target position. Sound-NeRirF requires knowledge of neither sound source (e.g. location and number of sound sources) nor room acoustic properties (e.g. room size, geometry, materials). Instead, it merely depends on a sparse set of sound receivers' positions, as well as the recorded sound at each position. We instantiate the continuous 6D function as multi-layer perceptrons (MLP), so it is fully differentiable and continuous at any spatial position. SoundNeRirF is encouraged, during the training stage, to implicitly encode the interaction between sound sources, receivers and room acoustic properties by minimizing the discrepancy between the predicted sound and the truly heard sound at the target position. During inference, the sound at a novel position is predicted by giving a reference position and the corresponding reference sound. Extensive experiments on both synthetic and real-world datasets show SoundNeRirF is capable of predicting high-fidelity and audio-realistic sound that fully captures room reverberation characteristics, significantly outperforming existing methods in terms of accuracy and efficiency. A.

1. INTRODUCTION

Room acoustic [3, 50, 1, 27] aims at accurately capturing the sound propagation process in a room environment, so as to determine the received sound characterizing reverberation at any spatial position. Robots and other intelligent agents (e.g. voice assistants) can exploit such sound for a variety of tasks, including navigation [9, 44, 29] , reconstruction [62, 8] and audio-dependent simultaneous localization and mapping (SLAM) [16] . As a complementary sensor to vision, sound perception exhibits strengths in scenarios involving dim or no lighting and occlusion. In most cases, room acoustic can be treated as linear time-invariant system (LTI). Thus, the underlying sound propagation problem is to precisely derive a room impulse response (RIR) -a transfer function of time and other acoustic variables that measures the behaviour a sound takes from a sound source to a receiver. The sound recorded at a particular position can thus be obtained by convolving the sound at the source position with the RIR. We call such RIR source-to-receiver RIR (s2r-RIR). Accurately deriving s2r-RIR, however, is a difficult task. The challenge is three-fold: 1) high computational cost: traditional methods [3, 50, 1, 27, 41, 26, 4, 22] undergo rigorous mathematical derivation and extensive measurements to compute s2r-RIR, by treating sound as either rays [50, 1, 27] or waveforms [3, 41] . The corresponding computation complexity is proportional to the room geometric layout complexity and the number of sources. 2) non-scalable: the whole computation process needs to be re-executed once the source/receiver location changes slightly, or the room layout has altered (e.g. furniture movement). 3) too strong assumptions: the assumption that the source/receiver location and room acoustic properties are well-defined and known in advance is too strong to be held in real scenarios. The sound source location is mostly unknown in real-scenarios, and localizing sound sources is an extremely difficult task [20, 23, 6, 19] . In this work, we propose SoundNeRirF, a receiver-to-receiver Sound Neural Room impulse response Field for efficiently predicting what sounds will be heard at arbitrary positions. SoundNeRirF requires knowledge of neither sound sources (e.g. source number and position) nor room acoustic properties 1 , but instead represents a room acoustic scene through a sparse set of 3D spatial positions that robots have explored, as well as the sound recorded by the robot at each position. A robot equipped with a receiver can easily collect massive such datasets by simply moving around the room and recording its position and received sound at each step. SoundNeRirF represents a room acoustic scene by a continuous 6D function, which takes two receivers' 3D positions (one reference and one target position) as input, and outputs two room impulse responses that jointly project the sound from the reference position to the target position. SoundNeRirF thus learns receiver-to-receiver RIR (r2r-RIR). SoundNeRirF is reinforced to implicitly encode the interaction between sound sources and receivers, in addition to room acoustic properties by minimizing the error between predicted and the truly recorded sounds, because sound recorded at any position is naturally a result of such an interaction. By instantiating the 6D function as multi-layer perceptrons (MLPs), SoundNeRirF is differentiable and continuous for any spatial position, and can be optimized with gradient descent. Figure 1 visualizes SoundNeRirF's motivation. Specifically, SoundNeRirF splits r2r-RIR into two parts: an inverse-RIR for the reference position and a forward-RIR for the target position. We further introduce receiver virtual position settings and a physical constraint strategy that guides SoundNeRirF to explicitly learn direct-path, specularreflection and late-reverberation that commonly exist in room acoustics. SoundNeRirF has numerous applications in robotics, including immersive audio/video game experience in augmented reality (AR) [24, 57] , sound pre-hearing without actually reaching to the location [5] , and improving robot localization (by utilizing the predicted sound and truly heard sound). In summary, we make four main contributions: First, we propose SoundNeRirF that learns r2r-RIR neural field from a sparse set of receiver recordings. Second, SoundNeRirF requires knowledge of neither sound source nor room acoustic properties, but instead depends on more accessible data that can be collected by robot walking around in the room environment. Third, SoundNeRirF directly predicts sound raw waveform. It disentangles sound prediction and r2r-RIR learning, exhibiting strong generalization capability in predicting previously unheard sound (see Experiment Sec.). Lastly, we release all datasets collected in either synthetic and real-world scenario to the public to foster more research.

2. RELATED WORK

Room Acoustics Modelling There are two main ways to model room acoustics: wave-based modelling [3, 41, 26, 4] and geometry-based modelling (aka geometrical acoustics) [50] . The wave-based modelling utilizes sound wave nature to model sound wave propagation, whereas geometry-based modelling treats sound propagation as optic rays. Typical geometry-based modelling methods include ray tracing [27] , image source method (ISM [1] ), beam tracing [17] and acoustic radiosity [22, 36] . The main goal of room acoustics is to characterize the room reverberation effect, which consists of three main components: direct-path, specular-reflection and late-reverberation. SoundNeRirF explicitly models the three components to learn r2r-RIR. Neural Room Impulse Response (RIR) Some recent work [46, 55, 47, 45, 12, 30, 49] proposed to learn room impulse response (RIR) with deep neural networks. However, they all assume massive source-to-receiver RIRs (s2r-RIR) are available to train the model. Unlike these methods, Sound-NeRirF learns implicit r2r-RIR from a set of robot recorded sounds at different positions, which are easiler to collect with much less constraints. Neural Radiance Field (NeRF) has received lots of attention in recent years, especially in computer vision community [33, 21, 59, 52, 61] . They model static or dynamic visual scenes by optimizing an implicit neural radiance field in order to render photo-realistic novel views. Some recent work [60, 13, 58] extends to neural radiance field to 3D point cloud [60] , image-text domain [58] and robotics [13, 53, 35, 28] for robot localization, position and scene representation. A. Luo et. al. [30] propose to learn implicit neural acoustic fields from source-receiver pairs model how nearby physical environment affect reverberation. However, their method requires the presence of massive s2r-RIR as well. Our proposed SoundNeRirF shares the same idea with NeRF that learns an implicit receiver-to-receiver acoustic room impulse response field. Sound Synthesis Inferring sound waveform for a novel location partially relates to sound synthesis [38, 14, 64, 48, 15, 54, 11, 42] . WaveNet [38] learns to predict future sound waveform based on previously heard sound waveform. WaveGAN [14] and GANSynth [15] adopt generative adversarial network (GAN [18] ) to learn to generate sound. Specifically, A. Richard et. al. [48] propose to infer binaural sound waveforms from mono-channel input waveform. 3 SOUNDNERIRF: RECEIVER-TO-RECEIVER NEURAL ROOM IMPULSE RESPONSE FIELD

3.1. PROBLEM DEFINITION

In a geometric acoustic room, at least one static sound source is constantly emitting sounds. N sound receivers at various positions (or as a robot moves to different positions) are recording the acoustic environment, {(W, P) = {(w i , p i )} N i=1 }, W ∈ R T ×N , P ∈ R 3×N . Each sound w i is a time series of the sound raw waveform (of length T ) and position p i is the receiver's coordinate [x i , y i , z i ]. Our target is to learn an implicit receiver-to-receiver neural room impulse response field (SoundNeRirF) F from (W, P), that is capable of efficiently inferring the sound ŵt that would be heard at a novel position (target position) p t . Note that p t is previously unexplored, and we are given a reference sound w r from explored location p r , ŵt = F (W,P) (p t , p r , w r ), ( ŵt , p t ) ̸ ∈ (W, P), (w r , p r ) ∈ (W, P) In this work, both sound source and receiver are omni-directional and monoaural, so we do not have to consider their directivity and inter-channel difference (as discussed in [20, 48] ). We also assume all receivers record the same sound content (e.g. telephone ring) but each has encoded unique room reverberation effect based on its position, so the sound at arbitrary position can be predicted by the sound from any reference position, if the room reverberation change between them can be modelled.

3.2. SOUNDNERIRF FRAMEWORK INTRODUCTION

SoundNeRirF represents the receiver-to-receiver neural room impulse response field as a 6D function h, whose input is the combination of one reference receiver position p r = [x r , y r , z r ] and one target receiver position p t = [x t , y t , z t ], and whose output is an inverse-RIR h inv that mainly depends on reference position p r and a forward-RIR h for that mainly depends on target position p t : h inv , h for = h(p r , p t ). h inv and h for jointly project the sound at the reference position w r to the target position ŵt via 1D convolution operation, ŵt = w r ⊛ h inv ⊛ h for where ⊛ indicates 1D convolution operation in time domain. Both h inv and h for are one-dimensional impulse response in the time domain, and we omit their time indexes for succinct representation. The motivation behind learning two separate RIRs is to reduce the reference-target position arbitrarity  A i q I t W H g J x B l x A C n g s 1 K 3 d S y h N A x G b L Q U U U k s 1 G 2 O H m G L 5 z S x w N t X C n A C / X 3 R E a k t V M Z u 0 5 J Y G R X v b n 4 n x e m M L i N M q 6 S F J i i y 0 W D V G D Q e P 4 / 7 n P D K I i p I 4 Q a 7 m 7 F d E Q M o e B S K r k Q g t W X 1 0 m r V g 2 u q 8 H D V a V e y + M o o j N 0 j i 5 R g G 5 Q H d 2 j B m o i i j R 6 R q / o z Q P v x X v 3 P p a t B S + f O U V / 4 H 3 + A F 9 n k U Q = < / l a t e x i t > h inv

MLP Position Encoding

< l a t e x i t s h a 1 _ b a s e 6 4 = " u 6 / b s Z q P J m B K 9 impact and explicitly exploit room acoustic RIR property: we first project the reference sound to a virtual implicit sound source field with inverse-RIR (h inv , mainly depends on reference position) and then reproject it from the virtual implicit sound source field to the target sound with forward-RIR (h for , mainly depends on target position). By explicitly setting physical constraints to h inv and h for (Sec. 3.4), SoundNeRirF learns more robust r2r-RIR representation. In this work, we instantiate both h inv and h for as two identical multi-layer perceptrons (MLP, specifically 6 layers with 512 neurons in each layer), so they are fully differentiable and can be optimized by gradient descent algorithm. They are also continuous at any spatial position so we can infer the sound for any position, l 0 O C v M U h 7 6 i X u M = " > A A A B 8 n i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 j H g x W M E 8 4 D N G m Y n k 2 T I P J a Z X i E s + Q w v H h T x 6 t d 4 8 2 + c T f a g i Q U N R V U 3 3 V 1 x I r g F 3 / / 2 1 t Y 3 N r e 2 S z v l 3 b 3 9 g 8 P K 0 X H b 6 t R Q 1 q J a a N O N i W W C K 9 Y C D o J 1 E 8 O I j A X r x J P b 3 O 8 8 M W O 5 V g 8 w T V g k y U j x I a c E n B S O H 7 O e k X i o z a x f q f o 1 f w 6 8 S o K C V F G B Z r / y 1 R t o m k q m g A p i b R j 4 C U Q Z M c C p Y L N y L 7 U s I X R C R i x 0 V B H J b J T N T 5 7 h c 6 c M 8 r 2 u F O C 5 + n s i I 9 L a q Y x d p y Q w t s t e L v 7 n h S k M b 6 K M q y Q F p u h i 0 T A V G D T O / 8 c D b h g F M X W E U M P d r Z i O i S E U X E p l F 0 K w / P I q a d d r w V U t u L + s N u p F H C V 0 i s 7 Q B Q r Q N W q g O 9 R E L U S R C p J w H Q r m R y V 4 i O j t M c 7 m C c B T 9 T f H Q k J t B 4 F v q k M C P T 0 v D c W / / M a M X Q u W w m X U Q x M 0 u m g T i w w h H i c E G 5 z x S i I k S G E K m 5 2 x b R H F K F g c s y Z E N z 5 k x d J t V h w z w v u 7 V m + V J z F k U U H 6 B A d I x d d o B K 6 Q W V U Q R Q 9 o C f 0 g l 6 t R + v Z e r P e p 6 U Z a 9 a z j / 7 A + v g G B k 6 d i Q = = < / l a t e x i t > w r ⇤ h inv ⇤ h for = ŵt < l a t e x i t s h a _ b a s e = " Q J u h b i + F + l F q f x f e q N d Z R u + n q = " > A A A B X i c b V B N S w M x E J s a v q k c v w S J U k L J b R D W v H i s Y D + w X Z Z s m m D s m Q Z J W y F a C I V / + N N / + N a b s H b X w H h v h p l o e R M G f d l Z W Y N g t b x e d b s F h S y e p I r R J E p o T o g U z Q p m G G U F M c h p + w d D P U a Z a I e z O W I / x Q L C I E W y s F B C t Q k o E C p l t + r O g J a J l M y G g E p a e P y F p T I U h H G v d V x p / A w r w w i n k I v V R i M s I D r V U J h q P t d P E G n V u m j K F G h E E z f d E h m O t x F o O N s h n r R m r / e d U R N d + x o R M D R V k v i h K O T I J m r P + k x R Y v j Y E k w U s c i M s Q K E N D K t o Q v M W X l m r V v U u q d R J E M 1 5 n k Y x 0 K 6 C G S 6 F 4 H Q V K 3 o o 1 p 2 E g e T M Y 3 U z 9 5 i P X R k T q H s c x 9 0 M 6 U K I v G E U r P Z S e u n h O 4 i 6 e d Q t F t + z O Q J a J l 5 E i Z K h 1 C 1 + d X s S S k C t k k h r T 9 t w Y / Z R q F E z y S b 6 T G B 5 T N q I D 3 r Z U 0 Z A b P 5 1 d P C G n V u m R f q R t K S Q z 9 f d E S k N j x m F g O 0 O K Q 7 P o T c X / v H a C / W s / F S p O k C s 2 X 9 R P J M G I T N 8 n P a E 5 Q z m 2 h D I t 7 K 2 E D a m m D G 1 I e R u C t / j y M m l U y t 5 l 2 b u 7 K F Y r W R w 5 O I Y T K I E H V 1 C F W 6 h B H R g o e I Z X e H O M 8 + K 8 O x / z 1 h U n m z m C P 3 A + f w B U H o / + < / l a t e x i t > (w t , p t ) < l a t e x i t s h a 1 _ b a s e 6 4 = " u U o x n D 2 K 0 M e M W c / 8 x k 9 u c 5 Q 6 / Q Y = " > A A A C B 3 i c b V D L S s N A F J 3 U V 6 2 v q k t B B o t Q Q U p S R N 0 I B T c u X F S w D 2 h C m E w n 7 d D J g 5 k b p Y T u 3 P g r b l w o 4 t Z f c O f f O G m z 0 O q B G Q 7 n 3 M u 9 9 3 i x 4 A p M 8 8 s o L C w u L a 8 U V 0 t r 6 x u b W + X t n b a K E k l Z i 0 Y i k l 2 P K C Z 4 y F r A Q b B u L B k J P M E 6 3 u g y 8 z t 3 T C o e h b c w j p k T k E H I f U 4 J a M k t 7 9 s B g S E l I r 2 e 4 A t c t Y c E 0 n s X J s d Y / 0 d u u W L W z C n w X 2 L l p I J y N N 3 y p 9 2 P a B K w E K g g S v U s M w Y n J R I 4 F W x S s h P F Y k J H Z M B 6 m o Y k Y M p J p 3 d M 8 K F W + t i P p H 4 h 4 K n 6 s y M l g V L j w N O V 2 d Z q 3 s v E / 7 x e A v h inv = MLP inv (p r ), h for = MLP for (p t ) MLP representation has been successfully used for acoustic tasks such as speech enhancement [37] and audio equalization [40] . We extract the last MLP layer representation as the r2r-RIR (see Fig. 2 ). Please note that h inv and h for learning are independent of the sound content, we can use them for predicting previously unheard sound (sound emitted from the same source but not trained with, see Experiment). Hitherto we have vanilla SoundNeRirF with raw receiver position as input to learn neural r2r-RIR field using an MLP. We further modify it by adding extra virtual receivers and setting physical constraints to reinforce SoundNeRirF to learn a more representative r2r-RIR neural field.

3.3. SETTING VIRTUAL RECEIVER POSITIONS

To explicitly model the direct-path, specular-reflection and late-reverberation effect that commonly exist in room acoustic RIR, we manipulate each original input 3D position p (which holds both the reference and target receiver position) and consecutively place them at two virtual further locations (in our case, we multiply the original position by different coefficients to ensure the receiver is virtually set at further positions). The resulting three positions are leveraged to explicitly model the direct-path RIR h d (original position), specular-reflection RIR h s and the late-reverberation RIR h l , respectively. The final RIR is obtained by adding three separate RIRs together: h d (α 1 • p) + h s (α 2 • p) + h l (α 3 • p), and three separate RIRs go through the same MLP. Please note that the virtual receiver position setting is adopted for both the reference position and the target position. As a result, h inv in Eqn. 3 can be rewritten as, h inv = h inv d +h inv s +h inv l ; h inv d = MLP inv (α 1 •p r ), h inv s = MLP inv (α 2 •p r ), h inv l = MLP inv (α 3 •p r ) h for can be obtained in similar formulation in Eqn. 4. α 1 , α 2 , α 3 are hyperparameters setting the receiver to a further virtual position (In our case, they are empirically chosen, α 1 = 1, α 1 = 2, α 3 = 3). The motivation of such design is based on sound propagating as optic rays [1, 27] : the longer the path a sound signal travels, the greater the time delay and energy decay of the RIR there are in general. We implicitly approximate such room acoustic RIR nature caused by sound travelling path length difference by virtually setting the receiver at the further location. Similar approximation has been used by a traditional Imaged Sourced Method (ISM [1] ), in which it mirrors the sound source against a wall to get a virtual sound source that has a different distance to the receiver. The mirroring process can be executed one or multiple times to obtain different order virtual sources. Since we have no knowledge of the sound sources, we instead manipulate the receiver location to model the same RIR nature (See Fig. 2 right ). Following the practice in [43, 63, 56, 33] , we add position encoding to encode the 3D position into high-dimensional representation before feeding it to neural network. To encourage SoundNeRirF to learn representative r2r-RIR, we set two constraints on MLP-learned RIR: First, time-delay order constraint for direct path h d , specular-reflection h s and latereverberation h l . Second, to reflect the increased amplitude attenuation (energy decay) caused by longer travelling distance, we add monotonicity constraints to both inverse-RIR and forward-RIR. ), we set constraints to the RIRs in Eqn. 4 to reflect different travelled distances of sound. Moreover, to reflect the greater energy decay (attenuated amplitude) caused by longer travelling distance, we add a monotonicity constraint to each learned RIR separately. For example, for the h inv d , the two physical constraints can be expressed as, h inv d (t) = h inv d (t -t inv d ); h inv d (t) = σ h inv d (t) , h inv d (t -1) where t is the RIR index in the time domain. The other five RIRs can be obtained in a similar way. σ(•) operation ensures monotonicity. In inverse-RIR h inv , we use increasing monotonicity which means σ = max(•), while in forward-RIR h for we use decreasing monotonicity σ(•) = min(•). Examples of typical learned inverse-RIR and forward-RIR on our synthetic dataset are shown in Fig. 3 .

3.5. TRAINING AND INFERENCE

The SoundNeRirF pipeline is shown in Fig. 2 . It takes as input a pair of reference and target receiver positions. Two identical MLPs with layer-wise communication [34, 6] are used to learn h inv and h for in parallel. Position encoding is exploited to embed both the real and virtual positions to higher dimensions before feeding to MLP. The h inv and h for are jointly optimized by minimizing the discrepancy between the predicted sound ŵt and the ground truth sound w t . For the loss function, we combine mean squared error in both time and time-frequency domains, L = ||w t -ŵt || 2 + ||STFT(w t ) -STFT( ŵt )|| 2 where STFT(•) indicates the short-time Fourier transform. We explicitly use the mean squared loss in the frequency domain because merely resorting to a mean squared loss in the time domain easily leads to audio quality issues and distorted signals [48] . During inference, theoretically we can leverage any robot traversed position and recorded sound (reference sound) to infer the sound (target sound) at an arbitrary novel location.

4. EXPERIMENT

Dataset We run experiments on three datasets: 1. Synthetic Dataset Single large room with 3D source/receiver positions. With the flexibility of arbitrarily setting source/receiver number or positions, we can test the necessity of each components of SoundNeRirF framework (ablation study). To this end, we adopt Pyroomacoustics [51] to simulate a large shoebox-like room of size [50m × 50m × 30m]. Seven seed sounds are collected from copy-right free website freesound.org: piano, cat-meowing, footstep, phone-ringing, baby-crying, dog-barking and people-talking. Each seed sound is 4 seconds long and with the sampling rate 16k Hz. To reflect the impact of sound source numbers, we simulate seven identical rooms to cover sound source numbers from one to seven. In each room, we first put the corresponding sound sources at random positions, then randomly set 500 receivers at various locations (by guaranteeing at least 0.5 m distance to any sound source) to record the sound simultaneously. In each room, we keep all the sound sources and receivers' positions unchanged but replace the original seed sounds with new ones. It helps test all methods' generalization capability in predicting unheard sound. The resulting 500 recorded sounds in each room are split into 300/100/100 for train/test/eval, respectively. 2. Real-World Dataset Lab/Corridor environment. We use a Turtlebot robot equipped with a sound receiver (hosted by a Raspberry Pi). Vicon system [32] is utilized for localizing both speaker and receiver with a high precision (millimetre-level precision). The data-collection area consists of a cluttered lab room and two adjacent corridors. Up to 2 omni-directional speakers are put in the lab area with fixed positions. The Turtlebot moves to different positions to record sound. We collected two sub-datasets: the first one involves one sound source (emitting baby-crying first and then piano. baby-crying is for training model, piano is for unheard sound test), while the second one has two sound sources (one emitting cat-meowing and the other emitting dog-barking at the same time). In total, we collected 120 receiver positions for one and two sound sources separately, with 90/30 for train/test split. 3. Cross-Room Dataset large area with multiple rooms. To test the performance in room scenes that contain multiple rooms, and complex room layouts (e.g. wall, sofa and other furniture), we further run experiments on large-scale acoustic-realistic room environment platform SoundSpaces-2.0 [10] . SoundSpaces-2.0 supports Matterport3D indoor environment [7] , it synthesizes highfidelity sound (both binaural and monoaural) for any given source and receiver position. The average scene size is larger than 100m 2 . In total, we have collected 5000 sound recordings at different positions throughout the 5 room scenes (each room with 1000 recordings). We randomly split it into 700/300 for train/test, respectively. We train the individual model for each room separately. Each room has five sound sources that distribute to whole area, all sound sources emitting telephone-ring sound. The comparison of the three datasets is in Table 8 . More discussion about the three datasets' creation/motivation is given in Appendix A. Figure 4 shows the three datasets comparison. All datasets will be publicly released to benefit the research community. It incorporates human judgement in the loop, and is trained with a dataset carefully labelled by humans according to their perceptual similarity. We use CDPM as an approximation of human subjective evaluation of two sounds similarity. The lower of each of the three metrics, the better of the performance. Test Configuration For the synthetic dataset, we create 50 reference sounds (selected from the training set) that cover the whole room. The sound at a test position is predicted by each of the reference sounds separately. So each test position has 50 predictions. In addition to reporting the average metrics, we rank the 50 predictions according to their distance to the test position, and divide them into three distance-aware categories: close, medium and far. Reporting results under the three categories helps to understand different methods' capability in predicting sound from distant, farfield positions (see Fig. 5 middle). We randomly select 30/50 reference sounds from the training set for real-world/cross-room test, respectively. Comparing Methods There are no existing methods that directly model receiver-2-receiver sound prediction. We compare SoundNeRirF with five most relevant methods: two interpolation based methods and three neural network based methods (More detailed discussion is in Appendix B): 1. LinInterp: neighboring linear-interpolation. For each test position, we query its closest 5 positions with known sound. Then we linearly interpolate the sound to be heard at the test position by merging the 5 neighboring sound recordings with per-position weight, and each weight of the sound at a known position is inversely proportional to its distance to the test position. 2. NNeigh: nearest neighbor. Without any learning/interpolation process, we simply treat the nearest neighbor sound as the sound to be heard at any test (novel) position. 3. WaveNet [38] . The widely used WaveNet is adapted to take reference receiver position, sound and target position as input. It exploits dilated convolution to predict sound waveform. 4. TCN [2] . TCN adopts convolution and recurrent neural networks to model sequence2sequence transformation. We inject both the reference and target position to TCN to reinforce TCN to learn to predict target sound. 5. WarpNet [48] . WarpNet learns a warping field that maps the monoaural audio to binaural audio. In our implementation, we feed receiver position [x, y, z] to the warping module to learn to warp the sound from the reference position to the target position. Training Details We train all models with Pytorch [39] and Adam optimizer [25] , with an initial learning rate 0.0005 which decays every 20 epochs. We train 100 epochs and train each model three times and report the mean value for reducing model initialization impact (standard deviations are all within 0.01, we do not report it for brevity report). All sounds have sampling rate 16 k Hz. The sound length for both synthetic and real-world data is 4 s, while for cross-room data is 1 s. All sounds are normalized to [-1, 1] before feeding to the neural network for train. SoundNeRirF network architecture is given in Appendix E.

4.1. EXPERIMENTAL RESULT ON SYNTHETIC DATASET

The result on synthetic dataset is shown in Table 1 and Table 2 (T-MSE: 10 -3 ). We can see that SoundNeRirF achieves the best performance over all the five comparing methods. In unheard sound test, the three learning comparing methods have observed remarkable performance drop (even worse than the two non-learning baselines), while SoundNeRirF maintains nearly the same performance and thus exhibits strong generalization capability in predicting previously unheard sound. Receiver Density Discussion results reported in Table 1 and 2 are obtained by models trained on 300 receiver recordings. To figure out the impact of receiver density (number of receiver recordings), we train another two models with much smaller densities (with 100, and 200 recordings respectively). The result is shown in Fig. 5 left, from which we can see WarpNet [48] is the most sensitive to receiver density. The reduced receiver density leads to significant performance reduction. TCN [2] also suffers from low receiver density, and WaveNet [38] is less sensitive to receiver density than WarpNet and TCN. SoundNeRirF is invariant to receiver density change (TF-MSE stays as 0.03), which attests that SoundNeRirF is able to learn representative r2r-RIR with limited receiver recordings. Reference-Target Distance Discussion. The TF-MSE variation w.r.t. reference-target distance is shown in Fig. 5 middle, from which we can see that the performance of WarpNet and WaveNet gradually decrease as the reference-target distance increases. SoundNeRirF and TCN are insensitive to reference-target distance, and SoundNeRirF stays as the best under all distance metrics. Impact of Source Number. The performance w.r.t. sound source number is shown in Fig. 5 right, from which we can see the three comparison methods reduce in accuracy as the number of sources increases, with WarpNet [48] suffering the most. SoundNeRirF remains the best-performing method, owing to its design in learning implicit neural r2r-RIR field. 

4.3. EXPERIMENTAL RESULT ON REAL-WORLD DATASET

The result on real-world dataset is given in Table 3 (one source) and Table 4 (two sources) (T-MSE: 10 -4 ). We can learn that 1) while all methods have sharp performance drop when transiting from one sound source to two sound sources, SoundNeRirF largely mitigates the dilemma and always performs the best. Typical SoundNeRirF learned RIRs are shown in Fig. 6 and Fig. VI, V ,VII in Appendix, with reference and target receiver in different positions (lab or corridor). We see that SoundNeRirF successfully characterizes the room acoustic and learns different r2r-RIRs for different positions. Cross-Wall Prediction Discussion We further divide the reference and target receivers in the test set into two main categories: those in the lab room and those in the corridor, then report the performance by differentiating the reference and target receiver positions. We thus get four detailed results: Lab2Cor, Lab2Lab, Cor2Cor, and Cor2Lab (in reference-to-target format, Lab2Cor indicates the reference receiver is in the lab room, and the target receiver is in a corridor). The result is shown in Table 9 . We can see 1) SoundNeRirF outperforms all comparing methods in all Lab/Cor categories. 2) WaveNet shows advantage in Lab2Cor and Cor2Cor sound prediction, but performs poorly in the other two categories. More discussion is in Appendix Sec. C.1 and Table II .

4.4. EXPERIMENTAL RESULT ON CROSS-ROOM DATASET

The result on cross-room data is given in Table 5 (T-MSE: 10 -4 ), from which we can see that SoundNeRirF stays as the best-performing method, it outperforms existing methods by a largemargin (especially the two non-learning baselines). It thus shows SoundNeRirF generalizes well to large and complex room acoustic environment where various physical impediments, and room layout exist. Non-learning based methods lead to the worst performance (huge performance drop), which shows received sounds largely vary w.r.t. position displacement in complex room environment. Inference Time/Parameter Size result is in in Table 6 . The average inference time is computed by averaging 100 independent tests (in each test, we use one reference sound to predict one target sound) on two hardwares: 1.6 GHz Intel Core i5 CPU (single core) and Raspberry Pi 4. For Raspberry Pi 4 device, we covert models to ONNX-compatible models (WarpNet [48] model and TCN [2] model cannot be converted to ONNX-compatible models). We can see SoundNeRirF has small parameter size (just larger than WaveNet), and is extremely efficient. We refer readers to Appendix C, D for more discussion, including learned r2r-RIR visualization, extra results and noise discussion. Appendix H to potential application discussion. In sum, SoundNeRirF exhibits superiority in predicting sound to be heard at novel positions. It generalizes well for unheard sound prediction and requires much less training data than existing methods. It can also handle physical impediment. In the real-world environment, we have collected 65 receiver recordings in the two adjacent corridors, and 55 receiver recordings in the lab environment. These receiver locations cover all navigable area in the lab and two corridors. The train/test split is randomly chosen and uniformly covers both the lab and two corridors. In the real-world dataset, the xaxis range is In our experimental setting, we have a set of receivers' recordings that distribute a room area. We use these known recordings to predict the sound to be heard at novel positions. Two simple baselines are just interpolating the sound at novel position from a subset of receiver recordings with known sound and positions. Here we introduce two interpolation based baselines: • neighboring linear-interpolation (LinInterp): for each test position (novel position), we query its closest 5 positions with known sound. Then we linearly interpolate the sound to be heard at the test position by merging the 5 neighboring sound recordings with per-position weight, each weight of the sound at a known position is inversely proportional to its distance to the test position and all weights sums to 1 after softmax constraint. • nearest neighbor (NNeigh): without any learning or interpolation process, we simply treat the nearest neighbor sound as the sound to be heard at any test (novel) position. This baseline From the results on the three datasets, we can learn that 1) The two non-learning based baselines (LinInterp and NNeigh) lead to inferior performance than the other four deep neural network based methods (three comparing methods and our SoundNeRirF); 2) LinInterp performs better than NNeigh, which shows incorporating neighboring sounds can better capture the sound characteristic at a novel position. In summary, we can conclude that: explicit learning process is essential for accurate room sound rendering, it helps to better characterize room acoustics.

B.2 THREE LEARNING-BASED METHODS

There are no existing methods that directly model receiver-to-receiver sound neural field. We thus compare SoundNeRirF with three relevant methods: WarpNet [48] , TCN [2] and WaveNet [38] . WarpNet [48] synthesizes binaural audio from monoaural audio. It learns a warping field that maps the monoaural audio to binaural audio. In our implementation, we feed receiver position [x, y, z] to warping module to learn the warp field. TCN [2] adopts convolution and recurrent neural network to model sequence2sequence transform. WaveNet [38] adopts dilated convolution to predict future sound, based on previous heard sound. In our implementation of TCN [2] and WaveNet [38] , we concatenate receiver position [x, y, z] to the input sound waveform and feed their concatenation to neural network. Please note that normalization of the input receiver position [x, y, z] is applied as well, as we do in SoundNeRirF. The comparison of learnable parameter number among these methods are shown in Table 6 . We can see from table that SoundNeRirF parameter number is much smaller than those of TCN [2] and WarpNet [48] . SoundNeRirF parameter number is larger than the parameter number of WaveNet [38] , mainly due to the two paralleling MLPs for modelling inverse-RIR and forward-RIR respectively and layerwise feature communication strategy. We find that SoundNeRirF without learning reverse-RIR and forward-RIR respectively, but instead use one MLP to learn one general RIR (which means we just use 6-layer MLP to learn r2r-RIR, we call it SoundNeRirF_1MLP) achieves comparable and even better performance than WaveNet [38] on the synthetic dataset. The result is shown in Table I , from which we can see SoundNeRirF_1MLP outperforms WaveNet [38] in terms of TF-MSE and CDPM metrics, with a much smaller parameter number. It thus shows the advantage of SoundNeRirF in modelling receiver-to-receiver sound neural field. The average inference time of the comparing three methods and SoundNeRirF is given in Table 6 . The average inference time is computed by averaging 100 independent tests (in each test, we use one reference sound to predict one target sound) on two hardwares: 1.6 GHz Intel Core i5 CPU (single core) and Raspberry Pi 4. For Raspberry Pi 4 device, we covert trained models to ONNX-compatible models and run ONNX models on the device. We find that WarpNet [48] model and TCN [2] model cannot be converted to ONNX-compatible models successfully due to various issues, such as operator incompatibility and unsupported operators. Therefore, we only report the inference time of WaveNet [38] and SoundNeRirF. From Table 6 , we can clearly see that SoundNeRirF's inference time is much shorter than the three comparing methods' inference time, with just 1/200 of the three comparing methods' inference time on CPU device, 1/150 of WaveNet [38] inference time on Raspberry Pi 4 (although SoundNeRirF has more trainable parameters than WaveNet). Therefore, it shows the high efficiency of SoundNeRirF. corresponding result on synthetic dataset shows the necessity and importance of each compoment of SoundNeRirF design. In addition to the three ablation studies, we further conduct three more studies to figure out the necessity of setting different monotonicity constraints to inverse-RIR and forward-RIR respectively. Specifically, the necessity of increasing monotonicity to inverse-RIR and decreasing monotonicity to forward-RIR. We test two SoundNeRirF variants: both inverse-RIR and forward-RIR are with increasing monotonicity constraints (SoundNeRirF_incmono), with decreasing monotonicity constraints (SoundNeRirF_decmono), and swap the original constraint (SoundNeRirF_swapmono, use decreasing monotonicity for inverse-RIR and increasing monotonicity for forward-RIR). The result is given in Table III . We can conclude by this table that applying the same monotonicity constraints (either increasing or decreasing) to both inverse-RIR and forward-RIR leads to inferior performance. Swapping the monotonicity generates the worst performance. It in turn shows the necessity of setting increasing monotonicity to inverse-RIR and decreasing monotonicity to forward-RIR.

C.4 MORE EXPERIMENT IN REAL-WORLD DATASET

In the main paper, we have reported the quantitative result on both one sound source and two sound sources setting. The result on unheard sound test is given in Table IV . From this table, we can see that SoundNeRirF performs the best for unheard sound prediction in real-world setting.

C.5 NOISE DISCUSSION

In real-scenario, robot position localization often suffers from noise, in which the robot cannot localize itself with high-precision. In our real-world sound data collection platform, Vicon localization system [32] gives very high-precision position coordinates of the robot at each position. To test the Table V: Noise Test on Real Data. In each evaluation metric entry, from left to right the four values correspond to metrics with noise standard deviation σ = 0.0, 0.5, 1.0, 1.5 respectively. The standard deviation for each metric is within 0.02. Methods T-MSE (10 -4 ) (↓) TF-MSE (↓) CDPM (↓) WarpNet [48] 3.10 3.10 3.17 3.25 0.10 0.13 0.15 0.17 0.85 0.87 0.88 0.89 WaveNet [38] 2.05 2.07 2.09 2.12 0.06 0.07 0.09 0.11 1.17 1. 18 VI . In the layerwise communication, a trainable weight w is used to merge the intermediate features arising from reference position MLP (e.g. f 1 ) and target position MLP (e.g. f 2 ) before feeding them to the next layer, respectively, [f 1 ; f 2 ] = w • [f 1 ; f 2 ] F EXPLANATION ON FIG. 5 IN MAIN PAPER In Fig. 5 of the main paper, the curves in the middle and right figures are not associated with the exact value, and we re-scaled the yaxis for better comparison. To present more quantitative comparison, we provide the exact value for each curve. The result of TF-MSE variation change w.r. 



room acoustic properties relate to any factor that may affect sound propagation, including room size, geometric structure, surface absorption/scattering coefficient, etc. CONCLUSION AND LIMITATION DISCUSSIONIn this work, we have proposed a receiver-to-receiver acoustic neural field for predicting the sound to be heard at novel positions, given the sound heard at known positions. We assume the acoustic scene is static and sound sources position is fixed. Moreover, we just focus on mononaural sound. One future work is to extend to binaural sound prediction.



Record at Various Positions B. Receiver has Position and Sound C. Optimize SoundNeRirF D. Infer Sound at Novel Position

Figure 1: SoundNeRirF motivation: in a geometric acoustic room, multiple receivers (or robot walks to different positions) at various positions are recording the sound emitted from multiple sound sources (A). Each receiver records the sound and its spatial position (B). SoundNeRirF learns receiver-to-receiver neural room impulse response field (C), so as to predict the sound at novel position, given a reference sound and a reference position (D).

t e x i t s h a 1 _ b a s e 6 4 = " 3 9 v Q K U X x I A T d w O 4 F S N v D U j N + p A Y = " > A A A B 8 n i c b V D L S g N B E J y N r x h f U Y 9 e B o P g K e w G U Y 8 B L x 4 j m A d s 1 j A 7 m S R D 5 r H M 9 A b C k s / w 4 k E R r 3 6 N N / / G S b I H T S x o K K q 6 6 e 6 K E 8 E t + P 6 3 V 9 j Y 3 N r e K e 6 W 9 v Y P D o / K x y c t q 1 N D W Z N q o U 0 n J p Y J r l g T O A j W S Q w j M h a s H Y / v 5 n 5 7 w o z l W j 3 C N G G R J E P F B 5 w S c F I 4 e s q 6 R m K u J r N e u e J X / Q X w O g l y U k E 5 G r 3 y V 7 e v a S q Z

R s / o F b 1 5 4 L 1 4 7 9 7 H o n X N K 2 Z O 0 B 9 4 n z 9 W R J E + < / l a t e x i t > h for < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 Q D A a S U 7 m a F / s F 3 q V E A 8I n P f o p w = " > A A A C F H i c b V D L S g M x F M 3 U V 6 2 v U Z d u g k U Q h T J T R N 0 I B T c u K 9 g H t H X I p G k b m s k M y Z 2 W M s x H u P F X 3 L h Q x K 0 L d /6 N 6 Q P U 1 g O B k 3 P u 5 d 5 7 / E h w D Y 7 z Z W W W l l d W 1 7 L r u Y 3 N r e 0 d e 3 e v q s N Y U V a h o Q h V 3 S e a C S 5 Z B T g I V o 8 U I 4 E v W M 3 v X 4 / 9 2 o A p z U N 5 B 6 O I t Q L S l b z D K Q E j e f b p 0 F P 4 B P f u k 6 Y K M J e D 9 O f X C V W K r 3 C z R y A Z e p B 6 d t 4 p O B P g R e L O S B 7 N U P b s z 2 Y 7 p H H A J F B B t G 6 4 T g S t h C j g V L A 0 1 4 w 1 i w j t k y 5 r G

b l e y + M o w D G c Q A U u I I E I D m k B A w D O w p u j n R f n f m Y t + c w R / I H z + Q N O A I / < / l a t e x i t > (w r , p r ) < l a t e x i t s h a 1 _ b a s e 6 4 = " D m l b w P v 2 r M Y V 8 Z L + w H Y i Z C E n / d w = " > A A A B 8 X i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A i S N g N o h 4 D X j x G M A 9 M l j A 7 m S R D Z m e X m V 4 l L P k L L x 4 U 8 e r f e P N v n C R 7 0 M S C h q K q m + 6 u I J b C o O t + O y u r a + s b m 7 m t / P b O 7 t 5 + 4 e C w Y a

6 5 k / I w T o C F d D b I T w S G C G e h 4 D 6 X j I I Y a 0 K o 5 H p X T I d E E g o 6 u p I O w Z o / + S 9 p 1 2 v W a c 2 6 O a k 0 6 n k c R b S H D l A V W e g M N d A V a q I W o u g B P a E X 9 G o 8 G s / G m / E + K y 0 Y e c 8 u + g X j 4 x u 5 P Z i G < / l a t e x i t > L = ( ŵt , w t ) virtually set sources B2: SoundNeRirF sets virtual receiver positions A. SoundNeRirF Pipeline Illustration B. Virtual Receiver Setting

Figure 2: SoundNeRirF pipeline (left) and virtual receiver position setting motivation illustration.

Figure 3: Learned h inv and h for .

Figure 4: Three datasets comparison. A. We can freely put arbitrary sources/receivers in the synthesized shoebox-like environment. B. We use Turtlebot to collect sound in lab and corridors. C. Robot explore a multiple-room scene to collect sound at various positions. The trajectory path on the topdown map evolving from blue to red color indicates the exploration chronological order.

Figure 6: SoundNeRirF RIR Vis.

[-2.0 m, 4.0 m], yaxis range is [-6.0 m, 6.0 m], zrange is [0.0 m, 0.3 m]. The plane formed by xand yaxes is the 2D floor-plan map as shown in the main paper. The TurtleBot sound receiver positions are shown in Fig. I and detailed setup is in Fig. II, in which we can observe that the positions TurtleBot has traversed uniformly cover both the lab area and two adjacent corridors. The receivers used to train (color red) and test (color test) also cover the lab area and two corridors. A.3 DISCUSSION ON CROSS-ROOM DATA A: Room Scene gYvKGZ5eRqb B: Room Scene ARNzJeq3xxb

Figure III: Embodied robot collects sound in complex room environment (figure A) and multiple rooms (figure B). The left sub-figure in each figure shows the robot exploration trajectory in the top-down view, the trajectory evolving from blue to red color indicates the exploration chronological order. The right sub-figure visualizes the 3D scene environment.

Figure IV: Two SoundNeRirF learned r2r-RIRs on MP3D dataset[7], we show two r2r-RIRs (split by the green line), each comprises of a forward-RIR and an inverse-RIR. We explicitly show the time-shift learned by the each direct-path, specular-reflection and later-reverberation.

Figure V: SoundNeRirF learned corridor-to-corridor RIR. The left three sub-figures are learned specular, early and late reverberation RIR respectively.

Figure VI: SoundNeRirF learned lab-to-corridor RIR. The left three sub-figures are learned specular, early and late reverberation RIR respectively.

Overall Result on Synthetic Data

Unheard Sound Result on Synth. Data

One Sound Source Result on Real Data

Two Sound Sources Result on Real Data

Result on Cross-Room Data

Param Size/Inference Time

Ablation Study Result on Synth. Data

Three Dataset Comparison

TF-MSE (×10 -2 ) Result on Real Data (one source, baby-crying) w.r.t Lab/Cor Categories.

WaveNet comparison with one SoundNeRirF variant (SoundNeRirF_1MLP) on Synthetic Dataset over sound source number from 1 to 7.

1.19 1.19 TCN [2] 2.31 2.31 2.32 2.33 0.07 0.08 0.08 0.10 0.96 0.96 0.98 0.99 SoundNeRirF 1.20 1.20 1.21 1.22 0.05 0.05 0.06 0.06 0.71 0.72 0.72 0.74 SoundNeRirF Network Architecture. The format FC@n1,n2 indicates a full-connection layer with input neurons n1 and output neurons n2. Batch normalization (BN) is applied after each full-connection layer.

TF-MSE (↓) variation change w.r.t. Reference-Target Distance (corresponds to Fig. 5 middle figure in Fig. 5 main paper)

t. referencetarget distance is given in Table VII. The result of TF-MSE variation change w.r.t. sound source number is given in Table VIII.

annex

In the synthetic dataset, we use Pyroomacoustics [51] to simulate a shoebox-like room environment of size [50m × 50m × 30m]. The reason for creating a room of such size is two-fold:1. A large room intrinsically involves large reverberation difference. Sound recorded by receivers at different positions thus varies dramatically, in terms of sound waveform amplitude and other room acoustic related properties. Simulating such a room better fits for testing algorithms' capability in modelling room acoustics.2. The simulated room size is complementary to the real-world dataset environment roughly of size [12m × 12m × 3m], which is much smaller than simulated room and robot was collecting sound data within nearly a plane (the axis pointing to sky, or zaxis are nearly the same). This is in contrast with the synthetic dataset in which the receivers used to collect the sound are randomly set along the x-, yand zaxes, filling in the whole room.To accurately model sound propagation process, we combine ray tracing [27] and image source method [1] to derive the sound recorded by any receiver. Combining ray tracing and image source method strikes a good balance between recorded sound accuracy and computational cost. For the room simulation, we set the origin point [0, 0, 0] to be located at one corner, and all receivers' position coordinates are larger than 0.A.2 DISCUSSION ON REAL-WORLD DATA We first provide two learned r2r-RIR visualizations on MP3D dataset [7] . We provide one corridor-tocorridor (Cor2Cor, both reference and target position are in corridor) learned receiver-to-receiver RIR in Fig. IV . From this figure, we can clearly see that in large spatial area MP3D dataset [7] where the average area is larger than 100 m 2 , the direct-path RIR, specular-reflection RIR and late-reverberation RIR learn different time shift parameters. Such different time-shift parameters characterize room acoustics properties in large spatial area where the direct-path sound arrives to a receiver position earlier than specular-reflection sound, specular sound arrives earilier than later-reverberation sound.We also provide corridor-to-corridor (Cor2Cor, both reference position and target position are in corridor) learned receiver-to-receiver RIR in Fig. From the three figures, we can see that in much smaller area in our real-world dataset, the learned time-shift difference between direct-path, specular-reflection and later-reverberation is much smaller. Such phenomena is compatible with real-scenario room acoustics property which shows that in small area the arrival time for direct-path, specular-reflection and late-reverberation is very close.In summary, from such learned r2r-RIR visualizations, we can conclude that SoundNeRirF learned RIR is capable of characterizing 3D room scene acoustically, from the receiver-to-receiver modelling perspective.

C.3 MORE ABLATION STUDY

In the main paper, we have tested three SoundNeRirF variants: 1) SoundNeRirF without physical constraints (SoundNeRirF_noPhys), which helps test the necessity of explicitly setting physical constraints; 2) SoundNeRirF without explicitly modelling inverse-RIR and forward-RIR (SNeR-irF_1MLP, instead using one MLP to predict r2r-RIR. It achives slightly better performance than WaveNet [38] ); 3) SoundNeRirF without receiver virtual position setting (SNeRirF_noVit). The Specifically, we model noise as Gaussian noise with mean 0 and standard deviation σ, G(µ = 0, σ). Each coordinate x, y or z is added a noise sampled from the Gaussian noise. By varying the standard deviation σ, we test all models' tolerance w.r.t. different level of noise interference. In this experiment, we choose three noise levels, with σ = 0.5, 1.0, 1.5, respectively. The result is shown in Table V , from which we can see that SoundNeRirF exhibits stronger rubustness towards noise interference than three comparing methods.

D MORE DISCUSSION ON EXPERIMENTAL SETTING

For each receiver position [x, y, z], we first normalize each coordinate into [-1, 1] before feeding to position encoding module, as WarpNet [48] does. In position encoding, the corresponding position encoding embedding size is 384. The learned time shift parameters for both the inverse-RIR and forward-RIR are quite small (both are less than 5).Unheard Sound Test Explanation. In our experimental settings, in each sound source and receiver positions, we collect two sounds. While one sound is used for training the model, the other sound is hidden for training but instead is used for testing. Since the two sounds theoretically share the same receiver-to-receiver RIR, such "unheard sound test" helps test each model's generalization capability in predicting new sound.Position Encoding since deep neural networks tend to learn lower frequency functions [43] , we follow the practice in 3D protein structure modelling [63] , Transformer network [56] and RGB image based neural radiance field [33] to encode the the low-dimensional position into high dimension. For example, for the reference position p r (also for the two virtually set positions 2 • p r and 3 • p r ), the position encoding process works as,where p r = [x r , y r , z r ], and L is the position encoding length, we use 64 in our work (so the position embedding size is 384). The target position p t goes the same process.

G PREDICTED SOUND SAMPLES

We also provide predicted sound samples by each method with our real-world dataset, as well as the ground truth sound in the supplementary material. The predicted sound samples use the reference sound in the lab area and target sound in the corridor area. 

H POTENTIAL SOUNDNERIRF APPLICATION IN ROBOTICS

The motivation of this work is to enable the robot to efficiently predict the sound to be heard at novel positions, without reply on enough prior information about the room geometric features/sound source installation and rigorous and complex mathematical derivation. The robot can efficiently and effectively learn such neural acoustic field by simply exploring a new environment to record the position and received sound at each step.We further show that learning such sound neural field through our proposed SoundNeRirF framework is easy to implement on a robot. Our experiment shows our method just needs a relatively small number of receiver recordings to learn an efficient sound neural field (Fig. 5 , left-most figure). So the robot doesn't have to walk a long time (exhaustive walking). At the same time, SoundNeRirF is light-weight, it achieves real-time inference even on resource-constrained edge device. For example, in the Table 6 , SoundNeRirF takes 0.066s to predict the sound on Raspberry Pi. This advantage enables the robot to predict/render sound in real-time, so it fits for tasks that lay much emphasis on efficiency.We argue that the learned SoundNeRirF can be used for various down-stream robot-related tasks. As far as we can see at current moment, it includes but is not limited to: 1. SoundNeRirF can be used to improve robot self-localization capability, by comparing the predicted sound and truly recorded sound at a particular position; 2. SoundNeRirF can be used for embodied robotics research in synthesized audio-vision environment. For example, the audio-vision platform for embodied robotics SoundSpaces1.0 just provides discrete and pre-computed RIR, so the whole environment is not dynamic and the robot can only hear sound at particular positions. SoundNeRirF can help to avoid such obstacles, enabling the robot to render sound at arbitrary positions. 3. VR/AR applications for immersive acoustic experience (audio auralisation).

