VISUAL TIMING FOR SOUND SOURCE DEPTH ESTIMATION IN THE WILD

Abstract

Depth estimation enables a wide variety of 3D applications, such as robotics and autonomous driving. Despite significant work on various depth sensors, it is challenging to develop an all-in-one method to meet multiple basic criteria. In this paper, we propose a novel audio-visual learning scheme by integrating semantic features with physical spatial cues to boost monocular depth with only one microphone. Inspired by the flash-to-bang theory, we develop FBDepth, the first passive audio-visual depth estimation framework. It is based on the difference between the time-of-flight (ToF) of the light and the sound. We formulate sound source depth estimation as an audio-visual event localization task for collision events. To approach decimeter-level depth accuracy, we design a coarse-to-fine pipeline to push the temporary localization accuracy from event-level to millisecond-level by aligning audio-visual correspondence and manipulating optical flow. FBDepth feeds the estimated visual timestamp together with the audio clip and object visual features to regress the source depth. We use a mobile phone to collect 3.6K+ video clips with 24 different objects at up to 60m. FBDepth shows superior performance especially at a long range compared to monocular and stereo methods.

1. INTRODUCTION

Depth estimation is the fundamental functionality to enable 3D perception and manipulation. Although there have been significant efforts on developing depth estimation methods with various sensors, current depth estimation schemes fail to achieve a good balance on multiple basic metrics including accuracy, range, angular resolution, cost, and power consumption. Active depth sensing methods actively emit signals, such as LiDAR (Caesar et al., 2020 ), structuredlight (Zhang, 2012) , mmWave (Barnes et al., 2020 ), ultrasound (Mao et al., 2016) , WiFi (Vasisht et al., 2016) . They compare the reflected signal with the reference signal to derive time-of-flight (ToF), phase change, or Doppler shift to estimate the depth. Active methods can achieve high accuracy because of the physical fundamental and well-designed modulated sensing signals. Lidar is the most attractive active senor due to its large sensing range and dense point cloud. However, the density is not sufficient enough to enable a small angular resolution. Therefore, the points are too sparse to be recognized at a long distance. Besides, the prohibitive cost and power consumption limit the availability of Lidar on general sensing devices. Passive depth sensing takes signals from the environment for sensing directly. It commonly uses RGB monocular camera (Bhoi, 2019; Laga et al., 2020 ), stereo camera (Cheng et al., 2020) , thermal camera (Lu & Lu, 2021) , or multi-view cameras (Long et al., 2021a) . These sensors can achieve pixel-wise angular resolution and consume pretty less energy due to omitting the signal emitting. Among them, stereo matching can effectively estimate the disparity and infer a dense depth map since it transforms the spatial depth to the visual disparity based on the solid physical law. The baseline of the stereo camera determines the effective range and accuracy. Therefore, the dimension of the stereo camera is placed as the critical trade-off with sensing metrics. Thanks to the advance in deep learning, the cheap monocular depth estimation keeps on improving performance with new network structures and high-quality datasets. However, the accuracy is still not satisfactory especially at a long range because it can only regress depth based on the implicit visual cues. It is ill-posed without any physical formulation. Besides, it heavily relies on the dataset. It requires domain adaption and camera calibration for various camera intrinsics (Li et al., 2022) . In this paper, we propose to add only one microphone to enable explicit physical depth measurement and boost the performance of a single RGB camera. It does not rely on the intrinsic of cameras and implicit visual cues. We develop a novel passive depth estimation scheme with a solid physical formulation, called Flash-to-Bang Depth (FBDepth). Flash-to-Bang is used to estimate the distance to the lightning strike according to the difference between the arrival time of a lightning flash and a thunder crack. This works because light travels a million times faster than sound. When the sound source is several miles away, the delay is large enough to be perceptible. Applying it to our context, FBDepth can estimate the depth of a collision that triggers audio-visual events. The collision event has been explored for navigation and physical search in (Gan et al., 2022) , but our work is the first that uses the collision for depth estimation. Collisions are common and can arise when a ball bounces on the ground, a person takes a step, or a musician hits a drum. We identify and exploit several unique properties related to various collisions in the wild. First, the duration of a collision is short and collision events are sparse. Thus, there are few overlapped collisions. Second, though the motion of objects changes dramatically after the collision, they are almost static at the collision moment. Third, the impact sound is loud enough to propagate to a long range. Flash-to-Bang is applied to the range of miles for human perception. Using it for general depth estimation poses several significant challenges: (i) It is inaccessible to ground truth collision time from video and audio. Video only offers up to 240 frames per second(fps), and may not capture the exact instance when the collision occurs. Audio has a high sampling rate but it is hard to detect the start of a collision solely based on the collision sound due to different sound patterns arising from collisions as well as ambient noise. (ii) We need highly accurate collision time. 1 ms error can result in a depth error of 34 cm. (iii) Noise present in both audio and video further exacerbate the problem. To realize our idea, we formulate the sound source depth estimation as the audio-visual localization task. Whereas existing work (Wu et al., 2019; Xia & Zhao, 2022) still focuses on 1-second-segment level localization. FBdepth performs event-level localization by aligning correspondence between the audio and the video. Apart from audio-visual semantic features as input in existing work (Tian et al., 2018; Chen et al., 2021a) , we incorporate optical flow to exclude static objects with similar visual appearances. Furthermore, FBDepth applies the impulse change of optical flow to locate collision moments at the frame level. Finally, we formulate the ms-level estimation as an optimization problem of video interpolations. FBDepth succeeds to interpolate the best collision moment by maximizing the intersection between extrapolations of before-collision and after-collision flows. With the estimated timestamp of visual collision, we regress the sound source depth with the audio clip and visual features. FBdepth avoids the requirement to know the timestamp of audio collision. Besides, different objects have subtle differences in audio-visual temporal alignment. For example, a rigid body generates the sound peak once it touches another body. But an elastic body produces little sound during the initial collision and takes several ms to produce the peak with the maximum deformation. We feed semantic features to enable the network aware of the material, size, etc. Our main contributions are as follows: 1. To the best of our knowledge, FBDepth is the first passive audio-visual depth estimation. It brings the physical propagation property to audio-visual learning. 2. We introduce the ms-level audio-visual localization task. We propose a novel coarse-to-fine method to improve temporal resolution by leveraging the unique properties of collisions. 3. We collect 3.6K+ audio-visual samples across 24 different objects in the wild. Our extensive evaluation shows that FBDepth achieves 0.64m absolute error(AbsErr) and 2.98% AbsRel across a wide range from 2 m to 60 m. Especially, FBDepth shows more improvement in the longer range.

2. RELATED WORK

Multi-modality Depth estimation. Recent work on depth estimation has shown the benefits of fusing cameras and other active sensors. (Qiu et al., 2019; Imran et al., 2021) recover dense depth maps from sparse Lidar point clouds and a single image. (Long et al., 2021b) associates pixels with pretty sparse radar points to achieve superior accuracy. The effective range can be increased as well by Lidar-camera (Zhang et al., 2020) or Radar-camera (Zhang et al., 2021) . However, these methods are still expensive in cost and power consumption.

